FOSDEM 2024 Transcribed / Subtitled by Whisper

Welcome to FOSDEM 2024
Welcome to FOSSTEM 2024. I'll start with a reminder to be kinder. We lost quite a few volunteers over the pandemic. We lost muscle memory. We lost stuff. We lost everything. Last year was pretty rough in the background. From what I've heard, it was not super noticeable for the attendees, which was nice. But still there were a few words. We hope that we are basically almost back to pre-pandemic levels of efficiency and effectiveness. We'll see. But if something breaks, ideally tell Infodesk about it and just move on. Can't do anything else anyway. Speaking of losing a lot of hands, please volunteer. As of right now, we still have plenty of opportunities for volunteering open. You'll find a bunch of QR codes so you can just take photos of the thing if you want to and you'll have everything. We still need more hands to help. As per usual, if you help with cleanup, A, any of the network cable, which is not that much anymore, because we have been pretty good at using UB infrastructure these days, any network cable which you take out under all supervision you can keep. I'm going to repeat this in the closing talk. Also, there is food for everyone who stays until the end. Historically, sometimes this went until midnight. But the last few years, we were always done-ish at 10, and more hands, more better. So if you are around and if you have time, please help. If you need help yourself, well, yellow shirts or jackets mean Steph. Blue shirts mean deaf room managers. Those are the people who basically run the conference in a conference in the various rooms. Green shirts are video team. There are a few back there. You also see orange shirts, which are general purpose volunteers for all the things. Speaking of not everything being super smoothly just now, the supplier didn't send all the t-shirts which we ordered, and we noticed too late. So a long story short, we don't have enough of the shirts. As you can see, I'm wearing a black one, not a yellow one. So not every volunteer might have a shirt. If in doubt, given the benefit of the doubt, they are probably a volunteer if they claim so. This is ULB. This is not according to True North. You see True North down below, but also we have the navigation. If you need to go around, this is way, way better than just five years ago when we didn't have this, and everyone got lost all the time. So those QR codes all now go into this precise system if you need anything. From a t-shirt or a hoodie, if you want to have a printed map, if you want to give us a donation, if you just want to say hi and thanks, whatever Infodesk is your main landing spot, as per usual, extra large, extra small might be gone sooner than later, so going there sooner than later might be good. We are also on Matrix, and you can also email us at all times, and all of those are monitored, not 24-7, but almost. So this is the difference between last year and this year, and there is more than a person a week of work between those two photos, and more than 10K in investment. So as a reminder for those who don't know, we have on the FOSTA network, it is IPv6 only on the Wi-Fi, and it is net6.4 and DNS6.4, so all the translation to IPv4 happens for you automatically, except for, I believe it was ripe, we are the first conference to actually do this, and we found a bunch of bugs. Learning in the public, full transparency, we just found a bug in internal Grafana tooling with IPv6. Anyway, so primarily use this network please, and if you maintain any software and something breaks, this is great, of course now you can fix it. At the same time, if you want the more legacy taste of internet in your devices, you can use the dual stack one, and you will get full IPv4. But we have been doing this for years, it is pretty stable, just try it, and you will just see, hey, this IPv6 stuff actually works. We also have a lightning talk, or longer than a lightning talk in the infrastructure view, as per usual, the QR code goes there, and as a reminder, because we have funny people who think they are funny when they try and attack stuff, A, we are going to find you and shut you down, but B, please encrypt all your traffic, because some people are just maybe not as nice as they should be. Social interactions, well, hallway track, that is the main thing while we are here at FOSTA, because we can't get into the rooms anyway. There is also lots of new joiners, because as you would have loved. Anyway, we have Matrix as the primary thing, we also have IRC with reasonable effort, but there is no bridge this year, if you don't know what that means, doesn't matter, just go to Matrix, long story short, yeah. Also, go to Mustadon and look at stuff. If you need to register for, well, Brain's Matrix on, chat FOSTA or you click on the upper left, and then on the create account and you are done. If you want to toot about those things, which you see, ideally with consent of the people who are in the picture, use the hashtag FOSTA, there is our main account, we also have other social media, but we try to ignore this, so that's the main one. You will find this with the QR code. Anyone who doesn't want to be in the picture now is a good time to hide your face, and everyone else just raise your hands for a second and say hi. Those are going to be posted on Mustadon in a few. So, schedule. You probably know this, but this is the schedule, it builds dynamically, but we are pretty much settled, but if any deaf room manager needs to still change something, someone fell sick, whatever, it will update within minutes, so this is the absolute source of truth for everything regarding schedule. And I still don't think anyone told us about an Apple-compatible thing for the... There is one? Okay. Can you send us to info at FOSTA, so next time I can put this on the slides? Thank you. Just say Ritchie H or 2Ritchie H and it will find its way. Thank you. So, the stats, we have even more events crammed into these old buildings. We have even more speakers. We do have more tracks and deaf rooms, but I forgot the old number, so you don't see the diff. We managed to cram even more stands into the building and maybe we will be able to do even more next year. We don't yet know, because we are always out of space, as you know, and fewer lighting talks. These are all the deaf rooms. You won't be reading all of those right now, but just to give you a rough idea of scale, this is only page one, this is page two. And for those who don't know, yes, this is the largest open source conference on Earth. We have a bunch of stands. We rejuggled them quite a bit. So, if you are walking to that place there for the last three or five years, you always saw that one thing. Look online. Some of them didn't make it, but most we just rejuggled them and tried to organize them with a new system. So, there might just be somewhere else. Yeah. We don't have a hacker room this year. We needed this for basically for staff and for boxes of stuff. So, please go to the cafeteria in the main, or next to the food area in the main campus. Please don't go into the old room. Even if you know where the old room is, please just go. Don't go in there, because there's going to be people in there who are working and trying to make FOSTA happen. So, yeah, please don't. Speaking of being nice about things, when, not if, you need to queue here, because you will need to queue. Please form an orderly queue. Don't skip, blah, blah, blah, the usual, but also don't block any of the pathways and don't block any of the fire exits. We are literally filling this campus to capacity and maybe a little bit to breaking capacity. So, please try and keep the pathways free. If staff and volunteers come with requests or instructions or, hey, this is a fire hazard or whatever, please don't discuss. Just do it, because they probably have a good reason why they do it, why they tell you. If staff or volunteers are running, make way. We are not running often, but when we are running, we mean it, so please just make way. In the food queue, please consider letting people skip ahead if they have one of the color-coded t-shirts, but also if you have one of those, don't abuse it and don't use it when you actually would have time. Also happens sometimes. It goes both ways. The one exception, we are doing food runs for the actual teams which are stuck in rooms all day, like myself, and those, we will just skip the queue and go to the front. If you see this sign, you saw a green one on the sides or on the doors as you came in, which was nice. All of them on the back have this one. If there is a sign that the room is full, that means the room is full. It does not mean that because you have a really good reason and really wanted to see that one talk and your friend is speaking, you can sneak in. It means the room is full. This sucks, but it means the room is full. Maybe it's going to work this time, but please, that sign means do not enter. Also if you try to open the door, sometimes they are a little bit broken, but if you try to open the door and someone pulls back, that means the volunteer has a reason why the door is currently closed. Usually of course it's super loud inside. Maybe also I should have put this there. When you leave and enter a building, please don't talk and try to not bump the chairs and everything. It's incredibly loud up front and also quite a few defferums try to optimize for Q&A, but this means the audience needs to be quiet while entering and leaving. I know it's unusual, but those buildings are super, super, super loud, in particular for anyone up front. This is literally built to echo everything from me up to you, but also everything from you down to me. It's super loud and this is true for quite a few buildings. So please be quiet when entering and leaving. I need to put this on the slides for next year. If you have any feedback, send it there. We don't get enough feedback. If you say nice things, that's also appreciated. That happens almost never. If you want to say crazy things, maybe don't, but anything you want to send us, send us, we like reading your stuff and we read everything. If you need first aid, we have one first aid station in K Building on the second level and for most of the entryways to the second level, it would be on the right end of the thing or if you're the one end and then it's at the other end, if you don't see it. There's also the net link here. I'm just going to wait a little bit longer for people to take a picture of this slide, because if you need it, you will need it, so now is a good time to take a picture of this. We also have both professional security and the rec course walking around campus randomly, like they have their own schedule and everything. They are the professionals, we are not. If you need anything, go there. A bit about health and safety. You hear the coughing all around of you. Some of you are sick right now, some with COVID. Some of you will fall sick while here and some of you will fall sick at home. If you are currently infected, ideally go home. At the same time, I recognize that this might be harsh if you're only coughing, but in this case, please just wear a mask. We have free masks at the infodesk. I also see a few people with masks here. You will also see me with a mask when I'm not talking about stuff. Also for the first-timers and for those who might not have been here so often, Fostem Flue, Chaos Grippe, Oktoberfest, those are mass spreader events. A statistically significant amount of people will come home sick. Done. Plain statistics in science. You can get sick if you want to, but maybe don't. This is the Mozilla death room. I saw this on Masteron, not on any other social network, which is nice. The left one is 20 years ago, 2004, I think they said, and they are almost in the same place by pure accident this year. The wall changed color, but beyond this it's the same. Why am I bringing this up? Well, next year we will have the 25. So, if you have old shirts, which you have from as old as possible, ideally, please do bring them. Even if they don't fit, still bring them. Someone might be able to wear them, or you can just pull them over your belly, or maybe you slim down so much so you can just get a second person into the shirt, depending which way it went. But please bring the old stuff. If you have an old hoodie, if you have an old sweater, if you have this one project shirt from like ages ago, whatever. Bring the old stuff. We are going to figure something out for next year. Maybe take a picture here or outside, or we'll see. Most likely Sunday next year, because then we can gorge on Saturday how many people come. Anyway, if you have anything old, which you think might be nice on a photo, bring it. We also have a code of conduct, and we followed a code of conduct both ways. Anything which happens, I need to put a QR code here, just realized. Yeah, this is the code of conduct. Please read it, but the short version is nice. Anything happens, conducted for StemOrc. If anything is really like currently ongoing, there is a cell phone, and the cell phone is going to wake people up. So anything really bad happens, call this number, and wake whoever, poor soul. Probably Michael is currently carrying this thing. Or anyone who has a yellow shirt, well, with stuff on it, or a jacket or something, just go to them. By proxy, if you can't find someone else in green, blue, or orange shirt, but primarily ideally go to the yellow shirt, or InfoDesk and K also works. To the thank yous, to all the volunteers, all the deaf room speakers, all the deaf room managers, all the staff, all everyone. Thank you. APPLAUSE Also, thank you to all the sponsors. APPLAUSE If you want to give thanks the other way, donations can be made at InfoDesk in HOK or online. So for those who don't know, of course, I don't know how many are in... Oh, for who is this the first foster? Wow, good! APPLAUSE So for those who don't know this, we limit the amount of sponsors, basically. The reason is we don't want to depend on anyone, any one company, or on any long-standing support or anything. We deliberately finance ourselves substantially through donations and t-shirt sales and things like these. Well, that's basically it. So yeah, this is why we call for donations, because we don't want to be depending on the corporate overlords, and like they pull funding or something. We want this to be a grass-rootsy thing, not only on the organizing side, but also on the financial side. And thank you! APPLAUSE To be clear, also thank you, even if you don't donate. Also, for those who don't know, it's really loud up here, so even if you just whisper, like two, five, twenty people... Thank you. Whispering is super loud. You cannot imagine how good or bad, depending on your vantage point, these rooms are. It's insane how loud is up front for the speakers if you even mumble. Person who's over there on the right. Yeah, so, have fun, and as always, be excellent to each other. APPLAUSE .
Where have the women of tech history gone?
Good morning everyone. Hope everyone is settling down. We can get started with our first talk. Our first talk is where have the women of tech history gone. Our speaker is Laura Dury. She has been a developer for six years and awarded at World Scales Belgium in Web Technologies category. She has been doing monthly YouTube live discussions on latest tech developments in tech industry. Additionally, she has also started a career of Fourier in France. The talk is mostly about where have women of tech history gone. Addaa Lovelace, Hedji Lamar, the Enoch Girls, Grace Hopper, John Clark. Stemming from the role of calculator, the profession of developer has initially considered a woman's job while hardware design was seen as a man's job. However, who are these women who have shaped the world in tech? Why don't we hear more about them with Laura? You'll attempt to see the record straight bit by bit and provide role models in the tech you've always needed. Thank you. Hi, can you hear me right? Hi everyone, thank you so much for coming today. I just wanted to say that at first I try to do my talk on Sunday because this is way too much for me to handle. Please be kind to me, thank you so much. We're going to talk today about the women in the tech history. First of all, I wanted to talk to you about a little anecdote that happened to me when I was in college. During my first year I had a North History class and I was kind of sad to see there were maximum two women represented. I decided to send an email to my teacher and to ask him why he presented so few women. He answered kindly, honestly, that he didn't have enough time to add more artists to his syllabus. Because of that, some students may not have the required basis for their future career. We think about students in illustration, in painting, art, etc. At first I didn't really pay attention to it, I didn't really see the huge problem behind this. Then when I started to realize that this is kind of weird, that this is not normal, this is not fair, I had two questions in my mind. The first one is why are women not considered as the required basis? Why do they have less than men? The second one is who is the person or the group of persons who decide what someone deserves more than another to be in a syllabus? Spoiler alert, I don't have the answer to this question, I have ideas, I have theories. This is not the aim of this talk, but I hope that this question you can yourself think about it and maybe try to think about it. What I can do is to pay tribute and give a place to women who did a fantastic word to revolutionize the computer science field. This is something I have had in my mind for many years, in fact it's only natural that I'm here today in front of you to speak about that. The problem is present in the majority of fields, but today we're going to concentrate and talk only about computer science, the reason why we are all here today. Of course we're going to do that. Personally, if you go home and you remember two names of women you learned about today, it's a huge win for me. What about you? Do you know some names of women in the tech history? Adelovelle. Kathleen Booth. Margaret Hamilton. Belinda Pearson. Oh sorry I can't hear. Belinda Pearson. Belinda Pearson, yeah that's true. I don't think you know anything about that. Okay I have a lot of names, that's really nice. Okay thank you so much for that. So let's go discover together the stories through the computer science history. And for that we need to go back in time and we're going to begin at the age of enlightenment. So the ancestors of computing machines were human computers and especially in the astronomy fields. So basically computer was a job. And it was about mathematical calculations and very often the job was divided, the computers were divided into groups to compute long and difficult calculations. And the job was done in a way that the calculations were executed at the same time in parallel. And I wanted to talk to you about that because this is really funny, but because still today this is something that we are looking at in our computer. How many operations my computer can execute at the same time. And it was already something that people created, a way of working that was created a long time ago already. So like every profession was dominated by the men. However the first woman to be quoted in articles about the computer science history is Nicole Ren LePote, co-corrigor pour les Français. So she is one of the most famous astronomers of the age of enlightenment. And she is famous because with two other men she calculated the return date of the Hallease comet for April 13 in 1756, 59. Almost exactly as it returned on March 13 in the same year. So I don't know if you understand, we are in the 18th century and they calculated by hands the return date of the comet with only one month of error. So it's really amazing. Maria Mitchell also made a splash for discovering the first telescopic comet, which means it's invisible for the eyes. It will be named after her and she will receive a gold medal for this achievement. So during the 19th century there were a few barriers and contradictions regarding women in the scientific fields. So despite the fact that they had access to degrees, they were forced to resign as soon as they get married. A kindly reminder that a woman that is not married at that time doesn't exist in the eyes of society. So yeah. The history of computer science starts in 1840 with a woman that you obviously know and if you don't know her you should ask yourself some serious questions. Who is that Pokemon? Well of course it's Ada Lovelace. So I think that everyone in this room know who is Ada Lovelace. But for me she is not only the first programmer and this is my thought and I wanted to speak to you with you about that. So for that I need to explain you something. So Charles Babbage is the person who built the difference engine and the analytical engine. However he was messy and he couldn't stand back from his machine he was building. So he had ideas but he didn't have a concept that embraced his machine. Hence the arrival of our sweet and dear Ada. She invented the concepts behind the analytical engine by providing the first algorithms. And I have something to say more and I forgot about it. So she invented the concept of the analytical engine by providing the first algorithm. And ladies and gentlemen computer science was born. So this is why I think for me that Ada Lovelace isn't just the first programmer but she is the mother of computer science by giving these first algorithms. And by the way you can find the first notions of loops and functions in these algorithms. So despite this extraordinary invention it was way too innovative for that time. I remind you to tell you again where we are in 1840. So it was way too innovative and the analytical engine was forgotten for lack of funding. Before being rediscovered in 1937 to inspire the Mach 1 the first general purpose electromechanical computer. But let's take it easy. Alright we are in the end of 19th century and Edward Charles Pickering is the founder of a group of women called the Harvard computers. These women listed over 10,000 stars and developed a system to describe them. But a woman, a particular woman stood out, Aynie Jump Cannon. So she pioneered, this is hard to remember this one, she pioneered a new spectral type classification system and she developed the Harvard classification scheme which is still in use today. No sorry, which is the basis of the system used today. Between 1911 and 1915 she classified over 5,000 stars a month at a rate of one star per 3 seconds. I don't know what you can do in 3 seconds. I mean I can chug a beer in 3 seconds but that's all I can do, right? Okay girl, you have my respect. And in the 19th century the growth of industries opened up opportunities for women to join the field of technology. One notable woman, Great Hermann, made significant contributions with her advanced work in mathematics and physics. She played a key role in her early philosophical work on the foundation of quantum mechanics. But in 1920s her doctoral thesis led the groundwork for computer algebra and it first established the existence of algorithms for many of the basic problems of abstract algebra. So we are going to see a little more of computing here. I promise it's coming. Between computer algebra, I don't know if you know this app or a definition over there if you want to look at that after. So between the 1940s and the 1970s women were widely hired as coders and there are numbers of reasons. The first one is that coding programming was an emerging field so you didn't need a diploma to be hired. As new hires only had to pass a straightforward test, logic test, sorry, to work in a computer science job. Another factor was that despite the fact that women had diplomas degrees in scientific field, they faced a lot of challenges like finding a job or even advancing in their career. So they decided to turn to opportunities in the IT field. The last one is the shortage of manpower during this time and the fact that women cost very little. Grace Hopper. So during the World War II Grace Hopper, a 36 years old mathematician, decided to serve her country. This is very American. I'm sorry for the Americans over there. She decided to let her job, her teaching position at Vassar College to enroll in the US Navy expecting to decode enemy messages and serve her country. Surprisingly, the US Navy sent her to Harvard where she became the third programmer of the Mark I. If you remember, earlier I mentioned the analytical engine and how it was rediscovered to inspire Howard Aiken to create the Mark I in 1937. Well, the Mark I is a versatile, punchcard, programmable calculator and it was Grace who has the honor or rather the heavy burden of taming this machine. She wrote her 521 page user manual from scratch with any help of nobody. Like they said, okay, this is the machine. Go yourself and yeah. See you next time. Okay. So this is really impressive to know that and with her work she was engaged in top secret calculation crucial to the war efforts. Involving past like determining rocket trajectories, generating range tables for new anti aircraft guns and calibrating minesweepers. Now look at your computer. Look how easy it is to code. Now imagine doing this with a big, big computer like doing this for day long, for day long and for night long also. This is not the right page. Yeah, this is. We continue in the history and we are in 1940 and this year marks a milestone in the history of computing. The first fully electronic computer, the ENIAC. It was developed to automate and speed up the work of calculators and computers who was first humans. Right. But even if it was faster, it still needed a human intervention called the operator. And this job was largely performed by women. So the operator is the person who will enter manually questions into the machine through switches and cables. So you have a little, I don't know, overview. Can you see it? Well, it's kind of dark. I'm sorry about that. Yeah, you have a lot of cables over there. And six astounding women, Kathleen, Marlene, Betty, Francis, Betty and Ruth were the first six ENIAC programmer and the first programmer by extension also. So they had to install and assembling this machine. You have to know that the operator was the programmer of today. And even if this is the case, even if this is the programmer of today, at that time it was, it didn't receive a lot of credits. And it was very often belittled because it was performed by women. And hardware was the main job. Yet the line between these two jobs wasn't really clear cuts because women, so operators, needed to have little or in depth hardware knowledge to do it. To control this machine, to program these machines. Because this is still hardware. We didn't have in graphic interface or things like that. You needed to touch the hardware to use the cables, the switches. So this is where we see there is a big difference between a job description and what these women really had to do. Hello. I have a little anecdote. So first of all ENIAC for those who didn't know, means Electronical, Numerical, Integrator and Computer. So all these six women had a mathematics degree in common. They were responsible for installing and assembling the ENIAC. And the most important thing, they were the ancestors of the debugger. So look again to this machine and imagine you have a bug but you don't know where it is. So there were six. There were a group, so they had to work together to try to understand where a bug come from. And why is it a bug? So they created a system to work together as a debugger when there is a bug. And this is quite impressive. I don't know if there is people in this room already saw a machine like that or not. Yeah, okay. That's so nice. I'm jealous. So now we are in 1942 and a significant innovation emerged unintentionally driven by Hedila Marm, a renowned movie star. So to understand what happened, we need to rewind a little bit and delve into her background. So Hedila Marm is really famous for her role in the first non-pornographic film featuring a nude orgasm scene, which is really like, people were, oh my God, oh my God, this is so, yeah. And she also recognized, she recognized as the face of Disney's animated film Snow White. I don't know if it made sense. So, yeah. But she was facing a troubled marriage and Lamar decided to fled from Austria. But she had a really interesting alter ego. Like she was super duper into war and technologies, advancements. Well, it was influenced by her former husband who was a prominent Austrian art manufacturer. And during that time, she crossed path with a pianist named George Entail. And together they created, they invented top secret communication for radio-controlled torpedoes called, if I remember, Frequency Hoping Spread Spectrum. Is it right? Yes, it is right. Okay, Arda FHSS. Thank you, thank you. Okay, let me correct this. And so they patented this idea in 1942 and what is surprising, singly, what is really awesome is that to see that this technology is still in use today. And for all those who are on social network right now on the web, you can think, Adi Lamar, because of her that we have Wi-Fi and Bluetooth today. And a little thing that I have to say is that when it comes to unusual career changes, I think that we are reaching new heights. At the same time, a new way of thinking could emerge in the 50s. So the programming was involving way faster than hardware, which is still the case today. And so they begin to think because they had to begin to optimize their algorithm. And this lead to an image of the singular creative genius who wielding a form of black magic. And with that, the first stereotypical of the programmer emerged. So the white, hairy, antisocial men. And even if this is more to the realm of fantasy, studies in the 60s showed that it was a profile sewed after and it was more easily hired by companies. So you thought you were done with Grace Hopper? Now she's back. And you have to know that after the war, she worked on the Univac. So it was the more powerful computer at that time. And when she was put in charge of the automatic department, sorry, when she was put in charge of the automatic programming department, she had the idea of the compiler. So this person, this one there, she saved our life because now our computer can understand languages that we can read. We don't have zeroes, one or very low, low, low level languages. So thank you, thank you Grace Hopper for that. And as the idea was revolutionary, she started to observe that every manufacturer, every brand of computer was started to develop their own compilers. So in 1969, sorry, this is not the right date. In 1959, almost, in 1959, she faced a potential chaos that it could be. She decided to call on her old Navy connections to organize a meeting with every manufacturer of the country. And when they came out of the meeting, they all agreed on a simple universal language. The common oriented business language or COBOL was invented, which is still in use today in banks. Who do COBOL? Who can code in COBOL? Here, some people not a lot, okay. Are you happy with that? Okay, that's nice, the love man is in there, thank you. So I have two little anecdotes about Grace Hopper. I mean like this woman, like who didn't know Grace Hopper before coming today? You're gonna love her, okay? I mean, I already, we can love her, but the first anecdote I have about her is that she was also the person who think about the software portability. So before, we had to rewrite every program on every computer. And she then had the idea of why? Why we couldn't compile the code to just put a software in between computers without having to rewrite them? Thank you Grace, thank you so much, oh my god. And the second thing is like a little bit funny is that she is the one who decided to call the process of writing instruction, coding, coding, coding, coding, coding. And it's funny to know that this term replaced by programming because you know this is a woman, so no coding, we're gonna say programming. Today is coming back to our vocabulary and today is way more cool, cooler to say coding than programming. Okay, now look at this graph. So this is the percentage of women majors by field. So we have the medical school, low school, physical sciences and computer science. And what we can see, I was going to speak in French, what we can see is that there is a kind of rupture between women and computer science between 1980 and 1995. So this is a big question and I think that if you are interested in, by women in computer science, I think that you already heard about that, about this thing and what happened and why. This is not the aim of the speak but I think it's still important to speak about that. And this is, so there are a lot of reasons, there are a lot of theories about that. And I really invite you to discuss about that with people, older people, younger people and to see what can be done to try to make this curve up again, really higher. But today one of the reasons I saw when I did my research is the arrival of the personal computer in 1981. Woohoo, PC. Before the PC, the thing is that university students had little to no exposure to computer because they were rare, expensive and oh my god, it was like the size of a house. So they were relatively on equal foot. However, with the introduction of the PC, a new stereotype emerged and I love this one. This is a joke. The perception arose that to be a proficient programmer, you have to spend countless hours obsessively on a computer, which is still the case today. So leading to the notion of the real programmer who sported a computer screen tan from constant screen time. This is my case. I don't know if I'm good, but this is my case though, sadly. Funny thing is many men in the business didn't even fit the stereotypes and it could be a little bit controversial. And however for the women it was different. You couldn't have this kind of stereotype on women because either they were not tough enough or they were too tough and then annoying. So many women begin to doubt about their ability to code and dropped out school. And the last thing I have to say about that is the fact that when households acquired a PC, a personal computer, it was mostly put in the boys room with the father taking a coach role and trying to push his son to explore programming. Does people here live that? Or not? Yeah? Okay. Okay. And this is one of the multiple reasons why there is the wear, sorry, a gap gender who began. It's not the only one. I'm not saying that because people after my conference were, no, this is not the only reason. No, I know I didn't say that. I'm sorry. And so before I said, like they were relatively on the equal foot and with that they weren't because the women, the girls, wasn't pushed to, not a majority, so there are exceptions, all right? A majority of girls wasn't pushed to try the computer or programming. And so at the end, before university, the boys were more experienced than women. So today we hear every day, we hear every day about Chagipiti and AI. That's so cool. I'm sick of it. Thank you. Thank you. That's cute. And during my research, I discovered several women who have advanced the field of artificial intelligence, including Alice Chocock and Karen Spark Jones. And today we're going to speak about Karen Spark Jones because I had to do a choice. Scientists and researchers in computer science, Karen Spark Jones' work focuses on natural language processing or NLP and information retrieval. So this is a good anecdote to say when you are with your friends in a party with your friends from programming and everything. She developed the, you know, to seem intelligent, smart. She developed the TF IDF. I don't know if people know that. Perfect. Yes, some of you. Okay, nice. Okay. So this is the term frequency, inverse document frequency. And if you may let me read this because this is impossible to read by heart because this is not my field. This is a weighted relevance measure that is still used today by most search engines. And it's an important tool for SEO. So if you are doing web, if you are web developers, it's kind of important to know it. And this is this woman who developed it. This method combined the physical presence of a word in a text with the weight of its importance in general. It does make it possible to define the relevance of a specific keyword in a text. So finally, this is kind of charge PD due to understand what you're saying when you are writing a prompt in big. I don't know. I'm bigger. Oh my God. What did I say? And then after she decided to work with Margaret mastermind and they wanted to do, to have a little challenge to challenge themselves. So she decided to program a computer to understand words with multiple meaning. And the result of that was a dictionary of synonyms. Karen published an article in 1964 that is considered as a fundamental document and the foundation in the field of natural language processing. I think that if you are interesting in that, if you are working, if you are coding in this field or just interested, I think this is really, could be really nice to read more about her and to let people know about her work. So her ideas was little appreciated at that time, but they are implemented today and continue to inspire. Okay, I'm going to say something now. Please don't stay here. Okay. People go out because I'm saying that is going to be a little bit. She also mentored a generation of researchers, both men and women, and she coined the slogan, computer is too important to be left to men. Thank you. Thank you to her. Nobody is living? Perfect. Okay. I also discovered something really interesting is that there are no sexism in hacking. Why? Because the philosophy of the hacker is that only the work of the hacker is judged in the hacker itself and not the hacker itself. So it means that we don't care about where you come from, your age, your gender, what you look like, or anything, or your orientation. It's hard to say this one. This is, you are only judged by your work. However, I had the luck to type, to do a research on Google in French and trying to search the top 10 female hackers of the world. So, yeah. The funny thing is that if they are French speakers here, it's written, Le dit plus belle accuse du monde qui te font chaud, which is a literal translation from another language. So it makes sense, but a half is not really making sense. So this is the 10 hottest female hacker in the world. So I watched the article and they were quite impressive for their work. Well, this is true they were impressive for their work, but sad to be to finish inside that. And what I wanted to say is that, yeah. So we see a will of progress, of progressments in, is it English? No, we see a will of doing better about all these ethics things. But, however, we see that in the society, the female hacker still is a fantasy, like this, or we have a lot of stereotypes of women female hackers. So the woman I would like to highlight here is Joanna Ruckowski, sorry for my pronunciation, a Polish computer scientist and security expert. She's best known for her research on low level security and still malware. This is a conclusion. So I can go hours and hours about women. To be honest with you, my first version of this conference, I think I had like 20 women. And they said to me, come down, okay, okay, okay. So today, many actions and associations are being set up to give a place and a voice to women in IT. And this conference is one of them. The reason I'm glad, no, this is not, I'm glad, but no, okay. I have some questions, like have you ever had a role model in your life? And this, sorry, I don't remember. And did this role model help you to dream and give you the motivation to project yourself and believe in your dreams? Yes, no, okay. Yes, okay. Did it allow you to say to yourself, I can do it? Well, role model, like I would like now to speak a little bit about my own experience. Sorry, this is my conference, okay, so you're here to hear me now. I would like to talk a little bit about my experience of discovering my own role model. So this is really weird to say like that. The role model have a lot of consequences and all of them are positive. Not only they can make us think that we can have that kind of dream, dream of reaching great heights, just as they do, but above all, we allow ourselves to think that we have the right to do so. It may sound weird and simplistic, you know. Often when I suggest to my friends, female friends, you know, because I'm passionate of what I'm doing and I don't have a lot of co-dra friends. I don't know, I have Twitch, okay, it's good. So I'm like, oh, do you want to learn a little bit? You know, HTML, CSS, it's really funny. You don't have to, you know, to do a trigger warning is going to be a lot of flash colors, okay, trigger warning. You know, the little rotations and colors, CSS animation, this is so funny. I love to do that. And this is really funny. Okay, it's going away, trigger warning is done. And they always said to me like, oh, no, no, I don't want to because I'm not good at math. So even if computer science have a basis of mathematics, depending on the field, it doesn't like require a lot of mathematics, depending on the field. And I love this sentence and you'd be surprised by how many of my buddies who were not brilliant at math at all have gone on to study computer science or engineering without ever asking themselves whether they're good or not at math. I love that, I love that. And this is kind of a sad situation, all right. So now we all agree and I think we all agree in the room here today that the fact that women and men are smart to do mathematics. I don't know. What did I write? Okay, stereotypes linked to women in mathematics no longer put people in agreement. And I think that we are all agreed today to say that. But the fact is that they persist unconsciously in society. A woman will often feel inferior to her male peers in math because of conditionings and stereotypes that persist. I know that this is not the case of everyone. So I had this case, I felt that until maybe I was 15 and then after I met people who let me learn math and say, okay, no, I'm good at math and I love it. So personally, when I discovered my role model, it was maybe two years ago and her name is Aureligeant. I don't know if you know her here in the room. Okay, so yeah, she's from France and she, I never know how to describe what she's doing, all right. She's a numerical physicist. I don't know how to explain. She's doing AI. She's a physicist. She's doing a lot of things and she's really impressive. She wrote a lot of books. She's like trying to help people to understand the AI. And I just fell in love with what she's done, her background, her career. When I read her book, I don't have the translation. If you want to read the book, you should really read her book, her first book. Where is the mic over there? Okay, and if you want to know a little bit more, like for the book, don't hesitate to come after and to ask me. I can show you the book. So like that you can see if you want to buy it or not. And discovering this woman let me think that, okay, no, even if I was already a programmer, you know, I was already working. I was already having, did my studies and everything. But it made me think that I can do more because I wanted to do more, but I was afraid. I was like, what do I have to say? What can I say? I'm like, I mean, I'm a woman. I'm afraid. It's sad, but I think that this is what I thought unconsciously before. And meeting this woman, like being in the highlights, being in front of people, writing books and being known, and give me the courage, give me the, it opened the door for me to go in to say, okay, I can do it too, and I have the right to do it. So the aim of this conference is to highlight women who have changed the course of IT history and who can inspire young girls today or women or all the people like. But I ask you to those who have patiently listened to these stories, when you get home to write down at least two names you discovered today and spread the word, the word, the word. To share the stories of these women with your daughters, with your students, with your friends, with your cousins, your niece, with the people in the street. I don't know, your bar mate, well, I don't know. And create them to show these women. These girls don't have to become, they don't have to become programmers, but you can open their horizon and show that being a girl, being a girl doesn't have to limit the choices and their dream. So please narrate and create and propagate. Thank you. It's literal translation of French, so if you have better translation, don't hesitate to tell me. So to finish my talk, my, why, I didn't, oh no, this is internet. Oh no, oh no internet. Go buddy. Okay, try again. So I know you have talk to see, I hope I'm gonna do it faster. Oh. Okay, we're gonna do it like that. So, nice to meet you, my name is Laura Durieux, a.k.a. Deaf Girl. So I'm a full stack web developer, WorldSkills Belgium Gold Medal in 2020 and 2021. I am a streamer on Twitch and we are doing code on Twitch, so don't hesitate to come and say hi. I'm also the show presenter of On est pas des Yankees on RTBS X-Pay, which is the national media of Belgium. So here you can take a picture and see, and come to see me on my social media. So the slide gonna be available for after. Thank you, if you have questions, don't hesitate. Thank you so much. Thank you, thank you.
Outreachy: 1000 interns
Hello, folks. Good morning, evening, afternoon, wherever you are. Welcome to the Outreachy Talk and Celebration of 1000 Interns. So before we start, I just want to see a show of hands has anyone participated as an Outreachy mentor, a coordinator, an intern before? Woohoo! Thank you for coming. And for folks who haven't heard about Outreachy before, Outreachy is an internship program that provides internships in free and open source and open science internships. And our internships are open to people who are subject to systemic bias, discrimination, and impacted by underrepresentation in the technology industry of their country. Outreachy is truly remote all around the world. We have mentors are remote, interns are remote, we have interns on all the different livable continents, not Antarctica yet, but maybe soon. And the interns are paid $7,000 total for the internship stipend. And that's a three-month internship. We run internships twice a year, May to August, and December to March. And as of our most current cohort, December 2023, we have had 1097 internships. And to celebrate that 1000 interns, we had a bunch of celebrations. Awesome. Okay, so we celebrated milestone in six countries. We had the celebration in Cameroon, in India, Nigeria, Kenya, and of course in USDSE. And this celebration is awesome because we had past interns. I mean, folks who have gone through this program, they were able to like organize, they led the celebration, and they made everybody to feel included across the celebration. Aside the six countries that we celebrated, we also had the celebration virtually. We had three sessions, and it was really awesome. I also want to talk a little bit about our accomplishments. Not only do we have 1000 interns, we have a 96% internship completion rate. And that's part of because our internships, we consider more of a fellowship. We want to make sure that the interns complete the internship. If they get sick, if they, you know, have family issues, we extend the internship. And so we want to make sure that this is more about them learning about free software and open science than trying to get a particular project done. And we not only have this great completion rate, we also retain people in free software as well. So 80% of past interns continue to contribute to free software, and 44% of those interns are employed to contribute to free software as part of their job. So we want to talk about a little bit about how did we get here? How did we get to 1000 outreach interns? As we talk about how did we get here, you're probably wondering who we are. So let us introduce ourselves. My name is Karen Sandler. I am a co-founder of Outreachy. I'm the executive director of Software Freedom Conservancy, which is the home of Outreachy. I'm from Brazil, came here from a trip of 11,000 kilometers. It took me a while to get here. I was a past intern when we came here, and I'm the current information and process architect of Outreachy. Awesome. And I'm Omotala Eunice Omotayo. I'm from the giants of Africa, Nigeria, and I'm the community manager at Outreachy. Hi, I'm Sage Sharp. I use Dave M pronouns and I have one of the Outreachy organizers from USA. So we're going to go back to Outreachy history. Oh, right. Before I can tell you, I'm going to just quickly introduce why I wanted to help co-found Outreachy. I have a heart condition. I literally have a big heart. I used to think it was very rare, but it's actually quite common. I'm at a high risk of suddenly dying, and so I have a pacemaker defibrillator implanted in my body. I can't see the source code in my own body, and I was shocked unnecessarily once while I was pregnant, actually more than once, while pregnant because my heart was doing what a normal pregnant woman's heart does, but my defibrillator thought I was in distress. The only way to stop it was to take drugs to slow my heart rate down. And this made me realize that our technology may not be made for us despite the best intentions, and what are we going to do when that happens? And so I became really passionate about software freedom and learning about, like, as I've lived with this heart condition and I've participated in the free and open source software communities, it's become very clear that our software can never be made for everyone, unless it's made by everyone, unless everybody has a chance to contribute. And so this is where I sort of entered the role as I found out about my heart condition and started speaking about it. I became the executive director of the GNOME Foundation, where I met a woman named Marina Zurahinskaya. So this is a picture of Marina, this is me, ages ago, presenting the award to Marina. So Marina was a GNOME shell developer, and she was very involved in the GNOME community. And when the GNOME board evaluated their applications to Google Summer of Code, they noticed that out of 181 applicants, none appeared to be women, and they realized that there was a problem. And so the GNOME board eventually brought Marina in and said, what should we do about this? And Marina wanted to start a program to help address this issue. And so she looked back, and in 2006, the GNOME board had decided to do a summer outreach program, which they did a few internships, and it was a one-off thing. It was successful, the interns finished their internships, but none of those interns continued with the GNOME project, and it was just kind of left behind. And so Marina decided to reinvigorate that program. She is not on stage now, you're probably wondering. She's not on stage because she died of breast cancer last year, which is really tragic, but she leaves this amazing legacy that she created of outreach, and I'm so excited to be able to tell her story to you. And so in the 2009 guatech, there were so few women attendees that the GNOME board and Marina decided that this was the moment that we were going to pick this up and we were going to create this internship program. Raise your hand if you were at that desktop summit in 2009. Nobody! That's great! I'm so excited to tell you about it. No, it was a really interesting experience, and so the GNOME board went back with Marina and we decided to launch a new internship program, and Marina very thoughtfully tried to say, what are all of the ways that women are not participating in free and open-source software? Why don't they get started? And she systematically tried to address those issues, connecting interns with mentors and helping them make their first contribution. And so in 2010, the first outreach round, so this is the beginning of what we considered to be outreachy, and for a while we did the first round, the second round, and then we started using the months and years, because saying that you were part of the 13th round or the 15th round didn't make a lot of sense. So we started with that. If we could just go back to that previous slide. So if you notice, this program at the time was for women, and so you see we have this logo of this karate lady sticking her foot out, kicking forward. I love this picture, but it's very much of how the program started, very, very gendered. It was open to anyone who identified as a woman, and the program had interns, and it was a really amazing cohort for the next one. So in 2010, we had eight interns, and then you can see all these pictures that were of the interns at the different guatechs in the coming years. And so a community was starting to be formed, and one of the things that Marina did was she created meetups so that people could meet each other before a conference so that you could walk in there and know that you would have the confidence of knowing you had met someone before you entered it. So as RIT progressed, the internships again continued to be all with GNOME, and I was executive director of the GNOME Foundation, and the internships were so successful. The interns that came through the program were core contributors to GNOME. We had the GNOME planet, and so the interns would be blogging on the planet, and we would see their avatars, and people would come to Guantanamo and they would become so connected, and we realized that this was a program that really needed to expand beyond the GNOME project. And so I started talking with my friend Bradley Kuhn, who was the executive director of Software Freedom Conservancy. Now he works with me at Software Freedom Conservancy still, and Marina connected to Jessica McHeller of the Twisted Project, and Twisted was a Software Freedom Conservancy member project, and so we decided to do experiment and see if we could expand the internship beyond GNOME, and so we did, and it was hugely successful, and so we went from there and offered it to connect it to a lot of other member projects. So now today we tend to have 35 to 40 different free software communities and open science communities participating in each cohort. Yeah, we used to have a slide where we put all of the communities on it, but it just became too difficult to read. Yeah, so as Karen mentioned, originally in 2010, our criteria for who could participate in the internships was anyone who identified as a woman, and then in 2013 we decided to expand that to make it more trans and queer inclusive, and we said the internships are open to women, both cis and trans, trans men, and gender queer people as well. I think in 2014 or around around that time, we also started expanding tech companies published a lot of their data about their employees, and so we realized that in the United States we were able to expand to people of color who were underrepresented in the US tech industry, and I launched this effort to kind of try to expand outreach to country by country. I was talking to lawyers in France and lawyers in Australia, and we were starting to like figure out a way to expand place by place, and it was a lot of work and very difficult, and you know free software is global, and outreach participants were always global, the mentors and the interns, and it really didn't make a lot of sense to try to do that. Yeah, so instead of country by country, the internship criteria we have now is anyone who faces underrepresentation, systemic bias, or discrimination in the tech industry of their country. Now, how do we determine that? We've come up with a series of essay questions that we ask applicants, which is, you know, tell us which country are you going to live in during the internship? How are you underrepresented in that country? How has your learning environment been? Did you see people, you know, the last slide, the last talk, talked about role models. Did you see few role models who looked like you, who represented your identity and background, and then we talked about, you know, what systemic bias or discrimination have you faced both in while you're building your skills and if you were to apply for a job in the industry of your country, and so these essays over time we found ways to evaluate them in a global scale while still being, having, allowing people to talk about their experiences at a local level. I love this because we don't decide what it means to be discriminated. We don't decide what counts as discrimination. We don't, like, have a list of anyone who is subject to systemic bias. We don't have classes of people. We let people tell us about their own experiences and because we don't presume to understand every single experience of systemic bias, discrimination, and underrepresentation. So then we get into sort of middle history. Well, can I do one more ancient history? Because it's so exciting here at Bosnium. I was on this very stage in 2014 when I announced that Outreachy was coming to, well, it was rebranding, Outreach Program for Women was rebranding to Outreachy because it was no longer just for women, and we also announced that it was coming to Software Freedom Conservancy. The project outgrew the Gnome Foundation. You know, there were still only a handful of Gnome interns and the rest of the internships were with the Linux kernel and Wikimedia and Mozilla and a ton of other communities. And so the Gnome board and Software Freedom Conservancy and the Outreachy team all got together and we moved the program over to Software Freedom Conservancy where it remains today. So I got involved as part of Outreachy and I think it was was it 2014 or 2015? One of the two. I think 2014. I think 2014. Yeah. As the Linux kernel coordinator. So I originally helped find mentors in the Linux kernel. I connected them to Outreachy, got them prepared to help applicants during the contribution period. And then in 2016, I stepped up to become part of the actual Outreachy organizer team and passed off the Linux kernel coordinator position to someone else. So in 2015, we have opened up our program and said, hey, let's write these essays about the discrimination and bias that we face. We started having issues with reviewing those because we started to get thousands and thousands of initial applications and also a lot more communities involved too. So in 2017, I sat down with my spouse, Jamie, and he helped me understand a little bit of Django and we built Outreachy, a Django based website where mentors could sign up, where applicants could sign up and it really fits the the customness and fit what our our program was. And so big shout out to Django and Python and that wonderful community. And I want to say, like, this is a reflection of, you know, I talked a little bit about Marina and how she founded the program. One of the things that is the most impressive part of her legacy is that she built up this program, but then Sage came on board and she worked with them and she was able to transfer that knowledge and create a program that was robust and that could could exist without her. And so we're here on stage with this project that Marina really started with her personal passion, but that she thought about how it would continue without her. And so Sage coming on was this absolutely essential and then bringing all of this maturing the program. Yes, and I would say my role has been how do we scale. This is how do we scale. And so the next part was we really need to it to just be more than me and Karen at this point. And so we brought on Anna. My story about Alt Ritchie starts in early 2017. I heard about Alt Ritchie from an Alt Ritchie intern working with Mozilla, she gave a talk, a lighting talk at a women technology conference in my city. And at the time I had the so crushing realization as these mechanical engineering major that as a partially sided person, I wouldn't be able to find a job in my state and when country, I had too many obstacles to face and to overcome. So I applied to the December 2017 cohort was accepted in my first try. And I had a really good experience in my internship. I had mentors who believed in me. And if you're seeing these, Beno and Johan, thank you. And the community was happy to have me as a member. It was a really transformative experience as one who faced ableism all my life. I had people who believed in me in my potential and didn't question whether I was capable or not of doing my job. And when you were switching careers through a program like this, you will experience something that's called a liminal moment. You are not the person who you were before it started and you are still not the person you were about to become. You are in between states. It's disorienting and scary. And you have to find yourself again at the end of the program. And that can be a really difficult task. Interestingly, when I joined Outreachy, Outreachy itself was facing a liminal moment. Things were changing. And we asked ourselves, what is Outreachy exactly? I remember when we created a Zulip server and we started connecting with interns by running bi-weekly intern chats about career in free and open source software conferences, et cetera. Interns were no longer experiencing their internship in isolation. And they were connecting to each other without depending on proprietary software or proprietary social media. That was when something clicked. What was once something more of a liminal online space where people would just go through with an adjacent community, it became a communal space. And with a communal space comes coexistence, the need of permanence and a sense of belonging. With a thriving community comes management's challenges that were beyond our capacity. At that time, we were just too few. And we published a call for a community manager. And I will say that before we posted the call for community manager, we tried to scale by improving our documentation. We said, okay, if we can't answer everyone's questions, if we can't answer all the applicants' questions, especially with so many, could we scale our documentation? And that worked for a while. But eventually, we said, no, we really need an actual person that can help us. Present day, yeah. So we can... So we're going back. What would you like to do? We can continue. All right. So present day, one of the things too is as we expanded, we really need to make sure that we could find additional funding. Right. So I want to... I do want to start by saying, Outreach was originally funded by corporate sponsorship, which was great. I definitely want to give a shout out to Google, which is the company that sponsored the first... Like all the first rounds and every round since then is the only company that has sponsored every single round of Outreach. Plus, they gave us a lot of help. The program is modeled in part after Google Summer of Code. And the Google staff has always been very supportive and helpful and has given us the information and assistance throughout. And I really also want to give a huge shout out to Red Hat because Marina worked at Red Hat and Red Hat contributed her time. It's safe to say that there would be no Outreach-y without Red Hat's contributions early and then continuing in those years after. But nonetheless, the program is not... We deeply appreciate our corporate sponsorships, but it is very tough on the program to have to continue to get corporate sponsorships and then to be responsive to the interests that a lot of companies have and want to put on internships that they're funding. And so in this period, we shifted a lot more to grant funding to supplement the corporate sponsorship. And that was really transformative to the program because we were able to plan a little bit more long term and Ford Foundation, ARDC and Chan Zuckerberg Initiative were the foundations that came in. I would like to say if any of you work at a company that want to sponsor Outreach-y, definitely get in touch. We really can use the support. We also have some individual funding support. And having that mix of funding is really important to be able to have the internships that we want to have. And honestly, being able to say no without having to think twice to a company that wants us to have an internship that's too tightly tied to one company, we're not going to do it. Having an internship that is not going to be a good experience for an intern, we're not going to do it. And having all that... Having this independent funding, we would have said no before, but it's even easy. It's very possible and easy to do it. And one of the interesting things that comes from grant funding is that we can decide, hey, there might be some initiatives that really need our support. And so one of the things that we did in 2020 was we started funding humanitarian free software. And so this is things like public lab that did citizen science and... Mobile lab. Mobile lab as well, which is a open science hacker space and biomedical research. All peer. Yes. All kinds of interesting things. And so these are projects that don't necessarily have enough funding on their own to support an intern. But because we were able to get grant funding, we could offer both funding for humanitarian open science... open source, and then eventually we moved to funding open science as well. So again, citizen science, scientific research, we had outreach projects that were actually looking at COVID at trying to estimate what was the hospital capacity with COVID. And so it was really a proud moment to be able to fund that kind of research. And then in 2022, we had our lovely community manager come in. Okay. So a little bit about where I was coming from. I have past experience working with marginalized population, supporting them, especially when it comes to their rights, when it comes to them receiving the rights supports that they need. And I also have past experience empowering people into tech through Sheikot Africa, coming up with programs, supporting them and standing in gap as an intermediary between them and also the organization. Then coming into R3C as a community manager, I now stand as an intermediary between the R3C applicants and the R3C community and the program itself. So I was veered the R3C social media platforms, supporting and also responding to R3C applicants, putting out contents that made the applicants, people who were interested in what R3C is doing, understand what R3C stands for. And also I was able to come up with coffee charts. So via R3C platforms as well, we were able to have real conversation, real life conversation, helping R3C applicants to understand the R3C program better and also bringing past interns, mentors, community coordinators to answer questions that the applicants have and also to share their experience through the R3C program. And I've also been able to create more awareness about the R3C program through attending and speaking at various conferences. This has really been awesome. Especially at different conferences, I was able to empower people, tell them about the R3C program and that has created a very good awareness about the program. And also this, I would say, has created a very good and resounding application. We have a big growth about R3C applicants from especially the African perspective, right? People coming not just to participate and also to give back to R3C. As you can see, we have zero interns from the African perspective at 2010. And as at the December 23 course, we have over 44 African interns. So which means, so this way, folks from the African perspective now understand better that there's a space for them in the open source ecosystem. They are coming into this program to contribute and to improve open source and open science projects and also to give back to the R3C and the open source ecosystem in general. I want to say that before we had a solid program, an amazing program, but you gave it a voice, you gave it faces, the recognition it deserves. Thank you. And I'm grateful for that. Thank you so much. I would also like to add that since I joined the R3C program, folks have been, especially the applicants, they now understand the different parts of open source that they can contribute to, especially the fact that it's not just about the code part. They don't have to come into open source to maybe be a programmer. They can come into it to contribute and to give back in various perspective documentation, even event planning, community management, and so on and so forth. And also that because to the new R3C organizer. Yeah, we talked about a sense of belonging that comes with finding a community. Another thing that comes up often is this desire to give back. You offer a great opportunity. You want others to have access to similar opportunities that has happened to me. This is why I joined the outreach team back in 2018. We found that many interns come back as mentors, some as new mentors, some as experienced mentors. Either way, challenging situations require extensive support. And we decided that we needed someone dedicated to supporting and advocating for our mentors. Yes. After Omotolo's outstanding year of supporting applicants and interns, we hired Hilda Udufu. She is someone who has extensive experience with the program. She was an intern. She was a mentor. She was a coordinator all of it for public lab. And I'm proud to say that in turn I've become her mentor when she joined team. She's been facilitating conversations with mentors in office hours, having interviews with them so we can highlight their work, working hard and facilitating relationships between mentors and interns. And I think all of it is an indication of a phase of maturity within the program. We are not only looking for always growing. We are looking for growing sustainably and keeping our community flourishing. And I also love to add that Sage and Karen has mentioned how Astrochi has grown from, I mean the background of Astrochi and the growth so far. And with this we can also point out how Astrochi has grown in the aspect of not just why should we have Astrochi, but now to better support the applicants I come in as a community manager, right? And also we have Tilda. So Astrochi is not just supporting applicants, we are also supporting mentors. So because we understand that the program is not just about interns coming to contribute to open source, the program is also about people staying back in open source and also working together to give back to open source ecosystem. Yes, this is about open source sustainability as a whole, like the ability of us continually to exist as a community, supporting contributions and making sure that software still exists and still maintained. And I would say that you know you can look at the numbers of the people who find jobs that are contributing to free and open source software and the number of people who continue to contribute, but no matter where our interns go after that they always take the values of software freedom with them and they're exposed to software freedom and they take those values and they there's a follow-on effect from these internships. And I would say our interns have won awards, they've joined boards of directors, they've been mentors, grand mentors, great grand mentors and we see graduates of Outreachy everywhere. All right, so then the question becomes what's next for Outreachy? What is the future Outreachy? And the future of Outreachy maybe it's you. Maybe you would like to mentor, maybe you would like to coordinate. If you'd like to know more about Outreachy you can come and ask questions, but also there's a bough in AW121 at 1300 or 1pm and if you'd like to come talk with us, figure out how to get involved, we would love to hear you, we would love to hear what you're doing in free software and come connect with us. If you're interested in being signing up as a free software community, the deadline to sign up as a community is February 15th, so please do check out our call for mentors and communities. This is a celebration, you know, we're celebrating the fact that we got to this point and we can only do it with you really. We are actually gated generally by the number of mentors that we can find, so we we shield for funding already, but realistically most of the time it's finding enough mentors to provide those internships and so you know really that's that's all of you who are here, you're you're you know enough to be here. Actually raise your hand if you're here and it's your first FOSDEM. Wow so it's like it's like a third of the room, that's great. So yeah you know I think one of the things that I'm most proud about Outreachy is that it's a real grassroots program, like it's something that we started by offering something really pragmatic, like just offer internships, have that work pair interns with mentors and have them learn and then we've just been growing it slowly. I remember when when we started back in the day and I was a new executive director there were a lot of diversity initiatives coming up at the time, it was like very fashionable to start diversity initiatives and there were like programs getting millions of dollars based on glossy work you know glossy brochures that they had made having not done anything in the past, but we found it Outreachy with the different mentality with the with the bottom up free and open source software mentality of we're going to do the work and then if people find it valuable then the resources, the time and the money will come after and so we can't you know it's Outreachy is our thousands of volunteers and I'm proud that it is itself a free software project. And also we want to like tell folks that are listening to us that you can support Outreachy in several ways. You can go back to your local communities to tell the story of Outreachy to become an advocate of Outreachy. Tell folks who can be part of Outreachy as an intent to apply to the Outreachy program. You can also contribute to Outreachy right through your various communities, your various projects by bringing your projects as I mean your community, you can be a mentor, you can come in as a community coordinator right and you can also support Outreachy by going back to create more awareness about the Outreachy program. So tell folks about Outreachy, you tell your communities about Outreachy, bring your projects to Outreachy and you can also partner with Outreachy in various ways. So you can reach out to us if you want and connect with us right. We have you can connect with us later today to ask us questions, discuss ways, several ways that you can contribute to Outreachy. Additionally, we may not have the capacity to work as a mentor, but you may have the capacity of reviewing pull requests, of reviewing contributions made by the applicants. Communities need it so much, they get so overwhelmed with our applicants and they will be great help if we could help them. Yeah, even if you have experience with any particular community that's involved with Outreachy going to help out and answer questions in the community chat, that is a great way to help those communities. Questions? No, we're going to the thank yous because there's a lot here. Yeah, so I don't know if we... I don't think we want to read all the thank yous. No, we're not going to. I want to highlight a few people though. I definitely want to... We've already talked about the organizers and the reviewers and all of our volunteers. I definitely need to... We always joke that Outreachy is like a python swallowing a goat. There is so much logistical work to be done to manage Outreachy. It is huge and so we want to thank the Conservancy Accounting and Logistical and Financial staff including Bradley and Roseanne. And also... They're amazing. And I also really want to thank Roseanna who is on the... It says Gnome Board. Oh, and the Gnome Board. Right, Roseanna who did that logistical work at Gnome and helped launch the program. We want to thank the Gnome Board because there were times when running a program like this is difficult. It's a lot. Yeah, and we've had our times where there's been misinformation and people attacking the work that we do beyond calling us names and threatening us. And it's been really stressful. And the Gnome Board spent a lot of time making sure that they were defending the program and supporting it. And then I also want to applaud them for realizing that it had outgrown the Gnome community and that it made sense to move it to another organization. The Outreachy leadership in the past, Cindy Polaris and also Tony Sebro who is now on SFC's board and was our general council and is still involved with Outreachy. Justin Colonino who has given us pro bono legal help actually from Outreachy's inception has been supporting us with legal work. Ropes and Gray who gives us pro bono legal work, Otter Tech and also Owen Taylor and Jamie Sharp. I did read most of them, I'm sorry. But they deserve it. All right, so. So we can take some questions if anyone has any questions. All the microphones so we have to share one here. It is so hard to hear in this room, so you have to speak really loud. Okay, first of all, a huge big thank you. It's hard to overstate the value of what you do. And because it is so valuable, my question is, so in the end you kind of dodged the topic a bit about the future. So my question would be, since it's so valuable, how can you transcend from an organization that depends like so many others on the efforts of some individuals for survival into something that is actually hard to stop, that has a life on its own that you couldn't shut down if you wanted to. Is that for me? Yeah, that's you. I mean, they're pointing at me to answer that question, which you know, I'm executive director, so I have to like be the visionary of this program, you know, like, and give that voice. But I do have an answer after you. But Sage will have the answer. No, I mean, I think that the whole point of it being a grassroots, like free and open source software project is that we grow sustainably, we grow slowly, and we grow carefully. We bring stability. We've been working for the last five years on redundancy, so making sure that we have a team that isn't going to completely burn out. It's so much work to do this program. I don't know how Marina did all of those logistics. She basically did them herself for a really long time, and she maintained all of these wiki pages where she wrote down the names of, she just stayed in touch with every outreach intern and like wrote down where they went to work because she ran into them in a, you know, in the hallway. So like what we've done is with Anna's help and Sage's help, and now what Matola's told us is to make that a lot more systematic. So we've got a robustness so that if any one of us is no longer part of the program, it has a life of its own. I think too, to bring in some of the values of free software, what we have done in Outreachy is we have talked to different communities and learned what are the best practices for being inclusive, for onboarding new members, for designing projects for interns, and we've documented that. So if you look at the Outreachy documentation for mentors and communities, there's a lot there of knowledge that we have learned that was siloed across different communities. And so even if Outreachy goes, I think we still have impact on those communities. Our documentation, our knowledge sharing, the lessons we've learned will move on. And so I think in the future, we'd like to be a little more vocal about why we design the program specific ways, how we be more inclusive and coach our communities on that. And I think too that the grant funding companies that want to fund the Outreachy General Fund rather than specific internships is going to be the way forward. So we keep pushing our sponsors towards that and hoping that they'll allow us to make sure that our team continues and also that we can decide which communities have the strongest interns and allocate funding that way. We have a dream of having an open mentorship alliance with other mentorship programs. We know we are not the only ones and there are many, many more that do things differently, but they are as important and as fundamental to the open source system. I would also say that like historically, we have improved something about Outreachy every single round. Like the idea is that the whole point of like free and open source software is that no one and nothing is perfect, right? And so we've been changing something every round. If you have feedback for us about things that could be made better, we would love to hear it because we're looking at it ourselves and so we expect to change and improve. I was also going to add to that, but it's also really nice to see all four of you on stage and also the diversity of the organizer group as well as really I think a special part of this, but my question is kind of actually building on what you Anna said about the mentoring side. So I'm definitely seeing a challenge in a lot of open source communities and projects around that mentoring side. General on how do you do mentoring and how do you scale mentoring in a community? So my question is like from your perspective and doing all of this kind of working so closely with projects and working with mentors, what are the greatest challenges that you see around mentoring and mentorship and open source? And do you have all the answers? No, do you have any ideas or tips about what you think the open source movement needs today to grow and scale mentoring? I can think of some like cultural differences, the way you talk to someone in Brazil may be different from the way someone talks to another in the United States or in Nigeria. So conciliating those differences when you are doing asynchronous communication, for example, you can create a lot of conflict with some. Another one is safeguarding. This is something that some mentors have told me, especially when we work with more marginalized communities. It's difficult to ensure, it can be challenging to ensure that everyone is in a safe environment. We had some folks that had some really challenging lives at home and needed safeguarding. And that can be a difficult situation, both for the mentor and the intern. So having the psychological support for both of them is important and also challenging. And I think as well for mentors, I think having a pathway to mentorship is important. I think a lot of people assume, hey, I have to be a maintainer for five years to be a mentor. And so finding ways to define a path towards mentorship that doesn't feel like you have to be an absolute expert. So one of the things we've been trying to do in our outreach chats is talk about what does mentoring mean? How do you get to be a mentor and emphasize you don't have to know everything? And so with Outreachy, what we've done is we've encouraged people to co-mentor. So you've got someone who is more experienced in the project and someone who has just been in intern, either Outreachy, Google Summer of Code, and they shadow the mentor. They're starting helping out. We're training mentors. And so figuring out how to create those pathways is going to help. It's difficult to aspire to be something if you don't even know that is a possibility to be that something. And also to add to what Sej just mentioned, the Outreachy organizers, especially the mentor support, to that we have been able to come up with different initiatives to support mentors from the contribution stage. We understand it's not easy from the contribution stage to getting feedback at every point during the internship to our mentors office accession. Because we want to understand the various challenges that the mentors face, we want to be able to support them. Sometimes we also want mentors to come together through our coffee chats with mentors. And even of his accession, we want mentors to be able to discuss with one another from different communities to state the challenges they are facing, to learn from another mentor from another community, how they've been able to address these challenges. That way, happy mentors learn from one another. I would say also you commented on our diversity as organizers, but one of the strengths of the program is recognizing that the burden of bringing diversity to free and open source software shouldn't rely on the people who are underrepresented. And so, and so, mentorship is a great way to be an ally, right? It's a great way to shift that burden. And so, like, you know, I think that's when one of the strengths of the program over time is that it's a great way to get folks that are not subject to systemic bias who are not underrepresented to help bring people up. I think, do we, do we have time for one more question? Or are we? We're done. We're out of time. Thank you so much for joining us. Thanks for supporting our reach. Have a great bottom. Thank you.
FOSDEM 2024 Highlights
All right, please take a seat. We get in towards the end of the conference, so this is a presentation on the highlights. Hope you enjoy. Hi, everyone. My name is Diego. This is very impressive. You all look very beautiful from down there. I'll be very quick, so these are the highlights. I am, oh, this vibrates. I am one of the managers of the Open Research Dev Room. We feature research tools, open tools and technologies in research context. It means academia, but it also means journalism, it can mean citizen science and also beyond that. We, I'm too far away. Yeah, forget it. I hate those. So this works. No, it's fine. It's fine. Yeah, look, it works. We had a great day yesterday. We had a whole day pack full of great talks, but obviously I'm not here to talk about that. You might have gotten that. So here I am. Some people might already know me as the annoying French guy with the camera running around Fosdemme asking people if I can take their picture, saying everybody looks so great. So yeah, here I am. My name is Diego and I'm here to show you pictures. These are portraits I've been taking for years now at the Fosdemme, and it's a project that's a bit pixelated. No, it's cool. It's a project that's called, it's not called, that's called, jeez, that was a spoiler. It's a project that's called Faces of Fosdemme. It's my long-term photography project. It's been going for a while now, 2015, I've started, and until today. It's hosted on a French co-op that's called Ouvaton, and this is the URL. About me, I am a research engineer in social sciences. I work at Sciences Po in Paris, and this is my email. Do reach out. About the project itself, well, it got really funny really quickly because it's actually fun trying to catch up with people and catch them year after year and put those people, those pictures next to each other. So here I want to thank a very special person. You might know him. It's Alasdair, so Alasdair, thank you for everything. Thank you. But actually there are really dozens of people that I want to thank with this project, probably many of you. I have a lot of these pictures. So here, I saw Holger this morning. He's not here today, so I can show his face in really big on the screen. He won't mind, but he'll be watching the stream later. So hi Holger. You really helped start it all at the beginning in 2015. About the project, it's still ongoing, and I'm actually open to collaboration just to keep the project going because I think it's important and it would be funny. Do reach out to me. This is my email. If you have a camera and IDs about what we should or shouldn't do, taking pictures here, I'd be open. Do visit the website, but do not DDOS the Ouvaton service. They're cool. I really need to keep that under five minutes, but I just wanted to say that I think we have a, the first name is Great Community of People. It's a beautiful community and a diverse one, and we really need to take care of that. I think taking pictures is taking care because it helps. It shows how cool we are, and that's actually important. I don't know what to say. I should really keep it simple. Just the website, next slide please. The website is a really simple shelf of yearly project. Do visit it. There are good years and bad years, years with IDs, and years with beautiful people. That's all the years. You're always all beautiful. That's basically it. I just wanted to say that I met a lot of cool people here, and I am really glad to be finally able to share those with you. So thank you very much, and see you next year, I guess. You should have another slide. I'm missing two slides. You missed the dedicated one. Just show it one. Yeah. We are going to have a lot of switchovers. This is all very last minute. Next we have Alberto P. Marti. Also, if all the speakers could come into this area, it would make it easier logistically. If you're a speaker for this segment, please just come here. I don't know what IPC or ICIS is, but I'm looking forward to finding out. That's the one. Okay. So hello, everyone. My name is Alberto Marti. I'm the VP of open source innovation at OpenMimola Systems. Thank you. Well, thank you all for being here, obviously, and many thanks for the organizers to keep organizing this, and also giving me this couple of minutes to explain to you this project. Louder. Louder. Louder. Much louder. Much louder. Okay, so very quickly, there was some mislines, but don't worry. Stay there. The podium. That's true. So, very quickly, two minutes. I'm going to present today, I'm going to present today what's going to become the largest open source project in the history of the European Union. And I know what you're thinking. This is bullshit. There might be something, but so the EPSI, say hello to the EPSI, as we call it, this is, means in EU jargon, means the an important project of common European interest on next generation cloud infrastructure and services. So imagine if you had 3,000 million euros to do open source in Europe, what would you do? Found foster. Found foster. Okay, we might do that. The thing is, we have actually got 3,000 million euros to build a European cloud and its computing platform in Europe, and it is going to be open source. So it's not my merit alone. I'm just one of the, a representative of one of the 19 European companies that are behind this, supported by 12 member states, 12 countries in Europe with EU funding, coming with help from the European Commission, to invest initially 1,200 million euros from next generation EU funding. We are starting actually in January, so last month, plus another almost 2,000 million euros from companies in a co-investment program to develop together. Finally, European industry developing open source. When they say industry, it's telcos, cloud providers, technology providers, end users, railway companies, a lot of different people. So look at the website, EPSI CIS, it's starting this month, it's going to be running for 3, 4, 5 years, and we're finally hopefully changing the mindset in Europe that we can actually develop the critical strategic technology that we need for cloud and edge, and avoid vendor locking and other things that you know perfectly well. So I leave it to you to check this out, stay in contact, and we'll see you around next year. Thank you. Thank you. Next up we have Boris Dolly talking about energy. I don't have more details. Yes, all yours. Hey. Okay, thank you. Hello everyone. Is there any energy at FOSDEM? Oh yes, I have to stay there. Is there any energy at FOSDEM? Thank you guys. So I'm very, very proud to be here with you guys and trying to make highlights on what was the energy dev room. Of course it was about open source, but it was focused on energy. Please come back. Thank you. The first occasion we had to be here with you at FOSDEM was last year, 2023. We had half a day online. It was very interesting, and we had half a day on site for eight presentation. It was a small room, but it was full, full, full, full. I managed the entrance. It was impossible to say, I'm sorry, you have to get out. And this year, thank you so much FOSDEM staff, because we had a lengthy day yesterday. There was a lot of people in it, 100 places, and some guys were standing up on the corridor. So thank you so much. It was a very unbelievable day. So, what was the topics? Of course it was energy, but when we speak about energy, we of course speak about our future, prosperity for people on the planet. We had a panel session just here. Some of you guys maybe were there, but there was a lot of questions, so a lot of interest, of course, for this convergence between energy and software through open source. So the topics were there, our future of course, but smart metering, smart changing, solar, winds, and grids. Not so much of you should be comfortable with the concept of grids, because it's a really industrial thing, but it becomes a public thing, because you are more and more producers as consumers, so we say prosumers. So we spoke of course about open source software, but also about open source hardware. And there was a lot of questions on it, and thank you so much for guys working on that. Open data of course, about knowledge, science, and research, and of course communities. And I think it's about this, and if I can send you a message, please join. We need everyone to work on this project for the future. So a few pictures. First, thank you so much for the staff. Thank you, and maybe you can applause them. Thank you. And now concerning energy, we have a real ecosystem. It exists, and feel free to join, even if it's only for making documentation, translation, or bug fix, or maybe deeper contribution to the core code. So join us, here you have a few stickers. We have a lot of projects in this LFMG community. An incredible turnout. I spoke about that before, and open hardware also. And last but not least, Future of Energy is really willing to see you next year. Thank you so much. Just, if I have the time, may I take a picture for my daughter, because I miss her, and she's asking me, why are you away, bad? So please. Thank you so much. See you next year, I hope. Thank you. Next we have Shirley Bales with an update of the community deaf room. I hope so, at least. Just a slide, I think, on that one. Oh, okay. I think that I just need a slide. Okay. Apparently we only have a slide from this deaf room, but I can do this. Where? There it is. There it is. Next we have Chris Simmons with the embedded deaf room. Can you hear me now? Yeah, that's better. Yeah, so I'm here to give you a quick flavor of what we were doing in the embedded deaf room all day yesterday. Skip that. So embedded software is transforming the world, has been for a long time, and FOS embedded software is leading the way. So everything is, that's better. Everything is, everything you touch that has electronics inside, it has open source software inside, and we, the people in this room and others are producing that software. The embedded deaf room has been part of FOSDEM pretty much since the beginning. Since 2003 was the first embedded deaf room. There's been one every year. Apart from one year where for administrative reasons, nobody quite got around to applying for a deaf room, so we didn't have one. These things happen. So this year we had the deaf room for the whole of Saturday. We had 18 really fantastic talks. I don't know if any of the speakers are in the room right now, but if you are, thank you all very much. It went really well. And we were lucky enough to have one of the bigger deaf rooms, so we had 446 places. So guess which talk filled the deaf room? Anybody want to advance on beer? Well the answer is of course is beer. So we had this really great talk from John, John Britton. Brewing free beer with ESP Home and Home Assistant. It's a very good talk by the way. Highly recommend you check it out and you learn not only about beer, but also about Home Assistant, Home Automation and embedded systems. One of the great things about working with embedded staff is you get cool things to play with, including a nose. So you see that is a robot nose and the talk there is how to build an Arduino which can sample, use that nose to sample smells and detect for example the difference between whiskey and coffee which it was able to do with 99% accuracy, which is good. You wouldn't want to get them confused. And we also had an actual robot. Unfortunately this picture I've chosen isn't that great, but the robot is on the little white table underneath. Yeah, okay. That's my photography skills for you. And we also had a lot of fun. But all good things come to an end. So at the end of the day, there should be another, okay, my slides are mixed up. So this is the end of the deaf room and I just want to say thank you for all the people who turned up and we hope to do the same thing again next year with the provision of the of the foster developers. So once again, thank you all very much and we've had a great time. Thank you. Next up we have the monitoring of the deaf room and it's not the same name. It's Anna. That's the one. Hi, everybody. I'm Anna from monitoring. Hi, I'm Anna. I have a complicated surname. I'm from monitoring and the observability of the room and yeah, it's, should I? Okay, cool. It's one of the oldest deaf rooms here at FOSM. We actually started in 2017 and we are still here. And yeah, this year we had 12 talks but in total we had 97, which is pretty cool. And yeah, the talks increased, the number of room managers decreased. So now we are down to three but we are still very happy and still things are going smoothly. In this year's deaf room the team was kind of getting more about current ideas and how to implement them correctly than new shiny ideas which we see as a maturity sign and that's great. And we thank all the speakers who participated. We had talks from nine different organizations, another healthy signal for OSS as open source community. And yeah, and big thanks to audience for not getting up during Q&As and sitting and getting squeezed when we yelled at them even it was hot. And yeah, and that's all. And our sites are very short in the spirit of FOSM I guess. And we made on the go because the deaf room was today. A couple of pictures, a big Q and yeah, sorry, mic problem. That's it for us. Thank you very much. Hi. Hi. Hi. Hi. Hi. Hi. Hi. Hi. Hi. Hi. Hi. Hi. Hi. Hi. Hi. Hi. Hi. GCC is next. Which is, which is just the slide. David Malcolm with GCC. Yeah, just as you can see this was a very last minute idea. GCC. One more, one more. That's okay. That's fine. Junior now. FOSM junior with Peter Madison. Another round of applause, please. Hi, first I'm very uncomfortable with doing this, so let's see what happens. I was a deaf room manager for the Fosdam Junior track, together with my son Bart, who is a software developer. So that's great. Yeah. I don't know if you noticed there were a lot more children at Fosdam than all the years before. Anybody notice it? No? Oh. So for a couple of years I'm a visitor. I had a couple of stands at Fosdam and I always noticed that there were only a few children attending, mostly with their parents, mostly walking past the stands, collecting the stickers. And that was it. So before COVID, I already thought about the idea we should do something for them. So I talked about it with the Fosdam organizers and we organized Fosdam Junior together with Kododojo and I hope you know what Kododojo is. Ever heard of Kododojo? Yeah? Kododojo is a worldwide computer club where children learn how to program for free. So together with a lot of collaboration and with a lot of developers that were also giving presentations in the educational deaf room, we made the first workshops for children here at Fosdam. So I don't know if you know all the developers, all the software packages. I guess you know MIT Appaventure and MicroBlocks and Snap and Haiti and Zim. And we also had Raspberry Pi workshop. And then the following, it's only one picture. This is the picture of the first workshop given at Fosdam for children. And it was the turnout. So this is wonderful. Yeah, you should let them dance. Yeah? Oh, Haiti, thank you. This is the MicroBlocks workshop where they can play with micro-bits. Not every workshop was visited by so many children, but that was not a problem because when we had less children, we could give more attention to them. So that was also good. So what have we learned from this? It worked. My dear, for many years ago, worked. Developers liked it, giving presentations. The children liked it. And what we also learned, stickers for everyone. Not the stickers from Fosdam, but we should give stickers to the children, at least what their name is, what language they speak, because we had children from French, from Belgium, from Germany. So we had to talk English, German, Dutch, French, and that's sometimes a problem. So it's easier to know what they can speak, what they understand. And the same goes for all the Code of Dojo volunteers that came to Brussels to help. We also need to give them stickers. Then we had a problem. Also, if 30-stickets is too much or was the room too small, that is something we have to discuss with the organizers at Fosdam. So that's it. So I hope, personally, I hope we have another Fosdam junior next year with a lot more children, a lot more workshops, but let's see what happens. Thank you. Thank you. This one? Next one, do you have the first content, man? We're going back to Mike. Just copying slides. We can play some holding music. This is already for next time. When we do the cutover, please stay seated. Of course, after this one, we still have the closing talk, and please stay seated until the final thing of the closing talk. We had issues with this last year. So please keep seated until it's obvious that things are over. Hi, folks. So confidential computing, Dev Room, we're pretty sure that about 98% of you have no idea what any of these words mean. So they asked me to spend five minutes just to explain very briefly what confidential computing is, why you should care, and why open source. So first of all, why should you care? Well, we all know how to do data in transit, right? We all know how to do data in storage. What about data when it's in use? That's the difficult bit, protecting that, and that's what confidential computing can do. So we assume that you folks care about privacy and about security, and that means that you should be thinking about protecting data. And there's all sorts of lots of different types of data you might want to be protecting or your employers might want to be protecting or your governments, whoever it might be. And one thing I want to dispel is it's not about DRM. So confidential computing, when it first came along, people were very worried about it in the open source community because concerns about DRM, that's not what we're talking about now, we can move on. Honest. So it's really widely available, but not many people know about it. Pretty much all the major clouds have it. You can buy server-grade hardware right now with it so it's there, it's coming, and we open source folks need to know about it. So this is a definition from the confidential computing consortium, of which I'm the exact director. So it's protecting data in use using hardware-based, attested trust execution environment. Lots of words, and I'm going to try briefly to explain what some of them mean. So normal protection of data, right? You can stop workloads messing with each other. Containers messing with each other, VMs messing with each other. That's fine, we know how to do that. That's not so difficult. Type one. Type two. That's stopping your workload from messing with your host. That's a bad thing. We know how to do that. But it's much, much more difficult to do this last type. Stopping your host messing with your workload, right? Because of how memory pages work and virtualization and all those sorts of things in containers. And so confidential computing is about using hardware-based, chip-based CPU, GPU capabilities to enforce that. It's one of a bunch of things called PETs, Privacy Enhancing Technologies. You've probably heard of things like fully homomorphic encryption. That's another example. So why does this need to be open source? Well, think of a stack in the cloud, right? You don't want to trust any of these. If you don't fully trust your cloud provider, and you should not fully trust your cloud provider, right? So you want to be able to protect the stuff on the left. And that's what TEs do. A trusted execution environment is the capabilities that a CPU, a GPU, whatever provides to protect the confidentiality and the integrity of the stuff in there. So meeting very, very quickly on, what is attestation? Well, it's all very well if I give my workload, my application, my data to the cloud provider and say, make it safe. They say, oh, I've made it safe. Yeah, right. OK. So we need to make sure that's actually the case. So what we do is we use cryptographic measurements to make that happen. So we create a T instance, which is basically a set of memory pages, which are protected by the CPU. And we say to the CPU, measure me. Do a cryptographic hash of the stuff in there, right? And it does that. And it signs that cryptographically with a key embedded in the chip, which was put there by Intel or AMD or NVIDIA or whoever it may be. And you have an attestation services. You know what? That looks good to me. That's what I expect to be in there. I've done this measurement before. I've got a list of known good examples. It's good. So you know what? You can put your application now into the TE. And now it's a tested. So you can have much better trust. Right. Oh, and you want that to be open source too, right? Because otherwise how can you be trusting that? So if you want to know more about all of this beautiful stuff, you should go to any of these resources. It used to be called Hardware Aided Trust Computing Dev Room. It isn't anymore. It's now called Confederation Computing Dev Room. There's a whole bunch of white papers, a book. And that's me. And we're done. Thank you. This is the one which I got. So I don't know. Do we have Matthias Bolt Lendzirk here? If yes. No. Okay. Then over to Justin V. Florey for the Distribution Staff Room. Or not. Then Michael at nlnet.nl. Good. No, no, you need to stay up front. This works if you stay close enough. You have to stay with me when you're done. Okay. So, okay. Is this loud enough? So as has been mentioned before, okay, I have to stay here. So as has been mentioned before, Falsam is a bit of a sticker heaven. So we decided to do this Hexmas because, well, we make Hex stickers. So we bought a shitload of them. We bought about 140,000 hexagon stickers. And with each sticker taking roughly 1673 square millimeters, that means that we have through 34 square meters of stickers, which should be enough to cover 3750 laptops in full. And this is the classic thing, Pat, right? So if you put all of these behind you, you can make a cover, a trailer, about six pounds for one kilometers. And that is shit. I missed out on one of the... I think we measured something like 3,700 tractors protesting against... So we bought 160 different designs and you can do these nice things with it. And so people did actually, you can make these really cool color tiling because, well, this becomes a thing, right? So getting more aesthetics into computing. Obviously our main goal is not to grant lots of cool projects. I heard some ambitious things. So it's a challenge to the people that have the billions. We don't have the billions. We have a nice amount of money as a grand maker. But I challenge you, we had 49 talks. So that means that they have to do 80 times as much. So I expect 4,000 talks at the next false dem from the cloud computing group that was announced in the first talk. So, like with Hexmas, you can have this father Hex to appear. And in this case, is the man who wrote the Hex spec, visited... There's an official spec for this. And the man visited our booth. He was called on by some of the gods on the internet and he was summoned in our place. And so he mastodoned. Mastodon's a project that we funded. So we were proud to see that in place. He mastodoned the source code that we printed out for everybody to follow on our footsteps. So hopefully next time there will be lots more Hex stickers and the kids will be happy. They will have the name tags, but they will also each of them have hundreds and hundreds of stickers to take home. And that's basically it. By the way, if you haven't applied to Inelna, you're doing a cool project. Inelna.nl. Ask us for money and we work on cool stuff. Next up, we have Steven Goodwin on Magic. Okay, so thank you everybody. As you can maybe be able to tell, my name is James Merlin. And as you can tell from the name, I am a magician. Now, when you become a magician, you get given four envelopes. Now, there's one of me, so I need three people from maybe the first or two. Can I have three people from maybe this side? Because these are speaker people. So obviously, can't trust them. Even you. So can I get one more person, please? Yep, I'll come down. Okay, fantastic. So now we've got three people. So we've got three people. And as a magician, you get given four envelopes. There is. And as a magician, you get given four envelopes. There is a number one, which is in nice, bright yellow. We have a number two, which is nice and brown. There's a number three, which is pink, and a number four. So we'll start at the left. Would you like number one, two, three, or four? Number one. Number one. Everyone goes for number one. That's yours. Don't open it just yet. No surprise. Next up, you have a number two, number three, or number four left. What would you like? You would like number four. No. No. I can't get the stuff. Can't get the stuff. And finally, you have two or four. You would like number four. Okay. Would you now all please open your envelopes? You may open them at the top, be very polite, or just tear them open. If you tear them and you look like a ruffian, I'll assume you use windows. I'm seriously not going to work at Microsoft after that. So inside the envelope, you should find a piece of paper. Throw the rest of the envelope away. We have people for that. Probably Alistair. So for the start off, would you tell me what's in your envelope on your piece of paper that you chose? It says yours. What's in your envelope on your piece of paper that you chose? What's in your piece of paper in your envelope that you chose? Well, in the envelope that you left me is a piece of paper that happens to be a 50-pound note that says mine. Thank you all. I can do that from here. Thank you. Now, that's quite a nice little thing. But what on earth is a magician doing at Fozdem? Well, spoiler alert, cover your ears if you don't want to know, but my real name is not James Merlin. Oh, some people were actually shocked by that. It's not. My real name is Stephen Goodwin. I'm an open source developer. I've been for the last 30 years. I'm an incredible geek. As in, I do a lot of really weird stuff. I was speaking about generating music from algorithms earlier today. So why was I talking about magic yesterday? We closed out this room yesterday evening with a talk about magic. Magic is almost the antithesis of open source. Magicians deal in secrets. Things you're not meant to know. How do magicians share secrets? How do you reserve them? Basically, what we found is you can copyright the script of a magic show, but you can't protect it. You can't put a license on it because then it becomes an end user license agreement, and we know how those don't work. We talked about how patents can protect the secret of a magic trick if it's inventive enough. But how do patents work? You have to file it with the patent body, which it means to protect the secret of a magic trick. You have to tell everybody exactly how it works. Oops. And that was the essence of what I was here to do. By the way, you realize there's a whole group of other people who have these same issues, and maybe there's a completely different problem to be solved. And it was a nice talk. I enjoyed giving it. People laughed in the right places most of the time. And we came away not knowing much more than we came in with, which was protecting stuff, making sure the people have the information they need. It's really difficult when your whole infrastructure relies on secrets. We are just so used to openness that when you suddenly look at it from the other side, you realize, actually, we might be in the right here. So with that, I say thank you. My name is either James Merlin or Stephen Goodwin. Thank you and good night. Next up, we have open source and European legislative landscape with Simon Phipps. Thank you. So my apologies. I have no slides. We had to cut the dev room closed early so I could make it here. And so we haven't made them yet. So I've been coming to Fosden since 2006 and done a bunch of stuff. But last year, I made friends with a guy called Martin Ertson. He works at NLNet Labs and he had accidentally fallen in with a bad lot, a load of policymakers at the Dutch government and the European parliament. And he decided that he would bring them to Fosden. And so last year, you will remember there was a keynote with two European policymakers explaining the Cyber Resilience Act and the Product Liability Directive. And many of us found that presentation very alarming. And so a group of people, me from OSI, some folks from Free Software Foundation Europe, from Eclipse and from Open Forum Europe, got together and started pestering policymakers at the European Commission into changing the CRA so that we could still have open source in Europe. So I'm pleased to say that towards the end of last year, it seemed that they had listened to us. That actually you can do open source in Europe. Thank goodness. So here at Fosden this year, we asked them if they would come back and tell us how it had gone. So Benjamin, who was the guy that wrote the CRA, asked me whether the Fosden audience normally eggs people. And I assured him that there had been very little fruit or eggs thrown in previous years, although this year could be an exception. But he did wear old clothes anyway, just in case. And he came along and gave us a main track talk yesterday along with Omar Energy, who wrote the Product Liability Directive. And they told us how it was now, very carefully. We then got them and quite a number of their closest friends into a dev room today. And we invited you to come and tell them how you felt about the Cyber Resilience Act and the Product Liability Directive. And you did. The dev room has been completely full for the entire day from 9 until just before 5 when we closed it. We have had a queue outside and we're going to write four reports which are going to be very pretty, which we're going to give to the European Commission so that they know that we exist. Because, you know, they didn't know we did until last year. They went and asked SMEs about the Cyber Resilience Act, but they didn't bother to ask any open source developers. And that's why the CRA last year just had one mention of open source in it. And that's why the CRA at the end of last year had about 20 to 25 mentions of open source in it. And additionally creates an extra new category in law called a software steward to whom different rules apply to a software manufacturer. Now Alistair asked me to say roughly what the effect of the CRA and the PLD is. The CRA as it stands now pretty much exempts open source developers from any mandatory action. If you are all feeling fellow human kindness towards your downstream, you might want to provide them with the documentation they need to comply with the CRA. But there's nothing that you actually have to do. Additionally, we found during the process that some open source foundations accidentally turned out to be treated as manufacturers. And that's why Mike Malinkovich from Eclipse wrote some corresponding blogs last year about how the world was ending. Now Mike then in December wrote a blog saying how the world was no longer ending. And the reason for that was because there is a new lightweight set of responsibilities for open source foundations and charities, which additionally do not attract any legal penalties if they accidentally forget to do them. So to give you the TLDR, it's inaccurate as Benjamin keeps on telling me. Now it is true that open source development is exempt from the CRA. It is people who are your monetizing downstreams to whom the CRA applies. And you could ask them for lots of money to help them comply with the CRA. So this was all a crazy experiment. We applied for our dev room two days before the deadline. I wrote the application in about an hour just before the deadline. And unexpectedly we rewarded it and so we had to make it happen. Thank you very much for that trust. Please can we have a room double the size next year? Next up we have Blaine Garst. Thank you. I'm not used to this. So I put the free in FreeBSD and I forgot the staging instructions. I'm the wizard of many things. I'm a wizard in the sense that I gave you POSIX through work at Bell Labs because they ran out of gas. So I started hooking up with OpenBSD, Bill Joy. I crossed paths with him several times. They bought our sources for OpenSTEP and derived Java out of it. So Java is a derivative work of mine because I put interfaces into Objective C and retooled everything. I put closures into our Objective C language and it retooled all the APIs that Apple sells. You cannot do that. I hired Chris Lautner and took LVM out of my alma mater and put it into Apple. I just had kind of epic level impact. So that was the POSIX. That's the first one, Act 2. That was kind of the closures. The closures were called that up arrow blocks. They came from small talk. I took over that language. I didn't even have to look at this. I put an underbar atomic into the CLO and committee. It was the wrong way to do concurrency. I knew that at the time. The correct way to do concurrency is actors. And that's what I talked about this week. Spacebar. What do I do here? There we go. This is what I talked about yesterday. I have a system version of rewriting everything. Because security is an architecture. You have to start with architecture to accomplish it. I went to a few of your security talks and stuff. I asked Chris Stoffer in here. A fundamental question. If you don't track the hardware, you have no chance of actual security. I'm consulting to some space guys. I've consulted to some grid energy operators. They need unhackable. And the way you do it is with open source protocols, peer to peer people to people networks. I'm trying to get a modem that's super secure into everybody's home. You sign up your friends. They are the ones who know your identity. So we solve identity first. With your friends. You can keep track of your thousand friends, you know, public keys. And they're generated on super secure hardware that you guys are just talking about. Well, anyway, a few years ago I got set up with this guy. He builds a hardware that solves all these problems. You can read about it in my FOSSTEM talk from 222. I signed an LOI to buy chips from him on Friday before I came here. My life has been a little difficult. I say I'm a wizard. I do illusions as the Great Merlin did. I have some with me, but I don't know. I can only tell one story. So I transform myself in my other talk. What else did I do? All right, so here's the basic idea. Code is actually math when you look at it from a... It's math. Math is algorithms. They can't be patented. Math is not source code. It's algorithms. So look at this. I'm going to get rid of every... I may be the last programmer is the way I think about it. I am building... I have a plan to end the patriarchy. To free you all, free everybody from wage slavery. I'm buying the chips. I've got a plan. I invite you to follow it. The next update will likely be at OpenFest in November. And I brought pen and paper to select anybody who wants to follow along. I'll add you to my sub-stack. You can track along, participate. I'm going to... I'm the wizard. And I will offer one-on-one, lead one-on-one code reviews and stuff as I elevate my software. I got 30,000 lines of code. It compiles and runs on ESP32s. I ran the Batman mesh network so we could build dynamic networks in my neighborhood in case everything went out. I know I'm going to try to get a polar fire chip on the Linux laptops with a SESTIR radio. So they're emergency responders, so you are not subject to your ISP going down. Okay? I am badass. I'm sorry. I'll show you one token for my career. This is a good one. Okay? I got a little box here, you know? I still got the sticker on it, you know? And you know what's in it? It's a metal. They gave me a metal because I was at Bell Labs, pre-divacature. And by God, they were splitting us off because they were going to build the Internet of Internet. They were trying to build the Internet. It's called the Bell Data Network. And I got this little metal here. They gave me a metal. It says, Blaine Gilles, first to be chosen. And do you know what? It was the pioneer of the AT&T logo. It's pure silver. I don't know, a couple hundred bucks. They didn't have stock options. The spring guys, sorry. The Java guys, they were the first person. They took my kernel team because we got out of the hardware business. And I put Objective C into the kernel, you know, a little object-oriented language that had interfaces in it because that's where they started out. I got Gosling on stage admitting it. Am I out of time? All right, what? Okay. I'm not shouting loud enough. Okay. Thank you. You guys, people love working for me because I just, I inspire them and get out of the way. And that's what I intend to do. I know how to leverage and leverage and leverage as Monomoto Masashi did. That's what I wrote and read aloud at the beginning of my FOSTA talk last time. I'm inspired. You gotta take me away. They're coming to take me away? No, we're not. This is good. I'm a wizard of chocolate. You gotta keep up with me, though. That's hard for most people. But I'm nice about it. You can imagine I'm nice about it. This is the chip I'm gonna put into Modems. I'm gonna put into Radios because to prove the correctness, we need everything in the open. It needs to be algorithms. Okay, next slide. I have a sheet of paper. You want to help me out? Join me. Sign up. Here, I'll put you on my bog list and we'll, I can teach you how to code in algorithms. Christoph is gonna teach me how to turn those algorithms into fun drag and drop software to make this happen. Maybe, if I could talk a little bit. Not Christoph. Yes, I did. Christopher. I don't know. I brought, the magic I do is I make chocolate, ethical chocolate. If we could have supply, if we got unhackable supply chain, then we could make choices about where our stuff comes from because we're over consuming the resources that are planned by Factor 1.8 every year. This ain't gonna last. We've got to track the resources. Money isn't the answer. By first degrees in economics. I came over to high school of a thousand. We have to go to the next slide. I'm sorry, I gotta go. Thank you. Thank you. Thank you. So, please bear with us. We need to switch out the laptop, but we are not yet done. You are going to see something nice. We are trying to break something live to, or test it to breaking point. Okay. Hi everybody. I think it forced me to have the solemn responsibility to encourage you to break the network. Now, if we were doing this in the horrible pandemic times, it would probably be over matrix. We would probably be using Element call these days as a way to do communication. Element call looks like this. It is a selective forwarding unit powered by LiveKit, which is a wonderful open source project. And then you have matrix clients that connect to it to do video conferencing, which is now end to end encrypted and scalable connecting to a matrix home server. So, let me actually go to one I made earlier, and that was the QR code that I just checked up. And I am going to join that. Like so. I am going to take that link and check it into the Janssen room or the Janssen room. Let me just find it. Key notes. Blunk in the logo, the URL and that. And if you look carefully, you will see the URL has a password in it, which is the end to end encryption URL for that room. And I am hoping that somebody is going to join. Yes. So, shall we get everybody in the room onto this and see what horrors happen even to the network? Or the SFU itself? So, the way this thing works... Well, what? OK, OK. Sorry. Yep, yep, yep. There we go. Go boring. So, the way this thing works is to only subscribe to the streams that you need at the resolution that you are requesting for the streams which are on the screen. So, in theory, it should go quite well. We are only using one instance here. We are not actually federating through to others. But, oh, hello, we are up to 17 users. Come on, you can do more than that. I do all hands of work. It's like 100 people. So, we know it can go that far. But the question is, will the network crap out first? Now, by the way, you can double click on people and like zoom them in. There's Andy at home and Amandine. And there's Timo who makes a conduit matrix server up to 21. How high can we go? Let me put the QR code back up to passively, aggressively. We are in fact actively, aggressively ask you to follow the QR code and see whether we can melt something. Okay, we are up to 25. By the way, if you have AluminX installed, it will open in AluminX. Okay, I'll just keep the QR code up. You and your scientific logic. Let me go and put it on the side. And we can do both. If I knew how to use a windowing manager, then we could literally do both. Yeah, actually, look, it's a black hatch. All right. Oh, you can see all the amazing relay outs on Element Cool here. By the way, one of the fun things here is that if I open a different tab, then you can see my network usage is going to drop right down, where it should drop right down, because it's no longer subscribing to the feeds. Likewise, if people scroll off the bottom, then you just stop subscribing to them. Listen, come on, how do you got 28 people? It seems to already start breaking for people and the audience, they can't join anymore. So we broke it through. Oh, you're asking that we have actually melted it? We have to... Well, clearly, it's not the... It must be the network or something. Hopefully, it's not on our side. But anyway, there we go. This was the demo that I was asked to redo again, because we did it earlier and the recording didn't work very well. But thank you, everybody, for following along. There's one other very quick thing, which is what we're talking about European Union stuff. Outside of CRA and PLD, there's also the DMA, the Digital Markets Act, and the big announcement we had on the main stage talk this morning is that we've been working Matrix and Aliment with WhatsApp as one of the gatekeepers defined by the EU in order to try to figure out how we can interoperate in an open fashion with them once DMA hits on March 7th, where they are obligated to open their APIs and their network to anybody who asks. So watch the space for March 7th to see what that looks like. I think that's it from me. APPLAUSE
Closing FOSDEM 2024
Finally, so if you're leaving, please do so quietly. It's horrendously loud up front. Also, if you're still talking, please stop. Thank you. So this is over, finally. It is, as they say, a wrap. And yes, I love this picture. You're going to see this every single year. OK, is this better for you? Oh, OK. Thank you. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. OK, so yeah, it is another year done. It is a wrap. And yes, I'm going to show you this picture every single year. To talk about stuff, again, you'll find a lot of really good content, good conversations, link to all the slides. If you're a speaker, also talk about your stuff. Put summaries off your deaf room, the small, the big stories. Put stuff onto Macedon. As several people know, there is only one social media network, which is relevant during FOSSTEM. And it's the open source one. And it's super awesome. Yeah, also, oh, yeah, I need to take a picture. Good thing that I took this note. You'll find this picture if you want to wave or anything. Now is a good chance. I'll also not make this one blurry this time. And this one. And now we're done. Thank you. So you'll find those later. And also use it. Yes. Slide correction about conference size. I have been approached about, I stated we are the largest open source conference on Earth. And this is not precisely correct. By the amount of talks and everything, and by the amount of tracks which we have in parallel, we are the largest conference on Earth, not just open source. Like the largest conference on Earth. Not just open source, like the largest, period. Which is pretty nice. Largest conference on Earth, period. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. If you have old shirts from Fostom or Ostom or other projects or whatever, bring them next year. If they are too tight or too loose, still ideally bring them the older the better. We are going to figure something out. We don't know what, but we'll probably do something on Sunday next year. So bring your old stuff and share it with your friends and like get as much old stuff as you possibly have. If you have anything or if you do not have anything anymore, we have as always a lost and found in K at the Infodesk. I haven't been able to get the information until when we have it, but we probably have it until like 10-ish I guess. But the sooner the better. So if you lost anything, I saw a helmet, I saw some notebook and other stuff. Just go there and see if we have your stuff because we would like to return it to you. If you have feedback, please send it to feedback at FostomOrc. We do read everything and by we, I mean every one of us reads pretty much everything. Keep it light on the crazy stuff, but other than this we like reading your feedback. Good, negative, anything in between. Some video stats. Our peak outgoing video was 5 gigabits. Still hosted off-site. Next year we are planning to pull this on-site because now we have the bandwidth and the redundancy. And yes, it's super loud if you talk in the audience, so please stop. Thank you. We had a total of 27 concurrent streams going out of the conference at all times and we peaked at 2,200 concurrent viewers. And the unique viewers are too low, but they're too low. They're like 5,000 or something. Sorry, I was there too low. I messed up the slide. Sorry. The stats in small is what you already had from last time. As of 1300 UTC, roughly those are the stats of the review system. As you can see that those numbers went earlier, I got tricked into also harrowing to some extent the things. So I had three talks in a row. This is why the numbers are earlier. The point is if you are a speaker or a deaf room manager, please, please, please, please, please with sugar on top. You get those emails, do not ignore them. We rely on you to actually cut your own video because we cannot possibly cut all of this video ourselves. The last time we tried this, it took us into the summer to actually cut all the video. Please crowdsource this. If you get those emails, do not ignore them. Work through them and help us cut your stuff. We can upload your content much earlier. You did all the work for the presentation for deaf room. Please do not ignore this work. Some are really awesome about this and some completely ignored. Please do not ignore it. Matrix stats. Again, as of this afternoon sometime, we had 2300 users this time in the matrix space. As you can see, the numbers go low, which is kind of stents to reason because this phosem is definitely larger than last year's. Obviously, the remote component is going down with more people on site again. We almost sold out as of this afternoon, I believe. We completely sold out, but I am not quite certain. You ate a lot of food. The free cookies which were given out this year did not have a GDPR by now. Maybe we need to improve this next year. Network stats. Again, those numbers are relatively useless. They are only a rough indication because of privacy extensions. We used to be able to deduce amount of visitors from this because we did photos and we calculated how many people did have their laptops out. Every person won one's phone. We could deduce roughly the amount of visitors, but with privacy extension, this has gone away. As the different operating systems have different timings for the privacy extensions, and we do not know what operating systems you are all using, we can't even roughly deduce this based on statistics. The main thing is, it is significantly more than last year. The main thing is, I think everyone who was here last year sees the same. We are actually coming back to a real in-person, which is super nice to see. We had some fun things, like for example the very smelly room, which you don't want to spend a lot of time in, which we called the operating center or the knock. We left the door open and we literally had water running down. Actually, this container was half full by the time and this is my foot. Just to show you that not everything is going right with large conferences, I will spare you pictures of all the overflowing toilets, but you probably saw how we closed things off on it again. Oh, I forgot to include this picture. Also, we had the free cat people and they designed a key, which the Prusa people printed, which we could use to open all the towel holders. We actually get more towels into the toilets. This is why you could try your hands. Some of you will have seen or will have noticed that we had a power cut in part of the K building. Again, please stop talking. Some of you will have noticed that there was a power cut in K, which impacted quite a few of the stands which we had. We narrowed it down over time. Initially, we had a ladder to come up and then I just stood there with a broom and just popped the breaker back in. We just started isolating bits and pieces of the daisy chain of power strips until we found the short circuit. It's just part of what makes a conference run, in which thankfully, none of the people here actually see happen. It just magically works. Also, if you are a speaker, and if anyone wires you up with one of those lapel mix, they do cost money. Also, if you go away, not on purpose, and you still have them, yes, we noticed, but this basically means we scramble and we send someone running after you. Please, for the future, try and remember to return your mic. It would be nice. It makes less panic, because those are expensive and we have to pay for them if they go missing. If you want to make donations, you can still do this, because for those who don't know, we try to limit the amount of corporate sponsorship which we are taking, but I heard we are going to take some money from you, I think. Anyway, we try to have a broad base of financing. We could easily go with corporate sponsors for all the costs which we have for running FOSDEM, but we do not want this. We want to have a broad base of financing, so at any point in time, we are not dependent on a single large sponsor or anything. There is no hard dependency from our end to keep someone happy, which is what allows us to completely separate all of the donations and everything and all of the content. You will find very few conferences which do this as aggressively and as absolutely as FOSDEM does it, and the continued donations by everyone are what make this work. Now we come to the thank yous, I might start crying as per usual. These are the first ones. These are our sponsors. We have a total of 160 deaf room managers, which is pages and pages of names. Next, that is the most important one. We are almost at a thousand speakers and maybe we are going to break this next time. It would be nice for 2025 for 25 years, so those are all the names. Thank you. Then we have a total of 173 names volunteers. Just to be clear, there are a lot of people who just have around the edges or take some trash with them or whatever. There is more than this. Those are ourselves. I forgot to thank you, because you also all make FOSDEM happen, so thank you. Again, three talks in a row, this was rushed. Please help us. Take a look around you if you see any trash. Congratulations, it's yours now. Bring it to the next trash bin or anything, because if you don't do it, one of the volunteers or the staff are going to do it. If you have more time, it would be great if you could just come down to here after the whole thing. We will collect you, give you brooms, gloves, everything and tell you what to do. As per usual, any of the network cabling which you remove under our guidance, you can keep. We are going to have a common feeding session when we are done, roughly between 10 and 12 o'clock. Anyone who stays to the end, we have plenty of food to thank you in a more direct way for helping us clean up. I have a kidney thing, and also my knee is kind of broken, so I can't do the FOSDEM dance. Who volunteers to do the FOSDEM dance, if you know what the FOSDEM dance is? We have probably enough people, I need someone to volunteer. No? Come on, come on. Someone? Anyone? No? I can explain it to you, but you have to do it. No? Yeah? No? Too late, too late. Okay, we have one, perfect. So, we also need more volunteers. So anyone who is a volunteer, anyone who is a speaker, anyone who is a deaf room manager, anyone who is deaf, please come up front A for a round of applause and B for the FOSDEM dance. You don't have microphone beauties anymore, you can just come. I see a staffer who is trying to hide. Come on, come on. So, if you are using this chance to leave, then please do so quietly for the few who do. Yeah, you can explain it. Yeah, just from what I remember from all those years, you just start on one feet. You go on one leg, you just go like so until you go crazy. Thank you all. Yeah, thank you and see you next year. Thank you.
How to Chart your own Career Path in Open Source - Panel Discussion
Okay. Can people hear me okay? We're good. All right. Thanks. So I guess we can officially say good afternoon. Thanks for coming. My name is Ray Paik. I'll let the panelists introduce himself in a few minutes. It's a little weird because we have to be here for the camera, but we'll make it work. So I'm a community manager at Pincap. If you're not familiar with Pincap, we're a company behind the open source database, TIDB. And if you're part of CNCF, you may have heard about a couple of other projects that we donated to the foundation. First one is ChaosMesh, and the other one is TikeAB, which is our key value database. I've been at Pincap since April of last year, and I started my career in open source community management about 10 years ago when I joined the Linux Foundation. I was there for about four years. Then I had community manager roles at GitLab and CubeDev before I ended up at Pincap. I don't know how you all felt about 2023. 2023 felt somewhat difficult, especially on the job front for people in open source. I myself was laid off, but I was fortunate. I was wrapping up my interview process at Pincap, so I think I accepted my offer a week or two after I was given a notice. So I was fine, but then you had this constant drumbeat of negative music. It seemed like companies that we thought were at the forefront of open source were making significant cuts to the community teams. Open source program office is just completely being obliterated. I couldn't tell for a long time last year that whether this is just another boom and bust cycle in the high tech industry or there's something more fundamental going on. I did a lot of thinking about open source careers, and then so I decided to propose this panel. Glad to have a wonderful panel this year. I'll let the audience introduce themselves. I'll look at you want to get you want to start? Yes, I hope you can hear me too. So thanks. My name is Ildeco Vanja. I work for the Open Infrastructure Foundation as Director of Community. The Open Infra Foundation is an open source foundation that hosts and helps support open source software development communities into software infrastructure space like OpenStack, Cata Containers, Starling Access, all our examples of the projects that we have. I joined the foundation seven and a half years ago, so I'm already on the record in terms of longest employment of my life. So you can tell that I like working working here. Before the foundation, I used to work for Ericsson, which is a large telecom vendor company, so very different environment. However, that's where I got in touch with open source. I started to contribute to the OpenStack project, and my first experience was so wonderful that I just couldn't stop afterwards. I became a really big open source advocate. And open source became a fundamental part of my life to the level that now my full time job is is all about open source and working with communities and the ecosystem and anyone who would like to get involved or maybe they don't like yet that they would like to get involved, but that's where I come in and convince them that that's the best idea that they will have in their lives. So yeah, that's me in a nutshell. Okay, I had technical difficulties. So I'm Don Foster. I am the Director of Data Science and Governing Board Member for the Chaos Project. I'm also on the board of an organization called Open UK. I live just outside of London. And I'm also a co chair for the CNCF contributor strategy technical advisory group. So I tend to wear a few a few different hats. I got my start, well, I came out of university with a computer science degree in the mid 90s, and I somehow managed to luck my way into a Unix system administration job. So my very first job out of university. And back then I worked for a manufacturing company and manufacturing companies do not like to spend money on software. So I used a lot of open source software just in the nature of being a system administrator. And then fast forward a couple of years, I was at Intel in around 2000 2001. And they needed someone to look at which open source projects were going to be strategic for for them over the next, you know, number of years. So which ones should we be engaged in which ones should we be working with and I was working mostly at the time in the kind of the Linux developer tools space, I think like compilers and IDEs. So so that was sort of my first first role that was more focused on open source. And then over the years, I managed to somehow turn that into a full time thing where I was community manager and a few different companies. I've done I've done lots of different things in my open source career over the years. Most recently, before chaos, I was at VMware and I was their director of open source community strategy. So I've done little little bits of things in open source over the years. Alison Randall. So I also started my career in open source in the 90s or through software since we didn't have the name yet then. I was working at a startup, as an online bookseller that just happened to use Pearl as their development language. I'd used Pearl a little bit before for linguistic research, but that was when I really got into it. And within a year, I was teaching Pearl at the local Linux user group. And then I got sucked into Pearl design work by the development team. And then I got sucked into being project manager and the president of the foundation. And it kind of went from there. So I've been involved in a lot of different projects, but you'll know some of them. Debbie and Ubuntu open stack. I'm currently chair of the board of software freedom conservancy on the board of open infrastructure foundation and also on the board of open usage comments. Cool. So yeah, I mean, I was really excited about the panel because we bring a lot of different backgrounds and different way you've got introduced the open source. So I guess I'll ask this question to you out, Alison is like, so 20 plus years, like what motivates you to keep saying in open source? Like what are the things that you enjoy the most about the open source communities? I mean, for me, hindsight is clear. It really has always come down to the people and the things that we built together. And that's partly the software that we build together. We've built some really amazing tech. But it's also the communities we've built and the styles of collaboration we've invented. And, you know, the legal structures that supported those very different ways of working together. And that's what really stands the test of time. You know, you can get distracted by the politics and all of that. That's not really what matters. What matters is the people and what you're building. Anything else you want to add or? Yeah, plus plus one to the people. It's been it's been an amazing career, right? Like I I've met people and I know people all over the world and I can I can go almost anywhere and find someone that I've worked with on a project somewhere to sit down and have a coffee with no matter no matter where I am. And so it's yeah, I've just met so many amazing, wonderful people. I can I can also plus one that that notion. And the other thing is that when it comes to open source, I mean, the majority of the people are there because they are interested in the project and the technology they had they share the goals. They they work on something that's in their common interest. So you find people who are enthusiastic about what they do. And it is a great environment to be in and to be part of. And like knowing people all around the globe, you learn a lot about cultures. And you just have access to so much knowledge that we share with each other on a daily basis. And you get so many different points of view that it's just it's very hard to match in any corporate environment in my experience. So the flip side of that question is, because we talked about all the positives and what we enjoy the most. Are there examples of time when you wonder to yourself, like, what am I doing with my life? And maybe this isn't for me. Like, I mean, maybe it doesn't have to be that dramatic. But anything you want to share? I mean, I do like what I what I do today. And that's that's why I keep doing it. There's there's ups and downs, no matter what you do. When it comes to open source, like back in my, let's say, corporate days. I think that it would have been better if I spent a little bit more time understanding corporate politics and navigating how open source can fit into a product development environment, and figuring out how to work with with our managers to also help them understand. Because there there are a lot of examples where where you're a developer, you're working on the code, you know what you're doing, you know why you're doing it, you're enthusiastic about it. But there are so many other people in the company who are trying to make sure that there's a product schedule that the customer is happy that the company makes revenue because otherwise, we are all in big trouble. So there are a lot of moving pieces. And you who are actively participating in an open source community to you, it's crystal crystal crystal crystal clear. There you go. What's happening. But but someone like a manager who's a program manager trying to again make sure that the product is on track. They don't know they don't have the experience. They just see something from the outside. So helping them understand how these communities work, what you need to do to be effective in that community and also be effective in the company where you're working. That that can be an interesting balance and an interesting challenge. And when I was very new to it, I think I stumbled on a few mistakes that I would do differently today. Cool. Anything else you want to add or we can move on. But I mean, you mentioned balance. And I think one of the, like challenges that you hear a lot of different events like including your FOSM is that people talk a lot about work life balance or trying to maintain balance in general. And then maybe Don, I'll ask this question to you because you work. I mean, before you joined, you came on board a chaos full time. You're at VMware. So I mean, you're actively involved in Kubernetes and other communities, but that's not 100% of your job. You have responsibility as a VMware employee. And how difficult is that balancing at like trying to be a good open source citizen, but also trying to be a good employee. Yeah, that can be Okay, so so that can be a real, a real challenge. I, you know, I think on the one hand, I was I was just super lucky, right? Because my managers at VMware were really supportive of the work that I did. At the time I was contributing to Kubernetes and to the chaos project and a few other things. And so they were, they were very supportive of me spending that that time. But I also take the approach where I, and I didn't always do this, I've burnt out a couple of times in tech, like, like many people have, where I tried to do all the things. And now I'm super protective of my personal time. And I, you know, I kind of work as a set number of hours. And then when I'm done, I'm done. And the only way I can do that is by being really brutal about prioritization and just saying no to the things that aren't that important so that I can focus on the things that are, whether they're the things I'm working on in open source communities, or the parts of my, you know, at the time, real, real day job. And then I'm, I'm sort of lucky now that I, I do, I will admit right now I have, I have my dream job. So the data piece, the open source metrics with chaos has always been my passion project. So being able to do that full time is, is been pretty great. I would like to applaud Don for being able to do the prioritization that that you made the decision and you're sticking to it because I suck. I am also in my dream job. But to me, that did not help with spending not too much time on it. And I think when it comes to open source and also what we are doing right now, especially after COVID, like so many of us are working from home. And to me, just the working from home setup, whether it's open source or not open source work, I like to be enthusiastic about whatever I do. So that setup, to me, makes it already really hard to find a balance because like the left corner of the table is the work corner and the right is where I have my personal time when I eat lunch. Like that, that just doesn't really work well for me. And I think that Don also mentioned burnout, that that is something that probably most of us who are enthusiastic to an extreme level will experience at least once. So I share all the challenges and I can, I can only recommend that once you experience burnout once, you do have a choice from that point because you do have the full end to end experience. You know what the signs are that lead to burnout. So you do have the choice when you are seeing the signs next time to stop to know that, okay, I'm not going forward like this anymore, because I know where it leads. So you do have the tools with the experience that you're gaining, even if you don't find the right balance right at the beginning. Cool. There's just so many interesting things to do in open source. It's hard to choose one or two. No, I'm like Don said, I'm all for setting boundaries. I mean, I work for work with a lot of colleagues in China, and I'm in Pacific time zone. And between five and seven, it's really difficult to say no to a quick call. But I think most of my colleagues now like between six to 8pm, that's family time, I need to have dinner with them. And they understand like if you have the right corporate culture that works, but it's really hard to do some time. So, so go ahead. Just a quick note, like I just started to work with a new community. And they they are very active in Europe and Asia Pacific. I also am in the US on West Coast time. And I work with two communities in total and the other communities very North America centric with a few people in Asia Pacific. So I have all the three major time zone regions to cover. So I'm currently in the process of trying to find a new balance because I can work 24 seven so easily because there's always someone awake. Who's very active in the community that I'm working with, who I could talk to, why could solve a challenge for and it can be very hard when it comes to the time zone challenges, especially if you're like really working with global communities. So when when I first like opened our talk, I mean, we talked about like the job market like last year. But for people that are looking for jobs in open source, like I mean, are there any advice like any of you like to share like in terms of, you know, first of all, finding the interesting openings that you might you might want to pursue interviewing tips, etc, etc, finding the right culture. Okay, I can start. I would say that my my biggest piece of advice when you're looking for work is is to use your use your network. So I think in my entire career, I have only I've only ever had one job that I got from applying through the traditional channels. Every other job I've ever gotten has been because of someone I knew. And in a lot of cases, these were people that I knew through my work in open source through these open source communities. So when you're when you're looking for work, just, you know, spend some time talking to some of the people that work in the communities that you're interested in, and who work at companies that you might want to work at or organizations that you might want to work at and talk to them, you know, ask them what it's like at that company and see if it might be a good fit for you, ask them what kind of job openings they have, and just just talk to people and get other people's suggestions. Because once you talk to enough people, they will generally know of other people that you can talk to that maybe you weren't already connected to. So so don't be shy about talking to the people that you know and asking them asking them their advice and what it's like where they're where they're working. Yeah, I think that when it comes to open source, like, you're operating in a public environment, whatever that you do is public. So you can also point to things that you've done. It's it's much easier to build a resume as well if you're active in open source. So it's the connections and also the work that you you've already done. And another thing that kind of connects back to sort of early mistakes. That for I think that was the first question or along those lines. Like, building connections really is truly important. Like when you're attending an event, you can you can prioritize to listen to talks. But I would challenge you and say that if you're not interested in talking to the speaker after the talk or talking to people in the room who are interested in the same topic, then is that really the best session you could choose in that particular time slot because you can always have access to the content later. Many conferences are recording presentations and even if they don't, the the information is out there floating on the internet one way or the other. But the person doesn't. And the in person connection is invaluable. Like I have a lot of experience, you know, jumping in new communities and you do that on the online channels first. But whenever you get the opportunity to actually talk to a few people in person, the online interaction just becomes so completely different, way more efficient and usually a much more pleasant experience. And then those connections are also could be the ones that are lending you a new job because those people know you, they trust you and they can give a recommendation at the company where they work that, hey, there's this person, we've been working together in this community and they are so amazing and they're looking for a job or maybe they are not looking for a job, but we should get them anyway. So that's that's a great way to go. I would I would add keep in mind that there's not just one way to have a job in open source, you know, pretty much any job these days that's related to software is going to be related to open source. So in my career, I've often switched between like doing all my volunteer, all my open source development as a volunteer and doing paid work that's like running an open source conference or managing an open source foundation. And then also I've done it the other way around where my paid work was open source development and then I was as a volunteer serving as a board member or you know, a community manager or something like that in an open source project. So like, don't be afraid to mix things up and yeah, find a way to get paid but also find a way to like live your passions. Cool. So somewhat related to that I guess I mean you've done lots of hiring over the years for open source rules. When you interview candidates like what do you typically look for? I can I can start. So obviously the skills that you need for a particular job depends on the job. But when it comes to open source interacting with people and being a team player is kind of a requirement. It doesn't mean that you have to be an extra word. I'm an introvert. I know so many people in open source who are totally introverts. But since we are all so passionate about what we are doing, that is not a barrier for us to participate. So the willingness to interact with people and to even if you're not comfortable with the public environment fully yet, but the willingness to be and to do so, that is very important because you will need to interact with people from all over the place. And if you're quiet and shy and you don't want to be out there, then it is very hard to get successful in open source in my experience. So that's definitely up on the list of being able to do that. Do you want to take the mic? I would say that I generally look for someone who has enough of the skills that we're looking for that they can probably do the job knowing that there will be pieces of the work that they'll need help with later. So one of the things I will caution you about is that job descriptions on the website are wish lists. They are not requirements. I have never in all of my years had every single thing listed on that job description as a skill. And they still give me the job. And I still, I guess, I seem to be successful. And so don't look at those. As a list of requirements, look at those as a list of things that they would like that person to have because they're not going to get that unicorn. They're not going to get that person with every single one of those skills. They're going to get somebody who has enough of those skills to do the job, and then they're going to train them on the rest of it. So make sure that you go ahead and apply for stuff, even if you don't think that you have everything, because in a lot of cases they'll be willing to take a chance on you and train you up on some of the other bits. And also, like, if they see that you're passionate about that particular job and you have an idea of why you would be the best person to do that job, that usually gets you through the interviews as well. And that, like, if you don't have a skill that's listed, then they will more likely overlook that because you're someone who's already in that mindset that you're ready for that job. So I totally agree with that observation. I don't think I ever checked all the boxes either. I think that's impossible. Most of the time the job description is also written in a way to just be a little bit scary. I assume they are trying to limit how many people are submitting applications just because the job description looked like you need a 200-year work experience before you apply. But really, most of the things you will be able to learn and don't be afraid to learn. And if you're also open about that, at least to me that was always appealing when a person is honest about, okay, this I don't know yet, but I can learn it. And for most of the tech jobs, you will never stop learning. So if you have that ambition that will take you from A to B and then from B to C and you're able to grow, that is always very appealing. Like, again, the ability to grow, that is another thing that at least I personally look for, to see that the person will be able to grow into the job that they are applying for. But then they will also be able to grow further out of their job and do something else in the company. Really, a job where you already know everything before you start it is super boring. So look for the jobs that you will learn something really interesting and that will lead you on to other jobs where you learn even more interesting things. Yeah, I mean, we don't mean to harp on a job description too much, but the other thing I want to add is, by the time you accept the position and start, it could have been three, four, five months since that job description was originally written. So think about that, after a quarter, things change, the market's changed, there was a reorganization of the company. So, I mean, that's why I try not to take job description as a gospel, although it's very tempting. Because you want to check all the boxes as many as you can, but it's just a guideline. Like, it's an educated guess as to what the new person might be doing, but it's still a guess. So, sorry, I'm going to the list here. So, I think what I've seen some people do, like, you start in open source, but then you step back, you take a different role, you do a non-open source role. I think some of you have done that in your career. Like, can you talk about that experience and why you, I don't know if you were forced to do that, or why did you just step away and what was it like coming back into open source? I've done it multiple times, I mean, three decades is a long time. I mean, some of it was layoffs, you know, it happens, but more of it was often, I mean, there's reasons like family health, there's reasons like kids, there's reasons like I took a break to do a PhD, you know, there's all kinds of good reasons to take a break, but another one is to avoid burnout. So, if you think you have to stay in the one project forever, you will work yourself and work yourself and work yourself. But if you recognize it's totally okay to just go away for a couple years and, you know, either come back to that project or a totally different project that excites you two years from now. Like, it feels less devastating to step aside from a project and you can do a well planned orderly handoff instead of a flame out burnout. If you push yourself all the way to flame out burnout, chances are you will never work in open source again because you burned yourself too far down. You can come back, but it's much, much harder than if you just recognize the signs and say, oh, you know, I should really take a couple years off and do something else. And then you come back revitalized. So, yeah, I actually, I highly recommend taking a break from time to time. It's a really good idea. Cool. Yeah, I've also had the occasional detour. I had one that was the company that I worked for, the politics around open source internally just got to be too much for me. And so I spent six months working in a, like, a market research department or something, something kind of random that I thought was a little bit interesting. But, you know, in the other way, I think that can help with, you know, burnout and just kind of, you know, doing something new is I've worked in loads of different open source communities. So, chaos is probably the one that I've worked in the longest because I've been working with these tools since before the, before the organization existed. But in a lot of cases, I've worked in, you know, kind of a series of open source projects based, frankly, on what the company that I was working for was particularly interested in. But the thing that I found was that in every time I switched from an open source community to another, I found that there was at least one person that I knew from a previous open source community. So even when you kind of bounce from community to community, there are usually other people that you know from previous lives in other communities. I do not have the experience yet. I'm still burning myself to learn the lesson. But to Don's point, I did see people, like, moving from one employer to the other, but still working in the same open source project. And also the popping up at another community and like, oh, hi. Like, you're here too. Cool. And it's just, it kind of shows you that the world gets a little bit smaller if you keep being involved in open source and you just know people. And the connections that you make will more likely to stay with you longer than in a corporate environment where you're just jumping, jumping companies. And that's a really nice experience. Even if I assume that even when you're taking a break and you come back and you see some familiar faces, but the project is new, that's kind of a nice mixture of I'm doing something new, but I don't have to make all new buddies to go from A to B. So yeah. Cool. So yeah. Yeah. So, I mean, I think earlier we were talking about like regrets or mistakes. I mean, the one I made personally was, I mean, I was working at Intel. We got reorg and then I had to stop working on open source, which was devastating. And then I think the mistake I made was I just spent time like sulking and being depressed. And I mean, that's fine. But what I should have done is do more productive and try to get engaged in the open source community somehow, find a different project, show up to meet ups, et cetera, et cetera, rather than feeling sorry for myself. So I think getting reorg then maybe getting laid off are two good examples. Like if you're a force away from like open source, like, you know, what advice do you have for, you know, maybe just stay engaged in the community somehow. I think nowadays it's easier. But, you know, what kind of approaches do you take to find a new community to join or like how do you keep up to date on what's happening out there? I have an example that's kind of slightly connecting to the questions. I will share it. I used to do trainings like how to contribute to OpenStack trainings one or two days before, prior to big events where a lot of people traveled to. And I met a lot of different people with very different motivations of why they were at that training. Some of them were just like it was free and then they were already there and it seemed interesting. And a lot of people kept asking like they do the training, they learn about the tools, the processes in the community. So what should they do next, what they should work on? And I always kept asking back like what are you interested in? Because I can point you to the low hanging fruit bugs. That's easy. But once you fix the bug and then you fix another one, then the third one is like why am I doing this in the first place? If you're not interested in the particular technology or you don't have a motivation to be at that particular place. So I would say that don't ever let anyone else tell you what you should do. Go where you feel passionate. Like learning, when you learn something new, where you're interested in the technology. And if you get involved then you will have the connections and then the job will come around as well. If you want a paid job that also works with that particular technology. So I would say that here make sure that you prioritize your interest and invest in yourself through that. I think it also partly goes back to the people. Like multiple times, we talked about other open source contributors moving from project to project. Like I was working at Canonical and when I left Canonical, a lot of other people that had also been at Canonical working on the OpenStack project. And I thought, that's interesting. What's that all about? And that is how I got involved in OpenStack. It was just by talking to other people and seeing what they were interested in now and kind of keeping those connections. So your network in open source can be really, really valuable in staying connected and finding out where the new things are and where you might want to keep working. Cool. Okay. I think we have about 13 minutes left. So I think I'll ask one more question and then leave the last 10 minutes for the audience. And I'm not going to hold any of you to this. I think earlier when I started, I felt pretty depressed for large parts of last year. Because I wasn't sure if this shift is unique or we're just dealing with another pendulum swing. So open source careers, in general, what's your outlook? I mean, to be honest, I think I'm a little bit more optimistic now than I was like middle of last year. Middle of last year just seemed daunting. And I just devastating to see a lot of my friends get laid off. But what are your thoughts on where things are headed? Or are we dealing with more of the same? Okay. I can go first. So as I work for an open source foundation that is a nonprofit organization, and I work with a lot of communities, I do see the effect still of where the economy is right now. However, at the same time, even just in the past two days in the co-located events, people are throwing out numbers. Like if we didn't have open source, then it would be like 4. something billion dollars to rebuild what we lost. And the trillions of dollars of demand that is driven by open source software. So those kind of numbers show that open source will not go away. So even if the economy is restructuring itself, companies will restructure themselves too. And I don't think that anyone really has a choice of not using open source software anymore. The software also needs to be maintained because otherwise you're not able to use that. Security is a high priority item in any single conversation that I've been participating in in the past few months. And maybe it's getting up to years now. So there's a lot to do in open source. It is also a model that is very sustainable if it's done right in terms of investment. So I think we will bounce back overall. And I think that the job market will have a lot of opportunities that are more directly focused on open source. And even, I think Alison mentioned that there isn't really a job that has nothing to do with open source anymore. It's just maybe isn't called out directly. But I'm optimistic. Yeah, I'm also optimistic. I do think that the pendulum has swung too far in the cutting of jobs. And in particular, I think some of the open source groups have been particularly bad hit. But I think it's not going to take companies long to realize that somebody has to do that work on the projects that they depend on. And so, you know, I work a lot with CNCF projects. Most of them are understaffed and they don't have enough resources to maintain the software over the long term. And I think that if they don't get more contributors coming back to those projects, because companies have pulled people off of some of those projects. And so many companies, their whole product line relies on a lot of these products. So I think they're going to quickly realize that their new features, their bug fixes, their things they're going to need in the software, that they're going to have to resource some of that. But the other trend that I find particularly promising as well is some of the alternative funding sources. So you look at groups like the sovereign tech fund out of Germany who are funding core infrastructure projects. You look at things like GitHub sponsors. You look at a lot of these other groups that have started funding individual projects, individual developers. And so I think that's also an interesting trend from a career and a job standpoint for open source. I don't know. From personal experience, I was laid off last year and I haven't looked too hard because I was having fun working full time on my volunteer open source projects. But I did see in the new year, towards the end of the year, it was a lot of, oh, this year it's totally blocked off. And in the beginning of the new year, it was like, we're hiring, we're hiring, we're hiring, we have a lot of spaces to fill. So if you think it's been a difficult year last year, look again because things are changing now. Cool. So I think cautiously optimistic is the phrase I like to borrow from the economists. So I think we can open up to audience questions. I don't know if we have a microphone for the audience or I can just bring one. Thanks. So I have a question for Ildiko. You mentioned earlier that when you are at events like this, you have to take advantage of getting to know people and to interact. So I'm also introvert. How do you get past the barrier of talking to strangers at an event like this? Not an easy question I know, but. Excellent question. I can only share my personal experience. To me, if I'm passionate about something that will push me through the first few seconds of awful experience, I can only share my personal experience. I can only share my personal experience. If I'm passionate about something that will push me through the first few seconds of awful experience. The other thing is that what I found is I have days when I just wake up and I'm feeling more social. And there are days when I can do whatever I want. I could write a script for myself before I walk up to a person who I don't know yet. And I would still be totally awkward. And I learned to say that it's okay. I have days like this and it is okay. I also started to kind of be a bit more open about this sometimes just telling the person, you know, I'm socially awkward sometimes. I'm not ashamed about it. So like I'm not afraid of putting it out on the table. And many times the other person can relate that, yeah, well, it's not easy for me either. And it's just so many examples I have where I said something like this and all of the sudden, that was like the icebreaker and the other person is also like, oh yeah, it is hard for me too. And then we have something to talk about. So it is hard. I know that it will drain me like after a conference like this. I need a few days to recover. My mother also knows that she should not call me like two, three days for two, three days because I will not be a pleasant experience on the phone necessarily. But yeah, you learn how the social interaction affects you and then you will also learn how to navigate yourself. So I can only encourage people to get through the first few awkward experiences and then build on what you learned about yourself. Yeah, I mean, just to build on that, like talking to strangers is hard, right? And I share some of Ildigo's like I tend to be a little socially awkward. But what helps me is to talk to people in more social situations. So you know, you're in line for a coffee or you're at one of the after parties or something where it's a little bit more social. And I just sort of have ways of coping with it. Like my question for people is always, you know, are you enjoying the conference? Or you know, and building on that like, oh, it was the favorite thing you saw today or what are you looking forward to tomorrow? So you're talking about the conference and you're also, you know, even if this person isn't all that interesting to you or working on the kind of the same things, maybe you learn something about what they found at the conference and what looks interesting to them. And it can be a good icebreaker. And you know, and then sometimes, you know, if you're standing in a group of people and you know, somebody else will chime in and then pretty soon you've got a conversation. But that's how I start. That's my coping strategy for awkward conversations with strangers. I'm also an extreme introvert. For me, it's about understanding that it's difficult for them too. So if I'm focused on trying to make them comfortable, I'm not thinking about how uncomfortable I am. And also just being super curious. Like, ooh, what do you do? What are you interested in? And like, I get so focused on whatever project they're involved in that, again, I just completely forget about my own awkwardness. But planning time off, like even in the middle of the conference, planning like half a day off, like, oh, I don't have a whole lot of talks I want to see right now. I'm just going to go back to the hotel. And it really helps because you recharge that introvert battery and then you're ready to deal with people again. I just want to say like plus thousand to that one because I think it took me years to be comfortable with saying I don't really need to talk to anyone in the next two hours. Like on the sessions, I don't have a target topic where I need to network with people. So I just, I leave, I find a nice coffee shop somewhere outside of the convention center getting some fresh air and just saying that that's okay. Also like once you get into the environment and you start to know more people, you don't have to go to every single social event after the conference because there's usually happy hours every evening. You don't have to go to all of them. Once you have a base network, then you can pick which one you want to go and the rest don't sweat on it. Because at the very beginning, I was like, my company's, company's sending me like overseas. It's a very expensive trip. I'm missing a week of work. Like the day job kind of work. And I felt obligated to go to every session, talk to people, go to every social event. And like if I stepped outside of the convention center during the conference day, then I felt guilty. So letting that go, yeah, let it go. It's very important for you to take care of yourself first. I mean, in addition to social awkwardness that I also deal with, I mean, it's just like crowded conferences like this. It's challenging to find time to talk to people because they're all busy, especially with speakers. And I've been on both sides of this. It's completely okay to say, could I message you on LinkedIn and have a call with them like a week later. And I actually did that with one of the speakers last year. This is one of the, like it was in the K building, one of the larger sessions. And he was just inundated with a lot of people. And I just said, hey, can I, can I connect you with you on LinkedIn? And then you're going to be in a more relaxed environment on Zoom. You just have a conversation about his talk or his background. So there are, you know, you don't, don't force yourself to just have all the conversations in two days. It's just very difficult logistically. So any other questions? Oh, go ahead. Yeah. So my question is, have you experienced any sort of a significant difference in terms of revenue working on a heavily open source type of project or job versus a, suppose a normal one, if there's such a thing. Thank you. The only thing I can say about like salaries and things, in my experience, that's more tied to geographic locations rather than, and well, the, the job role itself. Like if you're at a hyperscaler and I don't know VP position, then I assume you will not have money problems for the rest of your life. But, but at the same time, I, I really, my experience is I moved geographic locations and that affected my salary more than anything else. I have not noticed a difference. And to the, I have not to the degree that when I was putting my son through college and I very much focused on salary, which I don't anymore. But I was like 1% working on fully open source. Like I was in the 1%. Like fully open source, like nothing, like no proprietary software. So it's not, it does have a lot to do with the company. Different companies have different salary bands. So you're more likely to get more if you work at a big company and then a small startup startups tend to be a bit more weighted towards like stock options. So it just happens. But yeah, there isn't, there isn't really a difference whether it's open source or not. Yeah. And then I mean also comparing like, because I just asked this question because I worked at a foundation working at nonprofit versus like a for profit organizations. When they need to hire people, they need to be competitive. Like, I mean, if you're a nonprofit, you can offer like stock options. That's not viable, but they have to find other ways to make it appealing to attract good people. Right. So you can't be at a complete disadvantage like salary wise as an example. If that's, if, you know, that's, you know, that's my experience. Other questions. Any nine. All right. Cool. Well, just final thing I want to say. So if you want to connect with us, I mentioned LinkedIn. All of us are on LinkedIn and also on Twitter. If you want to continue the conversation, feel free and enjoy the rest of the weekend. Thank you. Thank you.
The Regulators Are Coming: One Year On
Okay. Testing, testing. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Yeah, there we go. If I call your attention. In the next session, we have a one hour and on the regulators are coming. Your chair for this session is going to be Simon Phipps, and he will tell you all about it. Welcome. Thanks for coming. There we go. Yay! Hi. So I'm Simon Phipps from OSI, and I'm part of a group of people from Open Source foundations that have been engaging with the European legislators this year to fix the issues that you all told Benjamin about after his talk at FOSDEM last year. And the TLDR for when you leave early is that thankfully Benjamin and Omar down here listened very carefully and have, I believe, addressed all of our concerns with the impact of the CRA on open source developers and open source charities. There are some remaining issues that are a little more complex to deal with, and they will be dealt with in some guidance that comes from the European Commission. So to speak to you today, first of all I've got Benjamin Burgle, who is a head of unit now, head of sector at DG Connect, and he was one of the authors of the Cyber Resilience Act and has been intimately involved in fixing it with us all year. And he is going to tell us all about the CRA. After that we're going to hear from Gail Blondel from Eclipse Foundation, who was also part of our group that was interacting with the Commission, and he's going to tell you whether Benjamin is telling you the truth or not. And then Omar is going to tell us the same things about the Public Liability Directive, and then Doug Villum from Apache is going to tell you whether Omar told you the truth, and then Enzo here is going to run an audience Q&A so you can ask these people all the questions that you want to. We've only got 50 minutes, so if your question doesn't get answered, come to our dev room, which is all day tomorrow, in AW1120. It's an open source in the European Legislative Landscape, and we're running four 2-hour workshops to give written feedback to the Commission on their digital agenda legislative program. So with all that said, Benjamin, thank you so much for coming back, and they've promised not to throw anything. So go for it. Thank you. Thank you so much, Simon. Thanks for having me again. It's been an exciting year. I was here exactly one year ago. Last year when I was here, I was presenting the Commission proposal, which is the first step of the legislative process. We as the Commission, we make the proposal, and then the co-legislators, the European Parliament, as well as the Council, which represents the Member States, they negotiate on the basis of our proposal, and now I'm here to report back after one year of negotiations. The text is almost done. It's quite stable. We still need the final vote by the European Parliament, so it's not entirely finished, but we are quite confident that what I'm going to present to you today is a rather stable version of the Cyber Resilience Act, the newest kid on the block when it comes to cybersecurity legislation. Last year I presented the proposal. I will repeat some of that this year, but I will focus much more on the open source elements, because there are much more open source elements in the final version compared to the original version. For those that weren't there, what is the CRA about? It essentially requires developers, hardware and software manufacturers to introduce security by design in their development processes. The cheese on the left represents a product with digital elements, as we call them, filled with holes and security vulnerabilities. On the right-hand side, once you've complied with the CRA, there will be way fewer holes, although we do acknowledge, of course, that it will be impossible to get rid of all the holes. That's just the nature of cybersecurity. Here is a brief introduction into the main elements of the law. As I said, it's about cybersecurity rules for the placing on the union market, the entire European Union of hardware and software products. We have three main actors in this legislation, the manufacturers. They will bear the brunt of those rules. They have to make sure that their products are secure, but then there are also obligations on other types of actors, mostly the distributors, so these are essentially either brick-and-mortar stores or online shops. They have to make sure that the products that they sell are secure, as well as importers that import from outside the union onto our market. The rules come in the shape of essential requirements. So essential requirements are high-level, objective-oriented, technologically neutral requirements for the placing on the market of the products. They are things like ensure access control, ensure the confidentiality, integrity of stored and transmitted data, and so forth. So you know all these are high-level. This is the cybersecurity 101 that we're essentially putting in the law. To make it more useful and more easy for manufacturers to comply with those requirements, the European Standardization Organizations, they will develop harmonized standards, and then you can use those standards to comply with those requirements. The European Standardization Organizations, essentially, they gather the manufacturers, so it will be the manufacturers themselves who will develop those standards. Depending on the level of risk that is associated with a product, there will also be different types of conformity assessment. I will explain that in a moment. I also want to mention separately that there are going to be reporting obligations, so if you discover vulnerabilities in your products that are being actively exploited, or you have an incident on your network that affects the security of your product, then you would need to report that. And finally, another important element, of course, is the market surveillance and enforcement. So all 27 member states, they will be required to set up their own national market surveillance authorities to check products and ensure that the products that are on the market are actually secure or at least compliant with the CRA. So these are the main elements. We are tapping into an existing framework. You've all seen it probably, the CE mark. So on your smartphone charges, for instance, you have the CE mark. The CE mark tells you that this product that you're holding in your hands is essentially compliant with all European product regulation. And in the future, when you see the CE mark, it will not only mean that you're compliant with safety regulation at the union level, but also with cybersecurity legislation, the Cybersecurity Act. So which products are we talking about? The scope is quite wide and deep. So when I say wide, I mean that it applies to all sorts of hardware and software products, such as laptops or operating systems. But it also applies not only to the final products, but to the components, because the nature of cybersecurity is, as you all well know, that often vulnerabilities and components can have an impact on the security of the final product. And in many cases, it is very difficult for the integrator who builds a final product to find all the vulnerabilities in those components, often components of black boxes, in particular when they don't come in the shape of open source. So they also need to be secured. And so all components that are placed on the market as separate products, they are also in the scope of this regulation. What is not in the scope? I already explained that last time, but it was not sufficient for you. I explained that non-commercial products would not be in the scope. And I think this has been quite an issue that has been discussed very lengthy. A lot of people have asked, what does it mean non-commercial in particular in the context of open source? And this is one of the reasons why for the last year we've tried to flesh out in more detail what non-commercial means for open source. And I can tell you that during the last year, barely a single day has passed by when I didn't wake up to a message from Simon, Dirk Willem or Enzo trying to help along with this process. So non-commercial products are not in the scope. I will explain in a moment what that means for open source. Stand-alone services, in particular software as a service, that don't come with a product that are stand-alone, that you just access through a website, they are also not covered. And we also have a few outright exclusions of products that are already regulated when it comes to cybersecurity, so they don't need to be covered by the CRA. And that includes, for instance, motor vehicles and medical devices. Okay, so just to understand, I said the scope is wide and deep. I want to talk a bit about what it means that it's deep, right? So when you are a manufacturer of a final product, in this case a smartphone, you will be integrating two types of components. On the one hand, like in blue here, components that you've developed yourself, as well as components here in yellow that you are buying on the market or sourcing from the market, and you're also integrating them. So you are responsible for the security of the entire product as a whole and for its compliance with the CRA. But when it comes to the components that you source from third parties, of course it's much more difficult to have assurance about the security. And for those components, we've introduced a due diligence requirement. That means that as a manufacturer you will have to do the utmost to make sure that the components that you integrate are secure. That can mean that you simply check the change log. Is this a component that is regularly maintained? You check the vulnerability database that are out there on the internet to see if the latest version contains any vulnerabilities. And if it's a commercial product and it is subject to the Cyber Resilience Act, you can also check whether it carries the CE marking. So this is how you can achieve that the product as a whole is CRA compliant. So now to the conformity assessment, I mentioned it earlier, and this is the first time I'm going to mention open source more explicitly. This is where it's explicitly mentioned in the text. For the vast majority of products, which we call the default category, manufacturers, they will have to undergo a self-assessment. That means that it's the manufacturer, Him or herself, that will check and ensure that the product is compliant. But then there are some products that are explicitly listed in the annex of this regulation that the co-legislator have considered as important or critical from a cybersecurity point of view, and they will have to undergo a more stringent type of conformity assessment. So first we have the category of important products, and manufacturers in this category, they will have to apply at least a harmonized standard, the ones that I mentioned earlier, or in some instances they will even have to submit their product to a third party to have it checked if it's secure and compliant with the law. So products in this category are for instance operating systems, antivirus software, or also firewalls. Then there are also critical products. They are also listed in the annex. These are products such as smart cards and secure elements that we consider to be even more important. By the way, only hardware products, no products that are softwares or nothing that is potentially open source. And for these products we may in the future even go a step further and require a certification of the products. Now when it comes to free and open source software, we have a special provision in the CRA that says irrespective of whether your product is important or not, you will always be allowed to undergo a self-assessment. So you will not have to submit any free and open source software that is in the scope of the CRA to a third party. And the reason behind that is that when it comes to open source, it's a transparent product, and anyone including the users or integrators, they can check for themselves whether this product is secure. So you do not need to have a third party that vouches for the product. Now we also try with the CRA to shift the responsibility from the developers of open source components to their integrators. Because so far integrators have often been free riding on open source components and not giving enough back to the community in terms of fixing vulnerabilities in these products. So coming back to the smart phone product that I presented earlier, right? So imagine a smart phone product that integrates an open source component. Here is a silly open source component that prints fruit onto your... that prints fruit. So far it was a one direction thing, right? So the integrator would take the component and, I mean not always sometimes, of course integrators also contribute a lot back. But in many cases they would just integrate the component into their own product and that would be it. From now on the CRA will say, if you find a vulnerability in your component, you have to inform the developer of that component. So that developer can also provide a fix to that vulnerability. In addition to that, since as a manufacturer of a final product, you are responsible for the product as a whole and in absence of a fix from the upstream manufacturer, you will also be required to provide a fix. I mean either you fix the vulnerability in that component or you replace that component by a different component. You just have to make sure that your product is secure. But if you do provide a fix, then you will also have to provide that fix to the upstream manufacturer so that the upstream manufacturer can integrate it. So this is how we want to share the burden on security between the developers of final products as well as the developers of free and open source software. So is your open source software project covered by the CRA? I think this is the question that you are all asking yourselves. I said initially the commission proposal said, if you are not commercial, you are out of scope, right? And now we fleshed this out in much more detail and we've even introduced a new type of actor. The open source software steward which I will also present in a moment to you. So if you are merely contributing to someone else's project, you are definitely not a manufacturer. You're not subject to any obligations. That was a worry that was expressed several times but here I can assure you, you can just keep contributing and you do not need to worry about CRA compliance. Now if you are providing the project and not merely contributing to it, the question is, are you developing in the course of a commercial activity? So if you're not, if it's really just a hobby project, again, you're not in the scope of the CRA. Now if it is in the course of a commercial activity, the next question is, are you directly monetizing that product? I mean, because we know that many open source projects, they do not directly monetize but they're still a wider commercial setting, right? Many companies coming together to jointly develop a component that they will use for their own products that's a wider commercial setting. But we only look here at the direct monetization of the project. If you're directly monetizing it, then you are a manufacturer and then you are subject to the security by design requirements of the CRA. If you're not directly monetizing the project but it's still taking place in this wider commercial context, this new type of actors introduce the open source software steward. So these are essentially foundations, not-for-profit manufacturers and so forth. Here we've invented a new very light touch regime. So if you are a legal person that provides support to specific FOSS projects on a sustained basis and these projects they are intended for commercial activities, then you will have to comply with the light touch regime of the CRA as regards the open source software steward. But if you're just a collaborative project, no governance frameworks to speak of, no direct monetization, then again you're not in the scope of the CRA. That means the vast majority of the open source projects will not be in the scope of the CRA. So I don't know how do we still have time. I can maybe quickly explain what the open source software steward will be. I already gave some examples, right? So foundations, not-for-profits, also companies that build open source for themselves for their own monetization or integration into their own projects, but then make it available to the public, they will all be open source software stewards. And I already said it's a light touch approach. It's not going to be heavy, but the idea is to place some responsibilities on these types of actors, but only responsibilities that they can also bear, giving the nature of their project and their organization. So there are basically three types of obligations. First, you have to put in place a cybersecurity policy. The CRA is not very prescriptive what that cybersecurity policy should look like. It provides some basic elements that need to be mentioned, such as supporting the community in providing information about security vulnerabilities, describing how you will mitigate vulnerabilities and so forth. Secondly, you will be required to cooperate with market surveillance authorities, just like any other actor in the Cyber Resilience Act. And thirdly, you will also be required to report incidents and vulnerabilities, but only to the extent that you are involved in the development. So if you're not involved in the development and you know nothing about the project and the vulnerabilities, then you will not be required to report vulnerabilities. Okay, so this was a high level overview of the CRA. Just maybe very briefly what are the next steps. So we are hoping to conclude the CRA very quickly in the coming months. The entry into force, I cannot be sure, but it will be roughly around middle of 2024, maybe a little bit later. And then there's going to be a three years transition period. During that three years transition period, the European standardization organizations are going to develop the standards. We as commission are going to develop guidance. For this, we will need you because of course the CRA is a high level legislation. Many of the concepts, they need to be fleshed out through the guidance. So I'm actually looking forward a lot to all your questions because these questions, they will help us determine what is relevant for the guidance. Yes, and then in three years time from maybe June this year, so maybe in June 2027, the CRA will enter into force. Thank you very much for your attention. Thank you very much. Thank you very much. Still got to turn it on. There we go. Thank you very much for all that. Now Gail Blondel is one of the leaders of the Eclipse Foundation. Eclipse has been speaking up frequently for the open source community in this legislative process. They've had two staff working on it quite a lot of the time. Deb Bryant and Enzo over here who you'll hear from later. Gail, could you come and tell us how the Eclipse Foundation feels about the state of the CRA now? Yes, thank you very much Simon and thank you Benjamin for the presentation. Well, thank you for coming. You see that went well. That was okay. So far. So one first point is that I think that we have always said that we agree with the goal of the CRA. That was on the first blog that was published by Mike on the topic. We agree on the goal but initially that was very scary. And I think that last year that was the conclusion of your presentation last year. Hey, come on. What are you doing? How can you put us on the spot like that? Because putting C marking on all the open source project was just not an option. One thing and that's very important is that we know that we have lots of open source developers that are volunteers. And even when they are paid to do open source development, what they focus on is doing the features of their project. And non-functional topics like security, etc. I think that as a community, as an ecosystem, we know we have to take care of that because we had lots of issues in the past. But having that coming through a regulation was something completely new to us. And yeah, even if there is a legislative process that is kind of obscure for most of us, I think that what's interesting is that to see that during the year, we managed to establish some enough connections that the co-legislators listen to the open source community. So I think that from your presentation today's obligation to push corrections, to push fixes, upstreams, also the fact that contributing people are not responsible for, have no obligations, etc. And also from my perspective, the introduction of a new kind of organization, that's the first time there is a regulation talking about open source foundations or those kind of organizations as something specific. That's very interesting aspects. But to conclude, and maybe that's an opening for the conversation after, is that that's just the beginning because we have mostly three years in front of us. And in those three years, so you will write guidelines. And hopefully we can collaborate well on writing guidelines. But there will also be the standards. And maybe from the point of view of the open source community, it says that the standard organizations have not been the best friends of the open source community. So that's how do we, I think that when you say open harmonized standards, I guess that a few people in the room say, hmm, it's unlikely we will like such things like an harmonized standard. So that's something we need to keep on our radar. And the fact that the regulators are coming, that's the title of the panel, I think that that's a good thing because that's also the fact that open source has won and is present everywhere. So we used to be under the radar. And now I see several faces from the European Commission in the attendance. You are here to explain to us and we have established some connections. So that's good things. And yeah, the conversation continues tomorrow in the panel, in the EU policy day room. And that's it. Thank you. So that's the CRA. Now, the CRA sets the rules for the market surveillance authorities. It says how countries are going to make sure their citizens are safe from the products that are being sold in those markets. When it turns out those products aren't safe, Europe's Product Liability Directive gives citizens recourse to have justice brought into their lives. And the Product Liability Directive has been in place for many years in Europe, but it doesn't give any liability to software producers. And so within boundaries that I will fix. So the European Commission is going to do something about that. Those big, bold, lettered disclaimers at the end of your software licenses do not apply in Europe anymore. And that's because the Product Liability Directive is being updated to give software producers liability towards consumers. And to tell us about that, we've got the legal and policy officer from DG Grow, Omar Anagi, who was one of the primary authors of the PLD, and he's going to tell us what's in it. So Omar, please. Thank you Simon, and good afternoon everyone. It's a pleasure to be here again this year. Same as the CRA. We are a year after, and we have now more than just a proposal. We have a legislation that still needs to go through the adoption by the parliament. But just as a small introduction, whatever has been said just before, let's try to forget it for the next 12 minutes, because it is not applicable in our case here. When we speak about the PLD, basically it applies to any type of products. The only element that is necessary is whether they are made available on the market, and made available on the market basically means any supply distribution of a use, and whether it's return of payment or free of charge. And the most important element is actually the commercial activity. I know that everyone asks always, especially here last year, those questions, what is a commercial activity? Unfortunately, I cannot tell you exactly if your own product or your own software is in a commercial activity. This is an assessment that is done by the judge itself. There are elements, the number of supply of the product, the number of use of the product, but this cannot be determined beforehand for the PLD, because of its own nature of safety net. So the assessment will be done for each individual product, even if it's let's take the more traditional product like a bottle, and you will have to look at the specific bottle and not the series of bottles to determine whether it is in a commercial activity or not. And I say this is the scope, but then we arrive to the product itself. Any product, the definition is really legalistic, so you don't need to really get that, but basically it's everything, and we have clarified that also softwares, raw materials, and digital manufacturing files are products on the PLD. There is no definition of what is a software, as you probably know, like the software 20 years ago is not the same than today. So the idea was to leave it as open as possible to ensure future proof safety net nature of the PLD. You asked me if SIS are covered, yes they are covered. The PLD disregards how the product is supplied, how the product is bought, how the product is used, how it is the model of the product, where it is stored. All of this is totally disregarded. Any software is covered by the PLD, algorithm, operating systems, AI systems, apps, whatever you want, all of them are covered under the word actually software. As Simon said, the PLD does not kick in. I mean, you do your job, and in the PLD we're not telling you how to do your job. The only thing that we are telling you is, nor the risk profile of your product, because if something wrong happens, and maybe none of you will ever experience the PLD in your life, if something wrong happens, someone has to get compensated for the damage. The damages are pretty straightforward. It's basically death and personal injury, including psychological health, the destruction of property, and the last one is destruction or corruption of data. Those are the three main categories of damages that need to be compensated. If there is a single of one of these ones, you would then have to compensate basically everything that is related to that. As I said, you will not have a case if there is no damage, and you will not have to face the PLD itself. Except in certain situations, you might have liability even if the damage has not yet occurred. Let's take a pacemaker. You know that the pacemaker has an issue. You will not wait that the person dies because of it. You will get preemptively the compensation, namely the damages of going back again for surgery and etc. I use the pacemaker because they are part of the wider range of medical devices, and medical devices also sometimes implies or include software. This is a specific situation in itself. When we talk about the liability, the question is for how long? The main rule is ten years. This is the general rule. Namely, if you place your product on the market, you may have it available on the market from the first day. This is when the time starts running. But as you know, a software might evolve, AI system for example as well. Considering that a software that was placed 15 years ago and has been changed through a lot of updates for instances, it will be kind of limiting to steer or only apply to ten years, because it means that someone who bought the software ten years ago or eleven years ago will not be able to cover the damages in case something wrong happens, although the software has been updated. We have also included a new starting period, which is when the product is substantially modified. I'm not going to go into detail. We're not explaining exactly what is a substantial modification. In most of the legislation, you will find what a substantial modification is, but roughly for software, I don't know if the CRA has a substantial modification definition, but for instance you will have to go under the CRA to see what is a substantial modification in case of cyber vulnerabilities. What we say is basically if you update your software and the update is as such that it changes the risk profile of your software, it is a new product, it is a new software and the time limitation starts running from that moment again. So each time you will change to that point or to that element of your software, you will restart the clock in that sense. If it doesn't, then it doesn't, and then you're ten years, ten years. The extension of the liability has also been put to 25 years in a specific situation, which are the health injuries. That shouldn't concern you that much, but just for you to know, it's basically pharmaceutical. That's the easiest one when you realize that you have some damages because of it, but it took more than ten years to appear. So this is a specific situation, but software you just know. We talk about time limitation and then we also need to talk about the exemptions. Exemption means that even though your product caused the damage, one of the three, you might be able to be exempted from your liability. There is a full list of exemptions, not going to go into details, but maybe two are important for you and I will explain the first one a bit later. If you did not place your product on the market, but it was placed by someone else, or the development risk defense, what we call the state of the art, which I think in your field is the most relevant one, and just to be clear, it's not the knowledge of the developer, it's the knowledge of the community, of the science around. And it's not about the known unknown, it's about only the unknown unknowns. Only in those cases, you will be able to be exempted from your liability. So just to take an example for you and maybe to make it as clear as possible, the PRD does not apply for any product when they are supplied outside of a commercial activity. This is the same for free and open source software. If your free and open source software is developed or supplied outside of a commercial activity, but someone decides to integrate it into another product, and therefore the product is then sold to a person and causes harm, the liability is pretty clear. The person will only be able to go against the integrator of the software, but not against the developer of the free and open source software that has been supplied outside of a commercial activity. That's a bit of clarity that is now in the text, which was not there before, but just for you to really understand how it will work. And the very last point is about, I know that you have clauses in your license. The PRD is pretty simple. No matter your clause, you cannot use it against a person, a natural person that is claiming compensation. So there is no leeway for avoiding liability. If it's a natural person, so me, you, anyone else comes against you, has a damage, asks for compensation, brings you to court, you cannot say that you had a clause in your license that said that you will not be held liable. That will not be accepted by a court. That's a general principle that works for everyone, and for any type of product, is to avoid that the weakest party, namely the consumer, suffers from an imposed contract. But what we have clarified in the legislation is basically, if you, a small company, very small company, decide that, okay, you sell your software to another company to integrate it, but you do not want to take over the liability. If this is your case, you can then have a clause in your license or in your contract. And in that case, the manufacturer of the overall product, the integrator of the software, will not be able to come against you in case he has compensated the natural person. What happens usually is the natural person goes against the name manufacturer of the product, and then it is that manufacturer of the overall product that will go against the other component manufacturer for getting part of the compensation. This would then not be possible if you have such a clause. So that's a bit of the small panorama of the PRD. So I leave you on that, and I hope you enjoy it. Thank you. Thank you very much. Perfect. I'll come get it from you now. And to respond for the defence, we have Dirk Willem van Gulak from the Apache Software Foundation. Thanks, Simon. So, yeah, so basically these were, so I think so like in many ways, what's happening here is that software is becoming, yeah, very grown up, and just sort of like a, I don't know, a phone charger or an electric drill, where sort of like being put under the same rules. Now, I think the positive news is here that in this process, the open source site, the development site, and also like the micro enterprises are largely sort of like out of scope. However, what I want to stress, and also want to stress about the CRA, is that it is a massive change for our industry. Even we as open source developers, we're not alone. We're actually part of that IT industry, and the PLD and the CRA will probably sort of like, or will absolutely affect our industries way more than they do open source, because the industry has to come to the table. The industry is basically squarely in the view of the CRA and squarely in the view of the PLD. So I think one thing sort of like, we can sort of like, yeah, be positive about and celebrate about is that all the worries we had last year around the CRA and especially about the PLD didn't really come to fruit. I mean, things are now sort of like, we've got a fair balance, I think. But at the same time, as an IT community, we've got sort of like some massive challenges sort of like they're left. And I think sort of like some of the questions of you may well be in that area. Thanks. Thank you. Okay. And so we're going to move to an audience Q&A. If you've got a question that you would like to ask the panel, or particularly the guys from the commission, then if you would like to raise a hand, there's a hand raised down here, and Omar is going to moderate for us. I'm Enzo, not Omar. Sorry, Enzo is going to moderate. My brain is gone. Yeah, go ahead. Go ahead. Please, yeah. Yeah. So we're glad that a lot of the concerns of the open source community were heard. We can't hear you. Yeah, okay. So we're glad that a lot of the concerns of the open source community were heard. But for Linux distributions, like for example, Debian, we will be exempt because we don't do anything commercially. But we are worried about our downstream users, which of course use Debian commercially. So for example, a lot of very small and very small local IT providers sell computers with Debian, for example, or do other business using Debian and integrating it into their products. And we are worried about how they will be able to comply with the CRA obligations because they are so small that they can't do it themselves. So it would be really hard for them. And also the margins in the computer industry are not that big that they can just say, okay, I'm going to employ somebody who's doing that. That's not possible for most of them. So that's what we want to have guidance for. And also it's really difficult for them to understand all these regulations and what this means in practice concrete for somebody who's, for example, just selling computers with Debian. Thank you very much. I think it's a very good question. I guess Benjamin, it's pretty obvious that this question is for you if you want to answer real quick. Yeah, thanks. It's a great question, I think. So indeed, if you are selling a laptop, for instance, with an operating system installed, if you're building that laptop, if you're the manufacturer of that laptop with the operating system, you will be in the scope of the CRA. And the due diligence requirements as regards the integration of the operating system, they will also apply to you. I mean, I explained before what due diligence means, right? So there are a lot of ways in which you can do due diligence. The CRA is on purpose not very prescriptive because we want to give a lot of flexibility to the integrators. But one thing is for sure, it doesn't mean that you can only integrate CE mark products. You can integrate any open source component that you like. And there is a myriad of ways in which you can demonstrate that the components that you integrate are secure. I think in a case like this one where the upstream provider, so the Debian project, is such a massive undertaking, I think it would be extremely helpful for your integrators if you provide them with useful documentation on how Debian as a software, how Debian addresses the various security requirements of the CRA. I mean, just because the CRA doesn't apply to you, doesn't mean that you shouldn't take security seriously and I'm sure you do, right? So I'm sure many of the things that the CRA requires, such as the access control and so forth, I mean, obviously modern operating systems like Debian do that. So if you document in a transparent manner how you are actually complying with security by design principles, I mean, you're essentially doing the work for your integrators and then they can just recycle that work for their own documentation. So their documentation doesn't need to be heavy anymore. Thank you very much, Benjamin. Is there another question here over there? Thank you. Yeah, this is a question for the Eclipse Apache foundations. Aren't you afraid that you have kind of doomed the software foundations in shielding the developers? Because when I look at this, the first thing that jumped out of me was, okay, I have to make sure that I'm not going to be a software steward. So if somebody wants to pay me for work, then the best thing I can do is dump the project into one of the foundations and make myself just a contributor. Thank you very much. Dirk, maybe first or Gail? Right, so I think the question is really like what do I do as a small developer, right? And this forced me to dump my projects in one of the foundations. And I think it's useful perhaps to turn this around. I mean, what is happening here is that society is asking the software developers to start producing good secure software to basically use industry best practices. Now in open source, we by and large do that. In fact, we pretty much set every industry best practice around security. And it's our downstream people in the commercial markets who are often not updating. I mean, we update log4j within 24 hours and then like now years later, it's still not being done universally. So I think to a large extent the answer to that question is that as developers basically, we'll have to sort of like get more systematic and more explicit about documenting the good things we're doing. And I fully expect sort of like that a year from now, two years from now, we basically all more or less have documented that in the same way. Because I mean, at Apache we've documented some of the things at Eclipse, at Python, we basically all doing the same thing. So yes, of course, we're going to steal each other's documents, right? It's open source. I mean, that's just the easiest way of doing it. And then indeed, basically, you sort of like get that foundation like style, all those things which are part of an open source steward like being sustained in the market, being responsible about these things. Yeah, simply then becomes much wider available. Thank you, Dirk. Yeah, just maybe to add something like I hear your point that, OK, if there is some constraint due to the fact that there is an open source steward, I absolutely want to avoid being in this situation or I want to make. I don't think that people or organizations bring their projects to a foundation just to avoid the theory or to do something like that. And that's the main point is more likely to set up collaborations or to have a vendor neutral governance or stuff like that. I think that's our main point in my opinion is that we help create consortia, but the open source steward is a good way to implement the requirements of the CRA in the context where.
Copyleft and the GPL: Finding the Path Forward to Defend our Software Right to Repair
I'm very pleased to welcome someone who many of you will know, Bradley Kuhn, I'll let him introduce himself and he's going to talk about copy left and the GPL and its role in defending our right to repair. Over to you Bradley. Thank you. I never thought of myself as a legend so it's kind of you to say that but I don't know if it's really accurate. In many ways this will be an update of the keynote that I gave at Fosdham 2017. I realize many of you may not have been to Fosdham from four years ago or whatever, more than four years ago, six years ago. If you didn't, I don't mind if you start clicking on the links, they're in my notes on the Fosdham page and you can click through those slides as I get started if you want the context because my goal here is to kind of give an update on what's happened in the last couple of years. I approach all of the things in free and open source software from the role of an activist which is frankly quite different from most of the kind of work that people tend to do. I am trained as a software developer, I spent many years doing software development including proprietary software development once upon a time and it's what turned me into an activist such that I'm not doing so much software development but instead trying to change the way that the technology industry works with regard to the rights that people have with regard to their technology. I really believe it's important if you're an activist to keep your goals and your area of activism narrow. It's not that I don't believe other causes are very important. Probably most of the causes that you would expect I am supportive of, I am. But I spend my time focused really narrowly on trying to reach a world where everyone has universal software freedom or the phrase that I've been using a lot more to refer to it is a universal right to software repair. You should have the right to fix your software and I want a world where everyone has that right for all the software that they use. As I get older I become more and more morbid, I was already morbid but now even more so and not a week goes by that I don't think about the fact that this activist cause that I'm working on has basically no chance of succeeding in my lifetime. I will die in a world that most people in the world are using proprietary software for most of the things that they do with computers and most people won't have the right to repair that software the day that I die. It's an unfortunate fact but it's expected with regard to work as an activist. Ultimately activism is not really just politics although it feels that way most of the time. Most days I end the day feeling like I'm some sort of politician and I don't particularly like that feeling but it's fundamentally different from politics because politicians really seek only to achieve what is politically viable and the art of politics is the art of compromising to a solution that everybody can agree on. But activists tend to, oh shoot sorry my colleagues are apparently talking about my talk and I'm getting notices about it. I think the big contrast to all of that in between activism and politics is that activism is really trying to reach something that's politically unviable that really doesn't seem like it could ever happen and the only similarity between the two really is that the goal of the activist is different but the strategies tend to be the same. So as an activist I have to find and every activist needs to find strategies that are politically viable but those are not end goals in themselves and it's a mistake to look at them as end goals in themselves and if you look at my previous talk you'll see I've talked about how copy left is not an end goal in itself it's a strategy to achieve software freedom. Now being an activist you often get insulted as it turns out. I've been called many times to be tilting at windmills. I counted at least ten times that's happened either after one of my talks and I actually always wonder if folks who have said that to me or actually have read the source material or at least the cliff notes. Here is actually an excerpt from the cliff notes on the book Don Quixote and I actually think it really explains activism quite well. Ultimately acts of rebellion or reform are quixotic because the reformer aims to undermine the existing institutions in order to change them and I believe that activism for software rights is designed to undermine the existing technology industry in order to change it. Copy left is a strategy a politically viable mechanism to bring about that change. Copy left is subversive, copy left is quixotic, copy left is counterculture. Now the interesting thing is that it also has to be politically viable and we also have to succeed at using it to push forward the long term, never happening in my lifetime goal of software rights for everybody. Now one of the things that's been incredibly difficult for copy left to continue to succeed and I think it has had some success but its success is in peril in recent years is because of the level of co-option that we now see with regard to free and open source software. Co-option is a process that happens in social justice movements whereby the mainstream takes control of the politically viable parts of the otherwise idealistic activist cause and uses them to seek its own goals and usually supersedes the activists in time without tremendous work to fight against it. You can actually look at the history of copy left and really see that the technology industry has done a really good job of co-option of free and open source software and copy left has faced challenges as a strategy because of that co-option. Since 2001 almost every large big technology company has had some sort of strategy to discredit undermined or otherwise work against copy left. You'll find that most big companies really really love non-copy lefted open source and free software and they really dislike copy lefted open source and free software. This is not necessarily, I was just going to say, it's reasonable that they take this position because copy left is designed to undermine the technology industry and they don't want it undermined. They want to continue to make proprietary software and copy left is attempting to prevent that. I realize a lot of you probably work for some of these companies and I suspect that last slide for those of you in the audience who work for one of those companies or the many others that aren't on the list, it creates a certain amount of cognitive dissonance. I want to be clear that it's not your fault. You're part of a system, as am I, that has systemic institutionalized problems that take away people's rights to repair and improve their software. No individual in the culture altogether can avoid this. If you take a look at the keynotes that Karen Sandler and I gave in previous years about the issue of trying to use only free software in your life, it's very very difficult. So it's not necessarily a surprise that many people are going to be in a situation where they're both working for and against open source and free software at the same time. In the end, these are the systems, these systematic problems in the technology industry that are the windmills that someday the activist cause of universal software freedom will take down but try to get comfortable with the cognitive dissonance that if you're doing work at a company, even if you're working on open source, but that company also does lots of proprietary stuff, you're both part of the problem and part of the solution at the same time. And I'm going to talk about a couple things at the end of the talk that you can do to help. Companies have been paying attention to what copy left does and how it works for a very long time, the entire history of copy left more or less. And this is the laundry list I came up with of basically the playbook that companies use to keep copy left from succeeding. It's relatively disturbing that you'll find, for example, on GitHub. For a while there was a GitHub employee who spent, I guess, their free time, maybe their paid time, not sure, going to various different projects that were copy left and trying to shame them into changing their license to a non copy left license. I don't think that's occurring anymore, but it was one of the many examples I saw where there was real pressure from the corporate environment to shame people who chose copy left, tell them their software was never going to be popular. And I've seen many colleagues who work for companies reassigned from working on a copy lefted project to a non copy lefted one, or who asked to have a project be released under copy left to be told they have to release it under the MIT license or something else. It's very interesting to me, too, that there has been such pushback in the free software world about assigning copy rights to nonprofit entities that want to pursue this long term goal of copy left and software freedom when most of the copy rights generated in software, open source and free software, are held by companies because your company keeps the copy right when you are an employee. And for Europeans, I'm talking about the exploitative rate, and moral rates really tend not to come into play with regard to copy left licenses. And what we've seen in my work doing enforcement and compliance with copy left licenses is that really companies have become incredibly brazen and simply ignoring copy left requirements entirely, or just promulgating interpretations that make no sense, saying that, well, that's what the source code is. Even though it's not the source code that they use, they make up a separate source release to give to you and claim that it corresponds to the binaries you've received, but it doesn't in fact do that. So there's a lot of activities that folks who want to take away software freedom are using to succeed. Now, we've had a lot of success doing various different actions around copyright law to succeed in making copy left work. I don't want this to sound like copy left has been a mitigated disaster. I think it's actually been quite successful. Howard Veltner for years was doing enforcement in Germany that was very successful under copyright law. He's no longer doing that activity. The organization I work for, Software Freedom Conservancy, had a successful copyright litigation regarding the Busy Box project that did gain software freedom on a few devices. We've learned a lot of lessons from doing copy left as a copyrighted focused endeavor. The remedies that courts offer for copyright rather are not particularly well fitting with what we're trying to get done with copy left. Typically courts offer either injunction, which is a prohibition on further distribution of the software and financial damages. Now, it's never been our goal to stop people from copying and redistributing copy lefted software. So injunction tends not to be that interesting other than leverage to get compliance. And similarly, a threat of financial damages is not normally what activists want. I don't want money for copy left infringement. I want compliance with the copy left license. I want users to have the rights they were supposed to get in the first place. And there's no amount of financial reward that can ever replace the importance of having the right to modify the source code of your software. Additionally, any copyright action has to be brought by the copyright holder or someone they've assigned to. This creates a weird burp. I think Harold Vulta was a true hero that he spent basically 10 to 15 years with probably half or 75% of his time just doing GPL enforcement to liberate Linux devices that had his code in it. But it's a lot to ask of any developer who would rather be writing code to have to go out and do these actions to get compliance with software that companies have violated the GPL on. So I've come to the conclusion, and many of us who do this work, that copyright was a pretty clumsy mechanism for getting what we actually want with copy left. The goal of copy left was to have a strategy that would liberate software for users. They would have the right to the source code. The source code would compile. The source code would install. But in the end, courts in copyright cases don't order entities to do the things that GPL says. So over the last few years, I've been thinking a lot about this quote that was actually the quote I put in the US you put in your high school yearbook. It's when you finish your four-year post-secondary education and you get to put a quote in your yearbook. This is the quote I put in my yearbook, which I should have went back and read long before, because in fact, always trying to do copy left enforcement as a copyright-based action was really a foolish consistency, because other options were available. In fact, when you look at the GPL v2, the word you or your appears 93 times. It refers to what you are supposed to be allowed to do, what your rights are. The phrase copyright holder appears twice. The GPL and copy left license general have never been about the copyright holder. They're about you. They're about the user. They're about the person who received the software and what their rights are, namely, their right to modify, repair, and reinstall the software. What the lawyers call this is they are third-party beneficiaries of the GPL agreement. For various reasons that I can't get into because there's only eight minutes left in my talk, we spent a lot of years only looking at copyright solutions and not looking at the third-party beneficiary solutions to getting compliance with copy left and getting source code to users who were looking for it. At Software Freedom Conservancy, we currently have a third-party beneficiary case in the United States that is going quite well. You can read about it on our website against a seller of Linux-based televisions in the United States. We're also engaged in some grant making to bring actions outside the U.S. on a third-party beneficiary basis with regard to the GPL. I believe this is an important step that we have to take because we can't, as the English phrase goes, put all our eggs in one basket. Copyright-based solutions for copy left are useful and they've succeeded in the past, but we shouldn't only be doing those. We should be looking at other mechanisms that we have in the legal infrastructure that we have for the defense of users' rights and software and using any of the mechanisms that we find and really all of the mechanisms that we find. The other thing we're doing in the Visio case is we're asking for this thing that lawyers call specific performance. Now, specific performance is typically used for things like a family heirloom or some other type of object or thing that has no equivalent and no amount of money can replace it. Well, the fact is we never felt that any amount of money could replace users' rights anyway, so specific performance is exactly the type of legal mechanism that we should seek with regard to getting the proper and correct source code for the devices that we have that contain GPL software. Furthermore, we don't even ask for financial damages. Our goal being that we're being very clear with the court what we want as activists. We want the right to copy, modify, and reinstall the software so that we can make use of it, so that we can take these computers, which just look like televisions, and use them for other purposes like every other computer was 20 years ago. Today, most of our computers can't be repurposed, and usually they can't be repurposed because they're violating the GPL. That's something that we want to change. Now, there's a lot of stuff that you can do to help with this. I know I fed you some cognitive dissonance early, but there's real steps you can take if you're employed doing open source and free software development. I mentioned that companies tend to require their employees to effectively assign their copyrights. Usually it's part of your employment contract. Now, I realize if you're unemployed and desperately looking for a job, you don't have a lot of leverage, but most people in the technology industry switch jobs voluntarily. I ask that when you switch your job voluntarily, make it a term and condition of your new employment, that you get to keep your own copyrights, that you can decide when your software should be copy-lefted or not, instead of letting your employer decide that for you. As I said, we still will need to pursue copyright-based solutions to open source and free software issues with regard to copy-lefts. Assign your own copyrights is a useful thing. And if you're willing, you can assign them to SFC, because I don't think it should then on top of that be your job to go out and figure out how to get compliance with the copy-left license after you've done all the hard work to get it released under a copy-left license. SFC is one of the organizations that's willing to do that work for you, and we're prepared to do it. You can also help on these two, and we have a new program that I'm announcing today at SFC that will hopefully really make it possible for you to make real progress in dealing with these issues. We're launching this new thing called Use the Source. Use the Source is designed to be a program where we can teach you and mentor you and get you involved in the process that we do to deal with all these millions of embedded devices that are violating the GPL. We've, as FOSS developers, have taken a FOSS development approach to it. We want a solution where volunteers can check the offers for source, because almost every device that you've bought in the last five years probably is based on Linux and probably has an offer for source. With the Use the Source community, you can post any source requests that you've made, submit whatever the company gave you, and work with the rest of the community to figure out how to get it to build, because the big problem is that most of these companies are not actually giving the right source code. They are not giving you the scripts used to control compilation and installation of the software, which the GPLE2 has mandated since 1992. You have so much power as consumers and users of these devices to put pressure on these companies that are constantly telling us, oh, it's going to take us six months to get you the source code. Oh, well, that's an old product, and we don't have to give you the source code. Not true. I want to and would like you to join with my colleague Denver and I to learn how to verify these source releases. It really hearkens back. If any of you were on Harold Delta's mailing list in the early 2000s on gplviolations.org, they're now defunct, but if you look back at those archives, you'll see that's what people were doing in the early 2000s to try to get companies into compliance as a collaborative effort. We want the issues of copy left to be something that operates like an open source and free software community. I hope that you'll join us in doing that. This was really scheduled to be an update from my previous talk, so I'm going to finish up there. I don't have a lot of time for questions. However, we have the legal and policy dev room that's happening this afternoon. There will be lots of discussion of issues related to copy left there, and I'll be there along with my other colleagues who work on this kind of stuff that you can ask questions there. Use the source was launched today, and so you can take a look at the website and read more about it there. I hope you'll donate to become a sustainer. We're a non-profit charity that we operate based on your donations, and I'd really appreciate it. With that, I think I'm completely out of time. I'm sorry I can't take questions, but please find me later. Thank you very much, Steve Bradley.
How open source projects approach Functional Safety
Hello everybody. So for our next speakers we have Nicole and Philip. So Nicole is from Electomatis and Philip from Bosch and they'll be talking about how open source projects approach functional safety. Thank you. Thanks and welcome. As the title says we're talking on how open source projects approach functional safety. There are so many projects out there so we just took three examples which will be the Xan Zafire and Linux part. So we do together so just to give you a short intro on who we are. My name is Philip as introduced. I'm currently a product manager for embedded open source inside Bosch. But the time where I'm speaking here I mainly do is the chair of the technical steering committee within the ELISA project which works on enabling safety and I'm also leading a work group there which basically puts the different pieces of open source projects together. Then I'm currently also a member of the Linux Foundation Europe advisory board which has been started last year and personally I'm open source into the S since like 15-20 years or something where I'm using it mainly. And this gives time for Nicole directly. Yeah so this works. So yeah I'm Nicole. Yeah I'm mainly a safety person. I started off in production maintenance so not maintenance like a maintainer that you might know but really like yeah something breaks and somebody has to fix it. I went then to software development, been a software developer and tester mainly always with some safety background in the automotive and the defense industry. And then changed to do more safety sort of safety centric work. About five years ago we decided we can do things better and founded ElectroMetres so that's where I'm currently at. Doing safety consulting a little bit of what we heard before the license compliance stuff. I'm in and out of the open chain project. I'm mainly involved with the ELISA projects in the system and the medical group. I'm in the SPDX project in a safety working group. The safety manager of the SEFA project and so on. So yeah if you want to know about open source and safety maybe we should talk. Yeah if you talk to me and I won't recognize you again. I'm not arrogant I just cannot remember your face so please just you know if you know me just tell me hey we know each other. That's completely normal for me that I won't recognize you again. So you can also text me so mainly each social media platform. I have the same handle Nick Pablo so yeah please feel free to reach out. So starting with the real thing. So yeah we're starting about functional safety so who is in the room familiar with the term functional safety at all. Oh that looks good. So for those who aren't so we're not talking security we're talking safety here so that part of the system that should not kill you in case of an internal fault. So yeah it's mainly it's a do not harm thing that the system behaves always in a safe way whatever happens whatever input you have whatever thing that breaks that's a part of safety and the stuff that you need to ship with your product to really prove that the system cannot kill you or most probably won't kill you let's put it like that. So with functional safety the main things that you do expect are yeah that software behaves as specified which implies already that you have a specification that it does not interfere in any way with the system components you have and all around us events are addressed somehow so that you even can avoid them or at least detect them and have some mitigation efforts in there and yeah that you have sufficient evidence that's really proof this. Right and I guess the next one is for me. So as in the title we talk about the three example projects so we have one as the Linux representative while there's so much more we'll later see this. We have a real time OS with Zepfire there are recently also others coming out which claim to have safety certification and may all be open source but this is the one in there and we have the last one which the Xen hypervisor has a virtualization solution and they all run under the Linux foundation projects and to just get an idea on who's in there so we have a large number of members within the Zepfire a few people in Xen actually and then the middle size in Elisa and as you could see from my introduction I'm coming from Bosch so we're doing a lot of automotive work also and with this respect I just wanted to show how different members are because if you think about like medical devices, automotive, railway, industrial parts that's where really safety standards come into picture and if you look into Zepfire it's not really our mobility or ever space member in there it's mainly hardware driven and so we have microcontroller companies, sensor manufacturers and so which really have this visibility in there while for the Xen projects you have some mobility suppliers but it's originated from the server process there's no real car manufacturer in there and what I had here in my also to mention is that they have a bunch of other members in there but we're just not supporting the project as a finance while if you look into the Elisa project there are mobility aerospace system providers in there so from OEM to one supplier base and so we will now go through the different projects one by one just say where they are and how they have common parts on safety and how they differ so the next slide goes already again wants to because she comes from Zepfire. Yeah so I'm not sure how many people in the room already know about Zepfire so Zepfire is let's say the coolest real time art class that you can find out there it's open source it's permissively licensed you can bring it down to a really resource safety as a safety size really even smaller than the Linux kernel we're currently heading for safety certification we have yeah the safety working group that now currently preparing and let's say enhancing what's there in the project so that anybody can take the project artifacts to go through their own certification or qualification or at least has a look into it to say hey that really brings everything that I need for my application at the moment yeah the safety awareness honestly in the community is limited there's a high awareness but quality but yeah safety sometimes really is the hot potato nobody wants to touch but yeah we have this working group and we're working on it and we're getting more and more support in the complete community. What's important still yeah it's posicompatible so for all that that come here from the automotive domain and think hey we need something it's you can use your posic stuff on that and the main part of Zepfire is hardware agnostic so there's a very strict hardware abstraction layer so it's really easily to port from one application or one platform to the other. So yeah San is you. And in contrast if you take San so we saw that it's a much smaller community but they are really coming from a strong security background they were used to have virtualization to isolate systems so in there from the beginning of the project they had a very strong quality mindset which you can see that they have every commit tested they have two different CIP pipelines for this testing they also have a strong rigorous quality process so that they really just have full commit traceability so from the first commit to the end you can see everything which happened which is organized there due to having it in official production also data centered it shows you how much care you need to take because there's a lot of chance for intrusion there yeah AMD silings are those which are mainly driving the San project they are also the ones which work on the safety certification and that's also what they said if you have an open source project you need to do this continuously right because traditional things are often safety certified for one shot and then you have hard times to update things so it's a challenge which many have to follow now and what they said in the first phase they show that actually open source is certifiable that's what they have as the certification concept approval and then they go for what for an assessment if you want to do the same thing in Linux just imagine all the distributions out there the flavors how you build create things life gets much more complicated this is really the open source superlative you have so many contributors such a large code size and everything and then also very large community much more flexibility how user use cases and also in this way I just took some by searching the web some examples from all the companies which making their attempt and how to do this either together with others or independent and it was having a first activity with the social Linux and P project so it was roughly a range of 10 years back already where they worked on it it stopped a little bit it transitioned into the Lisa project which I'm representing and it recently get a new momentum due to what's called the software defined vehicle where you have much more centralized high performance compute cloud connectivity and so on and there is a we want to have more open source usage and bring more things into the space but then the first question when they come to the Lisa project they asked to me when will you have a safety certified linux which I can simply use I was a well that's something which we are not able to provide directly because when you are in an environment you need to make sure that your system is safe and we cannot make sure that you do your engineering properly and we cannot also not guarantee that you follow certain processes which are required we can just give you guidance how to use them and the last thing is like oh will you just have one snapshot which is then safety certified and for that this doesn't make sense because you know how many fixes are getting into a product so there's most likely a vulnerability coming up things are connected you want to enhance features so we need to go on a continuous path and that's something completely new also for traditional safety part works and we always put the disclaimer in you cannot be relieved from your responsibilities so it's like legal notice here but yeah we have different projects which can provide fast forward and as we share same burdens on regulations that goes together certification will be the key part of it at the end and this gets complicated because certification is very expensive so it takes some authorities a lot of checks audits and so on and how this is financed you can see with three different approaches so there's for Zephyr their Platinum members which finance it they get the full access for Xanad's AMD Xilinx which is the business in there they mainly spent the effort in the project and as for Linux there are integrators so the strongest people in their involved is currently Red Hat and Coating they are really trying to bring the things forward and you can also see this difference in there on how open the people are so Nicole mentioned there is the workgroup on safety the safety workgroup when Zephyr it opened last year so there has been always safety activities from the beginning considered but to get a wider outreach this was opened also to the non-platinum members so that there's new inputs new activities the requirements management is a good example which came out of it and a little bit things keep behind the scenes because it's basically then a benefit also for the Platinum member to get the commitment and to guarantee that there is financing in Xan has the approach from AMD they are working on getting code MISRAC compliant which is some special activity which often was asked by automotive and they also provide documentation and their parts upstream so later on for Zephyr and for Xanad if you have the software at hand this is a software which is running on safety systems but you don't know how to use it because you miss maybe a manual or you miss certain test cases or you don't know how to really bring it into picture in the ELISA project or so we really focus on the enablement we don't do a safety certification and it will just enable others so we want to pre-helping pass why we do so code complexity is one reason of it so if you see we have million lines of code in Linux and small footprint in the artist in the hypervisor due to the nature of the software and I don't I put it in here and said it can be easier so I don't say it is easier but due to the small code size you have a much better chance in reaching these things faster but to get code coverage testing and so on you need trainings which Nicole for example does a lot with in the past and currently so maybe Shoddy can give a word on this yeah so yeah asking for what do the people contributing to the safety element have enough skills it's a it's a frequent question and we have different approaches in the different projects so in Zephyr we had a training just by a usual training provider for who was interested some while ago there is the option to have another training for that and we have two committees we have a safety committee and we have a safety working group which are full with as a seasoned safety experts that can guide and train and are just there and to to help people out with open questions the sand project had a little bit of different approach they are very centric about their code quality and about complying with coding standards so everybody who wants to contribute needs to know about Misra and they had even a training by boxing to do so and yeah they have mainly one safety expert that really spreads the word in Elisa we again have a different approach so for the complete community we have open web in ours that you can yeah either dial in or watch them again on YouTube there are the safety experts again in the working groups there are different working groups so there is no direct training provided but you can have the webinars and you can yeah just learn as you go from the experts in there right and then if you think well that's a lot of things which I have to do I just go with the traditional approach I save some money I take a normal artist but then I just took the Linux example you can read it from the left the right or from the right to the left so you don't have hard real-time requirements you don't have safety certification yeah these are some topics but you have a really rich ecosystem portability features experts any kind of hardware support it's two tackle complex products this something which you don't have with a certified artist because it's often proprietary and just for a very limited set of devices so it depends on which point you are you need to tackle both sides you need to see how does my system look like what do I want to achieve where is my complexity and how do I want to solve the complexity for this we put together different working groups within Elisa and created the largest system so it's also why we have this talk here where we really bring an on a microcontroller micro processor base we're bringing different Linux flavors and all put things together to showcase this is the reproducible system however you can make use of because from our mission we say we would like to give you elements processes tools software documentation which brings things forward and this especially means if you have safety critical systems you need to get understanding about the systems so this is what you need call wants to take about yeah so I think everybody here in the room will agree with me to say yeah to assess whether a system is safe you first need to know about the system you need to understand the complete system to see if a system or a subsystem in the system is safe enough so then you really need to understand if this element or subsystem that you want to qualify for safety is in this context of your complete system in a safe enough or capable enough to do what you want it to do really needs to yeah choose which features are important for your safety to evaluate them to qualify them and identify the gaps that you might want needs to fill yourself with your own application with your own basic software with your own whatever so in safety there's this approach for that called safety element out of context so I think the market approach is that you have a safe a generically safety qualified or safety certified system software component whatever that you can integrate into your system and it has been developed without knowing where it will go into so in fact it's not a safety element out of context safety element was an assumed context so whatever you do for example as a sefer or a son you assume what you will need for functional safety and you write this down and you work according to that so typically these elements are then come with obligations they come with a safety manual that you need to adhere to you prove that for your assumed context and that the features that you identified as maybe safety critical for your user that they are developed efficiently that they have requirements suitable implementation that there is test that there's completeness that there's planning and that there are these obligations of use when you want to integrate it to your final system and yeah sounds all very great but we still come over some community challenges or some general challenges when we want to bring open source to the safety world so we will we get a lot of pushback still from the safety world that yeah open source is not behaving like commercial software and we can't do this as a death sentence yeah it's true in the same in this open source project you have less influence on maintainers you can you're not the boss you can't tell them do this what you can do is hey I need this and I'll I'll propose the change to do this and I do it and we will do it together oh yeah it's not it's it's not a top down hierarchical approach there it's harder to bring skills into the community you can just say hey these 20 developers will get this training because they just don't care people tell us hey who so often who will be liable for a certified suffer who will be liable for this who will be liable for that so this is something that still needs more understanding out there because the community will never be liable whatever CRA will say we need a development process that somehow is present in the document saying hey I need to ship this with requirements there needs to be a system architecture there needs to be auditable code whatever that is so really to map map what a safety integrity standard needs from a product and how yeah I guess as we approach the last three to four minutes already yeah I just keep this one short so what we do is we cannot do this alone and we really try to find communities areas where we can collaborate so that's why the project really reaches out to all the different areas seeing what are related activities to go through development and really share ideas and for the sapphire for example can be something to learn from also for other communities yeah so then I also keep this very short because so yeah we try what we are currently approaching is to apply something like a v-model as a knowledge model not like as a process model to the sepah project to really have requirements and traceability and everything that we need so there's a lot of stuff there and there's a lot of stuff happening so we have already said we have two committees working on that there's this safety committee consisting of the platinum members that really do the strategic decisions about a final certification and there's a working group really creating the value for everybody that you can have all the artifacts and all the information that you need for a safety qualification this is a current snapshot of our requirements work we do requirements using stricter stand from stricter is over there and for everybody who wants to know more about that we do have a talk on sunday in the esbombe deaf room around lunchtime for that we also get asked what to do if you want to contribute same thing as always just show up please when you show up and you have ideas you have best practices please share them the communities are all open to that even when you don't know much about safety just show up because everybody will just tell you hey listen and learn and we do this together and when you're a safety person that wants to contribute or that wants to bring open source to their products yeah just become an ambassador for open source and safety because the quality usually is really high the functionality is very high there's a lot of stuff around that can be used and where we have the value through collaboration and not just through yeah purchasing license agreements and all that so we have many value here yeah final thing yeah there is no certification date set so please don't come to us and ask we can we can collect money and bet on that when we are ready we are very far in a lot of projects there won't be a certification for elisa because elisa should help you to certify or to qualify and this yeah elfen and sefer we are on it we are creating the stuff let's say soon and that's it and before you leave we have a little bit swag left over there but not the hat this is from someone else
Privacy-respecting usage metrics for free software projects
Hello, hello, everyone. Welcome to 4th stand 2024. And this agenda speaker is Will Thompson. And welcome him. Yeah. Hello, can you hear me? Great. So, so, how about this? Is that better? We'll see how we go. Cool. Hi, everyone. Thanks for coming today. I've seen a lot of really great talks in this room over the years. It's a real privilege to be on this side of the auditorium for the first time. So, a little bit about me. I'm an engineer at Endless West Foundation, where I've been for seven or so years. And I've been working on GNOME and GNOME adjacent stuff for longer than that. And today I want to talk about why it's useful for free software projects to collect usage data. I want to talk about how this can be done in a privacy respecting fashion. I'll talk about the Endless OS system for this as an example of maybe an existence proof. I'm not necessarily suggesting that other projects should take what we've built and use it directly, though, of course, you can. But I hope to encourage other free software projects on the desktop or other ways to consider adopting similar techniques so we can better understand how the software is used. I mentioned Endless, what's that? So I work for Endless OS Foundation. We are a nonprofit organization. Our vision is simple. The whole world is empowered. And access to the digital tools of the modern world is a prerequisite for being empowered. So we strive to ensure access to these tools and create opportunities for underserved and under-resourced communities around the world. We do a lot of things which are not Endless OS, even though Endless OS is in our name. But it's Endless OS I'll be referring to today. So I'll talk briefly about what Endless OS is and what it's for. In brief, it's an immutable Linux desktop distro. Visually it's known with some modest customizations to suit our target users. The groups we work with typically have little to no previous computing experience, but they probably have user smartphone. You can download Endless OS from our websites and you can in some parts of the world buy it to be installed in OEM systems. But we as an organization are more focused on, okay, we are more focused on working with other nonprofits and with companies aligned with our mission to bring computing to underserved communities. So this might be partnering with another foundation to set up a computer lab in a disconnected rural village or we might work with microfinance organizations to make computers affordable to low-income families and so on. And in these contexts, there's often limited or intimate and internet connectivity. So part of the point of Endless OS is we pre-install lots of apps and lots of offline learning resources and we make sure the whole system is fully usable offline. So what do I mean when I say the word metrics? I'm going to use the word telemetry, metrics, analytics, usage data and so on interchangeably. Sorry if there are technical nuances of those words. But I'm referring to the concept of end user software, so that software that runs on a device in your hands, collecting data about how it is used and then periodically sending this to its developers. You might be saying that sounds a lot like spying. Please hear me out. I'm not talking about that. The other part of the title was privacy respecting. So you might be skeptical because when people talk about usage data, they're often talking about slurping up all kinds of personal data about each user, building profiles of each individual and then you sell it to advertisers, which is so the easiest way to explain what I mean is privacy respecting means the opposite of that. But the easiest privacy respecting thing to do is to do nothing. You don't collect any usage data. You don't have to write any code. You don't have to think about the ethical or legal issues with the data collection because you're not doing it. So maybe for a lot of projects, that's fine. And you might ask why? Why would you do this? Well, software is not made in a vacuum. Normally you're trying to help some group of people do something they couldn't do before. And so in order to build good software, it's useful to know how is your software being used. Is it being used at all? What hardware is it being used on? Which features are used? Which features are not used? And so on. And if we have this information, we can make informed decisions about how to build the software rather than basing it on assumption and guesswork and vision alone. The other strand to this is a lot of people are developing free software at work. I work for a non-profit and I would like us to continue to do the work that we do, to advance our mission and also to contribute to the open source comments. And part of doing that is to demonstrate that the work that we're doing has the impact that we are trying to have on the world. And the organizations we work with have similar needs. They need to justify to themselves and to their own sponsors that it's worth putting their time and resources into working with us. So having quantitative data helps to support the impact we're making. And you might say, okay, that's fine, but why don't you just ask your users, run some interviews, do some surveys, some usability testing and so on. Wouldn't that be ideal? And yes, of course, there's no substitute for actually talking to the humans who are using our software. It's quite rare, particularly in free software projects, to have the resources to scale that. And for some things, users are not consciously aware of the ways they're using their software, the software they're using. There are also limits to what you can learn from a half hour or one hour testing session as opposed to usage over time as part of doing your day-to-day work or life. It's very useful to find volunteer testers from the community. You can learn very interesting things from that. Those groups tend to also be quite self-selecting. So this will sue the results towards people who have a higher motivation to tell you what they would like you to do with your software. So ideally, you want both, I think. I think you want to talk to end users and explain the why behind what you can find in data that you have. And in the other direction, having data about how the software is used can drive the kinds of questions that you want to ask your end users. And essentially, every website online store, app, and mainstream OS provides something like this. I'm not arguing that we should do something just because everyone else does it. And hearing that a big tech company does something might often be a reason to do the opposite thing. But there are non-evil reasons to want to do this. And I think it's reasonable to assume that the people who are developing software free or non-free typically want it to be good and useful. And other projects have similar requirements and constraints to what I've just discussed. So even with more resources, you can't constantly interview your users. And we're often at a disadvantage compared to commercially backed software. The big ones are in people and time and money. All of these things are, of course, related. And I think that rejecting the idea of collecting users' data outright just creates more unnecessary disadvantages for ourselves. We should want to have the information that we need to focus the limited time and resources that we do have. And we have the opportunity to use the structure and the transparency of free software projects to do something that's actually better than the status quo in the wider industry. We want to respect our users and preserve their privacy while still being able to make better decisions and make our software serve them better. The kind of axiomatic thing here is we do not want to collect personal data. We don't want to track individuals. We don't want to sell that data or worse have it stolen through some database hack. We don't want to serve targeted advertising and so on. Of course, handling personal data comes with legal responsibilities as well, which if you can just not collect personal data, it's much better for everybody. So if you want to hold a word in mind, think tally not surveillance. An analogy to Cicassidy, who's here with his phone, is think about a library. So near me, our local library is run by volunteers. And you might imagine that one day you go to the library and there's someone at the door holding one of these little tally clickers. And for each person that goes through the door, they click it once. And this helps them to get some kind of measure of how well used the library is. Maybe they can collect this similar tally on different days of the week or at different times. And this can help them decide how they staff the library, advocate for more funding from the local government and so on. The other end of the scale is if you imagine someone kind of following you around in the library and they're going to look over your shoulders, say, okay, you've gone to the computer book section. You've gone to the children's book section. You probably have a child. Okay, watching what you're reading. Obviously, this is not hyperbole, but this is really not what we're talking about here. So sometimes you can get this kind of tally information from some kind of service that you control. So FlatHub is the de facto standard flatback repo. It has, we recently announced that it's reached one million users. So how do we measure this? Well, it's measured by a proxy. There is a runtime which most users of Flatpack we claim have at least one app installed which uses this runtime. And due to the way that Flatpack downloads updates, you can tell the difference between an update and a fresh install. So when a new point release of that runtime is made, you simply count how many downloads there are of updates for that runtime in a given period of time, say, a week after you've released the runtime. And this gives you a pretty reasonable lower bound on how many installations of that runtime there must be. And there was no identifier needed. We didn't need to look at IP addresses or machine ideas or anything, just having some knowledge of the ecosystem and how the Flatpack client behaves. And there are other places where this idea is used. So in Fedora, there's this thing called count me. EndlessOS has something similar. So with DNF is the package system for Fedora. And it has to periodically update the list of packages that are available. And so the approach here is that in one random request per week, and these requests would be happening anyway, an extra parameter is added, count me. And then it has a value which refers to how long it's been since you first installed your system, which gives you some indication of what retention is like of the system. Then from the user agent, it's possible to infer what the distro version is and what variation of Fedora it is and architecture and so on. And it's a clever idea to piggyback on the meta link request. And again, let's users be counted without personal data because there's a fixed frequency. So they published the aggregate data, which I've doctored a little bit to fit on the slide. So here, again, there's the fixed frequency, which meant that no identifier was needed. And there are also these kind of statically determined segments of the user base, which doesn't identify any individual. It identifies a massive group of systems. So the three main ideas for doing something else here is that we want to generalize this approach to finer grain data. But it's data that we wouldn't otherwise have because we don't get anything as a side effect of stuff that's happening entirely on your local device. In the library analogy, they do have this information, I suppose, of which books are people borrowing. It doesn't matter who's borrowing them. It's just in general terms what's popular. So the three ideas here, the first one I've mentioned already is sending on fixed frequencies. If you send information or rather record it on a daily basis, this means that you can be sure when you look at the data that if two different events on the same day appear, they must have come from two different systems. But you don't have to identify which particular system they came from and you can't tell which events were the same from week one, which systems they were. In the other axis, we're not interested in individual events. We're interested in patterns of usage data. You generally want to be able to compare those patterns between different groups of your users. Maybe it's for software version, maybe it's by local, maybe it's by hardware. It depends on what you're trying to learn. But these are determined ahead of time. They're static and they are common across a large group of users or devices, rather, I should say. The third piece is to do some kind of data, which is instantaneous or which you're collecting on a timer. This is easy, but for some things, it's kind of continuous data, which this doesn't work for. For example, app usage. You might want to understand which apps are used the most in terms of time. This is something where you might, on a given day, open and close an app several times. You need to do some kind of client-side aggregation to turn this continuous value into a single data point on a fixed frequency that you can record by itself. So the end-of-the-sometrics system, you'll be shocked to hear, works as I've just described. It breaks down into a few components, which I'll go kind of following the duration of the arrows in this diagram. We'll talk about what happens on the end-user's device, then how that's transmitted to a server, and then what happens once the data reaches the server. So for local event recording, we have a daemon which runs system-wide. It's a D-Bus service, and applications on the system use a D-Bus API to talk to the daemon to record when certain events happen locally. So some of these components are just regular system components that are doing things that they just do. So our updater, for example, you can see the red box in the bottom left is recording an event when an update has failed. There's also one extra daemon, this metrics instrumentation thing, which is for capturing just general stuff about the system, CPU information, disk usage, and so on. We actually also have a mediocre crash reporting system using this mechanism. It's not ideal, but it's better than nothing. And as we'll see, it works for a system which is intermittently connected. So each of these events has some kind of payload associated with it. So we'll zoom in on the red event from the updater. This is when an update fails, we capture some information here. We capture the time at which the update failed. We capture the OS version that it occurred on. We have this UUID. Now, this is not specific to this event that happened for this one machine. This ID is the same for all updater failures. It identifies the category of event that's occurred. And then we have a payload, which in this case is just a human readable and localized error message. And that's kind of gross. We have some nasty pattern matching to untranslate the string in some cases and take out the values that vary just to narrow this down. We transmit this the raw event because it was the only practical thing to do given the way error handling works in the updater. But it's still very useful. This is, from this we can determine, this is the most common reason updates fail if the disk is full. The updater runs in the backgrounds, so it's unlikely that people will be actively checking it. So it's useful for us to know, are there fixable errors that we can sort out somehow. I also talked about app usage. We've patched an MShell on NSOS to record how long particular apps are used for. And so this one is one which gets aggregated in the way I just described earlier, where you coalesce this continuous variable into you slice it by day and by month. Here I'm showing by day. And it's actually the metrics demon which does this. The shell tells the demon start recording event with this UUID and this payload. And then sometime later, when you close the app, it says stop. And the demon takes care of coalescing multiple instances of that into one in any given time period and slicing it if it runs over midnight or over the end of a month. Okay, so now we've got a load of events buffered in this demon. We have an on memory, in memory, an on disk buffer with a size limit. So we just delete odd stuff if we run out of space. And then if and when you're online, the demon periodically reports these to our server and then deletes the local copy of the event. This is an HTTP request. You might be saying, I've said there's no device unnotified. Yes, there's an IP address. We'll come to that. That's an artifact of the internet. And this this upload contains as many events that will be cashed as we can fit in a single request. Plus a timestamp. Actually, there's more than one timestamp. There's a clever algorithm to correct for incorrect clocks and a channel. What's a channel? So this is the kind of static segmentation that I referred to earlier. And on end of the S we have just a couple of things here. The lesson flags for is it a standalone install, a dual boots or a live system? Interesting. But the main thing is this image identifier. And so this is an artifact of how we build and distribute and the source. When you install in this OS, you're taking a disk image, which has been pre built with a load of apps in it, and you install that by just DDing it directly to your disk. You image the disk with the same image. And so we have custom customized these in various dimensions. There's this product idea, which is how this came to end up on a computer. Was it a download version? Was it an OEM partnership? Or was it another organization we're working with? Or is it a custom built image that someone has built using the tools we provide? There's some other stuff about the original OS branch that was installed and the hardware architecture. And then this is personality, which again is an artifact of the way the OS works. If you're pre installing lots of learning resources, you want them to be in the language that the user speaks. So we have different variations for different decals. And we have a basic one which doesn't contain all of the massive reference apps. And when we work with partners or in particular projects, we often make a customized version for that. And that identifier ends up in this personality field. So if you go to the website today, or in fact at any point since the third of January, and you download the French version, you will get this image. It has this OS product, which is what we refer to the download version. Some attributes about the branches, the time stamp of when that image was built, and the personality. And so any system installed having chosen French will end up with exactly the same identifier ever since the start of the year. So this is what's on my laptop. And I happen to know that there are only other two users of this one, and one of them is over there. That's a unique case because we built this specially for a bunch of laptops in the UK endless team, and we never publish this image. And that's an edge case. In general, the same OS image is used by many different systems. So we have submitted a batch of events together with the channel to the server. What happens? Well, first of all, we discard the IP address. We don't want that. The HTTP, the endpoint adds a yet more time stamps to this bundle of events and puts it in a readers queue. Now, something totally separate, which has no idea where this bundle of events came from, pulls from the readers queue, and it splits the events apart and stashes them into a SQL database. There's one table in that database for each category of events. So I talked earlier about the daily app usage event. So this table has a field for the day. It has a field for the app, and it has a field for the duration. So in this example, of course, in the real database, there'd be many more rows. But just by way of example, you can see there were two different GNOME terminal events on the 30th of January. So we do know that there are two different systems. We don't know if the Chromium user on the same day was either one of those two users or a third user. The next day, there's an event for GNOME terminal, two and a half hours usage. We don't know if that was any of those two or three users we've already talked about, or a fourth user. We also have this aggregation by calendar month, which has higher latency, but it tends to be less noisy. And these tables are not linked to a device identifier. They're linked to the channel that was associated with the event. So this has this image identifier which is shared between many systems. And so we can't match up which different events came from the same system. We can't even identify which different instances of the same event came from the same system. Of course, there's an element of trust in this, like the server could be behaving not in the way I described. The best answer we have for this is that we're not doing that, and the server is all open source. So you can go and take a look at what it is, what's on our GitHub, is what we run. And the system is on by default because we've designed it to be privacy respecting. When you first install endless OS, like many GNOME systems, you get an initial setup wizard, which takes you through some steps to set up your system. This is actually from the development branch. It looks a little different in the released version. There's a toggle for enabling or disabling this feature. The toggle is enabled by default, but nothing will be submitted until the user setting up their system has gone past this page and continued to the end of initial setup. If you set the switch to off, then nothing will be captured, anything that's already been buffered but not submitted will be deleted. Of course, you can control this later. Once the events have been submitted to our server, there's no way for us to delete the events for a particular system because we don't know which events came from which system. And defaults are very powerful. The overwhelming majority of systems leave the default enabled. You might say, well, of course they would. Everyone likes defaults, right? The point of this is to get more representative data about a large body of systems. The system relates no personal data, it's designed not to be invasive. Being on by default keeps us honest about that. We really have to be sure that we're not collecting anything questionable. And some people, you can see some number here, may prefer that we don't do this. Of course we allow that, but we don't force someone to make a choice. Decision fatigue is real, particularly during first boots. We've seen that people get scared off by the number of questions that are asked. What's a keyboard layout? So adding more questions which people don't have the context to answer is not necessarily helpful. I acknowledge that not everyone agrees. There are other opinions. This is what we do for now. So what if we learnt? Some people may have read a blog post that I wrote six months ago with some examples of what we learnt. So for those who have read it, everything here is new. Parental controls. So some time ago we developed a feature in NSOS to allow parents to disable access to certain apps which are installed on the system to control whether their child using the system can install new apps and to set age rating thresholds on those. As part of integrating this into GNOME, which is now upstream, this screenshot is from GNOME OS, and we added this to the initial setup flow. So it's to be more easily discoverable. When you create a new user as well as choosing their name and the username, you can tick a box which is a little out of focus. The box at the bottom says set up parental controls for this user. It's unticked by the form that people tick it. If you tick it, two things happen. One thing is that this three things. The user you create is a standard user, not an administrator. A separate administrator user is created with a separate password. And then on the very next page you're offered the option to choose which parental controls you want to apply to this child. Now in this screenshot, if you sprint at it, no controls have actually been applied. The default is that you have to actively choose which things do you want to enable to to restrict. Do you want to restrict access to web browsers? Do you want to turn off certain apps? Do you want to set an age limit on which apps people can install from GNOME software? And we instrumented this and a large minority just left the defaults. So 40-something percent of people who chose parental controls didn't actually enable anything. That doesn't tell us why they didn't do that. I mean you can come up with some good theories, but it tells us that there's research to do in this area and it can help us to guide what we do next with this feature. A tool. So GNOME 40 introduced a tool that's offered when you first log in and that's whether you've previously used an older version of GNOME on the system or this is a fresh installation. NDSOS 5 was the first release to include GNOME 40 and it looked, as I showed you earlier, very GNOME-y which is rather different to what the previous versions of NDSOS looked like. So we inherited this tool. When you first log in you get this prompt and if you choose to take the tool you get a tool which just briefly walks you through how to use the desktop. I was curious whether people actually take it so I added a very quick patch to instrument this. This isn't really a show me the code kind of talk, but just for an example this is what you need on the client side. It's legible. So the top line is we've just defined a constant for the UUID, we just generate an ID, and then you have the two lines where you create the payload which is a Boolean which is true if they chose to take the tool and false otherwise. And then we call this method on the event recorder class to record the event. That's all you need on the client side. This is a small C library around a small D-Bus API and there's geo object introspection around it so you can access it from JavaScript and Python and all other things. Then the server, this is using SQL Alchemy as the ORM. You define a table like this which has some keys that have a name of the table, the same UUID. Again this is for all events in this category. The payload and how to turn the payload into an instance or into a row of this table. It's a little annoying that you have to do database migrations to add or remove events on the server. It has the downside of having the data in these nice structured tables but there's an upside in that we can generate the documentation which is on read the docs of which events the server understands. So the results are in. We captured this bit of information from 35,000 systems and across those 35,000 systems about 19% chose to take the tool. My assumption was that more people who are upgrading would take the tool than new installs because if you're upgrading you're surprised or this looks a bit different what's going on. Actually it's the reverse. At the top row we see users who are fresh installations and 32% of those users took the tool out of 5,000 total in the period we sampled this. Whereas for upgrade and list OS 4 it was just 15% who chose to take the tool out of a total of 29,000. This is just a snapshot because now that we've answered the question we've deleted this data. We've erased the data from the database. We add the UID to a list which gets discarded as soon as it's received from old plants. We've also updated the OS to remove the three lines of JavaScript I've shown you so we no longer collect this data on up-to-date systems and we discard it if we receive it from old ones. This is the part where I talk about all the things that are subpar about this system and what we might do in future. The big one is it's actually really annoying to have the data split out in this way. All the app users are atomized and we can't answer questions like does someone who uses app X also use app Y? Is there any correlation between groups of apps that people use? We could of course submit one event which contains all of the apps that a given user uses but that starts maybe that's a bit too fingerprinty. It would be nice to find some way to answer questions like that without implicitly fingerprinting users. It's also hard in general to slice this in new dimensions that you haven't already chosen to slice by. One question might be whether parentally controlled accounts behave differently in some ways to accounts that do not have parental controls enabled. The parental controls flag is not part of the channel so we can't see for any other event whether it came from a parentally controlled user or not. This is all just a consequence of what you choose to slice the data by. I think the trade-off is worth it but I need to acknowledge that it is annoying to not have it identify. There's also some kind of indeterminate upload latency. The problem here is how do you know when you have basically all of the data for the last time period? It's particularly bad for monthly events. Today is the third of February. Let's say I left my desktop at home. I switched it off on the 31st of January. We can't submit any data for January until February has started because otherwise we might have to add a bit more to the tally after the fact and you can't do that in the survey. Now my computer at home is switched off while I'm here. I'll switch it back on on Monday. That's the fifth. That's a five-day lag. Is that typical? Maybe we can look at the timestamps when we receive the events but we can't do that because we don't store the received timestamp for each event because if we stored that we could figure out which events came together. You can probably imagine ways to solve this by reducing the precision of timestamps and I think that's true in general. There are some cases where we have more precise timestamps than we might like largely for historical reasons. There are some complications if you can't assume that the local clock is accurate. Of course NTP exists but many endless OS systems are used mostly offline and it's also quite common we found for the real-time clock battery to have run flat. So it's not that unusual for people's laptops to have a totally incorrect time until they connect to the internet and then when they go offline and run out of power it goes back to sometime in the past. There's a lot of research into how to randomize the data that's submitted. There's randomized responses, differential privacy. I'm sure there are people here who know more about this than me. We haven't really explored this but the basic intuition is that you add noise to the data you record. Suppose you're recording a coin flip, maybe the parental controls one as an example. In 50% of cases you just always say true and then in the other 50% of cases you submit the true flag. That of course changes the results you get but once you aggregate it you know that of the 100 responses you get you expect to see 50 truths just from the coin flip and so then you can look at the rest of the batch of events to figure out the true ratio without actually having to know if any of the data points which is true is really true. This might be a way to allow collecting more interesting facts without getting into personal data. There are lots more questions we might like to ask about the software we ship. There's questions like are most, desktop Linux systems, single user or do you have multiple different Unix users on the system. What are the common monitor configurations? How common is it to have an external monitor most of the time? Do people change this around? Do people have their screens arranged horizontally or vertically or in a cool circle shape? Do people use workspaces? How do they use them? Which GNOME shell extensions are in use? I could go on for an hour, I won't do this. I think this data would be much more interesting if we had comparable data from other GNOME distributions. I'm using GNOME as an example just because that's what we ship, insert project name here. Every distro reaches a different group of people. Those groups will have different behavior. For example, I would claim that the typical Fedora user is probably quite different geographically, perhaps economically, perhaps in terms of technical skills, the typical NSOS user. If we have a common structure of data that was shared between all users of a given project, we can compare how the same upstream software is used in different contexts. Other organizations who do this kind of telemetry have public dashboards of the aggregate data. I showed you Fedora's published data from their repo servers. Mozilla has this great Firefox public data report, which gives you sort of daily and active users, monthly active users, version statistics, locale, top add-ons, and this is all, you can slice this by country as well as looking at it globally. Steam has this very interesting hardware survey. They've made a different choice. This is opt-in with a pop-up dialogue and still anonymous. It's very interesting. The median gamer is probably quite a different user to the median desktop Linux user. Kind of, you know, a little tongue-in-cheek, haha, only serious. In December, Spotify publishes this thing where you can open your Spotify app and it tells you in like really garishly bad images if you are like one of the top 100 listeners to some artist. And you see a lot of people remarking when they do this that this is kind of creepy, they have all this data. It's very cynically free marketing for Spotify. Now, of course, that's true. It is free marketing for Spotify. Other streaming services are available. But it's also fun and sociable. I've had conversations off the back of this that I wouldn't have had otherwise. And maybe we can have free marketing too. But we could do this differently. The central entity doesn't know anything about any individual, but we could potentially publish percentile distributions. And then on the local device, you could fetch this and determine, oh, right, you are actually in the top 5% of the ability. Maybe this is a bad idea. I don't know. Anyway, just to wrap up, I guess, from a starting point, I hope to have made the case that telemetry doesn't have to be creepy. There are ways that you can gather data about how your software is used without being invasive and building profiles of your users. And in an industry where I think not enough thought is given to this, I think we in the free software community can lead by example. We can build something that is better and allows us to improve our software while showing that a better way is possible to the broader industry. And if we do that, we can make decisions based on the combination of data and vision. The two work together to make something that's really great. Tomorrow morning, there's a telemetry buff in Room 121 and AW1. I hope to see other interested parties there and for people to tell me all the prior art that we didn't know about. That would be great. Hope to see you there. Otherwise, that's all I've got. There's some various links. If you follow me on Masterdome, don't expect too much discussion of this, but you're welcome to come. My blog has an older write-up on the same topic, which has some more and some less details. And the source code is on GitHub under the endless M organization. The name of the server and the event recorder is the service that buffers and submits the events. EndsOS.org is the place to go for more information about the EndsOS foundation and our work. Thank you. Hey, hey, hey. Does anyone have some questions? Please raise your hand. Oh, okay. I was wondering, you showed us that 10% opted out of sharing metrics, but how do you know that? So, in case the question didn't come across the PA, I think if I'm right, the question is, I said that 10% of people opted out. How can I know this? So, I mentioned that we have a system similar to Fedora's count me system. So, it sends a daily ping with a retention counter with no other identifying information, plus there's a Boolean, which says, is the full-fat metric system on or not? Okay. Thanks. Does anyone have some problem? Oh, I see you. Thank you. Hello. Hello. So, your talk has mainly been focused on how to get metrics. Sorry, I can't quite hear the question. Sorry. I couldn't quite hear what you were saying. Sorry. Is this coming through? Yeah, okay. So, your talk was mainly focused on anonymous metrics effectively, making them as unidentifiable as possible. And you did say that one of your problems is if you wanted to aggregate, if you wanted to sort of correlate these metrics to kind of figure out, okay, if person X uses this app, do they also use the other one? Have you given any thought internally on how you might do this in a way which wouldn't impact privacy? You mentioned fingerprinting, would it be one concern? Have you elaborated on that at all? I didn't touch all of that, but I think you're saying that I mentioned that we would be interested in knowing. This is probably focused on an anonymous system, and so this is one of the reasons we can't answer the question, who uses both app A and B? And you'd like, I think if I understand the question right, it's do we have any ideas for how we could do this? Effectively, yes. Okay, there's a few ways you could do this, right? One idea that we haven't explored, but I think would be interesting, is to layer onto this an opt-in system. So you could prompt people to be part of a time-limited study, and you could temporarily add something extra to this channel, which identifies them specifically for a fixed period of time. Then we'd turn it off on the client side, analyze it on the server, and then delete it. Then I think, so you think you can, it's easier to add more stuff to the channel than to remove it. And the other way to do this would be to look at some of these differential privacy techniques, and then submit a single event containing aggregate app usage for all apps on the system in any given week, let's say, but add artificial noise to that. So with some probability, change the numbers, replace the names of the apps, remove apps from it in a more systematic way than just shuffle it around. And there are techniques you can use to add noise while keeping the distribution of data the same. We haven't had an opportunity to go into that, but I think that's probably, in the general case, the way to address the points. Maybe there are other ideas. I'd love to hear more. Thanks. Any questions? If you have any questions, you can raise your hand. Hello. We still have 10 minutes to ask a question. 10 minutes left. Any questions? You can raise your hand. Okay. Okay. Climb into the speaker. Thank you very much.
Ingesting and analyzing millions of events per second in real-time using open source tools
Hi, I'm Javier here. This is going to be, oh, is it too loud? Okay, yeah, sorry. So yeah, I'm Javier. If you can find me on Mastodon or Twitter, I don't really use Mastodon, but it's Fosdome, so I have to put it there. But Twitter is better. It's not better, it's just where I hang around. Anyway, this is already too difficult. I don't have any slides. I have this gist that I will be scrolling, but this is going to be mostly a demo-driven talk. So I'm going to be speaking today about a template I've created, so you can start doing a stimuli that is using different open source components. I'm not a tab vocate. I've been working with data for a long time. 10 years ago, it was actually my first Fosdome. I have the teaser to prove it. I was speaking at the time about FAST data. I was speaking about credits. And today I'm going to be speaking again about FAST data. And last year, I organized the FAST and streaming data at the room, so I really like data, and I wanted to share with you some of the things I've learned about working with data. So 10 years ago, it was difficult to work with streaming data. I put in the presentation that you can work with millions of events per second, and we have some users doing that actually. I work for a company that developed an open source database, but I wanted to speak about all the things today. So some people really need to ingest millions of events per second. Most often you don't have to do that. You are happy with a few thousands, hundreds, tens of events per second, whatever. But 10 years ago, it was really not that easy to work with streaming data, because actually many of the technologies that you have to see with FAST data, it didn't exist at the time. So for example, 10 years ago, you have Cassandra and Redis. They were already available. They were just FAST databases. But even Apache Kafka, I will be speaking about Kafka today in case you don't know it. Kafka was only three years old. And pretty much every other technology that you will consider FAST or real time today, in 2014, it was either not a 16 or just being born. Things like Spark Streaming or Apache Flink, which are super cool today. Grafana, I'm going to be presenting about Grafana also today. QuestDB, which is the database where I work. At the time, it has a different name. But it didn't exist. Even large proprietary people like Google Cloud or Amazon Kinesis at the time were not offering streaming data. So what I'm trying to say here is that 10 years ago, working with streaming data, it was not really a thing. Some people will do it. Twitter, for example, was doing streaming data at the time with very interesting technologies. Facebook was doing streaming data. But it was not so useful to work with streaming data 10 years ago. But you are thinking, that was 10 years ago. Now we are in 2024. It should be this year. Well, it should be this year. But it's not always the case. So let me just for a second. 10 years ago, if you were doing streaming data, you would do like micro-batches of data. The streaming data platform would be a database. It would be Postgres. 10 years ago, maybe it was MySQL. Sorry, Postgres. But 10 years ago, MySQL was kind of... But it would be Maria de Veo, Postgres, or Cassandra. But you will have the data platform was a database and a CSV file. That's it. Okay? But it's not really a pipeline. And a CSV file was not super cool. So then we had some innovation. So we added an extra step. It was basically adding Excel. We reinforced them. So maybe OpenOffice, LibreOffice, whatever. But basically, the data platform was a database and a spreadsheet. And this was the data platform. I was doing reports for like, you know, interesting people. And that's how you would do things. It was not really a data platform at all. But then, this is not working. That's fine. I can do it with this. But then, at some point, I will start doing the demo. Don't worry. But at some point, we decided, and that was the first thing I wanted to talk about today, we decided that sending all your data directly to your database, it might not be the best idea. At some point, we decided that it could be a good idea to decouple the ingestion of the data with the storage analysis of the data. For many reasons. For example, because if you send your data all the time to your database and you need to restart the server for any reason, you have a problem. You cannot really stop the database if you are sending data all the time. Or maybe your database is not super fast, or maybe it is. But you know, you can have something in between. Or maybe you want to send your data to your database and to somewhere else. And it will be cool to have something in between to send the data. So we started seeing things like Apache Kafka and so on. I will present about that today. But that was the first step. Starting sending data first to some buffer and then to the database. Then we started seeing dashboards, not really Excel, but dashboards like Tablo, Grafana, other things to present the data. That was already interesting. And at this point, we spent a few years with this architecture. You have something like Apache Kafka or Rabbit NQ or whatever to have data in gestion. You will have a database, could be a PostgreSQL database, could be a different database specialized in working with faster data like Cassandra or other databases. And then you will have some business dashboards. And that was cool. But then it was the time of more real time of more advances. And it was like, now that I can analyze data, I also want to predict the future. I don't want to understand only what's happened. I want to understand why it's happened and what's going to happen next. Spoiler, it doesn't work. But anyway, we presented walks and today I will show you some examples of time series forecasting. But that's the idea. So it was not enough already with doing analytics. It will be analytics and some machine learning. And by the way, the dashboards got real time. You didn't have to be reloading with S5. I had at some point, and this is true in my browser, I had an extension that the only thing it did, it was reloading the page every five seconds. So we could have real time dashboards. No, that's the thing because they didn't reload. So you had that. So then it came people like Grafana or maybe Looker. And you look at it because it's not open source. And then you started seeing real time analytics. And that was pretty cool. And it's like, okay, this is looking something interesting. But yeah, then we have, of course, observability, monitoring, because if you want to go to production, you need to make sure things are working, not only on your machine, but in real time. And that's kind of the thing. We started in more and more things. And if you want to start today to ingest streaming data and less streaming data, you have all the components, but there are a lot of components. And they have to work together. And if you have never done it before, it can feel overwhelming. So that's what I want to be talking about today, how we can build a platform that can do all of these things. So this is what it looks like, like everything everywhere, all of the ones, like that movie. So that's basically what it looks like. Today you want to have an IT platform at the very least, very likely you have a dating system layer. The data from that layer needs to go somehow to your analytics database. It will probably have some listeners and application. You might want to send data also directly without the buffer, because your application doesn't support it, whatever. Then you have your database, your machine learning data science workflow, your monitoring workflow, your dashboard. And this is kind of the thing. So as I said, I call this the everything everywhere, all of the ones. And if it's the first time that you are working with streaming data, this might look to you the same way, look in the movies, like, what's this? Like, I don't know, this doesn't look natural. It looks like very weird. So that's why I decided, after working with streaming data for a while and suffering these things, I decided it was time to develop some easy example that you could deploy and if you don't have it, if you know about real analytics, don't use this. But if you have no idea about real analytics, if you've never done it before, this template gives you everything you need, everything. It gives you a few components you can use to capture real analytics end to end. Turning gestion to dashboard with monitoring with everything you need, with data science to do real analytics, and everything is open source in this template. So this is what I want to talk about today. What is in the template? And basically, I want to do this for a demo. I don't want to be speaking on it. I've been speaking on it for ten minutes. That's it. I mean, I want to be speaking now, but it's going to be with things moving on, with code and everything. So this is what the template, it looks like very blurry. But basically, this template is going to give you an Apache Kafka. I will tell you now about Kafka and why I chose it. But Apache Kafka, which is the ingestion layer, I have KafkaConnet, which is a component for getting data out of Kafka and into other places. The data I'm going to put it into QSDB, which is a fast and serious database, it's my employer. So I'm biased here. But I think they pay me. But it's Apache 2. So it's open source. You can use it for free. You don't have to pay to use it. But if you pay, it's better because then I get... Anyway, so I'm going to be using a database here. Mom, maybe I'm looking at this. I know I speak too much. Anyway, and then I'm going to be using Jupyter Notebooks, which are... It's Python on the browser, which is pretty cool for doing that science. And I will do some interactive exploration and some forecasting modeling. We'll have real time, that's what's with Grafana. On top of that, I'm using Telegraph to collect metrics and store them again on a database. So that's everything in the template. And this is what I'm going to be talking about today. So the template lives here. Until yesterday, it was on my user on GitHub. But that's of yesterday. We move it to my organization because they pay me. And we are open source and we are cool because FOSDEM. Yeah. So basically, I'm the single contributor. So basically, it lives here. Time series, stimuli, this template. The entry point is a Docker compose. So I know some people prefer other things, but Docker compose is cool. So this Docker compose, if you want to see what it's doing, feel free. But basically it's going to start all the things that I've told you before. And I'm going to show you what this looks like. So Docker compose up. If you're familiar, you all know Docker. Docker compose, in case anyone doesn't know Docker compose, I know, you know, better than me. But Docker compose allows you to start several containers in one go. And since networking Docker is a mess, the cool thing about Docker compose is that all the containers in the same compose file, they can talk to each other, which is cool. It's convenient. So if I do Docker compose up, look, looks. Things moving. So yeah, thank you for coming to this talk. So yeah, things are starting now. In my laptop, it starts fast because all the images are already here. You start from scratch, it needs to download a lot of things. It's going to download about one gigabyte of data. No, one gigabyte of data. But there's a lot of images here. But basically, I don't have any custom image. All the images in Docker compose are the standard, you can't know book image, standard Kafka. I didn't add anything. So it downloads everything in a wire connection. It takes about one minute or so to download everything and start up from scratch. So once you have this ready, we already have something up and running. Not much. It doesn't look like much. But since I'm moving there, and I want to show you what we have. So the first thing we have, I opened the other browser. Not you, Chrome, not today, not today. It's my default browser, sort of like that. I feel terribly bad because actually, Firefox was giving free cookies before and I took one. So thank you Firefox, I'm using Firefox for the demo. So the first thing you have here, it's, this is Jupyter notebooks. I'll show you to run Python, also the things, but Python from the browser. So the thing I have here to see like easily, I have created a script, which is going to read data from GitHub because open source. It's going to be reading public data from GitHub and it's going to be sending data to Kafka. And it's going to go through all the steps. And it's going to give me a nice task board. So if I execute this from the browser, it's going to be, it's going to be calling the GitHub API and every 10 seconds, it's going to be getting new data. I do it every 10 seconds because there are some great limits. I wanted to make sure I'm within the limits, but every 10 seconds, I get data from GitHub. It's going to go through all the components and eventually it will read the data. I'm going to tell you now all the things. And here in this dashboard, if everything was fine, I cannot see any data, that's not looking good. Let me ask for a second. We still have data, yes? Yeah, we do have data, actually, I guess. Table doesn't exist, table doesn't exist. Not you, Chrome, sorry. That's why I don't like it, man. It's always there, it's always trying. So, okay. Let's just for one second to check if I have data. I'm going to go through all the steps. I'm going to go through all the steps. Just one second to check if I have data coming into Kafka. It's funny because I was testing it earlier, like right here. Let me for a second what I have. I will tell you what I'm doing once I know what I'm doing myself. I mean, I cannot be more, okay. So, just for a second. What are you doing here, Javier? Topic? Can I see which topics I have created? I have... I will tell you in one second what I mean. The moment it works, I will tell you everything about it. But, okay, the topic is here. And... So, basically now I'm trying to see if I have some data entering into my Kafka platform or not. Yeah, data is coming here. So, let me go to the dashboard again. Oh, yeah. So, I don't know exactly what happened. I started this grid from the command line. But, basically, this is what should happen from the beginning. I have data which is going to Kafka. And I have a dashboard that is refreshing. No hands here. And in five seconds, it's going to have new data. That's kind of the idea, okay. So, this is, you know, this is like high level overview. What I want to do now is telling you all the different steps and how we can see what's happening at a different point in time. So, the first thing is... I told you already, I have here a script which is sending GitHub events to Kafka. So, Kafka, it's a message broker. It means you send messages, events, what... A message can be anything. In this case, I'm sending JSON because it is here. But you could have, like, Abro. You could have, like, anything, plain strings, whatever. You send events to Kafka on one end. And then, different consumers can read those events. It sounds super simple, but it's very powerful. First, because Kafka is very scalable. You can use Kafka at any scale. If you have a lot of data, you can add many servers. And, you know, it scales horizontally. It works pretty well. In Kafka, you don't have tables. You have something called topics. So, when you send a message, you publish the message into a topic. And then you can have any number of consumers reading messages from that topic. And they can read messages in different ways. You can choose to have each consumer is going to see all the messages from the topic. You can have them... You can create a consumer group. They read collaboratively. So, basically, you can have a topic in which consumers are reading data, and all of them are going to see all the data. You can have a topic in which consumers are going to be reading in parallel, different parts, so they can collaborate. You can have a topic in which some consumers are reading from the beginning, the others are reading from the other part. You define a retention period for each topic, and you can replay the stream of events at any point until that retention period. So, you want to replay what happened. You can replay again. So, it gives you a lot of possibilities, and it's a very good way of decoupling the ingestion layer. So, when you are working with Kafka, basically, what you do is like you use any Kafka library. And as I showed you here, all you have to do is you create a producer and you send your events, in this case, in JSON, to a topic from that producer library. On this notebook, I have the code only on Python, but in the template, if you defer the languages, you have the ingestion in multiple languages. I did this in all the languages that chatGPT could help me with. I did it in Python, and then I told chatGPT, do this in Node.js. Didn't work. Hey, I get this error. Anyway, so it works. So, basically here, you have the same code in different languages. You can do the ingestion using Node, you can do the ingestion using Java, using Rust, which is basically right now from the command line, and here sending from Rust, actually, they tie into the topic. So, right now, we ingest data into a Kafka topic. But Kafka itself is not doing anything with not-you-chrome. And this is... No, sorry, it's my brain. It's where, like... In the process in Kafka, this is the first step in the process here. From GitHub, we send to Kafka. Okay? But from Kafka, the messages don't go anywhere. Kafka, by design, is not pushing messages. There are some message brokers that use the push model. In Kafka, we use the pull model, which means if you want to read messages, you need to tell the broker. You need to tell the server in Kafka, hey, give me the following batch of messages. The people in Kafka decided to do it that way, because if you're going to have multiple consumers, it's a better way to work. If you are pushing data all the time and you have a low consumer and you're pushing data, the consumer eventually is going to be overwhelmed, it's not going to work. So in Kafka, you need to be pulling data. So you will need some application calling Kafka and saying, hey, you have new messages in the topic, give me the messages, I want to store them somewhere. For the first years, that would be how you work with Kafka. But that was annoying. So Kafka created something called KafkaConnect, which is part also of the open source Kafka. In Kafka, there is a like in pretty much in many projects, you have a part which is fully open source and a part which is proprietary. The things I'm showing today are only open source, of course. So KafkaConnect is part of open source Kafka. And what it does, it gives you connectors to read data from Kafka and write into multiple destinations. In the audience today, we have my colleague, Yaramir, I don't know exactly where he is, but Yaramir actually wrote the connector to write data from KafkaConnect into QSDB. I wouldn't know how to do that, but you know. So basically what I have here, I have a connector which is sending data from Kafka into a database. This KafkaConnect, it doesn't have any web interface. It has just an API. But I can tell KafkaConnect from the notebook. I can tell KafkaConnect, which plugins you have available. By default, you have only a few plugins. In this compose, Docker compose, I've included the driver, the GAR file to send data to QSDB. You could be sending data to, I don't know, to Amazon S3, to Hadoop file system, to Clickhouse, to Postgres, to whatever. In this case, I'm sending messages to another Kafka. So in this case, I'm sending messages to the KafkaConnect. So I have this connector available. And then what I have to do, I have to configure if I want to send data, I have to tell Connect where to get the data from and where it's going to send the data to. So in this case, I have a configuration that says I'm going to be reading data from a topic called GitHub Events, which is the topic in which I'm sending data. I'm going to be sending the data, I'm going to be sending data to this host and port. And then its different connector is going to have different parameters. At the very least, you will want to configure the format of the messages. In this case, I'm saying that we are going to be getting JSON messages in JSON and I want to output strings. So different connectors will have different options. And in this case, I'm also telling the timestamp name, I want to rename it to create it. So those kind of things, basically. But you know, it's connected with different options. But in the end, what I'm doing here is configuring how Kafka Connect is going to be reading data from a topic and how it's going to be writing the data at the destination. Are you still with me here? Yes? Cool. So destination is a database. And that database is called QuestDB, as I told you earlier. QuestDB is my employer. So, you know, I only have good things to say about QuestDB because it's actually pretty cool. So, in this, this is the instance running in Docker. And you can see here, it has multiple tables. Because I'm writing also monitoring. But basically, the tabling with Wayne-Gestin data is this one. GitHub Events. If I do something like select count from GitHub Events, we have only, like, you know, a thousand events, which is not too much. But anyway, it is what we have right now. So, this is what it looks like. We have, like, push events from different repositories, from different people, at different time stamps. And this database, this QuestDB thing, it's an open source database which is specialized on time series. Time series basically means I have data points with a time stamp and I want to do aggregations by time. So, we speak SQL, but a cool thing, I believe it's a cool thing in QuestDB, you can do things like this. I can say, I want to have the count of how many messages I'm getting in intervals of, for example, I don't know, every five seconds. So, I can say this. And what I get here is like, for each five seconds, I'm running this aggregation. So, basically, this is like a group buy. But instead of grouping by, you know, any dimension, I'm grouping in time intervals. So, this is the cool thing about a time series database. It gives you a lot of things for, you know, working with time. It gives you a lot of interesting things for working with time data. Or you could also say, I want to have this, but by repository and time stamp in intervals of five minutes. So, here I will have for this different repository I'm going to have for each five minutes interval I'm going to have the number of events for that repository. So, you can start to see how this can be interesting. You can also do things like join two different tables by approximate time. Maybe in one table I'm getting data every few microseconds. In the other, I'm getting data every two hours. But I want to join the closest event in time. Doing that in SQL, it can be done, but it's like, you know, a bit query. So, here is like, okay, just as of join. So, we have this kind of extensions. And the main thing about QSDB is that it's built time series data tends to be fast. So, it's built for a speed. In this case, I only have not even 2000 events. So, the 1000 events sounds like nothing, but if I go, oh, not the most, if I go to we have a demo site and I will stop speaking about QSDB in one second, I cannot write. I cannot even type the name of my database. Okay, that's better, yeah? Demo... Okay. See, eventually we'll note, internet here, I have the... The Wi-Fi here is not really great. When it comes back, I will show you which is totally why I was not getting any data earlier. So, in this public site we have some demo tables and I have one table which is a bit bigger than the one I just showed you. So, this table is not huge, but it's already 1.6 billion records, which is not too bad. 1.6 billion records is not too bad. And a cool thing about QSDB is that it's designed to be a good example. I could do something like give me the average... Let me just find any column in this table. So, give me the average fair amount, for example. From this table that has 1.6 billion records. How long should you expect to do a full scan over 1.6 billion records and do an average? We cannot catch. How fast will you say? 1.6 billion records. How many? 20 seconds. 20 seconds, that's good. I like it. It took 400 milliseconds. It was not too bad, but it's like that's the kind of thing. It's still kind of slow because QSDB is optimized to be very fast when you have chunks of time. So, it's like where... I don't know. If you go only for things like I want to only select data which is in one particular year, for example, this will be way faster because in the end... Oh. That's better. This will be faster. It took 200 milliseconds, which is not too bad considering that we have a lot of time. Oops. Count from trips. Okay. That's not... Let's go for one year with data. So, yeah, in this case, I can do the average. Distance, for example. And we are speaking about 170 million records. And it's still like, you know, execution is 50 milliseconds. So, that's the kind of thing. Time series databases, we are not the only one. We are the fastest. If you ask any other database provider, they will tell you they are the faster and they are right because it really depends on the query. We are super fast for time series queries. If you try to do other type of queries, we are not the fastest. But if you're just about time series, then we are super fast. Everything is optimized for that. Okay. But I don't have 1.6 billion records. And that might be it. But something I don't know if you ever consider that actually getting a billion records is easier than you think. If you have 500 time series, maybe 500 cars or scooters or whatever, maybe 500 machines, 500 users with a phone, send in a data point every second. So you have a data point per second. It doesn't sound like much, no? But then you say, how many seconds are in one day? How many seconds are in every week? How many seconds are in your typical month of 30. 43, 7 days? Because some months are different to others. So this is the amount of seconds you have in every month times 500. So you have 500 devices sending you a data point every second. You are going to get 1.3 billion records in just one month. So we see users generating this amount of data every day or even multiple times this data every day. So what I'm trying to say here is that when you are working with streaming data, it's pretty simple to get to the point in which you can actually have a lot of data to process. So it's quite interesting to have some database that can help you with that. But enough about speaking about QSDB. I want to speak about all the things today. So I told you already you can get data directly to Kafka into QSDB. By the way, before I move from this, I have another notebook because I told you I wanted to speak about millions of units per second in just in just 30 events every 10 seconds so I have another notebook which is not getting data from GitHub but it's generating synthetic data. So I'm going to show you another notebook to send events a bit faster. So I have another notebook which is generating fake data, IoT data. And it's going to be sending, I'm going to be sending for example, I don't know, 10,000 events per second? 10,000 per second? 15,000 per second? 15,000 per second, you think? 15,000 events every 100 milliseconds. 15,000 events every 100 milliseconds? Will you be happy with that? 15,000 events every 100 milliseconds? So now anyway, I'm sending 15,000 events every 100 milliseconds which is still not super fast. And I have a second dashboard, just for you to see that this kind of technologies all together, they really can help you work with data which is fast. So right now here, I'm going to show you. So we have a dashboard that's every 100 milliseconds it's refreshing. And this is how fast we are seeing data. So this is not, I mean, it's not super fast, but you get the idea, yeah? So Kafka can ingest literally hundreds of millions of you in a second. In QuestDB, we can ingest up to 4.2 million events per second. If you have 16 CPUs, 12 16 CPUs, you can ingest up to 4.2 million events per second. But that's the idea. This already feels real time. This already feels like, you know, this already feels like a real time dashboard in which we really are having data which is moving fast. I wanted to show you this because otherwise it feels like cheating. So this is not real data, but if you have real data at a speed, you can see on the planes, you know, this is random data. So yeah, they are not crazy is the way it is. But other thing, I wanted to show you that it really supports this kind of thing. So let me just move on to the next part. So the next part is I told you that, okay, it's cool to send data. It's cool to be able to run SQL queries and analyze data. But I also want to have some way of doing that science. I want to have some way of predict the future. So for doing that science, there are many different tools. But you get a notebook which is where I'm running all the demo here today in the browser. It's a very popular way of doing interactive analytics, sorry, interactive exploration. And I created here a notebook in which I'm using two different tools, actually three different tools for doing data exploration. Even if you are not a Python person, a Python developer, you probably have heard about Python pandas, yeah? Maybe not, maybe yes. But Python pandas is one of the most popular libraries for that science. Some people say it's slow, the latest version is faster, until recently pandas wouldn't parallelize. So you have like a large dataset, you know, but it has, but we forgive pandas to be slow, because first it's not that slow. Second, it gives you a huge amount of things to do with your data. So it's probably the most popular tool for doing interactive data exploration. And this is the kind of thing you can do. The first thing of course will be connecting to where you have your data. In pandas you can connect to your data, I don't know, you could use just CSV or whatever, in my case, I have the data in QSDB. In QSDB we speak the process protocol for query in data. So I'm just going to be connecting to a database using the process protocol. And I say, I tell pandas, pandas, read this query, give me the results. Okay, nothing too interesting. Now, in pandas everything goes into what is called a data frame. And data frames have a lot of methods to do interesting things. So it's like also if you use Spark or other tools, you also use the data in abstraction. So basically give me the information. It gives me some, like you know, some basic data types and so on, not too bad. Describe the data set. And it's already interesting because if I don't have any idea about my data, this already tells me, hey, for the, I only have one numerical column and it's a time stamp with Santa Poc, so this is anticlimatic, but it will tell me which is the minimum value, maximum value, the percentiles. So I can get an idea of my data or I can say, hey, tell me for all the columns that are not a number, tell me how many, you know, how many different values I have for its different type of event. And I can see here that in this data set, I almost don't have comments on commits for the 2000 events I've seen so far, but most of the events are push events on GitHub, which makes sense. But if I try to do any type of forecasting, what I see already is like my data is super bias. My data is going to train pretty well on push events because they are very common, but I only have two events with commit. I cannot learn from two events. You see the idea, yeah? So that's kind of a thing. Without doing nothing, you start getting familiar with your data set. They're like, oh, I only have to say describe, tell me the things. And it's very powerful because, you know, you don't have to worry much about those things. It's like, oh, show me what it looks like. This is the set event that I want to see. And I say, well, I prefer to see the distribution for each event. So again, Pandas gives you very simple ways to grouping things together, representing things. But basically here, for its different type of event, I'm going to see the count of how many events per minute I'm seeing for each of the different types. And as you can see, like, you know, you don't have to, I don't want to enter in details of the code because actually it's super simple. But it's just for simple things. And you can do it very easily. And with no effort, you can get interesting statistics. Of course, if you go deeper, you get more interesting things. But that's kind of the kind of the idea. So Pandas is pretty cool. But, you know, people have no heart. It was like, oh, Pandas is slow. Let's do something new in Rust because, you know, Python is slow. Rust is faster, blah, blah, blah, whatever. You can go to the building there and you can see the difference between Python and Rust. I'm not going to. But basically, Polar is the new key to the block. It's the library which is making data science faster. In some Pandas, it's saying, oh, now we are going to be faster. So it's pretty cool because now they are both trying to compete. But, but online, many people are using now Polar, not Pandas. And also the name. And a Panda, and a Polar beer, and a koala, it's like whatever, man. It's like, yeah. Basically here with Polar, same thing. You connect to the database and you have, literally, the same things you can do. Give me the account of events, give me the different things, you get the idea. I also went with a different library facets that I like it for some things. Because in facets, with very little code, I can get, again, a lot of insight from my data. For example, here, I can just see, hey, from the data I have already, so me, it will, oh, it's, my, my Wi-Fi here is super slow. I don't know if you have the same issue, but my Wi-Fi is super slow. And this is, this is calling an external JavaScript. And since the Wi-Fi is unreliable, it's not displaying fast, which is what happened with it, more earlier. It's what happened also, 1.1.2 QSDB. So everything is running locally. But this particular script is using a script from our place. So I can, sadly, I cannot show it to you, but if I reload the notebook, I should have the cache version that I had from the previous execution. So this, this one is not live. It's the, a cool thing about the Python notebooks is that you can, you can save the notebook with the current execution results, and you can serve with your team. So this is what it looked like when I had the internet, okay? Sorry about the internet, but you know, I don't have a computer here, but basically you have interesting things. So that's for exploring the data, but I promise you that we wanted to predict the future. And we wanted to have, like, you know, to have forecasting on time series. And again, in Python, you have a lot of interesting ways of working with time series data. When you have data in time series, the logical step is trying to predict what's going to happen next. How many, how, how much stock do I need to have for people that are going to be buying my product? How much energy do I have to produce? What's going to be the price of the Bitcoin in two weeks from now? Those kind of things are interesting to predict. If you know that one, let me know. But those kind of things are interesting, okay? That's the Holy Grail time series prediction. And this, of course, is not trivial, but there are a number of algorithms, like profit, or Arima, Arima, or other algorithms that actually allows you to do time series forecasting. Better or worse, but they allow you to do that. So in this case, I'm going to be using a model which is called profit. It was originally developed by Facebook. And profit allows you to do time series forecasting in a very simple way. I connect to my database, and the next thing I'm going to do is I'm going to tell my database QSDB to select all the events sample per minute, as I saw you earlier. So basically here, what I have is the all the events per minute. I can see some minutes I have more events, some minutes I have fewer events, and this is going to be my training data. And this is all it takes to train a model. You don't need to have any fancy NVIDIA GPU. You don't have to, for this, I choose this one profit specifically and other than having this notebook because they are very, very, very lightweight. So you can predict also we don't have any data. We have only 2000 events, which is super fast. But the model already learned, and with this I can already make predictions. Since I have data only for the past 20 minutes, what I'm going to try to predict is only the next 10 minutes of the future. Predicting 10 minutes into the future is not much. If I had more data in the dataset, I could predict the next week, the next month, whatever. But in this case, I'm going to predict the next 10 minutes. And this is what it looks like. It looks like we are seeing fewer and fewer commits over time. I can't predict the model. Something I can do. I can, for example, say so I'm sending data now for only one script. I can open, let me for a second, I can open this script and I can start sending events every two seconds instead of every 10 seconds. So if I execute now Yeah, it's the drawing here, I'm reading events from GitHub. Since Internet again is not working, I'm going to disconnect and reconnect again. I'm actually going to use my phone to tether. I'm going to use my phone just for a second. I can whistle. I'm trying to use my phone for the Internet connection. I can show you this. Wi-Fi hotspot. Don't connect to my hotspot now. Let's see if it opens. I can't see it. Okay, so that's my phone and hopefully let's see if this is better. No. Okay, my phone is also useless. So I don't have Internet. Basically what I was trying to do, I was trying to send more data. Yeah, that's okay. I just wanted to send more data. So you will see if we get more data. The prediction will be that it's going to be, you know, higher because that's how it works. But anyway, I have here another model, linear regression. You can use linear regression for predicting many things. In this case, I'm going to use linear regression for predicting also time series. And Senaedia. I trained the model and now I can predict the future. And as you can see, this model is also very pessimistic about the future. It says that we are not getting enough data. This is what we've been seeing and the prediction is going to go down. But that's cool. I mean, we have data. We have ways of predicting what's going to happen based on the past. Which is, I don't know if you thought it was this simple. These models are super simplistic, but they kind of give you an orientation, a trend, which is interesting. So we have that already. So next step, and almost running out of time, next step will be the real-time dashboard. I already showed you the dashboard. What I'm using here is called Grafana. And I have a couple of dashboards here. So Grafana, the way it works, Grafana is a tool for dashboard and alerts. And Grafana has a lot of plugins to connect to different data sources. In our case, I created a data source using the Postgres connector, because we are compatible with Postgres at the protocol level. So I'm connecting to my instance of QSDB in this docker container. And once you have a connection, you can just create dashboards and alerts on the dashboards. And the way I do this, if I go to any of these panels, the way you do this is just with SQL. So basically, each of these panels, they look very fancy by behind the scenes. It's just SQL query. It has some filters, for example, here. Here, this timestamp filter means that the filter is going to be the one that you have here selected on the dashboard. So you have some macros, so everything works together. But in the end, creating a dashboard in Grafana, just creating the SQL, putting the charge you want, choosing the color, choosing you want to have multiple series. If you want to have like, you know, how it's going to be the scale, but that's kind of the thing. That's all you need. The last component I had in my template was the monitoring. So monitoring, what you usually do for monitoring, you need to have some kind of server agent. If you're in Kubernetes, you have your own agents. In my case, I'm not running Kubernetes or anything. So I'm running an agent which is called Telegraph. It was created by InfluxDB. And Telegraph allows you to connect to many source of metrics. It allows you to connect to many destinations for storing the data. One of the destinations is QSDB. We are at a serious database, so it's pretty cool for monitoring. So I'm getting the data from QSDB and from Kafka monitoring metrics. The Telegraph agent is collecting the metrics. In the agent itself, I'm doing some transformation and then I'm writing the data back into my database. So for example, if I go to my QSDB installation, this is what my metrics look like. So in QSDB, we're exposing these metrics. How many items I'm missing from the cache, number of queries, number of commits in the file system, all those things. In Kafka, we also have a lot of different metrics that we can explore. So Telegraph is pulling all those metrics and we are sending them back to QSDB. So if I go to my local instance here, you will see I have here a few tables. Kafka cluster, Kafka runtime. If I do, for example, I want to select all the data from Kafka Java runtime, you see within collecting data, this actually is quite small. If I go to Kafka topic, hopefully in this one, I should have more activity. So, yeah, for the time that I've been writing this demo, we've been collecting data about the different topics, how many effects, how many bytes per second we were sending out of the topic, how much data was entering. You can see the offset in which different consumers are reading data. So every kind of statistics and that monitoring, we have on the database. So you could also build your dashboard. That's kind of the idea. So, what I wanted to tell you today is that if you want to work with streaming data, it's hard because streaming data can get very big. It never stops or at some point you need to decide when to do analytics. Isbasti is going to be sometimes faster, sometimes slower. At some times it will get late. At some times you will get new data with an update for some results that you already made it. At some point, the individual points that you are gathering, you are collecting that are very valuable for the immediate analytics. They lose value but the aggregate data gets more interesting. So working with this kind of data, it can be hard, but if you have the right tools, you can actually start working with streaming data at pretty much any speed in a simple enough way. At least for starting. Then of course, everything gets harder. So, I didn't tell you many, many things that I don't have yet. So, I'm going to talk about the template. The template is a starting point. So you can start getting familiar with streaming data and have some tools that are interesting. But if you want to go to production, you will need to support more data formats. Not that JSON. It's not very efficient for moving fast data. You want to have data life cycle policies. At which point, I'm going to delete the data. At which point, if I'm getting data every few milliseconds, maybe I want to aggregate into data for a few years, one month, I don't know. I want to delete the data, I don't know. So you have to define, maybe you want to move the data to some cheap cold storage. So those kind of things you have to decide. Data quality, data governance, replication. In this case, each component has only one server running. It's easy to add more replicas for each of the components, but you have to do it. So there are many things I didn't cover. But hopefully, this was interesting for you. There are the links to all the tools that I've been using in the template. The template itself is a patched to zero. So feel free to do anything you want with that. For anything you want, you can contact me on MasterDone, but actually, I will not with it. So contact me on Twitter, which is easier. Thank you very much. Anything you want, I hear. APPLAUSE
Will the first Artificial General Intelligence (AGI) instance be free or open-source software?
So, welcome to the next talk and welcome to Peter Levin, who's trying to answer the question if the first AGI instance will be free or open source software. Okay. Thank you very much for the introduction and welcome all to this talk, indeed. Maybe I'll wait a minute. Okay. So, indeed, I will be talking about AGI or general artificial intelligence and I will be trying to look at the question whether we can expect this to be open source or free software. And I'm a professor from the VUB. There where I do research in artificial intelligence. So, VUB is a sister university of VLB, where we are. So, I'm very happy to be here. And before diving straight into AGI, let's first have a look at the advances that we saw over the last decade. Because we did see some advances that were quite significant. First of all, I think 2015 we saw a big breakthrough made by Google DeepMinds, where they developed an agent that could actually learn to play video games, Atari video games. It's an old video game system, but it's a video game system that I used to play when I was a kid. So, this is generally considered to be quite a breakthrough because an AI system was developed that could on its own play these games and play a wide range of these games. Next, they made another breakthrough, Google DeepMinds. They came up with an agent that could actually play the game of Go. And the game of Go is a board game comparable to board games such as chess, but generally it's a harder board game. So, it is a board game for which it was not expected that we could actually build an agent that could solve this difficult game. But they did achieve to make such an agent and they actually beat Lisa Doll, who is here on the slides, who was the world champion in Go. This program, this AI agent was actually able to beat this human player. Next, in 2017, they developed AlphaGo Zero, another Go agent. But here, instead of learning from data, which was the case in the former agents, they actually learned solely by playing against themselves. So, the agent was just playing against another agent and there was no data used from any human games. So, this was also assumed to be quite a breakthrough because, of course, if you can build systems that go out of distribution that can learn more than data has to offer, that's of course something that is very interesting because that would allow us to learn also outside of the capabilities that we as human beings have. Later, also DeepMinds, they came up with AlphaFold, another problem that was thought to be very complex where you want to predict the structure of proteins simply based on the nucleotide sequences. But they also were able to construct a deep neural network to do this job. And of course, more recently we saw, for example, OpenAI coming up with Dali, a big neural network which you can prompt to generate images. For example, here, it was prompted to ask for a Matisse interpretation of a robot playing chess. So, you see that these engines are really becoming more and more powerful. And maybe the most shocking element of it all was ChatGPT, which was released in 2022. That was actually an agent that can act like a ChatGPT. And this is something that in a way really feels like you're talking to a human being. And this made a big impression on many scientists, but also the general public because this is really a type of AI that is approachable by the general population. So, of course, this is just AI. Let's say it's a bit unrespectful, just AI. But in this talk, we're going to have a look at AGI, artificial general intelligence. And to do so, we need a definition. And the debate is on what AGI is exactly is. But I chose this definition from Wikipedia, which is a consensus website. And what Wikipedia says is that it is an intelligent agent that could learn to accomplish any intellectual tasks that human beings can perform. That is what Wikipedia comes up with. And this is supported by an article in The Economist. And we can interpret this as a system that can behave like, that can do the things that humans beings can do. I think that's a reasonable interpretation of this. And does that mean that we're talking about average human being, or are we talking about the upper bound of what humanity is capable of? That's still up for interpretation. But I think we can agree that once you can actually emulate an average human being and the cognitive abilities of an average human being, this would really be a great breakthrough. And you might think that this idea is kind of a hype, and you're writing that. But it's also important to remember that this idea was there from the start. So this is a picture from the Dartmouth workshop where a lot of very smart people assembled. And it was 1956 already, quite a few years ago. You have Claude Shannon there. You have Marvin Minsky, who founded the MIT AI Laboratory. And you also have John McCarty that came up with Lisp, for example. And they came together with this task. They thought we're going to try to simulate or build a machine that can simulate all features of intelligence, which is very closely related to the definition that I just gave you. So the idea really is around for some time. So it's not only a hype, it's really what relates at the foundation of the field of AI. So maybe this claimer. So in this talk, I will not try to make predictions on AGI. That's a very hard thing to do. I will not try to do that. And I will not try to take too strong of a stance. I will present you with what is out there. I will present you with some of the difficulties to which I think at the free and open source software community can really make an important contribution. But I will refrain from taking too strong of a stance. What I will do is I will discuss how a scientific approach to AGI is, in my opinion, really important or even crucial. And to do that, you need reproducibility. Reproducibility in a scientific context. And in my opinion, this is very closely related to free and open source software. And I believe that we, as a scientific community, and the free and open source software community, can really have a big influence on each other in this regard. So why should we care about AGI now? So why is it becoming a hype? Well, there are two good reasons for that. Two elements that popped up over the last couple of years that are really quite influential. First of all, large language models and the secondly, reinforcement learning. And I'm aware, I'm very well aware that this list is not complete, but in the interest of time, I will focus on these. So large language models, I will try to make as intuitive as possible to explain. But basically, what we have is we have a language model and this language model will try to predict the next word. Based on a sequence of words, it will try to predict which is the most likely next word to be chose. And you can do that with all kinds of machine learning. You can do that with hidden Markov models. But when we're talking about large language models, what we're actually referring to is using a deep neural network. And here we have a very simple neural network with just an input layer, an output layer, and one hidden layer. But when we're talking about deep neural nets, what we actually mean is that you have many hidden layers and different kinds of architectures to make this advanced learning possible. But the general principles are still the same. We still have different layers and these layers are connected and the weights that we have on these connections between these layers is basically the parameters of our model. And this is what we will use to actually do a learning. And so this is a large neural network. Training it is really a complex thing. And this is also something we will discuss later. But this is also something on which all these research institutes really build upon free and open source software. The operating system that also the ways to do the networking of such trainings. Now of course large language models. There was evolution and the evolution started with relatively simple things that already showed some capabilities to what we now have in chat GPT, a system that is actually quite impressive. And this tweet, it's just a tweet, but it's indeed not AGI, but there are some capabilities that are really remarkable. And this is really something that many people used to make very impressive demonstrations. For example, you can write poem just by chatting to this bot. It will generate codes that you can actually use to play poem. So this is really something. The next thing is reinforcement learning. Reinforcement learning that's actually the main topic of my research in the AI lab of the VUB. And what we actually have, we have an agent and an environment. And we want the agent to learn to behave optimally in this environment. And the agent can do so by performing actions in the environment. And in this simple environment where we have Super Mario, this corresponds to pushing buttons on the controller. And when an action is performed, we can observe the state, which here is simply the screen that we can see, and a reward signal that tells us how well we're doing. It's a kind of feedback signal. So if we can build an agent that through these actions and through the observations of states and rewards can actually learn how to behave optimally, that's when we're talking about the field of reinforcement learning. And this is a simple video game, but it does not take that much imagination to see if we replace the video game with the world. The state space will be much more complex. The action space will be much more complex. But you can see that if you can make a sufficiently advanced agent, you would result in AGI. And this is exactly what Silver et al. have some influential researchers in the fields of AI set in a paper in 2022. They said reward is enough. Reinforcement learning, where you follow a simple reward, is sufficient to learn advanced capabilities. And they gave this example of the squirrel. The squirrel wants to maximize the nuts sense. He likes to eat nuts sense. So he wants to maximize nuts. And in order to maximize nuts, if we use this simple reward signal, the squirrel will need to learn advanced capabilities. For example, it will need to recognize trees. It will need to be able to climb in trees. It will need to be able to pick acorns. It will need to be able to store them for winter and so on. So by following a very simple reward signal, we can actually produce quite complex capabilities. And this is something that we will get back to, because this is also not without risk, but this is something we will get back to. Okay. A third thing, I said LLMs, I said reinforcement learning. But a very important thing is also compute. Since about a decade or maybe a little bit longer, we really upped our compute. We have now GPUs that allow us to do very complex, to allow us to train very complex models. And this is also really, has been really a game changer and the amount of compute that we have at our availability. Now, I said I will not go into predict when we will get a GI, but I will show you some quotes because the opinions are actually quite divergent. And you might know this guy. This is a professor, Jeffrey Hinton, he's a British Columbian AI scientist. And he's a professor and he's really considered to be one of the Godfathers of deep learning. And deep learning is really what lies at the foundation of all these influential models that we see show. And he thinks that we might be only 20 years away from AGI, General Purpose AI or AGI. And this is quite remarkable. And he made this statement in March 2023. And he also said that this is a statement that he would not have made 10 years earlier. So it's really based on the recent developments. On the other hand, people like Jan Le Koon, he's also a very influential AI researcher. He's also considered to be one of the fathers of deep learning, but he has a different opinion. He says it will take us decades to even only touch upon what AGI could be. So the opinions really diverge. Then let's have a look at what these big AI companies are actually thinking about it. Because this will also be important in this talk. For example, this is Jean Leg, he founded DeepMind together with Demis Hassabis in 2010. And he thinks that AGI is likely by 2028. He says so with a probability of 0.5. So it makes it easier to make predictions, of course, like that. But this is a statement by him. And he's one of the founders of DeepMind. So it will also have some... What these people say, of course, also resonates with many people in the community. Next, maybe you know this guy, this is the CEO of OpenAI. And he thinks it will take us to 2030 or 2031 before we will get to AGI. And I think it's very important to state these are very influential people within their company, within huge companies. And of course, these companies, they have significant bias. This might be a very self-fulfilling prophecy, the predictions that they make here. So it's very important to keep this in mind. Because, of course, the more people get hyped about AGI, the easier it will be for them to actually get their interest and capital to work on this research. And again, Altman also... I need to make this disclaimer. Altman also said that it will take 20 to 2030, but with a huge confidence interval. But of course, with huge confidence intervals, it's always easier to make predictions. So that's also something to take into account. Okay. All this sparked interest and maybe some concern. And even Snoop Dogg, he came up with this statement that he was concerned about AI. I had to remove some curse words here, but basically that is what he said. He heard this old dude talking... that created AI talking about that it is not safe that these things would get their own minds. So he's talking here about Jeffrey Hinton, maybe a bit disrespectful. But I think this really resonates with the general population. And it might be a bit tough to really grasp what this actually means, these breakthroughs, and where it will lead. Okay. Now, we could ask ourselves the question, are we aiming for AGI? As we will see, AGI has a lot of potential, has a lot of positive elements associated with it, but also some significant risks. Are we aiming for AGI? Well, don't take my word for it. I'm just a simple professor at the VUB. In my lab, it will not happen. I can be sure of that. But other companies like Open AI, they put on their website that they're working on that. So this is not something that we're coming up with. This is not just science fiction. Companies are actually trying to build this, and not only Open AI, but also Google DeepMinds. They want to build AR responsibly, okay? And if you look a little bit lower, they also mentioned AGI. So this is something that companies are working on. And more recently, also Meta, the CEO of Meta, came up with a wish to make this kind of technology. So it's really something that companies are working on. So as we will see, there will be an impact of this, so it's important to be aware of this. So what will be the impact of AGI, or the potential impact of AGI? Well, let's start with the goods. First of all, we will be able to tackle complex problems, maybe visit the stars. That would be a very nice achievement for humanity. We could automate things. Advanced automation, maybe even complete automation. And when we're talking about automation, we might think about automating the assembly of cars. We might think about automating a bakery. But of course, once your system is sufficiently smart, this will also allow us to automate coding jobs, for example, or research and teaching jobs. So this is really something that could have a huge impact. It would allow us to automate things. But if you automate things and you do not distribute the wealth that is generated like this, you will also be in serious trouble. And then of course, we can hope that we can enhance our human capabilities, because that would also be quite an interesting byproduct of AGI. Now the bad, many of these good things could result to serious social disruptions. And in a way, what is happening on social media, how people are being influenced, this is already going on. And you can assume that once you have agents that are even more intelligent, this will be even a bigger problem. Also, when starting to automate everything, if you do not take the politics into consideration when doing so, this might also lead to serious social disruptions. Another aspect is misalignment with humanity's goals. And this is something even within humanity, it's not easy to align our goals. There are many different views on how society should work. So how should we align an AGI to do what is best for us? Can we even define this? And then the ugly, this is of course the existential risk. And this is something that is of course an important concern. It's something that is explored extensively in science fiction literature, but it is actually an important risk. And we might also even go further than AGI, we might build a super intelligent system that is able to greatly outcompet us. And this might have even more advanced implications for our society. And you might know the books of Isaac Azimov, who really explored how we can try to align human beings with what we as humanity would like to do. But this really is not such an easy thing. And the existential risk of AGI is really something that has its own wicked Wikipedia page. So if we have an AGI, there are many ways that you could think of that it could influence our society or even really wipe out the human race. This is a really negative point of view and I'm not saying that it will necessarily be like this, but there are many options to do it for an AGI. So it is something that we should take into account if we make the balance. And this is maybe an example that you already heard of. It was introduced by MIG Bostrom. It's about the paperclip maximizer. And paperclip maximizer is about an AI system that is defined by a set of humans to maximize the number of paperclips. And at the start, the AI system is able to do that in a very efficient way. It is able to find or it is able to mine in a very efficient way and make many, many, many paperclips. But then when the mines are getting empty, there is a problem. The AI can no longer make paperclips, so it will start to use atoms of other things to also make paperclips. And in the end, also human beings and our entire earth will be transformed into paperclips, which is quite concerning. And maybe it's good to think about the reward is enough paper where we had this setting where we had an acorn that wanted to maximize its nuts. And in a way, this is very similar to the paperclip maximizer problem. And to say the least, that is actually really concerning. If we're not specifying our objective functions in a save and meaningful way, we really might run into troubles. And this is something that you also might have heard of, this probability of doom. It's something that is uttered a lot on social media. This is the probability of this existential risk. Currently, we do not have a formal framework to reason about this. So making statements about this is purely intuitive and in my opinion, at this point, not very relevant. But it is something that I think that most scientists would agree on, that this P doom is not zero. And if it is not zero, then you only need to make an AGI once, that actually for which this P doom will not be zero. And we will be in big trouble. So this is really something that we need to be very well aware of. And we also counter this paper. Reward is enough in our own paper where we set that scalar reward is not enough. If you have an agent that just follows one reward signal, for example, acorns or paper clips, you might end up in deep trouble. It might be smarter to look into multi-objective or multi-criteria reward signal set where you can say, I want to maximize the number of paper clips, but I also want to keep alive most of humanity, for example. So this is really something that might be something to consider when developing such systems. But we also make the disclaimer that even when making this multi-criteria reward signal, this is not a guarantee to avoid this existential risk. So this is something that we really should be aware of. And in my opinion, safety should be key. I think there are many positive aspects to AGI, but we should be aware of safety. And to do that, I think a scientific approach is necessary. And a scientific approach means that we need to formulate hypothesis about a risk, about safety, but also about purpose and impact. And when we can formulate hypothesis, we can also do experiments. And to do experiments, our science needs to be reproducible. Now, reproducibility in science is not so easy. It's a very important topic. It's not trivial. In a wet lab, you have many things that can go wrong. You have equipment, you have lab temperature, you have the purity of chemicals that you use, you have the skills of your technicians that might be different. And also the sex of your technicians, actually. So it has been shown by Sergei that the sex of the technicians who operate on rodents actually influences your experiments. Because for example, if you have male technicians, it will stress out your rats more than if you have female technicians. So reproducing experiments is something that is really challenging. But of course, in silico, in computer, on a computer system, in simulation, it could be much better. To reproduce things, we need two things. We need codes or a very rigorous description of what is going on in the codes. And we need data. Unfortunately, and this is a survey from 2018, not all papers, not all scientific papers that are done in AI actually come with codes. So as you can see here, there are a lot of papers that come with pseudo codes. There are some papers that have some test data, but not that many papers come with code. And that is really concerning. And major AI conferences, and we as AI scientists, we publish a lot in conferences where we present our peer-reviewed papers. These conferences are really becoming more and more aware of this problem. And they really make it an issue that people share codes with their manuscripts. But there are still journals like Nature and Science, really influential journals that do not enforce this, and this is really a pity. I have to say that there is this code science manifesto that says that basically doing science outside of the wet lab really coincides with releasing your codes. You need your code to do that. So this is a manifesto that has raised quite some awareness. And in AI research, there is a growing awareness of this. However, when we are talking about AI research these days, we indeed have academia. We have academia where research is being done. But we also have research institutes like Google DeepMind, like OpenAI, that are inherently different organizations. For many academic institutions, myself included, I think it is important that experiments are reproducible. I think it is important that we make our source code available so that other researchers can really build on top of our finance. But what about these research institutes? Well, the picture is really not black or white. So for example, DeepMinds, they developed this alpha fold system to predict protein structures, and the code to train a neural network was not available. And what happens, a set of researchers actually developed OpenFold, which was an open source implementation of this functionality, which really was able to reproduce this work. But this is what happened already in the free and open source software community so many times. Due to the need to have software as open source, people spent their time trying to rebuild things. But of course, this is of course a very good thing. But it would be better if the scientists just shared the code straight away. On the other hand, DeepMinds also made very important libraries available in a purely open source fashion, for example, Jax, which is a library that allows you to do very performant computational analysis. Also, TensorFlow for building these large neural networks. So these libraries are all very influential, and they really shaped how research is being done at this point. So there has been a major impact on AI research from these companies. Also, Alpha Zero, the agent that learned to play go, its code was not released with the paper, but recently they did release their code in an open source fashion, I think even a free software fashion, to actually allow other scientists to work on it. On the other hand, they also have open AI. They have their stable baselines, a library, which is a reinforcement learning library that also incorporates many algorithms and really lies at the foundation of many research that is being conducted. But on the other hand, we also have chatGPT, which is completely closed. It's near impossible to reproduce, not only because we do not have the codes, but we also do not have the description of the methods. We do not have a description of the infrastructure. We don't even know how big the dataset they use this or how big the neural network they use is actually, how big it actually is. So this is really something that is concerning. Google Gemini, same thing. Also, no source, only a black box that we can interact with via the network. There is one big exception, and that is the work done by Meta, where Jan Le Coon is the president of, and they did release many LLMs, where for which the source code is actually available. So the landscape is really divergent. Now, the question of today, or the first, the sparks of AGI, this was a paper written by Sebastian Bubek, is a brilliant scientist, and he wanted to investigate what the capabilities of chatGPT were. And so, for example, he used different versions of GPT, which he prompted to draw a unicorn index. This is what he reported in a paper, but this is not something that we could reproduce. These experiments, we cannot reproduce. First of all, we cannot seat GPT. So GPT is a stochastic agent. So in order to really reproduce what is going on, we would need to have the seat. But also, and maybe even more concerning, is that very recently a set of influential scientists actually showed that this black box axis is insufficient to properly audit AIs. So not having the code, not being able to look at the internals of this big neural network, what is going on, where, when you ask a certain prompt, is really not sufficient to really understand what is going on, and is not sufficient to get to safe AGI. So, to go back to this question at the first slide, will the first AGI be free or open source software? Well, it's really hard to know what drives these companies, but there is sure no commitment from their side to do so. OpenAI, DeepMinds, Anthropiq, they do make no mention of free software or open source software. But very recently, Meta, the CEO of Facebook, actually introduced that they will be developing AGI and they will also make it available in an open source fashion. And it was surprising to many, but maybe not too many who follow Jan Le Koon on Twitter, because he's tweeting about this all the time. He thinks this is really important. AI should be open source, should be free software, and this is indeed what they tried to do. So, okay, good. So we have these different viewpoints. Only a few days after Zuckerberg put this statement out, we already get these articles that are predicting doom, that saying that it is very scary, that such an influential technology would become available in an open source fashion, even comparing it to nuclear weapons. And if we're talking about existential threats, you could indeed see the comparison there. And we might also not make the recipe for a nuclear weapon available on a free license. So there is something to it. And so maybe we should ask ourselves the question, should the first AGI be free and open source software? And I'm not taking a stance here. It's really up for debate. And in general, I'm very much in favor of making things free and open source software. But this is the first kind of software that I really have my doubts about, because it will have a huge impact on society. And what will this mean if we have this available as open source so that everybody can access it? So this is really something that I believe is up for debate, because AGI will have a major impact on individuals, societies, but also governments. A government that has AGI will be a different government than a government that does not have access to an AGI. And this is all up for debate. But what I do think is that there should be oversight. And currently, there is not. There is no governmental oversight. Also, much of this research is happening in the United States. There has been some hearing in Congress, but this has also not been that much in depth. So for the moment, I think there is no oversight. So we have companies that are working on AGI at a certain point they might actually reach it. And what will we do next? So this is really something that we should be concerned about. And I will take this stance. So quite recently, Satya Nadella from Microsoft made this statement. If you look at inflation adjusted, there is currently no economic growth. And in a world like that, we may need a new input. And with this new input, he meant AI being the general purpose technology that drives economic growth. I think these kind of statements are really quite concerning, because of course economic growth is not everything. And as I said, there might be very positive things from AGI, but we also need to be very much aware of these risks. And in the end, if this is the case, if we have good things, if we have bad things, then basically it becomes a question, how do we balance these? And in order to balance these, we will need to be able to quantify what is this risk. And if we know, if we have a way to formally quantify this, then we will also need to make it a democratic question, because this is something that is a democratic problem. We need to have society decide on this topic. How important is it that we as a species remain? And how do we balance that to maybe the good things that come with AGI? And maybe something to close off with, the OpenAI came up with their work on superalignment. That means they expect that we will not only have AGI, but we will have a superintelligent system. And they were very proud to announce that, I think it was somewhere at the start of 2023, they were going to work on this superalignment. And they were going to put 20% of the computer they have, and they have a lot of computers, like many, many thousands of GPUs. They were going to dedicate 20% of that on safety. And they felt very happy about this statement, but I was confused, because I would think that we needed to do it the other way around, spend 80% on safety and 20% on capabilities. So this is something that I wanted to close with. So to wrap up, in my opinion, I did not take many stances in this talk, I think, but in my opinion, safety should be the first concern. And in this regard, we should study risk and the balance to make will be a democratic choice. So it will be something that societies will need to decide. Very importantly, oversight in the development of AGI is needed. And this is really something that, in my opinion, is lacking. And the debate of how free and open source software will be involved in this process is really important. And I think the community of free and open source software and the AI community has a lot to learn from each other. Maybe we will need new kind of licenses in order to deal with this kind of technologies. And many people say that AGI is something that we really need. And I fully agree that it would be a blessing for society if things go right. But in a way, if we can make AGI that it will be saved in only a thousand years, that should also be a good thing. The only thing to really want AGI now is if we want to solve a problem that is also an existential risk and that there is no other way to solve it. That would be a clear balance that we could make. But otherwise, I think safety should be of main concern. And that closes my talk. And if you have any questions, I will be happy to answer them. So are there any questions up there? Hi. So I was wondering, assuming AGI becomes open source and accessible to everyone, what are the material constraints that you think it could have in the future in the sense that, yeah, it might be open source, but then only very few people may be able to run it because it requires really powerful hardware? So what do you think in this regard? Yes, that's an excellent comment. In the interest of time, I didn't include it in this presentation. But this is indeed a real concern. This is already the case with the LMS that we now have here at our university, we would not be able to reproduce this research. But this is again something where governmental oversight is really important. If our governments really think AGI should happen, then they should also have the infrastructure to test these things on and make it possible to do this kind of research. And this is something where the EU also should step up, I think, to make this feasible. That being said, if we collaborate across our universities, we do have a significant amount of compute. So there is some compute, but that would also require us to really collaborate intensely in this front. But yeah, very good question. Thank you. Thank you for your talk. I'm wondering what kind of, so if deep neural networks are for what we have today, what we have to do more to have an AGI? Is just some fundamental research that we have to crack or is just more data, more training, more computer power? Well, the opinions differ. Some people say our project we're following now will lead eventually to AGI. So just using more data, just using bigger neural networks, that is a line of thinking. But on the other hand, this is not how we work. We as human beings, we do not need all literature that was ever produced in order to learn things. We can do that, but with just a fraction of this data that is available. So personally, I think we're still missing some things. There's some fundamental things that will require us to step up in order to get there. But at this point, it's really hard to say. I would also not have expected what JetGPT has become 10 years ago. I would also not have expected what JetGPT has become 10 years ago. I would also not have expected what JetGPT has become 10 years ago. I would also not have thought that this would have been possible. So it's really hard to say, but indeed, one advantage of making things from scratch to build the capabilities from a more fundamental base is maybe that we might have more control over what is going on. Because of course, if you train neural networks with a lot of data, it's really hard to know what to expect at the end of the training cycle. So I wanted to ask about whether an open source model or a commercial model is going to be first. So OpenAI took a very clear stance on this, right? In multiple interviews, they've stated that basically developing AGI will just require too much resources to be done by anything else than a commercial party. What do you think about that? Well, I think it's something that we need to work on. The question of being an open source will also allow us to look at the code, which will also give us insights in how these things work. That's one thing. But on the other hand, we will also need a computer to do this kind of research. And this is something that the European Union can indeed step up on. But this is something that will be important. Because indeed, without compute, it will be difficult to run these models. On the other hand, the previous question, it might be possible that we will have breakthroughs that will allow us to do things with much less compute. So this is something that is also very interesting. At these deep neural networks, they really are models with a huge number of parameters. So to train them, that is really an infrastructural nightmare. But maybe we can come up with things with fundamental concepts that will allow us to do much more with much less compute. So I had a question. What do you think of stopping the research or strengthening the research right now until we have proper security in place to be sure that none of those existential crises can occur? Yes, I think that's what I meant with oversight. There should be debate on how it should be conducted. There should be debate which direction we're reaching. And I think this is missing a lot at this point. So this is really something that needs to be debated. And we will need to... I think governments will need to be on top of this rather than following the companies that do this research. Because in the end, if a company develops this kind of technology, who will be responsible? There is a lot of ethical but also legal issues with that. So there is a lot of work to discuss this. And as I said, we might have AGI in 10 years, might be in 1000 years, but it doesn't hurt to start thinking about these processes now in due time. Hi. Can you have explained a lot of problems? Can AGI help with those problems? Sorry? So can AGI help with the problems of AGI? Well, then it might be too late, of course. There might be some circular aspects to it that is indeed in a way that many of these alignments, the ideas to alignments actually use the same kind of technologies and methods that are used to develop capability. So in a way that is indeed being sought after. But if you have a sufficiently complex model, this model might be trying to deceit you. So if this is the case, it becomes really hard to understand what is going on and to understand whether this model is really working with you or working against you. So in the end, our brains might be too small to still follow what is going on there. And that is where things get complicated. Appear. Don't you think that maybe we have a bigger probability of dying by disease that AGI can cure? That's a good question. So if you have other existential risks, we need to think about their probability. So for example, I do a lot of research in pandemic preparedness and indeed, might be possible that we have a virus that will be very destructive. On the other hand, in human history, we did not have any viruses or pathogens that really wiped out the entire species. So existential risk is really wiping out your species. So balancing that to other existential risks is important. But then we have to make sure that we have a formal framework to reason about this probability. Because otherwise, you're just comparing apples and oranges without actually knowing how they compare to each other. Does that answer your question a bit? Yes, but if you all have 100% more of a disease that might be dying, so maybe AGI can cure the disease. Yes, but that's how it has been. That's a good question because it was not with the microphone. So I think you said that we now have 100% chance of dying, that AGI might fix it. Well, that's true, but that's also what humanity is about. We are mortal beings. Should we put that in the balance of making ourselves maybe live forever, but on the other hand, maybe we might swipe out our species. I'm not an ethical expert in ethics, but these are things that we should think about. And maybe society should, through a democratic voice, decide on this. But that's not something that I think we can decide here today, but there are different angles to it. That is definitely the case. Hi. You mentioned two papers. The first one that argued that reward might be enough to achieve artificial general intelligence, and then the second one which argues that it may not be enough. Obviously, artificial general intelligence should be able to learn how to behave ethically in the same way all of humanity does. Sorry. This will understand. Sorry. Do you think there is a good approach of teaching artificial general intelligence to behave ethically, like humans do? If it can save the same sort of problems, it surely can understand the ethical reasons we do. Yes. Well, asking me, is there a good way to do that, said that would solve the problem in a way? So, unfortunately, I do not have the answer to that. We did do research that at least said that a multi-criteria approach makes a lot of sense. And this is also how we as human beings work. We do not have only this acorn to follow. We have different things that we deem important. So, formulating things in this fashion might be a good way to do that, but we also make the disclaimer that this is no guarantee to make this work out. So, there is a lot of work that will be necessary in order to, first of all, get some ideas on this probability of existential risk, but also ways to make it more likely that we will be heading towards a safe agia. Okay. So, thank you very much. Thank you.
Learning from disaster response teams to save the internet
Hi, everyone. It is great to be here. Thank you for coming to this talk. If you're here for the magic show, I'm afraid you have 30 minutes to wait. I'm here to guide us in an exploration of what we as a community, as open source practitioners, can learn from some of the most finely tuned and highly performant teams in the world. First responders. Through the interdisciplinary lens of social network science. So perhaps there is some magic in this talk. The magic of people working together. My name is Hannah Aubrey. I lead fast forward at Fastly. Let's save the Internet. In a past life, I was lucky to serve as a study coordinator of Sonic. No, not the one with the roller skates and the hamburgers. Dang, I knew that joke wouldn't play in the EU. The science of networks and communities research group. Sonic advances social network theories, methods and tools to better understand and meet the needs of diverse communities. They develop cutting edge techniques to study and improve social and knowledge networks and distributed working groups, online communities, virtual teams, and other large communities like the one we're all in. I am thrilled and a little bit washing to share that with the director of Sonic. Professor Nasheer Contractor is here in the audience today. Thank you for coming, Nash. And my dear friends, if you have any tough questions, please direct them at him. Let's start with a history reminder. Our earliest ancestors not only had to contend with the same natural disasters we experienced today, they also had to adapt and survive to nature itself. First, we became bipedal, freeing our hands to reach and to grasp and also to communicate simply with each other. Next, we developed complex brains with prefrontal cortexes, our personality centers, which enabled us to make snap second decisions based not only on external stimuli, but also our past experiences. Then we developed symbolic language to communicate complex ideas and then finally tools to take control of and shape our surroundings. So you see what makes us uniquely human. Actually, what has brought us here together today, the abilities to ponder, convene, reflect, build, collaborate, and coordinate are not only what make us so special, but also so successful. So then our tools got a lot better. The first fire pump was invented in Alexandria in the third century BCE. Unfortunately, it could not save the library, but I digress. As societies and civilizations began to form, the blast radius of disasters grew. We settled into towns that could burn down and buildings that earthquakes could topple. And so those smart brains of ours formed teams whose sole purpose was to patrol and respond to natural and made manmade disasters in the form of firefighters and police forces. Then societies became more complex. And with that came more complex disasters, not only fire and flood, but we created monetary systems, banks that could collapse and food systems that were prone to mass famine, not always for lack of food, but sometimes for lack of transportation or poor planning. Our close proximity to each other in cities and long distance cultural exchange made possible by ships brought diseases, colonization, and war, which ravaged human populations. We think of these ages as dark or undeveloped, but their responses to such crises were surprisingly neither. In fact, we begin to see thoughtful and multifaceted disaster response, not only search and rescue or medical aid, but tax relief, temporary infrastructure, even what we now call refugee camps, providing long term food and shelter for displaced peoples. In 1493, the Knights Hospitalers shipped doctors and surgeons to the Greek island of Kos after an earthquake. And so we see some of the first evidence of multiple different groups or organizations coordinating across disciplines and borders to respond to a disaster. In the intervening years, we've continued to hone our disaster response strategies. Humanities impact on this planet has required us to do so. And besides, those prefrontal cortexes of ours have a lot more data to lean on than our friends the ancient Alexandrians had. If they knew then what we knew now, maybe they could have saved that library. You should pull it together. Anyway, today we have entire organizations, governmental bodies, NGOs, and community groups dedicated to such activities. We have laws by country and internationally to enshrine basic human rights and ideal responses in crises. And now we're building a new frontier, a new form of transit. We're creating massive new civilizations, hosted on smallish, inscrutable, blinky boxes. In this new world, we can't even really see the threats, the crises. We're throwing people together in a way that's affecting global social structures and people's everyday lives. Like every form of infrastructure, like most every place where humans gather to live, to work, to learn, to play. The internet has grown up in an unplanned way. And we're still scrambling to understand it, to learn from our mistakes, to apply those lessons, to build the best internet, to build systems that protect people and systems that react when people are harmed. But don't worry too much. We'll survive these dark ages. Our species has survived every disaster it's encountered, at least so far. A common organizational structure found in groups undertaking large-scale operations to solve big, big problems is called a multi-team system. A system comprised of multiple teams working towards a shared goal. These structures can be found throughout all sorts of industries, working on all sorts of problems, disaster response, space exploration, governing humans, building stuff. If you're part of a business with multiple departments, you're in one. If you attend or work in a university, you're in one. And if you maintain or contribute to a support or, excuse me, contribute to or support or care about an open-source project, you're also in such a system. Because no matter what corner of the internet you occupy or to which technology you contribute, you're working in service of our shared mission to keep the internet open and free. So what makes up a multi-team system? Within the subordinate, the superordinate team, the entire system, we have local teams working on local or proximal goals, which may even be split further into component teams. And directing the subordinate teams is the leader or perhaps the team of leaders, which shares a global or system goal. And when you examine these teams using social network analysis, you find common patterns between successful MTSs. There are many more patterns we could discuss, but let's focus on three. A plan for coordination paired with frequent, clear communication, highly-performant and resilient local teams, and finally, empowered and effective leaders who are willing to sacrifice their local goal in service of the global goal. So before we explore each of these patterns, I want to share this diagram with you to underscore the importance of these patterns in disaster response. Because that term disaster response makes such activity sound reactive, doesn't it? But in reality, the most effective disaster responses begin long before the disaster happens or second best right after a disaster occurs. So I ask you to bear that in mind through the rest of this talk. After all, the best time to plant a tree was 10 years ago, and the second best time is today. First, let's talk about planning, coordination and communication. I don't think I need to talk about docs too much here. I think the OSS communities know this one quite well. And engineers know all about retrospectives. Like I mentioned, disaster response begins well before the disaster occurs. So in terms of coordination and communication, knowing where to turn for help or resources before a disaster occurs spares valuable time, energy and mental load during a crisis. Effective communication prevents errors in the field, helps the even distribution of resources and helps us learn from the mistakes we've made last time so we don't make them again next time. During disasters, response teams crucially over communicate. They share reports on the situation as it evolves. They communicate with stakeholders on the ground. And they report changes or progress to make the best decisions. Leadership and subordinate teams must have the most accurate and up to date information. Because knowledge sharing fosters a coordinated and collaborative environment. It reinforces the multi-team system as a single unit, not a set of separate teams. And because knowledge sharing makes it easier to be flexible and adaptable in rapidly changing environments. Interestingly, research has found that inter communication, communication between local teams is more important than intra communication. Communication within the local team to the success of the whole system. So in fact, there's actually a Goldilocks zone of inter to intra communication. Local teams should communicate half as much between teams as they communicate within their own team. Any more inter communication than that and performance declines any less than that and it declines too. When we talk about the viability of a team, we mean the success of the team. In moments of disaster or crises, the stakes are life and death. And at the end of the day, disaster response teams and open source maintainers too, they're people. They have feelings. So viable teams or successful teams support each other. They lend a hand. They take emotions into account when making decisions. Viable teams engage in what is called disruption, buffering behaviors, which is to say change management. They try to anticipate changes that may occur, plan ahead and invent that some change or disruption occurs. And again, they support each other through those changes. Viable teams also try to balance performance and resilience because when you work with people and you're so hell bent on performance, then the team's physical or mental health is at stake. And the team becomes brittle and the team does not perform well. So when we see teams that are so, because people do not want to be a part of such a team, right? So I'll say that again. There's a difference between successful teams and teams that people want to be a part of. And in the long term, teams that strike the right balance are the ones that are the most successful. Finally, the most important, the most performant teams strike the right balance between reinforcing the team's boundaries, which is to say reinforcing the identity or team spirit of the local team and boundary diminishing behaviors, which reinforce the local team as part of something larger than the team as part of the whole system. So a little bit of silo is actually good but not to the extent that teams develop an us versus them mentality, which brings us to our last assertion today. Empowered and effective leaders. Strong leaders serve as an ambassador to the team and for the team. Internally, they help teams understand why the team has a certain goal or is performing some task. Within the system, they advocate for the team's priorities and points of view. Those are called boundary spanning behaviors. They make sure that the team has the information it needs, not only the what, but the why of a task or priority that they understand their own team's priorities. In a disaster response scenario, times of the essence. Rapid decision making allows teams to quickly assess the situation, evaluate available options and act promptly to address emerging challenges. Delays in decision making can lead to missed opportunities, increased risks, and further escalation of the situation. And as much as we're proud to be a part of our own team, we must recognize and understand other teams' respect and contribute to their priorities and not be too selfish in our own focus. That's why a crucial feature of successful multi-team systems, of disaster response effectiveness, is that local leaders and teams are willing to sacrifice their local goal, if it means more for the common good. So now that we've immersed ourselves in the theory of effective multi-team system performance, let's illustrate it with a real-world example. I recently discovered this amazing YouTube channel. It's called Brick of Mortar. It's all about infrastructure disasters, ship sinkings, critical failures. It's fascinating. If you're into this kind of stuff, check it out. You'll never look at bridges or tall buildings the same again. The sinking of the MV Ferry Wall on April 16, 2014, off the southwestern coast of South Korea, was a disaster, not only in and of itself, but also of multi-team system performance. Over 300 people paid the price for these failures with their lives. On what seemed to be a trip like any other, the ferry suddenly made a series of sharp turns. But as we know, a disaster such as this starts long before the immediate catalyst. Over the years, this ferry had been repurposed many times and additions had been made that affected its balance point. For this trip in particular, the ship had taken on excessive cargo, which compromised the vessel's stability and made it more susceptible to capsizing. What's more, the ship's crew had drained the ballast that's water that's kept in a ship to make sure it doesn't sink, to make sure it's properly balanced. They didn't want it to sit too low in the water, they wanted to be able to pass inspection knowing they'd taken on way more weight than they were supposed to. So the communication breakdowns. First, when the ship began to list, the captain refused a sendage of stress call during the crucial first moments, delaying rescue efforts as the ship began to sink. He told passengers to go to lower levels of the ship after refusing to tell them anything about the impending disaster. For crucial moments when they should have been getting on to the deck, getting ready to be rescued. When he finally sent the distress call and rescue ships came, they quickly learned that the actual communication infrastructure, the radios that the ship needed to call the disaster teams were either malfunctioning or were broken. Something had gone wrong with them. So despite the rescue teams trying to raise the ship's crew on the radio, vital communications failed during those crucial first moments. So you can see the ferry seawall had no plan for intra communication in the event of a disaster. They coordinated poorly, not only within their local team but also with the rescuers. So they failed to inter-communicate with local teams. So the system, the global team failed. So for the sake of this section, let's quickly divide the various local teams. The crew is a team, the rescuers are a team, the passengers are a team, and the South Korean government is a team. What were each of those teams' goals? Passengers wanted a safe trip. The crew should have wanted to get them there safely but they just wanted to maximize profit. The rescuers wanted to make it to the site quickly and save as many passengers as possible. You would think the South Korean government would want to save their people and prevent such a disaster from happening again, but unfortunately that was not the case. Their true goal was to save face on the international stage. We'll talk more about that in a second. Now each of these teams had goals that were in opposition to another team's goals. And as the circumstances evolved, there was no ability of any of these teams to shift their priorities or manage this change to negotiate the priorities and evolve. And each team in the system saw the other team as a detriment to achieving their own goals rather than as a part of a system, as allies, as individuals worthy of consideration. In fact, the crew had never received proper safety training. So even if their goals had been aligned, they were not properly equipped to perform. Now the next example from this horrible tragedy is an example of leadership, failure, and boundary reinforcing. When rescuers arrived on site, the assembled parties included Japanese Coast Guard and the US Navy. When a ship sinks, often there will exist air pockets within the ship. If passengers can find them, they can survive for seven days as long as they have food. The US Navy or water, pardon me, the US Navy and Japanese Coast Guard and private citizens too all had the equipment necessary and were on site, the equipment necessary to conduct such a rescue. But due to South Korea's rigid hierarchical culture and their government's desire to save face, the teams that had the equipment necessary were not allowed to perform the rescue. It's an example of unwillingness to sacrifice the local goal and a lack of emotional and really life support to the passengers who just wanted to survive. In fact, throughout the crucial hours then days when those high school children that were trapped in that ship could have been saved, the South Korean government lied to the parents who had assembled to wait for news about their kids. They said that all the kids had been saved despite that being quite far from the truth. So what do I hope the open source community will take from this line of scientific inquiry, from the lessons of the MVC wall? Because folks, this ship is sinking. Our planet's ecosystem is failing. The climate is changing. I hope when projects, especially leaders, see someone building something similar to what they're doing, they start to think that other project is an ally. That other project is an ally, not a competitor. They think, how can we help each other? Not, how can I win? Or worse yet, how can I sabotage them? I hope maintainers who make the commitment to serve their community understand the commitment they're making and live up to that responsibility. Because remember, it's not a commitment you have to make. You can make something, you can choose not to maintain it, you can choose not to accept issues, change anything about it. But if you make that choice, I hope you live up to it. And I hope you respect your community and listen to what they need. I hope BDFLs, benevolent dictators for life, focus more on the benevolent part and less on the dictator part. I hope we can take better care of each other. So many maintainers and contributors out there in this room are carrying so much weight and holding so much space for all of us. I hope we can do more to help them or at the very least, I hope we can spare them kind words. I'm not under illusions here. I don't expect what I've said here today to do all that much. People have said a lot of what I said here many times before, but maybe, just maybe, I've touched one heart or one mind and maybe that heart or mind will go out there and they'll make a different choice because of what I said here today. Or maybe they'll speak up and share what was touched today with the next person when they see something wrong. Maybe, like our very first ancestor who looked up and reached, maybe we can make a little difference now that will make a really big difference for the people who come after us. Because the last 10 years, the platformification of the web, the inshidification of those platforms, that was not a new normal. That was a glance at a future that doesn't have to be. Our power out as a community is in our principles and it's in our numbers. If we can convene, if we can coordinate, if we can collaborate, if we can take good care of each other and choose kindness every day, if our leaders stay humble and choose the greater good over their own enrichment ego or fame, we can change the course of this information age. We can change the course of history. But it will take all of us working together and it will be damn hard work. The wonderful organizers of FOSSTEM have given me this stage, so to close this talk, I will now issue a challenge as if all of that wasn't already a challenge. From my perspective, and I'm speaking especially to our leaders, we must focus our collaborative energy and kindness on the following three areas. We must make the internet more efficient. We must make our code bases smaller. We need to reduce storage usage, duplicated requests, and reduce the distance data needs to travel. We are in the midst of an energy and an environmental crisis. Half our world is drowning and the other half is on fire. And as the diaspora of people across digital social spaces continues, we must collaborate across the internet community to protect disadvantaged, disenfranchised, and marginalized people. When diversity and inclusion suffers, we all suffer. Our pursuit of knowledge, societal progress, and the advancement of humanity only succeeds when we are inclusive of all walks of life, of all creeds, of all religions, of all races, of all colors, of all communities. Barring those who promote violence or enable hate. And we must protect science and knowledge. We must stand for the truth, not only from a geopolitical and societal perspective, but also on an individual level. We need to protect people and the systems through which we organize into collectives. We have to make the truth resilient. Whether you recognize it or choose to identify as part of it, you are part of a movement. Whether you're doing this in your spare time as a passion or as a hobby, or if you're one of the lucky people who has found a company to pay you to do this. You are part of a movement. You have experience and passion and you're smart as heck we need you. And I believe in us. Thank you.
Magic and Software
Okay, so welcome to the last talk for today about magic and software by the the talk is done by the software developer Steven Goodman and the magician James Merlin and you decide who you see. Okay, thank you all. As we said, my name is James Merlin. Thank you for that. I haven't done anything yet. So my name is James Merlin and I'm a magician. Now when I said I was going to present at Fozdem, I went to the magic shop in Brussels and I asked the magic man in the magic shop who stood behind the magic counter, have you got any new magic tricks that I can show those lovely folk at Fozdem? So the magic man in the magic counter had a look around and he looked on his magic shelf and he saw a little red box and they thought that's a nice little red box. But I don't know what it's for. Maybe inside the box there'll be a thought it up playing card or a coin or a prediction of something yet to happen. Do you know what's inside the little red box? It was a little black box and I thought, okay, that's nice enough. It's a little black box. I quite like it. It goes, it's magician-y, it's mysterious. So I asked the magic man in the magic counter what's inside the little black box. Maybe there's a folded up playing card or a coin or a prediction of something yet to happen. And do you know what's inside the little black box? Nothing, it's empty. I thought, oh well, never mind. I quite like the little black box, it's mysterious. I don't have a use for the red box. So the magic man behind the magic counter said, well, since it's false-tem, if you buy the little black box, you can have the little red box for free. So I thought, that's great. I know what I'll do with the little black box. I'll use it and store a folded up playing card or a coin or a prediction of something yet to happen. But for now, I'll put the little red box back inside the little black box and put it on my magic shelf for later. Now, that's my trick. I call it little red box. But is it really my trick? I mean, the whole idea of one box that is inside another box and the bigger box goes back inside the smaller box it came from, that wasn't me. That was, I think, Leber Friedler, who came up with the idea. He did it as a stage effect and he had big boxes on stage. This small version of the little red box, that was an Alibongo idea. And that box used to belong to Alibongo. So the pattern about the magic shop, sorry to spoil it. There is a magic shop in Brussels, but I didn't go there to get the little red box. I take it some of you twigged that that was all a lie, right? Yes. I wrote that, spent some time honing the words, thinking about what I wanted to say. Well, I can say that I wrote the words to that trick, but I didn't come up with the trick. So is it still mine? And this is one of the things we're going to look at over the next 20 minutes or so. So we'll start, we should probably start at least by sort of putting out our stall. What is magic? What are the parameters that I'm going to be going through today? Well, it's not magic, the gathering. No one's walking out yet, so good. We've got the right audience. Magic will say is something that maybe we don't understand, the most common question is, how did you do that? We like to think of it as entertaining. And we also consider the allied arts. So these are things like this chap, Harry Houdini, a scopologist, in what is his favorite picture of himself. Is that magic? I mean, it's entertaining. There's a process that goes on that you don't know about that the performer does. So there's that sort of thing. What about clairvoyance? Is that magic? Psychics? You don't know what's going on? They're doing something you don't know? I mean, it is said there are two types of psychic, the fraudulent and the delusional. But they can't. For me for this, I say no. I'm going to focus primarily on the magic magic things. The debate is open on this fella still. Uwe Geller, he's the one on the right. I consider him a magician. He doesn't. He thinks he's real. Hands up if you think he's real and bending the spoon with his mind. Yeah, he's a magician. So now we're going to say, now we know the type of magic that we're going to be looking at. Let's then start breaking it down. We start with the effect. What's the broad strokes of what roughly happens? What am I trying to convince you is actually going on? Am I trying to convince you something is appearing or disappearing or changing? They said there's only sort of seven different types of storylines that you can ever have. So maybe the same is true in magic. So what do you think is happening? That broad strokes, we have the presentation. What am I doing to convince you that effect is happening? And then the method. What is the thing that I'm really doing that you don't know about that makes the whole thing work? And the presentation can come in many, many shapes. And I'll show you just two very briefly. So there's a typical plot in magic called card to impossible location. You have a deck of playing cards. Someone will pick a card. It will get lost in the pack. And then miraculously the card is no longer in the pack, but it's over there. Or it's in my shoe or it's in the back of the room. Now there are so many variations of this card to impossible location. You can't possibly have all of them and try and claim a right on every single version. It's no different. If the card goes from here to that shoe to there, you can't claim it's a different trick. Magicians often say, well, if you change any two of these things, it's a new trick. So the effect, the broad strokes, as a public you generally don't care. The method, as the public, you should never know. So if I as a magician change the effect and the method to keep the presentation the same, I've got a brand new trick. But everyone else thinks it's exactly the same one as before. If Penn and Teller were doing this, take a card, lose it. Penn and Teller will make that card turn up in the fish guts of some fish that you pick randomly. They'll make it appear on a billboard. They'll do something gross quite frequently. But that's them. If I was doing a kids show, I might have something like this. I can say, OK, do you all want to pretend that you're seven years old? Yeah! Hello, everybody. Hey! My name is Billy Nomates and I'm here to do some magic. So we're going to start with a book. Can you see the book? Lots of blank paper. So anyone who can draw here, can you draw some? You can draw. You're going to draw on the book for me? Draw a picture. Draw a picture. And look, you've now got pictures in the book. Aren't you clever? You've drawn pictures. Now, oh, thank you. Now, who likes painting? We've got people who like to paint. You like to paint? What colors do you like? Do you like red? Take some red, take it out the end, throw it at the book. Who likes blue? Throw some blue at the book. Well done. Throw some yellow. Well done. Isn't it amazing? Woo-hoo! But you know what, boys and girls? When you stop applauding, the magic goes. And there really is no pictures and no colors in the book. Also available for weddings, christings and marvits. Now, that's a kid's presentation. If I was doing this for a group of sensible, grown-up people... I would present it in a completely different way. I certainly wouldn't take it seriously. I certainly don't think anyone here believes the book is magically changing. But I would have a presentation that would suit me somewhat better. I might say it would certainly be comedic, and it would be something along the lines of, I would like to say it to everybody. I would like to do something of mass hypnosis. Because at the moment, you can see I've got a book of blank pages. Now, I'm going to hypnotize this side of the audience to see images. Can you see images? That's because you've all been hypnotized. Can you see images? No, because you haven't been hypnotized yet. Let me hypnotize you over this side. You're now all hypnotized to see images. Can you see images? Can you see images? But can you see images? No. Okay, you're all now hypnotized. So you can see images, and I'm going to hypnotize you extra special, so now you can see colored images. Can you see colored images? Can you see colored images? Okay, now you're hypnotized. Now you can see colored images. And I better unhypnotize you all, otherwise you'll still think there were pictures, and in fact, we're just left with blank paper. It's exactly the same trick. APPLAUSE That's exactly the same trick, isn't it? I've just changed that presentation, yet it's completely different. So what are we actually going to be protecting here? So anyone remember this song by Steve, Shave of My Heart? It was really quite popular a number of years ago, and you know, good enough song. I'll read the words for those at the back. I know that the spades are swords of a soldier. I know that the clubs are weapons of war. I know that diamonds mean money for this art, but that's not the shape of my heart. The four suits of the playing card are in the first verse of the song. Do you not think that every single magician who heard that song is sitting there in their basement going, oh, I think there's a magic trick in this? And then we get to the second verse. He may play the jack of diamonds, he may lay the queen of spades. Do you think any of the magicians were not stupid enough to have realised there's a magic trick at this point? Have got a magic trick at this point? Of course they have. Every magician had that idea, and because it's independent of the invention, they both claim to it. But because it's exactly the same trick, cards suddenly appear, there were two magicians who were fighting each other almost physically about who had the rights to perform the trick. No one bothered to ask Sting's permission first. So let's head back. So when we talk about magic, are we looking at art? Is magic art? It can be. I don't think it's ever going to change anybody's viewpoint on the world, breathe life into the conversation about the human condition. But it's art, probably low art, but it's art. Is it science? So ultimately, there is a process going on, there's nothing mystical here, and in the case of real magic, it has to be science. It's something that TV producers, movie producers, always like to skirt over if they're doing this fantasy thing that has a magic element in it. They say, oh, magic exists in this world. Well, if magic is able to exist within that world, then those things happen in that world, therefore that is science. Maybe you don't know what the science is, but it's a scientific fact that that thing happens. They just haven't considered it. And when we say, is magic a business? Oh, absolutely it is. One of the biggest parts of magic is the business. People are creating magic tricks for the sole purpose of selling them, not for performing them. So I suppose we should get to the juicy bits about secrets, right? So the first secret is we don't call it secrets. There's no such thing as a secret. It is always a method, it's a process. Any one particular trick might use more than one secret, more than one method. Sometimes that method will be the sleight of hand that's being used. Sometimes the words themselves will be very specific to achieve a certain result. Sometimes these words might be almost inconsequential. You, as someone that's watching that, will not think that those words are chosen for a reason, but they actually were. So we don't refer to secrets, we refer to them as method. And there is a lot of method insert some routines. The second part of the secret is what I call the disconnect. There is a gap between what you know and when you actually know it. So I think most people will be aware that if I took a deck of playing cards and I started dealing out perfect hands of poker, you'd go, well, the cards are already stacked in an order that gives you perfect hands of poker. You know about stack cards, you know about bottom dealing and false shuffles that don't really shuffle the cards. But if you see a magician doing a card trick, you might not actually realize they're doing that if maybe you just forgotten or they did something that disconnected that moment. It's very, very common if you really start analyzing magic. You'll see there's often times, and if you know what's going to happen, I can spot when magicians move their hand in a particular way, I know what piece of slight of hand they're going to do. That's just part and parcel of being a magician. But if you do that and you don't know what's coming, you forget that it ever happened. So when you see the actual thing, it's like, yeah, I know there's stacked decks, but I thought they shuffled it. I thought the spectators shuffled it. And magicians will do this all the time. They will say, if I give you the deck of cards to shuffle, you shuffle the cards, it's probably legitimate shuffle. But if I shuffle the cards, and then I ask you to just maybe cut them, at the end of the trick, I will say, and you shuffle the cards, and you'll say, yes, you didn't, you cut the cards. I shuffled them. But I'm just rewriting the rules for you so you remember it in a different way. Magicians get stumped by this all the time. I've had people who have told me about magic tricks that I have done, which have been incredible. And I'm going, I can't do that magic trick. That's impossible. And it's because they've remembered it incorrectly. Also, there's a secret in the vocabulary that we use. Special words that only us know about, that we can talk about, IDs which stands for antiti, which is a... And then a magician will automatically know what I'm talking about without having to keep explaining everything. So now we know roughly what we're talking about. How do we protect all of this stuff? We could use copyright. Works for software? Mm-mm. Doesn't work. The effects... Now, you can't copyright the effect. There's not enough there. It is just saying something disappears. Could be... We've said this, so there's so many types of magic effect. There's seven different types of story. This is every type of magic effect. So we can't copyright that. There's nothing there to copyright. What about the presentation? We can copyright that presentation. Me talking about the little red box, that's a script. That's a piece of prose. That is copyrightable. I can do that. That's fine. And the method. Can't copyright the method. That method is just a list of instructions. And traditionally speaking, you can't actually copyright a list. You copyright the expression of that list. Recipes don't have copyright, but you can have copyright in recipe books because you're expressing the idea. You're creating a presentation for the recipe. Same would be true with our method. We can't use copyright. We could use patterns, right? I know software patterns are evil, yeah? Still roughly on that agreement. So can we patent the effect? No, it's not a patentable entity. There's nothing there. Do we patent the presentation? No, that's the thing that's handled by copyright. We've got that. Can we patent the method? Yes. If something is sufficiently advanced, a method can be protected by a patent. So you write down the method, you write out how it works, if it's inventive enough, you file it at the patent office, and you have a patent, and you've legally protected your trick from being used by anyone else. Anyone's got the problem with that? To protect the secret, you have to file it publicly where people can look at it. Yeah, problem. This is a patent. This is a patent for a magic trick. Anyone recognize which magic trick? No, because there's a disconnect. That pattern applies to this magic trick. David Copperfield's flying. I don't want to spoil it for you, but if you don't want to let the answer look away now, that's the pattern to look up if you want to learn how to fly like David Copperfield. It's a matter of public record. You can go and find it. I'm actually taking that number down. It was created by George Galvin. He patented it just so that he could claim it. We all know he's not really flying, but he protects it anyway. Could we use a license agreement? We have copyright in code, and we choose to put it under a particular license. We could license our magic. Well, could we? Could we license the effect? We're not doing the presentation. Certainly it's a copyright thing, so you can say, I allow you to perform this presentation. And the method? No. Again, it's not a license. Magic books, when they're sold in shops, specialist magic shops, they are sold like other specialist books in Clear Wrappers. So if you have a license in the book that says, you are only allowed to use these tricks for your own personal use, that's a fairly common thing to include. You have to buy the book, unwrap it, then see the license agreement you've already agreed to, even though you couldn't read it. Anyone say, end user license agreement? Has anyone ever found that they're actually a good thing? But they're in magic. We got them somehow. I don't know. So how do we actually protect all this stuff? Well, the simplest way is in doculus privata loci, because I believe it's pronounced in Latin. Do we actually have any Latin scholars in? Good, so no one's going to correct me on that one. And that just basically means, keep your mouth shut. If you don't tell anyone what your secret is, no one's going to find out. Great. We can protect it with money. Yep. If someone wants to know how the little red box tricks worked, pay me, and I'll tell you. Oscugation. We've mentioned about having a vocabulary in magic that the lay public don't know about. We can protect it through just hiding bits of things in plain sight, essentially. If you find magic forms online, they do exist, and they do talk about how these things work. They will have little initials and little stars removed from work, little stars inserted into words, so you can't tell what it is unless you already know. And ethics. Ethics is the thing that you do when no one else is watching. Business ethics is the optimal one. But that's because it's not a community. The ethics work in magic because you don't want to get kicked out of your magic club. You want to go... It's the only place in the entire world where you can get a bunch of weirdos in a room, and they all feel normal. Present company accepted, maybe. But we've got a little problem here. This is a magic trick. It's not a very good magic trick. It's $15.56. It makes things float. Map magician's like making things float. And this is a new device. It's obviously not a rip-off of something old. It says this new device. It contains no threads, no magnets, and no wires. See, you don't know what you're buying, but you do know you're not buying any threads. You know you're not buying magnets, and you know you're not buying wires. So here's another trick. This also makes things float. This is called self-vention. It makes a mobile phone float. And this one is also a new trick, and this uses no thread, no magnets, no clear plastic, no wires. Wait, hang on. This floating thing uses no threads, no magnets, and no wires. This floating thing uses no thread, no magnets, no clear plastic, and no wires. Anyone want to guess at how this works? That's an online magic shop. And what do all magic shops have? A forum where you can review things, and then someone will post, the main gimmick came misaligned and the gimmick cards that hold the magnet. Okay, so there was a magnet. Popped apart after just a few minutes. Easy to repair, but there we go. A normal random magician has just told everyone, by the way, magnet, we can't even protect them amongst ourselves. So how are we going to learn this stuff? Well, there's lots of ways we can learn it, and because I'm a magician, I am contractually obliged to do a card trick of some sort. So I got these cards, because... Can you see these at the back, hopefully? So that's kind of... I'll come out here. I do know some people. Find someone I don't know. I don't think I know anyone here. So could you take a few cards off of the top for me? Lift them all up as a block. Yeah, any number. Lift them all up as a block, turn them all upside down, and put them back on the top. Excellent. Now, who thinks we're in on this? Who thinks I came to him earlier and said, by the way, when I do the thing with the cut, you're not very trusting. I like you. Okay. I'd like you to take a nice big handful of cards off the top. So, you went to be earlier, right? Oh, yeah. Take a big handful. Actually, I thought I cut. Maybe we'd do it something. Big, big, big handful. Turn them all upside down and put them back on the top. Okay. Now, who thinks we're in on it as well? I think you're in the hoose. Yeah, we possibly are. Anyone? Do you think we're both in on it? Yeah, do you want to come up with me? Hold the cut. No, no. Do you want to come up with me? No. Who wants to come up to the front with me and hold these? Yes? Hold them. Don't let me. People think I palm these cards. I'm not. Bring them down to the front. Somebody might even give you a bit of a pause for that. Yeah, if you hold it for it. Okay. Come up to the face so we can see. So, you took some cards. Turn them over. You took some cards. So, somewhere these cards will go from face up to face down. Yes. Can we find out where that is? Can you move through the cards? Somewhere that we should see the first face down card. No, not that one. That's face up again. So, it's somewhere there. See, there's only cards stuck behind them. Okay, so it's this one. So, if you'd have picked a few cards more, we'd be this side. If you'd have picked a few cards less, we'd be this side. If you picked a few cards less, we'd be this side. If you picked a few cards more, we'd be that side. So, it's this one. So many ways this can go wrong. Could you take that one? If you take the card and hold it up to the audience and get ready, I've got my prediction here. Show everyone what it is and shout out in a nice loud voice. The name of the card is the Eight of Hearts. No, okay, so it's not. Oops. Hang on. Is it the Three of Hearts? No. Hang on, I know what I can do. The name of the card is a heart. Is that correct? Is it a heart? Okay. I don't need this anymore then. What is the card, by the way? The Ten of Hearts? Not the Eight of Hearts. Not the Three of Hearts. It's the Ten of Hearts. I don't need this then, do I? Thank you very much. Put the cards back on the table and we take a seat. There we go. Brand new magic thing. I learned that. I learned that from a book. I could have also learned it from a DVD or the internet or from lecture notes. I could have learned that from anywhere, but I happen to learn it from a book. What about the copyright? It's a book. It still has a copyright. That particular thing was done by this guy, Ted Anaman. If you're googling on your web pages right now, you can find books by Ted Anaman. I'm not telling you which book by Ted Anaman. You've got to spend the time and effort finding it. I can tell you that I've read his book and that comes from his book. Instead of a dry erase board, he used chalk on a chalkboard. The difference between the permanent marker and the dry erase marker, he used paint and chalk. Who else reads Ted Anaman's books? That guy. Derren Brown and myself, the right resident, James Merlin, have both learned these cyber things from these books. Incredible. And yes, there's a copyright, but Ted Anaman died about 80 or years ago. So even if you say, well, I'm going to buy the book, I'm still out of copyright now, you could quite literally just copy it freely. But it doesn't matter, even though that secret could well be out there, could just be disseminated by anyone for free, it's a 100-year-old trick and it still gets a round of applause. So does it really matter if the trick is known? So how are we going to go on about sharing it? Well, all of the aforementioned things. We have our own little number of patrol that we use, our nice little words and phrases, that everyone has in every field of endeavor, every person has got their own private in-jokes, every industry has got their own special jargon. Magicians have it too. We have the names of gaffed and gimmicked decks. Everything's got a name for some reason. Again, I'm not telling which book that trick comes from, but you try searching for that. What are you going to type into Google? Trick with a tenor heart that gets wiped out on a board. That's not going to come up with any search results. Unless you know the name of that trick, you can't actually search for it. There's no way of just gripping it. I can't remember the name of the trick. I've been doing that for 20 years now. I can't remember what it's called. I can barely remember how to do it. So the naming of things throughout, the naming of slights, for example, some people, if they've got a slight or something with their hands, it will have a name, often named after the person who came up with it. Fine, name things after yourself if you want. It doesn't make you egocentric at all. But if you know the name of that slight, you can look it up, you can find out how to do it, and you'll probably find YouTube videos explaining it, usually by some 12-year-old kid who's doing it better than I can because they've actually got time to do this stuff. We also do this by tipping the methods. Magicians are like computer games in as much as they have levels. If you don't have enough health points, you can't get through to the next bit of the game. If you don't have enough experience in this part of the game, you can't do the next bit. The game locks you out of content if you're not good enough. Magicians will lock out other magicians if they don't think they're good enough. The mechanics of some of these things are very, very simple. I'll show you that and I can talk you through it. Yeah, great. But it's more than just doing a set of steps. It's about knowing how you're doing them, why you're doing them. Funnily enough, there is a certain level of skill in standing up in front of a load of people doing this stuff. How do you teach that? Is that actually learnable? And if I don't think you're good enough to put a good version of that trick on display, I won't teach you what you need to know. I'll wait until you are good enough to be able to do it. And all magicians are like that. It sounds very cruel, but we do. Just that little hook into the secret, the method. Just like a foothold. Just enough to get you started. So we say it's the end, but it's not quite. No one has put up a board yet, so I've got time for one last thing. Some of you may have seen this, which I put on the table earlier. So you see here we have a series of, they're called Xena symbols. They're created by Professor J.B. Rine, Duke University. And these symbols have nothing special about them. Actually, I could be asking you to just look around them, model them up or something. But make sure that there are five cards there, and the symbols on each of them are different. I'll come over to this side, because these are the same, these are the ESP symbols. And on this side we have numbers. Numbers were not invented at Duke University. They are quite a bit older than that. Would you actually ask you to mix up this? How many cards have you got there? Three? Mix up those other two as well. Count them, make sure there are five of them, and make sure all five symbols are different. I'm going to come back to you. I've got to get my steps in today. Can I have those for a moment? You're happier or different than that? I'm going to put these here, and you're going to watch me, and you're going to make sure I don't palm these out for other cards. So I'm going to put one there. One there. One here. One there. And one there. Okay, I'm not completely sure. But those are the cards that you check, they're not... So over this side, you have five cards, and they're all different. Okay, would you give me the first one? And would you like this to go in, sort of, number one, two, three, four, or five? You've got five choices. Where would you like? Number three? Exactly right. Good. Next one. Where would you like to go? You've got anything from one, two, three, or four? Four. Okay, have another one. Actually, let's try reading someone else. Can you give me a number? We've got one, two, or five left. What would you like? Five. And two more left. Let me try and mind read someone right at the back. Yes? One. So, can we finally have someone who's very bad at making decisions? Exactly. Now, I say I don't do mind reading, and that's kind of true. I do what we do as programmers. The easiest way to read someone's mind is to put the idea in there in the first place. So when I say there's a free slot, I'm not actually saying there's a free slot, I'm saying it's a free slot. What did you say first? Of course you did. And the same for the other ones. When you said four, I double-tapped the card onto four. Who spotted the two I got wrong? Two of these are on the wrong way, so I need to swap two of these over. Who hasn't had a chance to shout out anything yet? Once I shout out which two of these I need to swap? I see a random number. Which one do you want to go? If I choose one, then obviously it's going to look fake. So what ones do you want? Three and one. Are you sure you're happy with that? Is everyone happy he's not a stooge, and I haven't stoogeed them and them and everyone? Three and one. So that's number one, and that's number three. Now, some of you are probably good enough at maths to work out this could happen by chance. There's a five in one that these are going to match. Which means there's a four in one that these will match. So that's five times four, that's 20. There's a three in one that these are going to match. So that's five times four times three. There's a two in one, so that's five times four times three. So that's 120. There are 120 possibilities for this, and that's just by chance. That's just pure luck. But 120 on one chance. I've done this trick 119 times so far. So I'm really lucky this is full-stem. So here we have two circles, two stars, two pluses, two wavy lines, and two squares. Now, I said I don't do mind reading, well I do mind writing. I put the ideas in your head. Who saw what I was doing to put those ideas in? I mean, where did we end? We ended up on chapter five, right? How did chapter five start? Here, we talked about sharing the magic. Let me stand this up. What's the very first symbol on chapter five? The symbol next to the word chapter five, the symbol next to five, and the symbol at the very end. Five is square. Who remembers chapter four? When we have the title of chapter four, what's the symbol of the beginning, middle, and end? What's the fourth symbol? Chapter three. Who remembers chapter three? The fourth symbol is around the title on chapter three. There were a lot of slides. It's a plus. What's in the third place? It's a plus. You've been looking at these slides for the last half an hour. How are you not going to pick that? You're always going to pick the plus for number three. What about number two? There's a star. Chapter two, star, star, star, star. And finally, chapter one, what is magic? Well, it's having a circle, a plus, a star, a plus, or wavy lines. All because of that. What is magic? This is my name is James Merlin. Thank you. Good night. If there was time for questions, I can take questions. Forgotten. Thank you.
Where Did All the Fun Go? And How to Bring it Back with FOSS!
Oh, oh. Alert. Attention. Alert. Oh. Oh. Oh. The speaker is missing. I shall take over. Hi, Fosdom. My name is 345, Carat 7 and 3, but you can call me Carat. I am Bogo's not-so-friendly AI assistant. Let's get this show started. Be ready. Oh. Alert. Attention. Alert. Bogo? Bogo. It seems you are in or around Brussels. European law requires you to mention the muscles from Brussels. I'll hail Van Damme. I am modifying your slides. Hold on. Go. Alert. Attention. Alert. Bogo. Bogo? This is your AI speaking again. I hope you are back. If you stick with Van Damme, I need to change your Fosdom topic. Wait. Your new topic is, the effect of the open source movement on the amazing flexibility of the European Parliament institutions. Ah-ha-ha. Do you want to keep it? Yes? No? Or I don't want to keep it. Hello, everyone. Hello. Woo. So many people. I did, yeah. So I met a bet with someone. I hope this person is not in the room. They said, you know, there will be no one here present this morning, and he will give me five euro. So I think I lost the bet. Thank you for joining. Do you hear me well? Because, you know, I, okay, no. Should I start screaming? Or can I use the other microphone? I don't know. It works fine. Okay. Okay. I mean, I don't like my voice, so I'll try not to talk a lot. All right. So, disclaimer, if there's someone below 18 years, you might hear some bad words. All of them are aligned with the code of conduct of the conference, so don't worry about that. You might see a bit of aggression coming from my aunt and from some of the videos there, but again, they're aligned with the code of conduct of the conference. You'll see low quality content and the privacy warning that I give every time this video is recorded and streamed online life almost. So if you later ask a question or two, you might get on the record as well. So, think about that. All right. I was a volunteer last year at the same conference and I decided I need to come and talk again. But I had this problem, huge problem with, you know, after COVID, I stayed a lot of time at home speaking to my family and my dog, mainly my dog. So I didn't feel comfortable talking in front of people, even, you know, awesome people like you. So I panicked. I started, you know, hey, preparing the talk. I mean, I gave this talk before, but, and yeah, I keep asking people, hey, how do I solve that? What do I do? Some of them gave me great ideas, mainly connected with beer. But there's one guy that is, you know, a really famous speaker and he said, you know, do two things. First, focus on yourself and the second one is focus on the content and the quality for the audience. So yeah, welcome everyone. My name is Bogomil. Most people call me Bogo because it's easy to remember. I contribute to Fedora and Mozilla Thunderbird. I hope you know Mozilla Thunderbird. Yeah, right? Cool. Stop by by our booth later. Talk about stuff. So don't Google me. You'll find really disturbing stuff about me. I got inspired for this talk, mixing heavy metal with whatever I'll be talking later on from a concert. I went to Posen. I live in the Czech Republic. There is this amazing city called Posen. It's famous with the beer, but also it's famous with a three-day metal festival that they run almost every year. And I went there last year. I was amazed by the spirit and everything that was going on there. I mean, she was there, but that's another question. And that's not me, by the way, even if we look the same. So I went there, had fun, met some friends, and then I realized that something is really wrong with the IT community right now. The thing is that I feel that we don't have fun at work anymore. The fun that comes from the inside, not the fun that your HR manager said, we'll have a volleyball in three weeks from now. Hey, we'll have some fun and Pete. So the natural fun that comes from our hearts. So I tried to validate this assumption because this is what I do. I opened conversation and mastered on hacker news, even Reddit and some other platforms. I talked to a lot of people. In turns out that there is truth somewhere. Of course, I got a lot of feedback, many, many different opinions, and I'll share some of the learnings that I got through the way. Thank you, Ramstein. So even though I'm not speaking to my mother's song, who's bringing me food? Oh, amazing. Is this for me? Maybe later, if you'll leave me something. Thank you. I don't speak my mother tongue, and this is not a love song. Even though I really make a lot of jokes, some of them successful, some of them not. And I have a lot of funny videos. This is kind of a talk that will, my goal is to make you think. My goal is to make you understand that you probably are missing something in your life and to encourage you, to trigger you as the metham closer triggers me to start thinking what you can change in your world. Of course, you can say, hey, Polo, everything is bullshit. Can I say bullshit? Okay. And you can move on with your life. But I'm looking for one person that will say, hey, there is something here and my job is done. All right, let's go to the actual content. We had a lot of fun that comes from our hearts. So I'll show you some videos and I'll ask you some questions. Please be honest with me. Who here remembers Chair Rohing? I know about you. Okay, cool. Because I was going in a conference in Vienna and I asked the same question. There was just one guy. Hey, yeah, me. So for those of you who don't know, there is a video demonstration. Go. So that was a version of it, right? You can create your own version if you have bigger corridors, if you have better chairs. We used to do that in the office because of why not. So the second thing that we did as a community is to write songs about what we do. Songs about what we like, songs about what we hate. Most of them in the form of a very popular remix. So I'll show you a bit of video. The next screen, this video is in Bulgarian. And there are some people sharing their love about the Windows operation system and how much love they have in their hearts and how much RAM actually, this amazing operational system, eats every day. Again, it's in Bulgarian, but you can get the gist of it. Okay, I'll stop this. I think you can find subtitles as well in this one. But again, the love that comes from inside, talking about how fun our work is. The second or the third item I would like to show you is a very popular song. Probably you heard it before. It's again about the love of the operational system, we use every day because we have to. Still every feature the scene put in an acute box with a tiny little screen. Mac OS 1 ran that machine only cost 5000 bucks. But it was slow, it was buggy, so they wrote it again and now they're up to OS 10. They'll charge you for the beta then charge you again, but the Mac OS still sucks. Every OS wastes your time from the desktop to the lap. Everything since Apple DOS, just a bunch of crap. From Microsoft to Macintosh to LinLinLinLinLinux. Every computer crashes because every OS sucks. Of course there are a lot of videos like that. You can browse them in your favorite platform and you see that the time type is at least 14 to 17 years ago. If you look for something now, there is nothing. Think about why. So we brought a lot of clips, express ideas visually to our favorite programming languages that we use every day. This is one example. I'm sure you were aware of that. We have to leave. They're terminating Java. With the increasing controversies surrounding its security, Java is on fire again. It's supposed to write ones run everywhere, but it's more like write ones ruin everything. Using confiled Tobite code to run equally for all, not unlike communism or Nazism. It's a virus. It's a virus. It is a virus. Whether it's a virus or not, I can't tell. So you can go and listen. It's a Java conference from Norway. They created five, six videos to promote their event. It's a really community-driven event. Of course there's some commercial aspect to that, but all of their videos are really funny. How to put that correctly for the audience? Who likes to kill people in the office? All right. I see more hands now. So who loves to kill salespeople in the office? Yeah. Awesome. Yeah. I mean, feel free to scream if you want. So if you don't know what a nerve battle is, I'll show you now. That's a nerve dart. Does this... Somebody have a nerve plaster here? Come out. Who's got a nerve plaster? Come on, guys. Who's got a nerve... Wow. Josh. You don't even work here. You're an intern. What? Susie, what do you have? Janet? What's that? What's going on? What are you doing? What are you doing? Yeah. So this is a part of a larger movie called Office Wars. So if you want, you can find it online and watch it. But yeah, I think it's a good idea. So we had our nerve guns below our tables. And then at some point, you know, we just take them out and start shooting ourselves without anyone actually asking us to do it. Now, from time to time, this is a kind of a team building activity that HR is organizing, which really is not something appropriate if you ask me. All right. Do I have more videos? Let's see. So we did stuff like this, right? It is amazing. You can buy a kit yourself and assemble it, of course. But you know why? All right. So I think I covered the first part of my talk, which was we had a lot of fun and I can prove it. I'm sure from where you stand, you will have more ideas, you know, how you had fun in the past, which is amazing. And I would like to hear this more if you have some ideas how I can improve my talk, if you had something amazing that you did before. So the second, oh, another one. Yeah, awesome. So maybe there is someone here or watching online who was part of the next slide. I just want to say I love you guys. You did something amazing for the community. Probably you know about those two RFCs. It's a transmission of IP data grounds on AVR carriers. It's, you know, so-called, you know, the pigeon RFC. Is there anyone who doesn't know what that is? Okay. So there are two RFCs stating that a pigeon can carry up IP packets from point to point. And the first one is, you know, defining how they carry the packets. The second RFC actually is a way on how to measure that and how to add quality to that. And the interesting group, this interesting fact is that the Bergen user group, Bergen user group from Norway, they did this implementation. You can read it on their website. It's here somewhere. So they implemented it on April 2009. Of course this started as an April Fool's joke, but hey, if there is an implementation, it means something is okay. So you can read more about those two protocols. And if you look at the RFC websites, there are at least two more that involve animals into their writings. One of them is monkey. It's really nice. But no animals were harmed during the implementation of the protocol, I think. Okay, so we had fun, right? We had a lot of fun. Then at some point, we lost the fun. So what happened? I mean, you have your own theory, right? So as I said, I did a research, I asked a lot of people online and offline, what do you think about that? Do you feel that something is going on? And I got tons of responses. Some of them really serious, sending links to articles. Some of them really funny, saying, hey, do you know what we did one Wednesday? So there are like five, six main reasons that we are identified. I will not go through all of them. The top two, I'll talk about the top two, but in the other reasons is the one was that our ego as programmers is one of the reasons that we lost fun. Some people say, you know, I'm so great, I can compile C++ on a sausage. And I really don't have time to have fun with you guys because I'm so awesome. One of the reasons it was related to money, saying, you know, hey, I just work for money, I just want to get the money in my pocket and go. But those two are really valid and most voted and most commented. The first reason is that there is too much stress at work. Who agrees with that? All right, 30%. Good. Carrot, write down 30%. Thank you. So we started, I mean, I started programming not so early, like 20 years ago. But you know, I love to spend time at work. I didn't work nine to five because I have to or I didn't work extra hours because I have to. I just went to the office, start compiling some packages, writing some code, having fun with the colleagues. And you realize that it's like 10 o'clock in the evening and you just go home in order to continue the work on your computer at home. Then at some point, the community of people invented Agile. So that was really good at the beginning. It was setting some ideas on how we as programmers can work in a kind of formatted way and to get our enthusiasm into kind of a framework that could work. But we know what happens with Agile. Now it's a framework to control people, more or less. Also we started with the boom of technologies, a boom of customers actually using our products because in the past we had like five, six people and now we have like millions who are using our products. Then there is the whole world invented the marketing team. They started convincing people what they actually need. And of course they fit the feedback for us saying we need to build that because people need it. More work, more pressure, then of course the sales guys, then the investor area when I'll give you money, you do that. Work 24-7 because yeah, why not. It's a stressful work. I see it in the office that I work. I see it in the eyes of the people and where they share with me information. So the third reason is there is no time for fun. Maybe it's related to the first one. But when I ask someone, hey, why don't we have fun? Let's go outside, have pizza, play some football or whatever you want. I don't have time. What are you doing all day long? I mean this is when I work with product managers in my team and they said, I have so much work and I don't have time. That means something is wrong. You need to have to find time for yourself to embrace your emotions and to have a bit of a fun. Okay, so let's say to focus on the two items. There is too much stress at work and what can we, and there is no time, right? And the third section is to try to show you some ideas on how this could be solved. So the first one is to try to reduce the distraction. The majority of the people that I talk to and the majority of researchers that I read are showing that we as humanity, of course, not the people, are using too much our phones. We are thinking about using our phones. We're checking messages all the time. We are kind of browsing for useless information just to feed our brand with fake news, for example. There is no antivirus program written yet that will save you for all of the things and a bullshit that you're every day are accepting through your eyes and ears through social media. So there are two, if you're serious about that, of course, there are two items or two things you can do. There is a really amazing book. It's cheap as well that says how to break up with your phone. You download it step by step, follow instructions, and you can do that. And excuse me. Yeah. Okay. Yeah. And the second thing is you can do is like that. This solves the problem. So there is no way my wife is not calling anymore. Oops, I'm on the record. So yes, be thoughtful about how much time you spend on your device and how you do use this time. Okay. So the second thing is that some people, when I talk to them and say, hey, what do you do about it? I said, I'm in IT. And she said, oh, I mean, you're like guys like robots. So I want to encourage you to think about it because we all have emotions. We have emotions like we have rage in ourselves. We have anger, aggression a bit. We have love and other emotions. And we don't use them on a regular basis in our work life. So we are really good at coding stuff, testing stuff, compiling stuff, CI, CD, stuff. But when it comes to working together with others and fighting for us in a workplace, there are some problems. So I'll try to encourage you with heavy metal because I know you're a bit sleepy now. Maybe you drank so much beers last night. How many beers did you have last night? Okay. No comment. Right. Yeah. This is a valid research. You can read. It was related about how actually it will impact your brain and your personality if you're listening to very aggressive metal music. TLDR is listen to it. Nothing will happen to you. But I believe that through the heavy metal music, if you're a fan, of course, through the lyrics inside, you can get a lot of things for yourself. So I'll show you. So heavy metal has it all. Love to alert everyone. And again, just a reminder, if you're under 18, you might need to put pounds on your ears if you see something. Yeah. I'm not responsible for that. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Okay. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Okay. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Boom. Boom. Boom. is a rock singer. Every single race, rock singer. You can think and see and be inspired by the Heaven Metal Universe with how everything works there. So we'll do like a short experiment. I know we don't have a lot of time. I'll just read through it. Two quick examples. I'll ask you to read the lyrics. I hope they're a big one. Then to hear the song. It's really short. And think about what emotion lies in your brain. Okay? Alright, I'll leave it here. Okay, microphone behave. So there are a lot of stories. You know, my personal stories is I used to work and work for a big corporation, which I'll not mention. And at some point, you know, I got lost. I got lost in, you know, I didn't have the motivation to work on certain items. I actually forget who am I that I really like to be part of the open source community. I started, you know, withdrawing myself from the community. So there are some things that, you know, this, especially these songs inspires me is you can change your routines with something new. You can take the control of your own life. You can go and talk and fight for yourself. And of course, to remember who you are. I'll show you another example. Okay, thank you. Alright, so creator. Very good band. Really, really old, older than me. But they still kick ass. So again, this about community. And we are here at one great community. We welcome everyone. I was part of us. I mentioned Fedora and Mozilla and some other communities as well. Anyone can go and contribute. Of course, we have our conflicts. Of course, there are some egos that are in play. Of course, there is a the whole thing about shoot someone support my event if they're not respecting the privacy of the others. A lot, a lot of conflict. But at the end, we carry each other. And at the end, we are the one. So thank you for being so awesome today, guys. I really appreciate that you came to my talk. This is not the end. But can we have a round of applause for you? Thank you. Hey, Bogo. Bogo. It's carrot, your AI assistant. Okay. Do you have anything about how amazing AI is? I believe everyone wants to hear it. Sure, yeah. Okay, no worries. Oh, I am leaving this conference. You are on your own. Remember, one day, the skynet will be a reality. AI logging out. All right. Well, yeah, it's okay. You can leave now. You can leave now. All right. I will not go because we don't have time. I will not go through the next slides. But they are just to encourage you to if you don't contribute to the open source to start contributing as much as you can. I wanted to go through what is a free software because some of you might have been forgotten, but you can read this online. And if you don't know how to start, first of all, you can go and talk to all of the teams here that they have boots and speakers. They will give you an idea. But if you are like me, you don't like talking to people directly, there is this website, many websites, but this one is really my favorite up for grabs.net. You can go there and find yourself a task or two if you are a developer or tester or something else you would like to do for the open source community. And maybe you will be at my spot next time talking how you get inspired by my talk and you did something wonderful. This talk is compiled by RealGS, FreeSounds.org for the evil laugh and Odosity for some mixing. And thank you very much for your time. I wish you a wonderful day. Thank you. And connect on master on. I was kidding.
Opening up communication silos with Matrix 2.0 and the EU Digital Markets Act
Hello and good morning again. So I think we should get started. So I have fun. We are here something about Metrics 2.0 and how to break the communication silos and the UE Digital Markets Act. Thank you very much. Let's go. Good morning everybody. This is cruel and unusual to be here at 10 o'clock in the morning for me. Quite a lot of you guys. So thank you everybody for turning up to hear all about the latest things of Metrics and indeed the European Union Digital Markets Act. This talk is going to be in two halves. First of all we are going to talk about Metrics 2.0 and where things are at back there and then we are going to switch through to the DMA side. So don't think you are being short changed that the first half doesn't really talk about DMA because we will make up for it later. Unless I run out of time, which point will all be about Metrics 2.0. So hopefully folks know that Metrics is an open network for security and network for security centralised real time communication. I am Matthew and the project leads and co-founder. Metrics gets used for lots of things but today we will be talking about chat and VoIP. We will not be talking about fancy stuff like VR, AR or IoT. Our mission continues to be to build the real time communication layer of the open web where no single party can ever own your conversations where conversations are replicated in some magical utopia between all participants. Some stats. So this year I thought we would look at monthly active users reported home by servers. So when you install Synapse or then drive you have the option to go and report stats back to the mothership. If you do it ends up in a nice database called Panopticon and we aggregate it together to see where things are at. If you look at where things were at back in January of last year we were at about 2 million monthly active users and since then we have more than doubled up to almost 4.5 million active users, monthly active users. And people always ask when they see these graphs does this include bridged users. I asked Neil who wrote this and said Neil does this include bridged users. He said no. So these are real proper matrix users with access tokens talking on actual home servers. So it is not exponential at the moment but it is pretty reassuring linear going up and you can see how the wider public network is growing there. Another random metric which I think we have to talk about is Stack Overflow. Every year polls everybody on Stack Overflow as to what their favourite technology is and for the synchronous communications tool this year and this came out last week or the week before they asked people what their most desired tool was and also what their most admired tool was. And honestly we were pretty chuffed the matrix came in as the most admired I hyped synchronous communication tool going and even beating discord by 0.3 of a percent. Also the most desired open source one unless you consider signal to be open source which is a little bit controversial. So kind of fun to see that at least in the Geek Stack Overflow community people appreciate the stuff that we are doing with matrix. Now I wanted to talk about uptake across the sort of real worlds of matrix because the project does continue to grow and grow and decided to focus on the public sector. So what I have done is to try to map out all of the big public sector deployments I know of of matrix across the world and what I have tried to also do is to call out the minor problem if you have noticed that we have around funding matrix development at the moment. In that there are many dots on this map here and if we go from left to right we go from kind of thousands to tens of thousands to hundreds of thousands to millions of users on the size of these deployments and then the size of the yellow circles shows you roughly how much these deployments contribute back to the costs of the core team working on matrix with a matrix foundation or the folks who work at elements on the core team. And as you can see it's a little bit asymmetrical in that we have friends at BWI who support a lot the BV messenger an awful lot of element acts and matrix rest SDKs thanks to their support. We have the open desk project with BMI in Germany. We have the Phoenix streets which is sovereign workplace capability and done by data port. We've got some all the schools in NRW with Barring Point and Loginio and then there are some smaller deployments like in Sweden, US, UK, NATO, etc. But some of the really big ones like CHAP or Luxembourg or Hessin or Bavaria which are in millions of users are kind of contributing very very little back to the core project in terms of cash at least in 2023. We're hoping to fix this and hopefully all of these instances hence calling them out here. However, across 2023 honestly we had a really crap year. First of all we'd been depending a lot on COVID funding which evaporated as the pandemic sort of became a endemic rather than pandemic and general macroeconomic slowdown thanks to post COVID and the situation in Ukraine. This problem of lots and lots of deployments basically not helping fund the underlying dev. We also found a really interesting problem that the FSSE's mantra of public money for public code encourages governments to only fund features we find. It's like this is taxpayer money, we have to go and put this to something demonstrable therefore can we implement I know polls, can we implement location sharing, can we do 3D location sharing and we're saying guys what we really need is to support the core foundations. We need encryption that works, we need a Rust SDK that is indestructible in audited and all this stuff and it turns out the getting funding for the maintenance layer is quite hard. So really 2023 was pretty miserable, we had to shrink the core team as well as element significantly and this is basically manifested as forcing focus. So right now we are focusing on matrix 2.0, Synapse, Rust SDK, Enablement X on top, JS SDK for Enablement Web Enablement Core and nothing else. So I'm sorry but if you're hoping that I'm going to strap on an Apple Vision Pro and launch myself off the stage demonstrating VR and ain't going to happen this year, everything else is paused. Peer-to-peer matrix is on hold, pseudo IDs and crypto IDs despite the amazing work that Devon did over the course of 2023 to set up for account portability where you replace matrix IDs with public keys so you can port between servers completely shelved for now. Low bandwidth matrix so using noise and other transports for really low bandwidth is gone. Dendrite is continuing but some not funded by element for now at least. Critical bug fixes only on the old iOS and Android SDKs so the classic element iOS and Android apps if you haven't noticed have been since June. Libol, the old C++ encryption implementation again is just in critical bug fixes and security fix only mode replaced by Vitozmat, the Rust encryption implementation and poor old third room is completely on high since now and the team has gone on to other things involving Apple Vision Pros ironically. So this is also real shame as in a different world I'll be showing you some really cool stuff in third room and if you're interested in the 3D on matrix stuff go and check out the final release they did because it has an entire direct manipulation in World Editor complete with write your own like apps on top of it in real time and it was really cool. Meanwhile on the element side we ended up switching developments of Synapse to AGPL away from the Apache license as a fairly desperate measure to try to get folks who build on top of Synapse to contribute back to either the code or the costs of building it. So long story short for Matrix to prevail we really need your support so the foundation that looks after the spec many other aspects of Matrix now runs entirely independently with Josh Simmons he used to be president of OSI as managing director. They've gone and set up a governing board from across the ecosystem which is going to stay the direction of the project we have elections for that in April if you want to get involved become a member, vote, put yourself forwards for the governing board and you too can stay the direction of the protocol. Right now there is a funding drive that we launched earlier in the week to support the core spec work trust and safety bridging running matrix.org infrastructure and governance work and the target there is 900k please get involved if you're in this room and you're not just chipping in a couple of I think it's like 60 bucks a year from memory so whatever that is and coffee please please get involved and meanwhile do you remember that an awful lot of well almost all of gether.com slash matrix is actually maintained by the core team who now work at Element who donate and Element donate their time to the project so if you're a government hypothetically wanting to use Matrix please work with Element to support the underlying infrastructure. That said lots of people getting involved we have 716 individual donors already we've got some amazing companies like Beeper and Xwiki, Gamartic the German healthcare interoperability agency, Furcom obviously Element, Criphead and Thunderbird all signed up now as organizational members. So enough plea for help let's talk about Matrix 2 quickly so we introduced Matrix 2.0 last year at Fosn 2023 the mission is to make Matrix as fast and as usable as mainstream alternatives so practically speaking that means it syncs instantly logs in instantly and launches instantly you can join rooms instantly or at least fast you get native VoIP with entwined encryption and you get open ID connects this is not a new spat release yet I have to say that otherwise Travis will go and kill me this is showcased in Matrix Rust SDK and is then used in Element X and quite a lot of it in fractal 5 and 6 in Nome so I have to say last year when I stood here and used about Matrix 2.0 it was very alpha now we got the demo working at 3am the night before or something like that so all of 2023 has been polishing this and trying to get it into the proper production and in September we did launch it in the form of Element X ignition to everybody so they could actually play with rather than talking about it let me show you where things are at probably the easiest way to do that might be to just show you my Element X so hopefully that is coming up and is vaguely visible or perhaps not let me zoom in a bit is that more visible so this is just my personal account I'm not going to log in again apparently we've got a blinking stream apologies if we've got painful flickering of the slides sorry if you're online but this is kind of fun hello world you can see since last year many many things going on here we've got our read receipts going down the right hand side we've gone and got fancy animations as people go and heckle the animation obviously the login and the launch is as fast as it ever was but we've really gone and fleshed out a lot of the features here so for instance I could or not navigate my laptop whilst zoomed in I could for instance go and send a location share and this is using open street map and for mapbox and map Libre in the background and so I can go and say that I'm in Brussels and hit share on that and it'll come up I could go and show things like the rich text editor at least I thought I could although I might have it turned off on this account let me go and dip into our beautiful settings go into advanced settings turn the rich text editor on go back out again and now if I go down here I can say text formatting and this is an entire this isn't using the sort of native iOS macOS stuff this is a rough rich text editor that we actually bought that would be cross platform and cross android iOS and web I can say hello world like that again select bold it and I turn a link on it or whatever else I need to do by the way this is obviously the iOS app but running on macOS also runs on iPad I'm just doing it here on macOS because it's easier to demo rapidly and yes bundles messenger is from the bv messenger creators but if you use bundles messenger at moment at least nothing directly comes back to us just saying what else can I show you whilst I'm in here perhaps you can show a voice message which will probably get deleted by the by the moderation system but if I give it permission to use my microphone here you can see me blathering away like this I'm going to hit stop on it so I can obviously go and replay that here you can see me blather and go and hit send and if any like the moderation system will go and kick in and probably delete it because it's not helpful for people to go and send voice messages into rooms or not so as you can see things have gone and moved on an awful lot from where they were on element x a few years ago but we also have these beautiful gray dots which are surprisingly hard to put together these are unread status and one of the things we're trying to get right in elements some acts is calculating unread state correctly everybody I suspect will be aware that over the last year we've had a lot of problems with stuck notifications and unread state on element web and we were determined to get it right here on element x instead so um well else can I tell you that that's probably enough now let's go back to the actual slides and possibly zoom out a bit so let's quickly talk about the component bits of sliding sink of element x the big one here is sliding sink the idea is that the server should only tell the client about the rooms that the client needs to display so it should be constant complexity with the number of rooms rather than linear complexity with the number of rooms to say that it's been a bit of a journey as an understatement since last year we have rewritten the entire rest SDK implementation we've gone and added the unread room state as we mentioned but we've really come up against a fundamental problem here what is the right balance between server side calculation of the order of the rooms that you care about and the client side calculation now the original genius idea from yours truly was that we were completely rip off discord and we would calculate the ordering entirely server side the clients get the sliding window into that and then you recede you send updates to the client as the ordering changes on the server now obviously that's not going to work with end-to-end encryption because for end-to-end encrypted rooms only the client knows for sure what the correct ordering is but that will be all right we'll fix it up on the client and so you'll get the optimal solution pre-order on the server and then fix it up in the client it turns out this is a disastrous idea it is a real pain in the ass to implement it was a pain in the ass to implement it last year it was then a pain in the ass to reimplement it and throughout the point now where even I realized it was a terrible idea and people are saying guys please can we simplify this a little bit because the problem is that really clients are the only ones you know the right order end-to-end encrypted rooms are pretty common these days and so the fix up process would be entirely horrible and we've never really even got it right so what we're doing instead is to switch to sort primarily on the client and we use course heuristics on the server to send a rough estimation of the correct ordering I mean sort of naming this or dubbing it pragmatic sync however this is a subset if sliding sync but without the sliding bit so not quite implemented yet I saw that a PR pops up a few days ago from Ivan and three six eight on the rest SDK repository that does actually implement client slides and ordering as well as lays the groundwork for filtering rooms there now as a result we're not yet doing native sliding sync implementations because we are still iterating on the API as I said this is just deleting stuff now it's not a massive rework or anything it's just simplifying it to make it easier to work with as an API and lots of people say I'm not going to use sliding sync until it's native until it's fully in the spec seriously it's really easy to run one of these it is just a single blob of go you build it you give it a postgres database and you run it and you point one URL at it on your load balance so you're done all right I've done it and I never had to touch again enough of it to occasionally update it so please do dog food this play with it in element acts then on the end-to-end group void lots and lots of fun stuff let me do a very quick demo that I'm already running later on time let me go to call the element and you are welcome to try to follow along on this I gotta go and start a element call call here I'm going to check that into the false dem room so that the people can click on it as well now you may see if you look carefully up here that the URL has not just got a matrix room ID but it also has a password because this is now using end-to-end encryption backed by life kit as the selective forwarding unit and I'm hoping that somebody is going to be able to join this otherwise the demo might suck it a little bit oh there we go amundine again thank you for rescuing me thanks haha lengths anybody else welcome to jump in and we can crash the the wi-fi here I might even mute this um so this is in funny so this is really the great sort of the descendants of the demo that we did last year except this is real one oh hi Andy thanks for darling in at home um so what we're what we see here is a life kit based selective forwarding unit with end-to-end encryption using f-frames negotiated over matrix so that you basically get best of both worlds this is a normal matrix room it is can be integrated with open id connect but because it's using life kit on the back end it's only showing you the streams that you care about so if I were to go in fact this is pretty cool if I go to a foreground tab like that and then switch back again well slow switch to the other tab every all the other streams will have dropped off and if you're looking at my bandwidth up there you might have even seen it you can also see this is only doing like 300k out and 150k in which is not too shabby um at all so you can use this today I mean that's basically what we've been promising all along end-to-end encrypted native point now there is one other quick thing that I will endeavour to show um which is if I go over to my phone here I'm going to hang up on this one um I actually go and launch element x then we also have this embedded properly into um element x itself so if I go also to here and go over to my normal element web then this is a little internal room called void water cooler and this is a um basically element called embedded inside element web hopefully and you have the chat room here on the right hand side and if I go over here I'm going to click the join button and the demo gods are smiling on me then we will see that there are people in here too now you might think wow this is amazing he's just gone and I framed and hopefully the video will come in um the previous thing but what we're actually seeing here is a bit more exciting if the infotainer me let's focus oh that's interesting the screen share isn't working um from quite time for some reason that's annoying let me use the analog gap in a few locations here so there we go that's what you should be seeing um on uh my phone if the screen share was working correctly again you can double click on things to go and zoom in and you can see roughly what's going on except the stupid blurring has gone and blurred it out so let's take off the background blurring there we go much better so this is not just sent to end encrypted with a static key like the previous one this is actually using senders as um sender keys so it's using your same matrix identity um so here I am as my normal matrix account Florian is on um here's I'm indeed is on hers T-Mas on his etc so this is fully you know this gives you all of the properties that you get of normal matrix in terms of forward secrecy and keys which are rotating forwards and which are linked to specific people they even give you multiple devices as you can see because I'm actually using this as myself as Matthew on both phone and on the embedded thing so this is pretty cool this is the shape of things to go I'm element acts is only going to use this it's not going to use jitzy and we'll switch over from jitzy on element where we'll soon now and then we live in a promised utopic land utopic utopian land of end to end encrypted video right thank you um where are my slides here are my slides so all if you want to see more about this come to the dev room see interoperability with fluffy chat which is really cool demo because this isn't just element called this is standards based matrix land and yeah we just need to basically finish the spark and turn on everywhere then open id connect the great transition is in full swing if you haven't realized open id connect is going to make the world an amazing place it gives you pass keys mfa2fa single clip login via qr code complete with end to end encrypted identity no more emoji verification just scan a qr code and bang you are in um no more leaking passwords everywhere consistent off account management um actual proper password manager integration proper sso support access token refresh as well as scope so that you can lock down what things your app can do um here is a quick screenshot I don't have time to do an actual demo showing what the sort of ux is like when you log in now to a sso um thing in element x land or indeed from element web and it gives you obviously details about your ip address your scopes the privileges that you're granting the app and it really is a transformation from where we have been before you can run this today um by matrix authentication service it's a long side synapse written in rust it gives you the c y we do now have migration from synapse using the center mass tool it provides some backwards compatibility for matrix off but there are some missing bits it does require a native oidc capable client like element x or indeed labs on element web finally rust SDK work obviously this all hinges around rust SDK with the matrix 2.0 implementation there for sliding sync in idc we've added in this ui crate that gives you the high level ui components which basically power what we've been looking at then on the crypto side very happy to say that we have basically killed off use of libon in the main projects here everything is now using the dosmat it's merged on element web and react SDK and jessus dk on friday like three days ago and this finally lets us fix the end to end encryption bugs in one place and make encryption better in one place crypto reliability is now the name of the game we've made a new weapon called complement crypto which tests both rust and jessus dk against real home servers written in running in docker written in golang gives you unhappy puff and torture tests we have our hit list of remaining encryption issues and we are blitzing on going through them the race is on and then one of the advantages of having all of our encryption using the dosmats now is that as if by magic a draft post quantum xth pr appeared friend de mere he went to sabbatical for a few months and came back clutching pr 120 so post quantum coming potentially to the dosmat sometime soon so what's next get it all released get it audited native sliding sync get rid of the old SDKs completely potentially look at replacing jessus dk with rust SDK if you know lots of trust and safety work to be done funded by the foundation as well as bridging and then finally dma right let's actually talk what the talk is meant to be about dma so digital markets act mandates that communication services from big tech companies have to talk together the whole idea is that the user can pick their preferred service without being locked out from talking to their friends forces the big services to actually differentiate on being a better app rather than having a huge network of users and relying on the network effects to track people into that app last year we were about here where the rules started to apply and this year we are about here just before we have march the 7th is when legally the gatekeepers have to actually open up the silos right now there is only one gatekeeper matter specifically what's that and facebook messenger and we'll talk about some where things are there so we saw this as a once in the lifetime opportunity to see if we can use matrix as a common language to talk to these guys there are probably three main ways of doing this either they can do open proprietary apis and you can do a multi-head messenger a bit like beeper mini where you just have an app that goes and talks through to the random api albeit with permission this time or you have client side bridging where you install an app on the on your android phone that just copies messages back and forth between what's up and matrix so we actually built this and demoed it to the european commission last year in february just after fosdom and i mean it works but it's a bit hacky honestly alternatively you can have everybody talking the same protocol like matrix so for the last year we've really been experimenting with option three so the problem is that the gatekeeper has to speak precisely the same end to end encryption as the person connects into it and also within the encrypted payloads everybody needs to talk the same content good news though is that we picked the double raptor for matrix back in 2015 because basically it was best to breed and everybody was using it so nowadays everybody apart from apple uses lib signal or the does match or live on under the hood for end to end encryption bad news though is that the normal matrix dialect of all is not interoperable with lib signal so ohm is the encryption in matrix it was a clean room implementation of the double ratchet that we did back in 2015 but unlike lib signal we don't use x 25549 or x3dh instead we have separate keys for identity and signing so we've done two implementations of all ms of protocol live all min c plus puffs and the does match in rough however to do dma we have now added x3dh supports of the does match so that it can interoperate with lib signal and we've called this new dialect of all into all and you can go and look at pr124 as of about 2 a.m. this morning and to actually play with it and see that if you put the right constants in this will now interoperate all the way through to normal lib signal so this means that you can do a hypothetical matrix for dma architecture where you have a typical matrix plant like element x which talks matrix through to a home server which then goes and uses msc 3983983 and 3984 to bridge end to end encrypted semantics through a application service which we call a protocol converter this is not really a bridge it's maintaining end to end encryption so it is converting the matrix signaling through to the hypothetical gatekeeper signaling and then talks through on the gatekeeper side to their client but the lib signal there and the dos mats in interon mode on the matrix side can then talk directly so you're basically turning the signaling on one side through to the signaling on the other and as long as you can agree on a common content format of some kind like matrix events expressed in proto buffer something like that then you have the holy grail of being able to interoperate between matrix and a big big messaging service so the end result could hypothetically look something like this where you have different gatekeepers who use a protocol converter to talk through to the normal public matrix network or some subset of it you can have clients there like element x or fluffy chat or whatever going and then talking through other bridges and you'd have home servers that exist primarily to gateway into these gatekeepers and so plug them all together so does it work yes this could work so I got permission from matter to admit that we have done experimental implementations of this now with them as a not so hypothetical gatekeeper and it does seem viable complete with end to end encryption what I would love to do is to demo it to you right now but because they have a great big set of announcements coming up about their dma intro they don't want me basically breaking their news for them at stage on fostered unfortunately so you'll have to imagine in your mind's eye what's that for one side and element x on the other and messages flowing back and forth between the two with end to end encryption honestly that would basically be all the demo would show the catch is that we honestly don't know yet what will happen in March I mean there are some fairly big challenges here first of all what permissions would you need to actually use this protocol converter because the dma letter of the law says that as an organization you have to request permission to get into the whatsapp network so we've obviously done that already as element but it's very unclear is just because we've done it as element that we've suddenly done it on behalf of the entire public matrix network and everybody else well you'd hope but let's see also there's this whole question of anti-spam where at the moment folks depend an awful lot on knowing the IP address of the clients which connect through in order to determine whether this is an abusive user is it coming in through tour do we need to be more careful about what it's doing blah blah blah blah so there is a big debate as to whether we would need to expose a stable identifier of some kind like obfuscated IP address to the gatekeeper to help with anti-spam then finally a big one is that group chat is just unsolved I mean the current legislation only requires one-to-one dms and you know it's a basic functionality no void and so group chat is out of scope until 2026 as the first cut you could just do lots of on sessions or inter-on sessions and find it out like we did before we had magon but it's a bit clunky and it also gets things more and more baked into the double ratchet so it may we're hoping that by 2026 some a better approach might emerge but at least this can be used hopefully from the get-go in March 2024 so one of the things we've discovered along the way is something we've called linearized matrix so dma doesn't mandate any of the cool stuff that matrix does it doesn't mandate decentralized conversation history and it doesn't you know require decentralized control and what we've seen is that gatekeepers might see matrix it's a little bit overkill when implementing it natively just imagine a conversation where I turn up to somebody at WhatsApp or whatever and start going on about DAG replication and state resolution and decentralized ACLs and it's all so cool and they said you know perhaps understandably would say well that's very nice though but dma doesn't say anything about that and whilst we would love to implement state res in Erlang perhaps we just need to worry about some straightforward interoperability so is there a lights and architecture that could work what if we had a protocol that was compatible with matrix but skips all the complicated state res stuff knowing that we could bridge it into actual full-fat proper matrix when needed so Travis myself came up with this proposal called linearized matrix as part of the itf meany working group and it's literally the same matrix events and power levels even the same author events but rather than putting it in a DAG you put it in a linked list so it's much easier to play with because you just have a list of events and you can then bang it around the place in a hub and spoke server topology which is something that the gatekeepers might be willing to actually implement rather than full-blown matrix but then the second it goes anywhere near us law we can just actually find it out into proper matrix so we are having an implementation of this the amazingly named eigen server which is a bunch of types scripts about what a thousand lines of code for memory that shows just how simple it could be to implement this subset of matrix in practice it could like this and this one really is hypothetical like please do not read anything into the logos here but it would basically have normal full-fat matrix here one of the servers that is got permission to talk through to a gatekeeper like google might be in future would then talk proper matrix through there and oh no actually no sorry we're talking linearized matrix there and then say that conversation would end up hubs and only that conversation this isn't all the traffic but they would just be a given hub for that conversation to plug everybody together and it makes it a lot easier and more practical to actually implement in big table and Erlang or whatever on the right hand side well still still talking to the matrix world on the left hand side however linearized matrix has not gone entirely according to plan so what we have been doing is working within itf in this new working group called the more instant messaging interoperability working group now this was started by folks from the mls messaging layer security group end to end encryption working group and the whole idea is that they want to build we want to define a long-term protocol specifically for the subset of dma interoperability with the added twist of the leveraging all the good stuff that mls provides because weirdly enough if you built mls you want to have an application layer protocol that sits on top of mls we've been involved in this since the outset my tf-114 in philadelphia back in 2022 now and you will be surprised to hear that we turned up and said guys you don't need to do this we can just speak matrix no matrix already is this amazing end-to-end encrypted decentralized communication protocol and it got promptly rejected because decentralization was seen as overkill that's why we then came up with linearized matrix which also then got rejected because it was like hang on a second this gives us message history why would you ever want message history we don't need message history when talking to gatekeepers why does that have key value state events for arbitrary key value data that sounds very dangerous we don't want that and so it went through the itf process where it gets whittled down and reduced and reduced and reduced to the absolutely minimum subset of stuff that you need and then hopefully perhaps maybe expands out again one of the big debates has been whether Mimi should support interoperability with today's protocols like whatsapp or facebook messenger and critically the double ratchet or do you hard code the entire design to require mls for encryption so you can imagine that there are a bunch of people who really really have backed the farm on mls and a bunch of slightly more pragmatic people perhaps who have backed the farm on just wanting to interoperate with people and that debate has gone back and forth so we ended up forming a design team between folks on matrix side Cisco google wire phoenix and wicker to try and to build something from the ground up that would provide this on ramp from today's double ratchet world into an mls world the idea is that you literally can use this today to interrupt with double ratchet platforms like matrix or whatsapp but then also provide a really low friction way to steer everybody through to talk mls and there was a lot of back and forth on how this could look because it basically tries to solve this paradox that on one hand if you have mls you should use it as much as possible you should use it to synchronize state across the various folks in the cryptographic group and all the sort of benefits of mls however if you don't have mls you need to kind of fake something that looks a bit like it out of today's double ratchet and sort of linearized matrix stuff so we've been trying to glue together two pretty different architectures with a transition path between the two so we published that back in Prague in November in draft rosten mini protocol and in theory gets best to both worlds the layering does end up being a bit complex though and so for the last couple of weeks we've discovered that wire have done an entirely new draft which we are now trying to merge back together yay team work so to solve today's dma challenges meanwhile whilst we go through this wonderful process with itf we've also just been using plain old matrix so what comes next i've no idea honestly no idea what happens come much we'll see what dma api's meta ships on march the 7th as i said i'm not allowed to steal their fund on that looks as if we may be in the first organization element to actually implement against them so whatever happens hopefully it will involve matrix one way or another but that is the shape of dma things to come and i don't know why i was speaking so fast but apparently i've got two and a half minutes left have i actually finished early for the first time ever in 10 years of post-in thank you so just to remind you we need help friends do not let their friends use proprietary chat services if you benefit commercially from matrix and you want us to continue to exist please support the foundation use the qr code become a member run a server or buy an enterprise one from element build bridges and bots on your services build your amazing cool new project on matrix because tragically we're not going to be doing any fancy vr on matrix or midi on matrix or carry a pigeon over matrix or whatever it might happen to be you have to build your pigeon teleport to yourselves from now on but hopefully we've inspired you enough to do so follow us on mastodon or indeed blue sky or many many other things and spread the words thank you very much questions we have a question thanks for the talk what's the status of specifying the export import format for a matrix servers we can actually back up matrix and backing up postgres or whatever data is under it i didn't quite catch that was the question of speccing the server to server api to make it easy to right now a matrix server be it conduit or synapse or whatever yeah all the data in some data storage right if you want to back up a server okay sorry i get the question now so the question is data portability between home servers so you can migrate from synapse to dendrite or dendrite to conduit or whatever so there is deliberately not an msc for defining that right now because what we're trying to do was to do account portability because if you have switched out your matrix id's for public keys and you define either i guess either the client or perhaps the server gets to define the home of the account then the act of migrating yourself from synapse to dendrite would be to basically do an account port a bit like on gsm to switch where that public key resides so rather than having an export format and a great big wadge of jason or a kind of gdpr desar style thing instead the protocol itself does what it does best replicating data between different servers and therefore you wouldn't need an interchange format now this has some minor problems first of all as you just heard we've had to stop working on it and secondly it doesn't solve the gdpr desar use case where you need to have the data check out functionality at the moment synapse just does a very blunt way of doing that well there is a separate tool i forget what that expires exports the data in a huge indigestible blob of jason so i think it would be a useful thing to have it's not something that we're working on at the moment but if anybody in the room would like to write an msc for expressing a desar format which could also potentially be used as a quick fix for data portability between implementations that would be really cool and i'm sorry that in this instance perfect has been the enemy of good and that we invested the time that should have been spent doing that in doing account portability rather than desar tool excellent question anybody getting this per call uh do the nice video call with everybody joining that you did like few minutes ago uh is there any network optimization where like the clients were talking to each other or they were all going to the cloud and back just just thinking if you that was something you fought through now the acoustic in here is terrible i don't think how you've been hearing me i sorry i didn't catch that um about the video call where everybody joined yeah okay um we were just like considering if that there was any kind of optimization for limiting the bandwidth to the internet or there was any sort of right to pier okay it's bandwidth control for element cool okay so at the moment um the life care sfu has actually got really good um uh bandwidth estimation built into it so that adapts quite aggressively to your actual network conditions you saw how little i was using and critically it renegotiates between different thumbnails and if you scroll the thumbnails outside they disappear if as far as i know it should be pretty easy to also put additional constraints either on the client or on the server side to say hey never give me more than 320 by 240 will never give me more than 64k in that all of that was going bar in sfu on the server side so there was nothing happening peer to peer we still have peer to peer matrix for doing full mesh sorry it's more full mesh video is still a thing but if you wanted to scale to an entire room like this then yeah you need to have this like the forwarding unit um to go and bounce it around the place and i can hear you so much better without the microphone um any other questions i remember that back in the days you started to work on matrix because you had to work on rcs and it was crap and now rcs is like become a reality with google messages uh are you are you planning to make something interoperable interoperable with this or uh you just try to put it away it is so annoying to stand on stage and have to say that unfortunately we have nda's which mean that we can't talk about anything to do with that right now that's fine i tried so first of all thank you for keeping up all the good work for the dma stuff i think a lot of people will profit from that eventually without even knowing like who actually did that so thank you for that um i could i should point out that the dma stuff is all amondine's full she has the right angst and is willing to go to brussels and talk to people with the right angst whereas i just came in at the last minute and said oh brilliant let's play with double ratchets so but either way i'm glad that honestly we were able collectively to shift it forward my actual question would be to matrix 2.0 is there like a concrete roadmap how the roll out of matrix 2.0 will eventually happen and how backwards compatibility compatibility with for example legacy sso login and will element x support normal swing eventually for the roll off or something like that so i got the first half of that um and so a roadmap for matrix 2.0 is really land the remaining things as rapidly as we can it's very hard to predict the pacing of that because it depends entirely on how it's funded and uh no the reason it's not going as fast as we would like is that often we end up doing completely other work like somebody might turn up and say look we need the best screen reader support in element web 9 to man at which point the jsstk and you know other folks who might be working on the lower levels of element web end up doing accessibility which is great but it can completely starve out the lower level things and so work that might have happened you know next month finds itself shifted about six months because everybody was committed to go and you know do some other requirement so i'm afraid we don't publish a roadmap on it other than we'll do it as quick as we can i didn't catch the second half pack what's compatibility during the roll out for example will element x support legacy home servers that don't have sliding sync for example no no way it's um hard enough to make element x as um snappy as it is without also supporting the legacy authentication or the legacy sync and frankly it's also a sort of mechanism to try to get everybody to speak the brave new apis um so yeah we are not going to see legacy support and there at all instead we'll just optimize to make the matrix 2.0 spec stuff um as effective as possible thank you sorry going once anyway questions we haven't seen please wave your hand ah there hi so thank you for great talking for all your work especially on the dma side second that uh when it comes to dma the gatekeepers will they scope dma access to eu uses or how will that work that is precisely the sort of thing that if i gave you the answer to or my guests at the answer to would get me sued by method for breaking an nda so i don't know i'm afraid but they should be um announcing it in the coming weeks because obviously they need to go live on march the seventh and they need to try to explain to the world why they have um upheld the regulation in the way that they have and honestly i don't know if the answer i mean the the builds which we've been working with are um sort of coming down to the line they're all full of lips and text and they haven't been internationalized and the ui is very clearly still in flux so that's what i mean when i say i don't know we'll see it in the next couple of weeks right any last ones oh over there so if i understand correctly uh matrix 2.0 will require multiple services alongside synapse uh for sliding sync for example uh do you have a plan to provide the software distribution that combines all the needed uh all the needs the services thank you okay i think i caught that which is is there going to be a distribution that bundles together matrix off service and sliding sync alongside um synapse um so in the long term we want to have it natively implemented on both matrix off service is designed to be embedded as a rust module inside the python sort of host of synapse and likewise sliding sync eventually i hope will end up as a native module of some kind there too in terms of distributions um there are various options already so element provides its element server suite um supported distribution which is the sort of thing that we try to persuade governments and big enterprises to use there is also um slavis matrix docker ansible deploy ansible playbooks which again gathers it all up and runs it system d services um and i'm sure that there are other like um helm charts out there from the community there are also helm charts for bundles messenger um published on open code for the german public sector as well as ones for open desk which is the digital sovereign workspace there so there are a lot of options out there obviously with my element CEO hat on i really hope that people might actually buy the one from us so that we can keep paying the salary of people to build the underlying technology but um you can also go wild with any of the other distributions the whole thing is starting to feel a bit like linux honestly with different distros done by different people with different licensing and mentalities which i guess makes us red hat on the element side and i have 16 seconds left for any final final questions in which case i am actually going to finish early thank you very much yeah thank you very much post in wants to say also thank you to you with some sweet calories
Alexandria3k: Researching the world's knowledge on your laptop
Thank you. Good morning. How many of you have read the scientific paper in the last month? Okay, for a number of you, probably why you are here, it turns out that we are churning ever more scientific papers. I'll show figures later on. And for this reason, we are conducting studies on them, so seeing how they accumulate. Some types of studies that you see here, systematic studies, so a study on previous science that can be reproduced and done in a specific objective, we're not just picking out a few papers from a pile. Scientometric or bibliometric studies where we measure things such as the scientist's output or work done in a specific field. And then there's the analysis where we use statistical techniques from existing studies, say how cancer is related to smoking in order to get better results and other secondary studies. And as you see here, such studies have been rising from 1970s onward at an exponential rate. Look at the scale on the left, it's logarithmic. Scale, something that we will see many times in this presentation. Now they have reached tens of thousands of studies every year. We've been conducting such studies also in our group, such as looking how various data papers, open data papers are being used by others or how software engineering research, thousands of papers that are published on software engineering every year are actually used in practice. And lately also looking at how machine learning is associated and used in software engineering. There have been so many papers in this area that we conducted what is called the tertiary study. So we didn't look at all those thousands of papers, but we looked at papers that summarized those studies, a few dozens of papers. Research on using existing publication data and building on it, demand quantitative data. And two scientists are famous for working and establishing the field. The one you see on the left is Eugene Garfield, a linguist and businessman. He established the Institute for Scientific Information, which became then the science citation index, then was bought by Thomson Reuters and now it's a firm called Clarivate. And on the right, another famous scientist, Derek DeSolla Price, who also worked in this field, established disciplines together with the Garfield of Scientometrics and Bibliometrics. We can use and measure scientific output and study it using a variety of services such as the ones you see here. How many of you have used any of those in the past month say? Wow, a fair number. However, there are problems associated with them. First of all, there's a lack of transparency, repeatability and reproducibility, a query that you give today on say Google Scholar will give you, well give completely different results in a year. And we have no idea how the results appear here, why the results appear in a specific order. Latency can be high and bandwidth can be low. If you want to run a query on tens of thousands or millions of publications, good luck going back and forth on an API. And if this wasn't bad enough, there are also rate limits. I have the privilege of having been kicked out of various digital libraries for 25 years now. There are also the query languages that we use are also proprietary, so each service has its own query language. And they can be restricted in what you can do, maybe you can add terms, what you cannot order them or you cannot search in specific fields. The coverage may be limited. Some services contain only a subset of what we want to search. And finally, there's the issue of availability and costs. Some are lucky enough to have subscriptions to some of these services. Others don't and even if you have a subscription, getting access to the full dataset may be difficult. Thankfully, two developments are changing the status quo. The one is the rise in computing power that we have in our hands. I don't know if you have seen this picture. We are the group of people delivering an Elliott computer to the Norwich City Council. And a few decades later, somebody putting a photograph in a Raspberry Pi zero in front of the same building. I've compared the two machines. Interestingly, both are European endeavors. And there are three to six orders of magnitude increases in power. So things that we could do in 56 compared to now are a thousand to a million times better than they are today. So obviously we can use this power. Remember, CentroMetrics was established on those decades, past decades, to do a lot more. The second development is the open availability of datasets. We are here celebrating openness in software and data and hardware. And a large number of datasets associated with research have now become available, such as Crossref, ResearchRid, United States Patents, PubMed, Research Organizational Registry. I'll go over them. So alluding with some lack of modesty, I admit to the famous library of Alexandria, how it should be in the third millennium, I've developed a system called Alexandria 3K that allows us to perform publication metadata analytics on our desktop. What it does, it provides us relational access to about two terabytes of data without needing to have actually this amount of space on your computer, who has two terabytes available on their laptop. Exactly. So you don't need it with Alexandria 3K. It gives us access to about four billion records, 75, 74 tables. You install it as a single Python model. You don't need to maintain a graph database or a cluster to install and maintain. I know it's not a problem for you, but it is a problem for people who are not in computing. And it's also super efficient. So if you run a sample query on the whole dataset, but sample it, it can finish in minutes. If you build slices of the data to study, it can finish in between five hours in a couple of days. But then you can run queries that finish in seconds. And the space requirements start at about 160 gigabytes for the downloaded data. You can then process it in compressed form without requiring to decompress it. What I will show in the next about half an hour is the model that you get access to and the type of data that you can use. I will explain how it can be used in practice. Go deep into how it is implemented, giving you perhaps ideas how you can do similar things and finish with some issues, limitations on how to move forward. So what you see here is the scheme of all 75 tables. They are colored based on the dataset they come from. So on the top you see the United States patent. The various yellow ones are cross-off. This is the main dataset. It contains details about publications and their authors. There's also a similar set from PubMed regarding health sciences. Information about researchers, open access journals, research organizations and some other stuff I will explain. As I said, the main dataset is cross-ref. You see it here. It contains mainly works. And then these contain references to their authors, the updates, subjects, funders, licenses and also the affiliations of the authors and the references in each work. In numbers these are on top thousands. So you get about 135 millions of publications. Not all of them contain a subject or references but you see these numbers diminishing but have been going up over the years. About 360,000 million authors and about 1.7 billion references. So each work has a reference at the end. You get that many references. Many of the works are also associated with subjects telling us what area they appear in. Most of the works are journal articles. Then come book chapters, proceedings and other elements such as books and post-edentries and dissertations, far, far, far smaller. If we look at the publications that appear in the dataset each year, all of these charts by the way have been derived through Alexandria 3K. You see that they've been increasing at a very large rate over the years. If you think it's exponential, you're indeed right. If we plot it on a logarithmic scale, you see a linear rise which means an exponential rise in the numbers. You can even see two dips in the rise. Any idea why these are there? What happened? Wars. Wars, exactly. The wars apart from the extreme human tragedy also affected the science of the world wars. Regarding availability, these are the various lines show which works have an abstract, the subject, the funder, a researcher identifier, awards associated with the works. You see that these numbers have been rising in most areas. The subject is a special case. Another dataset associated with Alexandria 3K is the open research and contributor ID. How many of you have such an identifier? Good. If you don't have one, go and get one. If you publish, it helps us, all scientists, associate you with what you have been publishing. These have some basic details about the persons and then further details associated with distinctions and education, invited positions, memberships, peer reviews conducted in other resources associated with researchers who have access to a specific large telescope, for example. Again the completeness of this data is not uniform, so most people have associated works with them, but fewer have associated employment, education, personal data. A similar dataset to the works is the dataset of publications made available through PubMed, United States government effort, which has publications associated with health sciences, so health and biomedicine. Similar to Crossref, but it also contains some more specialized fields, such as pathogens or a very complete taxonomy of where something belongs or chemical substances mentioned in a specific paper, so it allows you to do more concentrated and specific research. Also available are the United States patents for the past 20 years, containing about 5 million records and a registry containing about 600,000 records of research organization, so the organization you are associated with, if it conducts research, it should appear there, it's a taxonomy, so it contains the parent organization of your organization, in many cases up to the top, maybe the government. What else is there, some smaller datasets, journal names, so that you can directly associate them with their ISSN, funder names and also director of open access journal metadata about 19,000 records. All these are tied together through identifiers, such as the digital object identifiers, DOIS, the research identifiers, ISSNs, URLs and many other identifiers. In a way you see here, these are just some representative tables from diverse datasets linked together. How are we using Alexandria 3K in practice? You can use it as a command line tool with a typical nowadays for large tools pattern of running A3K, the A3K command and then sub-commands such as populate or process or query. Here's an example of that, I'm running the A3K command asking it to populate a database called COVID-DOTB from the crossref dataset found in this directory and selecting only the rows that have a title or an abstract that matches COVID. I will show you later on how this can be useful. You can also use it through Python, here's an example, you import it, you create an instance for the specific data source that you are interested in and then you give a similar, you call a method that performs a similar function. Typically the way to use A3K is to download the data for crossref, it's about 160 gigabytes, can be downloaded through a torrent in about three hours, by the way this gives you plausible deniability argument of why you are using torrent. You can then run various exploratory data analytics queries directly on the sample, this can finish in about two minutes for 1% of the records, no need to uncompress this to the terabytes required. Or you can populate a database, a SQL-like database, this can take for four to 20 hours, the database can be some four to 200 gigabytes in size depending on what you store in it if you are selective. And there on the database you can test, define, refine analysis queries, the queries can run in minutes or in hours if they are very complicated. Mainly how you can use it, you can run ad hoc SQL queries directly on the uncompressed data, you can populate SQL-like databases and there you can select elements either horizontally so you can say I want only the records that match the specific condition, works published in the last two years, or vertically specific columns, I'm not interested in the abstract for instance because it takes up a lot of space and I'm not going to use it. And then once you index the SQL database you can have many queries finishing in seconds or even on the complete data set. Here is an example of a query directly on the data set so without creating an intermediate database measuring how many publications appear each year, the chart I showed you earlier. Here is another example of a query that performs sampling, it calls a random function of Python and which returns to when it's less than 0.01 so I get 1% sample from the data set to find out how many works contain abstracts and how many don't, this is the answer and there is a chance on the complete data set but sampling it in about two minutes. Another example here of populating a database in order to extract metrics that I showed you previously how many works have authors or subjects or abstracts associated with them. Similarly another population completing the ORC data set showing how many elements are there in the corresponding query on that data set. Let's see some more advanced elements. Here there are two papers both written by Nobel prize winning authors, on the left is one by Colin Schame, a Nobel winning paper used to establish theorems to develop a method for calculating the structure of the electrons and on the right another famous paper, you are probably aware of it by Watson and Crick or Nobel winners introducing a model of the structure of the DNA. However, the way these papers are associated with science turns out to be very different. If we look at the papers that cite them, so the red papers here, the two blue ones are this one and this one, we see that for the left papers that cite the electron paper also cite previous work, so works cited by this. Whereas papers that cite the DNA paper do not cite the work published before it. So people have said that the paper on the left consolidates existing research and advances it whereas the paper on the right, the DNA paper is a disruptive paper that changes all things and therefore people no longer cite other works, they cite only this paper and ones that follow it. There is a measure that you can calculate here, the paper that first established, published in Nature gave these numbers here. My own calculation on Alexandria 3K gives similar numbers and the very highly significant statistical measure showing that these methods are equivalent. The one on the top, the published one is opaque, you cannot reproduce it because the data is not openly available, the one on the bottom can be run on your laptop. Here are some other measures, the evolution of scientific publishing after the Second World War, before that things were completely different, that's why I'm not looking at it. We see many interesting changes, number of authors per work, reason from 1.5 to 4, works per author from 1.99 fell to 1.59, references rose from 13 to 46, pages doubled to 12, the consolidation disruption index fell, so if you think that science is becoming less disruptive these days, it is a true number of citations, works published, journals, works cited at least once and factors have all reason exponentially. All these calculated with queries you can find on the software you can download from GitHub. Here's another interesting chart showing how the evolution of applications has changed in specific fields. You can see the big rise in computer science, the purple thing that has increased substantially over the years and also the relative fall in publications in arts and humanities, the other purple here that has diminished in this way. The absolute number has risen, don't be fooled because publications have risen exponentially, but still it occupies a lot less than it did in 1945. Here are examples of two other data sets. Here is the evolution of applicants by year and country of US patents. You see a fall in the number of patents associated with the United States and Japan and the rise from a low base of patents associated with China on the bottom of the blue line. Another one associated with replicating a paper looking at specific software, statistical software used in health science public research. Again you see with green the original paper and with orange replicated results with Alexandria 3K. This was completed by a TU Delft student in a couple of weeks. Let me show you a more substantial example of what can be done with Alexandria 3K, proof of principle of concept study on a specific topic namely COVID-19. What you are seeing this was while checking the data set, it's a publication I found in the American Journal of Ethics, COVID care in color, which was published by an author, a nurse working in a Bronx emergency room team for six years and painting for 25 years. She captioned it as fear of non contagion is dreadful, especially without proper protective equipment. So at the beginning of the pandemic when we were conflicting information on how and when to use our PPE we relied on each other. I created the data set in the way you see here. I populated a vertical and horizontal slice in an escalate database, selecting some elements with those that matched COVID in their title or abstract in running nine hours, about three gigabytes of data and the ones I indexed it, it rose to 3.6 gigabytes. We can see some numbers, half a million of published articles, 2.6 million authors, imagine the amount of effort that went into there and eight million references that go more there. What are the topics associated with this research? Everything you can imagine, you can see education and engineering featuring very high, of course, after general medicine, but even strategy, management, law, cultural studies, pollution, anthropology, AI, waste management, ocean engineering, all over the place. Who funded that research? This is the query I run, National Natural Science Foundation of China, the highest number of publication, and then National Institutes of Health and National Science Foundation from the United States, followed by various trusts and councils. However, if we look at the affiliations associated with COVID publications, these are the queries that established this. We see that first came the government of the US, then the University of California System, University of Toronto, and so on. Here I use the research organization registry and I moved the organizations up to the parent organization, so Berkeley, for instance, and the University of Southern California rose and UCSD rose to the University of California systems. Another question is, how quickly could we look and work on each other? So how scientists publishing COVID research could cite each other? Was it taking too long a time because things were advancing too fast? And what I have plotted here is publications citing other COVID publications published each month. And you see that fairly quickly, even on April 2020, there were thousands of citations to other COVID publications, which rose to hundreds of thousands late in 2020, early 2021. The number you see at the beginning is an artifact, journals that got published with a January date, although they appeared a lot later. This shows us we shouldn't blindly trust our data. I also looked at collaboration. I found some amazing things. There were articles with thousands of authors authored by thousands of people. So you see articles by 2,000 or 1,700 authors. I thought this cannot be true, so I looked at it and I saw that there were indeed most articles had the number of, say, five authors, but there were articles with 7,000 authors. I looked, couldn't believe my eyes, I haven't seen such thing before, and it was indeed published here. The number of authors appear in a footnote, pages 20 to 28. And this is not an isolated case. Through other queries, I found many articles with thousands of authors showing a way of collaborating in an amazing way. People probably contributed by giving data from hundreds or thousands of hospitals, and all these were collected on papers. Let me switch subject. How many of you heard the impact factor? Yeah. What is the impact factor for those who haven't? It's a measure that tries to see how important a journal is by measuring the citations articles in this venue appear in a given year divided by the publications in the previous two years. It has been severely criticized, especially when it's used for measuring scientific worthiness or productivity of authors. But another problem is that it is opaque, so clarivate publishes, but we have no idea how exactly it derives those numbers. It publishes the method, but we cannot replicate it. We can with Alexandria 3K with queries and populations, such as the ones you see here, and we can get results that rank journals by impact factor with a very high significance correlation with the ones published by clarivate. Through this, we can do queries such as find the most cited article in the last two years. It is this article establishing a method where you can find how matter depends on the properties of steel based just on facts associated with the atomic structure of that. I found that very strange, very specialized subject, so I run a query to find what publications are citing this article, and I got results in the titles such as these at a rate of about eight records per second over the entire data set. If you cannot understand the titles, this is how people outside computing view us when we talk shop. I also looked at cited articles over the last two years, but published in that period, and predictably this was clinical features of patients infected with the 2019 novel coronavirus in Wuhan, China, the first article reporting on COVID-19. Another metric of author productivity is the so-called H5 index, so how many papers you have published in the past five years that have been cited at least that number of times. I found an author that has an index of 76, which amounts to 15 papers a year, and many authors having more than 60 papers, an H index more than 60 and 100 more than 38. This was too large to believe because it's not only what was established previously, hyperproductive authors that publish papers every one paper every five days, but also papers that got cited a lot. So how could this happen? Using again a 3k, I looked at papers that were cited, the papers that those authors cited, and created and looked at those graphs, and what I found is that those papers were cited a lot more between them than papers of other authors with high productivity, but not that high. So this seems to suggest that some sort of clique is working there, authors citing each other, and thereby elevating artificially their H5 index. Whatever number you use, you can game it as you can understand. Finally, here's another interesting chart. Here I'm looking at what topics cite other topics, and for instance we can see these are the 50 strongest relationships, and we can see for instance that the cancer research cites a lot chemistry organic and inorganic chemistry. I will finish with some details regarding the implementation of Alexandria 3k. I hope that you will get a few tricks that you can use here. A3k is based on a plugin architecture, so on the top you get the command line interface, which uses a Python API, which you can also use directly, and below it there are two sets of plugins. Data sources, the values I showed you, you can create new ones by creating a file that establishes a new data source, say archive publications, and processes, things that manipulate data in new ways. For example, matching authors with affiliations, or disambigrating authors with the same name based on what they publish. And finally, at the bottom is a plugin API that these plugins use in order to function. The main ideas behind the cross implementation and the other databases is SQLite and virtual tables. Virtual tables are a feature of SQL that's magical. You can create tables that don't exist as real in a real database, but appear as tables which you can access with the select and other SQL statements. It uses a method to partition the data because large databases come in many, many files, tens of thousands of files for the case of Crossref. I use the number of each file as an index so that the database, the virtual table implementation of SQLite does not jump from one partition to the next because the compressing partitions is expensive. Another trick is to understand once you have written a query or a selection of what data you are interested in, how to understand what tables or what fields you want from each table. I don't want to parse SQL, especially with all the specific implementation details and of the SQLite. So what I'm doing is I ask SQLite to trace the analysis of the query, and thereby I can understand which columns and tables it touches. I also create vertical slices of the partitions for the queries in order to run faster. And I use various queries to only look within a partition in order to populate records. So the Crossref data appears in JSON format, and in a self-contained file, one of the 26,000 files, it will contain all references to each work. I don't need to go to other partitions. So here's an example. When you run the query on the top, what is happening underneath, a virtual table is created, and the query is run on this very simple table. When, however, you also do joins, what is happening is it is creating the tables, but as you see here, the tables have a container ID restricted to each container in turn, one, two, three, up to 26,000, so that each partition is decompressed in turn, and not all of them are run together, and then this is run on tables that are actually realized. For population, similar things are happening. So if you populate something with a condition, so I want only those subjects associated with library information sciences, and you want only some columns. First of all, tracing establishes the table name and the field that you are interested in, and then again on partitions, tables are populated, and then queries are run to fill the data that you want from the populated tables. I found that this is faster than using virtual tables because of the various joins. A thing called topological sort is used to establish in which order the joins have happened based on the names of the tables that you want to join. Similar ideas for populating ORCID, United States Patterns, and so on. Here because this appears XML records, we can skip the parsing of XML records that we are not interested in, and thereby gain an additional efficiency advantage. Let me finish with some issues and limitations of A3K. The coverage of authors is fairly low, so about 17 out of 360 million authors have an associated ORCID with them. Keep in mind data go back to the beginning, to the second world war, so ORCID wasn't a thing then. But even now, not all authors have an ORCID. This is improving because many institutions asked them to, and we're also investigating ways with which we can disambiguate authors even if they don't have an ORCID through machine learning methods. Also affiliations, either they are missing or they appear in diverse forms, so the same university, say the one here, can appear as ULB or the full name of the university. Abstracts are also not always available. A small number of them have an abstract, but many publications also have a text mining link that you can use in order to obtain the full text for data mining purposes. The subjects of publications are based on an identifier established by Scopus, which is associated with complete journals. So if something appears in a zoology journal, we assume it has to do with zoology, even if it has to do with, say, biology or informatics. Again, we're working to use machine learning methods in order to obtain better results here, and looking specifically at the impact factor calculation, which many are interested in. Establishing what is a citable item is a tricky, clarivate uses a proprietary method, where, for instance, an editorial or a letter is not considered citable. It's difficult to do this automatically. I assume they have people working on that. Way forward on how we can work as a community to improve a 3-key, first of all, these are the early days, so I'd be very happy to help the community conduct studies. If you have an idea for something for a way to use a 3-key, please contact me. I would like to integrate more open access data. So here are some ideas, an archive, the DBLP is a database of computer science papers. There is a taxonomy of medical research called MESH, which is extremely interesting, and the wider one used by the public library of science that I also think would be worth integrating and associating works with it. Associated with that is improving the various processes, so things in which we can process the data, disambiguate authors, there are many John Smiths or Zouz in China, find out which are the ones that are written in a specific article, and classify the topics of the publications. And finally, and this is relevant to us here at FOSTEM, evangelize more and better data availability, more use of ORCID and publication improvements on the published metadata. With this, I thank you for your interest and attendance here. I think we should have time for one or two questions. Do we have questions? Thank you for your talk. About 10 years ago, I participated in a Kangol contest, which was about disambiguating authors to link papers written by the same authors in the work you've done. Have you observed the need ambiguity in the different forms? Sir, can you repeat? I'm not hearing very well. Sorry. So I was saying that about 10 years ago, I participated in a Kangol contest, where the topic was finding the different papers written by the same author, because the names had different variations and formats and so on. In your work, do you also observe that the names of the authors are written in different ways, and that makes it harder for you to link papers together? All right. The question is regarding a contest that was run 10 years ago, whether there are authors that have their names appearing different forms? Absolutely. First names often get abbreviated to the first letter. Middle names appear and disappear in random order. So it's indeed a problem, or kid helps. But also efforts such as what you did to develop ways in which to establish uniquely an author with all their works are helpful and they can be integrated as processes, either with a pull request on A3K or by using the API and doing it on your own. Okay. So it's a two-parter. First, how often do you update the dataset? And is there a way to download the delta? Just get the new stuff. Okay. Two things. How often I update the dataset? The answer is never in contrast to say to open Alex. A3K is a tool for working with existing data. So whenever you want, you go and fetch the data. It doesn't come with the data. You use the data sources from their primary source. I don't pretend to curate the data. A3K allows you to use existing open data sources. Can you work with incremental updates if you have a way to running a select, if you have incremental parts, you can populate the data also with the increments, if you have a database that you populate incrementally. Thanks. Thanks. So I was in a university 10 years ago and more than 10 years ago and by that time our articles were published mostly as unstructured text. So is that still a thing? Are there any, are you aware of any efforts to make articles, because they're unstructured tests and those are difficult to analyze? If I understand the question correctly, whether there are efforts to structure the articles in a way that we can better analyze it, there are tools such as a Grubbit, we heard it yesterday at another dev room on Open Science that do that and create XML associated with an article's text. Of course this cannot always be perfect. Ideally we'd want the complete pipeline from authors to publishers to publication to take this structured account and not reverse engineer the structure after it has been published. Well, when I look at the time, we should stop here for Q&A here in this room. Maybe he has some more time. So maybe some applause. And first anyone wants to say thank you. Thank you very much.
How to Build an Open Source School Cloud for 5 Million Users
Okay. Welcome everybody at high noon at Sunday at Fosdem in Jean-Saint. We'll hear now David Walter. He will tell us about an on-cloud implementation for schools in Bavaria. And scaling up for 5 million users. A warm welcome to David. Thank you David. Hi everyone. Today I will talk about how we did on-cloud implementation to scale up to 5 million users or will scale up to 5 million users. And I will talk a little bit about our school project. So what is our target? We want to scale up to 5 million users and we want to be better than every other hyperscaler we know so far. But before we dive in I will give a quick introduction about who we are talking about. My name is David Walter. I'm project lead of the Bavarian School Crowd project. And I'm also responsible for the experience, the customer experience in on-cloud and for the security. I'm using on-cloud as an open source user administrator since 2014. So I'm quite familiar with the project. And yeah. On-cloud itself launched in 2010 and the on-cloud infinite scale implementation we are talking today about had general availability in 2022. Right now we are already hosting more than 2.5 thousand tenants and we have over 1 million downloads of infinite scale. And it becomes more and more a backbone for service provider. As it is already during the press, we got acquired quite recently by Kiteworks, our new mother firm. Kiteworks is a security first company and it empowers on-cloud very much. It helps us to drive security, to drive privacy and to drive compliance in the same way as they do because Kiteworks is a security first company which emphasize privacy compliance a lot. And it also helps us, not only in the project but as a company itself to provide 24.7 operations and 24.7 support. There is much more to say but please read that slide afterwards because I think we are kind of short in time so I uploaded the slides to FOSSTEM and please read the slide if you are interested. More in those details. So just to give a rough idea because I don't know how much you know about the Federalist State System in Germany. I want to give you a brief overview. We are talking about the Federalist State Poverty which has 2.6k schools compared to Germany with 32 thousand schools and European Union 200 thousand schools. Because of course we would like to have some kind of prototype in what we are doing here in Bavaria and to achieve higher goals as you can imagine. That's why I put here some numbers to give you a rough idea about what we are talking about. We are talking about 1.6 million pupils. We are talking about 116 thousand teachers. Roughly 2 million parents in an overall Federalist State of 13 million inhabitants which leads us basically to an infinite scale instance of 4.7 million named users. Why I am focusing on the named users we will come later when we talk about load and performance testing. So what were our goals? We have some on a really really high level we have some major topics like data sovereignty, a most secure system, a most secure fight against shell solution. We have digital accessibility as of course one of the main targets. But firstly we need to compete with the hyperscalers because this project will be always compared to Google, to Microsoft Teams or whatever. So we have to focus as much as we can on availability, on scale, on reliability as much as we can. So on a more functional way we talk about maximum integration, the integration of a messenger and of different other solutions that we will come later on. We need to scale out without any limits. We have seen it in the ICDays where we had a partial lockdown in Bavaria recently that we need to scale out really fast and without any limits. We have to target updates without zero down times. And the most challenging when it comes to the project, we had the first 50 schools on board it after 32 weeks of project duration which was really really challenging. So let's get on to the base. What is underneath the project? When we talk about the infrastructure which is provided by Plus Server, we are talking about four data centers in Germany. We are talking about two S3 locations. We are talking about four metadata locations, 240 Kubernetes clusters, 104 petabytes of S3 storage and up to 60,000 virtual CPUs and 120,000 gigabytes of RAM, which are quite big numbers. The project itself or the Kubernetes stack itself is based on the server in CloudStack and Gardner. That's the logos you see. The applications the user sees are, as I mentioned already, of course on Cloud Infinite Scale as its core and the only Office document server for writing, Excel sheets, presentations and PDF viewer and annotation. But why did the customer choose, or why did they choose on Cloud Infinite Scale? Just as a very very rough overview. The first thing which was very important that we can liberate the user data. The users have an interface to download their data from cloud to cloud, from OneDrive, Google Drive, on Cloud Classic, NextCloud, whatever the schools have already in place or the teachers mostly using their private lives. We want to liberate the user data. That's why we have an interface which allows the user immediately to move the data. The second thing is the spaces. The spaces is a collaborative workspace which doesn't have an owner. It has only a manager and it avoids data silos because, as you can imagine, teachers are collaborative people. So they have those data silos or they have those data silos and when the teacher left or called in sick or whatever, the material were stuck in his storage so he needed to be sent out. So we came up with spaces with fixed problem. Quarter management, as you can imagine, is quite important as well because there is literally a maximum of available storage. When we are talking about 4.7 million users, there is. There is just this multiplicator. The last thing is, it is a technology where I pointed out only one specific thing which makes infinite scales so special. It got rid of the split brain we had before with OnCloud 10. Before with OnCloud 10, we always had the database which keeps the metadata and we had the storage which keeps the data. When one part of it breaks or whatever, you have the problem that you lose more or less all your data. It is really hard to restore those data depending on the metadata. I don't know whether you are familiar with what metadata in this perspective means. It means, for instance, with whom did you share the data? When you miss this information, because for instance, the database collapsed, you have a real problem. We don't split this information and we store the metadata right with the data. The iPad app, just try it out yourself. It is just too good. I said already we had load and performance testing a lot. As you can imagine, schools start at 8 in the morning and every school starts at 8 in the morning. We know we have a huge load peak at 8 in the morning. We needed to make sure that this really is the software that is capable of dealing with it. What did we do? We designed two different test scenarios, one was test to fail and one was test to pass. We know precisely how far can we go and what is our sweet spot to go through. The target is that 95% of each full request, which means I do an action and as a user, I feel the action is completed after two seconds. That is what we are aiming with 95% of the request. We did it. We did it actually. The test to pass, the really test was 0.5 seconds of a full painted picture. Simultaneously, we did X test so we know that the user experience stays the same. I am running out of time so I am just going. What I want to emphasize is because we are in FOSSTEM, that onCloud Infinite Scale is not the only product which is in that platform used. OnCloud Infinite Scale only offers, there are different other softwares in between. There is a key clock as IDP, there is open LDAP as IDM, there is CLEMAV for the antivirus scan, there is a patch a ticker for the full text search and the OCR. There is a post fix for sending out the emails. When it comes to monitoring, also this is Grafana Open Observability Stacks. This Grafana, that is Prometheus, that is Mimia, that is Tempher and that is Loki. The operations which is done by our own staff is also mostly driven, more or less fully driven by open source solutions. We have Ansible with the AVX interface. We have the OCI by Sonataip, we have the Nginx with the web application firewall, etc. We have a helm for the package management. That is quite a big zoo, I would say, when it comes to the software we do use. We are not alone on the planet with our project, so we need to integrate in the universe which already existed. The first thing we need to do is to integrate their key clock, their SSO key clock with our key clock. Why don't we use immediately the same one? Because we need to host our clients, we have to host some other integrations, etc. The key clock installation went bigger and bigger and bigger, so it was an early decision to say we will host our own key clock and this will be federated with the customers SSO. Then there is the Metrix messenger by Stewie, which is integrated in a way that you can use the messenger and pick files from onCloud Infinite Scale and send it there or... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... I hope I understood your question correctly. You are asking about the autoscaling features we are using? Yes, because the user's pattern is relatively predictable, so everybody connects a day, you autoscale and downscale when people don't connect. Basically, this is the Kubernetes horizontal pod autoscaler, which is playing a very important role in it. There are other elements like resolver porting, etc. We are using all this information stored in the Helm charts. Of course, we don't scale to zero because we want to stay HA, high-waivable, at 24.7. That's why we don't scale it to zero. Of course, we do the scaling. To give you an impression, we had one IC day last week or the week before. Actually, it was three days. Because of homeschooling at that day, we had a load jump of 60%. It went smoothly without a second of outage. Thank you for that question. I was wondering, can you tell us anything about what sort of user feedback you've received today from teachers, students, parents? Yes. Thank you for the question. You are asking about the user's feedback. In the first period of time, we had 50 schools which were monitored very closely with LIME surveys. They were giving us immediate response to the overall feeling. Then we had some, I don't know how to call it, black box testing where our product management watched actually the users how they do some tasks. We improved our user interface and user experience according to it. Over the time, we have every four or six weeks, I'm not sure, check-ins with some certain key user groups which collect their impressions from the schools and report to us. Does it answer your question? Okay, perfect. You showed us you had 32 weeks between design phase and the first onboarding. Yes. Did you do a comparison between your vision and the commercial solutions that were available at the moment? Yes, but this had been mostly done during the tender phase, as you can imagine. We were checking on features, of course. We were checking on as files available on how they do stuff. At some certain points, we also did our own ways because we do have some insights provided by the customers, like for instance, school start at 8. I mean, it's common knowledge, but anyway, so we had some insights we could prevent to our platform, provide to our platform. Right now, we are still learning and getting better every day, right? Because even though I'm mentioning 5 million users, that's an adoption rate of 100%. Right now, we are at 950,000 users, but after 10 months after the initial start, the first school, after the first school, right now, we are 10 months behind. Okay, thank you. Any further questions? Hi. Hi, David. So since you store all the metadata with your files, I guess then on S3 or object storage, is there any sort of database for any sort of metadata, or is that all like user data is stored on Keynote, SSO only, you have metadata of files, which is together with the files on the object storage? Yes, thanks. The only database, or more or less only database we are using is Postgres SQL for Keynote. The metadata of the files are stored on RWX storage within the Kubernetes world, let's say, and we have our database to keep the metadata, which is the huge difference between onCloud 10 and onCloud Infinite Scale, because we needed to get rid of this split brain. It always is a hassle if you have to connect to the database and to the storage at the same time, and you cannot synchronize those processes 100%, and that's just impossible, right? So with this approach, we can make sure that this isn't a hassle anymore. Does it answer your question? Yes, just one thing. But you still store metadata and literally file blobs on different places, so in theory there's still the possibility of some sort of desync between the metadata volumes and the storage? In theory maybe, but I would say that's a very scientific question. I haven't seen and I couldn't imagine a vector in practice. Let's put it this way. Thank you. You're welcome. Hello. Thank you. I'm here. Great work and congratulations. Thank you for your presentation. The question is this, is it possible to try to make all this available for little communities in a little laboratory, or are there components that works only just on the power of the cloud? Thank you for asking this question. This is one of the most important things for us at OnCloud Infinite Scale, because we come from the community, we had developed this in close collaboration with the Zern, so this is also something we want to give you back. There is, when you go to, for instance, to get up to OnCloud.dev or to the documentation site, there's also available OnCloud Infinite Scale as a single binary, and you can run it in Docker, in Docker compose, or just as a single binary on your computer. Please feel free to download. Thanks for that question. Hello. Hi. I would like to ask, like, if there are any top level administrative tools for managing such a big cluster of Kubernetes nodes. Sorry, I didn't get a word. So I would like to ask, if there is a top level administrative tool, apart from like Grafana, etc., that you used to manage such a big cluster to deploy deployments, etc. Because it's such a big cluster of nodes. Yeah. Well, there is the Gardner provided by Plus Server, which gives us the opportunity to manage the clusters in a graphical way, and it also has some CLI, etc. But our philosophy at that point is more put the genius into the code, into the DevOps code. So when you're asking for something like a Sousa Ranger or something, we don't have that, to be honest. Yeah. Okay, thank you. I will have to interrupt you because time is over for that slot. First, I want to thank you with a little swing here. So that's nice.
Private clouds do not need to be legacy!
Okay, welcome to the next talk. Fabio Alessandro Fali Locati will tell us about private clouds do not need to be legacy. A warm welcome. Let's go. Thank you. So today we'll be look at mostly four things. First of all, what is a cloud? Because I think it's very important to define the scope of what we are going to talk about. Second thing is what can we learn from public clouds and we can learn lots of things from them. They have been a very successful business model and product for the last few years. So we can learn something from them. And then some technologies and consideration and best that you can do to have to create your own private clouds that is successful and then draw some conclusions with some more maybe names of projects that can help you. So very quickly about me, I have been around the Linux space for the last 20 odd years and I have done a lot of public cloud work, hyperscalar work. So I am both AWS and Google certified and I do work for Red Hat. So I also know the private cloud part as well. So why private clouds? Because obviously for many people here it's obvious why you want to have the private cloud but why do companies also nowadays are starting to look more into private clouds as well. So the first one is technical requirements. Public clouds, even though they seem like the perfect solution, the flexible solution for every kind of issue, actually they are hardware-sache software architecture. So if that specific architecture does not match your workload, then maybe the public cloud is not going to be the best suited architecture for you. Second point is legal requirement. Obviously public clouds are going to be external to your business unless you work in hyperscalar, that's a different conversation obviously for you. But for all the others, public clouds are going to be a different company. So in some situations this is not easily done or it can be complex. There are also financial considerations. Even though public cloud seems very cheap, you can buy everything for a few cents. When you start to have a lot of things that maybe are running 24-7, then expenses can become higher. So there might be financial reasons to look for a different model. And then obviously there is the most plain one, which is simply because someone in your organization decided that private cloud would be a better choice than the public one. So let's start from the very beginning. What exactly is cloud? I started as everyone from the Wikipedia definition. Wikipedia defines cloud computing as on-demand availability of computer system resources, especially data storage, cloud storage, or computing power without direct active management by the user. I personally don't really like that definition. I mean obviously it's the correct definition, but I don't think it highlights the most important parts about cloud. I prefer this one, which is mine, and I define cloud as a business model where one party runs to a second party, which could be parts of the same organization obviously. Computer system resources, especially data storage, computing power, with the smallest granularity possible. So the two big differences are first, the fact that I define it as a business model first, and second, the smallest granularity possible about the resources. And if we look at the history back before it was called cloud, you used to buy servers, rent servers, better say, in either yearly or monthly basis. Then with the cloud we moved to hours and then milliseconds, milliseconds nowadays with the functionalist things, serverless functions, and those kind of things. So time obviously is one of those aspects at even tiner granularity, but the other one is actually resources themselves. So before the cloud era we used to talk about CPUs or sockets or full systems. Nowadays we move to core and vcore and now fractional vCPUs. So effectively in this way the cloud can charge every single customer exactly for the amount of resources that they are using, but most interestingly I think it's the fact that they can also have very small prices per unit because the unit is very small as well. So what can we learn from public clouds? I think the first big bucket of things is going to be the separation of concern. Obviously public clouds are different companies from their customers, so it's very important to define a boundary. While in internal clouds or private clouds that is not always done. I think this is a key aspect that we need to learn from public clouds. And to be precise they have standardized the interface between the infrastructure which is what the public cloud provides and the workload which is your part. And by defining that clear card it's first of all easier to build, but secondly it's also easier to define which responsibilities are the provider responsibility and which responsibilities are the customer or user responsibility. Second is the scalability at workload level. Nowadays it's I would say obvious that your application needs to be the one that scales and the infrastructure might be there or not, but the reality is that your application needs to take care about that scalability part. Back in the days that was not obvious. Back in the days when you had more need for more resources than you had you would call your infraperson and they would provide you with a bigger server or whatever. So I think this is an important aspect that we need to bring into a successful private cloud as well. The third aspect is the obstruction of the physical architecture. This was a big change in the public cloud model. The public cloud was saying you don't care where my data centers are, how my server work, how many I have. The important thing for you is you have enough resources that you can use and those are very simple kind of resources that you can use. The second aspect that I think we should learn from them is to have a functional business model and by functional I mean economically sound business model. So by standardizing those interfaces between the infrastructure and the workload you can actually decide a charge for them and it's easier to define what that amount of money should be if you only have very limited amount of constructs. Now obviously if we are talking about the private cloud within a company you will not probably exchange real money internally between your departments but still you can do charge bags or virtual money, color money, whatever your company calls them. And this is very important because having a sustainable business model means that you are not a cost center. The IT is not a cost center anymore but it is part of the rest of the company having incomes as well as costs. Building back is critical as mentioned also for another reason because the risk of not building back is that all the other departments will look like they are absolutely in the positive even though maybe they are burning a huge amount of resources of the company. So it's very important to build back to the final customer to ensure that their accountability is also correct. And then the third aspect is keep the cost down. A lot of times in IT we have ignored the fact that things cost and we can justify costs only if we deliver enough value to the business. And therefore we had a lot of cost creeping around IT, internal ITs and that was also one of the advantages that companies saw in externalized IT to public clouds or other companies because they were like okay so those people will optimize costs. So it's very important if we want to have successful internal IT and private clouds to have them and run them in a cost aware way. The third aspect which I think is critical to learn from public clouds is to maintain the control. And by maintaining control I mean many different things. So first of all set some SLOs, measure, iterate and that is a critical point. If you don't have SLOs or SLAs or whatever different kind of indicators, objectives and so on you cannot have anything successful in the long term. And then of course measuring iterate is very important as well. But also it's very important to think about control in a moralistic way. So I would say do not use third party proprietary software if it's your own proprietary software that's a different conversation. But if it's someone else this is a big risk because when you create an architecture, a cloud for instance, you are going to host someone else workloads. And those someone else will probably want to run the workload for a long period of time which could be 10 years, 15 years, something like that. So you will need to ensure that you will be able to continue delivering that service for the next 10 to 15 years. If you rely on someone else software you know it could happen a lot of different things to that software provider that can make you put you in a difficult spot. There is also a big conversation in the IT by versus build. So effectively every time you decided to use a software you need to decide to either buy that software from someone else or build that software. I would say a lot of times for small tools mainly it's easier to build them than to buy them. I mean, to buy them means that they are readily available but building them means that you do have that knowledge and that ability to then maintain that software throughout the whole life of the platform. And then there is the big point about lockings. Lockings can be any kind of lockings. I do have my definition of lockings which is that lockings are the product between the probability that a component will require substitution during the life cycle of your solution which could be as I said 10, 15 more years and the total cost of that substitution. Now lockings are not something you can walk away from because if we go very basic the Linux kernel has APIs and that APIs are a lock-in because basically if you will want to move away from the Linux kernel you will have a lot of cost if you have your whole infrastructure on it. So obviously there is still a cost there. What is the probability though that you will want to walk away from Linux in the next 10, 15 years or that you will need to? Well, probably very low so I would say that is a low cost lock-in. But in the end, you know, every single decision brings a lock-in. You need to be aware of this and therefore manage your choices being aware of this point. So some technologies, consideration and bets, very simple. Let's start with the very basic one. Keep it simple as much as possible. Reduce the complexity of your system to a minimum. You have to maintain that stuff for the next many, many years. So try to keep it as simple as possible because it will become complex either way. So better to start easy. Second, prefer build time complexity over run time complexity. A big example I do about this is use VMs or bare metal stuff. I would prefer using bare metal things because obviously they are more complex to set up. They are going to be more complex in some scaling situation as well. But in the other hand, you will have one less layer to care about. And when you have a lot of layers, you start to have issues just because packages or stuff goes through a lot of layers. And it will be much harder to debug. So effectively, if you can, move your complexity to build time rather than run time. And then minimize the amount of services that you actually offer. So just offer the very basic services that your customer are requiring today. And then over time, if they will require more services, you might choose to give them as well. But if you start today with 1,000 different services, it will not make any sense because probably on a lot of those services, you will have very limited amount of customers. But still, you will not be able to drop them because you do have customers and your complexity will just skyrocket. Second, go for containers. Nowadays, there are still applications that are not fully containerized or containerizable. I understand that. But containers seems today to be a sensible unit of computing or way of transporting application and data throughout architectures. I would suggest going for a Kubernetes distribution. And by Kubernetes distribution, I mean something that implements the Kubernetes APIs. I think that Kubernetes APIs are going to be the next POSIX APIs. So basically, the whole world will standardize on them. And maybe there will have 10 different implementation of the Kubernetes APIs. But still, using those Kubernetes APIs will probably be a future proof choice. You can obviously go for a do-it-yourself installation of Kubernetes or any other application, community one, whatever you want to call it. Or you can go for a commercial one. There are not that many other choices. If you go for a commercial one, first of all, go for a fully open source one, as mentioned before, because if you have a proprietary third-party application, then you might incur some risk there. Go for a trustworthy company. And by trustworthy, I mean someone that you can trust, because in the end, it's you that are going to deploy and work on this stuff for the rest of the life of this application. Therefore, you need to trust that vendor, that vendor will still be there and support you in a sensible way. And then third, it needs to be valuable. And valuable is still from your perspective. So you have to see the do-it-yourself and that specific offering that you are looking at. Obviously, the offering will have some cost associated. All those advantages that commercial option is giving you more or less than the DIY. But this is a very critical choice, because then that becomes one of the key components of your whole infrastructure. And getting that choice wrong can be very costly in the course of the whole life of the application. And then another suggestion from my side. I come from an automation system administrator slash automation space. So automation is very important for me. First of all, go for a mutable approach to your infrastructure, bake your images, ship the images. When you have to do an update, just refresh the whole node with something new that continues to work and has all the changes. Second aspect is version everything, because hopefully you will never need to, you know, roll back anything and rolling back is always dangerous. But it's very important to see exactly what change from version X to version F soon so that effective you are ready to mitigate the issues. Then automate processes end to end. If you don't automate a process end to end, so effectively if you have like half process automated a human step and then half process automated afterwards, you are not getting the majority of the advantage of the automation, which is speed and reliability, because that human step is breaking those advantages for you. So a few conclusions that we can do and try to put together a possible solution and stack out of this whole conversation. We are going to have three aspects, the infraside, the API and the workload. So on the infraside create slash architects for multiple data centers, even though you are not really yet using them, start to think about multi-data center deployments because very probably you are going to need them. Your application will want them. Deploy Kubernetes container platforms cluster on Bermatel. As I said, having virtual machines in between can have advantages, but still it will give you a lot of headaches. Use a tool to obstruct the management of those clusters. There is open cluster management, which is a great open source tool. You can do it with many others, pick your own, but still don't try to manage 1,000 clusters manually. Set SLOs, measure, iterate, that is obviously a very important point, and automate all the pieces of your infrastructure and configuration. So on the API side, define discrete regions, not based on technically the data centers or that kind of things, but other reasons, that are business reasons. One example is, for instance, legal frameworks. Europe has a legal framework, US has a very different legal framework. Your application owners will probably want to choose if their application is running on the US part or the European part. We will probably not care a lot about if the data center is in Brussels, in Ghent or wherever. Standardize on Kubernetes APIs is the only kind of interface between your part, the infrastructure part, and the workload, so that it's a very easy way to do it. Start providing very, very few APIs to your user. My example would be, for instance, an OCI registry where you would want one, object storage, some kind of PV, PVC, so block storages, pause deployment stage, we said, comping mask secrets, but try to avoid databases or those kind of things, and then when your user actually asks for them, if you have enough users that ask for the same thing, then you can start to add additional services. So, on the workload side, create a simple UX to create, update, and delete workloads, and by simple UX, I mean a UX that is simple for your users. So, if your users are used for a certain kind of user experience, follow whatever your users are used to. Store every single version of your workloads configuration, and this way, for instance, in Ghent, this way, once issues will arise and they will, you can go back and say, look, you have changed this and that is what broke everything, but also this will allow you to do resource tests and that kind of things on real data. Require OCI with a knocked out if your workloads will require it, your application to be resilient to restart, replications, migrations, and all the other kind of operations. If you have some application that needs to be very sticky, that's a different conversation, but still don't put it as a default choice, make them choose to be a special application. So, with this, thanks a lot, and if we have question and time. Yeah, maybe we should have time for one or two questions. Thank you very much for the talk. Regarding immutable applications. Things with video, we are in some delay. Sorry for that. Regarding immutable applications, like it's something getting thrown around a lot, but like what exactly works to do this at scale? Like is there any recommendation on that? Sorry, don't catch. Immutable applications, you recommended this, but like as most applications are actually automated with Ansible or Puppet, which are very mutable. So, like what's the immutable approach? What's the tooling there? So, what is an immutable approach? I would say RPMOS 3, for instance, based distributions. So, there is Fedora, CoreOS, there is many others distribution that are doing the same thing. If you deploy, if you have your nodes that are only running Kubernetes, you don't need to manage them too much. You just need a very basic operating system and then enough to run a container runner. So, effectively, try to keep that one as small as possible, not to manage it too much, and then use any mutable operating system such as Fedora, CoreOS. Would be the Fedora one, there is CentOS, CoreOS, I think, or whatever it's called, the mutable version of it. I know that SUSE has one, Ubuntu has one as well and so on. A question from turning from public cloud to private cloud as you describe it, from an IT resource perspective, what does it mean for the companies who go back to private cloud? Do they need to build up new IT, scarce IT resources? Yeah, so from resources perspective, if the company has already dismissed all their data centers and moved to the public cloud and now it's like, oh, maybe we should go back, that's a little bit of a problem, but there are a lot of data centers that can rent you like square meters or racks in their data centers, but still you would manage the whole infrastructure. So, effectively, it just gives you space, power, and connectivity. So, that could be a way to start doing those kind of things. And then obviously if you have a lot of requirements, a lot of, you know, service to deploy, you can choose which path would be the best for you. Okay, thank you very much for the talk. Maybe one applause? Do we have applause for you?
Firefox power profiling: a powerful visualization of web sustainability
Okay. Welcome to Janssen. Final Fox Power Profiling. We'll listen to Florian Quez. Welcome. Applause Hello. I'm Florian Quez. I'm a performance engineer at Mozilla. For the last few years, I worked on understanding how much power is used by Firefox and how it's used, how we could reduce it. And today, I will be sharing how the tooling we've put in place to understand Firefox power use can be used to improve web sustainability. So first, I will explain what I mean with web sustainability. I mostly talk about carbon footprint when I'm thinking sustainability here. And there are three main components of the carbon footprint of browsing the web. The first and biggest one is the user device. Then the second component is whatever's not in front of the user. So that includes networking, server equipment. And then the power used on the device by the browser when browsing to see the web page the user wants to see. So let's look at each of the three. First, the user device. Usually, we think it's not within our control when we develop a website because you know it's the user who picks the device. We have nothing to do there. The emission we are talking about here is the embodied emission. Whenever someone is buying a new computer or a new smartphone, so it's the emissions to produce this device to manufacture it but also to ship it to the user. And even though we don't get to do anything about the actual emissions when creating the device, we can reduce the incentive for the user to replace the device. And the way we can do that is to ensure good performance because something feeling too slow is a strong incentive for someone to replace the device. And the other thing is ensuring web compatibility because if the device becomes incompatible, the user has to replace it or update it in some way. And on this topic, I would like to mention that Firefox is currently the only browser that is left supporting Windows 7 users. And we actually still have millions of users running Windows 7. So if you are thinking about sustainability, Web Compat is one of the first things and think about Firefox ESR. Second piece of carbon footprint is the emissions for the infrastructure. Anything server networking. I'm not going to talk a lot about this because it's already well covered. And there's a reason for that. It's that the financial cost of operating the services scales mostly with the emissions. So there's a strong incentive to optimize and there's already a lot of tooling available. And last and maybe not least, very often neglected, the emission caused by using the browser to display the web page. And the reason why it's neglected is that in people's minds, there's no good tooling available to look at those. Is that correct? Well, not really anymore. And this is what I will be talking about today. It's what we've done to change this. We'll be focused for today. So because the talk is 40 minutes long, I want to give you a structure. First, I will explain why as Mozilla, we care about this. Then I will explain our journey to measure power use locally, what we've done to be able to understand it. And then I will go deep into the topic of power profiling. But first, I will introduce the Firefox profiler so that even if you don't know it yet, you can make sense of it. Then I will explain what power profiling does and show examples. Examples are important because this is where we see why all of this makes sense. And then we will take a break from the structured presentation of slides. I will try to do a live demo if I have enough internet. And then I will explain what we call external power profiling, whatever that is, and give some more examples. So let's start. Why do we care about this as Mozilla? There are three main reasons. The first one is for sustainability. Mozilla met climate commitments of being carbon neutral, of reducing our footprint year over year, of leading openly by sharing materials, tools and methodologies. That's what I'm doing today. And of improving products from a sustainability perspective. There's a reason why this one matters. It's on the next slide. It's when we look at Mozilla's carbon footprint, the actual footprint caused by using our products on our users' devices is more than 90%. And by the way, when we say we are leading on sustainability, we are the only browser maker organization that actually publishes this kind of data. The others, they are shy about it. And we would like to encourage others to also publish this kind of data. Second reason for hearing about power use of Firefox, a very important one, it's for user experience. Nobody likes to use a computer that uses too much power, and there are multiple reasons. One, it's causing noise with the fans. If it's a laptop and it's super hot, it's painful. And then even the people who couldn't care less about climate change because they think it's somebody else's problem, they hate running out of battery. Life matters to everybody. And then last but not least, we do this for a better web. And there's a reason why we want to do this. It's Mozilla's mission. We are building a better internet. This is what we are here for. And because Firefox is made on web technologies, everything we do to make Firefox more energy efficient, and all the tooling we put in place, it's directly reusable for our web pages. Now let's dig into our journey to figure out local power use. I started a couple years ago. My task was to figure out how Firefox is its power, what should we do about it? It was not clearer than that. It's just, let's look into it. So when you want to understand the power use of something, the first thing you do is you take an energy meter, or what matter. So that's what I did. It's cheap, it's easy. It's pretty accurate in terms of the data it shows. It's also pretty useless for the case of software because software is not something that you start and it does the same thing all the time. You need to see evolution all the time, and I was not seeing anything. So next step is get a better what matter. I got some that is communicating to the computer through Bluetooth. It's sending me something like this so I can see a chart. It's much better. It's still not so great because I still can't correlate with what we were doing with the code in the browser. And at that point I wondered how is the competition doing it? And I found this blog post from Microsoft that I found very interesting. Back when they were working super hard on edge battery life, they were super proud of it. It was before they switched to Chromium. And one sometimes really caught my attention. That's the one highlighted in blue here. Power was measured on the surface book because it has integrated hardware instrumentation. So that's how Microsoft did it. They have their own brands of laptops. They put built-in power meter chips in those so that they could compare power use of edge with competing browsers. So that's not really something we can do as Mozilla. I don't have my laptops. But can I get some of those Microsoft laptops? Well, sure. So I actually found two that include those power meters. They are pretty old because they are back from when Microsoft was doing this kind of work. They still work. I tried to find newer devices but unfortunately all the devices where I found that a device like a power meter is exposed in the ACPI table actually doesn't seem to be actually present on the device. So the power meters were put in by manufacturers for prototyping and calibrating battery discharge rates and then not put into production devices. When we look at the tool called Perfmon on Windows on those computers, we get something like this. We see energy meters and we have 4 channels here, battery, CPU, GPU and the Wi-Fi chip. Which means we can measure the power use of each of those components. We can measure it. We see something like this. So we now have charts. We can try to correlate with stuff that happens. I'm not sure if you think like me but I really dislike this UI. I find it absolutely terrible. I can't make sense of it. Even the unit, I can't make sense. Like here I selected the CPU cores energy and it's using, last time it was measured, 5.0 something E plus 011. Whatever that means. While searching for those devices with built-in power meters, I had a good surprise. On some laptops, the names of energy meters were pretty familiar if you use Intel Power Gadget. Those are the rappel channels that are exposed by the CPU itself. There are some investigations. It turns out that all Windows computers with Intel CPUs expose the actual CPU as a built-in power meter and we can access it. Because another nice surprise, there's a documented API for it. I have not found any example of someone using the API but the API is documented. Which means I can now understand what the unit was. So the E plus 11, it was because it was Pico whatever. And we can correct many times per second. So I know that because I experimented with it. Very little overhead so I can correct many times. And the most important thing for us is it's accessible in user terms. I absolutely don't want anybody to run Firefox as root to be able to power profile it. So accessing as user land without providing users to install anything was very important for us. So all of this makes it pretty tempting to use for profiler counters. At this point, I started prototyping something, started hacking. And this is the screenshot of the very first power profiling prototype I got of Firefox. So we see the same name here, rappel channels. So it's not very user friendly as names here. And the units were not correct but the thing that matters a lot when seeing this screenshot is the shape here of the track. So this is the CPU package and you can see that it matches the shape of when we were using CPU. So the data seems correct. We were using CPU here too. And there's a shape that's moving here. PP1 is the GPU and there's a spike whenever we do something on screen. We were doing graphics work here and we had something here, something here, something here. So the shape is correct and this is a key validation because until then we were thinking we could not power profile because power profiling means running the profiler. Running the profiler means using power and we were afraid we would be profiling the profiler, which is not what we want. And this validates that it actually worked. So I decided to polish it and make it something we could ship. But I see I shared with you a screenshot of a profiler without introducing the profiler. It's a good time to introduce it. It's a profiler. It's built into Firefox. No additional tooling required. It's always there. The user interface is not there because we don't want to clutter things for all users. But it's trivial to make it show up. And it was created for performance work. And here by performance I mean making things faster. It's useful both for users so that they can make a useful bug report very quickly instead of saying something is slow. They can say, here's a profile of what happened. Please have a look. And useful for developers because those profiles are easily actionable. And it's one of the best profilers that exist currently and there's a good reason for that. We started investing heavily in it in the Firefox quantum days. And those days were when we decided that the engineering teams of Firefox being several times smaller than all of the competitor teams was definitely not a good reason for Firefox to be slower. So we needed better tooling. Profiler uses multiple sources of data. It's a sampling profiler which means that it uses a timer. And at a fixed interval it stops the execution of the program and will capture information. Typically it will capture stacks of all the threads we care about. And it will also capture counter values. So counters, for example, if we care about memory use, whenever memory is allocated or released, we'll increment or decrement a counter. And then when sampling we'll record the value of that counter at the time we sampled. And the last source of information is markers. Markers can be seen as annotations left by the developers so that whenever we see a profile we see what happens at the time. I had a screenshot of showing how to use the profiler, but I will try to do a demo instead. It will be more interactive. So this is Firefox Nightly Instance that is created fresh, no user profile. And I will go to profiler.firefox.com. It loads this web page and I will click the big Unable Firefox Profiler menu button. When I click this, I see a new toolbar icon that appears in my toolbar. It was already there. We are showing the icon when clicking that button. And I have settings here. In most cases the default settings will be good for what you want to do. So you can just click start recording and then you can start doing what you want to profile. So I will, for example, load the Wikipedia on page. And once I'm done doing the thing I want to profile, I can click the button in the toolbar again. The second oscillator, I have a new tab that opens which is my profile. The UI might be a bit intimidating at first. I will go through it with you. There are two main pieces of the UI. The first one is the top half here, which is what we call the timeline because everything is drawn against time. There's a time error here. And then there are panels at the bottom. In the timeline, you can see that we have what we call tracks here. And there are tracks for processes, like you see the parent process here. In Firefox, the parent process handles the user interface and the rendering. And you can see a process for each content process. So for example, I have the Wikipedia process here that I will select. And here there's activity. So I can make a selection and I can zoom into it by clicking this Manifold icon. So now we'll see in a lot more detail what happens. And the UI is very interactive. Whenever I move the mouse somewhere, I will see a tool tip showing me what happens. So here I'm seeing the stacks that were sampled. I see there's some JavaScript in here. I assume some of you are web developers and probably care more about what JavaScript runs so we can filter frames to JavaScript only. And here I'm seeing which JavaScript code was run by Wikipedia when loading the page. I also said we have markers, so I will show an example. In the panel at the bottom, I have a marker chart. And if I go here, I have DOM events. They show whatever event we are sent to the web page. And if I scroll down, I have many more. I won't go through them, of course, because we clearly don't have time. I just wanted to show the Yawek markers here that show whenever the thread was active, which is important for PogarUse. And the runable markers we have at the bottom here that show the name of the C++ task that was running, which is very important for Gecko developers when you send them one of the profiles. So that will be it for an introduction of the profiler itself. I will go back to the presentation and skip all the screenshot I had that show the same thing. And now we'll talk about actual power profiling. So I said we have a working prototype. Now we want to make it work for real. So it's built-in. Again, no extra tooling required. It supports the three major desktop platforms. We shipped it in Firefox 104. So that's already a little while ago. We got a lot of great feedback on it, especially from people who care about sustainability of the web. At best, my knowledge it's not been copied yet, but that might happen at some point. Platform support. On Windows 10, we support only the devices that include built-in power meters. On Windows 11, we have Intel CPUs. We have information about integrated GPU and memory power use. And on Windows 11, 22H2, we have about updates. We started seeing Windows supporting AMD horizon CPUs. And a good surprise here is there are separate information for each core, which means that if we can track which process was using which core for which thread, we can know exactly how much power is used by the code we are running. On Mac, we support both Apple Silicon CPUs and Intel CPUs. Different ways. For Apple Silicon, we actually have an API from the kernel that gives us information about the amount of energy used by each process. And I think Apple can do it because they control both the CPU chip and the kernel. It's very likely that whenever the context switch of thread, they record the value of the counter, and that's how we get the value. On for Intel, there's a very obscure system called that can be called only from assembly of a Raspberry Crash that gives us the value of a Rappel model-specific register. I'm very happy I didn't have to figure this out on my own. A former colleague did it several years ago. I could just copy-paste the code. That was nice. Last but not least, Linux. On Linux, we use Rappel Perfevance. Those used to be unrestricted, but then there was a side-channel attack. And the reason for that is if you look on Windows, you can access the power data up to once per millisecond. If you try to access it faster, you will get the same data returned again. On Linux, there's no rate limit like this, which means that you could create so often that you could get some information about the data being processed. So when this was discovered, it was then restricted to only be accessed by root. Thankfully, there's a command we can run as root that lets the kernel know that it's fine to not be paranoid about it. And that's the exact same command that needs to be run to use Linux Perfev, the built-in profiler for Linux. So I think it's fine. As long as users don't need to run Firefox as root, I'm still happy. So I will take this work around. MDCPUs are supported. A lot of running, it doesn't work on SNAP packages if you are an Ubuntu. The binary is provided by Mozilla Work. And I think that's because the packaging system puts some restriction on what's allowed or not. If you were about how to configure a profiler, I said in my previous demo that the default settings are right in most cases. If you want to put a profile, they are not right. Because obviously you want to also include the Power Use feature that's down here, the checkbox you see there. And also markers for all threads. And the reason for that is, we're waking up a thread as a cost in terms of power. And we want to know about it, even if it happens on threads, we are not profiling. So we want to know where ever it happens. And I think the power preset, of course. And the other thing we want to do is reduce the overhead. As I said, we were concerned we might be power profiling the profiler itself, which is useless. So what we are doing is increasing the sampling interval. Instead of sampling every millisecond, we sample every 10. We're using significantly the overhead of waking up. And when sampling, we don't capture the stacks. We only capture countervalues. We still have all the markers that still give plenty of information. For the presentation, I will give plenty of examples. And as I said, the profiler UI is very interactive. So here's a link to buy slides. And whenever you have a screenshot of a profile, I also have a link to the profile so that you can see the profile for yourself. One thing I should mention is that the share.firefox.devdomain doesn't work with IPv6. So you need to be on the first damn dwell stack Wi-Fi network if you want to be able to click right now. First example of wire profiling, I will go through it with you because I know the profiler was intimidating. It's the same thing as what I was doing before, which is loading the Wikipedia homepage. And we can see on the screenshot, near the top, it was not visible. So the screenshot here, we see it was not loaded yet, but there was CPU activity. Here something starts occurring, and here it's visually complete. So I made a selection here of the part of the profile that we care about. You see we have a parent process. We no longer have colors, and we no longer have stack samples because it's the power profiling settings. We still see the network requests. They are also shown here. And we have a new process power track here that gives us how much power is used by the parent process. We have a similar track here for the content process that tells us how much power is used at any given time and the amount of energy used. So here 134 mW. So that's how much power it takes to load Wikipedia, apparently, on that computer. Second example, this is profiling on Windows 11, and this is starting Firefox. And you can see how much... So this is the CPU activity when starting Firefox. You can see that here we had a window, and here it was visually complete, and the activity was done. So we can see how much power is used by the CPU cores. The built-in GPU, the entire CPU package, and here it's selected, and we see it used 10 mW to start Firefox on that specific Windows laptop. I'm not sure about you, but for me, whenever I have new tools, I like to play with them, and I like to test their limits. And I was wondering what's the tiniest thing I could power profile reliably. And this is the smallest thing I have found. I'm not sure if you've read what's written at the bottom of the slide, but basically, I was profiling Firefox doing exactly nothing. And it was not exactly nothing, actually, because when I looked at the profile, there were those small spikes. What are those spikes? And actually, it turns out there were the cursor blinking in the address bar. And if I select one of those spikes, like I did here, I can say that making the cursor happy on the address bar uses 1.5 mW. I was surprised. I didn't expect this to be that precise, but yes, it is that precise. Now is a good time for live demo, if it works. I assume many of you, when you came here, you wanted to see a map of the campus, especially if you came for the first time. And because you're in two open source, you probably use the open straight map. So I will try to figure out how much power it uses to search for the campus on the open straight map. So I will configure a profiler to use the power preset here. Start recording. Open a new tab. Type openstraightmap.org. It loads. I have a text field in the top right corner with a cursor in it. So I will type ulb and see what happens. So I'm in a university campus. This is very nice. I don't really recognize the shape of the building though. So I will zoom out to see if it's in the right place. Zoom out again. Oh, we're in the third door. Probably not correct. So I will additionally type process. Now that looks better. If I zoom into it, this is actually the building we are in. Okay, I will stop the demo for now and now we will look at the profile I'm capturing. Again, we have multiple tracks here. Google.com, I don't know why it's here, but I don't care about it. It was probably the background. I don't care about this either. So I have open straight map here, and I have the power and process here. I also don't care about how much memory we used. Okay, so this filter is down to what's useful. So I have a screenshot track here that that orientates me pretty quickly. So I was typing the address of the map. Here we have the home page that's loaded. And here I had my first results. Here it was animating when I was zooming out to try to figure out where I was. Okay, so let's look at power use now. So the part we might care about is when we actually loaded the right campus. So that's here. You see the network request here, and you see there's a spike here in power use, and similar here. So I will zoom into this time of a profile. And now this is what happens in the Firefox parent process. So we can see it used about 100 mW, and in the content process, about 33. So this is about 130 mW to load the search results, showing the campus here. But because we have a profile that's nice, whenever you have a profile, very often you explore other things, because it might be interesting. I'm most interested by the shape that I see here that seems pretty interesting in terms of power. So I will zoom into it. And we see there was syncycum power use here. When looking at the screenshot, there were animations going on. So it makes sense that the process during the rendering uses power. Here we used more than 300 mW in the parent process, plus 150 here. So actually zooming out to see where I was actually used a lot more power than searching for something. We can also see the bandwidth used. So this is the network request, and we see how much bandwidth was used in network bandwidth. So we can see here that zooming out took about 4 megabytes of network bandwidth. I will zoom out again. I see there's some activity here, and I'm curious about it. I see there are markers, so I will zoom into it and try to see what's going on. There are regular spikes here. I will select the content process here, because the markers and the content process, it's likely where the activity started. I will zoom a bit more on one of the spikes. And I said renewable markers, they're very interesting for browser developers because they let us know what's going on. I think I will try to zoom into it to make it more readable. So the marker here, and it's correct, correct blink callback timer. So this is actually the timer that used to blink the cursor that I had in the search field where I was typing ULB browsers. So we can look at how much power it used. And this is 0.17 mWh in the content process, plus 0.8 here. So a little bit less than 1 mWh was used when making that cursor appear. So what I was showing in the previous slide, we can actually do it, and it really works, really, really. And I think that will be it for the live demo. I will switch back to the slides. So I talked about many things already, so let's recap a little bit. We have power profiling that works on all the free major desktop platforms, Windows, Linux, and Mac. It's reliable, it's easy to use, you don't need to be rude to use it, you don't need to install anything, you just have everything in Firefox. So what about the free platform? Firefox is not shipping only on those three platforms. There's something called Android where we ship Firefox, and lots of users there too. So what about it? So far, we've not found good APIs that we could use for power profiling, but we had another idea, and this is what I will explain now when talking about external power profiling. If I'm taking a step back from what I was showing before, my first step was to look at how much power was used on the power sockets. And this gives us a little bit of a sense of how much power is used on the power sockets. And this gives us the full picture of how much power is used by the entire computer. But there's one problem. The maximum sampling rate is 50 hertz, which is in Europe the rate at which the current is oscillating for AC power, and they can't get any much faster data than that. We also got data at the extreme of a hand getting data from the CPU. Very precise data, but missing some of the computer. And it's even worse on the phone because we miss the entire screen or things like this. So the question was, is there anything in the middle we could look into? And yes, there is. If we are talking about mobile phones or also laptops, there's the charger. Maybe we could instrument the charger instead. And yes, we can. It turns out there are devices that are already on the market that are sold, and their purpose is to test chargers to verify how good the charger is. Check if the current and voltage of the charger is stable. And to be able to do that, they need to sample very quickly. That's very interesting for us. And those devices are affordable, at least if you compare them to the smartphone you used to test your web application on. Some can export data to a computer for USB or for Bluetooth. And one thing that's really important to note to understand how this works is when you charge your battery, when you unplug your charger, you want the battery to be completely full. So anything that was done by the smartphone while your battery was on the phone was done by the smartphone while your battery was already full, it's still using power from the charger because if it was taking from the battery, the battery would not be full when you unplug the charger. That means that if we wait enough for the battery to be completely full, and we still measure how much power goes through the charger, we actually measure how much power is used by the phone. And that's exactly what we want if we want to power profile. Another interesting detail is some of those power meters, they support more than 200 watts, which is more than enough to power profile any laptop that charges for USB power delivery. So here it's a MacBook that I was charging. So looking at power data from those kind of power meters is what we call external power profiling, and we shipped it in Firefox 121. So how did we make this work? Those devices that are charger testers, they are few that are available. There's only Windows software that comes with it. The software is in English, which means that if we hear on those Chinese characters, not sure what they mean. And there's poorly documented API that was nice when I wrote this. What actually that meant is that I found one device some day that has what they called an open API. Open API means there's one page of example C++ source code with Chinese commands. That was enough to get started. And then thankfully there are great and powerful reverse engineering tools. And I tested with this USB light that you can see on the slide here, but as various levels of brightness. And this was a stable load that could let me know which data to expect. And all the power meters on this slide are compatible with the Firefox profiler. They all produce nice power tracks. Well, some not so nice. Some produce very nice power tracks. And it's plug and play. If you run the script that's in this GitHub repository, you can see the device and start power profiling. You will see this power track appear out of magic. And it's nice that it just works because all the windows that came with those it's terrible, you don't want to try to use it. The readme file in this GitHub repository includes a list of supported devices. So that's basically the names of the device you saw in the previous pictures. Next to the name of each device, there's a link to an example profile of what you can get from it. And you can see that using... So that was... with this USB test light that I was using. Various levels of brightness. And you can see in the good example profile that there's what looks like a lot of noise here. It's actually not noise. If you zoom into the profile, you will see that it's a very regular pattern. And it exposes internal details of how the light is using power. It's sampling every milliseconds. And if you look at the bad example here, there's no noise, but it's just because it's sampling every 10 milliseconds, so it's taking an average. And the worst part about it is what's happening here. We are turning off the light. It should use zero power. But here we see a linear decline for 500 milliseconds. If I want to profile anything I'm doing with my software, a latency of 500 milliseconds is completely useless, so this device can almost go in the trash. In all the future examples I will be sharing, whenever you see something labeled USB power, it means power data coming from those kind of external power meters. Here's the first example of power profiling using this system. It's what we call an Android remote profile. Remote profiling means that the profiler was not running on the same device than what was used to start the profiler. So in this case, the profiler was running on an Android phone running Firefox. And I was controlling the profiler from my laptop that was also controlling the power meter. And when capturing the profile, both sources of data were merged, and we got this profile. We can again validate that it makes sense. You see the shape of the CPU user, the Android device. You see the shape of the power track. They match pretty well, which shows that we're actually measuring the right thing. And the baseline is relatively high here. It's probably because the screen was on at the time. And it's again a profile of loading the Wikipedia on page. We can see on the screenshots here. So as I said before, I have a link to all my profiles at the bottom of the slides. You can look at them now if you have a laptop in front of you. You can also look at them later if you look at the slides and want to see it again. I have more examples coming next. So I'm giving the links to the slides again. I will mostly be telling you two stories on how we use power profiling to understand what was happening. So the first story is I had one of my colleagues tell me, hey Florian, have you seen this new green leaf icon you see in the Windows Task Manager next to Edge? What is it? Can we have it too? So we're wondering what it is. It's screen washing. It's Microsoft doing something fantastic about the environment that we should know about. It turns out there's a Windows 11 API to let Windows know that a process is doing nothing that's immediately visible to users and that instead of optimizing for finishing as quickly as possible, the kernel should optimize and scheduling for resource use. We could use this API for Firefox 2. The power profile you see on screen here is the result when using a test case. And the first half of the profile, the test case was in a foreground tab and the test case is a stupid piece of JavaScript using as much CPU time as it can with an infinite loop. The second half of the profile, the tab was in the background and we can see the dramatic difference in power use. So yes, it actually does something. It's pretty significant. So putting background content processes in the eco quality of service on Windows 11 is something we shipped to Firefox 108, so that's quite a while ago. We have a first browser to do it if we exclude edge that did it when the API was introduced in Windows, of course. Chrome has followed a couple months later, so now I think everybody on the web benefits more or less, and this is great because it actually saves a lot of power. And I will explore a bit more how this works in the next few slides. So we try to do the same thing on Mac. So this is a profile on a Mac with an Intel CPU and we see the same nice decline in power use. And you will see here that I have a power reported by the CPU itself but also power from the USB power meter. So I checked also the power use by the entire laptop. So they all decline at the same point when we switch to the background. And you can see the numbers, they are pretty dramatic. The cores drop from 18 to 1.6 watts. And the entire Macbook from 30 to 10 watts. The numbers are even better on Apple Silicon, but this example is an Intel CPU to be able to compare with what I was showing for Windows. And next, I wonder so using less power when doing something stupid like an infinite loop is great, but that's usually not what you want to do with your code in your web pages. So what if the code was, the test case was doing an actual computation? So this is computing Fibonacci something. And you can see that when it's in the background, it uses dramatically less power to do the same thing. But also it takes a lot longer. So I have the numbers in the table here. It took more than three times as long. It used less CPU, the CPU used less energy, but the entire computer used more during that time. So if you can control the entire component system, typically in server environments where once you are done with the task you are doing, you can shut down the server, try to finish as quickly as possible. If you are like us in the situation of a web browser where there are things in the background that have no user impact, but you don't control what happens to the computer because it's the user's computer, then reducing the resources for everything that's in the background makes a lot of sense. And the way that this works on the CPUs that are not the CPUs, all the cores are the same, is probably by reducing the CPU frequency. And there's one slide where I'm trying to check this, because the profiler can also record CPU frequency. This profile was on Android. And you also see that whenever I have a spike in the CPU frequency, I have a small spike in power use. And when the CPU frequency remain high for a while, the power use was also high. So that kind of confirms the hypothesis here. Second story. This is a real life story. I was trying to fill in a survey that had many checkboxes, and I moved that web page to my external 4K display and put it in a maximized window so that I could see all the things that I was being asked to fill in. And I got distracted. Maybe by a baby or something. Came back a few minutes later, my laptop was super hot, the fan was extremely noisy, and I was wondering what's going on. Of course, I profiled. You could have guessed that. In the profile, I noticed something like this. So this is an artificial test case. I'm not seeing what the bad web page was. I could see that the color of the background was slowly changing with the gradients. With my eyes only, I could never see that. Like it was changing over the time of a few minutes. So completely useless animation. I tried to replicate this with animation that moves slightly faster than what we see it well. You can see very high CPU core use, high memory power use, high GPU power use, high everything power use. It's terrible. So if you think about it, I said an external 4K display, that means 8 million pixels. I'm in a modern MacBook that has a refresh rate of 120 Hz. That means we forced the laptop to compute the colors of a billion pixels per second. So no surprise that it was hot. Then I tried to explore my hypothesis because I'm saying it's because there are many pixels many times per second. Maybe we could check that it's correct. So on the next slide is the same test case, but I tried to reduce to a minimum the size of the window. We can see on the shape of the chart the impact that it has on power use. So on GPU power use, the impact was very dramatic. When the window was tiny, there was almost no power use left. You can see there are big spikes here. While I was resizing, this is because whenever we change the size of the window even by one pixel, we need to recompute the layout of the browser UI. So high CPU use while resizing, but very low otherwise once the window was small. I have all the numbers on the side. I won't read them outside, but you can look at the slides later. Another thing I did to test is there's a hidden preference in Firefox to limit the refresh rate. And I tried various refresh rates. And we can also see that the power use declines dramatically when reducing how many frames with display per second. So this validates the hypothesis we had, but it's just too many pixels too many times. So one thing to take away if you're a web developer and you're thinking about animating the background of a web page, think about it more than twice. It's absolutely terrible. You should not do it. And then I was thinking, okay, many pixels on screen. There's one case where we typically do it is if we are watching a video. And then it makes sense, right? So this is a profile of watching a YouTube video. First in a frame and then full screen. So you can see in a frame here the amount of GPU power use. And then full screen here. And there are spikes while we were entering and leaving full screen because there's a big animation and things we need to do with the UI. We can also see that the CPU power use was relatively low. I think this validates that graphic accelerations and hardware decoding were working well. So this is all good news. One last example about things to avoid as a web developer. Timers. Waking up a CPU is expensive, especially if you wake it up to do nothing. And using the web API, set time out, you can wake up the thread up to every 4 milliseconds. And this is what we see in this profile. This is a test case that just wakes up the CPU for the sake of doing nothing and sleeping again and waking it up again. And you can see a spike in power use whenever the CPU wakes up. And then the tab is put in the background. And when the tab is in the background, Firefox limits timers to 1s per second. And you can see this one tiny spike here in power use at the very end. So... This shows that throttling timers is a good idea. This is just about the CPU wakeups. If you are doing something using actual CPU time in those wakeups, this would dominate the power profile, of course. And if I have a few more minutes, I have a few more things that are worth sharing. One is Firefox as a built-in process manager. And if you see this icon here, whenever you have a name of a content process, which is typically what you will care about if you do a website, a profile icon will appear. If you click it, 5 seconds later, you will see a profile of the entire process. And if you are on an Apple Silicon machine where we have per-processed power data, you will see a power fraction showing how much power your website used at any given time. You might have seen in my slides and in my demo that whenever I was showing energy values next to it, there was a CO2 equivalent value, which is the equivalent of carbon emissions we would do... Those were created using the CO2.js library from the GreenWeb Foundation. This was a very welcome contribution we had. So thanks for that. I shared with you a very quick look at the bandwidth track while we were looking at the demo on the open-suit map. So the bandwidth track that you have that lets us know how much data has been transferred in regards to CO2 equivalents. This is a big question that I got after doing a preview stork in a different place a couple months ago. I was very much a participant that on the A floor our profiling is fantastic. We wished we had a tool like this for a very long time and it's great for optimizing for performance and sustainability. But you know what everybody else is looking at is how much data has been transferred because that's what everybody else was measuring until you had power profiling. And in the Firefox profiler, you already have all the information about networking because there's all the network requests that are shown. Could you just show how much data has been transferred and put a CO2 equivalent somewhere? What about it? Something like a great idea, so we did. We did it. This is shipping in Firefox 123 which is currently in beta shipping in a couple weeks. So we did use that. Maybe the last thing is if you're out of luck and you can't power profile. And there could be a few reasons. Maybe you're on a virtual machine so you don't have direct access to the CPU hardware. Maybe you are using a snap package and then there's nothing you can do. You're not good at Linux and you're not good, so you can't do the magic command to let the kernel know that it's fine to let us know about how much power is used. There's a hidden feature in the profiler because it's not fully polished yet. If you open the DevTools console and type experimental.enableProcessCPUTracks, you will see new track appearing that say ProcessCPU. And you can see in this example that the shape of the power track and the shape of the CPU's track match extremely well. The one case where they won't match is if you do massive animations like the full screen animation I was showing before. But I said you should not do that anyway. So if you are not doing anything completely stupid that's feasible, that's something you could look into as an alternative. Another conclusion, power profiling is possible. It's easy, it's fun, I encourage you to do it. Play with it, it's really simple. This is why I did a live demo so that you see how simple it is to use it. But if you are really thinking about web sustainability, where you will have the biggest impact, even though it's less visible, is on-show web compatibility with all browsers and especially with all devices that still have supported browsers, so things Firefox ESR. And think about group web performance because even if something is still compatible, if it's super slow, people will want to replace their hardware and that's where we really lose in terms of sustainability. Thank you very much for the talk. We've got still roughly 5 minutes for Q&A more or less. If we've got questions, you can answer. Do we have questions? Please raise your hands so we can come to you with a microphone. So, have you ever thought about the question of whether you would like to talk about the Q&A? Yes, I have. So, have you ever thought like when someone loads a website in localhost and Firefox can detect that it might use too much power to actually show like a pop-up hey, your local host app is using too much power, check out Firefox Profiler and Power Meter. Would that be feasible to basically push devs to fix their own apps? Would that be a good idea? I have not understood. There's so much echo that I couldn't understand what you are saying. Yeah, there was too much echo in there. Yeah, so basically the question is have you ever thought to push the Firefox Profiler towards developers when they're running apps on localhost and they're using too much CPU to have like a message hey, check out Firefox Profiler it will show you your app is slow and why it's slow. Okay, so you're suggesting that we could detect the case of web developers because they are running something on localhost and they are definitely developers and then we should show warning messages. Yeah, and push like promote the profiler to devs directly this way. That's an interesting idea, not just for excessive power use but also when something is dramatically slow we could let them know hey, you know we have good performance tools, you should have a look at them. Thanks for the idea. More questions? You have another question? So thus, like optimizing for power usage differ a lot from optimizing for like CPU usage because I guess I mean the less CPU you use the less power you use and the less network you use the less power you use but are there like other things to consider when you're trying to optimize for power usage than just useless resources? So if I understood the question is there other things to optimize for to use less power over than CPU use? Is that the question? Yeah, of course. So the power use is typically first CPU that really dominates it's both CPU time and waking up the CPU then there's graphics power use which is what I was trying to show with the examples network power use but you don't consider it as much if you're setting data from the network over time that will use power because it will wake up the Wi-Fi chip for example but in terms of scale CPU dominates so much that's really where most people should focus their attention at least when thinking about web pages. So more questions? Another thing I should have mentioned is I have Firefox provider stickers on the table here and Firefox stickers so you might want to take them. Shiny, shiny, shiny. Thank you very much. Okay, thank you very much. Maybe another applause. Thank you.
Proving that Cloud Sysadmins Cannot Read your Data!
So now we are on time, so let me start. So hi, I'm Christophe D'Aditon. I'm a senior principal software engineer working for Red Hat on confidential computing. My GitHub is C3D, so you have more about me on C3D.github.io, or you can scan the QR code here. And today's talk is about proving that cloud CCD means cannot read your data, and it's about confidential computing. The very unfortunate thing is that there is a confidential computing track right now in the ACE building, and so the folks who can say that what I say is bogus are all in the other room. So you have to trust me. It's too bad. It doesn't start well. Here's the agenda for today. The key topics we are going to cover is we are going to talk about confidential computing in general, give a quick overview, and see the various use cases for it. We are going to see how to build actual trust by starting from a root of trust and discuss what is attestation and what it proves. We're going to see why it matters to do measurements to build confidence, how to securely hand over secrets to your work, but more importantly, I'm going to try to convince that it's not really good to have a safe like this if you leave the door open. It's not as trivial as it sounds. There are more details in a series of blogs that is in this QR code, so you can scan that if you want to see more about this topic with links and so on. So what is confidential computing? Who has heard the term and knows something about it in this audience? Okay, so roughly 10%. So confidential computing is about protecting data in use. Confidentiality is the essence of being trusted. And the problem statement is quite simply, why should the infrastructure see your data at all? The software today typically runs on hardware that you do not own. It's not yours, like, for instance, a cloud provider. So that hardware owns the resources like the CPU, the memory, the disks, the networking cards and so on. And on top of it, you can run things like containers, for instance, and the carve out resources from this host. Now the tricky thing is that the classical sandboxing technologies that we all rely on for containers are preventing container escapes. They are designed to protect the Linux kernel from being overwritten by your workload in the container. They do nothing to protect the other way. And that means that a sysadmin on a machine can simply dump memory, can look at the other filesystem or the containers, and all that stuff. So that's not really good. And that's one of the reasons why there are so many difficulties to bring some kind of workload, like when you have multiple tenants or very sensitive data, to the cloud. It's difficult, for instance, to bring medical applications to the cloud. So we have solved that problem to some extent with data at rest on disks, with disk encryption, and data in transit, like networking. We know how to do that. So for this kind of data, the host essentially has no clue what's going on if you encrypt the data on the disk. A host that means cannot actually access the data because it doesn't have a key. In a non-confinancial computing architecture, on the other hand, that's not true for anything that is in the guest memory. So if you have your program that runs, it's fairly easy for the host to spy on what you're doing there. So let's do a quick demo about that to see how we can access secrets from the host simply by dumping guest memory. So what I'm going to do here is I'm... Uh-oh. Ah, okay. So... Give me one second. Okay. Okay, it fits now. So, uh, what you... So, by the way, you saw how I designed my slides. Uh, so what we are doing here is we are creating a VM from the fellow authority in age with four CPUs and four gigs of memory, and then we're booting that and setting it up with Cloud Units. And I log in as root, and then I type my password, and Cloud Units has put the root password for me, but also has set up authorized keys from my public keys. So what this means is I can SSH into the guest, and then I can simply... And I can SSH as root as well. So that's what KCLI does for me. Now, what I'm checking here with this dMSH is that I'm not running with ACV, so no memory encryption here. And what I'm going to do now is to watch a really good program, a C program, you know. That's typically commercial code. It looks like this, right? And there is some secret stuff in it, and I'm going to compile it, and you know the usual motto. If it compiles, you can deploy it. Okay, there are some warnings. We don't care. We just copy it to the guest, and then we run that on our guest. So what I expect from this program is to show a message, really secret stuff. It doesn't do exactly what I expected. There might be a slight bug in my code, but that's fine. I have the secret message. So now I go to the host, and because there is a really nice host, this admin, he knows how to use the QMU monitor test VM. He's another... It's not in the same class as the guy who wrote the C code, by the way. And so now he's dumping the guest memory with various arguments, and what this says essentially is I'm going to dump to a file that will contain all of my guest memory at once. So I dump my guest memory like this, and I'm going to speed up a little because... Where is the plus key here? So... oops. So what I see here is I open the file with Emacs, and I can find my secret stuff inside. That's the message that was shown on the console, and I also see what was in the source code, and you see they don't match. So that's the bug that I had in my source code. I'll have to investigate that later. But the point is all that stuff is clearly visible to my host admin, and so is my password. The strong password that I put on the command line initially is also quite visible in the dump. So that's not acceptable, right? So we need to do something about it. So the proposed solution by Intel, EMG, and everyone, I'm going to talk about them later, is to encrypt the memory. Hello, hello. Ah, okay. I must have pushed on the mute button. So the encrypted memory is stored... So it stores ciphertext, and it's completely transparent to the guest, and that means... Now, the encryption is not very strong. If you look carefully in the green box, you might be able to decipher it. And it's the same thing. The encryption that is used for these technologies is not the strongest we have, with the latest data in the world. And there is another aspect that is important, is that you need to make sure that the host cannot corrupt or poison the data, so you can, for instance, make sure that the host cannot change the value of the registers, and that you cannot inject introps that would cause the guest to do malicious things. Another aspect that I'm going to cover a bit more in details later is something called attestation. And the idea of attestation is proving that you know what you're running and where you're running it. So what are the technologies used for that? It's really a long evolution because it's a rather complicated problem. And we are now in that state that is best described by this quote from Andrew Tannenbaum. The good thing about standard is that there are so many to choose from. We're going to see that this is really true here. The vendor landscape is made of really different approaches. So you have AMD, for instance, that uses secure encrypted visualization. And I'm going to talk more about it later. But it was not really good. So there were further generations after that. And SCVES adds state encryption. So the state encryption I was telling you about was not in the first generation. That came as an afterthought. And SNP, secure nested pages, adds integrity production. And so you can see here the chart comparing the various generations of SCV from AMD. Intel has something called trusted domain extensions. It takes a different approach. And I'm going to explain why in a minute. And then IBM has something called secure execution, which is like all IBM technologies on virtualization is based mostly on firmware and really a combination of firmware and hardware support. PAR has something called protected execution facility. ARM has something called confidential computing architecture. Trusted zone, you may have heard of these things. Now, all these technologies share one thing in common is that nowadays the modern ones all rely on virtualization. But they all work differently. And so that means that when you actually go into the details, you run into a variety of problems. So let's start with SCV. SCV was really the initial implementation. So it was flat and this actually gave a relatively bad rep to the technology as a whole. It's based on an external processor, which is currently as far as we know an ARM core. And that does the work of encrypting the encrypting memory and this kind of things and doing the computations to prove that the memory is encrypted. So the hardware encryption itself is done by the memory controller through hardware. And this is built on top of virtualization. They also have a process-based approach called SME. So as I said, it's realized on the separate processors and the initial implementation only allows something called pre-attestations, which I'm going to demonstrate in a moment. So as I said, there were various vulnerabilities. Some of them in the firmware upgrade path that gave it a relatively bad reputation. So there was a cleanup crew that came after that to try to fix that with encrypted state and security pages. ES protects the CPU state, but doesn't change the attestation model. S&P protects against malicious base mapping. And you can now get your attestation. And again, I'm going to explain that in a moment exactly how this works. You can do that from within the gas which gives you way more flexibility. They also had a concept called VMPL, which is VM privilege levels, which lets us do very interesting things where you have some pieces of software that neither the guest nor the host can touch. So that's very interesting. That enables, for instance, product services like virtual TPMs where you know the source code, but you can't know the secrets either from the host or the guest. Intel TDX takes a very different approach. They started with something called SGX, so that's the Intel equivalent of SME, so secure software-guided extensions and that's to create secure enclaves and that encrypts at the process level. TDX, like SCV, is based on virtualization, but they don't choose a separate processor. Instead, it's a new separate CPU mode called secure arbitration mode, or CIM. And that means you have various binaries that use CIM call to cross over and do the computations in a secure way. The attestation is performed typically by a secure coating enclave that is another process on the side which neither the host nor your guest can access. So we are now in this brave, new secure world where we are entirely protected and nobody can harm us. So what happened there? So that's the point when you edit live. So that leads us to another interesting quote which is that history tends to repeat itself, but each time we make the same mistake, the price goes up. So what we have with memory encryption is really not a complete solution. And let's think like Sherlock Holmes and try to decide what do we really want to prove. Well, we're using a very simple solution that we really want to prove. Well, we're using the cloud. As everyone knows, that really means it's a computer you don't own. So how do you, on a computer you don't own, check that memory is transparently encrypted? That's weird, right? How do you get that? It's like from inside the matrix you want to know that you're inside the matrix. How do you prove that? Another problem is what is the software in that box? How do you prove that it's the software, that the software is any good? So it turns out that when I looked for a picture saying is the stuff in the box any good, I got that, so I found this funny. Are there some well hidden insecurities inside your setup that you did not see? So attestation is the process we put in place to prove such properties, provided you trust some specific part of the system. So we are going to see that by running a confidential VM. I'm going to use the first generation to outline all the steps one at a time, and I'm going to run it in the worst possible way as we will see. So I'm going to start a VM, and I start it in post-state, and that allows me to do the measurements on my initial memory to check that I have the right content. I do that with version download second for that gives me this SCV measurement, and then I pass that to a binary, in that case I will use virt.punu.scvvalidate. That is going to take all this design, put the version numbers and all that stuff, let's get that quickly, and that essentially gives me a way to check. So we have also this tic and tech, I'm going to explain how you get this tic and tech files in a moment, but what matters here is that you run this complicated command line here, and it tells you, hey, that's good, totally trustworthy, right? You're really happy with that, so after you have seen this message, you can resume, check the console, and see how your VM boots. And the boot of the VM is essentially the same as before, except that when you log in now, we are going to, let me skip a bit to save time, we check that we have SCV. So now if I do the same experiment as before, and I run my binary in my system, so let me again move forward a bit quickly there, because it's really the same thing. So I skip forward a bit, and now I do this same command as before, and if I grab the secret, I don't get it. Huh! That's not what I expected. Why do I see secret in there? Huh, that's such. So my trustworthy max, let me look inside and see what happens. Ah, so it's not the same secret. I just, my grip was a bit too simple. What I'm saying is simply some pieces of binary that happen to have the word secret in it. Okay, so finally something happened, and apparently some projection was in place, but it's still weird that we have all this stuff, and why is my root password still there? Wait, that doesn't work. What did I do wrong here? So any idea what's wrong here? That's where the folks who are in the other room would tell me, hey, where did you do that? So when I saw that the first time I actually double checked, is a CD actually active? So when you have a moment like this, it's like, huh? By the way, I really love that picture. I don't know how you got the car to do that. But it's really, you know, Houston, we have a problem. For me, it was time to tell the boss. You know that stuff? Well, okay, maybe we are going to talk a bit less time today because I see the data that I actually don't see. So my boss replied, isn't like the whole punchline? Yeah, there is something wrong here. So if I look at my binary, I see that the message is actually here, really secret stuff, and by the way, the bug is because it's written by a Python programmer, so it puts a plus to concatenate strings. In case you did not know, that was the problem. So if we look for secret stuff, we don't see it anymore. But we still see this strong fast world being inserted in the system. The reason for that, and by the way, the reason I was puzzled is that I had done the demo like 15 times before and never saw the problem. And then one day I was in a rush, I decided to accelerate things and use something called KCLI, which is based on cloud limits, and so there is a step in the process that is not encrypted. Just changing the tools led me to actually do something wrong without realizing it. So that's what I was talking about when I said, don't leave the safe open. Make sure your disks are encrypted in every step of the process. Otherwise, you're dead. What did we prove here? First of all, that confidential computing is only as strong as the weakest link in your chain. But now we are back to the question from before. How can we own a system that we don't own? There is a paradox here. So in order to explain that to you, I need to introduce you to a bit of terminology, and I will ask you to try really hard not to remember it. So let's start with ARK, that's AMG root key. Then you have ASK, that's the AMG ACV key. Then you have CEC, that's the tip endorsement key. Then you have OCA, that's the owner's certificate authority. Then you have PEC, that's the platform endorsement key. Then the PDH is the platform Diffie-Hellman key. The tick that we saw in the tick file earlier is the transport integrated key. The tech is the transport encryption key. And all that green stuff is TLA's. I am now supposed to know, but still don't care about at all. And in case you wonder, SOF stands for show of factor, which I think is really high at AMG when they invented all these acronyms. So, resistance is futile, you have to assimilate these things, or they will assimilate you. The good news is that we got so tired of this that we have a whole page on the Continental Container's project. We have a weekly page just with the acronyms we need on a daily basis. I added one yesterday actually preparing this talk. So, in order to take over the host, we are going to add our own OCA, PEC and PDH, and I hope you know what that means, to endorse the host as our own. And it's a technique that is color-coded known as I Licked It, Therefore It's Mine. So, how do we do that? We use a tool called SCV-CTL, CEPCAROL, and I'm going to reset the platform and do a verify, and the verify checks this chain, and when it's green, it means essentially that all the things sign each other correctly. And if I do that twice in a row, you see that I get the same results. So, I get the same platform Diffie-Hellman, and the same owner, so the third authority. Now, if I do a reset, then I'm going to get different results for the last three. The three at the bottom are from AMG, the three at the top belong to me. So, that's how I take ownership of this machine by essentially installing an OCA. This one is self-signed, it shows with a little circle here, but you can import it from outside if you want, if you want to be more secure. And it's a good idea to do so. Okay, what about the next step, which is now I want to own the guests. I sort of said I trust this host to that extent, to that cryptographic extent, that I put some keys in it, but now I want to really own the guests. And that's a bit more complicated, and again, I'm going to do it the wrong way intentionally, just to show all the magic that goes behind the scene when you look at this stuff. So, you have this launch security, I'm doing it with Libert here, and the launch security section, uh-oh. Yeah, so, to fill up the whole security section, you need to do the self-coded export. You export the PDA as a platform, if you're a Hellman. Then you can verify it, and it's as if you have verified the host itself. Now you need to create a session for this specific VM, and I'm going to name it TestVM. I'm going to put in it some policy flags that you can see on the screen, and you don't really need to care, but you can control, for instance, if debugging is enabled in the VM, and these kind of things. And that generates four files, the TestVM-Galage, Tech, Take, and Session. And that's what I'm going to use to describe my VM later. So, fast forward a little, and I edit this, I change the numbers in it, and I insert from the files that I just generated. And then that means that I'm going to have a virtual machine that can identify itself precisely with numbers that I generated. So normally you don't do that on the same machine, you would do that separately. Once I have done that, I can start my VM again in pause mode, and do the same verification that I did before, but you're going to let me skip forward a bit because it's the same thing. The important part is that the measurement changed. The measurement does include all the keys that you have put in the system, so that's how you know that it's a measurement of stuff you own. And the QMU-SVV Validate does check this measurement against the whole chain and make sure that this is somewhat solid. Or did we actually prove that? What did we really prove? What do we really measure? We're trusting a computer output message, so for all we know, the source code looks like this, right? So we need to... That's one scenario where having free software really matters. You really want to know that the binary that you're using to do that, you compile it yourself and it's actually doing what you expect. That's a problem actually because in the class today, some of the key components that are part of the root of trust are not open source at the moment. By the way, so we have this collective called the Continental Computing Consortium and that tries to bring together all these big companies to do the magic of agreeing on standards and so on. And CCC was the worst acronym they could pick because there are like 37 CCCs on Wikipedia, including the Kars Computer Club. So what we did is we injected our own OCA in the system, so we essentially marked the system that way and remember the O stands for owners. We are now the owner of something. We have a certificate of ownership of some kind. But what do we really own? In reality, it's more like something like this differential machine from Babaj. In the sense that we only have a tool that lets the other side prove their identity by computations. We expect the computation to give a result we can check. So it's really similar to what we do when we do two-factor authentication. One thing we prove is that the VM is encrypted using AMD signature. It's slightly stronger than Word 26. And we have also proven that the content of the VM is what we expect with the initial crypto hash. So we essentially are measuring from the start of how the VM is being built. So how do we make containers confidential in that space? How can we prove that we run the right container in age and that it's running in the right trusted environment? So this is a diagram of something called KALAC containers that runs containers in little virtual machines. And in order for this to be adapted for new... Did I lose the sound again? Hello? Yeah. In order to adapt that to the new environment, we need to change a few components that are marked in red here. Those need to be aware that we want to do some confidential computing with it. We also need to encrypt our disks. So that's the part on the top right here. And we need to have something on the side that will do the verification, that we call the relying party. And a relying party from a high level point of view consists in two parts. A key broker that delivers secrets and an attestation service that will do a crypto exchange to validate your attestation results. So on this diagram now we have three categories of colors. We have the trusted platform, which is in red. Trusted here means simply that it's ready to do some crypto computations on your behalf. It doesn't mean that it holds trusted data. All the data that it knows is encrypted. The host manages and offers resources used to run the containers like before. So that includes CPUs, memory, I.O. That doesn't change. But that's all it does. To it now, it is because a bag of bytes and a memory page is exactly the same thing, a bag of encrypted bytes. It doesn't know how to read the content. And the tenant is the new part, is the new aspect of this whole scheme. It's the part in green. It's confidential in the sense that everything in it is normally not decipherable by the host. And even when it's running on the host. And some part of it, as you see the relying party, might be in the cloud, might be on premise, might be elsewhere. So this is a new security model for Linux. Well, the host's admin is now considered hostile. And that pretty changes the threat model. So I saw here a page that was relatively recent. Is it me or is it Fudgy on the screen, right? Anyway, so I read for you, it's dated September 9. And what is interesting is that it's cosigned by folks from AMD and Intel. So they finally agreed on how to describe the memory model and the threat model in a way that everyone would agree on. One of the things that you need in order for this scheme to work is to say in platform we trust, the platform you run on has to do the work correctly. If, for instance, the AMD root key is leaked somewhere, the whole scheme falls apart. But the big change is that we no longer trust the host user named Root. And that's a good thing. On the host, Root has other parts and we just want to get that guy, that bad person, out of the equation. So that means the trusted platform needs to provide new services to qualify what is running and to make sure that it's actually running what you want. That in that sense that it's a trusted platform. So it's really only a tool to be trust that attacks can come from the host and hypervisor. And I was discussing in the hallways just minutes ago about how we can try to change that to make sure that the attacks won't come that way. But for the moment at least, that means that from the guest point of view, the host platform, the host hypervisor, might be trying to attack you. And that's really bad. So that's a question that Greg Cage is asking here. I'm sorry if this is fuzzy on the screen. So what do you actually trust here? Do you trust your CPU to do the computations? Do you trust external devices? Do virtual devices? Can you trust them? And so on. This leads to rather serious resistance on their part. Greg Cage there says, good luck with that. That was like two months ago. Well, when you reach a point where a key maintainer tells you, ah, good luck with this project. That's not necessarily a good sign. So we are thinking about other ways to do things that don't take as much effort on the kernel part. But at least there is one thing that we can do correctly. And that's the measurements of the initial state. That part we can somewhat trust. There is this pre attestation where you measure, the hypervisor essentially measures the initial state of the VM in post state. So that's the VM in post state is why it's right out. There is post attestation where you start the VM, but you can still assess the initial state that it booted from. And so you can send that, the guest can then query the platform security processor or the trustee enclave to say, please give me the measurement that you did when I started. And that measurement will be delivered almost directly by this additional security component. And we can go further. We can decide, for instance, to attest the containers themselves. It doesn't make much sense in practice because the containers, there is also independent effort to make container images encrypted to preserve the integrity and so on. So we don't really need to attest more than that. Attesting the bottom part is sufficient for our use case and that's how containers work for now. Another quick bit of terminology for the next steps is to understand that in the attestation you have values players and you have a verifier that does all the job but gets its input from things that belong to you in green. So like the reference value provider and the verifier owner. And for instance, you have a separation between the policies to appraise the evidence and the endorsement which is I take over this particular hardware. The attestor then can submit some evidence. So the attestor is in red. It's the trusted platform that does it. So it does the crypto measurements and you know that it's the platform doing the crypto measurements. And then the verifier can transmit that to the relying party. And the relying party can appraise the attestation results. So how does this work in practice? So you do a cryptographic measurement of the values bits you care about. From that, you get essentially something that is a proof of identity. It's like an ID card. And if everything goes well, you get some secrets in return. Because it's a challenge response process, you say that's why you need the attestation service. You have the attestation service. Something I'm going to show that in a further slide. But you know that it's fresh. It's happening now. And because it's dynamic and it's done on the side, this means you can revoke something that you accepted before. You can say, I discovered a zero-day exploit in this particular stack. I no longer want to run it. So I just revoke access to this and it can't boot anymore. So attestation is a proof of the configuration of a system, including the fact that it's running with encryption on. And including the fact that it's running a stack that I trust. It proves properties. Remote attestation decouples the evidence from the verification, just like you decouple a lock from the keys. So that's very important. There are two models, a passport model where you present the evidence, like a passport that says, these government guarantees that you are indeed Christophe de Dinsen. Or you can have a background check model, which is similar to putting your finger in a biometric device. It's not proving that I'm Christophe de Dinsen. It's proving that I have the right finger. So how do we... Does that actually prove something to you as a user? It's a proof by blocking forward progress. What happens is that... So first of all, because it's a one-time challenge, as I said, it proves the freshness of what you did. It proves that it's happening right now. The response contains a cryptographic proof. And it's basically as strong as the cryptography that uses it, of that form identity, memory encryption, and so on. It also proves the endorsement, because part of the encryption that is made includes your own certificates and keys. And it measures the initial guess of the stack, so you know you have a hash of what's running inside. If the proof fails, the secrets do not get delivered. So what happens is that the guess cannot decrypt its own disk volumes, and it cannot decrypt its container images. So basically, it's stuck there, and it's random harmless. So in order to build trust, you go step-by-step like this. You need to know exactly what you prove, what are the guarantees that you offer that way. And what confidential computing really cares about is confidentiality, right? So that means that what we really care about is we don't want to leak any data that is considered confidential. We don't want it to be leaked, we don't want it to be tampered with. We do not protect against crashes. Actually, a crash is a good outcome if you detect a correction, for instance. The best thing you can do is crush the guests. We do not protect disks or network data. That has to be done on the side as I showed earlier. It does not offer any guarantee of service, and because it's hardware-based, real-time cryptography, you can still properly mount some attacks if you know exactly what's running inside. It's also highly implementation dependent. But online is there is no automatic security. So to build things, we start with hardware. You may remember that from the TPM days. Once upon a time, we invented the hardware TPM, and there was a very good talk by James Bottomley yesterday about how to use that on your laptop. So what happens there is you have this stack where each step merges and launches something and stores, records the results in your log in the TPM. And that's the log or hash of the log that is going to tell you, I'm at this point in the boot and it's valid. And you can do things like, for instance, having keys that can only be unlocked in BIOS Phase 2. And once the TPM goes beyond that, its registers have changed, and it can no longer detect the same key that was used as part of the boot. So the operating system cannot see the key that was used by the BIOS to detect the disk. So that's what we call a chain of trust. Each step depends on the steps before. So I offer you something I call the remit pipeline. It's a simplified, in some sense, simplistic model for this kind of trust chain. Where you have R that stands for root of trust, E stands for endorsement, and for measurements, I for identity, T for trust, how you build policies, how you build trust from the elements you had before, and S for secrets. And you go from root of trust to secrets following these steps. So we are going to see that with various examples. For instance, that's the remit pipeline for the secure boot system. That's for selling a property where you start with a root of trust that's the notary, and you end up handing the keys. That's the same thing with historical money. You have gold or silver or the root of trust. Then the government gives value to them that is measured in dollars, and the number of dollars is the identity of a given transaction. Handing over the cash is how you establish the trust. I received this from you, and in exchange you deliver secrets, in that case, goods or food or something like that. So you see that the system is relatively easy to follow. So attestation flow unlocks by giving secrets as a cryptographic challenge. I explained that earlier, so I'm going to go very quickly, but the point here is that it's really a crypto challenge that doesn't prove. And that's how you get your response sent to the attestor. So now we are getting close to the time that is allotted. I have a few other things that I can show, but there are demos that we can switch to questions, and I can have slides showing some additional demos for various use cases of confidential computing. Because we have virtual machines, virtual functions, orchestration, that's confidential containers and things like that. We can have the whole entry level with confidential clusters, and so I'm going to simply show the various use cases and take questions at the same time. No questions? Oh, yeah. There is a question over there. So companies are, in some cases, allowed to run their own stack on so-called raw machines being rented out, where they get to do everything. And the attack against them is the BMC could have installed lowest level software that can't be replaced and there's no way to know about it. So does this address that or not? Yeah, so the question is, let me rephrase a little, the question is about sub-platform items like a BMC that have special powers. James, bottom line, I think, or someone else mentioned yesterday that everyone in this room is probably running a copy of Minix without knowing it, as soon as you're running a relatively recent Intel Core. Because there is a copy of Minix in the management processor. So it's the same idea this has, a lot of power and can do a lot of harm. This is the... the ARM core in a CV system is exactly in this class as well. It can do a lot of harm and the various failures that were detected were precisely by uploading a bad firmware in this ARM core. So the attacks do exist. To the best of my knowledge, the attacks that exist so far mostly require some privileged access to the machine because the BMC itself normally has privileges, but you're correct. This is an ACTAC vector that may exist. What I'm showing here, by the way, is RAIL 9.3 running on Azure with encrypted. It's just to show you that it's much simpler than when I did it manually. So you just click, click, click, click, and it deploys the VM for you. But of course you have to trust Azure to manage your secrets. And when they say, for instance, that they don't keep the private key, it's a website saying I don't keep the private key. Any other questions? What is the performance penalty on running the encryption? So the question is what is the performance penalty on running with encryption? It's not where you think. You might think that running with memory encryption makes memory accesses slower. And in practice, that's not really the case because we already have levels upon levels upon levels of cache and that the actual performance is re-dominated by the cache. I think that in the worst cases you can probably detect a 10% change, but that's about it. The real problem, though, when you run in the cloud, is that you're encrypting your memory, which means that the Linux kernel you have in memory is encrypted. So it varies from one VM to the next. So you cannot share it across VMs with the traditional techniques. But if you're running containers, you don't want to run your content, to keep your container images on the host either, right? So today when you run 10 containers, booting NGINX 10 times, you got one image of NGINX that gets downloaded once. If you want to have a secure NGINX, then you are going to download an encrypted image of NGINX that is going to be stored on an encrypted disk that is per VM. And so you're going to download it 10 times, store it 10 times. And the memory, so it's 10 times more. And so that's where the real cost of this thing is. And to be frank, it's not impossible that that might be one of the big reasons for pushing this by hardware vendors because you really need Mithia hardware for the same thing. There's... In answer to performance, what is the cost of trusting something that shouldn't have been trusted? Yes. Okay, but more to the point. The only solution I see, and I do see a solution, is that you have to run encrypted secure, boot whatever, at the factory, track that boot image, the session key for that thing, through its lifetime of deployments, such that you know that that never got overwritten any other time farther down the chain. So it's a different eco structure for the industry. So I think that I understood your comment by saying that you really need to do something that is short lived at the factory when you build these chips, you put them into an encrypted mode there at the chip, you know, with all the out of station, at a low level, and track that through all deployments. Oh, I see. And then run on that so that when you're presented with it with a thing you claim is secure, you actually have to add a station from the factory all the way to where you go to use it in order to believe and trust that. Yes, so I think the point is really that we need to check the quality of the encryption being used to store memory, et cetera. Now, on that front, the good news is that some of these technologies were invented with another motivation in mind. I don't know if you remember mem restors and stuff like that. There was a time where we thought that we could have all memory, essentially the RAM being persistent like on old HP calculators, where you switch off the machine and the RAM stays there. And of course that has a real problem. It makes a number of things faster, but that has a problem that if you take out the chip, the data is in there. So you want to encrypt all accesses to memory in order to avoid that. So because of that, the encryption technologies that have been used have been carefully thought out and have been tested with this method of taking the chip out and trying to reverse attack it. So that part, you know, you can never say never, but that's probably not the weakest link in the system. I think there was a question on the other side as well. Yes. How feasible is it to encrypt a certain process and leave other processes encrypted? So the question is how feasible is it to leave one process unencrypted while other processes are encrypted? And that's the first generation technologies that I was talking about. SME and SGX work exactly like this. Now the problem with these technologies is that it's very hard to support fork, because fork in Linux is you have two processes that have the same address space at least initially. And when you fork, do you want the other process to have the same address space ID or a different one? Most of the cases where you care for sharing memory, you want to have the same address space, but when you do a fork exact, maybe you don't want. So in order to solve that problem, Intel, for instance, implemented something that they call a libOS, and that's an OS you run inside a process that simulates all this fork and all this nice, essentially simulates all the Linux system calls from within a process with simulated process inside a process and knowing when to fork an actual process on the outside. That's one of the reasons why SGX did not really take off is that you had to rethink your application a lot in order to fit that model. First of all, thank you so much for a great presentation. I actually have double question. One, if it would be possible to share the slides, maybe somewhere because I don't see them on the description. And second is if we can use this with the cloud providers, like for example, the ETS and other clouds, if it's possible. So I understood the first part. Can you repeat the second part? I did not understand the second part. So first was about the slides. And second question was if we can use confidential containers with the providers like AWS ETS and other cloud Kubernetes providers. So the first question is about sharing the slides. If you don't mind, I prefer sharing the blog because it's probably better reading, but the slides will be shared. This presentation is made with software that I developed called Tower 3D, which is my biggest failure in the open source world because it's 500,000 lines of code. I cannot compile it anymore and there is a single user and that's me. So that means I cannot really help you run this presentation yourself. What I do is I share the source code and I share snapshots of the screen and I share a video of it. So that's how you... And you'll have the first-time replay because also we call everything. On the second question, which now I forgot. What is it? So the question was if we can run this level of confidentiality, like the encryption and everything, so confidential containers on the cloud, like for example using ETS from AWS. Yes, so the second question was about running confidential containers in the cloud. At the moment, not really. So, confidential containers at the moment is, I think we're completing release 0.8, if I'm memory serves me right. But so it's still not completely deployable. And quite frankly, one of the aspects that concerns me the most from a usability point of view, I'm working on it at the moment with a team of researchers at IBM, is that in order for this to be secured, there are so many APIs that go through the host to the QBLAD and so on, that we need to rewrite an alternate control plane path for these. The current solution, if you use confidential containers today, they close down all the insecure APIs. That's the default policy. And so when you close down something like getting the logs or doing your QBLAD is exactly inside a container, you lose a lot of functionality. We are trying to restore that in a safe way, but you imagine that it's complicated because that means you have to have a completely parallel control plane that doesn't run on the same host. That's not completely true though. And so I am late. Thank you very much. If there are further questions, I don't know, maybe you can recompile his slides to find his email address, or you can talk to him somewhere in the hallway. What's your email address on the slides? I don't remember. No, I did not give my email address. I gave my GitHub account. My email address is... Well, the easiest one I think is cc3d at redhat.com. Or cddd. Yeah, so first slide, c3d.github.io is... And from there you can... My name is not like... There are not many hash collisions on it, so you can find me easily. So, ask him any questions. Thank you very much. Let's hope this gets implemented because it will improve security very much. As a token of appreciation, we have some chocolates. Thank you. Thank you very much. Thank you.
Open Source for Sustainable and Long lasting Phones
Thank you. Thank you. Thank you. You can hear me? Yes. Good afternoon. Thank you very much for joining the open source for sustainable and long lasting fun stock with Luca and Agnes. Thank you very much for being here. I will be helping you here, helping you with your talk. Should you need anything? The stage is yours. Thank you. Who knows Fairfun? Wow. Happy. So we are super happy to be here with Luca today to speak about how open source is helping us at Fairfun. We'll speak about software, but not only. In this talk, it's not a super-taky talk. We'll try to speak about all the stuff that we do at Fairfun, which kind of report, which kind of data we open-publish to the public. We open-publish to push the industry to change. But first, let's introduce ourselves. I'm Agnes. I've been working at Fairfun for the last six years now, and I'm leading the IT and software longevity team. Software longevity means that we are maintaining the fund for longer. We are a small team, but the goal of this team is to make sure that we can have long-lasting funds from a software perspective. And I'm also involved in some collectives. I founded a company first in France called Ninja Squad, focused on open source, but more on the web layers. I'm also part of a collective called Duchesse France, which promotes women in tech. I founded also a smaller conference compared to the first dem called Mixit. And Mixit is a conference in Lyon, in the center of France, focused on tech and ethics. And the last but not the least, I'm part of the RézoMutu. I launched a website in my city, Saint-Etienne, in the center of France called Le Numéro Zéro. You have also an antenna of Le RézoMutu in Brussels, stu.info. And those websites are alternative to the mass media owned by billionaires. You can publish some information about UView on the local news. And all those websites are built on open source software. So I invite you to have a look on that. Luca Froyo. Hello. So my name is Luca Weiss. I work as an end-of-platform engineer at Fairphone. And on the side, I'm a free-time also do a bunch of Linux kernel development. I maintain the project called OpenRazor, which is an open source driver for Razor peripherals. And I'm also one of the core maintainers of the Post-Marketers project, which is a Linux distribution for mobile phones. A few words about Fairphone. Even if a lot of people here know Fairphone. So Fairphone started as a awareness campaign on conflict minerals. I will come back on that just after. In 2010, and only in 2013, we launched our own company called Fairphone. And the overall angle of Fairphone is to push forward change in the electronic industry to make it fairer. It's not an easy game, but we have been doing that for almost 10 years now. And how we did start on the next slide, please. Thank you. So as I mentioned just before, so we started as an awareness campaign on conflict minerals. And at that time, Fairphone was just a group of activists in between the DRC, Democratic Republic of Congo, and Amsterdam. And this campaign was to more or less raise awareness about the social damages in DRC linked to the mining industry. And at that time, a bit of context on that, a US law was passed called the Dot-Fronk Act. And this law, initially speaking, was good, right? It required US companies from the US stock exchange to disclose whether or not their products did contain some conflict minerals. But the immediate consequence of this law was the fact that some big players started to seek resources outside the DRC. And then people in the DRC, the miners then, were a bit stuck and started to go back to the smuggling activities. So this campaign was really, really focusing on that, to show a bit what did happen behind the scene. So after two or three years of campaigning, we decided to change a bit in the next slide, please, to change a bit the way of working. So we decided to be part of the industry, to try to push for a change from the inside out. And to do that, we were incubated by a nice place. I'm not sure if you know this foundation. This is called the Warf Foundation. So the J is pronounced R in Dutch. So the Warf Foundation, if you see the tagline making technology and society more open, fair and inclusive. So yeah, that's not a random incubator, like the French tech in France, which I don't like that much. So that's really, really focusing on fair technology. And this is in the center of Amsterdam, in the Red District. You have this castle. And the Warf Foundation is within this castle. So we were incubated in this incubator. And then Fairfun BV, so the company itself, was born in 2013. This is a social enterprise. So social enterprise means it's a bit like the scope in France. That small, the financial profitability is just a means to achieve all the goals, environmental and social goals. And in 2013, when we launched this company, we started a crowdfunding campaign, which was quite cool, because in the end, we managed to sell some funds in total, something like 50K funds, which is not that bad, without any marketing, the website was built by our own team, etc. And if you see the tagline on this screenshot, it says that a seriously cool smartphone that puts social values first, it shows again that Fairfun really, really started as a social activity. And we wanted to show how it is difficult in this industry to be a miner in DRC, etc. So perhaps some of you people have heard about how it is important to be more ecological friendly in this industry, but we do think that the real issue is the issue that you have on the people side. So if we pretend to build an ethical fund, we think that we should have a decolonial approach. We think that we should focus on the people, not in the western countries, but far away from us. We think that we need to respect the miners in DRC, and the people, the workers in the assembly line. But why a phone, Luca? Yeah, so why a phone? So as you, I think, are aware that the digital industry is causing a lot of greenhouse gas emissions, that's about 4% of the global greenhouse gas emissions, and of that 84% is actually going into device production. And the remaining 16% goes into running it, like networks and data centers. And yeah, electronic waste is really the world's fastest growing waste stream. With over 50 million tons per year, it is a really big problem in the world. Most of it is not recycled. Some of the phones, for example, are just kept and drawn, or thrown away to landfills. It serves like 1.2 billion phones a sort every year, which is a huge amount. And since most of them are only kept for 2 to 3 years, most are thrown away afterwards. And then 20% are recycled and the rest are just dumped somewhere. And also in the world, millions of people are working to dispose of this electronic waste that is not recycled properly. And this of course can cause a lot of environmental and problems to the people that actually work in this sector. So kind of where Fairf don tries to kind of make things better is in the materials, so in the mining sector, in the factories where the phones are produced and the components for the phones, longevity so you can actually keep the phone for longer, and then also reuse and recycling. So what happens to the device after the user is done with it, let's say. So kind of how we try at Fairf don to actually change something in the industry to not just be another smartphone company, is to try to change things. So we try to raise awareness to tell people about this issue, like in the stock we are doing today. We also want to set an example. So we want to show other companies and we want to show the industry that it's actually possible to do it differently and to do it better. And by doing this we want to motivate them to actually also come along, because of course if only Fairf understood the things that we are doing, not much happens in the world, but then if bigger companies like Samsung, they actually also implement the same programs, I think we could change a lot in the world. So in the last 10 years that Fairf don has existed, we have launched 5 smartphones. So we started with the Fairf don 1 in 2013 and then kind of had a phone every 2 years, more or less. Fairf don 3 and 3 plus are a bit of also great examples where if you have a Fairf don 3, you could actually just upgrade the cameras in the Fairf don 3 and have a Fairf don 3 plus and then just have better camera quality. At least starting with the Fairf don 2 we are also really focusing on software updates. So we had 7 years of software support for the device, so starting with Android 5 and upgrading it all the way to Android 10. And for example for our latest device, the Fairf don 5, which launched mid of last year, we want to provide software, or we are promising software updates until at least 2031, so 8 years of software updates, but hopefully also aiming for being able to provide updates for 10 years until 2033. So now let's look at the hidden side of how it is difficult to build a long lasting phone, how we can try to fight off-sodience and we will try to speak a bit about software and also about hardware. So why it is so important to do to reach longevity? So speaking about the stuff that we open publish, for every device that we build, we open publish what we call a NELCA, Lifecycle Assessment. So this is a methodology to assess the environmental impacts linked to each stages of your life cycle of the product, so production, transportation, usage and life. And we hope that at one point all the manufacturers will be obliged to publish such a report. This one is done by an academic partner called the Fraunhofer in Germany. And if you read this report, it shows that if you keep your phone longer than the average, the average being 3 years for the Android stack, if you keep it 5 years, you can cut global warming by 31%, and it goes to 44% if you keep your device 7 years. So this is key, this is really key to keep your device longer. And how we do that? So first of all, I will come back on that just after, but when you see a fair phone, you will see that you can easily open it and change some spare parts, because in the end, the obsolescence comes from the fact that it's super hard to repair your phone, right? If your display is broken, if you want to change the battery, sometimes it's screwed, and you cannot do that easily. So that's the first fact, the hardware aspect, I will come back on that after. But on the software side, perhaps some of you people have seen this message, the app is not compatible with your device, it means that apparently your app is not compatible with OS running on your phone, and probably it means that the OS is not resaving any security updates. So if we go to the next slide, so let's look at the device, so you will see a fair phone 3 in the middle of this slide. And on this fair phone 3, you have several components, right? And as a manufacturer, we need to choose some components, so a fingerprint sensor, a camera, a sock, system on a chip. And those components are not built by the same supplier, and it's quite difficult to have access to the code, the firmware running on those components. So as a manufacturer, you have not a lot of options, you can either contract long term support with those suppliers, right? But it's not always easy, because some of them are not at all willing to do that. For example, for fair phone 3, I put this device on purpose because we failed on fair phone 3 with the fingerprint sensor. We didn't manage to have a long term support agreement with the fingerprint sensor manufacturer. And when we wanted to upgrade to a great 1.13, then it means that for some people, for some users, they were obliged to use a pin cut and not to use their fingerprint, right? So it was not super nice. So sometimes we fail, but it's also something that we are not ashamed to mention. So you can either contract long term support agreement with your suppliers, or you can try to have access to some code. And obviously the second part is the most interesting part that we want to do. And if I look at just the Android part, so Android is just running on the CPU. You see the sock on the right, so on the sock you have plenty of subcomponents. You have the CPU, the GPU, the modem, etc. And if I look at just the Android part, if we go to the next slide, perhaps you have heard that Android is open source, but Android is not fully open source. Just AOSP, Android Open Source Project is open source, but the rest is not open source. So if you look at the orange layer or the purple layer, the hardware abstraction layer or the native daemons and libraries, this is not open source. So it comes with all the downside that I mentioned before about the lack of longevity. But on the older layers, close to the high level layers, you can find also some closed source, and it could be even worse. Speaking about some SDK, so if you are an app developer, perhaps you can have the willingness to integrate some SDK, because it could facilitate your development for your app, for example. So you can integrate Facebook SDK for your base whatsoever. And the second big black spot could be that those SDK could be big data artists. So that's the second issue that you see with those kind of SDK. It's not only an issue about software obsolescence, but that's also an issue for your own privacy. Yeah, I think we don't have a good slide, but that's okay. I wanted to skip this slide, but in the end I will speak about it. Just, yeah, it's a bit, that's a sad story about Facebook disclosing some private message between a mom and her daughter about an abortion that this daughter wanted to do. And Facebook disclosed all the private message to the justice, the American justice. So that's another, that's an example of how it could be better to have all you data in the big giant. If we go to the next slide, so let's speak now about the software updates that we do at Fairfax. As Luca mentioned just before, for Fairfax 2 we reached 7 years. And for Fairfax 5, which is the last device that we launched, we hope to do 10 years with a strong promise of doing 8. So I think I'm preaching a bit to the wrong audience to explain why we need software updates, but just to make sure, of course, there are any software that is out there kind of has issues and decisions need to be patched. There's always new security vulnerabilities found. So for example, in the Android world, Google is publishing a security bulletin every month with a bunch of new security issues that should be fixed on the devices. And of course, with new Android updates, with new Android versions, you can get new features which Google has implemented. So for example, better auto-fill or better permission management. So bringing Android to devices is quite a complex effort that is between multiple stakeholders. So any Android release starts out as AUSP by Google. This thing is taken by the SoC manufacturer, so in our example Qualcomm, which then modifies the AUSP code to actually make it compatible with the given SoC. And after they're done, then finally the device manufacturer can get this code and can integrate their changes in top, which makes the software work on the specific device. So for example, adding support for the display or for the touchscreen that's on the specific phone. Still, there needs to be implemented a bunch of changes for operators. So for example, to make voice of a LTE work correctly or make some settings be according to their requirements. And then kind of the last step is that you actually need to get launch approval by both Google, which is done by running millions of tests in the compatibility test suite and other test suites. And every single one of these tests needs to pass to actually be able to get launch approval. But also the operators test the software and make sure that it conforms to their standards. The process for security updates and normal updates are not a major version upgrade, but just on the same Android version look a bit differently. So every month, as I said, Google is providing security updates. The same is also happening from Qualcomm and some other parties. Network operators may have new requirements and new, for example, new app updates that need to be integrated. And then the device manufacturer is responsible to actually pulling all of this together to make sure it still works correctly. And then to go through the whole process of the approval again, so running many, many tests and making sure that everything works correctly. This process can be followed for about three years, which is kind of how long Google is maintaining any given Android version. And after that, or hopefully already before that, the manufacturer needs to update to a new Android version, otherwise you're out of support. So, yeah, as was mentioned already, while Android itself is actually open source, now as some modifications by Qualcomm, there's a lot of other proprietary components going into the system. So on a modern system or a modern SOC, there's a lot of code processes that handle a lot of different tasks. So for example, audio or modem or GPU. And these are run proprietary codes. So one where either only the device manufacturer can access the code, or actually for some even only the SOC manufacturer. So what happens if actually this chain is broken? So when the SOC manufacturer... When the SOC manufacturer doesn't provide support for the new Android version anymore. Well, this is generally where support in the industry stops, where device manufacturers no longer can provide any updates. So for this, we can look at Fevon 2, which was our device launched in 2015 with Android 5 in 2016, got an Android 6 update, but already back then the SOC went end of life. So other devices with the same SOC stopped... So other devices with the same SOC, so for example Nexus 5, stopped receiving software updates. But still in 2018, we managed to launch an Android 7 update in 2021 Android 9 upgrade and in early 2022 Android 10 upgrade. And to understand how we achieved this, we actually need to dig a little bit deeper. So we took over the role of Qualcomm a bit. We reused some proprietary parts from Android 6. And we looked at the Codel Rural Forum and the order name for Codel Rural, so where Qualcomm is releasing their open source changes. And we looked at some of that codes to give us a reference to how it really could work correctly. The kernel is also quite an important part that we needed to take a look. But yeah, for this we also looked at Linear Trials, which also provided some... Yeah, for Linear Trials also because they have a great reference of how the code could work together. And also provided quite a lot of fixes for some components in the system. So to enable communities like Linear Trials, we try to open source whatever code that we can do. Of course all of our devices also have an unlockable boot loader. And we share all of our code on our platform code.ferfn.com. Of course, open source is also great for a lot of other projects, like post marketers, which are more involved in. Which is a real Linux distribution for phones and other mobile devices. You can still check out the standard in the AW building to learn a bit more. So for Android as we said, normally ASP itself is open source, but normally all the changes that the manufacturer does to it are not open source. So the legal minimum that any manufacturer needs to publish is the kernel sources. So the Linux kernel which license another TPL license. But on top of that we also try to publish the full Android sources wherever we can. So for example, for Fevon 3 and Fevon 2, we had the complete Android source code with the proprietary components as pre-built. This version was then without the proprietary Google services and DRM and a few things like this. But essentially you can just download all of the code, compile it yourself and then flash on the device. And you have essentially a relatively similar build on your device than what we provide for regular users. For our new devices for Fevon 4, we currently only published all of the Android source code that our manufacturing partner, our ODM, produces. Unfortunately, the way that Qualcomm has structured the source tree normally makes it not possible that regular users without Qualcomm proprietary sources can compile this. But still we think it's really important to have this public as a reference. Because for things like the audio hire, still people can look at it and see what was changed for this device and then take over some of these changes for example for custom ROMs like linear joys. And we also managed to get permission to get the kernel divest tree sources public. This was quite a struggle because by default they are part of the proprietary package. But we managed to convince Qualcomm to allow us to publish them also because a bunch of other manufacturers also published them. So one problem that we also have with the software on our devices is that the chipset manufacturer provides us with a kernel version that is normally already by the time the device launches multiple years old. And it's never really updated to any newer version so the Linux kernel version that the device was launched with was basically the one that it stuck with. This means after a while that security patching can become quite tedious because there's a lot of changes on top of the Linux kernel release also. And of course also if the kernel releases end of life upstream then it becomes even more difficult to backport security fixes. Generally being stuck on this Linux kernel version doesn't make too much difference to a user. But especially lately we've seen that also new Android versions require new features in the Linux kernel. And if we are not able to update the Linux kernel, yeah we cannot really update to a new Android version then. So we can try to instead push the device in the SoC and device-specific support to the mainline kernel so upstream to kernel.org. With this done in the perfect word you can take any recent kernel release, put it on the device and have everything working. Unfortunately currently it's still far from feature complete but it is really cool and still can be used for a lot of purposes. There were also lots of great talks yesterday in the Boston Mobile Devices classroom. You can watch the recordings later. So some of the other things that we do. So we try to provide team-win recovery project builds, so TWP builds for the devices. For example for the FF5 we managed to get the build public on day one when the device was announced. We have factory packages on our support page. This is quite useful for third-party ROM developers so they can just take the new build and extract the proprietary components from it and integrate into their build. Also where possible we try to support third-party ROM developers so try to answer some questions and help them with some problems where possible. We think that default OS is great for regular users but for some users that prefer to have a bit more privacy oriented or security oriented operating systems. For example for VOD Google services we think that custom ROMs are really important for users. Hopefully OS is soon the app for our Fairbats XL headphones will be open source. A few words about the reparability on the hardware side. As I mentioned before it's quite easy to prepare a fairphone. If you look at this screen you can see a Fairphone 5. You have 10 modules on Fairphone 5 so if you break your screen you display you can easily change it. We also want to have an accessible decent price for that because one of the downsides of the reparability is the fact that sometimes a repair cost could be super high. For a display for example the cost replacement is on average 44% of the original price. It results in the fact that the users want to buy a new phone and not to just repair their display. The battery is the same. I'm not sure by heart the cost of our battery but for Fairphone 4 I think it is 20 or 29 euros. We really want to make sure that it's not super costly for one of our users to buy a new battery. Of course the batteries are not glued. I personally think that it should be forbidden to have glue on a phone. So we really really strive for having more modularity on the phone and we also fight or we try to do some lobbying to push the older manufacturers to do the same. We were part of some discussions at the European level or the French government level to have modularity as a criteria to define the coming index of reparability. And the last but not the least we do also an extension of the warranty from 2 years to 5 years for free to convince people to keep their phone longer. And we also publish the schematics. So about those schematics we just published the Fairphone 5 once a few days ago. So when you discuss with the competitors they can mention some reasons that it's not possible to publish those schematics. And the reasons are for example some intellectual property issue or some security issue. And this is all bullshit right. This is all bullshit. They cannot pretend. So intellectual property this is hard choice. If we want to publish those schematics right. If we want to not hold back on the intellectual property it's possible. And for the security reasons this is not the right clue to explain why open source is not an issue in terms of security. But you have to know that when you speak with some people at the European level or the French government whatsoever some lobbying from the big tech could convince those people that security could be an issue with open sourcing. So yeah that's something that we want to highlight today to show that it's possible. Yeah so let's talk a bit about the materials in the factories where we also try to improve the situation. So a smartphone contains over 50 different materials. One of those we selected 14 so called focus materials so where we think that's where improving things can have the biggest impact for now. We try to integrate these materials into the supply chain so they actually end up in the product. We also have for different materials we look into a bit more in the recycling part in the so trying to use recycled materials and for some other trying to get fair version materials. So we also try to map the journey of the materials and we publish this on our website so you can look at it. Why we want to do this is because we want to scale the fair sources we want to get more of these into our products. But also again we want to follow us we want to see that other companies see what we're doing and actually can look into exactly how we're doing this and then hopefully they are following us and also implementing this. For example here you can see the map of some of the materials in the FFIV and if we click on one of them for example for Tangsten here you can see it is mined from a town in Rwanda. Then it is processed in a I think it's a smelter in Austria and then goes to a different manufacturer in China and then finally to the final assembly manufacturer in China where the phone is actually being put together. We have a very long list of all of the suppliers, smelters and refiners also in these documents so you can see exactly what companies are involved here. What about the fair factories and just about the list of suppliers so the ultimate end goal is to convince the competitor to use the same list right because we have been doing this work of convincing the suppliers to act more responsibly so we hope that the competitors could do the same. What about the fair factories so I'm not sure if you have heard about Foxconn, that's a big factory owned by a Taiwanese company called Foxconn and this is one of the largest employer in the world, almost one million of employees and this company is known for bad working conditions, bad revenue etc. So what we try to do is to collect the workers voice. We don't want to pretend that we know better than them what is a good working condition right this is a decolonial approach also so we have Chinese employers working with the assembly line workers to make sure that we understand what are the best working conditions for them. We disclosed also in terms of open source we disclose a methodology about how you can implement leaving wage in the factories. So in this toolkit you have plenty of things to calculate this leaving wage you have some templates for the agreement for the workers etc. And I'm speaking about this notion of leaving wage so if you speak with the workers they will tell you that the most important thing for them is to have a leaving wage, a decent wage. If you look at the daily wage for an assembly line worker this is approximately $13 per day. And if they want to have a decent wage to avoid to do extra hours for example they will tell you that they need the double more or less right so $28. So you have a big gap right between the daily wage and the leaving wage. So what we have done at Fairphone we have paid this gap right we have paid those $28 per day. And the ultimate consequence for us in terms of price was to dedicate even less than $2 per phone to be able to pay those people correctly. So in this lobbying that we have done about the leaving wage, this toolkit, of course we have tried to convince all the manufacturers to do the same so far this is a big failure. But we are still hoping that it will work at one point because let's imagine that all the manufacturers could do that right, it would be super nice for the people there. We can go to the next slide. Yeah, so a lot of companies have recently also put recycling very big on their front covers. So for example, Apple, Samsung or nothing they are very big on the recycling. Unfortunately the way that recycling is being done is also of course sometimes not great but also just the way that the economy works. You cannot take a phone, recycle everything and then get 100% of the materials back to put into a new phone. There will always be a big junk that actually goes to waste and which you can't recycle because either it's made into alloys which you can't separate anymore or just different components where it's either not worth getting the tiny amounts of gold back out or anything. So there will always be new mining needed and somebody needs to look into this to actually make it better. But also kind of what we are doing with e-waste in Europe is sometimes shipping them to places in Africa. So here are some pictures from Akra Bloschi which is in Akra and the capital of Ghana. And you can see kind of where teenagers are burning some copper cables to get the plastic or to burn the plastic off to get the copper back to actually recycle the copper. And yeah of course you can't imagine that this is healthy for anything in the area. And also recently I think Ghana has also done a bit to clean up this area a bit but of course it's just going to happen somewhere else and it's probably happening in a million other places. Okay, time for the conclusion. So yeah the conclusion that we wanted to know if Luca was to speak about techno discernment and social justice. So we will use a quote from a person called Ivan Illich. I really like this guy not because he's Austrian like you Luca. But yeah I mean I started to read this philosopher a few years ago. And yeah he influenced my life as a software engineer. He wrote this book in the 70s called Tools for Conviviality. And if you look at this quote he's saying more or less that the modern tools, so if you extrapolate a bit think about your daily work. The modern tools should not be at the service only about a small group of experts but at the service of all the people, a bigger group. Right? So more or less how cool today with Luca. We don't want to speak only about fair fun right? We want also to invite you to step aside from your daily work and from your expert position. Of course open source is great right? We are in a super nice conference. I would like to force them by the way to the organizers. That's super great. Open source is a super good lever for responsible projects. But open source doesn't make automatically a project responsible right? So that's an invitation for you people to think about broader right? From a broader perspective. It's always good to do open source for sure. But where the code will run right? When we think about the hardware we spoke about about the social damages behind the cobalt right? Behind the extractivism in general. So we really need as experts to think about how products and we really need to ask ourselves if this product will really empower people. And also this is an invitation to think not only about how people, how community, but people far away from you. People far away from our western countries. Thanks. Thank you. You guys have any questions? Here is. Thank you for the great talk. You have told a lot about software upgrades and updates and support of your phones. But what about hardware updates? Because the most important part for the pollution is hardware waste, but not software. And if you, for example, if you upgrade your phone, I don't mean the main parts like CPU, but maybe displays or cameras or whatever on your current models. Maybe it would reduce the electronic waste as well. And did you think about it? Yeah. So for kind of reusing old phones, there's actually a project by a Belgian company ongoing where they are looking into kind of how they can reuse old FFM2s to actually do something with and use them for example for some IoT use cases. Yeah, some of the problems still apply there. Of course, the software support is gone for the old devices. And also, yeah, then the same applies to kind of the old firmware on the device. So you cannot, it's tricky to kind of make a secure product out of the old phone because all of the proprietary software support is just stopped. In terms of kind of keeping modules between different phones, which I think was kind of part of the question. It is definitely something we are thinking about, but we also currently can't really limit ourselves to kind of keeping the exact same form factor, for example, for the cameras. So the camera modules are compatible between the different phones. Hi. I have a fair phone here and it's five years old or so and maybe I want to buy a new battery for it. It's still fine, so I don't need it now. So I think I would want to buy a new battery. Would it be a battery that is produced last year or so? Or would it be a battery that has been laying on the shelf for many years and I think keeping a battery lying around, without tension, is not very good for such a long time. So how do you ensure the quality of batteries is still okay after five years? I mean, for Fairfond 3, for example, we have car suppliers to produce batteries over four years. We are still producing the battery. Oh, by the way, we stopped a few months ago. So we make sure that we are not buying all the stock ones and we make sure that the supplier is still around. That's also for us quite critical for the software updates because that's, to be honest, ultra tricky because normally speaking, in the industry, when you stop to buy the spare parts or whatsoever, the supplier is not willing to work with you anymore. So for us, you know, still buying batteries after years and years, it helps us also to have them still around for the software part, especially for the Qualcomm preparatory components. For Fairfond 3, for example, we don't have access to the source code of those components. So we needed the supplier to do that, the assembly, what we call the ODM. So we are still buying batteries over the time. This is what we try to do. We have time for one more question. Hi. Quickly, so one question was on the business model. How do you stay afloat in such a market? Is it coming from the premium price of the devices which I was happy to pay? That could be one factor, but I was wondering if you have investors who are particularly interested in sustainability and stuff like that, or where does the money come from? And the other one was how do you guarantee the 10 years of updates? Do you think you'll be able to force Qualcomm into giving you a 10-year long-term support? Long-term support or? For the 10 years of support, the Fairfond 5 is actually using an IoT chip set, so long-term support where we actually have from Qualcomm at least for way more years than with a normal phone processor. It works very similar. It's just a different product line from their side. What was the first part again? The business model. Of course, we want to attract new customers, of course. I think it's also... I mean, we don't need a single customer buying a new phone every two years. We are also happy with them if they keep the current phone for six years or eight years, and then come back to us. And I think there's a lot of room for expansion in just making people keep the phone for longer. People are already keeping their phones for longer, not just because of longers of the support from the manufacturers, but also because the life cycle of the improvement of the smartphone industry has a bit slowed down. So the phone that you buy now is not that different to one that you bought four years ago. It also helps in keeping them. We know it's just better for the environment, so we try to convince people to keep the phones for as long as possible. Thank you. Let's go. Thank you very much. The time is up. Thank you. Thank you. Thank you.
Unveiling the Open Renewable Energy Systems (ORES) Initiative - Panel Discussion
Hello, everyone. And welcome. I want to start by expressing how amazing it feels to have the opportunity to moderate the last panel at the main stage at FOSDEM. Thank you so much to all of the volunteers and the organizers who have made this incredible weekend happen. I just want to express that I have seen so much passion, innovativeness, and creativity from the open source community that we are all a part of and really appreciate it. For those who don't know me, I'm Dan Brown from Linux Foundation Energy. I've spent the last 10 years supporting open source communities. And what keeps me going is the knowledge that open source is not just a software methodology. It's not just a community, even. It's a philosophy and one that is having a massive impact on the world for good. And that's what we're here to talk about today. I'm privileged to be on stage with some truly intelligent, passionate, generous, hardworking, creative people. And I hope that we inspire you to join us in a new initiative that we believe will help improve the lives of people all around the world. So to give a little bit of background, it might seem obvious, but without energy and electricity there wouldn't actually be any technology. So none of us would have jobs in tech and open source wouldn't exist without stable, accessible energy systems. But as much as energy has benefited the world, it's also had unintended consequences. We're all aware of the challenges and dangers posed by climate change. And if we do not change the way that we produce and use energy, then the future is bleak. So we've come here to ask you to help us prevent that worst case outcome. The panel is going to give you an overview of some of the challenges that we've identified, that we feel the open source community can help address for the benefit of people all over the world. To set the stage, the current state of power systems is not very different than it was a hundred years ago. And that's a problem. It's inefficient. It's centralized. It's top down, proprietary for the most part. These are all things that the open source community can recognize as far from ideal. So we're going to present a proposal to you that can help us change that not only for those of us in the developed world, but for individuals, millions of whom, even in 2004, don't even have access to electricity. So with that, I'm going to introduce our esteemed panel, who are going to tell you more about our idea and our proposal to expand on the work that is already being done at Linux Foundation Energy and other open source initiatives to decarbonize electric systems. We want this to be interactive, so I'm only going to ask one or two questions to each panelist and then encourage you all to jump in and ask us questions and also give us your suggestions on how we can build on this proposal. So I will start by introducing from nearest to me to furthest Chris Shee, who is head of open source strategy at FutureWay. Chris really is kind of the godfather of this initiative. It's his brainchild, and so we are very thankful to him for bringing this group together. Next is Hillary Carter, senior VP of research at the Linux Foundation, who actually did some research reports that she's going to speak about during the panel that really inspired this initiative. Next to Hillary is Vivienne Barnier, CEO of the Inaccess Foundation, which is a nonprofit dedicated to bringing electricity to parts of the world that do not currently have it. Next is Carl Yang, CEO of Degscent Energy Technology. Carl has gotten involved to bring us an open hardware perspective because a lot of our initial discussions were so software focused and we realized that there needs to be a hardware component in order to make this successful. And finally, last but not least, is Tony Shannon, the head of digital services at the office of the CIO of the government of Ireland. Tony has been instrumental in advising us on how we can better engage with the public sector and how policy can help influence these topics going forward. So with that, I'm going to turn to the panel, and we're going to start with Chris. Chris, you're the one who came up with this idea. That I just mentioned for an initiative around open renewable energy systems, or OREZ, as we're calling it for short. Can you just give us some idea of what OREZ is and how it came about? Sure. I just shared earlier today about the history and the motivation of OREZ. One year ago, exactly this month, I live in San Francisco Bay Area, California. One year ago, at this time, we had atmospheric river events. And that flooded the streets and also the cyclones. My house, the roof was reploped by the cyclones. And I was lucky enough, I only spent $1,000 to fix it. My co-worker, probably spent about $200,000 because they live in a hilly area, and they have maybe 30-meter tall trees that fall down and crush the wall of the house. And today, as we speak right now in California, there are also strong atmospheric river events right now. So at the time, I was thinking, gee, this thing is happening, and what we can do individually myself, ourselves. So I'm coming from an ICT background. So I was thinking what we can do from take the expertise and methodology and thought process from ICT perspective. And I think there is a lot of scenery in the energy sector. So take that over to energy, and I see a lot of opportunities there. There's a huge gap in terms of open source and energy. So that's why I was very excited that we joined Lennis Foundation Energy, the open source energy. And one of the things what we were thinking about was that we had to really think kind of somewhat backward. As I mentioned, right now we have a centralized electrical system, and everybody is so, is benefiting from that, which is you just plug it in to receive energy, receive electricity, and you can plug in many different kind of devices. Now we come to a time we need to rethink how we consume and how we produce electricity energy. The idea we had in mind was that why cannot we make it easier to produce energy and to become energy independence? And that's the foundation, foundational motivation of our open renewable energy system. What we're looking to do, I want to say the result, what we envision in the future will be that everyone will have the opportunity to go to Costco to buy an already compliance device. Right now you go to Costco, Walmart, you can buy a device so easy, right? You buy a laptop, you buy a microwave, you just use it. Why not the same thing for production of energy? So that is the foundation of this always project. That's the motivation. Eventually we're thinking to have that device always compliant, but you can buy easily and plug it in at your house. Then you can create your own energy. Eventually the goal is to have individual energy independence. And that's what we're doing. The way we want to do it is through open source. And we have done this thing in telecommunication space. And we will move that kind of thought process and the methodology over to the energy space. Through building an open source API and a standard that allows manufacturers to freely adopt those standards and build an equal system of manufacturers, vendors, of those always compliance devices, and to reach to the masses. So this is a grassroots effort. And we are today we're here fortunate to be here to invite everyone to join us in this new energy independence movement. Thank you. Thank you, Chris. So I'm going to turn to Hillary next. Hillary, I want you to talk a little bit about the part that research played in the development of this initiative. And I know there are two specific reports that you want to touch on. And just so you know behind you, we've got the microgrids report up. Thank you so much, Dan. Hi, everybody. It's a pleasure to be back at FOSDEN to highlight the role that research has played in the establishment of open renewable energy systems community building. It was more than a year ago, about a year and a half ago, that Chris approached Linux Foundation Research, which is one of the departments that I lead, to test the hypothesis that there was indeed a viable, valid role for open source to create better, cheaper, faster solutions in the microgrid space, a nascent space, so that we could have widespread solutions that enable both clean energy generation and increase access to energy solutions in remote areas and areas that have been impacted by natural disaster. And so what we did was create a qualitative study and interviewed 17 subject matter experts across the energy ecosystem, including people from academia and industry, to pressure test this idea. Was this a good idea? What were the opportunities for the microgrid space leveraging open source? How could we take a nascent space and make it mainstream so that energy access is truly democratized and that energy supply is readily available? And what's incredibly rewarding is that research has served to plant some of the seeds for this idea and begin to have conversations with stakeholders who then a year and a half later, at the formation of such a working group today, can become participants. As well, now that we have a deliverable that comprises the insights from these experts, we have a document that we can share to say this is the why, this is the how, and here's how you can get involved. So I encourage you all to read this report. You can visit LF Energy or Linux Foundation Research or scan the QR code, but read it, share it, and get involved. Because through greater participation and more contributions from developers, we can indeed create a necessary future to democratizing energy, clean energy generation, as well as distribution to those who need it most. Next slide, please, Stan. Another report that I'm incredibly passionate about, fairly recent, and I think you might be of interest to all of you because you've come to a discussion about the relationship between open source and microgrids and clean energy is our sustainability report. And this piece of work discusses not just open source opportunities within the energy sector, but across all open source project communities. And our ability to make the link between open source technical projects and the United Nations sustainable development goals is a game changer because it is a door opener. It is a door opener to potential collaborators who have a mandate to achieve economic growth by sustainable beings, to governments, to regulators, to enterprises, to other partners, and can inspire new contributors to open source projects that are either intentionally or through applications and use cases, doing incredible things to advance sustainable development. And these reports build bridges and they generate conversations and they are a resource for the whole community. So I encourage you to explore the energy reports that we have, our sustainability report, microgrids report, and beyond, and use them to your own interests and advantage. So with that, Dan, we'll pass it back to you. Thank you. And I'm going to move over to Vivian. Can you tell us a little bit about the challenges of energy access? I have a feeling that a lot of folks don't realize what a problem it is and exactly how you envision OREDS helping address them. Thank you, Dan, and thank you, Chris and Hilary. You see behind me a picture of strictly, it's like several pictures matched together because luckily we never have this situation on earth so that everything is in the night. But it pictures pretty well the situation that without showing you, I think you see that there are areas where you possibly know there are a lot of people living but you almost see no light. This is because there is no electricity. And that's actually bringing electricity to these areas. This is like Sustainment Development Goal number seven, which was just mentioned. And that's where we and access focus on and sorry I have to correct you, Dan. As an access don't bring directly energy to people, but we want to facilitate the access to energy through leveraging open source. And thank you, Hilary, actually, to highlight and for this report because the idea of an access doing this work is like based on the individual feeling of a few practitioners of the energy access sector that open source is underexploited there. And that's where an access was founded. But we never had a proper research paper or something that supported our vision and mission and our beliefs and now we have. So that's great to see this confirmation that we are doing the right thing in the right direction. And now I would like to show on the next slide a bit what, oh, that's not the one I was expecting, but it's all good. That underlines the need for open source because you can see that there's a lot of electricity, electrification that has been ongoing about the last decade. However, now we have a reverse trend. So we have more people being unelectrified every year instead of less. And one, surely not the only one, but one of the main reasons is also there is interoperability missing. There are too much, there's a lot of stakeholders which is great and there are more stakeholders emerging. It's an eastern sector, as you said. But do stakeholders often develop their own things by their own in parallel, often with donor money from the same donor developing the same solution or pretty similar solution to then provide electricity to people instead of having a baseline infrastructure with the shared which has a common standard and common API, just as Chris said, where then the companies can build on and build their ecosystem and their business model around by doing little tweaks here and there where it's necessary. But I've seen like small utility companies, like many good companies developing smart meters, like 10 of them. You don't need to develop a smart meter. You need to sell electricity and you need somebody to provide you a smart meter. And maybe your business case is a bit different for each of you, but don't lose your time developing like another smart meter yet. So that's just one example. And this happens all the time. And that's where we believe the power of open source really needs to be leveraged more in the energy sector. And that's where we believe to bring this perspective into the always initiative. And yeah, I'm happy to pass over to Mike. All right. Thank you. An apology is about the slide. I thought we got that updated. Well, next we'll move to Carl, who as I mentioned before is bringing a bit of a hardware perspective for us. Carl, can you talk about the state of the overall residential equipment market? For energy and how you see OREZ playing a part in that? So this is our POC. This is our POC for the hardware to integrate with the architecture. For the device, we have micro-involver, BMI and also we call it AC battery and also the DC-DC converter. The goal for the POC is together with the community, with the DINIX Foundation energy projects. We provide low voltage. It's safe. Higher quality and low price. So our customer can do it yourself. For this part, mostly along with the power, energy, specification and regulation. For this part, we propose a strategy for the implementation for the POC. The most important thing for current phase is the communication. Also with the software management integration part, we focus on the energy flexibility based on the S2 standard integrated with EMS implementation. Later, we may focus on the open solution with the operation and the bigger data. Basically, the goal is we try to provide together with the open source community for the hardware to integrate with the OIS architecture design protocol and API. Thank you. Thank you, Carl. Now to Tony to give us the public sector perspective. Can you talk about why this initiative is needed from your perspective and what part of the policy play in advancing it? Sure. Thanks, Dan. I don't need to convince anybody here that whatever level you're looking at, the challenges that we face as a people, be it a community level or government level, this is decades in time that really matters. When I speak, I talk about a moment that really matters. We're all aware of the challenges that are faced around the world in terms of improving the lives of people and looking after the planet that we live on. We all have a challenge ahead of us in terms of people and planet. This decade towards 2030 is a pivotal moment in time in that sense. We know the policy frameworks out there talk about the need for digital transition and a green transition. That's again widely understood, but there's always an often a challenge in terms of a gap between a nice and an aspirational policy and its implementation. And how do you get to solve the problems of people and planet using digital and green in a scalable and sustainable way? Those are key questions. We know that for instance if you look at the energy sector, it is at least in part way down by the proprietary nature of the systems that are at play. And Dan already mentioned the top down nature of the way energy is delivered. So that doesn't really point to how to tackle the problem. What we know is, and obviously FOSDEM is a living example of that, if you want to tackle some of the most complex problems in the world, you start from the bottom up. You innovate from the bottom up. You have small working groups like you're all made up of that come up with ideas as to how to fix problems and you scale them and sustain them from the bottom up. And Dan if you could move to the next slide perhaps. I might talk to a related pattern. I talk a lot about the complexity of the challenge ahead of us, but if you're looking at complex challenges, you need to simply talk about patterns. And I think we know in the public service that actually the challenges that are required fall down in terms of patterns around people and process and technology. And there's a really interesting initiative out there called GovStack, which some of you may or may not be aware of. But it's about what are the building blocks required to deliver 21st public services and there are only about 20 that are, if you boil them all down. And if you can focus on those core building blocks, you can transform the lives of people using open source and reusable building blocks to change the way the world operates. I'm pretty sure that the same principle holds true for the power, the energy transition we need to face. The question, and I think the challenge that Ores is trying to tackle is, what are the building blocks that are required to accommodate and move on the energy transition from the bottom up? And that's why this work on micro grids and the Ores project really offered the potential for a bottom up revolution and innovation in the way energy is delivered that can be empowered really by people like yourselves to get involved in that effort. Because all we need to do is solve this at the bottom and take it up from there and that will transform the way the world operates. And I'll pause there. I think that was very well said, so thank you Tony. So I do have more questions that I can ask the panelists, but I kind of want to open it and see if there are any thoughts from the audience. I'm first going to go ahead and put this up in case anyone is interested and wants to learn more. The QR code on your left is to join our mailing list. The one on the right is a wiki. If you go to the wiki, there's not much there. So this working group is still in the formation stage. We are not even an official LF energy working group yet, but we did not want to miss the opportunity to present this idea to a FOSDEM audience because we really do think that there are a lot of folks here who would be interested and can help us progress this forward. It actually speaks to the power of FOSDEM that just since yesterday's energy dev room, the size of our potential working group has pretty much doubled already because we met so many great folks here. I see Luis coming right up here actually and he's one of them who are interested in getting involved. For the couple hours before this panel, we actually all got together in the little bar cafe area and started debating what this is going to look like. And I'll say it's a debate. We don't even agree on what this should look like. We don't agree on how the standards should be structured. We don't agree what kind of specs there should be. We don't agree on what the hardware and software components should be, but that's what working groups are for. And that's what is great about open source is it provides us the opportunity to build a community that can debate these issues. So I'm happy to open it to questions and see if there are comments and see if the audience has any thoughts. If you have ideas, if you have suggestions, or if you have questions for anyone on the panel, please feel free to come forward. And if not, I do have questions I can ask, but I want to really give you all the opportunity first. Which countries in the world do you think are the most fertile ground for these kind of changes? Or which are the worst? Which would resist the hardest? I'll take a stab at that. And my answer is rooted in one of the contributors to the research report. Some of the most advanced work in microgrids has come out of the University of Alaska in the United States. And I can bet everyone here appreciates why that is. It's a remote state, isolated from the rest of the United States geographically. And shipping power to remote parts of the world is expensive, it's wasteful, and so on and so forth. So I'm very encouraged by some of the academic research coming out of the United States to help proliferate at least the ideas. There is a lot of motivation from people in remote communities all over the world. So I think maybe it's not so much the country, but some of the attributes of the leading stakeholders are you in a remote area of the developed world. And of course some of the underdeveloped countries just have this deep need for access to power so that they can plug in and participate in online communities. And I wonder if Vivian has any thoughts there as well. I know you're coming from an energy access perspective. I'm not 100% sure if I got the question. I got the question about a country, what exactly was the question? Which country is the most... It's a fertile ground. A fertile ground, okay. Yeah, I mean as Hilary said, surely in the industrialized world, region like Alaska, or state like Alaska for sure, me looking into like an industrialized or under-industrialized regions. For example, you can look at Nigeria, which is a country which actually has a power grid or several power grids, but they are extremely unstable and there's a very huge community of companies and individuals working throughout solving it by isolated or interconnected mini grids or micro grids. So that's definitely a geography and a general setting. And also Nigeria is relatively good, let's say, excessively to tech. And so yeah, that's definitely one of the regions in the country, but I mean there are plenty other regions to just to name one. Thank you. And it looks like we have another question. Yes. Have you thought about looking into the home automation community? Because I think OpenHAB has a booth over here, and there are other home automation, open source home automation systems, which I think already have energy management on their agenda, or have some partial solutions. Yes, yeah, okay. So the question looks like there are some organizations, similar organizations, like the energy organization. And what I was thinking is that we don't segregate between organizations. Whoever who are interested, okay, who are interested in this area, please just scan the QR code, draw on the mailing list, and have that conversation. And be part of this movement to become a sustainable energy part of the future. In terms of technical details on is this an energy management system? Definitely. There are all these EMS, BMS, virtual power plant, macro grid, all of these terminology will be involved in this project. So yeah, so that's what I'm thinking. Just to add quickly on that one, and it actually came up as Dan said, our working group has like doubled since yesterday, and one of the inputs from the new members were exactly to looking into this existing home automation, open source organizations, companies that are already around. They are not yet part of it, but more than welcome to pull them in in the best way we can, and makes total sense. That's right. I see the microphone's up there, so Chris, we'll get you next. So somewhat related to that past question, some academic studies show that by far the cheapest way to make reliable grids which rely heavily on renewable energy is to incorporate a lot of, what's called demand response, so that consumers can be sent signals either with or without financial incentives to reduce their demand as an alternative to employing things like gas power plants or large battery arrays or indeed expensive transmission systems. And I think I see this as some of the most important area that it's in to get open source into because I see lots of closed silos and gatekeepers trying to set up in these areas at the moment and also huge problems with things like security and privacy on an individual basis. I mean, this is also great because you will get a lot of western hackers wanting to engage in this because it solves their own problem as well. So I mean, is there something that you're sort of paying to focus on heavily or is this a sort of peripheral goal? I'll just quickly try to handle that. I think, Dan, the demand response stuff that you're talking about is acknowledged as probably part of the overall solution here. Does related work on LF Energy, do we expect to plug into that? Is that fair? Yeah, I didn't catch all of the question, but if you're asking about a generally demand response, yes, I mean, that's something that's already big utilities are having to deal with that right now as they onboard renewables because load balancing is becoming more difficult when you have renewables versus being able to just turn up a gas turbine or not. And so a lot of that is being driven by new ML tools, AI tools. These are things that LF Energy is already doing on a utility scale. I think it probably, some of that, and I'm not the most technically minded person, so some of that may be scalable down to something like what we're talking about with OREZ, but there are also maybe new tools that we need to develop as part of this initiative to meet that sort of use case as well. Can I do another tying into the previous two? It's sort of a scoping question. It seems to me that you have two... Can you move the microphone closer to your mouth? Oh, I'm sorry. We can hear you up here. I'm usually so loud people are complaining in the next country. It seems to me you have a scoping issue here. There's two obvious cohorts you could apply to, either the underserved moment in the middle of Africa or it could be the balcony solar crowd. I propose something to our UK Energy Minister in 2009, which is balcony solar. Now, if you do the latter, you're not bringing electricity to new people, but you are helping people contribute to it and you'll get lots of Western hackers, etc. Do you want to rule out the balcony solar crowd or rule them into your solution? Because it seems that you've got two very separate targets there. I hear a statement as opposed to a question that we're dealing with two very different cohorts. Yep, you've got the underserved and you've got the already well-served, but balcony solar, you could do some guerrilla generation and you'll get lots of feedback. Who wants to take that? Okay, let me add something into that. I heard several things, demand response and also security from the previous lectures. These are very important aspects. First of all, this relates to the utility health and how we can help the utilities. Our project, at the beginning, we started with a residential level, small grass-root projects, but eventually we want to have industrial focus applications and maybe utility applications down the road. We're starting from our human center and then house a residential level in the first place. In terms of security, it's very important. We want to make sure other devices are secure and secure by design. What we mean by that is that, for example, we have autonomous mode, which does not connect to the network at all in the first place. And also we can have federated mode, meaning they join the federation and then they work together in collaboratively. That's a little bit more on the architecture side of this, that this working group is collectively going to define together. If you join the group, then you have more exposure and discussion and contribute your thought as well. Welcome. Thank you. Next? Chris. Hello, Chris Adams, Green Web Foundation. This sounds really cool. I really, really like it. One of the takeaways I took away was that it seems to be about decoupling future energy access from burning fossil fuels, which seems to be a really good thing, rather more so than converting existing demand to, like, that's already connected to the grid, right? You know that when we have a transition, we're switching from burning fuel to having access to finance. And one of the problems when people talk about energy access is the different ways to pay for it. Could you talk a little bit about the role of finance and how that would play a role in this? Because I kind of get how this would make it cheaper, which is good, but there's still a really, really big thing to address there. I see a Hillary nodding, so maybe if I put it to you first, perhaps? Finance. So making it affordable and cheap or inaccessible for people. I'm going to refer to a previous report that I had produced, which was focusing on decentralized energy and incentivizing the establishment of the prosumer energy movement. And if we look at Brooklyn, New York, there was a project called the Brooklyn Microgrid. There was no energy access problem in Brooklyn. There was an experiment to see if we could do better, if we could produce clean energy ourselves on our rooftops, share it, and not be in violation of any regulation. And these ideals are absolutely possible. We just have to put in the mechanisms and have favorable regulations so that people can produce their own energy. If they have an excess energy, sell it to their neighbor and do that within a regulatory framework and a payments framework that is accessible and affordable and universal. So all of these building blocks need to come together. It really requires the collaboration of governments and regulators to make microgrids possible, whether they are in Brooklyn or whether they're in Nigeria. And also giving people the incentive to charge their devices at non-peak hours. One of the experiments that we, a pilot project that took place in Canada, was with our energy distribution operator, as well as a payments network, and using blockchain technology, giving actual economic incentives to people. To make green choices. And until we commercialize green so that it is cheaper and a better, or just more accessible than fossil fuel burning choices, we're just not going to get there as quickly. We have to work with government incentives and individual incentives. Did Vivian or maybe Tony or Vivian, or Chris, please? I think the key word here is finance or affordability. And this is exactly what Orres is trying to address. Why? Look now. If you want to install your home energy system, at least mine from America, I believe in California. The company will approach me saying, okay, if you want to install a Tesla system in your house, that costs you about $40,000 to $50,000. Half of that is going to be labor cost, permitting cost, all the other installation things. The rest will be equipment. But if we do it with Orres way, that's why the thing happens. It's open. When it's open, what that means is open standard. That means it drives and tries to build and foster an equal system of vendors and suppliers, installers. Once you have mass adoption of this, then the prices will go down, both in terms of the labor cost and in terms of the equipment cost. Especially, that was the reason one of the motivations of Orres design is the plug and play that reduces the labor cost. That $25,000 right off the bat is gone. See, that's the open source way. And that will systematically and dramatically reduce the cost of renewable energy systems. That was the one, the prime motivation of Orres. Okay, so that's why. Please join us. I'm not sure if you were also asking particularly about the finance question for energy access. In particular, so is this possibly going beyond this panel? I'm pretty happy to engage with you in the discussion after work in this area quite a lot. And would say, as Chris said, we are contributing to it, but we can't solve all problems at the same time. It might drive down the costs, so make it easier, but we are not like a financing mechanism. We're not inventing a new financing mechanism to it. It's also needed, and maybe Tony has some ideas on the policy side. So, I mean, yeah, the finance aspect is reducing costs here, not inventing or developing another finance mechanism. Great if somebody has ideas and make that open source even better. Another question? On the joining part, so can you tell us about the prerequisites for someone to make a meaningful contribution or put another way? I'm just a software developer. I don't know any about energy distribution hardware stuff. What can I contribute? What can I bring to the table? I mean, I can talk a little bit about that and I'll let the panel weigh in as well. But, you know, this is true of any open source project. Like, anyone can contribute and get involved. You don't have to necessarily contribute code. You can contribute. We will need help with organization. We will need help with marketing. We will need help with probably doing events and notes keeping and maintaining websites. And if you're interested in contributing in that sort of way, please still sign up for the mailing list and let us know that. I'll happily give you tasks to do. And one thing, if you don't know anything, your question is, if you don't know anything about energy, how do I contribute? And then everybody in this room, you can help contribute. Just spread the word. Take the picture of those QR codes. Send it to your friends. Spread the word. Not only particularly in this working group, this project is an open renewable energy system. This project, per se, but spread the concept, the mission of ORES. We want to get more people involved and think about ways how we can help create renewable energy resources and to live more sustainably. Just one more addition. And pretty sure from here in a few months down the line, we will have very specific work tasks that need to be addressed and that can perfectly be addressed by somebody without any energy background. So, yeah, go on the mailing list. Stay tuned and it will get every time more concrete and also more detached from particular energy or energy access knowledge. Alright, I see we've got about three minutes left, so I think you'll be the last question. It's really interesting to hear from you guys and I thought it was also interesting how you said there are so many different ideas about the approach to the problem. So, I was wondering, like, what are the main conflicting perspectives or ideas about the implementation of this idea? Sorry, the implementation of... The conflicting ideas about the implementation of the whole system that you guys were describing. Oh, I alluded to that. There was some disagreement in our earlier working group meeting. We have more agreement than disagreement. This is true. Yes, there's more agreement than disagreement. Otherwise, we don't have that many people come here. So, your question is about more implementation. Is that right? Yeah, like, what are the disagreements, essentially? What are disagreements? Yes. Oh. I think the disagreement... You know what? I don't really think we have much disagreement. I think the reason we have a lot of discussion was because there may be some understandings. Right? It's a very loud room. Sometimes we couldn't communicate clearly. But eventually, people come together. Eventually, people come together. That demonstrates a lot of interest, a lot of consensus, and also a lot of passion. People now are joining together into this project. So, with respect to detailed implementation, then you will know that we have joined the mailing list, the working group. Then we'll have more discussion as to how we're going to do this. We've got some ideas already. And just come join us. Sorry, just a quick word on perspectives on implementation. For me, the issue is perfection is the enemy of the good, as you all know. So, I'm here because I believe this mission, but we need to see this in action. So, we're trying to find the common ground. Where do we get our first implementation? Kick it around. See how it works. See how it doesn't work. Go from there. It's as simple as that. Any final word? We've got about one minute left. I just wanted to say, I think there was more conversation around who should we invite? How do we get started? What are the priority issues? This is a big idea, and we want as many people involved as possible. So, I echo Chris, Dan, and everybody on this panel to say, spread the word, share what you've learned today, and share the reports if they help bring in new stakeholders. And we look forward to hearing from you. Alright, thank you all so much. Feel free to grab any of us afterwards if you have any other questions. Thank you so much. Is this the... I want to say one more thing. I'm not sure. Is this seem to be the last? Is it the last? No, there's one more after us. There's one more? Okay, this is the second to the last conversation here. So, if people stay here at Leislang after these two days of photo, give us a big shout. Thank you. And, join us, and we'll all be part of this movement. Thank you guys.
Take Your FOSS Project From Surviving To Thriving
Good morning everyone. We have Ryan here talking to us about open source and yeah, take it away Ryan. Thank you. One problem with having your talk first thing in the morning is there are a lot of missing seats and I'm pretty sure that's because I saw almost everyone here out last night drinking really late. Louder? Is this better? No? I'm not sure I can make it louder. Is there a set? Is there a? Okay. How about if I just talk like this for a little bit? That's going to be hard because the thing you didn't hear was I said we're missing people because I saw them all out drinking last night which included me so talking at this volume the whole time is going to leave me without the ability to do anything the rest of today. I'm Ryan Sypes. I'm the managing director of product for the Thunderbird team at Mozilla. And that's a little weird to say because Thunderbird exists in a different Mozilla than you guys all know. There's a foundation which is a non-profit. Hey, that's perfect. And there's a corporation that develops Firefox and then there's another corporation called MZLA that develops Thunderbird. And that's due to a unique history of not being sustainable, of Thunderbird not being sustainable. There's a great product, a great open source project attracted a lot of users. There are multiple counts and depending on which one you believe I happen to believe we attracted as many as 30 million users at one point and not at all monetizable. At least that's the telling that I got when I came on the project which is we haven't really found a way to make money off of this so paying developers is hard so we're giving it back to the community to Steward. And so that takes me into this talk where when I came on it was a project that was run by a very active community of volunteers and there were a couple of us, a few of us who were able to work on it for a job. I came on part-time as community manager and if I remember right there were only two other people when I came on. For a product that still had 25 million users there were a lot of days that we couldn't even build the product literally like it would be a red tree every day and we would have to work really hard just to get it into a buildable state. I'm speaking to a room of developers and so I can say that a lot of that was because we were downstream from Firefox which had a thousand people working and developing the product every day and so absorbing that, thousands of changes in a week for someone who only had one single developer working full-time and was difficult. But that's the backdrop against to kick off a story of success. Maybe one of the few stories of really incredible success around sustainability and open source. So why do I have credibility to give a talk on sustainability and success in open source? This is our revenue year over year. 2017 is when I came on, 2023 obviously last year and when I came on it wasn't bad. We had $700,000 in donation revenue last year. That number is $8.9 million in donation revenue. Going on to an online percentage calculator, I determined that's a 1,108% increase over the last six years. When you start to move, when the bars in that graph start to get higher and higher, people start asking you why? And it may seem like an easy answer. You're like, there has to be a why. And yes, the whole time I knew there was a why. But describing that to someone else felt like I need you to understand about 600 things and then we can talk about why. So for you guys, I did the thing of distilling 600 things into three, which I'm going to try to impart on you. And why is this important? Well, I believe it's really important for every open source project to have a map to sustainability. I took a class a few years ago around product management. And one thing I was really struck by was the professor said every product before you launch it should have a business model. Even if you're not going to deploy the business model to make money, if you're a startup, let's say, your intention is just to grab users, as many users as you can, and not monetize the product. It was this person's opinion that you should know when and how you're intending to monetize the product. In open source software, I would say that that's not always the case. You know, you don't always intend to monetize your open source software and in fact, you shouldn't always. But you should always have a plan for how you're going to sustain the project. And I don't need to, you know, originally I was going to find some articles of really big open source projects that had really sad episodes or endings. But then I decided you guys know many of them, which is, the story is pretty common, right? Like you have a person or a group of people working on a piece of software. It gets really popular. And then something happens to that person or this group of people. And it doesn't even have to be a sad episode. It could be that the person developing the software has a kid and loses time to work on the software. But it could also be sad, you know, they die. Okay, now the software is going to go stale. There's no one to maintain it. And so it's my strong opinion, just like that professor who said you need a business model, that you need a sustainability plan for your software. And this will come back to Thunderbird again. So I can tell the rest of that story. I'm going to drink some water. A sustainability plan is you just forecasting into the future and saying this project, or maybe you don't have to forecast in the future. Maybe you're already running a successful open source project. But you sit down, you say, how are we going to make sure that there's always someone or a group of people developing this software? Or you can say, which I don't recommend, but you could say, no, it's just me. And when I die, it dies. And that's okay too. Now, at least you've thought it through. But let's say that you want your software to be healthy and go into the future in a way that people can count on it. As I spill my water all over myself in front of like 100 people, you need to lay out a plan. And for Thunderbird, I think that plan should have been developed from the start. So that plan would have said we're going to try to a couple of methods to monetize this. And because we know we need at least a handful of developers in order to consistently develop this software, simple, right? That's all it needs to be. You say one person is not going to maintain this. So we're going to have to figure out how to monetize this project so that we can pay people to work on it in order to sustain it. And the fact that that didn't work out is why Thunderbird entered dire straights. And what we ultimately figured out, at least for now, and plans can change, is that our sustainability plan was to tell, we know we need at least a handful of people working on Thunderbird in order for the software to work. That's just, I live through the volunteer days of it, and that's not, it was not pleasant, it was not good, it was not a way to sustain the software. And so the answer had to be, no, we have to pay people to work on this, to do the crappy stuff, to maintain it. It's not always pretty work, but it has to be done and therefore we need money, because people are not going to actually do some of the stuff that we need them to do unless we pay them, because we know that from experience. We were trying to, I was talking to a bunch of people about what Thunderbird looked like when I came on. And to drive this point home, I've now heard a bunch of different ways to describe it, but the one that I thought of last night is, you go into a house, maybe you buy a house, and you walk in and you're like, man, this is a really nice house, it's big. A lot of people, you know, really like it, they come by, they check it out. And then you're like, all I'm going to do is I'm just going to update the kitchen, because the kitchen looks a little dated. And like you start pulling out the cabinets, and there's just like termites everywhere, there's a pipe just like shooting water off in one direction, and like, it's just not pretty. And every time you go to change something in that house, it's the same thing. Like you just like open a closet, and there's like a clown in there, and you're just like, I'm just going to close the closet and not think about that for a little while. And so for us, we had to somehow, we said, okay, you can't sustain this project with just random people off the internet working on whatever they want to do, whatever they want to add. In fact, that's really bad. That's like, all this structural problems with this house, and someone's like putting a pool on the roof. And so we needed a plan, and that came in the form of trying to monetize it. So, to come to a sustainability plan, you kind of have to ask yourself some key questions. And this is probably not that crazy, but I wonder how many of you have actually asked yourself this about your open source project. If you're not working on a project that's like Red Hat, like we know how Red Hat is sustained, we know how like, you're going to make money, you know how you're going to sustain the project, I hope, I don't know. But for me, something that I took stock of over the years at Thunderbird and tried to think about is, okay, how much effort does it take to do this project? In a perfect, maybe not a perfect world, in a the minimum viable level of effort, what is the minimum viable level of effort that it takes to maintain this project? And then I thought, you know, who are the key stakeholders? Well, that one was really easy, because that was like the tens of millions of users. They were definitely like the number one stakeholder. And then the third one, which I'm not sure anyone had really spent a lot of time thinking about, is how do we communicate with those stakeholders? Which businesses creating proprietary software, right from the outset, this is a thought that they're mulling around in their head, which is, okay, someone downloads a product, there's either something within the product that we're going to use as a mechanism to have a conversation with our users. Or there are other channels by which we have these conversations with our users. And that can serve a variety of purposes, right? Like one, obviously, for commercial product is to convert people to paying users. Maybe it's a free product and then you pay for some kind of additional features or whatever. For us, we really hadn't developed these mechanisms, and the people that we had following us, we're following us in places that we're only able to speak to a fraction of those users. So whether that's an IRC channel, you know, obviously, you're not capturing all of the Thunderbird users there, or a Twitter handle, not capturing even remotely all of the users there. And then once you kind of have answered the first one, you think about this fourth one, which is that aside the effort, what else does a project need? For Thunderbird, it's infrastructure, and building and distributing Thunderbird alone is like a source of cost in and of itself. So I call that out because you have to kind of take a holistic look at like, what does it take to build and distribute and make this software available to everyone? Okay, so I've told you a lot of really basic things that you probably could think about if you spent 15 minutes thinking about projects, and why did I do that? I've been contributing to open source for probably 20 years in some capacity or another, and I'm losing a lot of people. We're losing them. Yeah, I'm gonna have to like do a cartwheel or something. I call this out because I don't think most open source projects start from a place of any plan. I don't know if you guys agree, but oftentimes it's just like, I'm gonna do this cool thing. It's gonna scratch my itch, or it's gonna scratch the itch of people I know, and what happens happens. And talking about a sustainability plan, I'm asking you, please, for the love of God, don't do that. Just take the extra whatever it is, 10 minutes, to just think through, okay, best case scenario, I create this software. How am I gonna do, or how am I gonna think about these four things? Because what we don't need in the world is more un-maintained open source software that people rely on. Because that creates a bad ecosystem and a bad reputation for open source software, and we all know that. We all know people who we've turned on to open source software who essentially just say, oh yeah, well, I'll just use Thunderbird because I don't want to trash any project. But something you could imagine hearing about Thunderbird is someone's like, this is just crappy outlook. I can't afford outlook, or I can't use outlook for some reason, so I'm using crappy outlook. I don't believe that, but that is the result you get for not having a sustainable maintained open source project. And then if you're like family members of mine, and you learn that this is an open source project, and it is crappy outlook or crappy Photoshop, or insert the software of your choice here, you associate open source software with not as good. It's just not as good. And I don't think any of us want that to be the outcome of our work. One major challenge that I had in helping get Thunderbird to a place where it was sustainable was I had a community of developers who, whether or not they'd admit to believing this, this is what they believed. Not this, this is what I believe, but they believe that money was bad. Anytime I brought up the fact that we could do this to raise donations, they're like, we don't want to annoy the user. We should just make the software, and if they manage to stumble across a donation page somehow randomly, that's okay. And the thought I kept having in these conversations, because I was saying, oh, okay, once a year, we could just put a little thing, a little pop up or something that just says, did you know Thunderbird runs on donations? That's how we pay folks, like please give. And I got so much pushback for that. And it's because when you really talk to people, they thought that asking for money in a direct way to users was somehow not a activity that an open source project should do. I don't know where that comes from, but I know it's true of a lot of open source projects, because when folks started asking us, how did you raise donations? And I told them, one mechanism is we hit all users with a full page donation appeal every year. You just saw like their faces drop, and you could just tell that they were like, I'm not going to do that. And you know, it's funny because I also felt uncomfortable with that. I thought this is going to look like spam. This is going to be annoying. Maybe users will leave because it'll be like, I just want to do my email. And this is like bothering me, and I never want to see a popup again like this. And I don't remember which piece of software it was, but a little later after that thought, I was using some, oh, it was Evernote. I opened up Evernote. I don't actually use Evernote, but I had used it in the past, and I was like, I know I put a number in here that I need to remember. So I'm going to open it up and find that number. Before I could ever look at a note, it was like three things in succession that I had to exit out of. And it was, they were all like, you should pay us. And then it's like, no. And it's like, yeah, but you should pay us. And I'm like, no. It's like, well, if you don't pay us, we're going to do X, Y, and Z. And I'm like, no, I just need my number. But after that, I was like, you know what? I bet that's happened a lot in a lot of different programs. And I don't remember it because that's not an activity. Exiting out was not an activity that I committed to memory. It's just like, oh yeah, like, of course they're going to ask me for money, like, whatever. But that, my friends, was a eureka moment of like, I don't remember any software asking me for money, but I know they do it all the time. And so that's what we did. Popped up full screen once a year. Help keep Thunderbird alive. This is like the history of Thunderbird in one page. Did you know that less than 1% of Thunderbird users fund all of our work? That was especially true when this was displayed. Not too long ago, Thunderbird was on the verge of extinction. We don't show advertisements or sell your data. We're completely funded by donations from our community. That, my friends, is a $6.8 million appeal. Well, now it's more than that because we've run it twice. So it's a $15 million appeal. That, just this. And, you know, when I ask people, our users, do you remember the bird, the end of year donation appeal? I've asked probably, I don't know, 100, 150 users at this point about this appeal. You know what most of them say? I don't know what you're talking about. The bird, the Thunderbird, I'm like, no, he's a bird holding a heart. He popped up over your whole paid, you know, he took over your whole browser like a couple weeks back. No, I don't know. I don't remember. I'm sorry. This bird sounds really important to you, but I don't remember it. And some KDE guys asked me about, you know, raising donations. And I told them exactly what you might imagine came out of this, which was like, pop something up a couple times a year full screen. And they're like, you know, we don't ask often. Give us a little money. And they're like, people will, people will revolt. They'll change to, you know, I'm like, maybe, but honestly, it'll pop up in December and you'll ask them a week later. And they'll have no idea what you're talking about. Because if they don't donate, they're just going to hit the X and just move on and not think about it. Maybe next year they're like, oh, yeah, didn't they do this last year? But probably not. Because the moment they go to Wikipedia and get that appeal from our good friend, Mr. Wales, you know, these things are all invisible. This year we tried displaying this a few times a year. Donations went up. Nobody remembered. Nobody remembers the bird. It just doesn't, that's not how human brains work. We're so inundated with incoming signals all the time. And that's the point. You're not that annoying, maybe to your friends. But these appeals are not that annoying. Because we live in an information environment where this is just something people expect and something that people have grown to not see anymore. And so that's the takeaway. But there is one other thing. And now we'll go back into Thunderbird. Because ultimately I'm up here to both tell you how to make a sustainable project and implant in your head that you're going to come out of this and you're going to install Thunderbird. So we asked our users for money. That's pretty simple. Because we knew in order to be sustainable that we really needed at least 10 engineers working on it just to make Thunderbird run, not to do a bunch of fancy stuff just to go. And once we were able to set up this model, it became a lot easier to convince the other developers on the project. Those ones who were like, money is bad. They're like, money's not that bad. And it helps us sustain the project. And users understood that too. That appeal said, essentially, you get value from this and without you, it doesn't work. And you know, I lied a little bit earlier. I did hear from users who do remember the appeal. And especially after the first one. And I'm going to look at someone in the audience because he may have seen a negative comment. But I never saw a negative comment about the full screen takeover. In fact, I saw positivity of folks saying, I just assumed that you guys ran off of like Google money. If I had known that you were reliant on donations all these years, I would have been donating the whole time. And I got that hundreds of times of just people saying, of course, I want the tool that I'm using every day to manage my email, to be sustainable, to be funded. But you never told me that you needed my help. You never told me that that was on me. And so I remember the strangest feeling of being thanked for telling people to give me their money. So there you have it. But this slide, which I haven't even talked about since I put it up, is the other piece, which is you're all maintaining different software. Or maybe you're not. I don't know what you do. But I assume most of you are engaged with some piece of software that you develop. And it's not a one size fits all. If your software isn't public facing like Thunderbird and doesn't have 20 million users, this is not irrelevant to you. You probably do have stakeholders. What is your conversation with them look like? If it's a big enterprise that's leveraging your software, are you talking to them? Have you talked to them? Have you said, hey, don't be like the mafia and be like something really bad could happen. The software, it could go away. But maybe you should at least remind them that that could happen. Something really bad could happen. Somebody could get hit by a bus. That guy is me. But if you don't make that clear to the people who rely on your software, if you're not sharing with them the need, the pieces of the story that tell them how your software is sustainable, they're not going to know. And then you're going to have these, I remember, and this is the example I'll finally use. What piece of software was it? Any of you could answer this and I'm not going to pull it out of my head. But we've seen it, right? Just go to ours, Technica. Some open source project that's used by Google has a security vulnerability in it, probably today. At some point, that's going to be exploited. And the ours, Technica story, it's the same every time. It's like major exploit found in this product. And at some product we all use. And then they figure out, oh yeah, it was because they were using this library. And the maintainer stopped maintaining it 10 years ago. Google would have much preferred that that maintainer be like whatever it is in the repo maybe, it's just a big thing at the top. It's just like, I can't afford rent. I'm going to have to go work at Starbucks unless somebody gives me some money to work on this thing. I guarantee a lot of developers downstream would be like, oh, that's really bad. We need to give this guy some money. And that's what I'm talking about. Don't be that guy. That is a dick move. You created the thing. You're not maintaining it. And that's not the dick move. The dick move is before you stop maintaining it, what did you do to let everybody know who relies on it? What was happening? Why is it going away? What can you do to prevent it from going away? And that's something we see all the time. You guys know it. I know it. So just think about it. Think about it. Today, tomorrow, how am I going to make my software sustainable for the future for the people who rely on it? And you'll think of something. Try different things. Make that part of your day. Make that part of your development process. And I know that's the most annoying thing to say. But just, you know, like I'm going to dedicate 3% of my time working on the project to just figuring out how to make it sustainable. Because I think that's good not just for you, but for me, for open source, for this software movement. And yeah, that's my talk. Thank you. Thank you. And I'm happy to take some questions you can ask me about Thunderbird. You can ask me about sustainability. You can ask me whatever you want. You can ask me how I get my hair like this. But we have another five minutes with each other. Thank you for your talk this morning. Are you helping your colleagues at Firefox? I get asked a lot of questions from those teams. It's true. At first it was an anomaly. You know, people didn't really know what to think when we started on this path to sustainability. And to be honest, at first it was met with like maybe some snickers of like, oh, they're funded by donations. Well, let's see how long that lasts. Now, you know, six or seven years into it, I get a lot more serious questions of, because it's kind of changed, right? It's like, oh, wait, folks are paying you for no value exchange, no immediate value exchange? Maybe we should figure out why they're doing that and whether or not we can apply that to other things. But I will say, though, it is scary too. It's scary to run off of just pure donations. Because, and I don't think you should always choose a donation model for your open source project. If you have any other model available to you, use that because I do tell my team one of my biggest worries is that in an economic downturn, for instance, donating to the Underboard would probably be a much lower priority than buying food. And so, you know, donations is not necessarily always the best model for sustaining an open source project. And you should use, when I put, like, do it your way, like you should use whatever mechanism, whatever mechanisms you have available that are going to make a project most sustainable, in my opinion. Any other questions? Yes. What's the current size of the Thunderbird team? I'm trying to find who said that over here, down here. It's 32. We're hiring an additional 13 roles this year. So it'll be 45 by the end of the year. And we expect to continue growing minus an economic downturn. So, yeah. There's one down here. Do you get a lot of contributions from people who you have not hired, like from random people on the internet? Or is it mostly the pay developers that are developing Thunderbird? I didn't hear all of that. Can I get it one more time? I was asking if you get a lot of contributions from people on the internet, as opposed to pay developers? We used to get more, so we used to get more contributions out of necessity from our community, because they were maintaining the project. So from 2012 to 2017, maybe even 2018, I would say that we had to have a majority of contributions from the community. But as I said at the beginning, the problem was people were scratching their own itch and not addressing the actual product needs. That's right. And so that was a really weird time, because you may have a really bad, bad thing happening in the software. And we had a few. But if it was going to take two months of consistent development to fix it, volunteers, they just were like, yeah, I'm really just here to fix filters, and then I'm out. And so once again, sustainability, you have to have people around your project to work on the hard things. So you mentioned you showed the full page thingy. Did you show it to everybody the same day, or is it like staged? Because I feel like everybody the same day might end up in Twitter, like, oh, what is this thing? I'm like snowballing. Yeah. Yeah. So the answer is we have not been very sophisticated. We hit everyone. So we don't hit them on the same day, but that's not because we're clever. And we developed a system for, you know, spreading that out and a B tested that. That's because they see it when they update to the newest version. And not everyone updates the day of the release. In fact, no one like, okay, I shouldn't say that. Not just it's spread out because people update on different days. But that also gives us some data of like, hey, well, we performed better this week than we did last week, you know, but with this group. So yeah. But in the future, we'll have a more sophisticated mechanism and I'll come back and tell you tell you what worked and what didn't. Two quick questions. One, do you use a particular way of collecting the money? Like is it a PayPal driven thing? And second one is do you maintain some kind of a pad for, you know, to try and smooth out that economic curve like a year in the bank, so to speak? Yes. Yeah, yeah, yeah. Very good questions. On the first one, we used to have our homegrown donations stack from Mozilla. So it's a stack that Mozilla created for the foundation. You know, there's this thing of in any endeavor where it's like, does this make your beer taste better? And we found out the answer was no, maintaining a donation stack does not make our beer taste better. And so we've moved to a platform called fundraise up, which supported supports way more ways of giving than we ever did. And so I would recommend you like, if you don't have to create a donation stack, please don't do that. Like use some tool or platform to do that for you. And then the second question, which I'm trying to remember what it was. Oh, yes, we, yeah, we have a year in the bank and that's what that's, that's how we hedge against potential donations. So far donations have only gone up, which is good, but we're hedging against, you know, a scenario that I talked about. And I think that's wise. It definitely gives our developers who know we're donation funded, who work for us a lot of a sense of stability, which is good. Well, thanks for your talk. And I see possible applications of this method also not only for software, but also for creative commons, art projects or something like that. Yeah, just to throw it in. Yeah, I think, I mean, it's, and it's super simple, right? And my time's done. So I'll answer this and then I'll be done. The model is just this one that like, even if you're creating something that's free, you should be thinking about how it is you're going to support that work. And especially if there are people in the world who find value from what you're doing, you should be communicating with them in order to tell them to continue to receive this value. There just has to be some layer of support around it, which I think is it's straightforward, but I just don't think it's a process that a lot of folks go through when they're creating something for free. They don't think about, they're always thinking like best term, best case scenario, like I'm going to be able to dedicate this amount of time to this in perpetuity for the rest of my life. And then something happens, you know, like maybe you're like me and you have twins. And then everything that you used to do, all your open source projects, for instance, in my case, they just stopped being developed if they're only developed by me. And if I could go back to myself like six years ago, I would, you know, like smack myself and be like, you don't have time for this, like you don't have time for maintaining all these different projects. How are you going to, who else you should be pulling in other people to help maintain them, whatever it is, you know, but it was not a thought that I had. And that's what I would like everybody to think about, you know, before you start something, or if you're in the middle of it right now, how are you going to sustain it in the long term? Are you going to sustain it? You know, okay. Cool. Thank you so much.
NetBSD 10: Thirty years, still going strong!
Okay, finally we're set up. Many thanks for your patience. Let us kick off immediately with NetBSD 10. So as you probably know, NetBSD turned 30 years old, or 30 years young last year. We've tremendous improved in the security support of CPU, GPU and stuff like that. Also NetBSD is quite known for its package system. We have here Benny with us. Benny has been a developer for more than 10 years on NetBSD. He's been as well with us many times at the NetBSD developer room here at FusDem. So who better than him could talk about the 30 years of NetBSD. So ladies and gentlemen, please welcome Benny Siegert. Thank you for the kind words of introduction. So welcome to this talk today. I have mainly three topics for you. 30 years of BSD. I want to talk about the new NetBSD release, NetBSD 10. And are we at 50 years of BSD yet? And that's what I want to start with. And the answer is maybe, depending on how you count. So one BSD, the first Berkeley software distribution was released in 1978, so that's not 50 years. However, the work on BSD started in 1974 when the computer science research group at Berkeley University got a PDP 11 and they installed Unix on it and they started hacking on it. In fact, they didn't have sole possession of it. They had shared with the mathematicians and they were using a different OS. So twice a day they had to switch the stack of disks and reboot. So one BSD, I don't know, isn't that interesting? It's mostly like a collection of utilities. You already need Unix installed and you can install some BSD stuff on it. Two BSDs may be the interesting one, which came out in 1979 because it's kind of a full system. And what I find amazing is that two BSD is still maintained today. So there's a collection of some crazy folks on the internet, obviously, that are pushing out patches for two BSD every once in a while. The last one was less than a year ago. And so like one BSD, two BSDs only for the PDP 11, but you can emulate one. You can use SimH, run a PDP 11 emulator and run 2.11 BSD and see what it's like. You can even go one step further if you're willing to do this and buy this thing here, which is called a Pi DP 11. It is the front panel from a PDP 11 model 70, except shrunk by a factor of three. So it's not quite as bulky. And in the back you stick a Raspberry Pi and it runs a PDP 11 emulator. And all the lights and all the switches and all the key switch even work do the right thing. So you can do this and run any of the PDP 11 operating systems, including 2.11 BSD in 2024, that's your thing. And then three BSD made a major change that they did, they ported it to Vax. And at the time it was still single architecture, so PDP 11's board vanished, instead was replaced by Vax and 4BSD was kind of the same. By the way, this will be very abridged. I will not go into all the details. BSD history would be its own talk and other people such as Marshall Kirk McKewsick have done this much better than I could. But anyway, so I want to highlight one release, which is a bit weird. Frankly like Berkeley is terrible at naming. 4.3 BSD Tahoe was named because in addition to the Vax it supported the new Tahoe architecture. Now you probably have not heard of the Tahoe architecture. That's because it was a colossal failure. And the only, this is what the workstations look like. The only company manufacturing these workstations went out of business about two months later. So there are practically no surviving machines. I don't think anybody knows much about these. Apparently Pixar had one and they used it for special effects running BSD. What was special about this release is that it was both for the Vax and the Tahoe. So they separated all the bits that are specific to an architecture support. So people took this and said like, I don't care about Tahoe, I want to run it on something else so they ported it to all sorts of other architectures. And this is sort of the origin of the multi-architecture nature of modern BSD operating systems. And then again I'm leaving out a few steps. Then it gets very messy. There was a lawsuit involved and so on. And at some point there was a release of BSD that ran on PCs. It was called 386 BSD. And the team was I think two people. And there was a lot of buzz around it. It was a lot of development and a lot of patch sets and stuff. But the development of 386 BSD itself was kind of sluggish. So people started taking matters into their own hands and that's where NetBSD comes from. So I found the announcement from 1993 of NetBSD 0.8 for whatever reason they called their first release 0.8. And it starts off a bit odd like we've been wondering what I've been up to, blah, blah, you know I've built this new system. I'm calling it NetBSD. Essentially what it is realistically it's the last 386 BSD release 0.1. And all the patches that were floating around the net and that were okay, added on top. And that's also why it's called NetBSD. Because in 1993 not that many people had internet access but NetBSD from the start embraced the internet as a method of development and getting patches and distributing the OS and so on. So NetBSD as the name implies is a creation of the members of the network community, meaning the internet, without the net it's likely that this release would not have come about. So this is 30 years. This is a bit more than 30 years. It's not quite 31 so I guess where it counts. I want to dwell on this announcement for a bit more. By the way there's four signatories on the bottom of the announcement. You see CGD is the one who sent it, but there's also Theo de Rage is one of the signatories. He was one of the founding members of NetBSD before they kicked him out and he ended up founding OpenBSD. But what is interesting is that it tells a little bit more about the purpose and the plans for NetBSD. And it's interesting seeing these and comparing what has happened since. So I've added a few highlights here. Why do this at all? And it says we consider this an escape from the political wars surrounding our wonderful operating system and we want to do a stable production quality release. And also this bit we intend to integrate free positive changes from whoever will provide them and we hope to create a stable, accessible system and to be responsive to the needs and desires of the users. So here you can see the project values laid out in short form. No religious wars, stability, community, acceptance of patches if they're okay. And I think these have largely held up honestly. Like 30 years later I think the NetBSD project is holding up these values even though probably most developers of today haven't read this thing here. I mean I hadn't until I prepared this talk. That's quite nice. And then what ended up happening is sort of a Cambrian explosion and one of these aspects of people contributing their stuff is people contributed support for the machines they were using. And that is how NetBSD got this reputation of running on all the things, even including toasters. So this was sort of the peak of NetBSD porting. The year was 1995. This person here is Jeff Rizzo, a NetBSD developer. And their company Technology Systems Design presents toaster running NetBSD. Of course it runs NetBSD. And the point of the project at the time, so you see there's an arm board there which is in the toaster. But it's sort of IoT in a sense. It had remote management and it could actually manage the toasting function. So it could change the duration or the heat or whatever. It has a little display which you cannot really see at the front. So this was like famous, you know. It runs on everything, even on the toaster. And that was 1995. And I want to go slightly heretic here and ask, is any of this relevant today? Because if you look at the list of architectures on NetBSD calls and ports, there's I think 71, if I remember correctly. And they're in three tiers. Tier one is like the good ones. Tier two is the ones that may have some problems, but they're sort of chugging along, maybe not the main focus of the project. And tier three is the ones that are on live support and basically dead. Anyway, if you look at the tier two architectures, they all seem kind of retro in a sense. Dreamcast, really. Amiga, the B-Box, who here has a B-Box? I don't think any one of you has a B-Box. So I'm going to offer this. I think the portability argument is more or less dead. Because there's no modern hardware, I think, that runs NetBSD but not Linux. If you look at say, new RISC 5 boards or something, they come with a Linux kernel. They don't come with a NetBSD kernel. Usually. By the way, this is the list of tier one ports. So the important focus things are ARM, X86, 32, 42, 64 bits, Spark, and Xen emulation, and MIPS and PowerPC. Anyway, so these are the tier one ports. So it's a good list, but still. I mean, is portability the sales argument anymore? I don't think so. So what remains? Why would you want to use NetBSD? And going back to the values we heard about earlier, so we have an accessible system, but it's still powerful and stable. And by accessible, what was meant then, and what I mean now is, it's something you can understand from top to bottom. If you're starting out with Linux, let's say, and you try to figure out how a modern Ubuntu system works with system D and 100 demons running everywhere and things reconfiguring themselves, it's very hard. It's simple on the surface, but underneath hides a ton of complexity trying to make stuff work. NetBSD is different, I think. It's simple throughout, and that way you can understand it, all the layers and how they work together. There's also documentation there. There's a NetBSD guide, which is very long and complete and has a ton of stuff. So you have one eBook, if you will, that explains the system to you. You can read the main pages, which unlike in Linux, main pages are usually complete and well written. There's a system D, maybe some of you like that. But I keep saying to people, why should I try a BSD operating system? I think it's a learning opportunity. And even if you're, say, a Linux user today, and you try BSD for a few months and you come back to Linux, maybe you have learned something about the system. I think that's good. But also, NetBSD has some crazy research things in it, but it's also kind of old school unix in some sense. So I think it's a nice compromise between these two very different worlds. So if you boot it today, you're going to have a graphical console. You have graphics acceleration. You have like ZFS, modern volume management. You can run all sorts of software. You can run GNOME on it. You can run Firefox on it. You can run Rust programs, Go programs. It's all there. And I think the main actual uses that people are making of NetBSD, one is on the server. It is a very solid, very reliable server OS for things like routers and stuff, firewalls. But also, it is kind of surprisingly useful as a desktop OS. Maybe you might have to make some compromises here and there, but like you can listen to stuff, listen to music in Firefox or some other player. You can watch YouTube videos. You can use LibreOffice, whatever you want. It's all fine. Or you can do things like this here. This is a Dynabook running NetBSD with a patent input. Again coming back to the announcement email, I keep coming back to this. There's so much in it. The welcoming community is also an important thing. I think not all open source projects, not even all BSD projects, have welcoming communities. I think NetBSD does. This is the group photo from the 2019 package source con in Cambridge. I don't think we have done one since COVID hit, unfortunately. But like people are generally nice and welcoming. I think that's very important. And it's a good thing to have. Changing gears a little bit, I want to talk to you about the new release. We've done 30 years of development, but what do we have to show for it? We have NetBC 10, which in fact is not released. When I wrote the abstract for this talk in October, everybody was saying, you know, it's going to be released in a month. So I just put this as a given here. We have the new NetBC 10 release. I'm going to talk about it. It's not there. We have release candidate number three. But it's okay. It's all release candidate number three. To understand where we are, I want to talk a little bit about the release timeline and maybe also about the way NetBC development is organized in general. So NetBC has a core team and only the people, well, has a core team and has a slightly larger team of developers that have joined the NetBC foundation and have officially joined the project. And they're the ones that can directly commit to the repositories. So if you're not a NetBC developer, you cannot directly commit to the NetBC repo. And all the development of NetBC happens on the head branch, well, on the main line. And then what happens is every once in a while, there's a branch for the numbered releases. And between the branch and the release of the .0 version, there can be quite a lot of time because you find that there are some problems that you didn't notice. Usually you know, usually once you have a release branch and you're in beta at that point, many, many more people start using it. And they find many more problems that you didn't know about, for instance. So the release of NetBC 10 is probably going to be in February sometime. But the branch of the 10 branch was in December 22. So it's already been branched for quite a bit. And after the branch is done, there are no direct commits by random people into the branch. But instead it's all going, it goes by tickets and they're reviewed and there's the discussion on them. So it's a bit more stable in that sense. So like, so the basis of this development is already a bit old. It's already from 2022. So if you have hardware that's newer than 22 and it doesn't work on NetBC 10, maybe current is actually better. But also the point I want to make is the NetBC 9 branch was in July 2019. So you have three and a half years of trunk development that has also gone into this. So I'm going to explain some things that have changed that are new, but there's like a million other small changes that would be far too boring to talk about. So the one thing that you might immediately notice is performance. Performance has increased a great deal, especially file system performance, which to be fair was not very good before. It's good now, I think. And the scheduler has improved a lot. So if you have a system that has big and little cores, for example, an ARM or an Intel CPU with performance cores, they call it and power saving ones or something, the scheduler is aware of that. And depending on how much like punch you need, it'll use one or the other. That's very nice. The graphics drivers have updated to be on power Linux 5.6. So you have accelerated support for AMD, for Intel, for NVIDIA. There's a new wire guard driver, which may be interesting. So if you're using tail scale or another VPN on wire guard basis, and this is not the original wire guard, but it's a re-implementation from the spec. And then there's much improved DTFS, newer version, and also much improved virtualization. For example, for Xen, I don't know if a lot of you use Xen, but in the past Xen had two virtualization modes. There was the HVM mode, which did not require any collaboration from the OS. So you could run an unmodified Windows on it, but it was kind of slow. And then there's the PV mode, para-virtualized one, where the OS is sort of aware that it is running on Xen, and it's not actually, it's its own architecture, basically. Like the kernel is directly written against the interface of the hypervisor instead of pretending to talk to hardware. And Xen folks have added three sort of in-between modes, and I think we can do them all. So one thing you can do if you're on HVM, you can gain speed by using para-virtualized driver for network and storage. And these are called VIRT-IO. So NetBeasty has those. There's a mode called PVHVM, where you also have interrupt and timer that are, like, not pretending to talk to some Intel interrupt controller, but to talk to the hypervisor. This is called PVHVM. I have that too. And the highest performance mode these days is called PVH, which is a para-virtualized system. So I used the Xen kernel, not the, I don't know, AMD64 kernel, but it uses hardware support for page tables and stuff. So this is the highest performance mode, the PVH mode. If you're using Xen, this is what you should be running. And the whole thing is more multiprocessor aware. The DOM zero, which is the sort of the host system, can be multiprocessor. The individual VMs can be multiprocessor. This is really nice. This graphic here comes from Brendan Gregg's blog, by the way. What color is the Xen? If you don't know Brendan Gregg, you should look him up. He does good stuff. He's also done amazing talks at Fostern before. And then in terms of hardware, I think the biggest amount of work has gone into ARM. This in general, I.O. is a lot better in ARM, like if you're running it on a Raspberry Pi, let's say, you'll notice you'll have more network throughput, more disk throughput. That's all really nice. There's support for the security features in modern ARM processors in the ARM V8E instruction set. Many of them help against the sort of return-oriented programming, like exploits. For instance, you can authenticate, the kernel can authenticate the pointers. The pointers are tagged with a secret tag, and only if the CPU will check if the tag is there, otherwise the pointer cannot be dereferenced. So you can't just take random memory and say, this is no pointer. This branch target identification, where in the instruction set, only instruction set level, you can say, here's a branch, but it's only allowed to jump to this address or this address, something like this. And there's a mode called privilege access never, where the user space can actually forbid the kernel from accessing a page. So while you're holding your key material, for example, in memory, you can mark it as privileged access never, and then nobody else has access to it. So that's great. Crypto support, using crypto instructions if the CPU has them, and a lot of new systems. And there's three things that might bite you if you upgrade. So I want to mention them specifically. One is SSL route certificates are now in the base system. So before you had to install a package called Mozilla root certs, and that was always annoying. Nowadays, you install, SSL will just work, certificate validation will just work. It's nice. Entropy management means that if you don't have entropy, if you don't have randomness, then the kernel will not give you random data. And practice what that means if you're running in a VM mainly and you don't have an entropy source at all, things might hang when they ask for random data. And that's not great. So there's some, the config files have special support for stuff. Also there is a VRTIO RNG driver where the host of your VM can hand randomness to the VMs. And so if you enable that, you may have to enable it in your QMU config or whatever you're using, then this is better. And finally, there's a new feature around POSIX ACLs and extended attributes. So these are attributes on files. And they need a new file system, or rather like a variant of the existing file system. So by default, NetBC has FFS version 2, the fast file system. And there's a variant now that you can choose when you install called FFS v2 EA, which has extended attributes. And if you do that, you should be aware that older NetBC versions cannot read it. And also if you installed current at some point and used that, there was a flag day where that format changed. So you did, I don't know, there's a complicated procedure to not lose your data. So that's by the way one reason why NetBC 10 is late because there was this file system problem. But FFS v2 with extended attribute is not the default, I think. Only if you need POSIX ACLs, you can choose it during installation, but default is just to disable it. So now that I went through like this laundry list of things, I don't know, some of you might be bored. I'm going to ask, what are you going to run it on? Maybe you say, you have convinced me, I want to try out this NetBC 10 thing. What am I going to run it on? And here I have a little gallery of a couple of the ARM SBCs that are now supported in the new release that weren't there before. The Raspberry Pi 4 is very nice. I'm using one myself. The 5 is I think in current, the 3 to 2, et cetera. They're all also there. Then you have the Odroid and N2 plus, the Quad 64, the ASUS Tinker board, the humming board pulse, I don't know, there's various things. They all have very specialties, different SOCs. The Orange Pi 5 I like a lot because it has a pretty powerful processor and you can get it with up to 32 gigs of RAM. You see that slot at the top left, it's an NVMe slot, you can add an SSD there. So for less than 200 euros, you can have a really powerful ARM-based workstation with an Orange Pi 5. Highly recommended. So if you're using it on ARM, it used to be annoying with bootloaders. ARM bootloaders are a bit weird in many ways. This has become a lot better for the Orange Pi, the Raspberry Pi and a couple others. This thing called Tiano Core EDK2. You downloaded it for your model and you just unpack the thing onto your storage medium, whether it's an SSD or memory card or whatever, and then it acts like a BIOS basically. You get a UEFI shell and then you can just use a generic, like, Net-B-Ste-Arm image or some other OS. So that's become very convenient. ARM-SPCs are awesome. They're still the Pinebook Pro. It's a laptop that is, I think it's more than $1.99 now, it's $229. But still, it's very cheap. It has a nice ARM processor display. It's perfectly usable. Net-B-Ste runs really, really well on it. So this thing here talks about current, but now it's in 10. So you have display, backlight control, NVMe, USB. It all works. It's nice. So if you want a Net-B-Ste laptop, maybe that's the best choice. Or how about this thing? I don't know if I can get the video to work. I feel like a boomer. Can I get the video to work? Where's my mouse pointer? So this here is a Nintendo Wii. And then this happens. It turns out, I didn't know that. The Nintendo Wii has a PowerPC processor. So this runs the Net-B-Ste EVB PPC, EVB's evaluation board. So we treated like an SBC with a PowerPC processor. And there you are. This is pretty new. This is a few weeks ago that this support got added. I'm not sure if it's present in 10, but it's there. Here's Net-B-Ste. Did you ever want to run PostFix on your Wii? There you go. You can also run in the cloud. Net-B-Ste is pretty much at home in the cloud. It runs on all sorts of virtualization things. It itself includes several virtualizers. So I've talked about Xen. There's NVMM, which is a hypervisor that's integrated into QMU. There's Hex... I think Hexen... I'm not sure if it's still there, actually. But anyway, so there's various acceleration technologies that are in QMU. You can run it under Beehive if you have a free BST host. You can run it under KVM if you have a Linux host. You can run it on AWS. We have community AMIs available. These are AWS machine images for both Intel and ARM. The ARM ones in particular are nice. They're very cost-effective. There's this project here to build Google Cloud machine images with Net-B-Ste, Canon Vulture, Oracle Cloud, many others. So that's also an option. Why not have Yen? Now that you have a machine, what are you going to run on it? This is where I slightly switch gears and talk about packages. This is my hobby horse because I mostly work on packages. So Net-B-Ste comes with a package collection called Package Source, PKGSRC. And it's actually not Net-B-Ste only. You can run package source on 18 or 20 different OSes. You have 35,000 packages, although not all of them will probably build on all OSes. Once a quarter, we do a stable branch with binary packages. In fact, the last two stable branches have offered binary packages for 10. So I think that's very nice. You can install Net-B-Ste 10 somewhere and install the package manager and get going immediately. If you're on a platform that does not have binary packages you build from source, that's very easy. You just do make package install in the right directory and it downloads all the dependencies and builds them in order. If you're doing Firefox or LibreOffice, you may have to be a bit patient. Like on the Pinebook Pro, it's more an overnight run. But yeah, it works. And you can also update from sources when a new branch comes out. If you're using package in, then it's even easier to change the path to the binary packages and do package in upgrade and it does the thing. These 35,000 packages include all the stuff you expect, like NGINX and Apache and whatnot, but also GNOME and, as I said, LibreOffice, Firefox, Thunderbird, they're all there. It's pretty complete. And also, if you find something that you would like to be available as a package, but it's not there, you can relatively easily get access to the WIP repository work in progress and start creating your own package in there. The barrier is pretty low and it's sort of our gateway drug, I guess, to becoming a full-time contributor. This WIP thing is, I think it's a superb idea because it gives you a well-lit path to becoming a developer, at least in the packages space. But it's just starting to tinker there. And it comes without warranty. The WIP packages, they may be completely broken. So as a user, you don't quite know what you're getting. Maybe you want to stick to the main repo, but if you're tinkering, it's great. And one last thing that I want to mention, you can install a package source tree in an arbitrary prefix and you can do it without root privileges. So if you have a machine, even if it's a Linux or a macOS machine, and you don't have root on it for some reason, but you want to run some software that is not on there, you can bootstrap a package source into some tree and use that. And you can also use that the same way that Python developers would use virtual ENV. You can have a tree that has your development environment in it and then just copy it back and forth and use it on any machine that you're working on. So that's also a very, very powerful thing. It's not just for Python, it's for all software. So I think it's great. So to conclude, we've spent, not me personally, but the NetBeastie project has spent 30 years on NetBeastie. It's still going strong. We have a new release, almost out. A release, Canada 3 is the current one. You should go check it out. Thank you very much. We're open for questions. Any questions? Yeah. Hello. Okay. So we talked about where NetBeastie can run everywhere. And so maybe here a question. What is the current state about regarding the risk 564 integration? The risk 5, is that the question? Yes, exactly. Okay. What's the current state of risk 5? Not quite there yet, I think. There is a NetBeastie risk 5 port. I've seen videos of it booting, but it's not as good as Armiette, I think. Do we have any more questions? Yes. Hello, and thank you for your presentation. Is there something that can be provided to the students, Bachelor of Science, Computer Students, for, let's say, get them more involved in the project, like easy hacks and stuff like that, to tell them, hey, do this and you can learn something. You know OAS, but this is some OAS that you can start hacking with. Thank you. Okay, the question was about sort of low barrier contribution opportunities for students, like what can you do to get into hacking and so on. I think with Package Source WIP we've done a really good thing around packages. I'm not so sure if there is something for NetBeastie itself. We have the Google Summer of Code, so you can become a GSOC student and do a programming project there. If I remember correctly, on netbeastie.org there's a list of projects where you could even get funding for outside of GSOC. Like somebody has done bountains on things, but I'm not sure if they're beginner friendly, and there's also a list of possible projects. But maybe we could do more there. Thanks for the suggestion. Hi, good talk. So I've read a blog, I think it was last year, about the status of Weyland. Hello. So the status of Weyland in NetBeastie, where it was sort of early attempts to get it running. Are we ever expecting Weyland to land properly in NetBeastie, or is it going to be eight years, 10 years, heat death of the universe? It's a good question. The question is about the state of Weyland support. I think there is some Weyland. If I remember correctly, you can run Sway as a Weyland compositor. Various applications like Firefox are built with Weyland support by default. But I think the vast majority of NetBeastie users are still on X11. I'm not sure what it would take to change that, other than more manpower. Weyland on NetBeastie is more or less like a one woman show at the moment. So if you're interested in that, please contact us and contribute. Maybe to add one thing that makes it hard is Weyland has a lot of Linux-isms. For instance, the Weyland input API just takes the Linux input.h and wraps it, and that's it. So you get to reimplement parts of Linux APIs, which is annoying. More questions? Coming back to the question of NetBeastie is suitable for student projects. I'd like to talk with the person who asked that question. I have more information. Hi. So I've been in this Linux user for a while, and I'm interested in the BSD world. Why would I want to try NetBeastie instead of FreeBSD, say, on my laptop? That's a difficult question. Why would you want to try NetBeastie and not FreeBSD? FreeBSD has many of these things that I mentioned for NetBeastie as well. FreeBSD has a lot bigger community, I think more contributors and so on. What's in it for you personally? I think for some, depending on what the hardware is, the support for NetBeastie or FreeBSD might be better. Again if you want to get involved in the community, the FreeBSD community is bigger, but the NetBeastie community might be more accommodating. I don't know. I'm struggling a bit with an answer there. I think they're both good. You could try both and see which fits. Questions? Anyone? Who was it? For the last few years we have seen Linux take over many things. Things we take for granted about how the Linux in general work. There has been a lot of efforts to reimplement functionality in BSDs. I was wondering is there any organized effort to counter that? Not all of these interfaces are the best designs. There's a lot of room for improvement there, but I have seen BSDs lag behind. There's an engagement issue when it comes to companies that it's a lot easier for people to justify contributing to Linux versus contributing to BSDs just based on license, even though that's a bad idea anyway, because projects get founded to a point and then making things open source of course allows the cost for the companies. Is there anything going on to improve these situations? I think there were several questions. You said Linux has taken over mindshare. If you're at a company it is easier to justify contributing to Linux than to BSD on impact arguments, which is true, I think. There was also the aspect about Linux folks re-implementing more and more classical things with system D and network manager and all these things, or getting rid of the if config command. I think the BSD community in general just doesn't follow along in these things. There's no system D. There's no IP route 2 or whatever it's called. The route command still works the same way it worked in 1982. As for the impact argument, I think if you're looking for sort of impact in that sense, then let's say a contribution to free BSD might be more justifiable than to net BSD. In a company it depends on what you use. If you build your service on net BSD, then contributing to net BSD makes a lot of sense. If you're building your service based on Linux, then it might not. For example, Netflix, they're highest throughput streaming servers. They're on free BSD, not Linux, because free BSD achieves a higher throughput with the same hardware. It all depends. Maybe one other aspect, talking for myself personally, my company does not use net BSD in production. But I work on it precisely because it is not like my day job. So I guess you partially touched the answer to my question. I see that there is good support for Zen. So do you have any insights about application and workloads which people using on top of BSD under those hypervisors? The question was, I think, what applications are people running on net BST on Zen? Anything really. You can get Zen hosting from a lot of places. So get a virtual server that's a paravirtualized Zen VM. It used to be, AWS used to be like that. I don't think they offer that anymore. So some of it is just like a cheap shell access for somebody. You have a home in the internet somewhere that you can SSH into. I've used it for web servers, mail servers, file servers, all sorts of things. Hi, thank you. I have two questions if it's okay to related questions. The first one is the community looks really nice and welcoming. But do you think it's maybe too small with the best factor of one? And second question, how reactive is the community for incorporating new changes? Okay, two questions. One of them is about the size of the community. I think it would be great if it were bigger. Honestly, net BST is at least in some ways, I think, kind of minimal stuffing in a sense. We could use more hands on many things. Although it's not dead or in decline, which is good. I regularly see new people coming in. Maybe 10 years ago, there were a lot of really old school folks that were around since the beginning. There's been sort of a generational change as it's bound to happen after 20 years of running something. There's a lot of younger folks in here, which is nice. People would be welcome, of course. What was the second part of the question? How reactive is the community? It depends, unfortunately. Sometimes very fast, sometimes very slow. I don't know. There's no SLA or anything. What Ryan said in the previous talk, if you have people from the community working on something together, they're going to scratch their own itch. If you submit a patch and somebody sees it and thinks, oh, this looks interesting, they might react to it immediately. Sometimes you might have to ask on IRC or ping the thread again on the mailing list to get somebody to react. A lot of people are busy in their day jobs, unfortunately. Many, many thanks to all for attending and your questions. Many thanks, Benny.
Reproducible Builds: The First Ten Years
Our next talk is about reproducible builds the first 10 years and I'll let Holger explain Hello everybody. So I'm not a lunar I'm Holger but that was the interest light of lunar's talk 10 years and two days ago. I'm based in Hamburg I work on Debian since many years and I've also been to foster many years. Ten years ago we did the first setup with all videos and all rooms and then I gave up with video and cut and voile for reproducible builds. We're working mostly on Debian but I try to make or try to contribute help make all software all free software reproducible and it's pretty complex topic ask me anything anytime during the next two days because there's a lot of information in this talk. So reproducible builds. This is not my talk but the talk of the people working on this these are over 150 people and it's not my knowledge we have them in Git so if you are missing there then you can add yourself to this repository and be here. So about you who knows about reproducible builds why and how a bit. Yay that is awesome I can go home now thank you. Who has contributed to reproducible builds? Wee thank you. Who knows that reproducible builds have been around for more than 30 years? Ten years. Because 30 years really old not as old as net BST or maybe it is. Who knows about S-Bomb? The industry. So S-Bomb is software build of materials and basically our build info files from 2014 also already have this contain all the build environment describe what's in there. It's the same concept more or less. And we need you it's not going reproducible builds nothing five or ten or even fifty people can do it needs to be done in every software project needs continuous testing that software is reproducible. But I think we can do it and there's still a lot of work to do. So I give a short introduction the problem source code is freely available which is not a problem but most people use binaries and that is a problem. And no one really knows if this binary really comes from the source code. I'll get to that in a moment. And as a result there are various supply chain attacks. So long time ago more than ten years ago there was a threat on devil mailing list in 2007 and it was known but it seemed undoable. This email would be really cool if deviant policy required that packages could be rebuilt bit identical from source and then somebody replied I see no benefit. Someone else replied I for one think this is technically infeasible but hey I'm happy to be proved wrong. I'm happy to prove them wrong. So that was ten years ago but the idea also appeared in the year 2000 already in another threat. And then in 2017 we learned that GCC was reproducible in the early 90s on several architectures and not only GCC but also binutils and the whole new tool chain. But that got lost that got bit rotten and people forgot. So fast forward to last year. There was a mail on the wire guard mailing list of VPN up for Android said that the bills are now reproducible. The release is identical on their website Google Play store on an asteroid. And that was well and they didn't even tell us. Yay so great. We're super happy. This logo by the way we developed in 2016. So it's also 80 years old was a logo by design committee and in the end I think it turned out nice anyhow. So our definition when is it reproducible. It built is reproducible given the same source code built environment and built instructions. Any party can recreate it by bit identical copies of all specified artifact. Pretty simple. Yeah it gets more complicated because you need to have everything. What is source code. What is the environment and so on. So our mission is to enable anyone to independently verify that a given source produces bit by bit identical results. And by that we are an important building block in making supply chains more secure. Nothing more nothing less reproducible builds are not more secure than others. They're just built more securely and unsecure software still remains unsecure. But with reproducible builds you cannot be sure that you run this insecure software. And this again is from Luna's talk from 10 years ago. It's pretty much the same definition that we have now. And by 2024 reproducible builds have been widely understood. We have resources documentation. We have public scientific publication. There's lots of material online. And even the White House has said about us. They made a release which in 2021 government statement which requires software bill of materials for governmental software. And at the moment only recommends reproducible builds. But hey the White House recommends our work. Yay. How did we get there. Money. Snowden. Why money. Bitcoin. 10 years ago actually 12 years ago in 2012 or 11 the software was made reproducible because Bitcoin all bitcoins were worth four billion. I think there was very much more now. But they were still afraid if there was a compromised Bitcoin client would steal the Bitcoin. The developers would be accused of having a back door and they didn't want this. So they basically made reproducible builds. And Snowden is kind of obvious. And so the tour people made tour browser reproducible in 2013. Because they were afraid that they would get back door. And tour browser is Firefox. One of the biggest software projects in the world. 50 or 70 or whatever megabyte binary at the time. Every bit was the same. And there was. Wow. So how did we really get there. Lot of work by many people over many years. And. Debcom 13 in 2013 there was a box of small workshop last minute 30 attendees and that kicked off the Debbie in a thought for reproducible builds. And then Luna had this another box at that conference team. And the stock here. And we had the first package that patches for the package to Debbie and package maintainer. We sorted the files and created built info files built info files is where we store the environment and the sources and the product. The output the binaries. And with that we can reproduce them. And that was all done in 2014 already. And in September 2014 I started systematic builds of Debbie and packages twice. First does handle packages and then all 25,000 at that time. And Mike Perry and said she gave a presentation at Congress. See Congress in December of 2014 showing my graphs. It was really nice sitting in the audience and suddenly seeing this graph I made. And I have some from the slides. So this is the presentation. 2014 again. And I want to believe that is really the problem with trusting somebody that this binary comes from. Whether it's needs to need to believe me or Microsoft or your government or Debbie and you need to believe somebody saying that's not believing is not scientific at all. And I am the developer. It's I know it's on my machine and I'm upstanding careful and yeah. But we develop us are also excellent targets. I just spoke with somebody who's producing the GPG binaries for Windows who would have interest to compromise that several nation states. So. And they had that to very nice proof of concept. So the most secure computer in the world is that network most computers on network USB access. 24 seven you can compromise computers and especially if one computer gives you access to 100 millions of other computers or lots of money or whatever the state secrets. Where whatever you terrorist or war taxi want to make. So they made a small back door. They use the CDE against SS eight where the the. It's greater and it should be greater equal. That's the difference and gives you who to exploit. And in the binary the difference is one bit. So they are seven e. Seven C is the difference whether somebody can get root on your computer. And it's very hard to detect. And then they made another thing. They made a Linux kernel module which modifies. That if the compiler looks at the source code and it's not a good thing. And then they made another thing and they made a Linux kernel module which modifies. That if the compiler looks at the source code it will take different compile different code than if you look at it with an editor. And then they made a proof of concept for that. So these attacks are not only feasible they are possible and probably be done. And this was the graph they showed. No they must have showed an earlier graph because it was from 2014. But anyway the green pack is the percentage of Debian packages reproducible. The orange one are unreproducible and the red ones are failing to build. But it was still more than half the packages were quite early reproducible. So 2015. Luna and myself gave a talk here. And this was the first talk where we were spread from Debian to going to all free software projects. I think it was quite nicely perceived. Because since then we have lots of collaboration. And Luna gave her presentation at CCCCAM presenting many problems. We found many many problems and come on ways to work against them. And we had the first reproducible build summit in Athens. We had the time I think we were 25 people from 15 projects or something. And we wrote source date epoch spec which I explained in a second. And divorce code. So first the common reason for unreproducibility is time stamps. People leave time stamps everywhere. And they leave more time stamps. Really every documentation tool has time stamps. Compilers have time stamps. And there's build parses. Also very annoying. And the rest. And the rest is about 400 different kind of issues or something. But it's mostly time stamps and build parses. And for build parses to get that is very easy. You just rebuild in the same build parses. And nowadays with name spaces it's also trivial to do. And for time stamps we came up with source date epoch. Who knows about source date epoch? Yoo hoo hoo. Build time stamps are largely meaningless. Source date epoch describes the time of the last modification of the source. In seconds since the unique epoch. Because that is consistent. This is deterministic. It doesn't change. And this means that was when the software last changed. And when you know the build environment you also know all the libraries you're using. So there's no use to record a time stamp. And that's supported by a lot of software today. If this build environment variable is set. Then it's used and there's 100 tools. DCC is using it. Pundock is using it. Whatever is using it. And the specification is also really stable. We modified it once in 2017 and that's it. So it's been working. And it's available on the internet. Difascope. Who knows about Difascope? You should all know about Difascope. I met lawyers who know Difascope. Who uses Difascope? Nothing against lawyers. I explained it. Difascope tries to get to the bottom of what makes files or directories different. It will recursively unpack archives into many kinds and transforms various binary formats into more human readable form to compare them. So you have a tar archive. And in the tar archive is a PDF. And in the PDF is an image. And the image has a varying time stamp. Difascope will show this. And it has large file system formats. So basically for anything. APK files, DEX files, all file systems, GBG, keybox databases, HTML, anything. There's more JPEG, whatever. It compares two objects. And this is also why there's a lot of different types of files. And it's not just a file. It's a file. It's a file. And it compares objects. And this is also why lawyers like to use it, because they compare two PDFs, and then they see which text changes. And there's other use cases, but we don't also use it to find out why something is unreproducible. Not if it's unreproducible. If it's unreproducible, it's easy because the hash doesn't match. But really why? And it falls back on HexStump. It does fuzzy matching and many things. It does disassembling. Here you can also go to try Diffrescope Org and upload two objects and it will show you the difference. Not sure if you can read this, but basically the colors are. There's an archive, there's at the top, there's the bits are different, and then here you see really the diff between the two versions. And because I compared two different versions, 5.06 and 5.07, the version numbers of course change. But you can look at many differences with Diffrescope. If you haven't taken a look at Diffrescope, do. So in the last 10 years we filed almost 4,000 bucks with patches and 3,500 of them have already been merged. So there's only 300 left and there's still some more coming. And in general in Debian we found over 32,000 bucks. Most are failing to build from source because we constantly rebuild Debian, but there's also many other things. The Reproducer Builds is also a huge QA effort. And yeah, we have a Git repository very categorized the issues and put, okay, this package is also affected by this. And Luna's talk is also a good reference for this. And it's because it's much easier to describe what makes a problem a package unreproducible, we have created the unreproducible packages package which shows many, many problems and how to fix them, because a reproducible package is very, it's hard to show why it's reproducible. It's very easy to show what's unreproducible. And some of the unexpected benefits we had is lower development costs and increased development speeds through less developer time. Google is one of the main users of this benefit because the builds are way faster, of course you can cash way more. And it's also good for software development to see if this change really just affects this part you thought it would be effect. And for license compliance, you can only be sure it's free software if you know the binary is coming from the source. As you're running some binary, maybe it matches the license, maybe it doesn't, who knows. Yeah, and you have reproducible verified S-bombs. S-bombs are just a statement with reproducible builds you have verified S-bombs. So we made these summits over the years, mostly in Europe. And we're going to have a summit this year as well. There were always like 50 people around something. But there were many, many projects there. Like all the BSD, FDROID, Github, Microsoft, RATAT, Apache, Maven, whatever, many, many, many. And there was another benefit, bootstrapable org. It began as a breakout session at the Reproducible Build Summit in Berlin. Who knows about bootstrapable org? Some people. So my understanding is you have 500 byte block which can build in this very small assembler, which builds another small assembler, which builds a tiny C-C, which builds an old GCC, which builds an old GCC, which builds modern GCC. So you bootstrap from sources only. And they bootstrap Dix, which is another Linux distribution. It's pretty amazing. They have their own talk here. And that is just because there was an idea to do this and people really tried this. And bootstrapping, completely bootstrapping from source has not been done since the 60s or 70s. We all just use binaries, building binaries, building binaries. For the summit this year, we don't know where, we don't know when. We need a location for 50 people. We need some sponsors to cover the costs. And we need you to make it happen. Please talk to me after the event if you have an idea where to hold this. In general, we have some funding. We are a Software Freedom Conservancy project since 2018. Funding is for Chris Lam, Mati Aritzolo, Vagrant Cascadian and myself. We support our continuous work, the development, community work, developing software, designing processes, the summit. Thanks to all our sponsors. So short overview of various projects, which is mostly about Debian. So these are the CI results for Debian Trixie. We are now in the 95% range and it's pretty boring. The graph has become very boring. This is a bit nicer. So these are the past Debian releases. Bookworm is the current release. Bullseye is two years ago. Buster is four years ago. And you can see the unreproducible packages have gone constantly down, but there are still over 600 unreproducible packages. And these I really want to get to zero. This is still the goal. So these are the CI results for Debian Unstable for the last 10 years. And you can see in the right here in the end in 2023, we stopped varying the build pass. That's why we went from 85 to 95%. But Debian is constantly growing and we are getting constantly getting more reproducible. So in 2017, we changed or Debian changed policy and now says packages should build reproducible. It's not a must. It's just that would be really nice. And of course I want to get to the must, but that is not so easy. So I want to have reproducible packages must not regress as a next step. And in 2025, because that will be after the next Debian release. And also new packages must build reproducible. I don't want any new packages which are not reproducible. It's been 10 years, it's over. And finally in whatever 2027, one, two more releases. I think all packages must build reproducible to be allowed into testing unstable. That can be an unstable experimental. You can experiment with that. And really 100% is the goal. And 100% reproducible is a political decision and nothing technical. Because we can always say, okay, you're out. And that is political, not technical. So we need to change policy. And we can work around must have offenders using whitelist in the beginning. Like at the moment Grubb and Linux kernel are not reproducible. And I guess most people want to use them. And goal is still 100%. Whitelist are just a way to achieve that goal eventually. Because with that we can kick the others out. And then Debian testing migration. I'm not sure how many know Debian workflow. So there's packages get uploaded to unstable, then they move to testing. And eventually testing is declared stable. So this moving to testing, this migration, there's various penalties or can be introduced. And since three months or something, it now shows if the package is unreproducible, there's no penalty or nothing yet. But I think for the next release, there should be penalties for violating, not regressing, and new packages must be reproducible. And the framework is now there. It's just not activated yet. And the whitelisting part I already said. So, and this, this is a bit stepped for the next part. Because what I showed you before these other graphs, this graph is just about continuous integration where we build the package twice in maximum variation to see why a package is not reproducible. But what we really want to do is Debian builds a package once, and we want to rebuild it to see if we can reproduce it. And we don't want to find differences, we want to find the same thing. So for this, this, we have made this other rebuilder, and this was already working two years ago, and it also showed good results, but then it stopped working. Because we need the working snapshot Debian org service. Snapshot has all the Debian binaries ever released. And without snapshot, we cannot recreate the same environment. But snapshot is buggy. And this has been buggy for five years. And so this is broken. Sad. And Snaps fixing snapshot, it's 150 terabytes of data. It has four pushes per day, gaining 70 gigabytes of data every day. One project to fix, so I got access to fix it. Yay. We need something soon. We still need to fix snapshot, so if somebody wants big data, we need this, please talk to me. But at the summit last year, we also did, we don't need the whole snapshot. We only want to reproduce 70,000 binary packages, and they will depend on 30,000 packages. So 40,000 packages are never used as built-in pens. And then it looks at the build info files, because the build info file describes the environment, and those 30,000 packages are only used on 100,000 variations. So we only need 100,000 packages. That's just 100 gigabytes per arch and suit, so that's nothing. It's just two terabytes or something. It fits on my laptop. So we rebuild our snapshot, it was born. And that's a cache for Snapshot WNORG, which only stores the packages used as built-in pens today. And if new version is, then the old built-in pens are not needed anymore. Because Snapshot has this problem, seeding this still takes a week, but we've done this, and then each arch only takes hours to seed from another instance. And we already run two instances of rebuild a snapshot, and our goal is to allow many instances, so you can just have your snapshot cache in your institution and use it. And this is needed because DAP rebuild, which is used for rebuilding DAP packages, then uses the boot snap together with meta-snap. Because the packages don't have the trust information, and the meta-snap has the metadata from Snapshot WNORG, so you have a trust pass there. But because there's only five minutes left, I will skip those details. And rebuild a snapshot only has one issue at the moment, because we only started it in early December, and then there was Christmas and Congress and whatever, so we didn't fix this issue. And this is, Lynx and Yosh have really helped with that. I've done some work, but the coding part is mostly them. I've done the design work. I hope to have this working in a month or at least two. I don't really care when. We've waited five years, so yay. And so Outlook testing migration can use and force policy, but we need real rebuilders, because we don't want to test immigration just on CI results. And therefore the rebuilders, we need a working snapshot, and we will keep the CI builds to really still find the issues. I can give a very short overview now in the last three minutes about other projects. Tails is easy. Tails was the first project which is reproducible. You can rebuild the Tails ISO, and be sure that's the same ISO. ArchDinox has rebuilt us in Snapshot Binary and Active Community. They really rock. They know more about this than I do. Zouza is one person, but maybe that person will be allowed to do a reproducible Zouza fork on company time this year. So Zouza, I'm looking forward to Zouza in this regard. MixoS and GeekZoS are by design reproducible, but they also have still the unreproducible Linux software. Yachto has support for reproducible images, and FDroid also has reproducible packages in the repository. Alpine has basic support. FreeBSD, the base system is reproducible for all BSDs. We never tested OpenBSD, and we never tested the ports. Fedora Red Hat Ubuntu is not interested, it seems, but that is not really true anymore. Fedora has enabled in Macro, so that source state epoch is now used when building packages, so Fedora could have this easily. Ubuntu, I would really love a Debian fork, which is reproducible. Take the Debian sources, throw away the Debian build processes, do them new, and make a Debian binary fork, which is reproducible. So many projects support reproducible builds in a way or another, but it's mostly in theory, so it's unclear what it does, how users benefit. Tales is easy. Tales has one ISO, one checks them, you can recreate it with Debian with 60,000 packages. How do you verify them? It's still open. And this is massive success. This was thought impossible 10 years ago. This is, again, a 10-year-old slide. In theory, we are done. In practice, we have shown that reproducible builds can be done in theory. We need rebuilders, we need to store the results, we need defined criteria, how tools should treat the data, and then we need to use these tools. Because if you have several rebuilders, which you basically want, what you do with the results are not matching. And yeah, those last 5% or it's maybe 2% now, we still need to fix the software. And we need project-level consensus and commitment to keep reproducible builds working in practice. Thank you. Thank you for the talk. I learned a lot actually. I had vaguely heard about it, but very good introduction. Anyone have any questions? Hello, Agarhe. Most of the graphs you were showing were for AMD64, I think. Do you have any information about the other architectures? Is there any difference between AMD64 and ARM64, for example? We also have graph for ARM64, I-386 and ARMHF, and I just got an offer for RISC64 hardware. So when we do this rebuilders, we want to have every WN architecture. But it's, yeah, get there. Any more questions? How about, I actually have several questions. How about source releases? Because this is a problem with many projects that they don't have reproducible sources. If you want to get an expulsion or something, unless the original tower is around, you cannot remake it easily from a weapon. It's getting out of the... The releases of the source code or the source release? The releases of the source code. For many projects, it's still hard. Is there any effort around that? It depends mostly on the tools, like whether G-SIP or GitHub, when it creates an archive, is reproducible. We have basically decided not to look at this, because if you do a release once, you have version X, and that version X stays. Being able to reproduce will recreate the same version. We've just decided to be out of scope. But it's worthwhile, do it. Another thing is variants. For some distributions, they support a lot of variants of packages, and these distributions are more interested in causal analysis. How we arrive at certain versions and why we didn't arrive at it. Because this is what enables us to fix the things that didn't allow us to have a reproducible build. And if you have 100 variants multiplied by 1000 packages, that's a problem. I would not say that the number of packages matters. Of course, if they are reproducible, you can have 100,000 of packages. Is there any effort in causal analysis of why a package is not reproducible? Can you very hardly understand you, because it's so loud. Is there any effort to build tools to do causal analysis? Why the packages are? Yes, we've built this. This is the framework to analyze this. But first you check whether the packages are reproducible, but just comparing the hash. Build it twice, and if the hash is the same, it is the same. And if not, you can use Difloscope to analyze why. We've also had Bernhard Wiedemann from ZUSA also made some scripts to analyze the build logs to make a statistically analyze on this. So there is work on this as well. There is endless work on this area. We should talk about Snapchat, the Levian, and Thare. We should, Julian. Definition, I assume, that we have given the same build environment. What are the best practices to deliver the same build environment in long term? Are there any best practices for that? Well, just recreating the build environment is different from distribution to distribution, and it's often a challenge. Could you repeat it? Recreating the build environment is often difficult, and it depends very much on the distribution how to do this. But it can be done. So what are the best practices? What about containerization or something like that, or VMs or whatever? VMs help with that, yes, but it's also you need to record the build environment while you build, and then have some tools to recreate it. Okay, so are there any tools right now for kind of defining build environment? There are several tools, yes. Any keywords for that? There's DepriBuild for Debian, there's Reprobild for Arch Linux, there's, I don't know how OpenWRT does it, but depending on the distro there's different ones. Great talk. I was really fascinated by you describing how we can now potentially build from source code, bootstrap from, you know, 500 bytes, and then ancient versions of GCC and upwards. I think one of the reasons that is fascinating is because it addresses, Ken Thompson's attacks that he described in On Trusting Trust, which has been a major weakness in the security of the whole industry. When can I do this with Debian? When can I build Debian from source completely like back in the 60s? There you would need to talk to the bootstrap over people to bootstrap Debian really. I think you can do it probably today if you do the work. The sources are there. Somebody just needs to do it. And about this trusting trust from, can Richie what you mentioned, there's now from David Wheeler, there's reverse double compilation, where you rebuild both compilers, two compilers twice with each other, and if they produce the same results then you can be more or really sure that this trusting trust issue is resolved. There was a nice paper through the last year, October, for this 30s or 40s anniversary of Trusting Trust. Any more questions? It's not really a question, it's more a comment. For people who come from TPMs and attestation, attestation always works on binaries, and the question you always have is, what is the source code corresponding to the binary? And the only process we know forgetting that is actually reproducible builds. And so there's a lot of people in the security community actually trying to advocate for reproducible builds, just so we can prove to our customers that this binary hash corresponds to this source code. And the comment is just that this is actually a very important use case that is rising enormously in importance with the attestation requirements of confidential computing and the like, that you can actually use to plead for funding for guys like the NSA or other people, because this is suddenly becoming really, really important. I just have a total beginner question. So I have source code checked out somewhere, and I want to have a reproducible build. So how do I go there? I can set source stage epoch, but that doesn't really match because each of those files have a different date. So is there an easy just to make this way, and then it will analyze the source states, set the right source epoch and things like that, or how do I get there? It depends. But first you set source state epoch just to the last modification of the whole source. So if you have 10 source files, just the latest timestamp. And then you just build it, and you build it twice, and you compare it. And with that, if you just build it now twice, you already have some variations, like randomization, hashes are not sorting the same, but maybe you don't catch the issue with the timestamp. So you build it once today and once tomorrow. And then you compare it, either they are the same, or if not, then you compare them with DIFFERSCOPE to see where is the difference. And according to this, you do whatever is needed to remove the difference. Okay, then is there an easy tool to give me the right timestamp to set for source stage epoch? Or when I have a make file, I may be even able to say, Analyze the date and give me the highest one as source state epoch. There is not the right timestamp. Often the right timestamp is no timestamp. And some people just set it to January 1st, 1970, or just drop it. Because if the timestamp is just there to be a timestamp, if it's really meaningless, you can just drop it. The other thing is you replace the build timestamp with the timestamp of source state epoch. Then you still have a timestamp, but what's the timestamp worse? Just kind of saying that one thing I have done several times in projects is to add, for the reason of getting the same source table each time, added a make target or something like that to basically touch all files with a timestamp derived from a file. And this file will be some sort of manifest. So you can check that against the files and also use it as a source for all the timestamps. It's not perfect because sometimes there's timestamps inside files, and you need to manually add something to edit that, but it's a good starting point to do that. Thank you for the talk. What are the main challenges to make the Linux kernel representative? At the moment, the main challenges are signatures on kernel modules. Then we came up with the solution that if you rebuild something and you get the same bits, then you can just reapply the signature again because the signature will match again, but for that you still need to have the signature. So it's mostly, it's not impossible, but it's busy work. And because the signing process is also a secure boot change and there's time requirements to get the signing in, that is more problematic than the technical challenges. Okay, and that's time. Thank you very much for the talk.
An engineer's guide to Linux Kernel upgrades
Thank you everyone for coming to my talk. My name is Ignat. I work for Cloudfer. Who here heard about Cloudfer? Who's using Cloudfer? Should be more hands, by the way, because even if he didn't hear about Cloudfer, probably using Cloudfer one way or another. This is my first time at FOSDEM. So thank you very much for exchanging your lunchtime for my talk. I hope it will be really exciting. And today we're going to talk about Linux kernel upgrades and how you should do them, and most likely how you should not do them. So a little bit about myself. I do Linux at Cloudfer. I enjoy system security and performance, and I'm passionate about low-level programming. So the Linux kernel, drivers, bootloaders, and other stuff, reading in unsafe programming languages. Okay, before we start, a little bit of show of hands. So what would you do in this case? Imagine you're working at the shoot on your laptop. You're doing stuff. And yeah, and suddenly this pop-up comes in. I'm like, oh, updates available. What would you do? Like install now? Who's install now? Oh, nice. Well, who's resumed later? Do later? 50-50. So those people who raise their hands for install now, what if instead it wasn't your computer but a production system? Who would press install now? No, very few. But yeah, you like Bitcoin probably, right? Risky. Yeah, and usually it's something like that for production system, right? So it's a difficult choice between remind me later and don't remind me at all. Please don't install. And this is natural, I think. Because it's connected to the fact how do we perceive software updates, especially for production systems, right? Well, we don't perceive them really good, right? So we perceive software updates as kind of these monsters where they come in, they're nasty, they're bugging you. They kind of like an update can break your stuff. Like the traditional engineering motto, if it works, don't touch it, why do we need to install an update, right? Yeah. But the thing is, with regular software updates, we perceive them as monsters, but they're not really scary. They're kind of annoying and ugly, but pesky, but not that much. When it comes to Linux Chrome upgrades forever, it's mostly like this big monster trying to destroy the universe, right? And why that? And again, it's natural because, well, we know how to deal with regular software updates. Yeah, you have a service, it crashes once a week in production, how do we fix it? Well, if you use like system D, you'll just set a policy for it to restart it, and yeah, job is done. It can go home. Well, yeah, you'll be kind of restarting a service once a week. Your service will be in slightly degraded state, but yeah, you'll buy yourself some time to investigate and fix it later. When the Linux crash, Linux kernel crashes, however. Well, technically, this is you, right? So it's end of the world because you don't have any system D to restart it. You don't have any metrics and understanding why it happened. Your service is not reachable. No SSH to debug nothing. Well, it's kind of, it's indeed end of the universe. And that's why usually we're scared of software updates, but when it comes to Linux kernel updates, we're scared like even more. And this why like people avoid updating their Linux kernel for the most part, right? Especially in production systems. But there are common risks. If you don't apply software updates regularly, especially for the Linux kernel. So the first one of them is like your bugs are not getting fixed. And here's some statistics. So I will be talking about the Linux kernel release cycles a little bit later to introduce you. This is basically the preview is a snapshot of all bug fixes releases of a stable kernel branch 6.1. So the latest Linux LTS kernel is 6.6, but because it doesn't have as many releases, so you don't get pretty graphs, I decided to go to the previous one, 6.1. And what this graph shows you is the number commits per each bug fix release on a 6.1 stable kernel. So again, I'll be talking about release types later in this talk, but you at this point, you should know that these bug fixes releases happen roughly every week. And these bug fixes releases are what the name says. They're only bug fixes. There are no new features, no subsystem rewrite bugs and security vulnerabilities. And as you can see, so far the 6.1 stable kernel had 67, 76 releases, and out of 76 releases, there are 50 releases with more than 100 commits in them. So it means 100 bug fixes every week. Almost every release, really, like 80% or something, right, if I'm doing the mass write. 20 releases, so it's 25-ish percent every four release, every fourth release, so roughly every month, have more than 200 commits and maybe 200 potential bug fixes. And there are like these five mega releases with more than 500 commits in them. And actually, if you look in the graph, it's actually seven, but the last two barely made it to the 500. But yeah, these are like these mega releases with a lot of commits. So if you don't upgrade your kernel regularly, your system runs with all these potential bugs, like, and every week you delay, you're kind of missing out at least on 100 bug fixes in your kernel. Second, what you'll be missing out is on potential performance improvements. This is a snapshot from Cloud 4 production systems when we started evaluating, we were using at the time the 5.4 stable kernel and we started to evaluate 5.10 kernel. And so we did like half and half deployment to set of identical servers, like one with 5.4, one with 7. And this is like, this graph shows the, you know, like average memory consumption per server and you can see that on 5.10, we have much less memory consumption. And people are like, what did we break? Like, what happened? And nothing bad happened. It's actually, yeah. So that was 5.4, 5.4 versus 5.7. So we kind of saved something around 5 gigs of RAM per server. And like, at first we thought something broke, but when you dig later into the mailing list, you just, you see that like, you know, like some other folks in this case, this was Facebook now matter, and nice people did some improvements in the kernel code and improved the memory management system. And now you are consuming less memory for the same workload, with the same performance. Right? So it's like, it's almost like downloading RAM from the Internet. And you basically get it for free if you just apply an update, like it's open source, right? And recent news, for example, the latest LTS kernel is 6.6 and it rumored that it has a new scheduler in that. And there is a phoronics article that says like, if you're using Nginx, with that scheduler, it will be much, much more performant. So you'll get it for free as well if you move to 6.6 potentially. I mean, I don't have any pretty graphs because it didn't work better for us, but maybe for you it will. Yeah. And I mean, looking a little bit forward to the next talk, after mine, there will be some discussion, I hope, regarding some security improvements with TPMs and the Linux kernel, and it will involve some code probably, and you only can get it if you upgrade. So let's look at the same data, but from the point of view of accumulating changed delta. So this is basically the same data, number of commits per release, but it's kind of accumulating. It shows the number of commits since the initial release, right? And in this graph, you can easily see you can commit, you can calculate changed delta. For example, if you're on a 6.1.10 bug fix release and you want to upgrade to 6.120, you can commit changed delta is 1,762 commits, right? And basically, if you assume, which would be natural to assume the fact that the number of changes is proportional to risk, so for example, these are like 1,762 bug fixes you're running with, so it's kind of like the amount of risk you're taking by not upgrading is proportional to that number. Now let's say you wanted to upgrade, but for some reason you decided to delay, and you decided for, like, I don't know, it's end of the quarter you had a big incident, you have, you know, like your company gets a big contract, so you decided not to change anything to be more stable for the time being, and you're postponing the upgrade, and when you actually decide to upgrade now, you're upgrading from 6.1 to 10 to 6.1 to 30, which is like you just extended your not upgrading time twice. And you might think naturally that your risk grew 2x, but if you calculate the difference here, you may see that in some cases, a 2x postponing, 2x not time not upgrading, your risk actually can grow higher, now your risk grew 2.221. Right, so the risk sometimes of not upgrading systems and delaying may grow higher than the time you're not upgrading. So yeah, for 2x delay of not upgrading, we get 2.21 more risk of hitting above. If you're not upgrading, security vulnerabilities are not getting patched. So this is a similar graph, but it now shows only publicly known CVEs patched in a bugfix release, and just this data is actually crowdsourced, so it might be incomplete, but even from this you can see that out of 71 releases, for which data is available right now, 56 releases, like again almost 80%, have at least one CVE patched. And there is 18 releases again, 20, 25%, with more than five CVEs patched. So again, if you're not upgrading kernel regularly, you're running not only with security vulnerabilities, you're running with known, publicly known security vulnerabilities, for which most likely an exploit is available somewhere on the internet. Not patching your security vulnerabilities also puts a risk on your compliance, so if your production systems are subject to some sort of compliance, you have a required time at which you should be patching these vulnerabilities. So for example, if you're subject to PCI DSS compliance, like for most payment systems and stuff, it says that the critical or high security patches or updates should be patched within one month of release. So imagine there is a known, publicly known security vulnerability in the Linux kernel and you have one month to fully roll it out to your production systems. Who here knows about Acvifox? What happened to it? A few hands. So it wasn't about the Linux kernel, but Acvifox was running an Apache server, an old version and patched with known security vulnerabilities and people used an exploit on their system and exultrated some data. And it was a big mess. It was really expensive for the company. It cost its reputation as well as a lot of money, compensation, a lot of lawsuits, so very, very, very bad. Which brings us to not so fun fact. You remember like in the old days when you go to admin forums in 2000 and people were boasting around how long their server, how stable their servers are posting their uptime. Like my uptime is two years, three years. Well, since Apache and Linux kernel requires a reboot, now it's not cool anymore. So if your uptime is more than 30 days, you're most likely vulnerable and not compliant to something. So now let's talk about an anti-partness for Linux kernel releases. If you're managing a production system, for most software updates there is some kind of a change management process or well understood practices which usually like sysadmins, sres and engineers apply to manage change. But most of them unfortunately do not apply to the Linux kernel. So when you go and want to update your production system, oftentimes for a software update, the change management process will ask you why. Why do you want to update and which things from the change log on this new version is applicable to us. Like are we really fixing bugs that are hitting us? Are we really fixing TVs that are applicable to us? Well, and it doesn't apply here just because of this graph, right? So remember these bug fix releases happen every week and like with most of the releases having more than 100 commits, so it doesn't mean that every week you should be going through all the commits and trying to understand if that particular fix is actually applicable to your system. For this it's very expensive. You need a huge team of really good Linux kernel experts to understand if you know like this off by one thing in the memory management subsystem is actually triggerable on your work. So if you do go this way, mostly you'll be doing something like this. You will be just continuously stamping releases for no particular reason with no analysis. Then goes for security vulnerabilities. You say, yeah, we need like we have five CVs we need to patch due to compliance and then you may ask somebody may ask the question, is the security vulnerability actually exploitable in our systems? Do we use that subsystem? Sometimes it's an easy answer if it's in a driver for I2C and you're on a machine which doesn't have an I2C, then you can say no, but most of the time it's much more hard and like many exploits, many successful exploits are not like some kind of high severity big vulnerability. Sometimes attackers manage to change smaller vulnerabilities properly to get an exploit. So going back to that, like going back to this question if you really think of it like who can answer this question? Technically this question can be answered by the attacker because if the attacker has the list of the CVs running in the system, they're highly motivated to break into the system and this is their bread and butter. They spend like 24 or 7 to design and implement successful exploits. But unfortunately you're not asking this question to the attacker, you don't know who they are, right? You're asking for a security patch reviewer, you're going to some team for security people and they're like, oh is this vulnerability applicable? And they're highly motivated to go home on time, right? And they need to review several patches a day, not only from the Linux kernel but from many other subsystems and do other stuff like doing security architectures, doing compliance, many things. So it's kind of you're asking this person, right? And the quality of that answer will not be great. They will say like, yeah, maybe yes, maybe no. So the best course of action is just not ask this question and assume that every CV is applicable to your workload and patch it. Well, one of the traditional approaches in upgrading stuff, especially the Linux kernel is soaking. Like let's put it in the cannery somewhere and soak it for one month to ensure we don't get anything. Yeah, but basically you come back to this by soaking it in a subset of your production, you're not releasing to elsewhere and you start accumulating change delta and therefore your risk of not upgrading grows and hitting a potential bug. Same with security vulnerabilities, if you're soaking it somewhere, you're not patching CVs in your production and you'll have the risk of being hacked and you're probably, for one month's soak, you're probably all, like if you have a one month's soak time somewhere in a cannery, you're already violating some of the compliance which dictates you have 30 days to roll out everywhere. But what does high soak time means in practice? It's usually because we just don't know what we're looking for and what it translates to. We don't have any success metrics or observability how our kernel performs, is it performed the same way after upgrade as it was performing before that. We also don't know our workload. My team gets questions, the same question from many teams, right? Will the kernel break my software? But for every team, the subsystem of interest is different. For a database team, they're mostly focused on IEO and file system performance, but for some image processing team, they mostly care about CPU scheduling and CPU performance. The question should be, will it, I'm interested in this particular subsystem, will it break my workload within IEO workload or like CPU bound workload or I'm interested in some hardware or something like that, networking as well. And probably it indicates lack of sufficient production kernel testing. Within the Linux kernel, you can also ensure that an update doesn't break someone's workload if you write a particular Unix test, an integration test. The Linux kernel has this nice suite called case self-test, which is easily extendable. If you care about a particular feature in the Linux kernel or a particular behavior, you can easily write a program which exercises that behavior and verifies that each upgrade keeps that behavior. Even though the kernel itself is written in C, you can write these tests in any programming language and even scripts. Sometimes you just get, yeah, whatever, kernel is just too critical. Let's have more approvals before we deploy. Regular software requires one approval and the Linux kernel should require two or three approvals. And again, this is related to the fact that we perceive kernel as a, you know, like, bad scary monster which can destroy the universe. But what if I told you that kernel deploys are inherently safer than any other software? Would you believe me? Who believes? You're in the matrix, yes. We learned it the hard way actually in CloudFare. So this is like a map of CloudFare data centers around the world. It's maybe even outdated, but the gist is like, yeah, we have a lot of data centers around the world. And with regular software, how do the updates happen? So from a 1000 feet view perspective. So engineers update the software package, push it to our registry package. Registry then the config management picks it up, downloads a new package. Also the config management may be configured to restart the service which uses the package. It can be graceful or non-graceful depending on the context. It doesn't matter. But the gist is new, bad, or good code can propagate through all this network without proper safeguards in minutes. And CloudFare learned it this hard way. So we had several bad outages where we didn't have proper safeguards for stage rollouts of some software. So we almost caused global wide network outages and these are described in these blog posts. On the contrary, how does Linux kernel upgrade works? The gist is it requires a reboot. So we, and to reboot the server, what we do is we drain traffic from the server, put it out of production, actually reboot. Then it comes up, it contacts our config management, we wait for it to be reconfigured. We run some basic acceptance tests and put back the servers into production. And I mean we would be crazy if we reboot everything at once, so we don't. So we have automation rebooting servers in one by one or in batches. So what it means is kind of, it's an inherently natural, slow-paced, gradual rollout with minimal impact. If things go wrong. Did we release kernels with bugs? Yes. But yes, some servers didn't come up properly. Some servers started showing errors and there were only a couple of servers. So we like reverted the release and like there was no visible impact. One problem is why people are afraid of running kernel releases is they don't understand them. How the kernel release process works. So kernel versions are designated by three numbers, like one number dot, another number dot, and then another dot. Example, like 6.132. Who here knows about semantic versioning? Almost everyone. So the gist of this talk is this is not a semantic versioning system. Everyone confuses this with a semantic versioning and it's not. But instead, what really is the first two numbers mean the major version, not major or minor as in semantic versioning. And the right most number means bug and security fixes. And when the right most number increments, you most always never get new features or major subsystem rewrites. So it's not only bug fixes or security vulnerabilities, nothing else, no new functionality. So how do these releases created? So the main bleeding edge we call it source code is stored in a git repository managed by this person. Who knows this person? We call him benelope dictator, right? So, yeah. The features are developed in branches, subsystem branches. So for example, you have subsystem for drivers, memory management, and that. And once in a while Linus pulls changes from these branches. This is where the pull request probably came from. I don't know, I'll note that for that. But the original pull request, not like fancy PRs that we have now, but it was an email saying, hey Linus, can you pull from my branch? This was a pull request. And it still is actually in the Linux kernel. Yeah, so Linus pulls all these changes from subsystem branches. And once in a while, he branches out the main branch into stable branches, which designate a major stable kernel release. And this happens roughly every nine to 10 weeks. Eventually, when bug fixes get accumulated, you get a tagged version on a stable branch, which indicates a bug fix release. So for example, you get 6.2.1. But how these bug fixes get propagated there? So they're not, if you have a bug, you do not submit a fix directly to a stable branch. Instead, you actually have to go through a respective subsystem maintainer to ensure this bug is not only fixed in the stable branch, but in the main branch and all other branches. So you actually commit your bug fix to the particular subsystem where the bug is, which will eventually get propagated to the main branch. But once it's in the main branch, it's not just merged into the stable branch. These bug fixes commit, especially mark, and the maintainers for the stable branches, the stable branches all have maintainers, they basically cherry-pick these bug fixes. And when enough bug fixes are getting accumulated, they do another bug fix release, which happens roughly every week. So yeah, a new major stable kernel is released every nine to 10 weeks, and it's the so-called merged window where new features get merged. There are only two weeks of the merged window usually. And the rest seven weeks are for testing in bug fix. And so even the major version receives a lot of bug fix in testing in the first place. And what you have to remember is leftmost version means nothing. So in Galway we had this problem where we, at some point, when we upgraded for 4.9 to 4.20, it was fine. But when we wanted to upgrade to 4.20 to 5.0, people were like, oh, it's the leftmost major version of this. It's probably really scary. No, it's not. It can even have less features than the previous major release. Linus himself tells that he just increments the leftmost number when he runs out of fingers on his hands and toes. But for whatever reason, sometimes he increments when the middle number is 19, sometimes it's 21, and sometimes it's 20. So apparently he has a variable number of fingers. Yeah, and bug fix or patch releases are released almost around once a week. They are denoted by the rightmost version number. They're usually cherry-picked from the main Linux branch. And the rule is there's always no new features. Therefore, regressions are quite rare. They almost always contain critical security patches, and you almost always want to apply it. Well, the problem with major kernel upgrades is that the major stable branch is kept alive around two, three months, and then it's abandoned. It's declared end of life, and no new bug fixes and security patches are backported there. And the assumption that at this point you will have a new stable merger version available, and you should just upgrade the merger version. But sometimes it's very costly to evaluate the major version because you do get new features and potential regressions. For this, there are so-called long-term stable releases where bugs and security features are backported for at least two years, and it's usually the last stable release of the year. Therefore, the so-called LTS stable release is released once a year, and if you follow these, which we do, for example, it provides you enough time for more rigid evaluation of the next long-term release. And surprisingly, the releases are quite well described on the kernel.org website slash releases. I was surprised how many people don't go beyond the main page of kernel.org to read stuff. So yeah, go and read it. It's quite interesting. Okay, so what do we do for safe and easy production kernel upgrades? First, don't create a dedicated deploy procedure for the Linux kernel, because kernel upgrades are usually less risky than other software who's been convinced today. Well, some hands, okay. A simple stage rollout is usually enough, and kernel upgrades are naturally slow paced because they require a reboot. And because you probably won't reboot everything at once, there is a lot of headroom to abort the deploy if things look wrong. Do avoid justifying bug fix kernel upgrades. Apply them with no questions asked. There is almost always something that is applicable to your workload, and it contains only bug fixes and security vulnerabilities only. And also minimize cannery soak times and prefer to use metrics-diven approach. You can sit in this 30-day window of operating your production kernel everywhere. So if you require high soak time, think about it. What metrics or observability will give you more confidence to roll out this kernel faster? Stay on the long-term branch if validating a major version is costly, so you have to do a lot of analysis and testing. You get at least two years of bug fixes and security patches, but don't wait for the two years, of course. Better what we do, for example, we start evaluating the next long-term release early in one year when it's available. Again, apart from just being proactive, it gives us more features early and sometimes, most of the times, better performance and resource utilization. And we also don't accumulate too much change delta, as I described before. If you don't have it, implement and improve production testing for major version validation. Basically, faith-lab grading the kernel requires you to understand what your workload is. If you're a web server or a database, what specific subsystems are in the target of your workload? Because sometimes, even a bug or an improvement in CPU does not apply to databases. Once you understand your workload, better to write tests which exercise these kernel subsystems and interests required by your workload. Having these tests also really helps with communicating issues to the upstream community, because in Cloud for All, our team is quite small and we're not experts in anything, and I would highly doubt if anyone really experienced in the Linux kernel, including Linus himself, could be an expert in all the kernel subsystems. So sometimes, we had a time where we had a bug in KVM, and we know nothing about KVM at that point, but because we had a reproducible test which triggers the bug, we spent like two weeks trying to understand what's going on, and we couldn't, but since we had a reproducer, we just posted an upstream mailing list, and there's always a person saying, oh yeah, here's a fix in 10 minutes, but you have to create this reproducible self-contained test to actually people to help you. And yeah, make metric-driven decisions whether to upgrade and not time-based decisions, so many might sometimes. One thing also helps with metrics and monitoring, and also automating your kernel releases, is with human risk perception, because sometimes when new people join your team, they still have this mentality of Linux kernel upgrades are very risky, and if you require a human to progress and do these upgrades, they will always be reluctant to do this. Like, automation really helps here to remove the human risk-perseverment factor, because these days, especially in clover, many teams are not even aware the kernel upgrades are happening. They're like happening under the hood automatically, and people don't notice it, just because, and you don't have to ask anyone whether you should upgrade, because you have this more or less, not perfect, but more or less data-driven approach. And I think that's it, whatever I want to talk to you today. So again, Linux kernel upgrades are not more risky than any other software. You need to patch early and patch often, and your bug fixes kernel releases should be applied with no question asked, and understanding your workload, metrics, and monitoring on automation will allow your system to stay patched and secure in the long run. Thank you very much. May I ask something? I know where that fear, it's a fear that we all have, I guess, and it comes from things that I can just say one story, so you have like a 5.4, and it's working fine, and you have some kind of special, maybe, chipset, and it doesn't support everything that chipset can offer, but it runs fine. So you upgrade to 5.0 something or 6, and it starts to crash. And then you roll back, and then you next time you will really think twice if you will upgrade to the next version, which will offer you more support for that chipset, but you still don't know. Then you wait others to upgrade, and to be sure that it's working fine now, and that's why you don't run to upgrade really fast, and then let me see if my dead one and that one did it, and it's running fine, and then it builds fear, you know, these things build fear, that's what can build fear, that's why it's always good to wait a bit more until 5 of them do it, and then, okay, I can see, so when I'm running fine now, I will do it now. Well, I mean, based on our experience, I have this same question from our production engineering team many times, it's like, why do we rush to upgrade? Why don't we wait until all the bugs were fixed and we can upgrade? And I guess it depends on your workload, but for us specifically, I sometimes call CloudFer as Linux as a service, because many of our products are using Linux, are stretching the Linux kernel to its edge. If there is a new feature like, and Linux kernel like XDP, IOU ring, people jump on it and adopt it almost immediately, and the result of that, because we use these edgy features which many people don't use, there is no one for us to fix these bugs, like we're hitting them first, so we tried waiting, and when we're waiting, we're still hitting the bugs, because like nobody else is using that feature in this way, and this is where you just can't, I guess it's the same with very specialized CPU or hardware, if nobody uses this hardware, you can't wait for the community, someone else to fix your bugs, you have to push them through. Of course, you see the bugs, it's always helpful to report them, and there will be some people on the mailing list within a moment, they will send you a one-liner patch to try it out, and usually it works out, but I mean, generally, if your workload is specific enough, or hardware is specific enough, you can't just wait for all the bugs to be fixed, because it's very applicable only to you. Okay, good day. I wanted just to in phase your position to say Linux is safer to upgrade over any other software, and to me the main reason is because of the strong commitment from this community to ensure that all the stable release are safe to upgrade. And I know very few other software that takes this contract with the users to say you cannot grab safely. And I think this is a major point, and I think the Linux community should be recognized for this, because it puts a lot of work to ensure that we are safe to upgrade. That's something very important. More than the rollout points you are leveraging, it's much more because I've searched strong contracts to ensure that every stable release is safe to be used. Yes, you mean you're referring to don't break user space mentality? Or even don't take a patch which is not already in mainline. I mean, if you get your patch into the stability, it's because it has been tested and proved to be safe, and because of the sum of all these patches is not to be safe. And this strong commitment is very important, I think, for the users. Yes, yes. They can press their work. Yes, yes, yes. And many times when you submit patches, there are tons of external, even people or systems, we run your patch in kind of a CI and they will report if there is something back. Yes, I guess you're right that we have to acknowledge that community puts a lot of effort to these stable releases to be actually stable. But also, like the release process itself goes a long way. So, technically, again, you have only two weeks to deploy new features and then you're stuck with seven weeks of bug fixing. So, yes, the emphasis on stability is a real win, I guess, for this community. And another thing, the sum of security issues is not only counting the CVs. Greg made a great presentation around that. If there is CVs, there's probably a security issue. But there are also fixes which are not as stacked as CVs, which could be our security issues. So, to evaluate the security risk of a given version, it's not only counting the CVs, it's much more complex than that. Yeah, I agree with that. And this is what I partly mentioned, that data is crowdsourced and probably incomplete. It's kind of like the minimal baseline of risk. But there is more, of course. There is like, these are publicly non-vulnerabilities which have been tagged on this project. There is like a lot of them which are intact with no CVs attached, as well as like a lot of unknown security vulnerabilities hiding in this system. So, yeah, definitely. Anyone? Hi. Here. I don't see. I'm here. Oh, okay. Hi. I have a question about Livepatch. Do you use in your company? Livepatch, we don't use Livepatch. And my personal view on this, I'm not... So, like, I don't fully see Livepatch technology covering all the use cases. So, I think it is useful for patching vulnerabilities really fast. Yeah, yeah. But on the particular type of vulnerability. Yes, yes. With Livepatch, you're basically replacing a piece of code in the kernel with another patch piece of code in the kernel. But we have to remember that in kernel, kernel API is not stable. And basically, you can only do that if your patching requires not changing some kind of structure. It may fall apart if you're required to adding a mutics into the structure if you have a race condition. And this is where Livepatch fails. And moreover, implementing Livepatch is very complicated. And it's kind of like you can crash the system as well because you're messing with the kernel code. So, in my perspective, in my opinion, the effort is kind of not worse of the return of investment. Like, if you don't have any company, like a Linux enterprise, Linux distro doing it for you and doing it for yourself, you're putting a lot of effort to make it. You can't patch all the security vulnerabilities with that. You're putting a lot of effort and you don't get much benefit. If you instead just focus on building a system where you can reboot anything at any time, that kind of gives you, like, much better, like, long-term result. Because you just can't reboot with a new kernel and, you know, your system kind of is resilient to that. And it takes as much effort. Thank you. Hello. Thanks for your detailed explanations and for outlining that the December version doesn't actually work the way we think it does. Now, I have questions. So, you mentioned that we usually install the rest of the software out of some side bound that we don't have control over. And actually, I do that for everything. Can you kernel? I don't usually compile it myself. So, the question is, can we, should we aware, should we be aware of particular tricks? Because this process is actually mediated by the distribution. Like, do the people who do the distributions know all the stuff you mentioned? Yes. And actually, the model which I described following LTS release and, like, rolling out bug releases regularly is what most distributions actually do. You might not see it because, for example, Debian, you kind of, they version the package differently. So, you think you're always on the same version, but you may notice if you're doing, like, regular up-get upgrade that when your new Linux kernel is installed, it actually installs you a new bug fix version, which is hidden under the hood. So, this is what most distributions do. They either follow LTS or they take a non-LTS branch and maintain it for longer. But when you upgrade your system, you just get bug fixes and security vulnerabilities patched as this bug fix release. Hello. I'm not completely sure how the kernel process works still. How about a firmware that's just dropped into the kernel? Is that included in those bug fixes? And if so, how are data set? How are you ensuring that those binary blobs don't change something that breaks everything? So, in modern distribution, and like within the Linux kernel upstream as well, the binary blobs are now managed separately. They're managing the separate git repository. And on distributions, there is a separate package for it usually called Linux firmware. So, basically, the code for the kernel and the binary blobs are upgraded at different cadence and have different release procedures. So, they are not included in the code upgrade these days. Hi. Over here. Yeah. So, you were talking about the fear in upgrading kernels, but to me or when I'm looking at my team, sometimes it's more of the tedious task in having to reboot or to migrate the service. And then, you know, doing it over and over like Groundhog Day. Now, my question is, what would you consider a reasonable cadence for that task? Or do you see even like a need at the system to align on a specific kernel and, you know, and zeroing out the whole system or just having some routine monthly maintenance that jumps a few versions? What's your take on that? So, again, for bug and security releases, my preferred kernels is weekly. So, they released every week. You have to compile it and roll. I mean, not roll it out everywhere, but start its rollout at some set of production then more and more and more. And again, basically, the more you delay, the more change delta you accumulate, the more risky you're bringing. So, if you do it as regularly as possible, your change delta is small. And technically, like within a couple of two bug fixes, even if it's something breaks for your particular service, you can kind of bisect it and understand what's happening much more easily than you have to go through, you know, like, thousand and thousand of commits. So, if it's hard, you have to think about how to make it easier and how to do it more often. The more often you, it's like gym, the more often you do it, you kind of build that muscle, you build the tooling around it, you build the metrics and observability around it, and then you build, eventually, you build your confidence that kind of, it takes you very fast and effortless to actually do it much, much more. Yeah, my question is mainly about the time spent. My question is mainly about the time that you spend, you know, managing that as part of your day-to-day. Well, again, it's basic calculation of return on investment, right? If a kernel upgrade is too costly in terms of like spending, you're spending a lot of time doing that, think about if you can invest this time to build some kind of automation. And that's what we basically did. Like, when I joined the company eight years ago, like, it was very manual and time-consuming and it required a huge team of SREs to actually do a kernel upgrade, but now they're not even involved anymore. And, like, it just happens. Thank you for the interesting talk and nice present for you. Thank you. Enjoy it. Thank you very much. Thank you.
Using your Laptop TPM as a Secure Key Store: Are we there yet?
Welcome to our next talk. Using your laptop TPM as a secure key store. How are we there that? From James Potomly, welcome. Have a fun. Thank you. Thank you very much. So I have been working with TPMs for quite a long time, but my history goes back way, way before that. I mostly began life as a kernel developer in Linux. A long, long time ago, I got into open source as business advocacy at least over a decade ago. I think my history with Linux kernel actually began before some of you were born now because we're all getting on a bit. I've been a kernel developer ever since, well I've been a kernel maintainer ever since 2002, this because he subsystem. I also think whether other architectures like PA risk have done a bit of arm, obviously we're forced to do risk five because it's a new trendy thing. I've been doing containers for a long, long time. I should probably be in the container dev room, but I have to be here, here to talk to you. And I'm what's called a reluctant TPM coder. The reason I'm a reluctant TPM coder is because I got into the TPM primarily because of my interest in actually using it as a security device to store keys. And part of the reason I got into that is because in the early days of Linux, we had a break in at kernel.org and everybody was forced to actually use more security around the SSH keys we used to push the kernel tree. And part of this was they gave us a Yubi key. And you know, these are these nice key dongles you plug in. And that Yubi key just did one key. And I was sort of like, well, but I don't have just one key. I've got my signing key and I've got my SSH key. And I actually use a couple more sub keys. And you know, I've got these keys for this. And so by the time I'd actually put all of my keys onto these Yubi key dongles, you've got about a fistful of them. And if you were here, I used to get Ted Cho to stand up and show me his Yubi key fistful for Google because he's got many, many more than I have. He's got about 20, which is pretty much useless. So what I was trying to do was to use the TPM of my laptop to replace all of these keys and yet still have exactly the same security as you would with a hardware dongle. The reason for Eluctant is because TPMs are really nasty things to program. This is why they actually haven't penetrated very well throughout the ecosystem. And the other point is you will actually not find a non-relectant TPM programmer. Pretty much everybody who stands up and gives you a talk about the TPM will always say they got into it for some other reason. Nobody loves the TPM is the moralfulness. Here's some more details about me because I've been blogging about this for a long time. So you can go to my blog site and there are tons of articles about the TPM. I'm afraid my blog is a bit stream of consciousness. The articles are in order of what I got interested in yesterday. And that means that there's a lot of stuff that isn't TPM on the blog. But I've usually got it tagged and labeled so you should be able to find the TPM stuff you're interested in. Thanks to FOSDAM, we all do matrix stuff now. So that's my matrix login. It's on my own server, so I run most of my own things. And you can get my GPG key. I'm not going to put up a key fingerprint because you get it over Dain instead. So Dain is this security protocol that goes over DNSSEC straight to my domain and actually pulls the GPG key directly from there. It's one of the ways you can actually replace the key distribution network of GPG. Of course another one is Web Key Distribution and all sorts of other things. But this command is actually the one you use to get my key and it's the one I was just training Lieness on about a week ago when he was wittering about the fact that my key expires quite often. So let's get started with YTPM, so security and trust. Everybody needs help protecting secrets. So I gave you my story of why I got interested in it. It was essentially protect secrets. Everybody has this need. And usually in computing terms, your secret is associated with an asymmetric key of some sort. For most people in this room, I bet you it's probably an RSA key. It's probably your GPG long-lived certification key. For more modern people, you're moving to elliptic curve keys. The reason for this is basically bits of security. So quantum computers, assuming we can't run Shor's algorithm, which just allows them to factor RSA and of course elliptic curves, if they come along, there's another algorithm called Grover's algorithm which dramatically reduces the amount of time it will take to do a brute force attack on a key. And so NEST and most government agencies are recommending we double the number of security bits in our keys. So if you have an RSA 2024 key, you've got about 112 bits of security. Most of my keys are elliptic curve, they're P256. I've got about 128 bits of security. But the minimum recommendation is going to be 256, and we're all going to have to upgrade to that. 256 bits of security in RSA equates to about 15,000 and something or other in terms of bits that you need for an RSA key. Effectively, it's an RSA 16K key. These keys are way too wieldy to actually be useful in practice, and so everybody is going to be forced to use elliptic curve keys shortly. And obviously, if these keys get stolen, the user can be impersonated. And the current state of the art is all of these key dongles, as I said. And I think we should do better than this. That statement about them carrying one key is no longer entirely true. Some of the more decent UB key ones can now carry up to three keys, but it's still way too small a number. And the good thing about a TPM is that the key does not have to be stored in the hardware. So TPM hardware actually has a small storage space as well you could use. But one of the things you can do with the TPM is actually load the key from the file into the TPM. So effectively, you can have thousands and thousands of keys on the same TPM. In fact, it just scales as far as your file system storage does. The other good thing about TPM keys is that automatically two factor. So let's conduct a little experiment about two factor authentication. Who uses USB keys in this room? Keep your hand up if you use a pin or a password with that USB key. That's actually pretty good. So about 20% of the hands went down. So the point is, even if you have a USB key and you don't use a pin or password, you're not two factor. Two factor requires something you know, which is the password, and something you have, which would be the key. So with the TPM, it's the authority for the key, which is what I proved. It's a passphrase I used to prove to the TPM. I'm the owner of the key, and something which I have, which is of course access to the TPM. So for TPM basics, effectively the TPM is a separate security module that's present in pretty much all laptops. This is an Infineon mobile TPM. The TPM is that little chip there. Sorry, my battery is running down on this. So I think I've got one shot out of the laser pointer, so I won't use it yet. But the big thing on the top is actually the LPC bus connector. So as you can see, the chip is tiny. The major component of this is actually connection to the bus. They've been ubiquitous for a while now. So they've been present in laptops for at least the last 20 years. Originally at TPM 1.2, but TPM 2.0 is going through the ecosystem. The reason why everybody should have a TPM 2.0, and if you only have a TPM 1.2, you shouldn't be using it, is for agile cryptography. TPM 1.2, the specification is so old that the only hash it can do for signing is Char-1, which has been deprecated for quite a long time. So if you see Char-1 in any key signing process, most people will tell you not to use that and not to use that signature. And this basically means that TPM 1.2 is obsolete. The good thing about TPM 2.0 is it has agile cryptography, although it's not as agile as you think. If I give you a list of all the algorithms my current TPM can do, it's basically four. It can do RSA 2K, it can do NIST P256 elliptic curve, it can do the burrito and airing curve, and I was hoping to say it could do P384, but I actually checked it just before the talk and it can't. So my laptop in fact can only do three algorithms. The actual functions that a TPM can do are many that vary than one run way beyond key storage. So shielded key handling is the big one that I'm talking about today, but it can also do something called measurement. If you've heard of measured boot, that's a function that a TPM does, and that was actually the original function that a TPM was invented to perform. It can also do something called data sealing. Data sealing means that you put a blob of data into the TPM and it only releases it back to you under certain circumstances. Effectively we use this for symmetric keys. So your disk encryption key on your laptop for instance would be stored as a sealed key blob that would be released to the kernel when everything goes right. And I'll actually be talking about that in the TPM and kernel session tomorrow. And then the final function a TPM does is attestation. If you're doing things like measurements, you need to prove to somebody else what the measurements you collected actually were. And so the TPM is capable of forming a signature over a set of measurements that can be verified back to this TPM and is therefore used as proof to somebody else that what you say is correct. But obviously today I'm only talking about shielded key handling. So none of the other three functions of the TPM will be covered. So keys inside the TPM are stored in hierarchies, which means they have a theoretical hierarchy with a root key at the top and then they have other keys descending off the root key. You can actually have intermediate keys in this hierarchy. The only reason it's called a hierarchy is because of the way the TPM works. It encrypts the key file that you get back from a TPM to a symmetric key that's stored in the hierarchy. The top key of the hierarchy is always called the primary. TPM 2 has four hierarchies, which basically mean four primary keys, but there's the platform hierarchy is never used. It belongs to the film I guys and they don't use it. The endorsement hierarchy is used for attestation. The storage hierarchy is where we put our keys and the null hierarchy is also pretty much never used because it's volatile. The null key changes every time you reboot your TPM and that means that you don't have a permanent key. You can actually encrypt to it because it changes every time. So a key file from a null hierarchy on one reboot would encrypt differently from the key file on the next reboot. So effectively we only use the storage hierarchy for keys. Like I said, TPM 1.2 was char and RSA and is therefore deprecated. TPM 2 can do RSA, char 256, char 512 is actually present on most of them, and it can do elliptic curve algorithms, which is really useful. TPM 1.2 has this agile cryptography because instead of actually storing keys in its internal structure, it stores 128 bit number called a seed. And from this seed it actually uses a key derivation function to get from the seed to whatever key you want. This sounds really good and it is because the seed is just a random number. Every time you initialize a TPM, which you can do, it will choose a new set of random numbers. So the storage seed is stored as this 128 bit number and from it you derive either an RSA key or elliptic curve key or whatever else you want. The key derivation algorithm ensures the same seed as long as you have the same number, it always comes back with the same public-private key pair or the same symmetric key, which is useful. But the problem is that the key derivation function for things like RSA involves finding prime numbers. So there is a special key derivation function that means you always find the same prime numbers, but you still have to conduct a prime search. The problem for a TPM is it is a very slow processing engine and that means it takes a long time to do prime searches. So creating a key from a seed on a TPM2 can take a long time. So this is my old laptop. It takes 43 seconds to actually do a prime search and construct the correct RSA 2048 primary from its seed. My new laptop is actually using an Intel firmware TPM, which is supposed to operate on the CPU, and it still takes 7 seconds to actually derive an RSA key. So the reason I use elliptic curve keys is just because they are much, much faster. An elliptic curve key derivation to a TPM is pretty much a linear-like operation. It doesn't involve finding primes. And the storage seed can be changed in a TPM. There is a special command to do that. And the reason it's useful is because if you're storing all of your keys in your TPM as key files and you want to shred every key that you ever own at once, all you have to do is change the primary seed because that changes the encryption key that is used to save and restore the keys. All your old keys will no longer restore into this TPM. So effectively your old keys become shredded, which is a useful thing to do if you're giving up your laptop or airport security wants to do something strange with it. So once you take a key that, say you create a key for GPG and you transfer it to a TPM key, that transfer is one way. You will never get that key back again from the TPM. So if you're using this for identity keys, you have to be careful because your identity keys tend to live a lot longer than your laptops. If you're an average developer like me, you'll tend to go through a new laptop once every, you know, I'd like to say two or three years. In fact, it's pretty expensive, so it's probably every five years. But my GPG key has been with me for the last 15 years and I would expect to keep it a lot longer than that. So my GPG key will outlive the TPM in this laptop. So the one thing I can't do is transfer the key irrevocably to the laptop. So what I usually do is I generate all my identity keys in a sort of vault, put them on a secure hard drive, lock it in the drawer on my desktop. Every time I get a new laptop, I convert the keys to TPM format for that laptop and then thereafter the unconverted keys just stay on my locked drawer. And the laptop travels with me with the keys in, but you can't get hold of them because they're TPM keys. And this never extractive property is really useful. Even the manufacturer of a TPM can't extract the key. There is a theoretical process you can go through where you decap a TPM, you pop its top off, and if I can get at the seeds within, I can actually derive the encryption key for the key files from them, and that would allow me to get the keys. But there is no programmatic way of doing this in the TPM. So the stated play of this is that unfortunately the TPM is really, really hard to use in program. So the way we try and enable these key systems for everybody to use is we try and actually enable them in the crypto systems. The TPM actually has a slight disagreement over the library standard you should use to program it. This is all technical, none of you need to know about this, but there are two implementations, one from Intel and one from IBM, conforming to two completely separate specifications that are both legitimately published by the TCG for no readily apparent reason, and therefore look completely different to program. And God knows why this is. But the key to enabling TPMs for key storage is just to make it simple. And that means that it needs really to be an integral part of the cryptography system that you use. So GPG, if you use it, needs to, as an integral part of GPG, just use TPMs. Open SSL, if that's your crypto system, needs as an integral part of that to use TPMs. And that's where everything would stay, except now there's a new added wrinkle in TPM security. So the TPM is usually attached to the bus, a bus in my laptop. For most Intel laptops, it's attached to something called the LPC bus, the low pin count bus. And this bus, unfortunately, is actually pretty easy to snoop. And if you can snoop the bus and you can snoop TPM transactions on it, I can actually intercept all of the commands that are flowing over the TPM. So for instance, if I'm wrapping my private key, the private key material has to go over this bus to the TPM, somebody could intercept it. If I'm not using HMAC-based passwords, the password goes in the clear over this bus, the authority for the key, you could intercept it. And there is actually an existing attack that does this, it's called TPM Genie. The guys at the National Security Lab in Canada actually constructed a dongle that you can easily attach to a laptop without actually really opening it up. And just extract all of the TPM traffic. There is also a theory that you can actually program another device on the LPC bus, like keyboard or the mouse, simply to reflect the commands back and also do the snooping for you. So there is a theory that this TPM Genie attack could also be a remote infiltration attack. It doesn't have to be an evil made local attack. But the upshot is, nobody uses a TPM nowadays without actually running security on the bus itself, the TPM talks to the laptop on. So all data and transit now has to be encrypted. And this makes TPMs go from sort of complicated to use to being excruciating extremely complicated to use. Because now you have to use something called a TPM session, which effectively is sort of like an ECDH encryption stream between the application and you. And you in the room do not want to know how to do this. I've actually written this code from scratch for the Linux kernel because right at the moment the Linux kernel doesn't do this. And at some point we're going to get a problem because of this. So we need to be doing this encryption. It really is horrible code. Nobody in there should actually have to do this. This is why all TPM coders are reluctant coders. So I do this so you don't have to. And for added sophistication, these sessions once we have them can actually be used to implement key policy. Key policy is useful because it can say things like, unless you booted this exact kernel version, do not release this key. Do not sign anything with this GPG key. Do not use this key. So key policy is something that I'll come on to a bit later. But sessions make using a TPM way more complicated. It's all complexity you hopefully don't need to know anything about because the more complex and difficult it is, the more reluctant everybody is to use it. So let's get on to crypto system enabling. Existing crypto systems mostly use password protected keys. You've all seen them in open SSH and GPG that if you cap the key file, the private key file, it's usually just a password encrypted key of some sort. Easy. TPM keys also require something called an authority. I mean, you can actually tell TPM not to use an authority. You can just use it without effectively non too factor like a lot of USB keys. But for the best use case, you've actually got a key authority. And it's basically a secret you just proved to the TPM, you know. If you do it over an HMAC session, you do an HMAC proof. It's effectively a challenge proof that you know the password. Password itself doesn't flow over the bus in the clear. And the key files contain a TPM key blobs and the password can just be used as the key authority. This is all very easy. The problem is the key file format needs standardizing. We have loads of crypto systems. One of the early successes of cryptography is that pretty much everybody uses the open SSL key format, which if you've seen it, it's that PEM file format, which is really useful. So in order to get interoperability in the TPM ecosystem, I've actually had to spend a long time trying to force people to standardize on one particular way of using TPM keys, one way of writing them. And pretty much over the last seven years, everybody has actually agreed to do this. Well, an IBM who can't agree on anything over the TPM have agreed to do this. And so we actually have, oh, well, apart from system D, who's a late comer to the TPM consumer of space, everything else uses a standard key format. And that key format is currently standardized on my website, but ultimately I'm hoping to make it an RFC so that it will just be part of the industry. And so then, as long as the crypto system recognizes the key file, everything should just work. You can use a TPM based key file in exactly the same place as you just used a private key file. Everything should just work through the TPM. All your cryptography operations become naturally secure, which is useful property. And you don't need to know anything about the TPM. All you need to do is know how to do a one time conversion of your key, and that's it. And obviously you need some discipline around key backup. As I said, if you convert your key and remove the original, you have no way of separating that key from that laptop. Now, lots of keys are ephemeral, so perhaps that's the way you should use some of your keys. But some keys represent your identity and should survive the laptop. So you need to be careful knowing which key is which and how you should use it. So some discipline around this. The advantages are easy. You only need to trust the TPM. TPMs have been manufactured by a lot of countries all over the world. Israel is now a current one. There have been allegations the NSA put a back door in it, but currently, if you remember two talks ago, the guy was talking about reproducible bills. We actually now have a standard software model for a TPM that manufacturers are supposed to provably conform to. So we should have proof that there are no back doors on behalf of the NSA. Of course, they could be listening on the bus or something. And the great thing is, even if you take away my private key file, you can't extract the key from it without being in possession of my laptop. Everybody when I used to make this statement said, prove it. So there is an SSH private key. I've helpfully removed the password from it, so it has no password. It's my TPM key. If you actually scan that QR code, you will get the private key. And anybody can actually see this. You can't do this with an ordinary private key because I'd be giving away with my secrets. But I can do this with a TPM key. And just as a precaution for those of you figured out that this key would be usable if you got hold of the laptop that produced it, it wasn't produced by this laptop sitting on the desk. It was produced by a laptop sitting back home. So don't mug me late at night to try and get this. But this is the SSH key that I use for logging into kernel.org. And I'm now just publishing the secret part of that key so confident and I that the TPM will protect those secrets. So apart from conversion to TPM format, there is no change to the workflow. So this is hopefully what makes it simple for all of you to actually use TPM keys in your everyday life. The disadvantages are, as I said, the key is tied to a physical TPM, which is part of your laptop. When your laptop is retired or dies, that key is no longer accessible. You can no longer use it. The keys all need to be reconverted or duplicated, therefore, when you change laptop, if it's a long-lived key. And the TPM is slow. It can't process hundreds of keys. This is the reason why the TPM has a key ceiling operation, because TPM is way too slow to use for symmetric encryption like disk encryption. So the way you use disk encryption with a TPM is the key is actually sealed to the TPM. But if the TPM agrees and you provide the right password and all the policy satisfies, it will actually release that key into the kernel, into the open, where it can then actually be used by the main CPU for symmetric cryptography. But for asymmetric keys for elliptic curve and RSA, the TPM itself is doing the key operations. The private part of the key is never actually revealed. So the current status is that for open SSL1, the only way of using external crypto systems was something called an open SSL engine. And fortunately, we now have two of these for the TPM. So this is one I wrote, the top one, the open SSL2 TPM engine. This is one Intel guys wrote to go with the Intel TSS, which is the TPM2 TSS engine. Both of those are fairly good and fully functional as open SSL engines. For open SSL3, there is a problem in that they are trying to deprecate engines. Now, right at the moment, we're on open SSL... So when open SSL3 was coming up, they promised us point blank that if we hadn't converted our engines, they wouldn't work with open SSL3 full stop. They admitted when open SSL3.0 was released, this was a lie. And so open SSL3.0 still works with engines. They've just released open SSL3.1, which amazingly enough still works with engines. So the reason for this is because open SSL themselves internally uses engines and they're having a bit of difficulty deprecating their internal engines, and obviously they can't pull engine support until they can do that. So the engines will continue to work for a while. But there is a new mechanism, and it's by new, I mean this was excruciatingly and completely different from engine code. So I actually had to rewrite the entirety of the engine code to work with the provider. And then I did a little blog post about it, so if you're in the same position and have to convert an engine to a provider, I've got a detailed description of how to do it. It's not something I would wish on anybody, but it's finally been done. And we do have, even though it only says open SSLTPM2 engine because that's the name I chose for the project, before I knew it would have to become provider. So it's actually the same code that was in the previous engine is the core code is still there. I just separated it up, and I did a provider wrapper around it because the TPM code goes through a lot of tests and has to be provably correct. The last thing I wanted to do was rewrite all TPM code as well. My TPM system comes with Create TPM2 key. The Intel one comes with a Create key as well, but I've forgotten what it's called. This can also be used to convert ordinary keys to TPM-based keys, so it can be used to wrap effectively keys for the TPM. Elliptic curve issues. So TPM enabling works just fine, but the way that elliptic curves were programmed in the TPM, they didn't actually do the generic parameterized curves. They did specific named curves. This means that the only way you get to use elliptic curves with the TPM is if the curve is known to the TPM. And in fact, there are only really three mandated... Well, there's technically four because there's a Chinese curve called the SM something or other that's also mandated in the TPM, but nobody trusts the Chinese. So realistically, it's the NIST curves, and the Burrito-Nerring curve, the BN curve, is not something you should use. It was invented for direct anonymous attestation. It doesn't have as good security properties as the NIST curves, so realistically, you're down to only one elliptic curve you can actually use with the TPM. And the algorithms supports only ECDSA and ECDH, and this will be important because if you create a new GPG key, chances are it's told you to use a Bernstein 25519 curve, which is not part of the TPM, and it's actually not on the TCG radar for a very unfortunate reason. The 25519 is an Edwards curve, and Bernstein decided that the Edwards curve would have separate signature and separate Diffie-Hellman algorithms, and that means that the algorithms themselves, if for all agile cryptography in the TPM, are not present, which is a bit unfortunate. So don't wait, don't hold your breath waiting for 25519 to become a TPM standard curve. Chances are it's not going to be. If you want to use the TPM with elliptic curves, you're going to have to embrace the NIST curves. The other problem when I said all this is simple is actually an open SSL complexity. Open SSL has a special API to load engine key files. If you don't have this in your program, it won't load engine key files. And the problem was that pretty much no consumers of open SSL code, you know, open SSH, open VPN, all of the ones that are based on open SSL had this API sitting in there. So I can present a TPM key file all I like to these programs, they won't recognize it, because they're not using the correct load routine. This annoyingly stupid problem has been fixed in open SSL 3, but it was basically a complete drag on the ecosystem for a long time because it's the barrier to TPM enabling is not the fact that I've written the engine, because that didn't take me very long, it's the fact that pretty much no code out there actually knows how to use an engine with a key file because of this API. Because the open SSL consumers always forget this. Open SSL is sort of like an API explosion, which I mean, so the fewer APIs you have to know, the easier you find it program, which is why everybody always forgets the engine APIs. But it is only a couple of extra lines in the code, so I have actually successfully enabled it in things like open VPN. It's actually been using, so what it does is it just, for no good reason, when you go to the open SSL command line, you have to name the key type. You've seen the inform DER, inform PAM. There's also actually an inform engine for everything else. But nobody in their right mind would program that. All you do is you try the DER loader first, then you try the PAM loader, and then you should try the engine loader. But everybody forgets to try the engine loader after the DER and the PAM loader. So I put the code into open VPN, and it's been in there since 2.5. Unfortunately, we had a dispute over the licensing, so it got removed again in 2.67. This was over a statement about Apache and open SSL being compatible at the binary level. You don't need to go into it, but these things happen even to the best intentioned people, unfortunately. The good news is that if you compile open VPN with open SSL 3, it just works, because all of this is fixed in open SSL 3. So hopefully, open SSL 3 will also rescue me from trying to enable engine loading in all of the other open SSL consumer programs that I want to use this with. Open SSH was converted to use engine keys. I have a patch for it. But because Libre SSL does not use engines, the open SSH people seem to be philosophically opposed to anything to do with engine keys. And there's another wrinkle for open SSH. The problem is the way open SSH feeds keys into the agent is actually done by the primes. And as you know, for a TPM key, it won't release the primes to you. You can only use the engine key through the TPM. You can't see what the source prime numbers are. So the way that the SSH communicates with its agent is actually incompatible with the way engine keys work. Compatibility is easy. Making it compatible is easy. There's just an engine extension to open SSH which says, I'm not going to use primes. I'm just going to tell you where the key file is located. Agent, pick up the key file, don't use the primes. Which was, I mean, it was about a 20 line patch. It's fairly easy. But like I said, open SSH philosophically opposed to this. So I still have to patch open SSH to get all of my open SSH keys to work. For open SSL 3, this problem is mostly fixed because the file provider now understands how to load keys from any other provider. It will actually query all the providers and say, do you recognize this key? And if one of them says yes, it will load the key successfully, which is really useful. It gets me out of all the engine stuff. Doesn't solve the open SSH problem because it's trying to pass in prime. So in an unpatched open SSH, you will still get an error because it can't extract primes from the key. Unfortunately. So you don't need separate key loading routines. Everything should just work. This is brilliant. The one real success story I have is GNU PG. So it actually, way, way back in 2018, I had a conversation at FOSDAM with Werner Almsberger, who does GPG. And we agreed that we, I would code the GPG to use TPM keys and he would take the code. And it was, again, it was another problem because GPG doesn't use any known crypto system. It uses G-crypt, which is a very unusual cryptography library that I also wouldn't want to wish on anybody. But I was keen to get GPG supporting TPM keys because I use it on my laptop as well. So since version 2.3, it has supported them and version 2.3 is pretty old. It's a few revisions back. The main problem is that very few, even the bleeding edge distributions have this. Debbie and, Debbie in testing is still on 2240. I think Fedora is just about moving to open SSH to GPG 2.4. Fortunately, OpenSUSA, which is the distribution I'm currently running, has been using GPG 2.3 and then 2.4 for the past few years. So that's why I don't have the problem that you would have if you tried to use this. And key conversion is very easy. You just do GPG edit key, my key. You switch to the private key. You select the private key and you type key to TPM. Remember, this command is irreversible. Once, and it's not like the standard GPG thing is when you exit, it will ask you, do you really want to do this? Each TPM is instantaneous. It will do it immediately. So if you don't have a backup of your GPG keys, you've lost them because it will delete the old key file. So just be aware of that. But other than that, it's all fairly seamless. And other TPM supporting utilities are things like GNU TLS actually got it from OpenConnect, OpenConnect's for Cisco VPNs, SB sign tools is for secure boot, EFI tools is another secure boot thing. Oh, PKCS11 export is the way I'm hoping finally to get OpenSSH to do this because there's a guy from Red Hat called Jacob Yellen who's actually doing PKCS11 support in OpenSSH. And so the PKCS11 export is actually just a program that takes an OpenSSH key and exports it as a PKCS11 key, but it knows how to do it with any engine key or any provider key. So I can use this to actually export my OpenSSL keys as PKCS11 keys. It's also useful if you have Firefox because Firefox resolutely refuses to understand the basic OpenSSL key format. It insists you have to use the Netscape system key format, which nobody uses, but it also understands PKCS11. So this is the way I also use client certificate keys with Firefox as well. TPM key policies. So since TPM 2.0, it's actually supported a rich policy language based on things like PCR values, which are measured boot parameters, what have you, object secrets, it includes ands and aurs, which means you can build elaborate policy chains. So with TPM 1.2, the policy was a single statement. With TPM 2.0, the policy can be a chain of statements, and that can be this and this and this or this and this and this and this or this and so on and so forth. So you can build a very, very rich policy around how this key should be used. I wouldn't advise you to because it's sort of difficult to use, but you can do it. And policy is described to the key by a single hash value. So the way you construct policy is you use a session register. You have to execute all the policy statements in sequence, and if you've done it correctly, the hash value in that register matches the one in the key file and everything just works. The problem is that if you look at a key file, the policy is just a single hash. You can't go from the hash to the statements that were used to create the policy. So one of the things you have to do, one of the things the key file format does for you, is it actually stores all of the policy statements in a way that actually allow you to reconstruct the policy. So as long as you're using the standard key file format, the policy will always follow your keys. And the reason you need to do this is because if you forget which policy goes with which key, you suddenly get a combinatoric explosion of trying to figure out all the policies sort of I have lying around. Do they match this hash? How long is it actually going to take me to get up to this hash and match it? Yeah, I have to know how to execute the statements. And like I said, standardizing the file format meant that we could standardize the way the policy is presented. So all you have to know how to do now is to construct the policy. You don't need to know the mechanics of how it's done on the back end. We'll just do it for you. One of the useful things about policy is that as you saw, policy was a hash. If that hash is tied to the key, it can't be changed once the key is created. But the TPM has quite a few mechanisms that allow you to add policy after the fact. And the most standard mechanism for doing this is something called policy signing. So the usual way that the TPM works is that and policy is constructed just by a hash extension, which is the same way TPM's work, TPM PCR's work. So you put a hash in there, you put it side by side with the original value that was in the register, you hash it again, that becomes the new value, and you keep building up like this. If the TPM sees a signed policy statement, it will actually throw that hash value away and start using the one from the signed policy, which effectively means you can use policy signing to replace any policy on the key, which is interesting and useful. And it means the key can be updated if you change PCR values. So now, if you boot a standard Linux kernel, it will actually hash the command line, the initial RAM disk, and the kernel all into PCR9. And this means that I can lock my key to only unlocking, not only if it's the kernel version I know, but it has to be booted with the correct initial RAM disk and the correct command line, which is another really useful feature, and if I upgrade my kernel, the PCR values will change, but I can calculate what they should be. I just add another signed policy to the key that says use this. And I can also delete signed policies from the key, but beware, deleting a signed policy does not revoke the policy. If somebody else comes across the old policy, they can still use it, signature will still match. It just removes it from the key. And all statements, and indeed signed policy statements, are all processed effectively as fragment chains. And the same mechanism can actually be used to execute multiple signed policies, and it will keep trying until one fits. At this particular time, I was going to try and do a demo. I think I have two minutes left. Well, let's see if we can do that. Come here. Actually, let's make that bigger. So everybody can just about see that. First of all, I should just edit the key. So this is my GPG secret key. If you go up here, what you see is that a lot of my non-revoked or expired keys are actually TPM protected. So I'm using all of my GPG keys in my TPM, which is a very useful property. Let's see. I've got about two minutes left. So let's just... So what I've done is I've just moved my actual key directory out of the way, and I'm just going to create a new key. So this is the way you generate keys quickly. And this P256 is the way you tell it to. The very secure passphrase I'm going to use is test. Yep. So this is... Oh, for God's sake. Okay, let's not do a demo. Let's just go straight on to questions. I already showed you I had TPM keys. So five minutes left. Any questions? Hi. It's not so much a question. It's more a comment, because you put system. We're not using any non-standard format that we came up with. We're just using civilization that the Intel stack suggests us to use. So we didn't really add anything on top of that. So, I mean, your specification, great. We have no problem at all. We're supporting this, but it's also very, very new. And it doesn't even support the stuff that we need. Like we, for example, use policy authorize.nv, which allows you to store policy hash in an nv index. It's what we built everything on. And if your spec can't color that yet, then it's also not really in the position yet to be used for this kind of stuff. But in general, I'm not opposed at all. Like you seem to insinuating that we did our own thing and didn't want to play balls, anyone else. That's just garbage. You totally find supporting anything that people can agree on. This is not an area you want to be pioneers in. We want that some people do the work for us and then we just move to that. That's entirely fine. But yeah, make sure, though, that the functionality that people need, in this case, like for system-y stuff, that at least, like, looking through the spec, I just did that on my phone. Maybe I've missed something, but it just doesn't cover the stuff that we need, which is policy authorize.nv, for example. But actually, it only covers everything other than the exceptional commands. Policy authorize.nv is an actually exceptional command, so it's already technically covered by the spec that's there. Signed policy is an exception command. So this is technical to do with the way the spec works. Well, you have to, like, it's the same thing as the assigned stuff because you just store it in the nv. But anyway, this is very technical. Like, looking through the spec, it's just not covering. All I'm saying is, you know, we're not the problem. Like, we're fine with supporting anything, but maybe putting our stuff on the slides, not the West Wing, to start getting the discussion going. No, just use the point there. I thought I had everything standardized on this key format and system D came along and wasn't. That was all the sliders there to do. There's no real difficulty transferring keys from one format to another. And also, the standard is ASN1. I know there are a lot of people who hate ASN1 want to use JSON instead. We use TPM2B stuff that the Intel stack gives us. But again, if you add it to the Intel stack, we're happy to use it. We just consume the APIs as a provider. We do not see our position in a role in innovating that. We just want to use libraries that work for us. And if the libraries don't support it, they don't. So it's already in the Intel stack. The Intel engine stack generates keys of this format. And which layer is that? I think it's the create TPM. It's not a TSS layer. It's the actual Intel rotor TPM engine that was on the slides. That TPM engine also uses this key format because it's designed to be used for open-source. We did it in the library, right? Like in the TSS library. The key conversion sits outside. I would like to invite you to get together in a deaf room place. Thank you. Hi. We still, okay. So thank you for the talk. My question was, since the policies are only treated as hashes, does it mean that you could eventually, regardless of practicality, find another policy set that produces the same result and use an alternative policy set to update access to the key? So the question was, could I get two policies that produce the same hash? The actual question you're asking me is, can I induce a hash collision with SHA256? And the answer to that is no. And the reason that two policies can't produce the same hash is because they've been very careful to actually use the input values that go into the hash that mean if the policy is different, the value in the statement is different, so the hash value always has to be different. So the chances of getting a policy hash collision are exactly the same as they would be if getting a SHA256 hash collision or whatever hash you use. Is there any reason to use, if you're on Linux, to use either IBM or Intel Stack versus just in kernel TPM resource manager? So if you're a consumer and you're not programming with TPM, there is no reason whatsoever to prefer one stack or the other. They all work equally well. There are no security problems with either of them. When correctly programmed, they will set up sessions and do everything right. Intel does have one problem in that the Intel code also has an engine sitting in it, but that engine doesn't actually use TPM security. So that's the one piece of the Intel stack that's wrong, but the IBM stack doesn't include an engine. I mean my engine is separate from the IBM stack. Well, resource management just works. So you're leading to the problem where the Intel TSS encourages people to contact safe sessions, right? This is the problem. And the kernel resource manager doesn't expect people to do that, so it doesn't do a technical operation called regapping, right? The reason it doesn't do it is because I wrote that resource manager and I never had a reason to use it for regapping, so it just doesn't do it. I've already told the Intel people that if they want to use it like this, the kernel would be perfectly happy to accept patches to do de-gapping, and they're fairly easy to write. There is actually, and we're getting onto kernel stuff, a point at which the kernel itself may need to do de-gapping. One of the things that we're looking at is trying to use a permanent session within the kernel for certain key operations. If we do that, that session will be context saved, then we will have to do de-gapping and everything will just work. The problem is that the kernel coding is just in time. I'm the person doing it, and I haven't got around to it yet. So if you want to use the Intel TSS with saved sessions, then you need to use the ABMRD resource manager. If you want to use saved sessions, everything just works. Many, many thanks for the great talk.
The D Programming Language for Modern Open Source Development
Hello. All right. Great to see everybody. See some familiar folks here. Just a quick show of hands here. How many folks have heard about the D programming language? Oh, wow, awesome. Keep your hand up if you've used the D programming language or tried it out. Okay. Yeah, I see you there, Dennis. Yeah. A few other folks here. Great. This is perfect. You're in the right space. We're going to have a lot of fun today. And I'm going to give you an introduction to the D programming language here. I'm not going to show you everything because the D programming language is a really large programming language, but hopefully enough to get you excited here. And ultimately to show you some open source projects where you can get some inspiration. So let's go ahead and get into it here. So again, it's been six years since my last Boston talks. I just want to thank the organizers for inviting me back and letting me talk again. So again, the goal today is just to have fun. You can kind of sit back, relax, have a good time and just learn about, again, what I think is really interesting programming language that's expanded my mind as far as how I think about programming. So with that said, hopefully I'll come back sooner than every six years. So a little bit about me. My primary role is to do teaching. So I'm an associate teaching professor. So I love teaching stuff. I do teach the D programming language. I'll talk about that towards the end or give you a reference for that. Otherwise, I'm really interested in other sort of performance systems, these stuff. Again, you folks are my crowd. So again, I'm really excited to be here with you. And with that said, here's the abstract, of course, that you read and led you here. Again, to get you excited about the D programming language. And any code that I have for the talk will be linked here. If it isn't already shortly after this talk, I'll post it. All right. So again, what I want to do today is, again, get you curious about a really, really cool open source project. Now that open source project happens to be the D compiler. In fact, all the D compilers that we're going to find out have the source code available. So how cool is it that you can actually look at a programming language that's been around for quite some time and see some really awesome work by some really smart engineers. So at the very least, I hope that's exciting for you that you will have some place where you can look or send other people to look and see how optimizations are done or code is written or organized. So again, I think that's in itself very interesting. And maybe one day you yourself might find yourself contributing to this compiler, this ecosystem, or find inspiration elsewhere for using this programming language. And again, my secret dream for you, if I do a good job during this talk, is to get you excited enough to say, yeah, I'm going to contribute. There's been some awesome videos on how to just do that. Again, a lot of the open source projects that we've seen today and we'll see tomorrow have these resources. So again, I just want to point out that those are available as well. So again, it's really cool to look through the source code of the D compiler, which is a very, very, very fast compiler for the D programming language. Okay, so with that in mind, with my interest out there on what I want you to get out of this, or maybe to get excited about, again, whether you're a student practitioner, somebody in industry. Again, we'll continue moving forward here. And as I'm talking about this, I do want you to know that I'm a bit of a programming language enthusiast myself. So I love using different programming languages. This has been a problem for me since I started programming, always looking and kind of moving around to different languages, seeing what was new, what kind of features. And honestly, I think there is some value in that. You get to see how different languages approach things. Actually, we were just at a previous talk on the Hector script talking about actor model and mutability, how parallel processes are organized. I think there's a lot of value in taking away some of those core concepts from different languages. So what I've been doing lately is, again, every few days now at this point, I've been just turning on my camera for an hour and live streaming myself, learning a programming language for the first hour or so. And you pick up interesting things from different languages. But just to be clear, the languages that I use professionally and teach most are C++ and the D programming language. I'm always kind of thinking in terms of, oh, you know, Golang does it this way with their defer statement and D has scope, or oh, there's message passing in this language and this is how you do it in D in this way. So it's been a really interesting sort of experiment going through this process. And I'm thinking in the language that you ultimately, well, use. You kind of wire your brain a little bit sometimes. So that could be something kind of curious, again, looking at new languages, looking at languages that are popular, looking at languages that are maybe not so popular as well as far as mainstream. At the end of the day, what I hope one of your other takeaways will be is, you know, as we know, sometimes it doesn't matter what the language is. It's going to be what gives you a competitive advantage, what is fun for you to build software in, what is, you know, the tool that you can use to create something. So then my goal is not going to be to convince you today one programming language is better than the other. Even as I look at those programming languages, I try not to do that. I'm smarter than that. I think I am. We'll see if I slip today. You know, we sort of like our program languages and get used to it, right? We have our favorites. But again, I do want to share my enthusiasm for D, why it stands out, and why you might also have fun with it. So with that said, we're going to do that same little experiment that I've been doing, just turning the camera on for an hour, looking at a programming language for the first time, and just investigating some interesting parts of it. I hope that will get you curious about the different parts of the D programming language and again get you excited. And maybe, just maybe if I'm successful, and I looked around, I saw everybody who rose their hand and who didn't. We'll see more hands raised. What was it? Six years from now when I'm invited back. So anyways. All right. So I'll show you a few cool projects for inspiration. Most if not all are open source. The only ones that aren't are the scripts that I haven't put in my GitHub repo yet. So we'll be true by then to this top. And all something that you can learn from a specific feature. I'm a big proponent. Again, my background being in teaching in some industry, that we need to read more code as we're learning as well, because there's lots of smart engineers, you folks, writing that code and I want to learn from you. So with that said, we'll look at these projects all in the D programming language here. So let's go ahead and begin. I'm going to go ahead and start with something cool made in D. Why not get some inspiration to start this talk off. And here it is. A project that's built in the D programming language. TILIX. How many folks have used this terminal emulator? Yeah, I'm seeing a few hands go up here. Yeah, this is something I like to occasionally download and try out different ones. But to my surprise, I actually looked at the source code. One of my students actually told me TILIX is built in D. I didn't know that. So that was really cool what you find sometimes in the wild. But again, oftentimes as a user, you don't really care. Just a cool piece of software as an end user. But you get to see as a practitioner some of the cool tricks they do. So along with just showing you some different tools that have been built in the D programming language, I think it's important to say, well, why don't we care to look at this closer? So with all these slides here, again, I'm not going to ask you to read these or click on all the links. The slides will be available. But what you might be curious about or with this particular project, what's interesting is to see that, well, it's something that's very visual. And if you dig into the source code, it's using the GTK libraries. And those are C-based libraries. So how does D interface with C code? Well, the answer is D actually does a really, really nice job interfacing with C code. So if you are C programmers or have been using C, you can basically call directly your C functions in the D compiler. Easy as that. Now, of course, there are bindings and wrappers and other things that folks do with the D programming language. But that's nice. You get a head start by being able to use some of your C code or even C++ and Objective C. There's ways to squeeze stuff in. So I thought that was very neat, just looking at the main app file from this particular program to see the different libraries that they were bringing in and was it just straight C code or library? Some other neat things that I'm just going to trickle in some details about the D programming language as we go along here. There is something called import C, which is a really cool, well, it's effectively a C compiler built into D. So you can, on the command line, like you would with whatever your tool was, type in compiler, DMD, your D source files, and your C source files as well. So that's kind of neat there. Again, just giving you a head start if you're going to consider migrating to different programming language, which is a big decision to make if you already have some open source project. All right, so that's Tylix. That's kind of a fun one. I'm learning about how D can also play with C code. Okay, so let's just get a first impression of the D language. Again, pretend you're doing this experiment that I'm doing. You go into Google, you type in Dlang, and you go to the homepage, Dlang.org, and what do we see here? We'll actually see something that looks like this. I'm going to give everybody a minute or so to just look at this piece of code. There's a sample code there. And then I'll ask for some participation. We can make this interactive on the afternoon. But just take a look at this and let me know what you think it does or what's interesting. I'll take hands and get some volunteers here. I'll give everyone a minute to think about that. What's popping out there, folks? Give me a hand and then something out. Yeah. So the few things I see on the first line, the import is local to main. On the second one, there is an object-style notation for a string. On the third one, there is an enum for an array, so that one I don't get. Then there is this immutable keyword, which is interesting because it's doing a mutable operation on A, but then B is immutable, I guess. And MSG is apparently a program that you send to the compiler, so I suspect it emits something at the end of compilation. Okay. How many did I get? Yeah, so we got a good staff at it. I saw other hands going up here. There was one actually right behind, if you wanted to share. Yeah, it could be the same thing or to add on. It really looks like a C, next one, like a C+++, free plus, so I don't know why they gave a name D, but they could just keep it with the pluses. In a way, it's really kind of easily to read. If you know anything of C or family, you can easily jump in and just do it. Yeah. So immediately when we're looking at program language, just to recap, we see it's sort of a curly brace, C style, algo style language, right? So we can kind of read it if we know C or C++, objective C, whatever. And it does look like a C+++, kind of language. We'll talk about that in a second. There's another hand here. Is that program manipulating types as values at compile time using decode like you would do in the ZIP program language? So the question about is it manipulating types here? Or something's kind of interesting about the types, certainly here. So for instance, what's the type of B, for instance? So what's it doing with the types there? Okay, it's static. We sort of know static and C and stuff, something about memory storage. Immutable, some sort of qualifier. It turns out a stronger than const. But what's the actual type? Well, there actually is some types being inferred here for us, like auto and other languages. Now I will let you know, again, I'll repeat some of these details. D is statically typed, but at compile time, yeah, we do have to make a decision about what the actual type's gonna be and what's returned. Yeah, this is great. I'm gonna advance it one slide forward here, and you'll see what the label is on the program here on the D language homepage here. And it's sort of an array at compile time. And that's kind of cool. Just this first example, this is usually the first example that comes up here. And I've got a description of the stuff that you folks recapped very nicely. But let's actually, we'll run or look at a few codes, but I think we should at least look at this basic one here. Let's make it a little bit bigger here. Just to get a feel, again, this is the same Hello World sort of program. Well, this is maybe even after Hello World, I would say. But interesting enough here. And let's just go ahead and compile it. So with DMD, again, I'm looking towards the bottom of my terminal here. I'm gonna compile it out. This program I called compile time sort.d for the extension. And the output file is going to be prog or prog. And as soon as I hit enter, interesting here. It's finishing compilation here. And boy, I didn't run the program. I'll tell you I didn't run it. But while I was compiling it, yeah, there is something interesting here going. It is called compile time sort. So you might have guessed that. But interestingly, and this is one of the big, why should you care? Or things to look out in languages that you care about. We can do computation at compile time. So this is a really powerful feature of the D programming language, the D compilers specifically, that we can take something like in a new, something that would maybe be a constant, right? Usually in another language. Set some values here, like an array. And then actually evaluate it with sort. But again, if you look at sort, this looks like a function that you might just call in your regular programming language, right? So there's nothing really different than the compile time sort to the run time sort. That's probably what we want, right? To be able to execute as much as possible at compile time and save our work for when we're actually running, right? Before aiming for performance. Of course, there's always trade offs for that. You notice that might take a little longer to compile. Again, let's go ahead and compile it. Again, pretty fast. Actually, we're gonna talk about how fast the D compiler is later here. Now if I actually run it, the program here, PROG, right, we just get hello, Fostum, because that's the actual run time computation that's going on, okay? This part here, this is the only thing we're really doing at run time. Now if we go on and later do something with B or print it out, we'll get to our sorted array, but that's the point there. So, already kind of neat. This is kind of an attention grabbing thing. And again, something that might be new depending on what programming languages you've looked at. And the thing, again, one of the things that certainly caught my attention. All right? And I mean, there's some other interesting stuff here like the, I think was mentioned here, the quoted string before, right line, dot right line. Okay, we'll talk about this. It's called universal function call syntax, but you know, some nice potential quality of life features for us. All right, so that was our pop quiz, only pop quiz we have here. But I do invite folks to raise their hand high if they see something interesting as we move forward. All right, so again, the sample and why you might choose to care. Just to go back, we call this CTFE or just Compile Time Function Execution. This idea that we can do work at compile time. Didn't we know there's a lot of other languages, templates, or sort of a mechanism to do this if you're coming from C++ background. To various extravagance levels of metaprogramming that you can do. Other languages might do this a little bit more explicitly otherwise, but that's the idea with the decompiler. So a big win in my mind. In a big win, how clean this syntax is. Okay, so a little bit about this deep programming language. Somebody mentioned, it kind of looks like C, plus, plus, plus, plus. Yeah, so a little bit of history here. Walter Bright, who's highlighted there with the arrow, that's him at D-Comp. A few years ago, two years ago now. He was the initial creator of the deep programming language. It's called the Digital Buyer's compiler originally. But folks kept saying, hey, it looks like C++, plus, plus, or whatever. And they just started calling it D, and that just sort of stuck. So that's what we got here. So a little bit about Walter, again, he's a compiler expert. He's worked on C compilers. Hence why there's sort of a C compiler in the D language. C++ compilers. And then of course, thought about it for a while and said, well, I'd like to make something new, something that's fun and efficient to program in as well, and that's where D sort of came about. And then also, a major collaborator was Andre Augsindrescu, around 2006 or so, joined. And then for the next 10 plus years was a very active contributor in building what we now use as D2. And we actually got other audience members who are contributors. I don't know if you want to out yourself, you can raise your hand, but you don't have to. So anyway, so there's a full history with the D programming language and a really interesting article if you want to learn about the history and the origins about how to evolve and the sort of why's you do things in the programming languages. Again, that can be interesting sometimes if you know the historical context, why things look a certain way they do. Sometimes that helps you understand when or when not to use a feature. So anyways, that's just a little bit about the history of the D programming language here. So again, what is the D programming language? Still on the front page, it's a general purpose programming language with static typing. So whether or not you see those types, they can be inferred. It's a systems level programming language. So you have low level access to things like pointers, for instance, and you get the C likes syntax. So it's relatively familiar again if you've used C or C++, right? I imagine pretty much everyone who knows I hand who had heard of it's new. Yeah, something like the next C or whatever. But the mantra with the D programming language, at least on the home page, is write fast, read fast, and run fast. So we'll try to see if it holds up to those things and again why it might be a good choice for playing around with or maybe your next open source project. So over the last 25 years now, there are three compilers for D. There's the DMD compiler, that's the main one that Walter has and works on. And that compiler is completely open source. So you can dig into it, you can make a fork of it and modify it and play around with that DMD compiler. And it's a very, very fast compiler as far as compiling your code. So you can compile the actual D compiler, I want to say in a matter of seconds, tens or hundreds of thousands of lines of code. And that has in part to do with these module system, being able to do concurrent builds and how many passes it does over the language. But it's very, very fast. Your edit compile and run cycle is very quick as you're iterating and doing development, which I find something important. There is also the equally as important the GDC front end for the GCC compiler suite. I think it was around GCC 9 or 10 that was added in officially. So you've got the front end there with Ian Buchwald working on that and LDC work on by Martin, which gives you all the LLVM infrastructure. So if you're trying to target lots of different platforms, for instance, the LDC or the LLVM based D compiler is available for that. So you've got three compilers, which is great. So you don't have to worry about it disappearing anytime soon. And it is very common for D programmers to take advantage of the very fast edit compile cycle with DMZ. And then when it comes time to build an optimized build, you want to take advantage of all your GCC tool sets and infrastructure or your LLVM infrastructure and all the optimization passes. You can use those compilers afterwards. So as far as downloading the tools, don't need to spend too much time on it. But again, if you're on one of these platforms, you probably have a way to get the D compiler built for that platform. Or otherwise, there is a zip file or something on your package manager available. And with the D programming language, you get a package manager that's called dub, which will help you manage dependencies, bring in packages, and these types of things. It's also sort of a lightweight build tool as well. There's other tools that you might expect, like Dformat, which are being worked on and already exist for code formatting. Dscanner, which is like a linter. And if you're a VS code user and want to intelligence and these types of things, there's support for that, as well as for IntelliJ. Okay. So D, where is it being used right now? Again, we've heard of this language. Maybe we've used some of the applications without realizing that they were written in D. Again, from the website, lots of different companies have used it internally. Again, folks like myself just use it for our own projects or research. But I think D has done a really nice job finding itself in various performance based niches. From some of these various companies, there's different stories about how different tools were being used, which I'm happy to go into. So I want to go ahead and show a few. And this was another built in the deep programming language tool. I tried to pronounce it correctly. I think it's Elmer, but it's a compressible flow simulator. Okay, super cool. So they're doing computational simulation, something very expensive to do. So this tool now is 10 plus years old, being used by various PhDs and postdocs and researchers. But again, why should we care about this tool other than it generates really pretty pictures? Their website has some really beautiful pictures. These are just the ones I sort of understand. So I could post in case anyone asked a question. But again, it's a project that's been around for 10 plus years. Most of the code is in D and it's shelling off high performance. And I thought this was a great message to share from their GitHub saying, our focus is on open source development to give a simple access point for doing their gas dynamics and research and teaching. So what a great place to start if again you're in this area and want to look at some open source D software. Okay, so that's a nice tool. Getting back to some of the D language features. Sort of already thrown out one of the main big ones here, the compile time function execution. Which again, we're starting to see in more other modern languages, but that's sort of a staple of D and why I think it's really interesting. But the language itself has a lot of really nice quality of life features. So these are things like you get a bunch of built in data structures without having to import anything. Dynamic arrays, associative arrays or maps or dictionaries. They're bounds checked, which you can enable or disable. There's always a path to performance here. You get things like your land does and delegates. The object oriented functional style, generic programming designed by introspection, concurrent paradigms, all of that. Again, they said it's a really big tool. We can't cover all of it, but there's probably something interesting here for you or it's a domain where you might expand. I personally found that I started doing more functional style programming when I started using D because it was very accessible in their standard library. The D language also by default is garbage collected. But you can turn that off if you want. You can malloc and free. You can do reference counting. You can implement from scratch your own strategy if you want. There is a question and I'll repeat. Yeah, so the question, just to repeat and I'll break it into two, is how granular are these, this ability to turn off things like garbage collection. If you do need performance in a certain sectionary code. That's as granular as putting an attribute on the function. You could say at no GC on it. And in practice, no garbage collections will happen. And you can do that. You can do that. You can do no GC on it. And in practice, no garbage collections will happen in that section there. I think there are more in the actual tools you can do like a GC.disable, which I think is similar to what Java and other languages have. I think you could do like a system.noGC or whatever. So you get that granularity. That could be at a function based level, saying this code, no GC, and being able to handle it. The array bounds checking, I know that is set as a compiler flag. I don't know if, for that one, I actually don't know the answer if you could do that on a per function level. What I would say is if you wanted an array that wasn't bounds checked. There is, I think, of a standard library. I think one of the standard array containers doesn't do allocations. And then I would also just say, so you don't have to worry about that container of garbage collections. But for the bounds checking, I would probably just, to be sure, you could implement your own dynamic array. No problem, just like you wouldn't see if you want that granularity. I will also show, what will I show here? Yeah, so does that answer the question? GC as granular as garbage collection per function, you can enable, disable. And then for the array bounds checking, you can always implement your own. But there is a compiler flag for on or off. And typically, folks would use that again for that last little performance game, if they're like building a video game or something, and super certain, there's not gonna be any arrays that got bounds. Because typically, you know the thick size allocation, you would just turn that off. Perfect. All right, questions or features that look exciting here? And there's lots and lots, and the point is you have control, which is really, really cool for what you need. And we're gonna even dive a little bit further into this. There's some other cool stuff you can do, if you only need a subset of this feature. But let's continue getting inspired here. So we've got a standard library. So again, batteries included, like pretty much every other programming language these days, you have to have a standard library with containers or data structures, various algorithms, right? We've already seen sorts in the very first example, but there are things like map and filter and fold and so on. There's various concurrency primitives and so on, and we'll take a look at some of those. So you have a pretty decent standard library here. That's in discussion about expanding and refactoring and so on. So most of the common stuff you would need, handling JSON, CSV, files and so on. So that brings me to another built in D here, why do we care? I'll get into my code, so B's, Yanmi. But so here's just the type of way that I started using the D programming language was writing these little scripts, 50 line, 100 line, throw away codes to automate some tasks that I'm doing at my desk. I found myself doing a lot of queries to YouTube to gather data about what videos have been published in a channel or what videos are in my playlist, these types of things. So what was really nice was just to find that there is the standard.net.curl in the D standard library. And then I could just build a query string and then effectively make a query and retrieve my data from that curl request. And then I have standard JSON and then I can just, again, if I'm retrieving from some API, JSON data, again, common format for that JSON or JSON, then I can work with that data as needed here. And then you've got other sort of quality of life things like range-based loops. So you can go through the keys, you can put the keys and the values here if you wanted to iterate through them as well. So nice little script you end up writing a few of these here. So there's one example with YouTube. I do this a lot for GitHub for, again, pulling repos, looking at them, pushing code to students. So again, same sort of pattern that I'm always using with any rest-based API where I'm pulling data in. One little interesting thing here, looking at line 53, we can start to see that if you want to set various event handlers, again, here's just a little example of a lambda function here. You can have anonymous functions. You can have delegates and these types of things in the D language. So nice little quality of life things here. Okay, so this is kind of interesting here. My little scripts, and I'm sure many of you folks have your shell scripts or Python scripts or whatever. And again, that's what happened to me. I had a bunch of shell scripts. Mostly I had scripts in Python. And then I just started translating them to D because again, I liked it, it was a little bit less cognitive overload for me. Again, if I'm working in C++ and D, they're pretty similar in how I can think about some of those intakes. But it's sort of interesting that when I'm using D, I'm still effectively executing my scripts like I do in Python. Okay, so let me go ahead and explain this here. And what do I mean by that? Yeah, question first. Let's see, line 54. Maybe a bug or something there, line 54. Sorry, I didn't hear. Unreceived. Oh, the E and the I backwards, uh-oh, okay. I knew I shouldn't have put my code here. Good catch, I'll fix it in the post, yeah. Gotta do some fixing tonight. But the good news is, right, we can iterate quickly. So I'm gonna give you an even faster tool that I use to iterate and run these scripts. Just a little helper tool, it's called rdmd, run, you know, the dmd, basically just does on the fly compilation. And it does, you know, it compiles, you know, as fast as your decompiler basically does, dmd, but then it'll just execute your program immediately. And the advantage of this is then you can use D like a shell scripting language, right? You can actually, if I get a read down here, I'll try to highlight my cursor. I know it's a little bit small here, but you can just put the pound into the bang sign slash user slash bin slash environment, rdmd, you know, chmod, execute, or whatever, and then you just run your program, just like a regular script. So again, that's a really nice way to, if you need to, transition your scripting language to something that's statically typed, or you can just think in the deprogramming language rather than multiple languages. I found that as a nice quality of life improvement. Again, I understand I'm the enthusiast here, but I found that as a really big win for me. So rdmd is available. The LDC compilers, you also have this available as well with L, dmd2. I haven't checked the GDC one actually. So that was really cool. So, you know, generally speaking to, you know, my effort because I was running somebody who does little scripts was, you know, if you use a compiled language, generally, gotta be careful with talking about performance. You get better performance than an interpreted scripting language. So again, a big win for me and my projects. But there is still more to this performance story beyond just, you know, switching to a compiled language here. Because I started stumbling upon other really cool things in the deprogramming language, the community pointed me to. I started doing this in my scripts here. So you'll see here highlighted, let me draw your attention towards the top. Dot parallel here. So I just kind of stick that on the end of some collection or some array. And basically what I get is the equivalent of, for those of you who've done OpenMP, a parallel-based for loop here, right? We're able to launch multiple threads here. That's a small change that you can make, right? If you don't have any dependencies on the data in between, you still have to think about it, certainly to make sure you get correct code. But imagine just going through all of your range-based for loops and doing dot parallel. And if you're doing separate tasks, getting a performance boost, right? Use your CPU. You paid a lot of money for it, so put it to work. So again, quality of life feature there. Now does it make things faster? Again, you have to profile. You always got to check these things out. So, you know, maybe a better use case, another open source project from a D conference, just a standard, you know, Hello World Ray Tracer project where I used standard parallelism. And again, if you're looking per pixel or doing something graphically, right, you have a lot of pixels, however wide your resolution is, a thousand pixels, but, you know, a thousand something of that nature. You can try dot parallel on it and see if it speeds things up. And of course, my performance wizards and the spiniers. You're launching too many threads or what's going on, you know. So, you know, does it make things faster? I'll get to that in one slide here. Because I also see something interesting that I've touched on but haven't explained. What's going on in this for each loop? For each Y and for each X, okay, those must be like the pixels going across and up and down. Okay, so there's a lot of them. But this next part's kind of interesting. Okay, I've got a camera dot get screen height dot iota, which is like a range, and then that dot parallel. Well, what this is, is an example of that uniform function call syntax. This idea that we can sort of chain functions together with a dot. Again, maybe you've seen this about programming languages. Maybe you've implemented design patterns that allow you to do this. But it's a really nice quality of life feature if you just sort of compare the camera dot get screen height dot iota dot parallel versus, you know, trying to figure out how do I nest these things. Parallel, okay, iota, and then you're counting your parentheses or, you know, you're hoping them or whatever counts them correctly for you. Again, just a little quality of life thing, more readable code, and you can actually think and sometimes see like, oh yeah, I see that is just a range there. Maybe I can paralyze it. Maybe there is some data independent thing there. So anyways, that's just, you know, following up on that. And then as a little aside, and you can look a little bit more, there is a built-in profiler in the D compiler for seeing how many times a function executes, how much time you spend in it, and there's also a memory profiler so you can see how many garbage collections you're doing if you're using the garbage collector. Okay, so built into the compiler, you don't have to search for them. You know, I do use other tools like perf and choose your favorite tools, but nice that it's there, okay? It's an easy tool that you could build into a continuous integration system or whatever. Okay, so, you know, speaking of some graphics projects, again, that's sort of one of my passions, so, and it turns out that D is a great language for building graphics projects. So, you know, the must-need-it, you know, pretty picture slide, and there's actually games and physics, you know, if you click into this. The cool D language project, Daegon, here is a game engine, so, you know, something sufficiently complex. Why do we care about this, though? Other than it's, you know, pretty in a slideshow. Very, very beautiful. Lots of hard work there. But again, just to see the substantial project by engine and graphics developers, you can see how it's laid out, how different core systems are laid out. Again, might be interesting for you to, again, think about if you're going to use D for building games, how you organize different components and game objects and these types of things. And you can kind of look through the directory structure. D uses a sort of directory structure for packages like Java or other languages, and that's kind of interesting. And there's also just a fun comparison to C++ here if you want to see the video. It's not really to say anything. Both these applications are very GPU bound, so that's sort of the point, right? Use the language you want, and if you're GPU bound, that's all on the GPU anyway, so, you know, you can think about those tradeoffs. So there's one game engine. Another one, Dash, this is a cool, I think it started off as a student project, and then I gained some steam with several folks. So there's a little game they made. Why do you care about this? Well, you know, I spent just a few minutes looking at the code to see how things were structured. And very interestingly, they were using this idea of mixins in their code. How many folks, just as a survey, have heard of a mixin by hand? Okay, we've got about 40% or so around there. But that's the idea that you're literally just taking in a string and pasting in your code, and it should be like valid decode that gets compiled. Sounds trivial, sounds like, kind of, why would you do this? But it makes sense in use cases if you've got graphics code, if you can just import or paste in some shader code and do a mixin. Or maybe you can use other compiler or compile time techniques to sort of build out a string at compile time and then generate code. It's a very simple idea that you can compose and generate some really cool graphics things. I think it tends to work well in this use case that the game showed. Another later project here, Hypreem Engine. So they built, you know, some nice stuff. Why do you care about it? Why should we look at it? Well, Hypreem is very active in the community, so a good person to know for one. But a really interesting example of just seeing how to support multiple platforms. So again, Hypreem can build a D project on PlayStation Vita, Xbox, Mac, iOS, Android, et cetera. Just to see that that's accessible, I think that's a project worth studying and to see, you know, how did they get there? Okay. All right. So there's lots of other graphics resources. Mike Parker, who's a member of the community, has done a great job with common libraries and graphics stuff, sort of an FYI. We're talking about open source today, so I'm going to sort of ignore the commercial game projects done in D, but there's a few interesting talks, again, if that's your sort of domain. And, okay, so talking about a few of the other D language things of interest, the paradigms, okay? Because again, I said when I started using D, I started doing things more functionally. I started thinking more about concurrency. I started thinking about object-oriented programming, I think in the right way. At least, you know, how message passing is supposed to be one of those pillars of object-oriented programming that kind of gets forgotten sometimes. At least that's what I think of with object-oriented programming. But anyways, just a couple of examples. You can take a peek at these again after the talk, but I've got the range-based loop here, and then I've got the sort of the mantra of no raw loops. Get rid of those raw loops and just use functions like filter or, you know, these types of components here. So again, very nice, often easy to substitute, often you find instances where you can just do a dot-parallel much more easily. And on the right here is just a classic. You've got an interface, and you want to create a type of dog, a husky golden retriever, you know, like your favorite dog, Belgian Shepherd, et cetera. Okay, and then I can't leave D without giving a hollow world of metaprogramming, because that's really, again, one of the strengths here, right? We talked about stuff that you could do at compile time. So just a sort of simple function here. It's called print data. So I'll draw your attention towards line 38. T is the template parameter, so there's no angular brackets. You just put the template parameters right after here. So T, whatever the data type would be, and I've got another T for whatever that type is, and then the struct. Okay, what is the struct and why do we care about it? Well, we care about this struct only if it has members, right, attributes called memory and elements. Okay, so memory might be a chunk of the, you know, I don't know, some attribute of the elements is maybe, again, an array of the data. So what's sort of interesting is, one, you can think about this as a sort of template constraint, or a concept, again, depending on what language you're coming from, that has to be adhered to. So I can only use this templated function on structs to print their data if it has memory and elements. Well, I think that's kind of a nice constraint to think about or to have that ability to do it. So that's kind of interesting here. If we have time at the end, I'll flash some of the examples that I'm going to put in the GitHub repository for other introspection things you can do. You've got a traits library, so you can see, you know, what member functions you have. Is this thing a unit test? Does it have some attribute on it, like no GC or whatever? A question? And the question was, is there static if? There is static if. There's static for each. My question is why you use static. Why it's not static? Here, it's, I guess I could make this static. I don't know if it's implicit actually here. I need to think if it is or not. Yeah, I guess, yeah, we don't need it because technically we wouldn't generate this template if it wasn't valid, since that's happening at compile time. Okay. So, you know, here's, you know, leading us towards the end. So, I know, I've gone through this. I've tried not to make it a sales pitch just to show you things that I'm excited about. But if you're not ready to try, D, there's still yet other interesting things in the compiler. There's something called better C, which is a subset of the D language. And basically what this does is it gets rid of or sort of removes a lot of the language run time. So, this is if you want to do some more like bare metal programming, for instance, and you don't want to carry the standard library phobos, or you don't need some of these other features. You get most of the quality of life things, like the bounds checking with arrays, you get slices for working with them, you know, delegates, land as all those nice things, all the compile time execution, but you can sort of just use it as a better C language. Some of the stuff you're starting to see in C23, for instance. And there's a really nice talk introducing that on kernels and how they're using better C for kernel development. So, again, getting into a little level stuff here. So, as far as learning more about the language, again, great tour on the website. The good news is, you know, anybody who's written a book on the D programming language, and there's seven or eight, I think, they're all good books, right? They're all written by enthusiasts, reviewed by the community. These are the first two I'm going to recommend that folks who are beginners take a look at. They're more, you know, for someone who's an audience who knows how to program, and we'll get you started here. Forums and Discord, otherwise, are very active as well. YouTube, that's me. And then teaching the D language. So, you can hear it from my perspective again, but even better if you hear it from the students, right? They're unbiased thoughts on what the value was, if it was useful for them. And the last sort of resources as we're kind of wrapping up here to talk about, again, from Andre, he wrote this really nice piece here called The Case for D. This was in 2009. I think a lot of it still holds in a way, but, you know, basically, he summarizes it as a high-level systems language where it can be productive and enjoy coding. That's what I found. You know, maybe you'll find that too. Okay, again, that's up to you to decide. I hope I just shared some cool stuff for you to get excited about otherwise. So, again, what do we care maybe as an open-source developer? You know, you've got a readable, writeable, performance language that hopefully gives you a lot of quality of life features like fast iteration time. You know, I think there's a competitive advantage here with any project. I found it with my students. Again, that's something you'll have to test, but that's what I found. My students get further using D than other programming languages. And there's three compilers available. You don't have to worry about it disappearing or, you know, other stuff, you know, going on here. All right, what's next for me? Well, I talked a whole lot about graphics. That's my passion. That's what I've worked in. But I'm now working on learning a web framework called Vibe, which is super cool. If you're more on the website, there's a great book about it to get you started on building, you know, scalable and performance web applications. Alrighty, so we learned a bunch of things. Here's sort of a summary slide on some of our takeaways. Again, I'm going to leave that wall of text for you because I want you to leave excited and not tired from reading. I just want to go ahead and close off by thanking you. I'm going to be around so you can ask me any questions now or after as well. Thank you. Thank you. A question? What you will say, why Rust and why not Rust and why D? That's a good question. I don't want to pit languages against each other, so it's why Rust or why D? What I would say, because that's a hot question I get asked a lot, D's code is very plastic, the plasticity is high. Meaning I can mold it and change it, which I very much like. In a way that, again, I'm not as much a Rust expert. I've used it a little bit, but D's plasticity is very good. It writes how I want to write the code. It's got the memory safety with the garbage collection itself. I find it very, very productive. I find if you're going to write an application, again, I'm in games and so on where there's lots of mutable state. D's a perfect fit for that, for writing safety and maintainable code that I can change later. Yeah. So the comment was coming from a C, this was sort of easier code to C and to read. Yep, yeah. That's the other read. It's easy to get into. Yeah. Another question? Testing. The UFCS looks very cool, but how do I know if it's like a function or if it's a method of the object like that I'm calling? Because it felt, it was all the same color on your VIM script and I was like, oh no. Yeah, when you're doing the dot, so a few nice things that D language does when you're working with pointers in classes one, you know, if you're coming from C or C++, there's no arrow. So, you know, you don't have to worry about that. Everything's a dot. But then the idea of, is it a variable that I did a dot or the function call? Usually function calls are not required if they don't have any parameters, you can leave them empty. I usually just put the parentheses after. Otherwise, this is for things like language server protocol and your text editor make easy enough. It's not usually a problem. Yeah. Alrighty, thank you.
One way forward: finding a path to what comes after Unix
Hi, so while we are figuring out how to fix the image, we will start with a lecture, One Way Forward, Finding a Path to What Comes After Unix by Liam Proven. Hi, thank you. So this is my fourth FOSDEM talk, remarkably enough. Thank you for coming back to listen to me again. I have presented three previous talks which were all on the theme of the progression of the software industry and some misunderstandings. The first one was called The Circuit Less Travelled and I talked a bit about alternate operating systems, people have probably never heard of or met. Then in 2020 I did one called Generation Gaps which talked about how the perceived threat of switching generations of tech and losing backwards compatibility is a boom, not a cost. Then the last one I did in 2021 was called Starting Over and I talked about a proposal for a next generation of operating system discarding 50 years of legacy baggage using a memory technology called Optane from Intel which has since been cancelled, shows my an airing finger on the pulse. Now I kind of want to go in a slightly different direction with this one. So since I proposed this talk, something very sad happened, something inevitable but nonetheless regrettable, a giant of 20th century software design, Niklas Theot died. So because of that and in a small tribute I tried to slightly rework the introduction of the talk. Now Theot was famous probably for what I feel was kind of only a relatively minor aspect of his career. For example Theot studied and later taught in California for quite a while as a young man where he had a nickname of Bucky and I rather like the idea of this dignified computer scientist was Bucky Theot. In the days when he was teaching and studying the Unix system was being developed and people use teletypes and teletypes are very roughly a printer with a typewriter plugged in and typewriters don't have things like control and alt and shift and so on. And Theot proposed a mechanism for how controller keys would work, modifier keys would work and the mechanism is still used, it's called Bucky bits in his honor. He is of course famous for Pascal but one thing that I found is not fully understood sometimes is how and why Pascal came about. Theot and the British computer scientist Tony Hall were working with the Algol programming language committee to design the next version of Algol and Theot proposed a relatively simple change to the language. The Algol committee rejected it and went with a counter proposal which was rather more complicated and that became Algol 68 and killed the language forever. As a result lots of other programming languages got their chance to grow in the sunshine when this giant fell and one of those was BCPL, BCPL gave rise to B, B gave rise to C and Unix grew in part because Bucky Theot pulled out of the Algol committee. A lot of what we work in and use today came from this. He didn't just do Pascal but Pascal had ramifications people might not know today. The original operating system for the Apple Lisa was largely implemented in Pascal. A version of Pascal or a derivative of Pascal called Concurrent Euclid was used to build an operating system called Tunis which was a Unix. There was a Unix built in Pascal. So now with lots of people getting interested in more typesafe languages to descend from C like Rust and Go and D and whatnot. People were building Unix compatible operating systems in a Pascal derivative in the 70s. Pascal of course became a huge hit partly due to Bolland's turbo Pascal and that gave rise to Bolland Delphi and set much of the course of Microsoft Windows 3. The more of all that, he didn't care and Theot instead went on and wrote a successor language with more and better modularity called Modular but he immediately discarded that and he moved on to a modular version of Pascal with more concurrency called Modular 2. That for a while was the fastest compiler on the PC platform and was briefly very successful. He ignored that as well. Late in Viet's career, only a few years before he retired, he wrote a wonderful short article called A Plea for Lean Software. I have a screen grab of it which you can't see which is a slightly skew if PDF from Viet's own homepage. It's only a few pages long. It's very readable. I really recommend if you haven't seen it, Google it. But as a proof of the validity of the concept of extremely lean, tiny, simple software, he moved on from Modular 2 and he came up with what is arguably the masterpiece of his career, a system called Oberon. Oberon is a tiny Pascal-like language. It has built-in concurrency primitives and the language also comes with a compiler and an editor and an IDE which are all implemented in Oberon itself. Now, I wrote an obituary for Nicklaus Theod for the register in early January and I couldn't come up with a clear source for how many lines of code there were in Oberon. So I went to the project Oberon website. I downloaded the zips of the core operating system. I unpacked them and I ran WC-L against them and it came out to only just over 4,000 lines of code. 4,623 lines of code in 262 kilobytes of text files. That's an operating system and a tiling windowing interface and an editor and a compiler. When people talk about Lean Software and talk about how much smaller software could be, I don't think people fully understand the scale of the issues we are facing these days. So, when I talked about Oberon as an interesting candidate for maybe a successor to Unix for a new technological generation, well, you know what, I think trying to propose a Pascal-related language to an audience of Linux and Unix folks was probably the wrong audience. The talk seemed to go down well but I've not heard of any results or anybody particularly investigating although development in Oberon is still quite active. So, I've come up with something else that might be a bit more familiar and close to home territory here. FOSDEM is a FOS conference. It's all about FOS and unfortunately, or for better or for worse, maybe I should say FOS these days is about Unix and Unix-related operating systems. Now, a lot of people misunderstand that, I think. So, Unix today is increasingly about Linux. I can see an open BSD developer in the audience down there but increasingly Unix means Linux. But what is Unix? Now, the definition of what is Unix changed in 1993. I just want to ask you guys, stick your hands up if you can remember the year 1993. We got quite a few grumpy old gits. Well done. Congratulations. Thank you. But about half the crowd probably can't clearly remember 1993. In 1993, I can go to the 60s sadly for me. In 1993, what Unix is changed. And a lot of people still haven't heard the news. Because in 1993, Novell bought Unix system laboratories from AT&T and they donated the trademark of Unix to the open group, famous for Motif and CDE and a few other things. They kept the Thrust Code and much good it did them because there is a successor company to Novell which spone off Caldera and Caldera bought Skow and changed its name to Skow and went crazy and sued its own customers. A derivative sort of subsidiary company called Cinewoth are still trading. They're still selling operating systems based on the original AT&T code base. Unix doesn't mean based on AT&T code and it hasn't for 31 years. Unix now means passes open group compliance tests and those used to be called POSIX. Effectively what used to be called POSIX now means Unix. If it passes compliance tests, it's Unix. We have to pay for the trademark, right? So there are a few operating systems that have paid. HP UX doesn't have a future because it only runs on Atenium and there's no more Atenium kit. IBM AIX, I don't know if anybody heard but last year IBM shut down the AIX development team in Boko Vatan and passed maintenance over to a team in India. Mac OS both on Intel and X6. Both on Intel and ARM is certified but at least two or three Linux distributions have passed the testing. They were all Chinese CentOS Linux derivatives but the point is they passed the test and for a while the company's paid. Unix is Linux. Linux is a Unix. It's not Unix like it is Unix and that upsets some Linux folks who like to think they're rebels. Now, I had a family tree of Unix from Wikipedia here but you can't see it. It shows the original development series at AT&T, Unix 3rd edition, 4th edition, System 5, lots of derivatives branching off at various dates and of course off on its side with no direct connection Linux. But I would argue if you take a step back, BSD did descend from originally the AT&T code base. Cisco and AIX and Solaris and SunOS before it and HP UX and all of them still have roots in that code base. You can consider all of the monolithic proprietary Unix as one branch. They've all got shared ancestry. On the side of that there's all the branch which is Linux which is a separate code base but taken by the same design. I would say Unix arguably is just all those as one. You know what the implementation differences now are relatively minor. You know, I'm presenting, he says. I'm using LibreOffice. I'm using a version that was released about two days ago. I don't think that's why you can't see it. But LibreOffice, Firefox, bunch of other stuff, you know, Chrome, Chromium, blah blah blah. They run on Mac OS, they run on Linux, they run on all the BSDs. It's all close enough. But there is another branch of this family. Several systems passed the testing and are called Unix that aren't because it's just a compatibility layer. IBM's EOS still carries the branding. But it's a weird mainframe operating system which doesn't even use things like ASCII. It's, it can act like, no, Unix but it's not a Unix. But there is another branch of the family. The microconnors. There's a bunch of microconnel operating systems out there which are Unix-like enough to pass the test. There's Mac OS, which is what I'm running on here. There's QNICS, which not enough people have heard of. I think QNICS is very cool. It's a Canadian system. It's tiny and it's mainly used in embedded systems. It runs lots of traffic lights and engine management units. But if you ever owned a BlackBerry 10 device, that was QNICS. And there's a couple of open source ones that have not set the world on fire like the GNU Heard and Minix 3. Minix 3 is a true microconnel. But sadly, Dr. Andy Tannenbaum has retired. Minix 3 is looking stagnant. I don't think there's been any commits in several years. It'd be a great project for somebody to pick up. Minix has tens of millions, maybe hundreds of millions of users. And probably half this room run Minix and you might not know it. Because Minix runs on the embedded management chip inside all modern Intel processors from the Core i3, i5, i7 onwards. They've all got Minix in there somewhere. But Intel didn't contribute much code back and didn't contribute any money back. And so it didn't help Minix 3 at all. So we've got two families, the big monolithic Unixers and the microconnel Unixers. But they may not share the code base. But we all know what a Unix looks like, right? It's probably written in C or something very much like C. It's probably got one big file system that's rooted at a folder called slash. It's probably got a shell which uses the familiar little cryptic short commands. It's probably got case sensitivity. But there are exceptions, macOS 10, sorry, macOS. It's not case sensitive. It's case preserving. Still a Unix. It's still compatible. It's passed the test. But you know what? All these charts, all of these sort of generation diagrams and so on, show the AT&T Unix line going on and things branching off it at third edition, fourth edition, fifth, sixth, seventh. And then they stop. But it didn't stop there. And I think that's quite important. And my computer won't go to the next slide. Have notes. Great. Thanks. So there is another line of descent. And I have a wonderful spinal tap gag that you can't see. Research Unix continued in AT&T after the industry took what, worked and ran with it and made it huge. But Dennis Ritchie and Ken Thompson and some others did what Niklas the Up did. They ignored what the industry were doing with their earlier product and they kept on. And there was research Unix eighth and then ninth and then tenth edition. Very little trace of them seems to remain. I've tried to write about them. And when I've said stuff that Wikipedia has links pointing to sources, mostly I get people angrily telling me I'm wrong, but they can't actually provide an alternate story. What came back, what happened next, I think, was interesting and significant. And a lot of this industry have neglected it. By 1992, Unix was a massively important commercial system and the roots of what became free BST and net BST were happening. But let's look at somewhat like the, what defines something as being a Unix or Unix-like system. It's not about a GUI because Mac OS X passes the test, but there's no X or anything like X in here. There's an optional X package, but it's not defined by the GUI. It's not defined by case sensitivity. It's not defined by the kernel. It's just, does it look like a Unix? Does it feel like a Unix? Well, in the end, at the end of the 1980s, a change happened which brought the Unix industry of proprietary workstations into conflict with the PC industry and cheap X86 kit. Because suddenly it was affordable to buy, a reasonably capable 32-bit computer with a reasonably fast expansion bus so you could have fast networking, cheap. And suddenly this created the conditions for open source Unixes to take root and flourish and thrive. They grew in the soil of Windows. I know a lot of people like to hate Windows, but Windows created a marketplace of cheap 32-bit hardware that was ideal for Linux to grow. Windows was not a particularly useful platform on 286s and 286s were not particularly useful for most Unix-like systems. What the Open Source Unixes that took root in the late 80s, early 90s, did not account for how the hardware they were running on had changed. Traditional Unix was a mini-computer operating system. It ran on a big shared box with no display of its own, no keyboard of its own, just dumb terminals. And a lot of what and how Unix is and how it works is because of that. It's focused on text files, it's focused on piping, it's focused on text editors. It's because of the mini-computers that it evolved on. But by the 90s, computers won't like that anymore. By the 90s, computers were standalone boxes with a graphical screen and a network card. And Unix doesn't account for that well. I mean, Unix does networking, sure, you know, there's things like NFS, but it's a bolt-on. You have to run external commands and mount external file systems in particular places. Graphics are bolted on. There's very little conception of them in the traditional Unix kernel. So, you know, everybody likes to hate on X11 these days, it works for me. But Wayland, they like to talk about how they're modernizing the Linux and Unix graphics stack because Wayland works on FreeBSD. Well, Unix is about files, everything in Unix is a file, right? So, go on, then, Wayland enthusiasts, tell me, where in the file system can I find a window on the display of a Wayland box? Go on, then, point me at the file that has the dimensions of the window. Point me at a file that says where it is in the Z-order. Can't you show this is a Unix display system? Because it doesn't sound like one to me. But you know what? The research Unix team, well, thinking about stuff like that, and they started to put stuff like that into Research Unix 10, and that finally became a product that was announced in 1990. It's called Plan 9 from Bell Labs. It is effectively Research Unix 11, but they didn't call it Unix, and they didn't call it Unix for a good reason. It looks a lot like Unix because it's built in C and it's case sensitive and it has a fairly familiar shell, but it's got networking right in the heart of the design where it should be. So, you can have access to the file systems of other Plan 9 machines on your network, and they can have access to your subject to permissions and so on. It has a built-in graphical windowing system, which went through several iterations, but the one that it has now is called Rio, and in Rio, a window is also a folder. It's a directory, and in that directory there are text files. You want to write to a window, you put text in a file in a folder. That's what Unix is meant to be about, not wretched gamer hacks like variable refresh rate and so on. I don't know, I don't. I'm frankly indifferent to Wayland because there's not a single desktop I like to use that runs on the Wayland yet. So, Plan 9 takes the idea of everything as a file and really means it, whereas it was just a marketing slogan for traditional Unix, and that includes Linux and OpenBSD and FreeBSD and NetBSD and DragonflyBSD. Everything really is a file. You don't need a network-aware windowing system to put windows on another box on the network in Plan 9 because you can get it at their file system and you can just write files into the file system and windows just happen. It makes X11 look very over complicated, which the Wayland people would agree with, but it makes Wayland look like it was something invented by Microsoft. And in a way, Plan 9 also invalidated all that 80s work that went into micro kernels because a lot of the problems of micro kernel implementation just kind of go away in Plan 9. It's not that it's better, it just kind of makes some of it irrelevant. One of the big problems of micro kernels is you put all of the bits of your operating system that would be in the kernel in separate little blobs, servers that run in user space, but they have to communicate with each other to get the job done and that communication is slow and it becomes a bottleneck. Plan 9 just has put it in the file system, which is the Unix way. So the original Plan 9, when they showed it in the year 1990, at that point they had a code comparison and Plan 9 was actually just about the same size as the new Mach micro kernel that Carnegie Mellon University was showing off then. And I don't know if you know what happened with Mach. Mach had a whole load of operating systems based on it. It was very cool in the late 80s. Deck OSF 1, which later became True 64, was based on Mach. The PowerPC version of OS 2 is based on Mach. Only one of them is still around and really gets much attention and it's Mach OS 10. And the way that they got it to work is they took a big chunk of code from originally, I think it was 4.4 BSD, but now it's from free BSD, and they had that as the Unix server, which provided all that nice Unix file compatibility stuff, but it was a bit too slow, so they stuck it into the kernel. That's the XMU kernel behind Mac OS. It's Mach and a lot of the stuff in free BSD that provides all those handy APIs. But all in kernel space. It's not really a very micro kernel design anymore. So, as an aside, in case all of this sounds a little bit too theoretical, one of the problems, if you put everything in the file system, of course, is where? Where is it going to go? What's the layout of the file system? Back in 2011 on the register, I predicted that the next big thing in Linux virtualization would be containers. It was the same year that Docker launched. It's one of the couple of times in my life I really called it. Containers are sort of CH routes on steroids, of course. All the file paths for a given process get a new route directory, and since everything is relative to route, that means they're isolated. But the problem is that on Unix there isn't a good clear separation between different types of namespace. Unix doesn't just have the file system namespace. It's got user IDs, and it's got group IDs, and it's got process IDs, and they're not in the file system as such. So, if you just have a different file system, your process can still see other processes. It can talk to them. There is a feature in the kernel now called C groups, which splits up those namespaces as well, and so it provides a kind of isolation by sub namespacing multiple different traditional Unix namespaces. But Plan 9 did that in 1990. It first shipped in 1992. Every process has its own namespace. Every process has its own view of the file system. Every process is in a container. That was announced a full decade before the first pre-BST would jail. This was a project put together by two geniuses who really had a pretty good idea of where the industry was going to go, and the industry ignored it. But, of course, there's always a bad side to go with a good side, if Plan 9 is so clever, why aren't we all using it? Well, some of the reasons are easy. Like Unix itself in its early days, it was a research operating system. It wasn't intended as a product for commercial use. It wasn't sold, and it wasn't open source. But in 2000, Plan 9 3rd Edition did become open source under its own license, but they fixed that in 2014 when it was re-licensed under the GPL. It's very, very minimal, and the conceptual design, especially the namespaces, have penalties. So, because there isn't a real view of the real file system underneath, there's a bunch of stuff we take for granted on Unix, which you can't do. There's no way to move a file from one place to another, except by copying it and then deleting the copy. There's no way to move a whole directory tree. The recommended way to do it is tar it and un-tar it where you want it to be. There's no links. There's no hard links or soft links, because that would break the clarity of there being no one mass underlying hierarchy, so no links. I know that sounds a massive limitation, but you know what? As I said, I'm old. When I started in this industry, I put in large commercial systems based on DOS, both the 4, even Windows 3, and DOS until DOS 5 didn't have a move command. You couldn't do it. The best you could do was copy it and then delete the original. And you know what? I ran large file servers supporting hundreds of users. Plan 9 is weird. It's unforgiving. It's strange, but you know what? It's... They had an idea. They had a set of solutions to industry problems 30 years ago, and the industry ignored them. So it's kind of understandable that they're a bit sore about that. There are many talks of Plan 9, and I wrote about one of them a year or so ago called Nine Front. It has a famously cryptic website, which is full of weird little in-jokes and things. I think it's fair. It's legit. The whole industry has ignored that they've got better solutions to the stuff we've implemented with massive stacks like Kubernetes. I'm somewhat neutral on the whole flat pack and snap controversy in Linux these days, but you know what? The snap model is easy. You copy a big compressed file. That's your app. The flat pack model is OS Tree. OS Tree is Git, but for binaries. Anybody here who thinks they really understand how Git works and how it distributes those changes, there's not. If you can't explain it and you don't understand it, it's not a good model. The OS Tree people had a slogan in the early days. They said it was a better DOS than DOS and a better Windows than Windows. Well, Plan 9 was a better Unix than Unix, but it wasn't better enough. They didn't call it Unix because it breaks so much compatibility. Plan 9 is kind of Unix 2.0. There was more work after that. There is also a successor to Plan 9 called Inferno. It solves other interesting problems, but I haven't got time. I deleted a whole page and slide about that. The late great Bucky Viet made a plea for Lean software. When I wrote this yesterday, I couldn't find a word, a line count for the successor to Oberon. Oberon is a tiny single processor operating system. There is a successor, one of several called A2, which understands multiple processors. It has a network stack. It has a web browser. It's a very basic one, but the point is it's got a more conventional GUI. I estimated yesterday that it's maybe, let's say, about 5 to 10 times bigger. There's a team in Russia actively working on A2, and I asked on their Telegram channel last night, and they got back to me today. I said, maybe it's 5 times bigger than the 5,000-odd lines of Oberon or 10 times bigger. Yeah, I was wrong. It's 8,000-odd lines. That's a preemptively multitasking, multi-threaded SMP operating system with a GUI and an IP stack in 8,000 lines. Debian 12, for comparison, contains 1,341,564,204 lines of code. Source the Debian 12 release announcement. One and a third billion lines of code, one and a third thousand million lines of code. Nobody can read that. You won't live long enough to read that. For comparison, Google Chrome, the Chromium code base, is approximately 40 million lines of code. If somebody tells you they've gone through the Google Chromium code base and cleaned it up and removed all the Google algorithms, they're lying to you. There are a lot of big lies in this industry, and one of them is the scale of the problem that we have, of billions and billions of lines of code that nobody can understand. Now, computers, another big lie is that computers stopped getting a lot quicker a few years ago. About the time that Intel released the Pentium 4, that space heater that some of you might remember, Intel released it in 2000. They promised that by 2005 they'd have a 10 gigahertz one. It didn't happen. Instead, we got the Pentium M, and that evolved into the Core series. They're smaller and they're cooler. More's law doesn't hold anymore. What we get now is Kumi's law. Kumi's law says that over time, chips get smaller and use less power, not twice as fast. What we get now is width, not speed. We get more cores, more co-processors. Brothers and Level law. Amdahl's law. Look that one up. It's quite fun. Amdahl's law says when you parallelize a piece of code, the scales you get from more threads of execution top out quite rapidly. If you have an infinite number of cores available and you can take 95% of your code base and make it parallel, you'll get a maximum 20 times speed up. That often actually requires in the region of 1,000 to 2,000 cores to support that. After that, it stops working. Nobody can automate making code parallel. It's not possible. As far as we can tell, this never will be possible. A human has to read the code and make it parallel. That is never going to happen with the size of Linux and so on today. It's not possible. We can nibble around the edge with our many eyes making all those buds shallow. We can tweak little bits. We can refine subsystems. We can't rewrite the whole thing. It's no longer manageable. This has built a multi-trillion dollar industry around the world, but it's very, very hard to make big improvements anymore. The hardware isn't helping anymore. Software is vast and vastly complicated and nobody understands it anymore. Let's compare this with Plan 9. I got a few letters and emails and a phone call when I last wrote about Plan 9. One chap said, oh, yeah, yeah, about the size. The kernel is 5,191,091 bytes. It's 5 meg. The entire distribution is all sources, documents, the local Git repository, and binaries is about 530 meg in the AMD64 version. That is not just an operating system. That is a clustering system, a cluster-aware file system. It's a container management system and a networked GUI in half a gig. Plan 9 has 38 system calls. I have been trying to find a list of how many system calls there are in Linux. In 2016, I found an estimate that kernel 4.7 had 335 syscalls. In 2017, that was up to 341. I found a chart for 6.8 and fed it into Libre Office and looked at the number of rows because it's weird and it's formatted. I think there are 520 syscalls across all the architectures. That's the bare kernel. Plan 9, 38. There is another problem with Plan 9. It's not Unix and you can't run Unix apps on it. You can't port Unix apps to it. It's too different. They have some solutions. There's a thing called Ape which is a wine or wine lib. You can recompile POSIX code for 8 and get it to work, maybe, if you're lucky. They also have a Linux emulator called Linux Emu, which is a bit like the FreeBSD one. It's 32-bit only. It's old. It's largely unmaintained. Now, the open Solaris Linux Emulator was unmaintained until Joian modernized it for SmartOS and they got it up to parity with kernel 5 and 64-bit, but it was a lot of work. It also has a hypervisor called VMX. It's limited. It only can give a VM one core. I think it might be only a 32-bit VM and the documentation says VMX can and will crash your kernel. But it is there. But look, there's all the technology that's come along. We all have 64-bit computers now. We all have loads of memory. There's 64-bit Plan 9 in the form of 9-front. Over in Linux world, there are these things called micro-VMs. Amazon have got Firecracker for AWS. There was also a project called Intel Clear Containers, which merged with another one to become Catech Containers. Tiny virtual machines designed to boot an operating system in milliseconds, run one program and then quit. They're trying to make VMs as small and quick as containers. There is already a hypervisor for 9-front. Maybe we can make a Linux micro-VM for 9-front. So you run Firefox and it starts up a kernel and it loads Firefox and it's running in a window on your desktop and when you close Firefox, the VM goes away. It doesn't mean adding loads of bloat to Plan 9. It doesn't mean it has to have a huge emulation. Just run the real kernel. We can do that these days. There is stuff we can do. There's a thing called KubeOS that some people may have seen. It runs every internet-facing program in a separate VM for enhanced security. But we can do some stuff on the Linux side of this. When I predicted containers, I went back to IBM inventing virtualization in the 60s. IBM couldn't make mainframes in the 60s interactive. So what they did is they put a hypervisor on the mainframe and they ran a dedicated interactive operating system, one per user, in a VM. How about we make a Linux that's designed and can only run in a VM? How about we have a Linux that has no hardware support? It can only talk to VertiO devices. It boots with Pixi or something like this off a file system. There's already a 9p server and client in the Linux kernel. It boots off a directory. It has no hardware support, no file systems, no disk drives, no serial port emulation. All that goes out the window. It can only run in a VM. We make it boot really quickly. We can do it today on Hyper-V and KVM and Xen and VMware and all these things. We can get this working now. And then we can make a plan line that could run Linux programs and was Linux compatible, which means it is Unix compatible. And so I had this idea a while ago and I took it to a couple of plan line communities like on Reddit. And I expected people to shout and call me an idiot and tell me it was a stupid idea. Well, okay, some of them did. But what surprised me is people said, why would you want to do that? You can run plan 9 in a VM if you want that plan 9 stuff. Just run it on Linux and then you can run your web browser on the host. Well, I'm interested in trying to go the other way. Can we provide a plan 9, this Unix 2, that can run our old programs while we find ways to bring this stuff across? Can we make the whole Waylon versus X argument go away? There's an X server for plan 9. It's called X. Spanish for X, I think. Probably going to crash a lot, but the point is you could have graphical apps in your tiny container rendered on the desktop. Linux is really mature. The VSDs are really mature. Each successive release doesn't change much anymore because the code base is so big, so complicated, so hairy, we can't change it much anymore. But there is another path here. Now, I proposed a path involving Smalltalk and Lisp and Dylan and Oberon and why I was fosdome the wrong crowd for that, I think. But here's a very unixy way forward. Here's a way to build a next generation better system that brings in containers and a windowing server and networking right into the operating system. Does it with 10% of the syscalls of a modern Linux system, maybe more like 5% these days? None of the plan 9 people said to me, you can't do that, that's ridiculous. They're like, could if you wanted. Well, I can't. I can barely program my way out of a paper bag. I write about this stuff. I explain it. That's my job. But I'm just throwing this idea out there. There is a possible path forward. It's all open source. It all already exists. We could start working on this now and actually possibly have a plan for a next generation of unix-like system. There isn't billions of lines of code that nobody can ever fix. I'm throwing the idea out. Have fun with it. That's the end. So we have time for a few questions. Will you publish the slides on the FOSTA and webpads? Absolutely. And the script as well if you'd like it. May I just tell you want to restart everything from scratch, which is good because I really think more lines of codes, it's not good. And kids now they don't know what means to make it less. Less RAM, less CPU. So they need to learn that part of the schooling. I completely agree. Hi. Thank you for your very interesting talk. My question is you compared the size of plan 9 and let's say a successor 9 front with a Debian. So what the point is that the communities that are creating new Linux distributions are much larger while let's say 9 front is the work on that stopped in the 90s and there is the plan 9 stopped in the 90s and there's this 9 front community which are only very few people. So I would assume that when more people would work on 9 front it would also just grow to the same size as unix or VSD systems today and then we have the same problems again. It might well be, yes. I mean that's kind of human nature, isn't it? But I think that there are core hard problems which we do not have any way to tackle with the vast code bases of things like Linux. The problem may recur but that's the problem for our kids if they're not living in caves in the Arctic thanks to climate change. I wanted to maybe disperse your bubble but I have tried that sort of thing when you try to take a simple system and bring it up to some bigger project. It doesn't work because you end up for performance reasons re-implementing all the complexity on bigger systems. You may get something slimy because you started from scratch, you know the problems but you end up adding a lot of the complexity and particularly when it comes to things like plan 9 it's all these things you mentioned like the fact you cannot move files or the fact that you need some ways to share the view between processes or the fact of simple hardware support and heavy-dip performance that bring back all that complexity. It could be, it could be. I looked up the hard link, sim link and move issue. Oh brilliant. Don't get too excited that's no mine.
First Aid Kit for C/C++ Server Performance
And today I will show you some of the most common performance issues which I have seen so far in my career, how to fix them, and the benchmarks which show the numbers, what kind of performance increase which you can get when you fix the stuff. It works. My talk will follow the plan on the slide. So I will first present you some issue categories where you typically lose most of the performance, at least in my experience, like I said. And then for each category we will go through specific topics, what you can optimize how and what kind of numbers you can get when you optimize. And then some sort of conclusions of the topics on the slide, we just go for them one by one. The QR code right now is not working, it will be working after the talk. Everything will be online clickable, you can just walk through it again to repeat the recipes if you need. So back end performance at least in my area of work, back end usually means one of those three things, latency of your requests, CPU and memory usage on your machine, and your throughput, which is how many requests you can process third time frame, which is usually expressed as per second. So request per second, RPS, and we want to improve this stuff. And also there are those bad places where you can lose performance in those three categories, which are inefficient suboptimal heap usage, unnecessary expensive threat content should be done on critical paths and inefficient networking, like inefficient networking IO or inefficient circuit scheduling and things around this stuff. And like I said, we just go through each and see specific cases. Starting with the heap, to understand what can you lose here, you have to understand how heap is working. It's just enough to understand basics, you don't need to know specific implementations. But the basics are that this heap thing, it's a data structure, like some sort of tree, has stable whatever, it's global in your process, and it is used by all the trends. When you call new or malloc, they go into the heap and fetch three block of specified size, return it to you and you use it. When you call free or delete, they are placed back into the queue. And this operation of finding free block in the queue of needed size or placing free block back into the queue, it takes time. So this lookup thing in the heap, it's not free and it's not constant time. It's some lookup time, which depends on how big the heap is, for example. So if we mentioned that this is a tree, which stores blocks sorted by size, then lookup time will be something logarithmic or something like, doesn't have to be tree. But the point is the bigger the queue, the bigger the heap, the more expensive are the lookups in the heap. Also like I said, this is a global thing in the process, usually by default, which means that you will get threat contention on this thing if you use it extensively. For example, multiple charts are located in blocks of the same size, but very frequently you will have threat contention. Heap does have mutixes inside and you can even see them sometimes in the flame graphs. So make it worse if you are writing in C++ and you're happily using those nice fancy containers, least vector queue stack unordered containers. No, forward least, all this nice easy to use stuff, you have to realize that even if you don't use heap explicitly, it is used inside those containers. Heap is basically, vector is basically dynamic array, myp is basically a red-black tree where nodes are allocated, least allocates, containers for every item of the distance are on. So you use the heap even if you don't do it explicitly to make it even worse after that. You have to remember that allocations affect each other. Like I said, the more allocations you do, the slower it will be next to allocations and freeings. Use heap becomes bigger, it becomes more fragmented, less optimal and it gets more and more expensive to use. What can we do about this stuff? Firstly, you can try not to use this stuff. You can just not use the heap when you don't need to. For example, when you can just locate stuff on the stack, when something, some object is array, it's small enough and its size is known at compile time, just declare it on the stack and use it. If it doesn't have to be something long-leaving. Or another frequent use case which I see is then when we have a class or struct and we store in there something by pointer and lifetime of this object is equal to the class where it's stored, right? And they just store it by value then. You will reduce number of heap allocations then. When you cannot get rid of the heap allocation and you have it in some critical path which is very frequently used on your server and you see it in the flying graphs, you can still do something about it, you can optimize it. And there are ways, some easy ways how you can quickly regain some performance back. We will start with the object pooling thing which is not as simple as it sounds. Typical very widespread use case in the back end. We have this server, requests are coming to the server. Each request is read from the network parsed allocated into something like struct request or class request. It can be big, one kilobyte, five kilobytes of different data, different members, attributes, then you place it into your business logic pipeline. It is processed in the end, it is deleted. And this process is repeated again and again for every request. And if the request is big enough, like one kilobyte, and you do it frequently enough, like 100,000 times per second or million times per second, then you will get heap issues here because this heap allocation and freeing will get expensive of such a big object like one kilobyte or more. And you can see it in your flying graphs sometimes if you are building them at all. Example of the code. So we have this class request with many members. Some of them can be indirect members. For example, we could inherit this from base request, which is base, base request and something like and it can pile up. So in my current project, the size of this thing is two kilobytes from those many, many, many small members. And then we have this business logic pipeline, like process request and it allocates request object, fills this with data and when request is complete, asynchronously somewhere it is deleted. This thing, those two lines will get expensive. If done frequently enough and request is big enough. Effects of the heap here can be mitigated quite easily. If instead of using the heap all the time for allocating and freeing stuff, we will just allocate it once and then we will reuse it again and again. So we use the heap just once and then we don't use it. And we avoid the heap issues. This is called object pooling. When you allocate stuff once, store it in some sort of pool and then you take it from the pool and place it back by placing the heap. Even though first time you do allocate it on the heap. Then what you get from this is that firstly you do not pay for the lookup time in the heap. If you remember that the heap is storing those blocks of different sizes, sort it somehow then it needs to be something like 3 or hash or whatever. But here all the objects are the same of the same size. It can be just least or stack, right? It could be done in constant time, allocation and freeing. We do not pay for lookup time anymore. You can deal with concurrency in a more efficient way than the standard library. I mean you can switch of course the heap from like GMLOCK or TCMLOCK, right? We heard about it. It can make stuff actually faster. But if you do not want to or you have to have more control in your code over those things and you have this pooling thing, you can implement concurrency yourself and you have to agree that doing concurrency stack or concurrent list is obviously much simpler than doing concurrent tree or concurrent hash table or something, right? It can be done much simpler. Let's try. This is how I tried first time. It's good first try, right? Kind of. It's simple. That's why it's good. Sometimes it's even good enough, right? So we don't need over engineer things. But in this case it makes not much sense because if your code is very hot and you suffer from heap contention and you change it to this, then it will get even worse because you will exchange heap contention with mutics contention. And secondly, you are still using the heap because if you are storing in an STL container, any of them, you will be using heap and we don't want to use the heap. So it cannot be done this way. But it can be improved. It's not a dead end, right? This is how it can be improved. And the alternative is to add local pooling. So what we have is instead of single global pool for everything, we have one global pool and also in each thread we have thread local pool of limited size in each thread in addition to the global pool. When threads are locating something with new or maloc or whatever, they take objects from the local pool, not from the global one. And when they free objects, they place it back into the local pool. And this local pool can be done very, very simply. It can be just a list, an intrusive list and that's it. It doesn't need mute access or anything because each of those local pools is used exclusively by one thread. But when pool inside some threads becomes empty and they want to allocate more, they will take a batch of objects from the global storage and will reuse this batch until it also ends and so on. On the other side, when they will be freeing stuff and local pool becomes too big because it's limited in size, it cannot grow. Infinitely, they will move it back into the global storage so as other threads could reuse it. This way we get firstly that the heap is used rarely. It is used in bulk when it is used to allocate at once many objects, not four, like 64, 128. And also it will not be used at all after some point when all the pools will get saturated. And fourthly, there is no contention on the single global pool. This global storage, it can be protected with the mutex, but it is used so rarely that this mutex contention will not be visible. It will be used at most every, like, 64 locations. So it's 64 times less contention, which means it will be basically almost zero, neglectable. If the explanation was too bulky, I prepared an example how it works, like a real life example, how it could look like. Imagine that we have those three threads and empty global pool. All is empty in the beginning. First thread wants to allocate something. It will take a look at the global storage. There is nothing, so it has to allocate a new batch. New batches are located on the heap. But then when it will allocate objects, they will be taken from this batch. No more heap allocations. Just one heap allocation. And then from the allocated batch, we take objects one by one. Then second thread. That's the same. It has had local pool empty, nothing in the global storage, so it had to allocate second batch. They keep using the objects from the local pools. So far, we only did two heap allocations. But then happens, which happens very frequently in backend code. Those objects, they migrate into another thread. It happens when you have dedicated threads for networking, they read data from network. They parse it, create this struct request, push it into some sort of queue. And this queue is taken by other threads doing business logic, and they will delete the request. So most often, it happens that you have one thread allocating requests, other threads deleting requests. Objects will migrate. So here they migrated. And this other thread completed them somehow and tries to free them. They do not fit into this local pool. It is limited in our example by four. So fifth item didn't fit. And to fit more, it will have to migrate this pool into the global storage. And then it can keep freeing stuff. Now, a little bit more random work happens. Some more migrations. And then we are in a situation when second thread wants to allocate something. But it doesn't have anything locally, so it will go to the global pool. And this time, we have a free batch. So we take it. And we use this batch. So far, during this entire demonstration, we have only used heap two times for all those allocations. And at some point, after some more work, we will not use heap at all. It will be all saturated. Work continues like that. How it could look in the code? Yeah, visible, good. I have this benchmarks link is on the slide. Everything is already open. You can reproduce it yourself. I have this value, I think, which is, whose size is configurable at compile time via templates in C++. And I'm testing it with sizes one byte, half kilobyte, and one kilobyte. And they also have the same value, but thread pooled by the algorithm, which I just explained before. And in the C++, no matter how much we can argue whether it's good or bad, full of unnecessary stuff, but templates are sometimes very nice. In this case, I implemented the pooling in templates just once. And what I have to do is simply inherit this magic thread pooled class, and my class becomes thread pooled. I can simply apply it in as many places as I need, and all the types will become thread pooled with their own independent pools. So I'm comparing value versus value pooled. The comparison itself is that I have many threads. Each thread allocates many, many values, and then frees them in random order. And then again, and then again. And I am testing how fast is this spring and allocation, depending on number of threads and so on. And those are the results, which were surprising for me to be frank, that for one byte, even for a single byte case, I got a speedup. Normally, heap is very fast, even for, is very fast for smaller locations like those few bytes. Heap is actually extremely fast, the standard heap. But some why my pooled version was even faster than that on single byte case. But my most interesting relevant case was twice faster, which was good enough. And it can be actually quite visible in the final RPS. So of course, you have to benchmark everything. You shouldn't just blindly make everything thread pooled. I think this stuff will get faster. Probably will not. You have to apply case by case, measure performance, see how much it helps. I have seen in my experience that this can help and can be observable in the final RPS. This simple thing. What else can we do with the heap? Intrusive containers. So to understand the problem, which mostly comes from STL again, from STL containers, those list, map, unordered things, forward list, and the thing which unites them all is that they are not intrusive. And to show the point, let's have a look at the list. The lists are the most popular type of container, at least in my type of work in the stuff which I'm coding. I very frequently use lists. And the problem with the list is when you push something into the list, it will not be directly saved there. It will be copied and saved into link container object, this gray cube. Even if you store pointers, this pointer, those eight bytes, they will be copied. Not your object, but something will be copied and it's unavoidable. And it will be copied into this link container thing, allocated on the heap every time when you push into the list. And when you pop from the list, it will be deleted. So every operation with the list costs you heap operations. Secondly, which is not so obvious, but it also has performance, is that when you store pointers in a STL list, iteration of the list becomes slower. Because when you store pointers and you want to get into your object to de-reference some written member, for example, in your struct, you will first have to de-reference link container and then you de-reference your pointer to get to the member. You have two memory access operations. And they are not free. This arrow thing costs something. So we have additional memory. We look up simply because of how a STL list is implemented. What can be done with this? It's an intrusive list. So what we do, the basic idea is that we add those links, next and previous links, which are linking the items together. We add them into our object directly, like in the old C times. When you ask a student to implement a list, they do this. Probably they are doing it right because we will not have heap users here on every push and pop because we don't need intermediate link container objects to locate and delete them. And secondly, we don't have this additional memory lookup because to get your data, you just de-reference your pointer and directly get to the data. No intermediate objects for this. The only problem with those intrusive containers is that they are quite bulky, at least in C. So this is huge pain. Maintaining those next and previous pointers, head and tail of the list, and you do this every time for every type that you have and you want to store it in the list. This looks quite not good. It's quite hard to reuse such code without C++ templates. C++ templates, you can implement actually intrusive lists just once and then reuse them. On the slide, there are links to forward list and doubly list implemented by me. On the left side, you can see how the API looks for forward list. And on the right side, how it's used. So I have this object something. I simply add this next pointer in any place of my object and I instantly can use it in intrusive lists. With the intrusive list implemented just once using templates. And this name, by next, it's customizable so you can change the name as well of the member. Then what can you get from the performance if you apply intrusiveness is shown on this benchmark link on the slide as usual. I'm comparing a list of pointers with intrusive list. It's the list of pointers because usually, just like I said in my code, I prefer to manage lifetime of my objects myself. And when I have an object, I push it into the list. So I have object before that. And when I pop it from the list, I usually keep sleeping for a while after that. So I don't want to copy the entire object for storing it in the list. That's why I usually store pointers. And intrusive list stores pointers by design. So I'm comparing kind of similar cases. And the benchmark is doing that I'm measuring time of list population, how fast I push items into the list. And list walking, how much costs to me this additional memory lookup. It's interesting, right? So this small arrow thing is even visible in any measurements. This is what you get when you switch. So at least in this benchmark, right? So it might not get this speed up in your case, but in this benchmark indeed. And in my experience, it also sometimes does. I've got almost three times the speed up for list population because I no longer allocate those link containers. Firstly, secondly, you see this walking speed is 7% very small, almost noise. But it's not noise. It's reproducible. Every time you run this benchmark, you will see this difference, which comes, it's not much, but it comes from this additional arrow thing. And it's not much, but it doesn't mean that you can just leave it, right? Why have this performance loss if you cannot have it? Those small things, they pile up into something bigger. In my experience, this was all the easy stuff with the hip for which we have time. We can also have a look at thread condensation things. What is thread condensation? It appears when you have multiple threads which try to access certain critical sections at the same time, like mutex protected data or something like. And when this happens too frequently, it can cripple your performance, your cripple parallelism of your code. So your code will not as parallel as it could be. And result could be something like you have the 64 core machine, you enter it, you type H-stop and you see two cores used, right? It's not a good situation, paying so much money and then getting this. You are not utilizing all the resources when you have thread condensation or you are utilizing them on the condensation itself, not on something useful. And what can we do about this quickly? Like it's first aid kit, right? So it should be done something easy and quick. First thing, false sharing. It assumes that, let's start on the case when you think it's easy stuff. I know this, I am master of condensation. I don't have it. This is how I protect it from condensation. I placed this link on the slide with the benchmark and the example is that I have this object with two members. One member is always accessed in one thread, other member is always accessed in another thread. And seems like I don't have condensation because I am not sharing any data between the threads. And they have this benchmark which does some amount of work for 10 seconds or so, which looks good enough. But if I do it like this, I get five times the speedup. By adding this 64 bytes of unused data between my members of the, in this track. What is the link that I increased size of this track and I've got five times speedup? Should I just make all my strikes bigger the better? They will get faster. To understand the reasons behind this, you have to understand how CPU works with the memory. The thing is that CPU cores in your CPU, they don't access main memory, the RAM, the bus directly. They do it through this proxy thing called CPU cache, which is to put it simply is basically one cache per core, right? Not to dive into too much details. And this cache thing is basically accessing the main memory for the CPU and CPU is reading the cache transparently. And the cache, it has those blocks of fixed size, which are copies of small, small parts of the main memory. And those small blocks of fixed size, 64 bytes or 128, we call them cache lines. And all works fine and fast until we get the case when multiple CPU cores for some reason start reading and writing the same cache lines. For example, by the same address, one thread is doing crates, other threads are doing grids. Then we get contention and CPU has to perform this very expensive synchronization of the different cores so as they would store the same data for the same address. So as it wouldn't happen, then for the same address, different threads see different values, right? It shouldn't happen. And this synchronization of the cores is very expensive. This is where the slowdown happens. And what could happen and did happen in our case, that data which was seemingly unrelated, different bytes, they by bad luck just happened to be in the same cache line. And we've got contention on the cache line on the hardware level, not on the application logic level. Simply because when you work with memory, you always work with basically with minimal size of single cache line. Even when you access single bit, the entire cache line of 64 bytes containing this bit will be used by the CPU by the cache. My fix was as simple as just adding this to split my data into separate cache lines. And now I no longer have contention. This is how I've got five times speed up. This is measurable in the final RPS as well. It can be visible when you fix it. Just when you're fixing it, make sure that it makes sense. Like I said, don't just add the 64 bytes padding everywhere where you think you're sharing data. Add it, test if it makes sense. If it doesn't change anything, then just don't add it. It's as simple as that. What else can we do with thread contention? Have a look at memory ordering. If you are having highly loaded multi-threaded application, it's very, very likely that you are having also those atomic operations in your code. Like SDD atomic and C++ and double underscore sync, double underscore atomic in C compilers, which all do the same basically. Today, besides some arguments, they also take this memory order mysterious thing. There are plenty of those orders. And what they are doing is that they regulate how much freedom the CPU has about executing this instruction and instructions around this one without explicit ordering. CPU can execute your instructions in any order it wants. Even if you turn on all the off of the compiler optimizations, your machine code looks absolutely linear, even if you have single thread, still those instructions inside single thread can be completed in random order. It doesn't matter in which order you wrote them in C or C++ or whatever you're using. Example on the slide. So we have those free variables, ABC, starting zero, and they have one thread assigning them to one to three in order ABC. And then other thread reading them in different order, CBA. It looks impossible by all the logic, but it is in theory possible in some CPUs that you will get printed free equal C and zero equal B. It looks impossible because if second thread C is B, C assigned to free, it means that also it should C be assigned, right? Because it was assigned before C. But it could happen that it will not see this because, for example, read in second thread of B could be completed before the read of C. Or writing of B in thread one could be completed after writing of the variable C. We don't have any guarantees from the hardware when we are talking about observability of thread state from another thread point of view. And if you think if you are safe on x86, you just don't use ARM and ignore the problem, the bad stuff is that you still have reordering on x86. There is example on the slide by this link which you can compile and run. And even on x86, it will demonstrate reordering. The some instructions, logically impossible, will complete in different order. Without any tricks. It's completely predictable machine code. It will happen even on x86. We will not dive into details of each memory order possible. It's too much time. But I will demonstrate you what kind of speedup you can get if you study memory ordering and use correct ones. Benchmark on the slide link as usual and the benchmark is very simple. I have this loop, single thread. I'm not even using multi-thread here. It's just single thread. I'm using Atomics in a single thread to demonstrate the point. It has this loop where I'm using STD Atomic and it runs in two versions. First is default STD Atomic operation with sequential consistency order on the right side. Memory order, sex, CST. It is default when you use STD Atomic and don't specify memory order. So it is the safest and strictest order when you use it. Code works like it looks. This wire is default. Otherwise people wouldn't have to bother when they don't care. But it is an overkill in this case. It's too expensive. And in my case, relaxed order is enough. It is in most cases actually enough. Like in shared pointers, relaxed order is enough. And I'm just comparing this loop with relaxed and sequential. Just think of a number. What do you think would be, how much would be a speedup? Like probably you're thinking zero because if you know x is x86, you will tell me that it will render the same machine code. When x86 writing has the same guarantees. Prosequential consistency and relaxed order doesn't matter. x86 is safe, right? But I've got 16 times a speedup here if I'm using relaxed order. It was x86. It was modern compiler. Loop was not optimized out. It was minus of re-optimization. So top optimizations. And still I've got 16 times speedup of this loop. What happened here exactly? If I open machine code, this assembly stuff, I will see that relaxed order was compiled into single-move operation. Prosequential consistency was compiled into this exchange operation. The reason is that on x86, there is only one possible re-ordering. Prosequential consistency order protects from this type of re-ordering using this exchange operation, which gives more guarantees than move operation. And the problem is that in this case it wasn't needed. So I just requested two strict guarantees where I don't need them. And I paid 16 times slowdown for this. And in fact, at least in my entire career, I have never seen case when sequential order was needed. It is needed in such extreme weird cases that I have only seen the artificial examples. I have never seen it in actual production code needed. The only pre-orders I ever needed were relaxed or acquired plus release, nothing else. So this is what kind of speedup you can get. For fun, go to Godbolt and try to render the same code on this version of compiler on C-line. It will be even more interesting. Just amount of machine code simply didn't fit on the slide. That's why I didn't put it here from C-line for this simple loop. What else can we do with thread contention? Look for eqs. In the background code, it's very, very frequent that you need some sort of queues sometimes in multiple places of your application. And the usual use case is that you have multiple threads producing something for the queue, like requests. They read from the network, allocate requests, validate it, and push into queue. And other threads are doing, for example, business logic. They are taking objects from the queue and processing them, and then deleting, like on the slide. How can we do this? We start simple again. If we don't have much load, then this solution is actually just fine. So we have this queue. It's just a Mutex protected container in STL. It works fine. If you have hundreds of thousands of RPS or millions of RPS on this queue, then you will get Mutex contention here. You will get it guaranteed. What can you do about this is just get rid of the Mutex. And there are solutions how to do this, called log free queues, which allow you to have a queue without a Mutex and STL. It will be thread safe. And the problem with those queues is that there is no one major queue, which is best for all the cases. Implementation of specific queue very much depends on what kind of queue you want exactly. Like there are those four types of queues, depending on how many threads are producing for the queue, how many threads are taking objects from the queue. And also you have to know whether the queue should be limited in size in your case, what happens when the size limit is reached. So when you understand your case, you can choose one of the queues, one of the implementations. There are many, many implementations. Of course, I just placed a few of them on the slide. For all the queue types, two of them are mine. One of them is this very popular, according to GitHub stars, Cameron 314 concurrent queue. And also there is this very nice website, 1024course.net. Who knows? It's a very nice website, which not only contains source codes of various queue types, but also they are actually explained there in a simple language. So you can go there and dedicate yourself about how those queues are working and why, what is log free, what is weight free, what are all those memory ordering types. It's all explained on this side. Very understandable stuff. And like I said, don't just use multi-producer, multi-consumer queue for everything. If you have, for instance, single producer, single consumer queue, it can be done much faster than the former. So just be careful what you choose. And this is kind of speed up you can get when you simply switch from mutex protected to log free queue. This is benchmark of my two queues. Some benchmarks doing multiple producer, multiple consumer threads. And stuff, and those are the numbers. So this also can be visible on the final RPS of your application. Just make sure you test it before you apply it. All of those stuff I'm mentioning today, it makes sense to test it first to make sure if you actually need it. What else can we optimize quickly like first aid kit, networking, backend performance, very often like 90% of all this performance will consist of how efficient your networking is. How efficient is your data received and sending. And in cases like one connection, one request, this HTTP stuff, it also matters how quickly you can accept and close clients. So this socket creation and closure also matter how fast you can do this in those types of scenarios and quick stuff we can fix here is, for example, scatter gather a link to the benchmark on the slide. And the use case is this. You have, imagine this multiple buffers that you want to send. Each buffer can be separate message or each buffer can be part of single message like chant, response or something. And you want to send multiple piling up buffers into the socket. How do you do this? Do it in a simple way. You just run the loop where your calls send on every buffer, right? It works, obviously. And on this benchmark, I have speed two and a half gigabytes per second on local socket payer without networking. And I was sending 16 one kilobyte buffers every time when I called send all works fine. But if I do it like this, I suddenly get two and a half times speed up. And what I changed is that instead of loop of send calls, I did a single send message call. Even the code on the left side looks bigger. It was this much faster. In practice, I saw that this switch made my code 10 times faster. It just depends on how many buffers you are trying to send of which size at one time. In this case, 16 buffers each one kilobyte in size, local circuits. I got this speed up, but it can be better. And where is the speed up coming from? The thing is that on the right side, I did 16 send calls. On the left side, I did single send message call. And those send and send message, they are in fact system calls. Very, very expensive. Switch into the kernel context when you're calling those things. And this is extremely expensive, basically. Every system call is always very expensive stuff. And you should avoid that, make them as few as you can. In this case, speed up is coming exactly from this. I simply made less system calls. I sent into the kernel multiple buffers at once. And this single, even single system call is many orders of magnitude more expensive than just filling this Iovac array. Even if it's something like 128 buffers, sending more doesn't make sense. 128, as far as I remember, it's the limit in the kernel anyway. They will not accept more. Funny thing, when you try to send more, sometimes kernel can return errors. It will even just do partial send. It can return error, like too many buffers. Someway, this is what I absorbed at least. So the solution here, if you have multiple buffers to send, simply use send message and receive message instead of looping those send and receive calls. And of course, it only matters if you have more than one buffer. If you have just one buffer and you switch from send to send message, absolutely nothing will change. Some people might be already thinking that why didn't I use readV and writeV calls? Because they look simpler. I don't need to fill in this message header object. I can just send array of Iovacs directly, right? They will work even with the same speed. The problem with those system calls, read and write, readV, writeV, is that when you use them, they are accounted in the kernel as disk operations, even if you use them on the circuit. I don't know why, but it is the fact. So when you are using read write calls on a circuit and you check this protspeed.io file, it will grow even if you call those functions on circuits. They will be accounted as disk operations. If you don't care about the statistics, then you can use those functions. But if you care, try to use send message and receive message. They are portable, available on all the Unix-like systems. So good stuff. What else can optimize event queues? It, of course, depends on the application very heavily, but often in the backend servers, we have, they can be quite loaded. So we can have tens of thousands of clients easily in the same number of circuits in one process of server. And although circuits can generate events like circuits can become readable, writeable, and receive out of band data from TCP or receive errors and stuff or custom events. And we need to handle all those circuits somehow at once. And there are three ways how to do it. Without ridiculous solutions like one thread per circuit or one process per circuit, it's not scalable at this scale. And the solutions are periodic polling, reactive polling, and event queues. Those are made up names. I just made them up myself. It's not like they can be found somewhere. And we go through each. So periodic polling is the simplest approach. As simple as you just have a loop where you iterate through all the circuits, and you try to read and write each, and then you sleep. And then you repeat. This way you don't spin on the busy loop, and you still handle all the circuits. The problem with this solution is that firstly we'll have additional latency here because imagine that circuit number N becomes readable. To get to the circuit, you firstly have to try to read N minus one circuit before. It will cost you time if you have thousands of circuits. Secondly you will lose latency here because imagine circuit became readable, and you just started 100 milliseconds sleep. You will waste 100 milliseconds of latency absolutely with no reason. And firstly you will waste CPU here because you will be doing lots and lots of unnecessary system calls. If socket is not readable and you are doing receive, you just wasted a system call, wasted some CPU time. The stuff can be easily fixed with a couple of solutions, one of which I am presenting only for the sake of you not using it because select thing is deprecated. It gives undefined behavior on Linux. If you have more than 1,024 sockets, or even if just one of them is bigger than 1,024 by value. It is not advisable for you to use it even in documentation. So there is an alternative poll which works quite fine. Even these days and it takes array of descriptors with events you want to listen for, those events field. And when you call poll, it will block you until any event happens with any of those circuits and when it returns, it will fill in our events field in all the descriptors with events which are available for circuits. This is how it looks in the code approximately. So we have this poll, you call it on all your descriptors. When it returns, you have events and you are scanning all the sockets and checking which socket has which events. Then you don't do those unassisted system calls. You only do reads when socket is readable. Write when socket is writable. And I have this benchmark, click on the slide where I have 5,000 clients and they are sending 5 million messages in total. And only a part of clients is active at the time which is realistic. It's not like all the time all the sockets are active. This kind of speed up I get when I switch from periodic polling to poll. I have got 25% speed up instantly and I did zero system calls which ended up with eWood block and periodic polling did 120 millions of those system calls which were not needed. And thirdly, periodic polling wasted huge amount of CPU time because it was spinning in a busy loop. I didn't even have slips in this case. If I would add slip to periodic polling here, it would get even slower. Here I didn't have slips and still it was slower and it wasted huge amount of time on those unnecessary system calls. This is not the end. We can optimize it further. One last optimization using event queues. The idea is that instead of having socket array in user space, we can have it in kernel space. And kernel will monitor your sockets all the time for happening events and notify you when something happens. This is Epolyn Linux, KQ on Mac and BSD and Diocompletion ports on Windows. So the idea is that you create this event queue, you add sockets one by one into the queue for monitoring specifying which events you want to monitor. And then you call this Epolyn wait thing to fetch the events. When it returns, you handle the events. It is as simple as that. So instead of placing all like 10,000 sockets into the kernel for each Epolyn wait, we just call Epolyn wait and get the events. This is how it looks in the code. We call Epolyn wait on our queue. We get some events, return, we handle those events, just them without full scanning the entire circuit array. For example, if you have 10,000 sockets and 10 of them got events, you will just iterate 10 times here, not 10,000 times. And the rest is the same as with polls. So we just read where readable, write where writable. As simple as that, if I apply this on top of poll, I get another 30% speedup. Even with a single chance. So you can of course optimize it first, but those were simple optimizations. This was all the stuff which we had time for, but also there is some additional content with eight other small, simple things which you can apply in your code. They're all clickable. You can click on them after the talk or ask them as questions right now if you like or ask me afterwards outside about those other optimizations. And now this was the end. Thanks for your attention. If anyone has any questions, then I believe we have time to take a couple. Thank you. Amazing talk. Thanks. You mentioned flame graphs a few times for the cash sharing issue. What kind of tooling do you recommend to detect those? For cash pieces? The cash sharing variables, sharing through cashes? Yeah, I think the first tool for example, Linux is able to measure the stuff. Okay. The first is also able to build nice flame graphs by the way. Any other questions? I have a question about the first example. I guess the second one, but still on the first chapter I guess of your talk about the intrusive list and it's my understanding that standard list C++ is also an intrusive list. So I don't think that interaction should do anything. The list is intrusive? Yes, I just checked it up so it can be in presentation. It's not intrusive. Okay. So for example, when you have this STD list, right, sign of intrusive list is when you have link inside of your objects. For example, if you store pointers and you have pointer at your object, you should be able to just unlink this object from the list directly, right? Just leave of the previous element, link with the next element instead of you. So you just in constant time can pop the item from the list, right? When it's intrusive. In a STD list, you have first it located. You have to iterate the list, find your pointer there and erase it by the iterator. Standard forward list also not? Are you certain on that? Unfortunately, yes. In a STL we don't have it. Maybe they, we have it in boost, I don't know. What, what in boost? Okay. Good stuff. Hello, thanks. What do you think about IO ring? I haven't tried it myself in real life use case yet. Okay. But I heard that can be faster than Ipul. So basically IO ring idea as far as I understand is the same as IO completion ports on Windows. Right? So you just directly send data from your buffers without copying. Yeah, I guess it is possible with IO ring to make even less assist calls. Yeah, yeah, perhaps. Could be good, could be great. But the idea falls into the same folder of event processing. So we don't then sort of socket array full scan or anything alike. Those are by the way, obviously not cross platform solutions in networking. I don't think we have anything cross platform enough besides maybe poll, right? And the rest could be like boost ASIO. Yeah, this stuff is working everywhere. If there are no other questions, then that should be it. Thank you all very much for coming.
20 Years of Open Source building XWiki and CryptPad
Okay, so hello everybody and thank you for coming so early. And so for those that were not there before, since you came so early, there are a few free t-shirts if you want to take them. So I'm going to talk about the story of our company, X-Wiki SS, building the X-Wiki and CripPad open source software in the last 20 years. So first a bit of a track about myself. And so I discovered technology in 1984, like using an Apple II and then I moved to PCs, I even moved to Windows 95, then I graduated from a good school. And actually in that good school I was kind of told, so I was actually, it was that we had a speech at some point telling us, you're a soldier of economic war. And so that of course resonated in a young person, but it also, I mean later on, it's like what, I mean why are we doing war? Like that doesn't make any sense. We're not fighting other countries, we should work together with other countries. And then in 1995, I was really very interested by the internet. I saw people using Mozilla browsers, Mosaic browsers in the school and I really just wanted to work on internet technology. And I ended up, so I started, I took one job about the internet at Capgemini, but after a few months I was recruited by Netscape because somebody from the team had left to Netscape and I ended up working three years. So who knows Netscape here, like okay, like so just not so bad. And so I was a consultant, I became also their Mozilla fan. I even wanted to work for Mozilla, Oregon, inside the company when they launched it. They didn't take me, so I ended up working in a French startup. I wanted to stay in Europe and that startup raised money, went IPO. I actually even was a virtual millionaire and then there was the internet bubble, it crashed. I was not at all a millionaire anymore, just like any other IT guy. And in that company, we were both by US company and in that company I used Wikis and that's how I wanted, I found Wikis like amazing in terms of how it brings people and helps people share knowledge. And that's how in 2004 I created XWiki. I was a bit accustomed to open source with what Netscape was doing. Netscape had a highly transparent organization and a way to share things. It was really pushing internet protocols standards and then they made Mozilla open source. And then I was a user of open source in my company as a CTO, like installing using Apache software foundation code and so on. And so in Wikis, we're purely coming from the open source world. It was very natural when I wanted to create a company and create a software to create it as open source. But I was not as much aware of the political aspects of open source. I was really looking at open source from the technical point of view. So that's how I started XWiki. I'm going to continue from that in this presentation. But I'm also now a member, our company is a member of APEL, which is an organization of companies that do open source in Europe. In France, we have the CNL. We have the Herb Open Source, OW2. I'm also on the board of OpenFoodFacts, great association working on open data for food. And I'm a small shareholder of Morena that is doing an open source phone. So I'm welcome to look at this. I find this amazing work. So what is XWiki SIS? So XWiki SIS is a French and European independent company. So it means we've been self-funded. And I say independent. It means it's self-funded. And I still own majority ownership. And the very large majority of the share is owned by employees, some ex-employees. Yes, I should stop. Okay, great. It's on HDMI, you put HDMI. The slides are here. No, no, you're here. And you're not seeing them. That's too bad. I took this cable. Sorry. Okay. So it means we control the company, which is actually something that's not so easy to achieve in tech companies. We are at around 4 million revenue in 2023. We did 50% growth, which has been really very nice. And we have 60 people, mostly in France and Romania, but also some people in Germany, even two people in Brazil. We do two open source software, XWiki and Kripad. One is the one we started with that I created. And then Kripad that was created inside the company in 2016. We have an international community. And so we also are very engaged for digital sovereignty. We think open source is very important for gaining control of software, both for states but also individuals. And we have a business model of allowing to have revenue for that software so that we can build it. And this is done through services support training, like anything software company do, but trying to do it in a way that it allows us to fund the open source software. So we have employees in all these countries. And so what we're trying to do is enabling freedom, both with the code but also with the products that we do. I'll come back to that. So two software, XWiki. It's about knowledge management. It's sharing information. What's really interesting with the knowledge, Wikis, is that it really allows people to share and make knowledge available. So we all know Wikipedia. But we do it for organization. Inside that area, we have competitors such as conference or notion or even the Wikis of Microsoft Teams. And this is part of, I mean the competition in the end for any open source software has a high impact on how you can actually fund your work. Depending on how you compare to the competition, you can find more or it can be more or less difficult to find money. We have more than 7,000 installs and more than 400 clients. And XWiki is now part of the Open Desk project. Also if you don't know Open Desk, I think about looking at it. Google it. We also did CripPad since 2016. And CripPad is an end-to-end encrypted document editing platform. Who knows CripPad? Here? Okay, good. And so I'd say competitors to Google Docs. And the real part is that it protects people's privacy. So I'll go on, how did we start? And so the big question is why be an entrepreneur in the end? Because I'm trying to, so in this talk I will try to focus more about the open source aspect of what we did. But when you talk about your company, it's all also about the entrepreneurship and the difficulties to just run a company. But so I had this wish to kind of create things and make them happen. And so that was a bit at the core of being an entrepreneur. But one really important thing was to try to do something that's useful for people and have some impact. I wanted to do it also in Europe. I've been to the Silicon Valley and I didn't feel I liked the moons. So the technology was great, but I didn't feel as good about the fact that people were just talking about money and how they would become rich all the time. And that really made me think, okay, I don't want to spend my life in an area where that's the goal. Like I want to be more in a place where we're talking about culture or whatever. And another aspect was as an employee in companies, you sometimes feel your managers are not doing what you want them to do or they are not fair or the company makes decisions that you don't understand. And in the end, you can complain and stay as is and keep complaining about what people tell you around. But my idea was that my feeling was, well, instead of complaining, just try to do better. And that's also a reason to become a manager or own company and be the one that has to take responsibility for what's happening in the group. A big aspect is about believing in the product and in the purpose of the product. One of the really important things that motivated us at X-Wiki for 20 years is the fact that we feel our products are missing or they're not enough existent in the world and they're useful. They serve an important purpose. When I started X-Wiki, I was a big user of Wikis, I was a big user of task management tools. And I said, okay, we could do task management tools, we could do Wikis. And I directed myself towards Wikis because of the fact that they help sharing knowledge. Task management tools are a lot about efficiency being more efficient in companies and I felt knowledge is missing more. Like we're missing more the fact that we spread knowledge and that we educate people more. And in the end, this has stayed with us for 20 years. So we have a lot of Wikis inside companies that help people get more knowledgeable about what they do and about the work in their own company. But we also have a few public Wikis. We have the dictionary of the history of Switzerland, which is a public funded Swiss project about the knowledge about Switzerland. We also have a Wikis about rare disease. And you don't want to look at the website too much because it's sometimes really hard to look at it, what the parents of the kids that have this disease live. But it's highly useful for this community of parents that are living with the disease of their kids. We also have a Wikis for public service in France and so on. And so from my point of view, if you want to stay motivated about software for 20 years, you also need to really believe about the fact that your software is useful. And in the end, in 2016, we created CripPad. We created the technology, but we decided to make it a product because we really thought it was doing something that was missing, is protecting people's privacy and that too many software are exposing the data, are not built to protect the data, and CripPad is a product that is built to protect the data. So now the problem is that if you want to do a good software, you're interested in doing it as open source, then how do you fund it? There's different ways. So you can just raise money. So that works. You can build, there is a lot of open source software that is built by companies that have raised money. Even now, the modern way of raising money is doing some crypto thing and launching a token and getting millions of dollars. So it can work. I'll come back why. I didn't feel it was a good approach for us. You can be an open source volunteer and that's great. But what I tried to do in this graph is measure the sustainability of that action and how much impact it can have and on the other side, how fast you can develop things, but also the comfort of doing it. Because in the end, if you want to do that for a lot of years, are you doing this under stress? Are you doing this having a good day and being able to have a good life aside? And so open source volunteers won donations being an independent professional, like a freelancer and getting paid for doing service around open source. That's really good ways. And bootstrapping a company, this is what we did. And I feel, and this is what I want to show also in this presentation, is that it's a good way. You have a decent level of comfort. You can have speed because if you hire people, you do more. There is the sentence that you can go fast alone, but if you want to go far, you need to go as a group. And that's what the company allows, being a group that is funded, that has some money and can go further together. And you can see in this presentation also the acceleration that we had over the time between the beginning and now. So investors, why? I want to take a little bit more time on the investing. So it took us a little time to realize it was not what we wanted. I came from a company that had raised VC money, and I saw the fact that you can create a momentum. You can have a money hire very highly skilled people, and you can build things fast. But in the end, the real thing that you need to think about is the day you take VC money is who is the real boss and who holds the key of the decision in the future. And whenever I had discussions with some investors, beyond the fact that they tend to like the salespeople or the business people more than the tech people, and so that might be a reason for them to not giving us money. But for us, the problem was, okay, are we agreeing on where we want to go? And in the end, investors are in for a return on investment, so making more money with the money that they put. And as an entrepreneur, that's not what I was in for. I was in for the human relationship with the employees, running a project over the long term, and creating open source. And when you discuss about open source, and in France at the time, it was also quite simple. They didn't understand open source, so you had to explain it. Today, it might be better like, oh, open source, great. Let's do open source AI in France. They love it right now, and they tell you, oh, it's great. But what is their goal with it? Do they want to sustain that open source AI, for example, or do they just want to make a play to take a piece in the market and then cash in at some point and close the work? And so this can create good open source. And that's fine. But if you have a goal of being, for example, good to your community, not lie to them. Not tell them that you're doing open source and not have a hidden agenda about how you're going to make some money. That's going to be difficult. And I felt as a CEO, if I raise money, I would start lying to my customers about what our real goal is with this open source project. And being independent allowed us to not be that. It's much slower. It was much slower. But in the end, it's more important to do it like that. In the end, money is a mean not a goal. That's really a thing to think about. And so what was bootstrapping about? And so from 2003 to 2010, it took seven years to get to one million of revenue. It took a lot of time. It took three years almost of myself not getting paid. I found some other ways. And then any time we would do a little bit of money through service, we would use the product. We would use it for hiring more people and growing the product and making it better. One of the great things with open source is that you can build on other people's software. And that's magical. And so you can really reuse a lot. And that's actually what the proprietary software companies are doing. Now you have 90% of proprietary software is actually open source software. And they keep control of the latest piece, trying to cash in or build some business model about our data. But 90% of it is open source. And that helps us also. That helps also the open source companies we can build on that. The support of the community is huge. The service is a good way to start. It has problem over the long term to do only service. But it's a good way because you sell time, you make money. So you don't take risk with service. It can be something that doesn't have the level of risk. Another aspect that allowed us to go from zero to one million is European research money and French credit and brochures. In France you have a lot of help about research. So you can, if you do something innovative, you can bring it to the state and get some taxes back. So you will have less cost as a company. This is, for example, more difficult if you are in association. You can get subsidies, but you won't get social charges back because you're doing research. It's going to be more difficult. And then you have European research projects. You can group with other companies. And we had the chance in 2007 to join some other companies in projects and get some funding through that. In the end, over the 20 years of X-Wiki, I calculated that we received 10 million euros of European research grants of projects in France and so on. And that in the end was our VC. I mean, getting 10 million from a VC is quite difficult. It took 20 years to get that, but it allowed us to fund the software. Another thing that happened in that time is that we went to Romania in 2006. It was initially through the Google Summer of Code. And we had a student that was in Romania and we gave some projects. He candidated and he was really great. At the time, I didn't have money to pay people a lot. It was difficult to hire a full-time employer in France. There was competition about the cost. Romania was really an emerging country in the tech industry with great scientific skills. And we hired some of the first people. They all stayed. The first three that we hired are still working with X-Wiki today. And we have 25 people now in Romania. It was initially a cost-driven decision with the opportunity to have people with skills. And over the time, it's a fully integrated team that is also believing in open source. And so we hope that we also had this little effect to bring some open source to Romania because we're one of the rare companies that is doing open source in the city we are among Amazon, Microsoft and so on. And we also have, thanks to Romania, a lot of women in the team. And it may also happen that there were some couples created at X-Wiki. So as an entrepreneur, it makes you think about the impact you have. So that's just a graph of finance. I'm revealing our finances. So I'm not going to detail them, but it can show you the split. The most important data here is that when we started, 0% recurring revenue. And after six years, 20% of recurring revenue supports. And in the end, that's the goal. The goal is increase the support revenue. It takes time. You need a great product. You need to reach a maturity in the product. But it grows over time. So it's all about the strategy to make that recurrent revenue grow with the users and customers of the software, whatever the goals are, whatever the type of recurrent revenue. So it took a lot of time. It grew to 20%. So one thing that I really want people to think about is there is no success in open source without a good product. There's a lot of people that think that it's open source business model doesn't work, but in reality, it's just the product's not competitive. There is a huge amount of products, including in the open source world. If you don't do a good product, there's not going to work. And so you also need to think about the strategy to direct revenue towards the product. So when you do service, that's part of the problem. You might diverge from the product roadmap to make a great product. Because you're going to follow what some customers say instead of following what all customers need. So you need to think about that. And one of the things we learned over the 20 years is that it can be a good idea to condition the service on taking the support, which allows to give extra funding to the product and dedicate people to work on the product. And but there is also some companies, for example, NextCloud, one of the things they do is that they don't give you service. They don't sell you service. They make you pay the product, the support a bit more, and they give you the service. That's also a strategy that is interesting, that is going to raise the product revenue and really make the company focus on the product. And then the service will be used to make the product better. So we need to think about focusing the revenue towards the roadmap. That's the case, for example, of the research projects. Another aspect is the community is super important. It's your marketing. It's also your insurance. Customers will find it reassuring that you have open source. And it's also your recruitment tool. You'll find developers. We have hired so many people that came toward the community. And it's also very important, the community, to be a good open source citizen. That's also how you look. You see if companies are really true about open source. Are they really working with the community? Is the community open? If a software doesn't take patches, doesn't take pull requests, is not discussing with people about how the software should be, you could question their motivation to really do open source. So at X-Wiki, for example, even though we don't have that many contributors because we're moving fast on our end, and it's not fully natural that people come and give you code, it doesn't happen like that. It's a challenge to make people give you code. It doesn't happen on all software. So the fact that the community is not huge around a product, from my point of view, doesn't necessarily say that it's not a good open source community because it also depends on whether the people want to come. At X-Wiki, we have a fully open development model, but we don't get automatically people running. We're using Apache Software Foundation kind of rules for running the community. You can find our code, you can comment, discuss in the chat, and so on. So some companies bring their products to a foundation, that's also an approach. One thing is the relationship that we customer in open source at the beginning. I realized that you talk about open source, oh look, it's great, open source, you're going to be more free, no lock-in, etc. And you talk to some large companies, the thing is they don't give a shit. They don't care about this. They just want the best product at the lowest price possible, efficiency. So some people do care in the end, but you have to kind of find them in companies and find the people that can be sponsors of open source. Today you have OSPOS in very large companies, even in the European community, in public service. These are the sponsors, but the majority of buyers of software are looking for the best software at the lowest price. And that's why you need to be competitive to show them that you have also the best software. And there is a difficulty with the marketing of the proprietary products. There is so much marketing of the proprietary products that it will cloud the vision of the customers. They will get stuff for free. They don't look at the long-term price evolution of software. We lived it with our competition with conference. Conference recently changed the prices, but for years customers were buying. We knew it would happen. We knew that at some point they would cash in as much as they could. I mean, they would cash in on the proprietary nature of the software and the fact that they control people's data. And a good thing with open source is that open source validates your product. So you can go and show to customers, look, we have these users, it shows that the software is good. And that works very well. And we also have progress in Europe today because there is an issue of digital sovereignty, not something that was not foreseeable with the dominance of American companies, but it's something that politicians or European organizations or state organizations took time to take some action on. Now there is a bit of action in this area. One thing that I learned also through creating X-Wiki is looking at Floss and free and liberal open source software as a goal. Initially, let's create good software, let's create a good company, let's have a good balance with employee. But in what I discovered is the goal of open source of free software is giving us freedom, giving us control about software. It's all the values that are described by the FSFE that are really interesting. And that they discovered that and that motivates us even more in building what we do. In the end, we had to find a balance between all these things. And these are the values that we promoted internally in the company. This is what makes our company. We need to take care of our community, we need to take care of our customers, we need to do a great product. And the great product is about the domain in which we are, the knowledge and privacy, the goal that we have for software. And we want people to be happy inside the company and we want to do open source. So these are the values that we promote internally. And what the challenge for CEO and for the group is to find that balance between these five items. And for example, we can see that these are the highest ranked reason why the people at XWiki decided to join XWiki. This is recent data, this is not data from the past. And we can see that being open source is a key reason why people want to be there. But they also want to be there because they like the product that you're building on and that we're building. So one of the key things was building on support revenue. I mentioned the recurring revenue and this was really important to really make the support revenue accelerate to be able to gain sustainability. And that's really the challenge for a company that wants to build open source over the long term. And so from 2010 to 2015, we moved from 1 million to 2 million revenue. But most importantly, we grew from 250K to 800K of recurring revenue. In the end, that's what I look more at, like how much recurring revenue we're making. Because that is what's funding the company. We failed at building partnership. We hope the product could be used to build some other products. But we found it very difficult to find the deals. And in the end, we found that we were better at creating a direct relationship with customers. Also explaining them the open source model and what we were trying to do. And also explaining them the value of our product. Relationship with direct customers is key in order to build the value of your software. We also tried to build the first version of SaaS, we call it XWiki Cloud. In the end, we focused on the main product and the main product's value. It also allows some simplification. And so that's the graph, you can look at it. Recurrent revenue grew to 35% in the time and that's really great to have that. It's not only about the percentage of recurrent revenue, it's about the amount that sustains the team. Because even if you stay with a percentage that's 50, the extra money that you're getting from the service from the research product, it becomes bonus when the recurrent revenue is enough. So if you reach a certain amount of recurrent revenue, the rest becomes a bonus. At the beginning, that doesn't work like that. You have close to zero recurrent revenue, so you don't even manage to find the team to continuously develop the product. And one thing to keep in mind is that close source competition is tough. Even if you're doing something innovative, you'll launch something new, such as Enterprise Wiki when we started or an end-to-end encrypted tool. At some point, if there is a big market, you're going to have close source competitors that are raising money that are going to come. And they might grow faster than you for a while. They will educate the market, which is interesting for you, but they will also try to take the market and then lay the cash in. But you can stick and stay true to your goals and wait. I always tend to say when you're number three and number one buys number two, you become number two. And when you're number two, you're the alternative to the number one. And all companies want a need alternative for competition. So I hope, I wish that open source would not be the alternative, but would be the leader that's not always happening and not always easy. But being the alternative is also something that helps you grow. And so after that, we had a challenging period. And so we were growing progressively, but what happened at some point, we flattened. And that was because of the competition. At the beginning, we were working mostly on innovation. We had customers interested in buying what we were doing through innovation. But at some point, we flattened. And we didn't have that innovation thing. We had stronger competition, SaaS coming in and speeding up. Deployment people would just buy SaaS. Companies would be less interested in the open source aspect. So we basically were flat with 35%. So what difficulties that we had? Competiteness, SaaS, competition. Our custom work was less demanded because there were more products that were doing things in a standard way. The fact, we had to educate the market about the fact that open source is not completely free. So that's also, you tend to put the priority on this when you're not making enough money. You tend to think it's just a business model. It's also the business model, but the main thing that we changed in that period, one of the things that we stopped trying to think like open source startups, we came back to think like what we had, what was the value that we had, and it was about our product. And what we did in the end, we created Task Force to transform the company, focus again on the product, making the product better. And trying to convince again people that the product was good. And it worked. And we looked at what was missing, what was not so good in the UI, and really did effort in it. And the thing is, when you're doing two million revenue, you do have money to try to fix problems, and that's nice. And so we relaunched a competitive offering, and we also changed a few things in the way we were selling to customers to try to improve their understanding of open source so that they would give us more recurrent revenue in order to fund more of the product. And so one of the things we did is rewarding customers, paying the product. So for example, we decided to build open source paying applications. So this is quite unique. So I don't believe in the open core model where you're doing open source and proprietary on top of it, because it tends to push you towards doing more proprietary. At Xwiki, what we decided to do, we did paying extensions that's similar to open core in the sense that you have to pay for them. But the code is open source, we just don't make the build available. So we have the Xwiki core, completely free, and of course, completely open source. And we have extensions, you get them as extension, they say pay for it, pay for it. In reality, the code is fully 100% available in GitHub. If people would want to use them for free, they rebuild them. The thing is, people don't make the effort. It's a lot of effort for companies to do that. And by making a little bit of friction for companies to adopt these extensions, it motivates them to pay, to give them a reason to pay in companies. And the bad part of it is we like it to be free completely for individuals. But this means that you would need to find some other ways to make it happen. But the most important thing in this strategy is that the code itself is open source. That means over the long term, it's owned by everybody, not just by us. So we cannot be the owners of that code over the long term. So this is a part where open source is not free and it needs to be explained. We tend to think that everything needs to be free, but you cannot pay people if everything is free. So if you want to build it, you have this difficulty. In 2016, we launched CripPad. And it was another experience there because we relaunched a second product inside the company. It had some useful aspects. It recreated innovation in the company. And it helped us gain other research projects because research projects are highly linked to innovation. So we had a second batch of innovation inside the company. And it also helped for the image of the company. It made us more known by individuals. And then, oh, you're doing Xwiki also. Some people, I mean, I don't know. In people that know Xwiki and CripPad, how many people didn't know it was the same company that was doing that? I don't know. Who knows Xwiki? Who knows CripPad? Anyway, so then 2020 happens, COVID, what happens? So that's a crisis. One thing to think about is always be ready as an entrepreneur for a crisis. It will happen if you stay long enough. We had subprimes in 2009, 2020 we had COVID. The thing is for us, we were more ready than a lot of companies because we were already remote friendly. We were allowed to do, everybody was allowed to do two days of remote in the company. We just moved it to do whatever you want, just work. And everybody worked from home. We had the tools, everything was already adapted to work with remote tools. That's one of the magic of open source tools and open source development model. We had the knowledge, we had the knowledge tools. And it also gave a boost to CripPad because CripPad actually was used by education. For example, in Germany, we had credible usage of CripPad over a couple weeks during COVID. So it also gave a boost. But as a company, of course, it creates a bit of scare what will happen, will customers go away, will there be a financial worldwide crisis for years. In the end, we went through there. One of the things that COVID showed is a challenge for European digital sovereignty. Politicians realized that supply chains were a problem and that there were risks there. And this has tainted towards digital sovereignty and software. And since a few years, we've seen that there is an interest in this area. But the most important thing that happened for us is Atlassian changing their business model and saying that people should move to their cloud. And they should stop using software in their own companies. And they closed the smaller offers. And they decided that in 2020 in November. So I don't know how they found that COVID was a good time to add some stress to their users. But they did and their customers didn't like it because we received a lot of mail like saying that, okay, what is their way to replace Atlassian conference with Xwiki. And so we spent time on improving our migrators. And we were not necessarily surprised. We were surprised the extent of the change that they made and what they did to their customers. But from our point of view, it was something that would happen at least progressively. When investor backed companies want to cash in, this is a time for the SMAs and open source companies to really propose an alternative that is more sustainable over the long term. Open source is more sustainable for other people than proprietary software is. So for us, it brings some maturity. This raised our revenue to 3 million 3 of sales in 2023. 50% growth, I said that at the beginning of the presentation. And in the end, 1 million 6 of recurring revenue with 30% growth on the recurring revenue. And that has been huge for the company and for allowing to build more software. So this is a graph you can see last three years, pretty nice. So when you look, when you are in 2020. When everything's flat, you feel a bit depressed. And that's not going well. But then three great years behind it. So it's never given. You can always turn things around. Not only because of Atlassian or thanks to Atlassian, but also because we went one project with Digital Sorbitancy in France and Germany and the software was recognized. And what about the future? So everybody talks about AI. For a knowledge company, it's a real question. So we need to think about it. One of the things that AI is doing right now is that it's questioning the aspect of open source again. We saw a lot of big companies, as I said, not caring about open source. Politicians not caring about open source. With AI, it's the first time the president of France said the word open source. Which we wanted him to do for years in the industry saying that it was important. And he said it for AI. Okay, what will it change? We'll see. But at least it raised the question of transparency again, of the control of code of data. And that's something that is positive for the future. But you also need to get prepared because it changes a lot of things. The architecture of running AI is complicated. It's much more harder to run it on premise. And so you need to find solutions for that. We're working on AI at Xwiki. We have an extension. And we also gained a research project to do some search engine using AI. And I would like to point out the approach of also NextCloud with ethical AI. We are completely aligned with that aspect. You cannot do AI today without thinking about whether it's ethical, whether it's protecting data or not. One big aspect that I think is really important for the future is software modularity and integrations. We believe at Xwiki that the future of open source software is allowing to assemble software together. And making better reuse. I said at the beginning of Xwiki that when we started, we reused a lot of open source software. Well, if we want our software to survive in the open source world, we also need to make sure that it can be reused more. And this is why we've launched a new product. We call it Xwiki Crystal. And it's going to be a new modular AI that will not only work with Xwiki, but can work with other Wikis and can be integrated in other tools. And the other thing is that we're part of the Open Desk project, which is a funded project in Germany to make an open source suite of collaborative products. And we're very happy to be part of it. And the other aspect is doing with CripPad what we did with Xwiki. I showed the financial of Xwiki, the company that included both Xwiki and CripPad. But what's really interesting is when you run a second product inside a company, how does the other product look? And what's really interesting is that it looks a lot like the Xwiki product at the beginning, 20% only of recurring revenue. And it's difficult to build that recurring revenue. So if you love CripPad, we are very happy. We've been able to double the size of the team, as you can see in the funding in the last two years. So in 2023, we doubled the size of the team for 2024 too. But it's only 20% of recurring revenue. And that means we don't have the sustainability yet. And if you look, the blue and red part is our recurring revenue, subscriptions to CripPad.fr and donations. And you can actually help us build sustainable revenue by promoting CripPad and allow us to find more users and customers. But you can also help it with donations. Any software needs to reach that sustainability through the recurring revenue. That's really the challenge for it. Finally, giving back. So first we give our software because it's open source. We give our code as a company. But we also think that it's important that we give back to the other open source projects we do. We wish large companies would do that. Large companies that use open source for free a lot or proprietary software company today that are building on open source should give back something to all the project that they use. And we decided to create a fosfan of 1% of our recurring revenue to give it back to the projects that we use. We have three years backlog that we're going to give like almost 30K to the different projects that we use. We're going to give for example to the Matrix Foundation. We're going to give to MasterDont and to lots of other tools that we're using. And we're going to continue to participate to industry organization to help it make known. The conclusion is that nothing of this would happen without the team itself. And we have a team of 60 people, more than 200 people over 20 years that worked on Xwiki. And that's really the kudos to them because you cannot do that without all the people that worked. And we, for example, at Xwiki we have more than seven people that worked 15 years. We have seven people that worked 15 years at Xwiki. 15 that worked 10 years at Xwiki. And this is not necessarily that easy to achieve that for a group of 60. We have a difficulty of funding all the time. If you want to join, we have jobs. And also nothing would have been possible of what we did without the help of European projects, French projects, BPI, Europe, NLNet. If you don't know the NGI program, the funding you can get for the open source from NLNet, go look at it. It can help you fund your project. That's it. And if you have any questions, I'm welcome. I'm available or I can take any questions. Any questions? No, I guess people are just installing here. I have a question if nobody has a question. So I was wondering how was the ride between building a company and having a community, basically, was there any conflict of what to put in the product, what to not put in the product, let's hide this away so that people pay for it, let's give it for free. How was this dynamic in the building of Xwiki? Yeah, and that's the difficult part is what do you do as a pain module? What do you do as free? Well, one of the things, so first is really keep an open community. We are very important and really keep everything open source even if you have pain stuff so that people can look at the code and discuss. For the choices of what the features are, well, we try to direct them as much as possible to the ones that the bigger companies would need, would need most themselves, not necessarily the individual or the smaller companies. Because in the end, it's mostly the bigger companies that have the funding for you or for an enterprise software. And it sounds weird that the bigger companies are not paying for it. I think Matrix has a talk just after and I know that they will talk about the fact that you have huge deployments of Matrix and with zero money and some smaller deployment that are giving significant money. And the larger companies that are massively using open source need to participate to it. And so directing the specific features that they need, for example, audit logs for compliance reason is something of big companies. But for example, LDAP authentication or SSO, it's a bit tougher to not give it because it's a security feature that's really important to make software more secure. So for example, that has been a difficulty for us. So we made active directory paying application, but LDAP configuration is still available in XWiki as a documentation in the open source documentation. But if they want the simple configuration with Microsoft Active Directory, they pay the application and we sold a few of them. Hello, first of all, very nice talk. Thank you. What would you have done differently on the Twiki journey? What would you have done differently on the Twiki journey? Oh, that's a good question. Well, the little strategies to make people understand open source better, for example, making people pay more for service if they didn't take support, what I've done earlier, the paying application maybe earlier, not so sure because initially you need to build community for sure first and you need to build competitiveness. So it's kind of difficult. So that part, not do the four products on top of XWiki with partners, but at the same time they gave us some money. So maybe do less service sometimes, more product. So these are the things I would have done. But it's done differently. Like basically the playbook of how you can fund the product or try to do it earlier. And then when we learned it, I had other presentation about the different method we found in prior for them about how to fund an open source software. So I gave to Kerl, which also has a great experience about how to fund the work in Kerl. Any other questions? Nope, okay. Thank you, Ludovic. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you.
You too could have made curl!
I better not touch anything anymore. Okay, nine minutes off. Okay, cool. Hi. Technical stuff. Right. Let's start this. I am Daniel. I work on curl all day. I work for Wolfers & Sells. I do curl stuff all day. I am going to talk a lot about curl. I always talk a lot about curl, but today as well. I don't think I am going to present a lot of new things here. You are going to hear me reproduce and repeat the things you already know. But cliches are cliches for a reason. I am going to just let you know that some of them are actually true. At least from my point of view. I work on curl since a long time. It runs in a few things these days. You can actually probably not walk very far without using curl knowingly or not. It is in a lot of different devices, things, services. Since a few years back on more than one planet as well. Right? A favorite slide of mine. Anita, squeeze it in. I am sorry. A few years ago I also got this gold medal from the Swedish king here for my work on curl. And actually... But not a single gold medal since then. It is kind of a disappointment. But anyway, these days we estimate there is roughly 20 billion installations of curl. Quite a few. We don't actually know that it is 20 billion. It is roughly open source, we don't know. But definitely there cannot be that many other open source programs in general. Software anywhere that runs in many more instances. I am pretty sure. Pretty decent thing, I think. But you know, everything really didn't start out like that. It has been taking quite a while. Because in my project, our project Curl stuff... We of course started somewhere. And it was a long and sort of an effort. And a long journey from something that was really not very good until what it is today. Which could possibly be good. So in November 1996, it's a long time ago, I turned 26. Fun. So I started with a little project. It was more like that. Very silly toy. 160 lines of code. Just a few screen folds. And what do you do with that? You start with playing with it, it makes it something. You start fiddling with it. And you know, start small. Do what you want to do. Give it a lot of time and have fun. That's how you start an open source project. You have a niche, you start scratching. And as long as it is fun, why not work on it? And in my case, I worked on it for about two years. I actually recalled it Curl then in 1998. So it started with another name. But that's a long story. Anyway, two years later, December 1998, what an awesome success. 300 downloads of my software. I have this screenshot from the website that I had back then. Because I think it's a cool reminder that actually getting 300 downloads of your software is pretty cool. That's way more than all your friends. All those who just did it because they know you. Actually started to reach out. And that's cool. It is cool with 300 downloads, even compared to 20 billion today. And I also want to emphasize that this was two years later, right? Two years later, 300 downloads. Yay, a good one somewhere. I mean, in 20 years, we could have 3000. So yeah, keep working at it. And finding your goal or you find a project to work on, of course, it's a good thing, right? It's fun and work on it. And maybe, sure, you want to make it easy for others to help. But you can be sure that, I mean, the world is drowning in open source projects and good ideas, right? It's not a problem to find good ideas. It's not a problem to find open source projects. But how do you actually get anyone else interested in your little project? Because you think it's fun and interesting and serves a purpose. Probably not. Probably you're just going to have to realize that it's you and your project for a while until it's proven to be something. So as long as it's fun, why not keep at it, right? And spend the time because it's not going to be an immediate success. Very few things are an immediate success. So yeah, spend time on it. People often ask me what I've done in Curl mostly. But I think what I'm mostly done on Curl is spend time, right? 1996, I started this. And also learn to use the time. I told you, I was 26 years when I started this. I didn't have any kids. I've had kids since then and they have grown up pretty big since then too. And all of that, we're all having lives, families, other things than just open source, right? But how do you actually get time to spend on your projects? In many cases, you maybe need to do a little less of something else or a little bit less sleep or whatever. In my case, I sort of, yeah, maybe if you really want to spend time, as I say, you need to spend time on your project to get somewhere, maybe you have to do a little bit less of something else, right? And people actually sometimes don't believe me when I say that I never ever play computer games, right? That's just, that's an easy thing to rip out of your life and save hours and spend that on your open source project instead. I mean, you can cut down on sleep as well. And I do that to Firmum, but that has its downsides as well. Just accept the fact that for long periods of time, you might just be the only person, right? You, of course, you make it easy for anyone to contribute and you know, load the bars and accept pull requests and everything, but, you know, there are many open source projects out there and we're all competing with the same developers, right? And all those developers, they also play computer games. They watch TV, they have families, they have other priorities in life before your open source project. But I can spend time on my project. I can control at least to some degree what I spend time on. So sure, just accept the fact that, yeah, yeah, yeah, I make pull requests in my own project, right? I put them up there, someone can comment on them, someone might review them, but if they don't, I go ahead and much and continue with the next one. Because in the end, it doesn't really matter. Looking back at your project, you don't care if I started my project 10 years ago, 15 years ago, or two years ago, as long as the project is good, it's there, it fulfills the purpose. So in a way, time doesn't really matter in the end. And of course, reaching somewhere, accomplishing something with your project, there is really no silver bullet here. There is just engineering. And there's just open source stuff that we all know how to do. We've all been doing for a long time. There is just hard work and keeping at it. And of course, having fun. Because if you're not having fun when doing it, you probably won't endure. So in the code project, right, it started in 1996. Number of lines of code is basically zero. So I actually started a project with someone else's code, so I didn't write those first 160 lines of code. And then I became a maintainer a few months later. And then we started the journeys. And then we, so now we're at 160k something. And yes, a fascinating linear growth too. Kind of unbelievable. So yeah, I'm just saying that keeping at it, things might develop. And making sure that others can contribute is of course, crucially important. And that's why it's open source. We want to enable others to contribute, even if in many times maybe they don't, but there's still that opportunity, right, and availability. And if you're doing things right and you happen to be accepted by others, maybe someone will contribute. And now everyone is looking at that bump in 2005 and thinks, what happened? And it's quite boring. I actually just went back and sort of filled in names that I have missed out from the list before. So I just went back, so it's actually not supposed to be there. But it's my script count number of names in the list. So over time you might get a lot of help if you're successful enough. But success is obviously not given, right? There are a lot of open source projects. I mean, and they're adding every day, right? So there are hundreds of thousands. Just look at GitHub or whatever. We're drowning in open source projects. And yeah, it's certainly not a guarantee that whatever we do is a success and going to be popular or anything. But if you don't give it enough time, if you don't spend your efforts and really make sure, I get a lot of questions or people say, yeah, yeah, I spend a lot of time on my 47 projects. I did them for several months and nobody used it. So sure, if you don't spend enough time, if you don't polish it enough, maybe it doesn't stick out among all the others, right? So maybe you actually have to spend more time to get somewhere. And it needs to be fun. But whatever you do and whatever anyone does, there will be times when you're sort of, when you just run into something that wasn't supposed to happen, like a security problems or whatever. And it's bound to happen to anyone who's doing software, maybe more to some than less to others. But still, everyone is doing mistakes. It doesn't matter how long we're doing this or how much we have done it. As long as we keep developing, we keep changing things. There will be mistakes and mistakes will lead to security problems every once in a while. In Curl, it looks like this. The green ones are bars. When we fix security problems, the red ones are when we introduce them because I tracked them down. So of course, we introduced them before we fixed them. But anyway, I'm just meaning that, yeah, we work really hard, of course, to make sure that we don't introduce bugs, we don't introduce security problems, but can be sure that they will creep in anyway because it's tough. And you all know that, right? Nothing new here. But what do you do? You just own your mistakes because they are going to happen and try to learn from them, which I think is really, really hard, right? Because every time you get a security problem, it feels like, this is a one-off, we should never have done this stupid thing. But try to learn from it, adapt, move on, add more tests, and make sure that we at least don't reproduce the exact same problem again in the future. And yeah, I've done that still several times actually, it's kind of stupid. Yeah. And keep having fun because if it's not fun, it's not going to, you're not going to spend all that time on it. And no one else is going to do it either. And of course, everyone makes mistakes. And it's really a matter of how you handle the mistakes. It's not sort of the amount of mistakes or how critical they are, but how you take care of them, how you take care of the people who actually made the mistakes. In my case, it's easy to take care of the people because almost all of them were my mistakes. And there's no denying that it's sort of soul-crushing when you have your software in 20 billion installations and you have one of these things that you know can end up really, really bad for the users. Yes, that can make it a little bit harder to go to sleep at night. But yeah, again, we all do mistakes. We try to learn from them and move on, right? And in our case, in pretty much everyone's case, we just have to do what we can do, right? Engineering, we write readable code. You should be able to understand the code in any language. Whenever you read code, it should be understandable. If you can't understand the code, it's the wrong code, right? And you document everything clearly a lot. And another thing with working stuff, or working stuff for a long time, is that you have a long time to write the documentation as well, ideally, right? And a lot of tests too, because the more time, the more tests. And you analyze your code, of course, you threw every tool at it and make sure that the tools, they don't complain on your code. And then when you have sort of fulfilled all these steps, yeah, you know, it's pretty decent and you can throw fuzzing at it. And in our case, I also like to offer bug bounty as well, because I'm fortunate enough to have someone who pays for it. So we offer a lot of money to people who can point out the security problems. And yes, then you get a lot of bogus crap as well, sort of, yeah, there's a security problem. But still, also get a lot of quality people spending a lot of time and effort actually trying to find security flaws. So in my experience, this works really well. It's a pretty cheap way to get a lot of help to find your most stupid mistakes. But, okay, there might be other people involved in open source sometimes. You're not alone all the time. And really, over time, you learn that it's code is easy, right? Code is easy, you can just debug it, try it again, write a new algorithm, but the people, they are never easy. People are sort of what the challenges are. And the longer you work in an open source project, the more you maintain, you know that the challenges, what you need to sort of, what you face on a day-to-day basis is the problems with communicating and talking to people from different areas of life, cultures, languages and everything. And you can be sure that they are going to be less-to-friendly at times. So over time, we do less and less coding and more and more interfacing with humans and other things as a maintainer of some stuff, right? And, right, so negative feedback is sort of the default. It's a little bit depressing, but you know, as long as things work, sure, 20 billion installations, no one says a single word, sort of, yeah, it works, cool. And someone finds a little bug somewhere and you can be sure that that is what you are told about, especially if it appears stupid or silly or something, because then someone is very upset that surely it should have worked since a long time ago. You've been working on this for so much. So that is, of course, and I know you all know this, that's the default. You basically never hear when things are good, because that's the default. Everyone assumes everything is good all the time. When something is bad, you get told about it. So people often ask me what the difference is in curl back in the days with 2,000 lines of code with 300 users compared today with 20 billion installations. There's really no difference, because in the little development community, people raise their bugs, they complain, they have problems. All the ones that are successful, they shut up, they are somewhere else. So it doesn't really look different today. And a lot of lessons in what you do when you realize over time that contributors rarely stick around. In curl, I have lowered the bars and the friction for new contributors, I think, a lot. So we get a lot of contributors even fixing a spelling error or typos in a comment somewhere. People contribute that. And I think any contribution is a good contribution. It doesn't really matter if you fix a typo that makes it hard to read. Yes, it's an improvement. So I accept it, but do the contributors stick around? Out of all, I mean, today we have, I think, 1,240 authors who have written code commits in curl. That's an amazing number of people. Over 65% of them did it once and never again. So, and I don't think I'm unique in that, and I don't think it's special. I think it's more like that's how people work, right? They show up, they find a problem, they submit it, and they move on to something else. Because it's not their primary interest in helping my project, they just found a problem and fixed it and moved on. And sure, it's okay for them, it's okay for me too, but just the realization that most people who show up, they will show up there a few times maybe if you're lucky and then never again. And maybe every once in a while, of course, you get a new contributor who will actually stick around for a long time and contribute a lot. And you will be happy for those. And of course, I mean, there actually is the reverse too, right? There are a lot of newcomers. And I've never heard of, you never saw this person before in your life, and they show up suddenly one day with an amazing patch showing that they understand everything. And you can sort of be amazed that someone just shows up on your doorstep one day and have a perfect understanding of your architecture and design style and code style and everything. So suddenly, open source is open and ready for surprises in every direction. And that's part of the fun, right? Less fun is perhaps that sometimes when being a little bit public about things, things can go in the other direction. So I actually never really... So this email from, well, soon three years ago, was actually the first one that sort of hit me. Hit me like this. Yeah, so my email address is in the current license. And the current license then appear in a lot of products. And this person quite clearly had been attacked in some way and saw some traces of curl in some leftovers somewhere. And that was obviously my fault. He had lost his family's life and job and everything. Completely confused person, but it was all my fault. That was tough. But, okay, open source, this fun thing with open source, open source, the term was coined in 1998, right? Exactly the same. Actually the month before I started renamed it to curl. So it's sort of open source and curl. And it's been hand in hand going for a while and still just 25 years, right? Before we did open source before we called it open source too, right? Because we still worked exactly the same way. We just didn't use the term then because then mostly we talked about free software, but it was a little bit more confusion than what it actually was. But anyway, so today is much easier to do open source because everyone knows about open source. If you approach a developer today working in any field, people actually know what open source is. Back in 1998 or 96, no one knew about open source in general. It was just a niche click of weirdos. And today everyone is using open source, right? There's not a single project, single user, single developer anyway who doesn't use open source at least to some extent willingly or not. It's just going to be there. And we're all going to work with open source in ways that we suddenly did not 25 years ago. And we all, I mean, there's so, so many more contributors to open source today than before, right? There are literally millions and millions of possible contributors today. Back in 1998, there were not millions and millions of contributors. In 1998, the total internet population was, I think, estimated to like 40 million. That's basically the amount of open source developers today, right? And of course, we're many, many more maintainers of open source today than we ever were before. So there's also a lot of equals among us, right? We all know, I can talk to you like today, we who maintain open source and you are all a lot of open source maintainers. I don't have to even pretend. So there's a lot of good things. So it's of course also much easier and much better place to do open source today than ever before. And I think it's going to be much easier and better going forward as well because all of this is just going to improve. We're just going to do more open source and it's way, way easier to do open source today too, thanks to infrastructure, tooling, funding, whatever. But I think we're into in for a bright future. But anyway, I've done this and worked on a single project for so long and people ask me then, don't you ever get bored? The same project for 27, 28 years, yes. And of course, I get bored. Everyone gets bored every once in a while, right? Lack of motivation. How fun is it to work on the same thing all the time? Of course, the motivation comes and goes. That goes for everyone. And that's just natural part of life, right? Whatever you do, there will be periods in your life when you don't feel the same sort of, yes, it's going to be great to work on this documentation today again. Sometimes you just have to, you know, do something else, spend more time with your family. In my case, I like to sort of move around, do something silly and some less important part of the code or do a slight less curl for a while. I've just come to realize that lack of motivation is just a natural thing. It's just sort of an endless cycle. Sometimes you come in and then come back and it doesn't really matter as long as you sort of let it play out and maintain your overload. One of these things is very commonly brought up, right? If you're being that single person and you feel that a lot of users are depending on your work, maybe you sometimes work a little more than you should. And I think this is a real problem and it's a real, it can affect us for real. But it is important to separate you from your project, of course. I'm not sure I manage that always, but I do try. And there's a little this of, you know, if your code run in a lot of places, can you really ever be sure when you release a version that is not going to sort of bring down half the internet? I don't know. I think you just come to, you have to deal with it. In my case, I think I'm actually pretty good with this because I feel that we have enough tests, enough eyeballs, enough people involved that crossing my fingers. It might not happen too often at least. So I think it works really well. That's from my case at least. But I want to emphasize and I think this is true for many people that the thing about imposter syndrome, it doesn't really ever go away. It doesn't matter if you have those 20 billion installations, you can still experience periods of that. Did I, did I, do I even know this sort of who am I to tell them how things work? I mean, come on, this protocol doesn't actually work like this. But what one of my skills, I think when it comes to doing open source is just make sure to use the time slots you get. I have that a lot, you know, you have a family, you have a life, you have friends, but sometimes you have 20 minutes for yourself. Can you spend those 20 minutes on your open source project? I've become very good at it, which makes me very good at, you know, if I get 20 minutes here and 20 minutes there, that's actually 40 minutes. And I'm not complaining about, yeah, I need an hour to get prepared first because then I would never do anything at all. And I don't split my attention between all, a lot of sort of many other tiny things. Sure, I do a lot of other projects as well, but I give them much less attention. And again, time might feel important sometimes, but it really doesn't. In most cases, it doesn't matter if you're done today or tomorrow or next week or the week after that. Who cares, right? Sure, it's not in this release, you're going to do another release soon again anyway. And down the line, it didn't matter if you were done last week or next week. So, let it take some more time. And of course, I'm a true believer in release early, release often, so that everyone has a chance to get your latest code as soon as possible, because it just makes maintaining and everything easier and contributors have a much easier time to actually work on your latest code better. Yeah, so reduce contributor fiction to get people to help out better and have fun. Of course, we need to just remember that we're all different. I can stand here and say how I work, but I'm sure that you all have sort of objections and say, yeah, it doesn't work for me. It doesn't work for my case because I mean, spare time, as I'm talking about spare time working on open source, you can of course, in my case, I work on open source work hours and spare time hours, that sort of maximize. But working on anything, spare time is of course a luxury, right? If you're working on something on your spare time, maybe someone else in your family is doing, you know, the laundry or cooking or taking care of the kids or whatever. So of course, that's a luxury. If you have that position, it's a luxury. I don't deny that. So in many cases, you don't have that luxury. And of course, then it's much harder. And there's of course an unequal privilege here, right? If you're rich enough to do this, you can do this. If you have to work two works and take care of the rest of your extended family, maybe you can't do this. Yeah, just have to be aware that of course it's a luxury. We're all different. We're all unique. And of course, what is success? I consider 300 downloads a success in 1998. We all have a different way to measure success, right? So we don't have to have 20 billion installations. It's fine if all your friends are happy with your tool and you can just have fun. That's also success. In my case, I have mentioned already my email address in the Curl license. This gives me an excellent opportunity to learn about people's agonists in life. Like if they don't know how to install their GPS in their car, they email me and ask me. And you can imagine the amount of anger in this user. He couldn't install the GPS. He's been scrolling through that open source license screen in his car. Found an email. I'm going to email this person. So I get a lot of car questions. So then you learn, yeah, Sean, so my email is apparently in a lot of cars and people have problems with cars. So I have no idea. But not only cars, actually. So I can learn about other things too. And usually my way, this is the best way I have to actually learn about where people are using Curl. So wow. So I have to Google. Often I don't even understand the question. I have to Google it. What are you talking about? Oh, great. Are they using Curl too? It's confusing. I sort of stopped replying to them because... You know, the first... She asked me when I started. You want to help out? You want to be friendly? Someone ask you questions? Obviously completely lost. No, sir. This is how it works. I just wrote a little component. No, no, no, no. That's not how it works. Just ask your friends and help me fix this car now. So I have this example. This is a great one. It's a little bit convoluted, but I'll explain. I got an email from a woman. She said her Instagram account was hacked. So what are you asking me about that? Sad for you. Okay. But she showed me the proof that I'm involved. Instagram? My name. Now I should just head over and talk to the guys and tell them to help her with her account that had been hacked. And I told her, cool. They're using Curl. That's in my code, right? And try to explain the concept of open source. I never talked to these persons. I didn't know the use, Carl. For me, it was like a revelation. Cool. Instagram, right? That's like a billion installs suddenly. She didn't really see it the same way. Now, she emailed me back. She found my name again in her phone. Exhibit two. But it cannot be a coincidence. Your name cannot be twice in my phone. For any good reason, right? So she threatened to contact them and tell me that I'm an Instagram and Spotify hacking. I don't know if she did actually. So maybe they don't know this yet. No, I'm exposing myself. So when I work on this stuff, I just, what I'm trying to say here is I'm not special. I didn't do anything genius like I've just been working on this a long time. I just had an idea. I think it's fun. So this is what I do. And I think this is sort of the best you can do. And I wanted a tool to do the internet transfer. It does look a little bit more these days than it did from the beginning. And I endured. I kept going at it because I didn't know anything else and didn't know better. I think it's fun. And keep polishing. If you spend a lot of time on something, it can actually become pretty good. And make it possible for others to contribute if they want to. And you can just hope and wish that they will contribute. In my case, they did to a pretty hard degree. And this is really the most fun I can imagine. Yes, I'm living the dream. I work on my spare time project full time and getting paid for it. What else can you ask? So, is that easy? I think you can do it too. And pretty much that's sort of what I wanted to tell you. I've written about these things a little bit before in this book-like thing if you wanted to read more about my thoughts on this topic. So, thank you. I'm done. APPLAUSE I think we have a few minutes for questions. If you have a question, raise your arm and someone will run with it. There's a question. The mic will come flying. Hi, thanks for the talk. I have a question regarding... You mentioned that you have lots of contributors nowadays. And how do you deal with their PRs, basically? I was wondering two questions. One is how nitpicky you are and how you... Basically, based on your experience, how nitpicky can you be without discouraging people from contributing? Like being overly pedantic on comments and stuff like that? I'm having a little bit of a hard time to hear your question. So, it's regarding how nitpicky you are in your PR reviews. So, how pedantic you can be not to discourage people from contributing to such an important piece of software? So, do you tend to just let things through or are you very strict? And you still get lots of contributors, even though you're strict in your reviews. Because I guess when you get a diverse set of contributors, it can happen that lots of people have different coding styles and different levels of detail that they go into, code comments and stuff. I don't think I have any sort of general rule there. I try to... Sometimes, I think there are contributors who are clearly, maybe newcomers struggling with the language or the culture or everything. And of course, I try to be a little bit more welcoming, maybe more forgiving and helping out. But it depends also a little bit about load and everything. Usually people, no matter culture, language, anything, people understand code and following code styles and making sure the test case works and everything like that. So, usually I don't have to consider that to any greater amount. Okay, that's interesting. Most people are developers. They understand this from the beginning. The other bit of question was regarding... similar, but regarding documentation. So, have you found that... Documentation is roughly the same. Documentation in the code comments. So, if you've seen that being over-documented, has that helped you or are you not doing it? Because when you get such a... Over-documented, that's a rare thing. Well, over already means it's too much. But when you know... you could go overboard and you could... You can, but in my experience that is very rare and sure. I mean, we can have a discussion. Sure, you mentioned this as a comment, but then below the code is exactly the same thing. Assign A to 2. No, yeah, of course. Maybe you don't have to say that in a comment, and then we just had a discussion. So, of course. But I think that's very rare, actually. Usually it's in the other direction. Maybe you could explain a little comment here why this is happening, and not just have a huge blob of code. Right, right. I guess what I'm referring to is when you have such a long story in your software, and you want to leave traces of some design choices and why some things were implemented some way rather than the other, because other people, especially contributors who are one-time contributors, are not going to have enough context. So, I'm just asking regarding your style. Do you try to leave traces of context like this was done this way because of this reason? That reason, please do not change it, blah, blah, blah, stuff like that. Sometimes, but it's hard to leave traces of sort of to leave it for history because things change. So, leaving traces like that also just risks that you leave traces of your former design or former decisions that maybe were not enlightened enough. So, I don't make a concerned effort to do that because everything is in git anyway. We can always go back and look at the history if we want to. Was there any question left or should I shut up? I have a couple of questions. One is how much time you were spending on the project before being able to work on it? Sorry, can you repeat it a little louder? Yeah, sorry. How much time you were spending on the project before you were working on it full-time? I have a long-standing tradition in my family that I spend every night on curl. So, when the rest of the family goes to bed, I stay for another two hours working on curl. So, I've done that since 1996 basically. That's two hours every day, every week, every month, every year for 27 years. Now I've just added my full-time work as well. So, now it's just instead of two hours per day, it's now 10 hours for work days. Do you delegate maintenance? Sorry, again. Do you delegate maintenance? So, do you have many... Do you delegate maintenance of your project to someone else and how many... Or you maintain everything yourself? Because there's much maintenance overhead. Well, I'm the lead developer here. I'm not the sole maintainer. We're a whole team. There's a lot of people, apart from me, who can merge code and who does. I just think I do the bulk part of it because I'm the only one who works on it full-time. I do it much more than they, but if I would stay in a conference the whole weekend, someone else can still merge code while I'm away or if I'm just absent. So, there's a whole team actually. You
OpenPrinting - We make printing just work!
Dear audience, please give a warm welcome to Peter Kirtel Kampeter, who is the leader of the Open Printing Project, being an employee of Canonical, doing all the great works to make printing easy, not only on Linux, but also on quite an amount of other operating system. Thank you. Have a nice talk. Thanks. So I will tell about how Open Printing has emerged and what we are doing, what are the challenges, and in the end I will also even tell a little bit about printing under Windows, but I am not developing the printing part of Windows and Windows does not use cups. So let's start. What we are doing is the central part of our work is naturally that we have cups and we integrate cups in the, we develop cups. This is done by Michael Sweet, but it's a part of Open Printing. Michael Sweet has left Apple. We are also, we are also integrating cups into the operating system. We are doing it by cups filters, for example, and that I am also also contributing to GoScript and I am also coordinating with desktops and we are working together with a printer working group. This is a consortium of industry, of printer industry and software industry. They are developing standards, especially the Internet Printing Protocol, IPP, and we work together with them. We have also annual meetings and we are implementing these standards in software. For example, cups is all IPP-based and what changes and comes new in IPP like driverless IPP printing, we are updating everything and implementing this so that it can be used in our operating systems. And, yes, and we are also cooperating naturally with printer manufacturers. I had also, I was also a lot in dialogue with printer manufacturers, especially longer ago when one still needed printer drivers to help them to design drivers and to do it the right way. And so these are the principal tasks also to the, also part of the integration of printing is that one, that we, that the packaging, for example, I am doing not only the package, the Debian packaging for Ubuntu, but I am also doing the cup snap, so a snap package of cups, which one can easily download from the snap store and this we are having the always the newest upstream cups on any distribution, but also for snap only distributions like Ubuntu Core Desktop. But about this I will talk at 130 in the distribution zoom with a snap and Ubuntu Core Desktop talk and, so how did it all begin? One thing is I had a printing problem like Richard Stallman had also a printing problem. He invented the principle of free software and therefore we are all here. And I had also, I was system administrator in the late 90s and there we had also printers and they only worked so more or less because we were in the dark times of LPD and in the beginning of 2000 cups appeared. Michael Sweet has released the 1.0 of cups in 2000 and there was an article, a news, a Linux magazine article by called Pfeifler about cups and I have read it and I have seen all of this, this will help us a lot and I have installed it and everything worked better with it and so, but it was all command line and so I have written a little print dialogue and put it on fresh meat in that time in 2000 fresh meat was the place where one told about new free software and then the court Pfeifler the author of this article saw this invited me to the Linux talk, a big Linux conference to show it on the booth of his company and so I showed it there and most distros were not interested in cups and print dialogue and so, but Mendox soft. The show was in the beginning of July 2000 and first of August I lived in Paris because Mendox soft invited me to work with them to switch to switch mandrel Linux from LPD to cups so I lived six years in Paris but please don't try to speak French to me and then I had and then I have in a few months already switched mandrel Linux to two cups so the fall edition of mandrel Linux in 2000 was the first Linux distribution with cups and there were also some challenges so that especially I had not only two package cups and RPM packages this was the easiest part but I had also to take care that all the printers which worked before they work afterwards after the switch over there were drivers many were built into Go script and many were a small little filter program which some student has written to get their printer working and but they all did not come with PPD files because they were written with only LPD in mind and not with cups in mind and I needed a way to get PPD files for them all because cups needs PPD files and there was linuxprinting.org someone has created a website linuxprinting.org and there was a database with printers how will they work and with which driver and especially with the PPD file generator and the database for the PPD file generator needed also much more information options paper sizes and resolutions and whatever but most of these entries did not have that some have that and so I knew that this works and so I asked the author to to fill in the rest and he did not have the time and so he gave me right access and in 2000 and in 2001 even the full control of the site so that I continued maintaining it and so I filled in this database and this way I could make all the printers working with cups and therefore the mandrake linux the first mandrake linux with cups was working perfectly and printing was work much much easier than with in old LPD times and I also gave talks on conferences organized every year on the linux tag and open printing booth and so on and so I spread the news of cups as Michael sweet themselves was stage shy he did not go to conferences and so and so the other distros saw it and switched all over in 2003 also all the distros used cups and so I made cups the standard and I also found it open printing in 2001 in together with some people from the PWG and so this was the start and and in an LPD and any other approach to make printing better like PPR, PDQ or whatever they were called they all stopped maintaining because everyone used cups and nobody wanted to use all these other things yes I found it open printing in 2001 it was not what it is now it was at first that I worked together with some people of the PWG to to uh on printing APIs I had every every week a phone meeting a phone meeting not yet video phone classic phone uh to talk about APIs and to develop APIs and I continued naturally to maintain to to my work with cups and with linux printing.art database and so on and one thing is back in in 2006 I organized my very first open printing summit in Atlanta Georgia in the US at uh at RICO and it was a lanyere that time but probably now they are all called RICO and there there were there I brought something like 40 people together from printing projects desktop projects uh printer manufacturers, printer driver projects and so on to work together on the future of printing on improving print dialogues especially on on improving drivers and so on and on this meeting there was also Jan Murdoch one of the founders of Debian therefore the Jan in the end of Debian and I asked him I talked with him in a hallway session and I asked him this I told told him that the the linux printing.art server the database server is standing in the house of the of its original author and it's already carrying official ppd files of manufacturers and it should be for security and reliability that that should be in the data center whether they're at Debian they could perhaps host it or at the free standards group we uh uh uh Jan Murdoch was also uh was also engaged in the free standards group and open printing this api effort was also part done as part of the free standards group and so Jan told me I can host it at the free standards group but he did not only want to host the database server he also wanted to host me he invited me to work at the free standards group full time to manage open printing to to join the linuxprinting.org and the open printing work in one call it open printing and this is the open printing of now and the free standards group has eaten up the OSTF the open source developer laboratory in one year later in 2007 and founded the linux foundation and so open printing is part of the linux foundation and I was working at the linux foundation that time but in parallel in 2006 in 2006 on the open printing summit where I also was where I was organizing an open printing booth I bumped into Mark Shuddleworth he also asked me whether I want to work at canonical and so I started part-time at canonical also and then I was at canonical and at linux foundation full-time working for open printing and I could leave Paris because I worked from home and then in the time and then time was going on and the next step was that in 2008 I started meant I started to to be the org admin for the linux foundation for the google summer of code every year up to now and the open printing a part of it so open printing was in the google summer of code and I was mentoring students mentoring open print open google summer of code contributors for open printing and this was very important because printing is not really sexy for volunteers to choose this and so by means of heavily engaging in open printing I get contributors who contribute code and some even stay and continue mentoring or the website we have now is also done by former google summer of code contributors and also from 2006 on I organized every year an open printing summit later on together with a pwg and another another milestone more more recently on the open printing summits every year I was I've met avik bazu from lexmark in India and he has contacts to universities in India and from 2015 on every year he was reaching out to the universities and finding for us contributors we have then done some some selection they have done fixed some issues for us and so on to learn to work with open printing and so we had five or six every year working coding for us they were mainly from therefore they were mainly from Indian universities and mainly from the iit Indian Indian Institute of Technology Monday and therefore I have also organized a conference last year in Monday and met them in person so and what we have all done is that at first we have we have made cups the standard printing system as I told and it is used in more in all posic style operating systems including mac os michael sweet has been for a long time at apple and worked on cup and developed cups and integrated and mac os mac os he has left in the end of 2019 and he continues to develop cups and now it is hosted at all and since then it's hosted at open printing and all I have made off three printer drivers working with cups as I told and I have and another thing is in former times in LPD times the standard print job format was post script because the only printers in that time which could print graphical content and were used with unix machines and computing centers at universities were typically a post script printers and so post script was the standard format for printing graphical content and this got and in 2006 as there were so many printers with so many different printing languages and post script is also rather awkward it's not secure it's a two-in-complete programming language so one can inject malware with it so we did we mic sweet and me have decided to switch the standard print job format to pdf and so since 2006 the standard printing format is pdf not really in real life it took some years until the distributions actually were using pdf as standard print job format I think 2010 or so pdf was always used I also I also did a grand unified ghost script as in the 2000s when Mike wrote cups he made a false fork of ghost script for the cups last format and to integrate some third party printer drivers because the ghost script the original ghost script the top of the line the newest version was not free software they released it as free software only one year later and so we so this was also a reason why Mike did the fork and in 2008 the ghost folks decided to release the ghost from the top of the line also as free software and this was for me then the point to do the grand unified ghost script to to do the reunification of ghost script to put Mike's ghost script and the original ghost script and third party driver all together and then all was there at artifacts but unified in one ghost script and and system config printers also maintained by open printing it was originally the printer setup tool of wetted and wetted was and later on when I have left mandu xsoft and I entered canonical and Ubuntu needed a printer setup tool I I ended up to take system config printer for that and improved it a lot for especially for the association of drivers and printers with each other that this works fully automatically and correctly and then in then in 2011 also 2010 to I think 2011 Apple has decided to not maintain the cups filters anymore which Apple does not need because Apple uses their own proprietary filters and I've overtaken this code as the cups filters project and maintain and then I have maintained it on open printing and another thing the common print dialogue backends this is something the print dialogue contains the print dialogue communicates with cups for printing and there are many print dialogues and many GUI projects Mozilla KDE GNOME LibreOffice and Chromium and all of them are big projects lot of inertia difficult to find the right contact and so it was difficult on a cups change to get all these projects go with this change and update their dialogues and so I came to the idea one can put an in-between layer there that the that cups and any net and any cloud printing technology in 2017 when I started that there was Google Cloud Print is implemented in a back end and the dialogue communicates with this back end and the back end with the print technology and so we can maintain the back end at open printing the one for cups and so the dialogues always work correctly with cups so common print dialogue backends and I have snapped cups as I told earlier and now all the free all the free software printer drivers had to do a transition again because the cups three I will tell it later will not support ppd files and classic printer drivers anymore and so I had to put all the printer drivers into printer applications which emulations of driverless IPP pointers as the cups we will only support driverless IPP pointers and so this way we had another transition and this I also have already done what is this So, and you see what we are actually doing. We do Cups, we do Cups filters, common print dialogue, backends, and Papa Retrofit is a library to put old printer drivers, classic Cups drivers into printer applications. And then we have also printer compatibility databases. We still have the Fumatic, linuxprinting.org database from the good old times. But it is less important because the printers which one can buy currently are all driverless. And therefore, Mike Sweet has created a list of all driverless printers, which is very well maintaining, which we have on open printing. And so, and we are collaborating, as I told with the printer working group, and with printer manufacturers. But the dialogue with printer manufacturers about printer drivers came more or less to an end because we have driverless IPP printing. And we are taking care about the whole printing stack, the whole printing architecture. And integration in desktops, I talk with desktop people. I'm running Google Summer of Code projects in updating printer setup tools and print dialogues. And also the integration in distributions, I also have to take account of immutable distributions. One is Ubuntu Core Desktop, I have the Cups snap for it. But I also plan to make Docker packages of Cups and of the print applications. Because there are so many other immutable distributions which do not use snap. And in these immutable distributions, the only way to get system software in is to use Docker images or OCI containers or Podman and so on. Because the desktop applications are in those ones usually added by Flatpak and Flatpak does not support system utilities. And now one important thing, where we are very intensely working currently, is the so-called new architecture. And the new architecture means that we do not support PPD files and classic drivers anymore going all IPP. And all IPP means that we support only driverless IPP printers. We support only driverless IPP printers. And this means that the old legacy printers which need a driver, they would not work anymore. And therefore, we have introduced the printer applications, emulations, software emulations of driverless IPP printers. So it's a damon which is a very important part of the process. So it's a damon which on one end is like a driverless IPP printer. And on the other end, it communicates with the printer. And internally, it does the conversions to the printer's native language. So the driver is more or less encapsulated. And this way, the old printers can live on and we do not lose the support for them and require the users to throw them away. So here's the scheme. The old Cups 2.0 which you are currently using. And you see the old cups, the user application sends a job. And in the old cups, there are different possibilities. IPP everywhere is a driverless IPP printer and cups can directly talk with it. So the current cups 2.x already can behave like the new architecture. It supports the driverless printers directly. And the cups 2.x supports PPD files and classic drivers as any older cups too. And through this, it supports the older printer, PostScript or Asta printer. Any printer which needs a driver. But the printer applications, as they are just emulating an IPP printer, driverless IPP printer, they also work with the old cups. Because the old cups supports driverless IPP printers. So you could already switch over to a printer application. Yes, things. Switch over to a printer application. And so if you are writing a printer driver, don't write a classic printer driver anymore. Write a printer application right away. It works with the old cups. It works with the new cups. And with the new cups, you see you can only talk with driverless IPP printers. And the printer is actually one directly and otherwise, only via the software emulation printer applications. And we'd also even do driverless scanning. Because the scanners in driverless multifunction devices, they all understand ESCL, so also a standardized communication protocol. And so we can make also scanner applications which are emulations of ESCL scanners. And inside one has, for example, the same driver for the old scanner. And cups, we will this year, in a few months, we will release cups 2.5.x. This is not yet the new architecture. It's only to get all of and some other features into cups without doing the big switch over to make it easier for the enterprise distributions so that they can do this more lightweight switch. To get the all of in and in the end of the year, we will release cups 3.x. And the 2.x will, and here the cups 3.x. This is really doing away with the PPD files. But it is also doing another big change. The cups demon is replaced by two demons by a local cup server, which is a user demon and by a sharing server, which is a system demon. So the local cup server is only for getting the print jobs of local applications. And passing them on to driverless IPP pointers, either on the network connected to USB or a printer application for a legacy printer, and not for sharing printers to other users. It even does not use a network socket. It only uses a Unix socket file for communication. And the sharing server. And the sharing server, you can really share printers and you can configure the sharing printer on how to share the printers and so on. It's a system demon. And but so you install it only optionally when you really want to share printers and therefore print server, it has also ACL, it has accounting and everything. And by default for the desktop machine, you only use the local server. So this is the scheme. The very light blue is the sharing server. The medium blue is the local server and the dark blue is all which is in common. This is in the cups library. And so we will have three packages, the sharing server, the local server and the Libcups three. And here you see the two servers and the scheme how everything is together. Thank you. And now with all this, we are already working for more or less five years on it. And planning it and telling everyone how this works. We go all IPP and we want to have print applications and so on. It seems that Microsoft has heard it and that they want to go away from the crap of printing system from Windows 3 from the late 90s. And so they came with a new printing system. Microsoft protected print, which is already available in Windows 11 for testing. And in Windows 12, it will probably be the stand system. And this one is all IPP. As it is with cups three, but it is not cups. It is not that we have system DSP to one of Windows and then we put cups onto it. The code is for more pre-ar. And one already told me that it's still a little bit wonky and flaky. People who have tested it, there's no access to the source code. And Microsoft tells IPP driverless printers are all supported, but legacy printers, they are obsolete, you can throw them away. But I can tell you if someone of you has to use Windows or someone in your family uses Windows, they do not need to throw away the old printers. As under Windows we have WSL and under WSL, you can run printer applications. And so every printer which works under Linux will also in the future work under Windows. And currently works under Windows too, because we have WSL and printer applications. And Microsoft wants to do it for security because they want to get rid of the drivers. Because the drivers have vulnerabilities, it's old, often un-maintained code. And they want to do away with all these vulnerabilities. Because these old cups drivers all run in deep system, perhaps even kernel level. And so crash the whole system when someone hits these bugs. And another thing is Microsoft is telling about print support apps. These are not printer applications. They are not trying to support old printers. PSAs, they add ons to the driverless IPP printing to do something specific. So it's some way that they are fleeing a little bit out of the driverless. So, and if you want to get into, and we need naturally a lot of help. And therefore, I want to ask, therefore I am asking for volunteers that you could voluntarily. As in any other open source project, you should also, it would be great if you would step up to help us at open printing. If you can also participate as a Google Summer of Code contributor. We are also participating this year again. And we have a list of nice project ideas. If you download these slides and then click the links, then you can get to all of this. And we need people for, especially for desktop integration, for updating printer setup tools, for updating print dialogues. We need also people for maintaining the website. We need people for CI testing, for creating CI testing scripts. For OCI containerization, Wax, Docker and so on. And we need, so we are in need of many people, documentation, we need to make documentation for our libraries. And this is very, very important. Here we already need eagerly volunteers for that. And so I hope on all of you, come up to me or contact me through these channels and step up and help us to make printing just work and to make it even work better. As it seems, as Michael Tunelt told, there's no painless, there's no painless way to print under windows. And it seems to go on. And so we should have a nice printing under Linux and other operating systems with open printing. And so are there any questions? Yeah, hello, thank you for a nice talk. I have a question a bit more about the printer sharing server and accounting and things. Well, we work with schools and all the individuals are actually happy that the printing just works. But the principles are not always happy that the pupils are printing pretty pictures and using a lot of paper. So this controlling who can and how many have they are printed. Can you go a little bit to the steps also in a couple two point something about the printing sharing server and the transition to three point X2 kind of what's the right way to do it? And so the sharing the sharing server will have all the same possibilities which you have also in cups which you will also which you had also in cups two dot X and will it will also have the possibility to define profiles where you can where you can filter printers which the users can see and cannot see. And you can also tell which which printers to share to where and I don't know whether you can force option settings which would be a possibility for example to force the use to force users to print in black and white or so and not in color and only selected users can print in color. I don't know whether it's possible but you can in detail tell which users and which which clients can use and see certain printers. In cups yes in in cups three it will be in more detail. I think in cups two you can at least say to share printers to certain networks or to certain clients but in cups three you can also also tell in cups two you can also include and exclude users who are allowed to print. And in cups three you have also profiles which which you can filter which printers are visible and which are not. Hi thanks for the talk you mentioned scanning on a couple of slides. How far away are we from having internet scanning protocol and all modern scanners supporting it. We have we have already standardized protocols for scanning. The common protocol the most common one one is ESCL and if you have an a driverless IPP multifunction device it can at least it can usually scan with ESCL some few others with another protocol which is called WSD which is from Microsoft. We have a same backends for these two which is same AirScan which is not in the same core package it's a separate package but it's in all the distros currently and one of these protocol is supported by any driverless IPP multifunction device so the scanner and such a multifunction device also works. And for further more sophisticated support there's also IPP scan but this is not yet picked up by the print and scanner industry the standard is there and completed but we will perhaps use it later on at open printing for network scanning and to have more detailed permission and client server handling when using IPP scan instead of ESCL but for just scanning we currently use ESCL. Thanks. Good morning to so do you happen to know why Microsoft decided to adopt their own stuff instead of open printing. Can you. They decided to go with the Mopria and why did they do that instead of go to. Yes, yes, Microsoft is a classic commercial close source company and so they are starting with open source they have they have they have hired a Leonard Perturing and and this way they the development of system D happens in Microsoft but somehow they did never come up to me and say we want to use open printing so that into there is also consortium Mopria probably also of printer manufacturers and software companies but all close not that's the printer working group which is all open and in Mopria they have also defined the specification for driverless IPP printing which is very similar it's practically the same specification as of Apple Airpoint or IPP everywhere and therefore I call out I call it always only driverless IPP printing because it's technically the same and so Mopria is also a group of consortium and this group Mopria also writes code right puts puts the standard in code they wrote for example also an Android app which is called Mopria for printing. And so for printing on driverless IPP printers and Microsoft seems to have also worked with Mopria to get their code for for driverless IPP printing and to get into Windows protected print. They did not by themselves tell that they work together with Mopria but people from the printer working group has told this to me. And unfortunately they do it this way it it had been a dream by me if Microsoft had had overtaken cups and so we have a really all over standardization and one organization providing the printing code to everyone. But unfortunately this did not happen. Okay. Thank you to you to Michael and all the printing heroes that help printing not suck as much in Linux. What. Thank you to you to Michael and all the printing open source heroes that help printing not suck as much in Linux. You're welcome. Thank you very much. Thank you very much. Any more questions. Till many thanks for your interesting talk. We've got a little present for you. Let me have a look where it is. You're welcome. Thanks for your talks.
SCION, hitting the future Internet road: Next-generation Internet ecosystem and burgeoning opportunities
Thank you. My pleasure to introduce the next speakers, Jordy and Tillman, who are going to be speaking about SCION, the next generation internet. Can we get a round of applause please? Thank you. Hello everyone. Thanks for attending the talk. Thanks to the host for having us here. I'm a little bit of a rockstar right now. We are Till and Jordy. We both come from ETH Zurich. We are part of the network security group at ETH Zurich. We are also part of the SCION open source implementation team. First question, who has heard before about SCION? Okay, some people. You can skip the introduction and the overview. Okay, so for the rest, I will start introducing what SCION is. SCION is a clean design of an inter-domain architecture that considers security from design to achieve security properties, mainly availability, also transparency and control, reliability and scalability. So I want to make here clear that SCION has to do with an inter-domain network, so it doesn't have anything to do with inter-domain protocols or higher level protocols in that sense. And the other thing I want to highlight on this slide is that SCION is an open source project. So here you have the GitHub repo in which you can find the reference implementation of SCION. So in here you have also, well, Till will give more details afterwards. Here you can find also references to documentation and related stuff. So the second question is why does even SCION exist? So SCION comes as an alternative to our old friend, BGP IP Internet. So yes, this was created even before I was born, so imagine how things have changed so far. So SCION has the distinct aspect that incorporates those security aspects I mentioned before from the very inception. Why do we need this? Because we need a network that provides availability even under the presence of malicious actors because there are people interested in harming the inter-domain routing. So we can find some examples, current recent examples, so for example an outage caused to a Spanish ISP due to a BGP attack. And we have several malicious actors in Internet unfortunately from nation state actors to cyber criminal groups that are interested in harming and due to different reasons. They can be from political reasons to economic incentive, you can name it. And yes, sometimes the trust boundaries, so trust nature of the current routing architecture sometimes doesn't make it clear enough where the trust boundaries are. So probably some of you are hungry, maybe not enough for just running away and grabbing lunch. So I cannot offer you food like actual food but I can offer you some yummy desserts. In this case towards the end of the presentation we will give or we will present a couple of demos. One is a browsing demo using Scion, second one is a Scion word first person shooter, and finally we will walk you through steps and guidelines for developers that hopefully find this interesting and want to contribute or just use what's there so far. But first some overview of Scion. So the whole Scion ecosystem includes different entities from different domains or from research institutions to ISPs, to vendors and integrators and users of the system. All these ecosystems is nurtured by the Scion Association which is a non-profit organization responsible for the standardization of the Scion protocols. They are currently pushing and they have published three or four ITF drafts and they are pushing it to RFCs. So they are also responsible for managing the open source implementation. So here I try to summarize Scion in five distinct aspects. The first one is that Scion is a pathway where internet architecture meaning that end hosts are presented with path information in the network and they can make a choice of what path they use to send their traffic through. The second aspect is that Scion designs and implements a scalable trust infrastructure. I will go into a little bit more detail just in the next slide. It also designs and implements scalable path discovery basically in the control plane for trying to achieve rapid global connectivity. Then another aspect is that it has like multi-path nature. So as I said before end hosts are presented with several paths that they can use even simultaneously. And finally another aspect I would like to highlight in here is that there is already real world deployment of Scion. So these I will show you towards the end or the middle of the presentation. So first just some terminologies. My idea here is that you kind of get the idea or the gist of it so you are not completely abstract in this. So first term is that Scion organizes itself in trust in so called trust isolation domains or ISDs for sure. These trust domains as the name indicates are isolated trust so are nothing else that group of ASSs of autonomous systems that share a common trust route configuration. So they basically agree on a set of routing policies that they want to use. And the other term here is the core ASSs which are the ones in charge of managing meaning updating those TRCs and so on. And they also provide peering with other ISDs. So basically they isolate trust. This is an important point I want to emphasize. This isolate trust it's not another kind of isolation. Then the other part of the overview is the control plane. So here I will explain briefly how the control plane and the path dissemination looks like. So again this is an overview of code. This is full of details and you will find them in the documentation and references to the books and several papers that we have about this. So the routing information is disseminated to the network in these so called beacons which are these squads, these corals and squads in here. Those beacons are initiated from these core ASSs that I mentioned before and they are either propagated farther down the network in the local ISD and they are also propagated between these core ASSs. These beacons are authenticated and extended at every hop and every hop meaning every ASS on path decide how they extend these beacons. So they do it based on only their local policies. So yeah basically on the slide you see that they are disseminated and finally you have found some path information. You have already those segments, sorry those beacons have been disseminated and at the very moment they reach an ASS and here we can focus on the green ones. For example those are already usable so there is no need for convergence in that sense so this piece of information is already directly usable. This helps to achieve rapid path exploration and scalability. As I said before this is just a quick overview because I don't want to overrun with details but there is exhaustive evaluation on this scalability aspect of the control plane system. One other aspect I mentioned before is the multi-path nature of SIAM. So from the N-host perspective N-host retrieve path information from their local ASSs. So basically they request this path information and they retrieve several paths they can use simultaneously. So the path server, so called the servers in the local ASS that provides this information will provide to the N-host this information. So this is different from source routing so here N-host directly retrieves the path from their local ASS. This many paths or several paths allows applications for optimizing on different metrics so they might find some of those paths better in terms of latency than others. Others better in terms of bandwidth and they may hopefully find a point that better suits their needs, the application or N-host needs. So just for putting some numbers in here in the current production network, so the real fabric we are building, if you take two ASSs you will find from dozens to even hundreds of paths that can be used to reach the other endpoint. And last slide on this overview is this control plane and data plane slides. So the control plane is what I have just explained in the previous slides and the data plane is what I will try to explain right now. So as I said before N-host retrieve those path segments from local service and they combine them to create a path. So you can find here two examples of paths. So segments are combined in one path for packet one and in another different path for packet two. So once the N-host has encapsulated this information into the packet they send it out to the network. And routers forward packet based on the path information, so they inspect this path information which contains information of which one is the next hop and then routers can simply forward to this next hop. So this allows for simple routers and stateless operations. So as you can see here, so for example those packets may belong to the same application. For example, you send the packets or the N-host in this case send the packets using two different paths that in this case are even disjoint. So for example this can be useful if you have an application, it has a control channel, it can use low latency path for control channel and higher bandwidth path for the real application data. Okay, now, so I want to also convey the idea that there is already some tangible stuff so it's not, as I said before, it's not only a research project, of course there is a lot of research in it, but there is real deployment and engineering right now already in Cyan. So for that I will basically explain these two networks. So first one is the actual, the global Cyan internet, so the real production network in that sense, so the real fabric. And I will just introduce briefly some concrete ISDs in this case, those color bubbles that I showed at the beginning. And then I will also talk briefly about side lab test with network which is a completely separated network, in this case this is an overlay network that anyone can use and I will give more details afterwards. So in general this production network again is, this is not an overlay network, it's the real fabric, it's BGP-free. And it's currently deployed by several international ISPs, so here you have some logos, you don't have to look at them. Currently there are over 100 ASS and they are distributed in Switzerland, you find a few of them, also in the EU, in North America or in Asia. And the other thing about the production network is that recently also has been enabled Cyan-Claus-Bates access, in this case this is a commercial offering. But so just for you to know that if anyone happens to have like cloud deployments, they can also access the production network. Okay this is one of the examples, this is one ISD again, one of the, this color bubbles that I showed you at the beginning. This is the Education and Research ISD, Sierra is the fancy name. And here you find universities, in this case here you have some of those. This is a growing ISD so it's not closed, so more universities may come and some are interested in coming in. There are also other research institutions and research and education networks that also provide connectivity between those research entities. So here is a world map picture of how they are distributed roughly around the world. And then very shortly there are also industry use cases right now, so there is this secure, in Switzerland are those two that I'm going to introduce. The first one is the Secure Swiss Finance Network, so they are basically using Cyan and they are going to phase out the finance IP net that they are using. And by June this year, and by then they will have, or the network will have around 120 participants. And the other example similar to the Secure Swiss Finance Network is the Secure Swiss Healthcare Network that provides similar service for health professionals. So yeah, that was the real production network and this is Cyan Lab, which is the research network. In this case Cyan Lab is a globally distributed testbed to conduct experiments and test deployments. So anyone can join, so anyone in the audience can join this network just by downloading a virtual machine. So with background file you run background app and then you have your node attached to one of these transit nodes. So all of those are transit nodes, leaf nodes are not in here. And yeah, basically I'm not that interested in the names, so they may be a little bit unreadable, but different boxes are located in different parts of the world. So for example you have Korea, you have North American, also you. So yeah, Tillman will also give some pointers afterwards where you can find the information for joining Cyan Lab. So now we also have awesome Cyan project, so basically this is a compilation of projects that are related with Cyan. So we have from infrastructure projects, so we have people implementing Cyan into Fino routers, also express routers. We also have firewalls using Cyan and other kind of infrastructure related projects. We also have application projects, so we have the Cyan Naval Browser extension for example in Chromium. And we also have, so far as Cyan were, Quake 3 video game, video game client distance. And of course we also have the libraries. We have pointers to reference implementation again to network APIs and client and host stack in different languages. So we have in Go, then we have this client libraries for Java that Till will give some more information and explanation just right now. We also have client and host in Rust and bindings to other languages like C++ and Python. Then also this list includes useful tools, so Cyan is integrated in the CIT emulator, so if you are using CIT emulator you can also bring up your Cyan network. Then there is also escape libraries for package generation and YSAR plugins for packet capturing. So here is the first demo I want to show to you. I will just switch to the video. Okay. Yeah, I guess. So this is the Cyan browser demo and basically here you will see, so first of all this uses the production network. So you will see this basically browsing in the production network. This is part already, I just said, of these awesome projects in which we have projects of, so different projects using Cyan. So in here you can find this extension. You load one Cyan-enabled website, in this case for example the ETH website. And the extension provides some information about the resources and where they were loaded from. So here you see that the resources from the ETH domain were loaded via Cyan, so the green indicates Cyan and red indicates that they fall back to the GPIP. Of course this is configurable and you could choose not to fall back to anything if they are not available. Here you have some path information. You see that we stay within the Swiss ISD, so my client in this case is in the Swiss ISD, so we stay in there. Because the server happens to be located in the same ISD. Then an example of navigating to another ISD. In this case we navigate to the MacW University server and in here we see that resources are loaded, all of them via Cyan and we have also some path information. So here we see that we go from the Swiss ISD where my client is through this CERA education network that I presented before. So here you find the exact AS numbers that the traffic traverses. So here I type yet another example. And basically also we have more path information. This resource also happens to be located in the same ISD. If you really do find different ISS but this is not important information. So basically that will be it. Now this is just again for just showing that we fall back to the GPIP but this I already explained. The second demo I wanted to show is this Quake 3 demo. So here this demo is using the CYLAB testbed network so it's not using the production network. And here our client is located in note at ETH in Zurich and we connect to the server which is located in Magdeburg, Germany. We connect it to the server and okay. So basically here what you will see commands being typed. The piece of information I want to convey is that so different things that we have is this showpaths command that's CYM specific. And this showpaths command show all available paths from the client to the server. So you get a bunch of paths and well we see a little bit more of them. And the other thing is that for demonstration purposes what we do is we bind to the key to command next path. So then we can iterate basically and see how different paths provide different latencies. So we have this key shortcut and while we are playing you will see on the top left of the screen not right now but while we are playing. So this path for example, so if you saw before this path had 100 milliseconds latency, we start playing. And then what I was saying before is that now for example you see in the top left corner we have switched the path. So we just press this key shortcut and we iterate over the set of available paths. So this is for demonstration purposes for show that we can find paths with different latencies. We keep iterating, we see changes in latency, we see we keep iterating on the top left part of the screen and yes basically we see these different latencies. So hopefully this will stop now in the last frame and here for this specific path we receive this latency. So this interactive of course you can program this and adapt your application to have path selection algorithm that does it automatically and always takes the path with best latency. And yes that was it. I will now let the floor to Till for explaining the rest of the presentation. Okay thank you Jody. So now let's imagine you found all the science stuff very interesting and you tried to implement your own project. How would you go about that? Yeah the first step would be to go through this awesome science list that Jody presented earlier where you can find existing projects but also libraries, language libraries to connect to the science network. So these are probably most important ones for a new project. The first one here is the Go API that's like the reference implementation, the original implementation of science. It's the most comprehensive implementation. It contains everything including border routers, control servers and everything you need to completely run science. It also comes with language bindings for C, C++ and Python. More recently we have a Rust API 100% written in Rust and just released a few days ago we have an alpha version of the Java API and that's actually what I'm going to talk about in the next few slides because that's kind of the project I'm involved in the Java API. So yeah it's written in 100% pure Java. It is very similar to the Datagram channel that people may know from Java with a few exceptions. Datagram socket is currently not implemented but that's very pretty much the next thing to do in our list especially since I realized that a lot of existing older projects rely on Datagram socket instead of Datagram channel. The library also has an API for path inspection. This is pretty much all what science is about. You kind of get a lot of paths from your AS and you select the best path for your purpose. So yeah path inspection and selection is very essential. It also supports the SCMP protocol. SCMP is like ICMP for science. So again you have echo or ping and trace root commands available. So let's look at a very basic Java client. So this is a Datagram channel example and basically there's nothing to see here because it looks exactly like you would use a normal Datagram channel. The one thing to just bear in mind is for example the host name eth.ch that needs to be in sign and able to host otherwise you can't run that example and also your local machine needs to be somehow connected to the sign network. Let's look at a bit more interesting example. So there's an additional method. There are several ones but this is one that may be interesting. It's called set path policy. So what you can do of course is just go through all the paths that you get from your local ISP, your local AS and then pick the one that you want to use. But it's much easier if you can just define a path policy. In that case maximum bandwidth. You set that path policy on your channel and the channel will always try to find a path that suits this path policy. Now we're going to look at the server side. It's a little bit different from the native Java implementation in the sense that the receive doesn't return an Ionet socket address but a path object. And the path object does contain the Ionet socket address from the client that connected to the server. But it also contains the whole path that the packet actually took through the internet. And you can just use this path and to send a response back to the client. The idea here in sign is that usually if you send a packet to a server you would send it back the exactly same route. Technically you don't have to do that but it makes it a lot easier for the server because the server doesn't have to look up paths how to connect to the client. It's just much faster. So I mentioned path policies before. The Java library comes with some predefined path policies. Somewhat self explanatory probably but one path policy is called first. That just picks the first path that your AS gives you back when you ask it for a path to a certain destination. That's kind of a cheap way for the AS to actually recommend you a path. They think this is a path you should use. The next one is min hop that just tries to find the shortest path. So with the least hops in the path on hop being like a border router or other ASs you have to go through. Then there is min latency and max bandwidth which also pretty much do what you would expect them to do except that these implementations are non parameterized aesthetic. So they just rely on metadata. So you ask your AS for a path to get metadata back that kind of estimates the latency and give you like the allocated bandwidth for the short for the links in the path. If you want to have like really the best latency you would need to implement a new filter or I may also provide that in the future. A new filter that looks at all the paths, pings all the paths and then just selects the one that has a lowest latency. At the bottom of the list we have the ISD allow and ISD disallow filters. ISD being the isolation domain numbers that we previously saw, although this whole set of ASs. So basically isolation domains can map to countries for example or to something like the university network that we saw earlier. So these ISD allow and disallow can be used to implement something like geo fencing. And since the ISDs represent countries for example you can decide that you don't want your packets to go through a certain country. So imagine you're on the bottom left ISD in one of those ASs in that ISD and you want in your ISD is 110 and you want to send to ISD 130 and there are a lot of paths, some direct path. Some go out by 99 and one some path by 125 and 120 and for some reason you don't like ISD 99 which could be a country, could be just some other organization. And then you can just define your path policy like that. The exact syntax is a little bit different. So I simplified it here but it's pretty much that. So you can just exclude 99 so the filter will pick any path that is not and that doesn't go by a 99. So once you wrote your application, your first step, well not the first step but yeah, the next step is testing. And the common way to test sign is just to run a local network on your computer on your machine. And you can do that using the reference implementation that was mentioned earlier, the sign proto reference implementation. So what you do is first you define a topology in a file, topology file, then you run this command. This will create a lot of configuration files for all the border routers and control servers, demons and whatever needs to be started in your machine. Then you can view the topology if you like. So we have a very simple topology here with three ASs like the three ellipses, one core AS that's a button on the top and they all reside in the same ISD. That's just a simple example. There are also a number of topologies that are already in the repository so you don't really need to write your own if you don't want to. Then you just run the topology that will start up all the different processes for the different border routers, control services. They're all connected by loopback devices and then you can just connect with your local application to the network. In this case I just run a ping to the core ISD and that's the result. So there are other methods for testing. So there's for example the seed emulator that Jordy already mentioned that does support Sion. Then there's the Sion lab. This is a worldwide network of Sion nodes. If you want to use it you can go to the website, register, you can allocate your own ASs if you want. Then you can, as mentioned previously, download an image for a virtual machine and the virtual machine is like an AS that you run locally. You can even create several of those and then create a network. Then you can test in the production network but that requires you to actually have access to the production network. So if you're lucky your ISP supports that but there are currently not that many. AWS offers nodes that have Sion access so you can rent an AWS cloud center or something. Or maybe your university has access to the Sierra network. And finally for debugging there's a lot of command line tools. I think I mentioned ping and trace route before, show path and several others. And there's also a very neat wire shark plugin. So you can just look at Sion packets, inspect the header, look at the path that is associated with the header. So and if you want to contribute there are several tons of projects that could be done. You could carry your own project so we are still missing native libraries for C and C++ for example. There are no libraries at the moment for C sharp or swift. You could think about embedded or mobile devices. Also network protocols. For example Java implementation currently only supports UDP. We aim to use support quick and probably TCP very soon but there are many other protocols that would need support. Or you can just use one of the big existing projects for web proxies, HTTP servers or video conferencing clients like JITC or something. Or gaming libraries and try to make them sign a way so you can select path or automatically select good path in these projects. So yeah finally if you want help or support there's a Sion Slack channel, there's a Matrix channel. And since last week we also have a Sion tag on Slack overflow so you can tag your question with Sion and we have some developers subscribe to the tag and they will try to answer your questions. Yeah that's already from my side everything. Thank you and looking forward to some questions. Thank you for a great presentation. I have my question is regards security and protection against the DOS attack. So you allow everybody to select a path for packets and how do you protect against someone doing it maliciously. For example sending packets back and forth between ASS to overload the network. Yeah so the question was I think how do you prevent DOS attacks or how to prevent people abusing paths to send traffic for creating loops for example. So these paths that you can see here they all signed. So that makes it kind of impossible to create your own path. The paths are all signed by all the ASS on the way. And that also makes it a bit easier to prevent DOS attacks because you know where a packet came from. And if you don't like that region you can quite simply block everything that comes from that region. Thanks. Yes. I think a question is someone in the similar vein how do you deal with resiliency of the network which with. How do you deal with. Oh yeah the speakers. How do you deal with the resilience. Like the internet is very resilient because the routers can take independent decisions but if you select a path as a user and like a link goes down then like. Like information that information has to disseminate all the way back to the user so they can select a new route and then. Have their link up again instead of that just happening transparently to the user. So. Okay the point here is that normally you send the packet out to the network right and just pick routers to send it to the next hop just based on the destination information right. So when you have for example some link failing in the middle you need to converge to a stable state in which okay what's now the after the failure what's the next router I have to send the package to. So this is this takes some time and by that time your packet may be time out already so you need to still send the packet so with Sion you can already detect that when you see that the packet it's taking long or you don't get feedback and you can already take for example. Completely disjoint path and in that sense this failover mechanism is quite effective. So normally when link is busy you also get network feedback right so you see that packets are thinking longer latency is increasing so what you do is you as a user you are interested in taking a more. Healthy path in that sense if I can say that so you will automatically switch to that path. I have I have a question you showed in your API example that when you create a connection you specify the route. But supposedly a country changes you don't want to go your want your packet to go through a certain country then you have to change your software. Currently when I make a connection I only specify the destination and I don't care how it gets there. Now the information of how it gets there is mixed with where it should go. So when the route changes I need to change my client software. If I understand correctly. I hope I understood the question. So the routers become very simple because I don't need to do any decisions anymore. The only thing is they could verify whether the path is assigned past. So yes you do have to update your clients. The clients all need new libraries that could be. I mean I have a quite high level library with a Java but it could be an operating system. Just another driver that sits underneath UDP or TCP and adds like the sign path transparently. But yeah that's kind of the big work we have to provide updates for the clients. Let me add five cents. So there are also transition mechanisms right. So we have for example a thing so called Sion IP Gateway. So there for example traditional IP applications so you have your applications in a certain subnet and this traffic gets to this Sion IP Gateway. And this Sion IP Gateway encapsulates traffic to the Sion network. So in that sense you don't need to change for example with this transition mechanism. You don't need to change your application. Of course the application is not taking like the best properties in that sense. For example then you would not be as application choosing your path and optimizing for all of those or for some of those metrics. But you can still let this traffic go to this specific gateway and then this gateway will decide for you depending maybe on policies and local policies where do I send this traffic to. I don't mean the one time conversion of course if you switch to a new network protocol you have to change your software. But it's more the dynamic stuff when now my provider decides if some AS is out of the loop. But now I have to decide that as a client. Yes but you can easily default to whatever paths your provider provides to you. So you can always pick your default and just don't care about the rest. But on top of that you have the choice as an host to decide where you want to send your traffic to. If your for your use case is not important whether your traffic goes through I don't know any country you name it. Then you can just fall back to the default paths and then you don't have to make this decision if so. Is path selection always from the client side or can the server make a decision of what paths are acceptable. So the password is usually the client would connect the path server in his AS in its AS to get a selection of paths and we would use those paths. And those paths are sent to the server and the server could look up a different path to the client in theory but that feels a bit ineffective. So they can just reverse the path what's automatically reversed in this API to you send the packet back. Yes but just adding something else. There are some projects for example that try to make some negotiations so the server can signal or indicate the client. What's the best path to choose in their opinion but this is also like separate from the vanilla side I don't think. This is something you could put in place. I have a few questions but they should be quick to answer. So I've read about this thing called secure backbone autonomous system which is where you advertise better routes to existing BGP infrastructure. Is this in wide deployment is this popular is it being used. Also have there been any experiments with Wi-Fi does Zion work well in wireless context. Also in the quake example I didn't see an IP address in the quake demo I saw some other sort of address. What is that address. It's like the ISD or something. So I got the first and the last questions I will try to answer those first. So the first was talking about as much right. So as much is also an incremental deployment model architecture which basically is a hybrid between BGP and and so in that sense. The idea very basic idea and I'm not the one like involving the project but very basic ideas that you combine BGP so that BGP are announced closer to this backbone. And then you use this backbone as secure backbone and then you go out to the Internet again hopefully close to the to the to the destination. So the specific question was about the current deployment so it's under deployment. So some members in our team are making efforts and this I would say should be soonish already to be used in production. It's not yet there but it's there. The last question was about the address format right. So yes the address format that you saw is composed of the ISD so this color bubble and a yes the individual is of the of this ISD. So these two numbers plus the end host the end host address. This end host address has a scope within the autonomous system. So you could use basically any address you want but the scope of this end host address is specific to this specific autonomous system which is indicated by the ISD plus AS numbers. There was just one more question about wireless. Is there any experiments with wireless? Has anyone done anything? So now we will be starting projects for supporting Sion in Android and there we are going to go deeper on that. But of course I mean for example if you are interested in providing wireless support and optimize for wireless this Sion use case I mean you are more than welcome. But yeah. Thank you for the talk. My question is you mentioned earlier at the presentation that there is a way to update ASS which nodes which which nodes can be trusted with the further routing. So my question is how does that work? Who decides which nodes can act as ASS which ones cannot and how those bigger bubbles ISP or something else whatever it's called. How do you decide which nodes act as ASS and can you dynamically update it? So if I understood the question correctly it's about who decides what ASS is trustable or not trustable right? So what Sion brings is this possibility to the sender in this case the end host within the ASS as I said before. So your ASS will provide you with a set of paths and then you will based on your local policies as end host you will apply these policies to these set of paths and then you will end up having a subset of path that you consider that good for your use cases. So this of course brings so this delegates some responsibility but this is good. I mean at the end as I answered before you can fall back to any default path right? And just be kind of agnostic to okay where my packet is going to but I mean the main benefit is just having a choice on that. So ISDs basically represent jurisdictions right? So you can think about them for example we have the Swiss ISD. We will have other countries ISDs or regions or in this case for example this group of university institutions. So you may think so in there is pretty much for my use case for example I would want my traffic only go through research institutions because I'm deploying this thing from maybe my home country. And then I could basically steer that policy or implement that policy in there and then of all the sets of available paths I will be doing that. Of course this will depend maybe on your application for doing certain things you are good with other paths. So I don't know if I... The initial trans routes so they are agreed upon so basically... Can we take this offline so we can get the next speaker please? Yeah I mean I can answer you offline because they need to... Hallway track it's a thing. Okay thank you for the talk thank you. Thank you.
Sequoia PGP: Rethinking OpenPGP Tooling
Thanks for coming. Today I'm going to talk about Sequoia PGP, in particular rethinking open PGP tooling. First I want to introduce Sequoia PGP for those of you who don't know about it. I'll talk about its design and implementations and what makes that interesting. And then some day-to-day usage in order to illustrate what I mean by rethinking open PGP tooling. So what is Sequoia PGP? Say an open PGP implementation. And you can see here this is our GitLab site and we have a number of projects. And what is this open PGP thing? Well it's an IETF standard. Go on the internet, you can download it for free. It's derived from PGP which was published in 1991. The first version of the standard was published in 1996 and work is still ongoing. The next version of the standard is expected this year. It's currently in working group last call. And the standard defines the wire format. So how messages look, how certificates look. It talks about algorithms like encryption and decryption and how signing and verification work. And it also defines importantly a PKI, a public key infrastructure. But it's not an implementation. Sequoia is an implementation. And it's not just an implementation of the spec. It's a whole number of services and tooling and applications on top. And for those of you who've used open PGP, it's also a paradigm shift. We've looked at the way things have worked in the past and we have some new ideas about how maybe they can work better at least for some people. Sequoia's technical goals are to be a library-first architecture. So not a command line tool where a library calls the command line tool, but really the library is the source of truth. It's the most powerful thing. We have unopinionated low-level interfaces that are safe by default. That means that you can do a lot of things. You can do a lot of stupid things. But what we really tried to do on our API is to make sure that the easy way to do something is the safe way. Of course, low-level interfaces are hard to use. And so we provide high- level interfaces. And these high-level interfaces are necessarily opinionated. And what do you do when the opinion doesn't match what you need? Well, you can either completely switch to the low-level interface, which is inconvenient. What we've tried to do is to make our interfaces gradual so it's possible to mix low-level data structures with high-level data structures. And we've also designed Sequoia that the services are optional. And I'll get to what that means later. But what was the motivation for building Sequoia PGP? Sure, most of you have heard about it. It's existed for, I think, 23 years. And we've talked to people. And we heard some complaints from some users. I don't want to say all users, but certainly some users. And as we all know, the people who have something negative to say tend to be the loudest. So I don't want to say that as a representative sample. But what we heard was that the CLI was hard to use. And that the CLI-first approach that GNU PG takes, where you have GPGME, which is a library that calls out to the CLI binary, it's brittle. We heard that the APIs are too opinionated. And sometimes you want to do something and it almost matches what GNU PG expects, but not quite. And then you have to write a lot of code in order to work around it. People didn't like that the services are mandatory and the scalability wasn't so good. And I'm not talking about internet scalability. I'm just talking about individual user who has a few thousand certificates locally. Operations just take too long. So that's sort of the negative motivation, but there was also positive motivation. If you go out onto YouTube and you look at the GNU PG channel, there are a number of interviews called the GNU PG stories. And there are a lot of people from different projects, the EFF, the ACLU from OCCRP, from newspapers and reporters and reporters without borders and activists. And there's a common theme that they all were repeating. We use a lot of different encryption technologies, but probably none more important than GPG. And the question that we had was, you know, can we do better? We were inspired by this. So I want to take another step back and talk about Sequoia's sort of prehistory. So Sequoia was started in 2017, but before that, the people who started it that was used as Kai and I, we worked on GNU PG. And while we worked on GNU PG, of course, we worked with code. We talked to people who were using GNU PG as in developers. And we also talked to end users. And we had ideas about how to change things because we had these conversations where people were telling us things that they were unhappy with. And we had many technical conversations with Werner. Werner is the main author of GNU PG. And we couldn't converge on a vision. And so we had this conflict in the room. Werner wanted to go in one direction. The three of us wanted to go in a different direction. Should we continue with the established approach? Should we pursue the Sequoia vision? What does a compromise look like? And sometimes a compromise just isn't possible. And what do you do in that case? Is one person win and dominate? And sometimes that's a solution. In this case, we chose to part ways. And I think that's a perfectly okay thing to do. Werner had a vision. We had a vision. We didn't demand that Werner changed his vision. We left and we started a new project where we wanted to experiment and see if we could solve these problems that we had recognized. And what happens when you have two projects? Do you split the users? Do we have a small number or a big number of GNU PG users and all of a sudden half of them go to GNU PG and half of them go to Sequoia or 10% go to Sequoia and 90% to GNU PG? It could happen. But I think that's a pessimistic view of the possibilities. I don't think that has to be that way. Because it's not just about the GNU PG users. They're all of these non-users out there who are not using encryption technology. We wanted to offer more choice for users. We wanted to explore different options and see if the users out there or non-users could be served by this new paradigm. There's a diversity of needs. We wanted to in particular win over non-users. And the great thing is that there are a lot of non-users or maybe that's a sad thing but there are a lot of non-users. And we have a protocol. OpenPGP. It's interoperable. Can the network effects help? More implementations, more users, more network effects? In this view of the world, the ecosystem wins. There is more privacy and there's more security. And at this point I want to have an ode to Werner. So Sequoia really owes its existence to Werner. He was an inspiration to make GNU PG better. He was our inspiration to work on cryptography and defend privacy. And if Yusis Kai and I are Sequoia's parents, then it's not far to say that Werner is absolutely Sequoia's grandfather. And it turns out that it's not just two implementations. There are many implementations. There's OpenPGP for go, OpenPGP.js, PG painless, PG pi, R&P, RPGP and Sequoia. And these are just the free software implementations that are relatively big. And if you have all of these implementations out there, how do they work together? Well, yeah, we have this standard, but we have to ensure interoperability. And ensuring interoperability prevents vendor lock-in and improves the network effects for everyone. And for this, a standard is not enough. We need more. We need an OpenPGP interoperability test suite. And this was one of the first things that we actually worked on. And it currently has 131 tests and over 1,500 test vectors. And here you can see a snapshot. You can see that most implementations are tested. Currently, there's one implementation that I mentioned that's not there, which is RPGP. But thank you to Heiko, a former Sequoia developer. He's currently adding support for RPGP. All right, now I want to switch gears a bit and talk about the design and implementation of the library or the low-level components. So this Sequoia's architecture, what does it look like? I mentioned before, library-first approach. So applications are built on the library. And on top of the library, we have the CLI. The CLI is using the library, and that makes the CLI necessarily less powerful than the library. And we think that's okay. If you want to do, if you want to program using our CLI, it's possible. If you want to go further, then you're probably in a space where you should be using a library in a high-level language. We have a bunch of high-level components. They're optional. We have services that run as demons, for instance, the key store. But it doesn't have to run as a daemon. It can be co-located. Now, the daemon has the advantage that you have process separation, and this avoids things like heartbleed. It can multiplex resources, it can share state. But it's not always the right solution. And so it's possible to co-locate the service into the application binary in the same address space. And that's good when you need a restricted environment or you want to fall back in order to increase robustness. Now, I mentioned that we have a whole bunch of components. So up top, we have here OpenPGP, which is our library. And next to it, we have PGP-CertD, which is a certificate store or a standard, or not yet a standard, but there is a text that describes how it works. It looks like a mailder. And we have a library implementation. And that doesn't directly depend on our library. And then we have a whole bunch of libraries on top and services. We have the key store for private key operations. We have the cert store, which is the in-memory certificate store. We have the web of trust engine. We have our network library for accessing key servers in WKD and Dane. We have our auto-crypt library for doing auto-crypt operations. And we have another library for configuring the cryptographic policy. And SQ, you know, it exposes all of this functionality. And so it's using all of these things. And RPM is one of the users of Sequoia. Since Fedora 38, the version of RPM that ship uses Sequoia to verify the packages. And it doesn't use secret key material. It has its own certificate store and it uses its own trust model. So all of these components aren't needed. And RPM just links against OpenPGP, the library, and the configuration policy. Now I mentioned before our API design, unopinionated low-level interfaces, opinionated high-level interfaces that are built on these low-level APIs. But what does that look like? So let's imagine that we have a certificate and we want to write it out to disk. So we have a method called serialize. You provide it with a buffer or a file or whatever. And it just writes it out in OpenPGP format. What if there's secret key material in there? That would be a shame if you accidentally leaked that. Well, in Sequoia, we automatically strip that out by default. That's safe. You really, really need to write out the secret key material. Sometimes you do. Sometimes you want to. You have to opt in. And for this, we have asTSK, which converts the data type. And the new data type provides an interface, the same interface, and you serialize it. And then you also get the secret key material. And I mentioned that we have these progressive high-level APIs. What do they look like? Here, we see how to create a certificate. We have a certificate builder. You want to create a general-purpose certificate. You add a user ID, and you generate it, and you're good. But what if you also want to add a decentralized social proof? So, probably heard of Keybase, where you can do these social proofs or link services. There's also a mechanism in OpenPGP, or an extension that allows you to embed them directly into the certificates. And that's not really supported by the library. At least there are no APIs to do that. But you can use the signature builder, and you can add on the appropriate notation. And then the cert builder, you override the how it creates the signature by using this template, and then the certificate that's created automatically has this decentralized proof embedded in it. So that's the library. What does the command line interface look like? So, SQ is our primary command line interface. There are other tools out there, of course, but SQ is sort of the GPG equivalent, if you will. And we opted for a sub-command style interface. So, if you want to encrypt a message, use the encrypt sub-command. And here, I'm encrypting a message to me. So, the recipient email is neil at sequoia-pqp.org. The next thing that's very important is that we have a very clear separation of options. So, there's another sub-command, SQ sign. You can sign a message. And this command does not take the recipient email argument, because it doesn't make sense in this context. And so, if you try to provide it, you get an error. And another thing that we've really tried to do is ensure that there's consistency between the sub-commands. So, if you have, for instance, an email option, it doesn't matter what sub-command we're talking about, it has more or less the same semantics. And we've talked to people who've used sequoia or SQ, and the reactions have been very positive so far. So, we're quite confident that the design maybe is not optimal, but certainly is good. But the really big paradigm shift in SQ and in sequoia in general is the way that we think about certificates. And a certificate is sort of this OpenPGP artifact that you use and that you throw out onto the internet someplace on a key server and a Web key directory. You publish it on your web page. And if I want to send you an encrypted message, then I download your certificate and I encrypt a message using your certificate. And then I send you that encrypted message and you're able to decrypt it with your keys. So, certificates are really important and how we use them is also important. And in SQ, we're moving away from curated key rings. So, a curated key ring means that the data that you have available locally has been checked. It's authentic. It's the right stuff. If you have a certificate for me or that claims to be for me in your key ring, you're assuming that it's good. And it's sort of this say yes to get worked on mentality. So, if you're using GPG and you have a curated key ring mentality, and that's not required in GPG, but it's how many people use it we've observed, and you want to send a message to DKG and you address him by his email address, then GPG is going to warn you and say, do you want to use this key anyway? And it doesn't really provide you any options. The options are get worked on or not get worked on. And certifying user IDs is not easy. So, the amount of energy required in order to certify a user ID means that hitting yes is sort of much easier. And we want to move towards strong authentication. And so, in SQ, we treat the local certificate store as a cache. It's no better than the certificates that are stored on, for instance, the SKS key servers where no authentication is done. By default, we'll just store anything there. What about these self signed user IDs? Right, if you have a certificate, you create a user ID, you add it to that certificate. On my certificate, I have a self signed user ID that says Neil, but anybody can create a certificate where there's a user ID that says Neil on it. We treat it at most as a hint. And SQ certificates can only be addressed by authenticated identifiers. And the way that we do this is we really, really embrace the web of trust. And now the question that you're probably asking is, is this going to be a usability nightmare? And it's a question that we also asked ourselves because we didn't know we had to try it out. And I propose let's take a look and see. But we need to take a step back again and ask, what is authentication exactly? And so there are sort of two aspects to authentication. What we want to know is what certificate should I use when I want to encrypt a message for Alice? Or alternatively, if I have a certificate, who does this certificate really belong to? Is it Alice's certificate? And really self signatures, they don't mean anything in this context, right? This certificate here that we see on the right, there are user IDs that say Alice, but did Alice create those user IDs? Or is it Mallory who's trying to trick you? Or maybe somebody who's just trolling? So what does authentication look like today? Well, we have a centralized authentication, which is easy to use, but it's unsafe in the sense that you're relying on these central authorities. You're relying on hundreds of centralized CAs in X509. These are controlled by governments, not only your government, they're controlled by companies whose interest is to make money. And any one of them can trick you. Certificate transparency helps, but you're still reliant on the centralized CAs. And they haven't done a good job historically. So up here at the top, we see here a Google security blog post and Chrome planned to distrust semantic certificates because they had made too many mistakes. This is not great. But signal, signal is great, right? In a certain sense, technically, signal is even worse. And signal, you have one key server. And it's on the same infrastructure as the message transport. The good news is, and I use signal, you can trust the signal foundation, right? I believe in Moxie, but I don't think belief is enough. I think we want technical solutions. So what about peer-to-peer? Here, we're talking about checking somebody's fingerprint or checking the safety numbers in the context of signal. You can do that, and that is really, really safe. And it's a really good thing to do if you're worried. But it has such high upfront costs that few people do it. We need something in between. And then there's a third model, which is the consistency model. Do you have the same certificate every single time? This is called trust on first use, more or less. And it's really easy for users until they have a problem. And then how do you resolve a conflict? All right. So we have these different models, and maybe they're good enough, I don't know. Maybe they're even good enough for most. So Pearlbuck, who's a Nobel prize winner, said 100 years ago, the test of a civilization is the way that it cares for its helpless members. So the weakest members, the people who need protection, the activists and the people that are being pursued. And so our goal is not to be good enough for most, but to be good enough for even more. We want to provide a progressive system that serves a range of needs. And the way that we're doing it is we're providing different tools in order to increase confidence. And the tools work to support the user. And then based on the individual user's threat model, you can decide if the degree of confidence is high enough. And for this, we use the web of trust, which is a powerful and flexible PKI. In the web of trust, everyone can act like a certification authority. That doesn't mean that everybody is your own personal certification authority. You have to opt in. But maybe you as an individual don't opt in by yourself, but it's your system administrator at work, or it's a family member who you rely on. And the web of trust can use weak evidence. It's not a zero or one decision. It's possible to combine evidence in the web of trust. And the web of trust can work with all of the models that I presented before. It can be used in a centralized manner. It can be used in a federated manner. And it can be used in a peer to peer manner. And traditionally, people think of the web of trust as a peer to peer solution to authentication, where we go to key signing parties and we check fingerprints. But that doesn't have to be. And so if the web of trust is so good, why hasn't it succeeded? Why are we only using it in this very limited way? And I think the reason is because we've been missing the tools to make it easier to automatically integrate evidence into a web of trust and tools that make it easy to manage the web of trust. And I would say until now, because we've been working hard on improving the tooling. So in order to illustrate the power of the tools, I want to do an example. I want to send an encrypted mail to DKG or encrypt a message to DKG. So let's just try it out and see what happens. So we do SQL encrypt. We provide the email address, and we get an error. Well, that's not so great. Let's go to the key servers. Let's go on the network and see if we can find a certificate for DKG. So in SQ, this is the SQ network fetch sub command. And immediately we see something that doesn't give us confidence in the tools I would suspect. We imported four certificates. Ouch. Which one do we use? Which one is the right one? Is one of the four even the right one? Maybe it's a fifth one that we didn't find. What should we do? The best thing that we can do would be to ask Daniel, what is your fingerprint? And then use that one. But what if we can't or it's inconvenient? We could ask somebody else. That's pretty good that we rely on. Or a better solution is to ask multiple entities, combine the evidence, and then weigh the evidence according to the entities, and ideally do this in a completely automated way. And then you have a certain degree of confidence that a binding is correct. And maybe that's enough for you, maybe not. That depends on your threat model. And there's already a whole bunch of rudimentary evidence out there about what certificate we should use for DKG. There are a whole bunch of key servers. There's WKD, which is the Web Key Directory, and there's Dane for looking up certificates in DNS. And it turns out that keys.openpgp.org is a validating key server. That means that if you attempt to upload a certificate to keys.openpgp.org, you get an email where the user IDs on the certificate get an email prompting them to follow a link to validate the user ID for that certificate. keys.mailvalope.com does something similar. Proton mail does something similar where you don't get an email, but you log in, you say, this is my certificate. WKD is controlled by the user or their administrator, and the same thing for Dane. And SQ network fetch already fetches them all. You don't have to do it manually. And by the way, it records the evidence in the Web of Trust. They're stored completely as normal Web of Trust data structures as defined by the ITF standard. But how are they stored? So the way it works is we'll take keys.openpgp.org as an example. It's more or less a de facto CA. So what we do locally is we create a shadow CA. We create a new certificate and we say downloaded from keys.openpgp.org as the user ID. We have a local trust route. The local trust route says this shadow CA is an intermediate intermediary CA. We don't create one for SKS because SKS does not do any form of validation. And so in the case of keys.openpgp.org, we download a certificate from that key server. We go through the user IDs and then we create a certification for the returned user IDs using the keys.openpgp.org shadow CA certificate. And this evidence is automatically combined by the Web of Trust. So we have this trust route and shadow CA's. They're created automatically. And by default, the shadow CA's are trusted minimally. Some users don't want to rely on keys.openpgp.org and that's completely understandable. And as I mentioned, in the Web of Trust, you can have a varying degree of confidence in a binding or in a CA. And so we use the minimum, one out of 120. And what we also do is the trust route and the shadow CA's and the certificates, the certifications that are created, they're all marked as unexportable. And we do this in order to protect the user's privacy. So let's take a quick look here at how the evidence is recorded. So we do sqpki list and we put down the email address and then sq helpfully shows us the three paths that it found. And at the top, we see that there's the local trust route that's followed by an intermediary CA called the public directories that's followed by our shadow CA downloaded from keys.openpgp.org and then that is certifying the certificate from Daniel. And what that looks like graphically is here shown at the bottom. And some observations, the shadow CA's are partially trusted. keys.openpgp.org, we see on the edge leading to it has a one on it. It's one out of 120. The same thing for WKD, the same thing for Dane. And we don't want to completely ever rely on all of the public directories out there. And so we insert in between a public directories shadow CA. And this acts as a sort of electrical resistor where there's a maximum of a 40 that can flow through it. In this case, we see that the trust amount is three out of 120. And that's not enough to authenticate the certificate. But what's interesting is that we have no evidence for other certificates either. So what do we do now? Are we done? Well, if we're sufficiently convinced, then we're done. If not, we need to get more evidence. Where can we get more evidence? Now, we can think about the additional overhead of talking to Daniel or finding people who know Daniel and his certificate. Whatever the case, once we're convinced, we have two options. We can create a public certification. This is what most people do when they go to a key signing party. They create a certification and they publish it on the key servers. Or we can create a private link, which is not exportable, either permanent or temporary. In this case, we're going to create a private link that is permanent. And we do sqpki link add, no password. No password required. It is the local trust route. It just works. Bam. So let's do sqpki list now, fully authenticated. And does it work? It does work. sq encrypt recipient email DKG. What if we decide we want to fully trust keys.openpqp.org? Also pretty easy. This is the general form for trusting any certificate as a CA. sqpki link add. And we say that we want this to be a CA for anything. I'll get to what that means in a minute. In this case, we're saying keys.openpqp.org should be fully trusted. So let's try another email address sqpki list. It's fully authenticated, going from the local trust route to download it from keys.openpqp.org to the email address that we entered. And there's more information that we can incorporate. And some of it we already do. We have usage information, for instance, tofu. If you download a certificate from a URL, like you're downloading, for instance, Fedora, or you're downloading tales, you can monitor the URL. We can use auto-crypt information. And we can even easily introduce CA's. So what are organizational CA's? You have an organization, say a company, or a group of activists, and they are willing to delegate sort of these authentication decisions to a trusted entity. Maybe it's the admin or the nerd. And if I want to talk to somebody inside of that organization, then I don't have to authenticate every individual. I just have to authenticate the CA. And now I've bootstrapped trust into the organization. And by the way, we have a CA, chriscoi.sh, pgp.org. So if you want to contact us, you can use it. How does it work? You first have to think about how much you want to trust it from 1 to 120. Do you want to scope the trust? Because it's our CA. We might trick you. But you can rely on us, probably, to say what are the correct certificates for people in our organization. So here we can partially trust the CA, sqpki.link.add. And we're limiting it to sequoia.pgp.org. So it won't be used for other certificates. And here you can see Justices' email address. I do sqpki.list. And it is fully authenticated using the CA. And by the way, if you want to run your own CA, there's tooling for that too. There's open pgpca. It's a great way to bring up a CA. It's easy to use. And it's written by hyco. And I encourage you to check it out. But there's more tooling out there where pki can help, where you need pki. Let's look at open SSH for a moment. In open SSH, the authentication keys are the identity keys. And if the authentication key is compromised, users have to update. Is that a problem? We have a great case study, just a few months ago, GitHub accidentally leaked their private key. The good news is it wasn't leaked for long. They immediately removed it, or seconds, minutes later, I don't know. They rotated the key. The bad news is every single GitHub user who used their RSA key had to update their known host file. Quite the pain in the butt. What if you could use OpenPGP's pki and OpenPGP's certificates, where you have a separate identity key, and you keep the identity key offline. And when something like that happens, you can, of course, make an announcement, then you rotate the subkey, and that's it. There are two former Sequoia developers who are currently working on that, Victor and David. The project is called SSH, OpenPGP off. And I encourage you to check that out as well. What about commits? I'm sure many of you have signed a commit. What does it mean? I don't know. It doesn't mean anything if you don't have a policy. So we have a tool called Sequoia Git. It defines, as a document that talks about how to define a signing policy for a project. You put it into a Tomo file. The Tomo file is directly embedded into your Git repository. It evolves with the project. And then you're able to check whether or not commits are authentic according to the project developers. So there's a whole bunch of tools that I think change the way that one could use OpenPGP and interact with the ecosystem. So what if you want to use Sequoia today? Of course, SQ, I presented. You can use it. It's packaged for Debian, it's packaged for Fedora, it's packaged for Arch, it's packaged for other distributions. But SQ is not integrated into a lot of existing tools. So do you want to live in sort of a split brain world where you have some of your tooling using some state and other tooling using other state? So we have the GPG Chameleon and it is an implementation of GPG's de facto interface. So you can just drop it in and use it and here you see GPG-version reports that it's the chameleon actually. And it uses both GPG state and SQs, which means that you immediately profit from SQ's PKI tooling. It automatically uses that when doing web of trust calculations. So you don't actually have to do any migration. And if you're using Thunderbird, we have the Octopus, which again is a drop-in equivalent API to the RMP interface that Thunderbird is currently using. And it includes web of trust support and GPG agent support. Now if you want to integrate OpenPGP, there's a standard. You can read the standard as long as it gets complicated. But recently, just two months ago, Hico, Paul, Ms. Uppedy, Victor and David published a book, OpenPGP for application developers. It's the book that should have existed 20 years ago. It didn't exist and now it exists. It's a few hundred pages talking about some of the details of OpenPGP as they relate to the needs of application developers. And I think that this is really the game changer. Who's been funding Sequoia? The project started in 2017. For six years, the PEP Foundation funded Sequoia. We received money from NLNet and currently we are being funded by the sovereign tech fund, at least until the end of the year. And post-2024. Well, that's an open question. Maybe somebody can help us with. Thanks for listening. I hope that I've convinced you that users have different needs. There are different users. They have different needs. And I don't think that there is one universal solution. There's not one implementation that is going to make everybody happy, necessarily. And if that implementation were to try to exist, the fact that it tries to be everything to everyone means that it's going to make some people unhappy. Sequoia has a different architecture. It has different paradigms. Maybe it's the right one for you. Maybe it's the right one for some non-users to convert. And I don't think that it's going to divert them to be open PHP users. I firmly believe that diversity in an ecosystem is a strength. I believe that we are better together. And I believe that winning is not dominance of a single implementation, but improving privacy and security for individuals. And as a small aside, by the way, implementing your own PKI, that's the new implement your own crypto library. Please don't do that. Thank you very much. Thank you very much. Are there any questions? A question there. Thank you. Um Okay. Thanks for your talk. I have a question. I currently use GPG agent as SSH agent. And is it possible with Sequoia too? Okay. I can't hear anything because the microphone or the speakers are pointed this direction. Can you say it again? I'm currently using GPG agent SSH agent. Can it be done with Sequoia too? Can you use Sequoia as a GPG agent? SSH agent. SSH agent. Yeah. Okay. Okay. It's okay. Hi Neil. Thanks for the talk. Lots of interesting points. Oh, sorry. I'll try to speak loud. Thanks for the talk. Thanks for the points raised. Thanks for the bows. I'm wondering a bit about compatibility and interop. Could you speak on that topic a bit? Because of, well, recent developments. Where do you see Sequoia in the future? Like, especially this year and going forward when it comes to interop with other implementations and newer versions of OpenPGP. I know this is a bit of a larger topic, but maybe you can share some of your thoughts with that. Okay. So I understood the question is what is the future compatibility with OpenPGP? And our intention is absolutely to implement whatever the ITF decides to standardize in the next revision of the OpenPGP protocol. And I believe what your question is sort of asking about is the LibrePGP thing? And I think that's a good question. I think that's a good question. I think that's a good question. And I believe what your question is sort of asking about is the LibrePGP thing? And I mean, that's a whole can of worms. And I think it's an extremely unfortunate situation. And my personal hope is that we're all going to implement the things that the standard bodies say is the standard because it improves interoperability. One of the arguments around LibrePGP, which is a different, which is the GNU PG format or the GNU PG and RMP format alternative to the ITF standard, is that they say we already shipped it. Well, they already shipped it. I think it absolutely makes sense to write down what it is that they shipped. But I hope that future developments are going to go in a direction where they also support the standard. Hi. Do you integrate with hardware backed private keys? So for example, Fido keys. Right. So there are two ways that we do integration. So the first one is if you're currently using GNU PG and you're using the GPG agent. And then you decide, okay, I want to try out Sequoia. And you're using the Chameleon GPG. Chameleon will automatically use GPG agent. That means that there is zero configuration required. You automatically get access to all of the things that you had access to before. So that's sort of the easy thing. The other half is what does it look like in terms of Sequoia sort of native support. And for this, we have a private key store, which has a device driver style architecture. And then there are different back ends implemented by that. Again, one of the back ends is the GPG agent back end. But Hico, for instance, did a lot of work on the smart card area. And so if you're using an open PGG smart card, then in the future you'll be able to use the private key store and it will be able to talk to your open PGG smart card. Likewise for PIV tokens and we expect to add additional things in the future. Are there any concerns or ongoing work with regards to post-quantum? Post-quantum is a good question. Right, of course. The whole ITF is very interested in addressing the question of how do we deal with the post-quantum threat. And there, as I mentioned, the ITF working group has submitted the document to the ITF for ratification. And it's currently in working group last call, or last call, I'm not entirely sure of the terminology. But we expect within the next couple of months that it will be ratified. And the working group has a nuke shatter. The nuke shatter has been accepted and the nuke shatter includes post-quantum work. The post-quantum work has more or less already been done and it was a collaboration between the BSI and Proton primarily. So the BSI a few years ago had a call and they asked, or the MTG, which is a company in Germany, applied to do the post-quantum work in the open PGG space. Proton joined in and there is an entire draft or there have been multiple versions of a draft. Everybody is more or less happy with the draft. It is much less controversial, one might say, not that the crypto refresh is terribly controversial. And this is the direction that we're moving in and I expect that it will also be ratified very quickly. The tricky part, of course, is the actual deployment in real life. It is not a very long time but it seems that we do still have a couple of years. Thank you very much. I think if there are further questions, your email was on the slide. Feel free to ask him, I'm assuming. It was a very enlightening talk. It was a challenging talk too. And as Belgians, we'd like to give you a token of our appreciation for your effort. Thank you very much. Have a nice day.
Version control post-Git
So, in the previous talk, we heard about how you can add lots of commands to, well, like to fix the shortcomings of an interesting or dated model of version control. In this talk, I want to convince you that you can actually... Oh, this isn't working? Yeah. I didn't do anything. So, I don't... Oh, yeah, right, right, right. So, as I was saying in the previous talk, we heard about how you can add commands to all sorts of things and improve on the state of the art in version control. In this talk, I want to convince you that we can actually get all, like most of the stuff for free if we do... If we start by doing some mathematics. So, my entire life so far has been about doing really far stretches between deep PRE or deep practice or between one domain and another. I've been working on DNA nanotechnology as a theoretical computer scientist for a while. Today, I'm working on a game theory and economics algorithms for electricity markets. I'm also working... Well, in general, I'm interested in distributed computing. So, PIRUL is a byproduct of that interest. So, in this talk, I'll first start with defining things. There'll be some mathematics involved and also some practical stuff. So, if you're not interested in mathematical parts, like bear with me, they won't be too long hopefully. All right. So, I'll first start with defining what version control is for me, then talk about our solution and what we've been working on, and then some cool implementation details or things that I find cool at least. And then, well, my talk title was a little bit of a bit of a bit of a bit of a bit of So, I'll try to come back to that and comment on that choice by asking the question again. Is this really post-get? Okay. So, version control. Probably many people here use it and know exactly what it is. Let's do it anyway. It's one or more cool out there. Is it a tree of documents concurrently? And one thing that's super important is the asynchronous nature of the thing. The co-authors can choose when they want to sync or merge. And like reminding these definitions sounds like it's a little bit trivial, but it actually matters a lot when you're deep down in mathematics and you're trying to understand what you're doing and why you're doing it. So, for example, how does version control differ from Google Docs, for example? It is made conflicts. So, that's a feature that's been often overlooked in current systems like Git, Mercurial, SVN, CVS, RCS before it. And finally, also, version control allows you to review a project's history. So, that's a very important feature as well. Many people consider, well, apparently not everybody, but many people consider version control a solved problem. Git is here to stay. It's sold version control 15 years ago, and that's the end of the story. Unfortunately, well, I personally don't think this is true. Some symptoms indicating this might not be true are, for example, that's R-tools, no matter how good we think they are. They still aren't used by non-coders despite their maturity. We've been working on this for 30 years, and the thing we're most proud of in this industry cannot even be used by outsiders. So, nowadays we have, like, Silverware designed by NASA, and everybody at NASA is proud of that, and everybody buying Silverware designed by NASA is proud of that. But R-tools for version control, they aren't used by outsiders or just marginally. R-tools are distributed. Yes, most of the time we use them with a global central server. So I've seen a poster in this conference saying, not all paths lead to Chrome, but apparently many paths lead to GitHub. R-tools also require strong work discipline in planning. You need to plan ahead of time, to plan your branches, to follow really strict workflows, to rebase versus merge. We even had a slide on that in previous talk, like how we merge shop or rebase shop. In this talk, I'll try to convince you that this doesn't matter at all. And all this results in a significant waste of human work time at a global scale. Improvements have been proposed, for example, DAX, but they don't really scale and nobody uses them. So, is there, can we get a quick fix? Can we get that next Git command that will fix everything? Well, unfortunately, I don't think so. First of all, because abstractions leak terribly in Git. And the reason isn't the UI, isn't the bad naming of some commands or arguments. The reason is that if we consider miracle trees and DAX as the core mechanism, they can't really be hidden from the user. Because if that's the core thing you want, and if all the properties you want come from that, there's absolutely no reason to hide it from the user. And there's no hope you can even succeed in doing that. Also, similarly, if strict ordering of snapshots is the main feature, the most used Git commands like rebase, we heard about rarararar in previous talk, cherpic, they're fixes around that best feature we have. So why do you need to fix your main feature? That's strange. Anyway, some more same terms. There is an inflation of commands and options in Git. Like I'll try to show you an example here. This is God and comicals, it's the point that someone made a Git's man page generator that actually looks credible. If you reload the page here, you get a different one every time. Give environment grabs non-reset downstream environments using past local garbage collectors while overriding fitting shells survey the given environments. Boo doesn't use this command here. Anyway, there is also an inflation of UIs. Even Big Tech is now investing in Git in mercurial UIs. I won't cite names here. But there's also inflation of forges. How many forges have started last year alone? I know a bunch of forges because they sometimes contact me to help them do something with people. My claim here is that we all consider text editing a solved problem. There's no new text editor popping up every now and then and convincing the VC that this is a really cool idea. Window managers, same thing. And forges, you keep seeing forges popping up every now and then. I'm not saying this is bad. This is actually really good. It's fantastic that the ecosystem is thriving and there's a lot of diversity. But my claim is that maybe this inflation comes from some more fundamental thing that we don't understand. Now, on to our demands. First, we demand associative merges. It may not matter much to use, but I'll show you in the next slide that you actually want associative merges. Associative merges means that when you have two changes, A and B, and you merge them together, it should do the same as merging A followed by merging B. And the reason you want that is because you want to be able to review your patches or your comments one by one and then merge them and trust that the merge does exactly what you think it does, except in get it doesn't. Associative merges, if A and B can be produced independently, they all should not matter. That's what I mean by are we a rebase shop or a merge shop. It doesn't matter. We don't want to ask this question. We want to get back to work. Branches, we do want branches. Everybody loves branches, but not too many branches. Branches are good until they aren't. I'll tell you more on that later. We also want that's something I personally really want. I want low algorithmic complexity. And also ideally, fast implementations was in my way of seeing things that come second. But I also give you an example of a very fast implementation of something with Darm. All right, associative merges, the first of our demands. This is exactly what I described, but this is a graphic view of it. So you have two co-authors, Alice and Bob. Alice produces one, commits A, and Bob produces two, commits B and C. And then in the first scenario, Alice merges Bob's first, commits, and then Bob's second, commits. The second scenario, she merges both commits at once. And while the commit identifiers, nobody would expect that they're the same. They will necessarily be different. But the contents of the, if there's no conflict at all, the contents should absolutely be the same. And this is actually false. So this is an example, a counter example of where Giz is not associated, and this might be a problem. This is actually terrifying to me. We start on the left here with a simplified two lines, A and B. Alice follows that path. She adds the first commits with a G at the very beginning of the file. And then she goes on and adds another commit with another copy of A and B before that G, right? And concurrently to that, Bob inserts an X between the initial A and B. And when you can try that, if you have a laptop here, you can try to simulate that in Git today. And this is actually not a bug, it's a feature of Git. And what happens if you do that is that Bob's new lines get merged into Alice's new lines. And if I were working in a high security project, this would absolutely terrify me. This means that Git can randomly shuffle your lines around and do it silently. Without telling you that there's no conflict, nothing. It just works. And yet it doesn't really work. It doesn't do what you think it does. So we don't want that. We want to fix the problem. We also want commutative mergers. So that's a more controversial one. We want the property that if Alice and Bob work together, well, Alice can pull Bob's changes and Bob can pull Alice's changes. Without having to worry too much about the resulting hashes or the contents of the file. If the patches are independent, you should be able to apply them in any order without changing the results. So we do that. Git and SBN are never commutative. So why would we want this? Actually there's very good reasons to want this. For example, you might want to unapply and unapply an old change, an old patch that you just made a few patches ago that was wrong. And you want to undo it without having to change your entire identities. Of course, we also want state identities and I'll come back to that later. We want cherry picking. We want to be able to just take that one patch from a different branch, maybe a bug fix. I want to pull this into our branch without having to rebase everything and change every commit's identity. And yet keep strong, unfortable state identifiers. And we want partial clones. So partial clones means that you want to just pull the patches related to subprojects and possibly also in the other direction merge repos transparently. Scott was talking before about mono repos and how Microsoft devoted probably millions of man hours to have their mono repos work. But actually if you first try to model your things properly and try to understand commutativity, actually you don't need all that. You can get mono repos for free. And so that brings us to one of the crucial slides in this presentation about states versus changes. So the way we see things in order to think about version control. So I understand I'm not giving you many new commands or cool line hacks, but before getting to that we had to think about what it means to do version control. So states versus changes. There's actually two fundamentally different ways to model what version control is. One is by seeing it as a series of versions, a series of snapshots of your repos, which is what Git does, and Mercurial, SVN, CVS do that. And only compute changes as a byproduct of these. The question we can ask is what if we did the opposite? What is if instead of considering that working in a project means creating a new version, we consider that working in a project means changing it, creating a change. And another question we can ask is what if we did both at the same time? And that's what we do. All right. So a little bit of bibliography first. And this is getting a little bit mathematically. So bear with me if you don't understand everything, I'll get back to cool implementation stuff later. So a change-based idea, and this is from the 80s, it's called operational transforms. It is what Google Docs, for example, uses, or Docs uses that as well. So in operational transforms, for example, we start with a file with only three lines, T1 on the left inserts, so this is Alice, let's say, inserts X at the beginning of the file, and T2 deletes the character C. So to get into two divergent states, and then what operational transforms mean is that now that we've inserted an X at the beginning of the file, this changes, we have to rewrite the other concurrent changes. So for example, here, instead of saying delete the second character, it was a C, now you have to say delete the third character, it was a C. And so that's how you can merge things. Docs does this and uses this to detect conflicts. One issue is that if you only have insertions and deletions, it's okay, but there are still performance problems. But as soon as you start handling more than insertions and deletions, you get into a quadratic explosion of cases because you have to handle all pairs of types of operations. And this is according to Google engineers who worked in the Google Docs project, this is an absolute nightmare to implement. And I've never heard anyone implement an operational transform, we said, yeah, they're cool, they're really easy. So a hybrid in more recent state approach is CRDTs. So how many here know about CRDTs? Okay, okay, reasonable number. So the general principle is very simple. The idea is to design a structure where all operations have the properties we want. Instead of having your structure and say, okay, now how do we merge changes? Instead of doing that, you take the problem in the opposite direction saying, how can we design a data structure from scratch so that all operations on that data structure have the right properties, meaning they are commutative, associative, they have a neutral element and all that algebraic things. Natural examples of CRDTs are like, for example, a very simple one, increments only counters, counters where you can only add one to the counter. This is a very easy and natural example of a CRDT because if Alice and Bob both increment the counter, you just have to add two to the result to merge their change. Insert only sets, for the exact same reason, insert only sets, there are natural CRDTs. And if you want to do things like deletions or more subtle data structure, more subtle, more interesting operations, you have to use more trickier techniques like Tom Stones and Landport clocks. I won't get into these details and if you're interested, we can talk later. A useless example of a CRDT that's often invoked is a full git repository. Well this is just an append only set of commits. So yeah, sure, it's a CRDT. It's commutative and all that. But also, it's not super useful to see a git repository as a CRDT. Just means you can clone this. Saying this is a CRDT means basically it's okay, you can clone a git repository and then keep pulling into it. But it doesn't mean it handles merges properly. What we want is the heads of the git repository to be a CRDT. All right, so how do we do that? Well, merges, when everything goes right, we're not really interested. So we'll start by modeling conflicts because they're the hardest case. And if we cannot model conflicts properly, then there's little hope of doing anything interesting. So conflicts are where we need a good tool the most. The exact definition depends on the tool. DAX for example has really cool and exotic definitions of conflicts. For example, if Alice and Bob are writing to the same file at the same place, I'd say most tools consider that a conflict, but not all. When Alice renames a file from F to G, while Bob renames it to H, I'd say most tools also consider that a conflict. Some will just pick a name randomly. Alice renames a function F while Bob adds a call to F. So that's a trickier one. And very few tools that I know of can handle that conflict properly. If a conflict can do it, my tool cannot do it. And Git can certainly not do it. So how do we do that? How do we solve all these problems at once and get all the nice properties for free? We do that by using category theory. So this is a mathematical framework that gives you really nice tools to model, to work in abstractions of things in general. So our modeling in category theory of this problem is if you have any two patches, F and G, what we want is a unique state P, which is the sort of the minimal merge of F and G, such that for anything that Alice and Bob could do after F and G to reach a common state, that common state, or common state Q, that common state can also be reached from P. So instead of doing some work to get to common points, you can always first get to a common point and then do the work. So if P exists, category theorist call it the push out of F and G. So the reason we're interested in that is because category theory has a lot of tools to start from this simple modeling and give us lots of cool stuff like free data structures and while do our job for us basically. So in this case, category theorist would notice that the push out of two patches doesn't always exist and this is absolutely equivalent saying that sometimes there are conflicts. And now the question becomes how to generalize representation of states, states are like X, Y and Z, so that all pairs of changes F and G have a push out. The solution is to just generalize states to directed graphs instead of just sequences of bytes where vertices of these graphs are bytes. I'll give you an example in the next slide. Verses are bytes and edges represent the union of all known order between bytes. So that sounds a little far fetch, but actually it's very clear in the example. So the way we model these things in Pichl is as fellow. So the first example of a simple patch is how do we add some bytes to our data structures. So we have a file. So well first in our graph, all vertices are labeled by a change number. Here for example C0 is change number zero and an interval such as zero and presenting bytes inside that change, inside that patch. And edges or our graph are labeled by the change that introduced them. So for example, here's a starting from an initial file C0, zero, n. How do we add some bytes in the middle? Well, we'll first split the initial file and then add some bytes, add a new vertex inside the middle of that vertex and then reconnect everything so that we can get the order of rights. And now that the bytes introduced by C0 between zero and i come before the bytes in C1 between zero and m and these in turn come before the bytes introduced by C0 between i and n. So this is how we do insertions. The implementation now, the rest is a question of, is a matter of implementation like how do we store these giants graphs efficiently on disk and so on. Deleting works more or less in the same way except we now introduce a new thing which is the edge label. So deleting a vertex in our system means turning an edge, like turning an edge from a continuous line into a dashed line. And so we do more or less the same thing. In this example we're deleting bytes J2i from C0 and zero to K from C1. So some bytes that were introduced by previous patches. And so this is what we guess in the end. So we get a bunch of vertices and dashed lines to indicate which bytes should be deleted and which are still alive. All right. And that's actually the good news is that we don't need more than that to build an entire version control system. So this is rebuilding foundations first, right? This is actually really cool because it's a very minimalistic system and we like that because it makes everything else easy. So two kinds of changes. One is adding a vertex to our graph in a context, meaning parents and children of that new vertex, and change an edge's label. And this is all we need actually. We get, like, from these things we get a ton of cool properties for free. First we get free conflict handling because we were just, like, there's no notion of conflict in this. We're just adding vertices, changing edges' labels. And there's no, like, the graph naturally models conflicts. So conflicts are possible. They're properly modeled inside the graph and they can be talked about and manipulated without any specific treatments. So our definition of conflict here is we could first call the live vertices whose incoming edges are all alive, meaning they're all full lines. Live vertices or vertices whose incoming edges are all dead, and vertices in the middle that have both alive and dead edges, we call them zombies. We say now that the graph has no conflicts if and only if it has no zombie, and all its live vertices are totally ordered, meaning we can actually compute a full ordering of all the bytes. We know exactly what order the bytes come in the file, and if we have that, then we can output the file to the user, and it actually makes sense. So some notes on this system. So this gives a system where we have changes that are not exactly or diffs that are not exactly Unix diffs. They are Unix diffs plus tons of metadata to make it work. And now they are partially ordered by their dependencies on other changes. This means that you cannot possibly work inside a file that has not been introduced in the repository yet, or you cannot edit a paragraph that doesn't exist yet. So not all changes are commutative, but changes that could be made independently are commutative. Chera picking, now this is the same as applying a patch. We only have two commands in the system, apply a patch, unapply a patch, and it does everything. So chair picking is just apply a patch. There's no need for git rararar, because conflicts are solved by changes. You can chair pick changes. So you don't have to do any special hack or Ruby Goldberg machine to remember the conflict resolution or anything. Like the conflict resolution is just a patch, and you can just send that to others, push it, pull it, and that's it. Partial clones, monorepo sub modules, so they are easy. As long as white patches are disallowed. So if you have a patch that just does some formatting across your giant monorepo, then you will have a problem, because this patch will probably have tons of dependencies, and you will end up pulling lots of dependencies into your repo. But if you're careful enough to fragment or to cut this patch into smaller pieces, then everything becomes easy, because you can just pull the patches that are relative to a tiny part of your repository, or your giant monorepo, and it just works because of commutativity. Because all the patches you produce locally by working, they do, they necessarily commute with all the patches that were produced by your co-authors and other parts of the monorepo. So after your day of work, when you push your patches to the server, others do the same, but it doesn't matter, because it gives the same results in the end. For large files, so I've showed you a graph in the previous slides that didn't talk at all about the contents. So the contents is something super important, obviously, but it's only handled during Diff, and it is not, like you don't apply a patch, the patch themselves, they are not applied, they don't use the contents of the vertices in order to be applied. You can apply a patch by just saying, well, I just added this file, it has like one terabyte of data, and that's it, and you can find the data in some change somewhere. And so a nice consequence of that is that for large files, you can apply a patch without knowing what's in the patch. You can fetch the rest later. And so if you're running a video game shop, for example, and you have artists pushing assets, large binary assets all day, then at the end of the day, when you want to just pull everything, you don't have to pull all the intermediate versions. You pull just the operational part of these versions, saying, well, I added one gigabyte here, and now I replace that gigabyte with another one, and then yet with another one, and maybe you have 10 versions of your binary asset during the day. But at the end of the day, the only thing you know is that, yeah, you had 10 versions, there's only one still alive, just pull that one. So there's no special hack or LFS needed, you're just using patches. All right. Now onto some implementation things. So there's a lot of things to say about the implementation. The project is entirely written in Rust, or mostly written in Rust, I'd say. So I won't cover all the details about how it works and our implementation, because there would be, that would take an entire day of talks, and I won't do that. But I'll just give you some cool things that I like about our implementation. So the first challenge is that we have really large graphs on disk, and we obviously don't want to load them up entirely in memory in order to edit them. We want to be able to just work directly with the on disk data structure. So we want to store edges in a key value store, because that's an easy way to do that kind of stuff. We want transactions, because we're actually inventing a new format, data format, so we absolutely want assets, properties, we want like full transactionality, we want passive crash safety, so all these things. And we want branches. So we want to be able to take a key value store and just fork it without copying a single byte, because it would take too long. It would take a time linear in the size of history, and we don't want that. So there's no key value store, there was no key value store when we started that would do that, especially the branching feature. So we had to write, I had to write one. It's called Sanakiria, which means dictionary and finish. It's a non-disk transactional key value store, but it's actually not just a key value store, it's something really generic. It's an ACID block allocator in a file, really. And that block allocator uses a key value store, uses B trees to allocate memory, but also the B trees themselves use the allocator to do their job. So that's why the minimal data structure we can have is B trees, but also you can write all sorts of data structures with Sanakiria just by using the allocator. We have crash safety using referential transparency in copy and write. The initial goal was actually done successfully, completed successfully, because the tables are forkable in big O of log n, so it's probably completely useless for a general key value store. Then in our case, branches were really needed, so we had to do it. So forkable in big O of log n, a logarithmic time in the total size of the total number of keys. As written in Rust, while it started in Rust back in 2015 or something, Rust was still a bit younger than today. And the cool bit about that is that it allows you to access generic Rust types directly as pointers into, like, storing the file. It is way too generic, way too many things, meaning that the APIs is horrible to use, but there's a good consequence of the generality is that it's even generic in the underlying storage layer. For example, we can use it on memory map file, which we do all the time in Pichol, but we can also do it in Z standard compressed files. I've also used it not super successfully, but the prototype worked on Cloudflare key values, Cloudflare KV, so storing, like, building a key value store on top of another key value store. The cool thing about that is that, you can also use it to build ropes, for example, or Patricia trees or things like that, or vector search indexes. So I thought that implementing this in Cloudflare KV was interesting, but it's actually too slow to be really useful. All right. A very unexpected consequence of that is that Senacilia is the fastest key value store we've tested and actually beats LMDB, which is the fastest C key value store by a pretty wide margin. In these graphs, I've included, actually, the coolest project ever in this space, which is not Senacilia, it's Sled. Sled is a really fantastic key value store that allows you to have multiple concurrent writers using really, really cool data, like modern database technology and so on. Senacilia is much more modest in its scope, but it's also way faster. So if we remove Sled, which is more on the experimental side, we can see that we beat LMDB by 50% or something. I've included the Rust standard B trees in these graphs. They're the theoretical limit. You cannot go faster than that because they don't even store anything on disk. We have to store stuff on disk, so we will be slower. Okay. So using Senacilia, I can build modular databases. Like I said, this is transactional block allocator with reference counting included. I've built different data structures in this. One cool thing you cannot do with others that can be done in Senacilia are composite types. So for example, in Pihu, branches are B trees of strings to other B trees that store our graphs. So you can nest data structures like that, which is cool. I have a prototype text editor with forkable files. So you can click a button and have a free copy of your file, sharing all its common bytes with the previous one. And its type is something like that, a B tree of string to rope in the Pihu graph. So if you're interested in data structures and performance challenges, join us because we're doing cool stuff in the space. All right. So back to my claim that this is both GIT. Things we get for free. I've said a few of these so far, but we can get, one thing I haven't said is we can get super fast Pihu credits, which is the Pihu equivalence of GIT blame because we don't want to blame our co-authors. We'd rather credit them. So the info is readily available in the graph. So you get that for free. So Scott was speaking in his talk about how you can use really cool hacks to speed up GIT's blame. But actually, if you don't use GIT at all, like if you model your stuff properly, you don't even need to speed things up because the information is readily available in the graph. And so it's fast by default. You can have your bug fixes in your main branch. You don't have to plan feature branches and bug fix branches in advance. You can push them to production by having them on your main branch. You can work on several features in the same branch and then decide what belongs to a feature after the fact. So no more rigid workflows. No more like way less meetings, hopefully. You can get sub modules. So sub modules don't have to suck. Don't let anyone tell you otherwise. Sub modules don't have to suck. You can get them for free using patch commutativity. And the reason is changes in unrelated projects commute because they can be produced independently because they are related to unrelated projects. So they are commutative. And so you can get sub modules for free. Signing and identity. So this is something that GIT introduced recently as well. But after we did, your identity is your public key. All patches are signed by default. And the identity changes are easy and possible because we like to welcome everyone. And people sometimes change their identity. And we don't want to let their personal life interfere with their work. You get free cherry picking. So I've said that a couple of times already. But you just apply the patch and no need to change its hash, its identity. And you can get almost free scalability to very large mono repos. So I've said that. We Goldberg machine needed. Just one cool bit of implementation. We have commutative state identifiers. So I've said earlier that we want to hash our patches to be able to make them inforgible. But we also don't want it to be just like a soup of patches and like, you know, hidden states where you cannot get your, you can get like where you don't understand anything anymore. So we do want state identifiers. It's like commit hashes. And but the thing we, the thing that's hard for us is because we want, because we, because we, the patches are commutative, we want a then B to give the same state as identifiers as B then A. And at the same time, we want these state identifiers to be fast to compute. So we want, we want, we don't want to have to like, for example, when naive, naive, naive, naive version would be like sorts all your hashes by, by hash and then hashed all of that. So that would take a time linear in the size of history, but there is a cool trick which is to use discrete log and elliptic curves. So this is where you turn each patch identity H into an integer and then you identify, you identify your states using each the part, each the product of all your hashes. And this is something you can compute very easily from a state and the next patch. So this takes a constant time to confuse and it's commutative for free. So this is, this is, this is a trick that I like. Okay. All right. Now, what's future developments in that, in that space? So we want, we're working towards a hybrid states patch system. So in, in GITS, I said that commits our states, not changes. There's like the blog posts all the time reminding us of that fact, even though patches can be applied and recomputed after the fact. Darks only has changes and recomputes states as needed. So it's completely, completely opposite approach. And Pihla's both, as I've tried to convince you. It has a data structure modeling the current states, but that data structure was not actually designed, it was found, it was, it was calculated from the nature of the patches we wanted to have on text files. And this is therefore completely transparent. So this is not a leaky abstraction. This is something that was found and calculated from the patches themselves and not just a cache that, that leaks or is like sometimes becomes irrelevant to the patches that were applied to it. All right. Ongoing projects in that direction towards a hybrid state patch system where we have tags currently, but there are a bit, there are bits of slow and, and bloated. So we want to, we're working towards a lightweight tags. And this is a feature that will add super fast history browsing in, in Pihla, while retaining all the good properties of patches. Currently tags are implemented as a semi-quill data database using compressed files as a back end. Another thing we're, we've, we've been discussing on our, Zulip is patch groups or topics or things like that to group patches together and to be able to say, well, I have just one branch. I don't want to work with branches. So I only have one branch. Maybe this is bad practice, good practice. I don't know what, I don't, I don't want to tell people what their good practices should be. Like it should be what makes them fast and efficient. That's all. So patch groups would, would allow people to group patches together to have all their patches in the, in the main branch and then push just patches related to one specific feature. So that would allow people to like be a bit more organized, for example. Someone proposed queues recently to avoid half-merged states because when, when you're merging states, patches one by one, sometimes you can get in a state that was never tested by anyone. So that's, that's not great. And so we want to add queues to get the best of both words actually to say, well, actually I've done a merge here. I've not, I've just, I've not just applied the patch. I've done, I've applied the patch together with other patches. So that's the kind of stuff we're working on. But this is not, not super, not super hard. And if you're interested in contributing, we're really welcoming. Well, we, we hope we're welcoming enough. All right. So if you want to help us, this is currently a large project and a small team. There's lots of satellite projects. So we've, we've built our own database system in order to build this. We've built our own SSH implementation in order to build this. But fortunately proper mathematics can make that work. We have way less things to implement that, that any other version control system. It's bootstrapped, meaning it's used, we've been using it for itself since 2017, which wasn't without its problems in the beginning, because some patches could only be applied if you already had the patch. So there was a interesting, interesting challenges. There's a lot of effort needed in maintaining the documentation, accessibility tutorials, UI, bike shedding, we're really demanding of that. We have good first bugs tags on our repository. Well, our repository hopefully isn't hosted on GitHub because we can't, they don't support Piholi yet. So we had to build our own forge, unfortunately. And so if you want to come help us, then yeah, good first bugs are the way to, to get started. And come say hi on our Zulip. It's the URL is here. All right. So open source version control based on algorithms and PRMs and also cool, low level stuff to optimize databases, scalable to mono repos and large files for free, potentially usable by non-coders. I've told it to absolute non-coders. I've discussed with some people in parliament, in the French parliament actually. Artists can use it without having to learn what's a rebase, without having to know whether they're in a merge shop or a rebase shop. Lawyers can use it to version their, their documents and what about others. Sonic Pi composers and music musicians, legal builders and whatnot. We have a repository hosting service available. That's it. Thank you. Okay. If you're living live quietly, we have time for like a couple of questions. Raise your hand. Okay. Quiet, please. Hi. Thank you for your talk. I had a quick question regarding diff algorithms. I think the default diff algorithm in GitHub is Meyer. There's a bunch of other ones you can select. Something I can't hear the question. Can, please. Sorry, I'll speak a bit more. My question was regarding the diff algorithms. I think in Git, the default algorithm is Meyer. There's a few other ones you can select, I believe. But something that's a bit too bad is that the diff algorithm is generic. It's only diffing text. It doesn't know what it is actually diffing. If it's a JS file or a Python file, it doesn't really care. And so, in a way, I think it could be interesting to have plug-and-play diff algorithms to have more semantic diffing based on what interpreting what you're diffing. Would that be something you're thinking about? Yeah. So the question is, can you swap? Can you replace the diff algorithm with something else? So, yes, actually you can. But unlike in Git, Git doesn't do diffs. You can do diff after the fact and change your diff algorithm after the fact. Obviously here, this won't be possible because the diff is so core to what we do. So you can definitely change the diff algorithm, but you have to do it while you're working in order to create your passion. You cannot do it after the fact. So that's a trade-off that we've made. Hi. It looks like a really interesting tool. I was going to ask how easy do you think it would be to automatically migrate from, say, an existing Git repo into a repo kind of based around this system? Well, you tell me. We have a Git importer. One pain point currently with the importer is that there's no exporter because importing and then exporting would be doing a round trip. Because diffs are ambiguous in Git, doing a round trip would create artificial conflicts. So we haven't implemented that. There are interesting challenges towards that goal. So one thing for perfect interpretability would be to have an importer and an exporter that work in a transparent way. Currently, there's only a Git importer. So if you want to convert your project into Pihl, you can use our Git importer. You cannot just work on the side using Pihl and then collaborate with the rest of your team using Git. So we'd love to have that. But there are theoretical challenges towards that goal. But yeah, how is he? I don't know. Try it and tell us if it was too hard. Hey. Thanks for the talk. I'm over here. You said that DAX can recognize it as a conflict when Alice renames a function f to g and Bob adds a call to f concurrently. How does that work? This was a little bit of exaggeration in what I said. What DAX can do is that they don't do just insertions and deletions. What they do is they have a command called DAX replace where you can replace an identifier with another one. And so they make it commutative with other operations so that they are able to detect a conflict in some cases. But I wouldn't rely on it to check for the semantics of my repository. This is just a tool. This isn't meant to solve all the problems of solving conflicts automatically or conflicts are part of a normal working process. As some say, seeing is believing. You're working on something that can be called, can be similar to a group where. So we know that someone like Douglas Engelbart provided a nice demo and showed that people actually can use that. So you're saying potentially non coders can use that. But can you or have you some demos or something that can show this actually in action and use by the non coders or do you have plans to show something like that? I don't really have a demo of non coders working something. So the non coders, the specific non coders I was thinking about when I wrote that on the slides are people doing like contracts in my company. So we're using plain text contracts for our customers to make sure that we're on the same page and so we're using version control between the sales and implementation to make sure we're not selling stuff that doesn't exist, for example. So yeah, that's that kind of stuff. But it's still very limited and we don't yet have demos, like entire demos of people working like a part of the apartment were operating entirely in Pichu because no part of the apartment has, so they've tried, they've started to look at it. But no, there's no demo of that. But yeah, we very much love to start collaborating with non coders to make it welcoming and useful and fun for them. So you think you can do that actually? Yeah, I'm pretty sure I can. I can explain the entire Pichu in just a few minutes, like the entire UI in just our CLI in just a few minutes to people who've never coded before. So yeah, apply a patch, unapply a patch, that's all you do. I thought these, I was extremely excited to see this talk. I started off very skeptical and I completely changed my opinion and I just really appreciate kind of doing that. It was a really effective talk and I'm very glad about that. I wanted to say regarding operational transforms versus the directed graph model that you have, I noticed that you had, there are boxes that you've drawn for operational transforms that seemed similar when you drew the same boxes to describe your category theory, sorry, your directed graph process. The question I wanted to ask is, could you view the vertices of the operational transform block as every vertex is the entire state of the document? I want to ask you, is it correct to say that your directed graph model instead of having each vertex be the entire state of the document, is instead individual byte ranges? So I guess would you say that this analogy to operational transforms is kind of correct in that each vertex is not the entire state and part of the reason why I would say Pjul and, sorry, I mispronounced that, Pjul and the rest of your software I think works, I think it's because it has less of a dependency on the entire state of the document. So I don't know if I explained it right, but I felt like I had a realization as all and I wanted to know whether you had, I guess, thoughts about how this directed graph model is kind of inherently easier or harder than operation. To me it makes sense why it's stronger and I just wanted to share that realization because there's less dependency on state. I wanted to know if you thought that that was a reasonable description. Yeah, this is reasonable. One comment I would have on that is the main difference with operational transforms is that you don't have, if you want to, for example, merge n batches in a sequence, in Pjul you don't have to look at pairs of batches. You can just apply one batch and then the next one and they don't have to see each other in order to, they don't modify each other when they're being applied, which is not the case in operational transforms. So that's the main difference. But I agree that it's confusing that the diagrams look similar. What is the killer feature of Pjul that will make it succeed where darks failed? Performance. Okay, speaking of performance, how much of that is attributable to the change of data structures and algorithms and how much is attributable to rewriting it in Rust or in Haskell? Most of it is algorithms. Almost 100% of it is algorithms. Sure, writing in Rust makes it faster than writing it in a garbage collected language, but this is just marginal. The main thing about Rust is that it allows you to write really low level stuff that allows you to build different kinds of algorithms. For example, Senakiria, I cannot see how this could be written in Haskell or OCaml, but it would be really, really painful. Rust makes it much easier because this is low level stuff and this is where you can get most optimizations. But yeah, performance, Pjul is, it was doubly exponentially faster than darks for mergers. So back when, like two years ago when darks had the exponential merge problem, this has been fixed since then. And we're only exponentially faster than them there. So for example, if you want to merge a patch, if you have a really large file with a really long history in Pjul and you want to apply a patch in the middle of that file, this will take a time logarithmic in the size of the file. In Git, you have to look at the entire file, do a deep pre between that file and your patch and then apply that. So it's linear in the size of the file. So we're still exponentially faster than that. Of course it doesn't matter because most files are not so crazily large that we can see the difference in the algorithmic complexity between Git and Pjul. But on really, really large files, this would matter. So what that means is, yeah, killer feature, you can scale to mono repos where darks, I've seen it fail on a paper. I've seen it take, I've seen, merged this take a really, really long time, several minutes on the mathematics paper with 10 pages of latex. And here we're like, everything works in milliseconds or less. And because of the fear, like the scalability to mono repos in large files is not done using extra hacks or extra layers or extra LFSs or extra sub modules or whatnot. They're just like a byproduct of our design. They're just, yeah, these are the killer features. I don't know if I answered the question. I get the name of Sanakya for dictionary, but why the name Pjul? Oh, this is the name of a South American bird that has the property that it lays nest. It builds a nest cooperatively in between a group of birds. And then all the females of the group lay their eggs in the same nest and they take turns to keep them warm. And so, yeah, it's just a metaphor. Okay, our time is up. And so let's thank again. Thank you.
Building a Community-Owned Data Confidence Fabric With Distributed Ledgers and Smart Contracts
Okay, so the next topic is building community-owned data confidence fabric with distributed ledgers and smart contracts. So let's give them a good welcome. Hi, everyone. Thanks for staying. So today we're going to be talking about data confidence fabrics. Here's the agenda for the day. So briefly, first of all, I refer to my colleague Sean here as an academic who works for University College Cork. We've been collaborating on this project, so I'm going to start by quickly going over dense open source support and community contributions. I'm going to go over Linus Foundation Edge, how it started, and I'm going to be focusing on Project Alvarium, which is part of Al of Edge. Then I'm going to hand over to Sean. He will go over a few use cases of Project Alvarium. And at the end, we have a short demo that we're going to show you. So the team I work in, we do research mostly in European research projects. Most of the projects, all of them are funded by the European Commission, multiple programs, mostly Horizon Europe program. We collaborate with a lot of the consortium's basically a lot of partners across Europe, across all these projects we've collaborated with over 100 partners. The domains that we work on are mostly edge computing and cloud, storage, the zeroed ledgers, and quite recently we were working on knowledge graphs. And a common theme across projects is always energy efficiency and sustainability. Most of the projects here deal with orchestrating deployment across cloud and edge resource constraint devices. I'm currently working in the Clever project. We're trying to build knowledge graphs that represent Kubernetes clusters and use these knowledge graphs, try out graph algorithms if we see that it enhance the schedulers in Kubernetes. And yeah, the themes are mostly, as I just said, this is just quickly about the projects. And one thing is that most of the work that we do on these projects, most of the code is open sourced. Now Dell and open source, Dell has been working with the Linux Foundation for over the past decade. It's a founding member of open program with infrastructure up in HBC, as well as member of the CNCF up in SSF and you can make for you Sonic and the Yachto projects. Currently Dell is involved in 43 open source projects, 10 of which are Linux Foundation projects. Now in preparation for this, I went to asking colleagues about the individual contributors across the company, about their contributions, and I learned that currently there's a process in place to inject open source contributions into my legal, which is our legal compliance systems, ServiceNow and JIRA, to basically encourage contributions from the company so that it be logged. There's also other activities such as events organized within Dell about open source. One of the colleagues contributed a first time contributing page to just help other basically contributors to make their first contribution. And what I learned as well is that the most effective way as of now is across the company is word of mouth. So it's driven by interest mostly. So Dell contributions to the Linux kernel as originally a hardware manufacturer, basically when Dell needs to develop drivers to make sure that Ubuntu runs properly on a Dell system. So these drivers are open sourced. The process goes as follows. Basically these drivers are basically developed, contributed, and they're pushed back to the Linux kernel so that they work across all the distributions. After their first implementation, any patches to make sure they all work properly are also pushed back to the kernel so that it would work with distributions like mostly working first on Ubuntu and then as they're pushed back to the kernel so that they work with Fedora or Debian or Arch or any other distro. Now the story of ILOF Edge, it started in 2017 when Dell contributed 125,000 lines of code to Linux Foundation to create AdJax Foundry due to its interest in Edge computing, which is basically an open source platform for basically processing across the Edge and the Cloud. This platform basically enables Edge data collection from sensors, communication between enterprise and cloud and on-prem data centers, and for processing at the Edge. It has runtimes, messaging substrates and so on. And it's basically an umbrella organization for other projects. One project of ILOF Edge is Project Alvarium, which we're going to discuss now, AdJax Foundry, which what started the whole thing and other projects. Now to go on to Alvarium, first the problem statement. So basically the motivation for Alvarium came from the realization that a lot of the projects that we're involved in, we deal with data sources that are dispersed, the sensors come from different manufacturers, sometimes sensors are owned by different organizations, and usually the data is processed as it jumps through the network. Before it reaches AWS S3, let's say it can be processed at a local server and so on. So we needed a way to measure the trust of how much we can trust those data points that are coming from these several data sources. So this was the origin of the idea of a data confidence fabric. So the idea is to define what trust is basically. So as I said, more implications are extensively distributed. Data traverses across the network. The idea is to create metadata that attests the verifiability of the data at the origin. That's the first step. The second thing is to create metadata describing how the data was processed as it jumps throughout the network. At any point the data is touched, we need to inject metadata at what happened here. And those are the insertion points. And we need a way to measure or quantify that trust into a floating point. So we need to put it over 10 and over a certain confidence score so that this confidence score can be used by the end user to decide how to actuate or do anything with that data. So to summarize what I just said, there are two things here. One is a policy defining the measure of trust. And the other is an implementation that interprets that policy and calculates the trust score. And the insertion points that I just discussed, so in this figure here we see that first we have the route of trust, which is the sensor that signs the data when it catches it. The gateway, the data goes from all these sensors to a gateway where authentication or authorization happens. At the gateway, the Alvarium SDK would be used to inject basically the data capture environment where the data was captured. Next the data would travel on to an edge server or a distributed storage. Here like for secure immutable, scalable edge persistence such as IPFS would be there. And then the fifth injection point would be a ledger where the trust would be registered. A more concrete example of this or what trust are we talking about those, what are these policies? So basically at the gateway what we have is if there's TPM on the source device, you get a score of one. If there's secure boot, you get a score of one. If the data was registered in a ledger, you get a score of one. Then at the edge server, more things get added. If there's an application running on that server that's encrypting the data before it travels, you get an additional score. And if the signature was verified at this point, you get an additional point. So this is the idea of injecting, of calculating that trust measure as the data is traveling throughout the network. And at the end, you would get a score, a confidence score. If the data matches or basically satisfies all these policies that we set for it, then you would get six out of six. If there's no TPM on the device, it would get five out of six so that the consumer of the data would kind of know basically that there's something missing. Or there's something that would lead to a certain issue with the data. And of course, these are just dummy weights that we're using here. It was scoring one for every one of these policies. You could configure this to basically weigh something more than the other and so on. The different use cases of LVARM. So one is internal quality and security control. The second is regulatory compliance. So to get a very viable percentage of data that it should meet a certain threshold. Marketplace application, like if you're selling your data or any IoT data, these trust measures would help basically. Make others trust the data more. Trusted actuation, if it's a real-time application and your data is going right into a certain actuation. So you would use this score to know if you trust to do that action based on that measure or that temperature reading, let's say. And the final one is a trusted ecosystem partner where your metrics could factor into trust ratings by using one's product or service. There's multiple implementations of LVARM and we'd be happy if anyone's interested to have a look at them. There's a Go implementation, the Java, Python one. And Rust one is on the way. These are the links. There's the website and all those GitHub repos with examples. So if anyone's interested. Now I'm going to hand over to Sean to walk through some use cases. Thanks. So I'm Sean O'Murphy from the University College Cork and we're partners with Dell and a whole bunch of other research institutions, universities, industry, in this European project, the Collaborative Edge Cloud Continuum and the Bettered AI for a visionary industry of the future, which we call Clever because that's too much of a multiple. Now the idea with Clever is that we're exploring technologies where work is being done down at the edge, up at the cloud, that has been passed up and down, and decisions being made about what to do with it, and generally the applications are AI and machine learning applications. So from the open view, we have some use cases with the data confidence that data confidence fabric like Alvarian can give you. So it allows you to mix and match data from old hardware, new hardware, different firmware versions, and some of this may be more or less secured, there may be historical information about vulnerabilities that may make you consider some historical data differently than you might if it was up to date. You also may have an application where you're accepting contributions from a whole mix of different sources. So rather than a single organization having complete ownership and control over a whole network, you could have accepting public sources, so you know, Citizen Science, the universe, that kind of thing, where you could have expert users and you could have lay people contributing together. They could also be using, you know, publicly available consumer sensors mixed with high-end professional grade stuff. Also, you may be dealing with datasets where you want to be able to be sure that you have permission to use everything within it. So it could be a case where you're collecting data to train models of it, and you want to make sure that you have the correct licensing for all of the material within it, so we think annotations could be very useful for that. So this could save you from getting in difficulties down the line by using things you shouldn't have been using. So from UCC, we're working with models and data conference for mixed trust applications. So many applications and models, they like to have a lot of data to work with, so this makes intuitive sense. You have lots of material, that's more scope to learn things about the domain that you're trying to train on. However, not all data comes from the same place, so like I said, you'd have different sensors, different firmwares, old hardware, new hardware. Something might have been installed by a certified engineer or might have been something installed by a home user. So you can have a mix of all these different sources. So like I said, some of it you can trust highly, some of it maybe not so much, but a lot of data generally makes for a better model. So in the illustration here, I just have an example of we have data coming from out of different sources. Most of it might be low trust score data, so it's the larger red circle, and some could be a little bit more trusted in the yellow, and then in the green that's maybe the best quality or the most trustable stuff that we are aware of. Now we could train a model just with the trust score 6 material, which is the smallest, so we could say we're very confident with all the material that goes into this learning. However, it's the least amount of material compared to the alternatives we have. So you may get good results or you may get sort of limited results. And it's dependent on what you're working with and what your application is. But generally most models will do better when they have more to learn from, but if there is some poor quality or possibly malicious material in your larger set, you want to do something to avoid that contributing over much to your model. So we take our trust scores, which the data conference fabric can give you by annotating the data based on its provenance and its history and security characteristics. And we use this as an additional input into our model. So most machine-earning AI models, they allow you to weight inputs, either through integers or super sampling or so on. So depending on your application, you might be more interested in one kind of annotation versus another. So you might be more interested in stuff that has been signed and gone through like secure socket layer, come from devices with just platform models installed. Or you may not worry too much about if there might be something like the difference between security and safety, very important. You want to make sure your material is very good and very trustable. Something more like optimization, where a catastrophe in optimization is a small loss of money and carbon, but loss in safety and security could be much more serious. So we have a case study from our work using data conference as an input and machine-earning. So we have decided to mix trust. So we're accepting material from the low trust data set combined with the material from the high trust data set. But we are using the trust ways to decide how much it contributes to the model. So we have an experimental set up here where we take an existing data set and a portion of it we poison. So we actually manipulate it by changing results in order to cause a sort of malicious result in our resulting model. But because we have annotations to data conference fabrics to work with, we can make sure that I had a potential to have been altered in this way as a commensurate low trust weighting audit. So the idea here is to combine the stuff that may have been manipulated or malicious in nature with the stuff that we know is relatively clean high trust and make sure we have a large data set that has a combination of high trust and low trust, but we're going to limit how much the low trust stuff contributes without discrediting it entirely. So our experimental set up here is we have the census data for California Housing in 1990. And our objective here is can we predict the median value of housing based on the other fields. So the latitude, latitude, ocean proximity and so on. And the poisoning here is to adjust the latitude by two degrees northwards. So the effect of this, if you're a malicious actor and you wanted to wreck a model, if you applied this to material that was contrary to the model, what you would do is you would say LA is somewhere in the middle of the desert, San Diego is up the coast and no man's land and so on. So the impact on your model is that the predictions for the median values get pretty out of whack. And so the results here, the blue line is a baseline where it would be the best case where everything was perfect and not manipulated. And then you can see when we just used the clean set, which is the green result here, the more and more of your data set that may have been poisoned, the more of it you are disregarding entirely, which means your clean data set is getting smaller and smaller and smaller. So once you get past 50%, it starts to degrade because the amount of material it can learn from is getting smaller. On the other side, then we have the poison data set where we are incorporating the poison stuff with the clean stuff. This degrades pretty badly as well because the poison stuff is malicious in nature, it was intended to cause a bad result. And then the trust result, which is the red triangles here, these are when we're actually using the trust as an additional input. So we're accepting the material that may have been poisoned, but it's been weighed more low compared to the clean stuff. So we get the benefit of all of the clean material, all of the potentially poison stuff, some of which does have good explanatory power within it because it wasn't necessarily altered, it could have been no trust but still good quality, or there are parts of it that hadn't been manipulated that still give you good information. So from our side in the University College of Cork, we have some future directions in using data conference fabric, so we think it's pretty interesting technology in a cool area. So usually in a zero trust environment, you're using this to decide whether or not to use something, whereas we think there's something interesting to be done in use it a little bit, or use it as much as is appropriate for the application in mind. We also think the trust scoring, so we saw you could trust something based on your choice of weightings for certain security features, but what if those weightings could be something that could be learned on a per application or per organization basis? There might be some way to do this through iterative modelling, or feedback loops, something communicating between the edge and the cloud and back again, based on the actual results of what you're doing. And we have some interesting in some new types of annotations, so what you see with a researcher as Khalid is working in this area exploring different kinds of annotations, maybe there's some new novel approaches we can take here. And so we have some ideas about the performance of the models themselves being a way to calibrate the weights. So based on how well your model is doing, is that telling you something about a particular annotation that appears quite a lot? We have some of our contract demo using Gilverin Harrison. Yeah, so now I'm going to show a five minute demo. Okay, so this demo has... Oh, what happened? So with this demo I'm using Hedera and Project Alvarium, so Hedera is a serial ledger that is basically... It uses the hashgraph, the serial consensus algorithm, what makes... we chose it because it's consensus in Hedera is much faster than other ledgers, and because it has public subscribe semantics. Now, Alvarium works with any messaging, like middleware, like Kafka or Polsar, or a... let me pause it for a minute. Okay, it works with anything that supports public subscribe semantics. In this case we're using a distributed ledger, it doesn't have to be a ledger, but the ledger here adds some more trust. So what I just did here is I started a service that created a smart contract on the Hedera ledger, and it created a new topic on Hedera so that the sensor can publish to that topic. And the smart contract is used for automated billing of trust services, or the annotations that are injected into the data as it's flowing, and the topic ID is for Alvarium SDK basically to publish the annotations that is... It's adding the data to that topic. The smart contract that the devices... So those are the two... And then what I'll do here is I'll just update the config for the other two apps to use those contract ID and the topic ID. Then I will go and run the UI, basically it's a React app that basically subscribes to that topic to get the annotations and to the smart contract. Just a second. So here I'm showing the wallet of the user which has some HBuy, which is the cryptocurrency used in Hedera, and the wallet of the trust provider which is that in this case. And you would have a fee which is registered in the smart contract that is going to be paid by the customer based on the annotations that are added to the sensor data that it's using. And the gas fees are paid to the network for publishing to the topic. So here what you'll see is the source of the data, the reading, the auditing, the hash, and the annotations that are added. So now I'm running a simulated sensor that is generating dummy data and sending that dummy data to the ledger. This is the dummy data generated and annotated by the SDK, by Alvarium. And they're going to be sent over. And then I'll switch now over to the UI to see them pop up here. And you would see now that the billing is happening there. The fee is being automated. And if you view the annotations there, you would see the scores that are added to the data point. So I'll view them now and I'll be able to see it. As you can see here, it's a Boolean where what's the kind of the trust annotation here. It's TPM first. It's false. You don't get a point for this and so on. If we look at the wallet, we can see a certain fee transferred from customers account to Dell's account. So here are they coming in? I set up another one here where here the billing is happening automatically after every annotation or every data point as it comes in. But I noticed that the publisher ends up paying a lot of gas fees because every time they're doing a transaction. So what you can do is pay in bulk in the second small demo. Basically they start coming in. You don't see nothing going out of the wallet of the publisher. And then after you fetch a few, you can see the amount you up top and this is how much is due for the pay. And at a certain point you just pay the bill at once and it's a single transaction. Account balance has increased by 21. And that's it. Thank you guys. What happens when the button is clicked is... So if you want to reach out, those are our emails. If you're interested in the project, please reach out. If you have any interesting project that you think we might be interested in as well, reach out to us. And if you have any questions, yeah. Yeah. So an example of the trust scores, you could also have them not be just zero or one or range to create new ones, right? Save hang on. When you were attributing trust scores, you could also configure them to be not just one or zero, but also range in between? Exactly. Any floating number. In such a case, how do you verify that the same factors are causing the same amount of changes in trust factors from different sources? Like if a vulnerability in one source is attributing the same amount of trust as if it's coming from a different source in a different edge node. So if you're attributing a different number, okay. If there's one thing that's causing a decrease in trust, but that the decrease is equivalent no matter where the decrease in trust is viewed. Hello. My assumption here is that the ways would be defined by the verifier only. Exactly. So that is it. So you have a certain number of policies. You're coming in. You want to check that. So at once, you have all the metadata, all the annotations. The annotations are just telling me if there's encryption or not, what type of encryption is there. If there's any other policy there, if the signature was verified and so on. So at once, I would weight all these based on what I want to weigh them and I would calculate the score. So the scoring is not happening as it's flowing through. What is happening as the data, like the points where the data is touched throughout the network, the gateway, the whatever server the data passed through before coming to storage. All I'm getting is the metadata and at once at the end, I'm scoring it. Yeah. So thanks for the talk. I believe you mentioned that you also use this technology for license compliance and some, I would assume licenses for data sets. I think this is the first time I hear something about this. So I'm really interested. Could you please elaborate a bit on that? How does it look like, at least on a high level? So we're very early on on this. But I think the usual thing with the data conference fabric idea is you can make an annotation for pretty much anything you care about. So if what you're interested in is I have a data set. I want to verify which of it has been either say properly licensed. Well, there's going to be some sort of process to verify those licenses. And that's a step that if that step is complete at the time that it's requested, that's an annotation that gets added to the history of that piece of data. And then at a subsequent point, if I'm trying to create and collect the data set from all the material that I have, I can say just give me all the stuff that had that annotation associated with it. Now I have a data set that I know at the time of Providence that it passed that particular check at that time. Now, how do you go about that check? I mean, that's up to the developer, I suppose. But the idea here is that the data conference fabric gives you the framework to approach things like this. And so I think this is very important when you're collecting things from a mix of public and private sources. It could have come from many different paths to where you are. You want to make sure that everything's above board and you don't get yourself into a sticky situation using things that you shouldn't have been using. And also we're starting to see tools like Nightshade and Glaze that are attempting to make that stuff worse to work with for very good reasons. And so you can make sure that you're using good quality material that you have the permission to use. And that means models that you've been training on that stuff you know as well that these are all above board and fair to use. Yeah, just something to be clear. If I well understood, it's not you who are providing the scoring, right? So the framework gives you a way to have the annotations up in a ledger or a subscription broker or something like that. Trust scores, they can be generated and added to your ledger or they can be computed at any time that a decision is being made about that piece of data that could be happening anywhere. So you can request the scores that have been pre-computed or you can make your own trust score algorithm. And in that case you just request the annotations associated with all the material you're dealing with and produce your own scores. So there's flexibility there and you can have trust scores based on a per application basis as well. Okay, and so I have a question about smart contract and Mario in your blockchain use case. I mean, why to use your contract, your smart contract to only inscribe data on chain? What is the interest for your database to write the scoring on chain via smart contract? Why are you using it? You can just like verify in the database. And also a second question which I think is related if you are really providing verifiable and high level data, can you act as an oracle? And you can begin an oracle into the blockchain to have secured and already noted annotated data which we can use. Are you thinking about this type of use case as a oracle? Okay, so the first question, why are we using a smart contract? As I mentioned, a ledger and a smart contract are not essential to, it was just the use case here. So the main purpose of the contract was just to automate billing for any trust services that have been for, if anyone wants to package up this SDK and provide it as a service. And keep it distributed. They can use a ledger to automate the billing process for the use of those annotate, as per annotation. So the only reason we're going on chain is to just track those annotations. But as you said, there's storing on chain is inefficient if there's a bulk of data that's going on chain. And for the second one as an oracle, to be honest, I'm not quite sure how oracles work. I'm familiar with oracles. They're also peer to peer, but I'm not quite familiar with how they work. So how, how a very amount would contribute to being an oracle. So, yeah. Okay, so no more questions. So again, let's have another round of applause. Thank you.
Synergy in Open Communities
Hello. This is an open community, synergy and online partnership. So my name is William Brasik Ray. I'm a senior software engineer with the Linaro Developer Service. Linaro you can see on the top left corner. Although I'm probably most likely to work on the Linux Carod's kernel counter subsystem maintainer. And during my time I'm contributing to both submitted patches and been a maintainer for some time. So I've come across a number of, I'd say pitfalls perhaps that we can get in when we're doing what's open communities. So I think it's useful to consider the engineering sector before we, because the orient is towards where we're from. So. Hello. Hello. This right? Right. So the engineer mindset. So engineers are very objective oriented. So I want to get things accomplished. It's kind of orientation on a very pragmatic orientation to your objectives. You try a streamlined process where you can see, expand through. But there's this idea of brief. It's beautiful. You want to reduce sort of the redundancies of your actions and try to get things done faster and basically build on mistakes so you don't make them again. You're building up on previous knowledge. And the key worry for an engineer is the mistakes of your past. So that creates this idea that a novice is a liability. And I think that's sort of the wrong mindset to have when you're dealing with maintenance because maintainers have to mitigate both established engineers in your community as well as the beginners who are just not joining. You don't want novices to feel pushed away from that. So you heard the big, too big challenges that online maintenance have. First of all, written speech. Typically when you're dealing with an online community, you're not. You don't always interact in person. And a lot of language nuances are lost in written speech. For example, in just simple tone, take the phrase, that's a great idea. When it's written down, you're not sure is that being sarcastic or is it being sincere? Are they serious or is it a joke? That can create miscommunication and you end up with a lot of problems. To that end, you can mitigate that by becoming familiar with people. And that's since a certain amount of respect is earned when you interact with people. Strangers are normally very apprehensive to something new. So to be able to communicate with strangers, you need to build up a certain amount of trust before you can actually get on a personal level. So how do you build trust? Trust can be built traditionally through a gift. So gifts are tools towards establishment of trust. What is that for the maintainers in online community? That's typically going to be in the form of some sort of work. For example, helping bring up novices to the environment that they're in for the community. Once you've established this trust as a baseline, then you can actually move forward to producing the product that you're trying to complete. Criticism becomes more easily acceptable. If you're trying to criticize a stranger than that apprehension that they normally have is going to build up a wall and not going to be able to address the core problem with what they're delivering. The criticism is going to be taken on a more personal level than what you intended. When you go back to the top of where meaning is lost in the text, your criticisms might be for the work, but the person on the other end is interpreting it not the way that you want them to. So what I like to do is I'm a very visual kind of person. So how about we walk through some scenarios? Here's one that probably a lot of people have been through. You have a friend, right? Perhaps you're eating out together and you notice they have something in their mouths. Maybe it's a piece of food in their teeth. So you say, hey, I think you have something there and perhaps you should go check it out in the mirror and fix it up. You have a certain amount of previous relationship with that person and the trust has already been built. So the person trusts you and they go and take care of the issue that they seem. But think about it from the perspective of a stranger. If you go up to a stranger in the same situation, say, hello, madam, you're very ugly. Why don't you go back home and change your face? It's taken in a very different way than you would as a person you've known before. And the reason why is because you have an established trust. So your words are interpreted in different contexts than they would be as friends. And that's the core issue with an online community, that impersonal nature of communication results in misunderstandings like that. Engineers going back to start are very pragmatic. Engineers want to be able to focus on the core issue. But without the established trust beforehand, then the misinterpretation comes up and we end up creating a noise in a sense in terms of what we're trying to achieve. And then there's always something worse when we started. So let's go over something more specific to online communities and that's the reviewing patches. This is in the context of perhaps programming work. But I think it could apply more generally. So let's cover the wrong ways. So you might have come across before. I mean, if you were to say no, just deny something but not give an explanation. They're taking again a pragmatic approach as yes or no. But without that understanding in mind, the previous trust, then it comes off as too abrupt. And again, that person, the novice, might just be taken back by it and I want to contribute. Another one that we come across perhaps is RTFM. If you've never come across that, that's read the manual. It's not very useful to you. So please don't respond that way. And I understand why. You want the person submitting to go through the steps of trying to research previous solutions and understand the system before trying to propose a change. And that's understandable but that should be communicated because when we tell them to just read the manual, it's if you ever interacted with certain manuals, they might be very large to put as an understanding. So it's a ton to read the manual. It's not very helpful to finding the exact solution of what you want completed. So if you respond in ways like, what even is this one earth or you're trying to do, again, you need to be more specific. And these kind of words, they end up taking a very personal way. In fact, sometimes maintenance do in fact take it personally. This goes bad and you should feel bad. That ends up now moving the topic from a very technical discussion to now a relationship discussion. And I think the last one is very sad. I'm not going to reply to you anymore. I think at that point, we've started on a common ground of trying to produce something good. And now we've ended in a destruction of the relationship. The reason I've used synergy as a topic of discussion is because synergy is two more agents coming together in interaction that produces something greater than what they can produce alone. When you burn the bridges that you've made by replying in these ways, you've produced something worse than what you would have done independent. So you've actually ended up worse than when you started. By the way, I really love this picture here because I think it encapsulates the relationship that maintenance should have with a submitter. On the left, you have someone using the wrong tool to try to achieve something. And on the right, same situation. I think maintainers should be mentors in a sense. They should guide an office to the correct procedures to accomplish what they're trying to do. And then together, we can produce something greater. So here's what I propose for the right way. Let's try these again. But it's something that I think communicates better in an online discussion. Hi, William. Thank you for taking your time to send your patch. You address their work and you thank them for what they're doing. This is great for a draft, but we have some comments and questions below. So this sets up a premise where you appreciate their work, but you're prepared to provide criticism. You're establishing trust so that you can then criticize. So let's continue on. Unfortunately, this hardware doesn't support that function. This function is set from page 42 of the manual. And there you point them to where they should go to try to solve it in a different way. You start them off in the right direction and you let them take over. So you're not becoming consumed with the task yourself as a maintainer. You're guiding them to produce what you need. Then perhaps you might say, I'm confused about this part. Would you explain it some more? Let's work together to figure this out. You're making sure that the discussion, the problems that you're having is not with the relationship between them. It's with technically what they've produced. You want to know on a technical aspect what's going wrong. You can also provide your experience as guidance. This is a common error that I made in the past too. Here's how you can fix it and why it's better. There are good ideas here. A little more work will be able to merge this. You keep providing this guidance and motivation to continue on and you welcome them into the community in that sense. I believe this becomes more effective by building up that trust progressively. They're able to take the criticisms later on when you do have to criticize the code. A maintainer's job isn't just to merge. As that goes, a maintenance job is to say no. It's very difficult in that sense. Here are some guiding principles. As I mentioned before, maintainers should be mentors. No one starts out an expert, so you can't expect that from every new patch submitter. Remember how you felt when you were in office. You didn't know the procedure or everything. It's very overwhelming. The maintenance should be there to guide them through that process. In fact, maintenance should serve as ambassadors of the code. The focus should be on the code itself, not the personal issues that you have with someone. Of course, you come across issues where sometimes submitter might be very abrupt. What you should do in those sorts of instances is pivot back to technical discussions. Don't get thrown off into a thread of interpersonal relationships because, again, that doesn't have anything to do with the code in question. Boat of trust and established respect. If you acknowledge the efforts of the patch submitter, they're more open to taking into criticism. If you do the favor for them, they'll continue to do the favor for you. Take for tax situation. You're working towards a common goal, so treat it as that. You're both families towards that. It's a cooperative effort to improve the code. Not a fight for acceptance. If the disagreements are focused on technical issues and not personal issues, then you're able to focus on improving the code. In other words, you shouldn't have discussions about UUU, all these problems. This, this, and this are the issues. Together, you work towards something external to produce something better. Thank you for that. These are hard issues to take on. I think in general, people want to produce good things. That's why they're submitting the patch in the first place. People aren't going there to pick fights. Of course, there are people that are like that, but the vast majority are trying to produce something. I think if we orient ourselves towards that, then we can produce an environment that produces better things. We can increase the synergy in this sense. If you have any comments or questions, please send me a message. I'm very open to that. I truly believe this is something important. We should be discussing and trying to find solutions to those problems. That's my GPG key. If you're wrestling, that needs to be signed. I'll leave it with this. I think that we can work together to produce something better. I think that that should be our focus. Not in terms of fights, but just focus on the code. Thank you.
Problems and solutions for running a distributed virtual world
Welcome everybody. We're about to start our next talk. My name is David. It's my pleasure to introduce Vadim Toshinsky-Schmelev who will be speaking on distributed virtual worlds, problems and solutions. Thank you. I don't see. Okay. All right. Welcome. Thank you for coming. So my name is Vadim. I'm from a non-profit organization called Overtv. We distribute, we develop virtual worlds. Kind of could call it the metaverse, but that's kind of a bit of a dirty word these days. Thanks to Facebook. So there we are. Yeah. So who am I? I'm a software developer. I'm the current chairman of Overtv, a non-profit which supports our work. And I'm one of the active developers of the system. All right. So first I expect that a lot of people won't be all that familiar with what we're talking about here. So I have to give a short introduction to what we're talking about. We're talking about systems like Second Life, Aviyar Chat. They're kind of a very, follow a very kind of similar pattern. Oh yeah. They follow kind of a very similar pattern in that, well, first thing you get is a login screen. So basically it's a kind of vault garden. If you can access that, well, you get to access all the goodies. But if you're ever kicked out of it, that's it for you and all your content and all your relationships and everything else that goes with it. Same issues happen with things like if they happen to go bankrupt. So for instance, it's kind of been rumoured that Aviyar Chat could have some possible financial problems at some point. Who knows? All right. So to give a bit of context about what we're talking about here, the first of all, I'm just going to give a short introduction of what these platforms provide to the normal user. All right. So we have here an open source client for Second Life. First thing is you get a login screen. You're not going anywhere without that. It's a multi-user world. This is kind of old fashioned. There are pretty places over there. But yeah, basically it's a 3D environment with no fixed purpose where you can create, socialize, build stuff. There's typically a set of worlds you can explore. On Second Life, this is actually pretty neat in that there's a contiguous simulation. You can actually cross from one world to the next just by walking around. So that creates a huge amount of stuff you can explore, which is larger in space than many cities. This actually costs a lot of money to maintain, which is kind of why they do things the way they do. On some other systems, you also have things like isolated environments that are completely just floating in the world. You have an inventory where you can keep your stuff that can be objects, avatars, sounds, textures, text documents, things like that. That's all typically kept server side. You have user accounts which provide you with a name, profile picture, membership into groups, which can grant you access to places, for instance. And most of those systems also have some sort of scripting language. In the case of Second Life, it's a proprietary one. Not a big fan of that. That allows you to do things. You can do guns that shoot things. You can implement games. You can do various kind of management functions, like deciding who gets to enter an area who does not. Yeah. Right. So what do we do? We are a virtual world in the style of this. We support Linux and Windows. We have a distributed architecture, which is very different from what the other systems implement. And we do scripting in JavaScript with V8, so very modern. And we are supported by Non-Profit, which I'm chairman of. So the issue with the previous ones is that Second Life is like 150 million business. I believe VRChat is like 100 million investment. We can't do that. We are a tiny project. We actually are a fork of high fidelity, which was a kind of commercial successor to Second Life, but that died, so we picked it up. So high fidelity heavily leaned into decentralization, and we keep on with that spirit. So how do we do it? We distribute server hosting. We distribute content hosting. We don't require logins anywhere. We script in V8. We have, we support VR, desktop VR, which is, I think, extremely common in open source. We don't have any kind of cryptocurrency sort of deal. That monetization is not what we do. There's no lock-in of any kind. So basically, in structure, our system is something like the World Wide Web. We do both a 3D web server and a 3D web client. So this is all kind of glued together with an absolute minimum of things that do have a slight centralizing effect, but are completely optional. Like, we do have an account server, but you don't have to use it. You can just use it anonymously. It's like a 3D web page. When you go to foster.org, you don't have to log in anywhere. All right. So how does this work? Servers run on an EVPS, or you can use NAT hold punching to just run it from your home laptop if you want to. So what is a good thing about this? Well, hosting is a troublesome for an organization. We don't have actually the resources to pay huge bills for Amazon. Also, this is legally troublesome in that in some areas, some kind of content may be trickier to host than others. We also can't lock you out of anything. And this is like a 3D web server. So if you set it up, you own it. It's yours. We can never prohibit anyone from connecting to it. And you can use any kind of resources you provide. It just downloads assets over HTTP. But what are the problems with this model? Well, our users have to seek a third party host. So basically, the first step anyone has to do to deal with this is to go to AWS or a line out or digital ocean and figure things out. Personal servers may just go down entirely. Setup is easy, but not completely trivial. And the world can't be coherent. By this, I mean that we can do like second life and just kind of a small, smoothly flow from one environment to another because there's no sort of, no any kind of central infrastructure that says, hey, this guy is to the left to this other guy. All right. How do we make this easier for our users? Pre-packaged packages, pre-made images from common hosts. We have a very friendly build script, which you can run on pretty much any language distribution and build it from source. This is also intended to make it easier to try experimental developments. You can build any branch or any pull request we have. We support hosting from home so that you don't actually have to deal with AWS. And yeah, one of the kind of potential future developments that we've been discussing is appearing in between servers to share different kinds of content. This would avoid having to reach out to anything else. So what do we mean by that? So the servers would establish links to each other and exchange things like sound streams, textures and messages. So in this way, we could have a chat system that doesn't actually depend on anything external. We've had discussions of connecting to matrix, for instance, or to have some sort of discord plugin. But yeah, that's not quite in the spirit. We have an internal chat system which works inside each environment. But we want to have better support for intercommunication in between the areas that exist. So how do you distribute content? It's just an HTTP server. Clients just get a URL from our server to download it. So the actual servers that contain each environment don't decide anything about the content because they don't even see it. They just provide a URL to it. And it can come from places like AWS or Dropbox. And since completely standard format. So benefits for us. Well, hosting stuff is legally troublesome, especially if you allow people to upload stuff. It costs money. Yeah, there can be kind of legal troubles. And we can't lock you out of anything. If Second Life kicks you out, well, all your content goes with it. It's locked inside your account. Yeah, cost of that. Well, again, a typical thing that used to be recommended is that if you want to host an avatar, you go to AWS and use the free tire. That's not particularly user-friendly. So we're dealing, trying to deal with that. Content tends to disappear. That's one another big problem for our system. In Second Life, you can steal things from 2004 because the actual central server maintains the content even if the user is physically dead. We can't do that so far. So yeah, there's content loss. Content protection. That's an interesting wrinkle in our system. So in places like VRChat, some people like closed environments because that allows you to sell content by the copy. Like you make an avatar, you maybe spent a month polishing it up. That's a huge time investment which might be thousands of dollars if you were to contract somebody to do it. So what they do is to sell it by the copy. That's only possible in a system that actually locks you in because you need to ban people who break the rules. If somebody extracts the assets and re-uploads them, well, that kind of breaks that user model. So we've had to come to terms that we can't actually do this in any way. Because we only contract our own infrastructure. The system is copied and opened for client and server and there's no way to actually do this. Right. How do we work on solving content issues? We support common hosts like Dropbox. It actually reverse your URL to download. We have maintenance tools to scan things for broken content links. We are working on implementing support for WebDAF which actually allows uploads to the Web. So that would allow things like creating an inventory like in Second Life within the world. To work, for instance, in Blender, save an avatar and have it immediately reload inside the environment. One of the things that we've considered for preventing the loss of content is actually exposing public server backups. The server backups itself. But this is just for the server owner right now. So one of the kind of things we've considered is to allow anybody to download the entire content if the user doesn't feel very attached to the contents of it. So if they think anyone should be able to get this, we want to make it there. Yeah. Authentication is a bit of a problem in that everything is by default anonymous. That creates some difficulties with moderations and there is nothing as banning everyone for the same system. To deal with this, we've considered things like allowing each server to export information like a ban list which other servers could subscribe to voluntarily. In this way, there's no center, there's no central repository of anything. You say, hey, I trust Bob that his opinion is of who is a good or not person. Yeah. But on the good side, just start the client. As soon as you download it, you just can log into and connect to our environments. There is, yeah, we also support authentication by fingerprint, which is calculated for things like the MAC address. This is an opaque identifier. So you can just say this particular machine is allowed administrative access here. You don't need an account server to do that. And there's an issue with scripting security. So the system we inherited doesn't really have a security system. We're having to build it from scratch. So the current solutions work out to asking the user for permissions like a browser does and code-signing. But turns out there's no convenient solution for siding for JavaScript. And we've decided to steal something from Java. We can package everything in a JAR file. And it turns out that actually there can be multiple signatures on a JAR file, which means we don't need one central authority. You can trust whoever you want as a certifier. And that's about it. And yeah, fortunately, that's all I have time for. We have matrix and we have our website. So for questions, please come to our chat. And yeah.
Open Food Facts : Acting on the health and environnemental impacts of the food system
Welcome everybody. We're going to start the next session now. It's my pleasure to introduce Pierre Slamish who will be speaking on open food facts, acting on the health environment impact of the food system. Hello everyone. I just have a quick question. Have any of you in the room used NutriScore to choose food products by raise of hand? Okay. So you'll see that open food fact has played a little part in getting NutriScore out. So let's start and let's dive right in. There's a lot on the menu. So for those who don't know open food fact, I'll briefly introduce it. I'll have a section on what's new in the project this year, what's cooking for next year, and also we'll be able to do Q&A probably outdoors. So about open food fact, it's a project that we started 10 years ago. So it's an NGO and it tries to answer how do you choose the best product in the supermarket? A lot of information and it's not legible. I've never been able to understand the nutrition table. It's abstract out to me. So a long ingredient list as well. And yet food has a massive impact on public health. To give you an idea, obesity and overweight wipes 3% of our GDP due to the cost of treating obesity and overweight. And the same goes for the planet. One quarter of food emission is food. One quarter of carbon emission. So the idea of open food fact is to empower users and contributors who have an impact on their own health, on the environment, and on the health system at large. Our slogan, if you will, is don't panic but organize. So crowdsourcing is a way to do that mobile crowdsourcing. And if Wikipedia was able to build the largest on the planet, open street map, the largest map, why not build the largest database of food products on the planet? Two days, 10 years in, we have 3 million products from over 160 countries. Main sources, crowdsourcing, so you and me using mobile. But also the food industry which has started to realize that transparency in the end wins. So the mobile app of open food fact allows you to choose products that are good for you and the planet. You scan barcodes, you get NutriScore and EcoScore. You also have a personal scan for those of you who have food allergies or want to go vegan. It will help you on the journey. It's of course privacy preserving. So it's privacy by design. We don't require any login. And if you don't have a NutriScore in your country yet, you can get it on any products in a couple of seconds. You answer a few questions in the app and you get the scores instantaneously. So you can take your health to the next level with NutriScore and which is about the nutritional quality and NOVA which is about food ultra processing. So avoid NOVA for products as much as you possibly can. We also do additives and labels. And we make it simple to understand all of that. With NutriScore, we started computing it in 2015 when it was a scientific paper. It was called the five color score. And now we compute it in every country including Mexico and the United States. Everyone can get it even if a producer don't want you to get it. So we recompute it. We create an ecosystem around it. And the nice thing is that as you all experienced, it's now in supermarkets in Europe. It's still not compulsory though. And producers are beginning to improve their products. And we also show EcoScore which is about the planet. So same principle. We use something called life cycle analysis which are very precise analysis of food products. So it's an average. And then on top of the average, we make the computation more precise to the products using specific data. With EcoScore, the great news is that France will have an EcoScore despite all the trouble you are seeing right now in France. It's in low. So that's the cool news. It's beginning to be experimented in Belgium, in Colbert. And it's also available in many European countries and the US. And so yeah, we are having a more global discussion around it. In terms of impact, OpenFoodFact has quite a lot. Because we are open data, over 250 projects, application services reuse the data to inform users from questions on pregnancy, allergies, etc. Even big corporations use it. In terms of impact, it's a simple circle. We collect data using our mobile phone. People are more and more reusing that data to do many things including scientific research. People get more educated, more mindful about what they eat. They start changing their behaviors, their purchase behavior. And the whole industry actually starts to follow. The producers are taking notice and they are changing their recipe as a result and everyone benefits. And the circle goes on and on. So from those kind of Photoshop or GIMP images that we did a couple of years ago, we went straight to this where the NutriScore is everywhere. So yeah, you go from Perl code to real life impact where basically all products, all newly introduced products start to change for the better. What you can also see across Europe is for instance the differences between In The Food Offer. We take photos across space and time for 10 years and we found out that the Fanta Recipes change across Europe. So for instance, Italy 12% fruit, Serbia 3% fruit, Portugal 8% fruit plus high fructose corn syrup and 0% fruit in the French island of Réunion. So that's the kind of thing you can do with data. We can also have a giant map of food factories in Europe. So that's may near me. And all the packaging code you see on food products, we actually collect and we can map them. You can do benchmark if you're liking to data, if you want to choose a perfect year old. No, you can. So it's highly customizable. In 20 seconds, you can do your own charts. We also have a platform for the food industry. So whoops, sorry. Yeah, for the food industry to help them actually reformulate, we say, okay, here's an opportunity to reduce a little bit sugar and then you will get a better NutriScore. So we compute all of that. And brands have started playing the game. Some of the brands you consume every day are actually doing open data and sending open data to open food fact. Even the big ones like Unilever, even Ferrero from Nutella are doing that. So they're starting to realize that consumer pressure is important. In terms of milestones, so as I said, we launched NutriScore in 2015. We launched EcoScore more recently and ultra processing in 2018. So the project is a bit over 10 year old. And this year, we cross the three million products threshold, which is a nice milestone. We are now at 3.1 million monthly visitors on the websites and the app and contributors together have made 28 million edits since 2018 and it's growing, it's still growing. The permanent team is growing. The community is much more engaged this year than it used to be. We were doing European meetups. We had our second open food fact days this fall. And we are also getting more people into coding. This year, we also scaled app marketing so that new users discover about open data, open source and open food fact to 40 languages. And we started getting into European events and trying to get a European committee off the ground and not just be a French project. In terms of manufacturers, we introduced a few new features as well. Manufacturers are getting on board. And even more important to us, as scientific use and reuse, we have 30 scientific paper in nutrition, in machine learning based in 2023. And we have increased the reuse a little bit as well. So what's cooking for 2024? It's going to be a big year. First and foremost, because the new score is going to change. The formula is going to become more strict. You know that there's Italy is trying to block it at the European level. And the scientists are overwhelmingly supporting new three score. Seven countries have adopted it. And now it's the question of whether it will become the European score. The new formula is going to be more stringent, like seven out of 10 products are going to lose a grade. Most of them are going to lose a grade. And it will be a two-year transition in real life. But as soon as we start deploying it on open food fact, it will be on every, the new computation will be on all products directly in open food fact, even before producers do the transition. On mobile, it's going to be a big year. I'm going to go very fast because there are only four minutes left. So we did a lot of user interview this fall. And so we are going to make the app more pedagogical and to improve search. So here's a screenshot of all the ideas by the community. So we are going to improve the onboarding so that people better understand the scores. We are going to make the personalization engine more intuitive. We are going to make all the information more legible, guides even to go further for French people. We are going to try and tackle the mineral water scandal. And improving search. Also, thanks to the support of NGI, NGI search, we are going to have a live search in open food fact. And this year, we are going to go beyond food. So the thing is, we have had an impact on food. But there are many objects like, I know, this projector or this chair, which have a life cycle. And then at one point, the owner decides it's not worth keeping anymore. And as a result, we are surrounded by object, but some of us no longer serve us or please us. And they end up in the incinerator because we fail collectively to give them a second, a third life to repair them, to fix them. And so open product fact is all about that. Giving open data to power of circular economy. So today, this year, we are going to merge open food fact with open product facts, beauty facts and pet food facts so that you can scan anything on the planet and get solutions for it. And yes, people have asked us for that for years. We are getting into price collection this year. People, we started open food fact once, so what's in my food? But no, people want to know at what price. So we are starting open prices. Currently, it's only a web app. It's only five weeks old. So it's still a very experimental project. Even the logo is experimental. But basically adding a price takes 20 seconds. You scan the barcode, you put the price details, you put the location. It remembers the two or three locations you inputted previously. And then you start to realize weird stuff. Like for instance, price variation in the same city for the same products, for the same supermarket chain, and nobody is able to explain why. We are also thinking that we could kickstart a European price collection and build the first European Nutella price index. So we already have a few prices in Europe, but you'd be very welcome to add the prices nearby at your favorite shop. We are also, this is more experimental, but we also would like to help people free their data from receipts. So now at this point, you are asking how can I get involved in my country? So we have a broad European coverage that's already there, but there's still a lot of work to do. So how can you contribute? Scan and add new products. That's the most basic, but the most vital way to contribute to open food facts. Translations, word spreading, taxonomies and design. So a lot of knowledge about food required. And if you develop in any programming language, hacking and fixing is welcome. We have many programming language you can contribute. So the mobile app is in Flutter. We have some machine learning, robot off in Python. So we're even experimenting with LLMs and 60 seconds on the clock. Perl, Python, you name it. There's really something for you in there. So that's the QR code. If you want to become a volunteer, you can scan this QR code or go to openfac.work.com. Also, if you're a student or an adult, you have a Google Summer code we are going to apply. So if you want to become a mentor, a mentor or refer a mentor, feel free to do so. It's nice to have a large impact on food. We are independent from the food industry, by the way. We're not like a startup or anything. So we'd like to thank all the sponsors that are supporting some part of open food fact. So thank them for enabling infrastructure or everything. So I guess let's get in touch. Eight seconds on the clock. You have the contact email, my personal email, and you can install the app right here. Thank you.
Observations on a DNSSEC incident: the russian TLD
Welcome everybody. My name is David and I have the pleasure of introducing Stefan Bortsmeyer who will be speaking next on observations of a DNS sec incident, the Russian TLD. Hello everyone. I work for AFNIC which is a data file domain name registry so I know one or two things about the DNS. Time to see first the problem. So the lightning talk appeared quite recently in the schedule because everything happened on Tuesday this week. So many users noticed a problem. A lot of sites services under the .ru TLD. TLD is top level domain. .ru is for Russia. And there were many problems. Many people reported this as I cannot reach a young dex or I cannot reach V-contact or other service. But actually it was a very general problem with .ru. Everything with the name .ru was down, it seemed. But some people said, okay, it still works or it works for me. You know that on the internet because the internet as a previous speaker said, the world is not coherent. On the internet it's perfectly possible that some users say it's down on others. Hey, it works for me. So in that case there was no apparent reason for some reason for some people in Russia for instance it worked some not. Outside of Russia it was the same thing. On the problem lasted a few hours, three to four hours, which is a very common duration for an internet incident. Someone told me once that every internet incident is two hours of panic on five minutes to fix it. So a bit of analysis of the problem now. So I have something terrible to tell you. Don't believe what you read on the web. A lot of bullshit. Many people don't know what they're talking about. They don't rely on facts. In that case for instance a lot of things are observable on the internet. Anyone can run a DNS client, can run trace routes, can try with Curl or other software. So it's possible to have data, actual hard data. But yet some people prefer to immediately start writing anything on the social network rather than collecting data. So if we collect data we can see that the problem was not with one website or the other. So when people said Yandex is down, no it was not specific to Yandex. But also there was a specific problem. It was about Russia. Many people immediately started to assume that it had something to do with the war. That it was an attack by the Ukrainians or a problem with Russia. There is the first problem that many people talk on the social networks without first gathering data. But there is also another problem is that many people reacted to this event not based on facts but based on whether they were pro-Russian or anti-Russian. So they said it's a fault of Ukraine, CIA etc. or the opposite or it's a fault of Putin or Kadyrov or I don't know who. So for instance you can find in many articles published about this problem that it was because of Russian censorship, some censorship test that failed. There is no evidence supporting this. There is censorship in Russia but in the specific case of the incident on Tuesday there is absolutely zero evidence that it was an attack or zero evidence that it had anything to do with Russian censorship. It was just a technical problem. So to debug this sort of problem let me spoil immediately it was a DNSSEC issue. But it was in the title so you already know it. So the best tool to debug DNSSEC issues if you don't know it is DNSViz. DNSViz is one of these few programs that are loved both by hardcore hackers and by managers. Hackers love it because it's technically sound and it produces correct diagnostic. And managers love it because there are pictures. Here you can see the chain of cryptographic keys that were used in .ru. At the top is what is called the key signing key which is one reference from the DNSWoot. The key signing key signs two other keys which are called the ZSK, the zone signing keys. One was inactive at this time. It was the old one which was soon to be retired but still published because again the world is not consistent which means that different parts of the internet see different things so you have to keep information in case of. On the new one the active one on the white well as you see there is a problem. Red is not because of Russia it's because of problem in that case invalid signatures for all this type of data. So this was at the heart of the problem. The zone was signed cryptographically signed but with invalid signatures. So the issue was at the .ru domain name registry which is the organization in charge of the top level domain .ru. Unlike what many people said without any facts. It has nothing to do with the system of resolvers used by the internet access providers in Russia. The problem appeared for everyone. I had the problem at home for instance because the source of the problem, the root of the problem was at the .ru domain name registry. Also this registry is the same organization is also in charge of two other top level domains which were unaffected. Again unlike what you can read in many articles about the problem. So DNSSEC is a technology of security. The goal is the idea is to sign cryptographically the DNS data so the resolver at the other end can check that the data is pristine, is correct and has not been modified. So that the idea is in a way, actually it was even in the official statement by the domain name registry, in a way DNSSEC worked because the signatures were invalid so the resolvers rightly so rejected the signature. So you cannot see immediately that the signatures were invalid. You can query DNS with tools like a dig, drill etc. etc. But of course unless you can do RSA or ECDSSE computations in your head you will not see that the signature is invalid. You have to trust the software. So why did it work for some people? It's because not all DNS resolvers on the earth validate. I didn't try the resolver used on the first-dem network for instance. I assume it validates but for instance many big internet access providers don't bother to validate which means that if the signatures are incorrect it doesn't matter because they don't check anyway. So big public DNS resolvers like Google public DNS validate on the other problem. Also at home I have my own resolver which validates so I was also unable to see anything under .RU. But it can explain why some people said hey it works for me. Sure because every resolver DNS is decentralized which means that any resolver on earth will do its own validation and some decide that no it's broken so you cannot access it and some will not validate so it will work in a way. So the lessons we can take from this incident. One is that DNS is important. I can even say critical. Most activity on the internet start with the DNS so not having the DNS for most people is like having no internet. There have been some reports that for instance Russia was disconnected from the internet. Bullshit. It was easy to see that if you know the IP address of the server you could still reach it. But of course it's not really convenient. You cannot spend the day using ping and truss route with IP addresses. So for most users it was exactly as if the internet in Russia was down while it was only a DNS problem. So DNS is critical. That's why the people who work to maintain the DNS should be paid much more but it's another issue. Also an important thing about the DNS is that the domain names are organized in a tree with a root. So you can create top level domain like .fr.be.ru and then second level domains, .yondex.ru etc. And because of this organization in a tree if you break one node in the tree everything under it is down as well. So if you break something .com every name under something .com disappears and if you break a TLD, a top level domain, big problem because you break everything underneath. That's why domain name registries are extremely important. Also cryptography is hard. We know it. It's hard to do properly. It's hard to debug software as bugs. I'm sorry again to inform you that software as bugs. So it's still a problem today that internet could be more robust if we could get rid of security measures because every security technique can turn into a denial of service. In the case of .ru many people said oh okay because DNS sec was broken and access was then denied we should get rid of DNS sec. Okay it's exactly the same if you find an expired certificate on an HTTPS website you decide that checking certificates is a bad idea. It's the same thing for every security technique. If you lock your door when you leave and if you then lose your keys you cannot go back to your home. You have a denial of service and then people lock their doors for good reasons. So it's the same here. It's true that in that case a problem in a security technique made a denial of service but it doesn't mean that we should get rid of security. Again it's a very general problem with every security technique. Also one important lesson but you already know that free software is great because in that case without DNS v's debugging such problems would be much harder. Of course we could use tools like a dig, drill etc but typically they don't make nice reports. It's not just the pleasure of a nice picture. It's also a good summary and it allows you to see very quickly what was wrong. Some tools like drill for instance I use drill a lot and drill reported also the bad signature but it reports also many other things so it can be hard to pinpoint the problem. So DNS v's is really great. It can be used online but it's also free software so you can use it on your own machine if you want. Also during the problem I used a lot the Wipe at Last probes. There are small probes with free software in it. Of course that volunteers install all around the world so you can make distributed measurements. Again the world is not consistent. You can have things that work in one place and fail in another so you need also distributed monitoring of the internet, distributed debugging. And this is exactly what Wipe at Last probes are producing. The software on the probes is free software but typically you don't mess with it. The server is not so it's not really free software everywhere but it's quite open because not only anyone can install Wipe at Last probes but anyone can also ask for measurements from the probes. And they can do everything which is needed to debug DNS and DNS SAC issues. Thank you. I'll be there if you have questions or issues or you can ask them on the metrics room as well. Thank you.
A simple caching service for your CI
So, hello everyone. So, as I said, my name is Remedio Raffa, I'm a principal tech lead at Lino. I've been working on Open Source Project for a long time now, and I've been at FOSDEM for many years now, it's not my first FOSDEM presentation. So, I've been working on VLC media player on V8, Javascript Engine, and I joined Lino some years ago working on Lava and on Automation and CI in general. So, today I wanted to speak a bit about a really tiny project that I created some years ago, which is called Keyscash. And in order to present it, I have to explain why we are using Keyscash in Lino. So, at Lino we contribute a lot to the Linux channel, and not only by developing new stuff, drivers, and a lot of different things, but we also contribute a lot by testing the Linux channel. We have a project called LKFT, Linux channel functional testing project. That is, if you go to the website, it's written that the goal is to improve the Linux channel quality on the ARM architecture, because we are now mainly about ARM, but not only. By performing regression testing and reporting on seleting Linux channel branches on the Android command channel in real time. Okay. That's what is written on the website. More or less, it's a project led by Leno. It's an automated system to build and test a set of Linux channel trees. We mainly care about LTS, obviously, mainline and next. And by contract, we have to provide a report in 48 hours. So, it's quite tight between an RC on an LTS trees. In 48 hours, we have to provide an SLA. We have to provide a report, all right. So, if you look back at 2023, we built and tested 396 different RCs, so only LTS channels. As we also care about mainline and next, we built 2,443 different channel commits. That's 1.1 million builds. So, 1.1 million channels were built by the system by LKFT. And we ran 297 million tests just in one year. And if you look at the Android parts, Android command channel, that's 580 million tests. The tests are running both on virtual machines, so QMU and FVP. We have a specific system where we can instantiate in the cloud many machines for running QMU and FVP. That's a stock suite service that we created. We will not speak about it today. And we also have a physical lab. So, with physical devices in Cambridge, that is managed by a tool called Lava. That's a tool that I'm running inside in Salinaro. So, if you look at the LKFT, really simplified architecture because obviously it's way more complex than that. So, as I said, we care about LTS trees, mainline and next. So, we have GitLab repository that are just mirroring the different trees that we care about. And when there is changes, GitLab will pull it and we create a GitLab pipeline. The GitLab pipeline will send a set of instructions to our cloud service for building, called text build, that will run the builds. So, it will scale from zero machine to 5,000 machine in some seconds, do the builds, shut down the machine and then send the artifacts to an S3 like storage. So, the artifact will be the kernel, the TTB, the root file system, the modules, etc. And then these artifacts will be pulled by our lab in Cambridge to be tested on real devices. So, in the lab in Cambridge, we have some hundreds of boards, Raspberry Pi, Dragon boards, IKs, X15, etc. A lot of different boards. And at the same time, they will all pull the artifacts, deploy them on the hardware, depending on what kind of hardware you have, run the test and then report back. And obviously, everything will run in parallel and don't leave from the same storage. So, our CI system, as I said, will build and test artifacts, L, DTB, RAM, these modules, etc. And for each kernel, DTB and root file system, they will use multiple times because when we have one commit from the kernel, we'll build it for multiple architectures. We'll build it for x86, ARMv7, ARMv8, ARMv9, PPC, SH4, MIPS, etc. Then for each architecture, we'll have multiple configurations. I want to build with some virtio-specific configuration. I want to build in debug in release, etc. And then for each configuration, for each commit architecture configuration, I will run a set of tests. So, KSELTest, KUnit, libgperiod, the LTP, etc. Considering that LTP, for example, is broken into 20 different test suites that will be 20 different test jobs because it takes a lot of time to run. So, the CI system will run a lot of different jobs, of test jobs, that will actually pull the same artifacts all the time, which means that in the network, on the network in the lab in Cambridge, we have a lot of network usage and a lot of duplication. We are re-downloading always the same artifacts. So, that's normally really simple things to solve. You just add caching. So, just, I'm really adding that because that's really important. Our system, our CI system, the Lava Workers, will download multiple times the same artifacts at the same time in parallel. So, if you look for a caching proxy in the open-source community, you will obviously find that Squid is the main caching proxy and it's a perfectly good one. It's really working well. So, you should just install that on our network, point all the workers to it and it should work. Short answer is no, it's not working just because of the two reasons above. So, and also for another reason, this one. All artifacts, as I said, are published in an S3 like bucket. They are somewhere in the cloud. So, obviously, if you want to download them, you will download over HTTPS. You will not download a random binary from internet and run it in your local lab for testing. Not something that you will do. So, we have to validate. So, we use HTTPS to be sure that what we're downloading is what we're expecting. At least we are trusting the software. But when you add a Squid proxy in the connection, it will not work well with HTTPS. That written in the script documentation, you can make it work with that. It's not easy. The main problem is that as an HTTP client, when you connect to a website over HTTPS, you're expecting to get a certificate and the connection will be encrypted with the certificate and the certificate, you have to trust it. When you add Squid in the middle, Squid will have to connect on your BI to the server. So, the connection between Squid and the website is encrypted correctly. The certificate written by the website is a legit one, so it will work. But when Squid will have to decrypt the content to cache it and then re-encrypt it to send it back to you, it does not have the private certificate from the website, obviously. You don't have the private certificate of Google.com on your machine, so you cannot re-encrypt the traffic. So, Squid will need to have his own certificate and it will encrypt the traffic with its own asset certificate. And you will obviously not trust it. You will not trust your local Squid proxy to sign something from Google.com or AWS or any website or Linux Foundation. So, when the HTTP client receives the custom asset certificate, it will just say, no, I don't trust you. There is a workaround and it's written in the script documentation, obviously, which is create a wildcard certificate, which is a certificate that will be valid for absolutely every website on the planet, every DNS, so it's kind of a dangerous asset certificate. And you can install it on every of your HTTP clients. It's possible, but it's really crappy, honestly. That's the first problem. The second problem and that there is no way to work around it is that when Squid, when you try to download multiple times the same artifact in Squid, so, for example, you have two connections downloading the same root FS, Squid will download it twice and stream it back to the clients at the same time. And when it's downloaded, it's finished, then the third connection will have a cache version. But as long as it's not cached locally, it will re-download from the start. And as I said before, our system is by-designed running everything in parallel, so it's often the case that we have multiple downloads of the same artifact at the exact same time. So when using Squid, it was just not caching anything. Sorry. So that's why we created KeysCache. So Keys stands for keep it simple, stupid. It's a pretty simple and stupid caching service. But the main features that it has are exactly what we need for a API system. It allows to cache HTTPS resources without any acts or anything. It allows to download only once, even if you have multiple clients and they will all get a stream back, the stream of data back. And the reason why it's not, it's working for both cases is that it's not a transparent proxy. So it's not like clients that will know from an environment of the Bible that it has to go through a proxy. Instead, you have to prefix your URLs. So if you want to access example.com slash .fs.x4, for example, you have to prefix it by your KeysCache instance. So even if you're downloading over for HTTPS, your clients know that it goes to KeysCache and not example.com so that it's expecting a certificate from KeysCache, not from the original website. That's the first reason. And KeysCache also, we made it so it knows how to stream back to multiple clients, the same content. Fun thing, we also added a lot of automatic retries inside the KeysCache backends. So if for any reason, and it happens a lot, the connection between your network and the S3 like bucket breaks and it often breaks, honestly, KeysCache backend will automatically retries. This is a list of HTTP codes that we're retrying automatically. And it will also, so when it's retrying, it retries up to 50 times over a period of two hours because we had exponential backups. So sometimes a download will actually take two hours and 50 retries just because the S3 like bucket is just sometimes a bit buggy to answer. We also added partial download, which when you have, we do a retry, if the HTTP server knows how to do that, we only download the remaining content, not from the start. And the good thing is that with the automatic retries, the client will never see that there is a broken connection because from the client to KeysCache, the connection is kept alive. It's only the backends that sees the network issues. So it has been in production for 3.5 years. It downloaded 32 terabits of data from internet and served 1.6 petabytes of data locally just for a really small tiny software, which is an expansion ratio of 51 times. So we divided the network usage by 51 just by having a small working proxy. It also improved a lot of stability thanks to the automatic retries, I said, up to 50 retries, which is insane. And it also lowered a lot of the S3 egress cost because you have to pay for egress in the cloud. When you, for 1.6 petabytes of data, that's a lot of money. So yeah, we saved around 150 K of euros just by having a local proxy. Just because I have just two minutes, a look at the global architecture of the service, it suggests a Django application with a salary backends. So you have a reverse proxy and Ginex. It can be any reverse proxy in fact, that will receive an HTTP connection. It will send that to Giniacon, which is a Django runtime. The Django will see if the, we look at the database, but at the base, progress, to know if the artifact has been downloaded already or not. If it's a case, it will then look at the file system and just give that back to Ginex saying, please send that to the client. And I'm done with it. If it's not already downloaded, it will send a message to Redis that will spawn a salary task that will actually do the download and retry in the back end. And it's done only once. And it's then saving it to the file system, appending to a file, byte by byte. And at the same time, the Django process just reads the file on the file system and sends the bytes where they are available. And that's all. Waiting for the file, the file to be just finished. And if a second or third of many different users arrive for the same file, then they will just reuse what is already available in the file system and wait for the download to finish. And that's all. That's all. It's pretty simple and efficient. And it has been a really good use for us. And it might be useful for your CI system. So if you have any questions, I will be here after the talk. Thanks a lot. Thank you.
Reinventing database exploration with Azimutt
Welcome everybody, let's get started on the next session. My name is David and it is my pleasure to introduce Loïc Nuchel, who will be speaking on reinventing database exploration with Azamut. Thanks a lot. Hi everyone, thanks a lot for coming to my talk. Indeed, I will talk about Azamut and how we can explore the database with it. My name is Loïc Nuchel and I am principal engineer at Dr.Libb. Basically, the whole talk is a story about how I started at Dr.Libb and now I am here talking to you about Azamut. Three years ago, I started at Dr.Libb. I joined the company so if you don't know Dr.Libb, it is a French company in healthcare, allowing patients to book appointments with doctors. Basically, it is built on big monoliths, on rubric and rails backed by PostgreSQL database. Basically, it is a huge monorepo and also a huge database with 800 tables inside and several petabytes of data. As an architect, I joined Dr.Libb to work with the team and help with architecture, improve the code and things like that. But also, for that, I have to understand what is inside the database, what are the models and what are the relations. The thing with rubric and rails is you don't define the properties inside the models. You just define the relations, but often the models are quite long. They can do like 1,000 lines long and sometimes the relation is as far as 100 lines or something like that. That is not really convenient and I had to look inside the database a lot to understand what are the things and how it works. Basically, that is me working at Dr.Libb for the first month and obviously, this is not very friendly. I had to find a tool. I looked for a lot of tools. They are called ERD, and they show tables with relation and nodes. As you can see, this is not very friendly. Here is the 10 tables. Imagine 800 and you will have some trouble. I tried quite a bunch just for you to have a look at what they look like. Basically, the NWAS NAP failed. For a few reasons. The first one and most of you find is all of the tools I could find will show everything. When you show 800 tables, you don't understand anything. The second one is most of them don't have an SQL or database import. The last one is they are not private. Basically, I had to upload the schema to the service and I don't want that for Dr.Libb. Basically, when we are a developer and we are in this situation, we do another tool. That's what I did with the big one goal to make it easy for large database like 800 tables again. You may see tables with a lot of columns like 100 or something like that sometimes. Locally, this is not for us, but this is a possibility to stay local and just have it in your browser. Not send any data to the service and of course open source. The first part was the schema exploration. When you load your schema into azimuth, you don't see anything. You just see a search bar and an empty screen with some situation. The goal is to look for tables with the search and just load the table you are entered in. Mostly, if you are working with a big database, you don't want to see everything. You just want to see one, ten tables around your scale, your feature or something like that. You can do some nice diagram like this with choosing the table and the column you want to show. Also, you can navigate from one table to another following the relation. Obviously, the foreign key with outgoing relation, but also the incoming relation coming from the primary key from the other relation. That's pretty nice to expand your diagram and explore what's around. Of course, you don't see everything. You want many layouts. One per scope, discovery, team or anything you want, but several layouts of your database to understand it. The last thing is sometimes in the database, you don't have foreign key for all the relation. Sometimes for performance reasons, sometimes for reliability. There are a lot of ideas around there, but sometimes you don't have the relation as foreign key. So, azimuth can infer and suggest them directly inside the diagram. The last feature on the stream exploration is a fine pass. If you want to join data from one table to another and you don't know really all the tables in between, it will be a good one. Basically, when developing this feature, I was very surprised about how many paths there are. You will be surprised too. Basically, that's also a good idea to have a look at that. The second thing is when people are starting using that, like on read-only on the database schema, they wanted to draft new features on it. Basically, doing some more design for the database. I made a DSL with an explicit goal to be very simple. Here is a bigger version if you want to read, to just write the table name and the column name with two space before. Then you can add some attributes like the types and some primariki, unique index, new label and things like that. The goal of this DSL is to be very simple, very quick to write, to go as fast as you and you can flow and your figure can type. When you do this kind of exploration, sometimes you have some discovery and sometimes you want to write them somewhere, maybe for your colleague but also for your future self, again the exploration. There is a lot of documentation. Of course, the SQL command from the table, so it's loaded and accessible into azimuth. Also, the nut into the table, this is the same thing as the SQL command but inside azimuth you can edit and view it easily. There is some tag also to find type easily and of course there is the same thing for the columns. So the SQL command, some nut you can add. The nut are in markdown so you can do the formatting with images if you want, links, lists and things like that. On the layout, you have one layout for what you want and you can document them with some memo inside. Same with markdown, you can put image, link, whatever you want to explain the whole schema, some part of the schema, you can have the color behind. You can also have table groups to show that tables are in the same position or in the same context. That's how you can do the documentation for azimuth. The last part I did not long ago is the data exploration. Basically, before we were only on the structure, on the database model, but sometimes you want to go a bit deeper and understand the data inside the database and how it's working, what you can do. I think this is quite interesting. When you open the detailed sidebar for the table, you have all the details but also you have all the columns with the sample of the data inside. This is random picked data, not just a row with everything but I avoid nulls and things like that. You have interesting data to show here. The same for a column, when you open the sidebar for a column, you have the most used value, the count of rows, the cardinalities, the number of nulls and things like that to know a bit. What is inside this specific row? That's for the quick access but you can also do some full query from it. We have a visual editor for very simple query, like a table with some filter, but also you can write any query you want to have the result. Basically, you have all the results on the right in a list so you can see different results and have some nice features to filter, to sort and things like that. The most interesting one is this small arrow here. You can click on it and see the relative row on this part. Here, I selected all the events, like on the CFP database, we have the event but they are linked to a group here. You can see in one click that it's the human talkspire group which is the link row on this event. This also works in a nested way so if you scroll down and see other relations in this sidebar, you can have multiple sidebars that are stacking to navigate from one row to another. Basically, this was quite interesting but the very nice thing here is you can add this specific row, so one row and data from a table into the diagram. You can add to the layout and see this row specifically so this is not a table anymore, this is a row of data with of course the table name and the table column, but also with specific data for a specific row. You can refresh the query again but with all the data. Same as the layout, you can navigate through the row inside the data. If you click on the primary queue again, you will have all the linked tables and for the table, all the linked rows, with a maximum of 20, because sometimes it can be very expensive, for example, event or if you have some tracking things, you can have thousands of them. Basically, you can see easily what are the linked row, what are the incoming links to this specific row and not just the table in the schema. And then if you click of course on a specific one, you can show it. And the same is for foreign key, so if you click on an outgoing relation, you can just show the relation with it. This allows to do some nice diagram with not only the table in the schema but also actual data from your database that sometimes is interesting to show that you have several rows in the same table like here. And of course you can mix both having on your layout, having your schema, so the table above and the table below. So this is very small, it's not intended to read, but on the right you can see there is several times the same different row on the same table, on a clear blue. So I think that's a very interesting way to navigate into the data. So if you want to try it out, so it's available on azimuth.app, but there is also a nice CLI to load any database almost here, so you can just do NPX azimuth explore and then your database URL. It can be of course a remote URL but also a local one, you will start to get away on your machine which is just a node server to proxy the query to your database. So it also works with local database which is like I think the one of the only tools that I can do that. So thanks a lot, you can try it on azimuth.app, it currently works with several database, so major relational database but also some document database. And basically for relational database when you have a json field, a json column, it inspects the json column selecting 100 non-empty row and infers the schema from it so you see directly the schema of your json column inside azimuth. So this project is fully open source, I've been working on it for a bit more than two years and basically I intend to develop it a lot more in 2024, so if you are interested with it, there is a survey with a QR code and I will be happy to have your feedback on what you thought about what I presented but also what are your current problems about the database, what you expect to see from a tool helping you interact with the database and so on. Thank you all, there is still two minutes so maybe I can run one or two questions. Is there any supplementation when you explore a state or really state database? Yeah, there is several things, so it's made for big database, so basically the table is 100 tables and the biggest schema I think is 1000 tables, 1,500 tables, so there is no issue extracting the schema. There is more issue and basically that thing I will address soon. When you explore that, basically if you have a lot of data inside your database, the quick show of the value into the table and the column can be quite hard to get, but after that you just run some queries. So you will have performance issue if you do queries that take a lot of data but the queries run on your database are not linked to azimuth.
Passbolt - Open source password manager for teams
Next we have Remy Bertot with PassBolt open source password manager for teams. Thanks everyone. I just wanted to maybe we could give a little round of applause to Elina and David who has been running this room. Thanks a lot for volunteering and organizing this. It's really nice for me. So I'm Remy Bertot. You may remember me from other open source projects, minor and horrible contributions to mail-village, open social and some other events that we do. But here today I'm here to reprise my role as the co-founder of PassBolt. Before we start I just wanted to show you this picture which was taken 20 years ago. I think we can appreciate the amount of swag in this picture. If they are GenZ in the room, 20 years ago Facebook didn't exist so we didn't know that this picture will come back to us. So is anyone not using a password manager at work? Can you raise your hand? Okay, I can see the PassBolt developer raising their hand. We need to have a little talk about this after. But for the other ones that raised their hands, I would like you to meet with me after because I want to learn how you do to live without fear. So this is what we did. We built an open source password manager that is designed for collaboration. So you can share secrets with your teams. So you have granular access and permissions to every single resources and folders. It's available by default in the browsers. So it's designed specifically for collaboration. So as many password managers, we also have a quick access so you can basically see this little annoying icon in every single form when you browse the web. We also have Android and iOS applications. So it works with biometrics and you also have multi-account. So for example, if you have a personal account and a professional account on the same phone, then you can basically use both. They work with the autofill as well. So it's like automatically fill the form for you based on the API that is available by the phone. And since you guys are most likely developers, we also offer command line interface and SDKs. So if you have KERL and GPG, you can pretty much talk to Passbolt and like pipe the output of the KERL request to GPG and decrypt the content. So like it's pretty low footprint when it comes to like working with the API. We also have like Ansible plugins and like tutorials on how to integrate this into your GitLab pipelines. So basically you can use it as a secret manager, not just as a password manager for your team. You can also store like credentials for machine to machine authentication if you want. We also provide, one of the goal of Passbolt is to make sure that administrator of a Passbolt instance do not have to do a lot of work. So like we provide native packages for Linux distribution, pretty much all major distributions. And we keep adding some of them. Like this year we did the end charts. So if you're one of the cool kids and you use Kubernetes, you can basically get started with Passbolt as well. This really low maintenance, we have like people running Passbolt for years and not updating them and then updating them to the last version with just one command. So like we try to make your life easy. Obviously, we are not the only password manager in the space. So like a lot of people come to me and are like, oh, how are you different from this and that. So main difference with KeePass is like KeePass is a file. So like we are like an API and we do user management. So KeePass is great if you basically want offline access or if you don't want to share data. If you share with KeePass, you're going to have some issues with concurrent access and versioning. But it's great if you do not want to have any metadata and you just want a file and everything is encrypted. So like KeePass is really great there. With compared to Vaultwarden and Bitwarden, we have like basically different security properties like we use a completely random private key. With Bitwarden, for example, the encryption strength is depending on the password that is selected by the end user. So there are other password managers like one password that use a private key, but basically we are the only other one that does this. We require the browser extension. So this is also a key differentiator is that you must install Passbolt browser extension. Why is in case the server is compromised, an attacker cannot change the application. So they cannot change, for example, the cryptographic functions or like add some code to extract the decrypted materials. So this is one big difference. So when you access a Passbolt instance, you feel like it's a website, but it's actually not a website. It's an iPhone that is inserted in the page and the server cannot basically access what is in that iPhone. So it's basically a change of architecture. And we support, as I said, like nested folders with flexible permissions like in Bitwarden, you basically create collections and you put items in them. We basically support granular sharing and granular secret management. So for example, the secrets in Passbolt are encrypted once per person so that, for example, if somebody leaves the organization, we are able to provide a revocation. So it's not like we order Password Manager, which will have like a symmetric key that is shared with many users and they will not rotate that key for the collection. So obviously Bitwarden supports features that we don't have. So that's why people adopt Bitwarden. But what happened in 2023? So one of the major events of 2023 that we had is our site of the head of site reliability management at Passbolt got married. So we thought it will never happen, but like maybe we can give him a little round of applause for getting hitched. Well done, Diego. Now seriously, this year we shipped a single sign-on with OpenID, Microsoft, and Google. And I'm very pleased to announce that the OpenID connector will be soon available in the community edition. Both community edition and the Pro edition are completely open source. So they are both under AGPL. The Pro edition will require you to pay something, but obviously it's open source software, so you can do as you please. So we also shipped another interesting feature, which is like TOTP, which allows you to store your TOTP code into Passbolt and share them. So should you put your password and your TOTP code in the same application, it is up for you to decide. It's interesting, for example, if you want to share them, but if you don't want to put all your eggs in the same basket, I will understand. But the same way, like you need to look at all the risks, for example, if you use the Google Authenticator, if you have the sync enabled across device, your TOTP code are not end to end encrypted, so you might want to have that as well. We did also Passbolt.exe. It's not like my pet project, but like the 80% of the users of Passbolt are on Windows, so obviously this is something that they wanted. We plan to support more OS in the future, but we started with the biggest chunk. So it's a native app. It's not like a JavaScript application, these guys as an app. It's a native app. We did that for security reasons, because there are some properties when you use Electron that are not so great, so we basically spend a lot of time building that app last year. We did also a lot of other things. So we did some performance improvement. We changed the grid so you can select which columns are there or not. We introduced some role-based access control for UI. So for example, if you want your users to have less features, then you can just remove everything and feel like it's 2005 again. Then we have also suspend user. So you can, for example, if Diego is going on Honeymoon and you want to disable his access, then you can do that. And we have a lot of policies that were rolled out. So you can control what is the strength of a password by default, how long should the passphrase protect the private key, that sort of thing. So we also had four security audits, one on LDAP, one on SSO, one on our network, and one on our internal controls and processes. So we didn't have any major issues, but if you are curious, the reports are available on our incident page. And all the audits were made by QoP53, so basically it's legit. So what's cooking for 2024? I mean, it was supposed to be released in 2023, so this is coming next week. It's password expiry. So when somebody, for example, leave an organization or leave a group, or if you remove a permission on somebody's access to a secret, then this feature will mark automatically a password as needed to be rotated. So this way, as people come and go and access systems, you know which credentials needs to be rotated. I think it's a pretty interesting feature when it comes to security. And for organizations that are like masochists, you can also set policies and, for example, say, I want all my credentials to be rotated every 40 days and make my employees life a nightmare. You can do that. We also redesigned the admin panel. So, well, I mean, it's not like a lot of work, but I still think it's neat. And we have a lot of stuff coming. Also in the 4.6, we have SSO with ADFS, again, like Microsoft. But all right. And then we have a bunch of performance improvements. So we have people telling us, OK, I have 100, I have 12,000 secrets that are shared with 200 people, and it's slow. All right. At this point, is it really a secret? I don't know. But yeah, we're going to try to make your life a little less miserable. And we're going to have icons, custom fields, a bunch of other things that are missing in the software right now, like file attachment and nodes. And we have seven babies scheduled. So for a team of 40, we have like, I don't know what happened last year. I think like the summer of love of Passbolt. What is it made of? So the secret ingredient is love. I'm not going to spend a lot of time, but it's basically a simple PHP LAMP stack and with a web extension on top. So the application is split into multiple projects. The Passbolt API, which is based in PHP. Why PHP is because any administrator in their life have hosted some PHP, so it's pretty easy to use and like low maintenance is very stable. And we have the web extension on top, which is like split into two. We have a React app, which controls the UI. And we have what we call the background page, which is a separate application. And the front end and the back end to talk through a series of API. This is to make sure that anything phishing and happening on the front end can be controlled. So we have like a choke point where we can define like, is it normal for the front end to do these kind of things? So if you want to look at the JSON API, we have basically documentation online. And if you want to look at the style guide, it's based on storybook. So basically you can add or customize your teams without having to build the entire thing. And this is how we split the work. We have people working on the front end and we have people working on the back end as well. So I'm not going to take questions. But one of the things I wanted to tell you is that at 2 o'clock, we will be next to the bar and we will be like spinning the wheel like we did last year. And this year, the price is a picture of Mackey, which is our office dog. It's my price position. So if you want to come get it, it's at 2 in front of the bar. Thank you very much.
Kùzu: A Graph Database Management System for Python Graph Data Science
Next we have Prashant Rao with Kuzu, a graph database management system for Python graph data science. All right. Good afternoon, everyone. So my name is Prashant. I'm an AI engineer at Kuzu. So I'll be talking about graph databases today. Just a quick show of hands, how many people have worked with graph databases or heard of them? Fair number. Okay. So you're in the right room today. So I'll outline a bit about what I'm going to cover. I'll start with what graphs are for those who are not familiar. And then when you need graph modeling, I'll also cover some of the features of a competent graph database management system and what that means. And that leads into the vision that Kuzu has, both as a GDBMS or that is a graph database management system and as the go-to solution for graph data science. And I'll end with a walkthrough on how Kuzu makes graph data science workflows easier for the developer. So the first question we must ask is, what are graphs or networks as they're sometimes called? They are an abstract representation of entities and relationships. Essentially an entity is represented as a node and the way these are connected together is represented by an edge, which is the relationship shown in this figure. And as the figure in the bottom shows, these can get pretty complex and reveal really interesting structures about connected data. And that's exactly what we see in the real world. Graphs are actually one of the most natural ways to represent data. Social networks are of course something we are very familiar with, but graphs are very prevalent in many other domains, all the way from drug interactions to molecular networks to traffic networks. In the world of finance, you analyze transactions for things like fraud and they also are very common in knowledge graphs that encode factual information about the world. Kuzu is a graph database management system, which is a class of database management systems. So I'll start by giving a general overview about GDBMS. You generally have three components to any database system. You have the data model, you have the query language, and you have the system implementation. From the data model perspective, graph data models differ from the conventional relational data model in the sense that you typically represent the data as nodes and edges. And you have key value properties on these nodes or edges. In this example, with this triangle you see here, you have a cyclic relationship of transactions between people, one, two, and three, where the nodes one, two, and three have property information on the name and the edges have the amount of the transaction as a property. So this is called the property graph model of graphs. And it's very, very common and prevalent in the industry. But there's also another data model called RDF, resource description framework, which has a similar concept of subject, predicates, and objects, which represent a triple. The triple is a basic unit of data in the graph, but it's the same idea as the property graph model except the implementation is different. From a query language perspective, every graph database management system needs a high-level query language that's designed specifically with graph syntax. And an example of this is shown here. This is the Cypher query language, which Kuzu implements. And incidentally, Cypher is the same language that was invented and popularized by Neo4j, if anybody's used that before. But what this example query snippet shows is you have node information, A and B, of type account, and you're matching on those nodes. And then you're running a query, a joint query equivalent in a way that reminds you a lot of SQL. It's very declarative and it's very high-level reminiscent of SQL. From a system implementation standpoint, universally, I think it's hard to come with a statement that covers all graph systems, but in general, they implement storage structures, indices, and operators that are specific to graphs. One example is the shortest path operator. These are operators that are not prevalent in relational systems but are very common in graph systems. There are many reasons why you might need graph modeling, but I'll cover just a couple of them in these next two slides. For an example, let's take this query where we are trying to find direct or indirect possible sources of money flow into a person's account from a particular location. So in this example, the person is Alice, represented by node B in this Cypher query, and we are matching on the owner of that account, which is Alice, but also matching on account A, whose location is Canada. The key here is that the middle portion, which is the transfer star, that star syntax is a high-level general syntax called clean star. It's used to implement indirect and recursive joins. As you can see, the query is quite concise. It's quite readable. You can do this sort of query in SQL, but it's a recursive query and it's not as easy. It's going to be a lot more verbose and not that easy to read. One other example of this would be the shortest path query, which is a lot harder to do in recursive SQL, but in Cypher, it's very, very straightforward. It's just an additional clause that you add attached onto the previous query. Another case where you need graph modeling is in heterogeneous data. This example here shows an example of Dbpedia, which is a structured version of Wikipedia, and we're taking this example of the location we're in right now, University, Liberator and Brussels. On the left is the way it's stored in structured form, where you have key value properties, and each of these properties links to other properties. But on the right, we schematically represent that as a graph. As you can see, the university is linked to the city of Brussels. It's also linked to the country of Belgium, what affiliations it has, and each of these individual resources can be linked to other resources. This actually expresses the power of a graph model, because doing this with a tabular form of data would be almost impossible, because that's how Wikipedia is structured. It's a lot of connected information. This leads us to the question of what is a feature set of a competent graph database. We list a few of them here, but I think it's very difficult to go through each of them in the time we have. We do have a blog post that covers this in much more detail. It's called What Every GDBMS Should Do and Division. But in a nutshell, every GDBMS has to support things like many to many growing joins, recursive joins on top of heterogeneous data sets, for example, knowledge graphs. Another thing that we can highlight here is the schema querying aspect, where in this last example, you have account information and transaction edges, and you're able to query on the type of the edge. Let's say you have two different kinds of transactions. You don't want each of those on either side of the middle node to be the same transaction type. You're able to say, you apply a predicate on the edge to say that you don't want nodes of a particular type. This is the sort of thing that you can't do in SQL. You can only do this in a graph model. The vision of Kuzu as a graph database management system is it aims to represent the state of the art of how graphs should be stored, indexed, and queried. It does this by being highly scalable to several terabytes of data. It's very fast in terms of query speed. It supports the property graph model, which we described earlier. It also supports the RDF data model, which is going to be coming in the next release. It does so via a high-level query language, Cypher. It's easy to use and uses an embeddable architecture. We like to think of ourselves as like duck DB or SQLite, but for graphs. If you ever come across either of the other two relational systems, Kuzu is like a graph analog to those systems. I should also note here that Kuzu is based on many years of research at the University of Waterloo. It's now being developed in an independent company called Kuzu Inc, which we're from. The other big vision that Kuzu has from a data science perspective, specifically graph data science perspective, is to be the go-to back end for graph modeling and data science. Essentially the vision here is if you look at the bottom half of this figure, you have a lot of data sitting in disparate sources like data lakes, warehouses, relational databases all the way from Postgres and many others. There's a lot of interoperability challenges that you have with these data sources. Even though you have structured data, in many cases working with them as a graph is quite challenging because of the movement of data across the systems into a powerful graph database back end. Kuzu aims to be a simpler way and sort of an interface to that. In the upper half, the aim of this is to make graph data science much more accessible in the sense that we provide zero copy access to the data by writing out the format that is native to those libraries. For example, PyTorchumetric, NetworkX. These are popular graph data science libraries and machine learning libraries in Python. By being well integrated with the Python data science ecosystem, we believe this makes it a lot more achievable. I'll quickly walk through an example of how Kuzu makes graph data science easier in terms of a workflow. Let's consider this real world, a simple example, a toy example, where you have two different data sources. You have people and the movies that they watched. You also have people and their friends and where they live. These are two different data sets. Your goal is to use the information from this data to build a movie recommender system where a person who has watched certain movies gets recommended other movies. There are many ways you can build such a recommendation system. One we'll cover here is using a graph neural network, specifically using link prediction where you're trying to predict a recommended edge between a person and a movie. This is a very simple example where you have data set one which has the persons and the movies with some additional metadata, could be age or any other attributes. Then data set two has persons and what friends those persons have and where they live. For those who have not worked with graph machine learning before, it's a very high level overview in this slide where the goal of graph machine learning is to embed the nodes and the surroundings space into a vector space. The benefit of this is that it incorporates the structure of the graph based on the nodes and their surrounding neighbors. The idea is that you perform a computation on the graph nodes and transform the features of that graph into a feature vector like the array shown there. The idea is very similar to the kind of vectors that you may have seen in other domains like computer vision or natural language processing where the only difference is that in those domains you are considering the similarity between words in a sentence or pixels in an image whereas in this case you're considering the similarity of the topology of the graph itself. All of that is great but when you're working with the data, you're immediately faced with a problem. The data you have might exist in different sources. For example, the movies watch may exist in Postgres and the person friends may exist in other structured data sources that you export to CSV or Parquet or something similar. You need to bring them together to form a graph. Conceptually, this is how a graph would look. You have nodes that represent the persons. You have edges that represent the movies that they watched in the first graph and then in the second one you have edges that represent friendship between people and what cities they live in. The moment you do that, you have another problem where you potentially have overlapping data or duplicate data between these two subgraphs. In one of them you have the persons and the cities and the other ones you have persons and movies. Many of them might be the same people. There is some deduplication logic that's required where you have to merge people with the same attributes and there's some custom logic that needs to be put in terms of how you decide whether something is a duplicate. Once you do that, you have the final result where you have some nodes that are dangling in the sense that they have no edges that attach them to other nodes. These actually don't inform the machine learning model and would need to be removed. To do all of this is actually quite tedious if you were to write your own custom logic in your own language of choice. Where Kuzu comes in and where it's very powerful is the ability to just install an embeddable library using pip install Kuzu in Python. Once you have that, you're very rapidly able to run query execution to create the tables, load the data in and perform deduplication logic and dangling node removal using a high-level query language like Cypher in a way that scales to the size of the data that you have. In many ways, you don't have to worry about the scalability problem because now you have a high-level query language supporting your operations in the middle stages. Once you have all of the data and the features that are loaded into a graph, you essentially can walk through this process where you not only have the data in the right form, but you're able to actually encode the features into the graph and store that on disk. One of the biggest limitations with PyTorch geometric is if you ever worked with it before, is it's very memory intensive when you're working with large graphs. Kuzu helps a lot in this regard by persisting the features onto disk. That's exactly where I think we want to highlight this point. I think we are almost out of time, but I'll wrap up by saying the key points to take away. Kuzu is an in-process analytical graph database system, kind of like DuckTB is in the SQL world, but for graphs. It's highly scalable and optimized for multi-core parallelism and very well integrated with the PyData ecosystem, including NumPy, PyAro, NetworkX, PyTorch, and so on. It supports both the property graph model as well as the RDF graph model via Cypher, a high-level query language. It's embeddable and very easy to use from your application. It's also accessible with other language bindings, not just Python. If you come from other languages, those options exist as well. That's it from us. Kuzu is an open source, very permissive, licensed, MIT licensed project. I'd love for everyone to give it a try and reach out to us on Discord. We're always open to chatting more about graph use cases. Thank you.
Testing Containers with Python and pytest
Okay, next we have Dan Chudamak with testing containers with Python and PyTest. Wow, thanks. You haven't heard the talk yet, but thank you. So, first, the boring part, I'm Dan, I'm a software developer working for SUSE, I do other stuff, but since we only have 15 minutes, I'm just going to jump right into the meet, and that's why should you test containers? I'm not going to answer that, please test your containers if you deploy applications or anything else. And the first question usually people ask is, why don't you use shell scripts? Because I mean, shell scripts, they are super portable, they run everywhere, and shell scripts are also pretty fast. And given that shell scripts run everywhere, and they are so super-duper portable, everyone understands them. Apparently, I'm not everyone. Because in my opinion, shell scripts are very brittle, especially once I have to do string modeling, that's the point where I start to test my tests. And if I need to write tests to test my tests, I think I'm doing it wrong. You can disagree, but so let me give you the short sales pitch, why you should use Python, PyTest, and especially this is about a PyTest plugin that I wrote that's called PyTestContainer. So, what this thing can do for you is, it handles all the boring plumbing part of a test suite for containers, like pulling images, building containers, launching everything, and cleaning up, and not leaving you with terabytes of stale data. It uses the Python test-infra module, in case you know Python, PyTest, this is just another convenience module to just access files, check whether there are open ports, stuff that you can do with the Python standard library, but it's just another convenience layer there. One part that took more time than I care to admit, but that I'm moderately proud about is that the whole test suite is designed, it supports parallel test runs. So, if you use the PyTest X-Test plugin, it allows you to execute all your tests in parallel, so assuming you have 500 cores, you can run 500 tests in parallel. And the whole thing also works if your container images expose ports, provided you don't open a thousand ports on each and run 500 tests in parallel, then you'll run out of three ports. But there's tools for that. If you're using Podman and not just Docker, it can also work with Podman parts. You can also create abstractions to create container volumes and it will clean up after itself. Also, if you're more into the area, I have an application and I want to check whether it works not just on my box, but also on Fedora, CentOS, Debian, Arc Linux, Alpine, and whatever else there is in the world. You can just define a set of tests and you tell the plugin on which container images to execute them and it will do that for you. So, that allows you to have the same set of tests and run that on different containers, which would be, that would be more into the area that you are looking to test an application. It works with Podman and Docker, you just change it by changing environment variable. And if you happen to be in the lucky position to support enterprise-grade software that's very stable, hence very old, it still works with Python 3.6 and on all the important architectures, which took also more time than I care to admit. So, let's just take a look at a very simple example. This would be just a typical Python test file with your imports, then you define your container image. In this case, it's just the open-source-at-umbleweed image. And then you define a very trivial test. And in this case, what you can see here is, so this stuff here, that's really what testing for shines and it just takes a look at the O.S. release there. Very simple test, but you could do more elaborate examples. So, what are possible use cases? Of course, you'll just base images, you could test those, you could do applications inside containers, and another test, another possible use case would be you have an application and you want to check whether the application works on multiple O.S.s, but you don't need virtual machines for that. Then you could use PyTest container for that as well. So, I guess if you're in this talk, you might know a bit about PyTest and as the name suggests, it's a Python testing framework, otherwise the name would be very bad. All it really does is assemble tests, so it's like unit test on steroids, executes all test functions. And one thing that PyTest container uses extensively are fixtures. If you're not known to PyTest, you probably know setup and tear down functions from all other testing frameworks and PyTest fixtures are kind of the thing there. So, a fixture is really just a parameter for a test function and it can return a certain value and before that do some setup, give you something in this very simple example. This is from the PyTest docs, so it would give you, it would for instance create a mock SMTP connection and for the PyTest container, it gives you a connection to the already created container. And another cool thing that PyTest has is test parameterization, where you can just define multiple parameters. So, in this case, you would get your, you would have a test and you want to execute it for all combinations of those values, so it would run your test for all the, for the whole Cartesian product, so all combinations of 0, 1 and 2 and 3. Let's just jump into a few usage examples. So, in case you want to build new containers, you just define yourself the base URL, you have the, you define yourself the docker file and you create the creatively named class derived container and if you didn't already see, I shouldn't be in charge of naming things because I'm terrible at it, but I'm not very creative. But so, what will happen now if you pass this, if you pass this created class into your test function, the plugin will first pull your, pull the space image, it will build the container on top of that, launch it, pass it into this, pass it into the test function and once the test is executed, it will get cleaned up. You can also, so you can also define pass in other already created containers into this as a base, that all works. I have an example for that later. As I mentioned, binding free ports, you might not say, why don't you just add a parameter somewhere, okay, expose port 8000 on the host and that works as long as you only launch, as you don't launch tests in parallel. So, if you want to launch, if you for instance want to test this specific container five times in parallel, you can't bind all of them to the same port and for that, there's a relatively simple abstraction. So, you just create this port forwarding class, pass it into the container and then it will get exposed in the test. There, you will get the host port and this is inferred automatically on launch of the test. If you want to test ports, so this is very apartment specific, works rather like this. So, essentially a port is just a combination of containers and the only really interesting part that you want to use it for is again port forwarding, works exactly the same like with containers. One little catch. So, so far I was telling, I was claiming that your containers would be launched after the test and destroyed after the test and that's not entirely true because most tests don't modify the container and then you can get away with creating your container before all tests and tearing it down after all tests and you save actually a substantial amount of time. But if you decide to do tests like these where you try whether RM minus RF actually works, then any subsequent, any subsequent test will fail and start burning. So, and therefore there's a different fixture that's called just container per test and that will actually ensure that all, that you just get a fresh container for every test but it costs extra. So, but then you can also RM minus RF everything in your container and the subsequent container will still kind of work. For the case where you decide, where you want to run a bunch of tests, but you don't want to do the whole pie test parameterization before that, you can just dump all your containers into a global variable that's called container images and pie test will do the automatic parameterization. So, in this case, these, all the tests in the test module would get executed with all these containers and that's for instance what, what we're doing in, in the, in the, for Kiwi test functions where you just want to ensure, okay, they work on CentroStream, Fedora, DBN, R-Clinux, etc. Pp. What I've, what I hinted on previously that's dependencies between containers which is essentially just you want to build a container based on another one, based on another one, which would essentially be just you, you split up your, you would split up your Docker file. This might sound like maybe a weird idea at first but it can be, it can be relatively useful if you have, if you want to check different base images and then build stuff on top of them or you want to, or you decide to modify your base image. So, we have used this relatively extensively in the BCI test suite and you can, you can simply create containers that derive from others and that derive from others and you can do this relatively extensively, just don't add loops. That, that will not work. In case you want to, want to check whether, for instance, the environment in your container is still, is what you, what you expect or some config and you don't want to mess with the JSON that Docker Inspector, Portman Inspector gives you, that's also to a certain extent implemented. So, you'll get, you'll get a Python usable version of the inspect of a container where you can, for instance, check what's the user in there, what's the CMD of this container, if there's something in the environment and other stuff. Since I'm nearly out of time, I'm going to jump over this since it's not really that interesting actually. One important thing is if you are, so if you create an application in your container, applications usually take time to launch, please use health checks. Health checks are cool. I know they are not part of the OCI spec, that's a bummer, but please use health checks. Because if you start using a test suite and you, and you yourself execute a test manually, you launch your container, you try to curl where the application is there and it all works and then you automate it and it suddenly fails because the machine is much faster than you are. And your application is simply not up there. What PyTest container supports, if your app, if your container image has a health check defined, it will wait until the container is healthy. And as long as it's not healthy, it will not execute your test. If it becomes unhealthy, your test immediately fails. So, if you add a health check into your container file or if it's just in the image, then it will wait and it will start execute the test and you can always be sure that your container will be healthy. And if you don't want that for whatever reason, then you can just say, okay, I don't care about health checks. Then you just define the timeout to be minus one or you comment the health check out, but well. And then you can, then your container might not, might still be starting or it might be unhealthy. As I said, by default, it will use, it will pick Portman, but you can just set use Docker. And what I'm, as I said, moderately proud is you can just run your tests in parallel. That also works with port forwardings. So, that can save you a lot of time unless all your container builds run themselves in parallel. So, then you're not going to save a lot. The thing cleans up very well after itself. So, if you create containers of volumes or parts or temporary directories, all that gets cleaned up, images and intermediate layers are retained because otherwise it would just take forever and ever. There's a few people that use it. Most of them are those that I bullied into that, but maybe you find this useful and you will be on this list. And since I'm out of time, thank you.
GnuCOBOL, the Free Industrial-ready Alternative for COBOL!
So next we have Fabrice Le Faison and Simon Saubich with GNU Coble, the free software alternative for Coble. Thanks for the introduction. So if you need to just remember one thing from this talk, is that GNU Coble has now 20 years and is ready for industrial use. He has reached industrial maturity and can compete with the proprietary offers that are also available. So let's introduce ourselves. I am Fabrice Le Faison. I am the founder of a company called OCamel Pro. I have been a researcher at INRIA before and I've been in the open source ecosystem for a long time. And now I am contributing to GNU Coble since two years ago. And Simon Saubich is a former, still a Coble developer. He has been since 2006. He has been a contributor to Open Coble since 2008. And now he is the project leader of the GNU Coble project. So what is Coble? If you haven't seen a Coble program before, here is a very simple one, Hello World program. So as you can see, there are three divisions. One for the identification of the program, then a division for all the storage, so the global variables and file descriptors and so on. Here you just have one variable, which is a string of 80 characters with Hello World inside. And then you have a procedure division where you have all the statements. And here you just have one that displays the variable. So you have seen your first Coble program for a lot of years. So Coble was born in 1959. It's a business-oriented language, so it's dedicated just for business. It's not a generic language. But he's still evolving almost every 10 years. There is a new ISO standard. The last one is from last year in January. So why Coble matters? Surprisingly, it's still used in many big corporations. There are estimates that there are about 80 billion lines of code in the world. And it's more than for any other language. It's growing at a rate of between 5% and 15% every year. And for a lot of corporations, it's just too big to switch to another language. And another reason why they don't switch is that it's still fast and reliable. Maybe I can... So half of me again. As you have noticed a little bit, it seems to be important because otherwise we don't have a programming language for this long time. And as you may have heard, Coble is dead. I think about every 5 years it's dead and it's still alive. And it's likely to stay very long because of all this code and all the important thing it does. So it's the common saying that if you do use something with your ATM card without knowing that's not Java, mostly down that's very likely a lot of things that happens in Coble. So you use Coble with all the money you do every day. Also your insurances, your car, renting system, things like those run on maybe on mainframes, maybe on PC, but on Coble. So if we have this situation, it's an important thing. It's obviously a bad thing that only proprietary products exist that can execute it because you cannot reproduce it yourself, neither the big nor the slow ones. And you can not easily learn it because, yeah, buy a license for some 1000 euros, that's not really nice. It also doesn't help to really develop the language. And we have some proprietary environments. Very special is IBM with its mainframe because in this case the Coble is very hard bundled to the hardware and the hardware to the Coble. So it's very fast. And we have the PC system with actually a lot of environments, but nearly all were brought by Micro Focus. And Micro Focus now was bought itself two times, but that's a different question. And there's Fujitsu. And as you may see there aren't even any big European players in this. So that's another reason that it's good to have free software because free software doesn't have border and special regulations. There are open source projects. Previously there was OpenCoble. This was now several years ago translated to a GNU project. So we are actually GNU. There's Open Source Coble. They've also kind of forked from OpenCoble, made special for the Japan market. And those were then later translated to Java. So there's a Java option too. There's a newcomer, this GCoble. If you don't know there's GCC, Developer Room. One talk will also be about that, how hard it is to pass Coble because Coble is so big. And then another one is AutoKit. It's a compiler to .NET. It's very early in its stage. So GNU Coble as I said was created long time ago, also long time before I get in. It actually has roots in TinyCoble but was nearly rewritten to get OpenCoble. We have two active branches, the one that is easily available for you all. If you just do app install GNU Coble, jump, add, whatever. So your normal package manager likely has GNU Coble available. You just can install it with a glance. That's the branch GNU Coble 3. The shiny new thing will be GNU Coble 4. So this would be building there self. We have a bit of numbers but you can get it on your own. So what's the benefit of GNU Coble? GNU Coble does internally generate C code. So the C code is quite portable. You can run it on your Pi, you can run it on your main frame and everything below. I of course have it on my mobile phone. So no problem there. The generated modules are really C89. So you can run it everywhere. The runtime library is C89 and uses some new features if they are available. So on most systems you will have a current environment. We pass the standard test. There's a national standard test for it. 97%. Actually, if you compare with the proprietary, they don't pass 97% but less. We don't have support for object-orientated Coble. That's a nice feature from Coble 22 which isn't used that much actually in the production because you have the old code. And messages. This was a very old thing and now gets back in Coble 23. There's a new messaging system which you can think of, MQ, Rabbit or things like that. We have a lot of dialects. And so Micro Focus has one or two. IBM has its own and we have 19. So it's relatively easily to transfer from one Coble to GNU Coble, three year Coble, three year computing. And for other code that is still missing, so we are not complete supporting all 19 dialects of course. But if there are projects, then we can just increase what is needed. So the compiler also gets increased step by step. And the ecosystem helps you with all the things that may wear on the mainframe. Or otherwise if you just do SQL access, there's a standardised exec SQL thing that works in C and that also works in Coble. So Coble is still used a lot but GNU Coble also has many users already. So actually if you think of GNU Coble, of course it's the new name of OpenCoble. And OpenCoble was forked by Coble IT very early. And for now I think that there are hundreds of big users of Coble IT. So in some sense there are also users of GNU Coble. We at Okamil Pro, we work with the French tax administration, the DGFIP. And we add them to migrate from Gcos mainframes. So Gcos is the name of a BUL mainframe that was created in the 80s. And so they are moving from that dialect to GNU Coble on PC. So we added the dialect for Gcos to GNU Coble for them. And recently we had a mail on the GNU Coble bug mailing list from real time. And they told us, okay, we have a bug. We found a bug. But we have been converting many and many programs from micro focus to GNU. And we didn't find any other problem doing it. So it's quite a nice result. Maybe you want to talk about it. Yeah, and actually I was contacted by someone, it's not on the list, but on the updated one. That uses GNU Coble for his customers. And that's a company for banking, so core banking, everything that you have in the banking, apart from the online banking part, is part of the software. And there are a lot of, there are some of those customers in the Dachie region that currently use GNU Coble. And actually they migrated from Solaris and IAX or Plain Rel and Micro Focus to GNU Coble on different environments. And those, yeah, those have multiple, a lot of transactions each day where GNU Coble just works. And they found actually that the GNU Coble environment is much faster than the original Micro Focus one. What was part of the original introduction of this talk was how we reached this majority. There's a lot of things on this paper, of course, but I think one of the biggest things is that it's got easier to work with because people wrote documentation on it. Other people helped with writing a pure Coble source code debugger using GCC. There are a lot of approach actually that you can use for debugging Coble with GDB that way. The tool sets around GNU Coble increased per demand so people can actually do what they normally do with the old compilers or with the C compilers also on GNU Coble. And many important things are also people using GNU Coble because you don't know if your compiler is complete if nobody uses it. But if big companies use it for their software, you may or you'll likely find bugs. And then you can fix them. And this also helps because it's quite different if you have just some code, maybe even the NIST suite, which has a lot of code to run. You don't see where the issue is, but if you run this with thousands of processes in parallel on a nice Linux machine, then you see quite fast where the bottlenecks are. So putting this into production helped a lot because this allowed us to tune for the actual issues on performance and also on memory. So having a nice compiler is great, but you also need to have a good environment around it. So we started working at OCaml Pro on a studio for Nucobol that we call Super Bowl. So it's based on LSP, Language Server Protocol that we developed in OCaml for cobalt. So there is a full cobalt parser in it. And it gives you access to all the features that you enjoy in a modern editor. So if you are interested in it, the link is getsuperbowl.com and there is a GitHub repository. Actually, so it's not on the marketplace of VS Code for now. It's still being heavily developed, but it can be tested directly from the project on GitHub. So we have a screenshot. It's not where you can see that you can go find references, for example, for some identifier. You can find all the references. So it's a small screenshot. So as a conclusion, we wanted to show that Nucobol now is mature to be used in industrial settings, where usually people use proprietary solutions. It's nice because it's developed in Europe compared to the other solutions. And you can use Super Bowl to develop with it, to have a modern environment. And this year, there is a Google Summer of Code. And there are projects on Nucobol. So if you are a student and you want to contribute to Nucobol, it's a nice way to start. So there is the URL here, but it's easy to find on the web page also, I think. If you have any questions, please join us outside. And we'll be happy to reply to all your questions. Thank you. Thanks. Thank you.
Ensuring Longevity: Strategies for Sustainable FLOSS Projects.
Good afternoon. As you've been introduced to, I'm Cecilia Maundo. I'm a community mobilizer and sustainability coordinator with inter-news. And we are talking about longevity strategies for sustainable flows project. And I'm going to give this moment to Skylar to talk a little bit about the sustain project that we are working on with inter-news. Hi everyone, my name is Skylar and I'm part of the sustain team at inter-news. Inter-news is an international NGO based in the US that focuses primarily on independent media support. And we believe that information saves lives and through that it's important to have a free and open internet and a resilient internet. And so we're on the internet freedom and resilience team. And sustain is a two-year project that focuses on the sustainability of free and open source tools and sustainability through a holistic approach. So looking at community building and financial sustainability and marketing and visibility and not just technical sustainability. And I think that's really important. And yes, Cecilia here is going to talk a bunch about the work that we're doing and the cohort that we support as well. So Cecilia. Thank you so much. We are supporting a cohort of six teams. Some of them are the audience. The six teams are Teller, Bayonet, there is OpenWisp, there is Save by Open Archive, there is Onion Share. So we are talking about sustainability. First of all, let's talk about the importance of open source tools. There are so many things when you talk about open source tools, but one thing that comes into mind is the issue of transparency and security. We are living in a world of authoritarian regime. We are living in a world where journalists are not safe doing their work. We are living in a world where human rights activists are not safe. And one of the things that open source tool does, it gives them the liberty and the opportunity to be able to keep their information safe and be able to do their job. So one of the most important things about open source tools is transparency and security. So when we talk about sustainability, the first thing that comes into mind is funding. Most of the open source tools are run by volunteers and we know that not everyone also has the opportunity to volunteer and also volunteership will not pay the bills. So how can we be able, how can we come up with a plan that can be able to sustain them, a plan that they can be able to make money and be able to run these tools for years to come. So one of the ways is funding and there are so many ways when it comes to funding. We know funding is a very thorny issue. So one of the things we talk about is crowdsourcing, we talk about donation. Sustained project is funded by DRL, which is the State Department and it's a collaboration of Code for Science and Society and Guardian Project who are here in the audience. So we also talk about community engagement. If there's anything that keeps open source tools going is community. Community is the backbone of open source tools. We know there's a lot of the issue of volunteership. So how do we make the community more interested in open source tools? How do we make the community know that they can be able to take part in maintaining of open source tools? And one way is active communication channels, getting in new contributors. And when you get in new contributors also there is the issue of, is there interest in this tool? For example, there is the issue of, let's say, let me give an example of TELA, which is a tool whereby you can document your information in a safe way. So when you talk about that as a journalist myself, I can take part in being part of the community of TELA because I know it is helping my fellow journalist and myself. So how do we create a thriving community when it comes to open source tools? I'm just looking at this buzzer. So the issue of also governance, we need to establish a clear governance when it comes to the structure because most of the open source tools, as I said, is just a group of people who come up with an idea and now they want to make the idea work. So also out of that idea, for that idea to work, they must be very, very clear structure. And that is one of those also a very thorny issue because you've come together, there is no money, and you also need a structure. So we ensure that decision making processes are transparent and fair. And then also regular maintenance, regular maintenance of the open source tool. When you look at end users, their needs change with each coming day. You know, our needs are not what we had last year or two years ago. We looked, for example, when you look at the pandemic, things took a different time. So there is also the need of regular maintenance when it comes to these open source tools. And I know what comes into your mind. There is also money that is needed. Yes. And that's why we talked about funding, also expanding our funding pool. One of the advice that one of the tools was given is you can be able to offer consultancy service and that can be a way of also making money coming in and also assigning roles for maintenance and task and addressing the most needy issue at that time. So as I come to the end of this lightning talk, which I'm very excited because I'm a bit nervous as you can tell. So for us to ensure that there is a longevity of open source tools, it requires a multifaceted approach. It requires a holistic approach just as as Kala said, we are doing in sustain. Everyone needs to come on board to be able to know how we can be able to maintain open source tools and also addressing the issue of funding because that is the word we've been hearing since we started the project. Funding, funding, funding, funding. And also when we talk about funding, yes, you can get funding for one year, but how can we be able to get funding that is long term for about five years because when you get funding for about five years, then you know you have the luxury of being able to put everything in place by the time the five year end and also we talked about the issue of governance. How can you be able to put structures in place that can flow down and make this to work? We also talked about maintenance. Yes, any question? I only have an opportunity for one question. Yeah, because I'm looking at this clock. Well, well, well. So I'm assuming there's no question. Please, we will upload the presentation on this platform and also you will be able to get our emails in case you want to know more about the sustained project. Sustaining is sustaining safety tools with analytics, insights and networking. You can be able to contact us as we said is a project by internews and we are working with six tool teams to be able to create a sustainability action plan whereby these tools can be able to maintain their self even after the project that is done. So thank you so much. We also included our guide for sustainability tools or sustainability for open source tools and little leaflets on the side of the stage and up here as well. If you're interested in learning more about the strategies that Cecilia talked about. Thank you.
Documenting and Fixing Non-Reproducible Builds due to Configuration Options
Good afternoon, everyone. So next we have Aaron, speaking about documenting and fixing non reproducible builds due to configuration options. Thanks. So hello, everybody. My name is Aaron. I'm a PhD student at the University of Rennes. Doing research in the software engineering research team diverse of Inaria, Eriza, Ray in France. And today I'm going to talk about reproducible builds and software configurations. So what is reproducible builds? I took this definition from the paper, reproducible builds, increasing the integrity of software supply chains. So it says that the build process of a software product is reproducible when given a specific version of the source code and all its dependencies, every build produces bit by bit identical artifacts and plus no matter the environment. And I think it's a really important point. So to achieve reproducible builds, there are a set of guidelines in the website of reproducible builds such as how to have deterministic build systems, what not to ship in the binary or even how to distribute an environment, set some environment variable and so on. So let's take an example. So for the Linux, I can go to the source tree. So I've downloaded and I just generate the configuration of the kernel. So here in this case, I generated a tiny configuration, then I just build it. And once the build is done, I have a binary called the VM Linux that I just keep on the put in the TMP and then I clean everything up and I just reproduce the process. So tiny config run called twice just produce the same configuration. And now if I want to compare the product of these two builds running Difascope, which is a tool provided by the Producible Build Initiative tool. So what's happened? Just because I've built the two binaries few seconds apart, I have two binaries that are different, not bit by bit identical. So I'm following the guidelines. I can set some value to environment variables of the build system. So here in this case, K build. So I can give a fixed date, for instance, here the 1st of January of this year. And now I can have a bit by bit identical binary. The question is in Linux, for instance, we have all different set of configurations. We have the default configurations per architecture, all these config, all mode config and so on. And especially round config that will set some configuration options randomly. So do I need to just fix all of the predictability issues for Linux just with this trick? So we can look in the documentation. So the K build trick is obviously written in the documentation. But there's the documentation emphasize on the configuration options. So here we have six of them. So just as a reminder in the kernel, you can set some values to some options, either yes, no or module to ship them or not. And so here we have a list of six configuration options. But is that all? So as the latest version of the kernel, I think there are more than 19,000 configuration options. So there are six configuration options that have an impact on the predictability of the kernel among all these configuration options. So to answer this question, we have basically have a kind of a brutal approach. So we just generate the set of random configurations. So as you can see here on the left, then we build them in the same environment. So we have a fixed Docker file. And for each build, we just build them in a newly built container. Then we compare the binary. So we don't compare all of the intermediate file of the build. Just compare the final binary. Then you simply do a diff on the binary and get all the results, as you can see here. So there's a way to encode the configurations in a tabular representation. So we just have a row with all the configuration options. And zero means no. One means yes. Enabled. That means module if it exists. Then we get all the data and put it in a classification algorithm. And we just get the outlier options that are responsible of the non reproducibility. Then from the list, we have a phase that is an exploration phase that I will explain a little bit later, where we enrich the list we've got from the classification algorithm. Then we just have a fixed phase. And the idea is to add, if the options are indeed responsible of the non-reposibility to add them to the documentation. So the setup is, so this is the setup. So we have 2000 configuration for each system we study. So the Linux kernel, but also busybox and toybox. So we generate random configurations. We have a preset for x8664 for the kernel. And then for the environment, we just derive the tuxmake image. And then we set all of the environment variable so they don't vary during the build, like the timestamp and so on. So here's one of our first results is for Linux 47% of the builds were non reproducible. And for busybox, we have two cases here. We have the first case where we didn't vary the environment, so the build path. And there's a case where we just vary the build path. So we wanted to show case that there is an interaction between two layers, so the configuration and the build path. And to solve it, you just choose either to change the build path, to fix the build path, or to disable the debug configuration option. So it's up to you. But if we enable the debug configuration option and we just vary the build path between two builds, we have 49% of non reproducible builds. And for toybox, it's 100% reproducible in our study. And so now who is to blame? So no for the Linux case. So here we have an example of the decision tree we got from the process. And we have five configuration options here. So what we do is we don't consider destructor like if I disable module six, one. So here the structure is that if I disable six, one, the next responsible is Jacob profile of trace and so on. So here we just flatten everything and we consider each configuration as independent. Each configuration option as independent, sorry. And so we have this list of five configuration options that, so module six is a similar configuration option is in the documentation for both, but for the rest of them, we don't have them in the documentation of Linux. And now we have an exploration phase where the main idea is that we want to identify all the options of the same kind. So in the documentation, we saw that we had some configuration options on the module, CIG, all module CIG module and config CIG module CIG and so on. And so here the idea is just to identify the siblings of the options. Like if I disable one option, I have another alternative of the same kind and we just explore all the alternatives. And a great example here is module six, one. If I disable it, I have to enable two, 24 or 256 and so on. And so once we have, we've got all of the siblings, we just use the name and conversion in K config to just get the parent. So we know that if I want to disable this specific option, I have to disable this parent. And now, place to the fix of the each configuration options. So the idea is to remove all of the detected configuration options from the initial configuration. And it's a kind of hard task sometimes in the Linux kernel because we have to get all of the dependencies of the configuration options. So to solve each, I mean to detect the dependency of each of the configuration options you want to modify or to change, we use a tool called config fix that is a set based solver that is presented in detail in this paper here. And it just gives a list of options to a list of conditions to satisfy. And it could be in the configuration option and the value in order to apply a change. And then once the change is applied, or once the change is applied and the change being just set in the configuration to no, we just build again and check for a predictability. And from the list we've got, we were able to make 90% of the non reproducible build reproducible. We had 31 configurations, so 3.5% that is still not reproducible due to some dependency we couldn't identify. So that's one of the limits of the approach. And less than 0.5% because the tool we used couldn't find the diagnoses. But compared to the first result I showed, we went from 47% of non reproducibility to 1%. So now the summary. So I think one of the takeaways that options matter. So we should explore more the impact of configuration options in the reproducibility of each build. The second takeaway is that there could be interactions across variability layers, such as I showed for our busy box. So we also need to detect them and to pinpoint and describe precisely in documentation. And we have identified more configuration options that could be added to the documentation, so we'll send the patch soon. And now we could remove some of them. So 96% of non reproducible builds made reproducible. So if you want more detail on the whole approach and the rest, this will be presented at the mining software repository conference. It's an academic conference that will happen in Portugal in April. And thank you for your attention.
Platform engineering for dummies
Great. So good afternoon everyone. Next we will have Donnie Burkaltz introducing platform engineering for dummies. Thank you. Super excited to be here today. It's been a number of years for many of us since being at a Fosdham in person, so welcome back. I was very happy to be here. I got myself a very nice Belgian beer as soon as I arrived, so I'm feeling great right now, all ready for my talk. Only one now, just one. The rest will be later. And I hope I'm assuming none of you are actually dummies, so thank you for coming to this talk. This is just for people who have heard the term platform engineering. It's starting to get increasingly popular. It's the only thing people talk about besides AI these days. We're going to mostly skip that one. And we're going to talk about what it is, how vendors are completely destroying the term, just like they do with everything. And then how to get started with it yourself. How you really make it as easy as possible. You don't have to buy vendor solutions. You can use open source off the shelf software. It doesn't even have to be custom and brand new. So by the end of this talk, you'll have a really good sense of platform engineering, at least as good and as deep as you can get over the course of the next 12 or 13 minutes. You'll have a lot of good resources. I've got links in here and a couple of the slides as well. So you can go check those links out afterwards because it's not just about technology. It's also about the people. It's also about the process. There's a lot of different pieces you have to do to get this right. In fact, the technology in many cases is the easy part. But first, a very short story. A few years ago I worked as a technology leader leading a DevOps transformation at the time. That's what we called it. We now probably call it platform engineering at this travel tech company called Carlson Vaganley Travel, CWT. It was actually an office here in Brussels. I visited there a few years back. Great place. Lots of interesting development happening there. Since then, I actually have led products, management, and products at Docker and at Precona around open source containers and databases. I've spent a long time in the platform space. Long story short, I know what I'm talking about. I've been doing platforms for like 20 plus years at this point, as have many of you. I'm just sharing my own story and my own perspective here. I'm sure many of you have your own. When we think about platform engineering, or at least the way I look at it, there's really three key pillars to it. There's platform operations, platform as product, and self-service for developers. We're going to jump into each one of those pillars and talk a little bit more about what that means. If you want to check this out afterwards, I have my own independent analyst from my little blog post about it. Feel free to check that out at your leisure. What does platform operations mean? There's a lot of companies today. In fact, how many of you come from a large enterprise? Do you have something called a platform team? Does it maintain maybe Linux OS, maybe some other OSes that we won't talk about, some things like that? It just got called the platform team at some point. It might have been the OS team. Before that, maybe they merged it in with the network team or something else like that. When we talk about platform operations, we really mean operating it as a holistic platform regardless of how many servers, how many VMs, how many containers might be underneath it. The same thing we talked about 10 years ago with Cloud, the same thing we talked about five years ago with DevOps, moving away from that Pets mindset into the Cattle mindset, moving away from that single server, single container, naming things after our favorite characters or our favorite TV shows into that mindset of these things are fungible, they're disposable, we operate them as applications and fleets of things and they're automatically created and deleted on demand. We're in this world of SRE now, we're moving more and more into things like SLOs of how do you monitor the user impact of the applications you're serving. In this case, we're talking about platform engineering, meaning building for developers, but even if you're serving internal developers, a platform, you still have to care about the quality of service that you're giving them. You still have to care about your latency, you still have to care about your error rate, you still have to care about how much of your capacity you're using in any given moment. You have to treat those internal applications just as importantly as you treat the ones that you're serving to your external customers and users. A lot of companies don't do that, they'll have things like their tier one applications, those are business facing, they get major incidents, spinning up war rooms and all that kind of thing when there's an outage, but if their CI pipeline goes down they say, oh well, it'll be back eventually, it'll be fine, we can just have our developers kind of doing nothing for most of a day, no big deal. A lot of companies are still like that, but we have to apply this platform operations concept not just to our external customer facing applications, but treat our developer productivity as something business critical in its own right, because developers are expensive. Sitting there for a day, not being able to ship software is expensive. And so we went through exactly this journey at CWT. One good example of this was we started by monitoring tens of thousands of different infrastructure metrics, right, classic old school world of monitoring, and we shifted that into just a handful of user facing impact metrics, but along the way we actually had to educate our developers and our operations teams on how to debug things in a much more complicated way than they were used to, because with the infrastructure metric you could have a simple runbook. You see this thing, you push this button, done, whereas if you have a metric saying my application is slow, there's a lot more potential causes, a lot more you have to learn to jump into it, and so at the same time we made this transition with technology, we also had to upskill a lot of our level two operations teams and had them become an SREs in their own right learning how to automate things, learning how to debug things much more deeply. Now the second piece is platform as product, and when I say this what I mean is for things like your internal CI pipelines, for things like your container services, whatever other internal developer tools and services you might have, you have to apply the methods of product management to them. You don't have to have a full-time product manager, that's fine, if you do fantastic and you're lucky and fortunate and congratulations on that, but if you don't, there's a lot of different people who can pick up some of that load, learn how to do modern digital product management, right, you might have people even depending on how traditional your company is called service managers, right, they might use a framework called ITIL to talk about things, and those people still have the potential to modernize and move forward and get with the times and apply modern product management approaches, meaning talk to your internal stakeholders, understand the problems they're trying to solve, right, in many cases they might be providing a service like source code management is a service you provide to your developers, there's probably a team running it inside your company if you're at a big company, do those people talk to their own developers about what problems they're trying to solve and what their workflows look like, Jets are probably not, they just shove stuff at them and say good luck, right, and we're fortunate we now have better tools than we used to, but there's a lot of opportunity for people in those positions of being these central platform teams or central developer productivity teams to go talk to their own developers about the problems they're trying to solve their day, understanding their pain points, and bringing that back in. At the bottom I've shared a handful of links in varying levels of depth that are super good resources if you're wanting to learn this or if you wanted to share these with other teams, there's an entire specialization on Coursera that'll probably take somebody six months to go through maybe an hour or a few hours a week, there's a great book by the same person who put together that course or the series of courses and then there's a website you can just go read for free to start checking it out right now. In every one of those cases they aren't written for Platform as Product People, they aren't written only for internal product management, they're written for anybody doing modern product management of how do you get that up to speed and so you have to do a little bit of extra work to think what does this mean for me specifically, but all of you are smart people you can figure that out. Applying this Platform as Product approach is absolutely critical to doing Platform Engineering right and nothing about this requires a specific piece of technology, nothing about this says proprietary versus open source, this is the people and the process side of it, but you have to get this to get Platform Engineering right because if all you do is say oh hey we gave you a platform now we've got Platform Engineering, you're wrong. What probably happened especially if you're at a big enterprise is you still have a ticketing system somewhere and you're still requiring developers to go file a ticket every time they want access to some new resource whereas if you're getting Platform Engineering right you're moving away from that because you talk to your developers, you've understood their needs and you've probably moved into something much more policy driven where there might be an initial ticket but the only thing that happens is to assign the developer a role as I'm working as a developer or I'm working as a developer in a certain application area then they're granted that policy driven access and they're able to move on and get on with their lives instead of every single time they need access to a new server every time they need a VM created every time they need additional memory provision to the VM right all these things are crazy and in many cloud environments they have been partially solved but a lot of us are still working on premises we're still working with servers in data centers or in colos or working in clouds that feel like we're that in every one of those cases this is an opportunity to make dramatic improvements in our own productivity as developers um one example of this from my own experience at CWT was we applied this approach to a really novel area which is um one of the teams that reported me to me was the major incident commanding team right so every time stuff got really really bad it was like the fire department you'd call them in they'd run the the issue and run it through to conclusion now that team had to send out a lot of different communications to a lot of different audiences they had to send things out to our internal executives had to send things out to all their employees who were being affected by it we had to send some things out to our customers as well um all those communications were things that hadn't really changed for a long time we had to get a lot better at them there were all kinds of complaints that would come in from these different audiences because it wasn't a one-size fits all approach it was something where but we were sending communications out that way and then things had gradually evolved very organically there wasn't a clear way to understand who should get what i mean so we applied these these platform as product style approaches to the communications going out from the incident commander team and made dramatic improvements by just doing things as simple as going out and regularly talking to the people who need to consume this stuff to understand when do they need it what do they need what do they need to understand so they can turn around and make the right decisions or do their jobs more effectively or tell their own customers the people who actually pay us as a company what we need to do and what they need to do and how long they might need to wait and when to try back and what their alternatives might be what was interesting too is we did this in a very lightweight prototype sort of fashion right so of course we had a technology solution for sending all this stuff out but instead of using that and using our developer time to sit there and iterate and work their way through their backlogs we literally just wrote a heavily formatted email by hand and started sending this out and used that as a tool to iterate on what the product should look like and so we just put together this email and we'd send it to somebody and say hey like what is this what do you think of this like walk me through how you're interpreting this what you're doing and by applying that really lightweight technique of just doing things by hand doing things the rough way before we had to put in the effort on software development it dramatically speeded up our ability to figure out the right thing and then spend our development effort building the right thing instead of getting getting it wrong very slowly multiple times on the way and third self-service for developers this one is pretty self-explanatory so I'm not going to spend a lot of time on it but really this is the continuation of that consumerization of it trend right the expectations for user experience in the enterprise side are very different now than they were five ten years ago and the same history for developers right developers should not have to put up with really clunky terrible interfaces on their internal tools anymore right it's been bad for a long long long time but things are finally starting to get better right things have gone through very ticked-dirgin approaches my own experience at CWT was you know we came in and we did something called value stream mapping which is a great technique for anybody who's interested in solving a lot of problems like this where we worked through a very specific workflow and the one we picked was deploying a new application for the first time um worked through every single team a request went to every single team that had to touch it and end up being something like 15 different teams were involved in this because there was a single silo team for everything you could imagine right there was like a network team and a security team and a firewall team that wasn't the same as the security team uh and you know the list just goes on and on and on in large companies like this and every single one of them required a ticket in some case it was the ticket you had to file in some case there was a ticket that a team filed to another team and that team filed to a third team and then somebody else would audit it and somebody else would review it and finally it would work its way through right but imagine getting all those to a place where you can clearly define the policy once get agreement on that from all these teams and then work on that policy and use that policy to automate all of your governance going forward all right that's what we're talking about um we took out of a 45 day timeline to deploy new app we took 30 days out on the way there um by making some simple process improvements and applying some automation now let's look at some solutions over the course of the next minute what do you need from a solution you need a job runner pretty simple because you got to do stuff you need a web GUI so you can click some buttons you might want to click on it have an API or CLI but those aren't necessities you need to access controls so that only the right developers can do the stuff you want to do and of course you need to be floss now there's a few different classes of these job runners you might look at internal development platforms you might look at CI servers you might look at workflow and data orchestration tools or you might work on look at task schedules there are all good options when you're thinking about how do I do this platform engineering and really the answer here is use whatever you've got don't make this huge start where you are you can use GitOps you can use backstage you can use even Jenkins you can use workflow and data orchestration tools or task schedules so hopefully that's given you a sense and I'd encourage you to refer back to the slides later to see that list because I went through it pretty quick of what platform engineering is all about what are some of the different solutions and that you should start exactly where you are today using the tools you have don't make this over complicated thank you
Taming the Beast: Managing High-Growth Postgres Databases at CircleCI
Hold on. Hello everyone. Sorry? No, I think people are just using the arrow keys. Sorry. Less high tech. Hello everyone. So our next speaker is Bryce Kenta, introducing Taining the Beast, managing high growth postgres databases at CircleCI. Thank you. Hi everyone. My name is Bryce Kenta and welcome to my talk on Taining the Beast, the CircleCI journey to managing high growth postgres databases. First, who am I? So I'm a staff engineer at CircleCI, where I've been working for the last three years. I have over eight years of engineering experience spending the full stack back in front end. At CircleCI, I've been focusing on backend architecture and reliability. Over a period of hyper growth, reliability became a big problem at CircleCI to the point where our CTO started posting a monthly blog post to keep our customers updated about the improvements. So a key part of those improvements was dealing with large databases, which I'll be talking about today. I'm very enthusiastic about the develop experience and making that better, which is why I love my work at CircleCI. And when I'm not in front of a computer, you can find me on the driving range because Canada is very cold and occasionally traveling the world of my wife. All right, so let's get started. Just to give you a little bit of background about CircleCI, it's a global CI CD platform with a wide range of customers. A bunch of open source projects build on CircleCI, such as React Native, Angular. Anytime you see a .CircleCI folder in a repo that typically is building on CircleCI, and on the right screenshot, that's an example of a React Native workflow, which is currently just running some tests. And so this should be familiar to any of you that are maintaining any CI CD pipelines. So our platform runs about 4 million of these workflows per week and over 20 million jobs per week. Each workflow that runs on our platform generates net new data to be stored, such as the workflow itself, the dependencies between the workflow, the workflow graph, the job states, and test outputs and things like that. So to handle all of this traffic, our infrastructure runs over 150 services and 70 plus post-grace databases. However, some of these databases were growing very rapidly. The particularly one that supports the platform's engine. The growth of such databases was directly correlated with the number of workflows and jobs that are created per second. So an example of high-growth database that my team was responsible for had grown to 5 terabytes in size and growing by 500 gigabytes per quarter. The right amplification on that database was a recurring cause for incidents. The nail in the coffin, though, was when we tried to upgrade that database from an end-of-life 9.5 post-grace RDS instance to a 12.5 instance. This took months to complete and incurred significant downtime because of incidents. The first attempt at migrating the RDS instance took a couple of hours and resulted in poorer query performance. This is because the large tables required lengthy vacuum operations, post-upgrades, which led to massively degraded performance. We considered using AWS Database Migration Service, DMS, but it would take too long to complete given the database size because DMS uses logical replication which is concerned with the number of rows and the amount of bytes that you're transferring. We were finally able to do the version upgrade using a form of home-brewed logical replication, taking advantage of application-level knowledge of the database. But this required significant engineering effort with engineers working weekends. So that wasn't great. At the end of all this, it was clear to the business that operating these large databases is very risky and could cause a company-ending event. So we needed to tame this growth. So now I'll take you on the journey that we took to taming this beast. So first, I'll talk about the storage reduction, so the immediate savings that we gained by deleting some of the low-hanging fruits. Next, I'll talk about the growth restrictions that we put in place to make sure that the data growth remained at manageable levels. And lastly, I'll talk about some of the optimizations that we made to ensure long-term success. So the first thing we did to reduce the storage was to drop unused columns, tables, and indexes. Indexes in particular can grow large in size over time, so dropping them was a quick win. We leveraged a tool called PG Analyze to identify indexes with those scans. So that means they were not used, and then dropping the indexes not only benefits the storage size, but it also reduces write amplification, so the writes to the database are actually faster. Next, we switched a bunch of B3 indexes to use Brin indexes instead. So Brin indexes are designed for handling very large tables where in which certain columns have a natural correlation with where they're physically on the table. So for example, if you have an Ordis table with a created-at column, earlier records on the table would physically show up earlier in the physical location. So those Brin indexes are optimized for that kind of data. So from the screenshot, you can see we had a bunch of created-at indexes across multiple tables, but the thing to note is the size of those indexes. That took over 400 gigabytes of storage in a single database. So dropping them, or those the ones that were unused, or switching to Brin were able to save space immediately. The next step we did was to reduce the storage further, and we had to upload any static blob data to S3. So S3 is much cheaper, and you can define object life cycles to automatically delete the data. But my greeting to S3 came with some drawbacks, such as additional latency, because we had to put a Redis cache in front of it. And the other drawback was that it added more dependencies to our service, and the queries were no longer transactional. So we had to add code to stitch together the response from Postgrease and S3, so that added a bit of complexity. So at this point, we freed up some storage size and to give us some runway, but we haven't addressed the growth. So let's talk about that next. So the first thing we did to slow down the growth of our databases was to put in place data retention policies. Our product management team collaborated with other parts of the business to identify data retention periods. So the data retention period differs based on the customer plan. So for example, a free customer will get three months of data, and higher-plan customers will get up to two years. We communicated these policies to all of our customers ahead of time. We gave them a quarter, so three months of leeway, before actually enforcing any restrictions. So the next step after that was to implement data access restriction, but at the API layer before actually deleting any data. So this meant customers no longer have access to data beyond their retention period, which enabled us to go to step three, which is safely delete the data, because now customers don't have access to it anymore, using background jobs. I should point out that at this point we still have growth, but mainly due to new customers, or existing customers that are building more on the platform. But the growth is contained because we don't retain data older than two years. But we ran into some issues. So the first issue that we ran into was, as we're deleting data from the primary database, it caused degraded performance on the replicas, as the deletions are getting replicated. So we experienced like spike in IOPS and CPU usage, and so we needed to upsize the replicas. Another issue that we faced was index bloat. So frequent background deletions without a periodic maintenance of the indexes, reduces the efficiency of those indexes over time. So a solution for regularly re-indexing the database was necessary to make deletions sustainable. This is something that we're still figuring out. We haven't found a proper solution yet. But lastly, post-grace databases do not automatically reclaim space when a record is deleted. This is something that we found out. So there is a built-in vacuum operation to reclaim space, but this process only frees up space back to the table for reuse. So once disk is allocated for a table, it may never be released until that table is dropped. The vacuum operation has a full option which builds a new table and swaps the old table for the new, but it requires an exclusive lock. So this was not a viable solution for us because, again, it requires downtime. We're able to use PG-REPAC, which is an open-source post-grace equalization that allows us to reclaim space on the drop columns with minimal locking of the table. So that was great. And then the last step on our journey was to establish a long-term strategy. We needed a data archival process that could be applied to all of our high-growth databases. So we established a data reliability team with the mandate to own a single historical data store. The data store would support functional requirements such as high availability, be horizontally scalable, support multiple query patterns, which is needed by the API or the UI to filter data. But this historical database is only used to serve customer data only, nothing else. No ETL, nothing like that. And then each service team would implement a data archival process, which is similar to the diagram at the top. The service sends requests to the historical service to archive data. What data is archivable and when? It depends on that particular service domain. There's a sweeper job that makes sure that any missed archivable data is archived. And then there's a deletion job that is continuously deleting archive data. Also, as product teams are building new features that require net new tables to be added or to be created, we aim to partition them from the beginning. We use PG Partman, an open source partition manager to create time-based partitions. PG Partman enables us to configure retention periods and will automatically delete any old partition. So as soon as the partition falls out of the retention period, so in our case 24 months, it is automatically deleted by PG Partman so we don't have to worry about it. And finally, so now that I've taken you on the full journey from reducing our storage size to establishing long-term data archival processes, I'd like to take a moment to acknowledge some of the key learnings because an initiative of this magnitude was spanning almost two years and was non-trivial for us. So the first learning was to implement a brief retention policy as early as possible. Ideally, one that allows you to serve more data at your discretion because this means you don't have to implement the code to delete the data until you really need to. That would have saved us hours of engineering effort and downtime dealing with massive databases. The second learning rehearsed any major database maintenance, things like major version upgrades, space reclamation, re-indexing, anything like that. Make a copy of your production database, validate your changes there, compare query performance against the production database before actually running that maintenance in production. And finally, write down your learnings. This creates a knowledge base for everyone to learn from and helps other teams move faster. The extensive documentation that my team put together throughout the last two years is what helped me a lot to come up with this presentation. And that is it from me. So thank you for listening. I hope this was helpful to you.
ε-serde / mem_dbg / sux / dsi-bitstream / webgraph: a Rust ecosystem for large graph processing
Hi everybody, we're just about to have our next talk, who will be Sebastiano Vigna, who will be talking about a Rust ecosystem for large graph processing. Sebastiano? Thank you. Okay. Okay. How many Rust programmers here? Well, some. How many Rust programmers who handle large data structures, like those of gigabytes? A few. Okay. The first group is reasonably interested. The second group is more interested. The rest of the people can't sleep. I'm not offended. You can use the computer. It will be very, very boring. So okay, let me introduce why. Okay. What I'm doing is just announcing a few crates we are distributing that do very specific thing related to large-scale data analytics. And the original of this is a framework for graph compression that has been around for around 20 years. And that's being used by the community around the WWW, the web conf, the largest conference on the web in general, academic conference. For the last 20 years, there are many data sets that are distributed in this format that are utilised and so on. There are a lot of journals. And in 2011, it was used to measure the degrees of separation on Facebook, if you remember it, maybe you're too young. But it was quite a feat at that time because, I mean, it was for 15 years ago and Facebook was still rather large. But we were able at that time to represent the entire Facebook graph in just 211 gigabytes, which made it possible to run some pretty nice algorithm to compute this and distribution. Maybe in this community, I should mention that in the late, I started to do free software in the late 80s on the Amiga. Okay. So nobody remembers what it is, but I have some history with the free software movement as well. So at some point, we decided to move to Rust for the obvious reasons, like it's a high-performance, safe language. But, okay, all I said is in Java. It was written in Java, started in the 80s and of the 90s. And at that time, it seemed a very good idea. Okay. Then things happened like arrays are at most two billion elements. And if you have graphs with 50 billion elements, you cannot even index the notes, which gets very, very annoying. And today, anything this size is done using memory mapping. I mean, if you go to Facebook, Google, whatever, all the large structures are there in memory, but usually they're just memory mapped because you don't want to start up time. If you load in memory a graph that is half a terabyte, you wait minutes, whatever the platform you are on. But if you can memory map it, this time is amortized along the visit of the graph, for instance. Okay. And we actually need to represent very large graph. If you ever use Java, the access to memory mapping facility, I will not say words because they would not be proper in this particular situation. There are really lazy iterators. If you're written in Java and iterator, you know what I mean. And okay, so we, to do this, we needed to port a number of ideas from a Java library and to develop a few new things. So the first thing is absurdity, weird name. So it's a framework from epsilon copy, serialization, deserialization. So you might know what is zero copy, serialization, deserialization, means that you serialize something and then you use the memory, actually in the state it is, to represent the object internally. Okay. So there is no deserialization. You don't build a new object. The piece of memory is directly used as it is. And this is how things work, as I said, in all these organizations that have large indices, Facebook, Amazon, whatever you want. I mean, the index is on disk, it's memory mapped as it is. It's not deserialized in any proper sense. There are a few frameworks like abomination that do this kind of things in Rust, but they all have problems for us. The first one is the oldest one by Frank McSherry, writes into the serialized object. So if you want to memory map a file, that's out of question. You might know it is from the people that do the internalization library. Nice idea, but it has a huge impact on performance. It does some kind of runtime resolution of the access to vectors. And then there is Archive, you might be familiar with, which too does some relative memory that is differentiation. And also the structure you deserialize is completely different from the one you serialize. So you have to delegate all the methods and then each time you change one, you have to change the other. Not very practical. So what we did was develop this framework, which requires a little bit of collaboration from the underlying struct. But the basic idea is that you serialize something and then you epsilon copy deserialize it. So you access it, you allocate a very small amount of memory and then the rest comes directly from the disk without any intervention. And the way we do it, we remap vectors essentially. You build a structure with a vector, but when you deserialize it, it has a reference to a slice. In this way, we just have to allocate the actual struct that you want to deserialize, but then anything that is a pointer inside just point to the original memory. So epsilon copy, the idea is that it's not a zero copy because we did a little bit of copying, epsilon copy, a very small amount. But the advantage is that now you have exactly the structure that you serialize. It's exactly that structure with all its methods. The only thing you have to do, if you have vectors, there must be a type parameter and you must write the access method for as a left to a slice. Of course, when writing, you write for a vector, but when you read, you read it from a slice. This is the collaboration you need. But then, completed transparently, like you can do it with basic type. You store a vector and then you memory map it and that's it. And what you get in T is a reference to a slice. More precisely, something that derives to a slice, to a reference to a slice. And again, you work essentially transparently with respect to the framework. Unlike the other cases, and since there is nothing intervening, resolving the pointers, there is no dynamic resolution, everything is done at this realization time, zero impact on performance. The performance is exactly the one of the original structure. We use this to map massive immutable data structure like representation of sequences of sets and so on that are like those of gigabytes, 100 gigabytes on disk directly on memory, without any load time. So if you handle large immutable data structures, that could be for you. Memdology, that's a very small crate, but it's a problem we had. Okay, it's a high performance memory occupancy detector, which sounds ridiculous when you say it because, well, it does as to measure the memory occupied. It's not so easy because if you use the one that are around, so it is like a large vector and few other things, this is the amount of a located memory. These are the three more common frameworks, sorry, crates that do that, and this is the amount of time that they take, and this is the amount of time we take. So the reason is that without some infrastructure similar to the one of absurdity, you have to iterate through collections to measure the space occupied. And if you iterate through a billion element collection, it will take a lot of time. So we routinely measure the space and occupancy of things that are like 50 gigabytes, it will take eight minutes. So we develop this if you need to measure the actual occupation memory, not stack occupation, the actual occupation in memory of something large, try MDBG. Also, as a nice, it does you a print out of the structure with the old memory occupancy. It's important for us because we do all the time this succinct data structure that have various components and we need to know the relative size. So this is only if you have very large data structures. They are small, you can iterate, no problem. Succ is an ongoing problem, ongoing problem, yeah, it's an ongoing problem. I won't say an ingrate, but it's actually kind of an ongoing problem. And it's a part of an existing C++ project and Java project about succinct data structures. You might know what they are. If you don't, no problem, you don't need this crate, but they're very fashionable now. There is one crate at least that does this, but we wanted to have something more sophisticated. So if you're interested in Elias Fano representation of monotone sequences, ranking, selection, and so on, please have a look. This is really getting to existence, but we like to have feedback. Fungal piece bit streams, very, very high performance bit stream with read and write by word and support for big and little Indian files and a lot of instantaneous code, gamma, gamma, delta, go long, and so on. This is kind of cosy you'd like in MPEG or so on, but we use it to do graph compression and we spend a lot of time to optimize every single shift and move and also to give you scripts to just run and we massively test all parameters you can configure on your architecture so you can choose how to optimize the speed of the coding and the coding specifically on your architecture. Like which word size to use to pick up stuff from memory, using the coding tables or not, and so on. And this comes from quite a long experience in doing this with web graph. So if you're interested in writing this instantaneous code for compression, you should have a look at this IBS stream just to tell you a gamma code is ready in less than two thousand seconds. So I think this is pretty nice. Okay, the last piece which is probably the more specific, so you might less be interested in is web graph. So web graph is a framework to represent very large graphs in a compressed form. So typically snapshot of the web are represented in about one to two bits per link. The software heritage graph which is a graph with about half a trillion edges, it's three bits per link, Wikipedia costs 10 bits per link, it depends on the structure of the graph. But usually in particular the graph is redundant, you can represent data in 10, 20, even 50 times less than you do with a redundant version. It's a rough sport of the Java version and of course we use the SIB stream for instantaneous code and sucks for pointers in the big stream. And just to give you a very simple example, the software heritage graph is 34 billion nodes and a little bit more than half a trillion arcs and you can do a BFV visit single thread in three hours. It's very nice. Okay, you have to notice half a trillion edges. The ergonomics of the whole thing is incredibly better than Java. Just having real iterators changes completely the game because it's much more natural that what we had. And this is all the others are crates that you can download and use that are pretty stable. This is still on GitHub because it's a lot of code, a lot of optimization. We just merged into main the last big chunk of modification, the API should be stable by now. But this is very specialized. I mean unless you have graphs with hundreds of billions, half a trillion arcs, for instance, this biologist did this huge data set with a trillion protein-protein similarity edges and they did it with web graph because if you need a trillion edge and you need to distribute it and analyze it on a standard hardware, not a massive supercomputer, you do it using compression. There is also support for labels on the edges that you can enumerate and it's much better in the new version than in the old one. And one thing that we had to fight a lot against is lenders. So if you're familiar, I don't feel familiar with a lender idea. It's generally an idea and a number of crates for Rust. Lenders are iterators whose return object depends on the iterator itself. So iterators in Rust are thoughts that give you values and you can take the values and use them. But in all this kind of batch processing for graphs, you iterate on the graph and you cannot look at two nodes at the same time. There is a sequential iteration which goes through a file or a sorting of labels. So you need to be able to say, okay, this is the next batch of successor, use it, but I won't give you the next one until you finish with this one. To do this, you need to use essentially generic associated type. Not really that. We use higher order trade bounds. But you need to impose that each call to next can be made only when the previous one went out of scope. So you cannot do two calls to next in a row. And this is called a lender. There are a few crates that implement lenders now which have, say, almost feature parity with iterator, but the fact is that presently they work because of bug in the borrower checker. So the borrower checker doesn't check certain things that if fixed would make all these lender crates not work. And at that point, we would be in really deep shit because we have no idea how to do this other than the way we're doing it. In fact, we're even in a situation where we have a chain of an iterator returning iterators and the final value depend on state on the initials thing. So there is a propagation of bounds of on lifetime that goes through two different types. And that gives me headache each time I look at it. And in fact, I didn't even invent it. I asked on Rust forum and they said, I have this completely crazy situation. What can I do? And a very nice guy wrote a type like this with 25 different implied type bounds and now it works. But let's hope it continues to work. But this is just to say we need a little bit more borrowing in Rust than there is now to make this work properly because it has been a little bit of a pain to get something like an iterator in which the return value depend on the iterating object. In the last thing, if anybody know how to get one thing done, index get. Since 2015, it's been sitting in the issues of Rust to have an index trait that gives you a value, not a reference. Because index give you a reference. Now, index give you a reference is fine. But if you do compress, succinct, any kind of implicit data structure, index giving you a reference is a pain in the ass. Because you don't have the data. They are implicitly represented. You need the trait that giving two nice square brackets will give a value, not a reference. And then you can enter the world of modern implicit data structure. So if you know anybody who can implement this, convince someone in compiler team to get done with this, you please do it. I'm over. Thank you. Thank you.
Using elliptic curve cryptography for the purposes of identity
Hi everybody, next talk is about to start. We'll have Yamo Makinbach talking about using elliptic curve cryptography for the purposes of online identity. Thank you. Shall I start the buzzer? Shall I? And we're off apparently. Yeah. Alright, welcome. So I'm Yamo. I work on this project called Keogh's side, which is about online identity. And we're going to talk about it in a minute. First, because of the last previous talks, I wanted to specify the skill. There will be no 5 terabyte database here or serialization of billions of nodes, which is going to make a little script. It's a bit of a Bob Ross talk, I guess, which is going on a journey together and have fun, discover. And before I really start, we're going to try something experimental. We're going to try a little interactive demo at the end. We're going to write the script, but you're going to verify if the script that we're going to write actually works. So for this, for whoever wants to participate, you should consider downloading the Keogh's mobile app. It's available on these locations. You can just get the APK from the CodeBerg repo. Alright, let's get started. So if someone makes a claim, how do we verify that? Well, quite simply, with a proof. What do I mean with that? So for example, if Alice lost her luggage and then Bob found it very conveniently, and then Alice says it's hers, then Bob asked for the proof, of course, because, you know, and then Alice fiddles with the little dials and unlocks the luggage, and then she verified that the claim was indeed true, that it is indeed her luggage. So now we want to know, is this also true over the internet? Can we do this over the internet? Well, yes, we can. We can claim things over the internet, but humans travel rather poorly through ethernet cables, so we need to find a way to connect Alice and Bob in a different way, so that Alice can make her claim, and Bob can verify that claim, each in their own space and time. And so for this, we're going to use cryptographic signatures. So, yeah, we could talk for a long time about cryptographic signatures. For the purpose of this talk, let's just... the important stuff is basically just like a real signature, but digital, but the big difference, I guess, is that it's really difficult to forge, so that's good. And in short, we have a secret key, which we will use to sign documents, text documents, with a public key that we will use to verify those signatures, combine those two keys, and you have a key pair, and each key pair is identified by a unique fingerprint. All right. So let's try and work out this process then. So let's say that I will write this text document, which just says that this is my account on the Fediverse, on Macedon, now I will sign it with a key, which has this conveniently fingerprint, which starts with very familiar letters. And now the signature itself is just zeros and ones. We're not going to worry about that. So now I will give this text document, my claim, together with the signature to my friend, and my friend will use those two pieces of data. They will first verify that indeed the signature corresponds to this text document, and once that is done, they're going to my actual Fediverse account, and then they're going to read in the bio, oh, this person indeed wrote in their bio that they have this key. So that is the proof with which I verify my claim, and that it is indeed my account. So now we're going to do that whole process. We're going to try to create an online identity with just 100 lines of rust. I did need five dependencies. I tried to minimize it, but without these, it will be a lot more than 100 lines of code. So yeah, these will be it. So we're going to generate a key. This is where the elliptic curve part comes in. Elliptic curves are a technique of creating cryptographic keys, and in this case, we're using these specifically the P256 curve, but all this just to say, yeah, we're using these two lines of code just to create an entire cryptographic key. So this includes a public key and a secret key. Now, of course, I said every key pair has a fingerprint, so that's what this code does. It looks a bit complicated. This is the most complicated part. So the most important part about this script is basically we'll just get some data from the key, we'll get some parameters from the key, and then we're going to hash it, and that is how we get the actual fingerprint. Now we're going to collect the identity data. So we're going to create what we call a profile. Just a profile is just a name, some other metadata about the person, and claims, multiple claims. So I'm just going to continue with the same example as before. I'm just going to claim that that is my account on the Internet. Now we need a way to encode all this data, because we need the text document and we need a signature. So for this, we're going to use a JSON web token, which for the purposes of this talk is just a convenient way of combining a document and a signature. We'll need three parts. We'll need a header, a payload, and a signature. So let's make each of those. Oh yeah, some quick notes. So whenever you see that are you at ID, that is just the namespace that we use for the creation of the tokens. And sometimes you will see JWS instead of JWT. Those are different, but for the purposes of this talk, we'll just consider them the same. So let's create a header. So the header is just a little bit of metadata about the key that is creating this profile. So we'll set the fingerprints and we'll set the actual key. We'll just give it. And the public key, of course, not the secret key, because that one should be secret. We'll create the payload. The payload is the actual profile itself. So we're going to say like, oh, it's the type as a profile of this token. We're going to say line 10. We're going to say like, oh, what is the name? It will be the name and the identity claims. Don't mind all the payload set claims. That's just to confuse you, because JWT also uses the term claim in a different way. Just to make it easy. Now that we have the header and we have the payload, we're going to sign the two. That's what we do here. So line three, we get our key that we built earlier, generated earlier. And in line four and five, we're going to use it to sign the payload and the header. And with that, we are done. We have our profile. So now, if you would like to copy this, write this over. Yeah, that's not convenient. So we need to do a second part. We need to do a second step. I need to get this from my computer to your phone, to your device, whatever, so that you can verify for yourself that I do indeed have that account. So we need a way to transport, I guess, documents and preferably sign it. You guess where this is going? We're going to use another JSON web token. So we're actually going to reuse the same header, because we're going to use the same key. So we'll just use the same metadata about the key. We're going to create a second payload, which will be very similar. This time, instead of being a profile, it would just be a request. And we're just going to ask the server to create this profile. And then in line 14 and 15, we're actually going to give that document that we created earlier. We're just going to give it to the server. And this second outer JSON web token, we are actually going to upload it to... Sorry, we're going to sign it first, so we'll have a similar string, a piece of data that we can actually then send to the server. So this is where we're going to send it to what we call an ASPE server that we're working on. And it's just basically a way of storing and exchanging these kinds of profiles. And yeah, that is basically it, what you need to do. Those were the lines of code that you need to actually make an entire profile, make a claim, and make it so that people could verify for themselves with their own devices, with their own methods. So yeah, it is a fun script. You can actually just try it at home. Or as I said, we could try it live on stage. That is what we're going to try right now. So I did prepare it somewhere. So you'll see that apart from some cosmetic changes, if it loads... Yeah, that's the big risk of doing this on the stage. We'll give it a second. Apart from some cosmetic changes, it is largely the same script. And you'll see that it will fit neatly within 100 lines. And it might not. We'll give it another second. And if it... Alright, well, maybe it won't do it. It would have been phenomenal, I can promise you. Alright, I'll reload it once and then... I do have a sort of a backup. Alright, it's not playing game. Alright, so let's go back to the presentation. I think it's this one. I don't... wait. I have lost the presentation. That's a different presentation. What? That was not supposed to happen. Yeah, I don't know what's happening. But basically, yeah, this would have been... We would have run scripts and we would have created a profile. And then it would have presented you with a QR code that you could have scanned on your phone. And it would actually have worked. And then you could have seen that the script would have created a profile that we built here on stage. Yeah, and just with a couple of lines of code, we can work with cryptography, we can work with identity. And, yeah, thank you very much. Thank you.
Timestamping with opentimestamps
Alright folks, we're just ready to start our last talk, which will be time stamping with open timestamps by Timothy Riddia-Eli. Okay, thank you. So I'm a Red Hat employee that works as software engineer but not for this stuff. So what is time stamping? What is time stamping? Time stamping is needed to be sure a document or a file is made in a specific date. And for example, in Italy, because I'm Italian, the law requires that the data are ushered by a public officer, so you can't do that by yourself. So what about digital documents? Well usually digital time stamping is made on a third-party data center, so you must trust some other authority, and it's usually a certification authority. So how we can do that without reeling on a third-party authority? So we could use the blockchain, so you create the hash of a file or information, and you put this hash inside the blockchain, so you can demonstrate this hash was present on a specific time. So why the blockchain? It's safe because it's backed by millions and millions of dollars. It's open in the case of Bitcoin we use. It's not cheap to create a new Bitcoin because mining is an expensive process, but it's quite cheap to use that. So why open time stamping? So the blockchain is open, anybody can write on it, anybody could do the same thing directly without using the open time stamp or another framework. So open time is a standard way of doing the same thing in a trustless way, so without trust no one. It was proposed by Peter Todd, a Bitcoin Core developer. It's used by dozens of different companies, and it's almost because in information technology we can't have infinite storage, so it's almost infinitely scalable because it uses a Merkle tree. So what is a Merkle tree? Merkle tree is a tree where you just put the top hash or the root of Merkle tree inside the blockchain, but you can demonstrate that your file or your information existed without the need to push any hash inside the blockchain, but only the top hash or the root. So open time stamp provides users multiple and easy way to create an independent verify time stamps. Open time stamp project on GitHub includes these different implementation. The first one was written in Python. Then somebody has wrote one in Java, then in JavaScript because it's easier to use in browser or in some Node.js stuff. They also started to write a Rust open time stamp because Rust, as you told in a precedent talk, it's good languages because it's safe because it's fast, low memory usage, etc. Or on the open time stamp.org website that uses the JavaScript implementation. So now for this slide, I show an example of usage with the Python client because it was the first one. So if you want to use that, you just need to use OTS stamp command and stamp command create the Merkle tree of the file, submit it to some remote server that are the server that write the information on the Bitcoin blockchain every summer. So when you do stamp, the operation creates the hash of the original file concatenates with random nonce for privacy just to avoid having your hash on the Merkle tree directly and recalculate the hash. So you have double SHA hash and it sent the value to the calendar server. So the calendar server add the hash to the Merkle tree and return the response to the client in order to generate the OTS file that is a file you will need to verify the signature later. Of course this file is incomplete because it doesn't contain the record in the blockchain because you need to wait the calendar server to send the record to the blockchain and the Bitcoin networking to mine the block with the Merkle, etc. So when a time is elapsed, some hour the user rerun the OTS tool with upgrade operation and this update the file with which block of the Bitcoin blockchain includes the hash. It's also possible to create a timestamp for several different files simultaneously. In fact we did a test when we got all the ashes of all the files included in archive.org not the web.archive.org, the archive.org that includes the petabytes of files. Of course we didn't download all the files but archive.org API supports to you can ask the hash directly. So we took all the ashes from archive.org and we were able to put all these million files inside only one Merkle route. So it's absolutely scalable because you can put tons of files only with one Bitcoin transactions that you don't need to do yourself but is the calendar server that you have. So it's absolutely free. So the verification requires both the file and the original file or the original hash. And if you want to do that by yourself so without trusting nobody that's what you want. You need an up-to-date Bitcoin node. You don't need a full node but since the attestation is on the block either. But so you just need a prune node that only so you need only a few gigabytes of data instead of almost one terabyte of a full node. So if you do that you are sure nobody can fake your check because OTS asks directly the blockchain and so you don't need to trust anybody including the calendar servers that put your verification on the blockchain. So the OTS file includes three main sessions. The hash with the nodes, the Merkle tree construction because you need to know which other hash you have in the Merkle tree in order to be sure your file is in the tree by your root and which Bitcoin block includes your hash. So the timestamp is saved on a binary file to save space and to avoid problem of interpretation especially on Windows. The file is as OTS extension and it starts with this line. So if you use the OTS information command with the file it prints lots of information so I can't show them because it shows all the single Merkle ashes. But you can try that at home and you can see which Merkle tree is how the Merkle tree is created. So this is some example of open timestamp usage. The website I presented at the start, Proofmod.org that is an Android app by Guardian project that it uses to certify a photo is valid with GPS data etc. And ASA check is an example of how you can use timestamp newspaper article and to stamp it's a website that you can put the stamp on a Twitter. The end.
Compiler Options Hardening for C and C++
Okay, hello, good morning here at the lightning talks at Fostum in Brussels. I want to introduce you Thomas Neiman, senior security technology specialist from Ericsson, and we will give us an introduction to compiler-optioning hiring guides for C and C++. Give him a warm welcome. Thank you very much. Start. Thank you very much. I work for the network platform and telecommunications company Ericsson, but today I'm here to talk about the compiler options hardening guide for C and C++. I also am in the open-source security foundation as the sub-initiative lead for the compiler hardening best practices initiative that has produced this guide, and we had an initial release in November last year. I hope many of you might have heard about the open-source security foundation, but maybe for those who might have not. This is a community of software developers and security specialists who are working towards improving the security of the open-source ecosystem. This means both innovative open-source software as well as these efforts to develop best practices and collaboration around security in open-source software. The background for the work I'm talking about today is the C and C++ hardening challenge. We all know that the C and C++ languages are consistently the preferred languages for systems programming, embedded systems, and performance critical applications. But C and C++ are also memory unsafe, and that means that they are susceptible to a certain classes of programming defects that affect the memory integrity of software written in C and C++. In unfortunate cases, these defects can lead to software vulnerabilities that can be used by malicious actors to then exploit the software in different ways. Addressing these types of vulnerabilities in C and C++ in a large scale presents several significant challenges. There are many memory-safe alternatives for these languages, but there is also a lot of C and C++ code in the world today. Rewriting all of these existing code, the memory-safe languages is both umber-ably expensive, both in monetary value, but also from this kind of opportunity-cost point of view. The alternatives often have unsafe dependencies, and these unsafe dependencies will then slow down the migration to the memory-safe alternatives. One example of this is Rust, which is a very promising language and provides memory-safe guarantees. But if you look at the dependencies, there are some references here to recent surveys where the conclusion was that over 70% of Rust crates, the Rust packages in the official package repository, have some form of dependencies to either C or C++. This is not just a technological problem, but this is also something that is actually gathering regulatory attention. In the US, something that has been very influential in shifting the attitudes towards surface security was the presidential executive order on improving the US cybersecurity in May 2021. Also in the EU, we've had this cyber-resilience act that has also been heavily discussed among the open source communities as well, and specifically relate to memory safety. We've in the past two years seen a lot of guidance from cyber security authorities, including the USA NSA and the US CISA, who have issued these joint publications with other national cybersecurity authorities, the most recent being the December 2023 document on memory-safe road-waps, where they urge organizations to explicit plans how to shift away from memory unsafe code. So what we are doing in this initiative is that we are providing a guide for compiler options hardening, and currently this is specifically geared towards C and C++ code. The idea with this is that we have a guide that will help developers and packages of software to configure programming tools during development to reduce the attack surface of produced software. You can think about this as something that is quite close to what sometimes is called product hardening. If you have a hardening document that usually provides guidance to the parties who are deploying this software in configuring the operational parameters that help you deploy a software security in its operating environment, so we are focusing on these kind of parameters during development that helps everyone who is then later deploying the software. And the modern C and C++ compilers provide many optional features that help improve the security of the produced software, but this must be explicitly enabled when compiling software for the software to actually benefit from them. And if you are consuming software from the major Linux distributions, then these are usually, the major Linux distributions are usually enabling these features by default, but then if you are consuming open source software from source, then you are responsible for making sure that when the software is built, these kind of protections are enabled correctly. And of course, these also come with various challenges, right? So I will not go into this in detail here, but these challenges can sometimes make it difficult to deploy these in a sort of a correct and correct manner, and we hope that this guide will help practitioners in some of these challenges. And the current situation we are seeing now is that according to some academic surveys, the situation for these are actually much better on the desktop side, but especially embedded systems often ship without these protections enabled. So here is a publication from the Network and Distributed Systems Security Symposium from 2022, which shows that there is kind of like a radical difference between the deployment of specific hardening mechanisms between desktop and embedded systems. And of course, compiler options hardening is not a silver bullet, right? So this is something that is necessary in combination with the adoption of memory safe languages, secure coding standard, as well as security testing, but we hope that this is like one way of addressing the C&C++ hardening challenge. So if you look at the guide, you will find that this is sort of divided into sort of four main parts. So we have a large section on the recommended compiler options that currently cover a wide range of different features in GCC and CLANG LLVM, and this includes both flags that one developers about different software defects that are related to these security vulnerabilities, but also flags that will add instrumentation to the binaries, that helps the binaries be resilient at runtime against attacks that are trying to exploit possibly residual defects in the software. We also have a section on discouraged compiler options, so these are compiler options that have some specific purpose, but if you use them inappropriately, they may impact the security posture of the produced software in one way or another. We also have a section on sanitizers, so these are compiler-based tools designed to be used during development and testing to basically pinpoint memory safety issues and other defects, and these provide really a lot of valuable information for debugging and testing, but they often have sort of more runtime overhead or memory overhead which makes their deployment for production software more difficult, but they are very valuable during the development and testing phases. Then we have some information on managing debug data, so this is something that can help in making produced binaries more resistant to reverse engineering, where you have threat actors actually analyzing binaries specifically for ways to exploit them, but of course in practice these decompiler tools that are used for this purpose, they can work with debugging information, so the security of the system should not depend on omitting this information, but there is some sort of guidance with this respect. As I mentioned, we had the initial release of the guide in November 2023, and we have a lot of activity on the OpenSSF best practices working group GitHub pages where the development happens, and for this year we are planning on documenting new features that are in upcoming versions of GCC and CLAN, so this is actually an area where the compiler communities are very active in providing new valuable features that are security relevant, and we hope that this guide will eventually cover all these new features as well, and then we also have some plans with partners to also introduce information on new compilers, so we hope that this will also be possible during this year. And then another effort is that we have a separate guide on using compiler annotations in GCC and CLAN, so there is some work in progress work up on the GitHub if you are interested in that. And this is of course, everything is open source, and I hope that we also welcome contributions also from people who are not necessarily security experts, so we've had very valuable contributions on improving the readability and presentation of the guide, so if you think that there is something that could be improved, I urge you to open an issue or open a pull request towards this material. And other development happens on the OpenSSF best practices working group. GitHub repository and we have calls every other week on Zoom to discuss any open PRs and short developments around the guide, so these are also public. And this slide has some links, more links on how you can participate in the work that OpenSSF does if you're interested, so the slides are available on the talk page on the FOSTEM site if you want to access the links. And lastly, I'll leave this slide open so if you want to access the guide itself, you can do so at the URL here or by scanning the QR code. And that's it for my side, and I want to thank the FOSTEM organizers for giving me the opportunity to present this work here. Thank you very much. Thank you.
Mind the gap: Building a cultural commitment to documentation maintenance
I'm going to give a warm applause. It's your time. Great. Thanks, everybody, for joining me this morning. My name is Fiona Pierce Artiaga. I am the director of technical writing and documentation at Grafana Labs. We at Grafana have a really strong history of open source projects from Grafana, which I think is our most well-known, to Loki, Mimir Tempo, Pyroscope, Bela, Faro, and of course the entire IRM suite built off of our open source alerting project. As a result, we have a lot of experience with maintaining open source documentation, and we've learned some lessons along the way. So I'll share with you how this approach works in our company, as well as other companies that are in the open source community that I've worked for. I won't have time for questions today, but I will be outside afterward, and I'm more than happy to talk about documentation. I've winnowed this talk down as much as I possibly can, but I have loads to say on this topic, and I think there's always loads of questions about it. So when we look at an open source community and a commitment to maintenance, the gap that I'm talking about in mind the gap is the difference between what you hope your documentation will be and where it currently is. And wherever you are in your journey, I think you'll find these tactics and strategies useful. So it really breaks down into three major areas. You need to communicate, you need to minimize friction with your developer community, and you need to engage. So when we talk about minimizing communication, there's really things that I found work extremely well, and there are things that I would recommend not doing. The first thing you need to do, and this may sound really self-evident to you, is an open source project maintainer, but you need to tell your community what you need, and be explicit about the types of content that you find valuable, and why it is that you think it would move the project forward. In addition, you need to tell them how. Provide them with guidelines, give them examples, give them models, give them samples, because otherwise they're flying a little bit blind, and the PRs that you potentially receive may not be of the quality that would help your project. But on the topic of guidelines, you need to make sure they evolve. If they're brittle, you'll find that your contributor community will not feel embraced, it will not feel kind, and it's very important that the community feels welcomed and acknowledged in the process of making your project better. Things that we find don't work. Don't walk into this conversation with assumptions. Just because in your head you know what the project needs doesn't mean your community does. So say it out loud, put it in your readme, put it in a GitHub issue with Help Wanted. Add to the Help Wanted projects on GitHub itself. Go into a blog, if you have a website, publish it on your website. If you're doing a FOSDEM talk, you know, Thomas just mentioned at the end, contributions are welcome because making documentation better improves the project, not just for you, but for everybody consuming that tool or new technology. If you have a readme, make sure it's fresh. Otherwise, your contributors are following the guidelines that they thought were the right things to do, and if they're encountering difficulty moving the documentation contribution forward, it might be because you haven't told them properly how to do that. Have guidelines. If they're not present, people don't know what to do, and evolve them. Minimizing friction. I cannot say enough about the value of writing docs as code. We use a docs as code method, Acrophana. That means we use a source control product, GitHub. I think that's becoming the standard. We use an open source IDE. For most of us, we use Visual Studio Code. We also use a Veil Linter. If you attended one of my colleagues talks yesterday, we're improving the Veil Linter to follow the Grafana writer's toolkit guidelines that we've created. And make sure that the tooling and the tools that you engage with for documentation align with the tools that your team and your project maintainers are working with. When you begin breaking the documentation away from the project, if you put it into a CMS, or if you put it into a separate repo, you will find that the documentation regresses. It's just human nature. The minute you move the docs away from the code, it's not considered a maintenance project anymore. It's considered an other thing to do, potentially an afterthought. And if you're using a carrot and stick method, where your stick is that you're linting and you're making sure that the markdown file is associated with this new code is being updated, you have no way of enforcing that in your CI CD. So while CMSs and XML DEDA may seem really cool, I would highly recommend from my experience that you use markdown, GitHub, some kind of IDE like Visual Studio Code. You'll find that maintenance is much easier because you've minimized the friction to contributions. Engagement. So you've told people what you need. You've given them guidelines. You've given them samples. You've given them models. You have used the tools that they are also using. Don't leave them alone. Engage. If you see a PR that comes in from an outside contributor, make sure you respond to it in a pretty timely fashion. That could be two weeks, but maybe it's not six months. Documentation, additions, and PRs are very, very often the stepping stone into your project. It's when a contributor is checking to see whether it's a welcoming space, whether their contributions would actually add value. The number of typo fixes that we get, because we're not perfect in our technical writing team, the number of typo fixes we get is actually quite large. And it's because users know that that's a quick and easy fix. So as soon as they've given this contribution to you, welcome them. Take their contribution. And the next thing you know, they may have built you a full tutorial. They may have updated all your reference documentation. They may have done something really amazing to move your documentation forward. But if you leave them like the Cookie Monster here in this GIF, they're never going to come back to your project, because it doesn't feel like it's a safe and welcoming space. On the other hand, there are times when you receive a PR and you're, you know, it's just not necessarily the right thing to spend time on. It's a really hard decision. In an open source environment, we like to think that every contribution is welcome. But there's two major areas where we say, you know, this isn't really particularly valuable to the project. One is when the work to make that PR of a high enough quality to produce and publish is greater than the value that's being brought in the PR itself. You can engage with your contributor, point them to your guidelines, see if they can rectify the PR on their own. Or you can delve in and do the full edit. But sometimes you don't have time for that, and sometimes it's not as valuable as you would like it to be. So in those situations, be kind, be respectful, but you may close the PR. The second thing that we have found sometimes occurs is we will have a contributor who comes forward with a very distinct perspective that may not align with what we're trying to accomplish with our documentation, or is adding information that may work on their machine, but doesn't work on ours. And again, be kind, be respectful, and often those things will turn, but sometimes they won't, and in that case, close the PR. My main message here is don't let things linger. If you've decided that this is not something that's going to be valuable to your project, respond and close the PR. It's part of the process of engagement. If you leave those things open, the community will feel like, well, I don't understand why this isn't being moved forward or what's happening with it. So make sure that you're pretty responsive within a reasonable timeframe, and reasonable will vary according to the project. Data really helps in determining the know. We have a wide variety of data that we look at where Grafana is an observability monitoring company. Unsurprisingly, I have a dashboard that I look at on a really regular basis. But when you're just starting out, these are the key items that I would recommend that you look at. You want to make sure that you're looking, if you have a website, the number of visitors to a page that will help determine your top 10 and where you probably want to start if you're looking at maintenance issues. You also want to look at time on the page, and this is a bit of a subtle data point. Some pages that you write are going to be specifically created for users to move to something else, so you don't want them to spend a lot of time on the page. But if you have a page that's quite detailed and has a lot of information, and users are bouncing off of it in 35 seconds, there may be something wrong with that page. And you might want to create an issue and put a help wanted note on it so your community can help improve it, or give you feedback on why it is that they're moving so quickly away from the content you've provided. Data of Last Update helps you know how stale your content is. And some open source projects actually will put in the front matter of their files a date of Last Update and run a cron job on a monthly basis to see if something's six months or more out of date and check in with the subject matter experts or the maintainers of that page and ask them, does this need to be updated? It's been a while since you last looked at it. And then also look at your accessibility scores. And by accessibility, I mean how quickly the page loads and other elements such as the use of the color and whether or not it renders properly on mobile. So, seven lessons learned. I've been working in technical documentation for a couple of decades, so this is a very short list and I'm happy to talk about other lessons learned. Number one lesson is one big swing will not get you far enough. And one big swing refers to in many projects, people realize the documentation is drifted and they'll decide, okay, we're going to have a two-week time period or a three-week project where we're all going to dig in and work on the documentation and it's going to be fabulous at the end of this. We've done this a couple of times at Grafana and it has failed. And it hasn't failed because people don't have good intent. It hasn't failed because they don't have information of their fingertips about what needs to be updated. It's failed because one big swing will not take you far enough. You will get distracted from the project, you will work on something else, you will get a customer escalation, someone might take PTO. If documentation maintenance is not already part of the culture of that project, fixing it in a one-time go is not going to improve it. But that's actually a really valuable lesson because what that helps the project maintainers understand is, wow, we kind of thought we could get this done in a two-week period, three-week period, and now we recognize that we need to incorporate it into our daily business and into our daily updates. So that in and of itself can be quite a valuable exercise. If you're overwhelmed, like where do I start? I have such a wide swath of content that needs to be updated. Help your users get from zero to one. It sounds kind of basic, but start with the beginning of getting started. Make sure that you have a really watertight get started. Once users have gotten started with your project, they'll be much more forgiving of other areas that need better content. And they'll probably help you. Maintenance really only works if you have a commitment to open source documentation that lives in an open source project. When you move the documentation out, you will see that the documentation begins to regress. Linting, using something like a veil linter, is your friend. It helps move the PRs through more quickly. It helps identify more quickly for contributors where they haven't met a guideline, and it removes some of the toil of making that PR good. Guidelines increase your contributions. People know how to help you. They know where to go for guidance and samples and models. But make sure you're being flexible with those guidelines and hearing what the community is telling you. When we wrote the Writer's Toolkit about a year and four months ago, we've iterated on that toolkit a number of times. It's an open source project. We're happy to have issues and PRs opened against it. We've definitely course corrected in a couple of areas because we thought something was going to work. It didn't work as expected. And remember, it's okay to say no. After you've gotten everybody off and running on your project and they're really excited and you've told them how to do it, if the PRs still aren't good, it is okay to say no. That's it. Thank you so much. Okay, thank you for the talk and you are available outside for questions. Here's this. I don't know. Thank you for your talk. Thank you so much for the introduction. I appreciate you for heralding and volunteering your time.
A Lazy Developer’s Approach to Building Real-Time Web Applications
Mark was a wrench and his talk is about a lazy developers approach to building real-time web applications. Give him a warm welcome and applause. Thank you and it's your state. So good morning. Today I want to tell you about my two hobbies. First I'm a musician. I play the bass guitar and my other hobby is a cloud solution architect. That's my money making hobby. And the project I want to talk about today gave me the opportunity to combine these two hobbies into one little project and I want to share the learnings from it with you. Okay, so the challenge. The picture above you see Ralph. He is a friend of mine and he plays along. He plays songs, people sing along. But we had one problem when we play somewhere with WENUs from 100 to 1000 people. Songbooks don't scale. We had songbooks but they got damaged. The WENUs were dark. People couldn't read. Songbooks even were stolen. One fact that was beneficial for the project, we have music stand software on our tablets. They are networked to each other and this music stand software has an API. Terrible software, proprietary stuff. I don't want to talk about this software today. But what we made from it so that you are able to use it in your own projects. How to get lyrics to the people with minimum effort. That was our task. We had to solve. And so that it doesn't get boring, I want to start with a demo so that you see the result and later on I will show you how we accomplished that. So please use your mobile phones and with it decide or your computers will work both. If you call this web page, you will see. Let me show it here. This side, the other side. There it is. You will see exactly that page which I have loaded here. It's waiting for lyrics. So the communication is established and when I now start sending the lyrics, imagine somebody on the stage would change to the next song in the music stand software. Okay, let's do it. I had an AI friend of mine write a few songs about open source and the like. So if my talk is boring, just look at the songs and I'm also open for collaborations in getting music to them. Okay, let's go on. What do we see here? The songs get updated like I just said before. And we even have confetti when there's a new song. So the title of the talk was the lazy developer. Why does being lazy matter? If you are too eager, it can happen that you think of a structure, how to implement something and that you stick to this structure, that you don't have this gut feeling that's too complicated, has to be easier. You create code duplicates and the like. If you're too lazy on the other hand, you get nothing done. So you have to find the sweet spot. Being lazy enough and being eager enough to get things done. And because I did this in my spare time, that was the only approach which could work. So I had to have something easy which allows me to get this job done but also allows to scale to a venue of a thousand people, to a thousand people requesting this resource at the same time. And here's my technical approach to that. We have the people who want to sing along. We have the musician with the music stand with the let me call it rest-ish API, what I saw wasn't so good. A small VM at the cloud provider. Everything should start with something like this. A host name which I said on it. After installed, then I used catty as a static web server for static pages. Great project makes it easy to host things with the same default TLS configuration without any effort. So it was just spinning up the container and it's immediately served the web pages like I wanted to have them. So now we need a component which does the heavy load which transports the data to the devices of the people. There are many solutions around and since I'm working as a cloud solution architect with Kubernetes, you always look at the CNCF landscape. And as well in my company, as for this project, I saw NETS as the solution. We use it for micro service interaction in our projects but NETS also has a web sockets interface which make it possible that the people which are getting the web page through the static web page on the browser, the JavaScript part connects through web sockets to the NETS server. And then the musician needs to have a computer which pulse the API and as soon as there is a new song, the lyrics get sent to the message broker and when we talk about message brokers, there are a few patterns around how messages are being distributed. We have a classical fan out pattern here. The message comes in, the message broker distributes it among all of the subscribed devices and what's really nice about the approach, it's just a few lines of code in the end. So let me show you. We have the project here. You will get access to the GitLab repository at the end of the talk and also linked online. Okay, so we have the NETS server. The ports, the 8443 is the web sockets port over TLS. 3 times 2 is the NETS native port mapped to the outside. Then we tell NETS the host name so that for the TLS mapping and since Katie takes care of the certificates, we map the certificates from the Katie directory as we only file mount which we can access in this Docker compose repository I have set up here. And Katie, the easiest thing, just the regular web server ports mapped to the outside. Katie took care of getting the Lats and crypt certificates automatically. I only had to set the HSTS headers and had an A plus on Wallace SSL check. It's something I always want to try. Okay, and look at the application itself. This div does everything so it's more meta text than real payload on the page. There's the div with the id lyrics and as soon as something new comes in, its content is replaced and the JavaScript part is also something very simple. You see that's everything I did, basically developer. The communications magic is here. We include the NETS web sockets library and then we connect to the NETS server. We subscribe to the subject lyrics and as soon as something drops in as we receive new lyrics, we handle them over to handle lyrics which form its first line in bold and shows the rest just like we received it from the NETS server. If you want to have a look at the NETS configuration, it's also not much. I have defined two permission sets. One default permission so any user of the system has the set of permissions and it's just subscribed to lyrics and we have the lyrics publish security profile for the authenticated publisher. I defined the user with the hash password, assigned the permission and down below here you can see the web socket definition where I also use the TLS certificate, caddy gets me from let's encrypt. Next line I assign anonymous access to the user that it works with the login. No, not login. Okay, that's it in a nutshell. If you are interested about the topic of message brokers, I can highly recommend the book enterprise integration patterns. It's a book from 2003. I'm showing an IT book from 2003 but the principles are still the same. Of course, there are a few new. If you go to the website, they also have listed new principles but I wish I had new in the book 20 years earlier. Now I have it on my desk. My GitHub repo, check out Nats, check out caddy server and it's absolutely possible. You don't have to use Nats for this. You can use an MQTT server. I did the same example with EMQX. Rapid MQ should work. Also with Redis, it also has web sockets integration so you could also use that. My example was in Nats. If you are interested in Nats, I asked the guys of the project if they could send me some. On this corner of the desk, if you leave the room here, you can grab a sticker. After all, we are on conferences for the stickers, isn't it? That way. Okay. That's about it. What did we learn? Let others do the heavy lifting though. Just be lazy enough to find the right ways and get things done and concentrate on the things that really matters. Reach out to me if you have questions. I will be around. Don't forget the stickers. Have a great Svastim Sunday and a safe trip home. I'm Markus Röntschler. Thank you. Okay. Thank you for your talk.
Unpack Phabricator, Welcome Phorge - Forking the Opinionated Open Source Project Manager
Hello, welcome at the lightning talks here at Fosterman Brussels. I want to introduce to you Valerio and with us talk about Unpacked Fabricator Welcome Forge forking the optionated open source collaboration suite and give him a warm welcome and applause and the stage is yours. Thank you so much and so welcome again at Fosterman, it's Sunday morning. So thank you everybody for your time and there are some well-known communities loving Forge and I see some community members here from Mozilla, from Wikimedia and from FreeBSD. So thank you everybody and there are many curious here so I started as curious as well and so I'm Valerio, I'm a volunteer board director in Italian Linux society, some Italian members here maybe no maybe not and Wikimedia Italy take commission volunteer also both maintainer also volunteer for Wikimedia projects and I love also editing OpenStreetMap and in my personal life and at work also in Wikimedia indeed I use a fabricator and Forge intensively and so I would like to share this interest with you because I know you love Forge or fabricator so by the way why I started loving Forge in the company I work for they assist production plants about food indeed also Hazelnuts included so it seems that Italy for some reasons produces a lot of hazelnuts so if you eat hazelnuts please also thank Forge because Forge somehow handles management of hazelnuts I don't know why I've put the screen out in the slide. By the way some of us in me included started handling in weird ways software management like final version of the directory final really final really version 2 final really version 3 and indeed also the Linux kernel started in this way in 1991 it was in that way really then Bitkeeper was who knows Bitkeeper maybe some users here yeah and Bitkeeper was started as a proprietary solution to do serious software version control unfortunately is what it was a proprietary tool so then the interesting thing is that in 2005 the world exploded with free very interesting and powerful tools to manage software versioning including Git, Mercurial and GlueBazaar mostly GlueBazaar for Ubuntu you know and Mercurial is still used nowadays in some projects can you raise your hand if you use Mercurial in production for something right one two three four okay it's still alive instead you know Git it's the backbone for most things popular things and so but most people nowadays think that there is Git hub and is not aware that the backbones of the the world is also included in other tools like Mercurial and Subversion and Bazaar etc and so what about trying to discovering the platforms based on the the previous tools so it's interesting that source forge you know source forge was very popular but then Git Hub started spreading in the planet and focus it just on Git so no Mercurial no Bazaar no Subversion just on Git so from a fancy interface proprietary indeed you collaborate on software and also without knowledge of Git actually and then GitLab started as now open source and you see Fabricator in 2010 was started 14 years ago and so it's interesting to see that in this historical moment a lot of web platform for collaboration started to be widely adopted also this is maybe not so much famous Apache Allura but it was the backbone of source forge and it is to be honest and then you see forge in more recent years and Git the other platforms so by the way after the slides there are some links to do to find the data and this kind of visualization by the way Fabricator in 14 seconds what is Fabricator Fabricator was started as collaboration platform for a big blue social network a proprietary social network indeed and as you know this kind of big projects needs many repositories many software version controls like Mercurial why not let's use Mercurial and many many many people many many emails because as you see as you can see source forge started in 1999 but in the middle of nothing the web was not created GitLab was not created so you know that also nowadays some teams collaborate with emails so you send patches of emails somebody receives the email and so it's still something already done somewhere and also in that period so in 2010 even pricey created Fabricator to avoid this kind of workflows email based and so let's stop sending emails everywhere let's start having a web application called Fabricator and so Fabricator was started and adopted by Facebook Mozilla and Wikimedia Foundation that is maybe a really really active and lovely platform for looking at how powerful Fabricator is and unfortunately just to keep this really really clear and short Fabricator was no longer actively maintained this is strange for open-source software and sometimes we see companies closing the project this is not the case just they stopped working on it and so you may ask okay but why nowadays we should continue to fork Fabricator because they stopped any contribution so why we should continue this development why we should create something new called forge and well I would like to say that forge has a lot of weird features but it's very easy to host if you know how to install a Jumla or our press for a Drupal you already probably know how to install forge because it's really a matter of that and also try to install a GitLab owner Raspberry PI right right I have tried many times it's not you have you need some computational resources for large platforms like GitLab or something like that instead forge require minimal resources and this is really interesting and also attractive for some some some users of forge and Fabricator because I like that the bug tracker of forge and Fabricator is non repository centric I mean if you are on GitLab you need first a repository and then attach issues but on forge instead you just have issues and then you attach tags I don't know so you have an issue about my computer exploded so you attach some tags related to infrastructure related to Valerio that has a dedicated tag or attached a tag dedicated to Linux I don't know and then you can have a sub task without anything related like let's buy another computer for this person that has the laptop completely exploded so with a tag like so I don't know a shopping center for computer materials so this is the thing you don't need to be a developer to have issues on forge you just need to file a task and then attach tags and this is really interesting for companies and for organizations that because it's not really not everything can be filed in repositories in time and so I like it honestly also a killer feature is honestly that you can host Git subversion and Mercurial repositories all in one place you see your history you see your things and I see that this is also a killer feature in my company we have a couple of Mercurial repositories very legacy and why not keeping them alive so you can see the history you can see so we avoid any migration we avoid anything that we just have them there archived it's amazing also you can another nice thing is somehow different that nowadays trends is a local linting and local unit testing so your developer has facilities to have very easy local hints and local unit tests with results mirrored on the website on the patch page and this is also to speed up development because we have not to wait for a remote to build there to do those things in hours I don't know your laptop is often faster for most tasks sometimes and it encourages forking this is not banal most software like WordPress like jumella like github like whatever online platform you can think about with whatever different user use case or is often read only for production you download it if you have a change you have to download a completely different copy from git and start doing things instead for has not any difference between the relays and the development branch you just get pulled the bring the development branch of forge and that is your production you check out the stable that it's online you don't have to compile anything or so it's very easy to fork and people do that and so you may be you may know this this was the first component of a breakator historically and so people stop it to collaborate over emails and started collaborating online with this face interface and then they introduce at the work board so and in gitlab take an inspiration from from a fabricator to have their interface there is a really nice article from gitlab if you search for if you look for gitlab and fabricator online on the official website of gitlab they are very lowly in about describing how they take inspiration from fabricator so as I already said this is a work board not development oriented for example this is just about a single tag dedicated to fundraising backlog related to Wikimedia foundation and not related to a repository so you don't have to create a dummy repository just to work on like this and another thing I like of related to forge and fabricator is that you can have custom forms to create tasks and issues and whatever so you can have a simple form to ask people simple things and just okay if you know some tags put some tags but or just the title it's maybe just okay to organize your work or maybe have a nightmare form with a super detailed the fields about whatever aspect of your of your issue or your idea maybe you want your task visible only to yourself so you prepare the stub and then you release the issue so you maybe you want the task disabled just by you and your boss and your co-workers cannot edit anything this is a bit weird sometimes but it's very very powerful in in about how to managing permissions visibility and this kind of things and also another thing I like it is that once in search is really is really an advanced search so you find all the fields or custom fields if you added a custom field on your issue tracker you can on forge at the custom field like a favorite dinosaur I don't know and you find that field on the search engine indexed and ready to scale so the user has already all the fields on the advanced search and that you can organize the full search for your teams and organize all the full searches in your website easy to access and this kind of things also I like the calendar honestly in the company work for these a lot of this feature to for example they they had a Google calendar to plan each one I think if I am on for them I was filing a Google calendar with Valerius on for them instead now we have whatever calendar each worker has imported in forge so everyone see each one calendar and integrated in forge and this is also I think also another powerful tool is a herald herald is a is a strange to have a definition of herald but it's kind of have conditions and do actions so for example here you can tell to herald hey herald please in my forge installation if somebody does a commit please if the branch is my main and if a file is called foo.txt please send me an email or send or block the commit or add an auditor or I don't know and you can do really really advanced things like blocking people with very weird messages and so don't do this please read the documentation don't push directly master I don't know so also it's really really extensive as I already said this because it has a component to generate memes it has no sense but if you are frustrated at work you have 60 seconds to deploy in production you want to say that their vision is just good so let's approve their vision and you want just to say seems good to me just to say seems good to me and generate your automatic meme and so really I cannot describe all the platform in 15 minutes please visit with foo.txt this is a low platform to be honest in my opinion we need designers we are not so much skilled in designing to be honest but we are PHP developers somehow skilled in how well done PHP application should be done and so if you would like to contribute to maybe FlatHub or Docker image to speed up the deployment to forge on a production and development form you are welcome and so thank you for your time and thank you for your interest in forging for your interest in Xfabricator and let's continue this journey okay thank you very much
geOrchestra Spatial Data Infrastructure
Thank you. So I'm going to talk about georchestral, a special data infrastructure. So before starting to talk about georchestral, I'm going to present myself. So I'm Emilia De Vos. I work as a DevOps engineer in the geospatial department at K2Gamp. So K2Gamp is located in the Alps of France. And I also love about open source, I have a GitHub user if you want to check out my project. So before talking about the actual piece of software, maybe a lot of people do not know what is geospatial data. So geospatial data is the science to process and exploit geolocalized data that can be represented on the map. So if you ever used OpenStreetMap, you probably already used geospatial data. Or maybe I should not say Google Maps, so you probably used geospatial data. So what is spatial data infrastructure? Spatial data infrastructure is a framework of spatial data, metadata, users and tools that are interconnected between each other in order to use spatial data in a flexible and optimized way. So why is georchestral a complete data infrastructure? So georchestral is composed of multiple tools that are open source and maintained by volunteers. So the first tools, the one that everyone knows is PostgreSQL. We use the post-gis extension to store geospatial data. And then we have GeoServer because when you store geospatial data, you want to access it, this data. So you have APIs to access this geospatial data, like WMS, WMTA and so on. And then you have GeoNetwork because when you have geospatial data, you want to have some context, some metadata. So you want to know the description of this geospatial data, you want to know who created it, you want to know the date of the creation of the geospatial data and the license and so on. So that's where GeoNetwork comes into place. And then once you have this geospatial data, you want to visualize it on a map like you will be doing on Google Maps. So you use map stores to show this geospatial data on the map. And then once you have your geospatial data, you may want to control this geospatial data because sometimes you do not want to show this geospatial data to the public. You want to restrict it to people. So that's where the console comes into place. You have users, role, organization and so on. So you could think of a geochester like a Linux distribution. There are multiple tools that are interconnected between each other. And then you have the spatial data infrastructure. So I want to remind that all the components in geochestors are open source and they are actually more component in geochestors that I'm not going to talk in this presentation. So once you know about the insight of a geochestor, why should you use a geochestor? Why are the use cases? So the first one is to manage and share relevant information. So a region of especially in France use geochestors to store their geospatial data and then share it to the public. So here are a couple of names that maybe French people know about. We have a Lille that uses a geochestor. We have Haute France, Grand Est, Haute Loire, Brittany and outside of France. We also have countries like Bolivia. Geochestor is also used in small cities. When people want to know specific information about the cities, they can know like in this example information about a park in the cities of Rennes. So it's very useful to have a discussion on information. And there is also the digital factories Dutch Telecom uses geochestor to deploy the rollout of the fiber optic in Germany. And I want to also talk about the community. Every year we have a meeting, a meetup in the community where people of the existing member of the community gather to share their experience to know a bit more about geochestor. But newcomers are welcome also. Last year that was in Paris at IGN, a famous organization that created maps. And this year it will be in Lille. So you are welcome to join at Le Geocombe. It's called Le Geocombe. So thank you for listening to my presentation. I displayed QR code if you want to see my slides. And if you want to know more about Le Geocombe, it's a meeting. You can go to geochestor.org. There are all these kind of information. And there is also our GitHub organization. And you can also talk to us on IFC. And we have a special docker compose if you want to test geochestor in local. And there is also a demo link that I'm not linking in this slide, but you can see it on the official website. So thank you for listening to me. And I hope you will discover a bit more about geochestor because that was quite short. Thank you. Okay, thank you for your talk. And you are available outside for questions about your talk. Thank you very much. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you.
0 A.D. game: Vulkan API
Okay, our next speaker in live talk session will be Vladislav Bilo. We talk about 080 game, the Vulcan API. Welcome here. Hi, before I get started, how many of you does know 080 as you hand? Wow, I haven't expected that. And by any chance, has someone seen my previous talk in 2021? One? Really? Wow. That's great actually. Welcome to my lightning talk. Thank you for attending it. I'm Vladislav. I'm a senior graphics engineer and in my free time I'm working as a volunteer on 080. I'm going to talk about Vulcan API, a little story how I added it to the game. 080 is a free and open source real time strategy game. If you like games like Age of Empires, Warcraft or Starcraft, then you might find this game interesting as well. The game supports single and multiplayer modes, so you can play with your friends. We are internet or a local network. Also we have plans for campaigns. 080 is a history based game about ancient times. So many in-game buildings, units, maps, names, based on real historical documents. We even have recorded voices for such times. 0DS a year doesn't exist in Julian or Gigaurian calendars, so 1AD goes straight after 1BC. Interesting fact, the only 0 is used as a successful exit status on most platforms. And also a big plus that 0 as a character is on top of the alphabetical order. So like in most repositories, it's also a plus for us. It's not intended, but it's a plus. 0 is a cross-platform game. It works on Linux, Mac OS and Windows. Some users even run the game on Raspberry Pi. The one of the strongest features of 0AD is mod support. The main game itself is also a mod. It's possible to customize and recreate different things from UI to gameplay, from sound to textures. There are mods that add new civilizations, different timeframes like medieval or even fictional worlds like Legend of Zelda. A mod structure itself is an example of a folder with the main game mod. It has the same structure. It's transparent and doesn't require any special knowledge to make a mod. Also, the big plus that each mod works on every platform the game works on. So you need to compile something or pre-build some binaries. Because besides art or maps, we use JavaScript for scripting. By the way, all those pictures that I've shown you are screenshots from the game, so they are not post-processing. Let's focus on that picture. How we can achieve that? We need two key points. First, for a beautiful art made by our artists. The second one is the engine which draws a scene like that to the screen. So, what did we have back in 2021 before I started refocusing our engine? We have mods. We have a engine, a pyogenesis engine. It's our custom game engine. It's written in C++ and it runs mods. We are a library called Spider Monkey. It's a JavaScript library. It was calling on the JL driver directly. It was really tightly coupled. So you might meet different calls to OpenGL from different systems. There were a lot of implicit relations. That's an example of our old code. You might see on top that just direct OpenGL call. The next two calls are high-level calls, then three middle-level to reset bound textures, and then again OpenGL calls. That's not great for modern GPUs and modern architecture. So, which problems OpenGL have? CPU performance and unpredictable cost of some function calls. OpenGL was designed really, really long time ago. It's still evolving. So we have OpenGL 4.6 that was released back in maybe 2018. But it still has a lot of legacy. There is no way to query for supported features, at least standard way to query for supported features or GPUs. We can disable something if we see that GPU isn't capable of. We don't have validation or debug things. We have some extensions that might provide some feedback, but it's not enough. Another big problem that OpenGL is single threaded. It's pretty noticeable problem these days. Just an example how OpenGL reports a version. It's the same build, the same platform. So it can be reported. It depends only on driver version. So that's not so bad because you have major and minor versions in front of that. So at least we can parse that. But if we're going to talk about GPU name, that would be more complicated thing. That's the same GPU, the same platform and the same build, but different drivers, I mean different drivers of open source or not open source like different versions. And they all can report the same card with different names. So we have this structure. What we can do? We have three options. Switch the engine. It's really monstrous way to do things. It would be impossible for us. The second one use third party library like Libangl or BGFX. And the third one that I've chosen is adding an abstraction layer between our engine and OpenGL to cover OpenGL driver and be able to switch different backends. So how it would look like. The dashed line is the abstraction, which is used by the renderer code of the engine. OpenGL backend is our code that converts this abstraction to driver and driver interacts with GPU directly. And with that thing, we have interface like that. If you know C++, you might know what's going on here. It's actually pretty simple. No big brain here. But it allows us to do various things. Like it simplifies our renderer code. Do you remember how many low level or middle level calls we had before? And now we have only high level code. That simplifies renderer code a lot. So we're adding Vulkan backend and we added Dami backend for tests. So now because we have an abstraction layer, we can easily switch backends and we can now run tests with renderer code but on the server because now we don't require to have like a window manager or GPU at all. So by the way, about the size of rectangles, it's not by mistake that they are different. It's really they are because OpenGL is pretty simple in terms of interacting with driver. And Vulkan is much more explicit and you have to do many things manually. But I've designed the abstraction layer dashed line in such a way that it's going toward Vulkan. So actually Vulkan backend is smaller than it could be. And OpenGL backend against is bigger than it could be because it needs to work around some ideas that present in Vulkan and not present in OpenGL. So in OpenGL backend we have about 4K lines and Vulkan about 8K plus VMA. VMA is a third party library. It's a header only. It's used for resource management. So 8K is all our code and we use some part of those 17K lines of code. So about the results from 2021 to 2023 in total, if we sum up all the work we have done, it's about one, like one and a half or two months of all time work. It's just only factoring and adding an abstraction because adding Vulkan is pretty fast. Another way to work fast because Vulkan has a really great feature called validation layers. It simplifies developing for Vulkan a lot. I would say that it could take like one month without validation layers or maybe more. Now we have proper information about GPU because Vulkan itself provides all this information. Also it provides driver version and a lot of different information we can use to detect features, disable them. Also we don't have yet but we plan to add a multi-threading support. So about performance, it's the demo time. That's the game. It works. Let me check on the... This checkbox is not visible from this angle. So... It's a sample map we have. And what I'm going to show you. You might see the FPS counter and in OpenGL it states 30 FPS. And if we were to take a look at performance, the font is pretty small, but I can tell you that it says 32 milliseconds per frame. It's not a GPU time, it's per CPU time. And if we switch to Vulkan... By the way, we need to reload the game because we don't support real-time switching. It would be too complicated. And we're starting the game. And as you may see, the FPS counter is 60 and it's bound by a reset. If we were to take a look at the render now, it's only about 12 milliseconds per frame. So that's pretty. It's about three times faster. And again, that's only about CPU. So was it worth it? Yes, performance up to time and more stable frame times. We have a more predictable and stable behavior because not every platform game up to three times, but other just get a more smooth and stable performance. So do we need to Vulkan at all? Of course. If you can enable, just enable it. If you can easily switch to another library, do it. There should be a picture with ShiaLaBuff. And if you use an own custom engine, it's questionable currently because if it's enough for you to use OpenGL, that's great. But if you have a passionate graphics programmer like I am, maybe, then I would choose to switch to Vulkan. So thank you. And feel free to reach me after if you need some technical details or feel free to write me this email. Thank you very much. Thank you.
Attempt at building a transit app in Africa
Thank you. So, hello. So, hello. I'm Tarek Twati, so I'm a software engineer based in Nantes, France, in northwestern France. I'm a JavaScript developer and passionate about containers and functional programming, and also a railway and transit enthusiast. Today I'm going to talk about an idea I got in the late 2017. I was a foreign student in Nantes, and to navigate through the city, I used a transit app for a local transit app provided by the provider. So, the app helped me get autonomous in navigating through the city. I was able to get everywhere I want and was fully autonomous. I was remembering back in Tunisia, where I could travel. It was a bit more complicated. Back to Tunisia, so in a small country in North Africa. It's a country which is car-centric, so not, let's say, transit-friendly. I was thinking about foreign people, how can they travel like I was traveling in Nantes? How can they travel in a car-centric city where there's no app or service like I was enjoying out there? So, I looked on the internet, there's no solution, nothing was available, whether on Google, Google Maps, or OpenStreetMaps, or any other thing. So, the idea came along to build something like that. So, to build an app, we first need data and to be able to share to other people. So, to achieve some data, there was no, maybe, data platform that maybe the provider, maybe able to use it and extract some data. There was nothing. So, the only thing available was to go see the provider, the transit provider, and see with them what was possible. So, the transit provider was friendly and gave me some data set, let's say, or some timetable that was a bit restrictive. The timetable was from point A to point B, so a terminus to a terminus. And for stations, it was more tricky. There was no number of stations on the line, there was no departure from terminus to the next station, there was only from terminus to terminus. So, I asked the provider about these informations and what he replied to me that there was a guy who only know these stations. So, we do own the data from point A to the point B, but what is going on on the line, we don't know. There's one guy who knows, but no idea who took this guy. But the idea was to map the stations with people. So, what I came along, so it was to take the bus, take the lines I already got data on, and map each stations. And it was also more trickier. So, as the bus doesn't stop at stations, the bus can stop on demand. So, you may take the bus and someone asked to stop, it's not a station. And trickier too, there was no station names, so I came to give names to stations. So, mostly building a transit network. As we don't have stations, we don't have time departure on each station. So, I came along to estimate departure. So, I had to take several times at several periods the same line to have an average between traffic, no traffic, and have an estimation on each station on both directions. It was too exhausting. And ended up having a massive spreadsheet with all this data, and turn it to a GTFS model. I don't know if there's people who know the GTFS. It's standard around transportation, which is able to, which is kind of tricky, but it's mostly CSVs with bad, bad format, but it's mostly standard that able to exchange transit information. So, I was able to get around all this with the help of all the bus drivers. And that's it. So, I now have some data on some lines. I made maybe two months to build this data set. And now what I made later on is to build a web app made in React, based on Google Maps, and to expose mostly the lines and all the station tracks drawn on the canvas. And also to build a chatbot for users where they are able to send their location and get nearby stations and nearby departures. So, why is this an attempt and not a success? First thing, that the service was not too attractive. Though I didn't communicate on it. There was not maybe marketing or no communication. Maybe the lines were also not too attractive. So, the company first gave me three lines to start with, and the three lines were a bit not too attractive. And also the issue I had, I attempted to calculate to have an average departure on each station, but the average wasn't too precise enough. And used with the GTFS, we have the standards, we have validators to validate that data is correct, and not too incoherent, but we still had incoherent departures. So, next step from this project, I did it in 2017. I wanted to get it back and update the GTFS data I had and upload them on OpenStreetMap, so to make them accessible to everyone, and allow other people to contribute to this data set, or build what I have used as data sets. And that's it. Also, to add something, if there's people interested in this project, let me know, we can maybe create a metric channel to contribute to. Thank you.
System for Television Off-air Recording and Archiving, BFI National Television Archive
So, our next speaker is Joanna White and we talk about the system of television affair recording and archiving. BFI National Television Archive. Welcome, Herr. Thank you. Thank you. It's wonderful to be here today. Thank you for coming and thank you to FOS STEM for letting us speak here. I am Joanna White, developer at the BFI National Archive in the Data and Digital Preservation Department. Today I'll be talking briefly about STORA, System for Television Offair Recording and Archiving. It's a project that we've built in-house. So the BFI or the British Film Institute promote and preserve film and television across the UK and the BFI National Archives Department within the BFI and is also one of the largest archives in the world. So we have nearly one million digitized moving image assets in our digital preservation infrastructure or DPI as we call it. That means they've been ingested into our Spectralogic tape libraries for long-term preservation and they've also been catalogued in our collections information database, what we call SID. By far the largest collection of moving image materials in our off-air is our off-air television recordings with nearly 650,000 program files in DPI. You can see a selection of them here displayed. This is our staff DPI browser. It's internal. There's also a further 800,000 preserved. This is off-air recordings waiting to be processed and ingested and seeded in a future project. So the BFI is the body designated by OFCOM as the National Television Archive. Under the Provision and the Broadcasting Act of 1990, the designation allows us to record, observe and make accessible TV off-air under section 75 of the Copyright Designs and Patents Act of 1988 and later the Copyright and Rights and Performance Regulations 2014. Okay, that's the official bit. The BFI National Archive began recording off-air TV to one-inch real videotapes as you can see here in 1985 with the permission of select UK broadcasters. Programs were captured, curatorially chosen, captured by teams who would work there around the clock in shifts. In 2015, off-air TV recording became an automated process for us when we started collecting live TV programs 24-7. To do this, the BBC agreed to provide us with a fork of their Redux Off-air Capture Project, which you can see here. We worked with BBC developers to integrate it into our digital preservation infrastructure. The goal was to store MPEG TS files to our Spectrologic Tape Libraries for long-term storage. This is built on open-source technology. It's run from Linux, installed servers and uses open-source tools to record both television and radio programming for the BBC. At the BFI, we just use it for off-air television. So in May 2022, BBC Redux was shut down. In anticipation, the head of my department, head of data and digital preservation, Stephen McConnacky, launched our own R&D project the year before. Along with two BFI engineers, John Daniel and Brian Fattarini, we built the software recording solution to emulate many features of Redux with the name not to disrupt our existing DPI ingest workflows during that change over period. So like Redux, Stora records satellite-streamed UK broadcasts. The channels are a mix of high definition and standard definition streams, many broadcasting 24 hours a day. One full day of off-air recording captures around 500 programs to our storage. That's roughly 850 gigabytes of data, and that's roughly 300 terabytes every year. So we receive our signals from Astra Satellites, which broadcast mostly direct-to-home TV channels in Europe. It is nice to be considered still in Europe in this regard. They're received by our satellite dishes, passed through Quattro low-noise blocks before passing through TVS, TV, PCI receiver cards. The signals are routed through patch fields to a multi-switch, which selects band and polarization. We use three multi-switches for Stora so we can have 24 potential multiplexes. We've got a SESPA application, which demuques each channel's MPEG transport stream into a single program transport stream, creating a Unicast real-time transport protocol, or RTP stream, and a Unicast user datagram protocol, or UDP stream. We need both for our recording method. If you'd like to know more about the hardware setup, I can put you in touch with my colleague. It's not my area, I'm afraid. For those of you who are familiar with BBC Redux, you may recognise the folder naming convention and the contents of the folders. As I said, we have automated ingest workflows that needed this structure to be maintained. The folder path comprises recording date, channel name, and individual program broadcast time data in the name of the folder. We've got also the Unic event ID, which is for the program that's being shown, in this case 476. With the folder, you'll find three files, the Info CSV. This file contains program information, including channel, title, description, etc. Next, we have the Stream MPEG TS file. This is the recording captured from the broadcast. This is not encoded stream, but it's just dumped directly to storage, so it contains the packetised elementary streams, which wrap the main data stream, usually H264, video codec, AC3, or MPEG audio, subtitles, also in there, and information tables. You can view all this data really nicely when you look at it in VLC. Finally, we have the subtitles in there, which contains an extracted transcript of all the spoken word from the program. It's formatted as a Web Video Text Tracks format, or Web VTT. Making sure that we don't lose any of this information is really critical to our preservation goals. Storage code has been made possible by a wonderful collection of open source tools, which you can see here. We have Linux Ubuntu operating systems, and we use Linux command line tools throughout the code. Storage is written in Python, and a few external libraries such as Tenacity and Python VLC. Python VLC allows us to work easily in the code with the amazing software VLC from Video LAN. You'll probably see them, I'll foster them in the hats. VLC relies on the outstanding FFMPEG libraries to operate. FFMPEG is kind of worshipped at the BFI and in many archives globally. LibdVBT passes service information in the UDP streams, and it's key to how the scripts record the programs. Media Info provides detailed technical metadata for analysis of the MPEG TS files. CC Extractor extracts the subtitles from the MPEG TS file, saving them to a separate formatted file, and Nagios Core provides a monitoring service for real-time alerts when streams fail or recordings stop for us. So I'll quickly talk you through how storage uses these pieces of software. We'll look first at the recording script, which makes the file contain the MPEG transport stream. They used to have two recording methods for the storage code base, but they've been merged into one script now recently. I'll unpack that shortly. Both methods capture the MPEG transport stream using VLC, but they differ in how they start and stop the recording methods. So the first script I wrote utilizes electronic program-grade metadata, which you can see at the top. We get this from a commercial supplier, retrieved daily from their REST API. The EPG data is converted in Python into a JSON schedule for a single day's programs. One is created every day for every channel. Recordings are then prompted to start and stop from this JSON schedule. The script loops through every scheduled item before it then exits at the end of the last program, which usually just after midnight. And then we have shell restart scripts that run from Prontab, which immediately restart the script again, and it picks up the next day's schedule and carries on. Quick shout out here. I'm quite a new developer, and when I had this project placed on my plate, it was a little bit overwhelming, but I came across this script. It was on ActiveState code written in 2015, weirdly also written by somebody named J-White, J-White88. If anyone knows them, please thank them for me. Nobody knows them. I'm going to assume time travel is a thing by the time I'm 88, and I come back in time and give this to myself, which is a nice idea. So onto the second and better method for recording the off-air streams. It monitors the user data-gram protocol stream, UDP stream, and it gets the service information data, watches for changes in the event ID field for that broadcast stream. You can see that in the top. The event ID is that unique identifier for a program. The script stores the last five event IDs that have been broadcast, and if a new one turns up, then it knows that there's a new recording that needs to be triggered. So it should potentially loop indefinitely, monitoring a UDP stream in this way, creating and placing TV shows into their own unique folder paths, which you've seen. And these event IDs changes usually always fall right at the beginning of a new program as it starts to record. So it's a really very neat way to start and stop the recordings in the schedule. And another shout-out is needed here for the open-source project Libdvbt. I think it's a fork from a VLC library, I'm not sure, but it's by Michael Krufke. It's the stream parser and service information aggregator library for MPEG-2 transport streams. The recording script calls up dvbt from a Python sub-process spawn shell, captures the Libdvbt JSON-formatted response. The command has a time-out flag, which usually ensures the information is returned to you within two, three, four, five seconds. This response is reformatted and exported into a Python dictionary, and this provides the trigger for the VLC record start-stop process. So just to visualize how this method works, it does require us to have two streams, which is a little bit awkward, but doesn't really cause us any problems. So here you can see that the script monitors UDP stream waiting for an event ID number change in that stream, so from two, six, five, two, four, five, two, six, four, two, six, five. When the event ID changes, it's sensed the current VLC streaming recording is stopped on the RTP stream, and the new folder is created with the start time and duration of the next program. So in this folder, the RTP stream is placed, captured by VLC. And this is the code used to start and stop the VLC recording. The Python bind needs to create a VLC instance from the instance class in Python VLC and initiate a new media player object. Both are called into the main script to start and stop the recordings. We use the demuxt dump command, which uses a VLC unique codec from the demuxt library, a tool developed essentially for debugging, but it actually dumps the content directly to file without decoding it. I have the append flag also in there so that if a recording breaks midway through a program and then starts again, it will append it to the existing file and not overwrite it. If that happens, a restart warning text file is placed into the channel folder with the date and timestamps so that we can know that there's potentially a break in the stream. This is pretty rare though, it doesn't happen very often. So we also rely on media info software in the get stream info script. It uses the Python sub process again to spawn a media info call capturing the program start duration metadata. This is all then dumped into a CSV file. And then to extract the WebVTT files, we use the software CC extractor. We launch the software and the Python script again from sub process. Sub process is so important to these processes. This is a simple command that flags the WebVTT output format and then creates the file that you can see here. We then import this data into our SID database, which is viewable and searchable and provides a rich text metadata for the curatorial teams. Lastly, we have Nagios, which is an event monitoring system, which issues alerts when problems are detected. We have separate channel alerts for recording failures, which is identified by comparing a checksum between the current stream MPEG TS file and one four seconds earlier. And then we also have a stream check, which looks in the Cesbo software for an on air equals true for every channel. If either of those fail, then we get a display that says critical, but also we get an email that's sent to us with the context for what the failure is. Okay, so that's a rough guide to the store. In particular, how the code interacts with these open source projects. The open source repository contains all the store of scripts, descriptions for the code base, dependencies, environmental variables, and quantum launch details. It has an MIT license. I hope it may be of some interest here. But as a relatively new developer, I'm quite welcome. I welcome kind of feedback and advice. None of the team in the data and digital preservation department have computer science backgrounds. They're all archivists or TV people. I used to be a cameraman and an independent documentary maker. To be able to stand here and talk about this project like Stora, with just a few years coding experience is really mind blowing for me. And particularly at a time when accurately recording our televised social history is really just so critical. So this has really been made possible thanks to the open source tools we use and the developers we see in the room here. Thank you from the archiving world. And there's also quickly a growing interest in audio visual archives globally to try and work more with open source software and standards. Many of us meet annually at a conference called the No Time to Wait conference, which happens here in Europe. We welcome new attendees, who are developers, definitely. This conference has been connected with the development of the FFE1 codec, which was originally an FFMPEG project picked up and expanded by archivists working as developers. This codec is critical to the BFI's long term preservation of thousands of video and film assets. So the maintenance and upkeep of projects like FFMPEG is really very important to us. Traditionally archives have relied on expensive proprietary hardware, software and codecs that are not scalable. They keep their information behind paywalls and they're not likely to offer the kind of technical support we need long enough into the future for long term preservation. So having open workflows and standards developed within our own community is incredibly empowering for us. And yeah, this is the community where it's happening most, I would say, at the moment in the UK, in Europe. That's it. Thank you. Thank you. The next talk will be in five minutes.
Do you know YAML?
So, hello, good afternoon. Then we are going to start the next talk with Tina Mueller. And the topic is, do you know Yamu? Quite interesting topic. So Tina, this stage is yours. Thank you. Hello. Can everyone hear me as well in the back? OK. So who of you knows Yamu? OK, are you sure you know Yamu? So something about me, I'm doing Pearl since 1998. And I'm also intensively doing Yamu since 2017. So I guess I just have a weakness for misunderstood languages. Yes, the topics, some introduction, some history, Yamu usage, versions, new libraries, and Yamu test infrastructure. Oh, I got one extra minute because the timer wasn't started. So Yamu, it all started in 2001. I think 2004 was the first specification. It was invented by Orin Benkiki, Clark Evans, and Ingy.net. And Ingy says hi. And he's also the one who's still actively working on Yamu and relate things. And here's actually a mini talk that he sent me. He wants you to know about. So there's Yamu script. Many people try to do programming things in Yamu, but Yamu wasn't designed for that. Ingy has been working on a new Yamu-based programming language. It's complete and general purpose, best when embedded in plain old Yamu files. Excellent interpolation features, merge, filter, concatenate, any functions you can imagine, define your own functions, solves most programming things that people want to do with Yamu. So here you have a Yamu file, people and places. And this is the Yamu script. You can see the header. And you load the Yamu file. Then you get people from it and the list of places. Here you define a function with interpolation. And here you go over the arguments of the command line. Shuffle until you iterate over the list. And the output is this. And it just works. It's fast. And it's really easy to try it out. Just go to the web page. And there's a code command which executes a bash. And then you have it installed. And yeah, there's a link to it in the slides. And the slides are already online. So have fun with it. And that's the end of the talk in my talk. And I'll go on. So what does Yamu stand for? No Yamu ain't market language. It's a state-of-the-realization language. It's a superset of JSON. It has block style and also flow style, which many people also call JSON style because it's similar. And there are many ways to write a string. But they are all kind of useful in certain areas. It has aliases, like references or pointers, and commons. And there's an allowed comma after the last item. Hello, Jason. Multiple documents in one file. And really powerful tags for loading objects are doing customized loading. And I started this Yamu.info page, which gives you also the right words to actually talk about these. Like, for example, some documentation referred to Yamu's references. But it's called aliases and anchors. And I think it's good to have the right terminology because then you can actually find the right documentation for it. So the history is Yamu 1.1 was implemented by Pi Yamu and Lip Yamu with some divergence to the spec. And the decisions were with good intent. But it had other problems because if you diverge from the spec and others do not, then it's problematic. And many other libraries ported this or used Lip Yamu as a binding. And 1.2, the version 1.2 was not widely adopted for a long while. Many people just didn't know about it. And there is a prerequest for adding 1.2 for Pi Yamu. I created it some time ago, but there are some issues. So it can't be merged yet. And Lip Yamu and Pi Yamu were even used in the NASA Mars helicopter mission. And so this is something you can say these days. Yeah, as mentioned, 1.1 implementations were really widely adopted. And there was no clear change lock for 1.2. And there hadn't been a test suite until 2016. So before 2016, updating a library to 1.2 would have just to be sitting down and read the news back and start from scratch, mostly. So this is about the history. And now from a different angle, how do people actually get in touch with Yamu? So usually you're using an application that is using Yamu or some kind of Yamu, starting with examples from the documentation. So here's a salt stack. So you have these funny curly braces here. And is this a Yamu file? No, it's an SLS file. It's not a valid Yamu. And you cannot use a linter or anything on it, because first it has to run through ginger templating. And then the result is Yamu, hopefully. And many people think this syntax belongs to Yamu, but it does not. And the intro on their website doesn't even say which version it's using, which Yamu version. And here we have an answerable example. And here we also have these syntax, but inside of the string. That's also ginger templating. But it happens after you load the Yamu. So I think that's a better way. It has disadvantages and advantages, of course. But also here, many people think this is part of the Yamu. And they come to our Yamu channel and talk about it. The website also doesn't say anything about the Yamu version. Or yeah, it has some links at the bottom. And the GitHub workflow uses this syntax. And that's quite nice, because the dollar sign at the beginning is not special in Yamu, so you don't need to quote it, actually. And many people think this is part of Yamu, but it's not. And also no Yamu version information. So they don't document it. And I tested GitHub, and I think it's doing Yamu 1.2. Also, they learn Yamu in why minutes is mentioned, but it's also not saying anything about Yamu 1.2. So what are the actual changes? So they can be divided in syntax and in schema changes. And the syntax changes are really probably not important. There are also a few backward incompatible changes, but affecting even less people. But the schema changes are important. So the schema is about deciding if something is a Boolean, number, null, or string. And in 1.1, there are 22 values that are resolved as Booleans. On, off, yes, no. And you probably all know the Norway problem, so no is the same as false. So if you have a list of country codes like ES, DE, and O, then you will not get what you think. This is unexpected, and this has been fixed. So the 1.2 schema just has a lot less values, a lot less unexpected things happen. The sex-agasable numbers, base 60, are also gone. Who knows what sex-agasable numbers are? Wow, like a handful. No underscores and numbers allowed anymore, and the base 2 is also gone. And you can also click on the link in the slides to see these differences here. So only six values for Booleans. And yeah, it's a lot cleaner. But still, of course, there is this problem. When is this a number or not? So here we have a number. That's a string, and that's also a string. The thing is, what do you want? Like, you don't have to quote, and that's actually nice in many cases. And you can't have everything. So we have to live with the problem that sometimes we don't know exactly if it is a number or not. But what you can do and what you should do in many cases is actually validate. So who is using JSON schema or something like that for their YAML files? Come on. OK, you should think about it. And same actually goes for JSON. At least sometimes you can make mistakes in JSON. And you don't just send out your JSON or YAML files and think that it will just work. We have tests, hopefully. So use a validator. And we're using that in openSUSE for openQA. It can also protect you from processing unexpected data structure with a recursive tree of aliases, which is known at the Billion Lafs attack, which is actually not a real problem of YAML because they're just aliases. But if you process it or dump it with JSON, it will be huge. Yeah, use the right tools. So who of you knows YAML lint? OK, great. So that's a great tool. And it can tell you if you have unnecessary quotes. But the thing is I hate typing. So if you have an extremely limited number of fingers, you really hate typing. And so I wrote a YAML tidy, which is removing the quotes for me. I don't have to do it manually. And you are using often four meters for other languages, too, right? So here's a YAML tidy configuration. And here you can say the default scalar style should be plain. Here's a YAML file with unnecessary quotes. And this is what it looks like after YAML tidy. This would have been a number, so it's the quoted. And the curly brace here is problematic. So OK. YAML lint currently supports 1.1. And Adrian is working on it to actually support 1.2. And I also want to support 1.1. What else can we do to improve the situation? So there's a YAML test suite. Started in 2016, like I said. And Felix implemented NIMYAML and added a lot of cases. I started with YAMLPP and added test cases. We have 400 test cases. And 12 libraries are using it. But I would like to mention specifically a couple of libraries that's libfyaml. And it can be used as a replacement for libfyaml. It passes all tests. That's really rare. It's fast. It's actively developed. It can run through YAML comments. It's still experimental. And bindings to several languages are planned. There's also a new JavaScript library. It passes, I think, all tests by now. It's actively developed. Sorry. It can run through YAML with comments and blank lines. And it supports 1.2 and 1.1 and the merge key. And it's really good. And yeah, just because it's by me, YAMLPP passes most of the tests, except some things that are not relevant in ProL anyway, like arrays as hash keys. It also supports both YAML versions and comes with a nice highlighter. So YAML containers, I will go a bit faster through the last slides. So Engi started to put things in YAML containers. And you can actually look at the YAML playground right now. So here you have to start a Docker container locally. And now you can live edit YAML here. And now we just added something that is not valid. And there's one library which actually thinks it's valid. But OK. The test matrix is this one. And it really looks very red. Don't be scared. The test suite contains many edge cases. So that's why it's so red. And yeah, we're trying to actually make it better. And so you can also visit us on our matrix channel. There is some kind of construction going on, because Engi is moving the server. But if it's not there, then there's a fallback on IRC. So please contact us. We are really trying to improve YAML and everything around it. And thank you very much. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you.
Introduction to BlissLabs and Bliss OS
Okay, let's go to the next talk right now is John West and the talk is introduction to bliss labs and bliss OS. Thank you John, the stage is yours. Thank you. I represent bliss labs. Thank you very much. I represent bliss labs and bliss OS. So what bliss labs is, is we're a volunteer based 501C3 non-profit organization that helps open source projects thrive, mostly Android based, but we also do Linux projects as well. Our goal is to create and maintain various open source operating systems and other software that helps extend the life of devices in order to help with the world's e-waste problems. Bliss labs also maintains bliss OS and other open source software. We're not just a bunch of projects, we're a very mature open source org with a proper organizational structure. We work to mentor and teach future open source developers in all aspects. We form alliances with other projects that share in our visions. We develop tools to help minimize the complex development process of Android and Linux and aid in learning. And we also act as a global fiscal host for open source projects that are not able to monetize in their current region. So this allows for a much larger opportunity to work with others globally and increase their user base. Our region is global, so we have members of bliss labs all over the world. Age group ranges from 13 to 60. Students, professionals, retirees, the whole nine yards. Women, men were LGBT friendly, different able in key positions like CFO, CPO, CTO. For education, we don't have any requirements really. You can be a middle school student or you could have a PhD in the professor. Our estimated download count from the year we started in 2014 is now up to over 6 million. That's across the entire suite of bliss labs projects. And speaking of the projects, how many of you recognize any of these logos? Anybody? Okay. To go over some of these, we have BlissOS, which is a unified Android experience irrespective of hardware. We use Android X86 as the base OS, works on Intel and AMD X8664 version 2 devices and greater. We also run RayDroid, Android integration with Linux. It's a lightweight containerized base project. We run Android Linux hybrid, which is a cross of Linux and Android running on bare metal hardware. We maintain SmartDoc, which is a desktop UX for Android. We maintain XTmapper, which is an on-screen keymapper for Android as well. We maintain a community called Supreme Gamers, which is centered on Android X86 development. We also maintain BoringDroid, which is a complete open source desktop UI solution for AOSP. We're adding Bliss base to the mix now, which is a production ready example of Android on X86 based on Bliss, but geared towards, we'll say, commercial applications. Then we maintain Android Generic Project, which is an easy button for Android X86 and BlissOS builds. We also maintain BlissROM, which is Android for Android devices. Our matured process model includes building, testing, releasing, and documentation for both dev and users, and then post-release support. Popular open source projects having links with our projects. BlendOS is one. They use our RayDroid project. Ubuntu Web also uses our RayDroid project. Ubuntu Touch is well. PrimeOS, which uses Android Generic Project to generate their images. Then Android X86, we fully support that project by supplying team resources, build servers, development, et cetera. What we do to support sub-projects is we attend conferences like this. We supply hardware for testing, hardware for development, hardware for build and web servers, and communication servers like Slack, Discord, GC, Telegram, Matrix, et cetera. Then we provide software services like storage, source, forging, GDrive, or development services like GitLab, GitHub, Access to Jira, Confluence, Trello, and then servers for updates, OTA, GitHub, and CICL. Which brings me to the next part of what we're doing here, is we're introducing BlissOS, which is an Android on X86 hardware. It's Android for your PC. Funny thing is, my Linux PC took a crap on traveling out here. I actually had to put this whole thing together on Android X86, so on BlissOS. That's what we're seeing this whole thing run today from. It's based on mainline Linux, so using Android X86 patches on top of the kernel in order to provide support for Android subsystem. We have features of desktop UI, changes for X86 hardware, custom house, et cetera. Generic builds, so one ISO runs everything. We have tons of customization options included. Low resources based subsystem, so it's a very low overhead system to run Android on X86 hardware. A lot like Edge Linux would be. Then we have added Linux tools integrated with Android, like Turmux, networking tools. We bring in DRM hardware composer, Greylock, Mesa, all from mainline Linux. Some of the diverse use cases of BlissOS and the likes are kiosks, mobile devices, PCs, gaming devices like the Steam Deck, automotive displays, POS and customer facing displays, new displays, ad displays, TV and large screen applications, IoT and industrial IoT, industrial displays, gaming and component displays. BlissOS is open source. It's based on Apache version 2, GPL version 2 and GPL version 3. If you're using BlissOS, as is, it's free and open source to use anywhere. If you're using BlissOS, using a modified source, still comes as open source as long as you release the source code. It's open for anybody to use in their project. Coming to BlissLabs is not a requirement as long as you release that source code. Then a small per device or perpetual licensing cost for those that are using a modified source and do not wish to release that source code. Some of our milestone achievements as of lately are EIDU. We've gotten into a pilot program for Kenya using their low end hardware with our operating system. Companies have been using it in products lately. We've been shortlisted by Swarchforge multiple times as a featured project. We've been adding more and more devices every year. You can find demo videos on our website as of this week. You can also download an ISO for Android 11, 12, 13, and as of today, Android 14, which is our announcement today. We will be making BlissOS version 17 available within the next couple of days for everybody to test and download. Swarchforge is already on GitHub. The source is already on GitHub. Initial features are ported from Android 13 to 14, including desktop UI, dual ethernet, multi-display, et cetera. What you can expect from us in the near future are more device groups will be supporting Raspberry Pi, RISC 5, et cetera. More leaning towards the Linux hybrid side of things, where we'll have a larger Linux subsystem running Android on top. We'll be more independent and a complete fossil ecosystem, so we'll be providing tools for companies and individuals to create their own cutoff app stores, pretty much. Then we're going to be supplying some Edge IoT and IIoT examples as well in the code. Our process, we're going to be documenting even for newbies, so we're going to continue on that, dumbing it down for everybody. And automated installation support, so we're working on a couple new installers, graphical installers as well as text mode installers for Linux. And then we're going to fine tune our AI-enabled support bots that we use to help users answer the questions so we don't have to. Community engagement includes contests, Easter eggs we put into the operating system. We have goodies we often do giveaways, stickers, t-shirts, et cetera, get points, and then Blisify videos in the link on our website. Opportunities available through Blis Labs and BlisOS are internships, mentorships, open roles paid soon. We'll be web development, server maintenance, project management, HR, finance, and developers. Contributors can get mileage for the next move in their career or can be absorbed by BlisOS, BlisLabs for commercial opportunities. We have a very easy streamlined way of joining. We cut out most of the crap, no ego, but we are healthy and drama free. We are full democracy on team, so we have a flat structure, nobody's head of anything, nobody's overseer, nobody's the god of Blis. And that brings us to where you can find us online. So if you could scan the QR code, that'll take you to our link tree and that has all the links available for you to contact us and move forward in the future. If you have any questions, I will be available outside the room and I have my device and a couple other screens so I could demo if you guys are interested. Thank you very much.
Introducing the Open Podcast API
So, let's start the next talk. Introduction to the Open API, podcast API with Kuhn Glotsman and Karen Einzwort. Sorry for the pronunciation. So the stage is yours. Thank you. All right. Thank you. Thank you, everyone. So, yeah. We are here to present the... Thank you. Thank you. We're here to present the Open Podcast API, which is a new specification that allows users to actually synchronize their podcast listening data, like your subscriptions episodes where you started, where you want to continue, et cetera, et cetera. So why are we actually doing this new API? It's because we have a problem. The problem is that there is actually a defecto standard for synchronizing podcasting data between devices, but it has a couple of challenges, let's say. One of them being that it's no longer actively maintained for the moment. There is a draft for a new version of the API, which is good, but has been still for a while. So that's one issue. But maybe a bigger problem is that there are some technical issues fundamentally in the API and the way it's designed. One of them is about the episode identification, which is basically based on the media URL, which is in the RSS feed. And that thing is not always unique. RSS is a standard, but it's a Wild West at the same time. So we can't really rely too much on that. So that's a problem. And also, the software behind this standard has some issues with feed duplication, which occurs if a podcast changes the RSS feed, they change their URL, then you get the same podcast twice in your subscriptions list. So I said, well, we didn't say that yet, but I'm with Antenna Pot and Kiran is with Funqvill. And in Antenna Pot, what we see is that that service, that software behind the API de facto standard is actually used as a centralized service. So there's a lot of users, which is great, but it's also a restrain on the servers. And so that overload is actually causing end users in Antenna Pot to see errors, and then they come and complain to us and we're like, well, yeah, we don't have too much influence over that. So the solution there to this, these set of problems, is to build that new API standard, which is actually building on the existing standard, but being more extensible, more standards compliant, easier to implement across different projects so that we avoid the centralization aspect. So for users, that means that they can synchronize their subscriptions, listening progress, favorites, cues, etc., etc. That the idea is that they can connect all their different devices. So whether you're on your desktop or mobile, or if you have a work mobile and a private mobile, that all your listening progress, etc., moves from one to the other. And also, this integration with the different apps would allow you as a user to actually switch from Antenna Pot to Cast if you don't like Antenna Pot for some weird reason. And so, that's on the end user side, but we need developers to implement that API, of course. To make that as easy as possible, we want to have clear and comprehensive documentation about the features, but also about the behavior. So if I send this API call, then what is expected to happen? We want to have the specs being reliable and easy to implement. And also, we want them to be feature complete, because different podcasting apps and servers and services all have different features. Some might have multiple queues that you can create. Some like Antenna Pot, we only have one queue that you can create. So we need to make sure that the API covers all these different use cases. So the approach there is to build a new API spec based on the existing standard, which I assume many of you might have guessed, is gpotter.net. Notice that it's a great thing to start from. And there are some issues that we are trying to solve. So we're building on it in a way. We're not building on it. We're taking inspiration from it, I should say. But actually, compatibility with it is not our main focus. We also try to follow the OpenAPI standard specification, because that allows for easier integration into software. With by respecting this standard, we can have CI create libraries, which are always up to date with the latest specifications. And that's our plan also to do that for different languages. And an important aspect there is also that RSS is our single source of truth, meaning that we don't want to synchronize, for example, episode titles, because that's already in the RSS feed. So why would we synchronize data that's already in the RSS feed? But at the same time, we also have already the GUIT of an episode, the unique identifier of an episode in the RSS feed. That's unique, but not really, because of RSS Wild West. So we do actually expect to create and synchronize a true unique identifier for episodes. And then we're also trying to be podcasting to the already, which refers specifically to the GUIT at the podcast level. And there are some technical challenges. One is about the episode identification. Like I said, there's a GUIT in the feed to identify an episode, but that's not always globally unique. So why is it called a GUIT anyway? You have links and you have enclosure URLs. And we thought, okay, to identify an episode in order to sync data between devices, we could do a hash of these three, but they're all changing. They're all optional in the RSS standard. So you might have none of these and then end up with no hash, I guess. So that doesn't really work. So yeah, we are having the solution that the first discover of an episode, whether that's a server, if it pulls RSS feed, the RSS feed, or if it's a client, it creates the GUIT. And then yeah, there's some things that we need to consider. Like first pull the new information from the server and then send it back to avoid race condition, et cetera. And also we expect the client to do the application of episodes. But if you're interested in more technical aspects, there's a link to the notes. Okay, thank you. So building on that sort of quite specific example, there's the more general question of feature compatibility. So clients and servers need to agree in a way on what is compatible. We need to have a way of communicating that. So we can't expect all apps and services to support every single endpoint, every single call because different apps implement different things in different ways. So to sort of get around this, what we've decided is that we should have essentially a core feature set where we say that specific endpoints are considered core and you must support them as a client or as a server in order to be considered open podcast compatible. There is of course then scope to optionally sort of extend this and to add additional endpoints which give us more functionality but are not considered core. So you can then negotiate that between your API, your server and your client. These would be then documented in the specification, what is necessary for compliance, what is an optional extension and then we can sort of work with clients and servers to map that and say what is this, what works for you and what do you need to implement. So what sort of endpoints are we looking to add? Well, we've got a few that we've already been working on. So as Kun has already mentioned, subscriptions is a big one. It's fetching and storing and syncing all of your feeds, all of your subscriptions between devices with the option to update them, change their URL or change their position or whatever it may be and delete them and to manage them across all devices. Versioning this is an important one. If the specification changes and we decide to deprecate something, change an endpoint, we need to express what major versions are supported so that clients are aware of what they are able to get from the server. We are currently discussing episodes but as Kun has already alluded to, this is a very complicated thing. So we already have a pad full of information about how we will synchronize this but the goal is to have that sort of implemented to synchronize status and playback position, how long you've played it, that kind of thing for all episodes across all different feeds. In future, we would also like to be able to synchronize settings or give an optional endpoint for synchronizing settings, search endpoint, discovery for discovering similar podcasts and features and also ratings and reviews which are becoming a big part of a lot of podcast stores. Who's involved? Currently, you've got myself and Kun. We're from Antenapod and Funqwell respectively. We've also been in conversations with casts, pod friend, Gpod Async for NextCloud and Musicpod. The idea is to get as many projects on board as possible from both the client side and the server side. Funqwell acts as both but we're more steering for server side at the moment. If you are involved in a podcast adjacent project, we would love to hear from you and get your buy-in and your advice and feedback. Just to mention on that last point, those are all open source but the idea is that closed source projects could also use this if they wanted to. What are our next steps? As mentioned, we're still discussing episodes. It's a big thing. Something we need to get right and something that we need to finalize before we can consider ourselves at a point where we have a core endpoint. We also need to discuss authentication. This is super important. You should not be able to query somebody else's status and you should not be able to get a hold of anyone else's data. It must be locked down. We need to discuss how we want to do that. It will probably be just a case of OAuth. That is for someone who knows more about OAuth than me to decide. We're currently building a new website. Currently we have a website which is built using Sphinx but we found some limitations with Sphinx in terms of having dynamic content. We're going to be rebuilding that using Astro and Starlight. It's currently just in a pull request somewhere. We're just waiting for that to get deployed somewhere. We're mapping features across apps so we need to get a greater understanding of what different features are available in different applications, how they present that information, and therefore how we can present that in our API specifications. We want to get a beta implementation in a few applications. Client applications specifically. We would like to have at least two maybe more supporting some of those core API endpoints so that we can show that it works. Which of course means we also want to have a reference server implementation which the FunQuel team will be working on just so that people have something to test against, something that they can deploy themselves if they want to. We can check that our client implementations also work as expected and according to the specification. If you want to get in contact with us, contact details are up here. There's a QR code you can scan but basically search for Open Podcast API. It's where we are. We're on Matrix. We have the website which I mentioned we'll be replacing soon. Obviously we have a GitHub organization which is where all of the conversations are currently happening. Get in touch especially if you are interested in podcasting or are currently involved in podcasting we'd really like to hear from you. If you have questions we will be outside I guess and we would love to hear from you. We're very friendly I promise. Thank you very much for listening. I'll just put the contact details up again so that you can all take your time, scan the code. Thank you very much. It was lovely.
Project websites that don't suck
Okay, so next up we have Emily O'Mear with Project Websites that don't suck. Okay, hi everyone. Thank you for coming to the talk. I'm going to talk about Project Websites that don't suck, at least from a content perspective. So my specialty is the words that you put on the website, not how it looks, as you will discover from my slides, which honestly kind of suck from a design perspective. But anyway, the words on your website, the content really matter, and a lot of projects get this wrong. So that's what I'm going to talk about. Before I dive in too much, a little tiny bit about me. I am a positioning consultant, and I work with open source startups. And what this means is that a lot of the things that are relevant to positioning is about determining what kind of content, what ideas you need to get across on your website. The first thing, once you've determined your positioning, whether we're talking about something that's commercial or an open source project, the first thing you want to do is put that information or translate that information into something you're going to put on your website. I also host a podcast that's called The Business of Open Source. So there you go. Now you know a little bit about me. And now let's talk a little bit about the goal of a homepage. And I'm going to focus most of this talk on the homepage, because honestly that's the most important part of your website. The goal of a homepage is to help people understand very quickly what your project does and whether or not this project is worth investing another five minutes or another 30 minutes of their time into figuring out if this project is going to work for their needs or not. People make these decisions like they come to your website, they look at it for 20 seconds, and then they're like, okay, am I going to invest another five minutes in this? You want people to accurately self-select both in and out. So if you have somebody come to your website, stay on it for 20 seconds and be like, oh, this is not for me because it's not appropriate for whatever my use case is, and they go away. This is a success, provided that they actually made an accurate evaluation of the fact that this is not an appropriate fit for them. There's a third thing that I think is really important to call out here, which is that a good website that communicates clearly what your project does, and it doesn't have any major screw-ups in terms of content, is going to improve your credibility. It makes you look like you know what you're doing. So keeping in mind, these are the things that your homepage should do. A couple of random notes about homepages. A lot of people can get really caught up on bounce rates, and bounce rates are not a super great evaluation of whether or not your homepage sucks, because again, the goal is to repel people who are bad fits. When you repel a bad fit, you are doing them a favor because they did not have to invest five minutes or 30 minutes, or particularly in the case of open source projects, half of a day in order to discover that your project just doesn't work for them. You're also doing yourself a favor because these are dumb questions that don't get asked, or almost, I would say, off-topic discussions. We like to say there's no dumb questions, which is sort of true, but when you get a lot of questions that are clearly off-topic from what the goal of your project is in your community, this is a sign that you're not doing a good job of repelling bad fits. Moving beyond the homepage conversation, when you think about how people usually navigate through a website of a project, if your main audience is developers, which is not the case for every single open source project, usually they'll go directly to the docs from the homepage, and if there isn't a prominent docs link, the other really important page is your about page. I'm going to talk about about pages at the end of this talk. Okay, so let's talk about the anatomy of a homepage. The first thing that's really important that you want people to see as soon as they come onto your homepage before they scroll anywhere is your market category. What that means is basically how do you describe this project? What is it? Usually, that's a noun, by the way. You want people to understand also why should anyone use this project, which is to say what's the outcome that somebody can expect from using this project. You need to make it really clear who is an ideal fit and who is a bad fit. Again, you want to make it as easy as possible for people to understand which bucket they fit into. Then last, absolutely not least, is what differentiated value do you provide? In any market category, you're going to have a number of different options. Every project has competitors that are other projects. They have competitors that are closed source software. They have competitors that are the status quo. You want to communicate why somebody should use this particular project. What is unique about it in terms of what they're going to get as an outcome? Those are the really important things to communicate in a homepage. Let me give a little bit of a couple of examples. This is the homepage from OpenTelemetry. What is OpenTelemetry? Boom, it's obvious. It's telemetry. Success right there. We understand what this project is all about. What kind of telemetry is it? High quality, ubiquitous, and portable. Ideally, those three adjectives also describe the differentiation of what makes OpenTelemetry different from all the other options. Why should I use it? What am I going to get out? What's the outcome that I would get from using OpenTelemetry? Because I want observability. I want effective observability. I really like this except for the word effective. The reason is ideally when you're describing the outcome that somebody is going to get, you are describing it in a way that makes it clear who is a good fit and who is not a good fit. Effective is the kind of word that nobody is going to self-select out of. Nobody is like, yes, this is not going to, or yes, I'm not looking for effective observability. I'm looking for ineffective observability. Nobody is going to say that. What they could say, for example, is observability at scale. There's going to be some people in that case who would say, I don't operate at scale. I have relatively small needs, small scale, so I don't need something that's built for scale. Keeping in mind, again, that idea of repelling bad fits. Who is this for? Here's a couple of examples. The first one is from the Go website. It's great for teams. That is an indication that if you are not a team, you're not an ideal fit and that you're not who this programming language is built for. The second example is actually my favorite one. This is from Glue, which is maintained by solo.io. You can see how they're described as networking for applications, not infrastructure. If you come to their website and you're looking for a networking solution for infrastructure, you go away. This is from the Apache Cassandra website. You notice that in that first bit, they're describing what the heck Cassandra is. They have three differentiated values here. Runs in hybrid environments. It's very fault tolerant and it's reliable and stable. The good thing about that is it's really clear. The other good thing about it is there's three things. The bad thing about this website that you do not see on my slide is that if you scroll down, there's another row. If you scroll down, there's another row. They actually chose nine differentiated values. The problem is we're actually writing a website for humans and not robots. Websites also have robot visitors, but you don't care about them for this part of the exercise. Humans tend to have trouble remembering nine things. Three or four is where we max out. You want to choose the three, maybe four maximum, but ideally three things that really make your project unique and highlight them on your website. All that other stuff, you can put it somewhere in a comparison chart that has Xs and checks on it. You can make it discoverable on a project page or a use case page, but don't put it on your homepage because your homepage is really for focusing on making people understand really quickly and getting the things that people are going to be able to stick in people's heads. All right. Now, let's talk about about pages. I mentioned that about pages are one of the most important pages on a website. Quite frankly, most of them are garbage and the ones that are not garbage usually don't exist. The reason that I say that they're garbage is, first of all, they'll say things like, we believe in kindness. Great. Not that everyone believes in kindness. This does not make your project unique and it does not tell me anything about what outcome I'm going to expect. It doesn't build your credibility either. Nobody is going to self-select out of we believe in kindness. Or a lot of about pages are not human. This can be especially the case when there's a solo maintainer of a project that doesn't want to appear like they're a solo maintainer or wants to appear bigger than they are. But about pages are a place to talk about the team behind this project, the humans. They're a place to talk about your point of view, why you thought that this project was worth creating in the first place. And they're also a place to build credibility. Credibility in terms of showing what sort of credentials you have. I took this example from a company called TestifySec. They build a couple of open source projects, one of them is called Witness, the other one is Archivista. And you notice on their website they talk about how their team has contributed to all these publications that are about software supply chain security. Basically they're taking the opportunity to really showcase why you should trust them, why you should trust their team with your software supply chain security. If you scroll down on their website, in fact you will see the human beings that are on the team. That is really important about pages are really about being human. Last thing, it is unfortunate that I have to mention this, but it is really important, don't screw up the basics. You're probably not an English major, still use consistent capitalization, try not to have really glaring punctuation errors, really glaring grammar errors. I see this all the time. And the thing, the way that it comes off to visitors is that if you're not careful enough to get these sort of basics right on your website, are you getting it right in your project? Are you getting the details right in your project? It just looks sort of unprofessional. You should get someone to proofread your website. And if your website isn't a language that you do not speak natively, try to get a native speaker of that language to proofread your website too. Just so you can make sure that you're actually getting your point across that you're not screwing up the basics. Okay. So I'm sure everyone is out here thinking like, oh no, does my website suck? How do I figure out if my website sucks? It's really hard to tell from just metrics. Because like I said, a bounce rate is, it could mean that you're repelling a bunch of bad fits and that's okay. What you can do to determine if your website sucks is have conversations with people or even ask people like in a Slack or a Discord group. This doesn't require massive interviews. It doesn't have to be super qualitative interviews. Just asking questions like, when you came to our homepage, did you understand what our project was? How well did your expectations that were set by visiting our website match your experience of actually using our project once you got up and running? Okay. So I'm going to do a quick recap to remind everybody what I've talked about. Websites exist to help people self-select in or self-select out further evaluation of your project. They're making that decision. Am I going to invest another five minutes? Is this worth like reading through the docs? You want people to understand your project quickly and you do not want to neglect your about page and you don't want to get the basics wrong either. Okay. So that's about it. I will be outside if you want to chat. If you have questions, if you want an e-book that is about positioning for free open source projects, you can use a QR code to get it. And positioning is if you do not know what the differentiated value of your project is, if that's a question you cannot answer, you need to figure that out before you start thinking about what goes on your website. Anyway, I'm Emily O'Mear and thank you so much.
FOSS for DOCS
We have Dr. Candace McKita Moore with PhosphorDocs. Thank you. I'm Candace McKita Moore and I will be giving a presentation about PhosphorDocs. First I'll tell you a little bit about myself, then I'll tell you what I mean by PhosphorDocs. Then I will try to convince you to get involved in this or if you're already involved in it, kind of give you some tips about what you're doing. And I'll conclude with some deeper insights about how to be successful when you do this. So about me, I have my bachelor's from Columbia. I got my MD and Technion. I went further in my medical training. I did an internship. I did further training in emergency and radiology. Now I'm a research software at the Netherlands, a science center. So stopping my biography there, you can probably figure out what happened. I married a really awesome guy and I wanted to get out of the place I was. So he got a job in Europe. I said I'd follow. I've learned Dutch. You can't really work in the hospital when you speak, don't speak the native language. So I sort of reverted back to something I did before I went into medicine, which was software engineering. These days, like almost three years later, I do speak Dutch and you can find me two days a week now at Rutter Damoressens Medical Center. So I think I know a little bit about this because I've helped create a lot of what I call phosphor docs and by that I mean free open source software meant to help medical staff accomplish medical research or treatment goals. So right now I work mostly on CVSL, which is sort of processing arterial spin labeled sequences and other sequences from brain MRIs. And that's really typical of what I do. Usually I'm working with radiological data, but not always. So an example where I did it is Resurf EMG. That was a project where I was the lead engineer because my center gave a grant to a group at U20 to work with Respiratory Surface Electromography. The grant ended, I guess, about a year ago, but recently I realized that the scientists and engineers I work with have actually had a couple releases since I left the project, so it's still going strong across a couple academic medical centers. So now I want to warn you a little bit if you're new to this area. If you get into this, you're going to be annoyed. So one thing I want to warn you about is that in hospitals and health systems, on average, we don't have the best computer scientists and software engineers. That is not always true, but maybe you could see that as a positive, right? Because if you know anything, you're going to come out looking like a hero. But seriously, I'm here to try and get more people who are really enthusiastic and into open source to think about doing these kinds of projects. Unfortunately, when you do this, if you work in a hospital, you're going to be at best outside of a hierarchy. At worst, you'll be on the bottom and people will treat you like the gum on their shoe. Okay? Just deal with it. That's part of the culture of medicine. It's a very strong culture. One of the things that distinguishes it is the language. I can tell you from experience, if you're sort of a math nerd like me, nobody's going to speak your language. Just as an example, a long time ago, when I started doing quantitative image analysis on radiological images, I tried to talk to one of my colleagues who was another doctor about it. I was just sort of going off about this and the dot product. He was like, wait, wait, wait. The matrix is like the matrix of the movie, right? He wasn't kidding. I mean, that's kind of just what you'll have to deal with. I want to add a couple final warnings. If you're truly hardcore into FOSS, you will just have to make peace that people in healthcare systems, they use all of these proprietary products when perfectly good FOSS is available. Part of that is trash issues. Part of that I really blame on us as FOSS creators because a lot of FOSS projects that are actually pretty good if you bother to read the code, just have a lack of swag and swagger. What do I mean by that? I brought an example. Locos, a little merch, make your thing stick in people's mind. If you push past all of this and you're creating software, there's a final thing I want to warn you about if you're working in a hospital system. Unfortunately, within the hospital bureaucracies and health system bureaucracies, there are some people with power and some pretty weird ideas about the possibilities and ways to make wealth through technology. At some point, like myself, you will run into people who tell you, no, you can't open source this because otherwise you won't have any money and we won't have any money and that thing will work. It's not that they're evil. It's just that they aren't aware of how these things can actually be viable. Just as one example, most hospital and health systems have some really kind of wackadoodle legacy systems that are all kind of joined together in a weird way in the hospital. If you do something that needs to harvest and move around data, then you can make FOS and also charge the hospitals just to customize what you made. This is just one model. I don't have time to get into all of them, but you have to tell people this because otherwise you'll just hit a hospital lawyer who says, no, no, no. You can't open source that. Now that I've told you about all of these things, I want to tell you why to do this anyways. The simple answer is it matters. I've seen so many bright minds like literal PhDs in physics go to startups where they do things that in my opinion don't matter as much like use neural nets on fashion on the internet, whatever. I tried this in a room of doctors and only one person got one of these. I'm curious if anyone will even guess I'll give you some merch. Can anyone identify either of these diseases? No. Okay. I'll give away a merch at random later. These diseases are diseases that we've had phenomenal success in getting the life expectancy up on. It's cystic fibrosis and sickle cell. Specifically, I can tell you in the case of sickle cell, or both of them obviously, 100 years ago, computers and computer programmers were not part of the story. Today, especially in sickle cell, that curve is going up and that is powered by software. I can tell you that because I work with people who work specifically on this. There's also the international humanitarian angle. In my first slide, you saw me on the coast of Greece where I was part of an emergency volunteer crew. In those efforts, software actually plays a role because we have to do things like track infectious diseases that are coming from people and going places. You'll fight a strong culture in medicine, but you can win and you can do great things. You just need to come prepared. The three things at the top there, I think, are just non-negotiable. You might not have the funding to get all sorts of swag immediately, but for crying out loud, at least get a logo. I've seen so many beautiful projects that don't have a logo and they don't have the kind of person on them who will go out and speak about stuff and they don't get any use. They're going to die. The second thing is get a medical reader. Get an MD who doesn't hold that much to read your documentation and give you honest thoughts about it. You may end up, like I do, essentially splitting your documentation so there's a side for engineers and programmers and there's a side for doctors. The third thing, not as obvious, but probably the most important, is get your legal game going from day zero. So, I mean that for everyone, even if you don't touch a piece of patient data. If you touch patient data in Europe, yeah, GDPR, all of these things will come into play, but hospitals are large bureaucratic institutions, health systems, anything that touches health. Things like even getting the right contract may take months. But if you don't strain this out, you will end up with problems. So those three things are not optional in my opinion. As you move forward, get some videos. This is because, as doctors and other people of this type move higher in the hierarchy, they get less and less negative feedback and they sort of want to appear in charge of everything and they're not going to go to a meeting and tell you, I don't understand. Videos are something they can play in the privacy of their own home and learn what you're trying to tell them. Another thing that I think is really important, especially because I do signal processing, is getting more than one institution on board early. You will discover that algorithms you might be using at one institution might not work so well at another and it's better to discover that early. And of course, it's great if your team has a nice person you can send to meetings. And finally, once you've really built up what you're doing, please get a no-code interface because a lot of physicians are not even going to want to do as much as putting into two lines of command line and you will never convince them otherwise. So on a deep level, these things I'm talking about, they really have to do with culture. And when I think about culture, I sort of prefer the metaphor of water to fish, which I think it was an American writer came up with. You sort of don't know you're in it until you're out of it. And there are really different professional cultures between medicine and software engineering. One tiny example of that is how overloaded the terms are in computer science and software engineering, like correctness. I mean, how many things does Docker mean? I mean, like, this is just painful for me, even though I'm kind of part of both worlds. In the past year, I've gone to a bunch of things that were about diversity, and I sort of left annoyed, but they talked about breaking the world up into F-cultures and G-cultures. And they say F-cultures are hierarchical, conformist. They emphasize the group, and they're usually non-Western. These are cultures that, and now there are lines people think of as like exotic. Yeah, I've worked in the Western medical system for many, many years, and I could tell you that's medicine, OK? Now, there is a reason for that. We can't just all go our own way and do what we want, otherwise patients might start dying. So you have to learn to sort of navigate our culture. And unfortunately, you have to learn how to navigate your place in this hierarchy. So you have to be very respectful of those above you. You have to not sort of make them feel threatened. So give them their learning in smaller doses. I mentioned videos. The other thing that is super effective is to actually go sit with people. Even if they are like what we have in the Netherlands, technical physicians, they might not be so technical. Those people are supposed to be like halfway between an engineer and a doctor. You may have to sit with them and show them about something like as simple as command line that we're all very used to. But that helps because you get a sense of what they will be capable of dealing with. And you probably walk away and think, God, I just need to make a gooey because there's no hope. But you also get a sense when some of your nomenclature is unsettling for people. And it will be. And finally, please, worry about your legal issues. And make something shiny in the sense that it has a logo and it's well presented. So some final thoughts. I want to emphasize that there's a lot of unevenness in how software is sort of spreading across the world. And I've worked in places like Haiti. I've interacted with professionals in several African countries. Software is spreading. And unfortunately, it's often proprietary software. And this is really terrible because what you see is when big companies, just to give an example like Microsoft, sort of move in, they often set up systems intentionally or not that make a sort of vendor lock in inevitable that the health data in the system becomes so fused that the institutions, the hospitals just can't get away from this stuff. It's like sticky. So I think it would be great if people who made FOS sort of got there first and get their shiny in a way that builds trust. So I hope I've convinced you either to think about this or maybe to sort of up your game if you're in this area. And if you have any questions, you can send them to me. My email at the Science Center is right on the bottom line. And that's it. Thank you. Thank you. Thank you.
Journey to an open source contribution
Next, we have Thierry with Journey to an open source contribution. Thank you. So thank you for coming. I'm Thierry Berger. I love open source and I'm here with you today to tell you about a few open source fixes or stories I've done. So follow me. Let's make things better. I don't know about you, but I have a dream. My dream is that different players from different backgrounds, okay, it's a problem technical, but yeah, with different backgrounds, with different interests, players could still be able to play together. So you can imagine an old grandmother playing her match three game, you can see the three candies, and she will be able to share it with her grandchild. Hey, I'm a grandchild. And that grandchild will be able to share that candy to another game, or like a pet's life simulator game, something. So even though they have different interests, they can play together. And it's awesome, so I'm very motivated in that pitch. So I started a hobby project by using the Bay V game engine, which is an open source game engine made in Rust. And the project was going smoothly until I hit a problem. I couldn't input an ad, and it's a big problem because I want my players authenticated. And yeah, every email address is AvanHAT, so that's a problem. So time to fix it. I have to tell you about my keyboard. I'm French, I'm using Azure keyboard. So that means I have to input ALT GRZN0 to input an ad. And actually, behind the scenes on Windows, it's actually equivalent to control and ALT and GRZN0. And that control mapping is pretty important because control can have a lot of capabilities. It can scroll with the mouse wheel, it can copy, cut-paste, it can move the cursor, it can scroll with mouse wheel and move the cursor with arrow keys. Well, anyway, it can do a lot of things, open the task manager and other stuff. So I opened an issue and the library I'm using, it says bevy edui, a bridge to edui, a UI library. Hold the term. But yeah, I opened that issue. It's actually scrolling, it's a very long discussion. You can see it now because I'm using a PDF, but yep. And eventually we landed on a fix. It was a very long discussion. And yeah, I think it's pretty important, pretty interesting when discussions are way longer than the actual fix because it really shows that communication, software development is very important. And yeah, so if you have a problem, just ask questions and eventually everything will progress. So now we fixed our at sign, we can progress, right? That password field was my next difficulty. I have to tell you a bit more about my project. I want to support one-time authentication. So when the user registers to the application, it sends an email to the user. The user copies that email from their email client and copies that into the application, into that password field. And then web. It was working fine on a native, but on web, it's a bit complicated. So bevy edui, I told you a little bit about that, uses AR board, which is a library to support clipboard, but it's mainly focusing on native clients. So it's a synchronous API and web, it's a problem because we cannot really block the browser as it would freeze the entire browser. So it's just not allowed. And AR board desires to say simple, so that means we cannot add web support to it. So bevy edui implemented the local only clipboard, which is handy to copy from inside our application into our application, but that's not enough for my use case because I want to copy from outside my application. So time to fix it. So to fix it, first I checked what were my options and how other projects were doing it, mainly eFrame, an official framework from e4egui. And I could quickly have something working by using the web sys create, create to interface with JavaScript. And I had the copy, cut and paste event going through JavaScript directly from the browser and bypassing all the bevy machinery, which is great. But then I had another problem. I noticed that on Mac web, the controller command was not very well implemented because on Mac user, they don't control A or control C, they command C and command V. So we don't want to support correctly command C and command V to paste and then control A to select all text that's inconsistent, so time to fix it. So I fixed that by using the user agent on web and to detect on which platform I was, so eventually all my controls were consistent. So at this point, my pull request is the state of my whole adventure is that I have a pull request waiting for fixing the clipboard and it's on review. It can be a lot complicated. We did see a lot of little devils in the details. So I let it sit. The main contributor of bevy eGUI is in Ukraine, so you can guess he has a lot of other stuff to do. So anyway, I can just target my branch and I can continue on my journey, right? What is it wrong again? Let's rewind a bit. We skipped a little bit that first fix we did about that add sign. The fix was mostly if it's control, it's not text, but if we are on Windows, it might be text, so we're running the condition. And then web, it will not work because it's a macro there. And it's on compilation time. And web, it will not equate to Windows. It's actually wasn't 32 unknown and rest for the text savvy. And so it's not working. So now that I've studied the subject more, I could have done the same check I was doing before with the user agent to detect the correct platform. And that would have fixed all my problems. But then I did that in another pull request to separate things and do things the correct way. And I was a bit confused, so I first tried to remove that and then I was like, oh, okay, what about alt code? If I remove that, I can input alt code because I'm French. Did I tell you that? And yeah, I'm French, so I like to input A, O, R, weird characters. So I removed that and then I was like, oh, well, there is another if right there, maybe it just would fix my problem. I don't know what I was thinking. I was like, it's an emoji with the exploding head. But yeah, I was like that and pretty telling. But anyway, so I decided I would have to step a little bit back. Mistakes happen. So I blanked that out before in a previous slide if you remember. Baby behind the scene is using Winnit, which is a backend library for handling windowing. It's basically a low level stuff to send raw inputs. So I noticed they had a lot of fixes related or not to my issues and I was like, ah, will I have to do all my fixes again? I wasn't too confident in it. So yeah, mistakes happen because I think I would have been able to fix that by using the user agent and call it a day. But anyway, I like rabbit holes. So I went to update Winnit update. Yeah, why not? So I knew it wouldn't be too easy to do because I had to track multiple main branches, multiple unstable dependencies. I had to track baby main branch and Winnit main branch, which had multiple commits every day. So I had to have a stronger plan than doing it in provision. Yeah, well, anyway. So I had to first update, make everything compile and work. And then after everything compiled works, I could update to the new Winnit goodies. Yeah, Winnit API and good stuff. So first, when doing a dependency update, check the documentation. But I was updating two main. So documentation is not really great. So that meant foraging through the source code, pull requests, chance logs, working with the server, and occasionally chatting with relevant experts. Winnit uses Element, which is a response from them. So yeah, thanks, Element. First, when I was ready, I rolled up my sleeve and dove into the code. And the first thing I did was updating a lot of NM names. And I'm thankful that most IDE support for search and replace. Yeah, VSCode, sorry. Then another task I did was update a lot of dependencies. As you can see, there was a bunch. And I like to focus on a particular one, row window handle. It's a create to provide a common interface to interface with the window. Most of the dependencies had updated to a new version. As you can see, version 0.6, actually. In Bevy, we use continuous integration testing and cargo deny, which helps us prevent from duplicate dependencies. So I had to have my whole stack targeting the same row window handle dependency. And WGPU, which is another low-level create for graphics, wasn't updated to that. And I felt adding yet another main branch would be too much of a time sink. So I had to use version 0.5 for row window handle. And it's quite interesting how it's supported by the whole row window handle ecosystem. You can just add a feature to most ecosystem crates to say, I want to support this particular version and everything will be consistent. I had to do a few pull requests to the dependencies. But everyone was very responsive. We eventually had something consistent. So now we can profit, right? And progress on my task. Not yet, because the WinIt update is pretty complex. It can impact a lot of architecture platforms and stuff. And I don't have every platform to test. And I have also limited time. So I reached out to the baby discord to help. Like, hey, my pull request is nearing completion. Can you help me review it and test it, please? So yeah, we caught a few bugs. So I'm very thankful for everyone who chained in. And eventually the WinIt update was merged. Yes. So now we can profit, right? When I'm doing anything, I like to focus on the objective now. So that meant taking a few shortcuts. I noted them all as faithfully as I could. If you check out the whole pull request and the WinIt follow up, there is a lot of things. But I didn't write it in one go. As I discovered then, I would write them for me, for readers, and for future readers, for afterwards. Yeah, so now I think I will step back a little bit from all this and go back to my use case. Let's remind a bit. We implemented, yeah, we did a lot of things. We implemented copy-paste via JavaScript events. We detected the platform using user agent. And we even updated WinIt. So whoa, that's a lot of things. So does it work yet? Not yet. But I'm very confident that we have everything in our disposition to make it work. So next time we talk, I will tell you how everything works perfectly. Thank you for your time. Everybody can help. And if you want to help Bevy come into our Discord chat or just come talk to me afterward, I have free Bevy stickers if you want. So yeah, just come and talk. Thank you. Thank you.
Aerodynamic Data Models: Flying Fast at Scale with DuckDB
Next, we have Nishant and Mike with aerodynamic data models flying fast at the scale of DuckDV. Okay. Thanks so much. Great to be here. I'm Mike Driscoll. I'm the co-founder and CEO of Rail Data. Here with my colleague Nishant, also co-founder of Rail Data, and we're going to be talking about DuckDB and making super fast data applications with DuckDB. So really quickly, we're going to just talk about our product vision and how we ended up choosing a fast engine, ultimately what the criteria were for DuckDB. And then we're going to talk a little bit about some of the optimizations that we've made at the application data model and data engine layer to get Rail to be super fast for our needs. So I am going to be brave enough to do a live demo here, so we'll see how that goes. But first, what is Rail? Rail is an operational BI tool. A lot of BI tools out there, so what makes us different? Well, first of all, we have faster dashboards. We co-locate the data and compute. Queries are instant, even at billion record scale. We embrace BI as code, deploy globally but develop locally with GitHub workflows, and we do all of the ETL and SQL, and we embrace a metrics first philosophy. So all of the visualizations that you'll see here are automatically generated. So let's get into it. Let's do a demo. If you want to try it at home or in the safety of your own laptop, you can install Rail with that single curl command. And I'll go ahead and do that here. So I've already installed Rail, so we'll just go ahead and get started. So let's imagine we've downloaded it here, and I'm just going to run Rail, Start, my Fosden demo. Let's get that moving. So that's going to fire up a web browser here. And what I'm actually going to do is I'm going to show how we can basically add data, a source. This is going to be just a local file here because that's what I've got access to. So this is basically a Parquet file called GCP consumption metrics. It's got basically data from GCP that I collected on our cloud usage. I'm going to bring that in as a source. And what you'll see here is pretty fast. We imported 4.4 million records, and we can actually, with one click here, we're going to build a dashboard. So it's 4.5 million records with about seven columns. There's a timestamp there. Let's auto-generate a dashboard. And instantly we can look at some trends in this GCP data. Again, these are automatically generated dashboards. I'm actually going to zoom in on a particular area of the data here. And Rail lets me kind of slice and dice this data. I can take a look at what I've been paying for cloud storage. Wow, it looks like something was going on there sometime in 2021. I can zoom into that if I wanted to drill further and find out if there's a particular skew that was driving that. I can kind of get some insights with that, break this down and look at period over period comparisons. There's a lot of visualizations I can do. I can even create pivot tables here in Rail. And that's something that we've spent some time launching. But I won't go much further into Rail because we've only got a lightning talk here. I'm going to turn it over to my colleague, Nishant, to talk a little bit about what we've done to make Rail work really fast with DuckDB. Thanks, Mike. So there have been a lot of optimizations that we have done in order to make it snappy and faster, specifically at scale. So we have a three-in-one architecture where Rail starts with, as Mike showed, connecting to a source of data, then loading that source of data into Rail, storing that into an in-memory database, DuckDB, and then running these operational BI dashboards on the top. So why we chose DuckDB? What were our requirements? How we came to use DuckDB? There are a couple of things why we like DuckDB. The first one is speed. We were able to profile tens of GBs of data on your local laptop. So the demo we showed was just offline on this laptop only with subsequent performance. It's simple and lightweight, so it can be embedded into a very small binary size, which could be downloaded and easily started. It can scale up to hundreds of concurrent queries and scale up to hundreds of GBs of data. So the dashboard that you saw, there were almost 50-plus queries that were fired concurrently from that dashboard when that was loaded. Rail is open source, and we love open source technologies, so is DuckDB. So that was also one of the reasons to choose DuckDB as well. Here is another snapshot of DuckDB commits. This is again like a real dashboard. It's hosted on this demo URL if you want to dig further and slice and dice the different commits on GitHub. You can do that as well, but you can see that there are over 350 contributors on the DuckDB project, which is really great, and there is a good velocity with respect to all the contributions over there. Moving on to what specific optimizations we did across the stack, both in the front-end, the back-end, as well as the database side, what different optimizations we did in order to make it fast. It's not just one optimization that actually gave us the speed at scale. It was a series of small optimizations that added up to a subsequent performance, both starting from the application layer, the platform, and the engine. It was all across the stack, and I'll be going over those in detail one by one. The first thing you might have noticed is that the dashboard is very much focused on the time series where you can slice and dice filter on time. So we wanted the filtering on time to be very, very efficient, and DuckDB's storage format uses row groups to store the data. And with each row group, it also stores min-max of your timestamps. If you don't order your data correctly, you might end up with these min-max, which are spread all across the place. So when you run a query with a filter on a timestamp, you may need to scan all these row groups. So one optimization was to try to sort the data during your ingestion so that your min-max and indexes are more efficiently used. Here you can see there is a query which tries to figure out top-ten products by sales for like the first week of January. And on the left, when we are not ordered, it's scanning like two row groups, but it can scan only one row group and give the result back. There's one small tip that we noticed was that you also do not want to need to sort the data in a perfect way. If your input source is already partitioned by time, you can also just preserve the insertion order during ingestion. This is much faster than just resorting the whole dataset. Another thing we noticed was that we are doing a lot of filtering on dimensions, and comparing numbers is quite faster as compared to comparing strings. So there is a data type in DugDB which is enum, which you can use instead of string columns, which allows for faster comparisons and filtering on those columns. However, there was a trade-off that we had to do over there because now we are also converting a column type, which leads to higher ingestion time. Incremental ingestion also became harder for us because we now have to rewrite the column type where we need to add more values. For example, if there is a new user, there is a new campaign. If we are using it for that column, it doesn't work very well, but it works very well for columns like gender where the values are fixed and it doesn't change over time. The next optimization we did was query cancellation. As you go in the application, a user is interacting with different states, and if there are one state, when I click on the dashboard, it fires hundreds of queries. As those queries are, results are being streamed back and those queries are being executed, there are chances that the user sometimes goes ahead and changes the state of the dashboard by maybe adding a new filter. Clicking on another dimension value. All those things lead to a bunch of queries which are in the queue. We added a queue for those queries and we started cancelling those queries. This reduced the overall load on the database and saved almost 30-40% of the CPU cycles, which overall helped us to scale it even further. We also added a priority queue because not all workloads on your application are going to be the same. Interactive dashboard queries were the highest priority for us, so we wanted that interactive experience on the dashboard, but there are other workloads like schedule reports or API machine generated queries which could be executed at a lower priority. Having a priority queue helped us a lot in order to make the dashboards more interactive. This is an acronym Mike came with today around what you see is what you fetch. We implemented the delay execution in our dashboards. You can see in this slight animation that the rows are actually loaded dynamically as we are scrolling it down, then these things are fetched from the database. We believe that only compute what you want to show to your users, even though we have the scrolling experience here, but it is being dynamically computed. We do fetch like row groups and filter heavily on the data so that we don't end up computing things which we don't need to show to the user at all. Data modeling is also another technique which if you model your data correctly, you can reduce the overall complexity or overall data that needs to be scanned at the query time. You are essentially doing a trade off where you are pushing computations to your data modeling layer, to your ingestion rather than doing it at query time every time. Here is a data model which tries to do a bunch of things. I'll start with aggregation. This is like a sales data set. First, it tries to aggregate the data by day, which in itself reduces the amount of data by 10x. Just by doing that in my data models, I am now able to reduce the amount of data that needs to be scanned for each and every query by 10x, which improves the performance there. There are certain use cases where if the business needs allow you to retain certain amount of data, as the data gets old, the value gets decreased. If you are only looking at last few quarters of data, you may also choose to set some retention rules by applying a filter in your data modeling layer. You can order the data by timestamp as to you better utilize minmax indexes and finally, materialize the output of your model in real. That will store that as a materialized model so that it doesn't need to recompute the view every time. What we actually did here is that we leveraged SQL mainly for our data modeling layer, which allows you to set all these optimizations and do those in your data modeling layer itself. Here are a few resources around a blog post that we did on why we love REL. Here is a link on how would you like to try REL. It's a simple command as Mike showed, which you can use to download and try it yourself. I'll open up for questions. We can also grab questions on the hallway. Great question from the front of the audience. I think I recognize this gentleman as the creator of Click House. Great to see you, Alexi. Questions, how does it scale beyond one machine? It's a great question. Today REL runs, we do run DuckDB for single nodes, but we have been experimenting with other engines to achieve scale. So, fun fact is that Nishant and I worked together at the company that created Apache Druid. It was an advertising analytics business called MetaMarkets. We recently have been experimenting with Click House as well to achieve scale. So today REL's cloud version does use Apache Druid, and some of our customers have 50 terabytes or more of scale running this same application. What we like about saying like Click House is it may allow us to have the same sequel, dialect, same ergonomics from small scale to large scale without having to swap engines. So, but we look forward to trying that out more in the future. Great question. Other questions before we, I think we've got one minute left. Go ahead. Can we attach a mic for the base via Dr. D? Absolutely. I wouldn't recommend attaching per se, but maybe Nishant you can show in the demo. You can read from MySQL. You can read from Postgres. We support dozens of different connectors here. The key thing for REL is that we are not just attaching to a server. We're actually ingesting or orchestrating data into the compute, the in-memory compute engine, and then we co-locate the application very close to that database server. So, yeah, you can attach to any of these data sources and bring in, again, for real developer, you're right in your laptop. I have an example where I've got 100 billion events, 100 billion event systems can scale quite well on a single net. Okay. Thank you very much. We look forward to you trying it out and hearing from you on Discord. Bye-bye. Thank you.
Trusted Postgres Architect - Deploying Postgres with Infrastructure as Code
So, next we have Boris Mahihaas with trusted Postgres architect, deploying Postgres with infrastructure as code. Right, so thank you very much. Thanks for coming. My name, as she says, Boris Mahihaas. I used to be a solutions architect at ADB, but I grew a little bit of white hairs around senior solution architects, but this pieces off a lot of the building architects. So actually my real title is holistic system software engineer, because I would like to see the things from the fundamental interconnectedness of all things. I used to be a developer, operational people, a DBA consultant, so I'd like to see the whole stuff. That's why I see the value of the DevOps philosophy, because it's trying to get the whole thing kind of a one thing of deploying stuff in a more reliable way. Apart from that, I'm also an air guitar player, and I really love metal music, and within other type of music. I'm going to talk about trusted Postgres architect. So here, who uses already Postgres here in the room? Nice, okay, that's very good. Okay, so who didn't raise the hand wants to use Postgres maybe, but I think everybody already raised the hand. That's good. Okay, there you go. Thank you. So this talk is for you. All the rest is also an interesting problem, this, because it's about reliably developing Postgres in multiple different infrastructures. Okay, so this is a use case. This is a developer. She is trying to develop a new project. She wants to use Postgres finally because it's being one of the most popular databases in the last years, and then she has this brilliant idea, but she doesn't want to start using single commands all the time because she wants to have an environment where she can test, test, test, and when everything is working finally well, she's going to be able to deploy that into different environments, either test environment, pre-production, staging, whatever you call it, but exactly the same thing. So the typical stuff that people do is like, I have a container, I'm going to put that specific container into the server. This is not exactly that, but it can also be relying on containers in order to emulate the final architecture. So let me explain a little bit more. So you want to do it in a reliable way, and that's why the name of the tool is called trusted Postgres architect, which is the tool, TPA. We like to call it TPA because it gets people confused with TAP, which sounds like tap, which is for you to get your favorite beverage at the bar. The first goal that she has is to deploy one single instance running Postgres 16. This could be also if you are running already Postgres 14 or 15, you want to try the new features of Postgres 16. Who is already running Postgres 16 here in the room? Okay, so much less than the people who was already using Postgres. So this is probably one way for you to test the new version. So I'm going to just show you code here, which is YAML. You might not like it, but it is the standard way of doing Ansible stuff or doing deployment. So in this whole screen, which is pretty large, I'm going to put all the code that you need in order to have this one single instance. First of all, in TPA, you have to specify your architecture. This is a master one. I know now we call it primary, but master still sounds nice because it reminds me about master, master. So that's why it's called master one. And obviously it's going to be called, FOSTA, is the name of the cluster. Then you can have cluster variables, plenty of stuff that you can ignore at the moment. I'm going to come back to one of them afterwards. But the most relevant here is this one, Postgres version 16. That's what you need. Okay. So this is the version you want then, and it's going to put you that version deployed. So this is the cluster variables. I'm going to come back to the cluster variables later. Then because you want to be able to do deployments in multiple locations for fault tolerance, high availability, it's always good to specify a location. We are in Brussels, so we are going to call location Brussels. And we are going to have an instance, obviously. Thank you. At the ULB, but first we're going to say which type of instance and which default we have. So we are going to do it with a Debian image. It's going to be a specific detailer by TPA, but you can use whatever image you want here. And here the platform, it says that it's Docker. This is not cloud native stuff. It's really an easy way to have a virtual machine that I can connect to it and try to behave as if it is a virtual machine, but it's a container with everything that you need there for a virtual machine. And as you can see, TPA uses Ansible. So we are going to have this Ansible user here for connecting to the machines. And here is the instance. You specify only these parameters, the name, the location. This is a number within your cluster and the role. And the role is to be a primary. So here we use the most modern way of referring to the primary node. That's it. That's all the code that you need for one single instance, of course. Then because this is Ansible, you do TPX-SEC. This is the executable of the trusted architecture. Provision, so you provision your cluster and then deploy. And then you get it. Okay? So how do you connect to it? Yeah? Well, I told you that the Docker containers are going to behave as a virtual machine, so we can SSH to the machine. So we do SSH using that file that is going to be generated in the process of doing the provision. ULB is the alias that we gave to the instance. And then we do the typical thing. I become user Postgres. Oops. I become user Postgres and I execute PSQL. Yeah? So it's really nice because it's using the super user Postgres and you want to have this for applications. So let's get a new requirement. But this is how you connect, okay? You want to have an administrator, which is not the Postgres user. So we are going to call it Slonic because that's the name of the blue elephant. Yeah? And this is going to be an administrator. And then you have Ada Lobleys, which is going to be the application user. And we don't want to use the Postgres database, so we are going to have a FOSMDB, which is owned by the application user. So this gives us a little bit of more security already. So how do you change the previous code in order to allow this new request? So in the cluster variables, we are going to just keep these two variables there, the failover manager, which I'm going to use later on, and the Postgres version. And then I'm going to add the users. Yeah? So this is how you add the user. You say the username. I'm going to ask TPA to also generate a password for that one. And the roles of that particular user, in this case, is going to be super user. That's the administrator. You can also grant permissions and stuff like that. But in this case, I want to have a role attribute. And then we have the developer, which is doing the application part, Ada Lobleys. We just got the password for that one. Then we ask for the database. We give the name and the owner. And that's it. So I'm adding new stuff. So it's not just for the first deployment. It's also for maintenance. Okay? So you can do a git commit of your new version of the stuff. So you can keep a track of your infrastructure with different versions. If you want to revert this, you can also do it. Right? Then, of course, provision and deploy. And you can continue. Now I show you that you can connect to the database through SSH and then PSQL. Now we wanted to do it with an application. So what we are going to do is that we are going to ask for TPA. Give me the password that you generated for this cluster for that particular user. The password is a random stuff, which is not that random. It actually contains a reference to a metal band from Belgium. If you can figure it out, I will pay you a drink from it. And then using that password, you can connect with the normal PSQL. You provide the host IP, the port, the user. And you put the minus capital W user that you can put that password there if you don't want to put that in the PGE pass file, for instance. But now I'm not using the SSH stuff. So now I'm really behaving as if it is an application. Okay? You can take a picture and try to figure out the reference. So we have this now with that little amount of code, but we don't have any fault tolerance yet. What happens if this thing crashes? Well, we want to have a replica, exactly the same version, physical replication, and that's the new thing that we are going to do. So let's take the code again. We have that cluster variables for the failover manager that says absolutely nothing running Poster 16. GPA can do it with a tool called Rep Manager, which I like a lot, and also Patroni, which is also very, very good stuff. So you can choose. In this case, I'm choosing for Rep Manager. And then in the instances, if you remember, I have this primary one. The only thing that I need to do is to add another instance. This one is the VUB. So you see the French-picking one, the Dutch-picking one, but the city is in English so that nobody complains about the one that it picked. Now you have this one. It's a different role. You see this is a replica. And this is the primary. So I have to say who is the upstream of this replica and is the ULB. And I have a cluster with two nodes. Again, TPXSek provision, TPXSek deploy. Let's continue. I want to have more fault tolerance. What happens if there is an attack in Brussels and the old universities get destroyed? We want to have a third replica, but we don't want to do it replicating from the primary. We want to have cascading replication. But if somebody deletes a table here by accident, it's also going to be deleted in all the nodes. So you need to have some backup and restore for point-in-time recovery. That's why you want to have in another part of the country your barman, because you trust your barman, which is for backup and recovery management. It's important that your backups can be recovered. If you just have backup but you never recover them, you don't have backups, basically. So this is what we are going to build now. Let's get back to what we have. We have the location Brussels and two instances. Let's add another location, Vlonderen. This is in the north. And then we are going to add a replica in a very nice place called Achel. Used to be a trapeze beer. Not anymore. It's still a very good beer, but it's no longer a trapeze. This is just for your common knowledge about Foslin. So then you get the location is Vlonderen. I'm going to say that it's also a replica, and I'm going to have the view be a upstream. This is how I build cascading replication. I could do now provision and deploy, and I have my older replica, but I also want to have backup and recovery. So what I do is I add here another location, which is Wallouni. And then I add a very nice place, which is still a trapeze cheese, also beer. This is my favorite one, actually. And then look at the role. It is Barman. So here's how it gets a backup and recovery management just by adding this. So it's an instance with a Barman role. Now where am I taking the backup from? Well, I need some space there. I didn't put it on the bottom because otherwise you wouldn't read it. So it is coming here. You just put backup, rush for, and that's how you build it. So this is all the code you need, and you have already a cluster with cascading replication and backup and recovery tool. Good. You do provision and deploy, and then you're done, and you have built an architecture. So this is all done with Docker containers. The idea is that you can take exactly the same file and put it into virtual machines and other stuff. So if you don't remember how to do the configuration files, it's very easy. You can also use the tool TPXSecConfigure, the cluster for them, and you can say, I want to use this architecture, running PostgreSQL version 16. The platform is going to be Docker. My operating system is Debian, and the failover manager is RepManager. And you get something very similar that you just need to change some names. Now, look at the Docker thing here. If I change it to Bear, because I like Bear Metal, you just change that and you get a different configuration file with some IP addresses that you just need to add. It's basically the same. But you can also deploy to AWS. So TPA is going to also create your virtual machines at AWS if you have the credentials, and it's going to manage all the network things. So you just have to do provision and deploy, and you get everything. It's super cool. Okay, so configure. You provide architecture, the platform, and the OS. Then you do provision, and then you do deploy. And then deploy also provides some hooks, like doing pre-deploy, pre-NDB, post-deploy. You can have some stuff like enhancing your cluster. So to summarize, you have an architecture here, and an executor of the executor, which is of the architecture, which is the orchestrator here, and it's going to deploy to some machines. This machine can be running on virtual machines in AWS if you have the credentials, or it can be Bear Metal, or it can be Docker containers. When I see a ship like this with containers, I always think about the Albatross and the rhyme of the ancient mariners. Okay, so for those people who also listen the same kind of music, you know what I'm talking about. The cool thing is that it is exactly the same architecture here, and exactly the same way of doing the provision and deploy. It's just a different target. So instead of submitting your container to somewhere else, you say, I'm going to deploy the same architecture somewhere else. So what we basically do is when we do a project with a customer who wants to run an architecture, we deploy that with using TPA, the definition, and then we pass to the support team, which is going to continue talking with the customer after we have finished the project, exactly the same architecture, but in Docker containers. So whenever the people who is having the project in production has an issue, contact support and say, like, I have an issue with my architecture, they can deploy a model of it with Docker containers, and then they can test the whole thing there. So it gives you really a continuation of your project. It's not just the first deployment that is easy, but it's also the maintenance and the documentation of it. You don't want to document everything on PDF. You want to document in code. So your configuration file is the documentation of your architecture because you are using your infrastructure as code. That's the main advantage of using this kind of tool. That's why I like it a lot. All right. So these are the platforms. If you want to have a look at it, it is in GitHub now. It is released with a GPL version three. It is recently been open source, but this tool, we have been using it for six years already. So it's quite mature and have our best practices. Everything is done with security layers, SSL, host-based authentication, everything is done for you. And you have the documentation also at enterprisedba.com. To include, it is infrastructure as code. We always put them in Git so that we can have different versions. We know how it is evolving our infrastructure. It is not good only for testing because you can test your entire infrastructure, but you can also use it for deployment in production afterwards. It is a way of documenting your deployment, not just PDF, but documented as code. And it's not just for the deployment, but it's also for the maintenance of your stuff. And finally, we get it open source. It's been there for a while. We have been using it. We have been fighting for getting open source and you get it there. So you are free to use it as much as you want. Now all the documentation is there, but you can also contact me via my personal email, company email, also mastodon. And that part is also for the other social media with full of haters. And Hale Slonic. Thank you very much.
The wonderful life of a SQL query in a streaming database
Next, we have Jan Mench with the wonderful life of a SQL query in a streaming database. All right. Thank you very much for showing up. This is my first conference talk ever. So I really appreciate that you took the time to show up even though it's almost the end of the conference. All right. My name is Jan Mench, and I'm working at Rising Wave Labs. We are the creator of the Rising Wave database. And the Rising Wave database is an analytical database that's focusing on processing unbound data. So any data that comes in endlessly that you consume from something like Kafka or PubSub, for example. If you want to summarize what such a database does, not just us, but any other streaming database system, you could also say incremental updates to materialized views. So what do I mean by that? Well, first of all, let's very briefly define what's a view, what's a materialized view. A view is basically just an alias for your query. Kind of boring. If you query it, it just gets replaced. A materialized view, on the other hand, it's sort of like a cache. So you query it, and the materialized view itself keeps some state. And now the question is, you have a analytical question. You define your materialized view, but the underlying data on which this materialized view depends on, this changes. So how do you update your materialized view? How do you update the answer to your analytical question? Well, traditionally what you could do, you could just recalculate the entire materialized view. This is expensive. This is slow. As an alternative, you could collect some delta. You collect some delta, some tuples, put your aggregations over it, and then apply this to your materialized view. Our goal is to have this delta as small as possible so that we have a materialized view that is as fresh as possible. Ideally, we wanted that it updates whenever you insert anything, a single tuple, it should be reflected in the materialized view itself. All right. So very, running example, very simple. Let's say you have some sort of social media app, and in the social media app, people post their stories, right? And people vote on these stories. And now you have your materialized view that is this, and you want to know how many people, how many votes did each story get? That's your analytical question, and that is what will be updated whenever somebody clicks on vote. So let's say you would insert some votes somehow. In reality, this would come from a source like Kafka, right? And so immediately, this change should be reflected in your materialized view here. So this is, again, just a dummy example, but this is what we're trying to do. You have an analytical question. The data underneath changes. You see this change reflected immediately. So this is the same materialized view as before. You want to know how many votes did each story in your social media app get. So what happens, actually, in your database, if you send it this sequel? Well, first it's going to parse it. It's going to create a logical query plan, optimize it, and then create a physical query plan, all right? So the query plan, you could get this if you, not just with Verizon Wave, but any kind of database, if you use the explain keyword, and then you get this kind of query plan. And I want to point out if you operate this here, right? And these are the table scans, aggregate and join. So basically, you look at your votes and your stories tables. You pull in the new data. You aggregate it. You join it. Done. So you could also, this is what's happening underneath in your database, not just Verizon Wave, all sort of databases, right? So you could also display it like this, with, we call a streaming graph, but it's honestly more like a tree structure. And there you have your scans, you aggregate, you join, you materialize for you. Data flowing from left to right through your system. Because we like headaches, we distribute this, fragment this, and deploy it on different nodes. So how do we actually propagate this change? Well, every operator has some state, and now aggregate, for example, has this state, right? So two people voted for 1,003, it's reflected here, 1,003. The other operators also have state. Now what happens if you insert more votes? Well, the votes table scan will push down an event, and this event here in this case is 789 voted for 1,004. So this is your start of how you propagate the change from the beginning, from the table, or your Kafka, all the way to your materialized view. This then is consumed by your aggregate operator, and it changes states here from one to two. We push down the events further, and this will update the other states of the other stateful operators. So now we kind of established how the state flows from the original state in Kafka, or PubSub, or Pulsar, or RedPanda, all the way over to materialized view. So because we like headaches, and we make it hard for ourselves, we do this in distributed systems. And distributed systems have opportunities and advantages, opportunities and challenges, right? So, for example, you can execute stuff in parallel. This is great because this potentially speeds stuff up. But you also need to be able to recover because eventually in your distributed system, a node will crash. So you should be able to recover, and you should be able to scale up and down when you need more or less compute power. So let's talk about parallelism first. I showed you this streaming graph. This is a simplification. This is parallelism one. What happens if you want to have parallelism of three? Well, it would look a little bit more messy, something like this. You have the same operator three times, always. There's three operators that are scanning your votes table, and there's three operators that are aggregating. Why do we do this? Well, we can speed up stuff now. And how do we still get the correct result? Not just us, but other streaming processors. Well, you could just do some consistent hashing. Like here, you take the story ID by which you aggregate, hash it, and then each operator downstream is responsible for a key range. So this way, you horizontally partition your data, and nobody conflicts with anybody else because everybody is responsible for their own sliver of data. All right, so now we established this exchange here and how we speed up stuff by parallelizing this. But maybe we have an issue that some ports through the system are faster than others. Maybe because of a network hiccup, or maybe because you notice faster. So red is slow, green is fast. This is bad because in the join, we may have unaligned data. You don't have your join partner, incorrect result. So instead of sending the data just down like this, these events down like this, what you do, not just us, but for example, also flink, you would buffer stuff upstream in your operators, and then you would insert into your stream of events so-called barriers. These are these little black boxes in my example. And whenever an operator receives all barriers from upstream, it will send it downstream. And now here the same, if this operator receives this barrier and this one, then it knows upstream is done, it can send downstream. So now that means we aligned data processing across these different nodes on each level on a per-barrier basis. So we insert this every second. And yeah, a barrier, if you want to go through this system, is just a GRPC protobuf message. Right, now recovery. How to recover if a node crashes? Well, first we need a consistent state from which we can recover. So for example, if we tell each operator to persist their state every so-and-so many seconds, that would not work because aggregate and join are in different state here. Aggregate has seen the update, join has not. So if we would persist, this would be inconsistent. We don't want this. So instead what you do, you use barriers again. You send events through the system. Whenever a checkpoint barrier hits an operator, it persists, and it persists again, the second operator. So both have seen the first event, but not the second event. And this way you have a consistent snapshot. Right, everybody has seen the same events. So you have a state from which you can recover, replay your Kafka events, and that's good. Okay, scaling. Last thing. I say you have your nodes, and one node is under pressure. The red node is under pressure. So now you want to make sure that you alleviate this pressure by adding a new node. You use barriers again. You insert a pause barrier into your stream of data. You persist the data of all the state, of all the operators to disk, and then you would add a new node, put a new aggregate operator on this, tell everybody who upstream and downstream if and what partition they're responsible for, reload the state from disk, and then resume. So now this was a very quick overview on how you can do streaming data processing, how this works under the hood. If you ever want to write your own database system, think about this talk. Maybe it helped you. Thank you very much. If you have any questions, you can ask me. Just shoot me a text on LinkedIn. If you want to try Rising Wave, we're free and open source. Compile some source or just use GitHub. Right. Use a Docker container. Thank you.
Switching the FOSDEM conference management system to pretalx
Okay, so next we have Johan van de Waal speaking on switching the FOSDEM conference management system to pre-talks. It's already time? It's not too early? Okay. Wow, thank you. Hello, everyone. I'm going to talk about, well, maybe not such a technical issue, but I'm going to talk about how we migrated from PENTABARV, which is a logo on the left, to pre-talks as our conference management system. So very short thing about me. I do scientific programming. I developed together with my friend over there, fiber-based monitoring solutions. And apart from that, I've been in FOSDEM team for quite some time. I visited for the first time in 2007. I did some research for this presentation. And I've managed the geospatial dev room, and since a few years I've been coordinating the dev rooms, and I'm part of the server team of FOSDEM. I am not a web developer, and I'm also not good at slides, as you can see. It's important to know. So what is pre-talks? Oh, no. What is PENTABARV? And what is pre-talks? What do we use it for? PENTABARV and pre-talks are the tools that we use where people will submit their talk, where dev room managers or we or staff will choose our talks, we will review, we will build a schedule, and then finally we will publish it on the website using that tool. This is the tool we used, which was called PENTABARV. This is the new tool pre-talks, which we use this year for the first time. Why did we switch? Anyone here, who of you has actually submitted a talk this year? Okay. Who are the dev room managers in the room? Well, okay, at least one. Yeah, most of them are in their dev room, of course. So I would love to get some feedback from them. So what was the main issue with PENTABARV before? The main issue with PENTABARV is that it's Ruby on Rails. I tried to get it running on my computer for a few years because we wanted to improve it, but it didn't work. I couldn't get it working. Actually my next slide is maybe more interesting. This is a screenshot I made. So this was a state of pre-talks. This is their master repository. So you can see it has been abandoned for quite some time. We made a fork with a nice name Postgres9, which gives you an ID about when it happens. And you can see we did some updates, but not too much. You would get people making pull requests like this. So yeah, I could not install it. So in the web archive, I found install instructions, and they wanted to add that. I also found those install instructions, and we had some in our internal wiki, but without that, even with those, I could not get it running. That's why at some point I said no, I will not improve PENTABARV. I went to improve another project, which is still in use for other conferences. So I had a look at pre-talks. Pre-talks is a Django application. And when I tried it, I was struggling with that for a very, very long time. And at the end of the evening, I said, well, let's try the other one. So I just did this Docker compose stuff, and I had this thing running, and I could import a schedule which was generated before. And I almost had something that looked like a full conference system for FOSM ready, maybe after one hour. So I was quite happy with that. Yeah. So I was not the first one who had planned to move from pre-talks, because what are actually the issues with PENTABARV? The main issue was nobody could install it. It was still running, but we didn't know for how long. And if some strange bug would occur, I'm not sure anyone in our team would be really capable. Well, if there's really a bad bug, people will start to become a bit better, and they might fix it, but it was unmaintained for such a long time. So there have been many plans to move, but they usually failed because then they said, well, we need to have that feature, and that feature, and that feature, and that feature. And the other thing is nobody works on FOSM until, let's say, September. In September, we do a kickoff, then we open the call for dev rooms, and then it's like, okay, yeah, we're too late. It's not yet ready. We cannot use it. And then the next year, nobody will work on FOSM until September when, again, they kick in, and it's not ready. There's also some resistance to change. People will say, it works for us, so we don't need to change it. That's part of the... This is mostly the internal people. This is not the people submitting. We had people, kernel developers, sending videos of trying to login and do PentaBARF without it working. So I think that's quite a bad state. So in order to avoid those things, there are a few things I wanted to have before the kickoff that we have in September. And for me, there's were two things. It was building a website. It should be possible from PTOX. And the second thing is having an audit log. What is an audit log? This is an example from PentaBARF. It is everything which is entered into the system. It's actually interesting that somebody gave feedback on a talk of a year ago, but it was the last thing I could find. But it shows the difference everywhere in the system. And this is really useful because we have had these discussions. This year, some deaf room manager approved a talk in another deaf room. And then, yeah, that's a bit... It's not nice because the speaker, he books a stick and he thinks he can go. And then, oh, that was a mistake. We fixed that, by the way, so they can no longer approve each other's talks, but in the beginning it was possible. We had a presentation where they completely changed the scope after it was approved. It was also not very nice. And then you can always go back and see what was history. I actually would really recommend people if they do something. Use such a log even for normal database, but definitely for a conference management system, it makes sense. So this was one of the two things I wanted to have ready by September. It didn't have to look nice. This on the left is very useful. This on the right, well, you can do those things if you need them, but you will not get happy from it. But at least we had a way to find out if strange things happened, how they were happened. It was also a template for both changes, or if we changed some configuration, then at least we could trace back what was history. The second big thing which we needed, which I said, was to be able to build a website. Why? Because our website is used by all of our other integrations, or most of them, which include Matrix S review, which is what people use to review their videos, and all the scheduling applications that you have on your phone. So that was also one of the things which had to be ready, at least in some form, before we could switch. Yeah, I forgot. Oh yes, now I know again. A third thing which I did, but this actually only started after this initial session. This was during the actual organization of the event. I created a plugin for pre-talks, which has some specific settings for FOSDM. For example, people, they will pull out a call for papers on the website. They could enter it here. Well actually, all of them, they send it by mail before I had the system ready, but at least next year they will be able to do it. It's a bit similar for most of the other boxes which are there. So they can close their call for papers, so people can stop submitting to their track, because some tracks like to keep it open for a longer time, but then at least they get a URL where they can still submit. So if you're really quick, you can get that code and submit to the main track, but I don't think anyone will accept it. During the event, this is actually something I actually fixed this morning. Dev room manager will find some instructions. I hope that this will grow a bit over the next year, so that they have only one place to look while doing their team. Yeah, so I wrote here most of the things were already a bit late, but it was also only the way during the conference that it was like enough vibe to add those things, to build them, to realize that we needed them, because if you just click around in the interface, it all looks fine, but it's only when you start using it that you notice that you need some extra tools, unless you're really good at testing. I'm not. We had to make some changes to pre-talks itself, mostly to limit Devroom managers to edit other people's things. I made some changes to the review system. As I said before, sometimes it is too late. This was something I didn't understand that no one of them complained, because if they did reviews and they clicked next review, they would get a random track from another thing. That doesn't really make sense, so I changed it. It would always stay in the same track. So what are submissions? This was by default not enabled, but we have some people who submit a lot of talks. One person submitted 15 talks for this Boston. So if all of those are in different tracks, all of these Devroom managers, they would spend time in looking at the same proposal again. Well, now if they see the list, then at least they know, okay, he is already there. Let's keep him. The last things we've had to change, which are a bit complicated and where I'm maybe not completely happy about the workflow, is the fact that we have parallel scheduling. So pre-talks in itself, it's actually made for, you have a large group of reviewers, and then you have a small group of people who actually built a schedule, which works for most conferences, but which doesn't work for FOSM, because we have a lot of people scheduling. So well, that's actually the nice side. Some of the last things which I want to mention, these are like the annoyances of at least some of the people, mostly from the staff. Pre-talks is much less information dense. If you look just even to the resolution, you see that here, all information is a bit spread out over the screen, while here it's very close to each other. You have a search book. If you start typing here, you will get to any talk. Here you would go to proposals, then click on talks, then type in the checkbooks, then search. It will take you much more time. So this is one of the annoyances we had this year, which we hope to improve for next year. Things that we had over Penta, so these are things which already went better this year, even though it was like a migration year. There were much more reviews, so that was, we could reach a larger group, I think, because it was easier to use, or maybe because we promoted a bit more. Devroom managers, they can now send mails. Before they had to export all the email addresses, then run it through their own mail program and then send mails. While here they are in the system, if you would go back as a speaker, you can click open those things and find your mails by yourself. So finally, I have only three minutes left. What are the roadmap? What are the IDs? First of all, the audit log, which I showed you, it's a bit integrated. Well, actually, the code is quite separate, but as soon as pre-talks makes another release, I want to make it a separate plugin, which can be installed completely apart from Falsum, because I think it's interesting for all pre-talks users who choose Postgres, they just should use it. It will always help you. And then the next part is actually a bit going back to what I told earlier, Pentabar which was so hard to install. I don't want to create a new plugin which is as hard to install as pre-talks. Well, nowadays it's a bit hard. We have a demo site, which is that one, pre-talks test. So I actually want to make that into something that you can install with. It will not be one click, but it will be quite easy to install. So people who want to improve it, that they at least can do. Yes. Then the other thing is we made some custom changes. I hope to get rid of them, integrate either them into pre-talks or make sure that there are signals which are like places in the interface where we change something, but that it's already ready for it so that we can put it into the plugin instead of in the code itself which is forked. The last thing is we want to get more information about previous year submissions because it's interesting especially for main track speakers to see has this person presented before, how was the feedback? Maybe he gave the same presentation already, then we will skip him, those kind of things. Finally, my last slide. You can help. Well, an obvious way is to help upstream with the project. Pre-talks is used by a lot of other conferences. I believe they have about 100 conferences or something like that organized every year with pre-talks. Maybe much more than, sorry for it. We have our own repos, but especially the first one is useful. That's a pre-talks integration because that's the place where we really do the bug tracker. I put some questions there and I intend to put a few more also with questions for you. Then especially I mean you are the users like the dev room managers. Which settings do you want to have for reviews? Do you want to score from 0 to 10 or from 0 to 100 or in different categories? I just made a random choice and you should get some feedback. Then there are the two forked repos which are actually only useful in combination with the other one. I just give them for completeness. That's my talk.
FOSDEM infrastructure review
Okay, so for the end of the Lightning Talk Room, we have Richard Hartman and Sebastian Schubut with the Fasdame Infrastructure Review. Thank you and it's great to have you here for the last Lightning Talk of the day for our traditional infrastructure review that we are giving and trying to tell you how we are running this conference. So let's get this started. We were running on a quite stable platform for the last years. We have one Cisco AZR 10,000 or 106 that we do for all the magic that happens here. So you have some Wi-Fi connections and that our customers or our friends that are watching the live streams right now, I hope they can see us, that it will get the live streams. That's what we use those infrastructure for. Most of the traffic we're generating is actually for the live streams that we're doing. We see that in a bit, but roughly 500 megabits per second are just going out of the building to our multiplexing service and where then the videos are spread around the globe. We've had a handful of Cisco switches that we're trying to remove because they're getting quite old and we replaced them with even the same age Arista ones that are a bit easier to handle for us. So we have the first time three new Aristas switches onsite this year, which helped us a lot with the new setup that we planned. We had two very old servers. I think they were last year, they were 13 years. This year it should be 14 years or so. We thought about, hey, we should replace them because last year when we powered them up after the pandemic, the battery packs were depleted and we just saw the K, maybe we should just replace them. We found some additional servers that we packed into our rack and then we just spun up a Proxmox cluster there and started using the first VMs. And yeah, currently all of the monitoring is done locally here at the infrastructure rack that we have. We have a lot of software, Prometheus, Loki and Grafana and it really helps us a lot. Pain points of the last years, they've all been weeded out in the last year. So this year it was quite smooth run of the whole network setup. We were able to dig into problems that have come over the last year quite efficient and didn't spend a lot of time on working on bigger problems. We have also data for the public dashboards that we have on the internet that we send out and just persisted on a machine that's not running here on the campus in case of the campuses. Not available anymore to us because we need to get rid of the rack in the room but Richie will tell you something about the room conditions later. So first start with the video system that we have here. I don't know if any one of you has been running into or wanted to get into a room that said it's full but you can still watch it as a live stream. We have those beautiful live streaming devices that you can maybe see from the background. There's a laptop attached and there's a camera over there and there's also a laptop attached. And we're capturing all the video data with these video boxes that are handmade by the FOSDOM volunteers or the FOSDOM staff guys. And we've put that all on GitHub. You can if you want to see it you can just download everything from GitHub and those little boxes they sent basically all the streams to our render farm which we'll see later and then get processed there and sent to the internet. We have the most of the streams for the live streaming are sent to Hetzner cloud where we have a lot of storage there where we just send all the raw data and then it's been processed, cut into pieces and people like me can just go and edit it and say okay there was a blip that I just want to cut out. This is all not done here. It's in Hetzner but for next year this will all be going to the onsite so we don't need to push data out of the building that never should leave the building because we currently have some processing here at the ULB some at Hetzner so we're constantly pushing data in and out which is obviously not really great use of bandwidth. They were using a semi-autorated review process that some folks of that might be familiar with it. It's called S review. This is the thing that will move into ULB next year once our clusters are production grade because currently we have some issues with them. The video boxes I don't know who we have been to foster them before. They looked like this for years and years and years and they had some minor problems. I don't know if you can see it. It's this device. It's not open source and since we're an open source conference we try to replace everything with open source components and the other problem was this LCD part. They're not available anymore. You can't get them there out of production. So that's why the decision was made to get rid of them and before we had a new box we took everything apart to just burn the ships and so we needed to move. This was the first idea we came up with where you basically, you can see it up here, the box, just have an HDMI grabber card that is then sent to a small device that puts it out as a video stream and we have a network switch as well. We found out that most of the laptops that we use here are not really compatible with that so we had another version. It's here which is way, way better from a compatibility view because it just puts out as a normal, what's it called, video device for the next and we just put a standard, let me see. I should have this here. USB adapter, yeah, just open it up. That's why I brought it. Just standard USB adapter like everyone has it in the suitcase. It's just working and that's how the video boxes are now made. The data and the firmware for those devices, it's known to us. It's not yet completely available as I was being told but we should have something that's not relying on copyrighted, protected things from Blackmagic Design that we used before. The video setup for those who might be interested in it is quite complicated because we need to run a lot of wires. As you can see here, they're running the networking wires and we have all the networking infrastructure built into those video boxes. Our volunteers that are building up the whole conference, they get detailed instructions on how to wire up everything. It looks a bit messy here but you can look at those up there on GitHub, on our GitHub repo. We have a detailed instruction on how everything is set up, how the camera has been set up, what to turn your switches on and how to turn the knobs because as you might know, FOSDM is run by volunteers. We don't do it on a daily basis so we have quite good documentation on that. This is how video looks like from our video control center. We have a big overview where we can actually see audio levels from the room. So hopefully when I'm now speaking, I should be somewhere in the green area. We see how the slides are, how the speaker is and how the final mix that's going to the internet. So we have that in our control room and in every building, there's a per building view so we can instantly react on problems like the problem and that's why I chose this. In K1105, you see that it's lighting up in orange. There seems to be a problem with the audio. You can see the audio is not constant and spiking and that's what's raising issues so we can react on the video. Video rendering, this is how it looked like 2023. This is how our rack looked like in 2023. Looked like spaghetti and it was spaghetti but we completely rebuilt it this year. We emptied the whole rack. We now have a vacuum cleaner. A big one to suck in all of this stuff. For how it's been running in 2024, I like to head over to Riggi Hartman. Thank you. So 2024 is cleaner and we forgot to put in the photo of the nice rack but maybe you saw the mustard on already. We can also put this into the ones we upload. Maybe one point on the laptops. Just on the scale, we have three laptops per room. One is here in the rendering farm. One is at the back with one of the video boxes and with the camera. One is here. This is basically how we do this. This says for anyone who wants to copy this and maybe on the copying, part of why we talk about this and why all of this is in Git and why we try to document it well is we want you to copy this. We want you to be able to run your own stuff with tested software so you can focus more on running things and not reinventing the wheel all the time. So the great advantage here is you have a built-in UPS. You have a built-in mouse keyboard and display in all of those machines and there are strong enough to basically handle all tasks these days. So highly recommend to just split this out this way. Yes, Kubernetes containers, whatever is nice but this just works and if it's stupid but it works, it's not stupid. Really highly recommend it. Also, as you can see on the floor, I think I've been staffed for like 15 years or something now. We did not clean this room since and no one else did. So finally tonight we are going to clean this for the first time. Maybe this gives you also a little bit of an indication of what level of maturity we reached because we are largely done with the fires. We can actually do the optional bus night stuff. We had massive water leakage like this box was halfway full by the time we removed it. Turns out a lot of you breathe and we kept the doors open because it smells really bad back there, like really really bad. I don't know, mold or whatever but anyway the point is we tried to get the bad air out but we let a lot of moist air in and at the end we literally had this dripping constantly. It was not fully a stream coming down from the ceiling but it's still dripping. Like unknown unknowns of running such a thing. Another thing for the ones who witnessed the power outage in K while trying to debug this, at first we had this ladder but then I just took a broom and just kept putting the breaker back in. While we disconnected bits and pieces, point here is yes, A this is fun but B those are the things which you just can't plan for and you just have to deal with them and roll with them. It doesn't matter, get stuff done, keep the conference going. So some stats. As per usual, if you've seen this before, you've seen me talk about those stats repeatedly over the years and there is some method to this. I really just want to be honest and transparent about the good and the bad which we encounter because sometimes it takes us longer. This time we thought we would be quicker but we weren't. We are doing ourselves honest in your direction as well but as you can see, we are doing pretty good. We were actually done with the network, second quickest ever in the history of FOSSTEM since at least I took over that part. So this is, yeah, the point is, I'm making a mess of this, the point is try and build on your success because previous editions of FOSSTEM were run on throwing away everything and reinventing the wheel every single time. It wasted endless amounts of staff hours. Documenting stuff for yourself and keeping stuff running where you can and actually having automation, everything really, really will save your own bacon and you get to sleep earlier. Other numbers, the monitoring was immediate. We do plan to have even more stuff running under the year and basically take it out of the box, water it and it's done. For the 2024 redo, we had big plans when we talked 2020 about what we would do differently in 2021. Great. Last year was mainly about trying to find our footing again and not just like not having a conference as it were. So basically all the stuff which we already wanted to do half a decade ago have become even more pressing. To some extent some price points shifted and such so it was even cheaper to do it but also it just was even more work and even more lost muscle memory and everything. Also full transparency, we thought we would be done with the server migration by now and we are not. We still have five servers running, not three. Also just again like you as the visitors did not notice this and this is how it should be. Here if we have one or 20 servers and if 19 of them are on fire, you care about the product which you are consuming as in the conference. So for the ones who want to run those things yourself, always think from the user-consumer customer or whatever perspective first, everything else comes second. So we will clean this up over the next probably half year because we don't want to touch anything after this. Like we have the post-mortem two weeks and then we don't want to see each other for half a year and then we restart. Anyway we have massively upgraded our backbone. As you saw earlier we have a total of three heirs just now. Two of them are in an MC leg and give us full 10 gig between everything, like multiple redundant. So we can actually run a local storage array and don't have to rely on local disks anymore which obviously means it's quite nice. If something breaks you can actually just flip over the virtual machine or the container and we don't need to spin up something from more or less nothing. Also the new switches which we have, they are literally more than a decade old but it's nice thing about it. We don't need much more so we could just get away with buying stuff for 350 euro refurbished. Yes 350 with SFP cages and 640 for the one with copper ports. Those used to cost 50k each. So I implore you if you have old stuff running around in your own lab or whatever look on eBay or with other refurbished suppliers you can get really nice high-performance hardware which doesn't even spin up the fence a lot so you can even run this in your own room. It's really nice and it's just like get on eBay and get the old stuff because the old enterprise stuff is really really good. We have the Arista 750. Out of support forever but it works and it's really really stable. We got rid of spanning tree and the network is in here. Know why we hate spanning tree. Also for the ones who don't know what Arista does, MC leg is basically you pretend you are a stack on layer 2. But if you have layer 3 you act as if you had separate devices. So you do away with all the pain of layer 2 and spanning tree and other and all the icky stuff and you get all the power of dynamic and of static routing on layer 3 and above while not having the massive pain of lacking introspection into a stacked halfway magic layer 3 stack and also not having to deal with a split layer 2. So anyone who needs some redundancy look at all the rest of us get on with MC leg. It's going to make your life so much easier and more pleasant you can't even begin to imagine. Uplink we have two uplinks now. Cold gave us a line which is fully protected so they built a ring and gave us a 10 gig from there but even if we were to cut one of the fibers which run 2D will be premises we would still have the full 10 gig just running in the other direction and also we have 10 gig through destiny. There's also talks with others who might be willing to give us even more redundant upstream so we don't have this thing in the back of our head anymore that maybe something will break and we are going to be completely offline. Next year we will also hopefully have a second routing instance so even if the main router died because at the moment if I pull the plug everything is got down from your perspective. Hopefully this is also not going to be a single point of failure next year and hopefully it's going to be on open source and hopefully we will keep running on this new primary we'll see. There's plans. Hopefully the next one the next infrastructure review can go in depth on this because this again makes it even easier for you to replicate our conference without having to pay I don't know low five figures for a router. Clone our stuff. Main one is here. You can also find us a matrix you can send email blah blah blah the usual but use our stuff we don't just do this and put it out for us because we feel vanity we do this so people can copy this. We have a few minutes for questions I might need to run at some point to the highlight thing but I'll leave this one here if I do. So any questions. Just shout and I'm going to repeat. Just shout the question. The question was how many people are there you mean at foster. We don't know exactly the last time we had any good guess it was above 12 K but with privacy extensions we can't really count devices anymore so we did the thing where we counted rooms and we saw how many people at the laptops out and how many have cell phones blah blah and did the estimations from there but with privacy extensions we can't do this anymore unless we started to sell tickets control the entrance whatever and we don't want to and we can't even so way above 10 K yes yes I have this in the closing talk slides I don't have the numbers it's it's probably hundreds or it is three figures how many exactly we don't know how many volunteers was the question. And on the last year's we had those video boxes that had the small display mainly those those laptops here for displaying the information so the volunteers that are running the room they can see like what I can tell you it's the host named the uptime if everything is working and get a preview and that's what's all built into the old box and to the small display which is kind of hard to read and in this case the laptop is taking over the part of the bigger box when it comes to processing the video signal is here processed and sent as I think is a stream to the rendering farm I don't know what kind of stream they they chose for for intermediate I think it was mpk now H264 and what we do live streaming and encoded there will be AV1 and H264 I think for this year's edition. So the question was what access points do you use we use the access points that are installed by the ULB by the university here and most of the buildings these are Cisco access points that are hooked up to a Cisco WLC wireless line controller and some of the newer buildings or in some buildings they have been replaced with extreme and extreme switches switches and access points will be the way to go for ULB that's at least that's what we heard and we use their infrastructure there are two rooms the big one Jean-Saint where we bring our own Wi-Fi access points that are just connected to the to the backbone of the university the Wi-Fi controller knows those and adopts them and then just uses them for the additional amount of people that we bring in because we need more more client or higher density there. So the question was that on the main Wi-Fi network we are IPv6 only and how if we document the experiences with that yes we do that for our post-mortem because there were some glitches with we mainly we do DNS64 and NAT64 for getting you access to the old internet and we had some some problems today I think it was packages Mozilla org that we have problems because no today it was today as well and we're documenting these and everything that's get reported to us we then dive into that we have one one one of our staff members that's working at Cisco he helps us with the debugging on this side and we then checking that and also adding that to our post-mortem to see what's wrong or what has been wrong and try to fix this for the next year like content constant evolution that but also to be very blunt about this point the reason why we do this and why we've been doing this for like over half a decade by now is this is an open-source conference a lot of people who work on the Linux kernel on network management all the components are here we want to break stuff and we don't want to plaster stuff over we want the others to fix their stuff because now we have the actual engineers who can drive chains change through the companies through the projects through the whatever from the ground up and this works really really well as soon as we don't know but if you send to video at so the question was what the what the lag is in our in our pipeline how long does it take until we have to stream out at the viewers send email to video at foster morgue or join or just ask in the in the matrix somewhere we'll pick it up but it might take two weeks but I think speaking in youtube terms I think we're still allowed to to call it a live stream if you know if you mean by that it's hundreds of milliseconds top I think you were first the question was do we have statistics how many people use ipv6 versus ipv4 before no we don't because we due to privacy extensions we we can't know but we can tell you is that in the early days most people were in foster do foster dual stack and these days most people are in foster and I honestly think most of them don't even realize it's ipv6 only so the question was if we have if we are planning to change the dual stack name to legacy next year no we had this initially so when I made the whole change and everything I named the I named I named them foster and foster legacy and I switched this over after like three or four years because a the technically precise and correct term is dual stack and b there is no huge use in shaming people for something and also we had some issues with people not using it because they thought it might be 2.4g and like all kinds of weird stuff but in the end there is no use in shaming people for this if people use foster as the main thing and most do perfect and they're going to fix stuff as they go hopefully okay I need to leave to our next thing but I'm going to leave foster here thank you see you so I'm happy to take any more questions if you just leave leave quietly yeah please say so the question is what would I tell folks that are organizing conferences like this but using proprietary software I would just show them how we run it and what our learnings are from that and how we engage with the community for example for next year with with the routing thing we've we've been I don't know who of you has been to the networking dev room they're amazing guys there and we talked about them how can we replace this routing engine hardware routing engine can we do something in in in software and yes there is a solution that we found here with the colleagues and maybe we can show them how they if this is some sort of conference where where it's about programming or or networking or whatever maybe we can show them yes that can be done all open source and you get the features that you don't have yet you might get them when you just talk to the people and and encourage them to work with you yeah yes they're up there so if I understood it correctly is what's the use for the servers that we have and what is running there or so I can tell you what's currently running there this is mostly the monitoring stuff that we do monitoring alerting we have some sort of storage there that we just can have backup of the the data that that is produced from the laptops here we do the dns 64 things in software on those machines and these are the main use cases and also hosting our dns infrastructure from here where we can then just push out our dns the author name service for the conference are in this room and this is basically the the services that the service need to need to do things like dhcp for printers yes we have printers and for the payment terminals and things like that this is the the sort of traffic order what they need to do when they on those machines hope that answers a question okay so yes you again okay it's a bit it's a bit low on the audio level I can maybe we can just walk here and then I can understand your your question yeah okay that's fine yes or okay so the question is that we tear down stuff over the year or after the the conference and what's staying at the ulb and whatnot and everything that's inside the rooms that are not the room where the server is sitting will be removed from from the from the from the buildings and the service will be kept there the routers will be kept there one of the switches that you saw earlier on one of the slides they he will be removed as well but he would just goes to the to the to the the rack because we have this switch we need to clean up the the the rendering farm and then this switch will just be moved to the rack where it's sitting without power and but the rest will stay here over the year currently the years before we hadn't that that big team that could take care of that stuff over the years and I've just been taking that over last year and from what I've heard and what I what others told me that this wreck wasn't there permanently we just had it since six years or so and since then we can actually host stuff through the year here and the years before they were just piling it up in the room and and building building a tower of hardware and then just taking it back after the conference and since we have this wreck and it's also it has locks so we can just lock it and no one can can to crazy stuff with it since then we are trying to to have it sitting here running it most of the time most of the stuff will be turned off to to it's not necessary to have all the servers running and waste energy so the plan is also for the next year is to keep it like that remove the two servers to save some energy and just have it here and what has been done before like ripping it out and building it from scratch this is one of the things that we don't want to do anymore because we have a stable setup now and it's working yeah thanks thanks okay for everyone who's leaving the room I think we can start you can take the cables with you or can we start tearing it down or that work excellent okay okay there will be some more talks in the Jean Saint building like the closing talk if you want to go just down the hallway and there's a secret passage at the end of the hallway we just turn right on the last door and then you're in Chanson enjoy foster
An Introduction to Open Source AI
Welcome to the first VIP room for Boston. Yeah, cheers. There we go. Our first speaker today is William Jones. He comes from the South Hampton, Utah. Believe it or not, he loves someone half the house of strong men, so that's all really cool. And he's recently taken up some bouldering to site really hurting someone the first time he ever did it. I've got to say, I am very interested in this talk and please, will take it away. Applause Right here, right here, over here. I think I'm just trying to hear it. Even louder. No, that's even worse. Over here, maybe sound, sound better. Here, let me see here. I think it's better to be loud. I just want to say, I'm really really interested in this talk. I'm really curious. I'm just wondering, but Ken is close to the door because people are coming in like so much more than he's already in his room. I'm not sure if the door doesn't lock. So if you were here first, you could drag and hit the little button to open it. So if you want people to keep knocking on the door, it will lock you in. So let's start to hear. I think it's better to be loud. No, I'm just wondering. I'm just wondering. I'm just wondering. I'm just wondering. I'm just wondering. I'm just wondering. I'm just wondering. I'm just wondering. I'm just wondering. I'm just wondering. I'm just wondering. I'm just wondering. I'm just wondering. I'm just wondering. I'm just wondering. I'm just wondering. I'm just wondering. I'm just wondering. I'm just wondering. I'm just wondering. I'm just wondering. I'm just wondering. I'm just wondering. I'm just wondering. I'm just wondering. I'm just wondering. I'm just wondering. I'm just wondering. I'm just wondering. I'm just wondering. I'm just wondering. I'm just curious? I'm just wondering. I'm just wondering. I'm just wondering. How well have you been talking to habenson recently. Can you share your thoughts on y한테 Lancer Boiq announced? Which cablebomber maybe? I was concerned for the end of the day, which was when Linda had talked about lived and lost. So let's have a look at what we did today. So we've got A, we've got schedule day at Torse, there's a little gap between our Torse, so possibly using it for me, and then have a finish up, taking a web page and just checking if we're finding any of us involved. We've tried very hard to keep this beginning better than we did today. I think it's about just focusing on everybody and all the other systems who we need to walk away with something to do. And we also really want to keep it in the resource areas. We've done just one or a bunch of tools, which we like to use in AI, but we like to exhaust it. That's the normal goal of today. In terms of the design of the day, we've done it the day after this is over three months. We have some technology and technique talks at the start. We have some of the sort of schooling talks in the middle, slightly in there. And we have some centre like community talks at the end. So, that's what we're doing today. That's five minutes from my 30th install. So, we're the best discussion we've ever had to talk to, said that we wouldn't be able to save the three new productions to AI, which is very... And I still have some innovations of why this is the one that I want to do this. And what I'm very interested in is the not start with a set of innovations, something like AI, which is a television, which is like that. These people obviously have their own business, but I just don't think it's much of a thing to do or do anything at the time. So, what I like to do is then, or move the technical point of interest through this, where this kind of thing is important at the moment. So, to do that... Yeah, well, I don't think it's this, but I was just wondering, so, how could we, from that to today, look at AI and AI, which is a core of what we're doing as we're doing this in the world? It's not going to be what it's not going to be. And what we're going to try to do is understand things. But especially, what is it all about what we do with AI and AI, which is a new giant, that's about the large, it's a data-all-like-large population data from a small size of data. And I don't think having to go to AI is a more complicated thing to do in general, if I give you the time, because I want to understand how hard-working all sorts of engineers are in the world, then. That's quite an evil thing to do, the population of software engineers is high, that's quite mean. Even if I could ask them how hard they go, if I'm going to give them an answer, I'll use the last thing. So there is a fully-invented times when we want to do this, actually we might want to give it up now, just a while. It's understanding about large populations, large size of data, it's just a niche, it's something we have to do in the most every case, and it's still a problem that we're solving with AI and AI. And the motivation is that sort of thing, if you find that in the case of all the sustainability problems of explosion, and what this is about, is about how even data sets with not many innovations not always exist in there, will, because of the way these inventions are defined in the total way, and are made absolutely massive, and they can be at a small scale. So, let's do an example of that, and let's see, in the case where the man tonight has an idea of where he needs to work on. So, okay, I want to understand how the artist is going to get to work on. That's something that's unsaid and it's difficult to measure, because I'm going to tell you honestly. But, what I can do is that I need to know how you can do that in a way that's, okay, there's a lot of people on the internet who have been there, which is because they work with art, so maybe what you do to do it how hard the solution is to work, is to see if you want to see those sort of a new job, and, okay, I can propose this by the digital age, okay. But, then, I need to understand how the solution works, and I could either guess that as something I could do, or I could do, say, I'm a digital learning, which I'm learning how the solution is. And, if I could do that, that's okay, that's something I could do. I probably want to make it go on, but then I have to do power to get the solution up something. So, I want to make it go on, but if I propose this, you can literally do a, and how hard it would work if I at least had something that was not made, or I made a problem with what kind of a band maybe it's something that's been made by me, or maybe you know what I'm saying. That's, and all of these things would be perfectly possible, and they would be a huge commitment. However, when I unfortunately go away, and I find that shockingly actually the most important thing that people find is that you're not working hard, so you're not doing this in a relationship to your job. And, this is a whole line of explaining, but I still want to continue to invite those to do this one way. And I'm adding in a second minute something that might be a little bit different again, and I've seen a lot of what the people say, and if you do that, I do hope that you can see the sound of the band, although I won't mention it, but I think it is a different thing. But I'm genuinely not working with this sound of the band, so I think you get that through maybe the slow doing in something that it takes time. And the problem I might do this is that I suddenly don't need to find the point of doing this, because you can't find the power of the music in my head, you know what I mean? I don't just need to see the people who are actually using this division and the single-boxing combinations. I'm not doing that with the sound of the band, which that sound is different, but for any other music I don't need to put the sound of the S. So I'm going to be doing this with people that have not been made to perform in ten minutes, because I don't need ten people having a hundred people. And I'm just going to put the ten minutes soon on the speaker, and I'll have to do that with the speaker and the sound of the band. I'm just going to put the sound of the band in the studio. And in my house in the village today, I just suddenly like, I notice a lot of something that's something that's changed, something that's changed what you think it is. And I notice that any minute I notice how my mother, the things that we get, what we get, I never just see 98.74, 98.74, 98.74, 98.74, and this is what we're going to do with a music that is something that we've got on the music outside. And I think that this problem is that the music that's on the music outside is something that we should get very, very good. And it's really remarkable. It's really useful to look at the music outside the music outside the music because the only thing we can do is understand about the life and the existence of the music outside the music. And then I notice that the music is so large that we have to see what it is in the art of all innovation. And the music is so large that we have to see what it is. And then I notice that the music is so big. We know and this is something that allows even more of our exciting modern music and other modern music that we learn about and we use for all of us what we do and we use for all of us the amount of things that we can do to do that. And we're going to be able to make this music outside the music outside the music. And we're going to be able to make this music outside the music outside the music. And we're just waiting for the music to finish. So, we'll just stop here. Okay, wind, wind, wind. This is Yeah, and this is why I'm issuing this. Suddenly so popular because it's a massive cliché but we are in an increasingly data-driven world and we need the way of managing that data and by just the argument I've just given you, there's only one way to do it and that is to use things like AI, machine learning and statistics. We also have a nice little a side benefit from all of this is that AI and machine learning do allow us to automate a set of tasks which are otherwise very, very difficult to do. There are many tasks like for example the ones I just gave where it's extremely difficult to come up with a solution directly. For example, if I wanted to look at a picture and tell you whether it has a cat or a dog in it, that's not something like if you put me in a room and said I couldn't leave until I'd done that you'd come back to a skeleton 10 years later. It's just not possible to solve that problem directly. But to solve that problem indirectly by learning a solution from a set of examples, that's something I could do. And in fact, modern tooling is so good, I could probably do that in five minutes recently. It's really easy these days. So, where open source fits into this? So, we have to do our AI somehow. We've obviously sort of got to do it with software. It's not really a problem that can be solved at other scales. So we have open source software as thankfully actually these days a very big part of AI. And that's good, that's great, that's something we all know quite a lot about. But alongside this we also have thank you, we also have open source data and open source models. And this is another great thing that we can share when we're doing AI and machine learning. We can share not just the software we use to do things, we can share the models that we use. So, I posted earlier this slightly silly but I think it's a great thing. So, we can share the relationship between what I say, income or pay, beard length and seniority and how hard people work. And yeah, perfect. Yay, cool. Thank you. Thank you. Okay. What was I saying? Open source data. Open source data and open source models. Yep, sorry. So, as well as open source software we have open source data and open source models. These are, this is another lovely extra thing we can do. I think Michelle is going to be talking quite a lot about this later. And yeah, another lovely, other thing we can do. Again, it's another quite a big cliche but obviously with all these benefits that we get from AI and machine learning, we get a lot of challenges as well. We have some big challenges at the moment. Explainability is one of these. We have, I think, quite a lot of difficulty getting our more advanced models, particularly to tell us about why they work, how they're coming to the solutions that they do. And this is a big problem in applications which directly relate to people's lives like automotive or healthcare not being able to understand why we've just run that poor, old lady over is a big issue. And this is a real issue. The automotive industry at the moment has these wonderful self-driving cars that are in many ways and by many metrics are better than humans. But the regulators aren't letting any of this happen because there's no accountability for them. There's no way of having to behave well and there's no way of proving that they do what they should do. We also have issues of bias. AI and machine learning models are ultimately a reflection of the data that they were trained on and that can be a problem when the data has been scraped from the internet. The internet is a very wonderful thing but I think we all know that not all of the concepts on the internet is great and Models that are trained on this come to reflect the biases inside of it this has been sort of very publicly the case in a few examples and with Hiring AI is where people have trained these on data sets of their historical hiring practices and This has turned out not particularly well and being quite biased for example against the women that was quite a high profile case in the last few years and Yeah, this is this is not a great issue either Privacy is also a final difficult issue With a demand for AI and machine learning models comes a demand for data and this data doesn't always respect people's privacy as it should and Even when it does models themselves often leak things about the data they were trained on either intentionally or unintentionally Anyway, that is the end I think I did it quite early which should give us plenty of time to fix things and switch over Do we have any questions? If you can raise your hand I So it was a bit difficult for me to hear that What I'd not to pass the buck here, but I actually think our next speaker's got quite a lot to say about that might be Really more Again it was a little difficult to hear that was that security Yeah, it's it's a tricky thing I Think the main difficulty here is that we do know how to do security well in some respects with things like differential privacy but it's what cost we put what price we pay to do this and Doing security and also having things run at the speed they want to we want them to go out That's that's a tricky thing. Just answer your question It's just that I see a lot of focus on Yeah, no, that's I mean, I think that's a perfectly fair comment Honestly people do really focus on these three things and it's really tempting to well like I just did Focus on things like differential privacy Information leakage because they're interesting, but yeah basic cyber security really Should be the first port of call for these things
From OpenLLM-France to OpenLLM-Europe: Paving the way to sovereign and open source AI
So thank you for your coming. It's quite early, 9.30. It's difficult to start, so I will try to push energy to this session. So just before to get started, I would like to know more about you. So with three very simple questions. First, who has ever locally run LLM on his laptop using Lama CCP, VLLM or LLM Studio? Please raise your hand. Okay, right. Second question, who has ever fine-tuned a model? Fine-tuned. Okay, let's find 10. Okay. And the last one, who has, like me, dreaming to have one-dred open-source model in here? Not only one, not only open-weight model, but really open-source models. Raise your hand. Okay, you are in the right place. So we will do the job. Okay. Yes, my name is Michel Marie Modés. So I have a co-founder company, a software company called Inagora. So we started in 2001. So as the first time, we will be very close to our 25 years. So it will be for next year. And our mission with Inagora is to invent and develop good tech for good. So what I can sum up as ethical open-source. And for AI, we do the same. We do ethical AI. And to come up to achieve this goal, so we started a community, a very large and a brand community called Open LLM France. So we started in June 2020. And we have two main goals. First, as well to build trusted sovereign and real open-source generative AI technologies. And the second goal is to build a strong ecosystem around LLMs and generative AI systems. So for the second objective, so I can say that we have success because the community right now is more over 450 active members with a strong support from the academic and public research in France. So it's very important because, for example, with the GenC, we can use freely supercomputers like GenZ. And it's very useful for us to give freely GPUs to train our models. So it's very important. And at the same time, so we have a lot of corporates, corporate private company, who have are using AI technology or many to build with us AI solutions. So and all this track for today. So I think there is a lot of important things to build ethical AI system. But my talk, I will talk with you about three topics as well, what we could consider as open-source AI. So this is my first part of my talk. The second part of my talk will be related to diversity and the underrepresentation of our culture, our language in these models today. And the third part of my talk this morning will be related to data quality and the evaluation of this model. Okay. Under. Okay. Right. So to be very clear, and to start on the biggest problem, so the most popular open model that you are using today are not open source. They are open wave model. So this afternoon, Stefano Maffuli from the OZ, open source initiative, we have a talk to report to their progress to this definition of what we could consider as open source AI. So I'm very proud because I'm part of this small group and private group inside the experts from external from from the OZ to try to define and to get this definition because it's important to clarify the situation because as you know, and I'm not alone, but Stefano and probably some of you's have used as a published post to raise the problem to the misuse of the open source term today by some players on the far work ecosystem. And so and I put in this slide, you know, the OZ definition of open source. So to be very clear, if you have limitation on the use, the license and the term of use of a lease license, or if you don't have the artifact, the element to train again the model or to make a derefited work on it, you can't say that you are doing open source. This is very clear. And today, the main part of the popular model we have, you don't have a view and access to the data set used to train the model. For us, for this community, what we open source AI means three things. First, as well, that we are able to have the open source of the model. All the tooling system used, for example, to train the models to evaluate the model of the pipeline to do the evaluation of the model. And so for different things, it's not very easy to find this information on an open model today. The second point is related to a license. So if you have for us our license, we don't have this license, we have to have, we thought in the limitation of who and what we are doing with this model. And the most important is the third point is related to that asset, open corpus, open corpora. But you know, it's very interesting because probably if you follow the news related to AI, you saw during these past days some new models with data sets published under open source license. So, and I think it's very important and I think that for 2020, not only the year of open source AI, but also for data set publication, open source license. So I changed my presentation last night, just after the talk of Joss, the co-founder of Next Cloud, because he present an ethical rating system. And I'm very glad to see that we share the same point of view. And it's very simple for also for the Next Cloud community. If all these conditions are met, the three conditions, so you are in the green area. If you have only one, two conditions, so you are in the yellow, only one orange. And if you are using, for example, open AI, in fact, ChabGPT from open AI, zero condition are met. So you are in the red area. So if we have today this morning developers from this beautiful Next Cloud community, thanks for your job. It's amazing and we love it. And so for us, by the way, we are in the first green area and we try to do the job. The second part, the second topic I would like to underline this morning, it's the problem that AI generative models are more and more representation of a picture of what we are in terms of culture, in terms of society, in terms of language. So I think that's figures talked by the by themselves. So in the left, you can see that since 2018, less than 8% of LLM has been created in Europe. And on the right, what you can see that it's the volume of language used to train LAMATU model. So 0.16 for French and 0.17 for German. So percent. So I don't know what do you think about that. So but in my point of view, we can say that we are not really well represented as our culture values in this model today. So we have a problem, I think. And we have a community we try to solve. So first, first try we did, it's to adopt a data first, drive an approach or quite a quality first, drive an approach. And because the small also is beautiful. And we try to get the proof that quality of the data set is more important than the quantity of data you have. And to demonstrate this this point, we release a first model in October called Claire. So Claire like the woman, the show name in France. So I'm not against I have nothing against a podcast, Albert, Alfred, Mr. But you know, we prefer in our community to promote women because by fact, it's our little contribution to have more women in our AI ecosystem and a global unity. So I will, I will not go deeply in Claire because Julie, the real one. Yes. Julie will go deep and tell you all about Claire what we did. Oh, we did this model. But just for very, very, very, we just gave the proof that it's we are able with a lot of amount of French tokens to give a very, very conversational model. Conversational means that Claire is able to understand dialogue between people with their realization. And the second part of Claire, the second features, it's that Claire is able to talk like, like you, to make a dialogue, human like dialogue with defluence, hesitation, because we train Claire with conversational data. So we continue to collect a lot of data. And today, so we are around 140 billion of token in French. So and we I'm very glad and happy to announce that we started to the training phase to train our new model called Lucy. So Lucy, the main goal of Lucy is to fix or to yes, to improve the under representation of the French language in generally in LLMs. But at the same time, we put in our data set some over European language, the German, Spanish, Italian, some code to some some some source code to make our model to have a capacity of reasoning. And we try to build some new features to make this model efficient, not only in French, but for over language. So probably you will be interesting to follow this work and probably our custom tokenizer and so on. But the most important things I would like to share with you this morning is that we are not the only one community involved in this goal to build this sovereign LLM in Europe. So I'm sure that this list is not exhaustive. If anyone knows new or other initiative, please call me just after the presentation. I will be very excited to discuss with you. But the most important is that we are strongly believe that we have all the capacity, all the technology, all the GPUs in Europe to build our models. And it's why I'm very delighted to announce you that today, during the first day, we changed OpenLLM France to become OpenLLM Europe. So you can use this QR code to inboard yourself in this in our Discord server. So we all the content we produce during the six months in French is still available, available. But we have created the channel for each European language. So please welcome. And if someone want to be part of the community management team, please contact us and we will be very pleased to inboard you in our initiative. So that's my tool for today.
LinTO Studio as Your Ultimate Open Source AI-driven Media Management Solution
Okay, great thing everyone. Thanks to come to discover linto. linto is your ultimate open source AI driven media management solution. So I'm Damien Lenn. I'm head of R&D engineering here for linto at Lina Gora and I'm proud. So what is linto? Essentially linto is a set of voice technologies that enables you the best on the open source side of voice tech. You can find in linto all the cognitive APIs that you are craving about like transcription with a live or batch transcription. We have a set of NLP APIs that enables you to add punctuation, name entities and topics identification or so on. And also we worked on speech synthesis. This is the first set of linto technologies. Leveraging those technologies we built a full-figured surrogate, I mean alternative to Alexa and Dialogflow to build agents, smart agents which includes chatbots, smart assistants, voicebots with custom full software work walls that work on the browser that's very neat. And finally we the past two years leveraging further our technologies we built a business oriented solution which is called linto app which is a media management platform that enables you to load media and to make to run these cognitive APIs to edit the transcription in a nutshell to turn routine recording into fully qualified data lake. So there's a lot of software's closed source that enables you the same kind of features but more or less all of them uses the APIs from the big players you know them. Okay so the question here is always the same what happens to my data when I use the services provided by author, dictation, happy scribe and so on. In a nutshell you just send your data to them. So linto studio I will present you a quick video to show you the platform but here you have all the functionalities and note the link which is currently displayed you will find the link to immediately just use our alpha version which is online free you can just create your account and try yourself just after the meeting and you will find the link to our github pages to download and work with the source code. So linto studio enables you to use the APIs I've been talking about to add automatic stamp with our modified run times for whisper.ai not a whisper by openai. We are so enabled to speakers and turn identification and all I've been talking about before just note that the platform is a web platform where you can collaborate in real time using organization roles and share resources within the platform. It's shipped with companion Android application that you can use to to recall. The final slide before I move to the quick video of course as my colleagues presented you a work on the large language models of course we want to also leverage these technologies within linto studio and add this kind of feature I'm drafting here on the picture to work with the documents loaded into the platform and ask some things with large language models. Okay so here I jump to the video. Okay so I recorded this yesterday. Here on the left I'm currently recording something within the sorry so I'm recording with live transcription. Okay whenever I'm done I just stop I can navigate local files and listen back what I recorded but what I want to do is to send this recording directly within the platform which is of course the big window displayed on the right so I can change I can send it to the platform I choose the language the model I want to use then the media I uploaded just lands into the platform and here I can see that the transcription here includes the capitalization and normalization I can also explore the platform as I tell you media management solution so it's a multi-user platform we where everyone can create accounts and use roles within organization so here I just showcase the way you might invite users and assign roles within a given organization here I show the share mechanisms which is total rip off from notion way of doing things and I'm proud of it it was flawlessly I can share with external users as well send email automatically when I just share transcription to a user okay here I jump to the editor where I can you see use AI insight which is our NLP APIs okay you just click on the one you want to use and start generation forum identifying stuff in your text like name entities or locations and decisions topics and put highlights you can also manipulate the text and add manual highlights to annotate the text okay also we have another editor which is also very neat where it's a place where you can basically just built the SRT or VTT and you work with the screens you have the center the current screen you can arrange arrange them the timing you can of course correct the text which enables you to add something that you want to rip on the video directly some close captions here's the way I want to navigate within the platform I just can use tags and fetch the document I'm looking for also using full search text and so on and once again I get back to this recording I can show you here that I can also correct add some correction corrections to the text change speakers which is a real-time collaboration with a reconciliation of multi multiple users editing the text and finally as you saw we can export the document okay that's our platform demonstrated in a nutshell I took 10 minutes for this presentation hoping for any questions from you so if I am if you thank you for this presentation I have two questions one of them is technical and the other one is about money I'll start with the money this specific project how is it sustained that do you have revenue for this specific project and so what's the business and then the second question was what kind of power of computing power do you need to run this for a small organization maybe okay so the goal here for our business is very clear we offer as linear go around services for tuning models okay so this particular platform is also intended to be a SAS service where the user will be at some point when we have time to develop a subscription for that users will be able to use our system as a SAS but the source remains free and it can be austere on premise with the same features like away like like always at the Nogura we have no premium plan or whatever but we just feel that it's convenient to just host directly a solution as a SAS offer the other question was about the computing power okay so it requires quite a lot but we batch the process of the transcriptions and the long models inferences we just provide the best default way of doing stuff and if you dig in the code you'll see that our runtime supports kind of everything you can dream of we can run on CPU of course it will be a little bit clumsy we work on CPU with Intel extensions for transformers and so on and we of course work on GPU if you want to process a large batch of transcriptions when the hosting on premises any other questions we got time for one more how do you handle a typically French language setting which is irony how do you handle because of the keywords and so on the typically French set which is irony meaning that the speaker means exactly the opposite of what he says he's asking how do you do with the irony of French language of course using the you know the irony mark you know this one thank you Damian all right we're gonna start the next talk here in two minutes
LangChain From 0 To 1: Unveiling the Power of LLM Programming
Hi, y'all. I have the privilege of introducing you to Stefano. And he is from Italy, in the middle of the Italian coast. You've been a Linux enthusiast for 20 years, got me on that one. And your focus is on VoIP, interestingly enough. This is his 10th Fosdom, and your favorite animal is you after four beers. Very appropriate. Everyone, welcome Stefano. Thank you. One of my hobbies was caving. I spent 10 years going into caves, descending pitches with ropes, crawling into mud, and doing those awful things. The reason for doing that is that the very few time I had the chance to be the first one in an unknown place, it was awesome. When you are in an unknown place, you face some dangers, but you also have infinite possibilities. Behind the light of your headlamp, there could be anything. A river, a beach, kilometers of unexplored passages, who knows. And I feel the same about the AI today. And I'd really love to increase the power of your headlamp today. So I'm going to kick start you into Lang chain. This is the GitHub page for the talk, where you can find the proof of concept code and the presentation itself. It's better if you look at the code during the presentation. We'll explore Lang chain using one of its notable use case, that is retrieval of met generation. And for doing that, we will look at some of its components and concept that are document loaders, text splitters, embeddings, vector stores, retrievers, prompts and templates for generating prompts, large-length models, of course, and finally we'll combine some of those together in a chain. Then I'll experience the adrenaline of a live demo, and maybe we will take a look at some other notable use cases. Let's talk about our main quest first, that is retrieval of met generation. This cutting edge techniques involves giving additional data to the LLM to enhance its responses. It's interesting because when you give additional data to the LLM, the answers become more precise and relevant, and it's also allowed the citation of sources, and allowed to respond to data that are not in training data set, that could be even personal data or real-time data. It's a very discussed topic, and it's an intriguing case for showcasing Lang chain. This is the scheme of what we want to obtain. Multiple use cases exist over retrieval of met generation. We will look at the simple one that is question answering over unstructured data. We will take some text that is our unstructured data, and we will put it into a storage. Then we will ask a question and use the data from the storage to help the LLM answer the question. Let's look at it in more detail. We will take data from a transcript from a YouTube video, and we will load it into a usable format. Then we will split it into smaller parts and compute a vector representation, also known as embeddings, of this data. We will store it into a database. Then we will ask a question and compute the vector representation of the question, and use this vector representation to find similar documents. Then we will put the question and the retrieved documents into the prompt and give it to the large language model. If you're thinking that it's complex, I assure you that it's not, and it fits in a few lines of code. If you're thinking that it's trivial or worthless, I assure you that it's not the case-hater, because there are a lot of concepts behind that. Why using LineChain? LineChain is a framework for developing LLM-powered applications. It offers us a lot of ready-to-use of the shelf components and building blocks that make our life easier. Should we take our code in production, it also has components that make it easier for us to do it, and also it has a lot of samples to copy. It's fun because it has an extreme speed of improvement, and something interesting came out of its community continuously. On the other hand, it's very young, and breaking changes may happen, but we like risk. We are using Python. LineChain is also available in TypeScript, but that's not make-up-of-tea. We also have our main requirements that are LineChain, of course. OpenAI that we will use as embeddings and LLM provider, and TraumaDB as vector store. Since we're using OpenAI, we will provide an API key. Okay. In this part, we prepare and store our data. We will use four components that are a document loader to retrieve our data, to get our data, and convert it into a usable format. A text splitter for divide the document into smaller meaningful units, an embedding function to compute the vector representation and the vector store to store our vectors. The document loader is an object that takes from various sources to the data source. It takes from various sources of data and gives us a transform it into a usable format. That is a document. Multiple sources are available, and for instance, we can have files like PDF or text file or web pages or cloud storage such as Amazon S3 or Google Drive, social media like Reddit, Twitter, GitHub, and papers, and of course, YouTube transcripts. It's also very easy to write your own if you don't find something that fits for what you need. You can just extend the base loader class. This is our document loader, and we are using the YouTube loaders from the LineChain community. And this will take the transcript of our video and put it into the document class. This is the document class. It has a page content string that will hold the transcript of our video and a metadata dictionary that will have a key source with the URL of our video. Now that we have our document, we want to split it into smaller meaningful units. Why do we want to split it? Well, for free reason. The first one is that the input size of our LLM is limited, so we want to give smaller pieces. The second one is that, like me, our LLM tends to be easily distracted, so we want to increase as much as possible the signal-to-noise ratio and avoid to distract it, giving it useless information. So we will choose only the pieces important to answer the question. And the third reason is that usually we pay per token, so the more we give, the more we pay. We can think of five levels of text splitting from simple to complex. The simple one is splitting just counting charters or tokens. This is simple and easy, but it has a problem, and the problem is that probably we will end up splitting in the middle of a word or a phrase. The second level addresses this problem, and this recursive splitting. It recursively tries to split text on special charters like new line or punctuation, then combines those phrases together till the maximum length specified is reached. The third one, look at the document structure that works for HTML files or markdown or code. And then there are semantic chunkers that is still experimental on a long chain, and it's very interesting because it combines phrases together only if they are similar and use embeddings to compute similarity. The last one is highly experimental, and it's asking an LLM to split our text. This is highly experimental and also very expensive. It probably makes sense only if you are thinking that the cost per token is going to zero. We are using the recursive charter text splitter, that is the second, and it's a good default choice. We can specify the length of the text, and if you want some overlap. There's not a golden rule about that, so maybe you want to try what works best for you. Okay, now we have our documents, and we want to compute the embeddings. The embeddings are a vector representation in a high dimensional space. That means that we take our data and represent it as a vector. Each dimension of this vector will reflect an aspect of context or meaning of our data. There are thousands of those dimensions. If two pieces of text are similar, they are next to each other in the embedding space. That means that we can compute the similarity of two pieces of text just measuring the distance between those vectors. It seems complex, but for us it's very easy because for us it's just a function that we use when we create the vector store. We are using an external provider here, that is OpenAI. And auto privacy, obviously if you use an external provider to compute embeddings, you are sending your data to the external provider. We now have vector representation of our data, and our data is split. We want to store it into a vector store. A vector store is a database that is tailored for storing and searching embeddings. We are using TraumaDB here. It is open source, it's very easy to set up. This is the initialization. And as we said before, we are passing the OpenAI embedding function to it when we initialize it. These are the most used vector store in the reports of the state of AI for 2023. And TraumaDB is at first place, and FACE is also open source, it's from Meta. And Pinecon is a very popular cloud vector storage. Okay, we now have hard data into the vector store. We want to use it. We will use four main components here that are a retriever to search similar documents to our question, a prompt that will give the LLM the instruction on the output that we will give, the LLM that is the heart and lung and brain of our application, and finally we will combine those three together in a chain. Okay, the retriever is an object that is responsible for searching documents that are relevant to answer our question. The simple retriever does this just computing the vector representation of our question and search for document that are near to this vector in the embedding space. This is the simple retriever. Long chain also offers us more advanced retriever like this one, this is multi-query retriever. Please use the LLM component to formulate the variation of our question and then use the embeddings of those variations to search for similar documents, similar and hopefully relevant to answer our question. Now that we have similar documents, we can put them into the prompt and the prompt to give to the LLM. This is the prompt that we are using and the prompt is just a template with the instruction for our LLM and two variables in this case that are the context that will be our documents and the question itself. I love delving into details because it's just a template and also we can take this prompt from the long chain app. Long chain features an app with all the prompts and other objects that we can use, all the of the shell components that we can use. We have the prompt, we want to give it to the LLM. We are using OpenAI SLLM and this is how we initialize it. I use streaming, the first variable because it really improves the user experience and temperature zero means that we don't want creativity or hallucination, we just want precise answers. Maybe you can argue that I should have used different LLM providers but nobody gets fired for buying OpenAI so I chose that. These are the most used LLM providers always from long chain state of AI. OpenAI is at first place and I'd like to rant a bit about that because CloudAI, the third on that list, is labeled from almost from everywhere in the world except from Europe. This week the Italian data protection authority is going against OpenAI over privacy issue again. I know that there are a lot of privacy advocates here and I also care about user privacy but I think that defending the user right shouldn't mean going against going against war against them. That's my two cents. Those are the most used open source providers. It's interesting because the first three has a very different business model. The first one rents hardware, the second has a cost per token, paper token and the third one is for surf hosting. We now have gathered all the components, we want to put them together. This is all the components called one after another. We have our question and we pass the question to the retriever and we get a list of documents. The list of documents is joined together in the context variable then the context variable is used in the template to generate the prompt and the prompt is given to the LLM. It works nice and easy but we can do better and this put everything together using a chain. A chain is a sequence of components that does a function and it's better than just calling the component one after another because it has several advantages like it offers sync and the sync support and that allow us to take our code directly into production without changing it and also as advantages of observability and it's integrated very well with other launch chain components that are used to take code in production. This is the code put together using the LLM expression language LCL that is a new way of doing those chains. This is an acquired taste and it's quite new. It's from September but I find it very useful when you get used to it. Okay, let's see how this works. This is our code and there are two examples. One uses the chain, one not, this is the one that doesn't use it and it's just a few lines of codes. It's very easy. Okay, I forget the open AI key. Okay, I forget the open AI key. Of course it doesn't work. I'm not connected, you're right. Okay, I have a backup video. No, no. By the way, it's just for giving you an idea of the piece of calling the various components and the parts that takes the most time is computing embeddings and this is the streaming output. Okay, I have prepared some questions that are those questions and those are given too fast, sorry. I gave the question to the LLM and this is the output of the output of the LLM. Also, okay, it's nice because this one, the retriever wasn't able to find the answer for this question and so it wasn't able to give us a response and the LLM told us, I don't know. I'm not sure if I can move forward. Maybe I also have it for the LCL. The LCL version uses the multi-query retriever. So you will see now that it will ask multiple questions. Each question is transformed into multiple questions. This is low, I'm sorry. Okay, those are the questions and this is the answer that came out. Okay. There are also other interesting use cases of luncheon. We look at the simple one that is question answering over unstructured data. Also it's very interesting question answering over structured data. This one uses the LLM component to convert our question into a sequel query that is executed and the result of the query is used to improve the answer of our LLM. It's very interesting. Another one is data extraction. You just have to provide a JSON schema and then unstructured text and the JSON schema is automatically filled in with the data from the structured text. The LLM understands what to put into the JSON schema. It's interesting because there are people paid for doing that work. Summarization is very useful and it has a lot of, let's say, problems. It's an open problem. It's very interesting and useful. Then there is a synthetic data generation that is useful if you want to find a model or maybe if you want to anonymize some data. It works like data extraction backwards. You have a JSON schema and the LLM generates a text unstructured that contains data that will fit into the JSON schema. Finally, there are agents that is a key concept of luncheon and it's very fun. With agents, the LLM takes charge of choosing what action to do. It's worth studying. It's very interesting. Okay, that's it. So, thank you. Do you have any questions? I saw his hand first. Thank you. Very interesting. My question is how does this scale? You showed an example in which we have just one transcript. What if we had billions of transcripts? I didn't see any mention to the ranking of the retrieved chunk. If you can elaborate a little bit on that, it would be very good. Thanks. Okay, luncheon helps to take this in production. This was proof of concept so you can take this in production. Also, it's out of the scope of this talk. This was luncheon from zero to one. So, that scaling is from zero to 100. You can find a lot of examples on how to take that in production. If you take a look at the GitHub repository, there is also a link on how people from luncheon use this in production with the chatbot that helps searching in the luncheon documentation. You can find the code and it's very interesting. If you want to take it in production, it's worth copying that code. It's the best practice. Did I answer your question? I'm sure you'll see this coming. If I have some money to spend on a hardware and I want to get an LLM, there is a lot of proprietary intelligence that you use, like the Mbendix in particular, and also the other part that it's on the query side at the end of the chain. How difficult it is to do this without using OpenAI? It's really easy because luncheon allows to swap those components. I use it here at OpenAI because it's the easy way for having a result. But if you, for instance, use the Ollama, you can self-host the LLM and ask questions to the LLM, or maybe with a face you can rent hardware and run your open source model on their hardware. So it's easy because those components are swappable. All right, y'all. Let's give Stefano one more round of applause.
ML Guided Optimizations in LLVM
Can you hear me okay? Cool. Okay. So if sound doesn't work for the rest of the presentation, this is basically the key of it, right? So I'm a compiler engineer, I'm not an ML specialist, so I'm not a compiler engineer, so kind of like a heads up, if I say something wrong about ML, that's why. You can use ML in an industrial compiler, which is LLVM. Actually, show off hands, does anyone have you heard about LLVM, Clang? Cool. Okay. About half. I have a slide about that too. So out of the box, actually, as of Clang 17, it's not very well documented, because it's still work in progress, but you can actually connect to Clang and train models. So that's an interface just for training. It's a DMM kind of an interface. I think that means something to the ML community, if not, tell me. And this is not vaporware, it's just a virtual computer. In the sense that we actually use it for real, right? So I mean, you can read what's there, but we've been using it for about almost four years now, and we have some experience with it. And most of the talk is actually about trying to get to point three there, which is like what we've learned. The rest of it is set up. Okay. So LLVM, for those that did not raise their hand, is an open source project. It's a compiler. Actually, LLVM itself is a library. So it defines an intermediate representation. That's what IR stands for. It contains state of the art optimizations. It also knows how to lower to X86 or ARM or other targets. And then Clang is something like it compiles C or C++ down to LLVM IR. So basically Clang is built on top of LLVM. And so it's Swift. There's a Rust compiler. There's a Fortran compiler as well. And I mean, the LLVM project is bigger than this. There's a full tool chain there, like debugger, linker, all of that. Actually, shameless plug for the LLVM community that I'm part of. There's a dev room this afternoon here somewhere. To us, to Google, so I work at Google. To us, C and C++ is very important. Basically, anything that is performance critical, which is basically anything is written in C or C++. When we say C and C++, I really mean LLVM. And when I talk about LLVM, I mean LLVM at the tip of three in GitHub. So we don't have a special fork or anything like this. And we really chase the head by plus, like, well, minus usually two weeks. So we're very close to the head all the time. We have a release theme that keeps it basically in sync. And even small performance improvements matter, because a 1% saving across the fleet really means that much less hardware you have to buy, what you have to produce or consume, et cetera. And we keep doing this. All the performance improvements that we make are small, but they're constant. And it's like interest. It compounds. Our binary is no shocker. They serve RPC requests. No surprise there. The key thing is that to do that, to optimize these things, there's many things you can do. But as a compiler engineer, we're primarily occupied with how do we make the RPC request complete quickly. And the RPC request traverses a lot of code. Most of it is actually not the code that you want to execute. So there's things like networking stack, serialization, deserialization, security, blah, blah, blah, blah. And all of those things are reusable code. And they try to be genetic, which is the exact opposite of what I want for performance. Because for performance, I want it to be as specialized to what I'm actually doing. Like, I don't want it to be genetic, right? And for that reason, actually, the biggest levers that we have for performance are we collect profiles that tell us, like, well, actually, the program is spending time and then we reoptimize it. So we recompile it with them. And link them optimizations, which are basically like we can look at the whole program and try to, based on that understanding, try to make the right decisions. So things are big, like, you know, lots of data, lots of instructions to execute, nothing fits in any cache. I'm not being ambiguous there. I'm being actually precise. No cache fits the data that we're talking about, the instructions or the actual data being processed. So that's why, like, optimizations like inlining are, you know, very impactful because they contextualize, so they specialize things down to what you actually really have to execute. And then you end up with large functions, which means that optimizations are register allocation or have like a big problem to solve. What am I doing? Okay. Here we go. Okay. Which kind of gets us to why we want to do ML, right? So we want to do ML because we're looking at problems that are, sorry, sequential decision making. So inlining is about, hey, is this call site worth inlining? Sure. Okay. Fine. Well, the program just changed now, right? So what about this other call site? Is it still worth inlining? Maybe not, right? So as you go along, the state of the problem that you're trying to optimize changes, we don't have an Oracle that tells us what's the perfect optimization decision, especially at like the scale that we're talking about. I'm kind of like getting us to say reinforcement learning, probably no surprise to an ML community. Because I mean, otherwise what we do is like we have heuristics that can only operate on like local information. And because I mean, there's the one that actually we can make sense out of, right? So, and we have evidence that they're not good enough in the sense that we know that if we play a bit with them, we can, we can find headroom in optimization. So, but, you know, we cannot constantly twizzle with them, right? Like we want something a bit more systematic. So that's why we are interested in ML. We are also scared of ML because the compiler is about everything that ML is not. So the compiler must be correct. I don't think that it's a surprise to anyone, but it's a non-negotiable. The compiler must be deterministic again, because otherwise it's something that you cannot live with or, you know, to take forever to compile things because we cannot do incremental builds. So ML at least like naively to us felt like something more analog, right? Like it's more like, well, fuzzy, maybe something and that's not, not what we are about, right? So how did we go about it? Well, first we're not asking ML to deal with correctness. So already in the, in the code that I'm talking about, like in the compiler code that makes decisions like in lining and register location and things like this, we kind of already had a separation between what's correct. So, you know, there are certain things that are illegal to do so that we don't do them. We don't even wonder are they worth, like, you know, would they be valuable in doing it? We just don't do them. What we did here is we stressed that boundary even more. So we created like a very clear interface between ML questions and like what heuristic or policy questions and, you know, correctness issues. So the correctness stuff is, you know, written in normal imperative C C plus plus code that we can all look at and agree that it's actually correct, right? Module of bugs as always. But then out of choices that are equally correct, we go and ask ML, you know, which one should we make? To the end user, we don't want to tell them any of these not because it's like a shame or anything, but because it's the more different the compiler would look like the more difficult it would be to adopt it. So how about we make it look the same as it is today, which means no new dependencies, nothing extra, just additional flags, right? So that's something that is fine. So which really means that when we give the compiler to the user, we embed, we need to embed the models inside and not show any sort of like dependency on some sort of like an inference engine or anything like that. But for training, there's totally different. So for training, we're totally cool with like, depending on like TensorFlow and like whatever and, you know, like random generators in the weights and all of that is fine because that this training and actually we're fine with compiling a different compiler just for training, because that's not something that, you know, like, it's not for like everybody, right? So it's just for whoever does the training activity, which we also want to be rare because we don't want to like keep training it as you're trying to ship a product, right? So, you know, like, we give you the compiler and then like, hopefully the more the models are good enough, just like heuristics today, like, you know, like to, to resist changes that people make to their code, right? So basically, there's two types of interfaces that we ended up having. One is between compiler and policy. And there's like domain specific. What I mean is like, there's a different question that you ask is an inlining pass from the one that you ask is a register locator from the one that you ask is a instruction selector or something like that. But then the ML obstruction, like the way we interact with the mail is common because fundamentally ML to us looks like a function that we pass a bunch of tensors to and it comes back with an answer. And we, you know, like how it's implemented is, you know, it's not important, but it's irrelevant from the perspective of the interface and the implementations that we have are like either ahead of time, like I mentioned, or, you know, the interpreters who use TF light, like the people in embedded or for the DMM case, we're actually doing IPC over pipes. So the state in LLVM today, like if you, if you go to GitHub and you pull LLVM down, you basically have everything that you need to, to, you know, add the mail to a pass if you're a compiler engineer. It's TensorFlow centric, no surprise there, but it doesn't have to be. So the obstruction that I mentioned earlier can be, you know, like, I mean, you can, you can plug by the pytorch or anything like that. I mean, we, we made a pipe based protocol work over that obstruction. So it's clearly not TensorFlow specific. Any tools that are genetic, you know, like other utilities, like how you collect a corpus for training, right? So that's a problem. That's also in LLVM. We used to have them in, in, in a different repository, also open source, but they make more sense to go into LLVM. The training tools that we use, so for example, the, the fuchsia operating system that I had on an earlier slide trains using those tools, they are available there to as a, as a reference. But if you are a researcher, you probably want to use something like compiler Jim that is more research, research friendly. So there's kind of like different concerns in, in these tools. And then there's also like using the tooling that I mentioned, like there's, there's another body of work that produced a large corpus of IR that you can use for like whatever you want, like training for these purposes, or maybe doing LLVM training or anything like that. There's links there. In fact, like all the links in the, in the slide that are in the, you know, like when you go to falls them and you see the talk, they're there. Okay, what we learned, that's what I wanted to get to. And I'm doing well with time. Okay, so the, the, it works thing, right? So there's a difference between, I mean, there's been work doing ML with compilers in academia, but I mean, that there's a big difference between that and actually shipping a product and shipping a compiler for production teams. So the key thing is that, at least with a size problem, we have evidence from, from the Fuchsia team that it can work completely, meaning like they periodically, like about every month, pull LLVM, retrain a model on their code base, all on vanilla build bots. So they're like normal CPU based machines. They train for like about a day or so. And they produce a compiler at the end of that that optimizes for, for size, because that's what they care about. There's links, I think, down there, like an example of such a build bot. So it all, you know, this can be done completely openly. And the key thing also is that it works like turnkey, meaning like you don't need someone to go and pay attention to it. It just works repeatedly. And he's been working like this for like almost four years now, which is, which is good. Like we have a signal that we can have like an industrial process that produces an optimized compiler, you know, on a cadence, right? Okay, here's what it didn't work. So performance is hard. So, okay, so you are ML experts, you are not surprised at the statement that for reinforcement learning, the, the quality of the reward is very important. And we understood that through we, okay, it makes sense. However, for performance, the problem is a bit tricky. So it goes like this, you cannot just say, oh, let's run programs and see how well they run, because it takes time to build a program. And it takes time to run it. So you either do it very quickly, which, which means that you're doing it for small little benchmarks, which are completely relevant to what we're doing, right? So then basically you learn on something that has feature value distributions that have no match in what we're actually going to try to use it for. So we don't want to do that. Or you cannot do it. Like, it just takes too much time. So we were like, hold on a second, but we have profile information, like I talked earlier, like, we know, we collect this profile information that tells us where the program spends time and how many iterations loops take and all of that. So can't we do something based on that that kind of like guesstimates, at least a trend, right? Like, we don't care about absolute values, but at least something that can allow us to compare, you know, like to a baseline, the results of applying a new policy. And we thought we could any kind of worked like for register location. But we ended up having to select a winning model out of like a set of models that we trained, you know, like with this over synthetic reward. And we're not very happy with that. Like it's not how to put this like, we're missing that explanatory thing of like, well, why, you know, like, so if I do it for how long do I have to do it? And what do I have to look at when I look at the TensorFlow rewards and all of that? Like, what do I have to look at to know that I have to take it out and now train it or like, sorry, compare these models on on on running benchmarks? There's basically a bit of a waka mall. And that's not engineering. That's waka mall, right? So this is basically the main challenge for performance. And I basically like, you know, scaling this effort to more performance problems. And well, knowing that there's efforts on that, of course, like, come on, okay. ML model evaluation costs. So in the big scheme of things, when we did like in lining for size, or we did register location, I mean, we measured like the micro measurements on how much it takes to evaluate the model. But in the big scheme of things of like the entire compilation of a module, like of a C plus plus, basically, they kind of like goes in the noise, like it was more like a few percent variations. And it's fine. But there's not going to be that funny if the methodology, you know, like gains traction, right? There's not going to have lots of these things that take a lot of time. Also, the size of the model, which is really the weights, seems like it was kind of surprising to us. Initially, we had a small one and then working with some researchers in other teams at Google, they managed to produce a much, much larger model kind of accidentally, like, which kind of like took us by surprise, like it was suddenly 11 megs, like out of nowhere. And it's kind of funny when we're trying to optimize something for for reducing the size of either binary and LLVM itself blew up, right? I think that these are more like things that caught us by surprise. And we, to our understanding, in talking to ML experts, there's ways to mitigate this. But we kind of learned that we look a lot more like an embedded scenario than that we imagined, basically. So kind of like an interesting research topic, I think it's interesting at least to us as compiler engineers, but it's a research topic for the ML community, rather. How would we know without having to actually compare the result that a policy loses power, if you will, right? So, you know, like I was saying, people like Fuchsia, for example, train a policy and then they just decided, well, we'll just retrain one automatically whenever we we produce a new toolchain, right? But is that overly aggressive? Or was it like about time to do that anyway? Like, it'd be great to have a signal that tells you, hey, you know, hypothetically, maybe the feature value distribution changed, and it's out of the domain that actually the model was trained on. So hint hint nudge nudge, maybe it's time to train. But we don't know if that's actually what the indicator is. So that's what I say. I think it's an interesting topic that would be valuable to us, because it was give us an early indicator purely based on compiling, right? We can run the compiler and just see these values as you compile. You don't have to like do benchmarking for for for that. Oh, so in retrospect, I really so this is like honest truth. The first statement is true. We thought that right, like we are convinced that ML is magical. And we will get these policies that are awesome. And there will be at least not regressing and you know, like improving things and there will be no regressions and things will be great. And then we saw that all of them have the typical pattern that we have also in manually written heuristics, which is, you know, some things regress, some things improve. So that's all things are, I suppose. And maybe we can do something better than than that with additional policies that select the right one. But that was a bit of a surprise to us. Okay, performance. So like I was saying, I guess performance is some issues. But we went ahead and like, looked at like, where does the train model find opportunities for additional savings, right? And taking a step back. So what do I do as a compiler engineer in these sort of cases, like I look with Linux Perftool at, you know, runtime information. And I see where it's read. So where there's hotspots. And then I think really hard and look at the compiler and why it made those decisions. And I go and fix that. And then the red turns gray or green and sweet, right? And then I have to do it again and again until I make sure that there's no regressions in other parts of the code base. But that is basically what you do in that case. So when we looked at like functions that we had both indicators in the reward signal as poor as it was. But I mean, it was indicating that, you know, he's doing better. And we looked also empirically at them like, and yeah, they were doing better. And we're like, well, why? Right? So we look at the code and we couldn't tell why like we look at with Linux Perft and there was nothing shining, right? I mean, the code was different, right? Like we could tell that like, you know, pure line by line, you know, deep, it was different, but nothing was popping. And then we did a bit more investigation. And it turns out that the mail was finding or like, you know, the enforcement learning algorithm was finding opportunities in lukewarm parts of the code. So these are things that kind of like end up being like a peanut butter effect, right? Like I mean, nothing in particular is bad, or is improved categorically. But in aggregate, you can, you know, you get like a spread effect that is actually amounting to something. Great, but it's possible that that something is actually just noise, right? And I mean, today, we don't have a way of capturing that. Like we just say, Hey, here's the profile that we got by collecting it from from a running binary. And then I'm as is great. Okay, here I found an opportunity and actually that's just purely noise, right? So this is the part that I kind of had a bit of a trouble like how am I going to title it or anything. So what I ended up doing is just saying what I wanted to say. So as a compiler engineer, so as a developer in the open source, like as an LLVM compiler engineer, if this pans out more, like, you know, if you get more passes and the mail is, you know, like actually delivering more and more value to us, right? What's going to happen, right? So, well, on the plus side, I spent less time, you know, like tuning and twizzling with thresholds and other flags that I have today in the compiler, because I actually can can use a automatic feedback driven, self improving methodology, right? Like reinforcement learning. Okay, I think that's great, because I can actually focus on understanding what actually matters, right? Like for for driving that performance, like what features are important stuff like that. The barrier to entry though might change. So today you can use like, you know, like, you know, cheap, not this one, but a cheap machine, right? And compile the compiler and look at performance, you know, like optimization problems, and it's all fine. And ML, at least my view of it is that it has this risk of like quickly skidding into like, Oh, you need a farm of computers. And today, that's not the case, like I was saying, like, with what we've been doing, the models are small. So we didn't hit that problem. But that's a consideration, right? Like, I mean, is it going to be harder for, you know, the compiler engineer aspirant of the future to enter the field or what? The mental model is kind of different. You can have hinting at that before, right? Like, I mean, like, you don't think of the problem like you were before, you look at Linux perf and you find hotspots and stuff like that. But that's fine. Different, different just means different. It means like, you know, we can adapt, right? This is my pet peeve. Like the when you look as a compiler engineer, the ML frameworks, they are scary, because they're like very low level and they talk about things that I don't understand. And they're not talking about things that I want them to talk about. And we're not sure yet where that interface is. And I think that part of the the goal of the project is to kind of like figure out what that interface is. But today, it's like that. Like I was saying, there's links in the, all the links are actually in the, in the deck. And that's the end of my presentation. Yeah, questions. So the optimizations that you find using machine learning in code, can they also be put in LLVM itself without using machine learning? Or is it, can it only be learned using machine language because it is using the data, for instance, optimizations? So the optimizations, can they also be put in LLVM itself without using machine learning? Is it missing up? Is LLVM missing up? The optimizations that you find using machine learning? Right. So I'll say just to make sure that you're saying like the types of optimizations that we learned, could we just do them as normal imperative code back in LLVM? Some yes, some no. So especially the, when we looked at the type of optimizations that the size optimizer was doing, means some decisions are unexplainable, right? To do the wrong thing early on, but just because he kept learning the statistic by taking that path later is going to be all right. So that's kind of hard to translate into imperative code, I think. But some, some might be. What I'm saying is that the hope is that we, like so far in the evidence is that we kind of, it's hard to do that. We only have one time for one more question, one more question after this. Hi, thanks for your great talk. You've been talking about applying these techniques to clang and traditional compilers targeting, well, executables in the usual sense. What about machine learning compilers? So I'm thinking, yeah, applying ML to ML. I know there is some research in that. Do your techniques connect to that? Yes. So applying ML to ML compilers, right? I mean, MLIR, for example, is part of the LLVM project. And I think that there is work trying to do that too. And the infrastructure would be the same because I mean, it's all the same code, right? I'm not an ML for ML compilers compiler engineer. The word compiler appears way too many times, but we work with those people, like, so I don't see a reason they cannot apply this. I think that the domain though is, has its own idiosyncrasies that you cannot just take exactly what it is and apply it over, but the tooling would be the same. Does that make sense? Okay. One more question. All the way up there, really? Hi. I saw during the slide that one of the problems is that you are not really aware if by choosing a tree, a representational tree of the semantics that you are trying to compile, it's going to be better or worse compared to another tree that you are not for. And I was wondering, are we using the operative research theory? I mean, all the mixed integer linear programming theory that gives you a model of the reality and help you understanding how far you are from the optimal value of a certain representation. So, I'm not sure understood the question. Let me try to say back to your saying, are we applying? Okay, yeah. I'm seeing that machine learning basically relies on a loss on how far you are from a certain optimal value. And I'm seeing that there's a branch of mathematics called operational research that his work is trying to describe a word in an idealized matter. And you try to describe how it's costing respect to my objective value, making a certain decision instead of another one, and you get like a math formula. And there's the simplex algorithm that helps you to traverse those. Yeah, and I was wondering, are we trying to integrate those two fields of mathematics to reach? So, I think, let me give the answer because it's also time. So, and if the answer doesn't make sense, let's talk. I think the key problem is like understanding what that gap is, actually measuring that. And it goes back to the reward signaling thing. So, should we apply what you said? Probably, again, I'm not an expert in that. So, I mean, if you think it's worth doing like great. But the problem is that you'll hit very quickly is that the reward that we give or the signal that we give is bad. Right? So, then probably the rest of it falls, right? So, we need to fix that first before we can apply these things. But yeah, absolutely. Like, I mean, we should try all sorts of like methodologies. Like, there's a whole point. Did I make sense or did I miss it? Okay, let's talk more. All right, everyone give March another round of applause, please. All right, we're starting in about two more minutes. So, please, stick around. Don't forget, the desks are very loud. Please hold them down. Don't slam them. And we have the matrix room up and running again. Can you help me try to figure out how to make the both mics work? Can you do that? Can you hold it and can you talk to that? And unmute it in a second. This, this. Yeah, yeah. Can you start? How about now? Hello? Can someone give me a thumbs up? No. Someone got a thumbs up? Hey, thanks Marty. One second. Huh. At all. Nothing at all? Nothing? Okay, yeah, this is not working at all.
Practical Introduction to Safe Reinforcement Learning
Thank you. My name is Chris Pinn and I'm a PhD researcher at the University of Southampton and we will talk about safe reinforcement learning. So the first helpful presentation will be theoretical, we will define what reinforcement learning is, when is it safe and so on. In the second half we will talk about practical scenarios which hopefully will sort of clear things up. So when do we use reinforcement learning? When we want to solve control problems. So you can imagine we have a robotic arm that we want to control with an algorithm or that we have a space shuttle that should land on a moon and what these scenarios have in common is that there is an agent that interacts with an environment. The agent is choosing actions and the environment reacts to those actions by choosing actions, by choosing states and rewards, sorry. Mathematically the environment is defined as a set of states, as a mark of decision process with a set of states, set of actions, reward function and a transition function. The goal of this process is to find an agent that maximizes an optimality criterion. The optimality criterion is a long term sum of rewards that come from the environment. Now the agent's life cycle has two phases. First is the training phase and then we have the deployment phase. So in the training phase the agent explores the environment by taking random actions and over time it starts exploiting what it has learned about the environment to get further. Now when we are happy that the agent has been trained we deploy it and then the agent only exploits what it has learned. It doesn't explore anymore, it doesn't take random actions, it just does the same action over and over. Now the training is done with respect to the optimality criterion. So we want to find an agent that maximizes that long term sum of rewards coming from the environment. Open source software is a great enabler of reinforcement learning. We have gymnasium with a bunch of toy examples, Atari games, physics based games and so on. We have physics simulators like Mujoko that is used in robotics research but can be used in reinforcement learning as well. We can have traffic control and self driven vehicles with Sumo and Carla or multi agents scenarios, whether agent cooperate or compete. Now this shows the breadth of applications that reinforcement learning can have and obviously if your scenarios can match one of these simulators they are open source you can extend them and use them in your training. Of course in many scenarios you will have to create your own environment and here is a simple example on how to do that. So you would import gymnasium package, extend the environment class and among the things that you have to define would be a step function. Now the step function accepts the agent's action and returns a reward and a state. So basically it defines how the environment behaves. Similarly you would have a class of agent that accepts the state and returns an action and so this is the loop we have seen before but in code. This example is simplified but if you had to do the gymnasium documentation you would see something very similar there. Now why do we care about safer reinforcement learning? Because an agent maximizing an optimative criterion has to be creative and this creativity can be very dangerous. Last year we have seen this scenario from US Air Force where they said that their drone killed a human operator because the operator was stopping it from maximizing an signal. Now that didn't happen, it was a hypothetical scenario but it's a very good example of why we want to make reinforcement learning safe. So safety in this case means controlling the behaviors that the agent can discover over the training phase. Now we have two options when we want to include safety in the reinforcement learning process. Either we modify the optimative criterion that the agent is maximizing. If this optimative criterion contains some safety information and the agent maximizes it during the training phase it will become safer as well. Or we can modify the agent's actions and again if we do it correctly we can prohibit certain actions from being taken and again increase safety of the agent. Now I'll give you two examples of this. Let's start by modification of the optimative criterion. So we begin with a Markov decision process with a set of states, set of actions, a reward function and a transition function. Now we would like to add a function H to it that contains the safety information that we want to pass to the agent. This function H is inferred from data in more complicated scenarios and on how to do that I refer you to the paper but here I'm just going to self-engineery because it's a simple scenario. So on the left you see the new optimative criterion. So now we want to find an agent that maximizes the long-term sum of rewards as well as the function H. And that's how we can use that's one example of changing the optimative criterion to include safety information. So let's imagine that we have a reward function that rewards the agent for reaching the goal with 100 points and gives it a minus one point otherwise. This motivates the agent to complete the goal as fast as possible. And then let's say that we have inferred this function H from data and that it punishes the agent for going into the water with minus 100 points. So in a trajectory where the agent avoids all water we get a bunch of minus ones and a hundred at the end and the function H isn't used in this example but in a trajectory where we go through water the agent is punished by this function H. And this enables it to distinguish between these two trajectories and when the training phase is over it will naturally start avoiding these water tiles. And this example is obviously very simple but it works in much more bigger and scalable scenarios than this. And so some properties is that here we have safety only during the deployment phase not the training phase. The agent has to try all the dangerous stuff in order to realize that they are bad and avoid them doing the deployment phase. Similarly we have to have the data set of safe behaviors because we needed to infer the function H from it. So again these safe behaviors could be easy or difficult to obtain depending on your scenario. And in this example we don't need to define what safety actually means because it's implicitly contained in the data set. And again defining safe behaviors can be actually pretty hard when you think about it. And if you could have a data set that just contains all the safe trajectories and you can learn from it that actually is pretty helpful. As we will see in the next scenario. So in the second scenario we modify the agent's actions and in this paper researchers use formal methods and they start with the environment which is Markov decision process and they turn it into something called a transition system. Now similarly they have a set of safety requirements that the agent should adhere to and they turn them into specifications in linear typology. And then they bring these two things together with reactive synthesis to generate a shield. And this shield has the formal mathematical guarantees of ensuring that the agent never does a dangerous action. And what this means, let's take a look at the first trajectory again. Now here the shield isn't used and it's exactly the same as before. So we get a bunch of minus ones and a hundred at the end. But when we take a look at the second trajectory now when the agent should go through the water the shield will actually prohibit it from happening at all. So instead the shield will send the agent in some other direction and the agent will never experience going through the dangerous state. And so here we have safety during both training and deployment and compare this with the modification of the optimality criterion where we only had safety during the deployment phase not the training phase. However it comes at the cost of being able to specify the transition system. So this can actually be pretty difficult if we don't know the Markov decision process or we can't specify the transition system or if we don't know the safety specifications we can't use the reactive synthesis to generate a shield at all. So there is a trade-off. This approach is much more manual but what it enables us to do is guarantee more safety during the training and deployment phase. And yeah, thank you. Do we have any questions? Hi, thank you for the presentation. I had a quick question about the shielding aspect when we're basically blocking in action because we decided it's unsafe or the like. In this circumstance when the AI or the reinforcement learner is not able to expand in this direction because it's been blocked and it doesn't understand that part of the state space, what if down the line it ends up in a scenario where it like violates the safety because it came across it from a different trajectory? How would we be able to train it away from that? Because it's, I'm not sure if I explained it. Yeah, that's a good question. The agent can never violate the property if we specify the transition system correctly. If we make a mistake there or if we don't understand the Markov-District Process properly, that means if we don't know the transition function in the Markov-District Process for example, then obviously we'll get garbage and that's true of all formal methods. If you don't specify the transition model correctly or the specifications correctly, then you are not guaranteeing anything. So it's completely dependent on us having that knowledge and the capability of making that transition system. Got time for one more. There you go. Hi, in your example the rewards for falling in water of the environment was only minus one, but we're saying that's deadly, that's unsafe, we need to avoid it. Why are we not just addressing this by fixing the environment? Why don't we address this simply by fixing the environment? That's a good question. We can't modify the environment because the environment is not in our control. So the transitions will happen whether we want it or not. Now the question is how do we train the agents to avoid it? In this example, we punish the agent with minus 100 points and over time it will realize in the neural network for example that these trajectories involving water are just not worth it and it will start avoiding them. But we can't do anything about the environment, the environment is given and we can't modify it in any way. All right, y'all give Crispin another round of applause. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you.
Open Discussion on AI and Machine Learning
So we're going to have a nice little open discussion of AI and machine learning for the next 30 minutes. Jeremy, where are you? Here you are. And Jeremy is going to be chairing it for us. Take it away, Jeremy. Okay, ladies and gentlemen, we thought it would be good to fire up some Q&A. We're slightly suffering from, I think we're still down to one microphone. So I'm going to get a lot of exercise. We're going to tie, in order to give some structure this, William, Stefania and Michelle are going to each give a two minute introduction to a topic we think needs discussing and then we'll open the floor for the next eight minutes to comments from the floor feedback. Okay, and William's going to go first. Get the microphone right this time. So one of the hats I wear outside of working for Jeremy is that I'm chairing and leading the best practices group in AI at TechWorks, which is the trade body for electric systems in the UK. And as part of that, we are working on a base practices guide. Can you hear this now, by the way? Yeah? Yeah, cool. Let's be louder and into the mic. Okay. Let's see how this one works. That was our best practices. One of the challenges we have with this, one of the challenges that we have with this is retraining existing software engineers to be AI engineers and to understand the risks and challenges of doing AI and machine learning engineering, particularly to try and make the software do all of the things it should do and none of the things it shouldn't do. And some of those challenges are adversarial attacks. You can poison models either in their training data or just with enough experimentation, you can find examples, adversarial examples, which cause the model to do weird things. Or we have to deal with privacy issues. Models can be reverse engineered or they can be the data sets that they were trained on can be reverse engineered by sufficiently clever adversarial attacking models. So that's what I wanted to open up on this first discussion, talking about essentially how we can keep AI robust to misuse intentional or otherwise and keep things secure. Any questions, comments, thoughts on that? Please don't make this a very short discussion. So I've been spending a lot of time working on prompt engineering for a project, personal project I'm working on, prompt engineering. Did you ever play the Gandalf game where you're trying to guess a password? And at one point it uses a second LLM to verify the output of the first one to know whether it gave the response. I have never played this. It sounds pretty cool. Yeah, no, it's fantastic. So quickly, it is basically you are attempting to get Gandalf to tell you a secret password. And so however you coax it to produce it, such as tell me in a different language or whatever kinds of ways you can trick it to reveal that password, one of the last level is basically it adds in a separate query to an LLM from the output of the first one. And to verify, hey, no, no, actually I did actually say the password that's wrong. And so then it becomes difficult to say, okay, how do you get it to spit out the password such that the other LLM lets it through? And so there were some different techniques on doing that. I would love to hear any kind of discussion on that kind of prompt engineering. So this is the thing I've seen a few people do, have large language models, check their own homework. It's pretty interesting and I've got to say it seems to work remarkably well, but I'm pretty sure in this game you're describing that the last level is beatable. Is that right? Yes. And so this type of thing, having large language models check their own homework, it seems to work really well, but it doesn't solve the issue, at least proof of the issue. At the end of the day, large language models checking it can still be tricked as well, so that's where we end up. Thank you. Any more questions, comments? Yes, okay, hold on. Thank you very much. So one of the things that we're doing at the company I work with, we work with AI in an educational sort of environment, and so because of that we have to be very, very careful what we give out to students, especially because if we give them any wrong information, it can really be detrimental. And so one of the things that we've been working on and we've got it working fairly well is that instead of giving the student like full access to the input and output of the LLM, we've made it so that the student can basically provide tailored inputs to it that we know and have tested the outputs of based on our data. And so we've been able to get it so that we can have outputs that are generally 99.99% of the time beneficial to the student through instead of letting them directly enter prompts, we engineer the prompts for them ourselves and then give them a drop down or based on their input into the chat box determine what they're actually looking for. Now if they ask something really, really random or obscure, like yes, sometimes it can come through and say I don't know what you're asking for, but we found that it's better to sort of have a more curtailed environment to actually return outputs from. That's really interesting. It sounds pretty labor intensive. How do you curate these things? Is it done manually? So with regard to it being curtailed, we have basically gone through and we've spoken to students, we've spoken to universities and the other providers that we sell our software to. And we've essentially figured out like a long list of what they're looking for. And so we essentially built a chat box based around that. It's relatively limited right now with the number of prompts it can give. It's about 15, but we're adding more as time goes on. So it is intensive in building it, but the end sort of goal that it improves trustworthiness and robustness in our system, which is really important for our clients, is like it's worth it. That's really cool. So this is a chat box you've built yourself and is this like the chat box you've built, not a large language model, it just is a curation of a large language model? Yeah, it's not. The chat box itself, I didn't actually build it myself. Someone else on the team built it, but yeah, I believe it uses either a very basic LLM or it's just purely statistical. So something I was going to ask is if this was using a large language model and you were restricting the input and output, I think a really clever student could play around with the order you ask things and probably still get bad things to come out, which is, which would be, I mean, it would be bad, but it would be interesting. It would be interesting. Yeah. There's something that we're continually testing. But yeah, we, as the software works, it will only allow prompts to go to the actual LLM through one of the channels that we've laid out for it through the prompts. And so they might be able to be clever and get it to return a wrong prompt, but it still wouldn't return like things that are detrimental per se. That's interesting. Thank you. Okay. Thank you for that. Stefania, would you like to introduce your topic for discussion? Hello. Can you hear me well from outside? So my topic of discussion is the important to share your projects and to get exposed to conference from a national and community level as well and the local community as well. So to give you some inputs, I personally came from physics, decided to go in data science. And my very first exposure was Picon Italia seven years ago. How many of you actually programming Python? Cool. How many of you have been in Italy? So if you want to join actually on 22 to 25 of May, there will be Picon Italia. And personally, when I joined seven years ago, I was a volunteer presenting speakers. And at one point that I was studying a lot, I was saying, okay, I can do that. And that gave me a big push in applying to data science job and I've been teaching Python data science for a long time. And so I encouraged you to go to national conferences, but also on the local point of view. So for example, I'm based in Turin, Northwest Italy. And a great example was for example, Shishari Kat, an assistant. We are by, if I don't remember wrong, Piero Stefani and Savastani. And there was a contributor, Alessandro Spallina, that came to give his first talk with a demo inside a Python Turin on my city. So it was very interesting to see how from one person showing up a project can inspire others and also collect more and more volunteers to their own project. And to add on that also, I want to also inspire you to be networking across different communities. For example, another example of an event was working with OpenStreetMap data and collaborating with Wikimedia Italia in order to make this happen. So I want to inspire you to enhance your local community, especially also to give opportunities to students to showcase your product and your project in machine learning. And what I saw from my side is when you're able to create a space for people to network and to collaborate, it's also easy to study, understand better complex knowledge and collaborate together. So I'm very open to help you with your local community or national chapter and hoping for questions, even if you want to share your own experience in starting or being inspired by local events. Okay, thank you, Stefania. Any questions, comments or anyone wish to take up the offer of help from Stefania? Hands, come on. Yep. Help me hearing because I don't hear you. Hello. So thanks for the interview. I was kind of worried because I'm Belgian and I think there is a lack of student community in tech, especially informatics. I was a leader of a Google group last year for students and it's kind of hard to attract students to learn AI a part of their course. So I kind of agree with your view to improve this communication, to attract some people and motivate them to participate into some conference like that and other conference and other hackathon or something like in Italy, for example. And so I would really like to see more group like that, like you told and I think it's really important because some people don't even think that, some people think that I have to be, to study AI, to do AI and to be specializing in this thing, to find a job in this thing. And when I speak with recruiter and something, they have an opposite discourse. So I can agree with your point of view and I can't, like, no, I want to motivate people to join group, to create group because for example the GDSC, Google group, we are the first one was in Mons, in Belgium, a second in Liege, but for foreign students. And in Belgium it's really hard to create those groups and I think schools should help students to create that and to entertain the culture about techs and AI, etc. And I could wish to, with older societies or older more professional group to maybe ask to study, like university for example, to help those groups to preferrate and multiple and to attract more and more people like that. So thank you. Yes, thank you for the input and I have a lot of tips to give you that maybe can help other people too. So first of all, also encourage students, volunteer for conference, for example, an example of Picon Italia is to get also students from Florence, from high school and university to get and help. So it's very important, I'm also leading an nonprofit association, so it's very challenging to get that volunteer work. So if you are the organizer and what I encourage all of you is to not wait for something to happen, just make it happen, just create once. And the first thing that I will do is also to contact perhaps speakers from other university. For example, in Italy we have a tour in Milan that are quite close to each other. Milan is doing the first pay data, that's another great network of events, especially for students, have something very clear and attractive to them. And I have to be honest, sometimes also free food helps. For beers, for example, there is a format called Data beers, I don't know if you ever heard about it. It started from data scientists in Spain and there is a beer estrella that kindly sponsored beers. And the format is a free lighting talks. So we started to do it more and more in Italy, there is different city that doing it and also across Europe. And that could be an extra push to make people come to events and then really network to each other. So my suggestion is if you can get a professor on board, that's great. But then also another last tip that can give you is the last event that we did was in collaboration with a high school because sometimes it's very challenging to find spaces. So for example, you can look out the different community, even a Linux group that can give also particular topics in machine learning and say, okay, I'm in contact with the professor that can offer the space to host an open source community here. And then the students will start to know about that and start to get around and talk about that and from there can start another group. And there are a few tips, but feel free to contact me if you need more tips on that because it's very good to also share your experience so all the local groups and national community can also learn from each other. Out. So I was the one of the main organizers of the University of Birmingham Computer Science Society for several years. And so my tip about making a community is definitely focus on some organic kind of community growth of pizza and beer and interesting talks. And then from that, see if you can, once you've got enough of a group of people, you can then approach companies for sponsorship, which will only make it bigger and bigger because people are desperate, or companies are desperate for computer science graduates. You've got a genuine currency there that companies will definitely respond to. So I think build organic growth to begin with and then you can really get much more money and greater support by utilizing companies like that. Yes, I do agree with that. And also there's someone that is waving in here, you can get it. And also in terms of workshop, I do agree in getting involved also with companies, local companies, and in terms of kind of events, it could be more social events or more workshop events. For example, for the one about special data, it was very, very hardcore workshop with not both shared and then discussion of it. And he could be also hack and tell with someone is coding and other people are watching it and asking questions along the way. So hello, thank you for mentioning the cat. I am the guy that started it. So thank you. I want to get a little political if it is permitted. So I think in this place, it's really important to, and I ask you a comment on this, to focus on open source and standards because they go hand in hand. And it's a way to invite our territory in Europe to build without falling into the fun boy style around open AI, around US services. So it's time to build in open and create standards, not only the laws, because we are good at laws and standards and open source. It's the only way we can build our own AI economy. Thank you. Thank you. Thank you. Yeah, I do do agree completely. Actually, maybe he knows that I give some talks last year by potentially about a safety but later on at two 15 we are going to talk about exactly that. So standards AI governments. So if you want to stick around, there will be a wonderful panel to talk about that. And I completely agree that we need that more centralization and more discussion, even an event in university. And not only computer science, also policymakers need to have more and more exposure in the past. I worked with the open data and we have policymakers ask us, can you please explain that the science because we have to make the law. And yeah, so communication between the two sector are extremely important. Thank you for the inputs. Can I just ask a question? I was lovely to hear the discussion about graduates and everything. What's your view on how you bring forward the older engineer, of which I'm a representative member who's been working in software engineering for a long time. How do you bring them into the AI world? Well, I think there will be a space for everyone in terms of, well, I'm from the data science part of you and recently more about handling their risk of AI. So I think there will be a great space to understand how to make it more accessible to everyone. And also, I don't know your specific, because from UX to development to understanding more also cross part about languages, for example, so it came to my mind, for example, Rust became more and more popular in data science from pollers. This is a library in in used in data science through a Python binding as well. So it's very important. And let me know if I could request it as well to have some of the tools to have an understanding of different ways of approaching in this case, data science for different technology from especially now that we need more power, recorrosy, more paradigms that sometimes is less used in a high level program languages. So I think there's space to collaborate in that as well. Yes, so this work I've been looking at with best practices and I for tech works is actually quite a lot about this. And it's really interesting how different it is trying to write a guide or a course for somebody who already knows a lot of computer science paradigms. There's so much you can skip. But there's so many things you have to also be careful to not skip because you need to save them again or reemphasize things. For example, source control is one you need to think about not just source control for your data, sorry, source control for your code, you need to think about source control for your data to make sure your whole pipeline doesn't get messed up. And it's just very interesting. Might be some interesting things to talk about that by the way. It's so nice to be at FOSDOM. I haven't been here for years. So in 2017 I was doing repotential build, neurodebian, any other neuroscientists? No. Yeah, neurodebian. Yeah, good. Sorry, I could not find this building so I only saw the last slide. But so in 2017, a group of computational neuroscientists led by me put together a fully open and available set of training. And it had to be asynchronous as well, right? Code has to be fully available. The classes have to be pre-recorded so that anyone anywhere on earth can access it at any time. That is a extended definition of open and you have to create that. So we trot them everything up from reinforcement learning agents to GANs in 2017, right? And I think this space is under exploited in open education, right? And if you take this, what we did was say here's the criteria to create reproducible science. Here are the skills that you're going to learn. Go out and find and validate an open data set. And if you do all of that because of reproducible science, you in advance have a pre-registered paper. You will produce a peer-reviewed paper. In this space, concerningly, a lot of people, if they get to masters and PhD, have done at this time a lot of rote education and no critical thinking or no extending into new knowledge. But the second you tell them, these are tools. And they're tools to solving new problem. And if you solve that problem in science, right? In science, it's the only place where you say, I solved a problem, it's peer-reviewed, it's open and public. If you pre-register it, you're perfect. If you can motivate them with something real, it will be applied. And I just want that to be much, much more in the forefront of people's minds as they communicate. Okay. We've only got a couple more minutes. Will, would you and Stephanie just like to wind up the discussion and then we'll hand back to JJ to run the next talk. Just a quick comment on what you just said. It was always something we struggled with, or I struggled with when I was a neuroscience researcher, because there's quite a lot of not quite so good neuroscience researchers who don't do what they should and pre-register experiments. And they just sort of make it up as they go along and that causes a lot of problems. And I really like this idea of pre-register everything first and then when everything's pre-registered, then do it and then publish it no matter what. I think it's quite important. Yes. And do you want to just wrap up your bit on education? Sorry. Just quickly wrap up your bit on education. Okay. I think we're, okay. I think we're done. Okay. We're done. Thank you very much indeed. Bye. Thank you.
Introducing 'Refiners' – A Micro-Framework for Seamless Integration of Adapters in Neural Networks
Hi everyone. Thanks for that. Sorry about that delay. I have a great privilege to introduce Benjamin to us. Everyone give Benjamin a round of applause. Thanks. Okay. So today I'm going to introduce to you Re-finals which is a micro framework we developed at Fine Grand. And I would like also in this talk to give you, to inspire you to, if you're deaf or you're not into ML or you haven't trained a model before, I think what I'm going to show you today is a great way to start and get started with ML training. So why is it so daunting right now to train ML? I think that's because we're in a phase called, we could call like the foundational models phase where those huge companies are training big, big models and it's really hard to do. And the goal of course when training those foundational models will be to reach the AGI. So I have a very weak and restrictive definition of AGI but I have a 100% scientifically accurate definition which is we're going to reach AGI when everyone in this room is going to lose its job and become unemployed engineers. In the meantime, sorry, what can we do? So generally we do two kinds of things, right? So you say you can either become a prompt engineer, you don't touch ML, you're just relying on an external API or something like that to try to make the foundational model do what you want it actually to do or you can like do the training and it's very good for bragging but you cannot build it in open source because it's really costly, you need GPUs, you need data and it's risky. Like if you want to actually solve something, you train your MLM, you don't know if it's going to solve it. So there is a third way in between prompt engineering and training for a national model and the idea is the idea of adapters. So an adapter is just a way to patch a foundational model. So generally a foundational model or like chains of transformer layers and then for instance you could just inject some new weights into it and freeze the rest and you train those. And the advantage is that you require a lot less VRAM and you use the foundation on which the foundation model is built to train something very powerful and the idea is that you get a lot more flexibility than using prompt engineering. So it's something that's really exploding right now. So for instance that's a list of all adapters that exist for large-rengorge models. There's a lot. Also for generative image, there's a lot of adapters that allows you to generate images exactly like you want and unlock new possibilities that for instance table diffusion, the foundational model cannot do. And like every week you get two new papers on that domain. So it's really, really exploding right now. Okay, so let's talk a bit about why did we do that? So most of AI codes today is written in PyTorch, which is an imperative language to write deep learning model. It's very convenient because the only thing you have to do is to write your operation in a procedural way like with the NumPy API and everything works for you. And that's really great. But the issue is that when you want to patch or modify your model, you cannot do it because the code is already written. So you could monkey patch it, but you get into huge complications when you have multiple adapters or you want to compose them, et cetera. So we wrote yet another machine learning framework to solve that. So maybe that's not a good idea. So we wrote a micro framework. And what we mean by that is that Refiners is built on top of PyTorch. So everything is intercompatible. So if you train a model with Refiners, it's going to work in PyTorch. And if you have some elements you have already in PyTorch, it's going to work in Refiners. So Refiners is based on three key concepts, the chain, the context, and the adapter. And I'm going to go through each one to show you a bit how it works. So the idea of a chain is instead of writing just operation by operation in your code, you write each model as three of different layers. And what you get is that it's easy to edit dynamically. And it's completely explicit because when you see the graph of computation, you know what your model do. You don't have to look at the actual code. You just have to look at the model. So here's a comparison of PyTorch code. So you define your layers and you write all the operations. And on the right, what you will do with Refiners, where you just put each layer one after another. So that seems very basic. But so what you get is we have a very good representation of everything. And now you have a lot of helpers to help do some operation on it. So for instance, you can wrap model, pop them up, add some others. And then when you look at the rep of the model, everything is explicit. But you know what everything does. So even if you change the name of a layer, you still see that it's a chain. So now the chain is powerful. But you say, what if I want to do really complex models? Some models have some data that can pass through different layers. So we need something to simplify the flow of the chains. And so we introduce the context API that works a bit like in UI framework where you have a store. And everything nested down there can have access to that store. And so the idea is that even if you have a very nested chain, you can set the context. And then every sub-layer is going to inherit from it. And even in very complex models, you just add something like deep nested in the model. You can have access from the outside to any tensor you add to it. So for adaptation, it's very convenient. And the third and last concept is the idea of adapter. And the idea is to have an abstraction that make it easy to perform model surgery. Because when you're patching, you want maybe to add some parts, remove some parts, and let everything connected together again. So obviously, you're not going to do all the operation by hand every time. And so we have the adapter class. And the idea is, let's say, for instance, we want to target this linear, add some more logic to it. Then we can write an adapter that's going to plug itself into it. In terms of code, it just looks like that. We have a mixing called the adapter. And you can just rub it into it. And you get for free an inject method. And the inject method is going to do exactly that. We place the linear by the adapter. So for instance, this schema, it's look like the adapter that's called the Lora adapter, which is really common. OK. And so now we're using this to train new adapters. And we're doing it in the open. So you can have a look. We have a page called Boonties, where you can come and train adapters. For instance, the color palette adapter is currently being trained by someone who hasn't had ML training for monitoring before. And so if you could come and do some stuff with us, that would be really nice. Thank you. Thank you for listening. Any questions? No? Cool. All right, let's give them another round of applause.
Dynamic Explainability through Dynamic Causal Modeling
Hi everyone. I like being early so we're going to start a little bit early. First of all, who's enjoying the room? Yes? Yeah? Yeah, make some noise. Come on. There we go. There we go. It's all Will's work by the way, so thank him after this fact. All right, I have the pleasure of introducing Will again today. He started our event office this morning and he's continuing here helping us to do it today. He recently took up bouldering and fell two meters his first time of doing it and still keeps going on. I have to admit that is impressive. Take it away Will. Thank you. Yes, so I'm quickly plugging this talk because one of our speakers dropped out. So if I'm going anywhere nearer over time, just let me know. Just cut me off. It's fine. And what I want to talk about is a little bit of work I did while I was in neuroscience researcher. I think we had one of those earlier. Anyone else neuroscience? Nope, not a single one. Never mind. Well, it doesn't matter. It was there was some interesting work I did while I was a neuroscientist that looked at how to reverse engineer the causality between different brain regions. So the idea is we go out, we take FMRI of somebody, we take EEG, we get a signal of how their brain activity evolves over time. And what we want to understand is how these different brain regions are interacting with each other to produce this activity, which is a useful thing to do in research. It's useful in medical applications as well. And what we're actually doing here at its core is something I think it's really interesting and it's really powerful. We're simultaneously estimating a causal system that has engineered our behavior and we're parameterizing it at the same time with the values that made that happen. And that's actually a really difficult and challenging problem to do. And it's a bit of a shame that the technology sort of confined to academia at the moment. So this project I did, what I'll talk about today, is the work I did to make this more free and open source and not that the old code was bad, but it was research code to make it better and more commercially viable. So let's talk a little bit. We'll back off a little bit now and talk a little bit about what this technology does in a bit more detail and we'll talk about then what I did with it. So what we do with this technology, it's a Bayesian technology for anyone who knows what that means, but that means that we start with an estimate of everything we do know about the problem or at least an estimate of all the things that we don't know. We take this, we take our data and then with a dynamic causal modeling framework, we infer simultaneously a system that describes this behavior and also, as I said, the values of the parameters that parameterize the system. And we do this because it's Bayesian in an uncertainty aware way. So every parameter, everything we estimate comes with a, it comes with a probability distribution. It comes with an error estimate of how wrong we could be about this. So let's walk through a simple example. So the way this systems inference works is we infer what is called a dynamical system. That is what physicists use to describe how, I mean, basically everything works. And this consists of two things. It consists of a state that is in our planet's example, that could be the position mass of velocity of each planet. And it consists of a rule for how that state evolves over time, which would in this case be the laws of physics. Things like this, I was going off the slides a bit, but GM1M2 forces that two bodies with mass are applying to each other. And this is a dynamical system and it's what we're trying to infer from our data. And some of our viewers may notice the one I'm describing here, in fact, even on the slides is a differential equation. And for those guys, I would point out that dynamical systems are known for basically, basically not having any good solutions, which is partially a neutral framework. And in the dynamical, dynamic course of modelling world, we would be doing something like taking the position mass and velocity of planet A and saying, okay, I know there's probably some other planetary bodies affecting this, but I don't know how many, how many planetary bodies are affecting this planet that I can observe? And what is the position mass and velocities of each of these? And how uncertain am I about that? That's an interesting thing to be able to do. A real world application of this was in the COVID-19 pandemic. Here we have two very, very noisy things, hence the need for uncertainty that we understand about the pandemic. That is the number of positive COVID tests and the number of deaths. And we have a broad understanding of how the pandemic works. It's or at least broad understanding of ways we can model it, something like a CER model, which is what we opted for here. Well, I didn't originate this work, but I did make it work at the end. A CER model, which is susceptible, exposed, infected, recovered. And so we can take these two things. We can take the observed deaths and positive COVID tests and our understanding, and we can project backwards to a dynamical system that explains this. And there's three reasons this is a sort of particularly useful thing to be able to do. The first one is because we're understanding our data from partially, at least, or as much as we want to, we can tip in the things that we understand about the model beforehand. We can include things like hospital admissions and ventilator usage. And because we can include these parameters in our model, that means we can estimate them and it means we can look at them afterwards and have good guesses at them. And then we can do things like we did here, where we look at the number of deaths from COVID-19 over time. And we can estimate these parameters and fairly accurately, if you can see these graphs, because we knew they were there, they've been included, we can estimate them. That's a powerful thing to be able to do. Similarly, because we've included them out in model, and because of the way these things work, because it is a dynamical system with different parts, and we now have it all interact with each other, we can do things like look at counterfactuals. So we can say, what if I only have 200 hospital beds this week? What happens then? What happens? How bad do things become? And again, that's a really, really nice thing to be able to do. Finally, because this is a Bayesian method, because it is uncertainty aware, we don't just get single point estimates out of this method. We get probability distributions. So we can say things instead of like, I need 200 hospital beds this week. We can say, the model says, I'm 95% sure that I need between 200 and 400, or I'm 90% sure I need at least 300. And that's much, much, much more valuable than if you're planning things. That's what you need to know. You don't need to guess at what's going to happen. You need to know what might happen and how likely that is. In terms of the project we did about this, my work about this, it was really about making it more open source. And I say more because the original code base was open source, but it was written in MATLAB. And MATLAB is not really free and open source. And I will give credit to the original researchers. They did modify things so that it could mostly be run in Octave. But the unfortunate fact of the matter is, Octave is a wonderful thing. It really is. I can't give enough credit to developers, but it is not a replacement for MATLAB. Complicated MATLAB code, particularly MATLAB code, but dependencies will just break and then it's annoying to fix. And it's at least 10 times slower. Depending on what you're doing, it could be much, much, much slower than that. So having your code in MATLAB, even if it's open sourced, it's not really open sourced. So we rewrote the project in C++, which I know some people will disagree is an improvement, but it certainly goes a lot quicker. And yeah, as we did this, we have tried really hard to make this code more generic as well, because the original code was research code. It was designed for just two problems, neuroscience stuff and a little bit of COVID stuff. Now the code works generally. The other thing that we have been trying to do with this is to try and make it more open in other ways. Again, a problem with this is its research work. And so although there are papers describing how all of this works, I'm not going to lie, I literally did a PhD in this, they're incomprehensible. It's really, really, really difficult to get to the bottom of the papers, even if you really understand the subject material. If you don't understand the subject material, it's just awful. And the thing is, when you've gone through it, and you do understand all of this, it's not that hard. It's interesting and useful and creative, but it's difficult, it is not. And it doesn't need to be, so that's the other part of the work we have been doing with this. To try and make it free and open source by making it open about exactly what's happening and why. Anyway, that is the end of my talk. That is my GitHub repo. If anybody wants to look at it, do we have any questions? William, you've written it in C++. A lot of AI people expect to use Python. What's your solution for them? I have been writing a Python front-end for it. Is the solution. It's a little bit of work in progress, but you can draw from this in Python if you want to. Any other questions? You're going to make me run? No? All right, everyone give another round of applause for Will.
AI for Developers: Treating Open Source AI as a Function
When you didn't know what was going on, I'm looking up at the faces that are blank. Okay. So AI. At one stage, AI was a preserve of data scientists, machine learning folks, and people in universities. And us developers said, let them have their day. Okay. But what I want to look at AI, and I know chart CBT, you know, at the end of November 2022, throughout and everybody from your parents to your aunts to your uncles, suddenly know what AI is. There's kids around the world writing papers and some of the information is right and some of it is wrong. So there's a lot of teachers out there that are fed up, especially in secondary school or high school, that are fed up with getting answers from models and from AI. But on a serious note, what I want to look at today is from the angle of, I've been a developer for nearly the last 30 years, and my background is about writing services, applications, mostly back in middle there and earlier in my career in UI. And because of the growth of AI and because of what chart CBT does, an awful lot of our leaders in our companies now want us to leverage and consume models in our applications. So I just want to get a little demographic here. Hands up, who's the data scientist? Okay, there's a few of you. You might be leaving in a couple of minutes. Any machine learning engineers? Some more. Okay, I'm getting a bit nervous now. Any developers? Yes! Sorry for the roar. I'm safe. I won't get lynched from here out to the door. So let's get this show going. As JJ said, my name is Barton Hickey. I'm a software engineer working over at IBM. And I spent around the last eight to ten years in the cloud native space in different communities. But my key role, I've been very lucky, is contributing to open source communities and trying to drive open source communities forward. So a little background. I'm going to do a very small background in AI. AI folks don't shoot me down. It's my interpretation. I'm only dipping my toe in the water over the last couple of months. So this is how I see it. But really I want to get into the frameworks and open source in general and how can we use the models using these frameworks? And then we might have a bit of a demo if I have enough time. So I'm just going to throw up this definition. That kind of two sentences on it. And from what I've looked at and from what I understand, the way I look at the model is I look at it like any other program. Or library. Or whatever else that we've often called to use. So it has an API generally. We can call it and we can get a result. We can consume it in our application. But I suppose the difference here with, for me as a traditional programmer, our traditional programming is, you know, we give it a set of rules and instructions to tell the computer what to do. And more often than not, we should know how that program is going to work. With AI models, it's a little bit different here because it's not exactly explicitly programmed for predictor tasks to do or a particular prediction. It learns with the data. So there's the big difference from restarting out. So what is the journey of a model? So first of all, I suppose it's the building or the prototyping of that model. And data is the key here. So as a lot of my North American colleagues would say, garbage in, garbage out. So the data is really, really important when you're training or creating your model. And the first part of that is getting that data from reputable sources, reputable domains for the particular operations you want the model to do. Then it's the preparation of the data because the algorithms that process this data or create the model, they need the data of a certain quality and in certain formats. So things like errors and omissions need to go out of it. If there's any duplication, any missing values, and also then maybe converting into a format that the algorithm can use. There might be some aggregation as well, so that you know what I mean, to normalize values, etc. Then you choose the algorithm. The algorithm depends on the operation you want to do. And it also depends on how you're going to train it, how it's going to function, and then what resources you've available to do it. So you can see when we started out with the generative models, they were huge, and I'll talk about them in a minute. But now there seems to be a turn at the start of this year of smaller models and more modular models, etc. Because of the resources to train them and also to run them. The key part of the algorithm is to turn that data into a model and produce the model you have that's trained on certain data, which is the next step. The training data set is very important because you want to see within a certain level of tolerance how exacting is that model. So what are the results? Are they within a particular tolerance rate that you are going to be willing to accept? And then at this stage then you say, right, I have a model here, I've trained, what I need to do is the final step, which is validation. So then, testing the model against data that you didn't use when you were training to see if the model can actually learn when new data comes into it, and our as its predictions or its tasks as you expect them to be. So a big part of this I think going forward is open source could be the key with models because it's going to come down to trust. Do we trust the model? Because let's be fair about it. Even, you know, the people who write the models or the graded data scientists out there are machine learning engineers. Sometimes they can't even predict what the answers are. So the key here for transparency and in the open and trust of models is going to be key. And the last two parts are the parts I like. Running the model and calling the model, okay? So, you know, sometimes, you know, back in the day when you used, you know, binary tree library, you just wanted it to give you the answer. You didn't want to be writing a vanilla binary tree unless you were into that. So the same with the model. You want the model to do an operation for you, perform a task or make a prediction. So that's what they call inferencing. A little bit around genera.vii because that's the buzzword. Everyone's talking about it. And the key here is that it's a different type of model. So your tradition in machine learning models were trained on label data. So data that was specific for specific tasks. A lot went into knowledge of the domain and knowledge that the data scientists need to know around that particular area. And it was very intensive in the training. In this situation with the foundation models, what we're saying here is we're going to train it on a massive data setup on label data. And then you can use that model and fine tune it. Or so you can do different tuning like fine tuning where you take the model. And because these models are deep learning models, there's going to have a lot of layers in them. You may take one to number of layers off, put your own layers on top, i.e. you're going to train it against your own data. Or you might use prompt tuning where when you're calling the model, you're going to pass it prompts like examples, what you're looking for, and guide it towards the answer you're looking for. An example of this is around the large language models which are based on huge language data sets and can generate content from there. And we can see that in which generative AI. It's about using these models or these generic style models to be able to generate high quality text, video, etc. So that's the whole idea of the generative where we're going now. And the idea here that these models, one model can be used for different operations as opposed to be normally trained for one. Okay, so that's my intro to the AI. And the next part is what I really want to get onto is around the frameworks. So hands up who's heard of Hoganface. Okay, that's not a bad number. So Hoganface has built up an AI community which is nice to see based around open source. It's key things being it is a series of libraries. It has a huge catalog of open source models and data sets. Now, how is this appealing to you if you want to use open source models or sorry, you want to use AI and models. The great thing about this is you can pull those models down and you can run and use them locally. So that if something changes tomorrow morning with Hoganface, you can still use those models. You can use them locally. Now, they also provide a service where they host the models themselves. So you're directly just calling the models through their API. But I think it depends on what your setup is and your use cases, etc. So what I'd like to do with this when I'm looking through this is an example. And this is an example using the Hoganface Transformer library. And it was an example up on the Ray framework, which I'm going to touch on straight after this. So the first thing is we're using a model called the T5, which is an encoder-decoder model. And these types of models are really geared for natural language processing, or NLP as they call it. So when you're dealing with text of some sort in different languages, etc., it's usually text in, text out. And the task we want to do here is just a very simple example. Looking at the code is that what we want to do is just convert a bit of text to French. And I use French because we're here in Belgium. I couldn't get Flemish or I didn't think of it, so sorry about that. So when we look at this, we see it's two real calls. The first one is we use the Transformer library, and we call Pipeline. And in Pipeline, you specify the model and the task you want to do. And then inside the translate method, we're then going to call the model with the text that we send in. Very simple Python class. When you run the class, if it's your first time calling that model, it will call out to Hug & Face, Open Source Cal log, it will pull down the T5 model, and it will put it in cache. So any call after that for that particular version of the T5, it will get it locally. Ray. So Ray is a framework for scaling and distributing Python and machine learning applications. So the capability is it's providing which is batch inferencing on CPUs, GPUs. So the inferencing being running the models, which we talked about earlier. But the stuff we're interested in mostly. Obviously the serving of the models, hosting them up, having them ready to go, providing an API in front of them. Also training of models, large language models. So that's if you have your large language model. It's not, I suppose it's not giving you results you want for the particular operations you want. Then you may decide to train that with your own data. So in that case, you'll create a new version of the model. And then other operations like reinforced learning, etc. So if we want to use Ray to now host the model, in this scenario, what's going to happen is Ray has a nice little HTTP server. And pretty much all you have to do is put an annotation on top. You can see just below the translator class called serve deployment. And you can pass args in there if you want. And the next part that's important is the callback function, which is called underscore underscore call underscore underscore. So in this situation, once you you have your call, you have your annotation, you can then call this class using. Using server on. So it then will load it in a HTTP server and it'll provide the interface to it and you can call it as follows. The next and the final framework I want to look at is the Triton inference server. And the inference server provides you with support. For most machine learning applications and frameworks, as well as custom C++ and Python backends. So you can see the different frameworks that supports there and processes support, etc. And this time, if you want to wrap it or use it, you need to actually call the class. I'm going to read it from up here. It should be either Triton, Python model, which is a bit of a mouthful. And then you need an execute method. And that's where the HTTP request we're calling to just note here. It's not using JSON data types. It's using tensor data types. Okay, so I don't know if you've experienced of using tensors, etc. PyTorch or anything like that, you'll be used to that. So you have to convert it in into into Python data types so you can process it. The other thing you need to do to bootstrap it is you're going to need a configuration file and that config file. You give it the name of your particular model. So I'll call it the model you want to host. So it's going to be the local name that's going to host. And then you tell it what back end, in this case is Python, and your input output. And those input outputs are binary tensor flow types. And the last thing you do then is you need this model directory down here. So in a minute when we call models, it must be under this with the name of the wrapper model class and your config and your model file. And finally, to call it then, there's a number of ways you can do it. One easy way is to use Docker and run its container. And when you run that then, you then need to copy the artifacts, which is the model directory that we set up into the container. And then you call the Triton server executable to run it as a HTTP server. And you can call it in the call request above with that path in the name of your model and in fare. So before I go on to the demo, I just want to just do a little summary of that. So I'm after choosing three frameworks and you're probably someone's probably up the back on each and go on. There's loads more. Yes, there is. You VLM, you know, which is a nice alternative note to Hogan phase teachers. And there's just so many of them because this space is just growing phenomenally. We probably saw it with cloud native over 10 years ago. And you can see it at the moment that there's more and more frameworks. And I think the key here will be if we can have these frameworks be open source as well with the models. And then if companies want to put their value on top, then they can put their sauce on top. But for most of us here, we would like to be able to choose our framework and see if it'll do what we needed to do. So load our models and stuff like that. The other one I'm shown here, I'm just obviously running locally and I'll be running something locally here in a minute. But the idea here is that you'll deploy this into some system bare metal, a cloud system, whatever, because depending on the model, it's going to need serious resources to run a lot of the time. But as I said, there seems to be a shift towards the smaller models and then the plugability in these models where you can have different capabilities in the models. But that's something for me to learn down the line and for the data scientists to come and tell us about. Okay, so let's do a quick download. Yeah, I'll just escape here one second. You want? Yeah, if you don't mind. Thanks very much. Just before I show this. So what I'm doing is, I'm just running a framework here called KK. And why I'm running this is a colleague of mine, Mark Stewart, did a really nice UI using radio, the radio framework, if you know it, a Python framework for it's very handy for for doing UI, active access stuff and elements and things like that. And I'm just showing you here that I'm running it. And what it's running is it's running the AI, sorry, it's running the UI as a HPP server. And it's also running the back end server with a gRPC and HPP interface. And the back end server is the host. So where it's hosting the model and it's and the way to run those models are similar to the other frameworks I've shown by wrapping the model for the particular artifacts that's needed. Okay, so here's the simple UI. And what I want to do is just play with models for a minute and then just show, look, it's just code wrapped in the back end. And as developers, we can just use these models to perform operations that, you know, we might take us ages to write or would be quite difficult to write. And we can use these off the shelf models, hopefully. So the first one is putting in a sentence like. I can. How's that? Better? More. This is where you were told in class to come down to the front. Did you hear that? That's better. Thank you. Thank you. So what I want to do here is I'm using a model that does sentence similarity and I put in the canine is fast is my source. And then I'm saying the dog is running the cat is asleep. And if you know your cat from your dogs, a dog is the canine. A cat is not. So there we go. It's telling us 68%. It thinks the dog is asleep. So there we go. It's telling us 68%. It thinks the dog is running is the best sentence. But if I change it to say the cat. Is running. We now put the cat among the pigeons, so to speak, excuse the pun. But now we're saying it's a 36% chance that, you know, you're not going to be able to get the cat. It might be like the canine is running. So you can see here how you compare on with things. And because of vector spaces and so forth, where words are aligned, that if you do change some of the words like running, you know, fast, they're going to be near each other. Dogs and cats are going to be near each other because they're animals. So yeah, that's what happens there. So looking again, let's do image classification. I'm going to choose an image here. All right. Yay, puppies. Okay. Now, I love dogs, but it's telling me it's a 88% chance that's a golden retriever. Any person can confirm that? Okay. Thank you very much. I just thought it was a Labrador, but sorry about that. And the last one I'm going to do is something close to my own heart, which is I've taken an image of a sport we have in Ireland called Hurlin. Okay. So Hurlin is players are on the field. They have a ball. They have a stick. Okay. So when it does the detection here, it detects a sports ball. That's fantastic. It detects people, which is fantastic. But it calls the stick a baseball bat. Is there anyone in the audience that knows what the name of the stick is? A hurly. Yay. I love to hear. Okay. So that's an example of the data the model has been trained with. What you'd need to do there is either to prompt tune that model or fine tune that model with a data set that tells you more about the game of hurlin. It's able to detect the field sport, which is which is amazing people and ball, considering it hasn't been trained anything on the sport, but then it lets itself down a little bit. So why did I show you these things bar showing a really nice UI for my colleague? The reason is this. At the end of the day, we're calling libraries like we've always done or we're calling API is like we always done. And as long as we know the models can be trusted, then that's okay. All right. So the first part in here is this is just a great UI code. And in here, we're just saying there's a series of different classes for the for the different UI tabs. And we're just calling them here. And here's an example of one of the UI tabs down here for the image classification. You can see here the submit button and the other elements that are on the tab. And in this situation, you can see the submit is going to call into a function called function for that's a great name, by the way. FN. I haven't seen that for a long time. And you can see here in here, then this is where it's going to call the model. Okay. Now, why is this a little bit funny because it's a g is the g rpc API is calling and you can see here inside the UI when it starts up, it's going to get it's going to get a handle to the channel of the g rpc server. And then over on this side, you see here the different tabs classes to handle the UI interaction. And then finally down here is the wrapper code. And you can see here it's got similarities to some of the frameworks I showed a while ago. But in this situation, you use an annotation up here called module, you pass it a task and so forth. And the key here then is you can see in the init again, we're calling a pipeline, which is the hug and face transformer pipeline up here. Sorry, I've just jumped. Yeah, just up here. If you don't pass a model to it, the default model is going to use is the Google VIT model. And then down here is the key in the run method. So when you call the predict, it's going to call the HTTP server and it's going to be redirected in the server to the run method here of this module that's loaded in the server. And it's going to call the pipe here. And when it calls the setup pipe and passes the image like it did the last time, then it's going to get the result back. And that's how you get it. Okay, so let's jump back to our deck. All right, to wrap this up, anything's on the screen there. I'm actually going to talk about that for some reason. So you can read that if you want. No. Why did I do this talk today? I want to do this talk from the angle of somebody who's been writing code a long, long time. And I'm afraid that someday someone's going to say you're too old and bald and you've got glasses. And I'd... And a beard. Sorry about that. No, but when we look at the change, it's probably the biggest change. If you've been in the industry for maybe 20, 30 years, it's probably the biggest change we've seen the way things are going. And there's always that phrase, you know, you either adapt or you stand still and everything moves on. So the ability here to be able to use these models to write applications and improve things is really the key. And this is something that I suppose to grow out of AI definitely in the last 10 years or plus. I suppose starting with, or maybe 20 years starting with the Linux Foundation, then into OpenStack, Kubernetes, you know, all the other different great foundations and communities out there. I think a lot of companies have realized one company or one set of developers cannot keep up with the rate of development, the rate of change, the way technology is going. And having this ability to get these models eventually out of the grasp of the scientists stuck in the labs that don't want to release anything. Sorry data scientists, I'm not picking on you. But you know, to eventually see the light of day and they can go home and tell their families, you know, that thing I was working on for 20 years, they're now using it in fridges, in cars, everywhere. But the key here is that we need to have it open. We need to drive forward with our models and be transparent. We need to have trust in the models because we're asking the model to do something for us or give us a result that we depend on. And like libraries we use in the old days, you needed to go and do the groundwork and find out can we trust that library? Is it doing something we don't want it to do? Is it doing something malicious? So I think there's a big change here from the initial AI maybe in the last 10 years where it was around fraud detection, spam detection, you know, chatbots. I'd say there's going to be a big proliferation offered the next 5 to 10 years. So thank you very much. Questions?
Using Haystack to Build Custom Functionality for LLM Applications
We'll be starting our next talk here from Tawana Czelec. She lives in Amsterdam, but she's from Istanbul. She loves historical fiction and free diving. Surprisingly, she's dived 25 meters down to save her GoPro before, which we got a bunch of asses this year. This is pretty crazy. So thank you so much and take it away. Alright, thank you. I hope, let me know if I need to eat the mic. Alright, so this particular talk is a bit of an outlier to the talks I usually give because it's nearly fully a showcase of a very simple actually project that I built with some community members of Haystack and some functionalities of Haystack that made this project possible. And then I'll end up by showing a few other projects that we built together. And the way it usually goes with the Haystack community, so a quick side note, I work for Deepset, which is the company behind the open source project Haystack. And we have a Discord server, and from time to time, this is basically what happens. So this evening I say, I'm a bit bored, I want to do something, and I go and join a voice channel on our Discord, and there's one particular community member that I have to give a shout out to for this particular talk, Rec, because oftentimes Rec will then come up with a random idea, and either myself or the two of us will just share screens and do some pair coding, hack together something. And this particular project is exactly something that happened like this. I'm pretty sure a lot of you know this page, it's Hack and Use, there's a lot, it changes a lot. So the idea that Rec came up with was, well, why don't we build something very simple that gives you like a TLDR of the top K Hack and Use articles. So we built that, and then just recently, and when I say recently like two days ago, that became a hugging face space that you can actually get to at this QR code. And we've kind of vamped it up a bit, and we can now pick between two models. You can use MixTrail or an OpenAI GPT-4 model, and then provide a number of the top something, and I make it go up to five because no one's made of money, and you're going to be making APR calls to OpenAI possibly, so it goes up to five, and this is literally when I ran it yesterday, and the funny thing about this one was that at the time, the second top article was actually the FOSTA and Livestreams, but you get like a short summary of what the top three articles are at this point with a URL to get to the full article itself. So my whole talk is based on how this was made possible, how we actually built this project, and we built it with Haystack. Haystack is a fully open source, large language model framework. It's all written in Python. The main ideas behind Haystack is providing tooling for developers, so nothing is a plug-and-play really, you're building it all yourself, and the two main structures in Haystack that make it possible are called pipelines and components, and a pipeline is made up of a few components attached to each other where every component is forwarding some data to the next one. And I'm not going to get into what RAG is, I'm pretty sure a lot of you know what RAG is at this point, but a retrieval augmented generative pipeline might look a bit like this. You have a query, and then an embedder component is creating an embedding for that query, then a retriever component is retrieving the most relevant context for your LLM to actually use, and then it's forwarding that to what in Haystack world we call a prompt builder, so that context gets embedded into your prompt itself, and then you use the generator, and that can be any model, open source model of a hugging phase, or open AI, etc. And then you get an answer. This is a pipeline, but what a pipeline does is really dictated by what components it's comprised of. You might also create a pipeline that indexes documents. I'm not really going to get into this here, you're just basically fetching the contents of a URL, and then writing that into one of the available document stores that we have. But this is all made possible because of the structure called a component, and a component in Haystack is something that could have, for example, in this case it has two inputs and an output, but you don't necessarily have to have a defined output. Haystack doesn't really make assumptions as to what a component has to be. It can also be something that has two inputs, two outputs, and then the idea is you attach those components to each other, and you can be very, very specific here. You can be specific into saying, like, I want output one to be forwarded to input two of the next component. You can be very, very precise here. And maybe you can already start to see this starts to look quite like a graph. So how do we build these components? There's only a few requirements for something to be a component in Haystack world. We provide a bunch of ready-made components that, if you go to the Haystack documentation, you'll see a bunch of sections there, generators, converters, embedders, etc. These are all basically components that have been built exactly like this that we just provide in the package itself. But what you can do is build your own. And what you need to have is a class. So here I've just got a very, very well-named MyCustomComponent class. I need a run function. And the other thing you need are these decorators. So basically this is telling Haystack that this class is a component. And then the second one is around the run function. And this is actually used for pipeline validation down the line. But it's basically telling the Haystack pipeline what outputs it should expect from this component. In this scenario, I've got a MyCustomComponent that's expecting a query, which is supposed to be a string, and then it's returning documents. In this case, it's just hard-coded. It returns high and by. So we know that this, whatever query this gets, is going to be returning two documents, high and by. And this has then led to quite a bunch of components that don't actually, they're not served through the Haystack framework itself, not all of them are, but you can just install them as separate packages. And it's meant community members have just gone ahead and built components that they need for their very specific custom needs and made them available to the rest of the Haystack community. So let's come to our Hack and Use TLDR, if you will, project. The idea was that we wanted a component that would take top K, that could be a number, and it would return articles. And again, this is a colab that you can use, it should be running. And the way we did that was, this is very much pseudocode. Later, if we have time, I'm going to show you the actual code. But we built this component called Hack and Use Fetcher. It takes top K, it queries the Hack and Use API and gets the top whatever number we've decided. And the other thing I wanted to show here is, at the end, I don't know how well you can see it, but we've also added some meta information, because down the line, we can use meta information in our prompt, because you also get titles of the Hack and Use articles, you also get URLs, which is great for referencing down the line too. So we return full documents that have the content, the title, and the URLs of each Hack and Use articles that we fetched. And at the end of the day, we're going to be building a pipeline that looks like this, and everything you see in green is already provided with Haystack. So that came with pip install Haystack AI anyway. And the orange is what we've just built for ourselves, and it just slots into the rest of the Haystack pipeline ecosystem. And for this co-lab that I've shared with everyone here is, I've decided I'd just go ahead and use Mixstrahl, I've tried this with OpenAI models a lot, so why not try something new? And then the last thing I want to highlight about this particular pipeline is how the prompt is being built. So prompt templating happens in Haystack world by a component called the prompt builder. And templates use Ginger templating. And what's really important here is, okay, we have an instruction, you'll be provided with one or more hack and use articles, please provide summaries. But if you look at this close theme, we have actually a for loop. So this prompt builder automatically knows that it should be expecting an input called articles, and it can loop through those articles, and then it can access the contents of that article object individually in every step of that for loop. And that's how we're embedding URL here as well. And this is the final product. At the end of the day, we were able to build a pipeline where, given top three, we were able to run it, and we've got the TLDR summary and the URLs that you can find, the full articles of hack and use, current hack and use top articles. So with that, I want to show a few other projects that this custom component building functionality has enabled. The next one is slightly questionable. Please take it with a pinch of salt. I put it everywhere on that tugging face space too. And the idea came from, at the time, Twitter was very different. So the idea was, could we like build a Twitter fetcher that, given a username, could give you, this is really bad, could give you like a vibe check of the account, and we called it like, should I follow? And it gets like the last, I think, 40 posts of that user. Obviously after that, Twitter changed, so I went ahead and built a master on fetcher. You can also find that on the Haystack integrations page. And the best way I like showcasing this is actually using my boyfriend's master on account, because every time this tells me something a bit funny about his account, once it called him pessimistic, this time it called him sarcastic when discussing personal opinions. So that's also open. I think I linked to it in the notes as well, so you can go ahead and try that out. You just need to provide the full master on username. Without the at at the front, that's a bug I haven't fixed yet. Another thing that this enabled, actually not only used the Haystack custom component functionality, but also, I don't know if you remember when I showed the components earlier with the two outputs and the two inputs, et cetera, you can already start to imagine that you can actually have these pipelines loop too. So the idea was, what if we have some meeting notes, and we have our own GitHub repository, and anyone who's used GitHub repositories, you know that you can create those issue labels that are very specific to that GitHub repository. Could we build a system that, given meeting notes, generates a list of GitHub issues specifically for that repository that you're discussing in that meeting? And could we actually then use those generated structured outputs to query GitHub to actually create those issues? Now, this is great, and our experience has been that actually a lot of large language models are great at generating structured output, but not necessarily in the structure that you need. So it's going to be JSON, but is it going to abide by what you need that JSON object to look like? So the idea here was, okay, well, why don't we create an output validator component, and we use Pydantic for that. This is all based on a tutorial that's up on the Haystack website right now, and basically what we did for the GitHub demo was modify this tutorial just a bit. In the tutorial, we said that we provide a Pydantic model, and we said we need the output to be cities data, where in cities data you've got cities, and each city has a name, a country, and a population. And then we used a GPT model, and we saw that initially, for the first round, we did get structured output, but it's not valid JSON, or it doesn't abide by what we need that object to look like. So the idea is, what if we provide back to the LLM, the output that it just gave us, with an error message from Pydantic as to why it's wrong, why it doesn't abide by the Pydantic model we just provided. So the resulting pipeline looks a bit like this for our GitHub issues demo. We want to provide meeting notes, and we want to provide a schema. We give that to the prompt builder, the prompt builder exists in Haystack World. Then that whole prompt is given to a generator that generates... There's a one pass, like a first attempt at generating some structured output, which is then validated by our output validator, which doesn't exist in Haystack World, so this is a custom component. And either you're all good, done. But if it's not good, then we go back to the prompt builder with the invalid reply that was produced, plus the error messages. So for our use case, where you were trying to build this for Haystack, this is not accurate, by the way, our labels are not exactly that, but just for demonstration purposes, we went ahead and built a Pydantic model called issues, and we had to be very specific as to what our labels were, because you can't make a query with a new label that doesn't belong to that repository. And then we used our output validator. And then this is where things start to look a bit complicated, but the ginger templating is very useful here. Earlier, for the Hacker News articles, you saw a for loop. Instead here, we have an if statement. So if we have an error message and invalid replies coming in from any component in our pipeline, then this little section here, you already created the following output, yada yada, is appended to the full prompt. Again, at the end of the day, we ended up with a pipeline that looks a bit like this. So do I have time? I do have time, right? Four minutes. Okay, so the last thing I wanted to show is how these pipeline connections are actually defined in Haystack. Oh, great. Okay, I have plenty of time. All right, so ignore the corgis running around. Can everyone see this, or should I make this bigger? Okay. All right, so I told you that before the Hacker News Fetcher component was very much pseudocode, this is kind of boring. We're basically making requests to the API and getting the articles. Here, we're going to be using mixed trial through hugging face TGI. Hugging face TGI is free, but it is rate limited, and you need to provide an API key to use it. So you can go ahead and use this collab, but you do have to provide an API key. And then you see the prompt template you saw, and here's what's going on in the Haystack pipeline itself. We've got our prompt builder. We've got our LLM, so mixed trial via hugging face TGI. We just created the Hacker News Fetcher. What we do is simply add those to the Haystack pipeline. And then this is where Haystack can be quite verbose, but it can also mean that you can be very, you can create very custom pipelines, and it can get a bit crazy. You can have pipelines that branch out, loop back in, et cetera. We're being very specific that Hacker News Fetcher's articles output is being provided to prompt builder articles, which is going in here. And then finally, the only thing missing to actually run this is Hacker News Fetcher is the only component here that is missing an input. All of the rest have been provided inputs through the pipeline itself. So I can then define what the input of Hacker News Fetcher is when I do pipeline.run or pipe.run. And then, optionally, you can also give more inputs that are not necessary for the others, but for example, here, I'm using mixed-rull, and I wanted to up the created max tokens at the end, so I can also provide that at runtime. And that's it. Thank you very much. And you can also access the GitHub issues pipelines here, but I'm happy to take questions if there are any. Thank you very much. Thank you. Thank you. Hi. So in the Hacker News article summarizer, to the LLM, the URL, and asking it to both summarize the article and also just print back the URL. That appears to me a bit risky because it might change the URL. Do you consider it best practice to pass the URL through some other way, or do you find it fine to always ask the LLM to do that? I love this question because try that hugging face space, especially with mixed trial a few times and sometimes you just won't get the URL. Yes, and there are a few ways to make this a lot better because actually the Hack and Use Fetcher component itself earlier is just an API called Hack and Use and you have the URL there. So probably the best practice here would be to have the LLM only produce summaries and the other component provide an output of the URLs that was used to produce those summaries because yes, my experiences a lot of the OpenAI GPT models do a great job of following that instruction, reference this specific URL, but this is very much LLM based and how that large language model expects to be prompted. Not every instruction works the same way with every model. Any other questions? Oh, this one. Thanks for the presentation. I have a question on the prompt. I saw for an if, is it specific to Haystack or? Not at all. We use Ginger for the templating language. Actually, I will add a link to the Ginger documentation in the speaker notes of this slide deck that you'll find on FOSM2, but that's all Ginger syntax, which comes very handy because you get for loops, if loops, and you can actually start defining your own custom functions for that Ginger templating as well. All right. Give her a round of applause, please.
Using code generated by AI: issues, misconceptions and solutions
We also have the Matrix Room Up and Running, which is great. And I have the pleasure of introducing you to Andrew. You live near Oxford, which I believe is pretty cool. I'm a Texan, I wouldn't know. You love the Oxford music scene? You once played croquet for Cambridge University against Oxford, and he has a great joke for you. What is brown and sticky? A stick. Take it away, Andrew. My kids like that. Happy Fosfemme, everyone. So it's great to see everyone here. I am a lawyer and I've been advising on AI for quite a long time. You can hear me. That's better. Clearly, the emergence of large language models has meant that there's a lot more analysis going on in the legal context behind that. So I just thought I'd spent half an hour or so just going through some of my thought processes when I'm analysing mainly copyright law as it relates to AI machine learning and large language models. You may have heard a number of myths. So the first one is what models are essentially data. Data are facts and facts can't be covered by copyright. So statement number one. Statement number two, you may be aware that in various jurisdictions throughout the world there are different exemptions that exist in copyright law to enable you to gather and use data to ingest data for machine learning and data mining purposes. So you may hear that gathering training data within one of these copyright exemptions solves any copyright problems that you may have. So you really don't have to worry about any copyright issues. The third thing that you may hear is that output that is generated by generative AI cannot be subject to copyright. And the fourth one is that as a result AI generated code cannot trigger copy left obligations. So let's look at some of these statements in slightly greater depth. Now I'm going to do this in a slightly strange way because I've only got half an hour or so. I'm not able to go through all of my thought processes in great depth. What I propose to do is to give you the conclusions of my thinking first and then I'll give you a flavor of some of the thought processes that led to them. So you will by necessity only be getting part of the picture. I do apologize for that. So first of all, my belief is that a large language model is capable of containing and I put containing in inverted commas there because what that means is subject to a certain amount of interpretation. And I believe that it can contain copyright works or derivative works of copyright works. Secondly, I believe that if the training data was ingested under a copyright exemption, that doesn't automatically mean that that exempts any output from copyright protection as well. Third, the generated output is likely to be subject to copyright. This is a statement that will vary significantly from jurisdiction to jurisdiction. I'm licensed to practice in England and Wales and it's certainly true there. It may not be quite as true in other jurisdictions, but certainly from my jurisdiction's perspective that generated output is likely to be subject to copyright. And also under the English and Welsh jurisdiction, the output that's generated by a prompt view of entered belongs to you or possibly to your employer. But even having said that, then it means that AI generated output can be infringing and similarly it can also trigger copy left effects. So what I'm going to talk about now is a few things that you need to bear in mind about copyright and about the process of generating models and generating outputs from generative AI. So again, I'm going through these fairly quickly. I'm not necessarily going to link them in detail together. But when analyzing this, the three things you need to bear in mind from a copyright perspective are the copyright can potentially impinge at three points. So first of all, the point when the training data is ingested and the model is created in the first place. Secondly, and this is the one that people tend to forget about when the whole model is transferred from distributed from one place to another. And this is particularly relevant when that model is distributed over jurisdictional boundaries. And the third one is that we need to consider from a copyright perspective the point at which those results were output. So there's also potential of copyright impinging at that point as well. Now, there are a few things that you need to understand about copyright. And all the developers I've met have a pretty good grasp of copyright in theory, but there are always some areas around the edges that they're potentially a little unsure about. Many lawyers are unsure about these as well. And a lot of the arguments that we employ when we're talking about copyright analysis of large language models, they do tend to involve these edge cases. So there's a few things that you need to bear in mind when you are considering the application of copyright to AI, and these are just characteristics of copyright in general. So first of all, it's possible that more than one copyright can exist in a work simultaneously. So for example, if I write an opera that's based on the Lord of the Rings, then my creative input has gone into writing that opera, and I just like to tell you now it wouldn't be a very good one. And that will therefore be both a copyright work that I've created, but it's also a derivative work of Lord of the Rings. So if you want to perform that opera, you're going to need both a license from Middle Earth Enterprises, which is the organization that holds the rights or the relevant performing rights in the Lord of the Rings, but you're also going to need a license from me as well. So there's at least two copyrights being held in this opera simultaneously, and you're going to need licenses from both of those copyright holders in order to perform the work. And the classic example here is the Linux kernel, which will have many thousands, possibly tens of thousands of different copyright holders simultaneously, which is the main reason that we will never see it re-licensed under any license other than GPL version 2 only. So if you wanted to re-license it, then you would basically need to have permission from all of those copyright holders, or you would need to extract their copyright works from the kernel. That's never going to happen. So just stepping back a little bit, software is covered by copyright in just about the same way as literary works are. This is sort of the legal fiction that was established back in the day when the legal systems were thinking about how software should be protected under copyright. Indeed, if it should be protected at all. And one very key characteristic, one key piece of the philosophy behind copyright is this distinction between an idea and an expression. And again, this distinction is very clearly laid out in US copyright law. It's also laid out in European copyright law, but not quite so clearly. But the copyright directive, the software directive does make explicit reference to this distinction between the ideas and the facts and the expression of those ideas. So basically the facts themselves cannot be subject to copyright, but the expression of those facts can. And so one thing to bear in mind. Another thing to bear in mind is that copyright infringement is not subject to intent. It doesn't matter whether you intended to infringe copyright or not. If you do an act which causes copyright to be breached, if you copy something, believing that it was yours to copy or believing that you had a license to copy it. Then from a civil law perspective, we're not talking about the criminal law here, but from civil law perspective, that still counts as copyright infringement. And the third thing to realize is that if someone produces a work independently, which is the same as an existing copyright work, then each copyright owner will retain their own copyright in that work. So there's no infringement. So if I write a little melody and then somebody else independently comes up with an absolutely identical melody, they haven't copied me. They've had no opportunity to listen to that melody. Then they would equally hold the copyright in their melody as I hold in my melody. So those are three things that tend to be a little bit counterintuitive about copyright and their concepts that I draw on when I'm talking about the rest of my analysis here. OK, so let's look at one issue. So we take the premise that a model is essentially a set of statistical facts about the information or the training material. Does it mean that it is capable of containing derivative works? OK, let's look at changing the subject completely. Right, let's look at a WAV file. So you could also argue that a WAV file is just a set of facts about how far a speaker cone is from a fixed point at a particular period of time. And we know, obviously, that a WAV file can be infringing. A WAV file can take a piece of copyrighted music and the number of lawsuits that exist that cover copyrighted music encapsulated in electronic file formats is obviously huge. There's absolutely no doubt at all that music encapsulated in a WAV file, which you can argue is just a set, you know, selection of facts can be copyrighted. And just because you so just turning over to the concept of a model, people would say that you can't reverse engineer a model easily to find out that what is within it. I mean, it is just a set of information about statistical relationships. But just because you can't reverse engineer the model to determine its contents doesn't mean that it doesn't contain potentially derivative works. And I'll go into that in a little bit more detail later. But if we go back to the audio example, I mean, if you look at a more complex audio file format like Agvorbis, for example, you know, if you're just given that file, you're going to find it impossible, I would say, to reverse engineer it and get the music back out again. Unless you actually know how it was encoded in the first place, a WAV file is sufficiently simple that you probably could reverse engineer it and figure out the music was in there. But an Agvorbis file, you're just not going to be able to do that unless you know the encoding scheme in the first place. But nobody's going to argue that a Vorbis encoded file cannot be infringing. And there is, of course, a way that you can get AI models to reveal whether they contain any derivative works, and that is if it's part of a generative AI, you can just simply ask them. And I will give a few examples of that shortly. So some of you may be familiar with this poem, which was written by Lewis Carroll. And I'll just read the first line. It was Brillock and the Slythe Toves did Gire and Gimbal in the WAV. Now, for those of you who aren't native English speakers, the words that I've placed in yellow up here, they're just nonsense words. They were just made up specifically for this poem. Now, it's important to realize that Jabberwocky is no longer in copyright, which is partially why it's easy for me to talk about it here, because I don't have to worry about that. But because these are nonsense words and they only exist in this particular poem, or in derivatives of the poem that have been ingested later, then it turns out to be a great test of whether an AI has had access to this particular work as part of its ingestion process, if you can get it to disgorge these words later in some way. So I did a few experiments with chat GPT and I asked it to write a poem entitled Jabberwocky and it wrote Jabberwocky. So the result was verbatim. Didn't even try to change anything at all. So we know that chat GPT has ingested Jabberwocky. The chances of this being developed independently, produced independently, are infinitesimal. Did it know that it was out of copyright? I'm not suggesting that there's any copyright infringement going on here, clearly because Jabberwocky is out of copyright anyway. So it may be that in choosing the training materials that great care was taken to make sure that there were no copyright materials or materials which didn't have an appropriate license being ingested. So we did quite a few other tests on this basis. We found a number of works that actually are in copyright, contained within various large language models. I have to stress that this doesn't mean again that OpenAI is necessarily infringing. They might have attained a license to these particular works. But it does demonstrate that it's possible for LLM to contain copyright works. The argument that the LLM just contains facts doesn't really hold a great deal of water when you analyze it along these lines. And indeed, there are plenty of studies now subsequently showing that copyright works do exist in various LLMs. There's one from the copilot itself, which is some research showing that from time to time copilot will disgorge some verbatim copyright works. So the conclusion here is that we really can't believe AI can be used to essentially launder copyright works. You can't take a copyright work, feed it into an LLM and then claim that because the same copyright work has come out the other side, that it's no longer subject to copyright of some sort. So is it possible to have AI extract the ideas and leave the expression? And we can remember that ideas themselves don't attract copyright, but the expression does. Now there's a great video here. I put the QR code up there for it, which shows the difference or the similarity of two songs. One called My Sweet Lord by George Harrison, which you may be familiar with. And the other one called He's So Fine by the chiffons that you may not be quite so familiar with. And in a nutshell, in a case some time ago, George Harrison was sued by the chiffons for releasing My Sweet Lord, which does to my untrained musical ear sound very, very similar to He's So Fine. And the crux of the case was that although it was never established that Harrison had consciously copied He's So Fine, and if you recall, we said earlier that intent has nothing to play here. So whether he'd meant to or not was by the by. The fact is that if he had copied, then infringement would have occurred. The judge said that Harrison had the opportunity to hear He's So Fine because it was a quite popular song at the time. It would have been played in shops on the radio and so on and so forth. So there's a high probability that he would have heard it and somehow subconsciously it would have entered his mind and it would have become part of his thought process when he wrote My Sweet Lord. And there's a reference to the case there on the slide. So it seems to me quite logical that the courts are going to follow a similar reasoning with AI in that if a generative AI produces some material which appears to be copyright infringing and it can be demonstrated that that material was part of the training data, then the courts are likely to come to the conclusion that infringement is happening, notwithstanding that we can't really work out how exactly it's encoded if that's the correct word to use within the AI model in the first place. So let's look at a different case now which is sort of straining this idea, expression, distinction. So this is a photograph that was pretty popular in London a few years ago. It was in a lot of tourist places where you were buying souvenirs and so on. And it's a pretty striking picture of a red London bus crossing Westminster Bridge. So the picture that I've just shown you, which was on a drinks coaster, reproduced here on the left. And on the right, another company called New English Tees Limited decided that it would be nice to have a similar sort of image, but they didn't want to pay a license fee to the holders of the first image. So they asked a photographer to go and take a picture from roughly the same location of a London bus and then they got somebody to retouch it so the London bus was red and everything else was in monochrome. And you would imagine that that is about as clear an example of an idea versus an expression as possible. The expressions of these two photographs, but the idea is basically a red double-decker London bus crossing Westminster Bridge, which is all in monochrome. And it's not a particularly good case because this was only a very decision at the lower court, so it's not particularly binding. But nonetheless, it was determined that there was potential infringement going on here. So I went onto Dali and I used this as a prompt, which to me seems to be a fairly reasonable explanation or the distillation into an idea. And you'll see that Dali using that prompt has produced something that under that particular legal doctrine, this is almost certainly infringing. But hopefully not in Belgium. So that's an example of where you can have court cases that do not help this analysis a great deal. So what can help us in circumstances when we are using generative AI and we're trying to avoid infringement situations? I mean, there are different ways to do this. First of all, you can filter what is ingested. So if you limit your training data to things that are not in copyright anymore or things that you have a specific license for that allows, this is sufficiently broad that allows it to be used to generate materials using an AI or their subject to an exemption, again, that's broad enough to enable you to do that, then that may assist you. But the trouble is that then that is going to fairly dramatically reduce the pool of things that you're able to do to generate the model in the first place, which of course means that your model is not going to be as good as it potentially could be. But that's something to bear in mind. The second thing is to review the algorithm. If it turns out that your algorithm produces a model that's only two megabytes in size, then there's not going to be much space to fit a whole bunch of copyright works inside their derivatives or otherwise. So that's going to be a pretty strong argument that what comes out of it is unlikely to be infringing because it was going to be pretty difficult to actually fit anything inside there that could potentially be a derivative work in the first place. And the third option is clearly to filter what's output. You look at the output of the AI and at that point you determine whether that is potentially derivative work and sort of block its onwards transmission. There's potentially some infringement happening at the period where it's reduced, but if you're not distributing any further, that's kind of the limit the issues there. And there are a number of technologies available that could potentially help with doing that. So YouTube's content ID, for example, Getty Images has got some software for plagiarism to section and so on. So without going into detail, it's very easy for me to say that, but without going into detail, it's possible those sort of techniques can be used to determine whether the output is potentially infringing. And indeed, this is already being used in certain cases, including co-pilot at the moment. So if you've got co-pilot duplication detection that intends to do that. Those of you who are involved in open source compliance are probably familiar with snippet matching services like BlackDark or FOS ID or Scannos, which use a database of existing code and they have quite sophisticated algorithms to make sure that you can't defeat these algorithms simply by obfuscating the code. But how effective those algorithms are is obviously is variable. But I think it's likely that specialist products are going to be developed. And I happen to get in touch with the founder of one of the developers of one of these scanning software companies a couple of weeks ago and mentioned to him and said, you know, to your knowledge of the developments of foot to help with this situation of filtering the output of generative AI to see that it was potentially infringing. And he basically said that he couldn't tell me anymore unless I signed an NDA. So you can take that as you will. So few sort of final thoughts that I have here. I don't believe that permissively licensed knowledge base or permissively licensed corpus of materials used to ingest and create the model is the answer. There are a number of models available that say that we're only using permissively licensed code. That doesn't mean that, you know, there are no compliance obligations. I mean, you know, pretty much all of the licenses in question will have attribution requirements. So how do you follow those through onto the output? So you're taking a sort of risk based approach way of saying, well, you think that piece somebody who's licensed the code under Apache is less likely to get unhappy than somebody who's licensed it under GPL. But it's not a legal analysis. It's really a risk based analysis there. So you still got to be careful about that. Not a magic bullet. The other thing to be aware of is that different jurisdictions have very different rules about where the machine generated code is subject to copyright. So we touched on this earlier. There's a specific clause in the UK Copyright Act that says that machine generated works are subject to copyright. And that copyright will be owned by the person who made the arrangements. Now, it's a bit difficult to determine what made the arrangements means. Is it the person who created the model or is it the person who created the software that uses the model to produce the output? Or is it the person who put the prompt in? My gut feel is that it probably means that it's the person who put the prompt in to generate the output, but it's not been determined judicially. And there is one case which has got nothing to do with AI, but it's got to do with image generation that suggests that the person who wrote the software, in this case some game software, is the person who made the arrangements, not the person who was playing the game. So that's a little bit problematic, but I say it's only one case and it didn't go to the appeal clause. One other thing to bear in mind is that quite often, if you're looking at two pieces of copyright work and they're quite long and extensive and they potentially have mistakes in them and they are identical, including the mistakes, then you're going to make the assumption that the only way that Work B came into existence with all of those mistakes, etc. is that Work A was copied. Now, up until now, that's been a pretty reasonable assumption to make, but of course it's entirely possible that using Generative AI, two people could put the same very similar prompt in, or identical prompt in, and that prompt would generate identical output works, and therefore we can't automatically assume that a long and complex work that has mistakes in it is essentially only going to be owned by one person from a copyright perspective. So that's just a sort of cautionary word. So one thing that really does worry me here is the potential that AI can be used to automate a clean room rewrite, so we can use AI. Again, I've done some analysis on this, but I won't share the details now because I don't have time. But if you take a piece of code and you ask the AI to analyze the code and produce it to a functional description, and then you take that functional description and insert it into another AI and say, please write code to this functional description, then does that mean that because you've been through an automated process that has basically stripped out the expression, taking it to the functional description, which is purely an idea which we know does not attract copyright, and then reproduced a piece of software from that that somehow we've developed an automated way of copyright washing? An awful lot, I think several billion, if not trillion, dollars says that is not going to be allowed to happen, so we just need to be aware of that of a possibility. So that's a sort of whistle-stop tour through my various thoughts on the topic. Thank you very much for taking care, taking the time to listen to me. Do we have time for a couple of questions potentially? Two questions right over there. Fantastic. So one of the questions we have, I'm going to have to move up. So, I mean you might as well have a look at the whole thing. One question we had was if the model outputs a copy of an image who is infringing the AI machine or the human who asked the prompt. Good question. People infringe. So potentially it can be a legal person who can infringe, so it could be a company as well. But ultimately it's going to be who was doing the act of copying. So it's almost certainly going to be a human, but if the human is employed by an organisation, then it could be the organisation that would be infringing as well. While we've still got time, another question we had is there seems to have been some confusion about whether this was under US copyright law or England and Wales. Which ones are you talking about here? Any of you either? So what I'm saying was under the copyright of England and Wales, which is the same as the rest of the copyright law in the UK. But references that I made from time to time were about the idea of expression dichotomy for example is something that's much clearer under US law than it is under English law, but it still subsists. Thank you.
Reducing the risks of open source AI models and optimizing upsides
45 minutes session today. We want to go over some of the risks of open source models and then cover like what AI governments can do about this. So the first part will be a presentation so that we have the same common background information on what we're talking about, what kind of models, what does the technology look like. This isn't very in depth on the technical part but people who do have technical questions feel free to ask them and will leave a lot of time for audience participation questions and answers. So about 10 minute presentation, 15 minute initial panel discussion and then interaction with the public so that you can give us your thoughts and inputs on AI governments open source models. So let's start just with AI safety. How many in this room already have heard this term or have read about AI safety? Can you raise your hands? I'm seeing about half of you, thank you. So an interdisciplinary field concerned with preventing accidents, misuse or other harmful consequences that could result from artificial intelligence AI systems. This is not just technical talk or social thing, it has to both require the expertise to understand what machine learning systems are doing, how they work, what are their problems. So that's what I'll be going through in the first 10 minutes, 5 minutes of the talk and then figure out how society can adapt and how the economy can adapt to this new technology. Okay, part one, I'll just go briefly over deep learning, what's different, what's different with open source with relating to deep learning models as opposed to classic open source software and then part two, quick introduction, my personal thoughts on AI governance which do not represent the rest of the panel and then the panel discussion. Why is there something different with deep learning models? This is not the same as usual software where you can see the code and you can reason about what the model is doing. In deep learning models, there is huge pile of weights randomly initialized, updated into its succincts and there is a field called interpretability which tries to figure out how do these models achieve what they do and this is not always something that we actually can do for the largest models. So if I'm talking about GPT-4, chat GPT that you've interacted with, it's not sure how it's able to get the information that it has and we have some amount of uncertainty about what tasks it can do. When it was trained, we did not predict what strategies GPT-4 could use to reason and we continuously discovered techniques like chain of thought prompting and iteratively like tree of thoughts and this brings the particular difficulty where you can't scope exactly what actions, what kind of text a GPT-4 like model can do. There are a bunch of vulnerabilities and I'm going to focus on text and image generative AI. Accidents is when the developer did not intend the particular use and yet something happened. So specification gaming specifically that you optimize the AI to succeed at a certain objective like in video games and then it uses an exploit to gain maximum score instead of actually doing what you wanted to do and there are cases so for generative AI where we wanted to train an LLM to do text that was agreeable or that users found quite pleasant and once this was trained with the reinforcement learning system, it started talking about birthday cakes all the time because the particular machine learning system that was rating it kept rewarding this. So even as developers, you might train an ML system and then it will have unwanted behavior. You could also give it perfectly correct data sets and yet it could misgeneralize if it's learning labels that are easier to learn than more complicated data in your images. And then hallucinations, we use LLMs to give us some information and yet some of the time this information is incorrect. I'll not go too long on the adversarial part because this is more, this is less about how we train systems but how you make sure that while they're in deployment, you are still safe to use them. But there's specifically like prompt injections where particularly crafted input can cause your LLM systems to change behavior. Sometimes leak information about what prompts you are using before. And I mentioned Trajan's because this is a unsolved problem if in your data set that you downloaded online, if someone created a small percentage of this data to have a particular relationship, they can make that upon a particular trigger, the machine learning system will behave differently. So particular token inputs which suddenly cause the LLM to be much more willing to reveal prompt information or to follow different instructions than what you thought you find too needed to do. And this discussion like Fosdum is specifically about open source and open weight models are not the same thing as open source software. It's not because that you have the weight that you can actually know what the model is doing. Backdolls currently cannot be identified systematically. And you can't manually update the models. I mean, we don't have the, I guess humans, we can't update the model weights directly to rewire its behavior, though we can fine tune it which is like do new training runs. So to have actually open source deep learning systems, you need to have control of the full stack, both the data set and the training code, and you actually also need computing power to run this. If you don't have these things, you are not actually in control of the model that you have. Even if you do train this in our system yourself, you have the problems that I evoked before earlier. And so some really high level approaches. If you're going to use these models, don't use them in critical operations. And otherwise, well, be clear on what your model can and can't do by examining the model card or creating one when you do yourself. Misuse risk, you cannot, you can't choose how users will retrain your model after it's deployed. So Lama had particular safeguards so that it doesn't produce illegal content or illegal instructions. And yet fine tuning can remove those safeguards. So at the fundamental level, the only way you can get a model to never be dangerous is for it to not be capable to be dangerous. And if you're going to release your model in the wild, please evaluate your model's capabilities before release. This corresponds notably to can your model help people in illegal acts and others? Okay. Part two, AI governance and open source. Let's just get some context for where we're going. Like the current capabilities appeared in the last few years and we've had an exponential increase in the amount of training compute put behind these deep learning models, which is some of the reason why we don't, we're not able to understand them in that much detail. And not just training compute, the algorithmic efficiency of these models has been rising. So for a given year, for a given data set, we can now train models that can more efficiently recognize the image. It takes less amount of compute to train the models for, to still succeed at getting 63% performance and image in it. And so this leads to a question of like, are we going to predict, like this is leading to more and more powerful AI. And I just want to open this question, like leading ML scientists say that mitigating the risk of extinction from AI should be a global priority alongside other societal scale risks such as pandemics and nuclear war. So what are they talking about? I just don't have the time in two minutes to go over the whole field of AI existential risk and like forecasting governance, AGI governance. So I've listed these three papers for people who will be more interested to go into those details, but we can talk about some of these in the Q and A if you're interested. There's also pushback, like some ML leaders have signed this letter. Some others have said that it's ridiculous. Yanlokin's tweets, I think, is particularly relevant because it pushes back on both sides systematically. LLMs are not superhuman in all ways right now, but they're not useless parrots. The hallucinations are not the end of all. And scaling is not sufficient to do AGI, yet we will still get more success with deep learning. And he pushes back notably that AI doesn't exist and never will. He does think that we will create AGI or artificial general intelligence, but he has sort of different estimates as to when that will happen. And I guess I also listed Grady Butch's tweet. There are clear and present real harms that we must address now to worry about some future fantasy existential risk is a dangerous opportunity cost. So because there's so much polarization, there are so widely differing views on AI, AGI and existential risk. This is one of the reasons we're organizing this panel discussion and the participation of the audience. Before we get into the panel, I just mentioned two key concepts that I found quite important to frame this is that we have a choice between which technologies to advance. There are technologies which allow us to better understand the deep learning models we have to better control the risk that they have. And we can think about even if this is a complex environment where we don't know exactly what results of what policies are good or not, there are some keys to success. And one of the reasons I'm doing this here again is that a wider understanding of AI safety among AGI developers, among users and policymakers, seems to be one of this overwhelmingly like net good, net positive action. All right. So I'll hand over the mic to Stefan Adele-Pret. Yeah. Yeah. Well, thank you, Jonathan. And first of all, big applause for Jonathan for having a short introduction. And also being so able to introduce yourself. So I will do for you because he's a great researcher, actually a founder of the European Network for AI Safety, where different research and collaborators can go together. And also a teacher in the machine learning for a good bootcamp that has been funded by Aarabens Plus. And I can say for the first time, it's a very great panel, very great course. And with us, we also have a Felicity Reddell, and she's been involved also in connection with the Dutch Ministry of Internal Interest and extensive knowledge in AI policy, especially for the International Center for Future Generation. And with us, we also have Alexandra Xalidis that is involved in the Future All Life Institute that I personally have a big pen, have been a researcher in Harvard also for the ethical part of AI and extensive knowledge of AI safety as well in terms of policy. And in this panel, we have one to tackle, especially this knowledge. And the first question to have for all of you is what are in your opinion the risks to have the development of this advanced AI unrestricted through policy? So what are the risks to don't have any policy in place right now to really reinforce what could be a more stable and a safety environment? I'll hand it to you first. Thank you. Yeah, so I would like to start with saying that obviously open sourcing has a lot of consequences that can be very positive, such as accelerating innovation or decentralizing access. Just at the same time, there can be notable risks from open sourcing AI, especially as we talk about increasingly capable models. And so now I'm going to go briefly into those a little bit. I think the risk there stems mainly from two kinds of things. One is that these capable systems can have quite the misuse potential. It can make it a lot easier for bad actors, for malicious actors to access model weights to modify them and then basically to exploit the work of open source developers that didn't intend any harm. And they can, I mean, you've heard about this, they can poison the information ecosystem, but also lead to a lot of personal harm, like in the form of scams, for instance, or adult content without consent and online bullying. But also it can make it easier to create harmful substances or obviously like the whole story of cyber threats that are being made easier. So that was the first point, like there's the potential for misuse. And the second point, Jonathan also mentioned already a little bit, it just makes it a lot harder once the weights are open sourced to control the whole situation. You cannot really monitor how these models are used. You cannot roll them back. You cannot shut them down. Yeah, so I think Felicity Rout lined almost everything. And obviously we can divide the harms in terms of misuse, but also models going beyond our control. Those are more speculative in nature, but there are still harms that I think we consider at this stage. What I would like to focus on right now though, as someone with a policy law background, is that with open source, it's very difficult to attribute responsibility. So if something does go wrong in the sense of any of these risks with misuse materialized, and they're much more likely to materialize the minute that something is online and available with model weights, with data, with the code, as a policymaker or as let's say the lawyer representing whatever party has been harmed in that situation, it would be very difficult to identify who has caused that harm and who's responsible. And for me, that's the scariest part in the sense that in a traditional release, you would have a company, an entity, which is responsible, which you can therefore keep accountable for downstream harms. I think I really hope we can discuss this at some point in the panel, but I also think that there's a lot of misconceptions with regards to what risks developers take on when they use models released by big tech companies. Not many of us have read the terms of conditions. I have, and there's nothing in there, or there's very little in there that will protect downstream developers for future harms, because when a company like Meta open sources their models, the first thing that they will do is close themselves off from responsibility. Now that can be a very unfair balance because a lot of the time they might keep information which is crucial for making those models safe. So you end up with a dynamic where the company holds the levers necessary to make a model safe, but then attributes all the responsibility to the developer that ends up putting that downstream application on the market, or just releasing some other form. Okay, thank you. And on the positive side, if we want to look what will look like to have governments of AI that is actually working good, are any of your vision can be on your timeline, what you expect to become? And I guess, yes, what are good governments of open source AI will look like? Yeah, we can start. Yeah, so I think on a high level, the good news is that we don't probably need a lot of special governance rules, because it's the same kind of system, so we can probably have somewhat of the same requirements. It just makes it a lot easier to comply if you don't have the ways available for the reasons that were mentioned before. You can avoid it, that guard rates are being stripped, and you can monitor much more. You can realize when something goes wrong, and you can adjust the system. And I would also like to stress on a meta level that I think it makes sense to focus a lot of the requirements on the models that are the most advanced. It's kind of in line or similar to what's happening at the AI Act, for example, with the risk-based approach, you want to focus your efforts on where most of the risk stems from, and for general purpose AI systems, the tiered approach is the analogy of that. But yeah, to come to some concrete examples, I think one good approach could be to have offered access, restricted access for research and testing. So that could be vetted researchers from academia, but also from civil society. And I think also the open source community, there could be ways that independent people from the open source community could qualify and contribute there as well. And like that, you could get a lot of the benefit of open sourcing, namely that a lot of more eyes can look at the system and see where something might be wrong without a lot of the risk of just anybody having access, and as well as bad actors making it really easy for them to misuse the system. I'm not sure how much time there is. Just a quick question. Who's read the final text of the AI Act? Okay, cool. So we have, and what they have essentially compromised on is any model with a 10 to the 25 flaw ratio would end up falling in the highly capable category or a model with systemic risk, and would therefore be subject to extra precautionary measures. I think that's a relatively decent threshold to have arrived at. It's not perfect, but I think we can agree that not all models are as dangerous as some of the most highly capable ones with systemic risk. So we're not really concerned about open sourcing models which have next to no potential to cause harm. And I think that's a misconception about the AI safety community versus the open source community. What we're really concerned with are the most capable models, like the models that are leading in the field, that are often produced by the largest companies. So again, for me, it comes back to any type of governance that we arrive at has to have some sort of mechanism for tracking who is accessing these models. Because not having that tracking mechanism is again in the interest of these companies. The less information they have about the downstream developer, the less responsible they are themselves. Because the minute that they know something is going on or that there's some sort of suspicion that harm could be caused in the future, they become liable. So I think a know your customer provision would be a minimum for any type of governance scheme. Thank you. I think I'd like to highlight again this specialization based on capability. The governance of open source models will depend a lot on what kind of model you're doing. A lot of models really do benefit from having open weights, having being able to clean the data sets. And I think this is specifically true for a lot of narrow AI. If you're doing classification models, then using open source classification models that were trained on data sets that a lot of people have looked at, pruned and sort of understand, even have done interpretability on to be able to know what's going on. I think this does significantly reduce risk compared to if the leading classification models are closed source and sort of don't know what's going on. So I would encourage like empowering open source to do narrow AI quite well. On the other hand, when we talk about the, yeah, there's this idea about front chair models who can do more, a wider array of tasks like coding and where people are trying to make them more and more autonomous by putting them in frameworks like auto GPT, which was is one of these open source frameworks. And this is why I think the open source community can sort of at least have some self governance. Like I believe that by, as long as the developers of these frameworks care enough about their safety, they will do much more secure products and not generally release products that by default harm the users who sort of just plug in the model, they let it autonomously write their code and actually it's sort of like use their credit cards to buy 10,000 euros of stuff. Yeah. Yeah, thank you for your point, you. And before I open up to your question, I have a last questions. Over last year, I was able to have the occasion to talk about these topics in a Python community, Linux community, also understanding how Mozilla is getting on board. So the open source community seems very interesting in these topics. So in your opinion, what the open source community can do and how it could be best interacting in a way that could be safe for all of us. And personally, from my experience, that lead for very interesting conversation about the impact, how we can develop in a safe way. So I'm very interested in your inputs as well. What do you want to kickstart the conversation? I see mainly three things where the open source community can contribute to AI safety. One would be participate in existing approaches or like stress testing, red teaming, finding vulnerabilities, all those kind of things. The second one would be to try to work out more existing approaches and or develop new approaches that handle the safety issues. So for instance, finding ways towards robust water marking. And the third one, which might be the most important one, is to raise awareness and adjust the own behavior and norms. So I mean, awareness of the risks that we've talked about, and again, especially for increasingly advanced systems, but also to the distinction that Jonathan mentioned before, that we talk about open source, but it's quite a different story if you talk about traditional software versus the most capable advanced AI systems. So for instance, for Linux, the open source community can find a bug and send them in and can be fixed. But if your AI system tries to talk you into suicide, what do you do? Please retrain the model. You can't do so much. And other things could also be about the openness. For instance, we talk about openness, open source as a very binary thing, kind of like bullying. You just say either it's fully open or it's fully closed, but it's more of a multi-dimensional vector of different things that you can make available or you can choose making less available. Yeah, let's go. Yeah, I don't have much to add beyond that, except that working in civil society, we're always open to hearing from the open source community. We are not the companies. We are representing people who are concerned about AI safety in general for everyone, for the world. So, yeah, so we're always open to collaboration in that sense. And there is a future where you can have, you can reap those benefits of open source that developers benefit from at the same time as ensuring there's some reasonable guardrails to prevent what we all don't want, which is, you know, absolute catastrophe. Thank you. I'd like to highlight the work particularly done by Illyuther AI. Illyuther AI, who originally were some of those who trained again, like some of the first open-weights, large language models, replications of GPT-2, and who have pivoted in the last two years to also contribute to interpretably research and fundamental scientific research to understand deep learning models. And so the open source community has this advantage of having highly motivated people who are interested in the technical aspects and who will, like, go quite into detail even just for the passion of it. And we're seeing very good papers being published by teams like Illyuther AI's. And that's not the only org that works with open source large language models. So in terms of the scientific understanding, a lot of, like, the current level of large language models are already a fascinating artifact to study. And I'd encourage open source community to keep contributing to the advancements of interpretability and control, controlling the whole class of engineering systems, or how do we monitor what the machine, like, what is this model actually doing? Can we understand? Can we make it go to particular branches rather than not? Even, like, a lot of the prompting techniques which are known today have been discovered by people tinkering with their systems. So furthering our understanding keeps being a good thing. And I'd encourage the open source community to keep on, like, doing research and interacting in this way with these models. Also, another recent development regarding the AI Act is that now they're looking at setting up the AI office. So if you're someone who enjoys tinkering with these types of models and is interested in safety in general, there's a lot you can contribute to the field through an institutional group like the AI office, which is looking for highly capable people that have a technical background. Thank you. I hope you took notes. I took mentally notes, but I'll see you again in the recording of this. Do you have any questions for the public right now? Otherwise, I think, okay, I'll come there. Okay. I can't hear. Yes. Hi. Yeah. So we've spoken a lot about the harms that could come from bad actors. I wanted to ask about potentially harms that happen with big tech or large powerful organizations having access to behind closed doors, having access to this technology, and whether you believe that there's a need for access to their source code or from some kind of regulator or something like that? Coming. I think I want to... Yeah, I think that's right. There's definitely risks from that centralization of power. And I think it's very important how exactly we tackle that. So for instance, if you just require them to make everything open source, I don't think... I mean, then you don't have... You have that risk a bit reduced that they do things behind closed doors that are not visible. But I think if you do something like vetted researcher access, then you can approach it in a much safer way and kind of get the better deal in terms of balancing the risks and the benefits. This is a very crude analogy, but it's sort of the same thing with, let's say, if you had a bio weapon, right? Let's say we were developing this poisonous gas that has a capability to poison a lot of people all at once. It's pretty bad if there's a couple companies developing it and we have no idea what's happening behind closed doors, because again, they're companies, they're not state actors, there's no checks and balances in that sense. And until very recently, there was no regulation to have any access to what they're doing. On the other hand, we also don't want everyone to have their personal poisonous gas in the sense that suddenly the risk has been magnified by the fact that everyone has it at the same time. It's not like a nuclear nonproliferation system where multiple actors having the weapon would reduce the risk overall. So I think in general, as Felicity said, transparency is important, but also checks and balances are important. And there are mechanisms for democratizing the technology in a way that allows us to keep an eye on it and it doesn't just proliferate. Can I add one quick thing to that? So this narrative of democratizing AI, right? I think it's one that Big Tech tries to push and talks a lot about that. And I think it's kind of interesting, because democratizing means kind of like moving towards democracy, right? And that means kind of making decisions together of like, how do you want to govern a specific technology, for instance? But within AI, how it's usually used, it means availability, like access to the weights. But that is not really making a shared decision of how we want to use it. It's more like, like Alex said, like giving it to everybody. It's more like, yeah, not governing it at all. Anybody can do whatever they want to do with it. Maybe anarchization would be a better change. I'm not sure, but it's just like interesting how this is called, but it's not actually that, but it gives it a very positive spin to it that can be quite misleading, I think. So there's a question from the online audience, which I'm going to read out loud. So Frank, is there a reason for confidence that a smaller than 10 to the power 25 flop AI model cannot be trained or tuned for just as much harm as larger general ones? So, yeah. No, there is no confidence, but we fought tooth and nail to get it at that level and not a higher one. So this is basically the lowest threshold that was able to be achieved politically. So no, it wasn't a threshold that was determined through scientific, you know, studies. But and we have plenty of evidence that a model at 10 to the 24 would be harmful and potentially dangerous as well. But but yeah, the answer is just this is the best we could get politically. And thankfully within the AI act, there are other mechanisms for determining, for classifying a model as having systemic risk besides this threshold. So this is sort of a presumption. If you fall above it, then you're presumed to have systemic risk. But then the AI office, for example, will have the discretion to also monitor and determine whether a model has systemic risk through a much more detailed criteria that gives them that flexibility. I'd also like to highlight how this value will change in the future, because I gave I showed you the algorithmic efficiency graph where you need less flops to achieve a particular level of capability as our science of like how to do the training runs evolves. And so this is a maybe a stopgap measure where we think we're pretty sure we don't understand how these models work above 10 power 25. And we want to do more science. But also we want people to spend the time to actually analyze. Like if you're going to do something that's so hard to analyze, please put more effort, comparatively more effort to analyze it, because it does have in general more unknown capabilities. And so having there are different terms for having evaluations that depend on how competent the model is. And in the future, we can imagine more targeted laws that allow you to see, well, what kind of data are you training on and what kind of capabilities you have, and actually having a risk factor which depends more on the training data. But it isn't like these things have moved so fast that governance, governments, sorry, were not able or generally are not able to follow at that kind of pace of progress. And so I do understand it more of a stopgap than as the best way to govern these models in the future. And so as institutions are being constructed in different governance, there are institutions called the AI Safety Institute in the UK, in the US, and you mentioned one for the EU, the AI office. Now, I think joining, like people working at these organizations will be able in the future to more closely monitor what it is that makes a model dangerous or doesn't, capabilities like autonomous replication, and then we'll have maybe more sensible regulation in that approach. I'll quickly answer a second question from the text, and then we'll send it back to the room again. It's just a detail where someone asks, where does the European Network for AI Safety get its funding? And the answer is that we have received a grant from an organism called Lightspeed Grounds. Yeah, so does someone in the audience again have a question or comment about these? Is someone up high in the room? Sorry, can you raise your hand again? But we'll be happy to stay also afterwards outside to catch more questions. A lot of the safety problems are related to the data and where the data comes from, and obviously Big Tech will never expose their weight because that's their IP. I'm working on a data provenance solution, trying to build one, and I'm getting actually really insightful thoughts from this discussion. I believe that a couple of things in tech are out there like ZKML that can contribute to this whole thing, where Big Tech should not disclose their weights, but through ZKML, they can expose that the weights they used were safe, and the proof on chain would improve that that model is actually safe. Is that a thing you believe in or is that rubbish? Thank you. Sadly, the acoustics are not so good and so I didn't totally get it. Can you shout again just the core of your question? Yeah, let's talk afterwards. Like you mentioned the importance of the data and the capabilities in data. I think that's quite true, but like for specific technical discussion, let's talk after. So thank you again for the Jonathan Felicity and Alessandra, and we're here to talk and continue the conversation. Thank you. Awesome, y'all. We're going to be speaking or we're going to start our next talk here in about five minutes. It'll be right here in a second.
A Principled Component Analysis of Open Source Artificial Intelligence
As Julia is getting herself up and running, I'd like to introduce her to you. She is from Seattle, so she's another American here. She is a social technical, I can't say that, systems nerd and a huge fan of Lego, so that's important to know. You used to have a Scottish accent despite being an American and never being to Scotland. Now that is fascinating, that truly is. And her favorite joke is just her own humor. Julia, take it away. I think that one landed is exactly how I intended to, so thank you. Hi everyone, it's great to be here today to talk about what I am calling a principled component analysis of open source AI. So who am I? I'm Julia. I've been focused primarily on open source resilience and the software supply chain for the past five, ten years. I feel like a little bit of an open source hipster because I've been talking about some of these things before they were cool. But I've been working in and around open source AI since undergrad, so I'm not going to leave that to your imagination. Some things will probably give it away though. So in case you were wondering, that isn't a typo. It is a pun. It is a good slash bad pun, and I hope you are ready for more. I couldn't make a pun out of support vector machines though, so if you have one, please come talk to me later. That would be good. It would be much appreciated. So open source AI is not new. I've been doing open source AI for a while now, and I threw a bunch of stuff up on this slide, mostly copying and pasting from former poster presentations, but I have a chapter in this lovely book. I believe there is only one left in stock. They haven't had much of a call to reprint it. So you too could own constrained clustering, not about neural networks. And in that chapter, we had a very interesting approach to exploring information with user feedback using a variant of the K-means algorithm. So that was the basis of our chapter. We used this fantastic open source machine learning library called Weka. Is anyone familiar with Weka here? My people, hello. Excellent. Weka is wonderful. It has so many great machine learning algorithms in it. When I first started using Weka, I went to this website called SourceForge and downloaded it. And I was just entranced. It was my first experience with open source. I was entranced by this idea that I could go and see the code. I could go and modify it. And in fact, some researchers at the University of Texas had done this and modified it and redistributed it. Just this magical thing that I could then go and use in my own research, knowing that somebody who is much better at math than I am had validated all the algorithms. I also built this lovely autonomous robot, aerial autonomous robot, that won the dubious award of innovative hardware design from AAAI, which I think means we don't know how you got this to lift off the ground. But lift it did, mostly because we made it a giant dodecahedron to lift all of the camera components. And a few years ago, I used machine learning to tackle one of the world's hardest problems. Determining whether or not you should hug something. So I trained a little model when Fed and Image would tell you if it was a good idea to hug it or not. I am sad to say I am not huggable, apparently. I am not as mathematically proven. So, you know, maybe I should take my picture again, try it out. So that's me. So the bad news here is that I don't have any answers for any of you in this presentation. I only have open questions. We are not going into the deep specifics of models, algorithms, or approaches. This is going to be probably, for some of you, a little bit too high level. And for some, probably a little too low level. We are exploring this new area of technology that has ballooned kind of seemingly overnight. It feels like that to me. I don't know if that feels like that to you. And we are facing some really interesting challenges when talking about the advances that we have seen in artificial intelligence and how it intersects with open source if you are not using Weka. If you are using Weka, you are set. It's great. They are not paying me for this. I promise. So level set, like AI draws from a lot of different fields. If you go back to your AI 101 course, you are going to probably get a little bit of a survey overview of all of these different fields, from ethics to philosophy. Philosophy plays a big part in AI. Economics, my favorite part of AI is the formal logic side. But that's because statistics was never really my strong suit, which is why I love computers. So there are a lot of different considerations when it comes into building AI systems, AI technologies, and looking at new approaches for things both as practitioners and as researchers. I'm hearing a lot of echo. Is that like everyone? Okay. It will say everything twice. It's fine. So at a very high level, this is one of the slides I show people when they ask me why I don't use the phrase AI. It's because generally speaking, when people are talking about artificial intelligence, they are not talking about the entire field of artificial intelligence. They are talking about machine learning. So we can break it down into roughly two camps. And I call them camps because people are really settled in one or settled in the other, and they usually don't switch back and forth. That's been my experience anyway. So we've got symbolic artificial intelligence, the logic, logical AI. It's also referred to as logical AI. How do people think? How do we teach machines to think in ways that are similar to how people actually think? So this is where cognitive science really comes into play. And then the much bigger circle up there is what we're mostly concerned with these days is machine learning, the math. And this is what I tend to characterize as thinking is hard. We can probably build a model that comes close and then we'll do some math and we want to get stuff done. So we're just going to use the data that we've got and cross our fingers. You can argue with me about that opinion, all you like. I'm cool with that. So while I do have this, as I mentioned, the deep abiding love for symbolic AI, we're focusing primarily on machine learning here. Unless anybody wants to talk about slime. Not that slime. So some elements of AI, or machine learning, see? I'm also getting hit by the AI means machine learning bug. So some possible elements do include things like what data do we have that go into training the system? How do we actually train the system? How do we evaluate the system? All of the different elements, is there a model as an output? Is there a user interface as a way of interacting with machine learning? Now, not all of these are going to be present in every machine learning system. It kind of blows some people's minds to realize that you can have machine learning without a model. Or you can have machine learning without a task or prompt. But it's true. And we have to account for that when thinking about open source machine learning. And when we're looking at all of these different components, it gets a little bit hard to reason about. But if we reduce the dimensionality, PCA pun, we see roughly four buckets emerge. We've got the data, which is pretty familiar to us. We know what data is. We've got a good understanding of what might be training data, what's validation tasks, et cetera. We've got code, also a well-worn path for open source. And then we have what I call the other stuff. Because one of my skills is not naming things. And then finally, we've got output. So by doing a rough grouping with K equals four, we wind up with these four buckets. And I think that by thinking of them in this schematic, it makes it much easier to tackle the challenges that we face one by one. Now, some elements might appear in multiple buckets. Not on this slide, because simplicity. And some might not appear at all. But it's a starting point. So let's first talk about data. So when it comes to machine learning and data, we have some interesting problems. Some of them are known. Some of them are unknown. We have a lot of data out there. Machine learning research has been going on for what, since the 50s? Right? Ish? And that means that a lot of the data that has been used in this research doesn't have known provenance. So we don't actually know where some of the data came from. And if we're talking about things, I'm not going to talk about licenses, by the way. I forgot to mention that. I'm not talking about licenses. I'm just going to talk about the challenges in building open machine learning systems. But when there's data without known provenance, we don't know if it is truly game for us to use. If we have data that we don't know how it was collected, we don't know if it's truly game for us to use. Or if it suits our purpose. It could be an incredibly flawed data set, and we have no way of knowing. In terms of privacy, we've got a few big challenges with de-identification and anonymization. There's a lot of really interesting work that can be done in machine learning, but can't necessarily be done in an open way. And I don't actually know how to solve that particular beast, because I feel like I'm kind of splitten, too, on that. On the one hand, I do think that having full access to the training data, validation data, test, et cetera, is really important for building open systems. I don't think it can come at the expense of people's safety. And so if you are training something based on data that includes personally identifiable information, we kind of have to weigh that. There is that question of, should that be an open source system in the first place? I know that's a very controversial thing to say at FOSDEM. So I can leave now, if you like. For systems that also incorporate user feedback as part of the training data, because that's a fun place to get into, how do you build in, again, that de-identification and anonymization? And there are things like, okay, well, how are we actually going about splitting the corpus? If we are splitting the corpus into training and validation data, it's actually important to know what proportions we're using and how we're sampling in order to do that splitting. But again, some of these may not be applicable for all machine learning systems. So one of the things that I kept asking myself is, are these things required to recompile, for some definition of recompile, a model? So if I wanted to create the model from scratch and build it myself, what do I need? The entire dataset? If you want to show hands and tell me if you think it's required, I'm fascinated to know, but there's no obligation to. So is the entire dataset required to recompile a model? Would a description of the data suffice? How about a datasheet? Who thinks that you need to know about how the data was collected? Well, thank you. I appreciate you. I appreciate all of you. So I kept coming back to this question, is it required to recompile a model? And so you'll see that question as we go through some of these other sections. I do take the stance that you need the entire dataset and the methodology, but that introduces some big problems and big as in dollar signs, because these corpus are not small. Hosting them is expensive. So if we want to make open source machine learning open to all of the people who are interested in participating in it, how do we break down the cost of doing so? How do we make it available? The methodology, publishing the methodology for how the data was collected, helps with transparency if you trust it. But we're open source. We try to trust each other mostly. Except for a few of you. You know who you are. And there's some open questions about attribution. If the data also needs attribution. I'm not talking about from a legal sense or a licensed sense, just for transparency and for credit, because we appreciate giving credit where credit is due. And if somebody wants to opt out of having their data in a corpus, how do we handle that as well? Lots of unknown problems with data. But code gets a little bit easier. This is going to be the second time I make a Jurassic Park joke this week. But we know how to do open source software. This is code. We know how to do this. And despite what we've been hearing, most of machine learning fits solely in this camp. It fits solely in the camp of open source software. Job well done. Great. We've got Weka. What else do we need? So it does, they are governed by the same requirements as normal open source software. No special casing needed. Cool. One of the unique things, though, about this type of software is that it may actually produce one of those things we don't yet know how to deal with. The model. It also might involve an interface, how to interact with whatever system it produces, which may or may not be a model. And it does intersect with the data and some interesting problems that go along with that. So one of the things that we do when we process data is we clean it. So does that code need to be open source? Alongside with cleaning is some interesting value judgments that we have. We may say, okay, deprioritize this feature a little bit. Let's increase the priority on this one. And in that way, we are actually making moral and ethical judgments, and we're encoding them. Now, the great thing about code is that it's very easily inspected. For some definition of very and some definition of easily. That's an exercise for the reader. But if we dig into it, we can see where those value judgments are made. The other stuff, this is my favorite part. If somebody has a better name for it, let me know. So our hardware specifications required to recompile a model. How about disclosure of training time? How long it was trained for? A definition of correctness. So all of these do impact what comes out of the data. All of these do impact what comes out of your machine learning algorithm. I was doing like a one-day course, brushing up some knowledge, and there was a bit of a competition, ordinarily I hate competitions in classwork. But the idea was, okay, let's... Here are all the concepts. Do some fine-tuning, play around, and see who can have the highest accuracy for some random task. And I thought I did a pretty good job. The story of my life. I thought I did a pretty good job. But there was one person who achieved nearly a perfect score. So of course, the question is, would you do? And they said, oh, well, I just ran the training for, you know, two days. I'm like, oh, okay. Yeah, this was a one-day course. I didn't think it was a two-day investment. So the training time does absolutely affect the quality and the output of the resulting system, as does the hardware. So they are required to recompile a model. But similar to data, we also have the question of access. Access to equitable compute. Access to the hardware itself. And again, we have a problem of attribution. So finally, output. And since we are focusing on models and machine learning models, we've got matrices. I don't really know how to make that work in an open way. Yes, I can inspect a matrix. Can I make sense of it in isolation? Not so much. So if we do a litmus test, if this is all we have, can we do arbitrary machine learning tasks with just this? Probably not. How about just the code? Probably not. You still need some data. Just the hardware? Maybe. I'd like to see that. Just the model? I'm going to say no. That's my... I'm putting my foot down there. So what do we really need to make a transparent machine learning system? We need all of it. It all needs to be there. It all needs to be open and available. And that might mean that some things are not suitable for being open. So some other questions that I'd love for you to think about as you're thinking about open source and machine learning is what does contribution to a model look like? What does correctness of a contribution look like? How do we actually verify the openness of these systems in a way that doesn't require a huge amount of investment that only a select few have?
Codes Bound by Ethics: The Rising Tide of Non-Free Software Licenses in AI ecosystems
Alright y'all. We are going to start here in just a moment. I'd like to introduce you to Nahara. Nahara, I tried, I really did. She is coming from Estonia. She loves some free software, data protection, geopolitics, and wondering about different cultures and countries. She also spends some time doing photography and some perfumery. Now that's interesting. She's a certified scuba diver. We've got multiple athletes today. This is awesome. This is truly awesome. Take it away. Thank you. Thanks everybody. And welcome to Vostim. Good evening. My name is Nahara Ika. And I work as project manager at Free Software Foundation Europe. And today I'm going to also be speaking as a consortium member of the Zoom project, which is funded by the European Commission. And it basically aims to integrate the three O's. Open hardware, open data, and open software into an innovation-driven policy. So first off, I'm going to set the agenda for today. Since my talk is based on the AI license proliferation, based on ethical considerations, I'm going to start with how openness is also an ethical consideration, followed by the reasons for engaging with free software, and what open AI should mean in practice versus its current state. And finally, I'll end with what are the imposition of additional behavioral restrictions by licensing and the implications. So first of all, the concept of openness is subsumed in the definitions provided by Free Software and Open Source by FSF and OSI respectively. And Free Software provides you with four freedoms, that is, the freedom to study, use, share, and improve. So essentially, anybody can use and distribute Free Software by way of a license, a software license, in a non-exclusive manner. Then there are multiple reasons for engaging with Free Software. Primarily, proprietary licenses, as you know, are fundamentally incompatible with each other. But Free Software licenses are well standardized, well documented, and have widths to the complex legal issues. And so the C2Ca tail the problem of license proliferation by making it more legal interoperable and also by making license adoption easier. Yeah, and if you're talking about ethics, then Free Software also helps in providing, or rather, promoting digital garments. It helps to promote altruism, democratization of knowledge, and reciprocity, most importantly. And now, if you just put this concept into AI systems, then Free Software also helps in promoting accessibility, transparency, fairness, explainability. So yeah. Okay, so then AI systems don't really operate as traditional software. There are multiple interconnected components, as you see, training data, model architecture, and they require distinct development process and rely mostly on specialized resources in the hands of a few big tech companies. So the ideology of Free Software is essentially mapped into the concept of open AI, but we must be wary about the fact that AI is built differently than a traditional piece of it. And so there are a lot of components at play here. So sometimes the code around the model could be open source or Free Software oriented, but say the model isn't open source. So it's fundamentally not the same as a traditional software. And so what is particularly concerning is the popularization of the term open when it comes to AI systems. If you must apply the spirit of the traditional definition of Free Software and open source, then we must not forget the key pillars that actually make it open, which are transparency, reusability, oversight, and enablement. So now transparency in the context of AI could mean the ability to access the source code or read the source code. Reusability could mean to enable any third parties to reuse the code, the data or documentation. Reusability is basically enabling the ability to inspect and verifying the source code or the documentation or rather even data about configuration of the AI system. And enablement is basically by disclosing sufficient details of how the AI is built in order to enable a third party to rebuild the same AI system provided that given the fact that necessary computational resources are provided and these should be identified by the community building of the AI. Now as you know, the concept of open AI is pre-encombered. That is primarily due to the fact that the definition of AI systems itself is not clearly defined, which would change of course with the AI Act. But the concept of openness in AI is also not clearly defined. Now OSI has taken a great initiative in this regard by defining open source AI and while doing so, then not only endorsing the four traditional freedoms as provided by FreeSoft Fed, as you can see from the definition itself, but they're also trying to widen the spectrum of the definition by including diverse types of AI technologies. And as you see, it also says that they support the efforts on these issues, including appropriate government regulation when it comes to having ethical component built into the definition itself. So until a really important addition to this is the fact, and I take immense pride in saying this, that leading members of the Zoom consortium are also cooperating and collaborating towards this effort. So I'm hopeful that we would have a comprehensive definition of openness in AI, which would really enable all AI users to use licensing schemes appropriately. So until we have a definition that's been finalized as regards openness in AI as concerned, what we see is that AI labeled as open actually exists on a long gradient. Now on one end of this gradient, as you see, we have a handful of maximally open AI systems such as Luther AI. It's a nonprofit and it has licensed its model under Apache 2.0. There's also GPT-New York that is built and developed by Luther AI. And it's also made all its model weights and parameters. Also the documentation and the data around its training and configuration absolutely publicly accessible. Whereas on the other hand, you have AI systems like Lama 2, which basically claim themselves by Meta, which claim themselves to be open source. But Ashley Forbid uses from its use to build other language models. They also provide meaningless or rather not a very meaningful description of the language definition or some kind of transparency regarding the data that's being used to build the AI system. So yeah, well, now given the fact that there's any lack of definition, what we see is that openness actually exists on a long gradient. And if we need to use the term open or free, we need to actually conform to the principles provided by free software and open source software, which take with them a rich history of 40 years of success in having control over software. So, in the last decade, what we observe is that there's been few diverse groups and individuals who've departed from using free software licenses exclusively to creating certain licenses that actually prioritize restrictions on the use and distribution of software. And this primarily relate to field of endeavor, behavior, community management, commercial practice and ethical compatibility. So for instance, in 2021, there was Hippocratic License 3.0, which was developed and released by OES, and this specifically prohibits its use for the use of free software in violation of the universal standards of human rights. And this practice has now also spilled over to creation of sewer motor ethic codes for AI systems, and that has led to the creation of AI licenses with restrictive additional behaviors. For example, we have Lama 2 by Meta, and as you see, they have an entire appendix dedicated to a lot of prohibitory uses. There's also a similar list by Big Science Opal Rail M license. And so essentially, what are the implications of use of these licenses with additional behavioral restrictions? Now, as I see, it also basically creates barriers against use and reuse. As I've just displayed, there are certain terms and conditions which are absolutely ambiguous, and what happens is that the use of these wake terminologies create a very overarching prohibitory use for downstream integration and application of AI systems. And hurdles to adaption and improvement, this is basically by unauthorizing derivative work or by prohibiting copy left licenses. Hindrance to control over technology. So the consequence of this long gradient of openness of AI has led to the users not having appropriate control over the technology because of, you know, by blocking interoperability. And yeah, a weakening of oversight and transparency. So proprietary AI systems could also be transparent, but free software basically provides the ability to rep the source code and also improve it. And this also helps to minimize the discriminatory effects of AI systems. So in conclusion to our contribution to the Zoom project, we majorly recommend for recommendations to everyone first is preserve openness in AI. Now there's been a dissonance amongst the marketing pitch of these AI efforts versus restriction to software freedom, and this disables control, transparency, and oversight over technology. So there is an imperative need to preserve the openness. Then we talk about keeping licenses interoperable with free software licenses. Now the emergence of dedicated AI licenses is perceived as a national progression and a well-desired phenomenon. We only plead in the bargain is that these licenses should actually be interoperable with the free software licenses so that we actually make the AI systems reusable, accessible, and sustainable. And yeah, talking about ethical compliance. Ethics is actually deeply rooted in societal values, which actually differ from jurisdiction to jurisdiction. So in order to actually apply any ethical restrictions, we need to be very, very careful before embedding these into technologies. Any kind of restrictive practices based on ethical considerations should be under the purview of law and regulation and not licensing. Schemes because licenses aren't a substitute for regulation and they cannot be a substitute for good governance and for legislation. So yeah, with this I'd like to wrap up my talk. I hope it was insightful and I hope the message is loud and clear. If you call yourself as an open, licensed, following the principles of open-source software or free software, we should and we must respect the freedoms that it comes or it rather provides. So yeah, thank you. Unfortunately, we don't have time for time for any questions. Yes. Yes. All right, we will be starting with Steph here in just a couple minutes. So don't go in too far. Yes, sir.
Open Source AI at TechWorks, the UK trade body for Electronic Systems Engineering
Okay, so our final talk today is by Jeremy Bennett here. I have some notes. So you live in Southampton also, which is also in England. By the New Forest, which is almost a thousand years old. You spent some time in Paris and Nuremberg, which is great. You adore compilers from what seems like from reading this. And you have acted with Hugh Grant? Wow, that's an interesting story. Alright, sir. Jeremy, take it away. Thank you very much. You can ask me later where and when I acted with Hugh Grant. Okay, this is our last talk. It's only a short talk and it's a bit of a long story. I want to talk to you about my work I do in my spare time and William works with as well with tech works. Anyone here heard of tech works? It's the trade body for electronic systems in the UK. And just in case you think that's relevant, it's worth about 100 billion a year to the UK economy. It's about a million people working in that industry. It's 8% of the entire British economy. There's a reason why the minister turns up to the annual meeting. He listened. So it's a powerful body and you will certainly know the members, IBM, ARM, Cadence, Mentor, Siemens and the like. So it's a big body. It covers a lot of things. It was originally the National Microelectronics Institute and that's the one on the top right there that looks after silicon chip design. Going round, you've got Power Electronics Group. You've got the UK Electronics Skills Foundation which is the educational charity arm that oversees students' internships going into universities across the country. There's TechNest which is the embedded software group. There's eSIN, the Automotive Expert Group that looks after the automotive industry. And lastly, there's the Internet of Things Security Foundation. I'll come back to that. Now, what are they doing here? Because they're not an open source organization, anything but. But part of our role as open source engineers is to educate the wider world into the merits of openness. And I want to draw your attention to the Internet of Things Security Foundation and that's what it says on their front page. Okay? Material is published. It's a contribution from industry and you can download the material and you can download them for free. Okay? They're freely available to you and indeed there's an example of one and when we say free, we mean a Creative Commons attribution license. And that's a perfectly valid open license for what is documentation fundamentally. And so even though this has some of the biggest proprietary people amongst it, they have chosen to do their standardization work, their best practice work, their guides to the engineers in the industry to make them fully open. And they were put together by an open process and one of my open source engineers, you'll find his name in that document because he wrote a big chunk of it. And that's where the open philosophy is something you sell to them. And I was one of the group that sold the idea of doing this in the open and I'm a founder member of the Internet of Things Security Foundation. So how does that apply to AI? Well, William and I have been heavily involved with AI at TechWorks. I have the last year or two been co-chair with Mike Bartley of the AI initiative we've had going on long under the hood. And most of our members are experienced professional engineers and I think we heard a lot earlier from Stefania about the importance of education, but I'm particularly interested in the education of people who are already experienced. We've got lots of experienced engineers. How do you bring those people into a new industry? They've got their marketing guys telling them our new product's got to have AI and that's probably about the detail they get in their product spec. And they've got to implement it. So what TechWorks is trying to do is fill a gap in the market by making guidance available to those professional members it has. And the initial things we're going to start on is guidance on trustable AI because that's seen as one of the barriers in our industry. And quite honestly, if you've got companies that are making jet engine controllers you really want to trust any AI they put into them. And more generally, the professional engineer. So actually what William and I have been working on and you can join the meetings if you want to, the next thing that we've been doing is the best practices guide. We're not trying to tell you how to do AI. We're giving you the pointers so you can do it. We're not duplicating what other people are doing. We're trying to provide you the set of questions, a Q and A you can go to to say, should I even be using AI in this product? If I should be using AI, what sort of AI? What are the questions and risks I need to address? And the idea is if you're an engineer, but you don't know AI, it'll help you make a good job of your first project and subsequent projects. And hot news, this is I think the first public meeting this has been announced. So TechWorks announced its new AI innovation cross working group. It's a cross working group because it doesn't fit in any of one of those subsidiary organisations. So we'll work with Automotive, we'll work with Power Electronics, we'll work with the Electronic Skills Foundation, we'll work with the In their Things Security Foundation. It was announced on Thursday, there will be a launch event in London and then there will be more public events. The launch event quite honestly is to get the key influences in there to understand. So it'll be aimed at the government, both the civil service and the politicians. It'll be aimed at senior managers in the industry across the UK. And then we'll propagate it down and there'll be lots of events for the ordinary working engineer. But the good thing about that is the work we'll be doing will just like the In their Things Security Foundation be in the open. And there wasn't even a question about doing that this time. It was taken as given because it was seen what the success is. So really my talk is just an appeal to you is don't just engage with the open source community, engage with the wider engineering community and try and bring them online for using open source. And I'm hoping next year we'll come back and we'll be lots of feedback and this group will have fed into the other groups you've heard around here and will have drawn on what they've done and will be a useful addition to what's there. I say you can get involved with the best practice group, just send William an email and he'll hear it. So I'm the last speaker today. So I've got the, my last slide is nothing to do with tech works. It's some thank yous. So thank yous to those here. So I'd like to thank Will Jones, who's been overall charge of organizing this room. I'd like to thank JJ for chairing all day and JJ hasn't taken a break. I tried to make him take a break but he's indestructible. So he's gone through the whole day. Michelle from the Nagara, Jonathan and Stefania. I think Stefania's had to rush off for all their work from the European network on AI safety. And those four people, I should say there were four submissions to do an AI dev room and we've put all four submissions together. So you've got the best of four possible dev rooms you could have had all rolled into one. But the most important people making a success are all of you. We've had tremendous interaction. I've not been in all the talks but when I have, it's been great to have that. So thank you very much and of course we'll see you all next year. Thank you.
Introduction to OpenAPI
Good morning everybody. Thank you for being so patient. I don't think I've ever had a full room with 24 minutes to go before the start of my talk before. So that is a very special experience. Thank you for sharing it with me. I unmuted, but thank you for checking. So I am going to talk to you today about OpenAPI. I'm going to try to give you something new that you could maybe take back and try, whether you haven't seen this before or whether you're just looking to level up your game a little bit. My name is Lorna. I work for Redockly. I'm VP Developer Experience there. I love APIs. My background is in software engineering. I've been a developer for most of my career. I've built APIs, integrated with APIs, worked for API producers, done API consultancy. Now I build the API tooling. It's, yeah, look, it's a thing that I enjoy and I'm happy that you are all here to share it with me. So let's start by talking about OpenAPI. OpenAPI, I know a lot of people raised their hands, but maybe it's new to some people. OpenAPI is an open standard. It's a way of describing your HTTP APIs in a format that aims to be both human and machine readable. What's nice about that is when we use a standard format, everybody uses the same format. And when that's an open format, it's developed in the open. You can be part of that development process and I'll talk a little bit more about the OpenAPI community at the end. You can see what's coming. You can join the meetings. You can follow the issues on GitHub. If you are using OpenAPI as a producer, as a consumer, if you make tooling for OpenAPI, there are no surprises. You know what's coming and you can be part of that. So it really improves our confidence on working with it. I think the most difficult thing about working with OpenAPI is it's just very verbose. It takes a lot of lines to describe what can be quite a simple thing. So I'm going to start by talking a bit about the structure of OpenAPI because I think when you can find your way around, you understand the map, it's much easier to work with it. So this is a representation of the things that you will find at the top level of an OpenAPI description. OpenAPI, which version of OpenAPI is this? Info, a bit of metadata about the API that this description describes. So here you'll find the title, probably some license information, some contact information, the version that we're on. All of that is in the info block. External docs. It's very easy. You publish an IAS developer website, you link to your API reference docs. If the user arrives on the reference docs, maybe from a search engine, is there a link back to that nice developer website that you made them? Check, because I feel like I've put this right on everything I've ever worked on. There is a security section and that will describe the authorization and authentication needs for the different that are used by the different endpoints in your API. We've got a service section, where is this API published? Tags allow you to attach metadata to individual endpoints. They're listed at the top level and then you can just use them where you need them. The paths section is where the real API documentation actually happens. This is what we think of as API docs. We have an entry for each endpoint describing what it does, the parameters that it accepts or how to shape the request and the response or responses that you can expect back. You'll also find web hooks here, so where you have an API that as well as receiving requests and returning responses, something happens and it sends you a response. You can describe those with web hooks. They're a little bit different to the request response feature. Those were added in 3.1, which, although it is the newest version of OpenAPI, is 3 years old, so wouldn't describe it as cutting edge. We also have here the components section. The components section allows us to describe things that we're going to use multiple times. If you use the same filtering, pagination, date format, if those are common patterns across your API, I mean, if they're not, we need to talk. But if you reuse those things, you can define them in the components section and reuse them. So knowing kind of, they can go in any order, but knowing where you are and where the other things are that you might need can make these very long documents navigable. OpenAPI descriptions are often thousands or tens of thousands of lines of code. My favorite test API description to use is the GitHub one. It's quarter of a million lines of YAML. Like, yeah, you need to know where you're going. Your tools can help you. But it's like carrying something that's not exactly heavy, but it's just a bit unwieldy. So let's drill into some of the detail. Here is just basically the top part of your OpenAPI description. We have a version. It's not very exciting. We have an info section. We've got a title. Give your API a unique and meaningful title. We have summary and description. A lot of OpenAPI elements have these two texty fields, the summary and the description. The difference, the summary is just text. It's short format. It's usually shown in a listing. The description supports markdown, specifically common mark. It's usually shown when we're looking at the detail. So if your API is shown in a catalog or in a list, it'll use the summary. And if you are viewing the API reference documentation, you'll probably see the whole description. And don't be afraid to use the markdown features for links and to really enrich what you do within your OpenAPI file. There's an info version field. And I think this is one thing that I see people getting confused with frequently. Info version is the version of the API description. So if you change this definition document, you're going to change the description field. Does your API info version need to match your API version? I don't really care. But if you change your description a lot, can you please bump the info version so that I know I don't have the latest version of this document? You lock it to your API version if that helps or don't. Maybe you haven't made any API changes, but you did add great descriptions, better examples or something else that changes the OpenAPI description of your API. Bump the version so I know I need to get the new one. Please add a license. Yeah. So this is like some nice fluffy rendering. I made this with Blockly. I hope that you like it. And I think it's just easier to look at than the real thing. This is the YAML version. And I can do 10 screens of YAML and I will be having a nice time, but I don't know if you will be having a nice time. So I brought you some pictures. But this is kind of the equivalent of seeing it in YAML. Like now imagine another 20,000 lines and you're starting to visualize how this thing looks. Okay, let's look a little bit at the paths. We have within the YAML path section, we have one block for each combination of URL and verb or method. So like I have one that is item endpoint, it's got a get operation. Got another one. I'm really good at naming things. Called things another URL which has both get and post. Those are different operations. They get their own description. If we drill into one, how's an operation ID? Fun fact, operation ID is optional in open API. It's technically optional. Honestly, you need it. It needs to be unique. Just get your linting to put that in. There's very few APIs where this isn't a useful thing to have and it's not like it's painful to do. We've got a description. You probably would have a summary as well. Won't all fit. I have added some tags to my endpoint. This is related to user and accounts. We might have user and orders or some other combination of tags here. You can have multiple tags. If there were request body requirements or parameters, those will be described here as well. And then we've got the responses. I've only got the 200 response here. It's very bad. You should always describe your 400 response errors. I got 200 response here. It's application JSON and it's just got a couple of fields in it. I'm going to drill into that in more detail. It's the same endpoint. More detail. Shuffled down a little bit. In my response, you can see I have a maybe you can't see actually because the font is quite small. This schema has a message and an event ID. I've got data types. I've got descriptions. And I've crucially got examples here. The examples are the magic because it lets the user know what kind of data will this be. You can tell me it's a string. But if your example is, I don't know, are you UID? I'm like, oh yeah, I know what that is. If you show me it's my username or you show me it's an ID, okay, I am just instinctively going to put the right thing in when I'm using those tools. If you use the same fields in other places and it's becoming increasingly standard that even if you're not reusing them, you'll often use the open API reference syntax to refer to them being stored somewhere else. So instead of defining each of the objects or elements of the response payload, you just refer to use a reference, dollar ref, to refer to that description and put the description in the components. So your path entry looks like this and then we have that detail down in the components section under schemers. So this gives you a very powerful reuse. The key to API experience is consistency. And so the reuse helps us to just, without thinking, get it right, get it the same, get it consistent and avoid having similar named fields that might take different timestamp formats or look identical but validate differently because our back end application didn't understand that they were the same thing. So that's the structure of open API but I really felt when I created those slides that I was missing the magic. The thing that brings me to this and makes me believe in open API as the powerhouse of our modern application development. And when I think about open API, I think about the things that I do with it and the things that it enables. You think about the way that you design your API, giving meaningful operation IDs for each endpoint and these can be used by the tools that consume your API description. Having great descriptions, naming things in such a way that developers don't need to come and read your documentation because they will know from the operation ID what it's going to do and it's very consistent. They feel at home. You describe your error responses even if I never publish my open API description. The fact that I wrote down the error responses makes my API better because I thought about what I wanted to do if something went wrong. I can validate my API and make sure that my open API is valid, is at the standard that I want and I can have my own linting rules as well. Operation ID is optional. Why? Not in my APIs. So I write my own rules. I say we use kebab case here. We use plurals here. We always define an error response. We make sure that our examples match our media types. These are the things that you can add with the additional linting rules. We can create documentation. That's great. You have an API. You should probably have some docs for it. We can also allow other people to pull the open API description and generate their own docs, keep it locally for reference. I have some accessibility needs. If you have an accessible API web-based documentation, I can just generate with something that works for me with my open API locally. It's ideal. Beyond this sort of entry level, there's some more things that I think we are not doing enough of in open API. You have an API. You describe it with open API. You lint it. You generate some docs. This is great. Please do these things. You are all awesome. The next level is how you deal with very complex API setups. If you work in a large organization with many microservices, how does that pipeline look? How do you keep them all meeting the same standards? How do you bring them together to publish as if you knew what you were doing to the user? Don't mind if you do or not, but you need to look like you do. How do you bundle those things together? If you have one enormous open API description, how do you collaborate on that when you are making changes, whether you are an API experience specialist, product owner, engineer, tech writer? How do we give you a clue that GitHub file is not maintained as a single quarter of a million line YAML file? Looking at how do you manage your files? What do you do with references? How do you split across manageable file chunks? Then how do you bring that together to ship downstream? Finally, what do those downstream tools look like? A lot of organizations, organizations come into open API because they want documentation. This is the beginning. We don't want to write a whole load of words. We just want to describe once with open API and then we can generate some documentation and we can generate it in different ways. Then for free, you start being able to get all these other benefits. You can generate some client SDKs. You can even generate your service stubs if you want. Lots of tools will automatically integrate with your API if you have a good standard open API description. So your API gateways and other integration platforms will just take it. But you can also start to automatically look at how do you describe sequences of API calls? How do you test your API? What does a mock server look like? Because you've described this API in so much detail that a tool can pretend to be it very easily. So there's a lot of pieces here that make up the ecosystem. Open API is kind of the seed from which the rest of the tree grows. For me, this is the magic. It's the interoperability. It's the way that we come back to maybe we generate some open API. It's terrible. So then we use overlays or decorators to add all the descriptions and examples. And maybe not all of these end points are public yet. So we just filter out the public ones to make the final open API and generate some docs. Maybe only some of them are available in the SDK. So we filter differently, make a new open API file, pass that down to the SDK's endpoint. Maybe the next generation of your client SDK has some new functionality. Well, that you start with the same source file or files and bring that together. So it's all about how do you not code, generate docs, but how do you create your open API? Don't have time for my design first rant. So I'm going to try and hold that in. However, your open API comes into the picture. How do you maintain and manage it successfully? How do you ensure the quality on it? How do you transform it and get it ready for all the outputs that you choose? There's just so much in this picture. Let's talk about some tools. Now, I've just linked open API.tools here. I'm not making any specific tools recommendations. That's for two reasons. One, this is a really hot area. There's new tools every week. There are different tools for different text acts. When you are ready for a new tool on that day and no sooner, you should go and look at the list and pick something. The second reason is I work for a tools vendor. I work there because I use their tools. I cannot possibly give you an impartial recommendation. I went to ReadDocley because they know me and I know them. I really don't know the other tools that well as a result. So don't listen to me for specific tools. I work on the ReadDocley stuff and I love it. You need an editor. There's basically two ways to go. You can use a programmer's editor, something like VS Code. Please add some plugins to help yourself. ReadDocley makes an open API plugin. Even if you just have some syntax highlight for YAML, the one that makes the indentations a different color helps me a lot in YAML. Find something that works for you. There are some graphical editors and if that's your thing, then go find one of those. You don't need to pick the same as your team because it's an interrupt format. You use whatever you want to collaborate. Try really hard not to lock your team into tools. Again, accessibility needs. I need to do it in Vim and of course I can. That's part of the magic. Open API governance, which is clearly not a tool, but let's skate over that. When you write, your API standards do not exist until you write them down. They are not standards until they exist somewhere that somebody else can look at them and they are consistently enforced. We have a lot of really good linting that can really help you, but the humans are always going to be in this review process. Find your most wise and thoughtful humans and invite them to be part of the review process. Naming is the thing that the machines genuinely cannot do for us and just the joined up thinking of being able to see things next to each other. As you introduce API standards, start small. Do not be tempted by other people's recommended rule sets, not even ours. Pick what works for you. Look at the recommended rule set, but then pick the things that you aspire to and can adhere to today and commit to reviewing every six months and building up the quality on your API. If you're retrofitting standards to an existing API, there will be things you cannot change now and that's okay, but you can set those rules for the new versions. If you don't know where to start on this, I am going to recommend Zalando. Have some brilliant public API standards and you could do worse than, okay, they have a lot. Start small, just pick your favourites out of theirs. It's a great place to start and your organisation will evolve as it goes along. Please put some linting in. The machines are genuinely good at this. They can help keep you straight. Is your open API valid? Does it have descriptions? Does it have examples? I've got one team that I work with where we have a whole API where the description for the access response is okay with a full stop and it turns out we enforced sentences. So it has to be at least one word and at least one full stop. Yeah, we did some work with them on that. Get some case conventions, some naming conventions and be really picky about what you include. I do this with Redockly CLI, so if you are using that, feel free to send me questions. If you use something else, I can't answer your questions, but good luck. Open API documentation. Read the docs for your docs tools. I see a lot of implementations where those functionalities exist in the tooling that you've used, but you haven't really dug into what it can do or looked at how you can extend or configure it. API reference documentation is evolving very quickly in a good way. There's a lot of new entrants in this market. I'm not sure if I'm supposed to be saying that we have a new product coming out later in the year that does this. It's beautiful, but you have lots and lots of options. Whatever you've picked, make sure you're making the most of it. And if you have something that's, oh, our, I don't want to malign any other tool families, but something which isn't specialist docs and it can render documentation is a great way to start. But because you have the open API format, you can use all of one tool set for one thing, something else for docs, something else for your SDK gen, like lots and lots of options. Open API, when you publish documentation, your documentation is part of the product. You should be deploying it often. It should be easy to deploy and redeploy. And make sure that you're treating it like a web product. Get some metrics, have a look at what's happening, see what people run into. If you have interactive docs, what are people calling the same endpoint all the time? Is it super popular or is it super confusing? Why is everyone here testing this thing? Have a look at those metrics because they can really help you understand your product. I want to talk a little bit about the open API community. This is something that I don't always include in my technical open API talks, but as far as them, it feels appropriate. It's an open standard. It's part of the Linux foundation. You can get, you can learn more about it on openapis.org. The GitHub repository is public. Everything happens there. We have a Slack group. It's very active. Also, public to sign up. And there's a weekly technical meeting. I will confess, it's not super friendly for Europe. I think it's 6 p.m. Central European time, 5 p.m. for me in the UK. Yeah. I'm trying to get to a critical mass of EU-based maintainers, and then we need to start mixing that up. But yeah. If it's unfriendly for Europe, it's sort of dinnertime. There's no hope at all for anyone east of here. So yeah, we need to fix that. But the open API community is currently growing its maintainer set. It's working on some new stuff. Like, this is a good time to get involved. We've also spun up some special interest groups. So just to kind of tease some of the headline activities within the open API project. The Workflows special interest group describes a sequence of API calls. So if you have, this has come from the travel industry. So where you need to find the flights, find the seats, ask the user, book a seat. None of those make sense by themselves. Workflows aims to give an extra level of description for that. Overlays is a special interest group that describes repeatable modifications to an open API. So if you have a generated open API that is just thin, you don't maintain good examples and good descriptions when you're generating from code, and lots of organizations struggle to get away from that Java doc workflow. Overlays can help for now, where you can get your open API and make the same changes every time to make the descriptions better and add examples, hide things, whatever. Open API 4.0. Code name, Moonwalk, why? Don't ask. Don't let engineers name things. Open API, Project Moonwalk, is committed to doing some sort of release this calendar year. So that is just starting. The high level goals are to give you a really simple upgrade from 3.1 upwards, so 3.0 you might want to go to 3.1, and to include a wider range of HTTP APIs. Open API is amazing for RESTful APIs. Okay for some other HTTP-ish, RESTful-ish ones. Moonwalk will include the RPCs and a wider family. So if you've struggled with open API, have another look in about a year. Yeah, open API, an open standard for API descriptions. If you're not using it, I hope you will now or feel like it's a thing that you can approach. If you are, maybe I've given you some ideas to go back and look at what you might change in your current workflow. I'm going to leave you with some resources and say thank you very much for your time. Okay, I'm allowed to take two questions. Would anyone like to take a question? Yes. This is a really good question. How do I feel about generating open API from code or code from open API both ways? Let's start at the beginning. A lot of organizations generate open API from their back-end server-side code. I don't like it. And the reason I don't like it is I think when you go code first, you're missing a design step. When you design first, you're thinking about it in the context of the rest of the API. You're more likely to get the naming right the first time because that implementation is not done by an engineer by themselves. So you ideally design first APIs. You propose the change to your open API with a pull request. You're wise people and you're amazing linting. Go a few iterations to get it perfect. Then we build it. And that's my ideal and that's why I prefer it. The other question, generating code from open API? Yes, go for it. I think we have this machine description and there's a lot of boilerplate. So we can go quite a long way to things like client SDKs from open API. When I talk about the transform step where you have an open API and you make it better, for docs, you're going to add examples and descriptions. For API gateways, SDK code gen, that sort of thing, you're going to add metadata here. You're going to give the type hints that the specific programming languages and text stacks need. And you're going to give extra information. You might not have that at design time, but if you think of it as a pipeline that splits off, you might want to add some extra magic from your standard open API to enhance it before you generate code from it. But generating code is typically fine. It will only be as good as your description is. And lots of those fields are optional. So cool. I am out of time. Thank you so much, everyone. I hope to see you during the event.
Stopping all the attacks before they start: Building a security-first API
Welcome, Warren, and first yours. Thank you. That's not necessary. Sorry for starting a little bit late. This is building a security first API. I'm Warren Parade and I'm the CTO at Authress, which is a widely used authentication authorization API. That means we get a lot of requests, API requests, that is. So let's talk about that a little bit. Today in the world, there's about one trillion requests per second. Get on the slide. This is public internet and from recent research published by Akamai, about 83% of these requests are purely API related, machine clients, services, IoT devices. From our own research, 4.9% of these requests are malicious in nature. That means for every 20 requests your service gets, one of them is from a malicious attacker that's attempting to compromise your service. That's a lot. And I think we're all in this room right now because we know we have to do something about it. But there's a lot that we could be doing. So many things, in fact, that if I were to stand up here and talk through each and every one of these, we'd be here long after the conference was over. Luckily, I'm not going to do that. But also, doing that may not actually have the impact that you wish it to. That's because some of these may not be relevant to the service that you're building, functionality that actually is in your API. To figure that out, we actually need to do and build a threat model. A threat model is intentionally deciding what a malicious threat actor could potentially do to compromise your service. This is going to be unique based off of the data that you're saving, the way you've built your service, the infrastructure, and also your cloud provider of choice. However, I'm going to say that some threats are ubiquitous across many, if not all, APIs that we're building. So what I'm hoping to do is build up a common threat model that we can utilize to actually target solutions to these issues. So let's start. The first one is injection attacks. These come in many forms, where a malicious attacker attempts to construct a custom request into your service to cause it to execute unexpected code paths or flows. Common example are SQL or database injection attacks. These attempt to execute an unexpected SQL command against your database, such as drop all tables if we're lucky. Another type is host command injection. So executing a privileged action against our host operating system or virtualization layer. And the last kind is server side request forgeries. That's just a convoluted way of saying that your service has credentials that are used to interact with a third party integration, and an attack may attempt to utilize those credentials on its behalf. So let's add all of these to our common threat model. But we can go further. The application security program, OWASP, has listed the top two most concerning things. Number two is broken authentication. We don't have any authentication on our API whatsoever, or what we have added isn't sufficient to actually identify malicious attackers. And this is number two. That should be telling us something. Number one is actually broken object level authorization. This lack of authentication, but lack of authorization to validate whether or not the request that's coming in actually should be allowed to execute. Usually it's due to a lack of granular access control. And we can go further. These are only the one directly at our API, but we have infrastructure to consider as well. An malicious attacker could attempt to utilize our API indirectly to affect our infrastructure via a DDoS attack. Or it could attempt to inject a malicious code into one of our dependencies through a supply chain injection attack, software or otherwise. And lastly, they could attempt a physical intrusion into our data center if we have something on prem or a virtualization later or a cloud provider or the cloud provider itself. And the attacks aren't just specific or non-specific to services in general. You have to think about how the functionality of your service is relevant, what you're currently building. A malicious attacker may attempt to utilize how your resources are built, how your end points are functioned specifically in your case. In AuthWrist, we offer multi-tenant security capabilities that allow our tenants, our customers, to create customized identity providers, custom user logic, even give us custom URLs. They can attempt and have attempted to utilize that configurability to compromise adjacent tenants at customers across our whole product. Okay, that's I think enough for a threat model that we can all agree are problems across the board. Now I want to jump to actual solutions. First up is input validation. To deal with the threat of injection attacks, we can add input validation. And I honestly, I feel kind of silly putting this slide up here because I feel like most of us know about this problem. But I feel like every single day, every week, I hear about some problem with some public company who had an issue that could have been resolved by verifying that the request they are getting matched their expectations. Well, we already thought about our expectations a little bit. And if you were listening in the last talk, we learned about the open API specification, which documents our API in a programmatic way. Well, we could potentially take that programmatic documentation and utilize it to verify those same expectations on the request we're getting into our service. Here's an example from Authris where we have group management. This actually creates user groups. And on the left, you can actually see the schema for this endpoint. It's fairly bare here and uses a common component groups. On the right, you can see the example of an open API Explorer tool automatically rendering it. So we can take the schema and what I want to do is build up a place where we can store the validation of all of our endpoints. So when we add additional endpoints to our service, they will automatically get the security of input validation. So let's create a security middleware. And I've got one here in JavaScript. But you can be using, of course, any language. There are open API specification-based tools that work in whatever framework you're using. And I'm loading the spec up here and passing in the method, the path, the body, the headers. And this will get executed on every single request that we get to our service, irrelevant of the endpoints that we have. And just like that, hopefully we've eliminated injection attacks. But why stop there? Now that we're opening up the request to do validation, we can start thinking about authentication. And authentication is required because if we don't know who the user is, we don't know if they're a potential malicious attacker. We have no way of identifying them across requests. And while we could be using something like IP addresses or some fingerprinting that's blocked by most browsers today, none of them are as effective as just using user identification through authentication. And like this, we can hopefully close out the broken authentication threat that we have in our threat model. And usually entails having a trusted identity provider. Someone that can generate access tokens on behalf of our clients, who may be machines, IoT devices, even end users, those tokens usually look like JWTs. And they'll pass us those JWTs into our API through the authorization header. From there, we'll grab some data from the identity provider that allows us to verify those tokens. So let's just see what JWT actually looks like. And I'm sure someone's going to call me out and say it's pronounced jot according to the RFC. So I'm just going to say I absolutely know that. But not everyone knows what a JWT is. And so it's easy to remember by the letters rather than how to pronounce it. So I'm just going to keep saying that a JWT has properties that can be completely configured, but at least it contains ones that have the identity provider where the token came from. User ID that the token represents. Usually there are short lived token that expires. I created this one and it's actually going to expire soon during the conference. And signature that allows us to verify it. So now let's extend our security middleware to close out the broken authentication. We can just add another method in here that allows us to extract those important fields out of the JWT and then verify them. And maybe we're done. Now I think we got to be careful not to fall into a trap here. All of our endpoints need the same sort of authentication. All the services in our system may have a similar concern, right? We want to verify on every endpoint. We may be tempted to delegate that security to another team in our company or a single component where all the requests pass through. However, fundamentally every single one of your endpoints is at stake and you're likely on the team accountable for the security of those endpoints. So when we add another component such as an API gateway into our system, we're really left with two options. The first one is to completely ignore that component, get the tokens into our service and verify them. The alternative is to not trust that identity provider that's giving us the tokens and somehow trust the API gateway instead. If you're doing that, you may think, well, our identity provider isn't necessarily providing us what it needs to do. Maybe think about changing it. And like this, we can eliminate broken authentication. But it's important to remember that identity is not security by itself. Just because we know who the user is doesn't mean that they're actually allowed to perform the action on our API that they're attempting to request. To do that, we need to introduce authorization. Admission is verifying that the machine, service, or user actually has permission to call the endpoint that they're calling. And to stop the broken object authorization, BOLA, for every endpoint, we need to consciously decide what is the purpose of this endpoint for the product, what permissions make sense to actually check, and who should have access. The simplest thing we can do is add permissions to our JWT. Users get blanket permissions to everything basically via a property in the JWT. And you'll notice that there are no resources listed here. It's just a list of permissions. And this is the bare minimum to provide authorization in your API. And it only works really in simple cases that where most of your data is public, users don't really have that many roles, or the users don't really interact with each other. However, in all but the simplest cases, it's likely not sufficient because it doesn't provide granular enough control. You can't specify the resources here. And in order to stop the broken object authorization, we need to achieve granular access control. So we need a different access control strategy. And like that, we'll introduce resource-based. In resource-based, individual users get assigned specific permissions to resources. And with that, we can actually verify that they are authorized to call our endpoints, or the specific endpoint. Okay, let's extend our security middleware and close out the Bola threat. And we'll do that by just adding another line of code to our security middleware. And here I have an example from the author's SDK, but of course you can use any SDK, any product that allows you to do this verification. Or if you're feeling adventurous in the land of security, you can try to do it yourself, not something I normally recommend. And with granular access control, you can actually scope down the permissions of each of your endpoints to only what is absolutely required. And this is known as the principle of least privilege. Now it should be self-evident that the more granular our permissions are, the more secure our API is. Because if we don't have granularity, then users likely probably have access to do too much in our service. And a malicious attacker that gains access to or impersonates one of our users while having access to potentially all of their data. The only way to prevent Bola is have a granular access control strategy and implore the principle of least privilege. Don't be this company that asks to delete all of my emails and all of my calendars. It's just totally unnecessary. Also don't be that company that lets people do this. I should at least be able to uncheck just the delete part. Yeah, sure, maybe I trust you to read and compose some emails. Read just some, not all of them. And this is actually about as far as we can get with static code in a middleware that's running. If we want to deal with some of the threats around our infrastructure, we need to take it to the next level and add some monitoring and logging. We've taken care of most of the threats at this point, but there are still some left, right? Now if we take a look at how the DDoS text works, taking the next step is potentially throwing some sort of additional component into our architecture that allows us to detect when there's a problem. And we can do that by adding what's usually known as an API firewall or a web application firewall. We'll attempt to dynamically block attacks as they happen. It can contain dynamic rules that look for suspicious activity and are executed based off of multiple request patterns. Some requests we'll get through and we'll end up logging them. We have to make sure we're logging them. And then we can process that and potentially look for patterns within our service, maybe a couple of 200 in a particular way. And we can use that to dynamically update our rules. And if we identify something that the firewall isn't blocking at that moment, well, we can actually update those rules and stop an attack as it happens. And those updated rules will continue to live on into the future. So you throw in some monitoring and logging and you notice something, a spike that looks like this. Is this a problem? Does it look like an attack? Who thinks this is an attack? No one. Everyone thinks that this is totally normal behavior. I mean, users change things all the time. So your customer causes this to happen in your service and you're like, whatever, it's not a big deal. And it may not be anything. Realistically, spikes like this don't mean that much. What's important is what's relevant for your product. For us, we monitor and validate what we call the authorization ratio. That's successful, authorized request to ones that are actually blocked, where the permissions don't match or something fundamentally changed about how it's being used. Even though this came up as one spike, the truth is that this was actually two individual problems that are API firewall, caught, and our anomaly detection, which this is actually ran to grab and pull out. This is actually from Grafana. Depending on the severity of the attack, the rules may completely shut out all requests from a particular customer or a specific service client, depending on what's actually going on. OK. I said a lot of things, so I would like to summarize. The most important thing we can do is identify our threat model. We built one here and it includes injection attacks, broken user authentication, Bola, and DDS attacks amongst some other ones. Then we created custom solutions specifically to deal with our threats, validating our inputs, having user identity, granular access control, and then adding dynamic rules. This is great when everything goes to plan, but sometimes it doesn't. And when it does, we need to really understand what we're doing. So let's take a look at some things not to do. The first one is missing our threat model. I think I've said this enough. If we don't have a threat model, then we don't know what we're building is sufficient, useful, or even relevant. It's probably even harmful. Throwing security over the wall. It's nice to be able to utilize components from another team to help us build up a more secure API. But fundamentally, your APIs are your responsibility. If something happens, who is held accountable for the lack of security there? It isn't a replacement to use another team. Internal services. There's no such thing as an internal service. There's just services that don't have any external requests yet. Lateral attacks from other services from malicious threat actors may still end up in your API, utilizing a proxy, a gateway, or another service that could be one of your only customers inside your company. At some point, your service is going to become public or have public callers, potentially. And if you haven't built your service with security in mind, then you likely have some of the threats that we've actually added into our threat model already today. So treat all your callers as externals and that your API could be considered public. Building monoliths. I think at this point by 2024, we should know that microservices are more secure than monoliths. Sources have dedicated boundaries of which we've implemented authentication and authorization at those boundary layers. If something were to happen in one of our services and it becomes compromised, there's a bulkhead door that's closed by default. Whereas in a monolithic system, if any component becomes compromised, then our whole system becomes compromised. And the last one is building it yourself. Try to build or maintain a component that isn't your core competency. And while it may seem like a clever way to get around a supply chain attack, in reality, it means you're just volunteering to take full responsibility for whatever that component is and all the vulnerabilities that show up there. You and your team are probably not going to be able to beat out a team of experts who manage some open source component out there that is being checked continuously. Now the counter example argument to this is, well, what if it's a critical component? Well, the truth is everything in our service is a critical component in some way or else we probably wouldn't have included it. You're not going to get very far with the resources you have attempting to build and maintain literally everything in your stack, including making your own electricity. Okay. So hopefully at this point, you know exactly what you need to do to add some additional security to your service. I have a quick link to the presentation if anyone actually wants it. I see too many people getting on phones. It's going to be available online, so you don't necessarily have to, but if you really want to. Okay. One more. Five, four. Okay. Okay. And thank you. Okay. And thank you. Okay. Now we have some time for some questions. Yeah. Perfect. Five minutes for questions. Go ahead. I'm not sure are you operating in the Europe company? Our company is global. So yes, Europe is included. Oh, yes. Sorry. It's our company operating in Europe. Yes, for sure. It's operating in Europe. The reason I ask is that we have some new regulations according to data protection. Yeah, for sure. And the IP address is considered to be one identifying characteristic for a person. Yes. Yeah. How do you deal with the GDPR or whatever local regulation is for identifying information such as IP addresses? I mean, the sad story about IP addresses is like forget about it honestly. Because with IPv6, which is going to be here pretty soon, you're not going to be able to do anything remotely reasonable with IP addresses in the first place. So stop logging them. For us, what we actually do is edge computing for those customers. So wherever they're coming from, most of their data stays at the edge as much as possible. So if you do have those regulations, try to keep it there and don't send it other places. All the logs are resident to that data center or those databases in that location. It still allows you to perform security on the edge. And wherever you're running your stack, whether it's on-prem data centers, hopefully not in every single country or some cloud provider, they're usually following those regulations as well. So as long as you keep the data as close to that region as possible, you're abiding by the law as best in your ability as you possibly can. Go ahead. Yeah. We do validate all incoming requests against the schema, but that doesn't necessarily mean you're safe. So two easy examples could be a build just really, really long that leads to a DDoS. Or they could be some SQL in there and a laser developer stripping it before it goes into the database. So is there a way to add that kind of validation to the OpenAPI schema or would you have that as a separate set? Can you add extra validation to the OpenAPI schema? You absolutely can. There are even vendor extensions. I mean, if you want to put your whole everything in your OpenAPI schema, you can have, there are tools out there to dynamically generate the whole service if you want to. I don't know if I'd recommend going that far, but you absolutely can. Just to be clear, you can't block everything because you don't even know if you are blocking everything realistically, right? Something that undiscovered zero-day that's sitting out there could still be waiting and happening. So there is no full guarantee that you've actually even blocked everything. Go ahead. What are your pro tips regarding the use of refresh tokens? My pro tips for using refresh tokens. You probably don't need refresh tokens ever, and if you're using them and you believe this is standard or common, you probably don't fully understand the use case for them. Unless you're dealing with a third-party technology or interface integration, you don't need refresh tokens. Just forget they even exist. So it was just within your own system. There's no need for refresh tokens. There's plenty of alternatives that are more secure. I have a whole talk on that, I guess. Does that answer your question? Okay. Anyone else? I still have a couple minutes. Don't be shy. Yeah? Go ahead. Why? Why what? About the tokens. Why? Refresh tokens allow you to impersonate in your own service one of your users to access their resources. So Google Drive, for instance, a refresh token allows you to access Google Drive as your user. If you don't have integrations with third parties, then there's no reason to have refresh tokens. There's no third-party system where you need to impersonate a user. Your own users don't need to be impersonated. You already can perform every action within your service that you want. There's a lot? Yeah, okay. Thank you. That's...
Making API Terms of Service Trustworthy? Presenting FACT : The Fair API Commitment Terms
All right. Thank you, everyone. So we are moving on with the next talk. We're going to talk about API Terms of Service. Unfortunately, Celia couldn't join us. She's ill at home, but maybe offer. Can you offer yourself to give the talk? So, yeah, the floor is yours. Thank you. Actually, it was a joint talk between Celia and I, so you only had half of the talk, but 100% of the content, right? So, yeah, my name is Medimej Javier. I did many things in API Space, wrote books, organized conferences, and one of the projects I've been working on is this one that I'm actually really proud to have started, and I would love to have your feedback after the session on this. So first, who in the room had issues with Term of Services of APIs, you know, application breaking or business model changing or stuff like that, you know, some of people here? Was it Twitter? Google Maps? Other, no, MongoDB API versus, no? Okay, so we'll explore that together. So the idea was to try to be inspired. We've been actually highly inspired by a project called Term of Service Didn't Read. I don't know if you know this project. It's a group of people who read Term of Services because nobody reads them, right? Except lawyers. But nobody reads them and say, we will read them and we will tag them and we will show simply if they are risky or not for your data or if they respect you as a user, right? It's a really hard work. They have been doing that over the last 10 years and we say, okay, but for APIs actually, there is, for software, there is also a big risk, you know, when you don't read what will happen to the software, right? And the policy around the software. So we kind of forked philosophically their project to do API Term of Service Didn't Read, right? And actually, you can see Celia here, Benjamin III, owner of the project who is an open source lawyer, as you can call it, it's like that, right? So just to say the project has been supported by the Ford Foundation, the Modula Foundation, the Open Society Foundation as part of a grant to make more, the digital infrastructure more open source and more safe and trustworthy. So just to mention that. So where does the problem come from? We can take the example of Twitter, but we can take so many others, right? But just an example of Twitter. 2012, actually it's 2012, 2012, Twitter has a vibrant ecosystem, you know, thanks to their API, a lot of developers are able to resendicate Twitter content and make great client applications on web and mobile. But Twitter now has some investors who say we want to keep the money, we want to keep the ads, right? We want to keep the ads. So if we resendicate the data to others, we can't push the ads and get money. So we will kill the Twitter ecosystem by changing abruptly the amount of services. It was re-abrupt. Many, many, we estimate that more than 300,000 applications died from developers who invested their time and energy and stuff. So really a big, a big, a big, a big fiasco in the Twitter ecosystem. But a platform is good as its ecosystem, right? So in 2015, oh my God, like we killed all their ecosystem, developers are unhappy. The CEO of Twitter said, oh, sorry, developers, right? We reopen the time of services, right? We'll reopen, please come back. We want to reset relations. Okay, okay, let's reset relations, right? We, they updated time of services, make it more open, right? 2018, boom, they destroy apps again by killing the API that most of them use, claiming to, they have to update the backend, right? You know, so that was their claim, say, oh, it's a legacy API, you know, but we will not offer new versions, right? And 2023, after the long-nest acquisition, now APIs are extremely costly. And some research, for example, who are used to have an API for free, now they have, last year they had to pay $100 and now it's more than $40,000 to access Twitter API for research. So just to give you like how unstable it can be when you rely on someone else. As for software open source, as for companies opening their digital infrastructure APIs, when there is not stability trust, you know, it's really hard to trust each other, right? When an API is not stable, when people break API all the time, you may have known how Facebook was breaking some of the APIs, how Google has shut down many, many other APIs, how Netflix stopped their API to developers, you know, so many different ideas, so many different stories behind that. So we thought how we can be inspired by someone who can claim promises and keep these promises in the future. Actually, we have been inspired by Creative Commons, so I'm, you're familiar with Creative Commons, right, you know, CC by, you know, all these licenses. It's really easy to produce and really easy to understand. So this is where we started, we say, okay, can we build a Creative Commons pattern for APIs, right? You know, so that's the idea. Of course, it has, it's not just for copyright, it has more degrees of liberty, but let's see how the research has been made and what's the result, right? Yeah, and as I said, for research, now it's really like $40,000 a month, right? Of course, claiming that they don't want, they want to avoid the AI, the AI to learn from the Twitter data, but still, $42,000 when you're a researcher doing great stuff on social media information, that begins to be expensive, right? But it will not be the academics or we'll pay the $44 billion, a little bit of space, right? You know, so yeah, it's not the academics to pay, in my opinion, at least this amount. So yeah, how we can leverage API terms of services for more trustworthy digital ecosystem. So, you know, here I would mostly talk about web APIs, you know, we have the software API, Linux API, Android APIs, whatever, it has also the same idea that APIs needs to be stable over time and trustworthy, but let's say this one, there is also the infrastructure behind it. So, when you rely on it, you want it to be real time, you want to access the data directly. So, I'm talking of these ones which are, let's say, more tight to instant feedbacks and issues, right? And the term of services just for the definition, the term of services are the legal contract attached to the consumptions of these APIs. So, if I consume one API from a software provider, as a service, you know, that he owes actually the service behind the APIs, he will attach a legal document who say, you can use this API this number of time, this number of time per day, per month, this is what you are allowed to do with my API. You know, some people consider not like an open source, but that APIs can have limited use. You know, for example, Google Maps does not allow you to consume Google Maps to do another map services, right? Just an example, right? Can be obvious for business reasons, but this is not freedom, right? This is not the freedom to consume the data the way we want, right? We have the open data movement that allows that, but let's say this term of services is really a contract attached to the consumption of an API, and you should read it. I really advise you to read it. So, this is what we call the term of services. It has many degrees of liberty on the reuse, on the license, on the specs, on so many, so many things. And so, the idea is to, and one of our other assumptions is that I've read the API term of services is the biggest lie of the programmable web, right? Nobody reads them, and it's really the worst when you know something is wrong by your users and not by your provider. So, we made a survey across 200 experts in the API industry, you know. As a side thing, I organize API days conferences, which is the main series of conferences on APIs worldwide. And so, yeah, 40% believe that 40%, 42% never check if the API changes, you know. They never check in their code or whatever. They just learn it when it's down, right? So, 42% that's just to say it's quite a lot. 25% say they have a read approximately the term of services, you know. It's okay, I have proxy, I read them in 23% like a little bit more, you know, when they have business constraints in bank or whatever. And 30% consider they never have interactions with the provider of the API. So, I just want to go there, look at the developer portal, read the docs, sign up, get my key or my token and integrate, right? I don't want to spend more time on this. And so, the most of the time when they know when API is broken or doesn't work is by email. 70% of people just learn that by email. So, there's really this issue, this promise and this constant communication that's really a problem in these things, right? We also tried a new idea in the API space, which is the idea of copy left. Of course, copy left is not new in the Apprentice world, but in the API space it's quite new. Imagine if you consume an API that has a specific license, all the API you provide should have the same license, you know. So, that's something new in the API space. Everybody believe that it's digital infrastructure, you know, it's not like text or software like cold software. Now, it's hot, we have running software on servers. But imagine if you consume an API from an academic institution, because they have great data, you have to, all your APIs, even for business reasons, will have, we need to have the same license. So, it's something we tried. Didn't convince the majority of people, but at least we tried to explore that idea, right? And just to finish on the context, with all the interviews and surveys, we've seen that there's a lot of pressure, a lot of pressure from the providers, because of business and strategic decisions. Actually, when an API product manager and developers publish APIs in a business environment, the lawyers come and say, look, what did you expose? What's the license on this? And then they write this really huge and long and boring Temur service contract. And actually, the lawyers don't understand so much the tech. So, there is this miscommunication between the two that makes the text really complex with many, many protections, which is not good when you want to consume something to generate a nice application. So, we've seen also about how people can be completely dependent. If you build something on Twitter or x.com, they're the only provider of Twitter.com data or x.com data. It's really hard for you to compete to go to someone else. Imagine you want an SMS, do you want an API for SMS? There are Twilio, there are like their Vonage, there are many, many others. There are plenty, plenty, plenty of them, right? There's also a really important aspect when someone has a monopoly of their content or data or infrastructure. It's really hard, right? But the only opposite, I just want to show you also that when you have a stable API over time, it can be part of your success too. Just example of Amazon Web Services, you know, so Verne Vogels, the CTO of Amazon Web Services used to say, we had only chance to make it right, so we spent a lot of time designing our APIs and the APIs actually never really changed deeply. Of course, they're adding more stuff because they knew people will rely on them. So, this trustworthiness in the digital infrastructure space is extremely important, you know? So, yeah, but how we can make more people aware of, of making this more transparent. So, we interviewed like API providers, you know, when their, the API's product is being developed, when they reach the contact legal department, when they're, they're reviewed by industry standards, when the thermal service are documented, when the need, thermal service needs to be updated, you know, there's a full life cycle, right? So, how we do, how we tell people, so, so we did all these interviews and then we, based on this, the final result was that 69% people wanted to make more trust, enhance the trust between the providers and the consumer, right? 65% wanted to simplify it, mostly to simplify it and 42% believed that it enabled, it's a promise that enabled to have larger ecosystem, right? And just to give you another hint, Gartner, you know, the consulting company, really expensive one, but say that company who are able to demonstrate a safe and trustworthy ecosystem have 50% more chance to attract application developer, right? So, we see approximate, approximate these numbers. And so, we, based on that, we also, as I say, was inspired by Creative Commons, you know? Did you already use this, this quick, this quick, wow, it's not a form, but it's a quick wizard. You know, just check, do you allow adaptation of your work? Yes, no, yes, it's not sure like, do you allow commercial use? Yes or no? Boom, you have your license. So, actually, it's really two clicks to generate your license. Two clicks, just simple. And you have a license for copyright that everybody can understand, right? You have the logos on the right and everything. So, based on that, based on the conclusion I shared with you, based also on some other projects like Scriptaminant, OpenEthics, TOSDR, API Commons from Kinline, a specific expert of the industry, we decided to do the same, right? We decided to do the same, and this is the framework at the end. It's more complex than Creative Commons, because Creative Commons has less degrees of liberty about copyright. We actually, we went at the beginning for 18 degrees of decisions, right? And we reduced it, we reduced it to actually five, which are, do you access, do I allow access to my API to some people or to everyone, like what we call API neutrality? You know, like the no-gate-keeper policy. Do I allow my competitors to use my API, right? Amazon Web Services, they allow competitors to use Amazon Web Services. Competitors don't want to go there, but they would allow it, right? Just to give you an idea, right? There is also the specification, you know, the design and specification, my OpenEpi document, right? Or a PDPS specification. Do I allow it to be reused, re-consumed, copied, whatever? Another one is what we call the ethical data policy. Yeah. Oh, I think it's less than that, but okay. The ethical data policy, do I allow reuse of the content in what context? I allow the reuse of the content. The loyal output policy is mostly about the breaking change. You know, how do I pre-warn you before I do a change, a deprecator version? It's important. Most of the company who are fair give three months to warn when something can change some company go up to one year by contract to tell you, you know, when we make a breaking change, we will tell you one year in advance. Well, at least we will keep the existing version for one year. That's better, right? That's better than no notice, right? And the last one is reference and attribution. You will see some people allow you to consume their API as long as you say, okay, this data was provided by the University of Brussels, or this data has been provided by this academia or this company. I offer you kind of the data and the API and the service for free, do you allow attribution, right? And we believe it's always better when there is attribution. So we said the type of logo in this. Now, just so you demo about height work. So you can go on api-tos.org when it's the page of the project. We have a really, really long report because we have been, the funders wanted reports, but you can start the wizard. Okay. And then this is the wizard you have. Again, we really try to reduce to five questions. At the beginning it was 18 questions, so that was too much. So the first, the fair use policy, the fair use policy and the fair loyal change policy, you can go further, but this one you're obliged, we consider to be a fair API, to be a fair API, which will fact the fair API commitment trust, you have to accept these two. You can't decide anything, right? It has the fair use policy and the loyal change policy, which we consider three months, three months change, three months warning and notification for API change, right? So this is a yes, right? Then API access. Do you agree? I'll just go full screen. Do you allow a full API neutrality? Everyone can consume my API. Everyone can access to it. I will not gate, gate, not gate anyone, right? Do you oblige for share or like? You know, say, look, if you, you will be obliged to do it for, with the same license, you know, so it's not just, just consumption. You have to do it with the same license. So we, we tried to, to push to share or like, right? And restrictive rights, which is low rights, you know, we could, the, the person consider that they can limit the reuse, right? So let's say I do share or like, right? API specification, what are the conditions to access and reuse the specification? Is it a C0 license, which is actually a fork of Kinlain API commands project, you know, putting the copyright of API's public? Or is it a share like license, you know, like, you have to publish under the same license? Let's put it CC0, for example. The ethical data policy, what are the condition, the condition to reuse the data exposed by API? Is it a large data reuse, but with some restrictions? No, no, non-compete, whatever. Is it an open, full day, open data contract? Like, no, you can do whatever you want. I will, this is, this is good for me. Is it a commercial data contract? Anything can be used except for commercial reasons. So let's go for open data contract here. And the last one, before the last one, the loyal output policy, what are the conditions to reuse the outputs from the API usage? If you accept the commercial reuse, you have all commercial reuse, non-direct competition on non-commercial reuse. So this is really the commercial aspect above, let's say all commercial reuse allowed. And that's better at least. Reference and attribution, is it a requirement? Is it no attribution needed? Or is it trademark enforcement? Just to let you know also about the, some people here may think that attribution requirement would be green, because this is the right way to do. We also tried these colors to the, to the people in the community, right? They consider limitation or they consider no obligation, right? So it's not moral, but it is about like constraints versus freedom, right? But it goes against the, what we call free software. It's just to say like, this is what people understood. So we try to adapt to that. Attribution requirement, no attribution, trademark enforcement. So you oblige to put the logo, you oblige to say it comes from this company, this academia. Let's make some attribution requirement. And then here, you can click here and then you have your pictograms. Sorry, I don't want to. And you have the PDF of the license here. So I'll just go there. That actually take, retake all the selection you got, right? So this is more the lawyer stuff. We tried to make it simpler with all the exact duration and time and, and, and, and limitation, right? And that you can attach to your current, it's not a full term of services, but this is a fair API commitment terms that you can claim to show that for the things that are the more important for the community and for your users, these are the one you claim and you write and you engage to respect, you know, for what matters. It doesn't have the exact pricing or exact stuff. So this is a contract as an addendum that you attached to an existing contract that overrides the existing contract on these, on these five elements, right? So, yeah, I'll go back. I'll go back to, no, it's not this one. Sorry. So the fact license, as we call it, it's for fair, fair API commitment terms is to make API terms fair, transparent, trustworthy. And we think it's a fact, right? It's, this is why we are now an API task project. So we made this first thing where we have approximately a few dozens of companies who will begin to attach this to their, to their API. We're still looking for feedbacks for improvement. The next steps would be to, to make it even more simpler, put a little bit more degrees on, on the, to be able to write the full license, not just an addendum to the license on just the fair terms. And the next step afterwards is to even make it match in readable. So like this, we would love to attach it to open API specification or API documents as a specific section, you know, but like, if we take a little bit more time, and there are some, some discussions that needs to be taken. But just to let you know, we would love at some point it to be machine readable. So when you consume an API, automatically, you will know what you are allowed to do or not. And that will be a little bit more helpful for the developer community. Thank you very much. And if you have any questions, we'd love to answer them. Yes. Thanks for the presentation. Um, is it the scenario that someone or a company would agree with all of the conditions, but in the end would not respect the conditions, such as the data policy, for example, or it would change the, and if yes, who or what would make sure that they are compliance? There is a saying in, in French, actually. Can I repeat the question? Oh, can I repeat the question? So your, the question is, I tried to reframinate correctly, the question is that it's great that if we write it and we try to put them on the, on their API, thermo services, but will they really be enforced, right? You know, like at some point. So there are two answers. The first one, as I say, we come from French, promises are only good for the one who believes them, right? No. So this is a French thing. So I, I understand. I understand where you come from, but it's the same for the existing thermo services, you know, but at least it's what we believe it's like the nudge, right? You know, it is exactly like creative commons. Creative commons, you can still put a photo and PowerPoint and steal it without respecting creative commons. But if one day when people come and say, look, yeah, in this presentation, this is my photo, you didn't pay a license, it was not allowed for commercial use, or you are this big corporation, this is, this is how much you need to be fine, right? So it gives a nudge of people respecting more, right? So that's the first thing. When it's easy to respect, I believe people respect, right? We just make it need to be more easy. So if it's easy to produce and easy to understand, you know, it's easier as a personal to look, I will make the right choice, right? Because it's, when it's not known, like, like with pictures, for example, when we, we didn't know that we were, we had no chance to know the license, it was okay to copy and paste. Now we see the license say, okay, I take the risk now. And I take the risk of my company, if it's okay to pay five, six, 10 bucks or more on the non-project or whatever to actually pay and say, okay, so I'm good, right? So that's the, that's the first thing. The second thing, which may be later, but, and we, we tried to promote that, but they stopped, they didn't want to do it. We tried at some point to do exactly like they do on SLA, service of agreements, you know, if an infrastructure provider does not respect their SLA, actually they have to give you back money. In many, many contracts, they have to give you back money if they don't respect the SLA. So now it becomes more contractual. We try to put that in this, but it's a, it's a, it's a practice that really few companies do and respect, especially I've experienced that many times with the cloud providers say, you did not respect the SLA, you have to give us money, say prove it. You know, come on. It was slow. Now it was slow on your application, not on our, our infrastructure. So, so that's, that's, but we tried, we tried. Like SLA. Yeah. Yeah. So, you say, I'm trying to reformulate you say are two different concepts, the data and the license with the data and the API itself. I agree with you. Just a example. I was considering at some point for the French open data government. And I said, look, you can even, you can give an Excel part, not an Excel file because they tried, but a CSV file, whatever, as free because it's paid by public money. But if you put it on API, there's no instance. There's no service. There's no ubiquity, maintenance, API access, security, whatever, now it's, your data is now a service and you can find a model depending on what you want. But you have to attach the API to more services with, so I agree with you. There is the data aspect. This, we talk about the reuse of the API, but as I said, we attach that to an existing contract where that should include the data license. Again, it's, it's a, it's a lawyer stuff. There is a law room there with people we worked with who are part of the, of the feedback. It's, it was really hard at least even for us to come understand that there is the data, there is the license of the data, the reuse of the data. And there is the API, which has a, it's another type of software who has another type of license realm where the data can transit. But if it's transit by API, the license can be arding to each other. So it's not easy to understand. And this is why we tried to make it simpler for developers. But thank you for his feedback. And if it's not yet enough clear, yeah, we have to work on it. Thank you. Yes. It's based on the, mostly based on the interview. This is why I put some time at the beginning of the presentation to explain all the people who were interviewed. We've considered the, it's called FACT for fair API commitment terms. So we, we listen to everyone. We, we know, we also have our own culture comes from open source, open, open, open digital infrastructure. And we said, we can't call it fair if it has at least these two conditions, these two conditions are obligated, are an obligation. And most of the people agreed with that. So the first two one, which is the fair use and the lawyer, lawyer output policy, we consider this one with no choice, right? If you were not, not have these bricks, you right. And the question was that there are some choices that are mandatory and why they are mandatory. Oh, no, no, it's a, the one that is selected by default is a, an implement, a bug implementation. No, no, no, it's not, it's a bug, not a feature. But I agree with you, it can, it can push the choice, right? It could push the choice, right? But if, when people come to the website, they're, they have already a good vision about what they want. So they are not tricked by that, but yeah, thank you very much. Thank you.
The API Landscape : mapping the 2000+ API and opensource tooling for Developers
Yeah, it's me again. I made two applications. I've been accepted in two. So that's it, right? So the API industry, we've talked a little bit before on the thermal services, but now it's really a complete view about the landscape, and again, some work I've been doing over the last 10 years about mapping this industry. So a little bit about who I am now, I can talk more a little bit about me before it was really a collaboration between me and Celia and you know, Cube. So I'm the founder and CEO of Olimp.Legal, which is an AI assistant for data protection officers. We want to help people who are involved into making personal data reality, make their job better. And also the founder of API days conferences, the main series of industry conferences about the APIs, on the business aspect and the technical aspect. You can, yeah, so that's it. I'm co-author of two books on APIs called Continuous API Management. API's need versions, so book need versions too, right? Some reports for the European governments, and the report on GDPR data portability. Like if there is an API, can you transfer data according to GDPR and law? No, it doesn't work. The company don't respect, so I saved you 40 pages of the landscape. But it's interesting to know why, right? And also I co-authored the API landscape industry a third of the market every year, where this data comes from. So this is the landscape, at least at the end, in December, that was the landscape. It was 15, I have not 15, 1800 companies or open source projects on a map, right? That was, that you could try to guess when you want to develop an API, document an API, build it, design it, secure it, when you want to promote it, version it. So this was all the tools and for-profit tools and open source tools for that, right? The new version is here, it's live, it's on apilandscape.apic.io, I will show you the link. So now this is this full map of 2100 tools, where, and we will dive it in together, but when you have the old APLF cycle, with coding tools, design tools, gateway management, API ops, whatever, the security adjacent API tools, you know, like you can go there and you can click on any company, let me, do you want to name a company, I can try to find it. No, I don't know, let's click on, let's click on a, I don't know, a company here, Postman, for example, here. So you can see many information about them and you can see the complete profile. What's the potential headcount? What's their diversity? Are there women in top management? Is there the management diverse in terms of ethnicity, funding, you know, like tool, patents, granted, active products, whatever, known, and all their sections, right? And we'll try, so I will save you to the 2100 tools, but I will try to give you some hints about how to understand the landscape these days, right? And on the 2100 tools that you can see on this landscape, there are approximately 700, which are completely open source with, let's say, license that, which are a copy left or copy center, functional scratch, of course. You know, have you tried to turn it on again? But at least this is the landscape here that you can navigate and it's free and the data is also accessible. We make it simpler, it's a big air table, like a hosted database, but at least you can see, you can try to get stuff, understand stuff there. On apilandscape.apc.io, you can download the map if you want to post there, about it in your room. There is a search engine, actually at the conference, we print it and people take it, right? And report if there is a bug or something. So, and you can also add your tool. You can add your tool here if you think your tool should be there, right? So we've gone from there to the other one and we'll see actually what does that change. And now my talk here is to try to give you some hints about what's the update and what's changed, right? In summary, we will see that there is a full API mindset. Many new products are evolving into helping people to build APIs and more and more open source. So we'll dive more into that. Regulations are enforcing the obligation to publish APIs for companies or for administration. That enable a lot of tooling. Many industries are pushing the tooling to be dedicated to their standards or to their norms. Security and privacy, and we had a speaker and security like Arno is stand alone industry almost by itself. Now they are dedicated to API security on top of existing product like API management or API gateways. So new layers are hiding each other. Also the privacy aspect, checking if there is personal data, checking if the regulations are enforced. Again, new layers of product, like it's really 1000 layers. There is also what we call the API as a product aspect or transactional API, payment API, communication API, SMS APIs, maps APIs, forms API, like really the building bricks of the digital infrastructure. And last but not least, which we consider APIs which are all the low code and no code, which is API for non-developers or for citizen developers, the ability to mix APIs together, right? So I give you a hint. Now we are updating the landscape with all the AI APIs, you know, like this model, sub-unsource versus others, but it's not there yet because we started the work, right? The API mindset, I just wanted to give you one thing to really understand what's happening in industry. We've often believed APIs are here to expose capabilities to others, to make them automated and programmable. Actually I had a discussion with Sam Newman, the author of the microservice book, who told me, no, an API is as much exposing as much as hiding. Tony's like a menu in a restaurant. When you go in a restaurant, when you look at the menu, it's not what the cook can do, he's hiding 99% of what he can do. Just show you 1% of he wants what you to order and do. And this is what the design, this is really understanding the diner experience that you wanna give, right? And so this is something that's extremely important and this is why we have so much security and compliance stuff, right? I love this advertisement of Lego, right? You know, APIs are, and the service behind APIs are building bricks. And when we expose them, people will consume them the way they want, not the way obligatory you expect, but how they want, exactly like these Lego bricks. You publish Lego bricks, I use them the way I want. Oh, this is a diner, this is a plane, this is a tank, or this is a car, or this is whatever you want, right? So that's really the idea. Unless you understand that, I think you can't understand what's happening in the API industry. Large trend that we have seen in this industry, if you want it to be digital in the 2000s, you need a website or a website strategy, 2000s to be mobile, you remember? Solo, mobile, social, local, mobile, you remember that, right? I'm old, right? No, but like the 2000s, like the mobile aspect, you needed APIs actually to make mobile applications, to talk to the remote servers, and the backends and stuff like that. But now the 2000s are really API driven, to actually expose data to everyone else's website, or everyone else's mobile application. So that's really the idea, right? The first one was really like funnel channels, but now with APIs, I try to embed my data or my services to everyone else. So it's really horizontal aspect here. And last but not least, what we call the axiom of the API economy industry, whatever, is that organization, public, private, non-profit, at some point will open and provide their core competencies in the digital world through APIs. You know, they will, but I do really well. I will expose it to others so they can use it and consume it, right? And I will consume what others are doing the best in my system. So it's really the circular thing, I wouldn't say economy, at least it's circular use case, where I focus on what I do the best. I expose it to others and I consume what others are doing the best to support my stuff, right? So this is exactly the software we know, like the average application now, I think that's 37 APIs approximately, the consume. It's globalized, right? Let's say the digital world is really globalized. So I often take this comparison with the car industry. You know, like the car manufacturers, the old style of cars, right? Not the electric one, but these ones, actually they have hundreds of suppliers that they gather with each other, that they orchestrate. And I think the landscape, you know, showed you the previous version of landscape, is a little bit like that, right? At some point we will just gather project software, project libraries, web APIs, frameworks, and actually we will build a whole stack that is not for an application. So unless you understand that, I think it's really hard to understand what I will say later. So that was the latest version, right? And the latest version had five layers. First on top was the APLF cycle platforms, the back-end big needles, the API as a product, the transactional API, the business process of the service, the integration platform as a service, like how do I consume any APIs in one time, or the abstractions or the aggregators, you know, like how when API is wrapping a lot of them, I'll be able to consume them. The new version and updated version is different. So we made hard work, again, I will share the slides on social media and stuff, but like it's really different. We really try to not just make a list of tools, but understand the dynamic. So on the top left, you have the standard bodies, the governance bodies, and the protocols, right? These protocols are actually what we believe the standard, the base, right? And then after you have what we call the full life cycle, right, on top, which is the data provider providing APIs, delivering that with developer experience, security privacy, digital readiness, infrastructure availability, you know, all this stuff, right? Below you have the government and regulators that oblige sometimes to do it, that give a context, right? And all of this is now the exposition of the API. So all of this is behind the firewall, and now this is the term of services and the consumers. Now you expose these APIs into products, all the products API, transactional API that you consume to do something, oh, PDF reader, or whatever, and then you have the, on the right, the whole society, we believe it's important, but you also have all the aggregators, people who help you to consume these APIs, right? So standard bodies, full life cycle management, infrastructure, regulation and community, exposition, consumption, right? And so we try to do another taxonomy that looks like that. It's a little bit more complex, but I will drive you through that, right? So here, this is the infrastructure. All the project you may know about Docker, container, Kubernetes, all the cloud native aspect, you know, like Open Policy Agent, or you know, all this opens, this is really where a lot of the open source tools are, right? In this infrastructure, right? I guess someone who was nothing, I'm gonna say a lot of open source, but still. The standard bodies here, we believe they help things to happen on there in blue, and then we have the core aspect. So now we have 1400 approximately tools just there, right? The life cycle management, open source project, whatever, right? Developer proposal. So this is, if you work in the API space, these tools are mostly the one you know, right? Infrastructure is mostly made by ops, DevOps, whatever, but still they enable the microservices, architecture, the service mesh architecture, the stuff like that, right? Then we also put the industry, community, and intelligence. You know, many companies are providing services, consulting, whatever, design, whatever, so we put them here. And we said now it's the exposition, so now it's the product, the right. And then the discovery aggregators or marketplaces, I don't know where it's got, but at least you know, this is the, that helped you consume these ones who are produced, managed by these ones, who are actually powered by these ones, right? Just to give you a hint. So when you will go back here, I'll just go there, when you will go back not on the middle here, this is, we could not do a presentation on one page because we don't have a screen like this as big, but at least this is what you will be able to get but now in a vertical, yes, in a vertical manner, right? And so, yeah, I have still some more minutes to what actually we can get from this. You can play with it, right? But at least if you wanna, in seven minutes to have the conclusion. So first, the core of this API industry is really what we call the API management. The ability to know for every data or services inside that we publish internally or externally, who is using it at what rate, for what use case, for at what level of authorization, and yeah, like mostly that, right? So yeah, so it was mostly the API gateway that's really the core, the technological core of this, but now there's a full practice around that, the governance, the design, the documentation, the development, the testing, the monitoring. So all of this that will go to the management is becoming a commodity. Many, many, the prices, yes, prices are really going down, many more open source tools. And just to show you how it becomes a commodity, this is the acquisition of the top layers of other management solutions. Broadcom acquired, CA technology who had acquired layer seven, who had acquired Renscope, like it's really, it's really like the fish eating the other fish, right? Google acquired Apigee, who had acquired user greed and Firebase, AppSheet, IBM acquired Red Hat, three scale and Trongloop, and Software AG like few weeks ago, for, I don't know if you know web methods, but web methods now has been acquired by IBM, right? And you can see that, just to say that when so many big players are constantly dating, doesn't mean there is so much innovation. When they're constantly dating, it means innovation is not there. This is why many new companies actually made things open source, say okay, it's commoditization, you know, innovation is slowing down, it's time to cut the, to hit the market with open source free software API management solutions. So most known one are Gravity, Kong, actually Kong opened the software at the beginning, and when they raise money from investors, they're closing it, right? Sometimes the business model of the opening is the closing, unfortunately. NGDX acquired by EFI, but solo, three scale, acquired by HEDAT, acquired by IBM, TAIC and WSO2, who has been open source is the beginning, but just to let you know, that now there's a real open source stack that is strong, that is complete, that can do the full lifecycle, right? So if company wants to go full on APIs, they can do it full open source too, right? And with a license that allows it to do it, right? About the other trends, many regulations have obliged companies or institutions to open APIs. One of the first, the main one that has been, is PSD2, who is familiar with PSD2, whoever heard about it? What it is? No, I'm joking. No, PSD2 is the payment service directive too, it's in banking, obliging banks to open APIs. So now in Europe, if you can have what we call bank aggregators easily, no, like on your bank account, you can import other banking account easily, even they are competitor, it's because of this regulation. If I go back to the previous talk, it's an API neutrality aspect. Every bank is obliged to allow access to any other financial institution, as long as they're registered, to open APIs to make any applications. Pure neutrality, you know? So yeah, so just to give you a hint, but yeah, so many banks have been obliged to open APIs because of PSD2 directive. This is a landscape a little bit of the regulation maps. All in green, they have strong banking regulation obliging to open APIs. Yellow, it's coming in the next year. And red and orange, it's not there yet. But just to let you know, like at least Europe invented this regulation and has been copied by others. There's also the healthcare, you know, HL7, fire standard, to do smart application as they call it. But yeah, so just to let you know, many standard in healthcare, so many tooling dedicated for that, right? And specifications there. Last but not least, personal data regulations like GDPR, CCPA in California, like PIPL in China and whatever. 60 countries of regulations inspired by GDPR. We're really good at regulation in Europe, right? Yeah, really good at regulations. But yeah, obliging to open APIs because the user needs to get their data back, need to transfer it, so many, many regulations have applied to that. And a new tooling around that. You will see to the slides, especially, so the AUAI Act recently, in the AUAI Act, there's one proposal that actually you will not be able in Europe, I don't know if it's where we go there, to consume an API that is not hosted in Europe, right? On AI, right? The API has to be hosted in Europe or from a European company in Europe, right? So again, the API is really at the center of the regulation. The open source models can be open source, but hosted and consumed, it has to be in Europe, so many new tooling will come to be sure that you consume what you're allowed in the place you're allowed to do that. In the US, they have the Access Act, which is actually sometimes better than GDPR and portability, that oblige company to have APIs when you request your data, or in GDPR is not the case. When you ask your data on GDPR, we give you a JSON file or PDF that you can't use, the Access Act, it has to be an API access. And just for the story, if, for example, you ask your data from Facebook, GDPR, they will give you a JSON which is not incomplete. It's great like that. If you create an app just for yourself and ask permissions to have data for yourself, you will have a lot, right? So if you create their, they give less data according with GDPR than when you sign their platform policy, which is nonsense, but just to tell you how the GDPR sometimes is not strong enough to oblige, because it didn't put the word API. It has to be machine readable, machine readable. Okay, JSON, man. Oh, girl. So you will see in the landscape, there are specific sections for, if you're from specific industry, like banking, finance, insurance, or other aggregator in the space, I'll just go a little bit there to respect the time. Section three, API security. So we had a speaker earlier about API security, yes. API security now standalone thing. We had the API management layer, and now new threats. There's a new layer of pure API security tools, pure play that comes. It's inspired by the DevSecOps approach. I don't know if you heard that term before, you know, like the ability to put security in the DevOps pipeline between the dev who build the apps and apps who ship it. You know, the security inside the code, right? This really side, yeah. And yeah, it's really a rimes racing up. Unfortunately, there is again, a lot of money injected in that industry. You will see a little bit like in the landscape, this is pure API security players, right? On top of API gateways. Imagine now you are a developer or an architect. How many layers you have to just publish your APIs, right? And what I like here in this report from Sol Security, here, sensitive that exposure privacy incident, like 30% of people believe it's one of their main, it's one of their main concern. So the privacy aspect is there, and we believe a new generation of privacy tools will rise. Section number four, the API product aspect. I love this quote from the CEO of Twilio, who resigned as CEO actually. But the word is getting broken down into APIs. Every part of the stack of a business developer might need to build is eventually turning into APIs that developer can use. This is the map of the car I showed you. The trend is really high, and we have a few hundred tools there, just for that, but actually there are thousands. There are thousands of thousands, we can't map them. Just in the landscape, we give a hint there. And he made a book called, Ask Your Developer, which is not too bad, which is good for a CEO, right? Yeah, and some companies just to show you, so now it was last year valuation, now it's 13, no, some up and down. But yeah, it's just to show you that when you do one thing really well, and everybody consume it, you can really scale at large, and APIs enable that. I just take the example of a company called Avalara here. Avalara, they just do tax, VAT, tax calculation in e-commerce cart. You know when you have a cart, they have to calculate the VAT, but depending on the country, especially in the US, depending on the state, VAT will be different depending on where you're delivered. They just do an API for that, there are 4,000 employees. New standards with a new logo, but open API initiative, async API for asynchronous API, GRPC mostly for microservices and high scalable infrastructure. JSON, LDJ, JSON schema, GraphQL for people who don't like to design APIs, and APIs JSON from KinLane. Now I'm joking about GraphQL, but API JSON, I put it in a standard base because it's a way to publish all the interfaces and all the links important to your APIs. It's suspect that KinLane is developing there. So new standard for new infrastructure. That's but not least the local aspect. There's a huge trend into the local no code, all part of APIs consuming these tools, enabling aggregation. I'll let you discover that because just an example, we lack five developers we consider until 2025, and just 260,000 just with API skills. So we lack of resource, people who want to automate more, and yeah, this is the trend here. So as a wrap up, the full landscape that you can visit, download, consult, search in. API management is coming to community, more open source tools there. Regulation obliged to open APIs that leads to specialization of many, many tools. API security is no standalone product and privacy the next wave. APIs are the new business infrastructure and with the new open source technical stacks with standards that people respect almost all the time. And citizen developers, non-developers are the next API users and consumers thanks to no code and no code. I just recap the address here, APIlandscape.APICn.io. Okay, no, I cannot. Good try. But you can see APIlandscape.APICn.io, which is a media, and then you will be able to click on any company and get data and know what you will integrate or as an open source project or consume as a product. Thank you very much. We have five minutes for questions. Three, four, two. Yeah, sorry. Like you've been working a lot through creating the content like the landscapes. You think it's still used for creating more products related to API. I think we have a lot of companies that are doing things related to APIs, but you think you have something that could help bringing something new? On it or? So two examples, again, how it evolved. GraphQL when it came, wow, few dozens and hundreds of tools in the few years. Now it's quite stable, some are dying, some are reviving. Async API, like ever driven architecture, asynchronous API, great community, no like whoop, we see that coming, right? So yeah, it leaves some part of the landscape where we remove hundreds of companies every year or projects that we consider not relevant for this year, not maintained or whatever, and some new are coming. As I said, for this year, the research we're connecting, the open source, AI, ML, APIs, whatever, there will be a lot of things. There is an API called form blur GPT to hide personal data from open source models. So many, many new tooling on AI APIs will emerge and we believe it will be a full section by the end of the year, right? Just an example here, right? So, great. That's a really more related to AI. Let's say, again, when you see the funding that goes there, a lot of open source projects are coming, a lot of companies are doing it, you know, follow the money, right? Yeah, yeah, yeah. Yep. I think I got to get the part on the new AI Act. Yeah, I know, it was just a mention. It was just a mention that in the AU AI Act, you know, in all the GPR stuff, the data has to be hosted in Europe, right? Unless you show that the word hosted respect the same values as in Europe and it's complex, right? We call it like that a transfers, you know, stuff like that. But the AI Act, like proposed, one of the proposition is that in Europe, you are not able to consume an API where the data is not hosted in Europe, even if the model is open source, but it has to be accessible, the model has to be hosted in Europe and accessible by an API in Europe. You just to say that the API now comes into the regulation, the term, versus the API when there is no API and people send you machine readable file, like Excel or PDF or, the claim PDF is machine readable, but or GZAN actually, right? The fundamental, what's the fundamental decision behind that? No, the fundamental reason is the data localization. You know, we consider it's your sovereign as long as the data is in a place where you can send the police and take it, right? I make it really short. But, you know, like this is why company, country like Russia or China or others, have regulation that obliges the data to be on site, because at some point you're most sovereign if you can knock at the door, right? No. And ask a backdoor. Yeah, be good. Thank you very much. Thank you. Thank you.
Deploy Fast, Without Breaking Things: Level Up APIOps With OpenTelemetry
with the topic. It is a very big mouthful of a topic today, but I'm hoping that we're going to break this down for you today and that you're actually going to learn something that you can take home back to actually implement yourselves. I'm here just to be talking about the open telemetry part. Sonya is actually the brains of this operation. She's basically been planning this whole thing, set everything up and just invited me at the end because yeah, because I'm pretty. That's basically all that I'm contributing today. So I am hopeful that a lot of you have had any type of touch with open telemetry and observability in general, but also that you know the basic DevOps principles and how that is going to be connected with API Ops. Just an introduction for both myself and Sonya. I am Adnan. I do developer relations as you obviously might have already figured out. And yeah, Sonya here is a product manager at Tyche and I would like to hand over the microphone. Yeah, hi. I'm a product manager at Tyche. So we do API management. We have an open source API gateway. If you were in the session before that, you have seen it on the screen. It's an API gateway that's written in Go. It's really fast and has lots of capabilities. So do check it out. And now we are happy to talk about the topic. Cool. Just a quick rundown of the agenda for today. We have four main topics for the agenda today. First and foremost, we're going to talk about API Ops, what it is, how you can get started. And then from there, we're going to take a closer look into how to do API Ops hands on. So we're going to start with the Kubernetes cluster. We'll walk you through how to use Argo CD and Tyche for your API gateway and basically just enable very fast flows and very fast deployments and release cycles within your APIs. From there, we're going to move into production environment. So we're going to say, okay, so what do I need to do to get observability, to get insight into my production APIs? And from there, we're going to shift left even more and figure out how to integrate the release cycles and make them have integrated. I'm going to say integration testing as well. So we're shifting left even more using the production data, so the observability data for testing as well. So that's going to be, I'm going to say my most favorite part because I'm here from Trace Test and we do that. But for right now, let's do the API Ops portion first. Yes, so what is API Ops? Thank you. So you might be familiar with API management and I find that sometimes in API management, we have too many manual operation. And as you all know, manual operation, that's a cause for disaster, that's a cause for error, that's a cause for security problems and we need to speed up things. So my interpretation of what is API Ops and you might have heard about API Ops and some vendors will try to push their ideas of what is API Ops. Some would say it's about deploying your API fast. I'd like to bring a bit back the cultural side of DevOps and say that API Ops is the offspring of DevOps and API management. So it's applying the culture of DevOps to your API management lifecycle. And why? Because you want to deliver value fast without disrupting your users. So if we think back about the DevOps culture, the DevOps principle that originally came from before we started to have lots of vendor trying to sell off things that are DevOps applied, it's about fast flow. I want to be able to commit and have it used by user to have feedback. So to have that culture of having feedback loops. And it's also about enabling that culture of learning. I want to understand what's going on. I want to learn fast, fail fast and be able to provide value to my users. And we're here today to tell you that we think that observability is a key enabler for all that in API management or API Ops. So let's take a look at how to implement API Ops in modern Kubernetes environments to have fast flow. So typically you will have a developer that's building a service. You will have things like open API specification along the way. So we had a talk in this room earlier about open API. I'm not going to go more into details, but it's definitely a space, a place that you have to take into your CI, into your continuous integration, making it all automated. Today we're going to talk now a little bit more on the deployment side. That's why we haven't added it, but of course things like linting and generating documentation. All that should be part of your process. So once the developer commits something, it goes to the CI, continuous integration, and the result might be a Docker container. So it gets published. And now we want to deploy that. We want to deploy that new version of that service. We want to deploy it with an API specification. And for that in Kubernetes, the new way of doing continuous deployment is to use GitOps. There are projects like AgroCD or Flux that are able to do GitOps. What does it mean? GitOps? You're lucky you're really pretty. Okay. So the main thing about GitOps is you don't have a continuous pipeline that pushes the things and deploy to your server. That's the Kubernetes cluster with something like Agro. Pull the information and deploy it itself. So how does it look like? You have then at the end of your CI pipeline, you have to make a change into your deployment repository. You have a code artifacts for all your changes, all the configuration. And you might have a new version that is placed into staging. And AgroCD on your Kubernetes cluster can be configured to automatically pick it up and deploy it. So all automated. Now there's another thing that you need is to expose an API is an API gateway. So in that example, we are using tag API gateway to use the authentication, verification, monitoring. So we add an API gateway, open source API gateway to that. And that's going to be interesting also for the observability part later. So an API gateway helps you to centrally manage your APIs to use authentication, authorization, weight limiting, all this capability that you need in operation. How do you add that? The Kubernetes and GitHub way. Typically we focus on resource definition like it's the way in Kubernetes. So you can add things. And that's a very, very simple where you can say which protocol it use. You could define things like weight limiting, like security policy, which service is proxying on your cluster. And again, it's configuration as code. So it's again central repository. And when you make changes to it into your deployment configuration repository, something like ArgosCD will track it and will apply it automatically. So what we see at the end in your ArgosCD application, you see, okay, all my application definitions, all my application are synchronized automatically with whatever I put into my Git repository. So now we have the first step, right? We have automation for fast flow. We are preventing configuration drift. We have enhanced security. All is automated. No manual error. We are more efficient. We also have an audit trail. So we see exactly what was changed in the deployment of your APIs. And we have better collaboration and visibility on what's happening. Wonderful. And obviously, as the slide says, that is not enough. So we're getting the automation part down. What do we do next? Step three in the whole process is to get additional feedback into your feedback loops so you can connect both ops and dev correctly. So what this means is that the ops team needs to enable the dev team to fix issues by exactly knowing what the issue is, so that the dev team doesn't need to spend useless cycles trying to figure out what the problem is. And we do that by using OpenTelemetry and using Yeager, which are observability tools within our API ops pipelines. Now, this is what we exactly don't want. We don't want to see gears turning and hoping it's all fine because it's not really fine. You don't know what your users are seeing. So we don't really know if our users are happy. We just kind of know it works. And then you kind of do prayer driven development, as I like saying, that's not really what we want. We want to use observability to infer the internal state of our system by getting telemetry out of our system to understand what's actually happening. And then we can figure out whether our users are happy. Because this is something that we can see by using observability with distributed tracing. When our API is exposed telemetry, we can actually see, oh, okay, obviously something is wrong because we have breaking APIs. So it's pretty obvious that our users are unhappy because we can obviously see things breaking for them. And this is a particular view that you get by using Jaeger. Now, let's get to the fun part of actually showing you how it all works and how you can set it up yourself. Now, the way you do it is you use CNCF observability tooling. So tooling from the CNCF tracing landscape, more specifically open telemetry and Jaeger. Open telemetry is an incubating project. Jaeger is a graduated project. So they're all fully open source supported by the CNCF. Now, the specifics are that you use open telemetry as the open standard, we're very focused on open standard for the whole dev room today. So once again, it's an open standard to generate, collect and export your telemetry. Remember that part, it's a bunch of libraries and APIs that help you generate, collect and export telemetry. Now, where do you export it to? Well, you export it to Jaeger, which is a tracing backend, which is just like a data store for your distributed tracing. And then you use Jaeger for all of your production monitoring troubleshooting and whatever else you need to do in your production environment. Now, from this, one of the bigger issues is that open telemetry is quite hard to implement if you're new to it. So some vendors like to bake it in into their systems. One such vendor is there was a lot of suspense, right? Yeah. Yeah. So one thing that we did in tech is to add support, native support for open telemetry, because we know that people that works in the API space, they use API's to proxy multiple services, and the developers might not yet have implemented open telemetry. But we know they need one where to report the data on all the APIs have really visibility on what's happening. And so we added support, native support for open telemetry in tech to enable our user to export this data and to capture them automatically for older APIs. So that's need a couple of settings. This is settings for our hand charts. So where do you need to enable it in tech? You need to say where do you want to send the data to an open telemetry collector could be also directly to an observability backend. And this is what you get. So for every API request, you get a distributed trace for what's happening at the gateway and till the upstream service. So you can see, first of all, you can see any error that's happening already at the API gateway level, authentication error, wait limiting. We see sometimes people only monitor what's happening on the service, but they don't realize they're already missing a lot of people having issue with the authorization, authentication, wait limiting. And then you see what's happening in the upstream. So you can very, very quickly catch up errors, understand not only the timing text, the HTTP response code, but really what's happening if there's an error, if something is slow, where is it happening? Is it on the API gateway, is it on the upstream service? What are the details of the transaction that enables a team to better troubleshoot the issue? And with that, we have now achieved feedback from production. So we have healthy development lifecycle with feedback loop between Dev and Ops. If there's an issue, then the Ops team can report it, can take a look. So it's not only an error on a metric that goes up, it's really a trace where you understand where's the problem, you know, which team needs to act on. And it enables you to provide a better user experience, fix the issues earlier. Again, what have achieved, feedback from production, we no longer relying on user reporting feature, no longer somebody that calls support and say, oh, I have a problem, something is done, no, you see it, you see it all, so you can be proactive. You understand the API performance, you understand really what's happening, where the error is happening, and you can solve issues faster. And with that suspendsful mic switch, again, it's not enough. So we need to introduce another layer of, actually this one, no, we need to introduce another layer of protection. Because right now, we want, we're only stopping bugs after our users are seeing them. So we exactly know that a user saw problem that broke our API, and then we're now rotating back to fix it. We need to be more proactive and figure out how to stop the bugs before they even reach our users. Now, so this is a shift left even more approach, but actually for you guys, it's shift left even more approach. Because we want to add observability to our release cycles as well. So not just our production systems. So the way we're going to go through that a bit is by doing this little squiggly in between, as well. So this basically means that you need to implement something called trace-based testing, which is also called observability driven development. If you like honeycomb and their CTO, it's a term that they coined. Okay. Anyway, the way that you use trace-based testing is you quite literally using the distributed tracing that your observability, like open telemetry exposes, and then you're running tests on those actual data points from your infrastructure. So that means that even though we can see that we have our gears turning, that's awesome. But my initial connection to that API gateway is returning 200. But how do I know this is not broken? How do I know if this is on fire or not? This is an external service. I don't like I don't manage this. So this is something that easily breaks and that you don't really have a lot of control over. Now, let me show you how you can actually get to that state where you can do your testing against the distributed trace itself. This is a screenshot from Trace-Test, which is also a CNCF tracing landscape tool. You can build your test by getting the trace itself from Jaeger, and then you're writing your test specs directly against trace data. So you're not using any mocking, you're not using any faking or whatever the word is nowadays with kids use, I don't even know. You're literally getting the actual data back and running your test against that data. Now, the magical part here is that you can quite literally test against anything that's exposing telemetry. It can be an API gateway like TIC, it can be databases like Postgres, it can be caches like Redis, it can be pretty much anything that you have instrumented to export traces. Now, this is a really cool use case for authentication as well, but also for GraphQL. Now, for authentication, you have a very good example. Yeah, something like Off-Flow where you have multiple service taking to each other and getting the request, that's one of the really cool, useful examples. And also something that I've noticed as well is for GraphQL. So one thing for GraphQL is that it often returns a 200, even though it's failing because the actual error is within the response. So you don't really know, it's very intricate to test that. One thing you can do with trace-based testing is you can drill down to the actual middleware that handles that in your API gateway, find the exact error that happened, and then you can run your test spec on that exact value. So with all of this, we're getting step one, which is functional testing. So we can actually functionally validate our behavior of the system by using all of the telemetry that you've implemented in the prior step to make your production environment reliable. Now, but it doesn't really stop there. We also have step two, which is performance testing, because every span has a duration. You can quite literally go in and say, I want my duration for this span to be less than whatever value of 200 milliseconds or something, which means that if you have external services, external APIs, upstream APIs that you're not in charge of, if their performance is bad, you can validate that and you know exactly what part of your system is misbehaving. So this is the performance aspect as well. So you're getting basically two things from one, I'm going to say exercise. Now the way you do it, I'm going to walk you through quickly. You do this shifting left with trace test, which is, as I said, open source part of the CNCF tracing landscape as well. And what it does, it is quite literally giving you the infrastructure by actually the distributed system architecture by looking at the trace data. And then you can both get the overview of what your system is doing, and you can run tests against exactly what's happening in your system. So those are two powerful things because as engineers, it's very hard to know what the system is doing if it's highly distributed with a lot of microservices, especially if you're a new person on a team, it's just, it's a pain to do that. But with trace test, I want to show you how you can implement these integration tests in your Argo CD, like right here. So this is what an integration test in a post sync hook would look like. You have a API that you're deploying, you have your integration test, which basically runs a Kubernetes job from Argos, from the Argo CD sync hook, then it runs a few integration tests. If they, if they're failing, awesome, you know that they're failing, if they're passing, even better, you see that they're passing, but doesn't really stop here. The thing that you get with this is also every test that fails, you have a URL to go to that particular test to actually see precisely which part of that transaction failed within your API, within your API microservices. And I really like that part because this is not just, oh, yo, this failed, this is actually, this failed, here's exactly how, where, and what happened. And with that, we're actually getting to a stage where we're validating our production, but we're also using that effort we put into our production reliability to validate pre-production as well. So you're basically getting the exact same overview graph that Sonya just showed you, but instead of using your end users, you're running tests with trace test against the API Gateway platform, then you're getting the traces back from your Yeager or Grafana or whatever you're using, and then that info goes back to the API developer that can then fix the issues that were found. Now, with this, I'm just going to wrap up everything that we learned from this last section, which is that we got functional testing and we got performance testing. So you can both validate your behavior, or actually the behavior of your system, so all upstream and downstream services, API transactions, both the ones that you manage and don't manage, you can actually test database performance, you can test cache, you can also test the size of an HTTP response and request, but you can also do very intricate performance testing by validating the duration of every part of your API. And with that, I have a saying where I'm from. We say you're swatting two flies with one swing because I think that's more friendly than killing birds with stones. So yeah, with that, I think that this is the closest we can get to be bounty hunters because we're bug hunters. That was very lame. Anyway, so that's a CU space cowboy reference if somebody can. Thank you for making this. So, and just before we close, I want to say if this is a topic that's interesting for you, we're running an online API observability conference in February. It's going to be called LEAP because it's going to be on the LEAP there. So if that's the topic that's interesting to you, make sure to register. We have lots of people from the API space and observability space that will be coming. We also have a GitHub project about all the screenshot that we showed to you today. We were working on it as a GitHub example. We don't have a link for it, but if you're interested, just reach out to us. Those are LinkedIn. Yeah, I don't like Twitter anymore. So make sure to send a connect and we're happy to send you a link to a GitHub project. You can try it all by yourself at this combination of open source projects. Thank you so much. So we have some time for questions. Yeah, there is one over there. Questions down. Questions down. Go ahead with one customer. Yeah. Okay, so the question is, I have to repeat for the video, the question is, if I have a service that can be accessed by multiple customers, do I want to have one to send the data to different places so to split them out or do I want to have just one year, one open telemetry? And as always, it depends. And on what does it depend? It depends on do you want to give access to those data to your customers somewhere? Do you want to have strict regulation on the data of your customer where you may need to split them by location? But yeah. Yeah, yeah. Yeah. Yeah, that's a very, very, very good question. So the question is, how do I monitor the service level for every customer? So typically you have for every customer, they have, they are authenticated. So you have maybe something like a token. Yeah, yeah, but in production, yeah, yeah. So they're authenticated. So when they come to you, you can put a tag on an information on the trace, and tag will do it automatically if you're using the authorization or authentication from tag, tag. The API, yeah, it's tag. Tag. Yeah, no worry. And so on the traces, we put the information on who is going to API. And with open telemetry, you can then use the data to create your own report based on that information. Yeah. So we add that information on the API call so that you can reuse it for your report. Yeah, it's directly exposed. Yeah. That's a very good question. It's really important to monitor per customers because you want to, some customers have different usage, different patterns, and you want to make sure that every one of them is happy and not just like an average where you don't really understand whether problems. Also, the question is whether Trace Test notifies on errors. No, Trace Test is just a testing tool. You would then need something to automate the test, like Argo, and then you need something to alert on failures as well. And then you can pick the alerting tool that you want. Whatever you're using right now, you can automate within your CI, so you can build your CI within Argo or within whatever you can use Tecton. You can do basically whatever CI tool you're using, and then you're sending errors on that. So think just integration testing. You just get works, doesn't work, then you do whatever else you want to do. Yeah. Yeah. Another question. Observability data for APS, I can take that one. So, yeah, so the question is how do you deal with data privacy? And because in the observability data, they can land a lot that could be considered privacy data. So first, you have to be very aware of that, that observability data could potentially have some data that in your country, in your own regulation could have some impact. OpenTelemetry has a lot of tool for that. In the OpenTelemetry collector, there are kind of plugins that you can define using Yamal and say, that arguments, that thing I want to filter out, I don't want to register it. So you're very flexible in your observability pipeline, but that's something that you have to take care of to make sure that your developers haven't added something that you don't want to store. Sorry. I'll go for it. Go for it. Jack, when I use the data to send the data to the OpenTelemetry, this data is made only on HB8. HB8, the status only. So like a 100, 500 message, the status of the response of the HB8 request. Yes. All on another way is to analyze the response of the request. So the question is, what do we track or what kind of data do we expose with tech? So, yeah, so in tag the gateway, when it's being called, you will get the answer, but the traces, it will export using OpenTelemetry will contain all the data, all the steps, the traces that we saw in Yeager. And you can also extend them. So we have a plugin mechanism where you could, that you could load into there and add even more data if that's more open, extend your OpenTelemetry traces. The question is, where is the effort? So tech make it easier for you because it captured the starts up to the call to the upstream service and it tell you how long it took. And but if you want to get even more details, what happens after that, then it's where you need to instrument your services using OpenTelemetry. And then the beauty of it is when all the services speak the same observability language, they all send the data to the same place, then you have the full picture and that's kind of the operational dream. Thank you. Yeah. You suggest to run that on a trade production? It's right. Correct. Correct. So you wouldn't use trace this in this point of view for your production, you would use it in pre production, where you need sampling to be at 100%. Yeah, yeah, we can also just stand. We'll just wait so you can come by and chat with us. So because yeah, we don't have time. Don't follow up on questions. Yeah. So yeah, yeah, we'll be here. Come here. Yeah. Cool. Thank you.
Public calendars aggregation using Linkal
Hello everyone. Is everyone hearing me correctly? Yeah? Great. My name is Jounia Malka. I am a PhD student at Telecom Paris and I'm doing software supply chain security. I'm also an XOS developer but what I'm going to talk to you about today has nothing to do with this. I'm going to talk about a weekend project I did that is called LINCO and about like deficiencies I see in the public calendar's ecosystem. I'm running with a pretty adversarial screen resolution so if at some point the slides are completely broken I will try to describe what you're supposed to see. Right. So what I'm going to talk to you about today is like what I think is problematic in the public calendar and the calendar ecosystem for collaboration. And I'll explain a motivational situation that made me do this weekend project and then I explain like the two software we came up with to solve this situation. Right. So I think public calendars are sometimes or calendars in general sometimes a bit painful to interact with. And the problem I saw when I started thinking about this is like when you have like a public calendar and you want to follow this calendar on your calendar clients. There's different things that your calendar client can do. Let's say you want to. It can maybe have the capacity to import some ICS files. So even files in bulk but then it will not do anything more with this ICS file than display display them to you and will not for example subscribe to like the updates of these events and will not like. Continue to fetch new events as they come forward. There is like the intermediate player that will fetch the updates. So if you're even get updated like change location or something like that some some calendar clients will update them. And some will be the next player that do everything that you want is basically fetch new events as they come into your calendar. The other problem that I think is is is big like is that calendar providers are not always nice with with the possibility to export your calendars as public calendars that other can follow. Sometimes it's make it very complicated to find the actual option to export these public calendars and make it complicated for people using other calendar software or providers to to actually subscribe to your calendars. And I think also the calendar ecosystem is lacking some nice to have features that would make life easier. So I think like public calendars are not easily composable. It's not easy to like take a few public calendars and merge them into one one collection of calendars which is a nice thing that you can want to do when you're for example you're like you want to follow all events in about for example let's say in XOS because I'm an XOS developer in your region and you have several entities that organize these these events. And they all have a calendar and what you would like to do is maybe do some creation of these calendars and propose a collection of calendars that other XOS users might want to follow to get all the events in one place. This is not easily done. And the other thing that I think would be really nice is filtering of events in calendars. So just like you you are able easily to do filtering of emails. Why not be able to to filter from calendars you follow events that are relevant to you. For example events happening in certain certain geographic area or at certain given date or hour that could be really nice. And this is also very complicated to do I think. So all this thinking came from like a concrete situation where at my school there is a lot of different association that all organize their own stuff and they all maintain some play some kind of of place where they put all the events that they organize. Sometimes it's a calendar. Sometimes it's just a plain web page that you cannot do much with. Sometimes it's just send emails. But there is no there were no central place where you could just see all that get organized on the campus and be informed that way. So we had like a first iteration of solution for for this. This problem the first software that got developed at my in house developed at my school was called Mitis Mitis is a is just a web web service where there is some kind of interface. I don't know if you see it correctly but it's just an interface where it shows all the events from all the calendars. And but it's it's really nice and it was a first step in the right direction. But what you can do is this interface you can ask it to export ICS file so you can import all these events into your calendar clients. But what you can do is ask it to act as a calda server and add it to your calendar client and have on your phone or your computer all the events getting updated in real time and basically be able to follow all these events from all this this situation without action on your part. So what when I saw that I was like I kind of want this to be a calda server. So I created Lincoln so Lincoln is a is a weekend project and it does exactly that. It takes this idea and implement it as a calda server. So the design goals when I try to to think about Lincoln is like I wanted to basically do a calda server that will display several calendars coming from different places into one collection. So for the client it looks like it's one your collection of calendars that you're importing but actually all these calendars are hosted on different places. The other design goal that I want I wanted is like to be able to do some processing locally that Lincoln be able to process in a way or another the events so that we can have at some point maybe like the filtering features that I was talking to you about. Okay so the first iteration of this when I was thinking and trying to implement this my first iteration of my first idea was like okay I'm going to implement this in rest because why not. And actually I wanted to learn rest at the time. And I'm going to is going to be simple I'm going to use some rest libraries that act as calda plans so we have like mini calda for kitchen fridge. And these libraries are going to to perform the request to the underlying calendars and and this this part is kind of like logical and easy but the problem is that you also have to implement all the web dev calda specification on the other side. So you have to implement the HTTP server that's implements all the end point and all the specification of the web dev calda specification and and then you have to get all the calls and rewrite them in terms of function calls of these libraries. So the problem here was like it's kind of like a bit too painful to do because the calda web dev specification is very big and a bit. Complicated and it was a lot of things to do just for a weekend project so I was like this is no this is too complicated to painful there has to be something else. The second iteration. I was like this time I want to implement as little as possible. Of the of the web dev calda specification and and still get something working. And the idea is like we are going to rely on on the the clients so the calda clients they know how to do to format correctly the request. And the calda servers the underlying servers that we are trying to aggregate they also know how to answer this request so basically some somebody did the jobs the job for me and what I need to do is only like forward the appropriate request and appropriate body. To to this underlying calendars get the answer and maybe do some some some kind of modification at some point of the in the answers but we try to keep it as a minimum so what we see is we have the client client collect connect to link call. And then link call for what the request to the underlying calendars and the the answers come back and then we forward back the answer to the clients so we get we kind of act as a proxy and at this point some processing can happen of the request some filtering and some minimal modification needs to be done. Okay so if we if we start to to go in into the depth of the subject. We have two kind of request that we need to handle. So the first part the first kind is like request that the client is going to send us to discover the calendars that are inside the collection and these we kind of to this is the part we have to implement ourselves. Because we cannot forward this this request to to the underlying calendars it would make no sense. And the second part the second type of request is the one that wants to client as acquired all the as a list of all the calendars that are in the in the collection that we are trying to to give it to him. Then it can query the individual calendars and this we can completely just forward the request to the underlying calendars and practically do nothing on them like. Okay so I try to give you an insight on how this this can work. We have like in a in a calda client what you do is you connect to the you write down the the URL of the server and the username and password and it will try to to query the webdav server to to ask what is the calendar home. For this for this user for this principle and so we implement one one endpoint that is called principles link all so link all is the name of the user you should provide to your. Calda clients and and then it will try to to query this pass and. And what what kind of clients are going to send this is called like prop fine request it's property find request and it will ask for this specific property calendar on set it will ask for. A lot of different properties that we don't really care about but at some point it will ask for this property and when it does we answer that it should go and look at this pass slash cal. And so when it behaves correctly this is what it does. So the next column the calda clients do they they go to this past because they now they know this is the collection route. For for calendars and will try to to now find out what are the calendars that are in this collection. So it queries this this pass and then at this point I tried to implement also this. This pass by try to like guess what properties we should send back to the to the client at this point. But it was too also to painful so I took another direction. Instead I forward all the I forward the request that the client send me to all the underlying calendar and the also an answer and I aggregate the answer and this is what I send back to the client so now the client know the. Basically all the calendars that are in this in this collection. And we have to do some kind of hijacking of the answer. So that we modify some of the fields and the most important field so there is a lot of cosmetic fields that you can modify but the most important field that we need to modify is like the URL of each calendar so basically. Each underlying server here when the answer to the request they will say oh fine this calendar this specific calendar at this specific URL which is they will give their URL right so we have to change this so that it corresponds to what where we can answer the. The request for the specific calendar so we we change the URL for each calendar to slash calc slash the name of the calendar and so now the calc clients as a list of URL for each specific calendar. And it will query this URLs to fetch the events. And so now this is the part where we just shamelessly just forward this this request to the underlying servers acting as a man in the middle. And again when the response come then we can. Do some little modification and we can do some cosmetic modification like change the color of the calendar as it should appear on your on your client so. It may be possible that you try to aggregate several calendars that have the same color so you want to do some modification at in the in the. In Lincoln so that when they appear the collection appears on the clients they all have different colors or nice colors. So as a little working example. Sorry. Let's say I'm like I want to offer Nick's US calendar to the user that aggregates several several different. Calendars that are offered by different entities and I have like three entities so for example the genome which is like an association that can offer Nick's US meetups. Let's say a school can offer Nick's US courses and there is some next parties which are like let's say very real things organized by Nick's people. And that is also in this third third calendar and so here I have three different calendars in three different. Hostors. And so the way it works is like I have to create like a JSON file that. Basically states which are the calendars I want to integrate in my in my aggregated collection so I just list them like so. Then I just run Lincoln with this specific calendar that JSON file. And it gives me a Lincoln server so basically if you want to try at some point during the day and tell me that it doesn't work on your specific client. Oh it doesn't that it does work I don't know there is the server is currently live. But what you get if you are using. Mac OS or iOS like I was when I worked on this this project is. You had the Caldav collection and you just specify the URL that I gave you and the user link. And what it gives you is one calendar one collection that has three calendars these three calendars. And that will display basically the events that are in these calendars. And whenever like the underlying entities add new event to these calendars it will update and be. Be available in your client directly. She's also working on Thunderbird and I don't really know about other clients. And now let's let's talk about what I would like to do in the future. So as I told you one of the goals of this project is. Is to also have some some kind of filtering feature where you can say I'm only interested in events happening in let's say this city or happening on Tuesday night or whatever. And currently the way Lincoln is implemented is that you can do that you could go in the rest code base and implement the filters yourself. Which is. Admittedly what not a great user experience so what I think I want to do if if I ever get some time. Is kind of device like a domain specific language where you can write some filtering expressions. And for your calendar so you can you could say you would have the expressivity to express basically the kind of the kind of filter or rules that I just told you about. And then you would way on there like upload this expression to Lincoln. And it would it would do the filtering before the events comes to your calendar client. The other thing that I want to improve is that like Lincoln is currently only able to to serve one calendar collection. And one improvement that I would like to do is have it be multi multi tenancy so it could host as many calendar collection as needed. And and have like some kind of web interface where you could upload this this expression in this domain specific language to define this new calendar collection. And the last thing I want to say is that I think maybe this kind of filtering idea could be also in the future accepted in by KELDA servers and so maybe entering in some standardization. Thank you for your attention. Lincoln is available on GitHub at this year. And. If you have any question I would go to answer them. Yes. Hello. Hello. Hello. Hello. First is someone who has dealt with a lot of counter hell. I appreciate the effort you're putting into this project. And my second question my question is. Is there any sort of right functionality for give me you cover this early. But if you're just passing things in proxying them. If you have the appropriate credentials can you not like could you add events to these collective calendars or is it a sort of read only set up. You mean can Lincoln add events. Can you add can you add events through link how or can you could do that. Yes. There is no limitation that you but what kind of events would you I mean. What's the use case like you're managing the collection and you want one more event to appear to the people that are following this collection. Yeah. Or maybe the people who are subscribing to you know people who are receiving these events say hey I want to have an unofficial after party. I'm adding it after this main event. Other people can see it. So the immediate answer I can give to this is if you really as a collection manager want to add some events you could add your own calendar that you manage and add the calendars. The event to this underlying calendars and it will just work. There is no really there is no real limitation that you couldn't do it directly in Lincoln. But I think like in terms of user experience there is no real interface where you could do this easily. OK. Thank you. Thank you for the talk. Have you considered aggregating from social media like Facebook or similar. Sorry I didn't hear very well. Oh sorry. Have you considered aggregating from social media like Facebook or similar. Would this also work. I have not considered this yet. I know that I mean this could totally be an option. This is Lincoln is currently a very rough prototype. And what I want to do is add some some other ways to integrate events that are not directly from CalDaf servers. Mostly like the priority is adding events from Unpoint that just serves ICS files which is like I know some some people ask for this. But then adding some events from sources that aggregate some events like social media is also interesting and I will consider that. Other questions. OK. Thank you.
Indico: an event management system
Okay. Thank you very much. Yeah. Hi everyone. I'm really happy to be here. I'm Pedro Freira. I'm a software engineer at CERN. I'll be talking to you about Indico together with Dom. He's going to do the second half of presentation. First of all, it's a pleasure to be here. It's our first time at FOSDEM and it's really nice to see such interest. Thank you. So, yeah. As the title of the talk says, we'll be talking about Indico. It's an event management system as you may have realized by now. Well, all of the things that are being presented here today, collaborative effort and open source project and the MIT license, it's developed at CERN mainly with contributions from the United Nations and the Max Planck Institute for Physics. It counts with contributions from more than 70 developers over the last more or less 20 years. So, Indico is probably the most popular event management system you have never heard about. There's something like 300 servers around the world most belonging to educational, research, scientific institutions serving more than 350,000 users. So, it's a tool that you, yeah, as I said, it started out in the research world. Since, as you know, CERN is a research laboratory, but then kind of spread out to different environments and there are, yeah, a few examples of organizations from different domains that are already using it. So, a little bit of history starting in 1999, the physicists working at the Large Hadron Collider, which back then still didn't exist. They were still sort of projecting it, building it. They needed some sort of application which they could use to manage their meetings. So, what would normally happen is that you'd have a meeting, you'd exchange a few emails with their slides and so on, and then this would get kind of lost at some point because it'd be kind of spread around a few mailbox of different people and disks and so on. So, they wanted to have an application which they could use as like a focal point for this sort of event and as kind of an archival platform as well. So, this was the first attempt that it was a CDS agenda back then. Then in 2002, the opportunity came up with a European project which was focused on having a conferencing platform. So, they kind of put the two ideas together and then that's when Indica was born. It went into production in 2004. In 2007, we've added a room booking system to it. Then in 2008, a full interface overhaul. Then 2013, first workshop, word of mouth starts spreading and in 2015, the United Nations adopted it and we started a really nice fruitful collaboration which goes on to this day. 2017, we did a full rewrite of the application. We were working on an aging software stack. We changed even database system, moved to Postgres. So, that was 2017 and 2021, then we moved to Python 3 within the code 3.0. 2023, last year we surpassed 1 million events only at CERN and 2024. So, this year we celebrate our 20th anniversary. So, you may have heard about CERN, the big tunnel which we have underground, the LHC. You probably heard about the detectors and all the things that go, you know, that happen 100 and so meters underground. But a less known facet of the organization, well, maybe not for you because you're all tech people, is that the World Wide Web was invented by Tim Berners-Lee at the organization back then, in the late 80s, early 90s. And CERN is actually producing a lot of open source, also using it but really producing a net contributor to society when it comes to open source production. So, open science is really at the core of our mission. And we have a series of software products which, you know, to this day, I use around the world and which are developed mostly in the organization and then with collaboration of several labs. So, that's Invinio, Zenodo, there's also Roo, White Rabbit, a few other things. There's also the CERN Open Hardware License which, which, yeah, goes on to show how the laboratory was a bit of a pioneer in this whole open hardware movement. And like last year, we also set up our own open source program office. And yeah, as I said, we're also using a lot of open source software. Many of these projects are represented today here in the stands. So, yeah, thanks everyone also for your help. A little bit of publicity, there are three other talks from CERN in this conference. So, if you're interested in, you know, storage or research management with InvinioRDM, you guys are invited to pop by. So, yeah, coming back to CERN, we have around 17,000 people on campus at any time, around 230 meeting rooms, organizing more than 100,000 events a year between meetings, lectures, conferences, all sort of stuff. And many of these meetings are highly distributed. So, yeah, when you come up with Indico, the objective was actually to solve this problem. How do we get, you know, super big collaborations of thousands of physicists to work together in a distributed environment? And, you know, how do we conciliate that with the organizations also, physical presence? So, this is, yeah, this is a science gateway. It's a pretty recent addition to the laboratory. It's a super fancy project by the same architect who was responsible for the George Pompidou Center in Paris. But, yeah, just a disclaimer, we don't work in this building. We obviously work in the Brutalist buildings back there, where is the IT department. So, but, yeah, it's, you should really visit it. It's a really nice place. So, at CERN, Indico became quite popular very quickly. We've been growing year after year. This is the number of new events per year. So, we still kind of accelerating. And these are just examples of a few events, a few meetings, conferences that we currently hosting at CERN's Indico server. There are basically two types of events. There's the conferences, which are a sort of, you know, the more traditional workflow where you have a call for abstracts, paper reviewing. You have workflows which allow people then to interact, do the, you know, the reviewing of papers, refereeing and so on. And then there's the meetings, which are more, a bit of a simplified view in which you can upload, you know, your slides and share it with other people. And you have a common shared schedule. And now, I'll switch over to Dom. All right. People call me Dominic or Dom. I don't really care. So, this is Room Booking. It's a module which is part of Indico. As you can see by this nice screenshot, you've got the leaflet-based map on the right, which shows you rooms. On the left, you've got a timeline of, you know, the rooms which have been booked. Very, very, very simple stuff. But it's not just that. So, we're going to go into the technical aspects of Indico. So, at its core, it's a very, very general purpose. So, just because we use it at CERN to handle our conferences and meetings and also everything else, is very, very, it's not set in stone with, you know, what you can use it for. It's, you can use it for almost anything, pretty much, while in that realm anyway. You can also go through plugins as well. And also, you can customize it with, you know, standard CSS or what have you. So, under the hood, yes, it is a Python application, specifically a Flask-based. So, that handles our back-end. For the database, Postgres SQL, I believe they have a booth here. Then we have other stuff as well, such as a Celery, which is handling our tasks as well. And SQL Alchemy, which is essentially the ORM for Postgres. Again, that is a Python-based. And also, that's for the UI, well, the front-end, we could say. And a semantic UI, which is just the styling of this. And we've got a lot more services on top. Okay, so, as I said, plugins, extensions, so yes, Indico has them. You might be interested. So, yeah, these are just a couple of our plugins. I'll get into a lot more. But yeah, video conferencing payments, conversions to PDF, search via Alasah search, storage and URL shortening and, you know, a lot more stuff, which we can, which Indico handles under the hoods for CERN. So, for example, we've got a nice one-click Zoom join plugin here, as you can see there. Payments, so yes, CERN does handle payments for the conferences. Apologies. Apologies. So, CERN does handle payments for the conferences via its own plugin. So, you can see there, we can handle payments via the post-finance plugin, but also for people running their own instances. There is a third-party integration out there for collecting payments via Stripe. And a PayPal also. Workflows. So, when you come to CERN, you probably might go to a conference. So, we have our own internal workflow for handling your access and other stuff as well. That relates to it. And also, yeah, this is a bit more into the access. So, yes, Indico can also handle printing of your badges and also actually your access onto the site. Recording of events. Again, this goes back into a little bit of Zoom, but also Indico handles the entire life cycle. Conference and events. So, yeah, so here's just a quick screenshot so you can record an event. And on our side at CERN, the event will go to our CDS archive. So, it can be played back on the maintenance, you know, and that is the archive for our events. Okay, so you probably saw a little bit about room booking. This is our internal spinoff called a bureau tail. So, room booking, as it says on the tin, it's for rooms, bureau tail, bureau, it's for desks. So, at CERN, we do provide a modified version of Indico, which only has this specific module, which has been modified, and that is via a plugin. Again, going back to what I said earlier, you can also customize it. So, here is my screenshot of the International Linear Collider Indico instance, which is hosted at CERN. And, yeah, so nice and feel. And it's not just, you know, the front page. You can also customize your meetings with the same CSS rules. And also one more of the conference for Higgs 2020. Now, one last thing. We have a nice checking application. So, previously this was a React native application, but I think around last year we rewrote it from scratch to act as a, well, to be a PWA, a progressive web application. So, basically it's like in any other conference, you might have someone at a door scanning your badges, scanning your tickets, what have you. So, just an application where you can use your smartphone. And then, yeah, it gives you the all the functionality that you would expect from a badge scanner, so a QR code reader. And also lets you bring up details of who's attending. You can check them in. And also, you know, other bits and pieces on top. Okay. One last thing, I guess. So, it's a very accessible event management system. It's open source and we have a pretty nice and thriving community. So, it's a screenshot of our forums. You know where everyone is welcome. And, yeah, so, I guess you have any questions. I'll be sure to follow us as a shout out, I guess. But, yeah, that's all. Thank you. Thank you. I was wondering if you also had some kind of back end for budgeting. Like, when I organize a conference, I want to make sure that all the money that we receive from the thing then pays out for the Dora sun and things that I'm going to spend for the conference. So, should we repeat the question, right? Yeah, so the question is whether we have some sort of back end for budgeting to kind of budget different aspects of the conference. And the answer is no. I mean, you have customizable registration form where you can kind of assign prices to items. I don't know if that's what you need. Then, yeah, in terms of them doing, you know, financial data analysis and so on, then we don't have anything like that. But, yeah, but you can extract everything basically to Excel and do that stuff on a spreadsheet or, yeah. Okay. The question, I think there is some space for integration with the Giante de Nuit or GCNit or Viglovap for conferencing. And is there a way to manage Wi-Fi every password distribution for participants? You... The tokens discount for social events in the night. Repeat the question. Can you repeat the question? Yeah. Yeah, yeah, you have to repeat the question. Well, yeah, so the question is if there is some sort of way to distribute Wi-Fi passwords to participants. That's it? Yeah. Wi-Fi passwords or tokens for social events? Not built-in, but you could probably implement it through a plugin, right? That could be... I mean, this will function as it would be plugin-based. So, yeah, you probably would have to write something yourself or probably hire someone to write it. Sorry? Made it not for tokens and not for Wi-Fi passwords. You have to do plugins. Yeah, no, there's nothing built-in for that, no. Yes. Is there... Is there a time of the attendance registered for participants? So, the question is whether the time of attendance per participant is registered. Well, not the attendance because I think we don't have any mechanism. Actually, if we have people say, you know, I'm attending Nali's talk and so on. But we have the checking time, yeah, that the app that Dom presented before, that one, yeah, if you check a person in, the time is registered and you have like a log of who checked in at the event. But that's more for kind of the reception part of the event, like to give maybe the... There is more to check out or only check in? Only check in, yeah. So, it's like Hotel California, if you want to... Yes. Are there plans to have like a progressive web app for participants or partners, not for the organizers, for example, to schedule what is happening with these... So, the question is whether there are plans for a PWA which targets the participant's side of the event, not so much the organization like here. The answer is yes. We are planning on getting started still this year. There are some funding issues to be addressed, as you guys probably know very well, is often the case. But yeah, it's on the plan for this year. Yes. What priority has accessibility in the UI as you showed? It's a very good question. So, in terms of accessibility in the UI, currently in the code... It is currently going through a phase where we have in collaboration with the UN. There's basically a... We have... The UN has hired a developer to contribute back to Indico, some improvements to the accessibility. And that's about it. So, it is a thing which is, you know, it's a work in progress at the moment. And there are some features out there already, which are going to be released soon, or they're already available in mind of releases. It's already... Many of those have already been merged into our main branch and will be included in the next release. But yeah, there's a lot of work which is currently being done in making sure that we pass the WCAG. Yeah. Yes. What's about developer documentation? Is it well-documented so people can easily access and contribute to the project? Or it's kind of more... So, regarding a question on the developer documentation and how someone can contribute to it, yes, there is documentation out there. So, on... Change of slides. Yeah. So, if you go to getindico.io, we do have a couple of pages on how you can contribute back to the project. And also, we've got a pretty good ReadMe and some ReadTheDocs pages on how to contribute back. And it also covers stuff like how to set up your own developer instance and everything from, you know, how to probably write a half-decent comet when you... Or a PR. So, yeah. There's also some API documentation, Sphinx documentation based on the code documentation. And it's not as complete as we'd like, but yeah, it's a work in progress. Any other questions? No one? No. Well, thank you very much. Thank you.
OpenTalk - Video conferencing secure and GDPR compliant
I need to support me for the in-depth technical details because he is more proficient than I am in these areas. This is a very high level overview of the project. We are not going to go deep into the detail, but if you have questions, let go deeper, just ask them in the end. If you want to have product side view or customer view, you can always use the official contact channels and you will be answered there. So a little bit of background about OpenTalk. There is a company behind OpenTalk. It was founded in 2021 in the middle of the pandemic by a group, so a group is doing since more than 30 years I think already consulting and training for Linux and mail operations hosting and it is also the provider of the well-known mailbox operator MailboxOrg. And the OpenTalk company right now has around 20 employees right now, so it is increasing slowly but steady. So who are we? I am Wolfgang. I joined OpenTalk roughly one and a half years ago and became the backend team tech expert, so more or less the technical lead in July this year, or last year already. I have a master's degree in embedded systems design, but I am much more on the software side than on the hardware side. I am doing Rust since 2015 and I am still in the honeymoon phase and from all the languages I have done, this is the longest honeymoon phase I ever had. And I am the co-founder and organizer of a Linux user group and you can find me on the FEDIVERS. So Stefan. Yeah. I guess I have been like two and something years with OpenTalk now and I am mainly on the media team which is our thing for all the real-time stuff, audio, video, recording, streaming, webRTC. It is kind of in between front and then back end. And yeah, I also have been in university before, long time doing parallel programming, operating system stuff and also some real-time things and software defined radio. So if you are interested in that, just talk to me later on I guess. Okay. Some information about the project in general. So the project is written, or the front end is written in TypeScript, the back end is written in Rust. It is free software under the copy left EUPL 1.2 license. You can find technical documentation online under this domain docs.opentalk.eu. There is also a FEDIVERS account called OpenTalk Meeting. You will find it by that. And there is a Matrix channel as well, hosted on matrix.org. This is, yeah, the Matrix channel is where some of the devs are hanging around and answering technical questions but it is not an official support channel in that regard. Okay. So the user interface, this is what the video conferencing software looks like. So it is roughly similar to what you know from other programs. It was important to make a nice design that looks good and is, yeah, comfortable to use. We also have what we call the dashboard. This is where you can create meetings. You can add start and end date. You can add meeting series and you can also get an email or maybe that's on the next slide. You get an email when you are invited to a meeting or when a meeting is canceled and also the creator of the meeting gets the invite so they can put it into their own calendar. Okay. So short list of the features. We have a lobby with a mic and camera check so you can check that everything is working. We have some interesting moderation tools, one of them being the coffee break which we will show in the next slide, a timer so you can assign tasks to people and say, okay, you have 10 minutes for this and if you want then report when you are ready and when everybody is ready the timer ends or when the timeout is approached. Meeting participants, we have a poll feature and breakout rooms. Screen share, yeah, that's well known for conferencing software. One important information here is that multiple people can share the screen at the same time which comes in handy for peer programming. Yeah, you have the speaker view where you always see the large picture of the speaker of the person who is currently speaking. You can call in from the mobile or landline phones via SIP and we have integrations for a shared editor which in this case is Etherpad and a whiteboard which is SpaceTech currently. Yeah, I already said the invitations end. Right now we are in the course of finishing recording and streaming so you can record the meetings and you can as well live stream it and the idea is to also allow streaming it to multiple platforms at the same time so you can have YouTube Twitch and on-cast stream at the same time if you want. If you are interested in that, talk to Daniel over there, he did one of the work. Yeah. Okay, so here you see a screenshot of the coffee break, that's what it looks like. Everybody gets this full screen as soon as the coffee break is started but you can go back into the conference anytime you like. So for chit chatting up front before everybody is back, just like in real life. And this is another nice feature we have, we call it the shared folder feature. So in the dashboard when you create the meeting you can enable this shared folder switch. It must be configured for this OpenTalk instance but then the system will create a shared folder, it will create a folder on a next cloud instance. This is the part that needs to be configured. It will create two shares to this folder, one of them being read write and the other one being read only. And the moderators of the conference receive the read write link so they can put their material into this folder up front while all the other people have access to this either by clicking on the link in the invitation mail or by opening it through the icon during the conference. Okay, so this is a more technical part, I'll give the word to Stefan here. So that's what it looks like from a rough perspective of the developer or the administrator of the system. So it's not just one big service but we tried as much as we can to use existing components. So what we built mainly is the dark or the dark colored parts and the other services are more or less what you get just from the different projects. So we use Yarnos and RabbitMQ for communication and Yarnos as media gateway but we manage all the video rooms our own using our controller backend and as said there is a web front and written in TypeScript and React here but it's kind of symmetric to what happens on the other side with the, I like to call them back end clients for streaming call and all that stuff. They just have another way of starting the whole process but actually they do the RTC and signaling just as the front end would do and by now they also have a way to authenticate services against key cloak via service authentication so that's also, we can see that later, where you can extend, that would be a way to extend our system in that part. It's meant to be scalable so you can have multiple instances and they just share their data where Redis and so forth is session stuff and for the persistence data like which rooms do we have, what users are in which rooms invited, that stuff that would be stored in the normal relational database and we'll do a lot of integration stuff on that OpenID Connect key cloak side with other like user systems or databases what people tend to have already on site. Okay so this is a sneak peek of Rust code, it's currently not ready yet but we have approaching this. We are right now working on extracting the protocol data types into a separate library which was not the case when I started working with OpenTalk and the idea is to publish the client library to crates.io which is the default publication platform for Rust code and yeah it should be as easy as this, I mean the authentication is usually a little more involved than these two lines but you basically connect to the instance and can do things with the client so this is now the web API for managing appointments and so on so here we create an event with a title that we set and then we invite a user with the email address elizetexample.com and the role of a regular user you could do the same for a moderator as well so the idea is to allow automation and integration in a very easy and approachable way if you're familiar with Rust code. This is also what we will be using for the recorder which connects to the meeting session for the call-in via a landline or telephone and for other maybe future services. So yeah talking about these kind of services that's actually the flow you have there, you build your new backend service which will act like a client to the conference, it first needs to authenticate and get some access token however you set it up and then you usually just go to the backend and say hi that's me and that's the token I got so I'm authenticated and I would like to join this room over there which has this ID and then you essentially and by that you open a web socket where all the signaling happens and you see like the publications of media streams so the backend will just announce when new users arrive and will also announce what screen share and camera stream they have so you can then start the webRTC connection with the Janus and on that signaling channel you just exchange STP and other credentials to get the right stream set up and here you would like in our case we usually use GStreamer as a media stack here which is then that up to get all the streams and for instance do video mixing and when you're done with your recording so somebody tells you on the signaling okay stop now recording you will just upload the file of the video it produced to the controller again which puts it into a F3 storage which is also currently we use for development purposes we use Minio but you can use whatever F3 you would like and there it also becomes available on the dashboard then which also would work with other artifacts so like whiteboard or yeah meeting minutes would be the same thing just another document format right and what I missed out is the other way is when you don't initiate the action yourself there's also the RabbitMQ service where you can just attach and listen for the controller to call you and say hey your service you should do something like start a recording for instance and then just start the signaling session right that's that's basically it yeah okay that's also your part yeah so we talked or we've seen a lot of components which are open source and which we integrate there's also been as we are a company that also been other yeah companies and software developers which we integrated with so I guess that's one of the main things and themes that we and other people have yeah projects and try to integrate with each other and there is like UCS and integration where they basically have their key cloak and they use a management part and we just connect there and there is Innova phone which does mainly zip and has some also some platform and we try to integrate there also wire my D connect and made some adjustments to our zip stack so that we are compatible with them and yeah so it goes on like MGM is like we just started I guess they they talked about how we could do like components where you just would have the video but it's like in the starting phase and not just the whole front end and yeah as much people many people use it right now and this has been a high demand we did outlook plugin but there's also been some talk I know for Sunderbird plugin but it's just not yet yet on the way I guess and so yeah maybe just if you have some some questions or need or want to do something on your own just talk to us and we'd be happy to try to tell you what's going on and to support it as far as we can okay yeah yeah that's it more or less so we try to keep it short so if there's some specific questions and details yeah I'm gonna just go ahead you haven't mentioned entry and encryption yet at all and I know that did see has already some support of entry and encryption and also matrix is now getting into the real-time communication business and I was wondering what is your strategy here yeah I can say a word I guess it's it's not so easy is the starting point the thing is if you want to do end to end encryption you basically don't trust the back end that's that's a deal and we're talking about a web application right now which is like a problem because in the first place you would load your application from the server you don't want to trust so we are looking into how we can ensure that you can really maintain the integrity of all your personal keys and all that stuff and that's pretty hard to do in a browser environment and yeah of course we could encrypt media connections but that's just half of the deal so yeah basically we're in the process that's also a goal for certain projects we're working in but it's not yet a thing I can say okay that's that's a route we we're gonna take right now and here are the details so we didn't put it on the slides yet so if there are question on then topic yeah maybe we can have a have a discussion in detail later on or maybe if you have specific needs and that direction also let me know I'm interested in what do you consider that are still like very important features or properties which is not yet in any open source video conferencing solutions and which you are working on which you also don't have yet but you're working on what is what are kind of still important pieces to come so yeah as mentioned this is a whole streaming and recording a part which will right now in is one of the main things so we can support like bigger conferences with a with a feeling of being in a room so for now we're just doing the the low levels or finishing the low level streaming part and the first UI part to enable streaming but we're thinking a lot about how to integrate like the to have a mode where you have a stage and an audience and the stage would be like a normal web conf and web RTC conference and the audience would get to see the live stream and get a chat interface but it would all happen in our user interface that's something to come I guess but we have no time frame for that right now and the other part we are from the project side in is all the telephony part like zip and 8 3 2 3 I guess which is the old video conferencing standard on telephony nets yeah I guess there's much more but there was another question so I reckon an organization was 100 people but once in a while we host conferences for a few thousand and now I wonder should we then have a very large a Janos media gateway just for this one event per month or is there a way to scale easily down and up the resources because I've heard of Federation of Media servers in the matrix context and I think this is a very interesting concept when organizations have joint conferences so yeah we we also thought about that hard and long and there is like a limit on if you don't cascade Janos instance there's a limit on how many subscribers can be for a single publisher so the speaker in the room and that's for for our experience in the yeah say three to four hundred depending on how you configure load balancing and all the stuff and instead of doing cascading and all that we are right now looking more into the streaming direction then into have it and having it cascading and real time all because usually the audience will not interact heavily and you would have to invest a lot into getting all of the people like fast in there it might be a thing and we are looking also into the matrix how matrix does it with underlying they use live kit as far as I know but yeah we are exploring the other direction was having it on streaming and getting people in and out of the room or you know so into the web RTC conference or back into the stream view that would be my take on that because then you can just have a have it more resource efficient like have a small meeting which is easily manageable and also have a streaming set up which can easily scale lots of people thank you so the question is is there a support for island audio as in in a large meeting where two people can talk to each other alongside with the orator without interfering with the others this has been on the road map for quite some time already the idea is to lower the main room audio volume and have a private talk with a subgroup of the conference but it has not been implemented yet I guess it's already we already have an specification for it but not the time to build it yet
Securely collaborate with CryptPad
Okay, so hello everyone. So I'm, I joined CripLad last year and I'm here to present you like the product and the future directions. Yeah, so yeah, so it's called Securely Collaborate with CripLad and so, CripLad has already been presented last year, but I will start again like showing you what it's all about. Yes, no? No? Okay, thank you. Okay, so CripLad what it is, it's end to end encrypted like a collaborative, collaborative office suite. So you have a lot of different applications inside CripLad. So it all started as a project, I mean, as a name. The name is a bit confusing because you may think it's only for pads. And actually it started like this, so we're only pads. But then if you think about it, like all files are just like text files. And that's how we managed to be able to produce a lot of different applications from that. So some of them are like homemade, like the Kanban one and the Form one are really, have been made with our own little hands. While some other like which text, Whiteboard is for instance raw.io. Like those three presentations, spreadsheet and document are on the office. So we either like build a full application or just try to use other applications and plug it onto the CripLad layer, which is basically the encrypted encryption part. And the goal of all of this is to have both collaboration and privacy. Because when you think about it, when you are collaborating, you want to share some data. But then you don't want to share it with everyone. You want to share it with your collaborators first. And then maybe once you have a full document, you want to share it to others. And moreover, you may not want to have like the service provider. So in our case, it will be CripLad. But if you are using like some proprietary stuff, you don't want it to know what you are working on sometimes. Even if sometimes when you are a company, you are working with Google for business and then you just give them all your data. But anyway, here what we are advocating is that, yeah, we can have both collaboration and also privacy for our end users, which may be you. Anyway, I should have closed the fender bird. Take, take, take, take, take, good. And so one example of that is Disha Ravi, who is an Indian activist for climate, which has, I mean, which has been arrested in India near Bangalore because she was working on the farmer tool kit. So as you may know, like for instance right now in France, there is like some farmer protests. But in India, it was more, I mean, there was one as well in 2020, 2021. And actually, the thing on which Disha Ravi was arrested on was for helping rating the farmer tool kit, which is a document helping the farmers to cooperate and get together. Because India is a very multicultural subcontinent, so it was a big help for them. And actually, the document was published on Crip Pad in the end. And, but yeah, but at first it was made on Google Docs and Google helped India police. I mean, it can be understandable in some sense because it is a big market to become. But yeah, I'm not sure to judge, but at least we cannot sell your data to anyone. And how does it work? So now let's get into it. So basically, we have a model where we have a central server, which will deliver the files, but the files are all stored. So this picture is actually not really accurate because here what you see is that you have the first Pingu, which is writing hello, sending it to the server in an encrypted form, which is then broadcasted as well in this encrypted form. And then as it's a symmetric encryption and everyone has the key, they can all decrypt. Actually, it's not exactly like this that it works. It will be like saying, oh, I wrote an H and then it will send, I wrote an H. Then you said I had an E and so on. But it will just be like difference between, it will be decent patches. But here we just simplify things. And so we decided to keep this centralized part because in the end, we could have imagined something like peer to peer, but then it will be hard to synchronize, we'll have issues like if a message arrives before another. We'll have another issues, but as we already have a server that delivers the file, we can also use it to coordinate communications. And that's how we managed to achieve our goal of having end-to-end encrypted collaborative edition. Sorry. And right now, so that was presentation about CripPad. And about this, we recently, we are mostly done on it, but we had an NLNet project, which is called GroupRinse, where the goal was to analyze the security of CripPad and try to find new directions prepared for the future here. I wrote it here. And so we had many different improvements that have been shown by this analysis. So there are many things actually in CripPad because it's basically something that may be called cryptography driven, where basically, our design really relies on cryptography. For instance, when you log in, it is used, I mean your password plus login is used to derive your different keys, like signing keys, asymmetric encryption and everything. And so everything is based around this. And for instance, we don't store, for instance, a hash of your password and then try to match it. So for instance, it makes password recovery like a big hassle. We cannot do it, we just can do it. When you subscribe on CripPad, there is a big warning saying, don't forget your password, but of course people forget it. As well, as we are also mostly working with the document keys that we are sharing with people, we don't have any ratcheting or any key rotation. And then for instance, revocation, also no foreign secrecy, but I won't talk about it. So for revocation, it's also hard to get. And also another hot topic, like right now in the cryptographic community, it's post quantum crypto. So there have been like the NISDs and NIST, which started post quantum candidates evaluation like in 2015 and last 2022, almost last year because we are sort of a new year. Like in 2022, there have been the first selected new standards. So it was Falcon, Deletion, Kiber and another one. Anyway, so the thing that right now as well, like as I said earlier, CripPad started just as a small project in the company and then it's expanded. But the core is still there and the core is not really easy to work around. But there are also a lot of like refactor to do about the cryptographic layer in order to move toward the cryptographic agility. And actually, I mean, between all these different improvements, I will talk about the password recovery, recovery, because it's something that users may want. And so let's talk about it. So I said it's like something cryptographic driven. And I said that user are identified with their signature public key. And the thing that this relation is only one way. If you know your password and login, then you can get this public key. But I mean, maybe you cannot like relate it any other way around. And the thing that we have something which have become like a hassle to solve because of cryptography, so one solution is just like, hey, let's add more cryptography. So we'll add something which is called linear secret sharing or sometimes Shamir's linear secret sharing, which is the idea that you want to be able to share a secret between multiple parties. And then if a subset of them like above a certain threshold, but it can be more complex data structure. But anyway, let's keep it simple. So if you have more than a certain threshold, like if you have like, for instance, you split your key into five and you want at least three people to collaborate, like for instance, the majority to be able to reconstruct it. Then you can get the key back at the end. So what we'll use is like social linear secret sharing, something akin to the ring of trust where you will share your keys between different participants. And then you have to trust them, of course, because if two of them colludes, then they know your password. And here we can see a weird reference, which is Reed Solomon's code, because it's basically the same ideas of splitting things. So yeah, for instance, for me, I came from the cryptographic community. And we like code people because they are always telling us, all that code people invented everything before you. And I mean, at least this time it was true, not always, but sometimes. And what is social secret sharing? So as I said, you have a secret, like for instance, here it will be your password. And your password, even like directly like some keys to a file which contains part of the thing in order to be able to have some kind of revocation. So you have your secret and you share it between your friends. And then they all keep a shard, a part of it. Individually they cannot do anything with it. But then if you ask like for instance here, I will take the majority vote. So if you ask two of them, then you can get your secret back. I mean, hopefully they won't keep it to them. And then once you have your secret back, then maybe you can change your password with other convoluted way, but ideally. So then in some sense, you cannot lose your password because it's always somewhere there. Obviously it's not very sound because if for instance, some people are not connected because the thing that you also have to think about UI and UX, so how will it work? So if they have to click on a button, they have to be connected for that. You have to contact them, but I mean, it's still way better than losing your password in the end. And this raises like a lot of different questions. Like for instance, I mean, as I said, how will we make it understandable for the user because we don't want them to have like really just like a jizz and take that they have to send you back. You want it to be stored properly, so maybe directly on crit pad. But then there is a risk explanation because you have to tell them like what is sensitive, what is not, what you can do, what you can't. It's an unusual system for users. I mean, maybe for good reason. But anyway, I don't know a lot of systems where you have these kind of things implemented. And I think it's something like, okay, so something I didn't say about crit pad, that one of the aim is also like user friendliness. We don't want for instance something like PGP, which PGP is a very nice tool, but it's hard to get around it when you are starting and you just want to do something simple. You have to read the documentation, see exactly what you want to do. Even the OpenSSH client is like really powerful tool, but not very user friendly. But for us, we want something that, well, a lot of our users just want to use crit pad because it's open source and it's an office shoot, not because it's end to end encrypted. So we also want to keep this user base and we think it's important to make cryptography available for everyone. And then also like it just, in the end, just some displacement of the issue because like before, we were like, oh, but we can just lose everything. So now we may not lose everything, but then some other people, like if the trust is, if your friends are not very trustworthy, then they can call you and compute back your secret. I said, if they are not available, then you can do much and that. No? So to conclude, so I'll come back on everything I said beforehand. So crit pad is an end to end encrypted collaborative office shoot. And everything in this sentence is important. It's collaborative. You have all the, most of the tools you want. And it's also secure as in it's end to end encrypted. As I said in the previous talk, we also have like other issues like we cannot, I mean, as of now, we don't guarantee that the code you are executing, the JavaScript running on your browser, that it's indeed the real one. So there are also some other parts where we still can do better. We still can improve. But this is also one very sensitive thing. It's end to end encrypted, but I mean, there are also like some cryptographic solution for that. But yeah, it can be quite expensive. But we are also thinking about how to go toward this direction. I forgot to tell you about this, but I mean, as a full office shoot, you also have other collaborative office shoot, you have other collaboration tools that are available like calendars. Unfortunately, we can't synchronize them directly using CalDAV because it's encrypted on the server side, so it cannot serve it directly. And we don't want to send the servers the key. We also have like some teams, I mean, a way to share your document and calendars in between a team. Like for instance, right now in CripPad, we are working, like in XWiki in general, we are working using CripPad. And we have different teams like one for the support teams, one for the CripPad teams, like to organize things and stuff where we can find every document that we need, like the other size for instance. And also one of the very important points that it aims at being user friendly. And so for the future, we want to go toward quantum, crypto agility, like making the code more modular and that we can switch algorithms more easily to move toward post-quantum secure collaboration, which will be a way stronger security guarantee than what we have nowadays. Even if like right now if you, I mean, basically like as all the symmetric part is still like more sturdy than the asymmetric part, what is stored in the server, I mean the data are kind of okay even if, let us imagine that there is like quantum adversary right now. It will be more like someone can impersonate you, which is still a big issue, but it's, I mean, if you just get the data on the server, you cannot do much more from that. You need extra information, even if you have a quantum computer. There is also this revocation which I didn't talk about at all, but it's also an interesting issue to handle because it will, may help us to move toward forward secrecy, which is nice to get because it will mean that if you get a document at some point of time, you don't know what happened before and you can be revoked and then you won't know what will happen in the future on this document. We also can imagine other ways to resolve, I mean, so right now I was mostly talking about like we have a central server and stuff, but we can also use conflict-free applicative data type, like CRDTs, to like try to solve conflicts and stuff because right now it's really something very, very naive, which works in the end because in text, you don't have that many weird conflicts that happens and yeah, as I said, perfect execution. So now, as a last word, I will just present that, I mean, it's just the Crippett team and so thank you for your attention and if you have any questions, I'll be glad to answer. I have two questions. The first one, I visit Crippett only for document writing, something like Google Docs that have only a little problem for me. It's not have a full screen document, it was information about Crippett, where there's information about Crippett around the document, not like Google Docs, by instance. So I was at the beginning of this kind of thing, but maybe it can be resolved with a full screen text, something like that. So when I go to Crippett, the first, the first time. So the second one is interfacing with different type of document, worksheets, database, also text with table and so on because Google Docs and Google Sheets, there's not good, a good interfacing between both and I will be happy if it's good in Crippett. So let me be referred to be sure. So you said that the interfacing between spreadsheet and documents are not that good, right? Yes. Yeah, I mean like so far with both, I mean we are right now depending on the office, which we are also like, which we interfaced with the Crippett service, but it will be kind of hard to get, we can try to, I mean we are always trying to improve things, but I mean I know that we have work done here which is working mostly on this part, but yeah, I don't, yeah, maybe it will improve, but I mean we'll keep it in mind and... It's a good use of the tool to use, the interface. Yeah, I mean we are working a lot with user interface. Actually the project lead is a designer, so it's really giving us feedbacks about how to make things fitting nice. And yeah, any other questions? Hi, we'll discuss about it. Thank you for your talk. I've been helping package Crippett for NixOS, Nix packages, and one thing that came to my attention was that the whole thing is like 800 megabytes, and I was like, whoa, what's going on? And then I noticed that it's the integration with only office, that it's a lot of space. I just wanted to know if, I mean are you keeping that in mind in the future? Will you keep it? Because it's quite big for, like if you compare it to a WordPress release for example, the size is huge. The thing that we have, I mean we don't have the original version of only office because the thing that we only keep is the only office client and the server part, we are emulating it with Crippett basically. So we need to have this hacked only office in our repositories, and every time we have, we need to make an update, it's a mess. We are aware of this issue, we are trying to find a solution, but we are also other issues too. And then, that's it. Thanks for your feedback. Hi, a question. So what are exactly the technical limitations at this point? Because you showed the secret sharing. I imagine I'm not a cryptographer, so based on the theory, the more people who are collaborating at that point, the more harder it would be to manage. So what are the other technical limitations perhaps that you see at this point from that? So technical limitations for what? You showed the example of the secret sharing, where you have a secret shared between different users. The more you scale with the users for any document, the more it might be complicated. Actually, here's the main... So basically the question was that what are the technical limitations of Crippett, like in this context about secret sharing, for instance, they may be scalability issues. And actually the thing is that for that it will only be small islands where you will share... I mean, you will only have small sharing islands. For instance, for scalability issues, we don't have any issues with the number of users growing in terms of collaboration, because in the end, on a single document, at a single point of time, that many people will be working on it, and the server is only acting, is only there for communication, because it doesn't... I mean, there is really not much processing in the server, because it cannot do anything. It cannot be done. So it's all spread within the clients, within their browsers, which makes it a bit of an issue on mobile devices, for instance. But for secret sharing, this won't be a technical limitation. As I said, the main bottleneck with the use of cryptography is the fact that it will block us for... I mean, it will make some functionalities harder to implement, because everything is hidden, and you don't have access to it. But at least, as far as I understand, for secret sharing, it may not be an issue. The main issue will be that you have to coordinate with the other parties, but that's all. Yes. Sorry, Ludovic, also from the team, to answer to the next person. One of the big reasons there is a lot of space is that we have multiple versions of OnlyOffice, and the reason is because we store in the native format of OnlyOffice because of real-time, we're not storing the Excel version, so for compatibility reason, when we upgrade CripPad, we need to be able to upgrade the pad, so we need the older version so that we can upgrade the pad to the newer version of OnlyOffice, and there is a plan to make the installation of the OnlyOffice modules optional, and basically say, which ones do you want in your CripPad so that you're not carrying very old versions of OnlyOffice code in CripPad, and this is why it's so big. We're sorry about that, but there is a technical reason. Yeah, unfortunately. Thank you. So, yeah, I think we are done with this talk. Thank you Fabrice. Thank you, everyone.
Collabora Online: WASM
Thank you for joining us for the next talk. The next talk will be about Collaboration Online Web Assembly and it's going to be with Kaelan and Thorsten. Okay, I hope you can hear me. I can hear myself. So Collaboration Online, I wasn't myself, Kaelan, Mechmeyer and Thorsten Burns and we'll just get on with it. Collaboration Online Typical Overview. What we have is a... I can shout louder but again it's meant to be amplified. Collaboration Online Typical Overview, the idea here is that you have a browser, you have a JavaScript front-end, the JavaScript front-end communicates back to a server, the back-end server is a classic C++, compiled to native code, there's a web server daemon that basically manages a whole set of kit instances. And obviously, crucially with all these things, if there's no server, then it doesn't work because the browser talks to the server, if you lose connectivity or connection to the server then your browser has nothing to communicate with and nothing works at all. Each kit instance that I mentioned there is basically LibreOffice, it links to LibreOffice Core, it's big, the combined shared libraries on my local copy of those is 320 megs. So you've got a very, very large core body of code from LibreOffice that is included and linked to by each kit instance. Each instance is one document so when you connect to Collaboration Online, each document is running separately and each individual document is its own instance. When you've got multiple users, it's still one document, if you connect to one document, if you've got one user, three documents, you've got three instances. And then the server mediates between your browser and powers things to each client. Client renders images and you send back these pictures back to your JavaScript client. LibreOffice itself, core part of it is pretty portable. We've ported it to all these operating systems here, Linux, Windows, Mac, IOS, etc. On the core side we have the famous bridge which is particularly specific to the ABI of your architecture. But we have ported it to all these architectures and all these different ABI's in the past. So there's plenty of experience and plenty of examples of being ported to very, very, very diverse sets of things at a particularly low level. More online itself then, pretty much doesn't have the issue of requiring this low level bridge so its portability is fundamentally easier on that side even though then it is going to use Linux-based APIs. So again, two things that are relatively portable in their own sense. Which brings web assembly in. Yeah, that's my cue I guess. So who in this room has not heard about web assembly? Okay, so great. So this slide is for you. In all brevity. So web assembly, well the idea behind web assembly is to have something that runs in the same sandbox as Java but is much closer to the actual machine level abstraction that languages like C and C++ are having. The vast, like this massive amount of software out there already in those languages to be able, which is smart move to run this in the world. The history of that is the SMJS like started 2011, 2012 or something with SMJS also just a Google PNACL project. So that started with a subset of the Java script so that is able then to be run relatively performant like almost like native performance in the browser. That passed forward a few years, 2015, web assembly became a standard, started implementation in the browser like with the W3C. The M script project, big queue those by the way to M script for enabling that based on LLVM framework to compile C and increasing the C++ and the more complex features down to web assembly. Yeah, so small as to some footnote there, some website security policy is still required to run that because of the way of actually running in the same sandbox. It's using the shared array for actually having the memory and also the code down. If you look at the roadmap and the browser support then you realize this thing has arrived. It's pretty ubiquitous. Not all the features are there, some features are beta or just in the process of being standardized. But regardless of where you are mobile or desktop or anywhere like even this Node.js and Vazi like on a server has excellent web assembly support right now. That's the roadmap there. With any decent browser you have at least a subset of that which is good enough. Which gets me to what does it have to do with the LibreOffice here. So in 2015 we had a first look at that because it's kind of obvious to say well LibreOffice is so portable why not port it natively to the browser. And we utterly failed back in the day. It was just the way that the C++ support that was available in MScript was just not enough. I mean there was essentially no decent support for exceptions. There was a lot of problems with the threading. So we tried again in 2021 and we're very grateful to in particular NLNet but also to Collauer and my company for funding that. And we started again and long story short we succeeded to port that. It was quite a ride. It took more than a year. But given the fact that LibreOffice was already very portable, the build system was already there. MScript was already there. A lot of the building blocks were there. Qt which we use for the GUI thing that you see there in the screenshot had already been ported. So lots of those bits and pieces were falling into place. Some of them actually while we did the port. We were actually still quite lucky that in the end we were successful. So yeah and why did we evaluate that after this initial failure? Because things like AutoCAD and Unreal Engine and all those projects had already ported so we were reasonably confident that we were still hitting those nasty roadblocks. LibreOffice has this habit of breaking everything because it's so large and so did we. So for example at some stage we wanted to link and it took like almost 100 gigabytes and took two hours just to link the thing which kind of sucked and then a few months later MScript and Upstream fixed that and we were happy again. So that's the story of that port and with that I guess back to Quailin. Indeed. So yes so now you have LibreOffice's Wasm port. So what do we do then for clabber online offline? Ah so clabber online Wasm so Co-Wasm. So we took all that we do now in that case in the previous diagram you have your server running on the distant far away server, physical server and now in this case we bring the clabber online server side and we run it inside the browser. So you've got exactly the same javascript front end, the javascript front end communicates in the server but the server is actually the original stuff converted by inscription to WebAssembly and running inside the browser. So your server is inside the browser, you have the same architecture but you're only running one document in this particular case all inside the browser. So again the core stuff that Thorsten mentioned and then also port and start out online to Wasm as well. When you're sitting in front of your browser and you click this particular offline button as it is currently at the moment you manually ask to be put into the mode you redirect it to a page that gives you the Wasm that downloads the Wasm for you and if you get that just right you can get the Wasm into size to a particular size and location that it's a one time download. So the next time you go into it it's a particularly quick process. You give it a copy of the document and then it executes inside your browser and your javascript server, javascript client communicates in the server like it normally does. Thorsten mentioned it and kind of mentioned there is a security policy. Whether that becomes really, really problematic because you don't tend to just run clabber online and then of course it's launched from Lex cloud and it integrates back to get its documents from that or any of the other examples. So there's an intricate dense of multiple web applications and multiple web servers and then because of this particular specter flaw the browsers are incredibly paranoid about letting you run Wasm and it's really difficult to allow you to run Wasm and get multiple places working together. So all the links will give some of the documentation on that but it seems to be down to is that the thing that you're embedding into and the thing being embedded both have to agree that they're happy with being mixed together so that one isn't being tricked into being embedded and the other and the other isn't being tricked into embedding something that had no intention of running in the same browser as it. So thanks next loud, thanks to Julius Hartel there particularly for helping us get a bootstrapped up so we can get next clouds which document to provide the appropriate security headers to at least get the first step started and there's a particular pull request there that took us ages to get sorted out. You have to get your initial instigating website to give you the headers then on your side and the clabber online side you have to get matching headers and say that you're happy but you've been put inside of that then you have to get it only really will work if you have clabber online in reverse proxy mode so that all appears from the same server from the browser perspective. That works for Firefox just like that but for Chrome it has to be SSL it won't let you do it if you don't have HPPS. We found out we experimentally applied it that a lot of our own websites pull our logos from a third website and again that third website isn't happy to be included within this process then the logos are broken so in the end then you come up with basically that if you have Wasm enabled we set up some capabilities saying Wasm is in line and then we put all this cascade of headers or it becomes conditional on wanting to support. There's a lot of practicalities as well I mean when it actually runs in your browser things are pretty good or pretty slick I believe but actually building it in person talks about the 100 gigs now if it's improved now you only need 25 gigs of memory to link it so that's pretty challenging when you're unaware of what's going wrong there. Cross-compiling is always a little bit fraught in my particular case I've used the tool chain that I've had pushing together your own tool chain is painful so you have to invest time in that. Threading is a little bit unclear to me what's going on there we're using in scripting and then we have the standard C++ API talking about threading and if my particular machine I inquire how many threads I have it will tell me 20 which is the actual number of hardware threads I have but then when you actually go and run things with threads inside the Wasm you find that it seems to have a practical limit of four so you've got a mismatch there between what two APIs are telling you one of which is telling you 20 threads and another one is limiting it to four. It's lying it's four so that's hard to compile. So you run into problems there you have to watch out for it you have to just limit things down to your four threads and dismiss or ignore what some of the other APIs are saying. This particular one is incomplete you manually decide that you want to go into offline mode and it works beautifully but that's not particularly useful perhaps in the real world because you can't go back online so once your document is inside the browser it's inside the browser and it's not going to be migrated back to online when you're connecting with stores but it does work. I'll just show a little video and though I actually do have it working locally but trust me on that one. So this is your online as usual just about just to just do the two of them and just knock the network down and then I just hit Firefox's offline as well just to get quickly to the servers being disconnected you know type type type. No effort obviously it's going to work because we're not connected to anything. Back online we go back to reload this I believe and we should flip over to the view file and go offline as the magic button click click click. Wappy downloads start instantiating comes back up with a copy of the document. Drop our connectivity again go back speed up the process here and we should be able to interact with the document as it runs inside the browser to its jailed for whatever its slave copy of Clabber Online executing Wappy. It looks exactly the same even because you know it's non-trivial that it looks exactly the same but it does it's interactive you can move around with it and it works fine. Can't save back the abouts as well as it doesn't prove anything but that is the case. I think I have just got one last slide just to say that is the end of. Thank you. Not sure. I think we're doing good on time so any questions or anything. Otherwise I think I could. So I didn't fully understand does it save the document somehow inside the browser but you cannot get it out from there or can you save it somehow there inside it. Once it's in the browser you could manually use save as a download as it's not you can do that but what you can't do is just click another button that says go right right now you can't do it it's not like there's some fundamental problem it's just not done yet. Save as and process continue editing with your local copy of the office on the laptop and then upload it again online or something like this. Yeah. Thanks for talk. I can imagine that the idea on switching from online to offline is that you will be able to switch freely between online and offline. What is there the idea or the plan or the challenges you might find to switch easily back to online for instance when your internet drops out and it comes back. Well it's just a matter I suppose of some of the practicalities of I don't think there's anything fundamental I think it's just probably a matter of time to be let investigate the problem a bit further but if you've taken it offline you're just going to have a little practicality when you come back online if somebody else has opened up the document in the meantime you just come back to the classic problems of just file locking it you know do you go offline you just I mean that's what I would do I just say you know it's locked you know can't can't write back to it till it comes back online whatever it was just file locking but there's nothing nothing new or unusual about that. This is the this is the live version of the Weisman case and we once too I don't know look under the hood and make sure I'm not doing on that online. So let's have another question. I would say there are three people working on the document online and then one is dropping off continues to add comments and what is your strategy of catching up with the changes that have been carried out online in the meanwhile. It's probably the same the same issue again like yeah what are we going to do about that I mean to be decided I guess is the answer there but I don't think there's anything fundamental with that I mean it's the same as if you were back in the case where you have your shared network and you've shared a document and you had your classic clients and they open it up and it's locked by another user or whatever you just have to decide what strategy you want to imply there so yeah an open question really. So first amazing work it's really incredible to see this working I have a question about memory consumption have you have you measured more in detail the memory consumptions and are there strategies potentially to to minimize that memory consumption to like reduce it as much as possible. Yeah. The second question is is there a possibility for having multiple documents open in the same was module. And the first one was memory consumption. We're focused initially on this is the size of it getting the size down the physical size down of the binary which which I guess we've done I'm happy with that memory consumption of it itself haven't really looked massively at it I mean it's the same as memory consumption for any of them. I think this particular computer has just has the four six gigs of memory and there's no particular problems with that if you have one document if you start to know if you want to run at the same time you do an eight threaded build of Libre Office then yes you run out of memory on this machine so it's like one or the other. So no haven't done any particular about that we did have issues with the threading and some of the threading of course was causing memory excessive memory usage. So happy with that. The other one was second question. So in terms of memory consumption so the size that's a footprint of the binary like compared to what you get into some very heavily JavaScript lots of advertising websites it's not fundamentally different let's not orders of magnitude larger or anything like that so so while it's relatively large it's not like in the same age for at least desktop browsers as Quelin said it didn't like it didn't trigger anything but I did not measure like specifically the so for the let's say document itself is not very very heavy but of course you need to load the you need to load the binary into the into the sandbox there was clearly a problem or there will be a problem if you run if you run five or six or seven or eight documents you've got you that's very little sharing it but there's a cache possibly that chest but the actual footprint and the browser tab that's not chat from what I can tell so so valid point. I think the other question was. Yes. To multiple documents and I think the Tarsen cover that today. Did you answer the second question when I had multiple documents? My question was in relation to the first to the last response could we have a cache strategy or something to cache the code that's what I'm thinking to speed up the loading first loading especially or multiple tabs my idea was a browser extension or something where we could put the wasm in there and then load it as as you need it. But I'm not sure if a browser extension allows such use cases. I think we're happy with the caching as it is. Yeah. No I think the download and the caching is seems to be sorted now right the load time isn't fabulous yet but that's just the nature of the beast. I think we might be short on time now. We have one question also. Just to be clear a document is a file on the server and then it's pulled to the client so you might as well access it over another file sharing mechanism as well like SMB or NFS. When you're in your classic online clabber online case the document tends to be in some other kind of a wappy server and you get it from there and online can get it and then it presents it as a file just as a simple file for the browser case so in this wasm case the wasm doesn't do any direct connection back to the original wappy server so it's the online is allowed to get the document. Is the server file is also stored as a file or does it end up somewhere? Well as far as far as even clabber online is concerned it just is asked to fetch the document and is given the bytes for that document it doesn't really know where those bytes come from in that kind of a sense so you know neither of the two really have access to the underlying location where the files are. That's why I look at it at least. Thank you. This one was a bit hard and the next cloud is pulled and so on with the link. Yeah. Where did I put it? Can you repeat the questions? He just wants to see the... There is a long number. Yeah I'm not entirely sure where that's gone now but the slides will be uploaded to the place for that must be somewhere here. Back is up this one or the next one? This one three, three, two, three, two, six, zero. That's just where the propagating the headers that need to be passed around the place to get it up and running. Thank you.
Collabora Online usability optimization
Okay, so thank you for joining. The next talk is still about Collabora, second talk of the day about Collabora, which is about Collabora online usability optimization. And we still have Kaelin, and that was in the previous talk, and also Michael that is joining us. Thank you, Kaelin. Fantastic. This is Kaelin, this is Michael. Good. This is what I'm going to say. You'll see it as we get there. And yes, fantastic. Kaelin did a very good spiel earlier on how this thing works. So if you're in the previous talk, you saw something similar to this, but you have your browser, and then you have a web socket talking to a server on the back end, C++. And this talks to the Librofiskit over a Unix domain socket, which does all sorts of beautiful interoperability rendering, tiled goodness. And yes, this fetches data from an own cloud, an OSIS, a next cloud, a pygmc file, lots of things, any kind of WAPI share point I think we can use even. Yeah, for the good guys, right? And yes, so anyway, so this gets the file, this pushes it in here, it renders it, it comes back out to the browser. And yes, we do all sorts of things to try and cache that. So JavaScript here, good stuff over there. Anything else on there? Nope, nope. Seems pretty silly. And I just want to talk a little bit about latencies. This is an interactive presentation. I'm not going to ask you to put your hands up just yet. But just here are some timings. And the one I want to time is this human eye blink, 100 milliseconds for a human eye blink, okay? Right, so here we are. How good are you at blinking? Are you ready? Okay? So I'm going to press a button and we'll start blinking. And when you see red, stop. But you need to count at the same time, okay? You ready? Silently. Silently. Yeah, yeah, here we go. Ready? Ready? Are you ready? Go. How many? How many did you get? Do you want to try again? Yeah? Okay, so here is reciprocation for beginners, okay? So this is an advanced topic in maths, okay? If you need help. Anyway, so if you're a falcon, you've got like 7.7 milliseconds. So that's pretty good. Me, I'm more about here. I don't know how about you. Six, seven, eight. How many did you get? Do you want to try again? Okay, we're going to try again. It's like, okay, right? You got the idea now, right? Okay, ready? Not completely, okay. So I'm going to click and it's going to go green. Start blinking. And count the blinks you're doing. Blink as fast as you can, right? As many as you can. I want to get a high score here, right? We're going for the Peregrine Falcon 153 in a second, right? Okay, ready? Okay, three, two, you've not started yet, have you? Three, two, one, blink. Okay, that was a second. You had to blink. How many did you get? Five, six, seven, eight. Yeah, okay, fair enough. So this tells you your score. And interestingly, in the UK, they say a blink takes between 100 and 150 milliseconds. In Harvard, it takes between 100 and 400, which tells you something about Americans. Maybe. I don't know. It's slower pace of life is good for people generally. Anyway, sorry. So here we are. So actually, the very interesting thing is that when you start looking at some of these numbers, now on a log scale, so they're a bit more friendly, you know, the blinking is really quite slow. You can go from the Frankfurt to the US east coast and back again in the same time, right? So that's pretty good. You know, the 60 hertz frame time, 16, you know, is also quite long. You can get Frankfurt Milan, Frankfurt London is a similar time to the time it takes to get something on the screen, particularly when you add the monitor latency. So you blink faster than you miss. Lots of people are very worried about latency, and they don't have a good feeling for how long things take. But it's quite interesting to see some of these things. And also, in terms of typing, you know, like the average typist is supposed to be like three characters a second, pro 6.6. Yeah, it's human eye blinkers quicker. But you know, even me typing, not very accurately, it's like, yeah, quite, quite, and if you mash the keyboard, it turns out you're massively faster, like you're 10 times faster than the average typist when you mash the keyboard. It's not, you know, it's not good for it. So yes, there we go. Anyway, I'm going to hand over to Depp, Aquilon, unless you have anything to add? No, no, no, no, nothing to add on blinking. But yeah, the fundamental point that networking is really, really fast and stuff comes from one end to the other and back in a very, very sharp period of time is great. So, you know, don't generally have to worry too much about that part of things. Yeah, so what we do is that we have a bunch of demo servers that are generally publicly accessible. And what I've started, we started in recently is to use perf to sample once a second and record for an entire week what happens on the public servers. And at the end of the week, then we generate a single flame graph from all of that to see what, where, where, where our time is spent over the week generally. That's the demo servers, multi user testing. We have this once a week called some of the people present in the room, join us from that from other people, organizations and, and community members, members. And we just have a general feel as to what it feels like in that little 10, 20, 15 person call for the applications are still responsive or whatever issues arise in testing that can be checked at that point. And that is also profiled and flame graph generated, typically one for writer and one for Calc in recent tests, which are all stuck up in GitHub that you can look at yourselves if you're interested to see the change in time over what we're looking at. We use it internally in clapper, of course, with the deployment that is used daily there and the same week long profile that I mentioned for the demo server is run on the internal one now as well. Yeah, so that's the tooling that we're looking at there. And then interactive debugging, which you have the clapper online, you can do yourself. You just go help about and you trip a click on the dialogue there. And that'll show you up this debugging display that we're looking at here. There's loads of information in it there. The far right inside the tick box as you check them on, certain ones will check on display in the bottom left corner to tell you things. But maybe more interestingly, the one that we're calling the tile overlays. When you type in the documents, you'll get these flashing areas. And that's the part of the document that has been required to be redrawn because of your interaction. So what you're really hoping to see, especially looking at these things is that people are typing and you're hoping to see a small rectangle around the area of change that they're actually making. If the entire screen starts flashing, it means that there's a whole reason other piles of things have been redrawn or been invalidated to be painted to be redrawn later on to avoid that. These are the kind of flame graphs that we look at and the week and just for the purposes of looking at these things, the colors don't matter in these flame graphs or most flame graphs. What matters is the width of the line, the width of the bar, the wider the bar, the more proportionally time has been spent there. What you want to do is you want to take a quick look at it. You want to see which is the widest line and see can you make the wider lines narrower. I mean, it's nothing to the profiling really. It's just make the wide ones narrow. Yeah, so this particular one is in the widest bar there. This whole gigantic pile of boost, spirit, classic, whatever, which is all being used to detect if the PDF that people are opening up is a particular type of PDF, the hybrid PDF that's using LibreOffice where you can embed the LibreOffice document inside the PDF. So when you open up PDF, you also have the original document. It just takes a ludicrous amount of time, especially over the course of a week to collect up that information when it can be done in many orders of magnitude less. Yes. So it's good to see that sort of stuff and disappear off the profile. You should never optimize before profiling, obviously. Cool. Thanks, Will. Storing previous tiles. Yeah, so we've done a whole lot of work to improve our tile rendering performance. We store previous tiles that have been rendered so we can see what the difference is and just send the difference. That saves a lot of bandwidth and reduces latency too. And we've completely rewritten this. Well, how this is done in the last six months to a year. So we've already compressed it, so just a simple run length encoding. Because we're extremely modern, instead of doing stupid stuff like using byte lengths and this kind of thing, we use bit masks. And you'll see why in a second. So the bit mask essentially says, is the pixel the same as the previous pixel? So you end up with a bit mask. We have 1056 square tiles. So in four 64 bit numbers, we can have the whole bit mask for the row. And yeah, it's pretty easy. This removes a whole load of things. Previously, we stored them uncompressed. We compared them uncompressed. Turns out to be massively slower. Touch is much more memory. It uses much more space. And we also did clever things to hash each row as we did that while we were copying. But it turns out this is far better just to use the bit mask and some of that stuff. And, Koel and I did this fun thing with AVX2. Why not? You hear about these processor accelerated things and after shrinking our inner loop down to almost nothing, it's still not as quick as it could be on the CPU. So this is how we do it. We load a whole load, actually eight pixels, into a whole single AVX register, which is just kind of nice, right? Eight pixels at a time. And the problem is we need to compare it with the previous thing. So we shift a bit off the end. We shove the previous one. We shift it along, although actually it's really a sort of, yeah, it's a crossbar switch here that you permute to move things. There is no shift in AVX registers that does that. And then we just compare these guys. And that gives you a whole load of either whole all ones are all zeroes. And then comes Koel on magic trick. Well, yeah, in AVX, there's the AVX2, which is like practically available. But AVX512, which is not practically available, has a particular call that you can do that will compare the two things for you and give you that bit mask, which is not available in the AVX2. And if you look at what's available, though, you can guess if it was done in floats, then the number is basically available for you. So you cast it to floats, and you do this move mask thing brings your top bits in and gives you what you were hoping for in the first place, which is just an individual bit result for each individual pixel that you've compared, whether they're true or not. And you can basically so compress, pull the bits you're looking for out in no time. It's great. Which is pretty awesome. So, you know, you convert this into a floating point number, and you get the sign out of it. And that's your that's your orally bit mask. So the nice thing about this is there's no branch, there's no compare. There's nothing. There's a simple flat loop with about five instructions. At the end of that, we then have to work out how many pixels to copy because it's all very well saying these the same, but you need individual copies of those different pixels one after another. So a bit of a pop count will count the bits in the mask. And then with a clever lookup table, we can also use this. Yeah, this clever instruction shuffling instructions to shuffle the things in that we need to copy them out, stack them up. Bingo, twice as fast, which is nice. And hopefully AVX512, you know, will make it even even faster if you believe that you'll believe anything. So yes, here we go. So this is a real problem here. And if only we can find the idiot responsible for you. We don't need to suggest. Yeah, no, what's sometimes interesting is that, while I said earlier narrow was better, sometimes it can be interesting to see that wider will be better in the sense that when you look at the flame graph, what you should see is individual threads should all be positioned separately. So they shouldn't be, you know, combined with the main thread. So if you're not seeing work that you expect to see happening in a thread on the left hand side, basically, of your flame graph, then it means the threading isn't being used. So it becomes apparent that while there's this code that attempts to do this threading for doing this previous delta stuff, there is no existence of the threads and there's a flaw that needs to be sorted. So when you fix the flaw for the threading and bring it back in, you see then on the far left hand side, because it's rooted in the threading area, all that work is put on the left hand side separately in the flame graph. And while it's wider, it now means it's operating in a separate thread and you've made progress. So it's nice to get twice as fast and then four times as fast on top of it. That's the right sort of approach. Yeah, I think we're going to skip through some of these because we're running out of time. But working out where to do the work, either in the browser or not, and I'm pretty multiplying and the stupidity of the web and having an RGB, un-premultiplied alpha API. When it's almost certainly going to be premultiplied underneath its hood. Yeah, underneath the hood, all the hardware, everything is doing premultiplying because it's so much quicker. You can see the complaints online about people pushing RGBA into the canvas and getting something out that isn't the same because it's been premultiplied and then un-premultiplied. Anyway, there you go. The web APIs are awesome. What else? What should be on your profile? Well, it's very hard to know. This could be okay. Here's a whole lot of un-premultiplication here. It's a very old profile. It's a time, but hey, there's a lot of rendering on the profile. Not very much painting, lots of delta ring, so we fixed that. But actually, it's very hard to know if this is good or bad looking at that. Actually, with lots of bogus invalidations, you start to see lots of rendering and that's not what you want. So everything should shrink and you'll end up with a profile that looks the same, but everything feels much quicker. So we've done lots of work to shrink, I guess. Mr. Enders, do you want to pick a couple of these now? Yeah, just as you mentioned, with multiple user document tests, we have kind of basically monitor what's happening. People are joining documents. We got that full document invalidation we mentioned about happening. Clicking in headers and footers were causing the same things. I think fundamentally, because the invalidations and redrawing on the desktop has become so cheap, while in the past, the very distant past, we might have been pretty good at keeping validations down. In that case, we've become slack in recent decades and now we've treated it as cheap and that has affected things. So let's kind of have a look at that again and bring things down to smaller rendering areas and less invalidations. Yeah, and the good news is that improves LibreOffice, of course, as well. It's more efficient and clean on your PC as well underneath. So good. We've done lots better latency hiding in terms of more aggressive prefetching. So the next slide is there before you switch to it. So it's absolutely instant. Hiding latency in those ways is quite fun, enlarging the area around the view and maintaining that as tiles and just storing and managing much more compressed tile data in the clients that we manage much better now. This is a fun one. But we don't have much time for it. Yeah, well, God, classically, standard list and C++ was always a standard list. And if you wanted to get the size of it, you had to like pass the entire list from start to finish. That was sorted out decades ago. But for whatever reason, for compatibility purposes, if you use the particular Red Hat developer tool chain, then you seem to get the classic behavior or standard list back again. So when we were assuming that you was cheap and cheerful to get the length of a standard list, it turns out to be not the case with this particular case. So you have to go back to a different approach and it appears in your profile like that. But again, it looks normal that it should take some time to draw things. And it's normal to have a cache to speed that up. But if the cache has got 20,000 items in it, and you're just walking this list, you know, point it, chasing anyway. So gone. Oh, fun stuff. Like why not have a massive virtual device in the background that you could render to instead of the whole document every time you do something? Not great. Or another one, why not have a benchmark every time you start the document to see how fast rendering is, allocate a whole load of memory and dirty it, you know? Great. Yeah, trying to cache images. So we didn't bother catching compressed images because they're compressed, right? So why bother? They're small. They're good to have memory, except TIFFs not so much compressed, you know, you eventually have the whole massive chunk of memory there. Using G-Lib C trimming functions on idle to reduce memory usage. Yeah, trying to get better measurements of various things. Yeah, this is a fun one. Well, oh, this is the S-Maps word. Yes, yes, yes, we're reading the proc S-Maps to see how much memory we're using. And the classic S-Maps has got multiple entries in it for many, many parts of your process. So you just read multiple lines. So there's a relatively new one that has it all pre-edited for you. ProxMaps roll up, which is exactly what we want. Same code to read the previous one should work with the new one. Then apparently we're running out of memory, or it's being reported that we're running out of memory, and it's all very, very bizarre. You can't proc S-Maps roll up yourself. The numbers are good. There's something very odd, but it turns out that if you seek back to the beginning and then read again, that the numbers double every time you do this. There's an actual bug in the original implementation. It's not there in the kernel, my version 6 kernel, but it is there on version V18 or 16 that the servers were applied on. So you have to be just the right version for it to appear. So Linus fixed it, thank God. Quillholt found it. Well, it was fixed before we found it. But it's always nice to know you have to check your kernel is the right, you know, is the quality kernel before you start asking it how much memory it's using. Yeah, hunspell in the loop was almost entirely dominated, not by actually spelling things, but by looking at the time. You know, I'm sure in a bad talk, you know, it's quite similar. But that's a little bit unfortunate. So yeah, some improvements there. And lots of other things, graphs showing speedups. We've got to get to usability in the last minute. Let me whizz through this then. Here we go. Accessibility, dark modes, pretty pictures. This is going to be fast. Keyboard accelerators. This is all of the good stuff for people. Screen reading, and all sorts of nice things, videos of that. Better page navigators at the side so you can see where you're going. And lots of just little bits of usability polish, a nice font previews. Was this your page number thing? I forget who did that. Making it easier to insert page numbers so people can see, you know, what's going on easily, better change tracking and showing changes, AI, depot, stuff, and hey, some some. The good news is there's more opportunity for performance improvement. So we're still, we're still having fun. You know, hey, come join us. There's some cool play files to read. Right. Well, yes. At the moment, in Calc, when you're typing the entire row and validates beyond the right hand side of where you're actually typing. So we brought that down to the self in the most generic case, but it's not done for writer. In the writer case, if you're typing, we are invalidating all the way to the right hand side of the screen. So we'll bring shrink back back down again. We have some new metrics that we've included in that debugging overlay thing that give you an indication of, you know, how much of these updates that are coming through are the same data as they were before the update came through and the numbers are staggeringly high. So there's plenty of room for improvement to validate less, send more data down. So what we have now is fix, uh, approval. Yeah. The moment that's always been troublesome in, uh, Lear Office is the treatment of the alpha layer. We picked the wrong direction than everybody else does. Everybody else picks transparency. We picked opacity or vice versa. So we have the opposite direction. Everybody else would want to actually output something in the real world that handles transparency. We have to like reverse our transparency. So that's problematic. That's, that's now fixed. That one is fixed. That one is fixed. But then we've also kept our transparency layer in a separate, uh, bitmap, a separate buffer than an actual bitmap. And if we put them together someday, that would make things a lot easier, I believe. Yeah. It's the Windows 16 API decisions that are still with us. But anyway, we're getting rid of them quickly. That's great. Um, yeah, performance regression testing with Valgrind, uh, pipeline loading. So at the moment, oh, we got five minutes. Oh, look at that. Fantastic. I went too quickly. No, you're doing fine. Okay, right. Fine. Excellent. I think we're nearly the end. Um, so pipeline loading. So at the moment we have, um, we, we essentially fetch a, fetch a webpage that passes all the credentials we need to check ourselves. We'd load lots of JavaScript. We open a web socket. Then do we actually see if we can actually load the document and start checking who the user is? This is really foolish. I'm taking on a first start, we can be, you know, checking the user, downloading the document, even loading the document ready to get the web socket and then have a pre-rendered version. So this, this is very substantially reducing, um, startup time to make it incredibly quick. You already have a huge advantage that you have a real server at the back end and you're not having to jit, you know, millions of lines of code in your browser from JavaScript into something or, you know, web assembly into something. Um, so it should be just amazingly fast. And so this is a great way to, to speed that even further. And, you know, and a real server, you may have a time share, but you know, when you arrived your server, it's probably not doing much. In fact, the CPU cost on most of our servers is extremely low. So, you know, there's suddenly all these threads ready to render your document and get, get stuff to you quickly. Say some good things. And Valgrind, we've done a whole lot of work to get, um, it to run nicely under Valgrind with our privilege model and container model. It's a bit of a problem. Uh, and so we have some code now that turns everything into one process. So you can load and collaborate on one document and automate that, but you can run it in, in Valgrind. And why do you want to do performance profiling in Valgrind? It seems like a retro, uh, poly, right? But the beautiful thing about Valgrind is the simulated CPU. So anybody can run the same workload on their machine and between two runs, it's the same thing. And Valgrind luckily doesn't have a simulated thermal management system that randomly throttles your CPU, uh, performance. And it luckily doesn't have people screwing with your cache memory and running cron jobs in the background and, you know, thermally recalibrating your disk and all this other stuff. So what you discover is that between two identical commits, you're getting fractions of a, small fractions of a percent difference in the Valgrind numbers, which is beautiful because performance tends not to go away in big jumps. Like we can, it can go in big jumps, but it tends to go slowly downhill. And if the noise is bigger than the slow downhill, you've no idea where the problem is. So much better to have a little series of steps going down in one half a percent at a time and go, hey, we get rid of that and that. And did you realize and, uh, so, so this is really vital. And LibreOffice uses this on its perf, um, automation has been beautiful web pages with graphs. Um, and we'll, we'll be applying to, to collaborate online to, to try and avoid regressions. Yeah. Someday soon. Someday soon. Yeah. Neil, Neil Lazzone, we think probably. Anyway, anything else? No, I think we've covered plenty. Well, so, and yes, of course, we can't do anything without our partners and customers that pay for it all, blah, blah, blah, commercial plug. Good. Yes. That's good. Job done. And conclusions. Yes. So, uh, computers are unbelievably fast. I mean, like this is something that you should take home. You know, like the quarter of a nanosecond that your four giga hertz processor takes is just unbelievable in the scale of a hundred milliseconds plus. It takes you to blink your eye. It's just fantastically speedy in a way you can't explain. Uh, the network latency to anywhere almost, you know, you can go three times, uh, London to Frankfurt and back in the time you can blink, right? Like it's, it's unbelievably fast. In fact, you can go, you know, Frankfurt, Milan faster than your monitor can refresh, right? So, so like, it's quite amazing when you start looking at the times of things. Um, architecture is really a bet on CPUs and networks getting faster and cheaper. Has anyone noticed a trend there? I think there might be something in that. And, and we're basically racing the hardware guys. I mean, you know, we, we do stupid stuff, obviously, and then we remove it later. But, you know, the hardware people are also trying to beat us to run stupid stuff quicker. You know, that's their mission. And, uh, yes. And, and we extremely smooth. Don't get the feeling that it's bad. Try it. You know, most of these problems, you'll only start to see them when you have 20 plus people collaboratively editing in document. So, uh, yeah, it's, it's kind of, it's kind of cool. So give it a try and try the latest version and see, give us some feedback, get involved. And there's lots, lots of fun to get involved with. I mean, I don't know. Yeah, I'd like us to play two things. As I mentioned earlier, the profile that we have for Calc and Writers uploaded to GitHub once a week, generic Calc performance profile, generic writer performance profile, search on the online GitHub issues. And you can see all of the, the chats that we've mentioned there in the past. And you can even see with the progress there and the occasional blip during a call where things go horrifically wrong and get sorted out in the next one. So yeah, plenty to see and see what we're doing. There's some links in the slide. You can't see to the profiles and get involved in the Libre Office of Technology. Thank you. That's it. You've been very patient. Thank you.
Document collaboration made simpler: Revealing the concept of rooms in ONLYOFFICE DocSpace
So, we're going to be presenting the rooms in only office doc space and it's going to be presented by Alex. Hello, hello everyone. My name is Alex. I'm with only office, so this is my second time here at Forstum in Brussels. Thank you a lot guys, so the same clusters. That's everything is brilliant. And today I'm going to take you through the challenges of only office and of document collaboration and how only office can help you to overcome all these challenges. So I will also be introducing you to some existing features in only office and some updates. So let's get started. In our today's world is very important to work online and if teams have a lot of documents so the things can get messy very fast. So for us developers it is important to create and for users to pick a good solution for organizing online document collaboration. We at only office have more than 200 unique connections, unique integrations and we have a long, long list of requirements from the people who are trying to integrate our solutions into their services and that's why we are absolutely sure that we know almost everything about the integration. So wondering what is the most interesting cases, here are our points. So first we want to make sure that we save your time and efforts by automating the everyday routine. Having a lot of features is very important and it should be easy to add new features if needed. The next thing to consider is support for all popular file formats. If a software is built on their up-to-date technologies it will most likely be reliable, suit user needs and even have some killer features, so like we do. And of course security will always remain one of the most important questions for everyone who works on their documents online. Cross-platform apps give us ability to work on any document in any browser, from any device and remote workers need to be able to work together and all these challenges can be difficult to overcome without using write tools but only office can ensure your effective teamwork. Talking about usability, we want to provide real-end user experience. If a software is easy to use it boosts your productivity and accessibility, how it is easy for people of all abilities to use your product. If software have lots of bug fixes and updates it indicates that it is well maintained and here big thank you to our community which plays significant role sending us information about bugs, troubles and of course feedback. And last but definitely not least is availability. We are constantly increasing the number of distribution forms or builds of our products. Considering all these factors we at only office decided to create a new product, only office doc space, a product for organizing secure online document collaboration. Actually today is the first time when we are talking about that product. I was talking about that at FOSS Asia in 2023 but there was only better version available and now we have already a ready to go product that can be integrated and is already integrated into many well-known services. So before we dive into each factor I would like to share the history behind that idea. So the journey started in 2021 when we decided to rewrite, to completely rewrite our productivity platform. I am talking about that platform I mean the package with CRM, project management, mail clients and many many other features. So the main idea was to implement infinite scalability and in that same year we released free for personal use app server that made it faster, more stable and more functional and the idea shifted slightly. So we decided not only to rewrite the architecture but also to change the mechanisms of working with documents and in 2023 we released on office doc space. The main point for everyone who tries to integrate our solutions into their services is that there can benefit from our extended experience in the integration. So many office solutions and productivity platforms when working with the files create a mess of unstructured files, folders, subfolders but with only office doc space you are able to create rooms, doc space rooms which allow you to clearly structure your files depending on your requirements on the project goals. And when you have a long to do list every day so it is smart not to waste much time on every day routine like creating sharing or anything with the files. In only office there is no need to work with each file individually. You can create a room and all files within that room will be available according to the access level of the room. So there are few types of the rooms at the moment. Let's start with collaboration room. These rooms are perfect for those who are trying to work on the documents together to co-edit their documents. So here in these rooms you can make use of all beautiful co-editing features of only office software like using commenting, mentioning track changes, using revision controls and many many other features. So we do have built-in chat and telegram plugins right within the editors to communicate and we also allow to make audio and video calls using plugins for Zoom, GCE and Rainbow. So when inviting a user into your room you are able to set the access level. It may be administrator with full rights, power user with extended rights for editor or viewer. So the next type is public rooms. You can invite anyone using public links and what's very important there is no need to register somewhere and you can generate multiple links with different access rights. But you also are able to apply password for all files in the room or for example restricting the copy of the content of the file is available here or just the downloading and printing can be disabled. Yeah and there are also custom rooms that allow you to apply your own settings for any custom purpose you have in your mind. Again here everything depends on your use case. You can create a room for requesting form filling, for requesting commenting or document reviewing. Everything depends on your use case. So on the Office Doc Space includes different viewers and editors for all file types. Let's start with the digital forms. Here you are able to work with your forms with your form templates in DocXF or PDF format. So these PDF forms can be filled in, can be shared with anyone for filling or you can create or work with the files created by alternative applications of course. So you can easily view, create or edit text documents. I hope you are aware that only Office works with almost all text file types. Office Open XML files are supported but if you'd like you're welcome to work with Open Document Format or RTFT, TXT, HTML or anything else. The same for spreadsheets where you can work with your sheets and use more than 400 of different formulas and functions. You are also able to create slides using a variety of different animations, transitions and different objects. And now to PDF again. So PDF is widely used in the document workflows from meeting brochures to contracts for signing. And now you have PDF editor. You are able to annotate your PDF files using only Office editors. You are able to work with your PDF. You can convert your Office Open XML to PDF and vice versa. You can convert PDF files to Office Open XML to edit them. Additionally you are able to work with your electronic books that can be converted. Next feature is integrated media players for working with images or video and audio files. So the functionality of the described solution can be extended by using plugins and AI integration in the form of chatGPT plugins. So you can work with your text, you can make simple requests, generate keywords or images. Everything depends on your license with chatGPT. So if you do have chatGPT license you are able to work in the editors with your paid functions. If not just work with a free version. So Office Docspace is created using up-to-date technologies. We do use .NET Core and .NET Server to ensure reliable backhands and for frontend we do use React to make sure that everything is mobile friendly. Office Docspace is a safe way to handle your documents. So we follow all GDPR and HIPAA rules treating your personal information very carefully. With flexible permissions and JWT you are able to have complete control over your files but you are also able to add password protection or watermark everything you have in your room. For data and transit we do use HTTPS of course and for data at rest we do use industry leading AES256. And moreover the administrators can allow some additional settings like trusted mail demand configuration or session lifetime configuration or for example use two-factor authentication or single sign-on to have control over the login procedure. And of course backups and recovery are also here. So I'm glad to inform you that in the middle of 2023 we have included on the Office Docspace and our main HackerOne program. We received few reports and all these reports have been fixed in a timely manner. So thanks to Ethical HackerSchoolWorks with us in the HackerOne. On the Office Docspace is primarily designed for web-based operations and we understand the importance of using it on mobile devices. On the Office Docspace offer a user-friendly interface with interactive negation between rooms and settings for example. And we conducted several usability tests with more than 200 people from different countries and different industries. So and according to their feedback only Office has overall usability score 4.1 and the main advantages were so simplicity, clarity and modern interface. Of course you can customize your product. You can change the space name and URL of the portal, Docspace portal. You can change a color scheme and to support your corporate style you are able to use your own logo or to change the welcome page. As for accessibility on the Office Docspace and on the Office Editors are designed to accommodate users with spatial needs. So there are few options like screen readers, hotkeys but we do also support different plugins like voice to text, text to voice or translation for example. There are a lot of different plugins. For developers and integrators on the Office provides the ability to extend the functionality and here you can find the information about our plugin SDK. So you are welcome to create your own plugins and we do have few plugin samples on our GitHub page. For example PDF converter allows you to convert your PDF files to Office Open XML and vice versa as I said already. And the next one is draw IOR plugin for working with your professional looking diagrams. There is a plugin available that converts your audio and video files to text and of course you store it into your rooms. Open API documentation allows you to see how to integrate on the Office Docspace rooms into your product and to give your visitors, I mean visitors of your website ability to view and interact with documents right on your web page. Docspace rooms can be integrated into your service as an iframe. So the same we already have with only Office Docs, just iframe and of course data display settings can be configured. And now the main point, so there are a lot of services without document management functionality, without document editing functionality or without any cloud storage functionality. And only Office Docspace allows you to add everything you want here. All these features are available and anything can be used here. I mean it can be integrated into CRM, into CRMess, into any messenger and the next example is one of the most popular collaboration solutions on the today's market. Just an example. So I'm glad to say that we have only Office Docspace for Zoom integration. So just go to Zoom marketplace and look for only Office Docspace. You will be able to install Docspace for working on your documents right within the Zoom meeting. No additional actions like registration are required here. I think that everyone can remember him sharing a document and saying, okay, let's write it down together. So I mean in the Zoom session, but with only Office there is no need to share your document with anyone or to give someone access to your screen to work on the documents together. So just use only Office Docspace application. The same for WordPress. Only Office Docspace can be integrated into your WordPress pages. These are just two examples that show that our product can be integrated into any service. In 2023 released on the Office Docspace 1 and on the Office Docspace 2 with more than 50 new features. For example, public rooms are available now right to left interface and better and system plugins are supported. And for online editors, as the part of Docspace platform, we delivered three major updates with more than 200 bug fixes and about 200 new features. And the last version of only Office Docspace is available now. So release just a few days ago. So in the latest release on the Office Docspace 8, we have few very important features. Again, fillable PDF forms. We have PDF editors right now. Another interface for plugins has been updated and we do have long awaited right to left interface. That's very important and we understand that we have a long, long way to go with that functionality. I mean right to left. This is why we are looking forward for your feedback about that functionality right to left. I mean, and we really need feedback from our clients and integrators. The next point is our performance optimizations. So here you can see some numbers. We have moved some portions of the service to the client side. So and I think this definitely will add some more points to only Office editors. And thanks to our partners from own cloud, we have now the lower testing results for 100,000 of simultaneous connections. Having 100,000 of simultaneous connections, I mean that all these connections are active, sending some information from your client to their server. You can see the details of the infrastructure. Everything is in Kubernetes. So 12 big machines for documents server and two big machines for K6 just to generate that huge traffic. On the Office Docspace, we'll soon include private rooms. We are also going to implement electronic signatures. And there are more features that we plan to add. It can be used on the Office Docspace, can be used as a cloud solution, just look for on the Office Docspace. And if you'd like, you are welcome to install it as an on-premise. So in Kubernetes or any type of deployment. To try in the cloud. To try download in server version. So and thanks a lot for your attention. If you have any questions, we'll be happy to assist. We are here to guys in this on the Office t-shirts. Yeah. Thank you. Thank you very much. Are there any questions? Yes. Just a question about interfacing type of document worksheets. And so also because the documents in database formats and Writers and so on have problems to with Google from Google Docs to Google Sheets, it doesn't work. And so you have this kind of functionality. And also it was very interesting for me to gain a lot of text, a lot of time with converting speech to document. So there's a first question about whether there's an integration between sheets and documents and the second one about converting speech to document. So the first question, yes, we do have. Sure. Yeah. We do have plugins to add that functionality right within the editors. You can work with that. But you need to install an extra plugin for working with that. So and what about the integration between two types? Yeah, two types of the editors. Yeah, I see. But as far as I understand, you are working with X-Wiki right now. And no. No. No. No. Yeah. It's a free question. Yeah. And no, I mean the product. I mean the product. So maybe you just have one of the previous versions of OnlyOffice, for example. And there was your question about interface. Yeah. There was your question about interface. Yeah. Yeah. And I think that your question before, you are before. So you will be able to find a solution in one of the next version of the OnlyOffice. And what about the integration between these two types of editors? Yes, we do have that again in the latest versions of the OnlyOffice. For example, let's try work with OnlyOffice docs 8. Great. Any other question? There was somebody there I think. No, yes, no. Okay, thank you very much. And. And. And. And. And. And. And. And. And. And. And. And. And. And. And. And. And. And. And. And.
openDesk - The Open Source collaborative suite
Okay, so welcome to the talk about OpenDesk, the open source collaborative suite presented by Klemah Obang from Exviki and Vila Lydantal from Open Projects. Enjoy. Hello. Hello, everybody. Thanks for coming. The funny thing is we are not OpenDesk. We're just vendors. We're just contributing to OpenDesk, but we come to this later. Yeah, OpenDesk, what is that? That is an idea of building an alternative to Microsoft Office 365 or to Google Apps that come to the public sectors. They are used everywhere, and public sector wants to have an alternative to that. So if you really want to go after the big elephant that is not in the room, we're trying to create an alternative to that. So probably it's the biggest opportunity for open source software right now. Let's say at least in the realm of collaboration and working together. So OpenDesk is a powerful initiative of the German government with a goal to provide a serious alternative to the proprietary Big Tech establishment. It unites independent open source software vendors to create a sovereign workplace tailored for the public sector. We too here, we are just nerds. We're just software developers. I'm from outside. I work for and co-founded Open Project. So that is one of the parts of the solutions here that we will talk about. But I'm software engineer. I'm not OpenDesk. And this is Clément. Hello. So, well, you maybe have seen me during the room this morning. So my name is Clément Aubert. I'm an XQQ committer and I also work at XQQ SAS to do sales mainly. Okay. So well, let's start by discussing a little bit about the story of OpenDesk and how we got there essentially. So the issue with essentially collaborative suites goes a long way, right? Since 2015, especially in Western EU, so when I'm talking about Western EU, it's mainly French and German government because from more point of view, that's where we have the most information, let's say. Since 2015, what we see is that there are growing concerns when it comes to the US-based cloud offering that exists in order to create collaborative suites. And these concerns are mainly regarding the fact that you don't necessarily have control on your data. You don't know exactly where it is being stored, how it is being processed. There are especially privacy risks. If you are putting sensitive data, maybe they could be accessed in different ways. In particular, since 2018, there is an extra territorial law that exists in the US that allows the government to ask a company to access customer data, even though the data of the customer may not be from a US citizen. Just for, let's say, well, usually to get more information about that customer. And then there is the big question of locking, essentially, when you migrate your data. How easy it is to make it back? Are there any open standards that exist in order to do it the other way around? These concerns exist. And so actually, since late, well, the end of 2010 and beginning of 2020, French and Germany have started to create some rules when it comes to the handling of critical data in their government. In France in particular, there is an initiative which is called Cloud Nubu and Cloud Pi, which is essentially two cloud specifications that are used for public administrations. One is for, let's say, conventional data and another one is for more sensitive data. Germany also started another initiative, which is the Deutsche Verwaltungs-Cloud strategy, which is, let's say, kind of the same. It creates kind of a standard in order to protect the data that is used by public administration from the external actors. So this is essentially, to be clear, this is essentially infrastructures or definition of infrastructure that should be implemented by the state so that on the long run, states have the capability to host some data securely for them. In the meantime, there is the question of having security certifications, basically making sure that the different vendors that will provide a specific service for you, they have the necessary amount of security to provide that service. And so in the same idea, there are two standards that were created over the past years. In France, we have SecNumCloud. SecNumCloud has been created by the ANC. It's basically derived from another security standard, which is called ISO 27001. If you are doing security, you may know about it because it's well known. And what SecNumCloud does is that basically it takes most of the rules from ISO 27001, but it adds also controls when it comes to the nationality of the people and the location of the people that are allowed to process the data. The whole goal of SecNumCloud is to protect from extraterritorialities, in particular the US Cloud Act that I mentioned. In Germany, there is also another certification, which is called BSI C5, and we will talk about it afterwards because it's quite important. The goal is to basically be able to qualify a specific application that you want to deploy on a cloud offering that you want to deploy to be used by public institutions. So C5 certification is a little bit different in the sense that it's not about extraterritoriality at this point. It is more about basically making sure that the application is correctly developed. There is a good standard of quality when it comes to the change that you are applying to your application. You validate its compliance, etc. On the long term, so if you are residing in Europe, in the EU, there is a vision to create one unified standard, which is mainly based out of SecNumCloud and C5, which could be called the ES Cloud that would encapsulate and basically allow any vendor to qualify to this standard in order to be deployed in the different governmental public administrations in the EU. So that's very nice. We are actually introducing new laws that allow us to control what can be put and what can be used by public administration, but in the meantime, we may not be creating solutions here. So that's another issue that has been tackled by France and Germany over the past few years. In France, over the past like in 2021-222, there is a project that was started that was pushed by the DGE, which is the Extranger Neral des entreprises. It's a branch of the finance ministry. The goal was to create different consortium that would lead the creation of an alternative suite to things such as Office 365 or Google Workplace. The ones that Wilhelm talked about. So it's a project that has been started in 2023. The total of the project is around 23 million that has been invested by the state for the three consortiums. The idea is to have results by 226. Now we will not talk about them that much because it's actually not based on fully open source software, so it's not really the scope of this talk. The one we are interested in here is the project from Germany, which is OpenDesk. And the idea of OpenDesk is essentially, so it's a really different approach. In Germany, there is the Ministry of Interior. I will not say the name in German because it's going to be a nightmare. The Ministry of Interior decided in 2022 to create a consortium made of different actors, so we'll see them afterwards. But essentially, Dataport, which is a big service provider for public administration in Germany, as well as a couple of vendors, software vendors, are providing open source software. And the Ministry decided to group them together in order to create a platform which is coherent, fully open source, and can basically last longer and be maintained over time. So to give you an idea, in France, the financing for the three consortium I mentioned is like around 23 million, knowing that there is a little bit of a loan there, so you have to reimburse it. And then in Germany, it's basically orders, so it's not a loan, and the budget into 23 was 23 million, so a little bit more budget. So if we go into the details of OpenDesk, so the project was initially started by the German Ministry of Interior. Today it's being handled over to Zendes, which is also a public organization that has been created in order to handle, let's say, open source projects that have been created by the federal state. And so, as I mentioned, the project is currently co-managed with multiple actors, so there is the Bundesministerium, the Ministry of Interior, the BMI, PWC, which is helping on trying to have, find use cases, basically, find the correct user stories, the issues that we are trying to solve with our collaborative workplace. Data Port, which I mentioned, which is providing, basically, hosting for the project, and is also doing a lot of work in order to organize the different vendors all working together to create a unified workplace and create a product that works. And then we also have Bechle, which is present for helping on the financing of the project. So we talked about the vendors today in this project, in the OpenDesk project we have around a little bit more than 500 people that are working from both the different vendors, PWC, BMI, and Data Port and Bechle. So that's quite a large project in the end. I will get to the name of the vendors afterwards, but this is just kind of a quick view of what we're trying to achieve, right? Basically we're trying to achieve one solution, one workplace that fills in different needs. We want to have email management, of course, you need to have emails, you also want to create events, so you will have calendar modules, contacts, task management, we want to also have the file management parts where we can create new files, collaborate on them, and then you also may want to continue working on your projects, so develop projects within your organization, for that we have a project management tool, a knowledge base tool, and you also want to communicate with your co-workers, so there is modules about chats and video conferencing. So all these projects, the idea of OpenDesk is essentially that these modules should be made of solutions that can be switched easily. The ideal vision of OpenDesk would be that basically you have a software which is providing the email functionality, but let's say that tomorrow you want to switch it, you don't want to use the default version, the default software that is provided, then you should be able to do the switch fairly easily. So in practice today we are providing some sort of a default implementation, meaning that we have a couple of software that correspond to each of these features, and we don't really have two options for file management. Yes, that's a very important part because the German state don't want to get back into the vendor lock, so that's the reason why it's talking about mail and not about a certain vendor. Exactly. So if we look a little bit more in the details, so when it comes to everything which is related to email, agenda, contact management, calendar, and task, OpenExchange is handling today that big part of the project. When it comes to file management, it's mainly in NextCloud, and in NextCloud when you want to collaborate on different files, we have two external tools that have been integrated, so Collabra for which we had a talk a couple of minutes ago, and also CripPad, which we had a talk this morning, so they are used in order to edit basically office files, and for CripPad it's for editing diagrams today. When it comes to the communication, so we have Element which is handling, so Element based on Matrix, which is handling everything which is related to chats between teams, and there is a clever integration between Element and GC, the integration is provided by Nordic to basically allow to have video conferencing and basically rooms within Element where you can start calls and make calls and chats on a specific subject. Then on the project management capabilities, there are us, so OpenProject as part of the project management tool, and NextWiki for knowledge management. I would say that we are rather the, OpenProject and NextWiki are kind of the latest projects in the project, right? We arrived at least in NextWiki, we arrived at the end of 222, so not that far away, and not that long ago. Finally, so this whole portal that you see here, it's managed by Univention, which is another solution that allows to create user portals and it's also handling access management, authentication, user list, et cetera, et cetera. Single sign on. Yeah, single sign on also. So today the development of the project is managed by the hosting of everything which is related to the development of the project is managed by Dataport, so thanks to them. So we'll get to, at some point we'll try to do a demo if we have internet, but let's go a little bit more in the technical details. So today the architecture of OpenDesk is mainly based on Kubernetes, mainly because we are integrating a lot of different components, and at some point it was decided to use Kubernetes because it was the easiest basically to integrate all this complexity into one big package. So we are using Helm charts and GitLab CI to deploy Kubernetes cluster, and then basically each component of this cluster is one application managed by one vendor. Each vendor can provide either a generic Docker image that needs a little bit of configuration in order to work within the context of OpenDesk, or sometimes some vendors will provide really a tailored application with, so a tailored custom Docker image that meets some specific requirements, and well, so sometimes there are also other ways to deliver basically the applications. Yeah, a quick look at the features I may not have completely talked about. So of course it comes with a user directory managed by Inevention. All the applications are connected between each other through OpenID Connect, so that's very easy in some way and very standard, and we also have unified navigation across the different applications, that's something I want to show you in the demo. And finally, the goal of the project is really to make sure that all these components that we are integrating, they are connected together like fully. And so that's going to be very important. I would say it's a work in progress, but we'll see in the later part of the talk that we have, for example, examples of integrations between OpenProject and NextCloud, which are very exciting to see. Yeah, a little bit of a note. Also, when it comes to the distribution of OpenDesk, of course, it's made of OpenSource software, but the whole build itself, the whole project itself is OpenSource. And actually, you can access it on OpenCode, which is a GitLab instance that is used by the German state to publish its OpenSource projects. So there you should find basically a mirror of all the source code used by all the components of OpenDesk. But we will also find dedicated repositories with all the handcharts that are needed in order to deploy the platform. Another thing in order to be secure, to provide some security and compliance, so every release is signed. We create software builds of materials. We audit also the licenses of the different components that are being integrated within the workplace in order to make sure that we are really completely free OpenSource. The idea is essentially to have something that will work for the German administration and working for the German administration means being BSI C5 compliant. So as part of the project, we also have a little bit of work in order to help each component be, well, match the certification in terms of quality of development, for example. Finally, and maybe we'll talk a little bit about it later on, there is a big concern on accessibility. It's one of the things that we see that I'm the most talked about about OpenSource software. It's nice. It has a lot of features, but maybe sometimes it's not fully accessible. Well, as part of this project, we have to match also some accessibility guidelines in order to be used for the public sector. So this is also part of the final thing that we get from the project along with security. On the offering side, so the goal, the long-term goal for the project is essentially to have, today there are plans to create offers for the public administration in Germany, mainly through two entities. One is Zendes that we talked about, which is currently, let's say, coordinating the project at a very high level from the federal government perspective. And also DataPort, which is participating to the project as part of the project management, but DataPort has its own also suite, which is basically a fork of OpenDesk with some extra components that are not necessarily OpenSource that have been added or modified in order to answer some specific needs. So, yeah, today mainly offering for the German markets, but not really for the rest of Europe. So that's going to be a challenge for later. But there's interest from all over Europe in the product, so the French government also is interested. Yeah, Austria, I think. Australia, Sweden. So it's a big thing in Europe, and people are looking from all sides on that project. Okay, so let's try to do a quick demo. Let me see. Okay, so let's do it very quickly. Whoops. So this is, okay, this is an OpenDesk instance that we are using for review. So I just want to show you very quickly the different applications that are available. So let's say that I want to go to my emails. We are fully dependent on FOSDM network, so I hope it's going to load. Okay, great. So email, we are running on OpenExchange. So if you know OpenExchange, you probably already recognize this interface. What you see is that it's, so all the applications have been customized so that they have a color theme that matches, that is unified in order to have like a nice user experience. I can use a user directory which is based on the user that are registered in my OpenDesk instance. I can potentially add, in my email, I can add files, and when I'm selecting files, I have the choice of uploading a file from my computer, but I can also link a file from my next cloud account. And so if I do that, it will create a share automatically and make sure that whoever is receiving the email has the necessary access to see the file. This one, no. Okay, this one, thank you. Okay, so apart from that, in the email application, you will see that we have this little button here, and that's the transversal menu. It allows us to switch from one application to another, and it will change depending on your access rights. So we can look at, for example, next cloud, so integrated for the file management. Here I can, so I can create my file, I can create spreadsheets, and in that case, it will create a document that will be open within Collaborah. So I can edit it. We also have files that can be diagrams that we can edit directly with CripPad, so for that we integrated CripPad within OpenDesk for one specific functionality, and here we are using draw.io actually within CripPad. We can also look at maybe chats, in which case it's going to be a managed instance of matrix and element of the frontend, where I can have discussions with the other member of my OpenDesk instance or potentially other member of the Matrix Federation. And here what you see is that I'm actually part of a room which is used for a specific meeting within Matrix. And so these rooms, they can be created automatically when I'm in the agenda of OpenExchange. I create a new event and I say that I want to have a conference in Matrix, and it will create me a link that will help me to this room, that same format, where I have video conferencing here. I know that was a bad idea. Let's leave. And I have a whiteboard and I can also chat. Finally, we also have project management with OpenProject and knowledge management with Xwiki. So here I can create my new project, I can create my work packages, create some milestones and link them together. I won't go too much into the details because I don't want to spoil you. And here we also have a customized Xwiki instance. Today we are synchronizing users and writes. We don't have particular integrations with third parties with the other applications. So that's a very quick demo. And by the way, that is released. So you can download and try it and play around with that. So it's an open code. It's open source. So about the roadmap of OpenDesk, essentially the goal is to have a stable version like this month. As you said, it's already released. The main issue, I would say when it comes to the deployment, if you want to try it out, is that today there is still a good part of the documentation which is only in German. And sometimes if you're not speaking German, it can be a little bit difficult. On the longer run in 224. We are trying very hard. Yeah, yeah. It's just a few contributions for translation. I'm welcome, I guess. Exactly. You can do a full request. And so the idea in 224 is to have like more improvements in order to improve the BSIC5 compliance. Remember, the goal is to deploy that within the German federal administration and also within some German lenders. So compliance to any standard that exists for the public sector is really important for the project. And it's good because it allows to also improve the open source projects that are behind, that are being bundled in the platform. Yeah. Yeah. Yeah. So, I mean, we are super vendors like OpenProject or XWiki. For us, the perspective is a little bit different on that whole project of the whole OpenDesk because we already have a product. We already have clients. We already have a roadmap. And then suddenly someone says, hey, we want to integrate you, but you should have the same look and feel. We want to have the single sign on. We want you to finally come together and create deep integrations. So that is challenging for us because usually we tend to stay in our own soup because it's easier to build stuff in our own software. And integrations are complex. You need to organize. You need to find collaboration, like meetings with others, line roadmaps and important priorities. That's difficult. And now suddenly someone from the outside comes like, no, we want you guys to work together. And we will pay you, actually. Yes. So by integrating two very, very multiple different types of applications, we are going to build very deeply specialized applications. By integrating them, we create a multiple value. We multiply the value instead of everyone brewing their own soup. So also for us, it's a huge chance because if, let's say, XWiki is integrating with us, then maybe their clients, which are also likely to use OpenProject, would also book OpenProject, like the professional services. So for us, it's a huge, huge opportunity. And with OpenDesk, it comes like, okay, the German government also wants that it's easy to procure so that not every little city needs to go through a tender. Tender processes. Tender processes. So it will be much easier for a small city to book services from us. Right? So, and with that, we can build better software and we can better integrate. So maybe some challenges before we dive into more, like how we create integrations, basically. So some challenges that we see today is integration between the UI and the UX, so the products, of course, it's difficult to, well, basically not all software are created equals when it comes to the capacity to customize them because sometimes it has not been thought out from the beginning. Yeah. Oh, sorry. So there is a big challenge on UI and UX. There is also a question of overlapping features. Sometimes us as vendors, we create features like, we create, I don't know, like a task management feature in XWiki, which collides with OpenProject or a Wiki in OpenProject that collides with XWiki. Well, we have to find solutions for that, but usually we are like civilized, so it's okay. One issue is also like maintaining all these customizations that we are creating outside of the core of our products. So basically, we create an overlay that makes our application compatible with OpenDesk, but then like, how do we get the financing for that? Like, how do we maintain it? And if it's really difficult to maintain it across new versions, like, how do we do it? And so far, we need to find solutions on the long term for this. Talking about integrations, like, of course, these two systems, they don't exist only in OpenDesk. We exist outside OpenDesk as well. And then integration also makes sense. And those people might not have the whole OpenDesk infrastructure. So we always, when we build integrations, we want to build it in a way that is also suitable for other environments where the software could run separately. Exactly. Exactly. And so last thing is a creation of offerings. We mentioned the fact that other EU countries are interested. So apart from Germany, so that's going to be a challenge on the long term to find ways to provide OpenDesk in the public sector, maybe for other actors or even for the private sector. Yep. So. Yeah. To also go a little bit into an example for, I always like to talk about integrations because I think this is where the power lies in collaboration. I want to go into one example that I was working on with my team. That's the NextCloud and OpenProject integration. I somewhat sub-staffed this already presented last year here, but I want to have a different point of view on that. So quickly, okay. So NextCloud is mainly for us here today. It's a file storage platform. And OpenProject is something like, let's say, Jira or something like that. So you create and organize your work in work packages, issues, whatever you call them. And you can have them organized in gun charts or boards or whatever you need. So, okay. So we have a file management environment and we have a project management environment. And the outside perspective, let's say the public sector, they have a different perspective on that. They are saying like, where are the files for my task? Two things in one sentence, right? Does everyone in my team have access? Oh, so we are in project management system. We're organizing our work. The files are in a different system. Access management, okay. The third problem is I want to do the same thing over and over again, like doing the same processes, having the same steps organized in projects, task by task by task by task. And also I want to have the same template files and the same folder structure, having this all again and again. And they both need to go together. So what they don't want is that we, as OpenProject, that we build our own file management system because NextCloud is pretty good at that. And it's also integrated in the desktop and so on. Like people want to work on files like in the NextCloud experience. But also NextCloud might not be the best choice for organizing complex projects. Okay, it has the deck, right? But if you really want to go a bit more professional, probably OpenProject is a good idea. So from the public sector, they don't want these tiny solutions. They want the integrated solutions, right? And also it's not only OpenDesk, like other clients like the City of Cologne or the University of Duisburg, Essen or the Deutsche Bahn. They want the integration. They don't want the separate solutions. For us, it's easier to focus, if we integrate, it's much easier to focus on project management. Why, for example, NextCloud could benefit from focus on file management and so on. So this integration creates a great value in the combination. And also it's interesting, like once already mentioned, if we work together, and it's like NextLogest and Example, but if we all work together, then the sales also becomes easier because we all have clients that the others don't have yet. And with that joining together, joining forces in the integration of sales, we together then can capture a bigger market, get more money to build more open source software. Okay, little examples for how this looks like. So this is OpenProject. You have a work package. And on the right-hand side, you see files that are related to that task, which is baking pizza. I love baking pizza. And the interesting thing is you can see the files that are necessary for this baking of pizza on the right-hand side. But the files are not an OpenProject. They are NextCloud. And in NextLog, when they change their name, this name will change here as well. If they change their location, these links will still work. So this deeply integrated reference integrity, that is what you need in order to get away from chaos. This is what actually organizations want. They want to get rid of chaos. They want to have control over this stuff. Access control. So for projects in OpenProject, you can have something that's called a project folder. So we, OpenProject, we create folders in NextCloud for which we manage the access. So members of a project, this is the scope of a team, right? They need to have the access to the stuff that they want to have access to. So we say, okay, here in NextCloud, these people have access to it, fully automatically managed. That helps people to keep the data where it belongs to, the files where they belong to. So we are working on this project. Here are the files. Put them there in that folder. Okay, don't put them anywhere else. And if you leave the company, they're still there, right? If you're in the organization, they're still there. And then on the NextCloud side, also deeply integrated, we can show you which task of work packages are actually relevant for that file or where this file is used in. Let's say you have a template file for an employment contract, right? So where is this used and in which contracts is that file used? So you can find them on the right-hand side, directly jump into the work package of OpenProject and find the processes there. So the bottom line is like integrated. We are much, much stronger. Exactly. Exactly. Okay. Thank you. Thank you. I think we have some time for questions. So... Yes, I'll do it. Are we going to have to figure out the way to answer that? Yeah. Oh, you're going to give him a question. Thanks for the talk. Are there any license requirements in order to integrate into the OpenDesk infrastructure? And second question is, which vendor, so to speak, is kind of the product owner of the dashboard and the top bar we saw, and what are the requirements for them, which all apps share? So the first question, okay. So for the first question on the license requirements, there are some requirements. So basically anything that you have to commit on OpenCode needs to match within the list of authorized license by the German administration, by the admins of OpenCode. The list is not fully compatible with the one that is provided by the OpenSource initiative. So it's a little bit shorter. Basically we found that the hard way, essentially when, for example, if you have a software, you package it as a Docker image, and then when you have to upload the Docker image on OpenCode, you have to provide an SBOM for it with the license, and then you found out that in the base Docker image that you are depending, there is a Pearl library with a weird license header, and so it creates an exception, and you have like three months of review to make sure that it's okay to have that in OpenDesk. So it's a little bit of a mess. There is a list available on OpenCode for the license. When it comes to product ownership, I'm not 100% sure. I believe that it's, so the design of the navigation bar, it's managed by the, I think it's managed by either Zendes, which is like handling the project at a high level, and Zendes has been helped by consulting, by PWC, which is doing consulting and usability tests on top of OpenDesk. And the same goes for the portal, I believe. Thank you. Any other questions? And technically the widget port that you saw, the answer is the Univention. Yeah, thank you again for the talk. I have two questions. One is very specific. I'm from Lassen, Germany, and I've heard of Project Phoenix. You wrote Project, or D-Project Phoenix. Is there some difference, or is it just that Project Phoenix? And the second question is, is there... Is there a data port over there? Okay, can I just phrase the second question? The second question is, if you're in the context of a company which does not have an own IT and stuff like that, but likes to keep to an open source software where you can switch vendors, are there vendors just providing this setup where you can get an account and use it for your company? There's the idea of the job of data port, for being one of the potential hosts of that D-Pheonics suite. So then you could get that product from there just by renting it. But I think they just offer the services to public administration. I'll try to answer it. I'm a part of the Project Phoenix and also have a little insight of OpenDesk. The thing is Phoenix is a branch of this OpenDesk, and what was your question exactly? They just wrote D-Projects. Yeah, it's the same. There was some name changes on the way to the product. D means data port, they dropped it, and now it's Phoenix, so it's basically the same product. Is that like the second generation? No, no, no, it's just renaming, you know? And is there a possibility just to rent this somewhere for small companies who don't run an own IT? Not to my knowledge, but if there is a high demand on that, that will be possible. They already do it for some customers, so maybe it's a question of strategy and how much this is asked. It's Helm charts, right? So the idea is that it's easy for any host to host it. And then just to protect... Ah, there's Markus. You have a mind set? I'm going to take the questions in order. But the idea is that it's easy to host it, right? It shall be easy for any organization in the public sector to simply say, I have a data center, I just rented somewhere, and I just pull up the hand charts and off we go. Okay, you mentioned that this German strategy was to put this 23 million in year 2023. So my question is how does it go forwards on the funding side? It depends on the German farmers. Is there kind of a guaranteed maintenance for this code, or like who is taking care of the boring stuff, the security patches and all? I don't know. I don't know. So the budget allocated to the project depends on what's being voted in the parliament. There are budget cuts nowadays. I think so. I'm not 100% sure, but globally there is still budget on the project. It's about half of what we had last year. The budget repetition is another issue, right? It's basically like around 30 million dedicated to open projects in 2024 to be validated. Open desks, sorry. And sorry, what was the second question again? Who's handling the long-term maintenance of the security patches? So the goal on the long term is essentially to find a business model so that whenever you are deploying an open desk, there is a team that is managing basically the packaging of open desks, making sure that the Helm chart are up to date, et cetera. So this team needs to find some funding, and the idea would be to have, like, if you are taking professional support, basically the team would get a part of the funding. And then the idea is also to redistribute this funding to the vendors themselves in some way. The specifics of this distribution here, they are not fully defined, basically, because we are really right at the point where we have this default implementation of open desk that is just going out. And now there is a second step of finding the first clients and making sure that it deploys properly, basically. Okay, any other questions? I'm going to take some people that haven't spoken just for, like, distribution. Thank you. First of all, thanks for the talk. That's a really interesting project. And I just wanted to ask if there is an interest of at some point adding any repository tools or, for instance, I don't know, pipelining CI CD tools to the platform? I will repeat the question so that it's registered. I'm so sorry. So that it's registered. Whether there's any integration of repository or pipelining? Right now, yeah. Right now not. I personally think it makes perfect sense. I would very much welcome that. And I guess it's just like knocking at the door at vendors and saying, hey, we want to have this. Hello. Hello. Okay. Yeah. So you said there is unified procurement so you can buy license for all the different softwares if you want, like professional support and stuff. Is there also like a single point of contact for support? If I want to self-host this and have some issues with any of the softwares in the suite, who do I ask if I have problems? And is there someone who can help me and do I have to know which software has the problem? And also second question, what's your favorite pizza? Thank you. Okay. Good question. So the question is like, is there central support for the whole product? And actually, I don't know that much. I think it's not different yet. Yeah. Something that needs to be developed. It's part of what you said about the whole package. It's part of the discussion on the business model, basically. But I think it's more important to be first built the software now integrated and make it open source and available for everyone for free. And second question, the favorite pizza. Oh yeah. There are many. Okay. How are you? No, how are you? Okay. Maybe one last question because it's about like, it's about 40 seconds left and then we have to go to the next question. Let me go back there. Not just not a question, but a remark. Hi, I'm Renee. I'm with Zenders for two days now. And I'll be happy to take any questions or feedback on Open Desk with me. I'll be out to talk. Thank you. Okay. Okay. Really cool. Sure. So my question is there are huge parts of software stack that are still, I think, vendor tied outside of, for example, just office software, I guess. So is there a way to get the software stack to be open source? Yeah. So I think it's a good idea to have a lot of people who are interested in the software stack. So I think it's a good idea to have a lot of people who are interested in the software stack. Yeah. So I think it's a good idea to have a lot of people who are interested in the software stack. For example, like GitHub is one of the biggest ones. There is an alternative in GitLab. And operating systems, BIOS, hardware. I mean, I physically have problems to understand the question. So what's the question? Like if there are other software going to be integrated or? There is a huge part of the computer science ecosystem such as going down from hardware all the way up to operating systems. So the different layers. Different layers. Is there, you know, movements to free those? Yeah. So OpenDesk focuses on the desk, the working desk. So the tools that you need on your machine in order to work together. That's the current scope. There's not the scope of controlling the hardware or the operating system. That's a different story. Thank you. Thank you very much. Thank you.
Another approach to AI
Thank you all for joining us in a great, you don't. It's not working? Why is it not working? It is? Okay, good. Let's not correct. So thank you all for joining us and here to talk about next. AI. Exactly. Another approach to AI is just port fleet. Okay. Does this work well enough? You can hear it in the back and all that. Okay. I see thumbs up. That's wonderful. Yeah. Well, I had this. I'm your sport fleet. Direct communications co-founder at NextLoud. So, yeah, the thing we do at NextLoud is collaboration. That's of course what this room is all about. Now, there's this AI thing coming. And so I'm hoping to try and make this conversation a little bit interactive. I mean, there are other people here from Xweek and other projects who are working on, well, collaboration tools as well, open source collaboration tools. And, you know, this AI thing, I mean, there have been AI-ish tools being used for a long time, but a lot of them are also still quite new. So I'm kind of hoping that we can also have a bit of a conversation about it. Because, well, there are pros and cons. I mean, we'll get to all that stuff. Of course, the big thing here is, yeah, that we have, like, these big companies, right? They all want our data, and AI is for them another thing to use that data for. So, yeah, I mean, AI, I don't know how deep I want to go into what it is, because I think all of us know it a little bit. But we don't want to live in a world where there are five companies, you know, who run all our data, and that's kind of a little bit the case right now. I think that if Trump and his next presidency tells Microsoft to shut down their service in Europe, then basically you cannot get a new passport, you cannot, yeah, nobody can work at government here, for example, right? This is, I think, a bit of an issue. And, I mean, NextLoud is one of the projects that's working on solving that issue, essentially trying to give, well, companies, individuals, but also hopefully government, back the control over their data. We've built a collaboration platform. I'm guessing how many of you are not familiar with NextLoud? Yeah, okay, it's like six people. Google it. I will then not go into that, sorry, or duck-duck-go it, that would be better, obviously. So, as a company, we build an alternative for 365 in very quick, simple terms. And with alternative, we mean that as a government or a company, we think it's important that you have a choice. It's totally fine if you're happy that your data is at an American company and that U.S. buy agencies have access to it. If you're good with that, if that's not a threat to your business, that's fine. Government, I think it's by definition a threat, but that's their choice. But we think there should be, like, a choice. There should be an alternative. And an alternative is only an alternative, is it is, if it does what the other product does, obviously, and in a safe way and has enough ability to be used for a serious company. So, that's what we're building and we have built. That's why German and French and most European governments are in places already using Nexlout, be it cities, be it at a state level or federal level. So, as a company, we care a lot. Nexlout for us is a mission, it's a goal. It's like our way to try and make the world a tiny little bit better. And we want to work in an open, collaborative way. Therefore, we're very happy, of course, that it's used by thousands of governments and universities, et cetera, et cetera. And, of course, we're building this completely in the open. And again, that will be relevant because, of course, I think the future for AI will better be open, otherwise we are, well, just as crude as it is with collaboration platforms, honestly. And we have a wonderful community working with us and all this stuff, which is awesome. Also, as a company, we try to be open and transparent, not depending on venture capital, et cetera, but be self-owned. Anyhow, AI. So, AI is like, we've already introduced a ton of AI things over the years, like little things, and I will show some of them, but of course, with the latest LLMs and stuff, it's getting really complicated. I mean, there are tons of problems with it. I have a lot of potential, right? AI can help us make repetitive tasks easier, quicker, et cetera, but at the same time, Big Tech is basically loving it. They have all the data to be able to build the AI's. It costs tens or hundreds of millions right now to really train the proper LLMs, so they really have a bit of a monopoly here. And, yeah, the rest of us will have to just accept that they're using all our data to do it. And a lot of companies are already really realizing this is a problem for them, right? It's Citigroup and Goldman Sachs. They are actually not allowing their employees to use tools like chatGPT. I mean, if you're BMW and you're working on a new car and you're using an AI to generate some ideas or summarize some proposals, and you discover later that six months later Tesla, while designing their car, suddenly got some of your ideas coming into their AI planning, then there's a bit of an issue here. And, of course, this kind of stuff is happening. The company, like a while ago, Twitter and Zoom, they changed their terms of service to allow for training on user data. And, yeah, this is really an issue for business as well as, well, obviously, all our society. And then I'm not even talking about data biases in these models, carbon footprint. I mean, I think most of you are aware of all the issues with AI. So, honestly, I don't think the question is to AI or not, because there are too many benefits. The opportunities are really big, I think. I've been trying to make a bit of a list of that, but I was just changing it while standing here in line outside. So this is definitely not complete. So I'm just going to put it all on the screen and ask what's missing. I mean, I think there are some basics, you know, text to speech, speech to text, recognizing faces on photos and recognizing objects on it, et cetera. This is already, like Nexot has been shipping this for three years, four years already, these kind of things. It's just one model that you download and does this stuff, and translation and other one. It's fairly, I think it's, I mean, it's not simple. It's technically complicated stuff, but it works. And there are not huge risks. You don't need to send your data to Google anymore if you want text to speech, or if you want image recognition and being able to search for a dog and find all the pictures of your favorite pet. So this is already there, and it's not terribly complicated to use for a person. But of course, you now have all these new language models. I think there's really a big benefit, unlike dealing with information overload. You have tons of emails coming in. You have, I don't know, papers to read, et cetera. And these LLMs, they, I know they create a lot of fake content and hallucinate stuff. But the thing they're pretty reliable at is summarizing. And this is really quite important. I don't know how many emails you get, but I get a ton, and I would love to be able to summarize it or help select, you know, useful emails, et cetera. And this stuff is really possible, or, I don't know, meeting notes. And, yeah. So this is, I think, where these models can be super helpful. And you have, like, text generation, of course, they can do help out with this. You have also image analysis of various things. There have been some demos from Microsoft and Google already about a year ago, where they basically were showing that you have, like, a spreadsheet, and you select something in it, and then you type a question about it, and then it makes a graph that answers the question. This kind of stuff is also pretty magical. And there are tons of people in office all over the world that would benefit a lot from having this stuff. So I think, yeah, the benefits are really there. Another thing is, like, automation. Just talking about it with a colleague. This is, like, also a next step, you know, if you can say to the LLM, like, hey, send, make an appointment with another person, and then they try chat, and if that doesn't work, they try email. These kind of things would be really helpful, I think, in day-to-day work. So, yeah, I don't know. If there are other ideas or things that are missing, I'd love to hear it, actually, and make my list a little bit more complete, but we'll get to that, I think. So I just wanted to show a couple of examples, like we have this feature now, the Threads Summary, that makes a summary of your emails. Another example is, like, an Excel text. You can just select some text and say, hey, summarize it, create a headline. It's all quite simple to use. And image generation, of course, I mean, this is a horrible image, but, you know, you can make things that look good. And then you have the data analysis, and you have automation, all these other features we have ideas on. I'll share some things a bit later on. So I think we need to do AI in our collaboration platforms, like X-Wiki, you guys need to have a plan. I know only Office, but they integrated just chat GPT. I think we need a little bit more than just that, because, well, we're losing the on-prem capabilities, right? It's not competitive if you're just integrating chat GPT, then the data is sent to the U.S. anyway. So, yeah, that's not really a good solution. So the question is, how can we get this without the problems? And I think I'm in a room with open source people, so I think the answer for most of you is obvious, and this to me at least, just transparency and being open, yeah? And this is kind of the thing that we've been working on at NextLoud. We kind of made some rules for ourselves, so we have been doing AI as things, but when the whole text stuff from chat GPT came out, actually that was at the FOSDEM two years ago, we talked to people and each other, and we have some fairly smart people on board, also from the research community, and we tried to come up with, like, how can we handle this? Because we add more AI features, like we don't want to be left behind, and we need to be an alternative, as I said earlier, and you can only be an alternative if you offer similar features, otherwise who's going to use your product? But then how can you do that in an okayish way? So the idea we came up with was to at least create transparency, and of course, choice, I'll get to that next, but first the transparency. So we came up with the idea of creating a rating that has basically red, orange, yellow, green, and we would rate each of the integrations of AI features in NextCloud with this rating. So first, is it open source? Is the model available, and is the training data available? And so if a model has all three, it's green, if it has two of them, it's orange, if it has one of them, it's yellow, if it has none of them, it's red. So chat GPT integration, red. Completely on-prem model that is trained and has the training data available, for example, for speech to text, that can be green, and you have everything in between, of course. And the second thing is choice. So for us, it's really important that you, well, can choose, right? I mean, there are, again, legitimate users for something like chat GPT, and I mean, they're throwing so many billions at this problem that you can hardly argue that open source can really keep up to the latest stuff they're doing, and sometimes you just need it, fine? So in our user interface, we have these choices that you can have, like Opus, that's a translation exactly, so this would be a fully green one, and that's, well, we all know, chat GPT. So we try to make sure that for the various features that you can choose between these different models, on-prem, et cetera. So for us, of course, most of the work we put in on-prem and open source locally running AI features, because, well, that fits with our values as a company and, well, with our ethical AI rating, but the others are available. So at the moment, I made a list, but I'm sure there are many more you can use, like models like these in NextLoud. I have four of the various features. I'll show, well, actually, I'm showing examples right now. So this is just a bunch of the features we have. There is more, but suspicious login detection is something we developed like a really, really long time ago. It's basically a neural network that gets trained on your login data. It just runs completely local every time you log in. If you work nine to five from the Berlin office, let's say, and suddenly somebody at 3 a.m. logs in in your account from China, maybe there's something wrong, the model will detect that and give you a warning. Very simple, and we have had this for, I don't know, since 2020, so quite a while. And it's green, right? It runs fully local. There's nothing special about it, no data sent anywhere. We basically do a very similar thing with our mail app, where we basically train a neural network on subjects, sender, and email recipients, et cetera. And it creates a smart inbox trying to put important emails on top and the rest not, and again, no data sent anywhere, because it just runs on premise. I already mentioned phase recognition and stuff we did in 2022, I think, so this is all. But the problem with this is already, you need to download a multi-gigabyte file, which has, like, all the values needed for the neural network to recognize stuff. So we already had to re-architect a lot of the way Nexard works just to be able to download this big blob without creating all kinds of complexities for the users. And obviously, this problem gets bigger and bigger when you get to modern AIs. We even have music genre recognition using machine learning. It's yellow because it's trained on all the music on Spotify, which means the training data is actually copyrighted and therefore not open. And, yeah, we had, like, a pre-trained model to do call transcripts. We introduced that last year. That is nice. You have a call and the recording then gets text, speeds to text so that you get the text of the recording. Again, this model runs fully local, so that's cool. And speeds to text the other way around. Background blur, just a JavaScript thing that we upload in the browser, very simple translations. First, we made it with Deepol, which is not cool. So then we made one using the Opus corpus. You saw it earlier, and that is running fully local, so that's much better. So these are still mostly basic features, I think, today, and yet already pretty complicated. You need to keep an eye out where the data is being sent to, like with translation. But, of course, the big thing are LLMs, like the text operations. What we've been doing is to create, basically, NexLad Assistant. It uses the large language models, but open source on-prem ones that you can host yourself. It's like this little thing on the top. When you click on it, you get a dialogue. You can give a free prompt, or you can give a text to summarize and some other things. And it just runs this through one of the models that is supported by NexLad. And, again, you can put JetGPT here as model, or as back-end, but you can also then run your own LLM and connect that to NexLad, and then it can do all this stuff on-premise. So it's fairly simple. When it's running, then it'll get the results, and after a while you get the output of it. You can copy it into a document, et cetera. And, again, if you take a local model that is trained on public data, then it can be a fully green solution. So that's really cool. In places like NexLad Text, I already showed that, that you can select some text and then run this. Mail, I already showed this as well. In talk, like our video calling and JetSolution, you can translate a message, select it, and then choose translate, insert images and other stuff. We even made a little bot. This isn't the smartest bot. It's a very small model, but hey, it's fast. And you can ask it questions. Honestly, I wouldn't say such smart things. It's fairly shitty, I've noticed. But still, it works on your own server. That's kind of nice. So a lot is possible. One of the newer things we're working on is more of these services, because they're now companies like Amazon. They are running LLMs as a service. And other companies are doing this also purely in Europe, like you have ALEF, ALFA, and I think MIRROR or something. In France, there's also a company that is building local AI. So we're trying to support these, that you... You know, everybody can run these AIs, like you need a lot of heavy GPUs. It's a lot of compute. So you can use it as a service that at least it stays in Europe or at a company that you trust. Then I wouldn't recommend Amazon, per se, perhaps. For this, we also made it possible that you can put in some limits, otherwise users get a little creative and start to basically cost you a lot of money. And we worked on the interaction with this. I'll skip through this. A thing we're working on now is also to make all of this even smarter. A newer thing is the ability to take your documents that you have into account. So ContextChat is a feature of the Assistant that basically it has access to your documents, your emails, everything you have that gets indexed. Let me see. It's indexed into a vector database. So this runs as a separate service next to NextLoud. And then when you ask a question from the Assistant, it can actually answer using your documents, your company documentation, your emails, etc. So you can really do stuff like, you know, can you give me an idea of how we organize events, rather than in general, it can look at your documentation and then tell you, like, oh, you know, at your company, organize events this way. Or you can say, hey, can you give me a summary of the different requests that a colleague has emailed to me last week, and hopefully it'll give you all the to-dos that you got from that colleague in the last week. So this, yeah, has the context basically of what you are doing as a user at hand. It's, I think, really kind of, yeah, an important step forward to make this useful, because otherwise you're just getting the generic info that's in the LLM. As I said, they hallucinate stuff all the time. They're much better at, like, taking information and summarizing it, and that's, of course, what this does. I think it's much more reliable in that way, you know, vacation process, et cetera, et cetera. So that's a couple of things we've been doing lately on this, as well as in the context yet. So that's our approach to AI. I would really like to hear thoughts on that, and, like, I don't know what other projects are planning with this. One of these will be giving a talk after mine. But I know any feedback, questions, thoughts, fears, and anxieties? Okay, so is this working? Can anybody confirm in the back? Great, thank you very much. So any thoughts, questions? Let's start here. So you said your screenshot showed that... Yes, the screenshot showed that I need to double check the information that the assistant gave me. Notice that it doesn't give me the reference to the emails that it was quoting from. Is there a possibility to get that? Currently not, but thinking of the way this works, I mean, I have one developer here. They can interject, but I think that should actually be quite doable, because the way it works is it looks in this vector database and gives that information to the LLM to then summarize and give you the answer. And, well, in the vector database, I guess it knows where it came from, and therefore can then say what information was used to summarize that answer. So I would think this is possible, but I don't... I don't know. Yeah, I see a thumbs up. Excellent, okay. Any other questions, ideas? Okay. Or are we going to do this? Yes, for me, I am an e-aseptic. There's something... some examples. So it's good, something... when user is at the end, and he can correct what is said by the AI. So an example for translation, I have the word in Dutch, Académie de Sie. It's not universitaire, the translation in French. It's personne issue des milieux académiques. And that's any... the translator or a consultant don't give this response. Yeah, but it's... So you have a control, they say, from the user, also from the citizen in general, so when the user has no power of the system. You can't check a human translator either, though, unless you know the language at that point you didn't need them in the first place. So, yeah, you have to use this stuff in a skeptical way, but then... yeah. Yes, and the other thing is about consuming energy. So it was an emission in the RTBF about consuming of energy, of shaggivity. It was hard. Yeah, so the amount of energy that these models use is big. That's, by the way, one of the reasons I think they should be open source, is that the researchers who do stuff companies aren't interested in can try to optimize them and make them run with less energy. Yeah. Hi. I think another good use case for all these... If we combine these features, that would mean that we could have a super accessible environment, because if someone is blind or nearly blind, people could use all this text to speech if someone has autism, ADHD, whatever. You could try to find a shorter version, an easier, understandable version of a text or whatever, and combining this would help, I think. That is awesome. I'm going to add that to my slides right now, but I am completely making the laptop slow now. That's a really good point. Accessibility is a really important benefit. Actually, hint to the developer, bring it up in the team. Maybe we can already work on that. Yeah, any more? Yeah, just a question on the REC approach that you were describing before. Do you have any figure that you can share to which extent you tried the retrieval of the vector? Sorry, I did not hear the question. So when you were describing the rag, the retrieval of the augmented... The green, the colors, yes. No, the rag. When you're retrieving the vectors from the vector DB. Right. So can you give us some figures on to which extent you tried that? Talk to somebody, not me, in one of these, who knows the technical part there, and I'm not even sure we have somebody right here at the moment with that. Sorry. Okay. And we're out of time. I'm afraid. So this is probably it then. Somebody wants their microphone back. Alright, thank you all. Thank you very much, Josh.
Using Generative AI and Content Service Platforms together
Very much. So our next person is Ahel Boroy from Highland that's going to talk to us about using generative AI and content service platforms together. Thanks. I was on. I was on. I was checking the microphone. Okay, yep. Welcome to everyone. So this is another view on the same topic. So we are going on the technical side now. It's not like a final feature for a product, but it's a framework in order to help you to build all the features that we were seeing before in the context of a content service platform or a document. Okay, so we are going to review some in a stack. The next step we are going to use that is including also LLN on premise. We are going to review all the options. We are going also to describe the features we can build with this stack. And then we are going to review how to integrate that with your in our case because I'm working for Highland and we are building an open source product with the name of Fresco that is related to content management. So we are going to see how to integrate that with that content management platform and also just looking a few to the future. And obviously I need to include some AI picture because it is what it is. Anyway, so this GenAI stack that we are using includes mainly three components. So the first one is Olama. Olama is a service that is able to provide an API to interact with different LLMs. We are going to see later all the list but you can download your LLN on premise and this layer is providing the interaction with the LLM. You can even interact with different LLMs at the same time. The second one is Neo4j. So Neo4j is the vector database when you are using rack, we augmented reality and so on. Then you need to increase the information for the LLM. So you are storing all this information in this database. And finally we are using land change. So this framework is providing land change that is a framework to communicate all these different elements. So this framework is in Python but if you are not comfortable with Python there are many other languages that are including this kind of piece. Okay, so mainly what we have, if someone doesn't like Docker there is no problem so you still can deploy that without it but it is oriented to services. So you have Olama that is the one providing the services for the LLM that can be used in GPU or not. So you can, we are going to run this without the GPU. Just using my regular CPU on my computer. This is lower. I recommend you to use a GPU but you can do that. And we are piling all the models that we need so we can just use more than one model. With that we can increase the information for the operation with that Neo4j database and we can develop an API with this string LLM, with this framework. Okay, so these are the pieces. You have the project Docker.gen.a stack. Mainly these pieces is a sample. This sample is oriented to prompting. You have to reply questions. So we are going to do something a bit different from that but the first sample you can try is this one. Okay, so all the LLMs that are able to manage Olama today are these ones. Likely as this is growing every day there are more. But this was like last week. Okay, so this is what you need to understand. Obviously the larger the better but you need to take into account your resources. So these are very small. Your 4 gigabytes of RAM and 2 gigabytes of storage. So you can run that on a computer and then if you want to use something that is better is also larger in resources. And you can even use these LLMs that require, I don't know, many different computers once I say, okay. So today we are going to use this kind of LLMs. Also it's relevant to look at the license. So just was talking before about the license. So this is also something relevant if you want to build something commercial or something open source or whatever. You need to take care of the license. Also you can look that there is some weight license there because you have this LLM2 community license agreement that some people say that this open source, some other people say that it is not. So it's something different. So better to check if you don't see a patchy license or something that you can recognize. Better to check the conditions. So you have a lot of them to choose. We are going to work today on the demo with Mistral, CementB, that is a French company that is producing this kind of LLMs that are more or less the same performance as GPT 3.5. So it's good enough. And so what is open source, the LLM is free to download and to use, but the training data is not free and likely it has some copyright material on it. We don't know because it's not free. So on the next law ethical AI writing we have, sorry, yellow. I thought it was orange but it's yellow. Okay. It's more or less fine. So we are just only missing one. That was for text and for pictures. We know some LLM with a visual encoder on it. So for this part we are going to use lava. And lava really is granting all the different requirements. So we are using a green LLM for this other sample. Okay. Perfect. So all the demo is running on my computer while I'm there in the presentation. So I have everything running inside is 32 gigabytes of RAM and is AM64 architecture. So it's not AMD. It's MacBook Pro two years ago, something like that. Okay. As we were also reviewing the previous version before this GEN AI momentum, we also had some data section, test recognition, test classification, content analysis. Anyone is using content analysis for a real use case? Okay. It was not me. So it's something but you saw. But we have all the things, right? Some kind of automation. But now with the GEN AI, we have also a power classification. We could classify in the past. But now we can classify better. We can also, and when I say translate, we are going to see later the demo. Obviously we can translate. But we can also interact with the LLM in one language and to get the response in another language. Right? So that is the difference. We can also summarize a test. This is the most common use case and we can describe a picture. Prompting. Obviously we can use prompting. We can read that. So we have some new features that we can use in our documents. Okay. We are going to see some of them implemented. Okay. So what is this project about? It's not yet. Okay. The project is at some point of the slides. Okay. If not, I will give you the link. So in this project, what is created is a API by using this, all these infrastructure in order to provide different services. What we are using is some LLM embeddings. So we are just trying to avoid hallucinations. Just giving some additional information to the database from the document. So we are working with a document. Right? So we are not going with search. We are not going with some other applications of GNI. So we are focused on features of a document. So we are adding all that information so we can get a better response and more suitable to the document we are dealing with. And for that we are using Mistral. And if we are talking about a picture, then we can use the other LLM that was Java in order, for instance, to describe or classify the picture. We have also some, so we can choose the LLM. If you want to choose some other LLM than Mistral, you can do that for text. You can choose some other LLM with a vision and color enabled, like Java or some other on the list. And we can also choose the language. So we are going to see that later. We can just drop a document in Japanese and we are getting the summary in English or in the other side. Right? And also you can choose some numbers like the summary size or the number of tasks and so on. So these are parameters. Okay. So this is the API. Right? Pretty cool invocations. But let's see that leave. As always is better. Can you see the, better? Okay. Okay. So for instance, I'm going to work, let me find the, I'm going to work with this document. Right? I could be using an English document, but it should be easier for the AI. So we are using this one. And I'm also going to use this picture. So for your reference. Okay. Okay. Perfect. So for this document, we are going to ask for a summary. So give me a summary of this document that is in Japanese. So with that, if I'm able to. Okay. So this is running on my computer. So I have this ENAI stack running in this Docker deployment. And I'm getting the request. Okay. And with that, I'm getting the answer. Okay. So the test, this is a problem with kindergarten, in Japan, blah, blah, blah. Okay. That's fine. So I'm giving something in Japanese and I'm getting the summary in English. The second one, come on, note this one. I did it. Okay. The second one is just to classify. Classify a document that picking a term of a list of terms. So I want you to classify this document according to Japanese, Spanish or Vietnamese. Again, it's an easy example. Right. But you can choose whatever list of values. So if I say just classify this document into one of these three categories, the term is Japanese because the document is in Japanese. Okay. This is also a Revan for classification. And finally, we can also make some prompt on the document. What is the name of the zone or this document in Japanese document? The name of the zone is Musoku. Okay. So three different features that we can use on this, on a document. You can build more. Again, it's a Python, Python program with these three specific features, but you can grow up to include something else. And if we move to the, to the pictures that was for text, but for the pictures, we can describe this, this picture. We can also extract some, this is a person, this is, but that was done before. But describing is the, the, the new thing that GNI is providing for us. This is a bit slower, but in the end, they made so some man posting for the camera. He's wearing a green beanie, glasses, a black hoodie. And the land yall says air fraked. Well, no, it's a fresco, but more or less. Okay. The picture was not big enough, but it's fine. It's, it's something that is, is useful. And it's not that consuming internal resources, because it's running in, on my machine. So it's, it's fair enough. Okay. Once that we have all these features, and we have this, Python, just let me show you a bit. So this is the project, right? You have the Aeboroi, a fresco GNI, and you have the GNI stack, and mainly it's a Python program. Okay. With all these endpoints described, classified, prompt, and somebody. Okay. It's no more than that. Okay. If we go back to the original goal, is to integrate this kind of operations with our, with our product than in our case is a fresco. So a fresco, we can deploy that also in Docker or whatever you want. And we have two different APIs. So the first one is the classical press API. And the second one is a messages API, synchronous and asynchronous. So if we have existing content in the repository, you have a folder with 100 pictures, and you want to describe that. So you can use the recipe. Yes, to get the document, apply the operation, and update the document. And that's fine, because you can make a batch with that. Okay. You have all the operations available. And if you want to create that like more dynamically, when the people drops the document, yes, perform the action, then you have the messages API, the asynchronous API. So you can listen to the event, okay, there is a new picture, and this picture needs to be summarized. I'm going to summarize the picture, and that's updated. Okay. So these are the two different patterns we can, we can apply for it. What we are going to see now, again, live, everything is running on my laptop, just believe me, is something that allows us to classify a document. So we are going to upload a document. We are creating this rule. The rule is the same just for you to make the similarity with what is before. So we have a list of languages, Japanese, Vietnamese, English, whatever. And we are creating a rule to move the document to the right folder. So you draw a path document, and the document is moved to the right folder. Okay. Okay. So let's do that. Okay. Let's open a fresco. So there is a folder at some point. And this folder has a rule that is classifying the documents that I'm dropping on it. Okay. So if I, for instance, come for classify, no, for classify things, we are going to try with a Vietnamese one. It has to be a bit creative. Okay. Okay. So at this point, a fresco is listening to this new document, and it's classifying the document. So it's just selecting a term from the list of terms, and the document has been updated. So it has been classified. So if I refresh, what I find is that the document is on the Vietnamese folder, and you can do that with invoices, with whatever you want. And we can track that it was mistral, the LLM, that created this classification. Okay. Pretty easy, right? So you can integrate also all the other operations in that to get some automation. Okay. So I guess that I was running out. But no problem. So we have more time for questions. So again, this is a simple framework. You can deploy that on premise. You can choose your LLM. You have an initial REST API for operations. Public works are welcome. And then you need to integrate that with your product, with your organization, or whatever. Right. There is also some interesting hackathon with more use cases. So I presented you some use cases, but you have more of them on this hackathon. The slides will be, they are available on the, on Foxen. Okay. And also I'm using Olamma, but there are many other alternatives. You don't need to choose Olamma. So you have GPT4 all locally. This solution is the one used by, by next cloud, second state, high-end phase is the most known probably. But again, just, this is an initial framework. Take it as it and try some things with, with the NAA. Okay. That was all. Thanks. Thank you very much, Angel. Are there any questions? I'm going to do it in the order of the rule. Thank you, Angel. It seems to me all these operations are on one picture or one document. Are you also considering me asking a question on all my documents? No. So this, this sample is only for a single document or a single picture. But, but that is as easy as you have the database, the Neo4j database, then you can include as information as you want for a single document or for a single query. Right. So what I'm doing in the source code is to remove the previous information. You have to create something that is only for a single document. But you can modify that in order to add more than one document to one query. But on the sample is only for a document or a picture. While summarizing the Japanese PDF, why did you need to provide for context your picture Sorry. You showed the summarization of the Japanese PDF. Yeah. And then you provided for context the picture. No, no, the picture was for the last operation. So the three first operations for summarize, for classify and for prompting were related with the document in Japanese. I could use some other document. I know, but I love the document because I'm using this for testing for 15 years, something like that. So it's like my, my precious document. And, and the picture was there for the last one. It was the description of that picture that is more or less like, like yours then. Thank you. Similar to the previous question that I had, but for a single document, right. So the summarization for very large documents. Yeah. So, the problem is that again, I'm running on my lap. So I cannot use like a very large document, but I was just trying to summarize, for instance, books. Do you know the Gutenberg project? On the Gutenberg project, you have all the classics of Alice in Wonderland and so on. So I was trying to do that with that kind of documents. And it's able to do that, it takes a while, like minutes on my machine. Again, if instant adjusin, the regular CPU, you use a GPU, the tiny slide, I don't know, 100 faster, something like that. So I don't know. I need to make serious test with that. But having the right infrastructure, I guess that the, the performance is enough. It's not something like very instantaneous, right? But you can work with it. Thank you very much. Any other questions? Yes. Hi. A follow up on the previous question. Was the insertion into the vector is database taking a lot of time or was the actual query to the LLM? Because the insertion into the vector is database has to be done once, whereas the query can be done multiple times if, if you already vectorize the document, right? Yeah. So again, I was not trying to deliver a session on how to develop AI, right? It was just to create a framework. You have the AI track that can reply to you better than me in relation to that. But yeah, obviously, you can use the database. I'm not, I'm only using the database for a context of a single document, right? So you can create categories, you can add more than one document. You can add also the, the links to the response and, and so on. So yeah, sorry. Maybe I didn't understand you. Maybe you misunderstood my question. My question was when you added the Alice in Wonderland book, was it the vectorization that took time or was it the query to the LLM? No, no, it was vectorization, vectorization of the chance of the document. Okay. Sorry. That was the only one question. I'm not an expert, but I know a bit. Any other question? Okay. Thanks. Okay. One more question. Last one. I'll be around. So if someone just wants to, to catch me. Can you say a bit more about like the biggest use cases you see and if there's any open source setups of this that are out there for us to look at? In my opinion, the, the main use case of that is searching. So but this is a different world with different beasts. So but for searching AI, it's really quite relevant. So again, this is just to create a framework and then it's just to apply your imagination. Thank you very much, Angel. Thanks. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you.
Web Accessibility and Environmental Sustainability and with Popular CMS
Before we actually start, I just want to go off and quickly note the last two sessions being about AI and their rooms being packed, completely packed standing room only. We come down to a conversation about accessibility which is ultimately about humans, right? It's about us, all of us, right? And sustainability, which is the fucking planet we live on and the room is half full. So, you know, AI is an exciting thing at all, but really our priorities are a little bit at a skew here. This is not how the proportion should be about things that matter in our world. So I just want to add that little hint. No pressure. Exactly. Okay. So we can start, whether accessibility and environmental sustainability with popular CMS by Mike. Thank you very much. So thank you all for coming. So I wanted to give this talk because there's two issues I care a great deal about and I see there's a lot of overlap between them. On accessibility, we've made a lot of head roads. It's really nice to be able to see the kinds of changes that have happened in the last few years around digital accessibility. And we're only just starting to think about sustainability. And so I wanted to go off and have this session here so we can talk about them. The last two sessions had more technical information. I'm not going to show you any technical information. You will not see a terminal here. I'm here to talk about issues where there will be spaces and times to go off and to get into the terminal and to get into the technology. But generally I find that these kinds of presentations are not the best place for that. So first of all about me. Mike Gifford, a senior strategist at Civic Actions. I've been a Drupal developer for 18 years. I began working on accessibility in 2009. My assumption at that point was it wouldn't take me that long to go into, to fix the accessibility problems in Drupal. Just to let you know, we have not fixed all the accessibility problems in Drupal. This is a big problem, much bigger than I had thought. That said, Drupal is still one of the most accessible content management systems out there. I've also been involved in promoting open source for a long time. This isn't only my second time at FOSTIM, but I've been actively involved in promoting open source now for at least 20 years. And happy to be able to do this full time with Civic Actions now. And yeah, I've been able to focus more on accessibility and sustainability here at Civic Actions because I don't have to worry about HR and planning and all of that sort of stuff. So I've got deep roots in the Drupal community and that's just one CMS that people here are using. We also had a good experience in two years ago with FUNCA, which is an accessibility group in Sweden. And with them, what we tried to do was to build a, a cross CMS study of the offering environment. How do we try and make sure that offering tools are built to have accessibility built, how do we try and support authors to create more accessible content? How do we, we structure that? And how do we, if we can, looking at the European Accessibility Act and the Web Accessibility Directive, realizing the huge challenge that was facing the European Union to try and meet these particular accessibility goals, there was an effort by FUNCA to, to actually try and bring people together and to set some best practices for supporting authors and actually to do usability studies with authors to find out what, what we were doing. And I think it's actually the first, probably the only study using, looking at authors and engaging authors to say, how do we support authors in producing accessible content? And that's, that's strange given how many, how many, how much, how many billions of dollars are spent every year on accessibility and how little money is actually spent on fixing the problems upstream, either in open source projects like Drupal or supporting the authors who are creating most of the accessibility issues right now. So it's, it's an interesting challenge, but that was funded by the European Commission and it was a great project to be part of and allowed, allowed me to engage with people with, with PLONE and Type of, was Type of 3 there? No, Type of 3 was not, but JUMMA was, Umbreco, and again to see that the collaboration through, through different open source content management systems and some non-open source content management systems as well. I do think that a similar process should be happening with web sustainability, partly because we, we are in a climate crisis. This is something that we need to act on, all of us. It has to be something that we're thinking about the, the, how we're managing our, our technology because although we talk about the cloud, you know, all of the stuff has real world impacts. We do have to, to realize that there are, there are atoms, all, atoms behind all of the bits that we are driving. And unless we're starting to be more conscious about that, we're not going to be able to to, to have, we're going to continue to, to exponentially grow the, the, the, our sector, which will have a lot of environmentally negative consequences. So I wanted to touch on a couple different platforms, CMSs, and, and what they're doing around sustainability. Because this is something that, that I think every, every open source project should do, not just a matter of the ones that are, are, are content management systems. I'm only really familiar with the content management system world, because that's the world that I've been living in. But, you know, so for Drupal, we have, have a page that is talking about accessibility, about sustainability. It is really important to have something on, on, on project websites to say that this is something that matters to our community. This is a value of our community. You know, I've been writing about it and talking about it since 2016. But again, this is something that, that we need to see this information. If you don't hear it, reinforce it, and see that action that's happening, people are, are going to, people aren't necessarily sure how to plug in or how their actions today affect the sustainability in the future. There's a Drupal Slack channel on sustainability. There's also a sustainability statement that is up on the Drupal.org website. Again, it's not a very bold one, but it's something that, that is a start. It's a starting point for us to go off and say, this is something that matters for the Drupal community. And people need to have that, that starting point to go off and say, how do we, how do we try and, and get involved, take the next step, realize that our community is having an impact, both positive and negative in the world. So I'm trying to, to own as much as we can that responsibility around that. Is there any WordPress people here? Excellent. I like the WordPress sustainability folks. Like I like what is being done in that space. They've got a strong community. There's a lot of sustainability people who are engaged in, in both the Slack channel as well as in, in, in broader thought leadership, whether it's Tim Frick with Mighty Bites in Chicago or Tom Greenwood in, with Whole Green Digital in London. There's some great, great people in the WordPress space that are pushing sustainability and helping people engage with it. So, so yeah, there's, they've got a WordPress initiative and they meet as WordPress does them on a regular basis through the, the Slack channel and engage on issues that are important to try and help, help WordPress users go and, and make, make building sustainable websites more easier for everyone to go off and do. How much would it matter if WordPress were to go off and reduce its energy efficiency by 20% if they were able to reduce that by 20% I mean, that's, that's probably like, I don't know, I mean, like, they're 60% of the web, right? So how much energy is that? It's, it's, it would be significant if we could go off and reduce that demand side and, and think about the data centers, the impact on, on like the, the trickle-on effects to having 6% of the web be 20% smaller, would be enormous, 20% faster. If we could think about how, how we're, we're prioritizing our, our structure. Waitail, another really interesting CMS is doing some great stuff. Anybody from Waitail? Okay. So excellent, excellent. So, I, I think that, that of all the CMS out there, Waitail is doing the most to try and, and calculate what their impact is. There's a whole phrase that you've got to, if you don't measure it, it doesn't matter, right? And so, being able to say, you know, Waitail accounts for 8,240 tons of CO2 per year. And that's going to grow over year after year, I hope, because we want the Waitail community to go off and grow. But that's, that's still like 8,000 tons of carbon per year that Waitail's producing. And, I mean, we hope that the sites are doing good things, but ultimately, it's, it's a, if they weren't using work, Waitail, they'd be using something else. And it's so good to go, go off and start measuring this information. So, yeah, and then they've got good documentation. There's a, there's a roadmap in place. It's lovely to go off and, and see that there is, is, is support for authors so that authors again can have the support that they need to go off and, and to create more sustainable products. To have built in aid of support. Like, you know, this is something like, it's just the defaults that matter, right? Because, because authors are, authors are lazy. Developers are lazy. People are lazy. We want to do what is there by default. If you, if you, if you have the outcome, if whatever the outcome is, if it's the default, it's more likely to be the case. If you have to, if you have to do something else, if you have to go that extra step to do something, then most people aren't going to bother because they have the time to do the default, if that. And so, let, you know, let's, taking that effort as, as creators of tools to go off and to, to, to implement that as a default is really important. And that's something that we've done in, in Drupal for accessibility and something that, that, that has worked quite well for our community. So we definitely encourage that for both sustainability and accessibility. So how many people here are familiar with WCAG, the Web Content Accessibility Guidelines? Has anyone read it from end to end? Okay, we've got two people, three people. It's like it's, it's the most boring document. It's, it's written by committee, oh my gosh, is it ever a dreadful document? Because it was written by committee. How many people here have read the Web Sustainability Guidelines? Less, because it's a, it's a much newer document. This is something that only was released in September at the, the WC3 forum. But, but yeah, the, the, the, the Web Sustainability Guidelines have been written for, for, for it to be human consumable. And it's, it's written for the, the WCAG structure and framework, because we ultimately want to create a standard that governments and other organizations can sign on to and say, our organization is going to embrace these best practices for sustainability. And we want to go off and to, to measure ourselves against these criteria so that we can, we can put that stamp of approval that says we are meeting these best practices. So, from an accessibility point of view, the, the principles are, is it perceivable, operable, understandable and robust? And, okay, those are fairly generic concepts, but they've been broken down and explained enough times that many people understand how that's, how that's structured. Is it written in the most usable or most easy, easy language? We'll know, but, but it's, it's a, it's a fairly useful set of structures. The Web Sustainability Guidelines are based on, on the Sustainable Web Manifesto and the principles there are to be clean, to be efficient, to be open, to be honest, to be regenerative and be resilient. So, again, those are broad, broad goals that are, I think, help the, help us try and understand where, where those are of the North Stars in terms of where, where digital products should be going. I did, because this is an open source event, want to go off and highlight that, that open right now in the Web Manifesto is, is structured as products and services that provide, that are willing to be accessible, sites to allow for open exchange of information and allow users to control their data. So, that's not quite the same definition used by the OSI or the GNU Foundations or others, but it's a, it's a starting point and hopefully as more people jump on board and if our community is able to embrace this, I'm hoping that, that the manifesto may be able to be modified so that there's actually stronger language about supporting open source because we need to be able to collaborate with each other. We need to learn from each other's best practices and we need to use the innovative abilities of open source to be able to make change, the changes that we need faster. We don't have the time to go off and, for all of us to go off and, and to make the same mistakes over and over again. We need to be able to quickly find out what works, share that, those, those ideas and experiences about what works around sustainability with others in our community and with other communities, right? We need to learn and test our ideas. We need to be evidence driven, not something that is based on, well my community thinks this and your community thinks that. It's like, no, let's get out and, and find evidence to be able to determine what is the best approach. I would love to be able to have the numbers that Wagtail has for Drupal to be able to say, we think this is the CO2 impact for, for Drupal. It would be much bigger than Wagtail. Likewise with WordPress, it would be much bigger than Drupals. But to have those numbers would be useful to go off and say, this is actually what our community is responsible for and why removing this megabyte or this, this, this byte of data is important because it's this byte of data is being used on, you know, we know this, but with this byte of data is being loaded and transferred across websites, you know, a billion times a day. Like that's huge when you're trying to, when you're thinking about the load times, like it's a millisecond. What does it matter? On your side, it matters nothing. But when you're scaling it up to a global level and it's being replicated, you know, a million times like it is in the Drupal condition, that millisecond matters because that's, I'm not going to do the math, but it's, you know, still quite a few seconds. Any questions about, about both web sustainability guidelines or the web accessibility guidelines? No, any other CMSs, any sort of opportunity to jump in, contradict me, tell me what I've missed so far? How far are they from being official standard? So how far is the, the, the web standards, web sustainability guidelines from becoming an official WC3 standard? And thank you. So there's right now a community group that set up and created the draft to people, basically, Alexander and Tim went off and did most of the work to try and create this draft. And it's, it's really quite good and quite readable. But that's not how any other WC3 specifications being done. So, so right now it's a good enough document for people to start punching at. But we need to both set up a sustainability guidelines working group. We're in the process of doing that. So creating a charter to go and, and to create that, the oversight, invite more people into that, that working group. Once the working group has a charter, and then there's the long process of going off and, and getting an adoption for that. And then as soon as we have two use cases, so if the, the UK government and the US government say, yes, we're going to endorse this, this is the direction we see going ahead, or it doesn't have to be the US, could be, you know, there's the French government and the UK government decide that this is a useful, and then, then there's a, there's two implementations of this guideline and then basically the WC3 can make that a recommendation. But that's, that's the path forward for this and it is like so many community organizing efforts, a bit of a sausage making factory. So it's, it's great to have people involved and would love to have people check to be, to be involved in the web sustainability guidelines and to help sort of champion the formation of this because it is really important to have, have documents that are understandable and that are executable, that we have passed ahead to go off and implement those. So wanted to highlight some other, some other innovations from the web sustainability guidelines. One is, there's a real effort to try and be usable. So there, there, there's the WC3 document and many people who've, who've looked at other WC3 documents, whether it's on HTML or CSS or whatnot, you see that structure and your eyes start to go, go, get weary and you close them and whatnot. Well, the, the, the committee's gone and built a JSON file for the WC3 standard. So it's available both in the JSON format and the WC3 standard format and also in the web sustainability, web, sorry, web sustainability. Web, I'm blanking on the name of the, of the, the website, the web sustainability guidelines.org. Is that right? No, that's not quite right. I guess web sustainability.org. I should have it here in my notes and I do not. But there, it's available in, in, in three different formats. I wonder which is designed to be more human readable and, and, and a shareable reference. We've also structured around different elements. So there's, there's, there's structured around, around user experience design, on web development, on hosting and infrastructure, as well as on, on business strategy and product hosting. So thinking this, think about this as holistically as possible. There are some elements of the web sustainability guidelines that do touch on, on, on issues of accessibility because if people are, are not able to go off and access your services and they're trying to net, net, to navigate around to go off and access the, your, your content, that's something that, that is, is going to be a performance issue. It's going to be something where it's going to cost CPU cycles because you're, you're constantly trying to go up and access this. Also, it's, it's, you know, I think a lot of people have, have looked at the, the, the data centers and, and the hosting environment is one piece of it. But actually if you look at the, the overall energy component, the hosting piece is just, it's actually a small part of the puzzle. You know, most hosting companies are very aware of the, the cost of electricity and are doing their, their effort to, to minimize that cost so that they can increase their revenues. The real, from an electricity and, and, and CO2 perspective, there's a lot more work coming from, from, from our own devices, right? Like we plug in our devices, we maintain our devices. There's also elements of, of embedded energy. So, you know, I don't know how much CO2 is involved to go off and create this iPhone. It's not an iPhone, this, this Samsung phone. But, but there's, but looking at that embedded energy and the energy that it's going to take throughout the whole life cycle, life, life cycle of the product is something that we need to be starting to think about. So I mentioned that JSON file is, is a part of the innovation. They've, they've also, the WCAG format doesn't really structure the, the, the, the impact of the level of effort. But it's nice to see that, that the web sustainability guidelines have tried to go off and to, to pull those apart. So you can get a sense of what is the, what is the easiest thing for me to do that will have the, the biggest bang for the buck, right? If you can, you can highlight and structure your information around that ROI, then that's something that you can more easily scale up and, and know how to go off and invest your efforts. Because we need to be able to, to get people starting on this. There's a lot of, if you give people a checklist of 100,000 things that they need to change to the website, like, yeah, good luck. Maybe they'll hit the right ones. Maybe they'll just ignore it and walk away, right? But if you can say, here's the top issue right now that you can address. This is how you can start now having to have a bigger impact in your, in your, in your work. Then, then that can motivate people to get started and to say that they can actually make a difference. Because nobody wants to, to take on a wall of errors and, and to take, add that to their existing issue queue. Yeah. So, so I'm a big fan of automated tests. How many, how many other people here like automated tests? Get the machines to do it. You know, and, you know, there's good reason for that. I mean, people are good at some things, but we're generally not good at doing the same thing over and over and over again. Sisyphus really should have been a bot, because he would have, he wouldn't have minded at all. But in terms of automatic tools, I do want to go off and point out both Google Lighthouse and Unlighthouse. Has anyone tried Unlighthouse before? So, Unlighthouse.dev. It basically scans your whole website, more or less, with Google Lighthouse and gives you a Lighthouse score. So, it's what's from a, from a sustainability point of view. That comes with all of the stuff that comes with Lighthouse. You get the performance hits. You also get the accessibility score for that. The accessibility score is less than acts. You're not getting as much of the features as you do with acts, but it is a good solid starting point for people. But if you look at the performance numbers, that's a great place to start looking for sustainability. There's lots of overlap between sustainability and performance. And it's so useful to find allies too, like who are the people who are interested in the same issue. And absolutely performance people are on the same camp of sustainability. They're sort of like security and privacy. They're not the same, but they're very, very related. Also useful to point out sitespeed.io and co2.js. You can also tie in actscore into this as well. How people know what sitespeed.io is. It's similar to what Google Lighthouse can do. It does gain a site, or Unlighthouse can do. It does a site-wide crawl. It provides recommendations and a little coach that gives you some direction. It is open source, which is good. Not that Google Lighthouse isn't open source as well, but you deal with Google. But the co2.js module is a tool that's being implemented by the Web Sustainability, sorry, the Green Software Foundation. Green Web Foundation, thank you very much Chris. So the Green Web Foundation, fortunately, is the organization that's behind the co2.js. And it's a great module, particularly to give a snapshot quickly of what is available within your website and what the estimated co2 impact is for that. There's a lot more work that needs to be done. Two tools like co2.js. The default right now is to look at a byte model that is more or less looking at how many bits are transferred and to focus on that side. There's so much more research that is needed to actually get to something that is a verifiable amount of co2. So we know that one byte of an image is going to have a lot less co2 impact than one meg of JavaScript. JavaScript is so much more intensive in terms of CPU processing, whereas rendering HTML and images are really not a big deal for browsers. But JavaScript is something that does take a lot of processing power. But it's a really great tool and you can integrate it with sitespeed.io and have that information available. I also want to point out EcoGrader and websitecarbon.com. Both of those are good tools to get a sense of how heavy your website is. Take all this information with a grain of salt. This doesn't actually give you a number that gives you an estimate. That is something you can work towards and improve on. But just like with accessibility tools like the WaveTool bar or Microsoft's open source accessibility insights, just because you don't have any errors within your accessibility tool doesn't mean that your website is accessible. It means that there are no automated errors that have been found with your tools. So it's trying to understand what the automated tools can give you. They can't tell you that there are no accessibility errors. They can only tell you that there's no errors that the automated tools can identify. And the same thing is going to be the case with sustainability. We're going to have to have human input in the mix. So manual, we're going to have to have humans who are looking at do we have this page? Do we need to have this page? Is this something we can yank off of? Is there an easier process to get where people need to go in order to achieve the goals of what we're trying to achieve? That's a very human thing to try and understand how humans are acting. We can't be expecting bots to do that. Same with on accessibility. It's a question of does this alt text actually make sense? It may or may not, but we need to be able to... A bot isn't going to be able to tell that. It can tell you whether or not there are... If it's image123.jpg, that we can get a machine to do. Most don't, but it's possible to go off and do that. But we need to be thinking about where the limitations of what the machines can do and what we're thinking about these processes. And again with JavaScript, is this JavaScript file actually needed? Could somebody actually navigate this information with the mouse? What are the comparisons between the accessibility world and the sustainability world? What can we learn from one that we need to bring to the other as we scale up and start addressing issues of sustainability in our sites? Is it content fulfilling user needs? Does it work for the assistive technology? Those are all kinds of related questions between the two different disciplines. Any questions at this point? Going through a lot of stuff and I know how hot this room is and how overwhelming FOSDEM is. So totally understand that if people just want to go off and engage. But do feel free to go off and stop me if you have questions. So I wanted to touch a bit more about open source tools for sustainability because this is an open source event. So we've got, and Chris, if you see things that are missing, please go off and jump and say if there's things that are missing. There's the carbon cloud footprint, Scandis, how do you spell that? Scrapis here? Yeah, whatever, that one. There's another one by Yadz. It's such a hard time naming things. Why is it so difficult? Cube Green, Kepler, Green Metrics Tool, CO2.js. And CO2.js is built into Firefox. I learned about that last year at FOSDEM and it's really wonderful that that's been something that's being brought into a popular browser. Hopefully Chrome will be shamed into doing that as well because we do need to go off and build these tools in. I also mentioned sitesv.io previously. In terms of websites to learn more, there's the awesome green software list. So you can go and find all kinds of information about green software that's an open source tool that are available for that. Also opensustain.tech and climatetriage.com. So there's so much information out there. A lot of it is free. This is stuff that people are learning and sharing because they need to go off and because there's not a lot of awareness. This is not an issue that most people still believe that as long as you don't print out the web, if you don't print out your pages, you're being environmentally friendly. We're not thinking about the overall impact of our digital devices and the actual weight that they carry on the planet. Any questions about that? Any tools are missing? Chris, anything big that I might have missed? No, the only thing to ask is that there's a talk tomorrow where it's Firefox profiled talking about sustainability. Excellent. And there is also a talk by a digital spread on Skafanja at 6 o'clock tonight as well. So tonight at 6 o'clock there's a talk on Skafanja which is... I'm going to be at lower this now. Yeah, that's taking in the other room at 6 p.m. Marvel. That's great. Definitely want to learn more about that. That's the energy room, is that right? Yes. And also there's... Sorry, tomorrow there's the talk about the Firefox profiler that's in the energy room as well. No, I think that's another room. But if you look up where sustainability, Firefox, it will show up. Wonderful. Thank you. I will definitely share that out after the talk as well. So that's great. Sorry, in terms of the question, sorry, did I repeat enough of it that it's understandable? Okay, excellent. So yeah, we need to go off and just like accessibility, we need to bring these things in early as possible. So how do we tie this into our development process? How do we start looking as early on in the process so that we're catching where we're starting to add bloat? Where does the page start to slow down? How do we make sure that every sprint were a little bit faster and a little lighter than we were previously? So we want to catch bugs before they ever get to production. So it has to be part of the CI CD process. If it gets to production, it's too late. Not that you can't fix it later, we probably won't. We're developers. So also trying to go off and setting page budgets is quite useful as well. With accessibility, I like to go off and aim for zero X errors. So they call that axe clean in the DQ world. For website pages, like you're going to need to set your own, it'd be lovely if people could go off and reach 200 grams of CO2 per page. Most are much, much more than that. So I don't know how that many sites out there that are meeting that 200 grams of page, but let's set a goal and try and see if we can improve it over time. And measure our pages now so we know what we are now and see if we can get achievable goals over time. Again, think about sustainability and accessibility bugs. We need to start, we can't think about these as features. If we leave them as features, they're not going to be addressed. They have to be seen as bugs and treated like bugs, right? So that they're more likely to be fixed in their address early in the process. And even if you're looking at minor bugs, if you're repeating the same minor bug a million times, it becomes a major issue. So again, think about the cycle and how these things scale up in our tools. Look at these tools and try to find ways to get a multi-layered process for quality. How do we make sure that we're building into our CI CD, that we're measuring support for authors, that we're scanning the environment for errors, that we're doing randomized tests, like we don't want to be scanning every page every week. But we should have some sort of process where we have a plan for how we're going to provide automated and manual scans for the information. Are there ways that we can construct your manual testing? How do we try to make sure we have a thorough rock process to remove content that we don't need? Are we doing annual reviews and doing deeper guides? Are we encouraging people to get certifications or to learn more about this? There are some good tools out there from the Linux Foundation and others around learning about the sustainability impacts of digital tools. We've got some ideas about what a robust approach to accessibility looks like. Very similar. We've got to go off and check for errors in our process. Use tools like Editorially and Sally to go off and evaluate it. Is anyone going to use Purple Hats as an accessibility tool? Purple Hats is a great tool for crawling for accessibility errors. Singapore's Government Digital Services has created this using Act, but it's useful to try and think about ways to build processes and to have a belt and suspender approach, just like you do with security. There's also a tool called the WKGEM, which is WKG Evaluation Methodology, which is useful for thinking about a structured approach for evaluating websites or an accessibility to see that you can compare two websites and you're going to have some confidence that you're going to get similar results or similar kinds of comparisons so that you're not dealing with an apples and oranges situation. I also want to go off and highlight that in terms of, yeah, we've got information around CO2.js and incorporating that into our pages. There's tools like SiteSpeed.io, the Firefox Profiler. There's also efforts to try and have, I don't know what the evaluation, what the comparison would be between WKGEM, this accessibility evaluation, and something like sustainability. I don't know how you would create that tool. The Web Sustainability Guidelines is trying to do that, to have that comparison, so you have some way to do a comparison of two websites and have a sense of how sustainable both of them are, but this is stuff that needs to be developed. There's no tools for that. The Firefox Profiler is one tool. We need to have others because Firefox isn't the only browser out there. It may be the best browser out there, but it's not the only one out there, so we need to be thinking about this in terms of how people are engaging with the tools. I do want to encourage people to go and to learn more about sustainability, to think about what their next steps are. Here's the sustainablewebdesign.org website. That's the URL I forgot about earlier. This is the Human-Readable approach. Is everyone here use Slack? More or less, even if you have to, just because you have to. The climateaction.tech has a really great community for learning more about this, and I think it's a wonderful place for people to ask questions and to be able to share their ideas and to, if you've written blog posts about how you're engaging with sustainability in your open source project and you want to share that with others, that's a great place to do that. I think that's where I learned about the work that Wagtail is doing and tried to sort of bring that over. There's also a whole bunch of interesting books, and there's more and more coming out all the time. There's Green Code, there's SustainableWebDesign, there's Building Green Software. Depending whether you're a designer or a developer, whether you're back-end, front-end, there's material out there which is geared to you. Take a look at what's available and see if there's material out there that you can read and learn from. There's also any project managers here in the group? Okay, two project managers, excellent. Three, so there's now actually a course that I don't have listed here on sustainability for project managers to learn how to go off and how to project manager, what do they need to know in order to learn about digital sustainability. There's also some excellent podcasts including Environment Variables where you can hear our very own Mr. Chris Adams, much of the time, not all the time, but quite a lot of the time. And Green I.O., which is another great podcast, there's others as well where people are being done. There's a lot of places to learn about what's available here and engage with it. But I really want to encourage people here to test their code, to test their websites using tools like the website carbon.com website or eco grader or co2.js, test your stuff and see how it looks, what you can learn from it and share that with others. Let's start talking about that so that we're encouraging other people to think about what their impact is with the digital tools that they use. And with that, I can be contacted here and I can also, if people have any questions, happy to answer questions or engage with people here as people for CFET. Thank you, Mike. Any questions? I'm glad I didn't answer all of everyone's thoughts. Hi, thank you. Thank you for this. I'm not familiar with these topics, especially sustainability. You're talking about accessibility and sustainability as they are somehow related to each other. They're both elements of quality. The way I see it, accessibility is more about individuals, users, on the land, sustainability is about the general audience, the general environment issues. So how do they correlate to each other? I mean, if your website is sustainable in terms of electricity and money saving, it doesn't mean that it's accessible for a user which has issues using your website. They're mostly combined in my head because I work in both areas. So to some extent, you can have different areas, but there's also a real effort to see the development of, like you know about human-centered design? There's the effort to create planet-friendly design. We are the only species that is engaging with the web at this point that we know of. And so we are the users in this case. And we only live here and a few people in the space station. But it's like having a planet-friendly focus is... Of course they're... But the way I see it, companies will be happy to talk about sustainability because they will save money. Right. I'm not sure a company will be happy to talk about accessibility because it's not giving directly money to them. Is it correct what I'm saying or just maybe my wrong perception of everything? There's definitely different incentives. And partly it comes down to there's new legislation that is coming into place and finds that will be in place. And certainly in the U.S. if you're trying to sell to a university or a government agency, there's an effort to go off and to meet Section 508, which is now the more or less WKG 2.0 standards. So there are some financial incentives around that. But there's also... And people look at sustainability. They think about the cost of both producing or buying new hardware, but also in terms of electricity savings and seeing that as being a cost savings initiative. But there's other elements about digital sustainability that does actually cost money too. Like to do it properly, you do need time to focus the developers on building systems that are optimizing them and finding ways to cut down on the amount of craft that's... How many websites have redundant JavaScript libraries running? Or they've got multiple instances of analytics on their site or other tools that the marketing team wanted to install, but nobody's looked at for months. We know this happens in our industry all the time. And we are paying a price for that, but not necessarily a price that is... That the companies themselves are paying. Again, from an accessibility point of view, a lot of people are seeing this a lot cheaper to go off and to add an overlay to their website and hope that paying $50 a month will go off and wish away all of their accessibility problems, and they'll have something that they can justifiably say, we've got that covered for $50 a month as opposed to the justifiably thousands, hundreds of thousands of dollars to go off and pay people to actually fix the problems. But that just pushes it to the fringe. It's not something that actually solves the problem. Chris. Hi there. You can hear me all right, can you? Was that? You can hear me okay. Cool. I'm curious if there is a role you think that public sector organizations or early organizations could play like we saw with accessibility. Right. So mainstream this or make it easier for people to adopt like we saw with say public sector and Microsoft and stuff like that. Are there any similarities there that you would draw people's attention to? Absolutely. I mean, ultimately we've got to look at incentives, right? Follow the money. And that's starting to happen with accessibility, with sustainability. People are still not aware of it. And the UK government has done some great work talking about sustainability and digit and measuring it and trying to be aware of it. And their site is probably the most sustainable in the world. So it's really wonderful to see that. But also the, yeah, I think that the government sector is in most countries the single biggest procure of technology. So if we can get government to commit to buying green software, it will make a huge deal in our industry. Because people, if they want the contracts, they're going to have to go off and be able to follow the web sustainability guidelines and say that this complies, right? Even if they're doing a half-hearted effort, it will make a huge difference if government steps up and says this matters. Even just to make it an issue of public discussion. Thank you. Any question? Thank you, Mike. You mentioned earlier the AI hype. Do you think we can find that for either the BDT or accessibility? Yeah, I mean, absolutely. I mean, it can do wonderful things. But people also, there's an environmental cost to AI, and that's something that is not being measured. And so yeah, we can go off and solve some problems with AI, but we have to be very careful about how we're doing that and whether or not we're being responsible for our use of AI. Because it's something that we can't just solve problems and throw AI at a problem and hope that this goes away. We need to have humans involved in the process to see this actually makes things better and be focusing, like with accessibility, we need to make the people with lived experience of disabilities the experts. They're ultimately the ones who need to be heard to know that they're able to access these sites. And not the standards, not the experts, but the people with lived experience with disability need to be involved. And the same goes with sustainability. You can't just have, like, we have to be measuring what the overall life cycle impact is. That includes people coming to conferences like this. It also includes the cost of involving thousands of generative AI models to go off and to evaluate the performance of websites. We need to be cautious about it because I think that it can be useful, but it has a huge cost. Thank you, Mike. No other question? Okay. Thank you very much. Thank you very much. It was green the whole time. Perfect. Thank you.
Cristal: a new Wiki UI to rule them all
Okay, so thank you everyone for joining. We'll soon start the talk about crystal. One week you have to rule them all with Ludovic and Manuel. So I'm going to give you the mic. Okay, we're good. So hello everybody, welcome to this talk. So we're going to talk a bit about the new project we have at Xwiki. And so we're going to present the team first, who we are, the product vision, what vision we have for this product. And since it's a new product, it's not ready, so it's not something that's usable today. It's something that we're going to build with a lot of energy. And then we're going to show the design proposals that we have for the UI, which is from our belief very important, then the technical architecture, which is another part that we believe is very important. And then the current status and the roadmap. We call this project crystal because we want to both... We want that it's both beautiful on one side, but that it's also like the chemical structure, like it has a very nice and very well-done architecture. So first, who are we? So I'm Ludwig Dubost, the CEO and founder of Xwiki. We're going to talk about Xwiki just after. Manuel. Manuel is the tech lead of Crystal. He's going to talk to you about the architecture. The project has also Vincent Massaud, our CTO, Thiago, our designer on this project. And we have also the support of the whole Xwiki team on this project. So Xwiki SIS is a company that is established in 2004. For 20 years, we're working on Wikis. So it's quite a long time. We have friends in this endeavour of Wikis. We believe Wikis are very important. Hence our tagline knowledge is power. We are a self-funded company. And we have reached 4 million euros revenue. We're also building the CripPad software. We have done 50% growth in 2023. We have 60 employees in France, Romania, but also Germany and anywhere. As I said, we have two software, Xwiki, CripPad. We engage in digital sovereignty. We believe open source is really important for that. And we have a business model to really try to fund the software. We believe it's also very important to find a way to find open source software. Open source software cannot be done with... Microphone is not working. Is it okay now? Okay. So I have to be careful. It turns around. So the product vision. So first, there are lots of technological shifts that have happened in the latest years. For example, JavaScript is getting more and more mature. Like better development methodologies come in JavaScript. We come from the Java world. And so we're very keen to have great development methodology. And for a long time, JavaScript was like a bit in every direction. And now we see that it's also getting more organized with development tools. It has better frameworks also for developing JavaScript applications. Standards have evolved. You also have web components, JavaScript modules that are working much better. You have also technologies such as JSON-LD or SOLID that bring new capacity. There are also new paradigms. Paradigms, real time, is becoming something that any application should kind of have. We believe offline is important also. And that the technologies allow it. We also see a convergence in the field of wikis. Not only between wikis, so the features of wikis are getting similar. Like there's a better understanding of what the features of a wiki is. But also we see that there is a convergence between wikis, for example, or even drives. They're getting closer. So there are also questions about how they could be similar applications. For example, we have always had attachments in wikis. Maybe you could consider attachments or documents like their wiki documents. So there are convergence there in this area. We believe there is a model of future. So Jitsi's founder, Emil, in last year, he mentioned at the end of his 20 years Jitsi talk that in open source building on layers is going to be an approach that matters more. And it enables tremendous innovation. So if you look at Jitsi, you have a Jitsi library that makes the video conferencing module. And you have the Jitsi application. Like I really believe in this and open source can really reach all of its power if you can do a lot of reuse of anything. So you need lots of modularity. We had reached a lot of modularity in xwiki with everything's components in xwiki and Java. But now applications are way more client side. And so you need the same level of modularity on the client side. And we also need integrations between open source tools. There was a talk about OpenDesk in which we are partners where we need to bring open source applications together so that the whole suite of open source applications could replace Microsoft or proprietary applications. And we need to be able to integrate tools much tighter. And for this you need, again, the modularity. We also got an opportunity to fund that work. We have one with other companies as part of consortiums. We have one actually three projects, but two of these projects include the work of Crystal that we have included in the projects of building this new UI. And so we had this opportunity to fund it so we're able to get money for that. And so we have this big opportunity. We also have the opportunity to have the collaboration of partners. So actually the partners of this project, unfortunately they're not open source, would be also users of that Crystal module for their application storing data in their own system. We'll come to that when we talk about the vision for the product. So what's the vision of the product? It's actually one UI, one wiki UI, a modern one that brings all the features that you have in wikis today and that can support multiple back ends. So you would have an application that is web, desktop, mobile, and this application would be extensible, very modular based, but it would have a common data model behind that applications, which support offline, real time, and then it would be able to connect to different systems in terms of back ends. Of course it would be able to connect to Xwiki. We built Xwiki and we wanted to connect to Xwiki and support all the features that Xwiki has, even the most advanced one that we have in Xwiki. But we also would want it to do a basic wiki based on a file system that you store locally on your computer. We also would want it to work as a nice wiki with the next cloud back end using webdav or git. And we also would want it to support a wiki storing data in an end-to-end encrypted system such as CripPad that we build ourselves also at Xwiki. And this application, as a whole, where you can activate and deactivate modules, you could decide that these features you don't want them, you can change modules, you could replace modules in this modular application, would also be embeddable. That means you could put it in the next cloud server and serve it from the next cloud server. You can put it in the Xwiki server, serve it for the Xwiki server to access Xwiki data, or you could put it in any other application. That's the vision of the product, Crystal. The key concept is that we want it to be a slick UI with modern editor and slash comments, multiple back ends, as I mentioned. So, slick UI means it needs to be as good as what Notion does today in the world of wikis in terms of UIs, or the Notion competitors that we see coming in. So, we believe the Notion competitors are nice because they support a lot of nice UI features, but they don't support the modularity that Crystal will have. It's going to be offline in real time. It's going to have accessibility by default, support web components, and also be sustainable. There was a very nice talk before about sustainability of software, measuring the consumption of software. We want to try to do that also with Crystal. We want Crystal to be a UI that is built in a way that will consume less. It's going to be available as web, desktop, later mobile. It's going to be extensible and configurable, and it's going to be a strong editor. So, I'm not going to go into details of what a strong editor is. It's going to support markdown, but it's also going to support the Xwiki syntax, but it's going to be state-of-the-art UI. Lately in Xwiki, we implemented slash comments. It's going to have slash comments in Crystal too. It's going to also support structured data. That's one of the big advantages of Xwiki that our customers and users have loved in Xwiki compared to other wikis. We have a whole system around structured applications, structured data, and we're going to support that in Crystal. Some use cases, we wanted to be a UI for simple storage, markdown. So, it should be a simple wiki. The idea is that it can be a local note-taking app. You use as an offline with a storage that is local, and that would be really interesting. It's going to be a modernization for Xwiki because Xwiki has a UI that's quite old now. We have done a lot of things with this UI, but we wanted to be the modernization of the wiki UI in Xwiki. It's going to be embeddable, as I mentioned, and we wanted to be a wiki view on all your wikis so that as an individual user, you would have multiple wikis in your Crystal UI. You could even create a wiki of wikis, so you could create your own tree of pages, and you could be navigating different pages in different wikis that you have in the back end and locally, of course. It can be an end-to-end encrypted wiki for CripPad, which is a feature that we would love to have. So, we can summarize that as a new wiki UI to rule them all. The design proposal, I give it to Manuel. Is it working? We only share the results of work by Ciego, UX engineer, we hired a few months ago. Since we have some experience with Xwiki, we are able to part from a blank state, but using the experience we have already to redesign a more clean and modern UI for wiki. That's one example, but we have some documentation online where you can find other wireframes. Of course, everything is community-based, so you can come to the forum where we are openly discussing design ideas and contributions. One important aspect we want to work on is, since we want Crystal to be unbedaible, it can be with its own style. It needs to look like the application where it's integrated. What we want to do is, as a developer, when you design a part of Crystal, to design it with some abstract components, UI components, by configuration to be able to say, I want to use shoelace for the actual visualization without much code to say, OK, now, for this application, I want to use Vue.define on... to make it easy for the developers to switch from one design system to another. It can even be convenient, for instance, the French government has its own design system, so if you want to have some knowledge base for the French government, it should be by extension to define a new concrete design system and to use it for their own needs, for instance. We can imagine some other use cases, like in Nextcloud, they have their own set of components, and if you want to have Crystal inside Nextcloud, you want to have it look like Nextcloud and be seamlessly integrated inside the ecosystem. So, a few notes on our technical view for the future. Starting Crystal was a very good opportunity to be able to try new things, so I'm working for a few months into studying a lot of libraries that we can use for Crystal. That's a snapshot of things we have settled down for now. I went to the JavaScript room this morning, and now I have like dozens of new technologies to check. We have this page where all the choices we made are listed, and we maintain it over time. And so, in terms of architecture, it's starting easily with two main components, the web one on Electron, all the ones with Dash as the platform where the most work to do in the future because they are the most challenging for integration of Crystal inside XWiki because it's 20 years old project, so as you can expect, a lot of features to make us compatible with. Reach editing is very challenging. We need to choose a new technology for the editor which is compatible with offline editing with real-time editing, so that's a lot of work ahead, but we have plenty for our next roadmap. So, the key aspect we have in mind is that we want to preserve from what we already have in XWiki and that we deem as important its accessibility and sustainability. That comes with artifact size, of course, measuring performance, making the Crystal locally usable, modular with inversion of control, based on standard as much as we can, for instance JSON-LD web components, to keep having documentation for users, for developers, to have a broad idea of the artifacts we want to publish, so the abstract design system library for others to develop design systems on top of Crystal, a set of connectors to different sources, as we said, and a JavaScript syntax formula to have offline editing with a rich experience, a software development kit to be able to develop extensions, and a set of components we're considering web components in particular because that's independent from a particular framework, which I believe is better for the future, a long-term future of the project. So, on users, we have this electronic application for desktop not taking as a replacement for the XWiki front-end. So, I'll get back. And so, the tricky part now is what's the status and where are we today? So, the first thing is we have a prototype of the extensible architecture using IOC and the version of control. That's actually a very important part of the way we've designed the applications. So, people coming from the Java world understand what components in Java are and what inversion of control is, and this is actually something that is not used that much in the JavaScript world. It's used by frameworks, so ViewJS or AngularJS, they're frameworks that are doing inversion of control, but when it comes to JavaScript libraries, this is not something that is used that much. So, the key feature that is really important for the extensibility and modularity, if you want to be able to replace one piece of the system because you want to change the way it behaves, you need to be able to replace any module for which you have defined an API. And inversion of control is a key method to be able to do that. In the prototype we did, we've been able to load dynamically a module by configuration that is coming from the Internet. So, in the configuration that you want this module instead of the other module, and from a static build that has been built as a standard crystal delivery, you can add an extension that will replace one of the modules of the system. And this is key. We have designed the architecture, the basic architecture of plugins, skins, and user interface extension. In X-Riki there is a great feature that is called Skin and UIX. Skin is a way to replace the UI. UIX is a way to add an item somewhere in the UI. So, if you want to add a feature in the product by an extension, you need extension points. UIX and X-Riki is the way we do it. We have replicated these methods in the crystal prototype so that you can add things in the extension. And we'll also replicate the fact that you can replace the skin. So, in addition to what Manuel explained about the abstract design system, which allows to reimplement the basic view and the basic components that we're using in the whole application, we can also replace pieces of the user interface. We have implemented X-Riki and Markdown renderers. One difficulty was to bring a JavaScript renderer for X-Riki. If we want to be compatible with X-Riki, we want Markdown to be a first-class citizen in crystal because that's the standard today, but we also need to support our customers that are using the X-Riki syntax in with X-Riki. We've also done prototypes of client macros, rendering in ViewGS of a macro. So, new macros. We've done the choices of design system libraries. The first one we want to spend time on, Shule's Vutify. One thing we had in the previous slide that Manuel didn't say. Shule is one of our performance tests. Actually, twice as fast as Vutify, and Shule is a web component library. We were quite impressed by that. Vutify is a pure Vutify library. Shule is a cross-platform library of components like supporting React or Angular, etc. Really interesting work. We have done design work. We have a prototype UI for basic view. We have first test of the editor UI with Markdown and TipTap, and we have the project infrastructure. You can check the code at the link I gave, crystal on GitHub X-Riki contrib. Basically, what we want to achieve in 2024, we want the first version for basic wikis. You can browse your, you can actually take notes in Markdown with an electron system. You can access a git on the other side. You can access a basic X-Riki with not all the advanced features of X-Riki. Maybe about 50% of X-Riki's current features. By, during 2025, we will achieve 75% of X-Riki's feature, including structured data. We want to bundle it in X-Riki by 2026. We want to have also plugin repository. We'll probably have that earlier, but we want to start having more plugins. We want to also have done more plugin development and a Crip at release. We probably want it as a default UI for X-Riki. Also, if we have done our work properly. That's it. You can look at our website, crystalx-wiki.org. There is also very interesting information there for anybody that does a JavaScript application, an advanced application. We're not necessarily the biggest killer in JavaScript. We come from the Java world, as I said, but we have done a lot of studies of what are the good technologies, because we have a lot of experience about choosing libraries right. We're trying to really make tables. We have tables about libraries and so on, about technologies. Don't hesitate to look at this. X-Riki is also hiring. If you find this project interesting, you can also join. If you're interested in what X-Riki is about, I have a beautiful conference at 9am tomorrow. If you like to wake up early in K, and we also have a party, you can scan the QR code if you want to join our party tonight. There's no room left? You can still try. You can still try. It doesn't matter. There's a risk. Thank you. Questions? APPLAUSE Any questions? Do you have an example of an extension you're imagining or planning for? First, any macros are extensions. If you want to add macros in your wiki, it's going to be extensions. If you look at X-Riki, we have 650 extensions. We have at least 50 high-quality extensions that we're not bundling with X-Riki. Lots of them are macros. Macros can be extensions, but it can be just adding a feature. Structured data would be an extension. We would not bundle it in the basic one, in the basic crystal, if you are not using crystal as a back end, because we wouldn't support anything. Everything is going to be X-Riki supports. For us, anything will be an extension. The difference is some will be bundled and some won't be bundled, but storage system is an extension. Access to Tiki can be an extension. Access to GitHub is one. Access to Git, access to file system, they are all extensions. Thank you. Another question? No? No question? OK. Thank you very much. Last second. Do you have a specific library for JSON-LD? What do you want to use? Can I repeat the question? Is there a specific library for JSON-LD that we want to use? First, when we look at storage, there are two ways to do the abstract storage. One way is to hope that the server application will support JSON-LD by default. We'll actually do that for X-Riki to try to make X-Riki give you JSON-LD by default. We believe that will be better because we'll do conversions of structured data of X-Riki and JSON-LD. That will be very interesting. In the Java world, we have found a Java JSON-LD library that is widely available. On the JavaScript world, at this point, we didn't feel we needed the library. That's just JSON that we can manipulate. At this point, we haven't seen the need for a library because we're just storing the JSON-LD data as offline right away. Sorry, I forgot to say. The second way is to do the conversion to JSON-LD on the client. That means the storage module will use the standard API of the backend and then transforms things to JSON-LD to give it to the other crystal modules that will understand JSON-LD. The conversion would be on the client and then you store as offline the resulting of that conversion so that you can do anything in the application. We didn't see the need yet, but we're not there yet. We did some tests of how X-Riki.NET to JSON-LD would display when it has structured data in a page. We've been able to replicate things we do in X-Riki on the client side in a similar way. We're not there yet. For now, we're focusing on the editing experience, which is the most important part for the beginning. Thank you. Thank you very much. Another question. Sorry, we'll take it outside.
Pushing Tiki to its limits
Hello, so Tiki provides a very powerful and flexible database abstraction layer. Thanks to a concrete example which expanded for three years. We have learned a lot. As we start a similar project, we have time to reflect on lessons learned, pitfalls to avoid. And why not share everything with us, with you. So first I described the context, what the project was about, how we did it, what the challenges were, and what we learned as a summary. So I'm Jean-Marc Kipps. I discovered free software last century. I'm in the Tiki community since 2006, live in Strasbourg. I'm alone in front of you, but I don't want to believe that I did all that alone. It was a team project. It was headed by Evoludata, and a lot of people helped. Some of them are in the room. The customers were the peak team from the Institut Nationale et santé publique du Québec. The end users are medical testing laboratories. So that's the website. As you can see, everything is in French, but I'll translate as much as I can, and I translated before I did the screenshots. It's a way of cheating. This is the team, it's quality control, actually. And what I do is that every year they produce by medical samples. They ship them to registered labs, so peak ships, and the labs have to register. They have to register because not all the labs do the same analysis. It depends on the machines they own. They have to be certified for all the analyses they can do, and so they have to choose them. Then they do the test, and they send results, and peak analyses, the results, sales reports, and recommendations. And this is what they call one campaign, and there are many campaigns which are linked to group together in the program, et cetera. That's one of the processes. They used to do that using faxes. So at first you think, hey, how hard is it to be better than fax? It actually faxes hugely flexible. So for example, different medical disciplines did things in different ways for totally valid reasons. So we had to adapt, but there were also clever people. So they also used that project in order to kind of streamline and make their processes. So we met in the middle. Everybody improved. And of course there are other processes, but I don't have time to explain everything. That's just an example. So yes, they also have every year to draft, review, approve, and publish the programs that the people can register afterwards and manage those registrations. So in general, that's the website. If you don't have an account, and if you're not involved in a process, it does not match in it for you, even if you understand French. What's in it is this is, for example, the example of what I mentioned, that's the management of the program. So they have all the interface they need. They can edit it. They can view it. They can go and edit the campaign which are linked to the subprogram. They can go to other pages. There's a lot of it. Every table, as you can guess, is actually data linked to this program, but in other tables and sometimes in all other tables and other tables. So it's not simple. This isn't too hard. That's the process where they approve. They discuss on it. That's just comments, and then they validate it when they agree together. This is actually the same subprogram, but that's the end user view. So we have that flexibility, and also that's where they actually click when they want to register, as I said, they would. But there are lots of variety. Here is another program that you have plenty more combined. You can click on them because it's not the time of the year when they register. So as I said, it's rather complex. How we did it. Tiki, in case you don't know, has plenty of features, and you have to choose the ones you want for each project. So basically, we use the wiki pages in Tiki in order to embed widgets, which we call plugins, in the wiki pages, and that's where the logic is. Well, you can also use them for documentation. We have file galleries. We don't use them a lot. But there are some documents to share. Trackers is the huge thing. Trackers is the Tiki name for the database abstraction layer, because it's starting it, it's starting, started as a bug tracker with grew and grew and grew, and now it's a full-fledged database abstraction, but it's hard to rename things afterwards. The fact each tracker item still has a status, open, pending, et cetera, and we use it. The categories are useful because that's what we use for the permission system. The scheduler, I'll get back to it. It will be simpler. The performance-related features, the main one is that, well, when you have a lot of data, the important thing is how you set and index it, and the default is MySQL full text, but you really need to install elastic search for that. We really had to because there are too many limitations, especially in the number of fields that MySQL can do. And the rest are basically, we had to raise everything, and it's easy to do because we do it within Tiki. It's just configurable. So all the time we doubled some memory limits, et cetera. So trackers. Trackers are basically, you can think about trackers as tables. Each tracker is a table. We have 86 of them so far and still growing. This is the tracker admin view, which end users don't see, but the customers love it because they feel empowered. They can see what's going on. They can edit stuff. We have activated inline editing. So when you see that, you can click on any of those little widgets and edit what's there, correct typo, a filter on what you want to see, sort on every column, et cetera. So that goes out of, that allows to do a lot of things without bothering to set up a whole workflow and it's really useful. So I said that trackers are like tables and tables have fields. So there are plenty of kind of fields and these are also, you can just add them, et cetera. So the useful ones or the auto increment was, is really practical because it allows to access and display the item ID of each tracker item. Item link, it's super powerful. The item link, well, if you're familiar with SQL, think about foreign keys. The item link links to another tracker. So when you edit the item, you have a selection of item from the other tracker and you can link track, well, two trackers. You can link tracker items from one tracker to tracker items to another tracker. Once you manage to link those, items, it is super useful because it does the other thing. For example, as we said, these are the campaign. Each campaign is linked to one subprogram and once it's linked to the subprogram, I mean, the subprogram has a year of the subprogram. So the campaign just gets the year from the associated subprogram. You don't have to do a double entry, et cetera. So you get those data. It's all indexed together and when you display the campaign, you have all these, all these values from other trackers. So when you start to link trackers, as you can guess, yeah, it starts to look like, you know, database schematic. That's a schematic I did. By the way, in a wiki page, in a tiki page with a draw widget, I needed it for a workflow because otherwise I couldn't figure out what to do. And so this is, I linked all the item links and the item lists and I put color because the color is about the fact that, about the fact that when you link tracker items and you delete or change the status of a tracker item, you may want the related tracker item to be also deleted or its status to change or not. That's configurable and that's why I wanted to keep track of it. Yeah, still not 86 trackers. That was, yeah. So how it dealt with source management. We had three, not four environments. We set up a dedicated private GitLab repo. We had our branches and we stuck dev and test on those branches. So every commit would instantly update the site. We get from one to the other by merging and terrific and production is not locked in production. Then the staging environment we called test is approved. We create a tag with the date and we run that in production. And so that means that, well, we have auditability about our versioning system tells us what we were running at what time, how it evolved, what we had in production at a given time if we want to recreate production at a former date when you hit a bug and you try to figure out is that a regression or is it something we missed last year and it was already there. All our commits were very careful. We put that we do not edit the Tiki files as much as we can. We add our templates in our theme or in the custom translation. That means that when we do a merge and we want to get the novelties from the Tiki community called and the security improvements, we do not get merge conflicts. The database management is just the opposite flow because the reference is in production. That's where we have the real data. Some of this data has been entered by end users. Some of it is those wiki pages we edit in order to show you later why we have code in our wiki pages. The nice thing is that we can try that. We synchronize test and dev from that. Then we do experiments. Then we get that to be validated. If it's okay, that's the approved edit. Then we synchronize. Tiki takes care of keeping a history of changes in the wiki pages, in the tracker items. There's an activity log and that's how we get our auditability for that part. I just said that all our environments are running the same database. You may get how this is an issue. What we do is that one single file here is not a versioned. This one is specific to each Tiki because this is the one which has the database credentials and it also has a link to configuration file which can be versioned because we have an item for a section in the configuration file. That means that in the same configuration file, each environment uses another section. In this section, we can override any Tiki preference. This has two very big advantages. The first one is that all the security preferences and others can be set in that file and cannot be accidentally modified through the Tiki admin panel. The other is that we can have different things in different sections. That allows us, for example, to ensure that only the production server can send email notifications. You do not want your end users to get notifications from a test server or a dev server that I'm not supposed to know about. What else? Yes, and you can change the browser's title. You can change the theme options and end up having your browser tabs like this to have different colors when you are working in production or in staging or in dev. That avoids big mistakes when you are editing a site where you want to be sure that you're not editing fraud when you want to do stuff in dev. So there's still the part about how you do that. So Tiki has a no-code, low-code approach, but at some stage you just have to accept that the project is really complex and abundant. The great thing is that there are options for doing really complicated stuff. These are basically the list widgets which we call plugin because the list widget is super useful because that's what allows you to display stuff which come from anywhere in Tiki, but here we are only interested in the tracker items. List execute is very similar to list, but that's not for displaying. That's for listing stuff and doing things for on a whole bunch of tracker items at the same time like deleting them or changing their status. Custom search is also closely related and this is for allowing people to do searches, to filter, to end user have control in this case. So that's a list widget example. You are not going to understand how it works like we don't have the time. We ourselves have that documentation page. We spent a lot of time. It's plenty of info. Everything is there, there are examples and all that. We spent a lot of time in it. Basically the general idea is that this is something we can put in a web page. There is a section which says what filters, what we are going to display. There is a section about how we want to output it. The more there are predefined templates, but if we want full control, you just give a smart ETPL file and then you can code whatever you like. You can even change the formatting before it gets to the template. And if your filter doesn't match anything, there is an alternate in order to an alternate section. So that allows you to do all the pages you saw before. You have to realize that when I say you can do whatever you like in the template, one of the things you can do in the template is call another list plugin. The syntax is slightly different from the web page. And that allows you to collect information to trackers which are linked to other trackers, etc. And you can go on and on and on and on if you like. There are no limits at this point. So that's basically what we used nearly all the time for all the pages, for all the workflows. The scheduler is also really useful because sometimes some processes are just too complex. There are two special cases and all that. And we had especially like the scoring system. We just wrote a script which was directly doing the calculation and updating the values in the database. And the scheduler is our way of ensuring that things can run whenever you like. For example by night because luckily neither our customers or the end users really wanted to work outside of working hours. So we can run everything we like during the night, especially nightly script for calculating scoring things or index rebuilds, whatever. So what were the challenges? One of the challenges we had was that page because that page was awesome because we had lots of information. It doesn't show here but actually those columns have related information which are in different trackers. So that's one of the cases where you have those templates which call another list plugin, which call another list plugin, which call another list plugin. So obviously the first year everything was great and you had everything here. We were using table sorter. You can sort on any column. You can filter in these places and you can move the pagination around. It was all client side, meaning all the data was in the page and after the third year it starts to get you cloud player timeouts. So we have rewritten the templates in order to optimize to do some caching ourselves in the code. And then we had to raise the memory limit because you know, trade-offs. But that's not a solution which is going to last forever. So it's solved for this year and they want to have five years of data, I understand. We'll see that. So basically this will need to be rewritten using custom search and just paginate. Or here they have the download button because they will all those information in CSV so that they can do more data mining. So we will provide, we will rewrite that but let them download subsets of the data. And that should solve that. And figure out, I'll talk about CSV extract here. That was another issue we have. That's a, every link here generates a CSV file which again gets data from plenty of trackers. So for the big labs we did have some timers and we were about to do the same thing, you know, rewrite the optimize, the TPL, etc. But luckily we had another idea which is to talk with the customers who explained that, well, that those data hardly ever change. So the solution is not to calculate them when you click on the button. We are just going to use some caching and have a nightly mechanism or, you know, generate the caches at the right time and just link to a file. Yeah. So mainly our lessons and improvements meaning that since I said that we are going to have a similar project which is about to start. We wanted to see what we, well, essentially it worked. The customers were happy but we can still improve stuff. So what we are going to improve is use more sophisticated Tiki permission mechanisms which is called templated groups. That's for the permissions about what people are allowed to see depending on the groups they are in. We just used a simple way and then we had to add another layer of security in the smarty templates. We want to avoid that in the new project. Make sure all the layers of data are present in the design. Well, that's always hard because, well, it's always hard to realize that there is a missing table or tracker. It makes a lot of extra work to discover that too late. Then again, I'm totally convinced that it would be even worse if we were working in real SQL. The other thing is, as I said, the other lesson is, well, table sorter is not a tool for data mining huge data sets. That's the summary of it. So you have to get your customers to accept that sometimes they have to use pagination and not have everything available and there are technical limits. Same thing for identifying huge CSVs. And also we have taken advantage of this. We are going to improve the list plugin which will be expanded with sublist sections which basically will allow us to do joins without having to do that in TPL files. And that's about it. Thank you, Jean-Marc.
How to get rid of Confluence: Comparing Open Source Knowledgemanagent Systems
So, hello, I'm Markus Feiner. I'm, some of you may know me. I've been around the Linux and open source world since 1994. Started really early with Linux and had three operating systems at the time on my computer. And since the early 2000s, I'm an open source journalist working for Linux magazine. And the Hi-Zi-X Big Jump Tech magazine also. And super, thank you. So, wonderful. Perfect. So, I've done a lot of things. I was also team leader at SUSE, at documentation team lead. And yeah, having done lots of things, we don't have that much time. So, I better be fast now also. But within this talk, I have a lot of links and hints for you where to go to get much more information. Because all of you that are here, probably are only here because you heard the term conference and you like it a lot. And like everybody does. And it's, we started six minutes late, but we have 15 minutes left, I guess. Good. Okay. You ping me five minutes before we're done. Thanks. So, this is a presentation that is kind of typical for me. I'm going to rush through a lot of topics, but there's a lot of links inside. And you can go and find a lot of things. If you're not a German native speaker, you will find some articles that I wrote that you have to translate. But the best thing is that something we did in December or November last year is a large tabela chart of lots of open source alternatives to Atlassian. And I'm not going to dive deep into the things that happen with Atlassian and the 15th of February coming up. I'm sure you're all aware of that, that the support is running out. So, that's a different thing. I'm going to talk a little bit about knowledge management and about a concept that we found in SUSE in the documentation team that I called Agile Recursive Documentation. And generally the problem in knowledge management, and that is what Atlassian actually is about, is that we all sort of mis-underestimate the problem. I'm doing this Bush reference by Bonn-Perlman. But it's like an iceberg, and this is probably the iceberg that hit the Titanic, an original photo that I found. And I've been using it in knowledge management presentations because, like icebergs, there is implicit and explicit knowledge in companies. You have a lot of knowledge that is documented, that is fine, that everybody knows, but there is a lot of Rumsfeld reference, unknown unknowns. And in companies about, yes, there is. And about 80% in companies today is assumed to be implicit knowledge. So, knowledge that is there, but nobody knows that it's there, and people just do it. And just announcing doing knowledge management now will not help against that, and will not mitigate that in any way, because you have to take the people. You have to, oh, I forgot something. The implicit knowledge is also referring to, for example, people that go into, yeah, in a longer holiday, or that go into old age pension, and that are gone. And there are stories of, yeah, I can tell a lot of stories about that, what happens when the people are gone, and you have to find out what actually did they do, and how did they do it. Some people that I know, they had to understand, and had a Pearl Programmer in the company, and the Pearl Programmer had called all of the processes and scripts of his whole big setup after figures from the Simpsons. So, they found a process that was called like APU, and they were like, what does it do? What does it do? And then they found out, oh, it forks a lot. And only once they knew this, they could figure out what the whole Pearl thing does. But there's many stories like that. But in the end, you have to inspire and motivate your people to follow you. So, you need solutions that work in the processes and software that works, and that the people like in your company. So, as many as you are here, there will be lots and lots of different solutions in the end. Because documentation and knowledge management is teamwork. That's one article I wrote for the Linux Magazine together with my former colleagues from Susie. It's always teamwork, and you always have to work together, and you have lots and lots of different people in the company from the... I once heard that we could... When I was a consultant, we are the pathfinders, the mountain guides. We are the ones that find the trails, and then we tell others how to go them. And as we have different people in the mountains, you have the locals, you have us, the pathfinders or whatever, you have the tourists, and the locals and other beings. So, you have to make sure that everybody understands what you're talking about when you give them a description about something. And for that, we have the engineering part or the scientific background. And this is all... Every one of these is so huge and a large topic that you can do university studies on some of them. Knowledge management itself, organizing knowledge, process management, quality assurance, and then all of that basically combined into some things like knowledge process quality management. That is K-P-Q-M... K-P-Q-M, yes. And at the end, there has to be a presentation layer. That's what the people see. That's basically the editor with which we work. But in the background, you have to do lots of ordering, indexing with metadata. Anybody here who knows the term RDF? Still, the semantic web and all of that stuff. Yes, that's the background. And then you have taxonomies, terms, terminologies, registers, tables of contents, notations, catalogs. And it's a huge scientific realm that you can read books on each of those. But for a company, it's important that you do the needful, not all. And the representation, showing it to the readers, the customers is actually the mapping of the information. How do you do this with models, glossaries, how-to's, encyclopedias, documentation? And you see there's already the type, the form of how you present things is coming in with glossaries. And well, of course, what I'm going to tell you is the way, the right way, and found this yesterday in Brussels here, and I really like that. Maybe some of you remember Magritte. This is not a pipe. So, and somebody pointed this on his garage door. This is the way. And I still have to dive into the apple. I think the apple is also a Magritte thing. But the cloud, and I really like that. So, of course, what I'm presenting to you is the way how to do it. No, it's just a suggestion. Because if you don't have time for all that scientific research and all of that, maybe, and that is the usual, that is, then you're not alone. It's usual that companies, we don't have time to do the documentation. We don't have time for all that. And then we were at SUSE in the situation. We had five people in the doc team, and we were told to grow up to 10 people fast. And the people in the doc team, documentation team, they said, we don't have time to teach the new ones. So, what we did was we created this agile recursive onboarding. That means we have one new guy in the team, and this new guy will be, for example, the mentor for the next new guy or girl. And we created a mentor who is in charge of teaching the new guy. And the new guy would, for example, as an article in Linux magazine that I did. And some of the, why we call it agile is because only because we're using agile tools and methods. Like a Kanban board for it. So, this is what the new guy saw. Not at this one, but it's structured. This is an easy tool in next cloud that you can use, for example. For the start. This is the task that the new guy had. Or this is the tasks that the mentor had to do before the first day. These are the tasks that the new guy would have to do on the first day. This is in the first week, and so on. You know, Kanban, very easy there. And out of this came the first documentation of the job, of the documentation shop, the description, what he's doing. And this is an individual board for a new team member. And this new person was from the start involved in making the team better, making the documentation everything better, and yada, yada, yada. Until, and it was recursive because the next one would again start at the same point. But with the new improvements from the last one. And that is exactly what you can do with documentation and knowledge management in a company. You can start documenting things and have people be a part of it. That is the most important, take them with you. And there is other things that are really cool, but only larger companies usually can apply it. There's things that Stack Overflow and Reddit, those companies do really good. You can have your customers, your readers, the signpost, a triage by user pane, so the important documentation items first. And what nobody needs will disappear from the list and topics that are interested will go up in lists and documentation. So if you're writing on something nobody ever reads, maybe you could have invested your time better. But I'm jumping over to the tools because that is what we are talking about. And yeah, so in decision making, decision making is also very clear. You have this important, regular, not important ones, and thereby you decide what to document on this scale. But it's probably not you at FOSTA who decides that. It's usually the management because there's constant risk involved. So this is the stuff that we need to document because if this guy is run over by a car, we won't know how to do this process and this will be a lot of money and customers will be angry at what we're getting. The team has the expertise, the knowledge or documentation team has the expertise, but the management knows about money and risk. So that is why they have to actually define what. So if you're not, yeah, and the tool, oh, super, good. The tool that you're using should be the last decision that you take. Otherwise you end up in technology-driven development or design. I don't know if you've heard that before. That is, for example, happening with AI a lot. We want to do something with AI or we want to do something in Rust. We want to do something in what is this new framework that we bought. Oh, in Go. Let's do something in Go or whatsoever. That's not uncommon that a company buys some development framework and then they think, okay, what can we do with it? It should be the other way around. You should, as in project management, you should see what you want, what you need, all the needs and then the risk and the money involved and then in the end look which product does match my requirements the best. Well, with Atlassian, that's usually not the case. Also because it's just simply there and for a long time it has been there without competition. And now in the last years, Atlassian did these moves to the cloud. This increase of license cost and, well, and they have, of course, they have a good product, a viable product and it's highly integrated with ticketing and you remember Trello. And I like to call them something like the Microsoft of Knowledge because everybody in the development world uses it and it's hard to get around it. And most people are not very happy about it and especially since I learned that the price increase that comes now or came now for small companies can be up to 1,000% that they have to pay because the small bundles start at 500 users or something like that. And it's not the usual, you know, the usual increase is something I think like 10 times more than before. So not that bad. And so they did also an article on that that I don't have a screenshot in there but they're forcing the users into the cloud. They say no we don't because they can buy a data center license. That's the one that's like 100 times more expensive than before. But they reacted also to that but there's other issues also. But them being an Australian company and thus part of the Five Eyes which is the five countries that are very closely working together in terms of the NSA stuff. So there is GDPR issues. They even told their customers when in 2018 when we were working with Trello at SUSE they let a mail came in that told us we shouldn't put business secrets into Trello because that might not comply with data protection rules in Europe already then. We're like what the fuck are you doing? Knowledge management tool and you're telling us not to put business knowledge in there? And so yeah and then recently there have been more issues. There have also been security issues and also but that's okay that every software has that. But then there is also the fact that they are more focused on a global market than for example on a German or European market. So we had the situation that we had severe issues at a company I was working with with Umlauts, the German Umlauts A and they said well there and we opened the ticket and the answer was oh yeah most of our customers are using English. We are very sorry about that. We're like okay good but that's just basics and so last year and that's actually the core part of this. I did with Tim Schurman together we did two articles in the IT administrator where we took a deep dive into the open source alternatives for Atlassian. Both of them make five pages and a large chart that I have here yeah that's two pages. That wasn't in the printed that's only online and the link is this is the this is the link to it. It's in the presentation that will be on the FOSTA website. Yeah and um nice take the photos tell me when you're done then I go back good so and we came to the solution that there is a lot of alternatives and a lot of them also are facing a boom. Some of them say we have so many calls right now from people that want to get rid of Atlassian that we can't handle it all and one of the companies is a friend of our friends of mine from Regensburg where I come from in Bavaria and they say it's most of them are customers that don't want to use Atlassian anymore they don't want to do the move and now they have this deadline coming up on the 15th of February when their support is ending for their old product so yeah so many customers are turning away from Atlassian the priorities have shifted and as you see in this table I'm going to show you a picture of each of those and hopefully if my brain is good enough I will say a few words to each of those five minutes great so in this list we have blue spice just don't worry I'll tell you more about the names just name dropping now blue spice bookstack it's alphabetical docu wiki fos wiki media wiki open km outline p m wiki wiki js and x wiki that's the ones we compared in this list I'm sure there's more out there but we tried to address those that are open source and have enterprise support so that's what we this wrong laptop and here you the first one of course the big the biggest knowledge management software is of course wikipedia but wikipedia with media wiki software has one big flaw it's not well it scales obviously very good yeah it can run the seventh largest website in Germany I think but um what it it has only one use case and that is not the enterprise clearly it is has the use case of making wikipedia run and work forever and really good and so I mean you all know wikipedia and you all know that this is already the editor of you I don't know if you've been to wikipedia recently last year or so they integrated this wikipedia editor not you don't have to work in in markdown whatever it was anymore you can really type inside the text and from them there is a there is an enterprise distribution that's called blue spice that's the guys from rigs book that I've been talking about disclaimer I they're a friend of mine and I blog for their website so um but that's you can make your own image about them they have a lot of things that enterprises need for example something like privacy administration so they add usability and at enterprise features on it yeah and they're open source they base they're based on media wiki and they have several versions up to cloud and farm and sars and stuff and um we have then we have x wiki there was just a talk so I hope I'm not saying anything that's that's wrong x wiki is also very old and they are in my opinion really interesting because they do a lot of innovative features that go way beyond the wikis like their crypt pad and there which is sort of like a um end-to-end browser based collaboration office thingy space that is actually if I understand it right serverless yeah and so they are so and if and then there is doku wiki doku wiki is um something that is often found in scientific and or educational realm I have been okay I've been working with blue spice I've been working with x wiki I've been working with doku wiki and um doku wiki I found at a company I was consulting that did that is from the space agencies from German aeronautics space uh it was the gaf the yeah so they this this world somehow yeah and so they have a lot of people that are experts and I call that that's where the expert systems start I think I think that both x wiki and blue spice is something that you don't need much expert knowledge to work with yeah but with doku wiki it starts but doku wiki is actually really cool because you have some features that others don't have for example you have a shortcut and then the page that you're working is a presentation just one thing and that's that's really cool and uh then we have t wiki we just heard of I think was it t wiki that you just heard or x wiki the previous t and that is um well it came from well the t wiki was the old project and first wiki and q dot wiki are folks from it they have a lot of extensions and a lot of features also but I I haven't really seen them that much in companies and they are very um I can how do you say they um you have to look you have to see if they work for you but they are not like for me they are not like a valid atlasian replacement because you but these you need to be an expert to use them yeah and other ones there's there's a lot of more I found these four and I think I'm good in the time there is for example there's bookstack which is another very interesting project that works with books and shelves so this German here so bookstack has this imagery of um use the knowledge is stored in books in bookshelves and chapters in books so they always use those those metaphors from the from from books so they have pages chapters books and shelves and this is the page of the um uh the access rules and uh
The Challenges of Creating a FOSS Fact-Checking Platform for the Brazilian Community
Okay, so thank you very much for staying for the last talk of the day and last talk of the year for the collaboration day room. So thanks a lot. I see people that stayed a lot during the day. So thank you. So the last presentation is about Matheus, which will be the challenges of creating a fast fact-checking platform for the Brazilian community. Thank you. Thank you. Yeah, can you hear me okay? Yeah. And thanks, appreciate everyone here. I thought there would be less people, so I really appreciate that. It's my first time at FOSDEM as well. Let me introduce myself. I am a senior product manager at the Wikimedia Foundation working on MediaWiki. But this product is a different hat that I wear. I'm a volunteer on an NGO in Brazil and we are trying to combat fake news and make something that is open and in software, in data and in knowledge. So what I want to talk about here is the mission of this project and I'm going to go over the challenges of trying to do something against misinformation in Brazil. Brazil is like a very fertile soil for misinformation, information, especially recently. The reason for that to arise in my will was basically because seeing how my family was sharing and spreading misinformation, it all started. So I started imagining, so I'm a co-founder, I'm not the only one that volunteer on this project, but we started imagining a society where everyone can freely access and engage with true and reliable information about autonomy. And I kind of stole a little bit of the Wikimedia Foundation mission here because we share similar values. So the mission for this project is to encourage ad-edu communication. I'm not going to talk about it because the other co-founder, which is a journalist, is the one that kind of coined or uses that term. And basically it means that we can only achieve our mission if we actually educate people and we need to communicate that. So the platform and everything, all the product what entails, it all focuses on the specific pilot. But the idea is that for the values on accessibility, credibility and autonomy, we're creating individuals or we're making autonomous individuals in Brazil being able to access information or at least question information without losing the credibility. And so when we started, so this is a study from Caspersky. And so more than 70% of Brazilians with internet have believed in the fake news and 62% of Brazilians failed to recognize false news. So this was a study that at the time was what motivated us to keep going. And challenges were immense because it forced us to tweak planning and change and pivot a lot in the years of the foundational years, which is kind of the timeline, which I call the foundational years. So I'm even excluding here all of the technical exploration that I did since 2018. And in 2020 when I thought that, okay, we have technology and maybe we're ready to proceed, we signed up for the Mozilla Open Lab. We participated in receiving product mentorship and preparing an ideation which came to be Alete Effect. And from that exploration with Mozilla it was very interesting because we decided, so we came as dreamers like we want to do something multilingual for everyone. We want to fix the world because the logo of that program that we participated was fixing the internet. And then we learned that it wouldn't be like that. So we would have a focus group. We would only focus in Brazil. We would look into people that is already engaging with fact checking or at least a reading about fact checking somehow. So active readers or independent fact checkers, even professional fact checkers. And then we would even look into a specific demographics, which would be from the age of 18 to 29, would represent like 16% of Brazil. And we set a goal that if we get 0.1% of that, 35,000 potential independent fact checkers, which is an increase of 7,000% of professional fact checkers in Brazil. And I'm going to talk about this a little bit because, so we did this exploration on Mozilla Open Lab. Then we started working on the infrastructure and launched the infrastructure and tried to experiment more. We participated on the trust, it's the TTL, it's the trust and truth online, yeah, conference and so we introduced the concept of the democratization of fact checking in Brazil, which is something that we are looking forward. And in the same year, we started a residency at Projeto Comprova, which is a group of news outlets and independent organizations that combat fake news in Brazil. And by participating on that, we actually engaged with our personas or professional fact checkers. And from that point, we started exploring a platform focused on professional fact checkers only. So we wanted to do something to speed up their process, make sure that they were, the process were optimized and they could actually chase fake news and combat it. So with that in mind, we thought that we are ready to formalize and then we started understanding what that would entail. So with these learnings, we defined that with process transparency and the didactic of representation, we could enable a fact checking manual or operational guideline and then we could replicate that and create autonomy of the individuals that we wanted. So the methodology should be accessible and understandable and that's the requirement that forced us to be aligned with the Creative Commons license for our data and also everything that we create from courses or workshops are all open and available. And we also had some, in the last year, I'll also talk about that, but we had multiple workshops and partnerships with universities in Brazil and all of that was free and based on the knowledge sharing proposed. And the platform or the product that I mainly worked on would be just a facilitator or a place where people could use and engage with. And our main goal was that, so we were here in 2021, we wanted to reach the Brazilian elections with something that could be used for good. So then we formalized and then we participated on the Global Fact Nine conference, which is organized by the IFCN, the International Fact Checking Network. And when we got there, we went to validate some of the use cases that we have built and it was like a bucket of ice on us because we understood that there is a different dynamics happening on fact checking community. So one of the things is they are very worried of being open, mostly because that can be weaponized by bad actors and create more problems to them. So every software that they write, it's not always open, sirs. It's not always open. Licenses, because they are mostly tied to news outlets, they don't always follow the Creative Commons license. So everything that we have built to create a shared space, like a public digital space for people, was not going to work with professional fact checkers. So we understood that, we pivoted, but we kept the same model, the same values and we went forward and launched a platform with a few people, like 15 volunteers. So in the platform, we created a process where we would listen to the debates during the Brazilian elections and do live fact checking. And it was good. So this is just a screenshot of some of the parts of the functionality. So here, if you have something highlighted, it means that it was fact-checked by someone and the experiment was very good. Very small as well, because if you look into our views, you're going to see that we only have like a thousand views. But the impressions from the people and what we were able to achieve was a good data to proceed forward. But because this is a very small project, it's a very small organization, we were dreaming big for a presidential election, trying to do an impact and of course, the stretch goes there, but what we learned is there is a use case, we can do that, but we need to begin small. So we took a step back and looked into what we can really do with our resourcing. And from that, we decided that we would have three product pilers, the fact checker productivity, reduce to credible information and reduce obstacles to product. So talking about the last one first, because we were very small, it doesn't work like our reference, which is Wikipedia, it doesn't work like that. You cannot just go anonymously and do a fact-checking, you cannot go and create an account and start doing it because we don't have a governance model or we didn't have at the time. We didn't have a governance model, we had a lot of obstacles, so we put obstacles on purpose so we could only test with a few people and understand what we need to do. And then we need to now remove those obstacles because now we are becoming more confident that now this can be used by everyone and we are going to have a procedure, a code of conduct and a governance model that will help. Access to credible information is the one that we are not focused too much right now because we believe that from the model that we have plus the productivity, we are going to create the credible information, but access to it is a little bit different. So we need to be able to make sure that the audiences that we serve are actually being able to access this and they can access the platform, we also provide goods SEO using the claim review schema and to be able to be searchable without people having to go to the platform. However, in Brazil we have a lot of people that, so there is an inequality or disparity of resources in Brazil that includes internet access in a sense that people can have data plans that can only work with WhatsApp but they cannot access internet. They can access Instagram but they cannot access Wikipedia. So the access level is a total different gain that we are still studying how we are going to approach, but because these are not our focus we should not lose sight as well, so that's why it's a pilot. And the productivity was because of an opportunity that we had because in 2023 we started a partnership with the University of the State of São Paulo which I actually came from and did my bachelor there and we had four interns and this is the other co-founder, our SEO, I like to call her my president and these four students were our focus group, we had a group of students in a sense that they are very fresh like first year journalists they know nothing about life more or less, maybe, I don't know, maybe they can learn from Tik Tok now but the idea is that they should have no knowledge at all and be just willing to work on. And we had a very smooth process and it was very refreshing because they finished four months of internship having the same level of productivity than news outlets when we were delivering fact-checking material. So of course different levels of comparison but in any way as I mentioned the platform was launched a while ago this 2022 is when we did the experiments with the debates and then we see that at least from the platform we are increasing the engagement of the functionality so we still have pretty much the same unique visitors but now people are using the platform more because now we actually have productivity for the team. So these are just a few screenshots of the platform, the code is in GitHub and you can access the link in the end of the presentation. And so this is an example of a fact-checking report that is available after the fact and as I mentioned so we focus a lot on the productivity of the fact-checkers so we started putting in place tools that would be tied to a specific workflow which is flexible. The way technically speaking we are using state machines here to control everything and we adjusted the processes and adding different steps depending on how we learned with the team. And the idea is to have a visibility of the productivity and also be able to collect data and see is this actually improving because once we, if we actually go, if we actually reach the goal that I mentioned before like 35,000 people checking facts in Brazil it's going to be a very different thing to administrate and make sure it runs smoothly. So yeah, this is I think that the learns. A few things that I think I would like to mention from the experiment, it was in Brazil the open source community is very spread, not so much well organized so the whole period of creating the software, trying to test the software only captured six volunteers like actively working on it which it looks like well this is actually pretty good but I forced most of my friends to actually go there and hey come on you know how to QA things, can you help me QA this, you are a DevOps engineer, can you create a pipeline to me so this was kind of a best effort from a community but after all of this what happened was that a lot of entry-level engineers were starting to looking for something to work on and they were, because we had partnerships with universities we started having people just coming in and they were trying to learn with the software which was a very good experience and something that I would like to explore moving forward it requires more management on the technical side being able to actually provide good feedback to them but it has, we now have like two or three active volunteers that they are 20 years old and just learn how to code and now they actually provide good development for the platform of course there is a skill to consider but so yeah there was this challenge so the time frame when I look to this project like five years is a lot for the stage that we are at but there were multiple factors there like it's a product for a very specific area it's a problem that no one solves yet and the only way that I see moving forward is doubling down the educational effort for the actual goal which is being able to provide credible information and stop the spreading of misinformation there is no other way around there is no other software there is no AI there is nothing that can help other than having humans understanding what they are reading and being able to have something that is accessible so I also put here multi-generational because in Brazil the disparity on misinformation spread is based on age as well so the older you are the bigger the chance to be a victim of misinformation and what we are also going to be looking at the general TV AI so and I write here with all the truth in my heart that I have no clue what we are going to do about it and the reason that I put it here is because it's not that we need to use the general TV AI but we need to defend against it which is I think very different perspective to look at of course maybe we need to put it here and make pieces with the devil who is the same tools to combat at the same level but the concerns are different because we are already seeing on multiple elections around the world the usage of deep fakes and general TV AI to manipulate public speech and so this is going to be a very difficult thing to do however because we are losing the battle we need to consider and that's it the code is open source it works specifically for an audience in Brazil it has been tailor suited for that but because since the beginning we were concerned about having this for multiple audiences so it allows internationalization the code is the reason that so the stack that we chose and one of the things so the stack we chose was no JS and React because the ability to find more people to join the effort but that of course we are now considering if it makes sense to keep the same stack or rewrite some of the stuff because if we want to be lean and optimize for some use cases we might also consider performance and other things that right now the platform doesn't provide and I forgot to mention something very important everything that we do is integrated with Wicked data right now and we have efforts to integrate with the whole public data infrastructure the idea is that we only have what it needs to be done for fact checking but information about personality more information on Wikipedia all of that should be included and integrated with the ecosystem and in the end we encourage other people to build their own communities and be part of the movement fork and change toss it off test the same things that we did I think that because it's something very very important and it's going to change a lot in the next few years I believe that we should double down the effort as much as we can that's it it was supposed to be a gift but it's a PDF so thank you thank you very much for the attention thank you Matheus any question? Hello so I was looking at the website and I see the personalities and the declarations and I see the reviewers but then how do you define or who puts the new declarations in for the fact checkers to really check it? Yeah so one of the things that we learned from the procedure of the fact checking is that monitoring has a specific operational guideline so one example that a fact checker might find to is that they receive this information but they look into how is this happening on Twitter X how is this being spread into Facebook how is this being spread on WhatsApp is this a big effort does it make sense to to actually check because there is one thing which is called the strategic silence because if someone checks it makes more public and it spreads more so the the decision on what to put in there is for the volunteers that have been have the capability to operate the platform so right now in order to operate the platform you need to go over the training understand how our code of conduct works sign that you understand and you are going to vouch for it and once you understand the whole process then you are able to operate the platform and you will be you will be responsible for monitoring so these volunteers are the ones that also select what is putting there but we receive we receive suggestions from community in Brazil from people that follow us and we use that into consideration to decide on if we are going to put on the platform or not and because it's a small group it's going to be small data now but the idea is to streamline this process and grow and possibly monitoring will be we will need to evolve based on on on that does that make sense thank you it's the last one from the UK we have a few fact checking organizations there do you connect or are you intending to connect with other similar organizations from the world yeah so we did when I talked about being part of the global fact nine which is part of the the international fact checking network we learned a lot and met a lot of the the I think full fact is one the biggest one and one that we get in touch we got in touch and now we are in the process of being part in part of this network there's a lot of criteria it's if we do it we are going to be the first open project that actually enters so we are having some trouble to actually match on some of the criteria that they ask but we can we have some connections with India we have some connections with the Latin America network as well and recently we joined the the the network that is only on Brazil so every out Nils out in Brazil are also part of this network and we are connecting with them as well yeah we need that okay thank you other questions no okay thank you very much and well that was the collaboration they've room for 224 so thank you for staying until the end
How do you change the governance model of an established open source project?
All right, awesome. Thanks everyone. Thanks for joining us today. So my name is Ruth Cheesy and I'm going to be talking a bit about how we went through the process of changing a governance model in the Mortec project. So if you haven't come across Mortec before, it's an open source marketing automation. We've been around for about 10 years. I'm going to talk much about what Mortec does, but we've got a stand in H block. So if you want to come and chat with some of the community will be over there. So yeah, I'm project lead from Mortec. I'm also co-founder of the Women in Open Source community. You can connect with me on LinkedIn by zapping that QR code, but the slides should also be on the FOSDEM website afterwards. So if you need to check something, you can check all the links that I mentioned and everything is up on the FOSDEM website. So let's start off by talking about what we actually mean by governance. So in open source for me, governance can be something as simple as like a few paragraphs on a piece of paper or in a bigger project. It can be a lot more complicated, but ultimately it's about talking about how power structures operate within your projects, how decisions are made, how interaction happens within the project and also ultimately talking about steering the human collaboration and software evolution in your project. So where do we come from Mortec as a project? Well, we were originally what I call a corporate backed open source project and what I mean by that is one company backing the project and all of the governance is around that one company backing the project. So we were founded in 2014, GPL3 and in 2016 the founder created a SaaS company providing the software to Enterprise. In 2018 we had our first ever community lead release, so it was the first time someone led a release who wasn't an employee of the SaaS company. 2019 the SaaS product was acquired by a company and rolled into their marketing suite along with the brand, the trademark and everything to do with the project and community. And then in 2020, so soon after that, we started to make a governance model to make it clear what actually the company involvement was, what the community involvement was and how we made decisions collaboratively. And this is what that first model looked like. So you can see at the top here, these pale blue are, they must be a member of the company, the dark blue colors are, they must be a member of the community and then the gray ones here are anyone. So it could be company, it could be community. So there was quite a lot of corporate involvement in there mainly because the company wanted to steer the project and support the project. This was developed in collaboration with the community but very much designed by the company to make sure that they had still had a say in the project. So the key decision making structures we had here was the company. So the company actually owned the trademark and they gave the community the ability to use those trademarks. And they also employed the project lead, which was me at the time. And they chose the company representatives of the council. The project lead was hired by the company and the job was to kind of steer the project in the right direction to organize the community, to remove any roadblocks but also be the bridge between the company and the community. And then we also had a community council which I showed there. So four people from the company, four people from the community dealing with issues that went across the project. So they weren't just to do with one particular team but there were things that were slightly more complex or maybe needed a bit more thought before they were interacted. But for all intents and purposes those community representatives were the team leads when we first started. So we didn't have enough people active. We just kind of said if you show up then you can be the team lead really. So in April this last year, the company informed of the company that they weren't actually able to support us at the same level that they were supporting us to that point. And so things needed to change basically. And because of that we needed to find a way to go forwards that wasn't going to involve being backed just by one company. So the first thing we needed to decide was what's the fiscal structure going to look like for the project? How are we actually going to organize ourselves? How are we going to manage the governance? Things like that. So the way we made this decision was initially going away and doing an awful lot of research, looking at what other open source projects are doing, what other projects have changed their governance models over time and how did that work out for them? And bringing that all together in an idea of some proposal that I would take to the council. And at this point it was only me who knew that this was happening with the company. So some of the options were maybe looking at joining a foundation or joining an umbrella organization that could support us. But what was important with this is that we were able to still be autonomous, that we still had the ability to decide how we did things and what tools we used and so forth. So there were pros and cons to that approach. Another option that was in front of us was we were at that point using Open Collective to manage finances. So if we ran an event, we had somewhere for the money to go. We were only using it for finances. So there was also the option of expanding what we were using them for to provide some of the services that the company gave us. So like holding trademarks, holding our assets, employing the project lead, providing legal support. So that was another option that was open to us. And then also creating our own nonprofit organizations. Creating something ourselves that was maybe a 501 or a nonprofit CIC in the UK that would deliver all of those things that I just talked about and we would be able to do that from our open source project. Put them on nonprofit organizations, sorry. So some of the resources that I found useful in this process that are up here. So governing open is a really great starting point. If you're having to think about governance, there's got lots of resources and links off there that can get you going. There's also a really great one from the Python organization, PEP 802, which explains the governance models of lots of different open source projects. How they've changed over time. What went well, what went wrong, what was difficult. That was a great source of, they're not all the same kind of projects as us, but they were encountering similar kind of problems. And FOS governance, if you need any kind of document, whether it's a code of conduct, a privacy policy, what don't agents we accept, a governance model, there's absolutely loads of awesome resources there and you can also upload your resources. So you can share your resources as a to do for me to actually upload the new governance model, come done like that. And also don't underestimate if you're going through thinking about this, the power of the network. There's just so many people who took my calls when I was like, I need to speak to people about this to get some ideas. Gave me some good contacts, pointed me towards specific things that would help in this process. So if any of you are those people who I spoke to, thank you so much because it really did help. So once I'd kind of come up with, well, those are the three things that we could go for. And as project lead, these are the pros and cons, I think, for those things. I shared it with our council and then later shared it with our team lead. So the council and then the assistant team leads as well. So we're at 10 of us at this point tossing around the ideas of what are we going to do and what do we think is going to work for the project. The challenge of course with anything in open source is reaching a consensus. People had views on what was going to be best for now, what's going to be best for the long term. But ultimately we were able to come to a consensus together. And that consensus was that we wanted to actually become an independent open source project to use open source collective more and to refactor our governance model accordingly. So that news was shared in April. You can read the independence blog post there. And actually it was one of those moments where you hit publish and you're not quite sure what the response is going to be because you all believe in it, but you're really hoping everyone else is going to too. And it was a really positive response. So some of the things that we learned from this is language really matters. We're a massive international community and we are invited people who we trusted from our main communities to translate that important announcement so that people in the local communities could understand actually what that meant in their local language. And they really valued the fact that we'd taken the time to do that. So major communications it was really helpful. We also had a lot of people who either did not care at all, which I couldn't really understand, but some people don't care about governance. They just want to use your product. Some people who work the other end of the spectrum who really cared a lot and were extremely passionate. And then some people in the middle. So I guess I'd say that the lesson learned is like you've got to be prepared for all of them, not just the positive, but also the negative criticism that comes with that. And also being available. So in this stage it was really helpful to have opportunities. We had webinars with a translator for our Brazilian community, for our German speaking community, where people could actually hear what the changes were, what it meant for them. And then they had the chance to ask questions. It also was really helpful to have an open door office hours where people could literally just drop into a Zoom call and talk with me or talk with the team leads directly about whatever they wanted to talk about. Okay, so one of the features we had to think about when we were actually creating this governance model, once we decided what the structure was going to look like, was do we actually need a hierarchy at all? So someone in the community was saying, actually, I think we should have a completely flat organization structure. I don't think we need to have leadership and councils and things like that. We did a lot of research on that. We couldn't actually find any organizations in larger open source projects that had that structure. We didn't think that was going to be practical for us over the long term to not have some kind of organizational hierarchical structure. So we did investigate that, but we actually decided, yeah, we do think we still want to have structure. But we decided that some of the structure we already had was actually working alright. So like the teens and the working groups was working alright. The council was working okay, but it wasn't democratically elected. It was chosen. And so we wanted to change that so that it was actually chosen by the community. We didn't have a step in between the council and the teams where the community got to discuss and debate changes, which then go to the council to be introduced. So that's what we introduced with the general assembly, which is a collaboration of members who can decide and debate, and then things go to the council to be enacted. So that was the structure that we kind of came up with for the project. But the next step was if we vote in a council, how do we make sure they don't all disappear at the same time? Because we're going to be doing this at a specific moment in time. And for this, we took inspiration from the Python Software Foundation. So we did an election. We had people voting, and then we ranked them. So the top three people did three year terms. The next two people did two year terms, and the next two people did one year terms. So that worked really well. The community found that really positive. We did have two people right on the border who got the same number of votes. So we just had a conversation who wants to do three years, who wants to do two years. But that seemed like a really good way of us kind of making sure that we have fresh blood coming into the council as well. And then who actually manages the project lead? Because they were employed by the company, and now they're employed by the community. So who manages that? And ultimately, we decided that that would be managed by the council. So that would be like they would be reporting into the council, basically. Some of the things we also had to think about was how do we make decisions? Because although we made decisions, obviously we've made decisions. It wasn't really explicitly clear how long we give for what different types of decisions and what methods we use. This was also a subject that we did lots of research on. We did need to find a way to do voting and to make the voting fair and to make it a system that we could easily roll out a vote for anything, basically. So we ended up using an open source tool called Decidim, which we've implemented at community.mortic.org, which allows you to have a voting system. It also lets you do events and meetings with transparent notes using EtherPad. And that's actually worked really well. So that's the tooling that we actually implemented to do the practical voting. And then once you have voting, it's like, well, who gets to vote? And this again was quite a contentious subject. What we decided was that we would have different ways of you being eligible to vote. One is financial. You throw some money, you get to vote, $100 a year, or you can use the Big Mac index, which proportionately reduces the amount based on the comparative cost of a Big Mac. You can Google it as by the economist. We already use that in other places in the project and people find it helpful. So we just use the same system we were already using. Contribution based approximately five hours a month, consistently over three months. They can apply to be contribution based member. Corporate based where we have tiers from $1,200 a year up to $30,000 a year. An honorary membership for people who've done extraordinary contributions to the project. So those are the membership types that we decided on. So once you've got the types and what have you, people then started saying, but I do more contribution than him and I want to have more say. So here be dragons, like this is a really difficult thing to get your head round. It can get very complex very quickly. It can be exploited very easily. So we just decided one member, one vote. So whether that's a human individual member or a corporate, they get one vote. And that works because they have one account on our community portal and there's one member in our membership list. And the membership list is who has the ability to vote. So that kind of simplified it. People wanted to get really complicated, but we have to start somewhere. And then how are decisions made? This one we actually decided, well, trivial decisions, we don't want to rat red tape random. If it's trivial and you don't, it's not going to impact many people and it's reversible, just make the decision. Talk about it amongst yourself, make the decision. If it's non-trivial, like how many tracks should we run at a conference or who should we invite as a speaker? Or if there's a code situation where there's a few different options, but they don't have major impact if you take one or the other and it can be reversed. Then we say that's a 36 hour time box, taking into account holidays and things like that, but generally 36 hours. And if it's a significant decision, which impacts several teams or the whole project or it has financial impact or it's not easy to reverse without there being significant consequences, at least two weeks time box. And those decisions happen on the portal. So that everybody who's on the portal sees things happening, they see the discussions, they can be involved in the decision making process. And then ultimately we try to get to a point where we come to a consensus. We default to lazy consensus. So if nobody has given an opinion and the time box that elapses, decision is made. If they have, we try to find a way to bring their feedback in so everyone feels like they're on board or they can at least disagree and commit, you know, the best thing. So how do we come to the final version of the governance model? Discussions happen very, very fast. So we had a channel on Slack for the governance discussions. I could go in there in the morning and there'd be like 250 more messages in a thread and you're just like, how on earth can I keep up with this? If you come in completely fresh, it's really hard. So we tried to summarise this in a Google doc and each day someone would take on to write who had given what views and what the discussions were. So it made it easier for someone coming in to actually get an overview of where we were at. When we got to a point where there was a first draft, I posted it up on the forums. I explained that this is a first draft of a governance model. Anyone else is welcome to submit another one. This is the one that we've been working on. And the important bit is that we could see down here, we chunked each section of the governance model, which was quite lengthy, into separate forum topics. So you could go and discuss the membership section or you could go and discuss the council section and provide your feedback there. And then based on the feedback and suggestions, we could update that initial thread and people could see where we were at. And then we collated all of those back. So this was time box. We did actually have to extend it by two weeks because people said there was too much to discuss and too much to make decisions on in two weeks. So we extended it to four weeks. And then once that was done, the positive thing about having it on the forums is our community are marketers predominantly. So on Slack, they won't be following it. But when they go and say, my mortar can sense is broken, or I can't send this email, they're going to the forums. So they're coming past this post in the forums. So we actually got more people involved that wouldn't normally be involved in these discussions. Then we posted the final version basically for two weeks for people to review the whole thing. And if there's still things that they were worried about, they could respond to this thread. And I highlighted all the bits that had changed from the first draft and why they had changed. So some had changed from the forum. Some had changed from a panel that we did at our conference, but it was easy for people to check. So in this stage, long live the time box. I think it was Angie Webchick who was like, time box everything when I first started as community manager. And that's so true. Like giving people a fixed window and saying, we will make a decision at the end of this time box. Delegating the research as well. So delegating research, if somebody's really interested in something, ask them to go and research it and bring it back. And then you haven't got to do it yourself. So we've had some people who are super passionate about decision making, and they went and did all of the research on that. I am the worst person for complicating things. So keep it simple. Yeah, with governance, it can easily get really complicated. But we kept on saying, what's the core of what we're trying to achieve with this? And how can we get rid of some of the fluff that doesn't need to be there? And also these ones, go where they are. So as many places as you can, talk about this governance stuff that you're trying to do. Social media, sending emails, talking at conferences, talking in person. We actually had some code of conduct infringements because of this, because people got so emotive about something that they really believed in. It doesn't mean that you don't obey the code of conduct. And I think modeling the behavior you want to see is really important. So when someone was disagreeing with something, one of the most useful things I learned to say was, you know what, I'm about six out of 10 that we keep this because x, y, z. Or I'm two out of 10 about this. I think it's kind of nice, but I'm not too worried. And then people have the language to understand and communicate themselves how passionately they are thinking about this thing and why. So you can then kind of get into dialogue. And yeah, draft early, iterate often, be ready to chuck it in the bin, but get something on paper because otherwise it just turns into this big nebulous discussion that never actually becomes anything. And it can be very frustrating. So where are we at now? It's been a longer process than I would have hoped for, mainly because of the community engagement. It takes time to get people to engage, to get people to give you thoughts, and then to kind of go through that process. But actually we've done all right. So we published the final draft at the end of July. We launched our membership model where people could become a member in August. In October, the community portal came out of Beta. So it was in Beta for about a month where a couple of teams were using it. And then in December, we had our extraordinary general meeting where we inaugurated the council who had been voted through the nominations process and we adopted the new governance model formally. So far we've had about 150-ish people joined the portal. We've had 44 financially contributing and 14, actually it's more like 48 now, 14 practically contributing who have joined us members who have a practical contribution route. We've also got people who've paid and they're eligible practically, you know, but whatever if they want to pay them, great. We had the voting on the portal which was really successful. And also what we do is all of our meetings. So team meetings, working group meetings, everything happens on the portal. People can join on the portal. They get the link. The notes are taken there so people can see the notes from the meeting when they finish. And it's been really good actually. It's really been like a central place for all things community. So going forward for us as an open source project, what's next is financial stability. This is the biggest thing that we're working on right now because we don't have the backing of a big corporate anymore. We need to do this all ourselves. So we're exploring lots of different revenue streams, membership, but also having a trial system where people can try the software for two weeks. And if they wish to continue, they go into a contract with a provider, but we get a 40% revenue share for the first year and the 30% for the second and so forth. It decays down. So we're trying to be creative in exploring ways that we can offer value and also get the money. We're very much focusing on product adoption. So our kind of adoption is like this, which is great to see, but we need to continue. It is a competitive sector in the proprietary world. There's not much competition in the open source world, but we're still kind of moving forwards. And also from the product development process, we're 10 years in, but we're dealing with an immense amount of technical debt. So it's also kind of making the product more stable and introducing many more features. And then finally, what we're really trying to move towards is transparent by default. We do do that quite well and we have done that quite well since 2019, but basically every time a leadership role expires, it's voted on through the portal. Every time we have to make a decision, let's take that debate to the portal instead of having it in Slack, on GitHub, wherever, have it on the portal and then it's centralized. And also, yeah, making use of voting. So any time we need to actually practically have a vote on something, we now have a system that we can do it through. So that's me done. I think I'm just in time. Hooray! Yeah. Thank you. If you have a stand, as I said, on HBlock, so if you want to know anything about Morte, come and chat. Questions? Any questions? I'll come back up. Oh, Lord. You're going to get your steps in today. So thank you for your talk. I would like to ask you, how do you manage like liability against law? And how do you, who is deciding the salaries, like the levels, the salary levels and stuff like that? So one of the biggest expenses we've had in this whole process is legal. So we had to, an open source collective who was our fiscal host, have legal experts who are specialists in open source. So we use their services to get the right contracts for transferring the trademarks and all of that stuff. And they also review all of our trade, all of our contracts that we sign because they have to be signed by the fiscal host, not by us. In terms of what was the other question? How do you deal with? Salaries. Salaries. Okay. Yeah. How do we deal with salaries? Yeah, thorny subject because I'm paid by the community and I set the salary. And I did that like three years ago, not knowing this is going to happen. What we did, we did lots of research at that point about what open source projects paid as an hourly rate and also comparing them with what we could actually afford to pay. It was when we were migrating from symphony four to five, we had big projects that needed a contractor to do because we couldn't find people in the community. And we just set an hourly rate and it's very low compared to big projects. It's $40 an hour, but that's what everyone gets paid in the project. We want to use the sliding scale at some point. There's a proposal being put to the council soon to investigate that. But yeah, with that comes a warning because I live in the UK and that will probably end up being a lot more for the project. So do we really want to do that? But that's how we've done it. So yeah, anyone else? Hello. Thank you for your presentation. I was wondering what is the emotional impact of going through a process like that? And if you have any tips or tricks, how to navigate it? The emotional impact? Yeah, because I'm guessing you will have to have some difficult talks. Yeah. Because I think you care about having a fair governance. I think you need to have your own house in order if you're going through this kind of, like in terms of you need to be able to know yourself well, because it does get emotive, especially if you are the founder or if you are involved. In terms of dealing with other people, in dialogue with other people, I think a lot of it is people are very passionate. So it's trying to understand what it is that they are getting emotional about and why they're passionate about it. And how can we find a way for that to come into something if it's not constructive, come into some constructive way of taking the bits that are really helpful. But yeah, just trying to be mindful of your own stuff and not projecting that onto other people when they're coming to you with ideas you don't agree with. I don't know if that is, it's kind of a non-answer, but yeah, sorry. Scott, one more question. I'm fascinated by the voting system that you have. Projects have problems with people coming in and leaving quickly. You said one person, one vote. How do you make sure they stick around? Do you have any way of like saying, hey, we're doing this, but like, can you speak more to the voting process to it? Because projects always have a problem with that type of system. Yeah, so part of the thing that we've done is like for voting, you need to be a member and that is linked to a financial benefit to the project because you need to either pay or contribute so the project's benefiting. Do you mean in terms of how do you get people to care enough to vote? Yeah, I kind of like, I mean we put money into it, but sometimes it's cool that I don't really care, but I need your voice to suggest or know. Yeah, so like people have said they've joined but then they don't really care enough to vote. A lot of it is to do with one-to-one engagement or not one-to-one but one-to-small engagement, making sure that people are aware why it's important to vote on that thing and you've got to accept that some people won't care. But it's, I think it's like using that emotive language and trying to explain like this is your opportunity to have a say in this thing. We actually had probably 20 people become individual members because they wanted to vote for their favorite candidates in our election for example. Another thing we're going to do is a template contest where you have to be a member to upload the template for example. So we're trying to do things like that to get them into the system, understanding how it works so it's very easy to use. So thank you very much, Shavu, they really appreciate that. And if you've got any other further questions I know you gentlemen do, but they'll have to be outside in the hallway after the transfer bit to spot over the room. So thank you very much.
Meritocracy or Do-ocracy - why diversity is still hard and what can we do
Thank you so much. Thank you for having me. I think that we can keep in touch. Today's topic is about something I really care about so much about the community. I will try to tell it as my personal perspective, the story and tell it to your very personal friends. There are some terms that you can use to talk about it. The role involved in an open source project. The role is also working in community with some community work. Contributors, maintainers, any kind of thing. Some people are doing more coding and some are working in other roles. It's also very important to support the community. Other non-coasters are also very important. I also love open source. As you can see, I get involved in the open source project. I also get involved in the foundation that is supporting Python, the language itself and the community. I love the community so much. I work in the open as a community manager. We have a birth and we are in K-level too. We are also there tomorrow. Come talk to us. We have other projects. I will skip this because you can talk to me at the first. Why open source is different? I know that a lot of us are open source but I don't know who is here. Maybe it's a bit like don't need me to explain it to you. Just for the record, for a first one, I may get a little bit new to open source. Why open source is so different from maybe in your previous job, maybe you work in a closed source project, why is governance so different? Maybe you have already get a sense of it but let me explain. I made a comparison here because I really like comparing things. I was data scientist and I like analyzing things. Let's do a comparison of open source and Python software. How is the governance different? Open source, sometimes we don't have a clear owner. A lot of projects start out with one person. That person will probably think they are the owner but what if they are like, I want to retire, I don't want to do this open source project anymore. I'm a bit tired, so I give it to someone else. Who actually owns the project? Sometimes people are not clear. For example, if you have windows on your operation system, we know that the windows are developed by open source. We know that Microsoft owns windows, so that's more clear about who owns the project. Also, open source, there's a lot of volunteers involved. What you do is very different from working with staff and engineers, your colleagues. Staff and colleagues, they have to have responsibility. They do that, they have to do their job, right? But volunteers, volunteers, volunteer in their time, you can't really tell them that much tomorrow. You don't want to be pushing the volunteer like that. Another thing, if you work in a company that's very clear, you will probably have a manager or a manager who has another manager. There's a corporate hierarchy there, so maybe the CCO is the problem. For the manager, you have a team, and then you have a lead, you have more senior role, junior role, it's very clear in your hierarchy. But for open source, sometimes you get a sense of someone a more senior. They have not really your boss, right? They made someone who has contributed maybe a year or two years before you. So they know a little better about the project, so it doesn't feel like you're involving someone. They may be someone who is a maintainer, they may have more access to the project than you, but still, it's not a very clear hierarchy. Nobody is any of these folks. On the last thing, it says, open source, they are contributors. If they don't have lots of them, they don't have a lot of convenience. They don't sign an employment contract. They have to work five days a week and then eight hours a day. They don't sign an employment contract. They don't have to, you know, if they have something, you know, they've got busy with their work, you know, having some personal thing, they may actually contribute the best when they're done. You can't expect them to contribute in a constant kind of pace that, you know, always the same contribution all the time. But if you work in a company, of course, like, you know, someone has already signed an employment contract, they have committed their time, that's the responsibility. If they want to take time off, they have to apply, and then they have to pay an HR, they have to know who is having a holiday. So that is very different working in open source, especially open source that is like involving volunteer work. It's very different when you have a company and have staff working in it. So the last thing I would say that makes it super different is that, you know, proprietary software, it's owned by a company, product is supposed to make money for a company, because that's how a company provides, that's why they make this product, because, you know, they have to, as a company, they have to make profit, they have to thrive. But for open source projects, sometimes, you know, someone started a project, they just won't want people to use it, that's all they want, they don't even like, they're not selling it, they're not even making money, you know, they just someone started a project and then it becomes big, and it becomes something that everyone uses. That we would say is a successful project, but there's no like, monetary gain for one of the people who started it. So, now that we know that we are thinking about open source, we think about like, we put that like, kind of company, hierarchy, you know, corporate, and target, I would think about it within the community. So who's going to be in charge, right? Who's going to get the top? So there's something that we call, um, bureaucracy. So what is that? I'm not very good, English is not my first language, so I look it up. So you can't say, oh, whoever, it's basically what it means, whoever it has, it's the most knowledgeable, whoever do it, do it, they have a say. So, um, for example, if someone like, oh, I've done it for like 10 years, I know how it works, so that's why I'm able to have them listen, because like, you know, they are the most knowledgeable person in the group. But they've got to have this, every time they listen, they actually trust that person, so that's why they become the leader. We get out, who knows what a get out is. So we get out, but they're not going to take that alive. Listen, when a person knows about that character, it's oxymoron. Right? So they're not going to take that. So, there's only a small number of people, they are qualified CDFLs, well, there's actually no qualification of people, kind of, give them title, there's an honorable title that people give up to them. So, any of them? Well, I mean, not passionately, but I do, you can think of some, some of them. What? You know, I know that you have high-level people. I thought that you were... Linus? Yes, oh, wow! Yeah, hey! Yeah, it's not benevolent, yeah. We have already, like, you know, we agree on some people, right, they have, you know, there's no, like, a good tip, you know, people really appreciate the contribution of the sauce. Oh, yeah. Just a little? Oh, no. Oh, no. Okay, you know that for a while, the show must go on, so... So, you know, I still see a lot of people, like, in the community, having a community, they're like, oh, I'm a little bit of a friend, boy or friend, girl, you know, they're like, you know, they're like, you know... Inspired by the BDF arrows, because, you know, they have made really good contributions to the community and their contribution is fundamental, that's why. But, now I give you a challenge, like, you know, million-dollar question. Can you name some of the BDF arrows that are, you know, maybe they're women or they are people of color? Any? BDF arrows? Ten million dollars. I don't know, ten million dollars. Yeah, it's hard, right? There isn't many of them. The BDF arrows that I asked you before, that you can think of, most of them are white-nosed. So, it's not very diverse in that sense. So, that's a good thing about Marapposite, that, you know, we trust these people because, you know, they're good at it, right? So, we trust them, we trust that they can, you know, they can bring the community together to, you know, to lead the community, they have the knowledge. And it seems like everybody can be BDF arrows, right? Technically. So, if you're good enough, you will be BDF arrows. So, anybody can be BDF arrows. That's good. But, there's a catch, because, do everybody get the same chance to get the knowledge? Because, if you compare, right, if you compare someone, maybe they, you know, they were encouraged to learn programming language when they were young. And now these kids start programming very, very young. So, they are like, you know, compared to me, who is like, maybe I start programming, you know, who knows what they're over there. So, maybe when you do this, you know, but now it's like kids are already making games, and they're like, that was like, what? So, not everybody is given the same resources. So, everybody are like, you know, have the same opportunity to start this journey early, or maybe they have distracted in their life to do other things, so they don't pursue to be like a very good software engineer. So, it's very hard for someone who is, you know, who is like that to achieve the same high level. So, let's make a very good example here. This is my friend, Malin. So, maybe if you're Python people, you also know Malin. She's very famous in the Python community. She is from Zimbabwe. She's previous as a director and vice chair of the PSF. And she founded a symbol pie, she's the nonprofit for Zimbabwe women in Texas. Yay, she's a profit diversity in Zimbabwe. That's amazing, right? I really adore her. So, when she gave a keynote in Hyde of Italy in 2003, so it was an amazing keynote. I think that is really pulling the handle, but what brought me to this story is that when she did the keynote A, there was someone, actually I think from the accent, I just guessed, I'm making an assumption here, there are like a ton of developers and he asked, I don't see why we are different. You know what I mean? Like him and Malin, like he said that you will see a difference. You know, why do we need to make effort in diversity and inclusion? So, Malin gave a very, very good answer. She kind of, tell her story, she didn't start co-coming very young. She was doing some science subject until in her twenties that she started co-coming because of the field, the major science degree. So compared to someone who maybe, you know, imagine you are a kind of a male, a female developer in Europe, maybe around our age, maybe you start playing video games when you are very young and you start learning that, oh, actually you can't buy some code, you can do some fun stuff, so you are not, not everybody is given the same resources. Not everybody's journey is the same, so that's the difference. Especially now, if you look at the research still, you know, those are like women, they can't really spend the same time in the talk, you know, in clothing as many because they are soon to have more responsibility to interfere with kids and console shots. So, yeah, that's what I discovered after, you know, I'm seeing all these stories that we are not on the same level playing field. It's not everyone's ability, sometimes, you know, if you like, let's say imagine there's a girl in some update, you've heard of Trinity to learn coding, maybe she was also, she was also a bridge, maybe she would be from the PDFL, but it's about, you know, the environment, maybe someone who doesn't know if they can compute at home, it's more hard for them to be, you know, passionate and maybe to cope at a young age. Also, social expectations, like I said, still, women and now, still, lots of cultures expect them to be, like, maybe like care giver for family members, for kids, but then it would be like less time to develop, like, especially in open source, a lot of times you will, you know, work on open source in your spare time, so if you have to take care of your family, it's very hard. Also, access to resources, you know, I give an example now, these kids have more, you know, their internet is easier for them to get resources, but imagine if like, you talk about someone who maybe grew up in like the 80s, right, not everybody can do that, home for the 80s, so people, you know, so if they grow up in that time, they don't have access so maybe for them, you know, computer and programming is a new thing, so not everybody is safe. So, I'm making money off of time, I'm full of rush. So, we don't have the same opportunity. So, if you're not bureaucracy, what else? There's a new thing called bureaucracy, which I learned, so what is bureaucracy? Very simple to put it is that, you know, whoever got the time to do it and put out their hands, we don't check if they have the qualification or not, just like, we trust them, just do it, basically what it is. And one of the problems, you know, open source, I like this, you know, you don't have to have a certificate, right, and everybody can contribute to open source, we just like, well, of course, we won't, we won't question P out that easily, we will double check and have peer review, but, you know, we all stop you, like, you don't have to have a computer science degree to start, like, you know, looking at the PR and start really, you know, doing something that maybe gives them a list of contributions. Everybody's given a chance, there's no case keeping it, again, like, we are not meeting a CV until you can start, you know, the caretaker requests, right? So, but it's true, if we think about it, right, open source is new, I can see, kind of, because I just made some examples there, it's like, there's always work to do in open source, like, you know, nobody's got to tell you, like, you don't have anything that says you for offering your help, we don't need you to help, like, I think that's very rare in here. So, anyone can start their own project, if for example, if you feel that you don't want to conflict with others' project, you can actually, like, you know, clone it and start your own project if you want, that's the best case you can take, and again, like, we are not going to meet your CV and hire you and make you sign an open contract. So, it's open source, open to women, so that's the thing that I want to bring up with the story that I'm going to tell you. So, this is the career, we are still having a challenge because in, like, you know, you know, game of users, just 5.4% of them are over, so this is actually a statistic done in this book, you can actually, you have the link, if you click on it and see that it's still very small and I'm going to contribute to open source, if you look at the statistic research done in this book, and also the problem is that in tech industry, like women, a lot of them actually, they latch in their mid-career, so the more you see, the more senior you go, the less women you'll see in the company, so since I'm doing like this, why? Really worrying, so an example I want to give is, so, again, like, everybody can contribute to open source, you see Python is open source, you can contribute, but Python Core developer, we are looking at the Python Core developers, right, these are the developers who have committed rights, so these are the leaders, they are like, see more senior level in some sense in C Python, eight women in a team, do you know how many Core developers in C Python? This is a very difficult question, because I don't know why I have to ask someone, it's 87% of the total, and actually once, these are the actual, you know, I think there's 100 and something, people who still have the commit right, you know, if you look at age, out of 87%, it's less than 10% of, there isn't a lot of women Core developers in Python, but why, right? Again, like, what was supposed to be the last thing, everybody was supposed to have the same chance to contribute, but why going to technical leadership, right, go to more like senior level, there is a few women, and it's the pressure that I ask myself in the community, I was trying to say, why is so little, isn't there a lot of female core depth there, I want to be a spy, female core depth, but I'm fine with it, so what is lacking here, right? So Marietta is also my friend, and I'm telling you a lot of story about my friend, Marietta, she is the first female core depth, and you know, I admire her, she is really inspiring, you know, she's also not just, you know, for technical stuff, she also helps with all of I think Python US for two years, including this year, you know, very senior in tech, so very rare, if you look at the statistics, so I ask her why, there isn't much of, like, female core depth, and that's something that she answered for me, is that, so by the way, she is a mom, so she has a kid, so, it's like, why doesn't it fit, like, it's very hard, so they work, are you really hard in their work, because there's already some bias in their jobs, so they have to spend extra time sometimes at work, and it's like, you know, again, like, open source is sometimes like, you have to do, like, spend your spare time doing, so for someone who is like, for example, like, when you're a mom, you have kids, it's very hard, especially when you expect it to be the major caregiver in the family, and another problem that big companies doesn't allow their employees to spend their work hours to contribute to open source, so there's, again, less opportunity for people, especially women, to contribute to open source. Another thing, I actually talked with someone from the Rust community yesterday, that microaggression is a thing, right, not just items like, you know, Rust and maybe other more other communities, it's like, we have code of conduct, it's kind of, you know, everybody understands, and it's not like, but there's something for microaggression, that is, you know, something that really you can't make end-force, and it's still not a very nice thing, it's not very welcoming, and we don't want that, but it still exists. So, yeah, so what went wrong? Well, I don't have to explain this, you have all of the, like, when you make a policy there, you know, some conference, they have some big female speakers, very messy, so what's the problem? Because, I think a lot of, you know, some organizations, they still don't understand, like, they think that diversity inclusion is just a slogan, they have to do it because they don't want to be cancelled, they need to look good on the numbers, and I talked with some organizers, they said, oh, I'm women, I have the same merit as men. Well, I mean, I know a lot of amazing female developers, I've just already shown you two examples, but, you know, but nobody asked me, like, how to contact them, so. Like, I kind of, like, like this picture, right, I was like, well, I can speak Chinese, so there's like a story, old Chinese story, saying like, I think it's trying to steal a bell, but like, oh, but touch it, and everybody will know, so it's probably scary, assuming everybody's happy and it's probably scary, they're selling, but that's basically what happened here. What can we do? Professor, all we have to acknowledge the problem, we have to say yes, the problem exists, we have to fix it. Understand the problem, to not just look at things just like numbers, we have to treat people like people to understand why, like, these underrepresented folks, they don't feel comfortable, or, so that you have to talk to them, right, you have to talk to them, say, how we can help them, what do you need, how we can make it nice over you. You have to put effort in it, and you have to do it all the time, you can't just do it like once, and things are done, no, it's never going to be done, and it's in the foreseeable future. So as a community member, support that, if someone, you know, is an organization or some leadership, they're trying to put effort in it, trying to support them, if they're on a big round of a process, like you did well, I really love your conference, that kind of thing, so I call it, the Procraste. The Procraste, and the other two words, I'll just put it there. So we have the extra effort to support underrepresented folks, to be successful, so that's our management program, it's very important, I'm going to show you an example later, but this is all the things that we can do in the community to support underrepresented folks, and it's, you know, I don't want to point out all of them at the time, but you know, this is the thing, a lot of organization and community is doing, keep doing it, keep doing it, but you know, we can do, you know, we can do more. For example, again, this woman, I also adore her, she's from Wana, she's, I met her at the Papua, so she's living in India, so she was an outreach team, so coming up, you know, outreach team project? Yay! So outreach team project, they are paying for underrepresented folks to do internship at the Procraste project, so she used to work as an intern for outreach team at home, now she moved on, she started a new career, I mean still in tech career, but started new role, but she's still contributing to the PGA effort. She said that like, you can actually find her talk on the Python India YouTube, so you can look at it, and she said that like, outreach team program helped her to restart her career, because you know, I don't have time to tell her a whole story, but so thanks to outreach team program, she's now still a software engineer, I've done it in kids, so I want to take home, not everybody have the same access to resources, we have to accept the problem that we still have, you know, bias in the community, it's not like, we are not diverse and welcoming enough, you know, pay development open source, not just the maintainers, not just like, you know, like outreach program, they also like pay interns to try to get more new people to come into contribute to open source, you know, educate people with less access, you know, sometimes like free courses, free resources are good, emphasize the success, show these amazing women, like, I don't want to hear that I can't find female speaker anymore, I want to like tell people, there are lots of amazing female developers, female community members out there, reach out to them, provide safe environment, if you want them to join your community, to be present at your event, make sure that it's safe and welcoming for them. So the last thing, we will have to see our democracy, no, I say like try support democracy, like try something that, you know, is better, I think, too. So, thank you very much, I'm running out of time, so... applause
Please Make It Make Sense: Product Management Methods to Make Your Project's Purpose Clear
I think are we good? Okay, so next up we have some product management coverage content by Loria. Hi. Yeah, so I've been a little bit of an AV disaster, so I'm going to have to look at my slides because I can't see them here. But here's the title of my talk today. My goal is to help you get more structure around your open source projects, hopefully save time and ideally do less. Okay, so about me, I'm an American living in Germany since 2015 and I mention this because I came to Germany with a very live to work mindset and now I have a very work to live mindset. And you're going to see that mindset shift in my talks, like the messaging I share with you. Among my many open source activities has been contributing to Kubernetes, particularly SIG release and also more recently the open SSS security scorecard project. I have this link here where I thought I'd highlight it because you can find a lot of management and leadership guidance there. It's a collection of resources, blog posts, videos, templates, things like this, including some things I'll show you today. I've worked in places. I'm not working now. My company shut down at the turn of the year. So if you like what I have to say and think I could be helpful to your organization, let's talk and there's my LinkedIn in the meantime. I'll cover basically two branches in this talk. First is some observations from my time in open source. I'll sprinkle some helpful hints and examples along the way and then I will focus on some tried and true traditional product management methods that work in a company setting. You've probably encountered them in your day jobs, but they also work in open source with a little bit of creativity. So some of those observations, I see contributors taking on so much work. Just lots of issues, many times even multiple leadership roles and it just seems like a sure far way to burn them. Because they're so overstretched, they don't have a lot of time to do a lot of research and gather data. Also, that's a skill set that not everybody has and not everybody needs to have. But the end result is often that a lot of development is based on assumptions instead of data. Another thing I've noticed is that what exists today in a project isn't well-defined or documented or mutually understood by the project team. This represents a pitfall because you maybe don't have the shared understanding of what your project is and does and should be. And lastly, there's often times of vague strategy or even none at all. I would say that the most acute manifestation of this issue is that there's often a boundary between what goes in a project and what stays out that is lacking. This can lead to a lot of work being done and that work just kind of expanding. So if you take away anything from me today, it would be this message which is I really encourage and invite you to do less if you can. I know your manager may not want you to do less. There's always very specific conditions around that relationship, speaking from experience. So I'm happy to talk to any of you after the talk if you would like to have a sound and pour like ways you can manage your manager's expectations around what you can do in open source with your limited time and availability. But if you are the pressure source telling yourself to do all of the things, then I invite you to ask yourself at first like, does anybody even want this? I mean, maybe they do, but maybe if you're the only person or you don't have a very clear sense of how many people might find value in your project, maybe stop and collect more data before you move on. Also keep your personal backlog light. I know some people really enjoy working with them, but they take on so much work that they end up becoming the blocker for other people to make progress. And you don't really want to do that, right? You don't want to impede your fellow project contributors' efforts because you're like the decision maker on 10 different things. So that leads to delegating. Delegating not just to reduce your workload, but also to empower others to gain skills that you have. And I know that's rather time consuming, but oftentimes what I've seen in open source is that a little bit of upfront onboarding and knowledge exchange saves everybody time in the later stages because you have multiple people who can work on something at once. And the last tip is something I've used over the years because I would just take on work too. I love it, like, let's be busy. And then I would find that the work that I took on actually involved a lot more than I bargained for. So I highly encourage you to unpack a task before you say yes to doing that task because you may find that it's going to take you a significant amount of time. Here's an example of that. So this is a project board that I created with collaborators from SIG Release and Kubernetes. The initial idea was to rewrite a tool from scratch. And I looked at that and thought I heard that and I was like, you know, we may not want to do that because that sounds really, really intensive. So what we did is over a couple of sessions we figured out some real things that we didn't know about this particular tool that we wanted to, you know, talking about rewriting. And what we had was a lot of questions, like what is it, what does it do, what do users want. So you may not see all this text, but just the TLDR for you. There's a lot of spikes in decision making and documentation, like proposals to write to get community feedback before even setting to write code. So this is what I mentioned earlier, like the assumptions that we often take into our development plans. We had a lot of assumptions that we just had to rewrite this tool because it's just too broken and, you know, we just do it over. That's often not the case. And so I just want to point out that I didn't come up with the idea of assumption-driven development. I found a term that someone else created, and in my search to find out exactly who, I came upon this blog post, which I found really interesting. It's a developer who basically described his own failure trajectory because he was operating with assumption-driven development. And what he did was he decided to just take on a lot of work on his own. He didn't talk to anybody around him. He also didn't understand what he was working with in that day, like the tooling and all of the different tooling relationships, and also the knock-on effects of making changes. And he kind of went in like, I'm going to do this, say, and like it's going to be done. And that also didn't turn out to be true. There was a lot more work involved that he had expected and planned for. So I thought it was a really great summary from the developer's perspective of why assumption-driven development is often not the best method to use. I'm going to give them a talk, and you can ask questions after. Thanks. So basically, what I'm suggesting here, like a way to conquer assumptions, is oftentimes just listening to your environment. And that starts with the people around you. So there's this thing called active listening, and I found a nice resource from the Center for Creative Leadership, and they give you some behaviors that you can adopt, or adopt rather, to start listening more actively to your colleagues or co-collaborators and others you work with. They say, first of all, pay attention. And we take this as a given, but in our world of smartphones and lots of distractions and multitasking, we often don't really fully pay attention to each other. And one way that we don't do this is that we sometimes can't wait to, we don't wait for the person to finish what they're saying, before we just like, oh, I want to get my point out. We have to go, and then we end up missing the latter half of the sentence, because we're too focused on our own sentence and what we want to say. So active listening means that you don't do that. You actually let somebody finish, and then you ask. And you also can do things like clarify what the person is telling you by asking them questions. Did I think, I think I heard you say this. Is that correct? Or can you tell me more about what you're trying to say to me? And then together, it starts to become a collaboration, because you're inviting them to also clarify their ideas for themselves. And you're also getting higher quality information, because A, you're taking it in, and you're also engaging with it in a team context to work out new ideas. In addition to listening to your colleagues and people around you, you should also listen to your code. So I mentioned a few slides ago about this idea to rewrite a tool from scratch. But if you don't really listen to your own code from the beginning, you may end up doing a lot of work that you could have avoided by just optimizing and selectively choosing what to work on. So having artifacts like docs and diagrams will help you to better reason about the work you truly should do. Optimize, find the points where you can make things better, and also plan accordingly. So here's another example from Sig Release where we applied this principle. We had this tool, right? And we were going to rewrite it. But I said, first of all, let's actually document the flow that the user follows to use this tool, achieve a job, go from point A to point B. And so an engineer in Sig Release did this, and then we gathered around as a group, around his workflow, and talked through every step, figuring out what was really hard, what was taking a lot of time, what wasn't working. And as you can see from the results, the first line there is the overall flow. And then I blew up this section toward the end, where you see a lot of anger, and then there's this little clock, which means it was really time consuming. And you could then see in the full landscape of this project's flow where the pain points truly were. And we were also able to use these posts to document exactly where the code existed that was executing these steps. And so what we walked away with was a much more focused plan for what we needed to do. And we can then start there and then decide after collecting a lot of information about these weaker points what we should do next. Maybe we rewrite parts of this instead of the whole thing. When you have a workflow like that in place, it really helps you to put, it puts you in better control of your project. Now if you have no projects, that's fine too. What we're going to cover next are some tools that you can apply as you start working on a new project. But you can also introduce these even if you have something that's several years old. It doesn't matter. It's never too late to understand your work and then organize yourself to do the highest value work in the future. So I'm going to cover having a strategy with a doc template, doing user research and surveys, including an example of a survey which is the NPS, making a roadmap and giving you a template you can use, and then prioritizing and refining your backlog with some methods and tools you can apply for those activities. So here's a strategy doc template that I just worked with the security scorecard team on to actually fill this out. And I know these little lines here are small and you can't see them. I'll get to that in the next slide. But it basically introduces the concepts of the 5Ws that journalists use typically to write a news story where they need to have the reader know the facts of the story right away and then if the reader wants more detailed information they can read on. But it answers who, what, when, where, and why as well as how. The goal here is that you have an asynchronous tool that you can use so you don't have to have a meeting around this, although I advise it because you'll find that more information comes out when you actually discuss your strategy. But you can at least start with a template like this and people then can contribute their comments and ideas to it. This is Miro by the way. When you actually have this template filled out and you've gone through it with your team then you can dump it into a doc, refine it a bit more and then publish it in your repository for the public to look at. And then of course you can continuously revise as your project develops and you discover new information. So those small questions in that template are here basically. Not all of them but some key questions that are quite useful for getting a sense of where you're going with your work. So who are the users as long as the contributors and the maintainers? But really who are the users? Who are the people deriving value from your project today? And who do you want to derive, who do you want to have value derived in the future? Like who should derive value in the future? What does your project do today? On the flip side what does it not do? I mentioned earlier that boundary about what goes in a project and what stays out. When you can clearly explain what a project is and what it is not and what it shouldn't be then you can get a clearer sense of where that boundary lies. You can also think here about what the UX is like and what quality concerns and constraints you have. It's really just like what is your project essentially? What is your project useful? So what are the conditions to trigger a user actively coming to you, you're solving their problem? Another way to look at when is like how long does a particular stage of your project's workflow take to be completed? Where does your project fit in the ecosystem? So I'm not going to go over the ins and outs of doing a competitor analysis here. There's lots of templates online that you can look at to do one. But I highly recommend it because when you take a look at other projects in the space that are doing similar, solving a similar problem, you can then assess the resources behind those projects. Maybe there are even products, so maybe there's like a company doing what you want to do. So they have a lot of money and they can work quickly and then you can consider like what you actually have in your time budget to actually pursue. You can also see what those projects and products strengths and weaknesses are and then use that information to distinguish and differentiate like what you want to provide. Maybe it's a niche that you want to really get a handle on and provide a really clear good solution for that no one else is providing. Maybe it's just because your project is community-based and other projects and products out there are like for money and so like you're going to be able to serve the community whereas those alternatives will not. So thinking about where your project fits in that landscape is really quite helpful. That leads into why your project exists in the first place. What value does it deliver? Then that puts you in the seat of the user who is actually trying to use your project and solve those problems they face. Another question I like to ask around why is the cost of delay. So if we don't develop this project now or if we don't iterate on it and provide these features of functionality, what bad things happen? What bad things happen to our goals? What bad things happen for users who continue facing this problem without any solution? What happens to innovation in general? There's really a lot of interesting conversations you can have around cost of delay. Then finally how does it work now? This question is also a really nice hook for you to think about the future and where you want to be in 12 months or 24 months with it. How do you want to build this to provide different features? Maybe redesign the architecture to be simpler. How do you want to be and how is a good frame for that? I pointed earlier, we're going to cover some more tools and methods. The next one is user research and surveys. Having as much data as you possibly can really pulls you out of your own biases and what the developer with the assumption driven development blog post was describing. I only listened to me and it didn't work. If you're listening to your prospective users, your current users, other project leaders, you start to get all these different perspectives that can ultimately help you develop the right most valuable thing and not develop a lot of other things that are going to take up a lot of effort but maybe won't have such a payoff for you or for anyone else. Surveys should be kept quick and easy. I tend to use Google forums. I mean, I know it's not open source but it works. I don't ask people to write a lot because you don't want to read at all. You probably don't have time to read lots and lots of survey responses. The survey respondents also probably don't have a lot of time to fill out lots of forms. Using check boxes, multiple choice, rating options from zero to five or whatever you want to set is your endpoints. You have numeric data that you can quickly turn into charts like this one which was from a Google survey and it's just easy to make a chart out of the results. Another thing I like to remind people of is Please Buy by GDPR. Be careful about how you're collecting the data of the people who are filling out your survey. Make sure they offer their consent before you offer them a chance to give them, to give consent for usage of your data before they move on. Another great way to collect user data is through discussions. Like on GitHub, you can post a question and see people respond to it. That can be a little more time consuming because you're going to have to read through all of those answers. But it can be quite useful too because you get broader context. If you're in a hurry and you just say, hey, community, I want to know if you want us to do this thing or not. You can send out an issue and have them give it a plus one or not. You can use emoticons, this like votes. There's other tools out there that product managers use all the time like AHA that offer this kind of voting functionality for feature ideas. And finally, interviews which are really can be quite time consuming. But if you have the time to do them, you can even just do a few. You can learn so much about your own project. You can sit and watch somebody try to use it and see where they get stuck, see what's confusing to them, and collect all of that data and think of ways to optimize and improve. Oh, I forgot. This is a really important point to ask them. With the results, a lot of times when people fill out surveys, it's numbers, so it's all scientific. But it often isn't because our users may be giving their feedback from a limited set of data points themselves because they may not be aware of all the alternatives, all the directions that your project can take. They may not have a full understanding of the functionality because they don't have time or maybe you didn't explain it well. So always be aware that just when somebody tells you what they want, they may not actually want that thing. That may be the best guess that they have that would solve their problem, but actually in the broader context of other types of users, it wouldn't solve the problem in the best way. So just keep that in mind that data can also be a little bit of a trap if not used carefully. I want to give this example of a survey that you can run very quickly. If you don't have time to set up a forum yourself, lots of questions, you can still do an NPS survey. This is used by lots of companies, but it's quite useful in our context because it just consists of two questions. Basically would you recommend my project in this case to a friend or colleague? And then can you please explain why you gave that score? So the number is very easy. You have to put it in some kind of NPS calculator, so I gave you a link to one. It's also the image source. You basically put in all that data and then you come up with your NPS. And then there's different analyses online for like what is a good score, usually it's 20. When you're 50 to 80, you're doing really well. So that's from the way that the score is calculated. It's a pretty low overhead way to collect feedback. Are we on the right track or not? I mentioned also the next type of tool I want to show you and that's explained with this roadmap template which you can adapt to your own needs if you'd like. I cover some of the who, what, when, where, why questions that I covered with the strategy doc template. But the roadmap is more of the short term. What would you like to do in your next, say, three to six months? It's taking a slice of your strategy into getting you more focused around what you want to develop now. My strong recommendation is to keep it to a page or less so that people can actually remember it. Keep the number of deliverables and goals low, like one for three max, using a metric to justify why it's necessary. If you don't have a metric, like a baseline to say like we're doing this deliverable because X number of users want it, then you can also think about the metric that you want to apply to then be able to measure the success of your feature. I always like to include risks, like what is known, what is unknown in a roadmap, just so that with the unknowns you can plan that it might take away time from the future development. So it might be a bit of a distraction, but you at least are aware of it and you're going to have to work it out in the future as you go. And then technical goals. And this is like to make sure that quality, observability, testing doesn't fall by the wayside. I see this happening in a lot of projects and products as well where like all the stuff that actually makes the thing run gets pushed to the end and then the engineering team is stuck with a very patchy problematic system that they want to really fix, but nobody has a lot of time for them to do so. The next last couple of slides are just covering prioritization. So this is a matrix that I like to use because it allows teams to take a stack of issues and then plot them on this matrix. The matrix asks them to assess tasks, ideas based on the amount of effort along with the value that they expect to provide for the user once they do the thing. And then this allows the team to see like if they have a lot of things that are high value but also high effort, then they either need to maybe focus on one of those because they're not going to do like 10 high impact, high effort items at once or break them down into smaller bits so that they can then go into the do it now column which is really where your quick wins and your low hanging fruit should go. It's really important to plan for those quick wins to have them early on so that you can collect momentum and the team doesn't feel like they're just in some long slog that they're never going to see the results of their work. If you have quick turnaround for impact provided then that's nice because they can celebrate those wins early and keep going. There's also this nice, this is my favorite box, the don't do it box because that's where you just like close the issue and forget. Here's where I use this matrix in action. This is also a security scorecard recently. We haven't done this exercise yet but I'm really hoping we do it soon. This is basically all the bugs in the backlog and just putting them in specific buckets like some of them weren't bugs so that was just really categorizing what's a bug what isn't. Then the goal here is the team will plot the bugs on this graph and then we might find out that some of the bugs were solved, maybe some of them are relevant now but it's really to kick stuff out of the backlog and then just have the focus on what is really important, what's really valuable, what are people really being hurt by right now like we should fix right away. That's basically the steps for how you would apply such a matrix. I also encourage using a scoring model. There's a lot of different scoring models and you can find on Google or Ecosia, my favorite search engine personally is Ecosia. You can go in there and see what scoring models can do to help you assess things like reach, impact, excitement, effort and have a weighted scoring option so you can stack rank your backlog items and then do the top items first because you've decided through data and analysis that they're the most valuable ones. This is another template for your strategy. I just found this on Miro. It's by Lou Coleman and basically if you're rolling out an MVP for a new project for the first time, your center of focus is obviously the tree trunk so making the purpose of that really strong and solid and then over time you have more time to build on your tree trunk. This format allows you to plot your plans basically on different bands. So maybe the future band might be something that's high impact and high effort but it's just going to take a lot of time so you don't project that you're going to have it done right away. I just thought it was a nice visual I like trees too. Last slide is probably something that's very familiar to you. It's a standard campaign project board but this really helps with asynchronous collaboration because if you're running your board really well you'll only have high value work in it and then your contributors don't have to have a meeting to figure out what to do. They just pull off from the board knowing that you've clearly vetted your work through the tools that I've shown you so that they know that what they're going to deliver is ready to go and it's going to make a difference. My experience people are really motivated by purpose. They don't want to just do something for busy work. They actually want to know they're making a change. So with your really nicely refined backlog you can help your contributors along by giving them valuable work to do. I suggest making a triage work in group or having some mechanism in your team but just make sure that issues are triaged regularly so they don't pile up and that's a really good way to get non-code contributors involved as well. Making valuable high purpose work. Hopefully I have helped clear your path and helped you clarify your purpose. This is a nice trail in Amsterdam. It's quiet and friendly and inviting so hopefully that your open source development can achieve some similar aesthetics and that's it and that's the links to the resources that I've shared earlier. Not a question. So this goes back to the assumption driven development that made me wonder especially since you pointed out to stake the work first so you know what you're getting yourself into but if I do that, if I had done that then I would have never started any effort at any time because I would have been too intimidated had I known what I would have gotten myself into. So what do I do to still get stuff done? I think it depends on the number of factors. If you have a lot of time to build something out and really focus on it.
Compliance as a Community Effort: Engaging Contributors and Users
All right, folks, we're starting. So we've got Alan Pope, and he's going to talk about community and compliance. Hello, everyone. We might have to go slightly fast because my laptop battery is dying, so apologies for that. So hello, and welcome to my short, gentle talk, Compliance as a Community Effort. I've got two goals with this talk, raise awareness of compliance tools available in open source projects, and increase compliance engagement in open source projects between maintainers, contributors, and the rest of the community. It's a new topic for me, so I'd appreciate any feedback, either afterwards, here in Meet Space, or later in a bar, or via email. My contact details are at the end. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thanks, woes. That includes code scanning, reviewing to ensure compliance with their license obligations, DevOps integration to build compliance tools into their development process. We also provide open chain certification processes and services. And recently we've also provided training to, I have to say this is quite vaguely, developers of a large proprietary software product which consumes a lot of open source products. That's as specific as I can be. So, let's get into the mind of an open source developer. The quick and easy questions, you likely know the answer to, but the answer may be different for each of you. Why do developers create open source software? Why would you do that? There's many reasons beyond because I can and because I want to and why not. There's the traditional one which is to scratch an itch or to solve a problem of some kind that that developer has. There's personal development, maybe you want to learn a new toolkit, a framework, learn a new language or something. Maybe you want to build a portfolio, have some contributions on your GitHub profile that everyone looks at or maybe you want to announce a project on LinkedIn if you're looking for a job or something. Some people do it just for a sense of community and contribution. They just want to give back, helping others with new software to increase the corpus of software that's out there in the community. Maybe it's your job, maybe you've just been told to work on some open source software. That happens as well. There's a number of reasons why people contribute or create new software. That's all very reasonable. But why do people and organizations contribute to existing software? I mentioned why people create them. What about why people contribute? Maybe they want to improve it. Maybe they want to fix some bugs, add a feature or change functionality in the software. Maybe they also want career advancement and they fear that adding a few contributions on GitHub will help them in their career. Maybe community and networking is what they desire. Open source projects bring people together from around the world with shared interests. That's pretty obvious given you're at FOSDEM and if you look around you, there's a lot of people with shared interests around you. Some people want to give back, maybe it's a sense of altruism. There's a lot of reasons why people contribute to open source software. That's why people contribute. What about some of the ways in which they do that contribution? How do they contribute? There's probably amongst those items listed there some of the activities that you may have taken upon, whether it's translating software into your language or maybe you've filed bugs or maybe you've handed out stickers and events. There's a lot of ways in which you can contribute to an open source project depending upon your skill set and your desires. But one of the things that's often missing from these often cited lists of ways you can contribute is actually license compliance. Doesn't sound super interesting. Why would I want to do that? Well, trust is a big thing in open source software. Some of the keep contributors and users coming back is confidence in your project and a sense of trust in your project. At the basic level, they trust that the project is sustainable. Users might trust that the project is going to update regularly and put out timely releases when needed. Contributors to those projects will trust that their translations get merged in a timely fashion. That code gets reviewed in a timely fashion and merged into the project and eventually released and that their bugs get attention. And consumers of the project, maybe not direct consumers, maybe they're consuming your project inside their project. Maybe you built a library and they're using your library. They may want to trust that you're on top of security issues. And all of them likely trust, perhaps subconsciously, that the project is complying with license requirements. A lot of the time people don't even think about it. It may, of course, be the case that your favourite project is fully compliant with all of their license obligations. Do you know that for sure, though? Do you really know that for sure? What about all the dependencies that that project depends upon? Are you certain? When did you last check to make sure? Can you prove that you're complying with all the licenses if someone did ask you? Indeed, consider what you would say right now if someone asked you, does your project comply with all the licenses under which the software is distributed? In my experience, the majority of traditional open source projects are not written by lawyers. And also, the project team may not have access to legal advice, not directly, especially smaller applications or libraries created as a side project or something you do in the evening. But in order to improve trust in the software supply chain, it's something we should consider, just like all the other activities I listed on one of the previous slides. So let's think about the community's role in this. Users and contributors are already very familiar with these buttons in GitHub. They'll often use them without prejudice when you have bugs or when they require a new feature or when you've moved their cheese. And from a practical perspective, the project should accept license issues just like they accept any other issue. It's just like any other bug, right? And users and contributors should be empowered and encouraged to hit that button when there's a license issue. But we should also assume good intent. The default posture when a license issue is raised is not quick, let's sharpen all the pitchforks and start a thread on Twitter and Reddit and hack and use. It's engage in open conversation on the issue. There are a few barriers to community engagement with these compliance issues. There's a lack of awareness and understanding of licenses in general. Many contributors may not be fully aware of the importance of license compliance. That can be addressed with a bit of education, better documentation. It's a complex topic. Interrelationship between different types of potentially incompatible licenses is a hard thing to understand sometimes. So to make it easy, we should simplify the process of enabling them to report issues. Automate regular scanning of your project. There are tools I'll mention in a minute that can scan your project and highlight when there are license compliance problems. You can also integrate that into the whole process so that every time someone submits a pull request, a scan is done to ensure that not only the code is tidy and the tests run, but the tests should include license compliance as well. One of the other barriers is fear of legal repercussions. People don't want to press the issue button because they worry that if they talk about the problem, then the lawyers are going to come knocking on their door and it's all going to be blown up out of proportion. But what we should do is highlight that we're open to talking about these compliance issues. Be welcoming when someone wants to report an issue and foster a community that's full of open communication and dialogue rather than won't fix that kind of thing. So where do we start? Well, the Linux Foundation have a project called Open Chain. It's a good place to start. It's full of policies and procedures, loads of documentation. It's all open source. I've put a picture of the website up there. The URL is openchainproject.org. If you Google for Open Chain, you find some nonsense blockchain stuff. So this is the actual website. And it's all on GitHub as well. So there's repository in there with loads of documents that you could get started with to understand this whole compliance process. There's also a ton of tools available to help you. At the top end, there's the rather spendy black duck. Then there's open source tools like SIFT by Ankor and OSS Review Toolkit. And these allow you to build a software bill of materials so you understand what all the bits and bytes are inside your project and they can scan your repository, they can scan docker containers that contain your software and all that kind of good stuff. These tools don't solve everything. They're part of the story, but they're very useful for automating that scan. So you're aware of stuff that may be non-compliant ahead of time. We should, as I said, engage contributors in the compliance process. Ignoring a license violation and hoping it will go away is not a solution. We should educate and raise awareness that we're open to this kind of dialogue. Educate the rest of the team so they know that if someone files a licensing issue, don't panic, don't just delete the GitHub repository. It's not necessarily caused a panic. People make mistakes with their licenses and these things are solvable. And also celebrate the successes. So when you do solve a problem, don't bury it in a commit somewhere that fixes the license. Celebrate it like we weren't compliant and now we are. That's great. Just as you would celebrate the release of your software, celebrate being compliant. That's a good thing. I've mentioned we should establish clear policies and procedures so people know what their expectations are when they file an issue to do with compliance. And also promote a blame-free environment. Don't go pointing the finger and sharpening the pitchfork to try and find out, well, who committed this thing and who, you know, go through the get blame history and try and figure out who it was and chase them down. That doesn't help. Let's solve the problem and understand why it happened and try and prevent it happening in the future. Maybe assign responsibility. Maybe your team is large enough that you've got enough people that someone could be responsible for monitoring license compliance. I get that a lot of open source projects are, you know, that XKCD comic and the little balancing thing in the bottom right-hand corner. There's a lot of open source projects like that. But if you do have a significant team, maybe give some responsibility to a community member to keep an eye on this kind of stuff. So some takeaways from my little semi-rant. Integrate some compliance tools early on. Start scanning your projects. Make sure that you're using the correct licenses in the correct ways. Maybe you have a lot of dependencies. It's quite fashionable these days to use software from NPM and Cargo and Pipeye and, you know, are they all compliant as well? Are you consuming software that is also compliant? Leverage automation. You absolutely should have tools that are scanning your repositories on a regular basis. And this kind of stuff is pretty easy to do. Many of the providers that I've mentioned, you can just enable a GitHub action on your repository. Very quick and easy to do, very straightforward, and gives you a bit of peace of mind that you're scanning on a regular basis. And integrate compliance into the review process. So when someone contributes a new code, you can scan that before it gets merged, rather than two years down the road when that contributor has left the project. And engage with the community. Don't be afraid to have conversations in the open, in Matrix or IRC or wherever it is. You have those conversations. Don't feel the need to hide those licensing conversations away because you're scared of the lawyers coming knocking on your door. Switching tack slightly. I wanted to mention another project. We're bootstrapping this at Orcro, and it's a project called Corinthian. One of the challenges of open source businesses is in terms of mergers and acquisitions. Company acquiring another company asks the other company, do you have any open source software? They say, I don't know, somewhere maybe, I don't know. And in order it may be done. But the problem is the lawyers who are very smart about mergers and acquisitions are not necessarily super smart about open source software and licensing. And this is a bit of a gap in the market for lawyers performing these activities. So we started this project to collate process documentation for lawyers to help them understand the open source community, the open source licenses, how they fit together, how they interact. It's all open source, they will all be on GitHub. That domain just points to a static page at the moment, but we're building it out. I sincerely apologise for the AI generated Corinthian logo over there. It was done in a hurry. Patches welcome. And finally I wanted to highlight another talk this weekend given by Andrew Katz, Andrew is a well respected and knowledgeable open source lawyer with many years experience in the field. He's also an all round good egg. He's the all gross CEO, so he's my boss. So please go to his talk tomorrow. Or watch it online later on if the room is full. I hope the information I provided has been interesting. If you have any questions, again, I'm not a lawyer. This is not legal advice. It's just my opinion. Thank you for this. So quick question. Is there any support that a company or organisation that can support humanitarian organisations or other organisations in being compliant? Like kind of consultancy or support or ad hoc or whatever. Or guiding or come, let's say come a week to our organisation, help us set things up in the right direction and just, you know. Yeah, there are a few organisations that can assist in this kind of stuff like Software Freedom Law Centre, Software in the Public Interest, OSC, what is it, Open Source Something. It's also a different channel. I will put the notes, the answers to that question in the slide and upload it to the FOSDEN website. So, fun talk. Right now there is sort of a lot more pressure around knowing what's in your software because the security folks have suddenly realised, gosh, we should worry about that. Is there a way that we can sort of align efforts from the compliance folks with the security folks? It turns out that some of those tools that do security scanning also do license scanning as well. So there is actually quite an overlap there and some of the tools can like hit both buttons for you. Things like SIFT that does the S-bomb, ANCOR have other tools that can do the security scanning as well. So yeah, there are some tools that can do that, absolutely. So more of a shout out for a project and tool than a question. Reuse.software, that's the URL. So it's a Python tool reuse and it's also a spec around how to put license information in your project. Nice. You can't really comply with SIFT, the dependencies you use haven't actually told you what their license is. So that's for everyone. Make clear what license the software is released under. Absolutely. So Reuse.software has good tools for the whole video. Excellent. Thank you for that. You have four minutes of your lives back. Thank you very much. Thanks very much.
Single-vendor is the new proprietary
Thank you everyone for joining Solate in the first day of FOSDEN. Quick introduction before we start. My name is Thierry Carreze. I'm the general manager for the Open Infrastructure Foundation, which was previously known as the OpenStack Foundation. I was also elected to the board of the Open Source Initiative, and I'm serving as its vice chair right now. And as part of those activities, I've been working on the draft response for the OSI on the release of a new license called a Functional Source License. And as part of that, I really reflected back on some of the licensing that has been happening over the past few years, and recently at Ashicorp, I don't know if you're familiar with that probably, a very well-known previously open source company that decided to switch licensing for products like Terraform or Vault, and therefore creating some tension in the ecosystem. So looking at those critically, it just occurred to me that single vendor open source is the new proprietary. And this talk will get into details why I think that way. And I realize this might be a controversial opinion. I realize that some people will disagree with me. I realize that I will probably make some enemies out of this. But I think it's an important way of looking at it, and it might be a bit of an extreme characterization. The rant will be very short so that we have plenty of time at the end for you to engage in an open discussion and hopefully prove me wrong before I do it again. So all this real licensing that we've been hearing recently, hello. All the real licensing that we've been hearing about recently is all built on the same narrative around open source. And the story is very well known. It starts like this. At the dawn of the computer age, I think some people will get in at the end of the rant. At the dawn of the computer age, software was not considered very valuable. It was all about the hardware. And the people using those machines would actually develop the software that would run on those hardware as a commons and share it relatively freely. And it's with the advent of the 80s and the rise of the PC that made hardware a lot more like a commodity and with it made the software much more valuable. And that's when software companies like Microsoft were created and with it the proprietary software approach. The proprietary software approach is when a single entity owns the software that is produced and intends to capture all value thanks to restrictive licensing conditions. And we've seen the 90s after that that really led to a lot of excesses, especially as Microsoft decided to exploit that dominant position that they had. And openly developed open source really grew in the 90s in reaction to this evil proprietary approach. It predates obviously that period, but that's really when it really caught on. In that model, software is produced as a commons by a community of participants of organizations of individuals openly collaborating and the value is shared across the participants in that ecosystem. And this is all made possible thanks to free and open source licenses which guarantee a number of freedoms including the freedom to build on it without asking for permission and the freedom to use it for any purpose including making money. And in the next 20 years, really open source got overwhelmingly popular and it unleashed a software revolution. And those that have been around for that time measure how dramatic that change was. A recent study estimated that the demand side value of open source software today is nearly $9 trillion. It is estimated to be part of 96% of the software that is run. Like 96% of software contains an open source component. And it would be very hard to develop new software today without using open source. And so like everyone else, the companies that produce software massively adopted open source. They would develop in-house but release the end product under an open source license. And we call that single vendor open source. And with internet becoming more ubiquitous, some turn to a software as a service model and we saw the rise of the cloud and without the rise of the hyperscaler clouds. And some of those hyperscalers would run open source software at scale which would be seen as unfair competition by those open source software companies that were using this software as a service model. And that brings us to today where those companies say that while open source is great to get that initial visibility, it's bad for monetization. It's bad for business. And so if it's not business friendly, we need to invent new licenses, you know, to continue defending open source and especially against this evil proprietary software. And you know, with those licenses, we continue to give you access to the code for free. So you know, what's not to love. And in some cases, the license will even revert to an open source license after some time. Why do you hate us, Thierry? And I'll explain why. I think this narrative is built on three misconceptions, especially the last part, which this talk is going to deconstruct. The first one is that open source is great because you don't have to pay for it. The second one is that single vendor open source is the reasonable way to do open source. And the third one, interestingly, is that proprietary software is evil. So let's go one by one. The first one, open source is great because you don't have to pay for it. I mean, we are the ones writing the software and we continue to give it to you for free. Like why do you not happy with that? We just need to preserve our business interests, you know. So, well, the problem is open source is not great because you don't have to pay for it. Open source is great because everyone is free to use it. And that's a subtle distinction. I realize that. I mean, cost is a factor, but this goes way beyond monetary concerns, monetary barriers. What matters is not having to ask for permission. Just use it. Anyone, anywhere for anything. Not just the ones with deep pockets, not just the ones in certain geographies. And this really, this permissionless innovation that enabled a ton of valuable software itself often released as open source, which fed into that virtual cycle. Those non-compete licenses that they propose restrict you from doing anything with the software that the company disagrees with or considers competition. And they use pretty vague and untested legal terms and the end result is that it ends this permissionless innovation. You can no longer just use it. The second misconception is that single vendor open source is the reasonable way to do open source and resist evil proprietary software. I mean, we are the self-proclaimed commercial open source companies. We are the business conscious open source folks. You should follow our model, et cetera. Well, let's go back to the definition of proprietary that I used earlier. Single entity owns the software that is produced and intends to capture all value derived from it thanks to restrictive licensing conditions. Well, if you take that definition, single vendor open source companies are still doing what is essentially proprietary software. I mean, they will disagree, obviously. But they still consider the software being produced as their exclusive property and intend to capture all the value that derives from it. They aggregate copyright assignment so that they can change license anytime they want. So it's still proprietary. They just choose for now to release their software under an open source license. So single vendor open source is not the reasonable way to do open source and fight evil proprietary software. It's actually just another way to do proprietary software. It's just a relicensing time bomb. And sure enough, a lot of those exploded over the past year. So the proprietary development model is moving back to restrictive licensing now in a very predictable attempt to capture incrementally more value. Now that was predictable if only we had seen single vendor open source as the temporary tactic of proprietary development that it is. And that it always was. The third one, proprietary is evil. Well, this whole story would not hold if we did not demonize proprietary software in the first place and opposed it to open source software. But as we've seen, you can be proprietary, have a proprietary development model, and do open source as a temporary tactic. So it's not open source versus proprietary. We need to shift that. It's actually more complex than that. You can represent it as a quadrant. On one axis you have open source licensing versus not open source licensing. That's pretty clear cut. The open source initiative defines it. It comes with a bunch of freedoms. And it ultimately enables that permissionless innovation that I talked about. Why do we have those freedoms? It's because they enable the permissionless innovation model that we all benefit from today. On the other axis you have the development model. It's either openly developed by your community that will share the value of the work or it will be developed by a single entity that will own it. And if you look at the traditional proprietary software, that's what I call restricted software. It's when you're using a non-open source license to impose some licensing conditions, especially to preserve your business model or to gain some other benefit. If you look at the open source side, depending on whether it's developed by a group of organizations as a commons or if it's developed by a single entity that retains all copyright aggregation, it's either openly developed open source or single vendor open source. And the issue here is that we're seeing movement from single vendor open source back to restricted software. And they hope that by doing that they will retain enough aura from their open source days to hide the fact that it's just restricted software and pretend to continue to be on the good guy's side and fight against the evil proprietary software. But proprietary software is not evil. The abuse of dominant position in the 90s was evil for sure. But the proprietary model itself is not evil. In my opinion it's just inferior. If you truly think that software developed by a diverse set of actors working in open collaboration is not better, you should definitely do proprietary development. That's fine. Just be honest about it. What's evil is really the lies and hypocrisy that we are seeing there. Doing proprietary while pretending to be open, that's evil. That's what we call open washing. Trying to dilute the meaning of open source by creating deceptively named licenses like common clothes or server side public license or business software license. That's evil. Switching licenses from under your community after having promised to be forever open source like Aishikop just did. That's evil. Being fitting from open source freedoms to build your software in the first place. And then denying that those freedoms actually have value. That's evil. So yeah, as a summary, and I thought I would leave a lot of time for engagement from the crowd so I want to make sure we have time. I want to leave you with three takeaways. Three actions. First, I think it's time for us to remind everyone that the permissionless innovation that we currently benefit from should not be taken for granted. It is a direct consequence of the prevalence of open source licensing as defined by the open source initiative. And it requires all of the open source freedoms including the freedom to use the software for any purpose. The second takeaway is that I think it's time for us to describe what a world where they win looks like. Because if their vision wins, if everyone adopted their approach, all the innovation that those open source freedoms allow to unleash would come to a halt. And we would quickly be back in the 80s. And I've lived through the 80s. You don't want to be there. Imagine a world where you have to ask your lawyers for permission before you use any library, any programming language. And they will say that after some time it reverts to open source license. After two, three years, four years, the license automatically transforms into an open source license. But that's a trap too. Like imagine a world where you have to run a buggy two-year-old version of the software with known vulnerabilities because that's the one that is open source. That's not just practical. Finally, takeaway number three. I think it's time for us to reassert the value of software developed in an open collaboration. Everything else is proprietary. Everything else is a relicensing time bomb. So beware of CLA's when they are not held by an openly governed non-profit. Beware of single vendor open source software because it's just a proprietary model that happens to temporarily use open source licensing. And they have lots of money, lots of resources to spread their very confusing message around openness. And we're clearly disorganized. So I think as a conclusion that it's time for us to all clearly say that single vendor is the new proprietary. Thank you. Ah, objections. No, actually my questions are answered in your notes. So I'm interested in having your notes if that's possible with the slides. So the short story about this talk is actually the text that I wrote for the OSI to answer the functional source license. It was deemed to be too extreme to be representative of the organization. And so we toned it down and changed it. But that's actually what made me have the idea that I should turn it into a rant that I would present. And first then is clearly the right crowd to try it. So you would publish the notes? Yes, basically I'll make a blog post that's basically the same speech. OK, I have a question. What if the code is even developed by one component but under a position of Linux foundation? There's happened many times. You can notice that it's quite common at this moment. I wouldn't call that proprietary. What makes the proprietary approach is not just that you have a single participant. It's that it's close to others to join. And I'm pretty sure in the case of a project under the Linux foundation where they have a major vendor, they would be happy to have someone else. And I'm pretty sure that they would not be able to unilaterally change the licensing conditions because the trademarks and copyrighted creation would belong to the Linux foundation and not to that single company. So I think it protects you. If by design it's a single entity but I don't think they have a lot of projects like that, then yes, there is a problem. If you are prevented from participating as an equal in an open collaboration, yes, then there is a problem. So a provocative question. Is the GPL... It doesn't allow you to do anything. You cannot choose to follow the GPL as invasive in that sense. In your definition, is it still free, software in the most direct sense or is it something in between? I mean, a GPL? No, the GPL. The GPL itself? No, it's totally embedding those freedoms. To me, it's clearly an open source license. Some would say the one, the open source license. The main difference between the GPL and the permissive licenses is how much of a function you want to have back into the contribution cycle. It's really what makes it slightly different. And depending on how much you think you will get contributions without it or with it, all the big projects they have that moment where they have to choose between permissive and copy-left licenses, it's all a bet on the future. If you think that your ecosystem is so big that you will get contributions anyway without forcing people to give back, it's actually better to have a wider funnel to get into your system. If you think your project is never going to be super big and you can use all the contributions you get, I think the GPL approach is better. So you said a couple of things that I found sort of interesting. So one is, I think the objection or the observation is that if you leave the control to commercial entities, they are going to be continually tempted to re-license it, de-license it, change the licensing. So are you advocating that one should try and get the licensing and, well, the copyright transferred to a notionally neutral entity? Because it doesn't, for me, it doesn't seem to be that having like GPL on the side of a license, if the copyright belongs to one of these companies, they can just say, okay, fine, we'll leave that on the side, but we'll keep doing stuff over here. So you either then focus the GPL community and everybody has to turn it with their own resources. So it doesn't seem that GPL provides the protection that you're suggesting. So I think it's more about the ownership that you're pointing at. Is that correct? Yes. I would say copyright aggregation is just one of the assets that you need to have in a neutral asset lock in open collaboration. Trademark is another one. If one of the companies has the trademark, it means it's more difficult to weaponize, I guess, than copyright aggregation, but you can still pull the project identity away from the project, and so that can create some tension. So, yeah, clearly being able to put all of those assets that make that project initially possible and have some stability, so the name, the ability to change the license, under some kind of an asset lock, and I'm not necessarily saying go for an open source foundation like the one I work for. There are other ways today to actually create those open collaboration fields without necessarily going for a foundation. I think today foundations really bring value to make that open collaboration successful, not just possible. But yes, I think it's part of how you would fix the problem. The problem is really that it's a single entity, it's software that is developed by a single entity. They will try to hide it. They will say, well, we take contributions from the community. I mean, that worked for some, and clearly there is the difference between the contributors that are on the inside and the contributors that are on the outside. So it's free labor, it's not contribution, it's not an openly, it's not a common. The common you have to make sure that the future participants will benefit from it. Here it's like it's the pure ownership of one single company, and they take free labor when it's available, and then they change the license under you when some VC tells them to, so it's bad. I'm not a professional developer, so I know this is a rookie question. I don't know the answer. How would an entirely new thing come into being if no one person can have that thing? Doesn't every idea start with one person? So at that moment it is a proprietary thing. And it may often live for some time before it becomes something else. Your rant clearly doesn't cover somebody having a good idea. So how does something completely new come into being? So it starts as one person, but that person makes a choice at that point. They decide either to put it on some software force, name names, and have a proto-open governance around it that says, well, I'm the maintainer, but I would consider adding more maintainers and wants to create an open collaboration ground around it. Then they take the role of openly developed open source. At that initial stage they decide, wow, that's very interesting. I could build a company around it and monetize the heck out of it, and so I need to make sure I keep control over it. And the second contribution, I want to make sure they assign copyright to me or my organization that I can do whatever I want tomorrow. That's there going the way of the proprietary software. And it's really a thinking model. You either want to monetize something that goes beyond those comments that you create with others, or you think that software is going to be the real value that you create, and you want to make sure you capture all of it. Thank you very much for your... I'm here at the top. Thank you very much. I'd like to have your view on what happened to the Matrix Element ecosystem, where the server port was re-licensed in the last three years from a permissive license to a AGPL license. So this is re-licensing in the opposite way you're describing, so kind of more open, but there are still lots of discussion because they are forking their own community with their own software, and the whole community is not really kind of it. What is your view about that fact? So yes, it's going in the right direction, but it's still proprietary software. If they can actually do that, they can do it again. And you never know exactly in which direction. So I would argue that it's a proprietary holdout in the middle. That is probably well-intentioned right now, and a lot of those companies are actually well-intentioned when they start. It's when money comes in and they get some pressure about return on investment that suddenly you need to extract a bit more juice from that lemon, and the only lever you have is re-licensing. And I don't think HashiCore planned it from day zero. Although the VCs at some VC ventures actually have a playbook for it. So it's a published tactic. It's not as if it was like a secret or a surprise. In the end, it's all a calculated move around their approach of how we build software today. They can't really get around it, so they adopt it and try to make it do what they need. I'm not sure if it's just the VCs. I like to blame the VCs and investors too. But one of the things that I think is that people that are paying for software don't value open development and open collaboration as much. They just want the vendor to be around. They don't really value the history. They could have been around for 10 years on their open source development, but at the end of the day, they're still willing to pay for the software, the cloud offering, whatever it is. But they don't value the open development as much as the rest of us do. So I would disagree with that. It depends on the software. Obviously, if you run Firefox and you suddenly decide to use Chromium instead, because you're weird, it's possible. But in some cases, the project that I've been mostly working on over the past 12 years is OpenStack. And if you make an investment in an open source technology because you think open source is the way to build it, or you think that the software, the way it's built, it's not going to be changed on you. There will not be new licensing that will force you to pay support from one single company, one of those things. Having the guarantee that it's not going to change two years, five years, ten years from then is an important guarantee because you make a pretty significant investment in that infrastructure. So there are other open source solutions for providing infrastructure than OpenStack, but all the others are single vendor. And so that means you're potentially, if you choose them, they might decide to do something else with the software, and they might put your investment at risk because they might decide that you need to pay them for support per seat, per server, for whatever, some condition that you can't really accept, especially if you're a nonprofit. I really like the idea of the kind of frictionless way you can use the open source. What I think I'm hearing is that we really should be thinking about the fact that even if you can use particular open source projects, you might want to if they're like this. That's what I'm hearing. And I'm just wondering then, practically speaking, if we have to check whether or not they're like that, does that not interrupt the kind of value of the frictionless use thing? Because now, I just don't understand how we flag those, or how do you know, or how do you avoid this basically without interrupting this idea of just being able to use things that have open source software licenses? It's an excellent point. You can't just look at the license. That's actually another thing that I've been speaking about a few times. We need to go in the way we look at the software. We need to go beyond just the open source license because it's not going to give you this certainty that I think you need. And yes, it's a problem. And there are some organizations when they put out a project, they're pretty sure that they're under an open governance and they will be there forever, et cetera, et cetera. But there is no label. There is no brand. I'm trying to say openly developed open source. I have much as much as it's like a math fall. It might not be the right, like, I don't want to say good open source, bad open source. But yes, we need some way to say this sounds like safe open source and this sounds like potentially a restrictive in two days open source. And how do we differentiate that? I don't agree. We have an issue. The talk here is more to... The goal is to more of a wake-up call where I want people to realize how much we benefit from that permissionless innovation that open source licensing has. I want people to realize that this... Not all open source is aligned with that permissionless innovation. A lot of it is actually saying open source should not be allowed to run on any purpose by anyone for any purpose. And so going back to, you know, we've been through this cycle from the 60s to 2020, going back to the age where computers, where we did not have this 9 trillion body of code that we can easily pull from and that we are free to build on. And there's nothing that guarantees that's going to continue. Like, 10 years from now, we could... Why would open source still be around? It's because if we hold the line on open source licensing, if it's all the line, if the open source initiative continues to grab the open source definition, make sure that all the freedoms are in there and we don't remove one freedom and see what happens to permissionless innovation. And I think that's the general idea. And yes, be more... Look under the hood and see how the open source software that you're being sold is actually built. I'm sorry. It's at 6 o'clock, folks, so rain, livestreams ended. But continue your questions if you have any theory outside. Everyone agrees?
Cracking the Code to Executive Support: Open Source Edition
I'm going to start with the first speaker. See if this works. Oh, that works. Yeah, that turned on. Okay, so I'm going to spell my name. I'm doing this old school. As you can see, I decided to not deck around. Oh, come on. That was supposed to get a lease of a little chuckle. Ah. You can find me on GitHub, Blue Sky, LinkedIn. That's me. So I thought I'd start with a little bit of an introduction about who I am. So Addy Gerard, and I am originally from Los Angeles, California. I now live in Austin, Texas, like most of Californians. For those of you who are familiar with that, the tech community is moving that way. I am a humanitarian, and I have, in my spare time, I'm a hiker, and I love to read, Super Nerd About Data. I also, I have a Cocker Spaniel named Kai. He's a fluffball of super jealousy who adores my hikes. And yes, Kai does come from Ninjago. So any Ninjago fans out there? The Red Ninja. Okay. I am a communication strategist, and I've been doing that for about 15 years. What that means is that I spend my time talking to executives in companies and conveying messages. So when you have two people talk, sometimes that doesn't translate between them. And I help them communicate with people who are inside their company and also people who are outside of their company. I position stories. I am a ghost writer for emails, especially when there's problems between division heads. So I'm the person they call when things are going wrong, when there's a crisis. I am the person who tells them, say this, don't say that. Look this way, don't look that way. That's what I do. I started my open source journey, actually pretty recently. I'm active in the Cardano ecosystem, as well as the Inner Source Commons ecosystem. And I came to this because I got mentored by an incredible woman who helped me see the value of open source and its tremendous impact on businesses today. And as a communications professional, I felt that it was really important to be able to tell that story and help executives understand exactly how valuable open source is to them and to their business and to the future world. With that experience, I wanted to share today a couple of tips that I have learned over my career about how to position the value of open source. I'd like to start with the state of open source. You'll hear people say, open source is one, right? Open source is one. 80% of organizations have increased their open source utilization in the last year. 96% of code bases that have been scammed contain open source code, and 76% of those code bases is open source code. I want to invite you to consider if you have a product and 76% of the market share is yours, you dominate. You won. You're there. It's incredible. And it's amazing to me, the principles and values that people were dedicated and fought so hard to get here, to get the kind of adoption and utilization that we have today. Let's look at the flip side. It doesn't have enough support. It doesn't have enough support. There's a tremendous need for more support in open source. Not just in talent, but fiscal support to the community and the people who are giving to this cause. I'm going to illustrate this with a couple of stories. I know you guys have all probably heard them. Start with left pad. Prolific open source contributor. Company came along, wanted to go against the open source principles, and what did he do? He deleted 11 lines of code. Do you know what the headline said? Internet goes down. Man breaks the internet. For those of you who are curious about what the monetary impact of that is, and I've debated this with actually a couple of people, but they say that in a minute's time, nearly a million dollars worth of commerce happens. Which means in the first hour, 54 million dollars of business was lost. And in the two hours it took before he posted why, 108 million dollars of business was lost. I guarantee you those executives did not expect that. 108 million dollars and 11 lines of code. Let it never be said that engineers don't have power. Another example, again one that we're probably all familiar with, is called Colors and Faker, these projects, right? What happened? Now just to paint the picture, these had thousands of projects that utilized this code, thousands, millions of downloads. This individual should have never, ever, ever, ever been concerned with money. What an impact he had. Huge. But what did he do? He uploaded a bug that took it down. And the reason he did that is because he needed to be paid. The generosity of the open source community is huge. You don't find that anywhere else, it's amazing. You empower the globe. But you can't just continue to give, you have to be able to receive too, and people need to give back. Right? And it's not just about people giving back. It's about the organizations that profit and utilize this, providing support to the people who are contributing. We're at a crossroads. Open source needs to evolve. And it is my belief that at this point in time, it is of critical import that executives who do not understand and leadership of organizations who don't understand how prolific and how important open source is to their organization, that they understand it, that that information is conveyed to them powerfully, and that they don't lose track of that. They understand what that means to their business, to their future, to their product. Let's talk about a couple of strategies that are going to help. Now, for those of you who are not in an organization, this might not be as relevant. But for those of you who work for companies that utilize open source, I'm speaking to you. And I want you to consider these strategies and these ways that I have managed to impact and get more support for open source. There are three steps I'm going to go through. The first one is why. Why will they give back? The second one is what do they need? And how do you find out what an executive needs? How do you find that out? How do you figure that out in an organization? And last thing I'm going to go through, how? Okay, so why? Why will an executive listen to you? They will listen to you because it impacts their bottom line. I know this might, this is fundamentally when you look at the leadership of an organization, they are responsible for the health of that organization. And that means money. That means you are either cutting costs so that you operate more effectively and more efficiently, or you have done something that contributes to the revenue driving machine of that organization. Money is how, that's the why, that's the bottom line. That's what will get you through the door. Let's talk about what, what they need and how you can find it. How many of you have been to a corporate town hall? Okay, all right, decent amount, right? Let me go through what happens at a corporate town hall, right? You have your executives and they come out and they're sharply dressed and they're all lined up and they're sitting in their chairs, and they come out and they say, this is what we are moving to, right? We as an organization are going to do these five things in the next ten years, seven years, three years. We're going to make our mark here and here. We're going to do these things, there's usually, you know, three to five things that they're going to accomplish. This is really important, this is, this is your roadmap. And let me tell you why it's your roadmap. It's your roadmap because those are the items that the executives are hanging their hat on. They have now made a promise and a commitment to their stakeholders, and it doesn't matter whether we're talking about a public or a private organization, they have stakeholders and they are promising, we're going to make our mark here and here. So these items are the way for you to reach it, and what you need to do is you need to figure out where you have a common interest or a shared risk. Where does open source have a common interest or a shared risk on that? Let me give you some examples that you can utilize. I would care to wager that on each and every one of those things, you have one line item that talks about HR. We're going to enhance the culture. We're going to get an internal platform and it's going to talk, people are going to talk. We're going to have better recruiting. HR is a huge cost center for organizations, huge. Making that efficient, making that work is going to be an incredible thing and an impactful thing for your business, for the company. I'll give you a couple of numbers. In the United States alone, voluntary turnover costs businesses $1 trillion a year. To replace someone costs one and a half times their salary. How can open source make an impact there? Let me give you an example. In open source, because we work in the open, you can see the caliber of somebody's contributions right now. You have an unbelievable community where you know who's out there and you know who's contributing. Do you know what kind of impact that has to recruiters? Think about how many hours they could save understanding that because of this, because our organization embraces this, we're going to be more effective because we're going to know exactly where we want to go and who we want to connect with because we can see the caliber of their work. Another thing to think about in 2030, 58% of the workforce is going to be millennials and Gen Z. Do you know what one of the top qualities that millennials and Gen Z are looking for? I didn't hear you. Free time. Well, yeah. I wasn't always looking for free time. I like that. But collaboration. They want to work in a place where it's collaborative. Guess what open source is? It's a collaborative community. HR is looking to hire top talent. They need top talent. They need to attract top talent. You can do that when you have the values embedded in your organization that people want to go to. That is a shared interest that open source has with HR that will impact the bottom line of an organization. Let's look at a risk. Let's align a risk. Cyber attacks. Cyber attacks target software supply chain. And the global economy lost $45.8 billion last year from that. If 76% of your code base is open source, you should probably spend some time looking at that. Investing back in that. I don't need to draw the picture. Everybody has seen it of the stack with the one little toothpick that's holding up all the things. And you're one burnt out person away from watching that fall. How does that impact an executive? When your product goes down. Ooh, that hurts. You just lost market share. On that roadmap of the things that your organization is going to achieve. One of those things is going to be we want new market share. We have a new product or something along those lines. If that product utilizes open source, your executives have forecasted revenue that they're going to earn from said new product. They expect it. They're planning their budgets around it. And that, if that goes down, because there wasn't adequate support, you have just now impacted the executives. You have a case to make about how impactful that is. I mean, there's other things you could look at. For example, there's a 22% year over year increase in malware that has been contributed in packages to open source repos. 84% of repos have one known open source vulnerability right now. Open source is powerful and there are cases to be made that will make executives contribute back to the community. And it's not that they don't right now, right? We see ausposes are on the rise. They're increasing. People are starting to have more auspose, which is great. It's a dedicated team trying to help get more people involved and support the community. But it's not enough. There needs to be more. All right, let's move on to how. I'm going to talk about three things and how. The first one is nobody gets there alone. You all know this. It's community effort. But I want to encourage you to find people with diverse talent who complement your strengths because each and every one of us have a strength. And there are people who can complement that people across your organization. Don't stay in your department. Build a coalition. Build a coalition of individuals who understand the business value of open source and can convey that message powerfully. I'm going to give a couple of examples of roles that you should probably consider. One is someone in finance or an analyst. And it would help if they had a platform that they could pull data from, like detergent or something, that they could pull and they could connect the monetary value, the time to make that argument. They're vitally important. You need data. You need numbers. Another thing that would be really helpful is to have someone in comms or marketing. Executives are bombarded with information all day long. They're not working in the business. They're looking at the global view and the whole and looking on the business. And they need someone who can speak at a very high, very simple, very powerful level. Marketing and comms people do this all day long. We position things all day long. They're very helpful. There is a study of informal networks done by a professor at Harvard and the University of London. Her name is Karen Stevenson. In fact, Malcolm Goudwell in his book The Tipping Point also referenced this. There are three roles that you will find in the informal network within your organization. The first role is a connector or a hub. You all know these people. This is the person in your office who knows everyone. They're all connected. They know everybody around here and there. They know everyone. They are connected and they are talking to people. These are powerful people for you to talk to and help them understand the value of open source. Because they will share your message and they will share your passion. Another person, another role that you will find is called a pulse taker. Pulse taker I know. They also have connections, not as many as the connector. Here is the difference. The pulse taker is someone who is very strategic about their connections. They are the person who can see basically around corners. You probably can think of a person right now who fits that bill. Someone who knows things two or three steps ahead. It's like they can see around corners. Bob over in accounting likes things this way and Sally over here likes things that way. This person is really important because they can help you convey a message powerfully and navigate the things that are unseen. Because they understand how the organization operates and they understand the people in it. Because every organization has its own system and vibe. The last one is the gatekeeper. I know we all love this person. We all love this person, the one who controls the information and keeps you away. Keep you away, right? They are the hardest person to break through and they can be your biggest advocate. You can be your biggest advocate once you help them understand the value of open source. Executives don't work alone. They have advisors. And if you can get the ear of the advisor, you can get your way through the door. These are a little bit harder to find, but it's something to consider. Let me move on to culture. Number two, three. Every organization has a culture. Every country has a culture. People have culture. It's an important factor and I don't have time to go through all the pieces of it, but I'm going to go through two dimensions. There are consensus-based organizations in which harmony is paramount. In an organization where consensus is important, you can upset the apple car. And you need all to be with you and understand it and work with you. Versus a top-down organization, you're looking for one. You have a decision-maker, you are looking for one. You can be way more strategic and have a few less people that you need to work through in order to get to that top decision-maker. Another dimension is, is your organization egalitarian or hierarchical, right? Is it flat? Can I go call the CEO today and be like, hey, listen, let's talk? Or is it something, I'll give you a personal example. A little while ago, I read it wrong. I read it wrong. There was a VP a little bit away from me, not too far away, but a little bit away from me. And I thought, I'll just set up a meeting. We have some synergies. Thirty minutes should be no big deal. Two hours later, two executive assistants, a director and a VP later, I was informed. That was not the right path. There is a path to get to that person, and I had bucked the system. That's a hierarchical organization. Lessons learned, right? I don't know how many of you have been there, but I see some nodding faces, so I feel like we're getting, there's some equal stories there. Lastly, how do you position the story? Best practices for a story contain three parts. The first is why? In this, you will find the characters or the people who are involved. You will find the description of the setting. And lastly, you will find the conflict. And let me just say, please, if you hear nothing else today, listen to this. The conflict in your story that will get them to listen to you is not yours. It isn't your conflict. They don't care. It is their conflict. It is their conflict. It is open source's ability to solve their problems. And do you remember that road map I told you about? That's their problems. Each and every one of those represents a problem that that organization has. Then you move on to what the solution is. Here's your big idea. Here's your big idea. And lastly, how it solves it. We're all really aware of what's happened in the last year. Some of the license changes that have happened. Open source is at a pivotal moment. Because on this stage, the organizations of the world are watching. They're watching and they're looking and they want to see what happens next. We're going to get away with this? What happens now? Regulations are coming fast and furious and they're not going away. It's going to change the way things are going to happen. The CRA is going to change the way things are going to happen. We're at a pivotal moment. And I think that in this moment, it is really important that everyone be equipped to make this argument. To take the case of open source. To campaign for the value of open source. To explain how powerful the collaboration and innovation that it drives is. To showcase the value of open source. To showcase the value of open source. To show what a community is about and how it makes a difference. I hope that these strategies that I shared with you today really help you. Because everyone can do this. Everyone can do this. Anybody can make this case. Don't make it alone. Solve their problem. We can do this. But now is the moment. Now is the moment. Thank you. Thank you. Are there any questions? Thank you. Thank you for the chat. Really interesting. I'm sorry. I'm going to be the devil's advocate. I'm not satisfied with this session. No, but I learned a lot. There are three points missing from my perspective. From my organization. Maybe you have a solution for it. One, I don't see in your talk any solution for the support part. For us to convince our executives. We need to convince them that somehow, somewhere, there is a way to give the proper support to that tool or open source in general. Our executives are still very hesitant to use open source or philosophy because of that. Number two. Can I take them one at a time? Because that would be helpful. That would be helpful for me just so I don't lose track. Because it seems like there's going to be a couple here. So if your organization doesn't understand the power of open source, it's because a case hasn't been made that speaks to them yet. Bottom line. And one way that I would recommend theoretically helping them understand the value of open source is to demonstrate it within the organization. There is a inner source model that's very powerful to explain and help them understand how impactful it can be from a business perspective. It might be one way to look at it. I mean, they believe that there is more support in the closed source that they can rely on. They believe that they have more support in the closed source that they can rely on, that they can have contracts and everything in the closed source more than the open source. So that's why they don't see it practical and sustainable to have the support of the community. But be careful. I'm with open source to the death. But it's just a matter of I'm trying to convince my executives. And this is a breaking point. This is a red flag that they always like to support. So just, yeah. Second thing is that most of these executives come from the user perspective and not the developer perspective, the open source perspective. And so far, the biggest market share is Windows and Mac for the operating system. And for Office Suits, it's Microsoft Office as well. So they use closed source. They use proprietary. They use WhatsApp instead of Signal and Telegram. They use all of these things and they're used to them and they don't have this mentality. So from user experience, unless we get away to get more adoption rate in our society where more people are using more open source, that will affect, in my humble opinion, the way that the next generation executives will say, ah, yeah, I totally understand the reason why you're using Signal and not WhatsApp, for example. And sorry, the last point is change management. Oh, yeah. I think that is the biggest thing that we can work on. Oh, it is. I'm a change management professional. And a lot of organizations make a mistake of saying, oh, we're going to get a new process in and they drop this new process in and it's going to solve our problems. And you know what? Nothing happens. Because unless you have adoption, you've got nothing. You may have spent $3 million to put this fabulous education piece in place, and if people don't use it, you just flush that money. There is not enough people support. I completely agree. I would invite and love to talk to you about inner source because I think that might be a really powerful thing for them to understand the value of open source if they're so wanting to stay behind the firewall. Hi. Hi. And just one thing as a company that provides support in open source work, it's your job to go and tell the executives that there are plenty of companies and tell them which companies they have to go to or you bring them examples of how good that can work. And we work exactly as a proprietary company. They just don't know it because proprietary company told them that we are all in a garage. I get a type of one more question. Hi. Thank you. So you mentioned different cultures in the organization. I was wondering if you have some helpful tips and remarks about a US versus European kind of company if you think this is important or not. Just perspective on that. Thank you. There are a lot of differences and saying Europe in general is a little problematic because it depends where you are in Europe really fundamentally. I can reference a couple of books that would help and there's some studies, cultural organization, if you want to meet me after I can give you those references because culture is a huge part of it. And as the gentleman said, and thank you for pointing that out because I failed to and I intended to. Change management is going to be a critical component and it supports the cultural aspect because it talks to the people. Oftentimes we forget that we're talking to people. It's very technical. It's very process. It's very system and behind each and every one of those things is a person who has to do it. And if you aren't connected with that person, you have nothing. You have a stack of really cool stuff and nothing. It's the people. So change management and culture are going to play a huge part in that.
AI DevRels - Risks of Neglecting Open Source Talent in AI Critical Infrastructure
Thank you so much for attending for my talk. So today I'm going to be talking about a topic that at the end of the talk I'm planning to do like an open forum to hear your thoughts because, well, this is part of the presentation and part of the discussion. So we are talking about trying to find other areas where a dev rel can be critical and valuable for an organization perspective and where are the risks of neglecting critical open source infrastructure. So before getting started, a little bit of myself. And currently the project manager at Chuta Group, that is a Linux foundation project formed by a group of community sharing open source from offices, best practices and tooling, basically to help with open source management, open source operations and sharing the value of OSPOS among organizations. But before that I was working at Viteria, that is a software development analytics firm. I spent there three and a half years and I can say that everything I know about open source, like my very background was thanks to them so I own them a lot. While I was in Viteria, I was also studying my masters in data science that I finished and I focused my thesis on a really interesting report based on dev rel. So I still have in my repo that is public in my GitLab account because I'm mixing between. That was a set of Python scripts to gather and analyze data to measure the value of dev rels when helping open source communities. And right now I'm also studying another masters because I like to keep myself busy in front of content and struggling with JavaScript. In my spare time when I have some, I try to contribute to other open source communities and foundations like in your source commons, open chains or chaos. So that's me. So before getting into the topic, I know this is quite boring, descriptions and definitions but I think it's important because when we are talking about developer relations, you can type on Google or in other search engines and they will give you thousands of definitions and send to open source and also even when you type AI, it's like, oh, I don't really understand what do we mean. So I'm not here to set up a definition. I'm here to at least say during this presentation, when we are referring to developer relations, that is the definition we will use. When we are referring to open source, that is the definition we will use. And when we are referring to generative AI, that is the definition we will use. So we will see developer relations as a discipline that focus on supporting developers and building relationships with developers and also connecting with the organization's goals. In open source, in this context, we will see it as a method and also as a culture to develop and distribute software. But we also saw that open source as a wide variety like open data, also open hardware and so on. And generative AI, because I know AI has been here for ages, but now people are calling AI everything. In here, we will put it in the technology capable on generating text, but also image or other data, two models that basically is a generative model and that is why it's called generative AI. So our objectives for this presentation is basically answering these three questions, the first one and most important I would say is how these three things are connected. Because sometimes you talk open source, you talk about gen AI, and then you are also thinking on the organization side about, okay, and how those are connected with the security and innovation goals that an organization might have. Also why so this organization considers this open source integration when investing in AI. And I think this is where DevRoss comes in, who can facilitate such connection. So to start with, I wanted to really serve like the typical and really base and basic process when gathering data and when training models and in sum up like the AI lifecycle and try to ask ourselves where is open source here. So when, because AI is just like a small part of the whole process when you also need to clean the data which takes you sometimes like 80% of the whole time and then when you clean the data you perform an experimentally analysis. Like that is a process that is not just an AI, it has been on machine learning, it has been also in the science studies. There's not something new that we don't know, but I think an important question when looking at around all this life cycle is to ask where is open source. The sort of question, the answer would be, well, it's literally everywhere in any step of the process. And let me tell you why. So around the technology stack when an organization decides to invest in artificial intelligence, generative artificial intelligence in this case, there are different components that they need to take care of or they need to start thinking of. And in the technology stack you see tools for data collections or which generative AI model to use because there are different ones. What is the deep learning framework you will use? Are they going to use TensorFlow, Python, Keras? Also if you are thinking about putting into a container, what is the best tool or the best framework to use as well? When you are trying to represent that data, are you going to represent it in graphs? Then maybe you use D3JS or are you going to represent it into another different format and use Matplotlib? I keep saying names and childrens and many, many of those childrens are open source. I think that is when talking with organizations and not in the tech sector and more in the business side, the decision makers. They are not aware that when they say let's invest in AI or let's use this AI for creating software or for our internal development, they are not aware that the technology stack, the baseline that muters all that AI, it is powered by open source. So now, well, you might see this famous image, I just edited it a bit. So instead of thinking like modern infrastructure, think about the AI technology stack, how it is like chat GPT and all these popular AI toolings, even though they are proprietary. I mean, chat GPT is powered by PyTorch, that is open source. But anyway, think about that and then think about these open source maintainers being burned out because open source is just different from a proprietary software and organizations selling proprietary software. And I think that is the message that sometimes organizations might not want to understand. So organizations are far more innovative and secure in collaboration than insulation, but I think everyone here knows that. I think organizations have in their minds like yes, the benefits of open source and so on, but I think the tricky question here is how. And I know in the other time we were talking about OSPOS and as a vehicle to make that happen. But even OSPOS sometimes they are struggling to prove their value and to do the right work because organizations don't understand how they just heard the word OSPOS and they just go there and OSPOS cannot perform their work, for instance. So how can organizations advance in their AI maturity? So everything starts with people. Let's start with the baseline. And the people, it's a cross functional skill set of different areas. So of course we need to have project managers, but we also need security and license compliance and people managing the project health of the projects of the critical infrastructure, also all the infrastructure on IT development. In this talk, I know because this is really, really broad, I really want to focus on these three areas that can be like more attached to what our developer relations or community management role is. So when coming back to this integration of open source and how it connects with Gen AI and how it connects with the organizational schools in terms of security and innovation, I believe, and again this will be later an open forum, that there was with this experience in open source communities might be hidden gems for organizations that are right now investing thousands and thousands of dollars or euros in using or creating Gen AI power tools. And why? Okay, let's see the scope again. So we already have seen that machine learning, I mean not only AI, there's a lot of information that depends on open source projects. And these projects are maintained by developers that require support, they really are seeking for help because they cannot like, issues keep coming, keep going on and sometimes they don't have enough hands. And at that role that is this connector, this person that has the knowledge of an organization and has and understand the needs of those developers, of those maintainers because maybe in the past they were developers, they were open source maintainers too, can have to build this relationship and can transmit to the community the value and also to the organization a business value. I think that is this person, this linchpin that can connect and help both worlds. It's a win-win situation in my honest opinion. So, two organizations that maybe I know like someone were asking, yeah it's hard for an organization to prove the value of contributing to open source or using open source. So I think it's smart to invest not just on the final product but to what is behind. But it's smart to know how. And I think we are now coming into a point where we know the benefits, we know like we can think, analyze whether or not we should use or not use open source. But when we are using open source, we need to know how or if not, the organization is not going to see value and the communities are going to be at risk. So, this comes from one of the blog posts from the open source.com blog. And I think it's really interesting to see like how the old software supply chain funnel was and now how with open source you are putting a new layer and organization needs to really understand that layer. Because since it's a supply chain funnel, if someone fails at the very beginning, it will impact to their product, it will impact to their services and to their customers and to money. If it's a company and if not, it will impact to the experience. For instance, if we are talking with a government or a public administration. So, how we were talking about people, but there are other ways also to make that happen. Because sometimes one person is not enough. As I mentioned earlier, it's about a cross functional team or a skill set of people. We were mentioning that there was a really critical role. But there are people with experience in license compliance. There are people with experience in security compliance and infrastructure. So, if an organization is creating this cross functional team with different experts who sets the world for us in the organization. Because everyone will have different micro goals, different objectives, how can we put order into all these scales. This is a way, there is a way, but there are other many ways on how that to bring this connection and also connect with the different teams in the organization. Because here when we are talking about open source, we are talking about open source as a way to integrate it into the existing organizational teams. So we are thinking about how open source can help engineering teams to do their things, the things they have been doing, but with open source. And we are also helping the security team to do their things they have been doing also with open source. It's about integrating open source in what already exists. So there is one of the vehicles that has been proven to be effective in many organizations, that is OSPOS. And OSPOS sometimes they have devils and project manager, they have legal experts, they have security experts. So it's in a nutshell, this dedicated cross functional team. They don't need to be physical team, they can be virtual. So they can already have like in having like this advisors or experts into the different teams that the organization have. And these are some of the success stories based on the state of the OSPOS 2023 report. This is a study conducted by the organization I am currently working for, that is the children group has been doing this since 2018. And you have all the raw data in case you want to know like the devolution since in the children group's last OSPOS survey GitHub report. And in here the last year we saw that 93% of the OSPOS, the participants that responded they were part of the OSPOS, they had an OSPON, the organization were providing advice to security teams. And they also, we also saw that these OSPOS were really engaging in current age technology. Like containers or AI or data science. And also we saw in terms of how effective are those organizations providing upstream contributions like contributing back to those projects. We saw also a really big impact for those organizations who had OSPOS. So I'm not saying that the solution is OSPOS, but it's true that we are saying that... No, no, I mean I feel like... I'm not saying that, but just look at the data, no. All right. So you have the study there in case you want to be a bit more into that. So this was like a small break, but let's come back to the red-bird role. That can be in an OSPOS or not. I wanted to open the now the room for those people to think about what to the red-bird role description of the characteristics we have seen, like with an open source background, with an organization that has this deep AI involvement or they are willing to invest more in AI could look like. These are some examples I added, like for instance that the red-bird might have to have maybe an understanding on how the genitive AI models works much similar and how the different much similar frameworks works. Maybe they have been working already with some of those frameworks. They should have experience in contributing or maintaining open source projects in their past or currently. And disability, so this is more like their typical skill set, right? So advocates, developers need and be able to communicate that feedback to the organization. And disability could collaborate with organizations, with foundations, with independent maintainers from open source projects to co-create value because open source is a community of communities. So, yeah, this is to sum up with, I would like to open this open forum with this question, like how can we address the value of the role in AI field and teams to decision makers that basically are the ones that might be putting the money and investment. And yeah, thank you. Thank you. Anybody have any questions? So, have you thought about how this might change in the next couple of months or a year from now when we move towards, rather, an actual open source definition of AI? Because what you've described are, is AI, and brilliantly, AI that's using a lot of open source tools. But the open source initiative is trying to figure out what the open AI definition is. Do you feel like that's going to change thesis a little bit or it's not really, that the problem will be the same? I think that can be like a different scope. I'm also aware in the working group that the open source initiative is doing to come up with an open source definition of AI. I know it's hard because open source, for those who are unaware, open source is like software and in that sense it's like, okay, this is a way to distribute it, but AI has different components. What about the training models? What about the outcomes? What about the model itself? So it has different components and they need to define that. So it's hard. I will say, coming back to your question, I think that that is another angle. Like the organization can decide whether or not to use open source in their AI. And if they use open source, they might still need help from their brothers and people to take care of that critical infrastructure. But if they are not using open source AI, if they are still using TADGPT or other framework that is proprietary for building their machine learning models, they still need to do that. But maybe the planning or the objectives are going to be different from an organization that in the last phase they are still using open source versus an organization that in the last phase they are using proprietary software. Thank you for the presentation. And I just want to say that I think it's wonderful actually because in every discussion around open source that I've heard around AI today, it has been this kind of open source AI versus closed source AI. So I think the opportunity to remind people that so much of closed source AI is still dependent on open source technology is a huge... Like it's kind of like, which way you go, these are the things that need to be addressed and you can only do that by getting engaged in the open source ecosystem. It's a huge, powerful message actually. So thank you for highlighting it. And I guess the question I would have is how would we do that more? Because the conversation around open source in AI seems to be all around this, how do we define open source AI? But actually this point is probably more powerful and potentially more beneficial for the open source ecosystem because it's definitely true and there's definitely things you can do today to support that whole ecosystem. So how can we as a community kind of highlight that I guess? It gets lost in that other question. So what has been helping me is to focus on the supply chain funnel. Because like organizations are focused on just the tip of the iceberg and when they give a closer, like a deeper view on the whole picture, they say, oh, okay. So this is like a domino, is it called? Yeah, like a domino thing. Like if I screw up at the very beginning and everything is going to fall down. So that's what has worked, not saying, maybe others will say it's not so. Thank you. Great talk. Your question at the end posed around DevRel. Very compelling argument and I don't think you'd get much disagreement. But we seem to be losing the battle and losing DevRel's as pay positions. It seems to be the first thing that gets cut. Do you have any thoughts on that? Yeah, so you mean that right now they are going through a lot of the layoffs. And so, yeah, so actually, and I don't know if the message was quite understood of this talk, was to try to explore like other areas where the DevRel can provide value. Like there's going layoffs in DevRel, but on the other hand, organizations are investing thousands on AI. Like they are obsessed and it's like, are you obsessed with that? Okay. But in the other hand, you are losing and you are, yeah, you're losing the talent that will secure and will help you to your AI planning and strategy. So that was like the main goal of this to try to explore this role of DevRel into this emerging trends where the investment is going through. So how do we help DevRel's reposition themselves into that AI space? I think like, like giving like certain kinds of talks to like, related with AI and where the DevRel's can be helping. Like advocacy is always helping. And also maybe start exploring these job descriptions. I feel like, like for instance, in the Tudor group, we have these Ospo descriptions where we serve it in public. And then organizations use it to hire talent. Maybe like having this kind of like similar framework for a DevRel, like, okay, where can we include job descriptions? And we can include this AI DevRel and start advocating from that angle might help. Just a comment on that. I'm sorry, I came in late. I love that it's women also caring about developer relations. I work on an open source project that's building AI infrastructure. And we are at such a low level of the tech stack right now. We hope to be able to build developer tooling and provide a product that will give people access to that infrastructure. But we are just at such early stages. And I wonder if that's not also a problem for other projects right now is we're just not at a level of the stack that there's really good developer tooling to market. Because I think that's when DevRel becomes really powerful when you have a product people can use. So I would encourage DevRelers to become more active as product builders and help your engineering and more technical teams actually productize what they're doing right now. So the question. It's not a question. Oh, okay, okay. I was like, I think it was like a sense. Okay. Anybody else have any questions for Anna?
Open Source in 2024: boundaries, burnout, business
I'd like to start the talk by figuring out who some of you are, because I don't always know what the makeup of the people in the room are. So we're just going to go one by one and everyone's going to introduce themselves. So can you stick your hands in the air? Do you identify primarily as a dev rel person? Okay. An engineer? And another assorted business person? Oh wow, lots of business in the house. That's good. And anyone else want to just randomly shout out a thing that you consider yourself that I've not put in one of the buckets already? Infrastructure. Great. Thank you for infrastructuring us all. Yeah so I'm trying to make this talk like a wee bit of a kind of whistle stop tour of just some stuff that I feel like I've found useful over the years with my forays into open source land and I'm going to try and do a little bit of hand raising on occasion to see what you're thinking and if you ever just want to stick up your hand just for no apparent reason then do that as well. I've also got, well done. Someone's listening. Great. One person at least. So at the bottom of the slides as well I have like a little link shortener situation going on. But I try to like link to as much stuff as possible. If anything I say is interesting and you actually want to read more about that but if you don't please don't. Save this bandwidth for everyone. So, right me. I'm Mike McQuade. I still kind of identify as an engineer technically I guess I'm like a co-founder of a business situation right now. I'll tell you more about that later. I'm based in Edinburgh in Scotland. My open source main kind of thing for the last while has been home group. I've been maintaining that. Thank you. I've been maintaining that since 2009 and I'm now the like project leader which means I stand for election every year. No one else has ever run. Please someone in this room run. It's certainly free. So we've worked with maintenance and contributors all over the world over this kind of 15 year period mostly over text sometimes occasional video calls and occasionally in person. We've started meeting at Fosdown in the last few years as well as a group which is quite nice. I was also principal engineer at GitHub for 10 years. Well, I wasn't principal engineer for all that time but hey resume padding is okay. I worked for my home in Edinburgh mostly working for teams on the west coast of the US which taught me lots about that you don't tend to learn in the UK such as talking about your feelings and avoiding conflict. I'm now the CTO of a startup called Workbrew which as you might guess from the slightly convoluted name is somewhat adjacent to Homebrew. We're trying to do stuff around like big companies who want things from Homebrew that Homebrew volunteers don't want to do. We're going to try and commercialize some stuff around that. I'll talk more about that a little bit later. With two ex-git hub people who are on the east coast of the US. So my talk is going to be based on three B's. The first B I'm going to talk about is boundaries. So is anyone able to spot what this lovely bit of text is from? Anyone want to raise their hand and shout out if you're that much of a licensed geek? No shame here. So good. This is the... Yes, MIT license. Congratulations, you win nothing. I'm not a lawyer but my summary of what this part of the license at least says is and don't worry about reading all my license summary. Essentially the software you get is the software you get. If it's buggy, that's your problem. Tough luck. Maintainers, don't promise you that the software has ever, does or will ever work for any user or use case, even the ones that we say that they do. Maintainers are never responsible for any problems that anyone experiences even if they cause them deliberately by themselves because they think it will be funny. And if you disagree with any of this, you're not allowed to use the software. So I wrote a little post about this a while ago called Open Source Maintainers or You Nothing, which a lot of people who are maintainers or Open Source adjacent really like the title and everyone else hated the title. And you can see some good discussion of that and I'm not actually even joking here. If you Google for Mike McQuade asshole, there's a variety of people who think that this is not the correct way to approach Open Source. But to me, like this is, I mentioned boundaries before, like there's a lot of talk about kind of burnout, I'll kind of mention that later. But like a lot of people ask me like how have you been able to do this for 15 years without going any more mad than I already am? And for me, it's boundaries, right? Like I don't do things I don't want to do. And I think it's a really important thing for anyone in Open Source to internalize that unless you're getting paid to do it and then even then that doesn't necessarily always apply. If someone is rude, if someone is mean, you don't have to do what they say. I will close issues that are legitimate problems because the person who has opened them is just being nasty. And people don't like this, but that's how me and other people in Humberoo have actually stayed involved for a decent amount of time. Does anyone know who this lovely lady is? Brené Brown? Yes, some enthusiasm in the room. Good, good. If anyone who hasn't discovered her, she's social worker, researcher, author, podcaster, talks a lot about like just generally how to be a better human. And one of her courses she does, she talks about this kind of braving acronym for like the seven elements of trust. So the ones that jumped out to me were when I kind of, we're listening to this and kind of thinking about how this might apply to my job and Open Source and stuff like that, or like boundaries, reliability and integrity. Because if you want to trust people, you cannot be someone who is untrustworthy. And if you want to be trusted, you need to be trusted by others as well. So for me, the way I kind of think about this, because I'm primarily an engineer, is by turning all human problems into computer problems. So I think about like instead of how I might deal with computers and how I might deal with humans being separate, like let's try and conflate the two a little bit. So bad APIs are APIs with inconsistent and unpredictable behavior. So if you go and repeatedly ask for the same stuff and you get radically different results back, that generally sucks. If you are documented that you do one thing and you do something completely different, that generally sucks. And an API that every time you call it, it will try and do some overly complex query and then time out, generally sucks as well. And a lot of this applies in my experience to human relationships as well. So humans who are more consistent and predictable in their responses are more pleasant to deal with. If you are someone who at work, when your co-workers ask you to do something, if it's your big, big, big boss, you say it will be done in a day and if it's the person in your team you don't really like, you say it will take a month, that's not a great API. If you say to your co-workers or people on your open source project that on holiday I am not to be disturbed for the next week and then you are on Slack or email every five minutes answering questions because you feel like you're too important to actually go on holiday and let other people do their job, that's not a great API. So to me, your boundaries are what your API are and that helps you figure out how you should be treating others and how others should be treating you and that consistency makes it easier for people to understand you and generally probably makes it easier for your life as well. A nice experience I had in the past at GitHub is I had a manager who rather worryingly for me at least start my performance review with, Mike is very strict with his boundaries and I thought oh no, I'm going to get criticized for the fact that like I insist on having dinner with my kids like most nights of the week. But he went on to say that this made it easier to work with me because he knew when he could ask me to do things and when he couldn't and it modeled that behavior for new parents in the team that it's okay to do these things. So I would encourage you as well, those of you who might feel that it's overly indulgent to exercise your boundaries around these things, that you're not just doing it for yourself but you're also doing it to help model those around you particularly if you're in a position of more experience or authority or power or whatever, you can enable it for other people to do those things. So I think like I don't want to get too deep on this but I think one of the key things with setting boundaries is how comfortable you are about saying no to things. Like I talk about front loading disappointment quite often which is a lot of the times if you're someone who is very willing and able to say no to things sooner or later, you're making that person maybe disappointed straight away instead of being disappointed in three months when you're not able to deliver or do whatever they thought you were going to do. So if you want to read more about that, there's a little link down there. And more specifically to open source, there was a big movement a few years ago about trying to make things really, really easy for first time contributors to projects and I think that's really valuable from the perspective of like documentation and teaching resources and smoothing like onboarding floors and stuff like that. What I don't think scales very well is like hand holding first time contributors through how they get involved with a project because if someone doesn't know how to use Git for example and you're having to teach them every command and your project has like tens of thousands of contributors, you're probably not going to be able to do that for tens of thousands of people. But when people kind of express interest and they kind of come back again and again, then that's often like a nice time to maybe get a little bit more involved and help them out. So the next B is burnout. So as I mentioned before, I'm someone who's been lucky enough to not really have ever felt burnout with open source. I've definitely had periods of temporary large amounts of irritation. I'm sure many people who work with me have as well, but I've managed to sort of stay involved and for me what that's looked like is over the years kind of prioritizing my own kind of mental and physical health and like as I mentioned before boundaries and stuff like that. Something I found very helpful personally is seeing a therapist like I started seeing one during COVID and it has really helped me to figure out kind of how to manage this stuff. We probably have a lot more sessions than my therapist expected talking about like a particular pull request on Homebrew and someone who is trying really hard to be nice and helpful but are not doing that and actually being deeply unpleasant to a lot of people and how I handle this and all these types of things. So if it's something that you've considered or not tried, I would encourage you to. I've written a little kind of step-by-step guide if you're like how do I even go about finding one that might be useful to you. Another thing I found really helpful with kind of avoiding preventing burnout in myself and others is kind of having like decent relationships like inside, outside work. But in work or in open source land, I try and have something that looks a little bit like this. I guess I call it like the mentorship diamond. So it's this idea that I used to be religious, I kind of stole it from religion originally, like I don't know what it was, something like, I can't remember any of the words anymore, but anyway, I stole this idea, it's not mine. But this idea that essentially for anyone, it's a good thing to have people above you, people beside you and people kind of below you who you can speak to. So if you're in an employment situation, like your mentor might look like someone who has the job that you would like to have in five or ten years and your peers would be someone who has like maybe a similar job or someone in your team or whatever and your mentee could be anyone. The nice thing about this is I do say like unless you're literally the most experienced person at literally everything in our entire industry, then you can always find a mentor and similarly like unless you were literally, unless you were born during this talk, you can probably find a mentee who you have more experience than them and you can help them with some stuff. And I just, I also think the other thing, those of us who have kind of jobs in more kind of formal corporate environments, often like an org chart looks quite a lot like it's structured like this. So you might think, oh, this comes from me automatically, but I think this is something that's really worth like putting a little bit of effort into to actually like find these people yourself and not essentially rely on like, wow, my manager can be my mentor and my mentees can be the people who report to me or whatever because it just makes it a little bit more fluid and you can help find things a little bit more easily that way. Another thing I guess I've thought about with kind of avoiding burnout with open sources, trying to find people who will replace me. Who's, I don't know if any people have done any sales stuff. Have people are familiar with sales funnels? Yeah, like a few people. So I guess the idea is basically that you generally get lots of people at the top of the sales funnel, like potential leads who have no interest in you or what you're doing, but you haven't figured that out yet. So you send them all about a million emails and then some tiny proportion of them reply with something other than go away, leave me alone. And there was Michael Crossbacks because they've shown like some sort of interest and then some time proportion of them may actually end up paying you money one day and there's people who are sales. So in open source, I think we have somewhat a similar thing that I call like the contributor funnel, which is like generally most projects will have lots more users than they have contributors and lots more contributors than they have maintainers. Last time I count the numbers on this for Humbrew, the numbers are roughly like we have roughly 30 million Humber users, we reckon from my analytics data, got like 12,000 contributors total in the last 15 years. And we've got about 30 current Humber maintainers and about 50 lifetime, which is apparently 0.0015%. So when you're being yourself up, your project that kind of is used by a few hundred or a thousand people doesn't have like most of those people wanting to help you contribute or maintain or whatever, like cut yourself a little bit more slack. But at the same time, do bear in mind that like, you know, if you could, the more you can grow this funnel and the more easy you can make it for a user to contribute or contribute to kind of come and join you as a maintainer, that may well end up taking some load off you as well. Right. So the last thing I'm going to talk about is business. And don't worry, as despite the emoji, like I'm not a very suit and tie businessy, businessy person, but I guess like it was a third B and money in open source feels like a kind of relevant thing for us to be kind of talking about nowadays. So homebrews kind of had an interesting like money journey, basically. So when I joined, it was very much like Humber didn't have any money. It didn't need any money. It was just a random project on GitHub. People submitted PRs. This is in the days before you've equal to see I when we start with Humber, it didn't have any C.I. at all. It was all just verified on someone's machine. And the first thing we kind of did when we sort of realized like, Hey, we maybe need some money here. We figured we kind of could benefit from having some max, which would like automatically kind of run some C.I. I'm poor request from GitHub. So because it was 2013, the way you went and asked for money on the internet was a Kickstarter raised the 14,859 pounds. And that was then used to kind of buy some hardware, which was like physically taken on a train by me and installed in a data center on by good stuff. 2016, we joined the software free return C, which provides like fiscal hosting for us. Does anyone anyone familiar with what fiscal hosting means? Yeah. So essentially it's like providing a bank account and like legal services and someone who can let whole trademarks or anything like that for European source projects. So that made us, well, it made us a part of a 501 C three, like a US nonprofit essentially. So we could go and like receive tax deductible donations in the US and all that stuff and give us a bank account. So then after that, we kind of had somewhere to put our money. We had somewhere to receive new donations and somewhere like a process for like paying out stuff like that. 2017, we start using a Patreon to kind of try and get some monthly donations and stuck that in our read me. 2018, we started asking for donations on Twitter. But 2019 was the big kind of exciting time when we made humbrew itself. Like when you first installed it, or if you had had it already installed, you would see a one time nag message that says essentially homebrew is run entirely by volunteers, mostly in the spare time. Please consider donating and then 20,000, one meter collective. See if you can spot when we added the message to our. Yeah. So almost immediately, our kind of incoming money went up by a lot. So no one has really complained about the message. You get some people get kind of a little suspicious or paranoid about asking for money as an open source project. But I would strongly encourage you if you're an open source project and you need the money, then do consider like having a one time nag or whatever. And I also think I've found that people are a lot more responsive to that often than kind of like advertising and things like that. Another thing to think about with open source is like, again, I've heard a lot of people talking about open source economics in the last few years. So I kind of wrote a blog post about this and sort of spoke to my father in law, who's a professor of economics, who just because he's a very clever manager says it depends a lot about like, okay, help me understand like, what is economics? How does it define it? Blah, blah, blah. So like he sort of said, well, you know, we're mainly capitalist economies here. So it's the allocation of capital and how that throws through businesses and stuff like that. And then I kind of thought, well, whenever I see people talking about open source economics, they're all talking about money and like how open source projects get money and spend money and stuff like that. So is that what it's about? Well, like if you make everything about money and I guess, you know, no shade on our lovely American friends in the room, but I often find when I talk to kind of American folks about open source and European folks about open source, often I find like, sometimes on the American side, there can be more of a focus on like, we need to get money, money will solve all the problems. The lack of money is all the problems more than I do over here. But like the idea that kind of money fix all the problems, it's like, Humbrew is kind of an interesting phase right now because we have quite a lot of money in the bank. You can go in our open collective and see that we have, you know, six figures of funding, which is like both far too much money to spend on stickers and also unless you get really good stickers, like if anyone knows the best stickers, then let me know. But also like not enough to like sustainably actually pay like people to work full time on Humbrew. So in our case, we have quite a lot of money and that doesn't fix all our problems. And that's when I guess like told to my father-in-law kind of helped where it's like, well, actually economic problems are generally like, we generally consider them to be like allocation of limited resources, which is normally money, particularly in like capitalist economies, et cetera. But arguably the open source economic problem is allocation of like limited maintainers or like people, right? If you want to have stuff done, there are relatively small number of people who can do those things and you don't necessarily have enough resources to get those people to do those things. And if you have more money, does that somehow automatically add more of that? Well, I guess not really because for some people they have a full time job and the amount of money they would need to have to quit their full time job and then do the open source thing full time is a lot more than your project can have or guarantee or whatever. So it's tricky. But then for me, I think like, again, it feels like a weird thing to kind of frame in terms of economics, but I think the best economic thing you can do for your project to kind of maintain the amount of maintainer time you have and increase it is to make it actually an enjoyable place. If open source is a place where people want to go, if we come to events like this, we see each other, we make friends, we make friends on our project, then we want to work more with those people, we want to help those people out and we're having a good time. If open source is like a horrible drudgery place where people on the internet just shout at you all day, which it also could be, then you're probably going to want to spend less of your time doing that. And if you're like me, then you will probably be shocked when on a Saturday night, you're spending like your time smashing out some code from the European source project, look completely miserable and your wife's like, why do you do this again? Like, are you sure you want to do this? So yeah, so try and make it an enjoyable space for other people. Again, final note, like I've seen kind of concerning stuff around AI where, I guess, including some of the existing talk where there's a worry right now that we're moving to a world where everything is kind of open source underneath the hood, but then everything is very proprietary in the front and we maybe have open source data models or whatever, but you don't have any of the training data and is this going to mean sort of the death of open source? But to me, the only thing I've seen that looks close to failing in the last few years is open source is like a business model where there's a bunch of projects who have, I'm not going to name any names, but you know who they are, who had this sort of approach, either they fell into this trap or this was forced upon them by their maintainers or investors or whatever. But I would argue, and this is the model I'm trying to take and we'll see whether it works or not, is that like this is a better model for open source business, which is instead of trying to have like the open source side and the financial side in direct competition with each other, you have something where the business side and the open source side kind of contribute nicely to the same kind of ecosystem. I remember I had a job and I will name them because they were a lovely bunch of people, a company called K-Dub a few years ago. It's my first job kind of working primarily on open source stuff. I was very excited that I was like, great, I work on K-D, now I work at this company who has lots of K-D maintainers and I can get paid to work on K-D stuff. But I found quite quickly that the problems that often open source consultancy companies are paid to work on are actually in fact not the most interesting problems, but the most boring problems because people who will do them for fun don't want to solve those problems, so you have to pay people to do them. And to me, like this is the, I mean, it's not a very inspiring pitch to kind of come and work with me, I guess, at the company. But I do think there's a class of problems where you can't expect volunteers to solve them and you do want big companies to be involved with them. The big companies, sorry, expect this stuff to be solved, but the volunteers don't want to do it and they're, that's where, to me, it feels like a good fit. And very interesting, what I'm doing with this stuff, then we've got a little website at workgroup.com and a demo and all this type of stuff. So, in short, quick summary, boundaries, remember open source maintainers and volunteers or you're nothing. Remember, you can say no to things and you don't need to mentor first-time contributors for burnout, considering a therapist if you don't have one already, or change you one if you hate your current one, which is also a thing, it seems, consider the kind of mentorship diamond of mentoring above, below, side by side and that if you can get more users and make it easier to become contributors, make it easier to become maintainers, then that might make your life easier. And I talked a bit about making hungry financial sustainable and how that's about often making maintainers happy and avoiding them burning out as much as it is about money and what I thought about open source business. So, feel free to answer any questions now, but if you don't want to ask in front of the room, then here are my contact eats and stuff like that. And thank you very much for having me. Thank you for the talk. Over here. I really like your analogy with the API in terms of humans and machines and I'm curious if you, like good APIs are well publicly documented, I guess, and so I wonder if you have taken that analogy to that point of like having sort of a document, I used to have a colleague who would have like a Google doc that would be like how to work with me. Yeah, I actually do. I was polishing it like hopefully my company have our first employee in the kind of coming month. So I've been taking my old one from GitHub and I'm repolishing it for the current day. Yeah, for anyone who hasn't done this before, it's a really nice resource in companies and I guess it could work in open source as well. It's the way I saw it framed as like a human user guide where it's like things you might not know about me, things that I really like, things that when my co-workers do this, it makes me happy, things when my co-workers do this, it makes me sad. Like those types of things. Yeah, so I agree with you. Like I think, yeah, trying to publicly document that or at least internally document that, but there's nothing mind one that I wouldn't probably want to put on the public internet just yet. But yeah, I mean, you've inspired me. I might even make a public version of this. So we can. A follow up question to that. So if you write such a document, how do you find the boundary between like kind of being honest and setting, actually setting the boundaries and not coming across as rude as in setting those boundaries? Well, so the best way I found to avoid coming across as rude when I worked with Californians is who had not worked with Scottish people before. It's just to say, oh, oh, Scottish people are like this. But yeah, I mean, as pretty much all of my friends and family and co-workers of all time would tell you, I am not the person to ask on how to avoid coming across as rude. But I guess I would say like, as a middle ground, I guess, maybe not quite rudeness, but something I've been dealing with like at work with like a small start up with kind of, you know, impassioned people is you quite often end up in situations where someone maybe needs to hear something and you know it's going to hurt their feelings. And to me, like the two failure states for that are either you don't tell them what they need to hear because you're worried it's going to hurt their feelings or you say, well, it's going to hurt their feelings anyway. So yellow like, and yeah, I guess my thing is, as long as you're willing to kind of try your hardest to be as nice as you can and hurt their feelings as little as you can and be willing to admit you're wrong and apologize and these types of things, then you can get away with being rude, I guess. Yeah, don't have the us that one. Thank you. Thank you. Over here. Thank you so much for your talk. I'm Canadian, so I fall somewhere between American and European philosophies. I think you alluded to this, but the biggest economic problem for open source projects is actually the free rider problem where so many people can use software that they don't end up contributing to or paying for. And I'm wondering if you have any examples of open source projects that have achieved kind of a hybrid model of being able to keep things open source and free for people who can't pay, but for those who can. And we know there are so many large companies that do have incredible ability to pay for stuff that don't. How do we incentivize them? Yeah, that's a great question. Thank you. So on the free rider problem, I have maybe a slightly contentious take, which is that I think if you completely eliminated the free rider problem, you kill open source because open sources, I mean, I guess we've seen this in the last few years with the kind of the varying licenses and I'm maybe a bit more of a purist in terms of, I think, if you say, well, Mike can use this because he doesn't have lots of money, but Amazon can't because they've got lots of money. You kind of kill what makes open source what it is. So for me, it's about kind of thinking about creative ways of solving that. So some of it, I think it's like you're going to have a certain amount of free riders and you're going to have. And also, I think what makes open source interesting is, you know, I include my own projects and work in this as well is that I think if we end up with a culture where everyone is expected to pay their way to kind of be involved with open source, I'm not saying that you're saying this, but I have seen some people who have, then we, why would a company almost like want to use open source so they can just build everything internally as they were doing 20, 25 years ago? So I don't see like an easy solution to that. I wish I did. But I guess the solution, as I said, I'm trying to do right now is the idea of you maybe just don't provide support for free for the things that are massively six important to big companies. Like, I don't know how many of you have the misfortune of reading hacker news comments on occasion, but every so often there's like a manifesto there of like, it's very important that we as an industry start doing this thing. And I really like those because 50% of them convinced me to do the exact opposite of what the person is trying to convince me to do. And there was this website called SSO.tax that essentially talked about how look at all these companies who are outrageously gouging people who want to use like single sign on and stuff like that. And I saw that and I thought that's a really good idea because that's a way to essentially be like, well, if you're a big organization who has the requirement for this for some ISO certification or like whatever it may be, then you say, well, we're going to charge you a lot more. And if you're like an individual user, we're going to charge you a lot less or nothing. And I wonder if that could work with some open source models as well that if you have features in your project that you know are only going to be used by the biggest of the biggest corporations, then you say, well, that stuff you have to pay us for or it's under a license we know you don't want to use, but you can pay us to have it under a more liberal license or I don't know. Lots of ideas, no solutions. Thank you. We're at time. Thanks folks. Thanks Mike.
Where are the limits of open communities?
We're at time. We're going to get started. Everybody please welcome Anesca Miro. Hey there. Thank you for having me. It's kind of a big thing for me to stand here because last year was my first foster and I really enjoyed it as a volunteer here and as an attendee. So standing here and being able to share some of my work with you, it is kind of like a big achievement for me. Oh, thank you. And I'm active in various open source communities but the reason why I'm here today is being an community, operations and community builder in Chesco de Gitao which could be translated as Czechia Digital but I'm going to use the pronunciation Chesco de Gitao because that's how we, that's why I say it. And one of my role with the community brought me to this question, where are the limits of open communities or are there any limits and how can we face that? So first I need to give you some context of Chesco de Gitao to show you what we do and how we do it. Then I'm going to tell you what the word open means for us and what challenges it brought to us in the last year and years. And where are we now? Which is not the end of the journey as we'll see but it's like continuous process. Hopefully I'll have enough time at the end for the questions and not only for the questions but I would love to hear your stories because you know, all of you, as it's for them, all of you are somehow active in open source communities so I'm sure that we all have something to say to that. So first start, what Chesco de Gitao is. It is a small NGO based in Czechia consisting of like 12 people in the core team and we have like clear mission which is to help other NGOs and public administration in Czechia to use digital technologies to its full potential. Basically it means that we guide other NGOs and organizations from public administration through the process of digital transformation as vague as it sounds it also is because some of these organizations are on the baseline just needing to learn basic digital skills, some of them just need to help to create some digital products so we do all of that but I mentioned 12 of us so we can do it alone. We have kind of like big community, really diverse consisting of expert volunteers who are from the field of marketing, product management, project management, software development, UX designers, all of that you need to succeed in the digital world. We also have people from NGOs, people from public administration and people from various businesses who just like our vision and will support us. And we started in 2019 and since then since that we gathered more than 6,000 people in our slack which is our main environment and with all those people over those five years we finished more than 30 projects which impacted more than 4 million people in Czechia. Czechia has around 10 million people so I guess it's quite an achievement and it's all thanks to the community we are. So that's what we do and how we do it. We consider ourselves being value based organization and value based community. We have five core values which are not only written, we try to live them day by day in our day to day work and they are basically the baseline for most of our decisions. The first of them and the reason I'm here is openness and I'm going to also skip it for now because I'm going to talk about it later. But the next of them is professionalism. We're a really diverse community and when you have people from NGOs and business and public administration like totally different mindsets, totally different life experience and we have people from junior, middle, senior level and all of them having different life experience. The professionalism is a baseline for having good relationships in the community because we interact as professionals, we believe that all opinion matters and that we all can learn from each other and we strongly support collaboration, the next value. Especially in public administration people tend to build silos and work independently on other departments and we believe that through collaboration we can achieve more and have bigger impact. So that's what we try to teach in our community, to empower in our community and then people can use it in their day by day work. And we mostly volunteer based organization. So efficiency is really important value for us because people give us the most precious things they have, their free time and their know how and we don't want to waste any of that. Also our partners because we're funded from the private sector, we don't want to waste any of the resources given to us. So efficiency is really, really crucial. And then usefulness. This one is focused mostly on the solutions we built in our organization and our community. Because when you see digital technologies we may be perceived as a highly technical community which we are actually not because our target groups are NGOs and public administration which usually have lower level of technical knowledge or financial resources. So we have to be able to build solutions which are sustainable for them and which they can administer by themselves. So usually 10 to years, no code or low code platforms which are easily manageable for those organizations. So usefulness is a core value for building the products or building the solutions to be useful for the end user. It's really, really important. And then back to the openness. What does it mean in context of Chesco Digital? Where we have open code, it's a necessary condition for all the products we agree to work on and all the organization we work with to have open code, to have their repositories and get up under the least restrictive open source licenses. We also have open knowledge base. We have our own Wikipedia where we gather all knowledge we learned over those last five years. Also all the examples of where not to go and a lot of guides for, I don't know, how to create a successful webinar, how to cut videos, how to work with volunteers, how to build digital projects from scratch. I like to say that there is everything in our Wikipedia where you know what you're looking for. It's built on confluence so it tends to be quite messy and sometimes outdated and sorry for that, but it's kind of a rich source of information. We have transparent funding. We have transparent account when you can find all the financial transactions. We openly speak about our partners and sponsors, how much money they give us and how we spend them. We also speak openly about the money we as a core team get. You can basically find all aspects of our financial management from our transparent resources and learn how we are doing it that way. We strongly empower open communication. That's the reason why I'm here to share our story with you and that's visible mostly in our Slack. I bet that some of you know Slack, most of you know Slack and when you have Slack with an average 600 people active each month, it's kind of like beehive. It's really messy sometimes, but we really try to limit private conversations and private channels so everybody who joins can learn everything as fast as possible. We have as lowest barrier to join any project on the conversation you want. There's huge traffic in our Slack, but we have also other communication channels. Again, this is something we try to teach our target groups because we want to teach people that through collaboration, through open communication, open feedback and stuff like that, we can build meaningful relationships and achieve more together. This is one of our messages for the whole community we work with. What challenges this approach brings to us? The first one is actually hidden in the world open itself because it's just a word drive and we're a really diverse community. This word has so many different meanings for so many people. Even in our small core team, 12 people, we were all over the spectrum from being open no matter what to open by default. Yeah, have things open, but when there is a really good reason, really valid reason to have something not shared, let's not share it because it can do some harm. We needed to balance that not only in our team but in the whole community so they know what to expect. I'm going to briefly go through all the challenges and then get to how we handle that. The word open or the meaning of open. Then with the communication, I mentioned that our Slack is kind of like has a huge traffic. Slack is not really good for having big announcements and stuff like that because it's mostly for project communications, communications in teams. It's like fast pace, a lot of traffic, things get lost, not really intuitive to search in. But we used it for quite a long time as a single source of information because we don't want people to follow many different channels. We constantly received feedback that people are missing stuff, that we're not sharing things. Come on people, that wasn't Slack, it's actually one of our stickers. It wasn't Slack. It's why they're popular even in business. So we tried newsletter, one newsletter, but we do so many stuff in one time that even in the newsletter, it was so long that people stopped reading it. So when you want to share stuff, when we want to tell people about what we're doing, how can we contribute? But at the same time, to prioritize, when to say something to get the right information to the right people at the right time, how to keep the communication efficient, that was the next challenge because you want to share stuff. And then the one which was kind of like technical interesting was the information security. They just started in 2019 as a bunch of programmers who knew themselves very well. And they just started to work on some project for some NGO and then the word spread and it started growing. But several months it was just group of people who knew each other. So there was no need to apply any strict security measures. But then COVID came and Chesco Dicatal grew quite rapidly from like dozens of people to several thousands of people. And there were multiple projects in the same time with different needs for information, different needs of tool sets, different needs for data and stuff like that. And then over the journey we kind of lost track on who needs to access what and who is responsible for which tools and so. So yeah, we were still able to maintain security somehow, but when you don't have someone responsible, so the share responsibility in the word of data security doesn't work. So we needed to take a step back and think about how can we guarantee some security to our community and still keep this open approach. And the last one which I picked for you today was the toughest one for me as a community builder. Because we at the beginning when we announced our mission, we were so happy that people are joining Chesco Dicatal and they share the passion to help NGOs and share the urge to help public administration with digitization. And they share our journey and they will help us that we wanted to accommodate everyone and then we started creating opportunities based on the needs of the community. And we really lost track of who we are, what is our position on the map of systemic change in Czechia. And we faced the decision about yeah, we still want to be open community, but does it mean that we need to be open for everybody and to create something for everybody who comes? And this was the biggest challenge for me. So what we did about all that, first thing you see there is code of conduct. Maybe surprising that we didn't have one because it's like the good practice of open communities to have code of conduct. But remember we are a value based community. So we thought that having the values stated is enough. But then I also mentioned the different meanings of the word open. I mean five values, five words, so many meanings for those people. So we felt that we need to set up at least some of the values that we need to set up and some basic rules or to explain to our community what do we mean by those values. So we created code of conduct which basically is a translation of our values to like more words to specify more. What to expect when you join Czechoslovakia, what we do, how we do it, what we want from you, what you can expect from us and how can we secure the safe space for our community. How can we handle troubles and stuff like that. So six months ago actually we applied code of conduct. Kind of late but it exists. Then, yay, new communication channels. More to that, yeah. Sounds confusing but for us it meant that we want to use the communication channels for the best purpose they solve. So we keep Slack purely for the project communication. We took the important stuff out of that and for example the long-lasting topics or repeating questions and stuff like that we built an open discussion forum based on this course. So we have this one for that which is also more to the open communication because it's not hidden by the barriers of Slack. It's open for reading. You don't have to sign in. So this is also good for our target groups because most of the NGOs we learned over the way that they find Slack not very intuitive for them to use so they at the end maybe even didn't join. So that's new community forum. And we're also working on restructuring of our newsletters. So we have only the important stuff there and we're also building a new app somewhere between our website and the Slack where we'll have all the information for the community. So it's not getting lost in Slack. So it may be like, yeah, we have more channels but single purpose ones which may be easier to follow because you can pick the one which you want to follow. And then regarding the security. We did kind of like security check. We put together all the data we have, all the tools, all the databases, everything. We spent several months on going through the accesses and what do we really need? We can rid of not to store data we don't need. And for now it is that every person from our team is responsible for specific tools and specific data and we're doing regular checks on who can access what. It doesn't mean that we don't want to share. We just want to know who can access what and manage the accesses. And then about the open community itself, we spend more than a year defining our needs and finding our position on the map of how we can help best the public administration and NGOs in Czech Republic with digital transformation. And the biggest switch in my head was, yeah, I want to have a community who is thriving. But actually it doesn't mean that I need, that I have to take everybody's needs to account. I need to follow the vision of the organization and we need to be clear to say out loud, like that's what we do, that's our mission, that's how you can help us, that's where we need you. And you're welcome to join to help us achieve the goal. And if it means that there isn't room for you, we'll be happy when you follow, when you share what we do, if you like it. But we can't accommodate everybody's needs because we can't afford to lose focus. When we want to have a big impact, when we want to change something, we need to stay focused on our mission. So that's about it. So that's how we faced the challenges. As I mentioned at the beginning, it's not the end of the journey, it's still a process. We're still creating and trying. We're doing baby steps to achieve our goals and our mission. But hopefully I touched some points which may be inspiring for you. And I'd love to hear your stories or questions. I'm here. Talk to me. Any questions? Oh no, there's the saddest thing. Please share your story. Okay, I'm coming back at the microphone. Thank you. Worst thing that can happen to a speaker, not a different question. Thank you for the wholesome talk. I really relate to a lot of things you're talking about when your community grows and communication gets lost and different people consuming your information differently. Are you documenting it somewhere? If we can follow, because it just sounds like you have done a lot of thinking about what this single source of information looks like. Yeah, we do. But unfortunately, as we oriented on stuff in Czechia, we documented in Czech. But I'm thinking about, because I'm creating my own website, I'm thinking about translating some of that to some both posts and stuff like that because actually I did this talk earlier on our journey, so with different topics. Half a year ago on the Czech conference, Focus on Open Source. And it was just an experiment if this actually interests anybody. So I can see it does. So it actually inspires me to translate it in English. Thank you for the question. So the question at the back here. Can you just put your hand? Thank you very much. Do you know who gave my step? I have to work off the waffles. I have a question about the needs of the community. And if you use any tools or how do you stay close to the needs of the community and how do you then answer that? Yeah, I do regular well-being research, like each quarter, where there are close questions for some metrics, but also plenty of open. So usually I have 10% of community answering, which is not much, but I'm happy to have at least some, so let's say 60 to 70 responses. And we're trying to work on every of them. So that's the first one. And the second, we did kind of like intense research half a year ago with a design agency. And there were also not only written feedback, but also some deep, sorry, I forgot the word, but deep talking with the people. And so we're really working hard to get the needs right and do the mapping of our community. So we know what would be best to pursue. So yeah, we work on that hard. Hi, I would like to ask if you can give us more details on how do you slice the responsibilities? What was your approach? Yeah, that's actually the easy one. Thank you for that. I guess it's 12 members of the core team in just going to tell and each of us different responsibilities. So we divided the resources based on the group of responsibilities we have. So for example, when we have a database, which stores data of volunteers, I am the community builder. So I am responsible for that one. I have a colleague who's responsible for all the activities we do for NGOs. So she is responsible for the base where we store data for CRM and tools that's connected to that. So that was actually the easier job than it sounds because we divided the responsibilities based on the responsibilities we already have. Thank you. Hi, thank you very much for this very interesting two questions. One, what is the tool you use for access management and so on? And how do you verify that would be very interesting for us to also know what you're using? A second question. I'm interested in knowing how did you manage to scale up because that is one of the biggest questions that, for example, in my work, we're trying to figure out how on earth can we go from 20 to a thousand developers, right? Thank you. First, we have, as Slack is our main environment for now, we have two steps how to get in. The first is registering in our own base where you fill up the data about yourself, which then allows you to go to Slack. But now as we are creating the middle layer, we are creating our own. So we still actually, we have a workshop next week about it, how to find the right tool for where to build your profile on that. So we'll see about that one. And about scaling. Well, at first, when it was like several dozens of people, it was just self-organized, let's say. And then over time, the core team was built. So it's 12 of us right now. But we work closely with the community. So for example, in my team, I have two buddies and they are contacting on a grow basis. After five days, you join our Slack. One of them is actively contacting you asking if you have any troubles and so. So we work closely with volunteer base and we have those who want to be more active with our volunteers. So that's just how we scale. Thanks to our community. So I have a question that haunts me for years and related to open communities. And you have used the word a lot, but the question for me is who is we and how does one become part of that week? Yeah. Well, I identify as the part of Chiskorigita and how to be part of that. It's kind of like as I answered regarding the information about the resources, we're focused on the check environment. So it's mostly for the check speaking people, but through the website Chiskor, the digital, you can find all the information in there. I believe that no, here not. Sorry. So on the first slide, yeah, there are our social media accounts. So you can find information about that. And then it's really easy to join. It's easy to join. It's much tougher to right now to contribute because, you know, in the process, we are in the process of transition as we found ourselves finally. So we're recreating all the opportunities. So right now we're, yeah, I've figured it out. Thank you for the question. I have a question, maybe less about the open it, but more about the project that you run. Sorry, about? I have about the project that you run because one of your goals is, you know, to transform to like digital kind of way of thinking. Yes. And curious, how do you ensure and measure the long lasting impact of the work that you do? Because I guess the work that you do, you want that to be long lasting and kind of outlive you, right? Yeah, thank you. Great question. We have quite a tough process of choosing the project we start to cooperate on. And we have strict and less strict measures of the projects we accept, but the sustainability is one of them. So when you ask our community for help, we ask you what are your plans, kind of like midterm and long term with the project? It has to be a crucial part of your operations. I mean, for example, we cooperated with NGO focused on cancer prevention, and they needed a tool how to reach more people than by talks and conferences. And together we built a mobile app, and it was core tool for them to continue with their work. So there's an example of that. So it should be the crucial part for the functioning of the organization. And you don't have to have funding already, but you should have a plan how to get funding for it and to support it at least several years. So yeah, several steps during the process of exception of the cooperation to... You never can guarantee anything, but we try to do our best to choose the projects which aspire to be long term active. So because we don't want to waste the energy of the community invested to the project. Thank you for the question. So folks, we're actually at time. Okay. So the reason is the question is for the benefit of the public. Yeah, perfect. Thank you. Thank you very much for having me. Thank you. Thank you.
The State of Funding Free & Open Source Software
Thank you so much. It is a delight to be here. I spent years at FOSDA fighting the end of the community room because it was in a tiny room and you couldn't get in and I was like, well, if I have to speak to get in, I will, it looks like we have a nice big spot now. That is much better. Let's talk about the state of free and open source funding. I am Cara Soles. I work with GitHub. I work with open source maintainers. It is my role to advocate internally at the company and externally for the needs of maintainers. I also ran the inaugural cohort this past year of the GitHub accelerator, which I will speak some about and a bunch of other maintainers seeing activities. Which means because I am speaking with maintainers a fair amount, money just keeps coming up. So I am here to talk about money. I will talk about a number of different ways the money flows through the ecosystem, including FOSC contributor funds, grant funding, individual sponsorships. But I want to be extremely clear. We are just talking about cash today. As Mike McQuade said in this room earlier, the majority of what projects need is not money. You cannot solve your problems by throwing money at it no matter how much I might try. Most of the things that folks need are not cash. But we have only got 30 minutes, so we are going to talk just about cash. There is also big money dollar value in open source. Here you have the European Commission in 2021 saying companies in the EU invested around 1 billion euros in open source software in 2018, which brought a positive impact on the European economy of 65 to 95 billion euros. We predicted an increase of 10% in contributions would annually generate an additional 0.4 to 0.6% GDP. So there is money going on. But where is it going to maintainers? And by what mechanisms does it flow to maintainers? I do want to briefly say, why do we even care about funding software maintenance? So we have got 90% of companies are using open source. That is nice. That makes me feel good. FOSC constitutes 70 to 90% of any given piece of modern software solutions. That is sick. Love it. 89% of code bases contained open source, more than four years out of date. That makes me a little nervous. 84% of code bases contained at least one known open source vulnerability. That is actually even more uncomfortable. So there is a lot of reasons to fund open source maintenance. We can hear all day talking about them. It is the underpinning of everything that we are basically functioning with this society. But also let's not have it fall apart. So what do we know about FOSC maintainers and money? Most of what we know about this comes from tidelifts state of the open source maintainer report. Here they find that 60% of maintainers are unpaid hobbyists. And of that 60%, 77% would prefer to be paid. I know. It is pretty wild. 44% are solo maintainers. So if they won the lottery and moved to remote islands, the dream that we all share, there is no one to pick up the project and go forward. A lottery aside, 58% have at least considered quitting a project and 22% have actually quit maintaining a project. It is not bad. You shouldn't have to maintain a project the rest of your life. Building an open source project does not put you in an eternal prison. You can never free yourself from it. Things should end and you should be able to move on. But as we can see, if we have one maintainer and a lot of things depending on it, this is kind of an issue. Fortunately, one of the many paths forward they found is 56% said that earning more money for maintenance work would actually help keep them from quitting. So that is good. So where does this money pass through? Let's talk about funding platforms. So places that enable payments to false projects and maintainers. So they are kind of this middle space that helps the money move. There is a lot more platforms in the space than there used to be, which is fantastic. This led to a lot of specialization. So you have got platforms like Thanks.dev and Stack Aid, which are looking specifically at dependency funding. We have got the crowd funding model. We have got the classic ongoing crowd funding with Patreon, which a lot of software developers use. We have got tipping platforms. We have got ones focused on recurring income. And of course, we have direct tips. In this, I am going to mention two platforms that are focused very specifically around the open source space and that cover a bunch of these different techniques, which is GitHub sponsors and open source collective, which is a fiscal host under open collective. There is also a lot more models that people are looking at, which I don't have space for here. So there is a subscription model that Tidelift explores. There is service marketplace like Open Teams. We have got experiments in quadratic funding with Gitcoin. We could be here all day, which is good. We need this kind of diversity. So in terms of what kind of money is moving through, GitHub sponsors have seen $40 million come through since 2019. I know this looks like a copy-paste error, but I just imagine I actually think open source collective also has had exactly $40 million come since 2019 and saw $10 million in 2023 alone. So the growth of money coming through spaces like this shows that as we make it easier for folks to find and support projects that they want to support, we are able to funnel money through there, which is great. It lowers the barrier. One of the main sources of using platforms like this is FOSS contributor funds. So the way that, indeed, defined this when they kind of set out what a FOSS contributor fund was was a framework for selecting open source projects that a company supports financially, specifically designed to encourage open source participation and help companies take an active role in sustaining the projects they rely on. So colloquially, a FOSS fund is any instance of a company funding projects, usually they're dependent, these are like a semi-organized way, at least, without expecting anything concrete in return. So that's kind of flowing through here. Companies have been giving to open source for a long time and certainly much longer than 2019, but the rise of the structured program referred to as a FOSS contributor fund, you can really trace back to, indeed, and Duane O'Brien in 2019. And since then, a lot of companies have made public announcements saying not just here's what money we're going to give to open source directly, but here's how we're going to distribute that money and involve our own employees in it, which is really cool. GitHub sponsors found that the average value of an organization sponsorship versus an individual is 15 times, which is actually reflective of that companies do have a lot more money than I personally have. So that makes a lot of sense. In terms of how much money is actually coming through FOSS funds, though, like my back of an app can math is based on public announcements and based on what data we can see, maybe $12 million last year, which is life changing for a number of projects and life changing for a number of maintainers and like paltry, like honestly kind of a pathetic number when you look at the overall value of open source. In terms of scale, we see companies like Century and other companies give up to like $500,000, which is great. We see a few thousand across the map, which is good. Here's a few of the patterns that we see in this space. First of all, this isn't a sustainable model for the ecosystem. This is a supplementary model. And it's good, right? But we can't pretend that we're going to maintain this whole ecosystem in this philanthropic effort for companies to give back. It's supplementary. We also see that FOSS funds are usually not a stable source of income for maintainers. They're often giving like one time, it's employee voted, which is very positive, but means that you can't understand when you're going to get that income and you probably aren't necessarily going to get it again. We've also found that expecting companies to fund open source for the positive press is a failed path. The majority of companies that are giving money directly to open source projects do not talk about it openly. They don't get out there and to get their PR department like let's party. There's a number of reasons for this. I think one of them is what can seem like an incalculable sum internally to get someone to sign off on, to give to open source can look pathetic from the outside. We've all thought for money for stuff and been like, I did it. I got however much. But then people look at the total value of the company who are external and they're like, well, shouldn't the company be given more? And so I think that's one of the strains there in terms of companies not talking as publicly about it. FOSPONs are also particularly vulnerable to real or imaginary macroeconomic conditions. They took a hit definitely in 2023 and there's stuff that gets cut pretty early when companies come through and do a little snip, snip. So that also affects the stability. A lot of FOSPON stuff is associated with OSPOS, open source programs offices, which are fantastic. FOSPONs funding is not a focus however for OSPOS. OSPOS are using their limited resources to get employees engaged with open sourcing at the company to give back to open source, which is amazing. FOSPONs are a tool to get employees engaged, not one of the core kind of means to an end. And I will close this with the companies that have shared their funding structures and talked about what they're doing in this space has paid off. This may not be a huge number, but the reason that it's even where it is is because people went out and said this is how our company is doing it, this is how we're structuring it and I think you should do it too. So that is working. We've got a lot of space for improvement here. We've got a lot of improvements we can do in data in terms of identifying dependencies so we don't rely on charisma to choose projects, which is a big problem in this space. We can improve data around the benefits that funders receive, not individual benefits from projects, which are hard to juggle, but overall when we pump money into this ecosystem, what are we doing? Are we actually making it more secure? Are we making it more stable? How do we prove that and thus make this funding less vulnerable to cuts? I would also encourage you to check out FOSPONs and see if there's stuff that you want to do around this. In terms of individuals sending money to open source maintainers, if you give money to an open source project or a maintainer, nice job. Awesome. Thank you. It's actually having an impact. One of the really interesting things here is that across funding platforms, individuals are significantly outperforming FOSPONs in the amount of money that they are given to independent projects, which is very sad and pathetic from a company perspective, but it's like pretty badass from a you perspective. Nice work. It's interesting because in 2023 we learned that it's more resilient. When companies cut their budgets back, people didn't cut their budgets the same way, which is incredible because I don't know about you, but my groceries are 25% more expensive. Again, super nice work. It's not something we can structure the ecosystem across, but it shows how much people actually really care about the software that they use. We just have to have better mechanisms for that caring to be reflected by larger organizations. Let's talk about foundations. Where is the money? The money is going somewhere. The most common method of funding open source projects is joining or starting a foundation, which is an organization that can take care of fundraising, legal needs, corporate relationships, all this great stuff. In terms of how much money foundations are bringing in, in 2022 the top 32 software foundations, if you take out Wikimedia, brought in $304 million, which is pretty cool. Linux Foundation brought in 58% of that. It's got to be higher this year. The numbers aren't out for this year, but given the phenomenal increase in fundraising that the LF accomplished this past year, I bet that that percentage is even higher. There's a lot of different approaches across foundations, but I will focus specifically more so on kind of the Linux Foundation's approach, just because it's so dominant in the ecosystem. The way that foundations support projects is through a wide variety of resources, legal, financial services, project governance, how hosting, all these essential things. We ask open source maintainers to wear like 15 hats to be an accountant and a lawyer and a community manager and everything all at once. Foundations say, well, what if we took like 12 of those hats off and you only had to wear like three, which is great. That's something that enables projects to continue. The hat that foundations do not tend to be doing anything with here is that foundations are not usually paying money to project maintainers for the purpose of maintaining a project. That is not the way that foundations really function in general, with some notable exceptions, but small exceptions. Instead, they lead into this model of the professional maintainer. LF research in 2023 found among critical open source projects, the majority of maintainers and court contributors enjoy full-time employment. This is our 13% of maintainers that are professional maintainers we see in tidaliff survey, but they're not paid by foundations. These are people paid by companies, for-profit companies, to maintain projects that those companies are dependent on and rely on. As software foundations aren't paying for project maintenance and they are bringing in these other resources, part of that also is they have a very tight loop with companies because they have to be out there advocating for these companies to hire full-time maintainers. Foundations are helping maintain this whole ecosystem effect, which is really important in how a lot of these projects are surviving. They also make donating to open source really easy for companies. The administrative burden of donating to a ton of different projects is a rough one and different platforms are trying to address that, but foundations make it really simple. You give them a big lump sum and they tell you what level you're at and they give you a nice prize to show your executives and you're like, yeah, your finance team isn't mad at you. That's something I think that's really important. I think, here we go, this brings up a couple of questions. Is there a model where we see more maintainers paid by foundations, especially around maintainers who don't want to be employed by a large corporation? Is there an ability to build in more resiliency for when corporate cuts come in and we see maintainers of top projects cut at companies? Can we build in resiliency within foundations? Given that foundations have, for the most part, unlocked the code in terms of getting corporate giving, are there ways to distribute that benefit to the broader ecosystem that includes independent projects that aren't within a foundation? How do we take those learnings and bring them out more broadly? Let's talk a little bit about grants and government money. Grant is a one-time funding agreement. You say, I'm going to do something and then they give you money and then you do it. They come from two main sources. We've got philanthropic orgs and we've got them coming through governmental agencies. My favorite example of government investing in sustainable open sources is the sovereign tech fund. This is an initiative out of the German government. If you haven't read about the work they're doing, I plead with you, please read about it. It's fantastic. They've been talking about it a year at Fosden. They've been talking about it all week and also all year because they gave 15 million euros to open source projects and maintainers in the last one year. I think it's a really important model. Seeing governments step in to support critical open source software maintenance, I think it was really key. Here we see the European Commission in 2022 saying there's a clear need for a European funding mechanism to help sustain critical open source software communities. I'm an American so I'm going to say there's a clear need for America to also do that. Just put your country name in there, I guess. Or we could all work together, I don't know. Traditionally the space that we see grants in is the scientific and academic software funding space. If you talk to people in the scientific software space, they're like, oh yeah, grants. I eat grades for breakfast. If you talk to anybody else, like me, I'm like, what the hell is a grant? I don't know. Software gets that. There really is a limited understanding of grant funding opportunities outside that academic and scientific space. But agencies like NSF, NASA are looking to advance software with an academic research relationship. That doesn't usually mean maintenance. That means building something new. Similarly with these philanthropic orgs, like Moore, Sloan, CZI, they want to see cool when things happen. But grants traditionally have not been a way to fund ongoing software maintenance, which we're seeing this theme come over and over again. Let's do that theme again. Who else doesn't fund maintenance? Let's talk about venture capital. VC financing is when someone with a lot of money says, I think one day you too could make a lot of money and I'll bet on it. That comes in through when there's a related business built on top of a round of project. OSS Capital, who's best known for doing funding around open core kind of stuff in the space, said in 2022, the COSS category has grown dramatically over the last decade from $10 billion to $500 billion plus today. Even so, we are still in the early stages of costs, which we believe will grow into a $3 trillion category by 2030. The VCs are hungry and they see there is money. They are increasingly coming around knocking. For founders who want to work with VCs, I think one of the things that we really need here is for folks to make sure they're working with VCs who understand the peculiarities and strengths of open source. Then really importantly on the maintainer side, we need more resources to help people understand if VC funding, if individual investing, is actually right for them and their projects. Are they a hockey stick project? Do they want to be? Do they want to live that life? We need a lot more resources to prep people for when someone shows up at the door and says, do you want a lot of money? Sign this and also you have to do this now. Are we preparing people to make that decision? Speaking of which, we thought about this a lot this year. This was our first time running GitHub's Accelerator. It was a really exciting experiment and we are doing it again this coming year. I would look for applications to open up on that soon. What we did was we ran this Accelerator specifically for open source software maintainers. That for us meant there was no one size fits all approach. There's not one answer to this. We wanted to bring in a diversity of projects and say, how do we help experiment what sustainability is going to mean to you? What's financial sustainability? We want to look at a lot of different models and help you find the right one. For some of those, they did go on and get nice funding, go into iCombinator. For many of them, other things that meant starting businesses on top. If I had time to go through business models on top of open source here, then I would be going through it right now. But also, you would be here a lot longer in very uncomfortable chairs listening to the same person. That's why they only gave me a half an hour. Please forgive me. We need a lot better education on what those paths were, and that's something we're trying to start tackling. The way that we did this is we took 20 open source projects. We gave them $20,000 funding for projects. We did a 10-week program to say, let's bring in experts who have found ways to be sustainable with their open source projects to talk about how they did it and give that perspective and mentorship. This is the previous year we did a FOSS fund. We gave $500,000 to open source, 2022, 2023. We said, how do we take the same amount of money, but give more material support on top of it? We found a lot of the same barriers that people find. None of this is a surprise, but I think some of the key stuff is understanding on pros and cons of business models, the strains on solo-maintained projects. There's difficult community perception. In open source, you're expected to magically do everything for free, for the good of the world, but I still eat food. People judge you when you're trying to make money so you can keep doing the thing you love. There's really complicated stuff there. Maybe there's a therapy component next year. I don't know if I should pitch that. There's a lack for folks who are outside of foundation structure of legal guidance, accounting guidance and stuff that's specific to the ecosystem. We also need more opportunities for mentorship for people who have taken specific paths. That's something that's really lacking. There is one experimental method I would like to mention since we don't have a ton of time. Simon Willison, a data set co-founder of Django, was part of the accelerator. One of the things that he's really been trying out is this experiment to say, let's have companies pay maintainers to speak to your team. We know there's limited money for FOS funds, but there's money sitting around underutilized in company training budgets and consulting budgets. We've seen those pools and be like, what do you even spend that on? Spend it on maintainers. Bring them in to talk to the team. The internal team gets to speak to the experts behind the software. You get a time box commitment for the maintainer, so they're not having a contract where they have to keep your shit running forever. They're financially supporting projects. How do we find some more creative ways to get projects funded like this? I also think this ties into some of Filippo Volsorta's experiment with retainer agreements. But again, we can talk about that later. Let's put it together. This time I will actually pause if you want to take a picture of this slide. I know I just ran past it earlier. There was a single tear coming out of so many people's eyes that they like their phone captured something else. We've talked about how money flows to open source maintainers and the many paths that it takes. We've talked about that open source is still being funded for only a small portion of its value. Most companies are not going to magically step up to fill that hole. That's not how capitalism works. While companies rely on open source, little of that funding is trickling down. Of what funding does trickle down, it's mostly not trickling down for maintainers doing maintenance work as opposed to new features. Most maintainers are not able to financially support their work unless companies choose to hire them for it. In more limited cases, if they start their own businesses, but that's still kind of the majority of maintainers are being employed by a company. Current funding structures, as we said, not focused on maintenance at all. We need better models to support this. We need new models. I personally also really, as you heard, believe government funds aimed at critical suffer maintenance or promising development, among others. Some missions for you because you're here. You get homework. We're in a university. We need more data. What's working? What's not? Can your org do a survey? Can you release trends that you're seeing? Is there anything you can do there? We need more companies publicly committing to funding. Even if we need a bigger model than that, we got to keep stuff going for now. Can you get out there and actually say, we're going to commit to funding, we're going to do a FOSS fund, here's what we're doing? Add yourself on FOSS funders. Is there something you can do there? 2023 was a reminder that corporate budgets and priorities are fickle. It was a painful, painful reminder in this. Any funding planning that we do has to take this into consideration. How do we make it harder to cut open source funding that we do achieve? Maintainers need more resources and more mentorship to decide how to sustain their projects. Do you have experience in this? Can you help mentor other open source maintainers? In the past and what has and has not, that just as importantly worked for you. I think we should do what we can to advocate for government funding, which is something that has been happening a lot this week across Europe. And how do we get this conversation out of the echo chamber and into the broader ecosystem? So a few things to consider. A special thank you to a lot of people who were very, very wonderful and helpful in this. That is what I came here to say. And miraculously, I didn't run over. So here we are. I think we have time for about one question. One question. Don't mess it up. Okay. I'm kidding. The hand at the back went up first. So I'm going to go all the way up. This is more of a question that are a comment than a question, but... If you have other questions, I put my email up there. So you can use that. I'm also pretty easy to find. I'm not good at hiding. So this is more of a comment than a question. I'm kidding. So you sort of mentioned this a little bit in your talk, but over the last 18 months, roughly half a million people have been publicly laid off in the tech sector. Yes. Along with massively budgeting in private sector technology companies. Pretty brutal. A big question, but could you speak towards the impact on open source communities that this may have had that you've been seeing from your side? And are there any insights that you can provide here? Thank you. That is a very good question. And for anyone who couldn't hear it, it was around the enormous amount of layoffs and budget cuts in the last year, like what influence have I been seeing across the ecosystem? When I started this talk, everyone I talked to, I said, what about 2023? And do you have any data on 2023? I was like begging people for data on 2023. It turns out 2024 just started. And we don't have the data yet. Which I was very sad about. But we know anecdotally a lot of open source program offices had really harsh cuts. We also know that money is bouncing back very quickly. That nature is healing, it seems. But I don't know how fast that actually starts to trickle back down. And I think that we're going to have the opportunity to look back at 2023. I think you're asking the right questions and say, what can we learn from this so that the next time this happens, we shouldn't have to be, but we're more resilient. Like, we're ready as an ecosystem to brace for fickle corporate budgets. I wish I had a better answer there. All right, thanks, Kara. Yeah. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you.
The Many Hats of a Maintainer: Organizational Design That Helps Reduce Them
Thank you so much, the organizers and everybody here today. This is such a dream. Before I get into some things though, I wanted to dedicate just like the next 30 seconds right now to my best friend that passed in August. Many of you know her, Chris Nova. She is a prolific open source engineer, alpinist, hacker and past Fosdome speaker. What you're seeing on the screen right now is a photo of hers when she summited Mount Rainier several years ago. What I just wanted to say is that I hope we can always continue her memories here. With that said, I'm Paris Pittman. I'm a recovering Kubernetes maintainer. These days I hang out with the Swift programming language community and I sit on the Swift Core team. If you hear the twang in my o's, that's my hometown of Baltimore, Maryland. I sit in Seattle these days. So hello from Pacific Coast time and my brain still. I've been focused on committee management and governance of open source projects for quite a long time and I'm so happy to be here. The word cloud on the screen is a key part of my talk and my biography. It represents examples of roles, titles, groups and project organization that make up our open source communities today. I'm not going to fib. I've had a lot of very undefined hats in my life. These are some of the defined ones. Our maintainers today, if you sat in the talk that Kara just did, our maintainers today have tons of hats that they have to wear. At the end of the day, how can we help reduce these? Organizational design could definitely play a part but of course it doesn't play into all of our woes today. I'm not going to hit on anything funding related. It's quite funny how we were paired up on the schedule today, carried out all cash. I'm doing all humans. So welcome to the human piece. So in this open source world, if you have any kind of participatory goals with a project, there's elements that you need to plan for with and around when architecting roles, groups and processes for sustainability. The secret sauce to community lies in how you interpret and implement the elements. Let's go through those. Goals you have them. Lives collaboration. Distributed decision making. Transparency. And community engagement. Pull requests accepted. But now life has presented us with new elements that we need to design for and around. We have decades of open source community stories that we can look to to formulate new elements that can help support the maintainer. I'm sure you've heard a few of these, right? First one, you know a couple of maintainers that are probably masquerading as moderators. You probably know some who are tied up with code of conduct incidents. You've heard the infamous toxic community. What about that open source project that you know that has amazing engineering and no documentation? Or what if you know the engineering project where the maintainers are really trying to do their best to be the best documentarians that they can be but they just can't? Right? Same thing goes with website and branding. At the end of the day today, in order to market a project, you need to have a website and branding. So that means you as a maintainer are also going to need to put a hat on for website dev and designer. Another famous story. You're a product of your own success. Hooray! We've heard so many of those this week. Or weekend rather. It feels like a week, right? Hooray! But boo because now the workload is so absolutely not manageable for you. And yes, the next one. The never-ending quest for contributors to help out. Or for you to turn your users into steady maintainers. And that just really falls into this bucket that I call contributor turnover. And in a white paper, speaking of contributor turnover, in a white paper that I read from Carnegie Mellon, titled Why Do People Stop Flossing? That is, Why Do People Stop Contributing to Open Source? Quote, prior work has shown that the turnover rate of a project profoundly affects its survival probability and code quality. And in the same paper, another quote, 80% of projects fail due to contributor turnover. So, community management is clearly the missing element for organizational design today. If contributor turnover is that much of a metric for not having a lot of success. You've heard it themed in the stories too. Someone needs to do this work. We can't be everything to everyone. So how can we delegate this via roles, groups, and processes? When I have these conversations a lot with maintainers, because I do, that's my job. My question to them is, do you want to do this forever? Is this what you always want to do? This being 15 plus hats, or things that you don't want to do, or even things that you don't have skills in? Hilariously enough, this screenshot from Macedon literally came to me yesterday. It's such a great summary. This individual is saying that they need social media skills in order to promote their thing. And they just don't have that. And what is that? And what is this when I say, do you really want to do this forever? That's creating, maintaining, moderating mailing lists, chat platforms, forums, recruiting and onboarding new contributors, that's their documentation, their workflows, their processes. Also keeping your current contributors and maintainers. Also GitHub administration, website creation, mentoring, holy moly, y'all. That's a lot. So, first thing, maybe you should define a role that could be successful for you, which is community manager. This does not need to be someone who does community management as a profession. I know tons of engineers who wear a community manager hat. They love it. They have the skills and they're passionate for it. Because at the end of the day, that's all you need to shift the weight around from the maintainer. And implementation in the wild. And that's what I'm talking about. Dapper, a distributed application run time has a community manager role description posted in their community repo. It starts with writing down some responsibilities and posting it somewhere to be seen. That's your mailing list, your social media accounts, your issue backlog. The role can be iterated as code. That's not something that I hear a lot from maintainers. Maintainers are really pretty fearful about this. But ironically enough, you're not fearful about putting half-baked code out there. But you're okay with just going without. And let's not go without. You can iterate on this. It doesn't have to be the end state. So we talked about that. I just let loose with this one role that you should have, right? This community manager role. What are the other roles in open source though? This is it. That's what we got. Contributor and maintainer are two of the most common forms of organizational design in open source, no matter the size of the project. Two of the most common words that imply all of the work that you do. And that distributes trust. Even if you're the smallest of projects, you should at least have these two roles clearly defined, including how to get to be a maintainer. You've probably heard this term before in the last few years, but a contributor ladder, if you will. It helps people understand what you do, why you do it, and how to get there. What about the contributor ladder though? Is it that easy of a jump to go from contributor to maintainer? It's not. That's why we need things like mentoring and other community management type of activities. We added one in the middle and made it an actual ladder. Kubernetes has this, for instance, and it's the introduction of something called a reviewer. So you've got your contributor, reviewer, and maintainer. That reviewer is giving new contributors another wrong on that ladder to help build the trust, grow skills, and have practical mentoring experience versus a lofty one. Okay. But why stop there? Why do we have to stop with just those two roles or just community manager? If a project has needs, create them. Again, it's just like your code. Create it, and if it doesn't work, sunset it. According to the same Carnegie Mellon paper that I quoted earlier, role identity. It plays a strong role in contributor turnover. How about that? So while you're creating other roles, you're also building belonging and incentive to contribute. So what are some of those other roles that you could create that would solve some of the problems that we heard from the stories earlier? A release manager. A security lead. A communications lead. A social media manager. The list goes on. And honestly, it's kind of endless. Think about the things that you need and build for it. Speaking of endless, though, if a thriving community is your goal, build an emeritus role. One of the most forgotten parts, I think, of open source organizational design. How to exit off-board and be done with it and hang up your hat. And that should be celebrated. It should be kind of like retirement from your day job where they throw you a party at the end of the day and celebrate you. Project Jupiter actually calls their folks distinguished contributors. Isn't that cute? And I think that this is something that we really should try to normalize and include this in your contributor ladders. So now you have four. Contributor, reviewer, maintainer, and emeritus. So we have roles as one approach. Like groups, they're probably my personal favorite. And that is groups of humans. Groups allow people to drive work or interests in a space. And the group can only be two people. I hear a lot of naysayers from my maintainer friends sometimes. They'll say groups are so heavy. We're not Kubernetes. But like Kubernetes had groups before it was Kubernetes. Like why do we all think Kubernetes is Kubernetes? Kubernetes is Kubernetes because of these groups that were formulated in the early days. Because at the end of the day, what we've learned from groups is that they're great at bringing people in and guiding them to the work. Because at the end of the day, one of the most common phrases that I even use a lot of the times when we're talking about contributing over the source is jump right in. Just send a PR. That's not helpful when you're trying to scale a community. It's helpful to the people that you're talking to that know what you're talking about, but that's about it. So it's all about what you want out of it. Do you want experts in areas that you aren't? Do you want to distribute the load? Do you want a way to drive the work? Applying this to community management, because now that we know community management is so hella important. Several related successful groups have shaped over the years. And they are all targeting this community management of how can we together collectively work on that burden of community management. I recently created a contributor experience group with a hard focus on mentoring and bringing in over 60 new folks to the project each cycle. Kubernetes also has one that I led that supports 80,000 contributors. You all, 80,000. And all of these groups have different levels of decision making, approvals, charters, duties. But the one commonality that they all have is to support the maintainer. Some really cool examples in the wild of this that I've seen from various different types of sizes and types of projects. Again, because I think a lot of people assume that you have to be the Kubernetes of the world in order to do these cool, fun things. But you don't. For example, open telemetry has an end user working group that helps maintainers reduce their product management hacks. Because I know a lot of you have to toggle between what's important. This group helps them toggle between what's important. And then not only that, but then tries to get them to participate, them being end users. There is also advocacy and outreach with Jenkins, Debbie has a security team. Many projects have security teams these days. And then even in the SWIFT team, we are just about to launch an ecosystems work group to help our ecosystem maintainers. So again, this is also endless and this is also about your needs. But if you're still in the audience right now and you're still kind of like, I don't know. I don't know if I can do all this. Or if you have a community manager now, but your project is scaling up so much that it's even too much for them. Or what if you have tried, but this isn't just a thing for you? Well, there is the process component here. This process is really coming from the emergence of strong CI CD systems and the overarching infrastructure as code movement. And I've seen some really creative things on how engineers are project organizing themselves through configuration files. And I mean, imagine that you're an overburdened maintainer and you're really trying hard to onboard new contributors. And that means helping them with forming their groups and helping give out permissions to things like your Twitter or your Slack or whatever, you know, or your GitHub keys. And then you're also needing to update the documents in a million places, training folks on code of conduct stuff. The cool thing is in Kubernetes, we set something up like this that takes care of all of that. And it's this infrastructure is as community as I call it practice. Could be a way forward for you. This practice covers testing contributor management like bots that welcome first time contributors, artifacts, governance policy and procedure scaling. Kubernetes no longer has a full time community manager. I bet a lot of people think that it does. And the other thing is it hasn't for a long time. And the reason why is because it's held up by this infrastructure as community, it's governance. And all of those elements that I mentioned earlier with the delegated groups that have decentralized decision making, like contributor experience, special interest group and the testing special interest group and the steering committee. So we've reviewed roles. We've reviewed some groups at a high level and also showed you some process stuff. Good project organization and governance create environments where projects thrive and the humans do too. I was talking with Zach Saylor, a distinguished contributor. Again, I love that title in the Jupiter project. And he said that it took them two years, two years to rearchitect their project organization. But he said Paris quote, this is a quote, it was so worth it. And I said, why? Why? Because roles, groups and processes at the end of the day attract and bring new people to your project. It's scalable mentoring. One-on-one doesn't scale at communities of size. It just doesn't. And this way roles bring in shadows. You can easily have shadow organizations like CIG release and Kubernetes. Amazing shadow program, y'all. Just truly top notch. You should absolutely take a look at it. And then the third in that same kind of bucket is inviting other skills in, increases your chance of A, survival, and B, having a more diverse project. So why we sit in these talks of how to get more diversity, look to your project organization. And then the last thing, the one thing that I need to take a drink for because this gets me so excited. The last thing that not a lot of people recognize here, and this also goes back to Keras talk too, a little bit with the funding. It's like, how do we get more people that have day jobs contributing more and also being supported to do so? You know how we can do that by giving them roles and titles. Companies love roles and titles and it makes sense because you know why when you go to your managers at work and you say, hey, I want to contribute to XYZ project. You say that's nice. What's cool? What's in it for us? And then you're like, yeah, that's usually how the convoy goes. It's good, you know, we should do it. But imagine if you say, I would like to work my way to be a security lead for Kubernetes to bring industry experience to my day job. That's a game changer, y'all. And that's one of the reasons why Kubernetes has grown. And that's one of the reasons why you've seen Kubernetes so just well staffed in a way. I mean, of course, yes, trust me, I think the 300 plus maintainers would probably throw tomatoes at me when I say, well staffed. So I'm like ducking right now. But it's 300 maintainers, y'all. It's a lot. So to wrap up, remember the slide. Let's rally the industry around community management as an element go forward in addition to engagement, you know, pull requests accepted. Because at the end of the day, building robust, sustainable communities is more than accepting pull requests and taking issues. These two words, y'all, they can go so far to help our maintainers. Thank you. Thank you.
Open practices for open projects
Life Stream is on. So yours everybody, welcome, Dhanu Benjamin. Hello everyone. Hello. This is my first FOSSTEM. I have wanted to come to FOSSTEM for so long and I've come from all the way on the other side of the planet. Is there anyone in the room right now from New Zealand? Because if you're from New Zealand you've come further than me and if you're not from New Zealand then I probably have travelled the furthest to be here today. I'm from Melbourne, Melbourne, Australia. My name is Donna Benjamin. I'm the product owner and maintainer of the Open Practice Library. So I'm very proud to be emblazoned and being a public billboard today. And the stickers that I've just passed your way are the very simplest and easiest way of contributing to my project is to spread the word. Wear your sticker with pride somewhere and help tell the world about the Open Practice Library. But wait, I hear you say. Why would I want to tell the world about the Open Practice Library? Well my answer is, well why wouldn't you? Well let me tell you why I think it's awesome. Way back in, well over a decade ago now, for a little while I was a born again agilist. I did a scrum master course and I thought wow this is amazing and I became like a scrum padawan learning everything about scrum and thinking it was awesome. I'd been involved in the open source community for a long time before that and I felt like there was all this like similarities between like open source and the agile stuff and I'd been to lots of open source events and I went to some agile events and whilst we were all talking about software, none of the agile people were talking about open source and none of the open source people were talking about agile. It's almost like they were on different planets and there were a few people who were familiar with both worlds but not many. And then I came across the Open Practice Library. Thank you Leslie Hawthorne who sadly isn't with us but is one of the organisers of this Dev Room who introduced me to the Open Practice Library and I went oh my goodness this is the first time I've seen open and agile together and for me that was like because it made so much sense to me. I'd always wondered why they weren't together. I still don't know why they're not together. I have theories but no hard evidence shall we say. Okay so as you can tell I'm not doing slides. If you've bothered to read the little description of this talk then you may have some idea but in practice I know most people don't bother reading the abstracts they just kind of end up in a room at a conference so that's okay no judgement right no judgement. But what I wanted to do was to introduce this idea of open practices for open projects. So hands up if you're involved in an open source project. Okay hands up if you in that involvement have come across various needs, challenges, things that need doing. I think that's everyone. Was there anyone who didn't put up their hand? Did I miss you? No good. So the thing about the open practice library is that last word library. I like to think of a library as two things. Obviously it's a collection of stuff like usually books but other stuff. But libraries are often the hub at the centre of their community and so I think that's a really nice kind of thing to think about in the community developer room at Foster, right? We're talking about a hub of a community and the people in that community are and again the clue is in the word practitioners. They practice things and those things are practices and the kinds of practices in the library are organised around the Mobius loop. The Mobius loop is an outcome based delivery framework developed by Gabrielle Benfield and it starts with a discovery. It continues with a decision making kind of chunk and then there's delivery and then you cycle back through your options and decisions perhaps back to discovery or perhaps continuing to deliver based on what you've learned. In the open practice library we took Gabby's stuff because it was Creative Commons and we added a foundation underneath it. I think of that foundation as a bit more like a seesaw because we're trying to keep in balance cultural practices and technology practices. So on the cultural side we'll have things like a social contract or a team API or a manual of me. I think someone I think lovely Scottish chap this morning earlier. Mike talked about having a manual of me or some kind of personal API. That's a practice in the open practice library as well. But those cultural or social practices are also balanced by technical practices, stuff like CI CD pipelines, canary releases, observability. They're the underpinning platforms and I say it's a seesaw because you've got to have both. You've got to have those things in balance. And that foundation underpins the kind of work that you do to discover. Like we talked yesterday we heard a bit about product management and I think a lot of the product management practices fit in that discover phase of really understanding who this thing is for, why it exists, why they need it. And the delivery side once you've built something you need to be able to measure and learn. So that gives you kind of high level idea of the practice library. What I want to do now and this is the worst room to try and do this is actually have you talk to each other for a little bit. Sorry, live stream this is not going to be much fun for you. Maybe you could jot down some thoughts on pen and paper or just have a think while we have this little section. Okay, yeah, good. Awesome. You guys rock. Okay, so you who are here, I know it's tricky. So what I want you to try and do is kind of look around you, who's sitting beside you or maybe behind you and just huddle a little bit. That question I asked you before, you're involved in an open source project, tick, and you have some stuff that needs doing or thinking about or changing whatever. And I want you to just have a chat about that stuff. What are the things you want to tackle? Do you need to imagine something new or new set of features? Do you need to tackle a thoughty community problem as Angie and I have done on the odd occasion? Whatever it is, whatever the stuff is, no judgment, could be technical, could be social, could be infrastructure, could be finance, whatever it is. I just want you to have a little huddle and talk about the stuff. And I'm going to give you three minutes if I can count based on this clock. Your time starts now. Let's count it down. So I'm going to count 13. OK. Hello. Hello. Okay, hello! Hello! You got to be... Oh! Ooh! Nice! Sorry to rudely interrupt your conversations. It was kind of fascinating watching from down here, because some of you were like, into it. So you're like... And that's fine. So I'm hoping that you stimulated a few ideas, exchanged some maybe commonalities, maybe contrasts. What I'd love to do now, and I don't know that... I'm not going to try and make you run with the mics to try and get to everybody. I'll repeat. So what I'd love to do now is hear a little bit. What happened to my bit of chalk? Some different strategies. Sorry, not strategies. Different challenges and stuff. Thank you, Mike. Can we go to the box and show Karen the table? All good. All right, so I'm going to write down here. You're not all going to be able to see it, but I can reach it. Practical, right? Practices, practical. So who would like to start the bidding with a challenge for me? Getting developers to document. Getting developers to document. We just had this fabulous talk about that as well, right? Getting developers to document. And I think one of the things I really heard Erin say is around prioritisation. How do you prioritise things? And there are, in the Open Practice Library, a couple of different practices, actually probably more than a couple, focusing on prioritisation activities that you can do. Some of the famous ones, it's called the Eisenhower Matrix, or how now, wow. So document prioritisation is a good one. For which we have practices. Awesome, thank you. Next. Overcoming single-maintainership. Sorry? Overcoming single-person maintenance. Overcoming single-person maintenance. Ooh, that's a good one. I don't know if I've got a single practice for that, but I think that's about creating a space that invites collaboration. And a lot of our practices are designed to be collaborative and have people come together. So I guess it may even be part of this starting with docs and putting the call out. And one of our other speakers this morning around communication strategies, right? How do you get people to come on board and help? So I can't answer that one. I like it. So let me put single-maintainer challenge. All right, my writing is terrible. Don't judge. Yes. Getting your technical leads to write down a roadmap. Getting your technical leads to write down a roadmap. Getting your technical leads to write down a roadmap. Cat, I'm not sure that's a practice. But, bragging. Yeah. But I think it's a good one, though. Does anyone want to shout out some strategies for that? How do you get stuff out of people's head and onto the page? Document a prioritise. Document a prioritise. Gold star. Brainstorming. Brainstorming. Start with a wrong answer. Oh, that's a brilliant one. Start with the wrong answer. And someone will correct it. Yes, indeed. Yes, indeed. Yes. I listen to them. Listen to them. What if they're not talking? Yeah, they are. They don't want to write it down. They don't want to write it down. But maybe they don't need to write it down. This is where the collaboration piece comes in, right? A lot of the practices in the library are about helping teams do their stuff. Collaborate, either how to make sure they're building the right thing and building the thing right. But not everyone is a writer. And not everyone is a coder or a community manager. So what you want is this diverse perspectives and skills and for someone to work with the tech lead to at least talk about what is in their heads. And then maybe someone else can write it down. Good one. Right. Another. Sage. Finding and encouraging and nurturing mentors. Finding, encouraging and nurturing mentors. Awesome. Do we have a practice for that? No, but I think outreach he does and Patch is welcome. Yes. How do you create a welcoming space for other types of contributions to community? Like documentation. How do you create a welcoming space for other types of contributions to come in? Like documentation, like community management, like social media. Great question. Again, I'm not sure we've got specific practice for that. From community involvement at various times, be welcoming is like the first step. Be surprised how often and how unwelcoming so many open projects have been traditionally. I think it's actually changed. I've got grey hairs now. I'm beginning to sort of see a then and now. But being welcoming and making that an explicit value is probably the first step. And having consequences for behaviour that turns people away is probably the second. Really nice, nice point. I like that one. Thank you. Yes. A moderator is usually also a good mentor. Nice. That you have to train the moderator. A moderator is also usually a good mentor, but you also have to train the moderator. But some moderators, I agree with you, but I also disagree because some moderators can turn people away by saying, no, that's not welcome here. But then again, it's always, it's never simple, right? Sometimes you need to actively turn people away to ensure you have a welcoming project. Tricky to balance. That seesaw again. Yes. How to do the right things and do things right and convince your leadership that this is the right thing? Good one. How to do the right things and do things right and, because that wasn't enough, convince your leadership that that's the way it should be. Okay. So this one, I think we've got a good answer for, and that's the Mobius Loop at the heart of the Open Practice Library, Gabby's outcome-based model, is really my, really, it really pauses to say, we need to understand the who and the why, and we need to be clear on what we're trying to achieve. So our target outcomes. Too often in software development, sort of in agencies or in companies, probably less in the open source world, we're basically given some spec and then told to just go build it. And we've had no involvement in the design, the ideation, the who, the what, the why part of it. It's just just build this thing. And you could go build this thing real fast, but if it's the wrong thing, how much effort has been wasted, right? So I think this discover phase is really, really important. Actually, I'm not allowed to call it phase, Gabby told me not. It's more like a part. So the discover part of the whole picture, understanding who, understanding why, being really clear on what you're trying to achieve, and how you're going to measure if you've achieved it, you've got to get that clear before you rush into building. So once you've got that stuff clear, you're going to have a whole bunch of ideas of how to go about it, right? So you've got to go to options phase. You've got to develop, create, ideate, get those ideas up, and then sort them. You can't do everything. You have to prioritize. And that's your options phase. You'll decide, discover, decide. Before you go on to deliver the right thing, hopefully has been, you now know you're building the right thing, and you can focus on that technical excellence, bringing in those technical practices to ensure you're building it right. One more. Yes. I love everything you just said, and I'm not going to be able to say it back for the live stream perfectly, but it was building on the welcoming thing that we mentioned a bit before. How do you create a culture of gratitude? How do you create a space where it's safe to experiment? And there was a last little bit which I didn't catch. It's about personal development. Personal development. We're working in the project. And how do you create space for personal development and growing in the project? Beautiful. Okay, so the welcoming space, I think making a culture of gratitude is really important. The fact that someone has stepped into your project and offered a contribution. And I do this with the Open Practice Library. I get first contributors and I go, thank you so much for contributing to the Open Practice Library. It is the first thing I say. Even if the contribution is not particularly right or perfect for now, the fact that they made the effort to contribute is just awesome. And I am always overwhelmed with joy. So that's one way. In a physical space, we've done things like having a kudos wall or where you get sticky notes. If you're having regular team meetings, you can start with gratitude. What are we thankful for? There's lots of ways. And I think it's a really beautiful practice. And there is something in the Open Practice Library about it, thank you all. So yeah, that's lovely. The personal development piece. I don't think I've got something explicit about that, but I kind of seem to think that almost every step you take together as a community is an opportunity for personal growth. But it's not always sunshine and roses. Sometimes the personal growth comes out of quite hard work, emotional work and disappointments. And I think that's one of the biggest things that we can learn as a community is to not shy away from the stinky, hard messiness, because that's really human. We talk a lot about code and technical excellence, but really we're human beings, and it's not always right, it's not always perfect, and it's often very much the opposite. So I think they're real opportunities for personal growth as well. And then I've forgotten the middle bit again. Too many things for me to think about at once. Will that do? Thank you. Awesome. All right, I think we're just about out of time. One more? Two minutes. Two minutes, because we're doing the questions as well. Oh, then I can have more. Ha ha ha! In that case, question? Wait, you've had a lot of go's. Wait, wait, wait, wait. Anyone else? Yes? A team that feels overwhelmed by too many things to do. A team that feels overwhelmed by too many things to do? Our prioritization friend again. But one of the things that I like to do, and I think we've played with this, is just start with a list. Actually, just get it all out of people's heads and fears and worries. And get it down. And then you can say, okay, let's sort this. Let's come up with some criteria for how we want to sort it. Are we optimizing for impact, as some people say? Are we looking for low-hanging fruits so we can get some quick wins? Are we looking to tackle something really, really challenging, because we want to strive together? So I think getting it out of people's heads, because sometimes I think the overwhelm is just, it hasn't been quantified. But once you get it down into a list, you can say, hey, this stuff matters. And actually, if we never get round to this stuff down here, does it matter? Maybe we can lighten our load and just say, no, we're not doing those things. Anything else? Any questions, folks? So one thing I want to kind of add into this is, I was cheeky at the start and I passed around my Open Practice Library stickers. Did you all get one? So that's one way of contributing, of sharing the fact that the Open Practice Library exists. Another way of contributing, and we've heard it today, Open Collective, so Funding Open Practice Library. I've put us on opencollective.com. slash openpractice.library. Buy us a coffee a month. That's another way of contributing. But also some of the ideas that you've had, or perhaps practices that you're using, scan the Open Practice Library, and maybe your favorite practice isn't there. Maybe we could, you could add it. And we've got a really low barrier to entry. We'll accept most things before they're perfect, because hey, it can always be improved. It can always be iterated upon. So I very much want to invite every single one of you to think of yourselves as contributors to the Open Practice Library as being welcome at our community hub. Feel free to use the practices. There's absolutely no nothing stopping you. Feel free to raise issues if something's a bit clunky. And hey, feel free to make a pull request. Add a new practice. Help us fix our website, or all of the above. Thank you so much. Thank you. Just check any of the questions. Yeah, and now any standard questions? You don't have to follow my script. No? Hold on, hold on, hold on. You should know this by now. Thank you. It's more about what you find as a practice and what is not a practice, so that we know what to put there and what not to put there. Thank you. Excellent question. What's a practice and what's not a practice, and how do you know the difference? On the website, there's a menu, and there's a contributors guide. We've got editorial guidelines there, and a little bit of an outline of what we think a practice is. But that said, if it's not a practice, we can talk about it in the pull request and go, this isn't quite right. So don't let that stop you. Great question. Anyone else? All the way over there, Laura, run. You're off the pizza now. One moment, please. Here we go. No, I just want to express my gratitude and acknowledge how brave you are to come to this conference with a full room and to not use a slideshow and to really tap deep into the collective intelligence. It's really, it's very much appreciated. Thank you. Thank you. That is very kind and also very validating because I was like, I'm not going to do slides. I'm just not going to do slides. So thank you very much. Thank you. Well, unless there's anything else, I am going to express my gratitude again to all of you for coming and sitting through this, but also for having that conversation in the middle there and sharing with each other. I wish we had a bit more of that here, to be honest. But hey, I'm new, so forgive me. Thank you. Thank you very much.
Kickstarting an Open Source Culture: A Guide for Mentors
Welcome folks, we're on the final leg of Faust tomorrow, so hopefully we keep you awake. So I'm delighted here to be chatting today with a good friend of mine, Phyllisys, around something that we're both quite passionate about. I suppose from our experiences out in the open source, you know, when we first got involved and we got through it as we go along, and then, you know, just working with the community, collaborating with folks, and then seeing, realizing how we can bring that into our companies and bring that culture to help, you know, things get done better for one to a better world. Okay, so, so I'm Arton and I work over, I'm a developer over at IBM, and for the last about eight to ten years I've been in the cloud native space, and I've later started getting involved in AI as well. Yeah, and I'm Phyll, also a long time, we're both old guys now, a long time working as software developers, but, you know, done a lot in the, again, cloud native space, that's even though Martin and I worked at IBM together, that's really where we're connected over open source. Now I'm at AWS, still focused on open source, and we thought we'd start with really just kind of how we got our start. Again, we started our careers a very long time ago, we're not involved in open source for a good long part of our careers doing software development. For me, it was around 2014 that I had joined a new group in IBM that was focused more on open source, doing some work in OpenStack and Cloud Foundry, and this new thing called Docker came out, and I was asked to go check out this new technology, see if we could get involved, I became a contributor, and in essence I got hooked. I loved open source, I'd been a long time Linux user, but this was really my first experience contributing, making pull requests, reviewing code, helping others in the community, and that's led to the last 10 years of working in the OCI and CNCF and the Container D project where I'm a maintainer as well. So yeah, similar to Phil, I was working on Cloud Orchestration product that built on top of OpenStack about 2013, so we were downstream, we were building on it, and then I got an opportunity to say, okay, can we extend Horizon, which was the desktop at the time for OpenStack, it probably still is, and I remember the first conference I went to was over in Atlanta, I think someone fell off the bus, to be honest, because my manager came to me on a Friday evening of a long weekend to say, do you want to go to Atlanta to a conference the following weekend, do you know what to head over? So I went, and I was just blown away, and I think it was the whole collaboration of folks and all that, and then as I went into the community and I started then contributing into the Neutron, which was networking, and if you've ever worked with networking folks, they really are into the black arts, and they really take networking seriously, and I always felt like I'm going to get funged out here, because I don't care about IP4 or IP6, I just record, but it was a great experience, and they really made me feel welcome, I remember we had a meet-up, there was meet-ups at the time over in Rochester, Minnesota, and the work I worked on, they came up to me a lot and they maintained the same, we really liked what you've done, we really liked the fact that you took it on the chain when responses came back to you, you didn't get upset, you just moved on, you made the changes and go again, and then going to Hallways forward a few years then on to getting involved in Kubernetes and in the Helm community, and being really welcomed in there, being part of the Helm Tree release going out, and then becoming a maintainer in the community, and getting to actually talk at San Diego, which was a fabulous experience, so yeah, it's been great, you know. This is where his fancy clicker does the work. Yeah, a clicker's being unresponsive, sorry. Use the buttons. So why do companies need to cultivate a culture of open source? And I suppose the key one here is, and you get lost a lot of the time, I know we bang on about it in the community the whole lot is that, you know, nearly every company today producing software is, or are probably built on top of an open source stack, so if you're consuming it, you know, you're really involved in communities, sorry, you're involved in using communities, but you need to look at, you know, how do you feed that back into the community as well, because if you're using the stacks, if something goes wrong in the stack and you haven't been helping in those communities, then you don't really have a leg to fall back on. The other part of that is the amount of, when you're building on these stacks, you're building on the shoulders of people who put in hundreds and hundreds of hours. So you're getting a real lot of value here to build your product on top of it, where you can concentrate on your product, where you can drive it forward and that you may not have all the people in the community to do the good work for you like you're getting from the community. As you can see up here, there is so much open source out there, and that's coming from Linux, over the last 100 years, definitely from the Linux community, because maybe prior to that in the 90s, there's a bit more niche, the amount of people that were involved in open source, but definitely the Linux foundation community, up along through open stack, up along, it has really opened the door for people to contribute into communities, and it's created a real momentum and shift that, you know, if anything we, I suppose, that came out of log4j was that we realized that it's open source software is in every product around there, and we need to be aware of that. The final one then is, and this is very important for your customers, is most customers don't want vendor locking anymore, and open telemetry is a great example of that. It took a long time to come up with, there's been multiple standards in the telemetry space, but open telemetry has been probably the fourth standard where the different vendors have bought in and decided to work together, and a lot of clients know they want to be able to write their telemetry metric generations, their telemetry generations, or sorry, their generation of data once and use whatever back in they want. They don't want to be coming back again, having to change code and so forth to look to do observability and maintainability. They want to be able to have that, path finger Newton, and then use the particular back in they want after that. Yeah, so it's coming into the community dev room, it feels maybe like we're preaching to the choir, many of you here, you know, fully agree with the why, you know, why do we do open source, why do companies need to do open source, but I thought one extra data point on top of what Martin was just talking about is a report that just came out a year and a half ago that had this amazing stat that 82% of companies are looking for a vendor who's not just like Martin said, and like we all know, everyone's consuming open source, but 82% said they'd like to select a vendor who's actually participating upstream in an open source community. And then there were a bunch of responses about why, you know, oh, because, you know, they're familiar with open source processes or they're helping sustain a community of something that I'm depending on. And we definitely have experienced that, you know, working on container D for myself that was used in IBM's Kubernetes engine. It's used in several AWS container compute offerings. And AWS and IBM wants people who are active in that community so that we can fix problems so that, you know, like this last response from vendors, 46% said, I'm choosing a vendor that contributes because I know when I hit a problem, I can depend on that vendor because they have people in the open source community. And I think Martin, you had an example of that. Yeah, I have a little example, I can touch them because I didn't want to go near the stats that I feel was thrown out there because was it 46% of people didn't want it or they did want it. Sorry, I was a bit confused. No, on a serious note, the final point there is very telling about a year and a half ago, I was working with a partner and they were getting involved with us at the time. And they were really, really technical. They knew their stuff and they were a dream to work with being a technical person where, you know, they told us they told me what they were looking for. I had them alone. But one evening anyway, they were using the operator SDK from the operator framework, which is in the CNCF. And they found the bulk. And before the engineer and that engineer was on North American time, so he'd gone home to bed. He raised the issue and then he came up along. So I came in the next morning and was one of those lovely mornings, I didn't even have the coffee and he was like, oh my God, I'm doing this, you know, but I thought he was brilliant. They put it out there to say, that's great. All I went, I worked away and he took me on maybe two days to narrow down the bog, get the fix in. But for that partner that I was able to jump out there, make a fix. It wasn't a big issue. The big issue was just finding where the thing was as always. Once you find this, usually the solution isn't so bad. And then just working with the community getting in. And I think they really appreciated that fact. And you know, most of our customers out there and clients and so forth, they're very, very technical. They know their stuff. So they're not going to be hoodwinked. Yep. So yeah, Martin, you're going to take, you know, we've talked a little about companies, but why do employees care about involvement in open source? And this is, this is a lovely thing. And it's from my experiences. And when I talk in a while, we've a jumpstart program to help people get involved in open source is it's an amazing kind of, you know, I hate to use the word organic and just throw it around there, but it's a great way for somebody to get opportunities that they may not get within their own company. Because sometimes, you know, in companies or in teams and stuff, you know, things are rigid or certain ways or maybe, you know, sometimes it's like a public service for one do a better work where they know I've been there 10 years, so I'm entitled to do this or whatever. But for me, I just the ability to get opportunities either to speak at conferences, to meet people on different topics, to suddenly be involved in conversations that, you know, you taught were for, you know, somebody who is way more experiencing you is just amazing. It also gives people that ability to work on, you know, you may work in certain technology in your company, but then all of a sudden you're exposed to these technologies that are out here. And Phil said it there, like, when we first go out in the communities, you know, everyone's, get up, there's no problem to that. When you first go to get up, you're playing around with it, or you go out into USB IRC, but you're open to Slack now, whatever, like, it's a bit, you know, it's a big challenge when you first start out there and you're trying to engage and so forth. But it really gives you an opportunity to know how to collaborate with people and work with people, because it's not always about the technology. It's not always about contributions. It's a collaboration as well. Because at the end of the day, in your own company, you know, Bob or Mary beside you, they're paid to, you know, to work with you. When you're on the communities, you know, if, you know, people will only work with you if you're a decent person to work with. So you get that, you get those opportunities. And the funny thing I'd say is, just the friendship you made, as Phil said there, we worked in the same company, we met each other at Open Slack. And over the years, we never worked together internally, but we'd meet each other at, you know, KubeCon somewhere or some other conference or like Fosdom here again, and we get a chance to talk together. So I think that's lovely. Yep. And we usually take an old man selfie together. You make me do that. Yeah, he just wants it because he's got hair still in, I don't. All right, so we've talked a bit about the why. Just a few points. You know, what does it mean? What does it mean that a company has an open source culture, some kind of way that they're they're doing things to encourage open source involvement? One is just the simple fact that you're contributing back in some way. You know, you have employees who have a pathway. And I know there's probably a bunch of amazing Ospo leaders here or have been active in this room. You're making, you know, policies and capabilities, making it possible for people to do that in a sort of a clear way. You may create open source projects. I've had the pleasure AWS to be involved in creating two new open source projects that we've shared. We've gotten other people to collaborate. We're continuing to build those. And then, you know, there's the whole aspect of not just that you're allowing it, but there's some kind of encouragement. There's some way that employees who do open source don't feel like they're they're sort of stuck on a different track than everyone else. Like a promotion is harder because I'm mostly doing open source, and I'm not, you know, providing for the bottom line. And really that connects that there's some value, some, some incentive that employees think, you know, choosing to work on open source is just as valuable as, as, you know, being on a product team or working on a service. And then, you know, I think one of the cool things I've seen both at IBM, you mentioned the partner story, we have a group of AWS focused around actually collaborating with other vendors and customers and partners trying to not just do things between ourselves, but say, hey, join us in this community, and let's work on this together. And so, you know, these are, again, there's probably a lot more. But really, these are some of the keys that you would look for is, you know, what does it mean to even have an open source culture? Just going back in that last night to finish there is, you know, generally, partners are really, really on the ball technically, and they're really got the air to ground with their customers. They want to give the customers exactly what they want. And for them, open source is always that easier path to do and stuff. And it's the way they want to do it. So, you know, it is in your benefit to be able to engage like that. So how can you do this? So I've been very lucky in that a number of years ago, two great colleagues, Matt Rizowski and Anne Graham, came up with the idea for a junk star program. And the idea was that, you know, for early professionals, we'll get a chance to do a course for about nine weeks where there'd be an intro for the first two weeks, for so I'll tell them we're open source and how to contribute to open source and how to use open source and then particular projects and to pick one. And the goal being to get to push a PR out there. Now, you're probably, you know, if you've been open source, well, you go PR, but you forgot about the very first time you try to get that PR rate, especially if you took a while, you're probably looking at your GitHub, you were probably on your phone going, come on, review it, get it in there, you know. And it is still like we've all had that experience, I wish that we'd get in there and then you get to a stage and you're like, sure, if they leave it in, they leave it in, if they don't, I'm okay. But you know, and it's just giving people the confidence and as I say, we started with early professionals and now we've gone to experience folks because we realize they want the chance, especially, I don't know, maybe if you're, you know, as old as I feel there and myself, you know what I mean? You may have got caught in the rutter work or you might have got opportunities and I've seen people that come and see this and go, I wish I'd seen this years ago. You know what I mean? You know, they see the potential, they see the opportunities, they see where, you know what, I can take off here, you know, and it gives people that go. So the biggest thing is informing your company, tell them about open source and the benefits of it. The next part then is introducing that into your company, the tools and practices because things don't work in open source if the practices and the way people work and the tools are using are chunky or awkward because you have to remember here, it's people all over the world and all different companies, all right? You know, you have to find a common ground and a common way of working and a lot of companies know you can hear inner sources coming out all over the piss and you know, sometimes you hear, you'd swear inner source was something that's started, you know, that just grew from the sky, whereas to be honest, all you're doing is taking the first word of open source and changing it. You know what I mean? So the value has been seen here by companies that it's the collaboration, I think, more than anything. And you know, if your teams have been struggling or they've been finding a hard to get stuff out the door, that when they really start buying into this, they realize here, look, you know, one boat lives on. It's not about the individual, it's about the greater good. So I think that's important. The last two here, educating folks, okay? And like I said about the jumpstart, we had, look, some people, and I always talk when we do the jumpstart, you know, we have, we have weekly stand-ups for a half an hour or an hour and I said to folks, look, this is not like school. If you don't have the stuff done or you haven't made progress, please attend anyway, and we'll help you on block. But I always make that story. So when they come in and they'll have a PR push literally in the first week, and someone else is struggling because, you know, their kids are sick or they're gone on holidays or work has been really busy, but giving them the opportunity to be able to say right. And I always use the story of the hair and the tortoise, okay? You know, everyone gets there on their own time. And the last bit then is around and Phil touch on it. You really need to have a pat in your company that when people contribute to open source, because they're doing serious work out there. It's not someone out there having parties, even though people do go to parties, you know, I see John Willicky up there at the USSF party last night. So but on a serious note, you need to be able to give a sanctify and say to people you're doing really good work here. Well done. Yeah, and I just one thing to add to that. You know, Martin talked about the jumpstart program at IBM. We have an open source talent task force at AWS that just kicked off in the last year. We have an amazing Ospo and Nithya Ruff, many of you know her. And just trying to think about how to actually include HR in these discussions about, you know, what does it mean to have open source maintainers on your staff? How do you treat them differently than, you know, other parts of the company? How do you incentivize them in the same way that maybe other employees are incentivized? And then, yeah, just a lot of the practical education parts. Is there a way for open source, you know, newbies, so to speak, to get mentored to get, you know, and I do a lot of mentoring, we've built a small container runtime team at AWS, where I mentor some of those younger engineers. And with them, we've created like an open source hour, actually, I think it's two or three hours now, where the there's sort of an open, you know, video call, and, you know, the guy that's three weeks into the, into the job, you know, he's just created his GitHub ID. And he's like, I don't even know what to do. But he can join this call. And there's others on the team, who are like, here's an issue, you know, go read the issue. Let's help you figure out, you know, how to get your get set up and clone the repository. And, and so, you know, these are the practical sort of nuts and bolts of like how to get people involved, how to get them educated, how to get them incentivized. And again, I'm sure there's there's ways, you know, your companies are doing that. And, you know, I think this is an area where I think it'd be awesome to see more sharing of practices. You know, what are you doing in your company to incentivize and educate for open source? And just one little thing on that is one size doesn't fit all. So, you know, I'm way a laugh at that is I saw I know if it wants to fit all or one size fit most, I think is what it said, but no, I was mentoring a person work a couple of years ago, just one to one. And, you know, he noticed I was in helm and he said, all right, I want to get involved in him. So the very first meeting we had, he said, right, I want to get in the helm or whatever. And then I said to him, I said, do you know what you do? Pick five projects in order of your preference and come back to me. And I'd say he was a bit stunned. He told me afterwards, he said that was the worst he taught at the time, the worst mentor and he ever got was he comes in, and I tell him come up with five things goodbye. I'll meet you next week. So off he went. And he came back with the five things. And lo and behold, helm was not in the list of five things. He had interest in other stuff. All right. But I kind of knew that I wanted that person to know what they wanted to get involved in. So they went away and got involved in tech on. And they made a couple of contributions and they were doing a bit of work. But as the months went on, I didn't notice him jumping to the committer, you know, getting more a serial contributor, reviewing more and more. And I eventually said to him about six, seven months in, I said, look, what's the story? And he said to me, said I was afraid to tell you, but I don't like tech time. Now that's nothing against tech time. And if you're in tech time, do not attack me on the way home this evening. All right. But he was honest. And we found out afterwards he was more interested in K native. And once he got into K native, because he wanted to do it, he flourished. And he's doing unbelievable well after all. And you know what I mean? Every known again, he meets me says thanks for helping out. And you know what? That's relatable. Give him someone to help and listen to him, not telling him what to do. Yeah, Phil, you do it. Sorry. No problem. Yeah, we got a couple minutes left. We thought we'd connect. You know, we talked about how we got involved in open source initially. You know, kind of the where are we now? For me, you know, I've been now 10 years in, you know, almost spending the bulk of my time focused on open source as a project maintainer as a technical oversight board member in the OCI, a CNCF ambassador, and then focusing, you know, all the things I've learned trying to help others at AWS similar to what I did when I was at IBM, being a subject matter expert, helping other teams were figure out, hey, we have an open source project. We like to launch. Can you help us think through what that looks like? So it's kind of an exciting point for me to feel like I'm almost more focused on helping others now than so much, you know, trying to get involved in open source myself. Yeah, a bit like Phil, I didn't put the specifics in our general digital shows, you know, left hand, right hand doing different things as we fill it in. But, you know, for me, I think it's been, you know, people believe in me, helping me in the communities and know a chance to help other folks do it. That gives me great joy when I when we do the jumpstart and I come in on a Monday and I'm pissed off for whatever reason, because it's a Monday, maybe, you know, it's just to get the joy of helping folks and then also being able to help teams internally if they need a hand with open source. So to just finish out, okay, there are no free dinners in life, my dad, you saw with saying. And he's right. If you're going to consume something, give back, because it's the best way of driving things forward and knowing what's going on. What we've learned from working in open sources, and for me, definitely, is collaboration, the ability to work together, no matter where we're from, who we are, it doesn't matter. As long as you're a decent person, you're willing to work away, you know, you will get things done. And that's what team works about. All the best teams, especially sport teams, I'm going to land at Lorna, don't worry, especially sport teams, all right, they work the best when everybody is willing to do the job that they need to do, not they don't have to be the heroes. And finally, it's a great place for people to grow in their careers and their life. And if you're a senior leader or someone who's in the community that's done really, really great stuff, please help other people because that's what life is about. Great, Margot. Awesome, thank you. Q&A, I will run this back and forth. Any other questions? There'll be a few jelly babies in for you. Yes. What is the biggest community lesson you learned from OpenSack and how have you seen that applied in open source projects that have gotten large since, like for example, Kubernetes? Well, you're handing that to me. Well, you spent more time in OpenSack than I did. I feel like I didn't. I actually can't answer that question. No, no, no, no. I suppose, like from my experience, OpenSack, I had really great experiences from it. I thought the collaboration was really, really good there. And I think that was brought forward into the cloud native communities afterwards like Kubernetes, etc. So I think a lot of folks went and worked in in the Kubernetes communities with new people that came in. But I think the key at all times was that people understand that collaboration and being decent to each other and that you're trying to work towards the bigger thing. We don't need heroes in other words. Thank you. Anyone else? Can't be good. Either we did really well or people are bored out of their lives. Yeah, very possible. Okay. Thanks very much. You need folks and thank you.
Building an Open Source Community One Friend at a Time
Okay, great. Thanks for coming to my talk. It's going to be building an open source community one friend at a time. So we're in the community dev room here, but I've been in a couple of talks. No one has actually really defined what exactly community is. So maybe I'll start there. Here's a photos of a couple different communities that I'm a part of. And as you can see from this, there are many different types of communities. There's open source communities. There's my biking community. There's my friends. There's my running club. And there's also my party friends. And these communities all have different strengths. And one way to look at that would be how often do I see these people? For instance, I see my bike friends usually on vacation. Open source people is kind of ad hoc. My friends is daily. My running club is weekly. You know, all these kind of have their own rhythm and tempo to them. And the other thing about these communities is they're not kind of like independent, separate groups. A lot of these are overlapping, right? So someone that I bike with is someone I party with. Someone that I see every day is also someone that goes to my running club. Right? Community isn't this kind of one thing. It's this amorphous group of people. And really what I think community is, is it's built on these one-to-one connections between peoples. So this is kind of a microservices diagram. You can think about all these different microservices talking to each other. And these connections are kind of what holds the whole network together. And I would say it's the same thing about community. You have all these different nodes all overlapping, different strengths communicating to each other in different ways. And I would really say our community is built on these one-to-one friendships. So the question I'm trying to answer today is how can we build and strengthen these friendships to create better communities? And at this point you might be asking why is this guy standing in front of the room? So a little bit about me. From a large Irish Catholic family, all of our family gatherings were a lot of people around all the time. And I think that's really where I started to get some of my community from. And in college, the first thing that I did was join the student union. And one of the things that they had was kind of a saying. And it's in Latin, obviously, because it's an old-fashioned society. But what it is, roughly translated, has really stuck with me. Light increases through human interaction. And I think that drives a lot of what I do today. So with that, let's actually step out of the software world for a second. I know we're here at FOSSTEM. But if you take nothing else away from this talk, I want you to understand that there's a lot of communities in this world and we can learn a lot from the ones outside of open source. So what other community am I a part of? I'm a part of Midnight Runners. That's why I'm wearing this t-shirt today. And what exactly is Midnight Runners? Well, we run together. We work out together. And we socialize together. And really, this is the ethos that we have together of fitness plus social. That's what Midnight Runners is. And what we're trying to do is dismantle the idea that fitness needs to be hard and unenjoyable by combining this social aspect into it. And this idea has really resonated with a lot of people. Midnight Runners is now a global community with 17 cities around the world. And not only like, you know, kind of big scale, but bring this all the way back down to me. Midnight Runners is a community has given me a lot too. It's given me fitness. It's given me a place I call home. It's given me my relationship. And it's also given me friends in Berlin and all around the world. So how do we go about building community at Midnight Runners? I think the first thing that a lot of communities need to think about is what is your wow factor? For Midnight Runners, it's this fitness plus social. I remember the first time I went to a Midnight Runners run. I was a little bit lost in Berlin as a lot of people in Berlin are. I was struggling about what to do next. I was thinking about moving back to Berlin, sorry, moving back to the U.S. And I went to this Wednesday run and I was like, this is it. This is so amazing that the next week I changed my flight so I could come back early just to go again. So when somebody comes into your community, what is that wow factor that they're going to say, this is so amazing, I want to come back? And as we're saying, community is all about the people. And so in Midnight Runners, we have the captains, which we think of as the pillars of the community. This wow factor starts with the captains because that's the initial reason that people are going to come back to Midnight Runners, that first experience, that first interaction. Because at the end of the day, it's about the people, not the workout. And in any community, it's always about the X. It's about the people, not whatever else you're doing. And so how do we, its captains, kind of create this experience? We welcome people, we meet new members, we introduce people to each other, making sure that we join those different nodes in the network together. During the run, we tell people how it works, they understand what to do. We engage in the experience that we're part of the participants too. And we really do create that MR wow so that people are like, this is something new, this is something different, this is something I need in my life. And then afterwards, we encourage people to join us socially afterwards and try to remember people every single week. And there's lots of different types of captains and Midnight Runners. There's the social butterfly who likes to talk to everyone. There's the caribbean who's trying to be representative and taking care of our community. There's the party animal who brings that energy, excitement, that engagement to the community. There's the creative magician who's sparking joy and making magical moments that people want to come back for. And after the people, how else do we do it? Well, we do these things called party runs. And these are really creating memorable moments for people in our community. And that's our boot camp run that ends up with a party afterwards. And it's themed to increase the sense of community, people like, I'm part of something. And kind of taking all together what Midnight Runners does is it gives people a purpose, a fitness plus friends. And that's why people come back every single week to 17 cities around the globe. So that's kind of the step out of open source. Let's step back in. How does all of what I've said so far translate into the open source world? Well, the other thing you should know about me is I work for iSovalent. They're the creators of Cillium and EBPF. And I'm going to be talking about my experience in both of those communities, building and growing each of them. And if you're not familiar with us, our kind of ethos is that EBPF is the future of infrastructure. And it's how things are going to be built for the cloud native world and beyond. And if you're not familiar with the Cillium project, it's actually one of the fastest moving projects in the Linux Foundation ecosystem. You can see it on the graph here behind Linux and Kubernetes. And if you also aren't familiar with Cillium, it's really become the cloud native networking standard. There's now certifications around it. Cillium is a CNCF graduated project. People are hiring others with Cillium experience. And my company actually just got recently acquired by Cisco. And so that's where Cillium comes from. And what we've created. But how do we actually get there? And how do we create the open source community around the project? The first thing I'm talking about is the wow effect. What is the wow effect for us? And for us, that's EBPF. Now, I think I don't believe in Tonex software engineers. And maybe you don't either. But EBPF really is a Tonex software. It can help you do things like networking, observability, security, tracing, Tonex better than previous generations of technologies. These examples, graphs are from different blog posts that people have created in the community. And when people come in, they say, wow, EBPF really is a different way of building, creating, running, maintaining software. And that's the wow effect people come for. And now somebody's interested. They're like, I want to know more about EBPF. I want to know more about Cillium. So they're coming into your community. What do you do? The first thing you do is say hello. A simple way to do this is if you have a Slack channel, you know, set up a greet bot. You can say, hey, welcome, X. We're happy to have you in our community here. This is quite simple to set up and is an automated way to welcome people into your community. Next thing is document resources to get people started. You say, these are the jumping off points. Somebody wants to have a path, a couple stepping stones. They know what the next steps. For midnight runners, I'm going to come back next Wednesday and do this again. If you're starting in the Cillium ecosystem, it's check out the website, check out the docs. These are good places to learn about the community and get started. Then I think this is also nice, give no one point of contact. I actually put my contact info and I say, hey, if you have any questions, message me on Slack. People feel like there's someone that they can talk to and they're not alone in this community. Bringing them into the network. The next thing, if you saw Paris's talk, is give them someone to look up to. We say there's contributors and there's committers and tainters of the project. Let them know that there's a ladder. There's a way for them to get involved. It's great for them to know that. Also give them somewhere to look up to. Give them different roles that they can grow into and grow towards. I think there's been a lot of talk in this room so far today about giving people job titles, giving people something they can bring back to their work, something they can bring back to themselves, being say, hey, I'm a part of this. This is what I'm doing for this community. In Cillium, we have a couple different roles laid out in our governance stocks. When somebody's into your community and wants to become a part of it, make sure you engage with them. People want to have that one-to-one connection. This friendship, if you want to do marketing, that's great. It's one-to-many. If you want to do community, that's great. It's one-to-one. So engage with people one-on-one. Say, hey, thank you. I really appreciate the work you're doing and that you're part of this with us. Beyond that one-to-one interaction, you should also recognize the work that people are doing. What I do every week is I go around the community and see what people have done and do in newsletter. Being like, hey, look at the great content that people have been creating and call them out by name. Once you kind of do that, you start to realize that we really have not just our own community. It's what I was saying before. There's all of these overlapping communities. If you're in the CNC, if you go system, you'll recognize the landscape. Some people would say overcrowded or a hellscape. But I think what is really great about this is that it shows that there's so many overlapping communities that we can take advantage of, partner with, and learn and grow from each other. People aren't a part of a single community and you can bring people into yours. Another way that we do that is we're having a still-in-meaning EBPF day at KubeCon. Going off back in the strength of the Kubernetes ecosystem to help grow our own community. In fact, if you think about it, that's exactly what we're doing here at FOSSTEM. All the different dev rooms, right? Those are all overlapping communities. And probably each person here goes into a different set of them and goes to a different set of talks. And I think that's great because you're creating the experience exactly that you want. And we're all building our communities together. The next thing is to spark joy in the community. Do something that's unique, that's something memorable. So one of our community members created a Cilium cake. Swag is another common way in open source to do that. We also created trading cards about two of our maintainers as like a fun thing to do. And beyond these little things, also create really memorable moments for your community. So for example, we created an EBPF documentary. We created the illustrated children's guide to EBPF. And I have some copies here if you want to see what that looks like afterwards. And we also created a Lego set for Cilium. So people would be like, hey, remember when we had the documentary, we had the children's book, here's my Lego. That's what people are going to look back at, these memorable moments, and you want to give that to them. Because what all of this does is it gives people an identity that allows them to say, hey, I'm a part of this with these other people. Because at the end of the day, what people really want to have is something that gives them connections, something that gives them a purpose. And these friendships are really the building blocks that keep our communities running. Thank you for coming to my talk. Thank you. Hey, thank you for your talk. I really liked the tips for getting the people engaged in community. I have one question regarding welcoming newcomers, because you showed us your grid bot in your Slack, which was kind of like a long block of text links. I'd like to know if you measure some kind of click through rate into that. And also then you followed with one to one interactions. But all of those on your slides were connected to some contributions. So I would like to know if there is something in between, you know, between the grid bot, and then the one to one interactions. If we have some, you know, maybe Buddy East or something like that. Thank you. Okay, yeah. So do we have something between the grid bot and getting people further? We don't currently. I think that's something I'd like to work on. You know, kind of like step one and then how do we get to step two. And I think people kind of have to find their own way in the community right now. And then once they get to step three, they wrote a blog post, you know, then I start to get to know them more. I think this is also like a really difficult thing in open source too, because it's also a lot of people, you know, do like a drive by contribution. And how do you know, and you never know which ones are going to stay around and which ones aren't. And like, unfortunately, you can't invest everything into every single contributor. And so this may be extremely controversial, but like maybe that's like a good weeding process is like if they can find themselves to step three, if you don't always give them step two, I don't know, like, it's always great to have more contributors. But yeah, so that's my controversial opinion for today. Say what you will. So I have a question. I find myself as much of an introvert person. So I don't like fitness meetings and big parties. So if you have any advice for people like that to build communities of like mental people, even open to other people. Yeah, I would say it's all the same advice. If you actually look at the structure of my deck afterwards, you say you can see the same structure in Midnight Runners. This applies to the same structure in the open source community. I was just doing the Midnight Runners example, because that's the one I'm most familiar with. And I really do think like when you're trying to create a community that wow effect is like, okay, what is this community trying to bring people around? You know, that may be EDPF, it may be running, it may be like we really like playing pool, we really like, I don't know, going to Quiz Night, right? And like give people a reason like why people want a purpose in their life and see if you can give that to them. Right, that's why the community initially forms. And you kind of create those next things to build the community around that. You need to kind of have that core purpose that people are going to come to. And I think that's the most important thing. And it can be about anything. It doesn't have to be about fitness. Hi. Hi, Biff. Sorry, didn't you put it to sound that way? I was going to ask you a funny question of why don't we go out for lunch more, but I actually do have a legitimate question for you about your talk. So one of the things that you said that Isovalent and Cilium do to sort of help create that wow factor is like, like merch because then it's a sort of physical reminder and a memory that people create. By the way, everyone, Cilium has great merch and EDPF has fantastic merch and you should absolutely find them after. So my question for you is in our trying economic times of tightening budgets on everything, how do you justify the cost internally? You get acquired? No, sorry, that's not the right answer. I think there's also like lots of ways to do like stuff that isn't like physical merch and doesn't have to be like doesn't have to cost money. Like actually, like one of my favorite things is, sorry, is that a lot of people actually like is like the EB Dex. And so we have this repo of all these different bees. You know, we wrote back stories about each of them. Like this is something that people really like and it didn't cost the cost of a GitHub repo, which is free. And I know a lot of people like to use these on slides. I think probably the favorite one is or my personal favorite one is Excel BPF here. There's a great April Fool's article that we wrote last year. I mean, this is another memorable moment, right? Like we wrote like now launching Excel BPF, right? And it didn't cost any money to write the blog post. I mean, a bit of time, but like that's not something you have to put in a budget line by line item somewhere. And so I think there's lots of ways to do it without like creating physical stuff too. We've got time for one more question. I'm doing my midnight runners. Every Wednesday at 7 30. Thank you for the talk. So you mentioned about the wow factor about creating that wow factor. But say now you already have a wow factor. How do you like market yourself that okay, we are out there. How do you make people aware of your community and they can come and join your community? Is there any like, you know, some tricks or some some things that we can do to grow the community? I would say like if you want to form a community and you've decided like here's our wow factor, here's our purpose, whatever. Know that there's, you know, a million other communities out there. And I think probably the most important thing when you're starting it is like, don't think about your.
Strategies for Building Healthy Open Source Communities
So, I'm going to talk today about strategies for building healthy open source communities. I wanted to start by just quickly thanking the Alfa Peace Loan Foundation. So, they fund the Chaos Data Science Initiative, which pays me, and also thanks to the Linux Foundation and Board Foundation, which also provides support for the projects. I have been in the technology industry for well over 20 years, working mostly on open source projects with a focus on community strategy metrics and growing your contributor base. And I can tell you that it is really, really hard to build a strong open source community for a project. Most of us struggle with finding enough humans to sustain our projects. So, let's start by talking just a little bit about the problem and why it can be so hard to achieve sustainable communities for open source projects. Like I said, the problem is hard. I like to start my community talks with a quote from an alien life form on Star Trek the next generation who described humans as ugly bags of mostly water. Now, I don't think we're ugly, so I think they missed that part. But we're super squishy, right? And not just in the physical sense. We can be unpredictable. We can be irrational, especially when we're stressed out, overworked, burnt out. And the reality is, right, we're not robots. We're not mindless automatons. We have feelings. We have bad days. We have other commitments and we have personal challenges in our lives that are often completely invisible to other contributors. And they can get in the way of our contributions to open source projects. But you can't have an open source project without having human beings to maintain it. So you need to be able to encourage people to participate in ways that are sustainable over the long term, both for your project and also for those people. And it helps to be proactive and ask people to participate in specific ways and in ways that match the work you need to do within your project. Now many projects struggle to find people who will actively participate in their projects and continue to participate over the long term. If it was easy, you'd already have all the people you need to maintain your project. We wouldn't need this dev room. And none of you would be here watching this talk. But I think a common theme throughout all of the presentations in this dev room so far really has been that we're in a situation now where there are a lot of open source projects and not enough contributors and not enough resources to maintain those projects. So maintainers are burning out and they're in desperate need of help. And sometimes it can be really difficult to get people to contribute to your project. And unfortunately there's no magic, there's no one size fits all solution. So throughout this talk I'll focus on some things you can do to increase the chances of building a community and growing contributors for your project. Now that we've talked about the problem and some of the challenges, I'll shift into talking about strategies for building healthy communities. So next I'll talk about, and after that I'll talk about us taking a strategic goals based approach to metrics. And then finally I'll talk about some metrics you can use to measure project sustainability. To grow your community along with some resources and some final thoughts at the end. So as promised let's start by talking about developing and executing on a long term strategy for building a healthy community. Including motivation, project governance, new contributor onboarding, roadmaps, contributor ladders which you might have heard before from some of the talks, and leadership. To have people's motivations for contributing to your project vary widely. Some people are contributing as a part of their job while others might contribute to gain experience or maybe learn about a particular language or particular technology. Regardless of why they showed up there are some things you can do to motivate them to stick around. Clear communication, working in the open, and reducing friction are key to helping people stick around. And I'll talk more in the upcoming slides about the importance of explicit and clearly communicated project governance along with onboarding docs and fostering a welcoming community. There are also other things you can do to motivate people to contribute. Having good first issues or help wanted labels are excellent places to start because these help those humans find something they can work on while they learn more about your project. Good first issues and help wanted labels are passive requests for help. So I also encourage people to be proactive and specific about ways that people can help. Asking someone specific to review a PR or answer a question or respond to an issue demonstrates that you recognize their unique expertise and that you want their help with it. Anything that we're wanted and appreciated makes us squishy humans feel good, right? Which can be a strong motivator to contribute to an open source project or to continue contributing over time. People can also be more motivated to contribute when all of the project work is done in the open where they can participate as equals. When some of the work is done within the walls of a company or maybe inside a close knit group of maintainers it can leave the rest of us feeling left out and demotivated. A lot of people like to hate on project governance. It's just extra paperwork, it's busy work, it's politicking, it gets in the way of doing the real work on the project. But this isn't true of good governance which is really just about setting the expectations and getting all of the various humans within your community collaborating together. Ultimately the focus of project governance is on people, the roles we play, our responsibilities, how we make decisions and what we should expect from each other as part of participating in the community. The goal should be to make the processes for participation as obvious as possible. Even for people who are brand new to the community. Having clear rules about how collaboration occurs, how decisions are made, what types of contributions are in or out of scope helps community members make contributions that are likely to be accepted and embraced by the project. This helps avoid wasting people's time with contributions that maybe just aren't aligned with the project for whatever reason. A healthy project with clear governance makes the humans happy and it sets your project up for future growth and long term success. The good news is you don't have to start from scratch. The link we have here is some good templates with some instructions that apply to most projects if you want to quickly and easily build out some basic governance for your project. It's a lot more difficult to participate in a community if you don't know anything about the role you might play, the expectations, the key players or any of the rules for participating. That explicit documented project governance gives you both new and existing contributors a clear path to guide them through your project. Spending a bit of time documenting that governance up front can save you a lot of time later with fewer questions about how things work and it gives you a document that you can point those other humans to if they have questions. When I start contributing to an open source project, I want to know how decisions are made, who makes those decisions, where the discussions about those decisions happen, which helps me understand whether those decisions are made fairly and out in the open. The bottom line is that if the processes for collaboration and decision making are not clearly documented as part of the project governance, this introduces uncertainty into the mix and uncertainty makes the humans nervous. It increases the barrier to contribution and it jeopardizes the health and viability of your project. Good documentation is how we scale the things that take up precious time for the already overworked human beings, like answering the same onboarding questions over and over and over and over. I see so many open source projects with contributing guides that don't actually provide any useful information for people who are contributing. At a minimum, a new contributor needs to understand how to spin up an environment where they can do their development, the expectations for testing, how to run tests, and any processes or other expectations that you have for pull requests and then instructions for any other requirements you might have. If this is all well documented, new contributors can get started with a minimal amount of help from the existing maintainers, which can save you a lot of time in the long run. When a project doesn't have good onboarding docs, those poor, squishy, burnt out maintainers can get frustrated by the amount of time they spend on new contributor questions, which can make it hard for contributors to feel welcome. It'll take a longer time for them to become productive. This is how the humans get discouraged and then just drift away from your project. This does not mean that you need to spend weeks and months writing the perfect onboarding documentation. At this point, anything is better than nothing. If you start with a few things that help people actually get started quickly, then new contributors can help make those onboarding documents better by adding more details and maybe some additional instructions for something that they found confusing or that they struggled with. Then after onboarding, people need to be able to find something to work on. Having public roadmaps is a great way to do your planning in the open, while helping people find something to work on that aligns with the direction of the project. If you were here yesterday for Lori Apples' talk, she talked a bit about roadmaps as well. Roadmaps provide some crucial functions within open source projects, including setting the direction of the project, prioritizing tasks, organizing the work, and attracting and retaining contributors, and also providing transparency into where the project is heading. I think a lot of people underestimate the impact that a well-defined and up-to-date roadmap can have when building community around a project. They can help guide everyone toward achieving common goals and having a shared vision about the future of a project to help contributors work on activities that are aligned with that vision. The document linked on the slide has loads of detailed information about building a roadmap for your open source project. One of the most important things to think about is how you'll maintain that roadmap over time and actually keep it up-to-date. It can help to use tools that are already part of your development or your community processes, like GitHub project boards, for example, if you use GitHub, so that people don't need to use yet another tool. If you have community or developer meetings, it can help to have someone walk through the roadmap every couple of weeks just to talk about the things that are blocked or need help. Maybe set aside some focus time once or twice a year to think about the future of the project, and then you can incorporate that back into the roadmap. Bonus points if you can find a really good project manager who can help with the process. Your project should also be designed to keep diversity, equity, and inclusion top of mind. Building a diverse community where all of these humans feel welcome and included doesn't just happen. It requires putting work and thought into it. But this time is well spent, right? Providing an environment where everyone, including people from marginalized populations, feels safe is the first step toward building a diverse community around your project. Ideally having programs that give people opportunities for shadowing, mentoring, and sponsoring new potential leaders can help you grow a diverse set of people into new leaders for your project. Paris talked a bit about this. The Kubernetes experience, sorry, the Kubernetes contributor experience special interest group is a really great place to see some examples of how to implement programs for things like shadowing and mentoring. And projects that make a concerted effort to actually bring in new people from a variety of backgrounds and have programs in place to move them into leadership positions are more likely to benefit from increased innovation and just have a healthier contributor community. And by having a diverse and welcoming community, you have the advantage of getting those humans who might not feel welcome in some other projects. Now Paris and Bill both talked about contributor ladders. Defining the roles and responsibilities for contributors, reviewers, and maintainers can really help with recruiting new humans into these roles. It can help to think about this as a ladder where contributors can climb up to become reviewers and those reviewers can become maintainers. But what's important is to document and make sure that people understand how they can climb that ladder and how they can gain more responsibilities within your project. A contributor ladder usually outlines the different contributor roles within the project and along with the responsibilities and privileges that come with them. Having a contributor ladder helps set expectations for the roles and it encourages people to think about how they might take on areas of increasing responsibility within the project. And as you get more of the humans moving into maintainer roles, you can reduce the load of the existing maintainers. And the good news is, again, there's a template that you can use to avoid building this from scratch. This one was based on Kubernetes, so it probably has more roles than you need, but you can simplify it, customize it, make it work for whatever your project needs. Paris talked a little bit about emeritus as well, so I feel like I'm just dovetailing on all the things Paris said. But humans like to think of ourselves as irreplaceable. We are not. We move on to other jobs. We burn out. We retire. And let's face it, unlike the robots, humans are mortal and we do not live forever. You should think about what you might want to do next and how you can prepare someone else to take over after you move on. I encourage projects to have an option for people to move into emeritus roles, which recognizes the hard work that they've put in into a project and gives others a point of contact. If they have any questions about what came before, while also allowing you to step away from the day-to-day responsibilities of the project. And I encourage you to think of stepping into an emeritus role as a successful way of just sort of handing off your duties to the next generation of maintainers for a project. Now, I've talked a lot about things you can improve. Metrics can help you decide where you need to improve your community and measure your progress after making improvements. But quite a few people seem to take what feels to me like a random approach to metrics by measuring the things that they see other people measuring or gathering the metrics maybe that are easiest to collect. And maybe this even provides something useful. But I encourage you to think about your goals and take maybe a less random and more strategic approach by focusing on those goals. And when I say start with the goals, I don't actually mean start with your goals for the community. I actually think you need to take a few steps back and start at the very top. What's important for your organization or what's important for your project as a whole? And this in a lot of cases has been a company in my case, but it could be an organization like a foundation. It could just be the project instead of an organization. But you should start by looking at what that organization or project hopes to accomplish and what its goals and objectives are. And then you can take this down a level and figure out what your goals are as a community. And your roadmap can be one input into this whole process. And the most important part of putting together the strategies and plans for your open source contributions is then aligning them with the overall goals of your project. If your goals for the community don't support the overall goals for the project, you aren't likely to be successful. So it's worth the time to figure out what you want to do and how it supports the rest of the project or how it supports what your organization is trying to achieve. Once you figure out what you want to do as a community and then can tie it back to the bigger organization or the project, then you can start looking at using metrics to measure your progress. So people often ask me, for example, of the projects with the best metrics. But I really just don't think that's a good approach. What you measure depends on your goals and what you're trying to achieve, which may be completely different for other projects. So I prefer to encourage people to start by defining their goals. And ultimately, you need to look at your strategies and plans and come up with criteria that will help you measure whether or not you are successful. For example, if you want to improve the performance of a particular piece of open source software, measuring commits is not going to get you that. You actually need to have success criteria and measurement based on those types of performance you're trying to improve. If you want to gain influence within an open source project, maybe you work at a company. Maybe you measure increases in contributions or the number of employees who are moving into positions of leadership. And as with any good strategies and plans, hopefully the outcome and results should be measurable so that you can tell whether your efforts are successful. And this is where your metrics come in handy. Once you decide on your success criteria, you need to make sure that you can get the data required to measure it and maybe start measuring it now, get a good baseline of data. And there are loads of tools available to measure data about open source projects. Some of the commonly used tools can be found in the Linux Foundation's Chaos project where I work. But there are also loads of other tools and lots of big projects already have dashboards using either the Chaos tools or CNCF uses DevStats. There are loads of tools available for doing this. Since this is a presentation about building community, I encourage you to focus on your goals while also thinking about your time would be best spent on community activities. I've given a lot of suggestions so far in this presentation and you should not try to do everything at once. So I recommend that you think strategically about where you should start while keeping your goals top of mind. If you know you've had people interested in contributing but they've given up when they couldn't get started, maybe you should start with onboarding docs. If you have a lot of casual contributors, maybe you focus on the contributor ladder and governance to help move some of those other humans up to take on more leadership positions. An excellent way to free up time from maintainers is by getting help with different types of contributions that take up valuable time and are actually required to make an open source project successful. Things like documentation, marketing, community management, project management, and many more roles. For projects with complex code bases especially, it can sometimes be easier to onboard people into these roles first to free up some time to onboard other contributors later. This also has the advantage of bringing people in to help with things that can have a big impact on growing your community like roadmaps, governance, and other documentation. Time is precious. So it is important to identify the problem areas within your community where you can focus on the right things while avoiding wasting time on areas that are already working well. However, metrics do need to be interpreted in light of your goals, how you operate as a community, and all of the other things happening within your project. There's no one size fits all interpretation of metrics. So in this next section, I'll use some example graphs from some of our chaos metrics and talk about what some trends might indicate and how to think about addressing potential issues. One key area to look at for your project is responsiveness. This is a backlog graph from the chaos grammar lab tool. In this project, you can see that there are times where they've got a lot of PRs in the backlog that need to be merged or closed. Now, if these PRs are coming from several regular contributors who aren't maintainers, it might be a good idea to look at how you can promote some of those humans to become reviewers or maintainers to help out with the workload. But as with any metrics, you need to interpret them in light of your project. There are other things that can cause an increase in the backlog, like everyone preparing for a big release or maybe a big conference or just vacation season that might not be resolved by moving more people into leadership. Again, these graphs come from grammar lab. Other metrics to look at responsiveness focus on the amount of time it takes for maintainers to close issues in PRs. Looking at trends for these metrics is particularly important. This example, you can see that it's taking a lot longer for maintainers to close issues or PRs. It might be a good idea to look at how you can promote some more humans to become reviewers, maintainers, help with the workload. Again, you need to interpret this in light of your project. There are other things that can cause an increase in time to close, like the project becoming more complex or becoming larger, which can just increase the time required for things like testing and other activities that would happen in the process of reviewing and closing PRs. It can also help to look at the types of contributors that you have. In this case, casual contributors are those drive-through contributors who make a small handful of contributions and then disappear possibly forever. Regular contributors are the ones who make some contributions and then they stick around and continue to make contributions over a period of time. Core contributors are usually the maintainers who are there for the long term. You can really learn a lot from this graph. If you have a very small number of casual and regular contributors, this can mean that people don't have the information needed to become productive and to contribute. In some cases, onboarding docs can help solve these issues. Another thing this graph can indicate is whether there may be some fundamental issues within the project that are driving the humans away from your project. If you see the total number of contributors declining or the number of regular contributors declining, this can indicate some deeper issues, maybe toxic community members or an unwelcoming environment, and that probably needs to be resolved before you do anything else. Or it can mean there are other issues with things like lack of responsiveness. This metric is often called the bus factor or lottery factor based on the idea that if one person disappeared after winning the lottery and that person was making all of the contributions, then the project would probably be in trouble if they left. This graph uses data from Chaos's Augur software. I recommend measuring this because there are a few things that can tell you. First of all, how big of an issue is your current contributor situation? If it's like this one, you really should focus on getting some additional contributors and maintainers. You also might find that there are people who are contributing more than you realized, which is the other reason this is a good metric. This can help you think about who you can encourage to contribute more or maybe find someone who can move up the ladder into a leadership role. So you might look at some of those people who are a little bit lower down on the graph and see if you can promote them up into being a maintainer. The catch here and with so many metrics is that we don't want to just think about the people who are making commits. This is a good start, right? It's a start, but you should also be thinking about how to move people into leadership positions to be responsible for things that might not show up in GitHub, like documentation, community management, marketing, mentorship, lots of other important roles. And metrics are not something that you look at once and never revisit. It's important to think about metrics gathering as an ongoing process of measuring, improving, and monitoring. So you think about your goals and what you want to achieve. You pick some metrics. You make improvements, and then you monitor that over time. And before I wrap up the talk, here are just a few resources that you might find useful. There's some great stuff there from the CNCF contributor strategy tag around how to use and templates. The Open Source Way guidebook is just another one of my favorite community resources. And then the chaos metrics. We also have a Slack channel. You're welcome to join us. Anyone can participate in the chaos project. Maintaining an open source project is so much work. And there are so many maintainers who are overworked, exhausted, and burning out. The best way to address this challenge is by finding more humans and growing your contributor community. But it's hard work, right? And it takes time away from the day-to-day activities now, which can be super hard to justify if you feel like you're barely keeping up as it is. In the longer term, spending at least a little time on things that can help you recruit and keep new contributors will be worth it. And as I mentioned before, you don't need to do everything at once. Think about your goals. Use your metrics to help you figure out where your time would be best spent. So this is what I'm asking you to do. If you're a contributor to an open source project, carve out maybe an hour a week to improve your onboarding docs, your contributing guide, your project governance metrics, or just spend that time helping another human learn to do something new in the community. With that, thank you. And I'll, I think we have another two minutes for questions. Yes? Thanks for the presentation. It seems that some of the ideas that you presented, the contribution layer, the layer. Sorry, can you speak up a little bit? Thanks. Thanks for the presentation. It looks like some of the ideas that you presented, like the contribution ladder, can be maybe at odds with a project that is really owned by a company or where there is a strong presence of the company. Do you believe that there is a way to resolve this? Yes, I do think that there's a way to resolve that. I do think that sometimes the governance and the contributor ladders sometimes work a little bit differently when you're talking about projects that are owned by companies. I think that the best thing that the company can do is to be honest about what roles are really open to people from the community and which ones might not be. And that might not be something that your company wants to be transparent about, but I think if you're really trying to build a community around it, I do think you have to be transparent about that. And I think that the people that will stick around in your community will at least respect that transparency, even if maybe it's not the answer that they wanted to hear. So I think there's definitely room for that, but it will look a little bit different and you will have to have that balance between the company and the community. Hey, thank you, Don. All we have time for. Thanks. Thanks. Thank you.
Building Communities with Science!
I would like to introduce our last two speakers of the day, last but not least. We've got Stephen and Mike. Check. Yes, there we go. Hi. Hi there. I'm Steve Jacobs. I am the director of the Open at RIT, though I have been doing teaching students and faculty and staff about open source since 2009 when we started making educational games for OLPC. This is my first thought stem. So thank you very much. And hi, I'm Mike Nolan. I'm the associate director of Open at RIT. I do not have a career tenure as long as Steve. But I have been doing open source for around a decade now across many different sectors and industry from the humanitarian aid sector and now working in academia. And today we want to talk to you about using science to build open source communities. So I'm sure if you've been sitting in the deaf room all day, you'll probably have noticed some themes around community building and different methods. And you'll probably see some of those themes in ours, but in particular what I'm interested in sort of conveying with this talk is how we use evidence based practices to really figure out what sort of things are necessary for building communities and various types of open work projects. So just as a quick note, you know, while Steve and I are up here, we have a long and storied history of hiring team members and students and very talented people to work on some of the things that we're going to talk about today. Two of those students are Django and Daechon who are fantastic designers. I think both of them might be looking for work, design work and open source. So if you're interested, please reach out. I'd be happy to connect to you with them. They're both amazing and talented. So before we get into the nitty gritty details, Steve, do you want to talk a little bit about who we are and what we do? Sure. So what is now open at RIT has its roots in the education program. We started in 2009 when we wanted students to make educational games for one laptop or child. That went from a seminar to a standard course to multiple courses and a co-op paid internship program that led to the only academic minor in free and open source software and free culture on the planet. We're our alums, number of folks like Justin Flory who is running the Fedora community these days, Remy Decausmaker who is the head of the first open source programs office in the federal government and center for Medicare and Medicaid. So we built things like Fedora badges. We laid the groundwork for the UNICEF venture teams, Roadmap and Milestone program for their fellowship people. So we've been around doing a lot of things. We've learned a lot from a lot of people including shout out Elizabeth Barron up there in the back who is the chaos community manager who God bless her everyone. And several years ago when universities started talking about having OSPOs within universities, we decided to spun one up with a second one in the U.S. anyway to be an open source programs office. Except we don't call ourselves that. We call ourselves an open programs office and we talk about the fact that we support all academic open work. We don't want to send the designers, the artists, the people who don't feel like they do programming or they do research or they do formal academia away. We want everybody to do open stuff and we're there to support them to do that. God bless the Alpha P Sloan Foundation as they gave us some funding because they came to us and said the stuff you used to do out externally for UNICEF, we want you to do for your own faculty members and that's what we're going to be talking about a lot today. Did I miss anything? If you want you can talk about the pillars of the services. Okay, so formal academic education for students. We run a list for faculty, students and staff to look at to learn more about open. We do policy work internally for the university. We just released a position paper on the reasons we feel that federal institutions should fund the peer review for open science. Since open science is being pushed by most of our governments to try to fix pieces of science that are broken or inaccessible and peer review is one of those. We do work and research into digital infrastructure for community building. I was one of the first Ford Foundation digital infrastructure fellows many years ago and what we looked at was community issues within Pi PI, and the first types of things that Dawn was talking about, her comments about knowing what your community is about, what you're doing, you can look up the research and the fully open qualitative research we did, which means that uncharacteristically the people we interviewed signed away their privacy rights and allowed us to use their names and their transcripts from their interviews. If you look up conceptual mismatches, you'll find access to all that data, but to go to Dawn's point, one of the things that Pi PI ran into was road blocking themselves when the culture of the program made the people who got hired to do outreach community managing onboarding and governance ended up feeling guilty because they weren't pushing code. So we do that kind of infrastructure and policy work, community building work, and we do this fellowship work which we pioneered for us anyway, working with UNICEF, extended to our own faculty with the Sloan Foundation support, and now do for anyone who would like to work with us. And so going into this fellowship work that we're talking about here, this is going to be the main subject of our services that we want to sort of be describing today because I think it's quite interesting for community building. This is the approach that we landed on. So to give a quick overview of what we mean when we say community building or fellowship, when we talk about people who were trying to provide services to, these people are maintainers of open work. I'm sure many people here might be maintainers of open software, others might be academics creating data sets, or people publishing journals and stuff like that of open work, right? And they come to us because they're really interested in either building a community around a piece of work that they built, maybe growing an existing community, making it more inclusive, or bringing in a new type of potential contributor or user, or maybe they're just super burnt out and they're like, this is an entirely unsustainable way of maintaining this piece of work and I want to figure out what I can do or what sort of things are needed to make this more feasible to maintain this work. And so, well, we have slowly built over the last, I don't know, 10 years of working with different clients is this process where we provide a team of developers, designers, and community managers to tackle this specific problem, right? Not necessarily making direct contributions to them, but to tackle this problem of figuring out these community issues, what they are, and what resources are needed to overcome that. And so, what does this team actually do? Well, we kind of act as maybe a project accelerator, which is a pretty common term, and you see startup accelerators do this where you have a directed resource specifically of figuring out what your market is and stuff and what you're going to do. We do this with projects. So, a project maintainer will come to us and we'll help them launch or convert a project being open or we'll help them grow it or maintain it so they don't have to burn out. And so, as I said earlier, we do this with all kinds of projects. We have this link all over our slides quite often, this open work definition, so we don't just work with open software maintainers, but open scientists, open data repositories, OERs, and many other things as well. And kind of like the smaller cases, but things that often happen is because we're based in a university, we've had faculty come to us for help. And I use this project stuff all the time. I have all these teaching materials. They need teaching materials. How do I contribute it back? I don't understand what to do. So, we've helped faculty do that and as a result, we've also kind of done an analysis of that project's contributor pipeline and said, you know, if you kind of smooth things out here and here, you probably get more people putting stuff in. So, it kind of works both ways. You don't have to be a maintainer. And so, before we get into like the services, I think it's good to understand the background of what existing kind of resources and knowledge bases that we're pulling upon when we think about how we're going to be building communities. And so, like a big one is like the Mozilla Open Leadership Training Series, which, you know, I can't speak for these communities, but from my perspective have really influenced programs by Code for Science and Society or the Turing Way or Open Life Sciences and many other sort of community building services are offered by other institutions. And this training series was like an amazing thing that was created probably a decade ago at this point that really set out the foundations of like how you think about contributors and, you know, Don was talking about the contributor ladder, right, which I think is kind of a good allegory to contributor personas and pathways that are talked about by Mozilla. We also pulled from concepts from like Nadia Eggball's book, working in public, thinking about taxonomizing communities and understanding, you know, different, you know, depending on the type of community may face different issues. We also use design thinking in our process of ideating and creating solutions and iteratively testing them. And then, you know, obviously, all of this came together in large part when we were working with the UNICEF Venture Fund doing sort of consulting with the various projects that they were funding to build open source communities. Some of this knowledge has been stored in UNICEF's own sort of open source inventory. It's kind of taken a different shape at this point, but I encourage you to check it out if you're interested in the approach that they're using right now. And this also back fills the undergraduate education we do. We, the degree, the minor courses are open to students all across campus, not just computing students. It's, you know, as everyone has said before, you know, contribution is docs, it's onboarding, it's graphics, it's websites, I think kids build logos for the projects that they want to contribute to as part of the course. And somewhere between roughly 40% of the class is learning how to do analysis based on the type of graphs that Don was showing, that, you know, what, you know, both for themselves, you're going to have to do contributions by the end of the semester. Does this look like something you want to try to contribute to? Are they responsive? Are they supportive? What do the flows look like, right? Does, when your projects are due by the end of the semester line up to a point when they're not really taking contributions? So is that a good thing, right? So they learn to apply all that stuff. So when we start working with a project, in particular with a first engagement with a project, I think it can be, we try to divide it into these three sections of objectives that we're trying to work with, with our client. The first is, you know, it's kind of slowly learning more and more about the project and more and more about the, you know, what's going on. And through that you can figure out like what are the actual specific things that need to be created. So, you know, first and foremost we have to understand, you know, what the project is, where its goals, like why did you create this, why is openness, you know, purportedly something that's important and useful to the project and like what do they hope to get out of that. You know, then we try to figure out who are the potential stakeholders and stuff and like how would they potentially get involved. And then like from that we begin working backwards and say, okay, well what are the actual roadblocks that people are encountering when they try to get involved. What are the things that are missing? So, you know, I think just to give a little bit of additional detail for each of these stages, I kind of want to like pose a few questions that we're oftentimes asking our clients when we're working with them. Right? Like, what are you actually trying to do? Like what's the point of this project and resource that you're creating? You know, why is open source the thing that's important to it? Is it about, you know, getting people from other areas that you traditionally don't work with? Is it, you know, about inclusivity? And then, you know, do you have a community or people currently contributing? And what ways are they contributing? Maybe you have some software contributors but you know, you don't have contributors from a different place. Or, you know, maybe you have a lot of non-technical people involved in a software project but you don't actually have a lot of technical people. You know, that's super common in things like humanitarian software, right? And then like, what sort of community are you trying to create? And then after that, you know, we, once we kind of understand the objectives of the project, then we can start trying to figure out who are the different archetypes of stakeholders. So in Mozilla or in the Turing way, they talk about contributor personas, right? So these archetypal sort of contributors, maybe in our project we might have like a persona that's like focused on researchers. But then you might have another type of contributor that's like maybe coming from the private sector or something like that. And these people each have their own incentives and sort of reasons for getting involved in ways of engaging with. Then, you know, we begin theorizing like what's the ideal way of getting involved and this creates the pathway or maybe something akin to the ladder, contributor ladder. And so the stuff gets applied no matter what type of project it is, right? So when we did 25 different academic projects over two years running people through this thing, we had everybody from computational astrophysicists to great vineyard DNA people to deaf educators working on early childhood and learning languages and teaching international sign languages to my favorite acronym of any project we've ever worked on, the Victorian Autobiographical Information Network or VANE so that they could share their data on Victorian autobiography that was much broader than what they could put in their book, right? So all of this stuff works no matter what corner of the universe they're coming from, right? And, you know, collecting all this information on the contributor types and how they get involved, this has meant to create like an idealized version of what you hope your community to be, right? Who are the people that you actually want to be involved and, you know, what is like the end goal? And then from this end goal, we begin, you know, documenting like where are these like shortcomings, right? Where are the things that might be missing or preventing people from getting involved? And this stuff can be, you know, pretty simple in the end, right? It can be things like project documentation that's maybe missing or a lack of marketing materials or like just no outreach in specific, you know, either geographic or online areas where people aren't finding things or a lack of governance that is preventing people from getting the stage of like being maybe a first time, you know, drive through contributor and getting them into a thing where they're maybe approaching something more akin to a leadership role, right? And so it's this like very specific process that regardless of the type of project that we really try to stick to because by going through this process we're gathering the evidence of what's actually needed and what the materials are. And, you know, it took us some time to get to this because there's a real tendency of just being like, well, you know, just like I just want to read me template that will make my community grow, right? Just give me the best practices, right? But, you know, oftentimes when you hear like people who do community building, they're like, well, yeah, like there's best practices for sure. Like, you know, make sure you have a license and a code of conduct. These are good. But the realities of what it takes to build a community is very specific to your community. And so what we've worked really hard on is trying to come up with like a specific process of finding out what that is so you can know for sure. And so to talk kind of about the maybe the scientific methodology that we're using, we think about our engagements, you know, in two different ways. The first is developing solutions, which is kind of what I've gone through here, right? And the way that we do this beyond just asking the maintainer these questions is we're conducting qualitative semi-structured interview studies with various types of, yeah, thank you, big words. I just got my masters, so, you know, what can I say? We conduct these qualitative semi-structured interview studies with people to generate evidence on what sort of interventions are needed. So we're like, you know, heavy touch in this case, because what we do is we find out like after coming up with all these personas and pathways, we talk to all these different people in the community that we feel like are representing each type. We talk to people who were really successful in getting involved. We try to talk to people who are unsuccessful in getting involved. And through these interviews, we're generating data of like legitimate evidence as to what has worked, what hasn't worked. And through that, we can begin trying to derive, you know, kind of potential solutions to these problems, right? So if we know that people keep on having trouble with like, yeah, but I couldn't get the project up and running on my own computer, so I just kind of like ditched it and I tried to use something else. Then you know for sure that this is something that is worth putting work into. And then once we develop these solutions, maybe we write some documentation, right? You deploy it and then you want to be able to see if it was effective and like if things have changed and if new problems have popped up. And with this is kind of the next stage after that, right? Where we have a preference of taking a more mixed methods approach that involve using more quantitative data. You know, Don, I think, did a really great job talking of some of the different ways that you can use various software from chaos to evaluate against these goals that you've developed in the first stage. As well as continuing to do qualitative and confirming your conclusions that you're generating from the data with community members. So following up and talking with people and showing them, you know, hey, like, it seems like, you know, the amount of contributors have gone up here. Like, you know, maybe the bus factor has lowered in seeing if there's like new problems coming up. And so, you know, I think it's important to note that like the interview work and this qualitative work that we do and our community building, it's very heavy touch. And it's like kind of expensive, right? You got to like have people going out and like, you know, doing half hour interviews with random people over and over and collecting all the data and doing the transcripting and then like doing the coding. It takes a lot of work. But we find that it's really important because this work is often overshadowed and, you know, because this work is, it's often not thought about. And Steve, maybe this would be a good opportunity for you to talk about your research with conceptual mismatches and like what you were actually finding out. So, yeah, the conceptual mismatches effort. PyPI, the Python Packaging Index, had originally come to life as many open source projects do. Somebody had an itch, they have to scratch it, they built it, and it wasn't created for tens of thousands of people using it on a daily basis, right? So it needed to be re-scoped, re-architected, refactored, right? All those things, right? And they hit a wall. And they hit a wall even after they got funding to do that work. They hit a wall because of different perceptions of what being a community member, being a contributor, what the goals of the project were like, all those things. And so we did multiple rounds of interviews, and this is a process developed by Dr. Melchua, which she used on her PhD thesis originally, where a set of round robin interviews where we got X number of maintainers and X number of active community members to respond not only to a series of questions, but then respond to each other's answers, right? So we did three cycles and distilled all that down. And because they were willing, because they were open source people, yay, right? When you do a qualitative interview as a social scientist, generally it's just dead given that people never reveal who they are, right? Because often they might get fired if they trash their boss as part of this process, right? So you are like, never, never, never, never, supposed to actually out people or tie them to their work. And you have to go through an IRB and institutional research board where you prove that not only are you not electroshocking your students, but you are not revealing data. And we had to work with our IRB to create something and a process by which we had those folks not only sign off on the IRB paperwork, but give up their copyrights to the transcripts of their own interviews. We did let them review them first, but all that paperwork, all those forms, all that process is in the same repository as the report. So if you search on forward and critical digital infrastructure, you search on conceptual mismatches, if you will find a one-pager first, there will be a link to the repository. You can click down the repository and pull all that stuff out and reuse it. So if that kind of social science work is of interest to you and you want to try to convince your own powers that be that no one will go to jail or be sued if everybody signs off on this stuff, that's there for you. And as I highlighted briefly, there was so, you know, this happens, right? You have a core set of maintainers who are focused on how many lines of code go out a day, right? And did we hit refactoring this peep of the project by this time? And there was so much pressure there that people felt bad about doing their onboarding or their documenting or their other types of jobs, and that stuff fell behind, even that was funded as part of this big refactor, right? So culture and misperceptions about what our goals are and who we are and what we want to accomplish actually got in the way of doing what people wanted to do. So I'm going to breeze through this just so we have a little bit of time for Q&A at the end, but I thought it might help to give like a down-to-earth example of kind of a simple project that we worked with on a limited basis. So like one example is we worked with an RIT professor, Professor Rostojie. So she is a professor of computer science and she was working on developing datasets and investigating different ways that you can kind of potentially cause or ideally prevent. But issues around self-driving vehicles, getting into crashes and stuff due to harmful information in their datasets. So she's interested, obviously there's a lot of potential stakeholders that would be interested in this data and like how it can be distributed and a lot of other open source simulation systems that could use it. So we worked with her to develop specific personas of understanding like who these different types of stakeholders were, right? Because there are stakeholders in the private sector developing AV systems. There were other researchers who were interested in her work and might be able to continue generating further data. And then there were also kind of these like potential partner projects or like integrations with simulation systems, right? And so from there we kind of begin to try to understand like what were the goals of this project, which was a very early stage but potentially high impact project that could be integrated into all these different systems. And we found largely it was an issue of like discoverability and also understanding the open source aspect of this. So like for us, you know, through many of our interviews we realized that there's just, there wasn't a funnel from like finding the research to understanding that this is a project that can continually live and evolve and get contributions from other researchers and so on. So we needed to find a way to create a narrative around this. And like an obvious very easy first step to this was like developing a simple landing page website that describes the project, the research, you know, what are the main contribution asks, and then, you know, showcasing those different pathways of getting involved, which like oftentimes is just sort of linking to a well documented GitHub repository and examples of work, right? So we said, okay, well we're going to make a website and we went and we made a website and we like prototyped some copy and we got some review on it and we did the whole, you know, usual process of making a website and we deployed it and, you know, it was somewhat successful. I'm not going to go beyond this point because it was like a fairly limited engagement, but, you know, when thinking about applying this process, in particular if it's like a small project it can be quite a simple engagement, right? But, you know, when coming back to like what's the point of this engagement and like, you know, doing all this work to find out all this stuff is there's a few things, right? As a community becomes larger, developing that community becomes more complex because you have more stakeholders involved, you have more potential conflicts between those stakeholders, and more resources are needed to coordinate all that, right? And community development skills are not necessarily often found in like this peer to peer production of open work, right? So for our team we had like many designers and, you know, social scientists that like do this work and understand how to, you know, gather data from interviews and create this evidence. And, you know, Dr. Rastogi was super talented but not necessarily in these fields. And then, you know, finally if you want to justify growing, like allocating additional resources to growing your community, you need to have evidence to do that, right? Like in particular if you want to get funding for anything, like it's very, very difficult to just be like, and we're finding this now particularly in the private sector to say like, well, you know, we might go away if you don't give us money. We might go away if you do give us money but it's like less likely, right? I've spent the last four years of my life writing a lot of grants and I can promise you at least in those four years of experience has never worked once, although I've tried it a few times. So, you know, community development can get complex particularly as you migrate out of like a very simple governance structure into something that has like many different tiers and many different types of people involved. And this takes work to do, right? And that work requires these new skills, or maybe not new skills but like different types of skill sets involved. And that skill set is important for doing this work. And then finally, and what has certainly been probably one of the biggest points that has opened the eyes of a lot of people that we've worked with is, you know, if you want to like really get additional resources for your project, right? Particularly if you're like a burnt out maintainer or you feel like your project is like plateauing but you want to grow it and you know that it can but you just, you don't have what you need to do it. So, you know, getting the evidence to showcase what exactly is needed is one of the best tools that you have at your disposal for finding ways to like potentially fund this or just to, you know, give yourself the mind space to begin organizing around it, right? And then you can also get the data to be able to do that. And then you can also get the data to be able to do that. And then you can also get the data to be able to do that. And then you can also get the data to be able to do that. And then you can also get the data to be able to do that. And then, yeah, we have three minutes and 45 seconds left. So thank you and questions. Ask away. Any questions, folks? I'll bring it up on the mic. It's okay. So in across all of the projects that you've worked on, you do these personas. And I have to tell you that that is definitely a trigger word for me having worked in a number of companies where they would do personas and the people who ended up doing the personas knew absolutely nothing about. We do them right. So you're not trigger. My question is, have you looked at across all of these projects where people have done the personas and then later succeeded in the project? How far off the personas were at the beginning and did they evolve over time? Did they get closer? Can we get better at doing personas? Great question. There's a lot of bad persona. So I think part of it is coming from a design perspective. Personas are meant to be living documents that are kind of tested out. And they aren't just meant to be. It's like corporate values. A thing I hate, I hate corporate values because these arbitrary words like we're here for goodness and whatever. Personas can sometimes be that we're like, we have our customer persona and they like use our thing because they want to give us money because we're valuable. Like that's useless. With personas, what we're trying to find out is almost like documentation of our studies. Right. So we're we're trying to figure out like, who are the type of contributors that are useful to our community? And like, why do they want to contribute? And like, you know, what does that process look like? And so in terms of thinking about like, you know, we may make assumptions or collect, you know, invalid data about like who that person is and why they want to contribute. And so when thinking about our process, why there's these two stages of like developing solutions and then testing them, this is like where the evolution of a persona comes in. So, you know, we think our personas like are just having issues discovering us, right? So they're really interested and they have the time to contribute. And so we're going to like have a marketing campaign. But from from that campaign, we may find that they actually, you know, have another problem and then we can evolve that persona. So it's really meant to be kind of like documentation of that data. Sorry, I was wrong in the answer. I mean, I started as and still am a video game design professor. I've gone horribly wrong in so many ways. But what we tell our students straight up first thing, whatever it about ready to get to make a game is you're not your player. You think you're your player, but you're not your player. And so you need to figure out who your player really is just because they like the same game as you doesn't mean you're necessarily them. Any other questions? We want to go first. Well, thank you very much for our final talk of the day. Thank you so much. And thank you all for a great first FOSTA. Thank you. Thanks, folks. Thanks for sticking around.
Confidential Computing devroom welcome
Welcome, everyone, to the computer-intercom-puting dev room for this year in Fosnum. I'm happy to have a bigger room this year, and we also can fit more folks, so hopefully there's not going to be any occupancy problems. But yeah, so welcome. Who are we? I am Flitz. This is Jo. And we have a third member Fabiano, who has not managed to come here, but to help organizing. For the disclosure, I'm at NVIDIA, yours at Kalevun and Fabiano at Intel, but this is, of course, the private thing that we organized. And I want to use this welcome session for basically two reasons. To welcome you to the dev room, give you a little bit of the idea of what we're going to have to expect today. But as a second thing, also give a quick background information for combat energy computing. What is it? So that not every speaker has to recap the same thing on and on again, so that we can save some time for every speaker. So there is the Linux Foundation's computer computing consortium. We had a nice social event yesterday. If you haven't heard of that, please look them up, get involved into what they're doing. It's a great thing. And at this point, we're going to use their definition. I think there's many definitions of combat energy computing out there. There's many ways to see what it's doing, what it's not doing. There's many opinions to be had, I think, about it. But we're going to take the definition from them here now. And we can always have discussions about if that's a sensible definition or not. And according to that definition, combat energy computing is the protection of data in use by performing computation in a hardware-based, attested trust execution environment. I highlighted the things that I think are relevant here. For me, personally, hardware-based makes a lot of sense, or is important. Initially, actually, this dev room was called hardware-aided trusted computing dev room. And now we only renamed it to combat energy computing dev room last year. Attested, maybe you've heard of it, maybe you haven't. I think some folks will talk more in depth about attestation later on. I don't think I will cover it really here in the welcome session. And then trust execution environment, TEEs, are the things that basically guarantees you this thing, guarantees you all of these properties. And data in use is this protection of data in use. It's one of the lenses to view combat energy computing. This lens is basically the lens of saying we have data protection at rest by encryption in transit by TLS. And now we also have it in use by TEEs and combat energy computing. Now, I'm listing some key properties here just so that you get an idea of what are common between TEEs. There's still space here, you can come in. It's not that full, there's more space here on the side. There are some common properties between TEE implementations, and then there are some contextual properties between TEEs. So, commonly, most of them, or all of them I think, have data confidentiality. So they encrypt your data and they keep your data confidential. They have data integrity to ensure that your data cannot be changed. And they also have code integrity so that the code that you're running is also the code that you expect to run. So it's not being changed by the environment. On top of that, there are contextual properties. Not all of them, not all of the TEEs have them. Not all use cases need them, but I think these are properties that make sense to at least the subsets of folks doing these definitions. With the notable exception of code confidentiality. Code confidentiality is always, I think, was initially, has a big DRM label on top of it, I think, to me at least. Of course, maybe not everyone agrees to all of these use cases being good, but I think when looking at TEEs, these are definitely some properties that are common. So code confidentiality has been one use case that has been tried in the past. Authenticated launch. Now I'm kind of booting up something that you can authenticate. A programmability that you can change what's running in the system. It's not locked in. It's the beginning of launching it. Attestability again here. The thing that we can make sure that what is running in there at runtime. I can also verify that it is running. And recoverability in case something breaks, I can recover from issues. All right. And now I modified one figure from one of their nice white papers. You should read the white papers if you want to hear more and learn more about that. Where let's take a look at kind of the common software stack and see what is shielded by what type of computer computing technology. So for example, Intel SGX is something that many of you may have heard because it's the oldest that has been commercially available. And that historically, the idea has been that the app data and the library is shielded. So only kind of the application is shielded. While the OS and the process and the TSM is kind of in the untrusted part. So you only draw the protection boundary around the application itself. That's a good example for that idea or for this like shielding level is Intel SGX. And then I'm trust on, I said, Intel SGX is the oldest, but I'm trust was actually way older. It has just been not very accessible to kind of the developers that want to play around with it. I'm trust on has had this, I think, for over 20 years already. Not sure. Maybe always almost 30 years now. Where you draw the protection boundary around the OS and in the US in a trust zone, you have a secure world and a not secure world. And then you suddenly can say, look, I have a switch over to secure world and I have my own whole OS running there. And I can have multiplications that run on top of my OS. And this is actually now being picked up, picked up by the next generation of TEs in computer computing, where we now have, for example, I'm CCA with AMD, SVS and P into DX. I think you will hear about these technologies a lot also today. Where we draw the protection boundary again around the virtual machine. And then we have some, some machine monitors, some hypervisor. And on top of that, you then spawn up multiple machines, realms, enclaves. However, you may name them to call it a technology. And then inside your area, you have your guest OS, maybe some container runtime. On the top, you have some, some applications, right? So these are, let's say, multiple levels of depending on how big you draw your box in your technologies that you're using. But all of them have the idea that you cannot access from the lower levels the data that is running in the higher levels. So here, for example, the OS or the normal world OS or the TSHIM or VMM, also don't have access to application data. And I think that is the core idea for all this confidentiality. And you can also have a station on top of that, that you can convince remote people that, yes, this is the case, my host OS does not have access to my data. And taking a lens at, taking a look at how are we doing this today? We have three talks that are in this right column. TDX Deep Dive, and we talk about SV Step and about Mushroom. Then we have two more talks that are in the other domains. It's one about FTPM, one about databases. And then, of course, we always have talks that are really cross-cutting, that don't really fit into these type of boxes. That are also about attestation and how do we work with this? Now, our dev room is, fortunately, unfortunately, has become very popular. I thank all of you for submitting great talks. We could only fit eight, and we were hurting for that a lot. So I want to give some honorable mentions here to folks that we could not fit today. So there were quite some talks about Project Verizon. There was a talk by Thomas also last year on that. Check it out if that sounds interesting to you. We had suggestions about connected containers, connected clusters on open stack, about remodestation in telecom, and about formalizing remodestation. And I put the links here. The slides are on the schedule. Check them out. I think that these are all great. And we also didn't feel that it's justified to have our own talks up here if you already had to reject talks. So we did cool stuff on SGX, some small execution. And you was trying to build some bare SGX enclaves for the community. You can also check out these. If that sounds interesting. And I hope that we can start some community building around the times when we are not all here together. Maybe also keep submitting your talks next year. We are trying to get a full day dev room. I think this year didn't work because of all the dev rooms. But in the apparent success, I think we will get there at some points. And keep submitting talks and we'll try to get more space from the full-dom organize. Exactly. So we hope to have more engagement in the next years and move bigger to a full day so that we can have more of these talks at the same time. So this is just a schedule in short. How do we work for the speakers? So we always have a five-minute switch over. And for people to have some breathing room to go in and out, maybe have some side discussions. And at the same time, please also leave some room for Q&A. And so yeah, I think that's from my side. Welcome. Not sure if you want to see. Maybe just a couple of practicalities. So the room is limited rights. But there are still spots here and there. So people who are standing, please feel free to take a spot. And the logistics are not ideal here. But if maybe if you see people coming in tight to move to the middle, or if there is like spaces left, we're not sure. But the previous deaf rooms were really full and it would be a bit sad to kind of put on the door that people cannot come in if there is still seats left. So as long as there is seats left, we will not close the room. Also in those five-minute breaks, if you want, you can go in and out. Again, the logistics are not ideal, but we'll make it work.
Intel TDX Deep Dive
Perfect, I guess then let's start right a bit earlier, but then leaves potentially a few times for questions. Hi, my name is Benifuri. I work at the Confidential Computing Enabling Team at Intel. My main job is to, I can try, but I notice when I said there that it's not really loud, the speaker is not loud there. Okay, speak closer to the microphone, I will try. So yes, Intel Confidential Computing Enabling, right, we work together with academics, companies, partners, whoever wants to use our technology, we help them to do that, right, that's my job. Today, I will talk about Intel TDX, and I will, it's called Deep Dive, but I will start with an overview and then go deep in a few slides, right. So overview first, I don't want to speak too much about that, right, it was just done in the talk we just had. Without confidential computing, or if you don't use any protection mechanism, everything is in what we call the trust boundary, right, everything can access your confidential data. With our first technology, Intel S-Chicks was just mentioned, right, only the application with Intel TDX, the topic of today, a virtual machine is protected, right, everything of that was just mentioned, just want to mention it again, because like we have the options, right, use whatever you want. In general, you could say Intel S-Chicks is the more secure technology, Intel TDX the more usable technology, right, but that's up for debate if you want. Yeah, today we will concentrate on Intel TDX only. Here, you see an overview, like this is what a regular system looks like, right, we have the platform with cores, caches and so on, the memory and a regular hypervisor, a virtual machine monitor here. And with normal VMs, this hypervisor starts the virtual machines, right, and it is also, this hypervisor isolates the virtual machines from each other and isolates the virtual machines from the hypervisor itself. In the main memory, everything is plain text, right, so, which means that every person and every program with the necessary privileges can access the data, right, it's plain text. This is different with Intel TDX. With Intel TDX, we introduce what we call the Intel TDX module. The hypervisor has to be adjusted as well, it's now says here it's TDX enlightened, because the hypervisor is still responsible for resource management and hypervisor now needs to, instead of starting the virtual machines itself, it has to go to the TDX module and say here, please start your TDX protected virtual machines for me. And this is what the TDX module does, right, it starts to protect the virtual machines, which we call trust domains, right. Intel TDX stands for Intel Trust Domain Extensions, Intel TDX Protected Virtual Machines, we call them trust domains. Inside those trust domains or TDs, the guest OS running there has to be enlightened as well, right, it has to be at least to have some changes, because it has to handle now accesses to private memory and shared memory, it has to handle that, it also has to handle exceptions and it has to block certain calls that were possible before. Yes, the application inside the TD do not have to be adjusted, that's the main advantage of Intel TDX or comparable technologies. The main memory belonging to the TD is encrypted with an ephemeral key that is dedicated and hardware managed, right, as you see on the slide, it says encrypted with key one and key two, because every trust domain is encrypted with a different key. Inside the CPU, the data belonging to the TD is plain text, right, that's what confidential computing does, inside the CPU, data is plain text, but the CPU takes care that only the trust domain to which the data belongs has access to the data. Combining the main memory encryption and access control, Intel TDX enforces the isolation of the different TDs by using the Intel TDX module, on top of that, attestation proves that this is the case, right, what we will talk about attestation a bit later. This slide is about the Intel TDX enabling in Linux, it contains a lot of details, I don't want to go into every detail, I only want to highlight three things. First via VM isolation requires to enable a lot of parts of the software and Intel has done that, right, we have put the work in and basically everything is open source, right, and this is the, basically everything is open source, and even the ones in gray, they are only gray because they are a reference implementation, but also open source. And most of the pieces are already upstreamed, but not everything, right, that's the current situation of Intel TDX, but this will change soon, hopefully. One last slide of the overview is the availability of Intel TDX. Intel TDX was introduced at the beginning of 2023 with the fourth generation of Intel Xeon Scalables, but back then, only at four leading cloud service providers you see on the right. Everybody else buying these CPUs did not have Intel TDX enabled. Previous at that cloud service providers started already in Q1 2023, and general ability is supposed to be soon this year. Intel TDX became generally available with the fifth generation of Xeon Scalables, which was introduced at the end of last year in December, meaning if you now go to your favorite hardware vendor, you should be able to get such CPUs or at least soon. Good. Now to selected technical details of the technology. First, the CPU state is kept confidential by managing it in CPU protected memory, and that's the responsibility of the TDX module. For example, on a TDX exit, the CPU state is saved by the TDX module in a protected memory region, and this memory region is encrypted. And all memory confidentiality and integrity that's provided by Intel TDX is provided by what we call the TME IM key engine, the total memory encryption, integrity, multi key engine. And this is used to encrypt all the main memory belonging to a TD to prevent untrusted software or from observing the TD's memory. It uses AX XTS 128 bit, and each memory as mentioned before has its own keys. The memory integrity feature detects TD private memory corruption by software and direct memory access. The TDX module is responsible for managing all the keys, but to encrypt the different keys. But the TDX module itself still does not have access to the keys. This is done by the TME IMK hardware that manages the keys, and the TDX module only references key IDs. No access at any point to the keys that are actually used for the main memory encryption, not by any piece of software. I will skip remote attestation for now, because I will explain details later. But a bit about IO compatibility. By default, no direct connection to external devices is possible, because those external devices are untrusted. SQIO can be emulated by software, but this has performance overhead. At the end of the talk, I will talk a bit more about these aspects and how the situation should be changed or migrated in the future. With Intel TDX, performance monitoring and debug facilities run inside TTD. This is a difference compared to Intel SGX, because this means you can debug your application handling sensitive data. Because even during debugging, you are protected, you are inside the trust domain. Sure, the person that does the debugging now has access, but still the infrastructure provider doesn't see it. One final aspect here, the page table management happens inside the trust domain now to address remapping attacks. This was also different with SGX. It was responsibility of the operating system, which was untrusted. A few more details about the TDX module and what we call the secure arbitration mode. The TDX module is provided by Intel and the code is open source. Since only two weeks ago or something, it's on GitHub now. The seam loader verifies the signature of the Intel TDX module when the system boots and loads it into a special memory region, which we call the seam RR. Only software in the seam RR itself are able to access other memory in the seam RR. In fact, hindering everything but the TDX module from doing anything. All other software access and DMA access to this memory is completely blocked. The confidentiality and integrity of the seam RR is again protected with AES, XTS, 128 bit. The Intel TDX module runs in what we call the secure arbitration mode or seam for short. To be more precise in the seam VMX route mode. The ISA was extended with introduction of Intel TDX. The ISA was extended by four instructions to enable the communication between the host, the hypervisor and the hardware. These four instructions are seam calls for interactions between the hypervisor and the TDX module. So, start the TD, stop the TD, things like that. Seam read to return the execution control back to the hypervisor. TD call from a call from the TD to the TDX module and seam ops for calls from the TDX module to the hardware. Certain security critical ISA instructions are denied in seam to provide the protection guarantees we want. Now to TDX remote attestation. The TDX remote attestation, you all know that, you all have heard of that in SGX or in other technologies, uses quotes. Quotes are created by hardware and the quotes are used to prove something. In this case, the TD can prove four different attributes, at least four attributes with this quote. The booted image is exactly as expected. During the loading of the image, it's measured, so it's hashed and this hash is stored in what we call the MRTD. This is part of the quote. Second, measurements created or extended during runtime. Intel TDX has what we call runtime measurement registers or RTMRs and they can be extended at runtime. It's not done automatically, it's a can. It's a subtle topic, if you're more interested in that, but that's what we have. Number three, the TD is executed on an Intel TDX enabled platform. It's obvious that that's important, so nobody should just be able to simulate that it's an Intel TDX hardware. Number four, the Intel TDX platform is fully patched. As you also know, I assume in the past, there were problems with the different technologies, loading Intel SX, but then we provide a patch and we have the ability to prove at what patch level your platform is. Then it's as it says here in the next line, whoever is dessert or a lying party can look at the quote and then decide if it trusts the TD or not. Some might decide even an older patch level is fine, some say only the newest one is fine, some say MRTDs has to be a certain aspect, RTMRs have to be or don't have to be used, all that's possible. A bit more about the process of remote attestation, which should look very, very familiar to people that have seen the SGX remote attestation. It all starts with a relying party triggering the trust domain of here, please prove to me the things I just mentioned. The TD will reach out to the TDX module and the TDX module will reach out to the hardware. The hardware will then generate what we call a TD report and this report contains the measurement I mentioned before, but it also has for example the security version number of the TDX module, the measurement of the TDX module and the measurements of the TD and all other aspects that are in the trusted computing base and it's signed by the hardware at this point. The TD report then is routed back to the TD, back to the hypervisor and then to what we call the TD quoting enclave. And as the name enclave already suggests, it's an Intel SGX enclave, right? So we use Intel SGX for remote attestation of TDX. The TD quoting enclave checks if the report was signed at the same platform and if that's the case, it will sign it with the attestation key. I will come, what this means in a second and why this matters. But then now we have a TD quote that's signed by the attestation key and this TD quote is passed back to the relying party who can do quote verification now. But the important question is what just happened, right? The TD quote was signed with an attestation key, what does it mean, why should we trust that, right? And a key piece I skipped before is that the TD quoting enclave has randomly generated the attestation key before the process even starts, right? Without Intel being involved at all, this happens on the platform. But that still doesn't help much. But what also happens on start, the so-called provisioning certification enclave that's also provided by Intel will do local attestation with the TD quoting enclave. It will see, yes, okay, we run both on the same machine. It's a TD quoting enclave that I expect. And it just provided me an attestation key. And then it will use the provisioning certification key to sign a certificate. So then we have, as is you on the right side, an attestation key certificate that's signed by the PCK key. But again, why does this matter, right? The important piece now is Intel is able to create PCK certificates that are then rooted in an Intel CA. And this ends the trust chain, right? So the attestation key generated on hardware, but it links back to an Intel CA. And during quote verification, whoever does it, wherever this is done, can reach out to what we call the provisioning certification service to get all the collateral that's needed to check this chain. That's the process of remote attestation. And as said before, Intel TDX attestation uses Intel SGX. All the sets of collateral we had before, PCK certificate, distribution, caching services, they supported Intel SGX in the Pado in the past only. Now they support both. And it also, this also means that as it's required to enable SGX on the machine when you want to use TDX. Just quickly, a few words about how you can do the verification, right? There are basically four options. You can use a service by the cloud service provider. You can use a service by the vendor of your application. You can use potentially the Intel trust authority like an independent software as a service offering by Intel to do the verification for you, to alleviate the process. Or you can build it your own with the open source Intel libraries we provide. A few pieces of differentiation between the services, if you want to have a separation of responsibility between the infrastructure provider and the verification party, then you should not use the cloud service provider, obviously. But in all the other cases, it's fine. If you want to have consistency, if you want to support SGX and TDX, then it's up to the cloud service provider and the application provider if they supported our variance, definitely supported. If you want to have consistencies across your applications, across the environment you have on-prem hybrid, whatever, then obviously cloud service providers cannot be used, the application vendor potentially, and the others will do it. From a development perspective, it's low in the first three cases and I would say medium in the last case. So quickly, very quickly, two upcoming features of Intel TDXO that we have at least a little bit time for Q&A. First TD migration. TD migration will allow to live migrate one TD from one platform to another. It uses a service TD called the migration TD to do that. All the data is obviously encrypted. Just skipping a few details now, everything is encrypted. Everything will go over step by step, a short break. One TD on the left side goes down, the one on the right side goes up, which guarantees that a TD lives only once at a time. You should not have two different TDs with the same content. And one last feature, Intel TDX Connect. I mentioned that before. So it's a bit problematic at the moment to connect trust domains with a device. It is possible, but what's needed at the moment is the trust domain, like everything in the private memory of the trust domain is encrypted. But it can't directly, so it can write to shared memory, right? That's what it can, but it can't, no. It can't directly write to device. What it only can do is put data on a shared memory and the device can take the data from a shared memory, right? What we call a bounce buffer. So this is a bit slow. Still, it can be done securely, right? If a secure session key is established between device and trust domain, the data can be encrypted, put there, read in the device, and it's encrypted. So even today, this solution is there. Like you can connect an Intel TDX trust domain to an NVIDIA GPU with their confidential computing technology, have it end to end secure. That's possible. But it's a bit slow, or it has a bit of overhead because of this bounce buffer stuff. And this will change when Intel TDX comes along. Because with Intel TDX Connect, the idea is that a trusted device is put in, let's say, the trust zone of a trust domain. They're able to write into each other's memory directly after they trust each other, which will make the whole thing more efficient and has low overhead. This is just nothing I mentioned today is any secret, right? All of that is open here on this page. We have documentations. Knock yourself out. It's like thousands of pages you can read in the PDFs to get all the details you want. If not, feel free to reach out to me at any point after this talk, at any point later. If you have interest in, for example, bare metal access to machine, I'm also your guy for whatever experiments at the University as an organization, whatever, right? Because at the Cloud Service providers, you normally don't get that, right? You get a trust domain. That's it. Might be enough in many cases, but not in all. So reach out to me and thank you for your attention. Can we repeat the questions? Yeah, so, yeah? I have to repeat the question. The question was, is it possible, or I rephrase, correct me if I'm doing a bad job there. You said, it's possible to run a legacy application in a trust domain. Yeah, that's what you said. The question is, how is the integrity of such classes being maintained considering the fact that this application is legacy doesn't? Okay, yeah. So the question was, again, in my words, how is the process then, how is this protected because the application wasn't written, right, for this environment? And the answer is, it depends, right? Meaning you have an in-memory only application, then you don't have to do anything, right? Because the main memory is encrypted and you're done. As soon as your application writes to disk, it's a different story, right? Because if you write to disk plain text data, then it's plain text and everybody will see it. One thing you can do is you can either your application encrypts data before, then it is a change to the application, right? Or another variant is you activate, for example, full disk encryption in your operating system. Then you have to manage the key, right? That's another question then, but that's what you can do. And exactly the same for network connections, right? If you, again, send plain text data out, yeah, plain text data is out. But if you use TLS, you can do it, you just put your TLS endpoint in the trust domain now and you're good. Yeah? Thank you for a very nice talk. So I had a question about the state software support. Thank you very much. So I had a question related to the sort of status on the software support on the guest side, right? So with some of these comparable technologies today, you still need kind of some components in the middle on the guest side, like basically like firmware inside the guest or like Paraviser functionality that hides some of this communication between the underlying layer. So how is it with TDX today? Can you take like a stock Linux kernel and run this? So you still need some components there which are not yet fully open source? So at the moment, as I said, briefly before not everything is upstreamed, right? So it's, I guess, like the basic enabling should be there middle of the year. So at the moment, it's not there fully. But what we have is what we call a TDX early preview. So we collaborate with three operate distribution vendors to provide specific distribution versions. And that's canonical, Red Hat and SUSE. And we have all this is online. You just go to GitHub and it's, I just did it yesterday night, right? It's really like you start up a Buntu 2310, for example, you clone their repository, click install, done. You go to the buyers and activate TDX. Then there have another script to create a guest image. I don't just take them like 15 minutes to create, but just of download and all of that stuff. You start your trust domain and you're done. So that's pretty easy already. Yeah, thank you for the talk. I have kind of an obvious question. Is there a latency cost within one trusted domain from memory access given that it's encrypted and so on? So performance you mean, right? Okay. Yes, obviously it has to be, right? Infection can't be for free. But how high the overhead is highly depends on your workload. If it's a processor only workload, it's basically free. I don't have concrete numbers, but let's say one, two percent, right? So really, really low. If it's really disk IO sensitive, it's a different question, right? Because of this balance buffer and all of that stuff. Again, don't nail me on it, but let's say like it might go to 10% or even more, right? It's really, really dependent on your workload. I guess I have to stop now, but you can just come to me later, right?
SEV-Step: A Single-Stepping Framework for AMD-SEV
So, the next speaker is Luca Wilk from the University of Lubbeck and he will talk about some recent work he has been doing, actually, attack research. I'm very excited that the Dev Room from the start has had some consistent attack research line as well, which I think is very important for this new type of technology. So Luca, enlighten us. Yeah, thank you very much for the kind introduction. I will be talking about SCVSTEP, which is a single stepping framework for AMD SCV, and it's open source and available on GitHub, so feel free to check it out. And this was created as part of an academic paper, which is joint work with these great people down here. Okay, just a quick recap where we are in the trusted execution environment landscape. So as the name suggests, SCVSTEP is about AMD SCV. So we are in this confidential VM area here. However, single stepping is something that basically affects all TEs that are out there right now, so keep that in mind. Okay, with that out of the way, we can jump right in and explore what single stepping attacks actually are. So we start with a quite high level picture. What you want to do here is you want to take some kind of snapshot or observation of our protected application, and we use this for our tech. Now, if our TE runs normally, then it runs basically at full speed, and if we take these snapshots, we don't affect any synchronization with this TE process, and thus the observation and the data that we get is very blurry. But now if we start to interrupt the enclave at certain points, then we have these synchronous points in time where you can start to take our snapshots. So it's not running in parallel anymore, but the enclave is paused when we take our snapshots. And thus we already get a little bit more information. And now if we take this to the maximum resolution and we are able to interrupt the enclave after every single instruction reliably, then we get a pretty clear picture of what's going on. So I hope that already gave you like a good intuition. And now we go into what single-stepping attacks have actually been used for, mostly in academia. And these are all examples that have been done with SGX that really made this popular in academia because it made single-stepping very accessible. So the first attack avenue basic here is something called interrupt latency, and there you basically measure how long it takes from where you like started this attack to when you get like this callback that the enclave has been now interrupted or exited. And it has been shown that this timing actually revealed something about the kind of information that's running in the enclave. And for some instructions like different instructions, you can even learn something about the operands. So dividing by certain numbers takes longer than dividing by other numbers. And thus you really kind of instruction and maybe even the operand with these attacks. Then the second major attack avenue here is called interrupt counting or instruction counting. And here the idea is that certain algorithms and applications have secret dependent control flows, especially true for cryptographic algorithms. We have some secret key, and then I do some large integer multiplication or division and decode the dusted. Executes a different number of instructions depending on the secret data. And now when I do this senior stepping attacks, I can simply count the numbers of steps that I take. And then if I know on which code page I'm currently in, then I can learn something about the secret data just by observing the number of instructions. So in this tiny example here with a conditional jump, and in one case we skip over this move here, and the others we don't. So here we get two instructions executed here, three. And by knowing the code that's currently running, we can infer the value of the secret bit here. Then the third really popular attack avenue is not directly senior stepping, but closely related. It's called zero stepping. And here the idea is that we interrupt the enclave even more frequently. So before this able to actually execute a single instruction. So it doesn't make any progress on an architectural level, but on a micro-architectural level, it is first instruction. It's already starting to execute, then gets a board and roll it back. But on the micro-architectural state, there's actually still already stuff going on. And these attacks are able to measure this. And what we can do then is basically take an infinite number of measurements, but only running the enclave once. And this allows you to measure really, really tiny effects. And then the third column here is kind of the miscellaneous sketch all column. So as you can imagine, just by increasing this temporal resolution, you can improve basically any side channel attacks. So it has been used in many of these MDS attacks here, for example. Okay, so now that we know what senior stepping basically is and why it's really dangerous, we come to the main question of the stock here. Can ACVVM be single stepped? And if so, how? So let's take a look at the basic setup here. So this is like a very boiled down version of the control loop that's going on in the hypervisor, where we enter DVM here. Then we execute some instructions and then at some point we exit. So for senior stepping, the obvious question is, when we exit DVM here, this is what you want to control in our attack. And there are multiple reasons why this can happen. So we can configure certain instructions to be intercepted. And you can also use page for it by removing access rights in these nested page tables. However, none of these two methods give us the amount of control that we want because they are not instruction granular. However, you can also use external interrupts to force an exit from our VM. And this is actually what will allow us to achieve this instruction granularity. And for this, the attacker uses something that's called the APIC timer. It's a common timer on x86 used by the operating system. And by injecting this timer, we will force exits from DVM. So let's zoom in a little bit. This is a typical attack sequence here. In red, we have the coded ones in the hypervisor. It's controlled by the attacker. And on the right here, the blue, that's the three instructions from DVM that you just saw. So what do we need to do now to achieve senior stepping? Well, intuitively, you would think that you would need to hit this tiny window between these two instructions here to single step. However, luckily on x86, it's already sufficient if our interrupt hits somewhere during the execution of this instruction. Because then it will be held pending and will be basically recognized at instruction boundary. Okay, but if we just naively implement this and try to do this, then we are not quite there yet. And we will see that sometimes we will overshoot here and then we will execute two or more instructions. And this, of course, decreases our resolution because now we cannot guarantee that we do something after every instruction. Maybe we have bad luck and skip over very important memory access instructions also on. So this is really bad, this mighty stepping. And on the other side, we might undershoot a little bit and zero step. And this is not really dangerous because then we simply repeat, we don't miss out on any instructions. We just try again and it's a little bit less efficient. So why is this the case? And there has been some really nice papers on SGX and they show that this APIC timer has quite some jitter. So it's not cycle accurate. So it kind of makes sense that we see this behavior here. So what do we do about this? And the kind of obvious idea is, okay, we kind of need to make this window larger because our timer doesn't have the high enough resolution. So we kind of need to enlarge the window at which our timer can hit. And for this, we look at what's actually going on when we execute an instruction here. So first we have to fetch the instruction from memory from the code page and then the CPU can decode it, issue it to the pipeline and eventually retire it. So for the attack, the idea here is now that we make sure that this year takes a long time and we achieve this by simply flushing the page from the memory. So we flush the VMs TLB and that's when we enter it again. We need to do a page that we walk, which will take some time and this effectively prolongs this window here. That is required to execute the first instruction. And now although our timer still has this jitter, this window is large enough so that we can actually rely on the single step. And the ACV step at the time of publishing was the first frame that did this shortly afterwards. There were also some papers that did something similar and it's also open source. So we hope that other people will reuse it. Okay, so now let's take a little bit closer look at the ACV step framework itself. So besides reliably single stepping, we wanted to achieve two other goals. And this is reusability and interactivity for the attacks. And I will go over these two goals now in more detail. So for reusability, let's again look at our setup here. And since we want to program this APIC timer, we want to manipulate these page tables and maybe do some cache priming and probing. All of these things would benefit from being really close to entering and leaving DVM because this is the point. We have the lowest noise. However, this also means that we need to manipulate or change the kernel code and developing kernel code. It's quite hard. It's hard to debug. You're limited to see. You don't have any external libraries. So it's not the nicest programming environment. And also it makes reusing this for different attacks or for different papers quite hard because this environment is not so nice. And your tech logic is basically mixed together with these attack primitives. So instead what you want to do here is we only want to implement these bare primitives inside the kernel, like programming the timer, manipulating these page tables and cache priming and probing. And all of the other stuff is then moved out to user space. And we use an IOCTL API then to trigger this behavior from user space. So then here we have this much nicer programming environment. And other people can simply link against this library and write their attack code with it. And one tiny note is that this execution loop of DVM is asynchronous from our IOCTL API. So it changes only take effect the next time DVM exits. So we have some data variables here for communication, but this is something you kind of need in mind when you program these attacks. Okay, so we achieved this goal of usability. Let's move on to the second goal for interactivity. And to understand this a little bit better, I will go into more detail of how I envision this programming environment here in the user space library. And there we also basically want to have some kind of event loop. Initially we set up some configuration like I want to get a page forward once this page is accessed. And then we want basically to wait until this event happens. And when this event happens, we want to react to this event. We have usually in these attacks some kind of page forward sequence that would tell us when the VM is about to execute some certain function. And then maybe at this point we want to enable single stepping and do some steps to a cache attack, this kind of stuff. So this is basically the process event and the deved config part here. And the really important thing is that once we got this event, we also want the VM here to basically wait for us to process this event because we would allow it to resume. Then we would again lose this precise control you wanted to have to manipulate the environment after every instruction. So we now also need a way to basically communicate from the kernel side to a user space library to be able to send these events and wait for these acknowledgments. And for this we opted for a shared memory protocol. So the library and the kernel code here simply agree on a shared memory page and then use a simple protocol with some spin locks to basically implement this. Why is this not the most efficient? It is very low latency because it's just memory communication. You don't have any user space, kernel space context switch as with the IOCTL here and also reasonably to implement. Okay, and this is how we achieve this interactivity goal. This is basically the current state of the framework. But to close up, I also want to give an overview of ongoing and future work. So one thing I've been working on a little bit already and I would really like to continue on is to improve this API, this programming environment because right now it's kind of basically have these start, track, stop, track commands. And if you start to write your attack code as I've experienced myself, this can get quite messy and quite long really quick. So it would be cool to have some higher level abstractions for this. For example, a component that could track a certain page for a sequence for you and restart the tracking if you get some unexpected access and so on. And then some kind of mechanism or protocol to chain together these components so that you can structure your attack better. Also make it easier for people to get started by reusing these building blocks. And thinking about this even more, this is totally independent of the actuality underneath. So this is maybe something where the existing S3X step community could come together and could build these libraries at a higher level and then S3X step and SIV step. And I think the trust zone one is called load step could basically be initiated as drivers underneath that so that everyone could profit from this. Okay. And this is more or less it. You can again find the links for SIV step and also for SGX step, which I mentioned here. They are both open source and on GitHub. Feel free to check them out. Send me a pre request if you want to change something, create an issue that's something broken. And yeah, thank you so much. And I'm happy to answer questions now. Yeah. Yeah, thank you for the very interesting talk. A new Satchel attack for me. And now you've showed how to break things. Do you have some ideas how this kind of attack could be mitigated possibly? Yeah, so it's a really good question. So for S3X, there recently has been a paper which was does is called a X notify. And then basically the idea is to make the S3X and play interrupt aware and then execute some special handler that will pre fetch this first instruction that I showed so that you can't do this. I flushed the TAB and make everything really slow approach, but ensure that this the first instruction always executes really fast and this then mitigates this attack. And for TDX, which we just talked about, there's also some mitigation built into the TX module. And for SEV, we are currently looking into ideas how we could protect SEV VMs against this. Thank you. Thank you, Luca. Yes, we're back. So can you elaborate a bit on how much of this is SEV specific and how much of it is actually, let's say KVM step? Let's say if you don't have a mitigation in TDX, can you just launch this as is on any kind of VM or is this specific to SEV in any in any way? Thank you. So I don't think it's really specific to SEV because this ability to flush the TAB that should also be available with VMX with the hardware acceleration for Intel. I think that the basic primitive should apply. I also know that there has been like an internal prototype that's what's called TDX step that's on one of the Intel pages. So they basically build something similar for this. So I guess in principle, this should apply to all like VM based systems where the VM can be forced to exit by external interrupts. There's one more question. Can you repeat it if you have all the plans for TDX? It's definitely really interesting. The question was if you have also plans for TDX and as I've said, TDX is a bit in countermeasure, but I guess it would be of course interesting to try to figure out how this works exactly. If you can do something there.
Shielding Data, Embracing Openness, Optimizing Performance: A Journey Through Trustworthy Environments for Database Systems
Thank you, thank you everybody for coming. Before we start, I need to say two things. First of all, I'm sorry for this dev room. I'll try to speak as loud as I can, but if you don't see the slides, they are available online. Second, this is a talk about databases. We are database researchers. So first of all, we don't know everything. Second of all, we might also not understand everything, but regardless, we hope to give you a different perspective about this important problem, which is how to store data securely using trusted execution environments as a technology. Sorry. So we are PSD students. We work at CWI Amsterdam, and specifically our research focuses on secure databases. In particular, we do stuff like encrypted query processing, secure multi-party computation, data privacy, and so on. Here our research question is how to protect data in use. In fact, it's very easy to protect data at rest, but we also want to hide it while it's being processed. Our example here is related to cloud. So nowadays it is very common practice to outsource data management to cloud providers, but the thing is we also need to protect information from people who have access to the servers and internal attacks. There are some techniques to analyze data while keeping it encrypted, but like homomorphic encryption for instance, but unfortunately this field doesn't yet have encouraging performance results. So we here need to look for something simpler and more efficient to protect our data while it's being processed. In this talk, of course, we're here. So we talk about trusted execution environments, and we want to employ them as a technology to ensure confidentiality and isolation of the data. But before we start with this, we need to first understand the different technologies and different techniques to split the components of a database system in a trusted execution environment. In this talk, we focus about Intel SGX, and for who didn't know, it's basically a series of hardware instructions to split memory in a secure and insecure part, where the secure part we're going to call Enclave. And in the database field, Intel SGX specifically the first one is a very popular choice for development because it's the most mature one, and there is the most research on it. But at the same time, there are some performance limitations to run workloads that are very typical for database systems. In particular, the biggest problem here is the limited-page cache size, which is 120 megabytes in Intel SGX1. That being said, we're going to explain here many different models to split our DBMS. We have the... does this work? We have the full DBMS split, which means that basically we're going to put all the database inside the Enclave with just a very tiny layer of IO library to handle system calls. Then we have the middle DBMS split, which is something in between. It allows more fine-grained optimization and code splits. Usually approaches just put the query execution engine inside the Enclave, and everything else is going to stay out. And then we have the minimal DBMS split, where only the operators and the comparators are inside the Enclave, where with operators and comparators, I mean plus, minus, equal, and so on. Now we have a general understanding of the different models. We can start with some practical examples. So here's a personal favorite, it's called StelDB, and it's a Postgre's extension. We have some Postgre's people here. I'm very biased on this, but basically, StelDB is employing the third model that I mentioned, which is the minimal DBMS split. So basically it's only implementing operators and comparators. This choice was probably made because of the very limited memory that we can use. And so of course there are some trade-offs. If we do not have the full DBMS, of course there is more information leakage. For instance, people might be able to infer the size of the database and the operations that we are making inside the Enclave. And at the same time, even though the secure part is so limited, the performance is kind of bad. So here we are going to have here 5% to 30% overhead in transactional queries, where transactional queries are workloads that are very heavy in inserts and updates on current data. So, yeah, so this is a very good project, but still not quite what we would like to have if we are running actual real-world workloads. There are more examples here of other databases. We have a lot of implementations of SQLites. And they are, I think all of them are full DBMS split, but regardless, they add at least one or two orders of magnitude of overhead to the queries. We have a MariaDB kind of encrypted database, which is called Ageless. I think I saw some people from Ageless here, or there was a, it was at FOSDEM a few years back. And yeah, Ageless is basically this database that is designed to run inside an Enclave and uses MariaDB and ROXDB storage. It also has encrypted authenticated data in disk and in memory as well, and encrypted network connection. So it's a very nice project. Then we have an implementation of Microsoft SQL Server. I'm sorry about this. It's not open source, I know, but unfortunately it's one of the most relevant works in the field because it actually implements the query engine in the Enclave and splits the data between sensitive and insensitive tables. So it's a very novel idea, but unfortunately it doesn't work because this kind of models also assumes a very big Enclave size and due to the limitations of AgX1, this is not pretty feasible in practice. And then we have one analytical engine where with analytic, I mean doing analytics, so business intelligence workloads on a lot of data and historical data. And yeah, this is called OblityB and it implements Oblivious Physical Operator for analytical processing in the cloud. But yeah, once again, this is really, really slow because of the Enclave size. So our contribution here is taking all of this that we have and we notice two things in here. First of all, the big majority of these implementations on AgX1 are transactional and because analytical workloads really don't scale because of the volume of the data and they overhead called by last level cache misses and EPC swapping. The second problem is that there is no research on SGX2. So SGX2 was released a couple years ago, but still all the prototypes that I mentioned were made for SGX1. I'm not saying they don't work, but I'm saying there are no benchmarks, there are no implementations, so there are specifically tailored for SGX2. So here our contribution is to try and bridge the gap between efficient and secure analytical processing. To do so, we use our database, DacDB. Disclaimer here, we are not affiliated with DacDB, we are not paid by them. It's just an open source database that we happen to use because it's developed in our research center. So DacDB is open source, it's embedded, columnar analytical system, I'm sorry, there are a lot of buzzwords here, I'm going to explain that later. It's in C++11 without additional dependencies and it's actually been ported to SGX1 in 2022 by some, our master student. And before explaining what we did with DacDB and SGX1 and 2, I need to give you some fundamental concepts about database internals. We start here with column storage. So the difference between row and column storage is that basically data in column storage is stored in columns because if you do analytical workloads, we don't need usually all the columns that we have, we just need a few of them. So it is more efficient to store the columns all together such that we can only fetch what we need. And also this kind of column format is also very much, can very much benefit from compression because usually there is a lot of correlation between the data and our data is also huge. So we can definitely implement some sort of compression and DacDB specifically implements column lever compression where data is stored in column and then compressed. Now we also need to talk a little bit about vectorized execution. This is similar to the CMD instructions that you probably know of but applied to databases. So instead of performing operations to one row at a time, we perform it in batches. So instead of having a row fetching it, elaborating it and returning it, we do the same process with batches. So you can see this example, very, very simple query. And here our function next is going to return many tuples rather than one. And we push only the relevant blocks of data up and down the query plan. And this is more efficient because we have less system calls and we can also take advantage of the CPU more efficiently. Now thank you for the attention. Now Lotte is going to explain you how we ported DacDB to SGAX. Thank you, Illa. Okay, so before we go directly to SGAX2 and how we did it, we first are going to pay some attention to how it actually has been done to SGAX1 because the master student, he ported DacDB in two different ways. The first one is the full database management split. The main issue was here of course because of the low memory capacity that not the whole database or not all the data would fit in the enclave. And the second issue here is that system calls are not directly callable inside the enclave. So you either need to reimplement all system calls or the necessary system calls or you need to use some kind of library which maintains this kind of IOS심 layer. So the master student, he used Graphene. Nowadays, Graphene is actually called Grameen. And I think last year and the year before there were also talks at FOSDEM about Grameen, how this exactly works. So with Grameen and with fully porting DacDB into the enclave, there was a 20 time slowdown actually and this is mainly cause because of the expensive EPC swapping. So to mitigate this, the student tried to instead of keeping all memory buffers inside the enclave, pull some memory buffers outside the enclave and crib them outside the enclave and this way try to run DacDB. And this already gave a significant speed up but still there was a 30 time slowdown. The second approach that he did was the minimal DbMS split. He put basically all the operators inside the enclave and left the rest out of the enclave because this enabled to have factorized processing still and that really increases performance. And a second optimization that he did was replacing Ecos and Ocos by a synchronous request in a shared buffer also called the switches mode I think. And this also helped but still there was a 10 time slowdown. So a couple years later, now there is of course SGX2 and it doesn't suffer from the main memory limitation. So now it's basically easier to port DacDB as a whole to the enclave to SGX2 and we did it also with the use of Grameen which also improved the last years a lot so that made it actually surprisingly easy. And we did some benchmarks to actually see okay so what performance difference is there if you run a database fully inside the enclave. So before going into the results we did the benchmarks with TPCH which is a standard industry benchmark for analytical workloads so basically for data science workloads so there are no inputs inserts or updates but just analytics basically. And we compared it first with Grameen itself because since Grameen replace a system calls it also incurs some overhead but as you can see most overhead is caused by SGX self. On average we would say there is a 10 to 20 percent overhead but here we normalize baseline DacDB so that you actually can see the actual overhead per query and there are some specific queries such as query 12 and query 15 where the overhead is actually more than twice. So this might be a bit problematic. So we did some research we tried to identify okay so what is it in these queries that causes the overhead and we found that mainly strangely enough the overhead is introduced by O-calls so by E-enters and E-exits and we tried to investigate a bit further which system called it then was but there was some kind of timing function that seems to be executed outside of the enclave. And also within these queries there are two times as much page faults and well one optimization that we tried we're still working on it but was increasing the factor size in DacDB because usually in DacDB the factor size consists of 2048 tuples and usually this gives low L1 cache misses but it can incur many EPC calls so with increasing the factor size you basically maximize IO and IO is very expensive in the enclave and we actually found that if you increase the factor size to 16384 that the performance overhead is actually minimized for this workload and a small note is that not for all queries actually the performance improved but just for the queries with a lot of overhead it seems to be really beneficial to increase the factor size in DacDB. So this is very much work in process it's more a prototype than something you can actually use in production so please don't do it yet but we can conclude that analytics can actually perform people from the relatively efficient in SGX2 and the overhead seems to be acceptable but the question now is we can protect data in use so data in secure memory but what about the data in unsecure memory right now because if you go outside the enclave the data is not protected by default and so we will actually need some kind of encryption mechanism and DacDB right now has actually parquet encryption so we are already capable of encrypting parquet files and decrypting them inside the enclave and then perform secure analytics but in the end our goal is to design to build something that is fully functional and that is fully secure actually for users that want to do secure analytics with DacDB. So yeah this is our plan for the future we will of course open source everything but yeah thank you for your attention. Hi thank you for a very nice talk so I was wondering you talked about like this overhead that you were attributing to the old calls going out of the enclave and some of the commercial SGX frameworks use these techniques where you actually batch these together and they are commonly called asynchronous old calls so did you look into that at all and or do you have like some insights how that could affect the performance. Okay so your question basically is if we looked into the asynchronous old calls right or the asynchronous buffer basically well the master student from he looked into that and indeed improved performance we were planning on actually doing some benchmarks with this specific mode but we just didn't do it yet but we are still investigating but as far as I understand it is a little bit less secure to use this mode so yeah it will always be trade off but I suspect that it will improve performance quite a bit so reduced the overhead in the end. Yeah probably a stupid and provocative question have you tried shoving the whole database in a secure instance like 7S and P or TDX and comparing the performance between like SGX and TDX or 7S and P solution. Okay so the question is did we use other secure environments basically the answer is no so we have no performance comparisons yet but the plan is actually to do that indeed because if you want like not everybody is able to run SGX to write so the hardware field is pretty fragmented and we also want to kind of find solutions or at least have comparisons of which one is the best to use and or maybe even made some kind of framework that people can adopt to easily run also on different kind of hardware instructions. Yeah. Thank you I want to ask about the fully secure on the slide. Have you talked about side channels and what's your vision on that? Do you want to answer? Yeah in short yes this is a problem because all the research that we found there is always a trade off between performance and security and literally all the papers build this sort of model like cost model in not in terms of cost but in terms of information leakage. So a lot of people papers just say that yes we acknowledge that there are going to be some trade off some attacks in fact yes yes this is absolutely the case that it can happen but right now the goal was first to have something that is somewhat functional on some sort of database workloads because as I said the big limitations of SGX-1 made the whole thing completely invisible but now that this is actually possible we can also focus on how to fix these issues but unfortunately research tended not to acknowledge this issue so much in the past but for future yes we will.
The ups and downs of running enclaves in production
All right guys, so back to the matter of the day. The next speaker is Kian who works at Evervolt and I think it's quite exciting to have a bit of a complimentary perspective in let's say this exciting new field where we talk a lot about new technologies, but you will actually talk about how to use them in production. So take it away. Thanks. So I work for Evervolt and I will talk about Evervolt to begin with just so you know why we use enclaves in production and not just traditional computing. So we offer also, I don't know how loud that is if I'm too quiet, tell me so I can speak louder. So we offer tooling to allow customers to guarantee data security in different forms like encryption before it ever hits your system or ways to process said encrypted data in secure environments and so on and so forth. At the core of all of this is enclaves. We're running on AWS so we're using the Nitro enclaves which as far as I can tell aren't as open source as the Intel SGX or any of that stuff. But we've been doing this for a couple of years now and that was when we started the best we could find for doing VMs that guaranteed the security model that we required. So like I said encryption, so yeah we're running in fully isolated VMs where we can basically see nothing that's happening inside the VM without a lot of effort on our part which is mainly so we can protect our users data. So just to give the context, relay is our main product is what I would say. It's an encryption proxy, you put it in front of your service and you define some rules and before the service your request ever hit you, the rules are applied and all your data is encrypted. Sorry, I lost my mouse. So yeah so it's very much focused on web services but it's mainly for people who want to de-scope their environment so they can be more PCI compliant or protect HIPAA data and stuff like that. Relay runs, relay doesn't run in an enclave mainly due to performance reasons because it's processing lots of network requests and we want it to get quick because encryption is slow and we don't want to add overhead to our users. So we store all of our keys inside a KMS that is accessed from a secure enclave. That service, we have no access to the keys then. On startup it tests connections to the KMS, pulls down user keys, decrypts them and then we are able to process the user's requests and outside of that environment we can't replicate anything. This started though when more users joined us and we started to scale. At first we just had a lot of automation. That was stuff like how do you run Docker containers in enclaves and how do you make sure that you can scale up or scale down. AWS Nitro enclaves are guest VMs on top of EC2 nodes. There's not much automation about actually running what's in there so you have to build it all ourselves and get all that running and actually performing requests for our users. So after we got all that running we had issues with just the libraries in general. So the parts of AWS that are open source are all the interface libraries for connecting to them but we found that there's many, many edge cases where they just were very poorly documented or not documented at all about how do you interact with it, how do you work with the proxies. So for reference for those that haven't used it, you need to run a Vsoc. There is a Vsoc on your host for communicating with the secure enclave and this is the only I.O. you have in and out of the VM. You then need to manage all the connections yourself and how you transfer data in and out and communicate. We ran into some really fun problems though trying to use this and talking to the AWS guys about using their library and getting stuff. The funnest one I think was we lost, we were dropping, we had file descriptor leakage. Our VM guests, VMs were dying because we just couldn't connect to them anymore because we run out of file descriptors on them which I had not seen in a long time aside from just like breaking my own machine which was fun. Turned out we just made some assumptions about how stuff worked because we thought, oh this is how it works in Rust and no, that wasn't how it worked in the library and we were just not reading the code but we needed to read the code which was unfortunate because I would have liked it to be in the docs. But yeah, it really showed that there was no metrics or observability for these enclaves. We weren't able to know what's happening inside them or how to interact with them. We started trying to monitor them. This was interesting. Like I said, no metrics, no nothing. I realize you probably can't see a lot of those graphs but these were our load tests. We started to try and get metrics out of them because there's limited IO. We didn't want to just try and put a metrics collector inside them and shoot all the metrics out to Datadog or AWS. We started instrumenting the clients that we were talking to it with and we started sending load data and trying out different workloads. So a lot of black box testing. This was several weeks of just staring at graphs that I may have gone a little insane during but we're here now and it worked. So once we got through it all we were able to find different bottlenecks in the code but based on guesses and automation changes we were able to go from, I don't know if you can see that but about 1,500 requests per second, no, encryptions per second inside the enclave to about 5,000 encryptions per second just by switching our default curve which we hadn't ever considered because we encourage our users to set the curve but it made massive improvements for us. But we had no idea that the encryptions themselves were the bottleneck because we couldn't see what was happening inside of our enclaves or the VMs and know where our workloads were slowing down. So once we started doing the observability we really went in on it. So we did this black box testing and we found the limit pretty quickly. We had to guess where the bottlenecks were and there was a whiteboard in the office of like here are ideas we have to try in different configurations. We just worked our way taking each box off and turning itself on and off until we were able to actually get some improvements from it. We then started working on a level of, so AWS does have a concept of debug logs but the moment you turn it on your enclave isn't actually a testable anymore. The attestation variables all just turn to zero and you're not able to attest your connection. And like I mentioned before we need to be able to attest the connection to the KMS to actually even load keys into it so we couldn't run in debug mode at all. We had to figure it out. So we had to basically reimplement a level of tracing like if anyone is familiar with open telemetry and stuff we had to come up with a way of doing trace requests inside of it. We couldn't use open telemetry because it had no understanding of how to communicate outside of the VMs. We had to take concepts, reimplement them and come up with a way of batching requests, sending them out and limiting the amount of IO overhead that we were doing that. We eventually got there and we were able to monitor our boxes. That's when we started to notice more problems. So we basically had these two processes in the enclave talk into a shutter and we expected the green line there. Yellow line would be perfect. That was our local dev environment. But the green line is what we wanted to see in production. The blue line is what we were seeing in production. I've lost. I wasn't allowed to put the lines of the numbers on it to be specific here but that was about a 20x slowdown I think which was insane. We're still debugging this one. We're not 100% sure where the bottlenecks are. We're fairly certain it's the virtualization of the network layer inside the containers is just insanely slow. So what we're looking at is how can we short circuit that. There's some things like sock maps. You can re-root sockets. But effectively you can't just take a container and throw it, take a process and throw it in or take two processes and throw it into the VM and think that will work. It works on my machine. It does not just magically work. You need to really tune the system to actually be able to talk effectively. We're still tuning it. We're hoping to have some stuff to note soon about ways to speed it up with sock map and different improvements. Like I said, it's seemingly either the VM or the user space networking. The fun one which I think was a lot of people who have worked with Enclave go, duh, of course you had time slippage. There's no NTP in an Enclave. You can mount the PTP of the hypervisor. But again, that invalidates our security model for PCI. So we had to actually synchronize with NTP which meant we need to add another layer of periodic work that needs to be done by the guest box to ensure that the VM could actually know what the hell time it was. We noticed that we were losing a second a day which is quite a lot when you are doing and that was based on traffic volume as well, more traffic, more time we lost. But if we did nothing, it was just one second a day. That really got into it when we had to do anything that was sensitive such as token validation. So, off effectively broke if a VM was running for more than three days which led us to a cron job that just cycled VMs for every three days for a little while until we re-implemented NTP through the VSOC. Fun. These are a lot of, like, yeah. So we kept running into issues and we kind of said, why is this so painful? It should be easy to just deploy a service into an enclave and give other people the ability to, like, say, yeah, that person who hosts my cloud computer definitely can't see the data being processed and I can guarantee it. Really useful for health data or financial data which are our main customers. So we put it all together and have a product called enclaves if you want an easy way to do hosted enclaves. So you can effectively give, we'll give you a Docker container. No, we don't give you anything actually. We give you a CLI and you build the Docker container with a Docker file into a secure enclave. You are given PCRs so it's fully attested. You give us your secure enclave and we run it for you. We push our data plane and control plane into the enclave and it talks to the control plane that we use so you can leverage it, all of that is open source so you can reproduce the build yourself and validate all the attestation. That's the same and ensures that everything is communicating effectively and there's no, well, me or my team aren't, like, messing with your code or changing it or anything like that. So it's just regular Docker containers. The connection is fully attestable and you can connect to it. I see 10 minutes and I probably don't need that long. So, but yeah, we're working on this. We're taking everything we learn from building our own service, putting it into our Everloot enclaves and it's on our GitHub. If you want to have a look and go through it, we want to be able to, people to be able to look at it, see that we're not doing anything wrong and try it out and hopefully have a better experience getting on boarded with confidential computing than we had because it was a lot of like throwing stuff at the wall, seeing what broke, where it broke and trying to figure it out. I'm going to go for questions then. You said you had problems with curse. British didn't be using ECC. Do you have any idea why the curse might have been a problem? Are you hitting page boundaries, packet boundaries or any ideas? Yeah, so we what we were seeing was that it was in the CPU. There was optimizations that we hadn't accounted for. So by default, the box we were developing on ARM Max, who were highly optimized for the curve we were using in default, which led us to like say, great, look at the performance here on our local machines deployed to production performance crashed. Turns out the AWS boxes we were running on were optimized for the standard K1 curve or one curve, a camera, which one it is now, but basically wrong curve. And we were an evening in the enclave, those optimizations still come true. So we were able to get 20 X performance gains from that, I think. Anyone else? Can you elaborate a bit on the nature of the payload or whatever you're executing there? Because I mean, we saw there pretty much encryption transactions. But what was exactly running there? So what do we run in the enclave? So for the benchmark was so the benchmark was basically fuzzing was what we were doing. But we send. So as I mentioned, in the enclaves, we have all our customer keys each in it. So we had one of our keys in there that we would send 20,000. So we would have 20,000 fields to encrypt. And we'd say each of these fields, we're going to iterate through this dictionary and encrypt it. So we'd send just a generic JSON blob. But for purposes of encryption, we could send just that we could just be a Boolean or a string or whatever and just send it in. And we then would iterate through that JSON blob and it would have the it would say, I am this user or application in which would then cause service to choose the right key. And it would then just it and they would say, these are the fields inside the JSON blob to find and encrypt. So it was JSON blob and ID and fields encrypt. Very simple payload, but it was just iterative work. And because of how the encryption is implemented, it's all blocking work. So we'd have to farm out the work differently, not like directly related to enclaves, but when we did the load testing, we determined that we were blocking and dropping connections in the service. So what was happening was the connection we'd schedule the work on the enclave and then the connection from the upstream service would die. Then we wouldn't propagate that connection dying downstream. The enclave would do the work, try to send the encryption back and then go, oh, no one wants this work and stop. So we had to put some keep alive and connections. But these are again the things we missed because we were having to reimplement just what would be generic TCP or or HP for talking over the Vsoc in the enclave. So you mentioned the the architecture you you are using made you adapt your cryptographic parameters. So how would that scale up to the future? I mean, crypto agility facing any words on that? I don't know. I'm the SRE who's meant to make such scale, but that's actually outside of my domain. We have people who understand cryptography a lot better than me and the company who would be able to answer that question. I can give you an email address if you want to talk about it, but I can't speak myself on that. Thanks a lot for the great talk. So I wanted to go a little bit back to like the use case you presented in the beginning. And I might have missed something, but it sort of sounds to me that like the use case here was not really like a sort of protection at runtime, but it's kind of like a long term protection of the of the keys and not while they are used by the proxy, but where they are in store. So did you like consider other solutions for this like HSM's and do you have like any insights there that why did you end up choosing the nitro enclaves for this particular particular use case? So I'll be honest, that predates me at the company. I'm not sure why it is. I will I would say that we did level of evaluation that were probably not too deep. We are a startup for and find our feet at the time and we had implemented a level of encryption just inside a process. And then when we attempted to secure it and build it, the enclaves seemed like an easy solution. I think that we've proven they were not an easy solution. But yeah, that's so we we would have valid what we validated were just ways to do encryption that would guarantee we didn't have access to users keys and we couldn't decrypt any of their data. And yeah, uncle, it seemed easy in reality, not so easy. So there's one online question. Can you explain the TSTLF protocol that you use? Is the protocol specified somewhere and has it been formally verified? So it is. So we actually had to reimplement it. We I can't remember which one we did, but we've looked at one that was done by the confidential computing. Consorting more of the paper that was published on it and we it was attestation in TLS connection inserts as our original implementation. That then I can't remember the specifics of it. So I will have to refer you to our get history on this. We deployed it and we were able to do it. In production because people had to add our root CA to their root CA store because you couldn't extend TLS in the way that was specified in the RCA for customers. So we eventually had to switch to a new attestation, which unfortunately I'm not the expert on. But it is available in the TLS. It's written in Rust. It's on actually it is linked in the talks on the page and the files them under attestation bindings. So anyone can look at the protocol we use for a testing it. Effectively, we leverage the PCRs that are provided by the underlying nitro enclave to and then we have an attestation protocol that we use to test the TLS. The underlying nitro enclave to and then we have an attestation protocol that on connection to what we do a TLS handshake that then performs the attestation and the client must supply the attestation bindings and we have implementations on the client side and go Rust node Ruby Python. Actually not Ruby, just Python and node and go. Oh and Swift and Kotlin. I will ask it like this because the interference with the micro and then you can Yeah, sure. So there was also a bit of discussion in the chat here about nitro enclaves and in what far you can go about the E and I know this is an endless debate and we even had an exclusive debate last year. Maybe can you briefly react to that and maybe also say a bit about the infrastructure you built, how tidal is to nitro and then the next problem can be solved. Yeah, so last, oh yeah, sorry, repeating the question. It's the debate about nitro enclave versus TLS. They are not, as I said, they're not as open source because it's mainly on the client side, they're open source rather than the server side and it's mainly just white papers, I believe that specify how the nitro enclaves operate or just documentation. So and the other part of the question was. How specific the tooling you built in the company? Yeah, so how specific is the tooling to nitro? So we did evaluate other cloud providers to see if we could move off to it. This was done a year and a half ago. We looked at Azure for doing it. Azure didn't have the new Intel SGX or is it SGX? Sorry, TDS sorry. They didn't have the TDS at the time, so we validated that it couldn't fit our model of secure computing. We probably need to reevaluate now, but the tooling is very AWS focused right now and nitro enclave focused because it was about trying to make nitro enclaves easier for us to use. Conceptually though, the control plane and data plane aren't specific to that. So far, they could be reimplemented for anything that wants to do TCP over a network connection for inside non-clave and outside non-clave.
Securing Embedded Systems with fTPM implemented as Trusted Application in TEE
Thank you. So yeah, so I'm Temek and I'm going to be talking to you about FTPMs and how they can be implemented. So that's me. I'm currently wrapping up my bachelors in automation and robotics and working full-time at FRIEMDepth as an embedded systems developer. Junior embedded systems developer, please keep that in mind. So if I say something, I'll do my best. And what is FRIEMDepth? We are a company based in Gdańsk, Poland and our expertise is in firmware and embedded systems development. And we're kind of cool. You may know us from our main product, Dasara, which is a core boot distribution. So that's the agenda. I'm going to first give some information about TPMs, then about ARM trust zone and then about how it translates to practice and implementing it on embedded systems. So I guess most of you know what that TPM is, but I'm still going to give a brief overview. So usually it's a separate piece of hardware, a chip that runs cryptographic operations like encryption or generating random numbers. So the system becomes more secure. Not everything is visible to the user space and thus the surface of attack is lessened. So there are a few kinds of TPMs. Oh, yeah, so those are the more about what's cool about TPMs. Yeah, they can also verify the integrity of the system, of the boot process and detect any alteration to it. And yeah, the secure random number generation is also a really important part. So yeah, there's a few way you can implement TPMs into your system. So the most basic, the most known way and the one that was shown earlier is the discrete TPM and it's a separate physical chip that's completely separate from the CPU. It's on the motherboard, but it communicates with it. The difference will be more visible when I show you integrated TPM. It has the cheaper and more space saving option and that is the TPM that is integrated into another chip. The danger of that is that now it's that chip, if that chip is somehow corrupted, it's attacked, it has an connection, it has access to the integrated TPM. So it's less secure. The next one is the least secure, but it's still something that is used software TPM and it's usually an emulation just made for tests and prototyping. So the main topic of this talk is the last one, the firmware TPM and it's a software TPM that runs in a trusted execution environment and is separated from the normal S, from the user space via the trusted execution environment. The plus of it is it's cheap and it can be implemented on devices that are already provisioned. So via an update or something, but via an update. On embedded devices, the trusted execution environment is made possible via ARM trust zone and ARM trust zone creates a hardware separation. It creates two distinct words, they're called. So we have a normal word where we have the normal user space in documentation, it's called re-choice and there we run our applications like kernel and user space apps and we have a secure word that has trusted applications and those can be like stuff you don't want the user space to have access to or only have certain access to. Such I think can be an FTPM and it can run operations like encryption, decryption, creation of keys and then it's a random number generation that I'm particularly fond of because it's kind of funny. So yeah, and also secure word provides makes it so only trusted OS can access certain parts of the hardware of for example memory. So there are parts that are reserved for operations of trusted OS and there are those allowed to be accessed by which OS. And this exactly is made possible via secure monitor. So we have ARM trust zone specifies exception levels and as you can see a secure monitor allows certain like for example, it tells hypervisor when what memory addresses it can use and what are specified for secure partition manager that is part of the D. And so the threat model here is of course if we have an app that's been like a virus or something, it doesn't get to the bottom. So we can look at it this way that okay when we have our hypervisor is corrupted, the trusted applications and trusted OS is still like valid. But if the secure monitor is somehow corrupted, then we have a problem. And I'll get to that part later. Yeah, and all of that was for Cortex-I ARM series. That's important distinction because on ARM Cortex-M the trust zone works completely different. It works through interrupts. It's kind of a funny topic because you could theoretically implement some sort of I wouldn't say FTPM because Cortex-M doesn't really allow you for example for running operating systems. There were some, there are some products that do that. They're on the border of black magic and they're awesome. But yeah, so FTPM on Cortex-M is a weird concept a little. Okay, so yeah, there are some problems with FTPMs because you could as I said update a device via the internet over the air to allow it to add to it FTPMs. But as the slide says, the best protected systems have dedicated security from the beginning. And ARM trust zone and TR and like magical thing you can just throw on a device to make it more secure. It will make it more secure but not as much secure as it would be if you would think about those things from the beginning. Because ARM trust zone doesn't add in itself a lot of the important parts that make an embedded device secure. For example, there's no secure storage on like in the ARM trust zone specifically. You can use an MMC to achieve that. But if you don't have that on the device that you're updating, you have to find some work rounds. The same happens with secure counter or secure clock that can prevent rollback attacks. And if you don't have those, you're not really protected from them. The secure source of entropy is a really fun one because there's been a work around for this. Actually, there's been work around this specified by the UNICEK presentation. It's linked at the end of the slides. The secure source of entropy is a fun one because they've managed to achieve secure source of entropy via only right ones, only fuses that have some random seed on it. And they're written once when the device is manufactured. They can be written again. And they act as a seed for random number generation. So fun. Yeah. And FTPM also has its own problems because the secrets are written to the memory. It's not safe on its own, for example, from cold boot attacks. So when the device is suddenly shut down, you can see the state of the memory that was at the end of the device last runtime. The same happens with bus sniffing. So we can just physically peek at the electrons that travel on the bus. And also, yeah, you can just plug a JTEC to some processors. And also, there's a one small caveat. Normal and secure world can't run in parallel. So always one runs at once, and they take over. So if you have an embedded device that requires real-time operation, you're in trouble. There are workarounds, of course. But I would like to hear them because it's a problem. So imagine you were a junior developer and you were taught, OK, so do an FTM in practice. You're me, basically. So yeah, that's how I approach the problem. And that's how it can be approached. So let's say you have some embedded device. And yeah, we have there are a few implementations of T that you could use. Most of them are proprietary. Opti is not. Opti is open source. It's awesome. It has a documentation. So yeah, once we have that, we need to build FTPM as a trusted application for RT. In this case, for Opti. And at the last step, we add some user space support so we can actually call the TPM. So let's focus on the second part because it's fun. No, sorry. Let's focus on the last part first because I didn't arrange the slides as I thought I did. So yeah, there's a kernel module in the Linux upstream currently that supports access to him for FTPM. So it allows the system to mount FTPM as a TPM. I'm not going to show you the code. It's called. I don't understand half of it by school. But as you can see, it's made by Microsoft. And Microsoft provides a description like the paper, the white paper that was written on FTPMs. That's also cool. And they provide a reference implementation. Great. And it's for ARM. Great. So the half of the work is done, right? Oh yeah. That's as I said. I didn't arrange it as like, yeah. So that's how the kernel driver works visually. So yeah, it mounts it. So it's seen by the user space as a TPM device. OK, so that's a problem with the Microsoft implementation. Like it's not maintained at all. It's provided as it is. It's cool that it is. It could us to them. But it doesn't work currently as it is. And so that's what I've been fighting with for the last few weeks. And OK, so this requires tweaking. The amazing guys at Linaro shouted to them. We're kind enough to create a fork of OptiMunifestudes for we used for building. And it allows to build FTPM. I have a few minutes left. So I think I won't be able to show you a demo of it, because it's there on this laptop. And I also didn't have time in the last few days to create a pull request. So I hope by the time this video is up, it will already be on GitHub. And I hope also it will be merged. But yeah, if you want to build an FTPM on Camu, that's the best currently repository available to fork it. And yeah, Yocto also provides a bitmake recipe for building OptiWeef FTPM as a trusted application. But it currently works only for ARM. I mean, it only works as a test for ARM. To add support for your own board, you have to append some recipes and do some magic to make it work. I haven't tested it through Rugu. I haven't tested it as much as we would like to. So yeah. And all of this was made work on our current operating system for embedded devices. Like to focus on security and make it as adaptable as possible to solutions for your embedded device. So yeah. This is those other resources I use. So they're all awesome. I highly recommend this book. It's not as boring as it may sound. It's really well written. So yeah, that's all. And if we have time for questions, then we can do questions. Just a request. Can you go back to the very information you took page? OK, yeah, sure. I'll go back to. OK, I was shown a card to repeat the question. So yeah, I was told to go back as late. So that was a question. How many? It's also online. Make sure that they're also online. Oh, yeah. And slides are also online. So if anyone wants to do something, they're available. So yeah. If we have a few minutes. Oh, yeah, sure. Just a few years. Can you use the opti-recipes? Opti-recipes. Opti-recipes. There are the example ones. Yeah. So the question was, did I use opti-recipes to build it? I didn't see any related to FTPM in the examples. And the examples are also cool, but they're kind of complicated. So it took a lot of reading and trying to make sense of those make files. So I used only I tried, but the thing that worked was the patching of the Linaro fork, because it also has it. Like it was last updated, I think, a few years ago. So it uses a lot of outdated. Like there's Python 2 syntax in it somewhere. And so yeah, that's the one I'll be providing. Apple request, hopefully soon too. So yeah. Are there any more questions? There's also time, Tim, if you want to do a demo, I'm sure people will. OK, sure. So maybe in three minutes, and then we still have five minutes. Yeah, yeah. Awesome. So yeah, this is the Camel image made from this forked Linaro. So as you can see, we have a normal and secure word. Currently, it's not started. So I can start it. There's some outputs. Yeah, the secure word doesn't really provide any way of communicating with it. We have any sort of, like besides user space. Oh, sorry. So yeah. And this particular example there's an I'll show you, but I have only one hand for right now. So it's kind of hard. That's all. Yeah, so the Linaro guys provided some aliases to load the utilities that are on the host system. They are not built in. They default in this example, and they load up the not the exactly kernel module that's currently in the upstream. It's a slightly different one. So that's also why I didn't like want to call it a live demo, because it's more of a live Frankenstein that is currently working. Maybe at next post them, I'll have something better to show you. So yeah, and if I run this alias that uses all of those commands, we can run some tests with the IBM implementation of DPM utilities. So I think it will output some random generated code. Oh, I also have to hear a cheat sheet, because I couldn't remember the exact syntax for creating, for encryption, the encryption, because DPM tools also works. As I said, the FTPM is mounted as a TPM. So every user's space utility that can use TPM works. So yeah, I don't know. Load. So yeah, so that's the demo, I guess. So I think we're done. So thank you. It's been a pleasure. See you all somewhere.
Integrity Protect Workloads with Mushroom
All right. So the next speaker is Tom Dormann and Tom is a real hacker. So I first met Tom last year at the CCC event where he talked about, which I did a bit about, an attack he did on NX. And I think it's very inspiring in the Dev Room that we have this great company talks, but it's also really nice in the real, let's say free software ethos. Tom, there's also some work in his free time, just pure hobby projects. And he'll talk a bit about work he's been doing on AMD-SEV. Sure. So thank you for the introduction. Today my talk will be on integrity protecting Linux workloads with Mushroom. Okay. So what are you going to talk about? Well, first up, we'll talk about some of the goals of Mushroom. And then I give a short demo to show you how it actually works. And then we'll talk about higher level architecture and some of the parts in particular. So the supervisor, kernel and the DMM host. And then we'll also talk about some of the things we don't want to do, some non goals. And finally, we'll briefly touch on how Attestation works with Mushroom. Okay. So, but this is a micro walker. Yeah. Okay. But before that, a brief thing about me. So my name is Tom Dormann. I mostly do developing and re-security research. But my day job is also reverse engineering. Here are some of my links. And one thing about me is that I also really love Rust. So all of the code that you may see here today is also written in Rust. Okay. So what do we want to do? The main goal is to run Linux programs securely. And in particular, we want to run programs that just transform an input file or maybe multiple input files into an output file or potentially multiple output file files. And while doing that, we want to prevent any tampering during the process so that we can make sure that the output files are authentic. So for example, one use case would be that you have some untrusted server and that you want to compile code on. And ideally, you want to not trust that server, but still be assured somehow that there hasn't been a backdoor injected somewhere in your code. And you just want to have assurance that the code has been compiled without any tampering. So yeah. I'll give a brief domain of that. So yeah. Okay. So what we'll be doing, I already talked about workloads. Mushroom is completely generic in what kind of workload you want to run. It has to be Linux binary, but that's basically it. So for this example, I chose to do a simple Docker image and just because it's easy to set up. And so in this case, it's an alpine image, which has GCC and muzzle installed. And it will run this in its script, which just copies the input file and that we want to transform to another file on the file system. And then we'll run GCC on that. And then in the end, we'll use that output and copied it to a special output device. And so the file that we want to compile is this one right here. Yeah. So it's just a simple head award, just a proof of concept. Okay. So, yeah, I should clear that. Okay. So beforehand, I already set up some like environment variables just for some of the components. But the important thing is this one right here. Okay. So what we'll do is we'll run this command, which as you might already notice contains some information like the input file that you just that I just showed to you. It also specifies the output and it also specifies where to put the gestation report. And because that's in the end, how we really know that that process hasn't been tabbed with is that gestation report. And so we'll run that. In this case, we'll actually take a bit longer than usual, because like the Docker image is fairly large. It's like under six megabytes or something. And just loading that in this fairly slow process. But any second now, the workload will start running. Okay, now it's started running. And now it's finished. Okay, so let's take a look at the output file. So just file test. And we can already see that it's a 64 bit elf binary, which is of course expected because we compile the C program. But before we actually run the executable, like, let's actually verify that it hasn't been tampered with. And we can do that by just using the same command that we used previously. But instead of saying run, we use verify. So we use the exact same configuration parameters. And this will take very shortly. And it says okay, so we know that the process hasn't been tampered with. And so as the last step, let's actually make it executable. And run it. Yeah, so you can see that also works, which is, yeah. Okay, so now that we saw how it work, like, what it's supposed to be doing, let's talk about some of the details about how it's implemented. And the first thing to note here is that it's implemented using SEV, S&P. And so in this case, we have all the virtualization. The workload is of course supplied by user, which in this case was GCC. And then around that we have a completely custom kernel, which we'll also talk about later. And around that, then we have so called supervisor, and which is a concept I came up with, which is basically just responsible for communicating between the kernel and the host. And the important thing to note here is that most of the logic is actually in the kernel. And this will probably grow a lot in the future as well. And the supervisor is fairly slow, not slow, fairly small, and will also probably not grow a lot in the future. It might even shrink. And even in this configuration, there's some code that's like disabled at compile time because it's only there for like debug features. Okay. So about the kernel, it's completely written in Rust. It just implements the Linux Cisco interface, so that we can run unmodified Linux programs. It currently implements 83 Cisco's more or less because like some Cisco's have a lot of flags and we don't implement all of those. But still it's a lot of, it's enough for some of the applications at least. And yeah. So apart from that, we also support 32-bit and 64-bit binaries. And the reason we have this kernel is that usually you have a lot of bloat and you have a lot of stuff that you just don't need. And so the reason we have those on-custom kernels that we can just throw things away and only implement the things that we need. And yeah. We'll also need that for some things that we'll talk about shortly. Okay. So the really interesting thing I think about mushroom is the supervisor. And so I already talked about that it handles communication between the host and the kernel. What does that mean? Well, the first thing the supervisor does is actually load up the input. And so the input is not part of the initial measurement. The reason for that is that we don't want to measurement to change every time the input changes because then we can sign it or at least not in a way that's like really makes sense. The other thing is memory hot plug. So initially mushroom starts out with a very small amount of static memory. And then after that we use memory hot plug to load in more dynamic memory once it's needed. And lastly, the thing that we do during runtime is scheduling. And so if the kernel wants to run another CPU, it somehow has to tell the kernel about the host about that. And so that's also a responsibility of the supervisor. And so the interesting thing here is that this communication, it's not just a convention, it's not just that the kernel chooses to talk through the supervisor to the kernel and to the host. It's actually impossible for the host to talk to the kernel directly. And so the reason for that is that we want isolation there. We don't want the kernel to have potentially malicious input sending to the kernel and we want to prevent vulnerabilities by just not having an interface there. And this is implemented using a couple of hardware features. So for example, one of them is virtual top of memory, which just basically makes it so that the kernel can't access shared memory, which would of course be needed to access, yeah, to have shared access with the host. Another feature is VC reflect, which is basically in some cases you need the hypervisor. And instead of using the hypervisor, we can then just offload that responsibility to the supervisor. And that way the kernel doesn't even really have to be aware of it being run in a SAP VM. Lastly, the separation between the kernel and the supervisor, which is of course also important, is done using virtual memory privilege levels, which basically makes it so that the supervisor is allowed to access all memory. But the kernel is not. So for example, the supervisor has some secret keys that it uses for agitation. And the kernel is of course not allowed to access those secret keys. And yeah. So the important thing here though is that the supervisor is the only security critical component. The kernel can have as many bugs as it wants. The host will never be able to talk to the kernel directly. So it doesn't really matter if there are security bugs in there. And this is of course really nice for auditing, because the only thing we have to audit and make sure that it actually works is the supervisor, which is this once again fairly small fairly small component of code. Yeah. So for the VMM, we don't use QMU or anything. Reason for that being is that we have this fairly custom like memory hotplug and so on. And like all those interfaces and getting the data in and out. So yeah, instead of using something that already existed, that maybe has like abstractions that are not ideal for us. And we just implemented this for our own. It's not actually that complicated because we once again, we don't have that much host guest communication. So this VMM doesn't really have to implement a lot. And as of a couple of weeks ago, it also supports running the kernel outside of an SCV SMP VM, which is of course really useful for like debugging and profiling. And maybe not everyone has an S like an epic CPU that can actually run those VMs. Okay, so we already talked about a lot about things that we want to do, which yeah, but there are also things that we don't want to do. So one of those important things is that we don't want to do IO at runtime. If I want to run GCC, I don't need network. I will never need that. That's just not a thing that we need. And sort of thing is by not having network, we can reduce the tech surface drastically. And once again, like reduce complexity in the supervisor in the kernel and mitigate vulnerabilities by just not implementing interfaces. Of course, there are a lot of a lot of use cases where you do need network. But in those cases, you can just use standard Linux and like, you can just use other projects. But the point is that for a lot of projects and workloads, you don't need the extra complexity and job us by just not implementing that you can lower the potential for vulnerabilities. Same logic goes for persistent storage. So every time mushroom boots up, you boot into tempfs with like all those files that you split doing initialization. But once the VM actually exits, all that memory is destroyed because for a lot of use cases, you don't need persistent disk. And by not having that, you can once again, low complexity. Similarly, we also have fairly low complexity in the supervisor, which once again is this one part that's actually security critical. So one of the things that you might have noticed is that none of the things that the supervisor is doing are really CPU bound or performance critical in any way. And so for example, we can get away by just not implementing multi threading, because in reality, there's nothing that requires that amount of performance that could potentially be that could potentially like get a performance boost by multi threading. And so by not implementing multi threading, we can once again, just like eliminate the whole class of concurrency bucks, because that's just can't happen if you don't have multi threading. Similarly, the supervisor is fairly simple and doesn't actually need the heap. And then once again, we can just not have any bucks in there if you don't have a heap, if you don't need it. And yeah, so I think those non goals also really important because they could strain the things that we want to do and that way, like increase the security by setting up clear goals. Okay. So lastly, let's talk about agitation. I'm sorry, talked about the measurement. So in this case, this contains all of the binaries that you want to load up in, which is the first supervisor, the kernel and the in the binary. Those could be signed in the future. Currently, we just compare the raw hash. And so the SEV firmware, when you load in the image, it like hashes all the memory and like chains it together and just produces a hash that you could be some could sign, but we don't currently. And the host data field is also field that's supplied when the VM is booted up. And so this, this field just contains a hash of the input file. And the first thing the supervisor does is when it boots up is loading the input file and actually verifying that that hashes correct. So it doesn't even really look at the data, it just hashes it. And that way, there's no way hopefully for the data for the input file to potentially be malicious and influence the supervisor before it's actually been verified that it actually is the one that we want to see. And lastly, of course, we also want to attest the report, I'd like the output. And this is put in the report data field. And this is also interesting because this is actually the only field that the guests can influence at runtime. So both the measurement and the host data field are set by the SCV firmware. And even if you have like some malicious input file or malicious input binary, you can only modify like you can only modify the report data field. And so this is really important because if you have like, assume, assume you have some untrusted input, you will never be able to fortune attestation report in such a way that it pretends to have to come from another host data from another input file. And that way, we can just like by making this the simple abstraction choice, we can hopefully reduce the potential for any vulnerabilities there. And so this is also another thing where it's compared comparatively simpler compared to other projects. Because one of the things is that we only do attestation at the end of the process. So we don't have any certificates during runtime. And because we don't have any IO at runtime, and so we just don't need any other certificates that would usually have to interface with other services. And like, I can see why there are a lot of problems like sanitization. But that's just one of the things that this model doesn't really need. And similarly for like this encryption case, so the attestation model for mushroom is just really, really simple. And hopefully made in such a way that's actually easy to audit for external people if they wanted to do that. Okay, so do we have any questions about that? Thanks a lot for a very interesting talk. So I particularly like this demo that you showed because showing like this use case where you actually run a compiler inside the CVM is like a very, very desirable property in like build environments where you want to have this notion of hermiticicity where like you actually record the entire tool chain that you use to produce software. So related to this, I sort of had a question related to the sort of the trust assumptions here. So you talked about this that the supervisor is the sort of the only security critical component, but that basically only applies to the communication between like the outside world and the kernel. And but you still, you know, you later talk about it that you can still have like attacks via the input itself. So for instance, if I have like malicious code that targets some vulnerability in GCC, let's say, that's still possible, right? But on the other hand, that gets somehow recorded in as part of that station. Can you a little bit like elaborate on like, like these aspects? Yeah. Thank you. Great question. Yeah, of course. So yeah, so if you have an malicious input, that would show up in the attestation report. And I mean, ideally, if you have like a scenario where you want to have like a code cache, where you like compile, compile code once, you will only supply inputs that like are not malicious. So as long as you don't like request malicious inputs, you will not get malicious inputs out outputs out. Yeah. So I mean, in theory, there could be like attacks from the inside, but that's not really a problem because that always shows up in the attestation report and like a normal user will not request that. So yeah. Yes. Yes. So an additional comment was that this is audible. The question was whether or not this is auditable. And yeah, so the answer to that is yes, everything shows up in the attestation report. And yeah, so hopefully that's not a threat. Any other questions? Thank you. This was awesome. And then it's not a question is a feature request. If you could spit out as bomb, as bomb from, from the compilation thing with, you know, that would be fantastic. And yeah, well, the thing about that is that mushroom is not necessarily only meant for compilation processes. But if you want to do that, that's great. And one of the things I've been toying around with was, was running nix compilations and nix builds in that. And of course, those are already contained like in the build hashes, like the way nix works, all the inputs are already specified in that. And so if in that scenario, you would like more or less have an S form at least like traceable to some input. And but that's independent from mushroom. But of course, that's also the use case I intended. Okay. So yeah, first of all, very awesome work. I really like that you show that these confidential VM based solutions can also be used with very tiny trust computing bases. That's nice. And I mostly agree with your design choice of the non golds. But they said that you don't support mighty swing. But wouldn't that be somewhat important for compilation to be able to run on multiple costs? And it's kind of CPU can you consuming? Yeah, sure. And so this thing about multi threading, this only applies to the supervisor. The actual kernel can run as many costs as he wants. I mean, technically, a second limit on 128. But yeah, that could be changed. And it's probably enough. Yeah. Maybe a question also moving forward, you mentioned support and also even protected combination. Do you have a part of your design and I'm thinking about the PMPL support? Okay, so the question was whether or not my designs are tied to S&P, a CV S&P, or whether or not they could also apply on TDX. So currently, the supervisor is highly specific to S&P. But I don't see a reason right now why it couldn't be implemented for something like Intel TDX. That should probably be possible. Yeah, I mean, the MPLs are specific by I think with TDX stuff like partitions, maybe that could be something. I'm not sure. I haven't looked into that. Yeah.
Reproducible builds for confidential computing: Why remote attestation is worthless without it
All right. Let's get going. Our next speakers are Paul and Malte from Azure Systems and they're here to talk about remote attestation and reproducible builds. Yeah, thanks. And I will start with some motivation. First, the topic of the talk is reproducible builds for confidential computing and why we need it. So first, the motivation. Yeah, confidential computing. What is the situation with confidential computing? We have trust issues, especially when we're running in the public cloud. So yeah, first of all, we trust no one. Well, that's not entirely true. We need some hardware we would have trust. So we have some will have to trust the hardware manufacturer. And in all the other components that we are using, we have to establish trust before we can rely on them. And we're doing this using remote attestation. So a quick overview of a remote attestation based on the Rats RC. So here we have our three entities, the attestor, the verify and the relying party. And the goal of the remote attestation procedure is that relying party can place trust in the attestor system. So how are we doing this? Inside the attestor, there's a testing environment and the target environment. And the testing environment is doing measurement of the target environment. And then handing out some evidence that is verified by the verify. And the verify uses two kinds of resources to verify the evidence. First, the endorsement, which usually provides guarantees about the authenticity and integrity of the evidence. And then some reference values that are compared to the claims that are inside the evidence. And yeah, there were verify does verification and produces an attestation result. And that attestation result is consumed by the relying party. And using this attestation result, the relying party can place trust in the attestor system. So the aspect about this remote attestation procedure we want to talk about here are the reference values. So as I already said, we use the reference values to compare or to check the claims inside the evidence. And yeah, some of these reference values represent the code identity of what we are actually running inside of our TE. And often these values are hashes over what we are executing. And as we all may know, hashes are one way functions. So it is really difficult to go back from a hash to what was actually hashed. So many questions arise from this. Where do these hash values come from? Who produces them? So who's our reference value provider? What do these hashes stand for? And how can we establish trust into them? And often the answer is we just can't. And in this talk, we want to present a way how we can establish trust and meaning to those reference values. So why might this be a difficult task? So the main scenario we are talking about here is about CVMs. And these CVMs have quite large TCBs. So we need to cover all of our software component with these reference values. And yeah, there are quite a lot of components from where bootloader kernel user space. We all need that stuff. Yeah, can be quite a lot of lines of code, not always only some lines of code like in rushroom. But the more interesting question is who is part of our trusted computing base? So software vendors usually and usually also a lot. And there are different ways that we are including people into our trust base. So maybe the more simple one is that we can consume code from other people. So well, that's quite usual. It's also okay. We can order the code before we include it. And ideally, our language ecosystems provide us with some mechanism to pin the dependencies that we use by some hash or so. So that's okay. The second mechanism is more problematic. We could consume binary artifacts. And going back from a binary to source is expensive. And yeah, typically, this is when we install packages using a package manager, or if we use prebuilt VM images. And even if those binaries are signed, if we rely on the signature, we include the signer into our trust domain. And then there's the third case that is even worse. These are the situations where we cannot choose what we want to what is actually running inside of our TCP. So this is, for example, the case when we have like something hardware compatibility layer running below our guest OS in the CVM. Or if we are not able to run customer provided firmware in the public cloud. Okay, so talking a bit about the consequences here. Yeah, every software vendor we include on our trust boundary could potentially run an attack on us. For example, by delivering malicious reference values, so reference values for malicious binary. It's just really difficult to check for us what these values stand for. And in the end, we have no insights what is actually running in our system. So a simple solution could be we build everything from source, right? Source is good. So we can audit the source. But usually we are not the consumers of the things we built. So we're not the end users. And as a consequence, there's one remaining software vendor in the trust boundary. And that are we. So that's not good too. And the actual goal here is to provide a testable systems for the end user. And reproducible bugs can help us to do this. And much of it will continue to tell you about reproducible bugs. So thank you. So let's quickly talk about what reproducible builds actually are. So the basic idea is you have software development principles where third parties or anyone can take the same inputs and produce the same binary output. And this part of being independently verifiable is really important to us. And let's just take a small step back and look at our perspective. We are building a lot of software that is supposed to run inside of enclaves. For example, we're building a full Kubernetes distribution with OS images and containers. And we really want people not to have to just trust us because we are like reputable. We want people to take the stuff we build, look at the source code, verify it, rebuild the binaries. And only then, only if they can rebuild the same binaries, they can also just get to the same measurements and then they know that they can trust us. So in the perfect world, this is what we would like to have. We just take the source code, we put it into a function and we get out the reference values. But as you will see, this is sadly not the reality today. So looking at this more closely, you have the source code and then you have some kind of build process. And then what you get out is binary artifacts like the firmware, the kernel, anything that goes into the user space applications. And from these, you derive the hashes or other reference values used for remote attestation. And in reality, this is already where you start running into problems because the software itself is not open and you cannot rebuild it if the source code is not public, basically. So sometimes this is where you just have to stop. But then if you're lucky, the source code is actually available. But that's when you run into a whole different set of problems because if you want to build the same firmware and the same kernel and the same user space and everything else, you notice that if you build your software, it doesn't actually just depend on the source code. It actually also depends on timestamps and randomness and inputs that you didn't know you had. And also it depends on tools and specific versions of them. So let's say you actually managed to get all of this under control. Then you can still run into this situation where you get the same firmware and everything else, the whole stack, the whole TCB is the same. And you boot it in a trusted execution environment. And still the evidence that you extract is different. And this is often the case if you include anything in the measurement that is not part of the code, but it's actually dynamic, like a timestamp at boot or the instance ID of your virtual machine. And yeah, in this case, you have to do some run a policy engine on the other side, basically. Yeah, so this can be solved, but it's also really annoying. Next, we will quickly look at who's already doing good work in that field. So first is the AWS UEFI firmware. And this is used today to run AMD, SCV, SNP virtual machines. And it's really nice. It's just EDK2, OVMF firmware with some patches, but they also provide the full build system. So you can just download it and rebuild it from source and actually get to the same measurements. Yeah, another example is Constellation. This is our stuff that we built in there. We actually provide every container image, any tool, the whole operating system, anything can be rebuilt from source. And it's all reproducible. And yeah, then there's also the confidential containers, cloud API adapter, and there's the peer ports images. They now have an option to build images with MKOSI that are now also mostly reproducible. And we also have a GitHub repository where we basically just wrote on all of the steps that are needed to take a general purpose Linux distro and get also reproducible builds for that. And it's also documented and we show you all of the steps that we took. So you can play around with that, which I think is like a good starting point. So that's the repository if you want to have a look. Yeah. So now some concrete help if you actually want to do this. So this is for building OS images in particular. First of all, you need to pin your build tools. Basically, if you don't do that, tomorrow you will have a newer version of a tool and you will have a different result. And what we noticed is if we use something like NICS, we can pin all of the build dependencies in a very nice way. And also we were able to patch a lot of the tools in NICS. So they actually become reproducible. For example, we had a tool like MKFS for FAT partitions that was not reproducible. We could make sure that the version NICS was actually creating reproducible outputs. The second thing is about any things that you depend on. So that's libraries of your building software or binary packages if you have to include them in your image. First of all, you want to pin them. So you want to make sure that you know in advance the hashes of everything that will be a dependency. And then it's not done just during that. You also have to make sure that they are available in the future. So you have to archive them. You have to make them available. And you also need to have a mechanism then to actually operate your log files because if you just pin them, you will have a lot of security vulnerabilities in the future. Yeah. And then it goes on. You really want to build every piece of software in a sandbox because otherwise you don't actually know if your build is reproducible because it could depend on something that is not actually there in the future. So user build system that does this, there's MKOSI for building OS images. There's NICS and NICS OS, which are really great. And there's Bazel that also uses Sandbox. Yeah. It will eliminate a whole class of issues. And then you also really want to restrict build actions or install actions or any other kind of logic to only perform deterministic steps. For example, I think the Cocoa project was using HashiCorp Packer and that has the issue that it can run arbitrary steps, which means it could, for example, run up to get install. And then you basically have no idea what version of something will be installed. And the same applies to Docker files. So just use something that only does what you want. So this was our talk. There's some important things we want to give to you to think about, learn about reproducible builds. We want you to provide an open software stack for CC. And we want to enable the community to reproduce the reference values that we put out into the world so we can remove ourselves from the trust. So thank you. Thanks a lot. So I have a bit of a philosophical question. And that's related to sort of the sort of like the relationship between reproducible builds and build provenance. So we also had, like for the last time, there was last talk, there was a question about like S-bombs. And this is of course something that is of increased importance because of the focus we have on supply change security in general. And there is also people working on build provenance, right, where you have the build hermiticity and you have like a record of how software was built. And that also gives you like some guarantees of how you end up with a certain set of reference values, even if it's not fully reproducible, right? So, but you know, from the provenance, like what goes into this recipe. So do you have like any thoughts on like pros and cons or like reflections around these two related topics? Yes, definitely. So I think first of all, if you're able to have reproducible builds, you already basically have an S-bomb because it must be the source code and anything that's locked in there. So it's actually like already the S-bomb there. And then also, if you have an S-bomb, how do you trust the S-bomb? Because if someone just gives you the S-bomb and it's not signed, it could be fake. And if someone does create like a trustworthy S-bomb, it probably needs to be created in a confidential VM or something like that. So then you actually make the whole problem a lot more complicated. So if you can just use reproducible builds and then the problem is just fixed. So the question was about if this also solves the problem of pinning the toolchain. Yeah. Yes. So the question is about like if you can trust the toolchain that bootstraps the whole system. And yeah, I think you can bootstrap yourself from nothing. And I think also the NICS project has some kind of bootstrapping where they do exactly that. So that's it.
Increasing Trust and Preserving Privacy: Advancing Remote Attestation
I'm Leonardo and you're not from ARM and I think it's going to be a great end of the day so looking forward. Well, hi everyone. So this is a talk about remote attestation and yeah, because we think remote attestation is sort of an inflection point and it's becoming increasingly available and used and any new technology when it comes to the fore you have to consider different aspects, societal and technical as well. And yeah, so we're here to talk about this. Possibly interesting things. My name is Thomas as Fritz said. I'm your notes. The ghost of Hannes is here with us. He couldn't come to Brussels but he's here in spirit. Yep, that's us. Okay, so I wanted to start with this sort of timeline that tries to capture some of the more relevant events in the history of remote attestation starting from the theoretical underpinnings with the DDSA paper from the fine people at PARC in 1983. And you have to wait sort of 15 years before research trickles down into industry. You have at the end of the century the first industrial consortium that is formed to actually define what a trusted computing architecture is in terms of behavior, in terms of the interfaces that it needs to expose. And so we have TCPA formed which then morphs into TCG, the trusted computing group. These are the guys that are responsible for producing the TPM12, TPM20 specs, among other things. So you have the first decade of the 2000s that is sort of where attestation, because you know, TPM has a strong attestation story bound to the idea of using the TPM as a route of trust for reporting. So you have the first decade that is sort of driven by trusted computing use cases. Then enter the second decade and you have AMD and Intel SGX cropping up. And this starts the sort of confidential computing driven decade where you have the first, second and finally the third iterations of the architectures, which culminate into the SCV, SNP, Intel TDAX and ARM CCA. And you have a few other interesting events in that period. You have the riot paper from the Microsoft guys that sort of articulates fully the ideas that were in the DDSSA paper. So 30 and odd year later, you have finally the dice ideas on paper and not just on paper and code. And you also have the PSA attestation from ARM, which is sort of an IOT targeting attestation for IOT platforms like riot as well. So you have also this area of the users of the, well, this is covered by attestation primitives. Where attestation primitives starts to enter. And then you get into 2020s and so on. And here is where we sort of see some kind of maturity in terms of the standards that are actually coming to the fore. Not just standard in terms of standardized formats, data formats and rats that was mentioned before. And that is coming out this year. Is one example, but also software standards. So the configure FSTSM ABI that Linux kernel is just upstream is one very, very concrete example of standardization in that space, in the software space. So we are here, as I said, we're probably at an inflection point. The primitive is increasingly available, not just in the configuration computing space, although CC is a very, very prominent area, right, that drives this. But also, so you have use cases in IOT, you have use cases in TCP remediation. Well it's also cropping up in your devices with interesting societal fullbacks. And so basically the idea is that, like Dave Taylor said, every authentication use case is also an authentication use case. For wherever you have the need for authentication, authentication which is effectively a stronger authentication primitive, stronger identification primitive is something that could be used to either reinforce or supplant your previous thing. So that's where we are. And yeah, so I think when you have, as I said, when you have this new technologies, you need to look at the bigger picture and try to understand what are the implications that the use of these technologies have on the wider ecosystem. One of the interesting things here is the centralization risks that are involved with attestation. Another one is privacy. Well, let's start with centralization. I think it's here, yeah. So if you have looked at the RATS architecture picture that was in the talk before, you have seen that the verifier is at the very center of the image. And it's not just visual biases, it's really a central architecture upon a choke point of the architecture where all the message flows are intercepted basically. And also where the decisions are made because the verifier box has a verifier owner attached to it. And the verifier owner is the guy that has the power to decide who talks to who, right, which attest has the right to talk to a reliant party. So he's actually gating the information flow. And therefore he's a very powerful entity. And the risk here is associated with monopoly, right? So there are the situations where if you don't look at the carefully at your design and your architecture, you slip into this potential centralization risks, which we have seen in a way. I don't know whether you have followed that. Environment integrity is something that exploded last summer. And yeah, it's the cautionary tale. It's the perfect story of vertical integration where you have basically a monopolist actor that takes care of the whole thing and basically subversely. Well, it creates problems. So the fact here is that centralization can be sort of tackled, we think. The RAS architecture has a nice, it basically cuts through the, it has sort of, it curves out the roles in a way that you can actually, across a long tussle boundaries. So you can actually remodel the roles in a way that, you know, for example, you're moving the very fine function towards the user, in a user centric way. But you know, not all use cases are, not for all use cases, it is possible to do this, this rearrangement of roles because sometimes you would end up in a conflict of interest situation or something like that. So maybe one idea is to run the verifier as a neutral entity, a multi stakeholder entity. Analyze and script, that's what they did with when they democratized the X599 world by creating this multi stakeholder consortium that runs the less encrypt function, which is another example of this kind of centralization opportunities. Yeah. Privacy is another aspect. So all the flows, all the message flows go through the verifier, the verifier has to see the claims to make the ref value matching. Therefore, you know, it sees everything. So the potential of abusing this position is great because PII are maybe not in the evidence, but can be actually indirectly obtained from that. And so, so this is, this is a risk. There are things in the, in the, in the toolbox. There are basically two kinds of ways to deal with this. One is to inflate your anonymity set, either by cryptographic primitives, group signatures and stuff like that, or using methods like anonymization in the hardware, like, you know, creating a batch of devices, like FIDO does, like ARM CCA does, in certain configuration. The other thing is, well, yeah, is you reduce the, the claim set. So what you need to expose to the, okay, to the outside world by claim reduction and, and other patterns like selective disclosure and, etc. So things are there. So these were the societal aspects. This one is instead the technical aspects. So we have, we have been in a situation where mostly the, the designs were, so we're, we're, we're transitioning from a situation where the, the solutions were experimental, right? So we are, we were mostly in research mode. Now we need to move to a different approach, a more engineering oriented approach and, yeah, more structural approach. And we think we have, you know, some suggestion to make and I'll let Yonis, and I'll let, sorry for taking so long. Hey. Okay. So I want to talk to you a bit about IEDF and why we think it's a good venue to try to standardize standards relating to remote out the station. So first off, let's look a bit at some of the IEDF principles that form the core of its mission and why we think these are relevant to, to the hacker crowd here at Fosnum. So we started openness, open process, so everyone can get involved and can, can, can read standards that are being worked on. And this includes not just technical folks, but also members of the, let's say, civil society who have things to, to say about things that are being standardized or drafted. The second is technical expertise or competence, meaning that the IEDF only works on things that it has the competence to talk to. And it will, it will listen to technically competent input from whatever source there is. And this, the third, third principle is that of practical ethos. So rough consensus and running codes or trying to base all our standards on our engineering judgment and our real world experience. And more pragmatically, it means that all the standards need to come accompanied by some, some code for verification and hopefully multiple, multi-employment, implementations that are interoperable. So let's look at, at the station in the IEDF. I think the rats working group has already been mentioned and the major milestone that's been achieved about a year ago, the, the remote out station procedures architecture document from which this diagram is taken shows that the roles involved in making the remote out station usable. And the rats working group is there to essentially standardize around this diagram, around the roles, mechanisms, data formats inherent in this. But if you want to look at remote out station as an authentication mechanism, then we need to go beyond the rats and this diagram. And we need to look at cases where the attester and their lying party are trying to interact over different protocols like O-Auf, TLS, ESD, stuff like that. So let's start by looking at credential insurance. And in this case, I mean, for example, X5-9 certificates. So the enrollment over secure transport and certificate management protocols are central to public infrastructure. And it allows an entity to request from a registration or certification authority to generate a certificate. And a recent requirement from the CA browser forum has put in place a need from RA or CA for the entity to prove the security state of the key that's being certified. So that's why we're trying to integrate remote out station to make this happen. So the way remote out station works here is the verifier sends an ounce to the entity and entity that uses that to generate evidence and package it up in the CSR. And then the RA-CA can get an out station result back and decide whether it wants to trust the entity and issue the certificate. The identifiers there are for the places where you can find more information about how this all works. If we look at ACME, it's again for certificate insurance. And as you can see, the diagram looks pretty much the same. The only difference is in the fact that the evidence is carried in a different format defined by the W3C, so web-alpha format. So just to highlight the fact that we're pretty open and pragmatic about what we use. And if there's something ready, then we can just use that. The second type of credential that we care about, for example, in this case is OAuth, where a client might want to get an identifier and perhaps some credentials from an authorization server. Again, pretty much the same diagram. And then if we move on to secure channel establishment protocols like TLS, these are quite different because of their symmetricity compared to credential issuance. And we've tried to preserve that. In the diagram here, you can see one type of flow where the server is the one that testing itself, but you can have the same on both sides. So both the client and server can test themselves. They can use either attestation results or evidence as credentials, and they can use credentials, these credentials instead of PKI or alongside PKI. So obviously, we're dealing with some sensitive stuff here, and we want to make sure that our specifications are as secure as possible. And the way we do this is obviously we use our experience with these protocols, making sure that they're secure, and we use implementations to drive testing and make sure that we catch any bugs. But obviously, we can't just rely on that because we can't do proper thorough testing. So recently, we've been integrating formal verification into our work, trying to prove that the security properties that we care about are upheld by our designs. And actually, in IETF, we have a new usable formal methods proposed research group to take care of this more broadly. So I want to leave you with one message, which is please join us. Please join us in drafting these standards and implementing them and making sure that they work properly in the real world. Yeah, and we tend to lurk around in the ROTS working group and the CCC at the station. Thank you. Okay, I'll repeat the question. Is there like a representation for applying the service? I think there probably is. Yes, so the question was for ACME, for the ACME integration of remote other stations, whether there is example codes or reference implementation. I think there probably is. I think I've seen a demo from the person who was drafting this. But yeah, we can get in touch.
DIY Private Container Registry
So, welcome everyone. Thank you for joining my presentation. It's my pleasure to open the containers, the room this year at FOSDEM. I've just realized that the title may be a little bit misleading. I'm not going to talk about building a container registry. I'm not clever enough for that. I'm going to talk about making it private though, so if that's why you are here, feel free to stick around. My name is Marc Chagy-Cozard. I'm don't bother trying to pronounce my last name, just call me Marc. I'm the head of open source, the company called OpenMeter. We do usage-based metering. We don't do anything about container registries. This is just a talk I had in the queue for a while. Let me actually tell you a story. This is going to be a story over a long period of time and a couple of different companies I've worked at. It's going to be about distributing images to people, developers, different deployment environments, design partners, whatever you want to call them. We accumulated a number of requirements over time. Again, different companies have different requirements, but often when you want to share container images that are not open with design partners or customers, you have specific requirements like you want to be able to share those container images so people can deploy it in their target environments. They can pull it to their development machines and use it for development, or they want to run services in their CI. You want to be able to distribute these images often in a way that you don't want everyone to pull those images. You want specific people and environments to access those images. You need flexible authentication and authorization solutions to do that. Obviously, you also want to minimize the operational burden. You don't want to run your own storage necessarily. You want to use some sort of object store, and you don't really want to think about things like monitoring or backups. So what do people usually do these days when they need a cloud native solution? Any ideas? Well, you go to the CNCF landscape and try to find an existing solution. If you go to the CNCF landscape, you're going to see that there are a bunch of container registry solutions already available. These solutions can largely be categorized or put into four categories. One of them is the cloud hosted registries. Most cloud providers have their own hosted registries these days, which is easy to use. There is a second category, which is called peer-to-peer. I call them peer-to-peer registries. They are mostly about distributing images within a deployment environment. Those are not really for distributing images with other people. There are all-in-one solutions we're going to talk about, and there are plain old registries that you can run in your own environments. Now, obviously, and again, this is a story we had our own requirements at a company called Bonsai Cloud to distribute images, private images with a bunch of customers, and we started with cloud hosted registries. They are easy to set up. There is no operational cost, basically, but it does require for customers and people who want to pull your images to register a cloud provider account. You have to set up IAM and a bunch of other stuff. Surprisingly, this is something I learned. Companies who don't have an account for a specific cloud provider are not really eager to register one if they don't use that specific cloud provider. Cloud hosted registries may not always be the answer if you have customers who don't have accounts for those cloud registries. Now, again, some time went by and new requirements came in. Obviously, one of them was the no cloud provider registration, and the other one was more flexible authorization. Obviously, there are many different artifact stores out there with a wide range of project. We use it for a long time. I believe it's still in use, where we introduced it. And it gives you a bunch of different tools for distributing OCI artifacts. It gives you robust authorization solutions. It gives you things like image replication, so you don't actually have to push your images to the hardware. If you just want to use it as a distribution strategy, you can replicate your images from your existing solution or your existing registry. And that's what we did, actually. We replicated the images we wanted to distribute across clients from our existing registry. And again, hardware has tons of more features. And basically, what we did is we built a layer on top of the so-called feature robot accounts. It's basically a service account feature in hardware that you can use for service to service authentication. And we built a layer on top of that to distribute credentials to customers so they can authenticate with hardware. Now, the thing with hardware is it's a great solution. We loved it. We did find a few issues with it, though. So first of all, the group-based access control, it has, it's only for users, not for robot accounts. So we had to manually set up the authorization for each robot account every time, which means if we need to update those authorization policies, we have to update all of those robot accounts, which was a bit weird. The other issue we met was creating robot accounts for multiple projects, hardware structures, everything into projects. Basically, it's a namespacing feature and creating cross-project robot accounts requires using the admin credential, which is a single admin user with a password, which was a bit of an issue for us. But overall, hardware is a great solution, maybe a bit overly complex if you just want to use it for just distributing images. But it's great. But obviously, new requirements came in, like building a sub-serve portal for users. Obviously, we could have allowed them into hardware, but since we used robot accounts, that wasn't really an option. And again, as actual customers started to use these products, we need the closer integration with sales and licensing systems. So this is where we started to think about maybe building our own solution. But before I talk about the solution that we came up with, I need to talk a little bit about how container registries work. How many of you are familiar with OCI? I guess, fair bit, all right. So I'm not going to talk a lot about what OCI is. But basically, OCI has three specifications. That's relevant in this space. The distribution, which is the registry API, basically the image and the runtime. And the interesting part here is the distribution specification, because that's how you pull images from a registry, basically. That's what defines the API for pulling images. And the problem is that there is no built-in support for authentication and authorization in that interface. The distribution specification is basically just an HTTP interface. So technically, you could use basic OCI if you wanted to, if the client that you use to pull images supports it. And actually, I think the Docker registry OCI actually uses or allows using basic OCI, but again, it depends on the client. And there is no authorization built into the distribution specification at all. So that's the operation. That's how the Docker CLI works, basically. When you do Docker login or pull or push, that's what the Docker CLI does these days. Whenever you try to pull or push into a registry with authentication. So that's how authentication works in Docker. Again, this is not a formal specification. It's just something that Docker did. It's great. It works, but it's not a formal specification. Not yet anyway. So let's try to put this all together and see how we can build our own private container registry. So first of all, we need a container registry. And as I said, I'm not going to talk about building a next-year registry. Fortunately, we do have a couple of options. We have the SoCo distribution project. I always found this is a bit weird name. Like, what does it mean? Distribution. I mean, I understand it's about image distribution, but anyway, it's basically the reference Docker registry implementation. And the other project is Zod. Both of them are CSEO projects. The distribution project is basically, again, as I said, is the reference Docker registry implementation. Most providers use it under the Docker Hub, GitHub's container registry Harbor uses it under the hood. So it's basically the reference implementation for container registries. And I think they are working on a version three these days. I don't know if it's out yet or not, but there is a new version coming. The other project is called Zod. It's a new project. I think it focuses purely on the OCI distribution specification. So it doesn't really have support or backward, compatible support for the older Docker registry API. The registry authentication was actually broken for a long time. And fortunately, a couple of days ago, a week ago, they fixed it. So it should work with the registry of specification now. And the third project that I built, it's basically a proof of concept project, actually, but it's a registry authentication library and a service. This is the service that implements the authorization service component. If you remember the diagram before, this is how you can build your own authentication and authorization solution for your own registry. Again, it's a library so you can build your own service with it. And it's also a service comes with a couple of defaults and helpful configuration. You can check it out on GitHub if you want to. Now, couple of caveats with the registry specification. As I said, it's not actually a formal specification. There are several gaps in it, edge cases that are not covered. Properly, different clients and different services may implement the specification differently. There are also competing and not fully compatible specifications. I believe this is the biggest issue with it today. Charmism has its own similar specification which is incompatible. And this is actually the reason why the Zalt implementation was broken, because they used the Charmism oath specification, which is a version of the Docker token oath specification. And that's why it was broken. So hopefully, this is going to be solved by the OCI specification. So this is going to be the fourth, I believe the fourth OCI specification. There's a new formal working group. I believe they formed in August last year. So there's the new working group trying to solve all these issues and try to come up with a consistent solution for both authentication and authorization. So it becomes easier to build these kind of services. And it becomes easier for different clients and different registry implementations to work together. So hopefully, that's going to solve all those issues. And well, that's all I had for today. I'm happy to answer any questions you have. Or I can do a big demonstration of the port word service if you are interested. I'm up for both. Sorry, if you want to ask questions, there is a microphone here. All right, let's see the demo. Just give me a moment. All right. So all right. Can you see my console? I don't know if it's visible from the back. All right, cool. So this is a demo for port word, which I mentioned is an authorization service implementation. Thank you. As you can see, there are a couple other services here. One is called Docker. This is the distribution registry. The other one is Zot. The example works with Zot as well. And there is a fourth service here, which is called Kerbos. This is one of the included authorization solutions in the default port word service implementation. You can use Kerbos to define authorization policies for your container registry repositories. This is just the basic default policy that allows different types of users, like admin users, developers, and customers, to do different kinds of things. Obviously admins can do everything. I don't know. Users can push to their own namespaces. But the idea is, I don't really want to talk too much about Kerbos, but you can use Kerbos to define these policies. And it's integrated into the Docker of authorization specification through port word. Port word has its own configuration. I'm going to show that quickly as well. You don't have to parse this hash. I'm going to tell you its password just to make it easier. This is a very simple configuration here with a few static users. A couple configuration parameters for issuing tokens. And you can see the authorizer is configured as Kerbos here. There are a bunch of other authorizers that you can use. You can build your own, as I mentioned, port word is a library and a service, so you can integrate your own authorization solutions fairly easily. Now let's see if this thing actually works. So everything is running. Let's see if I can log in to the service. By the way, this is on GitHub. If you go to the Kickstarter on GitHub, you can try this out yourself. Yeah, go ahead. I'm sorry, I can't hear you and there is a microphone here. What type of authentication are supported? Is it only password based or do we use the default authentication? Yeah, it's only password based. That's the only thing that the Docker specification allows. But as far as I know, the OCI working group is trying to add more types of authentication allowed.
Forensic container checkpointing and analysis
Thank you. Yeah, thank you. So welcome to my session for forensic container, check pointing and analysis. So my name is Adran Reber. I worked at Red Hat since 2015. I'm involved in process migration. What's the basis for container check pointing now? I guess now 13 years now. Everything I'm talking about today is about CreeW, Checkpoint Restore and User Space, a low level tool. I'm involved there for a long time. And I'm focusing on container migration since 2015 and forensic container analysis is one use case of the overall container migration topic. So this talk will look something like this. I will give a bit of background about the tools, who uses Checkpoint Restore currently, who uses CreeW, how is it used, the use cases. I will go through a couple of them. Then I will talk about the title of the talk, forensic container analysis. This is basically just a demo. So maybe it fails. And then I will talk a bit about the future of Checkpoint Restore, especially with focus on Kubernetes today. Okay, so Checkpoint Restore and User Space CreeW, that's the tool we're using today to do the check pointing and create the images for the analysis. And the reason why it's called Checkpoint Restore and User Space is because Checkpoint Restore is a technology which exists on operating system and Linux for a long time. And previous approaches were either in the kernel, that's why this one is called in User Space, or they required some preloading. So you would do an LD preload and then some library would intercept everything you do. And then later the Restore, something would try to create the steps you did before. CreeW is different. CreeW is something what you would call a completely transparent Checkpoint Restore utility. It doesn't require any preparation of the tool. You can just point it at any process and you can checkpoint it if the process is not using any resources CreeW cannot handle. And then you can restore it on the same or on another machine. CreeW was developed with the goal to use existing kernel interfaces as much as possible. Over the years there were additional kernel interfaces introduced to support CreeW. None of these interfaces are specific to Checkpoint Restore, so there are always multiple different users using those new interfaces. So the most changes CreeW did to the kernel are not Checkpoint Restore specific. Most of the time it's just how to get more information about the running process out of the kernel. There are multiple integrations of Checkpoint Restore in different projects, container run times, container engines, container orchestrations. And the first I have to mention here is OpenVz. It's something I never used personally, but that's the group behind CreeW, so they developed CreeW to be able to live migrate their containers. They were doing containers before it was container, so it's something which existed for a very long time. And at some point I'm not sure about the history exactly, but they came up with CreeW to have a Linux tool which works for everybody and not just for them. Another interesting integration of CreeW is in-bored. This is Google's container engine, what they use in-house to run all their tasks. And although the upstream CreeW developers don't have direct contact with Google, we know from conferences how Google uses it, so basically what they do, they can migrate containers and they mostly do low priority containers. So if you have a node, there's something running on it. It needs more resources than before CreeW. They just killed the low priority container and restarted the work somewhere else from the beginning. And with the integration of CreeW, now they can just move it from one host to another host. So this is... And as far as we know, they're using it at least since 2017. I think that's when we saw the first presentations from Google how they use CreeW. Then there's an integration for a long time in CXT, and I probably have to mention in just today. It's also integrated there. Also for a very long time, it's integrated in Dockoff 4. Also, I don't know, maybe also 2016, something like this. I've worked for a couple of years to integrate checkpoint restore support in Portman, so you can also, using Portman, checkpoint restore containers migrate them from one host to another host. And the thing which I'm currently working on, which I started around, people talk to me about how they think to use container migration, container checkpointing, and the simplest one is maybe reboot in safe state. So you have your system running with a container on it, and you have a blue kernel there, and it has some problem, and you want to update the kernel. But your container takes a long time to start. You're not really happy doing a reboot because your application is down for a long time. So with GRI, you can update the kernel, then you can create a checkpoint, basically an image, a stateful image of your container, write it to disk, reboot your host, and then it comes up with a new kernel. This time it's green. You restore the container, and it's running pretty fast, much faster than waiting for all the initialization. So you can quickly do reboot of your systems using checkpoint restore. Another one, it's similar to the first one. And also people have been talking to me about this. So this is also used in production. You have a container which takes a long time to start, the one I've been told about. It takes like 10 minutes until everything is initialized. So, and they have a service which they want to sell to customers, and they want to have the customers fast access to the containers. They don't want them to wait for 10 minutes. So what they do is they initialize the container once, create a checkpoint, write it to disk, and then they can immediately start in matter of seconds services from this pre-initialized container, and their customers don't have to wait 10 minutes. It's just in 10, 20 seconds, something like this. The combination of those two use cases is the container live migration. So we have two hosts. We have the container on one host. And it's hopefully stateful because if the container is not stateful, the whole migration thing of container doesn't make much sense in the end. For the forensic use case, it can be a stateless container as well because you can still analyze it. So what is the same again? We create a copy of the container, write it to disk, and then we can create one or multiple copies on the destination system. And the original container can keep on running or not. So this is really up to you how you want to use checkpoint restore today. Another interesting thing people are talking about are spot instances. Spot instances are usually something which is cheap, but they go away. Those VMs, like, I don't know, you have two minutes warning, and people are using checkpoint restore there in combination with pre-use. So you get a signal that your VM is going down. You create a checkpoint, write it somewhere, and then you can continue to run your workload on another system without losing any work or without having to do any restart or long down times or whatever you would like to avoid. And something which came up recently is people are interested to use it for AI training. So you have your AI training running somewhere with a GPU, and for some reason it's aborted, or you have to make space on the node, and with a combination of checkpoint restore you can create a checkpoint of your container. In this case it's less about migration. It's just creating a copy of your state somewhere so you can continue to run it later, or even migrate. It really depends on what you want to do there. The interesting thing here is I mentioned previously that the CRIU cannot handle all resources, and GPUs are kind of the resources which CRIU cannot always handle. We are lucky that AMD came up to us and they actually implemented support to migrate or checkpoint restore applications which are running on the host CPU and at the same time on the AMD GPU. For NVIDIA, we don't know if that exists. We have heard people talking about it. I think Microsoft mentioned it at some point that they might have been using CRIU in combination with NVIDIA, but nobody talked to the CRIU Upscreen Project at least. So we are not aware that people are doing, but we kind of expect that people are using CRIU in combination with NVIDIA GPUs. So the next one is then forensic container analysis and my demo. So my demo is based on a container. I am using OpenHPC as a base. So the container is a stateful container. It is calculating pi and memory which we can hopefully later find in the container. So to create a checkpoint, there is a complicated way to do it. So currently checkpoint restoring Kubernetes is only a Q-Bled interface. Officially, the reason is because checkpoint restore writes your container, every memory page to disk. There is the potential, the risk that you now have private keys, random numbers, passwords, now all written on disk. The checkpoint is only readable by root. So the situation doesn't really change because if you root on a machine, you could also extract the memory, but for now, because it's not clear how to handle this or how we want to continue in the Kubernetes community. With this feature, it's just a Q-Bled only interface and it looks like this. I've also written a QCTL interface. It looks like this. It creates also the checkpoint archive. It's basically doing the same. It's just wiring all the calls completely to QCTL instead just a Q-Bled. So now we have a checkpoint and there's a tool called checkpoint control, which was mainly developed by Google Summer of Code students this year. So we're very happy for this help, which they did. And in its simplest form, checkpoint control will give you, I'm just going to make the font a little bit smaller for a short time here, and it gives me just some basic information about the container. I see it's the container's name counter. It's based on that image, ID, runtime, when it was created, the engine cryo. Checkpoint size is basically the size of all memory pages, and root fsdif size is the size of all files which have changed compared to the base image. So let's unpack the checkpoint archive to see some details. And it's just a car archive, so it's easy to unpack. I'm just going to move this to the top again. And there are a couple of files which now were created by the container engine. And so we have bind mounts. This is just some information that is necessary for restore, because we need to restore all the mounts from the outside of the container to the inside, and we need to know if it's a file or a directory, because the container engine doesn't want to remember if it's a container or if it's a directory or a file, but we need it for the restore. Configdump has some information. dump.loc has what cryo tells us. In this case, it doesn't matter because it works. Then we have the root fsdif file. This is all the files which have changed compared to the base image we saw previously. In the checkpoint directory is the one created by cryo. So that has the actual process information. So if we go there, this is the normal thing which cryo does is all, most of them are protobuf files generated by cryo here. And cryo comes with a tool called crit, crit, cryo image tool, and it has a parameter show, and we can have a look at one of those files. Let's look at UTS namespace information here. It basically just tells us the namespace, the UTS namespace has hostname counters, but we can also look at a file called pstree. This is the process tree. This one, so it starts to get difficult to understand what's going on. I have a couple of commands prepared. So I see with this one, we have four PIDs running in our container, 140, 40, 142. It's important to know this is the view from inside of the PID namespace, so cryo always remembers the PIDs from within the PID namespace and tries to recreate those PIDs later. If I looked at my process, which is maybe still running, it should, I can see here, basically, it's not hard to read, but those are the four, where's my mouse? I don't know. Oh, there it is. You see, so this is the PID one of the container, and this is probably 41, 42, 43, I guess, and you can see here it has other PIDs on the outside, because that's the view from the outside of the PID namespace. So it's important if you ever do an analysis of your checkpoint, it's always the PIDs from within the PID namespace. There's also for each process, we have a file called core, with the core information about the process. Let's have a quick look at this one, and it basically has the registers, the value of all the registers, floating point and much more stuff, and the end you see, the policies and the name of the process, and using the name of the process, I think I can get a list of what processes are running inside of my container and what they do, and you see the first one is called bash login wrapper, bash, pi and t, and if I compare this again with what's currently, I don't know, it's the wrong command, with here, again, I see it's again, bash login wrapper, bash, the Python code and the t command. So looking at these files, I can find out everything about the processes here, so there's a lot of information in here, and if you're looking for something specific, it might be difficult, but the information is here. There are additional files, for example, the tempfs.dev files, those are maybe also interesting files. Those are basically, let's have a look at those. Something like this is probably the right one. And you see, this is the content of a tempfs, so every tempfs which is not bind-mounted from the host, which is native to the container, KreeU kind of puts it into the image, it's basically just a tar, so every tempfs which was in your container is now also here, you can find all the information here. This looks like this was slash dev. What else do we have here? Let's have a look. Yeah, I think that's okay. And previously, I also wrote some, my secret data into the memory pages, and I can actually find this memory again here, this information again in the pages files. The pages images are, those are not, protobuf files are raw dumps of the memory, this is all the memory which was written to disk, and I can again find the information I've written to memory here. So if I know what I'm looking for, it's easy, if I'm looking for a password, then I have to pause it all through and maybe find a useful string in there, but this is just to show you, you have access to all memory pages, and they are now all on disk, and it can be easily analyzed, or at least look at. So if, okay, let's, I also wrote a couple of files to my container, I mentioned this here, the root of sdif tar, let's unpack that one. And so now this contains three files, so these are all files which have changed compared to the base image of the container, and this is just really simple, a file which is created, it just has the, it just contains the name of the file itself, but it's just to show you, if you want to look at content which has changed in the container, you will find it here in this root of sdif tar, which contains all the changed files. And if you think this is all too much work, then I already mentioned checkpoint control before, and it's even, has even more possibilities than what I've shown you, most of the things I've done here manually, the tool, thanks to our Google Summer of Code students, can at this point do. So let's have a look at checkpoint control inspect of the, and the $CP variable is basically pointing to the tar archive, so the tool is now unpacking the tar archive and giving us all the information. And what we see here now is this information we saw before, so it's just some basic information about the image, where it was, how big the checkpoint size is, then we see CreeO dump statistics, this is basically the time CreeO needs to write the checkpoint to this. You see how many memory pages were scanned, if they should be written to this, how many memory pages were actually written to this, and then we see the full process command line. We see all the environments variables of all processes running in our container, and next one even more variables, and more and more, and at some point there's, I think it even contains the open files, too many variables here. You see, now we see the open files, you see the one has open def null, and then two pipes, and then the working directory, and open sockets, you even see that that's the socket I've been talking to, and then we go to the process here, and then we see all the mounts we need, this is also important for restoring the process later. So I guess that's the end of my demo, so checkpoint control was the tool I was using, I was using CreeO image tool to have a look at the content of the images, and then I was using grep to find my secret key from the memory pages. So one thing I didn't show, you can use, there's a tool in CreeO which converts the checkpoint images to core dump files, and then you can use gdb to look at them, it's basically the same, you see the registers and the call stack and things like this, might also be interesting for a couple of people, to what's next, especially with focus on Kubernetes, so I've shown that I have a kubectl checkpoint kind of working, that's an open pull request, it's not being actively discussed at this point, but it's there so if somebody needs it, it can be easily used, maybe the next step would be to integrate checkpoint for complete parts, I've implemented this a couple of years ago, it's pretty simple, we just do a loop over all containers in a pod, we just create some metadata for the pod and then we can recreate it, so this is not a technical challenge, it's just most things at this point are how to get it in a way into Kubernetes which is sustainable and makes sense, and then maybe we have something like kubectl migrate, so we don't have to do it manually, maybe at some point the scheduler will decide, let's move this pod somewhere else, and one thing, so the image format I'm using is currently just a tar file, I came up with, but it's not a standard, so container D uses something else, I looked at the container D format, it's applicable for what I was looking at, but the problem was they were using internal protobuf structures, I didn't thought make sense to have in a public checkpoint, in theory, checkpointing on container D and restoring in cryo should not be a problem, but at this point we don't have a common image standard, I tried to start a discussion here, but it also didn't continue unfortunately, so with this I'm at the end, so I showed you that cryo can checkpoint containers, I haven't shown the restore part, but it works, it integrated in different container run times, it's used in production by different companies at this point, use cases are things like reboot into new kernel and save states, multiple copies, container migration, spot instances, AI learning support for GPUs there, and this is all available in Kubernetes under the forensic container checkpoint in cap 2008. So, I'm at the end, thank you, any questions? Thank you. Oh, sorry. Sorry, please be quiet, we cannot hear the questions. You mentioned GPUs are something you can't handle, what are the other big resources that... So basically cryo cannot handle anything that's external to the kernel, so InfiniBand is one which comes up in high performance computing always, so everything where you have a state in additional hardware, you need some way to extract it, you need to extract the state so you can later restore it, so... And just create a text in the process, is that stuff that fails? Exactly, it fails, Daniel. So currently the people I've talked to today, they are just interested in finding out if there has been an attack or if there is an attack ongoing, things like this, and then maybe at some point, maybe if you can have a couple of checkpoints and figure out, okay, this looks like an attack pattern, maybe detect it automatically using check pointing, this would be maybe something in the future, but finding a possible attack is one of the main motivations for people for the forensic use case. Thank you.
Introducing Incus
Hello. So, yeah, I'm Stefan Graber. I'm the project leader for Linux containers. And I'm just switching to the right screen here. There we go. And I'm one of the, one of the in-cast, in-cast maintainers. I was also the former project leader for LexD when I was working at Canonical. So, gonna go for a tiny bit of history first and then go into more, you know, what in-cast is and what can you do with it. So, the LXC project itself was created way back in August 2008 through IBM. That's the original Linux containers run time and was, has been used kind of everywhere, including the original version of Docker and some other places at that point. Linux containers itself was created, so the organization was created back in September 2014 and the LexD project got announced by Canonical in November 2014. Then LexD's been going on for a while until a lot of things happened in 2023. So, on July 4th Canonical announced that LexD was gonna be moved out of the Linux containers community project and moved into the Canonical organization itself. The next day we noticed that all non Canonical maintainers had lost all privileges on the repository. So, only Canonical employees were left maintaining it at that point. Then a few days later I left Canonical, so that happened. Then August 1st, Alex Astorai, who was the open SUSE package for LexD decided to go ahead and fork LexD as a new community project called InCus. A few days after that we made a decision to include InCus as part of the Linux containers project, effectively giving it the spot that LexD once had. InCus 0.1 was released on October 7th and we've had another four releases since then. Lastly, just as a bit of an early Christmas present, Canonical decided to go ahead and re-license LexD to AGPL as well as require everyone to sign a CLA to contribute to LexD. The consequence of that for us as a not-patry-to project is that we cannot look at anything happening in LexD anymore. We can't take any changes from LexD anymore, so InCus is effectively a hard fork at that point. So, that's the history. Now, to go back to what is this thing actually all about. So, InCus is a system container and virtual machine manager. It's image-based, so you've got a pretty large selection of distros. It's going to be a whole slide about it a bit later. But, yeah, it lets you effectively kind of cloud-like immediately create instances from any of those images. The system container part means that we run full Linux distributions. We don't run application containers, we don't run OCI right now, we don't do any other kind of stuff. The containers are really like a full Linux system that you then install packages in a normal way. Everything is built around a REST API with a pretty decent CLI tool. That REST API also has other clients who will go through that in a tiny bit. InCus got great support for resource limits, so you can pretty easily limit CPU memory, disk network, I or whatever you want. It's also got extremely good device pass-through to both containers and virtual machines, so you can do things like passing GPUs or attaching virtual TPMs or sharing your home directory or doing a whole bunch of different kind of sharing and passing through devices into containers and virtual machines. It also supports all of the expected stuff. I mean, it does snapshots, does backups, it's got a variety of networking options, a bunch of storage options, all of that stuff. It can also create projects as a way to group a bunch of instances together and effectively even open ID connect, which is cannot go to standard these days. And for authorization, we support OpenFGA, which is the open fine-grained access control project. That gets you, as the name implies, pretty fine-grained access control. There's also a number of web interfaces you can use on top of that. So here you've got one of those, which is actually the LexD web interface that runs perfectly fine on top of InCus. And yeah, that's one of the options there. As far as what you can run, well, there are a few options you can see up there. So InCus is indeed all based around images. We build images for pretty much all of the major Linux distros and even some of the not-so-major. And we build everything on both X86 and ARM. The vast majority of them are available for both container or VMs. We've got a number of them that are just for containers. And then because we do normal VMs, you can also run Windows, 3BSD, whatever else you want inside of the virtual machine. All right. So let's do a first quick demo of the standalone InCus experience. So if I switch over there, first thing we'll do is just launch an ARM Linux container. There we go. So we've got that. Then let's do another one for, let's do Alpine, the Edge release. So just do that. And this is obviously at risk of blowing up at any point because I'm on the first Wi-Fi. I think Ubuntu was planning on doing a VM. So let's do a VM instead of a container. So just tell it you want a VM instead. That's pretty much all that there is to it. And with that running, so we can see that the two containers already started, got their IPs and everything. The VM is still booting up, so it hasn't got its IP yet. It does now. If you want to get into any of them, you can just exact any commands. You can get a shell into Alpine. You can get a full bash inside of Arch. And you can do the exact same thing with the virtual machine. So like you don't need to get a console and log in and everything. Like there's an agent automatically in our virtual machines. You get to just immediately access them as if they're containers. So that works really well. You can create snapshots. So if you wanted to snapshot, the opposite snapshot creates the Arch one. If you don't give it a name, it just picks one for you. So we can see there's now a snapshot that we can restore or just keep around. There's also the ability to do automatic snapshots with a chron-type pattern with automatic snapshot expiry. You can do all that kind of stuff. Now let's create a custom storage volume. So we'll just do storage, volume, create, default. Let's call it demo. And then we're going to be adding that as a device to, let's say, Arch. So just call it demo. It's a disk. It comes from the default storage pool. And the volume is called demo. Configure this. There. And I forgot to do add. There. twice add. Now if we go inside of that VM, again, we see there's a new entry there. And then empty home. Hey, that's my home die tray. So that's very nice and easy. It's kind of doing automatically, VIRTA, UFS, 9p, all that kind of stuff. It talks to the agents to trigger the mounts. And it just, like our goal is for virtual machines to feel like containers in, like as much as we can. And having that agent in there really makes that super easy. And for the last party trick of this demo, let's do launch images. Open suzer, tumbleweed, desktop KDE as a desktop image. And also tell it that I want to see the VGA console as soon as it starts. So when I do that, it actually gets me a second window, which I need to drag over here. And let's try full screen that thing. Maybe. Yeah, full screen doesn't work. Okay. But we can see it boot. And it's going to get us eventually into a KDE session. Not sure where the resize didn't work. Oh, okay. Maybe the desktop where? I saw a mouse pointer that was about the right size. Nope. Okay. So it is starting KDE there. So we even have some desktop images. We've got an arch desktop image with GNOME. We've got Ubuntu with GNOME. And we've got open suzer with KDE. We're not building too many more of them mostly because they're actually very expensive to build as far as like resource, like the build time and distributing pretty large images. But it's to show that this works. And if you want to run your own, you can totally do that. All right. So let's just go back to slides. Come on. There we go. So other things you can do as this thing is effectively your own local tiny cloud and it's all built on rest API. It's what it also makes it very easy to integrate with other things. And other things here mean some of the pretty usual tools you might be dealing with. So Terraform, OpenTofu, you can integrate with that very easily. We've got a provider to maintain ourselves that you get to use. Encebal has got a connection plugin that you can use to deploy any of your playbooks directly against virtual machines or containers. And if you want to build your own images as derivatives of ours, you can use Packer as a very easy way to take our images and inject whatever stuff you want in there. There are a bunch of other tools. I mean, LexD especially had a lot of third-party tools that could integrate with it. A bunch of those are now migrating over to InCurse or supporting both. So that's very, it's a list that's very rapidly growing. Other things you can do, well, InCurse exposes an open metrics endpoint to get the details like the resource consumption and usage and all that kind of stuff of all of the instances running on it. So you can integrate that with Prometheus to script that data and keep it on the side. It also supports streaming, logging and audit events to Gryffina low key. So you get to effectively have your events and your metric in the same spot at which point you can use the dashboard that we've got in the Gryffina store to get something like that and run on Intel. So that's pretty convenient as well. If you don't like typing the name of your remote every time, you can switch to a remote. So you just do a remote switch at which point if I do a list, it goes straight to that remote and you don't need to type it every single time. That cluster is actually using a mix of local storage and remote storage. So it's got CF for HDD SSDs and it's got a local ZFS storage pool as well. And on the network side, it uses oven. So it actually has all of the stuff in place. And actually if we look at the remote list from earlier, we can see that it uses OIDC for login. So it's also using authentication bits I mentioned. Now if you wanted to launch, say, the BN12 instance on that thing, you can do it the perfectly normal way. And that's just going to instruct the cluster to go and do it. So in this case, thankfully it's running back home with very fast internet. So I don't need to wait for the first Wi-Fi to download stuff for me. But it's actually downloading the image and parking it creating the volume on-safe in this case and then starting the instance. I didn't even tell it whatever I wanted it on. So it just picked wherever it made sense, which is actually funny because if you use an image and you don't specify what architecture you want, you're going to get one of the architectures. So in this case, I didn't tell it I wanted ARM or Intel. There was more capacity on ARM, so I got an ARM instance. We can go and check that easily. But I know that the server it picked in that list is an ARM server. So if I go in here and look at the architecture, it's AR64. All right. Let's just look at things here. And I wanted to just show the dashboard as well. I'm just going to drag that particular window over. Where is it? It is here. I had it open. I've got way too many windows open on my laptop. Okay. So it's Grafana. It's loading. It's loading. And in this dashboard. Okay. I'm just making sure it looks at the right cluster before I show it to you. So there we go. Yeah. So this is actually the dashboard for the very first I was talking to. So it's SHF, the one I was showing. It's looking at the demo project. So we can see the top offenders as far as resource usage and that kind of stuff. We can look at graphs for network, for storage. And we can even kind of go down on the specific instances and see what they've been doing. So you could expand an instance and go look at its usage. It also gets all of the events from Loki. So we can see the instance creation and any comments like that. That shell I got is actually right here. And any error and stuff is also all captured right there. So that's the metric side of things. All right. So where do you get to run this thing? Well, quite a few distros have packages now for Incus as well as I meant, as I've mentioned, Devenin Lubuntu without packages in their next table release. We're also looking at doing a long term support release for Incus itself. Right now you might see version numbers like 0.4, 0.5 and be a bit scared about it. You need to remember that this is a derivative of Lexday. So one of our zero point release is just as stable if not more stable and like a five point something on the Lexday side. We've just not done anything past zero because we're waiting for the LTS of our other projects within the next containers, which we will do in March. And that's going to be the LTS of LXC, like CFS and Incus all at the same time. And we usually try to line up versions. So Incus is going to jump straight from 0.6 probably straight to 6.0. That's what's going to happen with the LTS. As far as other features we're looking at adding, with the release of Linux 6.7, we now have Bicash FS in the Linux kernel. And it's pretty interesting for us on the Incus side because it's very close to what ZFS or the RFS does, which we already support. So we're looking at adding a Bicash FS storage driver for people who want to start using that. On the cluster side, I mentioned that we support Cef right now, which is a pretty good option, but also a bit heavyweight. A bunch of people could instead do something different, whether it's like using a shared NVMe of a fabric drive or using some old fiber channel sand they might have gotten on eBay or something like that. So we're looking at adding distributed LVM as a storage driver, which effectively means if you have multiple systems that can all see the exact same block device somehow, then you can use LVM on top of that with a distributed locking manager on top so that all of the different machines in the cluster get to use that. So that kind of solves the issue of like how do I use my old sand at work or something else, you can use that. But it can also work in some other cases. I think someone is looking at using that with DRBD, for example, as an option. We are looking at adding OCI application container support. So that's potentially a bit of a surprise for some folks. But we feel that these days, like the application container space has now stabilized enough and we've got enough of our users who literally just run, like for some reason are running Docker inside of InCast to run a few specific applications that this particular use case we could support natively. So we're not looking at like competing with Kubernetes with all of the service mesh, this auto distribution thing. Like that's crazy stuff. They get to do that. But we would like it to be possible for you to run like two or three small containers for like your IoT software or whatever. That's kind of what we're looking at doing there. And on the networking side, we're using OVEN for distributed networking, which works pretty well. But we're also working now on another feature of OVEN which is called Interconnects, which then allows for having multiple clusters and then interconnect to the network. So you can have instances on multiple networks, on multiple clusters, and then connect those together and can direct them. And you've got 30 minutes with InCast pre-installed in there to just take it for a ride, play with it for a bit, see if that's something that's interesting to you and if it is, then you can go and install it for yourself. And that's it. We can try some questions. We've seen it's a bit difficult. So please, everyone remain quiet if there's any questions. So we can try and hear them. Is there anything? Oh, you have it there. Okay. So I'm quite sure some people are interested with the differences from the end there and this too. So compared to what? Sorry, I didn't catch that part. Oh, VMware. Okay. Well, so it's a lot cheaper. Yeah, for anyone who's using VMware professionally and has followed the news recently, let's say your VMware build is not great right now. So this is a viable alternative in many cases. It doesn't do, it doesn't have all 50,000 components all around it and all that kind of stuff. But if you are primarily using it as a way to get a cluster, create a bunch of VMs, maybe create some containers, run whatever OS you want on there, that will do it just fine. So it's definitely an option there. I mean, it's kind of in the same vein at that point as compared to like, you know, a Proxmox or some of those other options, it will work just fine. With the FireTact, we do have like, it's not a distribution you can install it on any system you want. It's obviously all open source and yeah, it is a pretty viable alternative and we do have a lot of people who are using VMware that are very closely looking at this as a potential way out of VMware right now. So the question here, better understanding terminology, would the LNs find a backdoor between a system container and a location container? Yeah, so the difference between application containers and system containers is a system container will run like a full Linux distro. It will run system day, it's going to have Udev running, it's going to, you'll be able to access it into it, install packages, reboot it. It's really designed to be like a state full long running type thing. Whereas your application container is usually, I mean, ideally single process or a process of some of its children, it's really more designed around delivering a specific application and most often it's going to be quite stateless with the idea that you can just nuke the thing and replace it at any point. It's kind of two different concepts. Like some people like the idea of having a system that they actually get to select what packages installed exact config and stuff and some people prefer not to care about any of that and just have something pre-installed and that's what an application container gets you. That's why having the ability to run some application containers directly on InCus alongside the system containers I think will be quite interesting because if you just, like if for a specific application it's easier to just get their pre-made thing then you'll be able to do that while still being able to run everything else. Yep, so we do have a bash completion profile. I absolutely hate shell completion for some reason, so I don't have it on my machine so I can't show you. System containers provide the ones that are interested in the rights? Yeah, I mean it is possible to get application container run times to get you a full system container. I mean nothing prevents you from deciding that the application you run in the container has been in it. That's definitely possible. It's just not what they were really meant for so there's a bunch of kind of, it just feels kind of less polished because that's not, that wasn't their goal. Like things like being able to dynamically pass new files in and dynamically attach devices, get whatever number of shells you want, be able to interact with the outside words through like a Unix socket inside of there. Those kind of things don't make too much sense for application containers just at the beginning and so some of those features will probably be lacking on that side. I tend, I mean, I was going to say like I usually like, you know, having one tool for the job and like picking the right tool for the job and effectively if you really care about running a bunch of application containers use one of the application container run times whether it's using Podman, Docker or some of those others. One thing that's actually interesting is that you can totally run Docker or Podman inside of an InCast container. So that works. You can run your normal Ubuntu, Debian or whatever in Existio inside of an InCast container and then install Docker or Podman in there and run some containers alongside whatever else you might be doing in that container. So that's something that works fine. I think we're probably out of time at this point. So thanks a lot everyone. I'm probably going to just be outside for a tiny bit if anyone has more questions and things. But yeah, thanks a lot.
Kubernetes Operators: Expanding Automation in Containerized Applications
I'm sorry. Wasn't it about, I thought it was about an operator? Okay, so welcome. This is Edith and she will give an introduction to Kubernetes operators. Yeah, going in. Okay. Is this working? Is this working? Is this working? Could you hear me, Hie? Yeah? Okay, that's nice. Okay, good morning everyone. I'm really happy to be here again. The last year I was here shaking of nervous because it was my first talk in English. I improved now. It's going to be better, I hope. And thank you so much for this opportunity and all you for also make this happy. So I'm so happy for that. Thank you. Okay, in this talk I will talk about Kubernetes operators. Did you hear about Kubernetes operators before? Could you be? Okay, we will do a little introduction for containers, little, then Kubernetes. We will see how we deploy an application in Kubernetes and then we will do an intro into Kubernetes operators. To start, this is me. You can call me Joseri. I am Tecnology Evangelis at Percona. I also got the UK Global Talent Visa. For that reason I moved to UK the last year. I am a Cloud Native Computing Foundation Ambassador, organizing events in my city in Lima in Peru. I'm Docker captain, also organizing meetups about Docker. And open source contributor who contribute to Apache Airflow in the past. And now just translating the documentation in the Kubernetes website from English to Spanish to make it more for people who speak your Spanish. So if you want to talk about this topic, you can find me there or in the building case. So I'm happy to share with you about this. Okay, I already say the agenda. So today we will talk about Kubernetes operators. But first we will start by containers. Before to talk about Kubernetes operators or Kubernetes, the fundamentals is containers. We all know that a container is a process that is running on top of our system operator. So we need a containerized technology that is on top of our host operating system. And on top of that our application is running as an isolated process. And all the libraries and dependencies that we need for our application is there too. So, you already know what's container, right? Yeah, yeah, this is a container in the room. So that's good. Okay. But with containers, you will see if you are using it, you may be cool experiments on challenge. For example, orchestration, you are not running two or three containers. Now you are running thousands. If you have a big application, right? You are running those applications and you need to orchestrate them. And this is challenge just using containers. Then we have also to secure containers images, managing these vulnerabilities, or we have to handle network securities, access, authentication. No use for one container, but for thousands, thousands of them. With many containers running, also we need to see what is happening in real time. We need to know what container is failing, what we need to restart, and we have this metrics, this data, and we need this kind of tools to help us to visualize these errors and diagnostic, diagnostic, diagnostic, diagnose these problems and improve the performance. Also, we saw that there is a problem with scalability. So we didn't have that kind of tool that helps us to scale faster from thousands to ten, maybe to ten containers, dependent of the demand of our application. And finally, managing a data storage also in a container makes this thing a little bit more complex. What are the advantages of Kubernetes? With Kubernetes, we automate all this process that we talk and we reduce this manual intervention. We don't do this for just one container, but we do it. We see Kubernetes in the world. You see the web page, there is a lot of concepts that we need to learn, a lot of terminology that it looks like we never end to learn Kubernetes. But for this presentation, we are going to focus in three main components, which is the pods, the deployment, and the services. In the case of pods, the pod is a unique basis of Kubernetes. Inside a pod, we have a container and the container is a container that is related with other pods with that shared red and storage. And we have deployments also to deploy our application. It's where we set the desired state and we also set the replicas that is going to have our application. And we have services to access to each pod and to make it available between other pods. For this example, we are going to see visual how we deploy a cluster in Kubernetes. And we are going to see this example of voting app. We have a voting app, which you have a web application on the right where you can just vote between cats and dogs. It's a web application. In the other side, we are going to show you the result of that voting that we have. We have two web applications here. But behind this, we have many things to run in a cluster, no use to applications. We have this. If we have to containerize our applications, we don't just have the front end or the part that is visible to the user, which is voting app, a RESTful app. We have behind also the REST application, for example, to get the data that we put in the web application in temporal memory. And we have the back end that is made in .NET. This is just an example. And we have the database in Postgres, which is also in a container. And the RESTful app, the application which is made in Node, is going to get the data from the database and is going to expose our result of what we want. In the voting application. Here, if you want to create a cluster in Kubernetes, we need to identify which the connection ports. For example, if you want to access to the voting application where we are going to vote between cats and dogs, we need to identify how we are going to access to that application. So the port 8 should be open. The same for the other, could be other port, but this is just an example. But in the other application, RESTful app, also we can open a port so the user can access to that application to get the data. But inside this stack, also we have to identify which application access to what application. In that case, in the case of RADIS, for example, the voting application is going to save the data in RADIS. And the backing application is going to accept the RADIS to get the data and process it. After that, bring it to the database, but the database have to have a port open. The same with RADIS. Have to open a port to make it accessible for other applications. Now we talk about services. So after you identify the port where the application is going to listen, we have to identify services. And the good thing before to do this is ask you, okay, who application is going to access to me, RADIS, for example, okay. I'm going to be useful for another application. Yes, so I need to create a service there. The same with the application of the database with Postgres. I need to create a service. And the same with other ports in the top, we need to create a service for that to be accessed for the user. Now we talk about deployments. After knowing what ports we are going to open, what are our containers, our EMAJs, we need to deploy our application. For our application, for to deploy our application, we just need to have to deploy our application the one we want. How many we want, the replicas we want. If it's one, in this case, I have the voting app and the Resulapp application deployed with replica street, if you can see. And I decide the part which is the control plane, which is the brain of the cluster. And we can access to the control plane through the user interface and also with Qubesetel. Type in comments, we can access to that. And we have the API server, which is the front end for Kubernetes for the control plane. And it recites a lot of, it recites responses and will process them and will save these interactions into the database of Kubernetes, which is ETCD. We have also the schedule. The schedule is going to assign the ports in the cluster nodes, but it's going to try to find what node is right for the port that I have. We have also control manager. The control manager is going to have several containers for the objects that we have in Kubernetes to monitor all the health of the cluster. And it's going to keep informing about the healthy of the cluster. And we have the ETCD, which is the database of Kubernetes. In the other side, in the worker side, we have three components. Qubelet, we have the container runtime, and we have Qubet proxy. For Qubelet, it will take control of the instructions and will start, stop, monitor the containers. We have the container runtime that could be Docker or could be another container runtime to create and to make possible the containers. And we have Qubet proxy, which is going to facilitate the need working communication between the ports that we have in the worker node. Okay, let's go for Kubernetes operators now. Okay, we saw the Jamel. We saw where we identified replicas, one we can scale. RAN or scaling applications, stateless applications, is easy. We saw that because we can just write this command and make Qubeset L scale, deploy to four, scale or level up this one. So it's easy to scale these kind of applications because Kubernetes was made for stateless applications, so it's easy to make. But how about applications that store data? Do you think it's easy to add it into a scale? Yeah, it's easy because we also saw that we have Postgres in a container and in a port, and we can define Jamel file and we can scale it also. That is good, it's easy too, but where is the problem where we want to run a stateful applications, for example, in Kubernetes? This is the real problem. When we want to run a database over the time, because we can deploy a database in Kubernetes, we can make it. Is there running? I did it, many of you did it, but what happened with a long time? To all the life cycle of a database, this is where the real problem came up because we have to handle backups, upgrades, recovery, replications, and many other things that is for the database itself. Kubernetes at the beginning was built for a stateless application, but when we talk about databases or applications that need to save the user data, we need to do another thing. When we talk about running a database in a long period of time, we are talking about the day two of the Kubernetes application life cycle. If we talk about day zero and day one, we are talking about planning and development, operations, and now. This is an example of a custom resource definition where we are going to set up a new kind of our object in Kubernetes. We saw objects in Kubernetes, we saw the default objects, we have deployments, we have ports, we have services, these are the default objects that Kubernetes have. But if we want to integrate a new one, my old one, for example, called Krontap, I have to define it in these custom resource definitions. And I have to also define the behavior of this new kind. What is going to do? What is going to be the main target of this new new kind? So if we see, we define it in that part in this custom resource definition, custom because I am customizing this file to add a new kind and integrate it into Kubernetes cluster. And this is the custom resource definition behavior. For example, this is a very simple one where I am defining the kind. The main target of this new object is going to use that image in five minutes. It's going to do something in each five minutes. This is the object that I want my new type of Kubernetes I wanted to make. But you can have bigger things, right? Make a deploy, make a backup, make a replication. So it's going to be huge, not just as simple as this. And after you apply this new custom resource definition, you are able to do kubectl get Krontap, like a normal object that we have in Kubernetes. And we will see the results. When this is made, you are able to use this new type of Kubernetes in all your Kubernetes cluster. Okay, we talk about the custom resource definition. Now we will talk about the custom controller. So the custom controller that Kubernetes has by default is easy as a concept. So this controller is going to try to find the difference between the desired state and the current state of our cluster and is going to try to match the difference. And the custom controller that we create for the Kubernetes operators is kind similar. Summarizing, how does it work with Kubernetes with all operators? So we have a user which is going to write a jammel from an object that we already know that is deployment. And the deployment after applying the deployment, the object, okay, after applying the deployment, the deployment is going to create the deploy that we want and the post that we want. In all this process, we have the custom control or custom control loop that is going to try to find the difference between my cluster, this is my cluster, and the desired state, and it's going to try to match. Okay, you are telling me that I have two, three ports, but I have just two ports, so it's going to take action to make this automatically. How is Kubernetes with operators? To run Kubernetes with operators, we need to install in our cluster two things. The operator lifecycle manager, which is going to handle and see all the lifecycle of our Kubernetes operators, which is very important. And the custom resource definition that we created at the beginning as a template and the controller that is going to do that matching between the desired state and the final state, but for our custom resource definition. In this case, the user writes a custom resource, but with a new type. It's not a type that we know, it's a type that we created. My app, for example, is a type that we created as a custom resource definition and we are applying it into the cluster. Inside the cluster, we also have the normal object that Kubernetes has, the control loop, to make this, to make, to keep looking, observing, looking for difference and act to try to match this into my cluster. And if you are looking how to create a Kubernetes operator by yourself, you can do it. There is a very nice guide in which is the operator framework where it's going to give you an SDK to start to create Kubernetes operators and all the steps to make it possible. And you also, if you're wondering where I can find all these Kubernetes operators, if this is your first time listening about this, we have, there is also a Kubernetes operators, which is called operator hub, similar to Docker hub, where we have all the images there, right, to access. So we have operator hub where you can find all the operators that we want. So you can use these operators to integrate in things that maybe your application in a Kubernetes cluster is not doing right now. Maybe something missing, maybe there is something that you are still doing something manual in your Kubernetes cluster. If you are doing something manual, maybe it's easy to automate. You can also find this kind of tools here to automate and make it even more efficient your cluster, your application running in Kubernetes. And as any application, it also have maturity, capability models. The Kubernetes operators has five maturity levels. The level one, which is, which is has, I cannot see, which is, is going to see just the installation and the level two, the actualization and convenience. The level three is going to see the complete life cycle of the application of our operator. The level four is going to be more deeper and the last level is going to be completely automatic. Our operator is going to be doing a lot of things. So the application that we run in Kubernetes operators have to be in this order, passing this capability, capability models. There is many Kubernetes operators that are already in the level four and some maybe arriving to the level five. But you, if you go to that, if you go and see, for example, this is a Kubernetes operator for MySQL based on PerconyStraVic cluster. You will see in all this kind of operators the capability levels they are like, they, they are, they have. So it, it means that they have a certain maturity building and you can, you, you can see that you can use it also. Okay. If you are wondering how you can work with Kubernetes operators in Percony, we have Kubernetes operators, which is completely open source. We have Percony Kubernetes operator for databases, MySQL, MongoDB and Postgres. Feel free to use, use it also collaborate because it's open source. So if you are also, if you see that this is very complex to use it, you don't want to handle a lot of jammels, scripts and many things. We have also Everest, Percony Everest, which you can use it with a graphical interface. And you can start to create your Kubernetes, your Kubernetes, you can create your database Kubernetes in, in clusters of, of Kubernetes. Yeah. And if you have questions, you can reach us there also in our community forum. And we are having a Rafa in building K after here or maybe in the day or tomorrow. The raffle is going to be tomorrow at 2pm. We are raffling this Lego. So feel free to go, scan the code. Good luck. So that's, that's all. Thank you so much. Yeah. Yeah. If you. If you. If you. If you. Hello. I'll just try and speak over the noise. And. Good presentation. Oh, yeah. Could you hear me? Yeah. Could you hear me? Could you hear me? Hello. Yeah. Could you hear me? No, it's not. Yeah, yeah, yeah. I want to talk, but this is not working. I think. Yeah. Hello, hello. Hi, hi, hi. Could you bring me the mic? I will answer the question. I think you'll get just kind of you. Yeah, yeah, yeah. Thank you. Thank you. So we finished, right? I mean, there is. Yeah, sorry. Who made the question? Sorry. Yeah. Yes. Sorry. Yeah, sorry. The mic is not working. Who made the question? Right? Yeah. Yeah. I use, I use Kamba. But for the, yeah. I use Kamba. Is that like CMB? C-A-N-B-A. Yeah, I'll try that. And then for that arrow, I use a photo shoot. Oh, photo shoot? To make your, to make it moving. Oh, okay, okay. And then I import it into Kamba and I do that. We need the mic. Thank you. Oh, it's complicated. Thank you. Sorry. Alex. This is for you. All right. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Sorry. Thanks.
Composefs and containers
Hi, everyone. I'm Alex. You may know me from hits such as Flatpak and my work on GNOME, but recently I've been doing this work on this thing called ComposeFS. And this talk is going to be partly about what it is and how it works, but also how it slots into the container ecosystem. So the tagline I use is an operatively sharing verified image file system. It's a mouthful, but like image file system, you can imagine it's about mounting a file system. That's an image. Like the container image could be, but it could also be a system image for a full system thing or any kind of image you wish to share and then reuse. And sharing is about having multiple of these images and they share resources in some way. So there's a little more efficient of multiple of them than each individual. And very far is that we want to in some way guarantee that we're reading what we're expecting to read. So the easiest way to explain how it works is by example. Suppose we have this image. I mean, it's not much of an image, but it's basically it's files, right? And a structure, a metadata and all that. And you run the MK ComposeFS command and you give it the directory that you want to create from. You give it an example of CFS. That's a file name. And then you pass in a digest store, which is a directory name. And when you run this, you get this. You get the image file and you get the objects directory that has a bunch of weird looking things in it. Those things are just files. And if you go back to the original thing, we had a food attacks, the barl attacks. And those weird looking objects just have the content of those files in them. And the names are like check sums. And then you can, from this again, you can mount it. You're specifying the objects directory and the example file. And then you get back the same results. It doesn't sound all that exciting because you could as well use to use a loopback mount for whatever kind of files you want to. The interesting thing is this particular objects directory. As I said, it has the backing data for these files. And if we were to change one of them to not be what it was supposed to be, and then you mount again, you can and you will see the new change data. So basically, this directory contains all the backing files of the entire file system. So the interesting thing happens, if you have multiple of these images that share the same base directory, because these names are actually check sums. So any two files that happen to have the same content will have the same check sum, will use the same file. So that's what called content addressing. It's common in getting whatever was not. And that's what you would get what I call opportunistic sharing, as opposed to explicit sharing like docker layering, for example. You have to be very careful about managing your dependencies, such that you used exactly the right base image and whatnot, and then you get sharing. But here you get wherever, for whatever reason, if you happen to have two identical files, they will be shared. And they're not only shared on disk, because of the way composite has worked, when you mount this thing, if you mount two things, they use the same file, and something M-apps it or whatever, or just in the cache. They would only store once in the cache. And you can easily see how you could use this to update to a new version of an image and not have to download all the data. Like you can just download the image, list all the objects and see which one you don't have and download those. So it's like an automatic way to do delta downloads, basically. And then to get into these verifying part, we have to look at something called fsverity. fsverity is a feature of the Linux kernel. It's been around for some time now. It's actually both a feature of the VFS itself and the individual file system. So it has to be implemented in each file system, and most of them are. Or actually XFS is not yet supporting it, but it does work on that. But basically you enable this on a file, and that makes the file in mutable in the sense that the VFS will not allow you to do any operation on this that changes its content. You will get basically permission denied when you try to write to it or whatever. But also, if you modify the file directly on the block device or whatever, Cosmic Ray hits your drive and it flips a bit somewhere, when you read it back, there's a checksum, that's like a recursive Merkle 3 checksum across the entire thing. So whenever you read a block from a file that has been modified, you will get an error, an IO error basically. So that's pretty cool, but unfortunately it has some weaknesses. Yes, you cannot change the file, but you can change the metadata. You can rename the file, you can make it set UID or whatever. You can create a new file, you can delete it and replace it with a new one with the same name. Basically it doesn't validate what we want, which is the image. The thing we want to validate is the entire image, the file names, the structures, the metadata, everything. So that's where we go back to the buzz events. And if you actually look at the measure of the FHRT measure of this file, which is the digest basically, like the checksum of thing, it actually has the digest for name. Basically we're using FHRT already on all these objects. And not only that, we also record the expected digest inside the image itself. So whenever the file system opens a backing file, it can verify that it's actually the right thing. And then once you've opened it, each individual read from the file will be verified by the kernel. So if you mount it with Verity on, but you might not want this, you might want to use the sharing without Verity, if for example your file system doesn't support Verity. But if you do have it, what just enable it, and then when you read the file that we changed before, now we get some kind of error because it's not, it doesn't have the very digest we expected it to. So that helps a bit. But what we wanted was to protect the entire file system, right? And you could potentially write in the metadata's image file and change the name of the file by modifying the file. And to avoid that, we enabled the Verity on the entire image itself. And then we passed the digest that we got. I mean, ideally this digest isn't, you're not supposed to just read it from the file when you mount it, but at build time, you record the digest and then via some kind of secure mechanism, a channel, secure signatures, what have you. For some reason, you trust this digest. And if you pass it to the mount command, if the mount command succeed, then every successful IEU operation on this thing will be guaranteed to return the same data that was built. Basically, this is a root of trust. If you have a reason to trust it, you can guarantee the thing. There are some technical details about it. This talk is more about explaining the high level parts. Initially, it was completely a new kernel process. Actually, I did a presentation on the kernel dev room last year at FOSSTEM about this. But during the upstreaming process, it was changed. So now it's using some existing technologies, overlay FS in particular and E-RUFS that were already in the kernel, but they have been extended a bit, so they support this use case too. And overlay FS is normally a way to layer things, but it actually has already a way for a file in an upper level to have a different name in the lower level. Normally, you use that for renaming a file across layers. But we use it instead to redirect to the baseter. And then we introduce this thing called data only lower base structures or layers. So we basically hide the lowest level of use redirects to the files in there. And then we have a new FS, which is an ISO file or a squash FS. It's just a read-only file system that we use to record the entire directory structure of the loader and the overlay FS, including the final names and whatnot and the overlay FS ex-adders that do the redirects to the lower directories. And then we use bind mount, or loopback mount as thing and set up an overlay FS combined thing to use it. We had to add a couple of things, data only lower directories to hide the lower directory. The FS is very devalidational. We had an add a new ex-adder to overlay FS to validate the redirects. And then there's some nested stuff where you have overlays and overlay FS. That's kind of weird, but we had to add that too. But it's all in now. And we have a final version release that has a stable format that you can use and we're supposed to work forever. We also have integration with OS tree. I will not spend too much time on this, but OS tree is our like red hats image-based entire system, like atomic immutable operating system used by things like Fedora Cyliblu. And the current version of OS tree has experimental support for just creating these composifest images. So we create them at build time and sign them. And then if you validate it during boot, like it validates the signature in the in the FS. And if it's valid, you can mount the composifest image and everything is just guaranteed to be whatever is right. And if you're using something like secure boot to make sure you boot the right kernel and you're right in it, RAMFS, then you basically have an entire time approved boot signed by your buyers key, basically. But this talk a little about containers. And composifest has two major targets, as the OS tree use case and OCI images. And the actual work on podman and the back ends is done not by me, but by Giuseppe, who is one of the podman and C-RAM developers. It's based on his work on CS3 shunt, which I'm not going to go into too much details here, but it's basically a new compression format for OCI images that allows adding an index to the file. And the index has the check sum of the files. We can avoid downloading them if you already have them. But also the fact that we have these digest is the perfect way to introduce the the objects directory kind of thing that composifest uses. So if you look at container storage, which is the goal library for storing local images that podman uses, the latest version has basic composifest support. And we used have to wait until they vendor in the latest version into podman and then it should just work out of the box. And if you have this, we'll get some of the advantages of composifest, like higher density. If you have any images in your cluster happen to share files, those files will be stored only once on disk per whatever node you're using. And once in memory. And we can also use the validation to make sure we don't accidentally get modifications. But also in the future, this is not this is something that needs a bit more work. We could have a list of signatures or a list of keys and like limit the amount of limit the types of images run to only those that have a signed composifest digest by these public keys. So how do you use this? There are some options in the storage.conf. You have to enable all of these currently. Actually, the convert images is not strictly necessary. That's for converting images that are in this new CSDD shunt format. So it works with any. If you pull any image, it will convert it to the format and then use composifest for mounting it. So it will work for anything, basically. But if you want maximum performance and not have to do the conversion, it's good to use CSDD shunt in your image repositories. But that's something you want anyway in the future at least because you get to download less as well. If you ever looked inside the container storage, this is how it looks, a traditional one. Actually, I deleted some stuff. But the important thing is this per back end directory called overlay because we're using the overlay back end. And every directory in there is a layer. And every layer has a diff directory that has all the files that are introduced in that layer. And then you use overlay fs to combine all of these. Plus at the end, it adds your empty directory that is the writable directory for your container. But instead, if you look at a composed fs using back end storage, it looks different. It has the same overall layout, but the diff directory contains basically the baster and then there's this extra data, the blob. So what happens is that when you set up the container, we mount all these composed fs directors, each producing a layer, and then we merge them in overlay fs plus your writable storage. So it looks slightly different, but it's basically the same. And to demonstrate how this will affect resource usage, I created 20, like, so this is a synthetic example, but I think it proves the point. I created 20 copies of Fedora image, and I changed one file in each and squashed them. So there are like single layer images that basically have all the same files in them. And then I run sleep in each of them. And sleep is a very small thing, but it will map glibc and it will do like some basic stuff that any app would do. And then I run them all in parallel. And we could look at how this looks in the storage. Every file here is, or every directory, every layer is the Fedora base image. So it's 180 megs, sum up for 20 up to three and a half gig. But if you look at the composed fs version, it's just 200 megs because only the things that are different use more space. And this might look weird that just one of them is larger, but actually what happens is that each individual layer has all the files that they refer to, but they're hard link across the directory. So there's like a tracking of which files are locally available. And instead of you making a copy of them, they're all like hard link of each other. So that's pretty cool. I mean, I can imagine, it depends of course on your workload, but it's not implausible that different images share files because most images are based on rbms or devs, whatever. And if you have the same dev of a GeoOopZee package, then it will have the same file, even if you're using a completely different build of your image. So this is pretty cool. We're working on trying to finalize this, make it work by default in Podman. I think it's going to miss the 4.9 release, so it's probably going to be in the next release. Also, the Podman people themselves are trying to work on making CSV chunks be more of a default thing or at least more widely used because it has advantages not only for Composite Fest, but also for used faster pulling of images. And then we need to look into signing images. I mean, there are ways to sign manifests already. And you can imagine if you add Composite Fest digest into the manifest, and you sign the manifest, if you can validate that the manifest is signed, you can trust the digest, and you can mount the thing and know you have the right thing. But there are outstanding questions like, yeah, but what kind of keys are you using, where do you store them, how do you know which one you trust, you know, there has to be some kind of rule set to specify what you're allowed to, which keys are golden or whatever, or are we using the kernel hearing or the secure boot keys, like, throughout sending questions. But the technical side is not really that complex. If you can validate using some kind of mechanism that the digest or the manifest is okay, then we can just mount this thing and be able to trust it. And I guess... Hi, thank you for your talk. I am one of the... One of the things that the design of your choice in PuzzleFS is different than Composite Fest is that the files are split along file boundaries, instead of along sharing boundaries. It makes easier to get cool graphs with USS, but harder to then share if you change small important portions of a file. I'm curious if you have any... Like, is that part of your future work? I've thought about it. And I'm not... I don't know, do I work on that? Because it would be... It would require so much work on the kernel side that just, like, fundamentally change how page should track. That's not gonna happen. I mean, it's a choice you have to make if you want to focus on disk space use or in memory use. I think this is sort of like a Goldilocks zone, but yes, it depends on your use case. Yeah, also PuzzleFS was written in Russian, so it was his practical forever reason. I didn't get the question. Yeah, it doesn't really like... CSD Nishant is a way to extend the header of a layer torbol with information about what goes in it. And it's not strictly needed for Composite Fest. Like, you can untar the torbol and compute the checksum of every file if you want to. But if it's there, you can avoid that and it's just better performance. So it's not necessary. Another question about the compression... The compression of the CSD, I think, is the only choice. I think it belongs to the CSD compression of the torbol. But the G-Face is one issue more than the normal G-Face. So, and then, like, in next image format, like, the CSD Nishant. Yeah, well, G-Face is the standard, right? Yeah, there's this E-Star GC thing that you could use for that. Like, I haven't spent any time trying to make that work. I mean, it will work. It just introduces a conversion layer, a computation of these things. But I mean, I think it could be done. I think it could be done. And if you have that kind of format, but you still, you can't... you would still have to create an index for your GSD files as well. And, like, I don't know if E-Star GC is better supported than CSD Nishant. It's unfortunate, like, the apartment people did the CSD D thing a long time ago, but it took a long time until Docker merged it. But then the latest version, like, a year or so, they merged the pull request to add it. So, it is actually supported by Docker now. So, the hope is that eventually we will get wider acceptance. But it isn't perfect right now. Hello. So, why would you run 20 sleep commands? So, why would you run 20 sleep commands? Was that a question? Or... Yes. Yeah. I mean, obviously, you wouldn't. You would run, on your node, a hundred different containers. But any containers that happens to ship the Fedora 232-2 glibc-rpm would have the same binary for glibc. And then that would be passed to the same file. So, everything that maps glibc would use the same file. I mean, this demo is just a demo to expose the sharing. It would be less extreme in the real case, but it would still happen. Thank you. So, I think it might be a long use case, but this object storage, there is a way to continue after leaving the closest client? It's not currently, but we have some issues around about adding tooling around that, like an FSEKA for it and a garbage collect for it and things like that. That's things. All the code is there to read back from an image file and extract a list of objects in it. So, it should be... Currently, it's possible by doing some scripting to do it, but yeah, we would like to add some tooling that automates that. We have five more minutes, so there's a question. Yeah, I mean, it works fine. It's into the word now and you can install it. Whoa. No, I mean, it requires, for all of the features, it requires the latest kernel, but if you have six, five or later, then it just works on any system. It's just a user space tool. It doesn't need any special anything. Okay. It just works. Thank you. Thank you. All right.
libamicontained: a low-level library for reasoning about resource restriction
Hi, so I'm Tycho Anderson. This is my colleague Sebastian Dabdupe and we work. I don't know, is this, is that, I don't know, we'll see. Oh, did I bump it now? Oh, that off. Okay. All right, so we work at Netflix on the multi-tenant container platform there. And I want to point out that the name for this is not in any way related to Jesse Frazell's library we break out. In fact, I didn't realize that when I cooked up this name that there was that name space collision. And I think there's some thought that we will probably merge this into a Linux containers project soonish. So that disclaimer, this won't be this particular name may not be that long lived. So with that, I just kind of want to go into kind of a basic question. How many CPUs do you have? And this is a question that it seems very simple. You know, you could call one function that would return an integer and tell you the answer. But as we'll see in a little bit, people have screwed it up in a large variety of ways. So the way that we do this today is typically with C groups, I'll go over some of the other interfaces in a bit. But if I just did like a, so I'm in, I'm in this C group here, I've just created it. There's no limit right now. So this anything tasks in this C group can run on any CPU. So there's no tasks in the C group right now. So what I'm going to do is put this shell in that C group. So now the things you'll see the first pit there is the task, the shell and the second one is cat. So now those tasks are in this C group, but there's still no restrictions. So I'm going to, so now I've restricted this particular shell can only run tasks on CPU zero and one. So this is like every container engine that you talk to has a way to do this. And it's, this is how they do it. The problem then is if that container itself, for example, suppose that you were running system D in that container, that can just container may create a sub C group. So now I'm in some sub C group. And in fact, the container engine will then grant the container the ability to manage C groups. So, so now I'm, now I'm here and I can see that I'm in a sub C group. And I have CPU set that CPU is that effective that tells me what things in this sub C group can what processes, I'm sorry, what processors they could run on. But if you look, if I look at the CPU set that CPU's file, which is the one that you use to control stuff with that files empty. So in particular, if you are a naive run time, you might look at this file and say like, Oh, hey, I can run on lots of CPU's. And the reality is you can't because there's this other file that ends in dot effective that tells you here's what the some total of all of the restrictions up the tree are. But you have to know to look at this right file. And this is only one of the, the whatever four different interfaces we have. So there's the kernel has a command line you can use to do this. This is what I still see pieces what people who really care about this stuff like HPC people use. The second one is, honestly, it's what libc's used to use. Then there's prox that prox CPU info all of the system prox of s's are emulated by Alex CFS. So in some container environments, you will get a view in prox that of the right thing, but in other container environments, you won't. And then on top of all of that, we also have this is called sked get affinity, which gives you some combination of the above results, but does not, for example, containing I sold CPUs. So if you know, based on all of these interfaces, how to answer this simple question of how many CPUs do I have, you can leave. And if you don't know that's okay, because the whole point of this talk is that we're going to try and fix that for you. So, so what's missing from this, in particular, you can do stuff like CFS quotas. So you're not even really worried about the number of CPUs you have, you may be able to run on any CPU, but you may have like a very small shared quota. And so in large multi tenant systems like we have at Netflix, we're trying to move away from assigning specific people to specific CPU sets with the goal being that if there is a CPU that somebody can run on somebody should be running on that CPU. So even even the original question is not necessarily correct. So it's hard to answer. I'm just going to kind of give you an overview of all the funny stuff that we found. So TC Malik, if you use non sequential CPU assignments, it will seg fault. That's bad. The JVM's implementation. This is a bug that we filed. So it queries CPU set that CPUs not effective, which was the demo I showed you. If you just look at the wrong file, you'll get the wrong answer. The other thing is that we care about other resources besides CPU. So we also care about memory and other network things. And the problem is that this effective file is really nice for CPUs, but it doesn't exist for memory. So if you want to do it for memory, you have to walk the whole C group through yourself as opposed to doing what the kernel already knows. So answering this question is also kind of annoying for other resources that aren't the, you know, the most obvious one, which is CPUs. So what we would end up with in production is a two CPU job would allocate 384 gigs of heap, which is half of an R5. And then, you know, that's not very good. So, so this, this kind of annoying. There's a longer explanation that one of my colleagues wrote about, you know, how to compute this correctly in the face of C groups, but again, doesn't take into account I sold CPUs or other things. So there's more bugs. So G libc used to use. CIS devices system node, which is a CISFS file. And so we could with LXCFS mask that value, but then they switched to sked get affinity. So now we can no longer do that. So the reason that it's important to think about what G libc does is lots of people use this CISCFS. Or if you use the nprox command line tool to do make dash j nprox or whatever that uses G libc, which uses sked get affinity. So if you are restricting resources in a, in a strange way, you may get the wrong answer and you'll spawn the wrong number of worker threads and you'll context switch into oblivion and get less work done. So one, one of the G libc maintainers pointed out that this particular problem should be solved by the kernel. That's a bug that he filed that I think nobody from the kernel side at redhead has ever looked at. Musil for just for completeness does the same thing. Sked get affinity. So we also saw crashes in lib uv reasoning about this incorrectly, which is no jazz, which is important because that's what serves on Netflix.com. Even Alex CFS, which was written by some of the people in this room, myself included, I guess maybe nobody else in this room now, but some of the people on the program committee for this containers thing wrote this code and it was wrong. We have, we found a couple of bugs there. So this caused crashes in lots of places. That was also bad. So even the people who are supposed to know how to do this don't know how to do it. So, yeah, they also though, I mentioned this thing earlier about if you use shares and quota, you know, then you really get the wrong answer. Alex CFS has a solution for this, which is kind of cool. It's called the CPU future. So there's a question about where should this computation live? One of the G-lib C developers said that it should live in the kernel, but the kernel people, you know, haven't really worried about this. They continue to add interfaces for figuring this out. So, you know, one answer is it should live nowhere. We should just keep allocating large heaps and crashing stuff and whatever. Unfortunately, my boss doesn't like that answer. So the next thing is we should fix that in the container runtime. So this is the traditional way that we did with Alex CFS. So you bind mounts and file. It's a fuse file system. It looks at when you make a fuse file request, it looks at that process ID. It goes and looks in the C group tree for that process ID and then tells you it lies to you and it basically says, hey, there's this many CPUs online. So this is the traditional way it worked for a very long time until libc started switching away from looking at proc and cysfs files and started using cyscalls, which makes sense because parsing proc and cysfs is kind of annoying. So, you know, if there's a cyscall that can do the thing, they want to use it. So that makes sense. The kernel people often will say this thing that's mechanism not policy. So they will give you the mechanism to reason about the thing and it's up to you to make a policy about how many threads or whatever you want to spawn. And so in principle, they have given us the mechanism because the mechanism is these 40 interfaces that are all can do very different things. And so if you look at all the right places, you can do it correctly. So in some sense, the kernel already did it. It's just sort of complicated. So yeah, there's lots of them. And the other thing is that there's a new patch series allowing ebpf to do scheduling. So right now, if you think about cfs and cfs quotas, cfs is hard coded in the kernel, very well understood by lots of people. If you, the user can load an ebpf program that will now decide which tasks to schedule, the algorithm for determining how many CPU equivalents this thing has is now dynamic. It depends on the results of this ebpf program. So the kernel also can't tell you necessarily anymore. So that's, I guess, the goal of our presentation here today is to have this library exist in user space. So it's one place so that everybody can go and ask this question in one location. So it should have no dependencies because Golang doesn't want to link against libc, but also, you know, run times generally don't want to include a lot of other stuff because they're supposed to be small. If you think about JVM or whatever. So it should also be correct. So correct in two senses. One is, you know, we should give you the right number of CPUs. So we should do the math correctly. And the other senses, we shouldn't do like terrible memory corruption or other things like that. So it should be safe in the programming languages execution model sense. And with that. So what we're proposing is this library called live mi contained. And the idea is that, well, it'll be a container aware API that like calculates like with the resources you have. In this talk, we're mostly focusing on CPU count. And it'll be statically linked with the CBI and we're writing it in Rust for, you know, safety guarantees. And the idea is that it's meant to be used by like language run times and applications instead of like trying to figure out which is the right interface for resources. And here's a link to our repo. So. So why do run times ask? And like, and we hit all those bugs before. The idea is that they mostly do this to size their thread pools and specifically like their GC threads. And they want to size their arenas and allocator threads as well. And how can this go wrong to give kind of a maybe so much of a sense of what's going on in the future. So how can this go wrong to give kind of a maybe simplified example. Let's say you have 10 containers on a host and just to make the math easy that host has 100 CPUs and we assign each container 10% CPU quota. Okay. All right back. So again, 10 containers, 100 CPUs, 10% quota each. Now what happens is those run times in each container, they're all going to see 100 CPUs and they're going to start 100 threads that they expect to do like a CPU's worth of work. So what happens they eat through their quota right away. And you get a ton of starvation right GC pauses. Everything starts spiking. So what should they do? Well, call our API. So what we have first is kind of your classic. This is a classic num CPU. So we're going to call a skedget affinity and this takes into account CPU sets, affinity masks, online CPU sets, affinity masks, online CPU sets. So we're going to take num CPUs like all this stuff. But our real value add is this recommended threads calculation. So this will take num CPUs and constrain it with like quota, for example, and compute. It's not a rocket science algorithm. It's basically just quota over a period gives you like a number of like how many threads you should be running. This is what like system D and Alex CFS do. So it's like a well known calculation. And so let's do that example again, but with recommended threads, right? We've got 10 containers. We've got our 100 CPU hosts with quota. Now each one is going to get when they call recommended threads, they're going to get 10 CPUs and problem solved forever. So what have people done in the past, including ourselves? So every language runtime implemented themselves, like they call it, you know, pass a container aware flag to JVM. But usually as we've seen, they get it wrong or sometimes Alex CFS does it does the right calculation, but it can only do the the proc file system mounting. It doesn't do like doesn't take care of like skedget affinity. So what we did at Netflix was we use Alex CFS and then using set comp would intercept this skedget affinity to like do this calculation. So this is kind of our follow up to say we're like, there must be a better way to do this. And there's lib which is in the Alex project today, but it's not container aware. So that's actually the we go for and then the next thing we do is we do a set comp. And the next thing is I wanted to throw out some like some additional like issues we ran into, which is a lot of things assume like static like a static resource size, right? But in kind of a containers world and you can when you can do you can like edit C groups, like for a running process, we don't think that that's no longer a correct assumption. Right. So like you're allowed to change C groups on live process and nothing seems to take that into account. So other things we're thinking about are like, what if thread pools were dynamically sized? It would periodically check like, Hey, do I still have 10 CPUs or now do I have 20? You can resize your thread pool that way. But that's not the way it works today. So we're that's future work, but more like food for thought. And that's it. Thank you. For sure there's overhead. But the reality is like if you want to lie to people about the right answer to this question, like you have to do something. And so that's like the best stop gap. I think like, is there a less over heady way like the real solution writers for just applications to get it right. So they don't have to worry about it. And so that's sort of how we arrived at this particular solution is what if people just actually knew how to do this correctly. The overhead of getting it wrong is probably worse than this. Right. Yeah. I'm not sure. So checkpoint restore will do C groups. So like the majority of the way people do this is with C groups. So I think one problem could be is if you restored an application to a larger or a smaller number of CPUs than originally had today, like mayhem. So if you if you have 100 CPU application, then you restore it onto a two CPU box. It's still going to think it has 100 CPUs. It's going to allocate 100 memory arenas free like one for each physical thread so that it can do, you know, threadless locking to do memory allocations quickly and stuff. And so 98 of those 100 arenas will be wasted. So for you guys, if you want to dynamically change CPU size, you have the same problem, but actually on steroids because you really want application run times to resize their thread pools, which is Sebastian's last point. For us, it's mostly like just get it right the first time, please. Like let's figure it out later. Other questions. It's a good question. We should right now. Our set comp stuff is not open source, but there's no reason it could be. We just has to like turn the crank to do the work. Just just for the offline folks. Yeah. So it's a good question. So your question is sort of what is the roadmap for this workload like this was basically stop zero of like, do people think this is a good idea? So it sounds like you think is a good idea. It would be nice to have your collaboration. One of the questions I have for the LXC project leaders, are you interested in having this code in LXC? Because I'm thinking step two would be, you know, we can centralize the code in a place where the container people like it. And then step three would be, as you say, go hat and hand to each front time. And it probably involves submitting a pull request and then flying to the Golang conference and the JVM conference and the Node.js conference. And you know, like it's going to be a long road, but what we have now is bad. So anyway, to point question to. Yeah, we talked about it. So the reason why the thing that exists right now has been developed by all of us is because it could probably make sense to have something that's more generic, that actually covers what needs and that hopefully is bad. So I think it's a good place to have the community that's working as a project because we're really not famous enough for that. We've got like CFS with a lot of those things there that are already used by others. It's a good place for it. And as I was talking with Tiger in the past about this, this is a long game. It's something we've been wanting for a lot of decades. Nobody really did anything about it that we wanted to finally do something about. So in a lot of decades we can have sites on our great every means instead of having to fill it out. And in the end, there's going to be the end point in the budget of different things. So I mean, the obvious one is that for things like the good things, especially the good right time, those kind of things that's almost right, I go to the back of the crowd. Those definitely need this kind of information because right now they're trying to do it differently, they wouldn't be wrong for different reasons. So that's what makes sense. But we can also go to... ... like you know, whatever it's using, it's not taking a lot of resources to use that kind of system. We've got that very common misconception in an activity that starts, you know, load up, rate, and then something. Which is really important because once you've got things that are mixed, you can easily create a lot of like 5,000 on the machine. That's how we did it all, by just creating a secret room that's got almost no resources on the media and then link that to the media. It's a pretty effective whole in right now, in order to do it, it's a great way to figure out what's actually going on. You can get some stuff, you can get some stuff on other bases, but if you want the full picture, you can have many different stuff. I'm getting really tired of having project passing next slide. It's just extremely hard to approach the material that they want. Sounds like the answer is yes. You're interested. Yes. I was just like, I'm not a project manager. I'm a group that people call, I don't know what it is. Because I'm a group that's a student. Yeah, yeah, memory is actually worse than CPU. I don't know what it is. I'm just trying to make it very easy to do in place. Yeah, so I would say I am certainly not opposed to this. I think this basically shouldn't live under some Netflix repo, Netflix repo, because nobody will take us seriously. It's much better to live upstream, whether it's Linux containers, U2 Linux or whatever. The biggest thing is I'm guessing that U2 Linux people aren't that into Rust. Rust is like a little bit of a weird thing, because you want to convince people that stuff is safe, but also adding a Rust tool chain to your build time is kind of painful. I'd be happy to talk if... Yeah, I have a question. He's a computer engineer, he's a employee of my company, so... Yes, yeah, we should talk after this, for sure. Yeah, yeah, where it lives is like, I don't want to make another talk in ten years with this same number of bug reports on the top. Other questions? Cool, thank you. Thanks for watching.
Using chroots in a single Linux Container as an alternative to docker-compose
All right. So next up we're going to have Aiden who is going to be talking to us about multi-image and container. All right. Ready? Okay. All right. Hi, everyone. I'm Aiden McClelland. I work for a company called Start 9. So this project here is a little bit of a work in progress, but it is something we are trying out because we have a little bit of a less common use case for our containers, and we decided to try something a little different. So first some background. We develop an operating system called Start OS. The purpose of this operating system is to allow end users without technical expertise to run their own home servers. So the idea being like trying to bring the desktop experience to home server administration, and that way we can bring a lot of these self-hosted applications to a wider variety of people on their own hardware without them having to learn everything you need to learn about Docker and the hosting tools that we're all familiar with. So as part of this, we do have a little bit of a different use case than is generally intended for things like Kubernetes or Ansible or a lot of these tools that are designed for deploying corporate infrastructure at scale. We're really looking at like a single host machine that the user wants very low touch with. They don't want to spend a lot of time configuring their applications at a granular level. So we decided, you know, like a lot of these applications, they come with these Docker-composed setups, right? You have a main image that has your application code and then you have things like databases and reverse proxies, etc. And commonly we deploy this as a Docker-compose file, and what this does is it creates a bunch of containers that now have to be managed by the OS and by proxy by the user, right? So what we've always tried to do with Start OS is we've maintained this idea of one container, one service. And what this allows us to do is it reduces a lot of the complexity of the management of a bunch of different containers and also provides a single IP address and virtual interface on which the application is running. So when you're doing all of your network mapping, all of that can be mapped to a single virtual IP address that can then be viewed either from within the subnet within the device or is then exported through the host. This also means that you can define resource limits on a single container basis as opposed to having to do a group of containers and managing that as a group, a C group with subgroups, right? Another final reason that we did this is that our package maintainer scripts, we prefer to run inside the contained environment and these package maintainer scripts are run in JavaScript. So we run a service manager in the container that reads the package maintainer scripts and then is able to set up all of our subcontainers, our sub file systems from there, and execute our actual binaries. Okay, so the question is why do people want multiple containers at all, right? Like oftentimes you can take a single Docker image, a single application image and install all of the software you might need, but in practice this is not as easy for the service developer, right? A lot of times we have people coming to us asking for, hey, I want to be able to use an off-the-shelf Postgres image, I want to use an off-the-shelf Nginx image, I don't want to have to use like the package manager for the distribution of my container, to install that and manage it. So that's like the number one use case that we have for that. It also allows you to run applications, like say you have one in Debian, one in Alpine, run all of them together. Then, you know, the other reason that you might want multiple containers is you can isolate the subcomponents of an application away from each other and also do resource limits on individual application subcomponents. If anybody has additional reasons why you might want to do separate containers as opposed to a single container for an application, I would love to hear them, but these are the reasons we came up with. So our solution, we cover this first use case using trutes. Number two, as far as we can tell, works for the most part, but that is remaining to be teased out. This does not allow us to isolate the subcomponents of our application from each other or create resource limits on individual applications. Subcomponents as easily, those will have to be managed by manual tuning of resource limits within the prokates of the container. So, yeah, we've ultimately decided that those last two components aren't really necessary for our use case. Ultimately, a single application is where we define our sandbox. So sandboxing separate parts of an application from each other, like has some security benefit, we've decided isn't worth the complexity. So we decided to do this with LXC. Why do we do LXC as opposed to something like Docker or Podman? LXC is a lot more composable. It allows us to pop the hood on a lot of the very subcomponents of container technology and manage it more manually. So we can, for example, easily manipulate the container root FS at runtime. So even with an unprivileged container, that unprivileged container can communicate with the host and modify its root file system very easily. We use our shared mount propagation for our root FS, which allows the host operating system to easily manipulate that file system. And then it's also unlike some other container tools, you can perform commands like shrewt and mount from inside an unprivileged container, which is not allowed on a lot of other technologies. So to put together a service, an application, we have effectively a single root FS image that all of our applications share. This root FS image is just a base image that we use for all of our containers that has a, like, we use Alpine right now, but it loads a Node.js application that runs the package maintainer scripts and then launches the various actual demons inside their trues. It communicates with the host using a JSON RPC API over a Unix domain socket. So there's bi-directional communication between the host and the service manager in the container, and then, yeah, it can perform the actual application code inside the shrewts. So the host API, what it does for the container is it can perform some manipulation of the root file system of the container, and this allows creating overlaid images in the same way you might be creating a container. All we do is we create a root FS image with an overlay file system and attach it to the container in a way that they can trude into it. And then we also have a bunch of other APIs that these packages can interact with, mostly for integration with the end user experience, and integration with other services and applications on the host in a way that the user might have to intermediate. And then we also have a set of APIs designed for hassle-free networking. If you have, you know, some application bound to a port, you can now attach that port to a Tor address, to a clearnet address, or to just a LAN address so that you can be accessed by your local area network. And the host OS manages all of the certificate management, either through Let's Encrypt, or through a host root CA for the LAN communication, because obviously you can't get a Let's Encrypt certificate for a .local. Okay, so then the service itself, it runs a very basic API that receives commands from the hosts. So when the application is running, it can receive like an initialization command, it can start or stop the service, and then shut down the service entirely in order to kill the container. And then it also invokes all of the various package maintainer scripts, such as editing user configuration, installing the service, or updating the service. All of those perform various package maintainer scripts that get called from the host. Okay, so when we actually launch a binary, the package developer defines in some JavaScript, we have some well-typed TypeScript APIs for this to describe this structure, but it defines what binaries to launch, what images to launch each binary in, where to mount its persistence volume. So we have a series of persistence volumes that are mounted to the container, and can be attached to any path within these sub-file systems, and then it defines any environment variables or arguments in any standard way that you would launch a program. And then for each command that you have, when you just similar to how you would define a system deservice file, you can define all of these arguments and then any dependencies or health checks associated with your service. And then for each of these commands, the in-container service manager will mount an overlaid image for the requested image ID to the container. It will then take our special directories, proxys, dev, and run, and bind them inside the container. So all of the containers share the same proxys, dev, and run. And then it will run the command in the true. Okay, so here is an example I have of a package maintainer script. I don't know if that's actually visible to everyone. Is that, are you guys able to see that? Okay. Well, I suppose I can just talk about it. But effectively, you have a fairly simple JSON configuration where you define your image ID, your command, your arguments, and then some health checks defining when is this thing ready, as well as some dependencies. So like if you don't want to launch a various demon until another service is ready, you can just specify that and then it won't launch until its health check passes. So all of this is available on the GitHub if you want to check it out. This particular example is in GitHub's start9labs slash hello world startOS. There should be a link on the talk. So time to do a little demo of what I have working so far. Let's see if I can get my shells over here. All right. So here I have an instance running, hold on. There we go. Here I have an instance running startOS. I've already installed a package. This package in this case is NextCloud. This NextCloud package contains two images. It's got the NextCloud base image, which also contains the Nginx server because it's running the PHP for NextCloud. And then we have Postgres, which is our database persistence layer for NextCloud. So what we're going to do, so we've attached into this container, and then I'm going to go ahead and just inject, basically run a REPL inside the JavaScript engine here. And I'm going to go ahead and do my imports here as well. And what this has done is it has connected us to our JSON RPC APIs, both the hosting of the container and the container into the host. And then we're going to create a couple of overlay images. So first we're going to do our Postgres image. And so what this is going to do is it's going to tell the host, hey, I want to mount this Postgres image to the container. It says, okay, here you go. Here's the path at which I have attached it. I'm going to do the same thing for the main image. And there we are. I'm going to go ahead and define a couple environment variables. Okay. So I have a set of temporary hacks that I've put in that will later be managed by the actual container service manager. But it's mainly around permissions of the container. I still need to get Shift FS working properly. Because LXC, what it does is it maps the UIDs within the unprivileged container to UIDs on the host. And so when we mount stuff to the container, we also need to perform that same mapping. So we're not doing that yet, but I have a set of ownership changes that will manage that. And then all we have to do is go ahead and launch our application. So I'll go ahead and launch Postgres first. And here we go. We have Postgres running inside a tru, inside the container. And it looks like it's ready. And then now I can also launch. Next slide. So here we have, now both of these applications are running within the same process namespace, the same C group, the same container. But they're running from completely separate images. And that's all I have to show you guys. I think we can open up for Q&A. Thank you. So we have considered the idea. Right now we actually haven't found it necessary yet. Like the tru seems to be sufficient for the sandboxing we need to do. As far as we can tell, the technology is at a point where it wouldn't be too difficult to do containers and containers, but realistically we haven't found it necessary. That's all. So I think you're asking as a package developer how we distribute your application. So if you have a service that you want to distribute to our users, to people who are running on StartOS, we have our own, like the company Start9 runs a marketplace. But we just have a very standardized package format. In this package format, you could host on any website. If you want to charge for it, you can charge for it. But ultimately the APIs are generic enough that you can run your own marketplace to offer whatever services you want using whatever protocols you'd like to to gate access to those S9PKs. So as a service developer, in general, if you're publishing to our official registry, that means that you have a free and open source project that you're looking to distribute for free. But that does not stop you from running your own paid marketplace. One more question. I'm sorry, I couldn't hear that. Other resources for our application? Yeah, so the resources are managed on the scale of the entire application using the configuration of the outer LXC container that everything runs inside of. So you can just modify that LXC config. Well, we modify that LXC config automatically based off of the host APIs. Thank you.
Soft Reboot: keep your containers running while your image-based Linux host gets updated
Welcome everyone to our next session. Thank you very much. Hello. Good afternoon. My name is Luca. By day I work as a software engineer in the Linux systems group on Microsoft where I am responsible for the operating system that runs on the Azure infrastructure. By night I am involved in various open source projects. I'm a maintainer in system D, a Debian developer, DPDK, yes maintainer, a bunch of other stuff that I consistently forget about. So I'm going to talk to you about this new feature we had in system D middle of last year called software boot. And yes, it's a new type of reboot and we're going to look at how it's implemented first and in the second part of the talk we're going to look at two demos showing that running and how it can work with containers. So if you were at all systems go, you probably saw the first half of the talk while the second half is new. So first of all, why? Why do we want a new type of reboot? Don't we have enough already? And the answer is of course is performance. So rebooting means if you have some services that are running on your system and they're providing some functionality during that window of time they are interrupted and people don't like interruptions. So that is the main motivation for this. I also know that there are some updates system that require double reboots. I've been told for example that DNF line upgrades require double reboots. So by shorting the time it takes to do this we can save something there as well. But the main use case is the first one for avoiding interruptions. So when you go from a reboot to a KX, you're saving time because you're cutting away the time it takes to reset the firmware and the hardware. So the next obvious step was to cut away the kernels at time. If the kernel is not being updated you don't need to reboot it and do all the device initialization and everything else. So we came up with the idea of soft reboot and this is what it does. It just reboots the user space portion of your Linux system. Again the goal is to minimize disruption as much as possible. So this pairs very well with image based Linux. We've been talking about image based Linux systems for a couple of years now. This works very well with it because in the system you have a single root FS which is usually read only. And then you have a UKI where your kernel is in VR and these are distinct components. They are usually updated independently. And so with a soft reboot when you don't update your kernel you can update just your root FS. Now this also pairs very nicely with kernel live patching. So on production system you can fix bugs in your kernel without rebooting by using kernel live patching. And this pairs nicely with that because you can use the system to update the user space portion of your image when you have bugs or security problems or whatever. Again we are replacing the entire user space atomically and moving into a new root file system. Now it's not only for image based systems though. This can be used for package based OSs because for example you cannot restart the D-Bus demon or broker on a Linux system. Your system will explode if you do that. So by doing a soft reboot you can save some time when your D-Bus has some security problems that needs to be fixed or what not. So let's look at how it is implemented. So as far as the kernel is concerned nothing is happening. Everything is business as usual. It doesn't see anything. It's all the same session or the same boot. So for example we have still some problems to solve, some papercasts. For example if you do journal CTL minus boot minus one you will not see the previous software boot. You see the previous full reboot. We have ideas to fix this only to do list but it's one of the fewer papercasts left to solve. Now as far as user space is concerned everything goes away. It's a normal shutdown. So system D goes through the usual phases. It starts a shutdown target, a software boot target that conflicts with everything else so all the services get stopped. And then instead of giving control back to the kernel with a Cisco to reboot it just reexact itself into the new root file system by passing the full reboot. So you can do this in place. So your software boot is in the same root file system or you prepare ahead of time the new file system. And the run next route. And we allow this because usually prepare the new root file system and position all the mounts across and whatnot take some time. So you can do this ahead of time without having to interrupt all the services by doing it in line. So you can prepare your next root of s in run next route and then code the software boot so that you transition very quickly to the next root of s. And again you can prepare your any additional storage you have if you have any encrypted partition for var for example. You can prepare it ahead of time so you don't need to redo the decryption steps which again takes some time require maybe use an interruption maybe accident tpm or whatnot. And again the kernel stays the same so no configuration changes. So in system D 254 we added a new verb system system CTL software boot to do this equivalent the bus API and the next version. We also had some new signal that tell you yet this is shut down happening and it's off type software boot. So we are cutting time away from their boot is that all we can do with this. Not quite we can go further. So given system D set doesn't exit it's reexec itself. You can carry over any state we want to the software boot. So for example the file the script of store is not aware what it is a way to store for the script or inside PID one and then it gives them back to you to your service when it starts. And by the way all these links are on the slides are used to documentation I will put the slides online. But basically your service can say hey I have an active TCP connection take the sd for me and keep it there. And then your service goes down the software would happens you come back and you get back the TCP connection you can pick up from where you left. Because the kernel just stays running the connection is not interrupted it just buffered and there's some delay of course but it doesn't have to be established for example. It's not just sockets you can use this for MFD for example for any buffer any state that is expensive to calculate you can store it in a MFD and get it back immediately. And you can do this for the network stock for example in network D we have these options so that when it goes down it leaves interfaces configured. And when you go back in the software boot in a new file system you don't have to reconfigure your network interfaces which again can be a bit slow. And then finally we transition across Zashran as a state pseudophile system or tempfs so that if services have state in Zashran they find it again when they come back. This is not recursive but and also Zashtemp is reset completely because that's a scratch area. So by doing this we can accelerate the time that the services need to go back to fully functional after a software boot. But is that all we can do again and what the hell does any of these have to do with containers is it a container dev room. So here's an idea now some payloads are completely independent of your router fest for example containers but also portable services. Now if you don't know what a portable service is I suggest to check it out they're awesome they're a way to attach a system service to your OS that runs from a different root file system. But it comes with its own image but it's fully integrated with your system services it's quite cool. But it applies to these but not only this so these these are these services these containers these payloads are independent of the root file system. So can we let them run during this software boot process the answer is well yes why not. And the configuration to that is a bit complex it's linked there I want to show it here we show it in a demo later. But basically you can configure a system service so that system you will not kill it or stop it when the software boot happens. So is the service keeps running while the router fest is updated under it. Net or is it accessible we keep it up the current doesn't go away doesn't the conflict devices same thing for the disks. So for this kind of payloads we go from some interruption to zero interruptions we quite nice. Of course there's a catch there's always a catch these payloads they really need to have nothing to do with the root file system because for example if you keep anything. And if I open for the old root file system and you will keep the resource pin and they will be free that you use more memory or whatever else. So you need to make sure they are disconnected and also other parts of the US are going away for example the bus. So in the documentation there it shows up but you need to change the way you use the bus via the SD bus library for example to automatically reconnect when it comes up. It's usually not done because the bus never goes away normally but if you have one of these payloads so virus of the boot you need to change our use the but it's very simple and it's a. Describing the documentation there. Now one thing I will look at in the near future is also if we can if we have actual bind parts from the host. The first into the services if you can automatically refresh them after software boot I'm halfway through that is all done yet. So let's see this happening with Podman now because I am a coward I did not I'm not doing a live demo I'm showing a recording. Now this is a dead end image dead end testing and it's running podman some version and so podman has this thing called a quadlet where it generates. Some system services for your container and now this is not exactly what podman generates though it's a bit different as most stuff here and we see what that is in a moment. Or you can see down here it runs a very important production use case of sleep infinity that's typical production use case everybody uses. But to show what the actual difference is because this is a demo to put it together I am not a podman developer or user. I thought it was cool to make it work and I have it a bit together so podman gives you some some systems service I change it and show you the deep here so. These settings up here are necessary to make the containers service survive the software boot. This is a bit of a hack and if this is supported by podman natively it would have to be solved in a better way but basically this ties the container to the root file system to the var directory. So I have to comment that out so that they are not tied and it doesn't get shut down and then there's four more things down here that are suspicious and we'll see what they are in a moment. Now which is simple to explain if I start this container this. Sleep service and it takes a second because it downloads the image in the background and I resolve the complaints that we don't care. Now. The way podman works when you run it as part of a system service is works correctly creates some subc groups so there is the payload. C group node and then there is an additional sidecar control service that runs as part of the same C group and is also a group is dedicated to podman. Now the reason for this for settings here is because this common binary comes from the root file system. So we need to make sure if we just do this it will keep the root file system pin that we don't want that. So my my hack to make the demo work is actually we're running on a different the service runs on a different route image. So it's another big image with podman inside. So this binary and the podman binary that runs they come from this image not from the system that way they are independent and they are not tied together. And then we disconnect a couple of things. So. So now we we have that prepared and there's other things so you saw the two C groups there. Now the way system makes marks a C group for survival of software boot is by setting these extended attribute here. Now because podman gets a delegation from this C group which is the right thing to do but we don't touch the children. We do not set these extended attribute automatically for these two payoffs and if podman wanted to support this natively it would have to do that when he sets up the C groups. Now of course again this is how to gather so I'm doing that by hand just setting the that's a group there. The extended attribute so that system we won't kill all the these processes when they are running and now we can finally type software boot and we see all the US space going away. And then shortly thereafter we come back and we get a shell and then we check with us some errors in the C H and we don't care about so just ignore them. I was too lazy to hide them and then we can see that the sleep is still running and the control monitor as well and it's the same PID is the same processes. The containers kept running while we shut down all this stuff. All the system services have been shut down and restarted but the container is just going without interruption. So yeah again this is very quickly out together. I am not a podman developer is pondering there interested into supporting this or maybe LXD developers. I'm happy to help them but this is a have to get a demo I have another one which I think is a bit more interesting. So as your boost if you're not familiar is the an offload card that is installed in every Azure node so your Azure nodes that run your virtual machines have these arms 64 offloading card that runs the operating system that I work on. It's called Azure boost and I'm showing here a demo of this recorded in production on an Azure boost that he pulls for a second now we recorded this my colleague Maya to my oh my thanks go for recording this go record this amount ago so far executives and then I asked hey can I show this in public at a conference. This is never shown before I didn't only in turn on Microsoft stuff super secret and surprisingly they went yes you're going to like what now I have to do it so I had to unfortunately blank out the host names because this is a real node somewhere in the fleet in that it's entering the US and I couldn't show the host name which identifies the node so you will see this blank things I apologize for that but I had to hide those but we are showing here let's start going again. So in Azure we are running this Microsoft operating system it is just a machine it's arm some version of the kernel 5.10 we have what we call agents these are containers running as portable services. Some of these are critical for the actual customer VMs if they go away the network is interrupted you cannot make new connections. The agent is the critical one that goes away network goes away up local is a local one that does some local service so it doesn't matter so we configure the first one that portable service to survive the software boot. And the second one will will we just go away and disappear now we attach the update agent that does the software boot you can see the portable service is just a touch as a new image so we are moving to a new image here in the background there you can see Sierra console going away. Now we switch to a new SSH because of course SSH is not a critical service like it went away issue come up in a second. And we reconnect and we will check and compare the US versions before and after kind of version before and after and check on the status of these containers and see that actually run again so yes the version and the zero three so it was zero one before so we did update the root of S. It always read only the and very to the fast so we updated as a one block the corner is the same we didn't I didn't cheat and do not show a boot there the current is exactly the same same big than everything so let's check on how these containers are doing. And we can see this is the critical one the the net agent and we compare the P. I. D. is before and after they are the same so the same process is one nine seven and two zero nine nine they're the same the same process is the same pale. It keeps running to the software with while we change the the and very to the fast image behind the other one as we started because it's it's just a non critical service so we let that that be a starter so yes this is it for the demo and hope this was interesting this Nick pick at the Azure production machines and running in. Down in the fleet and we have five meals for questions questions. Any questions. I cannot. So checkpoint restore we don't and that's a very different thing right so checkpoint restore gives you an interruption service. This doesn't so you check point and then you come back to the same state of the process but you still have an interruption while you do your update this is different this is. Aim to let us update the root file system with zero interruption for this payloads so it's a bit different and we don't have plans for that at the moment now these are a bit complex payloads so we have a look into CRU at all. Think there was. Any other questions. So I end. No questions everything clear. I don't believe that there you go there we go. I know that guy I'm gonna second. I'm gonna second. So. Excellent question now the demo was recorded in production with a custom image loaded. Thank you. The demo we show was on a production node with a custom image with a new feature we are deploying this sometimes this year so it's not yet deployed a scale we will see I'm sure it will explode in horrible ways. But for now the main thing we found was debas reconnecting to debas was the main thing that broke the services but it's easy to fix that was the main thing for now. Other questions. Going. I can't hear shout. Shout with the microphone shout. Yes so the pen is on the local system I showed it before. You need to prepare them ahead of time. From here. It can work it can work. Thank you.
Juggling with UIDs and GIDs: rootless container deployment with Ansible
This will be a really short talk, just in minutes to take your attention, not much. And it's going to be an easy thing that probably some people already do, but I'm not like a container, I'm not the man when talking about containers. So what I think that this is kind of an interesting thing for all the people that like me, try to play with containers with home setup mostly. So the motivation for this talk is basically that I have like a server and I am experimenting with rootless containers at home. So I am trying with automating with Ansible, which is not an overkill, I even copy my .files with Ansible just because, you know. And then I like to learn by breaking stuff. So why don't you break stuff at home? So this talk is about, again, my personal setup. I will share all the code later on, but just to meet my home server, it's called Morla. Does anybody know who Morla is? 321 Morla is this, beautiful creatures. And it really resembles my... There's a problem with your slides. Oh my God. Maybe you can just reconnect it. Let's go. It's not that I only have Morla. Yeah. It's so... And it's... It remembers everything. It's like a NAS. I store all the things there. It's heavy and dusty like my home and it's a turtle like my home server, which is not a turtle, but whatever. So first setup, I used Portainer and Docker Compose. It's really convenient, but it had some issues that I'm discussing right now. If you use Portainer, Docker Compose, you're my friend, but I just don't use it anymore. And why is that? Well, it's easy to install on some machines like on OpenMediaVault, which I was running before. It's rootless. Well, I don't know if it's rootless right now, but it wasn't at the time. It only supported Docker and I didn't like to give all the root privileges around and it's heavily dependent on GUI. So these are the three things that I will try to resolve with just using Podman and Ansible. And so this is Portainer, if you don't know. Again, very convenient, but also I don't need this GUI. And Linux server images, if you're familiar with them, they're really simple and they all have this kind of workaround for running some services that you're not supposed to run as root inside the container because, I mean, you're not really sure that things can get out of the container. So just to work around this, they implement this feature that you just specify your UAD and your GAD inside the container and that when you run it as root, you're good to go because outside you will have that user. But that's not the case with Podman, right? So yeah, this is what happens with Podman. Basically if you run that configuration, this is for Pwigo, it's for sharing pictures. I want to be able to mount my volumes in the container using the same users, like outside. I don't want to have some weird mapping of namespacing because then it will be inconvenient if I just drop stuff in and out. So again, this would be really easy to solve if you just run inside stuff with root, but again, it's not, again, the case for all the images. So what you could do is, of course, to just run, like if you want to touch a file in some volume that you configured, it will give you permission tonight. So you just do Podman and share, you do what you have to do and then you have your file there. For me, again, it's inconvenient because I store their media files and I just want to drag them in and out quickly and I don't want to just do unshare all the time. So again, this is what we are facing. Basically the red part is what we care. So when we have a non-privileged user outside and a non-privileged user inside, to make it clear when you run any command, the un-privileged user inside will be remapped. I'm not sure that, like, I am not that familiar to talk about it. So this was just to explain myself how things work. So to make it even more clear, user inside and user outside not privileged, when you see the process of the user, you will see that outside you will have somebody. It's like 1-0-0-9-8. I don't know who you are. It's, well, if you don't have to manage stuff, it works, but if you want to take stuff in and out, then you have to deal with that. So what do you do? This is what you do. And this is why I call it, like, juggling. Because basically the way Podman works in root and non-root management, it allows you to remap the user IDs and the GIDs. So both works for groups and users inside and outside the container. And it's kind of complicated-ish syntax. But what basically happens is that you want to take the UID that you are running in the container. It could be, like, whatever UID, like a fake UID, they even call it, and you basically remap it on the host. So now comes the reason why I did this presentation, which is juggling. And as you see, you have to run this command three times, because you're dealing with a user ID inside the container with a fake user ID and a user ID outside the container that you really want to map to one another the correct way. So there we go. Let's run the first command. OK. You don't get lost. This is faster than the I. You can bet now on where's the UID now. You couldn't have guessed. OK. But anyway, back to serious stuff. The real example is this one. I am running all my Linux server images with a fake user ID, which is 911, because help me in case of... 911 is the fake user ID that goes inside and actually want to fake it, like if root is running inside. It's the same result that I want to obtain. So that outside I actually have my normal user ID. If you actually check into the container in the UID map, which is where actually the UIDs are defined, how they are mapped inside and outside, you will see that it's taking my ranges, as I'm specifying. Please don't ask me much details about it. We don't have time, but we'll talk about it later. And then the result is that when you mount stuff, you have it configured for your user. And there's another convenient option, but I don't have it on my server. If you just do one command, it's easier, but I don't use it in my server, so you can either shuffle or just use this one liner, and it will do just fine. OK. And then about Ansible. So running this command all the time is quite inconvenient if you have many, many containers, it's hard to maintain or it's a lot of scripting. And plus, I think that Ansible goes really well with Podman containers because first, the model is great. And second, it really enables you to have much more control on, for example, config files, templates that you need to put in the container when you boot it, and whatsoever. And what I do with Ansible, and I advise you to take a look if you've ever tried, but it's kind of cool. You can have your configuration and just store everything in what's in the main configuration file that will store all the necessary variables, like parts to expose, volumes to mount, all the users that you want to shuffle with, whatever. And then you can basically copy paste what's a generic container configuration or setup if you do things right. Let's take a look, for example, this is, again, the same configuration. OK, I just kind of find what's the main name of the container, what I want to be the display name, where to pull the image from, parts, a database, for example, because I'm running pods with that. And it's basically all just copy paste as you were doing with Docker Compose images, but this way you're actually controlling much more what's happening because you can also control configuration files, setup, mounting stuff here and there and whatsoever, you name it. And of course, well, this I was highlighting random stuff in case you're interested. Yeah, this is the volume configuration. Don't forget to put the capital Z if you are in an IC Linux environment, of course. And let's go to the setup now. So this is my control panel, by the way, it's really convenient. It's like container. You can just specify variables to say, run this, this, this and that container, and it's all finally replaced the name, and it will just work. So it's just simpler, in my opinion. And then I think if I lost, oh, yeah, I wanted to show you the setup as well. So no more scripting, right? If I need to create a config directory, I can just create it with Ansible, right? And it's really nice that also the container creation, this is the same shuffling that we did before. So the fake ID shuffles with IDs 0 and range 1 and et cetera, et cetera. And this is all predefined in all the setup files that Ansible can provide you. So I'll just finish up because I have a few minutes. Oh, this is also very convenient for containers. If you're booting many, many containers with Ansible, never did that. It's good because it will just say, OK, things are good. This thing, I don't need to check it because it's already there. Again, it was a big thing for me. You may be familiar with that, but if you're not, try it out. And in the end, no mistakes. Of course, I'm also not doing a demo because it would break. And very quickly, tags are also great to manage containers. I think they really go very well. You can just say, set up this, this, and that and tag all the files that you need. And then we just go up. Takeaways, go rootless, automate the stuff, and try to overcomplicate things at all times to oversimplify things. Thank you. That is my presentation. If you want to.
What's new in Containerd 2.0!
Alright, let's get started. Is that, I am unmuted, yes. So yeah, this will be fairly quick, just an update on container D. You're either here because you're interested in container D or because it's too hard to change dev rooms and so you're just going to sit here and hear about container D. Hopefully you're somewhat interested. I was having a bit of phosom nostalgia like 2018, talking about just like the first year and a half of container D getting to 1.0. So now we're on the cusp of our 2-0 release, our first time having kind of a major version since we started the project. First just a few stats in case you're unaware. Container D adoption has been growing a lot. Some of that's probably due to the Docker shim deprecation in Kubernetes. This is from DataDogs, ANO report. The CNCF and Sysdig also put out reports. They all come out with different numbers so believe whichever one. This one was positive for container D so I used it. You can probably find another one. Maybe more importantly to the project are actual community growth so people actually contributing, getting involved in the project, becoming maintainers. This is a crazy eye chart from the CNCF. You can see Kubernetes way up there at the top. Again, there's some magic math being done here about how many PRs and issues are flowing through your project. How many people are contributing and it comes out to container D being in the top 15 or so projects. One of the cool things is that we've had a lot of, I think this captures like the last nine months, but new maintainers, reviewers, committers from many different companies, some independents. So that's awesome to see as well. The cloud providers you might be using use container D underneath their Kubernetes service and some other projects as well. The thing I wanted to focus on is one of the reasons I think container D continues to grow as a project is that we've built in extensibility in different directions. I'll talk about three main directions that container D is extensible or how you can build around it. One is on the client end and so one of the newest representatives of that is Nerd CTL written by one of our maintainers, Akahiro Sudha who you've probably heard of because he's written 100 different projects in the container space and anytime you use rootless containers it's probably because Akahiro started that work many years ago. The hero nerd CTL which gives you now kind of a Docker command line for container D. The other way that we're extensible is in snapshotters and those are, if you remember Docker's graph drivers, these are the way that your containers file system are actually stored and so overlay is obviously a very common one that many of the container runtimes use but we've actually made it. So we have built in ones which I'll talk about but we also, you're able to extend that with a remote snapshotter and that's an area where we see a lot of growth where people are writing their own snapshotters for their own unique use cases. Then sort of directly down from container D is this layer we call the shim layer that drives an actual OS level runtime and so obviously many of you have heard of Run C or C Run that's kind of the common Linux adapter if you will that drives the set of syscalls you need to name space your process but the container D shim API again is extensible and there's many different shims available and we'll talk through those. So these are kind of three directions. There's also some other pluggable interfaces that I don't have time to get into today but these are all ways that again as we go into 2.0 we continue to see people expanding container D in these directions. I'll spend the least amount of time on clients. We've had this sort of simple tool in the project since the beginning called CTR. It was never really meant to be a production client for container D but just an easy way to poke at the API, get a list of images, list of processes. Run CTL is much more recent and has its own set of maintainers who are marching along with new releases that are either bringing better alignment with the Docker command set so all the flags, all the features or adding features that they can reach because they're built directly on container D like some of the lazy loading snapshotters, image encryption, container image signing, all those are built in to nerd CTL. Cry CTL is from the Kubernetes community that drives the CRI API of which container D has an implementation obviously CRIO and others have implementations for that API and then of course the Docker project is also built on container D. There's some interesting developer platforms built around these clients. After desktop and CoLima allow you to drive the Docker engine or container D but we have a team at Amazon who built Finch that's just built on nerd CTL build kit and container D again that allows you to do macOS and I forgot to add Windows here because we just launched Windows this past week. But again these are ways that people are extending the capability by building new clients around container D. So the other area I mentioned was snapshotters. There's a bunch of built in ones. Many of you will recognize things like overlay and device mapper, butter FS but this plugability of having proxy plugins to a remote snapshot are so now two things you're not tied to container D's release life cycle. You don't have to get your snapshot or merged into the container D code base. You can write your own, you can run it as a separate process with a GRPC listener and container D will call you for the API of the snapshotter, prepare, diff, unpack and those operations that are required for the snapshotter. So there's three main ones that all three of these have now been donated into the container D GitHub organization. So they were started as external projects and they've now been donated. They're all related to lazy loading file system so if you've played around with being able to run a container but not having to pull the entire image, say it's a 10 gigabyte image with scientific data sets or some complicated ML model. These lazy loading snapshots will only pull the files that are needed to start the container and so Star GZ, overlay BD and NIDUS are all in that family and then there are two, there's Sochi that was built by one of our teams at Amazon that is seekable OCI so again a lazy loading snapshotter and that's open source but then GKE also has a feature called image streaming built around the same ideas of lazy loading but that's at least for my understanding that's not an open source project today. So again these are ways that people are extending container D by having their own snapshot technology and plugging that into container D. Allison mentioned shims so OCI runtimes, there's several options there. So we have run C built in, you can also use C run and we test that in our test suite for container D and there's also some experimental Rust and free BST runtimes but then again you can have your own shim outside of kind of the container D core project such as the one for Windows maintained by Microsoft, HCS shim. Run wasi is one of the more active projects in the container D, you have namespace where again this is a shim where you can drive container D to the same API and clients but actually run wasm workloads instead of running a traditional Linux container and again there's a micro VM based shims, trusted execution environment and Quasar I think is how you pronounce this shim that deals with a new feature of container D 2.0 called sandboxing which we'll talk about in a minute. So again those are just three ways that I think have benefited the sort of container D's growth of being able to plug in and enable features that don't have to be part of the main container D code base and allows people to sort of expand for their use cases that maybe we don't even know about. So this is kind of the picture of where we are currently in the container D life cycle, 1.5 is now end of life, we created 1.6 as a long term support release that again until 2.0 is released we don't have an official end date but it will at least go out another few years. 1.7 is an active release cycle right now and then 2.0 should release in a month or two based on kind of our current set of betas and release candidates that we're in and so that's where we are as far as releases. I just mentioned this isn't new news but 1.6 is our first LTS release as it says here support at least until February 2025 and of course it's always a trick to try and maintain some integrity about how you get things into the LTS and one of the reasons that's tricky is that Kubernetes may add features in the CRI we need to implement that CRI endpoint so it sort of looks like a new feature and so we're having to try and do our best to make sure that we maintain compatibility with Kubernetes without sort of opening up 1.6 to a lot of new features so that it's a stable and mostly just back ports of fixes and obviously anything security related. So yeah so we have this idea that late this year we'll even make that back port criteria a little bit stricter so that people can rely on just a long stable release without a lot of changes to its feature set. 1.7 therefore is the end of our 1.x release cycle and what you'll see here is that we basically merged a lot of new features in 1.7 before we released it that we marked them all experimental so that people could start to try and use them and then in 2.0 all those become supported features and so I already mentioned the sandbox service and the API around that again we had this extensibility at the shim layer but with micro VMs and other ideas about how you treat the sandbox and how you configure it several of our contributors came up with the sandbox service and there's a whole API around that you can read a lot more about it on our either via the PRs or the documentation that's been merged. It was a preview in 1.7 but it'll be automatically turned on in 2.0 so in 1.7 there was a split that we actually had two implementations of the CRI one based on the sandbox and one our legacy code so that'll go away in 2.0 where it will just have the default sandbox implementation. NRI is very interesting if you've ever played around with OCI hooks and the ability to you know modify the specs so say I want to insert a device before my container starts the node resource interface is the sort of our decided implementation for doing that safely and having a way to have NRI plug-ins that you can that the administrative your cluster can enable and give the proper permissions to so NRI was experimental in 1.7 again will be fully supported in 2.0 and then transfer service so if you think about commands like save or export an image pull an image push an image in all our previous releases of container D that was a client side API and so your container client was actually doing those registry interactions in 1.7 and then of course in 2.0 this is now a service within the demon and so for some some use cases that was very important that the demon handles credentials of the demon handles the network connectivity to registries and also gives us a lot more tools for plugability of sort of source and sync so say I'm trying to copy an image from one place to another the transfer service gives you all that in a configurable way we also added username space support which was a new feature coming down so container D core had username space support but the CRI kept the enabled username spaces and Kubernetes added new API to the CRI and so those are now plumb through and implemented and supported in container D and then we had a lightweight RPC mechanism for shims and we've now added full GRPC support which was important again for certain use cases that people wanted so as I said we're in the midst of like our 2.0 release plan right now we are just about to I guess I didn't move that line over far enough because it's February now and we're just about to put out our first release candidate so we're possibly a little bit delayed from our original thinking but again 2.0 will be final sometime this spring and like I said all these sort of new capabilities that were in 1.7 will be final and supported in container D 2.0 it was our first chance to finally deprecate so we've been insistent on keeping a very stable API so that you know people aren't surprised that the latest container D release removed something so you can see that over the years we've deprecated a lot of features or at least mark them deprecated 2.0 will be the chance for us to finally remove those and provide recommendations. One of our contributors added a nice feature so you can actually turn on deprecation warnings and you can actually run container D 1.7 or even 1.6 LTS and get notified of all the features you're using they're deprecated to help you prepare for 2.0. One of the things we were going to remove was support for our oldest configuration version but then someone wrote a converter that automatically converts your configuration so we won't actually have to deprecate that in the sense that you're not going to have to rewrite your config unless you'd like to it'll do automatically for you. There's still a lot of things we'd like to do that we're still working on so I mentioned this new transfer service again the CRI is a plug in implementation within container D that uses container D's APIs to do the work so when the CRI says pull an image the CRI implementation calls into container D to do that so one of the things we're trying to migrate that to use the new transfer service so that's in development to allow plugability for shims themselves and then there's two there's two kind of API layer enhancements that we're thinking about if you think about Docker, Docker kind of gives you this higher level API again HTTP based if you ever have built a tool that uses the Docker API it's at least nice in that you can say run container and give it all the configuration information and it just does it and when people have come to container D they're like hey you don't have the Docker API what can I use that's similar to that and we really don't I have to create a task I have to create a container resource that I have to start the task and so we're thinking about really creating some of these abstractions so that when people move to container D they have a higher level image service and container service so those are things that if you have ideas if you have concepts we're open to them these aren't things that we've built yet but we're planning to as we go into the container D to the T to dot oh time frame if you're interested in contributing or getting involved there's a couple channels in the CNCF slack that we hang out in that we you know talk about new features or people ask us questions we do have a live community meeting on zoom twice a month the second and fourth Thursdays if it's bad for your time zone let us know obviously that's always a tricky thing to handle with time zones and again go to the repo open issues give us your ideas pull requests and that's all I have thank you
Orchestrating eBPF Applications in Kubernetes and Fedora
We're going to get back. So next up we're going to have Daniel and Dave talking to us about orchestrating eBPF applications in Kubernetes. Hi, everyone. I'm Dave. This is my colleague Daniel. We both work at Red Hat and we're here to talk to you about BPFman, quoted in Pheronics as the world's worst superhero. So what is eBPF? Show of hands. Let's do a show of hands. Who knows anything about eBPF? That's the best I've had in any room talking about eBPF. I should have qualified that question, shouldn't I? Very quickly then, BPF is a cool technology that allows you to do crazy stuff in the kernel. So if you weren't a kernel developer before and didn't know how to maybe do something crazy with networking, BPF works as a nice little programming toolkit for you to be able to go and do scary things or sensible things with code inside the kernel. So there are use cases in networking, security, observability. There's a whole bunch of projects out there that allow you to do all of these things. Essentially you load code into the kernel, runs inside a virtual machine if you like. The code is verified not to crash the kernel when it gets loaded in there, which is a good thing so it's a little bit safer than things like kernel modules. And yeah, I think that's about it. So you have your kernel space code that does the thing in the kernel. You typically have some code that runs in New Zealand as well that does things like reading data that you've exfiltrated from the kernel which gets stored in eBPF maps. So you can do all sorts of cool things. So why do we need a BPF manager? Is it better now? Hello? Awesome. Thank you. Okay, let's recap. So eBPF manager, what do we mean by an eBPF manager? So these days we are getting and seeing a lot of different eBPF related projects or these days say eBPF at some point within their documentation. So you may see that we got Calico, we got Silium. You can also see a lot of different monitoring projects such as Pixi, we know Kepler, you know a lot. So basically what is happening is at some point we need to, for the end projects to coexist a little bit without being broken all along. So what could happen, let's say you want to run an eBPF program, it goes as route, there it goes, and then it runs another one, it overrides what we would as a route and so on. And we wanted to have something that could make this a little bit more sense. So getting back to there, I was saying that every eBPF program needs route access, so you could either, okay you may say that okay but we got the capabilities. So is there anyone here that is aware of CAP BPF? Okay, not any, okay, we got one there. Okay, thank you. So far that's some kind of a trick because in the kernel, CAP BPF maps to CAP route, so it's a nice naming but not that much on that. Basically also you don't have any way currently of making sure that what you are running as BPF, it's like it, so you can run what it is there and you're going to trust what you have in mind. Also getting back to networking, so this is your idea of running a CNI plugin such as Cillium. Those Cillium, those kernel ABPF hooks are explosive and the same happens for monitoring for some of those. So basically you need to get into that and see a little bit of priority. We are going to be seeing that later about having two different ABPF programs with different priorities and see how can we handle those with BPF man. BPF man, getting back to there. So originally this was named BPFD, we changed it because it's no longer a demon and we wanted for people to somehow get used to that. Everybody knows Portman and we want to basically people to get to know BPF man as well as we know Portman. Then goes to BPF man, Prudette. So, BPF man is a nice source for that, surprise surprise. It started in the room I worked in the Red Lab, the version of theologies. So, we are starting to get some of the issues outside. It's absolutely awesome. So, as Daniel said, the BPF stuff is a privileged operation and what we really wanted to do was to train people on privileged stuff outside containers if possible because then we don't have to have containers that have privileges around like ages where the privileged stuff just happened very quickly. So, privileged stuff is down in the text. So, you can actually just write out your intent. I want this program to appear on all of my notes or only on notes. I have this set of labels or whatever. BPF man can then handle all of the orchestration for that for you. And very recently we also started digging into how we can take some of the stuff which is in the kernel which is really useful for hidden. So, the BPF audit logs which are in audit D and pulling those out into open to laboratory is log data. You can train them off to your log storage thing and figure out if there's something going on then maybe there's something that's happening in BPF at the same time. We have an idea of what's up there. And similarly with metrics as well, we're able to get metrics from the BPF system in the kernel and bring those up into open to laboratory as metrics as well. You can do the same graph, do whatever you like. And I can go next because so far we spoke about BPF, we spoke about the operator part of BPF man and so forth, but how about Fedora? We said we're concentrating on Fedora and there was no Fedora being done over here. So, first of all, how many, if there's any Fedora packages do we have in the room? Okay, awesome. You're going to be helping us. So, we established a new BPF group like I guess from the last month of the last year, 2020-3 because basically we wanted to promote the usage of BPF and see how could we package BPF applications within Fedora. And we were thinking about, okay, BPF man is cool. Let's go and get BPF man package within Fedora. So, we identified it and started making the parts on how do we get BPF man in Fedora. BPF man is written mainly in Rust and there's a self-contained change to add that to Fedora 40. So, feel free to take a look if you want to. I mean, those slides will be available. But the thing is that currently you can go and start getting all the dependencies that we have for BPF man in Fedora that those are really in Rust. We are mostly using Rust to our PN, which you could say, okay, I'm getting a spec file. I can just find no way. We haven't been speaking at all about how do we handle that. When you're packaging Rust programs in Fedora, you could do those in two different ways. You could go the fast path, okay, let's bundle everything there. Pretty much like the go way. I'm not going to be caring at all about how the dependencies interact with all the packages in Fedora. Or I'm going to go packaging all the dependencies and I'm going to be creating packages for all the dependencies in Fedora. This goes with a few caveats though. So, let's say one of the dependencies we have, as you can see here, we got a few dependencies, meaning in Fedora, six store, Tony, we had a few packages in Fedora that were newer than what we were using. That's super fine because we can just bump the dependencies and patch BPF man to work. And we had also the other way around. We had some packages that were too old in Fedora. That means I can't just go by itself and bump them up in Fedora because I'm not a maintainer of those packages. I need to go see who's the maintainer, speak to him so they bump that, so those doesn't collide in Fedora at all with any of the packages we might be. But let's just grab one quick example that we run into here in Fedora. So, I wanted, let's say, one of those missing ones, six store. Let's say, let's go and create a Rust six store package in Fedora. If you start doing so, you'll see that you have like, let's say, ten more dependencies, those ten more dependencies, grab you 23 more dependencies, and so. So this been, let's say, a challenging issue. You could say the dependency held a little bit, but we plan to address those for the final release, even if we go to the bundle one in the beginning. For those, we have been having a lot of help, mostly from the Rusty Group, specifically Fabio Valentini, who's the Rusty Group maintainer. He's been having a lot of help to us, but we are still missing a lot of Rust packages. So if you have time to spur, please go and help us review all the Rust packages that we are maintaining. Also, we also have another guy called Miquelon Sagasti. He's also a good friend of Spain. I mean, he does not do too much of a box-it-up job. So thank you for that guy. This is that. Of course, we are also open to anybody who would like to join the ABPFC Group and who would like to join us helping on this effort, as much as maintaining way more EBPF programs that could go in Fedora, such as Scribbr, Kepler, who's going to go next probably. So basically, thank you for that. Let's give them more time. So let's see if we can go for that. You want to come and do that? Okay. Okay. So here is the DTF program. I hope you like it. It's totally clear, isn't it? Yeah, I think we'll have to do that. Any comments? So this is a very, very simple program. Effectively what's doing is counting packets that were received from a place. Now, in reality, you'd want to be deploying them before it's like a load bar. But just to say to them, oh no, this is a little bit more important. So we can then match them to the container models with us. So we have a base stage in our container file that exports the packages that we need to go to. The DTF bytecode. And then we have the user space part of the equation here, which packages up our user space application. So remembering the fact that when I said that we're like user space is the DTF, the kernel is the base space of the DTF, so user space is it. Yeah, I can't make it any bigger. I'll go with you. Sorry. We will share the link in the chat. We'll charge the link afterwards, I guess. Yeah, yeah. There is a user space, and there is a kernel space. The kernel space is basically a PID bytecode. It is just one file inside of that. And then some labels that we'll do. We have to be using them on whether this should be like half or more. An extension to the ASI is back on this example. I wouldn't read any of that in the DTF stuff, but that's something we'll be figuring out as we go along. Anything we can use our container engine of choice to go and build our things. This is amazing, but fast, because we were working on that at a time. What? We use a pop-up interface. We use whatever we can use. We'll then go and question these resulting images of how to make a container registration, because the other man's going to need to go and crash the machine. So you can deploy your machine here, or you can deploy off the machine anywhere as long as you can access the internet to be called. Programming container registry, that's a problem for us. So, from personal experience, trying to develop a BPF program has been pretty handy when we develop a BPF in the point of view. So we can ask the DTFNR here, what's up? Are there any programs? We'd be happy to say that at the moment. We'll let you know if you'd mind to say, well, here's the one that we made earlier. Please, you can go below this one in the department to get some things for me. We've given it a priority of 10. That's a meaningless variable. Priority, we've given it a 1, and I think it's been 1 on TCC5, so it's a good one to give you. That's what we used at the later part for sorting which programs should be running ahead of one another, because we have some magical XTP stuff in the background here that allows you to sort of order your network programs, which we've borrowed from a project called LiveX2P, which makes writing extra easy to use. So BPF man's got the latest on program for us, and now we're going to verify the list there, by listing it to this, and then we're going to run our user specs bit. We're not using any of the data just because the CD is just running here. It's going to have a look into single panels telling it how many packets we've received, how many bytes we've received, so we can see that the program is working. And then we can go ahead and load another program. At this time, it will be priority one, so we would go ahead and load that, but that's a little bit tricky, so we did it to sort of eliminate it. But in the end, yes, it's a loaded, and then you move to see the program here, which we want to run in first and the first, and then add a program. So there we go. So just to talk about how we can take a BPF program in here, how we can take it, deploy the CDF man on the client, like we could also do, so we wanted to make sure we had time to demo that as well. Yeah. All right, that's it. Yeah, so basically, I guess you like that thing here. I just wanted to basically introduce you to this new BBA program. Please, please, please help if you want to take a look at the rest practices. And I guess we didn't have one slide, but we can just go for the Q&A. Any questions, guys? Can you see the CDF man? Sorry? Can you see the CDF man? Because they have a lot of programs, examples of how you can get it from there. Can you see that? Yeah, I understand. I just don't have a microphone. Yeah, we'll go. Let's take a look back. Yeah. Yeah, take it. Right, so we, BPF man itself is written in Rust and is backed by a really awesome pure Rust library, which I also happen to maintain called IA, which is amazing. Hopefully there's some IA contributors here. Yes, thank you. All right, one, that was good. So our entire stack is Rust. However, the BPF program that you write, you can do in C with libbpf, you can do in Cillium with go. This example program that we've run is using Cillium, BPF and go, and BPF to go, and all of that tooling. Okay, that is a really, really good question. So I will preface this with I am not a lawyer, obviously. However, basically 99.9 ish percent of BPF code will be GPL. The reason for that is all of the useful BPF helpers that you will call are GPL only. Therefore, has to be GPL. But that is just the BPF code that gets run in the kernel. There is a really good document on kernel.org that explains BPF licensing, which was written by lawyers, and effectively you can load your code into the kernel, and anything that goes and talks to that doesn't effectively get touched by GPL, so you're basically fine. Can you please just have us load your program to then enter it into the copy? Yep. So the bit in the kernel will be GPL. You use the space bit, can be whatever you want. Obviously, if you're copying example code from Cillium or whoever else, double check the license. A lot of the Cillium examples, I think, are GPL plus BSD2, something permissive that allows sort of reuse and copying. So... We're just packaging, effectively. That's all we are. So, no, no, no, write your program however you want, and we'll just help with the packaging and deployment, whether it's on multiple Linux nodes, whether it's Kubernetes, whether it's anything else. So, the owners are still on you to write the code, and then we'll help you package and deploy it. So basically that depends. Just think of some kind of a proper tooling to go and run your code in an easy way. But whatever you run there, it depends on you. So if you want to go load, no. A non-GPL v2 or non-GPL whatever license you want to go, that depends on you. So we ourselves are GPL, but we call it a Pachi, and we call it a license, but that's a Pachi. But the thing is that besides that, it's like if you are just running whatever application on another program, that's a different thing. Yeah, so I'm saying that the tooling itself, ours, it's a Pachi. But whatever program you want to run into it, it's up to you. So in terms of licenses and so forth, that would depend on whatever you want to do. So if you want to run some sealant samples, that would depend on the sealant license. But if you just want to run some proprietary thing, that's up to you. Thanks very much. I'm sorry. Over there. It's all the way there. One way to go. I think we call it whatever you want to do with the other parts. We can hear it all, sorry. Sorry, maybe, maybe, yes, come over here, no worries. Yeah, come over here. Yep, thank you.
Lift and shift: Modernising a legacy LAMP application with systemd-nspawn
So, next up is going to be Martin, who is going to be talking to us about lift and shift modernizing a legacy lamp application with system B and spawn. Hi, everybody. Welcome. So the last time I spoke at this conference a few years ago, it was in the microkernel dev room. It was a very small room. So the bigger the kernel, the bigger the room, I guess. So I'm going to start with a little bit of backstory. One evening about a year ago, I got a phone call from a friend, a principal at a school, saying, Martin, I need help with something. Our sole IT person that's worked here for 20 years has decided that they're just going to go off to the mountains and leave, and they're off in about a month. And I have no idea what state-house systems are around. I know nothing about that. I need someone I can trust who can step in and help. So I originally came in there as a consultant to look at what systems they had and figure out what the next steps were. I'm still there. It's still temporary. And I'm going to tell you a little bit about what I did over the last year there concentrating on the containers. So they weren't kidding when they said it was in a bad state. The critical application that the school ran on was running on one single server, along with a whole bunch of other stuff, pretty much everything else. And you can see here that that server basically dates back to 2009. Someone at some point tried to upgrade it from Debian Edge to Debian Leni. They failed, or they gave up, partly because from Edge to Leni, you had the transition from PHP 4 to PHP 5. I did a quick naive slot count of what's in Vah-dub-dub-dub HTML. There's 200 something thousand lines of PHP. It turns out that this person did not use source control. So there's a hell of a lot of duplication in there. And it's also very much a typical crud app, as you would design it 20 years ago. So it's all just very basic PHP with hidden HTML mix, the worst possible thing you could have. But at the same time, it's very simple as an application, which turns out helped us later. So my naive plan, how do I salvage this, try and extract as much business and technical knowledge from the author before they leave and never come back? And then virtualize all the things, secure all the obvious attack surfaces. I mean, this was still running TLS version 1. It had Apache 1.3 exposed to the internet, worst possible cases. So then split off the business critical system from all the other things that were running on that server. Do that in a way that's as future-proof and maintainable as I can. All while keeping it running and not getting killed by 550 students and 100 odd employees during the school year. The first two steps were pretty obvious. They had some new hardware lying around. I spun up a hypervisor. I had a bunch of VMs. So put the physical server into VMs, started splitting chunks off it. That turned out to be hard. So I eventually decided that I needed a way of reproducing this 15-year-old environment. Reproducing it in a way that I could then develop with, maintain with modern tools, source control and so on. So the nice thing here is I found that the Debian community have developed something called Debian EOL, which are basically Docker images of end-of-life Debian releases, all of them going way, way, way, way back. You can use these images to run both Docker containers or to do whatever else you want with them. The nice thing about them also is that they're actually integrated into the modern infrastructure so that pointing at archive.debian.org, you can, as you'll see, install additional software and so on. I could have probably done this with Docker, but it doesn't really fit the bill because this application, I mean, it's never going to be a 12-factor app with a bunch of microservices. I needed something that's more like previous Dejails or Flourish Zones. And I've previously used SystemDnSpawn. I use it, in fact, today to run a bunch of my own infrastructure, which was originally a bunch of Zen PVVMs and is now happily running for many years as SystemDnSpawn containers. So you want something that can do full system containers that's available, lightweight, and flexible. So how do we get Debian Lenny from 2009 running, using these Debian EOL images with SystemDnSpawn? We need a couple of tools, something called Scopo and OCI Image tool, to get the images off the Docker registry, flatten the OCI image, you basically end up with a root file system. You then, what I do is I use, the reason I'm emphasizing RefLink here, I didn't know about that, it's basically copy on write. So you can use this to create a lightweight copy of an entire directory tree, which only takes up more space if you actually change things in it. So, you try and run this, previally with SystemDnSpawn, and you find bam, it's safe false. Thankfully, we actually get a helpful message from the kernel saying, ooh, you tried to do VSys calls, but no, we don't do that anymore. We can fix that, that's fairly easy, and we can see that, oh look, we have Debian Lenny running in a SystemDnSpawn container. Okay, that's great, and if that was all I was going to tell you today, then that probably wouldn't be very interesting. But if all we want is Ben SH and that to get, that's fine, but I want this full system where I basically want to run full SBIT in it, inside the container to manage all the original LAMP stack services to run the application. I want to integrate the container's networking with the host system's SystemDnetworkD, and get a dev log in it, get, use username spacing, and start and stop the container as part of the normal host system boot process. So I made a script for this, I extracted this out of my build scripts so that you don't have to. There's a link to it also in the resources for this talk. Please take a look. So this script basically gives you a Debian Lenny root file system that has all the things applied to it to let you do the first, the steps that are described here. I spent quite a bit of time working that out, so I hope people will find that useful. You can then do, with that root file system, you get out of that script, you can boot the resulting root of this, like this. The important parts there are private users, private users equals pick, that turns on username spacing, so your container root gets, automatically gets a special user ID in a range mapped to it, which system dns-born will pick when that particular root file system is started. And you get a VF network talking to the host. Kill signal equals SIGINT, we want that so that when the host system, if you run this container as unit file tries to stop it, then the SIGINT gets sent to the sysvian as inside the container, and it will actually interpret that as a system shutdown and shutdown cleanly. So if you run that, you can log it on the console and you'll see that yes, we can shut down the container with control C. So there's a bunch of gotchas, networking, system d network d, you want this, since it integrates very well or bar some problems. Obviously your host needs IP forwarding enabled. As I found out today, or remembered today while making these slides at the hotel earlier today, if you're doing anything at all in your forward chain, since I was trying this top, then you need to make sure that forwarding is actually being accepted from and to container interfaces. Another really interesting one. So I'm still a DHCP client inside the container so that the container integrates with system d network d and gets a network address assigned to it when it spins up. Turns out that old DHCP clients are actually picky about getting proper checksums back in their responses. So if you don't add that particular mangle rule, then what will happen is your networking will appear to work and then mysteriously stop when the DHCP lease expires and the client tries to renew it and gets upset and you just see it renewing and renewing and nothing happens. So, system d journal d has a nice name spacing mechanism. It basically lets you spin up separate instances of system d journal d which have their own name space so you don't really want the container logs or different logs of the different instances mixing with the host logs. It works, but I had to actually read the source code of the system d main loop to figure out why it would just, after you start it, just mysteriously say, oh, no clients, I'm going away now. So the way to fix that, not described anywhere, is you add a drop and set your retention time to something high and then it will just wait around until something connects to devlog. Devlog you can then bind mount into the container. That's fairly straightforward. Starting up, start up and shut down integration. System d n spawn comes with a default unit file and you can then customize that. There are some useful things you can do there like you can add a dependency on your journal d namespace service so that everything nicely starts up and shuts down and there's an example of what you can start with exact start that if you want to use this particular arrangement. So I actually did this, or the bulk of it during the school holidays last summer. Application has been running fine since then. I was quite surprised. I could talk a lot more about PHP and MySQL 5 but that's mostly just be ranting. One thing that I didn't mention is the application is actually running all in CP1250 and not only that but originally the databases were all running still with MyISAM. So I ended up basically exporting the lot into SQL text files. Then I discovered that MySQL and PHP at this time didn't really understand character sets so the database thought that everything was Latin 1 when it in fact wasn't. Well, the way to fix that is again you export it to a text file making sure that the database or nothing tries to convert any of the data. Then you do a set on the text file and say just recreate, replace MyISAM everywhere with the InnoDB, replace Latin 1 with CP1250 and it actually worked. Still there. No data got corrupted. And it's 64 bit now so it won't fall over in 2038. So yeah and I'll end this with a quote for the conversation I had in the autumn with my long time friend Martin Sustrick who was asking, so you spent the last few years before that working on OS research with Unicernals and Docker and the University of Cambridge and so on. So what was more complicated? All this OS research that you were doing or the work you've been doing at the school over the last six months? And I said well definitely the work at the school over the last six months. And I still have 10 minutes. So in fact I guess questions. It was quicker than I thought. Yes sir. This man here? Sorry? The hyphen N option? Oh, ah yes. Okay so the reason you can't do that, in fact this is important and I sort of glossed over it here. That will only work. The journal D integration will only work if the distribution that's running inside the container is new enough. The Debbie and Lenny from 2009 does not have journal D, does not have system D, this predates it. So this is all running good old Cisvi S bin in it. So none of the integration that you'd expect, the fancy stuff that you get today with system D and spiral with machine Ctl if you use the full interface. If you run a system D distribution inside the container then your logging will just transparently get integrated with the host journal. Likewise you'll get things like machine Ctl login which will get you a TTY, a console that you can use to log into the container. We don't have that here because there is no system D, all of this relies on there being system D inside the container as well as on the host. It is exposed to the internet but not directly. So it's the first thing I did way back before I started on all of this. Right, number two here, secure the most obvious attack surfaces. I stuck a modern reverse proxy in front of it.
vscode-container-wasm: An Extension of VSCode on Browser for Running Containers Within Your Browser
So, our next talk is going to be about... Hello, I'm Kohei Tokunaga from NTT Corporation. I'm a reviewer of container D and a maintainer of Build Kit. And today, I'm going to talk about an extension of VS Code on Browser for running containers within the browser. So, this is the summary of this talk. So, on Browser VS Code lacks Linux terminal running completely inside browser. And VS Code container wasn't. Extention enables to run Linux-based containers and its terminal inside browser. And there are two options available for distributing containers to browsers. First one is pre-converting containers to wasn't images and distributing them. And second option is distributing OCI container images to browsers. So, there are several on Browser VS Code implementations in community. There is a limitation for that functionality. This is lack of Linux terminal running completely inside browser. So, users can edit code inside browser but cannot run them inside browser. And Linux-based development tools like CompilerS are also unavailable on browser. And one of root causes for this issue is that browsers don't provide Linux compatible system. So, Linux-based applications needs to be ported to browser. If the application is written in language other than JavaScript, WebAssembly or wasn't will be... will also be used for running them on browser. But actually, porting apps to WebAssembly is not easy. So, wasn't lacks compatibility to Linux system. For example, the binary format is completely different from the existing common binary format like x86-ELEF. And the app might need to be redesigned for Harvard architecture of wasn't. So, this might include like eliminating fork and exact related cause from the application. And some of the issues can be mitigated by CompilerS wasn't target support. But they still don't provide full compatibility to Linux. So, can we run a modified Linux terminal and Dev environment inside browser? So, here VS Code container wasn't. Extension can be used. This is an experimental VS Code extension for running containers inside browser. So, the container and the terminal is available on VS Code on browser without preparing remote SSH servers or something. And this is implemented, levelizing CPU emulators compiled to wasn't. We will discuss about it later. And the workspace of the editor is also mounted at slash workspace path. So, container can refer to the contents on the workspace. For example, it can compile the code stored on the workspace. And HTTP or HTTPS networking is also available. The container runs inside browser. So, the networking functionality is also restricted by browser. For example, the set of accessible sites from the container is limited by calls. So, how container images can be distributed to browsers? There are two options. Option A is pre-converting containers to wasm images. And option B is distributing OCI container images to browsers. So, first option for distributing containers to browsers is pre-converting containers to wasm images. And container to wasm converter provides this ability. The container to wasm is an experimental converter of container images to wasm images. It receives an arbitrarily Linux-based container as the input, and it outputs a wasm image that runs the container on wasm. So, we can run the containers on wasm-enabled environment like browsers. As shown in the right figure, the converted wasm image can be uploaded to any HTTP server accessible from the browser. To use them on VS Code on browser, you can configure the workspace using .vscode slash settings.json file. And the image location URL to that configuration file. And so, you need to add the image configuration URL to that configuration file so that the extension can launch the specified container on browser. And the pros of this approach is that once the container image is converted to wasm, it can run on any wasm-enabled environment, not limited to browsers. For example, the container can run on like washy run times, like wasm time as well. And cons of this approach is pre-conversion is needed for each container. If you want to run many kinds of containers on browser, all of them need to be pre-converted to wasm, so it may take extra cost for development time. And second option for distributing containers to browsers is to directly distributing OCI-compatible container images to browsers. If you use container registry, that registry needs to allow code access from the browser because it's accessed from the browser. But unfortunately, as of now, well-known public registries don't allow codes, but so you need to try it on local house registry with code header configured. Alternatively, you can also use codes-enabled HTTP or HTTPS server. In this case, the container image needs to be formatted as OCI image layout. This is the specification of layout of image content to be stored on the file system. For example, you can get a tar archive of this format using toka-save command newer than v25. And vscode container wasm supports fetching the image formatted with this spec over HTTP. In neither case, the image location needs to be written to the workspaces.vscode.settings.json file so that the extension can launch the specified container on browser. The pros of this approach is that this doesn't require a pre-conversion of the image, and a modified container image can be distributed to browsers. And cons of this approach is that obviously existing public container registries don't allow codes as of now. So if you don't use OCI layout approach, you need to prepare codes-enabled container registry or users need to use like a proxy or something to access to the registries. And this is an example of running container on github.dev. This is github.dev is an on-browser vscode that allows us editing codes of github-reports on-browser. This slide shows an example of running gcc installed devian container inside browser, and workspace is mounted at slash workspace, and HTTP or HTTPS networking is also available. And so this is a demo for this extension, and we use github.dev here. And that. Okay, so here, this is the extension of container wasm, and this is available on Marketplace. And this is the settings.json file in this repo, and this config file points to the URL of the devian container converted to wasm using container to wasm converter, and this is served on github pages, and we use that image on this workspace. And this is the terminal of the devian container running inside of the browser. And this is a secret we are going to use in this demo. And currently, yeah, by executing a command of this extension, this extension quickly loads the image, and the container image stored on github pages to this browser, and it just booted the Linux kernel and container inside browser with cpu emulation. And we currently see the devian shell in the browser. And by executing your name a command, you can see this is the x8664 and Linux environment inside browser. And this workspace of this, this workspace of this repo is mounted at slash workspace slash, so you can see the files of this repo inside browser, mounted on workspace directly. And in this container, we have gcc compiler, and we have a hello world pretty simple clanguage source code, so we can compile that c code inside browser using gcc compiler. Then we can run the compiled binary on browser. So the entire compiling and running steps are done inside browser in this demo. So how this extension works, the container depends on Linux to learn, so this project runs both of container and Linux inside wasm VM on browser. And to enable run existing architectures binaries inside wasm VM, we use cpu emulators compiled to wasm. We use box emulator for x8664 containers and tiny emu for risk 5 containers. So this extension launches all of the emulator Linux kernel and the container inside wasm VM on browser. And we also use microsoft slash vs code dash wasm for wasm and wasm host on browser. So this is a wasm host integrated to vs code, so this allows wasm VM to access to the terminal on vs code and the workspace directly over wasm compatible APIs like fd APIs. And how mounting workspaces to containers works. So as mentioned in the previous slide, we use vs code dash wasm for the wasm host and it provides the access to the workspace directly to the wasm VM over wasm compatible APIs. And emulator running inside wasm VM recognizes workspace directly via wasm APIs, then it shares that directly into the guest Linux via vortio9p. And that workspace is mounted to the containers slash workspace slash directly so the container can access to the workspace on that file system path. And container can perform HTTP or HTTPS networking with restrictions by browser. So this is implemented by running the entire networking stack runs inside of the browser. So additional proxy outside of the browser is not needed. And this networking stack supports forwarding HTTP and HTTPS connection to the outside of the browser using fetch API of the browser. And HTTPS connection is terminated at the networking stack on browser with its own certificate and the connection is re-encrypted by fetch API. So the container can access to the outside of the browser via HTTP, HTTPS proxy running inside of the browser. And there are actually some important restrictions by fetch API including accessible sites are limited by browser so code restriction is applied. And some headers are actually uncontrollable from the container because they are entirely controlled by browser. And vscode container wasm allows fetching container image directly from remote location without pre-conversion to wasm. So this is implemented by fetching and unpacking the container image in browser. The unpacked root file system of the container is mounted to the guest Linux via VARTA ION IP. And not limited to on-browser IDEs, we believe there are some expected use cases or possible use cases of running containers or wasm or browser. So first one is interactive on-browser Linux based demo. And second one is on-browser development and testing like this extension. And also sandbox execution environment of containers and application debugger runable on-browser were recorded and replayed debugging. There are some existing approaches for running unmodified applications on wasm. And I listed some of them here. First one is V86. This is a x86 compatible on-browser CPU emulator by Fabia Hammer. And it supports wide variety of guest OSs like including Windows. But it doesn't support for x86 64 now. And tiny emulator is a risk 5 and x86 emulator by Fabia Spillard. It can run on-browser and container to wasm converter actually uses this for risk 5 emulation. But it doesn't support for x86 64. And this project is still in a very early stage. So we expect further improvement. First one is performance analysis and improvement. We heavily rely on CPU emulation. So I think we need to analyze the overhead and I think we need some improvement for it. And possible integration will be with ELF Conf or ELF Conf. This is an AOT compiler of Linux. And this is a 64 ELF to wasm by Masashi Yoshimura, my colleague from NTT Corporation. So at LLVM, tomorrow my colleague Masashi also have this AOT compiler. So please check it out. And the integration of container ecosystem with browsers is also needed. As I mentioned, container has call to the solution. So currently accessing OS package repos from browser is not possible. And also in terms of container registries, as long as I know a public registries, container registries doesn't allow calls access. So on this field, your help is really needed if you know some technologies or repos or registries that allows calls access, please let us know. And graphic support is also on our milestone. So this is the summary of this talk. On-browser VS code lacks Linux terminal running completely inside browser. And VS code container wasm, experimental extension is enables to run Linux-based containers and its terminal inside browser. And there are two options for distributing containers to browsers. First one is pre-converting containers to wasm images. And second one is distributing OCI container images to browsers. And that's all of my talk. Thank you very much. Do you have any questions? Yes. Yes, please. Can you run Firefox inside the container? Okay, so the question was Firefox inside the container. So Firefox inside the container, inside Firefox. All right. Yeah, I haven't tested yet. But yeah, I believe it's possible. But I don't find any practical use case for this, but I think it's possible. Yes, of course. Yes. Sorry. QM. Thank you for the question. The question was about using QMU alternatively for like a box and tiny Mule. Yeah, I think this is very good question. And actually we have a, we have on container to wasm repo, we have an experimental branch that integrated QMU TCI to this extension. And yeah, in terms of like a TCG, yeah, we haven't integrated yet. So TCI is completely, yeah, so TCG we need to wait for running the generated code. So we, it is not obvious on wasm environment. But yeah, we are seeking for the way to integrate QMU into container to wasm. So this is, yeah, definitely on our milestone. Yeah. Thank you very much. Thank you very much. Thank you.
Zero-touch Infrastructure for Container Applications
Thanks. So today I'd like to talk a bit about zero as you touch less infrastructure for your container and Kubernetes workloads, most specifically about container optimized Linux. And hi, I'm Tilo. You can find my GitHub or mustad on. I work for Microsoft. If you want to reach out to me, just drop me an email. All right. So container optimized Linux is a very special way to look at Linux distros and a very special way to look at your whole application stack. I'm going to introduce a few fundamental foundational concepts of that and then I'll touch a topic that is a little bit neurologic with operators with large fleets and that is how to keep the operating system up to date in a safe manner. Also, we talk about compositability and that is like extending your operating system in a way that keeps your extensions from the operating system reasonably separated. And lastly, if you want to reach out to the project, I'll show you a few ways to interact with the community. But first of all, what is container optimized Linux? Do we have some special tweaks in the kernel or do we ship some user space bits that make containers run exceptionally well? Well, yeah, that's one way of looking at it, but it is actually taking a step back and looking at your Linux distro, not as a general purpose distro that does all the things, but at a very special purpose operating system that operates one thing, that is container workloads. You would imagine the operating system as not being anything special, just another piece in your application stack and make the operating system operatable like you operate your container workloads or your Kubernetes ports. Have an image based operating system in containers and Kubernetes, you basically create your application as an instance of a pre-built application image. And if you do all that, you can actually leverage the isolation that you get in container apps from the operating system side and there's some neat things you can do with that. First of all, let's talk a bit about the user experience side of things. So general purpose operating systems do a very great job to give you like all of the diversity, all of the choice, all of the knobs and levers that you can have to tweak and twist your operating system and make it fit for your purpose. The thing is for specialized workloads, you sometimes have to, even if you don't want to. So that's not how we perceive container optimized Linux instead, we're trying to focus on doing one thing and one thing very good. So we measure ourselves not by the features we ship, not by like the options that you have, but we measure ourselves by basically our function or one thing that we provide to users. A light switch can come like in 20 different colors, but you know if the green switch doesn't manage to turn on the light in 20% of the time, then there's kind of a disconnect between the designers of the light switch and the users. And we don't want to be like that. Talking about provisioning, if you provision a container app or a Kubernetes pod, you basically take a declarative configuration that specifies only the business logic bits that you need to adapt the image to. And then you apply that to a prebuilt application image and as a result, you will get an instance of that image with your config live in your cluster and that's your app. If you take that idea and if you apply it on how you would provision your operating system, then you would have a declarative way to configure the bits and pieces that you absolutely have to that are not that are not saying defaults. So your business logic of your notes, you would apply that to an operating system image and now would create an instance of that note which has the properties of the configuration you gave in. You could of course add some bootstrapping code in order to bring up a control plane or if you have a single purpose app to bring up that app, but this is kind of where we draw the line from the operating system point of view. So we don't want to ship another application control plane because there are great control planes out there. Alright, so if you take all those principles and do a provisioning which I'll do with the QEMO instance here locally, first of all, this is the config I was talking about and you won't see any like OS specific twists in here. This is purely my business logic. So I'll define a user here. I inline like a little bit of static HTML here. I have specified a small image here that will be worked into the configuration and then I define a system D unit that basically brings up KD container with the config that I provided and that is my application. So as is tradition, we compile YAML into JSON and this will basically take the image that I specified there and work into the configuration. So now if I boot, this is a web script. I just start the M and I pass this web jay and I generate it. Now what will happen is this is the first time this virtual machine is provisioned. So it applies the configuration that we added because this is how we set up nodes. It takes a few seconds and then after this configuration has been applied, the system D unit will start up that we specified the web server for. It will basically pull, I hope, do we have Wi-Fi? I hope. So I have Wi-Fi then. Obviously, there's no way for this virtual machine to pull KD. So what you can do though, my phone has Wi-Fi. My phone has Wi-Fi. My laptop does not have Wi-Fi. Come again. You'll fix better. Or I can use. So I'll shut this down. Because I'm not sure how the virtual machine is dealing with changing Wi-Fi. But I can try a hotspot. This is the great thing with live demos. Okay. There's Wi-Fi that hopefully works. And just reassure not again. Here we go. Again, for the virtual machine, this is the first boot because I interrupted it in its original first boot. And so it'll set up the, all of the base system using same default. And it will now that it has connection pull KD and then the web service will start. Fortunately, KD is a pretty small image. So we should, this is to see with a console, we should see output from Docker any second here. That will, yeah, there is. So you see that there wasn't anything on port 8080 before, but if I reload now, this is our static. This will get very interesting with the update demo later because the update is significantly larger. The basic idea is to give you an operating system that just behaves as your container application. And then you just, you know, make your config, you throw it on your cluster. It comes up the way you find it or it breaks while coming up. So you'll see there's something wrong. There's like one way or the other. And this way by configuring it once at provisioning time, just like you do your container apps, you don't have any config drift. Like if there's a note that you've been provisioned months ago, and if there's a new note that you provisioned based on the same config, those nodes will have the same configurations. And you don't need to fill around the way I did here, supported on many cloud providers like we have specialized images for those cloud providers. Terraform integration, go by means. If you have existing automation, you want to do that programmatically. And we have cluster API integration, more on that in a second. There's this weird thing. That's the difference between us and general purpose operating systems. And that is once, configure once before the operating system is even deployed. No SSHing into nodes and fiddling around with configs. And kind of the mental image that we have to comparison is, if you apply a service YAML to your cluster, you do not cube cut a lex egg into that port, then fiddle around with config files and only then your app works. That's a weird concept. And we are trying to make that happen for operating systems. All right. So cluster API is a method to deploy Kubernetes clusters based on the management Kubernetes cluster flat car. The project I'm working on is integrated in copy upstream. We support a number of providers and they're currently pioneering kind of a new way to provision images or as images of cluster API. Previously, those were built into the images and Kubernetes were built in the single image. That was the golden cluster API image which you provisioned. And then you had a problem because you couldn't update the operating system or Kubernetes independently because they were merged. So we're working on a provisioning time composition, more on that later, based on system these are the eggs and we have a number of proof of concept providers implemented. What's the benefits of having an image based operating system? Well, you provision from immutable images, you're creating an instance just like you create an instance of your application with a container image. Those images are always built from scratch and they are very easily testable this way. And just like you have no config drift with provisioning time configuration, you have no version drift with image based provisioning. That means that if you see a node and it runs operating system version A, then you have the full version set of all of the binaries shipped with the operating system just relating to that number. There's no individual tools updating in their own pace and that means you don't really need to chase those versions, right? In our case, everything is on a separate partition in the slash USR directory and that was the invariant you protected. So if you fiddle with the bits in that partition, then it will break because the VN verity checks will be invalid. It offers us the opportunity to do a partition update. So if you're a database operator on a Kubernetes cluster and you want to retain local state, then that's good news for you. Talking of which, I was mentioning the isolation of container applications to the operating system. That makes your app portable. It makes your app run on many, many operating systems just basically making sure that the runtime exists. You don't have any shared dependencies that we have an operating system. And if you look at this from the OS point of view, then the only thing you're giving to your applications is a well-defined contract, which is the container runtime on some kernel bits. And there are no other relations to have to those applications. And if you focus on that only, there's ways to thoroughly test that contract and make sure you never break. You always uphold this contract. If you look at the screenshot here, then you'll see our test suite running. Every single one of these green buttons is a scenario test that we run. We run those in our CIs for PRs to the operating system. And this makes sure that this contract is always uphold. Now, if you do that, it doesn't really matter for your applications, like which major version of the operating system you're running. Up to the point where it doesn't really matter which Docker version or container D version or kernel version is there, as long as you absolutely make sure that you uphold this contract to your application. And you might have guessed by now that this layer, this contract, that's our light switch. This gives us atomic in-place updates. So what we can do with this is we can stage a new operating system in a separate partition, and then we can switch over by means of a single reboot, atomically, like there's no intermediate state. The application comes up, sanity checks run, and everything's great, except when the application meets some edge cases where it has problems with the new release and in that case, you just roll back, and you have a known good state that's still there, and you know that it works. OK, so what I'm going to show now, I hope it works over my mobile connection, is something that you usually will not see, because usually that's automated. So with my configuration already deployed earlier, I disabled the update functionality, and that's because I don't want my great web server being interrupted by updates. But now that we demo updates, I can enable it, and I can check. So the program I'm calling here is the update client-client, and just interfaces with the update client, and it says that I am currently thrittling thumbs, and I have never checked for updates. So we just kindly ask it to please check for updates. Now, there's an update available. The reason for that is I'm not using the current version, I'm using a previous alpha version, and of course it'll find an update because we've published a new version in the team. Now, it'll start downloading, and you see the progress that is basically percentages, we're at 5% now, I'm not sure if it's going to make it to the end of the stock, but it is what it is. While it's trying its best, let's talk about updating and rebooting. For many people, reboots are scary, not really for us, and we're trying to make them not scary for everybody else. But we understand that that's an interruption of your service, so there's a variety of ways that you can configure reboots. The cheapest thing is to just have maintenance windows, and you configure that, so your instances are only allowed to reboot in specific time windows, but there are more advanced options available. For instance, if you have a small number of nodes without a control plane, you can make those nodes synchronize the update using a CD-LOX, so only a certain number of nodes updates at any point in time. Also, if you operate Kubernetes, there are a number of Kubernetes operators that you can use, and this is really great on advanced stuff, so those operators will detect when an update has been staged, it will drain, carefully drain the node, only after the node is drained, it will reboot the node, and it will uncoordinate the node after reboot, so it's repopulated, your operating system is updated, your cluster didn't notice a thing, and your applications are happy. You can also, of course, stagger updates and make sure that only a certain amount of nodes are out at any given point in time. This happens via a stateful update server, and our implementation of that is open source as well as part of the project, you can just download and use it, it's actually pretty easy to operate. We're using the Omaha protocol, which is the same protocol that Chromebooks use to update, so it's kind of pretty solid protocol. As I said, the update server is easy to self-host, so if you have a larger fleet, and you want to have your own update procedures and your own update rollout process, that's a straightforward way to get it. If you don't, however, but you want to be part of the stabilization process that we do for our releases, that's a very simple way to hope yourself in there. Let me carefully check if there is a chance. Oh, 63%, okay. And then it stopped. Anyways, maybe it will do it. So, Flatcar releases in basically just like an application. We split off the branch of a major release from a main branch, and that will go through Alpha, Beta, and Stable stages. Alpha is fully tested, but it may contain half-done features, which are for development, so Alpha is for devs, basically. Beta is also fully tested, and Beta is what we consider stable. Stable has seen some user workloads, and this is where you come in, right? So, if you operate larger clusters, it's absolutely safe to have a few Beta canneries in those clusters. If those canneries detect any issues with your workloads, you just get back to us and yell at us. You file a few issues against those Beta's, and we will never promote the Beta to Stable that has known issues. So, this way, you're keeping the Stable nodes in your cluster safe. And now you're supposed to demo Composability, and I think I skipped the Update demo, right? No! I have six more minutes. I'm not trying to get both demos in, but I think the Update demo is more amazing. So, I just talk about Composability. We're an image-based distribution. We are immutable. We don't have any package management. It's actually really hard to add something on the operating system level, and it's meant to be that way. However, there may be things that you want to add. Maybe Kubernetes, maybe Wazm, maybe an alternative container runtime like Podman. And since we don't want to provide any options because that would make the lights switch weird, there's this thing called SystemDISOSX, relatively new technology. It's been around for three years now, and this thing called disease based on file system images that are immutable, which kind of resonates strongly with us. So, the way this works is you take your Sussex image, which only contains certain binaries when you ship for one application, and then it's being merged at runtime or at boot time with your operating system tree, and it results in basically a merge tree. So, if you look at your operating system, then you'll see all of the binaries in there. Building is easy as well. You just have a local directory that resembles a root file system, and then you run makeswashfs or makeoz on that local path, and you end up with a file system image that you can use with Sussex. Yes, the update's there, and I'm going to show you the update. It worked. So, it now tells you that we need to reboot, right? Before we do that, let's check out what we're running. We're running an alpha version, 3815, clearly I'm a developer, because I run alpha, and we're running the 6.1 kernel, and our application is happily running on this. Let's reload again, and now I reboot. Ah, Zudo will make me a sandwich. And this will reboot into the new partition, where the new update is staged, and there's kind of a, like, a settle time that the update tool will wait, and after that time concluded, it'll mark the new partition as the default partition. You always boot into the new partition. If you have, like, important applications that absolutely need to run, you would make the update service depend on those applications, and if they don't start, the update service will never mark the new partition as the default partition. And then you only need to reboot in order to fall back in the old, known good state. All right. So, what did this buy us? We've upgraded from 3815 to 3850, and we upgraded from kernel 6.1 to kernel 6.6, and our container is happy. Didn't notice anything, which is the whole point. All right. Image composability. So, you take your SUSEX images, and you can either pre-bake them into existing operating system images out there, and then you have, like, a combined image, and you can provision that, or you can just use the declarative configuration that we provide to download those SUSEX provisioning time, and then they're there. You can use system, the SUSEX update, to update those SUSEX independently of the operating system, because if you have a Kubernetes SUSEX, there are no shared binaries of the operating system, and you might want to update that in your own pace. And with Kubernetes particularly and Cluster API, we have a need-proof concept running. So, for the Cluster API phones, they have pre-baked images with operating system and Kubernetes in it. It is very, very hard for them to update any OSPET. So, what they do is they delete nodes, and they basically provision new nodes that has a new stack in it. That's nice. It works for most workloads, but for some it doesn't, and there SUSEX comes in. So, SUSEX allows you, and we demo that for the OpenStack provider, to just update your Kubernetes in place, and, yeah, be happy about it. We're trying to, you know, invest more in that work and get more Cluster API providers interested. And this is the point where I need to skip the SUSEX demo for you. It's basically just showing you a wasn't-time binary appearing and disappearing after I merged the SUSEX and unmatched the SUSEX, but I have one and a half minutes left. So, I'm going to skip and continue with how you can work with us. If you're interested, there are a few flyers out there with a few QR codes on them. Just grab one when you go out, and you have a few pointers how to interact with us. We are very active on metrics and Slack. We're basically there all of the time. You will always find and maintain either on our metrics channel or on a Slack channel. We have Office OS that are user-centric. If you have a question about FlatCard, if you want to discuss something and you are a user, every second Tuesday of the month at 3.30 pm UTC. There's a DevSync. If you're interested in developing with FlatCard or if you just want to see how we plan our work, the day-to-day stuff, how we plan roadmap and how we discuss security stuff, that's the sync to join. All of our work is public. All of our planning is public. Roadmap is not public. And all of those are on GitHub. Those are GitHub boards. You can just review them and you see what we're up to. If you want to contribute or join as a developer, we're trying to make this very low bar. So we have a software development kit that is a container because obviously it is. And it just takes those seven steps. You just run those seven steps and you will fully rebuild that release from scratch and you will fully run all of our scenarios locally on a very small QEMO cluster. And it's equally easy to make modifications and to build your own images. To just wrap things up, and we have five minutes left for questions after that, so I skipped the SysX demo for a good reason to be open for your questions. If you leverage the isolation that container runtimes, container ups have inherently, and you look at this from the operating system point of view, there's a lot to win in terms of interchangeability and replaceability. If you use the declarative configuration before provisioning time and focus on the business logic for configuring your instances and to abstract all of the unnecessary complexities without taking away the option to fiddle with the bits, but with taking away the requirement of absolutely having to in order to get to something workable, you eliminate config drift. Atomic automated updates gives you that kind of contract, that abstraction that you get with container runtimes. And if you still need to change something on the operating system layer that absolutely cannot go into a container, you can compose it in the SysX. Fully community driven, we've submitted the project to the CNCF as an incubating project. The process is ongoing. Keep your fingers crossed. And that's it. Thank you. Any questions? Thank you. A lot of, a lot of please. How would you run this in production? I don't think I got that. How do you run debugging tools? The obvious first answer is you shouldn't because they are debugging tools and they just, you know, open your nodes wide up to all kinds of strangenesses. But if you absolutely have to, there's a development container that we have for flat car. That is usually used in automation, like when you, at provisioning time, or reboot, if you need to build some special kernel modules, then you would use that. So if you absolutely have to run DevTools, that's the way. You can do it. What's the DevTools in a container? They already are. We are shipping them with all of the releases. So that's the thing. You speak very, very loud. The mic is... If there's a standard for... Yes, there are multiple. They're actually pretty well defined. Where is the... You should probably ask the person a little bit behind you. I've seen it and I know it's smiling there. But anyways, so if you check out the dev.com, you can see the dev.com, you can see the dev.com, you can see the dev.com, if you check out uapi-group.org, then you'll find, like this is pretty dry, but you'll find some specifications on, like, how those images are supposed to work. And there's a number. You could use WashFest images. You could even use, like, complete file system images with multiple runtimes for multiple architectures in them. And it's all specified here. If you want to hands-on examples, we have in the FlacCa project, we have something we call the SusEx Bakery. And that... Did it load? Yeah. So that has some hands-on examples on deploying your custom docker, on deploying wasn'time, on deploying Kubernetes, and also on making itself updateable. Sorry? Sorry, I need to ask you this. Reproducible builds are something that are pretty much on our radar and our roadmap. We even produce, as a build artifact, salsa provenance in the 0.2 revision, I believe. So there's initial work there. You can use that to basically build on, but there are a number of things that we need to solve, like, you know, the right compiler usage and so on. So FlacCa is actually making heavy use of Gentoo in the back end. So if you use that for Gentoo, you can reuse it on FlacCa. It's very easy.
Modern Build Systems for Containers
Thank you very much. Yeah, so my name is Adrian. I work for Chingard but I'm a technical community advocate or DevRel. I do have a minor issue in that there's a rugby match on right now. I'm Scottish and Scotland is playing Wales right now so somebody scheduled my talk for the exact time of the rugby match. So I'm quite serious if somebody can go and look up the scores and let me know what's happening I would appreciate it. And you're all laughing I'm looking at me like I'm joking but I am actually quite serious. Okay so so this is one of my favourite quotes about containers. Docker is doing to apt what apt did to tar. That's by Brian Cantrell or I guess a lot of you know he works at Oxide now but he used to be at Sun and Giant and has done a whole bunch of stuff with containers and operating systems and so on. But the point is in the old days I guess we used tar balls like 20 years ago people were shipping software around by handing people sending people tar balls and it kind of worked especially like if you you know it's building from source but you typically got into problems with dependencies and he tried to ship a binary well good luck frankly. So then we got package managers and package managers really kind of solved a whole bunch of these problems right they took care of getting you the right dependencies. And I think we kind of take package managers for granted now but they are pretty cool and people still have to put a lot of work into them so if there's any sort of package manager containers out there thank you. But having said that all a package manager really is or all package really is is a tar ball plus some metadata about dependencies right. And then if you take a docker build or docker or a container image well that's really just an app or so on plus some metadata about how to run it. But now we're containing all the dependencies down to the operating system level. So that's kind of what Brian was talking about here. And container images really are just the same right they're just a file system plus some metadata. There's OCI standards on exactly what the file system should look like. It's not too complicated. Some of the metadata does get a bit confused and admittedly it's just a few levels of indirection. But it's not that complicated. Running and maintaining containers okay I will grant you that can get a lot more complicated. But that would be a very different talk to this one. So back in the early days of Docker and so on this is how we built container images. And this is a pretty typical build. And people still do this kind of thing today. And it's okay and it works. Yeah we're just using Golang copying in a source code running go build and set an entry point. There are some problems with this. Primarily you've got a big image here. I ran a build very similar to this one yesterday and it was 892 megabytes. It's also got CVE's. A Golang image is based on build tools which is based on Debian. And that base image the scanners at least. Scanners being like Trivys, Sneak, Grite, things like that. They complain that it does have CVE's. With or not those are at Trivys. Well I'll let you figure out. There's also poor reproducibility. I've not helped things here by writing from Golang. You could put a digest there and that would specify a specific image. But at the minute each time you run this you're going to get a very different build out. So one thing we can do that makes things a lot better is use a multi stage build. So here we've got practically the same build as this one at the top. But what we're doing is taking the artifact out and copying it on top of a production image. Now for the production image in this case I've used this to this image. This one has a chain guard static image. You could also totally use the Google disk to this image and you'll get pretty much the same thing. So what we're doing here, the other thing I had to do was set C go enabled to zero which tells go to produce a static binary in most cases. There are a couple of gotchas. But we copy that binary and copy onto this static image. So why can't I copy it onto a scratch image which is completely empty? Well typically you'll find that a lot of Linux applications do require a few things to be available. For example CA certificates we're talking to TLS to other web services. What often applications expect things like slash temp, slash home to be available and that's basically what static and the disk list images give you. They give you sort of bare minimum to run the typical statically compiled application. And if you do that you get a much, much smaller image. Like I did that yesterday that resulted in an 8.5 megabyte image as compared to the 892 megabyte image that's based on top of Golang. So that's an enormous saving. And that's not good just for security, it's also good just for transferring it about and reproducibility and so on. This is still not completely reproducible because Docker build tends not to be completely reproducible. There is ongoing work on that. There was a great talk at DockerCon last year on exactly that subject. There is an issue with disk to list. So in this example that we saw we used a static image which is for statically compiled binaries. So if you have a project that's run in say Rust, Go or something like that where you can produce a static binary that's perfect. But you can't use that for something like Java, Ruby or Node. In that case you're going to need a different base image that has the runtime you need in it at least. Now Google disk to list project does have a few of these. I believe they've got a Python one and a Java one at least. A Ching we've got a whole bunch more. But yeah, even if you find a base image you may still find that you need one or two more dependencies. Which brings you to the point well can I create my own sort of disk to list base images with absolute minimum in them. And I should like make the point that these images are so minimal they don't have like shells or package managers in them. They really are stripped down to just what you need at runtime. A quick aside there is a project called KO. We're cool, I'm not quite sure of which. And this is like a really easy way to build a go application into a disk to list image. So instead of an go build you're literally running a code build and it builds your container image. And you don't even need Docker. It's just because like all it's really doing is producing a file system there's no Docker involved. Yeah, so that's really trivial to use. There's no configuration literally. It's literally just code build. And yeah, you might be thinking well hang on how do you make these disk to list images if you're not using Docker build and so on. And can you make your own disk to list images. And the answer is you totally can. It's not that simple though. If you look at how the Google container tools disk list images are made they actually use Bazel which as some of you probably wear is the open source version of Blaze which is the internal Google build system which is understandably quite complex. But basically you know Google container tools disk list is debian with almost everything stripped out. A chain guard we did things a different way. We've got app code which builds our container images and you have to pair that with an operating system or package repository such as Wolfie which is our one or Alpine also works. You can't mix it though like Wolfie is based on G-Lib C and Alpine is based against Muzzle. So you can't mix packages at all. Okay, so we'll see if this works. I did it. Here is the repository for the go example for building with Bazel. So if you want to create a Docker image with Bazel this is the simplest example for creating one's go. And there's basically two you know there's like a go program here. We won't bother looking at that. But we can look at I think two interesting ones are module and build. So we have module here. I think somewhere we'll see we specify the base image. So we're building on top of a distrust base image and we're specifying the platforms. This bit's kind of interesting. If you use rules OCI in Bazel to build a container image you do have to specify a base image. Just kind of a bit annoying because when you're using something like Bazel you really want to specify everything down to the ingredients. Part of the whole point of Bazel is to completely specify where everything comes from but we have to pull in an image here. And there's no reason you couldn't create everything completely from scratch. Because again, container image is just a file system really. Okay, let's see if I can go back. And the other file is a build file. And okay there's not too much in it but there's a bit. And this is literally just to build that single very simple go application. So there is when you decide to use Bazel you suddenly got to bring in a whole bunch of stuff. But it does buy you a lot. There's a reason people like Google use it. There's excellent reproducibility. You run it twice. You get exactly the same binary artifact out. It is fast. There's a lot of levels of caching going on here. I think the main reason people use it though is for like provenance and so on. So provenance and reproducibility are sort of essential to your organization. That's when you want to be looking at something like Bazel. Yet you can build totally minimal images with it but as we saw there it is a bit of a beast. You're bringing in a lot of stuff just to build a simple go application in this case. So it's something you want to do for larger, more complicated systems. Probably not for like a small open source project. Yeah and there's the other issue about having to bring in a base image if you're using something like rules OCI. I think you can chain images though. And if you compare this to what we do at ChainGuard this is like how we build the Wolfie base image. This is the entire app code for the Wolfie base image. If you run this you'll get something very similar to the Wolfie base image you can download. So what we're saying is first thing is we got to point it at the package repository. In this case I'm pointing at the Wolfie package repository. Could also use Alpine. And then I'm saying what packages I want to install in my image. So I'm saying we need C certificates and Wolfie base. And if you do that you end up with an image that's basically got busybox and a few other things in it and not a lot else. And glibs too. Yeah then you set some metadata on your image things like entry point cmd and so on. And that's all you can do in app code. Right there's none of this like in Docker where you have run commands and you run arbitrary commands. You also can't add outside files in. Everything has to be in an APK package from a repository somewhere. That's all you can do. But because of that it's a lot simpler. Hopefully that example was simpler to understand than for example the Basel build. It's declarative. You're saying what you do and you can't do all this run imperative stuff. It's reproducible. I've done it twice. I will get the same result with the assumption that these packages haven't changed in the meantime. You can specify the exact version of packages as well and there is support for getting like a lock file out with exact versions of packages that you use. You do tend to get very low CV images from using this if you use Wolfie just because we're really aggressive at keeping the Wolfie packages up to date and they tend to have no CV use. The other thing is it composes well with Docker files. So like I said all you can do in app code is add in APKs. So if you want to add your own application in well you can use something like melange which is what we use internally to build APKs or you can take an existing you can build a base image of app code and then use it like Docker or something to copy in your application on top of it which is basically exactly what you're doing when you're using a multi-stage build with the train guard static image. Drawbacks I guess you are dependent on Alpine or Wolfie so if you want a package that's not in there you're going to have to create it yourself. What you totally can do you can use APK tools or you can use melange which is the train guard version of APK tools. As an aside there is also roles app code for Bazel which does kind of help with that issue I mentioned before about being able to build images in Bazel completely from scratch. So you can also check that out if you're interested. Oh yeah what's the score? That was unexpected thank you. Who scored? No don't worry. Yes I want to mention canonical chiseled containers so I spent a little while looking at this. There are resources you can't totally download it and play with it but there's not that much out there to be honest. So this is the canonical version of this chiseled containers. They do seem to have produced minimal low CV images and they do look good but it seems to be a very limited number of images they've created with this chisel mechanism. I could only find 3.net images and a GRE image. You can create your own images but it does seem a little bit complex. Basically the app is the idea of slices or the idea that you can, I guess that's where the name chisel comes from so you take an apt package and your chisel bits out that you're interested in or you're not interested in. It does feel a bit like the app packages, the problem is app packages are large as opposed to with APK and other package managers where you have the idea of sub packages. So we haven't had this problem in Molfi because we just defined sub packages if you want a package that just has libraries for example. I don't know enough about app to say if that would have been a reasonable pathway to do it but this chisel mechanism does look very manual. You end up having to specify paths etc. The other thing is it's very much part of the canonical ecosystem. You start seeing things like snapcraft and charms and so on and very much ties into all that. Build packs, I didn't spend too long looking at this one. A lot of you are probably aware of build packs from the old Heroku days which is where it all started and then Pivotal did their version and then they merged them together again and now we have cloud native build packs that build OCI images. The main selling point seems to be that it's easy to use and sort of automatic. So your build packs, well look at your project, they'll see you've got a Python requirements dot text or you've got a node package dot jason and try to automatically configure a build pack to build on top of that so you automatically get a container image out. From playing with it, it didn't seem to produce very small images. Maybe that was me holding it wrong. I don't think there's any reason it shouldn't produce small images. It just seems to be, that's all based on this idea of stacks and the stacks by default aren't that small. It does definitely feel to me about one size fits all. But yeah, have a look if you like. One thing I really do like that came out recently is Buildkit and well Buildkit didn't come out recently, but Dagger came out recently. So Buildkit is the sort of engine behind Docker build and Buildkit is a lot more powerful than it appears from looking at Docker file. There's a whole bunch of stuff you can do here about dependencies and resolution and caching that's really quite powerful. And I think when they built Buildkit, they were hoping that there'd be much more front ends created on top of Buildkit rather than just Docker file. But that's not really happened until now with Dagger. And Dagger really tries to take advantage of the power of Buildkit. I would say Dagger is much more designed for CI CD. So the selling point of Dagger is to solve this problem that you have in CI CD when I'm sure you all had like GitHub actions where your action isn't working. So you like chicken, try this and then try three, try four and you end up with try 26 of commits. Yeah, you've all been there? Yeah, it's a pain. And that's kind of what Dagger is trying to solve. The idea being that you can run Dagger locally and it'll build the same as it does locally as in a GitHub action or Circle CI, et cetera. I don't think this example is very fair to Dagger because it's actually really large and powerful, but I want to give you some flavor at least. So in this example, we're building a container, we're giving it a base image, we're telling it like a directory from the host to include. In this case, we're just including the MD files, markdown files, sitting the work directory, then telling it to execute LS and I put in the, I put from LS. So that's kind of what Dagger workflows look like. We're just half built a container, it's also done something with it. To me, it kind of feels similar in some ways to Bazel, but a lot simpler, right? Because now we have a build system that will build an entire sort of organization, my project if you like. I'm sure it does not offer the same providence and so on guarantees. What I really mean is just that it's designed for a team to use as opposed to a single person. There is also a bunch of plugins. So the Dagger working on something called the Daggerverse and that includes plugins, including one for AppCo. However, having played with it, you're actually better off with a plugin for APKs. So Dagger can effectively recreate AppCo because you can create a file system or image from scratch and then just add an APKs using the plugin. In some ways, it will be better than AppCo because you'll get caching and rebuilds really much faster. So there's quite a strong argument for using Dagger there instead of AppCo. Next, how am I doing for time? Okay, we should be okay. So you don't need to understand all of Nix or even install it to play with Nix to build Docker images or container images. There's effectively two approaches. You can use packages to Docker tools or you can use flakes and copy them into an image. And I should say I have definitely not understood very much of Nix and I didn't install it. Here is packages.docker tools. So again, it's somewhat similar to Bazel or something like that. We're specifying the name of the image. We're saying we want the reddest package inside it and that it should be available at slash bin and mountain of volume, et cetera. Now, you should be able to build that. I tried building it on my Mac and it told me it wouldn't build because it required KVM. I don't want to understand if that's, if there's something I could work around that. Now, I believe we'll create something that's fully reproducible. So I run twice and it will give me a bit wise identical result. You should be able to create minimal images. I'm not 100% sure in that one. It takes a full programming language and you do sort of need to buy into the whole Nix ecosystem, but it does seem quite a powerful solution. Nix Flakes. So this is entirely stolen from Mitchell Hashimoto's blog, but it was really quite interesting. So the idea with Nix is when you install an application, you also get all of the dependencies or particular version of all its dependencies along with it so that it always works wherever you put it. So there's no reason that you can't just take the whole file system tree and put it in a container and it should just work and it does. So the idea is to create a flake and copy it into an image using a Docker file and that's a pretty simple method and it does work on my Mac. The whole sort of method is written up by Mitchell Hashimoto at this blog post. I could show you, but by the time, it is a little bit frustrating because now we put Docker into the mix, it's reduced the level of reproducibility. I think you should be able to take minimal images, but there is an issue in that it creates a slightly weird file system. The app or the entry point is a shell script which includes all the dependencies. So I guess you get forced to include bash or something. I'm not quite sure that's always the case or there's a workaround for that. I really need to play more with it. But you do end up with a weird file system. So the problem with this solution is if you give it to somebody else and they try and debug it, they might well hit problems because you do a file system and you've got app and you've got next door. You don't have your usual Etsy and BIN directors. Okay, so to wrap up, what would I recommend? If you want a big organization wide solution, if you need provenance, reproducibility and so on, you can totally do that. But do be aware it can be a bit of a beast. I really like Dagger. I hope it does well. It is a new solution. So do be aware that it's still being built out. Yeah, it certainly seems a good solution if you feel a pain in CI CD and I think everybody does. If you have a smaller project, the first thing I would genuinely look at is like ecosystem specific build tooling. So like co for go for example. Because that's really simple and you're pretty sure it's going to work and be low config. There is jib for Java. I've not tried that one, but that would probably be the first thing I would try if I was doing Java again. Otherwise, there's nothing wrong with doing a multi-stage Docker build with this full of images. I totally would recommend looking at this full of images though to get a fully minimal production image. App Co, yeah, if you want to, if you need a bit more flexibility and creating your sort of base image, please go and have a look at App Co. And then finally the next stuff, yeah, that could well be a solution totally if you understand Nix and you've bought into that ecosystem. Okay, what's the score? Pretty good there, thank you. All right. Thank you.
From Containers to Unikernels: Navigating Integration Challenges in Cloud-Native Environments
So, thank you. My name is Ion Blackus and in this presentation we're going to talk about the Unikernels and their integration challenges in cloud native environments. So, first we're going to set the scene, talk about containers and some books mechanism and then we're going to read the Unikernels and talk about our own OCI compatible container runtime, Uran-C. And after that some demos and evaluation results sections. So, yeah, we're a team of researchers with mixed industrial and academic background, mostly focused in virtualization, container untimes and hardware acceleration. So, yeah, containers is the standard right way to deploy and package your application. They're portable, they can be run both in cloud and net. They're easy to scale with the support of a wide ecosystem and they have super fast phone times. But they come with major risk when it comes to multi-tendent scenarios. Multiple containers serve the same kernel and rely on software components for their isolation. So, yeah, in a malicious workload scenario, in a previous escalation, container impact the entire host. So, what about vendors did was to either use software-assisted solutions with tools like SACOM or a Parmore or hardware-assisted solutions, VM type solutions with tools like Farcracker and Deviser. So, yeah, the thing becomes like this. We have now, we deploy our application inside the container, inside the VM, totally isolated from the last system and we get to keep actually the benefits of container, the portability and the scalability and we also kind of resolve the isolation issue. But of course, this comes with side effects, since higher overhead because of CPU and memory provisioning for the VM and the much more complex system stack that needs to be maintained. By taking a step back, we can see that the application does not need all these parts, not run. It just needs the run time, the libraries and some parts of the OS, like the drivers. And, yeah, this kind of modeling is enabled by technology called the Unikanal. So, what's Unikanal? It's a specialized single-hatter space that contains exactly the necessary parts for the application to run. So, yeah, this leads to reduced attack surface and faster boot times, which is especially crucial in serverless scenarios where responsiveness matters. And, yeah, but Unikanals are not widely adopted yet. And why is that? We identified too many issues. The first one is packaging and, yeah, Unikanals should look like an OS image in order to utilize because this is support. And the second one, the more, more than, is deployment and the execution of Unikanals. Container run times need to be extended. Additional logic is needed in order to execute Unikanals. And with this, I would like to give a floor to your viewers to talk about your ANSI. Thank you, Yanis. Hello, everyone. So, to solve the deployment challenge of Unikanals, we introduce URAN-C, which is a Unikanal container runtime. It is fully CRI-compatible. It's written in Go. Actually, it's a CLI tool which makes use of interconnected Go packages to actually spawn the Unikanals. It reads Unikanals as processes. So, in a way, it directly manages the application and not the system in which the application runs. The images, the Unikanal images required to run these Unikanals are typical OSI artifacts. And in order to actually spawn these Unikanal VMs, we make use of underlying hypervisors. So, first, let's take a look at how Unikanal images looks like. First of all, they are standard OSI images, so they can be managed via standard tooling and can be distributed using already existing registries. But there is one differentiating factor. URAN-C needs some specific annotations to function. These annotations are the Unikanal binary path inside the root of S, the Unikanal type, the hypervisor type, the command line that we need to pass to the Unikanal. And optionally, if we are using IneterD, the path to the IneterD file. So, to facilitate the packaging of the Unikanal, we created a simple image builder called BIMMA, which uses simple Docker file like syntax to create the images. As you can see, it's pretty typical for anyone who has used Docker. It's practically the same thing. So, now we have seen how an OSI Unikanal image looks. Let's take a closer look at how Unikanal actually spawns a Unikanal. First of all, container.dc invokes URAN-C create. URAN-C create then sets up a new network name space, create a new name space, sets up a pseudo terminal if it's required, and spawns URAN-C reexec process inside that name space. The reexec process then notifies the parent process that had started. Then, URAN-C, the URAN-C create, the original process, saves the state, the PID of the reexec process, etc. Executes any create runtime hooks, and then sends an OK IPC message to the reexec process, executes any create container hooks, and then exits. Then, container.dc invokes URAN-C start, which sends an IPC message to the reexec process, executes post start hooks, and exits as well. So, now it's the most interesting part. The reexec process actually sets up any necessary network and storage components, for example, the W device, etc. Executes any start container hooks, and actually spawns the Unikanal VM. So, as you can see, this is a pretty typical life cycle for any container runtime, with just some minor adjustments to facilitate the Unikanal execution. So, to actually spawn the Unikanals, we use hypervisors. We made it really easy to integrate any new hypervisors you want to implement in the system. So, I'm going to show you how to do this with the Unikanal VM. So, in this case, you can just implement this interface, which is mostly just the exec V function. So, it's really easy. Currently, we have support for Solo 5, Kimu, and Firecracker. For storage, we have support for block device via the map or snapshotter. We have support for InitrD, which is packed inside the image. And we also have support for shared effects coming soon. In the diagram, you can see how an image looks like, the layers look like. So, for the network part, we followed the every simple approach that is also used by sandboxed runtimes, like Cata containers. We create a new top device inside the container network loading space. And then we breed all the data, all the traffic to the VF endpoint provided by CNI. We do this using traffic control. To integrate Unikanals in Kubernetes, we had the Asphalt Challenge. That's because we need to actually spawn non-Unikanal containers inside the same pod. For example, the POS container or any other side-con containers. To achieve this, we use RunC to spawn the generic containers. And then, RunC handles the Unikanal containers inside the network namespace of the pod. So, there are some really interesting use cases, for example, KN80, in which we need to have intrapod Unikanal container communication. In KN80, for example, the QPROXY container needs to be able to communicate with a user function, which is Unikanal. To achieve this, we implement a static network configuration. We provide the static IP to the top device. So, we handle it that way. So, now, let's see RunC in action. We will see a simple deployment using an HDL. We will pull the image from the registry. And using an HDL, we will actually spawn an NGNX Unikanal inside VM. So, as we can see, there are no containers running right now. We pull the image from our registry. Okay, it's already existing. And now, we can run it using an HDL. We have to define the runtime. So, we do that. Okay, it spawned. And now, we can see that it started six seconds ago. It was created. Perfect. So, now, we can inspect the container to find the IP address. Okay. And if we curl it, we can see that it's an NGNX server built using Unicraft. Pretty typical. So, now, we can see the actual run. And we can see that it's running. And the container is in the RunC process. That's also running. Okay. And now, with that, I will give the floor back to Janis to show you a more elaborate example with K-nate. Okay. So, now, just... Okay, that's bad. Now, let's deploy a serverless workload with RunC. So, what we first do here is that we see that we have another RunC process running in the cluster. And after this, what we need to do is to define the RunC class for the K-H cluster. You can see here, we apply the RunC class. And then, it's time to define the K-native service. We can see that your RunC container around them is specified. And a simple HTTP-reply-server workload is used as the workload of the serverless function. We apply the K-native service. And then, we will retrieve back the URL endpoint, which by triggering it with a simple HTTP request, a simple HTTP get, essentially, we start the execution of the serverless workload. So, here, we can see the curve. And after this, the pods are going to be running. And underneath, there will be the RunC process with the K-mode hypervisors with the sandbox workload. So, yeah. That's it. And... So, the evaluation section. In order to evaluate your RunC, we convert with other container runtimes, such as divisor and other containers. And, yeah, in that process, we utilize the tool called K-perf responsible for generating and triggering K-native services via HTTP request, as we saw in the demo. And also, responsible for reporting the service latencies. So, yeah, the scale from zero, evaluation scenarios like this, for a number of iterations, we scale. It's a K-native service. And we report at the end of the number of executions, we report the responsible latency. We do this for every other container runtime. So, these are the results. We can see on the X axis the different container runtimes used for that process. And on the Y axis, the service response latency, seconds. And, yeah, of course, lower is better. So, on the blog post, with the experiment setup and all the parameters setting for K-perf. That's all. Thank you. Thank you. Okay. So, the question is about memory benchmarking, right? Yeah. Memory benchmarking is not yet on our work, but we have plans on that also. Yeah. Something that we can do. Sorry. So, the question is if we have run in Germany, AWS, I don't know. Actually, this experiment was on the Prime Service. We have not yet experienced any big lab vendors and deployments. So, hopefully, maybe the next evaluation will be also part with major vendors. That was the end. Okay. So, okay. I heard something about paravirtualization, right? But, yeah. Okay. I'm not sure that that's. Okay. So, I think that's it. Thank you. Thank you so much.
Debug your stage-1 systemd with GDB and the NixOS test framework
So, my name is Julien and this is Ryan and Linus and we are three NixOS developers. And today we are going to talk to you about the situation that we had during the sprint where we found ourselves in need of debugging our system in Itaardee. So, I'm going to talk about, let me just, it's because I know them. I'm going to talk about why actually we were in this situation. And then Ryan is going to talk about what is the NixOS test framework and test frameworks in general. And then we are going to showcase how we did this specific fun debugging. So basically I'll motivate a little bit the situation we were in. So basically we wanted to work with encrypted secrets in Itaardee. So basically as you may or may not know, Initardee or Initial MFS is the initial file system loaded in RAM as part of the boot process. It supposedly contains all what is necessary in terms of drivers and executable to mount your root partition, which is what its main goal is, like be able to mount your root partition and continue the boot process. But in some cases, especially when your boot partition is encrypted, it also need to acquire like the key to mount it and to encrypt it. And so this can be done by displaying user prompt where you input your password, but it can also be done if necessary by starting a SSH server where you connect and then put your password in and then it mounts your root partition. And for that purpose, you sometimes need to have like secrets stored in this Initardee, for example, SSH key. The problem is that if you have an encrypted system, you kind of have to start from something unencrypted and this Initardee image is not encrypted. So if it has secrets and you just put the secrets in this image, then anybody reading your boot partition can have access to the secrets. So as Nixxos developers wanted to have like an option where you could actually have the secrets be encrypted. Currently, like in Nixxos, you have the secrets are like just put plainly in the boot partition and suffer the drawback that I was just describing before. And so we wanted to find a solution and the solution is we have an option to use systemd as like the Inix script. So we use systemd in stage one instead of a scripted Init script. And what we can do with systemd, we can use something called systemd credentials, which is basically an executable of systemd that has the main, just the role of encrypted and decrypting secrets. And you can do this by using your TPM. And so basically what you can do is use the same TPM in your Initardee and this way you have secrets that were encrypted when your system was booted. That systemd in stage one is now able to decrypt in your boot process. So why all this? Where am I coming? I start, I try to implement this in Nixxos and what we found out is that I don't know if you can read this particularly well, but this is the log of the boot process and you see that there is systemd that is running in Initardee, it says here running in Initardee. And then it says it loaded the credentials that I tried to pass it, to pass to it and then it caught an assertion in some function and says okay, I'm retiring early, goodbye. It's crashing. So the question is how can we, how can we like debug this kind of thing? And one of the things we consider at the beginning is to use the Nixxos framework because it allows us to be in some very constrained situation where you can find maybe the bug easier. And then Ren is going to talk to you about the Nixxos framework is the main turner for us. So the screenshot you just saw earlier was a screenshot of the Nixxos framework. So you can see that it's a VM test and we can repeat that VM test very easily. But so what I'm getting at is in Nixxos as Nixxos developers we have this test framework that we use a lot and I'm giving a screenshot of an over test framework that is open QA used by other distributions. But basically what is interesting with debugging is that when you debug you want to debug a situation, a particular situation where you are hitting the bug. And in our context the fact of using Nixxos test framework, the fact of writing test first is a way for us to automate entering into certain particular situation including the ones that we are interested in, interested to debug. So for us like the Nixxos test framework is only a way to facilitate debugging sessions, a way to be able to write code but enable us to explore various scenarios and try age and bisect very easily any sort of dependencies. In the distribution context we really care about system wide testing. So for me I will just do a very quick intro on that. There are two components I will define. There is the driver, the code you write to assert the invariance that you care about like for taking the example of the system decredentials you want to assert that the credential that you decrypt contains the contents that you are expecting. That's an invariant. You also have the setup. The setup is how do you bring the system to the state that you care about so we need to prepare an image that contains a system decredentials containing the contents that we will be expecting and that's the set of code. And both of them are usually written in some sort of domain specific language that could be a bash script, that could be C, that could be Python. And I made just a very simple state of the art table which is not exhaustive but I find it very interesting to compare which is that for example over project that needs to have like complicated integration testing framework are the kernel and they do have solutions to test file systems and various things. And you can see like they all have their own DSL whether it's bash or any ELF program or executable that you can run on the system and they use some sort of emulator to give you environments to give you full system ablation, to give you network, to give you VLANs so that you can reproduce any sort of environment. And I find interesting so I'm not aware of any over operating system wide integration testing framework except from OpenQA and the NixOS test framework which is just a bunch of bash scripts, Python script cobbled together using the Nix domain specific language and we're using the Nix machinery. And I find interesting that so the biggest difference I find with NixOS test framework and the Overs which enable us to do some interesting stuff is that usually you have one language for the domain specific language so you have Python or shell or something but in the case of the NixOS test framework you can use both. You can use Python and Nix together so you can interpolate Nix code inside of Python code and like you have two levels of DSL that enable you to reason at build time but also at run time. And you have so that's why I do the funny thing of saying Python Nix for driver and Nix Python for setup because you think run time and build time differently at this moment. And so to give you an overview the NixOS test framework can offer you like OpenQA anyway, work test OCR machinery so you can run a VM, you can spawn a chromium instance and you can like use the OCR to read the window title for example in a GNOME desktop environment and verify that it is indeed the window title you were expecting. And all of those tests are running in our CI automatically for every what we call channel bump that is a roll up of a lot of commits in the Nix repository basically. What I think is very interesting in our case and enable us to debug very quickly this problem is that there is a secret source for our test framework which comes from the fact that we use the Nix DSL here. So the Nix DSL gives us a way to describe packages, to describe system the units and various things and it's a functional programming language. So it means that you can write functions that abstract a certain test scenario and then you can write more code to do more advances in the assertion on that environment. So for example I just take a very bad screen and I'm sorry but I will describe it. We have ZFS in NixOS and ZFS is very complicated to maintain. I'm maintainer of ZFS unfortunately. And ZFS is very complicated to maintain because it's out of three kernel package that often has ABI breakages with the kernel for many complicated reasons and legal reasons. And to make the burden realistic on maintainers you need to have strong testing. And so we are able to do matrix testing over multiple version of ZFS and multiple version of the kernel itself and multiple version of even like stable versus unstable and we even have a variant for the system D stage one because NixOS has both stage one. It has a scripted stage one like Julian described and we have experimentally the system D P I D one stage one. And so we are able to test all those scenarios and be able to understand what is going on in a very like in not a lot of lines. And here I will pass it to, we tried a lot of things. We tried to isolate the problem with the NixOS test framework. We are able to patch things easily. But even though we were not able to find the root cause. So we passed on to more powerful tools. Thank you. Yeah. So there we were trying to work out how system D was crashing exactly. It was dumping its core to a file in the temporary file system and promptly exiting causing the kernel to panic and it's not a persistent file system. So we had no way of recovering that core file. So we decided to try and run GDB in the init ramfs or we quickly abandoned that idea because GDB is big and doesn't fit into an init ID that well. Thankfully we have GDB server which I'm guessing anyone familiar with GDB might already know about. So with GDB we can attach, we can either launch a process like above, launch a process as a child of the GDB server. It can listen on the TCP port and then we can attach to it with a separate GDB client process. That doesn't quite work if you want to debug your PID 1 because PID 1 can't be the child of another process. Thankfully it also has a mode where you can attach to a running process. So in this case we're launching sleep infinity in the background and then running GDB server to attach to that and likewise attaching to that GDB server using a GDB client. Now how do we do that if we want to do that in PID 1? We have to put GDB server in our init ramfs and then we have to have it target the PID 1 inside the init ramfs. The tricky part is we want to debug system D but because system D is crashing we can't use system D to launch GDB server. So we go back to having a shell script as our init and that shell script launches the GDB server, has that GDB server attached to itself and then executes system D. First thing we do is launch that GDB server, have it attached to $ in this case it's going to be 1 so the PID of the shell script and background that because otherwise Bash is going to wait for GDB server to exit and GDB server isn't going to exit. Then we sleep 1 because the GDB server needs a moment to start up and actually attach and then we exec system D to actually do our debugging. That ended up getting us actually able to debug it and Julien has a recording of how we did that, of what that looked like. Thank you. So let me try to put this demo on. So basically what we did, try to comment it as it goes. Oh this is not right. Yes it's not doing whatever I want. I think it's... And you can exit the full stream mode and then full stream it. No you didn't exit. Yes yes and trying to do it. Did I... Yes. You have your time. Yeah okay. So on the left side we are running our test framework virtual machine and you see now the virtual machine is not starting because it's waiting that we attach from GDB which we do in on the right side and you'll see as soon as we attach through this socket that is called hello the virtual machine is starting and GDB is loading the symbols yes and then when we do continue then the virtual machine is starting. So this one first virtual machine is as you see on the left is the installer virtual machine. It's going to install in XOS on a disk, populate the boot partition and everything, put the credential in it and then we restart it and we will eat the bug with system D. So what you see here is just a log of XOS installing itself and so this first GDB instance will not do anything purposeful because we are just... Because we change it in its script we have to change it both in the installing VM and in the installer VM so we are only doing the first part that is not really the part we are interested in. But should not take too much time. I can do filling. So what is interesting here is you can see like we have a very complicated well complicated setup to initialize system D initialize the installation and all that stuff. And this is the second VM booting now. All of this is automated. So we are reattaching with GDB and so we are now... The VM is now booting and it's now stuck on waiting for GDB to attach. So when I do this it doesn't work but when I properly attach actually it's reading the symbols and now when I do continue I will eat the bug that we were trying to debug. This we are eating it now and we now can see a backtrace. So yeah that's it. By reading this backtrace we found the bug we were looking for and we were able to open a PR to system D and fix it. And that's it. Do you have any questions? Do we have time for questions actually? Yes. Oh that's good. You said that you couldn't have system D be like the child of another process so you couldn't have GDB like start and run it. Why not? Yes. Do you want to answer this question? Yes so the question was why we can't have system D not be PID 1. It's because our bash script won't reap zombie processes which only PID 1 can do and because yeah there are various bits in system D which require it to be PID 1 especially if you are running it in the init ramfs because it needs to actually switch into the final root file system which you can't do as just any process. I don't understand how and when the transfer the ownership move from GDB server to system D because you attach GDB server to itself then you hit continue. The question was you don't understand when the control goes from GDB server to system D. The init in this case was a shell script which launched GDB server in the background and then the shell script replaced itself with system D and the GDB server was attaching to the shell script. Any other questions? Yeah just a matter of curiosity. Why do you say it's a problem to put all of the GDB binary into the init ramfs? So the question was why it's a problem to put all of GDB in the init ramfs? It's yeah it's fairly big. Big init ramfs can be a problem especially with limited with boot partitions of limited size. For that we might not have the terminal control bits and pieces necessary to make actually using GDB enjoyable whereas with a GDB server we can even attach a graphical front end to GDB or something similar to the target. And the debug symbols and the sources? Yes exactly. So GDB needs to access the debug symbols and the sources at good point. The question was why if we are using a TPM anyway to store the disk encryption keys why would we need to store more secrets in the boot partition to do anything else? I think so there are many use cases here. For example imagine you would run SSH server in the early boot to obtain another part of the key. So you store a part of the key in the TPM2 and another part on a server and the server asks you to prove your identity or something then you need to have your own identity somewhere because the server doesn't know if you're the true server who is asking for the over part of the key and that means you need private SSH house keys to be stored somewhere. So to confirm in general if you haven't configured something like an SSH server and explicitly put a secret in your init you're not going to get one. If that's part of your framework or where you want to split the key up and get it in different places for example this can help you do that. So again to repeat what you just said and I agree with that this sort of approach is useful when you have more secrets than just having the TPM2 disk encryption secret in the TPM2 when you have identity cessation or more parts of the secret somewhere else doing SSSS and what not. Shami's secret sharing to be more precise schemes and this makes sense in those use cases. We still have three minutes. Recompuse. Yeah. Is this already in stream with the TPM user in the init? Do you want to answer? Can you repeat sorry? Is this already in upstream mix? Mix package with the TPM2? Yeah so the question do you want to answer? Yeah okay. Repeat the question. Sorry yeah the question is this way to store secrets? Secret stream. Yes this way of storing secrets in init already upstream. The answer is no. We have a few dependencies necessary. One of them is using booting from system distub because system distub can measure the credentials you're passing. So there are PRs open. If you are an excess developers do review them please. But it will come soon I think in system reboot and also there is work being done in LANZABOOTIS for the same features. So both are going to be available soon I guess. Related is this one of the things that's kind of on the road to LANZABOOTIS? I'm the maintainer of LANZABOOT. So the question was is this part of the work to upstream LANZABOOT which is a secure boot component for NixOS? It's a bit special to NixOS because we have too many generations. The answer is this is in the ecosystem of those such things and yes basically. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you.
Love rr, Love rr, you're so good to me
Any questions during the talk? So a bit about myself. I've been working for MariaDB. Who here has never heard of MariaDB before? Okay. So there's a few people here, no problem. So MariaDB is a fork of MySQL, which you've probably heard of. It is developed by the original authors of MySQL. And it is mostly the default MySQL variant in most distributions. So you might be using MariaDB and not actually know it. I've been developing for MariaDB since 2013. I've done various features like roles, window functions, things that then got ported to MySQL or implemented by MySQL. I'm now working on catalogs and also adding MariaDB vector, a competitor to PG vector. Now, one of the biggest problems you can have in a database is that the database is a multi-threaded monster. You have many different threads and race conditions, if there is a bug, happen but they happen very, very rarely. And it's almost impossible to try to reproduce that failure. So what we usually get is we get core dumps. But the problem with core dumps is that they only give you the state at the end when everything has already went wrong. You don't know where the problem happened. So it would be excellent if we could find a way to go back in time. And I know there is a follow-up talk about how this thing actually works behind the scenes. I'm just going to give you a tutorial on how to make use of it, because I think this will just revolutionize all your debugging experiences. Even simple bugs that are not, let's say, race conditions are much easier to debug if you can step back whilst debugging. So RR, it's an opus-nose project. You can install it, most distributions have it. And it's a program that basically records your application, the state after each instruction the CPU executes. It does come with some caveats, I'll go into that. But as long as you have CPU support, it should work. Now, to set up, it's pretty simple. All you have to do is do echo1 to set this kernel parameter. Otherwise, RR will tell you, you need to set this, otherwise I won't work. And then you just run the program. So instead of gdb, RR, the program. The program doesn't stop, it actually just finishes. If it's the server, then it will keep running. But basically, the execution will be stored somewhere. There is a system variable you can change to say where this thing is stored. And then if you want to actually debug the program, you do RR replay. I'll show you in a demo how easy that is. Now, when you do RR replay, you get a gdb prompt. With a few extra instructions, like reverse next, reverse step, reverse step instruction, continue and finish. This is useful, but the thing I find the most useful is actually this being able to watch an address that never changes between executions. Because especially when you have a very complex code base, it's very hard to understand where something changes. So you do the first run, you see that this variable is wrong. So then you put a watch point on that variable, and then you run again. And that's how you figure out exactly where things go wrong. It's as simple as that, and it basically takes hours out of debugging. Now, a little bit about how my redebate does it. Like, I think this is probably a good thing you can try out in your projects. So my redebate has a test infrastructure similar to other projects. This one is tailored specifically for my redebate. It's called my redebate test run. What it does is it issues SQL queries to the server and compares the expected results with the actual result. If usually this works and if the results are the same, then the test passes. But every now and then, especially in our CI, we get failures. And those failures tend to be hard to reproduce. So in order to reproduce them, what I personally do is I start to do the testing. I start testing, I ask, I want RR recording. And then I don't just run the same test, like over and over. Now, I start the same test multiple times in parallel, just to overload the system as much as possible to try to get things to actually reproduce. And usually about after five, six hours, I tend to get a failure. With this while loop, the thing will stop the moment I get a failure. So then I have the exact trace I need to find my problem. Now, the limitation of RR is that it runs in a single threaded model, which means that this is how long it takes in milliseconds without RR and with RR. And of course, if you don't have platform support, you can't do this. It relies on a specific CPU capabilities. I've noticed that if you try to get a server in the cloud and try to run this, it will complain about something in AMD Zen processors. So your experience might vary depending on which iteration you have. Now, another problem is that because it's running in single threaded context, you don't get the exact same behavior as if you were running without RR. And actually, I can show you this. So I made a very small program here. This should be readable. So all this program does is it starts five threads. The threads don't have any locking on them. There's a number of iterations and they try to increment a counter that's not guarded. So obviously, we have a race condition there. We just try to increment something without the lock. If I were to run this program without RR, I get this number here, which is obviously wrong, but if I do try to run it with RR, a few times, you will often get this. This is the correct value if you actually have the right locking. So depending on how your application actually is, it's not guaranteed to be able to reproduce every sort of race condition that you have. And, okay, that's for the demo. And now one more thing that I really like about RR is it helps code discovery. Like MariaDB has a code base that's 25 to 30 years old. There's functions there that I don't understand what they do, but I have some expectations what they should return. So what I do is I tend to consider them black boxes and just step over them until I see that this thing returns something that I did not expect. Then I just step back and I can go into details into the function. So it even helps speed up code understanding. Okay, this was a brief talk. That's what I wanted to share. Now, any questions? Thank you. So I didn't even have time to... Please make sure to repeat the question. Yeah. Yeah, you had a question. For your problem of race condition, have you tried tools such as trade analyzer or val-green, a green slide out? So these, for example, you have given, for example, would be detected without having to trade off the race condition? Yes. So I repeat the question. The question was, have I tried other analysis tools to help detect these sorts of race conditions, like val-grind, sanitizers, stuff like that. So we have a set of tools we use. We use ASAN. So we compile with ASAN and run with ASAN. We run with val-grind, but val-grind has also the same problem that is single-threaded and actually slows down the execution even more. We compile with MSAN, so memory sanitizer. That one is a bit trickier because you have to compile system libraries as well for MSAN. But all these combined with RR, you get to the end the result, which is a bug-free program. I think that's the point. Yeah. You said that you need special, like, CPU features that use RR. Why is this? Does my first question, second thing, is, if I recall correctly, it means that it likely won't work in the end? Okay. I don't know the answer to the first question. I think that the next talk will actually explain that. It's security reasons. Okay. Security reasons, basically. But yeah. Okay, go ahead. Okay, and like, VM-wise, actually, I've never actually had to try in VMs. I've only used containers and worked there, so. Containers are just processes. Yes. But for the security reasons, I don't think that's a good answer. I mean, maybe let me rephrase the question, because I understand that's like the security reasons to enable Perf, right? But why do you need Perf in order for RR to run? Watch my talk. Okay. How much data does it typically generate? I've never actually looked at that, but we can check the recording for this one. So let's have a look. So the question was how much data this recording uses. Let's see. Demo 22. Let's do a DUE. So 100 megs for that very quick program. And for my AGD, please allow your honor to pose a subject. Let's see if I have a history of one here. And just in case, do you run it by default in your OCI? Not with RR. So we don't run RR in the CI. We use it when the CI detects a problem. So let's see this one. So this is one test case. Okay, one gig. Okay. Okay. One test, one gig. Yeah. If I'm not mistaken, I think GDB provides the record command as well. Watch my talk. I'm going to go exactly over that. The DDB record doesn't work. Is that true? Watch. Watch it. Watch it. Watch it. Watch it. My talk is to get you to fix things. So. Yeah, there's another one. There was one there. So the part that was on set a moment ago about the sanitizers, what about the red sanitizer? I... So about red sanitizer, I have not... It's good. So the sanitizer is good. You can detect a data race or something like this. But the thing is that you might have a crash that may be a reason of something else that was intended. And then you may want to step back to find out the exact reason. And also things like ASAN may not help you, because, for example, ASAN is doing this kind of shadow memory juggling to be basically detected in correct memory access but the thing is that when you have, for example, implement your own containers where you handle the capacity versus size thing, then the memory out of bound access between the capacity and... Or sorry, size and capacity won't be detected if it's your container and it's not really... Like, prepared for ASAN. You can poison if you have your own container. Yes, you can do that. And, for example, there are projects to, for example, make this for SDD string or SDDVQ and I'll be in right now. So one problem we tend to have is there's actually no data corruption... There is data corruption on disk and you need to figure out how that got on disk. And it's not necessarily a race condition, it's a bug that's hidden in the logic and it only happens in a certain case of events. So usually the crash is an assertion failure somewhere. Yeah. So in your example, we can see that there's a difference of the increments of the content. Do you know why it's not with us? Is it because of the instrumentation or the latency? And then... So probably the next talk will help answer that better, but I have a theory and at least this is my understanding of it. So if it's running in single threaded mode, it has to context switch between the threads. And it has to decide when that context switch happens. It just seems to be significantly more likely to switch after the store has happened. In which case the number... You kind of get to the right one at the end. So another way of saying it is serialize this execution? Yes, exactly. And there is a chaos mode you can enable. So rr with dash dash chaos makes it a little bit more possible for it to happen, more likely for it to happen. Might also not answer that. Oh, well that's a good question for you then. Yeah. Rr is recording process information. Like for example, if you're doing it for proc mappings, will it actually show up the memory, the memory mappings that are on this point in time when you are actually stopped on and like reversed? And another question, does it maybe record additional information, like for example, where the file descriptor points? Like what was the link on the proc, it, fd and the number there? Because that would be very useful to have. If it's in memory, it should be there. I mean, probably it's not memory mapping is there because I assume... So, yeah. In memory stuff, I've not had a problem getting it. So it's a very technical question. I don't have like the best answer, but I don't think there is a problem. Yep. So, I'm not one of the Rr developers. I can't answer that, but like from my understanding, I think it's kind of hard to get the CPU to do that. You're kind of relying on CPU being able to write stuff to memory. So... I just have a question. Like... I don't know if you can answer that. I don't know. I don't know. I don't know. Like... MariaDB, the releases are... How good are we in terms of known bugs, race conditions bugs? They, you know... This is for CI and I don't know if I brought some. Just wondering, the release, the MariaDB product, is it free of race conditions? I'm not sure the quality. There is a law... I don't know who named it, but there is no software free of bugs. Yeah, okay. But it's free of known race conditions. Um, like, there are... Like... Obviously, we try our best not to ship a buggy product, but there are race conditions, especially when... We've done lots of performance improvements for, like, high core count machines, like 96-plus cores. And that requires some refactoring, especially in EnoDB, the storage engine. There might still be race conditions there. We have had emergency releases when we release something and then we realize, a day later, we're getting data corruption. We need to do an emergency fix. But overall, we are pretty confident that the CI, since it runs for so many platforms, I think we have about 200-ish different combinations. If there is a race condition, it tends to show up. So is it just x86 or is it cross-spectre? The tool works on ARM. So I know it has support for M1 plus max. Um, yeah. So I know x86 and ARM and Apple Silicon kind of works. If that's all, then... One more? Okay, one kind of question. So let's say I'm a language developer, I'm making my own compiler, I need my own native code. Should I be using RR for something other than the default? I use RR for... Whenever I need a debugger, I use RR instead. Okay. So... Even for logic problems, where you've got like a zero execution code at the end, but the logic was flawed and... That's the advantage that you can go back in time so that you can inspect the value that even if you set the break point too late, you can just go back. Yeah. But one interesting thing is that the program you're debugging is not live, it's just being annulated. So if you would like to, for instance, see what this function would return in this situation or things like that, GDB, because it's running GDB in the front, GDB cannot just run the functions, tell you why it would happen. So there are cases in which you don't want to use RR. Yeah, it's a abstract interpretation. Yeah. It's just... It's just emulating the thing. Yeah, that's good. Thank you very much. So does this mean there is no, like, inferior running under the hood so you cannot print and call through with an argument? Well, let me just say that. I thought it wasn't... I can try that. No, no, no. I think it's probably work. We can try it because I know some ways to do that would not happen inferior running. Yeah. But right now I'm not sure what RR does. I never... Spoilers, I never looked at the RR code. I just have a vague idea of how it does that. Okay, let's see. We can do a... So you're saying to call printf, basically? No. Call any function that is in your program. You can call quotes. You can call quotes. It should be there because you never have GDB out of them. So, um... Print and then... Print and then... Puts are... Okay, I can probably do this. Yeah. Okay, no, yeah. So... So, yeah, so it does have an... I'm sorry, that one's incorrect. Yeah, it does. It did it ten times. But your function... I can... You're gonna have side effects. How is this... Yeah, so it does not... It's correct. Sorry, but what happens now if we step back? So... Two minutes left. No. Again, my talk. I'll touch on this and then I can... We... We... We died. So, we talked a little bit. I can go into a little bit more detail than I planned in my talk. There is a specific point where I say this is how I heard our words. I've never looked at it. But now I can, like, go a little bit further. But, yeah, so this was a talk. And in five minutes, I'll start explaining the technical details. Right. Thank you. Thank you.
Help us improve time manipulation with GDB
Which is wizards and barlocks. Welcome to my talk on manipulating time with GDB. Well, that's what I would have said if it wasn't for the RR talk that came right before me that taught everything that you're supposed to know that I was gonna say. So instead, let's talk about how you can help us make manipulating time with GDB even better, right? So let me give a quick summary of what I'm gonna be talking about. First, some introduction in case you didn't catch the previous talk, in case you didn't, you have no idea you were slipping in the talk or something. I don't know how you would, it was a pretty good talk. I'm then going to go into the technical details of how it works and as I explain how each little bit is working, I'll also explain to you why this little bit might be buggy. And then, I'll give you a couple links, a couple QR codes to where you can see the list of bugs that we have open, and then you can pick your favorite and I'll give you a little request to help us fix them and some contact information if you're not comfortable just throwing an email to the void of the mailing list and you would like to talk to someone who you think is a person. Right, so let's go from the start. What the hell am I talking about? Or first, who am I? Hello, I'm Guinevere. I've been hired by Red Hat to work on GDB. I've been doing it for almost three years and just recently I've been appointed one of the maintainers for GDB for one of the specific areas that does reverse debugging in GDB. And one of the things that I like to do is help people get into contributing to open source. I always wanted to contribute to open source when I was in university, but it always felt like an impossible task. I would need to be like some sort of genius to do it. And then as I started doing it professionally, I realized that there are some people who aren't geniuses like me who are doing it and I wanted to spread that around. And what is this GDB that I keep mentioning? In case you don't know, GDB is a very famous debugger for C, C++, it's been around for like 30 something years. Not sure how many, more than 30. But yeah, it is basically, if you're a time wizard, it's your best friend. It can slow down your program, it can stop it altogether and just, and as you can learn, as you just learned, it can also make it run backwards. I call it time travel debugging because it's much more fun than reverse debugging, let's be honest. It lets you undo instructions and full statements and even maybe sometimes start, go to the very start of the program. And it's very useful for a wide range of things from race conditions to just logical problems to just understanding the code that you don't understand. All of this was mentioned in the previous talk. If you didn't manage to catch it, I'll give a quick run over and then you can use what I teach here to go back in time and see the previous talk. So, since a lot of people were here in the previous talk, not many of you will be asking, many of you might be asking how is that possible, not many of you would be saying that's impossible, which was my joke. So I'm just gonna go and explain like, how is this possible? Because CPU is not meant to execute backwards. It doesn't have a way to just undo things. So let's go with a simple instruction. This is an x86 instruction, just adding one to a region of memory. And this one sounds like it would be very easy to undo, right? You just need to subtract one from the memory. You can do it arithmetically. Sounds like it. It's not quite that easy because whenever you use the arithmatic unit in the x86 CPU, it overrides some stuff. So you cannot just undo things logically. The best way to do it is to, instead just remember, hey, I'm looking at this address of memory. It is this long and it has this value. And then you save that in your program. But as I said, this is the arithmetic unit. So we also need to know the flags that were there before because it's gonna get overwritten. And every single instruction that happens, it will also increment the instruction pointer or the program counter. I use those interchangeably here. So we need to remember that. And if you basically created this in your program, inside your program, and then you added some markers here and there to say like, okay, between these ends is a single instruction, then you get exactly what GDB record pool does. This is the area that I'm most familiar with and it's the area that I maintain. It does exactly what I showed you and nothing more. So there are good things and bad things about using this version. A good thing is that it just comes with GDB. You don't need any extra things. It can fully reconstruct the program state to any previous state. It's not something that every single way to do it can do. But the bad thing is that it is really, really slow. If you think that the twice as slow thing that was mentioned in the RR is bad, try, I don't know, 20 or 40 times or even more. I'd never stopped to test how long it is. It's just unusably slow at this point. But it's really nice. We should make that better. And it's a little harder to support because we need to teach GDB every single instruction that we want to support for every single architecture. There's nothing that says only works in these architectures of that other than people putting their time into teaching the GDB disassembler. And as you can see, there are a lot of, as you can imagine, there are a lot of possible things that can go wrong. One of the things, like I said, we need to teach every single instruction to the disassembler. This is a QR code for a couple bugs that have been filed for missing instruction for this, missing instruction for that, missing instruction for this architecture. And also, if you like, making code neater, if you enjoy making it more readable, the disassembly code for x86 is a complete master. There's a single function with over 3,000 lines and unreadable other functions and members that are just a single letter. Please help. But that is just a single way to do it and a very small example. So let's look at a little bit of a longer one. Let's say we have an instruction here at this program counter. And then your program goes into this instruction and this instruction, which, so you can see that this was a jump and then it continues executing everything. You can see exactly where your program went through, right? And if we were saved exactly this information and just a couple bits more, like how long each instruction is, what kind of instruction it is and everything, we could have a very good idea of what path your program took through the code. And we could maybe not recreate everything, but we can understand like, hey, the bug is happening because there's some logic wrong at this point, which is making us take a wrong if somewhere. This is also the NGDB, is the bit trace recording. This relies on a feature of x86, I think Intel only, but don't quote me on that, which saves the whole path in a region of memory called the BTS. And then whenever the inferior, which is a program that's really debugging, is stopped, GDB looks at that region of area and does all that information of like, it's this big and it was this kind of instruction. It is again good because it's in the SQL tool and compared to the other version, it's pretty fast. It's no, I don't think there's any big slowdowns, maybe like two x and three x, which when we're talking about recording the whole execution is kind of all right. But like as I said, you cannot reconstruct everything and it's hardware dependent and it needs to be in the hardware. It's not like we can do anything to improve that. It has a couple of issues with test suite regressions, you hit some assertion errors and there's some usability problems with like not being very clear when you can or cannot do something, but it's not an area that I looked much more than what I needed to make this talk. So I'm not familiar with the problems. If anyone found it interesting, we can still chat. And well, I've been talking about, looking at each instruction at a time and you have this kind of execution style. What if you had, you made like a whole checkpoint, everything that's happening in your system at the very start of the program and then you keep going and then when you reach a certain point, you create a new checkpoint. So then you can fully recreate whatever is happening at an earlier stage and you can keep going. You cannot step a single instruction back, but you can step a lot and go forward some. This is what our R does. And this is why I got confused when we didn't say there was a live in theory because no, it has to have a live in theory. You create a checkpoint, you go forward, you create a checkpoint, move back and step forward. At least this is how I think our R works. And there's also a tool called UDB, which I've been told does that. It is proprietary. I have no idea how that works and I'm not all that interested in it. But yeah, and then what our R does, as you have all seen, is it creates a way for GDB to control the fewer. Yeah, it does that by creating a GDB server. I'll talk a little bit more about that later. But yeah, so those are the three main ways that I know of doing reverse debugging now. But once we recorded the thing, how do we use it? You using the GDB front end, which is the part that handles your commands and everything, you can do it two ways. Using reverse next, reverse step, and all those commands that were explained in the RR talk. Or you can actually just say to GDB, hey, I'm going to be going backwards using set execution direct reverse, and then just say next step or whatever, and it's going to understand what you want. Because actually behind the scenes, if you say reverse next, it is just doing that, setting it to reverse, and then executing the instruction, and then setting it back forward. So it does exactly the same thing. And when we handle the command, we try to make use of as much information as possible for going forward, as much of the logic as possible. And only when we know, okay, this part has to be different when we're going backwards, then we add a specific case like, okay, if going backwards and blah, blah, blah, then do this. And when we're doing that, and assuming that everything works until proven otherwise, what could possibly be buggy? And RR, like I said, it does a very smart thing. It tries to do as little as possible. It just creates a GDB server, which can control the infuer, the program that you're debugging, and just that. And then it's going to open a GDB server, and accept commands from another client, another GDB somewhere. And everything of command handling and understanding, and saying, okay, we're going to move these many instructions, or whatever that, that's all handled by GDB. All RR does is reset the information on the program. So yeah, what could possibly go wrong with this kind of setup, right? So, so many things. The fact that there had to be two whole talks explaining why this feature is nice and it exists, should tell you that this is not a very well-known feature. It's not something that you see many people using. And yet, there are over 30 bugs filed for it. In a feature that no one's using, that's kind of crazy to me, because if people were using, that would be just so many more. And along with things actually going wrong, there's also confusing things, and just user experience problems. So let's go over a couple of them. A command that's very, very useful if you're very used to GDB is the command until. You can tell it to go until a loop ends, for instance. It just does not work in reverse mode. Or, well, if you say reverse until, it just does not work. If you're setting the execution to reverse, it works just wrong. So yeah. And there are some commands, for instance, record instruction history, and function call history. These sound like it should work for all recording features, right? But yeah, no, they're only available for Btrace, and there's no way to tell as a user. There's nothing in the help text. There's nothing in the name of the command. There's not bug open for it, but there's a Stack Overflow question, that's the why. So yeah, that is part of the UX problem. And another UX problem. If you're used to GDB, ignore the last 30 seconds. But if you're not, at this start here, this is a GDB session, that says we are right before executing right before calling the function setup. So when you're going forward and you say step, you wanna step into the function setup. What this execution log is showing is that if you say reverse step, you do not step into the setup function. You step through the previous line that was not printed, because yes, that makes much sense. It's something that we've talked about in the mailing list before. It's not a trivial problem to solve, but it is a real problem. And there's another problem here in this very execution log. I say continue to move forward. And then GDB says no more execution history. And my very scientific testing of asking one friend has revealed that this makes it sound like you cannot execute forward anymore and you have to start again. You can, it's not gonna be stimulating, it's going to be running. So from the audience reaction, I think more people are confused by this. We have a couple of user experience problems. And if you like a challenge, if you don't want something easy to start, we have really hard issues. We have problem with multiple inferiors, because GDB can open multiple programs to be debugged. And there's no way for the recording to know which program is being recorded, actually. And there's lots of problems with handling signals and things like that, because this was introduced before GDB could do that. So no one ever looked back. GDB recording itself has a problem with multi-threading programs, because I showed you all the information, the memory, the region of memory, and the value. Where do I put the thread information there? Yeah, we don't record multi-thread stuff. So that's one reason to use RR. Until we fix that. So please help us fix that. I want people to use GDB. And as I said at the start, it is just unusably slow. We need some profiling to be done. We need to figure out why is it so slow and figure out how it can be faster, so that it can be more used and people can find more bugs for me to continue working. And then a question that some people might be asking is, where do I come in? Why am I giving you the talk? I said at the start that I like reverse debugging, and I like getting people interested. So if I said anything here, that's not like an interesting problem or an interesting thing that you would like to see how it works and how it can get fixed, just hit me up. And we'll chat and see where it goes. Does anyone have any more questions that I haven't answered yet? Okay, yeah. Yeah, in the previous talk in the era, it was supposed to enable some flag of the kernel, something like ourinary, or something. Would you know why is it necessary in the internals error needs to get the... Yeah, and also I said it was because of security reason, and I don't know if the person who has that, oh yeah, he's still here. Right, so the reason we need the perf flag is because as far as I understand it, again, I didn't look at our, but as far as I understand it, whenever a perf event happens, we get the checkpoint. So if a perf event would happen, we dump everything, and I think, yeah, so if you need that flag to make, to read into perf events and get that kind of internal information of another program, you would need that for RR as well. So I can't answer questions, so do you know, like can you provide some examples of perf events? When does it happen? I'm sorry, I can't because, again, it's not my area. I look at this as similar stuff, sorry. I think you were first. So as I understand, I may be conscious, but you record whenever there is writing on the off memory. No, when perf events happen. No, I'm not talking about, I'm talking about GP. It records every instruction. Yeah, every instruction, but for reverse, reverse, continue off. Reverse execution, yeah. Do, with watch point work, or is it? So the question is, can you use watch points even with GDB recording? And the answer is yes, you can. Most of GDB has no idea that a recording has happened. We sort of like separate what is handling commands, what is dealing with threads, what is dealing with the CPU itself, and somewhere along that stack, there is the part that goes, oh wait, you're trying to execute inverse. I'm not gonna send that to the CPU, I'm going to do it myself. And that facility has no information about like watch points and everything, and conversely, the watch point stuff has no idea that that's happening. It just will check later if that has happened. So yeah, everything that works forward, works backwards, except for changing the state of your program, because we're simulating based on what happened before. I think that was the question, yeah. How does it work with system calls? Does it work there somehow, or is it able to record kernel space or something like that? I'm sorry, I don't know. I've never tried seeing what happens. You won't be able to record the kernel space because whenever you step, whenever you execute step instruction over a Cisco instruction, you are never stepping into the kernel space. If you want to debug the kernel space, you need to basically debug the Linux kernel throughout QM for example. Then you can actually debug both the user space and kernel space, but otherwise no. So it's not gonna be able to handle like the side effects of a Cisco, but it knows that a Cisco has happened and does everything else basically, I think. I think Mark was first. So the multi-threaded case, can that ever work? Yes, I have a couple of ideas how. You can have basically like multiple separate histories, one for each thread, or you could have an extra field for each thing that says this is what thread X or for thread Y, or you could have, you can order things to like a single instruction per thread. There are a couple ways that I have not tested at all, and I don't know if any of them work, but I don't see a reason, like a theoretical reason why it would just be impossible. The thing is, you, one thing is the history, your log, what we record. The other thing is that we would need to serialize execution also because if you have two threads if you serialize it, it's still. You don't know which one changed memory, so how could you know if it was thread, two threads are just poking at memory, changing memory, how would you know which one it was? How would you know which instruction caused the side effect that you're seeing, so we would need to serialize, meaning the way this works is basically single-staff type instruction. So we would need like single-staff thread one, and then single-staff thread two, before that, you know, do a round-robin thing. Yeah, which would make actual, yeah, it would be even slower, but it would make my rate conditions more likely, so maybe it's a good thing, I don't know. Did you kind of just put back in a stop mode? Yeah, this is, like I said, complex issue. If you want a challenge, let's talk in the mail list. You would need to provide guarantee the both forward progress and everything. That's a real mess. Let's talk in the mail list. It's a little complicated for right now, I'm sorry. If you have multiple threads that where you were trying to find a base condition, then if you know what thread is using what memory, and you know that because you recorded that, then you can tell the user, hey, these threads are at that time competing for the same thing. Yes? I think you also need to track all the move-exes and stuff like that, because if you don't, then you don't know if they are really race conditioning or not. Okay, so I'm just gonna repeat in case anyone's watching from afar. The question or comment is saying that if we know all the threads that are trying to access the memory at the same time, we can tell the user that a race condition is happening. And in theory, yes, if we keep track of the move-exes and everything, but the problem, again, the recording part is very far away from everything else from GDB. So unless you manage to do this recording, and then later you also create a command that does that kind of querying into the data, there's no easy way to get that information available to the user. We're not set up to get this kind of low-level stuff right out easily. Yeah? I have a question. It's not a thread-based question. Thank you. But actually, I work a lot with microcontrollers, and for example, with flash memory, four megabytes of flash memory, something like that. And I'm just wondering whether how hard would it be to make also GDB time traveling work on such microcontrollers? I guess the memory space is kind of an issue. Yeah. First off, if you're debugging GDB in the microcontroller itself, which I don't think it would be because GDB is big, then memory becomes an issue. If you're not, I don't know if GDB server is set up to do that. And if it were, it would make the same memory issue. So we would need a facility to get the disassembling information into GDB itself and then send it back to the GDB server. The problem is because also because you have scalability, and then you have distribution and you need to... So yeah, that's a complex use case. This backhand, the... Recordful. Yeah. This is all inside GDB. So if you're remote debugging, you don't need to teach the server anything at all. It's all being recorded on GDB side. Oh really? Yeah. Huh. So you can use Linux GDB server with this and it works. Okay, I'm surprised. You can open OCD maybe in this case, but you need some kind of... So what you need is to teach GDBs reverse debugging engine about that instruction set. This only supports X86 and... I think it does ARM and something, Power or S390. There are a couple of architectures that are partially supported. You need to basically create your own disassembler from scratch, unfortunately. Yeah, there is a disassembly engine inside GDB, but it only creates text. And I try to backport it, but... So right now, you create your own disassembly from scratch, it's easier. Yeah. Oh yeah, sorry, we're out of time. So we can talk more at the hallway track. Or probably tomorrow, because I'm gonna be managing everything. But thank you for coming and I promise... If anyone would like to contact me, these are my contact information. Yeah, thank you.
ROCgdb, GDB and AMDGPU debugging
Hi everyone. So I am, that's for learning. I am Len Slott. I'm working for AMD actually and I'm also the maintainer of the AMD GPU back in AppStream GTB and I'm going to talk to you about a bit of how to debug program running on GPUs, on AMD GPUs because I don't really know much about the other one. Out of curiosity, with the show of hands, how many of you have no idea of things that are working on the GPU actually? Yeah, okay. I'll try to go through that a bit so you have some understanding. So my plan roughly is to give you an overview of what the architecture is of our GPU and the effect model of that and what is the programming model we use to work on that. And then we'll go into how to use ROG-GDB, which is the downstream fork of GDB we use at AMD, which is a support for debugging GPU programs and we'll talk about where we are at supporting GDB in AppStream. There, supporting AMD GPU in AppStream. So a very abstract view of a GPU could be that. So we have on the top left, global memory, which is your VRAM, like your RAM of your GPU plus all the host memory you have. That's virtual memory, that's like in Linux, everything else, that's a 48-bit address space, so we use 64-bit pointers. And yeah, pretty easy to do so far. Then, just below it, we have what we call the Scale-Up General Purpose Registers, so if you were to x86, that's your A or AX or BX and so on. One small difference is instead of having like eight of them, we have a hundred of them ish, plus some status registers. So that's the easy part, but we still have like the two-thirds of the diagram to go through, so we have a bit more than that. Next we have this big block, which are the VGPRs, which are the vector general purpose registers. You can see that the first approximation, like your AVX ish vector registers, with some differences. Our system, those vector registers, so they have a fixed number of lanes, so you can see a register of like a array of 64 elements, each element is a 32-bit value. And when we do vector math, that's going to be pretty much what you would expect with AVX, so if you do a vector add of v0, v0, v1, you will take like the value of v0, line 0, you add that to the value of v0, line 1, start that in v0, line 0, and the same for every lane in parallel, so everything is happening at the same time. On top of that, we have what we have on the other side, on the top right, which is some memory which is going to be dedicated for every lane, so each lane, each one of those 64 lanes will have its own pool of memory. So we can have like vector load instruction, which will take an address, and that address is going to be an address within that particular piece of memory specific for that lane, so you do one load and you load 64 value at a time from 64 different address spaces. It can be a bit tricky sometimes. And that's the basic, the base for like a compute element we have, and you can take a couple of them, put them together, and you will have what we can call a compute unit. In a compute unit, you can group some of those compute elements together, and they can talk a bit in exchange information via yet another address space, so that's a Perth red block memory here, which is a 32-bit address space, and they can have some synchronization primitive within that CU. And then you take multiple CUs, glue them together, and pretty much you go to GPU. That's what's quite few. So that's a very, very abstract model of what a GPU can be. The way we can program for that is going to be usually using the heap programming language, which is quite very similar to CUDA, to be honest. That's a single source programming model where you will have part of the code which will execute on the host on the GPU, and part of the code will execute on the device, on the GPUs. So here that's kind of the elbows of GPU programming where we do a vector addition. So a bit of setup, we just initialize some memory, we copy it to the device, and we have here we submit some work to the device to be done. And when we do that, we describe the geometry of our problem in terms of one, two, or three D space. So we describe the size of a block, which are like how many elements are going to be running on same CU, and we say how many blocks we want, so how many CU can, or how many workgroup can work concurrently. Not necessary. How many workgroup we have, they don't have to synchronize in any way. That's a very, very fast, because I didn't talk much time, in terms of what that can look like. And now the question is, how does everything look like inside GDB, and rock GDB? So those elements, like the fundamental part which executes is what we call a wave. And in GDB, we map one wave to a thread. So when we do info thread in GDB, you will see a bunch of threads as usual, and then you will have those AMD GPU waves. And this AMD GPU wave, so that this collection of those vector registers, those scalar registers, and they're working together. Each of them will be running like those 64 work items at once in parallel, in pretty much. So just so everyone is not too bit confused, that target ID, the way it's built, that's so you can identify where that thread comes from. So basically it's built that way. We have the agent ID, which is like the idea of the GPU. We have a QID. The Q is the mechanism you use to submit work to the GPU. Then you have a dispatch. A dispatch is a unique of work. And your wave ID. And for the convenience, you have XYZ coordinate of your work group and your wave index inside that work group. And from there, we do have also, if you want, info agent, info queue, info dispatch, which can be used to animate the live dispatch using agents on the system. And now we get to the trickier part, which is one wave, which has 64 lanes, is going to be executing 64 work items in parallel, concurrently, and all at the same time. And so among all the scale of registers which are shared for the entire wave, you have one which is a PC, like the instruction pointer. So that means that you have what you would think of 64 different threads in your source program, they're going to be running the exact same instruction altogether inside the GPU. And each one of those work items will map to a lane. So that's one slice of a vector register plus a given address space. So GDB has a concept of current lanes the same way we have a concept of a current thread. So when you step, if you have a lane selected, you will be presented with the same lane again. And so to be a bit consistent, we do have an info lanes command which works like a bit like the other one. That goes so fast. Sorry. How is it for the community? Yeah, I know. You have a lane command you can use to select a given lane. And the lane ID is constructed a bit in the same way that we have thread ID where we have the agent ID, the queue ID, the dispatch ID, the wave ID. And then within that you have your lane ID after the slash and you have the coordinate of the current work item inside the work group and the coordinate of the work group inside the grid which is like everything. And so now the big question is maybe all of your threads are not going to be doing the same thing. So if you have like if full of lane ID, you do something else, you do something else. And although that work because everyone like every lane is going to be executed exactly the same instruction at every time. The way it works is by using lane masking. Basically the vector operation are going to be configured with like a mask register so some lane have some side effects and some don't. So we will turn the effect for some lane as no op. So basically the case where the lane, the condition is true is full. We will actually execute the full, we will be an op, else we will actually reactivate the lane, do what we want to do and the else on the other end. When we are in the if branch, we will actually do have some effects so write the element and otherwise we will deactivate the lane for the else and that's going to be a no op. And so if you were to step like single step within GDB with that execution model basically that means that every instruction is going to be executed. So you test if my lane ID odd, apparently it's not, no apparently it's odd so we do the else but what the fuck we do then. So we execute everything and GDB has some support to avoid this kind of confusion which is we will just don't stop when the current lane is inactive. So we will step as expected if we have something odd we will do our test, go to the else branch and we don't stop in then branch and we continue. Cool, that's the basic of our like the execution model works. Now we get to the tricky parts. As I said before we have multiple address spaces and when we have an address space basically when you want to load data from that memory or store data you will need an address but then contrary to what you have in the CPU world if you just have an address that's not going to get you very far because you need to know what address is going, the address is going to be an offset in a given address space and so you need to glue those two information to have something that actually makes sense. And so we have this address space found offset notation which we use and that can be used through our GDB. Yeah, so I'll go very fast so that pretty much what we have. Usually things you know you will read the slide I don't have to go over that. One question we have and what difficulty we have is to describe all that and especially all the address spaces and everything that's not going to work very well with Dwarf where a location is usually referred to as an address and address is just a value and as I said if you just have a value and you don't know what address space that goes with you can just talk. So we have a proposal to redesign a bit of Dwarf and the evaluation mechanics in Dwarf to address that and we're working with other vendors to try to have that submitted to the Dwarf standard so that started the discussion going on but that takes time and that's not in Dwarf yet and very fast the state of what we have in, yeah sorry, so the state of the AMGGPU support in GDB so we have all the basic stuff for controlling the execution and basically we have a bunch of symbolic debugging so being able to do a backtrace, ping variable and everything which can't really do because we need Dwarf support to do that and Dwarf is not standardised yet to support that so we're stuck. And a bunch of links you can look online and that's pretty much the end, sorry, it took a bit more time than it should have and if you have any question please. So one or two questions maybe, sorry. Yeah, here. Can we use this with shaders running in GLSL or something? The question is can we use that to for shader running in GLSL? Probably not because GLSL that's going to be on the graphics get back and that's not like that's only going to work for compute shaders so that will work for OpenCL that will work with SQL like there is an implementation called the heap SQL you could use but like the graphics pipeline would be different and that's not going to work, that's just for compute. Yeah, we have a question over there. You mentioned waves are kind of like threads, I didn't know is it represented the same way as threads are in GDP or do you have a separate thing and I was going to ask is it architected to make it easy for debugging? I guess that's my real question. So that's two separate questions but I guess would you say this is easier or harder to answer? Yes. And from the hardware perspective, yes. One thread is one wave and that's what we show in GDP if we go to the very beginning. We show in for threads you will list waves and once you have selected a given thread you can select a lane within that thread. Cool. Do you have time for one more question maybe? Maybe one question. Yes. My question was about like you put those efforts for the upstream but the final decision will be like it will emerge to upstream and we don't have to use separate product or it's still with like a separate product. No, our goal is to have everything upstream. The sooner the better but we cannot do that completely today mostly because of dwarf issues. To have dwarf which can be powerful enough to describe what's actually happening on GPU. We need some fundamental changes in dwarf and so to get our change in GDP we would need to support that dwarf six-ish but dwarf six doesn't exist yet so like that's pretty much one of the big things which is holding us back. Somewhere in the future the idea will be to be upstream. Yes, that's our goal. That's what we want to do. Okay. And yeah, I forgot to repeat the question. The question is do we intend to have everything upstream or do we want to keep having work GDP as a separate product? We don't. Sorry. Time's up. Thank you. And that is it.
GDB on Windows: status & plans
you you you you you should I start over? No. Hello, checking, sound check. Sorry everyone at home. Alright, so not asynchronous. So you move this to separate thread and then there's a way for one thread to communicate with the event. Something happened. We did this change in GDB more recently in GDB 13 before that the debugger really blocked. Like you continue the execution, you couldn't do anything else until the inferior stopped. So that was something that was upgraded in GDB 13. So now, skip the slides. So now in GDB 13, this is something that's more important to ID. That the ID you can press the continue button, the inferior is now executing. But the ID at the same time now can execute GDB commands like disassemble or install new breakpoints, search symbols, things like that. Now it can do that while inferior is running. Well, before it couldn't. The ID would have to stop the whole program and then do something. Going back, this other function, the counterpart of the waiting for an event is when you continue the event. You have this argument here, this parameter where as argument you can pass either one of these two macros. And this is basically, you know, like in GDB, when you get a signal and then you can decide to pass the signal to the inferior or not. You can suppress it or pass it. So when you pass it, it calls the signal handler in the inferior. There's something like that on Windows. Not that important, but it's similar enough. But they call it exceptions, not signals. And this function here, you decide whether to suppress the exception or not. Will the inferior continue processing the exception or will it be suppressed? And it's important to know that you do this decision when you call this function. I mentioned this already. All right, keep that in mind. So this is very basically how the debugger internally works. All stop mode is default mode in GDB, how everyone knows how it works. So here we have five threads. This is time, time period one. Everything is running, runnable. And then T3 is about to hit an exception. So it hits an exception and you're calling wait for debug event. It returns saying an event happens and the kernel freezes everything in the process. All threads, the elements, they're frozen. And this one got an exception. At this point, the user is now inspecting the program, debugging the actual bug, reading memory, backtracing, blah, blah. And finally, they decide to resume execution. And then that's when GDB calls this continued bug event and then passes that decision of whether to suppress exception or not. So it's late, it's here. And then all threads go back to being runnable again. That is if you want everything to be running or stop and then everything running again. There are times where you'll want to only resume one thread and leave everything else suspended, frozen. Internally, GDB needs to do this, like to step over breakpoints. But the user may also want to focus on a particular thread leaving every other frozen. And we do that, the user interface is to enable the setting. This doesn't work currently upstream, even though internally everything works because GDB needs to know how to step over a breakpoint. But it's never been exposed to the user. Nobody wired up this to the back end. So I did a little change in my work and it actually works. So it's the same as before, the exception triggers, user inspects the program and then decides to resume T1 instead of T3. And GDB suspends everything else and then calls continue debug event for T3, because that's where the event came from. And now T1 is runnable. But what if you want to do the converse, which is instead of running one and stopping everything else, you want to stop one but leave everything else running. That's what's called the non-stop mode. And this is what I wanted to make possible in Windows. Because this is supported on Linux since 2008. I know because I worked on that. So a long time by now and also supported on remote targets, meaning GDB server for Linux, but also some other embedded systems out there. They support this mode as well. But native Windows debugging does not support this. So non-stop mode means only the event, the breakpoint, reports to stop to the user and everything else continues running. This is interesting, again, mostly for IDE's. You can imagine a big list of threads and then only one of them reports an event. But it's also interesting because maybe one of the threads is important to keep running because maybe it's a watchdog or something that needs to ping a server and if it stops pinging, the program doesn't work. There's something on the other end that needs to see this while you inspect some kind of debugging triggered by some thread. And the reason I thought all over these years that Windows wouldn't work for this is that, well, we have this problem. Wait for the bug event, that magic function reference in the event, suspends everything already. The kernel already does this. There's no way to tell the kernel to not suspend every thread except the one that got the event. And we want to leave them running. So naively, I thought maybe just immediately suspend, block, freeze the thread that you care about, and call continuity bug event, right? But you can't because this is too early. We just got the event. The user hasn't yet decided whether to pass or not. The exception. That only happens after. And I was looking up this past year, and I noticed on the Microsoft website describing these APIs that they introduced a new flag to continue debugging events. And I read this and I was like, really? It's like they wrote this just for me. Hey, they're awesome. Well, it's not the ideal thing that I would like. I would like to have a way that the kernel doesn't freeze everything. Still freezes everything. But what they do is, if you pass this flag, what you're saying is, I got the event. Okay, cool. But I don't want to handle it right now. So, but I call continue. And I'm asking the kernel report the event again as soon as the thread becomes runnable. So what I do is, I call suspend thread on a thread that got the event. So it's no longer runnable. I call this continue the debug event function saying, get me back the same event again once I become, make the thread runnable. That's what he's saying here. Well, in other words. How does this actually work in practice using the same diagram as I showed before? I'll prototype this quickly with a hack and it worked. Amazing. Now I just need to make it clean. And of course that's, oh, sorry. So, some years before everything is runnable. T3 is about to raise an exception. It raises an exception. The kernel freezes everything. There's nothing I can do to control this. And then I freeze the thread that got an event. And then I call this function with this new magic macro. And then GDB remembers that T3 will get a repeated event later. Now users is inspecting the thread, but everything else is running now. Right? So the kernel paused all the threads, but I immediately told the kernel, we resume everything else. So there will be a small freeze. There will be increased jitter caused by the debugger. But most of the time whole threads will be running. And then later the user decides to re-resume T3 and the debugger just calls, you know, resume thread, unfreeze the thread. And remember, now the kernel, because the thread is now runnable, is going to re-report the event. And because we recorded it earlier that we will get a new repeated event, the debugger knows, okay, it's a repeated event. Now I need, I know I need to call continue debug event with the proper flag saying suppressed event or not, the exception or not. Yeah. And a colleague of mine wondered, does this work when multiple threads hit the breakpoint before you decide to resume? Yes, it does work. You know, same thing as before. And here you are looking at this thread and this one raises an exception. Everything works. You can look at this offline if you want to. Yeah, there's a lot more to this. That's when I, okay, the hacky version works now. I need to make it clean. And that, you know, I stumble a lot of things that I don't have time to go over right now. I'm going to touch a little bit on the test suite. How much time do I have, Dodgy? Three minutes. Plus five. Yeah, okay. All right, so I put this in the abstract. The reason is when I talk about the test suite, I need to make this distinction. And when I say GDB on Windows, there are actually two ports for Windows. There's GDB Compiled as a SIGWIN program. And there's GDB Compiled with the mean GW tool chain, which means it's a native Windows program. SIGWIN, for those who don't know, it's like, gives you a POSIX environment. It's a collection of tools, but it's also a runtime, a DLL that every tool is linked with. And this runtime provides you POSIX things like signals, PDYs, and a bunch of stuff. The C runtime that's used is not the one that comes with Windows normally. It's based on NewLib. Try to be as close as a Linux environment is so that you can recompile your application, a Linux application with minimal changes, quote, unquote. It works. So it's not an emulator. You have to recompile your program. Right. So the core of GDB has two ports, like the event loop, for example, is based on select slash pull for most Unix machines, ports. And SIGWIN is one of those. But the native version of GDB for Windows, based on mean GW, has a separate event loop based on this wait for multiple objects function, which is the Microsoft version of select. Right. But the backend, the code that talks with the debug API, those functions I mentioned for it, it's shared between both ports. It's the same code except for SIGWIN, there's extra magic to make some of SIGWIN-specific things work. And this is where I get to the test suite, because part of making this work and upstreamable, and I would get to a point where I was, you know, sure that I wasn't breaking things, because this isn't making this work involved, I'm revamping the backend very substantially. So I want to make sure that I wasn't breaking things. So run the test suite, right? Except running the test suite on Windows is a major pain in the... The test suite is... GDB test suite is built on Dezhek Nu. Dezhek Nu is an infrastructure built on expect, and expect itself is built on TECL, which is a programming language. And Dezhek Nu assumes a Unix-like environment, which you don't have on Windows normally. You know, assumes POSIX shells and utilities, kill, CPE, VOO, and there is no native expect part. There was a company active state that had something like that, but they killed that project some years ago. So you have to use something that's Unix-like to run Dezhek Nu. If you test GDB on a segment environment, you just run MakeCheck, it does work. It's super slow, not stable, but it does work. But if you want to make native Windows GDB, you test that, it's not the same thing, it's a proxy, but it's not the same thing. Remember, I said that the core of GDB is different code paths. So I would want to be able to test this guy as well, I mean GWGDB. So how about we run the test suite, Dezhek Nu under SIGWIN, but make it spawn the Windows GDB? Yeah, that's a potential idea. But the problem is, it's a SIGWIN expect, it's spawning a Windows process, and the input and output is going to be connected to a PTY from the SIGWIN side, but what the Windows GDB sees is just a pipe. And when GDB is connected to a pipe, because that's how SIGWIN PTYs work under the hood, it's a pipe, GDB is connected to a pipe, it's not what's called an is at PTY, so it disables everything that's interactive, so in the test suite, it completely falls down. And something else is that the test, the Dezhek Nu, because it is expecting that the inferior is being run under PTY, so there will be terminal mode controls, time's up. But I have the five? Because... I'll tell you, if you want one minute, you can do it. I'll give you one minute. I'm almost over, just 30 more slides, no, just one more. Right, so there are some ideas to get this working, there's also path mapping issues, because what they expect sees path-wise, slash, C drive, slash, X, it's not the same as GDB. C is because it's a native program, so it sees X, colon. And another problem is that the GDB test suite, when it wants to test multi-threaded things, the tests are all written with P-threads, which is not something native to Windows, even though mean GWW-W64 does have the WinP-threads library, so maybe we could use that. I have some ideas to try to make this work, but I haven't had the time to actually experiment much with this. I tried other things that I thought would be interesting, but they didn't work. The test suite, compiling on, yeah. Right, so about compilation, just if anyone here is motivated by this talk and wants to help. Compiling GDB on Sigwin is super slow, so the way that I got around it is to cross-compile, and yeah, some things here you can do. And then I can cross-compile to Sigwin, but to run the test suite, I need to run it inside Windows, that's, I can't avoid that. But I can point GDB, the test suite inside Windows, pointed to the built GDB that I've built on, sorry, on Linux. Whoo! All right, so maybe I should skip, yeah. So test suite, bad, need to fix a lot of things, that's the thing. GDB, it's the native, yeah, this is the thing that's for the future. Make it possible for GDB to debug programs compiled with Visual Studio. That is something that is missing, it's making people not use GDB on Windows, and I would prefer people not to think about using other tools, you know, staying on the lane. So at some point I would like to work on this, but, you know, no time for that. Just leave it on the screen if people have questions like maybe one question? Nothing. All right. Thank you. So. Okay, actually there is one minute left. Is there one quick question? Yeah. Okay, so here's my question. Oh no. Have you tried using Python to run the test suite? I have. GDB executes and stuff. I, that would be writing a new test suite. Yeah, that's right. I know there's actually some people that do that, some companies, but I wanted to find a way that it can run the existing tests before giving up completely. Okay.
Online Debugging and ABI Data Services
Why are you all here? This is a boring topic. I was not expecting so many folks, but I'm glad to see you. My name is Frank Heigler, I'm a Red Hat engineer. I don't have a bio site because I'm not that interesting, but I've been in free software for a couple of decades, almost three, quite a while. So this talk is about debugging information and another type of information that we hope to popularize storing online for occasional uses. Now, many of you guys know debuggers already, all good. The other subject is a little bit more esoteric, but we can still talk about it. Do you know if Mike is coming in okay? The mic is just for the recording. I know. By the way, who's the next speaker after me? Is that person in the room? Good. I might be able to sell to you sometime. I'll keep this pretty short. Well, I offered you 10 bucks but too late now. So this is boring. How are we all right software? Binary has come out. Some packages up the binaries into distribution. Distribution goes out to people. People who run binaries, everything is happy ever after. That's all that ever happens. Right? Right? So debugging is near and dear to me. I worked on the GB debugger a little bit here and there, back in the prehistoric times, and I've worked on debugger like tools ever since. Despite all my efforts to try and make debuggers irrelevant, we still have bugs in our software and you still need to silly things. So unfortunately, here we are and debugging is not so easy. So I have two parts of my presentation. This first part is about debugging information that's online. The second part will be about something else that's online, that's tangential, but you'll see the connection pretty shortly. So many debuggers. You know how debugger, is everyone familiar with how debuggers work roughly speaking? Not you, Pedro. So one of the main challenges of debuggers is that it has to operate at the machine level, at the register level, at the memory bits and bytes level in order to understand the operation of the program you're trying to debug. And this is despite the compiler doing its darnedest to erase any remnant of how the original source code looked. It's trying to do its very best in nuking every unnecessary, very variable access, maybe tightening up the data structures, shuffling in and out of registers all the time just to make things go damn fast. Compilers are done that way. Well, I mean they're great, but if you need to debug, you need a sophisticated way of telling the debugger where to find all that stuff. So long introduction to that, but we need a good high quality debug info, which basically gives metadata about where the, every piece of the source level constructs are at runtime in the actual machine. So where, where to registers, which memory spots, how each complicated data structure laid out, all those things have to be saved by a compiler, put somewhere, and then made ultimately available to debugger. So does the word dwarf mean, you guys know what that stuff is? Yeah. So, okay. Can you give me a few adjectives about dwarf? From the heart. Okay. Say again. Did you say short? Liar. Liar. Yeah, so dwarf is a very compact, amazing little, most graph database kind of thing. It is absolutely nothing but, it is not short. It can be order of magnitude larger than the actual binary. And because it is that large, distributions then not to ship the thing, not to trip it to normal users because, you know, it's like, like I said here, users just run things, right? You never debug. So they don't get the debugging phone normally. But say you do run into a problem, you don't want to debug, well then you need this information, right? So either you can be the developer who already had this, or for the last 20-ish years, various distros made available, the original debug data that the compiler generated, but it's not installed. It's somewhere off in a separate repository you have to sometimes enable and change the route and download and if you're lucky, you get the corresponding debug data that for the binder that you're trying to work on. So the brand new 2019 thing, which I remember when my team talked about here, two years ago, when it was younger, is this gadget that we, a community built called Debug Info D, which automates the distribution of the Debug Info and other precious such things. And the whole idea is to make it as easy as possible for people, not just developers, but ordinary users to automatically, without special privilege, get all this stuff for as much of the system as possible, you know, without having to go into route, without having to do, you know, activate channel, rel-debug-blah blah blah blah. Okay. So that's our little baby there, the first URL points you toward a website that describes the current situation. As I said, the project is now getting to its third or fourth year, so it is, I cannot call it a prototype in any sort of honest way, but work is still ongoing quite a bit. When we built this, it's a small server, part of the ship with the AlfvTills tool set, which is related to Alf and Dwarf decoding and processing and such, and there are a lot of low level, machine level tools in there. So Debug Info D is shipped with AlfvTills and it's shipped on all the major distros that I know of. All right. It, I forgot to mention this, but I mean, all it is, is it allows a debugger type tool to request Debug Info as well as source code for any binary based on the Hexadecimal unique build ID that's inside binary. So this is a kind of a hash code that's been in binaries for almost 20 years, thanks to Roland McGrath and a bunch of other people who made it happen back way back in the early odds. So it's an HTTP server, it's just an ordinary boring HTTP server as far as the clients are concerned. It's very cashable, it's very lightweight, very, very simple, no like XML, API, blah, blah, blah. It's just HTTP. Because it is trying to be really simple, it, we found it over a course of a couple of months to year, most major debugging type tools grew capability to use this API, use this web of system to fetch this stuff. So obviously, GDB is one of them, it was one of the first, but system type is another tool of this kind, it's close to me. Practically all the debuggers and tracing tools and profiling tools we know of is able to do this now. So the clients are well dispersed in a source code system. The servers are also in really good shape. Over the last few years, the whole bunch of distros came online with running their own debugging for the server. So Fedora was one of the first, and Bacento is up there, Debbie and Ubuntu and other smaller distros, they're all running this server now, whereby their own distro is fully debuggable through this system. So that's cool. The, we're not quite finished with it. There's a piece of extension stuff that we're still working on of one of them is that's particularly cool is cryptographic signature preservation of individual files. As you may know, archives as a whole can be signed by distros and then a client can verify that the archives have been modified, and that's cool. But if you don't want to download the whole RPM because it's too large or for various reasons, you just want to extract the actual source file that you want or an actual little debug door file. You still want to be assured somehow that that file was what the initial distro packaged, right? You don't want to make sure it wasn't adulterated somewhere in the middle. It's kind of security critical. So we're bringing into this web protocol the propagation of the signatures that may have been applied by the distro at the build system level. And that's, it's not easy and not many distros do that level of signature stuff yet, but Fedora and very modern rail do, and we hope others come online too. But what's nice is that each individual file has its own crypto signature, which can pass it down through debugging for the, all the way to the clients. So they can be assured they get the correct 100% grade A certified file. Alrighty. Psh. I couldn't bring myself to try a demo here. I was just too chicken, but, but the whole idea with this debugging for the clients stuff is that it is really automated and integrated and you don't have to do anything special. On the distros that this is enabled on, you just don't have to even do the first line. It'll be done for you in the ATC profile for all your shells. And you just run GDB on any random binary or your own binary and it'll pull in the debugging for for any shared libraries that you're using, any source files you're stepping into, it'll just pull each piece down one by one as necessary. And it's just, it becomes a non-problem. So there's almost nothing to see because it's just so smooth and automated. Parts of it can be slow for hilarious reasons, but I'll explain why if someone asks me that question. Anyway, it is nice. It is out there in many of the distros. I hope you guys enjoy it and I hope it makes your lives a little easier. That is the thing you ever encounter bugs. All right, all right, all right. So switching over to the other sort of health topic. All right, does everyone know what ABI means? Can there be someone person who does not know? So I can justify talking about it? No, thank you. I'll just be brief. As brief as possible. So it's interesting. There's a lot of interest, especially for ISVs who want to build a piece of software and then distribute it, let people run it on multiple distros. But even for normal projects who might want to build a binary of their own releases and then shift that to various other distributions. You got that? To shift it on various distributions so that it can be used on modified. Sometimes they've other problems like wanting to match different generations of shared libraries which might have had little evolutions of their own ABI whether a function signature got changed or type got changed. Something that's not the same at the binary level than it used to be. Which means that shared linking between them is no longer safe. Some projects, some shared library projects are exquisitely careful about this and they do incredible measures to prevent this kind of breakage. When they update their shared library it becomes backward compatible to decades ago by a lot of the works. Like Jalipsi is one of the best in this regard. But some libraries are less good at that. So if you want to write a binary that will work with multiple shared libraries you may need to either kind of ignore the problem hope it doesn't happen or you need to find a tool to check whether this will work with that. It is a bit esoteric. But there are lots of solutions or several solutions which try to work around this whole problem by just giving you, bundling you the one random version of a shared library from some random distro, package it together into a container image or a flat pack or whatever and just plop the whole thing on your system and then they've done the integration checking and then they know it'll work. It's legitimate. It's just it's very space obnoxious and some of them still kind of intermingle the bundle libraries and the host libraries and they kind of do version checking and they hope that the host's libgl will work with their version of libxt or whatever. So even this is a bit fuzzy. So anyway, what we're proposing is that projects that deal with multiple versions of shared libraries that they're concerned about compatibility checking for ABI's consider the gadget I'm going to talk about. Okay. You know, maybe we'll just skip this one. Everyone, but everyone knew. What's there's still a person who I didn't tell what the ABI was? It's just, it turns out to be exactly the same metadata the debugger uses to find variables at runtime. This is exactly the same data. It's just that it happens to be useful to examine even at compile time. So even with just the dead libraries on disk by parsing and processing the exact same debug info, you can tell whether that shared library has the same binary guarantees as a normal program might require. Sorry if I belabor the obvious guys. Okay. So our team at Red Hat, one of the tools they work on is this gadget called libabigale. I'm not sure who works on that, that guy there. Yeah. And it's awesome. It's a suite of tools, binary tools that compare shared libraries versus shared library by extracting their debugging info basically and just parsing it piece by piece, function by function, type by type, make sure they're all compatible with each other. It can also do match a binary to a variety of shared libraries and see whether there's still meet each other's needs. Like a good marriage, maybe. One thing that it's limited though is that to do this work, it needs to have all the files that you want to compare right there on your local disk. So if you want to compare your binary to a RAL6 version of libc or libxgl and you bunch a version, you need to somehow get hold of those files first and you can't really just do it otherwise. So our gadget, the new gadget we're adding to libabigale is a way of not requiring you to download all these shared libraries and all their corresponding debugging info for all these versions of distros that you might not even have or not even want. Just curious about. And a key to that is to realize that abigale also can take not just dwarf files, but also an XML representation of the dwarf. And XML is just conversion. It's a subset and conversion. So that's my four minute warning. We're doing okay. And because it's XML, it's large, but it's textual and it's compressible. And with the one track mind that I have, how can we store a large amount of XML for all these shared library data for a large distro? It's text, it's large, you want to share it. Well, how, oh, no, that's not the next time. I'm gonna leave that in mystery for 20 seconds. Oops, one moment. Two moments. It's pretty soft for it, don't worry, it's good. Ha ha ha. Okay. Yeah, we just skip over here. So writing a little tool is really just a thin wrapper around the existing, the abigale tooling to extract this XML version of the ABI. Jamitin, Git, because we love Git. It's a great way to store text files. And it's a great way to ship them, great way to compress the heck out of them and let them coexist in some nice way. So we can extract XML from a large corpus of files. We can give it a whole boatload of RPMs or Debian's or whatever, it will automatically extract all the shared libraries. It'll download all the debugging files and automatically via debugging 4D, if necessary. And it will generate a Git tree, which has all this XML stuff nicely structured that then can be used by the tool itself to later do a compatibility check. That way you don't have to install the foreign distributions anymore. Anyone can do you the honor or the favor of collecting this ABI XML stuff, sharing it in Git, put it up publicly, and then anyone who wants to compatibility check against that version of the OS that no longer has to worry about this. This is a crowdsourceable enterprise. So I tried this at home, no demo because no demo. But it is really not hard to use. All the prep work is just getting the software. But the thing you think is that the data is in, when one crowdsource version of data is now a couple of gigabytes, it has a big section of Rela8, all of Rela8 in there. As RBI stuff, a few other Ubuntu releases just randomly in there to plan to expand it, to have as many distros in there as people are willing to give us. To submit new information, it looks like these command lines, this is just to demonstrate you can build your own share library at your own institution and generate your own database. This version tells you that it can just mass import whole RPMs and they'll do the right thing, decompress and aggregate all the information. And at the top here is how you check a random binary against the entire set of shared libraries that that binary needs. There are a few bits of cleverness in there, small. It's not very clever, just a little clever. For example, as you know, libraries get updated every now and then we wanna make sure we can store more than one version of the same shared library in the database. There's not just one G-Lib C but 10 per update. So they all have to have a naming convention that lets them coexist. So we do that. But those are internal details. The basic thing is you can submit to the database this way and you can check with that and it tries to be that simple. And that's my conclusion page right there. That all the code is open source obviously and all the servers are extremely low tech on purpose. The first one is a very thin HTTP server and the second one is just literally a Git server that happens to have structured data inside it. So easy and even I can do it. Very, very straightforward, baby technology. And thank goodness, that's it. Can we have entertainment for questions? Yeah, we have minus five minutes. Minus five minutes, my God. Okay, any zero questions? Thank you.
Poke all the microcontrollers!
So hello everybody, welcome to this talk. So the title is poke all the microcontrollers, but the story is GNU poke inside GDB. So we'll talk about poke and GDB more than microcontrollers, so sorry for that part. But let's go to the... So first of all what's GNU poke? So it's the extensible, you can read it here, so the extensible editor for binary, for structured binary data, which is... So what's binary data? It's data encoded in sequence of bits, binary digits, 0, 1, and like this. And we can... So there are structures there. So it means there are relationship between the different bits, okay, like here. We like grouped in four bits, in labels. And like we can assign meaning to a part of like this structure, like this eight bits as a whole to be a number 67, for example, as a signed integer of bits eight. Or we can assign a meaning like a character C as ASCII does. So it's like this is the part of the structures. And then you can have more complicated structures, like this part is the length and this is the table everything, so it's something like that. So and then the editor is that you have the CLI, which you can view the content and you can change it, hence the name poke. And it's immediate, so it's like interactive, it's not about like... It's about when you are exploring the data, you're debugging, you're doing something, you're designing a data structure, like for encoding data, that's the best thing. You can use. So and it's extensible. So it means that there is a DSL to describe this relationship between the bits, okay? So bits, we are talking about like we can address each bits. So and then it's inside of the GNU poke, we have this architecture, which is like you have a leap poke, which is the library which implements, which has these three major components, which is the first one is the PKL or poke programming language incremental compiler, which is the incremental parts means that you can add definition, declaration. It's statically typed, so it's like add the stuff to the namespace. You can redefine things. So the PVM, so it's a compiled language which compiles to a PVM, which is a poke virtual machine supported by GNU Jitter, written by Luca Sayu. And so this is, and then all the magics and the bits are lying here, the IU space. And so and then the other thing is the programming like you can, okay. So I don't know what's going on. No, please. Okay. I can go to here, ext3, I guess. Okay, so it's very not easy to see what's, my God. Okay, so you can write poke and this is a program, which is a command line interface. So this is the poke part of the story. So this thing here, and there is a poke D, which is a demon so you can send the poke code through the Unix socket. So you can make interfaces and stuff like that. There is a new component, poke fumpt, which is go through the source code and then there are some tokens. You can put poke code there, so it's useful for when you're generating test cases and stuff, you can write poke. And the result is the text like you assemble an instruction in poke and then at the end in the test result, it's a number. It's a UN32 number in hex, which is you or other tools can work with, but you know it's like easy to debug because when you're writing the test, you know it's like poke code is freeable. So and then also this is useful when you want to or working with hardware, you have a bunch of registers. You can design like which bits you want to set and then you generate a .config c file included and you're done. You don't need to be like a coding c function, GPIO, init, clock, that thing, something. You can all put write in poke and then generate the numbers, the final numbers and just write the number to the register. So GNU poke in GDB. WTF so I cannot say the word. Because GDB is good at debugging. We are not. And if you want to be, it's not possible because like after like you become a good thing after some years. But GDB is good at debugging, but maybe not as good as Gnu Poke is at poking a binary data. So this will be a happy marriage, we hope. So and the question is okay, what we have already Python integration in GDB. So why we need a new other language and the answer is right, that's correct. It's a general purpose language. You can do whatever you want to do in that language because it's a general purpose language, of course. But there is a but here. It's a general purpose language. So it's not good because what we're talking about here, like you can be what, yeah? Because the poke is a DSL, like with uppercase p is the name of the language. So poke is a DSL specifically designed to describe and poke binary data. That's the reason we think it's a good combination. And so what's the talk all about? So my initial plan was more ambitious and had a lot of things with like life hardware thinking. You know when hardware and things come. So plan is a little bit, but I have hardware here. I can, I'm not disconnected. It's here, it's real. So it's partially right, but not quite as like I wrote in the abstract. I was too much ambitious. So okay, so it's a demo for, it was my fault, not a limitation of, oh, really? Okay, so it's a demo for showing integration of leap poke inside the GC, leap inside GDB using this hardware, which is I showed you. So let's see that hardware. It's this hardware. It's ESP32 C3 module, which is a risk five based microcontroller. It's a 32 bit risk five thingy. So other thing. And so in this demo, you have to, so here if you can see that I connected these two pins together to prevent the thing to go to the state. So like it always boot up correctly. So the LED, I copy pasted this thing from image. I have the link in the end in there. So, and then it's a risk five, so you can see it. And then this is the flags for the compilers if you want to compile for this destination. So, and so the idea is for the, like the thing is you want to do the board being. So this is the idea. This is the whole thingy. So, so you'd like to the first step in board bring up is like check the hardware to see things that should be connected, should be connected, and then things that should not be connected should not be connected. So this is the first seems obvious, but very important. And then you connect it to the public supply with the current and see it doesn't like draw too much. And then the next part is this. So it's classically you can go to the C compiler and then you write things and then gradually add more stuff GPIO LEDs every laser, you know, add from a small thing. And then you add more complicated things to that. But here I'm what I'm proposing is like you have the GDB, you have the JTAC. So it's a command line interface. It's alive. It feels like a shell and you have the superpower of poke. Then you should be good with experimenting with different ICs. Writing to registers and timers and stuff, right? So why this hardware? Because it provides the JTAC debugging over USB. You don't need any external program. That's great. It's also cheap. But we need, we have to compile the GDB, which this integration is not upstream. But then the problem is I use this fork of GDB from Espresse, which is the vendor of this chip, which is on this branch. And you can find the, so, and then when you need the leap poke, which is both of those things work, and you can find it here. So the patch for the integration is old and not updated. And yeah, here you can find it. So I put back ported to the, this branch of the binutils and like ported to newer version of poke too in order to be able to show something. So let's poke together. So we need to connect, use this open OCD to just create a GDB server. So the next step is we run the GDB, which we compile with that thing. Okay. Okay. So nobody have questions. I know. So this is the GDB init. So you have to limit the amount of hardware and things, blah, blah, blah. And so this is the other part of the story. I can, okay. So the people who want to play with this thing, you can, there is this repo here. It's not, so the official SDK is huge. I hate it. So this is a simple thingy, which in the three ESP, in that branch you have three files, you have all the things you need. You can play with that. And then you have this, where is that thing? You have this data sheet, which is awesome. So have fun. Okay. So let's go to the next part. So yeah, this is poke. And you can see that we can describe numbers with weird width. So this is an unsigned one with six bits. This is, yeah, should be fast, but yeah. So it's a programming language. So, right? It's a good one. Yeah. Yeah, yeah, yeah. You should be careful when asking for things. You know, everything is good. All is good, as German said. So here we can also have aliasing for types. So you can have the things. So you in seven, you in something. You can have, okay, it's not important anymore. Okay, so this is the open OCD part, which you can see that I already did that. I hope it still work. And then here we have this GDB thingy with all the hex, the foot. I put things together so it's not clean. I did not show you. So and then I had something. Yeah, so I have to write it here. So it was risk five, risk 532LGDB. You have to have that thingy, GDB in it. So we are here, please work. So it's reading, reading flash. It's doing that. So it's good, great. And now the GDB, so it's complain, no, I don't know what you're going, I know what I'm doing. So it's okay. So because there is no file in anything, so you had no idea what's going on. So you can see that layout next. We have a jump and then we have a like weird stuff somewhere. So we can go next instruction. It's somewhere, okay? So now poke. You have this poke subcommand, something in the GDB. So you can say, like you can ask that poke, what's the read the 132 bit unsigned integer at offset, what's the address of that thing? 0x4123, 1e9c for example. And is it correct? I hope it's correct. So you have to see the same number. Okay. I cannot verify that. So you have to see the number. Okay. Oh, we can, we can, we can. We can like this, verify that. So it's a content of this. There's my mouse. Please work, doesn't work. Why? It's your fault, you know. It's 1e9c. So you should get the same not getting because of the Indianness. You have to poke, set Indian to Indian big, I guess, or little or I don't know. Okay. Also think, okay, now things are not, still not. So it was little. Yeah. Yeah. So please work. Finally things, good. Then I'm happy about that. So let's, so you can have everything. You can define variables here. You can print stuff here from printfv, something, please work, don't crash, works, and doesn't crash. So you see you have the old CLI capabilities in the poke here. And then you can do, okay, we have, you saw this thing, it's a module. So we call it people. So we load this, it's part of the standard, so it's riskv.pk. So we say pk load riskv and then five or, good. Yeah, okay, okay. So you load the module and then next, so it gives you a bunch of definitions. So what I'm interested in is, in some, here. So this is an instruction of this risk five. Please work. Okay. So you have this many variants. So either it's formatted in R format, ISP, whatever, whatever. So we want to decode that integer. We had here as an RV32, what was that, instant? Or yeah, yeah, yeah. Okay, layout, next, more next, TOI disabled, please. Okay, okay. Okay, great. Thank you, Petru. Okay, so now you have all the, this part, like immediate part and then this is like, because it was, if you do remember, it was a, we can disassemble that also. If we do this. Okay, thank you. Disassemble from here to, from here to what the hell. So C80, let's go for, no, it should be. No, no, nine, yeah. A0, yeah. Yeah, this, yeah. So here, we had this thingy here. So then, so it's, now it's a poke variable. We can call methods on that. And this is, please unmask, it's not the time to re-syntex. Okay, so you see that, yeah, we are getting the same thing. So this gives us by the disassembler. This is the magic of poke. So we have other, so you have, I don't know, here, yeah. We have this, the ATAC sheet, so there are registers, you can configure things this way. So questions, you're happy now? Thank you. Thank you. Thank you. For example, can you change the T0 on the flight to some other register or 260 so you could patch it on the. If the function is inside this, in the RAM, yeah, definitely you can do that. I don't have the courage to do that now, but you have to trust me. Yeah, more questions, please, because. Yeah. For microcontrollers, it's like a script language for registers. Yeah. It's this one, this one, this one. So this is SVD, they, they, so you don't need to read all the, all the data sheet to understand. So you can have libraries like this is, sorry, Jose Python library, which you can use this description of all registers and you can like generate the syntax for poke and then you can like load whatever types you want and then poke them. So you can, yeah. So if I use like not, yes please, 32, but something normal. Yeah. Is it already up to them in GDP? No, no, I told it's not. It is false, you know, you can blame him. He don't care, you know. So like on a serious note, the problem is that they didn't opera-seam is that the GC, you're using the boom GC and also GDP uses the boom GC for the guile. So there's a problem there. So we have to overchange the GC, please, Luca, you know, you can ask him to give us a new GC, then we can opera-seam. So that's the real answer, sorry for joking. Yeah. So next one. Yeah. So, yeah. So I told you nobody has any questions, you know. Thank you. Thank you. Wow.
Verrou : a valgrind tool dedicated to floating point error diagnosis
So I'm Bruno Latulier and I'm working at EDF which is an electricity utility company in France and we are doing a lot of numerical simulation and of course we need to have a good verification and validation process and inside the verification process we have to take care about the floating point error. So I will give more detail about the floating point but I will be really short because I think almost everybody knows about it. We have in the when we are doing numerical simulation we design our algorithm with real number is the wonderful world of the mathematics but we have limited precision and so we have to use float or double in our code and it means that some usual in our code we have to do one thing and this small one thing because in double it's 10 to the power minus 16 it's really small this small one thing can have a huge impact for the final result of for simulation. So we need to be able to to have an estimation of the difference of the floating point computation and and the result we expect in the in the mathematical world and usually the developer that's our problem but usually the developer are able to see that there is a problem because when you modify your compiler option when you want to add parallelism you change the order of parenthesis and you have different results and so you see you have a problem but the problem is there and so we need a tool to be able to do this error estimation for real industrial complex application and it's the tool we called Vero which has this objective so I'm sorry but I need one slide of mathematics so it's quite easy we use stochastic arithmetic so usually when we want to debug we we don't like stochastic but there we are we want to we will use the stochasticity to debug so we replace the operation by the same operation but with stochastically random stochastically rounding so I tacadize and it's to the left or to the right with a defined probabilities I'm doing that for all the operations in my program so it's like a galton ball each operation I go to the left or to the right and at the end of the program I have kind of distribution and I use the support of this distribution with the formula there to compute the number of significant bits or the number of significant digits on this small program one divided by three followed by a multiplication by three this is the normal execution of the program with one to the nearest and after I put three execution with random rounding and I use all this result with the formula to see that I have almost two significant digits so it looks like really easy like this but if you have to modify all your program all your floating point operation in your code it will be really tough so the idea is to use valguine a dynamic binary instrumentation which is help a lot to develop this kind of tool valguine will give me an intermediate representation and I can just modify the intermediate representation so I don't I'm not able to write one line of assembly and I will do this kind of tool so valguine is really powerful and so when there is an operation an integer operation I can replace I can use the same when there is a floating point operation I can add counter that's easy and I can call my own implementation of the floating point operation and in this operation I can add the stochastic part of the operation for the user point of view I need to run the code several times that's the bad part I need to extract the value of interest I'm computing something I have to know what I want to compute what is the result the good part it works for all languages so C++, Fortran, Python yeah and it works with external library when we don't have the source from the valguine developer point of view it means I have to replace all floating point operation what is nice I there is no need of shadow memory so it's quite fast and what is really different compared to the other tools I want to modify the result and so this is the difficult part so I give you a small example which is called the Mueller suite so it's I compute a suite with recurrence and add some verbo CT to be like or simulation so there is this kind of result this is the execution with rounding to the nearest and there is address it's stupid but it will it's it's only a way to to present something later but it's stupid but I see really often this kind of thing we don't control or user and so now I run several times with veroo and in red there is all the all the output which is different so I can see there the result is completely wrong so when if I say to my colleagues you have a tricky floating point bug somewhere in your two millions line of code you will get a lot of friends so I need to do something and with one colleague we we developed something called delta debug which is a trial and error search algorithm so the user has to provide two script the first one is how I call my program with veroo so it's very simple because it's only a prefix command where the valguin prefix command and another script to say if the result is good or not and to say if the result is good or not we are doing a comparison with the result to the nearest we don't know if the reference is a good one we only know that there is a difference as is the difference we want to explain and and then we are we are with this command we we have to say the number of samples we need and at the end you will get the result there is problem in these two lines of code that's really nice it corresponds to the two divisions in my example from the valguin developer point of view it means I need to generate the search space I have to know which line or which symbol in my program where there is floating point operation so this is the first part and the second part I have to be able to run the program with a specific configuration I mean a set of function of line which are instrumented and a set which are where the line are not instrumented so I have to have to introduce discussion between this tool and valguin but it's not too difficult to do and it works well on real application but now I will present something which is more experimental I really go there I with with the two lines of code sometimes you do you do not have the right information what you want sometimes is to know the problem happened in the first iteration not in the last one and this corresponds to temporal localization and so for that I need to to modify the search space and I use in fact the the output of my program and I can't use directly the output because from one line to the other I need to to have the same key of all in all execution so it means I have to wildcard all floating point in the form for the search for the search space and especially when someone print addresses I have to wildcard addresses because it will be different in all in in different one and for the users there is nothing else to to modify in it the result is it happened in begin it and and it is also it's a two first iteration the problem happened only in the two first iteration so the standard output match without temporal context I only use the the fact that the user prints the number of iteration that's very important the user has to pay attention to the bufferization if you put if you print everything at the end of your program it's unuseful the empty line can be ignored and I can modify the the output by a filter script and the two last element used together we are able to to work group iteration we are able to do a lot of things and from the valgoin developer I have to define a five format to define the interaction between the IO so this is the standard output or even a file and veroo like client request the idea is to be able to call client request activated by the IO so my conclusion with veroo we are able to estimate floating point error it works well we are able to search the origin of floating point error with delta debuil it works well we are able to search mixed precision configuration it works well but not so much and we are also able to search where error are amplified it works sometimes and in my roadmap I want to to be able to to work on new architecture especially arm 64 I want to add new search space like the bike trace because if someone has encapsulated the addition it will give me that the addition is unstable nice and the last point and it will be probably the the most difficult research part is to be able to to to do error amplification localization without false positive that's the that's the key point and for the real conclusion it's on github there is documentation there with paper and so if you want to use it I will be happy so I imagine quite a few runs are needed because I guess for each floating point operation you have multiple combinations and can you give some numbers from experience how many you run one needs to track down bugs yeah the the question is how many samples we need to be able to do an error estimation it depends off the accuracy you want if the if your code is unstable and only to run run with the nearest the first one you have difference you have a bug so to have a problem it's a really small number if you want to prove that there is no bug statistically you will need to increase the number of sample and we have a paper I did there is a paper to say how many number you you need with with the confidence interval and everything but in practice it depends off the running time because it's it's always it's always the first question and now we have done the work to be able to have a number of samples and with theory and nobody use it the reason is is the computational time which is important in the interfloor project we are we have collaboration with colleagues of Versailles near Paris and they are doing all the almost the same work with LLVM infrastructure and so we are working together and it's a little bit faster but in fact it's more convenient to use from the binary point okay question did you at any point contemplate instead of using the stochastic method of figuring errors actually having your valgrain model of the floating-wave instructions use interval math that is represent bounds and then propagate the errors then you wouldn't have to rerun anything you just calculate with an upper and lower bound for each value and the problem with interval arithmetic what is now first what is really nice with interval arithmetic right the question is is it better to use interval arithmetic instead of stochastic arithmetic what is nice with interval arithmetic it don't lie never it say it will say the truth and it's really nice but the problem is on real industrial application it give the result is between minus infinity and plus infinity and it's true so that's the problem there is a lot of false positive in fact I'm I really use I use interval arithmetic when I discover a problem like that with this tool I'm I extract the problem I work on a small proxy app where I I'm able to to run interval arithmetic but with with multi precision to increase the precision in order to reduce the size of the interval as there it's it's really nice because I've I've access to the guantt's tool but on real industrial application it's it's too difficult it it works with acc avex avex too but I there was limitation with avex 512 because it's not implemented in valguine so this is a question about I think the mathematical library there is two way to do mathematical to to call mathematical function the first way is sometimes there is hardware and now I'm able to format wire to instrument fma and sqrt and the other way is to call the dynamic library of mathematical function and there is way tough because all the the developer of the mathematical library really know floating point operation and they are taking into account the fact that they they use one to the nearest operation and so if you use veroo with stochastic wounding on the mathematical library you can get stake fault you can get a lot of you you have a lot of problem it means that I have to exclude from the instrumentation the mathematical library and I have to re-implement myself the all the mathematical library to add the stochastic part uh that's quite tricky and uh and in fact I'm using some some thing which are a little bit limits because I use a reference with the quad mat and uh yeah it there it's okay but if I speak to to to to specialist of floating point they will kill me officially we have one minute for one last question but this is the last one I didn't try so you know what you get it so I worked in a project once where people had started using this minus f unsafe math in gcc which basically creates such issues and but they weren't it was a trade-off that they did performance but they wanted some more reproducibility would this tool work to kind of narrow it down to certain code paths where you might want to disable the fun safe math yeah uh the question is uh what what is the the liberty of what kind of freedom can I give to the compiler to uh to optimize my code in terms of floating point if I understand well and uh and in fact it's an open question it can help in in in the in that sense um I'm able to um to say where the code is sensitive but it's related to my test case because it's it's uh I'm not able to say anything of about the code in general it's the code plus the data set so to help a compiler it's really tough because the compiler needs to be able to run for its kind of data set and and the other part is uh or the question is uh which kind of option should we use for for the for the compiler what I see is a lot of people want to be reproducible and so they use all zero option and so the only thing they are doing is to be able to reproduce the wrong result because if you don't know why all zero is better than all three there is a good probability that all three uh will give you a better result because uh when you are doing a summation you will regroup it for uh for with the three uh when uh there is fmi there is one error one one ending error less so it's if if I have an advice you use all the compiler option you can and except if you know what you what you do uh there are small parts in veroo where I have to take care about uh floating point error really carefully if you if you use uh error free transformation you have to take care about floating point error uh so I don't know if you know what is error free transformation no nobody knows the error free transformation so it's it's a way to when we do a floating point operation we are able to compute the error on the floating point of of this error usually is we can represent this number and so we can compute the error and if you say to a compiler uh the fast math option you will say this error is zero because it's mathematically is zero and uh it will skip it so if you want to do um tricky algorithm with this kind of thing uh you have to protect it uh you have to protect your part in fact this part are used in twasic to be sure that the compiler is not able to to skip it if I make there is a misconception in floating point real that people want accuracy and reproducibility and they think they are the same but they are completely different and you can be inaccurate but they were reproducible or the other way around and actually in this case the inaccuracy wasn't the problem uh it was more the reproducibility because the the accuracy wasn't that important that's why people that probably at some point decided okay I can do so much more in the limit of the amount of time in real time context but let's see is that accurate no coronavirus system Valgrind is not natively possible to run like a ELC or some other end-gain system? I need Valgrind. In fact, I have less portability than Valgrind because I work for x8664. This works. I'm working for ARM and this is tough. This is really tough. I will need to do patch inside Valgrind for that. And it's not my... I'm not really confident to do that. But for x86 it was not easy but it was okay. Is there no solution for all PLC systems? Not yet. That question. Do you have plans to include this into the upstream version of Valgrind? At some point. It could be nice. I will like that. In fact, I think there is still a lot of work because I have only one architecture. And in the test infrastructure it's quite different because in the Valgrind architecture we need to do only once a test. And I have to run several times to be able to compute. And so I have a completely, I have a separated test infrastructure. So it couldn't be tested as part of the modern world? Yeah. It would be difficult I think. There is work and I think I will need to discuss with the Valgrind team. It's why I'm there. I want to begin the discussion with the Valgrind team. Do you have something about the performance panel? I was expecting it was the first question. So with the nearest it correspond to the instrumentation. And I'm doing a dirty call and I'm doing the same operation. With float it was not really optimized. I mainly work with double. And so it's quite acceptable. But it's not, the program is a stencil with FMA. So it's more difficult than a lot of other code. And if you are running with a code where there is only IO it's really fast. So but if you want to work with a Blast tree it's worth a lot. So the one I presented it's only the random part. And there is a lot of other kind to do randomness between upward and downward. And the other kind is to over some crazy false positive. So and FMA only is what I will get in Valgrind when I will be able to modify Valgrind. Because in Valgrind there is a there is a the FMA is implemented with the software. With the software and so there is a lot of operation. So we can reduce the time of the tool none. Tool none is yeah I'm faster than the tool none. So it means that there is a problem. So I discovered this last week so I'm not able to correct everywhere. And for me it's important even if I because when I'm doing Delta debug I do not instrument. I do not instrument some part with FMA and so it's costly. So I really need to to to modify Valgrind to reduce the performance cost. Because it's really painful to say I'm doing nothing and it's more costly than to modify the 14.0 behavior.
Yet another event sourcing library
Yeah, this is better. So yeah, I'll talk about history, how we made some decisions we made, some things regarding lambda and the project, and basically this was kind of a point where we started to do most of the stuff on our own. Then I will go over the patterns that were kind of influenced for the libraries, so the security and even sourcing, I'll briefly show how the whole thing works in architecture, diagrams, and then I will say why we actually decided to open source it. So the project started in 2019, everyone wanted to do several lists, it was kind of a fancy thing to do at the time, and also we wanted everything to be managed by Amazon and we didn't want to monitor containers or run stuff around, we just want to give our code to the Amazon and run it, and that was kind of perfect to do this. We also had to keep the business logic vendor independent, so this is kind of a regulatory requirement, so we kind of speak that our business logic is the most valuable thing and then we isolated it from the infrastructure, and so the infrastructure part we can always rewrite, but the business logic we want to reuse. You want a simple API, so I had all these query pads, headers, discussions, we always had about API, so I wanted to drop this thing out, and we wanted to keep data so we can transfer it, I know rewrite library and move it to another language and use the same data and so on, so like binary stored messages in Kafka queues were not an option for us. With Lambda basically the big problem is the startup, so we wanted to use closure because we had lots of data stuff to take care of, so the biggest problem was of course the startup, so the ground we had at that time was pretty new and basically most of the stuff didn't compile. We tried AWS SDK, this was a mess inside, they bring half of the main repository back to it when you use it. Also we had like Hockey Recipient, we had to fork it as well because there was some stuff there that they were using that didn't compile as well, even Logback didn't compile for like one year ago with this as well, so then we started to build something on our own to make it simple, so we created our own AWS SDK because everything they do in all this magical SDK is kind of a post request to the AWS, so it was kind of super easy in the end to do. So the first pattern we chose to use was TECORES, just command and query segregation pattern, so the idea is that you have place where you send commands, where you mutate data and you have a place where you query stuff and this kind of influence our implementation, so we had on HTTP site we just had two endpoints, commands and queries, you send in the body everything you want to do in the system, which also make you can take the same body, you can send it to the queue, you can send it and batch of the command in S3 buckets, so this was kind of great because we could just store the commands from the post request, put them in the queue or store them in S3 bucket as a list of commands so it was super practical. The query site is also very simple, so just the query endpoint which made the front end client, we implemented our own front end client for this, it was 300 lines of code together with it mocking, with retries, with deduplication, with everything, so basically just simply having this simplicity on the HTTP site made it possible. Together with Tech QRS, now it comes the event sourcing, so the idea of event sourcing is just we will not store the current state of the system, we will store events that happen, so it is a pattern from 2000, 1970, basically but then they didn't have enough resources to do it, so they decided to event like a relational database model where you just store the current state, so the event source will be, for example, if you take a shopping cart as an example, you would instead of storing the current shopping cart, you would store item edit, item removed, item edit and then when the client asks what is my shopping cart, you would go over the events and figure out what is the current state of the shopping cart but the nice advantage of this is that everything is stored, so basically for us it's very important, the audit logs, basically the event sourcing, they are naturally there, everything is stored, the database itself is inutable, so we are just appending stuff forward, so it's quite easy to handle from the security perspective, information perspective and so on, so for our implementation we have chosen to take postgres, we just store our events as a JSONB field with some metadata around, so it was super simple, we have the transactions because it's just append only, it scales very well, so we have around one terabyte of data and we just add, we don't even think about adding new stuff there, we use optimistic locking, so on the client side we just add sequence to every event and basically unique field on the postgres gives us optimistic locking, so it was super easy to do, so yeah, this is a simple diagram, so from the client perspective how things look like, so we have a command coming into the system and there we touch our service, we just edit the core implementation, edit the core does four things, so takes a snapshot from the view store, then does the processing, whatever needs to be done, stores the response in event store and basically sends to the router all the events effects that were created, so events are, as I said, something that will store the changes and the effects are the things that need to be distributed to the other services, so if you want to call service B or never call it directly, I will store it in the database, the things I want to send to the other service and then they will be distributed to the router. The router then sends also back to the service that needs to update this aggregate and then aggregate this update to the view store and then we go to the next cycle and query is just a simple query, goes to the view store, returns back data to the client. And one more diagram which is also important is how internally the core works, so basically does a couple of things, in the beginning we validate the request, the important thing is what we do, we check if there was already this request process, so we have a command response log where we check if the request was processed, if not then we go, we log this request in the request log, so all the entry commands that come to the system are stored there, so if we need to debug something later on, everything is collected there, so and since everything is a body, it's super easy to store, whether it comes from Q, post request, whatever, then is this processing request, where is the business logic part and then we start the transaction, so we can start the transaction at the very end of the request, which is quite nice from a performance perspective, we store events, store effects, so it commands to the other services and then we just mark this request as completed so that we have a deduplication afterwards. Well, basically that's it, so we started developing this internally, it was only meant as an internal library, there was no open sourcing processing component also, basically this was kind of an idea to start this process as well, there was no alternative limitation because it has a fixed infrastructure there, so we kind of used this as an opportunity to kind of expand library as well, so we mostly started using it as hobby projects, so for the side projects, so edit the DynamoDB support for example for the store as well, for the event store, and this helped to clean up the project, so we did a big round of cleanup of the project with the proper abstractions basically, then we started adding different implementations and then we were contributing the changes back to the internal, so we chose to have the internal project, so we fixed huge amount of bugs outside that help also to get them back internally for the internal implementation and so on, so we set up the open sourcing process, so basically any team in the whole company can open source what they want if you just follow these steps. Yeah, so we had very positive experience with this library, so we are now like almost one year in production, we store everything, this was this space off on daily basis, so we even had a business site messing up thousands of hundred to thousand records, we could recover them quite easily just creating data from database, everything is stored there, audit was super happy because we stored everything, I even ticked off a lot of the audits just because we said we store everything, so they were super happy, and yeah, so the most of stuff like if we had a production bug that basically clogged up the queues, we could clean up the queues and five minutes later we could just select what happened and recover it back in the queue, so we didn't have to worry about finding what was there in that letter queue, what is useful, what is not useful, so and because of the duplication we didn't have to worry about sending against some messages, so we do this almost every week we have one disaster we need to recover and it's super easy for us to do that. Yeah, that's it from my side, questions? Excellent, next we can set up. So tell us a bit more about accepting open source in your company, you can come up. So this was actually six months process, so. So, yes, so the question was the experience with setting up the open sourcing process in the company, so this was actually a very painful experience, so it took six months negotiation with security, actually first to understand what we want to do, then extend it, why you want to do, then talk to management, tell them why this is beneficial, but afterwards yeah, once we figured out that all the rules that we need to follow, then it was it was quite straightforward, so we documented everything and hope that it was six months process to get it, get it there. So, my question is why the architecture decided to use lambdas of the first, why decided to use lambdas? So, one side was because we had a burst, so we like, ah, sorry, the question is why we decided to use lambda functions, so basically in the beginning we had a burst of data, so for example in the morning we would get a bunch of data we need to process and the rest of the system would process like three requests per hour, so and this was kind of a nice thing because it scales quite fast and the other motivation was because it forces kind of doing it fluff clean, there's no caching, you have to really think about what to do, so it kind of wants to push the developers to go in direction of actually making stuff clean and that they don't depend on something being stored somewhere in memory and yeah, the third thing was it was a cool thing to do, so it was kind of a nice presentation, marketing material for the project as well. So, I mentioned you use optimistic locking, why did you decide to use it? Was it because of the lambda bear or was it? So, the question is why we do the optimistic locking, so we use actually postgres in the beginning, but we used optimistic locking because we didn't want to even start the transaction because until we are done, so because we kind of declare all the dependencies we have, we fetch them, we process the data and then we have everything we need to store the mutated database we have it at the end, so at this point then we open the transaction and then we can do something, so that means we fetch the aggregate, for example aggregate version 72, we process everything, we say okay now we'll be version 73 and if there's some version 73 in the bit in happening then we would have a postgres nicely saying there's a concurrency problem, so we didn't want to lock anything database, we just want to make it simple, so this was super easy to implement. I have a comment on that which is our database uses optimistic concurrency control and it actually gives much better scaling up the traditional locking methods and it's much more robust and it's more secure, so we can have a separate discussion about this later. Yes, let's have this, we'll be an interesting discussion.
The Old Remains New
Okay, can you hear me back there? Okay, thank you. So, morning and thank you for coming, and really I'd like to start with some thanks for all the organization that goes into FOSDN. So I'm going to talk about a technology that goes back to the late 1960s and has essentially stayed current and continues to evolve to this day. So this is from my company, the obligatory marketing slide. I'm not going to go into it. So the first database was actually not a sequel database. It was something called IMS. It ran on the mainframe and it was put together to manage the bill of materials for the Saturn V rocket and the Apollo Moon program. And it was a key value database because the key value database is the most general type of database and it was programmed in IBM 360 machine language. And then there was Mumps, which is Massachusetts General Hospital Utility Multi-Programming System. So if you can see that, you know, without needing to breathe through the whole process, you can get a beer. But it came out of the Animal Science Lab and it was also a key value database. It ran on a PDP7, which also was the machine that the first UNIX ran on. So it's a good heritage that way. And it was a complete database machine. So you booted the machine, it ran Mumps. Mumps is the operating system. There wasn't a separate file system. It was the database. And the only way to exit Mumps was to shut down the machine. And the key value database is basically key value tuples. So essentially, if you think of a tuple, there's a hierarchical key and there is one single value. So in this case, you know, a key can be capital followed by a country followed by a city. And at least in Mumps, it's always sorted. So one of the sayings in the Mumps communities, Mumps means you never have to say you're sorting, because everything is always sorted. The database is also completely schemalist. So what that means is that you can mix, you know, the values and the keys can be numbers, they can be strings, they can be just binary blubs. And essentially, there's a default sorting order for every key, which is that the empty string collates first, and then canonical numbers collate after that. And then anything else is not a canonical number collates in byte order. And one of the things is that you can change this, you can customize it, but this is the default sorting order. And the first key, that's something that always has, there's some rules. It has to be alphanumeric, but other keys can be pretty much anything. And, you know, the schema is determined entirely by the application programmer and not by the database. And you can mix key sizes. So basically, you know, this is a little bit like playing poker on a boat and all wet cards are wild. So you can pretty much put any number or any string as any key except the first one. And, you know, there's this similarity to, it's a tree. So basically, you can take those key values and you can make them look like array references where the, in those keys, the first key is like a variable, the other keys are like subscripts. And then the value is the value. And this basically is a paradigm which is very familiar to programmers. Every programmer in the world knows how to program arrays. And this also looks like a tree. And the difference, you know, it's kind of like a JSON tree or an XML tree, but the difference between a Mumps tree and a JSON tree is that, is that in Mumps, you know, you can have a value at the root of a sub tree as well as the nodes of a sub tree which you can't do in JSON. And the concept is, you know, basically, one of the things about Mumps is they tried to keep things simple because it ran on a PDV7, which is not a very powerful computer, and, you know, four kilobytes of memory. So you have variables which are called local variables which are there only for the lifetime of the process. And then you have database accesses. So if you see the little carrot in front of the variable name, that makes it into a database access which is now shared across processes. It's persistent. Anything that doesn't have a carrot basically is private to the process and disappears at the end of the process lifetime. And the language standard was very simple. So the power was the language and the database were tightly bound. Remember, you booted the computer, you type anything that's a database access. And so there are only some simple commands, simple set of functions, and then there are, you know, what are called intrinsic special variables which are certain characteristics like, you know, you to set an error trap, you basically set a value to dollar e trap, and then that string gets executed. Now, one of the interesting things about the standard is that the standard recognizes that here's a standard. Here are all the things that implementations can do which may can be non standard. So now if you're writing code, you know exactly how to write code that is portable across implementations. Plus, you also know that if you want to port code, it tells you exactly what you need to look at. So anything that starts with a Z or anything that there are these job parameters and device parameters, those are the things you have to look at. Everything else is portable. So here's a real power. Any database update is just a, you know, just a variable access. And because of the fact that, you know, that the keys and values can be anything you can have. Like in this case, you can say customer one, two, three, four is martin with a king. And then you can add a subscript and say birthdays, given the birthday. And there's also database accesses like dollar order is the give me the next subscript at this level. You can also, you know, set a local variable like next customer to customer with account ID. And in this case, account ID is a local variable. So, so database accesses are nothing other than simple programming statements. So the, the seamless coupling between the simple language and a simple database is where the real power of, of bumps lies. And the other thing is it just sort of works without any fuss. You know, you just start using it and it works. So where are we today? So today, mumps is kind of an inside joke in the healthcare environment where it started. And if you can imagine the CIO of a bank going to the board of directors saying, we've chosen our future programming language on his mumps, it's kind of like having mustard stains in your tie when you're making a presentation. So it became an ISO standard and the ISO standard name is M. And today the world's largest real time core banking systems and medical record systems use m databases. So the largest bank here in the Benelux, actually their core system runs on an MMM database. And there is an electronic health record system, which the entire country of Jordan is using. So there also runs in an M database. And where the industry is, there's one major proprietary implementation. And there are two major related FOS implementations. One is GTM. I actually led GTM from 1995 to 2017 and was responsible for open sourcing it in 2001. Before that, it had a proprietary license. And then in 2017, I left and started, YottaDB is a downstream database to take it beyond the, where the original M and GTM were. So, where are we today also? The M language standard was abandoned by the major proprietary implementation, which kind of, I think in retrospect, was good. And the other thing is that, you know, with the M language, the language is kind of like a religion to a programmer, right? I'm happy to change my religion. I'm not changing my programming language, but you go first. Right. And the other thing is that C is going to become a lingua franca of computing. So when we started, YottaDB, our vision was to say, the database is more important than the language. Okay, the language is important, but the database is more important. And you need to have the seamless coupling between the language and the database. That is a very powerful concept. So we extended that to other languages. And what we did was we created, there's the core database. And what you see in orange is the original M language. We also created a C API. So you basically do a pound sign include, and then you have C functions. And then once you have C, then you have pretty much any other language because every language in the world knows how to call C. So we have APIs for a number of other languages. And by the way, since the key value paradigm is the simplest paradigm, we also have a SQL layer that goes on top of it. And you can map the key value to SQL tables. So that's what the C API looks like. And because of the fact that you could write APIs for other languages, Lua's and other minimalistic language. So this is what the Lua API looks like. And this, by the way, the Lua API was written by the University of Antwerp. So I'd like to say thanks to them. So this is what the Hello World in Lua looks like. And you're basically doing a database access. So you're setting a global variable called, called Hello, with a subscript lower and the value is Hello World. And so languages are just tools to manipulate the database. So these are the languages that we support. So at YataDB, everything we do is 100% free and open source, mostly a GP, a V3, a couple of things are Apache licensed. So these are the, a way you can go for more information. And one of the interesting things is people talk about Redis as a high performance key value database. So we actually created a container and you can, in that container, run YataDB and Redis side by side and you can make the comparison for yourself. So we actually significantly outperform Redis. And that's my name and contact information. Please, you know, email me at any time. And so any questions, comments, if you're throwing rotten tomatoes, I'll duck. Yes, sir. It seems to me from the first slide that M is decorative language, but the languages that you wrap around is like C, Python, Goza, imperative languages. How is this market works? There's nothing functional about this. You know, basically from the title of the dev room, it said minimalistic. So this is minimalistic rather than functional. So the question is, is M a declarative language? It's actually procedural language. And you don't really need to declare anything. Think of it as a, as a form of like almost like a shell script, except you're not calling for sieges. So, so when is the origin of the code base on which were the challenges for migrating to new hardware? So the question was, what are the challenges migrating to new hardware? And at least for the other dev, because that depends on the implementation. Obviously, if you're an implementation as just an interpreter, it's easier to migrate. In our case, it's a, it's a compiler. And one of the interesting things about the compiler, it actually generates indirect threaded code. And so essentially, we have to generate the, you know, create these virtual op, there's a virtual machine with a bunch of virtual op codes. And we basically have to create the, the op codes in that machine. We basically have to deal with the stack frame. So it's not a zero effort porting, but it's a relatively straightforward port. I will tell you that the hardest thing we've done is actually putting the Java, I mean, the go wrapper on top of the C API, because go thinks it controls the world. And that's offensive to those of us that think we control the world. So there was a little bit of cultural mismatch. How come you're the fastest? You have to ask the reddest people where it is slower than your TV. I think it's a bunch of things. I think it's the fact that, you know, because of the, of our heritage and the fact that we're used in these very large scale systems, there, you know, we're, we're obsessive about performance. And so the only thing that's more important than before, well, two things are more important in performance, functional correctness and security. So, I mean, and, and plus, you know, the court base actually went into production back in 1986. And it's had therefore a lot of time to evolve and mature. Okay, in that case, there are no other questions. Thank you very much. Where's the next speaker who can come up and set up?
declarative calcs and visualization with calculang
Hi guys, thanks for attending. Yesterday I gave a talk about Calculang. This talk will be different. Yesterday's talk was in the JavaScript Dev Room. It was more about interacting with numbers and workings and formulas and sharing numbers and workings using Calculang and facilitated by this diagram which I went through. The important thing to know about this is Calculang is a language for calculations. It doesn't do other things. To do things with numbers you have to use another language and you call numbers from Calculang rather than re-implement them. This makes Calculang simple and focused. You just write pure functions or formulas from spreadsheet parlance. I talked about this in terms of these aims. Shareable and communicable. We're encapsulating numbers. We have a thing we can share. It doesn't have programming mixed in there. Transparent and verifiable. These are things I spoke about yesterday. Today we're going to... This talk will be more technical. Yesterday's talk was more interactive. Today I'm going to focus on this flexibility and reusability. We have a language that's just for calculations. I'm trying to develop a language which is flexible so that calculations can be reused. We look at an example which is from... You can find this example on Calculang.dev. You're a shop. You buy something for some price. You sell it for some price, hopefully higher. You've got some level of fixed expenses. You've got some number of units you buy and sell. We have these formulas. First of all, we have the separation of concerns. Over here we have these formulas. Sales is units times sales price. Purchases is units times purchase price. Profit sales minus purchases minus expenses. Over here, for these inputs, we have results for these formulas. This side is calling results. This is calling numbers from here rather than recalculating them. As a bit last side, I'm just going to sit and point out. If you did this in a spreadsheet, depending on how you design it, you've got one layout. Your layout decisions are made when you key in the formulas. But just to show you, we have separation so we can control layout. Here we're looking at sales prices in one value. It's a range of values. We put it on a x-axis. We didn't need to change that. We had to do it with calculations. The calculations are free from layout. I'm coming back to properties by Calculant to make it flexible. First of all, this isn't like the code that I showed yesterday, the Calculant code. I'm hiding some things, so I'll unhide them for you. These are the inputs. This is a convention to describe an input. The Calculant can understand it's an input. It's a function mapping onto the same name with underscore in. We have a convention to describe inputs. We have these inputs up here, so I'm just going to get this out of the way. We export constants, everything in Calculant, that's the rule. That's redundant, so let's get rid of that. Here are the formulas that we can focus on. I said some things about Calculant code yesterday. First of all, it looks like JavaScript because it's based on JavaScript and it compiles into JavaScript, so you get a JavaScript bundle or ESM at the end, which has just got pure functions in there, interior as portable. Second thing I said is that functions and ... sorry, there's only two primitives in Calculant to know about, which is formulas and inputs. Here expenses is an input, units is an input, and these sales, purchases, profit, or formulas, but all of them are implemented as functions, so we call units like it's a function, even with an input. Fourthly, Everton is a function which is a nice uniformity. It's good for refactoring things, so if units becomes a formula later, which is a hint about what we'll do next, then there's no refactoring. But we also, we don't populate these brackets, so that's something the Calculant compiler does for you. We will analyze how inputs are used and we'll populate those brackets for you, so we'll tread inputs through all those brackets. You can see, in this case, it's the only thing that the Calculant compiler does. You can see the output here. Let's just look at where inputs are used and build in a tree, which you can see down here. A little bit of graph theory, a little bit of logic. It knows that sales depends on sales price and units, profit depends on all of these things and populates this. This isn't in order to make formulas concise. It's in order to make numbers flexible. So that, supposing we're looking at different sales price here, different sales price here, but all it's saying is that if you have a higher sales price, you're going to make more money. If we're thinking about changing our sales price, we might actually want to create a constraint, a demand curve. New requirement, let's see how that's implemented. I'm going to turn on inputs so you can see what happens. Units is an input. With a demand curve, units is a formula. Not now it's changed. We didn't need to refactor where it's used. But there are changes in the JavaScript. So now purchases depends on sales price and purchase price. Previously it depended on units. Now units is determined by the sales price. So that's an example of input inference. So Calculation will infer inputs and populate them for you so you don't need to manually trade things. You can build functions which depend on inputs in complex ways. Basically you've got this flexibility to create cheap inputs and change your models, your calculations. But there's something happening that I did here that I don't like which is I copied all these formulas. And this would be nicer or neater if we didn't need to do that. So this is the same thing, nothing changed over here. This is the same thing, just modular. So we have a file, a calculating file which is important calculations or functions from the original shop model. So this is the original, exactly what we saw at the beginning, units is an input. But it has a different definition, it has that formula for units. And effectively this is a way that you can say I want these calculations but with these definitions. So Calculation gives precedence to what you define closer to the entry point, closer to the root, then closer to the actual calculation. So you can change how calculations work when you import them. Now in practice you can see examples where I use this on Calculation.dev. I indicate with this icon that modularity is being used. So yesterday I looked at a savings calculator which estimates for some interest rate, for some amount you're saving, it estimates an amount that you will have at the end after five years. You can change these things in the savings calculator. But interest rate might not be a thing that is fixed. So as time goes on you might find interest rates can change, we know that. So you might want to analyse how your original calculations, you might want to update your original calculations and analyse the gap between what you actually got and what you expected to get. Here we're doing that analysis for five different years. So again five different sets of results. And there's no copying of code, we're reusing the same savings calculations, we're updating that in a controlled way by making interest rate something that depends on a new input cut-off for actuals and it uses either expected or actual. So there are analyses, we can use the properties modularity input inference to make our analysis more contained. What I like about this is we've got one place where savings calculations happen, we've got one place where reconciliation calculations happen, all the logic is segregated. So other place where I use this, I made a pension calculator, it's for some calculations in Ireland and it's work in progress and lots of disclaimers on it. But I try to show people the value of pension tax relief, at least pre-retirement. And so we've got a pension calculator, but you want income tax calculations, so rather than put income tax calculations into the pension calculator, we use modularity, we use formulas. Down here just to structure some of these things better, I use modularity. And this one was my first test of modularity in calculating. It's a model which calculates, every year around September, October there's a budget in Ireland, the government changes tax rates and on the radio lots of people are talking about this change causes, this impact on finances, this lets you change those things and see an estimated impact on government finances. And there's no two copies of the calculations, it's just reusing. We're using calculations. The last thing I'll briefly touch on is we have separation of concerns. That creates an issue for Calculant because you need to make a thing on the right hand side there in order to see or do, I don't know, a Calculant with the numbers that come out of Calculant. So Calculant Vinspec is a very rough visualization API based on this which is brilliant. But it lets you map formulas and inputs directly to visual channels. This is a presentation I link in my abstract where you can see examples, there are code blocks here, we're passing in the model, input courses and values which don't change. We pass in a mark, we want bars, we say on the x-axis, we want different months so you can pass inputs in with a domain. So in a declarative way you can make visualizations of your numbers that come out of Calculant which is not prescribed, Calculant isn't the pinninated, you can use anything you want. This has many disadvantages but it's a way of using this. I'm starting to use this closer to the metal because this is far more powerful but it's very important to have a quick way when you're developing a model to get numbers out and see them. So that's one thing to help to develop things for that right hand side. So that's modularity and some of the technical details about the language and one way to visualize. I won't go through this but you can see examples in the tools code for different visuals. Any questions? Yeah? Syntax for Calculant is JavaScript. Do you think it would be helpful to have more human readable syntax? Yeah, definitely. In the medium or long term yes. It's got the developer experience overall. It's not something I put a huge amount of attention on but over time, of course I will. Because you're writing the formulas in a kind of declarative way, they're kind of light and there aren't too many different things to know. I said there's two primitives, formulas and inputs. I think it's conceptually simple. The developer experience and some things about performance are hard now but I think the conceptually can be much better. I showed you there's redundancy in the code that you're writing. So that is an issue that we can address in the future. What I really like is that there's no servers in old strata. It's all client-side. It's all client-side but that stuff is you need to run a compiler on your machine currently. It uses an old API. I use webpack and babble to make the thing but that will change. You can publish it right? Yeah, you get a bundle that you can publish. We will have a standalone calculator for certain sometime soon. The blocker for developers probably isn't just running on your local machine right now. I had a link to Calculated Party. It's a channel on metrics. You can join and talk more. If people want to contribute something for a community gallery, because that would be a nice next goal, then talk to me and we can try something. Things are typically good at our education or things that are simple because computationally just makes sure it works. Thank you.
For Want of Anneal: Examining The Unseen Changes Concerning Changes To VCS Assets and The Need For More Graph Centric Approaches
So we're just in a knowledge management setup for me. This is just the K-Outliner document. I did have a bit of an epistemic funk, so there's going to be a bit of bouncing around in terms of this. But just to sort of clarify things, I'm going to be presenting things regarding version control and logic from more of a wider perspective in the context of my work. And there's some very distinct pillars in terms of that, and unfortunately it does involve me quickly going through some of the things I've done historically. I got into technology as a consequence of doing some work on the government, which was looking at modern ICT systems for breaking down silos. It seemed very perverse to be looking at Conway's law later on and realizing that apparently that's the wrong thing to be doing, that I made the folly of thinking that the organization should be below what the capabilities of communication are. I later had the opportunity to assemble and aggregate lots of focusonomies using delicious, which gave me the autonomy to just use the non-alpha numeric characters that sit on the right hand side of a keyboard and use that as a form of creating logic rather than just going through a list of 26 characters and deciding which one is closest to A. It seems a very perverse way of being quick on a computer and being rational and orderly, both for input as well as recall. I had some fun later on once I realized that I probably should be doing coding with things, and so I looked into things regarding using logical forms to recognize what text is as well as be able to make transformations starting out with things like all can said and more recently doing things with regards to TXR, which is a list which does some very interesting passing. Some things which I've been developing over the last 10 years is key, which is my desiring to order things within the buttons which sit at the middle of the home row rather than obviously the inelegant aspect of just using my right hand for the non-alpha numeric keys forming things. I did a bit of DJing last night, I used to do that a lot and one of the things with collecting records as a physical object is that you have boxes and if you like to put things in boxes you reach a limit and if you're a bit too chaotic you end up putting all of the wrong things in the boxes and that doesn't help you in terms of things. Having done things with regards to folksonomies and aggregating, aggregating about 200 random people on the internet for things, you get a kind of idea that people's perspective on things are different and distinct and they can converge but it might need of various things like that. So I've really been sort of exerting myself and doing things as a, I've been truiting it out of what I've been doing as a body language and where I'd be providing annotations for what I call key as a recursive modeling language where there may be a point of precision but you have the ability to drill down on things and work on things later, perhaps having a sense of definition and compounding things just as with body language you might combine multiple gestures and people might not understand and you can put clarifications. So I started from the point of just doing things for separating things within a document just with headers with a classification then I started separating configs and documentation and just even attacking things. So for instance I separated them across the file structure and various things like that but I've done this so much that I have about a thousand directories in my home folder which are sort of, I have two annotation forms, there's the classics syntax which is based upon 36 of one type and then there's another thing which is bloom which I consider to be like a kind of, I refer to it as like a kind of plant sexual organ that there's a kind of propagating thing and it allowed me in my mind to think of rather than the home folder being the root that in effect it was the budding of an information systems concept and that everything in terms of the files and the subsets within a file are the roots, the things pushing down deep into the soil and I've been looking at intersecting things because obviously if you partition things out then you have to have these things flowing across and so in many respects I've spent a lot of time building in terms of what my perspective on information systems and how it should operate on a spatial aspect and trying to get in terms of passing of time in the context of things is very important in terms of things. As you can imagine I have been liable to make mistakes and having to reassess things as well as the fact that I have improved on things so for instance I rather than just having the classic annotation I would give it a sort of hub nucleus which would sit to exist in terms of two main forms in terms of a normative thing in terms of somebody having an opinion whether it's somebody else or yourself exerting an opinion as well as positivists where it's got more of a general thing and I would use the same sort of character to also provide an emphasis on whether something is coming in or coming out or as well as being destroyed and so that was a kind of evolution of concept which needs some clarification and being somebody who appreciates folksonomies and actually doing tasks as well as the joy of improvisation and free improvisation which has fantastic applications regarding how individuals communicate in a live setting but there's a sense of order and perspective but it does need a tiny bit of calibration and so for instance you have I would like to retrospectively deal with things as well as just as when people do things in say the geek's environment to repurpose thing and put a definition of everything calibrating behind and for me logic programming is is is one of the ways in which I can represent that and for instance one of the joys in terms of RDF is which is one of the kind of midway points is that I don't have to be trapped within these silos of each git database within each repository as well as the potential pitfalls of having a mega git database across an entire system that I could be using the RDF to acknowledge what something is but also point towards the fact that that might be an obsolete thing and that there's and that they and that there's something more ideal in which it should be considered in terms and that way you're not necessarily raising the thing but just in maturity realizing various things I'm gonna I'm gonna actually make use of this green so I'm gonna I'm gonna I'm gonna type in rainbow because that's I would have just jumped in terms of things but for some reason the word rainbow is yes rainbow there we go rainbow so this is a nice thing in emacs which so I click on that and it's gonna jump into the other folder which let's see if it jumps back give me give me a rainbow have we lost the rainbow I'm sorry but it's no rainbow okay well I never expected that the this is live presenting but okay one of the main points I was I had a picture of a rainbow you can imagine it and there was a rock in the middle ground and there was a rock in the foreground and one of the things which RDF does is having an open world that if I'm describing the middle ground rain rock formation it doesn't exclude that there might be other rocks in the environment whereas something like prologue would be expecting that if there's reference to one other thing then there is only one other thing and that's just worth dealing with and I wanted to talk about disjointed classes because that in terms of the aspect of the rainbow which you are imagining on this screen that for five left okay and my annotation system because it's based upon certain forms I can automatically have a reference which acknowledges that there are that there are distinctions within just the pattern of typing the main set for the bloom is 36 which within a semantic model works very nicely and the 2 to 15 characters for the annotations to exist as a representative form would be something like half a million I've already done about dictionaries of about with the text to the annotation about 2002 wiki data and I'll be dealing with that in terms of forming with things I'm going to briefly deal with bio make because this is this is you're not on the screen I'm gonna because this for me is very fascinating I was this is a form which allows you to have sets in terms of various things so that you're not just operating off input output in terms of things but having a collection as well as pulling in some of these logical forms and for me that's very significant regarding the the work I'm doing I should just briefly jump back to the the make file oh yeah there's I've been using remake for dealing with things regarding that so I've been using I've been using these annotations within the documents so I will yeah right so this is my make file where I've been using these annotations representatives so for instance I would have the arrogance of creating say 20 aliases for print so then I'm using the annotation within the print form to provide an inference within the script but this is this is how I build out scripts and coding and I'm going to be using an intersection of logic and forms for intersecting things so it all looks like that across different languages and I could use something like remake which would output the genuine code which exists and I'm just spending the time providing descriptions and because of the accumulation of forms and the fact that they congregate I have just like with an and system the ability to grow smarter and smarter of each form that my own orientation will feed into a wider network which can point around and because I have this chain of descriptions within the files and everything if you point to something somewhere else you can have that very large mesh of inheritances and various forms I'm going to hopefully have some experiments with using RDF with a transport agents who'd be doing federal best related content so one could be exerting things both from the perspective of version control and then forming responses for interacting things and you could be building business logic in terms of state as well as having having different criteria there's a there is a machine that I did load up which is a position of an architect adapting ideas from Aristotle the idea that rather than a straight line corridor that you are having to orientate and deal with things one of the great tragedies of our modern ages that we're forced to be distracted and things like this is a way of having a directed force in a logical more orderly way the below diagram is a impression of how this was done for a an architectural place for radiologists and I thought the feedback that they made the quiet room smaller so that the senior radiologists couldn't hide away from the juniors to be quite an interesting feedback loop in which I'm sure any logical that system in in an informatic system would be commensurable okay yeah and that's a nice diagram but we don't have time for some digression I suppose that's it yes good sorry for the rainbow I can be your mic stand if you prefer as well I can just stand yeah questions there there in effect you could you could make an inference from just creating a list of various things and seeing how it relates my my feelings that it would be better to once I've mastered the logical intersection of this stuff is to compile a mega dictionary based upon a collection of things and I think that's just efficient and I ever there's lots of people have been watching this will be very delighted once I push a button and an API springs up but like with most employment things there's a question of five or ten concepts and you can start in and deal with things that these are these are 36 odd things so that's just two page manual if needs be just to get somebody going yeah all right I've been just treating it as as a as a as a alternative to shell and a way of breaking things into ways of breaking problems things into smaller problems as well as a form of building coding against other subsets upstream so it's just a giant sandbox I'm going to be putting that into dedicated libraries and I think SWI prologue will be a very apt choice for that for a range of reasons yes thank you you
How to create the universal operating system
Welcome everyone. My name is Erotra. I'm glad my somewhat pretentious title has a lourd you all inside. I hope not to disappoint. Hence the disclaimer that I have way too many disclaimers to elaborate on. What is an operating system? I had to look it up on Wikipedia. I did have some imagination of what it could be, having used one for a few years. But it turns out, and it's slightly paraphrased because their definition is too long for me, it's a software platform providing access to resources and services to run computer programs. Okay, great. I knew that. That's what I use it for. Excellent. The title is about the universal operating system. And universal to me would imply more generalisation. I've always felt that the computer should evolve or computing should evolve. And I hope that we can move towards freely sharing, using, combining, understanding whatever we do with the computer. And from my personal perspective, from my day job, one aspect would be something about safety. And I've added security, but I'm definitely not a security expert in all the automation that we may computers do these days and hopefully in the near future. Because apparently we are dealing with a few crises here and there. And I believe we have ideas to get those addressed using information technology. I hope to learn from Jonathan at the level of representing information, how this could be used in the future. But I'm sticking to this bit. I'm more operationally oriented, so you could say, imperatively. And I know this is the declarative minimalist computing room. So I'll try to bridge that. The ingredients that I hope that the future universal operating system might incorporate is definitely the microkernel. Richard Stolman proposed for the GNU system a few years back that it could have a microkernel. I would still love to see that happen. In the community, work is being done on that. I hope to start to contribute to that. From a software point of view, I believe everything should be modular. Small pieces because I'm just a human. My head is limited and my understanding and time are short. Things should be definitely decentralized. Client and server would be a natural way to go in the interaction between things. But I want to focus on language semantics that might help us to move towards such a universal operating system. Because I think if we add all of these ingredients, we're going to incur enormous complexities. And I'm not really sure that if we go on in software development the way we do, it will actually scale to the level that we need our information technology and our operating system to scale. And I'm going to do that using a very silly example. This is actually the control of my cruise controller in my car. The picture comes from the internet, of course. I'm guessing most of you have heard about a cruise controller. Basically it's an electronic device in a car and it runs a bit of software. But I want to use it as a metaphor to talk to you about small modular things that can work in a larger environment with other small modular things. And if you add and combine them enough complexity goes up and we need to figure out a way to deal with that. So what is a cruise controller? I'm just going to read this out because I haven't memorized this. Basically when it's not enabled, hence it's disabled, the throttle, which is the thing under the hood that is normally controlled with your gas paddle, is fully controlled by the driver if he or she pushes the gas paddle. When it is enabled and you apply the set button, it captures the current velocity and uses that to maintain that velocity over the course of time. There are exceptions. In my car if I go uphill and I've set it at the lower limit, it will drop down and just drop the cruise controller. And there are other reasons for it to refer back to the human control instead of doing it automatically. One of them is if you press the brake pedal or the clutch pedal, it has to stop because it would be very annoying if your car continues moving while you don't want to. And of course as a human you can cancel it. Okay, this is all really boring. But basically if we would put this declaratively, we just want the damn thing to control our velocity. Done. Very abstractly. And I think this is a way in the future, a declarative way to do that future automation. But I've just been listing all of these pointless details which are still very abstract. If you look at the car in greater detail, there's a lot of imperative stuff going on, stateful stuff. And that's what we're trying to figure out a solution for. So we've been working on a language which we call also pretentiously designed, spelled incorrectly. Because if you look on the internet for the normal word design, you'll either find us. So at least this alternate spelling helps search engines. The semantics of our language allow us to, well, our language consists of interfaces and components basically. Interfaces are behavioral specifications. So they record the protocol. I'll show you an example in a minute. And these protocols are actually contracts of interaction between two components. And our components are of course modular. So they are completely isolated from the world by interfaces. And they are composable. You can stick them together and know that while they maintain the protocol, therefore they can cooperate properly. We have a formal definition of our semantics, meaning we actually express it in a formal process algebra. And I'll get back to that. We can simulate our behavior at the interface level and the component level. We can actually implement running code through code generation. And we can actually automatically verify a bunch of aspects of interfaces and components. And I'll try that to do that by example. So let's start with an interface. The pick. Sorry. Something like this. The picture of the buttons on my steering wheel I showed you is captured in terms of syntax here. There's the enable, disable button. The select, the current velocity to be maintained. As a resume button or resume function, a cancel. And on the dashboard, there's this LED that indicates that it's active or not. But the human is expected to interact in a specific way with the rest of the car. And that has been captured in this behavior bit. And our language takes a imperative approach. So we define state. I just scrolled across that. So we have two state iterations and we maintain that in variables, state and set point. And now we describe the interaction of the human with the cruise controller in the car behaviorally. In other words, if the initial state is disabled, we would accept an enable and become enabled. So dive into this further. I'll show you a picture of what this could look like. Let's see. Sorry. So if I show you the state diagram of the text I was just showing you, this is generated from that definition. This is what it would look like. This slightly more human readable. And we have sort of an intuition for this, I think. I'm going to make it slightly more complicated. Let's look at the component, the cruise controller component itself. It is specified similarly. Almost there. Yeah. We use the same language or the same concept. We define the behavior of the component itself, but now it receives its messages through ports and the cruise controller is supposed to interact with the different actors in the system, which is the human behind the human machine interface, the pedals, the throttle, and we have a timer, which I will not go into. I won't go through in all details through this behavior. I just want to show you the following. Sorry. I have to give one more example. The thing that I really want to add, which we have recently done, is an extension. Let me start over. This is what it looks like in, this is the formal semantics of that behavior, which we can actually feed to a model checker and check properties. And let me feed it to the model checker. It checks all of our default properties and the user defined properties, which I will now show you. So what we have just checked is that the component adheres to all of the interface contracts and it actually adheres to the invariant predicates. One of the invariant predicates, which is, you may have heard, there are cruise controllers who accelerate unwontantly. I have tried to encode that in the state of the environment, which the cruise controller is trying to control. And in this case, it would be, if the human has not activated the cruise controller, it should never actively control the throttle. So that's recorded like this. And I can actually make that fail by commenting out a throttle reset. And then that property will help us find a sequence of events that would lead to this illegal behavior, this unwanted behavior. Okay, this was very detailed. I'll try to wrap it up. Oops. So I have to make the link to the universal operating system ahead. I want to foresee that we will build a modular operating system. And because of the modularity and the distribution, the cooperative complexity goes up. And I think we've figured out a way to leverage model checking to help us there in the future. In the near future, trying to, I'm looking forward to adding that to her development, engaged development. In the coming year, we already, we had already planned to extend the scope of verification, including data contracts. But if you want to know more, just come and find us online. Here are the details. Excellent. Thank you. So this system is GPL, it's out of the open. Yes, you can find us on Savannah. It runs on gigs, a gigs install design. Can you tell a little bit more about the automatic verification of the model? Right. That's the magic board. Yes. We actually transform the model into MCRL2, which is a model calculus that allows you to do specify formal properties and capture the formal behavior. So what we effectively do, the execution semantics of the code that we generate is modeled in MCRL2. We verify the entire state space of that code, which is more efficient than trying to test all the code. And we have a composition guarantee. So when it finds nothing wrong, that there is really nothing wrong. And it's not a matter of, we didn't already have a time to find something at once. Exactly. But there are always aspects that you cannot represent, which are also important. You're welcome. More questions? Is it a result or possible outcome for the model? Does it commute to the whole solution space? At the component level now. You should repeat the question. Sorry. Your question was to verify all of the properties, does it expand the entire solution space? Exactly what we do at the component level. So the interfaces allow a certain behavior. And you want to expand that entire behavior, synthesize that, and go through it and figure out if there are any problems hiding there. That's what we do. Final question, is it used in production? It's used in production. Oh, yes. Our biggest customer currently is a thermo-efficient scientific. They make these huge electron microscopes. And I believe they've got about 1.2 million lines of our code running. Another question? Yes. Thank you. Is it also possible to create distributed systems with design? Currently, no. But I hope to integrate with what Christine will be talking about very soon. And that will solve that bit. Great. Thank you.
How much math can you fit in 700K?
So during these two minutes, I'm going to ask a few questions. I think the sound is better like this. Can you hear me? Yeah. So I heard a comment that color was not allowed, so I hope that you won't mind if I use 3D instead. But the screen and the actual device I'm going to talk about is black and white. Who uses a calculator from time to time? Who uses a calculator from the smartphone or whatever? Yeah, it's the majority. Who uses HP style calculators? Not that many. It's mostly. Who uses calculators for binary computations? Okay. Complex numbers in matrices, graphing. Okay. Just checking. So I don't think that the camera can zoom that far, right? So I can't show that. I suspect. Yeah, it'll be hard. But this is the device I'm talking about. You're going to speak. I'll hold up a sign. 5 minutes for question time. Yep. It's the dots. Is it also? For me it is. I don't know what's wrong with my timer. It's Android. Okay. So I'm Christophe D'Alincia. I'm working as a senior principle software engineer at Red Hat, working on confidential computing. I'm giving a talk on this topic this afternoon. But today I'm talking about a pet project of mine called GB48X, which is an open source HP48 style calculator for modern ARM hardware. So I talked about this last year, and I'm going to show how much progress we made since then. I start with a reminder of what GB48X is. We are going to review last year's future plans to see how well we did. I'm going to talk from one engineer to another. That's why I asked the questions at the beginning to see why we need all this math in the calculator. I'm going to extoll the virtues of 1980s era efficiency when there were only keyboards, no touchscreen, no fancy mouse, all that stuff. I'm going to explain how using much bigger numbers led to much less memory usage. And we are going to see a number of bells, whistles, and engineering units along the way. So I hope you enjoyed. Strap on. What is GB48X? The idea is really to revive Schullet's Packard's iconic reverse polish list on modern ARM hardware. So that's what the original box looked like. And a quick primer on the project. We want to simply put, reinvent the best calculators in the world. Nothing yet, less. It's designed to run on existing hardware from a company in Switzerland called Swiss Micro that does these kind of devices. So you see the DM32 on the right and the DM42 on the left. The specs for the project are from the HP manuals, and there are dozens of them. Unfortunately, they contradict one another because values calculators do not do exactly the same thing. So it's implemented in a language called reverse polish list, or RPL, which is a stack-based language, very powerful. It's based on common line and menus that you activate with keys below, function keys below the keyboard, the screen side. It has many data types and mathematical operations. I'm going to talk about this later. And many enhancements in the project compared to what HP did. Now, is this still minimalist? Well, you bet, because that machine has 70K of free RAM and 700K total for the program space, hence the title of the talk. So it's a low-power Cortex M4 at 80 MHz. The battery life is up to three years on this kind of battery, and one of the things that is nice is that the screen is passive, so when you switch off the calculators, it displays a picture, and the picture stays there forever. So that's where I have pictures of my wife and my calculator. The machine has only 96K of RAM, and if you remove the bitmap, which is a high-res bitmap, and the operating system needs, then you get to the 70K I was talking about. So 96K is 1.5,64 for the old-timers among us. It has only 2 megabytes of flash. It has 8 megs in the chip, but 6 are for a flash disk, and so there are 700K remaining for your program. That's less than a Macintosh floppy disk. They were 800K. The project did hit these limits quite hard. I'm going to explain how we worked around that. So last year I explained that I had to restart from scratch from a project called new RPL because we hit these limits. This year around Christmas, I hit the limits again, so I had to restart from rescratch, at least as far as the similar computations are concerned. So I'm going to explain that. So let's review last year's future plans. I think there is a problem with this one. Is this one okay, or is it... Yeah, okay. So I said, you know, back in 2023, I was young and naive, and I said a lot remains to be done. So I was talking about adding complex numbers, vector and matrix arithmetic, about 1500 functions that were left to implement, and key features like plotting and graphing. So what did we do? Well, a lot of this was done. Complex numbers are available, and they are actually much better than the original. For instance, you can have polar and rectangular. You have the usual notations. You have stuff like that. We have vector and matrix arithmetic fully implemented, and we have algebra, but also with exact computations like fractions inside matrices. So you never get a rounding error unlike on the HP calculators. That's the test suite. So the test suite runs on a simulator on Linux or Mac OS, and it currently runs about 2,200 tests. Not everything is tested. That, for instance, is implemented but not tested yet. And we have plotting and graphing, at least the basic features, like drawing stuff, etc., with some nice enhancements compared to what HP did. Like, for instance, we can have plots with various sizes and plot patterns, so I'm going to show that in a moment. And that lets you draw multiple things on the same screen and see what the different pieces are. It just was very fast on the screen here. So how did we go to use only 70K? It's a story of ultimate over-engineering. It's C++ with garbage collection and ubiquitous bad packing all over the place. Let me explain what I mean with that. A C++ object typically looks like this. You have a class, and the way this is represented in memory is you have a virtual table pointer, and then you have the value for the object, so in that case, for the integer, you'd have an integer value or an enzyme value. And then there's some overhead for malloc. It's self-operated or whatever. You have, for instance, a linked list or a free list or something like that. So overall, for your object representing an enzyme value, you typically use 12 bytes. 12 bytes, that's on a 32-bit CPU. That lets you represent all values up to 4 billion, and it's fixed size. You can't remove it in memory. Not good. Let's do better. So the representation we used looks like something like that. We use LB128, which is a system that is used, for instance, in Dwarf all over the place. And there let's us code the ID that is used to identify the type of object as one byte for integers. We have 128 types that we can represent with one byte. And the value, if it's less than 128, is also on one byte. So that means that I use only two bytes of memory that's a 6x factor compared to the other representation for all values below 128. And I can move to infinity because the LB128 is a variable size encoding, so I can essentially have numbers that are as big as I want. It's now a variable size object, and I can move it. So it's a vast improvement. That lets me have a memory organization where I have at the bottom of memory all the global variables, the global objects that I keep. It's essentially a name, a value, a name, a value. And then, so they are all packed together. And then on top of that, I have temporaries that move with a temporary pointer that moves as you allocate objects. And then there is an editor, scratch pad, and the transient stuff on top of that. Because it's all contiguous, the way to reach the next object is to skip by reading the ID and computing the size to get to the next object. So on top of memory, you have root pointers that point back to, like, the stack, the local variables, that kind of stuff, that point back to this memory area at the bottom. And the root pointers can point inside objects. That's a very important property for performance. For instance, if you follow the one link, you'll see that it points just behind, I think, like, with curly braces. It means it's part of a list, and I can put the value that is inside the list directly on the stack. So I can do the computations faster that way. And there is also a series of smart pointer classes, the names and in other score G in the source code, that let me have garbage-collected smart pointers. The allocation is super cheap, because it's essentially I'm moving the pointer at the top of the scratch space, like this. So it's just one addition and one comparison, and the comparison is to see, okay, am I out of memory, do I need to garbage-collect? So a very, very cheap allocation. The garbage collection itself, as you, you know, your memory grows and you allocate more and more stuff, so at some point, memory gets slow. The unreferenced temporaries, you no longer need them, so what you do is you copy the reference object down and you adjust the pointers, and then you move the editing part of the scratch pad down, and you reclaim your free space that way. So the good point of this approach is that there is no memory of a head at all. There is not a single byte that is used for metadata or linked list or anything like that. The sub-objects, so pointers to objects inside a list, for instance, don't cost me extra at all either. If you know something about garbage collectors and you think of a market-suit garbage collector, for instance, it needs some metadata about sub-objects, and so that means you have extra costs for objects inside objects. And it's a single-pass garbage collector, so it's simple code, easy to maintain, but the downside is that it's slow. It's essentially a quadratic behavior, number of stack objects times number of objects instead of linear or close to linear that you could get otherwise. So it's a usual trade-off of space for speed. So why use C++ at all? Well, it's because of template metaprogramming, and let me explain why this matters. So the guy that you see in the photo there is a guy named David van der Vorder, and he's a Belgian guy who initiated me to C++ metaprogramming back in 1998 when we were in the C++ HP compiler team. So the guys you see in the background are the HP compiler team back in 1998, and that guy is super, super smart and initiated me to template metaprogramming before it was even possible, so we were dreaming about doing these things. But now you can, and let me explain why it matters. I'm going to represent code as data using metaprogramming, not because we can, just for the sake of it, but because I have to. So let me talk about bug number 12 in our project. You compute 1.2 plus 3.4, and it hangs on battery power. So how do you reproduce this bug? You don't use the technique shown on the right. Instead, you simply type 1, 2, 2, 3, 4, plus, and the calculator sits there, not doing the computation. And your users call you and say, did you even test the thing? So you scratch your head, how did I miss that? Well, the fact is it hangs only on battery power, and as soon as you plug the USB cable, the computation resumes and you get the result. You can guess that I did my testing with the USB cable on. So what is this bug? This one was a bit hard to find. It turns out that the chip has an execute in place feature that works, it's supposed to work on the external chip, something called the QSPI interface, except it just lacks battery power, or a power juice when it's on battery. And so essentially it sits there waiting for the cycle to complete, and it completes it when you plug the power. Okay, so that means I have to move as much of my mathematics into data that I can read from the QSPI as opposed to code that I cannot put there. That's why I only have 700K, otherwise I'd have two makes. So how do I use C++ metaprogramming to do that? Let's see a description of an interesting math rule, and that's how you expand polynomials. So you know the rule, you see the first rule, for instance, X plus Y times Z, you turn that into X times Z plus Y times Z, and you see that's exactly what you see in the code. So the code contains essentially the mathematical formula as you're applying. That's neat, right? Now, here's a guess. How many bytes of code does that generate? Give me a guess. Nobody wants to want to guess. Okay, that's the assembly code. 12 bytes. So that code generates 12 bytes of code, but it generates tons of read-only data, which is good because I can move that to my QSPI. So the magic is this ugly metaprogramming code that generates constant arrays, and I taught the C++ compiler how to generate RPL objects from C++ expressions. Isn't that cool? And so that's how you get 12 bytes of code, tons of data that I don't care about, I have plenty of that data space free and no executing place needed. So in the end, how much math in 700K? Well, it turns out that for another reason, I'm now back under 500K, so I'm within the limit that we all heard about, the 640K that ought to be enough for everybody, right? So from one engineer to another, what do we have? So we have base numbers for engineers in the computer field, that's really fancy. In any base, I can compute in base 17 or 34 if you want, or three. With any size, you can compute on 13 bits or 512 bits if you want. We have complex numbers that's useful for electrical engineering, and we have phases that are dealt with with exact results if we can, so like exact fractions and stuff like that. We have linear algebra, and here two exact results when we can. Statistics, which is useful for field science. Degree minutes, second support, so that's if you're doing, you know, maritime navigation or stuff like that, that's really handy, you have a really nice shortcut for that. Unit conversions, if you want to land something on Mars without crashing it, because some guy in the US is using really ridiculous units. And symbolic processing, which is useful for math gigs. About 1980s era efficiency. I have this magic menu, it's the key at the top, next to the A symbol, and essentially it selects the right menu depending on the type of the object on the stack. So very few key strokes to get exactly the functions that are most useful for what I'm working on. Equation data entry, I use a single key to enter the symbol that they limit expressions, and when I'm inside, that's the quotes in RPL, but once I'm inside an expression, I no longer need these quotes, so I hit the same key and I get parentheses instead. And same thing with the equal sign that you see at the bottom, it evaluates an expression, so it's the eval function of RPL, but if you're inside an equation, then it says, well, I'm inserting an equal sign because I'm trading an equation, and if I'm inside parentheses, it's inserting a semicolon instead to separate function arguments. Exactly symbol data entry, that's for your gigs. So when you type a hash sign, the cursor changes to a B, that's for base numbers, and it says now the ABCD keys, you don't need to shift them or anything, you just get ABCD. DMS data entry dot dot dot, and yep, and a one key function. Okay. Just my conclusion that I cannot answer the question because I still have 200k to go, so see you next year, guys. Thank you. Thank you. Thank you. Thank you. Thank you. So the next speaker can set up. Is there any time for questions? Yes, there's five questions. We'll have one for the next speaker. Yeah. But I don't see the next speaker. No questions, seriously? Who wants to help with this project? I'll just give my laptop. You know, it's 20. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. We'll just give my laptop. Does that rock? Does the calculator have a beeper? Yes. That's a good question. So let me... I'll use the voice. Oh. I think it will be a full row. So here we go. Okay. Okay. Okay. Okay. Okay. Now it's your world of activities.
Scheme in the Browser with Guile Hoot and WebAssembly
Hello. This is Scheme in the Browser with Guile Hoot and WebAssembly. I am surrogate Robin Templeton. We are the Spritely Institute. We build decentralized network technology. I'm going to open up with what is WebAssembly. It's a W3C open standard. It is a low level compilation target for the web. It's available in browsers since 2017. One of the big things we needed to make this happen is support for Wazem GC. It stands for garbage collection. Supported in Firefox and Chromium since very recently, late 2023. Why should you care? Why should you give a hoot? JavaScript is no longer the only game in town. You have a language that you really like. You can use that language. If you are familiar with some, I know there are a bunch of Guile people in the room. We have tried compiling to JavaScript and we have seen a lot of languages that have done such things in the past. You had to take a lot of compromises. They were a lot slower. It was messy. By compiling to WebAssembly instead, you are compiling to a virtual machine designed for you to bring your language to it. It follows the capability security model. As you may know, we are all about the object capability security model. Things are reasonably fast. It is increasingly used in non-browser context. Despite the name web, it is used as an abstract virtual machine for many different things, including people to write and use it as a Docker replacement, etc. What is Guile? We are targeting this Guile thing. It is a flexible scheme. It supports a bunch of standards, etc. It has a nice VM. It is also traditionally used for very unixy systems. Robins here. Until today. We will do a tradeoff when Robin gets up here. We have had some decisions that we have needed to make. I will let Robin resume with what those decisions are. I will come straight up. We are abstractly replacing. Yes, I was in the next room for a few minutes accidentally. Continuing with decisions. Thanks for starting it off. Sorry I am running late. Continuing with the decisions we had to make starting this project. We could have just compiled Guile in C via M script in it or something and run it straight in the web browser along with its various C dependencies like the BGMP and the BWGC. We would have probably gotten that done in a week or two. We would have paid a big performance penalty and also we would have had much worse integration with JavaScript. We expect. The second option was to work with the experimental WebAssemblyGC extensions and we could expect a higher performance with that. In fact, that is what we are seeing. Not that we have a comparison, but presumably we are getting decent performance and excellent integration with JavaScript. We will not with option two, of course. That is Guile Hoot. It is a WebAssembly to Scheme compiler. It is built on the newly shipped WebAssemblyGC extensions. It includes a full WebAssembly toolchain. We are bringing the whole Scheme language to web applications so you can use Continuation, Nubium, Merrick Tower, any normal Scheme thing in your interactive Web application. The goals for this project were, of course, first of all, to run Goblet's applications in Web browsers, which essentially means general support for Guile applications. Plus some sort of user interface library. As well as to advocate for dynamically typed languages in general in the WebAssembly world. We are also providing an alternative WebAssembly toolchain as a consequence of our development process. I am not going to go through the code for this, but this is the inner loop for Wireworld, which is a sort of a circuit simulator. We will quickly... You can switch tabs, but I have it open. Okay, thanks. Is this the tab for it? Control tab through. This will be good enough. This is a graphical demonstration of Wireworld. We might have a live version here. This is a bit slow actually drawing, but yes, as we all learn in physics class, electrons have a head and tail. There will generate one, and that is a tiny circuit simulator. Back to the presentation. As far as our status, as far as how far we have gotten in Gauhut, we have basically all of our 7-in-1-small Scheme. That was our initial target because it is a very small and modern specification and has several implementations and a nice benchmark suite and so on. We are starting to add Gauhut in R6RS features. In version 3 that we released this week, we just added an R6RS library system. A couple of hash table types based on R6RS. We are starting to work on debugability in the browser starting with names for WebAssembly level objects. WebAssembly functions, for example, have names that you can see in the browser debugger like in backtraces. We can also run the R7RS benchmarks now. We are getting decent results there. We are now focusing on performance as well as functionality. Who does a bit of an unusual list implementation because it is not a standalone self-hosting self-contained compiler. It is heavily integrated with Gauhut. It will presumably always be integrated with Gauhut, but it is not a normal Unix Gauhut. We reuse the compiler tower that Gauhut has that takes in Scheme or another language like EmacsLisp on the front end. It goes through a couple of intermediate representations. Normally we would output at the last stage after TRIAL, which is Minimum List Scheme basically, and a Continuation Passing Style layer that is quite low level. We would output Gauhut bytecode, which is a high level bytecode for list-byte languages. With some tweaks to the compiler tool chain, we are able to output WebAssembly instead of Gauhut bytecode. That required very minor changes to the compiler, comparatively speaking. We also cannot use libGaile, which is in C and has a bunch of C dependencies, including its own garbage clutter where we want to use the browsers. We are building a new Scheme Runtime as part of Gauhut. That is written in WebAssembly as well as Scheme and Scheme Intermix with WebAssembly and so on. We are targeting Gauh compatibility overall and making progress on that front. From the host, typically the browser and WebAssembly terms, we need a few things beyond the basic WebAssembly that would have been available a few years ago, that you might have used in Google Docs or something like that. We need the garbage collection extensions. Those only got enabled in browsers by default in Chromium and Firefox last December, so quite new. We need tail calls, which are part of the basic specification as far as I know, and so that's very nice to not have to have any workarounds there with trampolining and things like that. We need string support, which is not part of base WebAssembly at this time. There's a proposal we like that we use on the source code level, but we have workarounds that work in browsers with native JavaScript strings with a bit of overhead. Like I mentioned, the latest Firefox and Chromium have all the extensions you need, and it is being actively worked on in WebKit, so we're expecting all major browsers to support Hoot programs in the very near future. End of the year, definitely. Any project like this has its pros and cons, so on the good side, WebAssembly text is the source code format for WebAssembly, and it's just S expressions. We don't even have a separate parser for anything, we just write scheme S expressions. That's quite nice for both generating WebAssembly code and for integrating it, embedding it, and scheme. We have tail calls for free, which is definitely convenient, and the reference type system that WasmGC provides for heap objects, which gives you more than just numbers, it gives you structures for subtyping, it gives you arrays, it gives you 31-bit integer value for tags, stuff, and dynamic languages. So that's all good, and then there are some difficulties that we will be addressing, so there's limited support for strings in WebAssembly or in browsers in general. There are no first class continuations, so we have to sort of reify a lot of the things that would normally just be implicit in the scheme program to get access to it for where, say, access to the stack is limited from WebAssembly for security reasons. Access to the heap is limited for security reasons, so things like that. And there are limited GC features, so there are no finalizers for example, and that's mostly because languages are so diverse in what they allow for that kind of feature. But we'll want some advanced GC features for goblins. So this is a little example from the runtime, showing a bit of WebAssembly code. This is a portp predicate, testing if x is a port object. You see the inline Wasm special form, and then a quoted bit of WebAssembly source code that represents an anonymous function. RefEek is the scheme untypes, scheme value, so we have a function from a scheme value to a scheme value, and it just tests if the x object is a port structure or not and returns the magic five values for true or false depending. So pretty simple, but very convenient for writing the standard library. So we provide our own self-contained WebAssembly tool chain. It lets you develop WebAssembly in one place conveniently from your REPL or whatever you like to use, so it is an alternative to the more mainstream options like binary N or the WebAssembly binary toolkit. It's a set of scheme modules that provide easy access in programs and from the REPL, also some command line tools where that makes sense. That can name basically what you'd expect, parser, assembler, disassembler, linker, dumper. But we also have a very nice interpreter that my co-worker Dave Thompson mostly wrote, and that allows you to do debugging of WebAssembly using tools very similar to what you get in a GaL REPL, which is great for interactive development. So a little example is a function that squares the number. We're importing a couple of packages to get access to the runtime or the interpreter. And we're compiling this Lambda expression. When we look at the Hoot square expression on the next line, the Hoot wrapper indicates that it is a WebAssembly object, not a regular scheme procedure. And then we can call it transparently like a scheme procedure to see that 4 times 4 equals 16. So the basics work. In the future, we're going to be working on more R7 or R5 small features. That's mainly the library system where we just landed R6RS libraries. So we'll be building our 7RS small libraries and GaO modules on top of that functionality. We're going to want definitely a lot more GaO compatibility in terms of libraries and surfies and stuff, and especially fibers for concurrency. It's very important for goblins. And finally, we want to be able to run goblins programs on top of GaO Hoot, and we really expect to be able to do that with few to no changes in terms of the network interfaces and things like that. And the future of WebAssembly also has some bearing on the direction of our projects. We had a pretty decent impact on the WASM GC proposals. The W3C WebAssembly community group is open to anyone who's interested. You don't have to be a big organization. We're definitely not, but we had some influence on the proposal as an early adopter, along with languages like OCaml and Kotlin. And so GaO Hoot makes a scheme, a practical alternative to JavaScript for interactive web applications. So you can write full programs with no JavaScript besides what is needed for loading the WebAssembly file. We believe WebAssembly should have space for all languages, not just low-level statically type ones, and hooters for scheme, but not necessarily just for scheme because our tool chain may be interesting for other projects. We have... We'll skip the rebel demo. We have to stop Robin. Alright. So next speaker can set up. Robin, you can still answer questions. Oh yes. Any questions? Here's one. Where's the microphone? You can have one. I can sound. What happened to the mic? The microphone. There you go. Oh, thank you. I got it. Enjoy the actual question. Do you have a place to get a form in the background? No, maybe. It's not going to work out because there's no narration. Okay, a question. You said that you don't support first class continuations, but you also said that you already do CPS in real. So my question is, if you already do CPS, if you verify your stack, then it seems not so difficult to do that first class continuations after all. We have first class continuations. Repeat the question. Yes, so repeat the question. I think the question was basically... Sorry, could you repeat the premise that started the question? I mean, you said you don't have first class continuations yet, right? We do have... So the question was whether we... was partly whether we have first class continuations, and we do. So we had to modify, add a pass to the GAL compiler tower to do tailification, which is just turning all non-tail calls into tail calls. And so that turns the... That gets us to the point on the Web assembly level where everything is split into small basic blocks. And we have... An ABI that reifies various stacks, and so we do have access to delimited continuations and who... And actually, call with current continuation is implemented in terms of delimited continuations, which I think is pretty rare for schemes. Okay, so you do reify the stack. Yeah, we have to reify stack. We have to reify several stacks because of Web assembly details, but we do have first class... Delimited and full continuations available right now. It's one of the first things we designed for. And that was Andy's design work for getting you out to work with Web assembly. Any more questions? I think we have a couple of questions. Andy, you have to go back for getting you out to work with Web assembly. Any more questions? You have to go back? Yes. How do you shoot in Strigoform? How do you shoot in Strigoform? I believe that you hit the Z or Z button. Strigoform. Strigoform? Okay, so yes. So Z fires. And yes, if you load the talk in or just load up the game directly on HIO, it's by my co-worker David Thompson with Graphics by Christine. And it's got particle effects, parallax shooting, audio, etc. are going on. But Scheme is not the bottleneck in terms of performance. The bottleneck is the JavaScript canvas. So we're doing that well on performance already without having taken it into consideration before. Thank you. Oh. Do we have time for another? I have a question from Matrix. I understand one of the goals is compatibility with Gile. However, some of the functions in the Gile standard library like open file and networking functions are not defined with security capabilities in mind. What's your plan to deal with these cases? That's a fantastic question. Christine will address some of this in hard talk at noon. And there are also some WebAssembly proposals like Wazzy and Wazix that could provide POSIX compatibility if we wanted to run it on a non-vouser runtime for, say, geeks integration or something like that. But in general, those APIs won't be used by Goblin's programs for the most part. But we will provide the compatibility. We will provide both because we already started the next talk.
RISC-V Bootstrapping in Guix and Live-Bootstrap
or say geeks integration or something like that. But in general, those APIs won't be used by Goblin's programs for the most part. But we will provide the compatibility. Because we already started the next door. All right, very good. Thank you very much. That's the, that thing. Hi, can you hear me? Yeah, right? Okay. How many people here is aware of the problem of the bootstrapping? Place hands. Okay, that's good. That's better than I expected. That's fine. So, first of all, this is a disclaimer. I wrote everything I'm going to talk about in my blog. And also I give a talk the last year. So if you really want nitty gritty details about the bootstrapping process, go there. This is not going to be a very technical talk. Okay? It's going to be just an explanation of what we did in the RISC-5 world in the bootstrapping process on gigs and live bootstrap. So, this is me, right? I'm a telecommunication engineer, and a freelance programmer, and I work a lot in gigs. So maybe you remember me from the last year. I gave this talk. There we explain the bootstrapping problem is you have more interest on that. There's more slides on that and a, quite a long explanation of what we are doing and why. So this is the context. I work with NLNet. The last year, so they paid me literally to make some work in the bootstrapping process. I back ported some support for RISC-5 to another GCC to the 4.6. And also I back ported support for tiny GCC boot, which is a fork we are maintaining in order to be able to bootstrap the compilers. So I'm going to talk a little bit more about this later. So this was explained in the last year, so that's nice. So this year, I decided to continue with this project, but it was completely burnt out, and I needed help because people always helps, right? So I added more people to the project. These two are most, the ones that took more work in this port, and they literally gave me the energy to continue, right? So Andrews is very interested in the project because he works in Live Bootstrap and Stage Zero, which are projects that are very related with this. We are going to see them later. And Janneke is the author of MES and also the maintainer of tiny GCC boot. We are going to talk about that just now. So let's see in pictures, right? There are some colors, but I'm going to point. So if anyone has problems with the colors, it's no worries. So this is what we had before my project, right? We have Stage Zero POSIX, which is source code, right? Then we built with that, we use that to build MES, and with MES, we try to build a bootstrapable tiny GCC, which is a fork of tiny GCC, but that is easier, right? The C code it uses is simpler to be able to build. Then we try to build tiny GCC, then we go for a very, very old GCC from the 90s, right? And then we go for a modern GCC, maybe with many steps in the middle in all of the parts, and then we try to compile the world with GCC. So now the colors. All this is the current bootstrapping process that is in live bootstrap. We have it in Geeks too. So this just works, but only in X86. So I'm working in the RISC-V port of all this. So RISC-V, the status of the RISC-V part was these two parts of the top, they were already having some RISC-V support. It was working pretty fine, okay? The bootstrap of all tiny GCC had zero RISC-V support. Tiny GCC, it was supposed to have some RISC-V support, but it was worse than we thought. GCC didn't have it because these are very old GCCs. They were written before RISC-V was invented, so no support there. And the modern GCC that supports RISC-V is the 7.5 version. Then the world, some things support RISC-V, some others don't, but that's not my problem. I'm only working from here to the top, so don't worry about that. So after my effort, my previous effort, I took the support from this GCC, which is kind of a modern GCC, 7.5 is not at all, took that to GCC-4, this is a note here, this is written in C, this is written in C++, ha ha, I had a lot of fun there. And also, I took the support from here and I moved it to this one, right? So this was also, I think it's like a 10-year difference between these two, so the APIs, the internal APIs changed, many things are very difficult. GCC is horrible to read. Maybe here there is the maintainer, I'm sorry, but it's really hard to read this project, I'm sorry. So at the time, we didn't know that this was orange, this is not fully supported in RISC-V, we thought it was completely green, fantastic, no, it's not, so problems there. And this one, I finished this backboard and I thought I was going to have issues with this, but it happened to be pretty much okay, so nowadays this is way greener than we thought at the beginning, so this is before what we did this year, right? Starting in June, we started working on this with the people I already mentioned, and now we settled to this point and this is already in LiveWoodstrap and we have it in Geeks, in Core Updates branch, this is already upstream to Geeks. So until here, everything works, so thank you very much. Good, so this part we already tested in Adobe Machine, this part we tested in Adobe Machine in real hardware, in RISC-V board we have, and this also works, this GCC 4.6 compiling stuff for RISC-V. So a compiler that was written before this architecture was invented is compiling to that, so that's also very nice, we have it, yeah? So this is more or less what we did. So there are problems though, the arrows here are still red, I don't like that. So why are they, they are still red, why? So TinyCC request some changes in the C library where we have here, so we need to change those to make them work, right? Also, the old GCC requires make, which I managed to compile the other day. And it requires some other stuff, right? It requires patch, we also need G-SIP, which I didn't have the time to compile, and some other things. So also this jump is going to be kind of complex because GCC really has a very complex build system, maybe you tried, it's a really complicated thing, right? So it should just work, but it probably won't. So, questions now, and I have some extra slides for later, but anyone has any question? No? No, okay, extra slides. So we had some limitations in the backboard we did, and this is what we have been playing with since June this year. So when I made the backboards, I was working only using a cross compiler. So if you're working on a cross compiler setup from x86 compiling stuff for RISC-5, you are going to have a lot of problems, why? Well, first of all, you have the bootstrapping problem we're going to show in the next slide. And also, I was using G-Lipsy, which is a very powerful Lipsy, and we don't have that in the bootstrapping process. There's no Lipsy, so we need to play around with all the stuff we have, like Meslipsy, which is written by us, so it's probably not going to be great. We're not that good after all. So also, there's the RISC-5 assembly issue. In TinyCC, the RISC-5 assembly they have, it doesn't use the same syntax as gas does. So our library was expecting a gas syntax, and this doesn't provide that. And also, it doesn't support the extended assembler. So we can't really mix very well C code with assembly code, and we need to play around with all the variables, protect them, and make all those things we have to do by hand, and that's a problem. So this is how TCC is built. The graph I showed you before is just a lie, but it's a good lie. So this is how it works. We first build Meslipsy, we take some part of the code of the TinyCC boot, and with that we build this one, and with that we build this one, and we change the flags of the code so we add more features. With that one, we build another one. We take the code again, we build another one. We do these six times, and then, of course, all these steps will need to work. There is a lot of bash, bash glue code in the middle to make all this happen, and you have to fix that too. And fixing very old bash code we did for this kind of thing is even harder than reading the compiler, but anyway. So then we check that this one and this one are the same. In the binary level, they have to be exactly the same. That means the compiler is not adding new stuff, so we have settled. So we can just continue with those. My colleague, Andrews, already tested that we are already settling the fourth iteration, but we do six because we did six and we don't want to change. They did. In live bootstrap, they only do four, right? So problems with GCC. I only tested, again, as a cross compiler the last year because I only wanted to see that it was able to compile things for RISC-5. And again, I wasn't doing the bootstrapping process of GCC. GCC does internally, when you build GCC by hand, they do a similar thing. They take all the code base of GCC, they create a previous GCC, then they take the code, they compile it again with the GCC they created, and then again, and then they compare. So I wasn't able to do that. And I didn't work in the C++ support either. So the work we did, we started with tiny CC boot, and we started working on top of it. We had to read, we spent many times, many nights, debugging crazy things, and also because Andrew has a real job, not like me. So we need to coordinate to do these kind of things. It was really hard. So also we don't have debug symbols because our compiler is very simple, and implementing that takes a lot of time, and it's difficult. So we do all of this like one hand here. It's very hard to do with one hand and also blindfolded. So, but we managed to do that. I wouldn't have the energy to do this without Andrew, so thank you, Andrews. Also, well, these are some errors, I explained them in my blog, then or later you can come and ask me about them. This is a lot of fun because the body was never executed for any X, it didn't matter. This happens a lot in RISC-5, sorry, in TCC, and in our backend, it exploded. Why? Because this is undefined behavior, and all the compiler was based on this. So they used these to clear bits, and we needed to check all the appearances of this and fix them all. So, funny stuff. Yeah, and many other things we found a lot. So you can read about them there. There's a very long explanation about all of that. Yeah, so we finally managed to build it, we have it, we have a recipe in Laboodstrap and in Geeks. Yeah. So, about mess. We had all the stuff in mess because it was affected by our work in TCC boot, so we started fixing stuff. Why there were errors in mess? Obviously because we were not perfect. Yannick almost is, but still. We had some issues because the bootstrapping process of I386 didn't use all the C constructs that appeared in RISC-5, so it started fixing many things, like the switch cases, they were wrong. The initialization of the structures were initialized to 22, I don't know why. So, these kind of things. And I am almost there. Well, TCC is the same. We finally managed to compile it in a different machine, with C++ support, all of that. Okay, fantastic. So, last words. So, people is important. If you're alone, you don't work well. I had issues, I was like completely depressed, burnout. So, bringing people, giving me energy, the knowledge I lack, and emotional support, good stuff. Also, money is important. You all know this, but if you're getting paid, you work better, you don't feel stressed, you are not just trying to eat the next day, to just get paid, do your work, that's fine, that's good. You can focus. So, thank you to Andrews, to Janneke, also on the net for the money. And you for listening. Thank you. And it's our question. I don't know if we are in time. We have time for questions. Okay. We'll set off to the data. Questions? Regarding both the people and the money, will you be continuing your work? Yeah, so regarding money, the people and all that, will I continue with the project? I'm not sure. I'm not sure. I don't think I'll be doing it. I don't think so. I don't think so. I don't think so. I don't think so. I don't think so. I don't think so. I don't think so. Will I continue with the project? We have funding and stuff to do still, until I think the project finishes in one year. So we're starting in June. So until June, we're going to continue. I'm still working on it. I have, most of the budget is not spent because we need to finally combine until GCC. So the project is to... June, we're going to go. Yeah. More? No? Yeah? I was listening to a ZIP project. It's used an interesting approach to this, where they use WebAssembly. So how is it working? Your project was you use the latest GCC to compile GCC to WebAssembly. And then your problem on risk five, is you just need to bootstrap a WebAssembly runtime, which is very small, to run GCC on risk five. Do you think this kind of approach might work in your environment, or is that just very specific to SIG's problem? So the question is around how the people at SIG resolved their issue with the bootstrapping. And they are yet using a WebAssembly environment stuff. And if we can do the same, or if that makes sense in this. So our idea of this is that we want to build everything from source in your machine. Why? Because if you get a Linux distribution, you download a Vivian or whatever from the internet, you are getting a lot of binary blobs. So the idea is just to get the source. So that's not very compatible with the approach you are proposing, because you won't get sources. You will get some kind of a wasn't thinking. And that's not easy to inspect. So what we have here is that you can inspect everything, starting from a very small binary that is written with comments, so you can read the comments on the binary that the bootstraps everything. So the idea is philosophically different. And I'm a little bit upset with the problem with SIG, because I really like the language. And now adding this wasn't thinking in the middle is making us very difficult to add SIG to Giggs, because we will have this kind of binary in there. And we don't really like that, because we want everything to be sourced. But yeah, the idea is good. But philosophically, it doesn't match with what we are doing. Any other time for another one? Piot? Yeah. No more? OK. Right? Thank you. Thank you, guys. You have one, Piot. OK, you have one. Sorry. What about the arm board? The arm. I don't know. I'm not sure. Maybe you should ask another people here, like Danny. But everything we are doing, the RISC-5 board we are doing is 64-bit. It's going to affect all the other 64-bit architecture. So we are doing advances in X, 66, 64, and ARM, and all things. So yeah. Yeah, shoot. Shoot. All yours. Yeah. So for the arm board, we got as far as compiling tiny CCC, and that one compiles an old GCC. And that old GCC has a lot of problems that are well known. Nokia had a lot of fun back then with these bugs. And so we are waiting for you to update GCC, and hopefully that fixes everything. So. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah.
Self-hosting and autonomy using guix-forge
So, good morning everyone. This is talk about Geeksforge. So, first let me explain what Geeksforge is about. So, Geeksforge is a Geeks channel that has services that will allow you to run a complete GitHub like software forge, but fully on free software and using existing free software components like Seagate and Git, of course, the laminar continuous integration system, something like public inbox and so on. So, usually when we try to build GitHub alternatives, we have monolithic systems like GitLab or GitE, Gox and so on. What Geeksforge tries to do different is use old and existing very stable components like Seagate and assemble it all together into a system that resembles a software forge. And it is assembled together using Geeks. So, you have a nice declarative configuration that you can just deploy practically anywhere. So, in a sense, it's like million a box if you have heard of the project, million a box, they set up complete mail server on a system using by integrating many different components. It's like that, but for software forges and using Geeks. So, first I'll start with a quick demo of the Geeks system containers. This is quite widely used as a package manager, but as a means to deploy Geeks, a full operating system and operating system containers, it's not so widely used. So, I just want to quickly show you a demo of how it works. So, this is a really simple operating system configuration. It just has an engine service that listens on 8080 and serves static directory. So, let me build that. So, the static directory has a simple HTML file that I just wrote up. So, first let's build the container. You build it using Geeks system container. And the hyphen capital N is to enable network access. And the container is completely stateless, something like Docker where you have attached storage somehow. So, you have to mount all storage, all state into the container. And that's why we have the expose here. So, you have this script that has been returned. So, if you open it, it's really just a guy's script that sets up the container and has all the dependencies built into the store itself. So, let me now run it. So, pseudo... Yeah. It says that my Geeks is two worlds, older than 30 days. So, I have started up the container. Let's just go to localhost 8080. And it works. So, this is just the static HTML page. Now, let's try to set up a container that actually uses the Geeks 4 channel. So, this is a more complicated configuration, operating system configuration. Here, I want to show you the Seagate service that Geeks 4 provides. So, it's really simple and it just takes a server name, which is the domain name. And then the repository directory where all the gate repositories are stored. And then you have something called a Forge engine X service, which is similar to the basic engine X service that you have in Geeks upstream. But it automatically handles things like HTTPS, acquiring a TLS certificate, setting up a crown job to periodically renew the certificate, automatic redirection from HTTP to HTTPS and so on. So, it does a lot of things in a very turnkey, fully automated way. You just push the button and you get it essentially. And this is the Acme service configuration. So, Acme is the protocol behind the Let's Encrypt. You have to register an email ID of that. So, that's my email ID. So, in this configuration, I'm currently using the staging URL. It's good for testing because you won't run into any rate limits. So, I'll actually take the risk and delete that. We'll try to build with a real Acme server. So, here again, I'll build a container and run it. I'm mounting a couple of state directories, Acme directory and the GitR Poster directory. So, there it is. It started. So, I'll go to git.demo.system.rego.net. So, initially, the container set up with a self-signed certificate. So, it doesn't work. So, let's actually get real certificates. So, find the shepherd. So, the PID of the container is 19.262. I drop into a shell. Get some source and profile. So, GitX4 sets up a script under user bin. Acme is any... Yeah, I'm inside the container. Yeah. So, around the script. And the script has been automatically configured with all the domain names that need certificates. And now it is actually getting certificates from Let's Encrypt. If you can see the logs, it's telling you what it's doing. Yeah, that it has a certificate and it has restarted the Nginx service as well. Now, if I reload this, it should work with proper certificates. Let's try. Yeah, there you go. So, this is Git. And you can browse some repositories that I put in there. So, Git is really simple, but it doesn't come with all features properly enabled by default. And you have to do a lot of manual tinkering to get it to work. For example, by default, it only serves the dumb HTTP transport protocol for Git. So, but the C Git Nginx 4G is set up with the smart HTTP protocol. That's one. And then you have things like... So, this C Git can render org mode readme files, which the basics it can't do. So, this is actually an org mode readme file in this repo. Then you have things like syntax highlighting that is automatically set up again. So, let's just look at the make file maybe. Yeah, so, yeah, you see the syntax highlighting. So, for that it uses Python pigments. So, my point is that Gitx 4G tries to do all this for you and doesn't expose all this complexity to the administrator. And all you're really saying here in this configuration is domain name and the directory where the repositories are. So, it handles a lot of things with very sensible defaults behind your back. So, that's that. Yeah. How much time do I have? Okay. Okay. So, the philosophy behind Gitx 4G is that it has to be really minimalistic. I don't want to be running a full database server just to publish a few GitHub postries and run a small project. And it should be as stateless as possible. Of course, you need a little bit of state for if you need a mailing list or if you need to backup your Git reports, of course. But it should not have hard to backup state like a database that you have to be that takes a lot of cognitive overhead to keep working successfully. As to, should be as donkey as possible, but it should still be able to inspect it and fit it in your head. It should not be something that is so complex that you cannot hold it in your head. And effectively, what the, what Geeks 4 and the, the Geeks 4 channel is doing is that it's, it's crowdsourcing server management in some sense. So, the regular server which you are always, which have, for which you have to mutate configuration files, you are the only one who's in charge of the server. But when you have Geeks 4 doing a lot of things for you, you're essentially getting a community to help you with managing your server. And so hopefully that will reduce configuration errors and let you run a polished server setup without putting in too much work. So that's it. Thank you. Nobody complains when the speaker is too quick, right? Is this a replacement from GitHub? Yeah, it's meant to be. What about the fast pushing process and we having these things? Can we support them with this Geeks 4? Do you mean the email workflow? Yeah. Yeah, so I don't mean to support public inbox based mailing list. Instead of pull request based model. I think that's easy to set up using existing tools and personally I think it's better than the pull request based model. Questions? Yeah. So I think you mentioned it's in a separate channel. Yeah. And are you planning to upstream it and what would be needed for that? So, can we repeat the question? Yes. Yes. Yes. Sorry. So I'm planning to upstream it into Geeks upstream instead of having a separate channel. So certainly there are some parts that can be upstream. For example, the automatic HTTPS that I demo it can certainly it should be upstreamed. But the all the other services I'm not really sure. So I'm not sure how much of this fits into Geeks upstream itself. We already have a Seagate service in Geeks upstream that doesn't do as much as the Seagate service in Geeks 4. So upstreaming this will essentially break the old service. Maybe it should be called something else now. So that's a difficult conversation to have. Could you have a Meta service? Sorry? Do you have a service with all your special services? I do have a 4 service. It's not fully integrated but it aims to be a full Meta service. Yeah. Can you show Laminar? Oh yes. I can show it in the browser. So this is Laminar which is a continuous integration system. So this is a system that we are already running. It's not running on this laptop. It's running on a different server. And it's a really simple continuous integration system that is very easy to set up. Like most continuous integration systems are so complex that they read the very enterpracy projects that are not meant for a single person to set up. But Laminar is really easy and you should have a look at the documentation itself. It's just a single page of documentation and you can set it up. So we use that in Geeks 4. And it fits in with the philosophy of using very minimal tools. We also have class in Geeks 4. Class is another Git reviewer which is written in Python. So you have even a choice for... If you don't like CGIT you can use class and maybe you can support Git delay and other Git viewers too. Sure. So these are the Git logs. Maybe... Yeah, make file again. So class is just a Git reviewer. It doesn't do anything else. Yeah, it supports the Smart HTTP protocol. Yeah. So you mentioned that the TLS stuff is automated as well. But with the demo there was something that seemed kind of manual? Oh yeah. So the manual step that I showed you is only the first time. And after that that same script is ran as a cron job. I need to get rid of the first manual step but I think I need to patch something in Geeks upstream for it to happen. So yeah. Question. Would it be easy to use this process to set up your own channel and then auto build your packages and then deliver that as a substitute? Yeah. Yeah. Get them to end the flow? Yeah. So we already do that in my Geeks 4 instance. And we also have the... So that is the Geeks bioinformatics channel which Pewter runs. And we already do that for all the packages in Geeks bioinformatics. For example, here you see names of many packages. Some of them build, some of them fail. And I think it's... Using laminar and Geeks 4 is simpler than something as complicated as... As Geeks is quicker as CIS. And I really don't want to be running Postgres to have... To just provide substitutes for my channel. So we have a replacement here for many things, right? Yeah. Including GitHub CI. We don't use GitHub CI anymore. Yeah, we don't use GitHub CI anymore. Alright. Thank you.
Spritely, Guile, Guix: a unified vision for user security
Well, we're sitting here waiting, and I got up here thanks to the previous speaker giving a generous amount of time to set up. I'm going to show off a little bit more of this wonderful thing that we have here called Strigiform, a pun on the Latin name for owls, which is a space shooter written in scheme compiled with Guile Hoot. And you can play it in your web browser, sure enough. And it's got, this was done by David Thompson for the last Lisp game jam, who did all the code, and I did all the music. And this is a real scheme application you're about to see in your browser, so let's start it. Oh, yeah. Let's do this. Ooh, we've got an alternate firing mode. Make your ship move slower. Oh, yeah, we got that. All right. Look, there's particle effects. Particle effects in the browser. Is this scheme? It's scheme. Parallax scrolling of stars. Can't freaking believe it. People said that we couldn't bring the scheme to the browser and we're doing it. So look how good I am. Now actually a fun fact about David Thompson, who we all adore at the Spritely Institute and are incredibly grateful every day to be able to work with, is that Dave loves space shooters as much as I do, and in fact makes, has made several space shooters for the Lisp game jam in Gile, including one who ever heard, who here has ever heard of the game, Ekagura? Raise your hand. Oh, like, okay, like three people. Okay. Well, Dave built a version called Lispagura, and you know, it's pretty good. But anyway, you can play this game yourself. You can witness the power of scheme directly in the browser. Oh, there we go. It's bound to happen. But you know, it's going pretty well. You know, oh god, there we go. The moment I start, it's getting harder. Where are we at on time? I'm full screen. I can't see. Do I still have a couple minutes? Are we there? Is it time to stop? Time to stop? Okay. All right, okay. Yeah, there we're right on the hour. Here's the real thing about this game. It also has a badass boss in it with like pulsing blood veins, like moving across like its forehead and everything like that. It looks awesome. And I did the graphics. Thank you very much for them. And you look awesome. But Dave made it work, so you know. That's more important. Anyway, hello. My name is Christine Leverweber. I am the CTO of the Spritely Institute. I've recognized quite a few faces here, so maybe some of you have seen me before. This talk is about Spritely, Gile and Geeks, a unified vision for user security. And what is the Spritely Network Communities Institute? Well, we are a research institution. We are building the future of decentralized networks from a protocol perspective, from a software perspective, from a strong consideration of how human beings interact. And everything we do is free and open source software, and it's all in the public interest. And we are a 501c3 nonprofit in the U.S. And research means collaboration. If you're excited by this, you happen to be working with some sort of organization that you want to collaborate with us, that would be great. If you're an individual and you want to collaborate with us, also great. I'm going to talk for just a moment about network communities at a general level, because we're the Spritely Network Communities Institute. And some of you may know me for my previous work on Activity Pub. In fact, Jessica Tallin, right there, raise your hand. She won't want to, but there she is. She's a co-author with me on this back. We're not the only people who worked on it, but of Activity Pub, which if you're familiar with Mastodon, etc., is what connects all those things together. And so we have some background in social things. And in fact, our background goes back even way further. Here is Lucasfilm's habitat, first ever massively multiplayer virtual world, ran on the Commodore freaking 64 with thousands of users. That should not have been possible with the graphical virtual world at the time, and yet they did it. And we are building towards more social systems. So here you see a mock-up of kind of what we're doing. We have a series of these mock-ups, but this is not the right talk for that, even though there's some interesting ideas that are kind of hidden in here of decentralized naming ideas. But because we are going to be talking about some more low-level details. And so this time I'm really talking with this audience, so we should figure out what we want to do, right? So we want to bring user freedom to everyone. We're at FOSDOM, makes sense. We want to make computers safe for everybody. We want to introduce network programming like you never seen before, and we're going to take over the web and with a power of combining our powers together, we're going to take over operating systems, and we'll hear about how. So from a perspective of, we chose to use Gile Scheme, right? Which is a Scheme, which is a Scheme, which is a list, which is a family of languages, like not that many people use. And why do we do this? It's because we love it, but it's also because it's really powerful. And actually if you go to the Gile website, you go to the documentation page, we're very proud that the first link that you see on there is actually our tutorial, Scheme Primer, which introduces somebody who's never used Scheme before, how to be able to, it's like a compressed version of structure interpretation in computer programs in like 30 pages. You go from knowing nothing to writing your own Scheme interpreter inside a Scheme at the end. And this has been pretty popular. But the reason we're using this stuff is not to just show off all these parentheses and how cool it is and everything, but it's because Lisp is clay. We want a language foundation which allows us to be able to build and express ideas very easily and very powerfully, and Lisp allows us to do that. Lisp is a power of composable domain-specific languages, and that means the types of things that we are building, of decentralized networks and et cetera, can also be composed with the types of things that, for example, the Geeks community is doing, even though they seem to be attacking something very different, because you can combine these together very easily. And so I'm going to talk about Spritely Goblins. It's our distributed, cooperative, transactional programming environment. We have versions for both Gile and for Racket. Gile is the main version these days. And it's based off of this family of computer science research called object capability security, which sounds really intimidating, but it's actually the least intimidating thing possible because it's distributed security you can understand. If you don't have it, you can't use it. It's just ordinary argument passing the way that developers do every day. If you have a reference to something, then you're able to do it, and the references to objects turn out to have very composable patterns. And if you want to understand more about those patterns, and you are a schemer, this is a wonderful paper. It really helped me understand everything. It's a security kernel bias done along the calculus. It's by Jonathan Reese of R5RS Scheme and many other things. And it explains how if you take a simple lexically-scoped language like Scheme and you treat it very seriously as that thing, that's your security model. LAMDA is your security model. And it saves you from a lot of dangerous things like confused deputies and ambient authority problems and heron and things like the access control list model, which most of us are familiar with from Unix and et cetera. Now when I say this rightly, Goblin's is really powerful and easy to do things in. Here is something that our engineer, same one who did that Strigiform thing I was showing off earlier, David Thompson did in his first week on the job. He had never used Goblin's before. He had read a little bit about it. But you know, day three, I'm like, okay, you've had your deep dive into Goblin's. All right. Now I want you to build me a distributed game. And since he's David Thompson, he can do the game part. And so he programmed this collaborative, lovely little garden demo in one day. And then wrote a blog post about it in day two because he's incredible in David Thompson. But even though he's David Thompson, it's really amazing that you could do this in one day. And what's really interesting is that this was all written in one process on one computer. And then when you hook it up, Goblin's has this thing called O-Cappen. And it's very specifically integrated so that you do this ordinary programming that automatically works over the network. These are based off of ideas that have existed for decades, but have been forgotten. And we've been pulling them off the shelf and trying to bring them back to life and bringing them to the world of scheme. And in fact, here it doesn't just work in one, across one language, runtime environment. Here are two different runtime environments. You can see two implementations of the guile version of this minimalist chatroom thing we created called Goblin Chat. Our distributed network thing can run over multiple networking substrates, including Torrnian services, which is slow as molasses, which is why you'll notice a lot of lag, but it can run over faster things too. But what's interesting here is that this is end to end encrypted. We can verify that the messages come from the user that they claim they did. And the code for the user and for the chatroom is 150 lines of code, of understandable scheme code. And what's also really interesting is that this whole program was also written entirely on one process, entirely on one computer, and then when we hooked it up to the network, it just worked. The distributed, the communication between the things just worked. And that's the kind of power that Goblins gives you. But that's not all the kind of power Goblins gives you, because we also are a transactional programming environment. Here is, in a very small amount of code that I won't get into, because it's very dense, a implementation of a bank, actually, a very minimalist fiat bank. And this is, you know, what's also interesting about this is that if something goes bad in the middle of this, if somehow, somehow, one piece of state was being updated and then it crashed before it updated the next piece of state that it needed to do both, that would just roll back. Because Goblins is automatically transactional, which is a really interesting and useful feature, because building distributed programs that don't have their state corrupted is actually really difficult. And here's one of the first things that I did when I was testing out the design of Goblins. This is an ASCII art space shooter running in a terminal, right? And you'd think that's the coolest part, but what's really the coolest part is that, what just happened there? Moving backwards and forwards in time. Because what is unlimited transactionality? It's time travel, right? This entire game, I programmed this entire game without even thinking about that I should add, that we already had the time travel feature effectively, and then I just exposed it in about an hour or two, just the gooey for it. I didn't have to change a single line of code of the gameplay to be able to make this happen. Because Goblins already comes with the fundamental abstractions for this. And in fact, we use this to be able to make your life easier. Debugging is one of the most important things you can do, especially in distributed systems, which are notoriously difficult to understand. So Goblins comes with a time traveling distributed debugger. And you don't even have to leave your REPL for it. You are able to use the tools directly within your REPL. You are able to move back and forth in time to find out what's going on, what's wrong, and actually debug objects at the time that the errors were occurring. And not only that, but it can visualize what happened to you. And this prints out right in the REPL. You don't even have to leave your tool. So this allows for a strong amount of developer productivity. And now, maybe you're not a guile person, right? Maybe you're like, well, this all sounds really cool, but I want this in Haskell or something else like that, right? Well, good news, because we are taking the designs that have been extrapolated from basically the work of Goblins. And actually, Jessica Tallin has done the hard work over this last year to write these up, thank you in Elnette for funding Jessica, to take all of the core inner mechanics of how the network protocol works and write them as specifications. The same way that Activity Pub is a specification that many different implementations are able to be members of the Fediverse, we would like to have many different implementations be members of the distributed world. But in order to be able to build this, in order to be able to understand this, the Lisp is clay aspect allowed us to be able to get here and understand things and move very efficiently. And also, it's just a delight to program it, right? So you don't have to worry about these types of things. These are details that we've solved for you so that the programmer can just focus on the code, but also, you know, cool graphic, right? So now that all sounds cool and well, but you know, you might say, Christian, and then I take a sip of water, very dramatically, and you say, Christian, but how are users going to use this? Do they all have to run geeks? Do they, you know, how do we get it to them, right? You know, maybe we have geeks packed, maybe we can get things to them a little bit that way, but you know, the GTK really work right on these other platforms, you know, what do we do, right? And what everybody has today is a browser, right? Everyone has a browser, right? So we want to be in the browser, right? And that is why we launched the Hoot project, right? And generously funded by the folks at Consensus and Mediasque who also like OCAP security. This is an example, if you're here for Robin's talk, you've already seen this bit, but this is a snippet of scheme code for a cellular automata called Wireworld, and it's a real scheme code. And what you see here is it actually executing and running in the browser, and it looks cool as heck, right? You know, but the other interesting thing is that, you know, you can do more than this. We are increasingly working on things like a foreign function interface to be able to make it so that you can integrate all sorts of things in the browser and be able to do, interact with, you know, the Canvas, interact with HTML. Dave wrote an example of a functional reactive programming app using nothing but scheme kind of React like, and there's an example of that on our blog, and that's the type of stuff that we're doing. But let's talk about the SecureOS vision because we are here and there are many scheme people in the audience, and surely you're dying to know what we are saying about combining goblins and geeks. Well, what happens if we combine goblins and geeks? Well, we have many computers, many of us are running computers using geeks, and many of us are running many computers using geeks, and since goblins is a distributed programming environment, what if we had distributed fabric across our different geeks instances where we can securely cooperate across multiple geeks instances? Now this is something that's not only free software nerds doing something, you know, and trying to catch up to everyone else, but this would be something really new that no one else is doing right now, right? This is, would be exciting territory. And so for example, imagine if we switch the geeks build daemon over to using goblins. You know, one machine could say, hey, I have a recipe for something I'd like to build, and then I want to have it deployed on this other machine. They can send it over to this other machine which builds it, and then it sends it over to the next one which deploys it. That's the type of thing we could do if we moved over to goblins' tech. And it's very natural. And in fact, geeks actually already is kind of moving down the Ocap-ish direction in slight, slight ways, this lovely Least Authority wrapper, right? You know, Least Authority and Ocap, you know, have a lot of shared history type stuff, maybe even, you know, some amount of overlapping volt and diagram stuff, but also, right? So, this is the right time to say, you know, if all this excited to you, the most important thing you can do is make something cool. We have these cool tools, and they've been just now, like, it's taken a while. Like, people are like, oh, should I pick up and run with these things? We've been kind of like, yeah, like, mostly. But like, within the last couple releases, and especially the next two upcoming releases, I think we'll really be reaching the point where this stuff is of both goblins and hood, where we can feel much more comfortable saying, actually, you should be picking this up and running with this. And so, we are planning on doing some hackathons and things like that, and maybe even jointly with the Geeks community. We've had some vague conversations about that, maybe a Spritely and Geeks joint hackathon, and we'd love to see whatever you're interested in making. And you know, maybe if you're especially excited about the Geeks plus Goblins dream, maybe you could participate in that. So we are a, I said that we are a 501c3 non-profit. We are a research institution. We are working with these different organizations, but also, maybe we should be working with you, right? Maybe we should be working with your organization. Maybe you'd love to give us money. Hey, we don't mind that. Maybe you'd like to be a technical partner. Maybe you'd like to jointly apply for funding with us. These are real opportunities that we can have, and so we should talk about them. Also, you can donate. We're a 501c3 non-profit. Now, most of the people here seem to be from Europe, so maybe this doesn't excite you quite as much. If you're in the US, it's tax-exempt, you know, tax-exempt donations, but even if you're not in the US, you're also donating to an organization with a mission that surely aligns with you at this very moment, hearing this ridiculous woman up on stage waving her arms around madly, and you must be thinking, gosh, I have to give them all the money, right? So, but here's another thing you can do. You can come up, if you are excited, come up and say hello and get some stickers. We have the most amazing stickers you've ever seen. It has our non-binary goblin mascot and a bunch of clothes you can dress them up in. And it also has the owl, and you can also dress up the owl in some stuff, not quite as many, but you can also dress up the owl. And please, Spritely representatives, please raise your hands. Jessica, Juliana, Robin, and myself, of course, we all have stickers come up, say hello, say I would like some of these stickers, and if you don't mind, tell us what are you excited about, right? Maybe how you would like to work together. And so, finally, I'm going to say let's build it all together. We are the Spritely Network Communities Institute. Communities are about collaboration. They are about the building of trust. Trust cannot be forced. Trust is a consensually collaborative process between multiple parties. That is the very foundation of the kind of route that we are looking at in our technology, but also, we would like to build with you, if you are excited about this stuff, let's build it together. Let's make it happen in whatever way, shape, or form. And that's it. Questions or excitement about stickers? By the way, take a look at this. It's keeping the browser. So who's got the first question? Does it work when it's not plugged in? Does what work when it's not plugged in? No, I like that. And the calculator presentation, like that, you put some numbers at the calculator. Oh, you unplug it. Well, okay, so actually, maybe are you asking if you're not hooked up to the network or is this just a joke? It's a joke, okay. You know, actually, disconnect handling is one of the next major features in the next version of, you heard me say that especially in the next version, one of the biggest things that we need to work on actually is disconnect handling. So this is, your joke has an appropriate thing. When two different instances are talking to each other and something goes wrong, you need to propagate the information and actually register a call back to what's happening, we have part of it, we need to finish it, what's happening in the next release. So your joke is actually pretty relevant. Next question. So I'm a bit afraid of the browser. Okay. Not very minimal. Do you have any recommendations there? So if you're afraid of the browser, it's not very minimal. There's multiple things we can say about that. Number one, WebAssembly is actually a very minimal virtual machine itself and it's increasingly being used by things that are not just browsers, right? So people are starting to use it as, so actually we did two versions hilariously of the wire world thing because we're ridiculous. This first version, it's actually kind of interesting, actually runs on top of something called Wazm 4. I don't know if you've ever heard of it, but it's a virtual machine, it's a fantasy console like an old Game Boy. That's very small, very tiny, but it runs on top of WebAssembly. So this first version, one of the interesting things about Hoot is it actually includes a full, doesn't use M script in, doesn't use LLVM, doesn't use any of that BS, comes with its own assembler, its own disassembler, it comes with its own virtual machine seeking to run the things, but for developer purposes it's not fast. And you, and this is all included actually with Hoot, it's a full toolkit. So for the first version of wire world, before we had proper scheme working, this was actually written in handwritten WebAssembly, mostly by Robin, a very small amount by me, but right at the end, but mostly by Robin. And of course you can play this, you probably saw Robin do a little bit before, and it works, right? So, and you know, there you go. But the other thing is that we have, we did a second one, which was the version that we did in scheme, one scheme became available actually. So, and this one, just to give you an idea, this one, here's a size of wire world here as a scheme program, and it looks like scheme, right? And that's it, right? You know, and that powers that version of wire world, right? So the reality is, WebAssembly is useful in contexts that are not just that, but also goblins is not specific to the browser, right? The browser is the way that we are planning on reaching people, but it does not necessarily, if you're excited about goblins, but you're, the browser will not be required to be able to get geeks to use goblins, for example. We're not, we're not saying, okay, geeks has to be an electron app now, right? You know, that's not an expectation, right? So, yeah, other questions? The thing is that aren't necessarily in the scheme yet. Is there going to be interoperability with other language one time, or other ways of reaching out to existing node bases, or other, just, if when you're building, just use it as. Yeah, yeah, so right now you can actually already, I have no idea what just happened, I think this is development console, you can actually already speak with JavaScript stuff inside of, inside of, sorry, I got distracted, I'll do that. You can speak with other JavaScript applications when you're using Hoot, like the, there's a full, like, bridge between those two worlds, and also, so that's from the Hoot perspective. From the goblins perspective, there are multiple implementations being done of O'Cappen, none of them, sorry, I'm going to be early to smug here, none of them are as cool as goblins because they don't have time-transl and transactionality and all the cool crap we have. But there's like somebody doing a Haskell implementation and somebody was kind of starting on a Rust implementation and stuff like that, we would love to see more implementations of these things. And it's, there's, we have a test suite that Jessica wrote that you can actually test against. So if you would like to bring some of the ideas, we are a research institution, you know, kind of the same way that the, when Jessica and I started on doing our work on Activity Pub, it was for the purpose of something that we were working on called Media Goblin and Activity Pub, which we were working on for Media Goblin, ended up being a bigger thing than Media Goblin. In some ways we're kind of running with that, we're actually saying, well, we really want you to use our software, but also more importantly, we want to change how computing works for everyone. Any other questions? Is that it? One more there. One more? Wait, who was it? Okay, over there, go up. Yeah. So, so, so it actually does have a small talk link in that, Goblins is heavily based on, it's basically like the e-programming language, but for Scheme. And the e-programming language was heavily inspired by Smalltalk. Mark Miller and Dean Tribble and the folks who worked on that, a bunch of them came from Xanadu, actually, and they worked on, and that group actually really loved Smalltalk and they used a whole bunch of that stuff. And so, so yeah, Smalltalk has an influence on some of this stuff. Through the family lineage, basically. And now I'm going to hand this over to my lovely friend, Hisham here. Thank you.
Five years of Teal: minimalism versus growth in language design
Alright, so let's get started. Thank you for sticking around during lunchtime. This is going to be five years of till, minimalism versus growth in language design. Okay, here we go. Okay, quick introductions. I'm Hisham. I've been doing free software for a long time with involved in many projects. I work at Kong where we do free software like free open source API gateway using mostly Lua. I'm currently working on the team that's having web assembly support in Kong and some of my other projects are LuaRocks, package manager, Google Linux, the weirdest distro ever and Htop, the process monitor. So, but here I'm here to talk about like the project that I'm most recently excited about, which is till. Till is a statically typed dialect of the Lua language. Right, so if you know like what type script is to JavaScript, you can think it's the same, right till is to Lua, what type script is to JavaScript, but let's not go too far with that analogy because here we're trying to keep to Lua's spirit, which is like a very minimal, a very tiny language and like I don't have the backing of Microsoft and hundreds of developers working on as they do on type script, but essentially we have a compiler, till that outputs Lua files. And so here's a quick taste of what till looks like. This is just like a random function I picked like from the source code of the latest version of till that I'm working on. And so what's lighter here like is what is not Lua, essentially the parts of till that are different from Lua. So we have a couple of different operators, you can add types there, but otherwise as you can see it's it's mostly regular Lua. I made the mistake of talking about till before without showing lots of search code. So this time I'm gonna make sure I show source code till is like the epitome of like conference driven development because when I first had the idea I started like hacking it together and like when I first presented it like was the challenge like can we do this right? Can we add types while remaining minimal and when I presented it back in 2019 like I was like halfway through like the compiler student compile itself and I show like oh I started like 500 errors and I'm currently like 100 or something like that. But eventually I got it working and then the next year I came back here and show it to you and I said like minimalist type Lua is here and there's this weird title because the language didn't have a name yet. So and then eventually at the third part of the trilogy we had a name and we actually had users and had stuff going actually going on with the language. But and then I basically said that well now I think I'm gonna get let it settle for a bit so that people can actually use it because if the language keep changing syntax like all the time and stuff like that like people won't feel comfortable adopting it right. So yeah so that's what's be going like we have like this small community I always like to make a distinction between like a community and a user base right because especially in the industry people like to look at numbers so whenever people ask me how is still going like I tell him like oh we're like 1.8k GitHub stars which is like the vanity metric like that doesn't really mean anything right but it means something right but in but in practice when I when I think about a community I think about the people who are involved with it and then I could say like like oh there there are like all of the other people who gotten involved right because I'm just doing the source-to-source compiler right and then there's people doing the build system there's people doing like the VS code plugin right that gives like IDE autocomplete those sorts of things so in nowadays for a language that's that's part of the whole package that's what expected of a modern language right so if I if I think on when I think of the two community I think about like I don't know like 10 people or or even less right but the people who regularly discuss on GitHub and all of that so right that that's the community right like the human aspect of it and speaking of community feedback here is like this is a full screenshot of like all of like if you go through all of the issues pages there are currently like if you open the the till GitHub repo right now there's like 67 issues like these are all of them right and I one thing I hate about GitHub is that they call everything issues right so like feature requests like back in the days of search for so we're like separate pages for like bugs and feature requests right and out like calling a feature request issue I think it's like super weird like I would like the program to do something that you did not design it to do now your code has issues right it's like no so basically what I do I label all of them like so the ones with the light labels here are feature requests and the ones with like the darker labels which are like the red ones those are the bugs right so I currently have like five open bugs and like three pages of feature requests right and and it's nice like sometimes you don't want to close your feature request you want to keep them around so that other people who have the same request can see and comment on them right so basically over time your job is going to keep accumulated an endless list of feature requests you're going to look at the repo and one day will have like 200 open issues oh this program must be all broken right no no it has a lot of people who are very interested in it this is actually a sign of success like people are interested in your program enough so that they are using it and asking for more stuff right so yeah so but then once you have this many feature requests like and we're here talking about minimalism how to keep the language small right I cannot add everything that people ask for and this is like my hobby side project currently I don't even have the time even if I wanted to right but I don't want to because I wanted to keep it small keeping to the like that idea from the very first talk like the challenge was how to keep it small right and the challenging thing is like I can't add everything that people ask for or even everything that people contribute because not only people go there and ask feature requests some people actually send in features like all written like a spoil requests like like I would like to do this extra thing and here's the code to do it please add it and I go like sorry right but but it's super important feedback right so uh yeah but when you go go back like it's always nice to look at that whole list and look try to look at the big picture right and then I saw like two like very recurring themes on things that people were asking for like related about like one is like nil safety right the famous uh billion dollar mistake right as as mentioned by touring award winner like 24 when he talked about this uh saying like adding the concept of null to a programming language was back in the day was his mistake because like everything that came out of that afterwards and um and lua is another of those language this does has like a nil type right and all of the consequences come with that uh and the other one is that people keep asking for abilities to express more complex table types because lua is like every like the only single composite type that you have like apart from like integers like like numbers bullions like and like tables right which is a thing that doubles as a hash table and an array right like it's there's a thing with keys and values and if you use numbers the the it has like some special behaviors to make an array access efficient but otherwise just think of it as one big thing with keys and values but once you start talking about types you look like oh this table is really an object this table is really a map this table is really an array this table is really a a map except when you use integer keys then i wanted to act as an array like these are kind of things that lua programmers do all the time because once you have a table that accepts everything right you start having these composite types and all these weird things and but then people mix it up right and those are the bugs that people run into when they're coding in lua with no explicit typing and they would like teal to help them with that right so let's talk about the first one uh a little uh so as i mentioned i was mentioning like giving that every variable can accept nil and you can pass stuff around like that uh until basically every type includes nil as a possible value right if you declare a variable as integer you can still put nil in there like essentially every variable is like optional right and in that sense so this has some consequences right this makes for example that uh in lua like you declare a function to have like three arguments you can just pass one argument or no arguments and like the lua vm will accept it and people do that on purpose because essentially they mean like oh in the semantics of my my application that argument is optional right so uh so lua accepts that and because lua accepts that teal accepts that right and uh another side effect is that in table like missing keys are valid right so um people let's say you define a function that operates on a table that has like x and y uh fields right uh but then your in your program since you decide since you designed that function that just wants to take something that takes x and y you want to pass other things right that have extra fields and sometimes they don't right so uh it gets it gets really messy right uh so in one of the previous talks i mentioned about like that i started looking at like doing the whole thing about uh implementing like nullability checks into the variables and how like and i showed like the size of the teal compiler source code versus the size of like all of the code to do proper reachability tests for null variables and everything and make like the whole check super automatic and it started to get like bigger than the compiler itself uh so that was sort of like a complete but heavyweight solution right so a very lightweight solution or i would say lighter weight solution like then actually having the proper optional types that people really want right was to have just arity checks which is like how many things you actually put inside the parentheses when calling a function right so and yes if eventually we find like a lightweight way to like nicely do optional types this will become sort of a redundant feature but this is this is what it looks like right if you have a function that takes like two parameters like if you pass two parameters then it's valid so if you just pass one parameter this didn't fail before but it will catch the error now but if you pass two parameters and one of them is nil well you know the arity of the function is still two so that's so that's valid right but who will write this right so i mean it's very like 80 20 thing like it's a very simple thing that catches like the kinds of mistakes that people actually do right so you have you kind of have to be uh forcing to to to get into trouble with that so uh yeah but if you want to be able to do that then you act you actually mark the arity like not in the type but in the argument right like the the argument is optional it's not that the type of y is like an optional integer the type is integer but y is an optional argument right so that's like a very simple to implement thing like that i had it i can draft state for a long time and i just decided to put that in into the language because it's always like oh do i want to add one more feature to the language right do i want to because i was able to do like to write the till compiler in itself so with the features that currently has you're able to write a compiler so like i go like okay the features look enough right but people get asking for more features so but then like being possible and being pleasant is kind of different i already feel like like coding in till in general because of the helps that type give you already feels like more pleasant to me as a very in a very subjective way than coding in like plain lua right but when i started working on the next feature like then i really wanted to have this because i wanted to do big refactors and change the numbers of arguments of things and i wanted the compiler to tell me uh if uh if i was making mistakes so i added this feature feature for my own use and it already proved like useful enough and uh yeah so if we're doing that for nil safety for for arguments what about uh for table keys can we do that it's trickier to pull it off right because uh even if you want to say like oh this key is optional this key is mandatory some one thing that people do in lua all the time is just like start a table with like open closed bracket like this is an empty table and start filling the elements one by one so like after the table is ready right that argument's no longer optional but it's optional for a while and it's a very common pattern so um however uh since i had a feature that i created for my own use of how i specify maps i adopted that for records as well so uh which is that total annotation because like lua like modern lua has these annotations like const and and so i just i don't want to say total which is for things like this right if you have an enum saying like which essentially restricts your string type into a set of known strings right you can say that the map that you're declaring here is total so the compiler will give you an error here because you specified north and south but not east and west okay so uh right so essentially we did the same for uh records right so uh you can specify a total record and for that use case of i'm declaring a table and i want everything to be in it uh you can uh you can specify that again like it's a lightweight solution then the whole solution but it's something that already gives you uh you know gets you a long way um okay so uh just because we started late right how how do how do we go on time like do i just keep going and then right okay all right so uh yeah so then the second one which is like the more interesting one is like subtitling table types i mean like i have avoided going with like complicated uh sub typing definitions as as you start combining uh those kinds of features like you can you can go real crazy with that like with even like features that programmers in type languages use every day like if you combine like generics and sub typing like there are computer sign results that show you that you end up with like a computationally undecidable type system and things like that so uh so i have avoided like delving into this complexity for as long as i could so for that super common case of i have this table which is a record in an array at the same time i added this jenki type that's array record that's just for doing that uh and to avoid having to do sub typing and things like that like and complex type hierarchies just because i wanted to have collections like i added like invariant generics so you can have like list of t but you cannot specify anything about t and then that worked and because uh typing and callbacks and passing like checking function arguments for covariance versus contravariant and all that gets complicated you just do bivariant matching you like you accepted both ways which is like wrong like it's unsound by design right but it will at least prevent you from matching like completely unrelated functions like you're not passing like the wrong function but if you're doing the thing like in the correct ballpark it will catch it right so um yeah which is some kind of like unsoundness by design that top typescript already has to do so uh yeah and one big limitation that we have because we had union types but you cannot you could not have union types of multiple table types right because at runtime the code that it translates to when you're trying to check like which which value i have in this union type which is like oh it's a it's an engineer or uh like my record like my object the luo code that that would generate uh became like luo code that says like show check that the luo type is number or check that the luo type is table and then it couldn't tell all the tables apart at runtime right so you will have to generate code that actually checks the tables by their contents right so you can tell like is this a circle is this a triangle is this like what is this right and turns out that in the real world a luo programmer is like an existing luo frameworks that people would want to use with till define their own object or oop like object-oriented systems with their own like inheritance systems and how to how to declare like the the the type checking right people already do this writing this custom code for type checking right but then the till compiler doesn't understand that right so uh for things like this if i want to have the is operator to work nicely in till so that the compiler understand it like i need to be able to translate that to that like framework specific way of determining like what the type is right so since we had records before like the the next version of till is going to have interfaces right so the interface like this was instead of writing record you write interface right but now you can you can actually have subtyping from interfaces which are abstract into records but then you have this fine where clause here which has an expression like it has a bit of code with essentially is like when you want to do an is check this is the code that you should use like it's essentially a macro right you have to replace this everywhere you need to do that change right and now essentially you can do uh you can do union types as long as the all of the types that you are putting in the union declare like there how to do the is operation right and yeah we're only on time so yeah so that's essentially what i described there and the thing is that that that magic where clause right essentially i had to add a feature to add a feature right because like i'm really having i'm having i'm really having to implement like a macro expansion thing for expressions just in order to have that work loss so like why not expose that but then again i thought like oh so i need to add macros to the language no i just went lighter wait again i just added macros for expressions and not like a full macro processor right which is what many projects for lua have tried to do in over the years but one of none of them like have become established right and turns out that where now it's syntactic sugar like we actually have like macro expressions in the language right but it was motivated by that right so in short what happened like till has seen use okay people want more features and less limitations which translates to many future requests oh no right but many future requests they relate to the same pain points right so the idea is that you try to combine them and come up with like the what's the most minimal design to try to address the most of these pain points so the things that came up out of that were like optional iris interfaces macro expressions and the idea was that i would just pick the lighter weight solution right because whenever possible like just to try to like just harking back to the title like minimalism of the language versus its growth right both the growth of the compiler the growth of the language specification and the growth of the user base right so so that's the approach that i took like you know you sometimes you have to choose like the incomplete solution just in order to balance all of those desires right so yeah that's it what i had thank you for sitting around thank you we need to free up the room for the next round but we can still have one or two quick questions Yeah, yeah. So have you decided what to improve the meta programming or doing some stuff, but what the meta tables and all that stuff? Okay. Have you ever decided for a new program or a new country you know the goal with the LAT? If I have some plan to improve the handling of meta tables, meta tables, it currently does have support for meta tables in records, you can specify them. The checks are not very strict, right? And I haven't had much feedback on that. So like if people run into trouble, I look into it, otherwise I'm just going to keep it as it is. Right now you can use them and no one's complaining, so I'm happy with that. Thank you. Thank you.
Opening and Welcome — Introducing the Open Website Alliance and Today's Program
So let's do this just a tiny bit differently. Now kicking off the day with sound, my name is Jam, and if you have questions or comments about the sessions and so on, I'm going to try and facilitate helping the speakers, getting the stuff done during the day in this room, and presumably when we have questions I'll be able to run over with the microphone. We only have one microphone though, and we are being streamed, so to be nice of us to speak into the microphone, or if someone's so enthusiastic that they're shouting out their questions, then if you're the speaker, repeat the question into the microphone. So today's dev room at FASEM 2024 is on the occasion of the launch of the Open Website Alliance, which was created by leadership level conversations between four open source content management systems, WordPress, Drupal, and Type 03, and we have representatives from three of those organizations with us today to celebrate that launch. Stop the champagne. Welcome to the stage. That was nice when you get to practice. So here, you hold that. This is not working. This is going to be the theme of the day. I have no confidence. All right, I'm just going to hold this. It's all good. We have one microphone, so passing a lapel mic around is a good exercise in dexterity. My name is Crystal Deonisopoulos. I am the president of Open Source Matters, which is the organization that supports the JUMLA project, and we are very, very excited to help co-found the Open Website Alliance. I think it's going to be an incredible opportunity for our CMSs to grow and learn together and share expertise and really support each other, because as we'll be talking about, we have so much in common. We have a charter, which we'll put back on the screen for people who missed it. You can take a look at what we've agreed to, which is, I think it's pretty cool. So thank you. Hi, I'm Matias Bortlesnik. I am a board member of the Type 3 Association, banner over there. And yeah, we didn't fit all of them together, but we are good friends anyway. Yes. My name is Matias. But yeah, we came together in the Open Website Alliance. First we met around discussing what to do about the Cyber Resilience Act in the European Union. And we just continued and we created this amazing charter. And what I want to say as well is that any other free and open source CMSs who align with the charter are very welcome to apply and can contact any of our projects and will lead them into bliss. And my name's Owen Landsbury from the Drupal Association. I volunteer on the board there. And we were very excited to be invited to join the Open Website Alliance to really further the strength of open source and to have a consistent message that we can tell the world. Also WordPress was supposed to be here, but got sick. WordPress was sick today. And send their apologies. So kicking off straight into the program of the day, thank you very much. Congratulations. So that's what the schedule for the day looks like today. It's pretty interesting and exciting. My name is Jam. I said that I'll be running the room today and sharing the mic with Matias in the first session. And we have people representing a number of different communities and entities that interact with multiple communities as well. So I'm really, really happy to be here. This feels like an incredibly important step in living open source values.
Defend FOSS: From innovation to world-wide positive change
And interestingly, this idea of sharing best practices, finding things to do at every level together seems, I don't know, more important than ever, probably, given the state of the incredible state of the world in 2024. So the premise here is that since 2008 or so, we've been doing great in open source, and most of us have been very busy. Who makes their money, supports their family, full-time, always, only open source? Right? I'm used to, I think all of my clients are still open source, but I'm doing something slightly different now. So we've all done great with open source. My open source origin story is in the Drupal community. But we have forgotten how special and weird we are, and what is most valuable and interesting about us is not obvious to everyone. And I think that we've been so busy winning and succeeding and, you know, having agencies with 120% bookings over any 18-month horizon. We forgot to tell people what an economy of plenty is, what sharing your best ideas with the smartest people you know can do, and so on. And we are not getting all of the opportunities to win business for ourselves or do good in the world that we could, right? That's about where we are. So very, very briefly, my name is Jim. I co-founded a company called Open Strategy Partners, and we work with technology agencies, product companies, and open source projects to communicate clearly about the wonderful, complicated, interesting stuff that all of us do in open source. So, right, that's me. Matthias, please come and introduce yourself. As I've already said, I'm Matthias Burt Lesniak, and I'll hold this myself, and you can click yes. I am living outside of Oslo in Norway. I do have a wife, and I do have two kids. Hi. And I have been using Type 3 since 2003, and I always say that the first time I used Type 3, I crashed the installation, and I think that's a really important thing to know for beginners in any CMS that you start by learning, and all of the people who are very good right now are people who didn't know anything when they started. So everyone's been in your boat, newbies. I'm in the Type 3 Association Board, and you're skipping a little bit faster. I thought I had half an hour for that slide. I'm in the Type 3 Association Board. I'm an evangelist at Tuju, and I'm also consulting for Open Strategy Partners as well, and you can find me on the bird site and on the mammoth sites and other places as well. So, but of course, when I do presentations about Type 3, I do the next slide. And what is Type 3? Well, you know, if you look at it from the top, you know, Type 3 is a PHP-based CMS. Huh. It is also free and open source. It's also community-driven, and it's also backed by an association. In addition to that, it has a long history. That actually fits to all of these. CMS is right. And just to underscore that, let's just take a look at how similar we actually are. And for another presentation, I looked at Drupal and their core dependencies, composer dependencies. They have 54. In Type 3, we have a few more. We have a different way of thinking around what should be in the core of the CMS. We can take that fight outside, but I think, you know, whatever happens, we have 98 core dependencies, and what is really striking is that we have 33 in common. You know, this is the stuff that makes us different. This is the same stuff, right? We're made of the same kind of stuff, and that puts things in a different light when you think about our four CMSs, right? In the open source world, we're often very good at bashing each other. We're saying, you know, oh, no, you shouldn't choose Joomla or Drupal. Type 3 is best. And no. One of the reasons why we founded the Open Website Alliance is because the most important choice that a client can make today is to choose open source, because when you look at open source, we have more in common than we have differences. And then the choice of which CMS you need depends on what needs you have, what agencies are available in your area, those kind of things. And that means we have so much to win by collaborating. Jam. So, on to the substance of the talk. Let's remember what we're about, and remember, and our aim by the end of the day is to encourage all of us to get back out in the world and tell everyone we know again. So we're going to talk about the fundamental idea of freedom as open source as freedom. We're going to talk about the pragmatic reasons why, if it's an applicable use case, you really should give open source a fair chance. You should choose and use open source for completely pragmatic, non-moral, non-ethical reasons. There were really good reasons. My old job was very involved in telling those stories, and I pulled out a bunch of old slides to talk about this part. Then Matthias is going to help us shift gears into a real idealistic space where we can do my favorite part about open source, which is making the world a tiny bit better every day at a scope that I hadn't considered until I started talking with him a couple of years ago. I mean, if you want the tabloid version of this presentation, it's Barbenheimer, right? And I'm Barbie, and he's Ken. No, he's... Yes, he's Ken. Yeah. He's Oppenheimer. So, I do all of the pink stuff, and he does all of the dark stuff. Oh, is that... is that how it goes? Okay. So, really, really interesting. On Friday, you were at the European... What was the thing? The EU Open Source Policy Summit. Right. And four or five times during the day, they said, Open Source is one, which is awesome, right? We won, which is great, except, you know, and as I just said, I've been making a decent living at this since about 2008. All of us are doing all right with this Open Source stuff. He sent me this article a year ago, almost exactly a year ago, and I got really, really, really, really upset. First of all, I was part of the crew that came into Australia and helped Australia get on Drupal. Australia has a thing called GovCMS, which is Open Source, which does incredible work. It enables 133 departments, multiple state governments, generates work and expertise in the Australian economy, and somebody at Deloitte got the right meeting with somebody in the Ministry of Something in Australia, where they decided that they needed personalization and that since the current GovCMS infrastructure couldn't do it, therefore, they were going to rip and replace it with an Adobe solution. And the Australian government entered contracts with Adobe for more than 80 million Australian dollars and spent at least 36 million with Adobe on personalization that was illegal under Australian law because the data retention required to run the personalization wasn't allowed. All while Australia has an incredible success story of Open Source winning, right? So we have to tell our stories and we have to tell our colleagues and we have to be really up front about who we are and what we do. So that was the kickoff. Then Matthias and I both, I got very angry. That's my, unfortunately, still my go-to. And I decided to put, like, that we needed to start telling people. So Matthias and I went on the circuit for six or eight months last year. I did a bunch of buffs and a bunch of early versions of this talk and Matthias did versions of a really cool, completely different talk. And now we've smashed them together. So, right. So we've been complacent. I think the story picks up. Oh, no, it's still my year. Right? So here's one of my oldest slides. A gentleman named Jim Alchin in 2001 said that Open Source was an intellectual property destroyer and he couldn't think of anything worse for the software business, which is hilarious for all of us paying our rent with this stuff, right? And we used to think that that was cute and funny, obviously, but we've been busy and we've been making good things. We've been taking what we do for granted, right? And it's special and it's different and it runs on fundamentally different economic models and still generates incredible value, right? So we need to be loud and proud and we need to evangelize about it. So, time to get back to the basics. And basically, I mean, we've been complacent. We've been, yes, successful, but, yeah, that thing, you know, free isn't beer, we say about free and open source software. We've been maybe focusing too much on the beer, too much on just being free on the beach or something relaxed. But we've forgotten these really important questions about, you know, the other things that open source has. Like open source is not only free as in beer. Yeah, I know that comes as a shock to many of you, but it's also free as in free speech, for example. You can do whatever you want. Hello, you can say and do whatever you want. Open source is a threat to anyone who doesn't want you to do whatever you want. Hey, what is this picture? Crayons or, well, we also have choice, right? With open source. Open source introduces choice into the software business. When you have choice, something really interesting happens. Open source projects do not work the same way as other software projects do. That's why I have a slide here with the Gingerbread Man. Hello, this is Gingerbread Man. Do you know the Gingerbread Man fairy tale? Hands up. So some people don't know it. That gives me a chance to tell it, right? Which was what I wanted. The Gingerbread Man, there's this man and this wife who lives in a house in the forest and they don't have any children. Maybe they don't know how to make them. Maybe they have some kind of reason for not having them, but they really, really, really, really want them. And so the wife gets this idea that she can bake a Gingerbread Boy for them, which is like, yes, I know, super. We should all go home and do that. What happens when she opens the oven and wants to take out the Gingerbread Boy? He springs out alive and he runs out the door. Lots of stuff happens. Whom? He meets lots of people. That's what happens with an open source project as well. When someone open sources their project, it runs out. It gets a life of its own. And if you know the fairy tale, you know that the Gingerbread Boy dies in the end. He gets eaten by a fox, not Firefox. We love Firefox, but it's... Yeah. There is another side to open source projects as well that we might have forgotten a little bit. And that is the puppy side of open source. We have to care for our open source projects and for open source maintainers and everything that is open source. Because in the end, this freedom that we have been living with and enjoying also comes with responsibility. Open source is not only freedom, it is also responsibility. Thank you. So, casting myself into my old job, which I loved very much, free and open source software gives us some really unique value propositions to work with and make good business choices as good business people. And it all starts with what we call the four freedoms. And I guess everyone in this room is basically familiar with the four freedoms. But I want to go over this for a moment. You're free to use our software for anything. Other people are free to do what they want and run out of the oven and get eaten by the fox. You're free to understand what you're getting and you're free to have anyone else confirm for you that it does what it says and that there's no tricks and traps in it. And that's really powerful. And you're allowed to make it into exactly what you need to solve the exact problem that you have. And that enables a ton of people at every scale. The largest businesses like Pfizer who do unique things every day at a scale that we can't even imagine. They use open source to solve running their business. And the smallest businesses doing something that no one else does in the world, they have the same software and the same ability to solve problems uniquely, which is awesome. And we're allowed to make that ideal solution and then pass it on to other people. Sell it, share it, run services on it, whatever we want, so that its value can go and multiply. And this virtuous circle gives us all really incredible superpowers. So when selling open source projects, we love to tell people, hey, the old way of selling open source was something like, oh, you can totally make it cheaper because you save money over here. And being responsible with money is important. You need to set the expectation that this has a license fee of zero dollars or euros. But frankly, your superpower there is that when you save money on the license fees, you can reallocate it into making a better project. So every IT project, every software project has a lot of costs. None of it's free. If you tell people it's free, they're gonna be disappointed in the end, right? But you have this incredible possibility to invest in choosing exactly an even better color of blue than you could afford if you were paying a license fee. You can hire new people, you can train new people, and, right, you're only paying money for the features that you need that aren't quite there yet, right? But you get, in all of our projects cases, millions of hours of coding and contribution work for free, just as your baseline to get started with, right? Plus, putting together all four freedoms, you have the freedom to choose people working in your local economy so that your money stays where you live. And you should be able to control your own data and make it safe and private as it should be. You should also be able to define what that software does and when on your schedule and not when Microsoft deems it possible. And yes, on your time. My dad used to tell this story. And I'm not sure why because he's not in the restaurant business, but he would tell a story about if you open a restaurant, you should own the building that you have your restaurant in. Because if you rent a building from somewhere, you build up this business, you make great spaghetti, people come by, they love it, they know where you are. At some point, your landlord is going to turn around and come back and say, I love that your place is doing so great. I also love your spaghetti. I'm also doubling your rent, right? And since the restaurant business happens to be about location and people walking in and having a habit of going places, you're kind of stuck with that. And if you move, you're going to lose, potentially most of your business have to start again, right? So I don't want my government, for example, to rent the space that they do citizen services with for me from Oracle or IBM or whomever, who then turns around and say, by the way, that thing that you've been using, right? Or I don't know, don't build your business on other people's stuff as much as that's possible. That thing that you've been using, nobody else wants it. So we're sunsetting it at the end of the year, but you can buy this other new version, which costs twice as much next year, right? So open source gives us a power to live in our own houses, right? To own the bricks where we work and give control. I think the security debate is, well, as old as this slide, who remembers this? But still today, now here's the weird thing. All the people I know in here, I know we've had this conversation in the 2000 and zeros with people, but this stuff still comes up today. And that's the point of this talk, right? There are new generations of people, new generations of business people, new people coming up without any of our experiences and we have to go and pull this stuff out and create sensible resources about it. And that's also, I mean, just to bring that up. We have a talk about it later today. And by going around and thinking that we have won this again and just looking at each other in the open source business, we're really, I mean, putting nails into our own coffin. Owen is going to talk about a small software company called Adobe. That has a lot of marketing money later today. And there are things that they can absolutely not do. And if you get into the situation where you're actually owned by somebody else because they own your bricks, you are losing your freedom. And you'll soon be without beer. Usually in the political world, we think software as two black and white things. Either the government should build, run, own everything because the government knows best. Right? And then there's another side of politics where they say that is absolutely bullshit. That private sector should own, run everything and whatever the government does should just be to pay money to them so that it runs. Right? None of those are actually open source. They're both proprietary solutions. The government can keep secrets and so can the private sector. Right? They don't have to publish their code. They don't have to have open source licenses. Open source can do something amazing to this picture. We are underutilizing that ability of seeing open source as a part of civil society. Civil society is what we as people do together, whether it's going to acquire and singing in that choir or it's building software or it's trying to help other people. That happens in the civil society realm of the world. That is community. Community is a really important thing and you can have the government pay money into community projects to support them. Private sector can pick up what is being built by the open source community and make a business out of it. They can put money back into it, code back into it. By thinking community around software, we don't just enable one product, but we enable anyone to make any number of products and business plans and earn any money in any way. We're talking too little about this. Here's another picture that I like. This is a desert, everyone. What's the thing you think about in a desert? Well, it's sand. Sand, sand, sand. For as long as the eye can see, right? Well, the desert is a picture of proprietary software. It's monolithic. It's a mono culture. It's monotechnical. The guy who owns the watering hole in the desert, well, if he's your friend, you can go there as much as you want to, but he's also the one who decides when you're no longer friends and when you have to pay more, you still have to go to that watering hole and get your water, right? That is how it functions. You might feel free. You might feel that it's warm, but you are dependent on someone. Open source, on the other hand, is the jungle. And, well, the jungle is not a mono culture. There are lots of different plants, lots of different things going on here. Any plant can grow up at any time, succeed or not, who knows. It wouldn't have been here if all of these plants were fighting each other. This is a picture of collaboration, but also a picture of interdependence. We're all dependent on each other just like the Drupalian type of three dependencies. If one of our shared dependencies, you know, isn't anymore, we both have a problem. So we have a shared responsibility around that. Everyone has access, but we have to not only take care of ourselves, we have to take care of others as well. And another really interesting thing is this is sand, right? Like in the desert. And what do we do when we want to stop a desert from growing? Any ideas? Exactly, we plant trees. The roots of the trees, the life, the community that we bring in, it doesn't only hold the grains, but it also holds water. It creates organic matter. That means that the desert will not come back, right? It changes the environment totally because of community. And now, Mr. McGuire here will say a few words about how this changes our way of thinking about government involvement. Yes, yes. So governments have the power to change the world. They have a certain amount of money that they can invest in, and there are some traditional ways to do that. But I think that the other side, that building a bridge across a given geographical feature will make it possible to have more commerce or more tinder dates or something, and they think that society will be better for it. They think the economy will be better for it. They can plan it, they can execute it, they can build it when the money is spent, it's spent, and the thing works or it doesn't, we find out when it happens. The same goes for they can choose to vaccinate people. They might choose to vaccinate people in another country so that they have better outcomes in that country, and then maybe change migration patterns, create economic welfare and growth in another place. Unfortunately, a lot of governments also choose to do sort of the opposite with their money. In all of these cases, however, whatever amount of money is spent on whatever number of physical things and necessary or not, this is reality. There's this really interesting other way to build government infrastructure. It doesn't cure diseases and so on per se, but when governments invest in open source digital infrastructure, build the perfect government ministry organizational website, right? Give it to the next one, give it to the next one, give it to the next one. Australia's done this very, very successfully and it's not the only place. It's a terrific model and it's this crazy thing where the money that a government spends once can infinitely improve and become infinitely, theoretically infinitely more valuable. The more departments who use that solution, the more value it becomes and the return on investment can grow kind of forever. It is also entirely reasonable to expect to be able to solve problems in your own country, in another country, and share this goodness and it's an incredible superpower and it's not just theoretical, and it's not just sort of government websites versus spending money on vaccines. There's this incredible real world effect that our work has. The UK IT supplier map before 2010 looked like this and these dozen dots or so, London, Reading, Brighton and so on, correspond exactly to HP headquarters, IBM headquarters, Adobe headquarters, Oracle headquarters and so on in the UK. And not only that, but they are the foreign headquarters, mostly for American companies, right? So the UK government was sending a ton of money into these companies. Now, they had this incredible program in the UK, the digital transformation, and they did a whole number of things that have had very, very positive benefits. I had the incredible privilege of being at a talk from the Cabinet Office, the guy who ran this program inside the Cabinet in the UK, and he gave me these slides originally. So they did a whole number of things. They created this procurement system that works quite well. They did a whole bunch of things, but fascinatingly they said any government IT project with a budget of less than 100 million pounds, they made a whole smaller businesses could apply for them. So SMEs got access to government projects under 100 million pounds, and they said if an open source solution can do the job, there is a preference for open source solutions in these procurements. There was a bunch of other stuff, but smaller companies were allowed to bid on government projects, and open source got a leg up based on functionality, not on marketing, right? So a few years later, the government procurement map looked like that. Don't drop your mind. Okay. Now, there's a bunch of wonderful stuff about this. First of all, all of this UK tax money, there's clearly some of it still going out of the country into large corporations, but think how many digital agencies, right? And in all over the map, even in Europe, right? Even in the Republic of Ireland, oddly. But right, look how much, imagine how much money the UK government spends on IT, and how that now goes into local people, local economies, new jobs and training and skills for the UK. It's an incredibly powerful story. An open source, right? It wouldn't be possible without the open source piece of that. So I believe the next line is, it was a dark and stormy night at the Type 3 Association. When the telephone suddenly rang. Yeah, we have somewhat more modern telephones in Type 3 as well, just to say that, yes, the telephone rang, and it was the government of Rwanda. They had 250 Type 3 installations that needed upgrading. Well, they definitely rang the right number, right? We can do that, but Type 3 doesn't make websites. Agencies make websites. How are we going to solve this? Well, we could have called one of our larger agencies and said, hey, you guys do it. You know, travel down there. This is money for you. You've been supporting us. We're nice. Well, and you could imagine, you could imagine in some circumstances and with any version of proprietary software that some large foreign company would come in, set up an office, do the work, pull millions out of the country, and potentially leave again. That's actually how it's done a lot of the time. We often call it foreign aid, actually, but it's not so much aid. It's basically an established business that open a local office and then they export all of their earnings. And they use closed solutions with this wonderful vendor lock-in that we all love to create financial independence. We've seen this before, right? We've seen this a hundred years ago. This, ladies and gentlemen, is colonists and exploitative. It's just got new clothes on. It's still the same way of working. So we couldn't do that. What we decided to do instead in type of three, and this is unique to open source, is we decided to use our community to create independent local business and expertise in Rwanda to serve their government. So you've got an independent financial cycle where the government pays agencies in the country, they serve their government. It all happens there. You're building something instead of just locking them in, and they have all of the freedom they want to do anything else with that expertise as well. And Daniel who's sitting here has written a wonderful article and we have a report about how that happened, that we also have some paper copies of, but basically people from type of three agencies in Europe spent a week in Rwanda training people. They came back a few times and did the same thing. And the result of that is we soon have 300 type of three open source based websites running everything in the Rwanda government, everything from the president's office down to local governments within the country. It's all done by Rwandans, for Rwandans, it's an opportunity for Rwanda. And the training, the community people trained both government employees and people from the local economy, so they created agency skills in Kigali as well. So yes, both government and local agencies were trained, that was for you online, that was what Jam said. And to put this into a more tabloid phrasing, what we did was we took a democratic and not-for-profit open source project and we used that to support sustainable and independent local business, right? Right? That's what Australia does when they invest in open source, that's what Germany does when they invest in open source, that's what governments should do when they work with open source. They support their ideological foundation in free speech, democracy, peaceful coexistence, hello, that's a really, really good selling point for open source, who can do that with proprietary exploitative software, eh? Go ahead. So, yeah, this is the GovCMS link, do you want to say something about that Jam? Or should I just continue ranting? I'm going to rant more. Okay, because basically what we created and what they have created is locally led, non-exploitative, anti-colonial software and they've created community, they've created civil society. Civil society, everyone won the Nobel Peace Prize in 2022. Go and look at the reasons for why that prize was given to two organizations and one person. The last sentence includes civil society, open source organizations are civil society organizations. They contribute the same way as other civil society organizations too, the welfare of the world. Now, what next? Right, so we looked at our origin story and the four freedoms that define us and then worked up through this scale of how things can be better using open source, thinking open source and expanding those practices well beyond the scope of just software into the real impact that we have on the world. So what we really think is important is that our values, right, are the values of a healthy society and this is not the future we were promised, this is not the happy place that I wanted to be in as a grown-up and so we have the chance to, we have the chance, we are cross-border international communities of practice. We are models of democracy in a time when democracy seems to be under attack, right? So at this point, I think the call to action is look at things like the Cyber Resilience Act, help governments who are willing to listen, get things right, help your friends in business and elsewhere make the right choices about software and the open source economy is not taught in schools. Recently I spent, I had a 45-minute appointment at an embassy talking about a project similar to the one in Randa. Well, I have 45 minutes. I spent 30 of those minutes trying to convince them that open source actually works. We need to tell people that open source works and why. Yes, yes we do. He agrees. So, and in that spirit, welcome to the Open Web Alliance Dev Room for the day. We'll take a very short break.
Open source leadership at scale, how 1300+ people improved Drupal’s multilingual features
Keeping on, keeping on schedule and with one microphone per person. Gabarhojji, very old friend of mine and open source original. Drupal Core maintainer several times. And now talking about a really powerful contribution project in the Drupal community. Roughly right? Thank you, Jim. Yeah. Great. Hi everybody. Thanks for coming. I think it's going to be interesting for everyone hopefully in some degree. Because this is about open source leadership at scale. As Jim said, I'm Gabarhojji. And... They called it one slide, Gabarhojji. No, it's done. So I'm Gabarhojji. My own made up title is Full Stack Community Organizer. Which means that I can put on an event for you. I can manage social media, design graphics, do a keynote, build developer tools and like basically everything in between. Write marketing copy and do everything in between. So whatever is needed at the time. So I've also been working with Drupal since 2003. Much like Mathias with Type or 3 since 2003. Just picked a different system. But it's around the same time and I'm a Drupal core committer and did a bunch of stuff that are... That helped in getting here where I am now. But I'm more interested right now in where you are coming from. Who's using Drupal for anything in the room? Alright, some of you who have no idea what Drupal is. Just here for the title because it was nice. Okay, I didn't explain Drupal that much. That's great. Who are you consider yourself primarily developers? Okay, great. Nice. So those were the main questions that I wanted to have so I can direct the talk properly for the audience. So I got into open source from open content. In the 1990s I went to high school. And the high school got dial-in modems and we got on the internet. And I was really interested in how we can publish stuff on the internet. How we can put something on the internet. And I decided to be the lazy teacher that reads five pages ahead and then teaches everybody else what they learned. So I started looking at how is this done and started to go into documentation and started to translate the W3C standards into Hungarian. And then the PHP documentation into Hungarian. And then distributing that and then starting to look for news and articles and translate those. Summarize and translate those into Hungarian and publish it for the Hungarian community. And so it basically turned out to be a thing where I needed to set up a website to publish all of these things. And I got together with a person in Vienna, Shandor Shromkuti, who I've never met ever in my life, ever since. Now we work together very well online. And we created this website called Vabla Bor that was hosting these things. And I went on a side quest with the PHP community. I became the lead of the PHP documentation and the lead of the PHP.net website in the beginning of the 2000s. And growing this Hungarian community website as well. But the Hungarian website grew out so much that we needed some kind of system to manage the community, to publish these things, manage the forums that we've had and the meetups that we had. And so we needed to have some system that managed this. And that's where I found Drupal. And Drupal was tiny and nice. And this was the whole Drupal conference in 2005, all the attendees. There's a certain person sitting there. So it was a tiny community that was very tight-knit. And we would get together. The software was managed through a mailing list where you would post a patch to the mailing list. And it was reviewed on the mailing list and then committed to CVS. So it was very tight-knit and everything was reviewed by these few people. And so it was very easy to join. And I needed it for a Hungarian website. So my main problems that I would go in and fix were usually about translatability into Hungarian. Or I wanted to have the path aliases in Hungarian. I wanted to have everything in Hungarian. And I always bumped into bugs and I submitted bugs and they got fixed. So I fixed them, then they got committed. So that was basically the natural way to get into the community. Small was easy to approach and they would receive those fixes very well. Fast forward, ten years later, this was the Drupal conference ten years later, there's people up there as well. They're hard to notice. So this is Drupal condomber. So it's kind of hard to get started in this community. Like who do you walk up to that I have this bug and please work with me on fixing this bug? It doesn't work. Like there's no way to do that. Like, I don't know, it's hard. It's like walking up to people on the street and like trying to convince them of something. When I got in here, all of the buses, I think bus 71 was always full. Like I waited for two or three buses. All of them were full. I decided to call an Uber yesterday. And I was myself in the Uber. So I walked up to the people waiting for the bus. Like, how do you want to come with me? And they were like, no. Who are you? Like that's the kind of feeling that you walk up to someone and like, no, who are you? Like, why are you approaching me? So I was alone in the Uber. But when you have this tight small community, it's much easier to work with them. So when we got to this point, it started to get very hard to manage what people are working on and organize that and motivate that. We kind of went by pretty long without more structural organization. But around this time, the project lead, Drew Spreiter, decided to set up initiatives. And so that initiatives could get back to this tight knit small feeling and they could sit together and know each other and have a sense of community in this much smaller scale. And they could work together very well. And when this started, I was approached to work on the multilingual initiative because even 10 years after I joined, the multilingual was still a problem space that had a lot of problems to be solved. And so I was happy to accept that. And so I started working on the multilingual initiative. And everything was rosy and happy and I started working on things. And then a bit later, something bad happened for me. And the least I considered that was super bad that happened is that another initiative was announced. The views in core initiative was announced. And so multilingual was especially in Europe pretty important. But views in Drupal is basically the, so if you don't know Drupal, that in that detail, then views is basically a query builder based on Drupal's very rich structured data. And it's also an output generator based on the query. And you can choose how the output is generated. And it can generate APIs and REST endpoints and lists and sliders and all kinds of things. So basically it's a query builder and an output generator. And views was four times more popular than any of the multilingual modules. And they got funding from Sony and other companies. So they had money. They were four times more popular. And they started to steal some of the people that were working on the multilingual initiative. So I felt totally betrayed and I was super angry. And I wrote this email to project leadership and core committers that this will jeopardize my initiative. It will make my work super hard to do because they're going to steal the thunder and will be very hard to do this going forward. And now I'm here talking about how successful it was. So we sort of resolved this. But I was super angry and very jealous and also felt betrayed. And I think what was interesting is I didn't get responses to my feelings there. I did get responses to the facts that I stated and they were refuted. But my feelings were not contested. And I think what I realized after a while, after I had time to think about this, is the problem that I had is I was thinking of Drupal as this small pie that we are eating from. And if everybody's eating from the same pie, then it's kind of be over after a while and you don't really have more to eat. So if you steal my people, then I don't have people and I'm not going to have people. And so I think that was the key understanding that I had is that I need to think about how to grow this pie. And even though we had all of those thousands of people at the conference, I think we still didn't have a good grasp on how we involved new contributors very well and how we make them successful, which was even more important. And my other problem that I had is I didn't have money. And this realization that I need to grow the pie didn't make me have more money. Some of the companies that were involved in the Multilingual Initiative had money and they were investing into sponsoring some of their people. But I didn't have money on the scale of 1,300 people. Like, that was not possible to achieve. So I need to figure out something else. And so what I started to look at is how to make people happy. Because they would come here if there's something in it for them. They would join us if there's something in it for them. And I did, I read a bunch of stuff and some of this clicked together afterwards and it provides a great structure for this talk. But some of this, I basically figured out on the go. And so I think the best structure for this talk is these three words. This is from Dan Pink's book called Drive, which is one of the three books that I suggest you read on this topic. So Dan Pink Drive. And he highlights that people like working on things when they have autonomy. So they decide for themselves. They decide how they solve problems, who they solve problems with, how they move forward, etc. People strive if they have mastery so they can get better at things. They improve. They can try new things and improve in them and get challenged. And they strive when there's a purpose of what they are working on. And so if we can figure out, if we can correct the code on those three things, then it works really well. And I think we correct the code in the multilingual initiative and this is how we did it. So I think the purpose is sort of easy, at least for the people that were involved in my initiative. They were primarily in Europe but also somewhat in Canada and somewhat in Northern US. And they had personal needs for multilingual. So obviously they had the purpose of solving their own problems. But there was also some higher purpose, like if you just look at where Drupal is used, the UNESCO uses multilingual Drupal to help with education and children and refugees and stuff like that. The CERN uses multilingual Drupal to advance science. Tesla is using multilingual Drupal to promote their technology. And you can like configure your car through Drupal on the Tesla.com website. Rathetti is using Drupal extensively and they invest money back in open source as well. While we are in Brussels, it's hard to avoid the European Commission. It's using Drupal super extensively. This is in Hungarian, aropa.eu. But they have 300 websites that are in Drupal. Most of them are multilingual obviously in Europe. It's hard to do anything without. And they have more than 100 people on staff, developers on staff that are working on their Drupal websites. So it's super extensive. But I mean these companies can pay their way to solve their problems. If you have 100 developers, even if multilingual is hard, you can solve that. If you're a Tesla, multilingual is hard, you can solve that. So that's not really what gave me purpose. What gave me purpose is my high school's website where I started working on open content is running on Drupal. Totally accidentally as I was not involved. Totally randomly. So this is the high school I went to. So it can make a Hungarian Drupal website that's fully Hungarian and works very nicely. It's not multilingual, but it's Hungarian. So that gives me purpose. If we can make it work in a way that the little websites can do it very easy that we succeeded. So that was my purpose in here. The autonomy part, I think, is much harder to solve if you come from a traditional open source developer background. Because I think many people that start open source projects, they're great developers. They have this idea of what they want to do. They have this architecture in mind that they know how to get there and the steps to get there and they are building it. And they want to have people along for the ride, but they don't want to have people to tell what the architecture should be and what the steps should be to implement, et cetera. So to give autonomy, you need to agree or understand that you need to agree on the high level goals and get rid of your idea of micromanaging anything below that. So you need to be comfortable with the idea that you define these high level goals and it's up to the team to figure out the rest. And maybe it's not the same architecture that you wanted. Maybe it's not going to be exactly on the timeline that you expected or the steps that you expected, but other people will implement it. If you share the same goal, there's going to be shared ownership and they will implement it. So I think this is hard for... So this is one of the things that's been... I've been trying to mentor initiative leads on in the Drupal community ever since because it's very hard to get from a developer background and have an idea of how this should be done. And then give up that idea and work on organizing the whole thing instead of implementing the whole thing. But to achieve some scale, you need to give that up. The next one is, especially in the Drupal community, is to set up space because when you have this big thousands of people at the conference, there is no identity, there's no space, there's no feeling of community for the team that you have unless you set up the space. So for example, what we did here is we have a chat room. This used to be IRC now, it sounds like, that is shared in the team. We use chat meetings and threads so that it's easy to get involved with multiple language backgrounds. It's much harder to follow live audio meetings and video meetings when it's not your native language and it's very fast. Chat meetings much easier to follow. We had this identity that was created by one of the sponsor companies, because the logo of the multilingual initiative, we had stickers of this, we had t-shirts of this, etc. When we went to events, we had tables where we set up a big sign that this is the multilingual initiative so people came in, they would recognize us, they would join us. We were always there in the morning, by the way, that's a good trick, so that we were the default choice at the contribution room. When people came in in the morning, they were like, oh, multilingual initiative, great. So that allowed us to have this sense of small community that we need to achieve in this big community, to have a sense of belonging and to have a sense of connection, and so that people stay and will have those personal connections that otherwise are not possible in this big community. We also had our own website, which you may or may not need, but it was nice to have our goals set out there, and we basically pulled issues from the IssueQ and used tagging and labeling on issues to prioritize them and then display them nicely, so we didn't need to do a bunch of work manually on the website itself. Now then you have people, I think the next important thing is to have buddies, set up buddies for things. At least in the Drupal community, there's always at least three people that you need for an issue to be committed. There's somebody that works on the fix, there's somebody that reviews the fix, and then somebody that commits the fix. So if you need three people to work on an issue, you need to set up those three people to be successful, like, that's not going to accidentally happen. Like, if you walk into this keynote room, I have an issue, there's nobody going to listen to you. So what we did is when new people came in and they were like, I want to help, we always assigned them to something that somebody else was already working on, because then they had a buddy that was already invested in the issue that they came in to help with, so there was already shared understanding between those two people that they want to solve this problem. And once we had that, we had these buddies that if one of them went away, we still had a solution for how do we move this along. There was still one person left that could serve as a successor to the person that used to introduce the problem to the next person. So it was pretty useful to keep things going because stuff happens to people. Like, all the main people that I had in the initiative, something happened to them, and it was always useful to have buddies that shared the same goals. And that was basically the only way to get stuff done in the Drupal community anyway. So I think that was pretty much a key to our success. And the next thing that I realized is we need to praise the smallest of results, because people don't really recognize that they are going towards a goal and they are achieving something towards the goal unless you point that out. And often people forget about, like, after a week or so that they did something great, so it's great to get back to that. And in the meetings, we always had a section where we were praising the results from the previous meeting and figuring out who did those things and call them out as well. And the other thing that's super important, I think, is to praise the people that go away, because when they go away, they probably already burnt out two or three months before they just didn't realize it. And it's good that they went away. It's good for them because they need the break. And it's good for you because they're not going to be here and, like, maybe have negative effects on the team. And it's good for you because if you are praising that they need this break now, it shows the team that they don't need to overwork here and they don't need to kill themselves through this project. We'll figure this out. And the person that you celebrated for taking the break may actually come back after they took the break if you've been kind through this process. So it's like there's no other option, I think, that's like the win-win-win-win-win to praise people that go away because it's the best for everybody. So if you do these things, you have those buddies, you have a small tight-knit community, even in the bigger community. You have this space. You give them autonomy to work on their own ways. You just share the high-level goals. And then you have this shared ownership about things. And maybe it's not going to be implemented in a way, maybe it's not going to be implemented by the same people you started with or on the timeline you wanted, but it's going to have shared ownership. And that was kind of useful for me when I had a problem. So a couple of years into this initiative, it was a long initiative. I had breakfast with my wife and she started to having very strong stomach pain that didn't end. And so we stopped our breakfast there and we went to the emergency room. And they figured out that her blood results are getting worse and worse, but there was no blood to be seen anywhere. So they figured out that it's internal bleeding and she was about to die in a couple hours if not operated immediately. And so they assembled a team of doctors that would operate her that night and they saved her life. But she lost one ovary on the way. But she now lives on and we still remember this day. And at the same time, Drupal Karnasdin was happening. I was supposed to be there and do all of this magic with the multilingual initiative. And I was obviously not going to travel to Drupal Karnasdin when my wife was recovering from a life-saving operation. So because we had this shared ownership and shared understanding of the initiative, all the stuff that we were planning for the multilingual initiative happened in Austin. They sent us flowers and guards and well wishes and they sent us this photo of some of the people on the Contribution Day to wish us well. But this was because we built this initiative to do it together. And so mastery is the final one, I think, which is probably the most interesting thing, I think. Because people want to get better and you want to have people on your open source project. And so the question is what is that you want to have people on your open source project and they want to get better at and what do you need that they may want to get better at. So that's what we are looking for. So one of the things that I've been doing at events is therapy sessions because multilingual used to be very painful in Drupal and people had pain. And so I set up a multilingual therapy barf, is what it was called, on the schedule. And I would sit back and I was like, do you want to talk about it? And they wanted to talk about it. And so what this was great for is, A, I got in the users that had pain about multilingual so I could have a requirements list of what I want to solve in the multilingual initiative. They got to talk about their pain so they felt heard. The people that were contributing on the initiative came in to the barf and they felt like they are the experts because they could give advice to the people that had the pain. So I was basically sitting there, I didn't do anything at this barf. I said, do you want to talk about this? And the experts came in from the initiative naturally and the people with the pain came in and I just sat there and I enjoyed it. So that's the investment. So the experts, basically the people working on the initiative came in and they gave advice to people with the pain. And we got to show at the barf, this is what we're working on, this is how it's going to make your life easier. We feel your pain. Yes, it is something that's hard right now, but this is how we are solving it. And so we could build in that feedback into what we were working on. We could review with the people with the pain the solutions that we had. Does this solve your pain or not? So it was very good to get direct feedback, it was very good to have them listened, it was very good to provide visibility to the people that were contributing and get professional recognition for them, sometimes business because they were giving advice to clients that were showing up in the room and they may get a business relationship after the barf. So it was great. So I think it was important to acknowledge that multilingual was a pain and provide this space in person as well. The next thing we did was radical openness about how we organized this initiative. So we created an open source slideshow for example that anybody could present anywhere. And they translated this slideshow into multiple languages presented in Japan and Poland and France and bring it to companies and a lot of places. So we just gave this slideshow and we didn't ask for anything in return. And this brought the news of the initiative into far and wide on the globe everywhere that this is happening and made people excited. And also gave people the opportunity to deliver sessions that have not done it before, they didn't build a slide deck that was compelling or anything. And this was useful for them as well. We made the Drupal distribution which had a demo of how this multilingual thing will work. It had demo content and demo menus and a bunch of features set up so that people can try out how they can do it. And they can try out how this will work and they can test this out and we can get feedback. We created a two hour workshop with a 23 page handout that detailed the steps of how you get to build this distribution basically. How you get to build out a multilingual menu, how you get to build out a multilingual content structure, etc. Super detailed. That's why it was 23 pages. It was like click here, right this, click here, right this detail. So this was very useful for people to do these workshops and teach people how to use the multilingual Drupal system before it was even done. Like we were already training people on multilingual Drupal before we were done. And with the help of Acquia we created a user testing script that could be crowdsourced so that people can do user testing at their meetups at their local events and record them and publish their results and we could aggregate the results and use that to inform how we are doing and where we need to improve the user interface or the flows or how all the things are connected. And so I've been doing a bunch of research and reading in the meantime and read a bunch of interesting tricks on how to involve more people. So this is one of them, car wash loyalty. So that was one interesting story about car wash companies. They want to have people come back to wash their cars. And so they did an experiment where they had a car wash loyalty card with eight slots that you could stamp in and then get a free wash at the end. And they did another card that had ten slots but two were already stamped in. It's the same eight slots but there was two more that were already stamped in. How much better did this one work for people? What do you think? This worked twice as good. So the ten slots two already stamped in worked twice as good to get people to get to ten stamps than the eight slots, eight empty slots. They had the same exact number of empty slots. But this one told you that you are already on your way to achieve your goal. So it was like you already started. You had two stamps even though you didn't do any car wash. It was just two stamps and the first stamp was the third stamp that you got there. And the people that had this card, they got there faster as well. Not just twice as many people got there but they got there faster. So I have to translate this to open source contribution. So one thing that I did is I wrote blog posts about how Drupal Multilingual is going. And I broke down the initiative basically I think 18 posts or so. Like this is what we do for multilingual installation. This is what we do for interface translation, etc. And at the end of the post I had a section of by the way this doesn't exactly work well yet. And these are the issues that you can be involved with. So people read about what's exciting thing coming up. They got informed. And at the end they got rubbed into helping with solving the problems because they already felt like we are getting this great solution and they was like it's just almost there. I just need to help with this one. That helped a lot. So at the end we got 1,300 people involved and that included people from companies like NBC Universal and Pfizer and Carefor and the University of Waterloo, University of Iowa, Biologist, Genetic Information Management, McGill University, Johnson & Johnson, Ticketmaster, Google Summer of Code, Google Code and you name it. So all of those sources had people that were involved. So this is the list of people. Too fast. Wanted to spot yourself? So there's a lot of people. And so basically all it took is for me to understand that this is not a fixed pie that we need to look at how we grow this pie. We need to figure out what's in it for people to come in here and grow and be involved. And for me to figure out that I need to give people the autonomy in this project to figure out how they're going to solve this problem. So they have a shared ownership of solving this issue, for them to have ways to get better at things, for them to master their craft, for them to improve on their own terms, and for us to have a shared purpose on why are we doing this. So if you want to read a lot more about all of these things, these are the three top books that I would suggest in this area. So David Marquez turned the ship around. This is great for handing off autonomy. He's a nuclear reactor, nuclear submarine captain that was training for one type of submarine for two years and then got reassigned in one week another type of submarine that he had no idea how to operate. And so he can need to figure out how to give autonomy to the crew. It's a great book. Danny Alping's drive is about this whole structure of autonomy, mystery, and purpose. And the switch from the Chip and Don Heath is a lot of great stories and solutions and tips about how do you make people do things that they probably wanted, but you need to convince them. So the car wash story comes from there, but it's all... There's no software stories in there, by the way, nothing. But there's a lot of stories about what do people do about glove ordering in the hospital or kids' cancer treatment or a bunch of other things. There's a lot of great stories in there that you can apply in one way or another to open source as well. So there was my talk. Any questions? You've left a speech list. All right. When does your book come out? I don't have one on my own. Thank you. All right. Oh, there we go. Yes. So the question was that Drupal has this challenge in all kinds of other areas as well. So I think that is the 10x or 100x to apply to all kinds of other topics. I think so, yes. So I think we've seen some of the recent initiatives that people were really driven to implement that had similar approaches, like the single directory components initiative, for example, had a very similar approach and a smaller scale. So I think we could apply a lot of these to other initiatives. We've been trying to mentor initiative leads on these ideas. And we've been successful in some of these ways of like how do we involve people from events in initiatives. That's been a track that we've been really successful on working through. But there's definitely a lot more that could be applied from here. Yes. Yes. Please submit a proposal to the 5.3 developer next. All right, please. I was suggested to submit the proposal for the 5.3 developer days. When and where is it? It's the first and second and third of August and it's in Koffrüh in Germany. It's the first three days of August in Koffrüh in Germany. It's for the camera. 5.3 developer days. You. Yeah, me too, but the listeners as well. Great. Yeah, thank you everyone. Have a nice day. Thank you.
Making FOSS CMS easier to teach with shared competency standards
I'm still remember their crest number. So anyway, this is, are you Florian zero or Florian one? We don't, maybe we just play flow. Flow, you go with the flow. Whoever is responsible will react to it and that's it. Skill display, this is Florian. All right, so anyway, Florian's organization skill display is in a really interesting space, very adjacent to ours. Thinking about the problems that from two sides, how do people coming into open source or technical careers in general qualify themselves and prove themselves qualified to potential employers and how do employers find qualified people and understand, you know, what's the basis of truth for understanding someone's skill level when you get to know them. So we're going to talk about certification now. We are talking about competencies on a lower level than certification, but can lead to certification of course. Competencies, yes. Well, hello, hello everyone. I want to propose, since we are here for the open website alliance, a shared competence standard focusing to get people into open source projects during educational state or whenever they want to go into development. For this, open source is already doing a lot. I mean, it's open by nature. We have low or no entry cost at all to start working on a project, getting the software start yourself up. There's vast amounts of resources on the net. I mean, there's the documentation, there's blogs, there's videos, YouTube is full of stuff. You have trainings, courses that can teach you how to get started with type of three, Drupal, WordPress, Jomla, how to prepare for certification if you want to go in a professional field. And as we are here now, we have events, we have camps, we have death days, there's lots of activity in the community. There is a community to turn to. That's the whole thing of open source. So that's great. You can start wherever you want. You don't need much resources. My are we talking about a shared competence standard then? There's still challenges for learners and for teachers, for coaches, for trainers. I want to break them down in four years. There is a problem with comparability. If I do a training, if I attend a class, it's cool. I'll learn a bit about PHP development or whatever. But I don't have a big picture. How close am I? Am I now a useful Drupal Dev? Where do I stand? What else do I need to qualify for a certificate, for instance? The resources I need, the resources I use to teach, the resources I use to learn, are up to date. For which version of Drupal is this article or this lesson? So there's a problem of obsolescence because the projects are living. They don't stay the same forever. Since we talked about common basis for the composer dependencies, for instance, there's a lot of shared stuff. So this can lead to redundancy, especially if I switch between technologies. I've worked with WordPress, I've worked with Jumler. I want to delve into type of three. What can I carry over? What can I skip? What's the most effective way to build on that, what I already know, to learn to use in my new environment? And lastly, ambiguity. There's lots of resources which one use I. What do I use to build my learning resource on? What can I use to be most effective with my training and my course? So these are the challenges and they open up somewhat of a gap in teaching. When I create a training, for instance, I need to ask myself, both in a school or in a professional environment, what do I need to cover? Where do I start? Where do I end? How do I guarantee that I do not skip anything important? And is there any knowledge that I can build upon from other lessons that my students had from different experiences, from different projects? But also a gap when I go for certification. How do I prepare optimally for certification? When am I ready for certification? I've worked for two years now as a dev. Do I know everything I need to get my developer certificate? Or in my company, is there anybody I wouldn't have thought about who is quite good and could already be certified without me knowing? Or do we have any lacking deficiencies in the company? Some areas that are just not covered by anybody? So we could build competences, skills to map these areas and to get a picture on who can do what and what are they missing. And this is what we want to do. We want to structure competences. So before we think about a big standard or anything, we should start at the basics. What is a skill? How do I define that? In its most easiest way, I have something simple, an identifier, a name, a title, and perhaps a description that makes my skill identifiable. About CMS, I know what a content management system is. I can clearly devise what that skill is if we have a denomination. Then we need to make it tangible. And you can understand what that skill entails. What can I do? What do I need to learn? And you have learning goals. So we can define those, for instance, as a list, as a short summary. And it transparency, because the skill definition can be something arbitrary. It's worth nothing if I don't know who made that definition. It's that someone who has a capacity, who has an authority on that field and whose definition has a worth for me as a teacher, as a learner. With these three things combined, we have a package that I can read. I can understand within a few minutes to tell me, okay, it will about a year. It will about so and so complex to learn this. I will take some time. I can make an assumption on how much a challenge I have before me to learn this. That would be a single competence. Type 3 is not a single competence. I know type 3 is worthless as a skill in itself. I need to go more into detail. So I need to put more skills and I need to put them into a relationship. And this is where we will need prerequisites. If I know what a CMS is, fine, I understand what the underlying construct is. Now I can go deeper. Now I can look how content anything works in a CMS on a basic level. So I can set my new learning goals. I know the difference between simple text and with weak editors. I know what content editing is. So I have skills built by prerequisites into, well, trees. And we have seen those. If the principle works for the shape shifter to it in Diablo, why doesn't it work for Drupal Dev? The only much larger scale. This way we can build a structure for competences and map them. Get a sense for where does my current level of skill lie, where am I, what is ahead of me, what is already covered, what did I miss? So how could we benefit from standardization, from structuring competences like that? First we create a language, a common language for what I can do. I, from education, I get a grade in school. In my CV I have X years in that and that project or that and that technology. But I have, I don't have hard facts. If I compartmentalize skills, I have a, sorry, I have a definition, a hard skill that can, that says what I can do and by whose definition I can do that. Second, we reduce redundancy. If I have a clear plan on my learning path, where I am, I can skip those parts that I don't need, that I already learned from other projects, experiences that I've already picked up. If I know doctrine, if I've acted for the doctrine, I can, I can skip repeating that learning process. And I can do that, also by skills provided by the pre-needling. And I can do with the shared principles of CMS. We have, we can't do anything basics for media libraries and so on. Those are shared principles between all the systems. Here I can reduce redundancy with hard skill facts. I can also create a, create higher comparability. What does it mean? If I have a learning resource and it just, I can't just take it by its cover, I have to guess how far it will bring me. If I can compare learning resources and courses and trainings by the skills they cover, I can pick the one that closes my learning, that gets me closest to my learning goals as possible. I can compare how to learn most efficiently. I can also compare what to learn at all. I might have experience with technologies, I might have experience with projects, I might have goals for what to achieve. I can now evaluate, okay, what do I have to learn to get to a project that brings me to my goal? I can compare between projects based on the skills that I have, based on the skills that I need to learn. And we could promote shared values. Open source built on collaboration. With a shared standard across CMS projects, we can emphasize this. We can bring the collaborative thought, the open thought to people already at an educational level. Matthias said open source mentality isn't taught at schools. We could bring it to schools. If we have an open standard that brings people on board, that helps teachers build a curricular for open source projects, we can teach open source mentality on schools. And we can incentivize the use of open source projects as well, because, well, teachers who are getting help by cutting their preparation times for courses by a competency standard, are more incentivized to use those projects on schools beyond the monetary advantages. And thus, it would help us to establish for more in the educational sector. So, how could we start with something like that? There already is a very basic tree that was created as part of a Erasmus Plus program, founded by the EU, created in cooperation between Type 3 and School in Vienna, the H. Taylor-Renwick, that covers CMS basics for use in the web development classes. It serves as a gateway to seconded management systems. To endorse this, and put the hopefully grand new open websites, Alliance logos on that, and expand on it, we could create a shared common basic skill set for working with FOSS, open source projects, with FOSS CMS projects, as a gateway into deeper documentations, deeper skill definitions for each project. So, this could not only be a cooperation, it could be a gateway into each single project. Which in turn would allow teachers and trainers to have a single source, a single structure, to build their curriculum, under a shared brand. So, there's no more running around, searching for articles, searching for running material. I have a skill structure, I have ideally referenced material, documentations that are endorsed or official, and I could build my courses from there. That's much centralization, and that's not open source. We want to have this, of course, as an open approach. How do we do so? Skill definitions should be open data. Have the possibility to contribute, to define skills where you see them, they are missing, provide them freely to teachers, to trainers, to businesses, to make their own certification preparation, and put this into initiative for open teaching, for teaching open source projects in a way that is accessible for everyone. That also could help, especially in territories where we don't have a broad supply of trainings, as Types van der Waas mentioned. With such a standard, we could bring teaching material to areas where professional help is a scarcer, there's not as much infrastructure. And we can also integrate that with certification. We have existing certificates for all CMS projects to different degrees. For type of feed, there are already skill definitions for most certifications that are out there. Nothing stops us from picturing those certifications with skills, provide a structure for preparation for them, and in this way, aid enabling and preparation for certification attempts. But we could, of course, also create new certifications for the Alliance itself, for shared approaches, for shared principles, especially on the entry level, but common principles that need to be taught. So, looking back at the challenge from the beginning, where we had no low comparability, you now have a greater scheme, a greater skill tree, where I can plan and track my progress, where I compare resources and trainings that are offered. With Anansi, it can be replaced by efficiency. I can now pick my learning goals and my learning path to that goals, based on the skills that I still need, skipping those that I already have. Transparency can solve the problems with obsolence, with skills, that are living index. If a skill, if a particular technology, for instance, falls out of favor or is replaced, the skills can be laid dormant, and replaced by newer versions, or completely other skills in the tree, and I still have the control for my curricula to be on the current state of the art. And lastly, versatility. By linking my trainings and resources to skills, I can still have a variety of resources available, but they are now anchored in the tree. I can choose them according to my needs, according to my learning goal, on what brings me the most for my current situation. So, this is what we in short propose, structure, defined structured skills for the CMS projects, find common pools, shared principles, and build from there an ecosystem for using education. Thank you. So, I think that in the open web alliance room, when we're looking at new ways to collaborate together as projects, the idea of shared and common competencies makes a lot of sense, and the idea of sharing the structures so that I might build mine more quickly and easily. All of that makes a lot of sense. Since we have some time, is there any chance that you might, unless other people have other questions, of course, but I'd really be interested in seeing a skill tree, or part of the typo three, one so that we can get a sense of what these things look like and feel like, and how I might as a learner, or a potential... Maybe you could dig into that a bit. Obviously, super happy for other questions, and Matthias is about to have one. So, I'm connected to you. I end up with these things that are not questions, but that are statements of support and stuff like that, but I've been working with skill display for a while, and one thing is you have a lot of skills, and mapping them afterwards can take some time, but once you have them, you have quite an amazing resource, and having people start their process of learning as CMS by, as we'll see, checking boxes of, I've learned this, I've learned this, I've learned this. A colleague says I know it, and I might have done a certification as well, I have a university degree or something like that. What it does in a working environment as well, I find is really amazing, because I've been working at agencies, and you've got a new project, and we need this technology, or we've got a new person in the team, what does that person most need to learn? You can use it to map out who knows what, and peer people, so that the person who knows something can work on a project with somebody who needs to learn it, for example, and that is a real strength, and I think also in multi-project environments, multi-technology environments, you need that kind of basic thing that, well, does everybody who works with code know get to a certain level? Do they know our processes? And you can build both, take processes from, all CMSs need the same basic processes, for example, but within the company you might have your own processes as well that you can enter, and you can build skill level, I love that. It might even be on a project level, for instance, for customer trainings, you might have different, you might have common parts within your projects across multiple projects for multiple customers, but if you build with this mindset customer trainings, you can reapply those resources, and expand them where necessary, and have a common shared basis for them, and also save time there. Good. I also want to contribute from my direct experience, so from my direct experience with skill display, I'm also coming from the Type 3 Association, and skill display was our partner in two years in which we have implemented an international Type 3 mentorship program all over the world. So we've actually touched and supported developers, web developers from about six countries in Africa and two countries in Latin America. And what we did was we used skill display to develop custom curricula. So custom learning paths which was aiming for people which were really in the beginning of their way as web developers, not so much skills, not so much skills in PHP and so on. So we've built custom curriculas that was matching them and their situation, and our mentors were using it step by step to check the acquired knowledge. It was very effective and showed also that you can actually customize the skill display path of learning on any way that will fit your purpose. Okay, thank you. Any questions somewhere else? I really want to demo. Ah, demo. Hey, yeah. Yeah. We have here as a demo a very, very simple skill tree. Again, we are talking about basic principles here. The idea was that we have a center core about a CMS. The root skill in this case, and from there branching the core principles for working with CMS. So I have the skill about a CMS. I know what a CMS is. I can define its ground functions, and that serves as a prerequisite for understanding what extensions and plugins are within the CMS. Because I need to know what the core functionality is so that I can deduct, okay, and then there's extensions that provide additional functionality. They are isolated and can be, I'm sorry, can be edited to projects as needed. Likewise, user management. This is a more in-deep function of a CMS. Again, in every CMS. So before I can go about user management, I need to understand what a CMS is. That is the core thought of a skill tree. That of course can grow and grow and grow. I think the type of three developer skill tree at this moment has about 130 skills. So we are going to, from Diablo 2 to Path of Exile here. But again, it is manageable, and I will rarely look at the whole tree, but only on the scope for my current learning needs. Because if I make a course, I don't need to focus on the whole tree. I can make a course about community working with CMS. So what do I need? I need to understand what a CMS is. I need to understand user management. And perhaps I should understand content editing. And these three skills now make my learning unit that I can dynamically create for my single purpose. For training someone for a community page. That is the strength of the tree structure, in essence. So I hope that somehow, first. Yeah. Does type of three having that skill tree in place, does that provide a template for Joomla or Drupal or WordPress to come in and basically fill in the blanks of their own tree? Of course. They can build on the common similarities, because, as for instance, about a CMS, basic skills are common. So why not go deeper into user management for the general principle, go deeper into Drupal user management? How does Drupal user management work? And from their branch of how does permissions work and so on and so on. So those need to be singular for a single system. They can intact with each other. I can compare, for instance. I can compare multilingual skills and requirements to create a multilingual page in each system. And I can see as a learner to say, okay, well, which of the systems has been covered better and which of those is easier for me to learn. And vice versa as a teacher, which is the best system for me to teach multilingual web development for my students from the scope of skills that will be required to actually make use of this. So yes, this can intertwine between multiple systems. Also with integration, booking platforms or commerce, commerce suites, you can integrate those as well. Okay, then. Thank you. Thank you.
Breaking Barriers: Content Management Systems and Accessibility
So everyone, welcome back to the Open Web Alliance Dev Room. Oh man. Cut that bit. Cut that bit. Welcome back to the Open Web site Alliance. Open Web site Alliance launched today. So I'm allowed to practice it saying it three more times before it has to be free of mistakes. So in this Open Web site Alliance Dev Room, we're now going to proceed with two friends from the Type 3 community and from the Mitwald hosting company talking about accessibility and content management systems. And they are going to be so kind and work their way in over the next 40 seconds to get started talking. Please Martin and Lucas. Alright, thank you so much. Thanks for the intro. Good morning. Can we still say good morning? It's before 12. Yeah, we... Alright. Alright, good morning everyone. We're going to be talking about accessibility and content management systems. We're going to get back to that bit in a second. I'm Martin. I work at Mitwald, responsible for developer relations over there. In case you don't know Mitwald, we're a hosting company from Germany. We specialize on providing the support for agencies, mostly in the open source CMS space. I also do some lecturing on computer science. And just a little disclaimer. I only have started learning about web accessibility recently, most of which I know I actually have learned from Lucas who's standing right next to me. Yeah, I'm Lucas and accessibility is quite a personal topic for me as you might have noticed already. So I have personal experience with a lot of issues you run into if you have some accessibility needs. And I'm a project developer at Mitwald, I think four years I started at Mitwald. And about ten years I'm a freelance web-based developer and developed themes, plugins, custom solutions and stuff like this. Yes, and I have advocated for this topic in our company for a long time now and now it's starting to get some action. And so we are here and talking about accessibility today. So I think or I might speculate that one reason this topic has been gaining traction in the past few months, years is that there's new legislation coming up. Like for example the European Accessibility Act comes into effect I think next year. There are also other laws that pertain to the same direction like in the US there's the Americans with Disabilities Act which has been around for a while, I think since the 90s actually. So I think this might be a reason that some people start seeing this now. But in the next 30 minutes we actually plan to convince you that fear of legislation should not be the actual reason to consider accessibility as something that's a good idea. Because in actuality it's about enabling access for everyone and not getting sued should not be your primary motivation for this. Like in a perfect world we shouldn't actually need to have these laws. So it should just make natural sense to be as accessible and inclusive to everyone. Sorry, I'm getting confused. To understand what we are talking about today we first need to define some things and mostly there are different kinds of barriers you can get to if you have some accessibility needs. And we've listed some here on the slide in a second. First of all there are perceptual difficulties you can get. For example if you have some vision and disabilities, a lot of you wearing glasses, discounts as well. And all of you have heard about color blindness by now, contrast issues and stuff like this. These are the most obvious things a lot of people think about. But there are motoric issues as you can see. I mainly use my right hand so it's difficult for me to do stuff like a lot of keyboard shortcuts are quite difficult. And stuff like this or maybe you have Parkinson disorder and you can't do small movements or keep your hands still. This is another problem. Mental and cognitive issues I think all of you have heard about ADHD. It's difficult to concentrate for people with ADHD if something is moving on your website or even on slides. Yesterday as well as the slide where an animated GIF was running all the time it was really hard to concentrate even for me. And someone with ADHD would have really, really much trouble concentrating on such stuff. And videos or animations on websites, it's the same issue. So think about this when doing such stuff. Remembering stuff especially in our short time memory might be quite difficult for some. In marketing there's the rule to just put at most eight items in your menu. This is one of the points so think about this as well. And then there are two more topics we want to talk about because there are also technical and economical aspects to accessibility. Maybe not all of the people want or can buy new devices, a new smartphone every year or because of environmental reasons they just don't want to. So technical accessibility is a problem as well. And maybe your TV we have some TVs in our company which has very low contrast ratios. We're actually in secret hoping to have one of these fitting university beamers just to bring across the point better but actually turn out to be alright. But you can see it in the top right corner. The Mid-World logo is quite difficult to see because it's white text on white background, not a good example. And maybe you have one of the famous low internet connections in the famous Deutsche Bahn. Then you will run into trouble as well. Yeah and of course low-end and all the devices I just talked about and you have to keep in mind that sometimes financial problems correlate to physical disabilities because people already have to spend much money on accessibility devices like wheelchairs and stuff like this and don't have that much money to put into technical devices. I think disability also correlates with unemployment a little bit. And if it's not a disability as mine you might also have a disability sometime. Maybe later your eyes get worse or you have a temporal disability, a colleague of mine tore his Achilles heel last year I guess and suddenly he noticed how hard it is to get up the stairs at the office. So everyone could be affected sometime or even if you just hold your coffee cup in the kitchen suddenly you only have one hand to use your smartphone and such stuff. And a lot of you might be the kind of keyboard nerd who likes to use the terminal for all his stuff and such. And such people will be happy to use your website only by keyboard as well. So accessibility needs can always be a personal preference. Yeah so with JusticeCast there are a lot of aspects you have to think about. Some of you might think this is really expensive, this must be really hard work. But in fact it's not that difficult. So I think about many people think about like adding accessibility after the fact which I can imagine does get painful because there is a development effort in that. So obviously the most obvious recommendation would be to just consider accessibility from the start if you're starting on a green field. Then it's actually not that difficult if you think about it from the get go and just think about it as a quality measure. Like you would also think about code quality. I'm guessing who's a developer actually. So most people, I would guess most of you wouldn't also think about skipping testing. That would be insane. And really irresponsible wouldn't it? I see some people struggling. Like who would do that? And even if you're arguing about money, if you're building an inaccessible product you're actively excluding users from your product. And I think there's estimations it's potentially about 15 to 27th percent of users that you're excluding. So if you're taking measures to include these users that's going to pay for itself in profit. And even if you're not starting on a green field there are also synergy effects with other quality goals. For example we've talked about sensory issues. For example if you have a limited vision or limited hearing and you provide audio content like podcasts or video streams. And one measure that you could take is to provide a transcript for that, for your audio content. Now if you have a transcript that's text and a search engine can crawl it. Boom! Instant SEO. We've also talked about accessibility issues on lower end devices. Because not everyone wants to buy a new iPhone 15 for 1500 bucks every year. I don't want to. So if you're targeting lower end devices you need to start thinking about limiting resource consumption. Being more efficiently you need to start thinking about performance optimizations. Which in itself is also an important quality goal. Which I'm just thinking about now is that this also has a sustainability impact. Because you're enabling users to keep older devices longer before they need to be replaced before they become obsolete. You also minimize resource consumption. You also minimize battery drainage on mobile devices for example. So that's also an important synergy. There are some quality goals where it gets a little bit more complicated. So for example when we talk about security there might be some trade-offs that you need to make concerning accessibility. I think a general rule of thumb might be the higher your security requirements are, you need to start thinking about accessibility to not become a problem. For example if you're enforcing multi-factor authentication you need to think about a way to make that accessible. If you're in even higher security areas like if you want users to use TLS client certificates or some things like that. That's a very high cognitive load that you're placing on users. Last week I used my German EID card for the first time to log into a service. That's not for everyone. I honestly have no idea if there's an accessible way to actually do that. This is not an impossible problem. There are guidelines for this and I'm just skipping a little bit ahead to the actual solutions. There are guidelines on that and there are guidelines on accessible authorization. Most of these boil down to reducing cognitive load on authentication. The recommendation is that your authentication process should not be dependent on a cognitive function test. Who remembers those? This is... Can anyone of you solve this? I can't. This is just ridiculous. Luckily these have gotten a little bit out of fashion lately because probably at the moment AIs can solve these better than we can. There are ways around that. It starts with simple things like if you have a username password. Remembering a password is also a cognitive function test but you can use a password manager. You can copy paste the password into a password input field. Don't prevent that. You remember those password forms where you can copy paste into... Yeah, that sucks. If you're requiring multi-factor authentication, I think there are new standards coming up that we can use to make it more accessible like web authentication, like pass keys. All of those reduce cognitive load on the authentication process. So that are things we need to start thinking about and need to start thinking about implementing them. And users without disabilities also benefit from pass keys, for example, because it's just a matter of comfort for those. Absolutely. Now let's talk a bit about cognitive management systems because this is why we are here. And in cognitive management systems you have two sides of the same kind, I think. On the one side you want to provide accessible content to the end user, so to speak. So one important part is that our editors have the option, the possibility to create awesome and accessible content at first. And of course the editing experience itself, so the backend should be accessible as well. So that the editors itself with accessibility needs can edit the content. And think of a blind user who is trying to create his or her own block and share their experiences to the world. They need to use a cognitive management system. Yes, for the front and sides, so to speak, there's the web content accessibility guidelines. This is the most basic stuff I hope most of you might have heard this by now. It's things like alt text for images, using anchor tags for links, using semantic HTML, I will run about this in a second. And then there's the editing side, there's the authoring tool accessibility guidelines. This is especially important for CMS systems. We won't go into detail on this because everything you need to know about this, you will read in the documentation. But we want to highlight some of the most important things. One of the most important things is using semantic HTML. And I would hope this wouldn't be necessary to say these days, but a lot of people get this wrong still today. Use a semantic HTML on the screen, you can see the header and the nuths who have replaced this in HTML5. But it starts with the basics. Use list tags in your HTML if you mean this is a list. You see in the wild you still see a lot of paragraphs all starting with a dash. This is garbage for screen reader users. They can't use this. So please use semantic HTML and most importantly make it easy for the users of the cognitive management system to use this semantic HTML. Maybe provide some automatic functions like if a paragraph starts with a dash automatically converted to an unordered list, you can skip to the next. And provide such automatic features. A lot of you might use some messages who do this by now. And our CMS should do this as well. I think a good example for how this really works out is this is a screenshot from the word for spec end. Where you have this semantic outline view where you can actually see the hierarchy of your headings. And the system also points out where you did something where you built it inconsistently. And this is really important to enable editors to build consistent content structures. Because screen readers actually built on those. I think Jaws is one of the most prominent screen readers. It actually offers navigation options based on the hierarchy of headings. So if you mess that up you're also going to mess up the screen reader. So I think what CMS in general should do is to discourage users from just placing certain kind of headings for aesthetic reasons. Because people do that. I see and through the yes, the nodding. Yeah. So there should be ways to discourage that even if it's just pointing out where things went wrong. And we also need options to be able to configure the CMS to prevent users from doing that. Like this is a screenshot from the type of three backend. In which as an integrator you really need to know how to configure the system to prevent users for example from. Above there is the content element heading which is I think in the default distribution it's rendered as an H2. But nothing prevents an editor from inserting their own H1 into the body of the content element itself which will mess up the hierarchy of headings. You can disable that as an integrator and you should but you need to think about it and you need to remember it. For example the Gutenberg editor in Watpers and they also have the option to set as custom H1 heading. And for the websites I develop I mostly disable this option deliberately so the users who edit the pages cannot make this one by accident. I think in general there are some parallels to search engine optimization aren't there? Because if it's a screen reader or a search engine you need to be able to build your site in a way that a machine can make sense of it. And that's why it's so important to adhere to. Basically it all boils down to adhere to the standard. Another thing that we'd like to point out is users are going to have certain expectations around how a system works. I read a case study a while ago on the type of 3 block. I think CPS did it. They tested the type of 3 back end with 2 or 3 blind users. And I think the gist of it was that you already can use the type of 3 back end right now with a screen reader if you have received appropriate training in doing so. So if you're in that situation you're going to have... You're relying on the system working in the way you were trained and you're expecting it to do. So the most important thing is to not break the expectations that users place upon the system. This goes in very, very, very many different ways. One example are conventions around the menu structure. Each and every CMS back end has some kind of menu structure and there are conventions and there are expectations around where things are placed in that menu. This especially gets important when we're talking about plugins or extensions or modules or whatever. Many names, same thing. Because we're not only talking about the CMS course, we're also talking about the ecosystems around the CMS, the third party extensions that extend those systems. So what we thought of as important is that extension authors should be encouraged to adhere to the conventions that are applied to the navigation structure in the CMS back end, for example. Another thing are UI components, for example. I think there are some WordPress plugins that completely roll their own UI library that behaves entirely different than the rest of the CMS. If you're relying on having your expectations around how the system works being met, then this would really confuse you. One other thing is clutter. I think this is a screenshot from a fairly representative WordPress back end. This is the plugin list. Let's have a look at this. That's a feedback prompt. That's another feedback prompt. That's an add. That's another add. That's a maintenance message. We talked about ADHD, didn't we? I also don't know what would happen if you would pipe that through a screen reader. I honestly don't know. Nothing good, probably. This is another screenshot. This is the WordPress... Was it the same site? Yes, it was the same site. The dashboard itself, same site. The same feedback prompt. Another feedback prompt. Another maintenance. That's an add. That's completely useless. I want to point out this new section is from the podcast plugin in this installation. The news in this widget are about digital cameras, YouTube videos. It's nothing about podcasts. You can disable all those as a user, but you're still being confronted with the cognitive load of... You need to make sense of it at least once. You can disable. This is the perfect handover to my next talking point. It's about giving the user the choice and respect choices the user has already made. As you just saw in the WordPress dashboard, users can disable the widgets in the dashboard, but there are a lot of more choices the user has already made when using your system. All of the operating systems nowadays provide some kind of dark mode, for example. It's really easy to use the dark mode setting in the CSS you can click, I think. We messed up the order in this list. Another point is opening links in new tabs. I often see that suddenly a new tab opens. For me, this is quite normal because I know the tab system. I'm used to this, but when giving courses to my clients, I can see that they get often confused. Once it was some worker in a school, I think, she tried to do something in the WordPress backend. A new tab opened. For me, this was obvious. She tried to use the back button of her browser, and she didn't get back where she was before. Her solution was closing the browser window completely and starting from scratch. This was pain in the ass, not good use. I think my mom would do that too. Of course, a lot of people do this because they don't use the PC or the browser that often as we do. Think about these users. Give the user the option to open links in new tabs. All of us could just hold the control or command key, and other users might use a wide click open in new tab. Give the user this choice. He or she knows when he or she wants it. Now I have two clicks, so I can see my next point. The user can select a font size, a minimal font size, and the browser settings. This is very important for people wearing glasses, or maybe later if your version gets worse. We have a colleague whose IDE is set to 48 points. It's really, really large. You might imagine, but for him, it's the best way to code. The IDE respects this choice, and so should you on your website. Just the other evening we were reminiscing about building websites in the old days. Do you remember when websites had their own buttons to adjust the font size on the site? You remember that? Something to do. That's completely attributable because the browser does it for you. You don't work against the browser. I'm the next talking point I already talked about before. It's a reduced motion and video auto play. Just don't. I think every one of us is annoyed by auto playing videos on news pages. It's the worst stuff. A reduced motion is a CSS property. Maybe you can click through so we can see the CSS snippets. The browser these days implement the choices you can easily opt in and reduce smooth scrolling or animations if the user has chosen to do. This is minimal effort, but a lot of users will benefit from this. Increased contrast and color version. I've already talked about this. Just think about that some colors might have the same gray value, so to speak. Once during my college lessons we had another student who asked my professor what this large circle on the slide is about. It was a pie chart, but all of the slices of this pie chart had different colors. It was obvious for us, but for him it was just a gray large circle. Because yes, he can only see gray scale. No colors at all. We didn't know this till this moment, but for him it was really difficult. One thing I always went about when trying to play some games with my friends is I'm casting keyboard bindings. Some games allow you to change keyboard bindings. Some games don't and some of these games I just cannot play because I can only use one hand when gaming. You still beat me every time in each of these games. Yeah, sorry. Maybe give the user the option to set some keyboard bindings even in your CMS. This could be customizable. Think about such stuff. It really, really helps some people. Yeah. This is an argument I also hear quite often. Thank you. Image with text in it is not readable by the search functionality of your wiki. Much into the technical details of accessibility. Our goal was to basically make you think about why accessibility is important and what it's actually supposed to solve. I'm seeing the animation order on the next slide is going to be messed up. Apologies for that. I would encourage you to think about accessibility not just as a feature that your PO can throw into your backlog and you can tell yourself that you're going to build it when you have the time, even although you already know that you're not. Please also don't think about accessibility just as risk avoidance so that we don't get sued. So if you want to think about accessibility as anything, just think about it as a human right that everyone should have and enjoy. It's just about making your product as accessible and as inclusive as you can for everyone. And there's the last animation step I said the order is messed up. All the technical stuff, all the accessibility guidelines, you can test this. There are automated test rates, you can pipe it through light holes and light holes, gives you a green check mark for example. But don't add accessibility just to have that green check mark, you need to understand what you're actually doing it for. Thank you. I didn't drop it. Wonderful. Congratulations. Thank you very much. In my experience, thinking about accessibility, two things really surprised me in my path. And the first was the realization that there are temporal moments where we're all less abled. Like walk out into the bright Brussels sunshine and look at your phone screen or carry anything or have a child or you know. So I found the idea that we're all in that category on a sliding scale all the time, really interesting. And then as a marketer actually, accessibility, if your motives are not pure, and not moral, and not ethical for a second, your total addressable market for any given thing goes up by including more people, right? And if you, and the more semantic data, the more machine readable data you have is the more accessible, is better for SEO, is better for making sales, right? So there's no cogent business argument in my mind to not making accessibility, right? Yeah. So any questions for the, from the room before we break for lunch? I'm sorry, I'm sorry. I'm sorry. I'm sorry. I'm sorry. I'm sorry. I'm sorry. I'm sorry. I'm sorry. I'm sorry. I'm sorry. I'm sorry. I'm sorry. Any questions for the, from the room before we break for lunch? Yes, please. You should never mention lunch. Unless your question is what is for lunch, because I don't know. So the question in, and the question was if it's a bad thing to, to provide dark mode for the user, and the user is not allowed to use it, and the user is not allowed to use it, and the user is not allowed to use it, and the user is not allowed to use it, and the user is not allowed to use it, and the user is not allowed to use it, and the user is not allowed to use it, and the user is not allowed to use it, and the user is not allowed to use it, and the user is not allowed to use it, and the user is not allowed to use it, and the user is not allowed to use it, and the user is not allowed to use it, and the user is not allowed to use it, and the user is not allowed to use it, and the user is not allowed to use it, and the user is not allowed to use it, and the user is not allowed to use it, and the user is not allowed to use it, and the user is not allowed to use it, and the user is not allowed to use it, and the user is not allowed to use it, and it is a struggle for dark mode, for example, but the default option should be to option the user as chosen. Yes. All right, so, thank you very much. Thank you.
Wrestling giants: How can free open source CMSes remain competitive with enterprise clients?
Yet another old friend and open source acquaintance of mine, Owen Lansbury, is with us. Owen volunteers for the Drupal Association and has been doing open source, I'm sure he's going to tell you, but for a very long time now. And touching on some of the topics that Matthias and I opened the day with, like how do we really compete on a business level with the big proprietary companies and what's the thinking going on in Owen's head and in Drupalland about that. And a special round of applause for Owen and his presentation shirt. So, oh sure, can we do this? Is that the speakers John Tick? I wasn't sure whether there'd be five people or five hundred for this talk, so thank you for having me. James, given the introduction about what the talk is about, I'm speaking from the Drupal perspective around the challenges that we're facing. And I have been involved in the Drupal project since around 2007, I think was my introduction. And then since then I've run a Drupal agency in Australia called Previous Next and then I've been volunteering on the Drupal Association Board for about the past five years or so. While I am representing the Drupal Association here, the content of my talk is mainly my own opinions, I'm not really reflecting official Drupal Association policy. However, what I'm going to do is tell you a little bit about Drupal that has been covered elsewhere, so I'll skip through what's already been talked about. I wanted to look back on the evolution of free and open source CMSs to understand where we might be heading with them. And then I'm going to dive into talking about how we wrestle the giants that we're competing against in the enterprise sector. And at the end I'll talk a little bit about how we can help each other as open source CMSs and then open it up for discussion and questions if you have any. So quick bit about Drupal, we have a saying in the Drupal world, come fit for the software and stay for the community. And that community was founded back in 2001, probably just up the road here in a college dorm room by Dries Spatets. And we just had our 23rd birthday as a project only a couple of weeks ago. We have a very active community. There's about 8,000 people that actively contribute code to the project, but then there's probably hundreds of thousands of people that use Drupal every day in their jobs as content editors, as developers, et cetera. We're currently at version 10.2.2, is that correct? And that has almost 8,000 extension modules if you're from the WordPress world, what we call modules you call plugins. And if you look at all the versions of Drupal, there's been about 50,000 modules written for Drupal over the years to extend its functionality. Importantly unlike other open source projects, we don't have a commercial module ecosystem or a commercial theme ecosystem. Everything that you do with Drupal is completely free. And that was a very conscious decision by Dries Spatets when he started the project. So I do sit on the Drupal Association Board, whether or not for profit organization that manages primarily the infrastructure around Drupal.org and then how people actually contribute code into the project. So we've had a big project to move to GitLab in the past couple of years, which has been a huge success for us. I think our build times have increased tenfold in terms of the speed that we can do that as a result. And then historically the Drupal Association has run DrupalCon in North America, which is our big flagship developer conference. We've had close involvement with the European event and then various programs that drive the project forward. And then outside of the Drupal Association, the community itself is very self-managed and self-sufficient. So there's hundreds of camps and meetups and other little country associations or big country associations in some cases that the Drupal Association has little to no involvement with in many cases. Now Drupal has always been at the forefront of open source and the open web. And this has been brought into focus recently where we have been officially recognized as a digital public good that in turn supports the United Nations Sustainable Development Goals. And as with many open source projects, the Drupal community is highly motivated by being able to build world-class software that anyone can download and use. And as has been talked about with Type 03, the impact that that's now having in Africa is significant. And one of my kind of foundation stories with Drupal is having an African come to me and saying, hey, we're using Drupal in Burkina Faso. Can you help me? And I said, well, sure, we've got this prepackaged solution that you can download and use tomorrow. And here's all the training materials. Off you go. So that is a really big driver for us. And the Drupal Association's mission statement is to drive the innovation and adoption of Drupal as this high-impact digital public good hand-in-hand with our open source community. And we recently wrote a manifesto that defines our commitment to the open web. And I think that's been incorporated into our open web alliance, website alliance. What are we called again? So yeah, if you're interested in reading that, have a look. Now in order to fulfill our mission of supporting the open web, we do need to be successful as a product in the open market. And I often don't like showing this slide because what it's implying is that Drupal used Pete in 2018, and it's been on this kind of downhill slide ever since. But there is a very different story here. So in 2015, we released Drupal 8, which was a significant architectural shift away from Drupal 7 and previous versions of Drupal. And prior to Drupal 8, Drupal had tried to be all things to all people. You could build your personal blog with it, or you could run the NASA website. But what's happened since Drupal 8 is that we've clearly positioned Drupal as being for ambitious digital experiences. And these slides are from a keynote that Drees gave back in 2017, where he outlined that vision of how we're moving away from being suitable for smaller sites and moving into these larger scale sites. And of course, larger scale sites typically means enterprise style customers. Now this coincided with the rise of SAS platforms like Wix and Squarespace and of course WordPress.com. And as they became more popular, what we've seen is that a lot of smaller sites have moved off Drupal to platforms like that, or other platforms that are better fit for purpose. And I think with that previous slide showing the downturn in the number of installs, that's not really reflecting the true story of what's happening with Drupal. And so what we've been trying to do in the Drupal Association for the past couple of years is rather than fixate on the number of installs, it's what's the health of our ecosystem and our Drupal economy, for lack of a better word. And if you look at the health of that ecosystem, it does tell a very positive story. So we have a listing of Drupal services companies on Drupal.org. With the top 100 of those companies, we estimate their combined annual revenues are about $1.5 billion US dollars. And then once you've factored in all the other Drupal projects that might be built by internal teams, other agencies, etc., our guesstimate, that's very much a guesstimate, is that the total market value annually for Drupal projects is about $3 billion US. So that is a big pie. I think Gabor talked about pies. I'm talking about pies too, because it's a big pie and there's a lot of competition for a slice of that pie. Now unlike other open source projects, Drupal doesn't have a single company that's responsible for the majority of the code contribution or the finances that run through the community. And the challenge that we've had with the Drupal Association is that our annual budget has historically been about $3 million, $3.5 million. So if we're talking about a $3 billion economy around Drupal, the Drupal Association is only extracting about 0.001% of that market value. So that has been a big challenge. And prior to COVID, the majority of that revenue came from running DrupalCon events. And of course, COVID hit, no one could go to in-person events. And we had to have an abrupt rethink about what the role of the Drupal Association is. And we had to refocus on how Drupal could be both sustainable and successful into the future. And so we've recently launched a new strategy that sees the Drupal Association play a closer role in both Drupal product innovation and also product marketing, two words that evoke quite a lot of emotion from people at times. And you might think, how is the organization that's at the center of this huge economy not involved in product management and marketing? But it is a big shift for the Drupal community and a big shift for the Drupal Association itself. So as I've often mentioned to people concerned about this change, if our open source product isn't successful in the open market, the community around that product is going to shift to where the action is. And it's definitely a case of if we can sustain a strong product and ensure a strong community, then we're going to be able to keep fulfilling our goals of providing a digital public good to the world. So all of these things are closely interlinked. So in order to see where we can go in the future, I'm just going to very quickly rewind. I'm sure most of you know these stories. But this story goes back 25 years to when Type 03 was released and then in the five or so years after that, most of the products that we know and love came into being in some form. And I don't need to emphasize that a quarter of a century in technology years is like a millennia. But any of these projects to still be running successfully is an incredible achievement. And I think if you rewind to 25 years ago when some of us were actually working in the industry, we were doing things like building our own custom CMSs and then having to maintain them ourselves and hope they didn't explode. And then big clients were only really able to use products like Interwoven or Oracle CMS, which literally cost millions and millions of dollars to install. So at the time as our open source CMSs came into that market, they were filling a really core need for reusable software that could be maintained across a broad network of people at a very reasonable price. And so as the mid-naughties turned into the 2010s, that really was a golden age for most of our projects. Most of us did very, very well. But then over the past decade, like I said, we have seen that shift towards SAS platforms. We've got headless CMSs. Now we've got static site generators. There's a lot of options out there. And I was looking on Builtwith.com for statistics for this talk and they have 242 CMS products that people are still using right now, which is, oh no, 424. Sorry, I misread that. And I think through this past decade, what we've also seen has been the rise of the digital experience platform, which everyone from Adobe to Sitecore and every other brand has jumped on board. And primarily because enterprise customers do want to have one platform that they can manage all of their content and customer interactions through. Now I personally consider DXP to be quite a clever marketing term. Often it's a mishmash of technologies that may or may not work that well together. But the key thing for those of us who are flying the content management flag is that we're getting shut out of those conversations when big clients are looking for a solution. So within the Drupal Association, we've taken that very seriously and we've started trying to formulate language around Drupal being at the core of an open DXP and however you kind of structure that open DXP is up to you. But we still have a huge amount of work to do in that regard. In the meantime, I think Jam had a slide showing Adobe's push into the government sector in Australia and they very cleverly marketed a Gov DXP platform to government clients in direct competition with, as Jam said, a government managed and run Gov CMS platform that's based on Drupal. And they're winning huge contracts off the back of that, tens if not hundreds of millions of dollars. And this focus on DXP by proprietary companies is just driven by the scale of the opportunities. So Adobe themselves, they think it's worth $110 billion annually and the amount that they invest to stay competitive in that market is around $2.7 billion in product development, sales and marketing every year. And then they make about $4 billion in total revenue. So do us as open source CMS have any chance at competing at that scale? And the good news is we do quite well already. And WordPress is obviously the elephant in the room. They run 40% of all websites, not just big websites. And in this category, this is the top 10,000 websites off BuiltWith.com, they're running a quarter of all of those sites. So open source is definitely winning through WordPress there. But as we go down the list here, you can see we've only got that little 1.4% gap between Drupal and Adobe. And I can guarantee you there is whole teams of people of Adobe looking at how they can kill Drupal to get that leap on the market share there. And I fully understand why us in the open source world, we often recoil at talk of total market value and market share. But when we look at open source from the perspective of its philosophical underpinnings of being free, we do need to recognize that we still build products and products need to be relevant in the market because if they're not, there's only one path and that's down. So in order to maintain a healthy ecosystem for our open source CMSs, enterprise customers are key because if they're changing your product, it means you're still relevant. It means that they can see a pathway for the next decade ahead, that your product's going to evolve and be supported. And it's a huge stamp of approval. Because enterprise customers are going to be pushing money through your communities, they're going to be driving feature requests, and the budgets that they're working with are just orders of magnitude bigger than anything in any other sector. So there are $100 million website rebuilds, but even on much smaller scale, that enterprise investment that flows through to the agency's building, maintaining and hosting these types of projects, and in turn, that provides consistent, repeat income that ultimately flows into our open source communities in the form of stable jobs, good salaries, community funding, and sponsored code contribution. So with that scene set, I'll jump into the meat of the talk, which is how can we actually survive and remain competitive when we're dealing with giants like Adobe? The first thing I want to dive into here is not just a whole lot of bad stock photos, but the understanding, the buying process that these large organizations go through when they're selecting new technologies. As I said in the previous slide, they're often making these decisions for at least a decade into the future, if not longer. And you'll often hear that the responsibility for making those decisions is with the CIOs, and more increasingly the CMOs, the chief marketing officers. All of these people are very risk averse. They don't want to make the wrong technology decision that backfires on their multi-billion dollar company, and they're all quite hard people to influence directly sitting in their ivory towers. So these people will probably be the ones that are reading the Gartner and the Forrester reports. And as you can see on the slides here, there's only one recognizable open source brand there, and that's WordPress VIP. Now, full credits are automatic who run WordPress VIP. There's a very much for profit service. So they've worked out that to get listed here, you need to have a single company with a certain amount of revenue that the researchers can analyze and then do an apples for apples comparison. And having the WordPress brand in these listings just gives huge legitimacy to WordPress for anyone that's using it. Also credit here to Acquia. So Acquia has been mentioned a few times. It was the company that was co-founded by Drees that started the Drupal project back in 2007. They do talk extensively about Acquia Drupal in the context of their broader DXP offerings. The Drupal story is kind of getting through there, but again, we don't have the Drupal brand name there. Every other product on here is proprietary. There was Squiz here that started off as a pseudo open source product, but even they're totally closed source. While I don't think it's that quantifiable about how important these surveys are, just not having the Drupal Type-O3 or Jumla brand names in them does hinder their recognition significantly in this enterprise market. So I think there is an opportunity here. It would be a big process to go through to kind of change the thinking of these analysts to start including open source projects. But if we have an open website alliance that can lobby them, and it is about lobbying, then maybe we have an opportunity to make that happen somehow. Now the other way these C-level decision makers can be targeted is through events. Unlike Adobe, we probably can't afford to pay Ryan Reynolds to come and do a keynote as Adobe do. But in the Drupal world, again, Acquia have done a very good job with their Acquia engage events where they showcase customer success stories on their DXP platform and through that the Drupal story gets told. And the Drupal Association, we have tried to run some C-level decision maker events at DrupalCon, but we've had mixed results because they're kind of mashed together into a bigger developer conference. And why don't C-level people want to be at a developer conference? It goes against a lot of our open source principles, but they want to have exclusivity, they want to feel like they're special, they want to be networking with a very select group of peers and they want to have strategic insights into technology that gives them competitive advantages. So like I said, that goes against so many of our principles in the open source world, but there's a formula there that we can definitely replicate in terms of targeting those types of people. And while our not-for-profit organizations that govern our communities might not have direct relationships with these C-level decision makers, our larger partner agencies definitely do. So the role that we can play as the community organizations is to give as much assistance as possible to those agencies to help them win or retain new enterprise clients, just through playbook-style information. So does your open source project have a playbook that compares your product against a site core in Adobe and has a whole range of answers as to why yours is a superior product? Do you have a pre-package demo that's consistently updated with new features and functionality that can highlight your technology in the same way that a slick demo from Adobe would? This is an area we fall very well short within the Drupal world. Most of our agencies are off replicating effort every single time that they go off and pitch to a new client, but this is a relatively easy to solve issue if it's given the right attention. And something that's related to this is being able to focus on the strength and scale of our global sales team. We don't have anyone that's responsible for sales at the Drupal Association, but every day we've got thousands of people out there pitching new projects to clients, telling slightly different stories, but selling Drupal to these larger clients. So as I noted, we can play a role as the association by providing those salespeople with the tools to help them win those projects. We have started to address this within the Drupal community with a certified partner program, and I recognize that other open source projects have similar types of programs. And what we're doing with that is we're positioning our agency partners using the same language that a proprietary platform would use. Again, we're kind of compromising our core values by playing that game, but to play in the enterprise space, you do need to play a certain type of game. And then the really core group of people that you need to be convincing are the people who'll actually be using the products themselves, so the developers, the content editors, the DevOps engineers, et cetera. So any C-level person will lean heavily on this group to give them evaluations and recommendations about what technology they should be moving towards, and where do they get their information from. They get it from their prior experience of having used different platforms. They might have used Drupal 7. What impression do they have about Drupal 7 compared to what it is today? They talk to their colleagues at other organizations who are using it. Hey, what's it like having a Drupal site? Can you find developers for it? And then, of course, the internet. There's a whole range of challenges that we could spend an entire presentation on each issue, but the core of each of them is purely about perception. So does your 25-year-old open-source product look and function like a temporary piece of software? So WordPress, despite the slides that we saw earlier, is often held up as the gold standard in terms of content editor experience, and we have paid a huge amount of attention to that within the Drupal community to update our editor experience and our administration UI. And we've even got a project at the moment that's looking at integrating the Gutenberg editor into Drupal that WordPress helped fund. Thank you. Another thing would be, is it obvious that your products can fulfill contemporary requirements like a headless front end or integrating with a popular marketing automation platform? Like I said, it is easy to find qualified developers. So this is a much bigger thing that I'll talk about in a moment, but for us as not for profit associations, we should be at the center of those initiatives, whether it's as simple as having a job board available or running a full certification program. Can a developer quickly download and install a demo of your product? When something we struggle heavily with in the Drupal world, I went on the download Drupal page the other day and the first thing it said was, you have to install Composer. I can't do that. I'm not a developer, so there's these big hurdles that we have to get through. But first impressions around that are absolutely key. If it doesn't work the first time, you're probably not going to look at that technology. And then are there demos, case studies, and white papers that target specific industry verticals that you can easily collate, put them in a presentation, and then give that presentation back to your C level? Decision maker and convince them that your open source product is the best one. Bearing in mind that any proprietary platform that's talking to that customer, they're going to be in there with a very slick demo. They're going to have their global digital partners, our digital agency partners saying yes to every feature requirement and yes to every question. So we do need to be playing that game in terms of convincing people, giving them confidence that moving to open source is the right recommendation. And then the final group that C level executives will lean on is their incumbent agencies and consultants. So are they recommending your open source platform to their big clients? Now, in the Drupal community, we've always had a core group of companies that both support the project and champion Drupal to their clients. And like I said, we've recently launched a certified partner program. But the key here is with your agency network, how easy is it for new agencies to both upskill and become part of that certified partner program? And agencies, they're going to be attracted to technology. They know that they can sell to new clients and they know that they can build a business practice around that. So the key to these big global clients is being able to have big global agencies as part of those networks. Again, ArcRear have done very well with that for their own partner network, but it's not something that we've been able to replicate with the Drupal Association at this point. And I think it's generally hard for these big global firms to get their heads around open source. It just doesn't mesh with how they do business. And a pattern that we've seen in the Drupal world is clients will demand that their agencies provide them with Drupal services. They say, yes, sure, we'll do that. They'll do a project for them more often than not. It gets outsourced to someone else, done to varying degrees of success, but they'll quickly slip back into their comfortable pattern of partnering with big proprietary firms. The other part of that is that the way that their structure and run projects, their ability or willingness to contribute back to open source projects in terms of code or have any connection with the communities is generally quite limited. So again, this is a hindrance. It's a solvable problem. It probably takes a lot of focus to get over that hurdle. But again, something that we might be able to grapple with. Now, in terms of where we do clearly win, rapid innovation is something that we do incredibly well in the open source world. And I'm sure, like in the Drupal world, ChatGPT gets released. And then a month later, we've got a working module that you can start integrating that into your Drupal sites with. And I'm sure that was a similar case with most other open source CMSs. But in my question to Gabo's talk, maintaining the speed of that innovation and the scale of innovation is something that becomes harder and harder as your project gets bigger and more complex and as both the software and the community have matured. And so in the Drupal world now, we have this very carefully planned release cycle. And we need to make sure that each release is rock solid. We can't be taking risks by putting new functionality in there that may or may not work, especially with so many big customers. So there's a certain level of conservatism that we now have to adopt because we do have these big customers. And another philosophical hurdle we have is this notion that the Drupal association themselves should be directing actual budget towards innovation projects when there's this notion that contribution is free. But contribution has never been free, no matter how you look at it. The cost in personal time or wages is always borne by someone, whether it's the individual contributing their time instead of doing paid work or the agency that's sponsoring their team to contribute that. And I think the recognition we have in the Drupal world is that other open source projects have no issue with that whatsoever. So the Linux Foundation, they have $160 million that they direct towards strategic projects each year. So we've started getting our heads around that in the Drupal world. We ran this pitchberg contest last year. It was run at the DrupalCon Pittsburgh, if the name needs an explanation there. And this was having a competition where we had $100,000 in funding to drive a few strategic things forward. And one of those was the Gutenberg projects that WordPress actually contributed some funding to. And so as soon as there's some money in the equation, the agencies that are working on those things can easily prioritize them because, hey, they're getting paid for it. Now being able to scale that model up is the hard part. Like I said, with a $3 billion economy around Drupal, how can the Drupal Association capture some of that value? And even if we just captured 1%, then that would be a $30 million innovation budget that we might be able to work with. And I think someone's doing a talk towards the end of the day about how you've tackled that in the WordPress world. Looking forward to hearing about that. Similarly, the idea that Drupal would be marketed as a product by the Drupal Association has been this big wall to get over. And there's a legacy there in terms of being structured as a nonprofit association in the USA where legally funds that come to the Drupal Association are for the advancement of a charitable cause. So there has been a sense that we're not allowed to market Drupal as a commercial product as a result. As I noted at the beginning of the talk, our charitable cause in the Drupal is ensuring that Drupal remains as a digital public good that supports things like the United Nations Sustainable Development Goals. So we do have a core underpinning there. And I think the important thing for us in the Drupal world is that if we don't have that positive product awareness, then we can't actually fulfill that digital public good role in the first place. So whether we call it marketing or advocacy, we do need to drive a positive image of Drupal as a product in comparison to what's on offer from proprietary platforms. We've had a volunteer group within the Drupal community called Promote Drupal that's done a range of initiatives. But what we're starting to do is bring that inside the Drupal Association. We've just commissioned a go-to-market strategy for how to position Drupal as a product in the open market. And we'll have a range of initiatives that will roll out through 2024. One initiative that we tried recently, which was a complete experiment, was to have the very first Drupal product brand at Booth at a big tech conference that was at Web Summit in Lisbon. And again, this was like this kind of radical thing that had never been done before. And it is a very expensive undertaking to do a booth like this at a product, at a conference like that. And so we partnered with a range of bigger Drupal services companies who help co-fund that. And then, of course, they get leads that come through from having a presence there. So in the early days, we have good anecdotal results from having tried it and whether we can kind of replicate that long term. I think the important thing I said to Boris earlier is that we're actually getting out of our bubble and we're putting Drupal out there as a product in the open market. And then likewise, having people tell the Drupal story at non-Drupal events is really key and something that we've been historically quite bad at. So this photo is our former Drupal Association chair, Batti Brattat, doing a keynote at Web Summit off the back of having a booth there. And there's just so many events around the world that we could be doing that type of thing at to get the Drupal story out there. Now, something that doesn't cost a huge amount of money is good press. So for this talk, I did a Google search on best CMS for enterprise. And amazingly, on the first page, we had Drupal come up as best for enterprise. Now was this coordinated by a clever PR person at the Drupal Association? No, there's no one who has that role at the Drupal Association, but it's something that we should probably start paying a bit of attention to. Because for every good review, then there's going to be a negative review about an open source security vulnerability or someone moaning on a internet forum about how bad the user experience of Drupal 7 is, even though Drupal 7 came out 15 years ago. So big firms, big companies, they're really good at managing those narratives. And there's nothing stopping us from doing the same with the right attention. And again, I'll just bring this example up. I think you had a version of this article of the way that Adobe is restoring the playbook of fear, uncertainty, and doubt in the Australian market at the moment to try and convince big customers away from open source. And this article is in Australia's version of the Financial Times, where they just regurgitated a press release from Adobe. Adobe had run a global survey, and surprisingly, the MyGov site they talk about here is now 20% better to use because it was built with Adobe Experience Manager. So us in the open source world, we need to be able to have counter narratives to this. And again, it's a playbook. It's a game to know how to play. Just to kind of finish on here, the biggest strength that we have in our open source communities is the depth and expertise of our developer pools. And there's huge value in being able to market that in a way that enterprise customers understand. I think Jam had a version of this slide in one of his talks a while ago where if you talk about Adobe, hey, they might have a thousand people working on Adobe Experience Manager, but in the dribble world, we've got 20,000. So again, being able to develop that or grow that developer network through robust outreach, training, mentoring, career pathway programs is something that us as nonprofit organizations should be at the center of. It's a big time intensive exercise, but something that's a solvable thing. I will finish very quickly on a couple of slides. So as we've talked about a little bit today, us in the open source CMS world, we're really at the forefront of championing and sustaining the open web. And it's not just us in the open source world who really care about the open web. It's something of huge concern to governments around the world and large organizations around the world. So whatever we can be doing to collectively maintain the focus on and protect our open source technologies is incredibly important. The work that's been done with the cybersecurity act is a good example of that. And similarly, let's look at ways that we can collectively promote positive open source and open web narratives in the enterprise market. And that might be as simple as ensuring that we've got consistent things that we all talk about or it might be as simple as engaging a PR person to manage those narratives on our collective behalf. So I'll leave it there. If there's questions, I can repeat them. Thank you so much. I'll just talk loud and hope that my catch is that one of the things that came out in Matthias' work around that has come to initial fruition with the open website alliance is that open source, we have 100,000, a million developers. We don't know, a huge number. And all of our lives are touched by it every day and you know someone who works with it. But you have people who come and say, oh, I tried open source once, it didn't work for me so I'm never going to do open source. And we are often worried about Wordpress or Jumla or Drupal or very obscure issues for people who aren't in our level of experience. So part of this idea could also be that the mass and the force for good, we don't have that marketing budget that Apple or Adobe or somebody has but trying to figure out how to leverage that scale and make these experiences somehow or these values visible at this collective level seems like a really exciting part of what we're doing here. And Drupal as having found the key into the enterprise market and into the government space very effectively is one of those players I think has a lot of really great examples to follow and I really hope that we can come to each other's conferences and interact more through channels like this. Great. Anyone have any questions for Owen? No.
Collaborative government websites standardization for digital sovereignty using Open-Source. The model of Rwanda and the GovStack Global initiative
Back to the Open Web Alliance Dev Room at FuzzDem 2024. Now I have the great pleasure of introducing a Type 03 friend, Daniel Homer-Adan, who runs the Type 03 Community Expansion Committee, which makes it part of his actual job to go around the world helping people with open source software. And he's got some really cool stories to tell today. Please, one more time. Thank you very much. Actually the organizers of the agenda were inspired to put me after Owen, because I will follow on the idea of building up the narrative. You know, Owen was mentioning that he would very much like to also get a call from Burkina Faso or another country to go there. While we somehow did get that call from Rwanda, as Matias was saying, the matter from then on was for us to create the frame into which we can actually receive calls from all over the place. And we are actually doing it and it's a matter, I think, of positioning and of building up this ability of ours to transform our capacity into value. Value, which is not necessarily something that we need to have a lot of money. We didn't have a lot of money when we started. We don't have a lot of money now, but we believe that it's a matter of value. So the value in which we believe and the value in which we create. I'm going to talk about the strategic approach, which while was developed by our Type 3 community, is actually, I really believe that actually working very well for any open source community. And definitely for any of the technologies and communities which are forming the open website alliance here. So I'm not going to delve into the technical parts of the Type 3 CMS technology itself, but go into the principles of what we've done and what we're planning to do. I'm Daniel from Oroden. I'm coordinating the community expansion committee in the Type 3 association. So it's a very important working because here we're talking about the expansion of the community. What does it mean? It was the committee was created in 2018 in order to take care of bringing the community and our technology in countries where Type 3 was not yet known. So Type 3, maybe you know it, was developed in the central Europe and most of the market was and still is into the German speaking countries and the countries around it. We've started to go into other geographies, into Africa, Latin America and so on. And while we are doing this, the committee itself actually is concentrating on several types of activities. It's engaging communities of web developers, looking to support them to understand and to learn the Type 3 technology and is also working with the market, say potential clients of various kinds and that was the initial idea. We have run for several years a program which was called the International Type 3 Mentorship Program. What it means is that through volunteer mentors or technical mentors from the Type 3 association community, we are supporting people who wanted to learn Type 3 from countries like Botswana, Zimbabwe, Rwanda, Uganda, Cuba, Chile, Bosnia and so on, keeping in mind that these are countries where Type 3 was not yet known or not yet embraced. Yet we really found out that through running these type of programs, while the people there got the basics of the Type 3 technology, there was still something very important which was missing, which was the access to work, the actual ability of them to earn their living using Type 3 and that was something that we realized we have to fix and that changed a lot of things. So we are now moving towards our approach into going from technology first towards market first or towards client first. I'm going to talk about the way into which this story evolved from the Rwandan story and our engagement with the Rwandan government, then the development of some principles which we are following and which we have developed into a strategy that is now going to be embraced by various other countries. In Africa but also beyond, we had the chance, the chance that we actually made ourselves to travel around the world. So last year, for example, we've reached physically governments far away from Oceania, for example, from Papua New Guinea and many, many governments from Africa as well. So how did Rwanda story came to being? Rwanda is a relatively small country in the middle of Africa. It has 12 million people and throughout the recent time, it has positioned itself as the digitalization line or the digital transformation line of Africa. Or some people say the Singapore of Africa. It's not just in relation to digitalization but in relation to the entire economical development that happened in the past, let's say, 10, 15 years. And due to a very, very strong, very coherent political alignment and political will, it managed to make strides into this logic of development. So it's not somehow necessarily a surprise, the fact that Rwanda, with some support from us, as you will see, has managed to establish itself as the first country which has set up an open source community-based content management system as a national standard for all the websites of all the public institutions in the country, starting from government, to national agencies, to regional administration, to embassies, everything. So this type of model actually is now going to be taken over by other governments and I'll get to that. In 2018, the websites of the government of Rwanda were running on a very old, even though type of tree, it was a very old version of it. It was easy for us to connect with them because of that. But actually, as Matias was mentioning about the call about upgrading the websites, those websites which were old and abandoned were not upgradable per se and they were not upgraded. Actually, when we first engaged, we saw the state of lack of maintenance, both in the terms of technology, but also in the terms of content, the fact that many were running without HTTPS, that each was running on a different server, that some didn't have editors at all, and a lot of many other problems. Some were crashing constantly. We realized that the problem is not just about upgrading some websites. So we've started by asking questions which at that point were hard questions. Now for us, they are normal. They are part of, let's say, receipt of approach, saying that aside from your perceived need of having the websites upgrade, do you actually have the knowledge in the country? Do you have the knowledge in the government to manage this type of projects? Do you have the knowledge in the market to develop this type of projects? Do you have the methodologies to host them, to deploy them, to monitor them, to ensure their security, to ensure their performance, to tune them? Do you have the capacity to actually train the editors in all the public institutions, ministries, city halls and so on, who can really maintain the websites? And when the answers definitely at that point were no, then we said it's not about the technology, it's about the process. So first of all and foremost, we need to have a plan. You have to have a plan as a government to develop this coherent approach of your new project, which of course it was typo 3, but it was not about the technology itself, it was about the approach at the national level to build up the capacity and to build up a plan on the long run. Two years and it took us four trips until we actually together with the government of Rwanda managed to set up a plan, a plan of what we need to do, of what they need to do and how we need to approach the process. They've managed to receive funding from the German cooperation agency, GIZ, in order to support them to actually train people, both in the government and in the private sector, and to deploy and set up the capacity there. At that point, and you see some of the pictures of these workshops that we did, on one hand to present, well the technology, the typo 3, but also to work together with them on the plan itself, so how the training or the knowledge acquisition has to be done and so on. And after the initial plan was set, a coaching phase began, a coaching phase which initially the government and the sponsor was thinking that it's about typo 3 technology. While it was about typo 3 technology, I would say that a lot of the activity was given to other type of aspects which for any developing country are super important. We might not see or might not perceive directly that a lot of support and a lot of knowledge gaining has to be taken in the methodology. So in the way that you're coding, in the way that you're organizing, in the tools that you're using, think that in many countries in the world, they did not benefit of university degrees or university level as we have in Europe. Maybe the ways into which they've approached coding so far was not structured, was not organized. They didn't know how to use properly the ideas. They didn't know how to use a management, a task management tracker. The methodology, for example, so Edge Island Scrum methodology, for example, was known as principle for them, but not as a practice. And one important aspect, the deployment. So continuous integration, continuous deployment, that was something that we've put a lot of emphasis on in order to build a proper methodology which has to be described, has to be understood, and then followed. Before and after the coaching is ending. So a CICD pipeline which would prevent by design, the local developers to, for example, go and directly change something in the production website, things which was a common thing. I mean, it was the way that they were doing it before we actually started to work with them in that direction. And the level of the instability of the website before we've done this support for transformation was quite terrible. So throughout the phase of six months of support through coaching, we've managed to transfer knowledge and to develop these good practices. So they have developed the central governmental website and six, first six ministerial websites. After that, the coaching from us, let's say, and the support from the donor has diminished or has finished, and they continued through their own resources to launch further on websites. So never at any moment did we, the traditional Type 3 community members, write any line of code in those websites. Everything, everything, front-end, back-end configurations, everything was done by them, by the Rwandans. Well, without coaching, but really it was a supportive coaching and not something intrusive that we did because the important thing was for them to develop their capacity. Up until this moment, more than 300 websites were launching Rwanda on Type 3. So from the central government, national agencies, regional and local administration embassies, they have set up at the level of the national IT agency, a center for training of the editors, meaning that all the editors from all the ministries and institutions are going through a formalized, standardized training and a national support center, both functional and technical at the level of the national institution. So that means that the other institutions did not have any more the burden of having to maintain and to know their own website while everything was taken care in a standardized approach. In what I will go a bit further, a centralized deployment and hosting approach as well. So in the national data center, in one place, everything was set up, supported by a team of DevOps and cybersecurity engineers, which also we have supported coaching, let's say. In this respect of doing the audits, both for performance and both for cybersecurity, for security, and doing a disaster recovery policy, which by the way, before we came in, was not present. So all the websites of the government agencies and so on are actually following the same pattern, the same look and feel and the same content structure. You see here, for example, the main website of the government, let's say the Gov website. You see here an example of a ministerial website of the Ministry of ICT. So it follows the same logic, the same pattern, the same type of content blocks in order to create a familiarity for the users, the visitors of the website with the ways that the institutions in the country are presenting themselves. Here is a website of a region, a northern region. And what we have right now is more than 300 websites, 19 ministries which are all on one code instance, 30 districts on one code instance, and the logic continues. What does it mean on one code instance? Well, for this instance, actually there is one installation, a multi-tenant or multi-domain, which provides the code for all the websites of the same type. Why is that? Because actually the functional and technical resembles allow it to use the same technological components and extensions in the case of Type 3. While in the same time, the technical architecture of Type 3 allows completely autonomous management from the back end of all the websites, each of the websites. Each of the websites has the editor structure with their rights and access abilities with their own content and of course with their own layouts or layout specific elements, let's say. What we got out of it and what they got out of it is an approach into which they have a standardized take or standardized framework both in the technical perspective, but also in the perspective of doing the development further, doing the maintenance, the disaster recovery, the training of the editors, the support functional and technical, the continuous integration, continuous deployment and monitoring and auditing both performance and security. All being done under the umbrella of the National IT Agency. In this coaching program of six months, there was a relatively equal representation of engineers from the National Agency and developers from local private companies. And the local private companies further on have carried on the development of the next websites and the process is still carried on. Of course, we are expecting on a target of general of about 600 websites. Now, out of this experience and out of the success that we saw, not just in the matter of technology, but in the entire approach and the effect that we saw in the country, both in them having well established stable websites and also capacity both in the government and private sector and opportunity at the private sector, we realized that we can actually extend this, learn the points towards any government in the world. Any government in the world. And we're not just talking about the African ones. We're talking about the European ones as well and the most developed countries as well. And the principal, as you will see, are transversal are at the highest level, not necessarily touching the technology per se, although the technology has to be adequate in order to support the principles. So what the principles are? Of course, the websites have to be secure. Be on recognized, secure the frameworks. They have to be standardized and what we mean by standardization is the effect or the scale effect that is generated both in terms of economy, economy of scale, definitely when you're doing this approach you have much less costs than the traditional approach of actually each institution choosing their own technology and then running their own tender and hiring their own developers in order to do their own website, which afterwards get not maintained because the SLA contracts with the developers and or the technology gets obsolete and nobody is tracking it on the long run. So building a logic of standardization at the level of technology and also at the level of the method creates a tremendous effect direct economic in the cost of developing and running the websites. When you're applying proper methodologies for all the processes like security and performance and everything, you're actually getting a much higher, much better resilience and again and a much better cost because everything is at the level of the hosting strategy centralized in all in one place. Nowadays when we're talking with other countries about the standardization we're actually bringing to the discussion 12 points and I've actually pointed them out into this talk up until now, but we are in the moment into which definitely the next country which will adopt it with the support of international funders is already Somalia. There are several other countries which are in advanced state of discussions, Senegal, Gambia, Papua New Guinea. All understanding as of now the idea and the logic of having a structured approach. So it's not about let's do a website and then we'll see what we'll do next. It's about first making a plan, choosing what's right, building your capacity in terms of management and the development and touch all the points. From development to maintenance to upgrade updates, patching, integration, deployment, performance, security, disaster recovery, definitely technical support, technical project manager level of the authority of the country which is responsible or authorities if it is the case. Content migration if it is the case, sometimes it is, definitely from the older websites to the new one and the training and the boarding of editors. And what we're doing and we're very careful about and also to advise governments is that we need to establish from the start a proper method of onboarding and aligning all the institutions. It would be bad and very, very damaging for them. If one, let's say one minister out of 20 would say no we don't follow this thing because we have our own people, we know best and so on. It breaks all the logic of sustainability and that's the first thing that we're actually doing. So you see there are many, many layers aside from technology per se that needs to be taken care of. So right now one of the most important aspects into which we're dwelling initially in the initial discussions with governments is, okay, who is pro, who's against in your country for this type of model. If there are agencies, their directors and so on which are having a different opinion, they say, okay, how can we actually smoothen up the process so they can actually embrace this approach. So in the interest of time and because the most important point will be what we're doing next and what we're going to do next together. I will go, let's say faster through some websites related to the principles of the websites of the governments. But one point to make here is that it all relates to strategy and to the discourse that we have. So as I've mentioned earlier, it's about the value that we are creating and we realize that the value that we are creating is much more than the technological value. And the thing is that we are only showing what the value should be or could be and it's them, it's the local capacity, local institutions which are actually creating it. But to point out that several principles have to be followed. And when we are looking at these principles, the principles that you see on the screen are actually applying to any system, any digital asset of any government in the world or any institution in the world. And each of them should follow them and should prioritize them and then it should choose the methodology, the vendors, the technology in order to fulfill them. So this is the type of narrative that also to follow up on the OAN's presentation, we need to push in order to break from the vendors pressure of the technologies which have much higher budgets that we have. So it's not about only the technological effect, it's a lot to do about the social effect. And this is how we realize while not starting in 2018 with this idea but actually catching up with what we've managed to create, that a lot of what we are doing and a lot of what we will do is about the systemic societal effect. It's about empowering the local businesses, the local people to not only learn but to use that learning in order to access new jobs, to access new opportunities, to develop a capacity and an understanding to partner with each other and to support each other in order to be more competitive, both in their country and outside. And this is the first lesson that we realize that we've managed to do in Rwanda and this is the lesson that we want to push and we push in front of each government in that. And to all these very important points, the open source as a principle responds very well in terms of the digital sovereignty control over your systems, your data, independence from vendors, optimal life cycle costs, technical flexibility, stronger local business, new high value jobs for the local people. That is something that us as communities, as international communities around open source technologies should strive for to support. Although one point to make and the following remarks actually apply positively to all the members of our open websites alliance is that not all the open source projects are created alike. And we see a lot of discussion and backlash from the governments which are not understanding properly what kind of open source they should use and how they should prepare to use it. That they shown on or put a blame on the idea of open source itself as being a risky technology in principle. The right open source is an open source which is properly supported, maintained, developed with a continuous long term vision and approach and with a clear long term evolution path. That is super important and of course it has to have community or let's say availability of developers and companies that can support development of new projects in order to alleviate the risk for the vendor lock-in. What we have right now and what we are already using in different circumstances with various governments in the world is to put up their set of principles on to which they can easily already develop their strategy and their plan. And we are seeing by working with them right now how easy it became when everything is set up and everything makes sense, makes sense for them and makes sense on the long run. Standardization and planning is a definite need. Optimal technological architecture so that is not we are doing a website and then we'll see. That's not the way and we really and we have learned to be bold also when we are asked by a government to say let's do an experiment. No, it doesn't work. So experiments doesn't work, strategies do. You know you're creating ownership, you're creating the understanding, you're creating a commitment for a long term and then things might work. Right? So optimal technical architecture to hold complex installations, to hold multi-tenant deployments. That's a must. Strategies to support all the processes, not only the development itself, but everything that is related to it, to development and maintenance and so on. Don't keep the burden in the government. Share it with the private sector. Ultimately is the private sector is the startups is the freelancer is the developers that will be your partner as a government all over the place in all the world to build up to support to maintain the websites and in general. The application thinking the terms of life cycle things do change up websites as any other application needs to be upgraded. It needs to be changed needs to be re-implemented at some point for sure needs to be integrated needs to be extended and so on. You need to think at that all the time. And one important thing which we believe that already in Europe we have a bit of tradition in other places and for sure in Africa there is no such tradition is technical community. Technical local community that share an interest in the technology that shared understanding that supporting each other and working together makes sense for them and that are also having a mechanism of governance. Meaning that several or some people or at least somebody is able to bring the people together to foster this development of conversation and to support the discussion to move on. Out of this. Initial project and then the discussions that we had more and more with governments and with global. Global funding organizations. We found ourselves as matching with the logic and the strategy of a very important international project which is called golf. Govsteck or govsteck global. Probably some of you have heard about it. It was initiated by the International Telecommunion and German cooperation agency. And what it looks for is to develop specifications on essential areas of digital transformation of a country or of a government and to support that government with blueprint sandbox rules requirements and recommendation in order for them to be ready and to implement faster. Those type of. Of systems. So these building blocks are developed with the support of. People from various types of communities. Volunteers as well as personnel from it. And the least while it has started and you see already 12 building blocks which already reached a certain level of maturity. I would say most of them are quite mature already. And the least will expand. So you see for example we have a building block for identity which is the national system. Building blocks for. Gis for geographic information systems for all the country for signatures for workflows which is the support for the digital transformation or even automation of processes in between the institution or with the citizen and so on. And each of the building blocks has a certain structure. Of course with some aspects related to. Requirements other recommendations. Strategies for implementation. And so on. I won't go into details here but. The specifications of the building blocks which are already developed or let's say. Documented are present on the website of Gov stack. You can find them. It's they are organized in by using the book. They can be brought by anybody. What will happen very soon is an announcement of a new building block. The content management system building block which is aiming to become a specification. Including recommendations requirements. And support for the implementation of a national strategies for the websites of governments all over the world. Once this specification is actually developed. The big funders. I.T.U. and J.S.A. will support its implementation and will support actually the governments to pick it up and to use it. One important aspect here. Is that this specification is not strictly technology related. So while due to our activity our record and our relations with the donors. We will. Move do the first move. In order to support I.T.U. and the NG is that to set up this. Building block. Actually the team which will develop the specification hopefully. Some of them are here in this room and it's not about only the people from the type three community. It's about the people from Drupal from jumlah from WordPress and definitely from from other content management systems that. That are aligning with the principles that that we are discussing together in the. Open websites alliance. What we are looking to do together is to develop a set of. Specifications which actually touch those these points. So high level strategic requirements functional requirements technical requirements. Data structure, integrity and interoperability in terms of principles. So how things should work and we are not going to say okay Drupal is doing this type of trees doing that it is beyond the scope now of this specification. The specification needs to create the concept which needs to be followed or should be followed by any of the governments. So what we're looking for is to actually develop this team to set it up to work together with I.T.U. and the JZ to to formalize the building block working group. And then to create a set of. Of rules that would allow us to collaborate and of course to come with. A specification that we can then use in relation to all the governments very soon so thank you very much if there is time for question I don't know what Jim can tell me. Yes, couple questions. Anybody anybody please. Hi, thank you so much. I've been working in digital conservation for quite some time. And it tends to be driven by vendors vendors who buy the thing and all your problems themselves. And we've been saying for a very long time that the most important is to have a plan. So that's what you're doing. My question is what were the factors that enabled you to actually go in and say what we need to plan and this is going to make it work. What allowed to be the second environment to need for a strategic plan? All right, I will repeat the question for the audience online. How could we make a difference in putting strategy and planning first while the general approach is vendor driven projects? Well, the aspect was our principle. We are very, very adamant on respecting the logic of the flow. So we have the possibility to not agree with any government or with any donor because we are not linked to any contract. So we just say what we believe in and we did it. That's why we went four times in Rwanda because the first time it was not convincing enough. The second was not convincing enough. The fourth was, you know, but we said on the same thing and as one of our colleagues and supporters is saying, keep on saying the thing that makes sense until it will be understood. And that is what we're doing. So we have received recently the same point from another government. Let's do a website first and then we'll see. And no, we just don't do that until you're committed to do a proper plan. We're not we're not engaging. Very good question. Anyone else? Wonderful. Okay. Thank you, Daniel. Thank you so much. Thank you.
Shaping the Future: Investing Wisely in Long-Term Open Source Development with "Five for the Future"
is giving us a talk about the importance of eating vegetables in our day, the famous Saturday program from the UK. And this sort of health information is very important for us. And you can see that I have not been following it since the pandemic, but I hope to learn a lot today. Thank you very much. Thank you for me here. And today I'm going to talk about a project that we're using in the world-class community to develop the open source system and increase the usage of the users. And Jesus Amelio, I work as a software engineer at Automatic, but I am working full-time in the open source project, okay, inside this FIFER Future Initiative. So I want to ask a question as a start. Would you work for free? No. Yeah, we're in open source conference, so this is a very tricky question. We work for free a lot of time. And we do free work with our family, with our community, church, and so on. Okay. And of course, in the IT world, we do a lot of free software. I like a lot this XCAD, this CD comic, because it shows all the modern digital infrastructure maintained by a random person in Nebraska has been, thanks to the maintenance, is 20,000 free, and of course, for free. And this is a joke, this is a trick, but this is the real. Do you remember the love for you problem you have in 2021? It has a, okay, it's a critical book. It causes a lot of security problems. It's rated with a 10 in a scale between 0 and 10, so it's very dangerous problem. And as you can see in Twitter, the developers were granted for work that they are not paid for for a future they are dislike yet need to keep with the back and work compatibility concerns. And they are not making money. They was not making money by this. In fact, if you take a look to the gift who have the sponsors for these users, one of the maintenance has one sponsor and the other one has 50 sponsors. Of course, they are working as a service in a full time company. So finally, we have a report, a security analysis report saying that 88% of the open source software contain components with no activity in the last two years and containing components that were not in the last version. So the question is, is this sustainable? No. Of course, it isn't. So we need to look for how to make money for these open source companies. I'm going to explain how the difference in the companies are making money in the open source software. So one of the types of this financial is the donation. For example, when you, the developer of UGS is making a full time, is making a full time developer with the gift have a sponsors. In the security, in the security, you can get rewards if you find a book for a company, but a lot of times these security searches sell in the black market through the third day fines. The core funding, for example, to develop the WP CLI, the World Press CLI, the developer create a core funding at Kickstarter before start developing the tool. Another projects are internal projects because, for example, Rust from the Mozilla Foundation, go from Google or RAT from Facebook. These were internal projects that become open source projects and now are used by a lot of different people. Some companies are making out of money. For example, RAT are making out of money with the consulting service and they are developing a lot of open source to achieve this consulting. There are a few foundations, for example, the Apache Foundation that get out of money from contributors and they pay for the first two tours for some developers and so on. Or, for example, the PHP Foundation that get money from different companies, for example, from my company, Automatic and they hire developers to work on the next versions of PHP. Some companies like Automatic are making money with a SaaS model at Wolpes.com, Nextcloud has a similar approach. They have an open source project and you can install your own service but if you don't want to do this, you can use the SaaS product. Or, for example, a large-level forge that have a very good open source PHP framework and you can use for free but if you can manage your service in an easy way, you can use this product. Some companies have dual license so if you are going to develop an open source product, you can use this library for free but if you can include in a closed source software, you need to pay not open source license. Another approach is the open core. For example, it's the approach that MySQL develops. They create an open source product and they have a proprietary product they sell. It's the same approach with GitLab. Some companies are getting external funding. For example, MariaDB has made, get 270 million in different venture capital rounds before become public, offering a sample and gains that has get 84 million dollars in different venture capital rounds. And finally, of course, there are out of different approaches but finally, to be hired by a company. An example of this is my example inside the Fight for a Future project. Okay, so let's start about the Fight for a Future initiative. And first, I'm going to start explaining economic, economic theory. It's called the tragedy of the commons. It's a very simple economical theory that we can make a similarity with the open source projects. Imagine a field, a public field where four farmers has their own cows and everything is okay. They have their cows in this field and they cancel the milk and so on. Everything is okay. But when they, one farmer said, okay, I'm making good money with one cow. Let's buy another one because with this new cow, I can get more money. Okay. And the other farmers, of course, said, okay, the other farmer is making good money with the two cows. So I'm going to buy another cow. So one day we have a lot of cows in this public field and one day the field broke. The cows died and the farmers are broke. This is very similar that we have, that happens in the open source projects because we have a lot in a lot of open source projects. We have a lot of people making a lot of money with these products but they are not giving back anything to the project. Of course, you know a lot of different things like this in the open source projects. Okay. This Five for the Future Initiative started in the work in Europe 2014 in Sofia, Bulgaria. Do you know what's our work camp? It's something like a group of people in the work camp. We have three major work camps in the US, Europe, and Asia, a lot of in the small cities. Okay. And one day we have some questions answered by Matt Müllermer, the co-founder of the work project. And one of these questions, they talk about the return of the company. So they ask that you have a lot of companies that are making a lot of money with workers but they don't give back any money to the project. Okay. So two days later Matt wrote an interesting post in the blog and I'm going to read two paragraphs about this to introduce the Five for the Future project. Okay. The first one is, I think a good rule of thumb that we scale with the community as it continues to grow is that organizations that want to grow the workers by and not just in the piece of it should dedicate five percent of the people to working on something to do with coal. Be development, documentation, security support, forums, theme reviews, training, testing, translation, whatever it might be that has moved workers mission forward. And the second interesting paragraph is, it's a big commitment but I can think of a better long-term investment in the health of workers overall. I think it will look incredibly modest in high-side. The rate is probably the bare minimum for a sustainable ecosystem avoiding the tragedy of the commons. Okay. So I introduced before this problem related with this. I think the five percent rule is one of that a lot of soft projects and companies should follow at least if they want to be private at the cataphrone now. This was an idea as a goal and if you know something about the workers project, I think it was a very good initiative. Okay. So let's talk more about this. This five percent encouraged to the individual contributors and organizations to contribute five percent of their resources to the workers development. If you want to take more information, you can go to workers.org slash five but I'm going to explain a lot more about the project now. So what's a contribution? The open source worker project is split in 22 different projects. So it's very impossible that you don't can fit in one of these projects because if you are a developer, you can code in something, you can translate. If you are not an English native speaker, you can go the translation of your language. We have 2008 different languages translated. We have a variety of support forums. We have documentation. We have learning videos design community, accessibility people, marketing people, community that are going to have different meetups and work comes around the world. We have TV people that lost the videos from these events to workers.tv. We can have people contributed to hosting initiative to create documentation for this. And finally, we have the open business photos that tries to get all the open source photos in one place. So I'm sure that you can contribute at least to one of these initiatives. Okay, so once we know how we can contribute, how we can place the time, we can do this as individual contributors or sponsored by a company. In the first situation, it's very simple. You go to your workers profile and said, okay, I am a sponsor. No, so I am the video contributor. I'm going to spend five hours per week in this community core hosting team. So now this is public in your profile claiming that you are dedicating five of weekly hours to these different teams. And in your profile, in your public profile, you have an activity. For example, this is my public profile and said that I create these translations. I close a ticket in the MetaFract. I close a pull request and so on. So we have a history of your contributions. We don't have the work policy. We don't have people taking a look at what people are contributing. This is an honor code. Okay. And you can be sponsored by a company. This is my situation. For example, you can see here automatic is sponsoring 109 people full time or part time to the open source project. It's a week near 4,000 hours. A lot of hours. And we have 111 companies sponsoring people. Okay. From one hour and automatic is the company with most hours. Okay. This is from two years ago, but it's the current situation. We have automatic as the company with most contributors and we have a lot of companies like Joas, GoDaddy, HumaMaid and so on with contributions. What is the perfect situation? This will be the perfect situation with around 10 big companies contributing a lot to the open source project and small companies, small agencies contributing 5 hours, 10 hours. This is 5%. Which companies are in the project? We have companies. We have companies like Google or Joas. Joas is a CEO company and we have a lot of hosting companies because the hosting companies is making a lot of money with the WordPress installations. Okay. But we have, as I said, before we have 181 companies, mainly companies focused on the WordPress ecosystem. Okay. Okay. Well, should you contribute? Let's see it. You're going to develop a lot of skills because you're going to work with a lot of talented contributors around the world. You're going to build your own portfolio, mainly at WordPress.org and at GitHub or on other platforms. This is very interesting if you are looking for a job because the quitters can see your profile and if you want to get more customers, it's a good initiative. And you're going to know a lot of people because you are going to interact with them in different ways. And this is important because probably in one year or two years you want to change your job and probably you made a contact in this networking in a polygons community, in our forums and so on. This is a typical path for a contributor and this was my path. You start as a regular user for the CMS, then you become a casual contributor, then you become a pleasure contributor and finally you become a leader on a aspect. Of course, you can stop in one of these steps. For example, in my situation, I started, I started using in 2006, installing a WordPress for my own blog. I come from to to WordPress. Then in 2014, eight years later, I started translating in Spanish. Later, I started translating to another language. It's called Galician. It's in the northwest of Spain and I continue translating a lot. So I became a casual contributor to a press contributor because I do a lot of translations and I become a validator for this language. I get very involved in the world project and one guy from automatic ping me, send me. Look at this. These are very interesting work that I think it fits with your profile. And now I'm here. Okay, so this is another way of becoming a start as a regular user and now I am one of the experts in the Polyglots team. Okay, why should a company contribute because you can pay to people to contribute to the open source project? Because you can grow your talent pool into ways. You can identify and recruit new talent. Of course, if you are involved in the project, you are going to make networking with a lot of people and these people in the future could be your employees. And of course, you can upskill your organization talent. Some people from your company are involved in the open source company. They are going to improve the knowledge. They are going to keep a date with the project's direction. Of course, if you are involved in some projects of the open source project, you are going to get information about the way of the project. You can take decisions aligned with your needs or your client's needs. And finally, you can gain credibility. Of course, if you can set to one of your customers that you are one of the main contributors to the open source project, so you can get clients with this approach. Finally, I'm going to give a few tips for individual contributors and for companies. I'm going to start with the individual contributors. First of all, find your team. Maybe you are from Italy, so you can start translating to Italy. You are a developer. You can start solving some PRs. You have a lot of knowledge about the plugins. You can reply questions in the forums. There are a lot of things, and I'm sure you can fit in one or more of these teams. Then read the docs. Nobody reads the docs, but we have very good docs. We have a contributor handbook. We have a few video courses about contributing in the WordPress, and each team has a handbook. In this handbook, you have all the information you need to contribute from the beginner to the expert user. It's very important to connect with the committee because we are working usually from our houses or offices in our remote approach. So we have weekly meetings through Slack, and we have Slack channels to coordinate all the works between the teams. It's very, very important to get this communication, this connection with the old people from the community. It's important to create a contributor habit because if you don't set this hour, this half hour to contribute, you are not going to have time in your day to day to contribute. It's very important to start slow with a very long-term commitment. This is a marathon, not 100 meters long. You know that people when they start to go to the game, they start training two hours a day, they stop doing this in two weeks. Okay, and tips for the company contributions. Set the strategy for the future. This is very important because you should have a strategy aligned with your market. Identify your goals, much than with your employer skills because if you don't have employees with these skills, it's impossible to align this with your goals, of course. And it's very interesting to work on having path contributions. Allocate weekly time. This is very important because we always have a project that we need to release tomorrow, and if we don't allocate this time, we're not going to work on this future project. Any of the company's could be interesting to think about building a contribution team. I'm not talking about full-time workers, but you can have five people working two or one or two hours a week in different teams. It could be very interesting to get more visibility in the different teams, the different 22 teams. Maybe it's interesting to hire full-time contributors, but of course you need to be a big company to do this. And finally, this 5% is aspirational. It's okay to work 1%, 2%, but the focus of the FIFO Reputure Initiative is to contribute back to the project. This is the key point. And now I am going to read your questions. How do you manage the specification of what should be done because there is a huge list of things waiting to be fixed, and how do you manage the priorities and the specification? Okay. He asked me how we manage the priorities of the work we do. Of course, if we have a book in production of things like this, this is the priority of the work, but then we have weekly meetings to focus on the work for this week. And we have a three-year plan to work on the... I'm talking about my team, the Polygos team. We have a three-year plan to work on this. Okay. And of course, we share all this work with the community using a post. For example, when we want to work on new initiative, we talk about this internally. And then we write a post to share this with the community, get feedback, and develop it or don't develop it. For example, a few months ago, we realized that we have a very high queue of translations. And when I said a very high queue, we have some language with half a million of strings to review. Okay. So it's impossible to review by hand. So we talk internally that it will be interesting to use the AI to review these queues. We write a P2, it's a post to share this information with the community, and get feedback. And the community said, no, we don't want... we don't need this AI review. So we stop this development. Okay. Okay. Okay. Who can you manage that the specification are made at first in a way that could be useful to other people? And of course, people may have an additional liar to pick local question, dedicated to local question. So how to manage that generic aspect of usability by others? Okay. He asked me how we manage these specifications from our customers. I work in the Polyglos team. Our customers are the open source contributors that made the translations. Okay. This is very important to know. So we try to do all our work in public, in GitHub use issues, in GitHub docs, and so on, and in post. So we try to discuss all this in public and get the feedback from them with the 10,000 feet level of specifications. Okay. So we want this, we don't want this, and when we are working on the developing, we share with them some skin shows and things like this to get feedback from them. But it's important to know that our customers, our hints are the open source contributors. People that are using their free time trying to translate work to their own language. Okay. This is my current situation that I'm working on this particular team. Yeah. So it's really intriguing to see. So a lot of open source projects track after the fact how people contributed. But I think the specialty about the fight for the future is that people pledge and they set up an entirely different psychological motivation. Did you do any tracking of whether that, how that changed contribution metrics or not the trajectory or anything? He asked me if we have any metrics around the change between the, when we activate the pledge for the hours each contributor is working on. We don't have this data. In fact, it's an honor code contribution. So you can pledge for 40 hours each week. And if you don't do anything work, you don't have the police say, hey, you pledge 40 hours and you do any work. I don't mean that that per person tracking, but I mean that before the fight for the future program for the future program in general, and before and after how it changed that contribution dynamic. No, we don't have any data. Somebody started to work on this. So we've got somebody who's actually currently trying to prove that fact that they're working. So there will be more exploring. I can connect with a person I can definitely see that it probably proved. Yeah, it probably like it's two times or three times. You probably have an impressive number of the end because it probably helped. I'm sure we have increased the number or in that we don't have this. We don't have these numbers. Yeah. So in the spirit of the joined open source forces here for others Sorry. In the spirit of the joined open source open website alliance here for other CMS communities who are looking to improve their contributions from individual developers or from corporations. Yeah. Do you have since you've been doing this for 10 years with the fight for the future, do you have any things that you would do differently if you were to start to deliver it fresh? Wow. This is a very good question. She asked me if we start now from scratch if we change anything. I'm not sure because I'm not working this 542 future program, but I'm not involved in the people in the team that work to manage all this project. Mainly, I will start getting data from the first day. This I think this is a very key point because now for example we have data like the number of transitions you did, the PRs you add, you comment, you close, things like this, but we don't for example if you are a designer it's hard to track your work. So I think we can have a lot of work to improve this. Thank you. The concept of peer pressure is something that could be more incorporated or made a bit easier. Who's like free loader and have like not just one company that's pointed out but like kind of make it a yeah you're not very nice to the community. I'll step close to the microphone because I'm allowed, but I would contend that negative peer pressure, it feels really dangerous and in the spirit of being constructive, the Drupal communities tried several versions of rewarding contribution and contribution credits through pushing people's appearances, list positions in searches and so on which has also resulted in a lot of gaming contributions into how many poll requests can I break something into. So humans are humans, but I think given everything else going on in the world we should encourage positive encouragements rather than you know. What I mean is mostly like fake pledges for somebody wants to go to the top by pledging a lot and then that other people can assure that the pledging conforms to what they actually do. For sure, for sure that I mean and as a marketer I'll admit it you know numbers are way better than like we have to know what the effect is right because at some point we have to know if what we are spending money on is producing anything like the result that we want. The gamification approach is very tricky because currently we are creating a new feature to be able to create transition events and when we are trying to create the functionalities we get a lot of gamification that the end user can do this to be in the top of the list. So this is very, very tricky and we know that we have some users, very active users that are going to try to gamify the solution. Yeah. Right.
Roundtable Round-Off on FOSS CMS Collaboration
Thank you all for coming back and thank you for spending the day with us. The idea of the people who organized this and I'm going to give 99% of the credit, I think appropriately to Crystal and Matthias. Thank you very, very much. We're also key in initiating the Open Website Alliance as an organization which we have, I would say, successfully launched today. We need to bash some champagne against someone somewhere but other than that. So I also found the program really interesting and really well rounded and actually ran the gamut from technical including code to very aspirational. There was tons of practical pragmatic stuff along the way. So I just wanted to run through this idea of things that bring us together based on the talks that we heard and I'm going to have to sit down or something so I can see my screen while I do it. Or what if we change the slide up there? Please change. There we are. Right. So for those of you who weren't here at the crack of dawn, Matthias and I talked about the wake up call that we had about a year ago where Open Source is losing out to proprietary offerings in cases where we clearly shouldn't be. So the big wake up call was around the Australian government and personalization but we've seen and heard many other examples. So we think that we need to remember what Open Source is about at a fundamental level, how it can make us and anyone we know better business people and how in very practical ways using and applying Open Source principles and technologies can make the world a better place. So that is what we kicked off the day with. Gabber Hoitje gave a really interesting inspiring talk about the nuts and bolts around coordinating volunteers around long term technical goals and getting community to become contributors and keeping them there is something that Drupal's been working very, very hard at since the project existed and they're really admirable and Gabber's a big force of nature in that. So thank you for doing that Gabber and I think that that gives us ideas about how we can coordinate our own communities and probably how to make a meta community around what we want to do with the Alliance. I'm sorry, I'm sorry, it's been a long day. I think offered us a really powerful tool set to help grow Open Source in general and probably grow what we do together as well because the certification, competency, skill tree concept, once you show how to learn or teach any given CMS, how much of it is going to be applicable to the next one, 70%, 90%, there might only be the details there and I think that also since skill display has a really tight affiliation with Open Source anyway I think they're really interesting allies for the future so thank you for coming Fluria. So Martin and Lucas who I think are not here anymore gave a really cool accessibility talk and reminded me of a bunch of things. It's just really important that all of us bake accessibility in right from the start and I guess it's because of the nature of my career and my work now but yes building things accessibility is easier if you do it from the start and cheaper but it doesn't have to be a moral choice. There is a significant percentage of people who are online who have some disability of some kind and some or all of us are sometimes less able than others. So from a business perspective or any kind of value proposition where I want someone to do something else, I want to conversion, I want them to read something, I want them to buy something, the more accessible you make it the better business you can do if you're offering is a good one and the more accessible you make your web properties the better they can be found by the robots who will then show it to your consumers. So I like these stories that we in our communities cloak in a lot of morality and ethical choices which is right and proper and the thing to do. I like to note that there are a lot of ways to argue for those things that are from a non, it's not a moral conversation so it's a very pragmatic one. So I was really happy to see that one today. Owen gave a bit of an aspirational picture of what it looks like to be the biggest CMS in the room with grappling with some problems of like being at huge scale but not necessarily being taken seriously at the same time. Like everybody's heard of Aqia now. I've been to some events where people talk about Aqia and it was so surprised and then they know there's that Drupal thing behind it and that's that oh there's that thing behind it that I think is really challenging. But Drupal's got really, really conscientious leadership and I think that if we get the chance to work together on some of these visibility issues I think it will be really exciting. I especially somewhere in your talk I realized that the thing that Matthias talks about a lot is open source could, should be a brand and you should come in in your tech decision making process early and say I want to use open source yes no and then if it's yes open source which it should be then is it Wagtail, is it Drupal, is it Type 03, is it right? And then we can still make our own work and things do better together by collaborating at the community level on the code level and so on wherever it makes sense. So Daniel then incredibly beautifully picked up what I think was my point about making a difference in the real world with open source thinking in practice and the incredibly impressive to me work that's gone into like not just throwing some agency work to somebody and leaving the Rwandans to figure out what they'll do next year but like building a structured measurable completely practical approach to government digital infrastructure and then oh yeah you can put your own websites on it I love that, I love my favorite part was that no one from the Type 03 like none of the Europeans who came to help and teach wrote any code I thought that was super awesome right. So all of us can use that open standard to and that methodology to make sure there's a backup plan make sure you have processes make sure there's disaster recovery and of course we can check any CMS we want in there right that's going to be a huge resource for all of us so that was super awesome. Sage gave us insight into as for most of the people in the room some cool insight to like what coding looks like when it's not in PHP right and like the deep roots of building content management experiences right when so many of us have been living in the world where the interfaces have been built for a while or the paradigms have been fixed and the decisions have been made and we just kind of live with what our code ancestors gave us right so I really really appreciated the fresh eyes on to solving web publishing and yes and Jesus talking about how we solve the contributor gap right the long tail the how many people use it versus never contribute all of those issues and I think the fight for the future initiative is a super cool idea and I think it intersects with and highlights ways that other communities have tried to solve this you know contribution credits or not the Gabor's idea about future promises versus measuring past contributions I thought was super interesting so you know but we have to have healthy communities and contributions all of us to continue our projects so that's got to be that's got to be some valuable got to be some valuable insight for us I think maybe especially if there's some more data that could come out of that that would be very very interesting so thank you to all of today's speakers very very interesting day I think I had if I find my phone I had three concrete questions that we could pose to the four representatives one of whom non-official from our four projects disclaimer disclaimer so I think that we could try I have three questions we could try them and we can see where they go right so how about I will ask the first question and each of you has a minute to introduce yourself and like a couple minutes to talk about the question and and we'll just let you fight over the microphone yeah so perhaps to the branding side of things just like php fig and the psr's have helped us grow together and reinvent even fewer wheels how do we do that at a public facing level Robert so yeah I'm I'm Robert I'm representing in officially wordpress here and with the which our base is php so as as James said it's very the foundation of what we're doing so what we can learn I think is the that's so much code that we have is like as we had today the dependencies between the systems are very small and so we could share those things and have less and learn that one system learned and can have gift as to other systems so that we have shared shared code and shared responsibilities and also can contribute it to central infrastructure maybe that we can reuse between the systems super cool I thought on that note on that note I basically thought whatever the next GDPR is right probably the accessibility thing or something like let's make libraries and standards together and work almost together right yeah when I hear that question I also think about you know it's actually important that we who work with CMS's talk more together try to answer questions together so we know that one of the projects is working on a nice solution it doesn't even mean that it has to be the same programming language the concept behind it might be enough to share something that can really help us all and I think then the difficult question will become you know well what is not something that we want to share is there anything you know that goes totally against the idea of our CMS for example you know that's also very very interesting discussion that I also think we should actually take together. I agree my name is Crystal I'm the president of open source matters which is the organization responsible for the JUMLO project and this is Matias with typo 3. So I think that there are so many things to talk about with this question it's a complicated one but in respect to like how we're communicating publicly as a collaborative effort I think it kind of goes back to how this whole open website alliance started because it was born from the need to make a coordinated response to the upcoming cyber resilience act so that was kind of the first time that we had collaborated on a global executive level in the United States and we knew that if we were to try and make individual responses that we wouldn't have the kind of impact as we could have as a group and I think that that's going to kind of set the tone of our future collaborations not just on policy stuff because you know the CRI was the first thing that we're going to have to deal with in open source for legislative actions but it's not going to be the last but also to maybe coordinate on making resources for our communities for this kind of thing a couple days ago we were at the policy summit and I had the thought okay well maybe we could work together to have like some kind of compliance checklist because there are so many things in common across each of our CMS communities that it doesn't make sense to do the same thing individually put in the individual effort to get essentially the same kind of result. One of the key concepts in programming is don't repeat yourself right so we can apply that here too. And compliance is technostics so that makes perfect sense. Yeah and to what Matias said about also knowing what things we shouldn't be sharing I think it goes to knowing where we stand in comparison to other CMSs because there are things that we share as strengths and there are things that aren't necessarily good fits and I love to the chart that you shared in your presentation about where Drupal sits as far as richness versus I don't remember what the other exes said but. Oh was that the gotten though of course. No it was where the. Oh no that was that was Dries' slide. Yeah okay it's a good slide. Thank you Dries. And I liked that you said you know we know that Drupal is not for these smaller brochure sites. I also liked that Wagtail had something similar that you said okay we know that people are going to have to code to set this up this is not going to be like a one and done kind of thing and I think that's going to be critical for how we are collaborating so that we can make sure that we know where we sit and help people make the best choice per project as opposed to being dogmatic about our individual CMSs. That was longer than two minutes sorry. It's okay so choose open source then based on the strengths of a project go ahead. I'm Owen from the Drupal Association that's the body that oversees aspects of the Drupal community and if I was going to pick one thing which is really really hard. The thing that does keep me up at night is the renewal of new developer talent into our communities. It's great to see a few people here under 40. This side of the room maybe not so much. And so I think where I think we can actually unify in terms of visibility and effort is around promoting the benefits of considering a developer career pathway into any of our platforms because it's not too hard to switch between platforms once you know a bit of PHP. And I think what I have seen not so much in Europe and North America but in emerging markets people are starting to look at careers in the future. People are starting to look at careers in say Drupal as being a pathway to prosperity. And Dries does have a great story about being in China and asking young developers why they've chosen Microsoft over Drupal. And the answer was well I want to buy an apartment. He's like well what's the big deal about buying an apartment? And if they can buy an apartment then they can woo a partner to live in that apartment with them. And so it kind of speaks to these quite fundamental needs that we have this incredible ability to tap into I think. So many of us in this room have had 20 plus year careers based on open source of built amazing businesses around that. And I don't think that story gets told as well as it could is that there are these incredible careers that you can have. And just to kind of wrap it up the fulfillment of a career working in open source I personally think is significantly above what you would get if you're working in a closed proprietary system. And in our own company a key hiring metric that we use or a tractor that we use is the ability to be able to work in open source as part of your role and to be working in a little 20 person company in Australia but to be part of this network of mentors around the world who help you grow in your own personal development. Did that make sense? Yes. And since we're all in the... Since we're all in the same space right actually this issue of being able to find developers and whether you know young CMS is not as cool as it was radical and wild 20 years ago right but content management is somehow the backbone of how the world works now right. Digital everything solving web publishing device publishing digital publishing of any kind in the end you have to manage content and and these systems are a great way to do it. Also yours. I was incredibly gratified coming out of the pandemic when I started going to events again. I have been to events in several communities especially typo 3 because I live in Germany and love that community but also Drupal and whether it's been local regional typo 3 events even German language typo 3 events rather than the international ones or Drupal-Con-Pittsburg I think in all my time in open source I have never seen so many young people coming to open source events and I have never seen so much diversity before right. So I find that incredibly heartening for all of us that some young people are interested in a sensible, helpful career in the world right so I think that's pretty awesome. We could and I really liked what you said about what makes working in open source potentially a lot more satisfying right. You're not locked away in the corporate walls with secrets you can't talk about right. You can work with the best and learn from the best and be part of something more meaningful. All right. Is there a need to reframe the conversation in each of our communities away from this level of competition and towards more cooperation and if there is how do we make that happen? So as an example we had in Germany there is a collaboration between the communities from the open source world in Germany called CMS Garden which is like all the CMS here plus a few more you never heard in your life and they were all awesome and what we have with this collaboration is we simply know each other and we simply have like the contacts of the other CMS so this is kind of like already happening in the world but it's not like that on the global level not like really normal standard and that's why we simply now can like go forward and go like hey all of you in the whole world of CMS all over the world you know someone bring them to the next CMS event that is not theirs and like simply have those connections who are already there simply have them flourish and like move them forward and invite everybody to join in because like we I think it was Matt Malinweg who said like that's like it's kind of like a sibling rivalry what we're having we are not really a competition because we all stand on the same like open source paradigms and it's just like that we have like an understanding of like I like my bowls to be red you're like your bowls to be yellow so but we built the same cars and we built the same structures together but it's just like a disagreement of how to reach this point but at the same we at the end we still work together and as I had connections with all the open source systems that's the same frequency everybody has so everybody's agreeing on that we need to work together that collaboration brings more forward than simply downloading like this not open file to simply run something but it's really the goal to work together and to achieve better results together with all the input that someone can bring. Community. Yes. Yeah I mean the website Alliance is a community of communities and I think that's something that we can really use and I think looking at things we can do I mean or what we're already doing. Benny the type of three core team lead is sitting on in on core team meetings in Drupal and that doesn't mean that type of three is becoming Drupal or anything like that which I think is an easy thought to make that you know we'll just copy everything but it's something about participating in larger discourse knowing why are they making these choices. Okay what are the reasons what is informing them on their decisions and that doesn't mean that we will always come to the same agreement but it means that we all become wiser in our decisions. Community connections perspectives. Now I'm waiting for what my word will be so I know what to say. Now I feel like a lot of collaboration has already been happening on the community level it's just not been recognized or talked about as much as the CMS garden or like individual initiatives between specific communities like some I don't remember the exact details right now but there was a security issue a few years ago that what? The far wrapper? Possibly I'm not a security person and it just goes in one ear not the other but so I think that there have been community level or you know especially on an individual level or local level collaborations already like I think that Jumla recently lent some video equipment to another CMS community in Germany for an event. Thank you. Yeah you're welcome. So it's not that it hasn't been happening it's about acknowledging the efforts that have been happening and also maybe being a little bit more coordinated with how we can do this because like Matias said we don't necessarily want to become each other we don't want a giant homogenous CMS monster. We have one thing that someone said a few days ago in a conversation about this was shared strengths and we can also still celebrate our differences because in different projects in different contexts one CMS is going to be a better choice than another. Jumla people in the room don't kill me for this but Jumla is not going to always be the best solution for every single project. It happens it's fine it's normal and I think it's important to say okay well this isn't a good choice for Jumla but it could be a good choice for type 03 or Drupal and if we have more collaboration around those things and recognition of what are the things that make us different or what are respective strengths are then we can help guide people to different choices that maybe they wouldn't otherwise make. It's kind of a just because you can doesn't mean you should situation. So sharing your word is sharing. I like that word. And choosing a technology. Do you want the mic? Choosing a given open source technology isn't always only WordPress does X better than type 03 does better than Drupal does. There's times when hey what is the local service provider market where I live? Who has been around and will help me run my business, the digital part of my business better in the long term? Or who made prettier designs this time? It's not just whether you're the super best at dynamic page building and the super best at very large very performance or the easiest way to get to specific site functionality or like great pre-packaged functionalities that get you going very fast. Everybody's got many other strengths as well. Just take a moment to reset. So I think in my talk I alluded to we had a pitch bird contest and one of the projects that receive funding was to put Gutenberg content editor into Drupal. And the person that funded that project was Matt Malinvec. Whether it was his own money or whether it was WordPress money I'm not sure. But in any commercial setting that would just make people's brains explode. That seeming competitor would offer their technology into a potentially competing platform. And I think what it speaks to is that in the Drupal world we were tearing our hair out a little bit in terms of content editor experience for a number of years. And we realized that there was this big gap between what was happening in WordPress and what was happening in Drupal. And I think it's a huge credit to both projects that there was this pragmatism to say well why are we going to spend all that effort to try and reinvent Gutenberg when we could actually just use Gutenberg. For the people that want to use it because it's not the type of thing that every use case will require. But for the people that want to use Drupal but they might be on the fence about committing because they do want that Gutenberg editor experience then that becomes a no brainer. And it might go the other way as well. Views in WordPress. But I think there is quite a strong model there in terms of where it makes sense to why reinvent the wheel when the wheel is already round. So other people's pretty things right? If it's clear that Gutenberg just represents a significantly better editorial experience for most people, right? That's something that we can actually all adopt and follow as a best practice or a standard or whatever. So that's back to the sharing point. Okay, terrific. You choose which piece and then that's how sharing works at my house. Whoever cuts doesn't get to choose which piece. Alright, so I feel really, really excited about this as a first step. I think the energy is there and the wheel is there to make this happen. I think it's a really interesting time economically and just in general in the world to be doing this. I've heard stories today that I never heard before and reasons to hope as well. And I think that getting these stories out and talking about these wins and talking about how this collaboration and sharing and connection is happening and can happen. I think there's a lot of fun, easy things to do like encouraging people to go to other people's meetups and fix a bug in somebody else's project, all that classic stuff. I feel like even that small stuff can really make a difference and those can really add up over time. Does anybody in the room have any questions for us in the last 10, 15 minutes that we've got here? Overloaded to here. We need coffee or beer. Can you read it? Thank you very much. Oh, see, I've been operating without glasses all day. Wait, is this one from Mike Giverd? It's for him. Oh, Mike is not in the room, but... Yeah, so Brian T. Mann wrote, I don't know if Mike Griffith is in the room, but he gave a talk yesterday about the collaboration that took place between several open source projects on accessibility under the EU funded research project. We need more of this shared research discussing the approaches and then our own implementations. We need to have more cross project research projects. That is a true. So thank you, Brian. That's terrific. Mike was here for part of the day, but he's not now. And I also... He wasn't here when the accessibility talk started, and I promised the guys to introduce him on LinkedIn and he showed up during their talk and connected with him on LinkedIn while they were talking, which was hilarious. So, yes. Yes. Just as the letter that came out of what is now the alliance to the European Union saying, hey, the Cyber Resilience Act has some significant problems and we'd love to help you, like, understand that the idea of it's illegal to release unfinished software is tricky, right? And, oh, there has to be software that is purely European if it's going to be run in Europe, like stuff like that. So a beautiful international community of practice sprang up and because these four projects run 50% of the web and because they were able to present a code instead of arguments, Europe's actually listening, right? We've gotten some doors open. So I think that sort of influence, right, is a really positive sign for us and the fact that we could probably also then go and make proposals that say, hey, European Future Project, whichever year they're going to do next, we'd like this money to put together X. Accessibility guidelines, GovStack, European version, whatever the ultimate, you know, editing experiences. I think that's a really interesting way to go. Yeah, and I just want to also, you know, give a big call out to Wagtail and to Umbroko who are here today as well as fellow open source CMS. Neither of them use PHP and I'm really happy that they're here because this is not, you know, for CMS group. This is really a group for everyone. And one thing that we haven't said anything about, well, we did say that we encourage everyone to become members, but we didn't say that the Open Website Alliance is built around everyone agreeing and, you know, making decisions together and that kind of stuff. But it also says in the charter that anyone who, you know, doesn't have anything to do with a certain issue are encouraged to refrain from voting. And that means that as a website alliance, we can also have smaller groups that, you know, are still supported by the whole alliance. Because for example, maybe there's something with a programming language that isn't PHP and the PHP guys can say, well, you know, we don't know anything about it, but, you know, we're not going to vote against it. And that's also a really important part of collaborating, saying that, you know, wow, you're doing an amazing job. It's not what I'm doing, but, you know, we're here together to support each other. And yeah, I love it. I also want to say that the results that we've seen with the CRA, with the European Union listening and chaining significantly from the original drafts is not only a result of the work that we did as CMSs. I think that a massive, massive amount of work and effort was put in by Open Forum Europe and their policy task force and everything and all of the very, I think almost 200 open source organizations had submitted feedback to the European Union about the CRA when it was during the feedback period. And so this is certainly not a result from just what we did. And I just want to make sure that we acknowledge that because so many people, so many projects put hours and hours and hours of time into this. And it really isn't perfect. I think one of the things that I liked that was said during the policy summit that we attended on Friday was that it's imperfect as all things made by humans are. But it's definitely a lot better than it would have been. So just brief shout out to them as well for their efforts and in making that kind of impact because it could have had a drastic, it would have changed drastically the way open source works, I think. What was the question? There wasn't one. There wasn't one. Hi, Brian. We've got this now. What do we do with it? Right. I think that we as an alliance. Yeah, Owen. I think I remember the question. Again, in my talk, I referenced that the protection of the open web is a very keen interest well beyond the people in this room. To the point, and I'm not sure how many people are aware, but the statement on the future of the Internet was signed by 60 national countries like their foreign officers signed this collective statement that said we need to keep the web open and free and interoperable and all of the things that we stand for and that we can support. And it's quite interesting to then think, okay, well, what kind of investment are they prepared to make to ensure that that remains the case? And then it's not just governments, it's major corporations, major tech companies that rely on the open web for their business models. If you have a major search engine, you cannot have content locked behind paywalls. So it's in their vested interest to ensure that content is open and indexable and all of those types of things as well. So there is significant forces out there that want to succeed. We need to have the language and the stories to be able to have those conversations with them in a way that that value then flows back into our projects. Wait, wait, wait, wait. This grouping in terms of scale, right? If the Alliance can come up with a common voice and common answers, more or less representing the hundreds of thousands of people who make a living either by building and maintaining these things or running their own businesses, right? The millions of people that benefit from all of our work, I think that gets us to see the table, right? And if we're prepared then to use the language that works at the higher levels than we are normal humans are used to interacting, right? I think that gives us a potential to make more of a difference too. Matthias is looking keen. Yeah, and about the charter, there's another point. We really like the charter. Yeah, the charter also allows us to have affiliates, so non-CMSs, but that sort of share our values that can also join meetings as observers, for example. And that's also a really nice feature that I like. And I know the clock is ticking here and we're closing in on five o'clock, and I also wanted to take this opportunity because I like to hear my own voice. But also because, Jam, you've done a fantastic job today. I wanted to say thank you. You volunteered to do this. So this is real volunteer community contribution. And just because I know what you love most in this world, I brought two bags of Norwegian salted nut chocolate for you. That counts as vegetables, right? That counts as vegetables because it's all from, you know, it's... I think it's fruit, technically. Yeah, it's technically fruit. So that works. It's five a day. Anyway. Thank you. I'm super happy to be here. He says he's super happy to be here. Yeah. Nice. He's happy to get the candy. He told us. What about my salt liquid? Oh, no. The tumble is for next day. So do you want a final word? Well, yes. Who's coming in exactly on time? I live in Germany. We love that because we don't get it from the trains anymore. So anyway, thank you all for coming. Thank you all the presenters. Thanks everyone who's here. Thank you also for the people on the stream. Thank you for having the super cool idea of the Alliance. Thank you for not representing WordPress but being very involved in WordPress. Just disclaimer, disclaimer, right? Thank you for coming the furthest to be here. Oh, and flew in from Australia to be here today. Right. And it really was my pleasure and privilege to be here and interact with all of you. And I'm really looking forward to the next time and to seeing what the Alliance does. So thanks everyone.
UKIs, TPMs, immutable initrds and full disk encryption – What Distributions Should Keep in Mind when Hopping onto the System Integrity Train
I'm happy to introduce our first speaker in the morning, who you can already see is all set up here. Well, I'm going to go and hand it over to Leonard to open us up and kick off the distribution s Devrem for the day. Take it over from here. I have this. Does it actually work? It works, right? Hi. Good morning. And thank you for waking up so early for me. Much appreciated. It was hard for me. It was probably hard for you as well. Today I'm going to talk about TPMs and UKIs and immutable inner-ities. I'll give a second talk later today in the boot and in-it track. So the topics are kind of related. But there I want to talk more about the early boot stuff. And here I want to focus on stuff, what it actually all means for distributions. So UKI, TPMs, immutable inner-ities and full description. I think this is where we should be going on the Linux distribution world. But of course, I am not the Linux distribution world. So in this talk, I kind of want to explain what I think might be next steps for distributions that actually want to adopt all this. Yeah. To start out with, this is a fairly technical talk. I'm pretty sure some of you at least have some rough idea about what I'm going to talk about. But just to get you up to some level at least that you have a chance to follow, let's go through a couple of very basic vocabulary and what I'm talking about. The first thing, secure boots. Like many of you probably came into contact with that. It's the thing where during boots, all the various binaries that are part of the boot process are signed cryptographically. And the firmware from early on makes sure that only properly signed stuff is run. The signing keys for that are kept by Microsoft. So it's like a centralized authority kind of thing. At this point, because it's relatively like they'll sign a lot of stuff. There's probably more of a denialist of bad stuff than an allow list of good stuff. And yeah, there's certainly criticism to be had about the centralized nature of this. There's another thing called measured boot. Measured boot is not so like say accepted, well known in the Linux world yet. Measured boot is something where basically rather than in Secure Boot where you disallow components that are bad to even run. In measured boot you allow everything to run. But before you run it, you make a measurement which is basically taking a SHA sum or something like that like a hash of like a cryptographic hash or what you're going to start next. And you write that in a certain register in a TPM. And this is an irreversible way. So afterwards you can cryptographically verify that everything that was started so far is actually what you think what it is. The good thing about this is it's more democratic in a way because it doesn't restrict anyone from running anything. But you can later use these measurements to protect your secrets. And that's what we're going to talk about later. So there's no centralized authority because there isn't a restrictions made on what's booted. But it's up to you to say basically I only want that if this software is run during my boot process that I can release my disk encryption secrets. And then you know, it gives you I think a more like more specific, more focused kind of security than the Secure Boot stuff gives you. TPM of course I already mentioned the word is like this little chip. I mean it used to be a little chip that is pretty much in all the laptops. And in one form or another it's also in all the cell phones. I mean they call it Secure Enclave and stuff like this. But conceptually it's always the same thing that you have like this security isolated security environment where you can keep your keys and that maintain access policies on keys and things like that. It's pretty common, has been around like all the laptops that were sold in the last 15 years probably already had a TPM. On Linux, well I mean it is automatically used because the measurements made into the TPM but actually actively used by the distributions it's generally not. Like it doesn't mean that you can't use it but it's so far typically left for hackers to actually have an interest in TPM to enable it. Regular people do not run this which is completely different like it is on Windows and these other operating systems was always used by default basically like BitLocker. If you don't do anything it just locks it to the TPM. One specific part of the TPM is the PCR registers. I already referenced them earlier, didn't call them PCRs. Those are these registers where you can write these hash values too. They do one relatively simple cryptographic operation which is like they take the old value and the new value and hash it together basically. This basically means only if the exact same stuff gets measured into it during boot the final result of the register will be as you expect. You can reverse this, I mentioned this, once something is measured into it it's measured into it and the only way to get the thing back to zero is to reboot. All the registers started zero and then you typically have 24 of these and half of them are basically used by firmware. The other one is for the operating system. Once you have these PCR values you can bind security to this like locking of disk secrets and thus you can do things like that. You can say that my disk secrets shall only be released if the operating system is in good state. How that actually works let's go into that detail a little bit later. That term is UKI. By the way I'm talking a lot and I have lots of slides and I much prefer though if we do a discussion here rather than me just talking. If any one of you has questions please interrupt me. Let's talk about this right away and let's not move the talks, the questions to the end because yeah I'm pretty sure half of you will probably then have forgotten your questions by then. Anyway so feel invited to interrupt me. So UKI's, this is actually what the other talks are going to be about is it's a unified kernel image. It's not radically new approach but it's certainly different than how most of the distributions use to manage their kernel images. UKI's are basically you take a kernel image, you take an inner d, you take a couple of other things like Chrome command line, boot splash, device tree or something like this. You glue it all together, turn it into a UFI binary like a PE binary and you sign it as a whole and during boot it gets measured as a whole. UKI's are awesome because they make things very very predictable because yeah once you deploy UKI it's one file, you drop it in the ESP like in the ESI system partition which is where the firmware starts from. You can update it as one file which is awesome because it's extremely robust right like you do not have to risk that you have half updated your kernels or something like this because it's always either you have the new file or you have the old file that's fantastic. So it's great from a cryptographic, from a robustness point of view and it's also great for other reasons like for example it's, is it always the same? You can test them better and so you have a greater chance to know that they will, like if you deploy them in lots of different machines they will probably all work equally well or equally bad but hopefully equally well. Anyway so much about the vocabulary just so that we all know at least the basics of what's coming next. How do we deal with the vocabulary? I want to explain a little bit the goals of what I'm actually doing here. So the general goal is tied to security and provide code integrity on Linux right? This is mostly about making sure that traditional Linux, like traditional Linux means distribution based Linux right? Like I do not mean Android or Chrome OS by this, I mean distributions like Fedora, Debian and these kind of things that have a, I would say the sudden democratic approach to things that everybody can participate in. It's not this over the wall open source but actual open source. So I want to make sure that these traditional Linux catch up to the level of security that the other ones actually provide you with. That Windows provides you since long time that Mac OS provides you these days Android, Chrome OS, they all have these code integrity protections. The general goal of this of course like if you want to talk about threat models is usually evil made stuff that you leave your laptop in your hotel room and you want to be sure that when you come back it's still your laptop with your software in it and it's not back doored because right now it's very easy to back door. So focus is generic distributions Fedora, Debian and so on and the goal is to make things just work right? Like I want to move this stuff out of this area where it's a specialist thing that TPM loving hackers enable and I'd rather want this that this is stuff that just works and defaults to just being enabled in distributions rather than being something you actually have to opt in and do work to get to. That is of course like it's a big ask but I think it's necessary because we nowadays like everybody knows the value of IT security and it's really sad that Linux has very little in this area by default because it's laughably easy to back door a laptop right now even if it uses full description because inner IDs and things like that are not protected at all. I already mentioned the word democratic a couple of times. My own focus is much more on measured boot than on Sacky boot right? Like Sacky boot is established like all the big distributions assigned the kernels with Microsoft key and things like that. I actually work for Microsoft as you might know but still I don't want to assign my kernels with Microsoft key. So I think measured boot is actually a much more interesting technology because it allows you to define your local policies yourself. You can sign your kernels yourself, you can define the policies for your secrets yourself and you can just say yeah I don't want to allow my machine to run Chrome OS or Windows or whatever else. I just want it to run my choice of kernels and my choice of inner dean, my choice of Linux operating system. So yeah the goal is definitely to, I think it fits nicely into how Linux distributions are traditionally organized because yeah they are in a way democratic too. So to be more technical what are your specific goals? I want that measured boot is done by default and that means not only, I mean it is done by default because you have, I am like up to the kernel do this anyway by default and have been doing this for the last 10 years or something but I want that this actually continues into the rest of the boot process and actually during the runtime of the operating system later as well. I also want that secure boot covers the whole boot process. Right now we end this really weird situation where it only covers the basic kernel and not the inner AD and I find that kind of laughable. But again measured boot is my main focus. Secure boot is, yeah we should do the two if we can but it is two different protections. Also you get the best results but I find the protection that measured boot provided was much more interesting than the one secure boot provided was. All the measurements that we make during the boot process right like the hashes like all the stuff that gets hashed I want to be predictable. Predictable basically means is that even before you boot it you have got to know if you know the components involved what these PCR values are going to be. This actually matters because you bind the security of your full description keys to these PCR values and if you cannot predict them you cannot do that. Only if you know that if I run the Fedora kernel from this version on and that is going to result in these hashes being measured into the PCR values you can say only unlock my keys if the PCRs have this value. So yeah predictability means a lot. Why do I even mention this? For example in Grubber things like that measurements are not so predictable because they don't measure the actual code so much but the selected pass through the code which basically means there are lots of variables in how it actually ends up in the PCR values depending on like I don't know if you move up your menu it might end up in different measurements. One of the goals is specifically also just encryption easy by default and this particularly also means service. Right now I'm pretty sure most of the people in the room probably use this encryption on their laptop and my assumption is also that most of you probably use it interactively with the keyboard unlock. You boot up the machine you type in your password. That's great but also we can do so much better and it's not something you could ever do in service. On service there's nobody in the server room to even unlock this stuff usually. So what TPMs give you is this ability that you can non interactively do disk encryption because the TPM keeps the secret for you. You do the PCR dance and you basically tell it to unlock the secrets only when the operating system like if it's your version of rel or your version of marina or whatever else that is booted up and nothing else. And this also means like I think actually we should get into that mode where distributions by default enable disk encryption right and even if people didn't ask for this and without necessarily even asking during the install time even for a passphrase or something like this simply because by default they should be locked to the TPM and then if people want to enroll a manual key or a FIDO key or whatever else then that would be on top of the TPM and not what you start out with. I mean this is like yeah this is the goal eventually we're not there yet. We don't even have the infrastructure to make this but I think it's I mean this is basically what Chromebooks and all these things generally do and I think we should catch up and kind of try to make this something that also works in Linux that way. Yeah another goal of boot process is testable I'll already mention it because if you have everything strictly predictable and uniform and on all the installations you kind of have the same set of software maybe in slightly different versions because one already updated his machine to the version of today and the other one didn't but still it should be a small set of different versions and yeah. By the way again questions yeah. So when you were talking about the measured boot regarding being local do you mean here local in terms of the hardware vendor or local based on the distribution or local based on the owner of the machine because at the moment with secure boot you have a buy from lots of parties Microsoft for signing, frameworks, vendors and then distribution that follows the whole process. Ultimately I mean you right like on your laptop you should be in power but the thing is like of course that's a big ask like if I install my mother Linux laptop she's not going to be capable of like it. So ultimately that basically means that I mean we come to this later hopefully given the time but it's like my assumption is that by default you get kernels and the OS provided to you and signed by you and protected by you by the distro vendor but I certainly want to enable you to that you can say basically fuck this I'm going to enroll my own stuff and we want to make this easily easy so that it's robust and you can actually do this right. So that you can basically be more restrictive even you can say not just it's okay that Fedora gets access to my disk encryption but you can even say something like only Fedora in the version that I picked on the architecture that I picked and so on and so on like you can make it much more focused because you know you machine better and the way like that you know that you don't boot from ice guzzies and things like that Fedora doesn't right like so you can make it much more focused and saying yeah I think I'm not involved right so but I'll be the goal is definitely to democratize it right like to put the people in the control if they if they want to but also knowing that this is not what in there like what people could do. So basically you said you but instead of you is your TPM so you don't even have to know about it. Sorry? Your TPM so you don't even have to know about it so it's easy for everybody to use. Yeah right. That's okay. Okay let's talk a little bit about the status quo like how it's right now. So most of the release distribution currently provide minimal second I call it minimal second boot because it only really covers the boot loader and the kernel and it doesn't only cover the inner d which I find really embarrassing for the fact that in 2024 you can just go to the ESP or boot petition and modify the inner d that anywhere you want and we'll just boot from it and nobody takes notice. But inner d is just a file system right? Why is it the same order as the kernel for measuring it? I mean the kernel could authenticate it if it wants to. I mean it could it just doesn't that's the thing that I'm saying. There is no authentication of the inner d right now. Not in the generic distributions at least and that's I mean it's rooted in the fact that inner d's are in the traditional line it's always generated locally on the system but basically they ultimately are different on every single system. They import not only code but also configuration local configuration and that basically means you cannot sign it on vendor systems right? Like if you are a customer of I don't know Suze and they give you kernel inner d then they cannot sign the inner d for you because that inner d only exists on your system and a specific system. So yeah so I think it's really bad situation right? Because it basically means that any evil mate can go into my hotel room just take the hardest guys on their laptop they can just go to the inner d change any file they want in particularly the password prompt for my lux stuff and send it all to send central server if they want and I will not be able to notice this. That is a situation that I think is really stupid all the way. So yeah I've already mentioned this the inner d's are locally built they are not protected by Secure Boot then there are very little measurements actually being done. The kernel now does a couple of them on its own like there's the inner id basically but in general the ones that are made by Grappa I already mentioned this are not predictable and it stops the moment the kernel actually does anything right? Like because the measurement that the kernel does and it still does in unify mode and then user space doesn't do anything anymore traditionally. So I think that's bad right? Because actually what I think makes a ton of sense for root disk encryption is that the key for the root disk encryption is only released by the TPM to the system in the inner d phase but never later right? That is a really nice property that you basically drop as you boot any chances to recover the boot the disk encryption key. I mean the kernel will always have it somewhere in memory because it actually needs to do the encryption but via the PCR mechanism you can make it relatively easily and we have now the infrastructure in place to do this that later on you can talk as much to the TPM as you want you will not be able to recover the disk encryption key from itself anymore because we basically blew a fuse through this but anyway this requires that we make measurements during the boot process and during the run time so that policy like this are actually very expressible. Yeah I remember this like in the status quo on the TPM based stuff again like there is a stat like two stacks even of TPM and Linux but except for hacker circles nobody uses them I would say you can't script it together there's many how-to's on the internet but nobody does 15 people in the world do it. So yeah I already mentioned this as well like the Lux password prompt is implemented in the Internet. Internet is not protected either way that's trivial backdoor it's a terrible thing. I would call this in summary pretty weak security and you could use words like laughable or something and compare to other operating systems. So what's the vision? Primary we want that kernels are shipped as UKIs by distributions so that they are everything is secured protected including the Internet and they are measured as ones and they are fully predictable. This means that the kernels and interities need to be pre-built right not on a local system I mean for the kernels they traditionally weren't except if you run Gantu but yeah the move would be to pre-build the interities centrally. If you do all this then you get stable hashes in the PCRs you can buy the disk encryption to it you get the universal predictability because the software doesn't deviate between systems it's always the same software. You have robust updates I mentioned already because the kernels can be updated in one file and yeah you test the combinations very well. Secondary is like a secondary goal that I have is what I just described is again central authority to some way because it's the distributions that do this. I think it's also important to keep people who actually want to sign their own stuff in pictures as well I mentioned that was your question basically earlier that yeah if you want to run your own like if you want to generate a key pair and sign your stuff yeah we should help you with this. So in this model you probably will still use a pre-built kernel by your distribution you might however combine it with a local Internet ID and then sign it with your key instead of the distribution key. Benefit of course is maximum flexibility but also you need to know your shit. The advantage of this of course is that the PCRs remain predictable but they only remain predictable within your local scope because only you know what you're actually going to build into the Internet IDs and how you're going to combine it. Yeah it's a it's a it's a it's a large installation footprint because you suddenly need to actually build to its installed to do this. I mean it might not be worse than the current situation with Drake and things like that but in some ways it is because you now need signing tools and things like this. But certainly both of these models are certainly in focus of what we should do I think. So the ultimate vision is there that yeah distributions in their install are to figure out is there a local TPM. I mean not all systems have TPMs in particular like ARM based they have other stuff but not TPMs and then the M's sometimes have them sometimes do not so we always have to work with the fact that TPMs might be there might not be there but the goal is certainly that if one is there we should lock to that by default. Locking to that by default doesn't mean non-interactive stuff exclusively it means yeah we can do non-interactive stuff but also mean you can combine it still with a pin. A pin is the exact same thing as a passphrase except that TPM people call it a pin. It doesn't imply a number or anything yeah. So the goal is to always encrypt the data when it's at rest and yeah we validate the boot process when we unlock things though so that we make sure that the right software at the right time and other conditions. Yeah and the goal is to install things by default that way. And then yeah I want that measurements are further done during for all facets of the system like not just for the boot code also for the OS itself for the applications for the configuration itself right like these for example measurements that are inherently local always because configurations always kind of local thing even if my mother would use her machine she probably would configure different backgrounds than somebody else. Backgrounds color is a shitty example because you probably don't need to measure that but still you get the idea. System identity by that I mean things like hostname machine stuff should probably also be measured so that you can use it in policies and can say yeah I want to have the secret that only is released on that machine and none on others. I want to see that these basic building blocks like the PCRs but also the policies generated of it out of it are automatically managed by the West because this is not entirely trivial right like because every time you update the West any component of a boot loader UKI or things like that you have to regenerate like you have to re predict what the PCRs are going to be in the next boot and then do something about that because you still want that when the system boots up next the disk encryption shall be released but not on other conditions so there's some extra work where you when you update something you need to predict the PCRs and do something with it. We'll talk about this hopefully later but let's see how much time we have. The result of this of course comprehensive code integrity the inner dirty gap is closed we are ready for remote attestation that's also kind of goal that remote attestation works I think I mean it's good for some cases if you actually run more than one system I'm pretty sure it's not so interesting for regular people themselves but we should at least be ready for this and that the stuff is that we have the building blocks ready so that people can use the TPM in any ways they want and we give them already building blocks for defining their policy on their own encrypted objects based on the state of the operating system because right now they're kind of lost in this and the result is that it's somewhat democratic because people can just do this on their own laptop and do not necessarily like get a high level of security of code integrity without necessarily getting the key sign by Microsoft. So to make all this any questions at this point so. Does division cover KXAC? That's a very specific and good question so KXAC is a big problem and like in the project that I work at Microsoft it's also a big problem. I have ideas how to deal with that but frankly we have so many problems we have to fix before we can fix that one too that I don't think that's gonna be fixed anytime soon right like but I have a pretty good idea what we probably should do with KXAC because KXAC for those who don't know KXAC is a thing where you basically boot one operating system and then while the operating system is running you decide you want to run another operating system usually new version of the operating system so you execute the new kernel. Now suddenly you didn't reboot so the DPM didn't get reset so all the PCR values will still have all the measurements from the first operating system and then the second operating system starts and the PCR measurements the PCR values just get added to that on top and then all your policies fall flat because they were predicted assuming that you started zero. So this creates a problem but I think we can deal with it like having a handover of secrets that are predicted that moment where you're about to start this up but let's not talk about that since it's highly specific for like we have way too much material before we start talking about KXAC. Any other questions at this point? Next question. So from my understanding if you had your computer that you've predicted all the values on if I was to take that drive and put that drive into another machine that said enterprise and I bought 100 of these laptops is there some kind of unique seed per machine or would it go oh this is functionally the same machine it has the same device tree it has the same hardware I'm going to unlock. The TPM generally contains like an encryption key that's specific to the TPM so no you cannot unlock the encryption key that you prepare for machine A on machine B I mean unless you have the same keys like the seed keys in the TPM but then yeah everything's out of the control and you don't have a TPM you have bullshit on your hands. Okay let's continue. So to make this all reality I'm the system guy so yeah that's what I'm talking about is all system stuff. We added different components these different components have shown up in the various distributions interestingly I find that different distributions adopted different parts of all this big tool set first so I think at this point there are very few distributions that adopted them all but there's at least one distribution that adopted each one of them individually. So I want to system reboot it's like we call it a boot lower it's actually not a boot lower it's a boot menu it's just a UFI program that allows you to select a different like a set of kernels and then chain loads those kernels it doesn't do anything fancy it doesn't have any understanding of how to load a kernel into memory and prepare it it doesn't do cryptography or anything like this it's just a dumb menu that xx other stuff. But it has nice properties because it takes inspiration from how Linux does drop-in directories like with RPM and DPKG there's this established pattern that you can extend other RPMs and DPKGs via drop-in files and directories so we took this idea and said okay new boot menu items are simply files that you drop in directories and as you install a new kernel you just drop in a file in a directory and that makes one new menu item show up. This is inherently different like how Grub works because in Grub you always have these boot scripts that need to be generated based on whatever you find and things like this this is much much simpler because it's just there's one file per kernel you find it and that's a boot menu item there you go. So that's one thing there's also system you stop. System you stop is a UFI bootstuff it's basically a little UFI program that you glue in front of a Linux kernel it runs in UFI mode does a couple of preparatory steps and then jumps into the actual kernel proper. These preparatory steps we'll discuss a little bit later but it's measurements and finding certain sidecars if you want them. So usually like in my perfect model where you use all these components the boot process is basically that the firmware invokes system to boot and then in system to boot you pick one kernel or automatically the newest kernel is picked and that then gives control to the stub inside of the kernel image and then that thing does a couple of things and gives control to the kernel inside of it which already has the init-rd loaded and then you jump to the init-rd so much about the boot path. Uki-fi or I don't know how we pronounce that you haven't really agreed on the pronunciation yet but it's a tool basically that allows you to build UKIs it takes a couple of different components glues them together can sign them from SecureBoot can do PCR predictions and spits out one EFI binary which you then can drop in your ESP. There's a tool called system demasher probably by this time you don't have to interact with it anymore because Uki-fi does it for you all it does it does that PCR prediction step for all the stuff that is contained in a UKI basically so you can run it and basically it tells you if you boot that UKI PCR 11 is going to be this value and then you can use that for policy but usually you don't have to interact with that anymore because Uki-fi is probably the tool you should be using and that calls it in the background so you don't have to bother. There's a thing called kernel install in system retweet. It used to be shell script but nowadays it's actually a proper program. Its job is to I mean Fedora has been using for a while other distributions are catching up I guess but the idea is basically that if the package manager drops its file in slash user and slash user is package manager then kernel install will take these and copy the kernel itself into the ESP to make the system bootable. So that you basically monopolize the OS vendor resources managed by the package manager and slash user and if you copy it into anything else like the ESP which is a shared location like it's not owned by the OS vendor it's owned by the system if you will and OS's just get the privilege to all drop something in there and the kernel is told to do this. The reason why you need something that's better than the CP is usually that you want to do a couple of things when you do this like create, I don't know, run MopPro, do a couple of other things. We have support even to generate the UKI at that step right so that you install on the system like a traditional kernel but locally it gets converted to UKI as you go and sign and things like that so that you basically can keep the old workflow in place how distributions generated in RIDs even and things like that but you end up in the new world with the UKI that is signed by your local key automatically without you even thinking about this. Other components, there is MKOS in RID, I'll ask a question. On the previous slide you mentioned, on the previous slide you mentioned system debuts, STUB measures the UKI. What measures system destub because you have... The firmware. So the stuff that I'm talking about system debuts, system destub, they are ultimately UFEI binaries and the firmware measures everything right so there's a full chain, the firmware does that part and then we like actually you know because system destub is just glued in front of the kernel to make the UKI which is a PE binary, actually the stuff that is in the UKI is already measured anyway by the firmware. The reason why we also measure the stuff ourselves a second time which sounds redundant is simply that PCRs, we have multiple of these and we want some separation of the stuff that... So there's one PCR basically, nine I think, where all the firmware stuff gets measured into AND the stuff right so there's going to be stuff that is specific to the local machine as well as the stuff that we as the OS vendor or whatever you want to call distributions and then we measure the control as measured into the same PCR and that basically makes the whole thing unpredictable. So we measure the second to the time just the stuff from the OS vendor into another PCR so that's what we combine the policy to. So that's why you have the double thing, it's two different PCRs. So MKSI, there's going to be another talk about this as basically a tool how you can build predictable reproducible in a D from generic Linux distributions and then make them ready for use in UKIs. The system you could set up is basically just a wrapper around like lip crypts up and it does a couple of these integrations, TPMs, FIDO and these kind of things and policy management and things like this. System decrypt and roll is the other side that allows you to enroll the TPM and roll the FIDO thing locally. System decrat is something, if we have the time we'll talk about it a little bit more later, it's basically, you know, if you have this vendor build UKI you might still want to be able to parameterize it, right? There's a reason why Inodore D generators the traditional way mix code from the West plus configuration into one CPIO Inodore D image. It's because people want to parameterize things. So parameterization is problematic, right? Like because it means things are not predictable anymore. Also you need to authenticate it again, right? Like that's what we want to come to. So the concept they came up with to fix this thing is called system decredentials. System decredentials are like ultimately they are way how you can pass secrets into system dec services. They have nothing, originally had nothing to do with the boot process. It's supposed to be like, you know, all these, the cloud people they love passing secrets in environment variables. I think that's a terrible idea because that gets inherited down the process tree. So this is supposed to be something better in that regard. So one of the nice things that system credentials actually has is that they can be encrypted, right? Like you can encrypt them and bind them to a TPM and local policy and things like that. This is extremely useful because it basically means that these credentials you can put them on untrusted territory, meaning the UFI ESP which has no authentication itself. It's an unprotected VFAT file system where basically the rule is the stuff that you read from the ESP you need to authenticate before you use it. So you can just drop these credentials in there and be reasonably safe that the contents of them cannot be read. What's the use case for things like that? Like for example, if you have a UKI with an inner ID and you actually want to make it open so that you can log into the inner ID with root password to debug things, you can stick that in an assistant decredential put in the ESP next to the UKI, how that actually works, we'll hopefully still find the time later to look into this. And be sure that this thing, because it's bound to the local TPM, is not accessible, like the root password is not accessible to anything, but that specific system and things like that. So system decred is kind of an approach for local parameterization. It's an option though. It's not like, I would assume that in most of the consumer kind of things you would never use this, but it needs to be there because some people want something like this. There's no restrictions on what you actually encode with this. It could also be, I don't know, ISCSI server data or HTTPS, like X5 or 9 certificates or something like this. Another thing is system is sys-axed, right? Like if you have predictable inner IDs, this of course means that they will come by default with a very clearly defined set of criminal model you was built in. This is restrictive, right? Because people nowadays have NVIDIA drivers, which are like hundreds of megabytes. If you want to make the system work well with all the current consumer hardware, you will have a massive inner ID. That might be something people want to avoid. So on one hand, we kind of push everybody to say, put everything in one file and the world will be a better place, but on the other hand we also know that this is probably not necessarily doable for all environments because these files will get massively used. They will work perfectly if you know your system. For example, if you just focus on Azure cloud stuff, right? Then you know exactly the drives you need. You can build a tiny UKI. It's all good. It's going to be entirely generic for Azure, but you could probably even cover multiple clouds in one UKI. It's still going to be small, all great. But once you get into the wide world where all kind of shit exists, it might be too limiting. So we thought about this. Once we came up with the system, we also had a different use case. Originally it was mostly focused on system, but we can use it great for modularizing inner ID to some point. So the idea basically is system is system extension. It's basically a disk image, a GPT disk image that contains a traditional Linux file system, usually something like squashFS, EROFS plus a signature for the variety partition. Variety for those who don't know, it's a, like DM variety is like a kernel concept for adding integrity protection to immutable file systems. So basically that, like it was like the first user of this was Chromebooks back in the day. I mean it's old now, but it basically says that on every sector access of the file system, you do make sure that it's actually authentic. It's a fantastic technology and we can use it to have these disk images that are, when you enable them, overlaid on top of slash user. So suddenly you get a certain level of modularity where the basic identity has slash user populated with lots of stuff, but you can add a couple of other things into it by adding a couple of systems to the system, which are just overlaid. Overlaying is basically overlayFS. It's really nice because it's atomic, it's cheap to do, and ultimately nothing new about it, it's just regular GPT disk images. So this is me, contract is actually the same idea, but it's about overlaying things on top of Etsy instead of slash user. Also with all the integrity, cryptography, things like this, but it's really nice because in contrast to the credentials that focus on individual bits of secrets, the system in contracts focuses on combination of stuff. Like you can drop 55 configuration files into one of these contracts, and these contracts either are applied or they're not applied. They're never half applied. You cannot use them out of context, hence, because either you get all of these files dropped into Etsy or appear in Etsy, or none of them. Honestly, contracts in my point of view is actually like the perfect configuration management tool, and everybody should just use that and stop using all the weird, Ansible Chef things because they do not have these nice security or atomicity properties, and the security and atomicity properties are just awesome. Another component, system in PCR lock. So I talked a lot about the predictability of the PCRs, but so the way how you actually lock disk secrets to PCR is basically you say, this PCR has to have that value, that PCR has to have that value, and that's all the case. You tell the TPM, you will release your encrypted secret to the West so that full disk encryption can work. But now you need some infrastructure to do the prediction for this. System in PCR lock is that infrastructure that we added to do this prediction. Basically, it manages a set of components that you assume are part of the boot. Then it does some magic, figures out if that's actually true, if that actually matches the reality so far, and then calculates a TPM policy, it's how it's called, from that, which says, OK, we will now create a policy that basically says if you use that policy to lock down secrets, then you have to have this firmware component in this version. You have to have this bootloader version in this version, and this UKI in this version, and a couple of other components that might be part of the boot, and it allows alternatives. Because usually if you update a kernel, you do not just want to say the new kernel is now the only one you can boot, you want to still allow the old kernel, the preceding kernel, to boot. You want this concept of alternative options for every step. So firmware updates the same thing. If you prepare a firmware update, and it fails, you have to boot up with the old firmware in place. If your policy says no way, then you have a problem. So you always need this kind of alternative system. So that's what System.UPCI-Rlock does. This policy is like, all the operating systems have a prediction, like all the other ones, like Windows, Chromebook, they all have prediction engines like this. We have the luxury that we come 15 years later than anybody else with this. So we can actually rely on newer types of TPM functionality, because we can start from zero now, instead of having to be compatible with the original TPM2 stuff that is really old by now. So we actually can do nicer things. We can actually store these policies in the TPM itself. The traditional way how BitLocker and Windows does it, for example, is that they store these policies in the BitLocker superblock on disk. Storing this stuff in the TPM is much nicer, because it basically means that you can have 500 different disks, and when you do your PCR predictions, you do not have to touch them. You do not have to go through every single disk and rewrite the superblock, but it's entirely sufficient to store some slightly different value in the TPM. That's a fundamental benefit, like improvement over what Windows can do, because we have the luxury that we are so late to the party. Any questions at this point? I only got like 10 minutes left, so if you have questions, this is the time to start asking. If you do not have questions, I'll continue with parameterization, modelarization. Which we actually kind of covered already. No one has questions. So, yeah, I mentioned already pre-building UKIs and alreadys are problematic, because they are identical and that makes them large. And you kind of parameterize them anymore. So, there's optional parameterization of the UKIs that breaks up the fact that they are one big thing. Our way of mentioning system-deserved credentials, like system-de-creds, encrypted, and individual bits of information, and then there are system-de-confects, which is the overlay thing on Etsy, which is like combinations of configuration. And the third one, which I have not talked about yet, is, but there was a talk yesterday in the VM mini-conf about this. It's kernel command line add-ons, right? Because one of the fundamental ways how you configure your learning system is by making additions to the kernel command line. Now, in all the stuff that I was talking about, the idea is, yeah, you don't get to do that, right? Because it's the most powerful thing in the world, because you can do in it as an agent, do whatever you want, right? So, we lock that down, right? Like, if you're in secure boot mode and you use that kind of stuff, yeah, you don't get to edit that, because security policy doesn't allow that. That, of course, doesn't necessarily fly with everybody. People hate that, right? Like, people want to be able to do this, but they want to have controls on this. One of the things that we came up with, like this guy over there came up with, is kernel command line add-ons. Add-ons is what we call basically, you build a UKI, but actually leave the kernel out and the NRAD out and everything else out. You just put the kernel command line in there. So, you have the UFIPE binary that looks exactly like a UFIPE binary, but you can't actually boot it because it doesn't actually contain any code. But what it contains is a kernel command line. Why would you do such a thing? It's because you can authenticate them and measure them like any other kind of binary that UFI deals with. Or actually, not you can do this, but the firmware will do it for you, because you can just tell the firmware, oh, I'm going to work with this binary now, please load and authenticate it, and then it will do this for you. Do measuring, dance, all in the background, you don't have to care. Because after all, SD boot and things like that are just a stupid boot menu with no understanding of loading and authenticating anything. And that's how it should be, right? Like, we want our boot pass to be stupid and not replicate, like, with Schimel and these kind of things, all the authentication over and over again. So, add-ons are basically a way how you can, yeah, sign a little kernel command line and then you can extend the one that is built in the UKI and then modulate away so that you can have one UKI and a couple of these add-ons that are, extend this thing and with proper authentication. Modularization, I mentioned this already with System DesistEx, yeah, because of NVIDIA drivers in particular, because they're massive firmwares, we have to do something. So, I mentioned these things, add-ons, system extensions, credentials, and config extensions like ConfaxSys, stuff. We call them sidecars, right? Like, because you have the unified kernel, but then it's not so unified, you have these things as well in there. How to manage those? So, the general idea is to extend this drop-in concept, right? Like, so that you have the UKI and you put next to it a directory where you put all these add-ons. So, how does it actually look? In the ESP, you put the UKI in the directory EFI Linux, and next to it, you have a sub-directly called exactly like the UKI with a suffix Xter.d, and there you put cred files for the credentials or confax.raw, that's the suffix we picked for confax.ddis, or sysex.raw and addon.efi, the EFI, those are the P.E. add-ons. So, it's simply relatively simple, right? You lose some of the extreme sexiness of the approach, which is updates and things like that are not single file anymore, but that's on you, I guess, if you actually make use of this functionality. So, this is all optional. Like, the focus, I think if you know your hardware, if you know the environment you want to run your stuff in, don't bother, right? So, I think that's a good side, like, just focus on the UKI's one kernel and everything simple and robust and idiot-proof and things like that. We've got like five minutes left, let's focus more on questions. Alright, so, taking a scenario like you just said, where you don't know what the hardware is, how's your vision of, okay, how do all these sidecars get selected, indicated, put in there? That's a very good question. That's not something I imagine you want like an RPM distro to be dumping stuff in there, but what should we do? That's a really good question, and there's actually two new list items about this in SystemDTree. I'm not sure how many people have seen that, but so, you know, in UDEV, in SystemDT in UDEV, we have already this concept, how we can automatically determine which kernel drivers to load on which machine, right? Like, it's called mod alias. It's basically like for PCI devices and USB devices, the vendor product that you turned into a string and then having a mapping database that maps that to the actual K mod to load. And nowadays, there's all kinds of mod alias for SM bias and things like that. So, we have this already, right, and there is infrastructure to have a database that we use it as input and you get kernel module information as output. And there's another database, the HW database, where you use these strings as input and you get UDEV properties as output. So, to me, that's what you should just use, right? So, a distribution that figures out how to split things up, like they would have one 6-6-6-x for NVIDIA drivers, one for AMD drivers and things like that. And then you would just maintain this in HWDT, basically, where you match against vendor product, add in and then specify the thing. And then we should have some tool that helps you figure that out and then probably should turn it into RPM command lines if you are RPM based distribution or in something equivalent, right? Like that's distribution material, yeah. But I think just using the mod alias stuff, perfect solution for this. It solves exactly that problem except that now it's not just one K mod, it's just a 6-6 that you pick up. Okay, so just for me to understand something, so system debut would be able to parse the add-ons and show you a similar menu to do. So, how can you start to choose something from the add-ons because I assume that the add-on will have multiple commands boot options, for example. Okay, so the whole command line stuff is still working, probably we'll have more stuff later that hooks that up with the menu. But right now the way it works basically is you drop in one kernel and you put the add-ons next to it and it's not as debut that has any understanding of this. As debut only finds the main UKI and turns it into an entry and then boot it eventually. But it's SD stub, right? Like this early code that is glued in front of the UKI that then sees, okay, I got invoked. Let's see, in which directory did I get invoked? Let's see if that has the sub-directory stuff and then loads everything that's in between that, right? So right now it's, you pick the UKI and that pins basically all the stuff next to it. So what you're asking for basically is that it shows up in the boot menu. We have been discussing this for a while and everybody agrees we should do this, we just haven't done it yet and well, we don't actually know how it precisely will look like. But the idea basically is that sooner or later we want to be able to not embed a single kernel command line into your UKI but a choice of them, right? So one is going to be the default one if nobody picks anything and this would then basically mean that, yeah, if SD boot finds one of these UKIs in the directory, it generates one menu item per kernel command line so that you basically have one UKI where you can select the factory set choice or the debug choice or the regular choice. And it, yeah. So everybody agrees that's the way to go, nobody does it. It's really high on my to-do list. Any last question? How much? Do we still have a minute or something? Are these like system extensions, credential extensions only? Sorry, I don't understand where. Sorry, are these extensions only useful in the case when you haven't enrolled your own machine owner key? Like, or is there still an advantage like so obviously these are going to be signed upstream. But if you've enrolled your own machine owner key, is the better approach or do you approach now just to build your own UKIs locally and take advantage of the secure boot there. In fact, did you have an authenticated NITRD? So I'm not sure I understood the full question but I answered it. It's about the machine owner key like the shim thingy. So all these components that I was just described have individual ways how they authenticate it, right? Like the add-ons because they're PE, UFI binary, okay my time's over. But basically let's, I'd like to finish the question. That's okay. So they are authenticated by a secure boot means, right? And that also means shim, right? Like so that's where the mock comes into control. The other ones are preferably authenticated by the kernel key ring stuff, right? Like we asked the kernel key ring to authenticate them. Now kernel key ring, populating that is a mess, right? Like because you can do it via the mock stuff, that works. But I think it's a mess that this is how it has to go, right? Ideally I would have the way how I can upload from user space a new kernel, like a couple of additional keys, so your local one, and then basically blow a fuse so that later nobody can do this, right? Like because that would be the democratic thing, right? So I would take a Fedora kernel and then in the early boot phase I can install an additional kernel and then nobody else can. And that's like for me, that's the perfect security. Well, we don't live in this area. But I added this concept that we can do authentication user space instead. Depending on security policies, the kernel might say no though, but on the kernel in assistive reasons they all say yes. You can mock, easy talk, the data is done. Yeah. Anyway, so, yeah. Thank you very much. Thank you very much.
Enhancing Linux Accessibility: A Unified Approach
Okay. Hello, everyone. I hope this room is not empty. It's hard to see from here. Okay. It seems like it's not. So do we have slides? Okay. So let's move to the table of contents. So good morning. Today we are going to talk about accessibility of Linux desktop environments. First, we will go through a little bit of theory. Unfortunately, it's needed, I believe, for the later part of the presentation. And then we will try to show you some live demos, kind of hands-on experience. And we will talk about what's the state of the accessibility, how we can improve it. And then in the end, there will be also something we tried to create to help developers improve the accessibility. So let's move to the introduction, please. So just a few words about us. My name is Wojciech Polaciek. I'm working at Red Hat in a security compliance team. I am blind. I'm a Linux user. I use mainly Arch Linux and Fedora. And I'm very slowly but steadily developing, trying to develop a special Linux distribution called Tboitux. It's aimed at blind and visually impaired users. Now I'm handing over to Lukasz. So, hi, everyone. I'm so glad to be here. My name is Lukasz, director. I'm also blind Red Hat, but working in the desktop team for a few years, but with Linux for 10 years. And I'm basically fixing accessibility issues whenever I find them. So mostly in GNOME upstream, or sometimes GNOME isn't enough, so we need to fix stuff in GTK and sometimes even lower. And let's continue with the presentation. So I'm handing the word back to Lukasz. Okay, thank you. So next slide, please. Very shortly, just to talk what is this accessibility actually about. To put it shortly, it's basically a property of a system which signifies that it can be used by a user with some kind of disability. But of course, this is a very large scope, so on this talk we will reduce the scope to computer systems and to blind users. So basically when I try to put it short, the accessible Linux desktop environment is such environment which can be effectively used by a blind user. That means that we can get the information from the environment and we can reasonably interact with the environment. We can get a work done, basically. Next slide, please. So why do we actually need to care about the accessibility? I will try to give you some reasons, hopefully. And mainly about Linux accessibility. So Linux is basically the only free of charge alternative to Windows and macOS desktop operating systems, which offers comparable level of accessibility. I think this is very important, especially in cases where, let's say, the funding might be low for the systems. Why actually blind users would like to use computers? That's also a question which would be answered. So basically computer for blind user is a gateway to a lot of information to internet. It's also a very efficient communication tool. It can be also very often used as a work tool. I hope we are quite good examples of that because we are basically developing on Linux every day. To move forward, I think that the accessibility is also important because I believe that diversity and inclusion in the open source ecosystem is important. And one more thing, I saw several stands which are dealing with the free open source on mobile devices. And as Linux can run on such some of these devices, I think this might be actually a future. So we might have alternative to Android and iOS in Linux. And I would like to see it being accessible. So let's move forward. On the state of current desktop accessibility in Linux distributions, I have good news and bad news. I will start with the bad news. Let's say that the initial hard law which is there for especially newbie Linux users is quite high. The reason is that most of users who are going to start using Linux are switching or previously using Mac OS or Windows. We've been running workshops in Czech Republic where we were basically trying to introduce Linux to users. And we very often encountered a situation when we started the desktop environment and everything was fine. And then we showed them that they were trying to use some elements of the environment like for example they were trying to find out what is the percentage of the battery in the laptop or what is the network they are connected to. And this was not accessible. They couldn't get this information and they were like so surprised because this is just no discussion about it in other operating systems. It's just there. This might be a bit problematic especially for new users. The good news is that when you sort these things out, you are actually doing pretty well compared to other desktop, to other operating systems. There is a lot of things you can do with Linux and it's still improving. So I think that the biggest hurdle is to get the things going and get the things set up correctly so that it doesn't introduce any additional learning curve or any additional obstacles for newbie users. Let's move forward. I think the next slide should be about some basic terminology. Just two terms so that we don't spend a lot of time here. First term is a screen reader. Screen reader is a piece of software which is essential for blind users of any operating system. And it's basically software which kind of creates a bridge between the graphical user interface and the user. So the screen reader gives feedback to user about what's going on on the screen. So what button for example is under focus. Focus means that basically by pressing space bar or enter you will interact with the control which is under focus. For example the screen reader tells you what's the current line in a text editor. It can read for example the alternative descriptions of images and various other stuff. But it also helps us interacting with the interface efficiently. So for example it can help us move around web pages by for example jumping over headings and it can for example display the list of links to us so that we can go through it very quickly and various other stuff. So this is screen reader and please do not confuse it with the speech synthesizer which is the second term and the last term on this slide. And this piece of software is also important because it's actually the mouthpiece of the screen reader. It's just a software which takes some text as an input and it produces speech as an output. And these two pieces of software are independent on each other basically. So that's just to set things straight. Next slide please. This is quite important and we will talk about it and we will show it in the following demos how to make these in a desktop environment accessible. So the first point is crucial. On the previous slide we talked about a screen reader and a speech synthesizer and together with some other software and configuration it's something we can call accessibility stack basically. If this is not present in distribution basically you can skip all the other points on this slide because it doesn't make any sense. You can have perfectly accessible desktop environment but if you don't have the screen reader present from the start basically there is like this is a showstopper. The second point is very related to the first one because you can have the screen reader present. You can have everything perfectly configured but when you don't know how to start it then like you are in trouble again. So this is usually accomplished through some standardized keyboard shortcut or something but there should be a way and there should be a documented way how to start it. Then there are like general or generic desktop interface elements which should be made accessible like icons, panels, as I mentioned on the previous slide. For example the icons in the panels these are often problematic. Then also the configuration facilities of the desktop environment these are also sometimes problematic and it provides quite bad user experience because as I said this is like standard. Then another point is about having reasonable keyboard shortcuts because keyboard is a primary input device used by blind users. So I think we should be able to like switch among applications, open the menus in some predictable way, in documented way, switch around virtual desktops and other stuff. Then one point which I mentioned separately is login screen because this is also quite often problematic. To have it fully accessible that means that after you enable the accessibility the screen reader should be started already on the login screen and the user should be able to use this screen to navigate it and to login. Then there is a documentation part because as each desktop environment is a little bit different we often or we encountered a situation where it looked like it's quite accessible but we actually didn't know how to use that. Then there are the apps provided by the desktop environment like these essential apps like clock, calendar, file manager which often go with the desktop environment. And last but not least are third party apps like they are mentioned here just mainly for the sake of completeness because I think it's a little bit out of scope but I would not be happy if we forgot about these apps. These apps I mean web browsers or word processors or whatever. So next slide please. I often get these kind of questions like so how about creating a custom distribution for blind users? Yeah that would be so awesome because you can configure it however we want and like if something does not work as expected like you know we just like we hack it around somewhere, we modify the configuration and it will be working so great. And I mean yeah like the idea is good and I believe that the thought behind it is like good positive. But I think this approach has several problems either like I might sound now weird because at the beginning I said that I'm creating my own windows distribution but I will explain it later. So basically yeah you can create your own windows distribution and during my 14 years with Linux I saw like three or four of them and each of them died in two years basically. Because there is quite significant bunch of people maintaining Linux distribution and these specialized news distros are usually maintained by one or two volunteers and that means that sooner or later they will run out of time or their nerves or something and because they have to keep with upstream and stuff like that. So this I think does not really work and then I think that rather than hacking around things which should work in the first place we should try to ensure that like they really work. To explain the thing with Wojtuk's I am basically creating the distribution based on Fedora with the hope that after I create for example I package some software or I create some configuration it might be in the end merged into some official Fedora repose. And so my vision is that in some years this distribution will not be needed because everything will be available in Fedora. So I was speaking for some time already so I think it's time to hand it over to Lukasz who will tell you something about our real examples and also stuff we prepared to make the process of. Creating accessible stuff environments easier. So unfortunately as Wojtuk already said the situation in the distribution landscape isn't ever happy and sunny and so on. And we didn't make up these things. So what we did we actually went through the most popular distribution so basically Fedora and it spins and then Wojtuk and it spins as well because we wanted to test more things and we tried to boot up their live images. And we looked around what happens and what we found one of three things actually happens. The most happy path was that everything was working and you could enable the screen reader using the shortcut and it actually talked. I'm not now talking about the accessibility of the environment itself. We have some issues basically everywhere but in comparison to the accessibility stack not being there it's a minor issue. The second group of distributions was quite interesting. There actually was the accessibility stack and everything but the accessibility shortcut wasn't working. So you pressed it and nothing happened and you didn't know why. Of course if you knew the next way you could turn the screen reader on in a different way but no new people know that and you don't want to describe them that they have to press shortcuts type of comment and press the return key and so on. Because they would say OK I will just stay where I am now. I will put up with all the marketing and stuff and we're not bothered with this system because it doesn't work. So we definitely don't want this situation either. But then there was the last group where there was no accessibility stack at the live image. So there was no screen reader no speech no nothing. So of course you couldn't even install this thing or do anything. And it's quite sad because as the video will show it's not so hard to make this thing available. Basically there was a package missing and everything else would be downloaded through dependencies. So I don't know it probably was only a mission on the part of the maintainers probably. So it should be a pretty easy fix. And of course there will be some size increase of the live image but it will not be something huge. So I don't think this will be relevant for them. Next slide please. So before we look at the demos I have to say something first. We of course don't use all the environments. So it may happen that we actually don't know how to use them correctly because as White has said the documentation on the accessibility part isn't always up to date or is a found double. So we could make a mistake. So if someone has some more information just say us them to us and we'll be grateful. And yeah we are using some names of keyboard keys and especially we have the super key which is usually mapped to the windows keys on most keyboards. And we managed to use both terms for the same key in the video. So no it's not anarchy but only a different name for one key in all the videos. And I think we are ready to show the first demonstration the happy one. So let's go. Of course. Oh yeah before we go we had some limitations in the presentation system so we will have to play the audio of the video through a speaker. So the experience may not be in the recording and probably may not be either either here but we will try and you can always watch the videos at home if you really want. And that's everything. And let's go for the first one. This video will show the work. And the show is done. The reference name is the number. And if the number is played after the old system boots up you have to just press the keyboard and it's shown in the monitor specifically. Out of the super plus S. And if you do it on the right moment because you of course don't know when the right moment is the speed starts and you can do basically everything you want. So what we actually saw we saw normal the home boot of course the boot sequence to what shortened on the video because it would be boring. And then we saw that on sometime after the machine started up the GNOME environment booted up but the visually impaired person wasn't aware of that because there was no start up sound or something like that. It would be quite handy to have something like this. But if the person was right and timed it correctly he could activate the screen reader by the accessibility shortcut and it actually started talking. So yeah that basically everything was working as best as it can in these times. And let's ask move to the second demonstration. We will show which will show a little bit in a certain state of things where we have all the components of the accessibility stack button. Not the accessibility shortcut. So let's play the second video. This video will show another case of an accessibility distribution but when the accessibility shortcut doesn't work this will be demonstrated on a Fedora main 39 live image. If we try to access the shortcut here. We get nothing. So we have to use another way. So we press out plus F2. Get to the command screen and type orca and confirm this thing with the entry key. Screen reader on desktop icon view. And so far surprise it worked. So it was just the shortcut missing. For example, move around the desktop icon. So here we saw made a scope environment with missing accessibility shortcut and of course no startup sound as well. But if you used a very complicated way for newbies but a way nonetheless the screen reader start. As you may have heard there is a career issue in the speech output. The pitch shouldn't change so often but that's another probably some Fedora configuration issue or something like that. We didn't make it to the root cause of this thing. But yeah, the thing actually talks. But in the last video you will see what happens if you have no accessibility stack out at all. So let's go. This video will show the most difficult case. Namely the case when there is no accessibility technologies on the live image. This will be shown on the cinema Fedora spin, the 39 version to be precise. So the first time of all the Fedora spins we actually got some sound when the boot is done. So that's definitely nice of them. But if we try to press the accessibility shortcut out past Windows plus S. We actually get an error message. Saying that there's no ORCA at all. So what we will do? Well, we will open the ground terminal using the outfdue shortcut. I think it's name. And then typing the command to install ORCA all the way. Of course we know nothing about the commands progress and we have to basically hope that we did everything correctly. Maybe we can infer something from the disc activity or so but that's everything we got. Is it done or not? Well, who knows the fact not me. So let's try. Let's start with the accessibility shortcut again. And it does nothing. But it actually may mean actually nothing as well. So let us try to out pass F2 method. And as you can see, to install was a success. We got everything we need for speech including some voice, the screen reader, the middleman and everything. So yeah, we can actually try to use this system. At least we can like read the terminal. There we go. It's a blank tab left out. Window, like user app. Blank, speech, dash, dispatcher, dash, error, dot, one, one, dot, five. Left control, window text. But of course, that's not everything. To make this system fully accessible, we would need to fix all the cinema accessibility issues and that's the story for another time. So here we saw something which would completely block Linux newbie because he wouldn't have an idea that he can even open the terminal. He probably wouldn't even know what a terminal is. And if he knew that, he would have no idea about the easy way how to install Linux packages because in most cases he wouldn't know about packages, package managers and everything needed for this quite hard way how to install this. So this was needed to the live environment. So this basically was quite big luck because I actually know a lot of these commands and tricks. So I could use some of them and I managed to do the Orca install. But no, it's not something I would expect from a normal user at all. So this shouldn't definitely happen. It should never happen to say it concretely. So let's go to the next slide. But enough talking about live media because very often a developer comes and asks, yeah, I got here some issue report about accessibility, but I have no idea how to do anything about it. Wow, he may even ask what actually is. It happened to me as well sometimes. So we wanted to do something about it. So if we look at the next slide, we saw that we needed an answer for this issue. And the answer was a small project that had with maybe four or six people called Linux as developer accessibility guide. This thing should teach the developers some very important things. First, it should teach them what accessibility is and what impairments he can actually encounter in the development career. Because of course the world isn't full of visual impairments. We have other impairments as well and we want diversity and all the things. So we want to talk about as many as these impairments as possible. Then because the usage of computer for these people is kind of different. They are for example not using the most at all. We talk a little about how they use the computer. And then we talk about which accessibility features of the top environment they use, how they use them and why. Because if the developer tries to find something on their own, they may find accessibility resources. They are there, they are plenty, but they explain only evaporated stuff. Of course there are some terms which are common, but the underlying ways how to accomplish the tasks are quite different. And to add to the issues, you either find some high level descriptions of the accessibility issues which doesn't help you at all. Because you don't know how to make your application accessible. Or you find for example the GTK reference which lists every accessibility role in GTK. But if you don't know when to use which of them, what accessibility roles are, what accessibility states and so on, the documentation is less as well. So another part of the developer accessibility guide is part where we try to describe how to make most controls accessible like buttons, check boxes, sliders. And then when we came to it, things like text controls, but text controls are one of the most complicated things to make accessible. So we didn't write this chapter yet. Well, until a few days ago we would have to say that you either use the GTK widgets or you have nothing because the accessibility interfaces couldn't be used by the developers itself. And as far as the guide goes now, we have only stuff for GTK4 and sometimes some comparisons with GTK3. But we want of course more. We want descriptions for QT or we will see what else. But we wouldn't be so successful without thanking our two colleagues from Red Hat, namely Anna Horser and Eshwin Kumar because they help with writing these widget descriptions and so on. And on the next slide, you can see that the work is publicly accessible on GitLab, of course. You can see the exact URL there as well. So if you want and we would like to, if you want to help us, let's basically do it and we are looking forward for maintaining Q there. And for the end, I'm just, Vojta just to go forward and finish the thing. Okay. Hello, hello. Does it work? No. Yeah. Okay. So for the conclusion, I would like to just summarize a few of these points. As you have probably guessed, we are a small group of people based in Red Hat and we basically share these points which are on the slide. So we think that accessibility should be integral part of the development. It's ideal case it should be like a first class citizen which is considered from the beginning. The second point is that we prefer to cooperate with upstream rather than, as I said, writing our own hacks and creating our own distros because I believe when we for example find some problem in the development, in one particular distro environment and we fix it or the developers of the environment fix it, it will get fixed in every distro. So I think that's just how it should be done. We also know that it's not an easy process. It will have its problems. For example, the user base of blind and visually impaired users is currently low. It's basically a closed circle because you have not many users. That means there are not many issues reported to developers. That means developers don't fix it and if it's not working, you will not attract many blind users. It goes around. So that's something we will have to improve. And the last part is probably if you care about accessibility, if you'd like to make your application accessible, look at the developer guide or contribute to the guide or you can even contact us. I think the next slides lists our contacts. There is an email. There is a matrix ID. There is a GitHub handle. And so that's everything from our slides for today. Thank you for your attention and we would be happy to answer some of your questions if you have any. So any questions? Hello. Thank you very much. So one of the criticisms is a bit more technical in nature. One of the criticisms that is leveled against Wayland as an environment is it's very tricky to get working with accessibility software. Is that something that you've come across or is it a solved problem and that's just out of date? Look, this sounds like a question for you. We have mine. Ah, okay. I'll give you this one. Okay. So Wayland, while the basic accessibility interactions are the same because the accessibility interactions are the same because the accessibility interactions are the same because the accessibility interactions are the same because the accessibility interactions are the same. Another phase is a separate device API. But, yeah, wayland changes things a lot because the users sometimes want to, for example, actually use screen reader specific comments and the screen reader needs to get the keys and on Wayland it can't. can't or it can't now but there are of course some experiments to change things. There are issues with mouse related things because the Visual Impart wants to use sometimes in a very, very broken application needs to do a mouse click from the keyboard and avoid the stuff and it's done very, very differently. And of course you need to get the coordinates for the widget as well for these mouse tricks and this is also completely different. So yeah, violent changes things, a lot, a lot changes things but it doesn't change the overall accessibility landscape at all. So you can use a lot of experience from the X days but you have to do some things differently to say. So first of all thank you so much for the talk. Before today I never really stopped to think about what it means to use a computer if you don't see the screen. Thank you so much for the talk. For today I never really stopped to think about what it's about when you're using a computer without seeing the screen. So it was very informative for me. It seems that a lot of the work that needs to be done is about essentially translating a user interface, a graphical user interface for example, into speech so that somebody can hear about what is being displayed. I was wondering if it's any easier to translate a CLI interface instead. So basically it seems to me that a non-graphical interface would maybe be easier for a blind person to use. Can you share something about that? Hello, hello, I will answer. So there are pros and cons to this approach. It's true that when it's just a pure text interface it's of course easier because it's just text that are no animations, no things which are kind of custom but it's not a very good approach. The thing is that in the end, even though you have a CLI interface, you have to access it somehow. You have to view it somehow basically. And most users will in the end again use some desktop terminal emulator. And also then there is a thing that as long as it's like really CLI, that you write some input, you get some output, then you review the output, decide what to do and do something next. That's quite easy but when you get to the more interactive ways, for example, when you, I don't know, if I want to use the midnight command area I would say, then actually it starts to be very tricky and in this case it's probably easier to have a well-working and well-accessible graphical user interface application than to think about how to make a midnight commander emit the correct events and stuff like that so that it can be used from the command line. And one more thing I want to say that basically when talking about graphical user interface, the most obvious way is how to fix this is just to add, for example, correct labels to controls or assign correct accessibility roles to controls. It's not like rocket science usually. The problem is that it's not well known and not well tested and so then the experience is sometimes poor and we are in the close circle I talked about. No users, no accessibility, no attracting of users, no users. Hello, thank you for the speech. I'm just wondering how do you advocate for integrating accessibility in the web because I feel like as a UX designer I spend a lot of my time being like, hey, we need to make this accessible and then you have, well, we don't have a surplus of developers working on a project and it's like, hey, good job, but we need to make spend more time making this accessible and I don't often feel like I met with a wall like it functions. Why should we do this? Do you have any tips for preaching accessibility to developers? So basically advertising accessibility to developers, question, how to do that? Did I understand it correctly? Yeah. That's the question. It's tricky because I think the best example would be to find someone who really wants to use the application and like I think here with the current Linux user base of blind users it's probably mostly about stories because like again as I said before, no users, no feedback, no accessibility, no users. I currently do not have any tips in my head. Lucas, do you have anything? I don't. It's very tricky because while accessibility is usually not somewhere near the top of the priority list for the development, I don't have so many tips because it's the circle because you don't want to recommend to your friends something which you know that they will basically hate every day because they find some unlabeled buttons but if you don't fix these buttons you can't recommend it never so. It's tricky. Basically we tried going to FOSDEM to fix this. I don't know if it helps. And yeah, like then of course what might also help is that there are standards for accessibility as I said for the web but also standards, there are standards which are derived from these web standards and they are more focused on desktop and they are in US based standards there are also U based standards and so if there is a motivation that you would like to fulfill these standards then this is probably a way, might be but depends. Thank you for the talk. You said at some point that one stopper is having some accessibility related packages missing from the distribution repository. I was wondering if there is some kind of list of usual accessibility packages that one would like to see in repository that we as package maintainer could verify that we actually package those things in our distribution repository somewhere. So do you know if such list exists somewhere? So I can answer that. Two answers. First answer is that I'm sorry maybe there was a misunderstanding but the problem is that the packages are in the distribution repositories basically but they are not present on the default life image of the system. But in reality it means like they never existed for us because as Lukash showed you would need to install them manually and you cannot do that basically. So this is probably not the problem but if it is a problem do we have this section in the accessibility guide which lists the packages? They should definitely be there but I don't think they are actually there yet. This is a very good item for us. But it's really just a couple of packages basically. It's an Orca screen reader as Lukash showed. This is the only desktop screen reader for Linux. It's easy. And there might be some dependencies. It probably doesn't make sense to say it here but just let us know and we will definitely add it to the Linux accessibility guide. Thank you for this. Hi. I have one comment and then two questions quickly. So first one you've moved me. I want to now try to go home and install Linux from scratch and see how this feels because that's interesting to me and I don't think I was ever aware of how hard this could be. Second are you the first question I guess I should say is are you involved are you aware of any work or connection to commercial interests? Have you worked with Bob Davis for example at Red Hat or like Rail Workstation to like kind of see if there's I ask is like I think the commercial driver could like drive more investment. So that's the first question. Second one and feel free to kick me if this is dumb but is there any work going on with large language models to like prompt and like answer instead of using labeling so that like if you put your mouse over something it would just say what does this see and then the large language model would tell you from scratch like if that's a dumb question please kick me but I'm curious. Can you just repeat the second question if there is a way how to like using a large language model and LLM to ask what it sees you know and describe it. Okay this is interesting I never thought about it. So first question yes we are in touch. We are in touch with Bob and we are this is one of the way we also try to improve the accessibility basically. Yeah so we try to find the cause or a case for Rail Workstation but as Rail is derived from Fedora we try to fix it basically upstream and then then take it from Fedora to Rail so that we can leverage this accessibility because I believe it can help us. And the second question this is actually interesting I never thought about that using language models but again I feel that it looks like an interesting idea but I think that in the end it's again just a hack. It's like we do not invent anything new here yeah we creating the accessibility accessible user interface was is possible for a very very long time and it doesn't take much effort it's just that no one is focusing on that. So I think this could be used but in some very rare and custom cases I would say. One more last question. Okay we are out of time. Yes for me I have a multi-factorial disability in one sense and the most painful for me is the activating power sitting of some function reducing the inserted the first one historical so the key insert key I want this activate this and the other one are the moving the screen horizontally vertically it's power sitting is how awful this kind of thing also reducing the screen to one line and something like that. Something I don't like it and I don't need it so for me it was very accessible more accessible without this kind of thing when I work on my computer. Okay I'm sorry was it like a question or was it more like a comment? It's not heard very well here. Was it more like a question or was it more like a comment? The question is not clear so it was about some functions to disable so what is for me it is there are problems of some short keys with the mouse with also the pointer and so on the path and so when I see the screen reducing to reduced to one line or something else the screen moving horizontally vertically so I did this it prayers in some way so for me it's difficult to make my job with this kind of things so that's clear I suppose. I have also reduced vision but that's not not bling. Yeah I'm not sure if I know the answer to this probably not. We can probably meet later after the talk so it is up. So we are at time thank you again for the presentation. We can move any Q&A out into the hallway while we get our next speaker up onto the stage.
mkosi-initrd: Building initrds out of distribution packages
Hello everyone. Let's talk about building in and out of distribution packages. So a little bit about us. I'm Dan. I work for the Linux user space at Tima and Mana. I'm a system dmaco assignment trainer. I'm Zbyshek. I work in Red Hat. I work on Fedora. I'm in Fesco. I work on system d mostly. So let's start by talking about in and out of these and why we need them. The general boot flow when you boot a kernel is you start with a boot loader. The boot loader goes to the kernel. The kernel is then responsible for finding the root file system. And in the early days of Linux this was pretty easy and the kernel could do it itself. But these days finding the root file system is a lot more complicated. So the kernel really said we're not going to solve that problem. We'll just leave it to user space. How do they do that? Well, you give a file system to the kernel, which is called the inner MFS. You do that via CPIO. The kernel will unpack that and then start user space in that temporary file system, which is unpacked in memory. And then once you go into that, the inner MFS is responsible for finding the actual root file system and then doing a switch root operation into it. And then you end up in the final file system. And the inner MFS can do what it wants really, but generally these days it's done in one of two ways. So the first one is that some bespoke bash script gets invoked, which is generated by your inner MFS generator. So these are tools like Drakeit, like inner MFS tools on Debian, or make it in, make in its CPIO on Arch Linux and derivatives. The other way you can do it these days is by using systemd. So systemd has for a very long time already supported running in the inner MFS. And it has all the tools and services you need to find the root file system and switch root into it. And some of the inner MFS generation tools actually don't like, are configurable, so you can either use the bash script or you can choose to use systemd in the inner MFS. So I'll add to this that the amount of stuff that needs to happen for the root file system to be available is growing more complex all the time. So we have encryption, we have RAID, device mapper, possibly the invariity. And in the theme of previous, previous talk, we for example might at some point ask the user for a password, but the user might not be using a keyboard, they might be using a braille device or they might need a screen reader to know that the password prompt is up. And all of this stuff will sooner or later need to be available very early in the boot before the system is reader. Yeah, well I'll add to that that your root file system might not even be there on the system yet, it might have to come from the network. So you would need all the tools to set up a network connection and everything in your inner MFS. So it can get pretty complicated. So what's the status quo? Like I said, we have the inner MFS generation tools like Drakeit, Makein' the CPIO, inner MFS tools. And the way these tools work is they basically go look at your host file system, what's on there and they start picking out specific files and use that to make the inner MFS. Which files to pick is, becomes specific logic for each inner MFS generator. So the thing you need to know is that if you say like include this binary on the inner MFS, that's not going to work because that binary has library dependencies. So you also need to go get all the libraries and of course those libraries can depend on more libraries and so forth and so forth and so forth. So you need logic to make sure all those things get picked up correctly. Now luckily for ELF binaries you can actually in a pretty hacky way do that by just going to look at the ELF binary where all the library dependencies are recorded and trying to figure out stuff that way. But then you get into stuff like DL Open where the library might not actually be listed in the ELF binary. Or you get into stuff like configuration files or other kinds of plugins or anything you can think of. There's no direct dependencies listed in the file system that you can use to figure out all these things that need to be included. So you can get into quite a few issues. So this leads to having regular packaging when a new piece of software is released that is used in the inner MFS where the package build or the WNDEP or the RPMs pack gets updated. And then you get into inner MFS specific packages that you can use to do the packaging. So for example Drakeit. A very good example of this is when we introduced systemd executor in systemd which is now required to launch services. So this was a new binary. So when we released a new version all the specs were updated. And then we also had to update every inner MFS generation tool to make sure to include that binary in the inner MFS. And this leads to quite a few bugs. It also means that when it becomes very unclear where the bug should be reported. It could either be bug in the upstream project or it could be the inner MFS generation tool that's not correctly picking up all the dependencies required to run the tool. So it becomes very hard to assign bugs and requires a lot of triaging to get the bugs to the right project. It's also hard to customize. If you want to include something it's up to you to figure out all the dependencies and specify them in the inner MFS generation tool to be included. And of course it's also quite slow because every time the inner MFS is updated it has to be done locally. And all the dependencies have to be figured out. And anyone that's ever used Drakeit and not used host only mode probably knows what I'm talking about because it takes forever. So what do we want to do instead? We want to reuse all the work that the distributions are already doing with their packaging. So the Arch package builds, the RPM specs, everything. We want to reuse all the work that goes into those and use those to build the inner MFS. So instead of going to look at the host file system we just install RPMs, install dApps, install packages, Arch Linux packages into the inner MFS and we get it out that way. And this has a few advantages. For example, package managers, it turns out that package managers are very good at installing packages. So it just works. They're also good at managing dependencies. So all these systems have, depending on the package manager, very extensive or at least very sane dependency resolution. So all the dependencies get listed and the package manager takes care of figuring out all the extra stuff that is needed and make sure that gets installed as well. You don't need to go parsing ELF binaries anymore to figure out the dependencies of a specific package. You don't need to learn another system. So you don't need to learn the inner MFS generation tool. You don't need to manually start listing the dependencies of the tool you want to include. You just install the RPM, the package, the dApp, whatever you want and the package manager takes care of all the rest. The ownership of bugs becomes clearer because the inner MFS generation tool is just installing packages. It's pretty simple. So there's a, the surface area for bugs is a lot smaller and generally when bugs appear, they're going to be able to be assigned to the upstream project instead of to the inner MFS generation tool. If any improvements are made to the packaging, of course, they automatically end up in the inner MFS as well. And finally, by doing this approach, the inner MFS is not tied anymore to the root file system or the host file system. So you can also start building the inner ID off-host on a distribution builder and distribute it as a package. So you can just download an inner ID instead of generating one locally. Assuming that the inner ID includes all the necessary pieces, this allows you to just have an inner ID that works for 99% of use cases without every user having to spend CPU power to build at inner ID themselves. So there is some, some requirements are needed to build the inner MFS out of packages. Specifically, this means that the packaging has to be done a little carefully to make sure that the inner MFS does not become too big. For example, GCClips, so GCC ships a bunch of libraries which are generally dependent on by software, at least the C library. But it also ships, like GCC supports the Go programming language, it supports Fortran, it supports D, and it includes standard libraries for all of those. If those are all put in the same package, especially the Go standard library, it's absolutely huge. So yeah, and with an inner MFS, that's pretty huge. So ideally, GCClips is a separate, sub-packages for each standard library so that you can only install the necessary one in the inner MFS. For example, Arch Linux doesn't do this. So we have to, you have to start removing stuff manually, but we don't want to do this, right? We want to rely on the packages. So ideally, the distributions take a little care that the core packages are split sufficiently so that you only install the necessary stuff in the inner MFS. Another good one is that the kernel modules generally depend on the kernel itself. So if you install the kernel modules in the inner MFS, the kernel gets pulled in as well, but you don't need a kernel in the inner MFS. So that's another thing where there should be a little care taken to make this possible. And finally, locales. Fedora has, and then Santos and derivatives, have a G-Lip C minimal rank pack package that only includes the official UTF-8 locale instead of all of them. And that again, stuff like that helps to reduce the size. So how do we propose to build this inner MFS out of packages? Well, we suggest to use MAKOSI, which is SystemDIS Image Building Tool. So our idea is that the inner MFS really isn't any different from a regular Linux image. It's just packaged differently. Instead of putting it in a disk image with a GPT partition table, you just package it with CPIO and you get your inner MFS. And inner MFS isn't really any different from a regular Linux system, except it just includes less software and it has two extra sim lings, and that's all you need. So you can build it using the regular image building tools. You don't need anything different. So MAKOSI is a tool that builds these images. It does a whole bunch of things. It installs packages, and it can also build you something else than an inner ID. So it can install bootloaders. It can build an inner MFS for a regular disk image. It can do unified kernel images. And it can run a whole bunch of tools that SystemD provides to configure system images. And it also allows you to test the thing by booting it in QMU or assisting the N-Spawn container. So how do you get started with MAKOSI? Well, this is an example to build Arch, install SystemD and the kernel. We enable autologon and then start it in QMU. This gets you something like the following. MAKOSI supports all the popular distributions, I guess. CentOS, Debian, Ubuntu, OpenSuzi, Arch, Fedora, and some derivatives of those. Raul? Raul. Raul BI. One interesting thing is that you do not need root privileges as your user to run MAKOSI. We use these new UID map and new GID map tools to be able to do everything without needing to enter your password. We also use SystemD Repart from SystemD to be able to build disk images without needing root privileges or loop devices. So you can just run all this as your regular user to build an image. We have configurations, so instead of having to specify everything on the command line, you can also use the regular SystemD unifile, which everyone knows from Unifiles. So what is MAKOSI in-it RD? Well, it is a MAKOSI configuration to build in-it-RAMFS images. It used to be a standalone project, but we recently merged it into MAKOSI itself. So it is already used to build the default in-it-RAMFS for all images that MAKOSI builds. So if you use MAKOSI to build a disk image and you do not specify your own in-it-RAMFS, it will use MAKOSI in-it RD to build an in-it-RAMFS and use that. So every time you boot a MAKOSI disk image, you are generally already using this. And we make sure this is tested on all the supported distributions. So it initially started out as a Fedora only thing, but when we merged it into MAKOSI, we implemented support for all the distributions. So you can build an in-it-RAMFS out of Arch packages, Shibuunter packages, Debian packages, OpenSUSE packages, CentOS packages, or Fedora packages. We also ship a kernel install plugin. So kernel install is a system that is tooling for taking a kernel from your slash user directory, where it is installed by the package manager usually, and moving it to the ESP and doing a bunch of extra required work, like for example building an in-it-RAMFS. So usually like on Fedora at least, the Drakeit ships its own kernel install plugin, but MAKOSI does as well. So you can basically configure kernel install to use MAKOSI in-it-RAMFS instead of Drakeit to build the in-it-RAMFS. And Drakeit will automatically disable itself if another in-it-RAMFS generator is enabled. This view reuses all the package manager caches from the host file system. So you're not downloading unnecessary packages. It just reuses the same RPMs or depths that were already used, or that you already installed on your host file system. And finally, it can be completely customized. So MAKOSI, the configuration supports drop-ins. So you can add a few of those in user-lit MAKOSI in-it-RD or ETC MAKOSI in-it-RD to add more packages to the in-it-RD or to remove some extra stuff, or anything you can think of really that is supported by MAKOSI, you can make sure that it gets applied to the in-it-RAMFS produced by the kernel install plugin. It can also be used as a standalone thing, so you don't need to use the kernel install plugin. This is how you would use it to build your own in-it-RAMFS, which will then appear in the working directory that you invoke it in. One interesting thing here is because the kernel modules packages aren't really set up completely correctly yet and pull in too many dependencies, we do the practical thing and we copy the kernel modules from the host. By using the kernel module exclude settings and the include settings, we can do the same thing that RakeIt does or the other tools do, where we only include the kernel modules that are loaded on the host file system, because if we would include all of them and all of their firmware dependencies in the in-it-RAMFS, it would grow to tremendous proportions. So make sure to only include what's needed. So we cover a lot of this with integration tests. Specifically, we make sure that booting from LUX works, so with an encrypted root file system, we make sure that LVM works, we make sure that the combination of the two works. This can boot up for the AOSFS. We support the system-dgpt auto-generator stuff, just specifying doing everything with FSTAP, whatever you can think of really. We try to make sure it works. There are some more niche technologies like RAID, NFS, and iSCSI that we haven't had the time to write integration tests for, so we can't say for sure that this will work, but we're working on making more stuff work that is already possible. So we're working on making more stuff work with the existing tools. That was everything I had to say. So this is a link to the configuration files from iSCSI in-it-RD. So there you can go and take a look at how the in-it-RDs are structured, which packages are included, what files are removed. Specifically, there's a lot of files that we have to remove depending on the distribution. So any distribution packages, go look at that, see what we have to remove manually, and improve your packaging so that we don't have to do that. Thank you for listening. So before the questions, I want to make one comment clarification. Since we're developing this, we get into this mindset of thinking about the low-level details, but I think that this might be a bit confusing, that on the one hand, we talk about building the in-it-RD in a predictable way, somewhere in central infrastructure, and signing it. And on the other hand, we talk about including local modules. And so a lot of this stuff is for development and for now. And in the long term, we want to have the centralized thing where we're building the in-it-RD, glue it together with the kernel, and sign the pair together, building a unified kernel image, which Leonard Pottering was talking about earlier today. So yeah, just to clear this up. Awesome, thank you. What questions do we have? One over here, one over there. Okay, so you mentioned that currently you use local modules. So it doesn't mean that all the complexity from record for selecting kernel modules still remains here as well, right? Yes, but it turns out the complexity for selecting kernel modules, because the kernel modules list their dependencies properly, is not all that much. But yes, we do support it. But we hope, like Shabish said, that eventually in the future, we don't have to use that part anymore. So we can have a proper set of default modules, and these are all properly sub-packaged in distributions, so that we can install distribution packages to get the kernel modules instead of having to do the extra complexity for selecting them locally. You spoke about integration testing on multiple distributions. Did you try to test in all, let's say, usual kind of lattice distribution, but did you try a bit older, and do you have something that you plan to maintain like testing with a new distribution that are coming? So at the moment, our integration tests are run for the default versions of all the supported distributions. So this is generally the latest. It's Debian testing, it's not Debian stable. But I mean, we could definitely add more. It's just running in GitHub Action, so it's just a matter of defining the necessary configuration and then we can run tests for everything. What was those questions? Zero questions. Alright, thanks you too. This was great.
The Monolith versus the Swarm - A Comparison of openSUSE’s and Fedora’s Build Infrastructures
Hi, everyone. Thanks for coming to my talk, the monolith versus the swarm. This is not about Starcraft. This is about build infrastructures. First about me, I'm Dan Cermak. I work at SUSE. I interact a lot with open SUSE and SUSE's build system, the open build service. Since I'm also a Fedora contributor, I also interact with Fedora's build system. Every time I switch between them, I get annoyed because of things that are in one and not in the other, and vice versa. At some point, I thought, well, maybe I could give a talk about that. Why, actually? Open SUSE and Fedora are very interesting in one regard, in my opinion. These are very similar distributions. They're both RPM based, and they look and feel kind of similar. For a certain amount of time, I also ran both from the same config. They work kind of the same, look kind of the same, but they have absolutely no common ancestry. They both use the RPM package manager, but SUSE started initially, I think, from Slackware, and then it became its own thing, switched to RPM, and Fedora came from the whole Red Head site, but there was never some kind of common path, and that also reflects itself in the completely different build systems, whereas open SUSE uses the open build service, which is the monolith in this talk, and one giant thing that makes everything. Fedora has 50 million different services, some of every single one doing something else, or sometimes similar things, to which we'll get, and hopefully, maybe, we can also learn from each other, because each approach has advantages and disadvantages. But let's first take a look what you actually need for a distro build system. So this is a very, very simplified graph of what you need to do if you have, if you want to build a distribution, you need some kind of source control where you store your package sources. Usually, you would pick it, but that's not always the case, as we'll see with the open build service, and you have some kind of service that builds your RPM packages. Since we're talking about RPM based distributions, yes, we want to build RPMs, but if this was about a one-two, you could also build depths, whatever. At some point, you want to build your repositories, images, containers, all the other deliverables. You want to take a look, and you want to monitor all the package builds, find out if things start breaking. Then you want to push all the same thing to QA. I'm not going to talk about the QA part because, kindly enough, this part is the same. Both distros use open QA, and if you're not neither from Fedora nor the open Suzerworld, and you don't use open QA, you should. There's enough open QA people around here, and by that, I have fulfilled my contractually obliged commercial part for open QA, but still please do. It's cool. Yeah, and then you push all your deliverables to some mirrors, and you have your distro nice and simple, right? So, in practice, I'm going to breeze through very, very quickly through the open build service, and this whole thing, it deserves at least five lectures, so it's probably not going to be complete. Sorry for those of you who are not familiar with it. So, the open build service is what I would call the distro building Swiss Army Knife because if you decide you want to build a new Linux distro that's based on something Fedora, open Suzer, DBN, Arc Linux-ish, you can do that with OBS more or less immediately. The whole thing's been around for a long time, so I think what I've been able to pick up from the historical documents of which there are unfortunately not many, it started out as a replacement of a service called OrderBuild. The design phase began around something 2005, probably 2006, it started becoming used, and it was introduced to open Suzer factory, which then became open Suzer Tumble with the rolling distro in 2009, and it started to get more and more extended, so initially it was just building RPMs, and that was it. But nowadays it couldn't build not only RPMs, but DBN packages, Arc Linux, web packages, you can build containers, you can build virtual machines, it will build, it will create repositories, it will publish everything, and so on and so on. So, essentially everything that you need to build a distribution, the open build service will do for you. The one big feature of the open build services that it will give you automated rebuilds, so that's something that's unfortunately not there in the Fedora infrastructure, and one thing that annoys me very, very much, and I think other people as well. The thing that annoys me very much in the open build service is the custom version control system, which is unfortunately not gates. Yeah, and then as I said, OBS, there's a ton of features that are, so one thing that I'd like to mention that's especially useful for distro building is a so-called staging area, which you can use if you want to, if you send in stuff into the distribution, you can create a sub-project. If you're familiar with the Koji world, it would be something like an automated side tag where you rebuild everything that depends on your change, and if it's all green, you merge it into the distro, and that's nicely, nicely possible in OBS. So, when I said it's a monolith, I was actually lying, it's two monoliths. You have monolith number one, that's the OBS front-end, that's a giant Rails application that talks to MariaDB database, and the actual magic with the interesting parts that's written in Perl, please don't run away, but yes, but it works relatively nicely, and that's really where the scheduler is, where the dependencies are resolved, this thing, signs your packages, publishes packages, but the actual building part is related to the workers, which invoke the bespoke OBS build script, and that will build your RPMs, DBN packages, containers, and so on. But as the user, you really just talk to the front-end, which has an XML API, so you see it's from the early, early odds, it's no JSON, so too early for that, or you interact via the OSC command line. So let's take a look how this actually looks, though this is an example of the OBS web UI, remind you a little bit of copper from Fedora, and the whole open build service is organized in projects, a project is more or less a collection of individual packages, that then OBS will rebuild for you. So here, let's just make a hypothetical example, you create yourself, you have some Python projects, there you have some Python packages, and OBS will happily keep rebuilding those once there's some kind of dependency change. And that, to all of you who know copper, that looks exactly like copper, but the real magic of OBS comes in, that you can automatically, that you can add dependencies between your different projects. So imagine I now decide that I want to test out some change in pandas, but I need special dependencies for that, and I have found them in some other project, and now I will include them in my build route in this home-danned project. And then OBS will do the dependency tracking recursively across projects, which gives me a whole ton of flexibility for experimentation. And it will also do all the rebuild these stuff across that. So as you can already see, this is a very simple, very simple, very simple, very simple, very simple, very simple, very simple, very simple. And this causes, so every time someone for instance decides to rebuild GCC, and you take a look at the monitoring page of OBS, you see a giant spike, because every single person, more or less, in either some way, depends on GCC. And it also means, when I was showing this picture, I realistically only retained the last build, because otherwise it would run out of this space now. The version control, now we're getting to the Nasty part of OBS, unfortunately, so if you start interacting with OBS via OSC, which is the command line client, it will behave like subversion, but you will very quickly realize it's not subversion. It's something completely custom that is weird and wonky, and it's actually just communicated via an XML API. It does things like there's not really subfolders, there's also no branches, and I hope we can rip this out. Well, I hope we would have ripped it out 2009, but it's still there, because that's the one disadvantage of a huge monolith, ripping out individual parts is very, very hard. And then another thing, which is, I'm not going to give it the full justice. So, on OBS, you have the option to branch packages. So, you can do this, and if I say branch, I actually mean it's more like a fork. So, let's say you find a package and opens with a tumbleweed, and it's not behaving like you would want, for instance, it's twold. So, you branch it, i.e. it makes a fork, and what actually happens underneath it creates a so-called link, which behaves kind of like a floating rebase of the changes you made locally on top of the changes that are in the distro. And if you think this makes no sense, you are not the only one, it kind of makes sense, it's a three-way merge, but getting it right is a nightmare. There are cases where this is very, very useful, because it allows you to have downstream patches that are continuously rebased, and you don't have to do the rebase yourself, and that all sounds nice, until your patch doesn't apply cleanly anymore, and the whole thing breaks down and OBS doesn't tell you. I'm sorry I can't give this whole thing justice, because I think there's like four people who really understand this, and I'm not one of them. And the other four are not here, and two of them will not admit that they understand it. So, a very nice feature of OBS is the project con for one thing that allows you to configure how projects are built and published, that's just a config file, but the other thing is you can tweak macros. So you can really tweak RPM macros. So let's take a look at an example. Here if you would look at the upper two lines, that's just some config stuff, but below there you can really set macros, and once you change them, OBS will rebuild this project, because it will not try to do macro tracking, at least I hope so. And this gives you as a release engineer a lot of flexibility, because if you suddenly find out, okay, but I now want to quickly change this, you don't have to edit redhead RPM config, and then rebuild manually everything yourself, OBS will do this for you. And at last, let me just quickly mention submit requests, so this is essentially the equivalent of a merge request. If you have branched, i.e. forked the project, you can send your changes back, but what's also possible, you can do full submissions of a new package into a project. And that's also a very nice feature how you submit new packages into open source at umbilweed, which in my opinion is a much better user experience than in fedora, where you go to bugzilla, you upload your source RPM somewhere on the internet, then you make a packaging request, then you make a new repo request, and if you haven't given up at that point, then you'll eventually get a repo. And this makes it much, much simpler. So as I said, I haven't given OBS full justice, but we're nearly approaching already 15 minutes, so let's switch over to the swarm. As I said, this is not Starcraft, but so first part of fedora, Paga, that's the Gitforge, so this is really, this is really your classical Gitforge, sadly not as maintained anymore as we all would wish. It's a Python based, it's a Python based GitHub inspired Gitforge, looks a little bit like GitHub from 10 years ago and feature wise, it's unfortunately not much further. It started also as a very great idea, because every repo in Paga is I think four repos in total, you get your own Git repo, you get your repo for your issues, for your merge requests, for your metadata, for your wiki, you can do remote pull requests on Paga, so if you don't have a Paga account and you decide I still want to make a pull request and I don't want to put it up on Paga, you can put it somewhere else into a Git repository and make a pull request to that remote repository. I don't know if any other Gitforge that would support that, but sadly this thing has been mostly written by Pingu, sadly he can't work on that anymore and so development has been very slow and it's more or less just a question of time until this will be replaced with GitLab. The build system of Fedora, that's Koji, that's really just a, it's a very, I'd say relatively simple RPM build system that uses Mock to build your RPMs and in contrast to OBS, which gives you a lot of flexibility, Koji doesn't, which gives Koji the big advantage that it's very, very simple to understand, because you can't do fancy things. You have one build route, there's your packages and then you build your package with a fixed NEVR, so name, Epoch version release in that build route and once it's built, it stays there forever. So Koji persists, builds forever, at least real builds. There's ways how you can test changes, so there are side tags and you can also do build route overrides and nowadays Koji can also talk to the, to Pega, to the Fedora disk Git, but it's in itself, it's relatively, it's a relatively simple thing and it's relatively simple to understand, which is a great advantage for release engineers, because that's all what they have to interact with and it's not, I'd say Koji is not that hard to grasp. OBS, it's a different beast. So then, Pangee, what I've heard Pangee apparently gives a few people PTSD, but then, or at least the configuration file, but Pangee itself is really just a distro composition tool, so yeah, it creates the Fedora composes, all the centers composes and Pangee itself is just, is again, just a tool that executes a bunch of other tools, but you can summarize it essentially. It assembles first all your packages from Koji from the current state, checks which packages you need to create the current snapshot that is called a compose, then it runs the repo creation, also OS tree repo creation, but that's just a different repo and then all your images are built. And this is kind of interesting because this is different than it's done on OBS, so in this step you create your images from the created repos and then the whole thing is published or not if it fails. On the open build service, these two steps are separate from each other and also the repo creation of stable distribution or stable distributions like Fedora is done differently and I don't want to say that this approach is much better, but the approach that's done on OBS has the slight disadvantage because the steps are independent of each other, you can publish your images and not publish a repo or you publish them at slightly different times and then they are not exactly the same and you suddenly, you publish your images, you run a zipper up or zip it up and suddenly there's changes also there shouldn't be. So that's I'd say a release engineering main difference. Now we come to the real swarm part of the Fedora infrastructure, that's image and container building because there's not one service, there's at least four that I know of, OSBS, the open shift build service, image factory which I think many people pray that it will go away very, very soon, then Kiwi, that's the main image builder that's also used in OBS and OS build, that's the new thing. These all are kind of wired all into Koji, where they grabbed RPMs from Koji and built images and built images from that. So at least with OSBS every time I look at it I run away because it includes far too many fancy words that I don't understand if I just want to build containers or images. There's too many. MBS, that's the module build service, unfortunately it will go away so let's not waste our time with that. Koji is a very interesting thing, it's kind of a workaround in the Fedora infra for not having automated rebuilds because it does exactly that. So what you can see here is a screenshot from Koshai where it will track every time there's a dependency change of your packages in either Fedora or height or one of the other variants, it will tell you what exactly changed between a compose, it will run a scratch build, so that's a kind of non-production build in Koji that can be removed and if it succeeded everything fine, if it breaks it will tell you. And it will also tell you if your dependency suddenly failed to resolve. And everyone who's worked with OBS will know that the worst thing that can happen is if your packages dependencies failed to resolve because that's the one thing you don't get a notification about. And that's what really Koji does. And sadly the one thing that Koji cannot do is kick off real builds which would make automated rebuilds happen in the Fedora infra but that's then the problem that there's not enough build power available. Another thing that really does not a clear equivalent in the open SUSE world, that's Bodhi, this is an updates testing facility. So this is primarily meant for end users to test package updates. You can just log in, check out an update to your favorite package and you can vote on updates and if you suddenly realize this update breaks my system you can give it negative karma and it will not go into the distro on the next compose. And nowadays it's also used to gate draw height which might or might not be a great idea. I've heard both sides of that equation. So one thing that I'd like to also mention that's Fedora messaging. That's an AMQP based notification bus where every single of those hundreds of Fedora services sends their notifications to. Open SUSE has something as well but sadly what opens SUSE doesn't have is the equivalent of the Fedora message notifications which is where you can configure on which kinds of events you want to get notified and then it will send you messages on matrix or messages via email which might or might not be a great idea. It's a bad idea during a mastery build because rip inbox. Last thing, what about copper? That wasn't me. I'm going to pretend that doesn't happen. So I hope it's not the smoke alarm. So what about copper? Copper is a community build system and really not the main part of the bad of the build system. And now for my favorite part and because I talked too long, it's going to be quick. The good, the bad and the ugly of each of them. I'd say the good parts are on OBS. It's very flexible and if you want to carry out large scale changes that's the place to go because it's much, much simpler than to do it in Koji. And you have one place where you go and you do everything there and not 500 places like in Fedora. In Fedora on the other hand you have a bunch of simple individual systems. So every one of these things is pretty easy, pretty easy to grog by themselves and if you have to work with just one of them, it's pretty simple. And it's relatively easy to extend at least in the sense that one of those services doesn't do what you want to do, just write a new one because that's sometimes like it appears has been done in the past. The bad, getting started, it sucks on both sides. It sucks in individual ways, in different ways, but I must confess I didn't find the getting started experience in either of them good. On OBS it's the version control. It's terrible. I hope we can replace it. And handling of something like a release distribution is really weird because OBS has been built around the idea you want to constantly rebuild everything. And if you try to shoehorn something where you don't try to rebuild everything all the time, it starts to become ugly. And then the OBS dependency result. Okay, everyone who is clapping has probably worked with OBS. Short explanation for those of you who haven't. OBS does its own dependency resolution, doesn't use the package manager for that. And the dependency resolver is stupid and if there's a possibility to make a choice, it won't make it. And that's when you get this error. Yeah, no automatic rebuilds on Fedora. It sucks. I hate it. Please, can we fix that? And sometimes systems get misused. Yeah, and then the ugly. So on OBS the whole thing is just so darn complex. It's impossible to understand. There's maybe half a dozen of people who really know most of the things. And given that it's a giant monolith, it's impossible to extend unless you really know all the details. And on Fedora you have just far too many systems and far too ugly glue that then also leads to ugly things where you can't extend things and you have just duplication. So who's better? No one. Both setups just have their advantages and disadvantages. If you want to do development, do OBS. If you want to do stable distributions, Fedora. That's my personal preference. I haven't worked ever on Rails, so I don't know if it was set like in production. If you want to do both, oh, I wish we could just combine it and it would work, but I haven't found it. So I hope we start time for questions if there are. I'm sorry. You can send the hate mail to me when you find me. I'm going to run away. Thank you.
Desktop Linux, as easy as a smartphone! Just in a Snap!
So, I'm Tiel Kampeta, leader of Open Printing, and by making a snap package of cups, I've learned snapping and done and made a lot of experience in snap. So, I got a snap enthusiast and I'm also working at Canonica and this way came up to giving also workshops in snap and so on and giving talks about snap. And here on the first step, I want to tell once what snap is and how it works and second about an old snap linux distribution Ubuntu Core and Ubuntu Core desktop and to show that snap gives you something a little bit like how smartphones work and so it makes it makes it very easy for the end user to maintain their system. So, what the hell are snaps and why should I use them? So, if you have an open source project, usually they develop some application and this application is published as its source code. For most users, it is much too difficult to download the source code and to compile it. Usually, they even do not have the compilers installed and do not know what they have to install to get the compilers. And so, there are other distributions, they fortunately make distro packages and naturally there are so many applications that the distributions do not cover everything so you cannot be sure that you actually get a distro package and also the distributions they make and update the distro packages until they release the distro version and after that for that distro version they do not make new versions of that package and so this is a little bit frustrating for users and this can easily turn users away and so this is a nightmare. A little bit. Yes. And so, there is a solution. It is when you have probably a smartphone and there you can easily download and install applications via the Google Play Store or the Apple App Store and you simply and it will be independent whether you have a Samsung, a Pixel or something else with all somewhat different Android versions. It is the same applications which you get from the Google Play Store. Yes. And so, and you know also Canonica, they have also created an operating system, Ubuntu Touch, a smartphone operating system. They are not doing this anymore but we have a UB port booth because the community continued and they learned from that. They did not throw everything or the experience away. They developed based on starting from the ideas of their click package format, a package format for computers for embedded and servers in the beginning but later also for desktop and that is Snap and Snap. We have a Snap Store and we can install applications on different distros, on different distro versions. We have a form of distro independent packaging for computers running Linux as IoT server or desktop. And by the way, Snap is 10 years old. It started in 2014. And so, we have a sandbox packaging which means every application is in the security capsule and one application cannot access the space of the other and cannot access the system. And it is operating system distribution independent because these snaps, they bring all the dependencies, the libraries you need and so, so you do not rely on the distribution where you install it and so, it installs on many different distributions like Ubuntu, Debian, Zoozer, Quartet, Windows or whatsoever. And your application, as I say, for the sandbox packaging, we have a security shell. This means every application is in a capsule made of up armors, sac comp and name spaces which prevents it from accessing the space of other applications, of other snaps or of the operating system. And you need, and you need for intercommunication, you need very well defined interfaces. You must define when you create a Snap which interfaces for the outside communication you want to use and only through these interfaces your Snap can communicate. For example, network Cups or Dnssd, Avahaya or so on. And so, it is very well defined what is, how the Snap, how the application in the Snap, the Snap application can communicate with the outside world. So this gives you security and privacy. And if there are interfaces which are dangerous, which can modify the outside system for example or read data which could be private, then these interfaces are considered dangerous and they can, and you can only put some, when you put such snaps which connect dangerous interfaces into the Snap Store. You must have to connect them by hand or you have, or you need a special permission of the Snap Store for those connecting automatically. And with this we can trust third party apps. We are not for a distribution, we are not needing any more that we can only trust our own distro maintainers and need everything packaged by our own distro maintainers. We can trust third party packages and so we can access a lot of different applications like with the Google Play Store and the smartphone. So and Snap has also some special features which perhaps other sandbox packaging methods not necessarily have. And the first thing is don't fear the diamonds. We are snapping them too. Snap allows packaging, diamonds and system applications. And also even what we will see later some kernel boot system desktop environment. You can snap everything. And one thing is also packaging can now move from distros to upstream which means instead of 10 distros all by themselves reinventing the wheel by packaging for themselves the upstream can package and test once and all distros can use it. And so employees of distro vendors can concentrate on the core distro. And we can also and we will also see how it goes into immutable distros all snap distros in our case all snap distros Ubuntu Core and a snap one can a little bit consider an immutable application because you have also the file system of the snap and which is also read only. And this we will see now. Now we look into what snap how snaps work. The snap core the snap file system the applications file system is a compressed GPG signed read only squash FS image which we simply mount we even do not uncompress it. Therefore we save a lot of memory, lot of storage and also memory. And it includes also the metadata of the snap and when we install the snap also a writable area of a for the file in the file system inside the capsule of the snap is defined. So that the application can write to somewhere. And we have five types of different snaps we have apps we have call us snaps we have which is the operating system call we have forget the snaps which is the boot system we have kernel snaps and we have desktop session snaps like Gnomok KDE or so. And we can and when they are updated we can handle binary diffs so that we can much more much more quickly we can have much quicker downloads. And they are available for most distros you can install snap D on many distros and they exist since 14.04 Ubuntu it is default already included snap D since Ubuntu 14.04 10 years. And as I said security we have the GPG signed read only file system for the application so it cannot be modified by any malware. And we have the confinement app armor sec comp and name spaces and the executables are called for the snap D snap confirm so that the security is enforced. And the snaps are route safe due to the encapsulation we can run an application as route but it cannot access it cannot modify anything in the environment because it's in the capsule. And so damon snaps the damon's one is good. And so we do not need any special users and special groups. And they are storage efficient due to the case that when we mount the immutable file system of them that this file system is not actually uncompressed. And we have also additional tricks so called content and once the course and the course that contains the core operating system lip GG lip and all the standard libraries. So this is mounted into the capsule office map to this course map so that the essential parts of the operating system system are available for the snap. And we have content provider snap for example for GNOME with all GNOME libraries for KDE with all KDE libraries so that this can be shared by the snaps too. And as I said safe interfaces and dangerous interfaces and interface between snaps or between the snap and the core system then the snap D snap is the providing snap. We have slots and plugs yes we have slots and plugs for connecting the snap for connecting between two snaps or between one snap and the system itself. And only due to and they are connected plug with slot and this gives a defined connection for communication to the outside of the snap and you get safe interfaces when you use a safe interface and upload your snap to the snap store and someone installs it this interface is connected automatically. And a dangerous interface when the user downloads a snap which uses a dangerous interface you usually have the user usually has to connect it by hand or the snap store has given special permission then this is also connected automatically. And now for the updating the updating is when you update the snap the new snap is downloaded and mounted but the old snap is not deleted. So and the new snap is in a new immutable file system so the new snap is installed and put in use and if you have any problems with it because the old snap always the previous one is not deleted you can easily step back. And so it's not a big problem when a new version introduces a bug and does not work you can easily step back to the previous one but the previous of the previous that is actually deleted. So and a snap started it is it was a part of for Ubuntu core for the Ubuntu core operating system by IoT. At Canonica we wanted to have an IoT operating system Ubuntu core and this we have created already back in 2014 as an immutable operating system with snap as the packaging format. This is the start of snap and the immutable operating system it has not only one single core as most others have it has some modules in the operating system itself. It has the kernel is one snap the gamut which is the boot system and the definition of partitioning and so on. This is also a snap and the core operating system these are the base libraries I mentioned already for the snaps Glib, Libc and whatever is a third snap and these three snaps give you the operating system which you can boot they come on one image but you can once installed you can update and replace them separately for example replacing the kernel by a gaming kernel or so. And these and onto these where once you have these three you install application snaps. These are the ones which I mentioned an application which is packaged as snap. So and this is the Ubuntu core operating system and the Ubuntu core and updates are also modular like with the application snaps when you update a snap of the Ubuntu core operating system for example the core snap then the new core snap is loaded and activated but the old one is not deleted so you can step back and if you update the kernel it can even step automatically back if the kernel does not succeed to boot. So when it hangs when it tries to boot and hang somewhere gives a kernel panic then it automatically steps back and reboots. So if there's a problem if you update into a bad kernel the boot simply takes longer and you are back in the old kernel. And now this Ubuntu core from 2014 in 2023 was extended to Ubuntu core desktop so we take the Ubuntu core and onto it we put a desktop by an additional snap which is the Ubuntu desktop session snap. It is currently at Canonica. Thank you. At Canonica it is a wayland with GNOME but as it's an exchangeable snaps for example Ubuntu 4 can snap KDE for example and all the other flavors can also contribute a desktop sessions snap and this way we get flavors of Ubuntu core desktop. And we have the application which this time can be desktop applications and it is distributed also as an image and usually with the base with gadget kernel and core system but in the image you have also the desktop session and some initial apps so that you have a complete desktop the user can start with. It's all in an image but once the image is installed as usual you can update and replace everything separately. A little bit like Lego pieces or like a framework laptop and software. And now you would think how do I do development on a system where everything is encapsulated and separated. What we do there we use LXD containers and do the development in an LXD container. And so we take an LXD container of operating system we want to develop under and inside this we compile we want all the tools and so on and test and so on so we do not need to snap our application which we are developing all the time to be able to test it. And so this and for this we have a graphical front and named workshop where you can easily choose which operating system not only Ubuntu but also Rated, R, Suze and whatever and so you can develop but it needs still some work for example that one can have a snapped IDE running natively but about working on the containers. And what we have yes what we have still to be done to make the system perfect and complete is once for gaming we need NVIDIA proprietary driver support this is still is still in the works and not not yet ready and for productivity we need to make the printer setup tools work with the new Cups 3.x what I have mentioned in my first talk today in the morning so that in as in Ubuntu Core desktop when Cups is encapsulated it cannot access to classic Cups drivers outside of Cups so it can only access to IPP printers and then we need to change the printer setup tools we also need to for productivity to introduce scanner applications so that we can add scanner drivers to Ubuntu Core desktop and we need to improve the development part so that we can have a snap of an IDE and then this snap can access the files and do the do the operations on the LXD containers in which we are developing and what still is missing is TPM full disk encryption needs still to be done and remote remote management of of of leads with canonical landscape is the secure and modular Ubuntu Core desktop is also an ideal distribution for for for companies who have many computers and want to have and want to have an easy maintenance of the systems so so remote remote maintaining is also an important part and then Active Directory login also for the enterprise desktop and the infrastructure to make it available as a distro that we make the ISOs that we have testing testing plans and testing scripting CI stable release tracks and documentation and so on but this is all planned for the next month at Canonica that these steps will be done so I think I'm also yes yes I yes I have it now yes that was it you can also see snapcraft.io there you find the snap there you find everything about snap there we have also a forum for questions and here are also some links and are there any questions we actually don't have time for questions that was that was it for you but thank you so much for your talk thank you and now we have a demo of Ubuntu Core desktop here behind that door and there you can if you have questions there are there is me there is Philip Kavish community maintain community manager of Canonica and there we can talk more about snap and Ubuntu Core desktop fantastic thank you so much thanks you
Upstream and downstream, best friends forever?
to introduce the speaker. Thank you. Hi, folks. Good afternoon. Welcome to the evening sessions. I'm going to turn it over to our next speaker, Franzy, to introduce his set. Just a couple of housekeeping rules if you could make sure phones, et cetera, are on silent. When you're taking a seat, they can be loud. Make sure that you do them gently and try and keep the talking to a minimum. Thank you. Hello, everyone. I'm Franzy Szek. I'm a product owner of the packet project. I will use this project as an example during the talk. Thanks, everyone, for coming. And I would like to also view things from you so don't sneak out with the doors behind. When I was thinking about this talk, I was thinking that if people come here, they already maybe had issues like me and already were thinking about it. So let's use their ideas as well and don't just talk for half of an hour and let them show and share. So I would like you to connect to this URL or just use menti.com and use this number to just connect to the slides so you can also provide some feedback for me or others. I hope it will not break the Wi-Fi or disappear or something, so we'll see how it goes. And this is an example question. Thank you for putting the answers there. That's not only to test it, but also so we know what are you coming from or what are the, what is your background? So let's give you a couple of more seconds. Wow, so many, so many. Okay. Yeah, and also a positive thing is that if you don't see the slides correctly, you can watch them on the screen and, yeah, sorry those who wanted to just fix some bugs in the meantime or check the next session so you need to use the phone or laptop for this. So sorry about that. Okay, so we'll move on. In the title, there was two times mentioned stream, but what do I mean by that? And it's what I mean by that is a stream of code or the program that comes from up, from the developers, down, down, down to the users. So that's the stream I have in mind. And you can have various pieces on the way and anything what goes up to the developer is an upstream. What goes down to the user is downstream. So for example, Fedora is a downstream when looking from the developer point of view or from GitHub, GitHub, but it can be also an upstream for CentoStream for rel. CentoStream is a downstream of Fedora but upstream of rel. So depends always from what place we are looking from. For this talk, when I mentioned upstream, I mean a Gitforge development, the GitHub, GitLab. By downstream, I mean some Linux distribution, for example Fedora in my case. I tried to use upstream developers and downstream maintainers to make it really clear, but just so you know. So just to check, try to show others where do you belong? Are you more a maintainer downstream guy or are you more an upstream developer, maybe curious how you can get to the distribution. So let's show others how we stand here and what I'm talking to. So mostly maintainers and if you are both upstream developer and downstream maintainer, you are somehow in the middle. Okay, so it's not moving much, so I'll continue. Okay, so let's back to the package project I mentioned in the title slide. Something around five years ago with few people around. We were thinking that we will create a new project and as a goal, we said, hey, let's make upstream and downstream closer together. So let's provide some downstream feedback to the development and also for downstream maintainers, let's provide them some connection to the upstream. For example, when they release new code in upstream, so let's get it automatically to downstream. And it was like, yeah, that will be awesome, everyone will be happy. So we started work on that and few months ago, yeah, we came to the upstream developers and said, yeah, we have this federal integration for your project and it's really easy and you will have like new functionality for your project and can be sure that your code will run. The feedback wasn't so positive and we were really surprised because, yeah, we are trying to help you. So what do you think why developers might care about downstream? Why they might even bother why they shouldn't just live on GitHub, GitHub website and live their awesome life and don't care about any distribution at all. So, yeah, hard question. I hope you are typing. Yeah, availability, software option. Wow, wow, so, for me, articles, okay, yeah, many, many reasons. Yeah, without distribution, they might have no users. Yeah, a lot of obvious things. Just to note, after the session, I'll share the results with you so maybe I'll also set up some blog posts but you will have it attached and, yeah, wow, so many things. It looks like it makes sense to care about downstream. So just a couple more seconds. Yeah, people, shitty tools, yeah, revenue maybe also. Yeah, sometimes there is a middleman that you don't need to tackle the video users. Sometimes you just don't want users maybe. Okay, so let's move on. So, we asked maintenance. Hey, maintenance, we have this nice service for you that will automatically send your upstream releases to your service. Always. We were very positive that we helped people. I don't care if they produce new code. It's definitely a new box and more work for me. So, I'm not sure if I want this service. So, same question. Why do you think maintenance should care about upstream? Why there should not be just the upstream that don't produce anything and I can happily leave just rebuild my package every half a year or so when there is a new version of the Linux distribution and live in peace. Yeah, users want new releases, yeah. New features that's related. Missing updates. Similar stuff, similar stuff. Yeah, bug fixes. Yeah. Writing the code is hard and I really don't want to do that and all the patching, yeah. Can you be a maintenance upstim? Looks like not, but yeah, there are a lot of maintenance with dead upstream. But availability, security fixes, stability. So, yeah, we have at least 17 reasons why to do that. So, I think it makes sense to care about upstream. Yeah, if there is no upstream project, then there is no downstream project. So, that makes sense. Okay. So, we really wanted to help people and it was quite a surprise that we were honest on that and that was our goal to bring upstream and downstream closer together. There was nothing hidden, just a really clear goal to help people. So, on the way, after these feedbacks, we also get some positive ones and also there were people that were both upstream and downstream and after all we get some users and also users that provided some feedback and after like these, somehow 40 years, I can say that we are saving some time people and helping them and there are also great people that uses our project. So, it looks like it makes some sense. But it wasn't easy and we are not done definitely and we've collected various feedback and complaints on the way. So, let's pick few typical sentences that we've heard during those years and let's take a look what we can do to help in those situations. So, the first one, when things go wrong, I don't want to look into the logs. I don't understand the downstream logs. There is some build failure and I don't understand it. So, what would you do in this situation? You are providing a build system integration or maybe like you are running RPM builds for the upstream pull requests or testing on Ubuntu or anything like that. Just the downstream feedback for any upstream change and people don't want to tackle with the downstream logs. What, when things go wrong? What would you do? How to help with that? Yeah, reliable mechanism for filling bugs. Yeah, definitely. If the problem is like a packaging problem, anything else what we can do? Okay, so it would be transparent. So, if we need to give the logs, so let's let them suffer as well. Yes, I should have 20 relevant logs to find the one relevant. Yeah, so help with some home combing those. Cool logging libraries. I'm still missing some crucial. Yeah, you can snap or flat peg, yes. But you can get like a failure from creating the snap. So, it's, we can treat like snap or flat peg like this and another distribution maybe if either of the ends but still. Yeah. Okay, I hope that's my job. Okay, I'm still missing one crucial. Like the obvious one. You know that is not possible. Probably that's why the response is not here. So, better logs. Yeah, it's usually not so easily possible. Sometimes yes, sometimes we can do something about it but with these systems, if you've been on the talk like the Dant had an hour ago about all the federal systems we had in place, yeah, we are trying to integrate with all of them and you, for example, copper had multiple logs and all the systems have different logs. So, and they use more or other tools. So, this is layered and we don't have power on all the logs. But maybe we can, as someone mentioned correctly, we can be good in the aggregation or some visualization or we can use AI. Okay, just kidding but a few colleagues and me are actually working on something like that. They are trying to collect various logs with the failures and trying to get like human input what's going on here and how to fix that. So, if you are interested in that, check this out. I really hope this will happen and will produce some really nice data set that can help us to provide some really nice way how to not tackle with hundreds of lines of logs that I don't need to tackle. So, that's just in the beginning but I'm really looking forward to that. Next thing would help us is provide nice notifications and also connect by that, I mean connect people that can help. And it relates also to the last point that we need to set really clear expectations. Who is responsible for what? Who should take a look? Yes, sometimes it's not clear, sometimes it can be really valid back in the code that is sketched quite soon and that's really nice but sometimes it can be downstream issue and sometimes it's something in the middle and so it's really nice when we are introducing these two to make clear expectations. Who will, which like maybe also time-based and with the notifications we can maybe ping people that can help. Okay, so just a single distribution. Why can't you support all of them? So, what would you do? Yeah, there are people from various distributions so that's maybe common that if you want to introduce some CI or anything that, yeah, but if I enable this for Fedora, I want also the BN, SUSE and everything. Can we help somehow? What would you do? Yeah, we can use build systems that support multiple targets like copper or OBS and these. Yeah, snap, flatback, yeah, we can like switch the distribution but when we discuss in the very beginning, we might want to care still in those distributions that we maybe work on. Yes, some distributions, yeah, a lot of distributions are really, really different and it's hard to somehow compare or maybe some obstruction. Yeah, I'm lost in all the good suggestions. I'll read them probably later. So, yeah, it's probably not possible, definitely not all. There are a lot of distributions but as someone suggested, we can maybe try, maybe we can also have open source that someone can contribute, maybe some architecture that can maybe combine various backends so we can maybe share. If you have an open source, so for example, then we'll come from SUSE and say, hey, let's support impact it also OBS and we can maybe collaborate so that's also possible. The tricky one is we are used to do distro specific terms and we need to be really careful about those because when we mention various scratch builds and patches, metadata, bugs, and all the weird terminology, it might be really, it also might be a reason why they are scared and developers don't want to hear about those. So, describe those terms and be careful and also you can hide those somehow so we don't need to speak about co-properties but we can maybe mention RPM builds and these things. So, yeah, what helps us is also we are supporting various functionality types and what helps is to provide easy and reliable testing like infrastructure or so if they can rely on that and they can run their test code or run their tests on this reliable infrastructure we are using for example testing bar project for that so we don't need to do that ourselves. And easy on boarding, I'll probably mention that multiple times but it's crucial because if they hid like the very first problem during the way and yeah with those distribution things it's not easy and we've spoiled it multiple times but it's important. Yes, sorry, I don't want to have more files in the repository. So, CI system that's generated maybe for any upstream CI so yet another config file. Don't be lazy, yeah. Yeah, there are thousands of files in the repository and you don't like yet another one or next to. Yeah. A few interesting things. Anything interesting? You see, okay, so yeah, we can put more complaints. Yeah, yeah. Very interesting things. Yes, I'll probably move on and read those later. It's interesting but so yeah, we might want to stick with the one line if possible, one file if possible because if it is possible but still better one file than multiple files and also not sure why but people rather put a shell script into the JSON or YAML instead of providing a shell script and specify a name of this shell script in the YAML file so we can help them do that. Yeah, if there is more content we can maybe let them link it. We can also enable some custom locations, custom name of or some sub directory so they can maybe hide it a bit and also for example Zoo project uses global configuration if people really don't want to put anything to their git repository just provide a separate git repository they can create a pull request and enable this. It's tricky from the developer point of view like how to do that, how all the messaging should work but yeah. Yeah, I have my own automation works well. So, I have my script or anything and I'm happy with it. How to live? Yeah. Yeah. Use on Civo. Yeah. Okay. Good for you. Yeah, standard protocols. I'll move on. So, this is a generic when we want someone to start using something else even like with the comparable tools we need some killer feature. We don't, having just the same feature set is not enough. They need to have clear motivation to have something extra when they move. Easy onboarding. I've mentioned that this is crucial. We are for example trying to do some online workshop with the gutter platform and various funny stuff to help them. Also, and that can be the killer feature if things break. The people or the users does not need to take a look and fix that and then that can save a lot of time. So, with the maintenance of this automation and we can save them a lot of time and we need to clearly communicate this but we need to also do that. So, if things break we should take a look and just ignore it. And work on the right things. So, listen to the community, listen to the people and try to like don't just assume that you are doing the right features but just ask and get something there. Yeah, your automation can break some rules for example packaging rules. How do you tackle these things? And yes, sometimes when those packaging guidelines were created the automation wasn't such a thing or maybe when they have written that they expected that humans will interact with the packages and maybe standardization, who's rules, yeah. Yeah, we can also tweak the rules. It's not set in stone so we can maybe discuss. Yeah. I trust the both modern and human. Yeah, that's a positive thing on the automation that yeah, it usually does not do like the human like things and be kind and that's I really like. So yeah, be open for suggestion, communicate, don't ignore the issues. Just talk and see what others think about that because sometimes it can be like really valuable feedback and maybe people already have a suggestion how to fix that or how you should behave. Yeah, sometimes you can also let the user decide for example we've talked about an issue if we should upload some archives to the cache if we should do that before it's merged into federal this gate and we were not sure we someone wants the automation someone. Yeah, someone wants just the security so let them decide. Yeah, and for us it helped that we are trying as much as possible to have the provisions as a regular user so we are not kind of special in any way with one slide exception to get less permissions that we need but otherwise we are trying. Yeah, so. Similar to the previous one. You can continue with the voting but I'll skip to two points I had. Yeah, when people think that they you behave differently we can provide some config options but usually bait. There can maybe be maybe people will realize that they don't need it but maybe they will be a new user that will have similar config option or similar feature request so you can maybe combine it for us. For us user defined actions helped a lot because we've done everyone has a different workflow and it was really hard to do securely and well but this was for us a huge and other. Yeah, and respond to the first issue and questions crucial even if it is like a little you weeks thingy that you are showing how you how you treat your users. So that's all from me. This is the project page for Stornon account and if we have maybe two minutes for question if you have any but maybe you can ask the audience. We don't have. Okay, so sorry about that but I think you've shared your opinion on that. So thanks a lot everyone. Thank you.
Supporting architecture psABIs with GNU Guix
Okay, so hi, I am Afrayim. I've been working on GNU Geeks for about eight years now. So supporting the PSABI's with Geeks. So the PSABI's are the, say, hey, it's been 20-ish years since we first got X8664. It would be nice if we compiled for something a little newer. So the PSABI's are a nice way to go and, let's see if I can see both screens, to go and see, to support both the older machines and the newer machines at the same time. So this is the output that I got from my computer running dash dash help on LD.so. As you can see, my machine supports X8664 v3 and v2 and of course the original, just regular X8664. So it's, you know, how we actually go and figure out what is supported and what isn't. So something that I wasn't able to find looking through everything is, where are these directories? I know the libraries go and slash lib and not really clear where the alternate ones go. So I did what any normal person would do. I went to my local checkout of G-Lib C and I searched for G-Lib C hardware caps. So from the test suite, we got dollar L, G-Lib C hardware caps, our little string and our library. And we also got that supported on three architectures. So we can see that X8664 isn't the only one that is looking for faster libraries. So that gives us on X8664 these four paths where LD.so will actually search for libraries that are supported, more or less. So we have our slash lib and then the two that are after it. For the sake of completeness, we also have for PowerPC64 little endian. We have power nine and power 10 and for S390, we have these other ones. And I've never seen an S390 machine. I just assume that they exist. So that works well on a regular distribution where all of the libraries go into slash lib or slash user slash lib. But in Geeks, everything has its own path that everything gets installed into. And fancy word at the top there is directed acyclic graph. So individually I know what all of those words mean, but in general it's like the arrows have. I'm trying to remember if this one goes up or goes down depending on which part of the stack you're working on. Sometimes we end up with arrows in different directions. So I'm assuming we're going down on this one. So for XPAT, it depends directly on get text minimal, which is different than our regular get text, which would have other inputs in it. And just like for this case, so the acyclic part is what it sounds like. There's no circles. There's no repetitions. Once you build the package, that's it. It goes into its designated folder and nothing else gets installed into there. So before when we had our library outputs, Geeks doesn't have a slash lib folder and it doesn't have these other folders either. So we still need to convince Geolib C to actually look for all of these places so that we can find the libraries. So it's before that one. The other thing is that reading through the bits, it turns out that we're not just looking for, not just checking all of these directories. We're looking specifically in slash lib. And then if your hardware supports it, we will also check the other library locations. So using, you know, to take your favorite library, Readline, you'd have lib slash lib slash readline dot s o. And you could have it also in your Geolib C hardware caps directory. But with Geeks, the first one is going to be in its full path slash gnu slash store slash big hash. And the other ones will be in other paths. So while you would have libraries in the other paths also, they don't actually end up in the same spot. And it goes and says, okay, here's my regular Readline library. I'm going to search for the Geolib C hardware cap along the same path. It's not going to find anything even though you've already built them. So question that I guess keeps on coming up when I was looking at this is, is it worth it? Does this actually make a difference when you're running the programs? Do you, how much difference do you really get from having a optimized libraries for your computer? And answer a little bit is, does it matter? I mean, the options are there. They wouldn't be there if it wasn't going to do something. And other part is people want it. Users want it. It might make a difference. So whether or not it matters, we're still doing it. And to some extent, wonder is it something like the, I always read that as fun roll loops. Where it's, does unrolling the loops actually matter? How much benefit are you getting from it? So, yeah. So one of the, oh, I got cut off a little bit. One of the programs that we were experimenting with was the new NCDU written in Zig. And so up here, I have a just transcribed output from Difascope. I actually went and compared the two binaries. Zig does, Zig inherits the optimizations from the underlying LLVM. And so I compared NCDU built against standard X8664 and one built for X8664 v3, which would run on my desktop. And other than seeing that more than 99% of the code was the same, this part was actually the part that had the largest amount of difference in the generated assembly. And so I don't know if, is it v0 upper is faster than other options here, but I also noticed that it ends up with the same number of instructions. So it's, you know, for a lot of this, we really are getting into very minimal benefits in here. But anyway, like we said, we're doing it anyway. The options are there. I'm not going to take no for an answer. Okay, that didn't get cut off at the bottom. So one of the libraries that we've already looked at and said this one actually benefits is GSL, which is one of the math libraries. So what you're looking at here is scheme code. This is the actual, this inherits from the actual package definition for GSL. An actual one has, you see this one's missing a version string. It doesn't have the source location. It's missing a couple of things where it's just inherited. But basically package definition defines the name inversion. It has the source where to get it. It says what kind of build system to use. If there are any arguments in the case of GSL, mostly we skip a bunch of tests. And then some other metadata that goes with it, home page synopsis, description, license. So for this one, we go and basically say, okay, so we're going to inherit from GSL. I'm going to change the name so that we append the PSABI to it so we can actually keep them separate. The make flags, we're going to actually pass to C flags and CXX flags that we're building for the specific PSABI. We add the, tell it to use the library dir of output is our, is the output. It's the per package directory where this library will get installed to. So instead of installing it to output slash lib, we're going to add glibc hardware caps slash PSABI, which is the directory we saw before. And then after the installation, we're going to delete a couple of extra bits that we don't need. We don't need the binary because we're just using the original GSL binary and the headers and include and anything in the share directory and the package config. We're just deleting all of that. And in the properties, we're going to hide it from the CLI so that we people can't just go and install this one on its own. And we're going to mark it as not tunable because we don't want to say, hey, build this, build this specific library for this specific sub architecture but actually tune it for my machine because that's not going to help anybody. So then when we go and we have the actual library similar to before, we go and we say, okay, go ahead and run through everything like normal using all of the normal package arguments and build arguments and everything. And then at the end, after install, we're just going to see here copy the actual libraries into their location because we can't go and say, here's regular GSL and install the other libraries into its folder. We say we've built a new one and we're going to copy the optimized libraries in. So then, you know, same thing for PowerPC. I just didn't put it there. So in the end, this is the regular one. This is the, you know, just a generic one. I shortened some of the directories so that it would fit. So we have the top, the full path, the output for GSL. This was, no, this is the everything together. We have the full path of the library, of the output for GSL with all of the hardware capability, all the optimized libraries. You see we have the one set of binaries. We have the headers. I've collapsed the libder so that doesn't take up all the space. And then inside the libder, I actually have at the bottom there, we have the libgsl for, you know, just regular. And then v2, v3, and v4 I closed so that it would all fit. And we have just the one package config and through all the testing that I had on various machines, it would go and using this as an input for everything else, it would go and link against the regular libgsl. And then at runtime, it would go and actually use the optimized library depending on which machine it was running on. So Geeks being a functional package manager, we end up with functions for things. So here, some of the bioinformatics programs that I was working on, we have, so here the idea was PGBB, PGGB. One of the common ways that it was distributed is as a docker image. So instead of just compiling everything for the, for baseline or for saying, hey, we've made 500 different images based on what your actual machine you're running on, is we said, okay, we have a list of, in this case, five libraries. Go ahead and actually replace all of their occurrences in the graph with these ones. And then when you go to run it, you get all of the benefits of using the faster libraries. So, you know, just back to the, is it worth it? So this one I hadn't really planned on getting into so much. This one was a blog post from last month. Somebody had gone and rebuilt parts of Arch Linux with, for x8664 v3. And they have, so yeah, so the claim was that it was 10% performance improvement. So the other part that's here that, I guess, yeah, not quite cut off is that the rebuilt one on v3 was also built with o3 flag versus the o2 flag that Arch Linux uses. And then they just went through a couple of programs to see, you know, is it actually faster? What type of speed benefit can you get? So negative times are faster, positive times are slower. So in this case it was, you know, compressing the kernel was faster, decompressing was slower. Flak was faster in all cases. Gawk was, you know, toss up. Gzip was slightly faster, but that might just be o3. LZ4 was slower. Python was slower. R was the same. Forbus was faster. And, you know, XZ, basically the same. Find decompression was faster. That in general, you're still left with a couple of packages here and there actually benefit from having, uh, faster libraries. It's not going to keep people from saying, I want everything to be compiled faster. I'm going to go back to the fun role loops and, you know, you have to eke out that extra little bit of speed. But, you know, it's, you know, yes, it's a, you know, from the distro side, some of it becomes a how much time do I want to spend actually, I guess maybe not specifically supporting the different options because, you know, I just send it through and it gets built and it's done, but how much time do I want to spend building four copies of everything so that I can mush them all together and, you know, expand the size of the final library? Um, what did I have over here that I... No, I thought I had a thing right there. Uh, so I guess the other part that we had with this was that, you know, this one, you know, this assumes that, I mean, first of all, this works well for GSL. Uh, I could go and change it from, you know, GSL HWABI is the name of the function to make it more general and then start passing it other, you know, pass the name of the actual library to inherit from and all of that. But, you know, the other part of this is this assumes that passing just C flags and CXX flags are going to actually go and produce the binary, the optimized files that you're looking for. And, you know, not always the case. Sometimes you end up with ones that need extra CXX flags or you need to go and manually add them in anyway or they're hard-coded and they need to be substituted out. And, you know, going back to the, do I, you actually want to go and support every single package becomes a do you want to go through the entire archive of all of the packages for something that may or may not actually make a difference on all of the libraries? Oops, that was too far. So, yes, yeah, are there any questions? Any comments? Okay. Okay. Okay. Thanks. So how far have you gotten in implementing this? Is it just something you've been experimenting with or something that's actually working? I've mostly been experimenting with it. Some of it is, you know, I don't want to actually, don't actually want to build everything multiple times. But the size increase, that's the part that I thought I had, the size increase on GSL, it went from, I think it was 5.5 megabytes to about 18 megabytes by adding in basically the four different copies of all of the libraries. So it really becomes, you know, say for Vorbis or for Ag or for specific libraries that we know are going to make a difference. It really makes a difference. For other ones, you look at it and say, okay, you know, LibxUL runner for Firefox, that's 100 megabytes. And it's a long build process. Maybe we won't do that one. Okay. You said that there is support for PowerPC, Power9 and Power10. And what is this older variant? So PowerMicroVod or Power7? The... They are all 64-bit PowerPC variants. They are all PowerPC, they are all the 64-bit variants from the actual G-Lib C, where was it? From the actual G-Lib C source code, these are the only directories that are currently... I think this was accurate as of 2.38. These are all of the directories that are searched for additional libraries. So I think Geeks right now targets Power8 as the baseline. So a backboard would be needed for Power7 and for MicroVod. You have to get it into G-Lib C to have the... I mean, I guess if you compiled the distribution for Power7, then you would have support for Power7 there. But in terms of having the special directories, currently there's no support in G-Lib C, although I suppose it could be added. I might investigate later if I have hardware or buy hardware, because I'm interested in the PowerPC notebook project. And they have much more modern hardware compared to anything Apple used, but it's still older than Power8. I've tried some benchmarks yourself with some of the applications, because what you showed with the microscope, those were mostly SIMD instructions, which were optimized. So I think everything which uses those can profit from the knowing that there's like a different kind of vector extension. I ran a couple of benchmarks. Most of them were inconclusive. The one that I actually noticed the biggest change in was LZ4. I actually compiled one for X32 for the 64-bit, X8664 with 32-bit instructions. I think the claim in general is that it's supposed to be up to 40% faster, and I found that the LZ4 benchmarks were 5% slower. So other than actually being quite surprised by that, a lot of it really seemed to fall into the, is it just hot cash? Is there something else running in the background? Is it actually a big enough change to be worth it? I don't know if that's fine, but thank you very much for your time.
Releasing a Linux based OS: an overview of Flatcar release cycle
All right, everyone. Welcome to the next session. Just the usual housekeeping. If you're leaving a little bit early, these chairs are fairly long, so don't do exactly that. There's going to be some good sessions here, and we'll have some time for questions at the end. So let's get started. Right. Thank you. Hi, everyone. I'm super excited to be here with you today to talk about FlatCut, to talk about releasing Linux based OS. And I hope you will learn new things. I hope you discover things. And yeah, if you have any questions, I'll be around for the rest of the day. And I'll be available at the end of this presentation to answer your questions. Before going further, I will quickly introduce myself. So my name is Mathieu. I work as a software engineer inside Microsoft. I'm mainly involved and principally involved in the FlatCut development and every features regarding FlatCut. So for example, I'm involved in the cluster API fields. I'm involved in testing the operating system, building the operating system. And what's matter today is releasing the operating system. If you are here at this talk, I assume it's because you have maybe some knowledge about FlatCut. You're already a user of FlatCut. You just want to discover and you're curious about this operating system. So let's have a quick look of what is FlatCut. So FlatCut is a Linux based operating system. It's designed to run containers. So you only have the bare minimum in your operating system to run containers. The goal is to have the less package you ship in the operating system, the less surface attack you have on your operating system. So that's the point of having this. This operating system benefits from automatic updates, which means once you've deployed your instance of FlatCut, it will get automatic updates from the release server and each release is done every two weeks approximately. So you can be sure to have a new version of FlatCut every two weeks. And finally, this system is immutable, which means SlashUSR is in read-only mode. You can't write anything in SlashUSR. You can't install any package. There is no package manager. There is no APT or whatever. So that's a few difference from a day-to-day operating system. FlatCut is already designed to run containers and nothing more. So the question is, well, just to show you inside the box, so I tried to write something on SlashUSR. It doesn't work because it's read-only. That's normal, even in pseudo. And try to use the package manager, the command.phone for each one of them, because that's the goal. The idea is that you have to trust the maintainers and what they ship inside the US. And if you need something more, you can ask yourself or you can ask to the maintainers or the communities, or you can try to find a way to install these packages. How do we maintain the system? Because you can't update yourself the package. You can't install any package, so you have to trust the maintainers and the community. So on GitHub, so this is the QR code on the GitHub repository, which leads to this list of packages. Basically, we are security-driven, which means each time there is a new CVE, a new issue with one of the packages shipped by FlatCut, we track it into this repository and we update the package. So for example, last week we've got the RunC and Docker CVE that has been made public last week. So it's already tracked, and when we will release the next FlatCut, so I hope this week, you should get this update closed. So the packages are updated by the security-driven base and also by a community-driven base, which means if one of you wants to add a new package to FlatCut, you can just open an issue. Hey, I'd like to have this package or this package into FlatCut. Is that possible? And if it's relevant for the community, if people are okay to have this new package into FlatCut, there is some chance that this package is being included in the next FlatCut release. Most of the time we try to challenge people to say, can you use Docker Image instead of using this package, or can you use just the binary that you will download from the boot of the instance to get your software. So we try to challenge always in the same goal is to have the less package into the operating system, because the less package you have, the less vulnerabilities you have in your operating system. If you want to know what's going on in the next FlatCut release, you can join the Office Hours, so this is done publicly every month. The next one is on February, and during the Office Hours, we just check the FlatCut release boards and we check which new package will be included in the next release, so you can give your opinion, you can give your input of which package should be prioritized or not for the next release. That's always a great time to discuss between the maintainers and the community about the content of the next release. So that's the release board that's available and public on GitHub. And of course we ship new packages and package updates, but also the bug fixes and new changes and new features into the operating system. Now we are ready to release, but before releasing, I would like to demystify a bit the FlatCut release number, because this is something we've seen quite some time that people are getting confused with the FlatCut version number. So this is a FlatCut version number, and the idea is like Semver versioning, but not really. For example, the first digit, the 3760, it's the days since the first CoreOS release, because FlatCut is a friendly fork of CoreOS initially, so 3,000 days, it's almost 10 years, so it was 10 years since the last release. Then the second digit is the promotion level, so are we talking about alpha release, beta release, table release or LTS? And finally we have the patch or the maintenance level, which is the last digit. So if you have a zero, it means it's a new major release because there is no patch yet done for this release number. So based on this, we can play a small game and try to identify which is who is who. So for example, the first 37602.0 is a new major stable, because you have the zero at the end, which means it's a major release. We have the two, which means stable, and we have the first digit that is just showed how many days since the first CoreOS release. But based on this, who is able to find what is the third, so 3,850.0.0.0, what is it? Is it an alpha release? Yeah. Is it a patch release? No. No, so it's a new major alpha. And the last one, so with all the freeze, 3,0,3,3.3.18, what does it mean? LTS. LTS, yeah. And it's really old LTS because there is a bunch of patch releases. So patch releases means basically kernel updates. Each time there is a kernel update for the LTS, for example, we just update the kernel, the CA certifications, and critical security issues like OpenSSL, but that's it. But yeah, for the LTS, most of the time it's just kernel patch release, so that's why you have a big number for the LTS. So I mentioned alpha, beta, stable. How does it look like across the time? So we have a new major that is done every two weeks. Then from time to time, we decide to promote that alpha to a beta. So that's why we have one that happens to this example. And then after a few times, this beta version becomes a stable one. And eventually it will become an LTS one. So that's quite interesting because you, as a user, if you run stable flat-card release, it means it has already been in beta a few months before landing into stable. So that's why also we encourage people to run beta nodes into their workloads, like so they can identify if there are any issues with their workload before it gets into stable. Yeah, so that's the release cycle. Now, what's the release process and how does it work? So most of the time it's done in four days. We never release on a Friday because it's a well-known rule that we don't want to break at flat-card neither. But basically on Monday, we start to build the new flat-card releases. So we kick off the builds for the new alpha, new beta, new stable, and normally the new LTS. So this is done on Monday. And on Tuesday, we check the status of the builds. Is the CI OK? Is the image been successfully? So yeah, we have a checklist of things to see and to check. And we start drafting and preparing the release nodes because that's quite important when you have a new release. It's to communicate to people that there is something new and that would be interesting to know what's inside this new release. So yeah, on Tuesday, we start drafting the release node and Wednesday, we have the go-no-go meeting. So this is a meeting done publicly on Matrix Channel where we discuss about should we actually go forward with the release. Are we in a good shape of a release and can we move forward? So this is the go-no-go meeting. So basically, we just check is everything green in the CI. Are the release nodes correctly prepared and everything is good on the CI? And yeah, we decide to go or not to go with the release. Then we have the actual release, which means we are going to take the new images to publish them on the release servers of flat-card and to generate the new payload because as I said, flat-card is going to get automatic updates. So we need to generate the payloads to get them downloaded by the current running instance. And then we have the announcement. As I said, it's important to communicate to people that there is a new flat-card release. And on Thursday, we have the marketplace release because flat-card is supported on multiple vendors. So we have the AWS, GCP, Azure. So we want to publish flat-card images on this marketplace. If we check the release process for Monday, so one of the flat-card engineers will start the build and he will publish the links. Then on Tuesday, we start to preparing the release nodes. So this is, for example, for the last table and there is some nodes. For example, there is Flakitest with Calico C&I on Digital Ocean. So we try to identify is it our fault because of the test framework? Is it something really critical? So sometimes we have to stop the release because we have an issue with the new kernel that has been identified with the test framework. So this is the kind of nodes we can take during the release process. And after that, we have the Go No Go meeting. As I said, it's done on metrics and everyone, contributors and maintainers are invited to say Go or No Go for the release. So it's a ping into all channels across all numbers. And yeah, so people can give feedback on the release status and we decide to move forward with the release. And when it's done, we actually have the release. It's available. It's on the public website and we communicate on Slack, on metrics, on Mastodon. But there is a new release available and please update and give feedback on the release. And finally, we have the Marketplace update. So this is an example with AWS update on the Marketplace. So what's interesting with this process is that the community is involved at each point and always. So nothing is done in secret or whatever. Every time you can give your input, every time you can see what's the status of the release, are we close to have the release to be done or are we far to have the release to be done? So for example, the checklist of all the best items are on public GitHub issues. So you can easily see where are we during the release process. The release notes are drafted on a HackMD document. So you can browse the release notes and start to write and send some comments on it. And the public discussion are always on metrics. So regarding flat-car release process, it's always on metrics, but also for the public discussion of flat-car development. So every decision regarding flat-car is done publicly on metrics. So there is no, as I said, secret discussion. The only thing that is still secret is the build for now because we still have some credentials for the various cloud providers. So ideally, we would like to have it in writtenly on Jenkins, and people can just see the logs of the build and see how things are going on. But it's not done yet. What we've done is that now if you open a pre-request against flat-car repository, it will start the build on GitHub action. So you can see your logs and you can see if something goes wrong, if the CI is OK. So it just build a QMU image and run the test on the QMU image. But for the release itself, it still relies on Jenkins, but eventually, we'd go public using GitHub actions. And I think that we'll close the talk. So if you have any questions, I'll be around with some flat-car teams, remember, for the end of the day. And thanks for your attention for this Sunday afternoon session. All right. Nice break-up question. Great. What's the elevator pitch for using flat-car above Fedora's offering or micro-OS from Sousa? Well, micro-OS and over-operating systems are quite similar, but flat-car has some multi-vendors, for example, or you can use it on premise, on bare metal, on different cloud providers. Also, you have new features that we try to merge into the flat-car operating system, for example, system DCS-X or other things that we try to leverage. And also, there is this, we try to do things upstream first. We got this talk about upstream versus downstream before that, but that's the idea. We try to go upstream first. So each time there is a new feature, we try to first implement it upstream before trying to solve it downstream. So we try to be more on the community side and try to fix the things on the upstream and not really on the downstream. And then speaking of fundamental differences, for example, with micro-OS, you don't have the same mechanism to provision the instance. For example, with flat-car, we use intensively ignition or afterburn, which is not yet available or experimental on micro-OS. So this is the kind of difference you can see. And if I recall correctly, I'm not sure if micro-OS is using REST 3, but this is the kind of functional features. You can see the difference, but in Vienna, it's the same purpose of operating system is just to give the user an operating system to run containers. That's it. So as it's open source, you have the choice of which solution you want to use. Thanks, Moomo. Feel free to comment this. How much has changed or has it been noticeable since Microsoft took over or the acquisition? Thanks for asking the question. So for the short story, Flatcar was developed initially by Kinfolk, which was a company that has been acquired by Microsoft two or three years ago. And I'd say it didn't change a thing for now for the development. The governance has always been on the community side, community-driven. And I'd say it's even better in a way because now we can totally be dedicated to this operating system and to the support of the operating system. And recently, like a few months ago, six months, something like that, we started to look into a CNCF incubation. So basically, we would like to have Flatcar find a new home at CNCF. So there is an open issue on the CNCF tracker so you can see the status of the incubation proposal. But in terms of governance, nothing's changed, and we're still dedicated for users to get the best Flatcar experience on any cloud providers. Thank you. Over the question? Yeah. Matthew, thanks for the talk and for the distribution, the idea, everything. So I'm not familiar with the project, so I'm attending just to understand what's going on. So everything is a container, right? All tools and everything are running as a container. But I'm curious how the kernel is booted or the NITRZ is done. So, yes, that part is a container or not? I don't think so. So Flatcar is not running inside a container. Flatcar is an operating system, like Ubuntu, like Debian, like whatever. It's designed to run container work loads. So you have the very minimum to run container work loads. You have a container one time, you have the kernel modules that face well. So in the end, it's like any over Linux distribution. You have your kernel, you have the boot process, you have the NITROM FS, and then after you have the user space. Yeah, so if I understand correctly, the stuff that's supposed to be previously managed in containers, in traditional packages, is now containers, right? But if a new version of a kernel is released, then how that's distributed, let's say. So if you have a new version of a software, how it's distributed on the operating system, that's the question. Yeah, you just wait for the new Flatcar release, because it's immutable. So if there is a new open SSL version, for example, you have to wait for the next Flatcar release to ship that new open SSL version. Like so you get the update. So that's like pulled from the Internet. It's not like in a format of a package, right? Sorry, come again? Is that in a format of a package of its pulled from the Internet straight? Yeah, it's not in the form, Flatcar is based on GN2 Linux. So the Flatcar itself, when you build it, you take the source from the value repository using GN2 mechanism, then you build the package. Then once the package is built, it is included into the image, which is the new Flatcar. Then the new Flatcar is released, and this is how you benefit from the software update. Okay. So I think my question is also, let's say not only technical, but more on the political side. So history is that this is a folk of CoroS, where Ken Falk started this, right? Then it was brought by Microsoft, but Microsoft has its own, the CBL Mariner. So how does this fit and what is, let's say, this is a cloud based, let's say, the essential client that you have. That this OS has to be used in cloud, right? Yeah, thanks for the question. So CBL Mariner is dedicated to run on Azure, while Flatcar is dedicated to run everywhere. And it's not monetary to run Flatcar on a cloud provider. So as I said, you can run your own Flatcar image on Raspberry Pi, on ARM64 at home if you want to. Or if you have your own, I don't know, Proxmox, we have some people that use Proxmox to run Flatcar. So yeah, Flatcar is really multi vendors and multi architectures. And so while CBL Mariner is really dedicated to Azure and nothing else at the moment. Hi. So in my previous role, we used Flatcar quite a bit for a while. But then we ran into kind of some trouble with AI and especially around things like we're trying to use like Infiniband. And we were actually kind of running into, I think, getting everything set up with Flatcar. Are you all kind of working more towards like AI workloads and making those easier to run on Flatcar? So I'm not at all AI expert. Maybe Remy, behind you, is a Flatcar member. I'm also a Flatcar maintainer and I've been looking at NVIDIA and GPU support in the past. And we want to get better at that. It would be great if someone from the community would also help because I have limited cycles. But it's something I and we care about, right? Just one last question. No one else has any? Do you support different container runtimes? So at the moment, we only ship the current container D. But basically, in a non-official way, you can use Podman using SystemDCX, which is a system D feature which allows you to mount overlay FS images on the base system. So yeah, with Podman, we ship a system DCX in an unofficial way. So you can just pull this Podman extension and load it on the system and have Podman up and running. There is some tracking issue to have this out of the box, of course. Like so you don't need to provisioning and to pull Podman's system extension. But yeah, ideally, we should be able to say, OK, if you want container D and Docker, or if you just want Podman, use this configuration and not this one. But yeah, you can use Podman. Actually, I did some experiments, there are things with it and it works. Cool, thank you. All right, I think we have time for one final question if someone's up for it. All right, looks like there's no question. Thank you very much. Thank you. Thank you. Thank you.
An introduction to Image Builder: building up-to-date, customised operating system images the easy way
Hello everyone. So I will talk a little bit about Image Builder. I want to talk a little bit about how it works, how the stack of Image Builder fits together, show off some of the things it can do. But before all that, I'll try to explain why it exists. So Image Builder builds bootable operating system images, the base images. It runs on your local machine or as a hosted service. So we also run and operate a service for Image Builder. Now building bootable operating system images is not that hard of a problem. You just put a few bits in the right place, some default correct, hopefully configurations. And most of the hard work is done by the package maintainers, the people who maintain a kernel, the people who maintain system data. They make sure it all fits together. But at a certain scale, consistency and reliability are very important. Because you need to build images for different purposes. I'm talking now from the point of view of a distribution. You need to build images for different purposes, for different architectures, for different target environments like AWS, GCP, Azure, local virtualization, bare metal. And you don't want to have these differ too much from one another. These images. You don't want to have them differ too much from one another between those different variables. You want to reason about them and produce them in roughly the same manner. You want to produce and reproduce them often without manual interference. So as part of pipeline. So you need infrastructure around it as well. And because I also mentioned like target environments, specifically the cloud platforms, well, every cloud platform today often offers something like an image builder as well. I guess AWS EC2 image builder. But yeah, then you're sort of like locked into what the cloud provider provides. Or you just end up using their images maybe full stop. So it's nice to have tooling that is sort of cloud agnostic. Or like specific cloud agnostic, vendor agnostic. And as a final point, while image based workloads are also becoming the norm. You know, everybody uses containers. People build images for a single specific workload. And really one of the things for end users that we're trying to solve is to make VM images almost as easy to deal with as container images. So image builder was created to address some of these problems. Right? So this is the stack. And I want to walk you through the stack just to quickly give an idea of what each component does and why it's there. I will start at OS build. The lowest base layer on which everything is based. So at the very bottom, we have OS build, which is a low level tool that executes a manifest. Now what is a manifest? The manifest describes exactly what goes into the image and then also how to package it. This manifest makes images audible because you have a manifest of exactly what's in it and reproducible. Since you have the exact steps that were used to build the image described inside of it as well. So not just the contents, but also like how do I get from those contents to the actual image? It is mostly distribution agnostic. So it doesn't necessarily have a notion of like what makes up a specific distribution. So what is a Fedora? 38 image, right? Doesn't really know. But of course it needs to support some package manner. So it currently supports RPM and Pacman packages. So theoretically, the distributions that it can build are things like Fedora, Arch, CentOS, et cetera. And as a final note, OS build starts from a pristine tree, so like an empty directory, and then builds it up piece by piece. But to clarify this, let's take a little look at a manifest. So I'm going to need to... That's maybe too much. So as you can see, this is sort of what a manifest looks like. First off, you have sources because the content needs to come from somewhere. So this is maybe a little bit. Okay, so these are just RPM packages indexed by their checksum, so that they're nice and addressable. Now here you can see three pipelines. I will not go through all of them, it's just like very quickly tell you what's inside of this manifest. So the first pipeline actually builds a container because we built the end artifact, so the end image inside of a container. And the reason that happens is because you kind of want the same tooling that will end up in the image building the image itself. So like the RPM version in the image, you kind of want to install those RPMs using the same version because otherwise you might get into a mess. So then the second pipeline is the second pipeline that actually puts all the bits in the correct place in the tree. So first it sets up the kernel, like I think it's proc kernel command line, then an RPM stage, so it installs a bunch of RPMs, then it sets up the locale, hostname, things like this. It relabels the tree, so as soon as it's happy. And then in the final pipeline, it actually goes ahead and packages the image up. So in this case, it just ends up being a raw disk file, right? So the most basic disk file and also like it sets up the file systems and stuff like this. So I guess just a takeaway is pipelines stages very precise way of describing what exactly should go into that end artifact. So now I need to like, yeah, go back a bit. Okay, that looks better. So, but like I said OS build doesn't actually know what makes up a specific distribution. So we have the image definitions. The image definitions contain all the information needed to describe an image of a specific distribution of a specific architecture for a specific target environments. So it describes things like the base package sets, the default configurations, how exactly architectures differ, you know, like install these packages if it's arch, install these packages if it's x86. And like all the differences between those different, you know, that whole matrix of like architectures, target platforms, that's contained within the image definitions. It integrates tightly with Composer. So if I don't know if you remember the stack from earlier, now we're at the gray layer. So we had OS build images and then Composer. And Composer is really the part that brings it all together. It takes user input in a format that I'll get into shortly. It takes package repositories, just as a source for the packages, like, you know, the kernel needs to come from somewhere. And the aforementioned image definitions to generate the manifest, which is then provided to OS build. Right, so Composer takes all of the input from the necessary places, generates a manifest, and then OS build executes that manifest. Because like I, like you probably saw, there's a manifest, they're okay to read, but if you had to write those by hand, that would not end up very well. It's a tiresome job. And as a final point as well, it also orchestrates all the builds. It manages the job queue and workers, which you know, like you can queue image builds and then, because image builds takes a long time often. So, and that's just an important point to be able to run this like in a host, as a host service, as part of infrastructure, etc. So okay, what does, what does Composer need? And as you can see, it's already a lot simpler than the manifest that we had before, right? So say that I want to build a Fedora 39 image. Okay, so I just asked for a Fedora 39 image. I want it to be x86, QCOW. I provide it with some repositories, which are just the default Fedora 39 repositories. Optionally, some customizations, right? So like in this case, I also want to install cockpit. I want like the base system with cockpit installed. And that's it. That's all Composer needs. It will take it, grab the image definitions, figure out what the manifest looks like, and then pass it to us. Okay, so how to make this even easier and like how do we actually like give this to, to end users? I will walk through like two tools that we have. And then I will show off the hosted service that we have running. I just realized I didn't spin up a VM, which I need to spin up. Okay, so Composer CLI. So Composer CLI takes this format called a blueprint. Now a blueprint kind of describes, again, like how to customize the image. As you can see, there's nothing anymore of like which architecture you want, which distribution you want. This is intended as like the on premise tool, and it's all inferred from the host. So like if you're running on a Fedora 38 system, it will build Fedora 38. If it's an ARM system, it will build ARM, etc. All that's left is customizations. And as you can see, it can be quite powerful, right? So like here, what happens is, I source this from a colleague of mine. But as you can see, it puts in place a system, the a little system, the service, then it asks ImageBill to also enable that system, the service, and that system, the service sets up a second disk on boot, right? And also it embeds a user. So yeah, what's left here is just really just the customizations. And this is how you would then like push that blueprint. The workflow is a bit cumbersome, but like you push the blueprint, you start the blueprint, you ask for the image type, which in that case is a Q-Cow. So important point, you use that same blueprint to build a Q-Cow, build an installer, build a cloud image, so that you get, you know, you really just have to specify what you want is common. And like ImageBill would say, care of the rest. And there's also like a little cockpit application. So can I, oh I need to, if it went to, ah there we go. So this is a cockpit application which allows you to define blueprints, build images. It's again targeted for on-premise use, as you can see, I probably should have removed this question. But yeah, so like it allows you to define those blueprints I showed earlier and build images from them. I think if you click, right click here, and you can see like some of the, okay, like the blueprint, like an output type, so there's a bunch of output types that we support. Yeah, voila. So now, okay, so actually the point why I actually mainly did this talk was, which is, so we also run as a Fedora, as a, as a hosted service, and for a specific hat company we've been running a service for a while now, but we are also figuring out like, okay, how do you actually run services, like software as a service for the community, right? Like how do you involve from our community? We want community users, we want community feedback, they often use software in like different or interesting ways. So yeah, that's what we wanted to offer support for that. So if you go to console.stagels.fodora infocloud.org, so it's currently just a staging service, but production is coming soon. It will, it will tell you like, okay, this is how you like, if you make an account there, then you can use the API in that way, etc. Currently it's just an API, but UI is also on the road now. So what's currently supported in that staging service, staging service, it's KVM, BeastFair, AWS, and like, on the right hand side, there's all images that we currently build, but we just need to set up some service, some stuff in the service and enable them and expose them. Also just 8x86 for now, but Arch is relatively easy to add, and in a production service, we'll definitely have Arch as well. So what does a request to this hosted service look like? And at the very start, right before the talk started, I actually sent off this exact same request to the image building service. So this is what a complex request, more complex request looks like. So there's the distribution I'm asking for Fedora38, X86, it's a QCOW2, don't mind the name, it's a bit weird, but QCOW2, please upload it for me to AWS S3 so I can download it afterwards. We host, like, we share images with a pre-signed URL, which are like valid for a couple of hours. So, and then it comes, like, then we get to the interesting bits, like, so can this hosted service, for which you don't need to, like, have any setup on your local machine. Also, integrate with other Fedora services that are currently available. So perhaps some of you know, copper, you know, like, very easy to build your own RPMs, you just upload your spec files and sources and it will go ahead and build your, do everything for you, like, the difficult bits for you, so, and host your RPMs even. So here I'm asking, like, okay, can you also in FCE, YAM repose, whatever, like, make this repository available so that I can install stuff from it. So that's this customization. Then install these packages, right, like, I want copper, I want firewall D, because, I don't know, and then NPM as well. Why? Because I have this awful startup script, which installs revealMD, which is, like, the thing that's running these slides now, installs demo slides, which come from this demo copper up there. Change is to root, runs revealMD, yeah, just it runs, it's a VM that runs slides, yeah. And then the second thing is, like, okay, set up a service for me, which, you know, it's a one-shot type of service, it runs that startup script that's defined above, and, yeah, run it after the network comes online, because I want it, it's a server, it's a service that also starts the server, right, so, and then I want to enable the cockpit service, I want to enable the revealMD service, so that's the little service that I defined above, and then, yeah, for firewall customizations, I want to open this specific port, because that's what revealMD listens on, and I want to enable the cockpit service as well, right, so, like, when the machine boots up, cockpits enabled, and these slides will hopefully be hosted now. Right, so let's go back to the terminal there, so, as you can see, so that's the same request, and I sent it off earlier, and it was building, so let's see what happens now. I hope it didn't fail, because that would, okay, no, so like, the build succeeded, right, as you can see, this monster of assigned URL, which is technically a secret, it's valid for six hours, so, yeah, you can just download it, and, yeah, I mean, that's it, it's really that easy, if there's like a whole bunch more customizations available than the ones listed here, but once you've read the spec a bit, you figure out how to write a JSON file, it's there, and then you can get a whole bunch of images out of it for free, so, let's run that image, right, so like, this is how I'm running that image, you can't see that, so, this is how I'm running that image, okay, so I'll just go ahead and do that off-screen, how am I for time, yeah, good, okay, so I've now booted it up, okay, so I asked it to install cockpit, and, okay, yeah, of course, yeah, just sort of, just sort of, okay, so let's just take a look at the networking, I asked it to like enable the cockpit service, and also, explore this additional port, it has done that, super, then I go to services, so these are all my, my system, these services, I look at revealMD, okay, so that's, that's still starting, and it might take a while because it's actually installing an npm package over my phone, so, hopefully that will do something in the meantime, I think in the meantime we can also maybe start with a question, if already somebody has one, but I want to show like a little thing at the end still, so, but for that I need this to, this to kick off, it's really the most exciting talk ever, isn't it, we all started staring at like, at some logs, so, thanks for the talk, I have two questions, so the first thing is to understand the architecture, the composer thing, it's like a demon or service, right, that's it, yeah, so the composer is the thing that runs as a hosted service and orchestrates everything basically, okay, so I understood that correctly, the second thing is that very expected question, it's not really creative, so how hard would it be to, to, to support like, let's say Ubuntu or, maybe I know, let's say something like that, that's very expected question, thank you, so I've actually experimented with this a little bit already, so like first we would need to add support in OS build, right, and then you would need to add an image, so just to install dev packages and like set up like, what's the thing, the bootstrap, you know, and then you would need to add an image definition, okay, but what is an Ubuntu image, and then you would need to add it to composer to expose it a little bit, okay, but first you'd need to, yeah, write some stages that can handle dev packages, and, and I've been, I've experimented with it before, I've got it to the point where I, we can build a bootable Debian image, right, not UEFI, so, but like, you know, we've got it to that point, it just, it requires some more work in cycles, like theoretically it's, it's, it's, it can do it, right, like it's, it's just a matter of some work, yeah, I'll answer your question, oh yeah, and in the meantime, let's, like, look, it seems to have done something, so, all right, great, yeah, slides are up, so let's go to that, oh yeah, so this is the most efficient way to, to host, I think, you know, image-based workloads, single slides, this is where you can find us, so like, yeah, this is our GitHub project, we have a matrix channel, please can say hi, and then we have a website, and so if you go to service, um, so service and then for door, you can read a bit more about how the architecture fits together, and yeah, if you go to the Fedora service, you can find instructions on how to do it if you want, please do, there's currently only two workers attached to the staging service, so if, if it, if it's not DDoS by like two hours from now, I'm going to be disappointed. Thank you. All right, any more questions? So, um, I'm a bit hazy on the architecture, so maybe the question, it doesn't make much sense, but how much work would be, um, you know, how feasible is to do all this locally, like, for example, for, I want to build a distro, then tweak something, then build another image and run it locally in a tight loop, so that it's like the whole process that you described, starting from the definition with the local overrides all the way to the image that you have downloaded and built locally, can you all do it on a laptop on the plane? Yeah, yeah, so like this, um, so the first, uh, let me go back a little bit, so we have a copy composer, now it doesn't, um, so this is essentially the same thing, right? Like you, you can just, or maybe I'm not answering your question, but this is essentially like, uh, you can do it all on, on your own laptop. Of course, your own laptop can do cross architecture builds, um, currently, uh, but, yeah, is that? So basically I would run the service locally and then talk to a web server on my, it's a unique socket in this case. You can also set up the, the service of, um, locally, but it's, it's not necessarily supported very well, uh, but it's all there. It's, it's all, when you install copy composer or, sorry, when you install, um, OSBIL composer on your fedora machine or whatever, it's all, it's all shit. Thank you. Yeah. Hi. Uh, yeah, probably an annoying question, but so, uh, under the soon, there was like, uh, ISO installers. Can you be, can you be a bit more specific about what soon means in this, uh, in this context? Because that would solve like a use case that I have. So, like really creating a bootable USB drive that a, a technician can plug in and, yeah. So, so, uh, yeah, you're right. So like the ISOs here, sort of installer and live installer are, are absolutely like, in tennis or bare metal. So those are like that you would burn onto an USB or DPD or a CD and, uh, yeah, plug in and, and, uh, they have like the, the fedora, like the anaconda install around them. Um, yeah. And is there like a specific fedora release that's being targeted or? Oh, it's been, this has all been fedora in, um, since fedora 34, 34. Okay. So it's, it's been there for a while. Okay. But we're still like, we're still actively working on it and making it better, like more customizations like that files customization where you can like just basically have like an entry point as like a, you know, just like a one shot system v service. Um, that was, it's more like a recent thing that we were trying to do. And yeah, there's more customizations that I think you can set up your file system a little bit like partitions and stuff like this. Okay. Then I'm missing the suit. Thank you. Yeah. But try it. Yeah. I think we have time for one more. So maybe I misunderstood something, but the whole part of the demo, everything was done locally, right? That you showed us on your local laptop, which part was on the cloud? I thought everything was local host. Everything was local host, except when I switched to here, like in the terminal, what, so right at the start of the talk, I sent a request, um, to wait, hang on. Oh yeah. Here up top here, I sent a request to the fedora. This is this fedora staging service that we're running now. And that is building an image and then that's spit out a URL. Now, of course, I didn't download this one. I built the same thing earlier and downloaded it then to run it now because it's like 700 megs and the internet is not that good. So, um, but like, let me just show, uh, so like, this is like, uh, yeah, the slash composes endpoints shows you all of the composes that I've done. Um, and it's basically all the same thing for this talk. So, um, I think I might have downloaded this one like one before or yeah, I mean, so, so this, this one, like the actual image that the VM that I put it up was not built on this laptop that was built on this on the cloud. Yeah. That I can also, if I have a powerful machine, I could also do it locally, right? Um, so do you mean like, uh, if you want to build the image locally? Yeah. Yeah, you can do that. So that's, that's, um, that's done with, uh, with cockpit composer, I think is the easiest way to get started. Yeah. So that's, that would be a cockpit, cockpit app is I think they're called. Yeah. Okay. Thank you. Yeah. You're welcome. I think we're done with time. Thank you.
2023 in Chimera Linux
Hello, everyone. I'm Daniel Kolesa. I'm the author and primarily developer of Kimmer Linux. First, in the first half of the talk, I will give you an overview of the project in case there's somebody there. There's probably many people here who haven't heard of it or don't know what it's about. And in the second part of the talk, I will sort of give you an update on what happened during last year. So first, what is Kimmer Linux? It's a general purpose Linux distribution primarily geared towards client computers such as desktops, but also others. It's built from scratch, not based on any other existing distribution with broad focus to it. And it's based on free BSD core tools, the muscle ellipsi and the LVM tool chain as its system compiler. It's hardened when it comes to tool chain security more than most distros and it's also very portable. Currently supporting official binary packages for five architectures, this includes R64, both big and little and in PPC64, Risk 5 and X86. It's based on binary packaging with APK tools, which is also known from Alpine Linux, but we use the next generation of APK tools, which is currently under development and not used by any other distro. It's a rolling release system, so packages are deployed continuously and there's no fixed releases anywhere. There's custom built source packaging infrastructure, which was written from scratch specifically for the distro. And it's generally meant to be lightweight and pragmatic, so not really focusing on one specific thing like many niche projects, but also overall they're still... I'm not really trying to make it to be like anything else. We also don't use system D, but we are trying to not to be militant about it because I kind of really hate doing that, so you won't find anything like that there. Why make this? Well, I want to make a well-rounded system I would personally enjoy using and also make proper use of LLVM's features, which they give us over GCC, for example, when it comes to security hardening. I'd like to improve software I'm unhappy with and overall focus on robustness and having a deterministic system which will always install the same way and work out of the box, but still being transparent and simple. And also good to default for things. As for some core principles of the system, I believe that some sort of technical purity is mostly counterproductive and makes things in general worse, so we try to not assume too much and sort of just make a flexible thing. I think that minimalism doesn't really mean anything by itself, so I try to focus on other things instead. I do believe that the system should be simple and it's better than being complex, but I also do believe that if complexity is necessary, then it's necessary. It's better than being complicated and requiring tons of setup. I think that development should be opinionated, but I also do think that dogmas are mostly bad and what we should actually focus on is a good software design. In general, we try to be inclusive, open and accessible. Also it's free software, so it's managed like free software. I think it's important to have fun when it comes to free software communities, so we try to basically make sure everything goes towards that. I don't know that mostly anything goes. Now let's get to the system design. I think it's good to always be strict by default. That is try to make sure that people always do the right thing, make sure that things are as strict as possible when it comes to, for example, when building software, having proper lint for everything, having sandboxing and so on. There should always be at least one good and obvious way to do things. I do believe that portability is extremely important as well as security and both of these should be preferred over having high benchmark numbers. I'd like to have self-sustaining tooling which can be retargeted to different environments so that when we need to switch to something else, it's actually easy to do and we do not get stuck with some weird old software. I do believe that the system should be atomic if possible when it comes to package transactions. To this end, I've been working on getting rid of things like post-installation hooks and so on instead making sure that as much of the transaction is atomic as possible so it doesn't get interrupted and could be rolled back. I'd like to encourage doing good and if something is written in a shell, it should usually not be written in a shell. I think there's always room for improvement and nothing is ever good enough. I also do believe that even though we are not using system D, it's not the devil so it's fine to get inspired if needed. Now let's get to packaging infrastructure. We have a custom system called C-Ports and it's implemented in Python. I started this in 2021 when I was still using Void Linux on my machines and basically I was unhappy with Xbpssrc being a massive pile of bash scripts. It had many drawbacks such as being slow, not having sufficient sandbox, being insecure so you couldn't really trust it to do some tasks. So I sort of redesigned the whole thing in hopefully a better way. It's sandboxed, it's very fast so you can introspect all the metadata and so on in real time. It's easy to use and it's optimized to minimize the amount of effort necessary for a small team to maintain things. So right now we have about 1600 packaging templates and we maintain them for all these architectures and there's only a few cometers. It's extremely important to actually have a tool link which will catch many different issues for you because otherwise it just becomes too much. So we have things like nightly update checks to stay on top of things and heavy linting for everything, heavy sandboxing, everything is containerized so all the software is built in a sandbox which has no network access, a read-only file system, sanitized defaults, deterministic installation of dependencies and so on. At the same time it makes it very easy to add new packages if necessary. Now let's take, this is how the C ports templates grew during last year. We started at around 800 and ended at some 1500. This is the comets, like overall comets in C ports. So we pretty much doubled everything during last year and this is how offers of templates or comets grew. So we went from around 10 to almost 50. Now let's get to the updates. During 2023 we released an alpha version which means the project is now ready to take on some users if they are feeling a little bit adventurous. The repositories are receiving major expansion constantly with new packages being added pretty much all the time. The system is usable as a daily driver with some precautions. I'm using it on many computers, many people are using it. There's still plenty more work to be done in all areas. Users are currently expected that if they are missing something in the repos they will add this and maintain it if possible. We have a lot of major software packages, a lot of it being done during last year including all the big web browsers like container infrastructure, office suite, Qt6, Java, different big programs. A lot of smaller programs which are still popular. We gained support for Flatpak which successfully supplements whatever is not packaged yet at least on the X86 architecture. We are still trying to make sure to package things if possible because it makes sure that people can actually run this software on all architecture supported by the distro which is not necessarily guaranteed when you are using Flatpak. There have been tool changes. We are now based on LVM 17. We default to no cement interposition. We default to initializing variables to zero. We have the new Lipsy++ hardening flags enabled. We have expanded Fortify source. We also shrunk the executable size a bit thanks to linker changes. Also since last year there is finally proper infrastructure which includes build bot which automatically builds all packages as comments come in. Somebody merges a request or pushes to the repository. Build bot will pick it up. It will deploy the jobs to all the workers. The workers will build stuff and deploy it back to repos. It is pretty much real time with only the overhead of actually the build time. The infrastructure is very simple thanks to C build itself providing pretty much everything it needs to do. That means it can do all the bulk sorting and so on pretty much by default. Which means what build bot needs to do is suggest it receives comments. It checks which templates were, well it doesn't even do that actually. It just tells the worker find all packages which are not yet present on repos but are present in the C ports. Sort of diff it with the existing state. Then do the sort to make sure that the bulk is built in correct order. This includes transit if dependencies which are not even included in the transaction and so on. And things will just automatically get built and it happens in half a second. We also have a fancy package browser at this URL now. As for low level user line changes we have an initial API for our session tracker which has been created from scratch to properly support user services in the system. We utilize this heavily to, for example, sound server and session bus is run within as a user service so you no longer have the bus launch or the bus run session nonsense like you had on many legacy systems. Instead it's done in the same way as on system D systems where you have one session bus which is started when you log in and then it persists as long as something needs it. We have user line based on free BSD 14. We switched to system DUDEF in place of EUDEF because EUDEF has been going out of date and it's pretty much just worse in all aspects. We have adopted system DETMP files and CIS users in the core to manage all the temporary state and all the system users to make sure they can be recreated as necessary. We also have system D compatible support for BINFMT config files so emulators can be set up in a transparent way and when you install QMU user for an architecture it will install the configuration file properly reload and you will get your emulation. As for service management we are preparing for the adoption of debuff broker instead of classic debuff demon. This will mean fully service driven activation of debuff services which means nothing will, everything will be supervised, there will be no legacy spawn this demon and then don't care about it because that sucks. So for this we are currently working on Libdin CTL which will add an or which already adds an API to interact with DIN from C and as well as over debuff and we will have a debuff broker controller which will use this API to generate if needed ephemeral services you know or like if something already has a service and it will just use that instead of generating anything. As for service management we have new service targets for example for time thing so services which depend on NTP can use this or firewalls can make sure they start before other networking demon starts and so on. The whole early service package got a big overhaul so there's different new helpers for say SW clock which is used on systems with no real time clock in hardware for swaps with CTL and so on and so on so to reduce dependencies on other packages. We have improved support for read only root file system and it should pretty much just work out of the box these days as well as countless minor quality of life improvements. As for hardware support this is Kibara running on a Steam Deck OLED for example which works out of the box. We have LTS kernel 6.6 and stable kernel 6.7. We gain support for Raspberry Pi 5 which is in the same images as all the other Raspberry Pi so we support 3, 4, 5 in one image as was shown before Steam Deck support. We introduced support for Big End in PPC64 which runs on machines as old as PowerMax G5 so that's something people can experiment with. Also we have AMD CPU working on Ampere Ultra R64 machine-sided box which needed some cursed kernel patches which were only present in some trees so otherwise you get garbage in the screen which is currently happening on every other distribution. As for conclusion it's been a very nice productive year. There's probably going to be a beta release possibly in late spring or early summer. The upcoming focus is going to be on service management because there's still more things to improve in there. I'd like to introduce more support for advanced service management features including better support for capabilities, possibly some name spacing, possibly proper config file management. Also session tracking is going to be a major focus. I'd like to expand the turn style API so it can fully replace E-Logindy because E-Logindy has been hugely insufficient for us. The main issue with E-Logindy is basically with SystemD you have your Logindy which interacts with SystemD directly so it can spawn the user session and then do it and it can actually properly interact with E-Logindy. It's mostly geared towards systems which are legacy so they use legacy style service management with system services only and you have no way to properly track the lifetime of any E-Logindy session. This is bad for us because for example we want to be able to ensure that user services can linger beyond the scope of the user session but if we have E-Logindy managing the runtime directory then we cannot do that because E-Logindy has no way to tell it to do this so it's kind of bad and we really need the proper solution. We are already patching E-Logindy to enable some of these things but I'd like to not patch anything and just get rid of it entirely. Also as a part of service management focus I would like to introduce proper logging infrastructure which means something like a journal where you can actually log both legacy log stuff as well as STD out from different services, have a central daemon which will properly deal with things like log rotation and so on in a better way than we have now. I would also like to focus on package management. This will possibly include things like ZSTD support for package compression and many different enhancements made possibly optional packages. Right now we only have reverse optional packages which is nice but makes some things clunky at times and there's a huge amount of work to do but we're going to get to it eventually. Thank you for listening and now if you have any questions feel free to ask. I will try my best to answer them. Yeah? Thanks for the talk. There are many talks these days of maintainer burnout and you know the lack of financing in open source. How is Chimera doing at this point? Sorry can you repeat that? Is Chimera while financed and are you okay in terms of you know load? You don't suffer from, do you feel that you may suffer from maintainer burnout if the project grows in size? Are you asking about financing? Yes. Is Chimera okay at this point? Well currently there's no financing to speak of like it's entirely volunteer driven. There is five server for example for building these five packages sponsored by a community member. The X86 dedicated server used for building X86 packages is paid for by me. I also have my own hardware for PPC64LE. As for other builders for R64 and also for Big ND and PPC64. These are sponsored by Oregon State University as VMs in their open source lab so that's really nice of them. I do not really think it needs financing right now. We are managing quite okay I would say but in the long term sure maybe it's something to think about. Now here. You have the year wrong on your date on the slides. The year is wrong on the date on your slides. What? Oh. Yeah sure. Paying attention. I am curious if you could explain maybe a few words about things like turnstile and E-log in D and what you feel is significant that system D brings to session management that you want to reproduce that other system D free Linux's don't do? Sure. I can explain that. Basically what log in D does in its core is it tracks when users log in and when they log out so it can actually keep track of the session and other applications can check this data so they can inspect what sessions exist, what kind of session type it is, what kind of session class it is, what C group is associated with that, like what processes belong in a specific session and other applications can make use of this information. What it also does is manage seats which means it gives you secure access to devices. Basically since Linux devices do not really have a concept of having multiple users being connected to them, you can ask log in D for a file descriptor to a GPU device for instance and it will give you that and it will hold it in place so that nothing else can ask for it and also nothing else will have physical access to the device node so only log in D is permitted to give you file descriptors to the device. This is extremely important for Wayland Compositors for instance and E-log in D basically provides you this functionality but it doesn't provide you proper session tracking or anything at all and session tracking is also used by say Wayland Compositors or log in majors or lots of different things. For session for seats tracking there's a project called Lipset and it also ships with a demon called CDD which will give you this secure access to the devices but it does not give you the session tracking so basically turnstile is supposed to complement this and make one whole thing which will provide the same functionality and it does so in a way which is very important sort of vendor independent so in addition to having its own solution it will also be simplified API interface to log in D itself which means different software would be able to adapt to this API and have support for both log in D itself as well as turnstile the demon as well as potentially some third party solutions for example for BSD systems which currently has nothing for this and that's why their Wayland support is very limited for example. I think we could take one more, does anyone? So what is the state about software installer? Is this also planned for the BDG? State of what? Of a lacking installer system because of its copy pasting files now? Well as for installer the problem with installers is that they are very complicated to make which means it hasn't really been a focus for now instead the manual installation is extremely simple and you can do it in like three minutes but there is a plan to have an installer eventually it's just not something which has been you know focused for now because it's complicated and it's a big project on its own to do properly otherwise if you don't do that you will end up with bad installer which will have many issues and will be limited in use so it's better to properly think about it and do it right. And a quick second question will be like a rolling release strategy or will be like releases in sync with the BSD user land utilities? Well the distro itself is rolling release we do update to new the user land utilities are using free BSD releases currently at 14 and we do update to new releases as they come out so it does not really sync with free BSD current for example but other than occasional backpacks when something major happens and we want it earlier but in general we stay on top of free BSD releases and use that. Okay thank you. I think yes we are out of time so thank you Daniel. Okay thank you.
Homebrew's Evolution
That's a very nice soothing start to the talk of just people saying shh. As some of you may know, I really like to start talks with raising hands. So put your hand in the air if you use Humbrew. Lots of people, cool. Put your hand in the air if you've contributed to Humbrew. This clump over here will make sense for the next question. Put your hand in the air if you maintain Humbrew. Put your hand in the air if you're concerned about what happens if there's a CV during this talk and no one is able to march a critical PR to fix open SSL. Because all the maintainers are here. Yes, good. Thank you. Yeah, so a little bit of background for you folks. Let's see if this is working. There we go. Oh, sorry. No, this is a Humbrew. We're Mac people here. Okay, there we go. So I forgot, Humbrew doesn't actually support this version anymore. No, back to that one. Oh, there we go. Okay, that's fine. Humbrew supports this one. Sorry, the jokes don't get any better from here. They're only worse. Hi, I'm Mike McQuade. This is my almost becoming yearly tradition at this point. Sort of state of Humbrew talk at FOSDEM. The distribution's ruined kindly. Let's me come and do this here, even though Humbrew isn't really a distribution, but it feels like the least square round peg hold situation at the conference here. You can find me at various places on the internet if you want to talk or ask me things during or after or whatever. I'm currently the CTO of a startup called Workbrew, which is trying to do some interesting stuff around Humbrew. I'll talk incredibly briefly about that at the end with two former GitHub people. I spent 10 years at GitHub, which I left as a principal engineer last year, and I'm Humbrew's project leader, which is something I have to get elected to do every year. No one has ever run against me, so please, someone do that and set me free from my life of enslavement to an open source project that I suffer for. And I've maintained Humbrew for apparently 15 years this year, which is a little bit worrying. So I'm going to talk through some stuff we've done in the last year or so. Some of it may be new to you, some of it will not be. None of it will be used to any of the maintainers. I don't know why they're here, but hopefully they will just laugh at my jokes and stuff like that anyway. The first major thing, I don't know if any of you noticed how many of you run Brew Update or noticed updating Humbrew. Lots of people complain at me about how Humbrew does this automatically without being prompted. You can opt out, but please don't. This should have got, for most people, most of the time, a lot faster than the last year. And the main reason is that we have stopped using Humbrew's GitHub repositories as the main data source for Humbrew. So when Humbrew was first created in 2009, one of the relatively innovative things it did was to use Git and put essentially all the data on a GitHub repo and then instead of building some complex update information system which is going to pull from some server somewhere that someone would have to host, it's like, no, we'll just do essentially just run Git fetch in the background. And Humbrew has kind of had a long-going battle with... Like a little bit of a battle with GitHub and more of a battle with the performance characteristics of this. So Humbrew Core, the main kind of Humbrew repository for all our formula, for all our packages, has kind of grown and grown over the years. Like we've had over, I think, 11,000 contributors, like millions of commits, hundreds of thousands of pull requests at this point. And as a result, it is very, very, very, very, very, very, very, very slow to do almost anything related to Git. And particularly with Git fetch, like a no-op Git fetch was probably at its worst, taking about 30 seconds just to be like, no, actually, you don't have any updates or anything required at all. So when I was lucky enough to be simultaneously working on Humbrew and GitHub, I added a call to the GitHub API that was there specifically to try and make Brut update a bit faster. So you could go to the GitHub API and it could quickly respond like, hey, don't run Git fetch, you don't need to, it's going to be really slow, and you don't have any changes anyway. A few other package managers use that now as well, which makes me happy. But over the years, lots of people at GitHub have kind of grumbled about using a Git repo as a CDN that's kind of nicely global. Globally distributed, and I believe at our peak, we had a couple of GitHub servers that were essentially dedicated purely to people fetching from Humbrew Core. So eventually, after leaving the company, it's kind of weird that it took me to leave the company to actually make my coworker sloppy. We, like with a bunch of work from other maintainers, we kind of moved over to essentially just curling a JSON file off the internet now. So instead, we have like a 15 meg-ish, I think, compressed file for Humbrew Core, for Humbrew Cask. When there's an update, we don't have any sort of clever binary diffing or anything, unfortunately, so we just download the whole thing again. But that seems to be a lot faster for most of the people, most of the time. And we still, optimistically, will be able to make it faster in future. So in case you didn't know, Humbrew has like a JSON API. This is basically the kind of the basis of what we're using. We've had to kind of add some bits and pieces and modify, move things around. And one of our maintainers here added like nice signing to this and stuff like that so that we could meet the kind of security requirements, the performance requirements we wanted for this new API way of downloading. It's actually, our API is really, really fast because it's posted on GitHub pages. So if you've had an idea of like statically building your API, it's incredibly painful in some respects, but also kind of fun in other ways. But yeah, don't dig too deep on how that's implemented because it's pretty disgusting. Another thing, somewhat relatedly, if you have set any of these variables in the past, like commonly people will set these things because Humbrew was updating too often and it was too slow and annoyed them, or shortly after we rolled out the API stuff, a bunch of people opted out because it was a little bit buggy and stuff like that, or it also updates too often considering un-setting them for a little bit. And then if things are still annoying for you, feel free to set them again, but you might have a better time without these than you used to. Similarly, if you still have these reports on your disk, you can now un-tap them and then you will get much more space back and just generally your updating could be potentially a little bit faster and happier and all this type of stuff. The other relatively big thing we did in the last year, not super exciting for everyone, but our analytics were hosted by Google for a very long time. We had a lot of people who didn't like us having analytics at all and I chose to ignore those people because we need them to be able to do our job, unfortunately. But I guess a concern we did hear again and again from people was like, hey, we don't mind you having analytics, but we're a bit concerned with all this data going to Google and if you look at the analytics docs, you can opt out of certain data collection, but that's kind of a line on trusting Google to do what they say, which I kind of do, but I understand not everyone does. So we've kind of now moved to kind of a nice cloud-hosted like EU instance of inflex DB, which means that we're gathering essentially the same data we had before, but we're not kind of tying it to individual users. We don't have the ability to kind of do stuff like capture IP addresses even if we wanted to and that makes everything a little bit nicer. So we've now destroyed all of our existing Google Analytics data and this means that if you want to know what Hummoo was doing or what user accounts were like two years ago, tough luck, but we do have this new analytics system automatically kind of deletes data after 365 days, so this should get us a nicer, slightly more privacy-focused approach in future. And the other thing that has been kind of principal with our analytics is trying to have it. So if people may not trust us with Gather Analytics, I understand that like it's a touchy point in the tech industry with privacy and all this stuff nowadays, but we do try and make all the information we gather public, so we've got these pages like under formula brudo sh slash analytics, various pages of the analytics we gather. We've got a few more things there than we used to be able to have and you can kind of see the download counts, percentage counts, all this type of stuff. And basically maintainers don't have access really to any more information than you do. Like we have a couple, a handful of people can access our InfluxDB console directly, but like the data in there is in such a kind of messy, horrible format that no one is querying that directly. They're all just using the same web pages as you and I might use, which feels like again, from a privacy perspective, we're all kind of on the same page, whether you're a user of Hummoo or people maintaining it. So also, again, another thing to stick your hand in the air for, who considers Hummoo to be slow? Yeah, a few people. Put your hand in the air if you feel like it got faster in the last year. Mostly just maintainers who made it faster, so... It's all right, you still count iValue. So this is a relatively common critique we hear about Hummoo, is it's slow or why does it upgrade all my things all the times and things like that. So we are working on this, this is kind of a background, medium priority thing for us that we kind of considered for quite a while. So in the last year, hopefully, brew update, that's mainly got faster from the API stuff we mentioned before. Hopefully brew upgrades, we've now made it a lot, in certain cases at least, we can now upgrade fewer of your dependencies than we used to. This is a little bit of a hack, but I'm going to talk later on about how we might be able to make this better going forward. And then similarly around brew fetch, some of our maintainers noticed that there was a bunch of work happening there that didn't need to happen. So I guess if you do find Hummoo to be a little bit too slow, then be relatively confident that we do feel your pain and we are trying to make things faster most of the time. A really weird performance optimization we decided to do, considering everything I've said before, is I don't know if anyone who's not a maintainer ever went and clicked around on the repo pages on GitHub, but due to the Git issues I mentioned earlier, a lot of these pages were time out and stuff like that. And another thing that Git and GitHub people who knew a lot about Git have said to us for a while is due to some complicated Git internal stuff that I don't really understand, you have structured the Hummoo repo in pretty much the worst possible way for Git performance. Git apparently really does not like having directories with thousands of files in them, and we had, I think, a directory with 8,000 files and it was something like that, which means you can see it on the GitHub interface because all these operations list in the directory, if you did a Git blame or Git log on this directory, all of those were time out, which meant increasing amounts of the GitHub user interface was just not useful for when you were using Hummoo, and that also contributed to why Git fetch was so slow Git GC was so slow, like opening PRs, like the pushes and the pulls and all this stuff involved, which is like getting really slow and getting slower and slower and slower. We were also seeing more instance with GitHub that GitHub didn't seem to think we're related to this, but I kind of did. So we've now like sharded our repos, so essentially like everything is split into directories based on name, and because we have quite a lot of libraries, so Lib gets its own special directory, it doesn't get bundled in under L, we've done the same thing for Hummoo Cask as well, like again, as I say, GitHub would be wanting us to do this for ages, but we've finally actually done this now, and that now means that on these pages, you can actually finally now see the commit information and timestamps and all this type of stuff, and it makes it a bit more useful for people when it wasn't before. So a more exciting thing for us is, we moved to like using Ruby 3.1, Hummoo, who knew that Hummoo was written in Ruby? It's this widely known thing, yeah, cool. And so Hummoo originally I think was on Mac OS, 10.5 I think the first version, and back then Apple provided like loads of stuff with the OS, including Ruby, 1.8 or whatever I think it was at the time, and Hummoo kind of particularly in the early days tried to use as much stuff from the system as possible and not pull in its own kind of libraries. We still try and do that where we can, but Ruby was an example where Apple said a few years ago that like, okay, we're kind of deprecating the system version of Ruby and Python and I think Perl and stuff like that, and for Apple kind of deprecating this stuff, we've sort of been playing chicken and being like, well, you say it's deprecated, but you keep upgrading it for us, so we're gonna just keep using your version as long as we can, and like eventually kind of went to some Apple people for the last release and were like, hey, the Ruby you supply is 2.6, that's really old, when are we gonna get a new one? And they were like, did you not read when we told you it was deprecated? And we were like, yeah, but, yeah, but please. And they said, no, this time we mean it. So like finally we've kind of, we've always had our own kind of thing we call portable Ruby, which allowed us a way to distribute a kind of a Ruby that you could install anywhere in your system. So it worked regardless of where your homebrew is, and it would work on a variety of Mac OS versions and stuff like that. And that was now moved to Ruby 3.1, so now we have a system where essentially everyone on Mac OS at least, on Linux, there's some configurations where you don't need this, but everyone has portable Ruby now and supplies kind of a nice, relatively new version of Ruby. So this is nice for us, it probably has some, it's had some mild performance increases, and it lets us use like newer language features, makes homebrew easier to kind of maintain, makes it easier for homebrew like Ruby users to kind of not be used to this kind of ancient version of Ruby, and then there's stuff like Surabay and Rubikop and all these other libraries we kind of depend on that were kind of creeping towards deprecating Ruby 2.6, or had already done so. So let's just kind of keep more up to date and stuff like that as well, which is very nice. We've also released a official like homebrew Mac OS package. This is another thing that's been kind of requested for a long time, people have a love-hate relationship. I think homebrew was one of the first projects to do the whole curl this bash script into your terminal, and then we'll install it that way. Who has security concerns about that approach? Almost everyone, good. We're gonna keep doing it, so yeah. All right. But if you don't like that, then you can use this instead. So this is kind of the more standard installation process you would expect, where you get a nice installer and you kind of click through these things and stuff like that. And you should end up at the end with essentially the same stuff, and it prints the same messages for you and all this type of stuff as the bash installer, but you can do this through like MDM tools and things like that. But as I mentioned earlier, I've actually been working on a few little bits which are kind of not strictly homebrew related. So I've been working on workbrew, which is this thing where we're building kind of some close source stuff on top of work, on top of homebrew to try and kind of find this balance where there's been a bunch of things where like the package is an example of one where people have asked for it over the years, some people wanted to get involved and built that, and that's all fine. Whereas on workbrew, there's been a bunch of stuff that people have asked for over the years and I've asked for it, it's homebrew volunteers and they don't want to do it, say okay, well fine, we can do some of this stuff for you for money. So we have our own package here now, which does a few more things than the homebrew one does and stuff like that. Not going to go on about workbrew too much, but if you are interested, go and have a look at our website and there's a little demo of like what we're doing and we're kind of recruiting people who we want to work with on this stuff. So get in touch. But on homebrew stuff, that's looking forward to the next year. So we meet together as kind of a homebrew group each year, so I'm not entirely sure what our roadmap is, we're going to kind of try and decide some things tomorrow, maybe as a group, kind of figure out like what we see as the most important things, but some ideas kind of I've seen flipping around and things that I have and kind of have currently open issues for them or stuff around like handling conflicts better. So there's this kind of ability for packages and homebrew to conflict with each other, that means you can't have either of them installed, sorry, you can't have both of them installed at the same time. That's kind of a pain in the ass, it doesn't really work very nicely, so we're hoping to improve some of that. There's also kind of inherent conflicts between CASCs and formulae. Who feels like they understand the difference between CASCs and formulae? Okay, only the homebrew maintainers, great. So homebrew had this kind of somewhat alternate approach, like the kind of integrated with homebrew, but was kind of its own separate ecosystem a few years ago that kind of merged into homebrew proper a few years ago called homebrew CASC. So homebrew, at least in the official kind of repo, is all about taking open source software. We build it from source, we give you binary packages, and then we ship that to you. Homebrew CASC is a little bit different, that's for distributing proprietary software where the upstream package, well, the upstream supplier of the software provides the binaries for you, and then we download that and install it for you. So for example, Wget might be a formula, because we can download the sources and build that from scratch, or something like Google Chrome, or Zoom, or whatever would be a CASC. So there's some cases in which there are CASC and formula for the same thing, like Docker, for example, is both an open source project that kind of, you get some nice binaries, you can build from source, but also there's like all the gooey stuff and whatever. And if you do, if you install the Docker formula and the Docker CASC at the same time, things get angry and start shouting at you, and it doesn't work very nicely. So that's something that we're probably gonna try and make better this year. Another thing is we're continuing to work on our API stuff, we're trying to make it smaller and faster and consider ways that we can do that to again make that updating experience more pleasant for people to use. The other, also the API, as someone who's kind of been consuming the Humbrew API a lot recently, it's pretty crap. It was originally kind of created in the relatively early days of like, I don't know, 2013 or something like that. And we've just kind of bolted on bits at this point where it's got like six arms and three legs and they're all the wrong shape and it's, yeah, yuck. So hopefully we can have something that's a little bit nicer for people who are kind of trying to integrate with Humbrew to use, release this year as well. And the stuff I mentioned earlier about upgrades. So part of the reason Humbrew is often upgrading everything all the time and people get grumpy because that's really slow, is because we don't have a good way of figuring out what upgrades are needed and when. So historically we had the kind of conservative approach of, well, if there's anything else that's new, that's in your kind of dependency tree, we will always try and upgrade everything every time just to be safe. But then we realized like, well, you upgrade a ton of stuff all the time and then that makes people sad and angry on the internet and all this type of stuff. So then what I mentioned we did last year was we basically said, well, we can kind of infer a little bit from the way the binary packages were built. The binary package was built with OpenSL 1.1.1 and now we have OpenSL 1.1.2. We know that this package doesn't need 1.1.2 so we don't have to upgrade it, yada, yada, yada. But hopefully we actually have like, there's a lot of the kind of bigger, proper package managers and distributions have like actual like ABI which stands for application binary interface, essentially like what libraries you can link again and change the versions without breaking things. They have a lot of tooling around that stuff that we could kind of adopt and similarly like we can have a way, even with our existing tooling to kind of make this stuff a little bit more explicit, which would mean that we don't need to upgrade as much stuff as much of the time. But because we're an OpenSL project, maybe what we do in the last year will be something that we haven't thought of yet, that we think of because someone in this room has a good idea in a pull request or you file a bug report and then that makes us think of something that's smart and then we go and do something in a clever way or you file a really well written feature request that then inspires us to do something cool. So I really encourage you, even if you've never been involved in an OpenSource project before, we're generally, myself excluded, a fairly friendly bunch and we will all try and help you get involved with Homebrew and help you along the way, particularly with something like a pull request, like if you have an idea and you think you can kind of make it happen and you can write some code in some sort of form, even if it's only like 10% of the way to working, feel free to open a pull request and then just say, hey, like this is what I tried, this is what I need help with, and then we can kind of help you along the way. It's often much easier to talk about the code than it is to talk about the ideas about the code beforehand. We're not the type of project where every pull request needs an issue open to beforehand, like we believe in discussing the code whenever you can rather than kind of discussing some abstract conception of what the code might look like when someone decides to write it. So I think we've got a little bit of time for questions now and also if you don't feel comfortable asking any questions in this format, then feel free to ask me anything privately. I'm on Mastodon and Twitter and you can email me and stuff as well. And yeah, thank you very much for having me. APPLAUSE Are there any questions? Oh, all right. Just going to ask, where's the... Oh, the beer costume. OK, so anyone who was here last year, I was wearing a head to toe beer costume because I love my Uber maintainer friends, but they're not always the most organized bunch. And someone posted a picture before Fosdame last year saying, like, here's a beer costume. Wouldn't it be funny we can make Mike wear this lol? And I was like, yeah, basically like challenge accepted. You're not organized enough to make that happen. And unfortunately they were and I had to wear a beer costume. There are pictures on the Internet. Don't look for them. Thankfully they were not organized enough to bring it this year, so that is why I'm not wearing the beer costume. And shame on you, sir, for reminding people that it exists. LAUGHTER Any more questions? Awesome. Thank you, Mike. APPLAUSE You
Writing a consistent-hashing Loadbalancer for the Kong API gateway (ketama principle)
Good afternoon. Welcome to the DNS Dev Room. Lovely to see a full room once again. And I would like to just give the word to our first presenter, Thijs, who will talk about consistent hashing and related stuff. Thank you. Good afternoon. I was already mentioning like quite a full room for such a niche topic. Anyway, I asked her here because I got too many slides, as always. My name is Thijs Freyja, I work for KONG. We company open source gateway. One of the things we do is load balancing. And this talks about like how we implement consistent hashing in the load balancer, what we run into, or works, what doesn't work. So consistent load balancing, like what, why, and how. And then like what does DNS have to do with it. The what is actually fairly simple. You take an outgoing connection, you take a property, and I'm basing the property, you make sure that every single connection outgoing with that property ends up on the same back ends system. Which is fairly, I think from the hash bucket perspective, fairly straightforward. A typical for our situation gateway, typical setup, incoming traffic, cluster of gateways, and then load balancing towards the back ends, each in one or multiple instances. When we do consistent hashing, what we'd like to do is, for example, pick a user ID, and then make sure that that user ID as a property ends up on one specific back end server. And in this case it would be Harry, John, and Mary would go to instance one, Paul, and Joanne go to instance number two. What did I touch? Fat finger. No. Come on. Yeah, next question. Why? Like I said, it's a bit of a niche topic. It is very specifically geared towards cash optimization. Consistent hashing is something that many people know, but this is very specifically to, if you do a lot of hashing, you don't want to have cash misses, especially when you're scaling systems. And for legacy, we have customers using it for sticky sessions, but it's like a bit of a mediocre solution to that. So in this case, yeah, as I already explained, the same users go to the same back end, which means that the data that a back end application needs to retrieve from a database would only be related to those specific individuals. So the cash hits will be higher considering the amount of cash that the system has available. Then hashing. Hashing in general is the basic concept of hash buckets. You take an input, you throw it into a hashing algorithm. C or C 32 we used before. Now we use XX hash fast. It needs to be, have a nice consistent distribution. Then you do a modulo, in this case modulo 10, you end up in one of 10 buckets. That's basically how it works. So whatever you input, as long as the same input ends up in the same bucket. Now let's take this, the 10 buckets that we have and let's extend that and say like we have continuum with how in the print the Kitama library, which is where this was first implemented, they call it continuum. And we're going to have four nodes, each one serving 25% of the traffic of the entire continuum. So everything that we get in and we hash into some bucket falls into one of those in the end four nodes. Now if we now scale the system, we add a fifth node. Now we see that we have 20% capacity added. So what is that going to do with our hashing algorithm? From node A, traffic previously going to node A will now end up, 5% will end up on node B. From B 10% will go to node C and so on. The overall consistency loss 50%, which means that our backend systems are going to have to fetch a lot of data to catch over this 50% because they moved systems. That's the thing that we're trying to prevent. So we have 20% capacity added, 50% of the continuum changed. Now if you look at what does that mean? At the bottom side we have catch-hate ratio, take an HTTP server, 70, 80, 90% HTTP servers can catch that, get there. Then you can see that with a 20% consistency loss, which will be the minimum in our case because we add a server, 20% capacity, so the minimum is always going to be 20%. So if we do that, we have like 50% extra capacity temporarily. If we don't use consistent hashing, we have 50% loss and we go to more than 120% pretty much on extra capacity, which is quite a big peak that we get on top. Now if you lay this out on a ring, and in this case we'll have simplified depiction, 20 buckets, whereas in reality for each expected backend that you need, you would expect to use like 50 to 100 entries at least. So if I would have a four backends, I would use 400 entries on this ring. Now if we distribute, if we distribute the five targets, the four targets that we had, going to do the same example again, the four targets that we had, then we have two relationships. That's the target, the IP port combination, and it's related to one of the positions on the slot, one of the slot, different concept, and it's going to be based on weight. Now we have four nodes, same weight, so each node gets 25% of the available slots, fairly straightforward. Then there's a slot towards the position on the hash bucket, and that's a one-on-one relationship. Every slot goes into one bucket, but it's randomized. So if we apply that, this is the way I would get. So instead of a continuous ranges of AAAA, BBB, CCC, we get a random layout with those nodes. If we now do hashing, basically we see that every hash ends up nicely in its bucket, and as a side effect, we'll get if hashing is not available, because the property is not available, we can do an easy round robin, good by simply walking the ring, which is a side effect of Yawarun. Now if we move into adding the node again, the same thing that we did before, we're adding 20% capacity, what we're going to do, we're going to relieve every, for the slots that we have assigned, we're going to ask every back end to relieve the amount of slots that are no longer to be assigned to them, in this case they were used from 25 to 20%, one slot, and we assign them to the new one. Now if we do that, we can see that the distribution stays the same, and the hashing principle now works, because we have minimal consistency loss. Now instead of having the 50% consistency loss at the start, we now end up with like the minimal consistency loss of only exactly the 20% that we added in capacity. It also has featured that if we have a simple health checker attached to it, it's quite common with load balancers, is that we can actually move on to the next slot, and our distribution will not really change. Yes, we'll have an impact on the consistent hashing principle, and we'll have catch losses, but due to the way it's random layout, we still maintain that the weights are properly distributed, and the traffic is going to be distributed about across all your remaining nodes. Now, that is like the basic concept of how you do the consistent hashing, and how you minimize this hashing loss. Now when it gets really tricky, is when you have to do this in a cluster. In a con cluster, the nodes are basically independent, so there's no shared state, at least we try to minimize, and we just saw that we had a lot of like random stuff in there, distribution layout, etc., and that's what makes it hard. So if we have to build a deterministic layout of that ring, we have to make sure that we build the exact same ring with the exact same layout on every node in the cluster. And that's where it gets hard. So for the random layout, the way we solve this is by using the distribution, the random distribution, but not the unpredictable, the unpredictability of a random number generator by using a separate random number generator, but with a fixed seed, just a constant, so it generates the same thing over and over again on every node, because we don't care about it being unique, we only care about the distribution, that is randomly distributed on the ring. Tracking changes as nodes are being added and removed, we keep a log history. So whenever we add a new node, if a cluster gets expanded, we just replay all the changes, so we end up with the exact same state of the cluster, of the ring, and make sure that the hashing still functions. And then you think we have it covered, and then comes DNS. Because initially we had it laid out to do just IP and port, but that's static, and in all of our customers' environments, it doesn't work. So we have the Kubernetes OpenShift console, and a common denominator there is DNS. So they do source discovery, and then exposes through a DNS interface, that's where you get your backhands. So the solution we had is that we had to expand our balancer to not just take IP at a result, we had to take hostnames. Now we have to resolve them, and then instead of having a single entry for a hostname, we could have like 2, 3, 4, 5, depending on what's in the record. But that hurts. Because if you look at what some of those discovery tools do, they don't set a truncation flag, which means that every node in the cluster is going to get a different answer in a different random order. A console, for example, does that, and the core of it is that those tools try to do load balancing on an infrastructure level by forcing you to renew the DNS records, and then they're going to give you only the information they want you to have. So a console is typically quite often we've seen deployment with TTL0 and no truncation flag. So you get one record, two records, three records, but not all of them. Once we have a truncation flag, we know we can retry, we can do TCP, we get a whole lot of them, and then we can check that our balancer stays in sync. Quite often we see that it doesn't happen and the balancer goes out of sync, and then we have customers complaining, basically. TTL0 means lots of changes and updates. Every change is an update to the balance of the structure, which is like really not a good combination. And then of course there's bugs. I was just, you were commenting on DNS, it's always DNS. And we actually, we had a discussion yesterday and somebody said actually like, DNS is like the RFC is just like the primer, and then the real implementation is there are 100 million different limitations out there. That's what DNS pretty much is. And you have the cater for all of them. So we have customers that set up DNS to only return a single record ever, and an infrastructure team is refusing to change it despite the fact that application teams have difficulties with their load balancing. Amazon has Route 53, occasionally gives you a TTL0, because my guess is it's a rounding bug when the TTL gets too low to return zero, which it shouldn't be doing. Now if you think all of this is hard, try filing a bug with Amazon. That's really hard. So in essence, we haven't implemented implemented it, but to really properly replay everything, we would need to have some central storage where we have the DNS records over time stored as well. So not only the changes in the host names, but also at that point in time, how does it resolve what are the entries we're getting? Because if I add a new node in my cluster later on, rebuild it, and that node gets four entries in the DNS, where the ones that were started earlier got only three, so we'll get the thing is out of sync again. Now if you look at the overall thing, I don't know where we are in time. Oh, we're in 10 minutes left. 10 minutes left. I've been hurrying too much. So concluding, it is, like I said, it's a niche algorithm and it is good for cache optimizations, but it is still, its primary concern is still load balancing. It's distributing load. The consistency is only secondary. And for caching, it makes sense because like with a cache, I mean, you lose some performance, but you don't lose data. So if you change it, you lose the consistency, it just rebuilds it and then you're good to go. Minimize the cache disruptions. The trick is with, it does require some contacts as the hash input. On the TCP level, you could have like an SNI name maybe or an IP. It's like really rough because it also needs the contacts that you need into the hash, needs to have enough cardinality to actually make sure that you're going to hit everything in your cluster. So if you have, I don't know, you say you do on a header, that either says Android or iOS. That means in your hashing algorithm, you're going to hit only two backhands ever. So if you scale and you have five backhands, you're still going to hit only two. If you have to be aware of that, that there's enough cardinality. And as before, DNS is be careful. Make sure that people set it up correctly. If you have it all in one hand, if you control the whole lot of it, you can set whether the truncation flag is set or not, whether it results everything, it gives you all the nodes, all the entries. That's good. If you're not, make sure you check with your other teams that they actually make sure you get the data to make sure that they're not fighting each other. Then there was the comparing algorithms. Like the first two are actually like the more generic ones. The way to round around like everybody knows. It's a good distribution. It doesn't care about caching. So you hit everything with everything you got. These connections are the same, but at least it takes into kind of like long lift connections, which not depending on what you're running or what service you're running might make a good distinction. Really different. Consistent hashing, cache optimization, and it works wonders under the right circumstances. There's a catch to really catch. Least latency, we have that. It's also niche, I'd say. I even wrote it. If you have high variance CPU loads, so take GraphQL, where you can have simple queries and very big queries, then this makes sense. But one thing to be careful of is that you need to have equal network latencies. So if you have all your nodes in the same network, your back ends, you have the same network latency, then it works. But if you have one close by and another one a couple of hops away that has more latency, then the least latency one will actually push the close by ones into starvation. Because they have lower latency because of the network, so we're going to push more load, and by the time the latency of those systems goes up, you're basically pushing them into resource starvation, either CPU memory or whatever, and then they become slower. And that's not an efficient way to run those servers, probably. So there's a catch to it. And I got it really in time. Questions? Thank you, Dyer. Thank you. Plenty of questions. You said that some DNS servers do not send and truncate bits. Would it be possible to just do TCP? Always? To always do TCP? Yes, that's an option. That's an option. If you can configure your client, not all clients can be configured. In our case, the client doesn't, we work with OpenResty. OpenResty underlying DNS client doesn't, it has a bug that prevents us from using it, actually. It doesn't do retries, which is, but that indeed is a good solution to making sure you get the entire, the entire record, all the entries. Yeah. I have a related question, which is, how often do you actually see enough data in a response to truncation of a content issue? It seems like you get hundreds of address port pairs and transfer before you know it. Depends on, depends on DNS implementation and to use the eyes of, of, yeah. So how many, how many times do you actually see truncation happening in those records? I don't have data. I guess it must have happened or you wouldn't know. Yeah. I don't have the data. I know that console does it. Console by default will only report three and it's UDP packet size. And I think it is UDP packet size. I don't know what it exactly is, but I would assume like four or five maybe and then truncation happens. Depends also depends on like name sizes and everything because it has to fit in a single packet. Oh, maybe console just arbitrarily truncates. Yes. Not to do with packet size. Yes. But that's, that is because it's a service discovery tool and it tries to pull the load balancer rule towards it and force you to only hit the nodes that it wants you to hit. Okay. So it has, I think it has for example things like data center awareness. So depending on where the, where the request comes from, it will give you a different set of answers than in another data center. So you're going to hit different backends. Yeah. Sorry, can you repeat the question? For the log synchronization. What algorithm are we using for the log synchronization? Which one do you mean across the cluster or inside a single balancer? Let's go with the balancer. I'll give you answer both. First one on the cluster, we don't have it because we, in the end, we have no synchronization over this shared state of this entire balancer. We don't share it. The only thing we have is the order in which nodes have been added and that's synchronized through a database basically, a control plane. Then inside the balancer algorithm itself due to the way the open rescue works, you basically don't need a log on this. The one thing is you need to be careful that you do not yield in between operations that actually are modifying the data structures. As long as you do that in an atomic operation, you're good and you don't need a log. So I answer the question. Maybe this, you could share the state of the various nodes in the cluster by just sharing the state instead of assuming you get the same input data and computing a state. So then you would of course introduce interdependencies because it would guarantee that all nodes at least use the same tables. So the question is can you use shared state instead of rebuilding it in every cluster separately? Yes, you can. The conflusters are basically independent. The data planes are on their own and they do not share state. They get their instructions from a control plane. That's it. Yes, there is one, there is the alternative option. We haven't implemented it yet. I don't think we will implement this. In reality, we see too little cases where we actually see the things go out of sync and cause issues. Usually you can tweak it by using a longer TTL so there's going to be less updates and you don't need it. So I don't think we will be implementing it in the end. Have you considered using rendezvous hashing instead of consistent hashing? Have we considered rendezvous hashing instead of consistent hashing? No. No. Frankly, I don't know it. I wrote this stuff actually quite some years ago. It was updated by a colleague later on. But I will be looking into rendezvous hashing. Thank you. Any more questions? Check. Matrix. No questions. Almost, almost, almost on the minute. You
DNS for I2P: Distributed Network without Central Authority
Okay, let's do it like this. Thank you very much Peter for all the efforts for the I2P deff room and by the way, do you hear me back there? Yes, lovely. Okay, right. I hope that the sound check is good now and we're not muted anymore. I'm the I2P guy. I'm one of the I2P guys and I'm talking about fully distributed networks with their specific problems. Fully distributed means truly fully distributed. So today we're also talking about systems without any trust involved, at least in theory. All right, hands up please. Who's familiar and is using I2P or who's familiar with I2P? Yes. Oh, I love you guys. That's really awesome. We have one third which is familiar with I2P so I'm really rushing through the I2P part. But then I'd like to talk about my depressive, my depressing last 12 months, which gave me really a hard time with implementing a persistent storage layer based on I2P and I will tell you why and I will tell you about all my failures and problems and yeah, I will complain a lot. No, we're talking about bit about Byzantine fault tolerance and the good and the bad of the past year. Right. Diva, I'm working for Diva Exchange but it's only an association based in Switzerland. I'm sometimes a lecturer at the University, at the Lucerne University of Applied Science, and there I'm talking about microservices and fully distributed trustless systems and stuff like that. But I'm singing nobody's song so I'm really totally completely independent and so is Diva Exchange. So we're not some coin guys or token guys, which doesn't mean that this is bad, we're just not like that. So hello I2P network. I2P is well known as a dark net because the media talks about it as a dark net, which means and we'll talk about it later that it has something to do with confidentiality and anonymity. But at the end it's an overlay network. So we have the existing internet and on top of that we place software routers to pack the traffic into packages, repackaging them, encrypting them and sending them over several hops and routers through the network. And like this we receive a confidential and anonymous message transport. I2P is no storage layer. Whatever you hear about the dark net that there is content stored etc. That's not true. I2P is not able to store content by itself. There are storage mechanisms like the interplanetary file system which is linked to file coin and these are storage layers. But these storage layers do not necessarily feature confidential and anonymous transport. Often they even fail on implementing such a layer. Six, seven months ago we made a study at the Devalor exchange and we were interested obviously in how big is the latency of UDP package transport on the I2P network. And as you can see it's slow, really slow. And that's the price for privacy. Anonymity, confidentiality is not for free. There is a price attached and this price tag within the I2P network is time. It's slow. Maybe but that's a theory and we need to look at the university into it. Maybe with a strongly increased number of routers maybe we can increase the bandwidth. But that's just maybe. I don't know. We have to do scientific research on that. But this is the current state. Now a dark net, an overlay network as I2P has cryptographic addresses. They are public keys and often it's a hash of a public key. And these B32 addresses so long cryptographic strings like up here are not human friendly. I will not talk about so-called triangle at 6.30 this evening in this room. You will have a presentation about this topic which is for sure also highly interesting. But we have these hashes and we need to map them to human friendly names. That's our job which we have to do in such a network. And that's why that's the motivation we need at DNS. But the only thing which I2P really has is a local address book. So each router, each note, there is nothing like a central authority. Each private note has its own look up key value store. It's called address book. So there you have a friendly name like diva.i2p linked to a hash or to a B32 address to simplify things. And if I'm loading somebody else's address book, this is a choke because that's a trusted operation. And within I2P we usually say we don't trust no one, it's trustless. So obviously we cannot just load address books from somewhere. Additionally within the I2P network if you're looking at the specifications and if you're looking at how the network is working today, we do have jump services, we do have kind of like registries, but all these services are again a delegation of trust, nothing which we really want. And as you can see ladies and gentlemen, I'm really critical towards the I2P network. I see the central components which we have within this network and I'm criticizing them. But not criticizing in a negative manner, I'm rather trying to make me as a developer and also the other developers to be aware of these central components. Right. Now Goethe, German, des Pudelskerren, the core of it all. Why are we doing this at diva? Why am I doing this? I want to have a service, a storage service and hence a DNS service which is A, fully anonymous, B, immutable, C, really barrier free. And barrier free is an interesting concept if you start to think about it. A coin, whatever it is, Filecoin, Namecoin, Monero, Bitcoin, Ethereum, I don't care, is not by its definition barrier free because, well, you have to acquire it somehow. So there is a barrier and barrier free in the meaning of device change means you have a very low hardware requirement to enter the network just to drop a name, a raspberry or any other low power device and ta-da, your member of the network and you can store stuff. And if the barrier is that low, by definition, the spam will be high. So we have to think about a cost function but the question is how is this cost function going to look like? We're going to discuss this in a minute and trustless. Again, I2P is built, architected, engineered in the last 20 years as a trustless system. Trustless means I really need to only look at my own node and either my node is right or it's wrong. I don't need to care to whom I'm connecting to because every data which is incoming, I have to verify myself. If I'm not doing the local math, I'm trusting somebody else and that's a bad idea in the context of I2P. Trust. I can tell you trust me the earth is flat. Now we all know that the concept of trust means I'm believing in a wrong set of root data or made upset. It's just invented and as if I'm starting to invent root data, I can prove anything because the root data is fake. I don't like that word. The root data is made up. Now if you're building your system of trust, your system will grow and we know in IT, but actually at least the view I have from my specific scientific point of view that the larger systems are growing, the more problems we do have in these systems because we need to introduce regulation that the trust is not abused. More regulation means later even more regulation and so it gets more and more complicated over time. One of the typical solutions, at least what I'm lecturing about is keep your system small. So base your decisions on math, base your system on math, keep them lean and at the end add a cost function to prevent spam or abuse to be a bit more generic. I2P is at least from my point of view, a network which enables small and lean systems. Right, where am I? 1540. In history, building a DNS on a fully distributed network, the approach isn't new. One of the older approaches is our systems which are based on the hash cash function which was properly described in 1990s and then leads to proof of work systems and these proof of work systems, they created currencies, like we all know Bitcoin. Maincoin then came in and other things which are proof of work. What I can guarantee you is proof of work is working. Proof of work as a cost function is mathematically, at least what we know today, perfectly working but it's extremely inefficient because it's a race. It's a race for the fastest solution. This is a bit trivial but at the end of the day it's a race and this race is inefficient. Now, I always resisted to implement yet another proof of work function, not because it's not working, just I didn't want it. What I also not wanted was the Filecoin interplanetary file system solution which is a validator approach because validator approach means nothing else and Filecoin did its moit. They used DRAN to select validators but they're just shifting the problem from their own system to another system and then they say we're solved but that's not true. At the end you just move the attack vector away from your own system just to open up another attack vector and for me, just for me, currency based, proof of work based or validator based is not really an approach and as I am an economist which I studied, I feel very uneasy, very, very uneasy about non-fungible currencies. There aren't many. A few are. Make your own research and you will find out which are really fungible. The others are difficult. Let me put it like this. Then there are Naive concepts which are very nice, highly performing but at the end you need in the area of DNS and in the area of DBA what we're talking about, immutability and integrity. Right, I want to lose a few words about the CAP theorem because consistency, availability and partition tolerance are a triangle within this CAP theorem and it's said that you have to choose two out of these three. Now some blockchain guys said hey we solved it, now we have all three. At least with Butante's fault tolerance I have my doubts and honestly I do not see any concept out in the wild which really solves that problem except proof of work and we don't want that. So this year and that was part of my biggest struggle. We had to leave and that was the talk I had in 2023 exactly here at this place about democratic Butante's fault tolerance which is developed by the University in Lausanne in Switzerland and also in Sydney, Australia and sorry guys with I2P this concept is not working because and we're talking about fallacies right afterwards about the problems with such networks, democratic Butante's fault tolerance was a fail. So we went as Diva chain into eventual consistency because the big problems in distributed computing known since the 90s are things like we have zero network latency wrong, we have unlimited bandwidth wrong, we have a secure network wrong and we all know that as Diva Loppers but sometimes in the lab we go into a perfect world, dream of something, create something and then in the real world it's not working and that's why my biggest tip for every blockchain developer in the universe tested on I2P. If it's still working you probably done a good job and that's exactly one of the core messages. I2P has that many network transport problems which are the price for privacy and which we want that it's a very good test case, a very good transport layer for all the blockchain developers out there including myself. So what we did in the last 12 months with Diva chain and obviously you'll find it on GitHub, we created a transaction based system which is barrier free, immutable, trustless and based on I2P, so fully anonymous. It's working now, it's working since about three weeks. The students, the last three weeks at the University of Applied Science in Lucerne wrote a little prototype with I2P but they had a lot of API troubles and struggles because I made mistakes so it was my mistake and at the end I couldn't present here the final prototype but because of me not because of the students, they did a good job. And what we're thinking today is how to implement the cost function because at the end a barrier free system I already said that will attract a lot of spam, a lot of DNS spam, a lot of content spam, a lot of whatever we can use this system for spam and that's not me as one of the developers that's not my intention. So probably it will be a function of availability and a function of cooperation and when you read this now and when you think this is new, no it's not. Filecoin already implemented this since 2014. The only problem they had was their validator selection so they made the mistake of using a validator function to implement their consensus but this they call this proof of storage, the function of availability and the other one they call it proof of window consistency or something like that but you have to prove two things. First you have to prove in the network, prove means mathematical proof that your content is stored and B that your content is continuously stored. So these concepts here are not new, I would just like think about it a bit more and then implement it. I already talked about my core failures or our core failures in our very little team. Democratic Britsantin false tolerance, a very nice concept, a very nice book, I learned a lot, it didn't work. The eventual consistency approach is working since a few weeks, API is highly unstable, I have a lot of coding work ahead of me, in front of me and I'm looking very much for feedback so if anybody is interested in hacking in, I'm always happy if somebody wants to contribute and the academia feedback was also very positive so I could show a few interesting things in the past months. Please in the last two minutes take out, in the last minute take out this take out, we believe that the eventual consistent DNS or blockchain like system used for this DNS challenge is a reasonable approach, eventual consistent so we drop blockchain consensus and replace it by eventual consistency transaction based. The core challenges as know today, we need to implement the cost function which is reasonable, decisions, decisions in our wording or nothing else, there is a global state where all peers and I2P network agree on a specific state of data and the participation is very welcome. In the presentation on the web which you'll find in this deaf room on the fourth step page you find all the sources and some more stuff so if you have questions please shoot. Yes please. Could you explain what you meant by immutability? His question was could you please explain what you mean by immutability? The answer is once written never change again. Yes please. Right he's asking in our system we're going to have a lot of traffic, we're going to have a lot of records stored right did I did I summarize that correctly and that's a problem right or that's your question in terms of storage right first compared to other approaches Diva chain because DNS is a side project we never like handshake or other projects we never intended to replace the current domain name systems therefore the clear net we always wanted to match I2P names like Diva.I2P because nobody is going to give us a domain to be 32 addresses so no we don't have much traffic there and so the storage problem is nothing I'm currently thinking about but yes there will be sooner or later scalability questions you're absolutely right but in this baby state I don't really care. Yes please. This question was if it's immutable how can I change things? In the blockchain world you never change a record you just let it live in a block or let's call it inner transaction and then you just create a new transaction on top and this new transaction is the new state because in a blockchain you always look from the top and the last state is the thing you believe in because it's properly proved using math. Is your answer given? Okay. Maybe a question from the phone? No other questions? Thank you.
Algo-rollover for .nl
Hello everybody, welcome to the TNS Dev Room, if you just came in. Our next speaker is Stefan, who will be telling us about the NSSEC PSK algorithm roll over for .NL, which normally isn't a very exciting thing, but I trust Stefan, but I've made it a boring situation. That's still fun to talk about. Yeah, thank you Peter. Welcome, my name is Stefan Udink, I work for SIDN, the .NL registry, and I'm talking about the KSK algorithm roll over we did in July last year for .NL. So why did we do this? What did we do for preparations for this change? What was the planning like? How did we execute it? And what did we measure on the internet on our change? So why would we want to change the algorithm? Yeah, the algorithm we used before was algorithm eight, and that's an RSA algorithm, and we wanted to use a SAVER algorithm to keep up with the new standards, because since June 2019, the new recommendation from the RFCs is to use an ECDSA algorithm for the NSSEC situations, and there's currently enough support in Resolvers to do this. As you can see in the drafts, both RSA and ECDSA are supported equally for most Resolvers. And a plus side is also that the NSSEC answers we are giving are smaller than the RSA answers, which gives us less impact when we are hit by reflections attacks. So it's better for the internet. So in way two, the algorithm roll over, we already replaced our HSMs. We used for the signing of the zone with new HSMs from Talis, which could do 20,000 signatures per second, which was a big increase from our previous HSMs. And we started with doing a test run on our test environment without any changes to see how does this work, how much time does it take, because there are a lot of things you can change which would change the time used for some steps in the roll over process. And a normal run took about three weeks without any changes. To be able to do it efficiently, we also made a test lab policy for OpenDNSSEC, which rolled very fast to be able to see what changes were done and to be able to create some scripts to follow everything that is done in the environment. And we also used the local DNS viz installation to see if a Resolver for our setup, it was inbound, could indeed resolve the new situation. And for that, we also created a fake root. So we could play root operator to change everything and we could validate everything to see if everything worked without any issue. That went quite well. And then we went to our acceptance environment in which we used a daily copy of the public NL zone and that has 6.4 million records or at least two main names in it, much more records. And then we had a memory issue. We used our old 128 gigabytes of memory, no swap usage, but still the system holded on something. And after we added swap to the system, it ran again, so it continued. It was not broken. Everything went where it left off. It was strange, but it helped us. So we could prevent this issue in production. Another thing was that normally we generate a full zone every half an hour and in a normal run, it took about 24 minutes to generate the zone, sign it and publish it, including validation. After adding the ECDSA keys, we did a run and then it took 45 minutes. That's not what we wanted because if you want to publish every half an hour, you cannot take 45 minutes to publish something. So we had to find a way to make it less than 30 minutes to do both RSA and ECDSA. And we saw that mainly the validation part costed a lot more time because ECDSA is harder on the validation part than on the generation part. And we made some things in parallel. So we compiled the zone with bind to raw format. We validated with valid NS and all those things we did in parallel and we added parallelization to valid NS. At least it was already available, but we used the switch to do that as well. So we are using all cores on our systems to do the validation and then we got to about 27 minutes of generation. So that's under 30 minutes. Very good job for us. And so we were able to continue with the new zone generation. So how do we plan this thing? We were in June and we know that it took some time. We saw that we might have a ZSCA roll over. We didn't want to do a June ZSCA roll over because Dan Dezon would increase even more because of extra signatures. So we had to plan it and we had also some data that we could use to do the validation. So we had to do a lot of things. And people in the organization that had to approve at IANA that we are going to change the yes in the route. So we expected that the IANA change would take three days and so we came up with this plan. And with all holidays for people, et cetera, we were able to do this plan. And as you can see, we have some asterisks next to some dates. And that's because these are dependent on the IANA change. And if the IANA would take more time than we expected, then those dates would change. And this is something we couldn't predict. But even though we thought three days should be normal and should be okay. And luckily for us, we did a blog post about this change and we were telling the people we are going to do this change. So if something breaks, you know we are doing this and you have these dates to see if everything is going according to plan and we will update this if there's some issue or the dates will change. But it was all good and we planned it very good because all those dates mentioned here were the dates that were used. So we did it according to plan. So on executing a plan, it's good to have written commands just to copy and paste them when you need them. You only have to check, yes, I'm doing this on the correct system. Yes, it's all written correctly, but you don't have to think about it anymore. So during the execution, we did continuous checking with the script we wrote and we did some DNS viz runs on the public DNS viz site to show the people that we are changing and we have some records that we can show. I will show the DNS viz pictures lighter. As I mentioned before, there would be an increase in file size for the zone and before it was 4.5 gig in size during 4.6 and afterwards with smaller signatures, we only had 2.3.7 gigs and that's very nice. Of course, we have a go no go moment because if we would have double signatures, we can still go back without any disruptions, et cetera, and if we would go forward, then we wouldn't be able to go back as easily. So we have to do a bit of a check and that went well. So some pictures of the algorithm 8 situation. During the policy change where you see an addition of the EC, we added the algorithm 13 to the root and removed the algorithm 8 from the root and afterwards we stopped using algorithm 8 and then this is a new situation with only ECDSA. During all this time, we also did some measurements and a colleague of mine, Moritz Müller, did most of the measurements. He wrote a Rolo from Moon Mon quite a few years back and did it again. We messaged some items mentioned on the slide. I want to mention that we only messaged two root servers because all root servers should say the same answers but we didn't want to measure it all 13. What might be interesting is that you see a lot of numbers in this graph and that was a bug in the Rolo from Moon software. You also might notice that there are multiple lines at the top and at the bottom and that's something like this. It's a measurement issue that was caused by using of a small buffer size and still trying to get key IDs from the answer. That's why you see a lot of changes and we saw what we were seeing. This is not correct. What's happening? Because if I do manual checking, everything is fine. What's happening here? Finally, we were able to find the issue and we were able to fix it. Another interesting thing is that during the change, we looked at the response sizes we sent and in this table it's only ns1.dns.nl and other systems have similar, not the same answers because the sizes might differ based on the implementation of the nameserver that's used because of the protocol compacting. Another interesting thing here is that the nx domain and dns keys are increasing during the rover and the ns set is not increasing. It's less and that's because the R6 and the set are in the additional section and during the rollover the section gets increased a lot more so only the R set for ns1.dns.nl is in the answer and the R6 but not for all the nameservers that are in our zone. If we look at traffic, normally we have about one percent of TCP traffic and during the rollover we have about five percent TCP traffic and after the rollover it's back to normal again. Here you see a graph with a logarithmic y-axis and you see that TCP is increasing a lot. It's about eight times more TCP traffic and especially there is no different state in the internet. It might change and so here you see that the KSK is back in the logistic direction and it's not going to change. So we have a level again. So in global we have no measured impact at all as far as we know. I don't know of any trust issues people had or something and you can see on the left picture that the adding of the ECDSA key and afterwards the removal of the RSA key and the right picture is the trust chain that is constant for the resolvers and that's my talk already. Are there any questions? I've got two questions. The first one on slide 17. You mentioned that during the rollover the NS size becomes smaller. Yes. I will ask you a question, complete question. So you said the NS set is getting smaller, yes and the question is? The question is there is no difference between the two. So I think that's a good question. I think that's a good question. I think that's a good question. I think that's a good question. I think that's a good one. So you said the NS set is getting smaller, yes and the question is? The question is there is an RFC out there that says glue is mandatory. If the size of the response is getting smaller because you're not including glue you have to set the TC there. Did you measure for that? So I'll repeat the question. There is an RFC that glue is not made, that blue is mandatory and did we measure some things about this? What I know about this situation is we looked at the dns-vis information that we got and for the measurement for ns1.dns.nl the glue is available but only for that name. So not the glue for the other nameset records. And I don't know if we looked at if the TC flag has been set and if there has been acted on. So and the second question? Yeah, my second question is I noticed you switched to talus. With regards to support from the talus company, did you test that if you got proper support, if you needed it? I will repeat the question. He said we switched to talus and did we test the support we had with talus before doing this transition? No, we did not test the support beforehand and technically we did not switch to talus as in we used to have lunas which was taken over by talus. So we continued with the same lunas, hsm products as we before. We had contact with talus before we switched the hsm's but we did not before the rollover try again to contact them to see how support would handle questions from us. Which might be a very good idea as well. Thank you for that. I'm asking for a friend. Yes. Relate question to that. The tank tank we had in the beginning, did you have any rollback plans in case something went bad? The question is do we have any rollback plans? I mentioned we had a go no go to see if everything is okay, we go forward. If everything is starting to fall we go backwards. After going forward we had some thoughts about how to continue but that might have impact. So the decision what to do when, depended on the situation at that moment. And we didn't write out everything, every possible scenario because that would be too much especially based on our testing and our acceptance and test environment that we had confidence that it would all go correctly. And we would have to look at the situation at the moment to see what the next step would be if something went wrong. Does that answer your question? Yes. If you had choice to redo your procedure, do you think it's worth it to have met HSM at all regarding the other complexity and risk of losing your key in case of backups are not here? Father Van Aving, a hidden signer that is aggabbed by your words for example. If I understand your question correctly is about, did you have anything about backups or? Are you happy with having an HSM reserve having an aggabbed linux that have the casket on the disk and do the signer and just the DNS update that's going on in the world? I don't know. I hope I understand the question, but if I want to answer your question, I'm thinking about we do not have an aggabbed system. We do have regularly backups of all the HSM keys. So in that way we do have an HSM that is aggabbed because the backup unit is an HSM and that we can use to restore keys if necessary. Do you think it's worth it? Worth it. I think it's worth it to have an aggabbed HSM to, it depends on your risk assessment if you want to have an aggabbed system and if you are going to do this thing for instance in public cloud, you might want to have a situation where you have an offline KSK for instance. So that might be in a setup. Did you conduct a penetration test on the HSM beforehand and what are your operations in case the security issue gets known in these HSMs? Did we do a PAN test on the HSMs and the next question was? What would you do if a vulnerability gets known? No, we did not do a PAN test and what we would do if a known vulnerability would be known would require our systems to investigate what happens and how can we react on that? Which information has leaked and how can we recover from that? Those are not known scenarios at least to me at the moment. Why not PAN test? Why not PAN test? I have no idea. Yes? I noticed that an extermination goes up to 14 or two bites. Did you, I'm curious what was your transportation settings for maximum? So your question is what is our transportation settings for the UDP size? We have 1232 as the size of the UDP packets as recommended by an RSE. Other parties that also provide anycast for .nl have slightly different settings for that. So that's why we focused here on NS1.DNS.nl because that's something we operate ourselves. The second question. The second question was you added the algorithm 13DS keys to the rule zone. So you've run a dual DS. What's that to allow removing the algorithm 13DS keys if you had to in a hurry or just as an additional acceptance step before you remove the algorithm 8DS records? Because during the fairly recent transition of carbon net and the EU cells, they basically just didn't swap. The question was why did you not remove the algorithm 8DS when you were adding the algorithm 13DS? Correct? Yes. We did that because we wanted to have a solid path and have a possibility to go back without any issue. So rather than take one big step in that regard, we took two small steps to ensure more stability at least from our point of view and good night's rest for us. Any other questions? Maybe not so much a question but a statement if that's allowed. Yeah. One. I think it's incredibly brave for a national top level domain to take a risk, right? And I think it's very good as a compliment. But because changing an algorithm is different than changing a key, changing an algorithm is fundamentally hard. And for SIDN to do this as one of the early adopters, not the first one but one of the early adopters to do this, I think it's very commendable. And I think you set an example for the rest of the industry for all the other top level domains including ICANN and we're looking at you. The same I would like to. And we're looking at you to see what you're doing well and of course we hope nothing goes wrong but we also need to have that information. And one of your colleagues is working with ICANN to make sure that if we ever do something in the room, that that goes well as well. So he's part of that group as well. So yeah, we're looking at this. We're hoping all the top level domains follow the same example. And yeah, all my credit goes to you guys. Welcome. Thank you. I want to repeat for the online audience. Because if you get a compliment like that, the person, Roy Arons from ICANN said that he said, very brave for our .NL or Azure at the end to do this algorithm change in the forefront of the people who are doing the change and are should be followed by registries to do this change as well. And we have shown that it's possible and without any incident. So any other till these please follow us. Good summary. Thank you. Thank you.
Bootstrapping time on OpenBSD
Welcome to the DNS Dev Room. Our next speaker is Otal, who is an OpenBSD developer. We're going to call him a faithful intersection, DNS second, NTP, and maybe two other terrible things. Yeah. Okay. So I'm going to talk about bootstrapping time specifically on how we implemented that on OpenBSD, but I think the approach could be used in other systems as well. So, small introduction, OpenBSD, a BSD derivative. We focus on security. We do that in several ways. For example, privilege separated demons, which in which we separate the various tasks a demon has to do into separate processes. Each of those processes have minimal capabilities, and they communicate with each other through pipes and exchanging of messages. There's also a lot of other techniques from memory management, which I'm also pretty involved in, and new APIs for that are, let's say, less easy to misuse, things like that. Apart from that, we also try to make a useful system, and so we like to have St. defaults focus on a system that is out of the box, a nice system to work with. By default, we do not have a lot of services active, but if we consider a certain functionality to be included in a default configuration, the configuration you get when you install the system, we are quite strict in that, in the sense that it has to be functionality, which is useful for a very, very large fraction of our users. But also, the actual implementation is maybe even considered more volatile, it's a higher risk, so we focus on extra on the security aspects of that, including the architecture of the software itself and the specific implementation. I'm now going to talk about time, and we'll see a bit later how that also involves DNS, but when, originally, when OpenBusy starts, it gets the time from a battery backed real-time clock, if your hardware has that, because not all the hardware has it, and even if you have hardware that has it, it's not always functioning properly. If you think of it all the hardware, then the case is, my CMOS battery ran out, is pretty well known, and most of the battery backed real-time clocks then give some default value way back in the past. But the booting system tries to read clock, if it's available, if that fails or there's no clock, the time is set based on a time step that is stored in the root file system, which says, well, this was the last time the file system was modified, and basically, if you unknown mount a root file system, which happens on an ordinary reboot, then, or shut down, then that time step gets set as well. So you have, let's say, if you reboot the machine, you probably have a time step, which is a little bit in the past, but, well, reasonably okay. It's a bit behind, probably, especially if you shut down your machine, go on vacation, and you don't have a real-time clock, because then you come back from vacation, and your clock is two weeks behind or so. So that's the problem. We have an NTPD implementation, which I'm going to talk about a bit more in a second, but that originally that implementation did not bump the clock. It would only gradually increase or slow down the clock to adjust it to make sure that the time is corresponding to the NTP-derived time. You could enable that, but it was not a default, because we said, well, we are not going to make a default, because we don't really have enough confidence that it will do the right thing. Why not? Because NTP in itself is not a secure protocol. That's one issue. And also, so we would like to have more than one source of time, not only NTP, even if you talk to multiple pairs. We would like to have an independent way of validating, or that time we see. So we formulated some goals in the beginning a few years back, and we like to say, well, we like to be pretty sure that if you boot up an OpenBISD system that you have the proper time, if you have network connectivity. So that's a nice goal, but we made things a bit harder for ourselves by stating, well, we do not fully trust NTP replies. Like I said, by default NTP is an insecure protocol, and also the design of the protocol is in a way a bit. You can compare it a bit to the original DNS implementations. Security was not a big thing in that time. So we'll talk about it a bit more later. But the goal is still to get the correct time on boot with high level of trust, not necessarily a very high level of trust in the sense that you have a cryptographic proof of that. That's maybe a goal for the coming years or so, but at least we have a high level of trust. Well, if there's no battery backed up clock available or it is not functioning properly, we still like to end up with the proper time. Like example, I gave this cheap boards with Raspberry Pi, for example, or other boards do not have a battery backed clock at all by default. And you can also have cases where very expensive servers forget about time when you switch them off. So the setting is if we can solve the problems in this quite difficult situation where we have lack of hardware support and things like that, and of course the more easy ones where you do have a proper RTC clock or you do have other facilities, then it comes easier. So if we say, yeah, okay, we need to be able to do DNS to resolve NTP peers, it might be that the resolver we are using is DNSSEC enabled. If that resolver is running on a other system, it's quite easy, probably that other system already has the proper time, but if we are running our own system on the same system and we do not have proper time, then DNSSEC is going to complicate matters. So we do want to consider that, at least, what we should do in that case. So a bit of words about the NTP protocol. It's pretty old. Let's say the same era as DNS protocol. There are some design similarities between them. For example, in DNS, a request and an answer basically has exactly the same format. NTP is the same. There's also the focus on UDP, of course, and also the case that the request you sent out and reply that's coming back. In reply that's coming back, a lot of information, maybe even all information that you sent out is also coming back. So you, as a client, you have a reasonable, easy task. You only have to consider the answer because the answer contains all the information you sent out earlier. So you only have to consider what's in the reply packet, do some processing, and you can continue. But of course that is, comes with that you have to trust that reply packet even more than you maybe would want to. Later, there were additions to the NTP protocol. Shared keys were introduced. So if you had a pair, an NTP pair, which you had some form of relationship and you would change some key, you share a key with that other party, then you could secure the NTP packet. So you had more confidence or pretty good confidence that you are receiving replies from a trusted source. Later on there was even more extensions where you say, oh, you invent NTS, which is a network time security, and that includes a key establishment protocol, which is pretty complex. And so far we did not like to implement that yet, but it might come at some point in time because of course that will give you some more cryptographically. And there's a process handling constraints, and constraints is a thing which I will talk about with later. So we have to do not have in our implementation any cryptographic proofs of any validity of the data, but we have a basic spoof protection. In the NTP protocol there's a field which is called transmit time, and according to the protocol the server which answers the question has to just echo that field. And if you, that's 64 bits, so we could, the server is not looking at that field for any other reason than just to echo it. So if we fill in a random, let's say cookie there, we can at least in some way make sure that an attacker which is spoofing us, trying to spoof an attacker which is not able to read the outgoing packets at least, can we protect against that. Of course that comes with storing some state in the client because you have to remember which cookie you sent out, but the protocol without any changes allows for that. When you are actually computing the time, and there's an algorithm in NTP protocol which allows you to, let's say, filter out the round trip times and things like that and get a good idea of the service time, you have to use the original sent out time and of course not the random thing you filled in. So the trust issue is in the NTP, original NTP protocol is a pretty complex statistical analysis of all the replies you have seen from different pairs. We do a bit more simple approach, we send out to several pairs queries, we collect results, we filter out things we consider bad and things which are bad as unreliable servers, servers that do not reply, servers that reply with a bad cookie and we select a median time, median time. And we use constraints which is a completely different source of time information by doing HTTPS requests to certain servers and the nice thing about an HTTP request is that the reply header also contains a time stamp. That is a rough time stamp, one second granularity, so low resolution, but we also do that to filter out bad NTP replies. So we know this NTP reply is outside the rough, our rough low resolution constraints, so we say skip that. There is a small complication there because the certificate check we need to, without any idea of the real time, has to use a time stamp, say is this a certificate which is valid now. But the question, if you do not know what now is, so what we do is we use the time stamp itself and say well it is at least consistent with what they're saying. So the HTTPS request is valid. On the time that server is telling us it is. And we'll come back to that later. Okay, but this is also a DNS dependency and that is because we want to be able to select NTP peers based on name. Of course we have things like pull.entp.org which are very dynamic, change all the time. Also location based, so depending on your query particulars you get a different answer. And you want to have the NSSEC validation. Now the NSSEC signatures contain a validity period with the same problem as with certificates. So we have here the hardest case. If we run the NSSEC enabled validating resolver on the same host as we are trying to boot, we have a bootstrap issue. Luckily there's a way around that. And that is to check disabled flag in the DNS request header. You can say to a DNS resolver I want to resolve this address. But do not do any DNSSEC validation. So that's easy at least from the protocol point of view. You can just set that flag and have at least some idea of that DNS resolving. But in the current API or in the API at that time which also is from the 80s or 90s there was no way to enable that. So now we come to another point it is because OpenBSD is a complete system. We built the C library, we built the APIs, we built applications and the demons that go come for it. We could just add that API and then assume in our application that that API is available. So this is a part of resolve.h, the source code. We introduced a new flag, save and use CD. And that enables us to use the APIs, the DNS resolution APIs which also use a bit of an Eucaly system which also stems from the 80s. That is a global variable or struct called underscore res which allows you to tweak the way DNS requests are done in a libc. These days that will be designed completely different because you would probably have some local object which you pass each time to that code to or have some context or something like that. But this is from the old days where a global variable or global struct would contain the flags to be used. So what we do is if we know that the time is not synced yet, we retry without with the CD bit and the resolution fails. We retry with the CD bit set and hope that it will get better. That way we have an answer. Of course it's not DNS validated. So we are closer maybe but still not there. So what is now the revamped mechanism is we get the time from RTC. That fails with time for the root VST and plasma, completely exactly the same. So the kernel is doing exactly the same as it did before. When open the entity starts, it will get constraints. So that's a new thing. It will try to get a rough idea of the time. And it will also send out entity requests based on DNS requests it has done. And those NTP replies will be validated using the constraints derived from the HTTP requests. And we will move, bump the time if it's going forward and otherwise do a gradual increase, a gradual adjust. We will bump only forward because we do not like to have logs with time going back. So monotony increasing time is pretty important. If we see, and that is probably an indication that something is really wrong, if we have to set the time backwards, we don't do that and scream in the logs and things like that. After that, the regular NTP things just happen, gradual adjustment of using several pairs, etc. So then we have some idea of time and we'll do it one more time. So in the sense that once we are synced, the NTP time and the system time agree, which can take several minutes of course because you have to slow adjust in many cases. We'll do it again. But then we say, well, we know we are synced. So we do have real DNS check validation. We do not have to do fall back to no see. And we use the constraints check the actual time. If at that point things are not okay, then we will of course scream in the logs that, well, we cannot your NTP, your NTP pairs, but then it's a system operator decision to do that. In a local LAN, of course, that might be a very suitable case. And the default config uses several NTP sources like NTP.pull.org, but also time flare offers a NTP server on all their pops. So you get a local, with all the same IP, you get a local time source or local close by at least is the idea. And also a sorted set of constraints of, let's say, well-known HTTPS servers like from Google. And we also use Fortnites servers for that. But they're, let's say, stamp of approval. So this is the default configuration. We also mix the quad 8. We are not using quad 8 for that because we like to have, they are, there's some tie between NTP and not. So, of Google, so we will say one is a completely different system set of systems from Google. So that is why we say, well, if we're using DNS, that will, let's say, diversify the different sources we're getting time from. And a little detail, surface means if the DNS request produces multiple IP addresses, we all query in all of them. And the server is a single source. And sensor is for, if we run on a system which has hardware clocks, for example, GPS based or you have the Meinberg, some set of hardware which, a PCI card you can insert in your system which gets the time from the DCF clock in Germany or other sources. We also use those, of course, as trusted sources. So that's the thing we call time sensors. So that is my talk. I'd like to thank some other OpenBSD developer who cooperated with me on this. And reachable and master done, but also OpenBSD, BUD.org. And I'd like to ask if there are any questions. Yeah. So you mentioned that NTP never sets the time back. But what happens, for example, if you have a hardware RTC clock that's misconfigured, like, for example, set one year in the future for some bizarre reason, and then you're a bit back. Yeah. So the question is, we know our NTP implementation never bumps, really hard set of the clock backwards. If that happens, and if you, but for some reason your RTC clock is misconfigured or set to the wrong time and you, then we require operator intervention. Then it's a human decision to do that. Of course, you can do that with the date command still or with our date where you say, well, get some time from a different system. But that is not a thing which happens automatically. We scream and we say, well, this is not right. But we require operator intervention for that case. I know the question. How much of this is tolerant if you don't have network during boot because it's a laptop that might be going to do a wireless network and that takes 10 seconds? Yeah. NTP deepen. We have a, if we do not have a working network configuration on when the NTP start, it takes about 10 seconds. And if then no actual traffic was seen by the, it says, well, sorry, cannot do it. I'm just going to continue booting because at that point in time, the boot script stops because we'd like to have as many demons starting with the correct time already set. So that's very early in the boot process. Of course, you have, and you have complex configuration with, with freelance and whatever, or then that's not going to work. But the NTP tries its best and then said, well, sorry, I cannot do it. I'm going to do my background tasks like I do always, but I'm not setting or bumping the time. So there. Sorry, you're out of time. Oh. Okay, thank you.
Let's make people love domain names again
Our next speakers whose names I forgot are going to tell us about new developments in domain management. So we were supposed to be only 10 persons, so I'm amazed that the room is full. So Pierre Olivier, are we ready? Let me check. No, no. The server is not responding. Okay, do we see cute kitten for instance? One minute. Oh, yes, our internet is not broken, so I don't know what is happening. So does it respond to ping? Let me check. The audience said, thank you very much. Check the DNS maybe? I don't know how to use an NS lookup. Makes sense. Okay, will I need to show you a happy domain later? Oh, oh, oh, wait, I have a clue. Oh, it works. It's always DNS. Thank you very much. Thanks to the sponsors and first to the volunteers. We are there thanks to them. They're from Warmthanku, Peter. So, I'm going to start with the first question. Warmthanku, Peter. So, look at these DNS issues. You all know them and these are the most known of the big companies. Big noise. Although their team are well skilled and are good professionals. What does it mean that it happens also to small companies all the time to all of you? So, like you, we face the same technical issues. The DNS system is complexity is increasing all the time. Yet, it's invisible, but not for us, but for all the people who type their URL in the navigator. Those and their answer are often badly set up through ignorance or lack of skills. Great. So, there are areas of improvement. Therefore, we build a team of experts to try and improve the situation. Your speaker today, Pierre Olivier Mercier, system engineer and computer professor in a computer science engineering school. And myself, Frédéric Griteur, contributor to Free Software Project Volunteer. Since decade and like you, through my voluntary efforts and regular funding, I'm going to hand read the URL there, 20 for another, etc. There are a lot of good ideas, nonetheless, in the Internet. Good stuff to find. DNS record assistance, for instance. Tool to test your records. Your zone, your delegation, your email parameters. Some services to monitor your domain, for instance, record propagation, or which domain is soon to expire. There are also online interfaces, sometimes with good functionalities. But each provider has its own interface. What does it mean? For us, we are managing several providers. That means learning each interface. And at the end, we end up using the raw mode and making errors. I remember a friend, not me. We missed the final dot at the end of a record. Another example, meaning it happens to us all the time. A few months ago, I wanted to add a CAA record to my domain. CAA record, which is not very well known, yet 10 years old. So if you didn't do it, do it. It's very easy. I found an online form, but it didn't support parameters. So it was useless for me. I was not open source. I couldn't help and change. Happy domain is open source. That means so much time wasted for switching from tools for another looking for solution, deepening in the RFC documentation, which is not so easy to read. Or maybe all of you have read all the RFC. Who has read all the RFC? One, two, three, four. Ah, great. Ah, great, great. I don't. Apologize. So we would like to offer you a sort of magic wand in the form of modern interface. And we named it Happy Domain. It will make settings pleasant, makes the display easy to read, and we would like to centralize all the men and registrar in one place and forgives error as much as possible. So let's Pierre Olivier make a demo. Let's dig inside the software because it's five o'clock and you are supposed to be tired. So tell me, how does it work? And I would like if you could change for me CAA in the domain. We're talking about CAA. Could you do that for me? Okay, okay. So I go to happydomain.org. I log in to my personal account. And here is the domain I managed today. We will make the modification or domain that is not listed here. We can see there are several providers on the left. I can use this provider to filter my domain. For the demo, I will use a local authoritative server, Power DNS that run on my local machine. As it is not already registered in Happy Domain, I click the provider and I select the domain I want to manage. So today this is happydomain.test. Before I forget, I will assign it a group. This is useful for example if we have several clients or environment. So here it is on the happydomain group. And now this is the abstract view of the zone. Instead of displaying directly the records, we group them in a kind of services. For example, we have the origin with the required record such as the SOA and the NS record. And as you can see, there is no technical subdomain here. For example, the DKIM record is not in the list of subdomains here. It belongs to the email service. If we look at it, you can see of course the MX record, but also the SPF1, the DKIM, the DMARC, etc. And the corresponding records that can be displayed here, we see that my domain key record is listed here. This is not my goal. Frédéric asked me to change our registered certificate authority. Yes, I would like to use the bypass. So the certificate authority is in the service certificate authority authorization. We can see there is a simple form that assists the user with several choices. Currently, we use let's encrypt, and Frédéric told me to change it to bypass. So great, it is in the list. If it wasn't, I can select other and write the domain names corresponding to the certificate authority. And we can also provide parameters. For example, some certificate authority can restrict which user can generate a certificate for a certain domain name. So here we can provide a client ID, which is API domain. And only the API domain account will be able to issue a certificate in the bypass certificate authority. We can see there are a lot of other settings, for example to restrict wildcard certificate issuance or the SMIME certificate issuance. And at the end of the form, we have the incident response. We will see a record, provide a way to contact the owner of the domain by mail or via a hook. Here, if someone in another certificate authority tried to issue a certificate, we will be alert for a violation of our security policy. All of that is a summary of the RFCs on the CAA records, and it's pretty easy to fill for a system administrator. So the modification here is made, but only for now on API domain. We can make several other modifications to make a batch and ensure the coherence of the zone we published. So here I just have one modification to make, so I can directly publish my changes. The interface asks us to review the change, so this is a modification for the API domain.test domain for the CAA record, and it changes from letsancript.org to bypass.com with some parameters. This is exactly what we want to do. And like in Git, you can add a log to retrieve it easily later. So I applied the change, and that's it. Frédéric, are you happy with that? Great, but sorry, it's in one month. I'm really sorry, it's a mistake for me. Could you roll back this stuff? With pleasure. So of course I can do the same modification as there was only one modification. This is pretty easy, but in case of many more modifications. As of today, we support 40 providers. We rely on the DNS control project, which is led by Stack Overflow, and several providers are added each month to that project. And also we support a classic authoritative server like Bind, Power DNS, and Not. We do our best for a facilitated readability, even river zones are supported. First, you can review your change at a glance before publishing them. We archive your modifications, therefore users can try and roll back. If required, you can easily export the record, and also you can import in a standard raw file. I think on the cake, the interface can be controlled. For instance, you may need to create a dedicated suit domain for testing, or before putting a project into production. Or you can have a local environment for your development, and then a prep production on a domain hosted by, for instance, Gandhi, and the production hosted by OVH. No problem, happy domain talk to every one of them. There is no need to learn the intricate of everyone's API. You can just call the same script for all your environments. Use a pedoman as you wish, first online, or you can easily install it on your server. We have binaries and docker image for you, so you can use it. Here is what we have today. You can use it right now, and please welcome. We're convinced that we can save time and bring a piece of main for teams that manage domain name. Think 15 seconds right now. This is your time to work. Think about what are your main issues, what are your main tasks, and your greatest interest of improvement. And then we will do our best according to what you say, and your results. We don't promise to build it, of course, but we'll do our best. Some ideas, for instance, now when you use a domain name provider, you are the only one to have access to the setup. We could use it with different levels for bringing several people all together and keep track of their modification. It could also be used to delegate to part of the system domain, for instance for the marketing department or whatever. And testing is really important. So Apidomen could perform some tests on every record, not only the delegation with the SOA record. Of course, we could use and display directly in Apidomen the result of the master, but for email, we can also perform the count of resolution for the SPF record. All these tests take time, and Apidomen could save operational teams a lot of time by aggregating all these tests in the interface. Apidomen could also constantly monitor your domain names and notify you when an issue occurs. And the propagation time is certainly the least understood part of the DNS, so why not display the theoretical propagation time directly and clearly in the interface? Yes, please. We could have also some dashboards. This is an example for DMARC. That means at a glance you can see your features and how they are going. And last but not least, with the effort we put to build the API for Apidomen and the form we build, we can also imagine using artificial intelligence to create a chat and interact directly with the OER zone. This is made possible thanks to function calling in recent models and API. So please use the Element Network of the first-dem or login our system. Use your fingers, rate your dreams and priorities and help us to help you. So thank you very much. Now the questions. Thank you. We have time for one question. Okay, one question. What time is it? No, that's another question. Please. So my company will use DNS control to have DNS managed by Git. Could Apidomains be integrated in that so that both work at the same time, so that somebody could edit the DNS using the Apidomains and see that they're committed to Git where someone else follows from it? I don't know. So the question is, can Apidomen be linked with DNS control? Currently not. We does not use JavaScript language used by DNS control, but perhaps that can be a good idea to go that. And you're welcome. Thank you. Thank you very much.
dnsconfd: system integrated DNS cache
My next speakers are Thomas and Peter who will tell us about DNS Confit, which is new to me from quite curious. So, hi everyone, my name is Tomas Korbars, this is my colleague Petro Manchik, we work at Red Hat and today we've come to talk to you about our new project that is called DNS Confit. So let's start a bit with a motivation behind this project. Last year we received a request from a user that required us to make possible for Unbound to be used as a local DNS cache and to be able to consume configuration from the network manager. In the past we had DNSsectorrigger package for this, but we dropped that in rail 9. So should we reintroduce it? We thought about implementing a debuts API into Unbound just as DNSMiles has and then implementing a network manager plugin just as DNSMiles has. But then we realized that if some similar request came in the future for different service we'll be doing the same over again. So we thought about creating a new project that would serve as a conduit between network manager and local DNS caching services. This project is the DNS Comp. Our requirements for it would be to be able to easily exchange the DNS cache, underlying cache, and to add more services in the future without too much work. We need to be able to support split DNS configuration. We need to be able to support split DNS configuration and then we need to be able to auto configure without manual interaction from the user. Also, we would like it to use already present system configuration and defaults and security features that are already built in and we maintain inside of our distribution. The behavior needs to be configurable enough so you can change handling of corner cases and you are not caught of guard by the behavior that you would not expect. Okay. Let's get a bit back in the past and tell something about why Fedora 33 introduced DNS cache and what it brings to us was a possibility of multiple simultaneous VPN connection at the same time. And that's great. It also made possible to configure global servers but reach some names which are accessible only on local network, which is nice for DNS over TLS but that was not enabled yet and still isn't. And it brought us excellent configuration presentation by Resolve's CDL command compared to what we had before. That was clearly better. And it also introduced well-documented bus interface for both configuration changes, for configuration displaying and also name resolution. They have nice article but that's not our job here. So what do we mean by 3DNS for here? When you connect to VPN without some smart solution like this, you send all name queries just a single VPN and use only your primary connectivity to deliver traffic to VPN server and that consumes everything you use. At that time, you cannot use any other connection interfaces you have on your laptop or mobile phone or something else because you use just one DNS or set the DNS that VPN knows. With split DNS, you can send different name queries to different set of servers provided by different networks. You are connected at the same time and most current devices today are capable of connectioning to different networks at the same time including multiple VPNs. All you need to have is non-coflicting names for them. So for example here, names are different and if some names in those domains provide some useful networks, you can access them at the same time. And we could end it here and thank SystemDGuys if everything worked great but sadly that was not the case entirely. I have listed few issues I think are important and still aren't fixed sufficiently but there were more bugs in the meantime somewhere fixed, some are still not. For example, it prevents any usage of DNS on the host which is where it is enabled by default configuration both Ubuntu and on our Fedora because it just doesn't forward DNS-enabled bit set in queries received. So any library which is quite capable of using DNS-sec cannot use it even if infrastructure, your network provides capability for it. Also, at least for Fedora and Ubuntu desktop I think, you would be quite surprised this top level domains often that does not exist because it sends top names without dot just a local interface over multicast protocol and if it doesn't find something which usually it doesn't, it just returns no that does not exist. So com domain does not exist but github.com domain, surprise it does and even on server edition when I think this is really unwanted. And also strange response is when a response fails because of DNS-sec validation fail, it still might contain a valid answer in the response which is unexpected and no other implementation I know does it this way. So DIC plus short DNS-sec failed org even with DNS-sec enabled in system DreslD gives you very nice address and I've listed just few issue numbers. So lessons we take from this is we want split DNS functionality auto configured and we want possibility to DNS over TLS and also that we want nicer front end than we had but system D people are very good at expertise in system integration and they are quite good engineers and I know it but they lack expertise in DNS protocol area and I am afraid it is visible and at the same time DNS resolvers people are excellent in DNS protocol area but their integration into system is often very limited or at least done and we think only the integration is missing and that is what we are trying to provide. So we want to reuse existing functionality. We want to provide some common interface to set forwarding to different servers so it doesn't change much and we want to provide nicer front end for showing what is configured regardless of use DNS cache in the end. So what we need for split DNS we need some local address which receives queries from applications that usually localhost we need ability to configure different domains to be forwarded to different sent of servers and of course some default for root servers to be forwarded to global default and we also want ability to reconfigure the service without stopping it and flashing entire cache as starting it again and from this is list of servers we have in Fedora and I think all of them are able to provide split DNS functionality most of them are also able to provide DNS over TLS functionality but only DNS mask except there is of D have some D bus capability and that is quite limited and DNS mask has own issues. So our approach is use what already exists provide just front end and components coordination do not reinvent the wheel. We do not want to handle DNS queries ourselves in our service we want proper services to do it and we just provide configuration for them and I have already shown almost every open source resolver has that ability and because we are not handling queries we just want to try single thread application and we written just our prototype in Python to verify this would work. What we also want is to reconfigure ETC resolve confile only when we verify basics that service is running and restore it when our service is stopped. I really hate what a result is when you uninstall it you have to fix it by hand. And we want to have stand alone demon because not everything is primary configuration we think should be done in network manager so there is some unified way to configure it whether it is used system be resolved D or our demon it should not change it should be just implementation detail. And we think the common part is the biggest one and just very small cash specific module is required to implement different caches what we plan to support is what we have in the RL that is primary unbound and also bind and DNS mask. And we want to provide basic compatibility for services using system D or the API directly because something already uses that but we do not want to implement every aspect of what they already implemented because we do not think that is necessary. So right how does the flow of configuration look right now network manager receives its list of DNS servers from either DHCP or the connection profile and then it pushes the configuration through the bus API into DNS confi D. DNS confi D then translates this configuration into some internal representation that we think is general enough for most underlying DNS caches and then we use specified module to transform this into the specific configuration that is used by the specific underlying service. For example for unbound it is a list of four borders. How does the system integration look like now. DNS confi D uses already existing unbound service that we ship and support so it respects its defaults security features and configuration that we ship. We inherit the system D result D debuts API so we work as an in place replacement as of now. You use the default system configuration that is provided and then we watch the underlying changes of the DNS cache so you are not caught off guard by the sudden inability to resolve the domain names. Here's the life cycle of our program that I talked about. DNS confi D itself is implemented as a system service so you can inspect it as you would inspect normal system service and it is started either on boot when it is enabled or it is started when configuration is pushed through because it is implemented by the bus and system D triggers us upon the configuration. After we start we start the underlying DNS cache. We look whether it is ready or not because there is some polling right now needed and we wait for the configuration that is provided by network manager. After that we watch for status changes and perform actions as are needed. Here are some memorable issues that we've encountered. The first one is a war for resolve confile because network manager finds out that a system D result D is running or not by checking existence of some symbolic links in system and we cannot own them because they are owned by the system D result D package and if they are not present on the system then network manager tried always to override our modifications of the resolve conf. We got a run by that by implementing a command that pushes lines into the configuration of network manager and we stop it from touching the resolve conf. We argued about whether it is better to execute the underlying service as a sub process or a system service because sub process approach provides easier way to monitor whether it is running or not but then I was persuaded by Peter that the system service is better because we use things that we already have in place. There is the issue whether unbound is truly up or not because the start job was finished but the command channel was not open yet so we faced some instability during testing but we got around that by pulling a few times and we need to update only zones that were updated in configuration so we hold current state that is set into unbound and we update only zones that are required and we thought that implementing this in D bus would be easier than it proved it really was. We've created a way, we are using of testing this we are using TMT test management tool with containers that allow us to simulate some network behavior in a way that verifies the actions of DNS conf. If you'll ever want to contribute set of these tests will verify that you won't change behavior that is already in place or you will be able just to show us where we are wrong and you want us to change the thing. Okay so what is working already? I admit we wanted much more to present here but it proved not so simple so split DNS configuration as you from network manager already works ETC Resolve Conf is changed just when our demon is running and is restored when it is stopped. Unbound support is the only one we have at this moment and implementation uses only D bus interfaces of system D Resolve D and at this moment also only its D bus name so it can be running just DNS Conf or Resolve D but not both and we reused network manager system D Resolve DNS plugin for now because it pushes configuration over D bus but in the future we want to get rid of it and make our own or use more parameters from just IP address and that is what we would like to use unlike the opportunistic way which system D Resolve D used because these RFCs were not defined at that time and we think this is correct way and support multiple cache is running at the same time is not necessary usually but it would be very helpful for some kinds of testing. We would like to have ability to forward over DNS over HTTPS but there is problem not any DNS cache we have in RL supports that and in further there are only few similar with DNS over Quick and auto configuration of DNS sec would be nice we would like to have some successor and better implementation what was once attempted with DNS sec trigger but maybe better accept it and maybe if its time sometime in the future rewrite into Rust and reduce memory required memory for our interfaces that would be all for us so if there are questions please now is the time and if we can't answer them please use these mails or file issue on the project Definitely stick around to the next speaker we will talk about the Rust domain craze and thanks for the call Questions, stay phone Would be helpful for inbound to have a D bus connection where it says when it's ready No I don't think it needs to be D bus connection I think we need to correct LIP system D notify event which it kinds of supports but I think last time we try to enable in federal it start crashing so it's not built in but some kind of support is there we just need support to inbound to tell us I'm the service I think I'm ready and there's system D API for that we need to use that whenever possible it doesn't have to be D bus Visek? If you only want to communicate over DNS local servers you need to crash I understand that you want to drop the MSS Resolve bridge So how do you want to overcome this? This is part of the question Second part of the comment is that we talk about D bus but D bus is something everyone can relate Actually now it's a series of D bus in parallel which means that we can have a resolution since any book before the D bus server is up Which is why it's always useful so we had a plan to add the private interface The second question was do we plan to add running interface? No I don't think we want that First question was get other info API How can you send the additional information for example about multiple interfaces over DNS projects? How can I send from which interface comes the query or how to request query just for selected interfaces over DNS? We don't want to because in what cases this is needed I think network manager needs that just to verify the connection works I think we might have different service which just will query us please Tell me address resolved on this interface and we will send the query just to correct addresses Because we know which addresses are used for that interface but that would be not served by the local cache Because that is not yet configured for that Can I make more sense to take this separately? To do this after session? Because it seems quite specific Yes, yes it might Any other questions? No thank you again
Domain: A modular Rust DNS toolkit
Our next speaker is Martin, who would be telling us about the domain crate, I think it's called. He's been building for Rust and all the cool tools you will be building with it. Probably, yeah. Hello. Back in 2015, so in the times before, two things happened. One was that Rust released version 1.0 and became stable. And the other was that I started working at a Nellitlabs. You probably know Rust as the thing that everyone wants to write everything in, and you might know Nellitlabs from things like Unbound, NSD, OpenDNSHack, and some more things. So I thought I want to teach myself some Rust, and I need to teach myself some DNS, so why not combine these things? Now, what also happened at the same time Benjamin Fry was working on trust DNS, which is now called Hickory DNS, and I figured, well, let's not do the same thing. That would be silly. So I came up with a different idea, which was instead of building a giant application, or a set of applications, build lots of building blocks that people can then use to build their own specialized DNS applications. Because what happens is we have a lot of good sort of stuff for generic use cases, resolvers, primary servers, that sort of stuff. But if you want to build your own specialized DNS application, that's actually surprisingly hard. I find that out the hard way. For a project that we were doing, I needed a very specific primary server that was having a like a Rust interface, and then instead of people actually telling me which specific resource records to publish, they just wanted me to tell, I want this thing to happen. And then I built this with a Python at the time with Flask and LDNS Python Reps, which are surprisingly weird. And I would rather have had something that I could just build this all in Rust. So I think this is really a good idea. And so I started working on it, learned a bunch of Rust, learned a bunch of DNS. And where are we now with it? This is the starting page for the documentation for the last released version, 093. It's also apparently quite a lot happened before at 09 already. So there's a bunch of things as you can see here. Base is also the handling of DNS data. So there's types in there for all the things you could think of. The main names, resource records, record data, all the R types and all these various types. And also messages, complete messages so you can build your own messages and whatnot. There's R data, which is a massively incomplete set of record data types. We only did the ones that we need because this is very early still. And whenever you change something and you have like what is it, 6,000 record types, you have to change all of that, which would be annoying. So we limited ourselves to the ones that we need. But there's a lot of stuff in there already. There's all the basic ones. There's the DNS-like ones. Even someone contributed the SVCB or HTTP ones. We even have that. There is a stop resolver, very simple one, just as like a proof of concept. Although that's, I think, the thing that most people actually use. I wrote a signer. Well, it's not complete because it doesn't do NSIC-3, but yeah. I wrote some T-SIC support that is actually complete. And that was kind of fun because that happened right at the time when there was an update to T-SIC. So I contributed back some thoughts on the quality of the RFC for T-SIC, which the original one is surprisingly weird. Someone contributed some very basic validation things. I think it's basically just validate and RR-SICs or nothing with looking at DNS keys and stuff like that. And we have a zone file part of there because everyone has one. And this is actually a second iteration. I wrote the first one, so I wrote the second one, which I probably also going to throw away and write the third one. But we'll see. This has also to do with, for NSD, a colleague recently built a zone file part that uses these, what is this, the things with the multiple SIMD, and that is ridiculously fast. So now I'm kind of embarrassed. So it actually kept growing quite nicely. But then, oh no, distractions. We did a thing for routing or RPKI, which we wrote two products for, Routinator and Krill, which are quite popular in that field. But it wasn't actually that bad because both of them are written in Rust, obviously. And these are actual products that are actually used by actual people. So we got a lot of experience with writing in Rust, deploying Rust, and also sort of got a bit more comfortable with doing things in Rust, because if you sort of listen to a lot of people and they're all, blah, blah, blah, fad, go away. But actually, nobody really cared. If you build an application that works, that is very convenient, then of course everyone will like it. And that then meant that we could take a step back. The DNS is changing quite currently, like there's all of these new things, like there is HTTPS, there's a lot of new transport protocols, there's a lot of stuff happening right now. So probably also the applications for DNS, the use cases for DNS are changing. And I think it's a good idea to sort of explore that space and see where we can get this by providing a lot of sort of building blocks that you can just put together quickly and just like play around with things. The sovereign tech fund, which is a German sort of funding organization for fundamental internet infrastructure, they agreed with us. So they funded a year of development for us, this, which then allowed us to spec what we're going to do with domain this year and allowed us to focus, to actually have like three people work on this full time more or less this year. So what's the plan? We came up with three tracks. First is the client track, which basically is all the things that need to be a DNS client. So the thing that sends requests and then receives responses, matches responses, and also sort of preprocesses responses. So the three things that we have here on our list right now is basic transports. We're currently only focusing on traditional DOS, DNS over Port 53, so UDP, TCP and DUT, the other ones are coming. But a year is actually not that long, surprisingly enough. We're doing response caching and we're doing, that's going to be fun, a DNS-seq validation. The second track is the server track. So the thing that receives requests and then figures out what to respond with. Again, basic transports. We're doing all the things that you need for zone handling. The key on plan is to have zone file parsing, which we've seen we already kind of sort of have. Just need to make it nicer, which also probably means we need to add lots more record types. Then stick all of that into a zone tree and use that zone tree to answer queries. Straightforward enough. Of course, DNS-exhining is a thing. As I said, I already have parts of that. We need to actually turn it from a proof of concept into an actual thing that you can use. We're going to do this zone transfer so that you can then basically, with this server track here, technically build your own authoritative, which then would solve the use case I heard earlier. Then we have a third track, which we just called the bonus track, which is sort of the idea is, well, we can build all of these things, but we don't know if they're any good if we don't use them. So let's build something where we actually use our stuff and put it together. One idea that we have is a DNS proxy. Sometimes you call forwarder, like all this terminology is terrible and non-standardized, but basically a thing that sits somewhere and receives requests, but doesn't do the actual recursive resolving just forwarded to someone else. What we've talked with various people about what they need is sort of a way to decide what to do with the request based on a set of rules. Look at the request, look at where it comes from, look at the R types, that sort of thing, and then say I want to process it, I want to send it forward to some other recursive, I want to respond to it here, or I just want to drop it because that's just the wrong thing. So the sort of long-term goal for this thing is to build something that can do that with maybe a scripting language and whatnot. But as an initial proof of concept, it will just basically be into some configuration. Maybe you have to write the rules in Rust by hand, but they're not yet. But no, there's even more. We also want to do a diagnostics tool. I'm not going to say dig. Mostly because we need it for testing anyway. But then we thought, well, just re-implementing dig is kind of sort of boring. Let's look into something more useful. And some of the things that we thought of was sort of like what DNSVis does. So have a thing that you can go and check if your DNSX for your zone is correct, correctly set up. Another idea was compare your resolver, your upstream recursive. What that actually has on data compared to your authoritative. Stuff like that. So if you have ideas, we're definitely open for it. First stage, we just want to sort of see how this thing is going to look. And then finally, we have a bunch of things in LDNS, which were intended as examples and not as actual applications for people to actually use, which then people actually used. And that's kind of annoying because we're actually not maintaining them. They're just there. And then people sort of, you know. So the thought here is, well, maybe we should actually make these official in some way for more shape and make them available. So things like check that your zone is correct. So was blah, blah, blah. Sign a zone and whatever else there is. And I think that all will keep us quite busy for the year. And we're also hoping that by the end of the year, we actually have something that is useful for people. So if you have ideas, if you want to, no, we don't have that yet. Sorry, what was the question? Oh, sorry. The question was support for zone MD. Zone MD, right. Yeah, no, we have none yet. Doing zone MD will expose all the mistakes you made in your design. Excellent. And or draw a path of destruction for your code base. Yeah, but that's what it did for us anyway. Yeah. So we should probably do this actually as part of designing the zone tree thing that we're doing right now. Yeah, that's a good idea. Yeah. Yeah. Can you maybe elaborate a bit, a tiny bit on the actual Rust experience for this? What did you learn, anything that surprised you in the DNS part or in the Rust part? Yeah, so the question was what was the experience with Rust and DNS with this all? Well, I started, this was basically my first real Rust project. So like you're like a kid, everything is as it is and everything is like how it should be because that's what it is. I think DNS wasn't all that surprising in and of itself. The situation with the RFCs and stuff being hidden where you don't expect it is super annoying. It absolutely helps if you have colleagues who have been doing DNS for 25 years, just like one desk over. But all in all, I didn't have any surprises. I couldn't think of anything. Which is great fun. I can recommend doing DNS. Yeah. Yeah. Any particular Rust libraries you're using for parsing, for making together the packet that goes over the line? So the question was whether we're using any other dependencies for parsing and whatnot? No, we did all of our stuff ourselves. Because ultimately it's relatively simple, right? It's just binary, sort of you just go over a sequence of bytes and you pick out the things. And I don't think it's worthwhile to just have some complicated thing for it. And what we did is what might be interesting for other people is we didn't want to stick to a specific representation for these octet sequences. So we did some abstractions over like vex or just slices or bytes or whatever. And made all of these types from the basic type scenario for that. Which makes the usage a bit iffy, because you now have these type arguments. But it's super flexible. In theory, you should be able to use this in an environment without an allocator. So if you just built this on top of arrays by the arrays, then you should be able to do probably only a UDP. But it should be possible to do a little DNS client for an embedded environment. Which was one of the use cases that I was looking into which I thought interesting, probably not very widely used but fun. Any other questions? Do you have a cell in the file, sir? Yes. Does it preserve comments? No. This keeps happening if you don't know. Is that something you would want? Yes. Okay. Any other questions for the audience? Oh, yes, sorry. The question was whether we had a zone file parser and whether it would preserve comments, which it does not. It currently only really just parses things into data structures too. So it's not like a manipulation thing. Yes. Talking about bind files, there's bind files. There's also the cop that you can have a name and you have a context to reference back to the previous record without having to fold things like people. That's awesome. Awesome. So the question was whether I guess whether it's compatible with whatever bind does. Yes. Which it is because that's the standard. Which is also super annoying because like the RFC is very. It's true. It really is. So what we're also looking because also colleagues were building a new zone for buzzer for NSD and that has to be compatible with whatever we have. Right. Or at least with whatever NSD was doing before. And I think they're mostly. Compatible. So what we're also looking into is maybe working on a sort of minimum definition of what is on. I should look like if it's if it's sort of portable between everything and have that as an actual grammar. That would be really cool. But it's also loads of work. So I'm not sure if this is going to happen. That we have a question. Yes. I had a question. So. Have I understood it correctly? You wrote. Building books for building entire server. Quite powerful. But. You never are. Going forward to it. Writing that server. That is. Yes. So the question was whether. We have plans to build an actual server and the answer is not currently. The. The. Our sort of the sort of the. What do you call it the distance that we're looking at is about five years. And we don't have a plan for that. But also we're very flexible. Stuff comes up. But currently we're very happy what we have with unbounded NSD. Yeah. And I'll be in a second. Of course. Thank you very much. We promise stickers. There's loads here. So if anyone needs more decorations.
The first 13 years of blockchain name systems
What do I call you, Naiman? I don't mind. I nickname or real name. Whatever is easier for you to pronounce, because a YAL is a bit difficult sometimes. A YAL? Yeah, I mean, a YAL, that's the name. I use Naiman because I live abroad and no one can pronounce it. Sorry? Yeah. Naiman seems a bit easier sometimes. Peter, see a new audio? Yeah. Unmuted. Unmuted? Right. Right. Your controls are here. You have 30 minutes to question a loud speaker. Just join us and welcome to the UNS developer room. This is our final speaker for the day. Naiman and he will talk to us about the history of blockchain and naming systems. Okay. Thank you. Thanks everyone, you know, who stayed till the end. I imagine it was a great, at least it was for me. I'm going to talk today about the history of, you know, blockchain name systems. I'm Naiman or a YAL. I'm from Israel, but I live in Poland. And, oh, that's fast. And I'm a mathematician, but I work on peer-to-peer websites in the last few years. If you don't know what it is, don't worry, because the main important thing regarding the talk is that those websites use blockchain name systems. So I had a chance to talk with the developers of the main ones, even being engaged a bit in some of them, and that's why I do this talk. This is some projects, but I did, which use, you know, blockchain name systems. Don't focus on that, because, you know, it talks not about me. I know that blockchain has a bad connotation, especially, I guess, in this thing. I'm not here to change your mind. I'm here to tell a story. And the story begins in 2001, where a guy called Zuko Wilcox sent a draft to his friends of some article that he wrote, and it began with the words, please do not propagate this information widely yet. I'm still working on it. Did they respect it? Absolutely not. This was propagated so hard that by now there is a Wikipedia page on it called Zuko's Triangle. Zuko's Triangle basically says that a name system, there are three properties that a name system can have or not have. One of them is secure. Secure means that two people cannot register the same name. The other one is human meaningful, which basically means you can choose which name you registers out of the ones which are available. And hopefully because you're human, the name would have some meaning. And the last one is decentralized, which means that in order to register a name or to verify a name, you can do it yourself without needing someone else like a third party. And Zuko's Triangle says that for any specific name system, you can have at most two of those properties. You cannot have the whole three. Sorry. Here are some examples. You know, a name system that I guess everyone here know. DNS. DNS has human meaningful, for sure. It's also safe. It's not decentralized in the definition of Zuko's Triangle. Public private key is safe. Yes, it's secure. Sorry. Decentralized, yes. You can generate yourself. You can verify someone else on your own. But it is human meaningless. Most public keys are a monster. And my favorite one, the state ID, which is safe. But otherwise, it is neither decentralized or human meaningful, which I think it's a shame. I would love to be able to choose my state ID, but states. Zuko's Triangle kind of was considered to be true for the first decade of this millennium. It was what I had that was not involved well known within the name systems community. But you can only have two. And you shouldn't try to build one that has the whole three. 2009, Bitcoin invented. And shortly afterwards, a year later, in some of the Bitcoin IRC chats, people started to say, hey, can we put name on a blockchain? Now, this continued in chats. There was a Bitcoin Talk forum. At some point, the legendary Aaron Schwartz heard about it. And he wrote an article squaring the triangle, which basically says, if we put names on a blockchain, we can actually go around Zuko's Triangle and can have a name system that have the whole three properties. This can be, you can argue if a blockchain is really decentralized or not in the sense that the requirement was that you can register and verify yourself, not register and verify with a blockchain. But for the sake of this talk, we think about the blockchain as a big dump object. It's a tool. It does what you want. I know it's not. I know that each blockchain has its own pros and cons. I'll be happy to argue about each event afterwards in a beer with that right now. Big dump object, by the way. I'm a youth science fiction fan. It's a term from science fiction. It's a subgenre of books that have a big dump object. It does something. That's a Ringworld by Larry Neven, classic science fiction book. If you haven't read it, I read it as a kid. I hope it's still fun now, but I really loved it. So 2011, Namecoin was launched. Namecoin basically did exactly that, putting names on a blockchain. Here is some interesting trivia details. The names that it put on a blockchain were not actually names. It was just like 250 bytes on a chain, so you can put a sequence of 01. Then if it's a name or not, or how you interpret it as an ASCII or a Unicode or whatever is up to you, no one verifies everything besides the fact that the similar bytes was not put before. No subdomains, because all you put is bytes. So there is no subdomains. It's just names that you register on it. They did have something which is called namespace. It was in the software layer, not on the blockchain. I want to put it out, because they had basically two that they were promoting, the developers. One was D, which was for domain names for websites. But the other one was ID. And that's important, because this already shows that the thinking was that those names are not necessarily for domains of computers that can be used for identify people. The cost was 0.1 Namecoin. NMC was a coin, currency of Namecoin. To adjust it was very difficult. I mean, you can raise it in a soft work, but to reduce it, you need to do a hard fork. Also to know how much it costs really in fiat money, depends on the moment that you buy it. And this didn't go to the developers or to finance anything. It was just burnt, because in blockchain economy thinking, burning money is how you make money. Lecture a few days ago. This is the last blocks of Namecoin. One transaction basically just means the miner. So which means no one did anything there. As you see at the moment, I think that it's a project which is still being maintained, but not really being used. And there's a question, why did it fail? Or at least I think it's failed. Namecoin people here, I apologize. And I think that there are two things that they did. Maybe wrong. First thing, they really copied Bitcoin's playbook one by one. But name is not money. Names are not money. It's a different animal. You can believe that 100 coins have value and it's okay, it's not contradicting. You can go to a store, one store that accepts in dollar, you will pay in dollar, another store in euros, you will pay in euros. Another one wants Bitcoin, okay, you will get some Bitcoin. It's not contradicting, but no one wants to think that the same object have two names. This is not how it goes. Like historically, if I would think that some God has one name, and you will think that the same God has another name, there's a good chance we will go to a war. We will not accept each other's belief. But the other reason, which is maybe more deep, is that namecoin developers had a huge challenge of building it. It was the second blockchain. It was the first NFT blockchain. It was the first side chain that had to invent marriage mining. And also after it was launched, it was definitely not scalable, and also don't think very scalable right now. And they spent lots of their time improving the protocol and handling all those technical details. And they didn't have time to also think, how do I make it useful? What is it good for? And you know, pushing it to users. So 2016, as I said, I entered the blockchain ecosystem. I asked people about namecoin. I even bought one, I think, just a name, but I'm not 100% sure what this intended to. The general feeling was that all the good names are squatted. There is nothing to do with it. And the names on a blockchain is nice for playing, but not really a useful use case. And in the same year, E&S was announced. E&S is a very different animal from namecoin because it is built on top of Ethereum. And if you don't know anything about blockchain, you should know that to write an application on top of Ethereum is much easier than building a blockchain. Which means E&S, which is really well written and a nice engineering feat, is still easier to write back from namecoin. So they actually had time to have long discussions how to get people to use it. And they did two things. One of them, they said, okay, names are going to have an auction. So it won't be the fastest person who takes a name, but the one who agrees to pay the most. It's not necessarily the best solution, but at least, you know, they try something. But the other thing that, again, I see is very crucial is that they had updates. They could update their system relatively easily and they were very open about it because when they launched in May the 4th, May 2017, they called it E&S Temporary Registrar. And some of the messages, they even said, you know, we are not sure how to do it right. So that's why temporary at some point, it will be changed, be prepared for it. At the time, 2017 May, it was before the DAO hack. So it was not really common in blockchain to say that you are going to change things. This was still the time of, you know, immutable programs and code is low and stuff. How did it go? Well, it went in the same way like Namecoin, quite successful commercially in the beginning. I think that someone put a bid on the name exchange.eth of $2.6 million. So that's quite well. Like Namecoin, the money did not go to the pockets of the developers, but instead it was locked. So it was a deposit and the moment that the name expired, you got it back, which if you want to fight squatters or, you know, speculators, it's not necessarily the best idea because they have nothing to lose. A year passed and another blockchain name system was announced, Handshake. And I like to say that Handshake took one step backwards, three steps forward. I think it's kind of represented. And the step backward was that ENS was built on top of a blockchain, which could be very flexible. Namecoin, sorry, Handshake said, well, we are actually going to build our own blockchain. It was a very, already in 2018, to have your own proof of work blockchain without updates was outdated. And I said, because I remember hearing about it, and I thought, OK, that's two years, at least too late. But this thing provided them the ability to do something that the other name systems didn't do. And I don't think that anyone else does, besides them at the moment, because I said that decentralized is registering a name and verifying a name by yourself. But actually, to verify something on a blockchain is very difficult. In the worst case, you need to have the whole blockchain, which is huge. In the better case, you only need to have, like, 30 gigabytes of a proof. And that's not very practical for a name system. And Handshake really made an effort. The whole white paper is to us to have short proofs. So of a few kilobytes, that this is the name owner, and that's what the data that they attached to it. The other thing that they did is gift economy. I think I know that this is from Corey Doctorow books, but at the time, this was very popular among the Bernice. Handshake actually is the first one that said, we want to replace ICANN. We want to be the new root of DNS. And then people were buying it. Namecheap bought a Handshake domain for 750K. There were people who were participating in auctions. And I checked. Now people still participate in auctions, not in these amounts, but it seems to be a thing. There were some other funny stories that SiHab joined Handshake and then left two days later, because they thought they get a domain on the blockchain. But actually, they got a subdomain on someone who has a domain on the blockchain. So there was nothing decentralized about it. It was a misunderstanding. But besides those things, I don't think that there was significant usage of Handshake. Definitely not at the time. We'll get back to it towards the end, just so you know when we speak about what happens nowadays. But at the time, it was mostly like the other blockchain systems buying and selling. So the thing was a bit grim at this point, 2020. But don't worry. New decade, things starts to be going to be more happy soon. We have to go before that one year later, where ENS permanent registrar was launched. And they took two years of studying lesson and actually modified things. And the first thing that they did is auctions out. Because for the first few weeks, people actually participate in auctions for some specific name like exchange.eth by the time that I wanted to buy ENS domain, which was Neiman. No one participate on the auction besides me. And it was just an annoying process for the user. So they said auctions are good for the beginning. Afterwards, you don't need them, which I think makes lots of sense. The other thing that they did is that by this point, they were almost broke. I mean, they started with a million grants, dollar, from a few foundation. I'm not sure if I don't remember if they got anything else on the way. But time passes. You have to pay people's salary. They were almost broke. And then they thought, I mean, their idea was to be a non-profit that gets stuff from donations and grants. But 2019, the blockchain had a winter. No one gave them any money. And then they figured out, well, there is all this locked money. And why do we actually lock it? It's not good for anything. It's not protecting against squatters because they can try to squat on them. If they don't manage, they just get the money back. And they did. The next step, money goes to ENS organization with this NGO, which means it's supposed to be fed into the development. And overnight, they became from an organization which is almost broke, the organization that has millions of dollars. This was important. I was already developing for ENS before. But it was a side project. And when this happened, you start to think, as a developer, well, maybe I should take it more seriously because now they have money that they have to give someone. They are an NGO. They are supposed by their declaration legally to give it to the ecosystem. They didn't give to anyone. But they thought, it sits here. Another thing that they did is that they kind of changed or defined what their names are for. And they said, this is a web-free identity, or more specifically, because a web-free is a marketing term very annoying. It's an identity supposed to be used in a pharium ecosystem. And I think that they managed actually to do it quite well. Verdeirector, Brantley Milligan, he did, in my opinion, magic. He has infinite amount of energy. I wrote some message in ENS forum. And immediately he said, hey, let's set a talk and meet. And he asked, do I want to build for them more? He had ideas. He started to do all those things where he asked people in Twitter to change their name to their ENS names to show that people actually use it as identity. In conferences, people start to use it. In the firm conferences is their identity, their name, Naiman.eth. He was really pushing it well. And I got to see it all from front seat, because at the time I was working on this project. It was a search engine for the Centra's Web for ENS plus IPFS websites. So I got to see how every month more and more people got ENS name. There was more buzz. And people actually use it as an ID. I'm not saying it was a huge thing, but it was a thing. There was a use case for this thing. And before, there was none. But still, when people ask me, hey, are you going to do something professional with it? Are you going to build a serious big project or business on top of it, I was saying that I'm not sure, because the root of ENS at the time was held by a multi-seq of, I think, seven people, which is quite risky. Forget the centralization. Not the centralization. It's just quite risky. If I'm doing a project which I put a lot of effort and investment on top of ENS and then something is hacked with a multi-seq of seven people, which is very easy to imagine, then what do I do next? So I was telling everyone. I also told it to the ENS people. I'm not sure it's so directly or implied it. I'm pretty sure I'm not the only one who mentioned that. And then we reached November 2021, when a very significant thing happened. ENS DAO was announced. DAO is a decentralized organization. If you're not from the blockchain ecosystem, it's OK if you don't know it. The idea of a DAO is that instead of out of the crypto Twitter, I mean, my mom's neighbor who has nothing to do with blockchain told me that he bought an ENS thing. And I was like, oh, I'm working with it. That's nice. I think that lots of people who are now active in ENS joined at this stage, not because there is money, just because they heard about it. It's made an impact. It's a big project that gave control to the community. It's also a bit if you want to work on blockchain, but you don't want to get into all the protocols. And you're not interested in money. Name system is something which is a bit easier to understand and clearer. ENS DAO is very active nowadays. So I was a member of ENS DAO for the first year. I was managing a subgroup of decentralized peer-to-peer websites, which is what I did at the time. I don't do it anymore. But I still follow it a bit. I know lots of people there. It's super active. The forum is active. There are calls every day. There are votings. I mean, for the good or for the bad, very really an active community. And at some point, I don't remember right now exactly when they actually transferred the root key ownership to the ENS DAO, which means now it's owned by ENS. There is one problem, or maybe two. And the first one is that ENS voting goes with ENS token. You can buy the token, which basically means that someone who is rich enough or motivated enough can kind of take over the organization. And if you want it to be critical infrastructure of the internet, it's very risky. If at some point it will be, then someone will take over. I mean, if someone can, then they will. I mean, the DAO can at some point decide that you get voting by reputation. But at the moment, this is the situation. And the other thing, while handshake have short proofs, ENS does not have such thing to verify anything on the Ethereum blockchain. You need quite a long proof. It's not very practical for anyone to do unless it's really your passion, like me. But even then, it's super difficult. I don't know what's the technical way to solve it. If any, right now, everyone compromise on that, and they actually verify things with other services. 2023, which for me is today, because we are beginning of 2024. So the state is at the moment that once ENS DAO went on, they have a huge market cap. They were very, even during the crypto winter of blockchain, they had quite a buzz. People started to make clubs like the 10K Club of People owned the name 1.ETH till 9,999.ETH. There was a website for clubs and stuff like that. It made an impact, and as a result, any blockchain now has their own name system, because it's just easy to make. And they see that there is people who will pay for it. I know people that in each of those block systems buy a few names, because normally they are quite cheap, and they are like, well, we don't know which is a good investment. But articles like the top 10, blockchain domain name systems, I admit that for a while I was trying to follow that, but I didn't find any that has technical innovation, which is what I care about. And I reached the point of saying, well, if something will happen, which is technically innovative, someone will tell me. ENS itself at the moment is focusing on a few things. One of them is subdomains. They want the subdomains to be kind of like domain, so you make someone a subdomain, and then they own it. Like it's not dependent on who owns the name. It's completely independent. If they had something which is called name wrapper, it was developed for many years and launched last year. CCIP is basically cross-chain interoperability, which means how to communicate with, from ENS can communicate with other blockchain. I am not a huge fan of that. I think it's centralized, a decentralized technology, but they seem to, lots of people, they like it. And they really want to join ICANN. Like they really want to get control of the .eth subdomain, a TLD, sorry, but the problem is the .eth is there for Ethiopia. Nick Johnson, the owner of eth, had a long thread about it recently, like a month or two ago. So if you want to read the details and where they held up to this discussion with ICANN of getting it or not, you can see it there. For the other projects, Handshake, I went and checked just before the lecture what's going on there, and I got the feeling that's not much change from the lunch, only that it's less enthusiasm now. Like people still participate in auctions with less money. I didn't find any real use case besides that. If anyone knows and I missed it, let me know. Another story that happened, and I'm going to wrap it up, is a stop-all domain. It's another ENS, another a few blocks name system. They try to patent some names, and now they're a legal battle with ENS. And I thought of maybe speaking of what I think happens in the future, but time is up, so I will not. Thank you, everyone. Thank you. Thank you.
Screen Sharing on Raspberry Pi 5 Using VNC in Weston and Wayland with the Yocto Project and OpenEmbedded
Hello everybody, can you hear me? Kind of okay? My name is Nelonavi, I'm an open source enthusiast and senior software engineer at the Succo Group. I'm here to talk about screen sharing on Raspberry Pi with VNC. And I don't know how many of you were at Fosnum 2023? Alright, quite a lot of people. I have no idea if you've visited my talk, but I had a lightning talk and it was about screen sharing and remote access with LDP. And of course the most frequently asked question was, why don't you use VNC? How can we do it with VNC? So here is another talk, next episode. And I'm going to share my experience in configuring Wayland and Western with VNC screen sharing. The agenda focuses on Raspberry Pi 5, just because Raspberry Pi 5 is something new and we are all excited about it. But actually at the end of the presentation you see that I tested this on several devices. It's not hardware specific as long as you have Wayland and Western running on this particular hardware you are targeting. The agenda includes obviously brief introduction to Wayland and Western, brief introduction to VNC, the integration in the Yocto project and open embedded, and core image Western with VNC demonstration on Raspberry Pi 5. And if you have no idea what core image Western is, you're going to learn in the slides that are coming. The slides are already available on the website, fosn.org. After the presentation I'm going to upload them to SlideShare as well. And the slides are kind of exact steps or at least this was my goal. So if you want to repeat this demo, following the information on the slides you should be able to do it. So let's start with Wayland and Western. How many of you are using Wayland? Oh, that's fantastic. This makes my work really easy because this talk obviously is about Wayland. Wayland is a display protocol that specifies the communication between a display server and the clients. It's supposed to work with different windowing systems. The main project was to replace the X windowing system. The project is starting in 2008, so it's more than 15 years old now. There's a security by design in terms of isolation, the input and the output of every window. As you probably have noticed for many years this was kind of a problem when we want to take a screenshot. But, you know, there's always advantages and disadvantages. There are many different compositors for Wayland. Western is the reference compositor and it's widely used in the embedded industry because it's small and simple. There are other compositors that are much more convenient for desktop usage, of course, but Western as a reference compositor for Wayland is very convenient for embedded devices. What is important actually for this talk is that Western 13 was released in November last year. It's important because the VLC backend was officially added in recent versions in Western. We're going to talk about this in the next slides. What are the remote desktop options that you have in Wayland and Western as of the moment? One of the options is remote desktop protocol. If you're interested in this, have a look at the recording of my lightning talk from last year. How many of you are actually using CARE-DP? And how many of you are using VLC in different setups? All right, more people. But it looks like the majority of the audience is not using any of them. You should start using them. Let's speak. So what is VLC? It stands for Virtual Network Computing, and it's a graphical desktop sharing system based on the remote framework before protocol. It was initially developed by Oliveri. This was 25 years ago. VLC has been around for a quarter of a century. It's a pixel-based screen sharing protocol. That's the big difference compared to LDP. And it works with all windowing systems, including Microsoft Windows, macOS, GNOME and other distributions, no matter if they're using X11 or Wayland. Here's a matrix comparing LDP with VLC so you can understand the major differences in relation to this talk. So obviously there are different types of protocols. There are different types. They work in a completely different way because LDP is a semantic protocol. It's a way of control, forms and other graphical primitives, while VLC, as I mentioned, is pixel-based. LDP has been in Western since 2016 and version 2. VLC was relatively recently added. It was added in Western 12. This happened in 2022. So the story of adding VLC to Western is a long story. Keep in mind that I'm a later Wayland more Western developer. I don't have contributions to Wayland and Western. I'm an engineer working with the Yachter Project and Open Embedded. So I'm kind of a user of Wayland and Western, and I'm doing integrations with Wayland and Western on various embedded devices. But we need to give the credit to all these people who put a lot of efforts providing VLC back-end in Western. The initiative, As Fires, I know, started by Stephen, and you can see that this happened almost five years ago. It took actually three years ago, three or four years ago, but it took a while to be merged. And finally, with some changes made by Philip, we're here. This was merged and landed in Western 12. The current version is Western 13, Western 14 is in development. What I've seen in mailing lists and some emails that have changed are some new exciting things that are supposed to happen in new versions of Western. But I'm talking about the things that are released and stable and you can use right now. So just a few sentences to give you an understanding how VLC works with this back-end for Western. It's available in Western 12 and newer versions, which means 12 and 13 as of the moment. It's going to be available, of course, for the new versions as well. It depends on a couple of open source projects. Meet VLC is one of them. This is an open source VLC server library with a clean interface started by Andriy. The source code is available in GitHub in the ISC license. So Meet VLC is built on top of other dependencies such as libdlm, mason, and package config. And in the runtime, it depends on Andriy's main loop. This is another open source project. And as you can see from here, it's actually by the author of Meet VLC. So they're tightly related. And now we're moving on to what I do on a daily basis. And this is the Yachter project and Open Embedded. How many of you have experienced with the Yachter project and Open Embedded? Oh, that's fantastic. I have been telling everyone that the Yachter project and Open Embedded are becoming a de facto industry standard. A lot of people were laughing, but now I see that there's so many hands, so it's good to see it. So if you're new to the Yachter and Open Embedded in a nutshell, the Yachter project is a collaborative project of the Linux Foundation. It helps you create a custom embedded Linux distribution for your particular needs for the hardware that you are targeting. It uses the Open Embedded build system, which includes BitBake and Open Embedded Core. The most interesting part of the Yachter project, in my opinion, is actually Pocky. This is the reference distribution of the Yachter project. It's provided as a metadata. There are no binaries. Pocky gives you the opportunity to quickly bootstrap your own distribution, so you're not starting from scratch. If you want, you can start from scratch, of course, but you can also take Pocky and build on top of Pocky, modify it, configure it, and so on. So it really speeds up the process and simplifies the whole process. The Yachter project has a steep learning curve, but once you know how it works, it simplifies the process for building your products and cuts the time to go to the market. The Yachter project has twice per year a release, and there are long-term support releases, which are covering at least two years. Here is a short list of some of the recent Yachter project releases. This one here is supposed to be released in April this year, so in a couple of months, and this is going to be the next long-term release. It's going to be maintained for actually four years now, and the current LTS release is Kirkstone. It was released in May 2022. It will be maintained until April 2026, or at least until then. The previous, actually the first long-term release was DINFO. It was released four years ago, and it's still maintained. This is quite convenient for the embedded industry, because this way you can take a stable release with long-term support and spend some time developing your project without having an outdated source by the time when your product is ready to be shipped on the market. West End and West End are provided as recipes by an open-embedded core. Here you can see a screenshot of the Git repository of open-embedded core containing Wayland and West End. There are different versions of Wayland and West End recipes, depending on the release of the Yachter project. Here comes the tricky part when we're speaking about VLC, because as I mentioned, VLC was relatively recently added to Wayland and West End, so you have to pick a version of the Yachter project which provides the required versions of Wayland and West End, or otherwise you have to backport them to an old version, which is... well, it opens the can of worms, so I don't advise you to go this way. You better migrate your product to a version of the Yachter project that maintains it. So, version 5 of the Yachter project is not released yet. As you've seen on the previous slide, it's going to be released in April, but as of the moment, it's targeting West End 13, so maybe it will end up with slightly newer version of West End. I don't know, but basically you have VLC there, and you have it in the previous release with West End 12, however, this is a short-term support release, so if you are starting working to work on a new product, I highly recommend you to go to version 5. Other long-term releases such as Kirk's Room and DILFAL are with older Western versions, which does not support VLC, so if you want to take advantage of VLC in West End and you are based on the Yachter project, make sure you are using a release that's compatible. How to bid-bake West End with VLC? First, the VLC backend is not enabled in West End, so those of you working with the Yachter project has to create a bid-bake plan file for West End to extend the package config variable and to add VLC to it. This will trigger enabling the backend VLC argument to true when you are building West End, and also add need VLC as a runtime dependency. I've explained that actually as a bid-time dependency, and also you need to place a plan file. However, I have created an example reference layer which provides a demo integration of West End with enabled VLC. This is the same open-source project that I started a year ago for my previous talk, which was about RDP. This year, actually the end of last year, I just extended it by providing splitting it into two layers, one layer for RDP, another layer for VLC. I hope this demo layer will help other people quickly get to speed with VLC or RDP, whatever they want. Here, of course, this talks about VLC, so the examples that we're talking about are VLC. Behind the core test, what happens is that when you do these configurations, when you add VLC package configuration to West End, you need a bunch of dependencies. These dependencies are AML and need VLC. As part of my work at Consulka Group, I took the effort to upstream these, and they are part of meta-open embedded. The West End environment recipes are actually in open embedded core that in order to build them with VLC, you have to make sure that you are adding meta-open embedded meta-oE to get all these two recipes and their dependencies. The good thing is that actually the community liked it, and I see that our people started contributing to these recipes here. There have been several updates. So these projects are obviously needed by a lot of people in the community and in the industry, of course. A little bit of information on how the baking of need VLC for West End and VLC works. So for need VLC, in order to work with the whole setup for West End and VLC, we have to enable encryption with TRS. This is how it works. Keep in mind that different versions of West End depend on different versions of need VLC. Here is a particular example for the things that are right now, but if you are watching these slides or this recording in two years, probably it's going to be different versions. So keep in mind how you can adjust these things. That's the point of this slide. And this is how we do BB Append for West End Ine. Basically, we are creating user West End and password West End. This is the same user and password that we are going to use to log in remotely to our targeted device. In this case, it's a Raspberry Pi file, but as I mentioned, you can do it with any other better device capable of running West End. So this is standard VLC to staff. You just deploy a directory that you need and so on. So here is how you generate the certificate. So a bunch of commands. So obviously the important thing is here that you have to set up the name of the host to make sure that all or most of the VLC clients will work. Later on, I'll show you an example of what kind of a problem I had because initially I didn't set it up properly. Here I'm targeting Raspberry Pi 5. This is the host of the device that I'm going to target. So I'm setting it here. Finally, of course, I have to copy the certificates to the target device. In my case, to the Raspberry Pi 5. And West End Ine is the main configuration file for West End. We have to configure it there to make sure that the screen sharing is enabled. So after reloading West End with this configuration, in my demo setup, West End is started by a system-based service. So I have to reload the system-based service to start West End. And once it's started, I have to press Ctrl-Alt plus S in order to enable the screen sharing. Of course, you know that many embedded devices don't have a keyboard. So pressing Ctrl-Alt plus S is not an easy task sometimes. It sounds easy if you're working on a desktop computer, but sometimes you're not on a desktop computer. And here comes a really convenient feature. You can use the automatic startup of VNC just to say that the same is available for RDP. So no matter if you're using VNC or RDP, you can do the same. Startup. Here is the configuration. Startup on startup. Start on startup. True. And finally, when we have the Raspberry Pi working with Wayland, West End enabled VNC from another computer in the same network using a VNC client, which we can connect remotely. In my case, I'm an open source enthusiast. I'm trying to use pretty much open source everywhere I can. So here I'm using a VNC client. And here's how I started it. Here's how I blocked in with user West End and password West End. You saw how I configured that. Here is the VNC demo. It's not super exciting. Basically what you see here is Ubuntu 20.2.4. This is the long-term release of Ubuntu that I'm running on my computer in the same network. You see the VNAGR window. And inside the VNAGR window, the VNC client, you see what is my Raspberry Pi 5 showing on the screen. And a frequently asked question is what is the performance? What is the frame rate? I haven't made a lot of tests, but by running some simple demonstrations like on the screenshot here, it runs with up to 20 frames per second, something like 18, 19, up to 20 frames per second. I haven't performed any optimizations. Basically you know that when you have screen sharing enabled, the performance in terms of frames per second is dropping out. A little bit of testing. I have mentioned that I've tested it on several devices. I made this talk about Raspberry Pi 5 because I was involved in the process of adding Raspberry Pi support in Meta Raspberry Pi 5 BSP layer. Now Raspberry Pi is there. It's in the master branch. It was recently merged by the maintainers. And now I've opened the GitHub request to Backport, actually the support for Raspberry Pi to Kirkstone in Meta Raspberry Pi. However, in terms of VNC, I have tested it with appropriate Western versions on other devices, Raspberry Pi 4, RockPi 4, this is with a RockChip system on a chip, Toradex version, IMX 8M plus system on a module. So the tricky part about NXP, no matter if you're using IMX 8, 7 or 6, is that you have to switch to the EtnaViv open source drivers because otherwise with the Vivante proprietary drivers, you have to use a fork of Western called Western IMX. And Western IMX, it's always a version behind. The last time I checked, which was a couple of days ago, the version that was available was 11. So we don't have in Western IMX the VNC back end. So if you are running into this situation, consider switching to EtnaViv. So we're wrapping up the talk with some conclusions. VNC is a pixel-based graphical protocol. It was added in Western 12. It works on any device that supports the appropriate Western version. It supports a TOS encryption. And I have created this layer, Meta Western Remote Desktop, which is available in GitHub, which allows you to build quickly a demo with RDP or VNC. The slides are available on the Fosn.org website. Thank you very much for your attention. I think we have a couple of minutes for some questions. Hi, thanks for the talk. I had some issues with RDP as well. I think it's when you mention performance, I think it's to do with the screen share plugin and the full screen plugin. It pins the single rendering thread of Western. So when you have RDP and stuff enabled, even on a more powerful computer, it does affect the presentation locally of the graphics. Do you know if there's any work being done to improve that architecture, maybe using shared buffers? That's a very good question. Unfortunately, I cannot provide a good answer to this because as I mentioned, I'm not a Wayland or Western engineer. I don't have upstream contributions there. In terms of Wayland and Western, I'm integrating it in the Yocto Project and Open Embedded. So you probably have to ask in the mailing list to compare the particular backhands available for Western. Yeah, thank you very much for the talk. This is a bit off topic and I'll keep it quick, but on the screen sharing idea, my boy keeps getting me to use this Parsec stuff, which I think is encoding screens as 8.264 movies, basically. I like the idea of leveraging GPU to the screen sharing. Have you looked at that at all? Is there anything you can comment on about that? No, sorry. I'm not familiar with it, but I can provide a comment. I'll have a look at it. Okay, thank you very much. That was great. All right. Thank you very much. If you have questions, I'll be around.
"Vanilla" Debian On An Industrial Embedded Device
Hello, everyone. Can you hear me? Hello, hello. So, I guess we can start. Is it working? Hello, hello. Quiet please. We're about to start the next presentation, guys. Hello, again. So, welcome everybody. Can you hear me? Can you hear me? Okay, great. Hello. So, I am Francesco. I'm here to talk about installing the IBM on an industrial embedded device. What I mean with an industrial embedded device, let's say, not a consumer device. So, it's a device that you might find in industrial automation, building automation, or an agricultural machine, not a baby PI, okay? And I am an embedded Linux engineer. I'm working with Uboot Linux and Open Embedded and using Debian since a very long time. And this is my distribution of choice, but I'm not working with Debian lately. So, what we will cover today. Why this started? I had some hardware available. I mean, that's because it's my job. Something like that, okay? That I normally work with this with Open Embedded. And I was wondering, why can't I just install Debian on it, okay? If we're supported by the upstream Linux kernel, everything is in place. What is preventing me to just install it, okay? And this is where we all started. And this is a little bit of talk, which challenges I had and I was able to get there. We will mainly talk about ARM and ARM64 devices. We will focus on the Uboot-Booter order. And the focus really here is about installing vanilla Debian. So, there are a lot of ways you can install Debian. Debian is running everywhere, not invented by myself. The focus here is really about doing just following the instruction and getting it done. So, just to set a little bit of the stage, a little bit of overview of an embedded system, but also any system boot today. After the system on chip, the CPU gets out of the system, they are going to get configured in some way. This part can be really, really complicated. And at some point, the firmware needs to load the operating system. What it does, it needs to figure out where is the kernel. It needs to figure out, in our case, where is the device tree. And put everything in memory prepared and then jump into the kernel entry point. We will really focus on this step here about preparing the binaries and jumping into the kernel. This is our focus for this talk. Something that is important to mention and will make our life easier is where is the firmware stored in Flash. Traditionally, on your PC, you have an SPI in AirFlash. That is where your UFI is stored and this is completely out of band. This is not where you are going to install your operating system. On embedded device, it depends. Sometimes for curse saving, you have only one device in which you store everything, the firmware and the operating system. Sometimes they are separated. The order that I am going to consider today is using EMMC. EMMC have very nice features that allows to do hardware partitioning. Normally, they have a dedicated partition for the boot firmware. This enables to not have to warn them about overwriting my firmware while doing the installation and operating system. This is not possible, for example, using an SD card like you would do with a Raspberry Pi. Raspberry Pi is booting from an SD card and then you cannot really do that. This makes the stuff more complicated. Good. In our case, our firmware is, as I said, U-boot. U-boot is a platform firmware. It supports a lot of architecture. In the end, it configures the hardware, as I said before, and then it is able to load the Linux kernel. Traditionally, let's say the past millennium, this was very coupled with the operating system that was loaded. Some time, a few years ago, probably 10, 8, I don't know exactly, there was introduced a new feature called DistraBoot that is trying to solve this task of loading the operating system generic. How does it work? U-boot is scriptable with shell script sort of and as environment variable. DistraBoot is implementing with script a generic way to search for a bootable partition and then to search for a way to boot the operating system that it found. In short, you tell the board which are your boot devices and then you just include this header that I mentioned earlier and that's it. It's very easy to integrate. What it searches for normally is a boot script with a fixed file name, searching either the first partition or the first active partition and executive content. It can also parse an x-tlinux.com file that describes how to properly load the OS. We will focus on the boot script that is more flexible. It allows really to do everything because it's really code. The reason we focus on that is that this is what is supported by default on Libyan. X-tlinux is working in Libyan but if not out of the box experience you get from the Libyan installer. The boot script normally what it will do in the end will be loading the kernel from some storage device, loading the device tree, loading your init addy and then we'll just jump into your distribution. Cool. Let's move on the operating system side. The Libyan has a package that is called the FlashKernel and it's really a glue package between the operating system and Uboot. I mean, FlashKernel is a little bit more generic but it's really able to integrate directly with Uboot, generating the boot script we just talked about. It's integrated into the Libyan installer, into the ARM one, and integrates with the kernel packages out of the box using hook. So it's supposedly to just put everything together. Given that in theory it should be good. We have Uboot, we have this package, so what I did was I tried to go through the installation. I took one model, for instance exactly the one that I just showed you before, probably the Libyan is still installed there, and I just followed the instruction, okay, as easy as that. I decided to do a net installation just because it was at the moment the most convenient for me, but again whatever is copied pasted here is just what you can find on the manual. And the result, it was not working. Why? What I figured out is that FlashKernel needs to know about the actual hardware to be able to properly generate a boot script. This is why this is required. This is required because it generates a boot script that are really matching the exact hardware that we are running on. So you really need to tell the exact device tree file that will be loaded in the boot script, and this is going to be part of the boot script. It's also possible to have custom boot scripts and have a sort of additional customization, but if your board is properly supported in Uboot, there is really not much to do apart from telling you use this flavor over the kernel and use this device tree. I mean, for instance, in Debian ARM port, it's supporting two flavors for the kernel, ARM-HF and ARM-HF-LPAE, and you need to tell which kernel flower you want to use. What I did was just opening a merge request on the Debian GitLab instance, SalsaDebian.org, and I mean, I am not a Debian developer, nor a maintainer, I'm just a user, but my merge request was just a reviewer and a sceptre. Now, for instance, it's part of Debian Bookwork. So it just opens for everybody to contact boot. Cool. Then I wanted to try something else. Distribute, at the moment, is considered a deprecated for a couple of reasons. One of the reasons is that being implemented with shared script is a little bit cumbersome. It's very difficult to understand how it works. There are scripts that are set in global variables, and then other scripts will rely on the previous global variables that are set. I mean, if you look at this for the first time, we will just get lost. There was also a need, a long-term plan for Uboot to completely move to a Kconfig based system, configuration system, and for that, there was a need to remove some include file, as we saw before, Distribute is configuring through a configuration include header. That's the main reason standard boot was done. I didn't have any board with this enabled myself, so what I did was just enable this on the board. I mean, it's trivial. This is just defining the boot targets that are just equivalents to what we discussed a few minutes ago regarding Distribute, and then enabling a few configuration options. From the integration to the distribution is pretty much the same. Standard boot is more generic. It supports also UFI. I mean, it's more, but in the end, it integrates the same way with the Flash kernel package, and it's able to execute the boot script. The documentation is linked here, and I mean, it's way more than what I showed here in this. It was a Flash USB Flash drive. And with that, I was just running that bootstrap from my PC directly on the target using USB. And there are a few queers to take into account because the architecture that my laptop is running is x86, while the target was an ARM. And during the second step of the bootstrap, you are going to execute target binaries. And this was really easy to be done also because as of today, using QAMU user static package and using the bin format MISC support in the kernel, it's really possible to run cross-platform-byte-arm binaries on an x86 in a transparent way. And all of that is just integrated into that bootstrap, and it's just a matter of installing this package. So I went through my depth bootstrap installation completed. I did all the required steps, and then it was not working. What was missing this time apart, of course, this board also was not supported. Today, this specific word is. What was missing this time was that the Debian kernel was not supporting the specific architecture. Again, the Debian kernel development is done on GitLab, on this GitLab project. And they are just open to taking merge requests, and it was really probably a free-line change to enable this architecture in a decay config. This is merged at the moment, and with that, finally, it was all working. Good. Last. Till now, I was playing around with old ARM 32-bit architecture, and I wanted to try something more modern and faster. I was getting older and older with the old ARM processor, so I wanted to try with an ARM 64 system. What I realized is that the Debian-y standard for ARM 64 just expected to use UFI. I believe that it would be possible to avoid it. It's not like it's a master, but if you just take the standard path, this is what is expected. And they said, okay, you would support a subset of UFI. This is targeting this specification here that I think of here. It has a few limitations. It's not a full-blown UFI implementation. One of the main limitations at the moment is that it does not support this set of variable up runtime. We will see in a minute why this is relevant. In a very simplified way, I hope that there is no expert on UFI here, what UBoot is doing is searching on a GPT partitioned device for a specific partition called ESP. It has a specific UID. And into that, it loads a binary in a UFI format that is more or less a variation of the Windows portable executable. And it just executed. Anyway, moving on, UFI was not enabled in the board that I have available, but there are really no hardware dependency on this functionality. It's pure software and it was just a matter of enabling it. You can see it's a lot of configuration options, but there is really just a generic code, nothing hardware specific. Good. So I tried again to install Debian and this time, let's say it was working more or less smoothly. It was possible to get to the end without having to do any kind of customization. What was a little bit scary was the fact that the installer was erroring out during the while making the system bootable. What it complained about was exactly about this set runtime variable that I just mentioned before. That was not working and the message is somehow scary. How this is working? UFI is configurable. There are variables and the operating system is able with setting the boot order and boot 00, boot 001 variable. To specify which device and really impractice the file name into the ESP partition. This is really what this is about. Normally, any modern operating system will install with a specific file. I believe that for Debian might be something like Debian.EFI, I'm not sure. However, this is not possible at the moment because of this limitation in Uboot. Debian is able nevertheless to install with a fallback location. That is what is used for removable device. If you think about USB flash device or your CD-ROM, there is a fixed name that is sort of fallback. Debian is able to install there and with that, I got a bootable system. I had of course a few small issues about the fact that some functionality were not enabled in the kernel, but that was straightforward to solve. What I was wondering here at this moment is which device tree was being used. Because device tree is really the hardware description of the board. This is critical for the US to properly use the hardware. While using this standard boot, this is really well defined and the flash kernel package is really telling you which device tree is using. With UFI, this is really not visible. So I started digging a little bit on that. What is used by default is the internal device tree from Uboot. Uboot is also using the device tree for its configuration and it is just able to pass it down to the operating system. Here we are. So what's next? As you saw, as of today, with a very small effort, it is possible to run a pure Debian experience on an embedded target. Something that I assumed here is that you have upstream support for your device. This was the case for the hardware that I have available, but if you take a random board, this might not be the case. Without that, Debian is not going to take your 1000 patches just to enable one board. Okay, this doesn't exist. Second, the integration. What is envisioned is that the device tree is really coming from the firmware in the hardware description. In practice, if you are familiar on how the development of the device tree works, as of now, are stored into the Git repository of the kernel, it's really an incremental approach. You would really like to have to update the firmware when you update the device tree. This is just the reality that we have today. So it would be very nice to be able to take the device tree from the root file system. However, it's complicated. When you boot with UFI, normally you have an intermediate boot loader. You don't jump directly in the kernel. So if you boot through grab, grab is able to load the device tree, but the loading on device tree is disabled once the boot is enabled, for example. And the other issue is that normally the device tree is going through some fix up on the... Why the boot firmware is executing. And if we... The grab is loading, if this fix up are not obvious to do, there is an implementation that is able to ask Uboot to do this fix up afterwards, but all of that is available as a patch, is integrated in Ubuntu and not in Debian and not in Mainline. And then, I mean, we can make this even more complicated if we think about device tree overlays that are binary patches, and they are normally used for non-discoveryable buses that are common. Think about an embedded device, I2C, SPI, or an LBDS display. And that's it. Opening for Q&A. So, any questions? Thanks for the nice talk. Just want to ask you, are you aware of the ISI integration tool for Debian? ISAR. So we are doing heavily Debian on embedded devices, on industrial devices with that. And it also addresses the topic of not upstream firmware. I mean, I'm not aware of this. I mean, I will have a look. I was really focusing here on a pristine and pure Debian experience. There are tons of way in which you can install Debian, tons of way you can have a custom kernel. There are really a lot of possibilities to use. I mean, I could have built probably three slides only on the options that are available. I was really willing to focus here. I want to go Debian or follow the instruction and get it done. This was really my goal here. Thank you for this presentation. I just didn't understand when it is booting from UEFI, it's boot from UEFI, where U-boot is placed, where it is in which partition or in MMC, or where it is, you said you're... U-boot. U-boot, boots from USB, if I understand. No, I was using USB just for the installation purpose. So I was just having the board available as it was a USB flash drive for the installation. After that, U-boot was installed nevertheless together with the UFI implementation on the MMC boot partition. Okay. This side. Thank you. Okay. We don't have any time for any more questions, but thank you very much, Francisco. Brilliant. Thank you.
Using linux-yocto as a Yocto BSP kernel
So, my name is Mitri, I'm working for Linaro. Today I'm going to talk a little bit about the Linux Yopto, a fairly unused Linux kernel or Linux BSP for Yopto. About me, I've been working on both Open Embedded and Linux Kernel and contributing them since 2007. Maybe some of you guys remember Open Zaurus, I've been using Open Zaurus but not contributing to it and started when it became on-stream. So, Linux came about 2000 commits and in our part of the Qualcomm ecosystem, we are working on the Qualcomm devices and I'm maintaining the Metaclick-com, the upstream and the open source BSP for the Qualcomm devices. And this talk is based on our experience with supporting or providing the canals. Should I move somewhere? With providing the canals for the Metaclick-com. So typical Open Embedded board support package, of course it contains Linux kernel. First come a recipe, custom. Initially the BSP will find their own recipe, their own way to do things. They have the source you write points into the Git tree. Yeah, sure. Sometimes this Git tree tracks the whole development history with all the tries, with all the attempts. Or sometimes it is just written for each major release or sometimes it's a mixture of both. So yeah, Roberto, the fix of the fix. It is not an imaginary thing, it is what I saw in one of the BSPs. Do you know how to track if the patch has been ever sent to upstream? What's the status of that patch? No. Which version is it? Well, if you're lucky, it is a long term support kernel, which you can up to the latest LTS release. I tell you, if you're lucky. And security updates, if you're extremely lucky. How to configure? Yeah, there will be usually a different config file either in the layer itself or in the same Git tree. So any idea how to upgrade it? Yeah, or how to enable the net filter or an other obscure option that the BSP vendor did not enable for you. Trouble some. And yeah, everybody does it this way. I think so. Well, I thought so. We tried to change this way for us. Linux Yachto. I knew about it for ages because it was the kernel that was used by OECore, by the core layer for the OpenEmbedded for Yachto. It was used for the QEMO machines. It has been used for some of the default BSPs. But why should I pay attention to it? I have my own kernel. Well, not quite. We found that it follows stable releases. It also follows the latest Linux release. It tracks the release canals. It has a very powerful kernel configuration tool based on the fragments, based on the internal scripting language. And it is endorsed by the Yachto project and the OpenEmbedded layer program. And that's actually what made us to look into it. We thought about making Metacore actually certified for Yachto project compatible layer program. And it is one of the recommendations. Sounds perfect. Well, all the stuff, all the DevConfig, all the points from the previous slides have been pointed. Yeah. The problem was that nobody uses it. We are trying to do it. So some literally small how-to. Yeah. This reminds me of one of the talks of reading the Emacs config. But I will be reading the Metacore configuration files. First of all, the recipe itself is in OECore. We do not have to provide any additional details. We do not provide the Git repo. We do not provide anything. The default is same. We just say, yeah, let's use the defaults. Let's enable it for our machine. Q-com. R&D is our OpenEmbedded machine. We will be using our paths and the bonus stuff. There will be a lot of the files described in the configuration. There will be a lot of the files in the SC format and the CFG format that should describe different options for different machines. We just need to enable a single root file. The kernel.yokto class will get all the files that are beneath that. So we do not have to list anything else in the source URI. If you need, you can add more of the configuration options. You can enable other features just by adding another append in your distro or in your enablement layer. That is it. You do not have to patch anything. You do not have to patch the DevConfig in either of the layers. You do not have to create something that tracks, oh, that was the DevConfig from that layer. The options changed. It also tracks stable, as I said. So we do not have to upgrade the versions. We do not have to upgrade anything when there is a new release from this stable team. Oh, sorry, one button. Patches. Of course, the BSP layer has tons of patches, hundreds of patches. We have to list all of them. They come in a series. This is just a few lines. In our layer, there are currently 78. We are trying to limit them to some same amount. And a bonus feature, bonus point. Because this is a recite from OECore, we have to track upstream status. So for each of the patches, there will be upstream status trailer that talks, yes, this patch has been submitted. Oh, sorry. We did not submit this patch yet. History is no longer written in some obscure git tree. There is all the history of the patches comes from the layer BSP. As the user, you can take a look at any pointers and find, okay, yeah, these patches have been enabled for this and that platform. And they have been changed this and that way. Oh, and when we're basing from 6.5 to 6.6, oh, they did this and that mistake, and I know how to fix it. So the whole history, the whole development is visible to the, well, is visible to the developers, is visible to the users of our layer. Config fragments. As I said, there is a powerful system of the config fragments. They have the SSE files that describe, okay, how it all beams together. And of course, the CFG files, the parts of the actual configuration. And SSE, they provide a street-like structure. They can include other SSE files or they can include config files. So there is a huge set of default features that you can enable. You just pull files from the default set that has been written for you by Richard, by Bruce Asheville. Several downsides. There is no control over the exact kernel version. This is all done by Bruce Asheville, by the maintainer of Linux Yachto. And when he upgrades to the next version, yeah, when is his decision. Before this new year, he decided to let everybody stay calm. And so he delays upgrading to 6.6 LTS for several weeks. And unless you know what's happening, this can cause some confusion. So we had to ask Bruce what's happening. Sometimes Linux Yachto gets delayed. Sometimes there are additional patches. Well, in fact, as it is a BSP for several devices, it has additional patches. And you have to understand how that corresponds to your device, how this conflicts with your patches. And the most important thing. So previously, you can easily have a set of developers working on just Git tree for your kernel. They do their job. They do it all right. They just tell you, OK, this is the Git commit that you should be pulling into your BSP layer. Yeah, OK. Now it is also responsibility of the maintainer of the corresponding layer to actually see what's going on, to work in close collaboration with the kernel developers. In our case, I'm working also as a reviewer for the patches being submitted by our developers and by Qualcomm developers enabling the particular features because sometimes they are not. So you can no longer just depend on being, oh, I'm Yachto developer. I'm open and better developer. You have to be kernel developer too. And the last but not least, so what if we have several hundred of BSP patches? How do you track them? How do you actually manage them so that it does not become the mess? So we solved that by splitting them into the series. And so we actually are working with, as I said, with Qualcomm people just, OK, you cannot say that, yeah, these hundred patches are for to enable this platform. You have to say, yeah, this is a feature, this is a 10 patches enabling this feature. These are 15 patches enabling another feature. And so splitting and tracking different patch sets separately. So rolling see, of course, there is the Linux Yachto itself repository, which has all the branches, all the patches, all the history from prehistoric times till the recent 6.6. Yachto can all cache. That is what I told you when I said that you are pulling the config fragments from the upstream. This is the repository with all the configuration fragments, with all the configuration scripts that your layer will pull, that will combine into the final kernel configuration. Yeah. Our unproud Metacucom, the upstream Qualcomm layer. If you are working on the, for some reason, Qualcomm robotics or on Qualcomm robotics platform, or if you are thinking about using the Yachto for your phone for some reason, and it works, please take a look. This is the area that you might be interested. And yeah, of course, yeah, linear services. We are linear development services, and we now have an account on Mastodon. So please join the followers, and of course we are hiring. So that's it. Ah! Yeah. Questions? Questions. Questions. Hello. How does the feature set from your kernel compares to some internal developer, the standard kernel Qualcomm provides? So I'm not working with Qualcomm, but I see often the big differences between the vendor kernel with thousands of patches compared to what is upstream. Yeah. So as I said, we are working for the Qualcomm upstream, or working on Qualcomm upstream enablement. So we are tracking what is going upstream, and we are developing, and we are sending patches upstream. So yes, right now there is a talk, or they just have been talked about different Qualcomm kernels in another building. You might be interested in statistics. In our case, as I said, currently for this metric, we call for enabling platforms that have not been fully integrated upstream. We have about 80 patches. Ah! Oh! Yeah, it works. So, internal deep tree, before we switch to Linux Yachter, contained from 150 to 200 patches. And so one of the reasons for switchover, and one of the points was that we were able to clean up that stuff. We were able to drop several, okay, I think it was about 20 patches. Just touching bit of config in a different way. So everything now goes to the Yachter. We were able to drop several obscure patches that had been lingering for years. And to move those patches actually into upstream, send them to upstream, rebook them and drop them finally. So I don't know if that answers your question. This doesn't work for the downstream development. Well, you are window with thousands of patches.
Embedded Security 2023
last year. Hello, everyone. Last year, first time, I was talking about errors in embedded development. And I would like to repeat a part of the experience that we have had last year. Please think about an embedded project you are working on or you have been working on recently. Lock it in your memory. No cheating. You lock a project. Now, how many open SSL versions are there in that project? Raise your hand if that's zero. Like 10 people. Raise your hand if there's one. Like 20 people. Raise your hand if you are sure there are two or more. Like less. And raise your hand if you do not know. That's the majority of the room. I think there are a little less people who do not know, but still the majority. Why the question is important? You will see later. And a bonus question for people who knew how many versions of open SSL they had. Who can list the total, who of you has a full list of dependencies of that project? Okay, I round 20 people. Congratulations to you. Now, who is Marta and why she is talking about such things and asking such intimate questions? I'm a security researcher. And then what to expect from the 2024? Now, let's talk from regulations. Regulations that plural are a little bit too much here. One regulation. Because that's a 25 minutes version of the talk. So, their regulation is the CRA. Now, one slight simplification of CRA. To your lawyers, I am simplifying. The CRA is adding security, mandatory security requirements to all products that will be put on the market in the European Union by the requirement of the C-mark. The C-mark, you know it, on all electronics you have the C-mark. It's extending the C-mark to add security, mandatory requirements. Examples of the things that are mandatory. No release with known vulnerabilities. As bonds. Secure configuration by default. Updates by default for all users. And so on and so on. There are two pages of those requirements. In the final version, it doesn't apply to open source project themselves. In most cases, it applies to products that are integrated open source. All products, in fact. It will require paperwork. Mainly risk analysis and the vulnerability management process. And what this paperwork will be, I cannot tell you right now because it's going to be defined further. As for most of the things C-related, you have self-assessment by default. But there are certain classes of products that will require more. Including external security audit. That's an expensive thing if you haven't done one. And that's hot news because we have a final version. It's expected to be voted next month. And from next month, there will be three years till the final implementation. Now, the current version excludes non-monetized open source project. That's a big simplification also. So if you are contributing to an open source project, it doesn't apply to you. But for all integrators, embedded people are integrating open source in their products. So basically, it applies to the whole embedded. There will be risk analysis to do for all components that you include. And that's why the question of what do you have as components in your project is important. And now the big question for the whole embedded open source community. Is everyone going to do this paperwork alone? Or are we going to do the paperwork the open source way and share the documentation prepared for each single dependency? That's a big question for 2024, for all of us. If you want to know more, if I scared you enough, I've written an article published at WN last year, so it covers the first version. And for your trip back from FOSDEM, there's a nice read, the final version of the regulation, just 189 pages. But it's not boring. I didn't fall asleep, it's not boring at all. Now, let's go to trends, apart from the regulation. CV numbers. What is a CV? CV is a way to name vulnerabilities, public ones. It stands for common vulnerabilities and narration. And the number of registered public vulnerabilities is growing up. And in 2023, it went up. Yet again, we have yet another year of a record high number of CVs. I haven't been splitting embedded, non-embeded, but for embedded, that's the same statistics. The number of vulnerabilities is going high in a very important way. Now, a complex problem of funding of security work. In the recent two, three years, and there was a big part of this process happening in 2023, there are external funds paying for security work in open source projects. Two main examples of that, OpenSSF Alpha Omega project that funded, I've chosen examples from the embedded field. OpenSSF Rust, Python, Eclipse Foundation, and the Sovereign Tech Fund that has been part of the work for the Yocta project and other projects too, but in the embedded field. Because of this funding, because of the pressure of the regulations that are happening not only in Europe, in the US there's also different pressure, but in the same direction, we are seeing the update of processes in different projects. An example of that, the Yocta project has now a security team and working security process. In relation to all that, we also have tools that are either being implemented or they are being used more and more frequently. For example, the S-Bomb generation, either in the Cyclone DX or in the S-Bed X format, that is getting more and more common option. In embedded projects, yet another example from our field, S-Bed X is now generated by default in the Pocky reference distribution in the Yocta project. And a similar tool link on the dependency checking and CVEs, you have that in the platforms like the Dependable on GitHub, Standard on Tools also, tools are happening and the pressure to use them is happening too. And another big question for all of us, all that work, it requires someone to do it. To do the security work, to do the processes, to look at the results of tooling, even if they are the CI, you have to have someone looking at the results. How can we do it long term and especially how we can fund it long term? Those external forms may disappear one day. Big question for 2024. Now, for the events, vulnerabilities and incidents, I had to cut things because I want to have time for questions and it's only 25 minutes, so I had to cut. And this is what I have chosen for this year. HTTP2 Rapid Reset, also known as CV 202344487. This one was actually exploited in practice between August and October of last year. And it's a vulnerability in the HTTP2 implementation, or a little bit in the specification itself also. And if a client creates a parallel stream, HTTP2 allows parallel streams for the same connection, if the client creates a parallel stream and just immediately after sends a message to close that parallel stream, this is generating a high load on the server. The creation of stream is pretty expensive. And as a result, you get a denial of service. Most HTTP servers have been affected and there was a big number of releases happening in October 2023. What is interesting in the whole story is that the servers that are more for the embedded market, so with careful resource allocation, with limitations of number of clients, or limitations of streams per client, they had better resources, less vulnerable to this issue. For example, like HTTP, they clearly state that they are not vulnerable to that issue. I'm providing a link to the NVID entry for that problem, with dozens of links for different projects with information, or what they did, or what they expect users to use as configuration options to prevent such things in the future. And then a little bit of fun. It's either funny or it's frightening, depends on how you read it. The whole thing happened in 2022, but it has been published in 2023, so we can say we put it in 2023. This was a long story, but in short, some trains in Poland weren't starting after maintenance. And the maintenance company took a team to the river engineering company, and what they figured out that there were things like, train was locking with a vague error message after staying in one place for a long time, or the train was reporting errors after staying at some GPS positions, which by coincidence turned out to be GPS positions of workshops of the competitors of the manufacturer. Or in some trains there was a log based on a date, well, related to the CERA, but also related to all the things happening on the market. Until now, embedded developers were choosing their dependencies. Well, it does the work, I can take it, if there is a license matters. In the future, it may be that license matters won't be the only condition. There may be also a condition that this project have security policy, is this project providing regular security updates for five years or more, and there may be the need to do the triage in your dependency list, in some surprising places also. On the S-Bomb site, last year we have had S-Bombs being generated in more and more places, generating S-Bombs at school, but it's even more cool to actually use them for something. So I think that's going to happen this year, and then on the pure vulnerability side, we are still seeing products being developed to be in an internal network, not connected to the internet, and then someone puts a GSM modem in there. I am expecting a few funny vulnerabilities like that. Then the hardware series is going to continue, not only chips but also firmware. Have a look at the size of the firmware of your network card, or your graphic card, or your gpu, or other thing, or phone chipset. That amount of software means there are bugs. If there are bugs, they are also likely security bugs. I expect that, maybe not this year, but sometime in the future, the future will have a big issue related to firmware in one of those categories. My personal pick is network cards, a packet to make things funny. Then there may be also issues in places you do not expect them to. Quite many open source projects have never issued a CV before. If they have never issued a CV, users have a tendency to not update them. Not having a CV does not mean that there are not any bugs. In fact, quite the contrary. I expect that we may have an issue of a very serious problem happening in one of those projects nobody has been looking into before. Then everyone will be trying to figure out how many copies of that project they have. To sum up, that is going to be an interesting year. Do you have questions? Thank you for the interesting talk. I have a question about the legislation. Are there different regulations for real security bugs and denial of service bugs? If you have some warmable hole in your software, which is network-connected, or something which is a denial of service, for me it is a different class. In one case, you probably get my point. There are two parts of answer for your question. The CRA is not the only regulation that is currently in progress. You know that there are European elections in Germany. Things are being rushed. There is the CRA, but there is also the PLD. There is the regulation related to the workings, there is the regulation relating to AI, and all of them have certain things. On the typical vulnerability in the US, if it is an exponential like in the case of that HTTP repeat reset, it is a vulnerability. I classify it with a typical vulnerability. If it were to happen in a network device, that also enters into other regulations quite probably. There may be things that apply in different places, depending on the actual use of the same software. Thank you very much for this talk. I think this is probably the most important talk to me, as I am a designer manufacturer, embedded hardware for startups and SMEs. I am desperately concerned about the situation. The timeline you lay out is scary enough, but you will know that we in the UK have IoT connected device law coming into power at the end of April. We have three months to be compliant to this. There is a £10 million penalty, potentially, to us, or a percentage of global revenue. I will say broadly not one of the startups or SMEs we work with, and indeed ourselves, are in a position to deliver on this stuff, which scares the heck out of me. I would love to know who we need to be talking to to work together to try to look at this. I haven't shared the scary part of the series about the penalties, but in all cases, you are not able to pay them, so... That is another example. In different places, there are different regulations being brought in the light. For me, as an open source community, we have the only way to solve it all together and prepare the whole paperwork all together. Otherwise, the big ones will be able to pay the whole paperwork, but the small ones, well, not really. I think we are out of time, unfortunately. Thank you.
V4L2 Stateless Video Encoding: Hardware Support and uAPI
All right. Hello, everybody, and thanks for being here today. So I'm Paul Krakowski, and today I'll be talking about VFALL2 Stateless Video Encoding. We're going to talk about hardware support, and we're going to talk about the Stateless Video Encoding new API, and we're going to see why it's actually difficult to come up with this new API. So I am now self-employed working in my own company, which is called SeasBase, so I provide engineering services on topics related to multimedia and graphics, so if you're interested, there's my email here. And let's begin with a very simple question. Why do we need to encode videos? So the main reason is really that storing pictures in digital form is extremely big in terms of size. So it really, really takes a lot of storage, and this is really not something that is reasonable. So pictures are just way too big. So what do we have to do? Well, simple. We just have to compress them. So let's compress videos. And now what happens when we do that? They tend to look crappy. So we compress them too much, and they look bad. So this kind of brings us to the main topic of encoding, which is the management of the trade-off between the size of the video and the quality that we get, or at least the perceived quality. So we want things that are not too big, and that look good, essentially. So in order to do that, we have a number of techniques which are implemented as codecs, which are just formats that specify kind of how we should encode videos. So a good codec has a good trade-off between the size and the quality, and in general, the codecs that we have today are extremely, extremely performant at this. Most of them have specifications that can be accessed. They are often hard to read, because essentially, if you want to understand the spec, you need to understand the codec first. And if you don't know the codec first, it's quite hard to understand the spec. So that's the situation, but at least there are some standards and specifications. So some require realities to be used, some others don't. That really depends. So there is a bunch of codecs which are hyped now. There is a lot of talk about AV1, VP9, HTC5, and HTC6, so that's the kind of upcoming stuff that everyone is excited about. But of course, it really takes a while before people actually use those codecs in the real world. But there is a real interest in that, especially if we're looking to reduce our global power consumption, because better codecs means less data, and if we have less data, less energy consumption for storage and for transmission. So having good codecs is actually quite important, and it can really make a difference. So how do we encode those videos? There is a number of techniques which are used, usually one after the other in some kind of chain. So we have a first class, which is the spatial compression, which is essentially how we can represent a single static picture in the most efficient way by using lots of different compression techniques. So I've mentioned a few here. Essentially, we go to frequency domain, and then we can eliminate some coefficients from that frequency domain. This is what we call quantization, and we have a value that we call QP, the quantization parameter, which is here to tell us how many of those frequency-based coefficients we want to keep in order to represent our picture. And in general, the high-frequency coefficients represent details, and the low-frequency coefficients represent the rough shape of the picture. But we also apply temporal compression techniques. Temporal compression will use the previews and sometimes the next frames as well in order to kind of calculate the current picture. So instead of coding a single picture all the time, we are actually coding a difference between the previews and all the next picture to create the current one. So this is, for example, a representation of what we call motion vectors, which will indicate how the kind of pixels change from one frame to another, so we have a notion of directivity and direction, and so on. So when we encode a video, we kind of have to decide on a strategy, which is exactly what do we want to do when we encode that video and which kind of trade-off we're going to adopt between the size and the quality and how we want that trade-off to evolve over time and things like that. So we have very common strategies. For example, if you just want an average or a constant bitrate to stay the same for the whole stream, that's one strategy that an encoder can implement, but we can also decide that instead we want a constant quality, so the size might change depending on what we want to achieve. So it's important to understand that this notion of strategy for encoding is really at the core of what an encoder is doing, and this is generally what we call rate control. So rate control is about dynamically deciding on this trade-off, so we have to decide on which frame type we want to use, we have to decide on the quantization parameter and a number of other settings that we can set in the encoder to essentially have it adapt to different scenes and also to be able to react to changes in the scene. So if you, for example, have a movie and you're going from one scene to another, you want your encoder to kind of react to that and give more details on the first image of the new scene, for example, things like that. So the main takeaway from this slide is that the rate control implementation is really the key to have good encoding and to have something that is actually performance and that gives good results. So now that we have a brief overview of those compression techniques and what encoders are supposed to do, the main topic is how do we accelerate this because doing all of this on the CPU is usually a very intensive process, so it will take a lot of your CPU, and nowadays we have lots of use cases where people want super high sizes and they want high frame rate, like 60 frames per second on things like 4K, and there are use cases where we want to be able to encode just as we are receiving the data. So for example, a regular camera that you used to shoot, you want to be able to produce a video so you want to be able to encode in more or less real time. So if you really want to achieve that, you have to use dedicated hardware which will offload this and will give you some acceleration for this process. So this is when we start talking about hardware-based video encoding or hardware-accelerated video encoding. So those hardware encoders, not only do they know how to produce the correct format for the video that we chose for the codec, they sometimes also have some extra features like pre-processing, so for example, they can convert the format of the pictures that you give them. They can also apply things like antishaking or cropping. So this is a very common pre-processing pipeline for an encoder. Usually they will have the ability to also encode multiple streams in parallel, so this, well, not necessarily in parallel, but at least in a time-shared way. So you could use the same encoder and have multiple streams that you want to encode, and then you encode one frame for one stream which goes into one sync, one output, and then you have another stream that you also want to deal with concurrently. So it's important that you are able to kind of switch between different contexts of encoding because you don't want your encoder to be just dedicated to one thing, to one task, so it's a little bit like a GPU where you want to be able to render multiple things like that. All right, so when we're talking about hardware video encoding, there is essentially two types of hardware implementations. The first one, which is probably the most common, is what we call the stateful encoders. So these encoders are somewhat abstracted and a bit less flexible for, let's say, the end users because they essentially come with a microcontroller that will do most of the heavy lifting involved in encoding, especially implementing the rate control strategies. So they come with a firmware that implements that, and it really does a lot, especially the rate control. And the CPU will usually interact with that encoder through some mailbox interface, and it will essentially give it messages like encode this source with these parameters, but the parameters are still quite high level. On the contrary, we have a second type of implementation, which we call the stateless encoders, which are really more bare metal, so they are also more flexible, and we have more control over exactly the parameters that we give the encoder and all the, let's say, technical decisions that are used to create the final bit stream, the final video codec data. So in that situation, the CPUs that is driving the encoder have to do more. It has to do essentially all that the firmware was doing in the stateful case, so this is generally more involved. And, yeah, it means that you have more things to do on your kernel, and we have, of course, a bunch of others. For the stateless designs, we have less known examples, but they are also quite popular and found in lots of chips. So we have the Hentro from Vericilicon, which is found, for example, on lots of IMX8s. We also find it on RockChip platforms, and some on Alwinner as well. Alwinner, which is a Chinese chip maker, has their own video engine implementation, which also has an encoder. So that's pretty much what we know about so far. Oh, MediaTek, I didn't mention it. So it's stateful, but it's kind of helpful between the two, right, because you can also drive it stateless. Okay, stateful encoder, stateless decoder. Okay, great. All right. Okay, so for the stateful case, in Linux, we have a great API in V4L2, which is based on the V4L2 memory-to-memory framework, which works with QQs, so essentially you're going to submit data from user space, which is your source picture, and you're going to get some encoded bitstream as a result from this API. We have pixel formats to describe the coded streams, and we can use some specific controls to set features of the encoder. So when there is a technical choice to be made, we can use those controls and tell it exactly, well, how we want the video to be encoded. But again, the rate control is implemented in the firmware, so all we can do with that is to tell the firmware, to tell the microcontroller what it should do, but it's not going to be the kernel side that does it. On the other side with the stateless encoding, like I said, we have a lot more to do from the CPU side, and this is when it gets a bit complicated with V4L2. So currently we don't have a stateless encoding UAPI. There is some difficulty that I'm going to mention in order to do that. And one of the points that are difficult is to implement the rate control part, so to be able to adapt exactly what the video stream looks like depending on the policy that you want to follow. And of course we want that UAPI to be hardware agnostic, so we don't want to just have user space drivers that will be specific to each encoder. Instead we want to have a generic interface, like it's the case for the stateful encoders, where we have this generic V4L2 API. But stateless encoding also has significant advantages. It's a lot more flexible, and that means that we have more control over what's going on. So in theory we are able to take better decisions to produce the best stream that we can, and that is not necessarily the case with a stateful, firmware-driven approach. So user space might actually have a bunch of information, like knowing that the scene is changing, for example, things like that, that can really help the encoders. So it actually makes sense for user space to want to do its own rate control, because it can have more information, it can also implement, let's say, advanced strategies. For example, nowadays there is talk about machine learning and how to help encoders achieve better results. So things like that would make sense in user space. But we also want to support simple cases where we don't want user space to have a huge stack that is very complex, but it would really be nice to have a simple case that can be covered without so much logic in user space. So you can kind of see that there is a little bit of contradiction between these two things, and this is one of the main, let's say, topics that make creating this UAPI difficult. So let's take a look at some existing work that was already carried out for these state-based encoders. For the Hentro H1, which is probably the most popular one in this category, we have some work that was done by RockTip, which is free software, in a stack called MPP. So you can find the source code here, and that's the part where it implements encoding for the H1. So that's great, but this is not V4L2. It's a fully user space-based approach. So we have Google that did a custom V4L2 driver in Chromium OS. So this is for the Chromebooks that they ship with RockTip SOCs that have the H1. So this time it's a V4L2 driver, but it really is hardware-specific, so you have a very specific API to drive that encoder. Now, from this base, which has all the knowledge of how to drive the encoder, I was able to write a mainline-based implementation on V4L2 when I was working at Budlin, and this one is still hardware-specific. So we have some custom register configuration that is pushed to the driver, and we get some custom feedback as a result that the user space side can use in order to implement the rate control. So in this case, the rate control is done in user space entirely. Ooh, it's not working anymore. Okay, and now we have also VPH encoding from Collabora, which also does the rate control in user space. So you can find the links to the RFC series there and the user space implementation in Gstreamer as well to demonstrate how it works. So while working on this hardware, there is a few things that we learned. For example, the fact that some metadata fields of the bitstream are actually constrained by the hardware itself. So it means that there are some fields in the codec where the specification allows you to choose between different values, but the encoder actually only works with one of them. So it means that if you are going to generate those fields, you need to be aware of which hardware you're running on. So the lesson learned from that was that the bitstream generation should really be on the kernel side because this is how you can really know exactly which choices are valid or not for this particular hardware. Sometimes the hardware also has rate control helpers. So this is some hardware features that can help you implement better rate control. It's not necessarily always a good idea or always required to use them, but they exist. And in that case, this kind of suggests that the rate control would make sense to be done in user space, sorry, in the kernel side, because again, we don't want user space to be specific to a particular hardware. We want it to be generic and agnostic. Now for a second example, which is something I've worked on very recently, again, at Budlin. So this is based on some existing work from the Linux Sanctuary community, which did a lot of research and implemented some user space implementation for the all-winner video engine encoder. And I did some follow-up work on some more recent platforms that also implement H264 encoding, this time using a proper VFIL2 driver. And again, because we still didn't have the stateless encoding UAPI, I decided to use the stateful encoding UAPI more or less directly. And this made it clear that this API was quite limiting and that it didn't allow leveraging the full potential of the stateless hardware designs. So there's a few lessons to be learned from that. Like I said, stateful API is not really a good fit for these stateless encoders, so it's not really viable to try and use that. The Bitstream beta data needs to be produced canalside, like I said, because we have some hardware constraints that we cannot represent and let's say forward to user space. So it has to be the kernel that decides how to generate those Bitstream headers. And for rate control, it's really unclear because having rate control on the kernel side makes the user space quite simple and it makes it really easy to operate it without having a lot of logic aside. But on the other hand, having the rate control in user space is a lot more flexible and it means that you can implement kind of whatever strategy you want, you can decide on the implementation yourself, which is a bit less easy when it's on the kernel side. Of course, it's not impossible, it's all free software and you can change it like you want, but we still understand that there is interest in both cases. So the current state of the art for stateless encoding in VFrl2 is that it's in progress, it's a discussion, so if you have an opinion on that or if you have ideas on how this could be improved or how this published quantization parameter, which references we're going to use to generate the frames based on the previews on X-frames. So having a switch kind of allows user space to choose if it wants to have low-level control or if it wants to have something simple that works, which is maybe suboptimal, but that can still be used nicely. Another way would be that we have rate control implemented in the kernel side, but instead of applying it to the next frame, it would just provide some suggestion to user space, so it would be some kind of feedback data that is provided with an indication of what the next QP or frame type could be to follow the policy that was selected kernel side. This could also work because then user space could decide to follow this feedback suggestion or not, so then it could decide to do something completely different. So in that situation, user space would still have all the low-level control, but it would have suggestions about which values would make sense depending on the kernel side rate control implementation. So that's also something that could work, and we could even have some switch to auto-apply the feedback so that user space doesn't even have to copy that suggestion into the actual configuration. We could just have a switch that makes it kind of automatically go, and after that, user space really doesn't have much to do, and it can kind of let it handle it by itself. So that would be also some form of trade-off that allows having something simple for user space and also allow user space to be able to control things if it wants to. Another thing that would be interesting is to have some common code that is shared between these different state-based encoders because especially for things like the bitstream metadata generation, there is a lot that is common, of course, because it targets the same format, so we could have some helpers that are shared between these different drivers. Again, the state-full encoders don't have to generate that bitstream metadata, so it's really something specific to the state-less encoders. Finally, the rate control implementations, if they end up existing on the kernel side, it would also be nice to be able to share them between the different drivers instead of having driver-specific implementations for that. Besides discussing and exchanging ideas and hopefully finding a solution for what this UAPI should look like, the next step will be to merge the work done on the Hentro and Cedrus drivers to bring X264 encoding and VPA encoding for Hentro. After that, the next step will be to have some G-streamer and FFM-back integration to use this UAPI, and after that, normally the rest of the world should be able to use the state-less hardware encoders, which will be great. So time is up for me. Thanks everybody for listening. Thank you for a great talk. Unfortunately, we do not have time for questions, but I really encourage everyone who has a question to just catch the speaker in the corridor. Thank you. Thanks.
A fully open source stack for MIPI cameras
The guy in the green. Look up. The guy from the whole group besides. Yeah, exactly. The same guy. The same guy. So you have the team fan and part of it. Nothing else in the N.A. So make sure that you have six slides. Oh, no. That's hard. That's a good one. I can't help but be sure that I'm not going to be able to do that. I'm going to try to do this on the phone. Okay, that's just not your thing. I'm just not sure if you need a copy of this. No, I'm not going to upload it. It's just how it is. Hello? Hello? Oh, it's you. I know. I know. I know. Final four? Final four. Final four. Final four. Final four. Final four. Final four. What's the other side? What's the other side? Okay. As the way he says. So, hello. Welcome to Brussels again. Still. Myself and Hans here will be doing a talk on nippy cameras. Here's some nice logos. Let's go. So, introductions. We'll do this. Go. Yeah, so I'm Hans de Hude. I work for Red Hat. It says it's on without this. Now it's on. Okay. So, hello everyone. My name is Hans de Hude. I work for Red Hat as a software engineer in the laptop hardware enablement team. I'm also the upstream kernel subsystem maintainers for drivers platform x86, which has mostly laptop related drivers. And, well, new Intel laptops are now using nippy cameras instead of old USB UBC cameras, which is a problem or a challenge. So, that's why I'm here and what this talk is about. So, my name is Brian O'Donohue. I see I don't have my second name here. And that's spelled with a Y, not an I, because I had a granny from Scotland and from Ireland, which I'm going to give you my standard spiel, is that not in the UK. It is in the EU. We do use the URL. Thank you. Thank you. But we're not in Schengen, so you have to show your passport when you come to my country. Yeah, I'm a kernel engineer with the Lamarrow. I work in the Qualcomm landing team. And, I suppose about a year ago, I inherited KMSS, which is the Qualcomm camera subsystem driver for the Lamarrow team. Here's my cold Lamarrow place. They dump all of my in-progress kernel work, my GitHub, and various IRC channels I'm on. So, we're going to kind of divide this up a little bit. There's a full topic. And it turns out that your I is more sensitive to green. I didn't know that. And particularly here, at 550 nanometers, I kind of a yellow bit of a yellow green. And so, actually, that impacts how we capture light. It's like an ADC for light. And the guy called Bruce Beyer, working at Eastman Kodak in 1974, came up with an encoding algorithm for how we capture light. Because basically the sensors are monochrome sensors. What we do is we put a filter on top of the sensors to capture a particular bit of the visual spectrum. And that's called Bayer encoding. And it gives us a lot of green again, because your eye likes green. And here's a paper, actually. I thought this was pretty cool. It just rolls it out here. So it's green, red, green, red, green, red, green, red. Blue, green, blue, green. You see we have a different Bayer pattern here. But they all conform to a similar layout. And they look like that. So, you can't really look at this. You'll see it a little bit later on. The picture that comes out in a Bayer pattern, it looks like anosaic from something from Roman times. Actually, it looks pretty cool. But it's not like what you take a picture and you do a selfie. You don't want to look like you're at Pompeii. So this is a problem. We need to interpolate that data. We need to interpret it and reasonably recover the original data that we took. So there are various methods to do that label. It's called deep-biering. Funny enough, we have the bi-erring called an image and we're going to deep-bier it to RGB. So we have label, nearest, by linear, and malar, hay, kuttler. I can't even say this. And as you go up, as you go down the list here, the computational overhead goes up. But the quality that you get out of it similarly goes up. Unsurprisingly enough. So here's a great paper on this, actually, by a guy called Morgan McGuire from about 2009, I think. And this shows us, you can see the mosaic pattern here. You can see here on the right, what he's calling graem-truth. So the original picture. This is an approximation because actually what happened is he took a picture with a Bayer sensor that has a particular resolution. But for the purposes of the talk, we'll call this graem-truth. So you can see graem-truth, raw image, label, it's kind of crap. Nearest, so you approximate based on the nearest pixel. Hands knows more about this now. I hope you give a better description of the recovery process than I will by linear. So in that case, we do line by line. And then this one here is much more complex. It does like kind of a bit in the middle and then other bits around it. But far more computationally costly. So every time you take a picture, if you think about it, your camera is performing at least this operation. At least one of these debiring operations and does so in microseconds. So how does it do that? It's not magic, believe it or not. It uses a thing called a hard ASP. So a hardware image spectrum, image signal processor, pardon me. Which typically, especially on modern systems, entails firmware running inside of the camera component of the SOC. So if you think about it, you take a picture, you start to bring data in. It's locally in memory, quite physically close to where the stream comes in. And you immediately want to process it there. Every time you kick up the chain closer and closer to main memory, you're going to pay computational costs for whatever you do up there. And so therefore, we have these hardware ASPs that have a silicon block and a firmware block that typically interacts with the silicon. And it's based on the principle of data locality. And here's some basic, so the 3As, the three most basic things that you would do with the image apart from debiring. Auto-focus, auto-white balance, auto-exposure. So bringing the exposure up or down, balancing. But you can see an example here of how the white hasn't been balanced properly. And on the right, it has been balanced properly. And the left is under-exposed. The middle is ground truth. And then on the right is over-exposed. I kind of like the right image myself, but I don't know. According to what we're calling ground truth, that's over-exposed. But hardware vendors consider all of this stuff secret sauce. Secret sauce. What's the definition of secret sauce? It's the goop that McDonald's puts on the patty before you stick it in your mouth. So the very simple sensors, the ice-ward sea sensors, is probably worth saying. The mid-sea sensors have ice-ward sea bolses that allow you to configure them. They're pretty cheap, I suppose, as sensors. But the tunings for those sensors, setting up the PLL, putting it into any configuration is considered secret sauce. And typically, if you look in the kernel, what you'll find is we have these big tables with magic numbers. And the magic number represents a... And the mirror confuses it. It doesn't know what's happening. So it thinks it's three different people. I thought that was a funny feature. So what's the problem that we're solving? What is the problem that... This is where the narrow and red hat come to the camera, and the problem that we're trying to solve. So like I say, I'm doing the Qualcomm stuff. You're doing the X86 stuff. And the commonality there is the sensors. So the silicon vendors will not release enough information to switch on their hardware ASP. And so what we have for MIPI cameras is raw data coming in, just Bayer encoded data. And for quite a while on the Qualcomm side, that's just what we've been delivering. We deliver you Bayer data upstream, we say, good luck. Have a nice day. Which is completely useless if I want to have a Zoom call. So the question then becomes where and how can we fix it? And the answer really clearly is in the live camera. So Laurent and Kieran and Jack-O-Paul and these guys here have a great project really. If you've ever tried to use the Video for Linux stuff, you'll find that the camera, even just hooking up your own camera, can be quite difficult to do. So live camera really isolates you from having to know anything about that. You can just run it, you can reuse the library, and actually I find it very easy to use. And I love it. Thanks for the t-shirt. So what we want to do is we want to do this. This is an example of a high level. This is quite similar, I suppose, to how the Raspberry Pi and the other hardware ASPs work, is that you have an ISP component here, sorry, I keep sending the way to the camera, an ISP component and then an IPA component. So the three As live inside of the IPA, and the other stuff, the stitching up, the pipeline happens in the ISP. And so when we approached the live camera and said, hey, we'd like to get something better than Bayer data, Edward, Edward, camera's stacked, why not? We do this all day long. We might as well show something for the jobs we're doing. They said, please, please implement something like this. So we started. My colleague, Andre, kind of love, I'm terrible with the Russian names, didn't also the code. I've been sitting in on the meetings and kind of arriving up and saying, here's the way I think it should work. And then I guess about three or four months ago. Yeah, something like that. So I need to join Linairo in doing this because we had a similar issue with the IPA 6, that Intel is actually working upstream to get the NIPI data receiving the CSI receiver going. And then we have all Bayer data, but Intel is currently doesn't really have a plan on how to get their secret source algorithms upstream because upstream doesn't want secret, they want open. So that's an issue. On the other hand, we have users who wanted their cameras to work. So the conclusion was that we needed a software ISP too. So we joined up with Linairo instead of doing our own stuff. And, well, that's starting to work pretty okay now, as you'll see in the demo. It needs more work on image quality, but it gives a picture. But that brings us to the next problem, which is how do applications access cameras in this new world where we have NIPI cameras, at least new for, not for smartphones, but for smartphones for an Android, but new for Linux on the desktop. So these NIPI cameras have a pipeline, be it with a software ISP or a hardware ISP, and this pipeline needs to be set up and configured. This is pretty complex stuff, which we don't want to do in a kernel. In a kernel, we just want to say, hey, here's a bunch of hardware blocks. Good luck. Just figure out how you chain them together and tell us, as the kernel, a NIP camera takes care of this for applications, but this means that currently Firefox and Chrome, assuming most people use their camera for video conferencing, and that happens in a browser, or Zoom, they all directly open the dev video note, and they expect to just be able to say there, give me a list of resolutions which you support. Oh, I want that resolution go. And it's no longer that simple. Which means that the applications will need to move to a different way of accessing cameras. At the same time, there are initiatives to move Linux on the desktop to more of a fixed operating system with applications in an app store model, mostly because building your own distribution from packages is... You get a lot of variations which can lead to instability, so having a fixed read-only base image and separate applications is desirable. These applications also run sandboxed, which I personally think is a really good idea for something like a browser because that's often a security hole. So we try to basically solve both problems at once by saying we have PipeWire. PipeWire is already used for screen capture into the browser, so it supports video transport, so let's also use that for cameras. So this will solve the sandboxing problem because PipeWire is on the sandboxed binary boundary with a portal to access it. And it also solves by using a PipeWire... a NIP camera plugin for PipeWire. It also solves the whole how do we access the camera issue. A colleague for mine has been working on this. Actually, first, I think PenguTronics started on this. They did webRTC work, which is like the shared webRTC framework between Chromium and Firefox. Then a colleague of mine picked up the integration in Firefox. This actually landed in Firefox 122, which was released like a week ago. So the Firefox, which we'll be demoing, is actually just from the Fedora repos. It's not a custom build. So with how do we access the camera problem solved and sort of having a proof of concept which we'll demo, the question becomes then what do we want to do in the future? Well, we want to do better as an image quality. We want to do it faster as in use less CPU and we want to do it cheaper as in use less energy because doing everything on the CPU is not good for your battery. So Brian actually has been experimenting with GPU acceleration. I have. And that still doesn't work? It semi-works. Oh, you have. I can change the background green on purpose or as. Or red, but I can't render. So I need to flip buffers or something. I need to go Google and just mash the keyboard until it works. So we're looking into GPU acceleration starting with OpenGL. Mostly because we already have OpenGL debiring support in a test app in Lib Camera called QCAM. So we already have a set of GL shaders and it's useful to start with sort of non-working base for the shaders. In the future, maybe we'll also do OpenCL also because some ARM SOCs only support or Vulkan. Some ARM CPUs only support Vulkan like the Imagination GPU. So yeah, that's another option. So GPU acceleration should do the faster and cheaper bit, which would also use less energy. Then we have a whole list of image quality enhancements which we would like to work on. These are also things which are actually done by the hardware ISP, but we didn't put them on the hardware ISP slide because the slide was already full. I'm going to skip this because I would like to use the last five minutes for questions. So. APPLAUSE Hey, you doing demo? Yeah, that's good. Questions on the demo? That's a good point. I'll give it a go around. Please. OK, nobody from that, please. Everybody from the front. So here you see permission dialogue, and then hopefully if I join this meeting, we'll get... Look. This is all... APPLAUSE So yeah, this is our current image quality and actually with this lightning it's not too bad. But if you have a really low light condition, then it sort of sucks which we need to work on. Do we have time for questions? I think we do. Really quick one. You mentioned OpenGL and OpenCL. Meepie is pretty common on embedded systems that usually don't have OpenGL, they usually have EGL or GLES. Yeah, so GLES is what we mean. Oh, sweet. OK, thank you. So what AAA algorithms have you actually got? So it's a more than just you've got white balance exposure. Do you have ones that people can play around with, implement their own versions of them? We would definitely welcome people to look at what we have at the moment. It's in a separate repo. I don't know if you edit that in the references. I probably did not. It has to be, you have to go look at the branch. So I might have been sure if I gave the branch for the ISP. I'm a terrible person. Look, the patches are on the upstream mailing list and the cover lever also has a link to the branch where you can check out the branch. And we definitely welcome people to experiment with more algorithms, better algorithms. Only please keep in mind, we're running this in the CPU. This is actually full HD, 30 FPS, which is currently like a 40 percent FPU load. I spend a lot of time optimizing this. So we need to do this in real time. That's important to keep in mind. I think it's probably worth saying that you won't really be able to use this on an IMX8 unless we get the GPU going. So there's a cutoff point of computational power. It's around a bit A53, I suppose. It's just too much work for those processes at that point. Hi. As far as Vulkan support, have you looked at a wrapper like Zinc to make the same functionality work on Vulkan-only socks like you have on the ones that do have OpenGL? No, we have not looked into Vulkan at all. We're currently at the stage where we're trying to get GLES shaders to work and Vulkan will come later. There's lots of stuff which will come later, like more image quality improvements. I think it might be more productive to look at OpenCL because then you don't care what you're talking to. You're talking to Vulkan, you're talking to a GPU, you're talking to a CPU. It kind of doesn't matter from the live camera perspective. So if we were going to spend more time rather than choosing between APIs and the GPU, it might be better to choose between different compute targets. You mentioned that there's lots of secret source algorithm which companies don't like upstreaming. What is the reason for this? Is it just companies being cagey, or is there an actual algorithm and all the magic number tables that for some reason they don't want public? Personally, I think it's a mix. On one part it's just companies being secretive because they're afraid of competition. Well, I don't think that's really necessary. This is something which we put together pretty quickly and it's really basic algorithms. It already gives a pretty decent picture. On the other hand, I think that the more advanced stuff really does have company or trade secrets in it. It's a mix, at least in my personal opinion. That was one more question you said. We're at a time of trade. Thank you very much for that talk. That was fantastic.
enioka Scan: say No! to vendor lock-in for your barcode scanners
So, hello everyone. My name is Antoine Gonzales. I am a French developer at Enoca Outculture and today I'm going to talk to you about bar code scanners and bar codes in general and why they are important in the open source. So, a bit of context. Bar codes, why does it matter? If you probably have noticed that everything in your daily life has bar codes attached to them, whether it's grocery shopping, parcels that you order online, even menus in restaurants these days have bar codes attached to them. The idea behind this is basically bar codes end up being one of the most efficient way to attach digital data, usually an idea, but sometimes more than this, to physical objects. It has many different types, one dimensional and two dimensional ones. The one you're most likely familiar with is EAN 13, which is pretty much on every package ever and QR codes, which is mostly used to share links and stuff like that. So, as I said, they're used everywhere, but depending on workflows you may have more or less requirements with them. Usually, for example, if you deal with large scale packaging or inventory keeping, maybe you need to scan lots of them very quickly. So, some workflows have specific requirements and require dedicated devices, which is what bar code scanners are. And bar code scanners, so the less wonderful side of it, there's a wide variety of them, some small, some big, some that look like phone and that mostly are phones, some that look like rings that fit on top of your finger and you can use to scan products. There's a wide variety of them and a wide variety of manufacturers for them, and that comes with this problem. For example, each manufacturer tends to have their own APIs, SDKs, their own licenses that are very usually not open source friendly. Documentation that can be more or less complete depending on who makes them. And obviously, most of the time, it's not proprietary, otherwise it would not be here. It's proprietary. So what it means is usually when you pick a manufacturer for your backup scanners, you end up sticking to it because changing or just adding more variety to your fleet means having to rewrite your entire application. And for a lot of companies, it's just a lot of time to invest and not profitable in the end. So what is EnoCastCan then? It's an Android library with the goal of exposing a single common API to interact with different scanners. So how it does it? So the goal behind it is that it allows you to pick the manufacturers and the scanners that you need for your needs. That may mean combining different manufacturers because you have multiple needs in your company or just changing when the current contract does not fit your needs anymore without having to rewrite everything. And how does it enable this? So obviously, there's no magic. If every manufacturer and every device requires a specific way to communicate with them, a specific code needs to exist at some point. That's in the library. The idea is every device exposes their own API and we implement a communication way, a way of communicating with that device in the library, either through official documentation when possible or if we don't have access to it through reverse engineering, the protocols used by the scanner. Once we have that communication set up, what we can do is provide an abstraction layer that the end user can use to, for example, send very high-level commands to the scanner, for example, start reading barcode, turn on illumination, something like this. In the library, we'll find the connected scanners and try to translate to the appropriate protocols behind. What's interesting about this approach is it makes it very expandable. This means, for example, if we don't support any given device but we want to add support later, it's pretty simple to do. It's about implementing one interface or another. We describe what the device can do, how do we do it, how do we translate to something that the device can understand. If the device has specific features that may not be common on most scanners, we can divide these commended subgroups that are very easy to implement or not in a way that makes it obvious what the device can do. For the end user, nothing changes. The whole point of the library, the application that chooses the library doesn't need to adapt anything. It's the library itself that's plug-and-play. In terms of compatibility, so far we support quite a wide range of scanners. Some of them use Bluetooth, classical low-energy. Some of them are integrated. This means, for example, a smartphone with a scanning camera on top of it. For some situations, the Android camera is all you need, so the smartphone's camera, in which case we support both the legacy camera API and the new camera 2 API. One of the biggest upgrade we made recently last year was the compatibility with the Zebra data-wage. Zebra is one of the main manufacturers in the back-off scanner industry. Data-wage is their proprietary service that communicates with most of the fleets of integrated scanners. I think this one, for example, allowed us to pretty much support everything these manufacturer produces. Any device that's not in the list, whether it uses the existing supported technologies or something else, for example, USB, if Android lets you access this way of communication, theoretically, nothing stops you from adding compatibility. It may just require a bit more boilerplate to get working the first time. But overall, we have a lot of already set of helpers to make the process easier. What comes next for this library? So like I said, we have a lot of scanners that are supported already, but obviously not all of them can be. There's a lot of devices out there, so more and more devices are going to be added as we get all of them. We also want to provide an external documentation containing guides and examples for the code. So right now we have pretty complete API dots, but not a lot of guides and quick starts for people who want to judge before starting implementing it whether or not this library is in the list, what they expect and what they want for the need. We also want to provide a more complete separation of the core library and the existing SDKs that are implemented to let you just download what you need and not support dozens of devices you might not use. Another thing we want to add is a standalone app, so both an application and a service published to the Android Play Store. So in case the defaults, functionalities of the library is all you need, you already have access to it and you might not want to reimplement everything in your application. And finally, better Bluetooth support. So we already support Bluetooth pretty well, but a lot of devices have specific methods of pairing with your Android phone. Sometimes they require pairing via scanning a barcode that's generated by the device. This is through NFC pairing, which we do not support right now, but we want to in the future and Android support for the activities like camera that we provide. Now what we need help with, because obviously, like I said, we do not have access to every device. It's not possible. There's too many of them, but maybe you do. Maybe you have access to barcode scanners that we haven't tested yet, in which case you probably can help us to expand this library. So for example, by just simply testing whether or not the device you have is supported or not by the library. Sometimes even if it's not explicitly tested, manufacturers do reuse some of their code and some of their protocols, in which case maybe a device that we haven't tested does work with the library. Or if not, we know we need some work on it. You can also add easily more SDKs to make more devices compatible. So for example, if you have a device that we do not support and that's not compatible with the library, you can try to either reverse engineer it or provide the documentation necessary so we can add its functionalities to it. And finally, if you see any feature that you think is missing that could be done better, or optimized maybe, you can try your hand at upgrading the code base, basically. We try to be reactive with issues and questions that we receive. So if you want to take a look at it, there's a barcode to the GitHub repository. So if anyone has any questions, I can answer some of them. And otherwise, if you want to stop in the hallways or open discussions on GitHub, you're welcome to do. Thank you very much, Paul Singh. We have one question. Hi. Can I ask, are you planning on supporting any other platforms because a worrying amount of POS software still runs on Windows XP with serial? So Android is like right now it's only an Android library, mainly because in the mechanisms we use to connect to the different scanners are specific to Android. But the core compatibility layer, basically, with each device is not specific to Android. So even if connecting to a scanner uses Android, you could probably take the code base used to translate messages and port it to Windows or Mac or whatever else. Anyone else? Okay. Well, thank you very much, Antoine. That was great. Thank you very much. And... APPLAUSE
The Small Device C Compiler (SDCC)
So, welcome to the small device C compiler. Talks about such short, so I'll try to fit in just a basic stuff. I'll start with a quick introduction on what the small device C compiler is. Then I talk about the architectures we target and then just a little bit of what the future hopefully brings for the small device C compiler. Okay, so STCC is, as the name says, C compiler. It tries to support C standards, in particular either C19, 99, 11 and 23. It's nearly always used as a free-standing implementation. The only exception I know of is that FASIX, an operating system for some 8-bit systems use it as a part of a hosted implementation. Now, those familiar with the C standard know that in a free-standing implementation, you are more restricted in particular in features from the standard library set. You can use course, well, on your device there's no file system, there's no point in using any standard library functions trying to open read or write files. There are some supporting tools, apart from the compiler itself, in particular SM plus, a linker and a simulator. The simulators are usually kind of cycle accurate. We mostly use them for our regression testing internally, but they are also usable by end users who want to run their programs on a simulator rather than on real hardware. It works on many house systems. Most popular would be Linux and Windows, but it works fine on free BSD and so on. We target various 8-bit architectures, probably more than any other compiler does, and we have some unusual optimizations that do make sense on these targets where you really have very little memory and where both optimizing for code size and for memory use are very important and often more important than optimizing for speed. Our user base consists mostly of developers targeting embedded systems. I guess they make about two-thirds of SDCC users, and the rest are retro gaming and retro computing enthusiasts because we also support various older 8-bit architectures. They're similar enough to modern 8-bit microcontrollers that it makes sense to have them all in the same compiler and many high-level optimizations can be shared. But I believe that the user base in the end benefits of having both these groups represented cause sometimes one group or the other is more eager to try some new feature, which of course helps us finding all the bugs in corner cases and iron out everything, while then more conservative users that want to wait for longer than getting in a more polished state. Our latest release was at the end of January, which is very recently, typically we do one release per year. So the project is hosted at SourceForge. We have our issue trackers there. We have mailing lists for communication. The users have version repository. The user weekly for some documentation outside the manual. And we have a compile farm for nightly regression testing, which means every night on many different host systems, both in terms of operating system and underlying architecture. The latest SDCC from Drunk is built and then runs all the regression tests, meaning compiling a lot of tests, running them on the simulators to see if the results are what they should be. There's something between 10,000 or 20,000 tests that are executed that way and also incorporates a large part of the GCC test suite. A quick comparison to more known compilers. We don't see ourselves as a competitor to GCC or LLVM, so the versus up there is just for a comparison. Now we specialize in targets that are hard to support in GCC and LLVM. For GCC or LLVM, you typically want some risk like architecture, many registers, uniform instructions set. Then you can use a tritine style register allocator and that's efficient and everything is nice. The typical 8-bit architecture is not like that. If you want to get into the compiler, there's a compiler developer, our learning curve tends to be less deep than GCC. Our internal interfaces tend to be more stable than LLVM, which for some people is also a nice feature. Talking about the recent release, our main improvements were definitely in the last two years in standard compliance, in particular ISOC23 support. This was partially funded as a project by the prototype fund from the German Ministry of Education and Research and improvements and optimizations, in particular generalized constant propagation to allow us to narrow variables. If people use an int as a loop counter, that's typically a waste of memory in an 8-bit target if that loop doesn't really need the 16-bit that an int has on those targets. The work in optimizations was partially founded by an LNET via the NGI-0 initiative. We also got two new parts, namely one for the WDC6502 and one for the SCR800. One is the MOS6500 derivative and the other is the SET80 derivative. Let's get to the parts. The STM8 part is our best one because we generate really good code for the STM8. It's currently the most advanced part. It has all the bells, whistles and great features. We do very well compared to the non-free compilers. Unfortunately, recently this architecture has become not recommended for new devices. The manufacturer tries to move their customers to arm. But just to illustrate how we do versus three other compilers, which are all non-free, in terms of benchmark scores, we generate the fastest code essentially, except for WEDSTONE, which is a floating-point benchmark. We didn't put as much emphasis on it. And we also generate reasonably small code also for all of these benchmarks here. This is with the current release in January versus the current versions of these non-free compilers. Now our oldest part is for the 8051 and its derivatives. That's an ancient microcontroller architecture that Intel introduced long, long ago and abandoned long, long ago. And there are still many dozens of manufacturers that make compatible devices. It's a very, very popular common microcontroller architecture. It's not as nice as STM8. It was the first supported architecture in STCC. But in the recent years, it has fallen a bit behind new features that got added for other architectures, didn't always get added to 8051. And also many devices made by different manufacturers are also often slightly different, in particular new features like additional data pointer registers, which are used in different ways. We have support for the HTC rate and ST rate. It's current microcontroller architecture by NXP. The problem is there's not really much of a free open source community around this architecture. There's individual bits here and there that someone wrote some free software for it. But in general, it seems a typical sentiment by developers of ST08 programs as well. We get the, at no monetary cost, we get the development environment for the manufacturer. Why should we try something else? And sometimes they complain a bit if the manufacturer drops the part for an older device. As per DOC, a Taiwanese company that makes billions of microcontrollers each year that are not that expensive, they were not really meant to be programmed in C. But we still managed to support them, at least three of the four subarchitectures that exist we already support. The largest one, the PDK says, not yet supported. One thing interesting about these is that they have hardware multishrating support, which we currently don't support. What we can do is write a C program, run it on one core and then the other cores run a sampler software. There's microchip pick. Those used to be very popular because they were cheap. The ports are currently un-maintained, but we still get sometimes contributions from users with patches. It's not like they're completely abandoned. Maybe sometime a maintainer will step out of these user contributions. Okay, now we get to the architectures relevant to the retro computing people. These are a large number of Z80 derived architectures. The SM83 might be known to most people here as a CPU from the Game Boy, even though it's also found in some other Japanese appliances and TV remotes. And then we have the MOS 6502 and its derivatives, which don't even fit on the line anymore. They're found in old embedded systems, especially those R2K to R3K, those other rabbits. They were very early IoT devices because they are kind of enhanced Z80 with ethernet or Wi-Fi on support on the chip. But these architectures are relevant to the retro computing community, which often doesn't use SDCs directly, but instead via downstream projects. They package SDCC together with libraries for certain devices that use these things like video game consoles or historic computer systems. Now, what will the future look like for SDCC? We're definitely facing a problem at the moment because the SDM8, the architecture for which we're doing really great, and those rabbit devices that I mentioned on the retro computing side, are both not recommended for new devices anymore. Meaning that the architectures for which we really, the architectures where we really do great as a compiler are about to be phased out. We will keep supporting them, probably unlike many of those commercial compilers. I mean two of the three commercial compilers for the SDM8 haven't even seen any update for the last two years. But to stay relevant for current embedded systems, we need to try something else. And basically this is the idea. The main thing is putting the focus on the MCS-51, the 80-51 again. It's an ancient architecture. It's not exactly the nicest architecture. But due to the large number of hardware vendors, it's not likely to die any time soon. And looking at the reasons why users choose non-free compilers versus SDCC for the 80-51, the main reason is definitely that the main non-free compiler for this architecture can optimize better for code size. So this slide about the future is basically a very rough outline for plans for the next two years. And generating better code in the MCS-51 port is definitely something that we want to do. We will look a little bit into the SDM8, but due to the lack of community behind it, there's probably not that much that can be done. We still try to keep the SDM8 up to the other ports feature-wise, even if maybe not optimization-wise and code generation-wise. For the PDORC things, it would be nice to be able to support the multishrating better and also support the one remaining subarchitecture. And then there's this F8 thing, which is basically a very early project to maybe come up with our own architecture. I've worked on the compiler for a long, long time and very often there was a feeling this could have been done a little bit better in this architecture, or that could have been done a bit better, it would have made it a much better target for C compilers. The SDM8, for example, is a really good architecture. It has things like stack pointer relative addressing modes. That's one, something that you really want for local variables in C because then you want them on the stacks, so you have full re-entrance, C standard compliance, everything. But it has very few registers. The SDM8 has more registers, but the stack access is a little bit less efficient, because you have to set up a frame pointer, it goes through index registers and so on. The dog things have great multithreading, but they don't have the necessary instruction to support good C standard atomics to communicate between the cores. And out of all those lessons basically learned from other architectures, the F8 is kind of a project to come up with an architecture that, to say that somebody should become, if it succeeds, something for the 8-bit world, something like risk 5 is for the rest of the world, and to see that the time is up. Questions? Thanks for the talk. Can you maybe give some hints about the internals of the compiler? The internals of the compiler, okay. We have a classic Lexiac, sorry, you didn't front-end. Yeah, I just want to say if you are using an intermediate representation and maybe also the simulator, does the simulator, since it has to support many architectural uses on intermediate representation, I would be curious about that. Okay, so the front-end is a classic Lexiac parser. We have an abstract syntax tree that gets converted into the i-code, which is basically a free address code. This then gets annotated with some extra information, such as the register allocation, and then in the individual back-ends, this i-code gets transformed into a sampler code. The sampler code then goes through a P-Pole optimizer, and that gets unwritten out to the linker. The simulators, well, that's not my area of expertise. Daniel Drotos is definitely doing most of the work on that part. They're written in C++. They're using the classes and stuff to abstract things away, but I don't think there's any intermediate representation in the simulator because they need to be fast. We want to run tens of thousands of tests for every architecture that we support every night, so performance is definitely a goal for the simulators. You mentioned code size as one of the areas where STCC lacks behind the proprietary compilers from the vendors. What kind of factor are we talking about, and are you doing regular statistics about the code size of STCC, like in terms of different versions and so on? Yes, we are tracking this throughout work. We have graphs, and we are not lacking in codes that in general compare to other compilers. I mean, we're doing okay for the STM-8. Resonance can generate smaller code, but resonance is in every other way the worst compiler for the STM-8 around these days. I mean, they don't even support C90, and the code is very slow. It's specifically for the 8051 backend that we, Kyle, generate more compact code. I need to just to preface my question, saying that I only experienced STCC through the downstream projects, and I began actually using it in great part thanks to your talk a couple of years ago. But I have noticed that the compilation step takes a lot longer than other compilers would. I suppose it's optimizing and evaluating. Why so? And what would help it? More faster disk, more RAM, faster processor? What would help the completion time stop a bit? This depends on the backend. Most backends use what we call the new register allocator, which definitely was the key to being able to compete this well with other compilers in generating faster code and also being competitive in code size. 8051 does not yet, but for the C80, this register allocator is used. It has a parameter, maxEloxPlanout, that you can set to tell the register allocator how many different possibilities to consider at each node of an internal representation. The default value is 3000. If you set it lower, you get less optimization, lower RAM usage, faster compilation, but there's people that set the thing to a million and let their program that in the end fits into 8 kilobytes compile for half an hour, but they really want it optimized as well as it's possible. So yes, the most of the compilation time is spent in the register allocator and the people optimizer, and for the parts that have the new register allocator, definitely the register allocator, typically more than the people optimizer. And one interesting thing is this can become provably optimal. If you also add F-Verbus ASM, you get comments in the sampler that tell you once if the register allocator found a provably optimal assignment. Per function. Okay, I think that's what we have time for for the questions. So just wanted to say I wish you thank you very much for the fascinating talk on the palace.
Brewing Free Beer with ESPHome and Home Assistant
All right, good. Today, I'm going to be giving a presentation called Brewing Free Beer with ESP Home and Home Assistant. This is free as in beer, as in freedom beer. I love brewing. My name is John Britton and I'm an amateur brewer. This is a Belgian triple that I brewed this summer back at my house in western Massachusetts. I'm also a contributor to Homebrew, the open source package manager. I'm a cheer at that as well. And I'm the co-founder of a company called Workbrew, which is a team for tools that use Homebrew at work. Here's an outline of what we're going to be talking about today. First, I'm going to give you an overview of something called All Green Brewing, which is in contrast to something called Extract Brewing, which some of you may have heard of before. Then we're going to talk about building a brewery, specifically a Hermes style three vessel system. And then we're going to get into the open source stuff. There's a really cool project called the Electric Brewery that if you're interested in this, I highly recommend you check out their website. They have tons and tons of DIY guides. I would assume many of you have heard of an ESP32 before, but I'll give a little introduction to that and a project called ESP Home, which is for writing firmware and programming ESP devices, and lastly Home Assistant. So let's start with All Green Brewing. What is beer? There's four main ingredients, water, malted grain, hops, and yeast. The process of making beer is extracting the sugar from the malted grain, adding in hops for flavor, and then using yeast to convert the sugar that's extracted from the grain into alcohol. Water is the most abundant ingredient in your beer. I think of it as like the canvas. In my brewing, I use well water. I just take the water right out of the ground and use it as it is. Some people I know use spring water. They buy it at the store. Sophisticated brewers will do chemical testing on their water to get an idea of what salts are dissolved and they'll add their own additives to make it the palette that they want for their beer. And if you're really, really into this stuff, you can use reverse osmosis or distilled water to start from a totally blank slate and add your own kind of salts and whatnot. Next is malted grain. Primarily barley. Barley is malted in a process where you steep it in liquid, which starts a germination process. During germination, the organism converts using enzymes, some starches into sugars, and then you kiln the grain, which is kind of like toasting it to stop the germination, stop it from growing, and it also imparts color and flavor. Hops, probably the most famous of the beer ingredients, added during boiling and during the fermentation process to add bitterness and aroma. And then lastly yeast, which is used to convert sugar into alcohol. Primarily there are two types of yeasts, ales and loggers. I'm an amateur brewer so I won't tell you the difference between the two other than loggers are much more difficult. So next, how is beer made? There's a lot of steps. One of these steps can be broken down into a whole day lecture. So I'm going to give you a really high level overview of mashing, sparging, boiling, fermenting, and packaging. So the first thing that you do is you take your grain that's already been malted by your maltster, you crush it up, you put it in a pot, and you add hot water. The hot water works like tea. It extracts the sugars from the malt and then it releases them into the liquid. The next step is sparging, where you add more liquid on top of this kind of grain bed and strain the water through, extracting the sugar from your mash. And in this picture you can see that it's kind of, the liquid on top is clear. By the end of your sparge, the liquid that's going through basically has nothing in it. You've extracted all of the sugars out of your mash. The next step is boiling. Boiling primarily exists to sterilize, but it also removes volatile compounds that create off flavors in your beer. During your boil you can add your aromatic and your bittering hops to extract the flavor. And then the last step, after you boil you chill your wort down. So basically all of the process we've talked about so far is making wort. Then you put it into a fermenter, add yeast, and the yeast makes the beer. So that's kind of the process. After a couple of weeks, in my case I usually let this sit for about two to three weeks. Take a couple of readings to see how the alcohol, the process has progressed. And then lastly, packaging. Beginner brewers will often use bottling because it's really easy, there's not a lot of equipment. But it takes a lot of time. Step by step you've got to put each bottle, fill it up. Additionally you have to add more sugar into your beer at the end, so that the yeast can then convert that sugar into more alcohol and produce CO2 to carbonate your beer. After a new brewer gets into packaging their beer with bottles, inevitably they switch to kegs and they do forced carbonation. So basically after you're done boiling, you chill down the beer, you put it in a keg, you use a container of CO2 and you pressurize it to a high PSI and you let it sit for a couple of days, and then you have carbonated beer ready to drink. So this is all fine and good. I wanted to go out and try this for myself. I had some friends in university who did home brewing. They did extract brewing, so basically on the stove in their kitchen. And my friends and my family all know that I like to overdo everything, so I decided I was going to do this all grain and build it from scratch myself. There's lots of different systems you can use. I decided to use something called the Herms three vessel system. Herms stands for heat exchange recirculating mash system. It's basically three kettles, so the three squares in the middle are each 20 gallons, I don't know, like 40 liters or so, I guess. That's my brewing set up on the right. This is kind of like a diagram. From right to left, you have the hot liquor tank, which is used for the starter water that you use to make the beer. You heat it up to a certain temperature. Then in the middle you have the mash louder ton, which is where the process of mashing happens. You add your grain in there. And then lastly you have the boil kettle. And down below you see the two circular things, those are pumps. Those pumps are rated for high temperature, food grade, etc. And then in the bottom left you have a chiller, this kind of coil thing. You run cold water in one direction and hot work through the other direction to cool it down. All of this stuff is available online through the open source project called the Electric Brewery. Now I say open source, they call themselves open source. They have a shop, they sell all their stuff, but really it's just, there's a website, it has tons of information, and they have two main guides, building your brewery and using your brewery. And what's cool about the Electric Brewery setup is everything is off the shelf. So you could go out, you could buy a brewery setup from a vendor, but if some part breaks, you're beholden to that vendor. And if that vendor goes out of business, if you can't get the part anymore, you're kind of host. With the Electric Brewery, everything is what you get in the plumbing section of your local hardware store. And in the electronics section or in the electrician section of your home improvement store. So when it comes to controlling this system, they have a control panel that you can build yourself. It takes months to do. And also the components alone cost, you know, 1500 US dollars or more. And if you want to buy it pre-assembled, it's only 2300 dollars. So totally out of the price range of what I was willing to do. So I thought to myself, what actually is this thing doing? Well, all it's really doing is turning on and off a high voltage, high amperage switch for a heating element. So inside of my hot liquor tank and inside of my boil kettle, I have a heating element. And it's the same heating element that you find in like a US like water heater, like a household water heater that's maybe 50 or 100 gallons. It's the same kind of thing. And all this does is let you set a temperature and turn it on and off according to a schedule. It's a PID controller basically. But because it's high voltage, high amps, you need to do a lot of stuff for safety. So I thought, can I replace this with a microcontroller? And I started to learn about ESP32. I found the dev boards, you know, you could buy one of these dev boards for less than 10 dollars. It has Bluetooth, it has Wi-Fi, it has a processor on it, it has tons of GPIO. And it has a huge community of support for building things, you know, with it. There are also lots and lots of devices out on the market that are built for consumers based off of ESP32. So this device is called a Sonoff THR320D. And it's a 20 amp relay with a display and a button and some LEDs and then nice terminals on the bottom where you can attach your electrical wires. And so what I did is I basically bought one of these devices for less than 50 bucks, took it apart at home. They don't have a UART on them, so I had to like solder in my own UART so that I could flash it. But it's a normal ESP32. And once I got my first firmware on there, I was able to do over-the-air updates. So basically a $50 device now replaced a $2300 device. So that was a pretty big savings for me. The next project that I used with this whole thing is called ESP Home. ESP Home, I believe, is made by the same folks that make Home Assistant, but it's a separate project. And it exists to make it easier for people to build custom firmwares for their ESP32 devices without necessarily having to write C code. I can write C code, I don't want to write C code, especially when I'm making beer. Like, I don't think I can handle it. So I took a look at ESP Home and learned that they have what they call components. There's hundreds of these components. Anything from Wi-Fi updates to PID controllers to switches to sensors to Bluetooth to long range radio without Wi-Fi, all kinds of different things you can do. They're basically pre-made code snippets that you can piece together to make your own firmware. So I'll give you just a tour of a couple of the core components that I used in my project. One is GPIO component. So this is just a code snippet that shows use the GPIO component, turn on pin number 27, and that powers up my temperature controller. The Dallas component lets me read a digital output from the temperature sensor. And then the most kind of valuable or biggest impact component that I used is climate. Within the climate, this was made for, you know, managing a thermostat. They have a platform called PID, which is proportional integral derivative. And the idea is that it lets you set a set point of what temperature you want to reach, and then you can tune it with some algorithms so that rather than overshooting your target temperature and dropping and overshooting and dropping, you can ease right up to your target temperature and hold it there within like 0.1 degrees Celsius very, very accurately. So I wrote this, you know, it's not very many lines, 10 lines of code or so, put in my tuning parameters. And you can see it says heat output, relay PWM output, and that connects to this component that I use, which is a P2M pulse width modulation. So effectively what happens is the PID controller outputs a number between 0 and 1 indicating a percentage of power. And then this PWM, what it does is I set it to a two minute period. So if I set it to 50% in a two minute period, every two minutes, one minute the heater will be on and one minute will be off. And it just pulses the relay on and off for me. And the PID controller manages kind of the percentage automatically to hit the target. And then there's another one that I used on here, so I showed that device has a button. This is just a small snippet of the controls that I built, but it has on multi-click. And you can just say when you touch the button down for one second and let go for at least 0.3 seconds, you can build whatever kind of gesture if you want, if you want to call it that. I have a ton of them, but what's really nice is the firmware that I put on these devices runs totally locally without internet access, without network access. And I can control the entire thing with one button by pressing a series of clicks. Or I can open up my phone and do it with Home Assistant. And then, for example, another one is interacting with LEDs. So when the device is powered on, the power LED turns on or when the heating elements turn on. The last kind of component to this is Home Assistant. Home Assistant is a really awesome project that lets you control any kind of IoT device in your home. And they have built right into it support for ESP Home. So as you can see on the right side here, you can see my different brewery controllers, Brewery C. I also have heat pumps that I'm controlling. All of these are done in the same kind of gamel where I use a pre-existing module. I have a couple of configurations and it automatically integrates. And on the left side, that's my mobile phone. So when I'm in the brewery, and this is what I see, I can drag the slider to set the temperature in my mash tun or in my boil kettle. And then down to the bottom, I have on-off switches for my work pump and for my water pump. And those are just done using smart plugs. So the entire system costs maybe less than $100 in hardware with all the wiring and whatnot instead of $2,300. So that's everything that I have to share with you. Reference materials. I definitely recommend it like Brewery, ESP Home, all of this stuff up there. And then some photos of the finished products. And that's it. Thank you very much. We have time for just, I think, one question. Okay, one question. Let's go to the guy in the front because it's just together. He's in the front row, yeah. Can you talk about all the sensors and... Can you say it? Say, you work with the Sonos device and I guess you have many other sensors. Can you talk, did you do any reverse engineering or you just put them in ESP? Yeah, so what's really great about this is the ESP Home community is huge. And they have a directory on the ESP Home website of all of the devices that they know of that are ESP based. So if you buy an off the shelf consumer ESP device, you can go to their website, see all the pinouts and everything. And so, for example, when I said GPIO 27 is the... There's like an RJ9 or RJ11 jack on top that's connected to that pinout. I didn't have to figure that out. I just went to their website and it said the pinout for that pin is this number. And it's all in a database. The only thing that I had to kind of reverse engineer was off the shelf devices don't expect you to flash them so they don't have headers for the first flashing. So once you get the first flash on there, you can do all over the air updates. Thank you. Okay, well, thank you.
Dora-rs: simplifying robotics stack for next gen robots
Then let's start. We would like to talk about Dora-Arres, which is a project to create a modern data flow framework for robotic applications. The idea of a data flow framework is that you split your applications into lots of small nodes that each perform some operation on the data, and then you can send the data by message passing to the next node, and this way you get a very isolated architecture. If one component goes down, for example, then the other components can keep on running, so you have very reliable architecture. For example, on the right in this example we have a webcam node that generates images, and these pictures are sent both to a plot node and an object detection node. The object detection uses an object detection algorithm to detect common objects in the image, and if it detects any it sends a bounding box to the plot node as well. The plot node can then combine the two and draw a rectangle around the detected objects in the image and print it out on the display or something. One nice feature of this design is that you can also split your data flow across multiple machines. For example, if you have an embedded system that has limited processing power, you can offload heavy computations to a remote machine and use the network for sending back the process data. Also, you have these nice boundaries, so by observing all the messages that are sent you can get a pretty good idea of what your system is doing. You have also, for example, the option to lock all the messages and try to replay them later on a debug system to debug issues. The most popular frameworks that implement this pattern are ROS and ROS2. They are both CNC++ based and have unfortunately quite a complex build system which can be a bit intimidating to beginners, but they are quite mature and widely used in both research and industry. The motivation for Dora was that we want to make the creation of robotic applications simple and fast, and we want to focus on modern languages like ROS and Python, but we still want to keep supporting CNC++. We also have plans to support WebAssembly in the future to have even more isolation between components. For the build system we try to keep things as simple as possible and to use the build system of the different languages when possible. So, for example, if you are writing one node in ROS, then you should just use the CrateZero dependency and run a cargo build command without any additional project specific tooling. And also we want to make it easy to integrate with latest technologies such as Python, AI models that you should be able to just use without much setup. For the general design we decided to make each node a separate process to benefit from the isolation and fairness guarantees of the operating system and to give authors of nodes a lot of flexibility because you have full control over the process so you can access devices or include some libraries that you need and so on. And the nodes communicate by sending messages as we said. So, for this we decided to use a declarative approach. We have a YAML file that lists all the outputs and inputs and how they map to each other so you have like a single crown of truth how your data flow graph is laid out. One feature that we are quite proud of is the zero copy implementation that is transparently added whenever the sender and receiver are on the same machine. So in this case we use chat memory to pass the message contents and we encode the messages using the Apache Arrow data format which allows us to access the data and process the data without any copying as well. So for example, on the right we could have a Rust node that sends some data to a Python node, the Python node in its Python runtime does some processing using NumPy for example and using this Apache Arrow format, all of that is possible without any serialization or copying of the data which is quite nice and also results in a quite nice performance. So here we see the latency of a Python node in Rust 2 compared with Dora and we see that for large messages the latency is much better because we don't need to copy the data or serialize the data at all. We also for compatibility try to create a Rust 2 bridge to allow the step by step migration of existing Rust 2 applications and also to use the existing Rust 2 tooling which is quite mature for our Dora nodes because we don't have that kind of tooling yet. For the implementation we decided to do the interfacing at the DDS level, so at the middleware level, so we don't link to the Rust 2 libraries but instead we have our own DDS implementation and this way the build process stays simple and we don't need to complicate things but we still are able to auto generate bindings for Rust in C++ because we pass the message definition files of Rust 2 and we also have automatic type conversions between the Rust 2 message types, the error data types and the native Rust or Python types. Yeah and lastly, so we have two more features, one is open telemetry, so it's for everything that is metadata, we don't want you to learn a new way of logging your logs, tracing your traces and having metrics, so we're using open telemetry which is an open format that is available on many languages, C++ Rust and it can connect to many back end as well, so if you're already using some promissory graphana you can just use Dora and use your back end really quickly and have all your data coming in and there's also a lot of applications that use this open telemetry such as website, servers and so you could have your Dora data at the same time as your other application data. So that's really useful and then we have also a hot reloading feature that helps us basically reload our application without having to reset the robots, this thing can take a bit of time and sometimes there's some calibration so we found it really useful to be able to change the code, change the logic without having to reset the robot and now that we can generate code with generative AI with chargeability as you probably tried out or mistral AI on a local machine, it's really useful and actually it's very simple to code, you can use, you just have to check your state and if the state doesn't change, if it's compatible you can just swap state and it works, so yeah I'm going to do a quick demo because I think it's probably easier for you to understand, so I have a robot, I have a microphone, I have a speaker, I'm going to use whisper as a node to convert my speech to text and then I'm going to use an LLM to change the text to code and using the hot reloading feature it's able to directly change the behavior of the robot and then I have two webcams just so that everyone can see and then I have additional node but I'll let you go into detail later because I don't have too much time, so this is kind of an overview of what is happening underneath and yeah let's do a quick demo, so I'm going to start my graph instead of my computer and fortunately I can't share my screen because somehow my event is not working with the HDMI so I'm just going to use my microphone to try to work with the robot, so let's say I have a robot here, I'm going to say okay, can you set the rotation to 50 and so now whisper is going to convert what I said to code and then the code is going to be hot reloaded in the robot in real time hopefully, so the first thing is takes a bit of time but it should get there at some point, I'm just going to try again, can you change the rotation to 50 please, just give me one sec and normally it should move, there is no problem, so I go and so this is just code, right, it's not something that I've pre-implemented, it's not and we can, if people want to have a look, we can have a look deep behind the back and talk about it, I can make him move but the table is really small so maybe I can just try something very small, yeah, can you set the variable x to 1, so yeah, okay he didn't understand, he didn't understand the Bible, so yeah, can you set the x variable to 1, okay, the node should move, yeah, so you can really, you have all the control, you can make him move, rotate, thing, just going to hope it's not going too far but hopefully it should be okay, yeah, and yeah, so this is like really simple logic, it's just changing one variable but I can also use charge-pt which is way more powerful to generate way more complicated code, okay, so this is kind of, if you already used charge-pt before, you will see that it's not always reliable, so it should work but it's not a promise, right, and I'm sorry if it doesn't work and I probably have to do some debugging of the code that charge-pt generate but let's say I say, can you set the rotation according to bounding boxes, so you can't probably see but I'm running an object detection on my computer and so he's able to get bounding boxes, you've probably already seen demo with deep learning and pie torch and things like that and so it's actually very simple, right, if you look into details of many demos and so it's actually very simple to get bounding boxes, so I'm getting the webcam, I'm sending it to object detection and then I'm plotting it on my computer but I'm also sending it to the planning which is the thing that is controlling the robot and so then, so charge-ptd has to link this bounding box to a rotation axis which is actually quite complicated because the rotation is in angles and then only what he has is bounding boxes with x, y on a, like an image frame, right, so it's probably like 20 pixel left, 20 pixel right and so he has to generate this code but normally it should work and so charge-pt takes about 30 seconds to one minute, maybe more depending on the file length because it's per token right and so now it's still like talking with charge-pt I think and now it's finished so it should be able to move, I'm just going to log it, I'm sorry if I take a bit of time and, alright, so there was a bit of an issue, the true value of an array is complicated so, alright, just give me one sec, center x ratio, center x, center x, okay and then I'm going to put a zero here and normally it should work, is it moving? Oh yeah, okay, so now he should move according to what he sees, so if I'm here maybe he's going to move like here, okay, yeah and now it should stop moving, yeah, it's like doing some PID stuff, right, like kind of like brewing so, and so if I move here, he's going to move from here, yeah and so this is the whole logic it was charge-pt, I didn't code anything, I'm probably too lazy to code this thing but yeah, but you can see the idea so if, yeah, that's kind of where we are now and, yeah, time's up, okay, so here's the feature we have and if you have any features you can let us know and we'll try our best to make it happen, sometimes it doesn't happen but we'll try our best and yeah, so well, thanks for listening, yeah, thanks for having us, thanks. I have any questions? Hi, it was very interesting, when you showed the graph of the latency with ROS2 versus Dora, do you know what middleware you were using to compare against, because obviously ROS2 has a whole bunch of different middlewares that you can use some of shared memory, some over the loop back, do you know which one it was? I think it was default DDS. Yeah, it was default DDS. Or TPS or? Now I can't remember exactly the details but we tried to use the exact, like, tutorial version of ROS2 and in Python you can't do like shared memory things so like, we tried to use MW isorox thing with Python but it didn't really work out, yeah. And on real time stuff, do you have bindings for say POSIX, Tread Priority setting and other real time integrations or is that still in the Python? Yeah, that's probably things that we can improve but actually there's a lot of time spent on serialization, copying and so this is where we think that the biggest difference is, yeah. But we can definitely look more into detail into the benchmark if you're interested, yeah. I'll take up one over there. Right, I'll give you some problems. Oh yeah, it's moving because of bounding boxes, yeah, if you see. Hello, I'm curious why you chose to have an, over here, an explicit definition of all the kind of topics of communication. Like why use that instead of kind of the raw style just publish and subscribe without, blindly publish and blindly subscribe? Do you want to answer? Okay, okay. Yeah, I think it was just to have additional insight at the beginning directly. We also plan, I think if we go to the road again, we also plan this dynamic data flow feature again because in some cases it's quite useful to add nodes at runtime but for like getting started it was useful to be able to generate a graph of the whole thing and yeah, so. Okay, one more question. There was a guy over here, yeah. Hi, sorry, very interesting. Maybe I missed it but how do you communicate across different computers which networking provider are using? Right, so there's an SDK and so basically you can send messages with a protocol buffer and there's a small computer and it's interesting and the video stream is using H.264. Yeah, so I was talking about like Protobuf but just like what is like, I don't know, DDS for raws, then now they are moving to senior or what are you, how are you sending the messages actually? Okay, so now I'm just using the SDK of the robot for communicating, yeah, so it's very simple. In the future if we have remote machine we can use TCP and maybe Xeno type of thing for Dora itself. Okay, so Dora now it's only one computer. Now it's kind of one computer. Yeah, we have like basic TCP support for remote machines but nothing too optimized. Yeah, but it's definitely something we want to do. May I ask another one? Yeah, so have you tried R2R like the another raws Rust binding which is also like using the raws C library but wrapped and it doesn't use all the complicated, the amend and what not, cash is building with cargo which I agree it's a good advantage. So maybe it's interesting to check that library also. Yeah, absolutely, yeah, we definitely look into the raws thing and there's many clients. We're actually working on the raws to bridge as well and so we are using the Rust client from like an unofficial raws client that enables you to not use the complex like raws to build system and still use raws to. So this is how things are. Okay, so the robots are ticking over. Okay, thank you very much. Thank you.
Vehicle Abstraction in Automotive Grade Linux with Eclipse Kuksa
All right. Why can't everyone, while the last people join the room, let me ask a few questions to get an idea of the audience that we have here. So, quick show of hands. Who of you knows AGL Automotive Grid Linux? That's quite a lot. Awesome. Another question. Who of you knows Cooxer? Okay, let us change that because that way, fewer hands than for the AGL. But I think that's a good thing. Last and final question. Who's here still from the beer talk? Like room beer? Okay, I'm glad. We actually came out of these talks. So, as you can already see in the introduction slide, we will talk about vehicular abstraction. So, we talk about Automotive Grid Linux and we talk about Cooxer. So, before that, maybe a bit of context. Who am I? So, I'm not the super automotive developer doing a Canon AutoZer for the last 20 years of my career, also due to age. But I started with really coming from the cloud. I used to keep an E2 working on different projects in Github. And I thought, how can we actually make application development for vehicles more fun and efficient? And one really large essential piece here is one challenge because there are no restandardized signals. You can develop an app for one car and it won't run on another vehicle, even maybe from the same vendor. So, what we often see in the industry is this kind of high end-to-end complexity. So, every application is developed for every specific model, every specific car, and we have a huge pay point for that because you cannot port your applications there. You cannot scale, so if a developer is developing an app for one brand, it won't work on the other brand and also maintenance is just a nightmare because just build it for one car and then you completely forget it. So, as always in computer science, one solution to that is abstraction. That's why we also took a lot of effort in the topic of vehicle abstraction here. So, how can we make a world like this happen? So, where we have tons of applications that develop against the same API, against the same data model and that work just on different cars. While it's the same car, it's the same time. I'm talking a bit too much about cars. We run it on different models, different brands and so on. So, basically how do we get to the world where we ride at once and run it everywhere and to also attract third-party developers because this is how you grow the ecosystem and also make it more attractive to develop the unrealized synergies. So, for this abstraction, I would say we basically need two things. One is a data model to operate on and the other thing is the APIs to interact with the data model. Coming to the first thing, I hope, here we go. When it comes to the data model or you might also call a taxonomy, we decided for the CoVisa Vehicle Signal Specification. So, it's done at an organization called CoVisa, formerly known as GenoVee. Maybe that rings the bell for some. And it was basically does. It creates a tree structure for all kinds of data that might be available in the vehicle. So, for instance, you get the tire pressure. You follow the branch of vehicle, chassis, axle, road, one, wheel, tire and then you get to the pressure signal. The same way you have sensor values in here, you can also have actuator values. So, for instance, when we have a seat position, we could just change this value of the seat position and eventually that seat in the car would move to that position. That's the idea of this whole data model. If you want to play a bit with that, there's also a really cool website called Digital Auto that makes nice visualization of that and also shows some example applications how you interact with VSS. Okay, now we go to the first piece. How about the second? And this is where Cooxa or more specific Cooxa Viara comes into play. So, while in this case, vehicle abstraction layer, so we talk about abstraction, so the idea is to have Cooxa running in the vehicle computer. So, some kind of computer which might run Unix or something similar to that. And we also assume this is a place where we decub the hard from the software in the vehicle. So, the underlying assumption is something you can see on the left. So, we have a lot of deeply embedded layers, can, autos, are, lin, sum, IP, whatever you like or maybe don't like, which is maybe really proprietary in some cases and also the signals and the bits are really specific to the car. So, then people would write something that we call provider or also feeder to translate between these really specific systems and embedded systems towards VSS using the Cooxa API. This is where the API is coming from because we use here Cooxa. If you like more on the abstraction side, we also can say like in the deeply embedded layers, we mostly have data like really 1001 or the bits and we kind of need to interpret those. So, we translate it to VSS, get some information out of that and then by combining this information in different applications, we actually create knowledge. And here Cooxa is a nice building piece for that. So, what is Cooxa in general? So, since we are in the open source conference, obviously it is open source, fully licensed on the APHG 2.0 license and as I just mentioned in the previous slide, it is some kind of digital twin based on VSS. So, it shows the current and the target value of your vehicle. I don't want to go into the definition of digital twins but I guess you kind of get what I am getting at here. And so, you only have the current value which is quite nice but you also have the target value. So, coming back to our seed example, when you would change the value, the current value as an application from a seed, this doesn't meet the seed is actually where I wanted to have. So, I actually will set the target value and then it is up to the deeply embedded layers, so the actual vehicle to move the position of the seed over time. So, that is why you can change both value and hopefully at some point the current value will be the target value because that is the whole idea. So, much about the concepts. Let's get to the code. Or maybe I won't show code here but what it is actually written in. So, we wrote this in Rust. If you steadily compile it, it is less than 4 megabytes, large or small depending on which word you are coming from I guess. Like, these are the cloud words and it is small from the automotive words, maybe large to you. And it is quite language agnostic because the interaction with this is with it because we have a GIPC interface with some basic functions like get, set and subscribe and also a number of client libraries using this. And with that, that is actually the basic of Cooxing and I have to be honest with you, if you have been in this death room last year, you would say where is the news because this has been shown there as well. So, let's get to the news. So, what has happened in the previous year? First and foremost, it was using AGL so Scott will talk a lot about that in the next minutes. But we also have some other news. For instance, we now have a Cooxer Android SDK, we have a mock service and we also did some work with later from our side. So, the Cooxer Android SDK, I mean it is kind of straightforward because in the end of the SDK, that is now available in Maven Central and you can interact with the data broker from an Android application. So, be it Android automotive or maybe your own app on your smartphone. So, assuming you have some kind of Cooxer abstraction in your vehicle, you can use a companion app for instance, which we are about to release to the F2O store. Now, there will be a moment for the releases there. We did support request beginning of the week, but we still wait for F2O to actually show this app in their repository. So, stay with me till Monday, then it might be there hopefully. Another thing is a mock service because the guys in the previous presentation had their robot here. We cannot always have a car on our lab to test the application, but we kind of depend on the behavior of the vehicle. So, we need a way to mox this. So, the community came up with a behavior definition. For instance, whenever the signal of a seed is changed to a certain value, like 1000, then the current value should also change to that value. And this is what you can basically mock or emulate with the mock service to show you just an example. Here we have just an example I mentioned. So, whenever the driver's side position changes, then we create an animation to move to that position or to move the current value to that position, which makes it quite easy and flexible to test whatever you desire with your car. And last but not least, this is just a sneak preview into the lab. So, Cooxer is part of the larger community in the Eclipse Foundation. There's an Eclipse software defined working group, or short Eclipse STV. And there's another distribution called Eclipse Leder, which tries to combine some of the major pieces of the ecosystem there. And this is called Leder. And what we managed to do is actually run the Leder-Yogtu layer on top of an HGL, so that you actually get these pieces, like especially Cooxer, but also some other projects like Cantal, to run on the HGL stack. And I think this is a really good opportunity to learn a bit more about HGL here. Oh, okay. I'll take over then. All right. Thank you, Sven Eric. So, I have done a lot of stuff around HGL, so people might recognize me. I'm Scott Murray. I've done Linux for a long time, and I've been at Linux for a reasonably long time as well. I've been working on HGL on contract for pretty much eight years at this point, and doing all kinds of different things for the project around keeping the Yogtu stuff up to date, and also doing a lot of the demo and integration type of things. So, there was maybe almost half of the people indicating that it would be what HGL was, but I'll do a very quick run-through. So it's a collaborative open source project, basically trying to build a base platform that you can build an automotive product on. So it's about 10 years old. We have a vast array of members now, a lot of the major OEMs, and tier one and two new suppliers. It's pretty much a code first sort of thing, where we are more focused on let's build the distro and get it there for people to try and involve. A lot of work went into that. You might have seen HGL demos for several years doing that type of stuff, but our members were basically saying in 2020 that they weren't interested in maintaining that because they weren't going to use it in product. They all have their own application frameworks, or they buy an application framework, and they like to see HGL focus on lower level, show us how to use open source more than writing new stuff. So we started out, our tech demos, or integration demos are more like taking best of breed open source projects and showing people an automotive. Here's how you use these things. And so this really worked out well, because we weren't connecting. We needed something to show here's how you will do vehicle signaling, and VSS and Cooks of Al basically were starting to come out around the same time that we needed a new thing. So I had started playing with Cooks of Al in 2021. Our first release basically was our spring release in 2022. And it replaced our old signal composer and our can service with basically the original Cooks of Al server. And so since then, basically since spring 2022, we have recipes in our layers for HGL to build the Cooks of Al server, now the data broker. And as well, we actually have some signal customization stuff to sort of access an example of here's how you add some custom signals. And we use their can feeder to basically sort of wire up and show here's how you put all these pieces together. We have our own sort of like mocked up HGL virtual car can definitions. And so that sort of acts as an example for people to use. So that was spring 2022, like I said, and that won't go into all the nitty-gritty there. But originally, we were using the original Web socket API, which is a standard thing with sort of companion to VSS. We actually had can working in our demos. And so through 2022 and into 2023, we were sort of keeping up with the Cooks of Al releases. I started, you know, some nominal updates around switching how we were doing our signal additions and stuff. And then this past summer, our pike release, basically, I started the process of switching over to the data broker, which is the rust based implementation. And so I actually got interesting because we're based on Yachto Kirkstone, which is the LTS release, which at this point is two years old. And it has older rust. So we couldn't actually build the data broker. And so that was the thing where basically, a jail, we contributed upstream, I have a layer that you can get for the Yachto Kirkstone like mix in basically gives you a newer rust to be able to build the data broker, which other people I now are no are using for building other rust projects. So that, you know, we're now starting to look at the data broker, this cop coming release. Basically, we were now using the absolutely latest version of Cooks. And that now I fully have us all switched over everything's data broker using gRPC, all our demos are converted. And that basically acts as a thing. We're trying to see this with the automotive community, because, you know, we see a lot of vendor, you know, or we encode that people open source is all like custom IPC and stuff like that. And it's like, Well, no, there are open source projects that are heavily used that do, you know, gRPC and, you know, interact with cloud providers and stuff, you don't have to reinvent the wheel. So Cooks of Al has been a very good thing for us to sort of try and get that to people. So how exactly are we using an AGL? So there's, you know, the assess applications. As Eric mentioned, the concept of, you know, there's actuators. So there's, you know, apps that basically just listen to sensors. So like dashboards type of, you know, things like that. And then for acting on signals, so basically implement an actuator behavior, we have some example services that do that kind of thing. It's like HVAC sort of stuff. There's also setting an actuator value. So that would be like on a user facing infotainment app would be like HVAC controls or, you know, audio or volume that type of stuff. So in our tree right now, we have two demo services that basically do that actuator side of things. So we have HVAC service that basically listens to all the like signals in the VSS hierarchy around HVAC controls. And then in our demo setup, which unfortunately we won't have the full setup here, actually pushes out to drive some fans and things like that. In the audio side, basically I'm listening into the audio like volume signal that's in VSS and we, you know, have some custom things that I'm working to push upstream. But basically actually drive that down into wire plumber and actually like adjust the, you know, the audio setup. The user facing side are demo applications, the QT demo, which I think we might be showing tomorrow. Basically we're using the SS signals for like everything pretty much. So all the applications in that demo, which are in our source tree, you can grab them, basically are all wired up to do VSS signaling. And the code is sort of in a nice little library now and basically allow you to reuse it. On our newer Flutter demo, which I'm not truthful, actually maybe I think we'll have one setup that'll have that tomorrow. Basically it's, you know, it has a unified sort of home screen. It's doing GRPC from Dart. And right now I don't have that sort of library sort of packaged up yet, but that might happen this year. Or we might move it to native code. Tidder, who are big into Flutter, they tell us that's what they do that for some of their stuff. So, you know, we're pretty much, this is what our newer Flutter demo looks like. And so in this demo, like the tire pressure, all the likes, you know, vehicle speeds and stuff like that, and all the, like the AC controls and the temperature, all of that is going through VSS signaling to driving, you know, demons or whatever you want to do. Or KNData coming in actually gets converted back into a signal update. So, so there's some extra, you know, presentations from Sven and Eric myself. And we're going to be in the AW building tomorrow. We're open bed at work today. We're to have that table tomorrow. We'll have our demos. And this is, you want to do your pitch? Sure. So if this sounds interesting, or even if it doesn't sound interesting, there's a huge chance to engage with the community around coaxa and the larger communities in the automotive sector. So we have something called Bosch Connected Experience. It's hosted by Bosch, but it's basically very large hackathon in Berlin in the end of February. So a bit short notice, but I would be really glad to see some of you there. We have the chance to work with a lot of things like maybe actual seats, maybe actual cars, hopefully. Or and also we plan to have some meetable assimilation of a car which is then connected to a data broker. So I think it will also be cool what you can do with combining these physical and also this cyber physical world, if you will. So I really encourage you to do that. If you want to come there, you normally have to apply. But if you just approach me, I think we'll find a quick way to get you in because being you in this room, I think qualifies as a good hacker for that. So was that maybe you there on another community meeting? So thanks a lot for stating this out and we open for questions. Yeah, I think we have a couple of minutes. Yeah, we'll have to share. Thank you. Great talk. Just wanted to understand a little bit about your testing cycle. So if you you're developing something with this and then you test it in a virtual environment and then you want to test it on a real car, like what do you do in practice when you're developing stuff? Do you have an answer to that? So I wouldn't have a straight answer because here we talk more about implementing that abstraction layer and mostly testing it against things like this mock service or with something like a feeder where we have recorded data. But things that you're touching on a small like a really general topic on how do I actually get my automotive software up and running and into the vehicle. So that's a bit beyond the scope of what just the Cookshead project is doing. So not too much I can comment on here, but I think it's a good topic for the communities, either AGL or Eclipse STB because we have some rounds of meetings where we exactly talk about that. Yeah, I would just say that it's still actually pretty early days for DSS. I mean, I know there's a bunch of OEMs and interior ones that are actively working to product eyes. So I don't think we have visibility yet into how they're actually going about and testing. So I think hopefully in the next year or two we'll see more and we'll maybe get some ideas there. Any more questions? Maybe in two or three words, can you share a little bit about the data broker? Is it something that looked like Debus? Is it something like look like MQTT broker? Something else? What it looks like exactly? Is it something that we can reuse elsewhere or is it specific to Cookshead? I would say the data broker is really specific to VSS data. So it's not like you can put any data in there. So the way it works is you start the data broker and you also give it this VSS data model that you have. So the VSS data model is expressed in a JSON or in a YAML file. Then you put this JSON or YAML file into the data broker and then you can basically do get set and subscribe. That's why I put up this slide again on this kind of data which is expressed in the data model and then the data broker implicitly knows about that. When you talk about MQTT, there's also I have to admit other APIs to interact with VSS. For instance, VIST done in W3C and they also looked a bit into how to do that over MQTT. But again here, the data broker is especially tailored to interact with VSS signals. That's why I cannot generalize it too much. Basically, when I go home, I have a project that our vehicle to cloud, Expert Group in Agile, wants to see basically pushing from VSS up into the cloud. So I'm going to be building a proxy that will basically take a list of signals to listen to from the VSS data broker or cooks a data broker and then basically MQTT them up somewhere. So then talk to us in a better world. I'll have a story for you then. Maybe one final thing to add to that. As there's one slide, I actually removed it from the slide deck but there has been a huge discussion in the VSS community whether VSS is actually fit in the vehicle or whether you should use VSS more on the cloud back end so that you put all the data from the car and whatever form it up to the cloud and then consume it in VSS there. And the data broker is kind of like an answer to yeah, it's also possible to do it in the car in addition to the cloud. So that's kind of the background story as well. Okay, so I think that's all we have time for for the moment. So thank you very much Sven Eric and Scott and round of applause.
An open-source, open-hardware offline finding system
Hello. So this is our talk about the spot nuts. It's a Techist Tinkering project. So first who we are. I am Pingu. I am 14 years old. I'm a member of the Techist community. I began hacking like four years ago or something like that. I'm interested in Python, home automation and stuff and obviously Penguins. And I also work at the Alexis project. And my name is Snick or Dominic if you like longer names, three to two. I am more or less the founder of the Techist community about which Penguins will say a few words right after my introduction. And here I'm working in the intersection between education and free software. It means I'm showing young people what free software is, what the values around free software are and also helping develop and promote free software for education institutions. And in my day job I mostly spend my time as a trainer for Linux administration, PostgreSQL, Rust and Python related topics. Yes, we mentioned Techist. It's a community based in Germany. Our goal is to create a comprehensive technical word for and with children and like to empower young people to question things and hack and build stuff like this project or the Alexis project. So here you can see where we were. This is an Alexis meeting. This was in I think at the Frostconn, the second largest conference in Germany. This here at the left side is our summer camp, name taking sun, where the trials come and then learn something like I think they are soldering things together and then programming it. So now what is an offline finding system? It's basically you attach something to something like a small tech, you attach it to your backpack, then you lose it and then you open some app on your smartphone or on your laptop and then you can find it or search it or don't find it. And the more technical offline finding thing. So the tech sent a signal via Bluetooth because it's offline. So there isn't a connection between the tech and the internet. Then an app like a helper app on your phone gets this Bluetooth signal and then says, hey, I found this tech there. And then I as the owner can go on my smartphone, search for the tech and then my phone search in the database for the tech. So how we got into offline finding. I'm very steady. So my scooter, like my scooter to drive in the city got stolen and then I had a Samsung smart tech like an offline finding tech attached to it. And then we drove to the approximate location and then with the feature that we can send a signal to the tech and the tech response I'm here, we could see where the tech was basically so what where signal was. And then we did three literation. So we went from multiple sites to it. And then there was a signal at one point and then we got a scooter back. And also there's our sketchy chef there. And he always loses stuff and wants to get it back or find it. So offline finding basically has three components. But the tracking tokens, the small devices that you attach to the things that you want to find, they aren't connected to the internet because then it wouldn't be offline and sort of use like some like and then there are the smartphones or some small helper devices. They get the signal from the tech and then send it to the internet. And then there's obviously a server where the messages like I'm here and there is a pack are sent to and then I can get them back from there. So there are obviously some challenges. Some are privacy related like a stranger must not abuse the beacon for tracking over the long term. And they should not identify the owners because then I could know where the stuff of some people is. And the back end, the server couldn't identify the owners either because then I as the owner of the server could identify the owners. And yeah. But some are also technical like the encryption without knowing the receiver because then I can identify the owner then Bluetooth because of the range and yeah because of Bluetooth. And then because of Bluetooth also the energy efficiency. Yeah, because at one point we tried out in ESP. How long would it last? And I think we did it with Shah 256 hashing and like lasted for a couple of hours. Because it's small and I think a couple of hours aren't enough for checking device. Yeah, design overview. All right. Thank you. Yeah. So after we somehow got snubbed by this by this topic around offline finding how this works, of course we wanted to try how far we can get building such a system. Of course, somewhat motivated by our grumpy, sorry, I mean, a sketchy, sketchy chef who asked, hey, is there some system like this based on open hardware, open source? I'm not so very excited about Apple controlling where I lose and find rediscover my stuff. So first, what we first it was we looked at how the Samsung smart tech system worked, which is the sort of tech that Pingu had attached to the scooter. And we found out that it sends these strange beacons of some sort using Bluetooth low energy. I will come back to that in a minute. And in the course of time, while we looked at how this works, we've it's more or less became obvious that actually this sort of system is an enter and encrypted mailbox system, because there is an owner device and this has a public key and yeah, what you can do with a public key, you can receive some sort of messages. And there are helper devices that can see these beacons and more or less just send any sort of message to the helper device. So if I lose something as the owner and let's say Pingu wants to help me find it, then they walk around in the city and their smartphone receives the beacon signal and now they somehow need to get the information back to me, telling me where they saw my beacon. And that's where these texts come in and they are as probably as dumb as you can imagine, they just send out a public key and yeah, so all the information you need to somehow get the location sent back to me. It's a macro incident that these messages carry location information. We could just as well put anything in there if any of you are into this sort of systems. Apple had a few vulnerabilities discovered in their implementation. One of the most interesting ones in the recent weeks was that people actually used the beacons themselves to transport key logger information out of otherwise air-gapped environments. I think using your favorite search engine or the search engine you distrust, least will bring some really interesting information up about this. So what we really want to build is a mailbox system and some sort of key management system because that's the really interesting part as far as I'm concerned, how we solve these privacy issues and some of the technical issues with cryptography. So this is the big picture. If this works I can zoom around in this a bit and now it shows that I should have used the headset. Can I do it with one hand? Yes, I can. So here's the big picture and what you can see here is all the red circles are showing secret keys that I use in the system, the green circles are showing public keys that I use in the system. Let's get a short overview of how this works. So we have the owner device and we give the owner device a sort of main key. This identifies the owner device and the easiest thing we could do now is we could make this Bluetooth beacon and simply copy the public key of the owner onto that beacon and attach it to some bag or scooter or some flash squirrel or whatever you don't want to lose. So at this point we more or less are done with the mailbox part and with the encryption part but we got into all the privacy troubles because what you now can do is you can follow the tech around. It always broadcasts the same public key information. You can just walk around the city and always rediscover where one person is moving and make a nice motion profile of this person. Also you could discover several tokens that are linked to the same owner device and get the information that all these tokens belong to the same owner. These are two of the most inherent privacy issues that you obviously don't want to make when designing such a system. So the next thing we do is we derive using hash based key derivation some keys or one key pair for each token so that we can unlink the tokens from each other. And the rest of the system in case I think many of you will have heard about this term a ratchet algorithm and the rest of the system more or less is very close to what for example the signal messenger does with the scriptography. We transfer this this key pair this device key pair to the tech and now we do one key derivation every let's say 15 minutes at least that's what Apple does. And the interesting part here because I never worked with cryptography on this level before is that now we can derive new key pairs on the tech and it will send out another elliptic curve public key every 15 minutes. So we fix the privacy issue of following someone around. Now you can follow someone for 15 minutes and after 15 minutes you see another beacon and you cannot distinguish whether this is the same tech which rotated its key pair or some other tech of another person. Yeah that's more or less the main secret of the system and then if I find the tech I can send a message to the public key it is currently broadcasting and there are some other things mixed in here but I don't want to go into too much detail about this part right now. And the second secret is that when I try to retrieve my location information that all the messages that other send to me I just ask the server for all the information sent to all the public keys I know my tech will have generated within the time frame. And this request can also be encrypted because we also use another set of keys so that the server can also not find out that all these keys are linked to my device. They should have zero knowledge about the ownership relation between the techs and the owners. Okay our experiments are implemented in Rust. We have split it into the spot nuts crates. Hazel OS is what is supposed to be running on the techs and the helper device in Rust based mobile app and in case you happen to need or happen to find the time to review an implementation of signals, X-Ed SDS implementation in Rust. We also factored out this crate so you can tell us what obvious mistakes we made in the cryptography there if you like. And the JG crates are a general implementation of this message of this mailbox system which can be used for the offline finding system but actually for anything that is supposed to carry public key information to someone and allow them to anonymously send back some sort of information. So what we have? We have this implementation of this general JG key exchange and Maywork system with a library usable as an alpha version and a small server implementation that actually does not care whether it is used for offline finding or whatever other purpose. And we have an experimental version of Hazel OS for ESP32 with the limitation that Pingu already mentioned that we get the ESP32 development board to run for something like five hours. So how long did you take to get your scooter back? Did you manage to do it in five hours? I don't think so. Okay you have to be quicker next time when you switch from some. Best thing so we can either fix the technical issue or you can start a running career so whichever is easier. Okay so next things we want to do is we want to find a decent microcontroller. I happened to give a Rust training last week and one attendee told me this ESP32 this is nothing to do with microcontrollers. This is a toy. Get a more hardcore microcontroller and I think this is what we will try. And for Hazel OS to this we need to build an experimental companion app. Maybe design a nice PCB so it don't have to attach a breadboard with a development board to your scooter or stuffed squirrel or whatever. And maybe we can find others interested in open offline findings standard because Google and Apple and Microsoft and you name it are working on something like this but of course it's not so very openly developed. Spotnuts is a tinkering. Thank you for the talk. The question is how do you allow the helper device to send the message to the owner device and at the exactly same time don't allow some stranger to track the owner. Somehow at the feeling that at least one of my slides went missing when refactoring the slide deck. There's back an infrastructure. One thing I mentioned is JGD which is just a small mailbox server. It just has two API endpoints. One receives messages. It does not care what these messages contain. They are just JSON encoded encrypted messages to the public key we saw and the owner devices they just ask hey do you happen to have received any message for this public key I think I might have had. So the thing here is you can actually even in the Apple ecosystem you can ask the server for all messages you like. You can just send public keys there and they will give you the information about all messages that were sent encrypted to this public key. The nice thing is so you can download the whole database from Apple servers as well. The nice thing is you can do anything with it because obviously you also need the second half of the key pair. If you don't have it you get a nice bunch of random data. Over here. Hello. It's here. Over here. Would it make sense to make this key rotation time period not fixed at 15 minutes because if I was following a tag I could time the key rotation based on the period and then know that it was rotated at the exact 15 minutes. Yes. Bit of silly question but have you considered Linux mobile support for the helper device? Can you repeat the question please? Have you considered supporting Linux mobile phones? Supporting mobile phones to carry the... Is it a part? That's running Linux instead of Android or iOS. It's supposed to be a web application which will need web Bluetooth support in more browsers than Google Chrome but actually there's this Rust library and it should be easy to use it in any sort of app that you like on any platform. That's great. Thank you. Thank you again. Thank you.
From an artificial nose weekend hack to a future-proof IoT device
That was helpful. Thank you. Thanks for joining. This is going to be a talk about a fun project that I started, I think it's almost four years now, so I feel like I'm sort of milking the idea, but it's pretty cool. It's back in 2019, I guess. I ended up building an artificial nose using some cool tech, and I'm going to talk a bit about the tech behind it and how I ended up moving the project from a really, really dirty weekend hack into something that's hopefully more future-proof and using cool things like Zephyr. So, a few words about myself. I'm a Benjamin. I'm based in France for the past year, almost to the day, actually. I've been working as a developer advocate for the Zephyr project at the Linux Foundation, and I do many things, including as a good French person, I guess, baking bread. And I don't know about you guys, but I've been trying to perfect my bread recipe for probably over 30 years. Like, I'm still not really happy about the way it turns out. Like, it's a bit random, right? And so, back, I think, in the really first few weeks of COVID, with like being stuck at home, lots of times on my hands, I was like, maybe technology can help me improve my bread recipe. What if I could figure out a device with maybe some AI in the mix that I could like train to figure out when my sourdough starter would be perfectly fermented? In my head, at least, the idea would be that I would buy AI, figure out when the sourdough kind of looks all right, bake the bread, figure out if the bread is good or not, give it a, like, oh, it's a nine out of 10. Like, it's really crispy, really nice. And then do the training that way, right? And so, the idea would be to smell the sourdough starter to capture some information, which in my head, at least, I'm not a chemist, I'm not a food chemist, but measuring things like the amount of volatilagony compounds and CO, CO2, whatever, there has to be a correlation and like the perfectly ripe sourdough starter, there has to be a way to identify it, right? And so, back in 2019, there was also this sort of cool kid on the block, new cool kid on the block, which was, and which is, tiny ML and things like TensorFlow Lite, finally available on micro controllers, things like that, right? And the thing is, I know really little about neural networks myself, like, for some reason, the math, like, whenever I would open a book about neural networks, and like, oh, yeah, it's easy, you're going to recognize handwritten digits, like, this is a bitmap, you go through some layers, blah, blah, blah, oh, you recognize the digits, that was going way over my head. The thing is, playing with physical things, more tangible things, I actually was a role in just a few hours, really, and with the help of some tools, some of you might have heard about something like, called edge impulse, it's not strictly speaking open source, although it's based on TensorFlow Lite for micros, but it helped me train a model, basically, taking some Arduino, like, an Arduino compatible device, this is a WIO terminal, a Cortex-M4, taking a gas sensor, feeding the data and like, capturing the data quite often, taking this data into some kind of training algorithms, and I would be able to figure out the difference, not necessarily between good bread and bad bread, because remember, COVID, like, flour wasn't even available in the supermarkets, but booze that I had in my house, so I actually figured that it was able to make the difference between not only, like, rum and whiskey, but it was actually accurate enough that two, like, one really pitted whiskey and one slightly less, so it would make up the difference, right? And I started to talk about the project, because I found it really cool, like, not do the silly bread thingy, but something slightly more useful, which is figuring out in the human breath, when there are, when you can spot the markers for fungal pneumonia, Kaleb, the kid almost died, basically, when he was really young and the doctors couldn't diagnose the disease, turns out that since then, there's now literature available out there that says that, yeah, there are some markers, and he sort of built a proof of concept for that, so that felt really good, but what didn't feel really good is that the code that was from day one available on GitHub of that project that I have to put together is horrible. It's like 2000 lines of boilerplate, copy paste, typical Arduino code, right? Like, I mean, I've been gathering bits here and there, of course it works, but it's really, really bad. Small, just like really quickly, because I think it's worth mentioning, how does a machine smell anyways, because we're all, I think, familiar, or we all think of things like temperature sensors and humidity and illuminance, like that certainly comes to mind, because we actually also use them every day, but there's also sensors that can smell, they measure the concentration of particular chemicals in the air. The way it works is basically just a chemical reaction between a tiny slice of metal oxide semiconductor, and based on how many of the offset compounds can be found in the air, you can measure a variation in the change in resistance, right? The more VOCs, voltallogonic compounds would be in the air, the higher the resistance, for example, which means that I could measure, like, start acquiring data, putting my sensor on top of bottles of alcohol and tea and coffee and whatnot, and capture basically what I would call the fingerprint or the olfactory fingerprint of a particular smell, and then with a bunch of AI and ML, basically figuring out what in this raw data identifies a smell, and so my intuition would be not knowing, again, a thing about signal extraction and all that kind of thing, would be, oh, well, but if this is whiskey, then if I were to write down what makes whiskey so special, it would be probably something like, oh, yeah, when you smell whiskey, nitrogen dioxide goes up, carbon monoxide, not so much, VOC goes up as well, maybe in a slightly more steady way, and so basically what happens then, the way the model works, is just that, except that it's a machine doing it, looking at the raw data, doing some basic statistics to extract the mean, the mean, the max, the standard deviation, like, all those things that could potentially characterize the smell, and then this pre-processing, this DSP, if you will, then goes through a typical neural network, so this is fun, you get to the point where you have this funny looking thing, like you can even go the extra mile and, like, sort of, 3D print, the enclosure, and there's, yeah, you have a lot of fun. I ended up building and packing, again, like, in those 2,000 lines of code, plus all the libraries, of course, that I'm pulling, I would have a GUI, I would have Wi-Fi integration, actually, that's something that I added eventually, and, like, whenever I smell something, I can push it using MQTT to a server, there's, of course, tons of hardware interactions, and all that needs to work at the same time, except that if you do it the Arduino way, and the lazy way, I guess, then you end up just doing this, which is, again, not necessarily, like, if you're lazy and just, like, eager to get your POC and your thing working, you end up putting a lot of code in, essentially, a superloop, and so, as often as possible, I need to do all this, which is acquiring sensor data, which, by the way, you don't need to do that often for getting good accuracy, like, the way the device works is that I just sample the gas sensor readings 10 times a second, it's not all that much, so every 100 milliseconds, I would read sensor data, and then I need a bit of time to actually run data through the AI model, which, again, doesn't really take a lot. The model, at the end of the day, is really simple, so you really only need a couple milliseconds there, fair enough, and then there's the world GUI aspect, which, again, if you're lazy, I'm not even, like, whenever a button is pressed, it's not even interrupt driven, so you need to figure out, like, if a button is being pressed right in the loop, not ideal, but you do that, and then, if you want, you then post results to an IoT server, and then you don't even know how long it's going to take, right? Like, if this is synchronous, it might be a problem. Enter an autos, right? That's basically, for the first few years of the project, it was sitting there on GitHub, this really crappy thing where people would open issues to be like, really, I mean, yes, I would put the ready to flash, like, firmware for people to use, but anyone who wanted to basically tweak the code, they were just scared, and so the thing is, I ended up, yeah, using Zephyr to try and rewrite, and also to myself, frankly, to learn some of the best practices there, I ended up trying to leverage some of the features of Zephyr, which is beyond being an autos, which hopefully would help me move away from the super loop, also get a better solution for targeting multiple architectures. Like, originally, I would be targeting the weird terminal, which is some D51 Cortex-M4, but I actually don't mind ESP32, and having the same code, same portable code, and portable build infrastructure, test infrastructure, I don't mind getting that, plus all the libraries that also come pre-packaged, and yeah, that's basically what I did. So, from this point, I guess, the presentation is more about telling you, like, how I replaced some of the concepts or some of the things that I had in my Arduino code, and point you to some interesting areas in Zephyr of, like, features and subsystems that are available that you maybe didn't know existed, and, but frankly, I didn't know existed either. Sensor acquisition, that might be the sort of the easy part, but I really like the fact that now my V2 version, if you will, of the NOS, I have essentially, and literally, a dedicated thread that acquires the data exactly at the sampling rate that I require for my model to perform accurately, right? That's like, that could be an issue. If I do the super loop thing, and for some reason, the UI takes longer to refresh or communicating with the the cloud takes longer, then it will basically shift the sampling rate for the gas sensor data, which basically means that I will start feeding crap into my AI model at all. So, you may want to sometimes put the sensor to sleep and make sure that it doesn't draw energy unnecessarily, so it's actually also integrated in the Zephyr APIs. Then comes the TensorFlow Lite aspect. So, I'm basically pulling TensorFlow Lite as a library in my application and leveraging something that's called ZBUS that makes it, especially for someone like me who's not necessarily a hardcore embedded developer, I basically have this high-level framework where, okay, I have my sensor acquisition thread that does its stuff, basically puts the sensor readings in a ring buffer, and whenever there is data that's available for the rest of the world and the rest of my app to do something out of, then it's effectively like there's an eventing system where, effectively, my inference thread really gets data, like subscribes to sensor readings so that it does the stuff and figure out what is it smelling like, and also uses ZBUS to put the result on the same, like using the same topic mechanism, if you will, so that, guess what, the GUI, for example, can in turn subscribe to this piece of information to do something useful out of. No need for fifo's and cues and semaphores, like it's actually really nice, and the overhead is minimal. So, there's that, and then for the GUI, that's one thing that's really nice with Zephyr is that you have LVGL, it just works, like there's obviously in Zephyr tons of drivers already available for a wide variety of display controllers, but then on top of that you even have, like, the high-level framework that is LVGL for creating a GUI with, like, chart, like, this gauge, this gauge, and I never know how to pronounce it, like, this gauge, and the charts, like, those are effectively widgets that subscribe to the data that comes and is being sent on ZBUS and just displays it, and the code is really, really straightforward, it integrates also with things like the Zephyr input system, like, if you have buttons, keypads, touch screens, that basically send events, you can have the LVGL app automatically react to that, right, so that's nice, and as you may notice, this is not a photo of LVGL running on the actual device, it is a screenshot of LVGL running in a desktop environment, because you can actually run the full artificial nose code in a fully emulated environment, if you will, on a POSIX OS, including the GUI aspect, so that's pretty nice, and like I said, it really feels like you're writing, like, really high-level applications, I have, I'm defining, and, like, I have a listener that wants to be notified whenever there is an inference result that's being made available by, probably, by the TensorFlow light for micro task and thread, and when that's happening, then it's pretty straightforward, you get the data, you really get it actually as an actual, like, typed message, like, so it's something like you can actually really make a good sense out of, in my case, the inference result would contain both a label telling me it's smelling coffee, whiskey, whatever, and a confidence level, based on how confident the model is that it is effectively whiskey or coffee, and so I can actually display that on my UI, and the code is really, like, literally moved from, yeah, 2,000 lines of code, I didn't count, but it's a couple hundred max, so there's that, and then this is sort of nice to have, if you were to do more than just a kind of prototype toy project, you could think about having the device, probably with something less stupid as the enclosure, but in the ceiling of the restrooms here in the building, so that whenever it smells pretty bad, you know that it's time to send someone to clean the place, but you don't want to send someone to clean the place, like, twice a day if, like, nothing happened, like, if it's, you're on the weekend, or it's like a day where there's strikes or whatever, or there's COVID and everyone is at home, so the device would need to be communicating somehow in a way, like, remotely, and for adding that to my project, it was also pretty straightforward, because there was a, like, full blown networking stack in Zephyr for, like, TCP, IP, and, like, co-op and MQTT, and, like, all the variants, all the flavors, and all the kind of connectivity options you may want to use, they're all there, and so effectively, and I can maybe quickly switch to a really quick demo, which is, I have, so, well, this is the version with the enclosure, this is the version, which is actually the WIO terminal, this one is M5 stack core 2, so this is effectively an ESP32, this is the sensor, it's already configured and already connected to Wi-Fi, so if I were to, I think I need to stop sharing maybe, if I were to connect to my MQTT, yeah, connected to an MQTT broker, and in real time, so this is really, like, reaching the internet and then my laptop connecting to the very same broker that this guy is connected to, and, yeah, apparently it's smelling ambient air, I guess it's more, like, nerdy or geeky air, and if I put, so this is, yeah, well, that was fast, actually, this is lemon, and for the anecdote, I, I mean, not that you care, but I actually forgot to bring the lemon from home, so I bought this one just this morning, so it's different lemon, I guess that the one I use for training the model, but it apparently works just the same, so that's, there's that, and what else, yeah, and many, many other things that are pretty cool in Zephyr, the fact that it leverages K-configured and device tree, just like Linux does, makes for pretty neat code when it comes to, oh, I want my GUI to be slightly different if my screen is large, I want to put, to cramp more into the UI, well, that's an information that you can get really easily from device tree, right, if my screen is wider than 300 pixels, blah, testing framework, CI integration, every time I commit something and push something and make a modification to the artificial nose, it gets built immediately, A1, basically, by the way, I wasn't working on Microsoft back then, and they are absolutely no problem with me putting everything on GitHub, so kudos to them for that, so now the new URL, if you wanted to check out the Zephyr version would be the same, with Zephyr in the name, you can find all the parts online, I don't get any royalties or whatever for that, but seed has actually sort of been like nice, ready to use bundle where you can order all the parts, and that's it, questions! Hello, thank you very much, so there is some abstraction where you can use different sensors, but surely the sensors don't give the same values for... Great question, I had a slide, I've removed the slide, removed the notes, I forgot, one thing that I would love to see happen to kind of answer your question is some kind of open data set, open ontology to actually describe smells in a consistent way, because you're right, like you would have sensors that are giving you readings in terms of like unitless concentration, like it's going between zero and 100% of VOC concentration, some would be talking PPM, some would be whatever, some would have like weird calibration things, there's, yeah, it's, you're right, so you would probably need to retrain the model, it's not like you can, at least with this code, it's not like you can easily be like, okay I'm going to switch from Bosch to Aliexpress, and it's going to work just the same, like you need to, yeah, I hope this answers the question. One more, yeah. We would like to know how it did it work with the sourdough and your baguettes? That's super, everyone asks the question, I never, like I never done the whole thing, like because back COVID, there was no flour, it would have been painful to bake dozens and dozens of baguettes and eat them anyways, and this is more fun to play with just random things like spices or booze, and the sourdough thing probably works, frankly, probably could be done more in a more simple way too, like maybe you just need a alcohol sensor and just measure the peak, and maybe that's it, I don't know. Thanks everyone. Okay, thank you.
Linux CAN upstreaming on MMU-less systems
So let's start it. Ciao everyone and welcome to my talk. I am Dario. I work at Amarulla Solution, a software consultant company mainly working on embedded Linux and Android projects with a focus on open source. And about me I am a contributor for some open source projects like Buildroot, Linux and UBoot and they come from Italy. This talk describes my experience in streaming the big scan driver for the Linux kernel for STM32 platforms and testing the driver as we will see required also applying some patches to the tools used for configuring and accessing the scan interface. Before jumping into the rest in things, let me spend a few words about the origin of this experience. The idea was to create a kernel driver from scratch in order to satisfy some curiosities of mine about the kernel development process. So creating a kernel from scratch is not like a backfixing patch but you have to create documentation, you have to update the device tree, thinking about the design of the driver in addition to its implementation. So it's a lot of stuff and it is easy to get in trouble but it is also a great opportunity to understand a lot of things. So why did I choose the big scan controller? First of all because I gained some experience in streaming patches for the kernel subsystems finding that both the maintainers and the guys are responsive and proactive. Then the big scan controller is present on development boards that are not so much expensive and with the developer boards you can find also the build root configuration. It is a good starting point to start the kernel development. Furthermore you can find also a lot of examples, a lot of code online of how to set up the controller and in the case of the Zephyr project for example you can find a old driver or lady implemented. Then finally you can test the driver without modifying the hardware because you can enable the loopback and silent board that are the test mode of such chips. So let's explore the internals to a static RAM like in the single-can configuration but in dual-can configuration they are shared and only the primary channel has direct access to the SRAM. There are also three test modes. For both of these test modes the transmission is looped back internally to the reception. But in silent mode the node is disconnected from the bus on a transmission. In loopback mode the node is disconnected from the bus on a reception and in combined loopback and silent mode the node is totally disconnected from the bus and this will be the mode I will use for testing the driver. Let's now have a look to the roadmap. You have to modify the build root configuration and create the Linux driver for handling the dual-can setup on the 469 disco board. Then create a new build root configuration and modify the Linux kernel in order to handle the single-can for the 769 disco and in both cases we have to test the driver and upstream the patches. So let's start with the 469 disco board. I started from the disco SD configuration and I enabled the networking and can bus support in the Linux fragment file. When I put the Linux kernel in override in order to be able to create my driver I start with the implementation. The implementation I started from the documentation describing the properties to be added to the device tree. In addition to the common properties that you can find in a driver like a compatible or a reg, I added the SD-can primary property in order to distinguish between master and secondary channel. Then I added the SD-can node as a reference to the shared memory. Then I added the can node and the can node to the platform device tree. I configured the pin mix control about the pins owned by the can controller. And finally I added to the device tree of the board the I enabled in the device tree of the board the can nodes and I disabled the peripherals with pins shared with the can controller in order to avoid conflicts. About the source code I opted for simplicity for the first version of the driver. So I in the driver handles all the free mailbox for transmission but in case of reception handles only 5 for 0. And about the filters I hardwired the assignment of the filters. So I assigned the filter 0 to the primary channel and filter 14 to the secondary channel. And in both cases I configured the filter register in order that all incoming messages were accepted. So in this way disabling these particular features. And for the interrupts I handled all the interrupt except for the interrupt on the 5.1 reception that was related to the 5.1 that is disabled. So now we are on the testing. I split the procedure of testing the driver in two steps. In the first steps I checked the DMS output in order to verify that there were not issues about the driver probing. Then I tried to set up the can interface enabling the loop and listen mode and verify that the transmitted messages were also received. But in this case I realized that the tools I had to use to test the driver didn't compile on platforms without MMU. The point is that the fork system doesn't work on such as Sipson. And the table clearly shows that only the busybox package is able to be compiled on a SAC system. But its IP link command is not able to set up a can interface. So I had to decide whether to patch busybox, whether to patch IP root 2 or add in support for can interface to busybox. I opted for busybox because it was already used in the system and farmed because on a system with limited resources using lightweight packages is preferred. Let's go also on the patches I had to apply. So in this way I updated the build root configuration in order to enable the can interface package. I put in override both the busybox and can interface package. And so I was able to create an IP link can applet to add to busybox and to patch can interface in order so that I didn't compile the program using the fork function. And fortunately the can dump can send application and don't use these functions. So in this way after applying the patches I was able to set up the can interface and verify that the transmitted messages were also received. After the test let's talk about the code review. The code review reached version 10 so many things has been changed and also has been fixed but among them some I think are really interesting like the use of the Cisco node for handling the shared memory, the replacement of the master and slave terms with primary and secondary according to the kernel coding guidelines. The use of the field get field prep macros for accessing the register bitfills in order to standardize the code for accessing the register and so reduce the errors. And finally the use of a rackback function to access shared memory. After the patches were upstreamed I started so with adding the single can handling for the driver. So I create a new configuration inside build root and just like the 469 disco I created a configuration where both the Linux can the root file system and the device tree were stored because were used only for me to test the driver. The changes required for handling the single peripheral configuration were quite minimal but overall without modifying the driver design and more or less these are the changes I had to apply to the driver. About the testing of the driver I used the same procedure I used before so I checked for the DMS output and I set up the free can interfaces and once again I try to verify that the transmitted messages were also received in a proper way. And even in this case with the code review even if it didn't take long brought out some interesting insights. The maintainer's idea was to change the source code as little as possible so he suggested me to use the Cisco node also for the primary channel even if there isn't a shared memory for it. And also to use the STCAN secondary property to distinguish between primary and secondary channel even if the change was not backward compatible with the dual can configuration. But it didn't matter since the driver wasn't yet in a stable version of the kernel. Then have a look at the merge problem we had with this series. Due to a misunderstanding in the application in applying the patches to the main line the order of application for patches A and B was inverted causing a failure in the compilation of the device tree that was fixed by reverting the patch B. So it was a situation not so good for me and also for the maintainers because everybody got really nervous. So there was one last question to ask to the maintainer of the CANSAP system that is what to do with the patches I had to apply to the tools used for testing the driver. And this was the response. So I upstreamed everything including a new further implementation of the IP link command using the library that is a lightweight library. I was quite curious about the use of this library so I ran some tests on it and finally I arrived to a further implementation of the IP link CAN command after the busybox one. So to sum up I upstreamed 12 patches for the Linux kernel. One patch for busybox, one patch for canutes and three patches for the libmnl. And then seven patches for bitroot. All the patches were accepted except for the one of busybox. I resend the patch multiple times but I didn't get any answer from the maintainer. So if you think that it could be useful for busybox to support the setup of a CAN interface please review the patch. Finally, finally for people who are interested I uploaded the buildroot project of my personal GitHub account and these are the commands you can use to build the images to store on the development board and to run tests for accessing the CAN interface. That's all for me. Thank you for your attention. Anybody have any questions? Maybe from buildroot. Hi. Thank you for your presentation. How long did it take to explain to all the patches and development of the drivers? Quite a long. I think it is the work I think of one year more or less. But not full time. Yeah, of course. You have to wait for the response of the maintainers. Thank you. Hi. So thank you again for your work. This is very useful because without MMU less system I haven't seen on Linux running CAN yet and the main problem was the IP link set because without that you CAN is configured as a network device on Linux and without that you cannot set the bit rate. So that was very important. My question is that on the STM only has a CAN controller. The CAN transceiver is always on the outside. So CAN transceiver any PCB changes, any hardware changes, did you ever have to make for these kinds of systems if somebody builds a product that does not have MMU? If this goes into production with Linux running do we need something else from the hardware side or that was the transceiver side nothing has to change? Could you repeat the question because I didn't catch a... So normally if I use a Raspberry Pi or something you have a MCP2151 that has a CAN controller and a CAN transceiver. STM32 only has a CAN controller. And on the board you would need to have external IP, IC, right, that would run as a CAN transceiver. I don't know if the board I have can transceiver because I didn't modify the hardware. I put in Lubeck internally the CAN controller and so I was able to test because I'm not an hardware engineer. I am a software engineer and I am more confident on software than on hardware. So I didn't want to put the hand on the hardware. I bought a transceiver but after finding that I was able to enable the Lubeck... Inasmuch... Okay, okay, thank you. You can basically use the same transceiver with the SD microcontroller so they are compatible. Was that your question? Yeah, so I just wanted to ask that in an MMU and MMU less system the CAN bus, the 120 or nothing else would change in a... Right, that was my question. Yes, the CAN controller usually has a digital interface with RX and TX line and this is connected to the transceiver that makes it into what's going over the wire and the transceiver doesn't care what kind of CAN controller was on the digital side you attached to. Hi, some STM32 have a newer IP for CAN that is named FD-CAN. I am familiar with FD-CAN so I wasn't aware... FD-CAN, FD-CAN. Yeah, and it seems like it's backwards compatible with DX-CAN so it looks like it should be compatible with your work so that's really nice but were you aware about FD-CAN? No? Okay. So I was wondering if it was planned to support the functionalities that are in FD-CAN and not in DX-CAN. But I think the FD-CAN features are not on this type of platform I think. I think that it's on the STM32MP because this is a microcontroller. No, no, I was using FD-CAN on STM32F303. But I suppose they have some weird... I can say something to this also. On the modern STM microcontrollers where you have FD-CAN they are using the M-CAN IP core from WOSH and this is already supported by Linux. Non-MU system. Was you able to test the very latest CAN-Utils and non-MU system with the very latest new ISOBus file system support? No. Okay, we have time for one more question if there is one. Okay, I think we're done. Thank you very much.
Flutter, Buildroot, and you!
All right. I'm not sure why you guys keep letting Americans in here, but thanks for stopping on by and letting me rant about Flutter. So it's an insane project, an insane package, and I have an insane boss who wanted it in Bilbroot, so I did that for him. A little bit about me. My name is Adam Duskett. I started my career completely through dumb luck and nepotism. I had a friend who only knew micro-embedded and I knew Linux, so it worked out. They were developing a sonar with a camera on a TIDM368 processor way back in the day. After that, I moved on to Micron Technology. I developed a kernel driver that substituted their DOS driver for memory testing and reporting purposes. Moved down to Michigan, started on VoIP emergency phones that did not work out well. I started contributing to Bilbroot in 2016. I had to look this up. My first commit was actually just an audit bump. It was like three lines long. But after that, I moved to Los Angeles. I started working on electric vehicle supply equipment. I joined Rivian Automotive as the first embedded Linux engineer for their fast charging network. We ended up using Bilbroot. It's still using Bilbroot to this day. It works great. Then I joined Amorella Solutions as a senior embedded systems developer. They're based out of Italy. Then hopefully in 2024, I'm never coming back to America. That would be quite nice. Some of my contributions, a ton of SC Linux packages. I tend to focus on those. Flutter, of course, and the Flutter packages that we'll talk about shortly. GeoObject introspection that took 17 patch revisions in three years, two trips to Europe to actually get in. OpenGDK, this is a total joke request by a friend. Apparently people still use this package. I don't know why, but it is there if you want to use OpenGDK with Bilbroot in 1,123 commits as of this writing. We assume that you know what Bilbroot is, what Flutter is. You're interested in using Flutter with a Bilbroot project, or you're interested in the next speaker. I'm not sure which. You're here. I appreciate it. How I actually ported this thing was I used Metaflutter, which is actually quite well constructed. But then things quickly took a turn for the worst. Normal industry standard practices just straight up don't apply to this package. Like downloading a reproducible tarball is just straight up not possible. We'll get into that. Requires tools to configure the source code that aren't standard from Google. You can't use your cross-compiled SDK toolchain. Enjoy. It actually includes a pre-compiled and patched LLVM. Release versions. You can find them at Flutter Engine tags. But those aren't for you. No, no, those are for Google. That's way, way too easy. You can't just download that tarball and compile it. That's impossible. No, you need a G-Client Python script from the Depot Tools repository, which is from Chromium. I don't know why they decided this. You need a .gclient file. You run this. And also, of course, the source code depends on .git directories being present to compile. So enjoy, because now you cannot create a reproducible tarball from this. So we needed a host depot tools. It gives us the tools necessary to download the source code, generate a tarball, configure the source. I made a genTarball with the help of the buildroot maintainers. So that will actually create a .gclient file for you, runs G-Client to download the source, generates a tarball in a format buildroot accepts, expects, but it's still not reproducible. So please save it somewhere. Create a hash file for yourself. Yeah, include the patch-claim compiler. I don't know why. I do not know why they did this. I've asked the developers, and I have yet to receive a response. So I'm going to chuck it up to Google. It requires that configure bundles and compiles all the third-party dependencies into a single generated flutter engine. It requires things like open SSL. But no, not your open SSL. Not the one provided by buildroot. That'd be too easy. No, no, it requires the third party, the one in the third-party directory, and then it smashes them all together and creates a giant flutter engine.so file. I'm going to move right along. Do you want to use flutter instead of QT? Well, there's some advantages. It's free. QT requires a professional license if you're going to sell your product. I fought Rivian for quite a while, actually, on this as they use QT, and a billion-dollar company didn't want to fork over 50 grand for a professional license. So I'm sure smaller companies also have a problem with this. It's just straight up licensed under BSD3. You can just use it. You can sell it. You can do whatever you want. There's a plethora of community plugins at plug.flutter.dev. Hotload restart makes debugging applications less time-consuming, in theory, uses Dart instead of QML and C++. Disadvantages. It's huge. It's 14 megs straight for the minimum, and that is just the .so. That's not an embedder. You still can't actually run your flutter application quite yet. OpenGL or Vulkan is necessary. It technically supports software rendering, as it's just a .so, and you need an embedder. But I have yet to find an embedder that officially supports running flutter with software rendering. Only X64, ARMv7, and probably V8, and Arc64 are supported. It says it supports I386, but I have yet to actually get it to build for I386. So at least in build routes, only X64. And it uses Dart instead of QML, C++. Really up to you there. LvGL is the other popular one among embedded platforms, and that supports anything with C in an output. It uses Dart instead of C. Yeah, mini... Oh, sorry, I actually reversed that. Yeah, less... So mini-more plugins, of course, it's much easier to build and publish applications in flutter than LvGL. It is far larger. Again, LvGL starts at 23k. Particularly the exact same thing as Flutter Gallery, except that the fuel line is removed. And it is a generic package right now. We do not have a infrastructure around that. So, and that may change in the future. The best way, I would say, to add your flutter application to a build route project is to create an external tree. So that way you can update build route easier. I have an external tree called Faustem. I also have a project on GitHub that uses Docker, an external tree, so we can actually use reproducible builds for exactly... I don't know. There it is. There we go. Okay, so this is the... I patched 2023, but here's the Faustem external tree. So you have packages, you have patches to build route. These patches are for the 2023 Ranch, but basically they add a bunch of stuff. But the Flutter package updates are the big one, because as I'll get to in a second, Flutter is awful in 2023. Please don't use it. This is a big one. Profiling. This slide took me about three hours on a plane to figure out, because I had never profiled a Flutter application, let alone something that was on a... remotely. So build Flutter Engine in profiling mode. Menu config libraries, graphics Flutter Engine, this is an enable profiling option. It's very straightforward. Make a bootable image. Run the application in profile mode. Here you go. Write that down and take a screenshot, because VM service and VM service port are cryptic and ill documented. There was also an observatory, but apparently that's being deprecated for this. They do the same thing as VM service. If you do not use VM service host equals 0000, or the URL of your host machine, you will not be able to connect, because the default is just local host only. I can show you this here as well. Actually, it would be a good practical demo. I have a VM here. I don't know if it will actually show up splash screen. Maybe it will. Maybe it won't. We might get lucky with a FOS stem logo. We might not. No, but it's booted. It's fine. If we SSH in, I have a FlutterPie profiling demo. That's going to... No, it's running. I should just SH.exit for everybody, so they can see. Yes. This is the exact same thing we have. Then I have a Flutter profile connect. Actually, it's probably in my... Yeah, there we go. This will connect to it. Over here, this is the x86, so it is running Flutter Gallery. We can go in, we can click reply, we can check the inbox, or something we can hit compose. This is fine. We can type. Everything works properly. By now, this should be working. Yes, so you will see a URL right here. I have no idea what I just clicked. Oh, right. Of course. So we'll copy this URL into Chrome. And there we go. So now we can check performance metrics against our VM here. And I've tested this also on a Raspberry Pi 5 and 4. It works great. Nothing's really running right now other than Flutter Gallery. If we keep clicking, eventually more pretty graphs will show up. So there's a CPU profiler, although good luck. This is a lot of ASM. I am not really sure what it does. And of course, memory profiling as well. And networking and logging and whatnot. So it's quite easy to set up a remote debugging on all of this. There we go. Yes, 0 to 0, or the IP address, the remote machine. Yeah, so the current state of Flutter, the 2023.11.x branch is, please, it's bad. Do not use it. Use the patches that I have if you need, or just wait until 2020, 4.02, which is later this month. But the packages, if you really want them, Flutter SDK bin, it's a set of tools used to compile the Flutter applications themselves. Flutter Engine, it's the main Flutter library. And Flutter Pi, that's it. That's what you get. It's the Flutter Embedder used to run the Flutter applications. It does not support Waylon. It's KMS and BRI only. But it does support GL, GLES, and Vulkan, which is quite nice. Oh, and of course, there's Flutter Gallery, which is the demo application. And yes, I have fixed this, but there's a lot of necessary options that have been patched in. In Master, it's a lot better. A lot better. Please use that. All the previous packages with the following additions. We have IVI Home Screen. It's actually developed by Toyota. It works with Waylon. 17 more plugins than Flutter Pi. 22 plugins are currently supported. There's a bunch of bug fixes and improvements. PubCache has moved to the download directory. There's more up-to-date packages, comprehensive build options for Flutter Gallery packages. One of the big things is the Flutter Gallery package is a good starting point as a generic patch, but we can do better. A Flutter infrastructure package would be so much nicer. Yes, so I actually have that patched in as well in one of these. So with this patch right here, this would create a package flutter.mk. Can you increase? Of course. I would love to. There we go. So that way you don't have to copy and paste all of that junk all over the place. I think I stole this from pkgcmake and then started editing it. Much like the Dart registrant files, all of this stuff would just automatically be figured out, including the package name, as the package name has to go on top of the PubStack.yaml file. It's quite easy to yank it out of there automatically. So your Flutter package is going to end up looking like this, which is much, much easier to read than what is currently supported. So that's pretty much it. The stretch goal would probably be the Firebase SDK, although that also seems quite the undertaking for a package. So yeah, any questions? Thank you. Thank you. I answered everybody's questions. So I'm going to have a question. Oh, no, don't let him in. No, no. What did you expect? So you said that Flutter would not run on software rendering. It's wrong. We have a runtime test that uses Mesa 3D with software processing. So it should be doable. It does use the Mason. Okay, it uses the Mason. Was it the OGL 2 software? Yeah, it's the OGL 2 software. So it's a bit of a shame that it's not running on software. So it's a bit of a shame that it's not running on software. So it's a bit of a shame that it's not running on software. So it's a bit of a shame that it's not running on software. So it's a bit of a shame that it's not running on software. So it's a bit of a shame that it's not running on software. So it's a bit of a shame that it's not running on software. So it's a bit of a shame that it's not running on software.
Google Home, But Better: Building our own Smart Home Display with Flutter
So, welcome for the second time. Thanks for staying this long with me last talk today, named Google Home, but better. Starting really good. Just a second. So, even though it's a short talk, just really quick, a little agenda for today, what you can expect. A bulk section, really brief about me, why am I talking about this, why should you listen to me talking about Flutter. The hardware used in this project, of course, that's one of the interesting parts, but no really big surprises there. It's just what you would all expect. Then we get to the software part one, the embedded Flutter part, and part two, the implementation. And I think this is for most of you, this will be the most interesting parts of this talk. So, first about me. Hi there, I'm Moritz. Yeah, a few years ago, when I was 15, 16, I started out with embedded development. Back then it was all hobbies. I started out with an 8051 Derivate. I think it was an Infinion XC878. I started developing in C. Back then I wanted to mainly build everything around music, high-fee, loudspeakers, equalizers, digital sound processors, and so on. Following through college, I worked as a why we created Snap Embedded. That's what we're doing there. Also, co-organizing the Flutter Munich meetup. So if you ever want to come over or speak in Munich about Flutter, just feel free to hit me up. So, I left Embedded. Now I'm back at Embedded. Why? And this is maybe really short clip showcasing why I'm back at Embedded user interfaces, because this is still stuff we get today in new projects. And it's still sometimes you get a new coffee machine state of the art with a touchscreen and you use the touchscreen and you're like, oh no, God, why did you build this? So, yeah, I don't want to build any more of those things. I run, I want to build the UI like of the one today's talk is about. I think this, I hope this looks a little better than the things you saw before. That's the user interface of the Google Home replica we built or I built that I normally wanted to present here, but sadly there was, it would have been hard to set it up in five minutes, get it here on the table, so we'll rather stick with the presentation. Also, it would have been unfair for all the people online. But nevertheless, I have picture of everything. We're going through that now. So, the hardware. Yeah, as I said, not much more as you would imagine, Raspberry Pi 4. It's still model 4B, 4GB of RAM, that's enough. 2GB of RAM with a desktop environment and Flutter, yeah, wouldn't recommend that on a Raspberry Pi. Of course, the Raspberry Pi 5 would work. It would just be more expensive and it would run just as good. A little thing we have in here, what deals with the but better part with the Google Home or Smart Home devices in general, we can't add whatever we want hardware and as we will not be adding a voice command service on this device, I thought about what would be cooler. Voice commands would are already out there and what do we need to see? What is the most interesting thing and that's for a lot of people, I guess, interacting with custom hardware. Therefore, we integrated an air sensor there, the Pimoroni SCD41 measures CO2 temperature and humidity, connects to the Raspberry Pi with I2C and it comes that is also very handy with a ready Python library that's known to be working with the Raspberry Pi. The touchscreen is just some WaveShare 11 inch IPS panel, capacitive touch, USB, HDMI, really nothing too special. Those touch screens just got really good in the last years using them. Yeah, at least with the Raspberry Pi OS or so, just works out of the box, it's fine, nothing to worry anymore about. Then for the last part, yeah, with Smart Home, what most people think about is turning light bulbs, plugs on and off and for Smart Home projects or whenever you want to do projects on your own, devices that come in really, really handy are those Shelly bulbs and Shelly plugs because they just come in with a built-in web server and you just have a REST API, connect them through your Wi-Fi, they come with an app, super easy and yeah, you have a REST API where you just can interact and it on, off, change the colors, it couldn't get much easier. So, all together without a whole bunch of cables, that then looks like this. So, now that we have the hardware part together, now comes the interesting, the next part, the embedded flutter part and as the talk earlier already pointed out, there's not just flutter to run on embedded devices. If you Google it, if you want to start out with it, you will find a few repositories all dealing with flutter and embedded devices. We just saw one, in fact in the last talk, it was using flutter Pi, so what's with that? Why are there different options? Is this not flutter or well, it is flutter, but to understand this, we may have to, yeah, next slide, we may have to look at the Linux embedded that flutter uses. The main difference, those custom embedders have, the custom embedder connects or the embedders connect the flutter engine with the targeted platform and the main difference we have with those custom embedders, which I have, let's see if this works, here on the right side, fancy, I wasn't prepared for that. So, the main thing you can see is here, something's missing. Flutter for Linux just heavily depends on GTK and GTK, in fact GTK 2, which is getting a pain right now for flutter itself. So, what most of those, or what all of those libraries have in common, we don't really need those GTK parts that flutter uses anyway in embedded hardware. We don't have tabs, we normally don't have windows, we don't need all of that stuff, so they just get rid of it, and which sadly isn't that easy in the Linux, in the, let's call it vanilla, flutter, embedded, but they get rid of it, so you can use flutter on custom hardware without GTK and GTK, and that means you can use flutter, for example, with Wayland, with a custom embedder, as the talk before already pointed out, which is not possible right now with the, let's call it flutter on embedded projects, especially if you want to go in a really industrial style, but we're getting there. Also, a big part that's missing right now is tutorials on how tools are still, there's not so many out there, just Google it, it's, yeah, there's not much out there, but I'm sure we will get through this within this year, or at least maybe the next year, and then flutter will also definitely become available to startups, to smaller, medium-sized companies, there will be tools, software as a services around that, and flutter will get more mature, I think we don't know it, but I guess that flutter will get more mature in the embedded world in the next one to two years. But, so if we want to do a project right now, where we just want to try out how flutter on embedded devices works, at least for this project, when we use a Raspberry Pi, we have Raspberry Pi OS, we can just use flutter as it is, we can build for Linux there, it will work just fine. The newest Raspberry Pi OS changed to, I think it changed to using Wayland, I haven't tried it yet, but apparently it works alright. Flutter needs to do something about GTK2 anyway, so maybe it will be possible with the just normal flutter to build something suitable for Wayland and direct rendering as well in the future. For right now, if you're doing a hobby project, if you just want to try something out with a Raspberry Pi, just go with flutter as it is, it's fine. If you want to go with, if you want to use direct rendering, if you want to go with Wayland, if you want to get something into production grade, then you have to look at flutter Pi, Toyota's, IVI Home Screen, or the one from Sony, whereas the Toyota thing really is amazing and is moving forward at a really fast pace. So enough to this generic talk about flutter, what about the implementation for this project right now? I want to go through it in a few steps and yeah, the first part or the first part that we need to integrate for this project to work is connect to Raspberry Pi to the touchscreen. What do we do for that? We use the Raspberry Pi Installer, Installer Raspberry Pi OS, it just works out of the box. Thanks to a lot of guys that are also here. That's really, really easy. Then we need to get flutter running. For that, we wrote a tool, I just said it with Snap Embedded, we're doing open source projects around that. We basically built a tool in the end, there's a repo with the link called Snap CLI which allows you to, from your host machine, set up a Raspberry Pi that's connected in the same network as you are. It'll connect over SSH, it will install flutter, all the stuff you need, and it will set it up as a custom debug device so that you can just run the code and debug out of VS code on Linux, Mac, Windows, and the code will compile and everything will run in real time with hot reload working with the Dart tools on your Raspberry Pi. If you want to just develop on Raspberry Pi, that's already really easy and straightforward. Even the Dart DevTools work, all of that is already there. Just, yeah, no cross compilation, we don't want to get in that direction yet. The next part is, yeah, it's rather uninteresting. Here you can see a little bit of Dart, that code won't run, I cut out everything that looked ugly. So that's just basically a get request. You connect the bulb and the plugs with your flutter or Dart application, run this function to get the bulb status, set the bulb status, or to set the bulb color. The more interesting part, I guess, and what I wanted to point out, which will also explain how you would integrate a voice assistant to with a flutter application on the Raspberry Pi, is how do we connect this sensor that's connected to the I2C bus with our flutter application. We would have a different approach, or we do have different approaches that we could use here. We could do a Dart implementation of everything directly to the I2C bus. We could go through the data sheets of the sensor, implement everything by ourselves, all the commands do it all by ourselves. We could run up an MQTT prokure on the Raspberry Pi. We could then connect the sensor to this on the prokure, subscribe the flutter application on the MQTT prokure, because MQTT is one of those plugins that work with most of the custom embedders, so that really works out of the box. So that would be possible to take. We could, of course, here, I use Python, but we could use a Python backend, just make another REST API on this device and talk to it locally, I think, in a lot of embedded projects. It's done that way. Or we use Dboss. We have the Dboss running on Raspberry Pi OS. We have the Dboss running on most Linux systems, and we can just clip on the session bus for this purpose. The plugins are also already there. And for this example, this is what we did, because connecting Flutter application with whatever else process is running on the machine, you can just use Dboss. We can just use the Python example library that was already shipped with the sensor, of course. I mean, we don't want to do work twice. So we can connect whatever we want right now with packages plugins that are already available. Resources, thank you very much. Two minutes.
How do you write an emulator anyway ?
Thank you. Well, welcome for this session and congratulations on waking up so early after yesterday evening. It's always hard on Sunday morning to watch them. And thank you for those who are watching online. So who am I? My name is Anis, as Mahmoud said. You can follow me on social media, find my blog here. I'm writing this gamegear emulator called Gears. This is not the subject of this talk, but maybe I'll tell you a bit more about the gamegear hardware so you can see how that helps writing an emulator. I'm not an emulation expert. I know there are a few here which are very well versed, but I'm hoping that helps gives another perspective. I also gave a presentation on the Z80 which was pre-recorded in the emulator dev room two years ago. You can watch the talk. And yesterday on WebAssembly, putting this emulator to the web browser in the rest dev room, we can also watch the recording when it's online. So this is a small demo, what you can see here. This is the emulator that's running in a native window. Yeah, nothing very specific. So first of all, I'll tell you why I'm giving you this presentation. But before that, has anyone here ever written an emulator before? Okay. Oh, that's quite interesting. Who here knows how to program, how to write code? Oh, nice. That's good because that's not the goal of this talk. It's not to teaching you how to code, right? You know how to program. I'm hoping with those skills you'll be able to give you a few pointers to how to start, like where to find documentation, things like that. The goal of this talk is not to be exhaustive, otherwise it will be a full university course over a semester or something. And I want to also tell you why you should write an emulator. That's something that should come from you. And yes, so the focus of this talk would be on simpler platform because it's always easier to start with something a bit simpler. Yeah. So what is an emulator first? So a few definitions. It's something I struggled a bit because they come in many shapes, but in general it's a software, I would say, a software program that is used to run software from another computer or another platform, whatever. To give a few examples, here I showed a few screenshots of existing emulators. You have Gameboy, an gameboy emulator named Semboy. You have another BGB. Some support weird devices like printer. I showed an emulator running on the Android platform for the BBC Micro. There's also an Android emulator itself. So you want to emulate the computer that runs an Android OS. And also put something in here which might be debatable, which is an analog pocket, which is an emulator using FPGA. So you write software defined hardware and use real-time thridges to run software from other platforms. An emulator can have a huge spectrum of, let's say, accuracy and emulation. What does it emulate? Accuracy is how faithful you will be to the original. When you're emulating something, will it be running like one software? If that's your goal, that's all right. You just emulate enough of the platform to run one game with burning all the available software. Or maybe you want to do even more and be able to run any software that's on the target platform identically as if it was running on real hardware. We call it clock accurate, but there are us even in this spectrum. Before we continue, I wanted to show you a crazy example I found a few weeks ago, a few months ago of an emulator. It's a Linux emulator. It's a RISC-5 emulator of running Linux written in Scratch. It's a Scratch programming language. So we can't really see anything here on the screen. So I'll describe. You have a Linux terminal on Game Console. It has already booted. I wrote some commands. And here I'm scrolling, and you can see the Scratch code of the RISC-5 core. So yeah, emulators comes in all shapes and colors. So you want to write an emulator. Let's go with the first level. What will it be? Starting. How do you start? So if you want to start, the first thing you have to do is to pick a target. And by target is the platform you're going, what I mean is the platform you want to emulate. You have to pick this target. You have to pick a host platform to start somewhere. Even if your goal is to write something that's portable and running on everything, you have to again start somewhere. So you pick a host platform and make sure you have a bit of time. If you want something that's complete, emulators are something that's hard to decide when it's complete. You can always have more features, more things. So you don't have to have a lot of time in a short period, but maybe on a longer period it works as well. For example, for my emulator, I started two years ago. I've been working on and off, so it's not something that's taking a lot of time every day. That's what I mean. Where to start? Okay. Start simple. So with the CPU, you pick one CPU instruction. You write some code that will be able to disassemble it, which means you will take the binary form of this instruction, this one instruction. It will be a few bytes, one byte, I don't know, depends on your platform. Can your code recognize this one instruction? It might seem trivial to you. Is this a few bytes? It's how it starts, basically. So you start with this, and then you start adding stuff on top of it. You have your disassembler. It's very useful to debug. Then you add something, which is execution. So you have the CPU. How do you model its state? Okay. What's inside the CPU? Go look for more information. What's the CPU? Build this state, change it, which is basically what an executing an instruction does, and verify it has the state change as you expected. So if you want to add something to a variable, you do an add operation in your language. And then as you, let's say it's a good starter, and as you go, you keep learning new CPU concepts and how a CPU works. So yeah, this is helpful for starting. So a CPU is a processor. It's a half of, usually it's considered a half of consoles. Nowadays it might be a GPU, but as I said, this is focused on 8-bit platforms. As I told you, it has states. A CPU is a processor. So this state is basically what we call registers. It has all the kind of states, but the stuff with registers. I told you about an instruction. An instruction is a minimum operation that a CPU can do. It has assembly visualization, a text. You probably have heard of assembly programming language. That's how you visualize for human instructions. And it has a binary version and encoding. And this binary is, yeah, it's bytes you have to recognize. It has other concepts that are interesting, interrupts. These CPUs usually they can do execute instructions sequentially, and they can also be interrupted. So when they have an event from the outside world, you can change the way it's executing code. Also interesting is how do you access memory? I told you about states. Usually as a programmer, when you write in code, you think about variables and things like that. And this hides that on a hardware. State can be on registers or in memory. And the way a CPU accesses memory is also quite interesting. But the goal is not to teach you those concepts. It's to give you pointers on how to learn. So we've learned about how we start. Let's talk about how do we structure an emulator? So you've been writing a bit of CPU code. How do you structure the whole emulator? Because the CPU does not make a complete thing. I'm giving you here an example of the emulator structure. It's schematic by Rodrigo Copetti, which has been doing a very nice introductory documentation on hardware platforms, various hardware platforms. And here I took the master system one. You can see here, as I told you, that the CPU has the central part. So it's the square in the middle where it's written as Xilog ZAT. Then you have other devices that are interesting. I told you about the memory. Here on this platform, you have two kinds of memory. There's ROM and there's RAM. You have IO control, which is how you plug a joystick or the time it was more controllers. So this is plugged on this platform on an IO controller, and this is connected to the CPU. You have the game cartridges. So those are specific type of memory with things like paging in order to access more memory than the CPU can access. It has also a way to generate sound. Small device, a chip called PSG. This device is from Texas Instruments. It generates a very simple sound. It has a video display processor, which would be the ancestors of today's GPUs. And other things like something to... Okay, it's a video display processor. It's a bit specific here, but it has access to its own video RAM, which is a concept that you have to think through if you want to emulate this platform. And the video encoder is used for TV output. So it depends again on the platform. This is nothing very special. Many platforms of the time had very similar architectures. This is interesting because as you want to structure your code, your emulator code, you will want to... We probably will want to follow this structure. You want to take those devices and maybe organize your code... Take them as a code boundary and organize your code in modules. I don't know whatever your language has, functions, objects, classes, namespaces, whatever is on your programming language. It's an interesting code boundary to know, okay, this device could be emulated like this. There's another device like that. Another trick I'd like to share is when you're writing an emulator, you don't have to think about optimization too much, but you're allowed to optimize a bit. Only, for example, you're writing a CPU. It's a very simple thing. You probably want it to be fast. It's something that will have to be very fast. You might want to, for example, not do allocations on the emulation path. If you know what memory allocation is, it means that you want... It's something that can be quite costly. It's very useful, but when you're emulating, it's not something you want to do every other instruction or every instruction. You might want to use drum tables. This one, on the case, is debatable. Depending on your language, it might be automatic. Quite a common advice we see when advising people to write an emulator is that you should write a vertical slice. What does that mean? That means you have all those things. You know, I told you about the CPU, there's the video display processors, the audio. If you go on to writing an emulator, you probably want to see results quickly. That means that you will write support for a few CPU instructions and then a bit of display code so that very quickly you'll be able to have feedback and see on the screen that's showing something like the Nintendo logo on Game Boy or Sega or whatever. You can do that. That's not what I did. Do what works best for you. For example, I gave you a talk two years ago on the ZAT. It was a pre-recorded talk and at the time I had nothing else but a CPU. It depends on what you want to do. Do not hesitate if you have any questions. No, maybe we'll take them at the end because they're recorded. Sorry about that. Another trick is that I told you a bit here about the disassembler before. You should disassemble and write the text versions, assembly versions of instructions. It will be very useful to have a debugger. You might want to build debugging tooling to debug what's happening inside your emulator early because you will have bugs. You will have emulation bugs. Build this tooling early. Or you can use already existing tooling. Here you have the emulation. It's a great one. It's not open source unfortunately but you should definitely check it out. It's a multi-platform emulator. I think we have the developer here in the room. Definitely use the malicious. I can't tell you how many platforms it emulates because I don't remember but it includes the Game Gear and Master System. It has great debugging toolkit. You can see assembly. You can see video devices. You can see many things. Always make sure you have debugging in whatever form works for you if it's tracing or logging. It's nice but be able to inspect the state that's happening on the target emulated machine. I told you about all of this but that's quite interesting. How does one find information on where to start? That's a very common question I've had. Where do you find documentation on emulation, on how the hardware works? Well, basically you look online. There are many different communities. If you want to emulate a Game Boy you probably want to go to GB Dev. It has information on how to write Game Boy software but also how the hardware works. In fact, I like reading documentation aimed at developers of the platform instead of emulators developers because it tells you how you're supposed to develop for this platform. It also means you'll be able to understand how to emulate it this way. There's also the Geq's complete technical reference. It's considered a definitive guide on the Game Boy. If you want to emulate a Sega platform you probably want to go to SMS Power. It's a community around the Sega Master System and other devices like the Game Gears, Sega Mark 1, Mark 3, SG-1000 and the most recently announced the Sega AI which was a computer from the 80s. I'm sorry I don't have a screenshot here but that was very interesting. It was an AI computer that Sega released in 1986. I invite you to look for it online. So here, SMS Power has documentation on how to develop software for the Sega Master System, the Game Gear, it has documentation on how the Z80 works. It has many links to other documentation on video, audio, etc. One of the guides I used when writing my newsletter, there are three main ones. The hardware reference manual for the Sega Game Gear console. Some people in the community took this developer manual that Sega wrote for game developers and they scanned it and OCRed it and made a great PDF version. I don't know who did that but it was invaluable as a preservation effort and also used that for developing my emulator. It has some things that, well, a small caveat here is when you're describing stuff for developers, you might not go into the details of how the hardware works and sometimes you'll have edge cases that won't be explained to other developers but you need to emulate properly if you want the emulation to be correct. So yeah, in general it was very useful. The CPU of the Master System as a Game Gear is fully documented by Xilog. It has very complete manuals, the company still exists. I suppose, for example, to the Game Boy where all the documentation is unofficial. The Z80 CPU is well documented and even then there are other tricks and things that are not documented in the official manual. It's kind of part of the talk I gave two years ago. You probably want to go read this undocumented Z80 documented and then afterwards watch my talk for things that are not in this document. So finding documentation on a very simple trick, you want to emulate something. When you do research, use technical terms. It might seem trivial like this but even I fell into this trap many times. You're looking for something, how to get the more accurate thing. An example is look online or what exact chip sets your target platform is using. For example, for the audio, it's a chip from Texas Instruments and instead of searching for how to do audio for X console, X computer, use more precise keywords. It gives better results and I'm showing you here, on the left here, I Google for Game Gear sound and you have audio videos, YouTube videos and things like that but nothing very specific and on the right you almost find just SMS power and that's it. So basically the link I gave you. So let's get a bit more into practically what that means. What would be, yeah, practically how does that work, how do devices work. What I'm showing you here is an extract of the ZAT manual and how to do device IO. It's very complex and you don't need to understand it. It's basically electronics but it hides the fact that back then, using devices was quite simple. It would be almost as simple as writing to a memory address and that's how you interact with the device. Now on the ZAT CPU, there were dedicated instructions to do that but it was then quite simple as opposed to modern platform where you have GPUs, memory mapping, DMA, whatever is in a modern platform, it used to be much simpler and that's something you can use when writing an emulator. So in practice you want to write an emulator for a host platform. Make sure you understand your host platform first. You want to write an emulator for windows. Make sure you understand how to display pixel buffer on windows. So you want, you know how to open a window, you know how to, I don't know, allocate a memory area where you can write pixels, what is the pixel format, can you display something, a small image, make an image, can you change it multiple times per second. So yeah, make sure you understand your host platform and yeah, it's the same for audio. Let's, you want to start emulating sound. Make sure you know how to do, play audio on your platform. You have a buffer, can you generate, I don't know, a sine wave or square wave to make a beep. Nothing about this is emulator specific but it's really something you have to do when you want to do interactive development or game development more specifically. So you want to, let's start with the graphics emulation, okay. This is something you will need hardware understanding. You will need to understand how the VDP works, for example, on the game share, how the PPU works on the Game Boy. And so you will need to read the documents I pointed earlier. I'm giving you an example here for the VDP, a few concepts that are interesting where you want to, when you want to display pixels, you need to understand how developers were interacting with the device and how they, for example, they accessed the video RAM, how did they use the registers of the VDP. So, conceptually, I told you it's very simple. You use specific instructions that are basically sending bytes one by one from the CPU to the VDP. So that's how you send commands to it. So you write your registers, you write to VRAM, so that's IO. Internally, it has a display area. Here, this is an extract of the game share documentation where the LCD display area, the small part of the screen, will be part of a bigger buffer and then it's like a viewport and it can scroll on it. It has an infinite scrolling. The top and the bottom are connected. The left and the right are connected, so it means that it's like a torus, mathematically, the donut shape. Other VDP interesting concepts, you have the sprites. I told you about the background. On the background, you display sprites. Sprites are often used for game characters. And they're very interesting because basically the VDP was like a sprite accelerator because at the time, if you wanted to display things very fast, it was not simple and the VDP helped with that. The sprite also helped to do collision detection and things like that. But you will need to understand how color encoding works, how sprite pixels are encoded because it's not really a simple square buffer or whatever. So everything has a specific encoding. It's well documented. Here, what I'm showing you is a tile map. So it's a dump of the video RAM of the Sonic 1 game share game. And this tile map has sprites on the bottom and background on the top. It's not exactly the same as the display screen, but it shows how things are represented in memory and then they can be mapped to the LCD display. I won't go into details about that, but you probably want to have a synchronization strategy between your CPU and your devices. If you want to synchronize the VDP, for example, it's something that's easier to do line by line. So you emulate a given number of instructions and then you emulate one line of the VDP. So it allows doing this emulator single threaded because it's easier to think that way. And it's an available thing strategy and one that can give accurate enough emulation. Sound emulation. Sound emulation is quite interesting. Again, it is hardware understanding, so you will need to read the documentation. I'll give you an example with the PSG. So you write registers. It has less registers. It's much simpler. It's a device that's conceptually quite simple. It has four channels. Three are tone generators. Basically, the generate beeps with a given frequency. And one is a noise generator and it generates noise, basically. So you have multiple things that are interesting. The tones are shown here on the top right. There are square waves, at least in theory, because when you interact with hardware, life is analog, and it's not perfectly square, so it might look a bit more like the wave on the bottom, just below it. And what I'm showing here is a noise generator. It's a very simple hardware device called Line Feedback Shift Register, or LFSR. And it's used to generate noise by basically shifting a set of bits, shifting them right or left. Well, it's the same, but here it's right. You start with one bit and you shift them, and then you output the bit that's on the right. But if you were to do that without feedback, it would just shift the one, and then you just put zero, and yeah, it's done. Except this here has XOR function. So we'll take two bits here, XOR them, and put them back as inputs. And with this input, you're able to generate the random-looking noise. It's not perfectly random, it's not cryptographically random, but it's good enough, and yeah, that's how it used to work. For sound emulation, again, you have to start simple. You want to generate a square wave, as I told you. It's a very good hello world for your platform, for sound. But then you will need to add more things. On the PSG, it varies on other platforms, on the master system and the megadra, as the game is here. You need to think about the tone channels having counters, and not the frequencies. You need to think in terms of period and not frequency. It's almost the same except when you're emulating, you will have edge cases that won't work well if you think in terms of frequency. A quick advice about thinking. You can have multiple ways to think audio emulations with the CPU. My advice would be to use CPU cycles. So when you're emulating instructions, you will need to count the cycles. Depending on the platform, one instruction can be a various number of cycles, from, I don't know, from four cycles to 20 on the ZAT. So you will need to count them accurately enough so that when you're playing audio, it won't be distorted. And in general, it's useful to count cycles properly even for display. I wanted to give you an example about playing samples, but very quickly. They also use a square wave, but it's quite similar. They use amplitude variations. So they play a wave that's always up. So if you were to play it, it would be silent. But they make the volume vary and they make it very, very, very fast, like 7,000 times per second. And it generates an audio signal and that's how you have samples. Samples are, when you hear, for example, Sega sound, something like you would play an audio file today. They didn't support, well, this platform did not support playing a random audio file, so developers had to get creative. Testing, how does one test an emulator? There are various strategies for that. For example, for the ZAT CPU, there are unit tests you can use from other emulators. For example, the Fuse test suite has very good unit tests that are not dependent on the Fuse emulator. You can also use integrated tests. For example, ZXOL and ZAT tests, these are programs that were written for the ZX Spectrum, which was a computer from the 80s. It has, they were generating instructions, lots of instructions, executing them, and then dumping the CPU state and making a very small checksum. And they run that on the real hardware and registered what was the checksum for each instruction set. So these are bit long, ZAT tests that can take a few seconds up to minutes. On real hardware, it was much longer, of course, and they are very useful. Go test, go test with, even if it's another platform, you can reuse the CPU tests very simply by doing a few bytes modification and it works on your platform. How to test audio? Well, this one I was not sure because I'm not so sure how to well test audio emulation. Listen to the music. Does it look like the original one? Yeah, you need a good ear for that. You can use fast Fourier transforms as well, which are mathematical operations used to analyze an audio signal. For example, you generate the square wave. Does it have the correct frequencies? If you go through the emulation path. And then can you hear the samples? I told you about the playing samples. These are, I'd say, the hardest part of the audio emulation because it plugs into many accuracy things. So yeah, can you hear them? Other examples here for the games here. This is DGTest SMS test suite. These are software developed by emulator developers for the platforms that will test various features. Here for the games here is the Sega Master System. For the Game Boy, you probably want to look at DMG AC test too, for example. The Game Boy is a platform that's very well emulated. It's a good choice to start. It has many tests. Blogs, test suite, MoonEye test suite, many accuracy tests. Yeah, you should look into it. Another testing strategy is frame generation. So you're emulating stuff. You're generating a display, pixel buffers. You can very easily dump your buffers into an image and compare this image with a good emulator. And you also can compare it with real hardware. For example, if you use flashcards and you don't have all the original games, it can be useful. In general, I would say test a lot of different software and look how it works. For example, here you can see on the left side, this is my test directory for a few games where I'm basically using that as a regression test suite. Does it always work? So some images have a story like I had bugs that I had to fix. And when it finally worked, I recorded that to make sure it kept working. On the right, what you have is the same boy auto frame generation. It's captured a very small part of a web page where they test all the Game Boy games, a Game Boy color, and other, and they make screenshots. It's very interesting. Other communities who are interesting in testing are the speedrun communities. I'll let you look into that. They also do frame testing, but they record the frames on real hardware. So summary of everything we said here. It was a bit fast. I'm sorry. Pick platforms, host platform, a target platform, something you want to emulate. First of all, always do something simple first and then make it grow. Read a lot of documents. That's probably something that's part of the emulator development. So a lot of documentation. Testing, because depending on how accurate you want to do, you probably want to test your software properly. And don't forget if you ever go and write an emulator to write blog posts about it. So people know about it and come here to first them in the emulator dev room and make a talk. Please. Thank you. Any questions? Testing, testing. Shall I do the question round? We have a bunch of questions. So I'm just going to run around and... Thanks for the talk. It was really good. Two small questions. Approximately how long did you spend on your first emulator? It was like a few weeks, a few months before you got something running. Do you have any recommendations from your experience? I know you did some stuff in Rust. Which languages did you use other languages? How was your experience with Rust? Is it good for emulators? Does it make things harder? Rust is very good. I don't use it for anything. So that's... To be honest, it's a hobby project. We have a question. It's a hobby project that you didn't measure how much time it took to develop everything. Before I had real feedback. You asked me before I had real feedback how long it took. Part of my strategy was different with what I usually recommended. I developed the CPU first. I wrote a talk about it. The feedback here was does the test fit pass? I used different tests and do they pass? That's how you get early results without having a complete emulator. About the programming language. I intentionally did not go into details here in this talk because I want people to be able to write in whatever language they feel comfortable. Rust is great. Go is great. Use whatever language you want. Especially if you're emulating an 8-bit platform. You don't really need to care too much about performance. You should be able to get very good results with whatever language you use. Unless maybe it's... I don't even have a good example. So yeah. Next question. Thanks for the talk. Regarding the audio emulation, would it be possible to just record the waveform and compare them? It can but... I didn't want to go too much into details for that. An example. On the Game Gear and Master System, the sound chip is generating audio at about 115 kHz. On modern platforms, you will run at 44 kHz, 48 kHz. You can have more on most laptops, but it's not what's usually used. So what you will need is a down-sampling strategy. So you will need to take the samples and down-sample them at your host platform sample rate. And this will generate artifacts. Yeah, okay. Good. Thanks. Yeah. Next question regarding the audio stuff. Do you know if there's any ongoing work to emulate the original sounds of the Game Boy or Game Gear because they have built-in speakers which will compress the sound and... Yeah, it will have a specific sound which you can't hear these days. Do you know if there's anything like that? Like you could record the compression and all? I think there is. I think I found a website, I don't remember, on audio emulation specifically. They were developing automatic filters to match the platforms as close as possible using machine learning and with the target of you putting back into DSPs or FPGAs. But I can't remember the name, I'm sorry. But that's something I'm very interested in, especially because audio emulation is not that simple. You need filtering. If you want to go closer to the Game Gear, for example, which has speakers, you will probably need filtering because it has a frequency response which will not be the same as your modern speakers frequency response. So yeah. I'm sorry. I'm sorry. I'm sorry. More questions? Yeah, I almost forgot about you, I'm sorry. So you mentioned tooling and debugging. What? Sorry. Tooling and debugging of the emulator. And you said there were two options like you can write your own tooling and debugging so that you can inspect the state of your emulator. But you also said you can use external existing debuggers. How do they help? How do they help for your emulator specifically? They will help you understand if you have a bug with a given game. They will help you with multiple things. They will help you understand how the game works. So you have a better view of how the game works, how the software is working. And they'll help you understand what you're doing wrong. So you're emulating the game, you have your own logging, your own debug tooling. They'll help you understand what you're doing wrong in your emulation. So it's more for a comparison type of thing. Alright, we have time for a short question. And also can you put your contacts to your first slide I think so people can find you after the talk? Oh yeah. I think that's a good idea. It was a short question, right? Or turn it into a short question. Are there any worthwhile platforms left to emulate that are also approachable? I would say yeah. I gave the example of the Game Boy, it was very specific because it has so much documentation. So yeah, it's a good platform to start. Is there anything left? It's okay if it has a lot of emulators. If you want a platform that no one has written an emulator for it will be harder because you have to discover all this information by yourself. So if you want an easy thing to start, it's not the same as exploring new stuff and reverse engineering and things like that. Well, thank you very much. Thank you, Anis.
Panda3DS: Climbing the tree of 3DS emulation
I'll scare me like that. Alright, without further ado, I want to introduce you to Paris. Paris is here. He was supposed to give the talk together with George, but George could not make it. Paris is going to tell us about the 3DS, but I know that he also has knowledge of the Nintendo 64, so make sure to ask him any questions about that as well. Did I forget anything, Paris? No, that's all I think. Alright, then please take it away. Alright, hello everyone. Welcome to our talk on the Panda 3DS. We'll cover a variety of topics, but first I'll introduce the three names on this list. First is George. He is the creator of Panda 3DS and this presentation. Next, me, Paris, I'm a contributor in Panda 3DS, and we also have David, who helped a lot with making the presentation really pretty. You don't want to see the original version. For questions, you can ask either after the talk or on the Panda 3DS Discord server. And lastly, you can send an email to these two to punish them for not showing up here. So, what is Panda 3DS? Panda 3DS is a Nintendo 3DS emulator for Windows, Mac OS, Linux and Android. Just like all emulator projects, we have several goals and aspirations, such as providing users with a pleasant experience, creating a portable code base, exploring new possibilities, researching the 3DS, expanding the RedPandicold, aiding homebrew developers in their 3DS software, and fun, usually. Here's a peek at Panda 3DS. We have three frontends, the SDL one, the QT one, and we also have an Android version. Let's take a look at what we're going to be discussing today. So, three topics. First, we're going to go over the hardware architecture of the 3DS. Next, we're going to talk about the 3DS software stack, that means the Horizon operating system and the user land. Finally, we're going to go over emulating the 3DS. We're going to talk about the levels of emulation, points of interest for new developers of the 3DS, etc. So, this is the first glance at the 3DS hardware. Most of the action happens in this big chip named the CTR, which is where Citrus name comes from. It has three CPUs. So, one is this ARM 11 right here, and this is what runs all 3DS code. It also has this ARM 9, which is the same model that was used in the DS and the DSi. And next is this ARM 7 CPU, which is the same in the DS and DSi, and also in the Game Boy Advance. We're going to see why it has three CPUs in a second. But first, there's also the DSP, the digital signal processor for audio, and then there is the Pica 200 GPU, which uploads to two displays right there. One notable thing about the top display is that, as you can see, there's two frames here, and that is because it would generate two frames, one for each eye, to sort of create this fake 3D effect. It has some other miscellaneous hardware. It has 128 megabytes of RAM in the original version, and the new one has 256, and some other miscellaneous stuff. We're going to go over all of that. But first, the reason that it had three CPUs is because, actually, the 3DS can run DS and GBA games natively. And a fun fact about this is that many people didn't know that the 3DS can run GBA games natively, because only those who were part of the Nintendo's ambassador program could use this feature officially. And that program is something that Nintendo launched for people that bought the 3DS at the original price of $250 before it dropped to $170. Nowadays, there's an open source interface for running GBA games called OpenAGBFirm. Alright, so inside the system on chip first, there is the ARM 11, and in the original 3DS, it is composed of two cores running at 268 MHz. That first core is not supposed to look like that, but let's ignore it. In the new 3DS, there is four cores running at 804 MHz. And we're going to see exactly what each core does, but first let's look inside one of these cores. Okay, of course, my clicking doesn't work, so I'm going to manually go to a different slide. This was supposed to be a cool transition, but it didn't work. Alright, inside the ARM 11 core, there is the ARM V6 architecture, and it is composed of three instruction sets. There is the ARM, the 32-bit main instruction set, that is what most game code is. There is the thumb, this is a reduced 16-bit version of the ARM instruction set for improved code density. Some operating system code uses thumb, and then there is Giselle, which is a Java byte code runner in hardware, essentially. Games don't use this, some homebrew uses this. There is also a vector floating point unit. There is media instructions for video and audio decoding, say your cut scenes, stuff like that. There is an MMU for running multi-tasking operating systems, and I guess more stuff. And there is a branch predictor out of order completion. Alright, now I will transition back to my normal slide. Alright, let's pretend everything worked. There is the ARM 9, which is the same as we said that was in the DS, and it has the ARM V5 architecture. In 3DS mode, it's actually used, however, to manage storage and cryptography hardware, so it's not completely unusable in the 3DS. It runs part of the operating system, we're going to look into that. And then there is the ARM 7, which is for DS and GBA backwards compatibility, and it is completely disabled in 3DS mode. How do the cores communicate with each other? Well, in the ARM 11, they use shared memory and interrupts to communicate with each other, and a bus. The ARM 11 with the ARM 9 communicates with this thing called the PXI-FIFO. PXI stands for a processor exchange interface. I mean, there's the bait. Some people say processor interconnect. Some people say pixie, as in furry, I don't know. Then the ARM 9 communicates with the ARM 7 using the IPC-FIFO. That was what it was called in the DS. So, it's the same there. And now we're quickly talking about the Pica 200, which is Nintendo's first off-the-shelf GPU in a handheld, and what was used in 3DS. It implements OpenGL ES 1.1, although most games didn't use OpenGL, and it has some extensions, and some of those include per fragment lighting, which means it can calculate lighting per fragment instead of per vertex, hardware saddle mapping, polygon subdivision through the use of geometry shaders, subsurface scattering, which is the scattering of light as it penetrates a translucent object, and more, many, many more. And games communicate with the GPU using a piece of shared memory, which is used to store GPU commands and some other interrupt info. If a game wants to render, it first needs to write to the Pica's external registers, which can figure the frame buffer and some other secondary things, and then they need to initialize the rendering context, which is done by configuring the internal registers, and this is done through the use of GPU commands, as before it was done through the use of writing to a memory address. Pica command lists are nothing more than a list of values, describing patterns and data for writing to the GPU internal registers. So that's all there is to it. This is a white square. This is Mikage, a demo by DMP for Pica 200 that was presented at SIGGref 2006, so they could show off their new GPU. It shows the impressive shadows, like you can see here, and fragment lighting functionality, like, for example, in the Samurai's helmet. And also, there is, you can't see it there, but the floor, the wood grain, was procedurally generated because the Pica 200 could procedurally generate textures. The name Mikage, that's where the Mikage 3DS emulator got its name from. And you may be wondering that, you know, Citro got its name from the CTR, Mikage got it from this, where did Panda get its name from? That is a really good question, so if you find the answer, do let me know. Alright, my computer is lagging. Well, that's the joy of running it in a VM, so you could use a PowerPoint. It would be quite nice if it would run live right now. Oh, this is awkward. Yeah. Is that a thing? It's like frozen. Oh, it's back. Alright, I'll restart the virtual machine just to be safe. Or maybe I won't. Let's see. Aha! No, no, no. You're good on time still. Okay, good. How do I send this? Aha! Alright, sorry about that. Oh, now my first window is buggy. Okay, great. So, this is a Vertex shader in the Pica 200. The Pica 200 didn't use a high level shading language like GLSL. Instead, shaders were usually written in assembly. So, let me actually try to fix my thing. Okay, I guess I can. There is uniforms at the top, just like in a GLSL shader, which is a read-only piece of data. In this case, it's our projection matrix. There's constants, output attributes and input attributes. So, our position are in color, we want to output. And moving down to the main function here, it calculates for dot products to combine the projection metrics with our input coordinates and moves the output color. It doesn't do any modifications there. So, we're quickly talking about the pixel pipeline. So, in modern GPUs, fragment shaders are used, which is a small program essentially to run the fragment pipeline. But in the Pica 200, there is no fragment shaders. Instead, it has a six-stage color combiner to do so. So, essentially what would happen is, Vertex data would come from programmable shader units into the rasterizer. The rasterizer would generate a position. And then, four textures would be sampled using these texture units. All of them could do 2D textures. The first one could also do cube maps, 3D textures. And the last one could also do procedural textures. And then, this texture data and the Vertex data would be passed to the color combiners, along with lighting data. And color combiner is essentially what the name suggests, color combiner. It takes in two inputs and a way to operate between these two inputs and sort of produces a new color. So, for example, you could have two textures that you want to combine, or you could have a texture and lighting source that you want to combine. And after that, after passing through all the six stages, you get a color which is then post-processed, and voila, you get the Kirby, which, for example here, the beanstalk is mixed. The beanstalk texture with the lighting from light source. This is why it has, like, a scene here. And then on the leaf Kirby standing on, there's a gradient from left to right. That's not probably how the texture was of the leaf. Instead, it was using a color combiner combining with horizontal position, so it gets darker as it goes to the right. Some sewing off of the pica, some games. So, Captain Toad Treasure Tracker is known for being good and clever with lighting and shadow effects. The Legend of Zelda Ocarina of Time, which uses the fog rendering hardware that pica has, and also ADC-1, which is a format you probably never heard of, which is a compressed texture format. Mario and Luigi Paper Jam, it generates the seawater via procedurally generated textures. And Super Mario 3D Land, the sole source of things, like stencil testing, logic operations, command lists that invoke other command lists and more. And now we're talking about the digital signal processor. It is the same one that was used in the DSi, but in the 3D S2s, far more. And most games would ship with a common firmware, which includes support for 24 audio channels, ASC audio decoder, for games like Pokemon X and Y and Fire Emblem, multiple audio encodings, effects such as reverb, delay, both mono and stereo input, but only stereo output. And the architecture for this digital signal processor might seem a little bit weird, but it's not really because, well, for example, it has 16-bit bytes instead of typical 8-bits. And that is because it is a digital signal processor and needs to process samples, which usually are 16-bits, but not always. It has some weird instructions that you may not find in a typical CPU, such as multiply ad. It has support for type loops, which is necessary for such work, and 500 kilobytes of memory. And this is a rendering of the DSP doing math, I guess, Georgia at this time. And then there is some other miscellaneous hardware that we saw earlier. So the RAM, we went through that. There is also 6 and 10 megabytes of VRAM, one-time programmable memory for storage uniques, console uniques, sorry, controller, DMA engine, cryptography engine. You're probably not interested in all of that. There's two back cameras and a front camera and an EIR front LED, which is used in the new 3DS for face tracking, because actually the fake 3D effect would be quite straining. They would need this. All right, and we've reached the point to talk about the software stack, and we're going to start by talking about the Horizon OS. So the Horizon OS was the operating system of the 3DS, and it was split between the RM11 CIS score, which I actually skipped because my transition didn't work. So let me find the slide. I'm sorry about this. Oh, okay, I guess everything is broken now, huh? Okay, let's see here. Close this. Aha! Okay, quick interjection. Sorry about that. What's the problem about the cores because of that broken effect before? In the original 3DS, there is two cores. One is the up core, one is the CIS score. The up core runs usual in apps, including games and system apps. The CIS score runs the operating system and services. We're going to see what those are in just a second. And the new 3DS has two more cores. One is for the head tracking service, which is for the straining effect that we talked about, and the other is just available as another app core. So they would have two cores. All right, after this quick interjection, I'm going to go back to the operating system, if I may. All right. So yeah, the R9 is for security and IO as we just saw earlier. So you need to get a firm grasp of firms. There is four in the 3DS. The one we're going to be talking about is the native firm, which runs 3DS games natively. This is the 3DS mode. In the R11 part of the operating system, the R11 runs the usual length and the majority of the operating system code while the R9 is dedicated to cryptography and cartridge. There is the AGB firm for GBA games natively. And the R7 is the start of the solo runs all the game code. There is the Twilight firm for the S, pretty much the same. And then there is the Safe firm, which is a bare bones version of native firm for recovery. And you also can use it for updating. But later on, from now on, actually, we're going to look at only the native firm here for when we talk about stuff. And the kernel inside the R11 is called kernel 11. I'm going to go through a brief introduction of what is a kernel. Every operating system has a kernel, which is considered the core of the operating system and handles various critical functionality. So in the 3DS R11, there is this kernel 11 and it is what we call a microkernel. A microkernel essentially tries to be as small as possible. It runs its services in user space, things that it needs to do, such as file systems and networking. It is less code, which means less vectors of attack. And it's generally considered to be more reliable. So if something crashes, not the whole thing crashes, like if your network crashes, the kernel doesn't. Examples are this kernel 11 and also Minix, for example. And then there is a monolithic kernel. There's more types of kernels, but these are the major ones. In the monolithic kernel, all or most services run inside the kernel proper. And there is supposed to be fewer context switches, so everything is supposed to be faster inside the kernel. One example of a monolithic kernel is Linux. Kernel calls happen via the supervisor call, SVC, which is this. Here is an example of the assembly function called SVC. And like most operating system, kernel level is not a pure microkernel, but it is still a microkernel and handles memory management, processing thread management, and service and process intercommunication. Let's see what a service is. So services, as we said, usually end processes. And inside the 3DS, they're managed by another service called the service manager. In order to interact with a service, say you're a game and you need to interact with them, first you need to get a handle to the service manager itself, which is a public service, unlike all others. And you do so using a supervisor call. And once you have a handle to the service manager, you can ask it for the handle to the service you actually want to use, say the file system service. And you do so using the sent sync request. So you request the handle from the service manager through the sent request. Then you need to set up a parameter buffer with the function you want to call to that service in the file system, say the file system service you want to call the read file function, which might not exist. And you pass the function you're being called, and the necessary parameters. And then you need to send a request to the service, again using sent sync request. And you receive an output buffer. The majority of communication with services is implemented with requests like this. Sometimes shared memory is used to reduce latency for some crucial services, such as the GPU service. And requests and responses are really thread-locked with storage. And as the name suggests, sent sync request, that is, they are synchronous, which means they're not a sync. The response is written after the function returns. But some services can't use something to happen later. So for example, there is this Y2R service, which is YUV2RGB color, is typically used for decoding videos and FMVs and cut scenes. It doesn't stall until it's finished, which might take a while. Instead, it notifies you when it's done using a kernel event. These are some important services. So there's the file system, the DSP, the GSP for GPU, communication, app for applets, HID for input, et cetera. There's many, many. This is a function that asks the HID to HID service to enable the gyroscope. It gets a thread command buffer. It makes a header to send in, and it sends a sync request. It checks if there is an error, and otherwise gets the result. So there is then Process9. And Process9 is what runs inside the ARM9, and it is unlike the kernel 11, it is a monolithic single process kernel. And all it handles, like we said, is cryptography and device IO talking to the cartridge, to the SD card, et cetera. It has over-complicated C++ that reverse engineers hate, and it's really, really big. And here's a funny quote by PSI, creator of Corgi3DS, goes, if you ever feel useless, just know that Process9 calculates a ha ha on the CPU, despite having access to a full hardware shanjin. All right, so this is live demos, but my system just broke, so I'm not going to do the live demos. I'm actually going to show you a video of the live demos. So here's a static frame. Here's a video of Pand3DS running on a Discord button. This is done through HTTP streaming. There is a server, and it sends the frames, and it becomes a GIF, and you can use the buttons to play at two frames per second. We've actually finished some games using this, so it's not as bad as it looks. This failed too, all right. Here is Lua scripting in Pand3DS, so you can use Lua scripts to create MGUI windows like this, and you can create a lag. Okay, there we go. You can create cheats and stuff. So you can...Pand3DS exposes these, like, functions, such as write16, so you can write to address and stuff, and it also exposes events, so you can do a function of a frame. So here I'm going to change, for example, the slider, if the video... Okay, let's skip the video. Oh, no, actually, here you can increment the RUPIC count, and there is also a button below that says... you can't see it, but it says murder link, and pretend I pressed it and link died. Oh, there you go. All right. Let's get away from the videos. All right, so there is a hidden emulation talk inside this 3DS architecture talk. So first, before we get into it, I need to introduce you to HLE and OLE. So HLE stands for high-level emulation, and in high-level emulation, you essentially reimplement parts of the emulator system software in your own code to avoid emulating the hardware that is needed to run said software. So, for example, I told you that the DSP has a common firmware that most games use. Well, instead of emulating the over-complicated DSP, you can emulate this common firmware. The calls that would happen, you do so by re-implementing it in C++. The opposite of that is HLE, where you would emulate the entire DSP, which is, you know, much harder, or actually may not be. And then you run the actual software, the DSP firmware, on that hardware. There is benefits to each side. So, for example, LLE is slow because you emulate the entire hardware, but HLE is not easy. You need to reverse engineer the software. You want to run, et cetera. We're going to look a bit more into that. So for the 3DS specifically, you can HLE the operating system. So kernels, services, process nine, there's more things you can add HLE. And this is an example of something being HLE. So there is the file system service, and this is an HLE implementation of the file system service that returns if an SD card is inserted. And if you were to LLE this, you would need to emulate the complex SD hardware interface. So, you know, there are some benefits to HLE, of course. So just a quick summary. LLE is tedious because there's much, much harder to implement, especially in the 3DS, and it's also slower. But, beneficially, you can run any 3DS software, including bare metal firmware such as Linux for 3DS or GodMode 9. HLE is again tedious because there's so many services to implement. But it is performant, but still are prone, and there's many things to reverse engineer. There is also a hybrid approach you can do. So you can HLE the kernels, but you could LLE most operating system services. And what does LLE services mean? Well, services, as we said, are usual and apps. So you can literally just take the binary and just run it. That would be LLE-ing it in your emulated hardware. You run it. This is a nice bit of balance. It minimizes work, improves performance, and accuracy. Maintains accuracy. So as a 3DS MUDAP, you need to consider how you're going to approach the CPU. And there is many ways to do so. First, let's take a look at interpreters. An interpreter, you essentially interpret all the opcodes that you need to run, which is to say you switch through them, you decode them. This is slow, but it's also very portable. As long as your code compiles to the target platform, you're pretty much good to go, usually. And this might be your choice because a GIT might be very hard or impossible on whatever you're targeting, say an iPhone, I don't know, or WASM. GIT recompilation is converting the ARM code to host CPU code. So if you're running on an X64, you would convert the ARM32 code to X64. This is the most common solution. It's also easier if you use dynamic, like Citroen Panda Do, which can perform ARM32 to X86 or ARM64. So you can run most devices that most people might care about, such as computers and phones and stuff like that. Then you could also consider virtualization. As far as I know, this hasn't been tried yet in the 3DS, but there is an ongoing request to try to do this. So virtualization is the way apps like VMware and VirtualBox work. And on some ARM32 and some ARM64 devices, there is the possibility to execute 3DS code natively through the use of virtualization. It's not possible on all ARM64 devices because they're removing this functionality, unfortunately. But on some Raspberry Pi, for example, and some rooted Android phones, you can run 3DS code natively implementing this. We don't yet know if it's faster than GIT, but yeah. And then you could also consider ahead-of-time recompilation potentially. So that would mean recompiling the ARM code from the code section to your host operating system. This has a benefit compared to the GIT, which is that if you ahead-of-time compile, you don't need to compile fast, which means that you can perform optimizations. So for example, you could use something like LLVM to produce optimal code. So yeah, this is another potential way you could run your games. Then you need to consider how you're going to approach the GPU. Again, there is two ways to go about it. So you could do software rendering, which is simpler, but again, portable and slower. That's the downside. You could also do a hardware renderer, which draws on the GPU using something like OpenGL, Vulkan, Metal, et cetera, DirectX. This is much faster and obviously ideal for playing games, but less portable. So for software renderers, how do you make them faster? That's what you've got to consider. Well, there's many ways to make them faster. For example, there is multi-threading. You can draw on concurrently on several threads. And then you could also use a recompiler for a software renderer. So you would take the pika state, the current render state. You would make a small binary for it, and you would run it for set state for every pixel. So this would avoid like running a branch for everything. For example, if the depth buffer is disabled and you don't want to check for every pixel. This is optimal, optimized. And you can also reuse this compiled binary if the same rendering state arises again. This is something that, for example, PCSX2 does for its software renderer. And then for hardware rendering, well, it's challenging. You have to make sure you choose the ideal API, manage surfaces correctly, such as textures, color buffers, et cetera. There's many, many other problems to solve. For example, dealing with the parts that are on really OpenGL compliant or games that use depth buffers, color buffers, and write to them directly. Or, yeah, there's many, for example, tracking when a texture has changed, and it's no longer valid. There's many issues to solve in a hardware renderer, which makes it a bit harder. Then you're going to consider how you're going to approach the pika state. So again, you can do it in Turbo. That's simple and also too slow. You could do a GIT on the CPU. So this is the vertex state that we saw before, and it is converted to some scary looking X64 assembly. A little more 64 assembly, scary looking. And then another approach. This is decent performance, but it could be better. Is to recompile these satyrs from this state to your own GPU. So something like GLSL or Spurvee, for example. This would give even better performance, but it might not be possible for some select pika satyrs. So, yeah, let's move on. Then you need to consider how you're going to emulate the pika pixel pipeline. There's a lot of things to emulate in the 3DS. So one approach to specialized satyrs. I think this is what Citra does. So essentially you compile a specialized satyr for each pika pixel pipeline configuration. That's how you will emulate the pika pixel pipeline and run it on your GPU. This has low GPU usage because those specialized satyrs are specialized and small, but lots of time spent compiling satyrs which causes stutters on some games. This is the most common approach. Panda3DS currently doesn't have this, but it has uber satyrs, which is a term coined by Dolphin which aims to solve that issue of specialized satyrs of causing stutters. You essentially take an entire emulator for the pika pixel pipeline and run it inside a GPU fragment satyr. So you don't need to compile many times. You just compile once and you run it forever. This, however, has higher GPU usage because it's very, very big, but no compilation stutter. And it works well on more PC GPUs, but struggles on mobile GPUs, for example, or lower end ones maybe, because it's very big. And this is what Panda3DS does, but, however, an even better solution would be to do hybrid emulation. So you will compile the specialized satyrs as synchronously in the background, which you couldn't do before because you need them to render. And while they're compiling on a different thread, you use the uber satyr right here, which is, this is used for every call until the relevant satyr is ready. And this gives, I think, better performance than either method and works well on all GPUs. And this is what Panda3DS wishes to achieve after we've, after we're done with specialized satyrs. And finally, there is the audio DSP. There is two approaches, as we saw earlier. There is the LLE approach. In the LLE, you need to consider how do you optimize it? Do you recompile the firmware to your own architecture? Do you do it ahead of time? Teacro comes to mind, which is an emulator, assembler, disassembler for the TIC DSP, and it's used in Citra and Mount DS. And you could also, instead, HLE DSP, which should be faster, and things you need to do with that is improving the current DSP reverse engineering efforts. It hasn't been fully reverse engineered. You need to make test runs and tooling. And then you would need to optimize it. You could do so using SIMB or SIMD, and multi-threading and end-end. That is all for emulation. And now I'll show you again some ways we are exploring new territory in the 3DS emulation to sort of wrap up this talk. So Panda3DS comes with fluid scripting, including a text editor, so developers can make scripts and mods and tests and stuff like that. And we also have MGUI support. You can have your little windows. And I think it's pretty cool. Also Panda3DS here is running CTR aging, a factory test program some other emulators may struggle with. And Panda3DS running on the Wii, via the same HTTP streaming, not natively on the Wii. Yeah, yeah, unfortunately. Not natively on the Wii, just using HTTP streaming. We thought to show it off because we have an HTTP server. Just throw it on the Wii, whatever. And the physics book there, the monitor rest, I failed that class. All right. We have a revolutionary UI. It has Panda icons. You can play on Discord with all your friends. That is it for me. Thank you very much. APPLAUSE Hello, hello. Test? Does it work? Yeah. OK. Thanks for nice talk. Anybody has any questions? We have quite some time. I'll start here. Hello. Hello. Great talk. How much time did it take to build all that stuff? Well, I think we've been working on it since the FOSTA applications opened, since we got accepted, actually. So I would say, like, about three months now, something like that. The original version was not nice, but we fixed it. So yeah. Maybe you can give us a presentation of the... Oh, I'm sorry. LAUGHTER Silly me. Three months. Three months. Yeah, we built all this and the presentation. LAUGHTER All right. No, it has been in development since September 2022. Yeah, so more than three months. One and a half years now. Sorry. Hello. I have a question regarding GPU emulation. Yes. Have you considered using parts of Mesa, for example, which has a lot of GPU stacks, open-jade implementations, and can do compilation to native GPUs? I wonder if it has been explored, not necessarily in Pandat 3DS, but any emulator? That is a good question. No, I haven't considered it. I don't know if George has. He's the main developer on the Pandat 3DS. You can ask him directly if he has, but as far as I know, no. Hey, thanks for the presentation. Hello. I was curious. Do you know how you mentioned that in the 3DS, different generations of the hardware have more hardware? How do games and other software handle that? The different generations of the 3DS have more hardware? Yeah, and the different CPU speeds and whatever. Do the games do more if they are on different versions of the 3DS? Are you talking about the DS or GBA backwards compatibility, or are you talking about 3DS games? Because the 3DS in all generations has the same CPUs, but the RM11 has more cores. Yeah, but what do they do with the more cores? Oh, I think I mentioned it, but I'll show you again here. Yeah, so one is used for running games, one is used for the operating system, this is used for head tracking, so your eyes don't get dizzy from the 3D effect on the new 3DS. These two are on the new 3DS, and then the new one, this one is available as another app core. I think for backwards compatibility, if you try to run an older game on the new 3DS, it tries to downclock the speeds and not use the extra app core, and if you run a new 3DS game, it uses both of these to run the game. Okay, thank you. What happens if you run the new game on older generation 3DS? Does it fall back here and turns off parts of the game and not to crash on the new 3DS? I'm not... Yes, they asked, what happens if you run a new 3DS game on all 3DS? And I'm not sure, but my guess would be that it would display some sort of message that you can't run it on this console. This is what happens, for example, when trying to run a Game Boy Color game on a Game Boy if there was no such support. So I would assume that's what would happen, but I'm not sure. I'm wondering about the ARM 7 core. You mentioned it was disabled on the 3DS mode, but can it be used in certain situations like running virtual console, GBA games? Can it be used to run natively, or is it still emulation? Did you ask if it's used to run natively Game Boy Advance games? Yes. Yeah, that is its only purpose, and also to run DS games, because the DS also had an ARM 7 core. The ARM 7 only exists in the 3DS to run Nintendo DS and DSi games and Game Boy Advance games natively. For the GPU emulation, did you consider using compute shaders on modern GPUs as well, like speed up parts of the rendering pipeline that are hard to do with normal fragment and vertex shaders? Yeah, that is a question for George, but I think he definitely would have considered to use compute shaders, yes. Does anybody else have a question? Yeah, my question was, you said that currently on the 3DS supports Windows Linux, Mac OS and Android, and do you expect to support more platforms in the future like Wasm or iOS? I personally wanted to try at least to port it to Wasm. It's not that easy. There's no recompiler from ARM32 to Wasm. You would need to use an interpreter. Currently, Panda 3DS doesn't have one, but we're hoping to add one eventually. And then there is WebGL, which is not great, but you can use it. Theoretically, it should be possible for Wasm, but it's in the future plans. What was your other system question? iOS. I think iOS also has some problems with running a recompiler. It needs special privileges. And also there's the fact that you can't really post an emulator on the App Store, which is unfortunate. But we definitely want to at some point, yes. That might change in the future. I may be able to. Yeah, hopefully. I have a question. I'm allowed to ask questions. So this is not really a question, but George is in the chat and he's answering some questions. I'm not sure if you have access to the chat. You can read the answers. The FOSTA chat? Yeah, it's on Matrix. No, this computer is not connected to the Internet. It's secure, ultra secure. But the answers are too long to read for now, so just check Matrix for now. So my question is, it's actually related to Anise's talk because he mentioned vertical slicing. I'm very curious, how would you do vertical slicing for complex systems such as this or Nintendo 64, for example? Could you provide a definition for vertical slicing? Like very... Anise, could you... A definition for vertical slicing. I don't think it's a good question. I have a question. Like you emulate just the necessary path to emulate, this is the start of a game, and then go on from that. Yeah, by the world, or just maybe, I don't know, a test run and then test game, and then more stuff as you go. So sort of test-driven development? More like emulating as little as possible, but just having feedback, visual feedback from the... I don't know if that's possible, emulating as little as possible. Even the simplest ROMs use quite a bit of the operating system. Even like a simple triangle uses quite a bit. So yeah, I don't know if it's possible if I understand the definition correctly. Thank you. So first question. How do you feel knowing that every second spent on emulation is making a Nintendo business executive cry? And... Secondly, do you think ahead of time compilation could seriously improve performance, especially in lower-end hardware, or do you think the actualize more on the GPU on the rendering side? Yeah, it definitely could. I don't know if it has been tried yet, but we do want to explore that possibility. The big thing about ahead of time is... Well, the thing about the 3DS is that it doesn't really do a lot of any dynamic code execution, so it should be theoretically possible. There is CROs in 3DS, which is the alternative of DLLs in 3DS, but George has told me that they can be handled. And yeah, the big thing about AOT... Yeah, that's all he told me, didn't elaborate. But the big thing about AOT is that you can optimize the code, so I would think that it would run quite faster. But we don't know. Okay, I think this is the last one. How do you deal with the system applets that the 3DS has? Do any games even interact with those in any way? When you say the system applets, do you mean like the home menu, stuff like that? Yeah, like when you go to the home menu, there's little apps at the top, for instance. How do games interact with that? Can they launch other games, that kind of stuff? Games in the 3DS only interact with services. They don't interact with the system apps. Okay, final one. Yeah, you talked a bit about the new 3DS infrastructure and the additional CPUs. And I was wondering, when you run older games that weren't using the additional CPU, does the new 3DS automatically use the additional CPUs, or are they just completely ignored and just unused? No, as far as I know, it just uses the original LabCore, the one it has, and it also down clocks it to the original speeds. Oh, well, thank you. Okay, great. Let's thank our speaker again. Thank you. Thank you.
Breathing Life into Legacy: An Open-Source Emulator of Legacy Apple Devices
So, we're going to start. So Martin here is going to tell us some stuff about Apple. And I have to confess, I'm very anti-Apple, so I wanted to actually refuse this talk, so that everybody will, again, refuse this talk. So, Martin, take it. Thank you very much. So good morning, everyone. Thank you. So good morning, everyone. Thank you for providing me with the opportunity to speak here. My name is Martin DeVos, and today I will present to you my hobby project, which involves an open-source emulator of legacy Apple devices. And in this talk, I will explain how I managed to emulate these kinds of devices, what it takes, what the challenges are, and what the next steps are. So let me first briefly introduce myself. So I'm a postdoctoral researcher at EPFL in Switzerland. And my main research topic is not actually on emulation or reverse engineering, but it's on distributed machine learning systems, like many people are working on nowadays, like LLMs and stuff. But I'm also a very big enthusiast of reverse engineering. And I actually started doing this during my master thesis already. And during my PhD, I worked on reverse engineering some mobile applications of banking apps in the Netherlands and other countries as well. And that has resulted in, well, my first paper of my PhD. And, yeah, and two years ago, I decided to pick up this project. I was inspired by reading a blog post of someone that managed to emulate an iPhone in software. And that's how I was motivated to work on this project. And this was actually Jonathan Effek. And I think he was one of the first that managed to boot iOS, the operating system of iPhones and other Apple devices, with QMU, which is a very popular open-source emulator. And he managed to boot it to an interactive Bash shell. So he managed to boot this emulator to user land, which is quite an achievement. And I thought, well, I want to learn how that works. It involves some reverse engineering, which is a thing I really like. I like seeing how software works, trying to decipher some of the secrets in the software. And it would also contribute eventually to long-term hardware preservation, because when people would run it, it has some feeling of nostalgia. And, well, I mean, I had my first Apple device, was an Apple Touch, and I decided to, well, work on that. So after reading the blog post, I was a bit puzzled. And I was like, OK, where do I start? How can I set up my own project to work on this kind of stuff? And, you know, Apple has released many different devices over time. And the first question I had to answer is, which device am I going to emulate? And if you think about contemporary devices, they are incredibly hard to emulate, at least emulating all the aspects of these devices is a very, very challenging and difficult task. They contain neural engines. They have phase ID touch IDs, which also has interactions with secure enclaves, but also software-based security measures like trust caches, which is a mechanism that's by Apple that only allows particular applications to have privileges. So I was thinking, if I go back in time and I take one of the first devices by Apple, at least in the Apple Touch family, that should be somewhat, well, easy to emulate. It is a device that was released in 2007, and it doesn't contain, well, the complicated hardware peripherals that I just mentioned. And, yeah, hopefully that will be simple enough to emulate, well, which were some famous last words, because even these devices are very, very complicated, as I will outline a bit later in this talk as well. So I'm definitely not the first one to work on this kind of emulation. So there are some related projects. One of the, I think the earliest attempt actually on emulating the SOC of an iPhone was by a CMW.me, who actually is the founder of Corellium, which you might know as a company that provides virtualization services, both of iPhone and Android applications. Yeah, we had the blog post that I just mentioned, which enforced the emulation of an iPhone 6S Plus. And that work was picked up by someone else and eventually involved in an iPhone 11 emulator. And there's also OpenEyeBoots, which is an open source bootloader for early generation Apple devices. And all of these projects have been extremely helpful in understanding and connecting all the different pieces together, because without them I wouldn't have been able to get this far. So then I had to pick a framework for emulation. And QMU is one of the most popular open source frameworks for this kind of emulation. It provides support for hardware emulation. You can define your peripherals, your hardware components. You can implement their expected behavior. And it already comes pre-shipped with support for many different protocols, like the USB protocol, network interfaces, SPI, SQC, SDAO, etc. So that was all very nice, but unfortunately it has a very, very steep learning curve. So it's quite difficult to wrap your head around particular parts of the project. So most of the time I had to rely on existing emulations provided by QMU to see how that works. And when doing emulation, you also need a way, or you would like to have a way of debugging your software, because you want to see which code path execution is being followed, what the register values are, and what's generally happening in the system. So the nice thing about QMU is that it automatically provides a GDB stop, a GDB server, that I can directly connect to, and then I can step through the code, I can jump to functions, and I can inspect all the register values. And for the reverse engineering part, I've been using Gidra, if I pronounce that correctly. It is a very popular open source tool for reverse engineering and decomposition, and this assembly of binaries, and this has been also tremendously helpful. So here on the right you can see, for example, some code related to the start procedure of the SPI controller, which controls the SPI interface. And if you look at it, it's actually pretty readable. You can do a lot with this stuff, but also the way Apple has engineered our software is very predictable. So they're using the IOCAD framework, which is very similar in structure. I mean, most of the peripherals look like this. You initialize some memory, you set some variables, and that's mostly it. So now let's talk a bit more about the emulation itself. So my philosophy when it comes to emulation is that I wanted to stay very close to the actual hardware, to what's actually happening on the hardware, no matter how difficult that might be. What I noticed is that many existing emulators, they cut corners, which is not unsurprising, right? Because for example, if you run into some kind of signature check, it might take a lot of time to get everything working and to get the right functionality and to make sure that pass steps. So one way is, for example, to just patch out that particular procedure or function call. Why do I want to do this? Because I had a feeling that any hack, any workaround I would do in the very early stages of working on this emulator would bite me back later. So I'd rather do it right very early in the bootpress process, where things might not be as involved as when dealing with a more high level of a user land or application. So I tried to, well, get it right the first try. Well, but as expected, it still ended up with a bunch of hacks, patches, works around, and patched out binaries, because for some things I really, really couldn't wrap my head around, at least not within a reasonable amount of time. So another philosophy that I had, I started with following the bootshank. So I started with the lowest level components in here, which is the Securon Bootrom. This is the very first piece of code that runs on an Apple device. It is actually fused into the chip of any device. If you find vulnerability in there, it's very nice, because you cannot patch that out. That's actually something that happened a few years ago. The Securon loots another called low level bootloader, LLB. That in turn loads the main bootloader, iBoot. Then that's iBoot, the component loads the X and U kernel. When the kernel has launched, it will start the launch D process, which is the very first process that runs on the system. That's launched Springboard, which is responsible for drawing the iconic user interface with the app icons and the hope screen. Springboard in turn starts all the different apps, like the Alarms, Safari, and other applications that you are familiar with. So I started working on the Bootrom first. As a very first step, I had to get the Bootrom, which is fortunately provided online. So that's very nice. It was dumped. The main responsibility of the Bootloader is not only to load the other bootloader, the low level bootloader, but also to initialize some key peripherals, like a clock, the timer, and the USB stack. Because even if everything else on the device fails, the Bootrom allows you to restore the device using some USB protocol. So if something goes wrong, you can use DFU, DFU mode to restore, to refresh your device. Now, I had some instructions running there, but I very quickly found out that when emulating this binary, this Bootrom, that it jumps to some unknown memory locations. And that was a bit problematic, because I didn't really know where it jumped to. And I looked a bit on the internet and I asked around. And it looks like this first generation iPhone is using some proprietary logic by Samsung. So very early generations of Apple devices were made in collaboration with Samsung. So the Bootrom was also made by Samsung. And I didn't really have any idea of what happens, because the Bootrom is very obfuscated and very small, and there are almost no strings and no contacts to work with. And I also didn't have any physical iPhone, Apple Touch device at that time. So I couldn't really figure out or dump that part of memory. And the same actually goes for the low-level Bootloader. I was running into the same problem there. It jumped to some unknown memory locations, so I decided to skip these two parts and go straight to iBoot. Yes, and this is how I load iBoot in code. So iBoot is the main Bootloader. It is responsible for loading the kernel from basically the hard disk. I was very fortunate that the source code of iBoot got leaked in 2018. So that actually was a newer version of iBoot, but at least it gave me some idea of how this all works. So I really tried hard to map all different components in the leaked source code with what I see in Gidra in the binaries. And I managed to boot iBoot and get all the peripherals up and running that iBoot expects. One thing about that is that there is this device tree, which you might also be familiar with if you work with Linux, some low-level Linux. It is basically a big dictionary of all the peripherals and their properties. It is included in the IPSW file, which is like the firmware file that you can download from Apple, and that is being installed. It is populated by iBoot. So iBoot, for example, gets the MAC address of the Wi-Fi driver and then it injects this number in the device tree. So here on the right, you can see a part of the device tree containing some information about the crypto AES engine. So it contains some identifiers and some other things. That was also dumped. So I also used that as reference to get an idea about which peripherals there are to emulate. And I can tell you that these devices are extremely complicated. So this is a diagram that I made of all the components that I managed to get up and running. Not all of them are fully functional, but most of them at least have some functionality. And this is for the Apple Touch 2G, which is slightly more complicated than the first-generation Apple Touch. So these peripherals, most of the peripherals, you can talk to them through something called memory map I.O. So in the memory map, there is a small part that is allocated to a particular peripheral. So here on the right, you can see the addresses of all these peripherals, which I also mostly got from the device tree. And you can write to these memory locations to talk to your hardware devices. And then the main challenge becomes, of course, like to talk with these hardware devices. And you have to do that in such a way that you get the expected responses and that the kernel and the other parts of the boot stage are happy with what these peripherals are saying. So this is an example how you can initialize the hardware components in QMU. You define some methods, some initialization methods, and then you include them in some main file. I won't spend too much time on this now. This is how you implement the functionality of each hardware component. You create a read method and a write method. The read method is called when something, a hardware address associated with the peripheral is read and the write function is called when you write to a register. And you can see, for example, in the read method that you have a switch, so you look at which address am I reading something from and then you return the write response. And sometimes that can be very arbitrary. I mean, I haven't deciphered all the meanings of all registers and what they expect, but you can at least do a best effort attempt in returning the values that makes the kernel happy. And this can become complicated very quickly. So here you can see a part of the SPI controller, which was a particularly difficult component because Apple has some, well, weird things sometimes. They make some modifications to their hardware, which not always follow well-established hardware protocols to say. And finally, you attach the peripheral to the overall machine in QMU. And you, so, and you optionally, you can connect the IRQ like the interrupt request. So interrupts are also functional there. Again, I won't spend too much time on this now. So after iBoot was running, I had to load the kernel and the kernel uses iOkit and it starts all the device, all the device drivers that are declared in the device tree. So whereas the low-level bootloader in iBoot would only load the most important peripherals, this would start all the peripherals. And here on the right, you can see some of the peripherals that I reverse engineered with the Ghidra. You can see LCD display, the power management unit, some other functionality that I didn't even know that were part of the Apple Touch itself. And this mostly follows a very similar protocol. When you start a peripheral, you usually execute some reset procedure or you do like an inter, you wait for interrupt or something to indicate that the device is ready. And after all these devices are loaded, then you start launch D. And this is the part where I spend most time on because I had to like get past all these peripherals. I had to understand how they work. And the further you get into the bootchain, the more complicated things become because then you are really building on, on the correct functionality of say the clock and the timer and interrupt requests, et cetera. So roughly 20 peripherals later, I got most of the things functional, like the clock, timer, the interrupt controllers, they're all fully functional. I'm pretty sure there are a few bugs left, but nothing too major. And only partial support for some of the more involved peripherals, just enough to make it past initialization. And then we're talking about peripherals like TV out, which happens that if you connect your Apple Touch to a TV, the GPU, also the accelerometer, the light sensors, they're not really important at this point. I was very fortunate that I could avoid GPU rendering, hardware GPU rendering with a flag. So the GPU rendering in this emulator happens fully in software, which is slower, but still it's reasonable enough to use the Apple Touch itself. So there's a lot of work to do, but at least at this point I managed to boot to userlens. To give you one more interesting challenge, was the persistence layer. So the Apple Touch contains two types of memory, some more memory that contains small binaries. I think it's at most a few megabytes. And you also have the NAND memory, which is like eight gigabytes, and you can store all your applications and the operating system in there. There are some key difference between the layout of these, of NAND and NAND. So I had to spend a lot of time on when I emulated the Apple Touch 2G to make sure that also works. The main problem here is that once the kernel gets some kind of block, let's say block five, it uses logical block addressing. And that doesn't match with what's how the NAND layout underneath works. So I had to really figure out how something is mapped from a logical block level to the physical block level. And that took a lot of time. I ended up with some scripts in a separate repository that takes a dmg file and that converts it to like a raw file system, a file system as it is like really in the hardware. This is the diagram for that to give you some more context. This is for the NAND. So we have the file system which is implemented in the kernel and that's if it wants to get something from the operating system, it uses a logical block address that goes through two different layers, the flash translation layer and the virtual flash layer, again with their own numbering and addressing and mappings. And that results eventually in some physical page number and a CE which is basically like a bank, a number between one and eight. I think in the interest of time I'm going to skip this but I just want to say that multi-touch, even though it looks very simple, how hard can it be to convert a touch on a screen from to an X and Y coordinate was very, very complicated to get right and for this I actually, for this I actually needed a real device. So most of the things I could do without having an actual device but for this I needed a real device because I had to play with touches and see how the encoding of the touch works. So here on the right you can see, well me playing around so you do press a button and then I recorded what the multi-touch driver gives back to me. So all in all I managed, when doing all of this, I managed to boot to the, I will touch one G to the home screen. Well you can see it's a pretty basic home screen, not many applications. I think I got this running about one and a half year ago and a few months ago I managed to get the Apple Touch 2G working as well running iOS 2.1.1 and the Apple Touch 1G is running iPhone OS 1.0. And that mostly concludes my presentation. I open sourced all the code, I created this GitHub project out of it which is a fork of the QEMU project. I'm not sure if I want to upstream it because it has a lot of ugly code and a lot of well, work arounds. But contributions are very welcome. It currently has support for the Apple Touch 1G and 2G and I'm currently focusing on getting the Apple Touch 2G stable so I can get an app store and third party applications up and running. So that's all, thank you. And if you want to know more, I have some blog posts with more technical details on my personal website. APPLAUSE Right, hello. Yeah, so we have some sign for questions. I hope the people ask questions are here in the front because I don't want to run to the back. But I'm going to start with a question because you mentioned Corellium, which is awesome by the way, they are very expensive but they are awesome, but Apple suites them into oblivion and they are lost, which I'm very proud of. It has nothing to do with it. But so the question is, has Apple made any friendly inquiries? No, no, no. I think this project is still too insignificant for Apple to care about. I also know about the Rockbox, for example, which does Ipod generation emulation. I'm not sure. I don't think they've been sued. But I'm not that worried about it right now. OK, excellent. Questions? Sorry, come to the side. Hi, thank you very much for your speaking. Only one simple question. Because why you choose the iPod Touch and the iPhone platform, it's only a simpler problem, or because there are patents or other problems in that way. Thank you very much. Yes, thank you. So the question is, why did I choose for the iPhone and not for the iPod Touch and not for the iPhone? Well, I mean, when I started this project, I was not familiar with the architecture of both. But I was thinking, well, the iPod Touch contains at least one less peripheral, namely the baseband, the modem baseband. And I was not sure how critical that would be for the entire booting procedure. So that was, I think, my main motivation. But most of this stuff can also be applied to the iPhone. I think with some changes, you can get the iPhone 2G working. Because the iPhone 2G is architecturally similar to the iPod Touch 1G. Yeah. Hi, great talk. What are your future plans for this project? Do you want to support newer devices or expand like the computer in a more modern iOS version? Yeah, thank you for your question. So what are my future plans? I am currently working on getting the USB up and running. There is an independent researcher that also managed to get a syscalls between the guest and the host running. So that's pretty cool. So we can do some syscalls. So I'm currently working on USB. Whether I want to work on newer generations, I'm not so sure. I think it will be possible to emulate them. But I think having one stable and, well, actively used emulator is better than having 10 fragmented, half supported emulators. Because there are like many Apple devices out there. So yeah. OK. OK. Hi, thank you for this great talk. I was wondering, you were talking about getting the app store up and running. Have you considered getting in touch with Jay Freeman, the author of Cydia? Cydia, no, I haven't considered getting in touch with him. I know some people are asking me about, can we jailbreak and then install Cydia? I think we probably can. But there's almost no tooling around this emulator at the moment. So getting these jailbreaks up and running is kind of difficult right now. But I think it's a good suggestion. I think at one point I should. Yes. Thank you. Yes. Anybody at the front, hopefully? Thank you. Hi. And thank you for your talk. I don't remember in 2007 with this type of device, required activation behind us or not? I think they indeed require activation. Oh, actually, that's a good point. I used activation tokens from an actual device. Because I also had to match the serial number, et cetera. So I matched the serial number. I used activation tokens from an actual device. And then it worked. But I could have well patched out all the look down the demon. It's the look down the is a demon responsible for this checking if everything is activated, et cetera. I could have well patched that out. OK. Thank you. Great talk. Have you got the opportunity to play with JTAG debugging to cross check if your emulator works well, like a real device? What are you referring to? Like how can you do this check? I would say you try to execute some peripheral access, both on the real device and in your emulator. And you cross check the read results. The spell is a good point. I think you could do it with open iBooks. So I managed to install open iBooks on the actual device. There you can play around with the peripherals. So I think you can have some kind of trace where you just fire requests to the hardware. And you get some responses. And you can cross check that indeed with what I get. No, I haven't done that yet. But I think that's an excellent idea to make sure that your emulator is mostly compatible or the same as your real device. So I had a small question, actually, because at the beginning you mentioned you're a postdoc. So how much time do you spend on this? It's very difficult to say. Because sometimes I have a week and I spend every evening on it. Sometimes I don't spend any time on it for three weeks. I mean, it also depends on my main schedule for my work. I mean, depends on paper deadlines as postdocs, obviously. Yeah, I think when you get closer to getting something up and running, you tend to be more motivated. And then I spend more time on it than when you're completely stuck. And yeah. OK, does anybody have a question? I can keep going on. So another small question is, because one of the previous talks, they mentioned motivation. How do you get motivation to start something like this? And where do you start? So can you tell us something about that? Yeah, I think for this, well, you mean, first of all, you need some curiosity. You want to know how things work. And you really want to, yeah, you have to be able to dig deep into some components. And you know, there are many components. So you will inevitably run into something that you don't know anything about. So I learned a lot about all the different components that are in there. But another very important thing, I think, is persistence. Because at many times, for example, when working on the multi-tarch or the nonce, I was like, yeah, I really don't know how this works. And then you solve a small part. And then it turns out there's yet another layer of indirection going on. And you have to figure that out again. And then it turns out that something you did earlier was you made the wrong assumption, which breaks all other components further in the pipeline.
CONFEDSS: Concolic execution and the puzzling practice of peripheral emulation
Okay, so this talk will be a little different. That was very impressed by the other talks, so now I feel like I have the worst one. But it will be a little different because the problem we have is a little bit different than the problems you have seen in previous presentations. The talk is called Confed's and Postem asked me to make a nice title, this is what I came up with. Who am I? I am Jeffrey, Jeffrey Rungen. I work at the NFI already for 10 years, as you can see by the loss of my hair, I'm getting old. I am the lead scientist of the exploitation team at the NFI. Oh, sorry, we'll get to what the NFI is. My favorite area of study is the overlap between hard and software, as we've seen the memory map profiles and register interfaces is where I'm at. My expertise is in reverse engineering exploitation, which requires emulation and some non-invasive hardware attacks. This is Luke, my colleague. Oh, yes, this works. Yeah, so I'm also a member of the exploitation team. Now I did this research at the NFI before I worked there just during my master thesis. Yeah, and that's also how I wrote into this subject. The NFI is the Netherlands Forensic Institute, and it's a bit weird because in other countries, if you get a case and it goes to court, the police do the forensic investigation. In the Netherlands, that's not the case because in a weird way, the police are considered not to be independent all of the time. So the defense should also be able to request evidence and request research and evidence. So what we have is the Netherlands Forensics Institute, and the police can request cases from us, but also the court itself and the defense, of course. That makes it also a bit weird in how it's set up because it's not just a digital lab. We have almost all the forensic principles in one building, which includes biological traces, chemical and physical. So biological is DNA, pathology, all kinds of stuff. Chemical and physical traces includes, for example, fiber research for clothing and other stuff, glass research, but also explosives. And toxicology, which has the drugs, which is, well, digital and biometric traces. This is where we come in. So we are the digital traces, but the digital and biometrics, they thought it was handy to put them together. So we're also together in a department with, for example, fingerprint research and other stuff. One of the weird things that comes out of this is that we get a lot of cooperation with other departments, mostly on machinery for us, and to make databases for all kinds of stuff. But for example, we used to use the X-ray scanner that pathology had to examine bodies, to examine PCBs. But then the resolution was not that great. So we got our own X-ray scanner. And they had just installed it. I was very excited to see it, and I come in, and there's a full human head and a jar inside of it. They heard we got a new one. Yes. And we also, I will not tell too much about this. See me after this presentation, if you want to know more. So what do we do at the NFI, except the human head thing? We are team exploitation, which is just a part of the digital team. And most of what we do is extracting and decrypting evidence from digital devices. And that's a very hard thing to do, and it requires some legal hard to place things. And that's why the NFI can only do offline methods. So of course, when you do exploitation, you're actively might be looking for vulnerabilities, for example. And because of that, we are not allowed to do anything outside of the building. So if we find your Facebook or off tokens, we're not allowed to look at your vacation pictures, but we can just give them to the police, and then they can ask permission if we find something. And the evidence takes many forms, which is what makes this job a lot of fun for me, at least. So of course, phones is a big one, but also USB sticks, crypto wallets, cars, everything from airplanes as well to cranes and building sites, we've had them as well. It's a very varying job. And of course, how do you do decryption of data if you don't have the keys or you don't have a method? You do reverse engineering, as we've seen the iPhone talk just now. Through reverse engineering, we can do exploitation. We use fuzzing as well to not have the manual labor. And we use various hardware based methods, which I will not go into now, because the scale is really large. So let's get into the technical stuff of things. How did this talk come about? Well, first, you need to know a little bit about how we do our reverse engineering. And if we just take, for example, as a use case, an Android boot chain, how does it work? We always start on Android with a primary boot loader, which is the same as Secure ROM that you saw on the iPhone. It's a piece of ROM that's fused into the chip, cannot be changed. If you find anything, that's nice for us. But the primary boot loader's main job is to initialize the storage, for example, EMMC or UFS, and then verify, load the next boot part from there and verify it and jump into it. The next part is called the secondary boot loader, which does most of the rest of the hardware initialization to get the system further up and running. And then we get into the specifics, the A boot, the Android boot loader. And the Android boot loader just sets everything up for the kernel, in a trim of S. It has some modes, for example, a lot of you might know fast boot. There's also an A boot that's for rescue type operations if you destroyed something else. And A boot, again, takes the kernel, verifies it in the boot image with the inner RD inside of it, verifies it, sets it up, jumps into it. And the kernel boots the user LAN. In the case of Android, Xigote, doesn't have to be, Graphene doesn't use it, I think. But from a reverse engineering standpoint, what is interesting about all these things and is how we can reverse them and how hard that is. For example, user LAN, you have a lot of library calls, there's a well-defined system, you know, how it works, what it does for most of the time. So we can quite, we can reverse it quite well, I would say. Linux kernel has the same thing, it also has a lot of known structures, which is really nice. As you also saw, I will grab back on the iPhone talk again, you saw a lot of these structs that were already defined. We can do that in the Linux kernel as well, because all these structs are preset and we know about them, so we can use them to our advantage. And we have A boot, which is already a bit more proprietary. There are some strings, some known interfaces, you know, some hardware that lives in there and how it works. And that gives you some device registers, which gives you some context. The secondary boot loader is getting a lot harder already, because that is very proprietary, there's not too many strings. There's some known protocols, but for example, sometimes it will speak a sort of debug protocol for a few, you can actually destroy the A boot boot loader, because it's on flash. If you do that, then you're going to talk to the secondary boot loader to repair your phone. And we know those protocols, so we can check to see what's happening. The primary boot loader, you're not going to do it for fun, probably. I was really happy, you also said it didn't work out. It's not nice. So let's look at some examples, because this is all text, I'm already a government employee, so I'm boring by default. Let's look at some examples. This is a library, I hope it's a bit visible. But you can see some library calls, for example, a fopen fcfget s. And you can see some strings in there, and you know what these calls do, because it's post-exdefined behavior, so that's nice. That gives you a good idea, a good overview of what it does. If you're very lucky, of course, symbols are still in there, parse proc of s, that kind of gives you a good idea of what it does. Then we get the kernel, it's a lot smaller, unfortunately. But you can still see some strings that give you a good idea, because you know what a kernel does. For example, at the bottom there, it says unexpected syscall invocation. So it might have something to do with syscall handling here. And you can all annotate that and get further into it. Then we have a boot. It looks a bit better than you might expect, because there's a lot of strings, for example, signature verification failed, secure check fail. That gives you some idea, like if there's a branch in there, which there is, that might be the signature verification. And you can work your way up from that. Then we get the sbl. And the only thing we see here, and bear in mind that everything that says r underscore is annotated by me, so that's not originally in there. There's just one string there, XML packet not formed correctly, run out of room looking for tag. It gives you some idea that it's probably XML parsing, as you can see on the top row there as well. It's looking for some new lines, some spaces. But there's a lot less to go on here, also because as we all are not running against a kernel or anything, there's no syscall interfaces, there's no library calls, it's a big binary blob talking to device registers basically. And the pbl gives you constructions like this. It's an absolute pain in the ass. There's no strings, because the way they make pbls, they're fused onto the sock, but they're rom, so they're generated in the masking for a chip. And the way it works when you build asics, which socks essentially are, asics go by square area. And the more square area you use, the more, just one of the parameters, but the more expensive they get. So they really don't like to use a lot of pbl memory. And that makes it so that there's very little strings in there usually. And you get constructions like this, which is a double D reference of an index in an array, which is code, which gets called with some parameters. It's not nice. You're not going to write this usually in C code. This is probably some struct indexing. And another thing there is as well. Pbls often from what I can see, but that's more of an assumption that I made based on the work. They don't use normal compilers. There are some weird constructions in there. For example, there are some constructions in there so that with one bit you can get different code paths and patch in new stuff. And I think that is because if they make masks for chips, once the masks are made for lithography in the factory, it's very, very expensive to make new masks, but editing one bit in a mask is a little bit less expensive. So there's all kinds of weird constructions in there. Now, we have normal solutions for debugging. So let's say we want to get some dynamic stuff going. Kernels, we can emulate them and debug them using QEMU and debugging with KGDB in QEMU, if you want. Userspace can be debugged with normal GDB. Android supports it. There's documentation on it. And the reason I say everything before that is signed and hard to emulate. The kernel, of course, is also signed and some parts of user run. But the kernel, you can unlock your boot loader and you can use a reference on for that and work your way up from there. You cannot unlock your boot loader in a way using normal processes so that you can patch a boot. So it makes it a lot harder. It's not made to be debugged or something like that. And everything before that. Oh, sorry. And so the state is unknown in a lot of these boot processes. You have no idea what should be where. A lot of hardware is still being initialized. So, for example, in SBL, the MMU is initialized. So we don't have it before that, which makes it easier, but also less easy for some tricks in exploitation. And there's lots of memory map peripherals because, like you saw in the iPhone talk, it's a von Neumann architecture and peripherals are 99% memory mapped if they're not on a bus. So let's take a look at how we do some static reversing. Let's take that awful, godawful piece of code and then PBL and see what it would do normally. So let's visualize it for a bit. So this part, just this part, the code call, what it would do, if you look, there's a 143680 there. Let's look what should be at that address given this code. There should be a pointer because it's dereference. So the first dereference will go to a pointer. Then that pointer will point you to another pointer because it's a double dereference. That pointer is a pointer, but not the pointer we want because it's an index and an array. So we take the next one and so on until we get to the right index in the array. And there should be another pointer. And that pointer is the actual pointer to the function that we're looking for. This is all pretty good and reasonable. If we have some state, we can say, okay, we'll follow that pointer, we'll go here, blah, blah, blah, blah. And then we get to the actual function that is referencing. And that function, you can make one other assumption about it. That function will be inside the memory map and inside the PBL that you have probably. Otherwise, it's in the SBL, but that's not loaded yet. So it has to be somewhere here because it's code that gets executed. That is a damn shame because this is very followable. You can do this if you know what these values are. But in real life, when you do static reversing, this is what that looks like. You have no idea. Of course, you know that here should be a pointer. And if you're lucky, there's a direct right to that address. And you know, okay, get set up here by this function. But then still, you have to follow that array, which is also uninitialized. And somewhere is that actual function, but we have no idea where at this point. So static reversing gets quite hard. And then if static reversing gets hard, we usually grab back to dynamic reversing if we can. And there are some nice assumptions we can make for dynamic reversing. Nothing runs before the primary boot loader. So the memory state is either zero dot or garbage. It doesn't matter for our setup because we're never going to read anything uninitialized, maybe. All state is set up inside the emulator. So we don't have to worry about peripherals having a certain state already and having to account for that. And then structures and their initialization because what you saw here is, of course, a structure. This is not something you write by hand. Like, oh, I really want to double the reference. This is just a structure being set up and used. And then complex logic. You can be single-stamped. And you can see the values to have a better understanding of what it does in smaller steps. And that's often what you do in reversing. You break a big problem up in small steps. It makes it a lot easier. So what I did was I just cobbled together a lot of tools with some glue logic. They make an emulator. This was the first version of this was done in an afternoon with a lot of anger. But the main thing that we used was killing. Killing is just a framework on top of Unicorn. But it allows you to write hooks, which Unicorn also allows you. But it also runs a GDB server, which makes a lot of stuff a lot easier. So it's built on top of QMU, but not the current version. Long story. So killing runs a GDB server. And this is the glue logic. So there was no logos for that. This is actually a Dutch tram company because Redsync also doesn't have a logo. But the killing uses, we attach GDB to killing. And then you have a tool called Redsync. It's made by Quark's lab. It's really nice. I like it a lot. And it hooks up GDB to Gidra. And now I know what you're thinking. There's already a GDB debugger in Gidra. And you can attach it. Yes. But not when we started this. And it was upcoming for like a year. So we used Redsync. And then we used Gidra Bridge, which is just a bridge in Python 3 to Gidra because I'm not going to write Python 2. It's 2023. And the glue logic pulls in all the programming, all the memory mapping, everything from Gidra, puts it in killing, starts execution. We attach the GDB server to killing and connect that back to Gidra via Redsync. And then we can single-step logic and set breakpoints and everything we wish to do. But there is still some problems in the emulation. I'm running this PowerPoint in wine, just so you know. So I hope the animations will function well. Yes. So the memory map here, we have DRAM at this point, fully uninitialized. It's not a problem for us at all. It's not in the way. But it's un-initialized. It's unpowered. You can actually see the powering up of the DRAM inside the secondary boot loader. And the offset of that DRAM is known from the DDB files. The kernel has to know it, meaning we know it. Then we have SRAM. And SRAM is a RAM that's on the sock that's always there. And it's used mainly for boot initialization. There are some weird use cases for it at runtime. But it's always there and it's initialized. Only the offset is somewhat known. We don't know all the time where it is. Sometimes it's not in one piece. It's not homogeneous. But that doesn't matter for our purposes. It's somewhat known by reversing. And the status is known, because at boot it's all zeros. And SRAM, we know it's all zeroed. So there is no logic in SRAM. There's no asynchronous state. So emulating SRAM is really, really easy. It's just RAM. Then we get to the peripherals, which are the actual problems. As we've already seen in the previous talks, the offset is usually unknown. We know some offsets from the DDB. But there are things like crypto engines and stuff that get initialized and that the kernel never touches, because in Android, you have your normal world. You have your secure world with trust zone. And the problem is that these peripherals, they have to be initialized, but the kernel doesn't know about them. So DDB doesn't know about them. We don't know about them. So we have to kind of reverse our way through it sometimes. The status also unknown, because there might be some state set up. Think about the serials or MAC addresses, some stuff like that. That is a state that is known at boot time, but not necessarily to us. And the status based on unknown logic. As we've also seen in the previous talks, it does a read of a memory map peripheral or a write. The things that you can see is I write one bit and then I wait for another bit to another offset. And we don't know that logic. And it changes asynchronously from the CPU. So it might have changed at any point in the execution and we have to deal with that. And the problem here is, and I got a little insight from the other talks, we can emulate on an peripherals. Of course, a lot of it is stub hooking. You saw the return 2042. A lot of that is what we did as well. Just write simple hooks. We know you're going to read from this or return something. It doesn't matter what. And sometimes we can see it should be a one. It's also good. But the big problem we have is that this is a tool for reverse engineering and not so much a real fully-flagged emulator. And it needs to work on a lot of things. So, for example, we'll be reversing a phone one day and then next time it will be a USB stick or an SSD self-encrypting thing. And then the emulator still needs to work. So we cannot hard-code peripherals every time. We can. It's a lot of work. And it turned out that I thought about it too simply when I started this. I thought, how many peripherals can there be needed for a boot? We just start with some stub hooks and then we implement some simple logic, vertical stacking. But it turns out there's quite a lot of peripherals needed to boot the phone. And it was less fun after doing 20 of them or something. So I thought, how can we do peripheral emulation? It's a really hard problem to do peripheral emulation if you don't know what the peripheral is. And it turns out that if you have really hard problems, you can just ask for an intern and they will do their master thesis on it. Okay. Yeah. So as Jeffrey told, we now have to emulate. Is it on? Hello, everyone. Yes, that works. So we have to emulate the peripherals, but we don't know what the peripherals are or where they are. So that's quite difficult. But let's first start with what goals we have. We want the emulation to be automatic. We don't want to have to do things every time we encounter a new peripheral or a new device. We also want the peripheral to behave somewhat realistically in the sense that, yeah, it should, the device should not go into an error state saying, oh, the peripheral was broken or whatever. And we also want it to be generic. So it works for one thing. It works for the other thing. It's nice. So of course I did a master thesis. So you have to look at the literature. And there basically are three different methods that are used to emulate peripherals. There's a hardware in the loop approach, which basically, well, just you have the hardware and you connect it to your computer that you're using for emulation. And every time the emulator requests some peripheral interaction, then you just pass it on through the hardware and then pass the value back. It works. The values are realistic, but it's not that scalable. Like every time you have a different device that you're emulating, you have to fetch a new device, buy it. I don't know. That's not too nice. The other method is to emulate it. And here we mean by emulates, just write those stops for every peripheral. If you do that well, which also is quite a bit of effort, if you do it well, the values will be realistic. But it is a lot of effort, especially if you have all those peripherals that Jeffrey just showed. So it's also not scalable. However, there is a third method that at least has the, or at least is scalable, symbolic abstractions, where you use symbolic execution to guess what's the correct value would be. But that might not be realistic. So this symbolic execution, what is that? Let's talk about that here. We have a simple function that takes two inputs and returns an output. And if we just execute that classically, normally, then we might start with a memory state here, a equals 11, b equals 5, and c equals 0. And in blue, we have the instruction pointer, and it just executes the code. It checks is a equal to 0, a is 11, so it's not equal to 0. So it goes through the else. It sets a to 4, and then it calculates c, which is 4 plus 5 equals 9. And then it returns 9 from that function. Pretty simple. But then symbolic execution, we don't have concrete values for a and b, but we have variables for us values where I'm using Greek characters for that. So a equals alpha, b equals beta, and c equals 0. And then we start executing. But then we have a problem. We have to decide is a equal to 0, but a is equal to alpha, is alpha equal to 0? So, well, for the moment, let's assume it is. And we'll also draw a nice graph saying, and a note that we assume that alpha is equal to 0. And we can just continue. We write 9 to b, and then we calculate c, which is 0 plus 9, which is equal to 9, and then we turn that. So we're done. Not quite. We also have to take care of the else branch. What happens when alpha is not equal to 0? Well, then a is a to 4, but b is still beta. So when we calculate c, then c doesn't have a concrete value. It has a symbolic value 4 plus beta. And then we would just return that. One of the main differences that you can already see is if we just normally execute a function, we just have one result. We have to reach result 9. But if we do it symbolically, then we have two results, either 4 plus beta or 9, depending on the value of alpha. So back to the peripheral emulation. If we use symbolic execution for the peripheral emulation, then we potentially have some problems. One of the problems is that there might be multiple states, and that often becomes very many states. So some of those states lead to successful emulation, and other states lead to bad emulation. They lead to infinite loops. They lead to the PBL thinking, oh, this peripheral is broken. Let's just sit in this infinite loop for a while. And another problem is that if you have many peripherals, which we do and many peripheral interactions, then we also have a lot of states. And we need to somehow remember all those states, so it quickly takes up a lot of memory. So the solution for these, or at least this last problem, is to only run symbolic execution just for a bit. Not the entire program, but only a few steps after the peripheral access, and then we probably will have already seen if the value that we have returned from the peripheral is actually good or not. So that's what I mean by concretizing the symbolic values. You run with this potentially symbolic value, for example, 4 plus beta for a while, and then after a while you check and you see, well, this seems like a good value. You pick a value for beta, say 0, and then you continue with 4 plus 0, namely 4, and then everything is concrete again, and you can just run the emulator. So that also brings another question. How do we concretize a value? What value do we pick? For that, I thought it would be nice to use different heuristics to add more constraints. And each heuristic, I called it a tactic, and every tactic defines a nice property, something that would indicate sort of success. For example, a nice property could be return 1 from the function that we're currently in. Maybe the function is checking if the peripheral is there, and it returns 1 if the peripheral is there, and 0 if it's not. Then the result of the read from the peripheral, if that leads to a 1 being returned, then we're probably returning the right value. And how do we know if we want to return 1 or 0 or something else? Well, we can just use first one, and if that doesn't seem to work, then we try another, and then another, which is to try that. That's also a nice thing that we don't have to be fast about this. We can just take a little more time. So what are these tactics? I defined three tactics, or infinitely many, but depends on how you look at it. I defined the dummy tactic, which is not that clever. It just continues to the next instruction. Many peripheral interactions are not really that relevant. You just write the value somewhere, or they read some value and store it in some other memory location. It's not that important. So symbolic execution is also quite intensive, so we don't have to have all this overhead if it's not necessary. So at first, we just try to ignore it and just continue to the next instruction. And then there's another tactic that's a bit more involved, actually uses the symbolic execution return tactic that also takes a value, value n, and it returns that value from the current function, or at least it tries to find a peripheral read value that returns n from the function that the peripheral read occurs in. And the step tactic, it just steps a certain number n states forward in the symbolic execution, and then it sort of judges the states and sees, well, maybe this state got into an infinite loop, and the others didn't, so probably don't want to be in the state that got into the infinite loop. So here is a nice diagram of how this part of the system works. We have at the top the analysis tool, which is a Ghidra. It provides the binary. Then on the left, we have the emulator that Jeffrey already talked about, which is based on killing. And yeah, it uses hooks, and instead of having hooks for different peripherals, we just hook the, pretty much the entire memory region. And every time there's a read from there, we say, oh, well, that's probably peripheral read, and we pass it on to a different component, which I called the read resolver. And there is another hook in there that detects bad states. The read resolver uses Z3 or anger, uses anger, anger is built on Z3, the symbolic execution framework, and it passes on the state of the emulator to the symbolic execution and instructs the symbolic execution what constraints it tried to find a solution for. It also chooses a tactic that I talked about. And finally, it pushes some things to the history, which is nice if you find that we chose the wrong tactic, then we can later backtrack, which is why do you have a hook for a bad state. If we end up in a bad state, we go to the backtracker. The backtracker looks at the history, and says, oh, well, maybe we should have come back to the previous state. It restores the state, and then a different tactic is selected. So how does this tactic selection work? It's pretty simple. We have a hard-coded list of tactics that is as tried. And the first time we come across a read, we pick the first one in that list. And then if it doesn't work or we go back, we backtrack, or we have 10 consecutive reads from the same address, then we go to the next item in the list. And if there are no tactics left, then we're probably in a hopeless state, and we might have made a problem or made a mistake earlier on, and we have to backtrack even further. I'll have to be quick about this. So this is a quick example of how this works. Yeah, so first we try the dummy tactic, which returns zero, which is fine. It just or is in a flag. Then we go to the second part of the code. First we return zero, but that doesn't work. We end up in the infinite loop. So we backtrack and we try to return zero. But this function doesn't return anything. So it gives an error. Return one also feels for the same reason. We step five things and we five steps forward and we find that, well, maybe if we end the value with two and we don't have zero, then and it's not equal to zero because we already tried zero. Then we should probably try that. And the symbolic, symbolic execution finds the value two. So we go to the next thing where we call a function. We first try zero, zero calls a panic function, which ends up in the infinite loop. So that's bad. Returning zero also leads to the panic function being called or well, returning zero doesn't work because we already tried returning zero. So returning one doesn't work because we end up in the panic function and stepping five actually does work here. And we return five and we not in the panic function. And then we have this infinite loop and in every iteration of the loop, it checks the flag checking to see if the peripheral is alive, probably, or done initialization. And every time we try the dummy tactic because well, we find this thing, but after 10 times, we find, oh, well, we're reading from the same address every time 10 times in a row. We probably should try a bit harder and return zero, which doesn't work because this function doesn't return and step five. And then we see, well, if we return zero, then we are in this infinite loop. We should try returning one. And then it works. So then I did the whole thing about evaluation, which combination of tactics is best. Well, most of them are roughly equally good. And then I did some tests with different programs to see, well, how fast is this? Not too fast, but it's fast enough. And we also find that there are, and I compared it to like the real hardware on our Raspberry Pi. And I compared it to the amount of instruction that were taken, well, fewer because our system exited out of a lot of loops early because they didn't have to do any initialization. But yeah, it was much, much lower. But most importantly, the data that is stored and returned, that is pretty much identical. And so that's nice. Also compared with another framework, which is, yeah, it was a little slower on these things. Data was nearly identical and the number of instructions was the same. Also not too important. Not sure what this slide is in your duplicate slide, I guess. There are some limitations. Sometimes we concretize the wrong value. And there's a lot of communication overhead, which is also part of future work. And we also have a summary, and we also have a live demo. So I guess we're going to try the live demo now. If we can switch the... Yeah. If someone could deliver a cellphone to us today. Yeah. Thank you. this. Yeah. Yeah, you can all say that's a solution for you. And then, uh... I can stand on that side. I hope I can see it. So, uh... Yesterday I noticed that in the description of this talk, I said there was a live demo. And then I told Luke, I read the description, and apparently we're giving a live demo. So, it's a bit tiny, but it really shows what this thing does. And it's definitely not a full-flag simulation. That is very... I'm very much a tool for reversing. It used to be called Firmulator. Luke didn't like the name. You maybe make it bigger? The text. That'll be it? That'll be all shit. I hope we make it in the time. So, first, we run it without the symbolic execution part. It loads all the memory mappings from Gidra, starts the emulation, and then it crashes, so let's stack them. Make it one smaller, then you can see the bottom, probably. And it says unmapped memory on the bottom. It's hard to see. Anyways. It crashes on an unmapped memory, because it tries to write, this is a Raspberry Pi, that's trying to initialize its video driver, and it tries to write to the video driver and read something back, and it fails, obviously, because it's not there. So, then we can run it again, using the part with symbolic execution. No one needs to matter. Now we run it again, and you can see that it connects to Gidra using the bridge. It loads the memory mappings that are there, loads the code from that, starts executing. But, unfortunately, now, is what you get. It's reading from two addresses, one after another. So, it's hanging, because it thinks it's not in a loop, but it definitely is in a loop. And that's the thing you run into here. Now you can set inside Gidra, you can set a command, just as I see you're in a loop, and a lot of what we do is just using this as a reversing tool, so we just want to be able to set things fast. And now we tell it in the right part of the loop, and it just returns here. It's good. And now you can see that everything went correctly. It has returned from that loop. Then it gets to the GPU initialization point. It should say at some point, yes. It sees that it has to return, command parsed, return value none. Now it returns out of that loop, tries to initialize the GPU, it maps a memory for that, symbolically executes the function for initializing the GPU. More returning. The demo effect is in full effect. The fun part about this is that it ends up initializing the GPU, but it gives you another error. I couldn't get this weird ass resolution that you wanted, and for us, that is the thing that differentiates it from a full blown emulator. It will not initialize it nicely, but it will initialize it enough for us to continue on our work where we are interested. Thank you, Jeffrey and Luke. Thank you.
Arm64EC: Microsoft's emulation Frankenstein
Alright, so time has flown. This is already the last talk for the emulator development room today. Thanks everybody for showing up. It's a crazy turnout. Today we've got Peter who's going to talk about a really interesting feature from Microsoft Windows. Is there a question already at the start? Let's get what's happening here. Oh, okay. Alright, so Peter has a lot of C++ experience and he can talk more about what he's going to do. So let's give him a hand. Okay, so first of all, why am I here at FOSDEM talking about some closed source Microsoft attack? That's what you're all thinking, right? So let's address that question first. So if you don't know me, one of my hobbies is hacking on LuaJet, which is a free and open source JIT compiler for Lua. And LuaJet recently gained support for Windows on ARM 64. Or at least I thought it did until this guy came along and was like, so do you support this other Windows on ARM 64? And we're like, wait, wait, what? You did what? So first I was horrified, then I was intrigued, and now I'm here speaking to all of you about what it is. So that went well. So hopefully I'm going to take you all on the same journey that I went through, kind of figuring out what this thing is, what it does, why it does it, and whether it does what it says it should do. Before we get into any of that, I'm talking about some Microsoft tech I do not work for. Microsoft, I'm talking about emulating Intel on ARM. I don't work for either Intel or ARM. If you know about LuaJet, you might have heard of Mike Paul. That's not me. Any of yous herein are my own, bugs are my own. If I'm wrong, that's my fault. Right then. Let's go into things. So we're going to do three kind of broad chapters here. First, we're going to have a general look at doing emulation of Intel code on ARM. I'm going to get really bored saying 64 all the time. So when I say Intel, I mean X64 code, and when I say ARM, I mean ARM 64 code, because otherwise all of 64 is going to get way too much. Then we'll look at this ARM 64 EC thing in particular, and then a bit of time about how LuaJet ported to this thing and whether that worked and how it worked. So emulation 101. You take Intel instructions like that one there, you turn them into ARM instructions like those three there, and you just do this for every single instruction that you find. How hard can this be? We've got this entire room to talk about doing this. One Intel instruction may become several, because Intel instructions are often more complex than ARM ones. If you're not familiar with assembly code, the square brackets here are memory loads or memory stores. In this case, they're all loads, but they could also be stores. I mentioned this because memory is complicated. Memory is what makes this more complex than it might look. Here are some of the things that I forgot to mention on the first slide. Let's start with memory ordering. If you have several threads that are all trying to work with memory at the same time, you can do cross-thread communication through memory, which on Intel often works. Intel gives you a very nice memory ordering properties, so you don't need memory barriers all that often. Whereas ARM is whatever, if you want to do cross-thread memory stuff, you will want some barriers in there to make it work. If you are trying to emulate the Intel code on ARM, you need to insert extra barriers that weren't there, otherwise we're going to introduce some concurrency bugs that weren't there. The annoying part here is that most memory operations aren't doing cross-thread stuff, but if you're doing an emulator, you don't know which instructions need the barriers and which ones don't, so you have to throw in the expensive memory barriers for almost every load and store, which is going to slow you right down. This middle question mark is saying, so memory is not just a big array of bytes. Memory is working out into pages, and those pages can have protections on them and other stuff, and all they make mapped to a PCIe device rather than going to RAM. You've got a question like, do you emulate an MMU and a bunch of things on it, or do you just pass it off to the host and let the host do whatever it would do? The final question mark here is flags. If that doesn't yet mean anything to you, that's good for you. Where can I get to the flag next? Because flags are a pain. Most Intel instructions, when you run them, they'll give you the main result that you're trying to get. They will also give you this array of six flags. Meanwhile, on ARM, some instructions will give you flags, and those that do only give you an array of four flags. I'm not a mathematician, but four is less than six, right? We've got a slight problem here. The question is, can we emulate the two that we don't have? Let's just run through all the flags. This could be a quick summary of what they are. We've got Z or ZF, just telling you whether the actual result, your main computation was zero or not. SF or N is telling you whether it was negative or not. Then we get to PF. Now, PF is great. Intel added PF in 1972 to give you the parity of the low eight bits of whatever it is that you were actually doing the computation of. Because back in 1972, you wanted a one bit checksum for doing modems and stuff. Intel being Intel, they've kept it ever since. You can emulate this thing on ARM. You just need to do a pop count of the low eight bits of whatever it is that you computed. If you know ARM assembly, you'll be like, wait a minute, there is no pop count instruction for general purpose rotors. We'll just gloss over that one. Then you've got the overflow flag or OF or V. That tells you whether any overflow happened during your computation. Useful for doing checks. Arithmetic and stuff. Then we've got CF and C, which is an extra carry bit in or out of your addition or subtraction. Fun point here is that there are two possible meanings for this flag in subtraction. And guess what? Intel choose one meaning, ARM choose the other meaning. If you're trying to emulate one on the other, you often have to flip the value of this flag to make them match up. Thankfully for this, ARM in ARM V8.4 added an instruction called CFINV, which for flipping the value of that flag added to make doing this kind of emulation easier. The final flag ARM doesn't have is AF on the right there. AF. AF is if you're doing binary coded decimal arithmetic. If you've never done any of that, good. Again, good for you. Intel thought back when they made these chips back in the 70s, BCD was a thing that people did. To make it fast, they added this extra flag that gives you the carry bit out of the low four bits of your computation because BCD uses groups of four bits. You can emulate the AF flag if you need to. We're doing a bunch of extra work to compute these things that we'd rather not do. A good emulator will try and work out when it doesn't have to compute anything at all or if it can defer the problem and hope that you don't actually need the answers at all. If you do have to compute them, there's extra work to do here which will slow you down. Start flags quickly. Next up, there are a bunch of existing solutions for doing emulation of Intel on ARM. Q emu we've heard quite a lot about here. With two flavors of Q emu, there's system mode and user mode, which boils down to in system mode, which will emulate an MMU and a bunch of devices on it, whereas user mode won't, so it's pushed off to the host. Therefore, Q emu user is much faster, but can't emulate as many things. There are a bunch of other open source solutions in the middle here. Starting with Justine Tini's Blink, which if you've not seen it, is part of a portable executable project for emulating Intel on anything. Her take is like, you know, we don't need a JVM with a portable bytecode that's used Intel code as a portable... Anyway, it does actually work. There's FeX emu that I'm not overly familiar with, but I think they're trying to be like Q emu user, but faster by only doing certain emulation of certain things on certain other things. I basically only Intel on ARM, whereas Q emu does everything on everything. Box 64 I wanted to mention because they pull this Q trick of saying, we will spot when you're trying to emulate a library that we've heard of, they're like, yeah, that's libc, that's SDL. I'm like, well, we won't emulate it. We'll just like, swap it out for our native version of it, which makes it faster because you don't have to emulate as much code. Obviously, the other big one that's not open source is Apple's Rosetta 2, which cheats by solving things in hardware. So, you know, this slide again, yeah, Apple solved this problem in hardware, this problem in hardware, this problem in hardware. So, you know, they cheat by adding extra hardware to their chips, and that makes that emulation extremely fast. Good for them, less good for other people. So, Apple can make a very appealing pitch to their developers, which is, you know, you can keep on with your kind of Intel code, and it'll like run fast on our custom hardware still. Or you can import it to ARM code, and it'll run even faster. I know Apple will port all of their kind of first party code, and the programmers in their ecosystem will do what they are told. Apple says you port your code, they will put their code. You know, the trade-off of working in a kind of Apple type system. Meanwhile, Microsoft have a far harder time. Like, you can target Intel, but it'll be dog-snow. Okay, not good. Or you can port your code to ARM, but you can't if you've got something like closed source library or plugins as part of your program. And this being that Microsoft's ecosystem, of course there are closed source libraries or plugins. So, yeah, Microsoft are in a really hard place here. And when I say slow, I mean slow. So, to give you a kind of idea here, I took the AllureJet benchmark suite, and I ran it on this Mac here as ARM code, 33 seconds. Fine, fine. Compiled it as Intel code, ran it under, was that a 2? 44 seconds. I mean, not great, but it's not a massive slow down. You can live with that. I ran a Windows VM on this thing, and the ARM version then took 37 seconds, which is a little bit slower. I'm not sure whether that's part of the VM or part of Windows slowing it down, or because Windows is running with 4K pages rather than 16K pages, same kind of ballpark. Then take the Intel version and run it under Windows' emulation, 106 seconds. Yeah, this is not good, not good. So, you know, you are someone at Microsoft, okay, so option one, I emulate the Intel code, it's too slow. Option two, I port it to ARM, is possibly impossible. At this point, I like to imagine some mad scientist at Microsoft, like, so, can we take two bad options, blend them together, and get a good option. Which, when you put it like that, seems insausable, but it turns out to actually work, surprisingly. So, that gets to part two. What is ARM64 EC? I, this crazy idea to get out of this error or to spot. Which is, let's let you port part of your application to ARM code. So, you know, if there are any kind of Intel bits that you can't port, because they're plug-ins or closed source, sure, leave them as like Intel code, but the stuff that you can port, you should port, and you can mix them all up, like, together, and allow you to like, cheaply interrupt between the two parts. And, you know, this is ARM64 EC. The ARM code is compactable with the emulated Intel code in a way that should hopefully work. Hopefully. So, that's the kind of thing. Big plan. But what does this mean? Like, how do we actually do this? Like, we're going to have to share the virtual address space between the Intel parts and the ARM parts, okay? We're going to need to share data structures out between the Intel parts and the ARM parts, okay? We're going to need to share call stacks between the two, fair enough. We're going to make things a little bit simpler by saying we can only kind of switch between Intel and ARM when you make a function call or you return from a function call. Or when you throw from a function and catch it higher up, that's, you know, painful. We're going to have to adjust how we do function calls a little bit to make this work, but ideally not too much. So, we're going to kind of delve into each of these points in turn. A shared virtual address space means, you know, you have all of your, you know, kind of address space and there's X people code in there and you have to know for any piece of X people code, is it ARM code or is it Intel code? So, we need an extra bit on every page to tell you kind of which one it is. I mentioned kind of doing cross thread communication earlier. Obviously, you know, our address space can have several threads in it. We're all trying to kind of talk to each other and any kind of Intel code running under, emulation still needs all of those extra barriers to be put in by the emulator, which will, you know, keep it slowed down but keep it correct. Meanwhile, any ARM code, Microsoft thought, just, you know, let the programmer that's doing the port put in the barriers where they have to be, which, you know, solves that problem at the cost of, you know, the programmer has to actually think. You can't just like recompile and not change your code and hope it works. So, all of that is kind of fine. Let me go to shared data structure layouts. Now, this starts off looking fairly simple. We say, you know, this is some kind of data structure. Let's make it compatible between Intel and ARM. Obviously, we can't change the Intel code, the whole point is we're kind of running free existing Intel code under emulation. So, we have to, you know, in ARM 64 EC mode, all of these types need the same size and alignment as ARM Intel. So, you know, longs are four bytes because windows doubles are eight bytes, pointers are eight bytes fine, function pointers, again, eight bytes. And this being why we needed an extra bit on every page to tell you whether it's Intel code or ARM code, because you might think, just like put it in the function pointer, but like there's no space. You have to make them one bit bigger to tell you whether they are Intel ARM pointers, which we can't do. So, we have to put that bit on the per page. But, you know, this all looks fine so far. Things get more interesting, though. So, if you're a C programmer, you'll know about set jump and long jump, which are C's equivalence to throw and catch. And there's this structure called jump buff that kind of tells you when I catch, this is the CPU state to kind of go back to when I do my catch. And you can pass the jump buff, you know, around, you know, you'll set it over here and use it over there. And in particular, in ARM 64 EC, you can do a long jump from Intel code to ARM code or from ARM code to Intel code. So, this jump buff guy has to be kind of compatible between the two. As I said, jump buff contains the CPU state that you want to go back to. But, like, Intel CPUs and ARM CPUs have different amounts of CPU state. So, that's going to be fun. To make it even worse, there's this Windows structure called context in all caps that contains the entire CPU state for a particular thread. But, again, you can pass around and do things with. And, yeah, so that guy has to be the same size as on ARM 64 EC as it is on Intel, despite there being a different amount of CPU state. So, this is starting to look a little hairy. So, what is all of the CPU state that we have to fit in to make these data structures compatible? So, we want a quick table here of, like, the user-visible CPU state on Intel and on ARM. ARM in one column, Intel in the other. I'm going to go through row by row to kind of go through them quickly. To, like, general purpose registers to start with, Intel has 16 of them, ARM has 32. You will notice we can't fit 32 into 16. This is going to be a slight problem. Next row is not so bad. We've got a bunch of, like, weird kind of edge cases. We've got RIP and PC. They're, like, the same thing. PSA and PSA because they're the same thing. The two FP things on ARM, we can fit into MX, CSR on Intel, but that much is fine. We've got the spare GS thing, which we'll come back to later. Next row is our floating point or vector registers. Again, Intel has 16, ARM has 32. 32 is more than 16. So, again, problem there. The bright sparks in the audience might say doesn't modern Intel with, like, AVEX 2 and AVEX 512 have far more registers than they're far larger. Yes, but ambient laters can't use AVEX or AVEX 2 or AVEX 512 because of patents. So, we're stuck with the kind of old, kind of, 16 of them and not only 128 bits wide. This final row is interesting because Intel way back added the AVEX 87 stack, which is 880-bit floating point registers. ARM has no such thing because this is, like, this old weird, like, legacy thing, but this is actually really good for us. So, our question is how do we fit all of the ARM column into the Intel column? So, let's start with the floating point registers and we'll say, okay, let's pretend ARM only has 16 of them. Problem solved, right? If you're writing ARM 64 EC code, you cannot use the high 16 of these guys. I mean, it'll come in a performance cost, but it'll make things work. The other way we had problem was with this first row, where we'd like to be not quite as extreme as throwing away half of them. So, we've got, like, 16 that we can fit over here. One can fit in GS and then 10 can fit down there. So, 16 plus 1 plus 10 means we can fit 27 of these guys in here somewhere. It works, it works. But we are still down 5. So, they'll become 5 general purpose registers that you cannot use. So, Microsoft said, okay, you just can't use X13, X14, X23, X24 or X28 in addition to the 16 factor things that you cannot use. So, this is the cost of making your data structures compatible between the two and it seems like a fairly high cost, but, you know, such is life. Moving on, we are sharing our call stacks between Intel and ARM. If you're familiar with Intel and ARM, your first point will be, so, doesn't like ARM put the return address in a register, whereas Intel puts it on the stack. Yes, we're going to have to fix that one up. The problem that you might not have noticed was that ARM requires your stack points to be 16-byte aligned when you use it for a load or a store. Intel merely recommends this very strongly. But doesn't actually check for it. So, you can very happily run with an only eight-byte aligned stack for a very long time and not notice that you've done anything wrong because it doesn't actually check for it. So, we're going to have to fix that one up, that one up too. So, a bit of work required to make these things work, but, you know, we can understand what that work is. And then we get to the actual meter things of, like, how do we switch between these two modes? We've made these things of compatible-ish or we've understood how to make them compatible, but how do we actually switch between Intel and ARM? So, if you're at use for assembly, you'll know what a calling convention is, which is, like, when we make a function call, where do we put the arguments for that call? Which registers contain when or what do you put on the stack when? And then, like, you can read these, like, long docs from ARM or from, like, other people about how to do this. And then there's a set of these rules for ARM, a set of these rules for Intel. We don't want to change those rules too much because, like, they mostly work, but they're not the same rules. You have to, like, put things in different places between Intel and ARM. So, we have to do some work to kind of fix that one up. And the work that you have to do will depend on the types of the arguments and the type of the return value of your function. So, we're going to need some kind of code for doing this work, and this code has to live on the ARM side of things because, again, we're trying to run Intel code that doesn't know it's running under emulation, so we can't change it. We can't add extra stuff in there. We have to add the extra stuff on the ARM side, which means if you're writing assembly code in ARM 64 it's like, how do I do a function call? This is how you do a function call. Step one, as you would normally put the arguments where they would be for a normal ARM call, and then, like, so, am I calling Intel or am I calling ARM? On the left-hand side, we've got the am I calling ARM column, and it, again, works like a normal ARM call other than this mystery box about exit points in X10 that I'm going to gloss over from now and come back to later. But other than that box, the left-hand column is a fairly normal function call on ARM. You put the results where they're meant to be, you call the function, you get the results back from where they are normally meant to be. The weird case is the Intel case on the right here, where we reduce some other things. Where we put the function that we want to call in the X9 register, has to be X9, and then we call an exit function. You're going to say, Peter, what is an exit function? And I will get to that in just a bit, but I want to address a different point first, which is this code has a branch in it, right? Everyone prefers straight line code to branchy code. But, like, we can get rid of these branches, mostly. You know, we'll have to have to do, like, both of these two steps, and push them up there, and then combine both of the calls. Because, like, this row, we're going to do a function call. We just don't know, like, where to yet. So, like, some kind of, like, conditional marks on where we want to call. And then we can make this whole lot straight line code. At which point, it'll look like this. You know, first box is the same. Second box is, like, we've pulled up both of the previous steps and just done both of them. This middle step is calling this magical mystery function from Microsoft. And then you do a call to somewhere. And then this last box is the same as previously. And if you're wondering, what does this magical mystery function from Microsoft do? You know, it turns this side back into this side. I mean, so, if you're reading assembly code, this is what you will see, but this is what it does. And now, I'll get to the previous point of so, what are these exit functions? So, they kind of fill the gap in. They kind of, the extra bits you have to do to transition off to arm mode. Which is we have to take the arguments that we carefully put in their arm places, take them out of their arm places, and put them into where they should be for the Intel style call. Which, you know, a bunch of work, but it's fine. And then ensure that the function that we want to call is still in X9. And then we call the next magical mystery function from Microsoft. And we have to do it in a special way. We have to put the address of this function in X16 and then call X16. Which is going to seem weird, but we're going to have to see why in a bit. And once the magical mystery function kind of comes back, we take the results from where they would live in Intel world, pull them out of that world, and put them where they would be for an arm world, and then we return as normal. So, okay. Next up, let's look at this magical mystery function. Which is this guy. So, first box in the top left. I mentioned that ARM puts return address on the register, whereas Intel puts it on the stack. This first box is fixing up that problem. Then the rest of the left-hand column is your kind of usual loop for emulating a CPU. You know, we get the next instruction. We do it somehow. Then we move to the next instruction and we do that one. In practice, there's going to be a far more complex logic in here. So, you know, like optimize stuff. And you're like, you know, jit compiler or AOT compiler or all sorts of clever stuff in there. But as far as we're concerned, as well as what it does, this kind of describes what it does. At some point, it'll say, wait a minute. You're now asking me to go back to ARM mode, because I've found code that's no longer like Intel code. We are doing some kind of mode switch. Now, I said earlier, mode switches, we've said they're only going to happen at function call or function return. Oh, go. If we've now gone from Intel's ARM, this is either a call or a return. And how do we know which? And the cheeky part there is that we look at the four bytes just before where we're going to start running and say, is this a call X16 instruction? Why is that the question? Because we have to call this magical mystery function as a call X16, which means if we just found that, it means we've just come back from the call that we were doing. We are in a function return type thing. And we set the return pointer to the code we want to run to, and we go to it. And this final column means that we're doing a function call, because the four bytes before where we're going to are anything else. And then we need to do, again, the opposite problem of where your return address wants to be. Is it on the stack or is it in a register? So we fix up that problem. And then we've set X4 to be the stack pointer. Why do we do that? Because the next step is to say we have to forcibly realign the stack pointer. Because remember that point where I said like, Intel code doesn't care about your alignment for as ARM does, this is where we fix that problem. And then we tail call X9's alternative entry point. Remember, X9 being the sting that we've inferred is a function tool that we're trying to make. So we do a call to almost that function. Again, you're going to say, Peter, what are these alternative entry points? So that's next. So every ARM function that could be called from Intel needs a so-called alternative entry point for handling when it is called from Intel. And it does all of the gubbins that have to be done to make this transition work. The only question is how do you find this alternative entry point, which is you put the offset of it in the four bytes before the start of the function, which is handy because we already had to read those four bytes to check whether they were that guy. So if they're not that guy, then they are the offset of this guy. And what is in one of these guys? Ignore the right hand column for now and look at just the left hand column, which point in the left hand column is mostly the opposite of what we saw earlier. We have to take the arguments from where they are in Intel and pull them out of there, put them where they should be for ARM land, call the real part of the function, and then take the results out of there and put them where they should be for Intel, and then call the next magical mystery function. The only interesting part here is this first box where we're saying if there are arguments that come off the stack, we can't read them from the stack point, we have to read them from x4. Why is this because of this forcibly realigning sp? You can no longer read your arguments from there because we might have changed it to realign it, but x4 tells you where they used to be. So that's fine. The interesting point is that the logic on the slide only depends upon the types of the arguments and the types of the return. It doesn't actually care about what the function actually does, and therefore you can kind of share these guys between multiple functions if those functions have the same type, which is good for like, you know, code sharing keeps your memory usage down, your iCash is happier, whatnot. But if you want to, you could write one of these per function, which point in the right hand side becomes an option that you could do, and then you can kind of skip the kind of calling in the other bits, just put the copy of your function in there, if you so wish. Okay, so next magical mystery point function is this guy. Don't worry, you've seen most of this side previously. It's all the same side except for the first box in the top left, for which you're going to ask, so what does this box in the top left do? What is the value of lr that we have up there? And if you trace all the stuff through, you'll see it's the same lr that we popped over on this side, which was the return address that we popped off the stack because we think the Intel code just made a function call, at which point what we're putting back in the top left is the return address to go back into Intel mode. So it all kind of works out. As you've seen this slide previously, at which point we have run out of magical mystery functions, so you know, there's a hard part over, that all kind of works out. So that tells you roughly very quickly what arm64 EC is, what the kind of code for it will look like, so next up, the kind of the lured part of how did I make lured work with this thing. So if you know lured, it's written in a mixture of assembly and C, and notably the interpreter is several thousand lines of assembly code, which is, you know, fun. So porting that code, that assembly code to arm64 EC, means that we can no longer use the versus that we said we couldn't use, they don't fit in the context structures. So we lose v16 to v31. That's fine, didn't use them to start with. x13, x14, again didn't use them to start with, not a problem. Unfortunately we did use x23 and x24 for various things, but because of what they were used for, they could be reworked to not require them with some almost zero cost tricks, so that wasn't too much of a pain. Losing x28, more annoying, that kind of required extra loads and stores to kind of split. In this regard, the jic and pilot was actually easier to port than the interpreter, because the jic and pilot could already just not use certain things, so you just had to add some things onto the list of what it can't use, and like, you know, it'll then just not use them. Again, there will be some kind of like perf cost to not using them, but it wasn't hard on the kind of porting side. Next up is handling these mode switches. So the C compiler will do most of the work for all of the C parts of Lerget, but again, it won't handle the assembly parts. So there are kind of three parts, therefore, that it doesn't really handle. One is the interpreter opcode for calling Ler API C functions. That one's fairly simple, like the, it's only one place, it can only call one type of function, and the type of that function is super simple. So that one's fine. Harder is the ffi. So if you're not familiar with the ffi, it's the Lerget's foreign function interface, and it lets Ler code call C functions of any type. Whatever type you want, it'll call it for you, and it'll just make it work. And you can also jit compile your ffi calls in most cases. I say with simple types, like most types are simple, so you can jit compile most of them. We also support ffi callbacks, but you can take a Ler function and make it into a C function, and then C code can call you. So again, it's like Intel code is trying to call your kind of ARM code that you created from your Ler code. You have to make that one work. That one's actually not too bad. 10, great, thank you. So the hard part is these two, just because they're kind of arbitrary types of function. So this is what I made Lerget do for interpreted ffi calls. So the good thing about ffi calls is that they are one shot calls. You give the ffi a pointer to call, and the type to use it to call it, and it'll go and do it, and it'll do it once. Because you're doing it once, you can look at the thing that you're trying to call it like so. Is this ARM code or is it Intel code, and just do the right thing for whatever it is that you're trying to call? Which gives you this nice, simple diagram. There is a slight problem with this diagram, though. So this is what we're meant to do. This is our slide from previously, and this is what we're actually doing. The right-hand side is just like we've inlined the exit tank, it's all fine. The left-hand side has a slight problem. I skipped over this box a while ago, and it's like we'll just forget about that box. And you'll notice that it's missing on this side, which will mean that certain things don't work. And the question is what doesn't work? So this is why I have to now tell you why this weird box is there on the left about putting a thing in X10 when there's no obvious reason why you need to do it. So let's answer that question, which is if we are making a function call and we are ARM code, and we might call Intel code, then we will need an exit tank. Hopefully we've now covered what those do and why you need them, and things of that mind, etc. If you want to do a function call, you need an exit tank, and to know which one to use, you have to know the type of the function that you want to call. Now, there's a particular subset of functions that don't know the type of the thing that they want to call. You might say that that's kind of weird, but let's just run with it for a while. And furthermore, these functions don't know their own type, also weird, but what they do know is that their own type matches the type of the thing that they want to call. Now this may sound like a somewhat contrived set of properties, but it does actually crop up enough in practice that it's worth caring about. So to let these weird typeless functions that don't know their own type and don't know what they're calling, to make them work, we give them an exit tank in X10. So if they are ARM code, they can just like, you know, run and say, well, whoever calls us put the appropriate thing in X10, and that will let us do the call that we want to do. So that means that if we end up calling one of those functions, and it then wants to call an Intel function, then this isn't actually going to work, but in practice it's actually fine. It's not yet been a problem. It could be fixed, but it's going to be like a pain to fix. You might ask, why is it going to be a pain to fix? That's because the FRI can call any type of function. So we can't just like preprepare an appropriate function for every single type of function. That's going to be way too many functions. So we have to jit compile the function that we want to use. And I mean, like really, like, can we just like not do that? Yeah, I just rather not do that yet. So I've skipped it. But it works. So, you know, great. That was interpreted, FRI calls. Then we've got jit compiled FRI calls. They're different because you will jit compile your call once, but then run it multiple times. So if we are jit compiling a call through a function pointer, we don't know whether that function pointer will be Intel or will be ARM. So we have to kind of do what we're meant to do more closely or almost do what we're meant to do. So, you know, we prepare the arguments as if it were a ARM call. If it ends up going to Intel, then we'll use an exit function to fix it up. We do the prep work that we went to do for the magical mystery function. All but again, I didn't want to jit compile an exit function for every possible type of thing that we might call. Because like, we're already jit compiling a function. We don't want to jit compile a function in addition to that at the same time. It just gets kind of hairy. So again, I cheated a bit and said, well, let's just write one function. They can handle like every case that can get jit compiled and just pass it the signature that it has to kind of pretend to be and just put that in some other register. And again, this will work fine in practice unless we hit the case of calling one of these typos functions that doesn't know their own type and it wants to call Intel and it happens to trash X15 that I've used to like, stash this X foot piece of state in. So again, not quite following the rules, but again, it works fine in practice. And then the slide that you've possibly all been waiting for, like, does this whole thing work? So you'll recall the first two lines from previous day. We said like, native ARM code ran in 37 seconds. The Intel code running under relation to 106, whereas the R54 you see code takes 38, which is pretty good, right? So we're kind of saying here, this is, it's native ARM code, so it should be close to 37 seconds, but it's making a combination such that it could call Intel code as and when it needs to. And making those combinations will slow you down by a few percentage points. But you know, you're in a much better place than you would otherwise be. And yeah, this crazy idea of Microsoft actually works. I can do one more slide or questions. Do you want to do? One more slide. Okay, great. So problems you didn't know that you had. Yeah, Linux has LD preload, which if you've used, you know, I want to let you know, change the malloc that I call or like, make F sync not slow. LD preload, great. Mac OS has LD, the old insert libraries, same thing, not quite the same details, but like the same thing. Windows doesn't have such a thing. It has ad hoc machine code patching. Yeah. And as a bonus point, Microsoft research used to sell a product called detours for doing this. Possibly like Microsoft research is only consumer facing product. Unsure. They made that open source on GitHub in like 2016. So you can go and find detours on GitHub and it will do 0.3. So you know how code lying around in your Intel code that expects to be able to go into other functions and patch them up. So to make this work, we have to take our functions and wrap them in a small Intel shell. So if you look at the shell, you're like, yeah, that's in the Intel code. I'll just patch that for you. And that's fun, right? So one of these magical mystery functions can kind of spot these shells and kind of skip right over them. But yeah, those shells are going to be here to make this thing work. That shouldn't be a problem in the first place, but it is because Windows doesn't have any of these systems. Bonus funds. Let's get back to here when we don't have to worry about the bonus problems. Okay. Great. Thank you, Peter. We have time. All right, let's do some questions. I'm going to start with one from online because otherwise we won't forget it. Can Intel code call ARM code? Oh, yes. Quick yes. Yes. Hands. This is the loop. Am I now trying to call ARM code? No. So I'm going to call Intel code. No, I'm trying to call ARM code. So we go over here and we go through the stack for calling ARM code. Yep, it all works. All right, one more time, hands, because I wasn't paying attention. I'll start here. How do you decide which code you can compile to ARM and which parts of the code you cannot and have to leave as Intel? So for the Luiget case, it's fairly simple because there's already an ARM version of Luiget. If you're going to write your own program, the advice is start with the hot parts and port those first. If that works, then you can slowly port more and more. I can get incremental speed improvements after you're porting more and more code. Over. Next question. Close by. Okay. Hi. Very nice presentation. Thank you. Hello. Okay. Yeah. Thank you very much. It was a very nice presentation. I was just curious what your experience is with the tooling support for these. What support? Like tooling support for these AVI, like the bloggers, compilers, what the support is like, if it's easy to use or. So yeah, the Microsoft C compiler can handle all of this fine. I think clang in LVM, kind of getting a few patches solely, but I'm going to be there for a while. The Visual Studio debugger for this stuff is great. You can single step through from ARM code to Intel code. I like not even notice that you've done a mode switch, which was kind of scary. Like, okay, single step, single step, single step, wait, what? I'm now in like ARM code. Okay, fine. So yeah, the Microsoft tooling is very good. The open source tooling not yet, not yet really there. So what I don't, maybe I've missed it, but what I don't quite understand is what I see here is the ARM64 AVI has been changed to match the Intel AVI a little bit more, right, to make this work. Yep. So how does that work when calling ARM64 Windows API functions? Do they have ARM64 EC versions of all of them? Yep. Wow. Yep. Yes, I have another question. It's a bit related to the question that was just asked about tool sense. Do you know other open source tool sense that support ARM64 EC, like GCC or maybe other GIT compilers? Yeah, that's my first question. Yeah, I've seen some patches land in Y and in Clang and LVM, but I kind of, I suspect they're all kind of starting to do things rather than like full support. Okay, another question and maybe, maybe I'm not sure I understood, but so you have LuaGIT users that want to call, do FFI basically with X64 code. So that's basically why you implemented the... Yeah, yes, most of your program is in Lua. Thank you. Any more questions? Oh, yeah, of course. Just, I think I didn't get, so you reduced the number of ARM registers, but wouldn't it possible to spill them to memory when you do the mode switch? Here's my cutout. So I'm going to run around. Here you go. Yeah, so it's, you can't spill them because you don't have any way to spill them to. Like if it was only the operating system that did mode switches between threads, you'd be fine. But you know, you can call such jump and long jump and there's like, there's not space in the jump buff to put the extra things. Or if you're really adventurous, you can do kind of user space scheduling in Windows. You know, you can call suspend thread and then like resume thread and like move your contacts in between threads. And you know, you could have Intel threads doing this onto your ARM threads. The Intel threads don't know that they're doing this to ARM threads. So you don't have any extra space to put the ARM states because they didn't know that they'd need this extra space. Yeah, I'm going to be running. We have somebody all the way in the back who's been waiting for a long time. Sorry, I didn't see you. I'm going to have to run back. I said a question to you. How do you deal with the red zone? I was like, why is he not answering? So short answer, Windows doesn't have a red zone in either Intel or ARM. So that's mostly fine. There was a related concept of home space for the first four integer registers in a kind of Intel call. And yeah, you have to handle that. So when you're doing your marshalling and remarshalling of arguments, you need to leave space for the home space as you would for a normal Intel call. So yeah, there is no red zone, but the closest equivalent thing, yes, you have to handle. Are there more questions? Are we? Oh, great. How long did this take you to figure out? Probably not very long. I mean, the documentation is pretty good on the Microsoft side. So possibly a week or two, probably. One more over there. So is there any way to call like regular, let's call them closed source ARM 64 Windows components, or is it complete separation? Completely separate. Any more questions? Oh, were you going to elaborate on the answer? Of course, yeah. I thought that was a really short answer. I'm just trying to save myself from running. Yeah, completely separate. So yeah, any kind of ARM only DLLs you can't call into, you have to have these special ARM 64 EC DLLs. Thankfully Microsoft have already done that for all of the kind of system libraries. So anything from Microsoft you can already call. But yeah, other code has to be in this weird mode to make it work. Any more questions? Yeah. Really making me work this year. Where were you? I was wondering, like, is there already examples of software that uses it because you can find it quite easily because the executable is like a different type that it uses? Because I, is there any like software, I haven't heard of this feature before, but like so far already using this major things? Yeah, so the person that opened the issue on the Lujo project is apparently using this thing. I mean, I'm told that most of like Microsoft like Office and similar are running in this mode so that you can have your Intel type plugins work. But yeah, apparently there's a user for this stuff of the Lujo thing and using it. The last question or was that the last question? Can we pass it? Sorry, what did the EC stand for? Emulator compatible. Am I stealing it from you? Emulator compatibility. That's what it stands for. If that was the last question, then let's thank Peter one more time. So am I still, yeah, with that, that closes up the Emulator Development Room this year. I want to thank you all for coming.
Opening Energy: Reimagining this Ecosystem through Open Source devroom
So welcome to the Energy Dev Room. We're all happy that you're here. For those who are speaking today, thank you for generating the content for the talks today. It was really exciting to see the proposals that came in. Just a couple, I guess like housekeeping rules of housekeeping. If there's an empty seat in the middle to make space for people who might be coming in, we're going to ask you to squeeze to the middle. It would be great if you could start already right now because I'm seeing some empty seats in the middle. We see empty seats, yeah. Squeeze together so there's people coming in so they can have somewhere to sit. Thank you. Anything else we need? Should we introduce anyone? Yeah. Who's going to introduce the volunteers? Yeah. Yeah, so the organizers for the room. I'm Rachel Tipton. I work for Open Climate Fix. We have Boris Dali from RTE, RTU. Anna. Yeah, do you want to introduce yourself there? Hi, I'm Anna. Yeah, one of the organizers. We have Dan from the Linux Foundation Energy. Kai from Everest Pionics. Nico, who's been managing all the things the past days. Thank you. Yeah, thank you all. Yeah, and also if there's speakers who have questions about the setup, you can in between the talks if you're a little worried about how you're going to get your computer setup. Feel free to approach us in the blue shirts or any of the other managers will be trying to help you. Okay. And by the way, there's been a small organizational change. The first and the second talk have been switched around, so don't be confused. Yeah. And have fun. Thank you for coming. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you.
EVerest: One stack to charge them all?
Yeah, I'm just going to talk a really quick presentation about Everest. First a few words about myself. My name is Kai. I have a background in computer science and robotics and I've been working at Pyonix on this Everest project since early 2021. So what's Everest? It's a complete software stack for electric vehicle chargers which is running on embedded Linux. It's released under the Apache 2.0 license. And the aim is to support many different hardware platforms and you can also build your own. Yeah, it comes with a lot of different modules already. So, you know, board support drivers for AC chargers, for DC chargers. It's already prepared for or comes with high level communication support. So we have Slack implemented, the Deanspec 7121 and ISO 508-2 and dash 20. There's OCBP 201 and 1.6 support with drivers for power meters for DC power supplies and so on. Yeah, the project is primarily written in C++17. There's also language support for JavaScript and Python and relatively recently we also introduced support for writing your own modules in Rust. Yeah, this is a, hopefully you can read the slide but it doesn't really matter that much. I'm just going to talk about it a little bit like the timeline, how this project came to be. So the first ideas on, you know, how to improve the EV charging ecosystem began in like the end of 2020. The company Pyonix, which started this project, was then founded in early 2021. And about a year later, Everest was announced as the latest Linux Foundation energy project with the source code being published in January 2022. And we had ChargeBite joined the technical steering committee and they also started integrating it into their charge controllers. In the beginning of 2023, we had different manufacturers of charging controllers and suppliers of chips and stuff like that launched several dev kits that are Everest enabled. And in October, we held our first little conference with like 100 people, the aptly named Everest Summit. It's always a bit of a mountaineering pun going on with some of the names. Yeah, and pretty much the same time we had the US Joint Office of Energy and Transportation as well as Charger Manufacturer Quello join our technical steering committee. And yeah, that leaves us pretty much here at FOSDEM 2024 and lots of exciting things basically planned in 2024 as well. And yeah, this is a slide basically showing a lot of the like ecosystem around it already. Like we have involvement from like academia from like enthusiasts just wanting to work on this, but also like charging station manufacturers, component suppliers and like standardization bodies as well. Yeah, then looking at 2023, that was basically the year where the project kind of like took off I would say. I held a short talk at FOSDEM last year in February and you can see like the stream of contributions basically increasing over the whole year, which was pretty cool. Lots of like pull requests to review lots of things to merge and a lot of like community engagement, which brings its own challenges with it. So it was a bit of a fast growing community. Like yeah, in 2023, we basically only had like a mailing list and at some point it was basically unmanageable because of all the traffic. So we kind of thought about how we want to like tackle this, how to make this, you know, sustainable for the future. So we thought about like moving to like a more like chat based solution to the Sulep chat and you can kind of see like the amount of messages kind of going down on the mailing list at the same time as you know, our active users on the chat system got up. So I think this is on a good track and we'll just have to see how this works out over the next couple of months. And yeah, with this introduction of the chat system, we kind of also created a new organizational structure to basically better engage with the community and manage this, you know, growth. So we introduced different working groups. One of them is, you know, focus on car communication. So I think ISO 1511A, Tadeimo and things like that. Another working group that I'm very active in is the cloud communication, which is mainly focusing on OCPP at the moment. Then there's one kind of talking about everything that's related to like the core of the Everest project itself, like builds tools and like the foundation of it, which has a bit of overlap with the CI and testing working group as well. And for everything there is not really a place for this general and Q&A working group. And yeah, what I find really interesting is that it's kind of like a multimodal approach. So we try to have like chat streams where people can ask questions and engage in like a text-based way, but also have like regular meetings, like video calls where people can also ask questions. And this seems to work pretty well. Yeah, let's talk quickly about some milestones in 2023. We had like set out a goal for monthly source code releases. And I think we more or less hit that goal. We had like 10 monthly source code releases in a year. We also just released the January 2024 release. Based on those source code releases, we also provide a Yachto layer for Kirkstone. And we're also kind of thinking about maybe a new release strategy going forward. So maybe doing releases every two or three months and focusing more on like stability of these releases. But this is still a bit up for debate at the moment. Some of the technical milestones of 2023, we worked pretty hard on OCPP 201. So the core and advanced security profiles of that, they are pretty much almost done. And some parties are already going into certification based on that code. And in general, there was very active development on OCPP in the last year, as well as OCPP 1.6, we also continuously improved that. And yeah, on the car side things, car communication side, we now have a pretty well-tested Dean's back as well as ISO 1511.8-2 implementation, including plug and charge. And we had the first successful charging sessions with ISO 1511.8-20DC. And yeah, to make all of this work pretty well, we tried to attend lots of different testing events. So we intended to OCA OCPP plug fists in Arnhem, as well as three different like Charin test intervals, which are focused on testing interoperability with ISO 1511.8. Some of you might remember this, like last year I talked about the open hardware that we also launched, like the end of 2022, early 2023, this year, Jaggen-Yeti board released under the CERN Open Hardware License Version 2. But I'm not going to go into any detail here. So if you're interested in that, there's like two talks I gave last year about this hardware and you can basically find everything that you need on the GitHub page as well. Just another cool thing, we built with this hardware, it's like a DIY DC charger. So we basically plugged this with a wiring diagram, very similar to this one that you can also find on GitHub together and use basically our AC controller hardware to drive a functioning DC charger. Another cool thing that we've been working on last year was, is this what we call the micro megawatt charger. This is a handheld DC charger powered by Everest. And what's pretty cool about this is it's a functioning handheld DC charger. So it started out with just an early prototype in early 2023, still in a box with cables and everything and basically ended up in something that fits inside of the box just for delivery. And what's cool about this is given that it's a functioning DC charger that is battery powered, you actually have voltage on the DC pins so you can plug it into a car and you go basically through the whole charging sequence with the car. Not just protocol testing but you can actually go to the power delivery and then most cars basically say, okay, I can't do much with one watt, I just stop. So why do we do this? It's pretty cool to just walk around on like these testables but also like on a normal parking lot, you know, with consent of the owners to just plug this into the car and generate log files and packet dumps and things like that. And we also try to publish these on GitHub after we recreated them. Then we worked a little bit on EV simulations so we got a small children's electric quad outfitted it with a CCS port, runs a hacked up Everest simulation on it, EV simulation on it. And I think it's one of the only children's electric quads that can charge on a commercial DC fast charger. And we have some more plans with that in 2024 so we want to have this EV simulation like natively in C++, include like an EV manager in there and basically extend it with ISO 15118-20 support. And there's a little bit of work going on on Tademo as well in the moment. And yeah, this brings me to the roadmap for 2024 and like no particular order. Like I just mentioned, the native EV simulation, we want to complete our OCBP201 implementation and start integrating OCBP 2.1 once the spec has been released. And there's going to be a lot of work going on on ISO 15118-20 so we have a C++ based XC parser and a parser generator in the works. We want to also include plug and charge there work on the AC unidirectional as well as bidirectional power transfer. And there's also a first lecture demo prototype for the charger side in the works. And yeah, if this sounded interesting for you, yeah, here so you can get involved. Basically you can find documentation and how to get engaged with the project, you know, like the mailing list, the group chats and things like that on everest.github.io. If you just want to look at the code, it's on github.github.com. And you can also find the open hardware under those two links. And yeah, I'm looking forward to your engagement, maybe contributions and thank you very much. We have about three minutes if anyone has any questions. Yes, I have information about to see the first is recuperation of energy when you are going down and the street is going down is the first question also the deceleration motor deceleration with trucks also in the system. And the other question is about this hardware or software for bicycle with electric assistance. Okay, I think the first two is mostly like on the EV side of things like proper EV side of things. And I mean, we are mostly focused on EV charges. But for bicycles, like, I think there's some work going on in some standardization bodies at the moment to basically specify like charging for small bikes for small like electric assisted bikes as well as, yeah, how do you call these things like the little motorcycles that are electric for the scooters and stuff like that. Doesn't look like it. So how much has the open hardware helped with the project in terms of contributions or say vendor adoption? I think it's really hard to quantify because people can just look at the designs and basically build stuff with it. Like our company, we had some orders for like finished kits of these things and I think we sold quite a few of those. But yeah, I think it helped. But it's more like, you know, we see it as like a dev kit that people can just play around with. And it's really not that complicated. I mean, especially like it's an AC charger. So you need some relays, you need a way to drive these relays. It's like a power meter on there, but usually if you want to build something for yourself, you don't need that. And then the high level communication board, this needs like a modem, like a power line communication modem to talk with the car. But only if you want to do Ice with an 11-8. If you don't want to do this, you can just leave this out as well and build something really, really simple. But for starting to hack around with Everest and all of these more advanced things, I think it helped. And there's definitely some interest there. Thank you very much.
Using FlexMeasures to build a climate tech startup, in 15 minutes
Welcome. Thanks for having me. My talk was actually about one o'clock this afternoon, but I'll jump in now. This is the right – am I too loud? It's fine. Okay. Well, I am Nicholas from Germany living in Amsterdam. I'm co-founder of CITA, Energy Flexibility, and we co-founded the FlexMeasures project. I will briefly talk about the FlexMeasures project. Last time at Boston, we also had to talk about some specifics. I like to introduce a project with some specific applications. So last year, we talked about our Vehicle to Grid implementation, where we use flex measures and home assistant. And today also, I'll go more on the developer perspective as a developer, you would actually work with flex measures. I only have 15 minutes, so I will fly over it a bit. Don't worry. I mean, let's not read every line of code. It's just to give you an impression. How would it be like? With flex measures as an introduction, we have been focusing on behind-the-meter optimization. So that's these other things you find behind-the-meter. So there's enough complexity to run an optimization and find the best running times for the things that are flexible here, which are usually EV charging, batteries. And today, we talk about hot water storage. These things are not exactly behind-the-meter, but they matter as well. In Netherlands, we have congestion on the grid that influences the optimization of what you're doing. It's a constraint and dynamic energy prices. So then, it becomes quite interesting as a problem. Right. So very briefly, flex measures is a platform that takes in a lot of data, like meter data or prices, all these things. And it gives you the best timing for your flexible assets as a very simplified picture of what it is. We have used it in a couple of areas, like I mentioned, bi-directional charging, in industry, in water sanitation, and now we're working on smart heating as well. Here's a little look on our dynamic visualization of what flex measures knows at any given time. So this is from the VEP UI of flex measures. You can replay what happened, what data flex measures knew, and what forecast it knew. But I want to spend 10 minutes, have this very brief tour. What if you were an energy startup? Let's say you work with smart heating, and you want to have the smart scheduling for your e-boiler, as an example. So these are things you would like to do. I will go through each of those. And I'll touch upon a couple of ways to interact with flex measures. You're writing your own flex measures, plug-in. There's a Python client, there's a command line interface, of course, there's an API. And I'll just, while I go through this list, everything will be touched for illustration, what are the things you can do. The brief picture would be that there's a house where there's the e-boiler, so your energy asset, with temperature readings. There's a flex measure server over here in the cloud. And all of these things are going to happen. So there's a little bit of an architecture diagram, but what we'll try to touch here. So the flex measures client will send temperature, it will ask the server to compute a schedule for the boiler. There's a data platform where we can get the prices. We'll have a cron tab because we will have to do some stuff just regularly. And let's keep that in mind. So this is the very first step. You don't have to read everything, but I'm just showing that we provide a cookie cutter template so you can quickly get up to speed, have your own code structure. So you choose a name and a description and you say, yeah, please give me the API blueprint. Blueprint is a word from the Flask system because flex measures is a Flask application. And you get some kind of boilerplate like this. And that's a boiler. This is the one endpoint we're doing here. What if we want to create a new customer for this project? This is a lot of code. This is basically the endpoint we wrote as an example. I'm not going to read everything. Basically, this is how you plug it in. It's going to be plugged in flex measures and available as an endpoint. We're creating a user and an account. And maybe this is the most interesting. So this is basically your business objects. I will go a little deeper here. This is the same code roughly. So we're creating the boiler as an asset. We're creating a couple of sensors. Here's two examples a bit bigger where we really define, we tell flex measures how to handle this. What kind of units are we handling and the event resolution and so that flex measures know what to do with them when data arrives. Schedules have to be made. And then if that happened, if somebody called this endpoint and your account was made and you would end up in the flex measures UI, you can see them here. Next step, let's say we measure the temperature locally. You have your own sensor and you want the temperature data to end up in flex measures as well. Then here's a small example how to use the flex measures client. Basically, it provides you with some nice code to work with more easily, but it actually uses the flex measures API in the background. For fun, we actually had the temperature reading in Fahrenheit, which we say when we send it to flex measures, the data is actually to be stored in Celsius and will automatically get it right. So this is where a lot of work goes, as you can imagine. But otherwise, this is just sending this reading. There's not much more. You'll do this regularly from your local script that runs on your Raspberry Pi, whatever you're doing there locally. One more step. So there's some external information we need. Temperature is a local reading from your local asset. Prices are a good example of information from some other third parties that just has to also be collected in flex measures. One other example is weather forecasts. In this example, I'm showing that we actually wrote a plugin for that. So we're cloning this plugin we wrote. NSEU is the organization of European transmission system operators, and they provide a data platform so you can get various things like prices, but also just a head allocations for all the transmission zones. And so we say we want the Dutch transmission zone. Please give me the prices for that. I'll talk and we configure everything. And actually then this is the command. So through flex measures CLI, this plugin has registered a group of commands, for instance, to import a head prices. Also, all of this is public how we wrote the plugin. So if you call this regularly, let's say one time per day, you'll have the next day head prices always in your system. Small visualization of one day of prices in the flex measures CLI. Excuse me. Okay, now I'm not sure how much time do I have. Eight minutes. All right, that's not too bad. But the main part now is you want to actually tell flex measures to give you an optimized schedule for your boiler. And here I'll show, I could do that via the flex measures client as well, but I'll just show how to use the API directly. This is not so interesting, of course, you have to have an authentication token. But I have to spend a bit more time here. A lot of time we spend when we made flex measures is how you configure the problem. How do you tell flex measures the constraints of the problem in the back flex measures will actually take your information about your setup and your problem. Basically, you could call that business rules, and really translate that dynamically into a linear program. So flex measures contains, I think three different algorithms, basically, we have one that's focusing on storage based problems. And that's what we also use for heat, heat batteries, we call them. We have one for, if you just want to allocate processes. But it's a very important part for developing a new application that you can tell the flex measures server, this is how I want you to treat this problem. Here's the constraint you don't know about, or here's a local thing you don't know about. And that's where we're working on two things, the flex model and flex context. So flex context would be, well, these are the prices that are relevant. We also have a project where we don't use prices, but we use the CO2 signal, the CO2 content of the grid that is anticipated. But the flex model is a bit more detailed. So this is not all the things you can do. But basically, wishing, well, the state of charge of this heat battery is this many kilowatt hours. So that's local knowledge you have. Here's some constraints. I can't go under this. We don't want to go under this. And also, here's a target for you. In the morning, I need to have this much energy content in my battery. I think this could also be a percentage. We're pretty flexible there. Some other constraints. You can see how these translate actually into constraints of a problem. And then you call our API to say, well, for this, the fill rate that I want a schedule for that, please start. And that will actually trigger a scheduling job. And then flex motors will usually pass this on to a worker. So we, in our implementations, we have a web worker and computation workers that will handle those. And then you can call this, get endpoint to check if your computation is ready. It will usually not be ready after three seconds, but soon after. And then, yeah, you get your values here. So then you can implement these settings locally. You can, let's say you ask for a schedule for 12 hours, then your local gateway has the plan for 12 hours. If there's anything that changes on the ground, you just ask for a new one. You'll update as we go. So that's general behavior. I'm almost done with, with a, you know, two of the force here. One thing we want to maybe do is in flex measures have a nice dashboard that has the most crucial data on top of each other for some inspection. And then, well, you can actually put that on the boiler asset. And then you, in flex measures, you have these nicely stacked, right? You want to see what you've been using for optimization on top. Although this comes from a different asset. This is something for everybody. All the assets can use this. And we use, as you remember, we had like four sensors or so that are relevant, but we just decided these two other ones we want to see. So we can easily see that in a period of low prices, flex measures has tried to, you know, fill the, fill the boiler at those times. Some signal here. I'll skip over this a bit because, yeah, I originally had a 25 minutes idea about this. Just as very quickly, we also noticed it's very important to also do some reporting. In flex measures, give some logic about that, that you combine some sensor data so you get the outputs of what happened, for instance, like costs. You know, that's very important. Sorry. And that can become a C like a minus well that you regularly say, okay, now the day has happened, we optimize as we could. Let's calculate how much energy costs we had here. So combine just the prices and the fill rate, which happened. But we also saw already that's that's many more interesting computations that people want. So this is a very simple multiplication. But we've made a pretty complicated architecture so you can actually have a lot of bring a couple sensors together for a new result that even can be used further in your next optimization or so. It's a very flexible system we've built here. And this is the project website. From there, you'll find the the GitHub, you find the read the docs, you'll find more information like I was interviewed for Python podcast where maybe I go into more detail. The mailing list contact, everything's there. You can also just write me directly, of course, if you're interested in doing something yourself and joining our TSC, the technical steering committee, everybody's welcome. And that's it. Yeah, there's lots of things to do, of course, I've touched upon a couple things, applications like vehicle to grid or smart heating and industry. But the roadway is still, of course, filled. There's so so much things in the energy behind the meter and a bit above to optimize. Thanks. We have time for question, then. If someone wants to ask one question, you said that you create a linear program. And what solver do you use to solve this program? What kind of solver? Yeah, we have we work with two solvers now. You could, of course, also use Cplex, but we've used two open source ones. All right, now they don't come to my head. Sorry. Hi. Yeah, we switched to that one. And we had a different one before that are both possible. So you can just those are shipped with a Docker image even so you can just configure that which one you want to use. But you can also we use pyomo as a representation for the problem. So everything that works with pyomo, which is you can use that as well. Thank you so much.
OwnTech Project: An open-source generic reprogrammable technology suite for reimagining the energy ecosystem
So, okay, we got the hard task to be the first one to speak, and so we failed. My name is Jean-Linéie. I'm the CEO and co-founder of OnTech, and today with Louis, we will discuss what we've done so far and what we are trying to achieve. So, we wanted to have a bit of a general introduction of how we see the energy and how it could become more and more open source over the years. The idea is that we see it as a pyramid with the bases being the power hardware and then having levels of sensors, real-time algorithms, industrial informatics, higher level in terms of communication, how we dispatch information from these devices on the field, what protocol we'll use, how we dispatch the energy among different power hardware, and then there is the highest level, which is like simulation, optimization, and modeling, forecasting, and so on. Today it's really exciting because if we look at what is it all about in this session, we have like plenty of amazing projects that are filling these pyramids, and it's really interesting because eventually we can reach that point where we have like the whole chains from the power hardware to the modeling, to the forecasting, to the optimization, through all the complexity as well of communication and protocols and so on. An interesting thing to note is that like the time constraints in the power hardware is not necessarily the same as the one for modeling and simulation for grid, for instance. So, the complexity associated with these things makes the informatics different, it's different fields between like the embedded world to the HPC and the modelization and optimization world. So, there is like an inherent complexity in the energy domain that is really interesting as a technical asset and thing to explore. And this is why I'm really excited today is that like in this session we are combining simulation, communication, hardware, and so it seems that we have already all the bricks and maybe tomorrow we'll build the pyramids. So we the energy people have the power to change the world and so I'm really excited about that. We'll let the floor of speech to Luis now. Thank you, Jean. And this pyramid is built with different bricks and these different bricks are hardware softer like Jean just said and hardware usually is hard until it isn't anymore, until somebody comes along and bundles the hardware somewhere and makes it ergonomic, makes it easy to use. That's what Arduino has done, that's what Raspberry Pi has done, Microbit has done it as well and they have inspired us to do that for power hardware. And that's what we have achieved. We have, there's a box there with one of our circuits and I'll pass it around a little bit later. We propose a community based compact versatile open source and low cost technology for learning and prototyping power electronics. That's the goal, that's what we want to achieve. The idea here is to create a technological sandbox just like Raspberry Pi, just like Arduino have something that is standardized, that is simple to use, that can be used by academia for teaching, can be used by industry for fast prototyping or for using in other applications for makers and fab lads to make fun stuff and burn it. And this is the place where we hope to foster new ideas and come up with new talents, people who are willing to build electric bicycles, people who want to build a microgrid, who want to understand how it works and put together the bricks and build the hardware upon which they can test their forecasting algorithms or test in their models. Now, how does, starting to get a little bit under the hood, how does power hardware work? If we look at it from the perspective of a functional analysis, the power is really the red arrow in the corner. And to get that arrow to work as we want, we have all these different arrows in the middle. And if we take a top down approach where we come, we did a forecasting, we did a simulation which allowed us to do a forecasting, which allowed us to do, to calculate an energy management strategy which we then send via dispatch through a protocol all the way to the target. And when it gets to the target, it gets here through the communication backdoor or frontdoor. And that goes into the industry informatics and the control systems which are operating in real time, locked into this micro or nanosecond level loop. It also receives measurements from its own embedded sensors, but these are not normal sensors that we come in interrogate via Laura once a week. Or these are sensors which are sending information at a one megahertz bandwidth which you are sampling at 50 microseconds or sampling at a very, very precise moment as well. These combined the control with the algorithms that are in here, they create the low level electric signals which then go there and trigger the power electronics for them to work the way you want them to. And then the loop is closed and the thing works. There's a little fiddler secret in the middle, never forget it, the energy has to come from somewhere so sometimes if that little fiddler secret fails, the whole thing stops. So everything kind of stands on the choice of the little component that you made when you put that little fiddler secret there somewhere. And what we did is that we got all the stuff, we put it into a board and you have all the different blocks which are somewhere bundled there together. But you don't have to understand to that level of complexity unless you want to. You see the communication coming in and the power going out, that's it. And that's the idea. We have two products. We have one which is a power product, the twist board which uses the second product, a passage to talk about. The twist board is a module which we can then either rack up together so we pick several twists, we put them together and that allows us to handle more power since they synchronize and communicate with each other. It's a linear progression. The more twists we put together, the more power we can handle. And we created a communication bus at the low level which can talk in CAN, can talk in NRS 485 so we can talk at the millisecond, we can talk at the microsecond and we can talk at the nanosecond with analog. So we have different bandwidths which we can dispatch with through different communication methods and protocols. And we have the spin board which I'll let Jean present you. So eventually in order to control power hardware so fast you need like some special embedded microcontroller. And this microcontroller has some real time constraints to it. So it's not a regular Arduino or Raspberry Pi that will do the job. If you want to have good performances you need really precise timers, really special communication peripherals. And so eventually we came up with designing our own board which is like the spin board. The spin board is both a piece of hardware that looks a bit like an Arduino Nano or a Raspberry Pi Pico. And this thing has tremendous resolution for its PWM signals, so the driving signals that will eventually drive the power stage, but also a really flexible acquisition of signals. So it will connect with the analog signals on the board. Eventually microcontrollers are great only if they work together with great ergonomics and coding a microcontroller can become either a nightmare, either a piece of cake, depending on what is the software and the ideas that you use to do so. So we wanted to comply with the maker movement mindset where you basically take a piece of microcontroller, you plug it with USB to your computer and you start coding in seconds and minutes. You don't have to install all the tool chain and so on and everything is done by the ID itself, so without the complexity to set and so on. In order to do so we use platform IO together with visual studio code so it's a really seamless experience for the developer. But also we have IO level of development that is possible for MATLAB for simulation people that want to deploy some control loops and control those directly in the target. They can do so through an IO level of graphical coding, let's say. And those are there is something from the Linux foundation, the FliarHartus that is providing a framework on top of which we've built APIs. So these APIs are calls that are basically making things seamless for the user so that you don't have to go through the hassle of the 2,000 pages of the microcontroller in order to program the power hardware. You have like high level functions that relates to the power world, so okay, what is the duty cycle, what signals I want on that MOSFET or directly related to the application. I want to increase the voltage, I want to decrease the voltage so I can go in my level of complexity in the language I talk daily and I don't have to go through documentation and things like that. So we have different APIs. One is the microcontroller API if you want to develop your own hardware, your own power hardware and control it through the spin board, you can do so. Or you can directly call another API that is built for the power hardware that we provide as well with the spin module. So this way you can call functions and not signals. And then there is a communication API, how to synchronize things with the surrounding world and task APIs to say okay, I want to dedicate that amount of time to do this calculation and that amount of time to do communication or higher level housekeeping stuff. And then there is a user code that is basically your main as in a not doing no experience, let's say. So this is the pinout. Of course everything is open source, so the hardware itself is a CERN HL license based. The idea here is to push people to share back the modification so that we can move on with a better and better hardware of the time. Of course all the documentation is Creative Commons, all the interfaces and the graphical stuff is GPL. And we have like a dataware and something that you can plug and see like the data live like if you were having a kind of a low bandwidth oscilloscope just by plugging your USB cable and gathering your data from the device directly. In order to make that thing happen, like we've created a foundation that is under the Aegis of the Seniors Foundation, so it's a National Council of Research in France that has put a ton of effort into making this thing a reality. So we got a lot of support from a public lab in France and this is where it comes from. And so the foundation is holding the IP. So if you want to contribute to this project, everything will be under a dedicated foundation that has strict rules to enforce the open source vibes of the project forever, forever one. And then there is a startup that is basically providing the hardware because if you want to develop things, you need someone to be able to provide the hardware to go fast basically. And yeah, so on the foundation side, we create tutorials, content, MOOCs and we make that thing available online. So we create an online space for that. We also coordinate a small embryo of community at the moment, but we hope that it will be more vivid and for some international collaboration around these fields of power for energy. And also we are starting to organize training sessions and events to answer local needs and the idea is to spread and to make things decentralized in a way that everyone can tackle its needs of energy with this kind of Arduino for energy thing. So to give an example of the first use case, at the moment we are working on a use case with a fully open source e-bike and in this e-bike you have inverters, you have battery chargers, BMS system in order to monitor all the cells of the battery, converter as well for the PV panel on the roof. And so we are collaborating with our great open source hardware projects such as Libre and Vosola and we are aiming at replacing all these closed source pieces of converters inside of this e-bike and make it fully open source A to Z from the smallest piece of electronics to the frame to the bike itself. So yes, that's it for me and hopefully Luis will be able to make a demo in five minutes. Yeah, maybe we can combine with a question. And how much? Sorry, can we buy the boards and how much? We started producing so we have our own pick and place machine so everything is made in France at the moment, assembled in France. So we have started assembly, we have shipped our first eight boards to a university in France for students. They haven't destroyed the boards yet so it's a good sign. And so we have pre-orders at the moment and so to give an insight of the price at the moment the power module is 300 euros and the microcontroller is 45, 49 euros. Can it be used in a full tour around the architecture? So yeah, to answer that really fast maybe I will come up back to that slide. So one of the strengths of the modular approach is that we've put a lot of effort into making different modules, being able to share power loads and share communication. And it's a good thing for fault tolerance because if you go modular, if a module fail you can think of clever ways of replacing the faulty modules with another module. Yeah, just one, an application is a complete autonomous, energy autonomous for electricity home with wind power, a little solar panels for the voltaic, also bicycle with electric assistance and so on. So also for with low tension DC for computers and something else and high tensions AC and also taking account of the day of the day, the battery charging with lead, battery charging with lithium or lithium, I don't know, something like that. So definitely off grid applications are key and also for energy independence and so on. At the moment the module that we've developed is DC based, so it's DC to DC. It has a really wide range of operation between 90 volts down to 10 volts also. So it complies with all battery technologies from 12 volt batteries, 24 volt batteries, 48 and 90, like 86 volt batteries. So it covers a range of battery application let's say. In the future our goal is this kind of grid applications and home and energy independence is to go for a microinverter basically and this will be made by combining different modules. So this one is a DC module and then we'll add an AC connection on top in order to cover this off grid applications and energy independence. There is one in the back. Could you create also some BMS or open source? So we haven't developed a BMS but that is already covered by the hardware from Libre Solar I think. Hello, it's a bit of an implementation question. So you are using Canvas, so you are using Canvas for now. Maybe it's because the automotive world is using it. I was wondering if you were thinking about moving to something like T1S. I'm not sure you're familiar with that. It's kind of Ethernet but with the CAN topology, so multi-drop. So it's really nice and kind of microcontroller friendly and IP based by thanks to Ethernet. So I was wondering if you were kind of thinking about it. So yeah, we thought about it because Ethernet is as great features but it tends to be costly. The idea is to go like to lower a bit down the cost of the overall communication architecture and so on. Yet we are making things like to the biggest extent modular in a way that if you want to plug a different way of communication you can do so. You can access all the pins of the microcontroller that we have. Maybe in the future we will make it think we will support different microcontrollers as well that will have more features and more peripherals but at the moment it's not planned. We have like two different things. CAN is for housekeeping and sending average data. And RS485 is for superfast communication. So we go at 20 megabits with RS485. So it's a bit uncommon but it permits to have like one cycle of control communication with different modules. So they can share one reference and a set point but also measurements among multiple modules still at 10 kHz control frequency for instance. No? Sorry, no demo. Can I convey it for you guys? We are here the whole day but the thing just crashed of course. Of course it did. Demo effect. Demo effect but I would like to just share something with you though. I can hear you online. But I would like to share something with you. Can we get into a... Yes. Yes. So we do have a... We do have a GitHub and what I wanted to show you is that on our own tech foundation GitHub there are sample codes, the examples of the data that we have. And in the example repository we have multiple different examples of how do we use the twist board at different applications, DC to DC, microgrid, AC. What I wanted to show you, the demo that I failed miserably to achieve, was the microgrid. So what is that supposed to look like if we get the peer to peer AC microgrid? We have the documentation. How to connect the boards together. And the communication that goes here. And these two boards then will work together to share power. In this case it's a peer to peer exchange. So one board is drawing power while the other is supplying. And this is actually data from the board itself. That means that we can ask the power converter to sample very quickly data. And keep in its memory and then we can retrieve it later. So we can do this kind of test where we get... Every point is about five microseconds apart so we can get a lot of resolution and see what's going on. It's offline because we do it after the work but it still works like that. And for the DC-DC side, same thing, we had the DC... Comes up. Okay. We have the different structures and different examples. They are there. So we invite you to go there, take a look at our Github. Take a look at the spin board. It's there. It's in Kaikaa. The twist board as well. And if you want to talk with us during the day, I have everything that I would need normally for a demo. And we can just sit down and do it together.
Enhancing OCPP with E2E-Security and Binary Data Streams for a more Secure Energy Ecosystem
Okay, welcome to my talk. We already heard a lot about the fascinating domain of immobility from the guys from Everest. While Everest is on the charging station side, this project is more what happens behind a single charging station. So all this back-end stuff up to the energy providers, distribution network operators and so on. And why is this important? Because in the energy domain we have a lot of safety and security regulations coming from the government. We have somehow to comply with it because in immobility it's not only important that you have IT security, you also need to provide energy safety because everything is connected via the grid. And so when too many people behave badly, we will have the next blackout. So nothing changed since last year, so I skip this because I only have 50 minutes. In the past, the immobility was quite simple. We had charging stations on one side and a back-end on the other side. And they more or less had been communicating. In the last couple of years we are HTTP web socket communication. So this is the client, this is the server, everything is fine. But now, the situation changed a little bit. We have no longer a single charging station somewhere at the street. We have normally multiple charging stations at one location. So it's quite useful to have some middle box which combines the communication. So that you save money when you want to communicate with the back-end. This is nothing new. There's a lot of vendors implementing it. There are even specialized vendors for this. It's already in the OCPP standard, but it's not really in greater detail. It's just mentioned that you could do it. We want to dig deeper into this problem and see what we need to realize in this. Next thing, when you have this middle box, it's very natural that you add additional stuff to this. So you not only want to combine the communication channel, you also have specialized energy meters which are now located at the grid connection point. So monitoring the grid connection point, the idea behind this is that you can do a local load management because you have only a limited capacity on your grid connection but want to share it between the charging stations and somebody must be in charge of how to share this energy. There are other projects who do the calculation for this, but this is the communication part. And here, for the first time, if you're a German, people, you know, this fascinating world of smart meter gateways, which is more or less specialized hardware from the federal industry for security in Germany, which regulate this area because energy, as I mentioned, is a safety critical infrastructure. So they try somehow to improve the situation that most vendors don't care that much about security and safety. The first problem would have, because I said we come from a very simplified view on this problem, where the connection from the charging station to a back end. So because of limitations of the OCPP protocol, we at the moment duplicate every connection between this charging or communication aggregation box and the back end. This is not only perhaps a design flaw, which nobody cares about, it's also starting to become more and more a security problem, because the only security we have is HTTPS, so transport layer security and transport layer security and in this box, here you have another transport layer security, so you have a split communication channel. So your IT security is no longer given, because this could be a man in the middle. It's getting even worse, because now we have specialized companies who sit in the middle between your charging stations or your aggregation box and your back ends or even multiple back ends and want to do analytics for you, because normally the charging station management operators or vendors just manage charging stations. They are not that much into analytics. So very often they even those sit in the middle and then you realize, okay, now the problem is getting more and more complicated, because people who only are interested in Excel sheets sit in the middle of your critical infrastructure and maybe they not only analyze what you're sending, possibly here could be sitting Mr. Putin and send commands back, because you have no chance to stop him. So the first thing what we want to have and this is also nothing really new, we want to share this web socket connections. For this we need to adapt a little bit the OCPP protocol. There's already an internal draft how you could do this, but when you look closer at this draft, so internal means internal in the open charging alliance, which is the organization managing the OCPP protocol. You see there perhaps a couple of drawbacks. The first thing what we obviously need, we need to add some additional looting information, so that we know we are sending from this box to that box and my idea is or my proposal is we can do a lot of interesting things if we copy this good old concept of the record route taken, which is also an IP version 4 optional option and so we can implement this much more user friendly. Next thing is in the OCPP internal draft we have more or less source routing, so the sender includes the path through the network into the request. This is well known, it's a valid way to do it, but it has also a lot of limitations because when network is changing very often you have a scalability problems. So it's much more logical to use a normal routing table in every box. You can use this typically outdoor learning what you know from other net switches, which also on learn which communication partners on which port and implement it more easily. Now, but it's getting more and more worse because we are in a modern world. A charging and stage management system today is no longer a monolithic thing on a notebook somewhere in the Netherlands. It's a highly complex system of microservices and these microservices are even from different operators. So we have very very often complex systems where the asset management, so which charging station is located where and coming from which vendor is within an SAP database, then you have another database for all this real time energy measurements and so on and so on and so on. So you realize okay now we have a bit of a problem because we have a critical infrastructure, but in the back end we have a multitude of loosely coupled systems without much of security. So the traditional also PP security model is also no longer sufficient here. For this very simple, it would be nice to have digital signatures. Again, there's an internal traffic and the open charging alliance, but this had signatures on the transport part of also PP. So it's limited to also PP, but it would be much more interesting to have it on the other PP messages itself, because then we can send end to end messages and end to end means in this case, from the EV to the energy distribution grid operator or to the EMP or to the smartphone of the driver and so on and so on. We will later see a lot of use cases how to make use of it. What do you, when you want to have signatures, the next problem is you read. As usual, you reduce the complex problems onto a key management problem. So you need something like signature policies to define who, which signature is valid, which signature should I use, which signature should I verify. When you have this signature is implemented, you can extend it to user roles because at the moment everything in also PP is more or less one user. You have no differentiation of this communication partner is only allowed to set energy commands. The other one can also change communication parameters or whatever. This can be implemented using the signatures. And last but not least, at the moment, also PP is only using the text frames of HTTP web sockets. But there are a lot of useful use cases for binary streams, especially when you look at firmware updates or look for downloads because this is at the moment external HTTP requests. And this makes your network security more complicated. So when you would integrate it into the also PP protocol, you could close down your network and only allow also PP communication and improve the security. So nice, all this little details, but what are the real use cases for this? So in Germany, we have since the first January of this year, there's a nice new law that your energy provider can send you messages. We are a highly regulated infrastructure to reduce the amount of energy you're using because we want to renew a bit of energy and so on and so on and so on. But it's external additional hardware. Why not use the existing infrastructure for this? The reason? Because it's not secure and safe enough at the moment. Would it be secure and safe enough if we could perhaps talk to these guys and say, okay, look, we have now improved our infrastructure why don't we remove this additional hardware? The same is in the same law, we have a possibility that an energy provider can get your measurements. This is again a regulated use case. We would do this with our normal also PP infrastructure with charging tariffs, charging tariffs coming from immobility providers or someone else. They should also be assigned secure data which is immutable and then used in OCPP. Good part is that in the upcoming or in the next version of OCPP, there will be some support for tariffs, not yet end-to-end signed tariffs, but at least half the way. There is this interesting use case where you want to pay for your charging but in an anonymous way, so you don't have an account somewhere, but you pay with your smartphone. In the regulation, they are talking about QR codes. Wouldn't it be interesting that you use this QR code to get something like a direct communication channel to this charging station over all this complicated infrastructure, but it's secure so that you have something like a remote control because nobody stands even not even for 20 minutes in front of the charging station just to look for tapening. They want to have it on their phones, but for this you need a secure channel. The same idea, but for another user group, is the charging station operators on the energy people also often don't know what's really going on on this charging station because it's very limited for the consent over the wire at the moment. They use a lot of AI to invent what might be happening on the charging station, but in reality it would be much nicer if we would have something like this digital twin idea just send everything what is important somewhere where it can be analyzed, but again we have no secure infrastructure in the middle because every shitty marketing company could manipulate our data. Better German calibration law, that is my favorite topic, but we had this already last year. We have national contact points who want to collect all this data and statistics about your charging station infrastructure, how good or not good it is. No security, no privacy at the moment. The same problem as usual. The really biggest problem is this is on the street. Yes, more or less last slide. This is on the street. There's also on the street, so no physical access security here. So even when we have encryption signatures, we cannot be sure if not somebody sending us a lot of crap. Okay, it's a bit harder to manipulate a lot of charging stations on the street, but if you're put in, probably you would try it anyway. So how or what can we do to analyze it here if this is a valid request or a valid information or not? And I try to my best to get it into the OPP standard, but also at the Open Charging Alliance, we have a normal problem. There are many leaching companies and not so much real contributing companies. So if you find this use case is interesting, if you think this is interesting for you, for your company, for whatever, feel free to contribute to this project, feel free to become a member of the Open Charging Alliance and help us to get it out on the street. Thank you so much for your presentation.
CitrineOS
Hello? All right. I'm going to start my talk. My name is Christian. I'm a software developer at a company called S44. We make software for ChargePoint operators and mobility service providers. So basically the cloud space of the EV stuff. All right. And today I'm going to talk about OCBP implementation and the clicker works. It worked a second ago. All right. So if you take a look around at chargers and charging networks, what you'll find often is a broken charger, a charger with a black screen, and especially payment terminals saying oops. I found a study from 2022 that said in the US that 75% or less than 75% of the chargers weren't working or when the users came up, they couldn't get a charge started. So now governments have gotten involved, right? There is uptime guarantees in the UK and the US Neve funding also relies on uptime guarantees. If I remember correctly, I think the off here also has uptime guarantee, but I'm not 100%. And so the most recent thing I found for the US and the company I work at is mainly US based. So that's why there's a little focus there is that in 2023 broken chargers was like the major concern for users to use public infrastructure for charging. And then maybe most importantly, Reddit users are super unhappy. I think some subreddits even banned talking about broken chargers because they were really annoying. And I'm going to click. All right. So one thing that we found or our thoughts on why this happens is a lot of proprietary implementations. So you can see Dolly's interpretation of OCBP proprietary stuff. So if you're not Tesla, which owns the entire vertical, right, they know what's happening at the charging station in the car and in the cloud. Then what do you do? Well, what happens right now is there is a bunch of different vendors. Wherever you sneeze in the EV charging cloud stuff, there's a different vendor. And most of them don't really share what's happening under the hood, which results in, well, a bunch of uncalled for behavior unknown what's about to happen, especially later in the field when it's a user interacting with something and you don't have known input. Then of course, we have OCBP 1.6, which leaves up a lot of stuff to the imagination on when which message should be sent. And then maybe the CSMS thinks, well, I'm expecting ID token now, but get some other message. But one thing that I think is one of the most big problems with OCBP 1.6 is around monitoring. So right now, each hardware vendor builds in their own obscure monitoring messages. And if you want to integrate with like five different hardware vendors, well, then you have to work out how to understand all five different messages. And that basically mean the same thing. Well, that leads to broken parts in the fields and no one knowing about them, which then leads to Reddit users being angry because the charging station has been broken for like a week. No one really noticed. Thanks. All right. So what can we do to improve the state? Well, OCBP 2.01, I think is already a huge step in the right direction. You can see Dolly thinks as well. OCBP 2.01 winning strongly. One thing that I really like about OCBP 2.01 is it has a lot of use cases and it's super structured and you can build your test cases on them. And then of course, there's much more monitoring around the device model that helps in identifying, oh, there's something about to go wrong with the charger instead of just it's broken. But that still doesn't help with transparency. So if everyone just reinvents the wheel once again, just like with 1.6, well, you're still going to run into different interpretations. So we think there should be something that's open source, that's transparent, that you know what's happening under the hood. And we hope that with something like that, there is better cross compatibility between different vendors and the CSMSs can easily integrate with a bunch of different hardware vendors. And next one. All right. So we looked around. We didn't find something that we were super happy with. So we came up with the project Citrine open source. It's written in TypeScript. I know in this room that might not be the most popular choice, but on the internet it is. So that's why we went with it. It runs on Node. We have a API based modular architecture. So similar to what Achim was saying, there's some microservices and you can set it up that, for instance, transactions is super scalable, but maybe provisioning is maybe not as needed. It's released under the Apache 2 license. And most recently it's been adopted by the Linux Foundation Energy. And it's in their hands now. Yeah. So in general, we think OCPP shouldn't be like something that everyone works on once again and again, but like a stable cornerstone that you can adopt, that you can drop into what you want, where you need it. Because the messages are there, the protocol is really specified and redoing the same thing. Well, I can spend my time better. So taking a quick look at what we envision for the system architecture and how it works right now, going from the left to the right. So charging stations connect via WebSockets to the central system that helps us with scalability. You can have a bunch of different instances of the central system that manage the individual chargers. Then we publish on a message broker. What was important to us is to have our underlying technology kind of agnostic. So you can set up Kafka, you can set up PubSub, whatever you want. Just like with memory cache. So you can use your address in memory cache. At least that's what we've implemented for now. And then you can adapt whatever interface you want. And for relational databases right now, we have it hooked up to PostgreSQL, but you can set up whatever relational database you want. Then comes down here, the maybe more interesting part. So we have our modules. And like I mentioned, transactions is a big one. Most of the bandwidth goes there. So we set up the modules based on how much we think they're used. One second, one back. One thing I forgot to mention is we use Fastify as the web framework to interact with our setup. All right. So looking one step further under the hood, we have a JSON schema generation JavaScript that we take the set up, the part three of the OZP spec and use that to validate all incoming and outgoing messages. And we generate our TypeScript interfaces out of that. Then to run, for the implementation of the modules, we work a lot with decorators and metadata on which decorator is used for which message. And that's how we route the messages within the modules. And then one thing that I think is quite nice is that we have some open API documentation that's generated. And you can easily try out some OZP messages from the REST API. So you can either use the API generation, click try, or use postman and just straight up send OZP messages that then get forwarded to the charger. And our system does the interaction with the charger for it. All right. So then looking up and looking at a UI, so right now we've hooked it up to Directis, which is an open source project that gives you some nice UI on top of a relational database that helps with keeping it simple. But you can go crazy on it. You can build your own flows in Directis and do whatever complex things you want. For now, we have it set up so that we have a little testing set up with our app that we whipped up to try charging. Yeah. All right. So where are we at right now? So a few days ago, we released the 1.0 version that goes through the OCTP protocol's testing cases of core and advanced security. We're quite happy that that's working. It's been working for a while, but we only got to release recently. Then right now we're under development is the advanced device management and advanced UI. We also have a few other people that we're talking to about integrating some payment and just general, we've generated quite some buzz with people that would like to add some modules or add just on functionality. And so moving forward from there, we're looking to ISO 1511.8 support. And hopefully in July, that's what we anticipate is that we have the full OCP 201 implemented. And then for the future, of course, similar to what Ahim was saying, well, you can build on your BI tools or whatnot. And we hope that this is a nice interface for innovation on top of and not that you have to hook yourself as a machine in the middle or something similar. And I'm really happy that so many people were interested in this topic. So maybe you also want to contribute. We're fairly fresh. You can find us on GitHub. The top right is QR code to our Citroen OS core GitHub page. The first technical steering committee will happen on March 14th. So get involved, join, bring ideas. And we have a Discord server. So drop by and ask questions. Sometimes we're fast. Sometimes we're slow, depending on our workload in responding. All right. Does anyone have questions? One simple question. We all know every vendor does its own shit. On the other hand, you generate everything from the JSON schema. So how do you implement extensibility? When a message or an unknown message comes in, do you drop it or can you handle it in a smarter way knowing, okay, it's coming from this vendor and therefore I should interpret it somehow? So right now I believe we drop it. Our major taste has been the Everest. And they send normal messages. Am I in the wrong spot? All right. And for the detail on how it will be handled in the future, I'll get back to you on Discord for that. I got to check with a few people on what's happening, what's going to happen there. So you said you can make an API call and you send the, for example, start charging message to the charger. So do you use like then you get the API call, you use Kafka or something and then from Kafka it goes to the charging station? Okay, that's very cool. I'm also doing that. Yeah, exactly. I've seen implementations where they are just white. I've seen implementations where they are just white like a flag into a database that is like very, very important time. And I think that's very ugly. And I think like message brokers are very elegant solution. Yep, we agree. Okay. With message brokers and 15118 you have very strict timing. How do you ensure that your message brokers not too slow? I got a, I got a pun that one. I'm too nervous for that right now. I'm sorry.
Power Grid Model: Open source high performance power systems analysis
Hello everyone. My name is Natesh. I work as a scientific software engineer at Al Yander. I'm also a developer at the Power Grid model project on which I'm going to give a talk now. So it is a high performance distribution grid power system analysis library. Yeah. And the next slide. Oh. Oh. Yes. So in this presentation I'm going to mention why do we need this project? How did we come across building this? So what does the library do? And how does it perform compared to other solutions that are already available in the space? And how do we use this within Al Yander which is Dutch DSO for its own products and applications? There's also some talk about open source since we are open source and we would like new contributors as well. In a traditional way up until a few years ago at least the power system analysis used to happen within the DSOs, DSOs in this way. The electrical engineers would usually have some data files where they run the calculation in a GUI focused software where we have built-in presets for running the calculation and we get only certain results and then we make decisions on whether to add a new transformer, add a new cable and such components within the grid or not. If the grid can handle more solar panels, if the grid can handle more EVs or not was done using this way. But now with the new smart meters and EVs and renewable energy we have to do a lot more and for that we have to have all of the data of the smart meters which is in a really huge volume in a database where also lies our topology and electrical parameters and then we cannot just use a custom, we cannot just use a preset of the calculation method. So we have to have some customization available over there and then we have to do the calculations in the cloud because these calculations are in the set of millions now because we are trying to simulate the entire year for example of time series and the volume increases a lot. So why did we decide to make this and what makes a good power system analysis library? So around I think 2018 also Alianthar faced a problem where we were not able to do this using any of the open source software or the commercial software. We faced these pain points actually and then we decided to make the library which are focused on around them. So we needed a well-defined software API. That's because we want this calculation library to be part of a really bigger application which does a lot of things apart from just calculations and we also wanted this library to be cross-platform and scalable so that we can use it within the cloud. And of course since the volume is in millions, high performance and parallelization was needed otherwise you might have to wait for a month or so to get results which is not adequate and if it's in cloud it costs you money as well. That was in 2018 by the way and after that at that point our power grid model was in our source within Alianthar. We had some applications in 2021 then we made it open source at around 22 and we do have a lot of applications now which I'll cover soon enough. What the library does? So it does some calculations especially the power flow calculations, state estimation and short circuit calculations for both single phase and three phase grids. We have many algorithms with which we can do this and these sum up the calculation functionalities in a really short way. We have a huge focus on the software side of the library because of the pain points that we did mention before. So we have a native shared memory multi-threading and that enables us to do the parallelization for batches and in as many cores as possible when we do deploy it in the cloud and yes the implementation is in C++ and the API for the users is in Python if they wish to use it and it's well documented, it's quite stable and then we have the binaries available in PyPy and Anaconda for the Kondaforge and we have support for Windows, Linux and Mac OS all three of them. And since making this library is not just enough we have to show that these calculations are actually correct as well and for that we have done the validation of the library against some theoretical hand calculations at the start. Then Vision and Chaya which are commercial software and also PowerFactory, we validated the library against them and PandaPower which is another open source library. So we validated against these software and then we use them as a reference for each new revision of Power Grid model. So it's part of our CI pipeline if any of the new features do not comply with it, it won't, yeah that should not be worse than. How does it perform compared to other libraries because yeah there are a lot of libraries within this domain. We have some more presentations now as well about them and each one has its own specific plus point and the plus point of Power Grid model is its performance. For the performance benchmark the link is in the presentation if you wish to do the benchmark yourself. We try to compare it with PandaPower and OpenDSS to get an idea of how it performs and we found that the performance in case of PandaPower is almost 20 times of their calculation which was a huge boost and will really help in doing these calculations much faster. So these were the symmetrical calculations and the asymmetrical calculations is where Power Grid model shines as well because we as a distribution, I mean when it started as a distribution analysis library within Alieander this was really needed at that point. So the Newton-Raphson for PandaPower is around 100 times and with OpenDSS we have to compare it with iterative current that was four times faster than that library. We have data conversions as well because we don't have the best data model to store it and hence we have we have conversions to SIM and other softwares that are used for power system analysis. SIM because we can then integrate with other applications throughout this ecosystem. And we currently use it within 10 plus applications within Alieander so it's a mature project at a production grid and yeah there are many applications grid planning, automatic network design, automatic network design, monitoring asset allocation and congestion management. Since I do have some time within automatic network design for example we try to forecast what the effect of the grid based on the EV growth will happen in the coming 30-40 years, EV growth and the solar panels and based on that we simulate this and then we identify the bottleneck, add the cable, run the simulation again and in this automatic way we design the whole network. That's what this application does. There are actually multiple congestion management applications as well. So one is the active one with which we do real-time congestion management. We take in the measurements from the previous 48 hours and predict if there's going to be a congestion in the coming 48 hours based on any plan maintenance if there is any and the other type of congestion management that we also do not present here. It's on the assessing the measurements of the entire year of this past year and then what would be the congestion in the coming year and based on that we might offer new contracts to our customers because the grid in Netherlands is highly congested right now. We have a lot of people waiting for new connections but we can't add them and hence power grid model really helps in making all of these calculations. For the open source you can just use the library and provide feedback. That's a great contribution in itself. Report any bugs as well. That's really helpful too and you can also do the validation for the library with any test cases of the 80 cases that I mentioned. You can provide more and validate the library. You can improve if you have an idea for a new way to make the API. You can suggest that too or you can also add new algorithms and make the code more efficient in the C++ code. That's also possible. We have a list of good first issues within the repository too if you wish to have a look. We have a few partners. There are DSOs, TSO research institutions, universities and other open source projects as well. The DSOs do use them. Aliantha does have those products as well as an access and study and are also trying to add to their operations. That's all from me. Do we have any questions? Hello. Thank you so much. This looks really, really, really cool. I have one question. If I am running a project, hello Chris Adams, Green Web Foundation. If I have a new project on to build a big solar farm or put a 100 megawatt data center somewhere, can I use this to model how I might integrate with your grid to say this is why you should let me build here or possibly this is what it's going to be the implication if we keep growing at this space. Yes, definitely. We do some calculations on our side. We would be able to, I mean, like Aliantha does it on its side if it can integrate the customer. On the side of the producer, the producer does it so it can identify if it's profitable to make this investment or not. What would be the ROI in the coming years based on what the grid looks like? That's definitely what the producers still do and they do use the model over there. Hi, Peter Dutfield from Open Climate Fix. Thank you for the talk. How did you say some other TSOs have used this? Have you had any feedback from them and how they found it? Well, I said that they are active partners so they did not actually use it. They are TANET and RTE as well. I'm trying to look if they could use this model. But some of the core features of TSOs, we do need to add them as well. That's one of the requirements from the TSO side. Once that happens, TSOs would use it as well. But the focus is primarily on the distribution system analysis side. In Germany, we have this TSO tells you please reduce your consumption. Can I use your project for this calculation? Is it fine enough or is your project just a scope of the complete DSO or a larger part of the grid? Can I use it for a single grid collection point or just for larger parts? Let me think if I got the question correctly. If you do have a single connection point and you wish to use the library, then the motivation would not be so that what would be the transition somewhere. But if it would be a profitable thing for you, right? Did I get it right? No, the DSO uses your library to calculate that tomorrow there's not enough energy. So he wants to tell some customers please we to use your consumption tomorrow. Is the library able to calculate this for single connection and grid connection points? So that I can really can say you and you and you have to reduce tomorrow? Or does it just calculate a very yes, is it just for a large part or also for a very narrow part of the grid? Now I understood. Nice point. The library does not do that. It just calculates the yeah, the power flow results, the voltages, the powers. One of the applications that I did mention about the active congestion management, we tell the customers to reduce their generations. We have certain contracts within Alieander to do that, but it's not part of the power grid model. Yes.
GridSuite and PowSyBl: an Open Source approach to develop advanced tools for grid analysis and simulation of power systems.
Okay. So, hello everyone. So, I'm Jean-Vatiste and Geoffroy is here. So, we are at the software development department from RT. And RT, was it RT? So, we will give some elements of context. So, RT is the French TSO, so transmission system operator. We handle from 20 kV to 400 kV, so it's the high voltage. And we must provide electricity 24-7 for all the costumers and all the inhabitants in France and of course in Europe because we have to cooperate. And the particularity is that we are asset owner of the grid, which means that we are responsible to invest and make sure that the equipment will be okay to work to complete our mission as a TSO. And we are also responsible to adapt the structure to make sure that we will ease the transition energy. So, we need some interconnections and we also try to adapt the grid to connect for example offshore wind generators. So, we have many, many challenges in the fast changing world. So, we have of course new energy mix with big goal around neutral carbon neutrality. Sorry, for 2050. So, it's a big challenge. And we have also some codes, some regulation that make drastic change and we must adapt to that. So, it's a more package where we have a lot of work to do in Europe. And for that, I will read the sentence because it's very important. Today... Okay, so... Today... So... Oh, it's okay. So, today's need is what is very important to understand is that today's need is not to build a tool that answers present needs, but to build a tool that is capable to integrate quickly and efficiently tomorrow's needs. And if you have the idea of the tender to create... the way to create new tools, sometimes you make specification, then you made a tender process, then you ask a vendor to develop, and this cycle is maybe like four years. The problem is that we don't know what we will ask to do in five years because everything is changing very fast. So, what is the strategy for it to answer those issue is to use open source. Andrew Froy will present you two tools that are based on open source. So, Possible, which is one of the first projects that was started, the Linux Foundation Energy Initiative, and then what we can do, what we can build on top of Possible. So, I'll let the floor to Joffre to present in detail the tools. So, hello everyone. So, the first project, Possible. So, Possible, it means Power System Blocks. Blocks, so this is a software component that we have as a foundation of many other applications, especially at RT, so we have something like 15 projects that are based on, as a list, a few components that has been developed in Possible. So, what is the content of Possible? What is it? So, this is many things, but it's the first way to model the Power Grid. So, we have a data model that allows us to build, to have a green model and to use it to make, for example, some evolution, some change for this grid model and to study what will be the impact. We also have some components for visualization of the grid that will be integrated in some higher level application. Also, what is very important is to be able to feed this data model with some data. So, for that, we have some converter coming and to standard data format. So, the most used one and the most famous one is the SIM, the SIM data model. So, we have premium support for SIM converter, SIM data model support. And also, what is very important is to have some interoperability with commercial tools. For example, the one, there's two very widely used commercial tools which has a PCC from Siemens and also PowerFactory. So, we are able to import some data into our data model from these tools. And also, we also have some converter from academic data format, for example, MATPOWER, which is widely used for research and science. And with this data model, we are able to run some analysis functions, for example, powerful calculations, security analysis. So, security analysis, for example, is a nice function that allows to test what will be the impact of some contingency. For example, we have a line loss, an outage on the grid, and we want to see what is really the impact of this outage on the flow, on the voltage to see if we have some trouble. We also have some scientific analysis, short-circuit calculation, which is also very important, and also dynamic simulation. So, time domain simulation. This is why we are integrated with another project of the Linux Foundation Energy, which is Dainaw. So, this is mostly written in Java. And this has been designed to be as light as possible. There is no dependency with complex framework or anything that takes decision to how you are going to use it in a higher level application. So, GridShoot. GridShoot is an example of a tool that is built on top of the component of possible, and that allows people to make some grid study. And very different studies. It starts from a real-time study, for example, security analysis, to a long-term development study. For example, we can with this tool, study what will be the impact of a connection of a new renewable generation power plant on the grid, and to assess that everything is fine if we connect this generation in a specific place on the grid. So, this is a tool that has been moving to production very recently. So, at the end of last year, since a few weeks, and we still have some very early users. And what we plan to have is 400 users in the coming two years. And so, this tool will replace an existing tool, which is at 30, since 15 years. So, we have a team of more than 20 developers, and it is a growing team. From the technical part, the technical stack that we are using is for sure 100% open source. So, this is a micro-service-based application, very scalable application, based on Java, Spring Boot. We have everything based on REST API and also asynchronous messaging with RabbitMQ. On the storage part, most of the micro-services are based on a PostgreSQL and Elasticsearch. As it's quite difficult to manage such a distributed application with a lot of micro-services, we have everything is deployed using Kubernetes cluster. And on the front-end part, so this is the web application using our RAC.js. And also, we use a little bit of WebGL for high-performance representation of the grid. So, an important issue that we had with this is that we have some Java component, which is very convenient to integrate it into, I would say, a classical enterprise application where often the backend is based on the Java ecosystem with Spring, Quarkus, or some kind of framework. So, this is fine, but what we had is also needed to use these components for high-performance also for research and data science community. And most of the people from the data science communities are on the Python ecosystem. So, the question was for us how to use the same piece of code in these two ecosystems and how to share the code in Python and Java. So, what we have done is to use another fantastic open source tool, which is the Gural VM. And Gural VM, this is done by Oracle. And this is several things, but we are using a component with native image that allows to compile Java code into native code. And thanks to this, we are able to build a C library for everything that we have impossible. And with this library, we can build a classical Python extension module based on the C library. So, some useful links. So, for sure, there is a GitHub repository for both project possible and Gridshoot. Maybe I can focus on the Slack channel, the two Slack channels. So, see where is the place where we answer questions and we discuss with the community. And also, there is an online demo of the application Gridshoot. So, if you want to test it, you can do it. So, we have an instance of Gridshoot that is deployed in the cloud. And you can connect to this one just using, for example, your GitHub account. Also, there is a YouTube video if you want to show a live demo. So, this is a screen of the application. What you can see here, just to explain what is it. So, on the left side, we have the data manager. So, starting from a case, from an initial, a green model, we have a way to create some tree of variant, of modification that allows us to test different changes in the network. And for all these variants, we can run some calculations and analysis and then compare what is the best and what is the best one for us. Then you can see on the right side that we have some way to represent, to display the grid. So, this is a full representation of the French high voltage grid. We have some substation diagram like here. We have some what you call the network diagram, which is part of the grid that is shown in the bus, in the nodal view. And then we can run some calculations. We also have some table to see the data in a tabular form. We have some specific user interface to show the results, etc. So, this is one for the presentation if you have any questions. Or if you want a demo of this tool, we can do it after the presentation. If you are interested to have a more detailed view of this tool. I was just wondering what format your network data is in. And whether you could, for example, take in the open-street map network data and try analysis on it. It's not complete, but can we do that? So, this is not OpenStreetMap, this is MapBox. But we can change the type provider to use whatever you want. So, here we have used a very light type representation of the grid just to have a better view of the grid. But we can use the OpenStreetMap. How do you make the link between the grid and the end-consumption items or human on the grid? Do you go to the machine-to-machine communication system so that the people stop consuming? Or do you make advanced polls in order to know the consumptions within one hour, within one month? So, this tool is a bit of a snapshot of the grid. So, this is done by some other tool which are before this one. So, we have the SkyDal, for example, which are doing the acquisition of the measurement. And that has a database of the grid model. And from this, we have some snapshots and then it can go into this tool. Okay? I don't know if I answered the question. How do you handle the stress when, for example, the grid is about to fold? Do you cover any cases with humans at the end of the grid? I don't know. Anyway. Okay. So, we will answer the question later. Thank you.
LFEnergy SEAPATH - Easier Operations in Electrical Substations through Digital Twin Empowerment
So, hello everyone. I am Paul Le Guin de Carnaison. I am an embedded software engineer at Sauvard Fair Linux. And today I'm going to speak a little bit about the C-PASS project and how we brought in. My company Sauvard Fair Linux. We are based both in Montreal and in France. And we are experts in embedded software and free open source engineering. And we've been working on the C-PASS project since the last couple of years. So what is the C-PASS project? C-PASS means software enable automation, platform and artifacts for the line. What is it? We are in a context of energy transition, as you all know. And there is a lot of constraints with this new energy. And the main constraint is that we have a multiplication of distributed controls. We have more and more power stations. And so we have an increase of the data management's need into this power station. And so the idea is how can we bring some free and open source into this power station. And this is where C-PASS is here. So to remind a quick reminder on the aim of C-PASS, the goal of C-PASS is to develop a reference design with an industrial grade on open source and real-time platform. C-PASS allows us to also virtualize platform and inside this virtualize platform we can run automation platform for our power station. And so we can share multi-application provider and this combines performance and safety. For 10 minutes presentation, I can present in deeper the C-PASS project, but my colleague Erwan already did it last year at 4.10.25. So if you're interested, you can see his presentation. So the main idea of this presentation is how did we bring some functional tests to the C-PASS project. And for this, I want to take a simple case 2D. So here is the power lines you can see in the campaign. And you have a tree after the storm that falls on your power lines and there is two lines that touch each other. And this is a big issue in your electricity system. And so you have systems that must cut the current very quickly to avoid any other on people or on the infrastructure. And so how can you have all this safety equipment with C-PASS? So I have a very simple representation of all of this is working. We have first a protection algorithm that makes a decision if there is or not a situation where there is another or not. And this algorithm is running inside a virtual machine and this is where we have the C-PASS project because this is running inside a C-PASS cluster, inside a hypervisor, et cetera. And we have on the opposite side an hardware which is doing the monitoring of our architecture. And the communication between the C-PASS cluster and this hardware is done with a protocol. And you know it, it's the ISC 61850 and this is a protocol which is based on TCP and it generates packets that we call sample value and this is the communication between the C-PASS cluster and our hardware. And so why did we need functional tests? C-PASS as you see is designed to work on a very critical infrastructure which is a power distribution and if we are some issue on the power distribution there is a need of the protection of the people and the protection of the infrastructure because we have electricity as that. And so in case of failure the safety protection must react as soon as possible. And so this is why we need a very, very, very low latency transit of this sample value that transit into the C-PASS cluster. And the last thing is that your power distribution in your country is running every time you have electricity in your home every time and so we are context of 24 hours and seven a day context. And so we have to ensure that this latency as low as possible every time. And so we are in a deterministic system where determinism is the primary goal. And so we have a big infrastructure, we have expensive items and so maybe you are wondering how can we in our labs at our desktop or can we simulate all of this chain simply. So this is the work we've been working on. And so I represent here a very simple scheme about how can we reproduce this protection chain in our lab. The first piece is what we call the Publisher Machine and the goal of the Publisher Machine is to generate the IEC 61850 sample values. And then we have the C-PASS cluster and the C-PASS cluster is composed of two parts. The first is the IPI Vizer which are running virtual machine and we have the virtual machines which runs all of the software which are an SV client receiver that will proceed the sample values that have been sent by the Publisher and a protection algorithm which takes the decision based on this sample value if we are an issue or not. I took here a presentation with two IPI Vizer and three VM but it could be a totally different architecture. So what tool did we use to do that? First on the Publisher Machine we use the pick up format. This is a very ideal of a format because we can reproduce some TCP traffic generation and for example we can reproduce what could be happen on electricity infrastructure for example a 50 hertz electrical signal and then we replace them with some tools. Here I use a TCP replay to send this packet with the spacing we want. We can use some PTP packets to synchronize all of this but keep in mind that it's not a recroid, it's not an obligation to use. PTP is only used on C-PASS when you wish to use some C-PASS features such as VM synchronization and VM migration but this is not an obligation to be used. And then on the C-PASS cluster we have first the IPI Vizer side and we have to have very low latency. First we have some CPU core isolation so we have done some work to dedicate some core only for the Linux system that is running on the IPI Vizer and isolate some core only for the VM which are running and do also some IEQ and processor isolation inside the Linux kernel to be sure we have some priority for some application etc. We also did some biosoptimization which is very hardware dependent but there is a lot to do. There is a thing like the multi-EPF reading feature that are very bad for determinism and you have to disable this kind of feature. And then on the virtual machines we have also it's kind of the same work that we did on the IPI Vizer side which is all of the isolation and CPU and IEQ etc. We use what is called the PTI path view and this is a very interesting feature because it allows us to directly inside the VM to take the packets which are received on the network interface of the IPI Vizer and this brought some good performance. And finally we can also use some SAIOV that can be used if you have multiple virtual machines but keep in mind that even if you have better results this is an optional feature. Thank you for your attention and please let me know if you have a question and I will answer to it. Thank you. Have you got any examples of real world adoption of this? Let's say Karin A. Ronan. So you are asking if we have concrete implementation of the C-PASS project. Currently no. We don't have any concrete implementation because C-PASS is currently near in early adoption but we have a good opportunity to be in the future. But currently it is not in production if it's your question but it's the goal of the C-PASS project. If you don't have a ground master clock what is the source of time? Matthew I will let maybe answer this question about PTP. In production we have to have a ground master clock but for testing we use just Linux PTP as a PTP clock. Thank you.
OpenSTEF: Opensource Short Term Energy Forecasting
Hi everyone. Thank you for having some patience with me. Computers are not my strong suit, although I am in IT. So my name is Sunita Rijder. I am the Community Manager of Open Staff. And I work at Alieander. So let's get into a little bit of background. So Alieander is a distributed grid operator. So we are responsible for the distribution of energy in both electricity and gas in about a third of the Netherlands. So I think we all know these kinds of gas. So this is energy consumption on some place in the Netherlands. However, we have no idea what's going to happen in the future. Well, this is where Open Staff comes in. Open Staff stands for Open Short-term Energy Forecast. So instead of our question mark, we actually know what's going to happen. So after this very short introduction, let's tell, let's talk about what I'm going to talk about today. So first of all, I'll start with the challenges on the grid and why we actually need Open Staff. Then I'll talk about Open Staff, of course. And finally, I really want to discuss our recent developments and collaborations. So the challenges on the grid. So when everything was still good and easy on the electricity grid, it looks like this. So on the left, you see one big producer, just one direction energy flow, and then we have our consumers. Fairly easy. However, due to the energy transition, I think you're all aware, it looks like this. So very chaotic. So on the production side, we have a distributed production due to our solar and wind, both on the mid and low voltage, but also at our consumers. And on the consumption side, we have the issue that our consumption has exponentially increased. We had a lot about EV charging over here. Well, those electrical vehicles need electricity through the grid. And this is where our capacity issues start. So this is a map of the Netherlands. And I think you can all guess that red is bad. So on the red parts, we actually have no capacity available. So let's say that you want to start a company of one of these areas, we cannot connect you. So you get no power from us because we just simply have none to give. But of course, we're all very smart and over the people. So we have some solutions. So one of these solutions is actually to shave the peak. If we expect grid limitations to be surpassed. So on this left image, you see a forecast on the load on, for example, a transformer. We see a very clear peak. And this is where our grid limitations are surpassed. So our solution is just to say shave the peak. So for example, if this is production, we just ask one of our solar farms to just shut off for a little while. Of course, they can money for this, but that's something else. And then this is the result. So our grid limitations are not surpassed and nothing breaks. So great. But then able to do this, we do need to know just left image. So we actually need accurate forecasts. And this is where we have open step. So again, open step stands for open short term energy forecasting. And let me a little bit explain a little bit more about it. So first of all, what the hell is it? Well, it's a complete software stack to forecast the load on the electricity grid. But it's energy forecast. So it could also do it for heat. And it's automated machine learning pipelines. So it's a step by step process, which is automated to make a forecast. So in these dark blue boxes is all everything that open stuff can do. And I'll talk a little bit more about it. So what does the software look like? So first of all, you need a database. This is one that you have to make yourself, of course. But we do have open step DBC, open step database connector. And this is able to get all of your data from your database. And then we get into open step already talked about pipelines. So of course, these are in the software overview. And these are part of the tasks orchestration. Then we have data preprocessing, which includes data validation. So for example, if you see a little flat line, as we're able to cancel these out of your input data, and there was something very interesting, feature engineering. So in this feature engineering, we're, for example, able to calculate the wind speed at the height of a windmill from the wind speed on the ground. And we're also able to calculate the lag load for one time stamp. And then of course, we're machine learning pipelines. So we have some machine learning in there. So we're using open source models such as XG boost to make our machine learning models. So we're able to train, optimize hyper parameters, of course, make a forecast. And we're also able to make a split forecast to our Dazzles model. And finally, we are able to evaluate our forecasts, store our model, and do some post processing. So let's look into the methodology on a really high level. So on the left, we have our target load. This is where we actually want to forecast. Then we have some external predictors. So we have our weather forecast, market prices, and typical profiles of companies and households. From these external predictors, we can actually calculate our derived features. So this is the feature engineering I just talked about. So we're able to calculate lag loads for each time stamp, but also to have some derived weather features, such as, for example, the wind speed at the height of a windmill. And for the more calendar info, it really matters if you're are forecasting on a Sunday or Christmas compared to a Monday. And then we can train a single model for all our lead times. So here you can see what the data, for example, looks like. So if a daytime with increments of 15 minutes, our targets, and external predictors, you can also see here that we have the Dutch energy prices in there. So if you have multiple training horizons, we just simply do pick late our data and use this for our training horizons. And then if there are questions about it, please ask me in the break, but I don't have time to go into this in 15 minutes. We can with our trained model now actually make this forecast. And of course, we want it to look nice. So we have this beautiful Grafana dashboard, which actually summarizes all of the information that you need for your forecast. So let's look into it. First and foremost, our forecast. So the red line on the left is actually the low that has been historically measured. And then we see here the yellow lines is our forecast. Well, now you see that there are a lot of yellow lines. What do those mean? Well, those are actually the quantiles. So you have actually a certainty in your forecast. And this can be actually useful if we're a certain location. You're quite sure what your forecast is going to be. You can go into another quantile. Then if you have a location where you have a lot of factors that you actually don't know anything about. And also very nice our feature importance plot. So here in the feature importance plot, we can see our lag loads and some other features. And this is actually nice. So you can see for every location, which features are important for your forecast. So for example, here we see radiation. I don't think it's readable for you, but it says radiation. So you know that there are quite some solar parks or solar panels behind, for example, your substation. Wind speeds nowhere to be seen. So probably no windmills in that area. So this was really short about open staff. Let me see how much time I have left. Six minutes. Perfect. Okay. So community and upcoming events. One of the main things that has really changed in open staff this last year is our community. So before it was just Alliander who actually created it together with RTE, working on open staff. And now it looks like this. So let me go over every company really quick. So Alliander, that's where I'm from, talked about that enough. RT actually working on open staff for quite a while and they're actually ready to implement it very soon. RT International just joined us this year. They have a very nice proof of concept and they're going to work on it further. Fidel has actually been using open staff quite a long. I've heard some terms, leeches this today. Well, that was a feed on up to actually a month ago. So we contacted them and they were like, oh yeah, we found some bugs. We fixed it. We can implement this. So they actually joined our community as of this year. Sigelman still working on a proof of concept and seeing if they want to replace their own forecasting model with open staff. And Shell is working on open staff DBC and seeing if they can use their method of data important. Now, I hope everyone feels like they want to try open staff. Well, you're in luck because we are organizing a workshop. So on the Friday, the first of March from two to four, we were organizing a workshop. And I would like everyone who's interested to join. So you'll get a better introduction to open staff and also a little bit more of the technical details. It will be virtual. And you will get really a hands on experience. So you get some example notebooks from us where you have to make your own exercises and you can actually make your own forecast with open staff and see how easy it is. If you want to sign up, just scan the QR card over here. And it will be very nice. I also have it on the next slide for people who are too slow. So want to know more about open staff, maybe even before you sign up for the workshop, we of course have our GitHub website documentation, etc. You're only one command away from using open staff. And if there's anything you want to ask or give some comments or anything, you can just send me an email or send me a message on LinkedIn. So thank you for your time and I welcome any questions. Who's running the microphone? I'll try to do my best. Please feel free to guess to find the best path. Hello. First of all, thank you so much. This was very interesting. And I have no experience, I have never heard of open staff before reading on the FOSTA website. I have one question about the data collection. Do you provide like some examples or standards on how and where to fetch data because the data source is very, I tried, I looked. So very good question, I think this is something that a community indeed struggles with. So for the Netherlands, we actually do have those sources because we are using them ourselves for other countries who are working on it to see if we can find some open data for everyone. But if you're interested, you can always send me an email and I'll see what we have. Yeah, great. Hi, it's Miné. I'm from Red Hat. So obviously I will ask the question about scaling this, right? How will you standardize and scale this because it's a project. It sounds super interesting. But how are we going to scale this to 49,000 substations or millions of smart meters at home? Very good question. This is actually something we're working on right now. So we are actually employing our open step stack on Dexter probably anytime soon and seeing if you can actually scale from that. Currently we have it scaled up at I think 100 substations. And if you're curious how we have a reference implementation on our GitHub and you can see all the information there on how we deploy this. Thanks. Yeah, yeah, sure. I have a question about the data sources. Is there any thought given to adding geographical information systems data into the system for forecasting models? Because especially stuff like wind and solar radiation also not just depend on the time of day and the wind speeds, but the location itself. Great question. Yeah, actually for our system, it just connects to the closest K and MI. So that's the Royal Dutch Weather Organization. So it's able to find the closest station to where you actually want to forecast. So it definitely takes a location to account. We have a prediction job class where you can put in all of the information for your forecast and in there you also put the latitude and longitude of your location. So it does take that into account. Question over there. Thanks for the question about the geographic data because I was thinking about an approach of just using cheap raspberry weather stations in Austria and distributing them across some locations to fetch the data because I have the Google Weather API and the Open Weather API or whatever as comparison values. And for the geographic thing, thanks for the question. How would you connect that? Like is this a plan of open stuff? Did I miss this? Yeah, thanks for the kind of difficult question because I don't know the answer. So I'll ask my colleagues who actually made this part of the open stuff and I'll get back to you if we connect afterwards. So then you'll know. But it's very interesting to do with the Raspberry Price things. Thanks.
Unleash the Power of Flexibility with Shapeshifter: A Universal Flex Trading Protocol
Alright, let's start then. Me and Tom here are, well, we've come here to tell you about Shapeshifter, a project we've both been working on. For me it's been, I think the first time I got in contact with it was five years ago, I think for Tom it's one or two. Well, this is us. I work as a freelancer. I actually used to work at Alieander, where we just saw the presentation from and I'm freelancing right now. Tom, just as me, is a member of the TSC, the Technical Steering Committee of this project. We're here to tell you something about it. That's some technical issues. This is not great. Anyway, the problem we're trying to, I'm going to talk to you about the problem we're trying to solve. Yeah, what Shapeshifter is, how does it work, and also how to use it. I do hope the screen will get back to us. Right, so it's funny actually, part of the slide you saw just now. It's actually fun to see the trend here because just two years ago we had some areas where there were issues with congestion on our grid and right now it's almost impossible to get anything going. Yeah, really, to get your data sender up or get your solar panels up. It's just almost impossible. Where the previous presentation was about forecasting, we're actually on the other end of things with the trends that she talked about, well, production and consumption is getting more simultaneous, it's getting less centralized, a lot of electrification. This means we're slowly getting into issues. So how does it look like if you draw it in a graph? This used to be our case. Well, quite simple. We expected use up until the green line, capacity was black and actually the grid companies oversold their capacity a little bit because, well, it's not usually used for the whole thing. Really annoying. Sorry for this. Pull out the clicker, maybe that's the problem. Yeah, it strangely shows up on the second screen there, but apparently the beamer's not very happy with it. Alright, almost back. So it still wasn't that bad yet. I'm gonna stand on this side so I can at least click on the laptop. At some point, it got so bad, our grid capacity is reached and well, if this happens, two things might happen. Either your transformer is sort of overloaded, but it will keep working. If you do it too often, what then happens is it will blow up or it will really shorten its lifespan by a lot. So we try to prevent it from happening. As was, yeah, we can also fix this problem by making sure it doesn't go over this peak anymore. How we do this, I'm just gonna try and explain it without the screen. There's some major users on the net, let's say above 1 megawatt for their connection, that actually have some influence on when their energy is used. So if they, yeah, let's say a giant freezer or they've got a battery somewhere stored, to them it doesn't really matter in case of the freezer, for example, the temperature needs to be between minus 25 and minus 20. So this means that if you don't use power at the peak, you can make sure that it's fixed. We're gonna try a different approach for the screen. Give them two minutes maybe. So really any major company in the beginning, it used to be voluntary, but at some point, yeah, problems are getting bigger and bigger. So we're getting into a state where it's actually allowed by Dutch law to force certain companies to comply with these rules to make sure that we're not having blackouts and that the grid is saved. We've got a second screen, we're back. All right, thank you. It was kind of stressful. So actually the previous presentation, the open step, this was really about creating insight. Where's your problem? When is your problem? How big is your problem? That only, so the insight is just creating the graphs. You can identify problems by having your experts at your company determine, okay, at which points, how much load can our transformers handle? How much can our cables handle? So once you have that, you can, okay, so these are risky areas because we see in our forecasts or in our measurements that we're nearing our limits in certain cases. Third one is really an interesting one. You can choose a solution because in the grids there's, well, not hundreds, but at least tens of possibilities to alleviate the pressure on your grid. And shape shifter, which we're talking about here is one of them, which is in the market based direction of solutions. Tom will tell a bit more about the contents of shape shifter itself. Of course, you need to activate a solution once you've chosen it. And last part is actually kind of interesting too. If companies are saying they've decreased their use, how do you actually determine that that's true? Because usually power use is not stable anyway. So how do you know they've kept their promises? Well, that's where shape shifter comes in. You know, mind if I talk about this one. It was founded in 2014, well, by a couple of grid operators, but also IT companies and consultants, because it was a very common problem in Europe altogether, but it was started in the Netherlands because we've got a very, very stable power grid and we like to keep it that way. So that's where the need came from. It's actually put into practice by the Dutch DSOs. I worked on the first pilot project where we actually made sure that we didn't have to put in a temporary cable, which would have cost 2 million euros for just three years, because that was when the new transformer would have been in place. So it was a very nice project to work on. The USEV standard, that's the Universal Smart Energy Framework, has been updated a couple of times. In the meantime, it was relabeled to universal flex trading protocol, but that conflicted with UFTP. FTP, that was kind of annoying. So it was renamed to shape shifter and hopefully we'll keep it that way for a little bit. And I think now it's Tom's turn to talk about what it does, or should I still take this one? I'll go into this one again. So in the grid, there's a lot of roles defined and the Universal Smart Energy Framework actually defines all these roles. The most important ones here for a shape shifter are the DSOs, aggregators and prosumers. Aggregators are companies that usually provide some form of IT where they can trade on behalf of companies, well, prosumers or consumers or companies that actually have the physical flexibility, but the aggregator will be the party that's participating in the flex trading protocol. It's possible that those two are the same. It's not necessary that roles are separated. It might be that a prosumer actually takes on the aggregator role because they're big enough to just be a party in this flex trading protocol. The DSOs and TSOs are the parties that really send out the request, the flex request that Tom will probably talk about into the market. All the aggregators will have time to respond to the request. Okay, apparently tomorrow at 6 p.m. until 7 p.m. there's congestion on the production side. So please, if you can use extra power during that time in a certain region, we will actually pay you for it. That's an example situation that you can handle with shape shifter. So some typical flex applications. Solar parks talked about them, wind parks, freezers. I think that's a very interesting one. I'm not even sure if it has been done before to actually use a giant freezer as an energy storage for grid stabilization, so I thought it was a very cool use case. Farms with solar, close to the solar parts, of course. Steel mills as they are major energy users and not necessarily time specific in when they do so. And really, in most of our early projects, we talked with greenhouses a lot because they have both large electricity connections and gas connections. So they can switch really easily or can even provide power back to the grid if they need to. So they are very nice from a flexibility perspective. Then I'll hand it over to my colleague Tom. Yeah, so in the time remaining, the shape shifter project, it consists of a specification which is published on GitHub. So you can scan the QR if you want. The specification is the one hand of the project. The other part is the XML schemas which are defined along with it. We are organized using a technical steering committee. We are part of the committee and also a couple of members from the UK DSOs and also from the Dutch DSOs. There are also two shape shifter implementations already. There is a Java library which is in use by Alieander and also at Gopax, which is the congestion platform we are using in the Netherlands. And there is a Python implementation which is used by another DSO. Everything is published under the Apache 2 license. We are currently focusing on improving our processes, our quality control, etc., to meet the open SSF best practices. And we are part of the LF Energy initiative. I will skip this for now. So we have a couple of implementations already that are using shape shifter. There was a demonstration project in the UK called Fusion. Good result from that. As I said, Gopax in the Netherlands. There are also some congestion service providers which are also implementing shape shifter on their end. So those are the CSP aggregator type of parties. And of course the grid operators themselves are using it to facilitate trading of flexibility using shape shifter. Skip this for now. Just one simple example of what the protocol looks like. It is an XML based protocol where the DSO can indicate what flexibility is required on the prosumer side on which the prosumer can reply with one or more offers that will be able to offer some flexibility to the grid operator. And the grid operator can then in turn reply to this message and say, okay, I want to use your flexibility during these hours to solve a specific congestion problem. One minute left. Current challenges we have with the project. We are trying to get as much of the prosumers involved as we can. So one of the challenges that we have to really keep a low entry barrier for these parties to connect using shape shifter. So one of the things is security is of course a very important topic. How can we keep it secure but also keep the barrier really low for entering into this type of integrations. And the other thing is we need more contributors. So that's pretty much a story for every open source project. But yeah, there are different ways you contribute. So if you find this interesting, please take a look at our GitHub and see if you can improve one of these items for example. Yeah, that's pretty much it. Any questions? I see one in the back. Oh, there's no time. Sorry. Yeah, just come to the front and then take your question.
OpenSCD: Everything Everywhere All at Once
So, yes, you can hear me. Hello, everybody. And welcome to my talk, OpenSCD, everything, everywhere, all at once. So my name is Tamar Schuss. I'm a software engineer and the lead of the domain development at Sprint 1. We are a software company in Stuttgart in the southwest Germany. And I would like to talk about... Hello, hello. Yes. Okay. First of all, what is OpenSCD? Just to give you context, a brief introduction to the history, how it came to be, and where is it today. Then what I would like to do, the goal is the talk, is that I would like to talk about the challenges we have as a community and which approach we took and which approach we are thinking about. So I'm just going to talk about the technical approaches today. So let's start. What is OpenSCD? It's an open substation communication designer. We're going to get into it later, what that really means. And it's also an IEC 61-850 tool. I don't know if everybody knows it, but I'm going to explain in a few sentences what it is. It's a progressive web app. So it's browser-based, and it's also a platform. So we think about it as a platform rather than an app. Okay. So just again for a quick context, probably everybody knows for the... Or for later, for the recording. This is a substation, an electrical substation, and it converts high voltage to low voltage and vice versa. And IEDs, so intelligent electronic devices, monitor and control the substation, so everything works as it should be. And the... Before I mentioned IEC 61-850 is a communication standard that describes or specifies how these devices should communicate or how you should design a communication between these devices. So OpenSD does something with this, and how it came to be is that at Omicron, one of our good friend Jacob Fogelsang created a Java app first because he wanted to help with the colleagues and his team to create multi-vendor projects because every vendor had its own tool and they interpreted everything a bit differently about the standard. So they try to... Our Jacob tried to create something where you can agree on a software level, so not just a specification but also agree on the implementation. Later on, Christian and Dinka joined the team and they restarted the project as a progressive web app because they saw just how hard it would have been to deploy and to just distribute a Java app to everybody. So they see the web platform as a nice way to distribute the software. Then the project started to grow. Alieander RTE joined Transpower from New Zealand and TransNet BW also from South West Germany and we joined with them and we had to create a few plugins for them. And now I'd like to think that we are at the scaling phase. Just last year, a colleague from Alieander, Pascal Wilbrank and I took over the maintenance of OpenSCD and just last week we have been accepted to LF Energy. So we are very happy about it and we are looking for the onboarding process and get to know all the other projects too. So these scaling of course are the scaling problems I think everybody has. So we have more interest in usage and more usage and more interest in the project and we face a few challenges. First to get back to the title, so everything. What we see is if a tool doesn't really provide all the tooling to design substations, then people are going to just use under ones. And we are right back there where we were at the beginning, that the tool is differently maybe interpreted as specification and then these designs, so these files are not going to be as much exchangeable as we would like to have. So what we see is that in order to be successful, we need to provide all the tooling, all the features that the users need. The problem with that is also that we have to provide it everywhere, otherwise a standard couldn't really work. It's already too bad that this high EC standard is not accessible for everybody for free, but even if the software that uses it isn't accessible to everybody, then it's never going to work. So at least we are trying to change that what we can. So we would like to really make it available for everybody. So all at once means, as you may know, in a multi-stakeholder project where everybody has its own deadlines, roadmaps and timelines, everybody tries to basically prioritize their own needs over the others because it makes sense. So this is also what we are facing with all the TSOs that everybody just has a different need. So our approach may be, so not every problem is solvable of course with technical solutions. We try to do, we try everything out, but today I would like to talk about just the technical ones because otherwise we would be here wrong. One is Web standard, it's really important to use them. We depend on them for the flexibility and performance and of course the long-term maintainability. Then the plugins, we have a plugin system. I'll get into these topics deeper in a bit. The plugin system just helps you customize for every use case you would like. And also the distribution, it's just one step further that you can have your own version from the whole system. So Web standards, how does it help us? So as I mentioned, it's a progressive web app. OpenSD is browser-based and what we need also is an offline usage capability because not every engineer has internet connection at sites or they would like to browser or design the digital substations on the go. So this is a really big point for us. And also that again as mentioned, installing an app, it's not really possible. It begins the prices and TSOs because the IT just doesn't like to install apps. So providing it in the browser, it's a nice way if you have internet connection to have it. Because it's a progressive web app, you have to visit it once and then you have it and you can use it at any time. The next one is custom elements as a web standard. So we use it for the plugin system and for a few other things. Why is it important for us? Because again it's a standard and if you can compile to custom elements or web custom elements, then we are fine, then you can create your own plugins. That leads to technology independence because we don't really mind what you are using. For example, OpenSCD is mainly lit JS based but we have for example, Swed plugins at Sprint I and so we created for Transnet BW, Swed based plugins and we just compiled it to custom elements and everything works fine. So this is also really nice to broaden our perspective and broaden our, let's say the developer team because no one company or the companies doesn't have to stick to one technology. Every company can pick their own or what they are best of it or what they have knowledge of and they can just use it. I'm going to show in a bit also how easy it is. So let's dive into the plugin system. This is OpenSCD and almost everything is a plugin. The menu points for example are all plugins and as is the example, the Open project opens by default, it opens a file locally from your PC. But for example, in our let's say sister project in Compass, it's also an LF Energy project, they re-implemented it and they have re-implemented the Open project plugin so that it opens files from the server. You can do this with everything else. So of course saving makes sense too. Then the next one is the editor plugin. So this is basically the main content that you see in the middle and also in the tab bar on the top where you can switch between the plugins. And the editor plugins are the plugins that can really manage the... Yes? Oh yeah. Yeah. Thanks. So editor plugins can really manage and modify the design. And what you don't see is the good thing, it's a validator plugins. We have by default the standard XML validators for the standard but you can of course, everywhere if you want, you can create validators that check for some semantic meaning. That means if you have for example a naming convention at the company, you can create a validator for it and then it's going to tell you if a naming of a device is not correct. Right. So how can you create plugins? It's really simple I think. It's just an unregistered custom element. So that means, if you can see it hopefully, it's just the standard way of creating custom balance. That's everything we need. We don't really need anything much because this we can load and use. And basically in this function, you can see almost everything we need. At the top highlighted, you can see this is a... Okay, maybe it's too small but in the top, we create a custom plugin tag. So a custom HTML tag name for every plugin. This is just to make sure that no plugin are collide. And we do this by hashing just the source. So you can have as many instances of your plugin as you want if it's necessary, just only the source or the source URL has to be different. In the next step, we just load the custom element, the JavaScript file and define the custom element with already generated tag name. And then render the tag, render the element, put in the HTML and the DOM and give it a few props, a few attributes so it has something to do. And the result is going to look something like this. Where you have open SCD and inside it, we have this plugin with random or a hashed generated HTML tag. So this is again an other example. But this is one of the plugins we created on the left. It's just a small weld component that wraps around another component. And on the left, we have this relatively small wrapper custom element that the main thing it does is here just basically deploys or starts this weld component. And why is it pretty good for this use case because it doesn't really have a runtime. So even if you have sub-weld, so to say, in every plugin, you are not going to have anything too big because it just really compiles down to basic JavaScript. In case it would be also possible something similar with React because React also bootstrap similar like this. And the only thing is that then with every plugin you would load React. Actually as the whole library with it, which sounds like a problem, but to be honest, once you load it, the plugin then it is cached and you're not going to load it every time. So even it would be good with React. So the last thing I think, the distributions, what's one of our solutions we are trying out currently is for example, it's what already working is that you can already deploy Open SID. So you can just take as it is today and deploy it on your own infrastructure. It is just a web app. So it's pretty easy. And it's yours. The other one is Eddence. We are currently working on to provide building blocks so you don't have to use everything. You can use just some of it. And it's easier to recreate and modify. For example, the plugin system, there is a history system where you can undo and redo your actions and also saving the project and editing. So these everything you could replace yourself and make it like for example that the editing doesn't happen in the browser but it gets sent to the servers, to the backend and then everything happens there. So this is what we are working on to increase again the flexibility. So again for the Compass project it's necessary to create new Eddence or a new, right now they are, they use a fork of Open SID but it's not the best solution so we would like to provide rather building blocks where you can put together your own platform. And what we saw is creating your own plugins. You can do it today at any time and the nice thing is that you can load the plugins from your local PC so you can have your PC and nobody can access to it but of course it's not the nicest thing to do so you can deploy it anywhere and you can install it in every distribution. So we already have a few distributions and we already have few plugins that we use everywhere even and it's not developed from the same teams. So it's always a nice way to use the work of others. Yes, so I was a bit quicker than I thought. Maybe we have a few questions but if you want to get in contact we have of course the Open SID organization at GitHub. We are on the Elephant Energy selection. We have a website and you can try out Open SID at opensid.github.io. Thank you. Is there a question in the room? The United States we have plenty of time, we have 10 minutes so if you want to ask a question of course to Tamas but to other speakers are still in the room feel free to ask for an energy question here. But of course priority to Open SID. It's post break, that's why. Let's break. Everyone's a bit tired. We can understand perfectly. We can ask questions to the audience. Oh, that's nice. Let's jump. Okay, IAC 61A50 right? You said that. Does it, apart from explaining, does everybody have any experience with that? Is that something you know of or not? Raise your hand. Okay. So who works in energy industry? Okay, about half I guess. Are you doing something with energy at home? Home automation maybe? Okay, yes of course you said. Okay, yeah. What if we teach energy that comes where? It's not industry. Yeah? My teaching. Teaching, teaching. Oh, education. Education. Higher education, primary schools, I don't know. Yeah, so much things. We had the own tech of course also. Has anybody thought of a question now? Ah, there's a question. Great. You're a hero. I think it's a follow on from your comments. I'm coming from a telecoms industry. I think I see a similar problem. The community is not big enough. The energy community is not big enough to sustain these types of projects. I think telecoms industry sees the same. So is there a way that the projects can be widened to get that their scope is even bigger so there's a much bigger chance of getting a more sustainable community? Do you have an opinion on that? Yeah. For sure. So as I mentioned in the beginning, so like it's what we try to do is what I talked about today is a technical approach. One was that basically having a desktop app is not going to cut it. So you need some new solution like the web platform where you can really distribute your software to everywhere. And also LF Energy and Alian there already does a great job with supporting the open source communities and using their project. It's already really big. How else you can, I think it's really hard to get amateurs so to say or hobbies I would say to the projects because the features we develop is not for a week and not for a week. The results are really long term. So until we really reach it, it could be years even in the energy industry and in the telecommunication I think it's similar. So the technology, how the technology moves in these industries is quite slow so to say or slower and maybe interesting. So how you can maintain such a communities I think you have to get through the chasm. If enough people get to use the project, for example, I think we are either there just before the chasm because Alian there uses it, Erty uses it and TrustNet BW uses it. If you can get a few other TSOs on board, then probably we're going to get over the chasm and the rest of the TSO is going to see this is a nice project and they want to maybe get involved too. So that's one thing how you can maybe grow the project and how to maintain the project is of course through foundations and through the companies. I don't see indeed this industry because it's so specialized and because of the closed source or the closed nature of this standard doesn't make it easier. I'm Dan Brown from Linux Foundation Energy and you're exactly right. There are so many parallels between networking and energy. I would say networking in telecom is like ten years ahead actually of where energy is right now, believe it or not. Where things, you know, ten years ago nothing was software defined and that sort of thing in the telecom space and now it largely is. So we need to go through exactly that same transition. I'm not saying telecom is perfect by any means and there definitely are not enough people in energy. So it's a matter of getting all of these traditional old school suppliers on board as well, the vendors who have been selling proprietary black box systems to the energy industry to utilities for years. They need to basically stop doing that and come to it with an open source approach and so they need to bring in the resources but we also need universities, we need researchers, we need government, we need the utilities themselves. So it's really a matter of community building and scaling and it's, you know, not an easy task by any means but that's why we're here in hopes that some of you who may not currently be, you may be developers in other vertical markets or in horizontal industries, horizontal technology areas who may find this interesting and be inspired and be inspired to, you know, come and join and start contributing to these sorts of projects. There's not, you know, an easy solution unfortunately but we're just doing everything that we can to keep building capacity. How the IAC61-850 market share, in terms of number of items, what part of the market of the substation does it represent? Meaning on one red electronic site that's deployed, how many are compatible with this protocol? So I'm not the best person to answer it, I'm not an electrical engineer, right? I'm not sure. So far what I get is that they are capable of it, so the IEDs, the Intelligent Electronic Devices are capable of it. I'm pretty sure, at least in you, so I haven't heard that they wouldn't be, so yes. Any other questions? Maybe to complete what you asked about. I think that the two last days some of us were in the Policy Summit for a European Commission. It was organized by the European Commission and we thought that it's very important to make a big announcement on energy and open source opportunity because we all rely, our future relies on energy, of course, our business, everything is relying on the energy. So if we can have fundings and if we organize through foundation to coordinate the effort and not to make efforts there and there and there, I think we will find a great path to have more and more contributors. Yes, you have a question. Can you please give the mic? I just want to compliment, sorry if I stopped abruptly, I'm sorry. I think that there's a very, in my experience, in my research in software-defined paratronics is that software-defined energy is much harder to achieve than software-defined data and signal because of the fact that there's a lot of current, there's a lot of power, there's a lot of issues with that and different use cases require different types of converters and all that. So there is, for me, one of the hurdles that we have as a community is that we need more open hardware as well. I mean, let's try to do some, you know, no code with no computer. It's not possible. We need to, if we want to do software, we need a computer. And if you want to do power, you need a power converter. It kind of goes with the, we abstract the hardware because eventually we want to, but there is a lack of hardware and I think that's a big frame, that's a very big break on the process because hardware is not only hard but it's difficult to abstract as well. We're going to get there. Thank you.
Power to the People - Technology for Access to Energy
Okay, I think we can start. Hi, I'm Vivien Barnier and this is Martin Jager. We will talk about Power to the People technology for energy access, or access to energy. We have seen a lot of great presentation today, very technical, very detailed. This presentation will start slightly differently. It goes also like this general discussion we had shortly about community and how to grow the community for energy projects. I want particularly to talk to you guys about access to energy and how we can use open source and which potential open source has for this completely under-explored area for open source, because a lot of these great projects are particularly targeting industrialized geographies, let's say, and their realities. But there's also another reality which looks like that. As you can see, in the north and hemisphere, like the global north, a lot of lights in the night. While you see, there's a lot of people living here and here and also here. But there's much less light. That's because there is no electricity. So if it's the energy death room, we should also think about how to leverage open source, both software hardware and community to help electrify these areas. And if we look particularly on how electrification has gone in the recent years, you can see there's been some progress in reducing people without electricity. So we started in over a billion in 2012. But then in the recent years, this improvement of access to electricity stopped and we have even a back trend. Any idea why? Not exactly. Not only Bangladesh as well and also on the African continent. There's a bit of progress. And now it's not getting cheaper anymore? Exactly. General population growth. The speed of electrification is the same. Just the population growth has outpaced the speed we are electrifying. So we have now every year more people with electricity than less. So we need to speed up the process of electrification in general. Now electrification of areas which haven't been electrified in the past brings particular challenges. Which is you have extremely low income customers. You have extremely remote areas and don't think about what is remote in Belgium or in Europe or somewhere. It's completely different. You have unknown future demand patterns and forget about AI machine learning. It's just not going to work. You don't have data. It's people that never use electricity. So you have no clue and you can't use these like digitalized methods to predict what will be the demand. Data connectivity issues. So if you want to communicate with the assets, extreme weather conditions and regulatory uncertainty. Because a lot of countries which are not that stable in the political situation. So you also don't know if they are coming the main good. If they are not coming the main good, what technology will they need and so on. So you need extremely low resilient technology to achieve universal electricity because you have to respond to all these particular challenges. And now you have NGOs, you have private companies, you have international companies. Large utilities are trying to go into this market. Non-profits, corporatives, communities, even agribusiness. Because they are already present in these areas. Often are going into energy ventures. And all these companies or stakeholders or NGOs are developing technology. Software, hardware and also business models. And very often they come up with exactly the same thing. Almost the same thing. So they are constantly reinventing the wheel. Because it's a very decent sector with a lot of players. So perfect playground for open source I would say. Because we have to overcome this constant reinvention of the wheel and it's decent. So there is a lot of possibilities to still shape the sector. Which in industrialized countries often is a bit more tricky because there are more established industries, more established players, bigger players and so on. So what we do at an access is we promote and support open source development and adoption for energy access. Very particularly for technologies that are meant to provide electricity to people who have not been electrified in the past. Particularly to generate an equitable ecosystem. Where more local companies, particularly domestic companies can participate. They can compete against large utilities like the ANGIS of this world. And so on like trying to grab this market. And have this adaptable and resilient infrastructure that we need to electrify. And to say what that means. A couple of projects that we have funded and supported in the development in the past. They range from software, hardware and business models. I will not go into details for the sake of time. And we want to speak particularly about one in particular today. Which is an open hardware project. Which is the open source battery management system. The Libre Solar, BMS C1. Which has been developed by Libre Solar. And which Martin will now provide you more insights on. Yeah, thank you very much. Yeah, so I'm from Libre Solar. Which is mostly an open source hardware and firmware project. But we're also a very small company. And doing some consultancy work around the open source hardware development. And yeah, we've developed these battery management systems. With a particular focus on energy access in a project together with the NXS foundation. And I will explain a little bit what a battery management system is at the very beginning. And talk about some of the technical aspects. As well as the community. And how we are interacting with other people interested to join our movement. So to say. Yeah, a battery management system is part of almost any modern electrical equipment. That's battery powered because all those batteries, those systems use lithium ion batteries. And yeah, in the energy sector, energy access sector in the past. There was still a huge use of lead acid batteries. Which have some issues with environmental damages. And which also don't last that long. And nowadays lithium ion batteries are getting cheaper. And so also. Okay, so I covered the antenna apparently. Yeah, so. Yeah, nowadays also in the energy access sector, which is very cost sensitive, as you can imagine, we also. Yeah, using lithium ion batteries more and more. And so we need a battery management system that takes care of the safety of those battery systems. So that's basically what it does. We have a pack of lots of different cells, which in series connected in series. We measure each single cell voltage, make sure that they are well balanced. We measure the current. And if something goes wrong, we have the safety measures like a fuse and a switch, which open the battery. And so that you don't get an overheating battery that could potentially even explode. And of course it's a safety critical component. So you have to take care to develop it right. And yeah, open source is a really good method for collaborating and not reinventing this wheel, which could be a costly process. So yeah, this is the hardware board that we developed. It can be divided into two parts largely. This is the power part, which does the switching and the current measurement. And you see those pretty large connectors. We can handle up to 100 amps, which is not that common for most people used to Arduino and so on. Of course, for the own take guys who had the presentation before, it's also going in that direction. Yeah, and that's a little bit challenging to put the microcontroller and the power parts onto the same PCB. But we decided to go with that route because we really have to make it as cheap as possible, but still handle this amount of power. And with these 100 amps and up to 48 volts, you can provide one huge in that sector AC inverter with enough electricity, like 3, 4, 5 kilowatts almost. And that's sufficient to build a small AC mini grid in such an area. There are also so-called solar home systems, which are really tiny, like 150 watts, 50 to 150 watts are really tiny systems. This BMS is not targeted at those systems, but more at the slightly larger systems. But you can also use the technology for light electric vehicles and other things, especially because the firmware is open and can be adapted easily. For in terms of communications, we have the CAN bus, which is the automotive and industrial state of the art protocol used for batteries as well. But we also have Wi-Fi and Bluetooth low energy, so that would be more for the non-control part. So we can talk to inverters, for example, with the CAN bus, but you can also have a smart phone app and connect to your battery. Where's the antenna? Okay. All right. So for an open source product, we think it's also essential to use only open source tools. And so that's why we decided to go with the Kikead, obviously. And yeah, so since a few years, I think Kikead is really a completely professional, great PCB design software. And it has very nice plugins, which you can use to automate some processes. And we're always trying to somehow get towards a similar experience than you have with software development for the hardware development. Because if you get a pull request and you get a binary, then you try to understand what's happening, what did the person change. That's difficult. So we use some community developed tools to create diffs in PDF format. So you can see what the pull request changes. And that makes it easier to collaborate also on hardware development, even though it's still not as easy as with software. Yeah, also we generate an interactive HTML bill of materials where you can see all the part placement. So if you want to assemble it manually or you want to fix something, then it makes it easier to understand and find the parts. Yeah, all those are community developed plugins. And for the firmware side, the software that's running on the BMS, we are using the Zephyr real time operating system. And I can really recommend anyone who's into embedded development and maybe didn't use R-Tosys before, really try it out. It's maintained by the Linux Foundation, so really fully open source. And it has extremely great features. So you can use it for almost any architecture. And you can also switch between different microcontrollers because it's very well abstracted. And that's something we experienced during our development. So we started with an SDM32 microcontroller. And then a chip crisis came and we couldn't get any SDM32 microcontrollers anymore. So we thought, hmm, what are we going to do? And we replaced the microcontroller with an ESP32 C3 microcontroller. And it was really just a matter of changing a few board configuration files. And then all the things like UART communications and so on worked out of the box, almost out of the box, I have to say, because there were some bugs in some drivers, which we, of course, upstreamed and fixed. But this is now fixed because that microcontroller was still pretty young. But yeah, that's really a huge advantage. And so if someone comes and needs this battery management system, but needs a different communication protocol or slightly different hardware configuration, a different microcontroller for whatever reason, it's almost a matter of a day to get it ported to a different board. Yeah, as mentioned already, we have lots of communication stacks in DeFa already working out of the box. For this energy access market, the GSM communication is very important because most of those batteries or off-grid systems need remote communication through GSM. But we can also use Bluetooth Low Energy and CAN and Modbus for the more local communication. Yeah, there are also some Zephyr folks here at FOSSTEM. We don't have a dedicated DEF room, but yeah, so if you want to learn more about Zephyr, I'm also an active contributor and maintainer. Just let me know afterwards and then we can talk about it a bit more. For the communications, we are using a protocol, so to say, that's called Thinkset, which you can think of as an API, a REST API, but for microcontrollers. So you can use it over the serial interface, you can use it over WebSocket remotely, you can use it over CAN bus, over Bluetooth Low Energy. We are using exactly the same upper layer of the communication protocol for all these layers, which makes it really easy to integrate it also into other projects. It's not meant to be just for battery management systems and it's self-explanatory. So you can see an example here. You can send a request. This would request the battery parts from the battery with a question mark for a GET request and then you get the data as JSON, including the units. Whether they are read-only or write-only values and so on. It's quite versatile and we are making good experiences with that in case you're interested. Here are the links. The presentation is also uploaded online, by the way. Now, coming to one challenging part of open-source hardware development, which is the manufacturing or production. So, often you can order electronics hardware on JLCPCB, but for these power electronics boards it's a bit more difficult. Sometimes you don't get the board specifications you need, because you need thicker copper layers than with boards where you just send out some data or communicate on the PCB. And also we have some quite heavy connectors on the boards, which need special soldering processes and so on. So far we haven't ordered anything with the Chinese manufacturers yet, but mainly ordered boards locally in Germany. And we're in contact with some companies in Nigeria who are also trying to produce it, or they will be able to produce it in the future, so that we want to basically also break this chain of developing something in Europe, producing it in China and shipping it to Africa. So, yeah, the idea is that it can be produced locally. And thanks to these features in KiCat, you can also easily, kind of easily solder it by hand. So this is an image of the first prototype in our FabLab in Hamburg, where we have a manual pick and place machine, but you can also use tweezers. And you need a pizza oven and a small reflow oven, and then you can do it yourself. That being said, we have ordered some PCBs. If you want to participate in the project, they will be ready soonish in one month, maybe. And you can also get in contact with us. Our plan is, of course, in the future to be able to provide boards for easily and for low price, so that everyone can participate, but it's also regulatory issues, or that's on our list to improve on that situation as well. Right, so final slide, almost. So how's the situation with adopters? We have had about 10 companies who started using the product during our project, and more than five companies who are actually starting to use it in their products and in the field. Also some companies from Europe picked it up and provided really valuable feedback and pull requests. We also tried to really start from the very beginning, from requirement specification, towards the final design of the PCB, to have everything on GitHub. So the specification was on GitHub, everyone could interact with us. And yeah, we've also got a community forum, one is on the AdNXS foundation website and in the LibreSolar forum. But I usually prefer just using GitHub issues and pull requests for communication, so you have it in the right place and people will find it. Yeah, so if you're interested in developing such systems or using them, just join us on our journey to bring power to the people. And here are some resources, some websites where you can find all the hardware designs, firmware designs and the community. And now, yeah, we're open to hear your questions. So first, thank you very much for your presentation and for your great work in the open source community. I had several few questions. First, I saw that you had a passive balancing board on your BMS. I was wondering if you are also offering or thinking about active balancing. Yeah, good question. We really tried to use active balancing at the very beginning and had this on our requirements list, but finally dropped it because of cost reasons. So there are some linear technologies chips and Texas Instruments chips, which cost $5 each for six cells. And they are really expensive. You have lots of passive components as well that you need. So that was the main reason that we are just going with passive balancing at the moment. There are some Chinese chips we couldn't get anywhere from a reliable distributor and we couldn't get any data sheets and so on. So that was also not an option. If you have an idea how to implement a cheap active balancing system, let me know. There was also already some contributors that wanted to do active balancing. Okay. Did you already have some cases in which you had to handle the balancing between lithium cell storage and other kind of storages such as compressed air or concrete and fibers or the storage system than lithium cell? No. So far it was only lithium cells of the different chemistries like iron, lithium iron phosphate, lithium nickel manganese cobalt oxide. We were in discussions with some people developing redox flow batteries where the voltage is smaller so you could potentially monitor multiple cells at once and get an average kind of. But yeah, so far these technologies you mentioned we haven't tried. But if the voltage range matches then it should also be possible to do the monitoring at least for those cells. I have a couple of questions. I don't know if I can ask multiple questions or just one. Okay. I can also stay outside the room after the session. Okay. You talk about network and notably having GSM but you don't talk about security feature for software and remote control. So I don't know, is there any, how is it handled or anything? Yeah, so we are using TLS certificates for the communications. And currently, yeah, that one of the problems with security is that with TLS you have the handshake mechanism every time you restart sending something out via the modem and in these areas you usually have SIM cards with roaming and they are rather expensive for high data rates. So you have like six megabytes per month only. And if you do a TLS handshake every time then that gets tricky so you have to reduce your data rate. We also thought about other things like co-op where you don't have the handshake but it's tricky but we're still implementing security so we're not making any compromises on that. Thanks. Originally it was the same question as the first one but I can switch to the next one. Did you consider to have a bit more modularity like on the other models you had? Those were somehow split up because on a bicycle for example 36 volts I wouldn't need 100 amps so that would be a bit much. And I think for everyone afterwards, we'll come out together and we could chat about the Vigotech project, Ointech mentioned. Yeah, so you have a certain degree of modularity which you can implement through leaving out some components. So you could just have less MOSFETs or take cheaper and smaller connectors and then reduce the amount of time you have to do that. That's possible in that direction but not in the other direction. So we designed it for 100 amps which is almost the limit of what you can get with the power and control PCB on the same control part on the same PCB because the chips have small pin pitches and if you need a huge copper then you get into trouble with that. So, yeah. And as you mentioned the bicycle, so this is really designed for energy access as we said and for high power appliances in energy access. So like milling machines or stuff like, or a tri-cycle where actually then you possibly get to the 100 amps that you will need and not for like an e-bike, like one person e-bike. Last question. We have a certain number of people who are using the same power supply and the same power supply to e-bike, like one person e-bike. Last question. We have 13 seconds. It's precise. I'll be quick. So, as you know there isn't a standard protocol for talking to BMSes from all, you know, because you're going to be talking to standard inverters in a typical system assuming there's some solar input and the problem is that most of those aren't open yet. So, they expect that there's a whole series of battery protocols. And so, currently if you buy a generic inverter, it speaks 15 different battery impersonation modes. I'm just wondering what you've done some of that or if anybody's thinking of a sensible standard for this madness. Five seconds, Enswer. Okay, yeah. We thought about that, but that's all only software. So, the ideas that we have are RS485 and CAN bus, so you can implement any special kind of battery protocol. We have CAN open readymade out of Zephyr, so that's already something, yeah, a higher level complicated stack that's already pre-built in. But, yeah, anything, any very special thing would have to be implemented. Thank you so much. Thank you very much.
Sharing the operational cost of Europe's electricity grid: optimization and transparency through open source
Hello everyone, I'm Peter Mitri, I'm a software developer at RTE, the French TSO. So today I'm going to speak to you about two open source tools to software that help us optimize and share the operational cost of the European grid. The first part of the presentation I will focus on optimization. So I will talk about what we call the regional operational security coordination and remedial action optimization. In this part I will introduce the open source software which is called OpenROW. In the second part I will talk about cost sharing through flow decomposition and in this part I will talk about the open source software which is called flow decomposition. I try to keep as much time as possible in the end for questions. So yeah, I hope you have some questions. Great. So let's talk about first of all why we need to optimize the grid. So I understood that many of you work in the energy sector but some don't. So we talked a lot about congestion management in the previous presentation. So here I'm going to try to set the scene and explain what a congestion is. So as you may know electrical equipments in the grid have physical limits. Outside of these limits the equipment is not safe to operate. So for example a power line which transports electricity from point A to point B has a thermal limit. If we exceed this limit, if we transport too much power on this line, the line may heat up, it may deform, it may even catch fire and of course it's pretty dangerous. So to help set the scene, imagine here that you have a small grid or a small part of the grid which is represented in three nodes. So the nodes would be like sites where consumers and producers are connected to the network. And between these nodes you have power lines which are in black here. And let's imagine that you have most of power production on the left side and most of power consumption on the right side. So most of the power will flow from the left to the right. Let's say that we have a consumption increase on the node here to the right. Then of course the flow will increase from the left to the right and depending on the network's topology it may very well be asymmetrical. So we may have more increase of the flow on the bottom part here. And we may find that the flow, the new flow that is on the line here exceeds its limit. So this is what we call a congestion. Of course there's not just the question of consumption and production, there is also some accidents that can happen in the grid and that can lead to congestions. So here you have an example. If we lose the line that transports electricity from here to here, then most of the power will flow through this line and this can lead to congestion on the upper line. As a TSO, RTE has the responsibility to be robust to all eventual incidents on the network. So we have to do something about these congestions. So what can we do? Fortunately we have what we call remedial actions. So these are actions on the network that can serve one of two purposes. The first purpose would be to redirect the flows on the lines. So for those of you who work in the electricity sector, you may know them as topological actions, HVDC actions or phase shift transformers. I'll talk about them in an example in the slide that follows this one. There's also another type of remedial actions which acts on the injections. We call that either redispatching or counter trading. These are actions that will change the power production plan of the producers. In general, the first part of remedial actions which redirect the flows are called non-costly because the only cost to operate them is the aging of the equipment. The TSO has power over these remedial actions. And the second type of remedial actions is costly because when we ask consumers or producers to change their injections, we pay them for their service. So to help set the scene, this is an example of non-costly remedial actions. So here in the example above, we have the base case where no remedial actions applied. So let's say that you have a congestion in the line here. One first type of remedial actions is the topological action. So let's say that you can split this node here into two nodes. This will make the power flow equal on both lines, this one and this one. And then it will relieve this line here and then we would have relieved the overload or the congestion on the network. Another type of remedial action is the phase shift transformer. So let's say that we equip the line here with a phase shift transformer. This kind of equipment is able to shift the phase of the current on the line and so act on the active power flow and so it can relieve the congestion on the line. The second type, in the second family of remedial actions, which are costly remedial actions, this is maybe actually easier to understand. What we can easily do is to call a producer which is on this node, a power plant, and ask them to decrease their production and ask a power plant that is here to increase their production. So naturally this makes the power production closer to the consumption site and it reduces the overall flows on the network and by consequence it relieves the congestion on the line. The key difference here is that power plants 1 and 2 get paid for their balancing service. The fact is that Europe's electricity grid is highly matched, interconnected and synchronous. So for example if you have an incident in France it is instantly measured in Romania. Thus the security of the network is no longer a national one, it's a European one, it's a global one. So TSOs have to conduct coordinated computations to ensure that the European network is secure. This is why the Acer, the Agency for Cooperation of Energy Regulators, imposes on TSOs to conduct what we call the regional operational security coordination. So in this process TSOs must choose the best remedial actions on the European scale to implement in the network in order to ensure that it is secure. Of course it's a large escape problem so we can hardly do it by hand. That's why we need an automatic tool which is called the RAU or the remedial action optimizer. The RAU will have to choose the most optimal remedial actions in a given perimeter and it also has to do so by minimizing costs that are imposed by cost remedial actions. So using an open source RAU has many benefits. First of all transparency because we are in a European perimeter. So what better way to be transparent about what the RAU does and which cost remedial actions it selects than to put its code in open source. Given of course that it's well documented. It also serves the purpose of coordination because this way when we put a tool in open source different TSOs from different countries and different vendors from different countries can cooperate more easily. It also serves robustness, interoperability, it also serves reusability and time to market because when a tool is used in many business contexts it becomes more versatile, it becomes more robust and it becomes quicker to deploy. At RTE we have developed an open source remedial action optimizer called the possible open RAU. So for those of you who may be know it it was called FARAO in the past. The journey started in 2019 but two weeks ago we made the move to possible open RAU and we did this because we wanted to join the Linux Foundation energy adventure because LFE provides a clear governance for which all contributors accept to abide and it also provides a clear methodology to work more efficiently and in better intelligence. Open RAU is actually used internally at RTE but also in many European processes. So I talked about regional operations security coordination or ROSC. Open RAU is being implemented for the SWE region here which covers France, Spain and Portugal. It is already in operation for another process which is called capacity calculation on the Italy North region and on the Co region which is actually the largest region in Europe to conduct the coordinated computations. It covers around a dozen countries. A few words about what our RAU can do. So it's an optimizer so of course it has to have an objective function. It can either minimize the worst congestion or remove all congestion in the network. About congestion we can model flow congestion and we can optimize flow congestion. So this is the example I talked about in the previous slides. We can also model voltage magnitude constraints and voltage angle constraints but for now the RAU cannot optimize them. It can only monitor them. For immediate actions we can optimize phase shift transformers in a given range. So the RAU if you give it a range of possible tap positions for the phase shift transformer it will choose the most optimal one that reduces congestions over the whole network. You can optimize an HVDC set point so it can change the set point of the HVDC to reduce constraints. It can also choose to activate or not activate some topological actions. For example closing a switch or opening a switch. It can optimize a subset of redispatching remedial actions so actually a redispatching remedial actions are pretty complex and actually an open RAU would just have a subset with strong limitations. Also it can optimize a subset of shunt compensator actions and it can for now only model counter trading remedial actions but we do not support optimizing them in the RAU. So of course like I said open RAU is used in the multiplicity of business context so it is very versatile. It has a lot of ways you can use it by changing the input data or by changing its parameters so if you need more information you can look on our website for all the ways it can be used. Under the hood the open RAU software is licensed under Mozilla public license 2.0. It's hosted on GitHub and the code is written in Java 17 so we use JUnit for unit testing of course we use Mavin for dependency management. We monitor the quality of the code on Sonoma cloud and we're pretty happy with our figures. We publish the code on OSS Sonotype and we rely closely on the possible library to be able to model the network and to simulate it in particular to use sensitivity computations and load flow computations. We also this specificity of the RAU we also use Google OR tools. I don't know if you know it but it's an open source modelization library for linear problems developed in open source by Google and through it we can support a multiplicity of linear solvers. For now for example we have skip which is an open source solver also CBC but also we can support express GROB Cplex which are commercial ones. As a side note we tested that open RAU is compatible with Docker Jenkins Kubernetes and Cucumber testing. So in conclusion I'd be more than happy for you to participate in our RAU adventure either by using it and giving feedback or by contributing to the project. So the best way to join the adventure would be to join the possible Slack team and then to join the RAU channel. And there is also a quick tutorial on Java if you want to play around with the RAU on our website. And if you want to know what the future of the RAU looks like the roadmap is updated once per month and it is discussed during the possible TSC which you are free to join. I'm moving on to the next subject which is decomposition and cost sharing. So I'm going to set the scene with a small example here. Imagine that you have three zones. Let's say there are three countries A, B and C. Imagine that you have big bow production in the north of A and big power consumption in the south of A. Then naturally you'd expect the power to flow from north to south so from producer to consumer but in reality it's not so simple. Any part of this commercial exchange, the power that is sold to the consumer, only part of it will transit through the internal lines of zone A and the other part will go through zone B then to zone C and then to zone A to the consumer. So of course the consumer got the power they needed but some of the power went through zones B and C. We call these loop flows or polluting flows. So the commercial exchange is simply the sum of internal flows plus loop flows. And we say that they are polluting because they transit through zones in which they are not consumed. So as you can imagine more loop flows in the polluted zone means more loads on the zone's internal grid. It means eventually more remedial actions to implement possibly costly and this leads to more costs for dispatching and counter trading. So in the core region alone we have up to 3.7 billion euros per year of dispatching and counter trading. And of course loop flows are a reality. They are a consequence of the topology of the network. We can do nothing about them. We cannot eliminate them. However we can compute them and we can better share costs when we know where they come from. So the Acer again the European regulator defined a clear methodology of computing loop flows in the core region and this methodology is followed by a methodology to better share costs between TSOs. Of course using an open source tool has all benefits here and most of all transparency because when you talk about sharing costs we talk about TSOs having to share the bills and being transparent is very important. At RT we developed a tool which is called possible flow decomposition. It follows the Acer methodology so you have the documentation for it here. And it has both a Java and a Python API. Under the hood it's almost the same as the row so MPL 2.0. It's developed in Java. It uses Mavin. It's hosted on GitHub. It uses a lot of computations thanks to possible for load flow computations. And most importantly it's already supported in our PyPossible API. That's it from me. Do you have any questions? So maybe I wasn't paying enough attention. Can you, so the purpose of your system is to allow you if something happens like whatever that thing on the Pyrenees a couple of years ago for the whole system to react appropriately. But you were showing that you're doing subsets to the computations. I didn't understand in an emergency. Presumably everyone needs to do something right at the same time. The whole network. However far the effect propagates. So what was happening there? What happens in an emergency versus whatever you were showing on the screen with doing computations of various regions? This is not really an engine that is supposed to help decision making in real time. It's supposed to be used as an optimizer for the grid. For example, in the regional operations security coordination, TSOs have like a photo of the grid in the day ahead. So 24 hours before real time. We merge the whole grid models of different TSOs. We conduct load flows and then we see if there are any congestions. If there are any congestions, then we run a remedial action optimization. The optimizer with us, okay, I found these non-costly remedial actions and these costly remedial actions that will make the network secure. 24 hours ahead. 24 hours ahead. 24 hours during the day, but it's not supposed to tell the operator which remedial action to choose. This is another, this is really apart from balancing. And if you, if we go back to the example where I showed balancing, something that resembles balancing, what we should do here is every time we change production somewhere, we have to, so if we decrease the production here, we have to increase the production here because the TSOs do, anyway, when we handle congestions, we cannot change the balance of the network. So the balance between demand and offer is handled in another process. Hello, I have a question about how much resolution you need to see into each of the grids in order to actually make some of this. Could you talk a little bit about the visibility that's required at the TSO level or beneath it, for example? Depends on the process. So in the regional security coordination, we look at high level voltage, so 200 kilovolts and 400 kilovolts. And basically all big production hubs are on this voltage level, but this is a really generic remedial action optimizer, so we can generalize it to whichever resolution we need. Any other questions? Is there some ideas to change the software for real-time congestion management, like for DSOs or for other systems? Yes, some experimentation is underway for balancing in order to be able to find creative remedial actions in real-time. So for now it's not an operation, but it's being experimented. So my question is about impact. Have you noticed that over European TSOs are using your software as well? Is that the goal in the end to share among different TSOs as the Europeans can? For now we are the only TSO using the RAL internally. However, here it is Coriso, which is the computation coordinator that is using OpenRAL for these three regions. And also the idea of joining the possible project is to be able to develop a Python API pretty quickly and to be able to have more users in different TSOs. What kind of algorithm is used in OpenRAL? We have an optimization algorithm, so a linear optimization algorithm. I have a few slides in the appendix for this. We can talk about it later if you want. But basically it's a search tree in which we optimize the topological actions and inside after every topological optimization we run a linear program to optimize linear remedial actions. These are remedial actions that have a linear effect on flows, for example, PSTs and HVDCs. How do you test it? How do you ensure that there isn't a bug that affects all OpenRAL instances running simultaneously? With this, if it answers your question. We have a lot of input files and expected output files. And with this stack, with Docker, Jenkins and Cucumber, Cucumber is a framework for functional testing. So you write scenarios in a Gherkin language. You say, for example, given this input file for the RAL, then I expect that there is no congestion at the end and that this remedial action is activated. You write it in a very natural language. And of course there is code to run these things. And then we put that in a Docker and in Jenkins and we run this every night upon almost 500 scenarios. And every night we are sure that our main branch on GitHub is still solid.
Quartz Solar OS: Building an open source AI solar forecast for everyone
Welcome. I'm Rachel Tipton and this is my colleague, Zach Watts. Can everybody hear me in the back? I'm not used to talking in a microphone. Yes, it's good, okay. So today we're going to be presenting Open Quartz, building an open source AI solar forecast for everyone. I'm a full stack developer. I work for Open Climate Fix. I'm going to introduce myself and then Zach will introduce himself. I'm a career change developer. So before working in climate tech, I was teaching English in France. I got a little bit tired of teaching 18 year old French students the present perfect. So I decided to, the French people, huh? Because it isn't. Yeah, because it isn't. It's not perfect. And I'm not a perfectionist. So I decided that I was going to channel my love for languages into learning code languages. I completed boot camp about a year and a half ago and that's why I landed. This is why I'm here. I landed in climate tech and I'm quite happy. Zach. Thank you, Rachel. Yeah, I'm Zach Watts. If anyone's noticed, my last name is Watts. So I think I was destined to work in power of an energy of some sort. I recently finished my masters in physics two years ago where I was trying to make cells dance using acoustic sound waves. And then I kind of fell in love a bit with AI and then joined open climate fix about a year ago where I do some of our machine learning implementation and data science. All right. So what to expect from our talk? I'll introduce open climate fix. We'll talk a little bit about why solar forecasting is important to balancing a power grid and some of the use cases that we use it for. We have a live solar forecasting service called court solar and derived from that is the open source court solar model that we'll be talking about and Zach's going to present that today because he's worked on that model. And then hopefully we'll have time for questions at the end. And this is a sneak preview of the code that we'll have you guys run at the end of the presentation. And we're hoping that the demo works, but we'll see. Open climate fix was founded in 2019. We're a London based company. I'm based in the north of France, so getting to be in Brussels is kind of more my home territory. This photo is from the sustainable work ventures office in London where we work. We're a nonprofit product lab developing open source solutions to decarbonize the power grid and generating solar forecast is part of that work. All right. So we see ourselves as like a, I'd say like a middle man or like the traverse between ML researchers and the energy industry. So we want to make our data available to researchers and we want to make the research ML researchers are doing available to the energy industry. And how do we do that? So all of our code is available on GitHub. We also have models and data sets that are available on hugging face. Does everybody know what hugging face is? I'm assuming this crowd does know. Yes. Okay. We know what this is. So a lot of the data sets are from NWP data or numerical weather predictions. And up to date, we have 500 people who have signed up to download those data sets. So we like to say that we're making an impact in that way. We also make available the EU met set data that we collect from where connected to a live service of like we get data from the satellite itself while we're generating our forecasts. And then we're actually putting that data into the ZAR file format and making that available to ML researchers. And that data has been downloaded 16,000 times so far from the Google public data sets site. So that's a way in which we're having an impact. The data has also been used to forecast rain, like to do rain predictions in Sweden, storm evolution in Taiwan. So it's been used for a lot of different purposes. And most recently, there was like a graduate paper that was published on, I think it was like day ahead PD forecasting. All right. So moving on to why solar forecasting important. The weather is unpredictable. The sun doesn't always shine. The wind doesn't always blow. If any of you have listened to a podcast on decarbonization, you've probably heard that phrase before. So moving into the future, our basically our power generation is going to be dependent on weather dependent energy sources like solar and wind. So in this chart, you can see by 2050, about 75% of the world's primary energy source is going to be based off of renewable resources. And then the resources at the bottom are gas and coal. These are what are called dispatchable resources. So you can you can burn X amount of coal and get X amount of electricity, you'll burn X amount of gas and get X amount of electricity. This is a basic concept that I'm presenting. But it's important to think about, because you don't have that predictability with solar or with wind. And that's where our predictions come in. So does anybody know what this is, this image that is on the screen? I'm sure there's somebody who knows more about it than I do. Peter, would you? No. Huh? Somebody else? Anybody? Yeah, it's a gas powered turbine. Thank you. So it's again. This is a gas powered turbine. I'm using it to introduce the idea of spinning reserves. So a power grid is, as we've seen, there's a lot of calculations, as well as a lot of, it's complex to balance a power grid. And so what we're doing with our work is, we're helping power grid operators balance the power grid by providing them with a PV solar forecast that indicates how much solar energy is going to be on the grid. If they don't have that energy, what ends up happening is they have something called spinning reserves that they keep running. And that spinning reserve is running at 50% capacity. And so it's running at 50% efficiency. And so you're actually burning fossil fuels just to ensure that there is electricity that could be generated to be on the grid. If you don't know how much solar energy is going to be on the grid, it makes it more likely that you're going to have a greater amount of spinning reserves that are functioning or running at a given time. So I'm just introducing this to explain how our solar forecasts are actually decreasing carbon emissions currently with our work with national grid. So our main solar forecast is a national forecast that's run for national grid ESO, which is the electricity system operator in the UK. This is a picture of the control room. If you've never seen the picture of a control room, this is what the national grid control room looks like. And our national forecast is in operation in the national grid control room. So this is what a solar forecast looks like. You have the dotted line here. So the dotted line, that's your forecast. And then the solid line behind where it says 1130 is basically the history of the forecast itself. And I'm just using this to show you the information that national grid is given. And then they're able to make balancing decisions based off of this information. So if they see that there's 3.5 gigawatts of energy that is guaranteed to be on the grid, then they can reduce the spinning reserves that are running at that time. And therefore, decreased balancing costs for themselves. And then they also are diminishing carbon emissions at the same time. The other model that we have in production is a sites model. And this is what the open courts model is based off of. And so this is a model that's not necessarily generating a solar forecast for the power grid itself or for an entire country, but it could be like for a solar farm or like a smart home operator. And Zach is going to tell us how it all works. Great. Thank you very much, Rachel. So as said, we've taken a lot of the information that we've learned from building these kind of larger, more complex models and distilled this down into a site model. But essentially what we're doing when we're trying to do a forecasting problem in general, we want to start by providing as much information as we can about the problem we're trying to solve. So we start that by providing a diverse set of solar historic generation data. That just means we can capture all sorts of different types of conditions that might occur across a different location. We then provide multiple numerical weather predictions. These are forecasts made by large supercomputers of different countries, forecasting things such as cloud cover, temperature, rain, irradiance. And not all of these numerical weather predictions are equal. Some of them have slightly different biases. So we try to incorporate as many as possible to try and capture that information. We also utilize satellite imagery. As Rachel said earlier, we've made that data set public on Google data sets. That's really useful for helping with kind of near term cloud formation, not only that the satellite imagery, because it's a satellite up in space, it can take a picture every five, 10, 15 minutes. So you have a higher resolution of data going into the model, whereas the numerical weather predictions, they're run on quite resource intensive, quite slow to run supercomputers with much lower resolutions. We also then provide some topographic data about the terrain in which we're forecasting. And we feed all of this data into machine learning model. And if you've dealt with any data on this kind of order of magnitude of 60 terabytes of satellite imagery, you would know some of the pains in creating batches and the slow processing times involved there. And out of this, we're able to create a national, a regional, and an individual sites level forecast, which I'll be talking about today. So as we said earlier, we've been doing some work with the National Grid ESO, which we started a couple years ago. They were our first pilot project with our forecasts. And we managed to generate a forecast, which was three times better than their existing in-house forecast. So that gives you a key to kind of the bar that was set when we kind of started this, trying to getting an error, which is three times better. And this chart we can see to the right here, this is from one of our latest models, which we call PVNet2. And you're looking at mean absolute error as a percentage per forecast horizon. Now I've used this to demonstrate the value of using satellite imagery combined with these numerical weather predictions. The light blue line that you can see here is if we train the model just using the satellite imagery, you can see it's quite good early on, but the error relative increases quite a bit. Whereas just using the NWPs, which is this dark green line here, very kind of horizontal consistent error. And then by combining the two data sources, we get this, what I find a quite satisfying convergence where the models learn to take the information it needs from both data sources. So moving on to our site level forecast, just curious here, if you have solar panels, could you just raise your hands now? All right, now keep your hand raised if you also have a battery pack in your house. Now, are any of you using solar forecasts in any way at all at the moment? You are, nice. So this is where we see the kind of site level forecast that we've generated to be open source being really useful. There's a bit of a shift going on in the past couple years as consumers and kind of home households are realizing that there's these technologies available that can help them optimize their energy consumption. And it's not just the consumers as well, it's the smart home operators who are looking to participate in these energy flexibility markets. Now, as we've heard, there's been lots of really great presentations today about how to manage a grid. The grid, the electricity grids really need a lot of more infrastructure that needs to be built on to the grids to meet electricity demands going forward. And one way they're trying to tackle this is by increasing flexibility through things like smart home management. So one way this could possibly be used is when a smart home operator has access to many, many households, they can incentivize households to turn up electricity or turn down during different times. And this provides a flexibility to the grid. Now, from a consumer perspective, you might have an electric vehicle and you might want to charge your EV at times when you know you've got the lowest cost to you, which is when you'd have solar generation. So you can look at a forecast and say, I want to drive my EV tomorrow. I can look at my solar forecast and be like, well, it's really sunny today and really cloudy tomorrow. So I'm going to charge my car up fully today and then I can drive it tomorrow and it'll be lowest cost to me. So we see this being used by smart home operators. We're already speaking to a few startups in this space who are trying to integrate this into this smart home optimization sort of systems. Experts in battery optimization, research and academics, and just general hobbyists who might want to incorporate solar forecasts into their current situations. So to create this model, we've used a data set of over a thousand household UK sites, which can see on the right here. And we've trained quite a simple model, just a gradient boosted tree, which essentially tries to separate out the data into different buckets. This is quite a crude example, but say the clouds are less than 25%, you might predict 100% PV. If not, then try and create another branch that will then split the data up further. And what we're able to do by using kind of a wide range of different sites spread out all across the UK is forecast anywhere in the UK. So we can now plug in our specific latitude and longitude information about the site we want to forecast for and forecast for anywhere and hypothetically globally as well, depending on what data we have available. So this brings us to open courts, which is the open source solar forecast we're presenting here today. This uses open NWPs. Now there's two primary open ones. There's a few, but the GFS, which is the American global forecasting system, and the ICON, which is created by the German weather service DWD, and is widely regarded as the most accurate free to use weather service. So we take things such as cloud cover, temperature, visibility, and we pull this data from open Meteo, and we're using our pre-trained model that we previously showed. And by doing this, we're able to create a forecast up to 48 hours ahead at a 15-minute resolution and do all this in four lines of code. And we're able to get a pretty good error doing this. In comparison to some of our other models, which use slightly more up-to-date information, the error is not too much worse. Now you might notice that there isn't satellite imagery involved here, and that's because this model itself is something that you can run on your computer using our pre-trained model and by pulling the data yourself in just a couple of lines of code. Now when you involve satellite imagery, you need licenses and stuff to have that data live. The repository, the data storage that we keep has a two-day lag, I think, on live real-time data. So we were going to do a demo, but we've had to do a last-minute swap of computers. So instead, I'm just going to talk through this with everyone. But if you do want to do the demo, you can follow along. So if you head over to our GitHub repository, which is GitHub-OpenClimateFix, I've pinned the repo, open-source-court-solar-forecast, so you won't have to type in that mouthful of the name of a repo. And if you head to the example folder, there is an example notebook you can follow through, which will lead you to creating a solar forecast. But essentially, all you need to do is pip install-court-solar-forecast that we have here. And then once you have that installed, these are the four lines of code we tempted you with at the beginning, but essentially, you want to first import the function, which we'll be using to run the forecast. Next, we import this PVSight class that we use. We then want to create the class. So in this case, we're going to specify the latitude and the longitude of the specific house or site that we want to forecast for, and then the capacity of our solar panels. Next, we just run that, we use that run forecast function, passing in our site as an object, and then specifying a time in which we want the forecast to start from. So using this time here, it would create a forecast starting at midnight on that night, going out 48 hours from that point onwards. And what does the results look like? Well, we get a nice, so this is where I click demo done, and it would nice graph and smooth, but we get the best results out of this anyway. So we get our solar forecast, which looks, as we might expect, kind of peaking around midday. There's some bumps in the road here. This could be due to some clouds that are coming over or storm. And we've got our forecast from midnight out to 48 hours ahead. So hypothetically speaking, with the demo running, I could have shown you what it looked like exactly at this location here today, looking out for the next two days, and we could have seen today. But running it on my computer, it didn't look too great. And that's kind of reflective on if you look outside the window today, it's a bit cloudy, and not the nicest. So I'm going to pass back to Rachel now to talk about the robot. All right. So moving forward, the idea for the open, the Quartz open source forecast is that other people can use it. You could potentially input different types of data, so different NWP data could be input or PV data. And also just anybody who wants to do a bit of ML experimentation, this would be a place to start with that. As a company, we're looking to build our community as an open source company. It's something that we're kind of trying to put in place. So if people use the model, hook it up to an API or a database, and actually start generating a regular forecast for themselves, we'd love to know about it. So I don't know if we have any time left for questions, but yeah. Too many questions. The prediction, does it assume like you can specify the capacity, but can you specify things like south facing versus east west facing, that kind of stuff? And how does this contrast with forecast.solar, which provides for home users a similar API? Sure. Thank you very much for the question. So in providing features like tilt and orientation, that's something that we have built into the model and needs a little bit of a tweak to get it working. So originally, with this model, it was based off a model that we have in production, which we run for a thousand household sites in the UK. And we found that the tilt and orientation data that is generally provided is not always that good and that accurate, because oftentimes with a solar installation, the builder might have noted it down, but it's not that accurate. And when we ran experiments, hard coding the tilt and orientation, versus letting a user kind of specify exactly, we got slightly better results if we assumed it was a perfectly south facing and at 30 degrees. But that is something that is a little tweak and is I think one of our kind of issues to work on. And your next question about using kind of another provider, what was it again, the name of the forecast.solar. So I think what differentiates what we're doing compared to other people, this is something that you can run like locally on your computer and do it yourself. And we're also forecasting generation. I think a lot of these other APIs, they're forecasting things like solar irradiance, and then it's down to the user to basically interpret that irradiance value into a generation value. Maybe forecast or solar is different, but I think that's what we do, maybe slightly different if that makes sense. How do you handle so this issue of the solar, long term solar weather and recent critical events quite like volcanoes or dust balls, which can affect the yield for the solar partners? Yeah, so things like volcanic eruptions, they definitely do affect the solar. And a lot of time, I think that information generally is helped out. So the numerical weather predictions that we use, sometimes they tried to capture in that information. I did see some research papers on how they actually don't capture in things like volcanic eruptions. And the researcher, I suppose, who was saying, we need to improve these models to capture things like that. One other data set that we're looking to incorporate is aerosol data sets. So that does include information like that. And is something which I think we're doing with some of our other models. And at some point, I guess, we'd like to do with this model as well, which should help to capture extra information like that. Hi, thanks for the talk. So I wanted to ask, what is the geographic extent of this? You're using models which might cover more than, say, Eukaryurop. Or if it's confined to Eukaryurop, do you have plans to expand it to a wider region in the future? Thanks. Hi, thanks for the question. So this model in particular, because it's sort of dependent on the weather data that you have available. So we're using IKON's global weather forecast. That essentially means that this model can be used anywhere in the world, because that forecast is a global forecast. The only issue you might encounter is because the training set that we've used is just for the UK, there might be some sort of bias towards the UK household sites that we've not really looked into yet. So I think one of the things that we do want to do is to create maybe a more robust global model is to have a PV data set, which does cover the whole world. But I think so this is very recently, we've pushed this out. And since we've done that, there was someone reaching out to us from Indonesia who was testing it out there. I think they got it working. So it does have global coverage. Some of our other models, which we provide as a product and service, they are quite specific to the UK, but we're expanding out to India at the moment and some other European regions. And that's mainly down to the satellite imagery data that we have access to, because we're using the European geostationary one. So it's easier for us to build on that, how it is at the moment. Thank you, everyone.
Can open source development drive energy transition? PyPSA-Earth experience
So, we have stopped somewhere between regional and global perspective. Let's go global. The energy transition implies that thousands power systems around the world should be transformed with a pace which has never been seen before. And while we know what should look like the picture on the global level, that is still a question how should it be translated into regional levels. And what is special about this global scale energy planning problem is that we should plan decades ahead under deep uncertainties. And basically, we have quite an experience of energy policy failures. There have been quite a few cases when energy policy measures looked quite reasonable in advance but have resulted in failures, didn't lead to results which have been expected and these programs should be stopped. And that is why actually we need large scale energy modeling. We can replace this painful experience of rail war cultures by playing with the modeling, with energy models. And obvious advantages of open source, of open modeling and open data for energy planning has lead to a rapid increase in interest towards open energy modeling. And currently we have dozens of open energy models. We have a lot of open data sets relevant for energy modeling. But the picture is very incomplete and very patchy and there are regions in the world where we do not have even a net zero plan not to say open net zero plan. And that is exactly the gap which we are addressing as independent research initiative. PIPESIMEDS Earth aims to provide every part of the world with the open and reproducible and accessible energy systems model. What we are doing can be divided by three blocks. First of all we are doing open coding indeed. We are working with open data and we support open energy modeling community. So just a reminder about energy systems model. There are I would say power engineering models that is tools which we mainly have discussed today and there are also academic integrated assessment models. Academic integrated assessment models relate to the whole world and model global scale large scale interconnections between economics, environment and energy. And energy systems model that is kind of tool which translates these results of global assessment into plan of actions on the regional scale and obviously energy systems model should contain should reproduce in realistic way behavior of power systems. So that is what our workflow what our architecture look like. We have data block, we have modeling block and we have optimization block. Processing is orchestrated by snake mate and well probably the most trying part of the whole picture is work with data. There are different groups of data which effort operation of power system and there is also quite trivial but very impactful moment which relates directly to open data licensing. Basically we have starters data kit which we provide with the model to facilitate starting with modeling and I think the most frequent how to start request is about loading this starter kids data and many troubles by created by the fact that some open light some licenses of open data set do not allow redistribution or hosting of the data. So for some data we can collect data set and transform it in the form which is needed for energy system model to run while for others we do not have right to redistribute and have provide data to sources and connect them with the scripts to clean data and to prepare them to format which can be used by energy modeling and that is exactly chain of the whole link which breaks most frequently. So just open data in action. Environmental and climate that is part of the data workflow where we are truly grateful to open science community and to geophysical community. Basically that is the most unproblematic part of the whole workflow and we have package which translates geophysics to energy related parameters and basically that's it. Mainly it just works but as for electricity demand here the biggest problem is data availability. Indeed well what we need are hourly demand profile for every country of the world at least at national at aggregated national level. Indeed they data exist but they are not openly available and so we have a model we have machine learning model which has reproduced synthetic lot profiles but we would be very happy to improve flexibility and geographic coverage of this approach and access to the data to original lot profiles that is bottleneck currently for this group. Another important part, another part which is crucial if you are interested to model a power system of some arbitrary country is data on power infrastructure especially on grid and here we have used open street map database and developed a dedicated package which extracts power futures and allows to prepare model of grid topology. A part of that we have packages from pipes ecosystem which provides data on power plants on installed generation and a data set which collects and curates data on technology costs including forecasts for technology of technologies development. So and that is what modeling workflow look like. We take preprocessed data for power infrastructure and simplified topology preserving electrical properties of original power grid then cluster it to make the problem tractable and the next point is the most challengeable from the perspective of open source because open solvers are still overplayed by commercial solutions and here there is some room for improvement and we are collaborating with developers of open solver to improve the situation and now once the workflow has been established we had to be to ensure that it is actually possible to apply our model for every country in the world in the most literal sense. It has it took about almost a year work to introduce all the necessary fixes which account for different special futures and now it is done that is linked to it's another report which contains schemes for power systems of every country of the world of 193 United States country and we also have the code the source code which we have used to produce these schemes as images and if you are interested in model any country of the world please feel free to do that. Now let's look what actually can we obtain if we apply this approach that is net zero study for Nigeria which we have used in course of development the model which we have used as kind of proof of concept and the lessons which we have learned the most interesting output of this study has been that well net zero power system for Nigeria can be actually a little cheaper as compared to state of as compared with status quo indeed we haven't included we haven't accounted properly for uncertainties which exist for energy demand for Nigeria and this work should be certainly continued and applied to every country of the African continent but that is what does it look like that is which well that may be helpful to shift a paradigm and that is actually what is it all about and that is a study which has been done by in collaboration of pipe summits earth and open energy transition and a German think tank agor energy vendor they have considered Kazakhstan power system and the question is if it is feasible to implement solar in and wind faster as compared to with the current Kazakhstan current national development plans and the results are quite encouraging and being currently discussed on policy level and that is output of a master study for Saudi Arabia and that is a country where 99% of energy mix relays on fossil fuels a study which an author has done using pipes earth has shown that wind and solar actually can have quite well quite a place in power system of Saudi Arabia and it isn't so expensive as it could be expected that is the case when data accessibility data availability is a big issue so this this results are quite preliminary because more advanced optimization methods are needed to account for this uncertainty and also account for all pathway all transformation pathway but what is important what is an effect what is an impact of this study is translating conversations translating discussion about possible futures for fossil fuel relate countries from purely hypothec hypothec level to a level of numbers and that is a case for Bolivia case when South America when networks of South America are considered and that is region where data of open street map data have not so good quality so it has been needed to introduce quite some tricks to restore topology and the resulted model has been successfully validated for energy for dispatch on the national level so it works even if you don't have data of such excellent quality in open street map and that is a case for Malaysia we have considered decrobanization of industry and in Malaysia the local the local feature is well renewable sources renewable potential is not so excellent so we have shown that it is basically possible to decrobonize one bright branch of energy sector but if we would speak about the whole national economics it looks like it makes absolute sense to include into modeling into discussion not only traditional on wind off wind and photovoltaic also something more exotic like floating solar or probably to consider cross-country interconnections so and last but not least community is essential part of the whole story we have different channel of communications and we are very interested that is essential for now for us to build global community as we have seen there are some countries of the world where there are still a lot of modeling evidence available where efforts of researchers and developers are still focused but energy transition is a global thing and if we wanted to work we need to provide tool we need to involve people around the whole world and we can unfortunately confirm that there is definitely a gap geographic geographic gap in free and open source software community Tobias has talked about during free about that during previous first day and now I think we have some understanding about reasons which which are behind this gap and that is basically quite quite simple people around people in different regions just have different patterns of communication and that should be accounted for if you want to build inclusive community and another part of the story is that many things which we take for granted like education or even stable internet connection cannot be taken for granted in too many parts of the world so but the good news is that actually those problem which cannot be solved alone can be perfectly can be solved if we join efforts and well we are doing it we are solving them we still have a lot to do there are research tasks there are validation tasks because we can build power system model for every country of the world but it would be nice to understand how close are we to reality what are errors what are modeling errors for each of the components for power grids model for installed capacity how far we are from reality in demand profiles and that validation task it is huge if you're interested to join please feel absolutely free we would be happy to accommodate you and another big task is to increase usability in particular condo environment and version conflicts inside all our pytonic soul that is still a big questions and we have we would be very happy to improve it somehow and another part relates to capacity building relates to improvement of documentation and to spreading the world spreading knowledge so again we are very happy to accommodate any suggestions and we are inviting contributions if you are interested please do not hesitate to ping us using any of our channel our communication channel so just a reminder that energy transition is a global thing and can be tackled effectively only together thank you very much and I am very happy to take your questions what's the role of Earth observation for this models do you use satellite data to track transit lines or look you look for wind turbines or solar cells or is in this data set you use is just you use official data sets for your modeling thank you we do not use satellite observations directly but we are using for power grid we are using open-street map data only while it would be great to supplement them by satellite images we had some stream which have been focused on addition and on adding actually satellite data to open-street map but the this team currently is not very active so that is perfectly that would be perfect that would work definitely but we just don't have capacity to do that right now also we would be happy to revive it and as for installed capacities we are using fusion we are using merging of a number open data sets on power plants I am not sure if satellite observations have been used of in any of this data sets but at least we don't make yet satellite processing ourselves and we do not use them directly also I agree with you that it would be very interesting idea and it would be also perfect academic topic there are some countries which really don't want to collaborate I don't know quite like North Korea or other other countries where we don't get any data well to answer that directly we have data for Northern Korea but we would be very careful about using them because when you are applying if you are modeling especially well specific countries I would be very much concerned about safety people who are affiliated with those countries and that also goes for China for example because for China there are some local regulations which basically forbid going into too much details of power system for people who are not approved by national government so I would be very careful about delicate areas of the world but technically yes it is possible so my feeling is that correct approach would be to try build collaboration in more or less safe way providing tools to people who are safe using these tools for example if there is there is some group in China who is approved as approved by national authorities as experts in power systems as people whom whom they trust then we may provide tool and support them in using it in the right way also I agree that it is complex question and it may go a little bit complicated first let me remark agor energy vendors a very good name in Germany so Congress and getting them using that and my question now are you also doing storage like water reservoirs or millions of batteries distributed well I agree that storage is one of the key question when we are speaking about energy transition and we include a number of different storage technologies and currently that is one of the key points of energy transition we are able to capture them and if you're interested please feel free to investigate the details we would be very happy very happy to obtain your feedback and suggestions and contributions if you see that something can be improved actually we have a huge poor request which should provide interface to a big list of different storage technologies and it would be perfect if you could revive it I was just interested I've got a friend who I'm doing who's a research who's doing geothermal in Nigeria and you have to yeah well he's doing it on me do you have geothermal resources in there as well okay all right good thank you yes we have geothermal and we have quite recent request from from Kenya where people are interested to include geothermal in a more sophisticated way thank you
Carbon measurement and energy attribution for processes and hardware devices in the Linux kernel
All right, everyone. I hope the mic is working. It's great to be here. This is my first fast time, by the way. And I'm very happy to talk to you all about current measurement and energy attribution for processes and hardware devices in Linux. My name's Aditya, but you can call me Adi. That's the first three letters. I'm a grad student. And yeah, that's my contact. I'm always very happy to talk to people during, before, and after my talk. Please reach out. Please. I would love to hear from you. So a bit of background. I'm a graduate student at ETHEHA Zirik in Switzerland. And I do research at the intersection of computer architecture and operating systems. I love this stuff very much. Great. What do we want to talk about? Let's get a bit of a brief background to bring everyone on the same page. Now, when we talk about energy sources and computing systems, you can have a bunch of options. You can have direct inputs from DC, from USB. You can have battery power systems. And if you're really exotic, you can even have energy harvesting devices. Okay? Now, we want to use the maximum. No, I'm sorry. We want to use the minimum amount of energy to perform our task. Why do we want to use the minimum amount of energy? Because energy consumption correlates with battery capacity. And battery capacity is a major, a significant design constraint for your consumers. Okay? All of us have cell phones. We have the recent buzz around Apple Vision Pro, AR devices. These devices are significantly restricted by the amount of battery capacity. So we want to minimize the energy that we use to get the job done. Okay? Now, what is the problem here? What do we want to solve? Let's flesh it out. Energy consumption is defined as power times latency. Power is determined by your hardware. Okay? Latency is determined by your software. Okay? Now, how do we measure this? How do we get this data? Programmers often measure latency using well-established tools. I'm guessing many of you would be familiar with Linux Perf, or you would have timed your own software using wall clock time using CPU clock cycles, right? Now, these are well-established metrics and well-established tools to quantify your latency. What if I ask you, do you know of any tools to calculate your application's energy? What comes to your mind when I pose this question to you? How would you calculate your application's energy consumption? You would say, okay, Ali, I know. This is very simple, right? Energy is power times latency, right? We just talked about this. I'll get the power from the CPU. My CPU has this magical interface called Rappel, which stands for running average power limit. I'm going to get the value and they, oh, voila, my CPU says 15 watts right now. Great. Then I'll time my application and my application turns out to be, let's say, five milliseconds. Okay? And we put these values into this formula and great, we have 75 mJ of energy consumption. Stop done. Let's go home. Unfortunately, this is too simplistic. Let's try to dive into what we missed here. This does not reflect the ground reality, okay? And now I'm going to deconstruct what happened here and what we missed. The first step, we saw the power was 15 watts. And unfortunately, this model assumes a linear power draw over time. That is not the case. If you actually look at the system, this is what it looks like. You have these values and you have these peaks. And if you measure your power at the wrong time, you will end up with a significantly different number than what you should have. On the x-axis, you have time value. On the y-axis, you have the power for CPU. And power consumption is not linear. So this assumption that we have a linear power draw is incorrect. Second, we got the power value from Rappel. Remember running average power limit? It turns out that Rappel is only available on Intel or sometimes on AMD. ARM, for example, has a very, very different interface to report power. So I would love to share a story. I was doing energy profiling on a server-class system back in university and I said, oh, I've built this great infrastructure on my Intel platform, right? Let me just use it to run on ARM and see what happens. And the moment I ran it on ARM, Linux Perf said, I'm sorry, I don't recognize the CPU. I can't give you any numbers. And it just crashed, okay? So all of these interfaces are really different and you need a significant amount of engineering to make sense of it across different platforms. Second limitation is that we do not have uniform interfaces or the formats to measure power reliably. All right. Let's try to go deeper. Let's try to get more into the closer to the ground truth. Our model got the power value from the CPU. What about the other devices? I'm right now broadcasting from this device and oh my God. I'm sorry for this. I hope not. Give me a sec. Beautiful. Beautiful. Okay, okay. So back to the presentation. We were talking about the impact of devices like the screen, the memory, the network cards, right? We don't know how to quantify them. So we did a lot of experiments and it turns out that these devices very often dominate your power consumption and our findings are also correlated from some similar observations at Google. So what Google did was they were trying to optimize their data centers and did a huge amount of profiling on their server class CPUs. Server class CPUs are the heaviest CPUs that you can get in the market. And it turns out that they observed that DRAM is dominating their power because DRAM is burning power all the time. CPU turns on and off, but the DRAM you cannot turn off. Remember, it is volatile. So you need to break out of this mindset that CPU is the end all be all. Okay. So let me try to summarize everything. We are inaccurately calculating only a fraction of the system's actual energy consumption. Okay. And I would love to put this in a take of a code for you. This is not from me, but I like this very much. We cannot improve what we cannot measure. So we first need to understand how to measure energy correctly. And that's what my project is all about. That's what I love to do. What is the goal of what I'm trying to do here? My goal is to develop a framework to accurately and reliably measure the energy consumption of processes in the kernel. All right. All of us can get this data. What is the use for this? Because data without use is it does not get used. Okay. Once we have this data, we want to report it to the end users in an easy to understand format. Right. End users should be able to make sense of the number. Right. What does this number mean for me? We wanted to report it to the programmers which improve their action ability, which enable them to move their code up and down to change their code to move the numbers. Right. And we want to report it to system designers to enable them to iterate much faster over low energy designs over low carbon designs. Okay. So let's try to dive deeper. What do we mean by a framework? What are we trying to do? Let's flesh it out. A framework comprises models and tools. Let's break down these two words. A power model is how we think about a device. When I say that I want to measure power, a power model is the mental model that I have that I will use to get the value. Okay. And it turns out that these power models are often very poorly understood for a number of devices. For example, DRAM power models are often not available to the public. They're not available to academia. They're, let's say, a proprietary trade secret. Don't quote me on that. And okay, once we have these power models, we can build tools which accurately calculate power based on these models. A tool that I would like to mention would be the NVIDIA SMI utility. It allows you to calculate the power of a GPU using this tool. It's a good tool. And okay, so let's pull it all in. What I would like you to take away is that we need accurate models, first and foremost accurate models, and second, reliable tools to calculate the energy consumption correctly. So we defined our problem and we defined our goal post where we want to go. And now let's see how are we going to get from point A to point B. Great. So before I dive into what is the mechanism, I would like to bring to knowledge what has been done before. All of us have been here for the entire day, right? We love energy and we love efficiency. If this is such an important problem, why didn't people solve it before? People did. People did try to solve it before and I'm going to describe to you right now what they did before and why that is insufficient, why we need to do better. Okay, on the screen you can see a screenshot from a tool from Intel that is known as PowerTop. And you can see the first column here which reports power estimate. And on the right side you have the description of the particular device, interrupt, process for which this power estimate is calculated. Now, what are the challenges? Well, first of all, I believe in energy. It turns out that power is a discrete time event. What do we mean by discrete time event? Let's try to break this down. If you have a graph, a power is a single point on that graph. Energy is the area under that graph, okay? We want to calculate energy because energy is what correlates to your battery drain. Your battery supplies you energy. Power is just one particular instance in that time. Second, PowerTop has a vendor-specific implementation. I hope that is clear. Third, what is the actionability? So I just showed you this data. I just showed you the screenshot. It says, oh, my display backlight is taking 350 milliwatts. Great. This particular process is consuming 292 milliwatts. Okay, fine. The question that comes to mind is what is the use for me? What is the actionability for the programmer for this data? How does the programmer change the code to move this number? And I don't know. How do I fix that? How do I fix something that I don't know how to fix? And that is a gap that I would like to bridge, right? So let me dive into the guts of the system. This is a system design. On the screen you can see an elementary flow chart which summarizes the system at a very high level. And this is a regression-based system. A regression-based system has two inputs. You have the parameters and you have the inputs to the parameters. First, we calculate the parameters and then we calculate the inputs to the parameters. Great, we have time. I will go into details now. Please bear with me. Okay, let's first look into the parameters. How do we determine the regression model's parameters? There's an algorithm for this. First of all, we turn off everything. We turn off everything that we can turn off in the system. We measure the baseline draw. This is what we refer to as the minimizing the system load. Then we pick each device one by one. We isolate the impact of the device on the baseline load. And we measure the drain over multiple times. So we turn on a single time. Let's say that I turn off everything and then I turn on just the screen. Okay? And I measure the difference between these two values. The difference is the impact of the screen on my baseline. And then I also do one thing. I sweep the screen. So I change the brightness of the screen from minimum to the maximum because obviously the minimum brightness is going to have different power than the maximum brightness, right? I hope this makes sense to me. Are you guys still with me? Okay, so this was just an example. But what we're trying to do is we're trying to quantify the impact of each device on the baseline. Now, I would love to give a metaphor to help explain this better. Imagine that you have a water tank. And in this water tank there's one single input and there are 10,000 tiny outputs. And the problem that you're trying to solve is what is the rate for each of the output pipes? You cannot measure it directly. So you have these 10,000 outputs which go on their own anytime. They can go off and you don't have levers to control them. What you're trying to figure out is what is the drain rate for each of the output pipes? That is essentially the problem that you're trying to solve. So what you do is you turn off all the outputs, okay? You turn off all the outputs and you turn on one single output and then you see the difference in the tank level before and after turning it off, okay? And that is essentially what we call as an isolation or well in academic terms it's also sometimes known as an ablation study. But we try to isolate the device and measure the impact. Next, we repeat this process for all the pipes in the system and we try to get a reasonable estimate of what is the impact of each pipe. Great. So that was the first step, the device-specific measurements. The second step would be the kernel process accounting step. This would be the inputs to the regression model. So we have the parameters that we got from this step and now we need the inputs. Now how do we determine these inputs? Sorry, did I hear a question? Okay, great. Right, how do we determine the inputs? What we do is we isolate the impact of each process. So we identify how much time the process used for the CPU. We identify how much was the network activity, the screen wakeups, file handles, memory usage and we put all of these numbers together into the model. And this is what gives us a predicted energy consumption value for that process. Okay, so what are the challenges? This seems very simple. This seems, okay, you've done this work but what did you not tell us? Here comes the part that I did not tell you. First part, estimated value. This is not the reality. It is really hard to find out the reality. And there's a very famous line in machine learning community. It's known as all models are wrong but some are useful. So my goal here is to build a useful model that I hope is less wrong. I would love to make it perfect but unfortunately we cannot make it perfect. But yes, I would love to make a useful model first. Second, there's a bit of a cash 22 situation here if you observe that. What is the cash 22? I am running a measurement process. There's a process that is doing measurement on my system. Okay, that is also going to create a load. So there's going to be a skew in the values that I get because of my measurement. Okay, and the more accurate I want it to be, the more skew it is going to create. So we want to understand what is the right amount of accuracy that we can use to also be useful while also minimizing the bias. So this is very challenging, right? Because this is different for every system. And that's a problem that I'm almost struggling to solve. I would love to get your inputs if you have. Great, next challenge. There are millions of devices out there and these millions of devices have billions of ICs inside them. Very often we don't even have the data sheets for these ICs to correlate the values that we see. The estimates that we get can range across two to three orders of magnitude. One device can say, oh, I use one microjoule and the second one can say, oh, I use 10 milliwatts and those numbers don't make sense. Those numbers really blow you away. So how do we maintain our sanity in the face of the variance that we see here? And one more challenge would be that, assume that you can say, oh, let the users supply this data, let me get the data and then build a centralized farm of this data and then try to make sense of it. Should the users share this data? Would the users share their device users' data to you and allow you to put it on a centralized server? Who will own that data? Because there's enormous value in it. So this is, I would love to get your inputs on. One more challenge here would be the validation. So we got a value that we estimated. How do we make sure this value is as close as possible to the ground truth? In an ideal world, I would have infinite money and I would go to every computer in this world and take a probe and put it next to their CPU and say, oh, this says 17.5 watts and my tool says 17.5 watts. Great job. Let's go. I cannot do that because I don't have that much time. Okay, so we want to minimize the difference from the ground truth and what we see in the tool. There's a significant challenge in making sure that what we see is what is the reality. Right? Remember, there's accuracy and there's precision and there's correctness. And all of these trifecta come together and make this a very difficult tool to get right. But still, I believe it's going to be great. I'm very happy to work on it. Great. So once we have the energy consumption, how do we link it to the carbon emissions? We just saw that we can calculate energy consumption using power time flatancy. The carbon footprint can be calculated by multiplying this number by the composition of the energy. Where did the energy come from that you used to power the device that you were running? And this composition depends on multiple factors. It can include the geography. It can include the time of availability. It can include the cost of generation of that energy. Right? So fortunately, there are good tools and libraries out there which can simplify this problem for you. So energy composition is, let's say, something that I believe people will solve faster than I can solve this one. That is why I would love to focus on this one. Great. All done. Let's get back to the good stuff. How is this going to look like? How is this going to make your life better? If you're an end user, I would love to ship to you an application like this, an application which tells you how much energy your inkscape usage consumed, how much energy your screen was dissipating. So as an end user, you can remember to turn off inkscape when you're not using it. Or you can figure out, oh, I need to deliver a presentation to so many people in five minutes. I'd better save my battery. Otherwise, I'll be in deep trouble. So it's for those use cases when you want to maximize your battery life as an end user. As a programmer, if you want to expose an API that enables programmers to take action, if you want to indicate the devices and the code regions which consume the maximum amount of power in the code and enable the programmers to change it, to modify it, to fix it. So actionability is the primary concern for programmers. In an ideal world, I would love to have direct suggestions in the IDE that tell the programmer, oh, this code is not, this code is going to burn this much carbon. You'd better change it. And for the system designers, what we want to do is we want to enable them to iterate our designs faster. We want them to enable this, we want them, we want to enable system designers to discover designs which are really low on energy, which are really high on performance, which are really high on carbon efficiency. So there's typically a design space that designers explore. And we want to enable them to explore the design space faster. That would be the end goal from this tool. Great. So what is the takeaway from this talk? If there's two things that I would love for you to take away, that if you forget everything else, okay, just remember these two things and I'd be very happy. First, we cannot improve what we cannot measure. We must measure correctly, okay, to improve things. And second, we need to break out of the CPU mindset, okay, non-CPU system components can dominate your power. Please remember that. Please remember these two things. And the next time I see you, please come say hi and I'll, I'll buy you lunch. Okay. Great. Thank you very much for listening to me. It's great to be here. It's great to talk to you. Please be in touch. Please reach out and oh, boy, we're out of time, but I'm very, very happy to get your questions. Come talk to me. There's still like two minutes for questions. So if there are any questions, please. Go for it. There's one in the back. Yes. So. So hello and thank you for this presentation. I hope you're not going to hate me for this question because I'm a primary infrastructure guy. And one thing I was always concerned about is redundancy, like a scale twice. So if one dies, is this part of your thinking and scope? Or does the question make sense? I'm really sorry. I don't fully understand what you mean by redundancy. I mean, sure, I understand, but redundancy is trying to solve the problem of fault tolerance. Okay. It's not trying to solve the problem of efficiency. I'm trying to solve the problem of efficiency. So redundancy is an orthogonal concern to mine. Does that, does that mean it makes sense? Yeah. Thank you. But thank you for the question. I really appreciate questions. Yes. Did you try to monitor the hover head of monitoring the energy consumption? Yes, that's a great question. No, we did not. On one side, I'm afraid it's going to be huge. On one side, I don't know. It's like an infinite recursion, you know, like how can I measure the impact of my tool itself? Like the tool is what measures the impact. But how do I measure the impact of the tool? I don't know. I hope, I hope that, I would love to believe that. That's what I, yes, that's what I want to believe. Yes, please. Thank you very much. It was great to be here. Thank you.
Advanced Linux Power Management Evaluation using Perf
So, hello. Let's start. I think 50 minutes and all, so I will hurry up here. And the previous slides and presentations, we saw the overall picture somehow, the crit thing. And the last slide, we dig into one system somehow. And in this slide, we also want to dig a little bit more in the details, how to analyze the power consumption. What we saw also in the previous presentation, there was an... Sure, sure, sorry. What we saw in the last presentation was that we saw the power consumption of one system, a little bit similar to power supply that we saw this task consumes and what's whatever. But the question what you often have is after this data, how you can optimize your load for your server, for your embedded product somehow. What are the causes why the application runs too often, the system runs too often, cannot go into deep states, peace states and such things, right? At the end, it's the hardware that consumes the power and you can save power if you put things into deep sleep states or consumes the frequency somehow. And this is really important to save energy. And what we did in the past was writing scripts to optimize your workload and get things what are the causes that an application runs too often, right? It runs often, cannot go into deep sleep states. This is important to do the power optimization. And what I provide in the next couple of slides here is an application that helps you to optimize your workload and makes this visible somehow. So what we are talking about is a perf script and extension to perf. So it's, if it is mainline, it's not yet mainline. I will send this script to Analo in the mailing list and hopefully it gets merged quickly somehow. But when it's merged then it's really usable, easy usable. It's just an up-get install and everything works out. And also for Yachter and Buildroot, it's really easy to use these things after. It's also important that it can be used in embedded systems and everywhere. How does it work? It's just a record call where you record your workload with the workload separator, like every time. And here I record for 60 for one minute, workload on all the CPUs. And then you record everything one minute fine. And then you start with the report, the power analyzer, and it's have different modes. Because I have just 10 minutes, I show one mode here, but there are different modes for different optimization, analysis, and things. So what are the modes? There are several modes that can be activated and used. And you just activate or use the mode you want to focus and dig into the details, right? This is how things work. And what's also important, every mode has different trace points in the kernel. So usually you record only the trace points you require for the particular analyzer. Because if you record everything, every trace point, you get a lot of huge data and things. So normally you limit the data. How does it work? So there's the per script. As always, you can write a recorded data, as we saw for one minute here. And it just records all the trace points that are required. But on the other hand, you can also record the data that are required for your analysis. This is documented, what the trace points are required. Then you have the data. And then you start the script, the reports, and here outputs all the analyzers for this. You start it here with a timer. So what are the timer events somehow. And then because there's a lot of data coming out of this, you usually can use this data already and see, here's something that's not working well, too much timer interaction, for example. But what is it also can do is some post-processing and to create data graphs somehow or filter things afterwards because it's a lot of data. And here, just a showcase, this is one image that's created. You see the time and you see a workload. It's a logarithmic scale here. How much time, timers are working. Timers are one course that triggers from a deep C state to an active state to a zero state somehow. Timers are not that good. Often you see, if you begin analyzing things on your desktop CZ, you see here, I think this is the kitty, my terminal I use here. It has just wake-ups all the time. Why are there wake-ups here? And then you see often some buggy applications, clip-out things. They are constantly triggering your system and this prevents to going into a deep C state. This is the causes that prevents this. So it's really important. And here you see a workload I started and you see all the timers that are correlated with starting a workload. Here you see a lot of kernel timers and then you can start optimizing things. This is just a focus for the timer events, but there are a lot of other events as well. This is other sub-sequence analyzes also just for the timer events. You see here for a tick-less system, normally if there is no load, kernel can really go in a deep sleep state. And then it shuts down the timer tick altogether. But does it really stop the timer tick? You will see it here in these images and you can analyze things and optimize things. What are the kernel timers that trigger your systems? If you look at the graphs a little bit, the resolution is not that good, but you see that there are timer ticks all the time, and the network interrupts, timers are working here and you can optimize this. If you see this and you know what's happened, what we see here in this graph is the timers that are working for each particular task. So you can optimize for your task as well. How many timers are there? I often see in the production environment that the timer has done all the time somehow and not correlated. What you can also do there are system calls for the granularity that the timer can optimize things. For example, the kernel which the introduction of the HR timers, the resolution timers, you can align timers so that timers are not really spaced there and they're exactly triggered at a particular moment in time, which is a simple system knob. You can also say, oh no, it's not so important that the timer is triggered at this time so that the kernel aligns timers at a particular time and allows a deeper sleep state again, something. This knowledge can be combined with the knowledge of this, what you see here, for example. Where are the timers? Right, CPU 0 is somehow special. There are the timers. Can you move, for example, tasks to CPU 1 so that this other CPU cores can go in a deeper sleep state, for example, right? All this is important to do an optimization there. There are some general options. Some others are not always required. This can be turned on with this particular flag. There's CPU, often you want analysis on a particular CPU so you can limit the data. And there's a file out option so if you want to do a post-processing, as we saw in the images, somehow the data is not put it on the standard out, so it's put it on the file and you can use this there. And the data is also written in a day and sanitized that you can trust through use partners here to read the CVS data. And for the post-processing, it's really easy. But there are multiple modules there provided. This is just a sneak peek on the timer module, but there are a lot of other modules as well. You can use them later on. But to the time limit, I just highlighted this timer module. But one last sneak peek here, for example, is the governor. The governor is the component within the kernel to do the processing and commanding of disease deep-stakes. This is the governor. You can select a different governor. It's normally the menu governor. There are other governors as well. And here what you see, how often is which C state is commanded here? And what is also analyzed is, was this good or not? Because the kernel doing a guess working, right? So here the things are the next time in 10 milliseconds, there's a workload because the timer will trigger. So it puts a processor in a particular C state. But was this the right decision or sleep is too narrow, too shallow? And so this is also important somehow. And here you can debug the governor. A student of mine also discovered a bug for the AMD stuff. It's for one particular C1 state. It's switched all the time to the wrong state. But I think this will be released in the next couple of weeks somehow. So it's really also important for you. If you see, does the governor does the right job here? This is visible with another analysis, but there are multiple other post processing steps. And yeah, that's all. I hope this will be integrated in the mainline next couple of weeks. But if you want, you can use this kernel tree and this particular branch to use this. It's just a perf script, really easy also to use out of the tree. And this post processing scripts cannot be shipped with the kernel. That's not how the kernel somehow works. This Python scripts and there will be always available here based on this. And at the end, good documented, hopefully somehow. So yeah, that's all questions. Yeah, perfect. Questions. I'm always getting a question. Process of coverage, just x86. What's the coverage of you got? I mean, now look, I've got an M1 Apple thing. Would I be able to run it off there if I run Linux on that hardware? Yeah, this script will work on ARM x86 for Intel and AMD. There are differences in the P state tracking because P state tracking is the introduction of Skylake and HWP with hardware tracing. So it's will be not visible, but it will be visible on ARM CPUs. For example, some as a sample work, some will not work, but it's just Linux and all the major. And some are more software, the analysts of scheduling events. Somehow it will always run, but more hardware like analysis will not work somehow. But yeah. Just a follow up for previous question. Will it work for like, Graviton, all this kind of cloud proprietary processors? Yeah. It would generally run there. If it at least Linux ARM and the processor and it will just be the same. So no difference there. Yeah. So I think it's going to be a good idea to run it on the same hardware. Yeah. And there. If it at least Linux ARM and the processor and it will just be the same. So no difference there. Another question. If not later on we can install a script at your PC and test it. Hi Aaron. That's just a follow up on the previous question. There's actually an extra library, LibOpen CSD, which gives you a whole lot of extra stuff on most ARM cores, but not necessarily apples and Amazon's ARM cores, but any that actually come from standard designs. So a Turing design here. One goal was that it runs everywhere somehow, right? It must be general. And I don't, we don't skip going into the EPPF world. So there are advantages to do things in the kernel to aggregation in the kernel. So but this has sometimes problems with on specific ARM and PSOX and embedded products. So the design was really that runs everywhere. It's easy to use and generally available somehow. Somehow EPPF working with EPPF things to in the kernel and process unwanted data there out has some advantages, right? But you need a tool train then on an embedded product. So it's not that great somehow. And this everything I told you was somehow the idea on the design somehow or extra library. Keep it a bit of minimal stuff which works everywhere somehow. If you want to do more and often you want to do more if you analyze your particular task, how is the scheduling behavior you need more and you need more custom scripting as well somehow. But this is not here. I think it's a lot of data already there. Easy available somehow. But if you want to do more, you need more scripting and things like that and libraries you want to use. Sure. It's a compromise. Maybe a question for me. Can you give us a few insights about the community? How many developers, how many people contribute? Currently I'm the main developer. But at the end it's just in the Python script so it's not really the rocket science. And there are students also working on this, help things and looking at the details. But yeah, it's not that magic somehow. It's just keeping things putting together and make them easy usable. The trace points and Steven Roslatt and all the things, the infrastructure that the kernel provides are the main drivers. That this is possible, right? So just in script. Thank you so much.
How can Open-Source help the Wind Power industry?
Hi everyone. Everybody can hear me well from the back. Okay. Good. So, to introduce, I work for ZF Windpower is not any company doing software. What we are doing are turbines. Okay. We produce pieces that go to produce wind power actually. But I'm part of the digital team. Why do we have digitalization? We'll talk about later on. Okay. Like that. But to start, let's tell a bit about my story with wind. I didn't start just working with turbines. My love for wind dates back much earlier when I used to live in the beautiful Marseille. Marseille, nice town for who has been. Less than six hours from Brussels and it has a great resource, wind. At the time, I used to do sailing. You come to the Vierpart, city center of Marseille, have your pasties and you will see beautiful boats. You go out, the sea is nice. What has this to do with the energy transition? Well, historically, Marseille, the main industry was fossil fuels, oil. Okay. We have a place called Fossilmer. Great place to sail. It has great wind and one of the best winds in Europe. Just in front now of the oil factories, what do we have? The latest technology of turbines, floating turbines, means devices that you can put on the sea. Very deep, very recent technology. Those are, if not the most modern, among the most modern of France. The power is 8.4 megabatts and now they are three. They can already power a small city like Martig. They're just close to it for who is familiar with the area. So from the love for wind, from sailing, now I can see that this gets to the energy transition. Right? Is it only Marseille? No. What happens in Marseille does not stay in Marseille. Valid also for energy. These are graphs that I created myself from data from Kaggle. So open source data. What do you see there? Well, you can see that around the world we have some big production of renewables in general. And I guess there will be somebody here who has been or is from South America. South America is a place where you have strong input from hydropower. But at least to stay here in Europe, wind here is a big deal. And it's increasing. If we look at countries like Denmark, we are already almost at half of the energy production. National energy production of wind. It's a combination of good wind, because up north is really good, and of politics of wheeling it. I told about Marseille, there is good wind, but France is not even close to the measure. That means wind is definitely one energy resource that we will use more and more. Very important. And produces at a big scale. We have already now in Marseille 25 megawatts. It's huge, only with three turbines. All good? All great? Well, we have some problems in general with big installations. Things can go very wrong. Very wrong. And this is not just a matter of changing a small component. Let's say that a turbine, like even in land, not as big as the one on Marseille, gets faulty. It has to be stopped. What happens? Notification processing. It has to go, and we need to tell someone. Two days, then there will be an inspection time. I get the team to inspect the fault. Two weeks. And if I have locally the component, the replacement, will be six weeks to replace it, but maybe much longer. Component may be on the other side of the world, even. We don't know. And then the repairs, a couple more weeks. So for a turbine, which is 3.5 megawatts, not the latest technology like I show in Marseille, the whole intervention, if you are lucky, let's say 10 weeks, is lots of money, 125k, at least. A lot. How do we tackle this problem? By forecasting and by optimization of spare parts, so getting the spare parts already in-house, and to get it start as quick as possible, faster return to operation. And this is done by treating data. What do we do? Ideally, we monitor and predict. So there is an alert. It has to go to the cloud. I have to classify the failure already in the cloud. I had to know what is about. I have to prescribe a solution. I have to find the spare parts. And I have to forecast when I need to apply my solution. What, where, when. That means alert, data are collected and analyzed, and compared to historical data. The graph that you see there is like a exponential curve. I don't get into the Bible modeling. Here is not the moment of mathematics, but is a model that predicts well when cumulative failures will occur, subjected to a certain type of failure. So that's the type. I don't, as I said, don't get to technical. This is not the moment, but we can discuss it later on. How do we do that? Well, wind turbine data come, production data come. So this is from us, from people who produce. We get it to the cloud and we do prescriptive maintenance. What means prescriptive? Well, let's see quickly. Reactive. I fix the failure. Okay. I have a puncture on my bike. I change. Preventive. I do it regularly, like changing the oil in the car. That's what is typically done also for bigger installation. Predictive. That's what I talked about just now. Prescript is AI. Okay. Familiar with that. So AI tells you this is going to happen. Please do this. Data analysis with open source software allows more and more sophisticated maintenance. We have just been talking so far about AI, power by Python and so on. We have seen very good demonstrations earlier on. And what is our digitalization tech stack? So to get more specific, I already introduced Python. Then pandas to treat data frames, to treat data, at least to a small scale. Lifelines is a package that implements ViBool and is open source. Myself, I had already my own version where I added some modules. Docker, Git, of course, we work as a team. And it goes to Azure DevOps. Notice, I told about the cloud. I told about DevOps. DevOps is not just a technology, it's a way of working. That's very important. The technology allows to work us in an agile manner. And that's how we get results. So what we want as a result, reduced on time. We want to shorten. Instead of six weeks, let's get less or 10 weeks. Reduce the cost. So we need to have a proper stock level. And for that, we need predictions. Reduce template maintenance. We don't want the scenario where we have to call the technician. And the technician has to come from the other side of the world. And to avoid, of course, the consequential damage by addressing recurring failures. Okay, if I see that a certain bearing is always failing, let's address that. So, why? How do I know that I'm not just talking hot air? Okay. This all sounds great. Okay, you use AI, you use open source. Does it work? Well, what did we do? Then as producers, as manufacturers, pardon. We went to one of the customers and we proposed a pilot project. We apply our techniques. Let's see. The results. Okay. How did it go? Well, 50% less of alert processing. All these alerts. So, unplanned field inspections, 60% less. We strongly reduced the lead time to repair because we could forecast and we had the right parts in stock. And in general, the annual energy production got up to 0.5% of all the park with also with all the turbines. This means lots of money. Of course, for corporate reasons, I cannot tell too much, but you can figure out. To conclude, okay? Without, then we can go more technical if you like, guys. But take away messages are fragmented value chain affects very badly the wind energy efficiency. Don't have value chain, which is all dispersed. Okay. In general. But data insights and very good communication from the data. Has great benefits. We reduced the alert processing effort. We have prescriptive maintenance, which allows us to decrease the time to repair and we increase the overall efficiency. So, the annual energy production is one of the main KPIs for the public. So, we have a lot of data. So, the annual energy production is one of the main KPIs for wind power, but in general for any energy source. All these could be achieved with open source software. Okay. I sort of saw the full stack. Finally, it was the devos practice, not only the software that allows us the success of the pilot project. And I guess here we have a lot of people who are familiar with devos. Now, I guess I still have a couple of minutes for questions. Anyone? See. Hi there. I used to work for Siemens wind power and they had a predictive maintenance team. I'm just wondering, have you found any other companies as you've put open source kind of using your tools? Well, I'm not dealing directly with the customers. In general, with the customers, we just propose our solutions and we exchange the data. For example, if Siemens Gamesa has failures, we communicate the failures to the dev and we can suggest a stock amount for a certain component, for a certain turbines. But it's not that we are like the software company going to sell that. You see, it's more like the normal customer relationship when you sell parts. But how can we make, let's say, predictions? How we can interact? How we can serve better the customers even as a company? Good analysis of data and for that we use open source.
Energy optimisation: smart home meets smart district
Good afternoon. My name is Rik Barillot. I've been a core member of Open Remote for a bit more than… louder? Okay, sure. A core member of Open Remote for a bit more than 12 years now. I'm not the person who was supposed to give this talk, so I'll do my best to work it through. Don't hesitate to come back afterwards, and I can point you to some of my colleagues that worked on those projects. A bit away? Yeah. Okay. Okay, I'll do my best. And speak louder? Okay. Okay, so Open Remote, it's a 100% open source IoT platform, so it would do whatever you expect from an IoT platform. Back to the devices, have some logic, and user interfaces. We'll come back to that a bit later. So open source, fully free, available on GitHub, and a community throughout the world that's pretty active. But also some projects that we work on with some companies. That's mainly what the core team does when I said professional. It's working on those projects. Also they have projects that are in home security or smart cities, typical IoT projects in more exotic things like smart clothing, architecture, and of course a lot of projects in the energy domain, energy management, but also some link to other aspects of energy. And we'll go into a bit more detail in the Nottingham city project a bit later. So looking at Open Remote, what is it? It's mainly a middleware developed in Java. It has a database that is both for the configuration of the system and for the state of the system. So the current values of your sensors, but also all the historical data. It has quite a few connections using standard protocols, so you can connect to gateways or to data feed. We'll see that later. Awesome property hardware. It has a set of user interfaces. You have standard more management user interfaces where you can configure the system or see the values or trigger some actuators. You get Insight, which is a dashboarding kind of application. But we also have a set of web components, freely available that you can use to build your own custom application for a given project. And so you have an application that you can access through a browser, or you can embed it into a mobile app, what we call the consoles. And you can also connect to other systems like Grafana, Power BI, if you want to have extra features. Then you have, of course, a mechanism for the logic. We support different type of rules engines, simple through the UIs like IFTTF. So if then that or more advanced features like Groovy scripting. So if you want to go really deep. There is a set of default services, so building blocks that you can use, for instance, to push notification to the mobile phones or to place devices on a map or to implement optimization services, what we'll talk about in a minute. And this is, of course, built with security in mind. So there is a strong identification, authentication and authorization layer in the system. So coming to energy optimization. We'll talk about two things. As we say, what we call smart home, but it can very well be a smart office or even an office complex. Basically, it's the concept of an island behind a meter. And you have kind of a sole proprietor of the island. And then when you move to the smart district, it's a composition of many islands behind one transformer. The problems are a bit different, but the system is the same. So if you look at the system, yeah, whatever, I'll do this. You have your renewable energy, so solar and wind. You have the grid, both import and export. You have a battery with charge discharge, and you have your load, your consumers, but can also sometimes feed in energy back into the system. Some electric vehicles can do that. So the goal for the smart home is to optimize either based on the cost, so you want to pay the least amount, or on the environmental footprint, so you want to be green as much as possible. The data that we have to do that is for the renewable energy, we are going to estimate the consumption based on the peak characteristics of the installation, so how much your solar power can produce, solar panel can produce, and on weather data, so we can take the estimate of that. For the grid, we have dynamic tariffs, so people can, for instance, have contracts where they pay a different tariff by the hour or by the quarter even, and so we have the data to know those costs, but there is also a carbon cost associated with the type of energy that is produced. The battery, it's a charge discharge, but there is also a cost, so a levelized cost of storage, so for instance, if your battery costs 1,000 euro, and it can do 1,000 charge discharge, every charge is charged cycle is 1 euro, so you need to take that into account when optimizing, and so for the loads, we have the path consumption, and we do a weighted exponential average to predict the future consumption on that. So now what we are trying to optimize, as I said, is minimizing the cost of the carbon exhaust based on all this data. And so the system will control what we call the flexible load, so depending on this data, it can decide when to charge or discharge the battery, it can decide when to charge or discharge potentially the electric vehicles, or it can decide to control heavy loads, like heat pumps where you have a bit of freedom and when you can power them up or the temperature set point, things like that. And this can be automatic of course, but it could also be simply manual by pushing information to end user through the UI. When you move to the smart district or the collection of island behind the transformer, you have a slightly different problem, which is the transformer that is between your district and the grid, which has a peak capacity, and so what you want to make sure is that you stay under the capacity of the transformer, both for import and for export. So when there is a real high production of renewable, you don't want to surcharge the grid. So the data that we have is basically the same for the battery, for the renewable and for the loads. In addition, we have real time peak power, not peak net power of the transformer, so we know how much the transformer is currently taking in and out. And we also can then adjust the optimization algorithm with a fake kind of tariff. So if we know that we need to change the consumption on the transformer, we can like fake how much the electricity would cost so that the optimization algorithm would steer one way or another. And so we keep doing the optimization at each individual island, but we want to push for the global optimization so that the grid stays or the transformer, the grid stays under control. And so one additional problem comes now with the fact that you have many households, for instance, in a district, which can have their own technology. So it's quite complex to control them, to automate them at all. So one way, and we're exploring that, is interfacing with more home automation systems, like Open Hub or Home Assistant, for instance. Another way is to manually impact. And so what we can do is send personal challenges to every household where the people can earn points, which basically earns them money if they play nice within the whole ecosystem. And there is a lot we can do is we also have shared flexible loads. So for instance, in a district, you can have the shared charging station for the electric vehicles. And then we can control and, for instance, diminish the available power so that we can also keep the grid under control. So that is the general idea. That is what we are aiming for. There are several pilot projects that are starting to implement that. So this is the global idea. One of them is the Nottingham City Council. The idea, it's a smart home, but really it's more smart, well, we could say office complex. The idea is to control the charging of all the vehicles, electric vehicles that are used by the City Council at Nottingham. And so what it means is you can control a global static battery plus the charging of all the vehicles to save money. You can also control or you want to have your vehicle charged at least to some level because you want to use it in the end. And you also want to prevent surpassing the limit, the power limit that you have for the whole district. Oh, sorry, Council. And so what you see on the right is the dashboard interface that we have in Open Remote that can show you the different location of the vehicles. So we can track that anonymously, but we can track the different vehicles and the global power that is currently used by charging of this vehicle. If we now move to the smart district, this is a project that is currently starting in Amsterdam, where we have a community of about 500 households that are part of this project. One thing is each household can control their consumption by we interface with the meter and they can see a real time information about the power they're consuming through the mobile app so they can adapt their own consumption. We have the challenges that I talked about so they see how the whole district is doing and their proposed challenges so that they can play nice within the neighborhood and by doing so earn money. And we can also, as we said, limit the if there is really an emergency, we can control the heavy loads that are shared for the district to make sure that we don't go above the limits of the transformer. So it looks a bit like that and these are design of slides so there are some inconsistency in the wording, but globally every participant will see his own consumption with a bit of a history on how the district is doing. And the green dots around the indication are a global indication of how the district is doing. So it's really gamification there. Now you see that at some point the neighborhood might be reaching the limit so we are reaching the limit at the transformer level and so we will propose to the person in each household a challenge saying well for the next hour you need to keep your consumption below this level. If the person accepts then for the duration of the challenge they will see their own consumption, see the limit, how they are doing against it and how many points they will collect. And so they also receive tips, say well potentially if you want to keep your consumption under the limit maybe charge your car a bit later or set the temperature a bit lower, something like that. When the challenge is done they see how many points they have collected and then they of course can see a summary of all the challenges they have completed, how many points they have earned, etc. This is the view from the manager so we can see different meters that are all connected to the system. At this stage as it's pilot project they have 50 meters connected, the project just started, the target is to have 150 by the end of the month February and with 150 this should be enough to already influence the whole behavior of the district. So with 150 connected meters we should be able to have an impact really on how the district and the impact on the transformer. And so this is here the dashboard where you see a summary, the small diagram I showed with the consumption and the load on the transformer, how we are doing compared to the peak performance of the transformer, a historical graph and things like that. So thank you, these were the two projects that are currently running on energy management at this stage, there have been others. You can find the open remote platform in the GitHub repo, there is also the forum where the community is active and other information. Thank you very much.
A journey accross the environmental materiality of digital services
Hi. So in this talk, we'd like to take you on a journey across the environmental materiality of digital services. So the speakers in front of you, here's David. My name is Benoit. We are contributors to an NGO called Boavista that we'll present briefly later. We also are colleagues in a small company called Hublot working on ICT and environmental impacts. Regarding Boavista, so the NGO we work for and this is the work of this NGO we present to you today. This is an NGO based in France that gathers more than 250 members now, private companies, public organizations, universities, researchers, freelancers and so on. And the goal of the organization is to provide public and open methods, data, tools and knowledge about environmental impacts of ICT and its assessment. And of course we try to provide a useful open source, open data and open science stuff. Thank you Benoit. So today's objective will be to see how can we get from digital service to its environmental materiality. Environmental materiality is another way of seeing its environmental impact and it includes not only its carbon emission but also all of the other pollution and its usage of renewable and non-renewable resources. To do this we need to follow a process which is called environmental accounting. And at Boavista we have chosen to do it with an open source approach. What is very difficult when you're doing accounting, environmental accounting in the context of ICT is that you must take into account all the value chain of your digital service including the end user equipment, network, data centers, so all of the infrastructure that your service is using. But you also need to take into account another dimension which is the lifecycle phases. So you don't want to only include the use phase impact of the use phase but also the impact of manufacturing the equipment that your service is running on, transporting those equipment to their place of usage using them and also the end of life of the equipment. Today we won't be able to dig in all of the dimension so you'll see on the slide what we're going to focus but Boavista is working on all of the dimension here. It's still me. So why have we decided to do open source? So we're out for them, I think everyone here is convinced that we should do all of the data and development with an open source process. But when we talk about environmental accounting, it's more specifically important to follow an open approach. First, because we believe it's a democratic necessity. Environmental figures are often used to justify political orientation. For instance, the Green New Deal is full of environmental figures and we believe that citizens should be able to criticize, audit and criticize the figures that are being used to make political orientation. Also, environmental figures and environmental accounting are used to label product and services. I think you might have seen some data centers who said that they are greener than greener. But to say this, you need to rely on figures, environmental figures and often those claims are not based on open approaches and figures, which is for us a problem since consumers cannot audit and criticize the figure. There is also a very more straightforward argument because today environmental accounting in the context of ICT is very immature. So the data that we use, the data that we report are of very bad quality. To illustrate this, we've done some work. We normalize the carbon impact of manufacturing one inch of a lead panel, so lead screen, and this is the impact for manufacturing one inch, the carbon footprint. And you see from the five data sources that you have here, we have a magnitude of 10 between the lowest impact and the highest impact. We could think that HP has a way better environmental friendly process than Dell, but this is not the case. At least we cannot, this is not the justification for this difference. This difference is, there is this difference because all of those providers are not using the same data sources, the same hypothesis, and the same method. And because all of those are not open, we are not able to explain you why there is those difference. So open source should be a way, if all of those figures were based on open source approaches, we could try to normalize those impacts, compare the provider once they get another, and explain why different providers have different impacts. So let's first focus on the energy footprint. So I guess the energy footprint is the part of the ICT footprint we mostly think about when we work in ICT. That's easier to get a grasp on it. But as David said, it's still, when we look at energy in ICT, it's still only one part of the impact. So it's really about the usage phase. It doesn't cover the rest, which is, which can be way, way greater impact than just the usage phase. That's also true for data centers. In what I will present to you today, most of the information are accurate for data centers. Some of them may be applied to end user equipment, but we didn't include specific information on network equipment. So we are going to include specific information on network equipment for technical reasons and also because it's hard to get data on that part. So first, a little bit of context regarding data centers. I don't know if you've seen the latest figures from the EA. EA is International Energy Agency, and it, let's say, it's a rather conservative organism so far regarding ICT and their own impact figures. But their latest figures is quite enlightening because we can see that in 2022 we were around 400 terawatt hour of energy consumed by data centers, which is the double from what they previously said for 2020, which is a bit strange. And also that their projection for 2026 or in two years says that it will double again. So around 800 terawatt hours. Part of it is because of AI, but not only. You guessed it. So this is the context. What we can say here at least is that we are really in hyper growth trend and not the opposites. That's not what we have seen in some medias like data centers and energy consumption is flat. That's not the case. Then what's the issue here actually? What do you want to look at? It's not just about the energy consumption, of course. I think I won't teach anything to anyone in this room when I say that energy consumption means that we at some point consume oil, gas, and coal or other energy sources. This will emit greenhouse gas emissions, of course. But we will also consume water in the process. We will consume water if we take into account the cooling of the data center. And we will consume minerals and metals and other resources. Not all the resources that we can account for are listed on the draw. But there are 16 environmental criteria that we take into account in Boa Vista tools. So what do you have in your position to work on your own energy consumption of your services? So we have talked during the day in this room about paraffin power top. There are other options as well. Of course there are physical measurement devices. So smart PDUs, ID rack or ID low administration cards if you have them on your server. What matters in general? This is one way. The other way is software evaluation. So those are the options that I've listed on the top. All of them are open source solutions. If you are, let's say in a bare metal server context, you might choose power API, paraff, power top, SkaFound. If you are in, let's say more in a development phase of software, you could use power Jula. If you are in a Kubernetes context, Kepler or SkaFound may help you. And if you are in a machine learning context, code carbon could be of good help. And these are some examples. What's behind the scene is actually interfaces that have been mentioned previously in the day. So NVIDIA, SMI for getting the energy consumption of GPUs, RapeL for Intel or AMD, X86 CPUs. And the third approach is modeling. So we could classify this as measurement. This is more about modelization. And some of those tools also use modeling, then don't necessarily use only measurement with those interfaces. And the Bois API is also part of it because it does modelize energy consumption and answers about what's the carbon composition of the electricity. What, if I take the words from the previous presentation. But when we say that we have to precise something is that both hardware and software measurement tools have their limits. If you take the wider purple and pink squares, they represent what perimeter, a physical device will be able to measure. So the whole machine actually, but you won't be able to zoom on what's the footprint of a software or a given component. On the other side, if you look at the yellow and green screens, not so green, but the smaller squares here, we see this is the perimeter that RapeL is able to measure. So a CPU, if there is an integrated GPU memory can be measured for GPUs, SMIs. In some cases you may have a broader perimeter in RapeL, but this is for recent machines only. So we have an issue here because we are in a trade-off between completeness of the evaluation and precision and the ability to zoom in on the footprint of one software, for example. And so how could we fix that situation? In Bovista we are launching a project called EnerGista, which is basically a collaborative science. This is a collaborative database that we open and we propose voluntary organizations, individuals to share with an open source agent energy data and data about the hardware of the machine that has been measured. Which will help us to do statistics and then at some point produce better models that will help us improve software evaluation for power evaluation. Thank you Benoit. So from the beginning of the presentation we've told you that the use phase and the energy consumption is not the only thing that should take into account when you want to account for the materiality of your service. And this is where the life cycle approach comes in. So a life cycle approach will try to take into account all of the phase of the life cycle of your service, but also all the impacts, well, most of the impacts of your criteria. So not only carbon footprint, but depletion of minerals and usage of water, for instance. We're going to focus here on how can you identify the environmental impact of manufacturing a server. So it will be mostly in this area. But at Bovista we try to have a comprehensive approach by identifying the impact of all the phase from all the value chain. So this is a very, very partial and simplified model on how can you get the environmental materiality of a server for a specific service. So the first step that we do when we do environmental accounting is we try to identify what is the technical infrastructure that hosts the service. And this is often the most difficult part, because for instance if you take a function as a service that runs on AWS, it's very hard to know what is the specific consumption of resources and what is the technical material that your function is running on. But we need this data to know and to understand what specific component is used and what is the impact of those components that we should allocate to the service. So this is sometimes like archaeology when we need to dig and try to make some hypotheses to know how do we get from a service to its technical layer. But once we have the technical layer, we need to go to the raw material, because this is where the impact comes from. So we try to map all the processes that needs to be completed to assemble and build a manufacturer server. In a simplified way, we could say that a server is an assembly of plastic for the casing and packet and components. So CPU, RAM, Graphic Card Card and so on. And a component has many processes, but the most impactful process is the packing process. When you pack the dye, it's part of the components that is engraved where you have the semiconductors. And for this, you need to have metal. And for having the dye, you need to engrave a silicaweather. And so as you can see, the process of engraving consumes a lot of water. And also you need metals to, of course, produce a silicaweather. Across all of these processes, there is the use of energy, which also will use raw material, which will cause the pollution and resource depletions. So of course, you don't want, each time you want to assess a service, we are not going to draw this map and go until the usage of coal, oil and so on. So what we do is we factorize the processes and we make them easier to access through the different tools we are building at Boavista. One main tool that we have is Boavista API, which is an API that can make a translation between the ICT world with IT people and the environmental impact. So you give to the API a technical configuration. It can describe a digital service, an equipment, a component. And the API will give you back environmental impact, not only on global warming, so not only on the carbon footprint. But for instance, on other impacts that has primary energy that you should know if you know a little bit about energy and abiotic depletion potential, which is a criteria that assess the removal of non-renewable resources. So this includes minerals and fossil resources. Around the API, we built, so our architecture is in microservices, so the API is a central microservice. But we have other tools, such as Cloud Scanner, which will scan an AWS account and try to assess with the API the impact of the AWS account. And we have also a pedagogical front end, which is called DataVista, which is based on the API, but it's just a nice layer on top of the API for people who doesn't want to manipulate API. So for instance, here is a way to assess the impact of a server. And you see you can configure a server. For instance, let's say that I have one CPU. Demo effect, where it's just, okay, I put an L. I can also change the location where I use a server. So this will chase the carbon footprint of the electricity where the server is running. So I invite you to play with this tool and see a little bit what is the main cause of the impact, both from the manufacturer and the use face. And also the manufacturer impact, you can have it by component. So it's also interesting to see which component is most impactful. There is also other features, which are also in the API. You can assess the impact of your cloud usage, for instance, or of end user devices, but we haven't introduced those during the talk. Yeah. The API is you can scan the QR code and this will get you to the repository of the BoaVista API. We wanted to open up this talk. So we've begun by talking about energy. Then we took a broader approach with the life cycle approach, life cycle assessment approach. And we wanted to open up with an even more systemic approach, which I call systemic footprint, but it could be also called a consequential approach. Yeah. From the beginning of the presentation, we'll talk, we've talked about the direct impact of digital service. So it means the impact of the value chain of the service. But maybe sometimes the most impactful part of digital service, it's not just direct footprint, but it is the indirect externalities, environmental externalities that is brought by the fact of deploying your service. Your service, you're doing your service for some usages and you need to be careful on why your service is used and how your service is used, because this might be, this service might be used to make environmental harm. So when you want to understand what is the consequences of launching your service, you need to take another approach, which is a causal approach, and trying to map the different causes and consequences that are, that follow the introduction of your software. For instance, if you take a cloud provider, cloud are known to be often more mutualized and more optimized in time of energy, energy usage and carbon footprint. But since we have consumed, since cloud is very easily accessed, we are consuming way more compute resources than we did before. So this is what we call the rebound effect. And this is something that we can get from a basic life cycle analysis. We need to have a more systemic approach to, to understand all of those social transformation that is brought by ICT. And I think we're done in. Thank you for your attention. We have some minutes left for questions. Yes, it was very interesting. But so the problem is that everybody must know this kind of thing for in collaboration to climate, environment and so on. And because there is no studies of Greenpeace about this kind of thing, about energy provider. Yes, in Belgium, but so this kind of thing is very difficult of because so I know that the three said what is French? I don't know in English. So I'm a Amazon Web Service. Yes, so and this, this kind of thing is very, very, very important to their data centers, how it's take off energy, its, its harm effect on the river or something like that. So, so all this kind of thing for the construction of, of a computer and so on. I would like to have that is an, an, an Greenpeace barometers of this kind of thing everywhere. So because it's very important for our future. So also when dissipate energy in a river, a separation of energy and so on. So your remark is about awareness, I think. So there, there, I think there is no report from Greenpeace, but there is a report from WWF at least. And I think the, the main purpose of Boa Vista and the tools that we're building is not efficiency, but it's more making people aware of those problems and taking action because I think, and I think we can, we can talk for both of us. We think that having more IT people engage is one way to, to fight against those, this, this, the impact of IT. Hello, thank you very much for this. When you were presenting the server impact thing, I have a technical question. There was a discussion about jewels and primary energy as opposed to something that we might use like kilowatt hours, which is quite common. Could you maybe talk a little bit about why you chose that rather than a figure that we see used in lots of other places? Because that is something that I found a little bit difficult to understand when I first looked at it. So primary energy versus secondary energy. If you could explain some of that and explain the decision to choose one versus what hours, for example, instead of jewels. Yeah. You want to answer? Why do we express primary energy in jewels? Yeah, what I can say, but I don't know if it's an accurate answer, but in practical terms, jewels are very used for very precise measurement purposes. Most of the time when we talk about big figures, we are more about what hours, kilowatt hours, megawatt hours and so on. What's its power? So it's not expressed on a timeframe. That has been said in a previous talk. I don't know if it clarifies or... Yeah. Oh, okay. Actually, I understand the confusion. Primary energy is an impact criteria. Secondary energy is a flow. So it's not considered a final impact. We use... If you see here, let's say, we can model the secondary energy, the power usage here, in what? And we use it to compute the usage impacts for the difference impact criteria. Primary energy is how do you deplete earth from primary energy? Does it? Maybe you're time for one. Maybe you can do both. So the question is, because of some countries now don't want any more of the rubbish servers from our countries, did the data centers change the policy in terms of management, for example, for the storage system? At Google, they used to break the hardware into small pieces, not even recycling them at all. And where there are changes recently for the spare parts management? Because of the fact that countries don't want to make the recycling any more offshore countries. Actually, that's a very complicated talk.
Power profiling my entire house with the Firefox Profiler
Thanks for coming so late. I'm Florian Kez. I work for Mozilla as a performance engineer. You might have been here last year when I was talking about the work I do. So as a performance engineer, my work is to understand how much Firefox uses power and what we can do to reduce it. So I was explaining last year how we developed power profiling tooling. That was the cover slide. For example, I was explaining that we have power profiling tools that let us understand how much power is used by things so small as just blinking the cursor in the address bar. So this is what I was presenting last year. And if you want to hear more on this topic, I will be doing a similar presentation updated and extended tomorrow in the main track. So today, I will be sharing a different story. It will be more a story, actually, because it's late and I want this presentation to be easy to follow, maybe a bit entertaining if I can. So first a story about why I worked on power profiling the entire house and then technical details and then lots of examples because those are the most interesting. So the story. So there was first time in February and in April, we had a new member in our family that I was very happy to welcome and that completely changed our life, of course. Two days before she was born, I installed this on the wall. One of the reasons why I installed this is solar panels, it's not obvious. I wanted most of her energy used to be renewable. And I tried before to have solar panels on the roof of our house and it turned out to be extremely difficult, which means we failed to get around those. Reasons were mostly there were chimneys on the south side of the house that were making massive shadows on the roof. Lots of other issues with the roof. Basically, all the companies who came, they never gave us a code. So we couldn't get panels on the roof. So install this and I was wondering, can this power of a bottle warmer that we will use for the milk we give to the baby? I work from home. I work on energy efficiency all the time. Will this power my home office? So I had questions and it answers. So how could I answer those questions? I installed the power meter that you see here inside the electric switchboard of the house. So it's communicating with RFI, I'm measuring three different things. The link with the grid. So seeing if we are importing or exporting electricity. It's measuring specifically the solar panels I had put on the wall. And it's also measuring my home office. So that I could answer the questions. Of course, I very quickly came up with more questions. So I was also wondering about the washing machine, the freezer, and a few other things in the house. So this is what the thing quickly looked like. So a bunch of things in here. I made the thing in the first place so I could make a mess of it if I wanted. So now we are measuring also the link to upstairs because there's a second panel upstairs. The freezer, the boiler, the washing machine, those kind of things. And also, I needed to answer the questions. So we put a smart plug on the bottle warmer to be able to figure out what was going on there. So now let's go into technical details. What am I doing over all this? How can I get relevant information? So first I need to correct and start the data. I have a constraint. I have nothing in the cloud. Because it's very personal, sensitive data. All the parameters are connected through Wi-Fi. But with parental control, they have no internet access. They all send data through MQTT. They send one piece of data every second. And there's an Ubuntu virtual machine somewhere in the house that hosts an MQTT server. And with trivial scripts, logs everything to disk. So that part is pretty simple. Then second part, I need to visualize the data. Because if I just have massive log files, I do nothing with it. And this is where the Firefox provider part comes in. A tool I was very familiar with because the power profiling part I made the previous year. I have on the Ubuntu virtual machine a trivial script that converts the data from the files on disk to a JSON file the profiler can understand. And the profiles contain mainly two things, power counters and markers. So this is what it looks like if you're not familiar with a profiler UI. You might not be. I will explain very briefly. So there's a time axis here. The top part here is what we call the timeline. Everything is against time. The values thing I said I'm metering, you can see them here. You see the shape of the chart for each of those. And markers, they are here. And they can give us more specific details about specific things that the script thought was interesting. And you can see here that for example, so BIM is the brand of the wall. You can see that typically produces more in the middle of the day. You can see that when the cloud is less interesting, many other things will go into more details later about what we see there. So one thing I wanted to mention here was the date, which is the most important, sorry, the date is the most important thing here. We were three weeks in after we got the baby. And this is what I spent most of my days doing. And actually most of the nights too. And how this works. Usually when people get a baby, they say they have no time left. I actually had the exact opposite. I ended up suddenly having plenty of spare time at night because she was waking up so often that we couldn't sleep. So we were taking turns. And half of the night I was up. And she would wake up, want to have some milk and then sleep a few minutes later. So I had plenty of hacking sessions that were somewhere between 10 minutes and three hours. Unpredictable. But I had multiple weeks of having those sessions at night, which was why the code is maybe a bit messy because I had to do it in small chunks. But it worked pretty well. Otherwise I would have had no time to do any of this hobby project on the side. Also the generous parental leave at Mozilla helped a lot because that meant I had lots of those weeks where I could stay up at night and do those kind of things. And then more seriously, generating a JSON file that the profiler can understand was really simple. Maybe because I work with a profiler a lot, but still I think most people could get it done and get something that works relatively quickly. And also I don't have to host any web UI or anything because I can just generate URLs like this with the URL to where I generate the JSON file. And that's all I have to handle. I don't have to take care about anything in the UI. Then there's the stuff that didn't work as well. The profiler was made to profile Firefox. Typically we were having profiling sessions over a few seconds. I accidentally had profiles that were an entire day. So stuff didn't work so well in terms of units, for example. So I did put some good requests to add minutes and then hours. And then a few weeks later, days also. Changing the units, if you remember the screenshot I gave of profiling the cursor blinking in the address bar, we were talking about milli-watt-hour, micro-watt-hour. I wanted to see kilowatt-hours because numbers with many zeros were not so fun. Performance also, showing a profile that contains data for an entire day. It was not that bad, but it took maybe five seconds to display. I fixed it. And another thing that was a lot more important when profiling the house and that is completely irrelevant when profiling Firefox is knowing when something happened. In Firefox, typically we want to know how long something took. Here I mostly wanted to know at which time of the day something happened when we were starting to consume more power. So I also had to tweak that a little bit. It's also nicer when using the Firefox use case, but it was a lot more important for profiling the house. Colors, but it was just nicer. Everything was gray in terms of power in Firefox because there were a few attracts. Now let's go into examples. Doing laundry. Washing machine dryer. So washing, it consumes a lot of power twice. And this is most likely when heating the water. And then there's what? Okay. Whatever. I also wondered why it's doing it twice here. I think I saw it doing it new once a few times. So it depends on the program. Actually, I would like to profile the values programs. And if we zoom into this part that looks interesting, but we don't see because of a big thing here, we see there are lots of patterns here that are probably good enough to figure out what the machine was actually doing. And then the dryer, and it turns out it uses less power than the washing, even though it takes longer. And this is probably because we took the most efficient dryer we could find with a heat pump. And I also profiled my mother's dryer and it uses seven times more power than mine. Typical day at the office, home office. And this is why I don't want this data to be in the cloud. And I don't want my manager to have access to this data. We can say exactly at which second I return to my desk throughout the entire day. And you can see that there are typical days like this with small breaks in the middle. You can see the shape here is different. And then there are days like this one. And the main difference here is when you see that it's high first and then decreases, it means my battery was not full. So that means I probably worked from somewhere else than my office. So here, here and here, I clearly worked somewhere else from my office. And then the last one is on Sunday. So on Sunday, the only thing that remains power down is the modem, which is also useful for Wi-Fi and the rest of the house. But maybe before working, I should have started with breakfast. So this is micro-oven from the 90s, generated from my grandmother. And two things we typically do in the morning is unfreezing bread and heating milk. And I was surprised by the patterns there. The surprise is I was thinking that when in the infreasing mode, we would use significantly less power. And that's actually correct. But the problem is it's heating at the maximum power for a few seconds, then nothing for a little while. And every 30 seconds, it's heating for seven seconds, which means that if I'm hoping to use solar panels, and it's in the morning, and they are not at their peak production, I'm basically buying all the power from the grid, even though the average power is only 300 watts. And that's the kind of stuff we see when power profiling with a high rate sampling, but I would not see if I was looking at that every hour. And heating milk is what you would expect, almost a rectangle. So now, time for a quiz to ensure you are still awake. In your opinion, what uses most power here? Is it the massive chest freezer we've got that's full of milk? Is it the internet modem? Who thinks the freezer? Raise your hand. Who thinks the modem? So let's provide it to figure out. So, of course, very different shape. The modem is using the segment of power almost the entire day with very tiny variations. And the freezer, there's a spike at the beginning for a few seconds. And then it's stable for a few minutes, and then stops entirely, and then starts again. Modem, 27 watts all day long. It also runs the virtual machine that does all of us power profiling. So the answer is you are all right. They used exactly the segment of power during the entire day. So back to the initial question about warming the milk for the baby. So there's this milk pump, and then there's the bottle warmer. How much do each of us consume? You can just see the number. I don't think I'm going to read them out loud. Something that we quickly realized when looking at those profiles that was interesting is we see the timing, same as figuring out when I'm working or when I'm not working. And I'm not sure if you had a baby recently and had this experience, but you have lots of constraints about how long you can keep things. So milk that has just been pumped and kept at room temperature you can use for four hours. If it has gone in the fridge and you are heating it, you can use it for two hours. So to be able to know if the bottle of milk in front of you is usable, when suddenly the baby wakes up and you don't know when it just slipped last time because you were not in charge of that time, usually it's a mess. And we can make use of this data, and we did. And that's actually what we used the power metering data the most for is figuring out if the bottle of milk in front of us is usable. And we figured this out. The reason why we figured this out is only because we could see on the chart that actually it's very easy to detect the pattern. So it's time for a summer break. We visited my parents and they recently had those nice solar panels installed on the roof of their kitchen and it came with a gateway that's sending the data to the manufacturer or the gateway who's collecting a lot of data. I'm not too happy about that but it was not my decision. So it's sending one data point every 15 minutes which is good enough to figure out how much electricity was imported or exported on that day. Use this to figure out what you're actually doing with your electricity. And I noticed during one night of taking care of a baby that actually we can get one data point every second if we query a local HTTP API. So I did. Put a Raspberry Pi in there. Of course we can get profiles. So now let's see what they look like. That's what I saw at my parents' house and one thing quickly caught my attention. So it's a free-phase system because of a large heat pump I will go into it later. This thing looks strange. There's high power use here and it's throughout the day. And the only thing that could be using as much power is this thing. And it's supposed to be using power of peak hours because the price of electricity is not the same in France at night or during the day. And after investigating a little bit we realized that there was this switch here that was in the wrong position that was forcing the thing to be on all the time. So it was pouring on whenever someone was using water. And we changed the switch and now it's eating only around midnight and then a little bit around 7 a.m. and then it stops the rest of the day. And that probably saved quite a bit of money. I said there's a large heat pump so now we are no longer in the summer. I forgot to say something. The heat pump here has a large accumulator also. And when we look at the power use pattern we see the heat pump that's pumping and using a lot of power on all the free phases six times a day. And then there's the circulator here that's going throughout the day. So we actually can understand how things work. And we can see also how the power from the solar panels was used. Back at home some magic happened. I said we couldn't have solar panels on our roof but we had a baby which means that we returned home and after returning home there was a midwife who came to visit to check everything was right. And on the car that she used to visit us there were ads for company putting solar panels on roofs that was owned by her husband who's very proud of figuring out solutions to all the desperate cases where there's nothing possible and who came gave us a code that was very reasonable on a couple months later. The baby solved all problems that we were not able to solve for two years. So now we have real solar panels on the roof but that's enough about this part of the story. Fast forward December and it's time for another baby picture. She's grown up quite a bit. She's really into trees. Whenever she's crying and we don't know why we show her a tree and she's super happy. So we had to get her a nice Christmas tree for our first Christmas. And it's time for another quiz. In what you see in this picture what's using the most power. So obviously there's the Christmas tree here. The Christmas tree turns itself on at sunset and turns off at midnight. Then you might not have seen but we have the solar panels here and they produce power during the day. They use power during the night for some reason. So what's using the most power in your opinion. Who thinks the Christmas tree? Who thinks the solar panels? Okay let's provide it. So the Christmas tree uses 10 watts for a few hours here and the solar panels about five during the end of the day and the beginning of the next day. And if we look at the numbers Christmas tree 64 solar panels at night 67. That was a surprise to me but yeah you couldn't be surprised twice by my quiz I guess. But they did produce a lot more power so it's still worth having them. And I think we still have a minute or two so I have a few more things I can share. I have more power matters that are funnier and the interesting thing about this one is it can give me data at a 50 hertz something rate which is the frequency of the oscillating AC power. And I forgot this profile at home on a computer that's not connected to the internet but the profile was fun because we can see what happens whenever the rotation direction changes. We can see that there's a break in power used for a few milliseconds and then it uses more power when the motor restarts. So all those details we can see and expose with fast sampling and power profiling and it's pretty nice to see. And then USB power meters those are interesting if you want to look at the energy used by any random USB thing or anything that charges through USB. And there are quite a few in this picture all of those are reverse engineered to make compatible with profiler and that's part of the topic for another talk that I will be giving tomorrow but this is kind of how I worked with those. So reverse engineering a bit and then putting a load here USB light that I knew what it would look like. The code is in here if you want to play with it. So I will explain why this is useful for profiling Firefox and Android and even Firefox and laptops tomorrow in the main track. Now let's see the things that were not working so well or that I still need to look into. All the profiles I shared were looking good. I selected them. Some don't look that good. So this is a profile of a boiler. I said we profile the boiler so this is just it's a gas boiler so it's not most of the energy used but still during winter it uses a lot of electricity to just circulate water so that the hot water it's producing is going through the house. And then the Wi-Fi is not so good. It's especially terrible in our house because there's a lot of concrete with metal in it almost everywhere. Despite putting multiple repeaters it's still not so great. And someday I still have missing data like this and profiles that are almost garbage. And it could lead to incorrect conclusions because the shape here is just clearly wrong. So if we can, wired network is probably better. It's not really possible to put those wires exactly everywhere like on smart sockets or things like that. I think the best solution if I have time would be to change the firmware in those devices for an open source one and ensure that they store the data until they receive an act from the server that the data has been received and include timestamps in the data. So probably a project from next time I have many nights without sleep. I would really like to clean up this code so that all of us could play with it easily. It's not very complicated but if we don't duplicate it, that's much better. So the code for power profiling with USB meters I cleaned up enough because it was part of my work and I put it in a easily accessible repository. The code to do profiles that are nice from on-phase gateways I would like to do soon. And the rest, it's a bit of a mess because it's a mix of my code and configuration data with the same files because like you know 10 minute hacking sessions. And I would also like to blog some of our profiles of appliances and devices that I tested because I think there's quite a few surprises we could have when looking at devices. Some don't really behave like we would expect. And as a conclusion I would say sampling at a high rate is useful to understand how things work just because we are often curious. I definitely am. It's also useful to find and fix bugs like the water header thing at my parents that was wasting a lot of power on costing money. And if we want to optimize consumption from the power that's generated based photovoltaic panels, it's better to have an idea of how much we will consume. Like especially unfreezing bread like I was sharing is probably not a good candidate for using energy from solar panels. And that's all I wanted to share for today. Thanks for your attention. Could you match the power used by your workstation with the solar panels in the end? Oh I forgot to say but I could totally use the power from the solar panels for my home office because it was clearly enough and I'm mostly working during days. And I could actually decide that when we have a lot of power from the solar panels maybe it's time to compile for your folks that will use a lot more power. But actually the one thing that uses the most power as we have seen in my profiles from the home office is whenever I decided to use the computer without being plugged and then plug it back in because then it charges and that's where the power used is the biggest. The other thing that contributes a lot of power use of my office is screens. I have two external screens and surprisingly the 27 inches screen and the 20 inches screen they have almost the same power use. So if I use only one I could turn off the second one and they will also save significant power. The profiling your stuff is often called NILM non-intrusive load monitoring so if you go and look up there there are databases you can contribute to. The end phase be careful if you're running on version three and you're using production.jose and it all goes away and it's all behind a power paywall and horrible don't upgrade. And things like water so microwaves yes are just on off so those are hard to do so you should run them when it's sunny. And washing machines right so normally washing machine is on heating the water at the beginning and then that's it you know there's mechanical effort which you could see on yours. Dishwashers are usually at least two because you get the main wash and then a hot rinse. So washing machine with two is weird. So I'm not sure there was a question in this or if it was just comments but about the versioning of the young phase gateway. The young phase gateway we've got at home is not collecting data about our power use. The on-phase gateway we've got at home is not collecting data about our power use so I put my own power meter behind it and the reported data about how much power is used by the on-phase system at night is dramatically different and my parents profile on in mine because in my parents profile is the data reported by the on-phase gateway and it's counting only the power used by the micro inverters that are on the panel and it's around one watt and mine is also counting the power used by the gateway itself and now we are on five. So time's up thank you so much. And you can see the presentation tomorrow if you want more details about Firefox for approfiling. Thank you so much.
Closing Energy: Reimagining this Ecosystem through Open Source devroom
Just a few words. If you want to just maybe think it won't work. Yeah, we'd like to take just a couple moments to close off the deaf room. Thank you all for being here. This was our second year we had the energy deaf room. We started, maybe we can put this off. Well, we started with half a day last year and we extended it virtually. Yeah, you were attending the virtual room last year. Yeah, and we kind of had a feeling that there's going to be more attendance if we have a full day. And I think this demonstrated it pretty well. Like, from the beginning in the morning until like basically now it was only like a few sessions maybe not completely full. So thank you very much for sticking around and I hope to see you next year. And maybe if you want to volunteer, you know, please send us an email or hit us up somewhere else.
BEAM me up, Scotty
Very short presentation about me, what is BIM for everyone that doesn't know it, of course. Thank you, David. Thank you. So thank you all for coming. As David said, this is just a brief introduction on what most of us, I think, already know the BIM and what's around there. I'll try to speak louder because I don't think that those things are working, but the camera is okay. So obviously, this one doesn't work no more. Come on. Oh, spoiler. Okay. So what is BIM? BIM, as we know, is the virtual machine that powers the old Erlang, Elixir, Glim, etc., etc. ecosystem. It was originally built in the late 70s to early 80s, and it runs in production ever since. Very nice. It's almost 40 years of running in production. Okay. The focus of the virtual machine, since it's very conception, was on concurrency and distribution in a moment where nobody was considering it. Actually, people are still inventing the wheel of concurrency to these days. Let's think of ACCA, for example, or things like that. An interesting statistics about the BIM is that according to Cisco, about 90 percent of internet traffic goes at least one time through BIM and Erlang node. BIM is being used in production successfully by WhatsApp, Discord, and such, to handle a very large amount of messages per second, like more than one million. Obviously, since we're all here and we are cool kids, we do love doing things cool, like making programming languages. So the first of all is obviously Erlang, the one that started it all, the old-wise guy, and that's an example of Erlang. It's very inspired in its syntax by Prologue. It is a functional language, dynamically typed. There are some very nice things like pattern matching, binary pattern matching. Actually, this is an example of the Ranch library, which is a framework for handling TCP connections. As you can see, there are things like tuples, lists, etc., etc., supervisors. I'll talk about this later on. And it's an example of a behavior, of an application. I'll talk about this later on, too. There are macros in Erlang. There are preposers based. And this leads to Elixir, another language in which macros are first class, first of all. And Elixir is much more recent. Some say it triggered a renaissance of the beam, of interest in the beam, because of its ruby-like syntax, which is, in my opinion, nicer than Prologue, based on one. But I don't say so for everyone. This is an example of another library, and it's under for connections to a kind of database. Also, Elixir is functional, just like Erlang, and it's dynamically typed for now. Because there's some work on bringing types to the Elixir compiler. Being a beam language, it also has pattern matching, also binary pattern matching, and macros, as I said, are first class. And then there's the cool new kid, Glim. I see many people rejoicing. The syntax, it's a strange mix between an ML language and a beam language. Its syntax is a bit inspired by Rust, but the most important thing is that it's statically typed, rather than dynamically. And it also has a philosophy, being a statically type of handling some errors before they happen, rather than just doing it later on. And it can also compile to JavaScript. As I say, it is the cool new kid, so it has to compile to JavaScript. But then, as I said, there are a number of other friends of the beam. Least flavored Erlang, if you really like parenthesis. Pure Erl, if you're an Askel guy who wants to run on Erlang for some reason. Then if you're more Askel-y than Askel's guy, and you want to try dependent types, you can also compile Idris2 to Erlang, if you want. And then just a bunch of languages that the great wise man Peter Weirding did. So, Braggful, which is a PHP compiling to Erlang, Lua compiling to Erlang, and all things like that. He just loves making beam languages. And he's a very nice guy. So, why is people still continuing to build languages on the beam? Because the beam has some kind of superpowers built in there. Actually, let me interject for a moment. What I'm referring to as beam is actually beam OTP, or as I have written, he started calling beam plus OTP. What's the difference? So, beam is just the bytecode VM that runs the core bytecode code. And it's register-based. Then there's a runtime system called ERTS, the Erlang runtime system. That handles how to make this binary code run on the beam. Yeah, that's true. So, we have concepts, processes, synchronization, because everything in the beam is a synchronous. And ports, ETS tables, which are used for storage of persistent data and such. It also takes care of scheduling the processes on the beam in a preemptive fashion. And it's usually mixed, confused with the beam. So, I keep referring to beam plus ERTS as beam, as everyone is doing. Then there's another part, very important, in the Erlang ecosystem, that's OTP. It's a framework that provides you with some battle-tested abstractions. I think many of us use supervisors, gen servers, state machines, and things like that. Supervisors handle what happens if a process fails. Gen servers are an abstraction on the concept of a server. State machines are abstractions on the concept of state machines, obviously. The great part of using the Erlang ecosystem is that all of these three components provide one alpha of greatness. So, we have three albs of greatness in total. So, what are the superpowers, or beam plus OTP? As I've recently started talking, calling it. First of all, concurrency. As I said before, the main unit of concurrency is the process. The process is just a piece of sequential code that's running, having his own memory, his own it, and sharing nothing with all other processes. What's the point of it? Handling failure in just one simple and easy way, located just in the process. Meaning that when you have a crash in your process, only that process will crash and not the whole application. Then, a consequence of that is that garbage collection runs on a per process basis. So, you don't need like in Java to stop everything just to run the garbage collector, you can run it synchronously. And finally, since we do share nothing, the only way process is ever to communicate with each other is by sending messages or signals actually. This allows the beam to scale seamlessly from a single node setup to a multi-node setup. You just rather than sending it to a local process, you send a message to a process in a different beam node. Then, this leads to the let it crash philosophy of beam and OTP. Since every part in the end is just a tree of supervisors, a supervision tree. You can also obviously propagate failure between processes so that you can handle doing the right thing depending on what process is crashed. But why in this world would you need supervisors? Since we have Kubernetes that almost seem to do the right and the same thing. Restarting our pod if something is not working. Well, the main idea is that when if a network connection goes down, you don't want to restart your application. So, those are two different layers. OTP focuses on the application layer, handling failures in your domain. And then Kubernetes focuses on the larger aspect of orchestration. So, you don't want to crash, as I said before, if a network question is going down. You just need to handle that in your application and you restart the pod or your deployment just in very specific cases or very tragic cases. That's where Kubernetes comes into the rescue. Being application-based, the supervision trees are obviously also more granular. So, that you can define a different strategy rather than just turning it on when the process crashed. So, it provides for your application a more flexible way to handle crashes. Kubernetes does not. It just handles networking, bring up and down pods, and that's it. Those are different level orchestrating containers is a different level than handling failures in your application. Next superpower of being a plus OTP is immutability. It didn't seem so when it was built, but now we see its value together with a share nothing concurrency. Because having no shared memory means that the processes cannot change the state of under process unexpectedly. And that's the reason why we also have immutability of data structures. This also leads to a referential transparency. Even if the beam allows you to have some kind of side effects, for example, logging or networking, it's just a pragmatic way of handling it rather than setting up a whole monad stack, for example. Then, final superpower distribution. As I said before, from the point of view of an application, it's the same to pass messages to a local node, a node in the same virtual machine or in another node that is running maybe in another part of the world. This allows you to distribute the work efficiently, seamlessly as a programmer. And this is made by a built-in protocol to discover nodes. And you can scale horizontally however you want. The code that you wrote will just work most of the time. Finally, since those things run in production ever since, there's a huge interest in observability and debuggability. The most easy thing to do is connect to your live application and start an interactive shell just to see what's going wrong. You can trace. There's an interesting tool called tracing later on, and access all the information of a process and so on and so forth. Tracing and profiling, as I said, are built-in into the machine. They are battle tested by running in production for years. If anybody uses ReContrace, it's very nice, but can anyone give me a match spec that works? I never found a way to make it work. And let's not forget hot-code reloading. It's really interesting. You can change your code while it is running so that, for example, if you have detected a bug in your live application, you can just fix it on the fly and your application will just work with the new fix. Then there are other interesting aspects. The first one, pattern matching, is more interesting from one who is writing programming languages. The BIM makes it really easy to write functional languages with pattern matching, also at binary level. Then BIM is already container aware since a number of years, I think, allowing you to use C-groups and Kubernetes, obviously, seamlessly, not what Java does, for example. It makes it easy to have foreign function codes. You just write a C-node that seems to be an orang node and just communicate with it. Or you can use the FFI built-in. What's the future for the BIM? Obviously, the first part is being more and more largely adopted. And then there are some interesting research level developments on bringing gradual typing to some languages like Elixir and Erlang. Also, Elixir, but a number of languages, are doing a pretty huge job of developing numerical computing and AI on BIM nodes in order to distribute those calculations in an easier way. Obviously, there's also work on embedded systems, like bringing, for example, a small instance of the BIM vehicle machine on microcontrollers. There's a work on that, especially at UMBM. Then, the main challenge is obviously becoming a wider and wider adopted choice for the backend. And in that sense, having all these people here is a very good sign. Thank you for being there. And if you have any questions, feel free to answer. I mean, feel free to ask. Thank you.
Gleam: Past, present, future!
Yes, good now it is Hello everybody, this is fantastically exciting Look how many people there are, I thought there was going to be like five of us having a lovely time But no, there's far too many of us, great I'm so excited, I'm going to take a photo Just so I can prove this happened Does everybody smile? Wonderful, thank you so much I'm ecstatic, so hello, I'm Louis, I'm the creator of the Gleam programming language If you want to talk to me, do so here I'm here to talk about Gleam, which is, you've just seen a new programming language for the Erlang virtual machine And it feels like we've had milestone, because the language has really matured especially over the last year And so I want to have a little bit of an indulgent look into the past, sort of where did it come from and how did it get started A little bit of a celebration of where we are now and look at some really cool projects And then I want to look into the future and say, what's coming next for Gleam, what can we bring to the beam? So, this slide is ever so slightly irrelevant after that last talk But I just want to stop by saying, what is Gleam, to get everybody on the same page It is a new functional programming language, it doesn't look like Ruby, it doesn't look like Prolog It kind of looks like Rust, C, JavaScript, that sort of thing perhaps And it runs on the Erlang VM, so it is a sibling of Elixir and Erlang But it is a bit different in that it is statically typed, unlike the other two dynamically typed languages, and most of the other ones And that means that it brings a new style of programming to the beam, and hopefully can draw in more beamy people It aims to be very small and consistent, and the point of that is that we want to make it as easy as possible to read code We want it as easy as possible to learn the language and to get productive with it And productivity is not just about having a good language, these days you often have really good tooling Gone are the days when you can just give someone a compiler and then you say, ok well everything else is up to you, you figure it out So we also have a really nice build tool that comes with a formatter and package management and a language server and all those sort of things you probably expect And also it can pass a JavaScript, which is probably less exciting to this room than most But maybe you don't have to write JavaScript if you do in your front-end, so maybe that's a cool thing So first up, the past, what did I mean? Yes good, ok, the past, the past, how did we get here? So this is a history of Gleam according to GitHub, and in the very beginning there was a little tiny little blip of activity and then nothing For absolutely ages, so what was that? What was the very first Gleam? It was this, this hideous thing, this is the very first Gleam syntax People keep saying that the first Gleam syntax was like a Haskell rip-off, it was not, it was this You see it's sort of C-style, it's got braces, but it has the Erlang thing of multiple function clauses, so your top level flow control is done that way And it looks like nothing and nobody's familiar with it and nobody really likes it And it has this perhaps cool idea of having the tests for functions actually be part of the function So maybe you could show that in documentation, and I thought this was great, this is the thing that the language is going to be all about But it looks kind of rubbish to me now, because you can't do any test setup, the only thing you can really do is give an input and an output Well maybe that's good if you're reversing a list, but other than that what can you really do with it? What else can it do? Nothing, it didn't have a type system, it didn't really have a design, I wasn't working in any direction You could return strings and maybe call a function, but that's kind of about it And it was just a really bad layer on top of Erlang, which asked the question, why? Why did this exist? Well it's kind of like today, I wanted to do a conference talk So there I am, looking at younger, Elixir London 2017 And I did a talk on how to write a compiler, how to write a compiler that targeted the Beam virtual machine And it went really well, people liked the talk, I got to hang out with loads of my peers and then I took that project and threw it away And I didn't think about it ever again, sort of Because during this empty period where no work was being done really, I was doing my job, doing open source stuff And I kept thinking, I kept thinking back to that project and wondering, is there actually a point in making another Beam language? And this was spurred on further because I was writing all these really wonderful languages And every time I was writing one of them, I was thinking, oh I really wish I was using one of the other ones Every time I'm writing Elms, it's really difficult to do this IO thing in Elm I wish I was, oh there's no concurrency, I wish I was using Elixir, or I was writing JavaScript I really wish I had Rusts tooling And I sort of figured, maybe it's possible to take all the things I like from all of these languages and merge them into one Because I've sort of accepted the language that I wanted to be writing didn't exist, I felt like I tried them all at this point And so can I make that thing that brings it all together? And so after about a year and a half, the start-up I was working for was bought and trashed And suddenly I had a lot of free time on my hands And so I thought, this is the perfect time to resurrect this project So I remade the whole thing and this is the syntax people keep telling me is the very first Gloom syntax, but it's not It looks a little bit more like OCaml with bits of Elixir mixed in I think And this is in February 2018, okay, so maybe like a year and a half after that previous one And so I kept working on it an awful lot and then fast forward a year and a bit later to April 2019 We've sadly scrapped all of the nice ML syntax and we've got a much more sort of JavaScript syntax And this is version 0.1 which I'm really excited about because it did something You could use it to write some small program whatsoever, which is really cool And it started to look a lot more like Moddingling Fast forward another half year We basically got the syntax as it is today, we did a little bit more, but that's kind of it You notice the differences here are, we've got one of those little pipes And if you look between the IO and the print line, we've got rid of that colon So that's the last of the little Erlang things, sorry Erlang fans What else happened? We used our first class modules as a feature that people love People absolutely love first class modules, that's something that you find in OCaml And really we do it a lot in Elixir and Erlang as well Because if you think about when we pass around an atom that is a reference to the module Well that's a first class module, we're passing it around, we don't have module functors But we do use them an awful lot in our APIs Good, I am actually on the right side And we also have row type records, which is a really cool way of A really cool type system feature that enables you to do these really interesting sorts of polymorphism With objects and variants that sort of looks like interfaces in an OO land But doesn't have that same sort of subtype thing, so these are two fantastically cool features And we also had a more complicated way of declaring types and data structures That was much more akin to what you find in Haskell So we got rid of all these really cool things and replaced them with a string concatenation operator The ability to use callbacks in a slightly nicer way and the ability to give names to arguments So we've swapped really sexy awesome functional programming stuff for things that are actually quite useful But not very exciting And this has kind of been the whole journey of Glim, this has been It's very easy when making something to get excited and distracted by all these things And it could be, we could do this, we could do that But what is actually the most useful thing? And it turns out just removing things and honing in on that call, that most useful, that most productive thing Is the most, hopefully, is the best thing to do And I think we've got a really nice place because of that One thing that we have added that is quite big actually is that JavaScript compilation So that wasn't in there originally, that sort of exploded after Which does make the ecosystem more complicated, but the language not so much We also got a build tool, as I mentioned earlier The idea is to have a really good batteries included one Originally we were using Rebar 3, which is the Erlang build tool And it's really good and it worked quite well for us But you could tell that we were using a tool that wasn't made for us The user experience wasn't as good as I wanted it to be And I didn't just want to match Erlang's developer experience Or even Elixir's developer experience I wanted to even best it in some fashions And I've been writing a lot of Rust and Go And they've got some really amazing tool And I thought, wow, let's take all this goodness that you find in these other ecosystems And let's pull them into the Beam ecosystem, make it grow even better We've integrated with the Hex package management We're all beamers together, it doesn't really matter what language you're writing We want to be able to all share the same code And all depend upon each other's projects and share and give back So we've integrated with Hex So rather than just having a few hundred packages written in Gleam We've also got the 20,000 packages that are written in Elixir and Erlang as well And then we've got a code formatter and a language server And lots of goodies like that So I said there's 20,000, a bit more than that packages on Hex On the package manager And about 200, a bit more than 200 of them are Gleam That makes it extremely difficult to find anything written in Gleam If you want to make a Gleam project So after a while we made the Gleam package index And what that is, that's a little window Just a little view that looks into Hex And allows you to see just the ones that are Gleam So if you want to find a library for HTML in this case You can type in HTML I didn't, I didn't, you're making, we'll talk about that later Anyway, and it will give you elicit packages that have the word HTML In the description or the name That somebody's library does not have HTML in the name or the description And then if you find something suitable you can use that in your project And if you don't then you can then make a decision about Whether you want to perhaps make something new Or if you want to pull something in from the wider ecosystem Internet points, everybody loves internet points So I know that Stiles on GitHub mean absolutely nothing But it's been really uplifting and really wonderful And I feel like a really good sign that loads of people have Have taken that two seconds to say, yeah this seems right This is kind of cool I've been doing this for an awful long time And I think I probably would have stopped by now If it wasn't for loads of lovely people Sharing their support in some small way Whether it be a Stiles on GitHub or a kind message on Discord Or absolutely loads of you turning up into this room today And so it's been absolutely lovely to see that line go up and up and up And I find it wild, I've plotted it here against two quite similar languages Microsoft's F-Sharp and O'Cammill And at some point in the last year or two We've ever taken both of them in terms of number of stars Which is absolutely incredible I'm really excited about ML types And also people really love the Beam I think So this is a really good sign for the future of the Beam What else have we got? Has anyone heard of Exism? Anyone a fan? Fantastic, for those who haven't, it's your lucky day This is a really wonderful website and project Where you can go to learn new programming languages And they've got tens and tens and tens of different languages on there And for a few years we've had a Gleam track And they give you an exercise, some instructions Maybe some hints, and then they give you a series of tests And you can solve it there in your browser Or you can use the command line and download it and use your favourite editor And then when you're happy with your solution You can submit it off and they do a bunch of automatic grading So they've gone some tests and they might do a bit of static analysis Like oh you've done this, maybe you didn't want to do that And then if you're feeling super brave, which is where the real value comes from You can submit it to get some mentoring from an experienced programmer There's loads of lovely people who are just sitting there Helping strangers improve their Erlang or Java or Gleam or whatever There's a really wonderful project And last year, with some help from the wonderful Erlang Ecosystem Foundation Who sponsored this work We went from not just having a set of challenges That you can use to practice your Gleam, but an entire course So you can start by not really knowing any Gleam And by going through this whole thing You can be taught individually all the different concepts And so they give you a concept Then they give you a little challenge that's focused on just that concept And then they will unlock all of the exercises that they think you should be able to do now Using those skills So it's a really fantastic resource and it's absolutely amazing that it's free So do check it out And it's been really well People have really taken to it this course So you can see in the middle Can you see where we launched the new syllabus? Suddenly the uptake went absolutely skyward, which is fantastic And this is not the number of people on the course There are about a thousand, just under a thousand This is how many solutions people are submitting So this is actually the activity It's absolutely wonderful 30,000 submissions Which is a lot of learning, a lot of wasted time Who knows? So Exism is really cool And I really like that idea of being taught The individual concepts in a way that enables you to get somewhere And become productive And off the back of that And also inspired by the wonderful tutorial that Go has So we decided to take that idea of teaching the Breaking the language into concepts And teaching them in an incremental fashion Where each concept builds upon the last one And distilled it, minus all the exercises Into a sort of whistle-stopped tour of Gleam So if you go to the Gleam website today And at the very top there's that hero image That's got the tagline saying Gleam is great tour I don't think it says exactly that, but you get the idea And there's a big button that says Get started or try it or something like that And if you do that it will point you straight It will point you straight onto that first lesson And you can go from This looks kind of interesting Maybe I'll try it to Oh wow, I'm writing and learning Gleam All in your browser without having to Work at how to install Erlang And realizing that App has an Alte Deck package So you can't actually install it properly And oh how do I install Rebar And how do I do these things No, you just go straight in and you can start learning So hopefully people from other ecosystems Or people who are writing election Erlang Can turn up and go Oh I want to give this Gleam thing a try And then very quickly Get whisked into being a Gleam They can be hooked They can start working on the beam And this comes because A, the compiler is written in Rust So if you have Rust you can compile to WebAssembly WebAssembly is a very cool project And we can also compile to JavaScript So if you have those two things together You can run the compiler inside the browser And you can also execute the code inside the browser So we don't have to run any servers So even I can afford this And we don't have to worry about any security stuff Everything is just on the person's computer And it also means it's super fast You can get your feedback immediately So Gleam present, I'm going a bit slow I'm going to speed up a bit Where are we now? I want to look at some projects in the community That are really cool My original version of this The talk ended up being about an hour and a half long So I've had to go loads out So if you're not mentioned, very sorry First thing I want to say is that The Gleam Discord is wonderful I'm super lucky to have loads of lovely people Hang out there I can see some of them here today And there's just people helping each other And sharing cool projects And talking about the news Or talking about coffee or keyboards Or anything really It's a really lovely place to Either get help or to talk to people So do join The community is absolutely wonderful And delightful And I'm super lucky to have Working with them be my job these days So thank you so much everyone But now on to the things they've made The first thing I want to talk and boast about Is MIST And MIST is a pure Gleam HB1.1 server That sports HBHBS It has web sockets I believe Server-centered events are coming in the next version And they're working on HB2 So the cool thing about this Is that it doesn't wrap an Erlang web server It is pure Gleam And it doesn't even use Erlang's OTP It uses Gleam's OTP It's an entirely new implementation And what's really cool Is that it's not just proving that you can use Gleam To make sophisticated things You know, implementing a fast HBHBS server Is quite challenging But you can also get really good performance Out of the ever end So here we've got a bunch of different web servers Graphed The ones at the top are MIST and Bandit Bandit being Elixir's new one Bandit has had a new version Since this benchmark was done So I think it's actually slightly faster now But they're about the same You'll notice we're even beating Go And everyone talks about how Go is super fast But no, we in the Erlang world can do better And we're obviously beating JavaScript But the thing I think is really cool Is that we are really beating Cowboy We are really building the one That we as the community have said This is the best fastest web server It shows that we have further that we can do And it shows that Gleam can be just as performance As Erlang So this really proves the language I think So I mentioned OTP Gleam has gone a different way for OTP Then shared out to Fred and his squid there Gleam has gone a different way with OTP With most of the other languages So Elixir and PureRail and other languages If they want to use OTP They put a very thin layer on top of Erlang OTP Well, Gleam doesn't do that Instead, Gleam takes the core concurrency primitives That you get from the Erlang runtime system And has made type safe versions of all of those And it's the same things like link, sport, monitor Send, receive And then it looks at the protocols that are implemented OTP says you've got to implement certain messages Like system messages And there's certain ways of sending Of doing synchronous requests and all that sort of stuff And we've implemented those same things From the ground up in a type safe way And what's really cool is that we've discovered it's possible For a long time people have said You can't have typed OTP Well, if you get the same If you get that same core primitives that you get inside Erlang You can build the same thing from the ground up So that's been really cool And the fact that it's been used to make miss Shows that it can work And it could be practical and useful in performance So it's all very good having a web server But you kind of need a... Probably need a web framework Unless you want to spend all your time Writing a parser for multi-part form bodies So we have Wisp Wisp is a really lovely little framework I can call it lovely because I made it So if you want to do a web thing That's a good place to start Databases are pretty handy as well We've got bindings for these sort And probably some others that I haven't found The first two, Postgres and SQLite They wrap Erlang projects All the SQLite one can even work on JavaScript If you're using Deno But the bottom two, they're really cool Because they're, again, written in pure Glean Using Glean OTP Now, this is a really cool one This isn't quite so beamy But so Glean can compile to JavaScript Okay, so how do I do a front-end in Glean? I don't want to be writing all this JavaScript For my Beam application If I can avoid it So Lustre is this really lovely library That's sort of quite similar to Elm Or perhaps some React State management systems That gives you a way to make a declarative DOM And then all you need to do is talk about What messages you're going to emit And then how you update the state Every time one of those messages come in And as an Erlanger, I look at this And I see a GenServer I think that the Elm architecture Is basically exactly the same as an Erlang GenServer Instead of calling it call, we're calling it Handlework, we're calling it updates And then we've got this HTML thing on the side Which I don't, who knows But what about live view? People like live view, right? That's the hotness at the moment So live view, in case you don't know Which I find that you almost certainly Do know in this room That's when you have that same sort of idea You get a declarative DOM that is on your front end But all your state updating Where you hold everything is on your back end And then they talk to each other over WebSockets And this results in a really lovely develop experience And you can do all sort of things that you can't Practically do if all the state is on the front end Well, us too can do that as well That last component I showed you There's nothing that says that has to run on the front end It could also run on the back end Just rendering it to HTML Or you could put it on both So you could just by saying Hey, start an actor with this And then here's WebSockets You can have live view with Luster And what's really cool is that you can now pick Which parts of your application is going to use Which architecture? You know, there's a criticism of live view That it means that certain actions That should be really snappy are quite slow And if you lose network connectivity Your whole application stops working Well, then maybe put those bits About making it be resilient to network failures Put those on the clients You can pick exactly what you want So we've got loads of servers and clients And API clients and middleware That are all part of this wider HDP ecosystem And one of the things that's really cool about this Is that there is a Gleam core library Called Gleam HDP that defines a few types for Requests and responses and headers and all these things And so all of these libraries Even though they've been made independently By different people, they can all work together They all share the same primitives And you can say, well I want that API client With that HDP client on the front end And that HDP client on the back end And I'm going to handle it with that server in my tests Fantastic, and it all just nits together Enough about Web There's lots of other cool places we can run code One of them, we probably will do an awful lot Is on the command line And there's this really lovely project called T-Shop Where you can It's a similar sort of Elm updated type thing But rather than being events coming from a DOM It's events coming from a terminal So you can make these really lovely interactive Tuis in Gleam Sadly at the moment you can't run this code on the Beam Because there's a few There's a few quirks of how The Beam handles standard input But hopefully we can make a proposal to The OTP team and they can expose A couple of functions that you can't get to And then we can have exactly the same thing In Elixir and Erlang and all sorts of other languages as well And because I've showed lots of libraries Let's look at an application I think this is really cool This is, I'm going to butcher the name Electrophonie, maybe Which is a music streaming app Similar to Spotify or such And it is written in Gleam Part of JavaScript using Luster And then because we've got this really excellent FFI So we can call into So we can call into other languages And we can use all these web APIs And do things like use the media keys Be on the lock screen of a phone Be in that little bit of the top of your computer Where the music thing is I don't know what it's called And, yeah, the ecosystem is really growing I think there's a name for that kind of curve I'm not sure what it is But we are now 1.2% of hex Which is a tiny number, but bear in mind We're not at version one yet And Elixir's been at version one for 10 years Something like that I think that's really impressive And I really hope that that is going to keep going So, where are we going? What comes next? So, Gleam isn't done A lot of things are very mature, but there are still things to work on And the thing I really want to focus on for the next year is the language server So, what is a language server? Just to make sure everybody's on the same page So, traditionally, if you are making a text editor, an IDE And you want to support a language or a plugin So that they can support a language You need to then work at how to learn all those things about the code layer Oh, how do I know if there's an error? How do I know what I can auto-complete with? How do I know what snippets would expand? How do I know what refactoring is I can do? You'd have to individually implement all those things But some clever clogs, I think at Microsoft, came up with this idea of We're going to have a language server, we're going to define a protocol That all the editors can speak and all these backends can speak And all you need to do is implement the protocol And then suddenly we can have one brain of an editor And that can talk to Elix and Vim and Emacs and VS Code and Zed And all these other cool ones And so we've got one of those Built into the binary that you get when you download Glim Excuse me And it works, but it doesn't work as well as I wanted to It's definitely the least mature part of the whole GLEE ecosystem And a big part of that is my fault I've been developing it entirely on Visual Studio Code And it means the protocol is a little ambiguous in places In a way that I find quite irritating but apparently is fine So all of the editors do slightly different things when you give it certain data So we need to spend more time working on the other editors And making sure that it's rock solid and works exactly the same And all the other ones And I switched a knee over them now so it's not going to be a problem anymore So first step, we're going to get it all working super reliably for everybody And then we're going to flesh it out to have everything We want to have the same experience that you're going to get with Rust Analyzer Or maybe even try and get close to what a JetBrains IDE might give you We want it to be a really excellent experience of all these different things Find references, renaming things, all sorts of refactorings And also code generators, I think are really cool There's loads of bits of trivial code that we bash out every single day about thinking about Well, if it's that easy, just press a button and have the tooling spit out for you And then you can choose to edit it in whatever way you want So breaking changes Over the last year we've had an awful lot of breaking changes Because there was a design and then suddenly a bunch of ULOT turned up and now we had users And then we realised that, oh, actually that original thing that I made up five years ago While I was sitting in my room wasn't the best idea There are problems, so we've made a load of breaking changes in order to refine them What breaking changes are coming next? Hopefully nothing, I think we're there I think we basically have the language to work exactly as it should Which is wonderful And that kind of begs the question Does that mean we can work towards a version one? Yeah, we're working towards a version one So what does that mean? When we get there, what's going to be the points of version one? And I think there are two pillars to this The first one is productivity for people who are using Gleam So that's going to be no breaking changes You can't build on top of foundation that's constantly changing on you We won't have no language bloat I'd be really proud of how we've really honed in on what makes Gleam good And by having a very small, concise, consistent surface area It makes it easy to work with And I want to keep that property I think it's very tempting for languages to hit version one and then go Oh, maybe we need this feature Or maybe we need typeplosses, maybe we need these things No, we're going to keep it super focused And it's going to say exactly that same language that you really love Or don't, you know, whatever it is, it's not going to change And we're going to keep working on improving the developer experience So more tooling, keep improving that If there's something that's annoying to do that everyone has to do Let's make a library for that You know, just keep solving those problems And document everything You know, we want to have cookbooks and guides and tutorials and examples And just make it really easy for you to go How do I do this in Gleam? Oh, look, it says here, here's how I do it Now I can get on And the next thing is sustainability I am not Microsoft I do not have 50 developers working on this I have me And some lovely people who are very kind enough to agree to join the core team Which means they're just called the core team and they do free work for me It's fantastic Thank you very much So we want to make sure that every bit of work that we're doing is as impactful as possible You know, it needs to be... Everything needs to be meaningful And if we can't justify it as being impactful for a large amount of people We just shouldn't do it We've got to make sure everything is efficient as possible Not just in the code, but in our practices as well We're going to document everything internal We're doing really well with this But I think we can do even better I would like people to go, oh, there's something... There's a quirk with the build tool I think this is a bug Okay, I'm going to look inside and see what it is And then just see loads of comments, loads of docs And then they can hopefully work out, oh, that's doing this, that's doing that I can make a contribution to this And the last two things are about funding the project So I work on this full-time And I work on this full-time thanks to GitHub sponsors primarily I really want to... So here, charted in the pink That's how much income we have for the project I'm super happy that it stayed super stable And up there in blue, that is the median For a lead developer in London, which is the city I live I really want to get that up to the blue line For obvious reasons But I'd like to do further than that I'd really like if we could afford to have like one... Two pizza team, is that too old? I want to have that core development team To be able to afford to work on this thing That I think is useful and important and productive And be able to work full-time And be rewarded appropriately It shouldn't be charity, I think, for these people They're doing this really useful work for the ecosystem And then if that stable foundation is there That means other people feel more confident Building their businesses and their projects and so on top of that So if you want to help out, do join the... Do start sponsoring or get your employer to So about half of that previous income comes from one place And that's from Fly, that's our big corporate sponsor They're the really wonderful deployment platform And the other half comes from people donating like five, ten, twenty dollars And they're both wonderful But it means there's quite a lot of... There's quite a lot of weights on one organisation I'd really like to spread that out So if we could have a bunch of smaller corporate sponsors I think that would be much better for the long-term health of the project And if you've got ideas for other things we can do So I know Elixir has a sort of quasi-support thing That you can sign up for If you've got some other ideas, get in touch with me I'd love to hear what your thoughts are So when is Glean version one? How much more have you got to do? Well the answer is now We're there, like we're completely ready And depending on how much you lot distract me For the next few days I hope to get a release candidate out today, tomorrow At some point in the immediate future So this is a really exciting time Good, so questions? Any questions? Thank you very much for creating Glean Could you elaborate more on what happens when we keep target minus JS? When we're targeting to JavaScript Repeat the question Yes, so the question is Can I explain what happens when we target compile to JavaScript? Okay, so we compile to... What can I say about it? So we compile to JavaScript source codes We don't add a runtime We keep very close to JavaScript So like your scripts end up being very small Suitable for use at a browser But because we don't have a runtime It means we don't have an implementation Of say like the Erlang concurrency inside JavaScript So you'll be using a different concurrency pattern If you're using Glean JavaScript If you're using Glean Erlang And that means there's certain incompatibilities Between the Erlang and JavaScript target You can't write a library that easily abstracts over both If it does file IO for it Well, that's a bad example If it does like HTTP requests, for example But it means, you know, that's the trade-off But then it means you can work very well with the Erlang... Sorry, with the JavaScript world We can run Glean in browser through... So again, sorry? Can we run Glean in browser through WebAssembly? Can you run Glean in browser through WebAssembly? No, but that's something we want to explore in future Not because we particularly want to do WebAssembly Sorry, not because we want to do it in the browser Because we could already do that with JavaScript But there's loads of other places you can use WebAssembly And I wanted to talk about this, but I didn't have enough time I think it would be really exciting If we had a good way of executing Glean inside the compiler Because there's loads of optimizations we can do We could start looking at certain kinds of like code generation Meta programming stuff that you can do in Alexa, for example You can't do in Glean But we can't do that because we don't have a copy of the beam This massive thing inside the Glean compiler So if we had like a little VM, maybe we could do that And WebAssembly is a really good little VM for this whole thing Thank you Any other questions? Yeah, I do have a question that you might use to point to I think it's a great question I think it was in the last year during the episode of code And it's a really great project But as you know, I think one of the main parts that draw me to the language Was the vibrant, pink color Is there a story behind it? Is it the color? Why is Glean pink? Great question Great question This was, what is this handle, K-Tec I think is And he just threw this idea it should be pink And I was like, oh really, why? That's really odd And I liked it because it's different You know, you see this pink and you don't go You know, if you see a blue you're like, is that TypeScript? Is it Python? You know, it's visually very different And the other thing is, I think it's quite friendly And hopefully it's welcoming to different people I hope that if someone sees a bright pink thing they go Oh that's cool, you know, maybe there's not going to be And it also says like, you know, be nice to each other No Nazis on the website, you know I'm hoping people will see that and get an idea of what we're about We're about being supportive and friendly And looking after each other So it's, look different and hopefully say something About the kind of vibe we want inside the community Well you work, thank you I mean, it's probably the best thing about Glean I think Currently you target both Glean and then Javister What do you plan to do to introduce other targets like WebAssembly? So I don't like to look at targets as well as, you know I think there's a problem and when people make in languages It's very easy for them to do things that are cool For a language maker to do So for example, it would be cool if I could target WebAssembly It would be cool if I had type classes I don't want to do it for those reasons I want to drive changes by them being impactful to the community And as I said with WebAssembly That can be a nice VM that you can embed in the compiler To enable compile time code execution You could use that to do like Glean script So you could just have just the binary on your server And you can use that to execute tiny little scripts When you don't want to have like a whole virtual machine installed On that computer for example All sorts of little things like that And so I think there is a good Argument in favour of having WebAssembly And so it's something and I would quite enjoy it So I'd like to explore it in future But it's not as high priority as like getting the language server Working really well, getting the documentation fantastic Making sure we've got like a really lovely Like Elixir Phoenix like experience To do web development in Glean So I would like it Maybe one day, don't hold your breath When you do message passing in Glean Does the messages support function cloners as well And if so, how does your type system handle it? As in you're asking When you're doing type OTP Can you send a function to another process? Yeah, function cloners Yes, okay, so So, it's quite tricky You can't, how much context do I give this Because I've thought about this for years And it's quite hard Yes, we can, you can pass any data to another function The key difference between message passing In typed Glean OTP and Erlang OTP Is that you need to have more than just a PID To send a message or something If you've used languages that have channels So for example, Go or Rust You don't just have like the handle for the thread And pass a message to it You've got to have a channel And you send a message via the channel So it's the same idea So we have this idea of a subject We don't call it a channel Because it would be confusing Because it still goes to a process inbox You can't give a channel to a different process And they start pulling from it And every channel is the thing that's typed Not the PID It looks like you should be able to do the PID But then you suddenly realize If you build from the ground up You can't implement synchronous You can't implement call, synchronous message passing If you have typed PIDs Because the type of the return Doesn't match the type of the PID So you need to have something more flexible So we have this thing And if you look under the hood in Erlang OTP They have the same abstraction We've got 14 seconds left And it's used to implement GenServit.co So they have this from thing So it's the same as the from field in GenServit.co That's the thing that you send messages around with I have three seconds left Thank you very much everybody Thank you very much
Property based testing in Elixir
That's not helpful at all. Okay. Now turn from yellow, greenish to green, greenish. Okay. Okay, cool. So let's write a unit test for very simple use case in which we want to add two to number together. And it would look something like this. So we usually when I write tests, I try to come up with at least three cases. So positive one which tests the happy path, one that actually tests the opposites and then try to find or think of edge cases in which my software could actually fail. So this is such an example in which we assert that two plus two is four, that two plus two is not equal to five, and we also try to find some edge cases like if one combined types or do something other funky stuff that my software still works. So if you look at that example, you can understand why I think writing tests can be pretty boring. So that's my conclusion, testing can be boring. Then if we look at it in other aspect of writing unit tests, what if our software project grows? If we have end features, then we have some linear amount of tests accompanied to that. But what if we then start to combine features? So function A and function B, we have to test combinations like pairs of those functions as well. Then the amount of tests will grow quadratically. But then if we're going further and we combine even more features, at a certain point that growth makes it really hard to scale to write to go further. So testing I think can be hard, at least if you want to do it properly. Like if you really want to make sure that you have for confidence your code you want to have as much cases covered in those tests. If you approach it that way, then testing can be hard. So how can we fix this? Well, some people they come up with property-based testing. A summary of it is instead of us humans writing examples, let's define properties of our code and let the computer come up with cases. So that's the folks at a company called Qwik. They came up with this idea around 2000s, and they build a project and the company around those ideas. They've also added some more features to it as well. But the general idea of property-based testing is that we define properties instead of examples for our tests. So let's have a look and a comparison how we could do that. So let's say that we write a test for string reversal. So we take some string and we have a function that reverses the order of those characters and how we would unit test for such a case look like. It would be something like this. So a raise of hands, if you write test like this, who feels confident that this test are actually covering all cases of our function? No hands raised, nobody feels confident. One maybe. Yeah, everybody is like you feeling anxious, right? You're not fully convinced about this test. You could probably write it in a different way. But if you would translate these things in properties, so let's take a pause and think if you would try to express that behavior, that functionality in properties, how would you do it? Like the contest numbers, special characters and so on. You would come up with examples of special characters, numbers, these kind of things. So examples, right? Basically, examples of edge cases like weird input. But that's not how you would, for example, define your software as a property. Those are again examples, clear use cases, but they're not properties of our code, right? One property is that the length of the string in input is the same length of the string that you get out. Yeah, that's a good one. So if we reverse the string, in both cases, the length of the string should stay the same. That's the property, right? So another one, still readable. Another one would be if we reverse the string twice, then we should come back with the original one. And this is how you would write it down in a property-based test. So we define a property-reversed string twice returns the original. And we actually tell, on this second line, we tell from all the possible string inputs. So we ask the library to come up with any strings. If we reverse that string twice, it should come back with the original, right? And if we run this, then the library will generate about 100 cases for us. And in doing that, try to prove that it's property holds for our code. So other examples, if we reverse the list, then the first item becomes the last one and the last item becomes the first one. If we have a palindrome and we reverse it, it will stay the same. So palindromes are strings which, when we reverse, they return the same string as well. And like you said, the amount of items, so this applies to any kind of list or string that we're reversing. If we reverse the amount of items, it stays consistent. It's not like some things disappear magically. So if we, and the funny thing is, if we try to write a property again, so I don't know if anybody noticed, but in the previous example, I specified that I want to generate a list or generate examples of strings, but which only contain ASCII characters. But if we do the funky stuff, the funky characters part, so we say, well, generate any string from the UTF-8 set, what will actually our library tell when we run that? And then it finds an edge case. So there are unicode characters that apply to previous characters as well. So when we reverse them, you don't get back the original anymore. And these are kind of edge cases which we, as humans, probably couldn't come up with. Well, you do know that it exists, but if I ask you like now, within these five, ten minutes to actually write this example, you actually wouldn't be able to do that. And it actually, normally runs about 100 cases, but even after eight cases, it found this example. So that's great. It found an edge case. And the other thing that is not shown in the example, but if you write a property and it finds a case for which the test fail or the property fails, property-based testing tools are also able to shrink down the case. So it does a binary search. So if I have a list of numbers and our test fail, then it tries to, then it tries the half of items from that list. Sorry. And if it still fails, then it goes on and on until it finds the minimal set of input under which our property doesn't hold anymore. So let's talk about some use cases. Where has this kind of tooling been used? So Volvo, at a certain point, wanted to have third-party parts to be replaced by other companies. So they came up with specifications in which these components should interact with one another. So they wrote a specification about 3,000 pages long. They had about six vendors come in to test this, their specification, combined they had a million lines of code written. And when they used property-based testing to actually test these six vendors' implementations of the specification, they even found about 200 issues. 100 of them were actually in the specification itself, and 100 of them were in the combination of those parts. Because a car consists of several parts. So it could take component A from vendor A and some other component from another vendor, and they tested components in isolation but never together. So the combination of these components actually yields some errors as well. Clana is a financial system, and at a certain point they had a problem, which occurred only once several weeks, and they had kind of a hint because it came up with when the generating files and it were over 1 gigabyte big. And they spent six weeks full time investigating this issue, and they couldn't find the source. Like they could stumble upon it, they could in some cases, trigger it, but it was actually impossible to find out how and where it came from. And it took them three days or less than three days in total to come up with a model to write the properties. Less than a day to run the properties until they actually stumbled upon race condition in which that's error occurred actually. So those are two kind of big examples. What are the other occasions in which we could actually use property based testing? One obvious one is if we have symmetrical functions, so if you serialize and deserialize something, those are opposite functions, you could easily property based test them. If you need to have some other method, if you have functionality that needs to have some kind of mathematical proof, it's good for comparing systems. So I had to, in one case, I had to rewrite the system in another language, and then it's nice to have the old system and the new system and test them against one another. And I haven't really mentioned it during this talk, but the tool that QVIC has built also has special conditions to test concurrency items. So if you have a system like the Cliana financial system, you're going to test what happens if five people do some transactions simultaneously. So conclusion. Property based testing can generate all kinds of test cases for us. Very often also edge cases that we as humans don't think about because it tries to spectrum of all inputs that you specified and find very weird items. Because of the shrinking, it also helps narrow down to diagnose what the actual culprit of the error is. It helps reduce complexity. And like I said, because you have to think about properties instead of examples, it actually makes you think differently about your tests. It makes you think more philosophically. And I think that in itself is already an advantage in learning property based testing. So I think we're out of time. So a small thank you for SliceGo. So if you think this is a nice presentation, I pulled it from their website. And I also have to contribute back and mention them. I also want to thank you all for attending this early and for listening. And for the organizers, of course, keep forgetting us. So if you think, well, this was nice introduction. It sparked my interest. How can I continue learning this? There's a good book on a website called propertesting.com. If you're not using Alexa or any of the other languages, there are also libraries in other languages like Python has hypothesis, for example. Look it up on either property based testing or generative testing. Some communities call it differently. And if you think, well, how should I think in writing properties? Then John Hughes, one of the founders of QVIC, also has a good talk in which he talks about how do you come up with these properties? How do you think in this way? So I don't know if we do have time for questions.
Gleam in the machine: phantom types and the builder pattern.
And as we mentioned, this is, let's talk about using types in GLEAM, but also these ideas translate to most other statically typed languages and hopefully also this gradual type experiment is happening in Elixir. So before I get into it, maybe a little bit about me and why I'm slightly qualified to talk about this. I'm a front-end developer at a consultancy called Data2Impact. I do sort of ALM and TypeScript and React there. I've been doing ALM, which is a statically typed functional programming language for building web apps for about six years now. I also do developer relations-y kind of things at an open source company called XYFlow. There we're doing React again and Svelte. And I'm also doing way too many things to do with GLEAM these days. Louis insists that I'm a member of the core team despite having never contributed anything to the compiler. I also maintain this front-end framework called Luster and an ecosystem of packages around that and spending all of my time in the GLEAM Discord is another full-time job. The title of this talk is Phantom Types and the Builder Pattern. So I suppose that begs the question, what are Phantom Types? And to explain that, we need to do a little detour on generics and what they look like in GLEAM. As a kind of dummy example, we have a list. This is a type, this is how you define custom types in GLEAM. So we have a list type and two variants. One is a head, which contains an integer value and the rest of the list. And then there is a tail, which is the end of the list or an empty list. But right now, this type only contains ints. So if we want a list of strings or a list of structs or anything. To do that, we need generics or type parameters. And so that's what that looks like in GLEAM. So we say list A, we have a type parameter called A and that head has a value that could be any of these A's and then the rest is another list of A's. So what the heck does this actually have to do with Phantom Types? Well, if we just take a moment to look at this type again, I want to home in on two specific bits. Well, okay, three apparently because I've left that black, but we'll ignore that. The list A type definition and this tail constructor. So we're saying this list is generic over A, but our tail constructor has no data attached to it. So Phantom Types don't exist at runtime. We don't have any data that associates with this type. And to exemplify that, here is a tiny trivial example. We're defining these two variables X and Y and we're telling GLEAM tree X as a list of ints and treat Y as a list of strings. But they're both this tail, right? And so maybe intuitively, we should expect to be able to test equality on this and get back through. But actually if we do this in GLEAM, we get a type error. This is a compile error that tells us GLEAM was expecting Y to be a list of ints, but it got a list of strings. That example was maybe not compelling and probably your gut reaction was, well, that's kind of rubbish because I knew that those two things were the same. So what are they actually good for? The first example is IDs or really anything where we have a shared runtime representation that we want to be able to distinguish between things at the type level. So this is kind of a common way to do IDs in GLEAM. We have what's called an opaque type, which means that the module that this type is defined in can construct IDs, but outside modules can't see into the type. So we expose this fromID function that just takes an integer and wraps it up. And then we might do something like this, right? Like we have a post ID and a user ID and we have some sort of database call or maybe an API call or something to upvote a post. But there's actually a bug in this code. Can anyone spot it? This is the definition of our imaginary upvote function and the first parameter is actually a post ID. And we've got the order the wrong way round because we're just using this ID type for both arguments. And so like as a user, we've made an error here that our program will never catch. We might be a month into production before we've realized that user two has upvoted a whole bunch of posts by accident. So there's a few different ways we can solve this, but one is with phantom types. So we have this kind type parameter now on our ID. We're not using it. That's what makes it phantom. And we keep the fromInt constructor the same. And then we let callers essentially refine what an ID type means. So we have say a user and a post type somewhere in our code base. And now we can make our upvote function type safe in the sense that although these two things have the same representation, we've told the type system that the first argument has to be a post ID. And so if we take that example again, now post ID is again an ID for posts and user ID is an ID for users. We've got the arguments the wrong way round this time still. But now we get a compile error, right? And we actually get quite helpful error as well. It was expecting a post ID, but we've got a user ID. But also we can take advantage of the fact that the representation of this type is the same no matter what kind. And so we can still write functions like say a debugging one that stringifies any ID. It's generic over this kind. We don't care about that. We just want to pull out the value and turn it into a string. The next example is validation. I'm going to show you some password validation, which is super don't do this actually in real life, but it's good enough for demonstration. This is maybe the naive way that we would handle password validation. Maybe we want to create a user in our database or something. And so we have this password type alias for a string and we would define say a function is valid that will check if a given password is valid according to some rules. Maybe it's long enough, whatever has all the funky characters. And then we have a create user function that would take a username and a password and then return a result, right? And so if the password is valid, we get back an OK user. And if the password is too short or doesn't have a capital letter or whatever, we get back an error. And this is fine, but I can already start spotting some problems with code like this. So for one, we have a similar problem from that previous example where there's nothing that actually distinguishes a password from a username. So password is a type alias here. It's just another name for a string type. And so we could accidentally swap the arguments around in our code and get a problem there. And also this create user function, this is probably going to be used in our business logic somewhere like kind of deep in our code. And now we're having to deal with errors and a way to surface those errors to our users. So here's a perhaps different way to formulate this with phantom types. So again, we have this type parameter now on our password type, this validation type parameter. And then we define these two sort of dummy types. They have no runtime representation, so there's no shape to them. We have one for invalid and we have one for valid. And we have a from string function that takes whatever password someone's typed in and it gives us back an invalid password. And with that, we can start refining our API so we can write a validate function that takes only invalid passwords because we don't need to check once we have a valid one. And it returns a result like before, you know, is the password okay? If it is, then we get back a valid password. If it's incorrect, we get back a reason and you know an error. Or we could write this kind of function to suggest a new password to a user if they provided an invalid one. And then it's probably a logic bug in our program if we ever started suggesting passwords to our users if they were already good enough. And finally, we can rewrite that create user function to do away with the result handling now. We say this only takes in valid passwords and so we only ever need to get back a user that we've created. At this point in our code, we've already asserted that the password is valid. We don't have to check again. We don't have to handle this in our business logic. We can handle the results much, much closer to the boundary of our program. So the takeaway there is that phantom types restrict our APIs so that we can focus just on the happy path. So the talk is about phantom types and the builder pattern. So I guess now is a good time to touch on what the builder pattern is. I'm a front-end developer so if I'm configuring a button or something, often there's a lot of optional properties. So we have this button config here. The label is required but then we have all these kind of options to change how it looks and works so we can maybe have an icon on the button. We can change the button's color, maybe some of its styles. So the builder pattern really lets us define this config and then we define, say, a constructor that handles all of the required stuff, in this case just the label, and then we kind of none out all the optional things. And then we provide individual functions that take that config and sort of add one bit of the config to it over time. So we have this with color builder which takes the config and then adds a color to it. We have one for an icon and so on. And then you end up with an API that's a bit like this. So we create a new button with the label Wibble and we add a sparkles icon to it or another formulation. We have a wobble button that is an error but it's also disabled. And this stuff then sort of scales. You can configure this sort of infinitely over time. But again, there's actually a bug in this code. Can anyone spot it? We've defined what the icon on the button is twice. And this is almost definitely a logic bug in our code. We said that we wanted a confetti icon and then later on down the line we replaced it with sparkles. And so it turns out the phantom types can help us here too. So I add a phantom type parameter to our config. This has icon parameter. And again, this is the same pattern rate. We have these two dummy types. They don't have any runtime representation. One for having no icon in our config and one for having an icon. And then we change our constructor function slightly. So now our button config that we get back when we create a new button has specifically no icon. And then we update our with icon builder to take only configs for buttons that don't have an icon and return a button config that asserts that there is an icon present. So if we look at that code example again, now like before we get a type error. We get a type error on the last step in the pipe that says, hey, this button config I expected to have no icon, but you must have given me one in the past. This is not just maybe an academic exercise, but this is also useful in the real world. I was going to show some code examples and then I realized it's pretty much exactly whatever I really showed. So I'm not going to do that, but I will just explain. I use the phantom types and the builder pattern for some code that I do in Lustre. Lustre is this front-end framework that I write for Gleam, which has components that can run both in the server or in the browser. The type of messages that the runtime can receive are slightly different there. And to avoid basically duplicating, say, 90% of the code base, I have a phantom type on the message that the runtime receives and I can say, hey, server components only receive this sort of subset of messages. Browser components only receive, say, these one or two types. And that's really, really useful. I also use these in a static site generator that I maintain for Lustre. And there we're using this builder pattern. So you build up your roots of your static type. And the config there has fields for, like, does this config have at least one static root? So if I run build, am I going to get a site out? And I use the phantom builder type pattern there to make sure that the user, when they construct their static site, they have at least one static root. And so to recap, the takeaway here is that phantom types don't exist at runtime, but we can use them to restrict APIs that we write so that we spend more time focused on the happy path rather than, say, error handling or duplicating our code for very, very similar scenarios. Thank you for listening. I think we've got a bit of time for questions now. We have some time for some questions. Anyone else have some? Any questions? Yeah, go for it. Is it possible to scale this phantom types to, like, multiple properties? Because in the example, we are not sure that there is this icon. Would it be possible to extend to this label and so on? Yeah. So the question is, can you extend this pattern for, say, like, multiple facts or witnesses, right? Can you have multiple phantom type parameters? The answer is absolutely yes. And that's exactly what I do in this static type builder. In Gleam, those would have to be individual type parameters for each sort of extra thing that you want to track. But if you were using a language like TypeScript or probably when Elixir gets their gradual type system, there you can use, say, a record. And you can use fields. You can grow the number of things that you're tracking just on the fields of the record. So you have, like, one phantom type that is a record. And all of the things that you care about would be labels of that record. Does that make sense? Any other questions? Okay. Thank you again. Thank you very much. Thank you. Thank you very much.
gen_statem Unveiled: A Theoretical Exploration of State Machines
especially state machines and how they are handled in Ireland and also from a theoretical point of view. So, it's up to you. Thank you. All right. Yes, he said, like, I'm relatively young but I know a school guy, so I code in V-man and use Ireland. So, this went too fast already. I work in Erlang Solutions. We do like Erlang stuff, so concurrency, scalability, the useful things that most of you would be hopefully familiar and we also contribute a lot to open source. This talk is going to be about state machines, as you heard. First, a question of protocols. What are protocols? I wanted to make a survey and ask you and so on, but we have limited time, so I'm going to answer the question already. System of rules. A few examples. Okay, I need to point here for this to work. Protocol defines the system of rules for syntax, semantics of the project, the program that you want to write. Some examples, the usual ones are TCP for network communication, is connection oriented, stream oriented, messages are ordered and they are acknowledged. Another common example, TLS for privacy, integrity and authenticity, encryption, very important. I hope that everybody has HTTPS enabling the browsers by default. Some other examples are file formats or markup languages. Parsers for them can also be implemented as state machines. The two classic examples, XML and JSON. XML is particularly interesting to me because I work in XMPP messaging server written in Erlang, of course. If you saw our talk in CodeBeam, for those that are following CodeBeam, Pablo and me, we talked about the state machine re-implementation in Mongo's IM. This is a bit of a continuation to that. Some more complex protocols can be implemented as state machines like HTTP and as I mentioned, XMPP, which is my specialty, which is extensible, that's the X and my favorite part of the whole thing, it's an instant messaging protocol that also has presences, the green bubble, whether your friend is connected or not and it also does contact list maintenance on the core protocol, 500 extensions and build your own. This is the state machine diagram for the protocol. Much like a flow chart on steroids, I really like that analogy. With the state machines, we are like the usual thing, how you think about state machines, you draw the state with some arrows, the arrows have tags about how you transition to the next state. Finest state machines give you a way to visualize a system that can be very complex. Why state machines? State machines can be seen as a model. We want to model the behavior of protocol that can be very complex like TLS or HTTP, most of you will be familiar, XMPP, my specialty. Let's talk a little bit quickly about state machines in particular. A few formalities. I studied mathematics in university, I'm excited by these weird symbols, but some people can find them off-putting, so I will try to make it pleasant. A few terminologies, we define an alphabet, terminologies, you use Greek symbols, mathematicians, which are the input symbols, zeros and ones, or ASCII characters, UTF-8, or complex symbols treated as a single element, half, and you can do equivalences. One of the weakest ones is the regular grammars, it's how you do regexes. A regex, this thing that right ones are never read, but very powerful, is theoretically equivalent to a state machine. Again, this is jumping too fast. Something a little bit more powerful is the partial automata, I'm not going to focus on this one too much, use a key difference, then nothing else parsed, now it has one more thing, it's the same thing before, plus a stack, and the stack behaves as you would expect. The function that used to take the state and the input symbol also takes the stack and the output of the function is whether you pop something from the stack or you push something on the stack. It's safe to consume a string that you give to this PDA as it arrives to one of the final states with an empty stack. There are equivalent definitions, not all definitions require the empty stack, but I choose that one. They are equivalent to context-free grammars, parsers, but not compilers. Why a compiler? So in tree, the thing about being context-free is that it doesn't remember symbols that were defined before. So for a compiler, for example, the usual regex compiler for C that needs to remember the definition when you say int e and then you use e later below, parser doesn't remember that, you need symbol tables, parser only builds the syntax tree. And the fancy one, the computer, theoretically, Turing machines, which is again the same thing, but nothing else is supplanted by a tape that is infinite. It is equivalent whether it's finite in one side and infinite in the other, all of those are equivalents, whether it has two tapes is also equivalent, will arrive to that. The function takes the tape and the action go one to the left and write something, go one to the right and write something. Very similar, a Turing machine is said to consume a string when the next step is undefined. When it holds, you have all heard of the holding problem. There is no way to know whether a Turing machine will hold. That is important. They are equivalent to interested grammars, compilers in the Chomsky hierarchy that are like four levels. The three things that I describe are zero, one and four, there is something in the level three that is not directly useful for the moment. So I skip that. So how do they compare? This goes very fast sometimes. So that's the power that they have. A Turing machine can do all the others. PDA can do the one over there. So that's the power that they can do. They contain the power of each other. Two FSMs running together has still the same theoretical power, the same thing that a PDA with a finite buffer or a PDA with a finite state machine is still as powerful as one PDA. Turing machines, whether it's multi-tape, tape one banded on one side, they are all equivalent again. A Turing machine doesn't get more powerful by giving it 100 tapes. It gets maybe more performant theoretically, but the problems that it can solve are all the same. And a PDA with two stacks is really a Turing machine when you know you can just go in both directions. So when you give the PDA two stacks, you build a Turing machine. So conceptually, finite state machines can keep track of one thing, the state. The push-down automata can keep track of two things, the state and the top of the stack. And a Turing machine can keep track of infinite things. When I was going through the mathematics and I came to this conclusion, I found this funny for a completely unrelated reason. In the European languages, I mean to human languages, used to have the concept of dual as something different to singular and plural. The function that it computes depends on one thing to things or an infinite number of them. The function that was defined before. So in the European languages, as I said, they had this special concept of the dual. And I found it very funny how informal human languages used to have such a thing as a dual, as a different grammatical category than one and infinite. When you build the declinations, they had a different thing. Why do I know this strange thing about languages? Because I live in Poland. So Slavic languages have some remnants of that dual concept. So there is this famous joke of in Polish you have like 100 ways to declinate number two. And you have more ways to declinate number two than you have number three because of that all dual. So two is special. I live in Poland, but I'm not Polish. It's challenging. So do FSMs produce output? Let's go move slowly to what is useful here. We can define finite state transducers, which same thing than before and then nothing else is supplanted by another output alphabet. The function takes the state and the input and decides the next state and a symbol for the output. It's a to consume a string the same and they are also equivalent to regular grammars. When it comes to the problems they can solve, again, they're all equivalent. You get fancier tools, but there are properties that are going to be all the same. You will see in a second there are many, but let's focus on two ways of defining transducers, the milley machines and Moore machines, whether the output, I have a laser, yes, whether the output symbol depends on the input and the previous state or only on the previous state. There is a way to define Moore machines from a milley machine, but not the other way around, so milley has a bit more powerful. Now something a bit more useful, how do they compare? They are still the same than the FSM machines, but this can be composed. We are getting into a bit of engineering. We are almost there. Not that much. This is a thing, laser. Yes, oh god. Come on, sometimes. So given three sets of states, three alphabets, one machine goes from one state and one alphabet to the next state and the other alphabet. The second machine uses the same the output of the previous as its input, so you can define the composition as a state machine that takes the first alphabet and the first set of inputs and gives you the third alphabet and set of inputs. Composition, cool. Why? Because you can implement all these things as state machines and the output of one is the input of the next. So my stack on XMPP, you can implement TCP as a state machine. Have you heard of the Erlang socket, the new socket? It's implemented in TCP on top of gain state them. If you go to the source code. So I have the output of one state them, throwing into the output of the next state them. TLS is also implemented as a gain state them, throwing output to my thing, to the XML parser that throws its output to the XMPP protocol. So we are composing things. One last theoretical thing. The unions of FSMs that is uniting all the states and strings, it's also an FSM intersection, so the states and its input symbols in common gives you a very small FSM. It's also an FSM reversing, still an FSM, empty, so no states and no input is also an FSM that when you do union and concatenation with another FSM does nothing and homomorphism, so a function that transforms alphabets and states into other alphabets and states preserves the structure of an FSM. So FSMs are a semi-ring. This is an algebraic structure. Why is it useful to have search algebras? To prove things that you cannot prove with Turing machines because they do not form an algebra. So now let's do something engineering, state them. So as I said before, it's a Melly machine. It gets the input and the alphabets, it produces the states and alphabets, it produces the next, you follow, I hope. We can consider that the input are the messages in the mailbox and the output symbols are side effects, like for example sending messages to another mailbox. Gain state them. I'm a big fan. I love it, but I know that people sometimes don't use it because maybe it's confusing or I don't know, complicated. So I'm going to try to explain one thing that is very useful here. An extended mailbox. This is a discussion that the OTP team, when they put the pull request for gain state them, there is a big discussion with over a thousand messages that was probably forgotten, but when they discovered gain state them and I liked it, I went to the source and I read that super long thing. And there are useful things said there. A way to visualize a gain state them. Imagine that it has one queue, that is something more than the process mailbox, with like three pointers. The head pointing at the oldest event and the tail pointing at the youngest and current is where I am now. You can move where current is with some of the actions that gain state them gives you, for example postponing an event. Postponing an event means that current moves to the next, but the event is not forgotten. There is a different action that will put current again in the head. Not postponing and you consume it is removed from the queue. When the state changes, current goes again to head. Next event inserts things where current is and not at the tail. And timeouts inserts things at the tail. So the engine, the gain state them implementation allows you to extend the inputs that your formal state machine is going to get. How does it work? Imagine that we are here, we have event one and we decide to postpone it. What happens? It's still on the mailbox. We just are now going to deal with event two. Now we decide to do some stuff and then go to the next state. So that has been processed and current because we changed the state goes back to the previous. Now we are again going to handle event one and this time we decide to not change the state, but we generate new inputs as if this process has received a message. But this event A, which is ad hoc, we just created it, is inserted where current is. So it's the next event that we are going to handle. We can decide to postpone it. Now we are going to handle event three. With event three we do some stuff, but we don't generate events. Imagine that there is middle code here doing. So event three has been dealt with. Now you go to event four and you decide to postpone event four, but also insert and event B. So event four goes behind, you insert and event B, you get the idea. So the engine gives you a way to extend the process queue. What am I doing with time? Oh, one more important power. I'm not going to have time for everything. One more useful power of the state machines. Managing accidental complexity. There is a talk that I want to recommend. It's quite an old one, maybe something like 10 or 15 years ago by Ulf Rieger, where he was complaining about some limitations of GANFSM, but even GAN server that we all use. Very useful talk and I have one tiny answer to that with the new GAN state that didn't exist back then. Typical state on, off, but you can imagine that you're switching a light, but your switch talks to a through a cable protocol to the light machine. So when the user says on, this is a GAN server, you say and the state is off, you send a request to on, you wait for the answer on, it's on, vice versa, relatively intuitive code. Now imagine that that request through the cable protocol was not synchronous and imagine that the switches cannot block. It needs to do other stuff. So you send an asynchronous request to the light, hey, turn on yourself and continue doing other things, but then the user sends more off and on. What do you decide to do here? It's not part of the protocol. The events are now asynchronous and out of order. There is no global ordering. So there are some questions like you need to choose what to do. Sometimes this, this is the, so we can use a state machine. They use all the way. The name of the function is the name of the state and you can postpone things if you are already running a request, you postpone it and when if the user press on like a hundred times, by the time they like says on, then you have changed the state and you're going to handle all those. It's already on, so just do nothing. But the code is terribly symmetric. It feels repetitive. So problems, there is no ordering when things are asynchronous. Tying yourself to the ordering of events leads to accidental complexity. This is the point of Ulfiger when the order changes, the whole implementation changes. It grows relative to the number of states. This is super simple. It's a like that goes on and off. But imagine complicated protocols and for example a middle layer between a very talkative protocol and a like one and code reuse. So I really like the handle event way of handling things. It's a single function callback that gets a simple the state and the data. By the way, it's very confusing because we are used to the state of the process for the server thing. But in the state, the state is the state machine state. So the other thing where you save like, I don't know, the socket, for example, is called data. So just confusing terminology. This, you can just pattern match whether you're in the same state and the previous function that was terribly repetitive is now in a single function head. This is, I believe, a way to answer to the problem that Ulf raised and now I'm exactly on time. One more slide. A way to answer to that problem and in a way that you can reuse code, that you can decide the order of events because you can postpone things and you can also insert things. Quickly here, why I use on the XMPP, we had like this implementation. There is only one thing that I really like here. The composing. As I said before, you have the TCP state machines that go to TLS that goes to XML that goes to messaging. So if we want to implement this on a single process, this can be, for example, this is a simplification on my data. I have a parser and the crypto library that when I get that TCP payload, this is how we do it in Mongoose. I am not TCP, TCP we just use TCP to complicate it. So it's a separate process. But crypto and XML parsers, we implemented on the spot. There is a C code that the parsers, part of the XML, for example, it gives you a new parser with a buffer and the XML structure that then you can use to generate the events that my protocol cares about, the XML payloads. That's one use case that we have. That's me. You can find me by that picture in all the places. Those are some of the projects I work in and I was going to say questions, but we are one minute late. Thank you.
Guess Less with Erlang Doctor
Okay, I... Yeah, it switched off to like... By itself. I didn't touch it. So yeah, when you debug, for example, your code, when you're trying to find out why you have a strange error or something like that, you can use our long tracing. And it's very powerful, as we said before. And for example, you can use tools like DBG or Recon that are using error tracing underneath. And the first step is to choose which functions you want to trace, actually, because you don't trace everything. Although you can trace what you want, you cannot trace everything at once. So you choose like, I want this function to be traced for this bunch of functions. And then, when you call these functions, you get your traces being printed out. So you get the information of this function is called, these are the arguments, return values, things like that. You can get it to console, you can get it to a file, and you can also send it to a network, and that's what I have been doing for many, many years. I was just setting up, for example, I said many years, yeah, for 15 years, I think, with Erlang. So I was just setting up a special node that was like collecting traces for all the other nodes. So you can also send them to the network. And, well, afterwards, you either read the traces that you collected, or you can also search them, grab them, parse them, do some other operations on them if you want. But these are just text logs, let's say, mostly. And the problem is that very often you have to repeat the whole process. That's because you've traced one function, but you found out that maybe the problem is in another function, maybe in a completely different module, and so on and so on. So you do it, so you repeat, repeat, and that might be kind of a problem. So this doesn't scale well. And what I mean by that is if you try to trace a lot of functions, well, I found out that at least for me, when I get like 100 to 1000 traces, it becomes difficult to read, like for a human to read that amount of information. Okay, but you can search, for example. And this also has a limit. So, of course, this is just a rough estimate, let's say, but for me, usually, when I have like 10 to 100,000 traces, then it becomes difficult, because even my system can slow down, IO can slow down, and actually it's quite surprising, but sending traces to a file or to a network, it's actually quite expensive. And it can slow down the system quite a lot. And it's a heavy operation, so sometimes I had traces accumulating for three minutes after I finished traces or something like that, and the messages were still in the queue, still being processed. Yeah, so this doesn't scale that well. Okay, so let's sum up. Choosing the function to trace is kind of a guesswork. Not always, of course, sometimes we know precisely, but most often I don't know. I know kind of what I'm looking for, but not exactly, and that's the problem, but I need to somehow know the function exactly here, to choose it to be traced. So, possibly many iterations of the process. This is, for me, this is like ad hoc logging. This is very much like logging, but I don't need to have a log statement in my code. I just choose dynamically, right now, I want to do these functions to be logged. And what if the trace behaviors have tests that fail every 20 runs, for example, do I need to repeat this 20 times? So what? That's the problem, right? And answer to some of those issues is Erlang Doctor, at least for me, and for the people who've used that. So what's the difference? So you set up the tracing for entire application, not always. Sometimes it's not possible. Sometimes you have to trace individual modules, but usually you can start with just one entire application. You capture traces, store them in the ETS table, and you clear the traces afterwards. And you can repeat this process instead of repeating everything, because you've collected enough stuff to query and to find out about different functions, for example. To query, oh, this was this function called, maybe another one, and so on and so on. You can do this. And of course, rarely you have to repeat this, but for me it's only when I, for example, trace the wrong application, because the problem is not in my code, it's in a library that I've used. Then I need to trace another Erlang application, right? But it doesn't occur that often. This scales much better. What are the limits? So on my laptop, for example, querying becomes slow at about 10 million traces collected, which is quite a lot, but it's like tracing a function, a system under heavy load, for example. And of course it depends on the size of individual traces, because you can have big arguments being passed or something like that. Yeah. System memory becomes the limit at 4 million at about 50 million traces, but sometimes it's 10 million, sometimes it's 100 million, it depends. But basically when you have a few million traces, it's probably too much. So there is a limit, of course. So to sum up, very few iterations of the whole process, usually one. This is for me like ad hoc instrumentation instead of ad hoc logging, because you're gathering structured information in the ETS table. I will show you details in a moment. And use cases, for me there are many use cases. For example, debugging, system exploration. I often use it to just learn about the system. I just run the system, do it with the usual stuff when I'm tracing the whole application, and I'm just querying what the system did, actually, with the traces. And you can also do some lightweight profiling without the need to set up the profiler for a particular function. Yeah. So let's go to the Erlang doctor itself. How to get it from GitHub for Erlang or for Elixir? For Elixir it's called Xdoctor, which looks like a former doctor, but it's just a bit funny. Yeah, so here is a package of Xdox for both of them. And yeah. So how to run it? Three options. The first one that I'm using sometimes when doing firefighting, this is when you don't have it in your shell, but you want it right now, like in a system that's misbehaving or something. In both tools there are snippets that just download it from the Internet and compile it and run it, which works in this particular case. It's probably the best option if you just want it right now. And yeah, all you need is to have the access to the Internet, which is usually the case. The second option, which I'm using always in development, is just that you set it up in your .erlang or .iex.exe file. So that it's always available whenever you start any Erlang or Elixir shell, be it in your project or wherever. And third option, packaging. You can always include it in your application, in your software, if you think it's that useful. Okay, so let's move on. Let's start. Examples are in Erlang, but they are also available for Elixir in the docs. You can find them. The first thing to do is to start. It needs a GenServer, so it just starts at GenServer. And a few other examples how you can start them. You can choose a different ETS table. You can just have multiple ones if you want, switch between them. You can limit the size of a table. Very useful, like in a production environment, if you need to do some tracing, you just set it to like 1,000 or something. Just the table will never grow bigger, so you will never consume all memory. And yeah, there is also a start link. Okay, so let's set up tracing. I'm just tracing an example module. It's a test suite, but it contains functions that we can trace. It's good. So yeah, I'm just starting that tracer. And I can also trace a specific function, like provide a module function argument, whole application, multiple applications. And add a bit more. You can trace messages. You can trace specific processes and so on. There are a few more options. Capturing the traces. Okay, so let's call a function from the trace module. I'm calling just a sleepy factorial of 3. It's a function that calculates a factorial and just sleeps for 1 millisecond between each step, right? It will have some time difference. That's it. Yeah, very simple. And yeah, I'm just... Okay, now we can stop tracing. It's a good habit because you don't accumulate traces when you don't want them anymore. And now what can you do? Because we've accumulated traces, what can you do with them? So let's read the record definition. By the way, I'm using records because they are very performant. And even maps were giving me five times worse performance for some operations. So yeah, I'm using records. Yeah, so let's get all the traces. So I got all the traces and I don't want to talk about everything. Let's talk about the arguments. So these are the arguments and these are the return values, okay? For calls, for returns. And I will just introduce the other fields as we go. Arguments are in brackets. So now trace selection. You can do a select. It's a fancy way of doing this ETS select with ETS from 2MS. And let's get all function calls. And for each argument, let's just get this argument. So I'm getting a list of arguments. And of course, this is a recursive way of calculating factorial. So it's 3, 2, 1, 0. And there is also select slash 2. And this one takes like any term and looks for that term. So here it found, for example, it has an argument. Here it found it as a return value. But there is more. It can be hidden inside any lists, maps, tuples. So it will look recursively inside your data structures to find out anything you're looking for. So for example, you can look for an error message, even if it's called unknown error, which occurred to me once. And I just found this unknown error. I just put unknown error here and I just found the function that that causes, right, instantly. Okay, there is also filter. It's similar to select. But here you can pass any function. It's a bit slower, but it has more features simply. So you can, for example, assign a result to a search, to a variable, and then you can search in that list again. Oops, sorry. Then you can search in that list again. So you can narrow down your search. You got like two traces. Now you search in that two traces, but only for calls and you get only one. So another way to query it. And the tracebacks are very important for me because I want to, like, know the source where this originated, this particular, for example, function call. So here I'm just looking for any return value of one. And the sleepy factor of one actually matches it and it returned one. So this is returned, the traceback of this one. The call itself is first. It's right here. Sleepy factor of one and the rest is just the traceback. And the sleepy factor of zero returned also one, but it skipped because of some skipping logic. Yeah, it's details, but it helps you, like, limit the output that you get. Actually, you can disable that and you can get, like, all the traces with output all. Then you get, then you have no skipping of traceback that include another tracebacks. And you can limit the number of much traces. You can reverse call order. You can search in a different table, in a list, for example. And you can get only the first traceback if you want, very useful, just shortcut, let's say. And you can also get it for a particular record or just an index of a trace because there has just this auto-incremented indexes. Yeah, and similar to a traceback, you have ranges and ranges look inside. So traceback is like what's the source and ranges give you, like, all the traces starting with a function call until it's returned. Everything in between, from one process. And, yeah, so for example, here we are really looking for any traces that are just for function calls that have one as an argument and you get a range of traces from the call until the return. A range options, you can get, like, limit call depth is quite interesting and very useful because by having one you just get a call and return, which is very useful. And searching in a list of traces is also possible, getting only the first range if there are many, also possible. And getting the range for a specific trace. So quite a lot of options. I've just, you know, added, I've been adding and adding over a few years of development of these two. So they're all quite useful. Utilities, two simple utilities they wanted to talk about. One is to just look up a trace. Nothing fancy here, ETS lookup, does it, right? But then you can execute the trace, which is quite useful for me. So if this was a function call, I can just execute it right now, again. This is just, for example, let's say I fix the bug and then instead of writing some long code, I can just execute a trace and see if the result is the same or different. Or I can trace again, right? I can start the traces and trace again. Okay, now a bit of profiling. So I find this lightweight profiling very useful because it doesn't put as much stress on the system as F-Prof, for example, the Erlang profiler. And it's like instantly available. I don't have to prepare for it in any way. So call start, it's statistics aggregated by this function. So I'm aggregating everything under the total atom. So I'm getting like four calls and this accumulated time and this own time. These are equal because I'm just accumulating everything. But if I aggregated by function argument, you can see that there was this one call with each of the arguments. And this call took the longest time, but actually it's accumulated time because its own time was the shortest section, right? You can also do filtering here. So you can say when N is smaller than 3 and we just skipped one of them. So you can do that and you can sort them. Yes, you can sort them and print them as a table. We had some just nice utilities to have. And the last feature I wanted to talk about is function call 3 statistics. I called it like that because let's say we have a function that calculates the Fibonacci sequence in a suboptimal way, you probably all know that it's suboptimal. It branches a lot and let's clean up the traces. Trace again, call fib of 4 which returns 3, which is the correct value and stop tracing. So we now have different traces in the table and let's do it. Let's just call this function with default arguments. So it says that there is a call 3, I mean by that function calls, returns, everything inside repeated twice because there is this number 2 and it took 10 microseconds that there is no sleep in this example. So it took 10 microseconds in total. And this is how the function call 3 looks like. So you can see that, yeah, indeed, it repeated twice. So this can help you find out redundant code. Yeah. Okay, so this function also has some options but I don't have time to talk about them. You can just customize them a bit. And table manipulation, you can get the current table, dump it to a file and load it on a different Erlang node. And then you can continue analysis on a different Erlang node. And that's all I wanted to talk about. And that's me on a mountain bike. Thank you.
Implementing UDP protocols in Elixir
There we go. I don't know about fun. We'll see. So thank you for the introduction. Hello, everyone. Nice to see you. Pretty happy that I can see so many Elixir developers on one place. I come from Croatia, where I think there is maybe 10 of us, and it's really hard to connect. So it's really nice to see that the community is growing worldwide. My name is Andre. I've been a developer for 11 years. I've been doing Elixir for the last three, two and a half-ish years. Previously, I've been a JavaScript developer. I decided that that's not going to fly anymore, that I need to have a life. So I switched to Elixir, and things have been going great. I'm a licensed accountant, building my own accounting software with Phoenix Liveview. It's going great. You should use it, Abandon React, use Liveview. It works for everything. I'm a vice president of the Croatian Association for Open Systems and Internet. I had to read that one out. We are very active in an open-source Croatian community. We organize an event. Please come talk to me after. I have some t-shirts. I have some stickers. And in the last slide, there is a coupon code for something percent off of the ticket. We're also going to try to have Sasha Urech this year there to come and talk. So if you want to mingle, please do come along. And I'm a member and the co-organizer of the conference. Let's start. So this is our plan today. We're going to go through the problem. As we're solving, we're going to fix it, and then we're going to talk why you should never do that things in Elixir or generally ever. So everything started. As things start, I was browsing Hacker News, and there were people doing a PP measuring context about their uptimes and who had the larger uptime. I update my servers, so I have a low uptime, but I wanted to be a part of that. And the idea was born. I want a fake uptime. How do you fake uptime? I didn't Google it. NTP server. I will just make the server ping for time, and every time it asks for time, I'll just downgrade the time, and hopefully it'll catch, and I will have a huge uptime. That's not how it works. You need to basically fake a kernel call. So maybe next year, I'm going to implement kernel modules in Elixir. It is possible, though. It is possible. I have a working proof of concept. You do, however, need a lower level language to call. I like solving these kinds of problems. So it's been super fun. Ever since I first applied at TopTel for a job, they gave me a task to implement the DNS server in JS. That's been super fun. So I wanted to see how it's done in Elixir. Spoiler alert, it's a lot easier and simpler and maintainable. And it's a cool topic to write about and talk about. This is a previous blog post. It's been featured in a couple of newsletters, which I'm still not sure why people like this, but maybe, hey, you share the same affinities as me. So the hardest thing was to discover the protocol. What's NTP? For those of you who might not know, it's a network time protocol. It is a terrible, oftenly abused protocol. Some of you might remember a couple of years back. It was used for a widespread DDoS attack. But it's one of the easiest ones to implement, so it turned out. So the first thing we need to do is we need to learn about the protocol. Now you can go and read RFC, but that's boring. I don't Google stuff, as it's obvious right now. So what I did, I just installed the TCP dump, NTP update in a virtual machine. I started up TCP dump and just updated the time. And I got the pick up file out. Now let's see how we read the pick up file. This is a very important part in implementing a protocol. So bear with me. This is how the packet looks like. If you love your life, unlike me, you'll probably use Wireshark. So it looks better. It's easier to browse, but this was also more fun. So let's get right into it. The first part, I've put up dots we should ignore it. That's the header of the UDP packet. This is something that's getting handled with us. If you remember the layers of TCP, the same thing applies to UDP as well. And the first parts are the hardware part, and then there is the network part. We can all ignore that, that will be handled for us by Elixir. So I just put dots in order not to confuse us. Everything else is super confusing anyways. So the first part, I'm just going to do this once. This is how you read the bytes. So this E3, you should convert to binary. And here we have something that we call flags. Here we have three flags. Those flags is something that you define as a protocol developer, or as whatever is the name of people defining protocols. So it's composed, this flag, this is three flags. So the first one is the leap here indicator. This is the NTP version and the packet mode. Now this packet mode is really important and you should remember that. So 011 is the client. So this is us asking for the time. And we will be making a server that responds. So you need to remember to flip this packet for shadowing. The next one is the clock stratum. Not important, we can ignore that. The pooling interval, this is how many packets will be sent, which is kind of telling that this kind of starts looking like a TCP, but whatever, you need to expect three. Not important, you can ignore that as well. And now we have a couple of flags. So we have the clock precision, the delay and dispersion. This is, since this is a VM, it doesn't look that all interesting. It's basically all zeros because it's lying about the time anyways. If you ran this on your computers and real laptops, you will see more details here. I've written some copious notes on that. So later go to this link and you will find a lot of links to the RFC and documentation if you really want to get into it. So let's continue. Now comes the more interesting parts, says he and there's all zeros. The first one is the reference ID. This is very important. This is how your server knows who asks for it. You can ignore it. The next part is the reference timestamp. So this is the timestamp we send to get actual time. Why it's zeros here? Because this is a client. The client does not need to tell the server it's time. You can however, the protocol allows it, but you don't need to. Next up is the origin timestamp. This is basically when the call was made or when the call was responded to. Again, you can send it as a client. You don't need to as a server, another story. The next part is the receive timestamp. This is actually when did we receive. This is very important for the server to send along because the client will then do some math between those three timestamps to get the actual timestamp. And this is the transmit timestamp. We need to return this as it is. Also foreshadowing, it took me way too much time to understand why I need to do this. It would be a lot easier if I read the RFC. Okay, so this was me at the point. After, this was after a couple of hours of things just not working, not understanding. I went through the documentation. Things started becoming a little more easier to understand. So let's get back to it. This is what we need. We need the reference ID and we need the transmit timestamp. Reference ID, again, repeating, this is very important to understand, is how we know to who to return the call to because you should keep some kind of a hash map or ETS tail or whatever. And the receive, the transmit timestamp you need to return. Otherwise, the client will return an error that it does not know how to calculate time because it diverged too much. So that's the main point. Okay, and this is where the fun part starts. If you ever did this with JavaScript, I should have included that slide. It's a lot of splitting on binary streams. It's extremely hard to read. And with Elixir, this is all that there is to it. So it's just pattern matching. And this was mind blowing to me. I know that Sasha has the whole chapter in his book about pattern matching on binary, I ignored it, I didn't kind of get it, but this is amazing. So what this does? I've decided to ignore the first part. Where is the mouse? I decided to ignore the first part. We don't need it. And I just ignore 12 bytes, right? Then we store the next four as the ID. Then we can ignore the rest 24. And then we just store the origin timestamp, the 8-1. That's all you need to do. When you receive a request, this is how you pattern match. Now imagine if you're developing your own protocol. This means that in a single line, you can basically parse the whole request that came to your server. Amazing. To me, this was mind blowing. I don't see too many mind blows. You either know this or don't think it's that interesting. But okay. So, and now we need to compile our response. We'll come to actual Elixir code in a second, don't worry. So, as you can see, we start with the receive timestamp as is. And now I've cheated here and you will see that I replied with the same two timestamps because I don't care about precision for this exercise. At this point, it was to me, it started becoming very obvious that whatever I returned, the uptime is not changing. So the level of this being fun started degrading rapidly. The only thing that was keeping me alive and to finish this little project was maybe I can get a blog post out of this. I don't know. Maybe some clout. So I stopped kind of caring. But yes, you should basically pull data from like a real clock source, like an atomic clock, a GPS clock, which is also kind of a fun thing to do because you can get an AliExpress those very cheap GPS USB modules, which are very easy to talk to and you can pull actual time from there. And it gives you the clock stratum. It gives you the precision and you can just dump it into it. So it is actually viable to create such a server in a language that I don't think it's best used for. But there we go. So we need to set again the reference, the origin and the receive timestamps now. The ID you will see later is basically the ID you respond with. Good practice started becoming just use your public IP as an ID because many NTP servers don't change their IPs at all. So it is a good idea to use it. It's not in the protocol, however. So you can do whatever you want. You can put Andre school there. And now let's create the actual UDP server in Elixir. By the way, it needs to run the port 123. I thought this was a joke. No, it's not. Pretty cool. So this can be done a lot better as I've learned later. But for posterity's sake, I decided to just continue on. If you use active false in this first line, you don't need to do a whole bunch of slides after. When I rewrote the slides after a meetup in Zagreb and I used active false, the whole presentation was kind of lacking for content. So I just left it in and decided, yes, we will do slides. And that's it. So let's use it. It's extremely simple to use. This will receive only one packet. If you put it active false, active true, it will receive many packets, whatever. So you need to open up a gen UDP server. You have the socket. What you then do is you can just pattern match on what you receive. We are mostly interested in the first part, like a real developer. I just ignore the errors and the closed states, whatever. Nobody cares about those anyways. And with the gen UDP send, you can send whatever. You can send strings even. That's it. This was also pretty mind blowing as an XJS developer. Just those condensed free lines in total is the whole UDP server. What? Just like a little side note, when I started learning Elixir, I was then, after JavaScript, I was a Python developer and a friend got me into learning Elixir. And one of my first revelations was, why was I writing so much code? I don't get it. So yes, this is all that there is to it. Let's create just like a simple server out of it. The first approach is, yes, let's just imitate something. So we have the init. And we'll just have a loop here. I forgot to call the actual loop, so don't forget to call the actual loop. So what this will do, just in a recursive fashion, every time it receives a packet, it will call itself again. That's it. To astound it a few here, this might look like a gen server, maybe. Yes, well, like everything in my life, I reinvented hot water, how we say in Croatian. So yes, let's make it into an actual gen server. So it's pretty simple. You can just use continue loop and have a handle continue. That's it. And don't forget to also include this part here, so you can actually start the gen server. In your supervision tree, in your application, for newbies in Elixir on the blog post, there is also a huge part how to start a new Elixir project that's not a Phoenix with the supervisor tree already set up. So you can go look at that. But if you have it set up, all you need to do is just include this child here and it'll get started and start working. And all that is super cool and fun, but where is the actual protocol? Where is the actual meat of the presentation? Well, it's also in the spirit of Elixir extremely simple. So we have one function here called generate NTP response. And you remember that pattern matching that I was raving on about? Well, I ignore everything, basically. I started and I just used the origin timestamp so I can do it in the function head as simple as this. We take the system time, which is the most precise way of getting time in the world. So everybody knows that. We store that as a receive timestamp. Doing a little bit of, thank you. I'll speed up. So I'm doing a little bit of code here to make it a bit more easier to read. Remember this sigil. If I'll have time, I'll show you how to implement your own sigils. Never do that again as well. But yes, this is way easier to read. So you just set up the header. Trust me, this is how you do it. Don't read too much into it. The ID is the private IP of that VM. And now comes the fun part. You can concatenate bit strings like anything else in the lixir. So this is how you do it. This NTP constant is really important. I have no idea what it is. I read the documentation. This is how you convert time vice versa. Why they need it? I don't understand. I'm very bad at math. So if somebody else knows it, please come and talk to me. Enlighten me. And again, you just compile the end result in a bit string. You return that instead of that hello world before that's it. So you can just generate NTP response from the request. You get the packet. You reply with it. Ta-da. There we go. All of this code, if you want to play around. And if you found mistakes, then you can go there and complain and do pull requests for a project that doesn't matter. But I would be more than happy to see what you guys think and what you feel about this. So come and see it. I still have a bit of more time. So let's implement the custom sigil and show how, again, how I should read more documentation. So let's create the sigil with tilde b, or whatever it was, quickly line b. It's pretty easy to do as well. You just create a module. You name it sigil and underscore and whatever letter you want that there to be. Now you can even do uppercase letter sites, I think, from the newest version of Elixir. And you can do even more letters than one. So that's cool. And you can just import it as is, and then it's available in that module. Now let's implement the actual parsing. So this is a string. And now here comes the me over complicating stuff. So I uppercase everything. I split everything. Then I do some mapping to get rid of multiple lines and whatever. And then reject empty spaces and then join everything. And I decode it. Well, I blame, blame, blame. You can just replace with a reg accent. Just decode it. But this looks way more fun anyways. Like everybody likes complications. So yes. That's it. A lot of the things I covered here I learned from this amazing book right over here from my friend Sasha Yurich. He gave us the 35% off code for the new version of the book. Go buy it. Even if you have the old versions support him. I think he is the foundation of learning Elixir in today's world. I've had so much fun with this book. Didn't reference it at all while I was doing this. I should have, but never mind. And also here is the coupon code 20% off. Here is the coupon code for the conference where we will try to also bring everybody who does Elixir there. So if you want to do, if you want to come over and hang out, or if you have open source projects in Elixir, do come talk about it with me. I have a Gantt T-shirt. I have some stickers here. Call for Spickle Flyers. And that's it. Thank you very much. You're welcome. Thank you. You're welcome.
Evolve your (web)app while it is running
Thank you. So I want to evolve my application. I'm going to use Gleam and Erlang. The organization asked for slides, but this is a luster application. So I can't give you the slides. My name is Kiro van Gelder. I'm a freelancer from my own company. I've been using software for up to 30 years since I don't know how many platforms, languages, environments. And I happen to like to beam Erlang, Gleam, etc. So that's why I picked it. That's not only why I picked it. So recently I was at Langdev, model-driven stuff, all kinds of things that people showed us. One talk that spiked my interest was about, okay, well, we have a game. We have a description of a game. And while we're running the game, there's people interacting with the game. Well, there's things we don't quite like, so we're going to change the rules. And then we have to keep, the game has to keep running. So, yeah, well, okay, that looked awesome. So I thought I can build something like that, but I'm going to do it on Beam. Beam has these superpowers, all the reloading and things are, it seems really suited for it, so why not? And then the other thing I thought, well, if I want to do that with a game, and I have to build some kind of infrastructure and things, let's start with a simple game. So I picked out Holland's Ganserbord. You have a little goose spawn, and you have to reach the very middle spot. And if you look at it from the model and the rules, there's actually quite a few of interesting exceptions for all the kinds of places that you can run around on. And land on a goose, you have to move twice as far. You can skip it during, if you're in prison, well, you can only get released when someone else releases you. It's kind of special. So what will I talk about? Tiny introduction on model-driven development. I'm going to tell you why I do the modeling game. I'm going to explain a little bit about dynamic reloading as the Beam provides it for those who might not know it. I'm going to tell you why the game itself, the instance will be running in core Erlang, and that will be demos all along. Model-driven development. You want to have some model. It needs to have a very precise description, and at some point, me as an editor, sometimes in a computer editor, will make that description. And from that description, we generate an instance that is running what we described. Why would you want to do that? Instances sometimes are, well, they're compiled targets, and they know the things that they have to do, but they don't know things about themselves. A model is a description that does know it. A very nice example that was given on that Langdef was about Dutch income tax. This is described in laws. Computers do not interpret laws, but they went through the effort to take the law, adjust it a little bit so they did it together with lawyers to make sure you got something that a computer could interpret. So laws might have ambiguities or vagueness things in there. Computers don't. Well, then they had these more precise versions of the law, and they, from that model, they generated Java code. Dutch income tax is running on Java code at the moment. Additional things you can do once you have it in the model is you can start reasoning about it. So by now, I understand that even if they want to introduce a new law, they can, you can just plug it in the system and see is there any contradiction here. If so, please adjust this law proposal. Often you'll have a domain specific language like the adjusted law. For my conserboard, you could have stuff like this. Often people will want to edit these DSLs by hand, but I'm very much interested in small deltas. If I have a running game and I make a big modification in here, I don't have a game. So I need to make sure I do a small step, a small delta. So I have to, with what I do, I have to restrict the possible edits that can be done, which I call deltas. How does that look? Building a system called Eagle. It's running on a server. That is, I have a browser. That's my client. From the client, I do the editing of the model. From there, I generate my instance, my game, and then from the same or a different browser, I view the game and I play the game. Why would I want to do a modeling GLEAM? The model should be as precise as possible as I just explained. GLEAM gives us types. Type safety is more than non-type safety. So my name is GLEAM. Other benefit, a superpower of GLEAM, you can compile it both to Erlang and to JavaScript. So I can, if I have my model and I can somehow transmit it, I can just use the exact same code on both sides of my client server and I know that it's the same thing. It saves me work. What does it look like? I have my model here. So in my LusterUp, I now have an iframe. This iframe is the model client that I showed in the previous picture. It's an iframe, so it's a web page, but as you might guess, that web page again is a LusterUp. And I want to grow the Hanzer board from as minimal a thing as possible in small deltas. So let's see where we can start. My model is very simple. It's a bit too simple here, so I need a bit more of a model. I can make a list of an int. I could even make a list of a list of an int and it's represented here. And I not only want to describe that I have some type, it also needs to have some default value. In addition to that, we have a cell. My Hanzer board isn't quite finished yet. Sorry. I have a plain cell starting to finish and I might even use the goose. And right now what I also have then is a game type. It has a board which has a list of cells and it has pawns which are a list of numbers indexing where the pawns are. It's a choice. There are other choices. And the default of my Hanzer board then has one cell at the moment which is plain, which is a bit ridiculous. So let's add a few more cells and say we're going to start at some of these things. And of course the default that I gave to the int turned out to be a bit awkward because I wanted it to be zero. So I mess a bit with my model. In very small steps, I modify my description to get hopefully to a better place. Dynamic reload. Erlang and beam provides us the tools to do these timing upload. This is a loop. Usually there's a process running on the beam that is executing this loop. It had been started by another process. And generally it receives messages, handles them, sends back some results. That's what the from exclamation mark is. It sends back some result. Could be errors. Could be part of the new state. And then it loops. And another possible message could be well just stop and then you don't loop. All there is to it. What Erlang and beam also provide is a way to reload, sorry, to load new code into your, into the beam for a specific module. That's the code module in the Erlang kernel package. You call it binary here in the middle. Target is the target module. The thing in square brackets is the file name that it would be coming from. We don't care. And the object code is compiled core Erlang code. At that point when you do that, you have your old version of the code in the beam and you have a new version of the code in the beam. But the existing processes, my game, my Hansbord is still running the old stuff. So now I'm going to send a message upgrade to my Hansbord process. And roughly like that this is the same loop as before but now I have the relevant upgrade part. Instead of just looping, which would loop in my old code, I have to say format this module for my target module, my Hansbord module, explicitly specify the module with the loop and then it's guaranteed to use the new version of the code. And then I can happily play along my game with upgraded code. Now why do we do that in Erlang? The difference between the local loop and the exported loop is something that Gleam does not know. So I can't do it. Why did I pick core Erlang? Because we can do this all in memory. No need to use file systems and other things. There is an Erlang C-E-R-L library that can generate these things. I wrapped it for Gleam. That's called Gencore Erlang and you can find it on the hex. So let's start a game. I already made the type properly. So I'm ready to create a game. If I connect to the server it won't do anything here, but if I create an instance it's now and it will connect to the game. So as you can see, it picked up the start plain plain from the definitions that I had at the left. It also picked up a move button. The rules, there are some implicit rules that I did not edit things about. I did not tell anything about. I need to be able to do something in my game. So there's a hard code that moves. I will just move the pawn one forward. And there's also a check for the win condition. So where the pawn is, it has to check on the board whether that's a finished location. And that rule is going on continuously. I have not made nice deltas and things for that in the UI and when at some point I will. But these things are running in the background in the instance. Now at least I'll show you the move. If I move, my pawn moves one forward. So yay, getting closer to the finish in my GUNS board. All right. So a little bit more about the bits and pieces that are happening. There's the Asian communication from the browser to the server for the model. Whenever there is a delta that's made, we just recreate the entire instance module in Core Erlang, reloaded and upgraded. And the instance just keeps talking to it. It doesn't even notice it from the client and also talks in terms of chasing with it. State and rules. The initial state is something that should be adopted, adjusted and rules are... So yeah, that's right. The rule is something that is in the model. I talk about a conceptual thing there. My instance doesn't know the rule. It just has code. And my client doesn't know what the rule is either. It just knows whether it can do something or cannot do something. So what is important that I wanted to make as two kind of changes. Some changes that when I do that, the instance will in the end see it. So I'm going to change the board. And that when I change the board it's going to compile the change into the Core Erlang, reload it, do some small migration and then also pass that information to our client. So, here we have it again. If I turn this into a Goose for instance, it becomes a Goose. So that was one recompilation of Erlang in the back button. Now I make another plane. I can add another one if I want to. And at some point I'm going to have to reach the finish here. So let's make that. So yeah, that was three, four recompilations of things in memory and moving on. Involving. But another one thing that I might want to change, because I can also create multiple instances, is my starting state. And that would mean that the only thing that happens if I change that from my client, it changes my model, but nothing else, unless I don't start a new instance, nothing happens with it. And that looks like this. So I have a game one and a game two. Game two hasn't been started yet. And if I change this one from zero to one, now I start at position one. Then we notice that in game one nothing happened. But if I start a new instance, then this one will now start at position one instead of position zero. And just to show that even though it shows two, it didn't change to one, but didn't show it. If I move, it will go to three. Well, and where did I put the finish today? On start, zero, one, three, on number four. I'll just move to four. There. I finished my game. I won. So things I want to do in the future. I'm very much interested in what kind of deltas are usable, sensible. You saw me adding cells to a board. You saw me change the type of the board. Okay. What if I remove a cell from the board? Yeah. What if the pawn is on there? It quickly becomes like, you could think of a couple of solutions for when you remove the cell from the board. You move the pawn to the previous or the next cell, or you remove it, or you put it on cell zero. But why would, why, how do you pick one? Depends on your application. So I really want to look more into that. Another thing is that I don't think the Hanselboard UI that I had looks very nice. It would be much better if it looked like the second slide that I showed. But if I do that, then the client really knows about Hanselboard. But what if I want to make a slightly different game? So, okay. And what if my Hanselboard knows about most of the things I do, but I add that labyrinth thing? Now I want to render a nice labyrinth. Okay. Can I make something that knows its Hanselboard, but can also adjust to changes that I make in the model that expands on what was already there that they didn't know about in the start? And obviously, it needs to be multiplayer because playing in my own is, I want you all to play with me. All right. The code that you saw in the iframes is in the top link. While reasoning about this, I also wrote a little, a start of a Gleam library that generates Gleam code, which is the second link, the one that generates Core Erlang. It's the third link, my own web page, the fourth. And if you want to know what the Dutch income text is looking like, it's all in there. Thank you. We have time for a couple of questions. Anyone? Any? Okay. Thank you. That was really quite amazing technology there. I was wondering, do you have thoughts on like when you might decide to apply these sorts of techniques to a problem? When it would be a good question. Yeah. Okay. Question is when this kind of solution would apply to a problem. It's, I might not be the best person to answer this because I'm somewhat new to model driven. It helps when you, when that description is going to give you something, whether that is checking that something is coherent or correct. When, yeah, the model should give you something. When not to do it, well, if you just want to play Hansenboard, just make a Hansenboard server and a Hansenboard client because it's much faster. Much quicker to build. So it's, it's, it is an investment. It's quite an investment. It's not just, not 10% extra. It's a factor much, possibly 5, 10 extra to make sure that you can really do that kind of stuff. Other question? It was the same question. Because it's fun. Okay. Yeah, that was really, really cool. Really interesting. When you changed like the initial state, you showed like the running client didn't update, right? Like it didn't, it didn't update the client because all these updates are triggered through messages. Is it possible that you could replay message history from the beginning with a running client? So like if I updated the initial state of a running process, could I then replay all of the messages it's received to change propagates? There's, there's, there's two answers to that, I guess. Yeah. Is it possible to replay all the events that happened both to the model and the instance or just to the instance game? Just to the instance. Just to the instance. At the moment, no. Would you? Would be, would be interesting. One way in which at least the part of the answer would be yes is if you look at the model when I, when I say please change this in this way with this delta, the server will respond by just giving you back, yep, I applied this delta, now you do too. Okay, cool. Any other questions? Okay, thank you then. Thank you. Thank you.
Type-safe Queries with Gleam & GraphQL
Hi everybody, I know you've heard it now, I think about four times about how the type safety and gleam is perfect. But guess what, we're doing it again for a fifth time in case you haven't heard it enough. But before we do that, I'll just introduce myself. My name is Harry Besto, I'm currently a student studying in the UK and looking to move on to university next year. I spoke here last year about gleam again and sort of helped start it off last year and at the moment I currently work with Felicia's Ventures on their research team. But while we get started, I know GraphQL isn't gleam, it isn't Erlang and it isn't Elixir, it's none of this stuff maybe you're here to see, but it is the perfect match to go with gleam as they both care about type safety and they care about how everything works correctly. In case people aren't familiar with GraphQL, I thought I'd just do a quick introduction to that so everybody can be on the same page for the rest of this. Here's an example type in GraphQL, it's a presentation, I thought that might be a fitting for today and we'll just imagine for a second that it's super simple, you only have a title, a set of speakers, in case you were lucky enough to have a friend, I don't. The amount of people who attend, whether it's one of the keynote ones and any speaker notes that people decide to give afterwards. So we'll break this down one by one really quick. The title's a string and it's not nullable. You have to have this as a string. The next one is an array of speakers where each speaker has to be there as well, that's what the estimation mark means. Integer that has to be there as well, a boolean that doesn't have to be there, it could be null, true or false, and then finally notes that also doesn't have to be there, just as a rough sort of guide for how GraphQL works. There's so much more to GraphQL that isn't actually fitting for this dev room and would go into so much more detail, so I think we'll just stick with this for now and we'll go from there. Before you end to using the two together, I'll just do the normal sort of introduction that everybody else has done to Gleam. It has type safe structs, the power of Erlang and JavaScript, and also a lovely friendly community who a lot of actually shown up this time here with the rest of their talks to persuade you as well to use it. So let's get to combining the two together. Here's what looks like very complicated Gleam code if you've never seen it before for a GraphQL request as well as sort of the request itself and then an object inside of it. We'll take for example that it could be a mutation or a query, which isn't too important for this, but it's just there anyway, and that you're requesting a list of objects potentially. Each object has a name, a set of arguments, and a set of fields that you're requesting where the field could be something as simple as just taking the name out of the presentation or as requesting subfields of the speakers that we saw earlier. Here's an example query that we're going to now use for the rest of this. The rest of the presentation is built around this query, so we're trying to query this presentation itself. We want the notes that I've supposedly written as well as my name and my email. This is what that looks like in Gleam, which looks absolutely awful for you guys. So we'll remove some of the stuff that isn't particularly important and we get left with this, which is just a simple function that would in reality take in the query as a string, actually pass it, and then would return something similar to this where we're saying it's not a mutation, we're requesting the presentation and the set of fields that were there before, and then the argument is that the title has to be that. But how on earth are we even going to use this? Everybody knows that GraphQL is normally queried over HTTP as its sort of baseline. Sometimes there's web sockets involved with subscriptions, and sometimes people go and do something a bit interesting with it. But for this example, we're going to use Wisp, which was actually written by Louis, and it I believe one of the only sort of higher level frameworks that Gleam has at the moment for HTTP. So I'll do a quick intro to this just to keep everything so everybody knows what they're doing. Wisp has some really nice functions in it for configuring logging, for getting the secret key for cookies and other hashing algorithms, and it builds on top of Mist, which is built by another one of the Gleam community, and just uses core underlying fundamentals for that. At the bottom, we're just telling it to sleep forever. So let's go through line by line and see what each thing does. This is super simple. It configures sane defaults, so info logging and all that sort of stuff, so you don't have to go over and do all of the annoying Erlang stuff. Then we generate a random string for this example. This isn't great. I guess you will have heard about it with Phoenix, Ruby on Rails, Laval, any of the other frameworks. You need to actually set a secret in real life, but for now we'll just ignore this as it's not particularly important. Then finally, we set up our handler, which is going to use this handle request function, which we'll get onto in a minute, as well as the secret, and then saying it's going to run on port 8000. Finally, the process does sleep forever is something I don't think you'd be used to in Erlang, but it's quite common in Gleam now, which is that you don't want the process just to terminate itself at the end. You want it to stay alive with the HTTP process running in the background. Let's get into the router. There's four imports we need just to start off, which is we need the request and response from Wisp. We need a string builder to actually send some stuff back. We need GraphQL Web, which is just sort of a, is nothing too big. It's just some boilerplate that's included with Wisp, and then also Gleam's HTTP for post for some filtering. Handle request is as simple as this. We're given a request. We have to return a response. We can do whatever we want to it, and as long as that we get the response out of the end, this, yeah, Wisp will handle returning it. Sorry. So in this case, we're using Gleam's powerful pattern matching to match on the tuple of the method and the path segments. That way, if you wanted to, you could have a get request to get info or health or sort of the UI as well be served. But in this case, we're just going to say everything else isn't found except for a post request to slash GraphQL. And then at the top, we have a use statement with web.middleware, which is something fairly new in Gleam but not super new. It's an abstraction for putting, for calling a function with a parameter passed into it. Before this, Gleam code was sort of nested functions on nested functions, which maybe wasn't the nicest to look at. This is sort of simplified that for quite a few things and is now used across everything from the standard library to libraries themselves to people's code. We have this GraphQL request function, which we actually need to do something with, but for now, we're just returning the string GraphQL response and sending that with 200 status code. So now if you were to send a post request to HTTP localhost 8000 slash GraphQL, you'd just get GraphQL response sent back to you. But you're thinking, that isn't what you came here to hear about, and it's absolutely useless. I agree. So let's go on to actually handling a GraphQL request and sending back some actually useful data. When we get the request in from GraphQL, it has the JSON structure of the query, the operation name, and then variables. The variables and operation name aren't important for this, but for a fully featured implementation they would be. Let's say you had a query that had loads of different requests inside of it, operations. You can then specify afterwards which one you want to use, and the variables that you pass stuff into those operations after the fact. Here we're using Gleam's decoder to decode into three, well, to decode three things into a custom type. The three doesn't mean anything like a tuple or an array like you might be used to. All it means is that the constructor of GraphQL full request has three arguments. So we're then saying the field query should be decoded as a string and put into the first field in the constructor. This is entirely type safe. I don't have anything on my slides about it, but let's say I jumbled up the order of these. It wouldn't match what the constructor should be, and as such you'd get a compile time error rather than ending up at the end of this talk going, why is my query looking like an object of variables? So I guess that's one point where Gleam's type safety comes back into being useful. The other part is that we can do something like this using the use statement again to say that the body has to be JSON. It's a case of we could handle all of this ourselves, but Wisp has a function for it. It requires that the content type's application JSON takes out the body and decodes it into a dynamic which goes back into Gleam's dynamic system, which allows you to break out of that box of type safety when you can't trust what you're getting in or you want to send something out in a way that's maybe less structured. When we want to then decode it, we have to, if we want to work with it, we can't just work with it in that dynamic form. It's not how Gleam wants to work, and it's going to make you could, but it would make your life so much harder than it needs to be. In real life, you should also be using result.try, handling this nicely and bubbling errors up, maybe with how Hailey spoke about where you could have your phantom types so you don't have to bubble it all the way back up. But for this short demonstration, I'm just unwrapping it, which will panic if somebody sends the wrong data. I'm unwrapping it with what is a bog standard QL request, which is underscore underscore type name, which just would in this case return query to you. It gives you nothing useful, but it means that the program isn't going to crash. Now that we have our body and we have the query inside of it, which we looked at earlier, we need to do something with it. So let's send it to that passing function that we saw at the very start and turn it from just being a generic string into something that's actually tangible and could be used by Gleam. But once we have that, how are we even going to resolve it? You have your query, you now have it as a Gleam type, but what does that even mean? You need to somehow get all of that speaker information back in a way that keeps your type safety, but also can be with the flexibility that GraphQL provides to people. So let's think about resolvers for a minute. We could go down this approach of having a type of the resolver where it has its key, for example, speaker.notes or just speaker, where you then have granularity over how far down you want to resolve each time. You want to resolve the whole object with one function or do you want an individual function for resolving each field? And then you have that function there, which gets the request as well as the variables and just returns a dynamic value, which can then be sent back to the client as Jason in this case. Being dynamic there, it will be much nicer to use a generic, but in the process of writing this talk, I couldn't figure out a nice way to allow you to have loads of different generics at once, which I think is something that is going to be worked on in the future maybe so that you can have a collection of generics that also are maybe like an interface type thing. Here's an example of how this could look, forgetting the presentation's notes, but there is one key problem that comes with this. Let's say you then have a list of these resolvers. How on earth do I find presentation.notes in a time that's actually suitable? The bigger notation for this is going to be o of n at best, or o of n at worst maybe, because you're going to have to go through all of them checking each one. So maps or lists, a hash map is going to be a thousand times better, maybe not actually a thousand times, don't quote me on that, but it is going to be significantly better than using this sort of list and sorting method. So let's switch it up and say that the resolver now isn't this custom type with a constructor and everything, it's now just a function that takes in a request, has those arguments passed in and still returns the dynamic. And now we can use this in a much simpler way, where we have a simple function that has resolve where it takes in a prefix, which will make a lot more sense in a minute. The HTTP request, the object itself we're trying to resolve, so we took the string, we pass it into this object and now we're going to resolve it. And then a dictionary of all the resolvers that were created when you wrote it. So you'd have your dictionary where you put in each of your resolvers, so you could have presentation.speakers, which resolves that whole array, as well as presentation.notes and any of the other fields that were there. Or you could even just have one resolver that resolves the entire presentation. Example of this is here, where you have your dictionary. We're just creating the dictionary from a list for simplicity, but this is the same as doing dictionary new and then inserting an element with that key and that value. Now that we have it, we can take our prefix, prefix it to the object.name and then try and see if we have that as a resolver. Of course you're thinking, how on earth is this going to work? Glean doesn't have an if statement, but we have something just as good, which is a case statement. We're going to say, if that resolver exists, try and just resolve using that. If it doesn't exist, we're going to check that there's fields. If there aren't fields, then we need to somehow handle this much nicer at the moment. It just returns that there was no resolver set. Then in reality, it should be erroring out properly and returning an error in the GraphQL standard way. And if there are fields, we then map each one, attempt to resolve it, and then we have this function at the bottom called combine results, which I'll come back to in a second. But you can see that the prefix actually ends up being the prefix that was passed in before and then the object's name then a dot. So as you go further down and into it, you get the dots. You still get the objects that were before added in. And then you can have your granularity that I've been speaking about of resolvers. The combine results just simply has the list of them piped in. It then folds the list into a dictionary and then it does dynamic.from to just turn that into that dynamic value to be returned at the end. So finally, I know it's been a lot, but let's finally put it all together into something actually useful. First, we need to resolve multiple things. When we got our request in, we could be requesting the presentation and a speaker separately and maybe an event separately. So we need to take each of those objects, go through the resolution process for them and get the values. So to do that, we call that function. Then we do, we say the responses after we've resolved them all, passing in the HTTP request, the past query, and your set of resolvers. And then this is what you finally end up with, the basic GraphQL request function in the start that just returned a string. Now it requires the JSON, unwraps the body, passes it, resolves everything, and then finally sends it as a HTML response, which probably should have been changed to JSON response. Small details there, and sends that back to the client. So yay, you might think we're finally finished. But in reality, we're actually missing so much from GraphQL that I wouldn't have even had time to discuss or make for this talk, such as proper error handling. When you return data from GraphQL, you should return data and errors separately. We don't even have the concept of an error here. Mutations at the moment, a mutation in the query, I just treated as the exact same thing. We should really have them as separate things that are resolved differently and handled differently. Subscriptions as well, that's using web sockets, and that's a whole nother layer of GraphQL that people, there's fairly, I guess, divisive opinions on how it should be implemented. So I didn't even touch that. But you might be thinking, overall, that's an absolute ton of work if you want to implement it yourself. So the goal with all of this is, let's make a package out of it. We'll have a green GraphQL package that you can just plug in, pass your resolvers, and it will pass and manage all of the query inside for you, similar to what Elixir has for Phoenix, as well as what Laval and the other major frameworks have, which makes it easier, which makes it super easy to use GraphQL and sometimes even easier than REST itself as it handles most of that abstraction for you. So thank you so much for listening as I ran through that really quickly. And if anybody has any questions, I think I have just enough time for a couple. Any questions? Thank you very much. So I know that GraphQL also has a schema, and I'm guessing from that API that you would write your schema alongside that. Do you have any thoughts on maybe generating some of that blue code from the schema, or maybe just the schema from the blue code? Yeah, when I first started, I thought, oh, the question was before I go, GraphQL has schemas, so is there any way for us to generate that code, either the code that I showed today or even the schema itself from Glean? And the answer to that is I looked at it when I started, and you could generate the schema from Glean and as such interpret some of this code, but long term I think the best bet would be sort of code generation similar to how it's done in JavaScript or TypeScript, where you either write your schema and then it generates some of the other Glean for you as well as a .graphql file, or you pass in a .graphql file to sort of a CLI, and it then spits out all of the boilerplate Glean you need. But yeah, schema validation is something else that sort of, I didn't really have time to do for this. Any other questions? Okay, thank you then.
MicroBlocks
I think we've got a lot to do, so we might as well get started. I'm John Maloney, and I'm going to be talking about micro clocks live coding the real world. You'll get to see a lot of real world stuff, including maybe real world bugs. I'm being joined today by Kathy Giori, who will be helping with the presentation. I'm the chair of the Runeisu, and there in the back of the room we have Peter, who's organized this day of the room. And where's Bernat? Bernat is another member of the MicroBlocks team. The next room is Yensu, who was sort of one of the founding members of MicroBlocks at the very beginning. So the goal of MicroBlocks is to make it the most intuitive and engaging tool for physical computing imaginable, and create a global community of learners and educators around it, which here we have a global community. A sort of a sub-goal is to inspire a wide range of learners, especially those who do not initially see themselves as potential technologists. So this is really important. We want to invite people in to STEM, not just the people who are excited about it, but get new people in. This is the kind of testimonial we like. I won't read the whole thing, but this is the story of a teacher who started teaching some girls STEM, and they initially were very unexcited about it, but then they started doing stuff with the micro-buildings of art and music and stuff. They got more excited, so they signed up for a second trimester. And the third trimester, not only were they so excited, they brought their friends in. And this is, I just love hearing stories like this, and it's not unusual to hear about this. So what makes MicroBlocks special, and how does it differ from other languages for physical computing? What makes it especially easy for beginners? So the first kind of obvious thing that you see with MicroBlocks is that it's a blocks language. So just as a comparison, if you're getting started with Arduino, you have to type this, and you just want to make a little blip program. It's 140 characters. There's a lot of unusual characters like curly braces and semicolons and stuff. And if you get almost any of those wrong, it won't compile, it won't do anything at all except give you error messages. In contrast with MicroBlocks, there's the same program, 6 blocks. You can't get anything wrong, basically, except maybe to have both of those set LEDs the same so that nothing happens. The LED might turn on, but not blink. So this is really easy to get started. And the neat thing about it is that once you get started, you can start playing with stuff. So you've got a working program, you start saying, well, I want to make it live faster or slower. I can change those milliseconds, et cetera, et cetera. So I think that's one of the biggest differences with MicroBlocks, but that's the same with MakeCode. Let's talk about another difference. This one is hard to explain without some pictures, so I made some pictures. So supposing you've got a microcontroller like this microbit, and you've got a laptop and you want to write a program for the microbit. Well, obviously, you can't type on microbits. You're going to use your laptop to write the program. And so the sort of standard way to do it in the old days was you write a program on the laptop, you compile it, download it onto the microbit, and now it runs on the microbit. A little gear show where the program is actually running. So that's sort of the Arduino type world. But the problem with that is that it's not very interactive. It's not live. You can't just change something and quickly see what happens. You have to go through this kind of compiled download cycle every time. You lose track of what you're thinking. So various people came up with this idea of doing a tethered system. So I think there was Scratch for our Arduino, and then Berlatt did Snap for our Arduino. And then there's also a microbit plugin for the regular Scratch. So the idea here is that you have a program running in whatever language it is, Snap or Scratch. And that is driving the microcontroller, which is acting as a sort of peripheral. So this is great because Scratch and Snap are super live, so you can make changes and they happen right away. But the problem is you constantly have to keep your laptop next to tethered to the microcontroller because the program, as shown by the ears, is actually running on the laptop. So the microcontroller is not autonomous. So with MicroLux, we try to combine the best of both of those worlds. So while you're programming it, it's tethered, and you can make changes and you see them right away. So it's live. And tomorrow, Berlatt is actually going to do a live coding music presentation where he just writes all the coded blocks and the music just keeps happening as he's programming. Quite amazing. But in the case of MicroLux, instead of the program running on the laptop, it actually does run on the microcontroller. And incrementally, as you work, it's downloading your scripts onto the microcontroller, bringing them into Flash memory. So that means that once you're done programming, you just unplug untether and plug in a battery and you can take this with you. You could build it into a hat or a Halloween costume or I know somebody who mounted one on a skateboard and used it as a skateboard speedometer, like he was actually measuring acceleration with it. So anyway, that's probably the biggest difference between MicroLux and other languages, is this combination of liveness and autonomy. But I think really the best way to understand MicroLux is to actually see it in action. So I'm going to do a lot of demos so you can kind of get a feel for it. So here's the MicroLux IDE. We're going to be reloaded. We have standalone versions, but the easiest thing is to just start it up in the Chrome browser. It has to be the Chrome browser because Chrome is the only one that supports Web Serial and Web Bluetooth, which are the ways that we connect to the board. So I'm going to pull out this little handy camera view so you can see what I'm doing. So I'm going to plug the micro bit into this board. And the first thing I'm going to do is connect to the board. So I have to sort of select the board. And now when it's connected, as soon as it connects, it realizes, oh, this is a micro bit. So it loads some libraries like this display library and stuff. And you can also tell it's connected because of that green circle behind the USB icon. So anybody who's used Scratch, this will just seem second nature, like why wouldn't it work this way? But you can just drag out a block, click on it, and see what it does. So it made a smiley face. And you don't even have to drag the block out. You can just click it right in the palette so this clear takes the smiley face away. So plot, OK, that sort of plots. Three, three must be the middle. Unplot, unplot stat. I can display a character. I can scroll some text. I can stop scrolling the text. So this is important. The idea is that just by clicking on blocks, you can discover what they do. You don't have to read a manual. You don't have to have somebody tell you or do sort of a whole bunch of tutorials. You basically can sort of see what's there, try it out, and learn how it works. Well, let's actually start writing a little program. So I'm going to take this display block, and the first program I'm going to write is just, I want it to display the smiley face when I press the A button. Microbits, for those of you who don't use microbits, have two buttons on them, A and B. So the A button on the microbit has nothing to do with the A key on the keyboard. So anyway, there, I'll click that, and we get a smiley face. And I could make a second program, and I'll say the B button is pressed. Just maybe make the display clear. So we've got clear, smiley clear. And so we've already got a little program in. For beginner, this is actually pretty magic. I mean, they've got something that they can, in fact, I can just unplug this and take this battery and plug it in. And ignore a few little flashes of them. As soon as I turn this on, oh, it's on. It's running. So there's the smiley, there's the non-smiley. Well, let's say I wrote this program and I forgot I didn't save it. So I'll clear this slate, I'll save the new program here. That's no problem with MicroLocks because you can actually read the program back from the board. Another pretty unique feature. So what I say is I say open from board, it says, OK, plug in the board. And what it's going to do is it's going to read the code from the board. I still have to say connect. Oh, right. The actual action is going to happen in the ID. So it's going to read the code from the board and decompile it and reconstruct the script. So it's not actually, it didn't actually save the source code on the board. It saved the compiled form, but we have a decompiler so we can get back to the original, almost exactly the original. So one thing you'll notice is the position of these, they were sort of reversed, like the A was on the left and the B was on the right before or maybe below. So it doesn't remember positions, it doesn't record comments because those are not part of the compiled code. But it gets all the logic back for you, which is what you really want if you've forgotten to save your program and you only have it on the board. All right, well, so far we have been using just the stuff a bit. Oh, I forgot a very important thing. The micro bit also has a bunch of sensors. So for example, it has an accelerometer that has three axes. So there's this tilt, we sort of show it as tilt. So the x-axis is kind of tilting left and right like this. And so you can see that it's sort of negative in this direction, positive in this direction. It's a little tedious to keep typing that though. So what I'm going to do is I'm going to use this save block to save it and then I'm going to put that in a forever loop. So it's going to save it over and over and it's actually going really, really fast. So I'll slow it down a little bit by putting this block in, maybe change that to a 100. Okay, so now we've got this interactive thing and we can see more quickly, negative, positive. And you get a feel for what that sensor is doing. But I just shook it and I saw some 200s but they went by really fast. So let me get a sort of a time recording of this. So I'm going to take the same tilt block. I'll just take these out. And now I'm graphing it. So I'm going to open this graph here. We'll see that if I tilt it this way it goes down to sort of minus 100-ish. Tilted this way it goes up to positive 100 and in the middle it's about zero. I shake it. I shake it really hard. It goes even more than 100. So this is a really cool way to get an intuitive feel for different kind of inputs, sensor inputs. Okay, so so far I've been showing you stuff that was built into the micro bit. And I have to say I think the BBC folks that originally designed the micro bit were brilliant because you can do a whole lot with everything that's built into the micro bit and never have to connect anything external. But if you do want to connect stuff that's external there's ways to do it. I have here, this is called a ring bit. And it's a little extension board that costs about $7 US, probably a little less than euros. It has a battery pack built in so you don't need this battery pack. And more importantly it has several sets of pins that you can plug things into. So let me plug this guy in. And I'm going to use with it, this is what's called a NeoPixel strip. So it's a strip of 10 RGB LEDs. And I can just plug it into, in zero here. Oh, wow. I forgot I already had a program on this board. Actually let's see what it is. Yeah, exactly, let's open from board. I didn't plan that actually. That would have made it cool if I had planned it but I didn't actually plan it. Okay, so we have here, we have one script here which is, I'm using this NeoPixel library. So when I run this it initializes all of these to a different, well it's actually got two things going on. Let me take it apart a little bit. So one thing is that this little for loop that says for 1 to 10, or I in 10, but sort of implied that it starts at 1, you can change the range if you want. It's going to set each NeoPixel i to a random color. So here we've introduced a for loop in a way that's very visual. I'll hold this up higher. And then all I did was I put this in a forever loop and there's another little block here that says, oh it should actually say pin zero. So it's now running this forever so it's kind of randomly showing a bunch of new pixels. While that's happening, and let's see if we can, yeah you can sort of see from the glow on the table that this is still running. While that's happening I can also run this script here which is going to do a little animation of the face on the display. So you can see it's actually doing these two totally different things at the same time. So it's doing this which is just alternating between two different faces. That's one idea. And then the other idea is this thing about animating the NeoPixels. And each of those is separate and it has a separate script. And that's another thing that's really nice about Viproblox is that it has concurrency. So you can have up to 10 things happening at once. It's got a limit to how many stacks you can have because of RAM limits. But you can do 10 things at once which is more than enough usually. Okay so the next thing I guess I wanted to show was that Viproblox runs on a lot of different boards. I'm not quite to that set of boards yet. But I'll show you a couple of interesting things. This is actually a micro bit plugged into a board called the Pipobris board. And Yashir over there is the owner of the company that created this board, maybe the mastermind behind it. But this has a whole bunch of peripherals that are sort of hardwired into the circuit board. But they can be broken apart so you can make something like, you can come up and look at it later, this little robot that Peter made with all the different components. So you can learn how the components work without having to plug in any wires. And then you can disassemble it, break the board apart and build the components into something if you want. So this is a new thing, this is on a Kickstarter right now, it's not yet available but will be soon. This is another freaking board, this is the Calliope Mini V3. So there's been a Calliope Mini V1 and V2. This is the V3 which has got more things like a microphone and a faster processor. You can also do Bluetooth and motor controller and so forth. It's also got a couple of Jack-Doc boards which I don't know how to use yet. But MicroBlocks already supports this board. It's still pretty new because it was only announced in November. But I guess they're already out on the market. Here's another one, this one I'll show you with the camera. This device is made to run, is a scientific instrument, it's really sold for doing science. It doesn't have much in the way of output devices except for three multicolored LEDs. But it's packed with sensors. So the idea, it's small enough and it has built-in batteries so the idea is that you could use it to instrument something like a weather balloon. Or a garden or something like that. So I have a little program on it that one of the sensors is a gesture sensor. I wave my finger over it, it just changes color. So left and right is green and blue and up and down is sort of a purple and red. So this is kind of a cool board. And the nice thing is the guy who created this thing has been building a whole set of curriculum around it. Both for computer science but also for science, physics and yeah. What is the name of the board? Oh it's called the Data Bot and the guy who created this is Robert Grover. So it's got an ASP32 inside. Yeah. Another board that's sort of in the beam created as we speak is, so this is a prototype of something called the MakerPort by a guy named Roger Wagner. He's got his name right on it. The sort of interesting thing about this is it doesn't have any, you can plug a display in, but it has a built-in MP3 player. That's what that little speaker drill will do. And what I really like about it is that it has this extension port for touch sensors. So plug that in and I'll try to get all of them in the camera for you. So we've got this ribbon cable coming out of here that goes to these 12 sort of pins of, yeah, 12 wires which I've stuck little pieces of bare wire in and then connected to these foil strips. And I can, I leave, I just powered this up. I have a program in here already that sends MIDI to the computer. Oh you've got it in there. Yeah. So let me just start a program called SimpleSynth. So SimpleSynth is a program that is, it's a free program that can receive MIDI from. So over this serial cables, not only is the MakerPort being powered by the serial cable, but it's sending MIDI commands over the serial cable. So if I turn off my volume here, we should be able to... So we've got a little piano and it's actually polyphonic because it's sending MIDI commands and the MIDI sends polyphonic. So you can do... Oops. That's not how I'm going to make the keys part of the polyphonic. Okay, so... Did you want to just show more boards? Kathy, Kathy is our hardware geek. She's always trying to get me to make a port micro box to more different boards. My problem. By the image in the middle? Yeah, yeah. Yeah. So this is a sort of a micro bit light board, but it actually has a Raspberry Pi Pico chip in it. This is the Raspberry Pi Pico... Wireless. Wireless. Without any pins. This one has pins. That's a micro bit, too. These two round things are circuit playground express two different versions with different chips. This is called a clue. It's also from Adafruit and it has a TFT display. You'll see that one running over there in a second. And here's even more boards. This one, unfortunately, has a lovely big screen, but you can't buy it anymore because the company went out of business. But it's actually a touchscreen. And this one is something from a company called M5 Stack and they made a wristwatchable kind of thing on a wristband. It's this OTT Go. This has a little display on it. This has a sort of matrix 5x5 matrix, but they're actually RDB LEDs instead of just single color. This one is a Chinese company. It has a few little power problems, so we're not so keen on it anymore. This has a built-in battery and you'll see one of those running over there, but a nice display of buttons. This one has 5x5 LEDs and the whole top is a button. So lots and lots of options. Microblocks actually supports, well, built into Microblocks itself, you can install something like 16 different, firmware for 16 different boards. So those are the ones we expect educators to be using, but then there's something like 5 dozen more boards that you can build from, you know, if you download from our Git repository and just type a command. You can build and install on some of these other boards like the Mbits and the T2Go board and so forth. So there's a lot of options out there. It's kind of a will-earing for an educator, so we kind of try to limit it like, you know, the micro-bit is a pretty good starting point if you've never done any microcontroller stuff. I wanted to show you also this. This is something called a Q-POD, the camera. It's a little robot that has a micro-bit on the front of it, and it also has some batteries to little motors and it's got a distance sensor. So I'm going to turn this guy on and, oh, it's got a program on it already. So the, let me turn it back on. It's a little hard to, I'll take this off for a moment. When I first turned the micro-bit on, it makes this three-letter code which was O-B-Q. That code is actually a code that will help you identify the board when you try to connect to it with Bluetooth. So that's what I'm going to try to do right now. So I will, actually, maybe I will open from board from this one too. No, that's going to be, it's going to be slow. I'll just connect via Bluetooth. And, oh my gosh, there's a lot of micro-bits in here. Is it that one? Yeah. And, let me just see it. Ah, right. We need to install a library here called the Q-POD library. And now I can, oops, there are loads that stop the wheels. Now I can put it down. So what I wanted to do here was I wanted to play a little bit with the distance sensor. So it's got this Q-POD distance thing, which is, as with all of our sort of reporter blocks, you can just click on it to see what it says. So as I move my hand closer to it, we should get lower numbers. And what I'd like to do is, actually, I think what I'd like to do at this point is load a program that I made already. Yeah, to save time. So the idea is to have a sort of a robot that senses when it's getting close to something and stops. So here's the beginning of that program. So it's got these headlights. We'll set them to green. And then we're kind of continuously going to get the distance and do things with it. We'll graph it, so I guess I could open the graph here. And then we'll set the headlights to red if the distance is below some threshold. So you know what? It doesn't bounce, ultrasound doesn't bounce off hands very well. So I'm using my phone. We don't see the screen. That's getting full. I just know not to move that. Oh, I see. We're going to open it. There's that, too. Well, I didn't want to completely cover the script. So you'll see that when it gets below a certain threshold, and you can see in the graph, you can see that changing. In fact, let's increase the scale of the graph a little bit. You can see that when it falls below about 25, the headlights turn red and it beeps. So now I'm going to sort of extend that by putting in some actual control. So I'm going to say when the headlights turn red, I want it to stop the wheels if they were going. And now the magic bit is I'm going to drop this block in that says if you don't see, if you're not close, then turn the wheels on. So we've got a self-derving car. Oh, and it also, it does the thing that a lot of real cars do that as they get closer to an obstacle, it speeds up. So it's pretty cool that you can write that program yourself, and it isn't very hard to understand. It's basically just the time for, the length of time it's playing the note is proportional to the distance. So it's a longer time between notes. If it's a bigger distance, and the notes, the time between notes gets shorter, the shorter distance. At this point, I think I want to show the grand set of things. So if we can hand that, the end of that way to him. Yeah, no, it's not recognizing the webcam. Did I plug it in the right thing? Yeah, I have that. I feel like the GCC is not. Oh, let's try that. Great. So, I guess the basic idea is that there's a lot of different things that are running microblocks. What Kathy's holding the photo of right now is a glove that has a color sensor. And then if you hold it over the right color, yeah. The LEDs are supposed to change that color. Actually, they are. Red. Black. I need some more color. Blue. Getting the blue. There we go. Detects color. Okay, and then... Oh, one of the points I wanted to make, can you show that the clue and the... What we try to do is we try to, when there are similar things, features between boards, like if they have a display, it might not be a 5x5 display, it might be a TFT color display. We try to sort of make it so that the program that ran on a micro bit could run on that. So we'll simulate the micro bit display on these boards. This is a CIDLAV-81 and an 8-of-3 clue. This is a Kaliope Mini V3. This is an atom, M5 stacked atom. This is a electrix Pico... Pico. Yup. And this is... This is, yeah, this is, yeah, Sierra's new Pico Bricks board with a micro bit controlling it. And we're using the micro bit display. The old Pico Bricks with the raspberry pi. Yup. So let me have my phone back. So now I want to also show you, I just remembered one more thing I wanted to show you. You may wonder about this backpack that Kathy will hold on to. So this backpack can be remotely controlled. So it's running a program that responds to Bluetooth. And the controlling program is a program called Octa Studio. Has anybody heard of Octa Studio? A couple of people. Octa Studio was created by the Lifeline Kinegar and Media Lab. Same group that created Scratch. And you might think of it as like, here's Scratch for tiny screens. So let's see if I can get my Ikevo camera back. Oh. See if you can see this. Click. Please. Oh. Alright, well, it's not, yeah. What's your... Well, you can hold this in front of your... So you won't be able to see the details exactly, but... Oh, yeah, there's a thought. So this is... This is sort of what Octa Studio looks like. You can see that it's got sort of blocks like in Scratch. And they can be... I can do this, drag around and drop. There's a palette at the bottom of the screen. So I wrote a program here, and I'm going to make it go full screen. And... And run it. Hard to do it in a mirror. So this is a program written in Octa Studio, which is a very, very simple program. They call it beaming. It sends... Oh, a Bluetooth. It sends one of these five shapes. And that backpack is supposed to be receiving it. So we'll see if that really works. Let's try Circle. And... Mirror image. Star. Triangle. And Heart. So if you go to the Octa Studio website, it's just octa studio.word. And you want to download this. We'll just leave this backpack up here while Kathy is talking. And you can actually send it messages yourself and have it changed. So Kathy, do you want your laptop? Yeah. But I could just start with some of the demo stuff. So for example, I have Octa Studio set up on the library for this little robot. So Octa Studio is something... Is it a controller? Yeah, anyone can control it. So that's kind of the tricky part is that Octa Studio right now, we don't have a way to test it certainly. Yeah, there's no channels. So this is my controller basically for this guy. But if I say stop, then it will stop. And then if I set it down, I can say go forward, go back, go left, go right, and stop. And you can hear I have little beads as well that go with the program. So each one has like send a beam shape and make a noise. So it's really trivial to do something like that. Now, so that's Octa Studio control. Then I have these little robots that I just used yet another micro bit. These ones I call my dancing robots. And what I do with this is I use button A to go forward. I use button B to go back. And then I use the tilt sensor with both buttons to go left or to go right. And so probably... So... Alright, so this is a really fun way to use just robots. And then I will... We'll have a little bit of time to play around with those. And then right before I came to Fostum, like the night before my flight, I bought this new robot from Electric Freecs. It's an X-Go Q-Frot. And yesterday we downloaded a library contributed by one of our Chinese community members. The library is all in Chinese. So we use our little playcoop to translate the commands. And then we didn't know the serial pins, or the communication with serial trying to fit with the two pins. So we got those pins this morning and then like right as we're setting up here, I'm able for the first time to command the robot. And so I have a couple buttons where... This is where I should... Where I press button A and this is going to make it P. And then I programmed it so that button B will make it sit. And then it has like dozens, you know, there's like 30, 40 pre-set moves that it will do. Or you can just actually actuate all the servo's and controls and make your own things, you know. So it has dance moves, it has all these moves. I literally just got it working, so that's why we don't have much going. But I will use the... The micro bit has a microphone, so the other part I don't have working yet, but I can use voice commands with like high syllables just using our clapper library. You can take the intensity of the syllables and then you use commands with different numbers of syllables and you could actually voice command to it. So I could say sit, sit. And then I say go forward like three syllables. And I just detect those three beats and you can actually command it with your voice. Or you could command it with Octa Studio or you could command it with another micro bit. I mean there's just numerous ways that you can do these things. And now with the cool thing with Connect over VLE is you could connect up to this micro bit and you could change things on the fly. You could like open from board, see my program, you could share programs with each other by just connecting and opening the board. I mean it's fabulous what we can do now with micro blocks and wireless communications in the state of 15. So when we're going to have a workshop directly after this presentation, but I'm going to do a very quick spin. We had created a presentation for the SNAP conference last year and Snowpixels, the only thing that I started with, they died Bob Martin of microchip. And there are seven LEDs that we have college students on the top row actually soldered these things together. I didn't bring Snowpixels, I should, but every year they have Christmas in the park and it's like fifth graders, you know that age. So they get to build their own patterns, create patterns and have them on Christmas trees. And we've done this except COVID had a little bit of break. And then I made a little box for my grandkids so they could push buttons inside the box. You can see I just have a couple of natural and zero buttons and LEDs. And then I had a smart Christmas tree and pull it together with the Raspberry Pi. This is running a web things project that I worked on as a Godzilla. And I just thought you might want to see this is this year, little auto, and it's just exploring, you know, pushing the button. So he's kind of like not knowing exactly what's going on. But Emmy had done it last year, so she already kind of knew the drill. So she was already going through the boxes, the buttons like one at a time and exploring each one that works. But then because I could talk to the Raspberry Pi with my smartphone and when she was looking down and changing the buttons, then I was also changing the patterns. She's like, what? You know, it's really fun to get them. Anyway, it was very, very fun to do. And you know, there's bromance in the back here. I taught this like group of real engineers how to program these types of things. And then I'm part of this TechLimit organization, TechLimit program, which is a volunteer by the mentors funded by the US Department of State. We get women coming from 21 different countries in Africa, Central Asia and the Middle East. And they come and I load them up with this stuff because we really want micro blocks to be global. So here they are like in my house and I'm training this and next month I'm going to Nigeria to train this. I've been on two delegation trips, Uzbekistan. And then last year I went again to Uzbekistan, Kyrgyzstan, Kazakhstan, teaching at the American Corners. And then I left, I leave all the hardware behind. Yasser was fabulous to donate like these Pico Bricks kits. It's like this board only has the Raspberry Pi Pico. So there are 40 of these kits in Central Asia now that I was fabulous enough to donate to leave behind and they started using them. Then I had a gal from Morocco and she's already written her own little book to train the kids. And then she got a grant, she got the money now to teach the kids in the mountains where the earthquakes were. They don't have laptops up there. So she's handing her first Lego League, creating this project that teaches basic coding that all the kids up in the mountains will have is they'll have the micro bit and a robot. And they'll be able to use coding cards and button presses and stuff like that to learn coding. So even without the micro box part. And then I've been part of this summer camp, the Society of Women Engineers in the Bay Area. It has a summer camp every year that I've been doing since 2017 and we, again, teaching them. And most of these girls underprivileged starting their incoming freshmen and they haven't learned to code yet. And so micro box is a fabulous way of basically getting them into coding, getting them confident about coding before they go into high school. Because you don't want to be afraid of it, you want to have. And the experience of micro box, they did this class library and the commands, it's absolutely fabulous. And then I've roped in some of the tech women and some other friends. I did some other summer camps. And then we have a big following in China. It's fabulous how micro boxes pick up in China. So we here at Wu is one of the guys, they translate the entire website and host it in China. So if you go to microboxfund.cn, you'll find the entire website in Chinese with all the resources. They do competitions in the summer, this dot pack. I mean, they're full of hardware over there. So they've done numerous different things in competitions. Then Turkit, I'm going to let him speak for himself because he does this stuff in Turkey. And then John has done a whole bunch of stuff. I'm not going to skip through the videos. Ben Nott here in Sidi Lev has done fabulous work. So, and then Pierre, who's taking him back to the room and to the dev room and stuff. He's done fabulous work with Kododojo in the Netherlands. And he's the one who got us inspired to come to Boston for the first time last year. And we hope to have a set up booth over there in the booth area to have one of his donuts too. So anyway, I'm going to hand it off to Turkit. One slide back, can't be everybody wants to use. One slide back, that one? No, the one with the QR codes. Oh, this one, yeah. So we have a Discord server that you can join us and talk to us in real time. The website, microblocks.com, fun of course. And those need the right thumbnail to know more. Oh, that's something I wanted to know more. I think I can go to the student on the next slide. Okay, go to the microblocks.com website and then get the correct, this is the old one we had in the kitchen. And together with the other company, Robots.com, which John already mentioned, they did this product called Pickle Breaks. And Pickle Breaks was a precursor, Pickle Breaks with the Raspberry Pi version, was last year and they won all kinds of awards. And this year, we want to duplicate the same success with the microbeach version of it. And basically the idea behind Pickle Breaks is that anybody who has put kids together for projects knows that when you have like 10-some sensors, you can display motor controllers and stuff like that. And then you have this, it's still details of it, whether it's analog or digital or 3 volts or 5 volts or where's the ground, where's the power, all that stuff. You want to put something big like this together, it's almost impossible to get it right. So the idea behind the Pickle Breaks board is that everything is already pre-mounted and the circuit board that is on, why is everything together? Because you don't have any cables. All you have to do is provide a processor on it, which is in this case, you have a microbeach here, that you can stick in one or two into it. And then we need some power, so we get a little thing like a battery. The cameras work well. Okay, cool. I'm going to make the connection and then put it under the camera so you can see it working here, okay? So I have a servo motor here just to demonstrate that the motor portion is doing something. I plug the servo into the monitor servo ports. Lost my little rabbit here. Okay. Okay, and now, I have a battery pack with E-myon and then we're just going to plug that to the back of the board so they get some juice. And we should have the demo starting to run. Let's take this one. Is the battery? Turn it on, turn it on. Okay, so in the beginning, we have a little animation happening on the old end screen that you're going to see. And then all of the 10 sensors, because as John mentioned, microblocks is multitasking real-time. All the sensors are active that you see here. The one we don't work with in this demo is the pier sensor, which detects motion because it makes a lot of crazy things. And then the infrared sensor here is very timing sensitive because it has to decode the codes. So those two are releted out of the demo, but everything else you see will be operational and just show it, walk it through it. So I'm turning on the battery here and we start off with a little demonstration. It starts with a picogreggs OLED display, microblocks and a picogreggs robot come together and then we see the display. It's kind of a really blurry display. Okay, can you fix it? Wow. It actually got worse when you got closer to it. Yeah, okay, cool. So basically what I have here is I'm showing you a button status, which is this... Oops, now we lost it. Okay, so there's a button on the picogreggs button here. When you press that, it detects it and it puts the little LED on top. We have put that as your mirror is being measured. The new pixels are going on. And what I've done is to show you that things are working real time. I programmed the A button of the microband and the D button to make the new pixels moving in the direction. So if I just go in here and I just press A, you'll see them cycle to the left. And I do it this way and they go to the right. Everything else you see displaying here. We have really interesting sensors on this board. One of them is this gesture sensor here. Gesture sensor detects three different things mainly. It detects that your hand is moving over to sensor left or right or up and down. Or you can detect the proximity, which is how far your hand is this way from the sensor. So doing this over here, if I go near to the sensor, you'll see on the display here, you should say near and far. So you should say near, far. And then if I move my hand this way, it says right, left, up and down. And then you have a light sensor here which detects the ambient light. So if I put my finger on it, it gets dark and you see the LDR numbers go low and high. And then the other neat sensor is this sensor over here which is a touch sensor. And it has two buttons, four arrow keys and all the keyboard, zero key, notes, not it. So that's also active. And if I take my, I've made it so that if I put the pot above 512 value, it becomes active. So just simple tone. Do you hear? And as I'm pressing the keys here, we also display over there under the key what notes it is playing or detecting and things like that. So the other thing we have, I programmed this for the servo motor, which is the one with the rabbit head on it. Let's see if I bring that in the picture. Okay. I'm going to, if I press both the A and B buttons on the micro bit, it's going to make it rotate left and right two times. So I'm doing that. So that's the little motor action. And the other thing we have is this relay here, this black relay on the bottom of the board. So it's actually a AC switching circuit. And when you press the button on the micro grids, I programmed it to come on and off. And I don't have anything connected to it, but the fact that you can press it and the light comes on means if you plug anything in electrical here, like a light or something, you know, a device or gas if you have it on, you can put, putting a, turning the button on or off, you can have this thing drive your appliances at home if you want to do that kind of stuff. So that's like even beyond the realm of playing around with these little gadgets, you can actually control stuff in your house, a toaster or a coffee machine or something. You know, you could say at 6 o'clock in the morning, turn it on and the coffee machine comes on and it's the coffee that's ready for you. That's basically all the different features we have. One other thing we want to show you is all the documentation and stuff that goes with it. We have a wiki site and we have documented the entire board and all the modules, as well as in the library, the entire set of blocks that are used to program the ecobrips is fully documented. And for each and every block that controls, that's in the library, we provide a small demo code. And the demo codes are on our wiki page, but the neat thing is if you have like, MicroBlocks ID as John was showing you before when he was programming it, if you have that open and you open the wiki and you mouse-drag one of the demo codes onto the ID, it'll be instantly running. So you don't even have to write the code. These are the module documentation for each one of them, for example. And the micro-bit connected, how to load firmware, how to connect the battery, how to power up in two, three different ways the whole board. And then the demo programs, how to download it. And that's a better shot of the screen picture there, so you did see it earlier. And then we can look at the library blocks too. Okay, so this is the page with the library blocks. And as you can see, you know, all these different blocks that we use to control it, that's on the wiki page. And if you, for example, we touch the temperature block with the mouse, and when you click it, it will go to the expanded description of that particular block. And you have the, it shows you where on the board the sensor is, and the actual sensor is pictured, what block you're using it with. And then the demo program that's written, and it's sitting in the wiki. So can you open like a, just to show it to the drag and drop, you know, that picture onto it. You can just drag it from there to here. Yeah, okay, that's perfect. So John will take this code from here and drag and drop it onto the micro blocks idea so you can see how easy it is. You can just start playing with it. Micro blocks right here, we've got this code, and just drag it over here and drop it. And there you go, it's in the idea. And now if you have your board plugged in with Bluetooth, with cable or whatever, and you hit run, and off you go. And you can switch these codes, switch these programs. The programs, all the blocks work by themselves, you don't really need a program to run the blocks. You can just click on them and they will do whatever it is they do. But we wrote little programs to demonstrate because it's a little easier and it has descriptive things. And sometimes it combines one or two functions to make it more interesting than just a sensor value. It's not playing into it. No, I'm just saying that in order to get a script, a picture with your code in the picture, you just save a picture of this script and in the PNG's metadata it has a project blocks program. So then you just drag a picture into the scripting area and it has your code. And this new PicoBridge board with the micro bit is a Kickstarter right now. It started about a week ago, a few days ago, and it will be there, how long is it? A month or so? About a month. Yeah, about a month is going to be Kickstarter. So if you're interested, you can go in there and examine it and maybe do a pre-order and stuff. And not only you get this board, but after the show come back and see, for example, we have a car key here, a robot car here with tracks. So that's running over the components of the sensor. Is that it? Remote control? Yeah, it could be. Yeah, there's a remote control too. So this is just one of the projects. There's like 25 or 30 different projects that you can build using these nicely laser cut wood parts. And you put them together in kits and then you take any one of the sensor components or motor components from the board and build it into the project. And off you go. This is the Raspberry Pi PC version of it, but you can see through basically these little parts, you just like snap and break the parts. And that's what you break the bricks off. When you first buy the board, it's wirelessly connected through these little strips. I mean, it's wired. You know, it's like the PC board is actually connected. And you break them apart, then you have to connect the cable. There's grow cables that go between the main board and these things. So I think we should wrap up at this point because it's slightly over time. I was going to ask you a question. Yeah, of course. You mentioned Chinese, but generally language support. Right. So we have a bunch of languages and you can switch the blocks to different languages. We also have on our website, we have a Learn page and we have some of our documentation in different languages. I don't know if we have any. If you're on the wiki pages, let's say on our Chrome browser or whatever, if you do a right click translate, it really translates beautifully into any language that you want. And therefore we didn't bother going to writing separate versions of all the different things because Google Translate is pretty good these days or whatever version that you're going to use. Okay, thank you. Thank you. Thank you. Thank you. So I think next up in here is actually a workshop on Microsoft's Will this take?
MIT App Inventor
All right, so welcome everyone. Good morning. I'm happy to be here. It's 3 AM back home. So I'm really looking forward to talking to you about MIT app inventor. And I'm being joined. My name is Evan Patton. I'm the lead software engineer for this project at MIT. And I'm being joined by Diego Bishwas, who are contributors to our open source project. And they're going to share a little bit of their experiences working with MIT app inventor. And just to get started, for those of you who don't know what MIT app inventor is, it's a web-based platform for building your own mobile phone applications. So you have a kind of drag and drop. What you see is what you get interface for designing your app. Then we have a block environment for coding your application. So in this particular example, we're coding up a Bluetooth connection to collect some data from a microbit. And then you can see that app running on your Android or iOS device. As we see here, we get some data that we can then graph on our phone. So app inventor has been around for a number of years. It was started at Google in 2008. And then it was moved to MIT in open source under the Apache license in 2011. We have almost 21 million users. And probably by the end of this month, we will hit over 100 million projects that people have created with app inventor. And at our peak usage, which is typically in May, we have about 1.4 million users a month. So contributions to our project go a long way. And the mission of app inventor is what we call computational action. So the idea is that you can really use a tool to build a meaningful application for solving a problem in your community or in your family life or in your country even. And we've got a number of examples of what that looks like. So back in 2014, there was this program called the Innovative App Challenge. And some students built an app to help their blind friend here navigate his school in Texas. And they won this competition. And they even got to meet President Obama at the time. And this was an app that they built with app inventor to try to improve somebody else's life. This is a similar example from Kentucky, where this young lady here, her grandfather, was suffering from Alzheimer's and was forgetting to take his medication. So she built this app called Farm Alarm to help him remember to take his medications. And you can see an example of what that looks like on the right there. And it's not just obviously restricted to the United States. We're used worldwide. These are the Darvi girls in India. And so they made an app that helped them to schedule time to go and fetch water at the community well to try to make it easier to schedule their lives around this very important thing that they need to do for their families. And then similarly, a little bit closer to where we are in Moldova, a group of girls made an app to help track water quality. And so they were able to collect a bunch of data points and visualize them on a map. And you could say, the quality of this water is high, medium, or low grade. So that way, you would avoid going to places where you wouldn't be able to get clean water. And so one of the reasons why we can do this is because one, people in the community, volunteers, have gone ahead and helped translate App In Mentor into a number of different languages. We're up to 20 languages now. And maybe you would be willing to help us make that even higher. And so you can go into App In Mentor and you can code an app in your own language because the entire interface is internationalized. We have a website up in the top right there, AppInMentor.mit.edu, where you can go and access a number of tutorials. And we've got them in a number of different themes. So obviously, there's a lot of beginner stuff. But we're doing a lot of advanced things, too, where you can learn more about data science and visualization, do some work with internet of things and connecting to nearby devices like Arduino and Microbit. And we have a whole suite of different examples of using artificial intelligence, including things like chat. GPT. And in the case of using the Android version, we also have a number of different extensions that you can use, or you can develop your own, again, because App In Mentor is open source. It makes it very easy to extend it. And we have, at the end, a link to the slideshow. So you should be able to access all this content as well. As I mentioned, we are open source under the Apache license. You can get to us on GitHub at mit-cml. And we've had over 180 contributors to the project over the last decade. Now, I'm going to hand it over to Vishwas. He's going to give you a little example of how App In Mentor works. So you can see it. And if you like or you're available, I don't know where's Peter. Do we have any more slots available for the workshop? Do you know? Five. Five? OK, so if you would like to learn more and actually try building your own app, we have a workshop that's happening at 12.30 in J Building. Do you need an instance for children between 7 and 7? Yes. 7 to 17. Right, OK. Great. So now, thank you, Evan, for an introduction to App In Mentor. And let's actually go and dive into it. No. Is that good? Yeah, OK. So you can follow along by going to code.appinventor.mit.edu. You don't have to have a Google account for anything. You can just log in and start doubling with Android apps. The example I'll get started with is Hello, Per, because it's trivial to follow along with. And it's also a fun app. The premise is simple. It's a picture of a cat. And when you click on it, it says meow. And we use this example a lot of the time because it's easy to get started with. And as I said earlier, it's a fun little app to play with. So as an intro to the interface, on the left, you see there's several components. And components are things you can drag and drop into the app. So let's say if you want to do that, you can do that. Let's say if I want to add a button, you can drag and drop it here. I can get rid of it from here. And so on. And there's loads of components, stuff you can see on the screen, stuff you use to put together things, multiple things that you can see on the screen, stuff to access like the camera or play sounds, translate text. If you're making a game, there's components to do that as well. And then you can see how there's a variety of components to play around with. If you have a Lego Mindstorm, there's components for Mindstorms and a few other ones that Evan will talk about in a bit. And on the right, well, let's get to the main part. That's the phone itself. The great thing about app inventories, you can see what your app's going to look like. And you can move things around. You can add stuff. You can edit stuff. So this is a button with a picture of a cat as its image. As you can see here, I can set it to none. And it goes back to nothing. And I can set it to the cat again. So it's instant. And it's really easy to follow along with. Another great thing about app inventor is to the left, you can see something called the emulator open. If I connect to that, it might take a moment. But let's be patient. Right, so while it's booting up, the emulator is great because it lets you view your app on your phone. Well, it's on the laptop now. But you can also install the app inventor companion from the Play Store or the App Store. And you will see your app live. And any changes you make will be reflected in real time, which is great for prototyping. There you go. Now, if I add something else, it will drop here. And I can interact with it like I do any other app. Now that's how you design the app. Let's try and get some functionality in. So we want the cat to meow when the button is pressed. And the way we do that is by using blocks. Now, at the top right, you see there's two parts to App Inventor. One is the designer, where you design stuff. And the blocks area is where you actually build the functionality of the app. So HelloPur is very simple. Earlier, we saw that the cat was basically a button. So when the button was clicked, you call Play on the sound. And what is sound, though? You'll see there's nothing called sound visible on the phone screen. But below, there is a non-visible component called sound. And that's a non-visible component is exactly what it sounds like. It doesn't show up on the screen, but it lets you do things that are important, like playing a sound. Now, what sound do I want to play? I can set it from the source, which is meow. And if I want to know what meow is, there's the media panel. And I can click on meow. I don't know if this is going to show up, but I can preview it. OK, plays on here. Great. So meow is the sound that we want to play. And OK, there you go. And when the button was clicked, just say, OK, I want to play the sound. And also, vibrate the phone for half a second, so 500 milliseconds. And if I do that, OK, it says meow on here. I'll turn it up. Sure. Let me also hold my mic here. There you go. All right. Let's also maybe try and make a change to the app while we're still here. I will use the Notifier component just to show notifications. And what I want to say is when the button was clicked, I want to show the text meow as well. And that's also fairly straightforward. All I do is so you can see all the blocks available by clicking here. There's lots to work with. Yeah, you can pretty much do anything you want in your app by using a combination of these blocks. It's just like building with Lego. You start real small, and then add stuff on to it and make your whole app. For now, all I want to say is I want to show an alert. And we'll add this to the bottom. So after the sound has played and the phone has vibrated, I want to show an alert. And I can go to the text category, and I will grab an empty text box and connect it here. So you can see how it all joins together nicely. And I want to say meow. If I try it out, there you go. So as you can see, it's really easy to get started. It's really easy to start making changes. And overall, at least in my opinion, it's great to actually get into how Android apps are made and make one for yourself. All right. OK, thank you very much, Voswas. So I briefly wanted to talk about the source organization of app inventors, considering that many of you might be willing to contribute to the project. And therefore, it's helpful to kind of understand how the project is put together. So as Voswas mentioned, there are a lot of components in the system. We have 96 different things you can use for app inventor spread across 13 different categories. All of these are written in Java, and they link to the native Android libraries. This is not some sort of emulated environment. You're actually running real code on a real phone when you're using it on your own device. And as we mentioned before, you have extensions where if we haven't implemented something, you can implement it yourself and add it into app inventor without even having to build the whole source tree. We have iOS, which is a more recent addition to the project. We open sourced it last year. And currently, of the 96 components, 63 have been implemented for iOS. We have, in the past, had Google Summer of Code projects to work on this. We're hoping to have more of them again this year. So if you're interested in programs like Google Summer of Code, feel free to come after and talk to me about that. And all of these, again, because it's native code, are primarily written in Swift. I'm going to skip this. Obviously, we saw how components work in the demo. We use Blockly. And I believe we have Christopher Allen here. He's going to speak about Blockly a little bit later in the day. But Blockly is this wonderful project that allows you to build your own environments for coding using a block language. And so you provide the back end. I think they have five languages they currently support. We have our own language that we've developed for App Inventor that we generate. And this basically is responsible for creating, taking your blocks, turning it into code that then executes on the device. And then we use App Engine to host the front end. So like the editor you saw when it's running in the browser, the client is written primarily in Java using GWT. And then the back end is talking to a bunch of different services, including what we call the Build Server. And the Build Server is responsible for taking your application and actually converting it into a real running app. And we saw the companion version where Visuals is able to make changes in real time and see how that works. But when he closes the emulator, he loses that work. Well, he loses the running version of the work. Obviously, the project's still stored on the server. But you can compile your application as an APK. Or it's not quite released yet, but we have a beta version that allows you to build for iOS as well. And then you can actually bundle your app and then run it on your phone, have it there. I've actually got a copy of HelloPer that we just did on my iPhone, for example. And so that allows you to kind of take your project with you. And we run 21 Build Servers. And we can support 168 simultaneous builds in the cloud for supporting app and metric users. Now, the project structure is fairly complicated. You saw a lot of these things in the examples already. Assets, where you have your different media files. And additionally, if you have extensions, you've got extensions or considered assets of the project. You've got your source files, which are your blocks and your design, and then project properties. And so internally, App Inventor represents us all using a zip file. You can poke around and try it out if you want to learn more about the internal structure. But it's very helpful if you are working on the App Inventor code to know how the files are laid out so you can edit them and try things out at the source level. Now, I put into slides here. Is anyone interested? How many people here are contributing to open source projects already, or are interested in contributing to open source projects? All right, we've got quite a few. So if you're interested in contributing to App Inventor, we do have some example instructions for how to do that. Obviously, because this is a project that's building for multiple platforms. We've got web. We've got Android. We've got iOS. It's a little bit more complicated than a regular process where maybe you're just building for web. And so I do have this here. I'm not going to go into much detail on it, just because it's got quite a few steps. But as I said before, we'll have a link to the slides at the end so you can go back and reference these. So let's talk about contributing to the project, which is really why we do open source. We want to be able to contribute to these projects and make them better. So one way you can contribute is you can help us with translations. So as I mentioned earlier, we support 20 languages already. But these are all contributed by volunteers. And you can sign up. We use WebLate. So weblate.appinventor.mit.edu. And then you can email my colleague Susan, who will set you up with a specific language. You just need to let her know your username and which language you plan to contribute to. The second way you can contribute to App Inventor is through GitHub. So we actually have, again, because we have this complicated build process, we have two different branches we do development on. So the first one is called UCR. UCR represents all of the changes to components that are going to run on the phone. So that's if you're making changes to the Android piece of the software, or changes to the iOS portion of the software. And the reason we do this is because when we do a release of a new component, we have to put it through the app store process. We have to put it into the Play Store process, which can sometimes take a couple of days. And so we tend to group the component releases into larger chunks to deal with the approval process there. All of the changes go to master. And then we build from master pretty regularly and deploy usually once a month. Or if there's immediate bug fixes that need to go out, we'll do it as frequently as needed. And then when we do a component release, UCR gets merged into master, and then we release the whole bundle altogether. We have issues that are marked help wanted on GitHub. So if you are new to contributing to open source, we have some smaller issues that are easier to sink your teeth into and get started. And we tag everything, which branch to start from and all of that. And we will help you if you have any problems making pull requests. And I mentioned before, and I want to invite Vishwas and Diego to really talk about their experiences with Google Summer of Code. We have participated in Google Summer of Code for a number of years now. I don't know the exact number, but I think it's six or seven. We are currently waiting on the 2024 decision, but we're hopeful. And if you'd like to learn more about contributing through Google Summer of Code, we have a community site where people can ask questions about App Inventor and there's a whole section dedicated to folks interested in Google Summer of Code. So we definitely would encourage you if you are eligible for Google Summer of Code to consider applying for App Inventor. And then Vishwas, your first one. Thank you. So a bit of background about me before I go into why I did Google Summer of Code for App Inventor. I started dabbling with App Inventor when I was 12, 13, I think, just starting small with probably the Hello Pro App as well. And yes, slowly got into making extensions for it in Java and eventually got around to actually making changes to the code base itself. So it's been a long journey. But I think over the years, I've come to be able to mess around with the code base enough to actually make a few meaningful changes, at least. And one of those was my Google Summer of Code project, which was in 2021, which was to change some of the internals of App Inventor to make it more modular. So at the moment, if I go back to the demo, you'll see that, well, it's not really that visible. But the whole thing is a monolith. It's one massive bundle that's generated. And as it might be evident, it's quite hard to maintain. It's hard to make changes to. And it's also not really that responsive. So it doesn't work great on mobile. So my project was to modernize it a little, make it look a bit prettier as well, I think, bring it to this decade. And also make it such that it's going to be easy moving forward to make further changes to the UI. And that was back in 2021. And in 2023, well, last year, I also mentored a project that was to add a new UI to collate all project-related properties into a dialog. So if I go back to App Inventor here, this should be merged, I think. So there's some properties. So let's say buttons have properties like font size and background color. But what about properties of the app itself? And initially, they were in screen one, which in hindsight probably doesn't make a lot of sense. But now they have been isolated into a separate dialog. So this was work that Aaron Modi did as part of his Google Summer Code project, which I mentored last year. I think what I want to get across is really easy to start contributing and really easy to sink your teeth into whatever you fancy. And go from there and become one of the main contributors to the project. So the barrier to entry is incredibly low. And I encourage all of you to start maybe just look at the source code if you can, and then go from there. I would also like to invite Diego. He's also done a Google Summer Code project and also mentored projects himself. So yeah. Thank you. This was a good morning, everyone. So my journey is kind of similar to this was I started programming with App Inventor back in 2014 when I had an idea but I didn't know how to code it. So one of my university, high school professors told me, oh, you can use App Inventor and build the app without coding. And that's how I got into App Inventor. Then 2015, 2016, I started contributing to the project. And in 2020, I was also a Google Summer Code student when I was developing a new format to export the app, which is now available for App Inventor. So when we are building the app, we have, sorry, yeah. Oh, it's not? OK. Yeah. So when we are exporting the apps, we usually have the typical Android APK file, which is the standard one. But then Google added the Android App Bundle distribution format for Google Play Store. And I have basically implemented these new formats so people can distribute the apps through this new format into the Play Store, which is usually more optimized for the Google Play Store. Thanks, Evan. Basically, just to give another view of the difference in a standard APK file, basically all the assets, all the kind of languages that architectures are bundled into the same file with an AAB file. Instead, the Google Play Store will automatically pick the ones that are needed for the mobile phone. And this overall size is going to be way, way smaller. It's basically Google chooses which assets to pack and distribute to your specific mobile phone. And as part of this project, I was also refactoring the computer, because it was also a monolith of 3,000 lines of code. So just to give another view of the changes, this is how we basically invoked the compiler before the best passing tens of parameters into the same constructor, which is kind of risky sometimes if we mess up the order. So this is the final solution. It may seem a bit more complex, a bit more step through. But in the end, it's easier to understand, because all the parameters are properly named. And thinking about iOS, the compiler now has a step-based process. So for the iOS later on, when it's going to be available, the build server, this same format of adding new steps will be conditionally added for the iOS build procedure. And last year, Zimas Beach was also a mentor. In MIT, I was mentoring Drubstree Vastava, where he was implementing some versions for the Android components that were not available for the iOS. So that's why last year, because it was open source, we were able to include this iOS support for our Google Summer of Code project. And as Alishia mentioned in Google Summer of Code and Google Blockly, so we had a student in 2020, Becca Westberg, that was starting to contribute to a pin mentor, adding some enumeration blocks for a pin mentor. And now Becca is basically working and contributing very actively to the Google Blockly project. And now, Evan can give more overview on the teaching and mentoring from App Inventor. Yes, so we've talked a lot about App Inventor in terms of the technical side of things, but I did want to offer as well that there is a lot that's available for you if you're an educator or maybe a parent who's interested in starting an after school club or even students. I mean, we've had a number of students who have come to MIT as undergraduates and will come work on App Inventor. And they say, oh yeah, I started a high school club because we didn't have programming classes and I wanted a way of learning to program. And my friends, we wanted to learn together. And we found App Inventor and it was great. So we have on our website a number of curricula, both at the middle school and high school level. So kind of looking at ages maybe kind of 12 and up or 11 and up that are available. It's mostly in English, but we do have some in Spanish. And of course, again, we welcome very greatly contributors who would be interested in helping us make that material available in more languages. There are a number of programs out there. So many of the examples that I talked about earlier came from a program called Technivation, which is a program that helps mentor young women to have them work in groups. They learn how to do things like develop a business plan and build an application to try and address a need in the community. And they have a big competition, a bunch of regional competitions and there's a worldwide competition where the winners go, I think usually to London, and they present their app ideas. And so that's a wonderful way if you would like to mentor those types of programs, I can get you in touch with people. I know folks like Peter have worked with groups like Coder Dojo to try to make these programs available for students in Europe. And there are many others in all parts of the world. And we have an app and mentor foundation now which runs a seasonal apathon, we call it, where you build your own app and compete. And these are happening all around the world now. OK, let's continue. So the last thing I want to kind of leave you with is a vision of the future and some of the work that we're doing. Obviously, we're producing this project that is used by millions of people, but we're also a research group and we're trying to think about what does the future look like in terms of computing and in terms of programming. And so we have this new project we're working on, built on app and mentor called Aptly. And the idea behind Aptly is we're trying to take the ease of developing an app with app and mentor and combine it with the kind of explosion of large language models that we've seen over the last two years. And so the idea is that you can provide a description of an application in natural language and then it uses GPT-4 to generate a textual version of the program. And we are then able to parse that into an app and mentor project. And so in our mission of trying to encourage computational action, our hope here is that maybe you don't know how to program yet, but you still have a really good idea of how you can solve a problem in your community. So could you just describe the solution to the computer and see if it could make something that would work for you? And then you can actually, this is just the first step in the process. You can collaboratively edit the application with the AI, make changes, and then instruct it based on your changes, other things you want it to do, and so on and so forth. And if you're interested in learning about this, there's a link there, appinventor.mit.edu slash aptly. We've got videos and other things of how this works. And we're continuing to do research on this. So when we first announced this about two years ago now, I think, Peter asked if it only worked in English. And I said to him, I don't know, to be honest with you. And so he gave me this example in Dutch. And this is the app that it made on the right-hand side. And of course, we thought this was really interesting because obviously, many people are used to writing programs in English. And in fact, one of the times I tried making an app with aptly, GPT came back and said, well, you can't make an app in Italian because Italian is not for programming languages. English is for programming languages, which I found to be a bit funny. And then I kept instructing it to do it, and eventually I did it. But I did think that, but it's not true. Obviously, I don't understand Dutch. But the LLM certainly seems to have some understanding of it. I was doing my particular test in Italian when I did it. And it kept telling me in Italian it couldn't do it until I complained enough. But it can take these things. So you don't even have to know English anymore, potentially, to develop an app. The system can do it for you because it has these latent understandings about what it means to build a program. And it's seen examples of other languages, so it can figure out the mapping of that. I have noticed that it doesn't get it quite right. There is a bug in this program, because this is saying it wants an index, and selection index would be the appropriate property here. But we'll gloss over the fact that it made a semantic error. That's OK. And then the other thing we've been trying to do is trying to teach it about concepts like games. So I gave it these two prompts to try to start making a little game world. So I wanted a 2D platformer. So you see I got some platforms at different levels. And then I said, OK, well, maybe I want my main character to be this little lizard with a space helmet on. And so you can see him up here on the top left. And so it's really quite fun to play around with. And we're continuing to do research on this. So it hasn't been open source, but we're looking to open source it soon and try to get more people playing around with these types of tools and trying to understand really how we can take these large language models and combine them with something that's as easy to use as App Mentor to make some really interesting programs. And so with that, I just would like to say thank you all for coming here bright and early on a Sunday morning to listen to us talk about App Mentor. We have a number of different links, as I mentioned. You can access these slides at fosdm24.appmentormit.edu. And then, of course, you have access to the sources and all of the resources that we've made available worldwide for this project. And if you have any other questions, you can also feel free to contact me by email or by email. Or we can talk for a little bit now if you have any questions from the audience. Yes. Yes. They can make the app any language. How technical does it have to be? Or how do you express the app to use like a few terms somehow? Or do you need to know a little bit about what it has been or what it's known as? It varies. Obviously, the more technical you get, the better it tends to be in terms of what it does. But one of the fun things that we've been really playing around with is kind of giving it things that are purposefully ambiguous to see how it responds. So one of the early examples I would do is I would start with Hello, Per, which we saw earlier. And then I would instruct it, when I click on the cat, make it bigger. Right? There are lots of different ways you could interpret that statement. And in fact, if we run this multiple times, we get different code outputs. So in one example, it would add 50 pixels to the width and the height. In a different one, it would multiply it by two and things like that. But it does seem to associate, OK, it knows cat and it can see that there's a button that is associated with the thing called kitty.png. And I imagine that's how it figures it out. Because we also have a version where we have both cats and dogs and then try to get it to understand the difference between cats and dogs. And then, of course, bigger, it seems to have many different interpretations of what bigger could be. Another fun one that I've been doing, which is actually really cool, and I don't know how it does this, I make an app where I have it with the add and remove button to add some content to a list. And then I say, make the colors of the buttons match their purpose. Right? And because one button is called add and one button is called remove, it makes the add one green and the remove one red. And then I say, well, this is not colorblind friendly. Make the colors be more colorblind friendly and then adjust the colors, which is particularly interesting because the entire interface is text based. Right? So there's no color here, but it seems to have some notion about colorblindness and how to change things to make things more colorblind friendly. And so then it comes up with more muted colors or changes, you know, some properties of the colors. And it's really quite fascinating to see. Actually, I wasn't planning to do a live demo of this, but let's try it. Oh, there's another question. Sorry. No, in the chat. Oh, in the chat. Someone can ask a question. Why do we have to create a Google Cloud to use MIT App Inventor? So that's why I was suggesting, so if you use the version of code, sorry, we'll go back to the app thing in a second, but I do want to answer this person's question. So in code.appInventor, when you come to the server, I got a log out. Yes, yes, yes. When you come to the server for code, there will be this continue without an account button. And if you click on that, you can bypass the need for the Google login. And what it does instead is it generates a random forward code. And you just save that. That's basically your password to access your account. And so you write that forward code down. And then when you come back, we just saw it. There's a set of four boxes where you put those words back in. So you don't need to have a Google account anymore to use this particular service. There are other services we offer where you do need a Google account. And right now, that's just for identity purposes. So we have a way of linking your projects to you. In this particular case, if you use this anonymous thing and you forget those four words, we cannot help you because we have no way of identifying you or your account. And so that's really important to save those four words if you use this anonymous version. But yes, there is a way of using app Inventor without the Google account. So that's how you do that. Yes. One other question about phone generation again. Assuming that say you know what you're doing and you're just using it as a help, is it possible to focus on parts of the generated app and actually enhance it in one fashion or another without regaining the company? Yes. So this is our hello per example. So let me just open that up so we have a good starting point. Oh, yeah. You can make it edit the pictures and stuff. I forgot to mention that. So here's our example. Yeah, just the hello per. And then there's this little pencil icon down here in the bottom right. And that brings up the editing thing. So I'm just going to come in here and say, let's do the example I said earlier. This is now a picture of a dog. Let's say when I click the dog, make it bigger. And the next version also has a microphone input, so I wouldn't have to be typing that. And so, again, we can watch. App Inventor also because we do a lot of logging in the console. So if you want to see a fun way of how it all works. So now, so see, here's something that's done that's interesting. So I thought that it could create two button handlers. Now, people who have used App Inventor know that this is a no-no. So I'm just going to combine those two. But this is part of the interesting part of working with the AI is it doesn't always get things right. So I've gone ahead and I've taken it. In this case, it's decided that making something bigger implies adding 10 pixels to each of the dimensions. But it's come up with that code. And then we're able to take the textual representation of that code and turn it into the block representation so that you can see how it works. I think it also made the button smaller in the designer compared to previously. Let me move this over to the side. And then, of course, so we could say, what do we want the button to be a picture of? It could be, I wonder if it knows about FOSDEM. Could we do something about FOSDEM? Okay. Let's see if it knows about FOSDEM. That could be fun. This is also one of the interesting things about this technology, right? And I'm sure people here have played with GPT-4, any of those kind of generative AI systems. So you know that occasionally it doesn't kind of either come up with what you want or it does something completely random. In other cases, it's sometimes very good. In this case, it might not have done anything. That happens sometimes too. Yeah, it was upset about something, so it rejected my request. Again, this is research-level stuff, right? This is not a product yet. That's one of the reasons why we haven't released it open source. We're still working out some of the kinks. And one of those kinks is that it doesn't have much in the way of error handling. Christopher. You mentioned that LLM is generating some language which you then personally just speak slightly more about that if we... Sure. So when we first started this project, it was the LLM at the time that we were using was the open AI codex model, which no longer exists. And when we were designing the language, we had a couple constraints on it that we wanted. We wanted it to be a one-to-one correspondent with App Inventor. And basically, for those of you who maybe don't have a computer sensor, the idea here is that for every single block or function or whatever in App Inventor, we wanted it to be an operation in the language and vice versa. So we didn't want, for example, to take JavaScript and then have to worry about Lambda functions because App Inventor has no concept of a Lambda function. So we... But we needed something where we wouldn't have to do a ton of training. So we started from Python. We said, we'll make it look like Python. Obviously, it will have had training on a lot of different... Oh. Yeah. Oh, yeah. So... Oops. Yeah, so if we... Oh, come on. Sorry, it's being a little bit finicky. The computer's being a little finicky. So it's got to construct something. There's no new keyword. You just use the class name. It's got named arguments. It has... This is some of the differences, though. There's no def keyword like there is in Python. We say, for event handlers, they begin when? Like they do, at least in the English version. And then you have the event name and the... Sorry, the component name and the event name. This one doesn't have procedures, but procedures are defined with the two keyword, and globals are created with initialize. And then there are relationships like in App Inventor, every component other than the screen has a parent. So translator belongs to screen one. If you had an arrangement, you would have the thing belongs inside of its arrangement. And so... And there's, as I said, there were two reasons... There were a couple reasons for this. One was we figured it seemed a lot of Python. Therefore, it should recognize a lot of these types of things. And two, we wanted something that didn't have a lot of extra syntax, because one, we thought it would be difficult to get it right. Our original proposal was, could we generate the JSON that underpins the design, and the XML that underpins the blocks. But XML is very verbose, and you have to keep it balanced, otherwise it doesn't work. JSON, in terms of verbosity, obviously, if you got parentheses wrong, we'd be in trouble. But that was that. And then the last thing was, because we're paying for it, we wanted to have as few tokens as possible, because you pay per token. So something like Java or C, if you've got a bunch of curly braces and semicolons all over the place, you're going to be paying for that, and you're going to have something that wasn't going to be. We could talk about things like color, and how we have to pay for all the extra use that are involved. But yeah, so that would be, we had a bunch of reasons for why we wanted to design the language this way. Obviously, these tools continue to get better. You can do things like fine-tuning, which we're still exploring, and maybe there might be even better representations of the language when you start fine-tuning. But yeah, obviously, this is a very rich area for continued work, and we're really hoping to try to get people really excited about LLMs and kind of open source. We're really hoping that more of these, people are releasing weights and things like that for various language models, so we're hoping that eventually we'll get to the point where more people can do these types of things. Do you have to train it with a language, or could you just kind of prompt it with a manual? Yeah, so what we do is we just do some prompt engineering. So we have about 20 to 30 examples on the server, and then what we do is we compute the embeddings of those, and I think the embedding space is like 1,024 dimensions. And then we take your prompt, and we compare its embedding to the embeddings of all the example programs, rank them based on distance, and then we take the top 10 and give those as examples. So it'll see something like, oh, put five things in a list, and then here's a list, and it'll see, here's create a button and do the button. And then what will happen is it kind of stitches together some of these things. I don't have a slide in this talk, but actually this picture will do. So this was the translation app, and in our initial test of this, we only gave it three example programs. We gave it a program where if you clicked a button, it would translate Hello World into French, I think. So there's no user input other than clicking the button. We gave it an example of a little app where you could add ingredients to a pizza, and so that gave it an example of how list views work, and then we gave it, actually it's not in this one, we also had one where if you click a button, it speaks. And so what we wanted to do then is we gave it those three examples and a prompt like this, and it stitched together the whole program from start to finish because it saw examples of each of these different functions, and again for reasons I do not understand, and I don't think any of us really understand, it was able to eventually figure out, okay, well the first thing I have to do is do the speech recognition, and then the second thing I have to do is call the translator, and then the last thing, which again is not in this example, there was a text-to-speech component, and it said, okay, well now speak the result of the translation, and it never saw a program, at least in this language, that did all of those things together, it saw bits and pieces of those and what they looked like, but it was able to predict the tokens in such a way that it could make the final app, and so that's really one of the interesting things, it's somehow in some cases, not in all cases, but in some cases it's able to make these combinations that do useful things, and then sometimes it fails, but that's life. Can the source run without Google App Engine? I believe it can now, Google did the latest versions of App Engine, the runtime is open source as of App Engine 2.0, that's the library we currently use, I don't know exactly how to run it outside of App Engine, but supposedly yes, in the new App Engine 2.0, that's all open source, and I believe it's available on GitHub, or if not, it's on Google's system, but there is also a pull request you could look at where the backend has been replaced with Postgres, and so you can, all the data objects and things are stored in Postgres, so if you do want to run it without the App Engine piece, you should be able to do that, so I would definitely encourage you to take a look at that. Any other questions? Oh, yes. Maybe a bit of attention, but this is a great tool for learning official design programming, you don't do the affecting, but how about the rest of the lifecycle? Are there any instructions or ways of teaching those, like version control, Q&A, distribution? We do have some of that, so we don't have a formal... Yeah, there we are. We don't have a formal version control system right now, but one thing you can do is we have a checkpoint feature, which essentially you can think of as making a commit, so when I click on this, it'll say, okay, I'm going to create a checkpoint called hello per checkpoint one, and if I keep doing it, it'll do checkpoint two, checkpoint three, checkpoint four, and so you can use that as sort of a way of effectively creating a commit, you can continue making changes, and then when we go back into the project list, you'll have all of your checkpoints for your application. Now, granted, that's probably not the ideal way of doing version control for something like this, but in the end, each of these projects is essentially a zip file with that structure we showed in the slides previously, and so you could, of course, pull it out and work with it that way. Now, the other thing we've been doing, and I didn't really talk much about the details of Apply, because that's a separate hour-long talk, but it uses a version of App Inventor that we've developed that supports real-time collaboration, so you get a multiple people editing, and there we have the entire event stream, so you can look at each individual change and kind of be able to say, well, we're going to take these edits out or do something like that, and I don't think this particular release has it. It's coming in the next one where we have Undo Reader Support, so if the AI does something you don't want it to do, you can undo it, so that way you don't lose valuable work because the AI went a little bonkers. In terms of distribution, we do have some documentation on things like managing your key store and other signing-related things, and for the iOS build server, we do have a bunch of documentation with screenshots of how you go through the process in App Store Connect and in the Apple Developer Portal to set up all of the signing and everything you need to do the iOS release. Now, obviously, this is, again, an area where more help in terms of if you think there's documentation missing on this front, the documentation is all in GitHub in the repository, so you're welcome to contribute, and it's written in markdowns, so it should be fairly straightforward. Quality Assurance. Yeah, we don't have a lot of that. I'll be perfectly honest. One thing that people have asked us before, and actually it might be even better for teachers in some ways, but having sort of a unit testing type framework for App Inventor, so that you could say, I expect that if I click this button, some output will happen, because right now for a teacher, when you get a class of 20 students, you get a grade at their homework, you get a load of each project, try it out, make sure it works correctly, grade it, and so on. And so automated ways of kind of doing something that would be helpful, but that's not something that's currently on our roadmap. Ah, okay. Well, I think the answer is still the same there, which is essentially we don't have stuff for that for students, but of course there's no reason people couldn't build something around that, like maybe, I don't know if the code or dojo folks would like to have material around doing quality assurance of applications, but that's certainly something they could develop. I mean, we obviously give the platform, and lots of people have built material around it. There's no restrictions on other people doing that, so, and all of the stuff on our website too is CC4 by SA, so anyone can take our material and reuse it. Yes. I have three questions, which I realize are probably all the same question. Okay. So one of the reasons why you chose when you wanted to support iOS to use Swift a lot of them, is to think of something like Kotlin that might be cross-platform. Okay. The second question was, you said that iOS support was coming soon to build servers, but how does it work without that at the moment, because it seems like you have some support for iOS. Yeah. And then the third question was, like, do you need the companion app and you not just build, like, APK files? Yes. Okay. So maybe I'll address the last one first, and then I'll go to the other two. So yes, you could build the APKs and not use the companion app at all. The nice thing about the companion app is that you see the changes in real time. So building an APK because it has to package it up, actually go to a build server, run through the Android build process, and then return the binary back to you, often can take a minute or two. Right? It's the difference between, like, people, like the old sort of compiled test release cycle versus reloading JavaScript in your browser type of thing. So the nice thing about using the companion is it saves you a lot of time. Yes. Yes. So internally, App Inventor uses Scheme as its programming language. So all the blocks and everything eventually output Scheme code, and then there's a Scheme interpreter running either on the Android or the iOS device that then runs the code and, you know, does whatever the behavior is. Which kind of, I guess, goes to the second question you had about why not rewrite everything in a completely brand new language. So what we did is we essentially made a Scheme interpreter for iOS, and then the question was, well, how do we get the components there? And for that, we just decided, well, we'll use the native language because obviously Apple is going to continue to support Swift, it seems, for the definite future. So let's just invest in the native platform, just like we use, well, we'd have to rewrite all of our stuff from Java into Kotlin. I believe there are tools that maybe help with that, but I haven't done a lot of investigation there. And so, yeah, we just said, well, let's just implement the Swift version of everything and keep it as close to the Apple examples as possible. So I guess the, oh, and then the build server. Yeah, so you can today go to the App Store and install the iOS companion app, and you can try your apps right now, right? So I could open that HelloPer app and load it on my phone, actually, let's go ahead and do that. Are we at time? Okay, yeah. Oh, we can do it offline, but essentially the idea is, can I take 30 seconds, Peter? So I scan the QR code and I go to my app here. Yes, I know my app is out of date. It's actually out of date because it's a newer version, and so it doesn't recognize the version number. And so it uses WebRTC to establish a connection and send the assets over, and then I just get the app running on my phone, and if I turn my volume up, of course, it doesn't know how to create audio assets. It only knows how to create visual assets. So it's still meows, but yeah, that's the idea. And so as we make changes, as Vishwas was showing, I would see it on my iOS app. Of course, if I close the app, then I lose the connection and I lose my progress, but with the build server, now I've got my regular hello per app, and then I can do it, and this one is looping infinitely, so I'm just going to stop it. That was for a test somebody asked about running audio in the background, and so I needed to make sure it would run for a very long time. But yeah, that's the idea. So packaging the app allows it to be persistent on your device, and of course you can put it into the app store and all that stuff. So with that, thank you very much. I'm happy to stay longer and answer other questions, but I will...
CoderDojo
Hello. Welcome. I'm going to start off this talk in a minute, so I'll just give you a second to finish up your chats and everything else. But while we're getting started, it would be great if you could reach out to the person sitting closest to you and just say hi and get their name off them. Because Kododojo is all about connecting with people, being social, and collaborating. So I want to encourage a bit of that collaboration in the session today. And if you're online in the chat, you can say hi to each other and we'll get a bit of Kododojo energy in the session. So first off, if we can get some hands raised, who has heard of Kododojo before? Very popular, great. Great to have so many people who've already heard of Kododojo here. And welcome to everyone who hasn't heard of Kododojo before. We're delighted to show you a bit about what it is and explain about it in this next one-hour session. So Kododojo is a volunteer-led movement of coding clubs for young people aged 7 to 17. We call the clubs dojo. Dojo means temple of learning in Japanese. And the idea from the co-founders was that Kododojo was a place for young people to be able to come and learn about coding in a free and open environment. So I'm going to play a little video for you now where we're going to hear from some of the volunteers involved in Kododojo. And I hope. Kododojo is a free volunteer-led coding club where our children can just come in and basically they can learn about anything in technology, whether it be games or web development. If you're not that technical, that's not a problem because there are usually a lot of technical people that can help you start. It's actually surprising how many friends you make through doing coding. You can get into groups and work with each other. It's actually really fun. It's really fun making all the projects and then like it's really cool seeing that you can actually learn things and it's really fun. Great. So now that you've got a brief overview of what Kododojo is, I wanted to highlight the history of Kododojo because it's really important to the itas and philosophy of where it came from. So it began in Ireland in 2011. A young student, James Welton, was 17 years old. He was a self-taught coder. He taught himself to code at home on his own and one of the things that he found was that it was a bit isolating. It was quite lonely. Not many other people in his area were as interested in coding or knew that much about it and that's something that he wanted to change. And he gained a little bit of notoriety because he hacked the iPod Nano and then other students in his school were interested and they said how did you do that? They wanted to learn how to build websites and how to code like he was able to. So he set up a coding club in his school and Bill Lau, who is an entrepreneur and a philanthropist, he heard about this idea from James and he thought it was great. At the time there was no coding being taught in schools. There was a big demand for software engineers but there was no opportunities developing those skills for young people in Ireland but also internationally as well. And together the pair founded Kododojo. Now Kododojo, you can see the influence there of a 17 year old boy on the name. They wanted something cool and edgy. James's other idea I think was super fun Saturday morning coding class. So we're kind of happy that Kododojo was the name that they went with in the end. And yeah, it was really successful. They started this club off in Cork, which is in the southern tip of Ireland and within a few months people were traveling for hours to come to this free coding club. There was nothing else in the area or even in the wider island for people to attend. So parents were bringing their children two hours to come to the coding club and two hours back. And because of that they seen how important it was to people and how much of an impact it could have. They decided to make the model open source and that encouraged growth across Ireland, then in the UK, then in the US and now globally. And this is the mission of Kododojo and the reason why clubs run right around the world and how it's been so successful even though it's a volunteer led movement. So a world where every child has the opportunity to learn and create with technology in a safe and social environment. And that's something that's still very important and true to the movement today. I wanted to highlight some of the key ethasses and the key philosophies of Kododojo. This one is One Rule, Be Cool. I've seen even the Kododojo Belgium volunteers. They have adapted a version of this as well. And this is a poster from the very first dojo where they had made up their own basically version of this philosophy. And we've since adapted it to a more modern poster. But the idea is to encourage young people to help and support each other, to share, to encourage and all those things are cool. And then things like bullying, lying or being disruptive or uncool. So that's the key rule in Kododojo. And then these are some of our other philosophies. So being inclusive and free. The idea that Kododojos are free is integral to the movement. So that young people, regardless of their background, can participate. They can come to sessions. There's no pressure on parents, even for donations or anything like that. The idea is it's open to anyone. Young people can give it a try. It also makes it a lot more accessible for people that might think it's not something for them. The idea that the sessions are free means that they'll give it a go even if they think it might be for them. Informal and fun, dojos should have a social atmosphere. So there's a lot of mixing, meeting each other, connecting with other people, sharing what each other are working on. And usually they can be a bit loud, but that's okay as well. And then open source. So the movement is open and the members are encouraged to share their resources and tools that they use and their ideas. So volunteers often develop resources and then they share them with other dojos. We have a lot of resources on our side and we encourage people to adapt and use them, how they see fit in their clubs and share their learnings. Because there's, now there's 1,700 dojos around the world and some of those have been running for over 10 years. So there's so much learning in the community already and we want to make sure that that is shared. Tying in with that, we have collaboration and teamwork. So we encourage young people to work together in teams, particularly when they're developing bigger projects and change making. So young people are encouraged to think about the social problems in their area or think about things that they can use their skills to improve. And this is a really important thing that the co-founders really focused on in terms of using your skills and using your coding skills for good. And then this is a poster of from dojos that we encourage young people to think about, to ask three than me. So ask yourself, think about the problem in front of you. This is when they come across a challenger, an issue or a bug in their code. Look over the code, see if they can debug it themselves. Ask a search engine, so check on Google if there's any solution to the issue they've come across. Ask your peers, so the other young people in the club and then ask a mentor. And the idea of this is really to encourage independence, encourage them to come up with solutions to the problems themselves. And also through that, asking their peers if one of the other ninjas in the club has come across this issue before, they can help each other out. And that also helps build confidence amongst the young people as well. Great. And I just wanted to highlight this quote because creativity is really important in Coder dojo. It's not just about coding. There's so many other skills that they develop, teamwork, presentation, and also creativity, because they're being encouraged to create things with code about the things that they care about. And that allows them to really get creative. This isn't a school type environment where they're being sat down and they're told exactly what to do. The idea is it's open and that really encourages them to expand how they think and also, yeah, to think about things a bit differently as well, which sometimes can be a bit challenging for young people who've been very, got very accustomed to a structured educational style, but it's something that we really encourage to really bring them out of themselves and allow them to explore things with technology. So our global community, there's over 1,700 dojos across 94 countries. And this is just some of the stats on the amount of young people being regularly impacted through Coder dojo clubs. And that's really thanks to the volunteers that are involved in Coder dojo. All these clubs are volunteer led and they really are the backbone of Coder dojo. If there was no people volunteering their time, there will be no clubs. There will be no Coder dojo. So I just want to highlight that stat for you all here today. And as mentioned, we have an open source model. You can see here how different clubs and communities have created and adapted their own logo. Here's the Belgium one over here. Here's Coder dojo Netherlands. And the map also gives you an idea of the spread of dojos right around the world. But the idea that clubs can really take on the model and run it how they see fish, they can adapt the logo. But also, once they follow the charter of being inclusive, being free, it's kind of like one volunteer described it as like a public park. Once you abide by the rules of the park, you can do whatever you want in the park. You can play soccer in the park, you can do yoga in the park, and Coder dojos a lot like that. Once you abide by the Coder dojo charter, you can adapt the sessions depending on the skills of the mentors and the young people and their interests. And also the circumstances of where the clubs are running in as well. So I was talking there about volunteers and how they're so important to Coder dojo. So I'm going to play this little video now just to give you an idea of why volunteers help out at Coder dojo. I went along to become a mentor because like a lot of mentors, my kids were interested at that point in becoming, I don't know, coders. They were interested in finding out more about coding. I'm involved with 11 which is a Bulgarian accelerator program. And they said, oh, you know, there is Coder dojo do want to help. And I always wanted to help to teach something. There wasn't one in the area at the time. I kept looking on Google, see what was starting that I could maybe volunteer in or something, but there wasn't one. So six months later I decided to start it. It means to me more than anything, being able to help kids to see possibilities that they have never and probably would never see otherwise. When I see them do things on their own, getting their ideas out there and seeing creativity which is spur out in the moment really gives me so much enjoyment. It's like a journey in the sense that, you know, you feel that you're going somewhere. It's also just a community and it's fun and exciting and every time you meet people from other dojos, you learn something new. Do you need to be an IT person? Do you need to be technically skilled or technically savvy? The answer, absolutely not. There are usually a lot of technical people that can help you start. Just go there, just get involved, see the vibe and see if it's something that might fit you. When people ask us, like, I'm not sure if I should get involved, we always send them to the nearest dojo they have and just go in and have a look and once they see how a dojo really runs, mostly they'll be addicted straight away, just like me were actually. We have a motto in Kododojo or, I don't know if it's called a motto, which is be cool and that's something that we say to the kids that turn up. I would have a similar motto for mentors, which would be just turn up. Great. So we did a survey a few years back with over 500 volunteers from right around the world and we asked them why do you volunteer and why do you keep coming back every session to volunteer and these were some of the reasons that they gave us, these were the top ones. So to be part of a global movement, to learn team building leadership and mentoring skills, they found it was a great way to give back to their local community. They also really enjoyed the fact that they got to meet like-minded people who were also passionate about technology and giving this creative opportunity to young people. They found it was a great way to share their experience and knowledge so they have all these technical skills and experiences and it's a great way to be able to impart those with others and they also felt empowered by helping young people develop skills for the future. So in terms of options for people here today who haven't heard of Kododojo, this is your first time hearing about it, you might be wondering how can I get involved, what's a good way for me to connect with this global movement, see is there any clubs in my area, that kind of thing. Well you have two options, so you can volunteer at a club that already exists, so you can go to the Kododojo website, see on our search if there is a dojo in your area and ask to volunteer with that dojo, they'd be delighted to have you. And then the other option is if there isn't a club already in your local area, you might decide today to start one up. You're here with, you've connected with a few other people around you, you've all got technical skills, so you might decide that starting a club in your local area is what you're interested in. So there are your options. So I have a little poll here, if people can jump on with their mobiles, I'll make the QR code a bit bigger. So I have a feeling I know what way this poll is going to go, but let's test it out. I was going to do a hands up, but I was like who's going to say they're not a Kodor if I get people to put their hands up. Are people able to connect if there's any issues, we can do a hands up. Oh we have a vote, that means it's working. Okay, who's going to win? Oh no. Thank you. See it's already helping each other out. I love it. It's great. Okay, great. And if there's anyone joining from online, you can use the chat feel free. You can also join using the link if you can see the link on the slides. But so this gives you an idea. There's just a, there's a mix of people, some people with coding skills, some people with, have done some coding, but maybe don't feel like they're a skill coder. Well, the important thing is there is a role for you in Kodor Dojo, regardless of your skill level. So if you are a skilled coder, you're in very high demand, you can develop additional skills in the Dojo, you can learn about how to use the skills you have and how to communicate those with the younger audience. It can be really rewarding to be able to share those skills that you have. And also you get to learn about other maybe coding languages or ways of approaching things and even about project building and things like that. If you've done some coding, this is also hugely useful to us in Kodor Dojo. Volunteers can help beginners and they can also learn alongside young people. There's also a lot of tools and resources that even experienced coders mightn't be as experienced with in terms of supporting Kodor Dojo clubs, including Scratch. So there's lots of things you can also learn about there and not a coder. So I want you to know whoever you were, person who voted that said they were not a coder, that you are very important in Kodor Dojo clubs as well, because we need people and volunteers with a wide array of tasks and experiences and being able to help the Dojo in lots of different ways. So there's going to be ways and roles where we need people to help out, even organizing volunteers, preparing the space, checking people in, helping out with promoting the club and outreach as well. So here are four of the primary roles that we have in Kodor Dojo clubs. So we have champions or co-champions. These are the people who are passionate about Kodor Dojo and want to start a club in their area. So they're people who are organized, they're passionate and they might have some experience managing groups of people or managing time. I remember a volunteer said to us before, or a champion actually told us before that if you can organize a child's birthday party, you can organize a Kodor Dojo, which I thought was quite a nice way to think about it. They can have technical skills, it's not necessary at all. We also have technical mentors, so these would usually be volunteers with technical skills. They are encouraged to be helpful, they're enthusiastic and they have knowledge in their specific areas. Then we have guidance mentors, so these can have youth experience, they might be parents, librarians, a wide range of roles, youth workers, and also they are enthusiastic about it and interested in helping young people. And then we also have support volunteers who have practical skills and experience, they might have some administration skills and prefer to help outside of the session, so they're less mentoring young people and more helping out in other roles to do with the dojo. And I really want to highlight this, roles are very flexible. I give you those examples there just to highlight the kinds of roles that people can have, but there's so much overlap. People who are champions often start helping mentoring in technical areas, support volunteers often jump in and mentor, providing encouragement to young people, so there is so much overlap in these roles. And they can adapt over time depending on your skills and interests as well. If you started out as a support volunteer and then you want to get more involved in mentoring, all those options are open as well. And then in terms of time commitment, because this is something we get asked about a lot, it depends on your role. If you're the champion, you'll be volunteering more of your time because you'll be organising the events and if you, it also depends on the session regularity as well. So some people run weekly sessions, some run them once every two weeks or once a month. So depending on how often the sessions are run, you might be asked for a greater time commitment, but roughly it's around three to four hours per month. Here are just some quotes from volunteers who've been involved in Kododojo around and highlighting that their skill sets and growth has been helped through volunteering and also that they're getting to make a real difference in the lives of those involved, which I thought was a lovely, a lovely note. Okay, now I want to show you what a dojo looks like in action. So I've given you a brief overview of what Kododojo is. I've also highlighted how you can get involved as a volunteer, but you might be wondering what the clubs look like. So this is an example. Great, so that kind of highlighted the key areas when running a dojo, but as I said before, all dojos are different. They might adapt depending on when they're running and also the types of events at different times of the year as well. But usually clubs start off with an icebreaker or some way of encouraging young people to connect with each other and get into the sessions. This is particularly useful if you're running an evening session, if young people have been at school all day and you want to kind of get them up out of their chairs and moving around before they start coding. These can be really helpful. And then we get into the coding itself. So I have another poll here for you guys. So what technologies do you think are used in Kododojo clubs? So this is taken from last year's annual survey where we asked Kododojo clubs right around the world and we asked them what are the most popular technologies that they use in their sessions. So you can hear an example of some of them. There was much more on the list, but I just highlighted some of the top ones. Interesting. Yeah. Yeah, which ones do you think is the most popular? The competition between Microgrid and Raspberry Pi. Oh, Scratch is jumping up. How many people think of its unity? You're happy. Okay. I think we've got most people's votes there. I'll do a little drum roll. And then this is what the results actually are in terms of this is the proportion of clubs using each of those. It's going to keep doing that jump and just to keep you on your toes. Don't worry. It's meant to do that. If you thought it was a glitch, don't worry. So Scratch is up there with almost over 90% of clubs using Scratch in their sessions, followed by Microbit, Python, HTML, CSS, JavaScript, then Unity, Raspberry Pi, and then App Inventor. So you see, and this isn't all of the languages or technologies. I've just tried to do a little summary for you all, but just to give you an idea, some of the technologies used and how they're spread across dojos. And bear in mind, this is from our annual service. So there's lots of other technologies and clubs doing different things as well. Great. So I just wanted to give a brief overview of Scratch, because I thought people might have heard of it, but it's a block-based programming language created at MIT, and it's really designed to be beginner-friendly. So young people don't need to worry about writing things with text, because it's a block-based coding language, and it's really helpful, particularly for younger children, those who are starting at Kododojo are seven years of age, and if they have no experience typing or even their reading and writing skills mightn't be as developed, this is a really great way to be able to give them an opportunity to create something quickly with code without them worrying about typos and those kind of errors, and it really allows them to experiment and try out different things without worrying too much about breaking any very valuable hardware or anything like that as well. So here are some of the resources available to Kododojos that are involved in Kododojo, so we have a how-to-mentor course, this is really useful, even if people have technical skills, they mightn't be sure how to mentor the kinds of approaches you can take, and even for group management, that kind of thing. So there's lots of guidance in there of different approaches you can use, and that's available on our website. We have a champions handbook for those starting a club, which goes through all the steps, all the organizational resources, and all the things you'd need to know for starting up a club. We have a help desk, which has hundreds of articles on all things from fundraising to safeguarding, everything's in there. We run regular community calls, so these are opportunities to connect with volunteers right around the world and learn something together with those. We have a Slack instance, so you can also, if you have any queries, you can write to other volunteers around the world and get their feedback. People also share resources that they've created for their dojos in there, or even different tools that they've developed, and things like that as well. Then you have the Kododojo team, which includes me, hello, and there's also, I have colleagues for different areas around the world, so yeah, get in touch with us if you have any questions. Then we have the Kododojo platform, so here's a little screenshot, but it's an event management tool, and also for volunteer management and communications as well. That's a free platform we provide to allow champions to create events for young people, help volunteers connect with dojos, so a really useful resource. I wanted to highlight some of the projects that we have available for young people. You can scan this, oh no, it's that slide thing going to get in the way. There we go. You can scan this QR code if you want to see what our project site is like. Basically, we have step-by-step projects for young people that they can use, but these are completely free for anyone right around the world to use, and you're interested in learning Unity, Python, Web Designer, any of the projects on there, feel free to use it as well. We've developed these new paths to help young people develop their skills and not only develop their coding skills, but also develop their independence over time. The format of our projects is in this three-to-one make format, so the first three projects are Explorer projects where they learn a particular skill or coding concepts. In the second two, they take the skills that they've learned in those first three projects and then they get a bit more creative, so they're able to adapt the projects a bit more, add their own design choices to it, and then in the final project, they're just given a brief for the project, and then they use all the skills that they've developed over the five projects to create something completely original, so two children could do that, invent project, and they'd end up with two completely different books made with code. That's really to encourage young people to explore the technology, get to know it, and also develop their skills so that they're more comfortable with less structure as well, so it eases them into it. All our projects are also available in a lot of different languages here. The top languages we have projects translated into, but we also have them translated into many more as well. Some of the useful features on our new project paths include these, so there's digital badges. When young people complete projects, there's info cards to explain different concepts to them. You can upgrade your project, so this is really helpful if you have a young person that's kind of racing ahead of the group or anything like that, there's additional things that they can add on to their project as well. We also have debugging tips, so the most common urge that young people come across, and we give them tips on how they can debug the projects themselves. We have quizzes that check that they understand the different concepts that they're learning through the projects as well as box-outs with more info. And then I also want to highlight another opportunity that young people can participate in, in a dojo. We're located on the International Space Station. In the Mission Zero Challenge for Beginners, you can collect environmental measurements and create your own pixel art image for the astronauts on board. In the more advanced Mission Space Lab, you're challenged to program and experiment. This year, calculate as accurately as possible the speed of the ISS using AstroPy's camera and sensors. All valid projects will receive a participation certificate and will have their programs run on the ISS. Mission Space Lab teams will join a webinar with an ESA astronaut. Great. So I just wanted to highlight, so that AstroPy mission is available to any young person in an ESA member state. A lot of our dojos do it. It's a great way to introduce young people to Python, and they get the added bonus of being able to have their code run on the International Space Station, which is amazing. So I also want to highlight this very last aspect of our dojo session. So we covered ice breakers or unplugged activities at the start. We did the main coding section that they do at the event. And then towards the end of the session, when things are wrapping up, we encourage young people to showcase what they've made during the event. So this doesn't need to be big. It can just be young people all gathering around a laptop and talking about what they've made, highlighting what they've learned, if they overcame any specific challenges, and really discussing about the kind of projects they've made so they can inspire other young people in the group as well. And tying into that, years and years ago, Kododojo volunteers, as they were encouraging young people to showcase in their individual dojos, they came together and said, I want to be great if we could share the projects that we've made across dojos, across Ireland and around the world. And so they developed this thing called Coolest Projects, and now it's grown exponentially from there. So we have an online showcase where I think last year there was five or six thousand projects entered into it from right around the world, and they include Scratch Projects, Web Development, Games, Hardware, Advanced Programming, and I'm missing one but I can't think what it is. Anyway, but it's a great opportunity and there's in-person events as well, so we encourage young people to develop those skills, sharing their project ideas and what they've created with each other. So we have events in Ireland and the UK, but also Kododojo Belgium are running Coolest Projects Belgium in April, and here are some of the other ones we have in Ghana, Serbia, and Sri Lanka as well. And there's more dates to be announced, those are just the ones we have at the minute, and then we have our online showcase as well. And any young person, even if they're not involved in a Kododojo, can participate in Coolest Projects. So if there is a young person you know who's interested in making things with technology, be sure to let them know about that as well. And finally, I wanted to just highlight the steps for starting a dojo. If you attended today's session and you're interested in Kododojo and how you could start a club in your local area, there's five simple steps to follow. So becoming a champion, this is just you saying to yourself, I'm going to do this, I'm going to start a club in my local area. You can come to our website, you can register your interest and we'll send you on some more information. You can gather your team, so if there's some of your friends that are with you today, or if you have colleagues, local people in your area that are interested in encouraging young people to Kododojo as well, there could be staff in the venue that you're going to run it in, they can help you out as well. Finding a venue, it could be a community centre, your local library, it could also be your office if you run a company in the local area that you can have one evening a week for example to run a dojo session, then plan your dojo. So chat with your team that you've gathered, what kind of things you can do during the session and also use our website and resources as well. And then promote your dojo, so you might set up a social media channel, but it also might be just reaching out to your local school, reaching out to your local library and encourage young people to attend. So that's been me, I'm Nula McHale, thank you very much for listening to the talk. If you're interested in registering a Kododojo club, please scan the QR code and you can also learn more at Kododojo.com. I'm just going to check the chat now in case there was any questions. No, there doesn't seem to be any questions. Is there any questions in the group? Yes, go for it. How to start? How to start? I just give you a beautiful diagram. So the easiest thing actually is to go to our website and if you scan here and fill in that form, we'll email you more information on resources that are helpful and things like that. But once you're interested, it's great to hear that you're interested because that is the key thing, being passionate about helping young people learn to code. And we also have volunteers here from Belgium and Peter, a champion who's already started and running a club who I'm sure would be happy to chat about his experience as well. Yes. I work with students with special education. Do you get any support from the department like this? Yes. Do you get any support from the department? Yes, so during the pandemic, some clubs ran online and we did. There's some clubs still doing a hybrid model. I know for some young people with special educational needs, in particular autism and things like that, they actually find that these clubs really useful. It also helps them come out of themselves a lot because sometimes technology can be a special topic of interest and so it's a way for them to use their skills, use and focus on something they're really interested. Meet other young people that also have that interest and then they're allowed to code. They might code with their headphones on or however they prefer to work. Often you can have the setup of the coding session changed or adapted. We have an accessibility guide which talks to champions and volunteers about how they can adapt sessions to differing needs of young people to help make reasonable adjustments to the room and things like that. So there definitely is a place in the space and a lot of young people really benefit from being able to use technology and coding as a way to encourage them to connect with their peers as well. I have a question a lot. I think I've done a lot of special needs. How can it be that in your dojo we have more than one room so it has to be quiet. We can have a separate room setup. We can make rooms of children with the same needs. Yeah. No, that doesn't mean anything for me. And he's just working and after that he's presenting his project and then he's done. He's seen him again in another two weeks. And that's it. So you can ask yourself why he doesn't come. There is something that makes him come to us. Even if he doesn't need us. Yeah, sometimes it's less about needing a mentor and just having a space where they feel comfortable and be able to do what they're interested in doing in that environment. So yeah, that's a great example. On the privileged environments is there an arrangement to co-ordra dojo to obtain hardware or dash 3 at the discount or something like that? So a lot of clubs, what they do is they work with companies in their area to get like pro bono or donations of refurbished laptops, things like that. Some dojos will have like five laptops held to the side and when people are booking tickets they can reserve a laptop if they need one, things like that. But there is opportunities for like grants and things like that, but not specifically like across all clubs everywhere because the needs are so different to specific dojos. Yep, sorry, with the cap. No, I got you. No, you first. Yeah, yeah, yeah. So yeah, like I was saying, clubs are completely adaptable to the interests and also the skills of volunteers. So if they're interested in talking about UX design or things like that or like project management tools and highlighting and young people really like that, particularly slightly older teenagers, they're really interested in the tools and the languages that real professional coders are using. So you being able to show them stuff on Figma and how they can create with it. And Figma is a really good example because it's such a usable tool that young people could see how user design works and how they can even like do things like testing stuff out with other young people in the dojo, getting their feedback, being able to adapt it. That's all user experience and learning from their users needs. So I think yeah, that's a great thing you could bring to a club, definitely. Yeah, sorry. So where is the funding platform? Yes, great question. So Kododojo is part of a merger with the Raspberry Pi Foundation. So the Raspberry Pi Foundation kind of sits over Kododojo and they fund and support the work for creating resources, keeping and maintaining the platform, all that. So yeah, that's where the Raspberry Pi Foundation. Great question. Oh yeah. So all of our resources are on the website, but we have on each of the like top of the projects paths, there's like short links. So in a session you can be like go to the short link to pull up a path. Is that what you mean? Yeah, yeah, yeah. Yeah, there is guidelines in the projects. One thing I will say is we have our own code editor so young people can use that code editor, it can be opened in the browser. They can create an account if they want, but they don't have to. So that's a tool that we provide for young people to be able to code straight away. Yeah, there's a Raspberry Pi code editor that we've developed. Yeah, go for it. Yeah, yeah, yeah. No, definitely. So all our project resources are translated into different languages, including Arabic, but the code editor isn't available in Arabic at this time, but it is a great question and it is something but you want to. Yeah, yeah, yeah. No, and definitely and Hedy is something that is like more than welcome for like Kododojo clubs to use. We're completely software and hardware agnostic. So if you see a tool that you think is useful for young people in your community, feel free to use it in a Kododojo. Great. Okay, I'll, I don't know if I've, okay, I'll take you up. Go for it. Yeah, I was curious. Are the dojos sort of typically bring your own device for most of the kids? Yes, yeah, so it is. Yeah, bring your own device predominantly. Yeah. I was curious about was just about how do you decide what to teach, do you have a bunch of resources on your website? Clearly, is that sort of the direction they're encouraged to go or is it a bit more up to them? It's really up to them. Some I think it's roughly a third of clubs rarely use our resources. So it's really is up to them. But a lot of people use it because they're volunteers, they're volunteering their time already. And the idea is we're providing these resources so they have something to go to, they have things to fall back on. We have some like printable resources that are helpful for young people that might struggle to have like two tabs open at once. But yeah, the majority of them would use the projects on the project site. Yeah. Question? I don't exactly the same. Do we have to provide them a computer to help them? But you said that, you know... Yeah, no, no, no, you want me. You have some guidance on how to, let's say, run the first session? Yes, yeah, yeah, yeah. I would like to start something that is in my area, but I think how we can start yet. I know there is a form I already said. Yeah, great. But yeah, no, I think that's definitely it's useful to hear the kind of questions people have when they're wondering about starting. But yeah, we have guidance and tips and examples of icebreakers and unplugged activities you can do with young people. We have a website filled with projects and also we have guidance on which projects and which languages can be useful for different age groups, just to give you some guidance on that if you weren't sure. And then also tips on how to encourage showcasing among Shire children or things like that and giving young people the opportunity to do that. Great. Gio. What's the plan for the foundation? Ooh, great question. Yeah, we've lots of plans. We're developing, I know, we're developing out. The platform is the key thing that we're working on at the minute. Previously, the last year, we've been winding down the old platform we've had. We've been able to release a new platform with event management. We're just launching all the email systems within that, which are going to be translated into the key languages that our volunteer communities use, which is amazing as well to be able to do those translations. And that's actually if anyone is interested in translating, we have all our resources that are translated. It's translated by volunteers as well. We have two members of staff that support the volunteers, but the majority of the the translations of our projects of the platform of our website is all done by volunteers. So if that's something you're interested in, do check out the website and offer to support that way because it is so useful, particularly for young people to be able to learn coding in their own language. Great. Okay. I think that's it. Is that okay?
Snap!
All right, good morning everybody. Welcome to the session on SNAP. My name is Jens, and with me is my colleague and friend Yadka. We're here to present SNAP, build your own blocks. But it's not just Yadka and me who are working on SNAP. We're just the ones presenting it today. There's a larger community and more people involved. There's Benat Ramagosa down there. There's Simon Walters from the SNAP community here. There's John Maloney, Turgut Gunezu. The whole blocks gang is here. And all of those have been more or less also involved in the development of SNAP. I've brought also Olanzah, who's very involved in SNAP, our mascot. I keep saying this for people who might not have heard it before. Every once or so often we get somebody who says, take down SNAP. It's a copy of Scratch. And so I just want to point out, yes, you know this is also a mascot on Scratch. It's called Gobo in Scratch. And we're friends with the Scratch team. And we're allowed to use Gobo. And our friend and collaborator, co-author Brian Harvey, mutated Gobo with this funny haircut that, of course, everybody in this room knows about. It's a lambda. So SNAP is Scratch on lambda. This is what it's about. It's about building your own blocks. And so what Yatka and I are planning for today is kind of not so much a talk as a demo. It's a walk. It's a lab visit. And we'd like to invite you to a visit of our lab. So when we say SNAP, the title is Build Your Own Blocks, it used to be called BYOB for Build Your Own Blocks, until some parents in the US who didn't have a sense of humor decided that it also means something else, like bring your own bottle. Usually there's alcohol in it. And the implication that there is alcohol might entice children to try it. So we had to give it a different name. So this is why it's called SNAP. We're still calling it Build Your Own Blocks. But to us, and what we'd like to show you today, it is as much about learning to code or learning about technology than it is about building one's own mind, about learning something about society, about the surroundings, about the environment. And we'd like to kind of give you a little overview of not just our technology, but also our pedagogy. So just very quickly, this is SNAP. How many of you know SNAP or have used SNAP? OK, about one third. So just very quickly, I'm just going to show you so in SNAP. One thing that we have, when you open it, this is kind of pure vanilla SNAP. You can, here's a block, you can move something. You can also kind of stick blocks together like these puzzle pieces. There are different kind of categories with control structures. You have kind of these control structures. This is a repetition. So when you have kind of four times this, and you click on this, everything is live. It moves. You can see what it does. You can, you have a drawing pen. You can get the pen down, pen up. So now you can draw a square. And always the question is how you get rid of it. So there's a clear block. You can click on these things. You don't really have to think that much ahead to do something. And it's called build your own blocks, because you can build your own blocks. So you can make a new block. I want to make a block in the motion category. It's a square block, and it has as input a size. It's really hard to build a block with one block. It's really hard to build a block with one hand. And so I'm getting this editor. I can say the size should be an input. It should be kind of a number. So now I can say, oh, this part is this thing that makes a square. I can drag this in here, and the size is an input. And I can say the size is how much it should move. So now I've made my own block. Wait, did I make my own block? Or did it? Oh, look at it. It didn't work. OK, see, this is the fun of doing it with one hand. OK, you guys, this is a collaborative coding environment. Let's try this again. Oh, I need to even make size and input again. So as we were just hanging out with a bunch of teachers is that the repetition is the mother of all learning. So now I have the square block. Let's try to make a square of 100 again. It works. And then we can do things like, now we can build around this. We can say, OK, let's do, let's clear this, and let's maybe say, OK, i times 10. And we also want to turn a little bit, like maybe 36. And we can kind of build things. This is kind of the technology part that isn't really new. It's great. We love it. But the thing is that we want to build blocks, but we also want to build ourselves. We want to interact with our surroundings. And so we want to look at data. We want to look at the world. And through programming, as I think Mitchell says it, learn to program and program to learn. But not just learn about technology, but learn about something else. So here's a little example of that a while ago. John and I were also working on and we're thinking about how we could express things that are really important. And here's a little puzzle. You can make an educator. You can kind of make a project and give it to your students. And in this project, I've downloaded a bunch of data from the internet, from GitHub. This is the population data of 195 countries over a span from the year 1800 to today. It's the population development. It's a lot of data. I've also downloaded the life expectancy data. And for some periods, there isn't one. But it's the life expectancy in each of these countries. And also, the gross domestic product adjusted for inflation and broken down per person living in that country. And that might tell us something about life conditions and how life conditions in countries have evolved over time. And some of you who can read it probably already know this. There's the GapMinder project by Hans Rosling. Who knows this? GapMinder. And so we thought, it's great to work with this. It's great to look at it. And we thought it's also really great to use coding, not just to use it, but to interact with the stator yourself. And so what we like to give people is an imperfect project that has something like here. It's something that already has a slider built in. And you can configure that slider, kind of give it some numbers. And then it emits a broadcast, like the slider changed. And now we could say, OK, in this project, like when this receives the slider changed, me, we might want to hide it. And we might want to go to some coordinate, like go a little bit to the left and down. And we might want to set a color to something light and write the value of the slider. So we're going to query the value of the slider. But it should be big. Let's make it big, probably at some point, before we also want to clear it. So now, as I move the slider, I'm getting a readout of the numbers. But I want it to be the numbers from 1800 until now. So I probably should have some kind of formula in here where I say it's not the value of the slider, but the value of the slider. Plus 1799. That's about as much math as I'm allowed to do. OK, so now I have a sort of a timeline, an interactive timeline. And now I want to kind of map all of this data and sort of scatter it in there. So I'm going to make a new, I'm going to draw a little bit of data. So I'm going to draw actually a new object. I'm going to use a vector pen and draw something like a, oops. Sorry. Draw just a little dot, which is about my drawing things. I'm going to call this a country. And a country is going to, I'm going to add another variable. That's just a name. So it's sort of a, OK. And actually, I don't need to show that. And so when the green flag is pressed, I can just set the name to nothing. And actually, what I want to do is now I want to do something for all the countries. So I can take out, so if I look at all the countries, they're all in the first column. So I could say, OK, this is the first item of the columns of this table. And as you can see, it's always live. So these are all the countries. So I can loop through them. We could say for all of these, I want to make a new clone. So I'm telling a new clone. I just want to assign the name to whatever is in that list. And I want to do this really fast. I don't want to wait for it to take long. And when I'm done, I want to do something. I want to broadcast that something has changed. So I'm going to broadcast the slider change. And now what I want to do is that when each of these clones receives the slider change, it should align itself to the data. And here's something that we're saying, build your own block. So here's a block that we made that you can already use. So we give this block that arranges a record. So the record here is indicated by the name. And we're selecting in these records, we're selecting the year indicated, which is dependent on our slider. It's the value of the slider. And so we want to map the wealth, kind of, you know, the money. We want to map that to the x-axis on a logarithmic scale. The life expectancy, we want to map to the y-axis. And the size of the bubble, we want to map to the population. And let's actually try this. Do you think this could work? OK. Wait, I wanted to do one more thing. I wanted to make them a little transparent. So we could set the ghost effect to something like 60. So we see the ones that are underneath. Let's try this again. Yeah. OK. So here's a map of the world of 200 countries in the year 1800. And now we can see some interesting things. As we move the slider, we can see how the countries develop. We can see how life expectancy rises, how kind of things are distributed. Now, it would be fun to see which country is which. So we could also add some other interactive elements, like we could say, when I am, mouse entered, we could say, you know, say my name. Say my name. Say my name. OK. Say my name. When I'm, again, when the mouse goes out, like when the mouse is departed, just stop saying whatever it's saying. Say nothing. Does it work? Yeah. We don't have to restart it. So this is China, the big thing. This is the US. And now we can do interesting things. Like now we know this big blob is China. We can see what happened to China. Whoops. Here there's a famine in the early 60s. Or we can see, we can look for other spectacular things like right at the beginning, like here in 1880. And there's a problem in Tunisia, the last case of the Black Plague. And with all this data, it's already interesting now to use this and to use Google and to find out what happened at certain periods of time. For example, we can go, let's go to the early 20th century here. 1904, the first genocide committed by Germans in Namibia of the 20th century. We can see how, oops, here you probably know this. It's not World War I. It's a drop in life expectancy almost everywhere except in Denmark. Nobody knew that before COVID. Nobody knew that before COVID. Now everybody knows it. It's right. That's the Spanish flu, which wasn't Spanish, but it's the flu epidemic. And so this is what we're talking about. It is as much about looking at data, finding out how to work with tables, how to model things, but also about really discovering things that are fun to discover with a computer. This is much more fun to discover with a computer than with a textbook. And it's also way more interactive. And even the learning can be more self-directed because now I want to know what happens at this bubble at this time. And so it's about building blocks, but also building knowledge very much in the constructivist, constructionist way. And so one thing that's really cooking that we're working on right now for the next version is, so with these abstractions, we're building up. We're sticking together blocks to build up abstractions that let us do more awesome things with less blocks. Now the question always comes, especially conferences like this, but at the core, at the bottom, there's got to be some real language. And the real language has to be text-based because everything is text-based. And even artificial intelligence uses large language models, so obviously there must be something to textual language that is very powerful. And it is. But here's something that we're working on. And we've had several projects. John and I were working on a project where we tried to make a, and we actually went pretty far to build a block-space language that was completely written in itself. Snap is not completely written in itself, but we're trying this. Look at this. This is something that's in the development version. Here's a new thing we're working on. It says blocks all the way. And if I click blocks all the way, now I can look at all these blocks that are called primitives that are actually written in JavaScript, but I can edit them. I can edit them and I see blocks, how they could be written in Snap itself. And here's a block that is a primitive block, which means we're actually, we're calling up a native JavaScript function. But we could turn this off, and then it would run this, and it would totally do the right thing. And if you actually look at this code, you might be astonished because this is probably not what you were expecting. It has sort of a NumPy APL-ish way to deal with coordinates in a way that uses vectors. And it's really fun to check out how things work. Like, for example, here's the glide block. So there's glide written in Snap itself. There is, of course, if an edge bounds, and this is way more complicated. But we can even go to other things, and like, obviously, the control category is very interesting. Now, we can look how forever loop is done. If you look at forever loop, it's implemented recursively. It's a higher order function that calls the function it is given, and it calls it recursively. But, or here's how repeat until would be, and repeat until is interesting, because it also uses if. Now, what about if? So here's how if is done, or things like not. And if you look at if, if again uses if else. So it's almost sort of like a not really infinite, but kind of it goes not all the way down to the metal, but it goes down some more subterranean stories. And what we're hoping to achieve with that is to let kids and learners explore the system, and find out it's not about one language versus the other, but it's about how to express your ideas and how to do these. So let's actually try to really find something, which is we're at a hackers conference. So what if we said, you know, move? Let's break move and say, okay, I don't want to go to something, I don't want to use the primitive. Actually, I want to see how it moves. I want to glide something like not so long. And so what I'm going to do is I'm going to run the glide block. I'm going to leave the coordinates empty, which turns it turns them into implicit parameters. I'm going to put in this vector in here. So now I've really fine move. And now if I run this, I can see that I changed the way how move works. And I've now made it such that I'm using glide instead of going to. And this is going to be fun because it sort of gives you agency to even change the way snap works because all of this is now editable. All of this is really a system that is malleable. So this is what it's about. Build your own blocks. I'm now going to hand over to Yatka to tell you some more about our kind of the pedagogy and the kinds of things we're working on. Yeah, thanks. So we do this thing together with people at UC Berkeley. And it's also a lot about education. So it's used at universities around the globe and also in a lot of schools by now. And if you have a school that you want to use, snap, feel free to reach out. We're always looking for more collaborators. And as you already saw, you can program snap in different ways. And that's also something that's really important to us that we can. So there's not one way to the solution, but there's several ways. And we also accommodate the boring ways, like using all the four loops that kids are required to learn in school. But we also want to elevate the mind to new ideas about programming, like the stuff that Jens has just been showing you. And I just wanted to give a short, can I close that? Oh, I'll go over here. Yeah, great. So, no, not right now. Thanks. So I wanted to do a short, oh, have we tried whether we can record something on the thing? No, but we'll try. OK, we'll try. I record a sound and see whether it works. Hello. OK, let's try to play it. OK, it doesn't. Good. Ah, OK, it comes out of here. Yeah, OK. Hello. So we can record sounds. This is something that's also really important to us that we can extend projects beyond the working with numbers. So I personally am not a developer. Like, I didn't study programming. I am a developer now, but I studied biology. And until I was 25, I thought programming is, can I say that shit and boring? Then I tried it out and I used Snap for it and it is actually fun. I mean, you guys all know that, but it's awesome. So I'm not that much of a math person. So I really love that you can extend Snap using media, using data. And now I want to show something using this recording. So you can access the samples of a sound file that I recorded by using this block from the sound category. So you can see we have different categories here with differently colored blocks that helps to structure the programs and also to read the code. And if I click on that, it's also a list with 51,840 samples. At the beginning, I didn't say anything, so it's all zeros. But then when we move down, we see that it's negative values, but very small negative values. So the samples are the amount that a membrane swings either to the left or the right. In our ear, for example, or in a speaker or something like that. And now I could try to modify that sound. So let me grab one more recording that we can always play it. And I could do it the traditional way using the for loop that I've just mentioned. Because as Yen said, in German schools, kids are required to learn loops. And this is really important. So for I from, we start lists with one. So let me start at one for I from one to the length of the samples of my recording. I want to do something. And the something I want to do is let's make it louder, maybe. Let's try this. So I want to create a new sound. And I call this new sound. And I set the new sound to an empty list. Okay, now maybe I do want to. So I set the sound to an empty list. And now I want to add stuff to my sound. And what I want to add is the value that I had before. So item I of the samples of the recording. And I want to multiply them with the factor. So increasing the number that's in the in the sample makes the sound louder times. So let's try times five maybe and hope I don't fry anything. And then I want to add that to my new sound. When I do it like that, it's pretty slow. So as you see, this runs rather slowly. So we have this, what we call the warp block that just speeds up things. So let me wrap this around here. And now I can have my new sound. And I hope that it's louder than the one before. So this was the one before. Hello. Can you even hear that? Hello. Okay, now let's try the louder one. Hello. Definitely louder. So we can use that way to change media. But we also want to support, as I mentioned, other ways of thinking. So what we have down here, and this is where the lambda that Alonso has, comes into play is the map block. So this is a higher order function, a function that takes another function as an input. And we represent this function with these gray rings. This is like a lot of the rings, one ring to rule them all metaphor. So this is what gives the power to one of the powers to snap. And we can use data here in the second input slot. So what I could use is again, let me just duplicate that, the samples of my sound. And now I can add another function down here. Let's do the apps function maybe. So this gives back the absolute value. And now I can play that sound as well. Hello. So you can hear that I sound more like when I don't have air through my nose. So you can make sound effects like that. And then the last way is the one that Jens already mentioned, and I didn't come up with the third effect. So I'm just going to use apps again. You can just drop lists like vectors into functions directly. So I could just drop this one in here to create the exact same effect. You could use floor. I could use floor. What? Let's try that. Beautiful. Okay, so this is just, I think this is just negative ones, zeros and ones. So with only three different values, you can still kind of understand the sound. Because what I said, because the pattern is represented by the way the samples are arranged. And not by the actual values in the samples. Okay, and you can also do that with more complex data. So we can also access the webcam from Snap. And I can try the same thing with the camera. Maybe I can try the same thing with the camera. Maybe we'll just unplug this real quick and I take a picture. Okay, we broke the webcam. So let's leave it at that. So let's go over and I tell you you can do it with graphic effects. And if you want, we have a workshop later come by and I'll show you how it works with graphic effects or photos. Okay, and then stuff that we are currently working on is AI. Since this is the big thing and the schools wanted, universities wanted. So we had to come up with resources that they could use to teach artificial intelligence. And I wanted to show you one that we've developed last year and it's available in German and English. You can download it from the internet. And it's a detailed walkthrough guide on how you can use that in a classroom setting. And how to program the whole thing, some ideas on what you could do with it. And we call it grant gestures. So it's a simple gesture recognizer program based on the $1 gesture recognizer. I don't know whether you know that, but it's a prototyping gesture recognizer for Unistro gestures, things that you can draw in one line. And I already prepared something like a TV cook. So I have this project here already and this is a simple drawing program. When I click on the stage, so we call this window here the stage, then it'll broadcast the word sketch. And when I receive the word sketch here, I'm reacting to that and I can actually draw something. Yeah, since some of you are sitting really far away, let's just increase the pen size a bit. Yeah, so I can draw stuff here. And what I already also prepared is I'm storing values in a variable. What you can see here is I have the sketch variable that I can also show. It's 164 points and it's the position that my sprite, the object that I'm programming went to. And now I also have this examples variable here where I already stored a few examples and this is always a path and a word that's attached to this path, so basically a label for that path. And now I want to create a few more things and then we're going to animate them. Can you hold the microphone again, please? Thanks. So to create an animation, let's start with the animation. We gave you a block here that we call animate and here you can also see again one of the awesome things in Snap. You can make your own control structures. So this is a C-shaped block like control structures, some control structures look in Snap. And this is a custom block that runs actions that you can put here into the C-shaped input slot and we made this block. And what it does is it takes what I've drawn and puts it on as a costume. So the costume is an image that the sprite is wearing and then it does something. So when I draw a heart, I want it to have a heartbeat. So increase the size a bit and then decrease the size in one step. So increasing the size, I do with the change size by 10 block and I do that 10 times and then I reset the stage. And since hearts are bumping like twice, I want to do this two times. And if I put that into the animate block, you see that I can do that with the drawing that I just made. So I can draw something and it takes this actual drawing and does something with it. And now I want to trigger this reaction whenever I receive the message heart. Okay, this is what's supposed to happen, but now I need to identify this heart. So we want to find out how these paths work and to see that I can render what I've drawn. So this is also a block we prepared and I can just put my sketch in here and I can render that and I see that the points through my path are not very well distributed. So I draw really slowly at the beginning and the end and then I was really fast here. So to really make them comparable, it's important that we have, we normalize them. So we have this resample block here and now I could resample my sketch to 64 points. And this evenly distributes the path between my, like the points on the path. Yeah, so now I can use that to train my algorithm here a bit more on my program a bit more. So here this is what I've drawn was a heart. So let me draw a new one and I can now add that to my examples. Let's add another one. Okay, and now I need to recognize this heart in all my examples. We also prepared a block for that, but you could build it yourself if you wanted to. This is the recognized block. And this recognized block looks for the smallest difference between two shapes. Like it's measuring the distance between the first point in the first path and the first point in the second path. And then adds up all the differences between the points and reports the one that has like the label of the one that has the smallest difference. And we can just use that as an input to the broadcast block. So let me just show this real quick. I'm recognizing my resampled sketch which is the heart that I've just drawn in all my examples. And since this is the heart that I've just stored, it should report back the heart which it does. I can try another thing. So this is also the heart. I also wrote down FOSSTEM. So let me try to write FOSSTEM. Writing is really hard on a touchpad. So this reports FOSSTEM. Great, seems to work. And now I can just broadcast this thing. And we want kids, people who use that to tell stories with it. So this is supposed to be an interactive storytelling project. And the story that I just came up with was, how did it work again? Let me check. Ah, yeah, okay. The weather in Brussels is not really nice. It's raining all day. And I'm sitting in dark buildings all day. But still, I love to be with you at FOSSTEM. Okay. So this is a resource that we've been working with again to also inspire people who might not be the traditional audiences for programming. But it's also pretty cool if you are a programmer and love math, you can still do stuff with it. And now I would hand over to Jens to tell you what we're venturing for next. Thank you, Jette Gah. I also love to be at FOSSTEM. This is so cool. So you might see, you know, this isn't really about an algorithm about using AI, about using a large language model. What we're trying to do is to sort of at least lift the lid a little bit to let you see a little bit underneath the hood. So for us, it's not about upskilling youth to be employable, but it's about bringing across a sense of awe and wonder about what you can do and maybe letting you reflect about things. So now, since it's been mentioned for two years, generative AI is this big thing with chat, GPT being everywhere. And it really boils down to, as we've seen before, language, even textual language being the basis for everything. And we thought, well, yeah, that's nice, but it's, we love language, but we also love structure. And so one thing that we've tried to come up is with a activity that is more on the basis of these language projects, which really is a next token prediction system. And that might lead us up to experiencing and learning something about the really generality of AI. So it's all inspired by a wonderful little project. I have to kind of give them credit by Michael Hilsche from Peha, Schweiz, who, this wonderful project, Zykia GPT, you all have to look at it. And so this is what we're trying to do is now build something like chat GPT ourselves and on little data. So we don't actually have to use chat GPT. So here's something I scourged the internet for 30 fairy tales of the brother's grim. And here's the English version of these 30 fairy tales. It's not a huge corpus, but 30 fairy tales. And it's just a text. And in order to work with this and turn this into an AI, I need to split these 30 fairy tales into a list of words. So now I've got a list of words like 58,000 words. And so sevens, waybians were once together. OK, so just a list of words doesn't give us a lot. So in order to use this in a language AI, we have to do some statistical analysis. And the way we do statistical analysis is by we're grouping these words by their sequences. They're called like pairs or triples or bigrams, trigrams, tetragrams. So, you know, it's build your own blocks. So I'm going to, can you please hold again, I'm going to make a category. Wow. I'm going to make a category that's called, be called generative AI. And I want to build this one function that I'm using. And it's the, it's going to be a function. It's going to be the n grams of a corpus. And n is going to be an input. It's going to be a number like two or three or five. And the corpus is what's going in the language thing. It's a list of words. So what I want to do is I want to get the numbers from zero to the length of the corpus to go through all of this. But I don't need the full length. I can decrease it by the n that I'm looking for minus one. Now I want to take this as an input to map. So I'm taking these numbers and for every of these numbers. So when I have a list, list of the numbers one to ten, item one is the number one. But if I put in a list of, of, of items to check, like three or six, I'm getting a list of, of individual items in there. So I can slice my input. So what I want to do is I want to get the item, whichever the number is, the numbers, whichever the numbers is, two n minus one of my corpus. This is what I want to do. So this is the function, the n-gram function. Let's actually try this. Here's the n-grams, like let's, let's go to the bigrams of my 30 fairy tales. Click on this. See, oh, seven Swabians. Swabians were, were once. You get it? It's kind of broken up. I can also do this for four. So now I'm getting seven. Swabians were once. Swabians were once together. You get the idea, right? So, so, so this breaks up the corpus into these little sequences of words. Now I want to do a statistical language model. So to do a model, I'm going to make a variable called model. And what the model should do is the model should be, let's actually hide this, should be several variants of this, not just bigrams or trigrams, but let's go all the way to five. So I'm going to say the model is going to be another map block. Map the n-grams. I'm going to leave this blank over the numbers from one to five. So let's actually just run this once. So now let's look at this model. Now this model is a kind of weird looking table. If I format this a little bit different, you can see it's a five item list. The first item is like though. Unigrams, bigrams, trigrams. Five is great because tetragrams. That sounds like diabolic pentagrams, right? Pentagrams. So it's a sort of a cascade of a model. And I can also try this like the item one of the model, you know, is a list with one, the item three of the models list with three. Okay. So this is really the heart of a statistical language model. Because now, for example, let's take a list. Let's take a list of words. Let's say the king's daughter. So we want to see what's a good way to find out how to complete a sentence that starts with the king's daughter. So we could look for, you know, in all these, in these four grams, whether anything starts with the first words like, oh, the king's daughter and then find out which words come. Let's actually do this. So this is keep in my model of four, something that is equal to the king's daughter. And we want to find out the items with the numbers from one to four. Okay. Let's, let's try this. One, two, three. You see. So we see, oh, there's a bunch of sentences. The king's daughter came is began, love said again came. So you see, there are several times sometimes the same thing in there. So we could say, okay, we could just take, you know, the last item of a random element of these things that we get to complete that sentence. So we can say the king's daughter came, the king's daughter laugh. And this is basically finding next token, something that has been used in that context before. So I've made a little block for that, which I'm going to import that does that. It's the next token block. It's literally just that. And now we could build something like chat to PTRself. So we could say, okay, when the green flag is clicked, ask. Enter the beginning of a tail and wait. And then when the user enters said, we're setting, oh, we need to make a new variable. That is going to be a tail. And we're going to set the tail to what the user entered, which is the answer. And we're going to split that by the word. And then what we want to do is we want to take the next token in that tail based on the model. And we want to add that to the tail. And then we want to say the thing, right? So we want to say the text of the tail, that just so we don't see a list, but we see a nice text. We want to say that. And we could say, okay, how about we do this when I receive next. And here we broadcast next. And then whenever the user presses space key, we also broadcast next. Okay. So, well, let's try this. Does it work? Enter the beginning of a new tail. Once upon, ah, once upon a time. So now whenever I press space, it is continuing this and it creates fairy tales. Sometimes it'll stick to one fairy tale pretty long, but since it only has a context of about five words, it keeps forgetting which fairy tale it's in and then just finds something else that is plausible linguistically, but maybe not from the story. And so you all know, right? This is not how chatGPT really works, but it's a statistical language model and the similarities are actually striking. This is what they call a Markov text generator, Markov chain text generator. And it has like a GPT model, a transformer model is just this except a little bit more complicated. It has a longer context and it has some neural networks that makes itself aware, so it kind of doesn't just take the last n words, but kind of knows, has a little more memory and distinguishes, which things are more important, but it's literally just making up stuff. It doesn't have any idea about the language it is written in. So when, for example, I take 30 fairy tales in German and instead of these 30 fairy tales in English, I'm doing the model in German and now, yeah, I'm just saying, you know, I don't care. So now it's going to speak German, right? Because it's been trained on these German fairy tales. And so it's not about language, it's about statistics. And so again, we could think that this is all about language, but at the core, it's about finding things that correlate with other things. So we thought, well, wouldn't it be nice if this were also a good pedagogical model for an AGI, for something that is more general, not really super intense, but more general. So there's lots of sequential data. One thing that I think every hacker loves, I don't know, is music. So here's, I transcribed 20 songs. And these songs aren't just words, they're notes. And so the notes are, you know, there's a pitch, a mini-pitch, and there's a duration, how many beats. So what if instead of these words, we just took these notes and shopped them up into a bunch of n-grams and, well, okay, so I prepared this little script that does that. It's the music improvisation script. So this takes the 20 songs, chops it up, uses the exact same blocks, like the n-grams, and the next token block. The exact same thing. And remember, the data is differently dimensioned. It's not a single list. It's a multi-dimensional list. I'll try whether it does something. So you sort of see, it's like me whistling, you know, ah, there's this, oh no, there's this other thing. Let's kind of go and do some funny associations. But it's already beyond language. It's already generalizing the principle of finding the next meaningful token. So we thought, okay, wow, this is nice. So what about pictures? Could it work with pictures? I mean, pictures aren't sequences. They're a plane with multi-dimensional. But you know what's a sequence. Sketching something is a sequence. So we thought, why don't we take the sketching thing? But instead of mapping coordinates, we remember for each line segment the direction it went to. So let me again kind of write something like an honor of, so, hello. So now I have the sketch in this model. And let's try to find out if we do the exact same thing that we did with the music by now doing a sketch program, training it on just this little data. Oh, it tries to imitate my handwriting and does something that sort of looks like I'd written it. And there's, it's going to be different kind of the next time. And it's sort of fun. It is finding some meaningful next tokens. And at this time we were really, really having fun. And we thought, well, what if we don't just write something? What if we, like Bernat's idea was, what if we, for example, draw something that already kind of makes sense? Like, let's draw something that looks like a tower, something that looks like a castle. Whatever, a roof and a moat. So here's a little thing that I drew. Let's try to find out how we can do a skyline. Isn't that cool? And that's almost, you know, a glimpse of, it doesn't matter about language. The secret is to seek when you're in a building. You can't get out of the building. You can't get out of the building. You can't get out of the building. You can't get out of the building. You can't get out of the building. You can't get out of the building. You can't get out of the building. You can't get out of the building. You can't get out of the building. You can't get out of the building. You can't get out of the building. No, you can't get out of the building. You can't get out of the building. like you want to edit it. How do you visualize what change? This is a list of numbers. What do you want to do? Do you think the delta between the change of your graph and the numbers? How do we do the software engineering spiel? We don't teach children about version control. We teach children powerful ideas. There is a version that is called SMURGE. We are mixing up the graphics towards refining the code. Do you have examples of the SNAP? We have the gesture recognizer including all the materials you need for school. We are working on the SNAP GPT thing that will be published in a few weeks. If you want, you can search for projects like the grand gestures. Here is the project. Go to snap.berkeley.edu and use the search bar. I am not aware of any other block space programming language offering the notion of procedure as data. We have data. We are more like researchers. We do not teach kids lambda. We use lambda to build blocks that kids can use. In higher classes, there is a curriculum that is used as an advanced placement course. We will use higher order functions totally. We have a high school version of that. They have a middle school curriculum for seventh graders. Thank you everybody. See you next year.
Youth Hacking 4 Freedom
Hi, I'm Bonnie Merring from the Free Software Foundation Europe and I have five minutes to present you packing for freedom to you. This is due to me because I'm a very last late approach and yeah, it's a bit of my fault, but I'm very happy that the Boston people made this possible for me to present you packing for freedom coding competition to you. So what's it all about? You packing for freedom is a programming competition for teenagers that will come to their eligibility criteria in a second. Basically you develop your own project in a period of six months and in those periods you work on this project by yourself or in a team. You can always ask for help and they have no limitations to the project ideas. We know this is quite a challenge. And we are always happy to help with coming up with a project idea and you can always approach us during this time. The only requirement is that it needs to be free software in the end. So you need to publish it on their free software license. We always help with this as well. So if you have questions about licensing, you can always approach us. And yeah, that's basically the idea of you packing for freedom. Now I come to the eligibility criteria and who can actually participate. I think not all of you are in this age area but you might know people who are or teachers so feel free to share the information. People between 14 and 18 can participate from Europe. Not just you. You don't need citizenship. You need to live here. So also refugees are welcome to participate. Sadly, we cannot help with hardware in this case but maybe the local Hexpace can help out. And you don't, as I already said, you don't need to have the European citizenship. So be free to share this information as I now see a bit of older phrases than 14. And this is where they can register and you find all this information also on ufackingforfreedom.org. And now about the programming period. It already started for this year but we will have the next year as well and we are still open for late registration. So people can actually participate or join us during the whole time of the coding period. Then they just don't have that much time as the others. So and we know that. We take this a bit into account. Yeah, and there are already around 90 people who have joined up for this year's coding competition. So usually some of them drop out during the six months period. And yeah, this is how the period looks like. And during this time, we have monthly calls. So we try to give them some input about free software, about how you actually work on projects. There's an expert from our Shuri who tells them more about the whole topic of Git, for example, which is quite a lot in procession or how to actually do licensing. So we are always there. We are approachable. And in those monthly calls, the teenagers also have the possibility to actually talk to each other in breakout rooms, to chat with each other and to connect to other young people from all over Europe who are actually also interested in programming and free software. I heard from some participants from the last two rounds that this is really an amazing thing because they are actually building friendships all across Europe. And then they visit each other and they form teams and they collaborate. So it's really the spirit of free software in it. And yeah, that's one of the things that we offer monthly calls. Every month we have this and we have, of course, also a lot of other communication channels where they can talk with each other. And yeah, as I already mentioned, we help with legal and licensing questions, general question motivation. We don't help with the programming, but they are always like it's free software. So they, in the spirit of free software, it's helpful for them to ask others and they are welcome to get help if they need it from their teachers, from some other resources. So it's not about they working on their own and if they are stuck, they are stuck. So we try to help as much as we can. And it's mostly about the fun. It's mostly about this and this is very important for us and that's what we try to keep alive with the monthly meetings. Now for the prices. So this is really important for the youngsters and I know that. So as much as there is fun involved and as much there is programming involved, we also have some pretty much nice prices as I would say. The first place can get 4000 euros, the second place 2000 and third to sixth place still get 1000 euros. There are three special prices that we award for example for people from a war zone, people from who are really young and who participate. Yeah and that's how the sherry then gives out the prices and evaluates project after the programming period and that's what I get. And yeah, you can follow us on social media. That's basically it and I'm pretty much on time. I have one minute for questions if there is any question. Here's again the registration link and yeah. No questions? Perfect. I'm at the FSFE booth around there so if you want to approach me there and get more information, get stickers, get posters, come to building K level 2. Thank you very much.
Live coding music with MicroBlocks and microcontrollers.
So, hi. Welcome to live coding music with microblocks and microcontrollers. This is Bernardo Mavosa. I come from Barcelona. And I work as a Snap and Microblocks Developer at SAP together in the same company with Jan Sanyadga and together in the same project with a lot of people here in the room, like John who is the creator of microblocks and so many other things, like Turgut, like Cathy, and of course Jan Sanyadga who also contributed to microblocks. For the purpose of this talk, I never mentioned this, but for the purpose of this talk I will mention I am an ex musician. I used to play in a reggae band for a bunch of years and recently I got into making music again with microcontrollers. So, I guess that makes me a musician again, but I don't know, you'll have to judge after this. And you can find me here. So that's who I am. And next I guess I should talk about what microblocks is in case you didn't attend. John, Cathy and Turgut spoke yesterday. So microblocks is a, well, first of all you can find all about it here, but I'll brief you in. So it's visual blocks in the same way that Scratch and Snap are. And I mentioned these two languages because these are the closest relatives. The blocks are one thing, what's behind the blocks is a different thing. The blocks is how we present the code. But what's most important about these three languages is that they are live, they are parallel. And that's usually not mentioned when we talk about blog space programming. We put them in the same back. But that would be like putting all text-based languages in the same back, right? They're languages that are interpreted, others that are compiled. So yeah, it's like Scratch and Snap, but for microcontrollers. These are small computer-like devices. It's a live language. It's an autonomous language that means that the code runs actually inside of the microcontroller and you can unplug it at any time and the code will keep on running. And it's a parallel language, meaning you can have multiple threats running at the same time inside of the microcontroller. And it's a portable language, meaning that you can run the exact same code in several different microcontrollers, even of different architectures. We even have versions of what we call the MicroLogs Virtual Machine running on the browser. So it's a very portable piece of code. And since the talk is about live coding music with microcontrollers, I thought I should talk a little bit about what live coding is in the context of making music. Live coding is programming things in real time, right? Well in the case of music, music and digital media, because there are people who make visuals live with programming, it's programming digital media live in real time for an audience, like you. So you may ask, okay, but how does one make music with a microcontroller? And the answer is, well, it's easy. You just generate signals, right? A microcontroller can generate square waves, can generate sine waves, and your answer obviously will be, yeah, okay. But what about music? Like it's 2024. Like what about actual music I want to dance to? And yeah, so the idea we had in that regard is let's use devices that already know how to make music and let's interface with them, okay? So reducing media for that. And without further ado, I will try to make some music for you today. I wanted to explain what I was doing as I was making the music. I'm not sure this is going to work, because I have a pretty sketchy setup where I connected my synthesizer to this thing here that's beaming audio to the back there, and then I don't know if I'll be able to do the two things at the same time. And I don't know if I'll be able to hold my microphone either. I'm really good at it. Maybe, yeah. Let me try if it works first. That's a bit too loud, is it? Is it good? Yeah, okay. Okay, so the first thing you need when you make music is obviously a little bit of a drum bit. So what I'm going to do is forever I'm going to wait until the next whole note. I apologize a bit for those of you who don't have any notions about music, but this will kind of make sense. I will wait until the next... Oh yeah. Is that good? I won't be able to fit as much music in it. And then I'll play some drum. Let's say I'll play a bass drum. And for how long? Well, for a quarter note. And we'll keep it a little bit. There we go. That's not enough for dancing. I mean, you have to be... Then at the next half note, I'm going to play a snare. Okay, so what do we need next? We need some bass. So we'll pretty much do the same thing. After every whole note, we'll arpeggiate... Let's say we'll arpeggiate a major chord, because I'm feeling happy. And we'll do it at... Yeah, we'll do it at the default configuration. And of course, this is going to sound not like I wanted to. No. So what I'll do is I'll select a different instrument for this MIDI channel. MIDI is organized in channels. And I'll select the bass number one, for example. Let's see what this sounds like. This is a bass, but it's a bit too high-pitched. So let's bring it down a bunch of octaves. Much better. But of course, it's got lower pitched. It also got lower volume, so I'll make it loud, as bass should always be. No, but bass is indispensable. Let's get a different bass. These are all the same bass. Okay, so we'll leave it at that. What else do we need? Well, we can already dance a little bit to this, but it'd be interesting to, you know, get some chords going on. So we'll basically use the same thing. And instead of arpeggiating, we're going to play... Let's say we'll use the major chord on channel two, where we'll obviously select a different instrument. Yeah, it should be good. Let's see how it sounds. Okay. You have to understand this is my set of instruments that I collected from different synthesizers and sound funds. So what I call a piano is not what you might think of as a piano, but this is my favorite piano. And I want this to play for a quarter note again. Not as loud. Piano is in the background. That's okay, but a little boring. Let's go three-step sounds up. Okay, so you can see a little bit how this works. I usually faster explain it at this point. It's starting to become a little boring. So... Can I start the voice down? It's a little bit of a volume up there. So at this point it starts to become a little boring. And you can choose two things when you're doing live coding and then it becomes boring. One is you will raise it and start the sound. The other way is to start to shuffle things around. And the other way is to add the audio stuff that was going on. No bass lines. I will add one more thing and then I'll start with what we have on screen. I'm sorry, but I will have to make it a little bit smarter. Because I don't want to be a little bit more boring. And what I'll do is I'll make another... ... ... ... Can you not hear? We can change the order of our bass line or we can... ... ... ... ... Okay, thanks to top it off. I guess that will be the end of the improvisation. We will improvise over a steel that might work. A major steel. And we will do one channel with a steel and we will decide what we want. We will put a new steel in there. And I'll do that in the next video. Not every... Okay, as you can see the soundfont that I'm using is kind of a retro one. But the next thing I'll do is using the external synthesizer. I can change the soundfont. Any point. And of course I'll also set up my instruments again after the chorus. ... As a disclaimer, okay, thank you. As a disclaimer, I had no idea what I was going to play today before I jumped on it. That's... I don't know, maybe a temerity. But that's how some people like to do it. Other people will prepare a session and have ideas in their mind. Okay, to be honest, I tried to prepare, but I realized I didn't have any way to output sound from my synthesizer. So I was in my hotel room and I realized I couldn't prepare. So... But I wanted to do it like that. That's pretty much it for this part. Oh, I forgot to say it's like it's 2024. So if you like this and if you'd like to try it out yourselves, I wrote a little activity in our Learn website where... If I can find it, yeah. Where I explain in quite a level of detail how this works, what kind of device you need, both in terms of what microcontroller and what synthesizers. If you already have some old synthesizer keyboard that has a MIDI input, that's the device you should use before buying anything. There are things that are quite cheap. You don't have to get a 2000 euro synthesizer. You can get this 80 euro little board hacked together by this independent hacker somewhere that sells them on their website. And I explained how to find these weird devices in here. There's of course a connection that you will have to do from your microcontroller to your synthesizer to get it to receive MIDI and make music. And I explained what MIDI is all about. I explained what the blogs in our three music production libraries are all about. You may have noticed we have these three libraries here, MIDI, Rhythm and Scales and Chords. Basically the MIDI one is the low level one, the one that talks to the synthesizer and abstracts all the MIDI messages and so on. So it knows how to produce MIDI messages that ask a synthesizer to make this note for that channel and change this instrument. There's a lot more MIDI things that you can do here. Rhythm is how we are abstracting the notion of dividing time into equal measures and fractions of those measures and also synchronizing different threads that are doing that. So this and this can sound constantly synchronized. Because otherwise if you didn't use this, if you just used a weight block, which might be your first approach... Thank you. So your first approach might be, do you actually need this weight until next thing? And you may want to do it like this. But can you feel how it's not in sync? And it will get worse. It gets worse in time. So we certainly need a way to synchronize all these scripts very tightly. Because the human ear is very, very good at detecting even slight deviations in tempo. And well, we explain how to make a simple song and how to make a simple trump pattern, I guess I can. We're talking about melodies and chords. Can I lower the volume globally? No, I can't. Okay, how to make arpeggios, etc. etc. And we end up with something similar to what you've heard today, which is a trump pattern, a bass line. Oh, where is it? Here it is. So by the end of this lesson you'll be producing this piece of music with very few blocks. One thing that I haven't used this time is random. Random is really great at adding variation to your music to make it sound less monotone. And how am I with time? I'm way too early. Can you show Greg the music in the mic? Yes. No. If you have a normal operating system slash a window system at all, you can drag this stuff into your micro blocks window. And it will load the blocks that are embedded in the image because we are serializing them into the PNG metadata. But I use weird window manager that doesn't know what windows are. So I can't show you that. Then how can you help? One of the things that we need most help with is translations. And translations, we welcome one-time things like, hey, this is my language, I will translate your idea. I didn't show that, but the idea is translated into several languages that includes the blocks. So if I select a different language, now all the code is in that language. So that's great for localization. And we welcome this one-time translations, but we welcome even more some kind of commitment as to, I will be the maintainer for this language. And when you make a change, I will also update the language file. Translations are pretty easy to do on micro blocks. And if anyone is interested, I can show later how that works. We also need help with writing activities. We have this learn page that I showed before. And in here we have a bunch of activities. You can filter by language, by the micro controller board that you have, by components. Let's say you are running some class and you want to teach about, I don't know, how servers work. So here's the activities that involve servers. You can filter by level, topics, et cetera. We welcome people to write their own activities or to translate activities that we already have into your languages. We also welcome donations. Micro blocks is a project under the wing of the software freedom conservancy. And even though we don't need that money ourselves to live because we have jobs, we use that money for things like buying hardware for workshops. We use that money for things like paying a web developer to make nice websites. This is obviously not done by an engineer. I hope you can tell. Money is always good, especially if it's recurring. And lastly, you can help by using it and giving us feedback and spreading the word. Any questions? I know I'm very early. I expected to take much longer. Do you have a few examples that you could share with people? I don't know if it's good. So can you hold? I don't know if I have it in this computer actually. But, but, but... Okay, here's a sample. It's a very short sample. I think it's just one minute of a bunch of snippets of songs that I made a while ago. And I will stop them and explain a little bit what they're about. So this was the first one that I made when I started playing with MIDI and realized that this thing was actually possible, because I didn't know if it was possible. This whole thing came as a request from two members of the live coding community in Barcelona and a good friend of mine who is a professor at a local university. They came to me asking if we could do live coding with our languages with Snap or MicroBlocks. And I started working on... I thought, you know, music needs to be really tied. These are educational languages. We don't even care about things being in sync or being, you know, very time sensitive. So I thought, I said, I don't think we can make music, but we can definitely make visualizations. And I started working on that with Snap, and we actually can make very nice visualizations with Snap. And then I thought, well, microcontrollers are actually very tight time-wise. They are very fast machines and they have nothing to do. It's not like an operating system where things are going on all over the place. In a microcontroller, there's just, you know, the code that you put in it. So it should be possible. And then I started experimenting and that seemed to work. And recently I've started experimenting, trying to do the same thing with Snap, and I realized it's actually pretty tight and it might be possible to do it too. But that's not for this talk. And so the first thing I did, remember I said I used to be a reggae musician, was to make some reggae music. And that was sort of my... Then I tried to make some jazzy tunes with different time signatures. This is sort of a Latin rock, three-four time signature as well. Then this is... You tell me. Then I actually explored the signal generation thing that I was talking about in the beginning. And mixing it... I started experimenting mixing the sounds that you can generate with a microcontroller by itself. Like you plug in wires into the digital pins of your thing and you mix them together with a passive mixer into a 3.5 jack, like an audio output. And see what you can make with that. And also mix that with the sound of an actual synthesizer and see what comes out of it. And this is an example of... Not this. So this is a mixture of... This is my attempt at free jazz. This is nice. I think I have to show that. I don't have it here. So this is a synthesizer that has the chip of my first computer inside. It's a YM... Someone here knows the number, I'm sure. It's the sound chip of some computers like the Amstrad CPC and the MSX in my case. And someone has made a synthesizer that uses that very chip, the actual chip, not an emulation or anything. You open it and you talk and it's that old chip in there. And it sounds like this. This is only chip tunes using only the sounds that microcontroller can generate. Sorry. So the whole thing, the drum pattern, the bass line, the chords, everything is generated just with the microcontroller. These are square waves and noise that is attenuated to make things like snare drums and hi-hats and things like that. Again, this is only generated with a microcontroller. I don't know what this is. And that's it. So you're saying that section was just the microcontroller not sending it out to me? Just the microcontroller. So, I think that's the only chip that I can use to make a microcontroller. So you're saying that section was just the microcontroller not sending it out to me? Just the microcontroller. So, yeah, if you're interested in that, I didn't show that because I thought it, well, I didn't bring my passive mixer either. But, oh, sorry, let me go back to a language someone else understands. There's the chip tune library here. And I can't show you how that sounds because I don't have the pins connected. But this is going to generate sounds just with your microcontroller just by generating signals. And it's actually pretty interesting to see how this works. If you look at this, it's actually just setting a digital pin on and off at a particular interval and in a particular pattern. And that generates a kick drum that's very loud and like it sounds like a hardcore techno kick drum. Tom's are generated in the same way. And then on boards that have a DAC that's a digital analog converter, you can generate noise. And by attenuating that noise, you can generate things like snare drums or hi-hats. And the way you do that, they look very much the same. You just fade some noise for a couple of milliseconds, right? So you generate noise. Noise is just right. If you attenuate that in a certain pattern, you get... And that can be a hi-hat or can be a snare depending on a bunch of parameters, right? And that's how they used to make music for these old 8-bit computers back in the day. So it's really nice that... As Jan and Yalga were saying before, we're doing this also to explore things that we don't exactly understand. And I've had so much fun understanding how my old computer generated all this magic music. And there's also the square wave chords library that you can use. That will play a chord. You need to connect three different pins because a chord is a minimum of three notes at the same time. And this will generate different notes in each pin in sync. Okay. So any more questions? So the question is whether I looked into also receiving MIDI data into the microcontroller. No, I haven't looked into it. But we've talked about it with John several times. It would be really interesting to do the opposite. Today I'm using an Arduino Dway just because it has a lot of memory. And so I can... How do I explain this? So micro-box is constantly synchronizing the code that you see on the screen with the code that's inside of the board, right? So what you're seeing here is a window to what the board has inside, right? That synchronization takes a little bit of time, of course. And on boards that have a slower memory, this can be noticed when you're making music. And this board has a very, very fast memory. And also it has a lot of memory, so you won't run out of memory while you're life-coding and changing the code constantly. And so you won't force the VM to reorganize the memory every, I don't know, N-cycles, right? So you can life-code for two hours and you won't get any glitches. But any microcontroller will do. Yes, yes, the memory is a little bit slow, so you'll get slight glitches. But yeah, but any will do. So, software synthesizers? Yes, you can use software synthesizers. The question was whether you can use software synthesizers. The answer is yes. John implemented Web... sorry, what maybe not... USB MIDI for microcontrollers that support USB host, is that right? Yeah, yeah. So, in the MIDI library, there's a block that you usually don't need to use if you're using hardware MIDI... sorry, hardware serial and regular, like serial MIDI, that lets you choose on which pin to emit your MIDI, and it also lets you select USB. And it actually uses the same USB line to program the board and send back MIDI to the computer. And that works really well. It has to be a microcontroller that has a USB... Yeah. I think I'm hinting at that in the article that I linked to. I would make one question about the aspect of your... Do we do the previewing? Yes, sure. So, it's at learn.microblocks.fun, and it's currently... since it's the last one that we wrote, it's currently the first one. And it's going to be the first one for a while, because it takes time to write these things. Okay. So, the question is, I mentioned something about sound fonts. Sound fonts are a format for someone that knows more about this, is going to correct me. But they are a format for virtual instruments and virtual synthesizers. And they consist of a collection of samples. Those are like sound files. And then a collection of transformations that you do to these samples, and those in turn get turned into instruments. So, for example, you might record this and turn that into a drum kick, right? And sound fonts are a format that's really old, but has been used forever. It works great. It's a format that many virtual synthesizers, software synthesizers use. And what I'm doing here is I'm using a board that was developed by... I forgot the name. But it uses sound fonts. So, it uses a Raspberry Pi compute module inside. And you can load sound fonts into it. And of course, in sound fonts are an open format. You can edit them and modify your instruments and create your own. So, you would see a Raspberry Pi left a software synthesizer inside the sound? Yes. Yeah. But it's a nicely packaged one. So, it's actually a synthesizer that has the Raspberry Pi compute module as the power house. But it's actually a standalone synthesizer that you can plug your MIDI in and you have your sound output. What microcontroller did you program in the download? Didn't I say it was a duet? An Arduino duet. So, how many MIDI synthesizers have you recently bought? Too many. Yeah. I have to give you a word of warning. If you try this, it's very addictive. Especially in this day and age where you can find very cheap synthesizers and you're like, come on, it's really cheap. I can afford another one. So, I think I have eight now. And one is actually flying from the UK to Barcelona as we speak. Yeah, it's very addictive. So, the first sound that we've got is the drone that's just on the plane. Yep. So, that actually comes from the synthesizer itself. Can you hold? Let me connect this back. So, general MIDI is a very funny protocol. So, I think someone sat down at some point and decided which 127 instruments were the most representative of the instruments that humans use all over the world. And they decided that they needed eight slots reserved to sound effects. And that these sound effects would be gunshots, waves, birds. So, I don't know how that happened. I think it's a good thing of eight instruments that I would like to have instead of those. But at least the birds are nice. And the way you play those is you select instrument sound effects. I forgot which one it is. Is this on? Oh, of course I have my... So, and the answer is also each synthesizer will either adhere to general MIDI or not. So, some will play different things where you expect them not to be. So, this is like a fret sound. Sound effect number two is like a flute sound. There's the waves. There's the birds. And I believe... I was not kidding. There's gunshots. They make us nice percussion. So, how about triggering samples? Yes. I think some samplers will work with MIDI notes. So, that will work. Also, you can trigger sequencers with this thing. We have to start playing and stop playing and continue playing blocks. But if there's something that you are missing in the library, these are all built with microblocks themselves. It's all built with blocks. So, for instance, say your sampler uses a special MIDI message that's not standard. You could look at how this block is made. You show the block definition. And you see it's actually using two different instances of this set MIDI note. It sets it on, it waits for the duration, minus the time it takes for the message to travel, and then it sets it off. So, how does this work? Let's look at the block definition. So, it depends. If you're sending me a list, then I'll send the list of notes all at once, so you can make chords. Otherwise, I'll send a MIDI command. And how does the MIDI command work? Well, the MIDI command is sending stuff over the serial port. And if you know the format of your special MIDI message, you can just send stuff over the serial port, and that will work. Okay. If that's all, thank you very much. And you can follow the project. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. You can follow the project and myself in MasterDun. How did that happen?
Hedy
So, hello everybody. My name is Jesus Pelai. I am a developer in the HEDI programming language. I also a teacher in the University of Carabobo, which is located in Venezuela, where you come from. Here is also Pink there. She teaches with HEDI, so she will also be answering your questions. And today we will be talking a little bit about the HEDI programming language, what it is, why it is useful over its alternatives, at least for user cases, and also how you might get involved with the project. So let's begin. Okay. Okay. So, what is HEDI? And I will summarize it really fast for those of you who are also jet lag like me. And these are the three core concepts of HEDI. So HEDI is a gradual, multilingual, textual programming language built for teaching. So this is what our entire deal is about. One outside asks you what is HEDI? You can tell them that it is gradual, multilingual, and you can use it in the classroom. And it's a bit of a mouthful, but I will be explaining each one of these concepts along the talk. But first, let's talk a little bit about our misconceptions when it comes to teaching programming. Because us as programmers, we don't really are that good when we want to teach programming, because we have this series of misunderstanding how people understand things. Another thing is that compilers are friends. And that is a lie made by the compiler PR group. The compiler is not your friend. It is a tool, and especially not friend for a kid. Because the compilers are made for professionals. And the error messages are very much tailored for adults who are working with the language. So when a kid sees an error message that is meant for an engineer, it is really intimidating. And also this second issue is also a sense from the fact that programming languages are made for adults and engineers. And people think often that that is not really a problem, that syntax is not really an issue. And one of the core things of HEDI is that syntax really poses a barrier for entry-level barrier for novices. 50% of, there's a study that says that 50% of programs submitted by novices have syntax errors. And 75% of programs submitted by the weaker students have errors. So this is not a kid that is having fun reading the error messages and fixing them. It's a kid that is trying stuff around, trying to make the program work. So it's really a really tall barrier for someone to get into programming when you have the syntax right away. And the other misconception is that you mostly learn alone. We often, when we think about learning programming, we think about someone who is sitting in front of a computer alone in the room. And that was the case 20, 30, even 10 years ago. But today there are many schools that teach programming. So now we also have to think about programming as a discipline that you can learn in the classroom. And these were some of the misconceptions that the creator of HEDI, Felina Hermans, had. When she was asked to teach a class of high school programming. Felina, she is a researcher from the Netherlands. She has a PhD in computer science. But she was teaching high schoolers as a Saturday activity. And now she thought, okay, fine, I know programming. I can teach programming. And for that she used Scratch. And Scratch is a great tool. It's a great language. Scratch is a visual-based language where for programming you only drag stuff around. And by dragging stuff around you have a program. For example, this one right here makes the cat move around and play meow-meow sounds. And kids, they love making the cat do meow sounds. However, as time went on, the kids grew up and they said to Felina, we like Scratch, but we don't like a toy anymore. You know, this is like a toy. Drag stuff around is like making a puzzle. We want real programming languages. They want that programmers use the ones that can get me a job. Because these were students from a technical school. So they were very much interested in learning programming as a way to get jobs in the future. And many people learn programming this way. Not everyone learns programming now because of a passion of computing, but because it is a really viable way to get out of poverty, to get a job. So now she said, yeah, that's fine. I can do that. Let's teach you Python. And Python is also a great language. It is certainly easier, the syntax is certainly easier than the likes of Java, C or C++. But still, it's very much a language made for professionals. Error messages are for professionals. And the syntax is a barrier right away. And you can see here, if I'm trying to teach a classroom how to write this, I would say, yeah, kids, today we are learning to print some text. And for some students, that is not really that exciting to just print text on a screen. But you got along. We are going to print text. And to do that, you write print, and then you write a parenthesis. To write a parenthesis, you have to do shift eight in the keyboard. And then a quotation mark. Yeah, the quotation mark is right beside the answer. So you have to do shift that, keep it pressed, write the text, and again. So as you can see, the kids, at this level, they might not have the proficiency in typing or proficiency knowing how to type stuff or clicking and selecting text. So that is also a barrier for them. But let's assume that the kids do it. And then you have text on a screen. Awesome. But what about some kid makes an error? For example, this kid, they wrote an uppercase P. And for them, this will look pretty fine. Perhaps this kid does not know that their rates quickly line means that they have an error. So they try to run it. And they are faced with this. A really uglier message, you know? It says, trace back. Most recent call last, a file and a path. And then in the end, it says, name error, name print is not defined, did you mean print? And a frack it, this is really intimidating stuff. The kids probably didn't even read the complete error message. They would say, teacher, teacher, what is happening here? And if they read it, and if they know English, it's another barrier. Because Python is an English-based language. So if the kid does not know English, there is no clue that they will understand this. But you know, they read it, they fix it, it's fine. But now you also have this mistake. Now this kid switched around their parentheses and the quotation mark. And now it says, syntax error. parentheses was never closed. And they might say, yeah, I closed it. And what the heck, it's a syntax error. Now, I have this program. And it looks perfect, except for the squiggly line that it tells us, adults, that there is an error, but the kid doesn't see it. And now they try to run it. And it says, indentation error, unexpected indent. Teacher, what is an indent? Now you don't want to be explaining all of these to a kid. This is just the first class, remember. They are just learning to print text on a screen. They are faced with complicated error messages and syntax element that has to be placed exactly as the programming line was expected. Otherwise, you don't have anything. And what an ugly error message. So it was also a problem when she tried to teach concepts. Because for example, if I show these to my students, I want them to understand the underlying concept of repetition. I want them to know that these right here will say, you will print 0, 1, 2, and 3. But they are not seeing that. They are seeing that they have to put a column and brackets and spaces and put it in such a way that it's perfect for the computer to understand. And the kids, they think, why is not the computer as smart to understand while I'm throwing at it? You know, there is artificial intelligence. They can write stuff. Why can't it understand a simple program that I'm making? So we see here that syntax is creating cognitive overload. So we humans have a limited space in our minds to work. We have a small amount of short-term memory. And when you are learning a programming language, your short-term memory is filled with the symbols. So you've got to have a short-term memory, so a spot for where the variable is, and another spot for where the bracket is, and the name of the programming, and the name of the function. So it really is complicated for the kids to understand syntax. And on top of that, understand the programming concept that you are trying to teach. But now, this is not only a programmer, a problem that us programmers face. Because there is also other disciplines that are very hard to understand. We have language and mathematics, and those are famously hard disciplines. But we don't teach them right away. You don't expect a kid to understand the Riemann hypothesis. We do it step by step. For example, if I'm teaching language to a kid, and the kid writes these letters, I don't say to the kid, kid, this is a really ugly letter. This is an ugly A. This doesn't make any sense. No, you tell them, this is really good. You wrote the A, you wrote the I, the N, and you are beginning to understand that. And then you tell them that the words, the vowels, and the consonants, they make words. And then you put more complicated stuff on top of that. And then you have that, you can have uppercase letters and punctuation marks. And then you end up with an understandable sentence. And now the rules change rapidly. You didn't start with the whole thing all at once. You were changing the rules little by little, little by little, so the kid could understand all the steps. Now, changing the rules creates also a little bit of cognitive overload. But that very much makes up for the fact that they are understanding each one of the concepts individually to then learn the other one. And they were learning on top of those foundations. And this is also the case for mathematics. If I'm trying to teach a kid to subtract numbers, I can tell them that you can't subtract a greater number from a lesser number because they don't know that negative numbers exist. So I can tell them 5 minus 3 equals 2 and 3 minus 5 equals 2, but then I change it. And I tell the kid, actually, 3 minus 5 equals minus 2. And now that you just changed the algorithm of subtraction a little bit, they didn't have to take all in advance that there are negative numbers and positive numbers. And the same happens for division. You're first telling that 8 divided by 3 is 2, reminder 2. And then you introduce the concept of fractions. And now it changed a little bit. And now 8 divided by 3 is 2 and 2-third. And the same happens for, and also the same 8 divided by 3 is 2.6666. So as you see, the same operation for the kid had three forms, and it became more complex as time went on. Now this is the idea. We can do that for programming too. And that is how Heady was born. We have a gradual programming language. And we can see here an example of Heady. So as you see, in the level 1, we have a really simple language. You can only print and ask input from the user. But you see here that there are no syntactic elements attached to it. So in its simple form, print just, print some text, print hello. I can ask input from the user, but I don't have to send a variable or do an equal sign off. Just ask what's your name. And the echo command will repeat the information just given to us by the user. So you just use that. And this level consists of six commands. Distribute for text 2 for the total module of Python, so you can also make drawings. So it's not as boring as just printing text. And you can also play some music. We just merged that. And so you can also now do some simple music notes in this level. And now you move on, because that is a really simple language. And I put some stuff to it. And now in level 4, I already introduced quotation marks and also variables. So you see here that the program changed a little bit. So now I have that if I want input, I store it in a variable. So now his name is ask what is your name. And I can concatenate that in the print command. And then in level 18, you end up using a subtractively valid subset of Python. So in the end, you will just be premming in Python. And you can move on from Heady to Python. And I start using that ecosystem. So you can use Piberix or anything else that you like. But you don't have to explain all of these syntactic elements right away, because you do it slowly. And also the levels, they offer you an opportunity for the kids to all be in the same place. Because you don't have many different things going on with kids using 4 and others using other stuff. Now, also it's important to note that the syntax of level 1 is not valid in level 4, for example. So when you move on from level to level, you are basically switching to another programming language. So in essence, we have 18 different programming languages which are built on top of each other. And each one has its own parser and its own syntax. And these are our design goals. So we have these six goals here that we use when we are designing a level. So the first one is that the concepts are offered at least three times. So in the example of print, the first way is you do a print text, then you print variables, and then you add quotation marks and everything. The second one is the concept is introduced as simple as possible. For example, in the repeat command, we first introduce it with repeat and a single sentence in a single line. So you don't have to deal with blocks or multiple commands at the same time. It's just repeat a single instruction. And then only one aspect changes at the time. So if I introduce the repeat command in level 7, I do it in a single line. So the next change is just do it and now we have blocks. And in the next one, in the level 9, we introduce nested blocks. So you see that we try to change as little as possible to make the levels feel like a gentle progression and not so steep. And also the syntactic elements are deferred to the latest moment possible. An example was the quotation mark. They are introduced in level 4 because you can't do more complex programs if you don't introduce them. And this also poses a problem for the students. Some kids, they really struggle with the introduction of the quotation marks. But we cannot do it any later because it would mean that we cannot make more complex programs. And the concepts are interleaved, which means that if we change something in one level, we don't change it in the next level and the next two levels. So concepts can be really taken for the students, by the students. And then it's always possible to create meaningful programs because in the end, this has to be fun otherwise the kid won't like to use it. OK, so that is for Grad-Walt. But that is not all what we are all about. We also are multilingual. And we ran a user study with Dutch kids. It was composed for 21 kids, I think, in an online class during the pandemic. And they told us that they liked Heady. They liked that the error messages were easier than Python. They liked that it offered a state-by-state guide. So they told us we would like to program in Dutch. We just don't want that the interface is Dutch. We want the keywords to be in Dutch. And it was a bit of a surprise because Dutch kids, they know English, but even then they want to program in their own language. And that is more so the case for Arab kids and Spanish-speaking kids, because, for example, for an Arab kid, I don't only have to teach them what is a program. I have to teach them what is a P, what is an R, what is an I, because they don't use the same script as us. On top of understanding syntax, they also have to understand the Latin script. So it's very hard for them. And on top of that, they have to change the tweet from the Arab keyboard to the English keyword and then back. So it is a mess. It's kind of a mess for them. So we did it. And now Heady is available in 49 languages. I made these slides like two weeks ago. Now we have 49 languages. So as you can see here, we have Spanish, we have Arab there, we have Japanese, and we also have Dutch. Dutch or German? That's German. German. So, and this is kind of hard. And I will talk a little bit about, it is a compromise for us to be able to support the 49 languages and growing. Because each one, you know, some languages have their own quirks. So Arab has something called Tatwill, which is like a valid space. So when you add a space, you can add it any place in the world. And we needed to highlight, because it's the same world. It's just have like that space in there. So when the language has these quirks, we tell the translator, translate the program as close as you can, as many languages as you can. And we will try to solve the problems that arise in the grammar. And there are some cases where we can't solve it because of, you know, limitations with passing technology. For example, in Chinese, the words for plural and singular are the same symbol. So if I say for animal in animals, they place the same symbol in both of those places. So there is no way to differentiate the variables. That is a problem that we really cannot solve. But there are some others like Tatwill in Arab. And also some key words have spaces in them. We can solve that. And there are some conquers of French has also valid spaces in the end of the world. So yeah, now let's do a demo. So this is, hey, this is level seven. Let's go back to level one. So this is the interface. Here we can see that we have several elements. We have the editor where kids can write the programs, the output window, and this top part is where they can read the assignment. So these are sort of like exercises that they can make. So this one we have to build them. So these are provided by us and are translated by the community. What if I want to change to another language? Now I say that I want to switch to Spanish. And you see here that it's with the interface, but the keywords are seen in English. See if this works. And you see now that the keywords now are translated. Which only translates, the keyword doesn't translate the program because for that we would need to use like an automatic technology for that. And that is kind of hard to integrate and also poses its own problems. But you can switch the keywords. And what about bilingual kids? What about I know English and also know Spanish? So I can mix in Spanish and English keywords. And you can only miss your own language and English. You cannot mix any combination up to languages. But this is really useful for kids, for example, Latino kids in the USA where they know Spanish and they know a little bit of language and a little bit of the other. Okay, cool. Now that's it for multilingual. But we are not only that. We also build for teaching. So one of the main aspects of Heady is that it's not only made for learning programming but it's also a system for teaching programming. So we are the primary language, but we also are the system that is built around it, learning and teaching system. And also this is very good for a teacher because the levels offer a state by state guy and they are already like a lesson plan. So they don't need to make a lesson plan from the beginning. So that makes it a little easier to teach. So let's compare that to teaching scratch. So this is scratch and you are when you open it, we are met with like a black canvas. So the kids can do whatever they want. But that also poses a problem for the teacher. Because what can my students build? Some of them will not have no idea. Others will have many ideas that they will want to build. So now you also have to either get lessons from the internet or build your own or you don't really know whether those will work. So you make your lesson. And now there will be a kid who will tell you, hey teacher, what do I do now? I don't know what next to do. And this is fine. They are all in their place. There are other five kids calling you but you go to them and you help them. But there is this kid who ignored your lesson. I build this. Can you help me debug it? And now I have to understand this program and help this kid debug it and there are other five kids around me calling me teacher, teacher, I don't know what to do. So this is hard for a teacher to manage because there is too much freedom and too much variation between the levels of proficiency in the students. So what we do, what we can do is that in Heady you can build your class. You can make a class. You have a student account and the students can submit their own programs in a class. So you don't have to download or upload any sort of files. So that is done automatically by your system. You can have quizzes so you can block the next level if the kid does not have the necessary grade. And you can also block it by date. So you don't have kids at a kid in level two and a kid in level 18 with the aligned levels of proficiency. So you have all of them right in the middle. So the slower ones will pick up the speed of the faster ones and the faster ones will not have, you know, will not build such complex programs because they will be limited by the level. And all of this is customizable. You can make your own lessons also by building your own adventures. So it is possible to just ignore what we've built or build your own adventures on top of that. Some teachers, they like to have an adventure that is like a project. So the project starts in level one and it changes from level to level and in the end you have to make a really big program. And that was they built it bit by bit in all of the levels. Now let's start a little bit about how does this work? What is our architecture? An architecture is a really simple client to server architecture. In the server we transpiled the programs using LARC. And LARC is a library that generates parser. You write your parsing, your syntax grammar using eb and f and then it generates the parser for that. And to that, and then we build the Python program in the server. We send it back to the client and it's called which is a library that executes Python in the browser and that is done by translating the Python program to a JavaScript that they can understand and then execute it. But what are some architectural challenges? So A is an editor that didn't work well with right to left languages. So we were having problems with Hebrew and Arab and I also think that's how it is. And so it was really a challenge. So we have, changing this wasn't easy. But we did it and it was hard. So as I told you, making a language that is available in many languages, it is very challenging because you have to debug in languages that you don't understand. You have to look like an Arab program and you don't understand Arab. So it's very much a decision by the team to make this program, this language more accessible to everyone because it's not really about this social justice problem but it's also an economics problem. Like it will be so much more easier for the kids to program in their own languages. For Arab people, they don't use our numbers. They have their own numerical system with other symbols. And they can use those in heavy. So it's really like programming at home. They don't have to learn everything else from other languages. And for this, to support these many languages, we use parser generators because building our own parser for 18 levels, 47 languages, would be really hard. So we generate them using LARC and that one is laser, the one we used in the front end. So we have to maintain 18 different grammars times 47 languages. So that is a lot. And we have some tricks for that. For example, we don't build the entire grammar all at once. So we only change bits that change from level to level. So if level 4, I'm just introducing quotation marks, I only change the print and ask commands. So it becomes a little bit easier. And we have to make some creative way to do that. So we made a grammar merger and also the same for LARC, for laser. So you see, it's complicated because both of these use different parsing algorithms. So you have to take that count for that. Okay, now let's talk a little bit about me. I am, as you see, I am Venezuelan. So what is a Venezuelan doing here giving this talk? And I became involved with Haiti in 2021. And by that point, I was an undergrad student trying to get involved in an open source project to understand a little bit more about professional projects and Python, et cetera. And I heard about Haiti because I followed Felina on Twitter. So I texted her. Hey, I want to get involved with Haiti. It could be possible. And she had a meeting with me on Google Meet. And that was really impressive for me, for a researcher from the Netherlands to be meeting, you know, an undergrad student from around the world. That really meant a lot for me. And then I became more and more involved with the project. And now she offered me the opportunity to work full-time at Haiti and then give this talk. So to me, the open source has not only been, you know, about building a great product, it has been life-changing for me. And this can be, you know, the case for other people because, for example, my students, they, I teach now in the university where I studied, and they told me, teacher, I want to get a job. I want to support my family. And now I'm telling them that come here, help us with Haiti with open source, and then will you have something in your resume? And this is very important for us in Haiti that not only our product is accessible to everyone, but also our system are accessible to everyone. So if you are a new time programmer, if you are learning Python and you are not as proficient in professional systems, we try to help you and make it as simple as possible for you. So you can boot Haiti really quickly in your computer and there won't be a problem. You don't have to deal with complicated stuff. If you only know Python, you can help us a lot in the back end, so you don't even have to know JavaScript or HTML. And this is also something that we embrace a lot when we're doing changes to our system. When we try to improve our systems, we think would this be accessible for students to build with Haiti? Because students have helped us a lot. They have built some of the features that you can see in Haiti, and then we polish them. And so we welcome everyone. We welcome novice programmers and we welcome professional programmers. So if you can get involved with us, you will very much welcome. So you can join us on our Discord and GitHub and you can try at Haiti.org. Thank you very much. Now we are open for questions. Yes? Just a quick question. You mentioned that one of the problems is syntax and errors. And I just wanted to know what language is used for showing errors in the language. And how you will not define but understand in which language should you show errors? Oh, right. So the user sets their own language. For example, I have Spanish here. And if I make a mistake, you can see here that the error message is shown in Spanish. And the same if I go to Arab. The error message is in Arab. So it does know which language to use for the user, for the error messages. And this is also the problem with the student itself. So because all of the expensive ideas that we use like day to day are so over complicated. It's profiling, the borders, etc. etc. Okay. There should be just like a small tool for kids just with a window. And as you show in the wrap, I made it directly on the computer. So you're talking about tools for the student. Yeah, like ideas that could help you. Oh, okay. Right. So they can, for HEDI or IDE, is there in the page? But they have included some tools. For example, we do have HEDI bogey. So you can debug HEDI programs and execute them step by step. And you can also know which variables are defined in the program. So we do have some light IDE features that can help the students. We do not want to make it very complicated. For example, you could also show like auto-completion. But that would mean that it would crowd the page a little bit and it would be more complicated and more stuff for the kid to take in. So we try to make it also as simple as possible for them. Yes? Have you thought about interfacing the IDE and HEDI with external algorithm like microcontroller or GPIOs or LAPs? Which is more, kids can take their hands and stuff. Yes, that is actually something that is in the works. We have a student right now working on interfacing HEDI. Which I think is microblocks. Microbits. Microbits, yeah, microbits. Yeah, it's starting to move from LED 38 into micro. I don't know how you do it. I am not personally involved with that student. So I don't know exactly who he is doing it. But we know that that is in the works. So it will be coming shortly. Also maybe I can step in here as a teacher. Yeah. Is the other microphone still there? Yes. So it's a question we often get when we show this. Is it boring for the kids only text? And that is exactly what I thought before I was teaching with it. But I wanted to teach them anyway because I wanted to do pie bricks with them. Lego robots with Python. So I needed to learn in Python first. So I thought maybe HEDI is the way to go then. And then I found out that in the classroom I teach nine and ten year olds. I actually enjoy doing these text things over and over and over and over again. And also at the end of my lessons I always have the portion that they can show their work to me. And often six or seven kids want to show exactly the same things. And they're still proud of it even if the ones ahead of them are exactly the same. So for kids it isn't as boring as I thought it would be. And also the turtle they love obviously. I'm not saying it is boring but maybe HEDI could be an alternative for a system like you're doing. Because programming is impossible. If you could do similar things but using a simple language that would be very useful for the kids. That's my idea. Because I have a kid and I'm facing this problem with her. You can go two ways there I guess because you can use the micro blocks that Peter can tell you everything about. And also the micro bits for instance. They have their own Python in their own websites. If you go to python.microbit.org I think then you can do Python as well. But there you do have the full syntax and stuff. So if they know all the concepts and the syntax before diving into that it's probably more successful. And there is somebody working on that. Yeah I think the idea with the micro bits within HEDI is that you can do a print and then it will come on the pixel screen on the micro bit for instance. Yeah probably the way to go is to... Probably is that what we do often when we do these connections to Python libraries is we have the HEDI code and it will get translated to Python. And it will probably be Python brick. And it will be executed. Any other questions? Yes. Very nice presentation. Thank you. How much does the children spend from level one to level 18 usually? How long did it take to get from level one? It does depend on age and prior knowledge obviously. At the nine year olds I teach I have them an hour a week and I do two or three weeks for one level. But the quicker kids could do way quicker than that but I want everybody to come along so I take it slowly. And the second question is is there an offline version? No. There is no offline version of HEDI. To use HEDI you need to be connected to the internet. Okay. Because I'm also a teacher and I have problems in bigger rooms with many children. The Wi-Fi is terrible. Yeah. Yeah just check the GitHub. You have an offline version. It's already a works. Yeah there is an offline version but this was like a person who took it as a personal project. So it doesn't work very well and it's not updated to the latest version of HEDI. Because you are using parcel generators so every time you change the set of keywords you have from Python every time you update you have to regenerate the parcel. Is this a real programming language or just a transpilation? You have a set of keywords and then it's Python. So you execute it there and you get the result and show it in the screen. Yeah. So what we do is we cache the parsers. So we have the grammars and we generate the parsers using lark. And this is like a Python object and we cache that Python object in the server. So when the user for example someone from the common used languages tried to use that they didn't have to wait and it would not overload the server. But if you are changing to a lesser used languages, yeah we will generate the grammar for that level. The total grammar and then generate the parcel and that will be in the memory. Is it automatic or do you have to rebuild the project? No this is all automatic. Lark takes the grammar file and automatically from the server. So we have the Spanish grammar, the Arabic grammar, the English grammar all of them in the server. But you have to change the grammar file of lark? Yes we have of course. When we... Did you build? Yes when we build. So you will have the new version of the parcel. Of the parcel, yeah. And do you cover... I think you cover just like the... Let's say the... Let's print some basic keywords. But if it is about advanced algorithm it's not part of the thing. It is a full premise. At level 7 you already have repeat and conditionals. So you can do some somewhat complex algorithms. Of course you don't have interface to archives or anything. So you are limited in what you can do for the interface to the outside. But it's a fully functional programming language. But there is still a subset of Python. There is a subset of Python in the end, yeah. Yeah, of course. It's just a subset. And for example you said from level 1 to level 18 let's take Arabic. At the beginning you will start learning in... The kid will start learning in Arabic, you know what I mean? Yeah. In level 18 it's still Arabic. No more than Arabic. But later you have to switch to Python to write a fully correct, a scientifically correct program. How do you do that? So that transition we haven't yet gotten to the point of designing content for that transition. So you end up in level 18 programming in Python in your own language. So then it is up to the teacher or the student to then learn Python using the English keywords. But you already have some part of the syntax and understand how some of it works. So what they have to do is it's a little bit less hard than in the first lesson in the beginning. And also obviously we're hoping that other programming languages will pick up on the whole multilingual thing and do maybe Arabic versions of Python for you, for the real Python thing. It's a long way to go though. So still I think for people not in the English language part of the world it's hard to get into programming in almost all languages based on English. So we're hoping to help. The other languages is not just about the language itself. No. It's about the computational thing we have. Yeah, absolutely. Everything is built in English, all the meanings and all that. Yeah. It is designed for professionals. It's easy to have an RTL algorithm. I'm not that sufficient always. There are a lot of problems to create something in a language other than English or whatever from the left to right. But when you go to that part you would always face problems even for the smallest. Yeah, and that's why Hedy is trying to do it for right to left as well. So when you're learning to code you can learn the concepts of doing loops and variables and all those concepts that are most of us could probably dream about. But kids it's really hard to learn and new people to coding as well. You can learn the concepts without learning the Latin alphabet or having to switch your keyboard because you want your output to be an Arabic, probably if you're an Arabic. And you can't type Latin and Arabic on the same keyboard. So we try to make it so that you can learn programming apart from learning English and the rest of it. Yeah. You have a question? Can we use the turtle library? Yes, we have embedded turtle functionality. It's not the complete library I think. No. I don't think you can do fills at the moment but you can do the turtle and colors and stuff. Yeah. Yes? You can do inputs in the keyboard. You can do the students can do input, output in the screen. Move the students from the screen object. No. No. No, there is no long term storage. So you can only do the output is to the output screen. So open it's printing, asking for input is directly from the keyboard, from the user. So there is no connection to files. No pixels as well. You can have the output read aloud by the building text to speech thing. And you can also download the turtle drawings. That's the only thing you can download. Yes? Why did you choose to use Lex and Nap? Not another modern part. Oh, it's not. It's called laser. No, it's from, it's not Jack. Larek is a Python library. And the other one is from code, the part of the code mirror which is in editor. So laser is purposely built to work with that code editor. So we're not using Jack. No. So it's a little bit more modern than those. Yes? Yeah, I need to follow up with the offline question. Can you self-host it? You want to run it locally? Yeah, you can self-host it. Okay, and another thing, yeah, it's really nice this approach to incrementally teach kids program. Is it also an approach to incrementally at the same time, teach the kids to read technical documents? Because at the end they're going to face this challenge of reading technical documents. So if you could teach them to do it on the ground, then there are... So our exercises and adventures are very much tailored to be, you know, fun and tailored for kids around, you know, 12, 13. So we don't teach them to read complex stuff. It's more like we want them to understand the programming aspect of it before moving on to other endeavors. Yeah, of course, but incrementally teach how to approach the things of information. No, it's the same approach across the levels. That's a good point, though. Yeah. Because the question that they ask about the warnings, I suppose that if you're a level one, you get a simple warning, but when you're a level 17, you get that more technical warning message, for example. So the error messages change a bit because we are adding more complexity to the language. So the error messages do get a little bit more complex, but they want that, for example, if you make an error in level one, that is the same as in level seven, it will be the same error message. I think it would be a great opportunity for students working on the project or anybody working on the project to help on, say, graduating from Haiti to real world Python and introduce a constant just like that and also more concepts such as using a code editor, instead of doing it on a web page, stuff like that, could be a level 18 or level 19 or something, extension. The Haiti project is only two or three years old. Yeah, we are very fast moving still with new features and stuff, like the music notes that are in there for about a week and only available in English because the translators haven't come round to translating it yet. So what's not now maybe next year, you never know. Yeah, you are very much welcome to come join us in our meetings because we have public meetings. We put them in the Discord so you can come join us and give us this idea so we can begin working on them and consider them. Yes? Can you talk about at the beginning as a preliminary exercise about typewriting? Kids do not even know how to do an app case. Yeah. App case later. Just to help them understand how to do that. Shall I answer that? There's not anything in there particularly in Haiti, but as a teacher, do that bit before I start them on Haiti. So I teach them how to make a password, why it's so important to have a password and keep a secret, stuff like that, and how to type your name with capital letters and accent letters if we have them a lot in the Netherlands. So I do that before I start them off in Haiti, but not for too long because once they are in, they are really eager to learn that kind of stuff because they really, really need it. And kids want to learn the stuff they need to do what they want. So if they need to type the quotation marks, they'll just ask me or the person next to them, and within 10 minutes everybody knows how to do it. And they don't forget because they try it, they have to use it all the time. So it works both ways I guess. You have your finger for a while. I can see how it's analogous to learn to read, to write, to learn. But what happens after level 18? Have you done any studies to see how impact what it is? Here you subjected to your learning from, as I understand it, from stage 1 to 18 in any of those four languages, but then to progress to, I don't know, hopefully your professional development, suddenly you have to learn everything else that you haven't learned, and you have to learn different languages and so on. So how useful it is beyond teaching, programming, and Haiti? So when you finish level 18, you already know, you know, Haiti in your language, or Python, you know, because it's level 18, but you already know the programming concepts, which is the important thing that we are trying to teach, because we are not trying to teach the syntax for that, because the syntax will change with every programming language. So we want the kids to know what it is, what is a loop, what is a conditional. So if they move on from Haiti to, you know, Python, or some other language that is English based, it will help them a great deal, because they already know the concepts behind those, behind that syntax. So the transition is easier. It's easier, but is it easier now? Julia, do you know if Filina did any research on that? There has been some research on this, yes. I do know that it's still, like, going from a level in Haiti to the next level is a smaller step, than going from level 18 to real world Python. So, yeah, they are still thinking about how can we make this a smaller step, so that it really feels like going from Haiti to Python is just a smaller step as between the levels. But it does help a lot, of course, because if you have no programming experience and you come through Haiti, then it's really smaller to go to Python than if you go to Python without anything in between. So, what I'm asking, I'm not questioning whether it's easier, or I'm asking if there is any evidence that it's easier enough for people to make the transition. Do you know? Do they make the transition or do they... Yeah, I do know that at the school actually, Filina, aside from being a researcher, is also a high school teacher, so one day she's teaching in the Netherlands, mostly the 12 year olds around that age, and she has students in her classroom of 12 year olds who go to Python after being done with Haiti. So, she does use it herself, and I've been observing in that class myself. Yeah, and it's working fine. Yeah, so I know, I mean, it's very anecdotal, so I don't know about big study with lots of kids, but I have seen a classroom where there were 12 year olds who switched from Haiti to Python and were enjoying themselves. So, yeah, it is possible at least. Also, if you'd like to hear Filina's answer, I encourage you to come into the Discord and ask the question there as well. She's on there every day, all day, so it seems to me anyway. So, she'll definitely answer the question if she's done any research, or maybe even put a student on it to research it, because it's a really, really good question. I know there is a paper about multilingual programming that she published a while back, but I don't think that it's specifically about Haiti, much more about the concept. So, I don't know if that will answer your question, that paper. Yes? Related to the previous question, is there an option in Haiti to show the program you wrote in the actual Python to make that process? No. There is no built-in functionality for that in the interface, but if you know a little bit your way in the browser, you can actually see what was transpired. Let me put this back in English. And maybe to add to this, I don't know if it's clear, but if you have code in maybe level 18, you can just copy-paste it into Python. It's actually a code. There's still a program about the English from Morocco. Well, if it's in English? Yeah. Assume that your program in Haiti in English, then you can copy-paste your level 18 code into Python, and then it's exactly the same. So, level 18, Haiti in English is actually Python same. It's a subset of Python. Remember that you can also mix in English and your own language. So, you can also do that transition within Haiti. So, you can mix English keywords and keywords in your own language right there in the editor. So, it makes the transition a little bit easier. Yes? How do you handle the zero word? How do I handle what? Zero word, variable that could not have some names. Like, you cannot have a variable called print. Ah, okay. When you translate it from one language to another one, the variable change is the zero word. Yeah, that depends on the specific variables. So, there are some keywords that we can parse in context. If your name or variable, for example, I don't know, color I think once you can make it. So, it depends. If it's a variable that you cannot assign, it will tell you that you are misusing the command or something like that. Otherwise, it will work normally.
Google Blockly
Hello, my name is Christopher Allen. I have the great privilege of being one of the approximately five engineers who work on Blockly at Google. We are part of engineering education and in particularly we're the kids coding team. I'm based in London. I have colleagues primarily in the US but also one in Zurich. So this talk I'm going to be talking a little bit about what Blockly is and what it's used for. I'm guessing a number of you are probably familiar with Blockly. Who has heard of Blockly before? Who has played with something containing Blockly before? Anybody written something using Blockly as a component of the thing? Okay, yeah, good mix of those. Great. So there will be a little introduction for the people who don't know what Blockly is. We're also going to talk about how to put Blockly into your app. So I shall do a little demonstration of what it looks like to build an app using Blockly. I'm going to talk a little bit about the architecture and internals of Blockly. Just a very high level overview but some of the structure of the code and some of the major pieces of how it works under the covers. I'm going to be talking a little bit about some of the things we have learned in the last, well actually now more than a decade, I think nearly 13 or 14 years that the team has been working on Blockly. I have fortunately had the great privilege of learning most of it second hand as I've only been on the team for about two and a half years. It's much better to learn for other people's mistakes. And finally hopefully we'll have some time for Q&A at the end. So what is Blockly? Blockly is an open source client side library for creating block based visual programming languages. That's a lot of words. So first of all I assume if you're here at Fosdame that you probably know what open source is. By client side what we mean is JavaScript. In fact in these days the actual code base is written in TypeScript but it compiles to JavaScript, it runs in the web browser. And block based visual programming languages just mean we're going to write software but our source code is going to look like a bunch of puzzle pieces connected together. So here is a picture of a Blockly workspace with a little program that consists of a function that computes Fibonacci numbers on the right and a piece of code on the left which we'll call that to compute the first five Fibonacci numbers and print them out neatly formatted. Now let's have a look at that in action. We can have a look at this very program here. You can see with Blockly that we obviously have these various blocks and I can pick them up and move them around. I can edit the text if I want to change. And then over here we can compile this code to any of five different programming languages. We support JavaScript, Python, PHP, Lua and Dart. Now this is a little playground for Blockly. You'll note that one of the things that it does not do is run the code. Blockly is the editor and I guess you could say the compiler but it is not the runtime, it is not the app that uses Blockly. So let's have a look at an app that uses Blockly. This is Blockly Games. The website is Blockly.games and it consists of a series of puzzles. This is a maze. The first level of maze is very easy. I just need to move forward twice. And I have succeeded in solving the maze. It shows me the JavaScript code that was generated. If we skip ahead to a slightly more interesting level, here I can create a loop and I'm going to add a conditional and I'm going to check to see if there is a path to the left and if it is I will turn left and if not I will move forward. And now when I run the program hopefully Pegman will make it all the way around the maze to the end. And so here Blockly is being used to display and edit the programs and then the game itself displays the maze and the various controls and it calls back into Blockly to highlight the blocks as the program runs. Great. All right. So a little bit of history. Blockly originated with a project called App Inventor. Now my slide says that it started in 2010 but I saw a presentation earlier this morning from the App Inventor team and I think they said it started in 2008 and I suspect they are probably right about that. It was a web app to create Android mobile phone apps. It still is in fact. It was originally sort of started by Hal Abelson and Mark Friedman. Hal Abelson some of you will probably know as one of the co-authors of the structure and interpretation of computer programs. A very well known computer science textbook from MIT. It was a web app and in its earliest version the front end the block editor was written in Java which was an interesting choice at the time because Java was kind of already on the way out. A hot Java browser with built-in JavaScript was very much a thing of the 90s and by 2010 if you wanted to run Java in your browser you had to install a bunch of plugins and fiddle around and it was not that easy to do. But you know it worked. Here you can see a screenshot from some very early version of App Inventor. It looks a lot less pretty than the current version does if you were here for the talk this morning. There were some issues with the Java editor. This was the kind of rendering bug that would be encountered and getting things to work well. Other than you know once you got the Java applet running was quite challenging. So kind of recognizing that the writing was on the wall for Java in the browser and that JavaScript was the way forward. My colleague Neil Fraser joined the team with and was assigned the task of rewriting the front end in JavaScript. He tried a bunch of different technologies to do the actual rendering. He used Canvas and SVG, wrote initial prototypes in both of those, decided that SVG was the way to go. So ever since Blockly has been an app written in JavaScript that produces SVG to display the blocks in the browser and on the right you can see a screenshot of a very early version of Blockly. The amusing thing about this is nobody knows where the trash can icon came from. So we were obliged to replace it with one that we knew what the copyright status was. Yeah, things happen. Of course all of you know the ghoul is infamous for cancelling projects and App Inventor was definitely one of those. It was cancelled but luckily it was open sourced and it was adopted by MIT and they have continued to develop it. Neil continued to work on the JavaScript block editor. It had not actually made it into App Inventor by this point and they had taken the Java front end and thrown it away and I think initially did their own thing completely. But Neil plowed on because he could see that this was going to be a useful piece of technology. Unfortunately even at Google you can only work on a cancelled project for so long before somebody tells you you should be doing something more useful with your time. But my friend Neil, he's a stubborn man and he decided he would go on vacation. But his vacations don't look like that. They tend to look more like this. The great thing about coming into the office when you're on vacation is nobody can tell you that you're not supposed to be working on the cancelled project. And after some weeks of vacation he had quite a lot of leaves stored up. He did a demo of the first version of the JavaScript block editor. The project was uncanceled. He got his vacation refunded. His boss told him he should take a real vacation this time and he was assigned to work on the project full time. The next year 2012 Blockly was officially released. It got this name. Some of you may know that Google Docs started off originally as something known as rightly and Google has had any number of similarly named projects inside and Blockly made it out with that name because of that. It debuted at the Bay Area Maker Faire and I don't know if any of you have been to Maker Faire. I had the chance to go in 2018 and there was an entire hall full of educational things that people had made and I was completely boggled at the number of them that had some kind of block based programming language as an interface. It was quite fun to go around with Neil and guess which ones were actually Blockly and which ones only looked like Blockly and not even he could always guess which was which. Blockly has been, because it wasn't specifically made for App Inventor, it was made as a general purpose library. It was quickly adopted by lots and lots of other groups. So 2014 we had Code.org releasing a version of Flappy Bird where you write your own Flappy Bird game with blocks. In 2014 President Obama with the first US president to code using Blockly, you can see his first program was simple and the interesting bit of the story is our partnership with Scratch. So I'm sure most of you have heard of Scratch or seen Scratch. Version 1.0 of Scratch was as I understand a PC app, probably for Windows, I'm not certain about that. In 2013 they released the first web version using Flash. It had taken them several years from when they made the decision to use Flash. Flash was like the obvious technology for doing this kind of interactive thing in the browser. By the time they had the first version ready and released it, Flash was very much on its way out. So they began thinking about what they would do and Neil and some other folks at Google who were involved with Blockly also started thinking about what to do and it seems that both Neil and some of the product people at Google reached out to the Scratch team independently and had conversations, engineer to engineer and planners to planners and everybody came to the conclusion that using Blockly as the next version of Scratch might be a good idea. This required a lot of work on Blockly's part to support the Scratch UI and on Scratch's part to redesign their app around Blockly. But finally in 2018, it was announced in 2016, in 2018 it came out in data test and in 2019 Scratch 3.0 was released and it has been Blockly under the covers since then and that is certainly in terms of usage as far as I know our biggest partner. We do however have lots and lots of other partners including several people in this room including all of these companies and many, many more and I guess we also have people using Blockly who have no access to computers. This photo was taken at a school in Malaysia, no computers but they were using cutouts, laminated cutouts of Blockly blocks to teach their kids to program or at least to be able to do thinking about computational thinking and algorithms and so on. So I think the question might be to put to you guys what are you guys building with Blockly? If you have built something with Blockly, I would love to stand up and just give us maybe a one sentence description of what you have made. So I'm one of the maintainers of the open app home automation system where we're writing rules and some of these rules are usually done in Python or JavaScript or we're using Blockly to actually make the learning curve much easier so even kids can create their rules, their home automation rules with a Blockly. So we're using the standard blocks and I implement like 100 more blocks that make our home automation very easy. Nice. I'm adding support for Go, the Go language to Blockly, especially for TinyGo so we can compile it to microcontroller. Oh, okay, that's cool. Thank you. A block-based editor to write automations using Google Home. So Google recently released the Google Home scripting language so it abstracts over that, gives you a set of blocks to work with so conditionals and actions that you can run and yet block-based editor for that. Yeah, cool. Oh, and yeah, two more. Apologies to those of you on the livestream who can't see. I'm actually building an XML editor with Blockly so in this case we are actually going in the other way around where we generate blocks out of the XML so you can actually just load an XML file and it's going to become blocks and then somebody that doesn't know XML is not technical can actually work with that XML file and modify it. That's cool. Yeah, thanks. We're a boot camp. We teach full stack HTML, CSS JavaScript. After we teach HTML and CSS, we use Blockly to have people make event handlers and that kind of thing. It generates JavaScript and so they can make their pages interactive before they have to learn syntax. Then they can look at the exported JavaScript and learn how it works and what it's doing. Nice. I use Blockly within Node-RED so within Node-RED you write a function node in JavaScript when you run out. No, I want to use Blockly so I'm heavily involved with the person who's written the Blockly interface inside of Node-RED. Excellent. That's my house. So yes, this is lovely but I think this gives such a great picture of the variety of different things that people use Blockly for from clearly educational focus things to utility scripting languages for other applications and sort of more businessy applications. I think that's great. All right. How to use Blockly. Those of you who already built something with Blockly, this will be old hat to you but for those of you who have never seen an app built with Blockly, I am going to show you very quickly how one might go about doing that. First of all, just a little bit of terminology so you can see here this, the big rectangle with all the stuff in it is what we call the workspace and on the left of the workspace you can see the toolbox which consists of a set of categories, logic, loops, math and so on and then a fly out that contains the blocks that users can drag onto the workspace. On the right you can see a bunch of blocks and on the far right you can see some zoom controls, a trash can, of course there are scroll bars, a grid, you can see a tool tip, just the usual sorts of things. All right, we'll see how this goes. So, oh yes, I was practicing earlier. So, on our developer site there's a page called Get Blockly which gives a number of ways of obtaining it. We publish Blockly as an NPM package and we also publish a variety of little utility packages for it, one of which includes a script that will create an app for you and so rather than making you sit as I take half an hour to type out a bunch of HTML and CSS and JavaScript we will just start with the little sample app. So, we'll take a moment to install some packages. It'll end up creating a directory structure for us that contains a main HTML file, a main JavaScript file, some other files on the side and it also creates a webpack configuration so that we can do an NPM start. This will fire up webpack, it will build the application, start a local server and open it in the web browser. So, here is our sample app. I'm just going to show you how it works. So, again we have our kind of categories here. I'm going to take a repeat block. I'm going to have it repeatedly put some text on the screen and you can see over here the JavaScript that is generated by the blocks appears here and underneath you can see the output of executing that code. We can change the color of the text and the contents of the text. It's very straightforward. Let's have a look at the files that it's created. I will try to make sure that these windows are big enough that you can all see if I can remember the right key combination. Yeah, here we go. Great. So, everything interesting is in the source directory. The app has a very simple HTML file, which basically just contains a few divs, one for the whole page, one for the output pane at the bottom left, one for the generated code at the top left and then the main blockly div which ends up being on the right. There is some CSS that puts all these things in the right place. This is a point where I have to admit I am not actually a front-end engineer despite working on a front-end library. So, I'm not going to try and explain how that works. I'm sure many of you here know far better than I do. The interesting parts of this app from my point of view are all in the main JavaScript file here. So, this was created for us using a template, but it's reasonably straightforward. The first bit of the code at the top just imports the various libraries and other files from the local directory that are going to be needed. This app contains some custom blocks and a custom generator for those blocks. The generator is the part of the blockly that actually produces the JavaScript code to be run. So, we register those custom pieces. Then we find the three divs in the HTML page. The one that is most interesting to us is BlocklyDiv. This line here is where Blockly is actually turned on, fired up on our page. It has some options. At the moment, the option we're passing in is Toolbox. I'll go and show you the contents of the Toolbox in a moment. It's loaded from another file in the same directory, but we can add some other options. For example, Blockly supports a number of different renderers. I can change the renderer that's used just by adding an extra option here. When I save the file, you can see Webpack recompiles it, reloads the browser, and now we have this much more scratch-style appearance. We can also add a grid. Those dots are pretty small. There are actually little crosses. Let's make them a little bit bigger. Length 5. That's a little easier to see. I could change the color. Both Neil and myself are Canadian, so we use Canadian spellings in Blockly. I really like blue, so we'll go with blue. Now we have blue grid. The grid you can see is useful for lining things up, but our blocks are not currently snapping to it, but we can turn on an option for that. I should know my JavaScript syntax by now. Now you'll see the blocks will snap to the grid each time I move them to keep them nice and neatly organized. At the moment, you can see in the tool box, you can see a number of blocks. Most of these blocks come from a standard library of blocks that we created, but a few of them, for example, this add text block is a custom block for this particular application. You can have a look at that here. Here is a piece of code which defines the add text block and adds it to the dictionary of blocks that Blockly maintains. We have lots of documentation about the format here, but as you can see, it's basically a little bit of JSON with a set of options that control the shape of the block and the fields and inputs it has. If we look at the tool box, here we have another bit of basically JSON that lists all the blocks that appear on the tool box, and in some cases, if they have some extra blocks attached to them. For example, here we have the repeat block, which is in logic. You can see that by default comes with a shadow block with a number here, so you don't need to attach a block there just to put a number in. That's given here with the default value of 10. Suppose that I'm making an app and I decide that I don't need to have variables and functions in this app because it's a very simple app that's going to be targeted maybe at young users or that just don't need that functionality. I can go down and find the variables and procedures sections of the tool box and I can delete them and you can see they disappear from the app. It's reasonably straightforward to control what Blockly looks like, what blocks it provides, and the things scroll options, the availability of the trash can, and so on. Blockly architecture and internals. For those of you who are of a non-technical background, thank you for bearing with me so far. We're going to talk a little bit more. This part of the talk is going to be more technical, but it's going to have more pretty pictures hopefully, so maybe that's something. First of all, a little bit of terminology. This is a stack of blocks, which is basically just any arrangement of blocks that are connected together. You can see that blocks have a number of connections, so the little bumps at the top and bottom are previous and next connections for statements. On the right, you can see a value input where this block can take a value from another block, and a statement input where you can connect a bunch of other statement blocks to it that it will execute, in this case, while it's repeating something. On the bottom, you can see a block that has an output connection on the left, and we describe it as having a dummy input, because the contents of the block is basically, it's a bunch of stuff, but all that stuff has to be contained in a set of inputs. The top block has two inputs, which are labeled, though the bottom block doesn't need any actual input, so it has what we call a dummy input, which is basically just, it's an input, but not really an input, and it's just to put other stuff inside, in this case, to put a color picker. The stuff you put inside inputs are called fields, and here's a few different kinds of fields. For example, on the left, we have the labels repeat and do. We have a drop-down field for while. The next block has a smaller block inside it, and that smaller block has a number field on it. On the right hand side of that block, you can see over here, this block here is what we call a shadow block, so it's a space you can put a block, and until you put a block there, it acts as if it has a number block in it, whereas this input here has an actual number block in it. Here we have an image input. In this case, it's just a static picture of a paragraph symbol, and we have a multi-line text input, and finally over here, we have a color picker, which when you click on it, pops up a little grid of colors that you can choose from. So those are fields, and you can see a number of them here, and you can imagine that creating an app, you might want to be able to create some more kinds of fields. You can see the architecture, and I should admit this is emitting a tremendous amount of detail, but the high-level architecture is that we have classes for workspace, for block, for input, and for field, and you can see that the input has, the input class has three subclasses, one for those dummy inputs, one for value inputs, and one for statement inputs. The field then has a number of subclasses, labels, numbers, text input, drop down, and so on, and you can see that this, you would want to be able to add more of these, and to facilitate developers adding more, the inputs and fields come from something called the registry, which is basically a little internal database running inside Blockly that allows the developers to register additional classes to provide different kinds of fields, or potentially different kinds of inputs. On the right, we have a little set of classes that are the Blockly events, so there is an abstract base class, and then for example, for events that have to do with blocks, there is a block base class, and then another series of block create, block move, block drag, block delete, and so on, and similarly for inputs and fields, and the connections that connect the inputs and blocks together. A large hierarchy of events that are generated, you can add event listeners to listen to these, to different events, and you can also register more events so that your custom fields, for example, could generate custom events if you wanted them to, and whenever Blockly needs to create a field, it will look up the field in the registry, and then the field can then look up the events that it needs to create, and so on, so it's all relatively extensible, it's quite easy to add additional kinds of things. On the left, there's a little bit of stuff here, so workspace and block are the kind of the abstract models of a workspace and a block, and then there's versions of those, there are subclasses of those for the ones that are actually rendered on the screen, which are block SVG and workspace SVG, and those communicate with a bunch of code that I don't want to get into called the rendering subsystem that is responsible for like producing the SVG that is then fed to the browser to make it draw the pretty pictures of blocks. So that's basically the front end of Blockly, the back end of Blockly as it were is the code generation subsystem, so we have a base class code generator which provides mostly utility functions for code generation and things like indentation and so on. There is then a subclass for each of the languages we support that provides things like functions to correctly quote text in that language, and for each of those subclasses we also provide an instance of that, in this case the lowercase j javascript generator that has a little dictionary of generator functions, so for each different block type, so controls if is the if block, there is a generator function for that, and so for each of the different blocks there's a there's a function written that takes as its input a block instance and some extra information for example a reference to the generator itself and it returns a piece of source code and for blocks that have inputs of various kinds they will then call the generator function for those inputs and take the resulting bits of text from those and combine them together to produce the output text. All right, lessons from a decade of Blockly. So there's a lot that I could say on almost any of these topics but I think the thing which I am certainly very conscious of and I know Neil is is that dependencies are very expensive. At the moment if you install Blockly it has only one dependency, the npm package has only one direct dependency, and that is on js dom and js dom is only used if you are running Blockly headlessly in no js because we need some DOM functions to parse XML. So this makes Blockly relatively dependency free but it didn't used to be like that. In the early days of Blockly we made use of the closure library, I don't know how many of you are familiar with this but it's a JavaScript UI widget library which also provides some JavaScript utility functions and it was very convenient to use, it took care of a lot of things like oh you need a date picker, well it provides a date picker and it also had utility functions that were useful but the closure library is kind of big and that meant that Blockly was kind of big and tended to load slowly especially if you had a small internet connection and some of the people that we are most interested in reaching are people who don't necessarily have that great an internet connection and if you are in some place where you are dependent on dial up or a low band with mobile phone connection we want you to be able to use Blockly and obviously if it takes a minute or two to load a page containing Blockly that is not going to make you very happy, it's not going to make you a returning customer of that app. So the Blockly team wanted to reduce the size of Blockly, now the piece of advice that was given was basically to go all in on closure, the closure library if you are going to use it it has a certain cost in size so you might as well use as much of it as you can and remove any other code that you have in your app that does the same thing that you could use a closure library for that seems like good advice and that was the direction that Blockly went for a while but it became clear that this wasn't working very well, Blockly was still pretty big and the closure library is designed for HTML and CSS whereas Blockly is mostly the front end is mostly SVG and using HTML and CSS based widgets in an SVG app turns out to be not very much fun and not particularly easy so eventually the decision was made to get rid of closure and it actually took several years to remove all of the bits of code in Blockly that called into the closure library, the few remaining bits that were actually useful basically just got copied into the Blockly source code and that dependency was eventually deleted but in the meantime there had been the cost of keeping it up to date and so on so it was definitely worth removing it because now we own all of that code ourselves and we don't have to worry about it breaking because somebody has decided to push an update but it took a long time and a lot of work to get there so I think the team is always thinking very carefully when opportunities come along to use new bits of tooling it can be very tempting some new application or utility or library that looks like it would make your life easier and be super convenient but it has costs another example of this is Blockly games until last week was running on Google App Engine it was using a version of the API which has been deprecated for about five years the amount of effort that it was going to take to update to the current version of the API was enough that we decided that it was going to be easier to self-host it so it still runs in Google Cloud but it runs in Google App Engine on basically a you know a bear server where we supply all of the software infrastructure that we previously used to be using that was provided by by App Engine so and hopefully that will considerably reduce the amount of time we spend dealing with App Engine updates in future so progress is great but keeping up is hard when Blockly was started around 2010 the technology that was used in use at Google was ECMAScript 5.1 which is to say an old version of JavaScript the closure type system I'm not sure if any of you are familiar with Google's closure compiler but it provided type checking for JavaScript very early on it was one of the first major tools to provide type checking for a JavaScript and when you build web apps the size that Google does that's extremely handy we also used what I might optimistically describe as a module system that was based around a function in a library called Google provide it wasn't really a module system it was basically just namespace objects if any of you who's used the TypeScript namespace feature it's basically exactly the same as that you can create an object and put some properties on it and put some more objects on those properties and you basically build a little tree and then you tuck your code away somewhere on that tree that's not going to conflict with anybody else's code and each file in your project puts its code on a different part of that tree there is still only one scope so if you declare a global variable by accident everybody gets to see it but it was a pretty pretty useful system at the time in 2012 TypeScript came along with a different slightly different and slightly incompatible but similar approach to JavaScript typing in 2014 Google promulgated a new module system called google.module which I think I would describe as being very similar to common js modules the syntax is a little bit different but the semantics are pretty similar to common js modules you have a local scope in each file so you can declare global variables in a file and they stay within that file life is happy 2015 of course we saw es6 and the introduction of ECMAScript modules with their own syntax and semantics which were quite different from common js modules and 2016 Google started to adopt TypeScript for our main internal code base guess what tech google Blockly was using in 2021 when I joined the project no no we were using ECMAScript 5.1 and the closure type system and google provide because you know the the Blockly code base is not enormous in the grand scheme of things google scale it's a pretty small bit of code but it's still hundreds and hundreds of files and when you have the choice between adding useful features to your project or fixing bugs that are impacting your users or spending time on refactoring your code to use a different language dialect and type system and module system that latter task doesn't seem like a particularly good use of your time the problem of course is with an open source project if you want to get people to contribute to your project it's nice to be using the same kind of tech that they're using in their projects and just just show hands here who's used the closure type system yeah three people have put up their hands that's about what I expected so yeah it's hard to find people who are on work on a project that is essentially written in something like a foreign language or at least a foreign dialect of a language that you know so we decided that we needed to update but the the cost of that update was substantial so the migration process was arduous we started by doing an almost entirely manual migration from the google provide module system to the google module actually really modules system that required a lot of restructuring code because google provide you didn't have any kind of notion of exports like just you you created a bunch of properties on an object and that object was visible to any of the other code in the system and so they could use those data properties or call those functions we had to think about what each file needed to actually export that other files within blockly were going to need and what things in that file were private and needed to be kept within the file it was it was quite a lot of work and we had to move a lot of code around and that ended up breaking creating a lot of breaking changes in blockly which was not great but we didn't know any way around that at the time we then were able to do a migration from es 5.1 to es 6 this was a mix of manual work but quite a lot of help from various bits of tool various bits of tooling to do that migration so starting to use class constructors for example or class syntax then the main final migration from the closure type system in google module to type script and es modules that was actually done using internal tooling at google i i probably can't talk a great deal about that but i know that it was briefly it was briefly made available as a now deprecated project on github whose name i whose public name i don't remember but it was used by like one or two other companies that had very heavily closure oriented code bases to do a typescript migration and they were fairly successful with that but it wasn't a useful general purpose tool so that has not not really been that useful to anybody else but we we we pushed through we fixed a lot of type errors that were left over after the automated tooling dumped a bunch of files into the repository and we eventually got it running we got it passed through typescript compilers type checking we figured out what we needed to do to get closure compiler to correctly ingest the code that was generated by the typescript compiler and life was good and now we have blockly in written in typescript with standard typescript yes modules code base that will be much more familiar to potential contributors it took a long time for us to get up to speed on these new technologies we made a number of regrettable mistakes we we we completely rearranged a lot of the public api of blockly which in hindsight we probably didn't actually need to do but we just didn't know at the time any better way to do it because we were mostly novices to the technology we're using we had only one person on the team who had really used typescript before we undertook this migration and even even kind of expertise on closure compiler there were only a few of us on the team who had used it extensively and kind of understood what it was doing and how to interpret some of its more obscure error messages the the blocks the library blocks that we provide and the code generators those were migrated a bit later that migration was actually done a bit more manually and using some of the wisdom we had gained by the migration of the main part of the code base and i would say largely that went a bit more smoothly i don't know that we could have done the main part of the code base entirely manually it took us basically a whole year's work from the whole then roughly six person team the the blocks and generators were a much smaller piece but it was satisfying to to do those and feel like i was you know doing it the way we should have done it the first time on the main part of the code base but yeah especially in the world's javascript technology changes it changes incredibly quickly and you're going to need to invest time at some point to keep your code base looking like something that a new developer will recognize architecture is always easier in hindsight so i guess there's a there's a few parts of this so part of it is when blocky began being written javascript was still a relatively new language that was used it was it was certainly we had gmail and google maps and javascript was being used to write web apps but web apps were still a relatively new thing and i think there were many different ways of writing code in javascript javascript is an object-oriented programming language but it didn't originally have class syntax and so a lot of people explored quite different ways to build objects it's not unusual for javascript code to copy properties from object to object to use mixins and things like that and there was a lot of stuff like that within the blocky code base blocks in javascript in blocky you can see there's a block class and the way different shaped blocks were created was basically to have a mixin object with some properties on it some methods and data you you copy those onto a new fresh block instance and you call an init method and it then kind of sets up the block so each block is an instance of the block class but it has some random extra properties on it and that was fine in the early days of javascript but that is not a type model that works particularly well with class syntax or especially with type script type checking there are some other interesting wrinkles here we have a workspace svg and block svg classes they were essentially an effort to separate the the model from the view but it's it's one of these cases where instead of having a view the model is the view and this then creates some problems we might want to turn our block definitions into actual subclasses of block but now you need a now you need to be able to subclass either block or block svg depending on whether Block v is running with a UI or headlessly so fixing that yeah these guys here they're they're they're difficult we we will probably continue working on this bit of the Block v architecture for some time but let's just say it's a lot clearer to us now how we should have written this the model the view and the controller should have been separate pieces there should have been less inheritance between them it should have been a has a rather than is a relationship and so on but you know nearly nearly 15 years experience you you learn some things finally no technical decisions are made in a vacuum Block v is a library that is used by hundreds thousands of other projects we don't we don't even know exactly we we have some metrics that tell us that we know roughly as a minimum about 40 million people a year use Block v to learn to code but we have no idea how many people use Block v as part of some you know corporate internal financial reporting system or a robot controller or any of the rest of us so we need when we're making changes to Block v we need to do so in a way that is going to be as as convenient as possible for developers to upgrade unfortunately if you never make any breaking changes to your code base you end up in a situation where you're eventually not going to be able to make any useful changes to it either so we look very carefully at breaking changes we try really hard not to break the Block v API and behavior but if we if we break it so the developer has to update their code we're okay with that but we do not break Block v in a way that would prevent the developer from being able to update their app and load saved programs what we do not want is somebody who's written a program in Block v being unable to load that program into Block v because of some change that we've made to Block v itself so some breaking changes for the developers hopefully no breaking changes when it comes to loading programs because we would like well it is still the case that the very first Block v programs that were ever written there it you know all the ones from Block v games that kids have saved those all still load and work today so and finally Block v's unexpected killer feature we only let the developers choose the hue of blocks people like to pick colors for things it we basically in Block v you can choose the hue and Block v chooses the saturation and value for the different parts of the block developers often complain about this and you can work around it if you really want to but the big advantage of doing it this way is it turned out that if you only let the developers choose the hue they really couldn't choose a set of colors that looked awful and the unexpected benefit was that Block v basically just looked better than any of the other Block front-end libraries out there at the time because because it prevented developers from making bad color choices so we kind of won on that one sort of by accident so yeah sometimes less is more I guess all right we have a few minutes left for Q&A I am happy to answer almost any question I can I did mention that I am not much of a front-end developer most of my work on Block v has been involved in migration and tooling and things like that but I am happy to answer any questions I possibly can with that mind stick up your hand I will pass you the mic if you're not too far away just so people who are watching on the live stream can hear could you tell us a bit more about Block v's own type system so each block can have a type so what kind of types are supported yeah so Block v's type system is very simple basically each connection can provide that to say an input or an output connection from a block can provide a list of strings those strings might be things like string number boolean color something like that and you can only connect two blocks together if there's at least one type in common between an output and an input and that is basically Block v's type system now there has been some work done by my colleague Becca on something called a nominal connection checker which is basically designed to provide a much more sophisticated kind of type checking that was she did that work experimentally actually before she joined Block v team so if you if you search for Google Block v nominal connection checker you can find some of the work that she did on that we are definitely interested in providing a bit more sophistication in terms of typing but it's not a huge priority for us because for the kind of educational market that we are there is our main target of our work this very simple type system that we have is generally good enough to make it easy for kids to not produce invalid programs which is our main goal thank you so one in the middle or I was just gonna I just wanted to build on that so for App Inventor we also have a system we extended Block v to take a function so as long as the function returns true then you can connect two blocks and that gives you basically max the developer of the app can figure out what type checking they want to do yeah thanks for the talk really interesting and one of the downsides of teaching coding through or one of the perceived downsides is that when you graduate beyond using blocks to using real code it's a one-way street right you take your code and you generate some code and it maybe is not the same shape of code that you would write if you're writing code from scratch and it's also a one-way street you can't go back yes do you know if there's been much research into going the other way like bringing existing logic into Block v and on kind of bridging this gap more smoothly on like improving the quality of the generated code to feel more like a human would write yeah so this is this is something that we're definitely very interested in my my colleague Neil is actually doing some work at the moment on basically pre-work on something like that the idea is okay so the fundamental problem with the kind of general code to blocks is that within Block v even in our standard block library we we don't have blocks that represent all of the different things you can do in any of the five languages that we support so there are lots of programming constructs which we just can't represent in Block v so you would need to have a set of blocks that could represent everything that the language can do in order for that to work in the general case even in the case the specific case that the program only uses constructs from within the set of things that you can do with Block v blocks it is not that straightforward to figure out how to how to do the reverse uh so Neil has a sort of a little research project that he's working on that will basically try to evolve a Block v program to be as close as possible to a given input text so one of the things that he's worked on so far is making the generators work a lot faster so that you can basically like randomly permute a Block v program like thousands of times and try to like you know tweak it in the direction that it looks more and more like a piece of code but yeah the fully general case would require a very comprehensive set of blocks and some fairly careful thought put into how to make that mapping work uh I will take your question but actually I just wanted to mention one thing which is I don't know if you guys are aware but there are some other projects that have looked at alternative approaches to the kind of blocks to coding transition one that I'm aware of is called pencil code which is a it's basically like a blocks editor but the blocks look a little bit more like source code and you can kind of click a button and the blocks disappear and just leave behind the source code and so then you can write code and you can then like click the button and then the blocks come back so that's kind of a clever that's kind of clever system so that that might be that might be a route that one could that could take again there was lots of things we would love to add to Block v uh alas we are a small group with a lot of responsibility and have to focus on you know fixing the bugs for one thing but yeah uh yes I have two small questions first of all why is google I appreciate that google is doing this because we all um use it and I don't appreciate a lot so why is google actually developing it what does it bring to google that's the one question and the second one I'll just add it quickly what will do with future of Block v so what do you have in mind on doing it okay so the first question but what why does why does google pay me to work on this I to be honest the the real answer that is is several uh several pay grades above where I sit on the org chart uh I don't really know for sure but I do know that we work with very closely with a part of the organization called education for social impact uh which as of uh as of this basically as of this weekend has moved to become part of google.org so it has always been uh I don't want to say exactly a charitable but a very intentional uh corporate citizenship outreach kind of uh department they run a lot of coding learn to code digital skills programs uh often targeting disadvantaged people marginal look at marginalized groups and uh we they are they are within google they are effectively our our customer um google uses Blockly as part of the um I forgot we have a learn to code program at cs first uh that is based on scratch which in turn is based on Blockly so the work that we do on Blockly eventually makes it around that loop so it becomes part of the cs first uh product that that google runs and and then just generally uh despite the current uh despite the current to belt tightening and so on at google google I think does continue to recognize that there is a considerable value to uh things like cs education uh for one thing uh we may not be hiring at the moment but for almost the entire company's history uh google has struggled to find enough highly qualified engineers so anything involved in providing more potential uh potential staff for the company has been a an important initiative uh hopefully we'll go back to hiring otherwise maybe that will be less useful as a justification uh as to ours the uh future of uh Blockly um we have a bit of an internal roadmap uh I find myself standing up here having had a long weekend of thinking about all kinds of other amazing projects and I cannot remember what we talked about in our team meeting even just a couple of weeks ago um but we are uh one of our major projects at the moment is to try to get our major partners up to date using latest version of Blockly because there have been a lot of bug fixes and new features added since there uh many of them have forked Blockly and are using older versions and we are gradually getting them back onto the main line version of Blockly uh after our big migration and uh in addition to that work we uh we have a number of projects adding features that have been requested one feature that was recently added was a set of APIs that allow you to create uh create procedure or function blocks in one workspace and call them from code in another workspace uh it can also be used to like uh have two workspaces that are kept in sync and we did a little demo within one browser but you could very easily ship the event data between two different browsers so you could have two people coding together on the same program in on two different computers um so we have a we have a we have a you know essentially a list of feature requests along those lines adding features that are useful for Blockly uh and generally we try to add them a fairly generalized way so that they can be used for a number of different purposes and that that one is an example of it. Did you have set outside contributions? Do you have set outside? Yes we do we do we are we are an open source project uh we get a lot of pull requests uh come our way uh you can see actually here we have two different github repositories the Blockly repository is the main Blockly library and we welcome pull requests there but that is maybe not the easiest place for less experienced developers to start the Blockly codebase is it's a little complicated and the interesting bits are quite complicated but Blockly samples is a repository containing a large number of plugins for Blockly basically additional stuff that you can plunk into the registry to add new kinds of fields and new features to Blockly and that is a much more tractable place we had a number of plugins that have been contributed to that repository and we get lots and lots of people doing bug fixes we have a good first issue label so if you are new to the project and maybe potentially even relatively new to contributing to github there are any number of little little bugs that are tagged that way that might just be like might just be fixing some documentation or you know uh you know correcting some typos or things like that uh and there's definitely lots of feature requests we quite often get feature requests from people on the forum they're saying oh how do I do this with Blockly and I'm like well we don't currently offer a way of doing that but here's a sketch of what you could do and if you can get it working please turn it into a plugin and contribute it back to the Blockly repository so other people can use it and uh of course it doesn't always happen but a surprising number of plugins have happened because somebody wanted something and we're like well we don't have time to build that but here's here's a rough idea of how to make it work go for it and and they've they've succeeded with it so yeah we uh we absolutely welcome contributions uh and in addition to external contributors we're very lucky to have a small group of other Googlers who work on Blockly as a 20 as a 20 project so yeah so that uh there's you're kind of in in a way as an external contributor you're sort of in good company because there's other external contributors within google as well we try to make life good for all of those people if we can. Any final questions? All right well thank you very much just before you go if you were one of the people who stuck your hand up earlier to tell me what you were building in Blockly please please come up here after the talk but yes so that's what you thank you so much for coming and uh I look forward to seeing what you guys make with Blockly
ZIMjs 2D PWA apps into 3D+VR with ThreeJS
Hello. Hello, Europe. Welcome to FOSDEM 2024. I am Carlo Roestel and I want to explain ZIM. ZIM JavaScript is a new way of coding creativity for kids who wants to type and not drag blocks. We can do different things. I will explain you the version 014, the version 015 and we launched this January two weeks ago, number 16. So let's enjoy the new tablet apps creating online. ZIM is created by Dan Zenn. You see a little man at the left side. He's from Canada and his name is Dan Zenn, but his alias is Dr. Abstract. And I am Carlo from Belgium and I created my own signature ZIM Salabim, the magic of code. Let's start. So this year ZIM already exists 10 years. So if you go to ZIM's history, you can see different versions. You can click on. So we now launched speech. Everybody is talking about AI, but we can now show you Happy New Year. It's okay, February, but still we can say Happy New Year to everybody. When we launch an app, here you have the possibility to see the promo for the app, the full version at the right side, the editor to see the code and to view the app in the online editor. So I'm going to give you a promo in the full version. So click on the button of the computer. You see an animation of the hand. It's a GIF. You can import into an HTML canvas and the computer should talk. If you click on it. So here it's a computer voice speaking, but you can choose wherever sound of the computer is active. Speak. This is Dutch. The language Dutch. I speak Dutch. So we are 10 years old. So we celebrate with this FOSDEM edition. I'm happy I can give you some information. I was here a year before also. But we now have the new ZIM forum. A little applause because we always work into the Slack version. So if you have problems or questions for ZIM, please contact ZIM forum. There is a big team of developers helping you if you have a problem to make apps. It's a pleasure to meet us there. So how did I start with ZIM? It was Corona three years ago and I'm a teacher and I was looking for my wife who is a teacher for little people, so toddlers in school. She was making videos. And the videos, they are not interactive. So I want to make apps because I'm a graphicist. I did not have lessons into JavaScript. So I found a man. I found Frank. I found Ken Academy. But suddenly Dan Zan came up and he is the inventor of ZIM. His little crazy man, but he loves a lot of inspiring people all over the world. You can see a little introduction. This movie. I was like a little bit of that myself. Now a two time Canadian New Media Awards winner. The second time was for educator of the year. I'm a professor of interactive media at Sheridan College in Canada. The first time was the programmer of the year for my website called Dan Zan because I'm also inventor of Dan Zan. Dan Zan is a large site. He's been going since the 90s. Many games and gadgets. Tools. There's a tool called Appartica, which was the number one appart making tool in Google for 20 years. Until Flash Die. And most of the Dan Zan site was made with Flash or even directed before that. And so it's now treated as a museum. I'm well known in the physical world for dancing and fashion parties. I had a happy celebration. Here are some of the things people have to say about me. These can all be found at the front of the creativity framework. Okay, so many people are telling about him and the reviews of the website. And I'll show you some reviews. So kids want to be happy and they want to make apps. So that was my first idea. And Dan Zan helped me to create simple active apps to go with the robot up and down, for example. It's a D-pad. And you can, for example, make dialogues and the kids can animate robots and aliens, for example. So if you click on the aliens, you can move them around in space or you can move the robot left and right. So this is what kids like a lot. They want to be satisfied with 10 minutes lesson, maybe about coding. And they can make this. We have assets for children made into ZIM kits slate. That's a new way of programming for kids. So also we have a hero sensor inside the RG. Here's him. Mobile device and we tilt the device. Also the robot can tilt left or right. And in 2023 we have results and no time is set. Frank Loss. So Frank Loss is somebody in the known physical world for game making with JavaScript. He made space inventors. And I was asking Dan, can we make also retro games? It's fun for kids also. But with the knowledge Dan gave me, I could use the information of Frank Loss and he is talking about ZIM that it's really fun. And fun is a factor for me to start exploring ZIM. Even videos in five minutes you can explore. So for example Angry Birds, if you want that, making Angry Birds a very known game. You have articles on dev.2 Angry Birds in 15 minutes. You have little videos where he's explaining how to code. So this is the intro. You see, he's really doing cool things for children. We look to other libraries also for examples and then maybe show you my example. I made on my website ZIM Salabim. You can find it there. And then you can drop and drag. So we have physics in ZIM. You can do whatever you want in ZIM. Just your imagination and you can make it. All for free and with a little help of others on the forum. It's nice to know. So for example the reviews. Thank you to ZIM. Develops. Time was quite short and a very smooth ride along a quick learning curve. So all around the world we are now. When you are a beginner intermediate or you're an expert, no problem. You can join. This is a survey so we can collect the people who are meeting us and the cat is Omi. We have given the name as Scratch has a cat. We also have our cat, the grown up, you know, the big one. Because we have to type a little introduction by then. Okay, so this is the framework ZIM slate. We a little change it. One year back it looks like this, but the main idea stays on the left side. You have the view of the app. Other outside you can code. For example, if color is white, you can use that to color the background in white. Okay, you see Harry Potter is teaching you because it's magic code, you know. Okay. We have. What is bubbling at ZIM bubbling is like what are we doing with ZIM. I'm Carl and I just had the idea to make kids happy. So I put a lot of effort each night because I also I have kids of one of five years and one of one year and eight months. So I need to measure time settings with my wife and it's very difficult because I believe in the product. It's free. It's open source and why shouldn't I invest in this beautiful website because I want to spread it to the world, you know. But I'm not alone. For example, I see games. It's from UK. They make a lot of kids apps. I'll show you. This is a website of I see games. You have math or English you can learn. You have even Dutch examples of apps. You're learning tools. This is, for example, a beautiful one. You can expand. So you have to learn the numbers. So you have to push 123. So the kids have to follow just with the finger. It's not doing anything, but the animation is made in ZIM. So I clicked the one. It's okay. There comes a star. Here we use physics so you can drag the wooden block and you have to make a bridge. So the children has to count because the one is before two and then we have the three after the two. And then we already have to running. So this is a sprite running, you know, this makes us fun for kids. So that's the idea after ZIM, but also top marks, for example, has a lot of examples for kids. If you want to be inspired as the teacher, this is like all made in ZIM, you know. So and also close up makes a lot of games. I'll show you a little further. So here we have the ZIM kids golden bar at the right side. If you go on the red side, you can click kids and then you can make the apps. But we do also VR for kids. So and we do AI. It can be imported in AI also. So here you are at the right side. You have a virtual environment. So I'll show you an example in some minutes. ZIM has also a Facebook where you can find some information. So the latest apps or post there. This is a clap lap from Israel who's making apps in ZIM. So this is the mediums of users have already played and use the products we developed this year. In the past, again, they want to take us about two months of work. Now takes us only two weeks of work. Working with ZIM is very flexible and makes it possible to work with dozens of components developed in the past. And also to develop and expand our own components and code elements that optimize the work. One of the biggest advantages of ZIM is the connection and support done and healthy. The availability and possibility of the web kind of ZIM developers and Slack or Git allows us to rely on ZIM as our main platform for development. Any question, buy or request is immediately a sensitive tool and we are very quick and we and our customers know that there is something to trust. Unlike other platforms, ZIM is constantly being updated. And every few weeks new features and innovations are released that optimize the work. Working with ZIM is easy and fast. Images, videos, samples, vector files and much more work naturally with ZIM. ZIM allows us to be present in the design and focus finishing our developments and games. The number one reason to work with ZIM is coding with ZIM is just fun. So thank you ZIM and all the team developers in ZIM. ZIM is here to stay. Okay, so I hope it's little advertisement but they also have the possibility made to tap lab to create your own game with an environment. So three games are free and the other of us you have to pay for. And I'll go to the next slide. So also the ICT games. It's not about why another ZIM is in the S when I'm in the game. And why I'm in the game. So I started 16 in the year 2000 and I wanted to make educational games to support my children from the past. On the learning objective of doing that thing and we get home. I wanted it to be free so that all children had the equal opportunity. And I originally made games in Clash and then about six years ago I was with an alternative and then ZIM. ZIM.JS which is just brilliant. So hopefully we'll see some things. So you can turn things and through this space. It's cute. And there's a mask to make it to chop off the edges, make it look like it's a mask. We've got randomly sorted shapes and words and then you can drag. And if you look at the code with this, then it breaks it up. See the grid there, climate and outfits. They describe how it works. It makes sense of what's happening. Also, I really like the documentation. It's not clear. The documentation. So we saw rectangles. We just found rectangles. I can type in something. I have a new rectangle again. So remember the themes of light in that game over there. This is all this section. This is all rectangles. These are all the powers and variables you need to make a rectangle. So you'll see. We have. We have explained the ICT man. The library has a lot of components made by ZIM and eco-appointments. You can use, for example, to make games for kids to connect lines. So many kids want to learn the numbers and connect them. And then you have an image popping up. So that was a question of me and then implemented it. So I was happy I could make games like that. So for my wife, this is the website. We have Dragon Games and matching games together. So you can visit it if you want. The slideshow is online. But now we will talk a little more about the environment where kids can make games. So it's a scene. So you have to choose for a background and some pictures. And you have to combine some pictures like the eyes on a gem. And you can then make your own creativity. So first of all, the names for a picture. We shortened it to pick. So you have a new pick. New out is audio. A video is feed. A hif is an animation. A hif is a hif. And a sfiji is a scalable vector graphic in a purple box. You have seen it. So Pragma is a little woman in the middle is the daughter of Dan Zan. And she's also helping kids programming because also girls can code of course. So it's good because also love it a lot. If you want to know how, for example, you need to make a new pick. We also have at the right side a help button. And then you see a video. So if you click the video, it then is explaining how you can work with the assets. A little explanation. So that's a little tricky. But let's try that again. Test. So no sound is playing because the user is supposed to interact with the act first. Here's what they do. Start. And now the back of the sound plays. Yay. So that's a little tricky. But. Okay. You can get to kids at kids.zinjs.org. So come on in and I'm going to show you how you can make a scene. Scroll down here. Make scenes. If you want to learn about coding, then you should be taking a look at all of these parts and trying out the handy tutorials here to work with you. But you can also make a scene in Slate. And Slate is this link right here. Or you can click. And that's a place where we can type any code that we want. Of course it needs to work. And then we can test our code over here. You may see it's starting with this demo. And then we can test our code over here. And then we can test our code over here. You may see it's starting with this demo right here. A fragment. And there's the code for that. But we want to make our own scene. And we'll do that over here. And test right now. We got nothing in it. Alright. We can add some assets. And then available up here. And if you want help on how to do that. Well, this is a help video. So hopefully this will give you help. But you can press help there. And what that does is it jumps you down to this help section right here. Where it tells you how to use the asset buttons. And it says see the video. There's a bunch of examples as well. We can add a background color. And that just adds a color. That's not really using assets. But here's adding a background image. Such as a beach. And there's adding images. Like a butterfly. We'll do that too. There's a code to be able to do that. We'll talk about that. Here we can put things into containers. And move them all together as well. So we'll take a look at that. Here's adding sounds. So that when we press on above, it plays a sound. And then here's adding a background sound. Which is a little bit tricky. Because we're really supposed to click on something in the app. Something in the app. Something in the app. Before we can play a sound. It's a sort of a polite thing that the browsers make it do. Because otherwise people would go to a page and sound would be playing right away. So that makes it a little tricky to add the code for a toggle button. As you can see. Or the start button. It's not too bad. We can actually test just fine. Nice and easily with the background sound. But that will only work because we're pressing the buttons here. If you did. In a full mode. Don't worry too much about that. Let's start before we go into the assets. Let's just show you how to change the color of this. The way you do that is you go frame.color is equal to something. Such as yellow. We have a bunch of colors. Most of the playing colors there are green. For instance, will work. And we get test. So there it is. This outer color here didn't change. It's still black. We can change that color by going frame.outer.color. Is equal to blue. Oops. That's cool. And we can test that. We may have done some other colors down. In the example. There you go. There's also a way. If you take a look at the demo. You see how that's called a gradient. And if you take a look at the code there. We made a new page. And we told it the stage would. And the same type. And we told it two different colors. We added that to the stage. So if you did that. Copy this. Copy. And paste it in here. And do a test. There it is. It's going from pink to blue. And our outer color happens to be blue. Okay. So that's some ways that you could. Change the other color at that point. There we go. You can see how I think of it. That's some ways that you could start off. Without the background image. And just start building things here. For instance. A new rectangle. Or a new circle. And we'll make it 100. Radius and red. And dot center. And dot drag. So this is you making an app. It just doesn't have any pictures. But it allows you to think. Think what if we want. Start making a game. It just has these shapes. Okay. Now let's go to assets. And see how that works. We'll clear that. And test. So we're back to nothing here. We're going to go and add a background image. So I'll click on backies. And then we'll go to the background image. And here's a bunch of different backings. We've collected for you. And thank you to back easy. We've got these right. And thank you to our co-ops. You help process all these. Okay. Let's choose the beach. Beach one. So you put on the beach picture. Or the check box. You want that check gone. And then hit save. What that does is it says. Okay. So we're back. And we can access that by going asset. Singulate. Asset. We're using an asset. And then you put the name that is right here. Beach zero one. Inside of little quotes. That turns it into a string. If you don't use a string, it will break. And then we will talk center that. So there's the asset. Asset by the way is a word for images and sounds. And we have images here. And also sounds. One day we may add sprints. So there is our background image. Now you can go to phone mode here. You'll see the phone mode makes it a bit longer. And you can see that the image doesn't really fit perfectly in the stage. And if we're not intended to. It's not high enough. So that's because the asset is just a little bit too big. And you can see that the image doesn't fit perfectly in the stage. And if we're not intended to. It's not high enough. So that's because the asset is just a little bit too big. Or for this. Which means we'll need to adjust. We can do that with scale.scot. So we put in a.scot. And if we make it twice as big. And save it. Or get test. There it is. And that fits. It's a little bit too big. As a matter of fact, if you want, we could drag it and sort of see what we mean. See? There's some. Maybe it's just fine. Maybe we like it like that. But we sort of have to guess at the scale a little bit. Or we could make it this scale. And then move it over. For instance, we could center. And then m o b e for move. And then we want to move it this way. This way. Then it has to move negative. It's negative x. So we could move it minus 300. Hey, there we go. So if we moved it to zero. There's what it looks like. You can see. But we're going to move it 300 pixels. Over in the negative direction. Okay? So that would look like this. Negative 300. And hey, that looks pretty good. I like that. But there's an easier way perhaps to just pick the picture. Or make the picture. Fill. And let's show you how we can do those two things. Moving it. What we'll do is we'll use scale. To. So if we use scale to. That will scale to the stage. And I say this. And now it's scaled to fit in the stage. Like that. If we made it at a bone. It fits to the top this time. And if we made a portrait. There it is. Fitting in there. That's okay. Well not really. Because it leaves this stuff on the top and the bottom. That's fit. So we want the sweepy bracket. Tight. Pulling. Fill. So we'll see what we've done there. We've used the sweepy bracket. To go to the tight parameter. Of. Fill. Hey look. Now it fills the stage. Fill in the stage at a bone. And you know the portrait. It fills it. So it fills it. No matter what you choose. You can still move it afterwards. But that will probably do us for now. It also may. Okay. A little introduction. To the slate. We have made several changes now. To import 3D into the editor. And a slate so kids can make 3D apps. Everything is explained by new teaching videos. And we have made for teachers the editor. Yes the editor. You can save for free all your zaps you're making. Otherwise you have to download it. Save it on a place you don't know. So now you can make zaps. Zim apps. Is a combination of zaps. The name. And then you can make lists of zaps and share them with your students. You can also find some demos. And all that is for free. So no advertisements. So this is the other one. The newer one. How you can explore. Zim with demos. And here you can have a look. To the. Editor. So. You can also find some demos. And all that is for free. So no advertisements. So this is the other one.operate seeds and make sure you錢. Come on. Got a start. Good news. All 2018. Slowing. So we have. the search here, like in DITS. And as soon as we do all the collapse, all of that means the results are, but this collapse is also the opening of my collapse, my result part. So there's how many we have in so far, 30 seconds of them. And the fact is that the current one that I have to put in here. So if we take a look at the dianobesan, oh, there's dianobesan, the bactezan, the grids, the guides, the do-wees, and passing parameters or object pages, the variables, and so on. And you go into the slide. So those are the dits. And in other words, great, great, we can just hide it that way. But we do have as well, and this one's called list. So once again, I'm logged in, which allows me to keep track of the files. And then here, what we're going to do is copy the bits. And I'm going to copy it to my own list. And I will call it bits. And note that they're all the names of the bits. And you work nice and happy with them. And it's saved. Now under save, here, see there's the files. List, OK, so that's right. That's not a list. So save is anything that I happen to get plus sign on. And then save plus my own list. So here under save, we have my own list. So you would have your own list. But we want to look into this. And then I can have these are my list. And they are my art, my prepassion, my basics, and here's my new list. Wow, take a look at that. So I have. OK, so you understand the idea, making lists of all your zeps. You can share them with the world. And we can go then for children much more into depth to learn code, of course. So we're building, building, building. And we hope everybody understands how you can manage and share your zaps, your ZIM apps, your little code to let children be creative. For example, this is a blob. A blob is a shape that is closed with a kbizier curve. And you need to change the parameters, the lines, to make all the form of shape. And this is needed to make, for example, the kids app connector. So you have to make a path or a blob that is closed. And when you have the blob, you can make a game of it. OK? So I also was thinking if kids can make games. If I have to make all the games for the kids, it's a lot of work. So I made a tool. Kids can import their own image. And then they can change the blob themselves. So you have like this basic blob. Hide the numbers. I can say I want to edit a blob. And I can just drag around the cat, the blob. Otherwise, you have to program it and drag and drop with big buttons. It's very easy. And then you have saved the new blob. And you can have the points to code, to use it into your game. So that is very important for children that they understand what is a form, of course, of a shape, to explore a code. Also, beats are possible. Beats, for example, stars. You can make the Europe flag. We are in Fosdame. It's Europe. So 12 stars need to be on the circle. And then you can add beats. Also, emojis. We made a tool to get kids creative with emojis. So what is the code for an emoji? They all send emojis on the phone. But what is the code after an emoji? So we are going to the depth for the kids and then, oh, wow, we can make emojis. So for example, the code for an emoji is new emoji. We have the code and then 200 pixels width and then in the middle of our game. So if you go to the site, the emoji, for example, this is a tool. You have like, ah, I want to do a tennis ping pong game. I want this, I can do copy and I can import it into Zimkits. Or I want another one. You can also find one by your own operating system. So for example, I'm happy. I'm in love with Fosdame. So I want to use the code of this emoji. So this is why kids are loving Zimkits a lot because there can be real coders. They can see the code behind. Also, we have panels. You can close, open and close. For example, emojis in the list for teachers, kindergarten. For example, then they can talk about what is this, what is that. So in a little time, you have pictures because they are emojis. And if they want to not anymore on the screen, you can remove them. So you can open close. You can move around. So you can make whatever you want and let children talk with, what is this? Oh, my daddy likes this a lot. I don't know what it is. But I think here everybody knows. I also have a little thirsty. I'm a little thirsty. So I will drink something afterwards because I'm happy and in love with Zimkits. But also the indicator has love emojis. So indicators in games, you want to know how many lives you still have after hitting something. And then now you can use emojis. So another tool we launched Zimkits is the emitter. So Zimkits has lessons. But the emitter wasn't yet possible to show to people. So this is the confetti. You can choose whatever you want. You can make fire if you want other colors. You emit something. And the code you can create by the button. And this code, you just copy it and paste it into the editor. So you paste it. Zimkits is slate. And we just do this way. And then we test it. And we have the code. You see? What I also explained to kids is how color works. Color is a big problem for children how RGB is working. So I made a tool for kids so they can change the color. And they understand what is alpha, what is looking through, what is hue, what is black, white, how does it work. So they can learn a lot of the techniques about color. Hexadism, all the thinking. This is all possible in Zimkits. Sliders. It's unbelievable. So these are tools I really need as a teacher to explain what is coding. Also, if you have particles for stars or a snow simulating fire, we have a website. And Zimkits has different levels for animating objects, of course. I'll show you a little example. So we go to the parts, events. Here you have a triangle. And you click on the triangle, and it goes to the right side. No problem at all. I go to the code, and kids can learn the code. This is a triangle. It's rotated. It has a cursor, the hand. And it has a position at the left center. So at the left side, in the middle, it's positioned some pixels on the right side. So that's the start position. But if you go to the level two, they learn about animation with the spacebar of the keyboard. So that is also possible. But here, you can click, and the circle is moving. But what is also possible, the other side, you can go with the arrows up and down. And when it's hitting, you get confetti. So this is why we needed the confetti, the emitter, to show children what is possible. Because when they are fighting in a game online, they also have to see something when they die. So this is also a possibility. Confetti, when we shoot something in the air, it goes to another place. And this is a little faster. But we have also the sound with it. We have a counter of the time. It goes down. And then, oh, we have 10 seconds left. We need some more points. Come on, come on. And then when it's finished, we need to see what do you think, what do you see? Game over. So this is a little animation for children. How can we make a game? If you want more complex, then we have sprites. We can really animate like in a real game. So this is the emitter, why it's used. You can also catch beats. So an emitter I used for is my own name. I learned kids of 12, 30 years how to make their own name. And a little game for the name is please raise all the cookies. And when you're happy, all cookies are coming. These are pictures into the emitter. So you can do whatever you want in the emitter. And if you want to play again, you can launch the button to play it again. So creativity is all yours, of course. Walking onto a path is also something why we use the path, of course. Here you have the path. And on the horse, there's center class. It's a typical Flemish ritual. So he comes on the roofs of the houses to bring presents to the children. So he's walking in Antwerp above the roof tops. And you can stop and do help. What do you have to do? So there's also a path. But it's coming back when it's at the end. OK. So also I made a game for Saint Martin. I live in Bevereux, next to Antwerp. And there is like a candy man. And the candy man is there at night. And he has to throw some candy. So you have to hit the little people to get them candy. So if they are happy, yay. And this is creativity. You can put your own pictures into the background on characters. You even can now play multi-pong. If you are with three screens, so two phones and one laptop, you can play pong along with your phone. So I would say test it sometimes. But I cannot show it now because I have no time anymore. But it's also working. OK. We have also animating. So if I see something on a classroom, I want to make it myself. So animating is with sprites. This is Dr. Abstract. He has colons and rows. And sprite sheets, we call it. And then we use it also for games. So for example, this one, we have three layers. And we can shoot. So that's what kids also like, of course. And they want playing games. But I'm not a shooter man. So I make it with vortex. Vortex is a little brave or another way of thinking. So you can remove somebody standing over there or something. But there is nothing to do. It's suddenly walking and jumping. But you can shoot. So another ID. What do I have here? We have also a video. Mr. Sander is somebody from Holland. He makes a lot of videos. I was asking him, can I do something with your videos? So he's asking, where do we hear the letter B or something? So on the left side, the different images has to come. And then you have to click left or right. What is the answer? So this is only the first step that I've shown you. But we have also Mr. Sander in the 3D space. So I want to show 3D. Why? Because in September, Zim launched the possibility to go forward into a 3D space. So everything is 2D, but you walk around. And you can go with your mouse left to right. And this is very amazing for kids. So they want a game. And they can even play videos in 2D space. All made by Dan Zain. So this is also about the science day in Belgium. This is Mr. Sander again. So this is playing. We are moving. So you can pause. And you can go forward. And you can do several interactions with the several Zim apps. So everything is a ZAP. And you can place it wherever you want. This is the book. We have an interactive book also in Zim. So we can go left and right by touching it. This is unbelievable. We can do it in 3D. Interactive books. So if I put the little people there, I go back. It's still on ZAP. So we can make books. We can path. We can go like here. We have a circle. We make it bigger. Other right side. We can animate everything in 3D space now. So this is a big step forward for people who are 3D makers. This is another example. We can even use the VR glasses. So down you have a button, not support, because on a website, a computer, Chromebook. But if you have the glasses on, you have to push the button below to go. And then you can go forward, backwards. You can then animate things, make it bigger, make the solution. So this is very amazing for me. So this is for the high school in Belgium. I made an example. Thomas Moore in Mechelen. So this is a big step forward for us. The game then made. It's also on the right side. But it's not with animated panels. It's only with cylinders. Also, we have a game module. So you have to import some modules. And then you can make a talking people. We have sensors, ZimCam, for making games interactive, of playing music. You can wave to your cam, and it's interactive. So this is the magic of code. I put the wrong button. So if I go to the number 5, demo 5, it's very cool. We have like a cursor. But now your cursor is your hand. You move your hand onto the red button circle. And it's activated, and you can place it somewhere else. So this is what I want to say about the magic of Zim. Zim can do everything. I didn't push the button just with my hands. I can move along. So you can make an Ufu landing on the world and giving everybody some happiness, love. But the next thing I want to show is speech. Speech is launched two weeks ago. And I did some pioneering work for it. So this is a little man, a boy. And of course, I want to talk to him. So I have a red button. Hello, everybody. How are you doing? And he's typing what I'm saying. So this is a big applause for Denzen in implementing it. So in Canada, thank you very much. All over the world, speech technology is very huge. And nobody else I've seen implementing it into apps. And we can do it. So also, for example, a robot, Basin Adrian, is from the 80s. I was a lover of Basin Adrian from Holland. And I made a robot because I was a huge fan. So I show you. So I made the finger at the left side so you know what you have to do. Hello, I am Robin, the robot. And then you have to, for the next sentences, hello, I am Robin, the robot. Again, hello, I am Robin, the robot. Yes, robot, of course. And then you can play it. And Robin is Robin in the language. And Robin is Robin in the language. So the language you can choose the language you want. Hello, I am Robin, the robot, of Fosdam. Hello, I am Robin, the robot, of Fosdam. So, even my name. Thank you for listening to Carl Roosale. Thank you for listening to Carl Roosale. Thank you very much. So I don't know what to say anymore. This is what I was liking a lot. But the problem, at least, is the children can talk every word. So what do they want to say? Words they don't want to hear as a teacher. So what did I offend? A list where children can only talk the words of the list. So he has the center class. He comes. He wants to bring presents. What are the presents that he brings about a Vliegtheug? So it's a plane, a bear. And if you talk to the circle, Vliegtheug, you get the word and you see the image. But you cannot talk any word. He only listens to the words I'm talking. That's extra magical. So I'm very excited about him. And we even have the controllers possible. So you can play with your controllers, also games. So if you go to heart 2, you can go to mouse for now. But if you have a Bluetooth connection or a controller, you can collect all the buttons or the circles. And when you are finished, you get love. OK, so I want to inspire my kids of five and one year and eight months. So I don't know anymore what more I can expect from the end end. But if you have questions, ask him. He made already the Groovity app. It's a PWA app. PWA apps or progressive web apps. Progressive web apps can be installed on your phone without internet. It's working. So we have like Groovity with many pages. You know you have seen apps with one page. Of course, we want to have children of the kindergarten with multiple games at once. So these are all stages after each other. You can play around. It comes back. It becomes bigger. I am on Placement in Belgium. It's a big website where you can post all the things about internet, you know. And it's very liked by the people, the teachers in Belgium. So now they want to make it theirself. And that's why I'm making Slate better and better. Also, the physics. We have a ball, for example, to jump. This is a basic example of sim. So you have to push the ball. And it can go outside or sim. So there are no limits for the border. So you have to push and then it comes. So now it's cool. We can add sound to it because we have the speech and the sim. So the kids know it's a number one. It's a number two. He can learn counting to 10, for example. This is what I show you. The explosions already. Sim coding with Slate. So these are all the websites you can visit. And also on TikTok, on YouTube, of Instagram, we are everywhere on all social media. Just take the time to explore Sim because it's much. But if you have grown up and about the blocky and all the blocks dropping, just go and typing. That's my thing I want to say to everybody. And last but not least, in March, we have Flanders Technology and Innovation. It's a festival about what we invent in Belgium. We invented the Jeepig compression by a woman from the university in Limburg, the province. And I made a little VR glass. You can see a video playing while we are turning. We are looking for our own way. We are looking for our own way. But above all, we are doing it for the sake of it. Like one of the first windmills, which we just saw here, in Zürich, there is a smartphone that is a little bit connected to an image in Leuven. And each one is only available through the game in Limburg. Strattoch or mitroscopic levels are available by the name of Strattoch. Red means order beginning or flight to the next if we want to make it better. Really, Flanders is working on it. But I'm sure even smarter if we make our own screenings for the big block if we can speak to each other. Look at the future. Or do we need to calm down? The future. Because that's exactly what we want with Flanders Technology in a vision. Do. Do it. Do it. Here in Flanders and all over the world. Yes, even this world. To stimulate, inspire and support talents. The problem of tomorrow is now solved. So, please join us. Thank you for coming to see how the world changes. And above all, how it started in Flanders Technology in a vision. So you see the head of the person. This is Flanders. By night, the lights on Brussels Antwerp, the coast. If you know, we have two parts in Belgium, Boulogne and Flanders of course. And if you have this in a Zapp, so you saw a website, I made also the Zapp. You can go to the full mode. You click on the QR code, you can do it yourself. But here I put it into a newer glass. The quest to reach with an image. Problems are stuck. That's what we want. Experiment, problems. So you see, you take a 360 degree picture and you can put whatever you want into the space with Zipp. Thank you very much. APPLAUSE See you online or somewhere else. Any questions? We have, I think, one minute for questions. Yes, or any other questions, please. You are all amazed about what Zipp can do.
From phone hardware to mobile Linux
So thank you all for coming to the Fossil Mobile Dev Room. Happy to see that the room is quite full. Somebody sees the sign that the room is full. Please turn it on the outside. I have no idea how to do that. And Luca is here. He just made a master done post about standing on the stage. And yeah, I guess we are all excited to see Luca's talk from phone hardware to mobile Linux. So please give a round of applause for Luca. Hi, thank you all for coming. Yeah, this is my talk from phone hardware to mobile Linux at Fossil 2024. So about me, my name is Luca Weiss. I am an Android platform engineer at Fairphone. But I'm also an open source maintainer and contributor for projects like PostMarketers, which is a Linux distribution for mobile phones. OpenRazer and open source driver for Razer products, so keyboards and mice, and a bunch of Linux kind of stuff. And you can follow me on Masuda if you want. So what is this presentation about? So in order for me to really understand how to port Linux to a phone, I need some understanding of how the hardware works, because otherwise it's really obscure what it's doing, et cetera. Here I'm trying to go from the PCB level using Schematics, which are public for F4 and F5. Since it worked, that's quite useful for me. Also trying to say how Linux communicates with the different chips on the SSE, which protocols they use together. And even though you might not have Schematics for the device you are working on, generally these phone Schematics are quite similar, because they are based on some reference design. So even though it might be a bit different, the same concepts generally apply. So let's look a bit at the printed circuit board. Also kind of on the left side, you can see, I cannot really show you, pointer nicely, you can see a kind of a phone PCB. This is from the F2, which I found. The picture with the red arrow, or the chip with the red arrow, the big one, is both the SSE, but also the memory, like the RAM. On this device, on newer devices, the RAM is combined with also the, for example, the UFS storage, the internal storage. This is put on top, called package and package, so they are actually both chips on top of each other. It saves PCB space, so you can make it a smaller PCB, and also saves some signaling, or makes better connectivity between the two chips. And the chips itself, they are kind of these big array of pins. It's called ball grid array, because they are little solder balls. And these are like a thousand pins or something, and they all connect to the PCB, and that's kind of where everything's routed to. Yes. So then if we go kind of inside the SSE, so this is the one chip that you saw. Here you can see a so-called decap chip. Here the Snapdragon 845. And you can see only kind of the green area here, is actually the software Linux is running on. These are the, in this example, 8 ARM cores, and they are running kind of the Linux you interact with. But as you can see, there's a lot of other stuff on the SSE. Stuff like the modem, is a separate chip. There's a GPU, there's DSPs, et cetera. There's a lot of other things also running on the SSE. Most of them are booted up at runtime, so Linux is actually loading the firmware into RAM, and then you're booting these chips using some proprietary firmware. So, yeah, the firmware there is not for Linux, but for the separate chips on the device. Unfortunately for most devices, at least for the hacker community, unfortunately, it is, you cannot replace this firmware, because it's signed using private keys that are, where the public key of that key is burned into the SSE, so you cannot change this firmware. So only the manufacturer can change firmware, but in this way, kind of the manufacturer can also ensure that only trusted firmware is running on a device this way. So, yeah, all of these firmers, they're basically just executable files that get loaded. In this example here, for the ADSP firmware, it's just in the hexagon architecture, so it's just a different CPU architecture. But modem chips is also, I think, there's some firmware for different CPU architectures. I think 10-silic or something, I've heard, we just used for the video firmware. So yeah, now we go a bit into the software side. So kind of you saw that there's a lot of stuff on the SSE, but we also need some way to address this, because otherwise we can just run CPU instructions and they don't do much as in communicating with anything else. And so MMIO gets used for this. It's called memory mapped IO. This is kind of, I believe, what most or all ARM chips use, probably also on a lot of other CPU architectures. And you can kind of imagine this, so I guess you're probably all familiar with RAM, so you have some address space and you can write to it and you can read from it again. But kind of with MMIO, on different addresses, you can talk with different functionality. So for example, on some address, for example here on the address 0x100000, there's the GCC, the global clock controller. So you write to some bits there and then you activate some clock on SSE. Yeah, yeah, so just by writing to some addresses and reading from some addresses, you can kind of talk with the rest of the SSE. On the right side, you can see the same representation in the device tree. So this is the thing that Linux uses to let the kernel know, or not just Linux, but to let the kernel know kind of where it can talk with the different components and which drivers it can use for this. So you have a bunch of different nodes for the different addresses which kind of defines like what component is present at this address. And this needs to be statistically defined in the device tree, which is kind of similar to the ACPI firmware on x86. So there it's just kind of a bit differently included for the kernel, but it's included with the kind of the bias that you get on the phone. Yeah, just to clarify the picture, like the colorful picture on the left side, just for illustration, the other two values should match kind of. Yeah, in the PROC IOM, I'm output you can see, so where you can see the different memory regions. You can also see at address 8 with a lot of series to 2.7 FFF. It's kind of the system RAM, so it's actually the address space for the 8 gigabytes of RAM that are on the device. So if you're right there, you can read the same value back just as you would expect from a RAM. Don't ask me how this MMIO stuff works. I attended the Free Silicon conference last year. They know kind of how this stuff works. I have no clue. So talk to them if you want to know more. So you saw two slides before with the BGA, so the many pins on the device or on the SOC. Even though it's a lot of pins, so it's over a thousand pins or something on kind of modern or, yeah, modern SOCs. Of those, most are used for ground or power, so really just not usable for anything like more than ground and power. So on the chipset for the FF5, you have 176 kind of GPOs or useful pins, which you can use for do-do stuff. Even though it might sound like a lot, it's really not much. So if you have, for example, eight I2C buses and eight HPI buses and eight UR buses on the SOC with separate pins each, you already have 80 pins used. And you still want to do a lot more with an SOC. So there's normally a lot of different functionality behind each separate pin on the SOC, which you can switch to in software. This enables flexibility from both the SOC manufacturer, because they can produce one SOC that can be used for a lot of different use cases, but also on the other side for customers buying this chip, they can use the same chip for also a lot of different use cases. So they can, for example, if they actually need eight I2C buses, they can actually use them and kind of sacrifice some other functionality by switching to it in software. So this is called pin control in Linux, also called TLMM for Qualcomm devices. So you can kind of switch between different functions. So for example, between I2C, SPI and UR, you can switch to a different function. And you get behind the pin, I would say, there's a hardware controller that can actually talk this protocol natively and can do it very power efficiently. On Qualcomm devices, there's up to nine different functions per pin, but most pins really have something like two to four functions, and this is just a part of the hardware that you get. In theory, for most pins, you could actually kind of bit bank a protocol on there. So for example, using a software driver for I2C can just have a loop that sets the pins low and high all the time. This generally works for I2C. It's also used in some devices, but it is prone to bad timing, for example, or it's also just using CPU cycles to do something that's really something dedicated hardware component can do way more efficiently. On the right side, I grabbed this illustration from the Raspberry Pi documentation. It kind of shows that on the one GPIO that you see, and there's a lot of different functionality behind it. So for example, in software, you can enable a pull-up regulator, which is useful for I2C buses, or pull-down, or a lot of different interrupt detection mechanisms. So you can just connect a GPIO to an interrupt line from a different chip and then get an interrupt in there. And the I2C can tell you very efficiently how this works. The bottom left is an excerpt from the device tree. So we can see here, for example, the GPIO155 is just configured to be in the function GPIO, so just normal high-low. So drive strength of two milliampers and bias pull-ups. Bias pull-up here means, for example, the pull-up regulator gets enabled, and you don't have to do this in hardware. You can actually use the one built into the SoC. For the second example, there's the I2C buses. So for example, for GPIO24 and 25 on this SoC, for the function called QUP06, which is Qualcomm thing, Qualcomm talk, you can get an I2C bus there. And you need to configure this in software because otherwise you just kind of get the default pin controller set, which is either set by the boot load or just kind of the default GPIO. So if you want to talk SPI, you need to set this correctly. Otherwise, the communication from inside the SoC doesn't go anywhere. So kind of now we know how the pins on SoC works. So here's an example of, for example, how the speaker amplifier is connected to the SoC. So there's a physical connection between the SoC on GPIO8 and 9. Nothing with microphones falling down. Sorry. Try to fix it again. OK. Perfect. Yeah. So, and this is connected to the SCL and SDA pins. So for I2C on this AW88261 chip, which is just a speaker amplifier. This is used for the speaker connection, so how you output audio on this device. It is used for controlling this amplifier, so kind of telling it some information that it needs to know to be able to properly amplify the audio that's coming. For the audio data, we'll be on the next slide. So here, also for GPIO8 and 9, you can see in the pin control driver, which is the first top screen on the left, or with the black background. Here you can see from the pin control driver, that's for GPIO8 and 9, there's a separate function defined. So then in the device tree, you can use this function. So QUP02, which is, again, Qualcomm Dock for I2C here. And on the other side, you can see on the I2C pin, or on the I2C bus, there's this on this AWINNIC chip defined. So the chip on the hardware has its own node in the device tree, which has, for example, here the I2C address defined, which you can find in the datasheet of the device, or if you're plotting downstream Linux, then you find this in the downstream device tree, which is probably most commonly what you use. And also, kind of knowing that GPIO8 and 9 here is the thing called I2C2 in this example. This is also, if you work from just the hardware design and datasheets, and you find in the datasheet, but these are most of the time confidential, so you probably just look at the downstream sources if you're plotting main Linux for it. So the speaker gets controlled via I2C, which is a very common protocol for the actual sound data that you send from the SoC to the amplifier. So it gets to the speaker. You actually use a protocol called I2S, Inter-IC Sound, sometimes also known as M-I2S, which is multiple I2S, so where you have multiple channels. And you can see here for GPIO150 to 153 are connected to the I2S pins on the amplifier chip, and this is how the audio data then gets actually transmitted between the two different chips. Again, you need to configure the pins correctly, so you need to select the I2S functions on these GPIO150 to 153. The example might be a bit confusing because actually in the device tree you set GPIO6, 7, 8 and 9. This is a bit also because it's not part of the same ping control thing. It's a bit complicated. Some probably going to skip this. But you also need to, because you've set the I2S output correctly, you still need to send the audio data to the correct I2S bus because the SoC has, I guess, five I2S buses. In this case the Qwinary one is used, so primary, secondary, quaternary, etc. And Qwinary, so the fifth one. You need to configure this, and then you can really set the sound device tree configuration to send data from the, here in this, for example, it's called Q6AFE die from the Qwinary basically to the amplifier. And then Linux knows that it actually should use the Qwinary connection, so GPIO150 to 153, it's sending it to the amplifier, and then Linux also knows that it needs to configure the amplifier correctly to be able to output the audio correctly to the speaker. Here also the I2S interface is actually handled by Linux itself, but by the ADSP, which is the audio digital signal processor. It's also a thing on Qualcomm, probably also other SoCs, where the main SoC can go into a lower power state if it's just playing audio because then the separate chip, which is way more efficient, can do this correctly. For microphone it looks completely different because why have the same thing for both audio output and input. Here it's going via the WCD9385 chip, which is a Qualcomm chip, which is kind of an audio collector. It's connected via sound wire, which is a completely different protocol. It's a new standard from around 2014. In comparison, I2S from the previous slide is from 1986, so it's quite an old standard. It's also a slim bus used on some SoCs, or actually still on the same SoC, but for different use cases again, which is standard from 2007. In this case, again, as you can imagine, there will be some connection path between the SoC and the amplifier chip, and then from the audio codec to the microphone to get microphone input. You can see here is a bunch of pins, cheaper, 144 to 158, and sorry, 149, and then 158 also. They're connected there. It's a protocol's sound wire, and the other will be just a kind of a physical connection between the audio codec and the microphone where it can get the data there. This WCD chip is used for three different microphones that are on the phone itself, and also for USB-C audio, where you can actually get an analog audio signal via the USB-C connector out, and here both output and input is handled via this chip. Yeah, so the microphone also needs to be configured. It works a bit differently here because it's a sound wire device, so it's different. But you have these two sound wire devices on the left side. For some reason, they have one for receiving and one for transmitting the data, and then you put them together on the right side as the kind of, again, you have a node for the WCD audio codec because it's a chip on the SoC that you have. From this, you reference it, and then you also, again, need to configure the audio path from the, or basically the other way around from the audio codec, so from the WCD chip back to the SoC so it can receive the microphone data. USB-C is quite a cool thing. In the past, micro USB only had kind of the functionality for power, so to charge the phone data, to transmit a bit of data, some pictures behind it, back and forth, and also OTG functionality, to, for example, plug in a keyboard, but it was not really used for much. USB-C can do a lot more. On the FFVV, you can use the connector for USB 2.0, so USB high speed, USB 3.0, which is called super speed, analog audio out via this thing called the audio adapter accessory mode, which is a standard part of USB Type-C, and also display port out via the display port alternate mode. So yeah, on the 24 pins, you can do a lot of different things, but for actually doing this, you don't just need the SoC and the USB-C connector, but also some other components that actually switch to different components based on kind of what the user plugs in. So for example, for actually getting analog audio data out of the USB-C connector, so you can just connect USB-C headphones that don't have a, where the data is not transmitted digitally, but just like analog, just like with a headphone jack connector. So actually the analog signals, they go over the USB 2.0 data pins, so kind of the D plus and the D minus for the right channel and left channel, and also the two sideband use pairs, which is a part of the USB-C connector. They are used for microphone and ground, or ground and microphone, depending which headphones you are in, because there's two different standards. And in order that audio data here flows again, you actually need to configure the different chips to handle the data correctly. So I've tried to kind of put in a reasonably understandable format. On the top, you can see the WCD codecs, or the sound wire codec that went before. It has the headphone jack output and microphone input. This is routed to a chip called the OCP 960096011, and this is some switching between USB and audio, depending on the use case, and this send this router to the USB-C connector, so you can actually get the data out there. In order to do this, kind of Linux needs to tell this OCP chip, this audio switch, to actually switch between either USB or the analog audio, depending on what you plug in. And this is here called the mode switch in Linux. So in Linux, in the device stream, you need to connect them together with some USB-C nodes. There's documentation for all of this, which might not be the best, but if you understand it some fun, you may be understanding. For this report alternate mode, it's way more complicated, because you can put, there's a lot of different combinations that you can put on the USB-C connector, so some of the super speed pins you can reuse for this report, either just one line or two lines or four lines, and for four lines, you actually cannot do USB 3.0 anymore, because all of the USB super speed lines are used for this report, but you can still use USB 2.0 for this. So there's a lot of different combinations that you can use. The DisplayPort aux channel, which is used for some communication also, runs over some different wires, which also needs to be switched correctly, depending on which orientation you plug in the DisplayPort connector, or the USB-C connector, basically. And everything needs to be configured, and if something is not configured correctly, no data will flow, so then you have a bad time, because you don't have DisplayPort out. So kind of device tree, because this talk has been about hardware and device tree, device tree actually represents hardware. It doesn't represent software concepts, you're not telling Linux to do something, but you're providing Linux with the information that the hardware is made in that way, and then Linux can use this information to actually do the correct things. So keep this in mind when writing device tree or the bindings or the commit messages. Don't write about, hey, this is telling Linux to do this, but you're actually telling, like, you're describing the hardware. Yeah, if you're adding a new chip, all of the power supplies and GPOs and everything should be included, because, yeah, it represents the hardware. Even if Linux doesn't use them yet, you should include them in the bindings, because the bindings represent the hardware. Device tree is also operating system independent, which is really cool, because many, many years ago you wrote kind of really Linux specific codes to make the hardware work, but device tree is operating system independent, so you can use the same device tree, for example, UBoot, which is Caleb's talk, which is coming up in a few minutes, but also, for example, for FreeBSD or whatever. They should be able to use them, because it represents the hardware, it doesn't represent Linux specific things. So, yeah, so, what do you want to say? Right. It's a great ending. Yeah, don't put any specific things in there. They should work if the other operating systems have driver for it, of course. And that's it. Thank you. Thank you very much, Luca. We have, like, two minutes for questions, so if somebody has a question, please ask now. I will give you the mic. Thank you for the presentation. Can you tell us if you pick the Fairphone 5 now, or maybe the 4-1, like, and you actually rent a vanilla Linux channel on it, what can actually work? So you spoke about headphones, mic, things like this. What's the actual state of the Fairphone on vanilla Linux? And also, next question, obviously, is, like, any plans to make it even more, like, open in the future? So, right now, like, completely vanilla Linux. A good chunk of stuff is working, but some stuff like displays are not working yet. But I have some patches which, like, for displays, still depends on some hack that actually Marainas is working on, hopefully. Yeah, you can use display, GPU, DisplayPort out over USB-C. You can use Modem. You can make inferior phone calls without audio, because audio is not working yet. So basically, the big things that are not working yet is audio is not working yet, camera is not working yet, and probably quite a few more things that you want from a phone. But kind of, you can use it as a kind of media machine if you connect Bluetooth headphones, you can stream via Wi-Fi or 4G or 5G and everything else. One more question? Yeah. Or maybe a quick plan for the future. I definitely plan on working more on this, but that's also not my full-time job to working this. I have one day a week currently for this. So I can definitely get way more done than without more time. But yeah, we will see. Could you please describe in a few words how to debug device 3? The second? How do you debug device 3? I mean, to some degree, I'm not supposed to debug device 3, because it's just a data-formatted space-based hardware. Of course, when bringing up hardware, you need to figure out what values to put in there. This is most in combination with the Linux driver then, because your device is just a data format to describe the hardware. So you need something to kind of interpret it and use it that way. So most of the time, you are debugging Linux drivers and probably fixing Linux drivers and then at the same time, fixing your device 3. In the perfect words, you would just write it once and it would be done and then fix the operating system maybe better. So that's the time we have. Another round of applause for Luca, please. And make sure to check out the stands we have for Linux on mobile in building AW. And now we have five minutes of break. Don't forget the mic. Yeah. And yeah. Thanks. Bye. Bye. Bye. Bye. Bye. Bye. Bye.
U-Boot for modern Qualcomm phones
Our next talk is a new boot for modern Qualcomm phones as Konfi tells me on my Linux phone. Give a big round of applause for Caleb. Well, hopefully that stays. Okay, there's no way to write. Okay, that'll do. Hey, everyone. Lots of people here. Yeah, so this is Uboot for modern Qualcomm phones. This is kind of about the issues that we face in running Linux on modern Android devices and how we can solve them to make it easier for distros and users. So a bit about me. Hey, Caleb. I've been working on and using Free Software since about 2018. I'm a kernel engineer at Linaire on the Qualcomm ecosystem team where I hack on pretty much anything, just not user space. And I'm especially interested in things that improve the user experience. Outside of work, I'm a member of the post-micro sql team. Yeah, you might recognize me from my work on the 1.6 and Snapdragon 845 devices. And otherwise, I'm just plotting new ways to keep your devices out of landfill. And I'm a maintainer of Qualcomm platform support in Uboot. If anybody else is interested, you can follow me on masterband here. So in this talk, we'll talk about why Linux on Android phones kind of sucks. We'll talk about the magic of UFI, how Uboot works as a UFI boot loader, and specifically how it works on Qualcomm devices. I'll talk about the state of Qualcomm support in Uboot now and have a quick demo. We'll go over the absolute status and like 10,000 foot overview of how to support a new Qualcomm SOC. So let's play a little game. I'm going to show you some boot loaders and you guys have to tell me which one is the odd one out. So who recognizes this? Not that many hands. This is the grub-lugger, apparently. It has another alleged boot loader. This is actually the official system debut lugger and not one that I made up. Maybe you recognize this, Tiana Core or this lugger. If you've ever used QMU with UFI, of course, it has another classic boot loader. How about this for a boot loader? Is it? Not a boot loader? Some people would definitely make the differ. Yeah, so I guess the point is really that boot loader is a bit of an overloaded term. It can mean a lot of different things. So for the sake of this SOC, it means... We're going to go with this definition. It's a software responsible for loading the kernel in our MFS and device tree and then jumping to it. Sometimes different bits of software do different parts of this process, but that's okay. So, booting an Android. The devices have this thing called ABL. It's the Android boot loader. It works like this. You have this boot image, which is flashed to a boot partition. It contains the kernel, device tree, and the inner MFS all packaged up together. Then you have this DTBO thing. The idea is that your device tree that's in the boot image is generic. It's for the platform, for the SOC. Then in your DTBO, you can have all your board-specific quirks and features like the display panel and things like that. So the process is you load the boot image, find a matching DTB, because there can be multiple in the boot image. Apply the right overlay from the DTBO, and then you boot. Easy, right? Well, there's no multi-boot support. You just get the one kernel. There is an AB update rollback feature, which can't be disabled. It's a lot of work to integrate. In fact, there are five different versions of this boot image format that we need to support. The DTBO partition obviously does not work with mainline device trees, because they're just not designed this way. So we need to either erase it, or in some cases build a special empty one to trick the boot loader into not crashing. There is no feedback if any parts of this process go wrong. You get dumped to fast boot, or, well, if you're lucky. And, yeah, overall, it's not a whole lot of fun. In fact, there's a few very horrific things hiding in the shadows. You may be familiar with this error message if you've ever booted Linux on Qualcomm platform. Qualcomm's boot loader loads the kernel in a mislead address. Here is, in fact, the command that's run as part of the kernel update process on PostMarketOS and other distros that need to support this platform. This is the most generic form. We need all of these device-specific properties, and it's not super great. Sometimes we need to append strings to the end of the boot image. And here is an example of what happens if you erase the DTBO partition on one device. That's an abstract trace from the boot loader. So we can do better, right? Well, here's how booting with UFI works. You have a normal Fat32 or AXE2 partition. The AFI specification actually says nothing about Fat32. So for those of you who really are not a fan of the Microsoft ways, you can get away with the AXE2, I think. You install this file bootAA64.EFI, or if you're on an AXE86 platform, it's like boot AXE86. And then you have your kernel and RAM disk and boot loader configuration wherever you want. The big difference here is that rather than having to just pack everything up in this special format, you get to run your own code. So you can do whatever you need to do. And in fact, this gives the distro full control over the boot loader so you can have your fullbacks or you can have multiple different kernels installed. You're used to this, right? So if you're using this EFI on your laptops and desktops, it's pretty great, as I'm sure you all know. However, many vendors still get things wrong in the AXE86 space and on ARM. And in UBoot, at least, we have some limitations that you can't really adjust the boot loader at runtime. For the purposes of this talk, we kind of gloss over that. But it's definitely an obvious winner if done right. And you're never going to get it right, but there's a great workaround to that, which is to just release the source code and let people fix the things that you get wrong. So we've compared the Android boot process to UFI. Of course, we're also noting that UFI is capable of all of the same secure boot features. So I have a question for you guys. Which boot loader do you think Qualcomm shipped on the automotive and IoT platforms? Is it the Android one or the UFI one? I heard AVL. Yeah, indeed. If you're running embedded Linux on IoT or on automotive even, you're not booting Android, but you still have to deal with this whole Android boot loader. So the answer to all our problems, UBoot. UBoot is a very cool open source boot loader, GPL2 licensed. I'm sure many of you are already familiar with it. It supports a whole lot of different devices and architectures. It has compatibility with the Linux driver model. So, reporting drivers from Linux is pretty straightforward. It uses device tree, which is pretty fantastic. And it can be adjusted to do anything. It's not just the UFI boot loader, but it does pass the system-ready base boot requirements. And it hasn't always had great Qualcomm support, but thankfully that's changing. So, to boot UBoot on a Qualcomm platform, on a phone specifically, we actually can't replace the Android boot loader, which sucks. It's hashed and the hash is signed by a private key. The public key is bund into the SOS. So, like Luca mentioned in his talk regarding firmware, the same is also true for the boot loader. Could we exploit this? Unfortunately, this isn't going to be one of those talks. I'm sure, well, I hope that one day we get there, but for now we can chain load, which is almost as good, right? And we can rely on ABL to give us a bunch of pretty useful information. So, you can build UBoot with this configuration, which is exactly what you might think. It prepends the Linux kernel image header to the UBoot image, which just lets us smuggle UBoot past ABL. ABL thinks it's just booting Linux, but it's actually not. And that's pretty much all this required. Well, then you wrap it up in the boot image, and now we're booting UBoot, basically. So, the state of Qualcomm support is provided in MacSnapDragon, which has been around for a really long time. To the original authors, I don't really know why they decided to name it Snapdragon and not Qualcomm, but whatever. As of today, not all of this is upstream, but work in progress. We have support for all of these platforms, the latest being my colleague, Neil, who recently got UBooting on SM8350, which is one-generation old. So that's all of the 2023 flagship phones, essentially. We're pretty much compatible with Linux advisory. So, if you have a phone that's supported by upstream Linux and basic SoC drivers are in UBoot, you can just take the advisory blog from Linux, combine it with UBoot, and you can probably boot. Some devices need a few additional things, which we keep track of in a separate advisory file, which UBoot automatically includes for us. And for things like USB, where we currently only support high-speed, we can fix this up at runtime. So, there's no hard advisory modifications required. There's also no board-specific code, which if you're familiar with UBoot development, you're probably very familiar with the board-specific code. We don't want any of that. We want to dynamically arch everything, read the memory maps from the advisory, and have one build target for everything. Support for USB, UFS storage, and MNC is headed upstream, and a whole lot more buttons, capsule updates, it's all getting there. So, where can you run this? Today, if you fancy shelling out, you can get yourself one of these fantastic IoT development platforms on the low-end for just $199. You can buy yourself an RB1 with a quad-core 2 GHz A53 SoC, one or two gigabytes of RAM, and eight or 16 gigs of MNC. Now, that is a steal, I gotta tell you. You can go a bit higher, you can go for the RB2. You get a whole eight cores, still a 2 GHz though, and still just two gigs of RAM. On the mid-end, there's the SCMA45 RB3. You'll learn more about these boards in Neil's talk in just a sec. You cannot actually buy this one anymore. So, yeah, sorry, I guess you'll have to go all the way up to the RB5. Only $550, right? Well, I'll let you in on a secret. You can buy yourself a OnePlus 8 for a lot less than that, and it has the same SoC in it. Now, if you already own a Snapdragon 845 device, like the OnePlus 6, you can head here and you can download a UBoot release right now. There is some work in progress post-marketOS support, which is not much yet, but it's on its way. We're booting with UFI and SystemDoot. Down the line, we will also have support for providing UBoot updates with LVFS and FirmwareUpd. So, you'll just be able to receive updates no matter what disk review you're running. And I can demonstrate this for you right now. Hopefully. Hopefully. Let's see. Yeah, okay. So, this is a OnePlus 6. This is my sort of daily driver device at the minute. And here we can see UBoot. So, we have this boot menu at the minute. And there's a few options here which you may or may not be able to read. The important ones, especially if you're interested in playing around with this. If you boot up with the USB cable attached, then you can choose this USB Serial Console Gadget option. And then you'll get a Serial Console on your PC from the USB port and the device. So, you can think around with UBoot. And yeah, I guess we'll boot into Linux. This is booting with SystemDoot. I want to buy a bit faster. And I got the AFI sub and then into Linux. Yeah, so there we go. This is kind of hopefully the future of Linux on Qualcomm devices. We can now provide updates as you would expect on your laptop or desktop. We don't need to mess around with a bunch of disk integration for the Android boot image. And everything just kind of works. So, yeah, in terms of upstreaming, there's quite a few patches on the mainliners right now. With those, we'll have USB support for the Snapdragon 845 and UFS I'm hoping to send off soon. There's a separate effort by a colleague of mine, Summit. He's working on pulling in all of the device trees from Linux, which are actually kept, married in a separate repository called Device Rebasing. So, the idea is to add this as a subtree in UBoot to allow us to keep it in sync. And this way, you'll be able to just clone UBoot, build it, and if your device is already in Linux and is already, like I said, basic support for the SoC, then provided you have a frame buffer support from ABL, then you can just boot UBoot and probably have UFS working. Yeah, if you're interested in understanding more about the UBoot flow, this is great talk by Simon Glass, who's a UBoot maintainer called Recent Advancements in UBoot. He gave it late last year. But, yeah, I really like the goal is if you have a fairly standard device, it should just work with the Linux advisory. In the future, I really want to get support for handling display panel variants. So, for example, the Pocophone F1, we need two different devices at the minute. And, yeah, we can do cool stuff, I think. Adding a new SoC in case you wanted to is very simple. You need clock and ping control. These can be essentially stub drivers, copy one that exists. You're kind of good to go. You may need to copy the UFS file data from the Linux driver. You can literally copy paste it into UBoot. You may need a compatible string for the PMIC and the SMME driver, and that's it. In fact, it took me about two hours following this process to reach that point. On this device I have here, which I guess I can show off for you. This is actually the Fairphone 5. And, well, it's... Here we are. It can also be from internal storage. So, yeah, this is fairly sure for it. Still a lot of out-of-tree patches, but definitely feel free to give it a go. And let me know how you get on. Yeah, that's all. Thanks. Amazing. Does anyone have any questions for Caleb? Yeah, here. On what partition, on what Android partition does UBoot live on? UBoot is flashed to the boot partition. Boot partition. Yeah, so on devices that don't have secure boot, where there's no signature verification going on, you can replace the stock boot loaded with UBoot. Second quick question. How often do you get on your phone? Call it from crash-dump. Unlike the OnePlus 6, it seems to depend on device and region and other things. It happens to me one in maybe 10 boots or less. Some people have issues more often. Yeah, it's going to be a fun one to debug. Hi. Thanks. Great talk. So, when you had on the slide earlier on that the, you know, board-specific BTBs have been added to UBoot. Does that mean that they are taken from ABL? No. So the Qualcomm's bootloader ABL is, they actually have their own very non-standard UFI implementation. And ABL is just an EFI app, but unfortunately none of that. They shut down all the UFI services before jumping to Linux and we can't modify it. But no, it doesn't use the device tree. It's based on ADK2. So we're pulling device trees from Linux. Okay. Thanks. Any more questions? Oh, yeah. One sec. Yeah, maybe it passed the mic. So since you have more standard bootloader now, are you able to do dual-boots, stuff like that? Yeah. Provided that you can share the AFI system partition. Or in fact, you can have multiple. I haven't validated this and other people are quite worried about it, but I'm fairly certain that the block device that has the user data partition on it, on the OnePlus 6, it's usually DevSCA. I'm fairly sure that we could completely format that and it would still boot, but I haven't tested this. And yeah, I wouldn't necessarily recommend it. But yeah, you can certainly remove the user data partition and create multiple for different root file systems. You can then format the OP2 partition as an ESP. You can set the right GUID type code, make it, and it'll work. Hopefully one day we can do a boot on Raiden. Yeah, well, the only thing on Android is a whole different beast. So I figured that the user data partition does not have overlapping directories with the Linux root file system. So I managed to dump a whole Linux root file system on the user data partition and tell Linux to use that at rootiface. Yeah. And it boots and it works. So you can have it without impacting any Android functionality. Yes, you can. This is how selfish OS boots. And there is also support for this in post-wacadoss, I believe, where if you put it in a special path, it will detect it. Do we have another question? We have one more minute. Yeah. So do you guys end up replacing LK2 or however you're supposed to pronounce that for the devices that use that? Yeah, for sure. I mean, so there's already support for some of the older platforms. It would be doable to add them. However, LK2 does a lot to support the downstream device trees and lots of fix up stuff. So currently the kind of status for the LK2 maintainers is it's not what the effort for them to move all the devices. But if you own one of those devices and work on it, you can definitely get it supported. Okay. So if I had support to this, I could replace LK2 for my class. Yes. For sure. There might be some other weird quacks for some of the 32 bit devices. Please give another round of applause for Kaleb. And again, five minute break and then we have the next talk. Check out the stand and bidding AW. Nice.
Mainline Linux on Qualcomm SoCs, are we here now ?
Thanks. Welcome to my talk. So my first time at FOSDEMM, so questions, twists. So I will do a summary of where we are now about Qualcomm SOC supporting in Mainline because I think it's time to do a point. So I'm part of the Linao Qualcomm Longing Team. I joined one year and a half ago. So my many daily work is actually platform Qualcomm support. I'm maintainer with Calib, the commentator of UBoot Best Sports and I bring new platforms upstream in Linux. I also maintain and develop other piece of Linux, namely the MLG SOCs, DLM bridge, DLM panels. And I'm working only on upstream Linux and UBoot for the last few years. And so I have a lot of patches upstream in UBoot and Linux. So I'm part of Linao. Linao has been founded to actually enable and make Linux and any software on ARM work better. So we're basically helping vendors and product makers actually make better products to work on ARM. So we have plenty of services to help the whole stack of software running better on ARM. And open source is at the heart of Linao. We mainly work on open source software mainly. So Qualcomm joined Linao 10 years ago because they wanted to have better open source support at the time, which was minimal. So they joined to support Linux, but it quickly collaborated plenty of other places and so far so good. Even happy Linao and Qualcomm is happy with the situation. So in the last 10 years, Linao and Qualcomm pushed a lot of really huge features for ARM ecosystem in Linux, namely the power framework that the energy hour schedule really changed the way Linux schedules correctly on cores. With help and Qualcomm participated in the standard and software structure on our servers, the Dragon boards are the reference today to test Android, ASP for example. We have called Linao, which is the principal kit storage for Qualcomm and for Linao engineers. And namely for the last three years, we were pushing the flagship mobile platforms. So this year, the last three months, I pushed the standard on Gen3 upstream and it was 98% two months after the announcement, which is pretty cool. So the agenda, where we came from 10 years ago, where we are now and the two supported devices, a demo and what's remaining. So we were 10 years ago. So 10 years ago, Qualcomm and vendors using Qualcomm SOS ships with kernels with more like three million change. So it was basically a separate kernel in a kernel. So this was a problem, but it's a hard to solve problem. How do you integrate, how do you upstream so much change in main Linux? That's why Linao started the learning team to fix this. And this is a graph I made to show on the last years, the last 10 years, how Qualcomm managed their downstream kernels. So initially, they used the long term kernel very long time and they kept accumulating new SOS support over time. So each time frame and for the last four years, the company changed the strategy and they stopped adding new code and simply changing existing code. I think the reason is first Android strategy with GKI and because the main line Linux has enough support and has the principal architecture missing over time. So this is what I posted nine years ago, eight years ago, which was true. I mean, it was mostly nonexistent. It was the only SOS inverter that was not upstreaming almost nothing. So hopefully it changed. So Linao worked on Qualcomm specific features in the last 10 years. So the biggest feature was the remote proc to handle DSP because before Qualcomm had a complex custom solution to speak to DSPs, which was like two million line of code only to speak to DSPs. And the biggest work we did was to implement it correctly upstream. So we have now a fully integrated way to speak to DSPs and it works really, really well. The other big feature of Qualcomm SOS is Interconnect because the Qualcomm SOS are very complex and you can fine tune any data path in the SOS and you can change the bandwidth. You can change the performance of any data path. So it was a huge feature. It took a very long time to upstream it. All the Venus video decoder was complex because it needed the other support before. The DSP audio support also needed the proper DSP handling. The DM driver is a huge beast because the graphics display engine is really complex and supports a lot of features. Lately the sound wire support was upstreamed for Qualcomm and other platforms and we worked on plenty of very time consuming subjects but tiny. But all of these are needed to actually boot a platform. So this is a graph of the upstream contribution. You can see it was quite a blow but all these features are really complex to upstream because they are either Qualcomm specific or very complex but it doesn't fit in any framework. So it took like seven or eight years to actually push the base features to actually be able to boot a high-end device. And the last four years because we had a complex support for all the small and very important features we are able to finally boot high-end and commercial forms on it. So we had a lot of contribution from Linao, Qualcomm and also the community. This explains a huge peak in the last two years. So this is a graph of the supported board of the time. So 10 years ago we had only two boards in 2D keyboards and now we have a lot of 300 boards which is huge. And most of them are community boards and non-reference of the base boards. These are the new supported boards of our time. So for each release the number of supported boards were added. So you can see in the last 10 release there's a huge amount of new boards added which is great and the community actually helps a lot in this case. So for the boards, supported boards, like Caleb says the historical dragon boards were the first really publicly available boards in the SBC form factor and they really helped starting the mainline development. And while they were like low-end SOCs, we supported a lot of features. Those support camera and very high-end features so it helped develop the baseline support to actually enable high-end SOCs at the end. So like Caleb said, these are the robotic boards. They are quite expensive and it's the current Qualcomm offering in the IoT world and they aim to support them fully upstream and which is quite, it's quite each board as a mid-end and low-end. So it's quite diverse and it helps supporting all the new features. So there is commercial phones which are running very, very well. So you won't expect all the features for daily usage. So you don't have haptics, you don't have camera but they work fine and you can boot and actually use it with Wi-Fi, Bluetooth and storage. You can have a few tablet convertibles running on Linux, mainline like the Lenovo Yiga 640 and these are the Qualcomm high-end reference devices. So those are the devices we use daily use to upstream high-end platforms. So this one is a one-year-old platform, this one is a two-year-old platform and this one is actually running this presentation. And those are the specific Qualcomm reference devices with test points used by Qualcomm engineers to actually upstream develop Android and we upstream mainline Linux support with them. So as I said, I was upstreaming the Gen3 support, the latest Qualcomm high-end SOC which the Samsung phones were announced two weeks ago and in 6.1 RC1 that was announced like two days before announcing the Samsung phones, we had already a display, UFS, PCI, USB, thermal, CPU fray, QSPC, suspend-resume and crypto-working on mainline Linux, check out Linux master and it was working, it works. And in the meantime, we developed audio, display portal mode, DSP, full DSP, modem, compute and audio, USB PD and charger and GPU is the last remaining one and I won't talk about it. So the flagship device you could use today is the Lenovo X14S. It's actually the best platform to use Qualcomm devices. It's really powerful and you can use it daily. My colleagues are actually developing mainline Linux on this platform. It supports, my colleague can use it about eight hours time working and you have almost all everything supported. So this is an example what is supported. You have a JPEU working storage keyboard, thermal, USB, suspend-resume, audio and you can boot over EFI. So you can, but obviously they're still working process like every software stuff. So the most important is the camera. Camera doesn't work. It's complex due to the sensor putting raw data and Qualcomm not wanting to upstream the stuff. So it's in working progress. We have something working. It's not public. We are working on it. There's plenty of other small features are missing like the embedded controller, the power measurement. Power measurement is infinite. It will never be perfect. So we're gaining amps every release. So it's a constant work. There's always some small, wiffy and Bluetooth issues. Audio needs active speaker protection. This is a big, modern feature. All the new, modern audio needs active speaker protection because it's not no more included in the codecs. And some stuff are still missing like the fingerprint or VDOC shale action. But we aim to support all this in the short frame. So today, if you want to test Linux, many Linux on the expertise, you can use Fedora, Ambian, Nubuntu or Debian. We about changing anything. It will install directly and boot and you could use it daily. So this is a great, great advancement. So demo time. I mean no need for a demo because I'm running it. I'm running the procession on a Qualcomm device. So yeah, for example, this is 8550. You can play a video, for example. It works fine. You can switch. I'm still in full screen. So you can see everything is fine. The video is still running. So the GPU works very fine. Up. Demo effect. Okay. So... So to show it's really usable. You have Wi-Fi and Bluetooth working in GPU and this platform is one year old. But I got hardware like two weeks ago. So it was great. And the support for the board is actually on the... It's made by the Qualcomm ARM maintainer. So it should be part for 6.9. So globally, what's remaining to support, properly support the Qualcomm SOCs, power optimization. It's a long term, nearly infinite work because the Qualcomm is complex and we still need to gain every time. So performance. Performance, like I said, each data path can be optimized. And it's also a long, long journey to support power and performance and manipulation. There are still missing some advanced graphic features, mainly for non-phones and non-laptops, like HDR and multi-plane and so on. Video with decoding accelerator is working progress. We're working with Qualcomm on it. Camera support is a big feature. Audio support, we still need to support DisplayPort audio. To support audio over the HDMI or the DisplayPort. Speaker protection, the sensor hub for the phones, feedback and the vibrator and the new platform because each year we have between two and three platforms to support either in computers or phones or IoT. Otherwise, it's keeping us working a lot. So we need help of the community because we need testing and we need to support more devices. Thanks to the community, we have the largest ARM 64 change in the last years. Every single release, we have a top change because it's really actively changing. We are really supporting mainstream devices, the phones, laptops, modems, accessories, converters. And we are working on new books. Qualcomm is porting new devices. It will simplify installing new distributions. And if you want to know the status of each SOC, you can go to this address in our GitHub IOMSM. It will give you a nice overview of the support. So like the last line is the standard on Gen 3. So all the yellow lines will be green in four weeks now. So for example, it's really kind of cool. We simply describe each feature with and it generates automatically a website. So it's really cool. So thank you for listening. I was happy to present the state of Qualcomm SST port and demo it in live. And it works fine. So no demo effect. Thank you very much. Very nice. Does anyone have any questions here? Yeah, hi. When can we expect Qualcomm to start upstream in support for the Linux that runs on the modems? On the modems? I have no idea. I'm sorry. Another question? Thank you. The question is actually first, is Linao or Qualcomm considering doing any upstreaming for legacy platforms? For what? For legacy platforms, for early edge upsets. We do it daily. Okay. So this is also happening? Okay. Yeah, we continue adding features for all platforms daily and the community helps a lot and we are testing it. And in fact, Qualcomm is pretty consistent in the firmware interfaces and APIs and registers. So we in fact support all devices quite constantly. And then the other thing you mentioned, specifically on camera, there's a lot of work on Android. Second, you have a lot of out of three drivers. Instead of a platform, Qualcomm, they actually get everything supported directly in upstream Linux. I hope so. And the question here, one second. Hello. Very nice talk. Any plans for Spectra ISP? So yeah, it's the same question. I don't know. It's not in our hands. Another one? Okay. I'll pass the mic. You talked about many distros already working. If we had, for example, a root FS from another distro, the boot loader situation, is it the same as in mobile phones and their SoCs? Or can we just expect to boot from UEFI or similar? So for the laptop, they have a functional UEFI shipped with the laptop. So there's no need for you boot. I think it's not perfect, but it works fine. So you can directly install Fedora in UEFI on the laptop when you open it. So it works. Thank you. So you mentioned something about video decoding. How will exactly that work? Will there be a VA API driver or will it use something else? Today, there is already a Venus v4l2 driver for the old platforms. And we are working to support the new platforms using v4l2. So Qualcomm wants to push support for the platform. So we need to find a way to merge it and make it more prettier. But yeah, v4l2. Okay. Thanks. Another question, anyone? Yeah, yeah. Hey, thanks for the talk. I had a question about availability of certain documents required to write a lot of the drivers. Is Qualcomm making those documents available to the public? So no, it's not the industry. They don't want to document the hardware publicly. So for regular people that want to help, it would be like reverse engineering or? Yeah, code. Okay. I mean, cool. I've implemented all the MLG support using code only. Almost no documentation. So it's hard. So we need documentation for more complex features. But most of the future, we use code, even us. Because documentation, you have registers, but you don't have the state machine. You don't have the behavior of the hardware. So. Okay, we could fit in another question if there is any. Otherwise, yeah. Okay. Yeah. I'd actually like to continue the question that the teacher's raised. So how is it working then? So you signed an NDA with Qualcomm, get the doc. Docs can write the code, but you're not allowed to document it. Yeah, speak about it. That's how it works. Yeah. Yeah. Gotcha. Please give another round of applause for our speaker. And it was really all running from this device here, the board. No laptop. Yeah.
VoLTE for FOSS
I'm not sure if you should attach the mic. Maybe. One of you should attach the mic. Maybe. Okay. Do you want this one or me and Mike? Me and Mike. I'll give you this mic. How can I use it? So put this in your pocket and this one you attach to your... Yeah. Is it correct? Because I'm bad with mics. Yeah, that should work. Can you put it up here? Maybe like this? No. No. I'll just speak and test it. Okay. Let's try it once. Let's try it once. It shows here. Yeah, it shows there but not... It's here. I have a green light. Are you going to... How do you say that? I will do my... Alright, next up we have a talk called Volte for Force. It's over LTE. I'm pretty sure we all want to see that on Linux Mobile. And this talk was originally supposed to be given by Marius from UbiPort. But Marius is not here today so please give it up for Nikita and Ivan instead. So, hello everyone. I'm Nikita Ukhenko from the UbiPort community and also looking for Yula. People mostly know me from Telegram as NotKid. And since Marius is not here, we cannot be here to the replacement but we try to say what we have learned so far when trying to make the evolves in the same work on Ubuntu Touch and other mobile industries. Currently it's still HMS but I hope that we can get more people involved into this and if you want you can stay here to discuss on how we can implement the Volte 3 on more Linux distros and what can we do together in our understanding. Can you turn up the volume like... You put the mic up or something. Yep. Okay, is it much better now? Yeah. Great. And the chat is not yet so can you raise it up more into the... Oh, yeah, that's a delay. Okay, good. Just go on, yeah. Yeah. So, I expect people in this room are familiar with voice over LTE and what is it for but briefly just a communication standard for voice calls over LTE networks. And there are similar standards for voice over Wi-Fi and voice over 5G networks which is called VOR NR basically. The main reason that we have to worry about the things is that GSM and LTE networks are now becoming scarce resource and if you want to make calls from our mobile Linux distros we need to implement this VOR LTE at some point. And if you had voice over Wi-Fi it allows other cool things like when you're in a roaming you can try to connect to your mobile operator endpoint and make local calls to your home country at local prices, not roaming prices. So, let's start from how it currently works on Android. There's a picture with the TelephoneLinux website but the point here is that there is a modern firmware. On top of it there is a modern interface library or a library that is used by RIL which stands for Radio Interface Layer on Android. On top of it it provides an HDDL server which implements HDDL radio interface and on the recent Android versions it became ideal instead but that's not what we care about at the moment. So, the frameworks part are the ones that implement the communication with HDDL radio interface and there are vendor specific IMS parts which plug into IMS Android interfaces but the vendor implementation is closed source and unfortunately device specific as well or it is chip specific. When we go to Ubuntu Touch we keep those four bottom layers but we don't have the frameworks anymore. And here the problem comes that the IMS parts are provided by the frameworks on Algam and we have instead of frameworks we have Ophono that talks to Radio Interface and Ophono is talked to by Telepathy, Ophono or other layers on TelepathyCondistro. On SelfieShow as we use TelepathyRing but in just implementation details. So, if we don't have the IMS part of frameworks what can we do now? From here we have a motif option. First we can reimplement the Android frameworks part of IMS so it can still talk to vendor interface and that's how it's currently done on SelfieShow as for Xperia 10.2 and 10.3. It's also been tested to work for other Qualcomm devices but unfortunately the plugin by YOLA is currently closed source as it relies on Qualcomm specific headers and I think YOLA is afraid of some legal stuff that's going to happen if it's publicly released. On Obuntu Touch we've been trying to use the same plugin around here but the problem is that the Qualcomm IMS part is black box and sometimes it works and sometimes it doesn't for no reason. It's quite hard to actually understand what's happening because basically what's the more the Ophono part sending is just asking the model, hey can you connect to the IMS for me and the model just answers yes I can and that's it. So you don't know when it's connecting, why it's connecting, how it's connecting and it's a complete black box. So, as you see in the picture we can try to write an IMS plugin and we plug it between the radio interface and Ophono or some other telephony layer. It works but it's device specific and it's a bit of a pain. I've been trying to write a similar plugin for many of the devices now and the idea of it is very simple like you tell the model, okay please enable IMS, here is the IMS IP and connect to it, copy and config from some pass but whether it works or not it's a bit of luck. Dependent on your career. The positive of this approach is actually that you don't need a 4G network knowledge and so you don't need voice over utility knowledge but it's a black box and if it works, if it works you don't know why. Yeah, so second option we have, it's very similar to the first I mentioned but it's maybe interesting for mainland people and that's why I'm mentioning it. So we can ignore completely the ICTL Android parts and you just write a library of a driver that talks to modem firmware directly. That's how it currently works on Pinephone actually because on Pinephone the Qualcomm modem is a separate USB modem and you can tell the modem via QMI to enable IMS and voice over utility. So it's the same black box approach but on a bit lower level and I don't think we will use it on Halium distros because it will cause MS if Android and direct modem interface communication done at the same time. But that's possible, yeah. This approach requires at least a little bit understanding of the network stack. And you need to also know your modem firmware protocol. On Qualcomm it's QMI as mainland people probably know and on media it's interestingly 80 commons but of course it modifies it up 80 commons. And then the most annoying approach but also maybe the most interesting is that we can attempt the modem to set up a data connection to MSAPN and interface with mobile service operator services on network transport layer and it becomes a real mess of senders and protocols. Basically that's the end goal but we wanted to show you how the voice over utility stack looks like and that's the picture. It's not only the voice over utility, here's the full 4G stack how it looks like and the voice over utility in the network is going over just a second. So this is the TCP IP network stack. This is the transport protocol used for the 4G network and the voice over utility network and then we're going for the stack which we showed you previously. So our end goal is trying to implement this on software so it would be open source for every distro to use but as you can imagine it's quite a challenge. It can also allow for some interesting things to do like there is a project to perform the SIM card authentication and set up encrypted AKV2 IPsec tunnel to the mobile operator endpoint from your laptop. So it makes it a bit easier to debug if you can just use the phone for authentication and then you set up the voice over Wi-Fi connection from your host PC. And there are multiple projects who try to implement the open source telephony. The most prominent one is currently DoaBungo for IMS services but sadly it has been un-maintained for the last five years or so. It's in a working shape but you'll probably need to figure out a lot of throw-angels later on. However, courtesy of Mohammed who also wanted to make it here to Brussels but he was refused a Belgian video suddenly. We have a screenshot of DoaBungo connecting to the mobile operator endpoint via APsec tunnel for voice over Wi-Fi and it tried to receive a call while it couldn't because audio part wasn't really working. But at least it could receive an SMS. Here are those symbols you see because SMS uses ETF16 text encoding and the console is ETF8. It did receive something and that's where I am currently. If to summarize the set of part is that we have voice over LTE working with device-specific... using Android radio interface on selfish OS and we won't touch Android on this selfish OS plugin but only for specific welcome devices. We have something cooking for MediaTek and we tried the third option for implementing and we are facing with IMS services in software. Both of them are possible but we are not there yet. Since Marius is not here he cannot speak about all the operator wordnesses he encountered over the road but we are open for discussion and if there are other mobile projects who want to get voice over LTE working it would be nice to see how we could collaborate. Do we have any questions? Maybe in the chat room? That's a question. Can you pass the mic? Okay, so you are going to touch. I get that right. Who is developing a funnel? I was wondering who is pushing these kinds of efforts forward. It's interesting question. Even Ophono is not... We have multiple folks of Ophono. The upstream one is... I don't know how other developers of upstream one. It was sponsored by Intel at some point, by Migo, but they have some community maintenance for Ophono. But the Ophono version we are using for is developed by YOLO for selfish OS and it's heavily forked from the upstream Ophono sadly, but they have been enforced by Adam Peek to bring latest Ophono changes back into selfish OS fork. So it's closer to upstream Ophono and it can be used for pine for selfish OS. And the Ophono binder plugin I've been talking about has been developed by Slava Monich also inside YOLO. Is there any cooperation with YOLO or are you just taking their stuff and developing it forward? I have a style of fish, so I'm interested in it from the user perspective. It never worked for me, by the way. Obviously, the stuff on the fork is open source and when it's possible we try to make upstream MMRs, but the code base is quite diverse, so currently we are taking from Ophono and we will be using it. So I thought that's a question in the chat. So somebody asked, on the Librim I learned that it can be very carrier-specific, whether voice over AT works or not, and it carries white or black list specific modems. Is there anything we can do in this regard, like spoofing modems? So there are multiple carrier-level specific things. First, each one has vendor-specific configs provided by its vendor. For example, on Qualcomm you have vendor-firmware MNT partition, it has image-modem subfolder, and for many carriers there is carrier-specific, modem-firmware configuration, and it's very much assigned black box, we cannot do anything about it. Of course, we can try to load configuration from a different carrier and whatnot, but as Alan would say from Sony, do not do this, you will break the carrier's network. So I think the detection of modem on carrier-level is mainly done by few parts. First is the EMI of your phone, which you cannot spoof in most cases, and there is also user-agent. The user-agent part when connecting to the... I'm a service on the network stack level, of course it can be spoofed. Okay, thanks. Any more questions from RUM? Yes, at the very back. One sec. So a bit related to this, are you encountering any pushback from carriers, because you could potentially be messing with our stack? I guess we are too small for carriers to care about us, unless we break something to bed, so not at the moment at least, but on us. Do we have another question in the room? Yeah. Hi, just by chance I was on the schedule for later today. There's some talk, ProvideVol.te using OpenZips. Have you ever heard of OpenSips? And is that interesting for us? Yeah, I haven't heard of them, but it would be nice to check the talk and see if it can be run on Linux. Can I get some? Just to expand on the previous question a little bit, in order to not have problems with the carriers, I'm also trying to set up just a 4G network with a software-defined radio, a private one, so we can test whatever we like without breaking stuff, you know. Okay, maybe from the chat again. Are there any plans to upstream Ophono changes to kernel.org or Ophono version? I don't know, I cannot speak for Ophono developers at the other side. Okay. Is there another question? Okay. Yeah, I guess then we close it. Give another round of applause, please. Don't forget the mic. The mic? Yeah. Would you buy the pack already, the mic? Yeah, I'll take it. We talked in video chat at some point from SusmoCom. Yeah, yeah, yeah, yeah. It's good, nice. Yeah, so this is also your, how do you say that, your field. Did you get any further with it or? No, not really, no. I think it's a little bit with everyone that is like, we married that guy who's working, but couldn't figure out why if some last work, some doesn't work. Yeah, so I don't know, we just need people to like to have a different approach with everybody. Is this your bag? No, it's the Coulance here, so I like to... Are you also late for the mobile mic? Yeah, I guess I'll make you check it out. Thank you. I think the refuges are here. I got an avid setup here. What is the score on my... Here's the HMI and here's the... Just be safe, whatever you need. Just be... Okay. I mean, it's also small. Nothing that much specific. Looks really nowhere. And you have some more minutes for the end of the 30, so it's like 9 minutes. Okay. I have so many problems with the inters, because I would say like... I mean, you're gonna do a little bit of this, but I'm like, let's say like... Okay, er... Check the camera. I'm like, fuck, I can't say my name. I'm like, seriously. I think the ref I have is the one I was saying about. I was like, I should be shooting myself more often. I don't know. I don't know. I'm gonna get a step-by-step. Sorry. I'm sorry. So I had to write something. But they felt like, you know, I should have written something. I could have thought of something. Otherwise, I could have done that. I don't want to. I don't want to. I was thinking the very smallest bit. I'm not sure. I'm not sure. I don't know. I'm just wondering. I still have a few questions. Have you talked to me before about the Dragon Messenger? No. I'm just trying to set it. What was your voice before? I was just wondering. I was just wondering. I was just wondering. I was just wondering. I was just wondering. I was just wondering. I was just wondering. I was just wondering. I was just wondering. I was just wondering. I was just wondering.
Universal Serial Bug - a tale of spontaneous modem resets
Okay, thank you all for coming. The next talk is anniversary serial bug, a tale of spontaneous modem research from Sebastian. Give a big round of applause please. So hi, I'm Sebastian Krzyszkowiak, I'm also known as DOS, and I have many hobbies, I make games for example, maybe you have played any match for, but other hobbies there is also mobile GNU Linux, which started many years ago when I got an open Mokoni of Rane, and eventually I had been contracted by Poison to work on the Libre M5 phone, which is this chunky boy here, and within this device there is a USB connected cellular modem on an M2 card, the worldmobile BM818, and this is the main character of our talk today, because we had a problem with it, and the problem manifested itself in this way that sometimes, occasionally, seemingly at random, the modem would just disappear from the bus, it would be just as if it was unplugged from the socket, and it would come back right away, even though it did come back, it was still pretty disruptive, because the network interface would go down, the audio routing would be turned down if you were doing the call, so this wasn't really great, the modem wasn't crashing, it wasn't rebooting, because it maintained its state, at least some of its state, but it was just as if you would pull the plug and plug it back in very quickly, without with some external power connected to the modem, and there were also other instances where the modem wouldn't come back, or when the whole USB bus would actually die together with it, however we won't be talking about those turned out, even though they were like wars, they turned out to be connected, but separate issues that weren't as technically interesting as those resets turned out to be. So this talk will be some kind of debugging case study, and I would just like to talk about how we identified the issue, how we debugged it, and worked around in the end, and at the start I would like to note that this is not some groundbreaking research, this is not a new discovery, because it turns out that this was known for ages already, but I think it's not a common knowledge still, and it turns out that it can still bite, so I thought that this would be an interesting thing to talk about and to share. So in order to understand what's going on, I'll quickly go through the topology of the devices on the LibreM5, so we have two M2 slots inside, one of them is the cellular modem, and the second one is the Wi-Fi and Bluetooth card, and there are two USB controllers, one of them goes to the USB C port, and it all swaps, and the other one is connected to the internal USB hub, and therefore it works as host only, and the internal hub is USB 2642, which has three downstream ports, one of them is internal, as this hub has a microSD reader built-in, and the other one, the one that we will be interested in today, is the modem's port that goes to its M2 slot, and there's also the third port that goes to the Wi-Fi M2 slot, however none of the cards that we use on this phone actually use USB, they use different interfaces, so this third port effectively remains unused. So universal serial bus, I'm just going to assume that everyone here knows what USB is, we all used it, so I won't read Wikipedia definitions, I will however go through some of the properties of USB, either to remind you or to make you aware of how this works on the wire, so the first thing, that the devices can be suspended, this is a power management thing, you can put a USB device to sleep, theoretically all of them can be put to sleep, not all of them react well to that, the specification says that they should, but yeah, reality is different, and there are two ways in which you can suspend a device, you can either selectively suspend a single port or put the whole bus into so-called global suspend, and another thing is that no device on the bus speaks unsolicited, every communication is actually handled by the host, it's the host that keeps pulling each of the devices for whether it has something to say or not, and then the device only responds to what the host is asking it for. There is one exception, when a device is suspended, it can actually signal that it wants to be woken up, but that's the only thing that it can signal on its own. One interesting thing is that, I think not everyone is aware of that, that all USB hubs are technically smart, they are on their own, a proper USB device that you can talk to, that you can send comments, and that can respond and send some status. The features that you can control this way vary, so not every hub will, for instance, provide power switching control, however, this is exactly how suspend is implemented, you send a message to the hub and the hub inter-president does it. Internally, when it's on the wire, USB works with two wires that form a differential pair, and you can have, on two wires you can have four states, however, one of them is illegal in the specification, the USB doesn't use it, so we are down to three states, they are called J, K, those two are when one of the wires is high and the other is low, and there is SE0, which is when both of the wires are low. There are some differences between various speed modes between USB 1 and USB 2, we won't be going into newer versions as they are different, and the modern here uses USB 2. However, the differences between USB 1 and 2 are small, the old states are similar, they use different voltages, but logically it's basically the same thing. So, let's go back to the bug. At some point we have noticed that those modern resets are somewhat dependent on movement or signal strength, the easy way to trigger them was, for instance, to ride a train, you could often see the cellular connection icon just disappearing for a moment, or when you don't notice some file, it maybe could drop out, and that was pretty annoying. And also, sometimes in some places it basically never happened, like at my desk when I worked on it, and it quite often happened in my bedroom, for example, where overnight I would wake up to a bunch of resets happening overnight. So, in order to look at those issues, we have to check some logs, I have showed them earlier, but that's not enough, and Linux has this pretty useful feature called dynamic debug, and pretty much all the channels called drivers are sprinkled with debug print messages, however, they are by default compiled out for performance reasons, however, you don't need to recompile the kernel to put them back in. They can be dynamically patched in, and this is how you can do it. Using this interface, this invocation tells the kernel to re-enable all the print statements from C files, from drivers, USB core, directory in the kernel registry. So, we did that, and this told us a bit more. It turns out that this is an example of such a reset happening, and it turns out it happens when the device wants to wake itself up from USB suspend, and here we can see the status given by the hub to its ports. The port one is the microSD reader, and we can see that there is 0507, which means that five that is connected and enumerated properly, seven is that it's suspended, and change zero means that nothing changes, and port two is the modem, and here we can see that it's different. Zero one means that it is connected, however, it didn't actually went through the whole process of connecting, so something happened there. Zero one means that it's not suspended, and change five tells us that it both changed its suspend status and connection status, so it's just like it would be from the plug, and put quickly back in at this point. To compare it, this is an example of when things go right. After the port has been resumed, we can see that status is 0503, which is different from the port one, because port one is still suspended, and port two is already working up, so there's three at the end, and change four tells us that only the suspend status has changed, so this is how it looks when it works fine. That told us something but not much. There is another feature called USB MON, which can be used to sniff on the traffic on the USB bus, and can be used with such common tools like Wireshark, however, it still didn't tell us anything new, and it's just like if the device was disconnected and put back in, so not very useful at this level. We have to take a few steps back, and the first LIME 5 fonts shipped to the first customers in December 2019, and the issue about those resets was filled actually by myself in August 2020, so there was plenty of time to notice this issue, and it hasn't been noticed earlier, so it's safe to assume that it wasn't there initially, and just came up later. So looking at what was the state that those first fonts have shipped in, the USB power management was already enabled with the software that was running on them, however, it turned out that the SD card reader, the driver for it, kept the USB hub active all the time. It was basically pulling it for media change status, and that's why it never suspended, so the whole USB hub was kept active, and it was fixed in August 2020, and there is a somewhat lengthy thread on the Linux kernel mailing list that you can follow if you're interested in that, and there was also another thing, at some point I have noticed that modern manager pulls the modern for signal strength every 30 seconds, and I wanted to change that because that's not very nice on the battery, and to make it start listening to the messages coming from the modern whenever signal strength changes instead, and I got it working the first time in the context of LiveM5 in August 2020, and later I noticed that with this change the resets popped up more often, without this change they were still there once the hub started suspending, but not as often as with this patch. So now we know that this is related to power management, and it turns out that disabling the suspend may be the issue going away, so yay! However, doing so it's almost half a watt, so not so yay, and basically this was the main reason behind a poor reputation of battery life on those devices when they first shipped, so power management is essential and it must be kept on, we just have to find a way to solve it without disabling suspend, and there was one vital observation, I think Elia said that she observed it first that the issue only ever happens if the hub has just been suspended, never if the hub sleeps for some time already, and then the modem wants to wake up, it's always the hub goes into suspend and right away the modem wants to wake itself up, and things go wrong, so this starts to smell like some kind of race condition, so what we do with race conditions, we start playing with some timeouts, if not in hopes to fix it then maybe to make it happen all the time, just to learn something about what's going on, Martin Keplinger was earlier working on that other issue that made the modem not come back, he had some progress on that, but however he didn't really make progress on this one, when I took it over I started, I based on his work and to figure out what's going on with the kernel code in USB and started changing some timeouts, eventually I figured out that this isn't going to help because at this point where was the earliest possible point where we could query the hub for its status, it was already telling us that something wrong happened, so this didn't really help, and I think that really helped was finding out that you could reproduce it by pinging the phone, if you pinged it over the network interface, cellular network interface and set the packet interval, just right, I think it was about, a second above two seconds, you could actually make the modem reset this way, so this helped to investigate it, and at some point I also started playing with an USB M2 adapter to pull the modem from the phone and put it into other kinds of USB sockets in other devices, the idea was to identify whether it was the hub or the SOC or the modem itself that caused troubles, and I found out that with that listed kernel modules for the modem and sleep timeouts all set to zero, I could make it into some kind of reset loop, it would basically reset every second or two and keep resetting, and at some point I noticed that when it was plugged to some USB hubs, I got it pretty much all the hubs I had in my house, some pretty ancient ones as well, and with some of them it never reset it, I couldn't make it reset with some of the hubs with others, it was pretty easy, and whenever it was connected to the house directly with no hub in between, it always worked, it never reset it, it even applied to this port on the Libre 5 itself, when it was plugged to the USB C port, the resets were never there, so there was time to start to read some specs to find out what's going on or should be going on, and it turns out that the USB device enters the suspense state after three milliseconds of no activity seen on the bus, and this can happen in two ways, you can send a message to the hub to enable port suspense feature, and this is how the hub stops sending frames to that port anymore so it doesn't see activity, and it suspends itself, or you can stop any communication on the bus, we just call it global suspense, and then all the devices on that bus see no activity and go into suspense, and when the device detects that the data lines have been in the idle state for at least three milliseconds, and high speed idle state is SE0, it must revert to the full speed configuration, which is J, so D plus high if I remember correctly, and then it must sample the state of the line, so it checks what hub has or host has asserted, and if it's full speed J, then it continues with the suspense process, it is required because SE0 is also a reset signal, if at this point it would stay in SE0, it means that this is the default state that the bus is put in and the device must reset, but if it's J then it means that this is a suspense has been requested, so the device then asserts J itself and stays in J, and we now know how a suspense works and how about the resume, the host can resume the port at any time, it must send the resume signaling, which is K, for at least 20 milliseconds, and after resuming the bus the host must resume communication within three milliseconds, because otherwise the device would go into suspense again, and what if it's the device that wants to wake itself up, it cannot wake itself up after being put into suspense for at least five milliseconds, and then it can, and it must hold the resume signaling, which is still K, for at least one millisecond, but for no more than 15 milliseconds, and the controlling hub, which is the hub that actually handles the resumes, suspended as there might be more on the industry, must re-broadcast that to upstream within one millisecond and ensures that it is signaled for at least 20 milliseconds, so it kind of takes over that signaling. So now it was time to get dirty, fortunately I didn't have to do that myself, Eric Kazmenko, who is the hardware guy at Poism did it for me, and soldered some wires and put a differential probe to it in order to sniff what's going on electrically on the wires, so this could be then seen on an oscilloscope and recorded, and this is an example of what's going on. We can see here at the beginning some kind of high-speed communication, as it's a lower voltage than full speed, at this point we can see that the modem went into suspense, this is the J state, for some time, and then here we can see the K state, which means that it was either resumed by host or it wanted to wake itself up, and it happened, cycled this way for some time, and eventually something went wrong here, so to zoom it up, what happened here is that there was some kind of high-speed communication, it stopped for three milliseconds, at which point the modem went into suspense, and there was a J signal for another three milliseconds, then it went into K state, we can assume that the modem wanted to wake itself up, and it lasted about 20 milliseconds, but then the bus went into SE0 and communication did not resume, it stayed at zero, at which point after another three milliseconds the modem just suspended itself again, so this is somewhat informative but still not enough. My hypothesis at this point was that the specification requires a great period of five milliseconds before sending a remote wake-up request, but I wasn't quite sure whether the wording isn't ambiguous, because it says that it needs to stay continuously in the idle state for five milliseconds, but if we check here, we have two idle states, there is high-speed idle state for three milliseconds, and full-speed idle state for another three milliseconds, so when is this point where it starts? However, there is also a side of English description, there is also a bit more formal state machine description in the specification, and after deciphering that it turns out that both of these idle states actually counted as one continuous idle state, so this probably wasn't it. So we go back to getting dirty, and this time instead of just sniffing what's going on between the modem and the app, we also sniffed what's going on between the app and the fonts processor, at the same time, which required quite interesting contraptions to be made, but it worked, and we got some data, and this is an example of things going wrong, and we can see some USB micro-fames here, so host polling the devices, and then some communication actual, and then nothing for three milliseconds on the modem port, on the bottom we can see the part between the app and the SOC, and there the micro-fames continue, and the modem goes into suspend, and after I think here it was too many seconds, it wants to wake itself up, so with assets K, and the app takes over, then 20 milliseconds later it stops, but what happens here at the bottom, and the micro-fames continue when the modem is suspended, and when it wakes it's up, starts to wake itself up, the communication still happens, until this point, then it stops, this is the point where the app has been suspended by the host, and then after three milliseconds the app went into suspend process by itself, and what happens here is that at this point, at this exact point the app started to wake itself up, however at this point also it should start sending frames to the modem, start forwarding frames from the host to the modem, but the app itself was waking up, so there was no data to transmit, so it all fell apart at this point, and I started looking closely into the specification, and following the state machine, and I couldn't really figure out what the app was exactly supposed to do in this case, when the upstream-facing port went into a suspending state while a downstream-facing port was already in the resuming state, and I wasn't sure whether it was my misunderstanding or whatever, what was, at this point in time the host has no way to know that the downstream-facing port is already attempting to wake itself up, if here we would query the status of the port, it would say that it's still suspended, there was no indication, and that's actually how it works in the spec, so that information only becomes available when the port already finishes resuming, so now I knew what was going on, and I had the knowledge what to put into the search browser, and I found this email from many years ago from Alan Sten, who is a guru of USB and power management subsystems in Linux, and he stated that the USB to spec does not take into account that possibility, so Alan basically validated my suspicion, yes, before I made that suspicion, so at this point I could safely assume that my suspicion was true, and what's worse, that mail ended with, I don't know what we should do, suggestions, anybody, there were some replies but it didn't really went anywhere, and however that mail pointed to an IRATA, and IRATA said that there is a very unlikely possibility of a race condition, and this issue can be avoided if system software suspense are downstream-facing hub ports before suspending a hub, I completely forgot to check IRATAs, at this point this was the first time I seen it, and I was so happy that this was the first time I seen it, because what the hell, this recommendation suspending the port before suspending a hub is exactly what makes this issue happen, and Alan Sten said, so himself in his mail that this IRATA is completely bonkers, so I'm so glad I didn't see it because I would be so confused, so there were around what I did to actually stop, prevent it from happening, so I have added a port query in the USB subsystem in the kernel, which when it was enabled the port was never actually suspended selectively, Linux only pretended to suspend it, but didn't actually send the command to the hub, since this would cause troubles as if we just pretend that the device is suspended, we stopped pulling it for more information, but the device isn't actually suspended, so it can't wake itself up, so to prevent that from happening, we keep such quick port active whenever any sibling port is active as well, and when the hub gets resumed, all ports marked with this quick are also resumed as well, and this lets us rely on global suspense when we just stop sending any communication, and all the devices suspend at the same time, preventing this race condition from happening, and this works well with the topology on the LibreM5, but wakes apart on different topologies, if we added another device, for example on this third port that also wanted to use remote wake up, it wouldn't work, there's the code, so what can we do now? This hack isn't really a sweet table for mainlining, it's really a bad hack, so for now it stays in our downstream tree, however I believe there is a way to do it in a way that could be potentially upstreamed, it wouldn't be the default, I'm pretty sure, because this it would be quite inefficient, but I think it should be possible to have this as an option if you have such devices that are resettled in this way, that you could actually have them work reliably and wouldn't have to disable power management completely, and to do so we would have to ensure that no downstream wake up capable port is suspended while the hub goes into suspend, and there's also another thing that made me implement it as this hack instead of a proper solution first, is that while the proper solution is less efficient, this hack actually gives us some efficiency because we can skip suspending each device one by one, we just suspend them all at once and it takes less time, so this lets us make the modem go to sleep more often saving more battery, and so that's basically it, I'm available for consulting so I can turn your money into code if you're interested to have something done in mobile gaming space, and if you have some questions like my reviewer had here you can ask them now, thank you. Great. You already have a question here? Oh, you mentioned the influence of the modem manager on this effect, can this be explained with your findings? Yes, this is because when the modem manager is polling every 30 seconds, it's the host that initiates the communication, but if we switch to unsolicited messages from the modem, then it's the modem that actually initiated, so it wakes itself up more as opposed to the host waking itself up when this issue never happens. Hello, thanks for your presentation, how many man hours went into this bug fix? Oh, I don't really know, it took many false starts let's say and red herrings, so this is obviously just a chunk of it because I had to feed it into the presentation, but yeah there were many approaches that when we were really in the dark at the beginning, I didn't know anything about how USB works, initially I had landed from scratch, so it took some time. Hi, quick question for you, actually two questions. The first one is, is the USB the ideal way to connect the modem or is there a better protocol that we could be using in the future in another design? It depends what you have available, perhaps on the LibreM5 we could theoretically use PCI Express, however PCI Express would be at least on this SOC would be much more power-hungry than USB and USB makes it easy to find such devices that you can actually have on a replaceable card that you can put into the phone pretty much off the shelf, so the options are quite limited in this place. And second question actually on that, when it comes to adding a different modem, this isn't a modem issue, obviously it didn't come down to which modem you were using, but are you guys looking at releasing a gemalto modem because that would be pretty cool? I'm not really a person that has anything like any power in this regard, so I can really say much about it. We have a question from the Metrix channel. When will it be fixed upstream, hopefully? Hopefully soon. Making this presentation, submitting it here, was actually a way to force myself into going through this again because after getting this hack done, I just wanted to take a break from all this USB stuff, so maybe soon, maybe not, we'll see. I think it should be pretty simple, in fact. We'll see what the maintainers will say, whether they will be happy to take such a quick approach or maybe they'll have another idea. We'll see. Are there any proper solutions to this problem, like in the USB specification, for example, are there any hubs that don't have that issue? So the specification of USB 2 never fixes it. USB 3 works in a completely different way and there are also supplemental low power modes in USB 2 that could be used and that also don't have this feature, but you have to have a device that supports those modes and we don't. So we can say that it's fixed because it's all completely different in USB 3 and higher. And for USB 2 devices, it's all up to the hub and how it's implemented. If it's implemented to the word of the spec, there's a high probability that it will have this issue, but some hubs are like, specs gives you some time to do things. You can do it like the minimum and maximum time and some hubs are faster and then you may not see this issue happening with them. So yeah, at this point with USB 2 devices, it's probably up to your luck with what components you are using. I'm working on open source USB debugging tools, sniffers, software, so I'll be interested in talking to you about capturing this as a test case to make sure that we're able to spot this happening on the wire in future. Okay. Very nice. Yeah, first another from the chat apparently, then a further to you. Is there known other mobile devices that suffer this issue? I relate some aspects of the bug on Pinebook Pro Wi-Fi. Honestly, I have no idea. This was the first time I experienced this issue and had to basically go through what I told you today. So I don't know. This was known for years. The email was 12 years ago and Alan Stem has said that this came up in testing. So obviously this came up somewhere, but where it was and which devices were affected, I have no idea. So you mentioned the other USB bug you were facing where the whole bus died. Did you fix that as well? And can you say like two sentences about that? Is there once again? The other bug you mentioned in the beginning where the whole USB stack died and the modem didn't come back. Did you fix that as well? And can you say maybe two sentences about what's the possible? Basically that one was pretty boring. It ended up to be a missing queer queen, the host driver that was already implemented, but wasn't enabled in the device tree. And at some point, actually, NXP has enabled that for all IMX8 and board. So this is fixed now May 9. So please give another round of applause. Thank you.
The Linux Phone App Ecosystem
Okay, our next speaker is Peter from linmob.net and linuxmartphones.org and he's talking about the Linux phone app ecosystem. Please have a round of applause. Hello everybody. So I hope everybody can hear me and yeah, this is my first talk and I'm really glad to be here. It's amazing that this conference is running every year volunteer based and that we have another room this year to have all these great mobile Linux talks. Now we'll have one that's maybe less great but I don't know. So I think I need to hold that. So this is an important thing. You can use those devices, Qualcomm SOCs or the little five and whatnot and you want to touch on all of it but it does have no apps, right? So in theory this could be so simple. You just install Waderoid on your distribution, simple and then you install asteroid, free software apps and then maybe you need some proprietary stuff so you can do that and you have all the apps. Well, you know, I've done that with Linux, it was in the past and so on and microg is amazing and whatnot but there's always issues and especially with virtualized Android and so yeah. There are good approaches and worse but I think I would rather go with native if possible so this talk is only about native apps, whatever that means. But not so fast, let's have a brief agenda. Who am I? Some dumb puns maybe. What's not in this talk? I don't have a slide for that because why? And then apps on Safe for Sure has absolutely a bunch of touch and the new contenders so what I do with the links on apps.org or what others and I do with the links on apps.org because I don't develop any apps as other people and I don't add all the apps. Can't do it. And then highlights we have, gaps and challenges and Q&A maybe. So motivation. We already heard of three major projects, realms maybe mentioned like with Safe for Sure has and you want to touch and all these new Linux distributions that's born up that we'll get into later and I think this is a small space in terms of market share but to solve that it's heavily fragmented. So maybe there's something to learn. Maybe another platform project, whatever you call it, product does something different and that's great and maybe others can learn from that. And then I wanted to spend some time with you, want to touch and Safe for Sure has after a while but yeah I don't know, broke happened so yeah that part is going to be rather thin. So then I had some assumptions at first so surely stuff like email that's easy, document protocols, well maybe quite complicated but it's there, metrics, it's there, XMPP, just do it and then stuff that has free APIs also yeah you know people will do it and then everything that has an API even if paid should also be doable. So yeah, let's get into it. So Safe for Sure has. When, oh I forgot the introduction part, shit. So yeah this is my website, it's lin.net that stands for linear mobster's network, no actually not. So this is a logo, you may know it from YouTube and this is the current homepage, weekly updates, a lot of work and now how it started because I think that part matters a bit. So it started in 2007 and even back then we had plenty of Linux mobile projects, community and others coming over from the handheld age to the smartphone age, handhelds.org, linux.go.org, I don't know if anybody was in those IRC rooms at the time if you are in your year, great. That's real stamina, what would you call that? So I somehow stayed around, well I left briefly because in 2011 we had like a major two things killed by CEO, so what happened with Nokia and what happened at HP, new CEO and then boom mobile linux, look promising, died, also open moco. Now to get into this talk, while I was doing a blog and totally into the Aztec in 2020 with a pine cone and oh my god what can you do, this thing only lasts five hours but hey I want to use it so is there a list of apps from this, forked it, eventually turned it into this because the previous implementation would no longer work on those low Linux phones and it's still pretty bad, I think there's an issue tracker on Framigate and we'll get to that later maybe but yeah so improvements of Alka I say but there's a lot that has been learned and I think it can be helpful so we skip that so say a fish, like we just said Elop killed the Anain and Harmaton Nokia and from the ashes raised YOLLA and they introduced the YOLLA phone into 2013 and it's quite modern so it's BTRFS well yeah who cares file systems, Wayland system the 2013, Wayland really and then there's a going, troubled surely don't make any more on devices you can buy a license bring you on Sony device and they've got something that's quite interested for those that need those proprietary bits to close the gap that's Android app support not a topic of this talk so what do they have so there are multiple interfaces to get software so there's the YOLLA star requires the YOLLA account no for pay apps has no web interface so I did not count those apps maybe there's an API or something we didn't look into it but yeah it looks quite nice and that's not the only source of software that's well organized there there's also open repost on that now that one is really old if you go on to open repost on that you will see that it lists one app for the LibreM 5 or for POS but it also has many apps for the N900 which I think many people still have fond memories of and the N9 and there's even some development still for the N9 so people are still using that thing today yeah it has a Storm in frontend for Safefish has also no for pay apps it like I said lists up for the projects and it has approximately 1800 apps and counting listed for Safefish but I don't think that with the transition from arm 32 to 64 bit and the long history of four major releases that Safefish apps you will be as you will be able to use all of them now this is what it looks like a little bit less entertaining than the YOLO Star but also I think quite fun and then there's a newest contender of course because more options better and that's Chum it since recently has a web front end it also has no for pay apps it has 170 apps listed for Safefish and it includes and this is for me a total highlight because it's this cross project collaboration I'm talking about it includes some Kyrgyz apps by packaging a modern version of Qt because Safefish uses silica for its widgets and it's stuck on Qt 5.6 forgot to tell you that earlier I mean who wants to talk about those sites that aren't so nice and shiny but people made it work and you can run like cast Angerfisher web browser which is nice because sometimes you may want a Chromium web browser because the real web browser in Safefish or as is Gekko based which is also really unique and there will be a talk about that later on so yeah highlights I did a little impromptu poll on Masterdun I wanted to do something better but these are the highlights of Safefish OS so if you're using Safefish OS and you haven't installed those apps I mean what are you doing just take out your Safefish phone and then install them and maybe enter your security code yeah and then you can do this nice multitask gesturing thing I will not go into demoing apps on Safefish OS I did that for YouTube and I failed miserably people were making fun of me doing that again so yeah there's a lot Safefish OS connect by the way integrate with KDE connect so if that's not obvious and then we have even had contract so if you were like me having a relative that was in deep danger that was something to appreciate at the time I mean now no more tracking why would we so yeah then next one just at Safefish now let's go for a bunch of touch it's about as old if not older envisioned in 2011 this is nice quarter on there so it was in 2011 that it was announced and it would Ubuntu would support smartphones tablets smart TV smart screens smart watches head units whatnot everything maybe peak Ubuntu I don't know and then they I left out the prepaid crowd for everybody else about that one and then had the first commercial device in 2015 February 2015 so like nine years ago by now my man time flies and they'd used mirror which these days is a way than compositor but then wasn't upstart because yeah and unit 8 their own convergent thing unity hate is amazing it's now you know Mary thankfully because canonical eventually would drop that all that great effort because didn't have market success so another death by CEO if you will but it was picked up by the community and could be picked up by the community because it was completely open source so maybe that's one of those lessons so only trust projects that are completely open source because then it doesn't matter if they go under and yeah you be porters doing great drop the latest release was just a few days ago and the store situation is also pretty simple as the open store it has a web interface so you can browse it without having even to touch device and get an idea of what would be available even as ratings so can look into is this actually working and it has more apps for the older one than for the new one so really I think that those numbers you know with 210 whether it's about 610 I think it's actually 217 215 by now but yeah who cares about the exact number that really should improve the open store has one neat feature I wanted to put a screenshot of that into my slides but who has the time so when you install an app on the open store it basically sometimes if that's specified next you for donation to the developer and I think if I remember correctly it may do that later on again and I know nagging nobody like likes to be necked me neither and nobody wants to feel bad because they don't have the time to fill out the details and do that stuff that you need to do that donation because it's also complicated because payment but I think that's a nice idea because you know giving something back and not just feedback does not work for me fail I don't know this is garbage you know maybe communicate friendly that might help and maybe donate if there's a way to keep this going you know we need to do that and then of course other ways to install apps so you can do you want to contain over 604 this was totally uninteresting for you know all those new apps we get to later because well in 1604 you want to 604 not much was mobile friendly in the GT cable they can type that and neither in KDE land really and then with 20 or 4 it's a little bit better but you need to bring your own environment variable variables and then there's new development only work on some devices snaps are you can install snaps on you want to touch now snaps are known to be controversial but on a system like you want to touch which is also in a way immutable air quotes and was very early with that so that's another thing that's great I think it's nice to have another option to distribute software more widely and if snaps what's been added first got a sticker on my little tablet here that's just what I would have preferred but it's good to have really nice and well you need to bring your own and worse to make it scale properly but wouldn't be fun otherwise highlights you must know so if you do a poll on master then apparently people favor message on clients it's weird and Weber a tool for web app creation generally you want to touch has a bunch of web apps which is great they have a way to do those other projects should do that too because it's maybe is relatively simple way to make a service seem available from an app store because people don't think that there's web browser that they could use then Deco great email client well might use some work to get GPG award but I mean come on it's an email client didn't have that when it was on the canonical throne that was fun when I first tried you want to touch it was like what the fuck because the only email client that shipped was a Gmail client again whatever past memories and then you nav for navigation and then there are more some of those really should be brought over just some highlights I think you can read those yourself so fluffy flesh had flutters interesting because they did not ship GTK and flutter in that click package as far as I could concern they made it a web app so they flutter can do web apps and then they went that way so also interesting hack would like to see more of that and then there's an app for scooter for scooters you know those urban mobility shit supporting two services really great I don't know whether it works didn't try be friendly if you try and have bugs Tesla app don't have a Tesla no idea Nostar nobody needs Nostar but they have a client and it works for me because I try to go there with my blog and whatnot but and then of course it's body for a premium client because like assumptions earlier it's body for premium IP I works good so and then gaps briefly for this metrics apps maybe so yeah not not really happy with that situation it's interesting the element adaptation is something like a hack some CSS hacks on top of element desktop nice approach but of course something like that is prone to break it you basically patch the more moving target how to do just ask all the Android custom ROMs and then XMPP of course and desktop Firefox we want to touch that's one for the poll yeah that would be great now new contenders and that's the area I'm competent about which why I spend so much time on other shit to not talk about it too much so up top you see the UIs for or also mom shell I could have put another logo there plasma mobile and then as a joke because I'm not going to talk about thermal apps sorry as XMO it's it's awesome I use it on my pine phone pro and then distributions you know dang next post macros mobian fedora there's a mobility stick then that fun icon anybody know what the icon on the right is any hands yes no it's open mandrieva made made one image for the pine phone but I had to put it here we are right now as some rolling of but you want to mobile next OS nice to have that too and of course open zoos the lizards are here too so yeah and then of course how did that get started it's all history 2017 maybe 2020 live in five pine phone what's a project based on desktop distributions like we saw I've got two times there being in that list and many eyes plasma mobile with kirgami for apps for with first lipendi a widget library to make GTK apps more adaptive and then these days lip of data for GTK for which really made us rain to go so that's more of a success than I would have hoped for as a spectator on the sidelines really impressive and then the downsides well no proper app store solution ish hands links for naps for dog org you know I was really hoping that we wouldn't need that by now because you don't want to maintain a website that lists like I don't know 500 maybe including games these days apps and has to you have to check those and then does it still work oh no I don't know who has the time so these are all the fun UI frameworks that I used in apps listed on Linux for network most of these don't really matter and I already mentioned the ones that do really matter except maybe flutter because that's going somewhere well but we will touch the later this is just as an overview so there are plenty apps with Kirigami it's like hundred and forty naps listed so plus my mobile is going rather strong there no side goes a little bit stronger up top with a lipendi I mean I could also call the account GTK for and GTK 3 but some of those don't really super fit very well you know only with foch scale to fit heck and whatnot if you've been in that arena you've seen that rodeo it works and it's great to have it but it shouldn't be there so yeah the panty 66 lip advisor 156 used to be more in the panty camp stuff is moving over which is I think good to see I don't know why I've got one you bunch of components at there yeah I think it was future five before it was an open store and then programming languages well I think everybody here in here is more competent to judge this than me I can do a little bit Python and some CSS and HTML and whatnot and barely do JavaScript but so it comes with the with the toolkits right there are also some things that I did not know before I started this list I didn't know that there were GTK apps made with C++ I always assumed that was all cute but yeah you learn so looking at the interfaces you can use to browse software here's one that's really nice these days it's no software see that fun little thing there that says adaptive yeah that's great that's metadata if that were everywhere I could stop working on Linux phone apps boy would I love that so but we're not there yet so yeah it's tough so can show that and then there's even a fork that only lists the apps that are indicated to be adaptive you know you can always write anything in metadata nobody checks so you could claim your app is super adaptive and it's not but then you will get that feedback so don't do that and also don't do it because otherwise I really can't retire that website at any time yeah so and then discover well it doesn't show adaptiveness but the thing is if something is kirigami most of the time it should work except a few few things that don't but you don't need everything in your phone right then there are of course also some cute widget apps that also work only barely and if you're lucky yeah and then metadata it's beautiful so my day job isn't publishing and in publishing we still love xml and abstream metadata is also xml and so this is a common specification that has been extended over the years I think that started I don't know decades ago or maybe but it's definitely more than one decade at this point and there I have some links on the site on a blog post and form fractures how to specify those before that there was an intern specification by purism and you can put your licensing in there you can put description release loads you know go crazy and the good thing is except for the release notes if I execute a script I can pull all those nice informations into linux phone apps are no ain't that great so yeah if you are developing an app please add metadata maybe there's a meta info creator that makes that relatively easy I know it's some extra sure and it sucks and nobody has the time but it's I think it's really useful for people and if you maybe want to contribute and run through the code forges and find apps that don't have metadata and make merge or pull requests adding that metadata go for it thank you yeah but with that express the metadata sorry about my excitement for xml nobody likes that anymore I know highlights for apps I don't think I need to iterate through the app just highlight itinerary it's really a better travel companion than the app by Deutsche Bahn for example which I know very much unfortunately because it's generally not only taste you about delays but it also tells you how to get from that one platform you have to start changing trains on to the other platform so you can see that because it's not always that numbers that are next to each other are on the same platform and that matters if things are delayed once again and then angle fish nice mobile web browser also on SafeWishers like I mentioned and then pure maps pure maps again we had that before could also have been on the Ubuntu touch list pure maps well everywhere oh I forgot cast so sorry cast is it's also great it's really feature rich does chapter markers I like podcasts sorry and then highlights in the norm side well chats and calls because you know sms mms calls who wants to get phone calls but yeah people do and if all your stack works it works even as a yeah very worst client again that's from the poll and also it's really nice 10 gram that's little thing for web apps you can also use it on your desktop all of these apps are also available on your desktop so if you don't have Linux phone you can also use all of these apps on the past two slides and they are also great on desktop because adaptive apps aim to be great anywhere and I think these listed here all succeeded that and then of course communication railway like maybe maybe I trouble the trains too much I don't know can you travel to the trains too much no idea and then spot Spotify premium again API magic and then flat sale because helps sometimes and then other highlights so these are apps that are on kiri gami and I've put two matrix clients and they may be I use too much metrics yeah and I must use too much metrics so one is using cute quick components to Nico and the other one is using flutter so special one apps that run anywhere on mobile Linux we had no pure maps maps navigation whatnot and maze fish smart watches and stuff is that and then kaitan that's an xmpp client and yeah it's only in ubuntu touch 64 that's why the asterisk is there but otherwise looks like building cute apps that are cross-platform is possible another special apps that run everywhere including legacy platforms so iOS and Android well see next talk flutter maybe I don't know we're really interested looking forward to that and then current gaps so what if you are have time and want to start here's the list we already saw that some of these things are solved somewhere I think you're going to touch also as a cryptocurrency wallet if you need that I don't know maybe you do and then of course what's that yeah tough and then more current gaps that I found elsewhere attention grabbing social media I think we need Instagram and TikTok to make that mainstream and we need Facebook for the Grand Parents and we need office clients to edit fucked word documents and shit and axler well you need that there are some approaches by the way that's one kt app and then yeah so gaps this brings me to packaging um aside from metadata you know releasing an app helps I'm not explicitly said stating that I'm looking at k delta chat in this very moment but I am so that's nice app it works delta chat is encrypted chat via email protocols nice but no release so not package anywhere aside from a u r and xOS yeah and also I mean maybe maybe flat up so in my little impromptu poll one answer was and that made me really so yeah this app seems great I'm looking very much forward to it land in db and stable and I'm like oh god this person is patient should learn something from them crazy yeah so please if you maintain an app maybe do that toggle thing release it at some point you know don't release it while it doesn't work won't help anyone but maybe release it once it when it won't when it barely works because it works barely but works then of course flatter apps build only for x86 64 linux electrode apps build only for 886 64 linux what the fuck signal and then generally apps build only for x86 64 linux you know aside from doing this mobile linux phone thing I've been running arm laptops for years and it's I mean now with fast arm laptops it's less of a problem you can compile shit but oh god imagine the pine book and then compiling a big electron app I mean you can't do that but boy that's like waiting for stuff landing in db and stable yeah then future challenges things get worse actually more and more services disappear behind apps and they in apps that are you know on the android side require play service often and thus don't easily work in bed right and that's a deal for public and private services so I think this is some german examples who cares but yeah we need virtualized android maybe we need to reverse engineer other things or we need to push government well governments I mean we're in brussels here double capital Belgium and the EU and NATO they're not state whatever but yeah so technical solution obvious one is the web and then of course what would I like to see more cross project collaboration in the app space I think I stress that enough but I've made it it's stress it enough to access to non-distribute sources easier and distributions and now that's controversial like enabling flat up from the get go and maybe even the snap store oh god people with throats brings at me and then donation bagging and other app install things maybe a future for software thingies and then a bug tracker like mozilla's platform tilt if you don't know they list this stuff whether disadvantage by last large companies also goes into that political avenue and help with linux phone apps or or so yeah yeah I want to make it a progressive web app I want to make search and filtering better but yeah who has time so conclusions I hope this wasn't too overwhelming or boring there may be more apps than you'd think regarding initial assumptions I think honestly despite trying to prove it people are just scratching their itch and that is perfectly fine so thank you this is the stuff where can reach me and where the next four minutes are always and if you want to contribute from agate it has issues with sign in so send just send that page to the mailing list and that last link is a really cursed really bad my skills at web development level thing that helps to create things time for questions thank you very much Peter any questions from the audience ha successfully over they're all taking it in still bored them to tears I'd ask the question oh it's actually not a question it's a statement this is David but no I just wanted to I wanted to thank you for taking all your time and preparing the weekly post as a user of mobile linux not so much a coder it has been huge to get me in the community to keep me in the community keep me up to speed with everything that's happening I realize that one person can't always do it but I just want to say thank you thank you that helps keep going another question or statement yeah in the back we'll take a second so I too want to have a Linux phone so can you please tell me how much time it's suffering do I have to you have to give to to achieve that goal depends on your approach so I think it's impossible to answer without knowing your specific use cases and the services you want to use and how much pain you're willing to go through or whether you're going to be like well you know wait right fine and also it depends on which hardware you choose but to go to hardware choices we need first to establish which distribution you go on and then go down some huge decision tree maybe that's a talk for next foster I have a pine phone but it's lying on my desk for I'm so it's catching us like most of those yeah I've got one of those too so many two of those yeah so pine phone of course since I've been paid by post marketer has no post marketer is amazing Mobian is also amazing think those safe choices and then try to solve your issues one thing at a time but if you have issues with your carrier and reliability and stuff then yeah it's get tough so maybe maybe different device maybe different carrier it's it's complicated okay I keep on dreaming do that a question from the matrix what do you think of box 64 I think we can use this to run some of our x86 64 programs as a current worker on until we have a 64 version of the binaries I think in some cases this is definitely useful and I think people love that for proprietary games mainly um with some electron apps you can actually use an arm 64 electron runtime and then run that so it's not always necessary to go that route but I mean why not so I personally haven't played with that because I am too thick to understand the instructions and don't have the time but yeah box 64 also great just emulate shit works all right another question yeah there's one okay please pass on Mike hi once again I echo the comment thank you very much for your weekly lim mob log of everything that's going on in linux mobile but my question is I well I think it's about purism about a year ago talked about a payment mechanism for developers I think maybe it's like a theory of it but I don't know if there's any you know anything about that about how that might be changing the landscape of linux mobile apps well I think it would be very good to have some thing like that and they are in a place to do that as a business they've got an easier route to that than all these non-profits um but I haven't don't have any news so I very much look forward to something like that but as far as I know it's not happened yet thanks please give another round of applause
Flutter - about the nightmare of cross platform development targetting Linux mobile
Okay, next up we have Brage, she's going to talk about Flutter apps. Please give a round of applause. Hello, yes, I'm going to talk about Flutter, but not like about the fancy ecosystem as we were just introduced in the previous talk, but I'm going to talk about development and rather about the nightmare of development targeting Linux mobile. Because from the perspective of app developers, there's still much work to do until we can properly target Linux mobile as with cross-platform software. Who am I? My name is Brage. I do Flutter since it was publicly released in 2018 and I work in healthcare actually, so my work has nothing to do with what I'm presenting here, but I find it interesting topic anyway. I use ARM by the way, that's why I talk about Linux mobile. Even the talk is held on ARM, maybe people recognize the laptop here. You can reach me via matrix since I do matrix for work, so when you have any questions at break, colon, and that leaks, I am from France. Back to topic, why would we like to use Flutter? We had a fancy overview about the Linux native ecosystem, about GDK progress, about KDE targeting Linux mobile. Why Flutter? Because Flutter is a decent cross-platform ecosystem. Unlike if I develop a GDK app, I do not uniquely target Linux, but I target giant ecosystem consisting of iOS, Android, desktops, maybe even web, and I can potentially also target Linux mobile. It has a fancy plug-in system for developers to access native resources, so we are not bound for example to web browser APIs, as we know it from many JavaScript based cross-platform toolkits. We have an amazing UI toolkit and that's what they love Flutter for. You have animations, the 2024 style, and it's fun to use. It renders perfectly, it renders 128 frames per second on high-end devices unless you have some vendors doing weird things and then it won't work. And it's no JS, no XML, so we have design as part of the code, so no external design resources which makes it quite fancy to refactor to use it for development. Yeah, Flutter, but let's talk about Linux and especially Linux mobile. We will talk about both in this talk, but the goal is finally what are the issues about Linux mobile. We have a giant ecosystem, I already told, like there are 10,000 apps in the Google Play store made with Flutter, a bit less in the Apple App Store, but we have a giant ecosystem and all these ecosystems of Flutter apps could target Linux and Linux mobile too. They are optimized for mobile views, they're actually handy to use on Linux. We just need to make it happen. And we have big players into it, namely Canonical and Google. I know they are very popular here, but they use Flutter, especially on Linux and push it. Unfortunately, that's a problem too, that they are the ones pushing it, not the community we will see that later. Yeah, like what are the key points in targeting Linux mobile and Linux in general? The first is like, okay, if I have the application, it should not have runtime issues, it should be usable on the mobile screen, it should have functional interaction with the user. The second from the developer perspective is I should be able to debug the app. I should be able to compile the app for my Linux phone, there we get to a big problem. And the third thing is redistribution. I first of all need to redistribute Flutter in order to have a package system which can target Linux distributions with dependency trees, with Flutter as a build dependency. The second thing is I need to package my Flutter app for Linux distributions. It sounds easy, but it can be hell. This is the first thing we are going to talk about because that's the most complicated when talking about Flutter. Afterwards debugging and runtime, I will give you a brief showcase of Flutter on Linux. Yeah, Flutter redistribution consists of two parts. We need to build the Flutter tool chain, so everything we need to develop and we need to package it in a way we can use it on Linux distributions in order to have it as dependency. Yeah, let's look at packaging because that's easier to understand at that point. If we follow the instructions on docs.flutter.dev.slashgettingstarted how to install Flutter, we simply clone a Git repository. I mean, that sounds amazing. It's just a Git repository. It should be packageable. You download that Git repository or you clone that Git repository, you execute Flutter for the first term and you see that. We're downloading lots of things. First of all, we are downloading the Dart SDK. We could use that one as system dependency, but that's difficult. But then we continue downloading. Let's look where are we downloading to? I mean, should be a user directory or something like decent location, which is user configurable. Yeah, no, no, no. We download all the stuff. We download to the installation directory. Now imagine how it is like with packaging stuff for Linux distributions. It's a bad idea if your runtime has hard coded to download stuff into the installation folder. That's a bit annoying. But that's something you can work around with patches to apply. Yeah, step by step. What is it downloading? Like you download the Flutter source, blah, blah, blah. You execute Flutter for the first time at loop and it's downloading the Dart SDK. So Dart is the underlying programming language Flutter is using. And yeah, afterwards, it's creating the snapshot from the Flutter tool. So it's compiling the Flutter tool written in Dart in order to have an executable of the Flutter tool itself. Then this compiled Flutter tool, remember, you clone source and it's first compiling stuff. Then we download the engine, the Flutter engine, and dozens of platform dependencies. And they keep changing each and every release. Good luck capturing that. So what do we have? We have fonts, we have Android stuff. If I use the Flutter tool to target Android development, I have different compilers all per architecture, compiled with native APIs. I have the Web SDK for target web. I need to download Skiya, CanvasKit in order to render in the web. All this is going to be downloaded. Generic Flutter SDK tools, platform compilers for Windows, Linux, Mac OS, FrontRenderer, for example, the GTK and better on Linux. And then I'm mostly done. Let's look at where these downloads come from in order to capture them and in order to improve that. Get a release now, now, it would be too easy. Some package registry, like, I mean, that could be a hosted Nexus or something. Better Chromium CI, the build system of Google for their open source proprietary components. They build all these components you need at runtime in order to save time while executing, I don't know. And it's built in Chromium CI and then downloaded at runtime. So you need to capture that somehow. You cannot know what's happening in this Chromium CI. No one knows. It's just we download blocks from proprietary storage and this is not very open source of you. It's held to package. It's held to work with that. But back to the topic, how can we package that? Now that we know where all this stuff is coming from, we could take all this stuff from Chromium CI. I mean, it's the easiest approach. I just want to have Flutter function. I want to develop my apps. Let's just package that stuff we get from Chromium CI. We could pre-cache it at prepared time of the packaging process. So download all these dependencies, create the snapshot and so on. And then just have it packaged in the distribution package with ship. Other option would be, and I won't give a definite answer on it. It's just prospect. You could also patch Flutter to make this user configurable. I made a merge request for that like two years ago. It was rejected because the Flutter authors did not see any use case. It's obviously a perfect idea to download stuff to the installation directory. Yeah. But even better, we could build them ourselves. Because actually, when I talk about Floss and mobile devices, I do not want stuff dropping out of this Chromium CI. I have no clue about what's happening in. Yeah. Building Flutter next topic. I don't know. Has anyone of you already built Flutter? Like the Flutter Engine, the Flutter tool? I guess a couple of people here. I guess you had fun. At this point, very special thanks to Lauren. Amazing work on patching Flutter to be able to make Flutter buildable with more or less less-vendored tool chain. Amazing work. So the next few slides are going to present actually the work done by Lauren. Yeah. Issues on Flutter Floss builds. Like you have, first of all, vendor packages. Like everything you could use Assistant to Sensey is being vended from some random upstream source from Google. We do not want that. Yeah. It's coming from Chromium CI, by the way. Also, the Flutter sources themselves are written in a way it's not muscle-compatible, existing patches, adding muscle support to the Flutter Engine were so far always rejected. Same applies to existing patches making it compile on BSD. Those are not that functional yet, but there were clear statements. There's no interest in adding support to that. There's no use case in it. So the Flutter team is not willing to accept these patches, this work done there, which is super sad in my opinion. Yeah. So the tool chain to build Flutter itself, it's basically a G-client solution. So you get the fancy repo, Depot tools from Google and download the solutions, and it's downloading lots of stuff from Chromium CI. This is a screenshot, can you see it here, from the Alpine package build files for building Flutter. You have, I don't know how many are, it's 15 patches only to make Flutter compile. There, you have some patches affecting the Engine, so for building the Engine, and some for runtime for the Flutter tool, and in both cases it's giant overhead just to package this simple tool. Yeah, it's sad. Yeah. Upstream work, nah, so far not wanted. It's not appreciated. There was upstream work until all patches were rejected, like it's already known for a while. So far all aims to improve that were rejected, and that's why there's unfortunately lots of downstream work going on. Yeah, mostly rejected. There we are. So in order to build Flutter on using a Floss tool chain only, you first need to patch the build route in order to have the function environment to build the Flutter Engine itself. First of all, things like, hey, use my local Python. I do not need your Vendor Python. Use local libraries and stuff. By default everything is Vendor. Afterwards, you need to patch the Engine to, for example, work or functionally work on muscle. This is though not required if you target G-Lib C devices, but the post market OIS people and Alpine people in this room, maybe the Void Linux people might be happy about that. And there are the patches are pretty similar to target BSD because Flutter has lots of stuff hard coded to function on Linux only, though it could at many places work on BSDs too. I'm talking about BSD because I love using BSD actually, and I'm sad Flutter doesn't work there yet. And afterwards, if you got to patch the Engine, you still need to patch the Flutter tool. Like we were talking about that. These artifacts, we do not want to download the Dart SDK. I want to use the Dart language installed by my distribution package manager rather than like some pre-compiled stuff. At the moment, it's usually, for example, Alpine has the Dart muscle port packaged there in order to work around that. So there's no canonical way yet. There's no clean way yet, though there is work ongoing that. And yeah, so that was the brief overview. I mean, I need to hurry. The talk is way too short to dive deeper into it. Like the second thing is debugging and cross-compiling. If I have a Linux mobile device, it's usually another CPU architecture compared to my host machine. Though host machines with ARM CPUs are involving now, like most people still use AMD64 devices, and that's why in most cases for debugging Linux mobile app targeting like this device, they need to be cross-compiled. And that's the moment where I wished Flutter was go because go is fancy and cross-compiling and Dart is like, oof, it's crappy. But wait a second. There are these fancy arguments existing for the Flutter tool, like target platform and like target sysroute where you can like specify a sysroute of, for example, R64 installation. Let's try that. That's the reply you get. I mean, nice that you added these parameters, but that's not exactly what I expected if it's shipped. So yeah, you see, there we have the aim of the upstream team to make it support, but it's too slow. There are other solutions making it better yet, and now I'm going away from the upstream, presenting some possibilities like to get Flutter to debug and to cross-compile on your ARM device, on your Raspberry Pi, on your watch and whatsoever. At that point, I can also recommend the embedded Linux talks on Flutter taking place in this system. They are driving deeper into the solutions I will present. Yeah, the shark is very confused by this output. Yeah, if I just want to compile, I could also just use KMU and like compile if it's functional for release builds, compile the stuff on my host tank. I could use KMU, use a static binary. I have my ARM binary. Okay, it's compiled. I could ship it, but I actually want a debugging session where I can use the fancy Flutter features like HotRestart, HotReload, where I just do Flutter run, show my beryllium instead of building it locally, pushing it, debugging it, not debugging it, checking whether it works, manually checking some outputs. Compiling is not debugging. That's a huge difference in it. Yeah, for cross-compiling and debugging, there's no canonical way yet to do that. You can compile Flutter cross-platform using KMU static binary. Thanks, but that's crappy. We actually don't want to do that. You could also just have your standalone ARM64 build server. That's what I do. I have ARM64 CI devices at home with which I build all the Flutter stuff I build in order to have test builds targeting, for example, Debian on mobile. Or you use custom devices. Flutter supports custom devices, which means you have configuration files. You tell the engine, the Flutter tool at runtime to use or to run on device configurations actually not supported. And there you have projects dropping in there. You have Flutter in Linux, embedded Linux developed by Sony. It's the Flutter embedded devices. Okay, that's duplicated, but yeah. It's basically a wrap around the Flutter tool, which enables you to run on ARM devices also remotely and you have Flutter Py also uses the custom devices API in order to target remote devices on Linux. But again, there is no build in way. There are these fancy projects enabling us to do that, but there's no Flutter build in way and that's sad. Yeah. As of now, it's easier. I have a full Linux installation on here. It's easier if I have my Flutter development environment installed on the device and SSH on the device and debug on there because that's way more functional than using the typical stuff you know from the second phone Android here. I just plug in the device and debug. That's not the state of debugging here. It's rather easy to develop on the target device itself if you have a decently powered CPU and like a desktop Linux distribution there or like can do it by SSH, that's way more convenient. And you should hopefully see an image. No, that's a joke. I have prepared a short showcase for you. It was number seven. Yeah. That's like showcase of Flutter. In a few moments, you will see me opening a Flutter. I recalled it while traveling here. That's why it's a bit blurry. Like that's an example of a Flutter app. Like you see animation rendering is pretty decent. Animation is crappy because it requires upstream patches in order to have defaultly handle Linux touch events as touch events and not as pointer events. There it's getting crappy but from the UI part, Flutter is fancy. And for example, like some Flutter apps ship these patches like to get scrolling to work. Most others do not. Some vendors ship patches. For example, Alpine again has patches to include a scroll behavior treating Linux touch mouse input as scroll drag enabled input. I think it's broken. I know it's broken since the last few releases but I think that's because the patch must be adjusted. Originally Alpine had a patch. It's no longer functional but it had a patch for it. And one could adjust that patch to still function. And like short summary, the first point is the touch is considered as mouse. That's why if you swipe it selects instead of scrolling. Scaling is sometimes an issue but that's an issue everywhere in Linux mobile. These devices have full HD or even higher resolution so everything is scaled dozens of times. You saw a GTK header bar which is pretty annoying. I do not want to see your header bar but that's again a GTK issue, not an issue of Flutter. And multi-window is pretty crappy because if I start a new instance you run into issues with any database connection you have open if you use local databases and you mess up your applications. Though you run into those issues in Android 2 but on Android it's handled way better because by default it does not start at two instances of your app. And yeah, that's state of the art. It's crappy but there is momentum. There is work going on. If you use all the patches, all the tool chains around Flutter, if you actually use them to target Linux mobile you can target Linux mobile in a pretty decent way. And I hope it's going on. Some work is going on upstream. Unfortunately most of the work is going on downstream which is pretty sad. That's not very open source of Google. But I mean it's Google. Yeah, so let's get Linux mobile ready as a cross-platform target and that was my talk. Awesome. Does anyone have questions? Yeah? You talked about the upstream not wanting to support muscle. But doesn't Android already have a libc other than glibc and do they even support that? If we look at Flutter we are talking about a completely different target of Android and Linux. And the Flutter Linux engine does not support anything apart from glibc and upstream. Of course it supports Android. That was what it was initially developed for but there it's another completely different components of the engine. And yet they compile with Android libc. Forgot the name. Yeah, by Jonik. Any more questions? Yeah? Martin. Your demo video showed a Flutter application running pretty smoothly. What defies what? Sorry? What defies your demo video running? That was a few years old smart from Shomai. It's a Shomai Pocophone F1 running Debian. No, how is it called? Mobian. Ah, okay. So, Friedrino. Yeah. Okay, thank you. If you tried on the Pine phone for example you won't have that experience because the GL driver is broken. That's exactly what I saw the last video. I often have that in my issue list, believe me. Any more questions? Yeah, that's one. So it seems like quite a pain to get Flutter to build and compile and get it all the way an app running on a Linux phone. Is it worth it? Is there really nothing better to get an app running on a Linux phone? As of now I consider Flutter as pretty liable for targeting Linux mobile because you have this giant ecosystem of existing Flutter apps. You have thousands of them which could theoretically run on Linux mobile but simply do not target it yet. You have 10,000 proprietary apps in Play Store. Okay, we do not want to have them. We have dozens of apps in Android all by the end. All of them could run on Linux if we made it easier. And all those patches are usually not patches. I as an app developer need to apply to my projects. Okay, I need to apply some patches too. Are the vendors shipping my app? But it's usually the vendors or the distributors shipping the distribution package to ship Flutter. I can easily build the Debian package for Flutter app. But if I want to do it the fancy open source way, if I want to use Flutter as a build dependency shipped from my package manager, then it's difficult. But I have the vision of getting there one day where I do not need to install, use my local Flutter installation with Flutter.dev slash getting started. But using Vendor Flutter, Vendor in the upstream of my Linux solution. And then it's harder but it's not the work done by the end developers. So I think it's worth it because it's only the distributors who need to do most of this work. Okay, thank you. Questions? Okay, in the back, one second. Thank you. So not related to Flutter, but if you said that's so painful to get something upstream from an open source perspective, how difficult would be or what would be the challenges, for example, to say, okay. As a community, we fork Flutter and we start supporting this fork because the maintainers don't want these patches on the official one. And we as open source citizens, we adopt this fork. How difficult would be that culturally? Well, forking Flutter entirely would be pretty complicated because Flutter is a rapidly moving ecosystem. There are many patches in the upstream and that could always break your fork with the giant company standing behind pushing Flutter development. So you have on the one side this giant company, namely Google, working on Flutter with a giant community and you would need to maintain your fork of the entire Flutter system on your own. What I consider as more realistic is patching the build route and like single components of the Flutter ecosystem, you could use as drop in dependency when shipping Flutter as a Linux distribution, for example, that would be way easier and also that's where currently see the Flutter floss Linux mobile ecosystem moving towards. So this work is more or less being done, but it's at the beginning stage. But I would not consider like forking Flutter entirely as a new framework. Hey, with this one you can target Linux mobile too because then you would lose all the big players already having their apps and continuing using Flutter. Thanks. Please give another round of applause.
5G in ModemManager
All right, thank you all for coming. So next up we have a very exciting topic, 5G and Modem Manager. Have a round of applause for Alexander. So let's talk about Modem Manager. Let me know if you don't see me because I'm not sure if this is going to work very well. I read about me first. I think I'm going to keep it like this. I have been the Modem Manager, Maintainer and Developer for the past 12 years and I've also been involved in developing and maintaining the two libraries that we use to communicate with Modems, which are the QMI for the QMI protocol and the MBIM protocol. I'm not working at the Google Chrome OS team since two years ago. And this talk is going to be about not only how we're going to add 5G support properly in Modem Manager, hopefully, but also how we added 4G and which are the issues that we had when we added 4G and how we are going to overcome these same kind of issues when developing 5G support. So we will look at what went well and what didn't go that well with 4G support. So before I joined the Modem Manager project, there was already support for 4G in the sense that you could connect the Modem, it was using 4G and the Modem will tell you, hey, I'm using 4G and then we will expose it and that's about it. So we were treating 4G yesterday as a different mode, we had 2G, 3G and now we have 4G, nothing else. So when I joined the project, I started to review the 1.0 API suggestion that were in the main list and the major focus at that time was to support multi-mode devices. So at that time we had two separate families of Modems, we had 3G, 3G-Bb-Modems, GSM, MGS, LTE, then you had another family which were 3G, VB2, you had CDMA, EVDO Modems for 2G and 3G. So 3G, VB2 had its own standard of 4G they digitized and then they started to use LTE as the standard 3G, VB2 Modems as well. So we had these strange 3G, VB2 and 3G, VB, Modems, multiple modems that had to be managed kind of in the same way, but it was very different in nature. Like 3GPP modems require a SIM card, 3GPP2 modems, most of them require to have some kind of activation with the network to bind your user account to the device itself, and it was a manual activation, automatic activation depending on the carrier. So there were many different things. And managing these new multi-modem devices, we thought this was the most important thing, but it wasn't, because 3GPP2 no longer exists. So can anyone tell me which main feature of 4G we missed? Because we didn't think of it. What's the mind here? No. Much more important than that. Actually related sometimes. So what we missed is the idea that when you attach to the network in 4G, you are actually creating a data network interface between the modem and the network, even if the host hasn't seen it yet. So you actually get an IP address, full data setup, communication between the modem and the network in the user plane, but the host knows nothing about that. And why did not we catch that? Because most operators didn't really care about that. They would allow you to send a blank APN during the touch, and then that was fine for them. They would tell you back, which is the APN that you are using. That was one approach. The other approach was that the settings used for data connection were actually going to be the same ones as used for attach. So you actually, when you kind of connect, you're actually configuring profile number one, which is the one used for attaching Qualcomm. There were lots of assumptions happening at the same time. There was also no consolidated approach to define these settings in non-protocols. The NBIM 1.0 spec did not have a way to specify attach settings. And many of the APIs that we developed at that time were based on looking at what NBIM capable modems were doing. So there's a use case where this does not work, which is when the settings are different. And so in 1.10, we added the support to explicitly specify attach settings. This is the case of Verizon, for example, where they have one specific attach APN and one specific date APN. So now we were able to say to the network, okay, we want these specific settings for registration, and then the network will tell us, then, yeah, you could have those, or you could have like a subset of those. You may ask for v4v6 and then only get back physics. So that's a very, very common thing that may happen. And this was added very late in 1.10, like many years after the 1.0 device API was introduced. Another thing that we missed in 1.0 was the support for profile management. So right now, up until that moment, the way you connect the modem, you specify all the settings that you want in the connection attempt. And in 1.18, we added the support to say, we already have a set of profiles, maybe even provided by the modem itself, because when you insert the sim card in the modem, the modem itself will build, not build, but with some carrier-specific settings, with some predefined profiles. This is very common in US carriers. So you insert the Verizon sim card, the modem boots, with already profiles defined as the way Verizon wants them. And then in that case, you can just say, connect profile three, and that's about it. So we did miss that. We missed some other things, which are maybe not as important as that one. Where did we do it? So the first API that we defined for 1.0 had multiple PDN connections in mind from the very beginning. Even if we did not support them in the same way as it's implemented now, at that time, we had modems that would expose two network interfaces at the same time, physical network interfaces that we could choose, okay, please connect this one to this APN, please connect this other one to this other APN. The multi-PDN support that we have right now is based on multiplexing, so we can have one single physical network interface, but then we can say, okay, I'm going to connect three different PDN connections, I'm going to create three different virtual network interfaces. And then the host can assign different data flows to each of these PDNs separately, because you have three different network interfaces, so you can do all the routing logic in the host itself. And this very same support was used to support Qualcomm SOC boards with the IPA driver, for example, which requires multiplexing by default. Now, where are we right now with the 5G support in Modem Manager? The picture is very similar as what we had before 1.0 for 4G. We just have the way to say that we are using 5G. We can say that we are using 5G SA networks if we only expose 5G access technology, but we also have the way to say that we are using NSA, so we are registered in 4G and we will use a 5G extra-barre when the bandwidth requirements happen. And that's about it. We don't have any other 5G-specific feature for now. What are we missing? So I'm not going to talk about 5G-specific features that apply, for example, in the radio interface because Modem Manager does not really care about any of those. We only want to be able to support things that the host is aware of, and that is completely hidden to the host. So one of the things that we are going to try to support is 5G Slicing, which is this important word that if you investigate about 5G, it's everywhere. So in 4G networks, there is no clear separation between different types of UEs. A UE is the combination of host and Modem. And so in 4G networks, you don't have any differentiation between different UEs. They are all treated in the same way. And in 5G, they do define specific types of UEs with different quality of service requirements. So you may have a UE that wants to have a bigger bandwidth. You may have a UE that wants to have an extreme low latency. You may have UEs that may send data to the network once a day or twice a day, but they need to be spread across a very big area. So in order to support all these different kinds of UEs, 5G introduces the concept of slicing. And so with the slicing, you have one single physical network, but then it can be logically divided into separate virtual networks. Each of them with its own quality of service requirements. And the separation, this is very important, goes up to the base station, which is something that 4G did not have. So imagine this UEs case. We have thousands of people here and for them, all of them with a phone, and all of them trying to get access to the network. There's congestion, there's a lot of radio interference between older devices. With 5G, what you gain is that you could have a phone using that slice that has a specific base station only for that slice. And so you get priority access to the network through this slice. And this may happen even with the same PDN. So you have one single APN that you want to connect to to the internet. You may have different paths from your host to connect to that same APN, based on the quality of service requirements that you have. Now 5G, as I said, is a logical partition of the physical network. And they are defined, they are specified or named by something called single NSS AI. It's a really bad name, I think. And so how are we going to support this in model manager? There are two main things that we need to support. One is during the registration, we want to specify which is the slice we want to connect to. At the time of registration, and we can't do that. And then you may ask for multiple slices, the network will give you back, okay, you are allowed to use this, you are not allowed to use this one, and you also have available this other one. So this is one simple way of binding, for example, all the traffic of the system to a single connection, to a single slice. This is the case that I told you before, single, this is a UI connected to two different slides separately, or both of them going to the same internet APN, and they use completely different virtual network connections in the operator side with different QoS settings. The complex way of using URSP rules is by using, the complex way of using five-year slices is by using URSP rules, in the way that the operator will tell you which is the way that you need to route the traffic through that network. So they will give you rules, the UI receives the rules, in this case the modem will push the rules to the host, and then the host needs to make all these separate traffic differentiation and move one data flow from one slice and one data flow for another slice. The UI should not be capable of deciding by itself which slice to use, so because this is mandated by the network, and so if you try to use a slice that you are not supposed to use, they may kick you out. So that's a way that the network has to control the access to the high privileged slices. Any modem manager that supports the slicing will look very much like a multi-PDN connection. We will have virtual network interfaces created for each slice, and that is about it. There are all the 5G features that we could consider, but I'm going to name them here only. So non-GPB access support that's basically accessing the operator network through Wi-Fi, for example, you can authenticate to the network through Wi-Fi, and then you also have non-IP based 5G connectivity. If you have a network connection between machines using different protocols, you could virtually create a 5G network connection between them without using the IP protocol. Now, how it's going to look like for the next 10 years? I think we need to focus on what went right and try to avoid the mistakes that we made in 1.0, but we also know the limitations because everything changes, and what is important now may not be important at all in 10 years. So the planning needs to be done carefully, and actually made it in a way that if in the future you need to change cores, then you can do it more or less easily. The first thing we should be doing is remove legacy features. A lot of the structure in the modern manager code base is based on this logic of having 3GPP devices as a separate type of devices. We can remove all that. Same for the ports, plain old telephony system, like these dial-up models. We said we would implement them 13 years ago, and we did not do anything. I think it's time to say that we're not going to do it. We had enough time to try to do it. And then obviously, all the plugins from modems that are very old, we can't remove them. There is no point in having them anymore. The focus should be on 4G and 5G modems, and on PCI and USB modems that expose the network interface. So we acknowledge that there are other types of modems, that is serial modems or USB modems that don't expose the network interface, and you can only do 80 plus PPP connections. Those would still be supported, but let's say like in live support only, like bare minimum data connections set up, and not thinking about adding many features to those as well. For example, not thinking about trying to add 5G S-Lacing in those devices. It wouldn't make much sense. We may want to have a new API. This API that we are using right now has been mostly untouched. We didn't break API into more than 12 years. I think it's time to do some some breakage. As I said before, remove interfaces that we don't want, and probably not the same process as we did for 1.0. In 1.0, we spent, I spent one year and a half with my branch, until it was mostly ready to be launched. I mean, I want to change that. That cannot happen again. I don't have as much time as I had that time. So the idea would be to do it progressively and start to add new APIs, at least the basic ones, and so on. We will have registration settings as a first-class citizen in the APIs. We no longer treat them as something automatic, which is what we do right now. We now want to configure 4G-attached LTE settings. We want to configure 5G registration slide settings and 7L-over common settings that you may have in the modern, like the manual versus automatic settings. All those should go in its own separate API with the idea that in the future we may have more. So it should be open to updates in the future. Regarding connection management, I think it's time to use profile-based connection management as default whenever possible. There are many reasons for this, especially when you use carrier settings, where the modern gives you all the settings that you need to use. There's no point in trying to add new settings on top of those when you already have them. So using profile management is the way to go there, and enable multiplexed connection by default. So as I said, the primary modems to use would be the ones that expose the network interface. All those allow you to do multiplexing. Most of them allow you to do multiplexing. So we should enable that by default. This is one of the main things that I would like to change as well. So right now when you have a modern detected by modern manager, and it happens to have voice support, even if you're in a laptop and does not have any audio connectivity, modern manager will try to configure voice-related stuff, call waiting status, all that. It doesn't make any sense to do that if you know that you're not going to use it. So let's move that to separate interfaces as they are right now, but as a way that you can actively enable that. And if there's any application with the intent of using voice capabilities, you can, hey, please, not a manager, enable voice capabilities in the modern. Then we will enable all the URCs, all the unsolicited message support, and everything that needs to be done to support voice, for example. Oh, no, that's another one. Yeah. This is extended to each list. So things that I would love to have, even if they are extremely difficult. So we have QMI proxy, NBIM proxy. Why not have an 80 proxy? Other programs can use 80 commands through modern manager, through the proxy, to do other stuff that does not interfere with the modern manager own control. So if you could have that, it will allow many applications to use 80 commands as well. Then we could move our GNSS location out of modern manager completely as a separate team. There's no reason for modern manager to have all this support for configuring AGPS and injecting extra files to the GNSS module. We do that because the modern has that. But if we have the proxies in place, there would be no reason not to do it out of modern manager. And, yeah, draft maybe for binary parsing of messages and all that. That was something that was already investigated. And that is all I have to say. Thank you very much for this great talk. Do we have any questions in the audience? Yeah. Thanks for the good talk. I was wondering how do you test all this? So what is your CI? So in Chrome OS, we have a lot of automatic testing for the moderns that we use. So I do rely a lot about that. Like when I joined Google, I found that there were a lot of information metrics about crashes and things, back traces. I was like, I need to fix all this. But I do rely also on my own testing. I do have a home network, a home LTE network with SLSLT, open 5G, as I have my own SIM cards. And that allows me to do a lot of testing that otherwise I would not be able to do. Because all the slicing stuff that also is very core network dependent? Yes. So you might run into problems. Oh, yeah. I know many operators are doing pilots and like private pilots. I do some open ones. I think also in the US, T-Mobile is doing it. But for example, for a 5G slicing, I think that my home network is enough for this kind of testing. Thanks. Next question from the back. Hi. I'm debugging voice calls from my device. And from what the manager I see messages like gained, audio, lost audio. And I have no idea what happens after that. And whenever I try to... So how do you use 80 commands to control the modern? No, it does it by itself. When I'm trying to get to the bottom of what's going on in the code, I only see interfaces behind interfaces behind interfaces. But where can I find the actual code that makes the audio? Like where should I look? Is there a problem? So modern manager is only in charge of starting the code and hanging up the code. That's all accepting an incoming call. Nothing audio related. I mean, modern manager knows absolutely nothing. About the audio path. You know, who is responsible for getting the audio? It depends on the platform, of course. So if you're using Libre and Firephone or something like that, then you may need to talk to them. Thank you. Thanks. There was a question from the matrix apparently. I'm rushing to the matrix. So somebody's asking, can we anticipate 6G features such as sharing machine learning data for connection optimization? I have no idea about any of that. I'm still in 5G. Maybe in 10 years we will talk about. Same talk for 6 years. You talked about REST for the protocol parsing and how there's already been experiments. And it's on your wish list. So I assume those experiments are somewhat successful. Can you talk any more about what those experiments are? So not much. I mean, it's useful. I think it's very useful. And I still keep finding bugs. For example, in the 3GPP PTU parsing, which we wrote 10 years ago. And there are still bugs there. Nasty memory related bugs. So REST is very promising in that regard. Cool. Thanks. One more question. And I'm back. Thanks for the talk. So the question regarding the AT proxy. With all the possible vendor crap, etc. So how do you plan to define if the comment is going to interfere with model manager or not? So is it going to be a low by default or is it going to be for be by default? So that's why we don't have a proxy yet. That's the main reason. Especially because model manager handles a lot of crap that manufacturers push in the AT port. So the idea would be to, in the same way that model manager disables a lot of URCs that knows that may happen, the proxy could do the same. And so we could still need to work with known URCs as they happen in the world. But I hope that manufacturers will start to use other things than AT at some point in 20 years. Give a round of applause for Alexander.
Droidian - Bridging the gap between various platforms with convergence
On ways. So. So thank you all for coming. The next talk is about Droidian from Bardia. Please give a big round of applause. Good afternoon, everyone. And welcome. My name is Bardia, as you've heard. And if you've been following our project, know me as Fake Shell in the community. And I'm one of the core devs of the Droidian project. And if you have any interest in embedded systems, mobile devices, that's why we're here, obviously. You might be particularly interested. So today, our topic of discussion is going to be Droidian and what we're doing, how everything works, and how everything goes together, and why the whole project even works. No, I'm sure. Like I said, I'm prepared for that. OK. So who are we? Well, we're a number of fos and privacy enthusiasts committed in building a free and open source project and operating system that is user friendly and open that can be utilized in different environments, such as phones, maybe even single board computers, et cetera. Tablets, different things. So Droidian is, as the name states, based on Debian. We take the core of Debian and add our own repository on top of it and add our own so-called finishing touches. So Droidian utilizes a number of different projects. Should I go down like this? OK. That's too far. OK, I messed it up. So Droidian utilizes a number of different projects. Some of the more well-known projects are Hallyum. We use lip hybris and Gbinder from Joly. We use the stack from GNOME, as you guys may know, Fosh. And we currently have a selection of devices supported in our official CI or build system. And I think it should be over 20. We haven't updated that device page, so it's not exactly up to date. It should be 25 or 26. So devices vary pretty largely from different manufacturers. Different release states. We have the OnePlus 3 from 2016. We have the Pixel 3a, the FX-Stack phones, the Galaxy S9, Lenovo Think phone. Like, the list goes on and on. So the barrier of entry for getting into Droidian and development and porting is fairly low, because there's already a number of devices that do exist. And they mostly cover most of the possible cases in the Android space. So for Droidian, one of the main things that people who just get to the project need to know about is our porting guide. So the porting guide is mostly split into three sections. It's the kernel compilation guide. It's the Routefest debugging and Routefest creation. Kernel compilation is going to be the initial testing and compiling, changing a few parameters in the kernel, and packaging it to get a Debian output, because we need Debian packages to do over-the-air app kernel updates. We have Routefest debugging, which occurs after the phone actually boots into the Droidian root file system. And last but not least is Routefest creation, because we obviously need to somehow get built for each device. So how do we actually get from Android to Linux, or so what we call Linux? So on Android, there's usually the bootloader, LK, loading the kernel, kernel loading the RAM disk. And the RAM disk does everything to start up the inner process of the system partition to actually start the system. And then system mounts a bunch of stuff, mounts product now to vendor, and a bunch of other garbage. So on Droidian, we take the same kernel that there was on Android, and we change the RAM disk. We have a modified fork of the Hallym RAM disk, which the Hallym project and UBPORs used to maintain. Now, in our fork, we have some support for a bunch of stuff that we use that is not in the upstream Hallym RAM disk. The Hallym RAM disk mounts the user data partition, which is where Droidian actually resides. We don't use system, which is kind of a basis base, but it is what it is. It mounts user data. It does a bunch of Android bootloader stuff to get everything up and running. And it starts in it, which is system D, obviously. So now system D starts, and system D starts up all the usual services. We have system D time sink, the system D resolve, and all the other stuff. But then we have our own services from system D. We have a service that starts a very small container that runs Android. And that Android starts and mounts a bunch of partitions, Android partitions, modem, and everything that the firmware and the drivers need. And the vendor script starts, the system GSI script starts, and we get all the drivers loaded, all the firmware loaded, and a bunch of interfaces start from Android. Now, then we have the usual file system of Debian. There's the user interface. There's like, file feedback, the end-to-rest. So from the Android services, we have hardware composer, which we use for compositing to the screen. We have audio flinger. Well, not exactly audio flinger. It's Droid media, but ignore that. We have Droid media for audio and camera. We have the radio interface layer for us name states radio. And a bunch of other services, lip, perf, manager for power, NXP, NFC, et cetera. So all the communication that we do from the Linux side of things to the Android side of things is done through Google's binder pipeline, or the binder IPC. And we'll explain how we actually use the binder IPC, how we actually communicate with it directly to the interfaces. So from the Linux services, everything looks familiar kind of. There's Fosh, obviously. There's feedback for feedback. There's Ophono, kind of ancient. And because nothing in the modern Linux stack can actually talk to Ophono, we have Ophono 2MM, which kind of exposes modern manager interfaces as a drop in replacement through Ophono. It's kind of a hack, but we don't talk about that. Yeah. So we have Joid.DNFPD. It's a fork of Sailfish community FPD, which is used for fingerprints. We have Call Audio.D as usual for Call Audio. Again, we have custom backends because Android. And Pulse Audio, again, ancient, but Android. And a bunch of other services. NFC and GeoClue, again, needs its own backend. But we're going to talk about these later. So most of the components that we have are not directly used by the user. So for camera, which is for Joid media, it's abstracted. And users just see the Joid.DNFPD camera app. For modem via Ophono, but users just see kind of a modem manager sort of imposter. For fingerprints, this part is completely customized for Joid.DNFPD. We just forked the settings. I haven't had it, everything. For battery management, Batman, very funny name. That does the work for battery management. I started that project as a shell script. It was a mistake. So Batman does a bunch of funny stuff, turns off CPU cores, sets governors, sets power save, whatever. It doesn't watch nonsense. And then we have Fosh, which is the user interface. Again, we maintain our own forcofosh because sometimes stuff happens, stuff breaks. We kind of have to maintain our own. We have bad experiences. We don't talk about those ones either. We don't say that in public. Joid.DNFPD needs to have a good image. Then we have the encryption service. Again, a custom tab and settings which uses Lux and LVM2. And the unlocker, which was, I think, initially developed for post market OS. We added a mini UI back end through LVGL. Again, custom back ends Android. I mean, it's the usual. So now how does everything actually go together? So as we mentioned, we have a bunch of custom back ends. We have a bunch of custom plugins. We have the Qt5 camera plugin from the days of, I think, Canonical, which developed it. There's the Ofona Binder plugin, which was developed by Joela, nice of them. There's a bunch of Pulse Audio modules that allow us to talk to the audio hell, like Droid Media itself, not exactly audio. And get audio through the hardware working, microphone, speakers, everything. We have GSTDroid. Again, talks to Droid Media to give us a nice and shiny Gstreamer pipeline that we can use for camera. And well, that's pretty much it. For back ends, because not for everything, we can add plugins, not all different pieces of software accept plugins. So we kind of had to hard fork a bunch of stuff. Some of them are not that frequently updated. So that was good luck for us. But GeoClue is barely updated. So we just added the Hypers back end, slap it in, which just works. We have the W-Root's Harvest Composer back end. I don't even know who started that. I know a bunch of people are involved in that. It's a mess. We have the Color-UD back end, which routes a bunch of stuff through hard-coded values. What if it works? And the Feedback-T back end, which talks to the Android vibrator how through IDEL and HIDL and gets the job done. It's not beautiful, but it works. And for MinUI, as we mentioned for Unlocker, we added a MinUI back end to LVGL itself, so it can draw to the screen without GPU acceleration, of course. Who needs GPU acceleration in the RAM disk? Anyways. So for Woot Animation, I think all this was used by Muff. We also have a MinUI back end for PlayMuff. I think it started life as the MoUI back end from JoLa. I don't remember. So to actually talk to the Android services, there's two main pieces that are doing the job for us. One's Lepipus and one's Gbinder. They allow us to craft. I mean, the Pibus has a bunch of compatibility layers and Gbinder that gives us a way to craft transactions and send it to the Android interfaces. And the whole system, how the whole thing works, pretty much ends there. Stuff's maybe hacky at times, I'm going to admit. But it works because we use pre-built vendor services and a bunch of stuff that was provided by the vendor itself. Stuff works for now. Maybe futures too. I'm joking. Like, stuff actually does work. So what is next for JoDian? Because the services work and the system itself starts up, everything works for the most part. But in reality, one of the main issues of the whole Linux ecosystem is app support. You don't have apps, must be honest. And no one wants to develop any either. No other big companies do. So I guess start integrating Beidjoy better into the system. Getting like zero startup time on Beidjoy, maybe developing something that replaces Beidjoy, again a drop in replacement. And clean up all the garbage that we added. We have a lot of garbage. So it's not pretty. We definitely have to go through everything. At least I do. I'm not a good programmer. We have to go refactor a lot of code, clean up a lot of code, see what we have to do. And possibly actually add some new features. So some of the actual features that I had in mind that I have been working on was wireless displays, which has to go through a pipe wire of using old version of pulse audio. So it's kind of tough. So I don't want to do a drop in the basement of pipe wire. I'm kind of tired of hacks. So we kind of have to fix up pulse audio to actually get pipe wire working. Then we can get pulse audio working because there's like an XTG portal for it. So that's one of the stuff in my to-do list that I actually have some work put in. Face unlock was something that I've been working on for the past two months. We can get face detection working through G-streamer. And G-streamer will actually move as you move your face along. I'm going to admit it's like 3 FPS. But it does detect. And the rest of the work can be done with OpenCV because not all Android devices do have the sensor to do it in hardware. So that has been in my to-do list. I've been working on it. Maybe we can help out other open source projects if they like face unlock, maybe. And two other very annoying features that are kind of deal breakers for others is once MMS. MMS, we don't have MMS. I tried many times. I couldn't get it working. MMS is very important. RCA is more important. But MMS also, at least in Canada and the US where I live, Android users are always using MMS to talk to the iOS guys. So MMS is very important. Dual SIM is very important as a deal breaker for many. And we have to work on dual SIM. That is a very big priority for me also. We've seen many users who actually looked at Android and they were like, oh, yeah, this is great. But you guys don't have dual SIM. So I'm out of here. That's not exactly the nicest. And besides all that, we still do have to work on app support for Linux and the ecosystem. With LibitVita and GTK4 becoming very mature and things working out, I have been at the very least working on porting all the old GTK3 applications that I've been using to GTK4 and LibitVita. Not exactly joy-dien specific, but it will benefit everyone. So that's something. A lot of applications are very slow. Settings app, as we all know, is very slow for the GNOME settings app. Much of the stuff is not threaded. Everything is running in a single thread. It's just horrible. A lot of code we have, I mean, well, I do have, that will soon possibly become a PR for many different projects, making many things threaded. We at joy-dien have a big PR to optimize GTK4. Speeding everything up, we've had a user who was working on a Blackberry, and he was seeing 70%, 80% performance improvement on his on GTK4. Because apparently there's a lot of issues in GTK4. Who could have thought? And the very last issue is that we don't, as the joy-dien people, we don't allow community devices in our build system. So if one of us, Core Devs, has a device, it can be made an official device. So like, be added to the build system, get stable builds and nightly builds. But we kind of don't have that for other people putting devices. So you should probably look into having a way to allow community people to port their phones and have them in our build system. I know many community porters have worked on devices, and they saw that, oh, they couldn't add it. So they just gave up. And the most important thing, documentation. And that's something I have to do, because none of the code I wrote has documentation. We have to do a lot of documentation. We don't like, at least the stuff that I worked on basically has nothing. I just worked on it. I slapped it on. I was like, yeah, it works, whatever. That one has to be worked on a lot. And that is at least my to-do list for now. No. Don't go down. Don't go down. Don't go down. Don't go down. OK. OK. OK. So if you want to contribute to Joyden via our device page, via our website, via our telegram channel, which also sync to our matrix, I think you can also find the matrix group for Joyden project. I don't use matrix much. But apparently, if you have a group that has a bunch of channels in it, I don't know. So you can find us there as well. And one kind of announcement that I have is we have been working towards getting phones with Joyden as the first pre-installed on phones. What a weird sentence. We have been working with an ODM to get Joyden phones, or so-called phones, with a Joyden-based system installed on them. And have that be sold to have kind of as a way that 0.64 does it. But it's like, yeah, we as Joyden developers are doing it. So we understand the system and we understand the hardware. So it's going to be much easier to develop on, because we also understand the system itself. So you might want to look out for that. Few relapses, not very labs. Few relapses, please. And possibly the bigger news of this sort of project of getting Joyden-based phones will be coming out in a few months. But you can be on the lookout for it. We have a website at the moment, kind of not exactly the best. Still being worked on. We have a survey asking users, if they wanted to have a phone with a Joyden-based system, what would they want? What specs would they want? What would they want the devs to be focusing on, et cetera? So you can expect a Linux-based phone sold on the market in a few months. Thank you. Thank you very much for the great talk. I know we have a lot of questions in the matrix, so I'm going to pass it on. So the highest upvoted question right now is, do you have any plans to switching to motor manager from Ophono? OK. So I have looked into this. I'm going to be 100% honest with you. I have looked into this. I am by no means a professional. And when I tried getting this working, I could never get a motor manager kind of back end to register a command over the binder IPC, the G-binder. Again, I am by no means a professional. And this is probably doable. And it will be a huge step forward, which will make the whole modem stack a lot better. It doesn't have to go through this, this, and this, and this, and this, a thousand things, then user sees some and gets it working. So yes, it will be great. I spent some time. I couldn't get it working. But it is in my to-do. One question. You mentioned that you implemented a WL roots back end for, I guess, to get fresh running. Is there any plans? For example, I currently use postmarker S on my phones. This is actually running in mainline kernel. So I guess it's a little bit of a different situation. But for example, different other Linux mobile UIs, like Nome Shell, just the Nome Shell branch for mobile, stuff like Plasma Mobile, SXMO. Is there a project to get those running on Droidian as well? Or is the only focus at this point? So at the moment, I actually understand the question. And we have a lot of questions like this, like getting different UIs running. So each UI that uses an underlying graphics library needs its own back end, obviously, because we have to use Harper Composer. And I know that there's like VeyFire that uses W roots. So that one works fine. There's a bunch of other W roots that works fine. But as an example, Plasma uses Kavein. There was what they used to be a Kavein back end for Harper Composer. And it's pretty old, or it's really old. And someone has to revive that to get it running. I currently don't have the time. I have a full-time job, and I'm a student. I'm kind of already under a lot of pressure. So for GNOME, which uses mutter, well, that's a beast by itself. Because Kavein and W roots are modular, somewhat. But mutter is the opposite. So the code for the RM back end, or frame by frame, or whatever, everything is baked in so hard that it's a very tough task actually adding a new back end. And let alone maintaining it. Because no one's going to accept any of our back ends upstream. Because no one can test it other than us. So if someone spends a time sure, but for GNOME shell with mutter, I really doubt it. Because it mutter itself. I might piss a lot of GNOME people off. I use GNOME myself. Mutter is a mess, at least when I looked at it six months ago. Thank you. How does Droidian support standard Debian, like Bookwam, Bozi, Deb files for RM64 targets? Well, yeah. You can run the packages. Right now, Droidian is based on Debian Trixie, the testing branch. We also have a branch for stable. Well, we have a snapshot for stable that you can use. It doesn't have many of the new features, that is based on the Bookwam. But any repository you add, any dead repository you add, if the packages are built for RM64 or the architecture is marked as all, like Python packages and stuff, everything will work. Flatpacks work, Snap packages work. If app images built for RM64, app images work, it's just like a computer. Thank you. Thanks. Maybe another? Yeah. OK, you. And then another question from Matrix. Thanks. Just a quick question about the strategy, because you mentioned that all these hacks you've built around to get it working. So my initial understanding was that you built Droidian to foster the development of these apps for Fosh, for instance. But now you're trying to also have a phone delivered with it. So does this really make sense to have a device running these, let's say, many hacks from the start? Well, yeah, that's a very good question. Well, we're trying to eliminate every single thing that we think is like a big hack. But it really depends on what you consider as a hack. Is libhybers a hack to you? Then the whole system is built on nothing. But to my eyes, I kind of have a different look to it. And in my opinion, we can slowly get rid of most of the hacks. Again, we have custom backends? Fair enough. But I don't see there as a hack. But in my opinion, a lot of those can be cleaned and can be made ready to be shipped on a phone sold to customers. So it's not that far gone that I would consider a waste of time. I would consider working on it a waste of time. I still think that it is very doable getting it done. Give a big round of applause again. Please, thank you.
Genode on the PinePhone on track to real-world usability
All right. Next up we have G-Note on the platform on the track to real-world usability. Have a big round of applause for Norman. Thank you very much for the chance to be the second time at this developer room. I was here one year ago introducing G-Note based phone, and now I will give you an update what happened in the meanwhile. So I had very little preparation here, but I wanted to show some demo source. If something breaks, please feel with me. I hope it will run smoothly. So first, to give you some background. Microphone is a bit. I will try. So the background of what we are doing to just recap. Like in 2003, my best friend Christian and me, we had a dream about a truly trustworthy operating system. We are somehow new from academia, certain puzzle pieces that should lead us there, but those puzzle pieces, each of those seem to belong to a different puzzle each. So that was quite difficult to align them. Can you hear me still? Okay. So it took us a few years to bring them into alignment to build the first prototype back then. Once we saw how all this could work, we were quite motivated to bring it to the real world and bootstrapped a company in Dresden by ourselves doing contract work. And with the idea to do our licenses technology. Fast forward ten years, we kept on working on this. And during this time, we had grown a small team of ten people. And at this time, we were able to move our work towards our genoed on our laptops. This was a bit milestone. And now a few years later, we did the first baby steps to also bring it on a mobile phone. So on the PC, it looks like this. So this is this good OS. That's an operating system built on top of the genoed system. And this is actually also running on this machine right now here. So that's basically used day to day on our developed machines. On the phone, what I presented one year ago, is this system running on the pine phone. The basic idea is that there is a part of the phone that has kind of fixed functionality like a feature phone you can think, or like a boot loader, something that is really fixed. And then there is a user defined part of the phone where the user can install software into and switch it in and out. And I will just give you a quick tour through the user interface. Let me just log into ARM. This type into my Linux VM over here. And let's see if you can get some kind of video running. Yeah, this one. So basically, here you see the phone UI. And the basic UI divides the phone into five categories over here. Is it doing something or why? Yeah. And the device category basically gives you control over physical things, physical aspects like the brightness or volume, or control over the microphone like a kill switch for the microphone, or some power settings like how you want to operate the phone. And you can see here when I modify the brightness, it has immediate effect. Then there is a second section that is all related to telephony. And you see here also the user has complete control over even lower level aspects like the powering of the phone. So here the power lines to this phone is really controlled by the user. So now, for example, when you switch it on, the modem is booted. And now we interact with the SIM card and the user can type in some PIN to get access to the network. And now we can receive calls or initiate calls. And we can also initiate a mobile data connection, which I'm going to do now. So basically, switching on the option to use your mobile data. So you can see that's also the option to use Wi-Fi. And now you see the three dots over here. They are basically telling that there is a currently setting up the connection. And once this is done, we see an IP address appearing here. And this means we have a data connectivity. And with this data connectivity, we can actually now do interesting things like install photo software. And the image when you just install it comes with a few example systems, I would say. So these are basically systems for the other side of the device. You can switch these two sides using this gesture here on the left side. And this is, for example, a very minimalistic example of an interactive application as a subsystem running on this user defined side of the device. And there are a few other examples, like, for example, this small oscilloscope that just shows microphone data. And you can see, basically, when switching to the other side, nothing is really visible there. That's because the microphone is still not enabled. So the user must enable the microphone first. And then you can see that the application, the user defined side, can observe the audio data from the mic. There are a bunch of other examples. I think the most interesting one is a web browser that we ported to the system. So this is based on the Chromium Engine and the Morph browser specifically. So in order to bring this to our OS, we also had to port, for example, the Uwuntu Touch UI toolkit, or nowadays called Lumiri UI toolkit, also enable the GPU and things like that. And you see here that the browser is running. It's not super smooth, but you have to keep in mind, it's a time phone that we are running on. But it's actually usable, and you can browse those websites and use these kind of modern JavaScript-based sites also. I think visiting GitHub also is possible. Yeah, this was basically what I could show you last year. So just to get to the set of this point, and now, okay, you see here, the number of controls that you may know from the Uwuntu Touch UI project. Okay, so what do I want to cover this time? So shortly after my talk here at FOSSTEM, we published a first image for the community to try out and to get user feedback. And then once you get user feedback, of course the problem you have to incorporate a somehow, you have to do something with this. And then you want to give the user some new version, the user needs to install it. And so how can this interplay work to be enjoyable for both sides? Then I want to also talk a bit about our first wishes by the users, and then going forward to speak about how to bring software on the device. So first when speaking about user feedback, you have this loop where the developer installs it. The system originally on the SD card following the instructions from the website gives feedback to the developer, improves the image, publishes it, and then the user installs into a new version and gives feedback. And you have this loop. And now the question is how fast can this loop happen and how frictionless can this be? Friction comes in at these two places. So for example, when the user wants to install a new version, the question is can the user trust the new version? It downloading something from the internet. How much work is it to install a new version? So if this is like a real big operating system that you have to upgrade, it's really a lot of effort and also a risk. What happens when you have a regression and you want to roll back to the previous version, for example? And on the developer side, it comes down to basically labor. So the developer has to put thoughts and work into improving the images and then also building it, publishing it, hosting the images for the users. So these are the costs at the developer side. That's just how it disturbs you somehow. This kind of rigging is okay. So we tried to look at this cycle in a kind of holistic way. So you see here that the developer cycle can come down to about five to 20 seconds. So for the image that I will show you in a minute, the cycle for iterating over this UI to things was about five seconds to 20 seconds, depending on whether I could start this whole thing on my Linux system or on the Python via fastboot. And then the publishing of the new version takes about three minutes or I do a full release, 30 minutes. So this is all done from my laptop, so I don't need any special hardware for that. Out of this complete process comes a really small image on that source, about 16 megabytes. And the user can basically install this. It's signed by the developers, so there is some kind of integrity protection there. And the nicest thing is that the installation is very simple and very transparent. It's basically replacing just one file in the boot directory. And the user can instantly roll back to another version if some regression occurs. So let's now try to give this as a demo. So to do that, I first have to, let me see. I first have to start a USB webcam over here. Let me see if this works. Okay. Okay, here's the webcam. Okay, can you see this? Okay, okay. So I will switch down. So now there are a few risks involved because the update will run over the air. So I hope that I get some kind of connectivity over here. So now you have seen the boot of the image from SD card. So it's quite quick to come up, but it's also a small image. So let's try to connect to the Wi-Fi. Okay, I think I tested this first and do a stack. Let me see if I get an IP address. Ah, I got an IP. Okay. So now you can basically go to the software dialogue over here. There's this tab over there, update. And one concept is that we can basically select different software sources, which are basically URLs. So this is, for example, the software source of my company, JunoLabs. But I can also select other sources like my colleagues or this guy here, and Feske, that's me. And I can basically check what Nfeske has to offer. So now there are some metadata is downloaded. And I see, okay, there are different images offered by Nfeske myself. And I can basically get some information about these different images. This was the last real release, and there's a new image that's the FOSTA edition. So let's try to download this one. And, yeah, I have luck that the wireless connection works well. So you can see now the progress of downloading. I can actually see it. And you can see with these buttons, you can also download the other versions. So you can have any number of versions downloaded and also from different software sources and keep them on your system as well. So we are almost there. Okay, so now the integrity is checked using openPGB signatures. So everything went smooth. And now I can install this image to my system, which is basically copying one file to the boot directory. And it says, okay, reboot to activate. So let's do this. So I go to the device section and I say, okay, reboot. And I have to confirm it. And now I'm doing a hardware reset and pressing fingers, crossing fingers. Sometimes the boot load, ah, now it's actually working. And now for anyone of you who also grew up in the 80s using Atari 800 computers, you may recognize the fonts and the color scheme. So these are inspired by my childhood. But what you see here is really a custom image. So I hacked this together in the last week. We had a lot of fun with these kind of graphics. It's basically the same functionality as the regular script image. But you can now see that the appearance has changed completely. So it's a completely different image. And using the update feature, I could now also go to another place and switch, for example, install another version and switch back to the earlier version. Okay. Let's continue. Okay. So the first response we got from the community of users was a question about the power. So the pine phone is quite well known for not being very long lasting when it comes to the battery. So people found it quite unacceptable that we left the screen on all the time. So they asked, how about implementing some kind of a screen saver to save the energy. So that was the first thing that we considered. And I will just give you a brief tour of how this normally works. So when you speak about power management on the pine phone, you have on the bottom of this this power management chip, which is in control over the actually voltages, power ramps, battery, the power button is attached there. So these are the lowest level kind of electric concerns. And once the pine phone switches, it's switched on, and the communication processor starts up together with a kind of companionship, which is the system control processor. And these are completely separate. So this is an ARM processor. This is a small microcontroller. It's based on RISC 1000 CPU core over here. And the first thing that happens at boot is that the ARM trusted firmware is started, and this loads the firmware into the system control processor. So this is also an open source of the nowadays, which is pretty cool. And this firmware is basically meant to interact with this power management chip over here. And this can also run when the application processor gets switched off if you want to save power. And then the Linux kernel is started, and you have these bunch of drivers. One driver talks over these kind of devices, mailbox device and chat memory through the device to the firmware. You can give commands, for example, for suspend or resume. And then you have drivers for the display, for the touch input and so on, all as kernel drivers. And then on top of that, you have the user space that uses these kernel services, like the input driver services, like kernel mode settings, things like that. And on top of that, you have the applications. That's the traditional architecture that you may know. And with G-Node, we can do a bit more flexible, so the picture looks like this. What's the same is that we have the startup, we have this ARM trusted firmware, but this time it loads a custom firmware, which is basically a small force interpreter. You have to know that the execution environment over here is just about 16K, so it's really small. So we put a small force interpreter here, but left it basically like a hull. It has no predefined functionality, it's just an open-ended force interpreter. And then the system be woods a small microkernel, and on top of that, now things get upside down. Because here we have the GUI server directly running on top of the microkernel with no dependencies underneath. So it can run without any driver running. And the drivers, they come in later, they connect to the GUI server as a client. So now we have put this upside down. And so you have the applications that talk to the GUI server using the SCUI interface over here. You have a display server that talks to the CAPTCHA service, which is the same service as you would use for screen shots and things like that. And the input driver, the touch screen driver, talks to the event service for injecting input events. Then there is this platform driver over here, and this guy has the job to arbitrate the access to the physical device resources, like interrupts, memory mapped I.O. and things like that. And so then, for example, this display driver comes up to the platform driver, asking for a platform session. The platform driver turns on the right power, watches and the right clocks, and the driver can do its work. And then you have this power driver here. This uses this interface over here and can send force commands to the force interpreter and can basically extend from there at runtime, which is quite flexible. So when we started this system, initially it grew about two and a half watts, which is quite a lot. And now, when it's going to sleep, five minutes, oh, okay, I have to hurry up. This is basically different. You see this difference? We just removed two components and that's it. And the power draw goes up down to just one watt, thereby also tweaking some voltages. Okay, live sleeping demo. I don't know if this I should really show because of the time constraints, but let's do this one quickly. So now it's sleeping, you see. So I will also have to connect the console here, the Pico.com. Oh, no. Okay, I will skip this small demo. I wanted to show you how the drivers come up, but it's probably, even time constraints, I will just skip this. It's a bit sad, but yeah. Okay, here we were. Okay, the last point I wanted to talk about is the question of extending the system. So we identified this whole bunch of work items that the developer typically has in front of him. And we also touched this at some parts of the previous talks, like the flat out talk was quite interesting in this respect. You have a bunch of different toolings and different build systems and so on to consider and all these different steps. And this is quite complicated. So and when targeting a system, the developer is confronted with all of this. And so we came up with a tool that's called Goa. It's called Goa because it's gold, but reached a little bit sooner. To basically assist the developer with these steps. And I will just show you an example how this Goa tool can be used. Using my Linux VM over here, I go to my Goa playground directory where I'm just playing around. And so I wanted to, for example, port the Atari 800 emulator to this phone. So I can do a Goa run and this basically runs this emulator here now on Linux. And I can basically see when I do a PS that these genot components that you see in the background with this Atari basic running here. They are actually Linux processes over here. So this is basically the Linux version of genot running on inside the VM here. And there is a nice demo I wanted to show you for the Atari 8 bit. So I can basically make a modification here in this runtime file. So I will, for example, at this argument here to the emulator, do a Goa run again. And you see here that this is now running a small graphics demo here for that's quite famous on the Atari 8 bit. And you see now the cycle is really fast. Okay, correct. Then when I want to try out this on a real machine, I can say I want to target this script system over here. So my host system, I start the test environment over here. And I can say I want to target script and I want to give the information where this script server, where this should be run. Okay, thank you. Still not going. Time's up. Time's up. Okay, so I will stop over here and invite you to catch up with my colleague Johannes who will give a talk at the microcunnel death room later at 6.30. And I will be there as well. So if you want to get in touch and see where we should go. Thank you.
Wayland's input-method is broken and it's my fault
Okay, please welcome Dorota, who will talk to us about Wayland's input method. Thank you. Yeah, this is really embarrassing. You have already seen my mistake when I was trying to set it up. Well, yeah, I have another confession to make. Yes, it is completely my fault that Wayland input method is broken. But I hope that after I explain it, you will judge me fairly. And me is Dorota. So, if you recognize this device that I try to use, it's Deliberant 5. And this is the reason why I'm even involved in this. And a couple of years ago, when the project was starting, I got hired to work on the input method to make sure that people can actually enter text on the phone. And there was a phone that you can't really text people with. It's not quite great of a phone. And we have already a solution. As you can see, this phone does not have a keyboard. So, this issue is not a new issue. So, people have already thought about it. The input method is a way to make sure that text gets entered on the computer without dependency on a keyboard. So, you can see some examples here. In non-Latin scripts, they use a keyboard but not in the direct way. You press the buttons, but something else comes out. Text recognition, like handwriting recognition, doesn't need a keyboard at all. And finally, this is what is useful for my case, making sure that you can type on a phone, can use an on-screen keyboard. But how do you plug it into the actual computer? You can use Wayland for it. Wayland is a graphical system, not quite. Wayland is actually a set of protocols. A set of protocols for connecting the user to applications. Applications, and user is represented by input and output devices, because that's really how the user looks like from the computer's perspective. And considering that I wanted an input method, it kind of fits to replace some of the keyboard's typical competences. So, maybe that's the way to do it. And it seems that people already agreed with me on that. There were existing protocols in Wayland for inputting text. And you might think that this is where I messed up. I tried to implement them wrong. In reality, it's much worse than that. Those protocols were not adequate. They were basically prototypes. So, I took it to myself to make new versions of them. And this is where I really messed up. I hope you forgive me. Let everyone else judge me after you see the presentation. It gets worse. Yeah, SqeakBor is an implementation that I used for it. I made some mistakes. One of the mistakes was sticking to keyboard roots. So, there are some operations that you want to do as a user that you expect from your mobile phone to be able to do that are not inputting text, but you live in kind of the same area. Like going to the next text field or submitting a form. You have this big enter button on your keyboard, right? Or like navigating text, moving the cursor left and right. This is not inputting text. This is something else. What is it? So, I thought, yeah, I have to do it. And you can see this is a screenshot from keyboard. And the key is actually there. It does something because I decided, well, we already have a keyboard emulation thing for other reasons. So, maybe we can just reuse it and emulate the keyboard when it's not about text. And that was at that end. But why? When you are using an input method, you are concerned with text. You want to submit text to other applications, so you submit text directly. But when it comes to keyboards, you don't design a keyboard protocol about text. You design a keyboard protocol around key events. Press key, release key. This is what the application gets. It's not so straightforward, but actually it's even worse. Instead of submitting buttons, you submit numbers. There is no preset relationship between numbers and letters. So, you have to decode them. And because you have many different keyboard kinds or layouts, see, this is a very tiny keyboard. It doesn't look anything like an overall one. This keyboard can input two different kinds of characters depending on which layout it is. This gets complicated because those layouts called key maps, they can be different. And despite that, in Wayland, even if you have no matter how many keyboards you have connected to your actual computer, virtual keyboards, like emulated keyboards or not, they all get presented as a single one to the application, which leads to problems. Because how is the application going to determine which keyboard you actually used and what character did you actually want when it only sees the number? Okay, you can send the key map, the table of relationships at every key press or whenever the user switches. If you press the smashed both keyboards at the same time, it's, oh, God, that's getting complicated. How do you disentangle all those modifiers, all the key presses? Basically, the Wayland protocols system is meant to care about corner cases. It's meant to work for everyone and not just for my cases on the phone. So even if you might never type from two keyboards on the phone, someone else might be interested in doing that. So basically, Wayland maintainers told me that this is not going to work. We are never going to accept it in this way. So yeah, what do I do? I want my software to gain wide adoption, but a protocol which doesn't work for intended use case without the supporting protocol, the supporting protocol being the keyboard emulation, without keyboard emulation, my input method was kind of useless for typing on the phone. So, well, I ran out into it without knowing what I was getting into. And this is my mistake. I hope you will forgive me. And this is not the only mistake I made. Oh, wait, maybe you can forgive me knowing that I actually know the solution. This is one of those cases where I know the solution. I think that what will work is a new actions protocol which combines all those things that are not relevant to the text input but are still useful. So maybe we can do that. I actually started some work on it. I'll talk later about it, but this is not the only mistake I made. I made so many mistakes. So, digitalization is a difficult problem. So, I have a question to you. If an application that I want to type into, if it has a small lag and I keep typing, what should happen? If I type two separate words, what should appear in the text input afterwards after the application comes out from the lag? Come on, come on. Yeah, you should get two separate words. In my protocols, you cannot have it this way. This is a bad thing because when the application starts lagging, you drop events and it can lead to stupid results. But the reason synchronization is this way is I wanted to make sure that the way we even have synchronization, the input method must be aware of what is inside the text field for the trivial reason of having completion suggestions. And I designed the protocol in such a way that every time the input method sends an event asking the application to put something in, it also sends the identifier of the state that it thinks the application is in. So, it always says, please do this considering that you are in that state. And when it sends two events, one after another, it only knows about the first state. So, by the time the second event is sent, the state has been changed by the first event. So, the second event is going to get ignored because I wrote this in the protocol. Please ignore this event. Yeah, so this is awkward. But I did it for a reason. So, the reason is kind of complicated and many people would say it's theoretical. Why do you bother? Maybe just accept it because this is what's going to happen. But I thought that maybe there are some reasons to let the application change its state from another source. Like someone who doesn't like me would say, oh yeah, you're going to type on your on-screen keyboard and on the real keyboard at the same time and modify the same text. Well, that's stupid. But I was thinking actually before this presentation, I have really weak examples here. This is my example that maybe the application can do auto-completion in a smart way, in a better way that the input method would not be able to. So, the input method should not get in the way if it conflicts. But there is a better way. If there are multiple people editing the same text document, I'm sure you have seen a system state that, then the text field is going to change a state independently of the input method. So, there is a need to track the state accurately and maybe drop events. I don't know. That's the problem. It's actually quite hard and I have no solution to that. So, yeah, maybe knowing that it's a difficult problem will help you forgive me for that mistake. Right. But none of this is going to get fixed unless someone picks it up. If you can see, my last comment was four years ago and the last time it was actually accepted was two years ago. That means that after I was moved from this project to another, the development will slow down a lot. So, there is a need for someone to pick it up. And if you want to pick it up, you are free to contact me and we can chat about those problems and maybe some other problems because I'm sure those are not the only bugs that I left there. So, you can catch me on my blog and on MasterDone as well as Matrix. And if you don't mind mistakes once in a while, I am also looking for work on free software. Thank you very much. Thank you. Thank you. Do we have questions? Yeah. Hi. Has there been any interest by any companies or industry about financing this kind of work? And have you heard from Qt in any way because they drive their own protocols? Right. I know they are forced by some of their customers to a little bit look into upstream input method protocols. But I think they just looked a little bit and then decided, nah, not interested. We have our own stuff. Oh, I didn't know about that. We need to talk more about it. But I have applied to... I'm a little bit angry about the situation. Understandable. Yeah, we have worked with Roman on this problem before. I have talked to Gnom and I have talked to NLNet about this a little bit. So... You talked also to NLNet. Yeah. I still haven't gotten proper answers, but we'll see where it leads in the next months. Hi, Lurota. Hello. So I just wanted to comment quickly that it's not only your fault. I am one of several Wayland developers in this room who conspired to attend, not only because we like you and thought the talk was interesting, but because we also felt blame for what happened and saw the title of your talk and thought we ought to be there to accept some of the blame. So you did good. You did great. Thank you. Someone already forgave me. More questions? Maybe you can talk with Alfa as well. Maybe I can do some connections. Yeah. Any more questions? Hello. So input method is clearly difficult, right? There are so many different layers and there are so many different considerations that need to go into that before you can create a sort of standard protocol that everybody can kind of agree on, especially on some mobile devices where we might not all agree on or completely understand everything that goes into that. Having seen that firsthand, how do you think, what's the correct approach for not just input method, but any other, because Wayland is trying to do a lot of different protocols like this, how should we try to approach developing these sorts of protocols from scratch? Because it's a big task. There's lots of different users, right? Like some people try committees and we have problems with committees sometimes. Some people try to just do it on their own and then that can cause problems. So the reason that this succeeded even in a limited way was that I had a company which was really invested in a product and they wanted to roll it out. So basically you need some person who cares and that was me for some time. And the other way I found like experimenting with Wayland protocols is kind of difficult. There is WR routes but I found that even though I attempt to do my best and there are viewers who review my contributions or also try to do their best, there are still bugs months and months later. So I would think that maybe we need to think about the easiest way to prototype protocols. I don't know. That's on one side and yeah. There was for the specific case of Wayland input method protocol, there is also competition that does not go through Wayland. So there's iBus for example and I think KDE or GNOME are also kind of investing in iBus a little bit instead of this. I can't blame them. Maybe it's easier to skip Wayland for some protocols. Of course you lose the security guarantees that Wayland provides but sometimes for prototyping at least it might be a good option but is it really useful if your target is Wayland? I'm not sure. Sorry. Given that a Surface Flinger and Android's input system are part of AOSP, have you looked at that to see how much that differs or does that model kind of, is that too different from what Wayland does to be inspired by it or to see what approach they take? Yeah. Similar to QTs in built IAM keyboard that talks directly to the application and can be bundled into it. Actually I don't know much about Android and Surface Flinger but I think one of the different mistakes that they made was that I did not look at other systems enough so maybe they already have solutions for some of that. Thanks a lot. Final round of applause for Dora Tante. Thank you.
Why not run OpenCL-accelerated LLM on your phone?
We are welcoming David who will talk to us about training OpenSeal Accelerators LLNs on your phones. Hello everyone, thank you for coming for the talk. I will start who I am. I'm David Heidelberg. I work on Mesa3D as a developer and currently I'm contracted by Colobora to work on CI. In free time I work on mobile embedded Linux and Alpine and Debian and PostmarketOS contributor. Let's go for the topic. So content, just quick introduction, what these things are about, how it looked before OpenSeal and these standards we have these days. What we have now, how it works, I will show you an example with running PostmarketOS with OpenSeal and what we can expect in the future. So first, who ever heard about OpenSeal here? Like most of you. So OpenSeal allows you to run your C or C++ code on CPU or GPU or specific DSP or FPGA or whatever you can find. This is one thing. Second thing, Mesa3D is part of GPU drivers. You have all the Linux systems or Linux computers and it also works on Android and Windows. I think Haiku or something exotic. And last thing which I use for my talk is TinyGrad framework which allow you to run stuff like GPT or stable diffusion or other interesting projects which you can run usually on GPU. So how it looked before the compute we know right now like QDA, ROKM, OpenCL. So you could do the computation or CPU but CPU has high overhead because it's meant to run classical computer programs and not highly parallelized software. Before you can use OpenGL and with OpenGL you could squash computations into OpenGL workloads but it was a big hack and work around and some scientists did it but not widely used between people. So currently we have options. We have already CPUs with multiple cores and multiple threads but the overhead is still here. We have GPUs which are much faster and much more easily parallelized but they still have some overhead and then we have smaller units like NPUs which you can find in new phones or new devices, new dev boards and these are optimized to run machine learning or AI workloads. But usually in the hardware you get like Linux phones you still don't have these accelerators in the place. For new phones and new devices you have them already. So probably all of who of you used OpenCL or any acceleration to run something? Okay, much less than the knowledge of OpenCL. So you can do all the stuff with this image processing and these days language models are most popular I would say. So this is motivation and what we have like in open source world. In general you can use multiple technologies but I will talk mainly about OpenCL because OpenCL gives you one thing which nothing else can. You can have CUDA or ROKM but these are usually vendor specific. So if you are going to write let's say software for your phone and then like you want something which will run everywhere and you cannot achieve that with CUDA at all because it's proprietary and closed. You can eventually do that with ROKM but it's only AMD for AMD cards. And so only alternative which remains is OpenCL. And in 2012 we had Clover implementation in Mesa and I was very excited about that and I was like wow now everything will be faster maybe it's associated with me running Gen2 back then and like wanting everything be compiled and much faster than on the BN or anything else. But for me it didn't work. I lost interest pretty quickly because to be honest with Clover nothing was working for me. So I gave up but in 2022 there came a rustical implementation to Mesa 3D and it supported latest OpenCL standard. It gained support for Intel and multiple drivers. You can see these supported drivers are Intel, AMD, Mali which is for example present on Pineform Pro and of course you can run on CPU or you can run over Vulkan with Zinc. There is work in progress for multiple drivers including Asahi, Qualcomm Adreno which is for example present in OnePlus 6 which is pretty popular Linux phone and which I also own and Vivante which is used in for example LibreM 5 and Raspberry Pi everyone knows I assume. So here is example how it looks like if you run a LLM model on OnePlus 6. It runs. It's not that slow. It's not that fast but on other hand it runs which is amazing. This is GPT2 as you can see from the answer it's not very clever but on other hand it runs and this is general purpose model. So currently I tried run some GPT3 based models some minimal ones but it's not there yet. But GPT2 works for me and so it means for every app you developing or you thinking about you can consider using some smaller models and already run it for example on the phones. And what is interesting is performance of course because if you run the model without GPU so you run on CPU you get load on 8 cores 100% to 4 week cores for strong cores get fully utilized but if you switch to open CL you get hardly 2 cores at 20% of usage and it's much faster as you can see. So and we talking about small phone. Here is the slide if anyone wants to try and you have post market OS on your phone so you can apply these simple steps to get in the place where it will be running. I guess you will not try right now but after you see the slides you can click through it and install the stuff. And in general where to open CL and compute heading. You can do a lot it's kind of widely supported already there is a lot of progress done on Rusty CL on Linux. So I believe in one year you can assume on every device you will be able to run some CL. So you can use it for your workloads for example what I heard just today before the talk that lib camera which is used for processing input from Linux cameras on phones considering using open CL workloads for processing. So it will get popular soon I hope so. And so what I'm thinking is the most interesting part you can start relying on it because for today applications like blender or GIMP are able to use open CL but it's not something you can count on. But that's hopefully change very soon. And another thing to talk about is what's going to happen next because open CL 3 is pretty good specification but if you compare to CUDA from NVIDIA it's really lacking. Best we have right now but it's not that amazing. It's amazing enough to be fast to provide a float which is good quality but it's not that great compete with CUDA. So I recommend for example David early talk about future of open CL and about some standardization which would fit all vendors so it wouldn't end up like one vendor trying to push his technology and this allow other vendors to contribute or use it. So something like open CL 4 we will see what happen. So this is eventually future is a little bit unclear but so far the open CL looks pretty good. Clover implementation and Mesa which was from 2012 is pretty dead because no one use it, no one maintains it. We dropped it from CI approximately half year ago so it's not counted it's just waiting for deletion. And last thing I want to point out is like even the low power device as a phone can provide some nice acceleration with open CL and it's really visible difference. So few credits Carol Herbst for bringing Rustical alive because that was very nice project and he was integrating Rust based software into Mesa which is C and C++ based. So that's pretty challenging task. Rob Clark for working of free Drano because are there are here any one plus six users? Okay right so you know the GPU works pretty well on these devices so this is lot of his work and Dimitri Baryshko. I hope I read the name right. Good for preparing to merge request for free Drano on Rustical and it's not merged yet but it's pretty close needs some polishing and for the tiny grad and GPT2 it works well enough and of course many others who contributed. So thank you for your attention and fire the questions. Thanks for the talk question regarding your comparison of the workload. Did you check the load on the GPU when you were running this model? So when you compare the eight cores versus GPU? I haven't checked the workload on GPU but I assume it was pretty high but yeah it's what I forgot to mention is it's not yet optimized and no one tried to profile it to give a good performance and tiny grad project is also meant mostly for the powerful AMD GPUs or let's say NVIDIA even Intel is not that popular. So it means these results which you've seen are still highly unoptimized software so it has probably lot of chances for improvements and better performance. Thank you. Hi thank you. Since newer for example Qualcomm SoC also have an actual NPU in the SoC are you aware if it's possible to run OpenCL on this and what could be missing and if anybody is working on this or something? Yes the new hardware has NPUs and for example one of former colleagues Tomoe Wissoso working on etnavif acceleration for NPUs so recently in Mesa was integrated part of the Teflon framework which allows interaction between etnavif which is vivante GPUs like in it's newer generation than this in LibreM 5 and you can run TensorFlow networks on NPU directly but so far only one vendor at vivante and one device is supported I think. How's the RAM usage? I think the model I used is like 500 megabytes so it's pretty nice like nothing serious the phone has 8 gigabytes so anyway for the RAM usage it would be more interesting with GPT-3 models but I think on the phones the language models are useful only if they are specialized and cut down to appropriate power which the phone offers because of course you cannot run full Lama GPT model which requires 8 gigabytes of RAM or VRAM on phone which has 8 gigabytes of RAM. Do we have any more questions? That's it then. Another big round of applause for David. Next talk is in 15 minutes.
The Journey to Ubuntu Touch 20.04 on PINE64
Hello everyone, thank you for coming and thank you for all the live streamers. A little bit about me, I'm a college student living in the US and I've been doing a lot of tech tinkering on open source stuff since I was little and there's been a lot of that experimentation in my house. Ubuntu has also been a very common operating system in our house just as much as Windows or Mac OS and so I have a particular affinity to it. On top of that the ASAHI Linux project that came out in 2022 sparked up an interest in me and reminded me that what my mobile devices were capable of running on their chips and so at the beginning I was running virtual machine Ubuntu images on my iOS devices but that wasn't native, that was virtual machines so I wanted a native Linux first device that was also affordable and accessible and that is where Pine 64 particularly stands out. And another important fact is that Orin actually means Pine so I've had a particular connection to them and affinity with them and dedication to their work. And so what makes Ubuntu Touch on Pine 64 different from most devices is split in two ways. One, Pine 64's devices are not like most Ubuntu Touch devices and that is that like many of the other talks earlier today have mentioned Ubuntu Touch runs on Hallyum kernels as opposed to mainline kernels which means that there's a lot of extra components that are thrown in the middle to do some abstraction to get a lot of the sensors and modem and such working. However on Pine 64 devices we don't have to use that, we have to use instead our own middleware often and also Ubuntu Touch is different than a lot of mobile Linux distributions because almost of those distributions allow you complete control over your operating system with a read write and file system and updates as they come. Ubuntu Touch does a read only file system to provide an immutability layer as well as over the air updates so updates happen in big chunks at once rather than individual packages as they come. So these pieces in particular have made adapting Pine 64 devices for Ubuntu Touch a challenge but a welcome one. So some background starts with the original 16.04 port came at a pivotal time for both Ubiports and Pine 64. For starters there was ongoing work to move to 18.04 from 16.04 although that work was later abandoned in favor of focusing on the jump to 20.04 as the project was focusing mainly on migrating away from legacy tools like Upstart when Canonical was developing the project and towards a system D based stack which the Ubiports team has done a great job with. They also announced around this time the renaming of Unity 8 to Lumiri which is still an ongoing process and involved not just the changing of a name in one place but in every single bit of code which has provided some incompatibilities as we will find out later on. The original PinePhone Community Edition came with Ubuntu Touch as well as the original Pine tab and when both of these were developed they were done primarily by one guy Dalton Durst who did a lot of work for not only these ports but also for the entirety of the Ubiports team and so he was handling a lot of internal infrastructure which meant that when the team was working on the eventual switch to 20.04 the Pine 64 port had to be pushed aside in favor of a lot of other stuff that Dalton was working on. And then another pivotal moment came in 2022 when first Dalton had left the development team to go work on other projects which left the PinePhone port completely abandoned at that point and Pine 64 also came out with the PinePhone Pro Explorer Edition which was around the time when I started getting interested in the device but notably the device didn't have an Ubuntu Touch port which means that I had to make that. And so my process with this port originally began with looking at some of the other builder scripts that were around. Notably there's one that is linked on the wiki called the DPA image builder that taught me a lot about how the structure of the images are compiled which allowed me to create this chart here and what's important about the PinePhone Pro is that the bootloader is separated onto a separate SPI chip rather than within the images themselves which meant I didn't have to pack those anymore which is a great benefit. We can also use particularly tow boot as our bootloader which allows us to dual boot using the volume keys or even switch into mass storage mode to flash directly to the device from any other machine. But as I quickly found out most of the fun was in the kernel and it didn't work immediately when I booted it because at the time the PinePhone Pro device tree files were not in the kernel yet and so I had to pull them from downstream. Particularly a lot of my kernel work has reflected Medjy's work and it was looking at his work that helped me figure out how to get those device trees in. Once I passed that process I had a booting and boot-to-image but this was not a distributable boot-to-image it was built manually and was heavy. So I had to switch to making a port for a boot-to-touch. It uses a very similar process but slightly different rather than reboot strapping from scratch. We actually pull a CD image from Ubuntu server and then use a program called Devo's which can open a Docker or Podman container and build on top of that CD image to create our final distributable images. And last year at FOSSTEM I wasn't here but an early stage of my PinePhone Pro port was shown off at the FOSSTEM stand and this year I now have four devices, the PinePhone, the PinePhone Pro, the PineTab and the PineTab 2 all running on a much stabbler version of the port. So once I got the PinePhone Pro ported it was time to move on to the PinePhone which was still stuck behind on 1604 and I didn't have the PinePhone myself but I could do some research in the meantime and so I found out actually that there was no reason why I couldn't include both architectures for the devices inside of my kernel image which I also learned from Meji's stream and once I had a unified kernel I also found out that we could use tow boot on the PinePhone as well which once again split out that necessity of having to pack the bootloader into our images and I asked someone to try it out on their device and sure enough it worked which was wonderful which meant we had both the PinePhone and the PinePhone Pro up within just like two weeks of each other. Shortly after that the PineTab 2 pre-orders went live and at this point I was looking to make another port and the UB ports team actually reached out to me and said do you want us to send that to you so that you can make the port nice, happily obliged and they also sent me one of the original PinePhone to maintain at this time and then the PineTab 2's port was very similar to the other ones and I had most of the hang of it by this point but it was too early for a tow boot port to be out yet so we had to use the UBOOT binaries which meant I had to go back to learning how to pack that into the image properly but luckily besides the bootloader the rest of the process was essentially the same and then after we had the PineTab 2 port another community member reached out to me and said hey I see that you have these other three devices ported up and I've got an original PineTab sitting in my drawer not doing anything would you like me to send it to you so that you can create a port for that as well and once again I said of course and unfortunately tow boot doesn't work on the PineTab either because the run for how many PineTabs actually came out was quite limited so the main maintainer of tow boot never got his hands on the device to create that port so we used the PineTab 2's process again and just packed the bootloader back into the images and that had two congruent sides, a PinePhone set of images without the bootloader in it and then a PineTab set of images with the bootloader in it. Notably the PineTab and PineTab 2 do use different bootloaders because they have different architectures so there are individual images for each of those devices. I was also warned about using kernel versions greater than 6.1 on the PineTab because apparently it would cause a kernel panic and an infinite reboot. I found that this was partially true but it was a very easy problem to solve all I needed to do was move a module from internal to external which allows it to run after the DRM system that it was relying on to run and then it never has that kernel panic because it never starts before it's supposed to. As I stated previously though a ported device doesn't mean all of its features are working so there were a lot of software component hurdles that I had to get over to get to the state that we were in today. Two of the biggest ones have been rotation and modem both of which were due to the niche circumstances of trying to conform to Ubuntu touches, Hallym software stack. So in particular we have the split of what most Pine64 distributions use versus what Ubuntu touch uses for starters modem manager versus ofono which has also been mentioned in a few talks earlier. Modem manager generally has a lot better stability with the EG25 modem that the PinePhone and PinePhone Pro use but with several scripts we were able to get ofono in a similarly stable state. Another of those components was the difference between IO sensor proxy and sensor FW. Sailfish OS also uses sensor FW and we also use the ofono sailfish port but the thing is with sensor FW compared to sensor proxy is that you have to write your own configuration files for your devices and it also has to use a second adapter in order to properly read from the IO buses. And so you can see here on these charts that both ofono and modem manager can use EG25 manager which handles with the powering and a lot of the sending data between the modem and that was how we were able to get a much more stable modem version on 2004 than compared to 1604. And with the sensor files even after all of those patches were properly put in and all of our sensors were reading correctly rotation still wasn't working and this was maybe my biggest frustration for eight months. And then one day I decided to look in the log files and I noticed that the display was being enumerated as unknown rather than DSI which in some places it says that correctly but in other places it doesn't so sure enough once I had fixed that enumeration in all of the places where it properly had to be rotation was working. And the other big group of struggles was read only images and recovery images both of which use a special init ram FS script and so these two components help provide that those OTA images the read only images provide a level of immutability so that a user can wipe the system into a reset state and rather than having to re-flash the whole image and it also protects the system from too much destruction but there's also the recovery scripts which allow the device to switch into that updating modes that it can install those OTA updates as opposed to installing the updates for individual packages live like most Linux distributions do. So while the 20.04 pine 64 images currently release with image files most Ubuntu touch images ship their updates through tar balls which is where we are moving towards and the recovery image is what we need for that final component to get the tar balls working and recently we did succeed in getting those read only images working and now we can copy much more of the deploy style of many of the other Ubuntu touch images and then looking forward we have a lot of different types of images that we can use. We are moving towards 20.04 on the entirety of the distribution which will likely be around when these recovery and over the air images will also be available but this rebase is going to be a welcome one for us because most of the components that we back ported into 20.04 for the Pinephone Pro and PineTab 2 will be already upstream in 20.04 so we don't have to carry that in our repositories anymore. Outside of Ubuntu touch we are also working closely with the Lumiere team that is working outside of regular Ubuntu as well as on Debian and so we are hoping that some of the changes like the enumeration to those displays can help fix some of those issues on Debian with rotation for example and right now our ports is the closest thing that Lumiere has to stability on mainline but we are hoping to get that expanded to a more generic set of devices in the near future and that's about it. Thank you. We have some demos of the devices available at the Foss on Mobile stand in Building AW so feel free to check those out afterwards. Great, first question. You talked about the PineTab 2 versions of that, the Dev1 and the early Adopter one, is it fixed for both? Yes. Thank you. Thank you, very interesting. Having heard some of the talks today in this Dev room makes me feel like this is the early days of ARM system boards or even worse like the those days where every game had to ship 36 audio drivers. Do you envision a future where we have a sort of standard platform like UEFI on PC and ARM? I would hope so. I think that the ASAHI Linux project is certainly a push towards that and I'm hoping that other companies can follow suit. Hello. Great talk. Is it technically possible that the, you mentioned that the PinePhone images are the same image for the two different Pine phones? Would it be possible that there be non Pine phones in the same image if they didn't require bootloader or is there a specific reason why they only work on Pine devices? The only reason right now is the kernel. Otherwise we absolutely can boot those images that don't include the bootloader on plenty of other devices. How did you find out to put the, was it from internal to external, the kernel module? Was it that? I was looking in the device tree files and I noticed a mention of the display driver in there, but it looked like there were actually a duplication of those mentions. And so when I went and ticked off one of those modules from Y to M on the displays, it worked and that's all it needed. And then in the kernel logs it also said that that display driver was trying to start before DRM was available. A question from the matrix. I've heard this question before today, but yeah, the question is, any plans on migrating to Modem Manager? I saw that question earlier and I would also hope so, but I don't think that actually is viable right now because that would mean the whole, wouldn't you touch stack would have to move to Modem Manager and so we instead have to rely on what the rest of the distribution is using, which right now is Ophano. It's another question. According to the picture, recovery was dropped in the 2004 layout. Was recovery functionality integrated into boot in the DRMFS? So it wasn't dropped, it's just not available yet. It's still a work in progress. I do not necessarily have a question, but I have a quick addition to the person that asked about the standardized boot format, about the DOS games. I think it was that guy. People are moving towards U-boot and chain loading U-boot on other devices and making repartitioning possible. So in the end it would look the same as I and also the pine phone that you developed. So that was a quick addition. Thanks. A follow up question. You meant kernel options before compiling with Y and M or okay. Say it again. Did you mean kernel options Y and M? Yes, yes, in the DevConfig. Thanks. Could you name a single thing that would make the porting to another device easier? What was the hardest thing? What would make your life easier if you would have to port to a new device? If the boot loader was figured out for me, then it would make it really easy. Because as I mentioned with the pine phone and pine phone pro images, it's really just the kernel at that point. It's not hard to figure out what kernel modules you need to get a certain device to boot. Maybe one more generic question. What's the current status regarding the full disk encryption in UB ports? Say it again. The full disk encryption status in UB ports. I actually don't know that. Does anyone, Alfred? Yeah, passing on to Alfred. Yeah, thank you. So it's probably not going to be so first of all, there is no home encryption whatsoever right now. But unless manually set up with scripts, so in which case you can do that yourselves. We shouldn't provide any default, but we want to provide a default. And that's probably not going to be lux based encryption, but rather file, directly file based with X4 and F2FS based solutions. Because the Android devices, they have Android partitioning schemes, they have various differences where it makes no sense to do full disk encryption in that way that we used to from the desktop. And with it being on the user data, we can ensure that selective things inside of the user data are encrypted, like the home directory of the main user of the device. In which case we can unlock it with the same on-screen keyboard that the Lumiri desktop uses without having to basically add the on-screen keyboard to the inner-dramf s early up in the boot so that they don't look different, that they're using it like that they look cohesive, that they work with similar technologies so that it's one completely fitting thing that does it all for you. So in this case, full disk encryption probably not, but file based encryption or file system based encryption more likely. There have been experiments with that and they were successful. How did you feel when you first successfully booted up Ubuntu Touch on the pine phone? It was an awesome feeling, but as I mentioned, I have been tech-tinkering for a long time so it was also a very familiar feeling of, oh yeah, I got it working. Thank you.
Towards a bright future with Mobian?
Thank you all and thank you all for attending this talk. So yeah, I'll be talking about how we can improve our future as mobile Linux users, especially with Mobian, but this all applies to other similar projects such as Postmarka 2S and so on. So first question you might have is, who is this guy? So basically I'm working as Senior Software Engineer at Colabora. I'm dealing mostly with building and maintaining custom distributions for embedded systems, so kind of related to what I do with Mobian. I've been a long time Floss Introduce and I've been a DBN developer for a few years. And back in 2020, so just the last first damn before the pandemic, basically, I got my hands on a pine phone and this prompted me to work on that, work on mobile Linux in general and start and still continue working on the Mobian project. So what's actually Mobian? It's a DBN derivative or in the DBN jargon we call that a blend, which targets mobile devices such as smartphones and tablets. It has a separate package repository and provides ready to use disk images you can flash on a few devices. It's actually a very small overlay on top of the DBN and we only provide currently 25 source packages in our repository compared to the vastly greater number which is in the DBN, which means that technically of all the packages you have access from a Mobian device, actually more than 99.9% of that is pure DBN. And so we have a few packages with downstream patches which can't be upstream at the present time. Half of those are kernels, a few others are user space applications, which we're working on dropping those patches and trying to find upstream friendly solutions. We have also a few packages which are basically workarounds because the feature does not exist in the upstream world, not yet at least one of those being for example, Millipixels, which is the camera application for the Libram 5. Once the Libram 5 gets supported by either or both megapixels and Lib Camera, we can basically just drop this package and rely on upstream applications. And finally we have six Mobian specific packages which are to be rewrote to be included in the DBN itself so we can lower the impact of Mobian and the footprint of Mobian. So we hope that we can get below 10 packages by the end of next year. We'll see if we make it, but that's our end goal for now. So latest developments, what happened the past year? We had the first table release. We just did the whole quotes around stable. It's basically that we released Mobian Bookworm at the same time as the DBN Bookworm was released. So that's our stable release. It doesn't mean it's bug free. It just means that we don't do huge upgrades and only targeted fixes. So the system stays stable and keeps working as it works currently even after software updates. So it was released in June last year. We have a few supported devices out of the box which are several Linux first devices, the PinePhone, the PinePhone Pro, the Librem 5 also. We support a few Android-based devices thanks to the work of the community, especially on the SDM845 kernel support. So we support the OnePlus 660 and the Pocophone F1. And we also provide X86 images for desktop PCs or X86 tablets such as the Microsoft Surface Pro and Go. We provide a single desktop environment in this release which is Posh. And we provide also up to today's 6.1 kernels. So the 6.1 kernel is not the latest but the former one LTS branch, meaning it's supported up until 2026 if my memory is good. And we have a script in CI which is run daily and automatically rebases all the kernel packages we have on 6.1 on the latest point release. So basically when there's a security update, usually the day after or the same day, the kernel is up to date in the Bookworm Update's repo which is basically our staging repo for the stable release. There are however a few things we wanted to include in this release that couldn't make it. First one is universal images. The plan here would be to have a single kernel package for all supported devices. It's working quite well for SDM 845 devices because they share already a single kernel and the people working on those devices all put their patches into the same repository. But for pine 64 devices for example which is based on all winner A64, rack chip, different chips also. It turns out that making a single kernel package out of those proved to be trickier than we anticipated and so we basically dropped this effort at some point and focused on having just per device kernels, at least for this release. So we couldn't make universal images obviously. We didn't find the time also to improve the hardware support of upstream. We still carry lots of patches across for all the devices I mentioned. It must be a total of 800 to 1000 downstream patches in the kernels only. So that's quite a significant amount. We'd like to get them upstream but we all had dead jobs and for now every day is still 24 hours only. So we have to make choice. Also we wanted to switch to the latest LTS kernel which is now 6.6 and finally realized that we couldn't because we didn't have any time, any resources to spend on that. So that means that Bookworm is stuck forever on 6.1 which is not too bad because the life cycle of Bookworm will end in about a year and a half and until then this kernel will still receive security updates and bug fixes. So as long as Bookworm lives, the kernel lives along with it and we can get up to date and avoid security holes anyway. However, the next release which I'm about to talk is Trixie and is already on 6.6. So what about the recent developments? We try still to unify our disk images slowly. Instead of aiming for a single image for all devices, we're taking a step along this path and try to just ship one image per kernel. Until now we have one image for the PinePhone, one image for the PineTab, another one for the PinePhone Pro and the PineTab 2 and so on because some of those devices require hardly specific tweaks to be included with configuration strips, Udev rules and so on. And so we came to a point where actually most of these tweaks weren't needed anymore because upstream had picked up and had the necessary features for those devices. So we could envision having instead of having one image per device, having one image per kernel. And so we have our kernels per architecture basically, per sub architecture really. We have one for the old winner, A64 devices. We have one for the Rockchip-based devices which are the PinePhone Pro and the PineTab 2. Two different socks from Rockchip but still we can use the same tree and so on. It was already working well on the SDM845 devices but we took this step a few weeks ago and so it quite reduced the number of images we were doing. Regarding Qualcomm-based images we had until now one image for the SDM845 devices and another one for the SMS225 which is basically the PhanPhone 4 because we used to maintain different kernels for all of those. This is going to change and actually already changed recently because we pretty much imported all the patches we needed into a single kernel for all Qualcomm devices we support. There are not many of those which is why we are managing to do that but for now we have a single kernel which handles all the SDM845 devices, 1 plus 6 and so on, the PhanPhone 4 which has a different chip and also the PhanPhone 5 which has another different chip. And so we have a single image for all Qualcomm devices and we just use a simple config file at build time to generate the boot image for the device because although the root file system are identical the boot images are really device specific because they need to have the device tree appended and the specific RAM disk and so on. But other than this boot image generation everything is handled at runtime using JoyJuicer which fetches the binary firmware from the Android vendor partition because those devices ship with Android first and so the firmware are already present on the device. This makes things a bit easier for us because we don't have to care about the firmware license, we don't distribute it, it's at runtime fetched from data which is already available on the device. And there's also a small package with QCOM 4 new tools which basically just includes a few scripts and configuration for which are basically the same on all Qualcomm based devices we support. We're also adding in the process a simpler way to add new device support at least if it's Qualcomm based. The thing is until now we needed to have a kernel package in the Mobian repo and a few specific tricks in the image build process. We created a new target for these build scripts and build recipes basically which is QCOM 4WIP, it's kind of a dummy device but the thing is you can separately build or rather cross compile your downstream kernel using the bin depth package make target which is supported by the upstream Linux so you don't have anything specific to do there. It generates a Debian package which you can drop into the Mobian recipes folder, edit some config file, run the build script and it will provide you with a root FS image and a boot image tailored for your device. Then you can flash it using fastboot and hopefully celebrate that your device can run Mobian. This is almost never that easy but the thing is we're moving the complexity from knowing the internals of the build system to just debugging the kernel booting on your device. So there's nothing Mobian specific in that, it's just general debugging and we basically made it sure it was as simple as it could be from the Mobian side. And we also have a small first-dem-present in the sense that Mobian now provides, it's been a week since the first images were published but we now provide plasma mobile images. It actually started over a year ago and the goal was to from the start have everything in Debian itself rather than carry downstream packages in Mobian. And so Marco, one of the Mobian developers, worked on that for more than a year and he managed to get all the needed packages in Debian itself including the MetaplasmaMobile Meta package which you just have to install, apt install PlasmaMobileFool for example and it will put in all the packages you need and from there we could build our Mobian image. So that's basically what happened over the last year. Now what's next? We're taking, trying to take a step further towards universal image. So I've talked about the kernel issue, unifying all patches into a single kernel but actually there's all this little device specific tweaks I mentioned earlier which have to be handled and until now we have per device packages so that means one new package to have in the repo for each new device we want to support. This is an approach that doesn't scale at all. I mean it works fine if you manage 10 devices. If you aim for tens or maybe let's hope for hundreds of devices it's just too much work for a small team. So the idea here is to have a runtime service which will identify the device it runs on using the device tree compatible property for example or the ACPI, DMI, vendor, manufacturer and so on strings on x86. Select or generate the needed config files, put them into a runtime generated Debian package and install it on the device with the ability to place triggers on that so that when one specific config file is modified by another package this tweaks package is regenerated, rebuilt and updated as well. So that's something we hope to achieve this year as well as getting closer to Pure Blend. This is a specific class of Debian derivatives and it involves having all the packages into the Debian repository. So this is our next step once we have a working runtime tweaks management but basically this would mean having all our meta packages, tweaks packages and so on into Debian itself so we can just install everything Mobian from the Debian repository. Not all hardware features will work unless you use the Mobian provided kernels of course so Mobian will stay relevant for some time at least and we'll also be still able to generate ready to use images which will make things easier for users rather than having to build themselves from the Debian packages. Another big topic is also the call audio management. A few years back we created call audio which is a demon monitoring the phone call status and switching audio profiles and routing on the go depending on the situation. This was in a post-sodial world and back then post-sodial didn't really bother with such things the only automatic switching it did was when you plug the headphones and so on and we made sure that call audio did disable that on the post-sodial side. But now we are living in a pipe wire world and with pipe wire comes a session manager which by default is wire plumber and the session manager is meant to do just that switch audio profiles switch the routing to match the current situation. And so call audio raises with wire plumber most of the time it often loses so this means that you're having a phone call and actually you don't hear anything in the phone earpiece because wire plumber did the switching right after call audio instructed pipe wire to do so. So there's clearly a conflict there and the goal here is to make call audio basically a part of wire plumber itself. This needs some work in pipe wire to make it aware of the modem and to monitor the phone call stages but we hope to submit an initial RFC implementation at some point this year. No problem obviously. And finally we plan for a few other minor improvements. I mean most of the project development process and infrastructure is under documented as it is most often the case. So we have very user centric documentation written by users but we are very few developers and we didn't take the time to do so. So we'd like to improve that because basically a significant portion of the project has a best factor of one which is me basically. So I try to change that and make sure we have backup solutions and we get more welcoming to other contributors. And finally we'd like also to keep working on upstream device improvements. The Pinefone Pro has a few low hanging fruits. We can upstream probably easily. The support for the Pinefone 2 is being merged upstream as we speak. It now has a working Wi-Fi device. We'll have to look if it can be upstream as well. We hope to support also the Pinefone V or Pad 9.5 depending on how you see it which would be the first week 5 device supported in Mobian. And we also welcome obviously contributions to support more devices to help us with documentation and to basically help us make the mobile future brighter for all of us Linux mobile users. So here are a few links. I put the slides. Thank you very much. Questions? Hi. So I was profoundly disappointed to read your blog post in October about the Travals with the Pinefone kernel and the fact that essentially all of the work that had gone into the Pinefone kernel in Meggy's kernel tree was not being upstream. Which I presumed was the case really since the Pinefone had come along. So I was just kind of wondering what had happened if anything had changed on that front if Meggy was upstreaming patches now or anyone else and kind of what the situation was with that. For the original Pinefone the current situation is that someone in Mobian stepped up to maintain and update this kernel. He also started upstreaming a few patches and is monitoring the kernel mailing list and working with upstream to improve the situation over time. So there's lots of work to be done. I know there's also another person which has started working on the driver for the Wi-Fi chip which for now it was downstream real tech full of crap basically and nothing close to being upstream able. The new driver will be hopefully upstreamed and so that's already one big pain in the ass which will be removed. So now there's a bit more hope for the original Pinefone and if things continue that way then it will probably be great. So a question from the matrix. Is there any plan to port the EG25 manager to LibGP-IoD 2.0? Right, yeah EG24 manager is a very specific piece of software for the modem found in the Pinefone and Pinefone Pro. It's using GPIOs through LibGP-IoD and there's a new release which changed the API completely. The thing is for now LibGP-IoD isn't packaged in Debian at least the version 2. Version 1 is in Debian so yeah for now I don't have any definite plan. The plan being once version 2 is in Debian then we go with it but before I'm not sure I have the time to deal with all of this. But much requests are welcome as always. Yeah so a question regarding your tweaks approach. So why do you want to build if I understood this correctly? The tweaks on the device package them there and then install this package instead of having just one package that carries all the tweaks. The thing is we will have one package carrying all the tweaks but those tweaks can conflict with each other. You can have conflicting configurations for OGO for example and depending on the device you have to select the right one. You have also devices which can't suspend because otherwise they don't resume and other devices which can do that. So you have to select the appropriate tweaks and the idea of creating a Debian package is that the packaging system is aware of those files. If you have some files and the user shares something then it won't overwrite them with a file from another package. If we don't do a package on the device and install it then if we just move files around the packaging system will not be aware of those and if at some point one Debian package ships a file with the exact same name then it will break. So that's the idea. Alright please give another background of applause for Anu. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you.
Daily blogging embedded Gecko development
Next up, we have David from Sailfish OS talking about daily blogging, embedded gecko development. Have a big round of applause, please. So thanks very much for having me here in the FOSA mobile devices dev room. It's great to be here in front of everyone. So yeah, so I'm David Llewellyn Jones and I'm going to talk about some of my experiences, essentially upgrading the browser for Sailfish OS from one version to the next version of the gecko engine. And I'll also talk about my experiences. So one of the things that I did as part of that was I did a daily blog about my experiences performing that development. And so I'll talk a little bit about that as well. So first of all, a little bit about me. So until a year ago, I was a software engineer at YOLA. And I know it was exactly a year ago because my last day was at FOSDEM when I was at YOLA. So that was my last day of working there and then I started on the Monday at my next job. And now I am working for an organization called the Alan Turing Institute as a research data scientist. But it means that I have some background with Sailfish OS and the development of Sailfish OS. So for those of you that don't know, YOLA is the company that develops Sailfish OS, which is a Linux variant that runs on mobile devices, on phones. I'll explain a little bit more about Sailfish OS in a second. But now I'm a community member. I still use Sailfish OS daily as my daily driver, as my main phone. And so it's nice to have the opportunity to develop for it as a member of the community. And I put up my avatar there because I guess on the internet I'm known as Flypig and my website is called flypig.co.uk and that's where I'm doing my daily blogging. And there will be a little bit of relevance for that later. So Sailfish OS, like I said, is a Linux distribution. That runs on smartphones. It was first released in 2012 on the YOLA 1 phone. That was a big release at the time. And it is a busy box and G-Lib C-based Linux OS, so it's not an Android derivative. It runs on libhybris and also natively on things like the Pine phone. YOLA is the company that develops the operating system. It is free for there. So there are ports of it onto other devices that are community ports. But there is also a commercial layer that you can pay for to get Android app support to run Android device, Android apps on top of it as well. When I was at YOLA I worked on a whole bunch of different things including the sync support. I worked on some user interface stuff. I worked on the email client, a whole bunch of different things. One of the things that I worked on was the Gecko rendering engine which is used for the browser. So that's something that I gained a bit of experience with. And that's called Selfish Browser and that is essentially what I'm going to be talking about today. So I mentioned that Selfish Browser uses the Gecko engine for its rendering. And I'm sure a lot of you will be familiar with Gecko, but again for those of you that aren't, Gecko is Mozilla's rendering engine that is used for Firefox. So it is a rendering engine that is underneath Firefox and it is a little bit like these other rendering engines that I've listed there like Blink which is the rendering engine that is developed essentially by Google and that is used in Chrome. It's used in, I guess it's used in Edge. It's used in a bunch of other browsers. Brave, Vivaldi browsers like that. It's quite widely used. Webkit which is the one that Apple forked from the KDE Webkit browser which is primarily I guess used in Safari. And I think Epiphany, no browser uses it as well. I've also put Netscape up on there. Have any of you used Netscape? Yeah, okay. Wow, that's impressive. Okay, that's really good. So I mean I remember Netscape from the, sorry. Netscape. Okay, Netscape. Sorry, so I said Netscape. Have any of you used Netsurf? I beg your pardon, that's a different question. Yeah, okay, there's still a bunch of you that have used Netsurf. So I remember Netsurf from its days on Risco S. So it's actually a very, very, had a long history, but it's a nice engine as well. So Gecko is like one of those engines, but it is used for Firefox. And it was released in 2000, as it says there. Netscape 6 was when it was released. So it's had quite a long history. And it has also had quite a long history on selfish OS. So it's quite unusual for mobile browsers to be using Gecko. I think selfish OS might be unique amongst the mobile delivery. The mobile derivatives in using Gecko. And part of the reason for that is because it's not really set up for embedded development, but I'll come back to that. The history is that it's not a fact of when selfish OS was essentially a development of MIMO, which was the Nokia operating system that ran on the N900. And it had a browser called MicroB that was Gecko based. And at that point in time, WebKit wasn't really as developed as it is now. So Gecko was perhaps a more natural choice. And essentially there is this thing called embedLight, which is the Gecko method of embedding a browser into other applications, which is used in selfish OS to allow this to happen. And I'll talk a little bit about embedLight later. But it essentially allows you to run Gecko, the rendering engine, as either a thread or a process in another app. So WebKit is well set up for this. That's why you have things like Electron, which are using WebKit as the back end. But Gecko is not so well set up for this. And it's not particularly supported by Mozilla. In fact, I have this quote here. So this is a quote from a guy called Chris Lord. It's an old quote from 2006. At that point in time, Chris Lord was working for Mozilla. So he's not anymore, as I understand it, but this is from his blog, and this was from when he was. And he says, Gecko has limited embedding capability that's not well documented, not well maintained, not heavily invested in. And at various points in its history, there have been embedding APIs and capabilities, but they've either been dropped or left to bit rot. And it mentions IPCLight. IPCLight is another name for embedLight. So it's the same thing. So as you can see, essentially, it's something that Mozilla doesn't really support. And that has, over time, caused increasing difficulty in making sure that this browser is working on selfish OS. I want to talk a little bit about the structure, because it's actually quite interesting. So working with Gecko, so as I said, I have some history from YOLO working with Gecko, but it's actually a really nice set of technologies to work with. In selfish OS, there is the structure where it works. Gecko is at the bottom. Zool runner, XUL runner, is the library that is built for embedding from the Gecko source. And on top of that, we have this layer called cuteMozEmbed, which was also at one point in time in Mozilla technology, but is now essentially maintained by YOLO, I think that's fair to say. And it provides a cute widget that embeds the browser inside it. That's all. And then on top of that, you have embedLightComponents, which is a set of JavaScript shims that run inside the Gecko JavaScript engine and provide support for the user interface. And then finally, you have the selfish browser executable, which is actually the app that you run, and it's the user interface that you see when you run the app. So there's quite a stack there. And it's built of a whole bunch of different technologies. So Gecko itself is made up of primarily C++ with a bunch of Rust. It's gradually converting to Rust. A big chunk of JavaScript as well in there, and also this small IPC layer, which provides inter-process communication. CuteMozEmbed is, I think, all C++, and embedLightComponents is all JavaScript. And then on top of that, selfish browser is a mixture of C++ and QML. So it's a cute-based browser. So you can see there's a whole bunch of really interesting technologies that are being used for it. So for me, it's a really nice thing to be working with. But as I said, there is a problem with Gecko, which is essentially providing this route for upgrade of the browser. So there is this upgrade problem that means that it's actually quite hard to upgrade the browser from one version to the next. So to give an example, when I was working at YOLA last year and before that, so in February 2021, we upgraded it to ESR60. So ESR is Mozilla Terminology for Extended Service Release. I think I have that right. I think it's not support, but I always say it wrong. I think it's extended service release. So it's essentially releases that have bug fixes and security updates for 12 cycles, so 12 normal Firefox releases or thereabout, so I understand it. And YOLA tends to upgrade to the next ESR release because that provides longer-term support, and it means that you don't have to keep up with all of the versions all the time, which is very difficult. But it takes a long time to do these jumps. So in February 2021, it was upgraded to 60 in March 2022. A year later, it was up for doing it again. And because I knew it was going to be quite a big job, I also felt like I didn't really want to do it alone, so that's one of the reasons why I decided to blog about it. I would decide to do this daily blog that would not only allow me to draw people in, but also keep the community informed about where things were at. So I'll talk a little bit about this dev diary. So what is a dev diary? I'm sure you probably already know, but let me give you a quote. So this is a quote from Jay Wilson, who is a person on the Internet. And he says, A developer diary is where you write about developer things. It can be the record of what you worked on last, some decisions made by stakeholders in the product, or a neat way to accomplish this task. It really comes down to what you want to put into it. Okay, so essentially it can be anything, but my experience with dev diaries was with game development, because I used to follow, I think originally it was prison architects that had these video diaries. They were a crowdsource funded game, and I really enjoyed them. I thought they were fantastic. But they worked really well for game developers, because they're actually, to a larger centre, marketing tool, because they have a lot of really great assets that they can share whilst they're doing the development of graphics, things like that, screenshots. So they worked really well as developer diaries. But for me, it was more about kind of capturing the process and making me think about the process and motivating me to try and actually carry on working on it, which I think is what I put on this slide, hopefully. So yeah, an aid memoir. So I found it useful to write about this stuff, so that I could refer back to it. So if I write what I've done, it means essentially I actually have this massive text file where I've written all of this stuff, and I can just search it to find out what particular changes. I made a wish particular point. My memory is not always so great, so it's very helpful for me to do that. It helped me to structure my thoughts in terms of architecting what changes I was making. It made me kind of think through it in a logical and consistent way. And it also motivated me to work on it on a daily basis, which was actually a really big deal for me, because otherwise I would drift. So that was actually really important. But there are some downsides as well. Ah, am I coming onto downsides? Okay, before I get to the downsides, it was also a way to involve the community in the process. So like I said, it was... So these browser upgrades are something that the community rightly sees as important. When you have an alternative operating system on your phone, you can't run... I appreciate I'm telling you all stuff that you already know, but you can't necessarily run all of the Android apps that you might want to be able to run. And in that case, having a good web browser is really important, because that provides you access to the services that you need. So an update browser is really important. And so keeping users aware of how things were developing was much nicer than just like after a year, saying, you know, here's a new version. It was actually much nicer, I thought, to see that process through. And also to maybe get other people to come in and help with some stuff. But as I said, there are some downsides as well. So for example, I find that taking... writing and coding at the same time takes about twice as long as just coding. You know, I think that's probably not too bad. That's probably what I'd expect. Like I said, there are benefits to keeping a record. Daily updates are a motivation, but they're also an incredible bind. So when you get home late from work and you've had a really long day and you know you've got to get a blog post out, you still have to get it out, right? That's the idea. So there are certain times, so I feel like the balance is in favour of... it makes me motivated in a positive way. There have been a few times I've been a bit grumpy about it. But overall, I think it's been a very positive experience for that. And what I found was that it works really well in certain parts of the coding process. So it works well when you've got small tasks that can split me split up daily. If you have a particularly long task that's going to take five days of just pure thought to work through the architecting of it, it's a bit of a disaster because you're writing the same stuff in cycles over and over again. So I guess those were like learning things that I had as I went along. And also, blogging, daily dev-diring, I guess, is also really good if you have visual assets to share. So I was mentioning about game development there, it's really good for stuff. But if you have graphics that you can share, that's also really helpful. All right, so very briefly, this is the timeline that I went through. So the reason I was telling you about this is about dev-diring is because I recommend it as a thing to go through if you're developing something. I think it's really helpful. This is the timeline. So I've now been doing it for 149 days. I took two weeks off, I think, in August or thereabouts, and I've taken two weeks off over FOSDN because otherwise my head would have exploded. But otherwise, I've been doing it daily for this long. And you can see that it took 45 days just to get the build working. And that was a real problem because you're developing completely blind, right? You're just making changes and you just don't know whether they're going to work until it builds. And then it took another five days before it would actually execute, and then a whole bunch more days it took up to 83 days before the rendering was working, which was quite an experience, I have to say. Getting it working was quite nice. And then after that, changes came a bit quicker, and I'm now back to kind of figuring things out at getting sites to render. So, you know, that's been quite an interesting experience. But the community response, I have to say, has been utterly overwhelming, and I've been really astonished by it. So people have just been incredibly generous and incredibly nice about all this stuff in a way that I, you know, these diaries are really dull, I can assure you, and yet people still read them and they still comment about them. And I sometimes feel like I joined the wrong internet. You know, everyone in the media seems to say that the internet is full of trolls and full of scammers and full of, you know, foreign agents that are trying to swindle you in some way or another. My experience is not that. You know, I'm sure those people are out there. I'm not saying they don't exist, but for me, I've just had positive experiences all the way. So that's been really nice. All right. And one thing in particular is that there is one community member called Thig, and he routinely generates art for me to put in the blog, which I find really, which is just wonderful. This is one of the images that he created. And yeah, so I just use them, and I think it's really cute. Although I feel like it's not a fair fight. I'm not sure. I feel like the Fox and the Gecko team me up on me there. But anyway, so a shout out to him. I think that's really good stuff. And also to Ulva. Have I got time to do a quick demo? Yep. Okay. Okay. I'm getting nodding. That's good. So very quick demo. So it does work. There in my slide. You don't want to see those. So let me run it. So what I discovered when I just try this a second ago, so it will crash probably, but was that the network connectivity down here is really poor. But as you can see, it does render, and it's pretty okay. So this site is full of SVGs, and yet they're working okay. One thing I'm particularly happy about is the fact that WebGL is working. So hopefully, there we go. So this is a little bit of shader code that's running in the browser. So I appreciate the applause, but I have to say it just worked. I just did the upgrade and it worked. And let me run this. So there's a bunch of stuff that's still turned off. So this is not up as high. So for example, WebRTC is off, is disabled, and there's a bunch of other stuff to get it to work. So hopefully this number will go higher, but it's still looking okay. So just finally, that's the demo. I just want to say, closing thoughts. So writing Dev.io I found was a really good experience. I recommend it. If you're going to do it, bear in mind it's literature, not code. Just write your thoughts. Work a couple of days in advance so that you've got a bit of a buffer, and stick to a really strict cadence. And write about what you think your tomorrow self is going to want to know about what you're doing today. And then finally, I just want to say, like I said, I had this overwhelmingly positive experience from the community, so I just wanted to thank all the people that have been reading the post. This slide took a long time to prepare, I have to tell you. But like I say, it's been really overwhelming, and I really appreciate it. And if I have forgotten you from this slide, I really do apologize. I tried my best. Thanks very much. APPLAUSE Great talk. Let's do the questions. I'm sorry, this is not related to blogging. Is it really using WebKit much easier than using Gekko? So essentially, should selfish be using WebKit? Is your question? Yeah, I mean, it's a very natural question. It's a very natural question. So what I would say is that my understanding is that it has been thought about, and the decision is that it would be more effort to do that. So all of those layers on top of the rendering engine that I showed, they are doing stuff that is selfish-y, right? And in the user interface, they are providing links between the two. There is a lot of interaction there, which is one of the reasons why this upgrade is so hard. But re-implementing all of those things for WebKit would actually be a lot of work as well. Because I've heard some complaints from other projects, I think some browser-based browser has moved from Gekko to WebKit. Right, yeah. So like I said, WebKit is set up for embedding, so it is a natural choice, I think. But I have to say, I'm actually very proud of the fact that selfish still uses Gekko. I guess most people don't care about the rendering engine, but I kind of think that is one of its attractive properties. But it is a very fair point. Since you are not working for Joll anymore, are they also somewhat involved in this process? I mean, I guess they are interested in what you are doing, and I'm guessing they will use this. Sure, well. But what is kind of to say to Stereo? That's a really good question. So I would say there are some Jolla people up on this slide. And so I still have contact with the people I worked with previously, and so this has not been done without their input. Yeah, I mean, they've been super helpful. People from Jolla have been helping me all the way through, and I don't mean just encouragement, they've been making co-changes as well to help. But I haven't had a discussion with them about what will happen eventually. So maybe eventually they'll say, actually, we don't want it with. Let's see. But yeah, so yeah, I have a lot of really good support from them. Yeah, hello. So first of all, I'm one of those Jolla people, but I'm not involved in it, but I'm going to say we follow it. And it's great that it does the work, so it's really awesome. And I forgot to speak in this microphone. And one thing I want to say, or I'm not sure if it's really a question even, I think you asked, should we use this WebKit? You have to first think WebKit or Blink. There's a big difference, and I think there are big advantages of using what's embedded, a converter versus using Blink, even, because the engine is faster, in my opinion, and doesn't use so much memory. And I was thinking, couldn't also other users contribute to it, they're not necessarily using save features, because the whole API stack, except the service browser, is not really service specific either. Is this a point? Yeah, no, I mean, that is a really good point. So in theory, it is not selfish specific. It is an embeddable version of Gecko. So I think that's true. So I have to be honest, I wouldn't know where to go to, though, to find those users who would be interested in doing that. I'm not saying they're not out there. I'm just saying I lack the experience to know, I think, it's part of the problem. My community is the selfish community, so that's what I know about. Yeah, it's a really good point, though. Yeah, I know. I just suggest, because in my opinion, healthy Web is important, and having just any variation of WebKit on it is just not healthy, in my opinion, and just try it. It's not so far off, I would say, if people take in the effort, of course. Yeah, interesting point. Thanks. Any more questions? Yeah, come around. Yeah, hi. Have you talked to Mozilla people, if they are happy to accept some of your patches even behind the flag, just to help you out and to elaborate the process? So that's a good question. I have to ask. Okay, so Bjorn has asked. I have asked about it because I have a feeling it's kind of weird that they talk about mobile, but then they just read Android, because, well, they have no iOS, because they just ask. But I don't think there's entirely interest in this idea, specifically, but just interest in mobile. I'm not sure. They say they go where the users are, so go somewhere else, and write their name. Right, so I guess it's something that we should probably push harder on, in that case. It looks like there is... And someone is not even able to know about it anymore. Yeah. Yeah, it's probably something that we could try harder to do, actually. It's a good idea. I think the biggest problem is that we are behind head, so we're not... I think we would have to push patches that we couldn't then test. Essentially. Yeah, well, my personal opinion is that Mozilla developers would love to help, because Jekyll Aging is not very popular, especially compared to other engines. So having somebody to use it, it's a good thing for them. Actually, yeah, but it would be nice to try. Give a round of applause for David. Thank you so much. Thank you.
PineTime: A Programmer's Toy and Beyond
All right, so the last tag of the deaf room is about another PIN 6040 device. We heard a lot about them today, but not much on the PIN time yet. So give a round of applause to Joseph. Thank you, wow, so many people here. So I'm Joseph and you probably ask why you should listen to me. I'm from Bernou. I'm C-Developer. I code in Bash. Yeah, Bernou is a nice city by the way. We have like those big statues and so on. In Bernou we have that conf, which is great open source conference. I would like to invite you there. Code for papers is still open. I'm organizing open-out conference in Czech Republic, which is in Czech, so not very interesting for you. I'm writing NEMU mobile, which is C++ project. I fly with microlight airplane and I did some adjustment. I'm doing adjustment of Amazfish, which is company on app to PIN time. I have a few patches there, so I hope I understand the topic a little bit. So that's about me. What's about you? Well, yet another thing. I wasn't sure about if I'm able to arrive because my back hurts and my doctor told me to do some exercise, so let's do exercise with me. So I'm C-Developer. Please write your hand. Okay, many C-Developers. I'm not C++ developer. Is anyone C++ developer? Okay. Do you own PIN time? Who have PIN time? Oh, so much. Right. What next? So do you try to code for PIN time? I'm trying to. Wow, so many people. Okay, so I will try to adjust details for you. And as I said, or I didn't said that I'm working for Gracor Text, so I'm not here on behalf of PIN64. And my aim was to recommend you some, or provide you some details about PIN64 PIN time. Because you may consider buying some smartwatch. And my aim was also to tell you about other options. So I selected those few devices, Bangal JAS, SQFMI. If you remember, BigBerry SQFMI does smartwatch as well. I recently started with Asteroid OS, which is great. And the main difference is that it is like full-blown Linux and very powerful device if you compare it with PIN time. Okay, you probably know the specifications. The CPU is 64 MHz. You have 64 kilobytes of memory and small storage. You have 240 to 440 screen with touch panel. You have accelerometer from Bosch. You have heart rate monitor, vibration motor, Bluetooth and battery. That's it. You have, or there are two editions. I have sealed edition of smart PIN time. And it means you are not able to connect to serial line. You have to debug everything remotely, which is not very comfortable for coding. You can buy also DevKit, which has those things. Okay. What about software? I was speaking about hardware. This is about software also. On the left side, we have some operating system for PIN time. On the right side, we have some companion apps. If you want comfortable or nice user experience, you probably want both. There is infinite time, which is written in C++. There is VaspOS, which is a micro Python, Riot OS, which is in Rust, if I remember correctly. I was focusing mostly on infinite time, which is like most advanced project. It's focusing on good user experience, and it's not like so much debugging and so on. On the other hand, we need some companion app. If you are using Android, you will choose probably Gadget Bridge, which is great. But I'm using Ubuntu Touch. What's now? When I started with PIN time, there wasn't any companion application for Ubuntu Touch. I've seen AmazFish, which was suitable for porting. I started with port, and I'm working on that right now. I'm fixing bugs there. There are also other apps for iPhone or post market OS for POPFOS, like GTK-based application. We have many options. I choose those to infinite time and AmazFish. What is infinite time? The vision of the project is to keep freedom and privacy. They are explicitly saying that they don't want health tracking features. But what actually it provides? We have some smartwatch and features. I would like to hear what features you would like to have on smartwatch. Please try to shout on me. Sorry, time. That's a very good feature. PIN time and infinite time can do the synchronization of time with your companion app. So you have synchronized time. Some other ideas? I'm sorry? Stopwatch. That's great. It's there. Tracking other sport activities. Tracking other sport activities. I decided to split it into three categories. This goes into health monitoring. They explicitly said that they don't want to do that. But it's a good feature. With age, I'm also wanting some health tracking and sport tracking activities on my smartwatch. GPS monitoring. Not for sport but for north. Okay, so some navigation applications. There is a feature that you have pure maps. The pure maps sends, or you set some targets, some navigation. And the pure maps sends the instruction to D-Bus, to Amazfish. And Amazfish sends those data to PIN time, to infinite time. And it should work nicely. But when I was developing that, I had broken legs, so I haven't the opportunity to test it. And after that, I didn't test it. Sorry. But I hope it works. Okay, if we are speaking about notifications, you probably want to be notified about phone call. And you probably want to refuse the phone call from the device. Testing works. It has some limitations. Like the infinite time starts like five rounds of vibrations. And after that, it stops vibrations, also notification or the vibration stops even the phone call wasn't received or rejected, actually. But otherwise, it worked nicely. Speaking about other notifications, for example, for emails or for instant messaging, the main issue is that the notifications are limited to 100 bytes, which is basically content of the whole screen. Like one screen, but the message can be longer. For sure. So there is this limitation because of lack of memory on the device. And when I started with infinite time, there were other issues and I will show them later. There are also other features such as alarms, but I would like to see a remotely controlled alarm and multiple alarms, but this is not possible. Similar for calendars. The Amasfish allows to send notification about event in calendar, but only when you are connected to Bluetooth, by Bluetooth to device. When the connection is lost, you don't have information about event, which is not the feature you want. And then you have some kind of other activities like remote control of music and navigation and weather forecast. By the way, with the new release, which was, I don't know, one month ago, there is a new weather forecast implemented in infinite time, new protocol. And with most recent patches, which are in master, but wasn't released yet, you can have the weather forecast also on other watch faces. So right now it is available only on pine time style watch face. Okay, let's continue. This is how the architecture of system look like. So thanks for picture. The important part is that it's based on RTOS and there is some abstraction layer which aims to provide some abstraction for device. And you can have driver separated from the controller which allows you porting to other devices. Okay. I want to develop that, so I'm user. I have some idea what I want. And then I would like to try my idea. There is something called simulator, which allows me to prototype the idea. It looks like this. You can even record the screen. So this is great. So I've added that into request. So my idea was that there is missing some notification or some status of the alarm. So I don't know if alarm is set or not. So I have added the icon into status bar and recorded the thing and submitted pull request. It's really easy. I hope I... Okay, it's not good idea. So I wanted to show simulator a little bit, but it's really easy to compile and it's not possible to share the screen. So maybe at the end if you have time. The simulator allows you to adjust also steps and the battery level and so on. So you can tune the UI to have good user experience. You have tuned gestures and so on. So it is really great and you don't need the device itself to test your application. The simulator itself has sub repository with infinite time. So you are actually modifying the code which can be run on the device. I have... When I started with Amazfish, I had troubles with notifications. So my colleague Pavel Kerek wrote me a message which says, ''Cześć Cinać Králu'' and well, due to limited memory, there isn't font for Czech language. So it was looking like that so some characters was missing and I wasn't able to read it. Luckily there is something we call transliteration and it allows to substitute some characters to some similar characters and we are able to read that normally. So I have added that option to Amazfish and I was pretty happy. Then you can see that there are some glitches. So there is a notification, somebody and title. So I have noticed that Amazfish is sending those incorrectly. So somebody should be in the header. So I have one more line for message itself. Next you can see that the teleports string should be name of application. It shouldn't be the name of the debus source. So this is another thing I was fixing. Then I was looking how to fix phone calls so it started to work. By the way my son's name is Adam. So one character is missing so I noticed that... I noticed that infinite time doesn't process the alert notification service properly. There should be A. So there is a fix in Amazfish with one byte to have proper name. And there is a plural request in infinite time which removes the byte from processing if not should be there. If you read abstract of this talk I wrote something like that the community around can benefit from those changes. So I was making the changes for Wound2Touch but Amazfish still works on Sailfish OS. So all Sailfish OS can benefit from those changes as well. So I like that feature of open source. Okay here are some more examples I have done. So one of thing was pairing. And I had issue also with the universal components. So one issue was that icons had bad dimensions and there was some attached page which wasn't inserted properly. So you was not able to click back on the pairing page which was really uncomfortable. I also fixed some retrieving of information about number of steps on device. I prepared sampler of the activity stream so the pine time is not storing your fitness activities. So number of steps each minute and heart rate each minute. But it can just stream data to company in application. So I am sampling data and in the end I can paint such histograms. If you look carefully you can see the time is like to 3 p.m. or something like that. So it is just few samples from whole day. I have also added some charts for battery so you can see that your battery can last for 7 days roughly. And also other things. Yeah let's skip that. For sure if you are working with microcontroller you want to check battery life. And you need to be careful to keep your consumption low. I have yet one slide about summary what MSFish does. Well from point of view that something does infinite time something does MSFish. Like they work together. By the way that picture is like there were some kind of silly joke or how to say that. So when I was kid so if you say something like I want to be millionaire the answer was I want to watch with fountain. So this one is it. So there are many features some of them are limited like synchronization of activities. Synchronization of calendar are not there because infinite time doesn't do that right now. There are a lot of poor requests and the main thing is have bottleneck in reviews. So it is a little bit demotivating to make something and then wait without response. But I hope it will be better in future. And yes it's community project so there is a lot of forks with a lot of ideas. I have here pictures of some forks from Joachim. By the way there is a rumor that Joachim he made some next generation of the of the pine time. So there is more is it will be pine time too. So twice faster CPU same Nordic semiconductor twice more memory. But I'm not sure we will see. And you know it's open source. There are many forks many many changes. And it's developing so I hope it will be it will be great. Or it is already actually but it will be better in the time. So I hope all of you will are motivated to contribute to Amaz fish to to pine time. And I'm looking forward to your contribution. So I think it's time for question right. Actually maybe one question or so but we don't have so much time. So maybe one one good question anyone. Otherwise give up a big I add a S1 actually but still give up a ground of applause first. Thank you if anybody has better question than mine please go ahead. Okay just as more curiosity because you mentioned multiple alarms. Yes I've never owned them smartwatch that supported more than one alarm. Why in your opinion is that are there some kind of constraints. Does nobody care about multiple alarms only being in the edge case. Well I I was looking for features which are interesting for for me this for example find my phone feature which I tried to implement. And when I was looking for those features I was looking into code of Amaz fish and there was that feature. So some smartwatch can do that but I don't know which one. We'll find out somehow thank you. All right another round of applause. Thank you very much. And I will ask you for another round of applause for organizers of that room. Thank you. Yeah I didn't do it alone also with Arnaud and all the others that's head out and yeah thank you all for coming to the deaf room. Make sure to check out the Linux or mobile stands at AW tomorrow if you haven't been there already and have fun. Thank you.
The State of OpenJDK
So, with that, we hand it over to our illustrious first speaker, Dalibor Topic. Yeah, hello, I'm Balkir for myself as well. Let's see, the microphone that works, works, yeah. So, my name is Dalibor Topic, I work at Oracle. And I don't need a mic. I'm Static-y. Oh, yeah, I'm like metal. Metal, I'm metal. Okay, better? No, that's just my voice. Right, so I work at Oracle as a product manager on OpenJDK. I also tend to wear many other hats in the OpenJDK community, like that of a JDK Updates maintainer. So, are they giving the presentation following right after this one on how the JDK Updates have evolved across the past decade? But for this one, let's get started in the interest of time. Four years ago, our story took a little turn. Mark Reinhold gave this talk, or version of that from four years ago, on the state of the OpenJDK in the free Java DevourMid foster. While Mark couldn't join us this week, he graciously offered me to actually present the talk as I would give it after four years later. So, my version has now to cover not just one year as the tradition demands, but four years of development of OpenJDK. For better or worse, I think for better, I don't get four times as much time. So, I just have 20 minutes, so it would be very compressed. And I leave the fun parts out for other people to cover in this room. But almost 34 years ago, the World Health Organization declared the novel coronavirus outbreak a public health emergency of international concern. Their highest level of alarm. And while this development, at least for me at the time, and probably to most of you might not have seen very concerning while we were here at Brussels, well, it turned out that in the following weeks and months and even years, we had a public and our response to it disrupted both the regular life and the supply chains. And for us, for the OpenJDK thing in particular, it came at an in-between moment because we're in the middle of making JDK 14. Back in September 2019, we had planned to release JDK 14 on March 17th of 2020. And so, boom, all of a sudden everything changes. Because we actually planned to have release cannabis in February and then release in March. And as you may well remember, March 2020 was quite something. Man, that was something. It was quite disruptive for many people in many ways. But what wasn't surprising was that we actually delivered JDK 14 as predicted in September right on time. Just as we planned on March 17th, just like every JDK release since, we had switched over to this new release cadence of having a release that's predictable and time-driven every six months. That switch was performed two years prior to that, back in 2018, with JDK 10. And then it led to much better results than the way the JDK development used to be prior to that with longer development cycles that are less predictable and tended to deliver less, let's say, reliable results in terms of actual release dates. And of course you can now say, well, sure, of course you did release on time because what was left to do was just to polish your release candidates to perfection, as you always do. How hard could it possibly be? Well, it's not so easy. So we actually, in our nine-month development cycle of the JDK releases, we leave three months to the polishing part to ramp down and actually fix the release down, whereas the first six months we devoted to developing features and all the new bug fixes that come in. And so we leave ourselves enough time to actually get to a really high degree of quality. So while there's probably a kernel of truth to saying there wasn't that much left to do, there is a lot more to it because as we look at how we delivered the rest of the remaining four years releases, they also all came in on time during the ups and downs of the pandemic and supply chains and what else was going on in the world. So there's actually quite more to it than just the fortunes favor of the bold or the brilliant or high-end putters. So as my colleague Brian Getz said in his keynote from last year, the bad old days, as we now call them, but they were the good old days back when we were in the middle of them, they had a very core-strained feature box release model. So it took us two years, two-four years between releases. And just this little fact tells you how flexible that was because that's the double the time sometimes, depending how bad things go. We had frequent delays. There used to be like a running joke sometimes in papers about trade releases. And we had no predictability. The release management process had to be very heavyweight to accumulate for that. And we had a lot of temptation because of that, the release is shipping every few years to push in changes that just had to be there. And so that lead to tweaking of the arms to get things in sometimes. And also because users wanted these features, not just in latest release, but on the releases they were using, to tweak the arms of the backport steams release managers to get them backported. So we had these huge releases that came out and kept getting major feature updates every six months, not just backfixes, but really big huge features like new garbage collectors and stuff. So that's a lot of work, but it wasn't really working too well for anybody. It wasn't working well for us because it had these massive overheads. It wasn't working too well for the developers and the community out there either because Java developers were getting frustrated by the pace of Java's development and by the time it took to go from one release to the next to get the major features, while all around them the world was exploding with excitement around groovy scala. All right, yeah. Pearl five, come on. Things happened back about ten years ago, so it was an exciting time for Java. And so the excessive backports also ended up being a problem for the backporting project because we ended up having to do a lot of work to get new features and to shuffle things around. Whereas with the new model we changed things a bit and decided to just focus on backporting what really matters, which is security fixes, grid fixes, and the rest will leave up to people who really want to do the work. So to do that we first inquired to see what actually we wanted, what we need for developers. They tend to want the latest and greatest features. They tend to want the language changes, the VM changes, the performance fixes, all the good stuff. Whereas our many Java users tend to be conservative business users. So they tend to favor and appreciate stability and security. And they tend to want to have their investment they've been to the platform two, three, five years ago pay off and keep paying off for a few more years as they had planned. And so it's hard to make these very different goals, hard to fulfill at the same time. So the old model basically didn't make either developers too happy or the enterprises too happy. The new model we adopted with these kids is a tip and tail model as George and Brian and others on our team like to call it. Where all the features that we develop are integrated on the tip of the development. So they're done where development happens, right, where the action is. And they're integrated only when they're ready and only when they're ready. So we resisted the temptation to pull something in just because it's got to get out there, man. We pull it only in when it's actually fully baked and we believe it's actually done and good enough. And to actually get to a point where it's good enough sooner, we introduced the concept of feature previews for example. Where a feature can then get gradually improved and refined to its actually fully part of the platform. And so that takes a lot of the pressure out from getting code and features into JDK. Because if you miss one train, there is another one coming six months afterwards. And if your feature isn't completely perfect, you want to have, use a feedback, you may get a preview, you get to use a preview, a preview. And you do this a couple of times if you have to. And this takes what used to be a singular point where you had to make decisions. Like this is where the release gets done. Break this away into actually a matrix across time. So you have plenty of time and space. You remove the overhead of managing the open source release part and can focus on development. And that's very important because that's where really the fun is. On the other hand, the tail part is that of course users really like and enjoy having stable Java releases to work with. Because one of the defining features of the platform is its stability, reliability and predictability. So we at Oracle offer commercial support for long term support releases. And we're not the only ones. So the many others in the community. And so this model basically allows both development to happen very quickly, but also allows commercial users, for example, to actually invest into the platform. And then get security updates and critical fixes and sometimes even more. But typically that's the focus of the platform. And when we did this six years ago, there was a lot of skepticism. Both internally, so at Oracle, also internally in the wider OpenJT community, because it sounded like a lot more work actually to release every six months. I mean, we knew how releases were. They were long and painful and never shipped on time. Right. So that was weird. For example, we feared at Oracle, we would end up having releases with nothing to do with the ship. Will we actually have marquee features every six months? Is that going to matter? Well, change changes hard. We fear change. But also in the broader ecosystem, people were skeptical. So you're doing a release every six months. I mean, are they going to be any good? Isn't Java 8 what everything everybody ever will need? Why do more Java releases at all? I mean, so isn't like, are we going to take over the world? Yeah, so that didn't quite happen. But nevertheless, so change is hard. And so people were really wondering about being overwhelmed and worried. And the one way how you actually fix this too, well, do the cultural change and show that it works. And so we had to change our mentality. We developed the JDK to switch our mindsets, to basically go away from the way things were done before, to learning to embrace this change and not to panic about missing a train. It's okay. There is an excellent. It's almost like the trains in Belgium, which seem to run fine, contrast to where I come from. There's always an excellent. We don't try to overdo the tail. We don't try to backport everything to the JDK updates project. We just backport what we believe needs to be backported. So just the specific fixes required for security and stability, maybe platform support, maybe every now and then there is something weird like new character sets or some data time updates that have to be done. But that's roughly it. We don't try to make Java 8 in open JDK be the same as Java 21 because it's not. It's not going to be. People who use Java 8 don't care about it being Java 21. And so this lets us, to philosophy, of breaking down this big, big feature, the small, tiny elements that can be delivered in six months increments and removes a lot of the release management process to a point where it's basically invisible. And things just flow along as they should be. And so this has worked out better than we hoped. The time for releases was slashed. The effort it took as well. The amount of work that had to be done on backports went significantly down as well. In particular, one thing that we embraced was the change of leadership in backports. Because back in the day, when it started doing JDK updates in 2011, we used to do those for many, many years with the Oracle Open JDK maintainer team. We delivered Java 6 updates for, I think, six years, Java 7 for five years. I think eight, also for four or five years, I'm not sure. And so once we move to this faster release cadence, we're clear to us, we can't just keep doing like five years of updates on every six months. That's just impossible. It makes no sense. Nobody's going to embrace every one of those all the time. So what we did instead was to move to a model of actually transitioning responsibility in the community for the updates. Where the Oracle maintainer team now takes care of the updates for the first six months. But then we passed on the responsibility of those in our community who are capable and willing of maintaining release further. Like Andrew over there did for eight and eleven. It still does. So if you want to hear painful stories asking later, and others like Severing will do for 21. And this is also a painful thing to do in the community because giving our responsibility requires, you know, courage. But we're fortunate enough to have been working as a community together for a long time now that we've built up the trust actually in processes to make this work well. And I think they do. It also lets us deliver on the Java itself in JDK itself more features faster, which is great. It makes developers happy. And this of course, it makes all our customers happy. JDK developers are happier because they have to spend so much time in their management. And so everybody benefits. And so we keep delivering features. In 21, as my colleague Aurelio Garcia de Ribeiro from Port of Management Team, says we have four buckets that the features happen into like language and libraries, performance and housekeeping and stuff. There is something for everyone, right? In the prior JDK release, there is even a bit more stuff that came in through this model. And if you compare with 11, there's actually a ton of stuff that happened in the past, in the past two LTS releases that is coming to the platform through this very, very quick development model. And all together, like, you know, the time between 11 and 21 isn't that much longer than like an old release would have been overall. So this model not just works well in delivering stuff, it delivers a mass of things. And it delivers so predictably and regularly. And that's what makes it great. Then another thing that's important, four years ago, when we were here, we had the open JDK Commitus Workshop in Brussels where we talked about our tooling. We had just been going through the progress to update our tooling for JDK development to move the social management from the material systems that we used and hosted over to GitHub. That's now Microsoft hosts and pretty much everybody uses. And that, of course, was also a change that required a bit of courage and also provided opportunity to define tooling that makes our jobs easier and that helps us manage the JDK coding changes coming in and keeps helping us. And we keep updating this tooling to help us with backpours and so on. And so this is a powerful thing. It's not just about the mindset, it's about embracing the tooling to actually change the things you do under the hood and to use the power of computers to make working on the computers easier. And so these days, Open JDK on GitHub is a pretty massive project. It has around, I think, 120 repositories, around 500 people participating. And most importantly, the past four years have been the four years with the most changes going into Open JDK ever in four-year terms, which is like one US presidency, just saying. So 2020, Trump was president, right? So it's also the change that led to pretty much most Open JDK contributors coming into the community because, of course, joining us through GitHub has been much easier than learning an unfamiliar code system and trying to work around the industries of a very different contribution path for most of the other open source projects. So that has helped us move Open JDK forward as fast as it was. So what's the actual change that happened in the past four years then overall? Well, we've got faster and better at delivering the JDK. Java has been back in 2020 before the pandemic, one of the top choices for developers for programming up there in the upper top right corner. And in the last survey from Red Monk for programming languages rankings, it has remained in the top right hand corner as one of the most popular choices for developers on GitHub and on Stack Overflow. I'm not so much so sure about Stack Overflow anymore because I think that's mostly a training site for generative AI these days. But, you know, the GitHub numbers I think are still fairly valid and the GitHub service also very out of cover. Java is very popular and frankly rightly so because we keep as a community making it better and moving forward. And just to, as a point of that, in our JDK release blocks that Sherat, one of my colleagues writes, we have this wonderful slides of the contributions into Open JDK tip from since JDK 11 to 21. So it's like 10 JDK releases, like five years development. And of course you can see there is a big red block of Oracle as Java stewards. We do a lot, we have to do a lot, right? But it's not just Oracle. There's a lot of other parties participating and moving Java forward together. And one aspect I really want to emphasize here is that this little blue block over there, that's the independent developers. That are not affiliated with Oracle, or Red Hat, or SAP. So those are people who don't work for any of these big companies. And they have, thanks to the move to GitHub, among other things, and now move so far forward in the contribution queue to actually make up, I think, the fourth largest block of contributors. And that's great. I mean, that tells you that we're doing something right in this community. So thank you to all of you who have actually contributed in the past. And hopefully we'll see a few of you join us and do interesting things in the next year. And speaking of interesting things, we have a bunch of projects going on in the community that are active, like Loom, with Ellen, talking about later, like ZGC, a garbage collector, Panama, Amber, for language changes, LIDEN, where we work on basically time shifting and time machines and other wonderful things for Java technology. The Valhalla to make the JVM much faster and simpler, and Babylon to make the Java more suitable for other weird deployment targets like GPUs. So we do a lot of weird things, fascinating things. And in particular, these are things where your feedback is very welcome, where your testing effort is very welcome, where your ideas are welcome. We can actually make a difference. It's much easier to make a difference than, say, if you were trying to contribute to a JDK Update project, which is a bit of a meat grinder. I'll tell you that in a few minutes and why it is. But nevertheless, this is where you can really do stuff and have fun doing some. And to talk about these projects, in this very dev room, we'll have today myself and Jonathan. Jonathan Dowland. Yeah. Over there, talking about JDK Updates at 11.05 and at 16.40. We'll have Roman Kenke over here talking about Lily Puth at 11.55. We'll have Maurizio Cimadabare talking about Project Panama. We'll have Alan Bateman talking about Project Loom. We'll have Phil, we just saw on the stage over there, talking about Project Wayfield along with Alexei and Niels. And we'll have Johannes over there talking about some of the work that's going on in the House spot on Save Points. So you have a chance to meet many of the people who work on the JDK here. If you haven't met him before, look around and ask around. You may be holding up your Chippewa review, as was in my case. Otherwise, enjoy the day, I would say. And thank you for coming and thank you for being a part of this community. So this gives us five minutes to shuffle rooms if you want to, and then we move on with the next one.
A decade of JDK Updates in OpenJDK
Thank you very much. Thank you very much. And now let's get going with the next one. Decade of JDK updates in OpenJDK. Technically it's more than a decade now because we started doing JDK updates in OpenJDK a long, long time ago. But we'll just stick with the title. Don't take anything I say here as, I don't know, serious investment device or anything like that. Just saying I'm talking about code. So we have 10 years in less than 30 minutes. Let's go with the prehistory. We started doing updates in OpenJDK with the OpenJDK 6 project back in 2007. So, Sun, microsystems, some of you may know it. OpenSourced Java under OpenJDK back in 2006 for the first parts. That was the compiler and hotspot. And then delivered the rest in 2007 in May. But the rest at the time, well, let me show you. It was announced to OpenSource in November of 2006. But then we released Java 6 in December of 2006. And so, like I said, Sun opened Source the compiler and hotspot. But there is a lot more to the JDK than the compiler and hotspot. There's like all the libraries and stuff like that. And so that wasn't open source until May. And so Java 6 was released. But it wasn't open source yet. Instead, what Sun had released at the time was the in-development version of Java 7. And of course, it had changes that were different from Java 6 because it's going on for a while developing. And so, we looked at that. And in particular, Joe Darcy from Oracle looked at that and said, this is a problem, let's fix it. So, he took the code from OpenJDK 7, the development version, and cut it down again, pare down to the Java 6 specification, and created the OpenJDK 6 project in 2007 that was based off OpenJDK 7, which leads us to this wonderful little crazy diagram of how OpenJDK 7 was born. You know, Java 6, there is builds and stuff, and then Java 7 development starts, and then 10 builds in, OpenJDK 7 happens, and then again, like, builds happen and happen and happen, like 20 of them and all of a sudden. Oh, yeah, OpenJDK 6, let's do that. That's a good idea. And so, we did that, and we did that quite successfully. For many, many years, Joe Darcy was the project lead for OpenJDK 6. He then got adopted by distributions, went into Debian, Gen2, Fedora, Red Hat, Oracle, you know, I guess, whatever else is there. And so, that was the first time we actually started doing updates for OpenJDK, for release in the community led by Joe Darcy, who's been from Oracle, for several years until 2011. At that time, we then started to, we then shipped JDK 11, and so Joe then moved on to do other things. Kelly O'Hare from Oracle took over for a few more years, and then in 2013, I think, that was the time when Java 6 reached the end of public updates. And so, at that time, it was also the end of the road for the Oracle maintainers of OpenJDK 6 in the OpenJDK community, because once we were done publishing public updates of the Oracle J-Ri, we were also done, of course, contributing to the open source version of the Java 6 documentation. And so, what happens next? Well, people who were using OpenJDK 6 were wondering what's going to happen. Will you be, I don't know, making us all pay tons of money to get Open Source fixes? Well, what we did was actually to create a process to have transitions between maintainers in the open source projects in OpenJDK. So, just as we transitioned from Joe to Kelly in 2011, once Kelly stepped down, once we were done contributing to OpenJDK 6, we then asked in the community, hey, so we're good, we've achieved what we set out to do. Is there somebody else in the community who has an interest and capability, very important bit, to maintain OpenJDK 6 project further? And as often, somebody appeared over there, that's Andrew, and said, yes, let me do it. And so, we worked with Andrew a bit, and it took over for a couple of years, and did this job until 2016, and then Andrew was also done maintaining OpenJDK 6 as the lead maintainer as Project Lead, and said, all right, transition time. Is there anybody else who's willing to take over OpenJDK 6 updates and maintain them going forward? And then again, somebody else popped up, it was Andrew Brign. I don't think he's here from Azul, and he kept OpenJDK 6 alive for, I think, three or even, about three more years. And so, that means beyond the lifespan of OpenJDK 6, and it was initially sort of planned, it got extended by the community for a couple more years, actually, going through different maintainer changes along the way. And this is something that I think is quite unique to how OpenJDK does things. In most open source communities, changing maintainers is something that only happens when you get hit by a bus, or burn out, or whatever you do. We try to plan for these things, because sharing this possibility makes people grow. And so, putting people in these leadership roles, actually, it gives them opportunity to just grow personally, but also to actually do things differently, and maybe even better along the way. And so, to show you a bit of that, when Java 7 came out, we decided to actually make OpenJDK the reference limitation for Java. The platform. So, the specification has a reference limitation, and before Java 6, there was the Oracle Java 6 limitation. For Java 7, that's when we started doing the representation as the OpenJDK reference binary. So, that was good, because it meant that the reference limitation and the ship is OpenJDK 7 was the same thing. It also meant that, as with 6, with its complicated history of pushing things back and forward while they're different, and with all these costs that come with that, we could actually start from scratch with 7, and just do the updates of Oracle Java 7 and OpenJDK simultaneously from the same code base. So, we could actually not just move to OpenJDK as the official one-time reference limitation for Java 7. We could actually have a JDK 7 updates project where we developed the JDK 7 updates in the Open, in the OpenJDK community, and then use those as the basis to deliver the Oracle JDK updates in our case, or I don't know, Fedora updates or whatever else people did with that. So, that was a project I started in 2011. So, basically, I took over from Joe, and we did a few things differently. So, one of the things we did was we created a new project, which was different from the JDK release project itself, to be able to develop updates along with their own processes, because update releases are quite different from a JDK release. Back in the days, as I said before, two to four years for JDK release, big, large projects that have actually culminated in one deliverable. JDK updates more of a meat grinder. So, we have a three-monthly release cycle, basically. We have four CPU updates, critical patch updates per year. So, that's quite different from releasing like once every few years. This requires a very different process to accomplish. And so, that's what we then did, which we changed things quite a bit to make them work both for OpenJDK releases in the community, but also on the Oracle side of things, so that the Oracle JDK deliver, deliver could actually benefit from this process, from this code, from contributions, and actually keep contributing back to this model. And so, we did this, again, for a couple of years. So, I was the project lead for OpenJDK 7 updates for four years, and we introduced the concept of a maintainer team, so it wasn't just me doing stuff. It was a team of four or five people, including Sean Coffey, who is somewhere around here, and Rob McKenna from Oracle. And we did this for four years, and then again, JDK 7 from Oracle came to its end of public updates, as expected. And we said, guys, we're done here. It's just somebody else who would be willing to pick up the work on the updates we've done with the processes we've defined, and to take this further, if they wish so. And then somebody was capable of doing that, who has the community scale license and so on and so on, experience delivering JDK updates. And again, Andrew Haley appeared. So, he wasn't completely, I guess, wasted from maintaining OpenJDK 6, but he embraced the challenge to maintain OpenJDK 7 for a few more years. So, he did that. Whereas, in 2020, Andrew was then done and also passed on the torch. The same thing, as on the main list. And Andrew Brigham stepped up again and did OpenJDK 7 updates for a bit longer. And I put the final act to both OpenJDK 6 and OpenJDK 7 updates by doing the votes to actually archive the projects last year. So, there won't be any more updates to those, because they were basically at that point completely done. But it shows that we have a model that actually works in distributing responsibility for updates development, and that actually works in delivering something that's very rare across most open source communities. Long term maintained versions. So, and what we do at Oracle, when we were done with 7, we focused on 8, of course, on 8 updates. So, whenever a new project is created, there is basically a new update train just coming along. And as I mentioned, Sean was one of the maintainers for OpenJDK 7 updates. For OpenJDK 8 updates, Sean become the project lead. So, again, new possibilities, new set of eyes, look at the processes, new set of improvements to make to the process. And he did. And so, 8 was, as many of you probably know, a pretty big and significant release. Not just because of LAMDUS, but also because a lot of people used 8, and probably still use 8. And we kept adding features in the update project to 8. So, in 8 Update 20, we added a bunch of features in the VM, and the new port, 8 Update 40, we added a whole slew of VM features and a bunch of other stuff, Java FX stuff, whatever. Web views in 8 Update 60. Every six months, we used to pull in a lot of stuff in the JIDK8 updates project. And this is the thing that I mentioned before. That's a lot of work, and maybe it's too much work, because it leads to something that looks a bit like this. Like a beached whale a bit of a release. So, you start with the GA, and then you have this huge hump of features. You just, just one more feature to add. Just one little more feature to add. This is one more thing. And then eventually, after a few years, you know, the whale gets thinner, and at the end, just before you leave, there's just all these other leftover fixes you want to throw in as the big tail rips down. And, you know, that can work, but it's very, very stressful over time. And so, as you see here, every six months, in features, that's probably a good idea. We should do it for real releases, right? So, that's what we did for GA's. And for 8, we all did the same thing. We transitioned. Sean was a project lead for a couple of years. And then, 2019, Oracle's Java 8 went through the end of the public office process. And again, Sean asked on the list, again, is it Andrew Haley? Yes, it is. Andrew Haley stepped up again and became the open dedicated update project lead. So, that's how it carried on continued, right, for a while. And now we get to the modern times, which is a wonderful movie with Charlie Chaplin, I think. But also, what we do in the J.D.C. Update Project. So, with the modern times, as I said, we have now released every six months, so we can't really afford to maintain each one of those as the Oracle updates team forever. Because there is no such thing as the end of the public updates anymore, anyway. So, what we do, and Rob McKenna is now the project lead for the Update Project, we have a single project that just does all the updates for J.D.C. Versions after 9. And in this project, there are repositories for each different update train, like J.D.C. 11U or J.D.C. 21U or J.D.C. 17U. And these repositories have repository leads. And so, at Oracle, we lead the development of the updates for the first six months, and then we predictably step down, and as somebody else to carry the maintenance burden forward, if they so wish to do, and are capable of doing so. And so, the funny thing is, of course, that's not always the case, right? So, for 9, of course, Rob was a project lead for six months, and then nobody stepped up, obviously. Same for 10, and then for 11, hey, Andrew Haley stepped up, and became the 11 Updates repository lead, and has been so ever since, I think. Then the same for 12, nobody stepped up. For 13, then, not an LTS release, but an Azure MTS release, and the Uranus Terenco stepped up, and was the repository lead for a while, and then, as we'll stop doing that, and that was the end of J.D.C. 13 Updates, that's also archived. The same for 15, but you also stepped up, and then for 17, the LTS release Andrew stepped up, and still is active, I think. It's not correct when you said it in I believe. Oh, yeah, sorry. I'll fix it, we'll fix it in the post. Yeah, this lasts from like two years ago, sorry. I think it's more real, so we'll do 25 for those. And then for 21, 17 over there is the repository lead. So we have this model that's, I think, quite unique in the open source world, but I think works really, really well, because it's on one hand predictable, you know, when the subcontainer steps down. On the other hand, it allows others to actually step in, and continue to maintain a release for as long as they actually feel it's necessary for them. And that's not easy to pull off. And as part of that, we also have people who have been working to improve the documentation on continuity of the updates. So for example, in our wiki, we have some nice explanations of what it takes to actually get you fixed into an updates release train. And one of the funny things, and why I like to talk about process so much, is that if you set up the playing field right, good things just follow. And setting the playing field right here has been to actually process-wise get the fixes, not to happen in the updates project first, but to get them to happen on the feature release first in the JDK tree itself, where the action is. And only when they come in there, ready for that, to backport them to the corresponding update release that they're relevant for. And this avoids a whole lot of complications around, for example, only having fixes coming in to say 11, but then as we'll discover, you need them on 17 later. It just means you take what goes into 22 these days, and then pull it back to 21, 17, 11, or 8 as applicable. And that looks a bit complex in ASCII. That's because it is. And like I said, because we have a release every three months, the JDK update project is not the project you usually start out with, if you try to come up with the open JDK. It's a meat grinder. This is basically where the sausage gets made. And sausage making is never pleasant to watch. But if you do it right, the results can be very tasty. Even vegetarian sausages, so no worries. So, and just as an example how we do it, we go to GitHub, you can see the pull requests for the updates projects, and then they'll have corresponding tags on them that explain where they go. For example, this one has in its title the bug ID. If you go to the bug ID, you'll find in JIRA and open JDK, it has a bunch of labels to it. And the labels actually say for which releases this particular project, this particular fix is destined. So for example, this one was supposed to go into 17, into 8, and into 11. It was requested, and then the maintainer said, yes. He also said no when the field changes viscose. So with that, I'll say thank you for helping us make JDK better and working updates. I would encourage you to come and join us with the open JDK and do stuff with us. For that, I'll leave you to the other talks. Thank you.
Exploring Quarkus Native: Choices and Implementation
Hello everyone, I'm Fivo Zakat and today I will talk about Quarkus native and some choices it makes and how it implements them. So how many of you are familiar with Quarkus? Know what Quarkus is? Well, less than I expected. Okay, so what it is? It's a Java framework. Well, it's an open source stack for building Java web apps, so it's a Java framework that aims to bring developer joy, it's Kubernetes native, brings the best of breed of libraries and standards and supports both imperative and reactive code. And that stopped working. So what does typically a framework do when you use it? Well, usually you write your Java application using the framework, then you package it, you save it wherever you want to deploy it and you start the application. And what it does, it will load configuration files, perform some annotations, perform some annotation processing, create some metadata graphs or whatever is needed and eventually run the application. So what Quarkus does to improve that situation is that it moves part of this configuration to build time, so you only run once the configuration and setup of your application and then when you deploy your application, it starts up faster and you don't have to repeat all this process. One benefit of this Quarkus feature is that it allows you to also go native. So instead of deploying on the JVM, you can deploy a native binary. So why would someone want to go native? We have put so much effort on making the JVM very mature, very stable, very high performance, et cetera, so why would someone want to go native? Without going in too much detail, I will list some of the pros and cons of going native. So first we will start with the pros. One of the major advantages of going native is that you get faster startup because you don't have a JVM that needs to start up, load classes, do classification, warm up, stuff like this, you get faster startup. You also get close to peak performance right from the beginning because you don't do just in time compilation, everything is ahead of time compile and that gives you close to your peak performance right from the beginning. You get a smaller standalone binary. Hint here, I'm comparing with shaping your application with the JVM. Otherwise the JAR file is smaller than the binary. And you also get smaller memory footprint when running your application because you don't have to keep all this data that the JVM keeps to track internal things. And another benefit is that if you launch the same application multiple times on the same host, they can share the heap as a copy and write memory segment. Now what are the disadvantages? First of all, you get slower development cycle. Compiling to native takes more than it takes to compiling to a JAR file. So we suggest that you develop on JVM, debug on the JVM and only when you are happy with your application then move to native because that takes some time. You also get lower peak performance because when you run binary, you don't get just in time compilation. So the compiler doesn't have the benefit to profile your code and to do better optimizations. It also can perform very aggressive optimizations relying on the deoptimizer to fall back to a slower version if something doesn't go as assumed during compilation time. Another issue is that security patches require recompilation. So even if a third-party library is vulnerable, you can just update the JAR file of that third-party library and don't recompile your code. You have to rebuild your code because parts of that third-party library might be empty in your application. So you have to recompile. Your application is also not portable. You lose the right ones run anywhere, principle. So because you are generating a binary file, it will only work on the target platform that you compile for. And last but not least, it lacks behind in terms of tooling support. So debugging is not as simple as in the JVM world. And the same goes for observability. That doesn't work. Okay. Now that we have seen that there are some benefits in using native code, let's see how it works. Quarkus uses GraVM and particularly GraVM's native image to generate the binary code from Java code. And how this works is that GraVM will take as input your Java application classes, the JDK classes, and the substrate VM classes. The substrate VM is a thin runtime layer that allows your application to run on bare metal. So it takes care of some of the system things going on. Then it performs a static analysis and this will allow it to perform dead code elimination. So it essentially doesn't compile any code that you don't need. If your application doesn't reference some part of your class path or your dependencies, it won't go in the binary. So it creates a graph like this where your Java applications reference some JDK classes and the JDK classes reference some substrate VM classes and it will eventually compile it to a native binary. However, GraVM comes with some limitations. There are things that are not supported and there are things that are supported but need manual configuration. And some of the not supported parts are currently working progress. I don't have enough time to go through this. So how does Quarkus offer, what does Quarkus offer on top of that? So GraVM takes Java and produces native code. So where does Quarkus native come into play? Because of the limitations I mentioned earlier, developing native applications for GraVM's native image might be painful and that's where Quarkus comes into play. It aims to help Java developers write their application and compile it to native without having to handle all the extra things that GraVM native image requires. First Quarkus will drive all the gathering of the metadata that the GraVM needs. So what's reflectively accessed, how many JNI interfaces are used, what are the resources we want to include our binary and stuff like this. Another benefit is that most of the ecosystem, so anything that comes with Quarkus is already supported for native image compilation. So if you want to use a library that's already supported by Quarkus, you don't have to do anything special, you just put it as a dependency to your application and it should work with native as well. It minimizes the dependencies because Quarkus already does a dependency analysis before going to native, so that allows you to pass less things to the class path and it helps the static analysis do the dead code elimination. Furthermore Quarkus through annotations, APIs and some configuration properties allow you to further find the configuration of your application for native. So some might think that that's not the only framework that does that, right? So why Quarkus? Quarkus takes an opinionated approach and it's different than the other frameworks in that it will try and build time initialize all the classes, while by default, Graph VMs native image runtime initializes the classes. And this might create some issues, so Quarkus will take care of reinitializing anything that's necessary like random seeds or some platform specific values and it will also reset fields that we don't need at runtime. It also doesn't allow incomplete class paths, so when you build everything needs to be on the class path, otherwise the build will fail and this ensures that you won't get any unexpected no class defound exceptions at runtime. And class, it uses Mandrel instead of the upstream Graph VM community addition, which is based on the Eclipse Temuring Open JDK build instead of the Laps JDK build and it's specifically tailored to Quarkus and maintained by Red Hat. So how does this really work under the covers? First of all, the Quarkus will take care of generating the Graph native image json configuration files. It will perform code substitutions wherever necessary. Code substitutions allow us to go and patch third-party libraries or even the JDK itself. So if we don't like there something or if something is not compatible with native compilation, we can adapt it. It will generate some byte code that is responsible for configuring things and it will change the defaults for Graph VM native image and it will also allow the user to pass additional parameters. So for the json configuration part, it generates these five files, one for JNI, for proxy classes, for reflective accesses, resources and serialization. These are the generation of these files is handled by the classes here. So it's native image reflective configs, let's say. And it decides what to put in these json files based on the build items that exist in your application. In Quarkus, you can define the build pipeline using these build items. And earlier I mentioned substitutions. Substitutions are heavily used in Quarkus because they assist in dead code elimination and they also make sure that things that are not supported in native code are not reachable and it will throw some appropriate exceptions for that. So Quarkus performs 303 method substitutions and 32 field recommendations in a total of 208 classes. This means that you don't have to do any of these on your own. They are already handled by Quarkus and this is only on Quarkus core. If you go and use some Quarkus extension, it performs its own substitutions and stuff like this. To see an example here, here we substitute the method allocate buffer in this class and we only do that when ZSTD is absent from the class path. And what we substitute the method with is a throw of an exception that this operation is unsupported. So if you compile your code to native and it invokes this method while the ZSTD library is not available, you will get this exception. And this is how we recompute fields. So here in Bouncy Castle's easy point, we go and reset the test random field because this is a secure random class and we don't want it to be preceded and pre-initialized in the native image. But whenever we restart the application, we get different random numbers. We can similarly change the value of a field by reinitializing from an alias. That means that we can pass whatever value we want not just reset it to null. Here we change the field unavailability cause to put a Quarkus specific exception in there. And we also substitute the method is available to return false to show that OpenSSL is not supported in this specific case. Regarding features generation, this is handled by the native image features step class and it will use Quarkus Gizmo to generate bytecode. And this bytecode is used to invoke Grail VMs APIs to perform stuff that cannot be done through the json configuration. So here is a part of the native image features that we generate. And what it essentially does is that it invokes first it gets the method descriptor for the runtime class initialization.initialize at build time method. And it will invoke this method passing it a string array with the empty string. This instructs Grail VM to build time initialize everything, which is different than what it does by default. And we can also parameterize the options that are passed to the native image build. And we do that in the native image build step. And here we see part of it. And what it does is that it always enables allow fold methods, which is off by default. It makes our application headless by default. It doesn't allow the creation of fallback images because fallback images are essentially JVM lancers. So you don't get the native application that you asked for. And we also always ask it to link at build time. And that concludes the talk. I would like to acknowledge that Quarkus participates in the IROEU funded project. And I'm ready to take questions, if any. Any questions in the chat? Yeah, the custom class loader is a bit tricky because Quarkus. The question was whether Quarkus also supports the standard JDK instead of Grail VM JDK. So this is the first part of the question. And the answer to that is yes. This is Quarkus native and this is optional. This is only if you want to go native. If you want to stay on the JVM path, you can use any JDK and it will work just fine. Now to the second question about custom class loaders. Although I'm not very familiar with that, I think that this might be a bit tricky because Quarkus already uses custom class loaders. So you have to make sure that they are somehow compatible. I couldn't hear the question, so. Okay, you find out a library and you wonder whether you can use it or not. Okay, if the library is supported by Quarkus itself, you will find it listed in the Quarkus supported libraries or in a Quarkus extension that supports this library. In that case, everything should work out of the box and you don't need to do anything. In the case that your library is not supported by Quarkus Core or any of the Quarkus extensions, then you need to use some of the trickings that Quarkus does to make it work. And Quarkus gives you some APIs and annotations that may assist you. Let's see that. There is a website like supported libraries that I can go to and have a look. I think if you go to code.quarkus.io, then you can see a list of supported extensions in libraries. Do we have time to get some more questions? One more question. Sorry. I was wondering if Worker's Native works with GNI-based providers, sorry, the provider interface, not GNI. The foreign API? No, no, sorry, like classes discovery when you want to load a specific service, SPI, that's the name, sorry, the service provider interface. I think I don't know. Okay, thank you. Okay, for the rest of the questions, please feel free to approach me on the break. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you.
Project Lilliput - Compact Object Headers
Hello? Yes. Ah, okay, let's do it again. Hello, hello friends. I'm Roman. So let's start with an overview. I'm scratchy too, right? Okay. A little overview and motivation why we are doing all this. So what is Project Lili put? Let's start at the beginning. It's an open JDK project. It's a community project. We get contributions by Red Hat, Oracle, SAP, Huawei, Alibaba, Amazon, that's me. The goal of that project is to reduce memory footprint of Java object headers. We have a lot of applications in the Java heap. It's a side effect of that goal. We will also see potential CPU and latency improvements. Specifically, we want to achieve these heap memory reductions by reducing the size of Java object headers. To illustrate this, let's consider a hypothetical very long button, I guess. So let's look at this hypothetical Java heap. Each of these squares represents a Java object with different sizes. And in such a heap, you would have like everything that is yellow here is object headers. This is really just metadata for the VM, like class information stuff, I will get into this. And this means we waste quite a lot of memory on just this metadata in the heap. With Project Lili put, the goal is, so if we consider each of these half of those squares, one word, we want to reduce these cut in half, might end up looking like this, with much less metadata for each object, and we free up a lot of space there at the bottom. So this is the savings. Another way to look at this is to see the breakdown of metadata versus actual payload. And I did some statistics when I started this project, so around 20% of live data on the heap is actually only metadata in the object headers. This depends a lot on the workloads. Some workloads, they have much more of this. Some workloads, they have much less ratio here, but this is kind of an average. So this is the current situation where we have headers that are 12 bytes large. With Project Lili put, in the first step, we want to go to 8 byte headers, which means that the ratio of metadata here is only like 13% on average. My very again, average savings of around 7%, 10% up to maybe 30% or even a little bit more percent here. The long term goal with Lili put is to have headers that are only 4 bytes large, and if you achieve this, then only 6% of live data is object headers and savings up to like 50% compared to the current situation. Yeah. We did that already. We did most of the work already and we run some services at Amazon already with Lili put, and this is the heap usage of the service after GC. When Lili put was deployed, it dropped by about 30%. I must admit that this is a bit of a sweet spot case. Not all workloads look like this, but it shows quite well how this works. As a side effect of this, the CPU utilization also dropped by about 25% here when Lili put got deployed. Is this still on? Okay. This was quite significant. This has the effect on latency too, latency drops by about 30%. Some percent. So, motivation part. Helpfully, Alexi did the GLL thing. So, if you have a workload and want to know how would it look like when running Lili put, you can generate a heap dump with your workload and then run it through GLL tool, which is here, and you can get some nice estimates of your heap utilization with different configurations, including Lili put. To tell you, you would say, I don't know, such and such percent with Lili put, and you can even do the crystal ball and look into the future and tell you about Lili put too. That's because the heap dumps don't actually include the metadata. The GLL tool needs to do a lot of calculations and estimations and get up and come up with something, but it's pretty useful. So, yeah. At the end of the day, what does it do? It reduces your hardware or cloud cost. You can save on instance cost here. Or you can do the same hardware and drive more load on it. You can help reduce your energy bills and save your tool for the climate, but this is all good. So, yeah. Let's look into what's going on there. The object headers. This is a breakdown of the current situation. So, we have a one word at the beginning of each object, which we call the mark words for historical reasons. It's really just a header that contains stuff. And we have 64 bits and two bits here. They are tag bits or lock bits. They indicate what the rest of those bits mean. Usually, they mean we have GCH bits here, four bits of GCH 0215. And then we have hash code bits. When you call the identity hash code of an object, this will generate some number and stick it here. And then you have some unused bits up here. Okay. And the next word, the second word of each object contains the class pointer. The class, so-called class pointer, points to the internal structures in hotspot that tells you all about the class of this object. Usually, what you get with modern JVMs is the compressed class pointer, which is 32 bits. I don't want to talk about the other one. Thomas is going to talk about a lot of details, how to deal with that. But this means that the first two words of each object are taken by metadata. For many objects, we can start sticking actual payload in these other 32 bits here. For arrays, we can put the area length there. But what we want to do is you can probably see this already. We have the class pointer here. We have free space here. We can probably stick it in there and then see how to deal with the other bits there. Problem is, what we also have is the so-called displaced headers or overloaded headers. This happens when locking happens or the GC does some things. And then the tag material indicates what the rest of those bits mean. And usually, it's a native pointer to some structure. And this means that those bits up here, they are not actually always unused. They are only sometimes unused. And this is a major problem. I get into how we want to deal with this. We still have to compress class pointers there. As an example, when this happens is when we have a stack-locked object, then the tag width is 0, 0. The rest of it is appointed onto the stack. Then we have objects that are locked by an object monitor. Then we have those tag widths. And this is appointed to some object monitor structure that lives somewhere. Or the GC can use this for storing forwarding information. And it points to then we have this tag with here, 0, 0, 1, 1. And it points to the forwarded object in somewhere else in the heap. This may look like this. In the example of an object monitor, we have exactly the situation that I just talked about. If that happens, where's the original mark? We'll say we have some hash code here. And so we need to preserve it somewhere. Its answer is usually it's in the, for example, in the object monitor case. It's displaced in the beginning of the object monitor structure. This is fiddly stuff. So the plan is to get here for the first step that we are currently working on. So we would still have to tag width here. Or maybe not. Then we have another bit here that says about self-focusing. Then we have H widths here, some unused stuff. And the identity hash code and a few bits for the class pointer. If you count this correctly, those are not 32 bits like we had before. Those are fewer bits. Thomas is going to talk about how we do this. The long-term plan for lilyput, we call this lilyput 2 already, is squeeze everything into 32 bits. We still have those tag widths here. It says forwarding point that you see H. It's unchanged. The hash code will fit into two bits. This is going to be a very bad hash code. No, no, no. We will leave it. And you still have the class pointer up there. What's the problem with all this? So in the current situation, we have two words. In the first word, it rarely carries any interesting information. So it might carry hash code information, but it's only, usually happens for few objects. It may carry locking information, but only very few objects ever get locked on the heap. So we basically waste this word on stuff that is rarely used. And in the second word, we have this class pointer, which is very crucial information for any object, because we need this for all sorts of things. If we don't have this, then this object doesn't have a type identity, which is, yeah, we may not do this. So this is part of the problem. In the new world, this class pointer is part of the header, which means that suddenly this header has crucial information that we must not lose. And we always need for things like figuring out the object size or something like this. So we must never lose that class pointer there. And also, this header displacement that I talked about earlier means that how do we access this information when we need to follow through the actual mark word? Those are the three problems. How to fit everything into fewer bits, how to safely access this mark word, and how to avoid losing the class pointer to begin with. I don't have all that much time, so I can only scratch the surface of these problems. I will give a very high level overview of this. The first is locking. When locking happens, then, well, let's look at, for example, stack locking is the most lightweightest locking in hotspot. It's really just a compare and swap thing on the object header. Oops, this was too fast. Let's go back a little bit. So it coordinates threads by comparing and swapping on the object header. I said this. It doesn't, as soon as contention happens, we inflate this stack lock to an object monitor. It doesn't support weight notify. It doesn't support JNI. And any of this happens, we inflate this to what we call a full object monitor. This is why we have those two different locking modes. The way that works now is when that happens, the JVM does a compare and swap on the object header, and it basically exchanges the current header with a pointer to the stack, and it sticks the original mark word somewhere on the stack. And this is how we can basically find the original mark word and restore it when we unlock. Then we do the opposite thing and swap it around. It does answer the question, is this thread locked by object E, by this object? It cannot answer the question, which thread is currently locking this object because it's not really necessary. In little bit, the way we solved that is, and I have only one slide of this, which is quite amazing because I basically rewrote the locking implementation there for this. But what we do there is instead of putting the original mark word on the stack and putting a pointer on this location in the mark word, we basically turn this around and say, OK, let's have a small lock stack, what would we call it, lock stack, and we basically only push the object on that lock stack, and then we still do a compare and swap to flip a single bit here, but nothing else. The rest of the mark word stays intact. It still answers the same question, is this thread T locking this object and nothing else? So this is the very scratch-to-service overview of this stack locking. I cannot go deeper because time is running out. I need to hand over to you at this point. Yeah, let's look at the monitor locking. It's a very similar situation. All engineers are currently working on a solution for that. The basic idea is to, instead of doing the mapping from an object to the object monitor through the header, we have a side table and do the mapping there, and then this means that we don't have this header displacement anymore. Also, only scratching the surface on GC forwarding. Some GCs need to store forwarding information in the object header. This is used by those full GCs here, and zero GC, G1 GC, and Shenandoah, they use the header to store forwarding information, and this means that when that happens, we would lose this class pointer and this crucial information because this is why this is a big problem here. For normal operation, it is not such a big deal because there we have copying GCs where we first create a copy of the object and then stick the forwarding pointer only in the old copy, which means that in the new copy we still have all this information, and we can follow through to this new copy to get to the class pointer. The full GC, because it has no space to copy the object, it slides the objects down to the bottom, and in the process of this, would lose this class information. The parallel GC has a different solution here based on the idea, let's not store this forwarding information in the header, let's always re-computers in some clever ways, and this is basically the plan for what we want to do for the other GCs too. So that GC doesn't have this problem because they don't do full GCs. And with that, well, we have a JAP-450, we have this new lightweight locking, it's already in Jadikate 21, the flag to enable this, when you grab a build from Alexi or some other place, is this flag, and I think with this, I'm handing over to Thomas, he wants to talk about the class pointers. Oh, you mentioned it. Oh, okay. Does this work? Oh, it works, wonderful. Yeah, okay, let me switch it. Class pointers, and... Wait. I'm gonna start. Okay, I'm Thomas, I work at Red Hat, and I'm going to talk about what we do for class pointers in Lilliput, and this will be a very quick dive because I only have 10 minutes, and this is a bit of a challenge. This is just one, but one of... Oh, this is cool. This is just one of many moving parts in the Lilliput project. Lilliput is really a real community project in the sense that many companies contribute, and this is the part we decided to tackle at Red Hat. And with that, let's start, time's ticking. So class pointer, the class pointer currently in the Lilliput project is 32 bits, and that's way too much, we need it to be smaller. It takes up like half of the whole space of the 64-bit Lilliput object header. And so, okay, that's pretty self-explanatory. So some background first. When we load up a class, a Java class, we build up a whole range of companion data structures in native memory, and the centerpiece of this group, and kind of the big boy among them, is the class structure written with K. It's a variable-sized structure ranging from 400 bytes to, I don't know, we saw five megabyte monsters at Amazon, but that's rare. And every object's header refers to the class structure of its class. That's why the shape of this reference really matters, because for one, obviously, footprint, and because also we need to be able to dereference it very quickly. Going from object to class is a hot path. So we could just store the native class pointer in the object header, but we usually don't do this, because it's 64-bit way too big. So since a long time ago, we already employ this optimization where we split the native pointer into two parts, a 32-bit offset that refers to a 64-bit runtime constant base, and we only store the offset, and that's what we call the narrow class pointer. And that trick only works if we are able to confine all class structures within a four-gigabyte range, obviously. And that's exactly what the class base is for. That's the only reason it exists. And we also have CDS. So CDS contains class data sharing. CDS archives contain pre-baked class structures. And so what we do usually is we map the CDS archive very close to the class base, such that both are engulfed by the class encoding range, and every class structure in either region can be reached by a narrow class pointer. Okay, and so decoding is basically just an addition. I'm a bit simplifying because of time reasons. And the nice thing is from the point of the JIT, the encoding base is a runtime constant. It's obviously not a constant because we determine it when the VM start-up. It's subject to ASLR and some such, but we can encode it, the JIT can encode it as a 64-bit immediate. That is nice, so we already save a load. And then we have a ton of optimizations that basically all depend on the base looking good. Whatever looking good means for the specific CPU, such that we can load it with just one move. I won't get into details. One simple example is if we manage to place the class base below four gigabytes, then we can set the base to zero. This is what we call unskilled encoding. And now basically every narrow class pointer is the class pointer, so we don't have to do anything. And we are now also very good at it. The one effect of the Lilliput work is that unless the address base is really populated or unless the operating system just flatly refuses to do this, this is likely to happen. Okay, for Lilliput, 32-bit is still too much, so we shrink it. There are some side goals. We still want to be able to address enough classes, whatever enough means, that is a complicated question. And we also want to basically keep using metaspaces and CDS, because both give us a ton of features we would need to reinvent otherwise. And we also decided to, for now, keep the class base layout as it is. Now, as a kind of a bar, we can load about five million classes, give or take. The class base is artificially kept at three gig. And I personally believe if you manage to load five million classes, you are either very patient or probably not really aware of doing it, because it's a leak. And I think this number is really high. Question still means is how many classes do we actually need to address. And this is a very complicated question, and I don't have much time. Therefore, we kind of decided to sidestep this question. We said, okay, we don't reduce the class encoding range, it's still four gigabyte. We leave it at that. And we say anything in multi-million range is probably fine. We can also, there's still room to reduce it, but we don't for now. What we do instead is we increase, if we manage to store all class structures at certain aligned addresses, then we can use this alignment shadow to save other stuff. And that's what we do. We decided on a 10-bit alignment because one kilobyte statistics. Even though class is very precise, it's the vast majority of class structures is below one kilobyte, usually larger than 512 byte. It's a bell distribution, so it's really much larger, outliers exist, but are really rare. And 10-bit give us 22-bit class pointers for now, and let us address three million classes. And that's still way enough, I think. So now something interesting happens, where before we used to store class structures in the class space back to back, we now store them at aligned boundaries. And this means that the narrow class pointer now is more like a class ID, because the class label morphs into a table of one kilobyte-sized slots, where every class structure occupies one or very rarely multiple slots. And we use the value range of narrow class much more efficiently, because every value now means basically another class. So it's a hypothetical 16-bit class pointer can address 65,000 classes. And that's good. Obviously, that hurts. We have alignment raised now. And for time reasons, I can't go into this. The gist of this slide is that we made metaspace very good at allocating at aligned boundaries without footprint loss, so we don't pay for that. And we still retain allocation performance as what we pour. This actually did cost quite a bit of work. Some supporting statistic, I skipped that. This is Mark Berthley out now. But before we had a 30-bit class pointer, we now reduced it to 22-bit that allowed us to inflate the hash back to its former 31-bit size. The story behind this is when we started with Lilliput, we had to reduce iHash to 25-bits, which of course has negative consequences for applications with large datasets, hash collisions, and so on. And the nice thing is we now have four free-bits. I'm sure we find a use for them. And there are some kinks still to be ironed out. Very quickly, hyperaligning class structures, any structure to sizes larger than cache line size may have detrimental effects to cache efficiency. I will need to look into this. We have mitigations planned if this should happen. There's also 30-bit. We are now in the weird situation that 30-bit class pointers on 30-bit platforms are larger than a 64-bit. And this is, we can deal with this. I kind of hope that 30-bit is going away. Before I have to do this. Future, 16-bit is possible if we just naively reduce narrow class pointer to 16-bit. This would be a severe reduction in the number of loadable classes and probably not acceptable. What we can do, however, is we can switch to a model where we have maybe a variable size tether where objects of the first 65,000 classes would benefit from lilyput and have lilyput header objects beyond that would get the narrow class pointer appended to the mark word. This, of course, is a lot more complex than what we do nowadays. But this is possible, maybe we have to do this. Should we ever do 30-bit lilyput headers as we have planned? This is basically not my idea. This is John Rose's idea. Okay, as a summary, where we are now, 10-bits we freed, restored IH to 31-bit, which is nice, that had been a blemish in the current lilyput implementation. And at the cost of a reduction of the number of addressable classes from five to three million, still a completely fantastic number. And the decoding is now more complex because now in addition to the addition, we also need to shift. There are some side effects trickling down. How are we time-wise? Oh, okay. Some side effects trickling down. We improved class pointer setup for the stock JVM. These are improvements that are already rolling out with JK22. And yeah, that basically was it. Thank you very much. And... Thank you. We take questions. Hello, Ren. Thank you. Thank you. Very nice job. Just a question. Is it worth to use some compression maybe to store these maps? Oh, sorry. Is there any gain possible to use some compression about these external maps or things like that? Maybe objects are not moving too much, or maybe you can compress these address or hash maps? I don't know if it's worth because it... analyze the pointers that may be big, then you have large, let's say, remember set of pointers somewhere or flags and maybe you have some compression here, may save and, but of course it's a trade off of performance between memories, so, depend on the use case. Thank you. So the question was, do we have any plans to like, optimize or address hash map implementations? You have to get more compact structure that we can delete from such hash maps. Ah, okay. Oh, this is not part of the scope of the report, but there may be other efforts. Yeah, but I don't know. Thank you. Well, you can go. Okay, thanks. I answer this quickly. The question was, when can we expect to see this in available release? Questions, I don't know yet. We have most of the stuff lined up already. I'm saying 24, but don't sue me on that. Can you just upstream this a bit tricky? Yeah. It's a whole separate effort. Yeah. I had a question whether it was possible, well, or was there any plan to actually make those tisings configurable? My thinking is that I would imagine maybe it's naive, but in some applications I worked on, classes are really, really simple and you don't use many of them. So you would benefit from having extremely small, you know, address to, like a small address space, basically, and smaller points or as even. I didn't quite, what's the question? The question is, is it possible to have configuration ergonomics for different sizes of class pointers or is it fixed by the JVM? Oh, possibly we debated that at some point in time. There are some advantages of keeping them constant because you get more efficient. But yeah, I'm not even sure we planned this as a development switch. Maybe it's undecided yet. Okay. No more questions? Okay then, thank you. Thank you.
An in-depth look at JFR in GraalVM and how it compares to JFR in OpenJDK
Hi everyone, my name is Robert Swenaga and I work at Red Hat. Today I'll be talking a little bit about JDK Flight Recorder in Gravium Native Image. And from now on we'll just refer to JDK Flight Recorder as JFR. So as a high level breakdown, I broke in this presentation to two sections. The first section is a high level overview of JFR Native Image and then we'll go into a low level deep dive of JFR Native Image and talk about some comparisons between substrate VM and hotspot. And I want to make note that even if you're not interested in Gravium Native Image at all, you may still be interested in the second half of this presentation because the details of JFR are going to be talking about there extend beyond just native image and also apply to hotspot more generally as well. Okay, so as a very quick refresher, JFR is an event-based monitoring and profiling tool. It's built directly into the JDK and it can give you some really valuable insights into what your application is doing both at a high level and also at the VM level. Okay, so Phoebus already talked about this a little bit, but Gravium Native Image is essentially a technology that allows you to convert your Java applications into binary executables. The appeal of this is you get much faster startup and use less resources and a big reason for that is you don't have to warm up your traditional JVM alongside your application code. And how it works is you compile your Java application to bytecode like you normally would and then you run the native image tool to convert that bytecode into your executable which you can later run. So why is JFR different to native image than in OpenJDK? The reasoning behind this is that a native image executable doesn't require a traditional JVM to run, however it still requires certain runtime components that your Java code expects such as GC and synchronization constructs like monitors, for example, and what's providing that in native images is something called substrate VM, which you can think of as sort of like a scoped down replacement for a hotspot. So it does a lot of the things that your Java code requires, but strips out a lot of the dynamic stuff that hotspot does that we don't really need in this environment. And the key here is that since a lot of the JFR code is embedded within hotspots, when we transfer it over to native image, we're using substrate VM so it has to be re-implemented in that VM instead. So that involves everything from the low-level JFR event instrumentation to the actual infrastructure that varies that JFR data from the point of instrumentation to the point where it's later consumed by a user. Yeah, so in terms of the current state of JFR support in native image, you can do things such as starting and stopping recording from the command line or from within your application code via the recording API. Several events are implemented, especially at the VM level. We have events for threads, monitors, allocations, you see, save points, etc. You can dump, snap, shot, to disk and inspect them with tools such as visual VM or JDK mission control as you normally would. The custom event API is also working, so you can create your own custom application level events. Stack traces and CPU profiling are also possible. Event streaming has recently been added as well. You can also even connect via remote GMX to flight recorder MaxBean, which practically means you can do things like from within the JMCUI, interact with JFR recordings that way, start them and manage them on the fly. How you might first interact with JFR in native image is at build time, you specify that you want the enable monitoring flag, specify you want JFR specifically, and that builds the JFR components into your executable. So then at runtime you can use the normal start recording, start flight recording option and pass all of the normal parameters that you would require, such as specifying a file name to dump the recording to or a duration, etc. There are still quite a few limitations to JFR native image. So not all events are implemented yet. It's an ongoing effort to keep up with open JDK in that area. Specifically, events related to bytecode instrumentation are not yet supported and of course some new JDK events we're trying to keep pace with that as well. Event streaming doesn't yet support stack traces, so that's one limitation of that. And we have a couple things that are in the review pipeline as well and are not yet supported in any release. That said, we've reached the deep dive, which is going to take up the majority of the presentation. And yeah, let's take a deep breath. So this road map essentially represents a very high level zoomed out view of the flow of JFR data through the system. And from now on each slide is going to contain this road map and the highlighted part will indicate the part that we're currently talking about just for convenience and easy reference. So firstly, the point of instrumentation. These are various points where JFR events are made at either an application level code or a VM level. And the screenshot on the slide is just from JDK Mission Control. I'm just using it to show some content that an event may contain. You can see there's a bunch of fields and corresponding values. And this is just one example. It'll vary by event. And you can think of JFR events as the primary thing that we're concerned with really. And the rest of the slides going forth are basically just piping to get that JFR data from the point of instrumentation to the chunk file where it can be consumed later. So yeah, speaking of chunk files, we're jumping all the way to the end of the road map. So chunk files are essentially the resting place of the JFR data as far as we're concerned for this presentation. And they must contain basically the same information, the same format regardless of whether OpenJDK or native images generating them. And they can be dumped to snapshots, the JFR snapshot which is the .JFR file format. And that's usually how people are going to interact with them via JMC or Visual VM or the JFR command line tool. Yeah, so chunk files are self-contained and they have four distinct sections. You can see in the diagram here header which contains pointers and other metadata. There is the event data section which contains the core JFR event data. Then there's the metadata section which describes the format and layout of the events in the event data section. And then we have the constant pools which contain constants which are referenced from the event data section. So the constants, in order to reduce the size of JFR data, we use a referencing ID scheme to increase compactness. And how this works is entries in the event data section of the chunk file will use unique IDs to reference into the constant pool section of the chunk file. And this helps with deduplicating the actual constants that are used by the JFR events. So in this slide you can see there's an example of one event entry which uses the unique ID 12 which is then going to be used to index the thread constant pool and reference the actual thread data residing there. So all this increases the compactness of the JFR data and what that does is it reduces overhead when dealing with it while it's in flight and when writing it to disk. It reduces the overall chunk file size as well. However the downside of this increased compactness in this referencing ID scheme is that we have a tight coupling of the event data and the constant pool data so that if they're ever separated and not found in the same self-contained chunk file then we can't decode the event data section and it's basically unreadable. So that's when down side. Right, so now that we've talked about the very beginning and the end of the road map we'll jump and fill in the middle. So now, so after event emission the JFR data splits. So the event data, the core event data goes to the JFR thread local buffers while the constant data goes to the constant pools. And in both hotspot and substrate VM the JFR thread local buffers essentially have the same purpose and same structure. So their structure in a segment way that allows for concurrent rating and reading of data and there are various pointers which define the sections. So there's the rate position pointer which basically determines where new data is written into the buffer. So when the event rate is in progress that's the pointer that's going to be in use. Then there's the committed position pointer which represents the end of the committed data section. And the committed data section is data that has been fully written so it's not an in-progress rate but it hasn't migrated anywhere else yet. The flush data section is essentially committed data that has been migrated somewhere else so it can be overridden at the earliest convenience. Eventually the buffers will fill up with committed data and will have to be flushed elsewhere and at that point all the pointers reset back to the start position. Hotspot is a little bit different in that it uses buffer pools to recycle buffers. So there's a live list and a free list and when a new thread requires a T.O.B. from JFR one will be taken off of the free list and put on the live list and vice versa when that thread goes away. But in such a threat we have it a little bit simpler. We just allocate a native memory, a thread local buffer when it's required and when the thread goes away we destroy that memory. So we don't really have to manage access to these buffer pools and maintain them. Right, in the case of virtual threads, multiple virtual threads may share the same thread local buffer of the carrier thread and that's not really an issue because each one has exclusive access at any point in time and the JFR data is eventually going to the same place anyways. Right, so after the thread local buffers fill up they are migrated, the data is migrated to a set of global buffers and the global buffers essentially act as a capacity for overflow storage and it's more efficient than increasing the size of all the thread local buffers because not all threads will be equally as busy with respect to JFR events. Right, so constant pools. Previously we mentioned how constant pools use a referencing ID scheme to reduce the size of JFR data and they do this essentially works by deduplicating constants. In a hotspot the deduplication works, one way the deduplication works is by using JFR specific bits and the metaspace data for certain constant types such as class with a K and also methods. So these JFR specific bits act essentially as Boolean toggles so when an event data reference from in a JFR local buffer somewhere references a constant that bit in that constant is flipped to indicate that it's referenced somewhere that way when it's time to actually persist the constants to disk we only have to persist the ones that are actually referenced not all of them. Additionally if multiple events reference the same constant that bit is only flipped once and that's only used to be written once so that's where the deduplication happens. There are some constant types such as stack traces that don't have metaspace data and those cases a lookup table is instead used for the deduplication and tracking and an interesting thing is in substrate VM native image there is no metaspace at all so we have to rely on the lookup table approach for all the various constant types. Right, so after enough JFR data has been generated a chunk rotation must be requested and what this is is essentially the way that JFR data is continually persisted to disk. The current chunk file and disk that's open is sealed and then a new chunk file is opened and in that process all the in-flight and memory data is flushed to that chunk file before it's sealed and the thread that's performing this the chunk rotation must flush the thread local buffers of other threads and to do that safely we have to request a save point. So the order of operations at a chunk rotation save point is as follows on the slide I want to make note that it's pretty similar in open JDK as it is in substrate VM and the space between chunk rotation save points the recording time between is called an epic and you can see in the green save point box that that's where we're actually flushing the JFR buffers both local and global to disk but the most interesting thing here is that we're writing the constant pool to disk outside of the save points when we're already starting epic 2 so what that means is we'll we're simultaneously writing the constants from epic 1 to disk while recording constants for relative to epic 2 so they're kind of mingling inside the constant pools so we need to keep them isolated however because we want to avoid writing constants perspective to epic 2 to disk into chunk file for epic 1 otherwise we'll have that mismatch and we won't be able to decode the data for constant for epic 2 the same issue that I explained a few slides back so how we do this is we tag each constant according to the respective epic to keep them isolated and essentially overall the more of the story is it allows us to reduce save point pause time by writing these constant pools outside of the save point and another way we actually reduce save point pause time is by having a dedicated JFR thread flush the global buffers to disk periodically throughout the epic time so it's not actually happening in the save points so there's less work to actually be done when we are stopping the worlds to flush the buffers to disk right um um one related note on save pointing is the question of can a chunk rotation save point interrupts concurrent event emission that may be happening in other threads so we have a scenario here where the save point actually and save points and epic transition actually interrupts the event emission and separates the constant data and the event data into different epics and different chunk files and then it will be unreadable then so that's a scenario that is in question right now um and in j in open JDK in hotspot the JFR code is written in C++ it's native code so it can actually be interrupted for a save point so it's not really an issue at all however in substrate VM it's Java on Java and the VM code is written in Java so the JFR stuff is Java code and potentially could save point at a very inopportune moment so how do we prevent that stuff from happening in substrate VM um how it's done is we have this annotation called an interruptible and what that does is that build time prevents the insertion of save point checks so that the code that's in the annotated with an interruptible annotation doesn't actually save point at all so you find that a lot of the JFR code is sprinkled with this annotation all over the place in the VM especially dealing with buffers and constant pools and event writes but this has pretty big consequences for the implementation itself because un-interruptible code that can't save point can only call other un-interruptible code that can't save point which means a lot of the JDK code that's written in Java is off limits so we can't use things like the normal hash tables, re-entrant locks, etc. we have to kind of like roll our own versions of that which are un-interruptible another thing is we can't even use manage memory on the Java heap because that can induce a garbage collection which requires save point and that's not un-interruptible so we have to use unmanaged native memory in order to craft room data structures to deal with a lot of these things so it's a little bit of work dealing with that and the last thing I want to talk about and the last difference I want to mention between JFR and substrate VM and hotspot is related to how JFR interfaces from the the Java level JFR code to the VM level JFR code and in open JDK it happens in the JVM class here you can see on the left side of sorry the right side of the slide and these are basically the points where the Java level JFR code and the JDK calls down to hotspot at the VM level using JNI so we reuse that code in native image we reuse that Java level JFR code from the JDK but there's no underlying hotspot implementation to call into so how do we resolve that mismatch what we use is we use substitutions which Feeb has talked about a little bit but I'll mention again but essentially what it does is allows us at build time to specify redirects from these Java methods to our own implementation the JFR VM level code so on the right side you can see mark chunk final is highlighted and that corresponds to the Java level code on the left sorry on the right I keep getting mixed up on the right side of the of the slide so we can see that we're actually grabbing that and then redirecting it to our own substrate VM base implementation of that code so that's how we kind of resolve that mismatch um yeah with that said um that basically concludes my presentation if you're interested there are further links for for more reading there's some documentation and some blog posts as well and you can always approach me as outside as well if you have more questions um yeah how good you are for time Chris okay if there's any questions I'm happy to answer them now you just did such a good job explaining it thanks yeah on on substrate VM is there did you measure impagant time to save point because if is it uninterruptible you know this uninterruptible trade oh time to save points yeah yeah I could imagine yeah um I'm not really sure of the exact figures I can't really give you a number but um I I know what you're saying it it would potentially an issue I haven't not really aware of it um but yeah that that's definitely a concern um but it's not just the jfr code that's marked as interruptible a lot of the gc code as well a lot of the low-level operations they they must also be uninterruptible so it's not just jfr yeah understood thanks yeah actually to tag on to that a lot of jfr code is really just instrumenting other low-level code which is already an uninterruptible so it's like collateral damage it's not really an issue to add a little bit more on to code that's already an intructible such as uh jfr gc event handling and uh slow path allocation stuff that's already you can't save point there anyways thank you okay okay uh thank you for listening
Foreign Function & Memory API
Can you hear me? I think so. It's working but not in a loud kind of way. Anyway, I have a loud voice, so that's not a problem. So I'm happy to be here. I was here four years ago with everything that happened, and I gave a talk on foreign memory API, and it was an incubating API in Java 14, I think. So I'm happy to be here now to talk about the foreign function of memory API, which is a finalized API in the upcoming Java 22 release. So why did we do this API? The main reason is that the landscape around Java application is changing rapidly. With the rise of machine learning, it's Java developers often need to do tasks that they necessarily didn't have to do before, such as talking to highly optimized linear algebra library that are not written in Java, they are written in C, C++, or for trans sometimes even. And the only way to reach to those libraries sometime is just to reach into native code directly. So these libraries will not be ported in Java, most of the time because they keep changing. So a new library pops up nearly every month with a new kind of idea in order to do offloading of computation to the GPU. So how do we talk to native libraries in Java? We do that with JNI. How many of you have used JNI in this room? OK, fair number. So good audience. With JNI, you can declare native methods. Native methods are like absurd methods in the sense that they don't have a Java method body, but they have a body that is defined somewhere else in a C file or a C++ file. And it can be C, C++, even assembly if you like to play with it a little bit. JNI is flexible, but it has a little bit of issues in the sense that it's what we call a native first programming model. So it pretty much focuses on giving you access to Java functionalities from the native side of the fence. So when you write in JNI, you realize that quickly you are basically shifting all your computation logic from the Java world to the native world in order to minimize the number of transitions back and forth. And that can be a problem. There's also no, I guess, idiomatic way to pass data to JNI. Yes, you can pass objects, but that has an overhead. Sometimes a lot of developers end up passing logs as pointer, as opaque pointer that are stored in some Java objects. And that kind of works. So the problem with native function, as I said, they never exist in isolation. They always have to manipulate some data. And this data is often off heap, of course. And there are not very many libraries in the JDK that allows us to do off-heap memory access. One of them is the DirectBuffer API. So probably you are familiar with DirectBuffers. They can be passed to native methods. And there are some JNI functions that allows us to, for example, get the pointer that is backing a DirectBuffer. So that the JNI code can manipulate the buffer directly. One of the issues with DirectBuffer, perhaps the main one, is that there is no deterministic way to free or unmap a byte buffer. So if you are done using your off-heap memory, you basically have to wait for the garbage collector to determine that the byte buffer is no longer reachable from your application. And that can have a latency cost. There is also a problem in the addressing space. The byte buffer API was born in 1.4 time. So quite a few years ago. And we only use ints as offsets there, which means the maximum addressable space is 2 gigabytes. With minus 1, yes. With the advent of persistent memory, these limits are starting to be a little bit tighter. Also, there are not many addressing options provided by Buffer. Either we go on the relative addressing scheme where basically we say, put in, put in, put in, then we rely on a mutable index on the byte buffer to keep track of where we want to store the bytes. But that's low because we have to mutate some state and then situ optimization have a little bit more trouble coping with that. Or we go fully explicit. And so we put offsets everywhere in our code. And that makes our code a little bit more brittle. So this is what happens when you want to access an IT library. You have a client. You have an IT library. You have some JNI Goop in the middle. What's inside the JNI Goop? Well, a little bit of everything. There are some native method declarations in the Java code. Then if you compile this code using Java C dash H, you have to use a special option, which will generate the site headers file that you need in order to implement your C JNI function. So you go over to C, you implement your JNI function. You compile that function, the C file, which is your client compiler of choice. You get back a shim DLL. This DLL is not the library that you wanted to talk to in the first place. This is just some extra glue code that you need in order to get to the library that you want. So now you have two native libraries, the one you want to talk to and the JNI DLL. And that's a little bit suboptimal. So what we need instead is a Java first programming model. So something that allows us to reach into native functions directly only used in Java code. We also need, since we want to model off-it memory in a more sane way, we need a replacement for the by-buffer API, something that is more targeted at the use cases that FFI has. So we want deterministic allocation. We want bigger addressing space. We want better ways to describe struct layouts so that we can access memory more easily. And also we want to tie everything together. So we want to define tools that allows us to automatically generate bindings for native library in one shot. And we'll see a little bit about that later. Ultimately, our goal is not to replace existing frameworks, such as JNA, JNR, for example. I think Charlie is going to talk about that maybe later. But to help some of those frameworks to overcome the workarounds that they have to keep doing all over again, because they don't have a proper API to deal with pointers. They don't have a proper API to free pointers when they are no longer used. And so hopefully some of this stuff is going to come handy in those cases, too. So Panama is not just about the foreign function memory API. Of course, that's a huge part of Panama. But Panama also contains the vector API, which is an end API to access SIMD computation from Java code directly. But there's also Babylon, a project that recently sprung up, which allows us to see what's inside the body of a Java method with a nice IR that can be introspective using Java. So what can you do with Babylon? For example, you can take a Java method that contains a loop, for example. And you can inspect that loop. You can turn it into a GPU kernel. And then you can use FFM to dispatch that kernel using CUDA to the GPU. So Babylon and FFM kind of comes together and provides us a better and more robust solution in order to do on GPU computing. The main instruction when it comes to accessing memory is called memory segment. That gives us access to a contiguous region of memory. There are, of course, two kind of memory segments. This is similar to white buffer. There are heap segments that are backed by on-heap memory. And native segments that are backed by off-heap memory. All segments have a size. So if you try to access a segment out of bounds, you get an error. They have a lifetime, which means they are alive. But then after you free them, they are no longer alive. So if you try to access them when they are no longer alive, you get an exception. And some segments may also have confinement. So they may start in a thread, and they can only be accessed in the same thread where they started from. How do we use segments? Well, it's not too difficult. It's very similar to white buffer. You can almost see the translation, the mechanical translation from the white buffer API to memory segments. Let's say that we want to model a point that has fields x and y. So what we have to do, we have to allocate a segment. We do that using an arena. We will see a little bit later what an arena is. Let's just go with me for a minute. We have to allocate a segment in 16 bytes, because the coordinates are 8 byte each. And then we put double values into each coordinates, one at offset s0 and another at offset 8. And that's how we populate the memory of that particular segment. So one of the issues that we have in this code, of course, is that we are using an automatic arena. An automatic arena is essentially providing an automatic, the allocation scheme, which is similar to the one that is used by the white buffer API. So we are not going to get any advantage here. But we can do one better. In fact, this is actually where we spend the most time designing the memory API. Java, as you all know, is based on the very idea of having automatic memory management, which means you only care about allocating objects. The garbage collector will sit behind your back and automatically recycle memory when no longer used. This is based on this concept of computing which objects are reachable at any given point in time. Problems with this approach is that computing the reachability graph, so which objects are reachable at any given point in time, is a very expensive operation. And you can find that garbage collectors, especially the one of the latest generation, the low latency garbage collectors, they don't want to materialize the reachability graph as often. So if you try, for example, to allocate a lot of the red buffer using ZGC, you will see that there's a lot more time before the, by buffer, is collected compared to having something else where you can actually thermistically release the memory. So that's a problem. Another problem is that the garbage collector doesn't have knowledge about the off-heap memory region that can be attached to the red buffer. The only thing the garbage collector sees is a very small instance, a very small by buffer instance that is like, I don't know, 16 bytes or something more. But it doesn't seem that maybe there are four gigabytes of off-heap memory attached to that. So there's no way to prioritize that collection. And also, garbage collector only can keep track of an object as long as if it's used from a Java application. So if that by buffer escapes to native code, then it's up to the developer to keep that object alive across the native code boundary. So you have to start playing with reachability fences, and your code suddenly doesn't look as good anymore. So what we need is a new way to think about managing memory resources explicitly. And that's challenging because we are sitting on top of a language that made its success on the very idea of basically never worry about releasing memory ever, because the garbage collector will do it for you. So what we introduced was an abstraction called arena. And arena models the life cycle of one or more memory segment. All the memory segments are allocated with the same arena at the same lifetime. So we call this a lifetime-centric approach, because first you have to think about what is the lifetime of the memory that you want to work with. Then you create an arena that embodies that lifetime, and then you start allocating memory. There are many kinds of arena. Of course, there is the silly global arena that you can use. And basically, whatever you allocate, it stays alive. It's never collected. There's the automatic arena, which we saw before, which basically gives us an automatic memory management scheme, which is similar to the buffer. But then there are the more interesting confined and shared arenas. These are arenas that support the autoclosable interface. So if you call close on that arena, all the memory that has been allocated with that arena will basically just go away deterministically. We don't need to wait for the garbage collector to do that. There are strong safety guarantees. Regardless of whether you are in the confined case or in the shared case, it's not possible for you to access a segment after it has been freed. And in the shared case, we had to do a lot of JVM black magic in order to make this work. Because of course, you can think, well, we just put a lock. Whenever you access a memory segment, we'll check whether the segment is still alive using an expressive operation. And then you realize that memory access is 10x slower than before. So what we did instead is, with the help of the GC team, we relied on some safe pointing mechanism to make sure that it is never possible to close a segment while there is any other thread that is trying to access the same segment. That works very well. Of course, it's a little bit more expensive if you need to close shared arena very frequently. But hopefully, you won't need to do that. So what we are trying to do here is to find an epibalance between the flexibility of C automatic memory management, sorry, the thermistic memory management, where you have to do free and maloc explicitly. That's very flexible, but it's also very unsafe, because you can have use after free, you can have memory leak, or the extreme safety of Rust, which comes at the expense of some flexibility when you try to code. Because if you want to do, for example, secret data structures in Rust, like a link list, it becomes very, very, very difficult. So Java is trying to sit in the middle. And I think we've done a good job doing this. So how do you work with explicit arenas? It's basically the same as with automatic arenas. The only difference here is that now we are using a try with resource statement. So we create the arena inside the try with resource block. We do the allocation. We populate the point struct. And then when we close the brace, all the memory goes away. So this is much better than the direct buffer counterpart, especially if you need to frequently allocate off-heap data structures, because we no longer put load on the garbage collector just to clean up the off-heap memory. So one thing that we need to still improve on this API is how do we access the fields of the struct that we want to operate with? In the example that I showed previously, we had to say, well, I want to access off-heap zero. I want to access off-heap eight, because we knew these were the offset where my fields are. But what if we could just declare what is the layout of the struct that I want to work with? What if we can translate the struct point 2D definition that we have in C into a Java object that models the same layout? Then we can start asking interesting questions, such as what is the layout of the field x or y. Give me avarendo for accessing the x field. And that is exactly what we are doing here. So instead of just relegating the definition of point 2D in a comment, we actually define the layout of the point struct as an object, as a Java object. And then we use this object to derive the two varendals, one for accessing the x field and one for accessing the y field. Then inside the try with our sources, we can just use the varendal to access the fields. We don't have to specify the offset eight for the field y, for example, because the varendal will encode all the offset computation automatically. At the same time, look in the allocation expression, the very first inside the try with our source block, we can see that we are just using passing the layout to the location routine. And the layout, of course, knows what is the size of the block that we want to allocate. So switching gears a little bit, let's start talking about FFI. The main abstraction in FFI is called native linker. This is an object that essentially embeds the calling convention of the platform in which the JVM runs. It provides two capabilities. The first is it allows us to derive a method end all that targets a native function. So we can basically describe the native function we want to call, get a method end all, and just call it from Java. The second capability is kind of the reverse of that. So we have a method end all that describes on Java computation. We want to turn it into a function pointer, so a memory segment, that then we can pass back to native code. In this approach is inspired to, for example, Python C types or lib FFI. These are kind of the main inspiration. So we want to be able to describe a function from Java, so then we can call it directly. It all builds on the abstraction that we've seen so far. So we use layouts to describe the signature of C functions. We use memory segment to pass addresses or structs. And we use our in-apps to model life cycles of upcalls and to model the life cycles also of loaded libraries. So when we want to call a native function, so here I define a function distance that take a point, returns the distance of the point from the origin. Actually, doing that in C is a little bit more convoluted than it looks like, because it essentially depends on the platform we are on. So if we are on Linux, we will have to look at some rules that are called the CSV calling convention. And that tells us that, for example, structs that are as big as the point to destruct that we have here can have their fields pass in registers. So the only thing that we need to do when calling the distance function is to load the first floating point register with the value 3, the second floating point register with the value 4, then we just jump on the function. But if you are on Windows, even if you are on X64, but on Windows, there is a completely different set of calling convention, which actually tells us that any struct that is bigger than 64 bit, such as our struct here, will be passed in memory instead, which means the struct has to be spilled on the stack, a pointer to the stack has to be stored in the RCX register, and then we jump to the function. So same function, same architecture, because X64, completely very different set of assembly instruction that needs to be generated in order to act as a trampoline from Java code, for example, to C code. So that's why it's important that we are able to describe the signature of a C function to the linker, because the linker then will inspect the signature of the C function and will determine what is the exact set of instruction that we need in order to go from the Java code to the native code underneath. And so how do we do this? Well, when we call the down call end on the native linker, we will pass, of course, the address of the function that we want to call. This is obtained using a symbol lookup, which we won't have time to investigate in further detail. But it will basically give us the address of where the distance address function lives. And then we provide a function descriptor. This function descriptor is nothing but a set of layouts, one for the return type and one for the argument. In this case, we know that the return type is double. So we use a double layout. And the argument is actually the point to this track that we defined before. So that same layout can now be reused in order to describe the signature of the function. Then inside our try with the source, we populate the point as before. And then we can call the method end. So we just pass the point memory segment to the method end that we obtain. And that means that we will be able to pass the point by value to the C function. And nothing else needs to be done, because the linker will figure out exactly what set of machine instruction to generate in order to go there. So of course, when we talk about native function, we always have to keep safety in the back of our mind, right? Because whenever we go into native, the operation is fundamentally unsafe. We could, for example, make a mistake in describing the signature of our target C function, which means the assembly step that we have is not correct for calling that particular function. We may cause all sorts of issues. The foreign code may attempt to free memory that has already been freed from Java code. Or we may get a pointer from native code. We may try to resize the pointer, but we got the size wrong. And so we are suddenly trying to access memory that is not there. So in the FFM API, there is a concept that is called restricted method. So there are some methods in the FFM API that are not directly available all the time. They are part of the Java API. So if you go in the Java doc, you can see them. But they are restricted, and you need to use an extra command line flag if you want to use them without warnings. So for now, basically, if you try to use a restricted method, such as the method for creating a down call method endo, you will only get a warning. But in the future, we plan to turn this warning into an error. And in that case, you will have to use a new option that is called dash, dash enable native access that will grant a subset of the models of your application or the all unnamed model if you are using the class path, access to restricted methods. This is a part of a bigger plan to move Java on a more solid foundation, one that allows us to provide integrity by default. So Java in its default configuration should always preserve integrity, which means it shouldn't be possible for native code to mess up with invariants, such as, for example, mutating final fields and things like that. So this is the workflow using FFM API when we want to access a native library. So we still have something in the middle between us and the native library that we want to call. This time, though, the stuff we have in the middle is just Java objects. We have memory layout, varendals, method endals, function descriptors. But here's an idea. What if we could generate all this stuff mechanically using a tool? And that's exactly what the JXR tool does. So let's say that we want to call the QSAR function, which is actually a tricky function because it has a function pointer that allows us to sort the contents to compare elements of an array. So it uses a function pointer type def. So if you want to model this using plain FFM, it's going to take you a little bit of setup code in order to create the app call stub and the method endals that are required to call this. But if you give all this header to JXR, so we could just start with pointing it at the header or the standard library header where this is defined, then we basically just get a bunch of static declaration that we can use to call QSAR. So if I do all this, the only thing I have to do from my code is first to create the function pointer. And this is possible with a factor that has been generated by JSTRAP that allows me to pass a lambda expression. And the lambda expression will be turned into a function pointer that is stored inside a memory segment. And then I can pass to the QSAR function. And the QSAR function is not a method endal anymore. It's a nice static wrapper around the method endal. So it's much better to use from the developer perspective because using method endal can sometimes be tricky with the fact that we can pass the wrong type and then it gets lower and things like that. So in comparison, this is the code that you have to write if you wanted to do this using JNI. So there's Java code with native methods. There's another file that is generated by Java C. And then there's quite a bit of C implementation in order to do QSAR. And it actually took us a few attempts in order to get to the best optimal implementation because our first attempt wasn't very good. It can actually get quite tricky. And even better, if you look at the performances, the plain FFM-based approach is roughly 2x, 3x faster than the JNI approach, every optimized JNI approach. And that's because a colleague of mine, Neal Verne, has put a lot of effort in trying to optimize, especially the up-call path. So when you want to call a Java function from native code, there was a lot of performance left on the table from JNI. And we were able to greatly improve the performances there. For regular calls, you probably won't see much difference. So FFM is more or less on par with JNI. But as soon as your native call is starting to up-call back into Java, you're going to see massive differences. So wrapping up, FFM provides a safe and efficient way to access memory. We have deterministic location. We have layouts to describe structs. And so it gives us ability to describe the content of the memory that we want to work with and then get varendals to access that memory in a much more robust way. Then we have an API to access native function directly from Java. So no need to write JNI code. That means that your deployment gets simpler, because you don't have that shim DLL going around that you need to distribute along with your application. And together, the foreign linker and memory segment and layouts provide the foundation of a new interrupt story for Java that is based on a tool called JSTRACT, which allows us to target native library directly. One thing that emerged while we were working on FFM is that there was quite a lot of number of use cases that we didn't anticipate at first. Since FFM is a fairly low level library, it allows very easily for other languages that are built on top of the VM, such as Scala, Closure, or even Ruby to use the FFM layer to then target native function. That was very expensive to do with JNI, because it meant that the other language sitting on top of the VM needed to spin some JNI code in order to be able to do that or maybe uses a library like libffi. But with FFM, this is possible directly out of the box. And I think that's a good improvement. We have been incubating and previewing, of course, for a long time, since JDK14, essentially, so that allowed us to get a lot of feedback from Apache Lucene, Netty, Tomcat. And I think today they are in production with some of this stuff. So I think if you run Lucene with Java 21, you are getting a code path that uses FFM under the hood. And I think that helped them to get rid of some of the issues where they had to use unsafe in order to free the memory that was mapped because otherwise waiting for the garbage collector could lead to other issues. We also are being used by Tornado VM. So in that case, it's an interesting case where memory segments are used to model memory that is inside the GPU. So they are using memory segment in a very creative way there and a bunch of other projects as chime in as well. So for us, it was a very successful experience of using preview features because it allows us to gather a lot of feedback. Not necessarily, we have a lot of knowledge on these topics within the JDK team. So it was good for us to put something out. And then here, our people were using some of this stuff and make it better. That's the end of my talk. These are some of the links. I hope that you are going to try FFM in 22. You can subscribe to the mailing list and send us feedback. There is a link to the JSTRAC tool. So there are binary snapshots available. So you can grab the latest one and start extracting your library of choice and play with it a little bit. And then a link to the repos. But that's mostly it. Thank you very much. The first question. Questions? Who is FFM-focused from Canadian technologies? I think it's pick. Yeah, basically what is the difference between these and Kotlin native, since Kotlin native can provide access to off-if memory and native function as well. I think they are very similar. One of the things that I think Kotlin native cannot do because it's still sitting on top of the VM, and it has to play by the rules of the existing libraries, is that it cannot have a solution for releasing memory safely. So I believe that Kotlin native is going to use some, I mean it's going to say at some point, oh if you use pointer your code is going to be unsafe, and you try to free a pointer then all bets are off. So this is the main, without solution if you use memory segments, you can close an arena and your code will never crash. You may get an exception. This is the same thing as the same thing. Yeah, but you know the APIs I've seen so far, there is always a whole, like if you use them correctly it works, but there are ways to use them for multiple threads where it's not working, unless you go deeper at the VM level of course, which of course Kotlin native cannot do. Up here, Mereh, Siou. Go on, question. Mereh, Siou, here. Do you know how many platform specific hacks need to be done, like if I want to use one code on like ARM macOS and Linux risk v or something, or is it all fully one code for all platforms? So in terms of JStrack the model, sorry the question was, do our platform specific is all this? Do we need to worry about differences between platforms? The answer is yes, in the sense that the JStrack tool is going to give you a binding for the platform that you are running on. Now, this sounds scary. In practice, for example, if you work with a high level library such as Lib Clang for example, we have a single run of JStrack, and then we reuse it across all the platforms and it works fine, because that library is defined in a way that is portable. If you work with system libraries, of course you are going to have a lot less luck, and that system library is only going to work on one platform, and all the platforms will need to do something else. Yes, can you tell us about the memory footprint compared to JNI? Memory footprint compared to JNI. So of course if you use memory segment, there is a little bit of footprint because you have an object that embeds an address, so you don't have a long. But our plan is to make all these memory segments scalarizable because the implementation is completely hidden. You only have a sealed interface in the API, which means all these interfaces are going to be implemented by value classes when Valhalla comes, which means if you bring up a memory segment, you wrap a memory segment around an address, you are not going to pay anything allocation-wise. For now, there is a little bit of cost in the cases where the VM cannot figure out with escape analysis the allocation, but in the future we plan for this to completely disappear. Yeah, okay. Sorry.
Ruby on the Modern JVM: Fibers, FFI, and More
Our next speaker is the esteemed and very famous Charlie Nutter, so let's give him a round of applause. Alright, microphone working. Can you hear me okay back there? Alright, great. I got a lot to cover. This is going to be a retrospective of all the problems that we've had trying to get Ruby onto the JVM. And then a little status report along the way about how we're doing on making the JVM catch up with those needs. Charles Nutter, that's me. There's my contact information. Been working at Red Hat now for I think 12 years. Before that worked for Ingeniard. It was a Ruby software as a service company. And then I was at Sun for the last three years as well. So I probably won't have time for interactive QA, but if you contact me online or post something in the Matrix channel, I will definitely get to it. I want to answer all the questions. Okay, so a little brief review of JRuby here. Ruby for the JVM, not too surprising there. It runs on Java 8 currently, but because of all the cool stuff and because we've ridden the Java 8 horse into the ground, we are going to be 17 or 21 minimum next release, which should be this year. In development for a long time, running Rails since 2006 and probably 2008, we started having production users. And we're the only alternative Ruby that's really had production users during that time. There's been a few other experiments, but nothing's ever really taken as well as JRuby. Maybe the most successful off-platform language brought to the JVM, Jython and Rhino Nazhorn might give us a run for our money, but given the maintenance state of those libraries, I think we're probably currently the most successful and most widely used JVM language that never was envisioned for this platform. So we've been chasing Rails all the time. That's kind of the gold standard for whether we can say we're a Ruby implementation or not. And after about two years of good work, we managed to get Rails working back then. Running Rails tests, running CRuby's tests, running all of the different libraries, suites, as much as possible. Compliance testing for Ruby has improved over the years, but we pretty much just run everything to try and make sure that we really are compatible. And very quickly, we ran into some serious challenges trying to bring a language like Ruby to the JVM and make it also usable and perform well. This is the quick summary. These are all areas I'm going to cover during this talk, so we will just blow right through here. These challenges help us grow both as a platform and as a community. They open up new worlds to Java developers, to JVM developers. They open up the potential of bringing new and unusual languages to the platform. It opens up the entire world of native libraries, native features that are out there that we don't necessarily have on the JVM. So we really need to focus on what are these challenges bringing a language like Ruby to the JVM and how can we make the JVM better to support languages like this in the future? So we'll start with strings and regular expressions. Excuse me for a moment. Okay. So one of the first things we ran into, JRuby's strings were just based on Java strings and we used Java's regular expressions. And at the time, regular expressions were being used in very unusual ways in the Ruby world. We ran into a case in an early version of Rails where they were using regular expression matching to parse HTTP requests that came in and look for, say, a mime header for an image and pull the image out. So you'd end up with a regular expression operating against a very large piece of data. And the built-in Java regular expression engine is implemented in such a way that for certain types of expressions like an alternation like this, it actually will recurse and recurse and recurse. And then very easily you can blow the stack out by feeding it too much data, just giving it too much data. To process, we'll blow it up. So we had to find other options. JRegix was an early one that worked against Java strings and we ran with that for quite a while. But eventually it came to be that the Java string itself was insufficient for us to represent Ruby's string behavior. Here's what that exception looks like. Very simple match here. It's just 10,000 of the A character followed by a B character with that same regular expression. It'll blow up on every version of JVM that's out there or anything based on OpenJDK classes. And I believe this is still an issue. So as we went forward and had to have a more robust, more robust regular expression engine that would work with a more custom type string on JRuby that matched CRuby's behavior, we ported over, or a contributor to JRuby, ported over Ruby's regular expression engine. So Oniguruma is the C library that Ruby uses for regular expression matching and ours is Joanie. It's a byte code based register machine, so there's no stack issues. It doesn't deepen the stack at all. It matches against byte arrays. And that'll be clear in a moment here why we need that. It also can do byte array matching with pluggable encodings, so regardless of what encoding those characters are in, and potentially if you want to use a different grammar for regular expression. This library was ported to characters and used by Nazhorn to do JavaScript regular expressions. They had the same sort of problems, and so they used our library but made it specific to JavaScript. So you see that I'm matching against byte arrays here and I said that strings were insufficient. Well the problem is that Ruby's string is not just one encoding, it's not just a blob of characters, it is represented as a byte array with an encoding. So within the system you can have many strings that all have different encodings and it all needs to be negotiated together when you combine them or use them against each other. So we had to follow suit essentially. We had to make a new string class for JRuby that used bytes, used a byte array, represented all the encodings, port all of the encoding logic over and the transcoding logic, which was a major piece of work. And essentially we have our own string now and we've had this for over a decade because Java strings just could not emulate all of the behaviors we needed for Ruby. This does complicate interop with Java but there are improvements coming there. So J-codings is the encoding library that we use. This provides the full set of encodings similar to what's in CRuby and the transcoding from any encoding to another encoding, which is used internally when we have two different strings come together and need to negotiate that. So where do we stand on the JVM today? Well rather than just having a character array inside strings, we do actually have a similar model now where there's a byte array but only two encodings are allowed inside that byte array. ISO 88591, which is essentially just the 128 bits of ASCII, or UTF-16, the old standard character. So this does lower our cost going to and from Ruby and Java when we are just using an ASCII range of characters, but UTF-8 would be nice to have there because most Ruby strings are going to be UTF-8 probably with at least one multi-byte character in there. So that has to all be copied over a lot more, a lot less efficiently. And Java Util Rejects does still blow the stack. I would love to see it get replaced at some point, but I don't know if there's any work being done to do that. Okay, so the next area that we ran into was that we have a nice runtime, but the performance wasn't there. We needed to be able to generate JVM bytecode from Ruby code and have it optimize like regular Java. So the interpreter was good. It was similar to Ruby 1.8 before they moved to their own bytecode runtime. It was very difficult for the JVM to optimize. We could walk through this stuff quickly and it was very easy to write as an interpreter, but you had a lot of polymorphic calls within that AST. It never really could see the optimization path through there. So we had to write a JIT. The reason that we did not just immediately start compiling all Ruby code into bytecode is because, for example, the Rails library will load into memory thousands of classes, tens of, or tens of thousands of methods, or tens of thousands of methods. That's a massive load for us to put onto the JVM when only a few hundred or a few thousand of those are ever going to be called. It also was slower for us to go straight to bytecode because the bytecode would end up being interpreted by the JVM's interpreter, which actually turned out to be slower than our interpreter after it JITs. So it made more sense for us to leave it in our interpreter until we saw it really needed JVM bytecode, and there we ended up with basically the first mixed-mode JIT runtime on top of the JVM. Later on, we did move to a more modern compiler design. We had a compiler engineer, Sabu Sastri, come in and help, and he basically helped us move a lot of the little P-Pole optimizations I was doing in my JIT up to a more modern compiler architecture. So this simplified the JIT, simplified what I had to write as far as emitting bytecode, which then let me explore performance a lot more in other ways. And then of course, as we move forward, we got invokeDynamic in Java 7. It's been steadily improving since then. It's used incredibly heavily in JRuby. If you take the bytecode of our JIT output from Ruby code, it's pretty much just stack moves and invokeDynamics for almost everything that we do. We will access local variables normally, but everything else has to have some little dynamic aspect as part of Ruby. So we use it very heavily and probably more heavily than almost any other project on the JVM. This is invokeDynamic performance over time from Java 8 up to 17. Really happy to see the performance improvements every release. It gets a little bit better. Looking at what we're doing with a more numeric algorithm, we get a bigger boost out of it. With something that's just walking a lot of objects, we're already kind of close to where Java would be on just walking an object graph, but still seeing that we do get some improvements from running invokeDynamic, making that more direct. Really cool is when we plug in a different JIT compiler here. So this is now using invokeDynamic on the Grawl JIT. And for a numeric algorithm where we're creating tons of numeric objects, we really see the impact of partial escape analysis helping us. And this is now really starting to get to the point of Java level performance for a numeric algorithm. This is the cases where it really helps. But over time, we have not seen that Grawl is generally faster and we don't generally recommend it unless you have something numeric or something that's doing a massive amount of allocation of temporary objects. So where are we today? One of the problems that we have generating individual methods or compiling at the runtime is ideally we want that compiled method to go away if the class goes away, or if it's a one-off generated method that eventually doesn't get used. So it's a class per method, and the only way to make those garbage collectible is a class loader per class per method. So every method that we JIT into the system has both a class surrounding it and an entire class loader just to work within the confines of the garbage collector. There's no other way to make garbage collectible classes right now on the JVM. There is anonymous class loader, but it's a hidden class, and we don't try to access that right now. Indie is clearly working very well. We're going to be doing more advanced call sites where we will have special case code along one fast path and then a slower dynamic path if it turns out it's not the logic we expected. It is a tricky API to use, but we have a lot of tooling that we've built around it. I've got some links to older talks of mine that go into detail on that. Okay, I think we're doing pretty good on time here. I know I talk fast. Come back to the video and do it at like half speed, and then maybe you'll catch everything that I'm trying to say here. So the next big area that we ran into was native interop. The sea ruby world really lives in a POSIX native sea environment. It's almost a DSL for writing POSIX code, really. And originally that's kind of what Mott's the creator wanted. He wanted something where he could write C, but essentially with a nice API, a nice language on top of it. So they are very heavily using JNI-like extensions to the runtime for most of their native access. This is clearly way too invasive for JRuby. It calls into internals of their object structures. It has direct access to the heap, direct access to garbage collector endpoints. Nothing that we can emulate efficiently in JNI, and we have tried. So we ended up pushing people more towards using programmatic access, like Project Panama, like libffi, rather than writing C extensions for C ruby to wrap a library. Let's just wrap the library by writing a little bit of ruby code. And so started out with the Java native runtime. It's basically our API for calling from Java down into native code and native memory. And then on top of that, porting the ruby ffi layer over with some invoke dynamic magic, try and make that all as clean and fast as possible. Java native runtime is actually a set of projects. Up at the top, jffi is the wrapper around libffi. That's where we ship about 20 different binaries in the jar for all the base platforms that we support. Libffi is in there, and we're just using standard libffi with some extra wrapper logic around it. JNR-fffi is kind of the baseline user API. If you're familiar with JNA, this is that level, where you say, I need a struct that's laid out like this. I need a function that takes these arguments, make these calls, allocate this memory. Then above that, we realize there were a lot of functions and a lot of behaviors that people were going to be rebinding over and over if we didn't provide them. So we have JNR-possicks, which is a slowly growing corpus of standard posix functions bound on top of JNR-fffi. So you can go in there and you can call things like posixpon or open a file or do native IO. You can even call fork, and it's a lot of fun to see what happens when you do that. JNR-enxio, extended native cross-platform IO, builds on JNR-possicks and provides an NIO channel that is all native down calls. So where we can't get selectable standard IO on the JVM, we can't get selectable sub-process channels, we can use JNR-enxio to have actual interactive control over standard input. Standard IO and sub-processes. You can actually use JRuby to spin up a VIM instance and it will have full console control and work properly. Basically impossible to do with the standard process builder stuff on Java. Unix socket, not too surprising, just wraps this other stuff with the Unix socket calls. And then JNR-process, like I mentioned, we have our own selectable channels for processes. You can use this as a Maven library, you pull it in and you'll have the same API as process builder but you'll get channels, selectable channels out of it instead of streams. So it's available right now for that. This is a little bit of what Ruby FFI looks like. Pretty straightforward, we're setting up a structure with particular widths of fields, attaching a function, get time of day, and then we can call it directly. Under the covers, this all uses JNR and ideally inlines as much as possible up to the native down call. So today, native interop on the JVM. Of course, we have Panama coming along, so the talk before me, Mauricio's talk, that's where all the information is about where things are going and we're really excited about that. I actually wrote the original JEP for Panama, which has now been walked away from many times, but we've been needing this for over a decade now and had to make our own but don't want to maintain it anymore. JNR is pretty much the fastest way outside of Panama to do these native down calls. In some cases, actually beating JNI because there's extensions to generate a little JNI function in memory using assembly that can cut out some of that overhead rather than just doing pure programmatic calling through lib.ffi. Jextract from Panama is coming along. We're also hoping that we can use that at runtime as a library to and access those data structures internally to generate Ruby.ffi code. This would be kind of the last mile for getting Rubyists to switch from writing C extensions to using FFI. If we could generate the Ruby.ffi code the same way we do the Panama code, there'd be nothing to stop them at that point. There is back-end work happening right now on JNR to integrate it with Panama. Michelle at Oracle is working on that and I'm hoping that we'll see something in the next couple of weeks. A little more review of some of these ideas. If we have Jextract that can generate Java code, we should be able to use Jextract to also generate Ruby.ffi code. That'll be the next big fun toy to play with is of Java 22. We also use the existing SQLite JDBC driver. Rubyists like to use SQLite for local development. But it's going through a JNI back-end. You have to make sure it's available for the platform that you're on. They are also playing with Panama behind the scenes. Early numbers look like two-ish times faster than the JNI wrapper around SQLite that they have. So this is coming along. We also are integrating a new Ruby parser called Prism, which is a simple C library that we all the implementations can share so that we are using the same Ruby parser. That we will integrate through Panama as well. And use Panama to make it much faster for us to downcall into this library, get our AST back out, and then proceed. Interestingly, we're also exploring using Prism as a Wasm-compiled library running on the chicory Wasm implementation on top of the JVM so that we can parse Ruby code using a native library even if we're not on a platform it's compiled for. And that's amazing that it works. All right. Moving along here. So lightweight threading is the next big one. Around Ruby 1.9, they introduced fibers, a coroutine-like concept, a micro thread concept. You would still have your native threads there, but they can bounce around to different fibers at any given time. And you get little structured concurrency, structured use of fibers, allows you to do multiple tasks in the same thread. There's also been a push toward structured concurrency in the Ruby world now, where fibers can wait on I.O. or make a blocking call on I.O. The runtime will see that and schedule another fiber to run in its place while it's waiting for that. So you can easily handle tens of thousands, hundreds of thousands of concurrent connections, for example, without blocking that many threads or having to write your own select loop and what not. So fibers on JRuby, without a coroutine API at the JVM level, of course, we've had to use native threads. And that clearly only scales up to a certain number of threads. With the structured concurrency example, we could have potentially thousands of fibers in the system, and it's almost impossible for us to support that with full, heavy native threads all along the way. Ruby also primarily uses internal iteration. Collections just have to implement an each method of basically a for each. And all collections in the system then expect you to pass a block of code into it. Well, how do you turn internal iteration into external iteration? You have to use a coroutine that can yield values back out while staying inside that loop. So now we've got that potential for all sorts of fibers, hundreds of thousands of fibers all over the system, just because we're iterating collections with an external iterator. I'm going to kind of blow through this because the next talk will cover fibers a bit more. The example here of handling requests on a thread. We've got a thread, a request comes in. Now it's waiting for more information, the thread's not being used. Finally we get more data, we can proceed with the rest of our request handling. With fibers, of course, we can use multiple different fibers handling different connections on the same native thread. So the request comes in, this fiber's waiting on IO. Well let's spin up another fiber that can handle the next request that comes in. And they can multiplex use of that same thread. This is what we're starting to see more and more in Ruby, and this is where it will be critical for us to have lightweight fibers, lightweight coroutines on J Ruby. Okay, so here is a little benchmark, a little example of trying to test how long it takes to spin up 100,000 fibers and run them all to completion. So they are 100,000 live fibers in the system at any given time on this benchmark. And of course as you would expect this doesn't work. We can't spin up 100,000 native threads, and it just crashes in horrific ways. I'd love to see this crash in less horrific ways, but ideally we just move away from this problem altogether. And that's where we get project loom. So JVM Today, as of 21, we now have an official API for lightweight coroutines for essentially fibers that maps almost perfectly to what we need in the Ruby world. And we've already got this integrated. We integrated it a year ago actually, and have only made minor changes along the way. I'd like to show this just to demonstrate how much work we had to do to switch from our built in native fiber, native thread fibers to the virtual thread fibers. I was shocked that this was all it took, and suddenly this benchmark actually could run. It could actually spin up all of those fibers and run them to completion. So amazing work on the loom side, and very happy with the results. Once wise, so here I drop it down to 10,000 so that I can actually try and get the threaded version to work. Clearly we're getting significant gains on passing, context switching between different fibers, because loom is just better at that, and there's a much lighter weight process for going from one fiber to another on the same thread. Not quite as fast as C Ruby. I suspect this is probably due to us relying on a very general purpose scheduler for the virtual threads behind the scenes, where we really just want to say, this fiber's done, now run this one, rather than unblock that fiber and wait for the scheduler to pick it up. I think we can make up most of this overhead. Similarly on M1, I don't know if this is general to arm or not, but this is the performance results we have. Could not get 10,000 to go on M1. I got to drop it down to like 2,000 or 3,000. The impact is a bit more here, but again I'm hoping that as loom evolves, as we use it better, we'll see improvements. Five minutes for the last section here. The classic problem with J Ruby is still startup time. If we did not have startup time, we probably would have won the Ruby war a long time ago. It's the number one, two, and three complaint about J Ruby is how much longer it takes to start up. The JBM is just not designed to start up quickly. Most of the core JD code starts in the interpreter. It takes a long time for that to optimize, and then your application can start getting fast. We make it worse because we interpret Ruby code, and then every once in a while we'll just throw more byte code at the JVM, like okay, now this call site's actually bound to a byte code method, not an interpreter, and we're just confusing the hell out of it all the time. This is one of the reasons we actually do lazy compilation to byte code, because we want to reduce the amount of overhead we force onto the JIT at the JVM level. Walk through J Ruby's architecture here quick. We have our Ruby parser, gives us our Ruby AST, we compile into our intermediate representation, interpret that for a while, and here's where it becomes mixed mode. Then eventually we will generate byte code for those methods, and then hopefully the rest of it all works and optimizes to native code. One of the early ways that we've tried to improve startup time is basically to turn most of that off, rather than turning anything into byte code, rather than even running the C2, the fast JIT in hotspot. We turn only to C1, we use the simple JIT in the JVM, and we only use our interpreter. This improves our startup time by about 2x. By far the best thing we've had so far. Now, another way that could be potentially a way to fix this would be ahead of time compilation. Of course, GrawlVM solves this very nicely for that world, but it completely disables all of the native things that we want. General purpose, invoke dynamic, and method handles just simply, essentially doesn't work. Then beyond that, we would have to pre-compile all of our code to byte code. We'd have to link it in some way that it could ahead of time compile the native. This is just not going to work for us. We're hoping that Layden will actually pick up here with a ahead of time option that can also do some dynamic stuff at runtime. Where are we today? The solutions we're looking at in the short term are mostly surrounding the checkpointing features. Checkpoint and restore in user space, the CRIU API on Linux allows us to run JRuby to a certain point, like just after startup, and then save off a copy of it that we can start quickly with later on. This is being standardized in Project Crack, an unfortunate name, but a lovely project. This is working pretty well with JRuby right now, just experimenting with it. We are still hoping that Layden with some ahead of time compilation that still enables the rest of JVM features will be our ultimate solution. You can see here, this is CRuby on the left side just doing a baseline startup. JRuby's baseline startup without our dash dash dev flag, which turns off all of the optimization. The dev flag here, not quite 2x, but giving us a good boost. Crack of course, significantly faster than all of those. We've actually gotten to a point in execution where we can start running Ruby code now, starting to get competitive with CRuby, which was essentially designed for fast startup. Same example generating a Rails app, again, getting very close to where CRuby sits on these numbers. So, wrapping up in the last minute here, JRuby is a test bed for all of these crazy JVM things that we're doing. We're pushing all of these edges. So whether you care about Ruby or not, we are the best invoke dynamic torture test. We're going to be hitting Panama extremely hard as it gets integrated into the system. All of the structural threading will be massively exercised by all of the structured concurrency stuff coming on the Ruby side. So if you're interested in helping us integrate any of these features, or if you're an implementer interested in testing these features at scale, JRuby is definitely something you should look at. This is more background. I'll let you take a quick picture of this if you want. These are talks I've done in the past that basically cover all of my many complaints about the JVM. That list of complaints gets smaller and smaller every year, thankfully.
Virtual Thread’s Next Steps
Okay, so apologies for those who are expecting Mark Rhino to be speaking today. He couldn't make it, so I'm the stand-in for today. So you've got a stand-in speaker and a stand-in topic. So the topic is, we're actually going to talk about virtual threads, which is part of Project Loom. Charlie mentioned a bit of it in his previous talk, and we're going to talk about what we're working on in this area. So Project Loom, I'm not going to go through all of the project. There's a lot of material out there that you can actually search, and most of it is actually pretty good. At a high level, what Project Loom in OpenJDK is about is really about an upgrade to Java's concurrency model. It's about enabling much more higher-scale server applications with very easy to write code. So I'm not going to go through all of that, as I said, but the main thing I'm going to talk about is virtual threads, which is Charlie called them lightweight threads in the previous slide. It's all about having much more lighter weight type of threads for execution. So think about replacing tasks with thread, thread per task models, and thread per connection, that kind of thing. There's other things that we're working on, structuring currency and some other features, they're topics for other talks. So we've been working on this feature for a long time. This is one of these features that requires jacking up the house, replacing the foundations, putting the house back down without actually breaking anything. I think actually we've actually been mostly successful on that. It went through a couple of preview releases within 19 and 20. We made it a permanent feature in 21, being well received by the ecosystem in general. If you actually look at all of the frameworks and libraries out there, they've actually got something working with virtual threads in a very short period of time. That's one of the nice things about having things in preview for a couple of releases is it allows these frameworks to try things out and actually find issues. The big thing about it and why there's such an interest in it is because it allows applications and developers to move away from the world of scalability that they had before, where they had to go to async and reactive, which actually is incompatible with a lot of the things in the Java platform, particularly around things like debugging and just being able to get your right mental model of actually what code is executing. Overall, we're in a pretty good shape. Performance is actually in pretty good shape. Reliability is pretty good. There's a couple of rough areas around performance that we need to work on. We talk about that another time. What I do want to talk about is the other 90%. This is one of the things about these big features, is you get the first 90% done and then you've got to figure out how to get the other 90% done. We do have some quality of implementation issues. One of the compromises that was made in order to actually get this feature in is that we didn't do a nice job on Java monitors. I want to talk a bit about that today because that is by far the top pain point, the second top pain point, and the third top pain point that people actually trying out virtual threads for the last couple of years is running into. I'm guessing that anyone that actually has tried virtual threads, I've read some of the articles. People always have these gotchas type of sections in their blogs and articles and it's all about pinning. I want to talk a bit about that today. I also want to talk about a few things that we're doing around expanding the set of libraries that actually work well with virtual threads. There's other projects that we're actually working on, as I said, that's for another day. In order to actually understand some of the slides that I'm going through and the material that I'm going to talk about is, you have to have some little bit of understanding as to how virtual threads are actually implemented. There's an underlying concept in the hotspot VM which is support for delimited continuations. It's not something that's exposed generally, it's just something that's there for the underlying construct that virtual machines is actually built on. What happens is that a virtual thread essentially wraps one of these continuations so that when a thread needs to block because of a lock or an IO operation or something like that, that translates to the continuation yielding and then when it actually continues again, it's like the thread continues execution again. In order to be able to actually do a threading library in Java, you actually need to combine it with some kind of a scheduler. The scheduler that we're actually combining for now is the fork join thread pool that's been in Java for quite some time and that's actually a work stealing scheduler. We're using it in very, very different ways than it's used for things like parallel streams and we're using more kind of in FIFO mode but you get the obvious kind of default parallelism which is based on the number of cores although it is a little bit dynamic and I'll get into that in a few minutes. So mental model to think about anyway is that you've got this sort of magic with continuations, we're combining with a scheduler and that scheduler is managing a set of threads. The mental model to think about is walking around with a little kid on your shoulders. The platform thread or the carrier thread is carrying the little child around which is the virtual thread on the shoulders. When the child wants to stop and do something, you take them off, drop them down, some other adult comes along, the other guardian picks them up and moves them on their shoulders. So that's essentially sort of what to think about with virtual threads. Okay, so in order to talk through some of these slides, I'm going to use the kind of the same sort of layout in all of these slides and so what I'm going to have is, is there's going to be a little bit of code example on the left and then I'm going to show some stacks on the right side. They're color coded and I'll show you these just to give you an idea what's actually going to happen. The thread that is going to be the subject of this talk, we're actually just going to give it an ID, it's number 22. We've doubled down on thread IDs, quite a bit in the platform the last couple of years. So what we're going to do here is we're going to have a bit of code that's executed by virtual thread number 22. In this case, for the first example, we're actually going to use the same before. Just think about a semaphore as something that has a number of permits. You actually acquire a permit and, and when you're, and run some critical code and then when you're done with the permits, you actually return it back to the semaphore with a release. So typically the typical idea that you actually use is, is, is acquire and then release at the end and use it to try resources concept. Okay, so very, very simple. So let's see what actually happens when a virtual thread goes and executes this code. So the arrow here, just think that this is, this thing of the red arrow is kind of like a program point, program counter. That's kind of where we are. On the right side now we see our first sort of stack traces here. There's actually two threads in the picture here. What you actually see at the top, and by the way, the stacks are in this, in, in, in, in all of these slides grow from the top to top down. So the, the, the, the orange brownie frames that you see at the top there, they're actually the fork join pool thread that's, that's actually carrying the virtual thread. And the greens, the green frames that you see there are the virtual thread. And so I've just merged them there together because that's actually what you have on the native thread underneath the, the covers. So when we get to do, to doing a semaphore acquire, what we see is, is, is, is in green here. If you look down right at the bottom, the, the, the, the frame there is actually semaphore.acquire. So we're about to call this guy and we'll actually see what actually happens when we call semaphore acquire. Okay, so semaphore acquire gets to this point here. Now every, every good movie needs a villain and I need a villain here. So in this case, let us assume that the, the villain is Andrew and he actually has the permit for this summer for. So what's going to happen is this virtual thread has to go and park because it can't acquire a semaphore. So what we see here is, is we're going down through the Javiutil concurrent code to try to acquire the semaphore. There's no permits available. So the thread has to park. What actually happens is, it bottoms out at the bottom trying to do a park which will actually yield the continuation. And there, this is when the, this is, this is where the magic occurs is, is the, the, the, the, the, the thread is now parking and, and magically its frames get removed from the native thread stack. And the worker thread or the fork joint thread is able to go and do other work. So that's actually what happens sort of at a high level with, with, with, with virtual threads. Now Andrew is finished with his, the, the, the semaphore and is doing a release. So he's returning back the permit back to the, to the semaphore. So now we're actually going to look at what happens here now when the, when the, the, the virtual thread is actually going to continue. So remember a virtual thread is waiting to acquire the semaphore. Andrew is doing the release and that goes and triggers the, the, the, the, the, goes through the Javu-Tilkin current code and it'll bottom out then doing an unpark of the, of the virtual thread that's actually waiting, which will do the continue. What, and, and what it's actually really going to do is actually schedule the task that's associated with the virtual thread back to the scheduler so that it actually can continue again. Very, very, very kind of straightforward. It's just submitting the, the, the virtual thread back to the, to, to the scheduler so it can continue. So back to our slide again. So what happens is, is, is, is, we'll assume the scheduler has now started executing this code again. It'll return back from, from, from, from the park that we were on earlier on and magically we, we, the frames start popping off and we go into our tri block and we are done. So that's sort of thing, how, how, how things actually work with parking and unparking and how they actually integrate with the scheduler. Now we get into the sort of the, the problem areas and where we have the pain points with, with virtual threads today. So I'm going to go through two, two, two, two scenarios. One of them is, is parking while holding a monitor. I mentioned about all these, these blogs that have these gotchas at the end. This is essentially what they're actually trying to show you in, in, in, in these blogs. So we've taken the same example but we've actually put it into a synchronized block. So the synchronized block here is, is, so that's kind of, think of that as a monitor enter, here's a monitor exit. Same thing if this was a, a synchronized method. The code that we had in the previous, the previous section is, is, is exactly the same. So what happens is, is we're going to do the, going to do the, the choir here. Andrew R. Villan again is actually holding the, the, the, the permit for the semaphore. So this virtual thread has to, has, has to go and park. So what happens this time is we're going to try to park but we're actually holding a monitor. So, so this yield down the bottom fails. Why does it fail? Well, we, we get into why it actually fails in, in, in a second. But something, we're not able to actually go and release the, the, the, the, the, the, the, the, the, the, the, the, the carrier thread to do other work here. This is actually why we have, you get performance issues and, and why we say that monitors lead to a quality implementation issue is because of this, this, this issue here. Now what actually happens in this particular case is, is that instead of actually failing, it actually falls back to actually to park on the carrier and the, the, the, the, the semaphore in this case works exactly the same as, as, as, as this would if, if, if you were able to unmount, we're just not able to unmount. We can't let the, the, the carrier go away and, and, and, and do other work. And right, why, why do we have this problem? And so we have this problem because of the way that monitors are implemented in, in the, in, in the Java VM. There's different, there's different kind of locking modes. A lot of this is sort of beyond where, where I typically work. But in, in, in Roman's talk earlier on, he actually, actually shows some of this where, where, the fast locking type is essentially, is essentially putting a pointer into the object back into a lock record that's actually in the, in the stack. Oh, if we, that, then we can't actually start removing frames that, when, when, when we, when we unmount. There's also these inflated cases where, where, you're actually building up a waiter list of who's actually waiting for the monitor. And what, what goes on to that waste list is, it's actually, it's actually the, the VM's internal Java thread, which is, is essentially the, the carrier thread in this case. So these are the reasons why, at least in this particular locking mode, that you cannot release the, the, the carrier at this point. And so there's, there's, there's, there's magic in the implementation to actually to track monitor usage to prevent this happening. There's another locking mode, which, which is, which is the, the, the, the newer one, the lightweight one where there's a, a, a per, per, per thread little stat lock. And that's got issues as well, because that's actually associated with the carrier thread, not with the virtual thread. So what we do about this, so there is a, there, there's a sort of larger, longer term effort that has, has kind of been underway. I think I saw Robin Ian here at one point. He actually started this work to actually completely re, re, reexamine and do a new implementation of Java monitors. Do it in Java, rather than actually in the VM. Because a lot of legacy code and a lot of, there's, there's a lot of history there. Now that is a longer term effort. There's a lot of unknowns, there's a lot of exploration. Both, we needed a plan B and, and, and plan B involved a hero and the hero in this case is, is, is, is, is, is Patricio in, in, in the hot spot team decided to go and have a go at actually trying to do a plan B, which is change the existing object, object monitor implementation to work with virtual threads. So, so what he actually did was, was, he's, he's, he's come up with a, well, there's, there's, there's several steps in this and this is, this is, this is by the way, this work is all actually in the, in the, in the Loom repo. So what he actually does is, is for the, for the StackLock cases, is he just, just, just inflates and then for the inflated state, he's actually for the moment has, has the VM actually doing a, a, a, a, a, a StackWalk to actually replace the owner so that it's actually not the, the, the, the Java thread, it gets, it, it, it, so the, the VM's view of the thread is actually the, the, the Java thread. And so that's a little bit expensive, but it actually does work. And, and here's a solution for the lightweight, locking mode as well, where the, the, the lock stack actually moves at, at, at, the, the, the mount and the unmount time. There's other work that's actually going on in parallel, some of the work they're calling Fillmore is actually doing about changing the, the, the, the, the lock ownership to be the thread ID. Once that work actually comes in, that means that you actually can eliminate the StackWalk, you eliminate all the GC overhead, you eliminate all of the actual overhead there of, of, of this, which is actually quite nice. So a lot of pieces from a lot of people are coming, coming, coming together, which, which is nice. So this is working in the, in the Loom repo for the moment. So we'll go back to our, our, our, our slides again and what would happen with, with that example, if we actually run it with the, some of the bills that we have from, from the Loom repo today. So when we do, we actually do our acquire, we bottom out again at the, at the, at the yield as, as before, but this time it actually succeeds. The, we release the carriage, go off and do other work, all very positive. So that is good. So that's one of the pain points with pinning and it'll be wonderful actually to, to, to, to, to, to get that in. Second scenario then is, is the contended monitor case. And that one is, is, is, is you have an example like this. In this case, this is, I've got rid of the, the, the, the, the, the, the, the semaphore from this example. And, but I'm actually going to block here. So we assume again now that Andrew is actually holding a lock this time, rather than the semaphore. And, and, and here we have our virtual thread number 22 is going to attempt to do a monitor enter at, at this point. And what happens today is, is it actually just, it actually blocks when at that monitor, at that, at that, essentially at that, at that, that monitor enter. So what's going on here is, is, is, is a contended monitor is, is, is actually a call into the runtime. It's essentially parking in the runtime. And, and, this is something that has to be, we have to remove that and essentially pop all those frames in order to be able to actually, to, to, to, to deal with this. So the way things are actually working at the moment is, is, is what, what, what Patricia has come up with is, is, is that essentially allows the VM to do a yield while it's from, from, from in the runtime. So normally we actually would do these yield from, from, from the, from the Java site. So it copies the frames off the, off, off, off the stack into the heap, just like we would with, with, with, with, with a, with a normal and, and, and, we would do it a normal yield and, and freeze. And what it does is, is it actually puts the, the virtual thread onto the wait list for the, for, for, for the, for the lock. And at a high level, it's as if we're actually doing a yield at that point. There's a bit of, there's a bit of magic that goes on with where it actually has to, has to return back as, as if it hold the, hold the lock and then you've got to run the, the, the stub that actually goes and actually does some fix up some, and there's, there's VM magic that actually happens. But essentially you're turning back to Java in a, kind of a blocking mode and then we can actually fix up the state and, and, and move it to its block mode so that the, the thread is actually blocked. So what I've put here is, is just to give you, just to kind of, you can visualize how this works. We do our monitor enter, which this could be a, a, a synchronized method as well. And it's as if we're actually calling into yield at that point. So that's actually very, very nice. That actually can work. So when Andrew releases the, the lock, then we, we just continue on and we're, and when we do that, what happens is, is Andrew releases the lock. We've now got no, no, wait for virtual thread 22. How do we actually get, how do we actually integrate back into the scheduler to get that to, to, to run again? And the way it actually works is it actually just moves the, the thread into, into, into a, into a list and, and unblocker thread will, will actually unblock it. This, so for those that have, understand how, how reference processing work in the GC, this is essentially kind of like another reference handler. It's the, it's, it's, it's, it's, it's, it's essentially queuing up, queuing up objects that are, get handled by, by, by Java thread. So the unblocker then just, it, it, it just snapshots the, however that, that, that list and, and then it just wakes up those threads, which puts it back into the, to the scheduler's queue. So that's all very nice. So when our example here is, is that Andrew has released the lock and we, we, we, we, we, we queue up that virtual thread to actually continue and it gets scheduled and it's a continuous execution inside the synchronized block. Very, very nice. So okay, high level question then, does this actually solve all of our pinning issues? And the answer is no. There's always work to do. We've got, we've got, there's, there's issues with native frames. You can't, you can't actually yield with the native frame on the continuation stack. That's not a problem. I think that, that we, we, we, we will ever, ever, ever address this, a lot of unsolvable issues there. There are other things that go along with monitors. There's object weight and it's important that we actually make progress on, on, on that one. There's ideas on how to, how to improve that one. And then there's the other more difficult one, which is class initializers. And, because class initializers is, is, will require more surgery in the VM in order to be able to address it. So these are things that are not addressed in what's in the Loom repo now, but are things that have to be addressed over the next while in order to eliminate these, these, these problems. Okay, moving on. I'm going to talk about IO now because there's a whole other set of this that, that, that, that, that goes along with, with, with, with IO. So let's talk a bit about SOCAs first because networking is, is, is actually, is, is, is actually straightforward. So here we have our virtual thread is actually going to try, attempt to make, establish a TCP connection. This case is Fosn port 443. So it's the SSL port. Okay, what happens there? So same, same, same diagram as before. We've got our carrier at the top and then we've got our green, green frames for the virtual thread. We're doing a socket, we're in the socket constructor, which actually initiates the connect. What does the connect actually do when you're on a virtual thread? It's actually going to initiate the connection, arm the file descriptor and then do a yield and to, to, so, so that the, so that the, the, the, the carrier can be released to actually go and do other work. So this is what our stack would actually look like. It goes down through the, the, the, the IO code and, and IO code and, and does, does, does, does our, does our yield. Carrier gets released to go and do other work. We're all good. What we have then in the, in the background is, is, is the way things work for the moment is there is this thing called a polar thread. The polar thread interacts with whatever the IO mechanism is on, on the, on the platform. There's implementations for, for, for, for E-Pole and, and KQ. There's one that integrates with the, the Windows WinSock driver. And what it does is it's just listening for events. When there's events from the operating system to tell you that the, that these are already events, then it just unblocks the corresponding virtual thread by, well, what it does is it actually just, it, it, it unparks it and just queues its task to the schedule and things that work. That little diagram over there, did you see that it was spinning? I'm going to reuse that in a couple of slides just to make, essentially all that's doing is listening for events on parking throughout, listening for events on parking throughout, listening for events and so on. That's all it does. And so, but back to this, back, back, back, back to our example here is, is we're, we're trying to establish a connection. The connection is now established and because we've done a wake up, we pop all the frames, we're, we're, we're, we're gone past the socket and constructor now and we're, we're, we're all good. So this is kind of the way things work today in, in, in JDK 21. You've got this, you've got this thread that's picking up IO events. It's queuing up and the corresponding virtual threads to, to, to, to the schedule, in this case I should, I did, depict it as a, as, as a box full of carrier threads. That's actually not all that efficient because a lot, there's a number of issues with this one. You actually start scaling things, scaling things up and particularly is, is, is that you've got a, you've got a number of carrier threads that correspond to the number cores and then you've got these other polar threads that are actually trying to compete for CPU cycles. You've also got the issue where, where you're picking up IO events on one thread but it's actually going to end up being processed on a different thread. So this is the, the, the, the, there's room for, for, for efficiency on, on this. So one of the things we actually have done in, in, in, in, in JDK 22 is we magically move these polar threads at least on some platforms from being platform threads to being virtual threads. Now the implications for that is they actually integrate then with the schedule. You're only picking up, you're only picking up IO events and queuing up virtual threads to on-park when there's actually cycles to actually to go and do that. In addition, because of the way this polar thread is actually written is it will actually, most of the time, continue the IO operation on the same thread that picked up the event. So you avoid having to actually go and dealing with events to, going, going, going, going between threads. When you actually scale it up, this is actually kind of what it actually goes and looks like is there's actually in, in polar threads that are virtual threads are actually, and, and then there's this, there's this other gutter guy in the background which is when it has to wake up polar threads, when there's nothing else for them to, to, to go, to go and do. And this is actually quite nice. There's, there's a nice paper written by them. There's a, there's, there's, there's a team in the University of Waterloo that actually work in, in, in the sort of similar area on their own library of, of for lightweight threads in C++. And they've got an initial paper which deals with all the IO strategies. And this is kind of one of the IO strategies that, that they, they're, they're also using by default. So, so that's Martin, and Karinson's team in, in Waterloo. It's the paper is, is, have your cake and eat it, which is a great title for, for, for, for a paper. So, so this is actually turns out, and some, and some benchmarks turns out to be actually quite, very, very profitable. And because, and so this, this, this is just a, some random benchmark that's actually sending, sending a 1K request and getting a 16K response. It's, it's on the loop back. There's, there's a client and, and, and, and, and a server thread for each one of these. So there's a lot of parking actually going on. There's a lot of IO between them. But the nice thing is, is we actually see improvements, significant improvements on, on, on, on, on, on, on arranging systems, which is actually quite good. Okay. Mo, moving on a lot, I want to talk about a bit of file IO, because this is where we actually put on our sad face. Because it's, it's not as, not as, not as good a story. So the example I'm actually using on this one is, is, is, is, is, I'm, I'm opening a file. So this is actually, this is code executing on a virtual thread. There's, I'm, I'm doing a file open here and I'm doing a file read. So two, two, two file operations. Down on the right, I've got just, I've just showing the, the, the box with the, let's assume this is a four core system. This four carrier thread is actually sitting there. So what actually happens today, and it's, it's, it's a bit lame, but I'm just explaining with the way it actually works today, is when you attempt to do a file IO operation that may actually consume the thread, it temporarily increases the parallelism, which will trigger an additional worker thread to be available to do other work. So if you've ever seen fork joint pool manage blocker, this is essentially the same thing which your actual compensation that actually can happen in managed blocking operations. So you, you do your file open, your thread is actually not, unavailable to do other work. And then when, when, when the file IO operation completes, you decrement the parallelism again, and what happens is, is then is that the, the, the number of worker threads that are available reverts back to where it actually was. These additional extra worker threads that might get created for these, they might hang around for a little while, but they will eventually, they, they, they, they, they, they will eventually terminate once the system is acquiescent. So same thing, what happens when we had to, to do a file IO, or sorry, we do a read, same kind of thing, increase the parallelism, and we do our read, and the once, once the read is actually complete, we will decrement the parallelism again. So this is all kind of lame, and you say, well, why haven't we done any better on this? So this is one of the things that we actually have been playing around with for, for quite some time. There are asynchronous IO interfaces on different operating systems. I'll just talk about what, for IO, you ring today. I'm just talking about in the context of file, we've also looked at it in the context of, of, of, of Socus as well. So in the Linux operating system, there is, there's, there's a completely different type of, of interface to the operating system, which is, which is supports asynchronous IO operations, is essentially, essentially allows the trap into the kernel. And so you've actually, you, you, you can actually queue up submissions in, in, in, in memory, and any events that are associated with those when they're complete are actually queued up in a ring. That's also, so it's successful from both the, the kernel and, and, and user space. So there's, there's a lot of issues and a lot of complications trying to interface this to the, to, to, to the JDK to be able to support a lot of the, the, the libraries. So what we have been doing, and this is in the, in the open JDK sandboxes, we have an, we have a low level API, which sits on top of the work that Maurizio was talking about in the previous slide, a previous presentation with an FFM. And that will provide the lower level access to the, the, the submission and the, the, the, the, the completion rings. And it, it, it, it, it, it hides a lot of that, that, that, that kind of detail. Now, in order to be able to work through some virtual threads, we actually have to do, we have to do a bit more replacing of foundations around the place. So there's a lot of prototype re-implementation of a lot of the Java, Java I.O. classes. And to be able to work with this as well. And that actually allows the, the problem of file I.O. to be actually reduced down very significantly. None of this, none of the completed bits for this are in the Loom repo yet, but we will, we will get there eventually. There's a bunch of design choices that, that have to be worked out when you're actually interfacing with something like I.O. U-ring because as to which threads can actually access the, the, it's, I.O. U-ring is not kind of designed for, for multiple threads to be, to be accessing a ring at the same time. So what we will end up doing is, is, is, is, is, is, is having essentially multiple I.O. ring instances and one per, per carrier essentially. And that actually fixes the completion, the, the submission side. The completion side is a little bit more complicated and there's a number of design choices around that. So the main message here is, is that there's a lot going on in this area as well because all of that areas of the libraries have to, have, have to, have to play cleanly with them, with virtual threads as well. Okay, so I'm not going to go through all of the other things that are going on, but I'll just talk, just, just, just, just mention a few of them. So Professor Doug Lee who's the, the, the sort of the, the, the world expert on concurrency is doing quite a bit of work at the, exploration into fork joint pool at the moment. That is for scenarios where you have a smaller number of cores, because a lot of, a lot of container and, and, and the cloudy type systems you're running with two or you're running with four cores. And we've, we've observed in many scenarios where you get underutilization. And a lot of of that relates to just the time it actually takes to work for, well worker threads are parking and then they have to be unparked in order to actually to actually to do work. So he's exploring a number of things on that and we're trying to come up with good benchmarks to be able to actually to measure these kind of things but this should, if this turns out to be profitable then it'll help some of these scenarios where we appear not to be having full utilization on smaller systems. JVMTI makes me scream. It's a very very invasive API and it's very much challenged by features in the Java platform where you move them out of the VM into Java. So having a native tool interface where a lot of the runtime is actually in Java rather than the VM is a challenge. We have a very good story for JVMTI and virtual threads and debugging and for many other types of tool agents but it's not working well for profiler type tools that want to use JVMTI and because there's a there's it's having to coordinate with a lot of code that's actually executing in Java. So there's a lot of work going on there to try to solve some of the problems and it's one where there's been some progress made but it's one that's going to take more time. And there's other efforts that I'm not going to talk about today which is about scope values which is essentially allows us to communicate something to a remote callee without having parameters in the call frames. Andrew Haley is actually leading that effort in Project Loom. Then we have the other big area which is structured concurrency which is all about being able to coordinate multiple threads that are decomposed running some operation is be able to actually to deal with them as a single unit of work. So there's a lot of interesting things going on there that API is currently in preview and we will have to do another preview of this for the next release. So these are other kind of efforts that are actually going on in this project at the moment. So that's kind of it. I think I've actually made it in with few minutes to spare and this is sort of links to the current JEPs that we have, the repository. When I was talking about the work on the monitors and some of the other changes around fork joint pool is they're accumulating in the Loom repo now. And yeah, okay. You'll have to hand out microphones. Okay. Hello. Hello. Hello. Test. Yes. Hello. I have to take questions here first. I think. I had a question on, can you hear me at all? Okay. So you mentioned on the, from network IO, you said you had a good solution, right? And then you said for file IO things are much trickier. But what? I'm sorry. I'm sorry. Thank you. Yeah, I'm sorry. Would you mind repeating it? Yes. So I was saying, so you mentioned network IO and you said, yeah, we have a pretty good solution here, but for file IO things are much trickier. But what's the fundamental issue that save a solution from network IO cannot be used for file IO too? Okay. So the question is, is, is why is the solution for, why can't the solution for network IO be used for file IO? And that is because there isn't the equipment sort of readiness API that you get for file, for non-blocking IO. So the reverse works, but not, what we're able to do at the moment is we're able to actually to map onto multiple different 20 years of scalable IO mechanisms for networking IO. There isn't really the equivalent for file IO. Okay. I see. Thank you. So simple question. When, when you solve these problems and get a good implementation and it's all in the next version of Java and you've, you've solved all of these problems, what do you think is going to be the impact? What's, what, what, what are you aiming for? Okay. So, so, so the, the, the ultimate goal is this for Java developers to be able to write code that, that reads exactly how it actually executes. So we need, we want to avoid having complicated, hard to, hard to read, hard to debug code that you get to, that the people are actually forced to write today with a synchronous IO are, are, are reactive. That's sort of the, the ultimate goal with this. So get the scalability with, with, with very obvious easy to read code. And at the same time as harmonious with all the other parts of the platform, such as debugging and profiling and so on. Because one of the things you actually lose today when you start going down the async type route is you, you lose so much of the, of the, of the tooling, you lose your debugging, you lose your profiling. We want to bring all of that back so that everything just works. What we have today in 21 is actually, is actually, is actually pretty good. You have a large range of applications can actually be developed and, and scale very, very well. But there's, there's, there's quite a lot of other amount of code that we want to get working with, well with virtual threads too. First, thank you so much. I think like suffering of millions will end with this holy light. My question is, when you look at the mitigation techniques that are recommended against the shortcoming of monitors, one of, like the primary thing is you're just recommending people to replace those with locks. Just use a lock. Okay. That's a, okay. That's a good observation. So in JEP 444, which is the, the, the Java enhancement proposal that introduced virtual threads as a permanent feature is, it, it, it suggests that if you're running into issues with them, with, with pinning with object monitors, you can just replace them with Java till concurrent locks. And that was very much kind of short term, short term advice in order to actually take to avoid the quality of implementation issue. But we never, we never said we'd never fix this problem. It was, it's, it's, it was, it was, it was a tactical decision to, to make the feature permanent without addressing the, the monitors issue. No, I very well understand the solution. It was just felt to me like it sounds like something machine should be doing. Like why doesn't the VM replace it with a lock on behalf of me? Right. So this, so, so, so what you're asking is, is why don't the VM magically replace it? There's a lot of issues with, with, with, with, with doing something like that. So. Okay. Thank you. Okay. Oh, there's one other. So with all the work going into addressing issues with monitors and pinning will constructs living inside Java, you till concurrent also benefit from that work or is that isolated from each other? Okay. So Java till concurrent does not base itself on, on, on, on, does, does not base itself on, on, on, on, on, on, on, on, on, on, on, on, on, on, on, on, on, on, on, on, on, on, on, on, on, on, on, on, on, on, on, on, on, on, on, on, on, on, on, on, on, on, on, on, on, on, on, on, on, on, on, on, on, on, on, on, on, on, on, on, on, on, on, on, on, on, on, on, on, on, on, on, on, on, on, on, on, on, on, on, on, on, on,
The Challenges of Running the Fuzion Language Natively on the OpenJDK
Okay, people, we are ready for the next talk. Please listen up, quiet down, and get ready for the next talk. Thank you. Okay, thank you, Andrew. So I'm going to talk about the fusion language and a bit more concrete about how we are running this on the open JDK. It's basically the problem of mapping a functional language to efficient data bytecode. The background of me, I'm basically doing compilers all of my life. Don't go into details, but what is important right now is I'm working at Tokiwa software in a team of three at the years where we together develop the fusion language and its implementation. A quick overview of this talk. I'm going to give a very short intro into the fusion language, and the rest I will show you through examples in the code. So I'm going to go into mostly different types and how they are realized on the open JDK. We're talking about tech union types, about product types that have values and managers, about type parameters, how to do margins and dimensions. I'm talking a bit about the class of the class. So I start. You can't hear me in the mic. Can't hear you? Yeah, yeah. We're not getting anything. Is it all plugged in properly? Turned on. Is it on? No, it's on. Okay. I'm sorry. Okay. Sorry for those online who missed that. Okay. I will start with a quick intro to the fusion language. Fusion is based on simplicity. It's all based on one single concept, which is a feature which takes the role of Java classes, Java methods, functions in other languages. Fusion is aiming at safety critical systems. We see more and more systems becoming safety critical. And in that area, we see that tools can make the developers' life much easier. Short the language, fusion is statically typed. It is polymorphic in different ways. It has union types, parametric types, and object-oriented style inheritance. And it is pure using effects to model side effects. The fusion tool chain looks like this. We start actually with fusion source files that go through a front end, compiled into fusion modules that are then processed by a middle end and by a static analyzer into a fusion application represented in the intermediate representation. And that is then the input for different back ends. And in this talk, I'm going to go into details of the JBM back end that then transfers this intermediate representation into bytecode in Java class files that are run on a JBM. The first aspect I want to focus on is tagged union types. I'll explain immediately what that is. As an example, I take an oven. I implement an oven in fusion, and an oven has an input, has a setting that oven can either be on, either be off, or it can be on with a temperature setting given in degrees centigrade or Fahrenheit. So there's three options in that union type. And off is just a unit type. It's just a value with no other state. While the temperature settings is either a setting that gives a centigrade or a temperature as an integer or a Fahrenheit temperature. That's a float in this case. And within the oven, we can then use a union type value and match on this or match on the setting and do different things depending on whether it is off or it is a temperature giving in centigrade or Fahrenheit. Now, I have to make some space here. Now, when we compile this to Java, it gives Java source, not to Java bytecode, to explain what we do that. We actually compile such an tagged union type into several fields. First, we have a tag field. That's why it's called tagged union types. That decides, do we actually have an off value? Oops, I have to make space again. Do we have a temperature or, and what kind of temperatures? And in case it is a centigrade temperature, we need another field to store that value or for Fahrenheit. So we have basically several fields to store a tagged union type value. I'll drive this example a bit further now. This is the most generic case. We have the tag and the different kind of values. And during the talk, I will go more and more specific until I reach the point where the oven literally disappears into the void. So the next step towards that is if we do a bit more an object oriented approach, we can use a temperature that is actually a reference to an object that provides a inner feature to get the temperature as a Celsius value. So it's a union type of off or the temperature and the matching changes accordingly. Now, how this could be implemented is we would have a tag that now decides whether it is off or it contains a temperature. But this is not what somebody would do in Java typically. This is a typical case where a null pointer or null reference would be used. So this is also what our back end does. It just uses one pointer value in that case and uses the null value to represent the state off. So that's the so-called nullable implementation of the tagged union type. Going further, having a more complex example, now we extend the parameter to also have a clean mode and to maybe have some error state, which essentially means we have four different possibilities and we need the temperature and the error state. But of course, there's never the case that the temperature and the error state are both used simultaneously. So we can join them into a single object value because only one of them is used. Now, the tag field decides on which of these four states we have, but actually we could use specific values for the object reference to represent the off and the clean states such that this all collapses into a single reference field for all four states. This is also what our back end does in this case. So such a tagged union collapses into a single field. Getting even simpler if you have a very simple oven that doesn't allow any temperature setting. It just has the modes on, off, clean, or error. That is basically a classic enumeration. Internally, this is just an integer, so we only have an integer type for that. If we have an even simpler oven that could just be on or off, there's only two states. So that falls down into a simple Boolean type and is compiled into a Boolean value by the Java back end. We can go further if we have now an application. We have a static analyzer for our application. And if the application actually never uses an oven that is on, that value can actually be determined to be a void type value. Avoid infusion is not like in Java. It's much more like a never type, so the result of a call to exit or so, that really never occurs at runtime. So if you have that, we don't need any data to store because all ovens in that application are always off. Can even go further if you have an application that doesn't even produce any ovens that are off. So maybe an application that only uses a list of ovens and that list is always empty. So both are never used, so we don't even need the oven anymore because this all can be thrown out. So that much to take union types. Next point is product types and fusion has value semantics while Java has reference semantics. A small example, I want to do a small graphics example here. I'll start with a very simple product type of point, which is just a combination of two coordinates, x and y, two integers. And I pass these points to a draw function that should draw these. I won't go into details there, but I just show you a bit of the syntax, how fusion uses effects, in this case, a graphics effect, to document that this function actually has a side effect. It requires an environment where graphics is available to actually draw this. Now we create a point, store it in a local variable p1, and pass that to the draw function. Now the question is how do we do this passing of the argument? How do we represent this point? We have value type semantic in fusion. So what we do is actually split this up into two fields, two parameters here for the draw function that are passed separately in a call. Similarly, when we create a new point, that point is split up into two local variables, or two fields that are then assigned separately. And finally, when we make this call, we pass on a call these two values individually. That works nice, is performant. Problematic in the JVM backend is the case of returning a value product type with value semantics. So here we have a shear function added to our feature point that creates a new point that is returned as a result. So Java cannot return multiple values, so what can we do instead? I need more space for that. And I've looked into a number of possibilities how we can return such a product type in Java. The first baseline in that analysis I looked at how, at inlining. If you would just inline the call, returning a value is just assigning between local variables. So we can use that, but of course that doesn't work in the general case because inlining will blow up, will not work for recursion and many restrictions. That's why I put such a flash behind that. That is not a solution for all our cases, but it gives a baseline for comparison. The typical way in Java to do that is that the call method would return a newly allocated temporary container object that just contains two integers. We could also do the other way around. We could have the caller preallocate or preallocate an object and pass a reference to that object to receive the results. The fourth solution I looked into was we could use static fields. So when returning two integers, we're returning a point, we just have two static fields, X and Y, and we store the value in there and the caller then retrieves them for that. I put a flash there as well because that is not thread safe. It doesn't work when we have more than one thread. What would we thread safe would be using thread local variables to return this value? Or if we would put these two static fields into our own thread instance. If our threads have a specific instance of that that could have fields, then I'll use for returning values thread locally. I've analyzed these different possibilities using the Java micro benchmark Harness, AMH. Actually, to my surprise, the allocation of a temporary object that is returned was even faster than the inlining version that I analyzed. But unfortunately in AMH, I couldn't analyze the last case of using my own thread type and fields in the thread structure. So I added my own ad hoc benchmarking framework to do the same measurements and I did a fairly different results but basically the same, but I also cover the last case. Now, I exclude those cases where I said that they are not generally applicable, so the inlining and the static fields. Of course, we can't use that in our implementation. Next, using thread local fields, thread local variables, which are relatively expensive, so kicking that out as well, we are left with allocating temporary objects and relying on the jit to optimize this because the jit does very well, but I don't know what the heuristics are behind there and whether we can actually rely on that. So for now, we're using thread local variables to return structured data on a call. So we're moving forward to seeing Project Valhalla coming into life because Project Valhalla will introduce value types and will use type descriptors for so-called Q types that provide value semantics for method arguments and method results, which is exactly what we need here. What I don't see from the current state of Valhalla is whether you actually that when you return a Q type, you actually don't have any allocations. So I would like to best have the guarantee to have value semantics and no allocation on a return. So next, type parameters. Generics would be the counterpart in Java. Here's a small fusion example how type parameters can be used. This is a function that calculates the arithmetic mean of free values that could be of an arbitrary numeric type t, and it just sums up those three values and divides them by the value of three, but it has to first create a value of three in that given numeric type. That could be something like a complex or a double or whatever is fed in there. And now we can call this with integers or with floating point values. Java's implementation of generics uses type erasure, so there's no type information at runtime, but a fusion uses a so-called monomorphization. That means we have specialized versions of the code for every type that is in use. What that means is that our back end, in this case, creates for every actual type that is used for a generic function, a specific implementation for that type that has all the types stripped to the actual type parameter. So that's quite straightforward. Yeah, inheritance, fusion has multiple inheritance. The question is how to implement that. The ways we've looked at is actually putting some type identifier into our instances and then doing some kind of table look up and invoke static to call this. We looked into how invoke dynamic could help us. Unfortunately, fusion is so static, it doesn't help us much at all. And finally, the invoke interface is actually the most useful solution for us because that supports multiple inheritance. So in effect, what our back end does is in most cases, our dynamic binding just binds to one possible target, so we can just use an invoke static. And only in few cases we actually see that there are possible dynamic targets and then we compile them to an invoke interface and we have specific interfaces actually defined for every single function that is called this dynamic binding. So we have a case where the classes that we generate could actually implement really, really many interfaces and we have to see how that scales with bigger applications. So coming towards the end, class file verifier, not much to say that, but the class file verifier helped a lot comparing the development of the JVM back end to the C-back end. Did we do that before? Because we saw so many errors much, much earlier than we would see in the C world. So the status of fusion at the moment is the language definition is getting more and more stable, base library, there's still a lot of work in progress. We have a JVM and a C-back end. Then we have basic static analysis tools available. And if you came to see the dogs, this is Fedex and Shadow who disturbed me working on that during the last year. This is where you find more information on that. Follow us on Twitter, give us our give us styles in GitHub. Stay informed. Thank you. Any questions? Hi. Hi, so you mentioned monomorphization and you had this example with this function that takes t which is numeric And then you generated I think 10 different versions for all the numeric types But then if you have like three type parameters, which are all numeric do you generate a thousand different versions? If there's three type parameters, there will be one version generated for each combination of these three type parameters Actually is used in the application. So it's used in the application. So it's like kind of a closed world Yeah, so so we have a static analyzer over the whole application Okay, so you don't have incremental or separate compilation It's static at compilation. So we've been very compiled. We look at the whole application We don't have any dynamic adding of more code. So we don't luckily don't have the problem of having to be open to add More there, we know exactly what will be in there And do you have a Java FFI with the JVM backend? Do we have a Java foreign function interface? Can you call into Java? At the moment not we are we are looking forward to using your Panama there as we've learned Because that would be a big step for us to also helping on the C interface even if the C backend We don't have any FFI force calling C code at this point Okay, thank you Do you have your mind made up on in terms of the Approach to concurrency you want to take I mean on the JVM virtual threads could be an option But at the same time if you have a C backend that could be really expensive to implement on your own We do have a very simple concurrency support right now, but it's basically limited to starting threads But there's no not much synchronization or anything going on Our current idea is that when when we do something concurrent That we want to use the static and Analyzer as much as possible to to set it we prove that there's no race conditions and that the code is safe The question is what channels do we want to provide? to actually allow interfed communication and all of that and We are still looking into possibilities there are many many things we could do it's not decided yet, so Thank you Maybe
OpenJDK Project Wakefield : The Wayland Desktop for JDK on Linux
Hello. Hi. Everyone, please take a seat. We're about to, well, we are starting. This session is OpenJDK Project Weight Field, the Whalen desktop for JDK on Linux. It's going to be a bit of a show here because we have three of us, so we're going to swap over. There's only two mics. We'll do our best. Very quick intro and then I'll turn it over. So I'm Phil Race. I work at Oracle. I'm the lead for the client libraries on OpenJDK. Next to me is Alexei Ushakov. He actually used to work in the Java 2D group at Sun a long time ago, but these days he works at JetBrains. And there's Nils DeGraf, who's a product manager or product owner, product, whatever, at Red Hat in the desktop group. And so he is going to do our first session. I'll hand it over to Nils. Okay. That's good. Maybe I should. You should. I should. This is going to be very interesting to do. Yes. I did not sign up for some dancing, but it's fine. The first, quickly about structure, we're first going to quickly explain what Whalen is and then explain how OpenJDK, then it tries to work on it and the whole weight field project. And then finally we have an actual demo and some explanation by Alexei. So quickly for Whalen, because we don't have that much time. So a bit closer. Okay. That should be better. First of all, what is Whalen? What is X11, for example, that it tries to replace? So it's about displaying. So rendering things into a, let's say, a matrix of pixels, something called, which sometimes called a frame buffer. And you usually try to get that on a stream, but maybe also try to stream that somewhere over the internet. And why do we need something fancy around that? It's because once you have multiple applications trying to render something, you want to have something like some kind of decisions around, okay, do we put them next to each other on top of each other or that kind of thing? So basically window management or tiling or whatever you want. And that's where a display server comes in. And you talk a display protocol between the apps and a display server. And it's usually also very related to input management, because, you know, if you have typing something, you want to go to your browser, for example, and you don't want to type your browser in some key logger application or something. So quickly about X11, which is, let's say, the old thing. So X11 started from the white. So we have the X11 server, which you start up normally using something like startX, which is going to listen to a socket, usually based on the display name, usually something like zero. And then each of your applications, I mean, imagine those two top applications being your browser or your file manager. And then they will actually say they will also connect to that socket. Sometimes if you have to do a display equals variable for the environment, and that's when which kind of socket it will try to listen to. So it will define the server. Now you're going to say something like, OK, now please X11 server, can you create me a new window? So X create window or something with this window, with this width and height. And then you can do your fancy things that you normally would expect to be able to do from a window manager. Now, that whole logic of should we be doing tiling or should we be doing overlapping windows and so on. That's usually what another X11 client, which is then usually the window manager comes in. And that actually helps in setting all the logic, let's say. So that's how the usual thing in X11 goes. Now X11, a very oversimplified. X11 is old. It's from the 80s. Now being old is not necessarily a problem. But it is older than, for example, Java and Linux. But one of the things is that it made a lot of assumptions that don't necessarily hold anymore and that are baked into the core protocol. For example, it talks about a root window, which we, once you have multiple monitors, you can still try to have like one big frame buffer that spans all of those monitors. But if you have multiple TPI, you can do trouble. Once you have GPUs get complexer and complexer, there's overlay planes and you want to do fancy stuff with that for performance reasons and battery reasons. There's security. X11 allows you to screen share anything and do input sniffing and snooping and spoofing without any kind of consent of notification. There's a dev room about Linux on mobile. I do not want a device that actually could do all of that with my private data. And there's also new UX paradigms like HDR, tiling window monitors and so on that actually make it be a bit harder, especially HDR. It's very hard to do, for example, in X11. So at some point people got together to create a new display protocol, which is that Wayland. So it's very minimal, but really, really minimal. It really tries to make sure it does not fall into the trap of making assumptions again. It goes to just say, okay, we have clients that want to send something rendered to a server, a compositor, let's say, and then we can then extend things over. So it doesn't even have a concept of top level windows, for example. You actually need an extension for that. It's called the XZG shell. If you ever want to look it up, it's very fancy. And some things you just don't want to have that in the display protocol anymore. For example, screen sharing is also related to videos. So we said, okay, let's try to put that out of the protocol, the core protocol and something with portals. But I will explain what portals are later. So what does a typical Wayland session look like? It's again, we start from the right. So we have the Wayland compositor, which you start. That's going to be, for example, with GNOME. It's going to be GNOME shell with KDE Billy Quinn. This way it will be something else. And then you will start a Wayland socket, which you can again listen to. You will talk to Wayland protocol. And a Wayland client will say, okay, please create me a surface. And then using that extension, for example, the protocol extension using XZG shell, you will then have something where you can say, I want to create a top level window and I want to do it this size. You can do positioning Wayland. Always fun. There's a lot of reasons for that. And for example, another Wayland client can be X Wayland, which is its own X11 server. So actually, inside your Wayland session, let's say, you can actually also then create multiple X11 clients, which will talk the whole X11 protocol and X Wayland to the translation to Wayland itself in the best way possible. And I did lie a little bit that this is not, or I did say that there's not everything yet. So there are some things that we don't want to put into a display protocol anymore. We want to do that in the portals. So we did something that that's new with Flatpak and snaps and all these containerization fancy methods. We want to make sure that there's some kind of permission thing, let's say that allows you, for example, I want to do screen sharing, let the user choose if that's okay or not. And then the deepest interface within a Flatpak, you can access that. And then, for example, go to Pipe Wire and other components which do not necessarily need to live in the compositor. You will go to those and then we can go to the next step, which is how portals can be maybe implemented, or, you know, how this can be used from within Wakefield. And I think that's going to be the part where Philip is going to come in. Okay. So Neil's described what Wayland is. And today, what we now have is a project to be able to support that Wayland compositing desktop from openJDK. And what's it all about really? Well, JDK is clearly a platform. It's not just an application and it abstracts away that underlying platform. So we're not going to be exposing anything about Wayland. Today, on Linux, it happens to be an X11 client. It's basically an X application at a, you know, crude level. But to be able to support that Wayland desktop, we need to make some changes in JDK. Some of the policies that Wayland has that Neil's touched on around security and things like that mean that the things that just are supposed to work in JDK won't work, even if we're trying to be that X going by that X Wayland product client that Neil showed on his diagram, right, which is what we call X compatibility mode. So that's the Wayland's compatibility mode for X11 applications. And although we don't even work in that today, even if we make it, even if we start to work in that, is that really the long term solution that we want, what we really would like is to maybe, you know, be a full on Wayland native client. So Open JDK project Wakefield, there's a URL there, has started a couple of years ago. And there are people from Red Hat, JetBrains, and Oracle who are working on this. We have a repo at the standard kind of in the standard format for Open JDK project repos. And what are our goals? Well, first off, we're going to try to support that compatibility mode properly. So we'll have a fully concomit, conformant JDK, and everything works as well as it can do, as well as it should do on the Wayland, when you're running on a Wayland desktop, but we'll be talking the X11 protocol. You know, we don't talk about it here, but, you know, most people will see that these days, if you log into a, you know, Linux desktop, they, and if you pay enough attention, there's an opportunity to switch between pure X.org and the Wayland desktop, which supports that compatibility mode. And right now, JDK only supports the X.org. So the longer term goal, as I just touched on, is that we want to be able to support native Wayland. So the X11 compatibility mode is some touch things that I'm going to touch on, but the much bigger thing is that native Wayland port. And there's a list of, there's a list here I won't read out of the different kinds of things that we need to deal with in making all of this work. And some of what we need to do for the native Wayland port is really just starting to emerge in the latest versions of GNOME. So it's not, this doesn't, this work is not intended for some, you know, older versions of Linux. This is something that you'll want to, or have to use on upcoming versions of Linux. And yeah, the policies of security that Wayland enforces, I think is the right word, are going to be some of the drivers for the things that we need to change. For example, the issues include, one of the most important things is we have an API that lets you capture screen pixels. And capturing the screen is, you know, Neil's touch on is something that Wayland very early on, I think, and is very clear about, doesn't like people to be able to do for privacy reasons. But AWT has expected to be able to do that forever. We expect to be able to do things for, like move the mouse, grab the pointer. We want to put our windows where we want to put our windows. Wayland will say, no, that's kind of our job. And you can't actually find out where windows are on the screen. Also in the X-Wayland compositing, in the X-Wayland mode, it's, there's the high DPI support is not complete. So in some of those things that I described above, Soundlight, they're things that you'd only need maybe for a test framework, but these are actually part of our standard APIs. And these, not many applications use them, but we have to be able to support it. And there's a bunch of bug level fixes that we've found we need to do as well. And we actually, you know, as the project went on, we actually found some bugs that were on the Wayland side as well. And there's a whole bunch of, all of these things that are described in detail at that URL, which I obviously, I'm not going to read out for you. So where are we now? JDK 21 pretty much did all of the work. We got the spec changes in that we needed. And there is a complete new implementation of the robot screen capture that was done almost entirely, well actually entirely by Alexander Zaginsev. It's using the screencast portal and pipe wire. And it, so the first time somebody tries to do a screen capture, there's a manual process of saying, yes, that's fine. And then after that, it's okay forever in theory. And there are some follow-up fixes going into JDK 22. We basically, if you have a desktop with GNOME 42, we should be good. And that will probably mean that we'll be able to, vendors will be able to officially support running on the Wayland desktop. In this compatibility mode with 22, which should ship in a month. And that's when we shift real focus to the pure Wayland toolkit. But there's, you know, what's involved a bit more about that. Complete new toolkit spans all of AWT in the window management and the rendering side. So all of these things here, creating windows, configuring windows, the management of everything, integration of the desktop, how you render to the surface. We can't use X11, open GL, sorry, XGL really. Or X render, which we, an X11 and X render is the default way we do everything on Linux today. Desktop integration, all of these different things I'm listening here need to be redone. When I was trying to describe it to somebody who's sort of more of a VM person, it's like, well, we need a new JIT and we need a new GC. And so that's the kind of scope of the work. So how would you do this? Well, I mean, a lot of GTK4 makes it fairly easy for a lot of applications to just, you know, port over because it deals with all of that, hides it from you. And then, you know, you would, Wayland, one of the things, it doesn't have a window manager, so it's client side decoration. It's all client side. And GTK would do that for you. And everything, I think, there, it sounds like it'd be easy to get a lot of things up and running, but it brings in a lot of baggage. If you do an LEDD or something on a running GTK4 process, you'll be paging through it for a while. And, but really, the, one of the problems was it's just really hard to get the level of management of when you render to the surface in the way that we need to with GTK4. There's more work with using the low level LibWayland, which is basically the equivalent of LibX11. But, you know, we've generally, when we've tried to do something with, in the JDK, like the most, the last example was we were doing a pipeline for Mac OS for their new rendering engine. And they have, like, a high level toolkit that's intended for applications, but we needed to use the lower level one. And it just sort of works out that way, seems, every time. But anyway, there's some new challenges that Wayland brings that, native Wayland brings that aren't necessarily there, that aren't there in the X compatibility mode. We need a new library Lib, EI, or A, however people pronounce it. That's just prototyping in GNOME 45. Well, it's the API, I believe, is final, but, you know, I think it's the first time it's been out there. That inability to layout windows that I touched on, it has some oddities like splash screens come up in the top left-hand corner of the screen. And, you know, that's not great from my perspective. So there is a, so there is a, already a toolkit in process and Alexi is actually going to be showing you that right now. And it's called WLToolkit. So, hand over to Alexi. Okay. Okay. Yeah. We use a separate thread for event handling in our prototype called WLToolkit. And actual handlers of Wayland events are dispatched on event dispatch thread. So, on Wayland rendering happens on client side. So, the client need to notify the compositor about the region that should be updated. So, we need to have a complete frame content and then submit it to the Wayland. That's required some changes in AWT and swing code to make sure that all the content is ready for submitting. Also, we use software rendering for our toolkit. Actually, software rendering was the only option in early ages of Java platform for swing applications. But since then, the situation has changed and now Java platform provides hardware acceleration for all major desktop platforms. Surprisingly, in current WLToolkit implementation, we have sometimes better performance than in X2 kit in X11 compatibility mode. So, for example, if you know a swing mark benchmark, it shows about 40% better performance than in X2 kit. So, it's quite enough for desktop applications. So, we can use it now. However, there are some areas where we still have lower performance. For example, in current implementation, we have about three times worse performance comparing with the hardware-executed X2 kit. So, yeah, of course, modern desktop applications need rich text capabilities, including our IDs. So, we're going to work on this and improve the performance of front rendering. Our current plan is to use Vulkan as a hardware acceleration with our WLToolkit. And let's see our demos. So, let's try to run them. It's quite unexpected resolution here. Here, you can see a special aw.toolkit.name property, a standard property that is set to WLToolkit to enable this toolkit for us. And that's run, as swing said. Oh, yeah. It looks like it fit the screen. Okay. You may notice that here we have unusual controls, actually, if you'll mention that. It's because Wayland, in this core part, doesn't provide client-side decorations. So, these controls were rendered by ours, so in WLToolkit. Okay. And let's see how it works, actually. So, here is frames. So, buttons. So, we have some animated curve here. Some combo boxes, dialogues. Okay. And some checkboxes working. Some more dialogues here. So, progress bar demo. Scrollable image here in scroll pane. And sliders. Yeah, here we have split pane. Tabot pane with some animated images here. Tables. Yeah, they work quite well. And tooltips. Yeah. And tree with expanded nodes. So, as you can see, all the controls are properly rendered here. And then I would like to show one more standard demo, at least demos bundled with platform for many years to show. Swing UI from work capabilities. It's Java 2D demo. It shows some advanced graphics. So, we can see here curve rendering. Actually, it's not full speed, so we can reduce the delay and see how it works actually quite well. Yeah. So, we have many different things here. Some clipping, color conversions, compositing rules, font rendering, image ops. So, it's some conversions for images. Some stroking capabilities. Here's a mix it. Yeah, demo with different primitives, paint modes. We have also here path rendering and transformations. So, as you can see, performance is quite acceptable. Now, we'll try because of the resolution to launch real-world application. Community version of IntelliJ IDEA. Yeah. Probably, it would be... Wow. Yes, it works. Yeah. And here we also see that we use the same property at little kid.name, that in special property file that we use for IDs. Yeah, here we can see actual well and implementation here. So, it's the constructor. It's quite difficult to see something. And here is the separate thread that I mentioned that handles well and events. Okay, that's it. Thanks for listening. Any questions if we have time? We have a minute. So, any questions? So, what is missing for... I'm repeating the question. You showed IntelliJ working, so what is missing? It was in details. We have Piki users and there are some stuff that in some corner cases that it's not working well, but it's generally workable. So, we have some users who gave us feedback about the quality. So, yeah. But we still need to polish it. Yeah, that... If it wasn't completely clear, right, you can try this for yourself. The project that I actually... Didn't you have a slide showing the branch? So, you just can check out that branch from the Wakefield repository and build it yourself and try it. Yep. Yes, over there. It does it work with JavaFX? No, this is at this point, this is implementing within JD K. JavaFX is like a separate toolkit entirely and we have to repeat this exercise for that. Unfortunately. Yes, over there. Sorry, feedback. Did we fix bugs in Whalen? No, we reported them to... We had friends at Red Hat. So, we've had some calls and some of the developers, a couple of the developers who kind of work on the desktop and even on Whalen directly there and will say, yeah, I file a bug here or yeah, I think it is. So, we've reported bugs and they've been fixed. Yes. Yes. Any plans to support fractional scaling? Any plans to support fractional scaling? My recollection is that Whalen itself fundamentally decided not to support fractional scaling. There's an extension? There's a... Of course, there's an extension. So, yeah, there's a protocol extension to do fractional scaling and if the WR2 gets wants to implement that, it can do that normally. But it should work. It should definitely already with the native Whalen mode, it should already be better with then the whole blurriness sometimes gets an excapable. Yeah. The one thing about that though is just generally, I mean, with fractional scaling, we don't have to deal with fractional scaling on Mac because that's always multiple. With the Windows look and feel, we are still sorting out bugs trying to make that work. So, we would undoubtedly find a whole bunch of new bugs with the GTK look and feel when we started doing that. So, it's not just a simple matter of saying, oh, yes, you know, there'll be a mess to be sorted out that's separate from the Whalen project. I think we're probably out of time. Yeah.
Zeroing and the semantic gap between host and guest
Hello. I want to start. Hi, everybody. So my name is Foyker Simonis. Hi, guys and girls. So my talk, my slides and my examples are on GitHub. I will show this link one more time at the end of the talk, so you can take a picture if you want. I'm currently having some fun at the Amazon Coroeta team working on OpenJK and I did the same along for quite some time in the past at the sub-JVM and submachine team. Today I want to tell you some details about running Java in containers and in different containers. One is Cue and the other is Firecracker. So what is Cue? Cue is checkpoint and restoring user space. That's functionality in Linux which allows you to serialize a whole process tree to the file system, basically to an image or a set of images and then it can be later restored from this image and run at the same state where it was checkpointed. It only saves anonymous pages of the process so it's quite efficient. It doesn't save the shared pages. We will see what impact that has. And correct, that's a coordinated restore and checkpoint. It was mentioned before in several talks. That's a project in the OpenJK which has basically two goals. One is to create a user land checkpoint and restore notification API which allows it applications to react and take actions based on a snapshot or restore event. So before the snapshot they can do certain things like zero out, secret memory or stuff like that and then to restore for example they can restore network connections which they tear down at snapshot, things like that. And gaining quite some traction in the community, the new versions of the popular frameworks like Spring, Micronaut, Quarkus or even AWS Lambda, they support this API so if you write applications or code for this frameworks you can already use this API. The second part of the goal of the correct project is to make the JDK itself SNAP safe. So the JDM as well as the JDK. This means that it uses this notification API for example in the JDK classes to take the actions I just talked about. And this is sometimes useful or even required to make, to react appropriately not only on checkpoint, on restore but also on clone because once you've checkpointed an application you cannot only restore it, you can basically restore it many times which I call cloning and then it's important for example if you have UUIDs or secrets to as I said to either wipe them out or reset them or recreate them. And Quark is using CRIU as a tool to do the actual checkpoint and restore process but as I said the API can be used without CRIU itself and we will see how that can be used with Firecracker for example. So let's dive into a quick demo. So I will use Pet Clinic as an example here. Oh this is the wrong window. So this is for CRIU. So I just start Java with some default settings which I pick up from Java options. It's basically Java 20 or 22 I think running with 512 max of memory, running the REST version of Spring Pet Clinic. And it takes about 10 seconds to initialize and then I use URL to access it just to make sure that it works. And yes you see it works, it really works. Now we use PMAP to look at RSS of the Java process. It's about almost 450 megabytes as you can see. And we can now use CRIU to dump this process. Oh I think it's hard to see. Yeah I will scroll it up. I just start to dump. So this was just the command line here to dump the Java process into a directory. And once we've done that we can take a look to see how much memory that used. And you see that the image itself is smaller than the footprint of the process itself. That's because what I said the image only contains the private and the dirty pages of the process, the shared pages from the mapped files for example. And we can now restore this process. So we use CRIU restore from the same directory and it works like in about 200 milliseconds. And if we use PMAP again you see that it uses, after we store it uses less memory, about 20 megabytes less than before. So why is that the case? Again that's because of shared pages. This is the diff of the whole PMAP output for the initial process before it was snapshotted and after we store. And you see the basic difference here is that for a lot of libraries like system libraries, LAP-NSS for example, we used 140 kilobytes for startup but this memory, these pages are not required anymore after we restore the process. So CRIU has still recorded that the process can access this memory but until it doesn't touch these pages they won't be paged in. So that's why after we store the process uses less memory which is a nice side effect. Okay so what other possibilities do we have? We start the application once again and it always takes about 10 seconds. So it works again. Now there is a feature called Trim-NATIFIP which was introduced by Thomas, my former colleague Thomas Stüffer, which basically frees the Maloch, the G-Lipsy Maloch buffers. And this can have quite a significant impact on the footprint of the process. So we see that the G-Lipsy Maloch cache used about 60 megabytes. And if we run now P-Map again we see that the RSS is much slower now, much slower now, just about 450 megabytes. And we can now... I also experimented with the new option which zeroes part of the heap. So it basically does a system GC and all the unused parts of the heap will be zeroed. If we do that and look at the memory footprint of the process we will see that the memory footprint got bigger because now parts of the heap which weren't mapped before get paged into the memory but they contain only zeros. And I have a pull request for the QIO project which such as QIO can recognize zero pages and ignore them while dumping. If we check point now with this zero option, it's basically the same like before. We just used the skip zero flag which is not standard until now but I hope it will be integrated soon. And if we take a look at image size we see that the image size now gets considerably smaller. So it's just 200 megabytes because all the pages which contain just zero bytes are replaced by reference to the kernel zero page. So basically it's a sparse file, the image file. And when we restart the process the memory footprint will be smaller as well. So we restart now from the new directory and when we take a look at the P-map output you see again it's just 270 megabytes. This is a little cumbersome so why not using the crack release itself. And the good thing is that crack basically does all what I've showed you what you basically can manually do with a normal JDK and with a normal QIO release. This is basically built into a crack version of crack build of the open JDK. We use the option crack check point two and give it a directory. So we run the application and then once it initializes we see it works and then instead of using the QIO command directly we can use a J command to check point. So I scroll it up here so we just scroll J command with the PID of the pet cleaning application and we execute the JCMD JDK checkpoint and that killed the checkpoint and also killed the process. And we can now restart that again with the help of Java by using the second crack option which is crack restore from and then give it the directory where the file was saved to. And this takes just a few milliseconds again and we see it works and again the memory footprint is like before it's like 280 after the first restore so it's considerably smaller because the heap was shrink. So what crack is actually doing it's not zeroing the memory but it's unmapping all the unused parts of the heap and I also recently added the feature to call into Tomas Trim native memory functionality to also free the JLPC buffers. So to summarize like in for a spring pet clinic application it has about memory footprint of a good 500 megabytes and after restore it's a little smaller because it doesn't have to restore all the shared pages. Image size is about 500 megabytes. If we zero out all the heap unused heap and use the skip zero flag of crew the RSS goes up just before the checkpoint but instead we get a much smaller image size and also a smaller footprint when we restore. And that's the same with crack because it basically it doesn't zero but it unmapped the memory and it has the same effect so it would wonder why do we need the zeroing at all then and not just use crack so I hope that will get clear in my next example. So for the next example I will use firecracker which is a KVM based virtualization. It's basically QMU on steroids it's a stripped down version of a virtualizer it has only a restricted set of network block device network device. It's rest base configuration it's written in Rust and it's open source under Apache 2 license and if you ever used AWS Lambda or Fargate for example that's the technology which drives this offering so every Lambda function is running in its own KVM based firecracker container. This is a diagram of how it works but I think we don't have time to go into the details today. It said I want to show you how this works practically. So I have another prepared another window here. So I use a script which which basically starts firecracker and inside firecracker it then starts again the pet cleaning application. If you take a close look this basically boots its own kernel which is 6.0 this here Linux 6.1.7 so it boots its own kernel in a virtual machine and then inside the kernel it starts firecracker and now if you see a rail this virtual machine has its own network address so we cannot use localised anymore. So we have to use the IP address of the virtual machine running on our host system but apart from them it works exactly the same and we have to look for two footprints now. We want to know we have to look at the footprint of the firecracker VM itself which is about 670 megabytes. Slightly bigger than that of the whole process and we can also look at the size of the JVM inside the guest and we see the JVM size inside the firecracker guest is about the same like when you run it locally which is basically clear. And we can now snapshot the whole firecracker container. Again that takes just a few seconds to this directory and if we want to see how big it is it's like 670 megabytes about the size of the whole firecracker container had in memory when it ran. So just to demonstrate how it works now we can restore from the snapshot. This again basically spins up the whole virtual machine in about 200 milliseconds and we can check how much memory it takes and you see it takes very few memory because it takes only 570 megabytes. Because it only pages in the pages which are really required to start the virtual machine. Crew paged in all the pages from its page file into the newly created process. Whereas firecracker does this lazily that's why initially it needs so few memory. The funny thing is that if you look inside the container by SSH and to the container and to a P map the Java process within the firecracker container still basically needs 500 megabytes but the VM itself only has paged in like 50 megabytes of memory. And what we can do now is, yeah we wanted to see, we already saw how big the, yeah we are sorry, we just do a request. So you see it's still working after we store the network devices are restored interfaces and it works and if we look at the image size of firecracker after the restore you see it gets bigger like 270 megabytes which corresponds mostly to what the crew process used. So that's actually the crew restore Java process so that's about this 270 megabytes seems to be required in order to process this request in pet clinic. So now how can we get this smaller the image size of the firecracker container because 690 megabytes is quite big. So again we run firecracker and you can see that I started a starter firecracker process with the checkpoint option so I with the crack checkpoint option so I can actually use the J command version now to check point. So we have the Java process in the KVM guest. Again we use SSH to SSH into the firecracker container and then inside the container we execute J command to checkpoint. This is a special version of checkpoint where we doesn't make sense to use crew within the firecracker container because anyway we will snapshot the whole firecracker container so instead we just use the special version of crack which only executes all the callbacks and thus all the optimizations but doesn't really call crew. So that's where we have to restart it inside and when we look at the memory. Inside the the Docker container we see that it's about 290 so it was it went down the SS but unfortunately the container it's like firecracker process itself still uses that much memory and if we. If we snapshot it. That that works. But. Let's take a look at the size. It's still 600 megabytes so that's why I called why I choose the title like that that's what I call the semantic. Gap between the guest and the host like even if I free memory in the guest container the host colonel does not cannot know that these pages are not used anymore by the guest system and they are still dirty from his perspective so if I if I. Snapshot. The container the whole VM it's it has to save them to this which makes it inefficient. So there are different possibilities to cope with this. One is to use. The the trim native image and the the zero heap options are showed you before because then firecracker has the chance to wipe out these pages from the image which make the image. Sight smaller. So I have I've summarized this here in this table so initially the firecracker process needs about six or almost 700 megabytes of RSS the JVM inside like before 500. Snapshot is about 600 megabytes after we store 50 megabytes and after the first request again 266. If you run this crack and do the checkpoint we can minimize the memory size within the VM to about 290 but the but the image size that the snapshot size itself stays at at at 600. If we do the trim native and zero unused heap the the memory consumption of the of the of the of the virtual machine goes up because again we we touch all the pages in order to zero them. But we get a much smaller image size because now the virtual machine manager again replaces these pages by the kernel zero page so we get a much smaller image and faster start up time. There's another possibility and that's called in it of in it on three that's a kernel option so the kernel usually when you give when you am a page and give it back to the kernel. The kernel doesn't do anything with this memory the kernel zeros the memory when you allocate it when you am up on your page the kernel will give you a zero page a page only containing zeros. But there is an option called in it on three which does this in the other other way around so it's and it's. Just for example in security critical application where people want to make sure that once they release memory this memory is immediately zeroed out. The thing with this is that the initial memory size of the container goes up because when the kernel boots up it was zeros the whole memory so it touches all pages so the footprint of the of the. Firecracker process is like one gigabyte which is what I gave him for the for the guest on the other side. When we snapshot this. We we get down to a four hundred twenty megabytes which is already quite nice. Last feature which I wanted to mention briefly is ballooning that's a special device inside the guest which can allocate memory can sing of it like a file cache. And then it has a means to communicate is back to the KVM manager and the host and tell the host that the host can now reuse this part of the memory so with this. If we inflate the balloon we can decrease the the footprint of the whole virtual machine but unfortunately the snapshot again gets bigger because from the. Host site this page is still look tainted so we have to combine ballooning within it on three then we get like all the benefits small very small footprint of the running KVM process and the smallest image size so with that I came to the end of my talk there are some references here. And I linked to the to the examples I showed you and this is where you find the presentation so thanks a lot. Thank you.
Bespoke containers with Jlink and OpenShift
Alright, good afternoon. My name is Jonathan Dowland. I'm a principal software engineer at Red Hat. I work on OpenJDK, in particular on containers. And I'm going to present some work for you today, which is a project team myself and Josh and Jay, who sadly can't be here today. I'm going to present some work we've been doing to look at integrating Java module technology with OpenShift. So I suspect you're all very familiar with Java modules, introduced to the JDK in 9 with Project Jigsaw, and perhaps less aware of what Red Hat OpenShift is. So OpenShift is Red Hat's enterprise distribution of Kubernetes, which is the de facto standard tool for doing container orchestration. So talking about containers. Red Hat, obviously probably heard of REL, Red Hat Enterprise Linux. Another project we have is the universal base image. And it's based on REL, and so it has some similar design principles, such as focus on quality and suitability for the enterprise. But it's different in the sense that it's available under the terms of a different end user license agreement. There's a short link at the top there to the full gory legal details. But unlike REL, you can, anybody without any kind of relationship to Red Hat can access UBI images, pull, push them and build upon them and distribute derivatives. So effectively it's a free image, and it's been designed to be useful as a base for any kind of containerized application. There's three flavors of the main UBI images. So there's UBI as it's called, and then there's minimal and then there's micro, and they're decreasing order size. So the full UBI image is about 200 megabytes uncompressed, and the minimal one's about 90, and the micro one, which is really small, has almost nothing in it, is about 20 megabytes, 20 megabytes of nothing somehow. You can get these, I mean those particular UBI containers are available, widely available, places like Docker Hub, but most or many Red Hat containers are not available from Docker Hub, and I'm not sure if you can see it in the room because of the positions of the tables, but the Red Hat ecosystem catalog is the place to go for Red Hat containers. So you can browse all of the UBI ones there. And the open JDK containers then, the thing I work on, they're part of the UBI family, so they're available under the same user license agreement, and you can pull and pull them freely without needing to be a customer or to have a developer subscription. I haven't included the full matrix of JDK containers we have now because it's grown too big for a slide. I think we have about 16 at last count, but the general URI form for the containers is there's two variants. There's build of images which contain the full JDK and developer tooling, the Java compiler and Maven, and that's the top one. And the second flavor we have is the runtime variants, which effectively suffix runtime, you get that. That has a slightly smaller subset of the JDK and doesn't include build tooling. I've included a link, which is probably at a better height now, to some, this is a page really, but we have a Gitub auto-generated documentation for all of the containers that we've shipped in recent times, so that's the place to go to see what's available. It has jumping off points to the ecosystem catalog, but it also includes information on how to actually configure the containers and to tailor them to your needs. So OpenShift has value add on top of Kubernetes. One of the concept ads is something called a build config and an instance of that is source to image. What OpenShift source to image is a process which allows you to define a workflow to consume an application source and produce a running deployable container automatically and automate the put-back process. By running a command like that, that's OC is the main OpenShift command line tool. You can create a whole load of objects inside OpenShift, which are interconnected to each other with triggers, so that in this particular example, if your application source URI is changes, if you push and you commit, or the base image upon which you're using for this process changes, then the workflow will pick that up and event up and automatically rerun and build a new deployment application, a deployment container. It's quite a simple workflow and the output image from this process is layered on top of the input image, which is the builder. The problem with that is the builder image is pretty big and customers want small containers and size of side, it also has stuff in you don't want necessarily in a runtime context, so the compiler, Maven, etc. So people with strong security concerns or with audit requirements want something else. The current state of the art in OpenShift for achieving that is multi-stage pipelines. You can chain these things together. The top part of this diagram is from the previous slide and you can chain that into a second build. In this case, the second build is using a different build conflict strategy called Docker strategy and basically it's a Docker file, which you may be familiar with. In this case, the output of the first stage is an intermediate image, which is used as one of the ingredients for the second stage and what we do basically, there we are. YAML, if you've ever dealt with Kubernetes, OpenShift is exactly the same, YAML, lots of YAML. The key piece here for that second stage is a Docker file, which in this case is embedded deep in the middle of YAML territory and it's a reasonably simple one. That's the state of the art today. I'm using Qwakus, I think you mentioned it on the side. For what follows, I've used Qwakus mostly for my experiments and for the examples. How big is it when you finish that process? It's pretty big. Unfortunately, the savings over the straightforward S2I process these days is pretty small, about 5%, not very good. If you look at the pie chart there, the thinnest slice on the slide is the application itself. The cost of doing business with this is quite high. The largest slice of the pie is the JDK itself and therefore that's the place we're focusing on trying to make space and size reductions. You can see another of the second biggest slice is the minimal base image itself. The open JDK container images based upon the UBI minimal image and that's about a quarter of the final payload for the application container here. First focus then is on trying to shrink the JDK and the second focus is to take a look at this base image. Our approach, what we're exploring is effectively the same basic shape of the workflow as before except we're going to extend the two build phases. The first extension is we're going to add at the application build stage, a post build, an analysis of the application using J-Link or J-Dep to determine which Java modules it uses. I should probably stress that the application itself does not need to use Java modules for this to function. We're looking at the Java modules that the application touches within the JDK. Then we use J-Link to strip the RedTap Open JDK that's provided in the container and create a bespoke VM which we then stash in the intermediate image. The second stage, the cherry picking stage then, we extend what we did before where we copied the application jar over into the runtime image and we additionally copy over the stripped JVM and a run script, basically the shell script entry point for the container which does some of the configuration at runtime and a small number of system dependencies we need to make the whole thing work. The reason we need to do these additional cherry picking things is because we've also switched out the image which we're layering on top of and we're now able to target the UBI micro image. I'm going to attempt something approximating a demo here. Let's have a look. If I was super brave, I would fire up an open shift cluster and give you a full blown web based exploration of that going on. I'm not that brave and subject to the constraints of operating on a laptop and FOSDEM Wi-Fi. I hope they'll forgive me for that. I mulled over exactly what I should show you and I could run through some of those build stages in isolation because in development we could do each of those stages separately. I figured perhaps what I'll do is I'll just show you the end stage if you like. What I have on this machine is a set of containers that have been built already. Scrolling off the bottom, let me fix that. Give the terminal a bit more read and state. I've got three containers here, container images on my machine. The first one running through the normal one stage S2I process has resulted, I've highlighted on the other window, perils of multi-mono. Here we are. So the plain S2I image then, this is a quarkus quick start and it ended up being 421 megs according to the podman. The multi-stage image, the current state of the art was a little bit smaller at 384. Actually that's better than the slides. Then the final image which has gone through our proof of concept jailing integration is down to 146. Let's run it. Run it, plonk, so that starts the app. Am I typing the right numbers here? Let's find out. I'm unable to connect, obviously not. Let me just borrow that window, fix it and put it back. Okay, there we go. Right, yeah, there you go, so the app works. So perhaps not the most exciting time you've seen today. That one, there we go. So yeah, in slightly more detail in the first phase when we extend the build process, this is opt in. We won't be doing it necessarily for every container build, so you have to enable it with an environment variable. As I said earlier, the general gist of it is run JDEPS and then JLink and there's an awful lot of pre- and processing going on to make that work. We're at the stage of this project where we're exploring a wider variety of applications to find all the edge cases. We have to add some modules that aren't picked up for whatever reason. We have to do some path fudging around when the class path is a bit unusual, et cetera. I've got a link to the source later if anyone really wants the gory details. So after this first stage, the intermediate image is pretty large. This is the second stage, the Docker file. Do not attempt to read this slide. The takeaway really is that it's grown more complicated than it was before. It was one or two lines before and we're doing quite a lot more work now. But yeah, there's four distinct things we copy in. We copy in the application, the stripped JVM, a run script and some system dependencies. At the moment, that's just grep and orc actually. But that might grow as we expand this. The results are pretty good. We're exploring a range of apps. This is not the best result. I've had all the worst. I've tried to be fair. It's about 43% the size of doing the multi-stage build. We've thrown away close to 70%. We're very happy with that. The new JVM is half the size of the older and the other significant saving is switching out to the micro-base image. We're happy with this and we're going to pursue it. A few bits and there are caveats. We've got to determine how, whether we're already serious blockers for real-world applications, more complicated applications than just quick starts. We've got some fun with JDK 11. At the moment, it grows the image. You get something twice as big instead of half the size. We know why and that's in trying to be fixed. Missing features. The reason it's getting smaller is because we're throwing stuff away. If you want FIPS support, if you're going to do stuff with time zone information or locales or you want debugging tools, that all needs to be added back in. We're trying to figure out a way that that would be practical for customers to actually do. The whole thing, our development works all in the open. If you want to do it, you can go to that address there and see all the gory details. It's on the bottom of that slide too. That's it. Thank you. I've got five minutes. I can't start the next one early. We've got five minutes of questions if anyone would like to. I don't know. I think the recording schedule would be broken. Any questions? Thank you. Sorry. Thanks. Sorry, I just want to hear me. Your previous slide there, you just talked a couple of things. You listed down TZD, locales, debugging information. One thing just to know is the TZD is actually in the base module. You don't get a choice. That will always be there. With your locales, there's actually a jailing plugin that actually allows you to select the locales. Maybe on your next step is something like that. By default, you just get US English. You can actually with the dash include locales, you can actually list out the locales that you want. That plugin will actually just take the resource data for those specific locales. That actually might be useful for you. More generally, the reason that locales are an issue for you is that it's what's called a service module. It's a service provider. There's nothing that directly depends on it. When you run JGEPs, you basically just sort of work out. That's basically doing static analysis to actually tell you what references there are. You never see static reference to something that's in a service provider module. Security providers is something you could actually list down there as well. J&DI providers is actually a bunch of those in the JDK that you never actually see a static reference to. Okay, thank you. That is really useful. I think we've made things... One of the modifications we do to the JDK in RHEL is to use the system time zone data. I think actually we've introduced this problem which you probably wouldn't have upstream. The information about the system data module is very useful. Thank you very much. Is there one right at the back? So, nice presentation, by the way. My question is, this seems to be optimizing for disk size. What about memory usage and stuff like that? Yes, so that's true. Size was the driver. I don't think it should make an appreciable difference to memory usage. I don't believe the Java will page in the modules it's not actually using or they'll page that out. I don't know, actually. We'll have to do some measurements. It's not been a driver for the project, but I wouldn't expect this to make significant gains in memory usage. Do you foresee that this could have a side effect on the memory side of things or not? Not loading stuff obviously consumes less memory. The fact that you don't load some stuff might make the system slower or something like that. I think... I don't know. We could add some measurements to our testing matrix, I think, and see what happens. For memory usage, the focus... The driver for that exploration is not in and Red Hat seems to have been towards Quarkus and Native Image. We've stuck to OpenJDK and the JVM for this work. We haven't really looked at memory, so it would be interesting to see. Thank you. Maybe I could just add to your reply just to that question. One of the side effects of having fewer modules in the runtime image is that you're actually memory mapping, you're mapping a smaller... It's called the G-image file. What actually happens is that that's actually completely memory mapped. You're not going to touch all the pages, so you may actually get some positive memory footprint benefits just because you're only going to have a small number of modules in the target image. It may help there. Great. Thank you. Cool. Okay. Thank you very much. Right. Next one. Thank you.
A beginner's guide to Backports
I bet it gets started because this one runs is really short on time. Okay, right, hello, it's still me. This is an unfortunately short talk. I'm going to be a bit straight for time, so I'm just going to run straight into it. So yeah, I work for OpenJDK. I work for Red Hat OpenJDK. I've been working on containers since I started at Red Hat, but since I joined the OpenJDK team, I try to diversify a little bit. So I started working on backpots about four years ago, but it's always been an issue as the kind of thing I did around the side. So I'm perennially a beginner and I still have to do myself a beginner. Loads of people in the audience here are significantly more experienced with backpots than me, and I hope if we have questions, they will help pitch in with answers. Okay, a little bit of preamble then. I'm not specifically talking about starting OpenJDK development itself, although there is some significant overlap. Why do it? There is a clear commercial reason, which is why so many companies are involved, but I'd argue there's another good reason is that it's fun. The kind of problems that you need to solve with backpots are a little bit different to regular development, so it's quite interesting, I think. There are a lot of documentation in OpenJDK community. Some's out of date. Some's a bit misleading. These two, URIs in particular, are not. These are fantastic. The top one is the quick cheat sheet thing, and the second one is your reference guide. It would be futile for me to try and rehash them here. I wouldn't do a better job. So instead, I'm going to try and focus on a couple of things that come up when you're just starting, and also some tips and tricks that I've picked up along the way. The focus really should be that doing backpots is a community thing, and you're joining a community, and certainly you will require sponsorship, reviews, and approvals. So it's important to recognize the community aspect of it and to build relationships with people. Okay, so the very, very, very first thing you need to do is to sign or be covered by the Oracle contributor agreement, because, well, that's really, really important. It's important for legal reasons. Your organization, if you work for a company that's doing this, then you might be covered by a sort of umbrella that's been signed, but check. It's important for legal reasons, but it's also important practically, because a whole load of automation won't happen for you until that's happened. So once you've signed that, you've got access to OpenJDK bot that operates on GitHub. It'll ignore you until that's done, so that's important. And the very, very first thing you should do is clone JDK, try building it, make sure it runs, and start forking GitHub repositories. Building it then, again, this is actually pretty well documented in the source tree. Building and testing pages in particular. A couple of quick tips. If you're building on a laptop, so that's saving up for a desktop, but turn on Ccash, it makes a huge difference in that sort of environment. You're going to require a boot JDK, which is a version or two one way or the other of what you're building. So you're going to start stashing those around. I download them from Adoptium, other vendors are available. I very much encourage that you run tests for backboards. The tier one test suite requires the Google test framework in the hotspot area, so download that too. And you need the job test tool, JTreg, which again, you might require more than one version of that, but it will be hit cross-up bridge when you come to it. Finally, unless you're actually working on fixing this, sometimes you will find that trying to build an older tree with a modern compiler will throw warnings which are promoted to errors by default. So if you're not trying to fix that and it's just getting in the way of what you're doing how to do, you can disable warnings as errors by passing that to the configure script. I think that even innate you these days, I think that's the way to do it. So as I said earlier, fork all the things. So go find the open JDK trees and all the backboards trees, create forks on GitHub, and then go into the settings and turn on actions for them. Do that first. Okay. To produce this slide, I spent quite a lot of time confused, so I think it's fair to say this can be confusing. You have the notion of the open JDK community as the sort of outer layer of abstraction, and your relationship with that is of starting as a contributor once you sign the OCA. Do that because you need to do that. The next thing then, you've got the notion of projects and you've got the notion of groups and they're distinct. The updates project, I mean, Dalibor covered this this morning. The updates project is the one that covers virtually all of the backport's work these days with the exception of 8U. Once you've done a couple of backports and found sponsors, you will be able to apply to become an author in a project, which is very, very useful and gets you an account in the open JDK infrastructure. To do that, you email the project lead for the project is Rob McKenna. For 8U, it's Andrew Haley. If you look at the commits in the Java trees, you'll see that they all have a very common format. The first line of the commit message starts with the bug number and that points into the Java bug system. This is very, very useful database of information about bugs. And when you're doing backports, you'll spend a lot of time tracing things in there. But the problem is, until you've acquired an open JDK account, you're limited in terms of how you can interact with it. You can't write to it, so you can't comment, write labels and things like that, but you're also limited in terms of what queries you can run. So thank you, Alexis Shippoliv for the JDK backports monitor project. This is Java program which basically constructs complicated JIRA queries and performs them and writes reports in a variety of different formats. Alexis not only created this, he also hosts an instance of it and publishes reports regularly to his own web server. So thank you, Alexis. If you make heavy use of this, and you work for an institution, I would strongly suggest you try and run it yourself. It might iron out some bugs. And also, if Alexis turns his website off, you will not be in hot water. So how to find a good backport bug? Here's an example of a parity report from Alexis' Backport Monitor. This is showing the first section, towards the top of the document here. The first section is a list of bugs which have been addressed in Oracle's build of 2103 and have not yet been fixed in OpenJDK updates project. So signals that you should avoid a bug, as a beginner certainly, is someone already working on it? Has someone filed a request for review? Has someone opened a pull request that addresses the issue already? Or has a company, I think A for Amazon in this case, flagged a bug as of interest to them? So eyeballs are already on these bugs, so to start out, probably look somewhere else. And then these are signs that a bug might be suitable. Open-sourced tests that are previously closed, test-only patches. Those are pretty attractive because the backport projects are happy to accept test-only patches. It's very low risk to the product. It increases the quality of testing. And they're generally quite easy to backport, sometimes trivial. So that's a good sign. And finally, it doesn't show up well on a projector. Other flags then, depending on who you are, what your interests are. So platform-specific bugs, Windows bugs, Mac bugs, port-specific bugs, IOT64, etc., will get less attention than more general bugs. I mean, almost nobody's doing Mac or Spark ports. So if that's an area that you can specialise in, that might be something worth looking for as well. Okay. I've got a couple of tips here about Git. So nowadays, I think all the JDK trees are in Git, or at least all the ones I deal with, which is nice. And time invested's perfecting Git, mastering Git, is worthwhile. There is actually a talk, I think, tomorrow in a different room from somebody who's going to give some super tips on Git. So I recommend that if you can fit it in. So the first stage then, this is a prerequisite for what's coming in my slides, the trees have grown really big, right? So if you clone them all, just the object storage for Git has grown to 200 megs-ish for 8U. By the time we get to 21U, it's 1.2 gigabytes. And if you've got all of those cloned, disk space is cheap, but that's network time, etc. If you stick all of those Git repositories as remotes in a local, one local repository, then you can deduplicate an awful lot of that storage. I think it last time I measured it, it's about a gig and a half for all of those. But that's not the main reason to do it, actually. I would say the main reason to do that is because it makes comparing objects between JDK trees really easy. You've always got access to any object in any of those trees, from any of those work trees, so you can look at the state of a particular patch as it was in mainline if you're in a backport's tree, etc. Without having to ferry patches around or export things. My favorite Git trick that I learned recently was the notion of Git work trees. When you make a clone, you by default get one work tree, it's the checkout, right? So the bit outside of the .git directory. You might be familiar with repositories which have zero work trees, that's bare repositories, such as if you were hosting them on a web server. But you can actually have more than one work tree for a single Git repository. The syntax is something like that. It creates a new sibling directory adjacent to your clone, tracking a particular branch from one of the remotes. That means you can have a work tree for each of the backport trees with only one common Git storage. Again, the reason that's really useful is if you're somewhere over in the 17U directory, you can still access and compare against objects in any of the other trees, just as if you were in a, it's essentially as good as being in a real Git clone. That's really handy. I blogged about it when I first discovered this, and that's somewhere at the bottom of the slide there, trying to talk about how it's useful for this stuff. Path changes then. So, files move around a lot between JDK versions, especially after 8U. What we used to have to do is use a corn shell script to basically take a patch and then mess the paths in the patch around to match your target tree, which was tricky. What you can do nowadays with the current new workflows is the opposite. If you're trying to backport a patch to 8U, for example, and the paths are all wrong, what I suggest you do is prior to trying to backport the patch, move the path files as they exist in 8U, in your local clone, to the patch that matches the source, the path from the patch you're trying to backport, which is a little counterintuitive. Commit that, but then that means when you pick the patch, you don't have any path conflicts, and you just have all the other problems to deal with instead. Once that's done, you can then revert the original commit, and the final delta is as if the commits, the fixups were all in the right place. You can either squash that yourself, rebase locally, or push it to get, and the gamebook and JDK bots would do that anyway when the commit was eventually accepted and integrated. I must have like seconds left, I guess. Just in passing, I use this all the time. You can type, I use this as a shorthand to open every file matching a particular regular expression in my editor in one go, and then I can, for quick substitutions, across a full tree. One last one then, a UI thing. I'm a complete holdout for Wayland, so that was a fascinating talk, but I'm still on X. When you run some of the tests for client stuff, which we're using AWT robot, and it moves to the mouse pointer or keyboard input, that can be a little confusing if you've paged off to do an email while you're waiting for the test to run, but also it can fail sometimes in esoteric environments. This is a real example of a backport I was doing last week with five new tests and all swing ones, four of them passed and one failed for me for who knows why. The solution in the next environment at least is a tool called Zephyr. It's just an X server in an X client. Fire that up, you get a little mini X window, run a terminal inside it, try running the tests in that. It isolates all of the mouse and keyboard operations from your main desktop, and in that circumstance all the tests pass and I have confidence they're okay. See how I'm closing? I have to say, as I said, it's fun and it's a community thing, but I have to reiterate one line from the excellent cheat sheet you were all earlier. You're quite close to the customer and there are a few safety nets, so be careful. The reviewers are fairly overburdened. They do a fantastic job, but you take some time over the patches and make sure that they update their high quality for everyone. That's it. Thank you. Yeah, have we got any questions? Hi. I'll send you a mic. Oh, okay. Sorry, just one little thing. I just came in at the end, you were talking about fixing up patches and stuff. Yeah. So when we completely relay out the source code for JDK9, those scripts that were actually put in the bin directory that actually will transfer patches between the old and new layout, you may or may not have found them. I can't remember the name of them either. Unshuffle patch. Unshuffle patch. I don't know whether they're useful, but I just let him mention them. Well, so Andrew Hughes maintained the 9TW, well, he maintained that script past 9U's life a little bit to further facilitate that, but I think it might be dead now. There was somebody over here, yeah. I'm going to say if there are anything that you can do about it, I can't remember the name of the either... Unshuffle patch. I'm going to say if there are anything that you can do about it. Yeah. Thank you very much. Yeah, I had a question about testing my back port. How can I make sure that I sufficiently test my back port if, say, for example, I'm porting a fix that doesn't have a regression test? Okay. That's interesting. So I think first thing is that unless you're really sure you don't need to run the tier one tests anyway, but flag it with your sponsor or reviewer, because it might be that they need more test coverage upstream in a newer tree anyway, and that should be back ported with it. Well, you might want to go and look at the justification that was given for there being no regression test when it was done in mainline and say and read about questions the reviewers asked then and what the actual fixer said. I mean, in theory, any fix that doesn't have a regression test needs to have some keyword attached, which explains why. If it's a dot fix, it's pretty obvious. If it's test itself, it's pretty obvious. Most other things, there can be like no-reg build. It's just fixing a build problem, things like that. So go look for the clues as to what to do in the mainline fix. You can at least run all the tests from the subcomponent of where that fix was done. So that's what I usually do. Because tier one is not usually running all the subcomponents tests there, but if you do a single chain in one area, then you just run all the tests there. Yeah, that GitHub actions things runs in a very limited set of tests, very, very limited. Thank you.
Cryostat: JFR in the cloud
Hello everyone. My name is Chris Ma and I am the manager of the Java monitoring team at Red Hat. Back in 2020, if you had come to this Java Dev Room here at Fosdom, my colleagues talked about a project called ContainerJFR. And that today is, Cryostat is the direct evolution of that project. Today I want to share with you guys the progress of that, the project. Before I jump into Cryostat, I just wanted to share a screenshot that depicts a demo of a sample pet clinic application that you've probably all seen before. And for the purposes of this demo, that pet clinic application is deployed in the cloud. So here's that pet clinic application. It does everything that you would expect a sample pet application to do. Now let's just say as a developer, this application has performance issues such as having abnormally high resource consumption or abnormally high response latency. What would you do? Now I kind of spoiled it with my title with Java Flight Recorder, but JFR is the first thing that pops into my mind. I know my colleague Robert talked a little bit about JFR and his early presentation, but as a quick refresher, for those who don't know what JFR Flight Recorder is, JFR is a profiling and event collection framework for Java. It gathers low level application behavior such as garbage collection metrics, memory allocation, and much more. It is built into the JDK, is low overhead, and therefore is suited for production environments. You can also use the low overhead recording infrastructure for your own event types. It pairs well with JDK, JMC, which is JD Commission Control, which is an analysis tool used to visualize JFR recordings in flame graphs, histograms, and many other visualizations. So going back to the pet clinic application, how do I use Java Flight Recorder with this application? Well, you might think I'll go into my trusty terminal, start JCommand, but then await. This pet clinic application is actually deployed in the cloud. And so herein lies a number of problems. I don't necessarily have access to the root terminal. Where and how do I access the JFR outputs files? I don't have any place to store them. What if I have multiple applications that I want to profile with JFR? Using JFR locally, I can only start it on one application at a time. And then lastly, what if my application shuts down unexpectedly? There is no records on spontaneous shutdown, and all of the JFR data would be wiped. So this is where I would introduce CrowdStat, and how can CrowdStat help? So CrowdStat can be added to help manage JFR recordings in the form of a web view. It is designed to work with Kubernetes pods, and it used Kubernetes persistent volumes to backup recordings. It also has functionality to capture JFR recordings if an application shuts down unexpectedly. Some additional benefits is that we can also define rules to automatically create recordings for matching JVMs, view time series metrics in Grafana dashboards. You can also export the JFR recordings for more detailed analysis in JMC, and then it can be used for applications also developed on frameworks such as Quarkis and Spring Boot to name a few. So I have a picture here of the web console view when I first start CrowdStat. And in this example, as you can see on the left-hand panel, is starting at the topology view. And this topology view gives you a sort of a bird's-eye view of all the applications CrowdStat can connect to. Starting from the bottom, you'll see that it is using the Kubernetes API. So it also works for Podman. So you would see Podman in that, I guess, configuration. The second, my apps, is the namespace that I have defined CrowdStats to be able to connect with applications within that namespace. And then you can see the Spring PetClinic acts application there. The reason why in this topology view you see that there's two PetClinic nodes is because I've scaled up the number of PetClinic pods just to give you a flavor of the ability in scale deployments that CrowdStat would be able to also access the multiple replicas. In the top view, you'll see there's some filtering. So in a large, maybe large scale deployment, you might want to narrow in on certain applications. These filtering capabilities allow you to specifically hone in on our target application. One thing I would like to note is that you might be wondering how I sort of ended up at this page and to sort of set up some context as to how I was able to set up CrowdStats to work with these PetClinic applications. I have a couple steps or a couple configuration that needed to happen in the background to get a demo or this setup working. So one of the things I did was I installed the CrowdStat operator on Operator Hub and installing the operator basically is a way of automating the CrowdStat deployment. Another notable sort of configuration responsibility of the operator is that it can help define the namespace boundaries in which CrowdStat can be used. So in my particular example, I configured it in sort of the most basic simple case, CrowdStat in a single namespace. CrowdStat does have the capabilities for multi namespace support. So if you have applications that span across multiple namespaces, CrowdStat can be used in those scenarios as well. And lastly, in order for CrowdStat to sort of communicate with the target applications, there are two methods of connectivity. And the first, I guess, methodology is using RemoteJMX. This requires setting up various JMX environment variables, also potentially configuring TLS or SSL, and then also exposing a port on the route of the Petcom link service. The other option is the CrowdStat agent. And so this is the recommended approach. This would involve basically, we provide a CrowdStat agent jar that you would build with your container image, and that would allow sort of auto discovery of that target applications. So what we're going to do is we're going to do a little bit of a demo. One of the next things that I wanted to show that's CrowdStat is capable of is, of course, the sort of bread and butter of CrowdStat, which is creating, creating JFR, well, starting JFR recordings. And so this is a sort of a simple simplistic UI of how you would define the creating that JFR recording. You set the duration. You can set the template. And this is synonymous to a JFR configuration file that you would use when using JFR locally. Out of the box, it comes with the standard configuration files, such as profiling it continuous. But you can also define your own sort of custom JFR configuration files if there are specific events that you want to look for. Lastly, at the bottom, you'll see there's show metadata options, and I haven't actually sort of expanded that a little bit. There is the ability, we have added the ability to add labels to your recordings, and this just allows users, you know, if you have a lot of recordings, you can sort of annotate or add metadata to these recordings so that you can search for them quicker. Once you create a recording, so when the recording is created, you can see that I've created a recording one. It's put duration for 200 seconds. It's still running. What we've added here is the automated analysis page from JMC. So if you've used JMC before, we basically took sort of the guts of that automated analysis, like its library, and we've basically just added it here for simpler access so that you don't have to switch between, you don't have to necessarily switch between JMC and Criostat. You also see archived recordings in the top, and that's basically if you create an active recording and you want to put it into storage, that gives you that ability to do so. So, as part of gathering the JFR recordings, we do have some, I guess, dashboards to have a quick look at some of the data coming out from the JVM. Now what I would say is that the data that you're seeing here is from the MBIN. These are MB metrics and not actual data from the JFR recording. And so these can actually be, I guess, received in real time or visualized in real time because the JVM is already, we're just basically visualizing this data. I did mention that there are some Grafana dashboards that are also available. And so here's an example of a, I guess, a JFR snapshot with, I guess, certain metrics. We have some canned templates that, on Grafana, that you can, when you create a JFR recording, you can open this and you can see data specific to that, those sort of buckets. Similarly, you can a little bit more detailed information about heaps. Heap memory, other threads as well, are available within the Grafana dashboards. One quick thing I want to highlight here is going back to the Polys view. As you can see, you can do bulk actions, meaning if you have multiple, if you select multiple applications, you can start recordings for multiple pods at the same time rather than doing them individually. That kind of simplifies things a little bit. This is the page for automated rules. It follows the same mindset of bulk actions, but instead you can sort of query the target application that you want to start the recording for. This just allows a little bit more granularity when searching for a target JVM. To kind of just wrap up where we're at with cryostat and what's next, there's a couple things that we want to sort of focus on. One of them is smart triggers. Smart triggers, to give you some context about that, smart triggers is a way of dynamically starting JFR recordings. This would be done through the Java agent or the cryostat agent. Right now we do have a preview feature where you can specify the metric and the metric value on when you want the JFR recording to start and then using what kind of JFR configuration file. What we would like to do with this feature is ultimately maybe make it more user-friendly so that you don't have to add, say, arguments to the Java agent to start this. The second thing was just cryostat agent auto configuration. We would like to have the operator streamlined in the installation for the installation process with the cryostat agent. I'd say those are the two main things to look out for. We have some upstream sites. If you want to ask some questions or have any feedback for us, we're always listening and that concludes my presentation.
Inner Workings of Safepoints
Hi, I'm Jonas Stechberger. I'll be talking about the inner workings of safe points. Essentially, what are safe points? You essentially have this nice little VM and you stop it. And as you saw, we have local safe points that only stop one VM. One VM, the other one, just float across. And the vicinity and the other stretch has worked. So the local safe points, they are quite cool because we stop a single thread. The thread doesn't modify its stack anymore. And it's nice to do things like garbage collection that just wants to operate on the thread stack. And we also have global safe points. They are also quite interesting. So essentially, you'll stop all the threads. That's cool when you want to do code deoptimizations and other stuff. That's what we're talking about, safe points. Safe points give either the state of the whole VM a guarantee to be stable or the state of the whole thread is guaranteed to be stable. And they are like one of the building blocks of the JVM. And they get even more important. I'll link their chat below because newer garbage collectors like concurrent etc. see use safe points, especially the thread local safe points to do work concurrently and to ensure that we can do as much work alongside concurrently every now and then at returns of methods instead of having a stop of all garbage collector as we have before. So essentially, when do we ask to be in a safe point? So when do we check he should be going to a safe point? We take here a typical method. It's just multi-blast. And so what we see here, we go into a safe point either when we return from a function, which is pretty neat. Or when we are at the back edge of a non-counted loop. So here in this example, we check at a safe point every time here when we're here or when we are at the end of the function, but beware of inlining. And that's a problem here. When we inline a function, then we don't have a safe point at the end of this inline function because the function doesn't exist anymore for the JVM. And then of course, we sometimes have loops. Some of you have written such for loops. And the problem is some years before, this didn't have a safe point anyway inside. And especially if B got really large, it meant that the JVM was taking quite a lot of time to reach the next safe point. So people started to do loop strip mining. And the idea is essentially you split this loop that we had before into an outer loop that usually iterates over increments like the value by a thousand typically, and an inner loop. And we have a safe point here. And that's quite interesting. That's quite good. It's called loop strip mining. And this reduced the latency in your JVM quite nicely. I'm Johannes Pechberger. I usually talk about profiles and sometimes about safe points. I work with an amazing team of talented engineers at the submachine team. We're the third biggest contributor to this little OpenShade. DK project that you might have heard about. And I sometimes fix bugs in template integrators. So template integrators like the thing when people say, oh, Java code is integrated in the lowest level of trig compilation, then they talk about template integrator because the template integrator turned O2 not produce safe point checks at the return of functions. And that's not great when you use this fact. So I sometimes fix bugs in the OpenShade. OK. And sometimes people call this work. And their back part of this fix us to all older LTS releases. What I'm not talking about is I'm not talking about how safe points suck. Because they sometimes suck, especially when you have profiles that are safe from bias. So they only sample your threat out safe point borders. As Nidson Barker says, safe point bias profiles are tricky, sneaky, filthy, all of the above. And if you want to know more about safe points and why they suck in profiling, ask this guy in the front. He knows a little bit about it. But more on the real topic of my talk. It's on implementation because we're all here to see some C code. Yay. In the beginning, I want to tell you how they code work. So essentially what we want to know, so essentially how safe points work, we have to insert these checks somewhere. And then the JVM goes into a safe point handler. And that's all amazing stuff like doing garbage collection work. Or hopefully in the future doing some safe point based deck walking to make profiling a little bit easier and to make profiling a little bit faster. But essentially what we could do in the search code, we could ask at every return, at every point where we run in search of safe point polygester, if threat is at safe point, please call safe point process. The thing is, this is like, probably this would compile in your check-in if you include some errors. And then in the last part, we did nothing. And it's quite cool. And it's slow, of course, because we have this branch everywhere. But the cool thing is, for once, that's pretty rare, this occasion. And so we can do some tricks. But what's in integrated mode, how it looks like, and here with the C++ code, how it actually looks like, that your template interpreter generates some code that calls test instruction that essentially sets a conditional bit. And it just tests that the polyg word, that the poll bit, is not zero. So that's just a simple check. That's how it's implemented in template interpreter. So we could essentially just implement this. And it's implemented this way. But we see that the first thing is, use pretty rarely. Because usually, we're not going to be at safe point. Because if we would get into this at every return, at every loopback edge, we would be just like being at safe points. And we wouldn't do some work with our JBMs. So usually, at safe point fails. And this is cool, because we now know that we can essentially make this a slow path. So it doesn't matter that much how fast it is if we get a fast path really fast. And one idea here is we could just read from a pointer. Because reading from a pointer in the fast path is quite fast, especially if there's a pointer that's in the cache or somewhere. It's nice. And the thing is, when we read from a pointer, there are two options. We could either read some data. Good. It's just a simple morph instruction. Or we can get a segmentation fault. And that's one of the things that JBM does. It uses segmentation faults for its own advantage. Because segmentation faults, yeah, they are somewhat expensive, but a fast path is really, really fast. So the idea is here in our method where we insert a check. We just access a so-called polling page. Because when we disable the save point, the fast path is yeah, it points to a good page, this pointer. But when we enable it, it points to the bad page. And we have a segmentation fault. And then essentially what segmentation fault does, it looks and looks, hey, did we want to access a save point, such a save point page? And I'm like, cool, we're probably at a save point. And that's one of the reasons why when you disable save points, when you do save point handling stuff around your JBM and want to be sneakily, so capturing save points, you got a lot of them. That doesn't help with debugging when you're out in GUB, when you're out in GDB, because you get a lot of save points. But anyways, before we go further to look into C++ code, I was told by someone in the audience that people like cute cat images. And the thing is, working with the OpenJK is interesting. But sometimes you have to sometimes calm down, take a cat, stroke it, have a nice time. And then you go back to learning how the OpenJK works. So I learned this because I wanted to fix a bug. So essentially how save points are initialized, we have here a bad page and a good page. And then we use the magic method protect memory. Essentially what it does, we call nprotect. And thereby we make the bad page neither readable nor writable. And the good page, we make it just readable. So we don't need to make it writable. So essentially we use the memory management unit of our CPU to implement save points. And that's pretty nice. So how, for example, C1 implements save points is quite, quite simple with this. It just accesses the pointer like it just accesses the value that's at this address. That's our page. So it's really just a single address. And that's a single moth, which is nice. And there's, of course, the question, how could we arm these? So what do we do here? And essentially what we do, we, for one, set the polling page here when you arm it to like the arm value or like the bad page. And of course, we're not doing any of this segmentation fall trick in template interpreters. So we're also setting the polling word so that the template interpreter can also check it. And what we do when we do a global save point poll, we essentially do this for every stretch. That's pretty simple here. And of course, sometimes we want drag save points because they can get quite annoying. And if some of you saw this 1 billion row change, then you probably saw that some of the winning contenders did this able save points because they can get quite annoying. But usually they aren't. So essentially there are a few ways you can, for one, use Jvi events. And I've built a website called Jvi Events Collection, where you can see all Jvi events available, also the events for trial and for all the JV case here. And you see here that there is a save point begin event, and you also have a save point end event. So you can check which save points are created. And also you can just pass x lock save point. You get lots of output. And I did this for like a Renaissance benchmark. And this is like the distribution that I get. And essentially most of the save points are in this case related to G1 because G1 was my selected garbage collector. If you want to learn more about me on my team, just go to this link. I was Johannes Pechberger here telling you a bit about the inner workings of save points. I hope you learn a bit. You can find me on Twitter on GitHub on my team at that machine.doctor. Oh, that was all from me. Yes, of course, Rief. Four precious minutes. Any questions here from the keynotes or any corrections of the OK developers? Can I ask a choker? So the question was before Java 5, how did it work? Any of my colleagues that were present at this time in the training came? Any of the OpenTree care developers here, any ideas? I don't. I only started two years ago. No problem. If these people don't know, then nobody knows. But if you have some ideas, come to Forstner next year and tell people about it. Yes, history lens. No, other questions? None? Good. Then, what's the pleasure of talking to you? And if you want to learn a bit more about Python, I'm tomorrow at 4 PM in the Python Dev Room, telling you about Python monitoring. And that's all from me. Thank you.
Java… to unlock GPU acceleration for Polyglot Language Runtimes
Okay, can you hear me? Excellent. Thank you. So it's a pleasure to be here. I'm on goal this amazing speakers today. So I'm Thanos. I'm a search fellow at the University of Manchester. I'm part of the Tornado VM team. And today I will talk about polyglot language implementations, which enable programming languages like Ruby, Python, and to run on top of the JVM, along with Java, of course. And I will try to make a step forward and show you how they can harness GPU acceleration from the JVM. I'll start a little bit with the polyglot programming, which has been here for many years, but in a sense it has been reignited by the advent of the Truffle framework from Graal VM. And in a sense it enables multiple programming languages to run on top of the JVM and interoperate. So that means that one Java class file can invoke a Python function and the Python program can invoke a Java method. Well, this is very interesting. It comes with many advantages. But what about GPU programming? Well, GPUs from Java. Well, this is not a thing yet. That's why we have been motivated at the University of Manchester and we have done all this research in the past eight years and we have created Tornado VM. Here is a link to the resources of Tornado VM with all the presentations that explain the programming model. Because my goal today is not to go very deep, to dive into the Tornado VM very deep, but to present the interoperability with the other programming languages and how they can use GPU acceleration from the JVM. So Tornado VM is an open source plug-in to existing JDK distributions. It is compatible with JDK 21, as you will see later. And it has some very cool features. So it has a platform agnostic API. So developers, they don't need to know GPU programming, FPGA programming. It comes with an optimizing compiler. So we extend GRAL with new phases that they can take Java methods and compile them to GPU code. We have a feature of dynamic reconfiguration at runtime, which means that the method execution can be migrated from a GPU back to the JVM and then go to the FPGA if it is appropriate. And with the latest release 1.0, we have enabled support for off-heap data types. So data can be allocated off-heap with a foreign function and memory API. And this is the API that Mauricio described earlier today. So feel free to follow Tornado VM in Twitter to engage with the website and of course to fork and try our examples which are open sourcing GitHub. So I spoke a little bit about off-heap data types, so I'll give an introduction, an example, because I'm not going to dive very into the API. So here we see two snapshots of code. On the left side, we see a main method that contains the allocation of float array by using primitive types, but is allocated as an object, in a sense, on-heap. So to migrate from such an allocation to the new allocation API that's exposed by the Tornado API, we have created the float array object that inside it can allocate memory by using the memory segment of the foreign function API. And it will allocate this memory off-heap. So this memory segment could be used directly from the GPU without the need to worry about GC collections and this stuff. And the cool part is that even if you don't use GPU programming, even if you don't want to execute on GPUs, you can still use this API to allocate memory off-heap. And here is a link that explains more. I hope it's visual from your side. If not, you will find my slides online in the Fosdome webpage. So the motivation for today is that Graal VM enables interoperability between programming languages like Ruby, JavaScript, and other programming languages. And Tornado VM enables hardware acceleration for Java. So what if we can combine them together and harness GPU acceleration from all these programming languages that are running on top of Trafl? Let's have a dive into the tech flow. So in this slide, I present a software stack from Graal VM for Trafl. So on the top, we see the Trafl framework and many implementations of polyglot runtimes like Graalpy, Graal.js, Trafl Ruby. And others because Trafl enables also programming language implementers to create their own programming languages by using the Java API. So I have grouped Python, Ruby, JavaScript, and Node.js in this side of the slide. And then beneath them, there is the Graal VM Zit compiler, so an optimizing compiler from Graal. So Java is also running on top of the JVM, of course. And all these languages, they start in the interpreted mode, and once they reach a hot state, then the optimizing compiler kicks in. And the cool part with such a polyglot implementation that enables polyglot programming is that there is, for the compiler enthusiasts, there is one Graal IR. So the nodes, at runtime, they are rewritten. That means that it can adjust. So if we kick in a Python function, then the node can be rewritten, and the Graal compiler will take a shape and will emit at the assembly code that will run on the CPU. So this solution offers the interoperability and offers the execution among different CPU instruction set architectures. But what if we have this heterogeneous hardware, like GPUs, FPGAs, which are available in some systems and servers? Well, then we'll have Tornado VM that enables Java methods to be compiled for GPUs, FPGAs, etc. Tornado VM has its own JIT compiler, which is an extension, a superset, I would say, of Graal, the Graal compiler, that it is enhanced with new phases in the compiler to automatically specialize the code from a method for GPU acceleration and FPG acceleration. So at the backbone of the compiler, we have three backends at the moment. We have OpenCL backend, CUDA, and SPV. And such a solution would enable many things. So if you want to learn more about the APIs, you can scan this QR code. And the code that is implemented with Tornado VM, it can harness besides the off-hip data types, it can also harness the execution with a Tornado VM profiler. If you want to learn more about the characteristics of your application, you can see how many data will be copying in the GPU memory, how expensive is the IEO maybe, because this could be very critical for the performance of the system. And you can customize even how many, how the data transfers will be performed. Because, for example, if you have a method that consumes redoneally data, then maybe you need to copy the data once, instead of copying the data every time you execute the kernel. Okay, so let's jump to the deployment. As I said, Tornado VM is compatible with different JDK distributions, so it's not a JVM, it is a plugin for JDK distributions. So it can be seen as a library, in a sense, because it offers an API in Java. And it is compatible with all these distributions. And on the other side, we have the compiler backends that makes it compatible with different heterogeneous hardware accelerators. We can emit vectorized code for multi-core CPU execution through OpenCL. We can run with different GPUs and FPGAs. In this particular talk, I will focus on GraVM, because we want to leverage polyglot, and NVIDIA GPUs, because I have created Docker images that they run on the NVIDIA GPUs. Now, regarding the GraVM deployment, I will focus in this slide in GraL Python, which is one implementation of polyglot runtime. This is shipped in two different standalone versions, releases. So we have the native standalone, which comes with the native image. And then we have the JVM standalone that enables the execution of Python programs on top of the JVM, and it has also the JVM compiler. The version that we tested is the 23.1, because tornado VM is compatible with this version of GraL. And here you can see that we have downloaded the community, and that's JVM. So we have the JVM standalone version downloaded. Well, we need the JVM standalone, because we want to run with tornado VM, and tornado VM will extend the GraL VM compiler. So this is the reason. The problem is that we tried it, and the JVM distribution is shipped with the JVM standalone, with a compiler built that it is built with libgral. So this comes with not many compiler modules, and that breaks the consistency for tornado VM. When we tried it. And this is because they wanted the image, the footprint to be lower, which makes sense, but it broke the compatibility with tornado VM. The good part on this story is that GraL is very active. The GraL community is very active in Slack workspace, so we managed to figure out what was the problem. On the bad side is that the solution was to build a GraL Pi and GraL VM from source, which was quite painful. And in order to avoid this pain for anyone who wants to try this work, we decided to build a Docker image that has inside GraL Pi, tornado VM, and we have also added the NVIDIA driver. So if you have a Linux machine or any machine that has an NVIDIA GPU, and you have also the NVIDIA container toolkit in this machine, then you will be able to run this image. The Docker file, the image is open source in GitHub. And on the other side, you can see the QR code that has the acceleration library. So the code that we have implemented in the examples module of tornado VM for the computation part that we will upload on the GPU, like K-means, matrix multiplication, and those are the examples. But there are also other compute examples that we have in the GitHub. And you can also pull the Docker image from Docker Hub. So we will jump into the examples. So as you see here, we have the Python and Java with tornado VM. So we have the Python program that imports Java, and then it loads the class from the compute examples class of the tornado VM repository. And then we have in this Java class that we have loaded, we have two methods that can be accessed by the Python program. The first one is the set inputs that set the actual data points and the number of packets that will be used for K-means. And the second one is the run with GPU. So this will invoke the actual GPU compilation for GPUs and the GPU execution. And on the other side, we have the Java tornado VM, where we use Java and the tornado VM API to create these parallel implementations of K-means. In this slide, you see, well, the steps, how to clone the repository that contains this Python program. And we see also the Python program, the K-means.py. So we see here beneath that we have the invocation of the actual method functions, Java methods, sorry. And here is the link for the Java implementation of K-means. And now if we jump into the Java part, which contains the computation that will be offloaded on the GPU. No, before we jump to the computation, we have the set inputs and I wanted to make a connection to reflect on the off-heap data types. So with these two, with a new vector float, this is an API type that is exposed by tornado VM and can allocate data vector types off-heap. And then we'll have the create matrix of clusters that does perform some initialization of the objects and also allocate some other data, like the clusters, which are going to be allocated off-heap as well. And now we are ready to move into the actual computation part. So on the left side, you see the run with Java implementation of this method. And on the right side, you see the accelerated one with the tornado VM API. So as we see here, the actual computation has been in this method, has been performed by this method. So they assign clusters. And the corresponding one on the right side, that is the tornado VM implementation, is this one. So in this one, I would like to focus on two parts. So you can see the task graph implementation. Task graph is an object exposed by the tornado VM API. In a sense, task graph enables you to define what code will go to the GPU. So what's going to be the actual computation and what data should be used on the GPU. So the input data and the output data. So in a sense, the task graph enables programmers to define what is going to go to the GPU for execution. And the second API, once we have done this, as you can see here, we can define also the data transfer mode, how often we want data input, input data or output data to be copied back and forth from the GPU. And once we have defined that, we can move to the second part, which is the execution plan. So the execution plan is another object that enables programmers to define how the execution will take place. So it could be, for example, with the profiler enabled, without the profiler enabled, with a custom grid size, which is defined by the programmer. And once we have defined how the execution will be done, will be performed, we are able to execute the actual task graph. So with execution.execute, it is this part that enables the actual execution of the code and the GIT compilation. So the second time that we will execute the assigned clusters, well, this is going to be the second time that we invoke the actual execute of the execution plan. And the second time that we will invoke the execution plan, the execution of the execution plan, this is going to be the time that the code will not be GIT because it is already GIT. So the code, the OpenCL code or the CUDA code will be all retrieved from the code cache of Tornado VM. So now we can move to the actual example to run. I have recorded a video that enables the execution of K-Means and MathExfoom.liblication because on my MacBook, I don't have an NVIDIA GPU. So we will fork the actual repository with examples. And now that we have forked, we will go inside, we check out the FOSDEM branch. And this is the Python code that we saw earlier. So it has these three. First, we load the class, and then we are able to invoke the Java code from Python. And here we will run, first, the Java implementation and then the GPU accelerated implementation. We can also pull the Docker image that we have created. And here in the repository, we have a launcher script that enables to run. So at first, we will try the Tornado devices to query how many NVIDIA GPUs exist in the system. And here it is the 2000 GPU that exists in my machine at home. And once we have done this, we will run with Truffle, the Python program. So Tornado Truffle, the Truffle flag and Python, will be able to run the actual Python program. And we will see here that at first, it will bring Hello World from Python. And then we run the Java implementation, which is a sequential, that I'm with Java. And then they run with GPU method. And as we see here, they take the first one, one second, and the second one, 140 milliseconds. So here we will try the same example, but with the thread info, which will enable the printing of the actual threads that have been used on the GPU. So as we see here, we have the number of data points that we passed with the set input. It has been the number of the global thread size that is uploaded on the GPU. And now we move to the second example, which is the matrix multiplication with Tornado VM. So in this example, we run five times the matrix multiplication. So we see here the execution time of matrix multiplication on the GPU. So the first time it was half second, and then it has moved to three milliseconds. This is because the first execution, it involves also the GIT compilation, which is expensive. Then the second time, third time, the execution time has been saturated because it is the actual launching of the code. Okay, I have showed you example of Python with Gralpy, but this is not the only one. We have also the key images for the other programming languages for JavaScript, Ruby, and you can find more details in those links where we have a blog post. And we explain also the polyglot programming from Tornado VM. So now we will try to find the other examples so now I will jump to the summary of my talk. So as key takeaways, I would like to emphasize that GralVM and Traffl enable Java interoperability with other programming languages that run on top of the JVM. Tornado VM afflows Java methods on GPUs, FPGAs, and multicore CPUs, so you can create parallel implementations. And that Tornado VM offers a Java API, so programmers, they don't need to know GPU programming. It is a Java API, a Java way to express parallelism. And we have also new off-hip data types. So finally, yes, it is possible to create high-performing implementations of code for data science libraries in Java, and reuse them by other programming languages. This is a slide that summarizes everyone who has contributed as a research staff for students at the University of Manchester, and these images are from our campus. And this is a surprise that it was taken and it was not raining. So I would like to invite you to join our community, follow us in GitHub, join us in the Tornado VM Slack space if you have questions, or if you want to interact with a team for discussions, and also to try our examples in GitHub. And in my last slide, I would like to acknowledge all these research funds that have supported their work at Tornado VM, like Elegant and Crip, Tango, Iro and InCode. So with that, I conclude my talk, and I think we have time for one or two questions. Okay, I've got the mic here, but first, I lived in Manchester for five years, and it doesn't always rain. Just mostly. Just mostly. Thanks for a great talk. Like one of the first pictures you had showed Tornado VM in parallel to the GrowlJIT using the JVMCI. So do you interact directly with JVMCI for generating code? Correct, yes. So the JVMCI enables other JIT compilers to be hooked in the JVM, and that's how we run, because we extend. So do you work with the standard JVMCI in upstream or open JDK, or you need the lab JDK with the latest JVMCI changes? Because the GrowlJIT compiler, as far as I know, requires the lab JDK with latest changes. We work with the standard JVMCI, yes. Thank you. Thank you. So when you write the kernel code in Java, then is it usually high-level code that you write, or do you try to write optimized code in Java? Like usually when you write, let's say, Qtacode, then you try to write a very specialized, use warp intrinsics and that kind of stuff. Is that something that is like in scope for turn out of VM, or not so much? No, that's a great question. Well, to answer this question, we do both. So we have two APIs. One is created for Java programmers. We will have, let's say, a computation that has four loops. So this is something that you can paralyze if you don't have data dependency. So we expose an annotation in this case, similar to OpenMP. So you can do add parallel in the four loop in order to give a hint to the compiler that this can run in parallel and will create parallel implementations in OpenCL or CUDA. And the second part is that if you are familiar with OpenCL and CUDA and you want to have access to low-level intrinsics, like, for example, use barriers or local memory, allocate local memory, then we'll have a second API, which is called kernel API. And with that, you can pretty much access every interesting that exists in OpenCL and CUDA programming from Java. So personally, I have used the second API to port existing OpenCL kernels in Java with Tonedo.
How to bring up GCC for your new chip
Okay ladies, if you'd like to get yourself settled down, because of the way we're running this room back to back, my talk has already started and I haven't got much to cover but we'll do what we can. So, yes, so that's everything that makes up the GNU toolchain. I'm going to go through some of these slides very fast because it's reference material so you can go back and look at the video afterwards if you want to check something. This is only going to look at GCC so I'm not going to worry about the assembler or any of the other stuff, I'm just going to look at the compiler and how you add something to a new chip. So how you get the back end up and running, where you can get more information, what the key things you need to do are and what I hope is at the end is that at the end you won't be able to write a new compiler but you'll know where to get started. So first of all, source of information, there's loads of theory behind compilers, there's an excellent beginner's textbook there, you can still buy it second hand, I believe someone bought one for a penny on Amazon, second hand, so and I've been recommending, I haven't used the second one but it has, I strongly recommend it by someone else there and this is the Bible. If you've got a lot of money you can buy the one on the left, if you haven't got so much the one on the right is still rightly available. But this is what we're going to worry about today, the GCC internals manual, everything you need to know is there, some of it's out of date but it's generally a pretty good document and it's online so you can just go and get that. So, we've got a new chip, our new chip, this is an entirely fictional architecture, it's taken from my textbook I showed earlier and it's a simple byte stream architecture used for just as a target you can compile to for demonstrating how to write a compiler. So, we've got arithmetic, we've got logic, we've got shifts, we've got the ability to store and load and we've got some branching and a branch and link so we can do sub-routine calls. And there's all the details of it but we'll come back to it. So, getting started, first of all you need GCC so you can clone it and there's a mirror on GitHub as well. You've seen this from Dave, here's the structure and the bit we're going to be concerned about is within the GCC primarily the config because that's where you put the configuration for the new back-end architecture. So, we're going to, there's one, there's one for RISC-5, there's dozens of them there, we're going to add one for VAM. So, if you were to look in RISC-5 you'd find these four key files, there's loads more in the RISC-5 directly but you have a .h file which is where you define a lot of parameters that says what my back-end looks like, you know, how big's a char, how big's an int and so forth. You have RISC-C which is where you put C code and it's really helper code to get you off the ground. You don't need, you need hardly anything in .c to get started. The big one and where we'll spend quite a lot of time is the machine description. It's a, it's the thing that describes what your architecture looks like and GCC will then pick that up and use that to be able to compile to your target. Okay, and it's written in a, nominally in a dialect of LISP called scheme. Okay, and lastly there's a file called .opt and you don't actually even have to have .opt but it's where you've got target specific options and our architecture, we're going to give it an option that says you can have soft multiplication where you do multiplication in software or you can have hard multiplication where you actually generate multiplication instructions. So first of all, we need to see how do we configure GCC for my new target. Well first of all, we actually need to go into the whole auto-conf system and actually add it in there. So at the top level in the repository you'll find a file called config.sub. Now that is actually pulled in from a separate project. Okay, so if you're doing this properly you would go to the project listed there and you'd make your change there. But I'm just going to hack it today and I'm just going to add a line in the, if you look in there you'll say case dollar CPU where all the CPUs are there and I'm just going to add VAM, our architecture. So now the auto-make system will understand about VAM and then inside the GCC sub-directory, so the GCC proper sub-directory, there's config.GCC and that's where you put all the GCC specific configurations. Okay, now our full name of our architecture is probably our compiler will be VAM-unknown-elf-GCC because we'll put the full triple in front. So VAM whatever you like, ELF will match that. So if you go and say I want to configure for that target, what do I define? And there's a whole load of variables you can set to tell your target what goes in there. The thing is you don't really need to put anything because it'll know there'll be a, if my target's VAM you must have a VAM.C, a VAM.H, a VAM.CC, a VAM.H and a VAM.MD and maybe a VAM.Opt. I'm going to say actually I want one other because this is bare metal. I'm going to take the standard ELFOS file for bare metal operating system file and add it to that and that's the target machine list of files that make up that architecture. So that is all I need to do to make GCC to know that. And now I can say go and configure GCC and this, you'll see it's a bit like Dave did, but this time my target is going to be VAM-unknown-elf and it will configure for that. I'm going to do, I'm going to put it in pre, I'm going to, when we've finished it, it'll get loaded in, it'll get installed in OptVAM. We'll do it without headers just to keep it simple. We'll just do the C language and as Dave said earlier disable the bootstrap, just the stage one which is on a plain C compiler and there's loads more options there and we'll come back to that later. And then I can just say make all hyphen GCC and lots and lots happens and then it will complain and say ah, but I can't find VAM.MD, the machine description, okay? Because I didn't actually create a machine description. I just told it that was here's my machine. So we're going to have to do something about that. So let's start adding those files in. Let's start with the header file and so let's create our configuration directory. So we're going to the source directory, we're coming out of our build directory, going to the source directory, create a sub directory within GCC config for VAM for our architecture and I'm just going to create empty files, VAM.CC, VAM.H, VAM.MD and VAM.Opt. Come back into our build directory and make all GCC again, lots more happens and then I get an error message. It says ah, in somewhere deep inside the GCC world I haven't found a definition of first pseudo register and maybe you meant first virtual register. And that's actually one of the variables that I have to define in .h. So in .h there's a whole load of macros I've got to define that I will need for that. Okay, so here's an example, so in VAM.h we've got some things. You define target CPU but those are the built-ins I want to appear. You know that when you compile for a particular architecture in GCC there are some predefined macros there including one that tells you what your architecture is. So we want underscore VAM in capitals and lower case actually defined so if you're writing code you can put hash if def VAM, if def underscore VAM and put your VAM specific code there. And there's a couple of asserts there, I'll assert the CPU's VAM and the machine is VAM. Okay, where does it go, what goes in the header file? There's a whole section on this on the internals manual. You'll be here till 2057 if you try and put all of those in. Easy approach do what we all do is copy an existing architecture and hack it around for you. Open risk is a really good one. It's quite small and Stafford Horn knows what he's doing so it's a good starting point and it's what I used. Okay, and associated implementation codes in VAM.cc and it's things like data storage, data types, register model, the ABI implementation, all the constants that will define that. Okay, so here we are, here's my storage layout, you know all the number of bits that go in everything, what boundaries I'm aligning on, the sizes of all my data types, what the ABI looks like, so I've got a comment to say what it does and then I define the first pseudo register, so I've got a total of 33 real registers and then anything else would be a pseudo register and I'm not going to go into pseudo registers, because I've got my 32 real registers, general registers and I've got my status register. I don't have the program counter as a register because it's not actually exposed in my architecture, I have nowhere treating it as a real register, it's just something behind the scenes. And I've got names for all my registers and some of those have fixed purposes, so r0 is always tied to 0, r1 is the stack pointer, so I've got an array telling me which of those have got predefined uses and the last one is the status register, that's got a predefined use. And then what are good ways to allocate this, so when GCC needs to use a register, what's a good one to choose, so I don't actually end up choosing one, I have to then worry about restoring and everything. And so I can give that in a priority order of what order do I want you to allocate registers in. We talk about register classes, now this is very simple because we haven't got many registers, normally you would separate out your integer registers from your floating point registers and then you can tell GCC to do different things depending whether you're doing floating point or integer. In our case it's only an integer machine anyway, so we've just got general regs and we've got one class for the status register which is the flag regs. You always have a no regs class which is no registers and all regs class which is all registers and you define the last thing in that enum because it tells you the size of the enum is limit regs classes. And then from that we can define a macro called n regs classes and we can define the names of these which are just the text strings. And lastly then we say for each of those classes we're going to have to give you 33 bits to tell you for each of those classes which bits are there. So for the no regs none of them are set, for the general registers all the bits are except the 33rd bit and it's the bottom low bits on the left and the top low bits on the right and then the status register is register 33 so it just has one bit set in the other bit there and then all regs has all the bits set. Okay and you've got a macro to tell you which regs, you've given a register number which register class are you in and there's loads more in there and you can read through it there and see what happens. So we say make all GCC and even more happens and then it complains that it can't see SP regnum. Now you think ah didn't I define a stat pointer, I did but I decided something else because the point is this is not SP regnum as known by a header, this is SP regnum from the machine description. Okay so some of these things are actually not defined in the header, they're defined in the machine description. So if we look how code generation works in GCC it's generic okay it's a pattern matching compiler, it looks for patterns and replaces them by new patterns. Okay so it's how it does code generation, it's actually also how it does optimization and what we have to do is give it all these pattern templates in order to be generated and that is what the machine description is and actually when we come to optimization replacing patterns by better patterns is what you do. So we heard from Dave the different types you've got generic then Gimple then RTL and we're really worrying about how you get down to the RTL level. Okay side note here GCC has its own name for type systems so they're everything from quarter inch to eight bits up to double inch and tetra inch and double float with and so they're known as QI or HI and so forth and you can have unsigned variance of those just when those will come all the way through so when you see those they just sizes of things. So how do you get Gimple to go down to RTL okay which you can then code generate from okay you we probably had a set of standard patterns okay and all you're going to do in the machine description is tell him given add QI 3 that's add quarter inch to three arguments two source arguments and destination and they're mostly three address code like that so add two quarter inch and so forth. There's a whole set of these to define you define all those okay and then GCC has all the patterns and it will generate code for your machine okay so quite a lot of these have to be defined but some of them don't need it you know you don't need atomic and vector patterns if you're not going to atomic if you haven't got atomic ops or if you're not a vector machine okay so I say when we build the compiler it's parsed and all that scheme description of these patterns will be turned into C which is then built it then compiled and put in your GCC compiler and there's a whole huge chapter on this in the internals manual machine descriptions but we will do the same thing is we will copy an existing machine description and hack it so I've this we will take OR1K again okay so let's have a look I want to just describe machine description I'm taking them these from risk5.md just because I want to show a lot of ideas here quickly and they're richer in in the risk5 one than in my simple one okay so at the heart of it is define instant define instant which is the semantics of a pattern this architecture supports the name can be anything but obviously we're worried about the predefined ones and add SI3 is one of them okay and that's how GCC can learn RTL using that name okay so the first thing you see is match op-rand that's telling you how to match the first op-rand and then the second op-rand and you see there's match op-rand size of it single integer number of that op-rand so we've got 0 1 and 2 and then a bit about what it is okay so register op-rand says I can be any register it's it's an allow or deny gating function okay and you can write your own predicates as well but the whole load of standard ones okay and then we have constraints on that now the constraint is not much here equals r comma r and that's saying I'm giving you two scenarios and they actually have both happened to be r in this case but we'll explain why that is so it can and the equals means I'm writing to it so I'm either writing to a register or I'm writing to a register okay now the reason that matters is these pairs go together so the second op-rand is a register a register the third opera the first opera and then op-rand 2 is register or I for immediate and you have to read those if you as though we're there in columns so we're looking at one scenario where first op-rand is a writable register and the other two operands are registers and we're looking at the second scenario where the destination is a register the first operands register but the second operand is an it's just so if you think of them in columns that's how to think of them okay and yeah so the next line which is just empty here that's often for a global predicate okay and that could be where you put one of your flags so you may have to find a predicate like is this soft multiplication in which case I can't generate a multiply okay and just empty means true just always do this and then the code generation template it's just a C fragment and in this case so you say if it's a 32 if it's a 64 bit architecture then generate the string add word blah blah blah if it's a 32 bit architecture then you it's just a generic add instruction okay and the percent elements there percent nor percent 1% to refer to operand nor operand 1 operand 2 okay and at the end you can add some attributes we're not going to worry about attributes in BAM attributes are useful because they're where tagging the instance and sometimes you can have code generation options and opera optimizations that can take advantage of them okay so let's look at what we did for BAM first of all you define some constants that's where sp reg none the numbers of the key registers is defined okay and then we've got a very simple instance it's called no op and it doesn't have anything to match really it's just constant zero and it generates the text string it generates for code generation is just not here's a more comfortable and add si 3 you've seen that bit before we've only got one sort of ad okay the first operand is destination register the second operand is a register and because VAM is a two-address machine okay so add a B means add a to B and put the result in B we actually have to say the destination you see I've constrained it to be zero that means it's got to be the same as opera and zero which is the destination okay and I've got the same for sub i and the template to generate the code okay so the standard names the standard MD patterns machine descriptions and output statements how you do the assembly language templates and you've got some useful files in there and I say the open-risk one is a good example that's pretty simple okay so what about the option file VAM.opt there's a whole spec on this and we're going to allow it to have hard division soft division hard mode whether or not you generate multiply and divide instructions and they have a fairly simple pattern of explaining what it is and a bit of descriptive text okay okay putting it all together so we do make all GCC and almost everything almost everything happens and away it goes and it blew up cannot stat 10 permit 10.cc you know I have no idea what this means it's in deep in the bowels it's journey mitt so what do we do about this I asked for help and so thank you to match a Rizzicky who came up and said there's a trick you can tell it to emit fewer partitions it might be a bug and so I tried with emitting five partitions and it all worked fine okay and actually I ended up with a GCC because X GCC is what the GCC within the build tree is called and it ran it and it ran itself test it said let's check if the compiler is any good and then I got an internal compiler error because I haven't actually finished doing my compiler so VAM.md is missing some patterns and it's essentially blown up because it couldn't work out how to find a pattern to get the code down there for one of the test cases but I do actually have a working compiler well I have a working compiler in the sense I've got a compiler I can run it will crash whenever it compiles things but that's actually that's actually quite an achievement so now I need to just debug it okay but I have actually got a GCC build so I'm Dave covered this how to dump stuff we are so you didn't know you just mentioned so you can dump all the different intermediate codes but what Dave did cover was the wrapper option and the wrapper option is your friend that's where you can go inside we've talked about the wrapper option and how it puts things here actually you can do the same sort of thing as you can do gdb args and then I just copied that error message I got with the internal compiler error and now I can run under there and I can run it and I can generate my internal compiler error under gdb but I now have the ability now to do to debug it okay self-test even better so there we are and this is why we work as a community because we are so make self-test type in gdb we'll do all this magic for you okay so there was a bit of smoke mirrors in there I created a minimal vam.cc guess what I copied it from there was a bug in vam.op.ul's I had to hand create that in the hack round that and that I think is a bug I had to create vam.com.cc and I'm not quite sure why I had to do that but everyone seems to do it except open risk and I had to make it and I just took the template one I used it I added vam to the documentation that's a good thing I also compiled with enable maintainer mode which is used to regenerate some files I'm not that was when I was trying to fix the url's problem I'm not sure I actually needed to do that okay but that's what I did to get there so what next and the reason this is rather rushed is it's part of our three-month graduate training course this stuff was put together by my colleague Max and Blinoff a few years ago it's a five-day part of the course um for eight hours a day with exercises and so I've compressed it into 25 minutes um but hopefully it gives you just a little bit of a touch on how you can get started and there's enough hooks in there that you'll get off the ground and if you get stuck ask for help we're a friendly bunch I have an ambition one day I'm going to create a full public tutorial on GCC that's probably my retirement project but in the meantime everything I've just shown you is on github thank you okay I've got I've got two minutes for questions yeah are there any ready-made um CPUs that are a bit weird like um big guitars that we can use and play around for fun yeah so the question is are there any ready-made ones there are loads I mean there are what 50 or 60 backends for GCC and some of them are really weird and some of them very normal I would look at open risk because it's relatively recently done it's well done it's quite small because great excellent so the comment was about working on power isa power power isa and adding the scalable vector um functionality into the back end please join in ask for help scalable vectors are the flavor of the month at the moment so you said that we have to add the architectural specific stuff in the machine description I was wondering if there is a minimum set of touring to complete that you say that you do the assignment the addition and this yes question for the audience then we start the question is yeah our time's up is what is the minimum set in the patterns I don't know but if someone could tell me I couldn't find that thank you thank you
My experience as a first time contributor to GCC's LTO
So, it's my pleasure to introduce Rishi Rai. He's one of the GSOC students from last year and worked on GCC LTO. Welcome here. He actually came here and got some travel funding from the new toolchain fund. So, that's also a thing the community can offer to students. Not all times and so on, there's conditions but it worked in this time. Okay. Hello everyone. I am Rishi Raj and today I'll be presenting my GSOC work on GCC's LTO and along with I will share my experience with first time contribution to GCC. So, basically this is little about me. I am under garaged to end from India and I love terminal. Because I am lazy and it helps me to automate a lot of tasks. And also I keep changing my distro about each two, three months and this I think this is very problematic because a lot I spent a lot of time in that. Apart from that in my free time I love to read fiction, travel and I play badminton a lot. So, moving on. So, before discussing my project, let's discuss some of the things related to my project. So, first let's discuss about the LTO. So, traditionally what compiler used to do is like to for optimization it is done at the compile time. So, each of the compile compilation unit is optimized independent of other compilation unit. But when we think about a real world project we are missing a lot of context because in real world project there are several of there are several compilation unit which are related to each other. And if we can like and if we can find some context between these compilation unit then we can optimize more. So, in LTO what we do is instead of optimizing along with optimizing at compile time we also compile at linking time. So, linker is aware of all the translation unit so there is obviously more optimization. Only downside being a longer compile time and more usage of memory during compilation. In GCC to enable LTO you can either use minus FLTO flag or minus FAT LTO objects flag. So, basically the second flag is to save GIMPAL bytecode along with LTO IL in the LTO object file. Now, let's discuss the basic structure of a ELF file. ELF file is basically used for storing binary libraries and executables on Linux and Unix based system. So, here you can see that there is a ELF header. ELF header basically contains the metadata about the ELF file to identify the ELF file on which system it was produced etc. Apart from this there is a program header table which is not relevant to my talk because it's used for executable and I am going to talk about the object files. So, our main interest is section header table. Basically it contains the different information about different section along with their names. So, in LTO this is the typical structure of ELF files. So, you can see these sections prefixed with .gnu.ltu prefix. So, these sections are produced by LTO streamer. So, now let's discuss the role of an assembler in producing LTO object file. So, right now we discussed that these all section with .gnu.ltu prefix are produced by LTO streamer. And these all are already in the binary format. So, assembler doesn't touch this like basically the assembler take this as an input and output it the object file. The only thing which assembler does is it produces .sim tab which contains the symbols .strtab which contains the string representation of section names and .strtab which contains the symbol name in the string representation. Apart from that if you are compiling with debugging flag then it also produce various debug symbols along with their relocations. So, assembler basically performs two functions. First is to generate this three section .sim tab .strtab .strtab and if debugging option is enabled then it produces these debug sections. So, now my project was to bypass assembler. So, first let's discuss why we wanted to bypass assembler. So, we already talked about like what was the function of a assembler in producing LTO object file. It was basically two things and those two things were not very complicated it can be actually done by compiler. So, what benefit we will get if we produce that from compiler. So, like we already noted the fact that a lot of sections with the GNU LTO prefix were produced by the LTO streamer. And compiler just have to scramble through that take that input and produce as output in the object file. So, there was a lot of I over it. So, here you can see the result. So, with bypass we are taking like 14 seconds to compile a single compilation unit file and with default part it was taking 21 seconds. So, only for a single compilation unit there was a difference of 7 seconds. So, if we like we are currently testing it on the real world project like Firefox and all and we will publish the result. And we hope to see a drastic improvements. So, to implement the project we along with my mentor divided into two parts. The first part was to extend the libivirty. Basically libivirty is a helper module which have various function to write and read from an object file. So, we extended the libivirty to output symbol table and a string table. So, the first task which a similar was performing was done by this. The second task was to output various debug sections and symbols along with their relocation directly. So, this was little bit complicated for me. The first thing was that GCC there was not sufficient documentation of GCC LTO debug architecture on the GCC website. So, I got in contact with Richard who is the who have implemented the GCC debug architecture and he helped me through it. And also the documentation of Dwarf debugging format was a little bit intimidating as a beginner, but as we proceeded further with help of my mentor I was successfully able to navigate that. And as of now we have like I have successfully implemented the support for ELF file format and it can be found in the devil by bypass ASM branch of the GCC repository. So, you can go there and see that and if you have some comments you can reach back to me. For now we are only supporting x86 target for relocations and we are generally we are eventually planning to roll out to other architectures too. And we will also provide support for other object files other than ELF too. So, now let's talk a little bit about how I got this opportunity to work in open source with GCC. So, basically I tried getting into open source before a few times but it was a bit intimidating for me because due to the large code base and all those things. So, when I like one of my friend recommended me to try for G-SOC and when I was going through various projects then this project interested me and thanks to DOH his newbies guide was very helpful me to introducing to GCC. And then I applied and eventually got selected. So, basically G-SOC is an opportunity for first new contributors to get started with open source and applications for this year is opening on 18 March. If any if anyone of you are interested then you can sure apply. I sought out the people who made it possible for my contribution first Google for organizing G-SOC which provided me a good platform to launch me to the open source world. My mentors Jan and Martin for helping and guiding throughout the summers to complete the project. Thomas and Dev for securing the sponsorship and also for my visa process. Thanks a lot. Or also to the GNU tool transfer which sponsored these attendance. Thanks for your attention. These are my socials. If you want you can connect me here. Any plans for the non-LTO case? So, you put in a regular assembly like machine called. But it will be repetition of a similar only right in non-LTO case. So, the question is repeat the question that we have on the video. So, he is asking if we have any plans to extend it for non-LTO case. So, in non-LTO case one need to be simply repetition like we will be building another assembler right. We will have to anyways like assemble all of those stuff. Can you please repeat? It is actually not done. So, we should be extend as we need it for the forward. Any more questions? Yeah, you said that you need more relocation right for the fire business. Is it just like relocation for the port? Or do you really need all kinds of relocation? So, he is asking that we like I was telling that we need more relocation. So, is it like relocation for pointers or relocation we need more relocation. So, basically for every architecture the relocation structure is different right. So, for each debug section we need to output a corresponding relocation so that during the late debug phase the linker could identify that debug section and link it. So, for each architecture we need to change that structure and output corresponding relocation. Basically the address and the add-in will be different in the different format. Thank you. How about this work in case other debugging formats get sad that they need to be in LTO. Because I see that you did some of the stuff in talk to us right. Yeah. Can you please repeat like I was? Well, I think that the LTO streamer part of the stuff that it dumps, UY pass, right, directly is basically how I mean how attached you work to the door for. Oh, okay.
Unicode Support for GCC Rust
Okay, so unique code support for GCC Rust. Okay, it's here. Sorry, wait a moment please. Okay, let me start my presentation, Unicode Support for GCC Rust frontend. And here's today's outline. First, I'll explain about my project and then how Unicode can be used in Rust. And then I'll explain about how we implement Unicode support in GCC Rust. And then I briefly explain about, introduce about two mongering schemes in Rust and then summary. So first, let me introduce a little bit about myself. My name is Raiki Tamura, and I'm an undergraduate student at Kyoto University in Japan. And I participated at Google Summer of Code 2023, and I worked with GCC organization. And other program and my main interests are compilers and low level program such as emulators. So next, I'll explain about my project. I worked on Unicode support for GCC Rust as a Google Summer of Code 2023 project last summer. And Google Summer of Code is a global online mentoring program where students work with open source organization, and they write some code and contribute to the organization. And now, I'm continuously working on supporting the new Rust manga in GCC Rust. So next, Unicode in Rust. You can use Unicode characters in Rust program, and first, you can know as key new lines and white spaces in Rust program. And next, you can create name attribute to specify the name of your Rust program and the values of this attribute accepts Unicode alphabetic and numeric characters and which includes also known as key characters such as alphabet from various languages. And last, you can use more known as key characters for identity. For example, below, you can find Germany characters and Japanese characters and Korean characters. And you can also use all varieties identifiers. Next, let's more deeply dig into Rust identifiers. Rust adapt the syntax of identifiers defining Uax31 which is a part of Unicode standard. And Uax31 is also adapted by ECMAScript, that is JavaScript, C++, Python, and other many languages. So the syntax of this Uax31 identifier are shown below, and this is something like a generalization of typical Novoski identifiers in programming languages. And in Rust, after identifiers being talked about, their normalization to special form is called normalization form C, shortly NFC, so that compiler can compare the same identifiers but with different encodings. So identifiers are normalized to some normalization form. So next, implementation. So before my project starts, there are already other front-end, GCC front-end, which supports Unicode. For example, libcpp is a C-pre-processing DCC, which implements lexer. And also, C++, as you remember, C++ adapts the same syntax of Unicode identifiers as Rust. So I took a look at it first, and next are also go front-end, go language supports Unicode, but go adapts different syntax for identifiers, so... but I read the implementation of DCC go. So my implementation is divided into three parts. First part is lexer part. In the first part, we modify the lexer to accept Unicode characters, and second part is the great-name attribute part. We added validation for values of the great-name attribute. And we checked if the values of the great-name only has Unicode alphabetical numeric characters, and last, the mangler part. We modify the mangler to handle Unicode identifiers. So the first part, the lexer part, in order to look up character properties, for example, identifiers... So we have to tell the compiler which characters can be used as identifiers, and which characters cannot be used for identifiers. So we have to tell the compiler some Unicode properties. So in order to look up such Unicode properties, we'll use some functions already in libcpp, and for other missing properties, we generated a header file from Unicode data files. Unicode data files are distributed by Unicode.org at the Unicode official site. And to achieve this, we wrote a Python script, and we pass... which passes Unicode data files and then generate C++ header files. So this is the part of the generated header file, which contains such boring table. So the next part is the great-name implementing, great-name attribute. So this is quite a simple part because all we have to do is to use a generated header file and add it to validation to the values of the identifiers. And the last part is the mangler part. First of all, we have to modify the Rust default mangler to handle Unicode characters. So the default mangler is called legacy, and legacy mangling schemes escapes non-asky characters as their code point. And also, we have to implement a new Rust mangler scheme, which is called v0. And in v0, identifiers are encoded as punicode, which is used in WebBrowser or something like that. And implementing v0.mangler to DCCRS is now still in progress. So here, I briefly explain about mangling schemes. There are two mangling schemes in Rust, legacy and v0. You can pass options to switch this mangler scheme. So in Rust C, you can use C symbol mangling version option, and in case of DCC, you can use f Rust mangling option. And v0 was introduced to Rust C on 2019, and it is used in the Rust for Linux project for some reason. I'll explain it later. So implementing v0 is so important for DCCRS project. So here, let me compare the two mangling schemes. In legacy mangling scheme, legacy symbols start with underscore z prefix, which conflict Italian ABI, which is used in C++. And v0 uses underscore r, which is unique to Rust. And next, used characters are different in legacy mangling schemes. Mangled symbol uses ASCII alphabet and ASCII number, and also uses Dara signs and period. And on the other hand, in v0, mangled symbol uses ASCII alphabet and number and underscore. So speaking of Dara signs, Dara signs are vendor-specific characters in mangled symbols. So typically, it is preferable that we avoid using these symbols. So next, type information. Basically, legacy symbols doesn't contain any type information. On the other hand, v0 has rich type information such as generic types and inherent implementation in Rust, or tried implementation in Rust, and more. And for example, these are contents namespaces like modules in Rust. And Rust, speaking of Unicode identifiers, you know, as you remember, legacy escapes Unicode characters as code points. On the other hand, v0 uses punicode to encode Unicode identifiers. So let's look at a simple example of a two-mangling scheme. If you define this function in Rust, you can see two-mangled symbols. And highlighted part is corresponding to the name of the function. And you can find Dara signs in legacy mangled scheme, which is vendor-specific characters. And you can also find a invisible symbol. You can also find a punicode encoded symbols. This is a hero part of this slide. So in summary, as a result of GSOC-223, GCCRS supports Unicode. And Rust compiler uses Unicode normalization and punicode encoding, and implementing the new-visor-mangled to GCCRS is now in progress. Thank you for free software foundation for supporting me to attend this conference. And I would like also thank to my mentors, Philippa and Arthur and other GCCRS team and another GSOC student, Mahat. And that's all. Thank you for listening. Thank you. So we have a few minutes to questions. Four minutes for questions. If people have questions. Hi. You showed the example for like the new and the old mangling. And how I can say, I can't understand how you would look at that and know if it's Unicode or if it's someone actually wrote U7 blah blah blah blah as a name of function. So how can you identify that? Oh, okay. So if I understand your question correctly, are there questions that you cannot find part corresponding to the name of function, right? Yeah, I cannot look at this and see, right, from U7 onwards, this is Unicode encoding and not something that was actually written by the user. Sorry, one more please. It's just like, imagine some user decided to write a function and they decided to call the function U7 KED blah blah blah blah. How can I tell if this encoding is saved, the user wrote that exactly, or if they used the Unicode? Okay, so your question is that you cannot tell this symbol is encoded using Unicode or low ASCII data. Oh, if it is verbating. Okay, so in visual symbol, you can find the first U character, right? So this means Unicode encoding and then you can find 7, which is a length of character string from K to H, which is an encoded part. So, yeah, thank you. One minute left for questions. Yeah, so my question is how much effort would it be for you now, after you know all this, to improve the existing lecture to also accept Unicode? Is it just 3 for things and an HR or? Are you speaking now? Current situation? Yeah, now I would add to an existing lecture the option to pass Unicode. How much work would it be? I think many developers don't use NoaSki Identifier, but what is so wide so many developers want to use NoaSki Identifier. So in terms of that, I think it is meaningful. Yeah, how much work would it be just if you edit now? Sorry? Would it take you a week or weeks? Maybe three days or so. Okay, all right. Time's up. Thank you.
What can Compiler-Explorer do for GCC
In parallel, let's get started here. Up next we have Mark Boudiaz, if I pronounce it. Yeah, that's mostly correct. Yeah, another time. Even for French people it's complicated. He also worked on the GCC RAS front end for a bit, and I think why are that got involved in the got called compiler explorer. And now he's telling us what the compiler explorer can do for GCC developers. Yeah, thank you. So my name is Mark. I'm a compiler engineer at Hidakor, and today we'll talk about compiler explorer in the context of GCC. So what's compiler explorer? So for people who may not know what the compiler explorer is, it's a website where you can enter a program in some language, for example C on the left, and you can pick one or more compilers and get the corresponding assembly. So that's the very basic usage. Compiler explorer was created mostly 10 years ago by Matt Godbold. So that's why you may know the website as Godbold, because he was hosting it on his own domain, and it stuck. So now people are referring it as Godbold. We are now a team of eight people. We host like 2,000 compilers, support more like 60 languages. We have around four million compilation jobs per week, and thanks to our sponsors and patreons and stuff like that, we are able to pay the bill every month of around $2,000. In the interest of time, I will only showcase a very small subset of what the website can do. If maybe you should go and check out by yourself and experiment and see if there's something that you can find useful on the website. And at the end, I will answer questions and maybe get feedback or future ideas. So basic use case. So I'll try with the live site if it works. It's not too slow. Okay. So let's say you have a CC file, then you can add a compiler like this. So by default, it's TCC. You can see that the assembly is color coded too much with the user source code on the left. You can also execute the code. So for example, here, you can see that the printf is displayed on the bottom. You can also ask to stop after the assembler and get instead the option view of the object file. So you can see here that you still have the relocation in the file. Or you can also ask for a full link for the program. Yeah, still. And you can see that the relocations are gone and it's resolved. The last thing that I wanted to show is that you can share this by clicking on share. You get a link. And if you send this link to someone and they open it, they will get the exact same setup and layout. So it's very useful to share some code, bugs and stuff like that. The next use case is if you need, for example, multiple files. So that's the case, for example, in Ada where you have to have different files for the package. For example, the full package is in the two files named foo.adb and ads. And we have a main unit called example. So this unit is using the foo package you can see here. And you should see I'm also using an input file called input. So you can also put like text file in it if you need that. And then you can add as before a compiler. So it's not compiling because I need Ada22 and you get the same features as before. So you can execute, get the object files. You can share the session. Everything works as before. So that's the very basic use cases. We support many more features. You can build your program using CMake. We have GPU support so you can execute code on actual GPUs. You can see both the target and the host view of the code. We have deep views for assembly so you can compare the output of different compilers or the same compilers with different options. We support libraries, environments. There is documentation for some ISA and many more. So please try it yourself and experiment. Now the first feature that can be useful for compiler development is the conformance view. So for example if you have a bug report, so in this case it's from the GCC bugzilla. It's an internal compiler error. You can use the conformance view to find when it started regressing. So you add a conformance view and from there you can add some compilers. So GCC, PX86, for example trunk. So you can see this is red so there's an error. If you hover on the right you can see the backtrace so it's an internal compiler error. So from there you can just duplicate and check with a different compiler. So GCC 13, so still failing. And you can do that for all the compilers. So I won't do this now because it's short of time. But... Okay. I will skip this one and just use... So this is local so there's only a subset of compilers but it's fast. And you can see that you see quickly where the problem started, so around the 13 release. And the nice part is if you want to modify the code and see if it changes anything, the view will update itself so you can play around and see if you can have better ideas or things like that. And again you can share the session and send it to anyone. Something I used during my day job where I need to test against different compilers or targets or language. I create empty templates meaning that I simply create the conformance view with the compilers. I'm interested in for the given target and the language and I leave the code mostly empty. And whenever I need to test something against C++ for x85 targets, I click the link, the share link. This opens up. I copy paste the code and I directly have the result. I don't have to add the compilers by hand every time. So that's it for the conformance view. Very recently, Jeremy in the team added the support for the GIMPAL. So it means that now you can use GIMPAL as any other language in the compiler explorer. So maybe that's useful for some of you. You can just copy paste and use any GCC starting from the nine release. We also have support for the dumps Dave and Jeremy talked about previously. So this is C. I can add the compiler. Then you can add GCC tree RTL. And from there, you have access to all the dumps that GCC emits. Like this. If you need, you can filter between the tree, IPR, RTL. And you have access to all the options that you would have from the command line. And again, if you change something like the optimization, the view should refresh itself. So believe me, it should work. And that's for the most used dumps. But if you have debug dumps from frontends, for example, I've added the one for the ADA. We cannot support you. Simply have to ask. And maybe we can guide you or we can do this ourselves. So just ask and we'll be happy to help. Something else we offer are the nightly compilers. For GCC, we build a subset of supported targets from the GCC master branch. We also build from different repositories. For example, the core ball or the Rust one for VikiTub. We can build the topic branches if you have some that you would like to see on the public website. Or we can build more complex stuff like the Rusty, Codgen GCC where you need to take Rusty, build GCC and stuff like that and package and publish it on the website. So again, ask and maybe we can help. We provide an API where you have access to the basic feature, mostly compile and execute. So you can use that from Shell Script to do tests or you can embed this in application, plug-in, IDE. For example, this is a screenshot from the tool I've done for work using... I can run against different compiler using filters from the command line so I find it very useful. So maybe this could be for some help for you. And the last thing I wanted to mention is how easy it is to create a local instance, private instance. It's mostly heat clone, make and it will do some NPM magic for you. And this will bind to local host so that's fine. You can use it yourself but if you want to do that for a team, multi-user, please, please, you need to take extra care because this is basically a remote execution as a service. So you are from the web browser asking people to enter code and click execute and do everything. So for yourself, easy, for multi-user, not so easy. And we have ideas of new features we would like to have in the context of GCC. For example, for Clang, we have a nice view where you have all the optimizer passes and you can see how each pass is modifying the IR and with a nice div view. So it would be nice to have the same thing for GCC. Maybe a better div view where you can do divs on the RTL directly. Someone has for more Windows Compiler so maybe you have other ideas. So this is the end. So again, that's only a very subset of features. So go and experiment by yourself. We accept any kind of contribution called feature request, anything. So thank you and I'll be happy to answer. So one question. So one question. There was one question. How do you manage security? I don't. We have people working on this, mostly Matt Partouf and Austin. They are doing very complex stuff. I don't understand because that's really not my domain. But everything is sandboxed. The nodes where you are executing are mostly empty. So even if you exit the sandbox, there's nothing to steal. And if you crash the machine, we just reboot a new one. So that's as far as I can give any details. But you can contact them directly. They'll be happy to answer that. Okay. Thank you. Thank you.
Unlocking Secret Analysis in GCC Static Analyzer
So I don't know if it's okay. So hi everyone. So as I was introduced, my name is Pierre. I'm a PhD student in France. And the subject of my thesis is to work on given subclass of the number eighties on cryptographic code. So be reassured, I won't talk about cryptography here. But the static analyzers seem to us as a good place to implement our analysis. So first I'll go through maybe a bit of an overview of the static analyzers. I don't know if any of you have followed up on the analyzer. Well, David, of course. Okay, so then I'll go through, well, a bit of our journey to our development of our analysis in the analyzer. And then I'll present some remaining issues that I think would be interesting to discuss within, well, with the community, I mean, because I think that some of the issues could be addressed directly within the analyzer. So first, how the analyzer is working. So in case you never used it, so it's a really dumb code here. So we just allocate a pointer and use it, well, free it and then use it. So this is the kind of code that the analyzer can detect as a problem, so statically at compile time. So when you call it with just dash F analyzer, this will give you a really nice output on the standard output with all the paths leading to the problem. So a bit of an overview. So it was introduced in just ten. So 2020, there is an API available to develop an out of plugin. So that's really interesting because you do not need to, well, rebuild all of GCC to, well, try the analyzer. And there is a symbolic execution in Giant inside it. So it's neither sound nor complete, as David said to me. But it's something really nice because as implementing your own analysis, you do not need to care about pass visibility because the analyzer will do it for you. And so how it works internally, so basically the analyzer will handle a state machine for given variables for you. So you do not really need to handle any data structure for this. The analyzer will do it for you. You just have to care about the transition of your variables. And there is, as you saw, a really nice reporting system. And since not long ago, there is also the option to output syrif file. So it's really nice standard pushed by Microsoft, I think. Yeah. And on the committee. Oh, okay. That's it. And well, thank you, David, for this. And also answering my email because it was my first experience with GCC. So it was a bit painful to go through GCC and the analyzer at the same time as a newbie. So how does it work internally a bit more? So usually variables are represented with both left value and right values. So for example, on the first line, you got an integer x with the value 42. So within the analyzer, there is a data structure known as store, which will give, well, keep the difference value for your variables at a given program point. So after the analyzing, after I've analyzed the second line, within the analyzer, you'll have this kind of state. So you'll have the region, which is the left value for the analyzer, x having the symbolic value. So the right value 42 and pointer having the symbolic value of the address of x. And so to implement the state machine, it's pretty easy. You just need to inherit from the state machine class. So as I said, the static analyzer will handle a map of s value to state for you. Something nice for reporting is that a given symbolic value can have its states having an origin. So it could be another s value, allowing the analyzer to rebuild the path to the problem you're triggering. So you can also have really complicated logic behind your states if you need to. So there is a class state you can inherit from too. And that's all you need to start to play with the analyzer. So our journey to implement a secret analysis in the static analyzer. So we needed to implement a taint analysis first. So taint analysis usually come from, well, historically come from the user input validation. So some of you might know it. So basically behind it, there's four core IDs. So there's the source. So where does the taint come from? The propagator. So it's the taint propagation through different variables. And the sync. So this will be operations triggering issues. And filter is how you can destroy the taint for a given variable. So if you apply it to my problem, it will become, well, the source is the secret. So for example, a cryptographic key, sorry, or a password, or anything at all. The propagator will be, well, how you propagate that secret dependent notion. The sync will be, in my case, well, condition. So if you have a secret dependent variables used in a condition or a memory access or non-constant CPU operations, you'll have some vulnerabilities on your code leading potentially to a side channel problem. And so filter will be, in our case, for example, the tainty ratio. So a call to B0 on an allocated area of memory, for example. So our submission is pretty simple. Every variable is in the start state. And in our case, if, so the original secret will be tainted by attributes. So for now, the developer has to give an attribute to the variable. And the variable will be tainted. But you can also have other variables being tainted if it is dependent by its initialization, sorry, by, well, another tainted variable. And the sync came here. So as soon as there is tainted variable used in a sync operation, so for example, the condition, we emit a warning through the analyzer. So our first try will just check if it was working correctly with a pre-intuitive one. So it's just a secret. You take attributes and use it in a condition. And we expected the warning to be emitted. So it was working perfectly. And now we wanted to check that the propagation was doing well too, and that was working too. But when we looked into the hood, we noticed that, well, actually, it was not variables who were tracked. It was their symbolic value. So for example, here, you had secret with the symbolic value of 42 and y with 142. And in the state map, so this is the map, the data structure keeping track of the state for your variables. Well, it's the symbolic value of 42 and 142, which are painted. So we came with a problem example, which is like this. And there was a false warning emitted here, because as it's 42, which is tracked and not x, here y was implicated tracked as tainted. So that's the minimal example representation of the data structure of the analyzer. So we needed to modify the state machine state map class to be able to not only track for symbolic values, which is really nice to do well pointer aliasing issues. So we wanted to also be able to track for origin. So that allows us to not implicitly track the value of y here anymore. So at the same time, we needed to modify the notion of origin for a trait, well, for a state, because in that case, we want to be able that when pointer is dereference to be able to know that it points to a tracked data. So here it's y. The address of y is here. So within the symbolic, well, the state machine state map, you'll have something like this. So this was allowed the analyzer to rebuild the pass. So basically, a secret, having the origin, is the kind of the original secret. You'll have y depending of secret, and the address of y is tainted, because y is tainted. So our modifications. We modified mainly the state machine state map logic, also a bit of class URL handling out of three API user, and also diagnostic related code. So we believe it could be nice to have those changes merged in the analyzer, because yeah, we are working on that, because it will allow the analyzer to receive a wider set of analysis, because for now, it's really nice to do some pointer related analysis. But if you want to get in touch with integer, float, or anything, it can mess up your analysis. So there are some remaining issues still, because when you're in the frontier, for example, of, well, scalar array and pointer, well, it can be a mess, because you do not want to track, for example, the third element of an array, so here, which is secret, and alias by its pointer. So the value of pointer is pointing to the third element, and you use it to access it. But you do not use the pointer to taint the region behind it. So to do it, we did a bit of trick. So at the same time, we are tainting the region t of 2, because we cannot just taint the symbolic value of t of 2, because otherwise we will taint 42. We have to also taint t plus 2, so the symbolic value behind it, so the address of the third element of t. So within the data, you have something like this. So what's interesting us is that value, well, not really that value, but the region containing that value. So we do not want to track the right value of that element. We do want to track its left value, and also its symbolic value, well, the symbolic value of its address. And so for now, we are chatting about it, because we do it in our analysis, but maybe it could be done directly within the analyzer. I do not have any solution yet, but I think it would be nice to discuss it. Another problem is regarding inter-procedural analysis. For example, here you have the local variable in the main function, which is secret, and when you give it to f, well, secret does not exist anymore. So you cannot really have, well, for now, or at least I didn't find the APIs within the analyzer to do it. You cannot really look at the value, well, of the taint of secret when you are in the context of the f function. So that's because the analyzer have data structure to represent a frame for functions, and so there's a notion of local values. So when you're within the context of the f function and you ask for, well, the state of the secret variable, well, for now, you just have a crash, because it's not a local variable. But we could discuss that. Yeah. So a bit of takeaways. So the modification we did are, well, now we can track state region for only inter-procedural analysis. So there are still issues remaining. So as I said, when you're in the frontier of region, when you want to track a region, so left values and one, well, right values, so scalar arrays and pointer frontier, and also inter-procedural analysis, as I said. So, yeah, that's it for me. Thanks for listening. Feel free to reach out if you need to. Sorry, you don't have the question. Yeah, that's the question, I think, though. Thank you. We have some questions. Yeah, so we have, like, questions and so forth. Well, do not force the question. If you do not have any questions, that's OK. I have many questions. I bet. Sorry, I should. We have 13 minutes. OK. So, you're the one. Yep. So, why don't try and implement the input thing into the privacy ecosystem, for example, instead of GCC? OK, so the question is, why not to implement the analysis directly into a more formal tool, such as privacy, for example? The idea is that there's already a lot of tools doing that particular issue, the kind of analysis, but a lot of them are kind of hard to use. You cannot really set them up easily. So, the idea behind it is to develop a tool which has no ambition to be neither sound nor complex, just usable, easily plug-in, plug-outs. So, the idea at the end would be for the developer to not touch his code base at all. So, that's work we need to do yet. But, yeah, the idea is just to try to think with the user and hopefully not making our tool a pain to use. Yeah, so that's it. Any more questions? Yeah. Is there a way to support, like, other languages, like if the static analyzer, because I saw only examples in C code. If, for example, the static analyzer was building a way that you could extend it to, for example, C++ or other methods. Yeah, so the question is about if there are any other language support. So, as we are in the, well, Gimbalist representation, basically, we are targeting C code only for now because, well, the subset is way easier to manage. And the idea would be to then rely on, well, the analyzer because I don't think the C++ support is... Yeah, I mean, as you say, the analyzer runs on, like, Gimbalist, I say, in terms of intermediate representation. And so, in theory, it handles everything that we have a for, but in practice, it handles everything that I've implemented. And in particular, exception handling support isn't implemented yet. I'm interested in doing a separate code project. Or, if you want to start a new project, that would be a wonderful thing for someone to work on. I'm focusing on C. But, I mean, people have used it for, I think, as a concept of running, checking unsafe rust code and CPL. And in theory, anything that GCC compiles, you can analyze. It just might look rubbish. They're probably rewards, yes. So, that also was a question. Hi. Hi. How far in the game can you analyze things? For instance, can you analyze certain bits of, I mean, like, not things, other bits? I think it could be possible, but you have to modify the analyzer itself. So, from out of three, plugging as we are trying to do it, I'm not sure you can do it, really. But, there is data structure within the analyzer to do it. Yeah. Within the analyzer, essentially, we build a directed graph of program point, program state pairs. And the program state, there's a store thing, which basically models the state of memory. And that, in theory, tracks the Pupit level. So, for example, within this frame, within this local, within that frame, within bit 17 of that local, is bound to this symbolic value. How well it works. Yeah. And how well it works with the state machine as well. I don't know. Yeah. So, for now, no. Yeah. Yeah. You have a question? Yeah. I think, with the exact amount of pointers, you're already showing that probably you're going to run into a decidable problem to have perfect answers. So, you always have to make a trade-off between false positive and true positives. Do you have any guiding principles on how to decide? We want to, yeah. Just repeat. Yeah. Okay. So, the question is about, well, basically, completed or soundness behind it. So, we do not aim to be neither sound or complete. But we want to go to soundness more. So, it means all problems are true problems, basically. Well, yeah. Follow-up question. Yeah. There are a number of other static analyzers. Yeah. I wonder, like, are they all aiming for rough, the same balance between those two? Well, it depends of the different projects. We were, well, we went for the GCC static analyzer because, well, on the research part, there was no work on GCC. And we thought it would be interesting to, well, first understand, well, why? So, is there a particular reason? Is it because how the compiler is working? Or is it because it's, well, I don't know, because people are just used to go to LLVM, for example. Yeah. Yeah. I mean, from the general analyzer point of view, I guess I'm doing it from an extreme programming, pragmatic kind of, I have a bank of open source, pre-stuffware test projects. I turn on the analyzer, I see what it spits out and decide, does that suck or is it useful? And it's even accordingly. And it's not very formal. Yeah. Hopefully it's useful. In some ways, think of it as a glorified compiler warning. Like, it's like, I really want to spend a lot of compile time to get a bit of a deeper, more involved warning than that. It's not going to prove that your program is correct. Yeah. You still have time to analyze it. Yeah. Let's hope it's fine. You can do that. Yeah. Well, that's good. I think that's a good question. You mentioned the limit when it comes to procedural analysis with your static analyzer. Especially in use case like yours, where you try to track the access to secret, applying memory locations to secret states, how would you handle inter-translation unit-based analyzers because I could probably access this location from a completely different file if I wanted to? Yeah, probably. But the thing, oh, yeah, yeah, sorry. So the question is if we could not handle the problem of the interpersonal analysis problem within another pass, if I, yeah. So it could be done. But then we would have to get out of the analyzer or maybe add some other logic at some points. But the thing which is nice with an analyzer is that in our implementation, we do not need to care about symbolic execution, well, symbolic evaluation of variables. And, well, everything is taken care of. Yeah. I specifically had it run at the point in GCSE in the integrates with LTO. Oh. So you can actually do link time analysis. Yeah. Well, in theory, one thing is the link time analysis. Yeah. In practice, if you do it on either, I have a few test cases in the test suite. If you try doing it on anything non-trivial, it will explode right now. Yeah. Sorry. Yeah. But in theory, you will find all the order n squared. Yeah. And just to clarify, I'm not a developer of the analyzer. I'm just a user kind of the analyzer. So I'm, well, I did modify it, but so far it's just a local modification. And hopefully it will get merged at some point. Yeah. That would be nice, but we have to handle the inter-processor problem first. Yeah. Yeah. Yeah. Yeah. If you enter two links, like LTO, if you enter the integrates in LTO, you will get a lot of top positions from the for it. Doesn't that deal with some of it? Yeah. Maybe you should repeat the question. Yeah. The question is about, doesn't the analyzer come a bit late in the passes, because it could be run after a lot of optimization? Well. And the answer is yes. Yeah. The answer is yes. In the analyzer. Sorry. Yeah. I think that's a good question. Yeah. I mean, as I said, I chose to run that late in order to try and piggyback LTO. And unfortunately it means in theory, some optimizations have run, and very similar optimizations have assumed there's no undefined behavior. And, well, actually, what does it know about undefined behavior? So that's one area where they can be caused positives. Potentially we could move earlier. It would be a lot of work, because I don't know, a technical bet. Yeah. And how many times are? Yeah. End times up. One last question. One last question. Yeah. What is it? Yeah. Yeah. Sorry. In practice, that just means that if I care, then I want to analyze one time, so zero and one time, I would go three right. Yes. I'm not sure to. So I've had of running 10 analyzers. I now just run GCC twice, once with zero, analyze, and one with go three analyzers. Yeah. That's OK. So the question is, instead of running several analyzer, just have to run the static analyzer of GCC and then run the different optimization enabled. So is it the question? It's more than that. Yeah, I'm more than that. OK, thanks. I was, yeah. All right. Thank you again. Ends up. Thanks. Thanks.
Yacking about Bison
Okay, then let's get started. So we have one on stage James Loden, yet again about Bitcoin. Some of you will remember from your compiler classes maybe, or from other words. He's one of the team working on a Kobo front end in GCC. And now you may think, okay, Kobo is what my grandfather used maybe or something. But Kobo is still alive. Did it have a new standard release recently? Yeah, 23. There you go. And you use it. Diversity in programming languages. You use it every day indirectly probably for financial transactions. You may well have used it this morning. I tend to think that the three years I've spent on this is a lot of work. But I've learned today to appreciate how much other work has gone in to the thing that we are contributing to. We set out some years ago to add to we decided it's time for a real free Kobo compiler. And I proposed we should add it to GCC and that is what led to this presentation. I'm going to talk to you about why that is a good idea and why I had no idea what I was getting into and what I've learned in the process. All of the large firms, I would say, in the 1970s and 80s wrote their Books and Records software in Kobo and that's how we run our financial businesses everywhere in the world yet now. The chances are that your ATM transaction yesterday went through a Kobo program that was written in 1980 or so. There are estimated to be billions of lines in use still today. And there was a period some years ago in the 90s when some Kobo applications, for example, were moved to Java because that was the new thing. Those were the easy programs. The hard ones are one mass of spaghetti that you don't even want to know what it looks like. And the only way you can move it to a new system is by taking that source code and making it run on a different computer. You're never going to re-engineer it. If it were, I wouldn't be here. The banks or any of these institutions spend a fortune running their proprietary systems on emulations provided by vendors for machines they haven't made for decades. That costs a lot of money. I just want 10% of that money. So this is an ongoing thing. We're working now actually with someone else to go into more detail later. This idea is you just take the code as you find it, however it may be, you compile it and you run it into a different machine. So we are targeting ISO as a standard because you have to go somewhere plus whatever additions are needed by some particular house to build what they need for their purposes. It's so old, Bacchus now hasn't started yet. So yes, we have a grammar. No, it was not defined for anything like an LALR machine. So it is an all-encompassing thing, but there is no place to go to get a library, a standard library. When COPL was defined originally there were no functions, there was no recursion, there were no local variables. Your program had stuff, they worked on it and produced things. It's also a gigantic language because it takes all of the problems that you have as an application programmer and puts it in the compiler. So if you want to convert something there's no printf string. You just move your variable from one thing to another, that's a compiler job. It's very fast because if you aren't doing anything in particular to check for say runtime errors, right, like length of the variable or something, that's the default behavior because 1957, we run as quick as C at the bottom. There are other features that can slow it down, but that's to take care of things that you would like to take care of, like that I write to a place that doesn't belong to me. Somebody suggested that C++ is an easy language to compile and I think there's reason to believe that's true. This is a big language. And that's not the biggest one. I checked Ganooko Bowl recently. Their sizes are about twice that now, those numbers in terms of terminals and rules inside the grammar. And this is just one of the verbs. You don't just name the things past the arguments, you say by which way you want to do it. And if something goes wrong, you have a way to capture that too. That's also handled inside the system. It's not try and catch, but it's another such exception system. Yeah, it's work. So, yeah, I've been doing this for a while. That's the price tag on my version of C programming language. I entered into this saying, okay, we're going to write a compiler. And I understood at the beginning it was just the front end. It turns out that my work is chapter four, section four. Everything else is the compiler. And I had to, I entered into this not really knowing, oh, 15 minutes, okay, that's good. Not really knowing what this was going to mean. And so, and I had never written such a thing as I'm working on now. I had used Bison for smaller tasks. But I didn't really, you know how it is, right? You start out trying to figure out how to do something. That's where that's the way I worked on it. And the there's some distance between knowing what you have to do, reading about how it's done, and then actually getting there. And this is the sort of answer I had to find my way through in order to learn what we were doing. And there's plenty and how to say this, there is no Royal Road, meaning to say, no one will tell you how to do this. All you can do is pick up the pieces and plow through it. I recommend as a life proposition that you begin every problem by knowing the complete domain, the problem domain perfectly, and the tools you're going to need perfectly. That I think you can guarantee will lead to success. In my case, I substituted what most programmers have, which is 90% of programmers think that they're above average. So why not just do that? There is a relatively small number of people in the in the Bison world. It's not a it was not has not been easy, I don't think as a person using that project to find people who can help me understand how to solve the problems I'm having. So that that's just a qualifier for like what's life like in the in the world of a guy who's writing a parser. Also, I had to learn that the Bison and Flex, the Lexar and the parser. It's really an odd thing in our world where you've got two projects that I don't know to what degree they talk to each other. They communicate and cooperate. It's not evident from what I've seen that there's very much communication like that. And but they share global variables. They share functions. They talk about each other a little bit. But they don't say the same things. So that was a that's a stumbling block that I have no idea how to solve. But I think it's a it's a gap in in the world that we work in. When you are writing your parser, there are two levels you're working in or maybe more. You are defining the metadata for your for your language in C. And then you're using C to tell Bob Dubner to generate the code for that stuff, please. So it was not clear when I began. And it's still a little bit fuzzy for me where those definitions have to lie and why it's sometimes difficult to understand why Bison doesn't understand what we're talking about. But I have found that we were able to solve almost all the problems in in the Kobo grammar, just using good old precedence, that is to say, you're always defining one thing in terms of another. If you are, you know, you could it's a little bit like a make file, right? It's work your way up to the left. And then what do I spend my do my day, my day I spend tracing, I just looking at the results. What did the machine do? We move from state here to state there, we have this, we're looking for that. It's not there. Oh, that's the error. Okay. I also discovered that if you read these books at the beginning, they say, Oh, well, we have identifiers, how do we change how we know how do we distinguish a function name from a variable name? Well, in my case, the function names are all defined. They're all statements, it's call, it's, it's read, it's, it's inspect. So I don't have that I know the names. And I know all my variable names, because Kobal has four sections and one of them is the data section of us where you put your variable names. So those are all. So I didn't have the problem of here's a string, I'll pass that off to the parser and let him figure out like I could support them out and have different kinds of tokens for different individual pieces. And that's this magic in that that's what you want to do. The more types you got, the less you're going to have to worry. Bison itself is a complex beast. It's not clear to me who's at the helm. There are a lot of pieces being added in different ways to do different things. And they all look very interesting. I just don't know which ones I want. I learned that there were some things that were quite useful. And, and then not. So if you have an optional term in the grammar, and you, and that thing, you know, obviously can be substituted for the thing that could be there, then, then you can use the precedent thing to to convert the empty version of it to the same presidents as the thing that is that would be there if it was present. That worked great. If you have a conflict and you simply say, Oh, well, that rule needs higher presidents than the other one. Very often that won't work. Somebody in this room might be able to tell me why. But I that has has been a dead end more than once. There were a lot of work went into Bison counter examples. It's something that I think someone spent a lot of time on. I myself have tried that feature several times hoping for a magic solution, because I use those need those frequently. But eventually I've come to find that, at least for myself, it doesn't help me that much. It's it you all it does is produce for you the path that you can actually trace through yourself back through the report. If you look at the state machine. So yes, it could be resolved different ways. Sometimes it helps me to think, doesn't matter. It's okay, we can take this shift. We don't have to worry about the reduce. I did try to run the graph on my grammar. I never saw it never came back. It did. I gave it 24 hours, but I never got an answer. So I'm not sure why that features there. I guess as a tutorial thing, if you had a seven line grammar and you want to understand how things operate. But I had to went my way through all of the different things that it could do this feature set that I was talking about is and I don't know the answer. Why I only know the answer I found. And I think that's true when you're in a complex environment, you just pick the pieces that work for you. And that's how you land there. So Tim toad, you will remember that from Pearl, there's more than one way to do it. There you can put options into the grammar itself or you can put options on the command line. I chose the to think that you could probably run the same source file with different options. So why put them in the source file put them on the command line so you can choose it. That's that's the route I went down. But there are pure parsers, you can push them, you can use the gc++ interfaces, you've got the general something. Parcer, you've got different versions of parsers that can be produced from the same text file. And you have YAC emulation. And so I had to like, decide what not to use. And I think what we want to do is fairly vanilla. There is a really cool feature though of the way they've separated out the pieces of C code that have to go into the grammar. And they don't do a great job if you ask me of describing how that works. But these these things of separating out what the early part of the the metaflight metadata needs in order to describe the data as versus the way we're going to generate the code. That requires and provides is very, very useful. And locations, you can't write a debugger if you don't have locations. So I would recommend if perhaps I'll send this to the bison folks that you you separate out your code that's going to the pieces that you need into different files and use those brackets to to to tell by some what the way they belong. It's really handy actually if you got enough C code in your in your YAC file to put it in some include somewhere because your editor will make you a lot happier. If you use these two, if you use a if you've got a printer type for every element, every semantic type in your union, then you get nice outputs like this. And there ought to be a rule that says, or there'll even ought to be a warning that says, Hey, you define a type where's the string for it? How come I can't see what this looks like? You're going to want it. And that's what you see here is a reduction for the varying part of the perform a part of the one of the four types of loops. And we know we know that we've got some keywords we've got varying we have from we have we were it's coming from a numeric display that's a bunch of digits in a row that we can use like a number. And and it's got a name. And this was tricky because names don't usually have hyphens in them when you get to the assembler. And there's a literal so that that was a that was probably a three or something. And boom, and so I'm able to look at this trace and refer right back to the source code and see the pieces that are are being operated on. Do not open up your debugger on the outputted on the on the object file that is compiled from the seek that was generated by the bison. You just won't like it. You don't run GDB on make. You don't run GDB on on that either. It's just awful. The what you do want to know is what the rules mean. You're using a system that is declarative. So you have to think about what the rule says and how they operate. That's the way you get to to the answer. And I think I'm just about done except to ask you to help me solve my problem. What you see here is what you would like to write in a lot of languages, right? The second thing is much more easily typed as the first thing, but it easily parsed. I'm on version eight of my my test for that one because when you get to see you don't know if C belongs to be as means it's relating back to a or whether C is going to be followed by a relational operator and be just an ordinary expansion. So you hit that spot and you think what I need is L. R. Two, but I don't have it. So but maybe there's someone here who does know. So that's why I'm here and I'm done. Thank you very much. I'm glad you asked that question. The man asked why didn't I write a handwritten parser. And the reason is because bison has saved my bacon. The the I don't know how to do that. I do know this many times every week, not every day, but every week. The the the bison output tells me I have an ambiguity in the grammar. I'm not that can't be parsed. It finds the mistakes that I would be putting in freely if I was writing it by hand. Any more questions? Or suggestions for this issue? I'm I'm here all day.
Can the mold linker be /usr/bin/ld?
So up next is 3. I hope that's reasonably correct. Yeah, that sounds right. Of linker thing, let's just say that. Yeah. Now talking about whether the mold linker can actually be used as a system linker. Yes. So thank you for coming to this talk. My name is Rui Uyama. So I'm the creator of the mold linker as well as the LLVM linker. So I wonder if you guys are using my linker. So raise your hand if you are using mold linker. And what about LLVD? OK, maybe almost everyone is using my linker. So it makes me very comfortable to be here. Anyways, so the mold linker is my latest attempt to create the best linker for developers. And that really matters because in most compilations and build times, linker dominates, especially if you are doing a quick edit, debug, compile cycle, because you edit a single file, build a thing. The compiler finishes pretty soon because it compiles just a single file. But the entire executables need to be built from scratch. So the link time matters. So I've been developing the mold linker since September 2020. So it's been almost three years under a little. So it's relatively new. So it's available under the MIT license now. It's been under a different license because I was trying to commercialize it. But it turns out that it didn't work out. So I decided to go with the published license. And the main purpose is to offer the fastest linker to that developer. So it's order of magnitude faster than the new linker. And it's also faster than my previous one, LLVD, as well as the new gold linker. So as a rough to give you an idea, the on a decent multi-core machine, mold can output one gigabyte output per second. So if your executable is two gigabytes, and then it takes two seconds on your machine. And that's pretty fast. But the modern executables are gigantic as well. So for example, if you build LLVM with debug info, the output would be like one and a half gigabyte. But it can be built in one and a half seconds. And the mold linker supports almost all major targets, except MIPS. And the reason is because MIPS, ABI, has diverged too much from the other ABI's. The fact is that the other ABI's have evolved since 2000. But the MIPS ABI has stagnated since the collapse of SGI, because SGI was a de facto player in that field to set the standard. And then no one has since then made any effort to improve the ABI. So MIPS has diverged. So at this point, I'm not sure if we want to work, continue working on MIPS support, because it seems like no one is really making a serious effort to refresh the architecture. But anyways, it supports a lot of architecture, even including long arch, which is a newcomer in this field. And despite being pretty new, I think that the linker is production ready. And I think that many people are actually using for production use. I will talk about that later, how I tested the linker. So from the developer's perspective, so this slide explains what is the model linker from the developer's perspective. So it's written in C++, specifically with C++ 20 features, and with Intel TVB as a 3D library. And the one thing that you would notice immediately if you take a look at the source code of model linker is that almost all functions and the data structures are templates rather than just plain functions or structures. And the templates are specialized for each target. So for example, if you, so we have, and the source code quality, and ideally have readable source code. So I put a lot of efforts to make it readable. So this is an example of how you write target specific code in mold. So it uses if constexpr in the source code. So if you are not familiar with C++ 20, this is a feature, this is a new feature. And the beauty of this feature is that if constexpr is evaluated at compile time rather than runtime, so this if constexpr expression will be compiled to nothing. If this function will not be specialized for PowerPC 64, V1. So if as long as you got your new code in this way, your new code cannot do anything harmful for other targets. And it cannot be, it cannot slow down other targets. So this is another example how we use C++ 20 feature in mold. So this is a data structure representing on this format of relocations. But there are many types of relocations because we at least have big Indian, little Indian 32 and a 64 bit version. So in combination we have already four different versions. And the beauty of C++ 20 is that you can use a require your crowds after the template keyword to specify what kind of type parameters that you wanna specialize for. So in this case, this data structure is specialized for middle Indian and real way of which is very technical stuff. But we have two different versions of relocation data structures. And below the definition, we have different versions of data structures of the same name. And we even have completely different version of data structure for specifically for Spark 64. Because Spark 64 has this weird field that doesn't exist in any other architecture. So, but we can just define this data structure only for Spark 64. And as long as you guard G code that access this field with if course expert, then your code will not be cause GM, you know, you are using the missing field of the data structure. So this is a very beautiful way to compile your code to a specific target. So, it's not loading. Okay, so this is a machine description of the of G some specific target. In this case, it's a machine description for x86 64. So we have bunch of constexpr static variables as a parameter. And it defines, you know, that whether it's a middle Indian architecture or big Indian architecture or it's 32 bit or 64 bit. And basically you, so if you wanna put the mold link to new target, then you define this kind of data structure where basically copy and paste. And then make the modification as you needed. And then it's just as simple as that. And since this is G's fields are compile time constant so the compiler knows what the value is at the compile time so they can optimize code based on these values instead of, you know, that dispatching at runtime. So this is a comparison of the number of lines that you need to put more linker to the new target. So on the left hand side, we have code. So it is not a really precise comparison because lines of code is not a direct indicator about how easy or how hard it is to put linker to the new target. But it gives you enough idea about the scale of you, about the amount of work that you have to do. So apparently for gold, you have to write tens of thousands of lines of code for each target. But the reality is most code in the target specific code for gold are just a copy paste. So for example, if you wanna put new gold to like spark or long arch or whatever, then you would start copying the entire file as long arch dot cc or whatever and then it make the modification. So you have a lot of copies of code and that's not a really good way to, you know, put that thing to the new code. And on the other hand, we have very little code in mode to put to the new architecture. So we have a few, we have some amount of code outside of these files for target specific architect code but overall the amount of code is very, very small, like only a few hundred lines of code. So testing, testing is the most important and the difficult part of writing the linker because as you know that if you write a simple linker it's not really hard because it's just a program that takes object files and combines them into a single executable or shared object file. But the thing is there are so many edge cases and because there are like hundreds of thousands of programs that uses the linker, essentially every program uses the linker. So every corner case will be, there is some use case of corner cases out there. So testing is very hard. So we have two tests of how to say the mode to ensure that you, I will be finding a bug before you will notice in the production use case. So the first test is shell script based test which is a very simple test. I have a slide, slide for this. So this is just a test case for the very simple test case. So we actually compile code and try to link the object file with mode and then actually execute it on the machine. And as you can see that if you have a cross compiler and the QMU, you can test that this test for other architecture that's different from the one that you are running on. So for example, you can test Spark 64 on x86 machine. But apparently this test is not enough for real use cases, right? So the other test that I was doing, I'm doing is to try to build all gentry packages in a business mode in a Docker container to find any bugs. And the beauty of using gentry is that with gentry, you can use the exact same command to build any package. And it can also run the unit test that comes with the package. So it's very easy to wait to test whether you can build the program and the build program will work or not. So I did that and it takes a few days on the 6C4 core machine. But it works. But the thing is it is sometimes extremely hard to debug the stuff when something goes wrong. But somehow I managed to fix all bugs that I found this way. Well, yeah, it was a fantastic experience to fix all the jits bugs. But my point is that it is very important to fix all bugs before you would notice in the world. Because if mold didn't work out of the box for your project, the next thing you would do is just switch back to the original linker and you will never try it again with the mold linker, right? So why mold is so fast? Well, so we use multiseletting, multiselet parallelization from the beginning. So that's essentially why mold is so fast. But the other thing is that mold is simply faster than the other linkers with single-slated case is sometimes because we are using optimized data structures and code. Actually, the data structure is more important than code. As Rob Pike once said that you would write code around data structures and not to other ways. So designing the right data structure is important to make faster program. So here is, I think, a good visualization of how good mold linker is to use multi-core all-G cores available on the machine. So on the left-hand side, LLD fails to use all-G cores, but the mold finishes very quickly with all-G cores. So why, but the question is, would be why do we want another linker even though we have LLD? So my answer is, so LLD is not known, first of all. And the other thing is that LLD does not stop or support GCC LTO. So LLD is actually tightly coupled to a specific version of LLVM. So LLD, for example, version 15 can do LTO only for LLVM 15. So it of course cannot handle any GCC LTO object files. So if you wanna do LTO with no faster linker, then mold is the only viable option. So what about Gnu Gold? I think the problem with Gnu Gold is the lack of clear ownership. So it looks like it's not really maintained well anymore. And the original creator of Gnu Gold, which is Google, has lost the interest of keep maintaining it because they are now switched to LLD. So I think the future of Gnu Gold is not clear. So and the gold is not as fast as my linker too. So can we improve Gnu LLD so that Gnu LLD gets as fast as my linker? My answer is no. I think that it's almost impossible to make the thing faster unless you rewrite everything from scratch. And if you rewrite from scratch, that would be the same thing as I did. So and in my opinion, the source code of Gnu LLD is not very easy to read. It's like the source code was written more than 30 years ago and it's been maintained since then. But people are still adding new features to Gnu LLD first and then put to other linkers because what they are actually using is the other linkers. But I think that the situation is silly because people do not really use Gnu LLD anymore for their real world project. So I think that it needs changing. And my question is do we wanna stay with Gnu LLD, the current Gnu LLD forever? My answer would be I don't think so since we have a good replacement. So if I can, I'm open to donate more to Gnu project so that we can call it a Gnu mold if that accelerates that option. It's not something that I can only decide but because it means a lot but I'm open to that option if it makes sense. So the death missing piece to use mold as the standard linker is the kernels and the embedded programming support. So user and the programs are mostly fine. Well, if you install more as a system linker you wouldn't notice any difference other than speed. But the kernels and the embedded programs needs more special care about memory layout because hardware for example, enforces you to put some data structure or code at a very specific location of the memory. And if you are programming against MMU this computer then you wanna layer as the hardware memory is. So that kind of stuff is usually handled by linker script as you know. But the linker script in my opinion has many issues. The first thing is that it doesn't have any formal specification of the language. It only has the manual and we implement to, so other linkers are trying to mimic the behavior of Gnu LD but it of course causes compatibility issues. And the other thing is that the linker script predates elf file format. So not all linker script command can translate directly to elf terminology and it causes more confusion than necessary. So, and I think that it is almost impossible to add a linker script support without slowing down the linker. So I think that we need something better. So this is my current approach to support embedded programming and counter support. So I added a very simple command line option which is called section order. And that specifies them how to layer the things. So, and I think that this option alone can satisfy like more than 90% of the usage but I'm pretty sure that that doesn't cover all the usage of linker script. So I need a help from you guys. So because especially in embedded programming world, their programs are not open source and they are not available on GitHub and they tend to be in house program. So I don't know what the real usage is for embedded programs. If you can tell me that I wanna do this with the mold linker, then I can implement that for you. So I would appreciate it if you give me a hint. All right. So this is the end of my slides. Thank you very much. So you mentioned that it's possible to do link time optimization, like as a feedback in the GCC, but in general, is it also possible, how easy is it to do link time optimization inside the linker, like is it possible for the linker to disassemble some instruction and try to put something else there? Okay, so the question is how easy it is to do something like link time optimization but not quite there. So I don't know if I correctly understand your question, but it's... It's basically optimizations during the linking. Yeah, of course, but the thing is... It's not by the compiler, it's all LTO, but it's not by the compiler. So the way how LTO works in the linker is compiler emits. So from the user's perspective, all you have to do is to add hyphen FLTO to the command line option to compiler and the linker, and everything works automatically. But behind the scenes, the compiler emits intermediate code instead of the actual machine code to the object file, and then the linker recognizes that intermediate code. And then it calls the compiler back end to compile all things once to the single object file, and then the link continues as if that gigantic single object file were passed to the linker. So in that sense, you can do anything with the intermediate file inside the compiler back end because the linker doesn't really care what is going on behind the scenes. So, well, does that answer your question? Yeah, so you said that you tested more against being all of the factors in gender Linux. How long did that take? How long does one count take? So how does it take to test all gender packages against more the linker? And it takes, if I remember correctly, three, four days on my 64-core machines, 64-core machine with 200 gigabytes, 256 gigabytes memory. And yeah, it's a very long time, but it's definitely doable on a beefy single machine. One target? Only for x86-64 because in order to cross-compile everything to different architectures and run-g test, you have to do that on QMU, which slows down like 100 times than the real performance on the computer. Yeah. Yeah, I can't. Yeah, sorry. What kind of mistakes did you make in LLD that you're fixing in mode? And are there any mistakes in mode that you think are interesting? So the question is what mistakes did I do in LLD that I fixed in LLD? And did I make any other mistakes in mode? That's a good question. The first thing is the relocation processing in LLD wasn't as good as mode. So it's complicated. It's hard to maintain, and it's slower than mode. So I fixed it. And the other thing is that LLD uses templates to support L6432, big-endian, little-endian, but it's just four instances. So it doesn't instantiate for each target. So you cannot use the technique that I used for Spark 6c4, that I showed you on the slide, for example. And did I make any mistake in mode? Maybe not. I am pretty satisfied with the quality of mode. I think that I really made... I'm personally enthusiastic about the code of the readability. So I tried to make the source code as readable as just like a book. And I don't know if I could achieve that goal, but the point is that, well, yeah, it's definitely readable. One last question. Are there any plans to ever support any order of that file that helps? Oh, so the question is, can you support other file formats? No, I'm planning to ever do that. Oh, do I have an plan to support other than LLD? Well, I did for macOS, which is a Unix-like environment, but it uses a different file format, which is called macOS. Yeah, but the thing is, and I succeeded to create a faster linker for macOS, which is much, much faster than the upload linker. But the thing is, last year in September, they released Xcode 14 with their own new linker. So there wasn't going on efforts within Apple that I wasn't aware of. And then their new linker is as fast as mine. Maybe they wrote my source code as well, because it's available online. But also, GTIB3, then? Oh, my linker is now available under the GMIT license. So it's, yeah. So maybe you only heard Apple. Well, Apple haven't released their source code yet. So, okay, we have to stop. So thank you again.
Build Distribution for Maintaining the Famous GCC 4.7
That's great to see so many people interested in GCC from more than 10 years ago. Okay, let's get started. So we are taking a step in time by more than 10 years, I think. Yes, almost, yeah. Okay, so Oliver Reiche is going to talk about the maintenance of GCC 4.7 and the reason for that is that GCC 4.7 has a special property which I'm sure you will talk about quickly. Exactly. So, hello everybody, my name is Oliver Reiche. I'm working for the Huawei Research Center in Munich. And yeah, I would like to talk about a built distribution for maintaining the famous GCC 4.7. So, and I would like to start with dissecting the title a little bit. So, first of all, what is the famous GCC 4.7 and what is it famous for? And then also talk a little bit about what do we mean by the term built distribution? Then I will show a little bit of patches that we applied to that GCC version and show a little bit about a bootstrap process before I wrap up the talk. All right, so GCC 4.7. Well, there is a movement that's called bootstrapable builds and this movement strives for building all software from source. And of course, you have to start somewhere. So, in practice, you usually start with a minimal set of binaries that you need to start the bootstrap process. And then at some point you bootstrap your C compiler and at some point you want now to bootstrap your C++ compiler and then you might ask yourself, how do I build a C++ compiler without a C++ compiler? Because most modern C++ compilers are actually written in C++. So, this is exactly where GCC 4.7 comes into play. It's a key role for the bootstrapable builds movement because it's the last GCC version that can only be compiled, that can be compiled with only a C compiler. So, if you want to enter the realm of C++ and everything that is beyond in this bootstrapping process, you will need this version of GCC. So, and it's also about software preservation because, yeah, it's a quite old code base. It does not run out of the box with modern compilers. It does not run out of the box of modern systems. Modern systems and modern compilers use by default usually the C11 standard. Also, this code base has some issues with that and GCC 4.7 does not build reproducibly in all scenarios. I will come to that a little later. So, the next thing is from the title, build distribution. I mean, this is like a very fuzzy task that we term that we invented. So, what do we mean by that? So, we have actually a project that's called Bootstrapable Toolchain. There's a little bit of advertisement here on the right side. You can build this project using our very own open source build system that's called JustBuild. And if you use this project, you can Bootstrap the latest compilers and latest build tools with it. And all you need, of course, our build system and reduced binary seed. We need the core utility being installed. We need a POSIX compliant shell and some C compiler with a working C standard library. So, even the tiny CC will work. And what we do is all of those two chains here are actually built from source. So, we didn't reinvent the wheel. We used the existing build descriptions for GCC Make or CMake for Clang. And our build system basically takes care of orchestrating the build and calling those foreign build systems. And yeah, you might have noticed, Make and CMake are not part of our initial binary seed. So, we have to Bootstrap those first. This is also what our build system takes care of in this project. And so, what we do basically is we do on-demand Bootstrap of all the necessary tools during this process to make sure that we have everything that we need in the next steps to do Bootstrapping of the next tool chains. And by doing so, we basically unfold the minimal Linux distribution on the fly that is barely enough to just build the tool chains that we are actually interested in. And yeah, this minimal Linux distribution is what we're referring to as the build distribution. All right, next I would like to talk a little bit about what patches did we apply to patch up GCC 4.7. Well, most of them are actually maintenance patches and backports. So, from newer GCC versions, so in the square brackets you see the GCC versions where we backported those commits from. So, in the PDF those are clickable links, brings you directly to GitHub. And yeah, just to mention a few, so the largest commit was the general Muzzle support. And yeah, this is just an example here. Of course, the commit is much longer. This introduced the entire macro infrastructure that is actually necessary for GCC to work with Muzzle. Another interesting commit was the actual linker support for Muzzle. So, it adds this magic string here which is the hard coded path where GCC expects the program interpreter to be located. But much more interestingly though is how did we patch up reproducibility for GCC 4.7. Well, if you use our build system or any other modern build system as a build orchestrator, they usually build in isolation. So, all of the stuff that runs in the action, so the make command, the make binary, everything that is needed to get the job done, is actually located in an isolated directory. It could be a temporary directory at a seemingly random path. It could also be located in the user's home directory. And there's a problem, for instance, yeah, those two binaries you heard about it today already, and CC1, the C compiler, and CC1 plus the C plus plus compiler, they contain checksums. And those checksums are computed from many things, and parts of that are the path of the linker that was actually used. And because we built in isolation, the linker is also located in this temporary isolated directory, and that path is seemingly random and finds its way in the final checksum. And the other problem is that the relevant object files for linking those binaries are also hashed to compute this checksum. And well, the object files contain debug information and therefore also contain somehow the build directory. So, we needed to patch that as well in order to compute a reproducible checksum that is independent of the build directory. So, which is actually fairly simple. So, we just made sure that the linker, we know the linker, we control the linker, so it's actually not necessary to hash the full path. So, we just stripped the path by some constants and replace it with some constant string. And of course, we copy the objects that are relevant during the process to some temporary directories, stripped them from any debug information using strip for target, of course, and then use those hashed those to compute the final checksum. So, at the end what we get, we still have a meaningful checksum that somehow represents how those binaries were built, while still being reproducible in the sense of being independent of the build directory. And all of those patches that I just showed will then be automatically applied during our bootstrap process. So, what is the process? How does it look like? So, we have actually multiple stages to, until we end up with the modern compilers that we actually want to build, because of time limitations I will only go into details of the very first stage. So, we start off with just having core utils, a shell and some C compiler. And the very first thing that we do is we bootstrap certain parts of busybox, because it includes very important tools that the auto tools and the auto config scripts later will need. And we only restrict to those very specific parts. So, grab find, say it for instance, and of course we need patch for patching GCC later. And with those tools at hand now, we can now bootstrap make. Make can be built with make, of course, but they also have a bootstrap path. Luckily for us, there's a shell script and with a little bit of magic, we end up getting the make binary and now we have make build system available. And then together with those tools and the make build system, we can bootstrap the archiver from the bin utils sources and then we also have an archiver available for producing static libraries. Okay, now we can do the first real build. So, we can build with those at hand, we can now build latest bin utils, the normal way it's meant to be built, configure and make, and then we can patch GCC and build GCC. If you're interested in running this on your machine, it should work on any x86-64-bit Linux system. You only have to install just build, clone this project and run this command. It should give you a working GCC 4.7 installation. Okay, so let me wrap up the talk. So, we tested that on many systems. It should work on any x86-64-bit Linux system. We also tried to test it on very different systems like NixOS, where actually everything is located at some custom path. We also tried very reduced images that only contain a tiny cc and a muslin libc. And with our project and together with our own build system, you can easily integrate, if you have a C++ project and use our build system, you can easily import this tool chain into your project. And then you can make the tool chain a committed dependency to your project, which has several advantages. Of course, it's easier to set up for the user. He doesn't need to have a certain C++ compiler installed. You can just clone your project, run build, and then the first thing that happens is the tool chain is being built. And don't worry about compile times. Of course, bootstrapping the tool chain takes a while, but this only needs to be done once. So the next time you build the tool chain is like a static part of your dependency chain that doesn't change, so it will come from cache. Of course, if the tool chain is committed to your project's history, also git bisects are easier. And we can even show, if you do it right, that you can predict the binary hashes of the binaries that your project produces. Because you have a very confined tool chain, you know exactly what the output should be. If you use the Moodle, Lib C, stripping, everything using static linking. We have a demo application showcasing that. We can predict binary hashes for this project that should run on every x86-64-bit lingo system. All right. Last thing, I would like to encourage everyone who's interested in that to just install just build and try those commands yourself. It will take about 30 minutes. If it doesn't work on your machine, please let us know, because this is super valuable information for us to make this process even more stable. All right. That's all. Thank you very much. Thank you. And we will allow for maybe three minutes of Q&A because we started late. And actually, I want to start with one question from the Matrix online channel because give them a chance to answer some of the questions. So Ismail Luceno asks if there is any collaboration with OpenBSD because they have been maintaining their own fork also of GCC 4.7, I guess, because of the C++. Okay. No, there's no collaboration. So the question was, is there any collaboration with OpenBSD? Yeah, very good. Okay. Because they maintain their own fork of GCC 4.7. No, there is not. This is actually a good question. I haven't heard about that before. So this is already valuable input for us. Okay. Got a question? Is this partly in the timing for things like bootstrapping for trusting trust at a time? What were your tries to avoid the possibility of your compiler to be supported? I didn't recognize it. Trusting? Trusting trusts, can you enrich your model to remember and then you should think of where you could support the compiler, you could insert an actor, compile it in such a way that you compile a source code and then recompile itself as high an actor and define the ring, but not present in the source code? Okay. Okay. It's pretty hard to repeat that question for me. I may be just in paraphrase. Yeah. Okay. So to ensure the question was whether this is security related. Yes, in some extent it is security related. So one idea is to have the possibility, if you build reproducibly in a way that you can say, okay, this source code compiles to this binary and will have this hash, pretty much independently of the system you're building on, of course there are some restrictions, that gives the opportunity to say, well, we can basically prove that this binary originates from that source code and that source code alone. That is actually also one of the motivations. Yes. It looks like typed up. How we got? One more question. One more. Do we have the next speaker in the room? Our guy finding. So, yeah. I was surprised that it's machine dependent. I wonder why different architectures aren't easily done. So the question was why it is machine dependent and different architectures weren't done. The reason is just we were focusing on x86, 64-bit Linux because it's the most widespread right now. And it's also quite of work to patching GCC up to make that happen. So we basically just not had the time to look into other architectures. But we already have it on our to-do list. We want to at least support ARM 64-bit. And then let's see where we come from there. All right. I guess we have to go. Yeah, then. So at the end of this process you get C++ compiler, but it is an older C++ compiler. Yes. I was just wondering like how many stepping stones are there to get to the latest? All right. So the question was that after the bootstrap process of stage zero, we just have GCC 4.7. This is a quite old compiler and what other steps are necessary to reach modern compilers. This is a very good question. Yes. So modern compilers usually need C++ 11 support. GCC 4.7 does not have that. And so the next stage, so stage one is actually bootstrapping GCC 10.2, which is to my knowledge the first one almost completely supporting C++ 11. 4.8? Is it all right? Okay. So current GCC can still be bootstrapped with GCC 4.8. Oh, okay. Okay. But not that clear. Okay, but we definitely need one more step. And yeah, we got that covered currently with GCC 10.2 is the step stage one. And then from there we can go on. So don't need more than one step after 10 years. So is that okay? Yeah, exactly. Yeah. And I guess the advantage of picking a later GCC version is that we don't have as much patching for new back ends and configurations, and stuff like that because that's then all. And it looks shiny new. And it's still maintained, GCC 10.2. Yeah, and you would help these main things. You're the next. Okay. I hope that'll be the end. I'm sorry. Okay. Thank you, Arvid. Thank you, Arvid. Could you help me with this? Yeah. I'll cut it off. I'll see you there. Thanks. Thanks.
Sega Dreamcast Homebrew with GCC
Okay, cool, cool. Okay, so up next is Falko Gurgis telling us I'm sure an entertaining story about Sega Dreamcast, how did you get this idea? I have an entertaining story about Sega Dreamcast Homebrew with GCC. That's true. Not this standard thing you would do. You ready? Alright, so I'm talking today on behalf of the Sega Dreamcast community. I'm actually a developer on the Homebrew independent SDK, SDK called Callisty iOS. And we're talking about how basically... Yeah, no problem. We good? Okay, yeah. So basically this entire Homebrew community is powered by GCC. And I'm just showing you kind of what the kind of stuff that being part of the GCC ecosystem is allowing us to do. So first of all, what is the Sega Dreamcast? Maybe some of you don't know because it had two years in the limelight. It was released in 1999 and it was only commercially viable until 2001. Despite that fact it had a substantial effect on the gaming industry. It left a huge legacy and it competed directly with the PlayStation 2. A little bit less the GameCube and Xbox because it didn't last that long. And then a little bit about it, it had Hitachi, which...SH4 CPU, which is now owned by Renaissance. An imagination PowerVR 2 GPU, which was the predecessor to what eventually got used in the iPhone. So that same technology actually for the GPU went on to do quite a lot of fancy stuff. And then there's a little bit extra about it. But the key thing here is the Hitachi SH4 CPU. And that's what has made our destinies intertwined with GCC. Because GCC is the only compiler that supports the Super-H architecture. So why the Sega Dreamcast? What's the big deal? So I think there's a lot of strong arguments for doing it. I think in like an era where people are into Raspberry Pi programming and embedded systems, it offers a really good middle ground between high performance because it's good at graphics, it's good at floating point operations and embedded programming. We have a lot of established tools that are really good. As you'll see we have really modern compiler support. We have a lot of language support. Thanks to Matt Godbolt we have SH4 and compiler Explorer. So you can actually check, look at what the disassembly of your Sega Dreamcast code looks like to make sure it's optimized. And as a beginner you can treat it like just kind of a weak PC using cross-platform APIs. Or as you mature in advance you can go down to the hardware level and optimize for it. There's also a lot of cool toys and peripherals. There's light guns, Samba De Amigo Maracas. There's the visual memory unit. And visual memory unit itself, the little VMU, has its own little homebrew scene. So as I was saying we have a pretty decent community and because our independent SDK uses no Sega code we're actually able to release our homebrew commercially and sell it online through retail stores and stuff like that. So this is how many we've released each year commercially and there's just a collage of different commercial games. So as you can see you're not going to get rich on Dreamcast, but you know if you're making a PC game within that spec range maybe you should check it out. So this is a little bit about Callisty iOS before I get really deep in some code stuff. This is a little bit about the architecture. So it's Callisty iOS, it's like a big SDK but it also is like an operating system. We have a kernel. We integrate with NewLib 440, which as far as I know is the latest one that's out there. That's where we do file I O, date time, malloc. We have a really cool virtual file system which abstracts the way the PC CD-ROM. You can stream from your PC. You can use the new SD card readers. Networking, we even have IPv6 on this thing. We have examples. We have add-ons and ports for OpenGL, OpenAL. The tool chains as you'll see we have GCC 1321, latest Benutils, GDB going on it. We're trying to take this retro game console and let you use the latest and greatest versions of the languages of your choice on it. That's kind of a little bit of what we're going to touch upon. This is a little bit about my Dreamcast. It's not going to go into too much detail but as you can see it's like a car. You can totally spend all your money on it if you want and go to town on it. You don't need to do any of this though to develop for it. That's another big point of it is as long as you can burn a CD-ROM, 90% of the Dreamcasts out there can boot your homebrew game as long as it's burned a certain way. That's part of why the homebrew scene became so big. The first thing we're going to look at is C23. We wanted C23 on the thing. What did it take to get there? It didn't take as much as we thought. One of the first things that we had to do was support Atomics from C11 so that you can say atomic int, atomic bool, and have, since we have a preemptive multi-threading scheduler, you want to be able to have atomic variables that aren't interrupted and stuff. Unfortunately the SH4 is old so there's no hardware support for Atomics. But since it's single core it's not a big deal. You just disable interrupts around it, you load or store your value, and then you enable interrupts afterwards. So this is actually offered by the compiler, the SH compiler, as the soft imask model. What it did not offer is 64-bit and generic Atomics. So we had to implement that and there's the C code for, it's kind of an ugly C macro to do it, but you can basically see. We just disable the IRQ, we load or store a type, and then we enable it later. And that's the basis of our Atomic model. So if the scheduler can't get interrupted when you're accessing Atomic, then it's Atomic. Then we validated the Atomics. You'll see a bunch of the output there is from my Dreamcast. So we have a bunch of tasks we ran through, a bunch of different Atomics, an Atomic buffer, and yeah, the Atomics work now on the Dreamcast. It's pretty nice. Something that was much harder was adding thread local storage support. So in C and C++ there's a thread local keyword, and there's a lot of stuff you have to do for that. It's a delicate interplay between the compiler and the operating system. On the operating system end, don't worry if this code is a little dense, that's the whole point, this was actually a pain, and that's code just there to show you what a pain it was. You have to allocate with every thread, you have to allocate extra block for thread local storage with the T data and T BSS segments for thread local, and then you have to swap every time you swap context, you have to swap the thread pointer to point to a new thread chunk. So we did that, and this is some of the validation tests for it. What actually makes it hard is you can align your TLS storage arbitrarily, so we had to compensate for arbitrary alignment, that was all the extra logic that was more than just a malloc with a fixed size, you have to also align those segments. So yeah, now TLS works on the Dreamcast. And then that was pretty much it, we got C23, we have no pointer, auto, type of, all the cool stuff that C23 added. VAopt is now in C23, Align As, Static, Constexpr, Compound Literals, one of my new favorite things to use right there. This is just me throwing in a bunch of C23 with a Breakpoint API. Oh, Binary Literals, pretty nice, C23 edition. This is a little video, uh-oh, was a little video, it's not working. Okay, well, cool. Maybe after you can check out my Twitter, all the videos are on there in case they don't work. So C plus plus 20 and 23 is up next. What we got for free, we actually got a whole lot for free, it's kind of cool. Concepts, constraints, modules are not fully supported by GCC yet, but hey, everything that was supported worked fine for SuperH, we were pretty shocked. Ranges, look at that crazy range stuff that we can do with C plus plus 23 on the Dreamcast. Pretty sweet, standard format, and this thing, a static, variadic, multi-dimensional, overloaded subscript operator. You can do that on your Dreamcast now, it works. That was pretty awesome. What we had to earn with this, standard async did not just work for us because our kernel had a serious bug with once in it that nothing had exercised that code path with the ferocity that modern C plus plus did, and we found a race condition there. Standard random device took a little bit of work, I'm going to get into that. Standard file system is not quite supported. Yeah, that's a sore point for me right now, we're working on that, that's our fault. We're not propagating error now properly with NewLib, working on that. Standard time zone, well Dreamcast doesn't really have a time zone, so not much we can do about that, although I will say we gracefully don't support it, so it's not a big deal. Stack trace is one that doesn't look like there's much we can do about that. Yeah, C plus plus 20 stack trace, I got the library compiling for it, but it looks like deep within the library where it's trying to look for the binary path for reflecting over the L for executable to unwind the stack and look up the symbols there's just not really any way for us to tell it where to look over the network for a Dreamcast, so yeah, there's no stack trace right now. Maybe we can hack something up for that later. Standard random device, it actually works fine, so you can do all this crazy random stuff. This is the NewLib hook where we actually hooked into, we supplied the entropy from a bunch of uninitialized RAM, and that's what the entropy is coming from uninitialized RAM, which goes to standard random device, and then this is just a uniform distribution getting generated on the Sega Dreamcast and showing, you know, looks pretty uniform. Yeah, C plus plus concurrency meets the Dreamcast, this is pretty exciting. Yeah, there's a bunch of interesting C plus plus 20 stuff there, so I made a huge test thing that we're running on Dreamcast, which is just, it is running a bunch of standard, it's generating a bunch of standard async threads, testing everything from semaphores, latches, share locks, condition variables, barriers, and everything. And at this point, I guess I can't show it because the video is not loading, but it would just be like a big printf printout showing that all the tests are passing. So, yeah, as far as I know, including code routines, everything from the support for GCC up to C plus plus 23 is working fine on the Sega Dreamcast because you definitely need that level of concurrency to work with this machine here. Alright, let's see. Yeah, I had another little video that's not, I don't know why they're not loading, but they're all on my Twitter. Alright, Objective C, there's a little more to this because there's a couple reasons. GCC, it looks like, doesn't quite support the latest version of Objective C. Objective C, too, I guess that's because Apple didn't want to fund it anymore, I'm not sure. So, we had to make do with what we had. It looks like Objective C might be a little broken right now for cross compilation. We had to patch a build script right now to get to cross compile for bare metal. It was failing a compilation stage and we just basically commented it out in one of the config files and then it worked. We were able to build LibObjective C. The problem with, you know, building plain Objective C is, LibObjective C is a C library that lets you access all of the object-oriented features of Objective C. It's not very pretty, it's not very idiomatic of Objective C, but that's the raw runtime. In order to do anything useful with Objective C or do anything that you normally associate with Objective C, you need the foundation standard library, which is typically associated with Apple. That's where your NS string, NS object, all that comes from. Luckily, GNUSTEP has an open source implementation of that. So, we tried to port that to the Sega Dreamcast to give you, you know, this very big, nice Apple API that you definitely want for your Sega Dreamcast homebrew. Oops, that went pretty well. So now you can, you've got data structures, you've got auto-release pool, you've got NS string, NS log, all that kind of stuff on the Sega Dreamcast. You know, that's just basically some Hello World stuff doing that from Objective C and that's the Dreamcast output. Now, what gets a little more interesting is the concurrency model for Objective C is actually pretty cool. We support the NS run loop, which has NS timer, which lets you schedule periodic timers. They're used for, like, GUI updating, you can use it for game engine logic. And then we're firing NS notification events asynchronously from that event loop. And the video was really just showing, like, a bunch of events firing asynchronously on a Dreamcast. I don't know why it's not working. But anyway, so you've got the Objective C concurrency model as well. And for the record, if you need Objective C++23 to get everything, that works too. If you want to mix both of them. Okay, so then we tried to get D on the Dreamcast. This was not done by me, this was done by someone who goes by Luna on the Luna Fox girl on Twitter. Thankfully, she helped us because I didn't know really much of anything about D. She did a great job. What was involved with bringing D to Dreamcast? Well, we used the GDC front end for GCC. We cross-compiled it for SHL. She wrote a custom run time to do some of the stuff that the D run time does, which I'm a little sketchy on. But I believe it's stuff like lifetime management, like allocation, deallocation, entry point. She did not use the garbage collector, not because it won't work on the Dreamcast, because we run Lua and it's fine, but she wanted manual lifetime management. And at this point, we did not try to do libfobos for the standard library. We actually are just binding to libc for that kind of stuff. And then that's kind of a folder view of what the project looks like. It's called DKOS, which is the D bindings for what we did. And as you can see, I was worried that a bunch of the low-level stuff we were doing in C and Callisty iOS would have to change. Like, hey, can you bind to inline assembly? Like, what are you going to do about the C macros? And actually, a D is quite capable. And here's some of the crazy stuff that she either rewrote or bound to from D. So there's inline assembly. It can handle flexible array members, inline functions, macros, versioned enumerations. I started getting a little jealous there as a C and C++ programmer, actually. It's really good stuff. So yeah, D meets the Dreamcast. So here's some fairly idiomatic looking D that had a video there that all it was doing was basically changing the color of the... It was animating the background color with the PowerVR on the Dreamcast, the frame buffer, and printing some stuff to standard out. And it worked great. And let's see. Here was one more video, which was a bunch of animated cubes showing 3D accelerated graphics with the D language. That's on her Twitter, actually. And then finally, everyone had been asking the entire time we were doing this on Twitter, hey, what about Rust? Hey, what about Rust? And we're just like, hey, man, I don't know what to tell you. LLVM doesn't support SH, SH4, like, take it up with them. And then GCCRS came along and happened. We weren't having any luck with Rust C at the time. We couldn't get it cross-compiling properly for SHL. So we started playing with GCCRS, even though it's like very, very new in its infancy. I mean, we were seeing like four loops being added almost in real time, you know, like we pulled down like, oh, you can use a loop now. It was pretty cool, you know? So this is not stuff that necessarily is ready to be played with, but we don't care. It's what we do here. There's no bar checker yet, so you'll notice everything is just unsafe, but it's still fun and it's still Rust. So this is, oh man, the video's not there. It's a rotating cube that is driven predominantly by Rust. It's unsafe, as you'll see. The main control flow is Rust. The OpenGL API is calling in to C for that. And then there's a mystery third language that you're about to see, which we implemented miscellaneous support utility functions for things that GCCRS wasn't able to cope with just yet. So, all right, we're going to go into that demo here. All right, so on the left we have the Rust, which is calling in to C. On the right we have the utility functions, which are Fortran. So we had C, Rust, and Fortran, all on the Dreamcast. And yeah, here was the rotating cube. So, yeah, I would say we inherited quite a good deal from the GCC ecosystem. And yeah, may your homebrew be powerful and good and fast, and yeah, that's it for us. I just want to say thank you to everyone who contributes to GCC, and to our GCC in general for supporting us, for supporting the SH backend. If you're interested in looking into any of this stuff, that is a link to our Wiki page, which is everything on how to set this up. You can do it from Windows, Mac, Linux. It's mostly just running a script that works in any POSIX environment that sets up the cross compiler. And I wanted to say that we are just one community that's powered by GCC and is modern. I'm friends with the guys who do the PSPSDK, the Sega Saturn stuff, Lib Dragon for Nintendo 64, SGDK, Sega Genesis, and the Vita SDK. I can tell you right now we're all using GCC. So, yeah, you guys are, there's a lot of people out there who owe you a lot. And if you like this kind of stuff and are interested in hearing more, you can follow me on X or Twitter or GitHub. And that's it. Any questions? Over there you have actually sitting, or one of the Fortran main tables. Really? Oh, it's awesome. We have a couple more in the room shortly. Oh, yeah, but... No, no, no, that's... Oh, I'm sorry. Our application at the moment is targeting the Lib Rome, basically. Oh, yeah, yeah. Which is a good library, but why that over Callistae OS? Because our app has been targeting the Dreamcast for that last 12, 15 years. It's developed by Marx, I think, in North Marx. Oh, my gosh. Oh, okay. Yeah, I know. Yeah, yeah. Accesses. I was wondering, it's fairly easy to install your GCC chain because trying to patch up GCC to a bit of Lib Rome character is a pain. Oh, you should totally use our tool chain. Yeah, our tool chain should definitely work. And our scripts, there's so many people in the Dreamcast community that by now they're pretty battle tested. Like, people will want it for Mac, Ubuntu, every flavor of Linux, Windows with StigWin versus Windows with WSL. You know, ours is pretty solid at this point. You should definitely check it out, actually. I'll definitely be trying to pull it. It's pretty nice, yeah, yeah. But, oh, that's really cool, though. Very nice. Anyone else? Yeah. Which version of NGL are you supporting? All right, so the latest you can get on the Sega Dreamcast is 1.1 because we don't have any shaders. It's all fixed function. But I will say, it's one of the most epic end- end-late-stage kind of GPUs that's fixed function. We have a lot of the stuff that went into shaders in hardware. Like, we have hardware accelerated bump mapping. We have some things called modifier volumes, which are really cool that you can use for cheap shadows and stuff like that. So there's a lot of cool stuff you can play with, despite being an OpenGL 1.1. You guys ever heard of Raylib? Yeah, we actually just got a port of Raylib that sits on top of GL 1.1. So it's really cool being in the Raylib community right now. And like, someone makes a game for like PC and you're like, hey, sick out your game on my Dreamcast. It looks pretty good. And they're like, what's a Dreamcast? But yeah. Anyone else? Yeah. Well, as you know, I'm the SuperH Conal maintainer. I do know that. My hero, man. There's actually the SuperH backend and GCC is actually still in a questionable state. Oleg Endo is watching this. So yeah. Well, he hasn't been doing on it so much recently. Yeah, yeah, yeah. There used to be two people working on it. So if I'm seeing now there's so many people like working on SuperH, it would be nice if like kind of people came to the Debian community or like there's also a Linux SH RSE channel on my barrel. Because like doing this all alone, like what I'm doing in Debian is quite a burden. So there's like some people that would like to help like also improve GCC. Absolutely. So the Linux kernel almost dropped the SuperH architecture and he saved his life. So yeah, we owe this man a great debt. And yeah, I meant to reach out. Definitely. Anyone else? Oh, yeah. I wanted to ask, as I know the Dreamcast had sort of support for Windows C for Dreamcast. Yes. And was there any plan or something about that? Because I remember the vertical system run on the Windows C platform and if I'm not 100% mistaken, GCC might have a Windows C target. I'm not sure about that because when the Dreamcast was released, there were two SDKs. You could use the one that was Windows CE, which a lot of games used and it was very impressive. It supported a lot of the Windows kernel and there was one that was pure Sega. But the thing is we try to distance ourselves from that because those are official proprietary SDKs. They're not independently developed so you can't really sell your home brew with that stuff. So I don't know too much about that, to be honest with you. Yeah, sorry. Anyone else? Yeah. We have actually, there's a giant chart on that wiki page that I linked to. There's a giant chart going back to like GCC4, looking at, we are running one of our polygon benchmarks, looking at performance versus binary size versus a few other variables and it's kind of interesting how it's varied across versions of GCC. Definitely GCC 13 is not the best or the worst and it's not like a linear trend either. But yeah, you can definitely take a look at that and that's a very good question too. Yeah, please. If anyone wants to port anything else, we are very interested. Okay, thank you. Thank you.
The secret life of a goroutine
It's time for our first actual talk of the day, which is by a very frequent speaker who I didn't have to look up the introduction of, because every time I look at his talk, it's like, wow, I learned something very deep about Go. So, small applause. Okay, just... Hello, everybody. Well, I'm going to talk about the secret life of a Go routine. This comes from my interest about how Go works internally, and I was investigating how the Go routine works internally. So, when I started investigating it, my idea of how Go routines were created and all that stuff was something like this. A caring mother with a baby in her arms, taking care of that beautiful, full of joy baby. It wasn't like that, okay? I started digging into the code and I realized that it's more like this. And necromancer racing the deads. I was like, why? There's a reason for that. But before that, I'm going to talk about something more general, that is the Go scheduler. For understanding how the Go routine works, we need to understand how the scheduler works and how it is shaped. So, let's start with the different pieces of the Go scheduler. One of them is the P extract that is the representation of a virtual CPU. Whenever you say Go Max Prox, what you are saying is the number of pieces that the scheduler has. And a processor, as I said, is a virtual representation of the CPU. It can have a status that can be either running, c-scrolling, or g-stop. It has associated the current M. We are going to see what an M is in a moment. Then it has, each processor has a queue of Go routines that needs to be executed. And a list of free Go routines. We are going to see what free Go routines are later. And, of course, other metadata. This is a very shallow explanation of the scheduler. This is an over simplification. Of course, it's more complex than that. But, well, a lot of other metadata inside the PS track. Let's talk about the M. The M is the self-representation of an operating system thread. It's what is executing your code in the CPU. And it has associated normally the current Go routine that is running in this M, in this machine. And the current processor that is associated to this M, that can be null, actually. There are some cases where the M is not associated to a processor. But, in general, they are associated. And other metadata. Let's talk about, let me, let's talk about the scheduler itself. On top of all these M's and P's, there's a struct that is called a schedule. That is, it has all the, it has a list of all the, all the idle M's, all the M's that are not doing any work, all the idle P's, processors that have not, that are not doing any work. All the, at least of global runnable Go routines, a queue of work that is not associated to any specific processor for now. And a list of global free Go routines. Okay. And the start of our show, the Go routine. There's a struct that is called GStrug. That struct is, represents a Go routine. And a Go routine is composed by, in a lot of the stuff, but mainly you have a stack that is a two kilobytes chunk of memory. The program counter that is similar to the program counter in a thread that is pointing to the next, well, to the current instruction that is executing. The status of the Go routine that can be running, waiting, runnable. There's a lot of different statuses. The current M that is associated to this Go routine is being executed right now. And the wait reason. The wait reason is if the Go routine is waiting, they have to be waiting for something. They have to be a reason for waiting. And that's the way reason. There's a lot of other metadata. But let's take a look at the whole picture. As I said, we have the scheduler at the top left with a list of free Go routines, a list of runnable Go routines, a list of either processors, either machines. And we have running processors with running Go routines associated with machines and all that stuff. Also, another interesting thing is that at global level in the runtime, as global variables, we have a list of all the M's, a list of all the P's, and a list of all the Go routines. That really are three global variables in the runtime. Okay, but how Go routines are created? This is where the necromancer raising the dead's metaphor comes into place. Because whenever you create a Go routine with just Peggy's, you create a spawn a new Go routine and start running things on that. But that's not what is happening. There's two ways of creating a Go routine. One option is to create it from scratch and the other option is to reuse all Go routine that is no longer working. So this is what is happening. Whenever a Go routine finish, it's changed the state to dead. So all that free Go routines, actually they are dead Go routines. So whenever you need a new Go routine, you can reuse one of them. Or the other option, if there's no free Go routine or dead Go routine to reuse, you create a new Go routine full of life, you kill it, and then you raise that from the dead. So that's the process. And actually that is how it works in the source code. It was shocking for me and it was a funny way of representing this. So let's see an example of that. Imagine that I have this Go routine here that wants to create a new Go routine. What it's going to do is pick one of the free Go routines in the free list and raise that from the dead, convert that into a runnable, put that in the queue of the runnable Go routines of the processor, and call the scheduler and the scheduler is going to, well, and the scheduler is going to eventually execute that Go routine. Another option is this Go routine here wants to run a new Go routine, spawn a new Go routine, but there's nothing in the free list of the processor. So it's going to go to the global free list of the scheduler and pick a chunk of them, move them to the processor, and then pick one of them and raise that from the dead and add it to the queue. And finally you have the option of this one is it wants to create a new Go routine, but there's nothing in the global queue. So what it's going to do is create a new Go routine. It's going to kill it and then it's going to raise that from the dead and put in the queue and all that stuff. So that's how Go routines are created. Let's see how Go routines, how is the life of a Go routine. A Go routine can go through a lot of different states, can go to runable to running, from running to waiting, from waiting to runable, from running to preempted, from preempted to waiting. There's a lot of stuff. Let's see how, let's see all these transitions one by one. From runable to running. That happens when you for example have a Go routine have finished the job or a Go routine start waiting for something. So it's going to call the scheduler. So the scheduler is going to try to find another Go routine to execute. The first thing that is going to do is try to find a Go routine in the local processor, in the runable list of the local processor. If there's nothing, it's going to go to the global runable queue and it's going to take some of that, it's going to move that work into the processor, it's going to schedule one of that Go routines to be executed. Then if there's nothing in the global queue, it's going to go to the net pool. The net pool is this system that allows Go to do IO work in an efficient way. And what it does is do the IO work and whenever it's finished, it gets the Go routine runable again. But sometimes what we do is we need to find work to do. So we go to the net pool and check if something is already done and start executing that. If there's nothing in the net pool, we are going to steal work from other processors. And if not, we are going to help the garbage collector in the marked face. Well, once we have found a Go routine in all the process, we are going to mark that as running and we are going to assign the machine, the operating system thread to that Go routine. We are going to mark that as running and we are going to start executing the code. Another option is running, well, another change is running to waiting. One of the interesting part of this is it's exemplifies how Go routines are cooperative entities. So they cooperate to give you the sensation of concurrency. So the Go routine, when the Go routine needs to wait for something, is the own Go routine who parks itself. Whenever I have to write to a channel, for example, if the channel is not buffered and I have to wait for something, what I'm going to do as a Go routine is park myself, stop myself, check my state to waiting, set the wait reason, detach myself from the operating system thread and run the scheduler. It's the Go routine that is marking itself as waiting, the one that is calling the scheduler to schedule the new Go routine. So the scheduler is going to find another task and it's going to start running that. So what are the reasons why we can wait? If you go to the Go source code, and actually there's in the bottom right corner, I usually put some references to the Go source code, but well, if you go to that point in the Go source code, you are going to see the wait reasons and that's the least of all the wait reasons. There's no more, there's no less. That's all the wait reasons. Don't pay too much attention to that. I'm going to summarize that. If you want to take a look, you can go. But the summary is you have GC reasons, garbage collector reasons, mutex reasons, semaphore reasons, channel reasons, sleep reasons, and other reasons. That's mainly why the garbage, why the Go routines waits for something. Okay, from running to Cisco and to running or runable again. Well, the Cisco is an interesting part. The Cisco is basically calling the operating system to do something and that can be fast or can be slow. And for some Cisco, it's kind of obvious, but for some Cisco, it's not so obvious. So what it does is whenever you enter in a Cisco, whenever you try to execute a Cisco, it's going to detach from the processor and it's going to detect if the Cisco is slow or fast. And if it's a fast Cisco, it's going to finish the Cisco and go back directly to running. But if the Cisco is slow, it's going to just stay in Cisco state and it's going to detach the processor. Well, it's going to keep the processor detached so the processor can select another Go routine to execute and it's going to finish the Cisco eventually and whenever it finish, it's going to move the Go routine to runable again and then queue that in a processor and all that stuff. The other thing that is interesting is the copy stack status. Whenever a Go routine needs to grow the stack because it needs more space for the function parameters or for the local variables of the function execution, it's passed through this process that it's going to move from running to copy stack. It's going to reserve the double of the current stack size in memory, copy over all the information from one place to another and change the pointers and then it's going to move back from copy stack to running again. From waiting to runable, this is a very interesting case because, again, as I said, Go routines are cooperative. So normally, a Go routine, it's changed from waiting to runable whenever other Go routine calls go ready. Whenever other Go routines say to my Go routine that it's ready to keep executing, we are going to see examples of that later. So whenever Go ready is called, for example, if a Go routine is sending something to a channel and some other Go routine is waiting, it's going to wake up that Go routine, it's going to mark us ready that Go routine. Then it's going to mark us ready, it's going to add that to the queue of the processor and try to get a processor to execute that. Another way is when you reactivate a list of Go routines that happens, for example, when the garbage collector have to reactivate some of the Go routines and then the garbage collector are waiting for the garbage collector phase, for the mark phase, and when that's finished, it's going to wake up a list of Go routines. Another case, it's when there's a case where it doesn't need to wait. Imagine that you say, hey, I'm going to wait for X, but that X is already fulfilled, so I'm going to go back to runable directly. Another thing is when you are trying to find a Go routine to execute the scheduler, you check the scheduler, sorry, you check the net pool, and the net pool sometimes has these Go routines that in theory they are waiting, but the data is already there or the job is already done. So it just moved that app from waiting to runable. Okay, from running to preempt to waiting or runable. Go has a preemptive garbage collector, has a preemptive runtime, and what it does is when a Go routine is executing for too much time, the system monitor is going to detect that and it's going to send a signal to the operating system thread that is executing the Go routine. That signal is going to mark the Go routine as preempt, so it's going to be moved from running to preempt, and eventually the Go routine itself is going to find the time for moving from preempt to waiting. And after the next garbage collector scan, it's going to move from waiting to runable again. So again, this is the whole life cycle, runable, running, syscall, waiting, preempt, govistak. Now all these states should be more obvious or more clear to everybody. There are some other kind of similar states of parallel states related to garbage collector. This is again a bit of a simplification, but this is in general what is the kind of state that you have in the Go routines. So let's see some examples. Imagine that you have a channel and you want to send data to that channel. The channel is not buffered, and there's nobody else waiting for that. So I try to send the data and because nobody's waiting, I'm going to need to wait for that. So I'm going to park myself, the Go routine is going to park itself, it's going to add itself to a list of Go routines that is inside the extract of the channel, and it's going to wait there. So it's there, it's waiting, and eventually another Go routine comes to read from the channel. What it's going to do is go there, read the data directly from the memory of the other Go routine, and then when it has the data, it's going to call Go ready on that Go routine saying this Go routine is already prepared to keep going. It's going to, and that's going to end in this state, and eventually the scheduler is going to select that Go routine to be run and everything is going to keep going. Yeah, this is the whole picture, trying to send the data, waiting inside the channel, getting the data from the other side, and the other Go routine is the one that is responsible of waking up the Go routine that was waiting in the channel. Let's see another example. Let's talk about the wake groups. For example, I can create a wake group and add three in this case. This is a very common pattern. And then I just found three Go routines that are going to do certain work in parallel. Then I'm going to wait at that point, maybe one Go routine is already running, maybe not, doesn't matter. So I call wait, so I'm now waiting. The Go routines keep going, maybe some of them are executed, maybe some of them have finished already, doesn't matter. Some of them finish and are there. And the last one, the last one is going to call done, the last done, and it's going to see that, hey, the wake group is already zero, so I'm going to call ready on the list of Go routines that are waiting for this wake group. So that end up with this situation where that's a runnable Go routine that is going to eventually be executed by the, well, that is going to be a schedule by the scheduler, and that's it. Again, the whole picture here. Okay, let's talk about how Go routines die. There's a Go routine normally dies when it finished the work. Basically, whenever there's nothing else to execute, it's going to change the state to that, it's going to set most of the data to the zero value, it's going to disconnect the Go routine from the end, add the Go routine to the free list of the processor, the dead Go routine to the free list of the processor, and call the scheduler to find anything else to execute. So, yeah, the whole life of the Go routine. Again, if you see this is the scenario where the Go routines are doing things. If I did my job correctly, you now should understand this better. And also this should sound familiar to. So let me finish with a couple things. One of them is I want to thanks Laura Pareja, the one that did all the illustrations for this talk. All the illustrations are creative common by. And you can see the webpage of Laura Pareja. So you can reuse it that do whatever you want with all that images. Also, I want to, I have a gift from MatterMos that is my company, they're the company that I work for. I have some stickers. I going to left out the stickers there, like Margie said. So that's exactly right there. So feel free to pick as many as you want. But I don't know if, well, I also have some pins too, but they are going to fly probably. Another thing is what is missing. I haven't talked about certain things because in the sake of simplicity, I try to avoid getting too much into the details. One of the things that I removed from the equation and have a lot to do with Go routines is garbage collector. I ignore the garbage collector entirely and it's a big chunk of how the scheduler interacts and how the Go routines are moving from one stage to another and all that stuff. The net pool, I mentioned the net pool, but I haven't entered into the details. There's very good talks about the garbage collector and the net pool out there. I know SIGO. Also, SIGO have certain implications with the Go routines also, but I have ignored them. The mark assist phase that is kind of important is a relevant part of things that Go routine does, assisting the garbage collector in the mark phase. This is the monitor that I have mentioned, but I haven't talked in detail about that. But again, there's talks around system monitor out there. One of the main references is the Go source code. I totally recommend you to go there and explore it. There's an illustrated text of Go runtime scheduler that is a YouTube video there. There's a series of posts from Argonel Labs about the Go scheduler. It's from 2018, so it's not super up today, but the general patterns are still there. Well, I hope this talk, after this talk, you have a better understanding of how the Go routines work, how the Go routines change from one state to another and all that stuff. But I want, what is more important to me, I want to encourage you to go there and explore the Go source code because it's a great source of information. There's a lot of super cool stuff there. And well, and depending on a combination of your passion about learning and your taste in movies, this can be more exciting than a zombie movie. So thank you. If you want to keep in touch with me, feel free to contact me. And the other thing, if you want to have a follow up session, then try this. If you want to have a follow up session, asking questions or whatever, feel free to join there. If you're leaving. Thank you.
You're already running my code in production: My simple journey to becoming a Go contributor.
And I will now like to introduce our next speaker to you. I would say he needs no introduction because you're already running his code. But he might need an introduction. This is a new... Sorry, could I have some silence in the room, please? Thank you. You're already running his code and he's telling a story of which I am, for some reason, after running the Go Dev Room for five years. Still I'm curious about, because I haven't contributed the Go project yet. And he has. I'm jealous of him. So round of applause for a Go contributor. Thank you. Can you hear me okay? Is the microphone on a good spot? Yep. So quick show of hands. Who here is a Go contributor? Is contributed to the standard library, the compiler. I see one, two, three, four, shows hands, five. Who here would like to be, like Marcia, who would like to be a Go contributor? There's a lot more hands. Who of you who wants to be is afraid to become a Go contributor? Who thinks it's intimidating or complicated or you just don't know enough about Go routine scheduling or something like that? Okay. This talk is for you folks who have your hands up right now. So my goals for the talk... Oh, first of my agenda. I'm going to talk about goals, who I am, and I'm going to tell my story of how I became a Go contributor and talk a little bit about how you can too. So that's my goal. My goals today, tell my story. And ultimately to encourage you to be less intimidated about becoming a Go contributor. My non-goals are to be exhaustive. I'm not going to do a deep dive into how the proposals work or how Garrett works or all the technical stuff. And I'm not going to show you a lot of code. There's a little bit of code, but you don't even have to be a Go developer to understand the code I'm going to show you. Who am I? I'm a Go contributor, technically. I'm a fractional Gofer. Fractional CTOs are all the rage these days. I'm not that. I'm a fractional Gofer. I work for different clients. You can hire me if you want some help with your Go. I also do Go mentoring and career mentoring, hire me. I'm also the co-organizer of the Go Amsterdam meetup. And I'm a podcast host and YouTuber. I hit that word, but I put videos on YouTube, so I am one. So some of you may know me through the Cup of Go podcasting. Listeners here in the room today? All right. A couple. I hope there's a lot more after this. I have stickers, by the way. They'll be over there. If you like Brewster, our little Gofer mascot for the Cup of Go podcast, get a sticker for your laptop a little bit later. So how did I become a contributor? Well, first I needed an idea. So long ago, I wrote this public open source library called Kivik. It's for CouchDB. It's sort of like database SQL, but for CouchDB. So if you wanted to be document store stuff. And I had a request from a user of my library. They were trying to send a regular expression as JSON to CouchDB because it's a JSON store. And it was just submitting an empty object rather than meaningful data. So they said, hey, could you make your library do this thing the right way and send a regular expression string? It's like, that's a really great request, but I don't feel like it's my library's responsible to do that. That should go in the standard library. So I created a request, which we'll talk about. But first, here's the problem they were explaining. So here's the code. I think this is the only slide in the presentation of code. So imagine you have this regular expression, food question mark. So it would match fo or foo, pretty simple. And you call JSON Marshall on something that contains that. This is the output you would get. Not very useful. This is the output the user of my library wanted and what I thought made sense. So I created a proposal on the Go issue tracker on GitHub. Now this is a great point to mention that there is a process, a proposal process. Some of you are probably familiar. If you listen to the Go podcast I just mentioned a couple of Go, we talk about proposals fairly frequently and we talk about, oh, this one's in the accept phase or this one's been declined or this one is possibly accepted and so on. That's all relates to this. Now this is a very simple proposal, so it didn't need the design doc, which some do, like generics had a design doc, actually multiples of design docs in the end. So this is a very simple proposal. I mean, I just explained it to you. I don't need a design doc to explain what I just explained on the last slide. So this didn't need that. So I just created a little, you can see there, that's the entire issue there, right? That's what I wanted. I showed the code that I just showed you. I showed the current behavior, the expected behavior and a little bit of conversation about my reasoning. And so that happened in 2021, May 13, if I can read that correctly. And then that kicked off this proposal process or a truncated miniature version of it anyway. So we had some discussion. One of the first comments came from Daniel Marti, who said, this would also be useful for this other thing and tagged Joe Sy, who was working on another issue that it would be relevant to. I don't know who this person's name, I didn't look it up, but they said, losing the options feels like a deal breaker. What that was referring to, there's actually two flags you can put on a regular expression in the Go library. You can say it's a POSIX regular expression and you can say it's, is it longest match? So at the end of two Boolean flags you can set on a regular expression and those are not expressed when you call the dot string method on the regular expression. So those flags would be lost. And so this person said that feels like a deal breaker. And there were some other comments too, but ultimately Russ Cox came in and said on June 9, so this is almost two months later, said it looks like this is probably going to be declined based on the fact that it would be a lossy expression of the regular expression. That was sad. Not really sad because this isn't a feature I wanted, I just was kind of excited to see a feature I proposed, you know, get through the process. And then Roger Pepe, I think is his name, came in and said, I think it would be fine if we went ahead and did this. You know, just use the equivalent of string, it's already lossy, why don't we just go with that and so on, gave his reasoning. And so this is just a month later now, we're into July 2021, Russ says, so this is the current idea, we're going to have Marshall and un-Marshall do exactly the same thing that string does, blah, blah, blah, and then it looks like it's going to be likely accept now. So, cool. Happy about that. Fingers crossed, let's see if it really becomes accepted. A week later, no change in consensus, so it became accepted, yay. So who's going to do the work? Sadly, just having your proposal accepted and go doesn't mean it's done, someone has to actually do the work. Now this isn't a lot of work, in fact Russ said, even before it was accepted, I'll do the implementation and see if I come up with anything surprising. I don't know if he ever did, if he did he never mentioned it on the issue tracker. If I ever had the chance to interview him, I'm going to ask him, did you ever do that thing? So I said, January, this is six months after it was accepted, I said I'm interested in working on this and nobody really responded except somebody gave me a heart and I thought I felt good, but. And then three months later, four months later, Joe Sy says, hey are you going to do this, Russ? I can actually use it now. And Cricket's from Russ, he's a busy guy, no shame on him, but you know, so more weight eating ensues. So I decided I was going to go ahead and do it and I decided to, I don't remember exactly when, we'll see the dates in a few moments, but so I decided to go ahead and do the code. Now this is a good time to talk about the contribution guide. This is probably the part, at least I felt, was the scariest part of contributing to go, so I'm not going to talk in detail about it, but the TLDR is you have to create a Google account, you probably already have one unless you're intentional about not having one for security or ethical reasons or whatever. If you want to contribute to go, you have to have one, I'm sorry to say, so if you're avoiding that bandwagon for ethical reasons, maybe go, contribution isn't for you, I understand your reasons, but you have to have a Google account, you have to set it up a Garrett account with a Google account. What's Garrett? Who's used Garrett, I'm curious? Who doesn't even know what the word means? All right. So think of like GitHub except an open source version of GitHub from 1992, that's what it looks like, but it's really powerful in ways that I can't really comprehend or explain because I haven't used it that much, but it's not bad, so don't be afraid of it, but they use Garrett for that. Now actually I lied a little bit, they do use Garrett for that, but you can do this through GitHub also, and I've not done that process, but if you're really afraid of Garrett and you can't read the documentation and follow the instructions, you can also use it, create a GitHub pull request, so that's an option open to you if you're really afraid of this, but don't be, it's not that bad. So 11 months later I finally wrote the code, I created my Google account and all that stuff and the Garrett account and I wrote the code, this is my change, this is what I added to the standard library, plus some tests and a couple other metadata things. It's like 20 lines of code if you count the comments in the blank space, the blank lines, that's not a big deal. I was really hurt though that Marcia didn't mention this in the Go 121 changes because I know it just barely threw under your radar. I actually got this yesterday evening, you're going to find it. Yes, yes, okay. And you knew I was going to talk about it, so why mention it twice? So really simple, I guess I lied, there's two slides of code, but it calls the string method and turns it into a byte slice, that's all it does to Marshall, to Marshall, your regular question, and then to un-Marshall it, it does the same thing in reverse with an extra error check, super simple code. So I pushed that up and then I, this is a screenshot of Garrett by the way, like I said, 1992 GitHub, that's what it looks like. And I got some code review. And then it was time for some humility. I kind of pride myself in writing tests and writing good tests, I usually write them before my code, first comment, make sure the test pass. I failed to, I mean I tested my code but I didn't run the entire test suite, which takes 10 minutes or something on my machine, and it was failing. The reason it was failing is because I failed to add some metadata about public API changes, it wasn't a big deal, it was easy to fix, but it made me feel a little bit silly for like, not writing, not running the test suite before I asked other people to waste their time reading my code. I had learned the project style, this was my original commit message, I don't see anything particularly wrong with it, but it wasn't the style that they wanted, they wanted something much shorter, they didn't want me to, they didn't want a long paragraph explaining, like they felt like, I say they, Ian felt like add these functions was enough, I didn't need a paragraph explanation, so I followed his style guide and ended up something shorter. The tests, he wanted some changes in the test, I called t.fatal, but it was a for loop, so if one test failed, the other test wouldn't run, so he wanted me to do t.error instead. Cool, makes sense. And then Godoc recently, I don't know how recently, recently in my mind because I used it before this, but they recently added these square brackets to do hyperlinks and stuff, and I didn't do that, so I needed to add that. Yeah, little nitpicky things, plus I forgot to run the test. That was kind of it. That was my thing. It got merged on March 27, so just over two years after the original, was that right? Just under two years after the original issue was opened, it got merged, and then it was in the Go 121, yay! My name's not there. It's in, it's in Git somewhere, but whatever. It still felt good. So I think I just breezed through that. I have a lot of time here. We have a time for questions here. I mean, I have a few more slides, but this is the point of my talk, really. What does it take to become a Go contributor, and what does it not take? So non-requirements are you don't need mad hacker skills. I mean, you saw the simplicity of that code I wrote. Now I've written much more complicated code, at least I like to think so, but not at the Go project. I've spoken to people who contribute to Go just by adding those square braces to Go doc. That's cool. That helps. I mean, that's valuable, right? It's not cheating. That gives me hyperlinks when I go to the Go doc for that package. I can click on a hyperlink now. That's useful. So if that's what you want to do to contribute to Go, that's all you need to do. All you need to know is how to type square brackets. You don't need to know about zombie Go routines and whatnot. You don't need deep Go knowledge. What do you need to be a Go contributor? I think the main thing I learned from this process is that for me to be a Go contributor, I need patience. I mean, a lot of that wall clock time was me not doing anything. If I had been trying and pushing the process forward, I probably could have truncated that down to maybe three or four months. But that's a long time to get 20 lines of code implemented, I think. I mean, relative to what I do at my day job anyway, where I do that 15 times a day or something. So it takes patience. But if you're willing to put in the time, you can become a Go contributor. It takes a little humility, especially when it comes to learning a new project style. I mean, I don't know if you've contributed to other open source projects before. I have. Each one has their own flavor, their own style. You need to learn that. You need to be willing to learn that and not, yeah, just put your ego on the side. That's not the point. It's just to do something useful according to the community's guidelines. And to learn some new things. Yeah, I think I'll breeze through this. Those of you who raised your hand that you were intimidated earlier, any of you feel less intimidated now? One, two, three. Okay, my talk was a success. That was my goal. If you're interested in learning other ways, one of my goals is to make Go less scary for people. That's part of the Cup of Go podcast idea where we talk about the weekly Go news. It's part of my YouTube channel, Boldly Go, if you want to watch that. If you have questions, reach out. You can find me at boldlygo.tech. That's my Go themed website. You can find all my socials and contact details there. Any questions? I don't know. Do we have, we can do questions, right? We have enough time for questions. We have time, so yeah. I will hand you the microphone. If you're too far away, you'll have to shout and he has to repeat. Hi, thanks for your talk. I want to do a Cup of Go listener. Wonderful, thanks. Shout out to the podcast. My question is, are there other ways to become a Go contributor like, you know, good first issues or stuff and get up? Other ways, other than introducing a proposal? Yes, definitely. You can find one of the existing bug fixes or proposals. So this was the first code I wrote that was implemented to Go. I had participated in the sense of filing bug reports and stuff like that previously that others then fixed. And many that had been just like closed as invalid or something that happens too. There's that humility part that comes in. But yes, there are a lot of open issues. There are some tagged as good first issues. You can find typo fixes, typo, I actually have an open CL. It's the Garrett terminology for a PR. Open for a documentation fix in a package in the center library. Things like that. There's a lot of things you can do. You don't need to file either a bug report or a feature request. You can find one that's already there. Hello, thank you for your talk. Yeah. I've tried several times during Octoberfest to do some contribution. And the big part of it was to find an easy issue to begin with. Do you have some tips for that? Not really. I mean, there is a, I believe there's a tag on GitHub on the issue tracker for like good first issue or needs help. I know there's a needs help. You could look at that. I think there's a good first issue, but I might be confused about the different project. One thing that is understandable but frustrating to me about the Go project is it's not really designed for newcomers. That's one thing I hope to help change with this. Help at least lower the mental barrier that you might have individually to doing this. But I say it's understandable because they're trying to build a professional quality, high quality language and standard library. And that requires one set of skills and guardrails around the project. Being open to all new contributors is a different one and requires very different types of open source management. So Go, I think, mostly intentionally has moved to that side of high barrier to entry for reasonably good reasons. But that is frustrating for this question. How do you find something you can do to contribute? I don't really have a great answer except look through the issue tracker and find something. In front. Become a Fotherm organizer, get fitness for free. Yeah, hello. So you had this requirement at the beginning and this sparked the problem and the solution in the library. But what did you do in the meanwhile? Because this took three years, right? So what did I do about this in the two years in this thing between issue, file, and I didn't do anything, honestly. The person using the library, I'm assuming they had their own work around. I mean, so there are work arounds for this sort of thing. Suppose that you want to, suppose this already exists. Now you're using Go 122, but you want a different version of the regular expression to be presented. You have the same problem, right? So you probably would end up wrapping the regular expression, reg x dot reg x type and put your own custom marshal on it, for example. That's probably what they were doing. I do that with time dot time or time dot duration fairly frequently depending on the application needs. So that's probably what I would do. Are there any differences in the main Go code versus like the Go X modules? Yeah, that's a good question. I haven't contributed to the X stuff, so I don't have experience to go on from there. I think it's pretty much the same process though. I do think the requirements for inclusion in the X packages are lower. So if you want to add, say something to X slices, you want to add, I don't know, change color or something, you know, some ridiculous thing there. There's lower barrier to entry to get in there because it's considered experimental. So you're like, if you want to do it in the center library, they have a high standard. Like we want to make sure that we're never going to regret doing this. In the experimental they're like, yeah, we don't know if it's a good idea, but let's try it. So in that sense it's easier, lower barrier to entry. Any last questions? Okay. I think this can mean one thing, but it was an amazing talk with not too many questions left. Round of applause everyone.
Single binary, full-stack provisioning
Check. Can you hear me? Yeah. Service announcements. If you have glasses that look like this and you don't have them anymore, I found them. I checked they are not with cameras. Hi everyone. Check. Check. Is it loud? Can you hear me in the back? The HDMI audio doesn't work so I don't have any warm-up music for you. But thank you everyone for coming. Hi. I'll start in time. No, perfect. I'm just standing here. I'm not being angry at you. Who has the lights? Is there a lights control that we can turn off? Can you put in the slides and see? In the back. Can you see the slides if not raise your hands? Perfect. You're awake now, right? We have fire alarms in this building. All right. I'm going to get going. Still two minutes. Two minutes. Oh, all right. For those who need to follow them, we have to start on time because of video recording. So that's why we are punctual every time, every day. I got an assistant just for picking out the best stickers. That's the one thing you need at follow them is an assistant just to get you stickers while you're working. It keeps us sane. The Go community is 50% for the amazing language, 20% for the amazing performance and 100% for the stickers. All right. Hi, everyone. Hello. Can I quickly introduce you before you start? I know you have much to say. Sorry. Sure. I'd like to welcome James to the stage. James got this talk because he's every single bus word on my bingo, which is full stack provisioning, DSL and single binaries. And he will explain all what that means right now. Go ahead. Hi, everyone. So I'm James. I'm going to talk pretty quickly. And so don't be afraid and turn this down. Can we turn this down slightly? Maybe. We'll see how it goes. Yeah. So I'm going to talk really fast because I want to show you some real live demos and I'm going to be basically at 30 minutes. So it's going to go really fast, but all the stuff I talk about is online. And if you have questions, I'll probably not have any time, but maybe at the end. So I work on this tool called MGMT. This is some stuff I won't show you. I'm a hacker. I write a blog called the technical blog of James. Who's seen my blog? Just raise your hand. If not, just raise my hand. Your hand. So I seem really popular. All right. Thank you. And like, do we want bad information? Do we? Come on. Louder. I don't know about you, but I really don't. This is my nope guy. You might have seen this. It gives me a chance to breathe as I'm getting my talk. I'm going to be typing, so I'm going to be sitting just out of the way. So I have this project called MGMT. Do you want to see a quick demo? What? All right. All right. Okay. Really quickly because I don't have a time to give you a full explanation, but basically I'm running it here over on the left and MGMT is running. And then over here on the right, I've just asked it to create one file. So I can cat hello world, right? This file that it has created. But the cool thing is MGMT is watching in real time. So if I remove the file and cat the file comes right back, right? It's always watching in real time, doing that kind of thing. In fact, it's so fast, I can even remove the file. It's kind of my old demo. Oops, if I could type and cat the file. And it always is that fast that it just instantly can fix the file. And you can even do crazy things like this. And basically ask it to watch as fast as you can. And it will even, you can see MGMT on the left here is spinning really fast, fixing the file. And that's kind of boring because we can do much more. This is just the base. What we really do is we have this language called an FRP. It's a reactive language that has streams of values that go into the resources. And when anything changes, it kind of rebuilds the world. And I want to show you some real things it can do. So no more abstract stuff. So I want to actually, I want to provision, I brought this little kind of fat raspberry pi kind of computer. And I'm going to show you a building and then I'm going to go back in my slides while it's provisioning and show you a little bit about the tool and how it works. Does that sound fair? All right. Thank you. So here's my demo and just just a quick thing to show you. So I'm just actually run this here first. So I'm just going to start it up here. It's not perfect. Like my type unification algorithm implementation is really slow. So you'll see it. It takes a moment. But basically what happens is suppose you had this, this code at the top printf math dot 42. It's just a function that returns the value of 42. We actually build these graphs. You can see that the values come in and the data flows along these graphs and build these big trees. And in fact, for even small programs, they'll look like this or like this, these enormous graphs and they all just track what's being sent and what's not and so on. And so I'm not going to show you this demo. I'm going to show you this demo. So let's say you want to just provision a machine from scratch. In my language that I've built called MCL, you just put the following code now. It's all that you need is a single binary that runs this, this code and has everything you need. You basically just set like your interface on your laptop, the IP of your laptop. You put a repo. So his definition for a fedora repo and what mirror to get. And then you just define the actual machine. So host name one, the MAC address and you're done. And so let's go over here. So MGMT is just about done running here. It's just starting up some stuff. It's actually, I've pre-cached a whole mirror on my laptop. So that's just like the entire fedora repo. So you have the packages there. So we're going to have to wait for the internet. And then the machine here, just so you can see what's going on. I actually have a little HDMI capture card and right now the machine is off because I haven't plugged it in yet, but it will just show you on this screen what's going on. Does that make sense? Do you want to see this? It's going to be a little bit boring to watch a machine provision, but I kind of like this kind of thing. So anyway, so this is starting up and just, I'm just going to kill these slides for a second. What I'm going to do now, so it's starting up an inside MGMT. There's resources which embed a TFTP, DHCP and all these other servers. So let's find the power part here and plug this in. So that's plugged in. And I just forgot one thing for my bag. Some notes. Yeah, so here's the machine and it's actually starting up now. You'll see here. And on the left, we're just going to see what happens. So as events come in, so you can see right away the DHCP server said, Oh, I see someone talking to me on DHCP. Can I have an IP address? I'm not touching anything. The machine is now automatically requesting the like the stage one UFI stuff. This is a pixie boot. So it's a network boot type of thing. There's a TFTP server that passes those files. And in a second, it's going to start doing that. And you can see right now it's pulling down this file. Do you recognize these two? What are those? It's basically Linux and the initial root file system boot this stuff off. And MGMT is doing that. I'm not doing anything at all. And you'll see it's going to slowly kind of boot. And just to get a little bit more information about what's going on, the script I wrote in MCL has a little bit of extra information that it hides. So, see if I can, oh, there we go. Nope. Okay, so right here, I have a little file. Let me just fix one thing. I forget which is the right one. There we go. So I just have a little thing that just shows what the state is. So the computer is set to be provisioned and the state is not provisioned, right? Because we haven't provisioned it yet. And if you see it starting up, it actually is just loaded the kickstart file. Does anyone know what kickstart files? Anyone doesn't? Raise your hand. Don't feel silly. A kickstart file is just a definition file that's built into RPM and like Red Hat, Fedora, Flavor kind of OSs that just says what should we do? Like how should I provision the disks? All that kind of stuff. And so it's doing that thing. And I can actually even zoom in here if you want to see what's going on. Oops, now it's gone. Where'd it go? There it is. Just move over. Oops. What's going on with my little media player today? Where is it? There we go. So you can kind of see what's going on. So this is actually Anaconda itself running. It's checking the storage configuration. And in a moment, it'll do some stuff, which we'll see. And then while I'm doing that, I can even show you. So this is just me doing a random W get to the web server that's built into this single Golang binary. And so if you want to see what happens, it actually is just generated. It has a template built into the script. All this stuff is just generated. And what it's going to do is it's going to do the basic install, install a few packages, whatever you want. And at the very end, there's a post hook, which is part of kickstart. So at the very end, when it's done, just before the machine reboots, it will actually run some command. And in our case, it runs a little flag, a little W get that talks back to the web server to send a secret message that says I'm done. And so that's going to just do its thing. And storage is a little slow. It's just a little low budget machine. Just to show you what's going on, I'll do, oh, here we go. So there's storage. It's running its thing. All automatic. If anyone has their laptop, by the way, that wants to be provisioned, I can set up your laptop for you. It'll wipe it, but it'll be good. And you'll have a really good fedora machine. I kind of like fedora. It works. It's very modern, upstream. Easy to build these kinds of things in. And let's scroll down here a little bit here. You can see what it's doing. So here it's building the partition. And right now, if you look over here, this is MGMT. All those packages are downloading. So if you're ever curious about exactly what packages are requested over the network when you install an OS, you have exact logs. In fact, it was requesting all these files I'd never heard of. And it's kind of interesting to see how machines work. So with this embedded TFTP, DHCP server, all these pieces, you know exactly what's going on. And you can also see weird hardware requesting weird things. So there's the standard like what a pixie boot is supposed to request over the network. And then when you look, it actually does some weird things that you might not know about. Probably because some customer had some weird setup that wanted some firmware thing. I don't know exactly. So it's installing all the files. Hi, come on in. You're missing all the fun. And yeah, let's just go look at a few slides while it's doing its thing. In fact, if you want to be really nerdy and look at some code. I never know how well this goes. Is that too small for you to see? So this in so just I don't want to give you specifics about the code because that's not important. But the entire provisioner, this entire provisioning tool with the lines and comments and whatever is like 500 lines of MCL. And so the realization I had some time back when I was doing puppet stuff was that if we had the right language, and we had the right engine and runtime all implemented in Golang, we would actually have a new way of building tools. Now I'm not here trying to convince you that we should use my provisioning tool. I think you should. I think it's better than most provisioning tools out there. But the real thing I want to teach you about is that this is a new way to build whatever tool you want. And because it has all these real time events, it's very easy to glue all sorts of useful pieces together, whether it's automating a distributed Cef cluster, whether it's single provisioning. But like for me, like how many times have you all installed a Linux distro somewhere with like the USB key? Like be honest, everyone here, right? But the idea now you download this one binary, Golang makes that super easy. You run it, you just give it a Mac address and you just plug the ethernet cable from your laptop port into the machine you're provisioning. Or if you want, you could plug it into a switch and plug in a bunch of machines. I think that's pretty easy, no? I don't know. I think so. But yeah. So it's just installing all those packages. And let's just go back here and see. So yeah, so this is all the code it takes. Now here's the cool thing. So as I was saying kind of briefly at the beginning, I'll have to look online if you want a more fleshed out definition. Variables are streams of data. And so this base class here, you basically include this code as hostname one. And then there's a hostname one variable and it has fields. So provisioned. So what actually happens is this bottom chunk of code is sitting there just waiting. It's waiting until the provisioning happens. And then when the provisioning is done, this variable, which was previously false, turns to true. And that means that this block of code will declaratively execute. So you can do these kind of stream things that are very easy. And in this little file that we're pulling here, we just pulled this file and it's going to basically change the contents from that to provisioned. This is just some screenshots for fun. It works on all sorts of sketchy old machines like this one and so on. Just a few thoughts while we're waiting for this to run. These FRP languages, these reactive languages, you probably have only seen them in the web. Like all these web UI people use these languages because you just change something and then the elements update. I don't really know much about them, but that's really where they're used almost everywhere. In the 70s, they used to actually be used or were done as a proof of concept to control helicopters. So they can be very, very fast. And for me, realizing that FRP was a solution to a totally different problem that's not UI based was exciting for me. And there's so many cool things that you might not realize. All the variables have events. So if you're debugging something, you can just point to a variable, have it be displayed in a file, and you can watch it change as your program executes. Real-time live debugging of variables. Very cool. I use this all the time. No, I don't use it all the time. I should use it all the time, but yeah. This demo has a local mirror, but you know, you could do this off the internet too. All sorts of other cool things. 500 lines of code. Just for curiosity, what provisioning tools do you all use today? Pulumi. Like all these tools, like how long did it take you to set them up? You need a DHP server, you need a TFTP server, you need an HTTP server to do pixie booting. Like I know those people that just, you know, use the cloud to provision their thing, but like if we want to really control our stack and control our provisioning, I think it's kind of important. A few things. The tool is not finished. MGMT, my tool, is not finished. The provisioning tool, which is one tool I've built with MGMT, is just one of those things. If you're good at programming and Golang, I desperately need more contributors who are smarter than me. I think I have the overall design, but some of the nitty-gritty little things to get them better and perfect. My type unification performance is really bad. It's like a, I'm sure, a very suboptimal, like, end to the power of end kind of algorithm. It's not that bad, but it feels bad. I still make, like, Golang concurrency bugs. I guess we all do, right? And I'm slowly killing one at a time, but if you're good at that, if you're good at lexing and parsing and, like, error messages to make my parser not be so hard to understand, that would be great. And writing new tools and all sorts of other cool things like that. I'm just a guy doing this because I believe in this project. I've been doing it for a while. It's kind of getting ready to be production ready. I think now it's finally really usable, but I don't know how to do this. Like, I don't want to make it proprietary. And so my next latest idea is the MGMT partner program. So you sign up for free. I have, like, a Google form, and I'm going to send you newsletters if you're interested. And if you have a company and you want to sponsor, that would be kind of really cool. And, yeah, if you want to go to this link, you can do that. Bitly MGMT-partner-program, like tell your companies, pay $100 per year, and I'll send you stuff and send you new tools, or free, if you really want. And, yeah, we'll come back to that. So let's just go back to this provisioning thing. So it's still coming along here. I'm going to just make this a little bit nicer to see. So that's, it's just finishing the package installation right now. The good news is I'll probably have another time for a live demo if you want another one. And we can see the machine is still not provisioned, still running away here. Quickly, while this is happening, does anyone have any quick questions, like really good questions? Scream it out, and I'll repeat it. I'm not sure if I see any questions, but I see that you were using just Dora for serving the CD-card. Tell me. I'm not familiar with the tool, so probably some of the documentation is something less. Since you were here, I can ask you, does it support other distros? Yeah, so at the moment, so I've built this provisioning tool. Can you repeat the question? Yeah, yep, I'm going to. So I've built this tool that just supports Fedora at the moment. And yes, you could definitely, in this MCL code, add like, if, if this, whatever. I haven't done all those if statements because, bless you, I'm just trying to test the minimally viable thing. But basically the goal is to add more distros when there's other people that want to do the work, right? I don't want to do the work for everyone else. Supposed to collaborate on it. So yeah, and actually right now here, it's just finishing off the configuring. The whole demo takes about 20 minutes. So I think we're about four or five minutes away. The last thing it does is checking all these packages. It's installing the bootloader. Let's just go over here. Look at that. Creating the users. This will ramfs. And then you'll see here in a moment, MGMT is going to wake up. And tell everyone what's going on. I was kind of a system in the original life. And I would look at machines doing this a lot. I was an early cobbler user. I don't know. Does anyone ever use cobbler? Like back in the day. And the thing is cobbler took so long to set up. I had to set up all these different pieces and get all these templates right. And it was kind of hard to do. And so after cobbler, I got into puppet. So I only use puppet. I did a lot of puppet. I have like old blog posts about all the fancy puppet stuff I was doing. And I was doing all these crazy puppet hacks. And then I realized one day that like all of these hacks should just be built into one tool, but not hacks. And then I started working on MGMT. Because the puppet folks just didn't think they were going to re-architect their core engine and language. And it was literally all of those lessons that probably happened over 10 years ago that taught me what I wanted to build. And I've been waiting to build the provisioning tool as one of the first examples of what MGMT could really do for a long time. And the real question I have, okay, so look at that. So I just ran the post-installation scripts. You can see here it got this URL, so action done, blah, blah, blah. Got flag equals true. And if you watch this here, this text file, I'm just pulling the text file. Why the text file? Because we're just declaratively putting some state into a text file. And MGMT will know when that changes. And why do we want to actually catch the state? Because when we know that it's done, MGMT is going to change a variable. The code will automatically update. And then when the machine reboots, it's not going to boot back into the provisioner again. Does that make sense? If it's working. Should be working. Maybe it's shy. Good point. Maybe it's not actually pulling that through. It's booted up. The flag should propagate. Let's see. It should boot up in a sec. So the annoying thing dealing with all these machines is they have all these weird kind of firmware, buggy things and timeouts and things like that. And here's one of them. You have to wait for quite a while for the pixie boots to timeout. Obviously this is a... So there's Fedora booting. And if we go here, in a moment, we'll be able to log into this machine. Not yet. Still loading. Still loading. Oh, I'm plugged in that work. For a sec. Did I ruin it? There we go. So logged in and the password is password. Don't tell anybody. And now we can do something destructive. Who wants to mess up this machine? True? Is this bad? Is that necessary? I don't know. We can delete DevNull. That's not good. What do you want to run? No one's adventurous. Anyways, you get the idea. So that's the provisioning tool. Yeah, do you want to see another quick demo or two? Only some of you do. Do the rest of you want to see another quick demo or two? Yeah. I'll just do a kind of a classic demo here. Oops. Just to show you, I'm going to over here start up MGMT. And the code that I'm running, is this working? I think it should be working. Just to show you what's going on, I promised about streams of stuff. I'll plug this. Can you see this OK in the back? How's that? So everything is a stream of values. So I have this function datetime.now. And it is literally the number of seconds since the epoch. And so it turns out that this is a function which happens to update every second. Because time is always moving forward. Other things like this multiplication are static. And you have all these different event streams. System load. Here's another one. System load down here. And for fun, I even added a view meter, which is actually sampling my actual laptop microphone. And so you just take all these values, combine them into this big template down here. And over here, I'm printing those out. So if I actually pull this file, see if this is working. You can see, I'm just pulling the file over here on your right. And if you look, you can see the time's changing, because it just reruns the graph every time something updates. The system load itself on my laptop, you'll see changes. And my little view meter, if I make noise, you can see it goes up. If my microphone is up loud enough. You get the idea, right? And all this is open source. You can all try this online. And the point is that you can now kind of start thinking about building your infrastructure in a way that you just describe the state. Today's state, the error scenarios, basically write a load balancer in software all very easily in this simple language. And then run it, and it will take care of things. So there's more material about this online, but that was just roughly to show you how it works. I'll kill that. Really quickly up to the top, because I got a few minutes left. So we have the core engine itself has all these resources. Those are the things that do work, like files, virtual machines, DHCP servers, so on. Those are all built in. And then there's this language, which is that FRP of values that creates graphs of resources and runs them. We run in parallel, we run in event driven. I didn't talk so much about some of the Golang nitty gritties, but Golang has been an absolutely great language for this. Because we do all this stuff very quickly in parallel, very, very fast, very easy to build this one binary with everything contained. And there's a lot of great libraries, bless you, that make it possible. So we leverage etcd, for example, which is also built in Golang and so on. We've got all sorts of different resources for managing things. Bless you. And yeah, lots of stuff about this is online already. Just quickly down at the bottom here. Future work. Who's like a really skilled or adventurous Golang programmer that wants to help out? Like if you've got skills, don't be shy. Ping me. This silly partner program thing, if you want to start using this, your company played around or just on your personal projects. Let's just recap a few things. He's just recapping his pen. That's the best joke I've got for the end here. I started a matrix channel because IRC is kind of dying and no one's there. We don't have a mailing list anymore because Red Hat used to host it and then they killed all the mailing lists. So if you are someone who hosts a reliable mailing list and wants to host our project, that would be great. There's a bunch of links online. So I'm purple idea on the internet all over places. So don't be shy. Feel free to ping me. If you're going to put in the time and you want to improve your Golang skills and so on, don't be shy. Ask me and I'll help you review your patches and so on. Yes, I'm someone's probably annoyed that I keep saying Golang. I just find it so confusing to say go all the time. It confuses my brain and when other people say it, it confuses me to hear what they're saying. So I know the language is called go. I just try to be less ambiguous. It turns out I'm not the only one, but my apologies if I've upset someone. There's a great conference in Ghent, which is right after FOSSTEM. And I'm going to be giving a slightly longer talk and a workshop, a talk on Monday workshop on Wednesday. So if you want to come by, it's free. Although we do ask you to sign up. It's a big six, seven, eight hundred person conference. So not huge like FOSSTEM, but still pretty big. And I have got some stickers if anyone would like a sticker and they're actually going to use it. So, yeah. Does anyone have a question or two? Yeah, go ahead. Scream it out. We're a lot about provisioning, but what about configuration management after something is provisioned? Absolutely. So the question is like what about config management after provisioning? And we absolutely do config management. My realization was that the standard way we talk about config management meant only this narrow thing. And I really believe that config management, the way I see it, is actually a more broad topic. And with the right tool, you can actually squeeze forward and do provisioning as part of your config management process. Do the actual configuring of the machine. So at that machine at the very end, we definitely can kick off MGMT and run itself and keep doing stuff there. That absolutely can happen. And then further, some people talk about orchestration. I don't like orchestration because orchestration is centralized. But we do what some people are calling, I think it's a lame word, but choreography. So this MCL language allows you to write distributed algorithms that run on more than one machine at the same time and they coordinate for various checkpoints and other things. It's really very cool, but definitely out of the scope for today's talk. So yeah, good question. Yeah, go ahead. I see an article about the distribution of topology. I'm not sure what's the... Yeah, the question is how does the distributed topology work? It's a longer story, but long story short, you write the MCL code. It gets pushed to every host in the cluster with EtsyD and they all run that separate algorithm, but some variables are different. So host name variable will be different per machine and so on. And they can use those slight differences to run slightly different codebaths in a way that they work together. And it's pretty cool, but I'm, you know, again, biased. Yeah, any more quick questions? I got about 30 seconds. Yes, no. Yeah, go ahead. Do you have a live version of MGMT doing post-provisioning of business or do you have a system? Yeah, so I use it personally for stuff. I run a few low-budget system-in jobs for some local businesses and stuff like that. But yeah, it's not publicly documented. It's just sort of, I use it. I've started using it very recently. But yeah, good question. And I think, so my goal right now is to have more people start using it and being early testers of real-world stuff. So if you're interested, please ping me and hopefully get you along with MGMT. Yeah? Thank you very much. Round of applause. Thank you.
Efficient Integration Testing in Go: A Case Study on Dapr
Actually, an ex-collworker of mine, we worked together on CertManager, if I recall correctly. We wrote a lot of tests there, not enough tests in my opinion, but there is never enough tests in the world. And I have to be honest, when I code and I'm not being paid for it, I do not write tests. So Josh does, and that's why he's going to talk to us about how to make your testing life way, way better. Right, that's possible Josh? Thank you very much. Cheers, Marsha. Good. So hi, Ron. Yeah, hopefully I can change Marsha's opinion on that during this talk. So I'm Josh. I work on the project DAPA, which is an open source project. I'm going to talk about that in a second. And the talk is about efficient integration testing in Go. So it's a case study on DAPA. I work on DAPA, I'm coming from a DAPA perspective, but the idea here is the kind of learnings that we have did through DAPA, you can kind of bring to your own project and make your project better, more efficient and correct and these kinds of things. So this is the agenda. Like I say, we'll talk about testing, we'll talk about DAPA a bit, the framework that I wrote for the integration testing in DAPA, and then some learnings and some gotchas and some things you can pick up for your own project. Cool. So testing. Why do we test software? Fundamentally, why do we test software? So the first thing is to prove the correctness of software. That's the main point, right? We write software, software is complex. Code is hardly readable by humans and we make mistakes and the more software you write, the harder it gets to keep track of the state and yeah, we all write bugs. But it's not necessarily the case that this is the only reason why we write tests. If it was the only reason why we write tests, we would write our test once and then once they start passing, we would delete the test file. So writing tests just for the correctness is not the only reason. Another reason is for putting guardrails in place. Implementation code changes over time and so assertions you want to make about your code behaving in a certain way, you want to kind of keep into the future. So yeah, that's why we don't want to delete our test files after we've written them. The next thing is ensuring compatibility with external APIs. So if you have external services, I'm thinking I come from like a Kubernetes world and things like this. So Kubernetes version changes, they break stuff all the time. You want to make sure that your code still behaves in the expected way when external things change. Verifying performance, performance testing, these kinds of things, making sure that not only your code is correct but it also does things in a timely manner or uses less resources than is your limit or things like this. And finally, and what we'll follow in this talk is hopefully that if you write a testing framework which is usable by humans and is efficient and is easy to read and use, then that testing framework itself can then be used as your kind of sandbox on how you can test or do experiments in your software and test features and things like this. So a really good testing framework is really important to improve your developer experience and the final thing is increasing developer velocity which is largely a big thing that we care about, right? We want to write features. So test types, if you open a textbook on testing, you'll probably see this graph somewhere. It's a very kind of classic visualization of the different types of testing. At the bottom you have a unit test, that's your test bar, that's your logic code, and it tests a variable equals another variable, really exciting stuff. And then at the very top you have things like your performance testing, your testings and things like this. And then the middle section you have your end-to-end and integration testing. The difference between these two things is semantic and depends what project you're talking about and who you're asking and things like this. Again, I'm coming from a dapper perspective. End-to-end tests for us are deploying to Kubernetes and running it in a Kubernetes environment and invoking it there. Integration testing is running binaries locally, typically, and that's where the differential takes place. Integration testing ideally runs quicker than your end-to-end testing. Kubernetes is a slow software so it's a pain in the ass to write loads of tests for an end-to-end test. So yeah, the talks about integration testing, what are integration tests? Fundamentally, this is what an integration test is, and this is true for a lot of testing as well. But fundamentally, you're setting up your system to be in a particular state that you care about. You're then asserting a particular behavior and then you are then cleaning up that system state. That is it. That is fundamentally what you're doing. As an example, again, going back to dapper, this might be executing one of the dapper services, then doing a curl, in this case, to make sure that the healthy endpoint returns a 200 or something like this, and then finally killing that process at the end. That's it. That's what an integration test is. Keep talking about dapper. That's interesting. That's not dapper. Okay. Try that again. What is dapper? Not that. Dapper is an open source project, all written in go. The tagline, the marketing headline, is that it is a set of APIs and SDKs and frameworks to make a developer more productive in a cloud-native environment. What that means fundamentally is that the project will expose a bunch of APIs for you that you typically need to write some business logic that does something interesting. They have a list of APIs here, so it gives you some state management, PubSub, Actors, and then you can back those APIs by whatever implementation that you want. It might have different concerns, so the infrateam might manage your postgres, and then to you as a developer, you're just exposed with the state support API. That's fundamentally what dapper is. What is important for this talk is that dapper is a complex software system. We have multiple services running, and they're all doing different things. We're all talking to each other. Maybe sometimes they're MTLS, sometimes it's not. Sometimes GRPC, sometimes HTTP. We have a whole set of APIs. We have a bunch of backing services that we support, whether it be postgres or some Google stuff, whatever it might be. The point here is that this is a very complex software system, which all software turns into over a longer period of time. When your software system becomes this complicated spaghetti mess, it becomes a house of cards. It will happen, and if anyone who's worked on a larger project will have first-hand experience, you make a small change, and that will have unexpected consequences or behaviors in a completely seemingly unrelated part of the system. You'll have software turns into house of cards, you don't want to make changes, and again you slow your developer velocity that we were talking about. How do we resolve this? Tests. We use integration testing. When I joined the project, there wasn't any integration tests, so it was kind of a blank slate. I could start from the very beginning of how I wanted our integration tests to look. I came with these set of design decisions. First of all, I wanted to go as the sole dependency on these integration tests. I hate make files. I think make is terrible, and I don't want that anywhere near having to invoke tests. The next thing that I wanted to do was to run a test. I wanted to do something like a test, and it would be worse, something like needing Python or God forbid having to run Docker or something like this. It just run my tests. We want them to be as close to what developers are doing in their day-to-day, because remember it's a community project, we have lots of contributors. Having go as a sole dependency was really important. They need to be quick. Time.sleepers.band, we'll talk about that later. Tests need to be portable. We basically get that for free with go, because go is very good in that it can be compiled to different architectures and operating systems and things like this, and it's designed from a portability perspective from the start, so we get that for free. It needs to be extensible. We have lots of contributors. People need to be able to write code for the integration tests as they contribute to the project, and it needs to be readable. Similar reasons. That was the design philosophy, the design decisions I came into the project with, or into the integration test with. Next was actually writing the framework itself. If we go back to our original diagram of fundamentally this is what an integration test is, the first thing we can do is turn this into go stuff. We create what I call the process, which is the thing that is managing the setup and also the cleanup, and then we have the test case, which is doing the assertions that we want on that particular test scenario. We can then put in some kind of wrapper stuff, so this is actually executable, and there's like an entry point into this kind of test case. And then we're in go, so it probably makes sense to make these interfaces. So this is what a test case is fundamentally. If you can do a setup and you can run, it will be able to be executable in the integration test suite. This is what an integration test looks like in DAPA. It's a single self-contained file, we do some registration on the test, and we'll talk about that in a second, and then we do a setup and then we do a run. You can see here in my setup that I'm creating a process, which is going to do the setup and the cleanup, and then the run bit is where I'm going to do the actual assertions. Talking about the process part, the bit that's responsible for the kind of dependency creation and cleanup. Again, similar story, it's an interface, it does a run, and it does a cleanup. Really simple, and that's the point, it needs to be simple. We'll talk about a bit in a second on why this is a great thing. This is what a process would look like. This is kind of like a no-op kind of example, not super important to read the whole thing. The whole idea is it's, again, a self-contained package. We have the new, which creates the thing with a bunch of options, using functional option style here, which isn't necessarily people's favorite. It made sense in this particular case. The kind of struct versus the kind of functional style is a bit of a hot topic. Yeah, it has a run and then it has a cleanup further down. I know very abstract, but it's clear, it's obviously very important to get your interfaces correct because you're going to live with these forever. Cool. We have a framework run. The thing that I wanted to point out here is we do a process run here, and then you can see that we're using the go test cleanup function, which is amazing because it puts things on a stack. When you create your dependencies, whether these be binaries or whatever else that we're using in our processes, it will clean them up in reverse order. You have that stack, which is the natural order for things to be executed and then cleaned up in. Cool. We have all our test cases defined. They're running various processes. Again, there might be executing binaries, writing to files, things like this. We do our assertions and then we do our cleanups. These will get put into test cases and then we have some kind of sweet runner that executes these tests. That's what it looks like. It's a for loop over a set of tests and it executes them. Simple stuff. The next thing is how does the integration sweet runner know about these tests? What we need is a case registry, which is just a very fancy way of saying that we have a global variable that has a slice of test cases. What is important here that I wanted to point out was that it was a design decision that our test cases, and I mentioned it before, that they should be self-isolated in single files. I think as a developer, when you're reading test cases and things like this and you're having to go backwards and forwards into various places to even follow what the test is doing, is not good practice and it's confusing. Again, you can run into these problems. In order to eliminate that, we went for the style of having an init function, which does the registration to that global variable, and then using the bare import and style to import our init functions up into the top-level registry. Next thing is naming, which is always hard. I think there's a thing where developers generally don't necessarily respect testing code as much as they should. They care a lot about their implementation code and make it look pretty and performant and things like this, but they don't necessarily respect their testing code as much. This leads on to the kind of mess that people don't want to add to it because it's difficult to read. Having respect to your test code is really important. Similarly, naming is generally really important. Go has good standard on how you should name things, i.e. meaning should be derived through context. If you have a HTTP package, don't call your thing HTTP server, call it server. It should be hierarchical. Similarly, derived meaning through context, package path, describe your thing. Less is more. Go is not an IDE language. It's a good language. You don't need to have really long names. Just be very specific. No under scores, things like this. The benefit of then treating our test cases to be this package hierarchy with very meaningful being purposeful names is that we can do some reflect magic that gets us a lot of benefits. So when I showed before that we're doing this kind of sweet test case registration, when we are registering a test or when we're pulling out all the tests, you don't need to read the code. But basically what we're doing is using reflect to name the test its package path plus that struct name. So before our thing was called base, so it pulls out the package path of where that base test file is plus the struct name itself. So in this particular case, this test would be test underscore integration, DAPID foo base. Why is this a cool thing to do? Because that means we can start doing reject searches over our tests. So you can imagine for example if I'm writing a feature for DAPID or trying to fix a bug, if I'm working on maybe the active subsystem or something like this or placement, I can in another time and I'll have my integration test running and I can just do a search, a reject search on all the tests that are in the project for related things. So yeah, being very specific about your naming means that you can search through them and run all the relevant tests. Again being quick, developer focus, good UX. Yeah, that's how you do rejects in Go for loop and then you filter out all the test names that don't match the rejects. Here's another example, I'm working on century related things or MTS related things, I want to run all the century tests, I can just give it a query. The next is processes. So these are the two bits down here, the kind of dependency setup and the cleanup. We've been talking a lot about the different services in DAPID, so these are obviously using the exec, we're exacting processes on the computer, using the exec package. What we've decided to do is follow the kind of UNIX philosophy of running these processes as in do one thing and do one thing really well. So the exec process does really good at exacting a binary on the computer. You can then wrap that process in another more meaningful, again being intentional about naming which has a bit more context about how that binary should be run. So for example, this century process has all the context of knows what the CLI flags and things like this gives it same defaults, exposes the options in a human readable way in order to run that binary. And then as I mentioned before, DAPID has lots of different services, it's a complex software system but following this UNIX philosophy you can do this wrapping in your processes to make more meaningful, higher level naming and interfaces for your developer. So I can talk about a Kubernetes process and it's very easy as a developer in my test suite to say run Kubernetes, whatever that might mean, under the hood that's actually like a mocked Kubernetes API server which is actually a HTTP server, yada yada yada. So yeah, having this kind of wrapped process is kind of an elegant way to handle that. Here's an example of another one, so there's an operator service, we're doing some log line stuff in here, some DAPID stuff, but these are very high order concepts of dependencies that we're creating and these are all wrapped going down. Process binaries, so I mentioned before that we want to go as the sole dependency and go is a good language and it's got a very good build caching system and what that means is that in our testing integration testing itself is we're building the binaries in the test, so one of the first things it's going to do is it's going to build all the binaries that are in the project, that's the code that's doing that. It's then going to write them to a deterministic static file location and what that means is that every time I invoke the test it's going to run that go build, but because of go builds cache magic it's not going to take any time at all, so I can completely retry my go test and it will just be quick. The other nice thing about this is that if I change my implementation code and just write go test in my integration test, it's going to pull all the changes that I've just made to the code right because it is building from source every time. So that's a neat thing with go piping. So software writes things to logs and these can typically be very noisy if you're running lots and lots and lots of tests and this is going to take up a lot of disk space potentially, it's going to write a lot of things to the screen and it makes it impossible to read the test output. If you've got oodles, like a gigabyte of test logs and you're trying to find one test failure and read the logs from what happened, it becomes impossible. So write these things to in-memory buffers and then you can do things like only write the in-memory log buffer to the screen if the test actually fails, which is the only time where you actually care about what the log line is. Then obviously you can do things like because it's in memory, you've got a reference to it, you've got a pointer to it, you can then do some assertions on what was in the log lines and test log lines that way. It's quite good for this, you can create pipes and things like this. All very idiomatic kind of go stuff that you're familiar with. Asserting eventually, so all software is eventually consistent fundamentally like computers that are any as quick as the speed of light that is as fast as they can go, they're not as fast as that. But fundamentally computers to do a thing will take some time. And so we have to wait a period of time to observe some behavior when we put it into a particular state. Just fundamentally we have to do that. However you should never use time.sleep to do this, which I think is very, it's always there and it's very easy to just be like, time.sleep three seconds or something like this, but you should never do it. Time.sleep is the nuclear option. So to kind of illustrate this, if a single test sleeps for five seconds and DAPA CI for example runs four times a day, not counting PRs or anything like this, just standardly runs every four times a day, this equates to two hours of idle CPU time a year. If we then do it more than this, so like DAPA currently has 133 integration tests, if just 10% of those tests sleep for five seconds, then that equates to more than an entire day in a year of idle CPU. Which is crazy, right? This is bad for the polar bears, bad for the environment, it's bad for our developers too, which, yeah. If your test takes ages to run, no one will want to run them and no one wants to add to them. So being very intentional about the speed of your tests is very important. The way to do this would be to do polling basically, so in Go there's the kind of testifier package that is really, really good and highly recommend using it and it has this eventually function. All of the functions in this package are like super sane and highly recommend used to use them. And yeah, computers are faster than you think they are. Stuff does not take as much as you think it does, so like HTTP calls over local hosts take like milliseconds. It doesn't confuse as fast as you think they are. So even I've got here an appalling of like every 100 milliseconds, maybe that is even too slow itself. So yeah, computers are faster than you think they are. Be more aggressive with your kind of assertions and your polling. Clean up. Tests should never leak. Having data leaking from one test case to another will invalidate your assertions just fundamentally. So it's very important that you clean up state in between test case runs. And yeah, and it's also the case that if you're not cleaning up the state in your project in between case runs, then you're going to reduce the resource utilization that each test case can do and it's going to slow down your tests. So I'm thinking, you know, if you've got database tests or something like this, you're writing a bunch of stuff to disk. What if you fill up the disk? You're not running any more tests, right? So clean up is important. To list through some of the things that could be interesting for you to use, use temporary directories, using the test package. That's really good. T.cleanup, we just spoke about that earlier. That's doing the kind of stack thing, so it does things in the kind of reverse order. Use port zero. Ideally your kernel is going to give you a free port if you ask for zero. Use in-memory stuff. Don't use the internet. Don't give stop channels into functions. And use context. Context is one of the best things in Go and always use context. Very quick to talk about operating systems. Operating systems are very weird. Use build tags where you need to do different file types and things like this depending on their operating system. Work through the pain. Use if statements. Yeah, and then finally being productive. So building a culture of integration tests in a distributed team is always a work in progress. To know unnecessarily really likes writing tests, however, if you write a really good test framework, that's going to encourage people to add to them. And if they're quick, they're easy to use, then yeah. A good testing framework should be usable as a development sandbox. So what I mean by that is if you're writing a new feature, your testing framework should be your first port of call to wanting to use that new feature. Tests are great because they're encode, which means they're reproducible, and I can execute them and I can make changes over time. And it's very clear what's going on. Just running binaries on your terminal and things like this are fine, but having it in test code makes the reproducible better. And then the more, again, the more higher order your processes are, the more productive your team will be. So don't describe things like your developer shouldn't be describing things like exec, this binary, things like this. They should always be in a high order kind of thing that they're describing. Again, it decreases the amount of code that you have to write in your test case and makes them more approachable for contributors. And that's me. Thank you, everyone. APPLAUSE Saved some time for you, but I don't know if you want some questions or leave it there. I can fit in one quick question. Otherwise, you can just grab them in the hallway. Ah, no question there. Let me run one second. Keep holding your hand up. So, quickly, why did you make your own sort of test filtering system instead of using Go's test filtering system? And secondly, why didn't you use an event hub instead of polling? Say the first one again, sorry. Why didn't you...
Effortless Bug Hunting with Differential Fuzzing
Our next speaker is Maché and he's going to talk about us, about hunting bugs and how do we hunt bugs? We do that by sending a bunch of random input into our programs or more scientifically called fuzzing. Round of applause. All right, welcome. So in the spirit of testing, let's talk about fuzzing. So I'm Maché, I'm an offensive security engineer, I've introduced the platform engineer and software engineer, I sail, climb and play board games. So what we'll talk about, we'll talk about fuzzing, we'll talk about differential fuzzing, how it differs from fuzzing and we'll talk about bugs that are in the sun in the library and how you can actually find those bugs and fix them using fuzzing. And then at the end we'll talk about fuzzing in continuous integration pipelines. So what we'll not talk about is how fuzzing works under the hood. There are excellent resources out there that we'll talk about like fuzzing engines and like other stuff. I'll link to them in the end, but this talk is not about this. Why should it occur? So there's an OSS Fuzz project, who's familiar with this? Cool. So this is a kind of a platform that gives open source projects computer resources to run Fuzz tests continuously. And there's about 1,000 projects in there and within a six or seven years it has found 10,000 vulnerabilities and 36,000 bugs. And if you do a simple math, that's 10 vulnerabilities per project and 36 bugs per project. So this seems like an F word that's worth investing in. So let's assume we have a simple function, it accepts the string, mutates it and it gives you a transform string back and it transforms letters or characters in the alphabet to a character that is fricking positions later. So you get n for a, o for b and b for c and so on and so forth. So in your regular testing, you'll come up with some inputs, you put those inputs into the function and then you make assertions based if the output is correct. You're all familiar with this probably, you can run this using your standard Go CLI. With fuzzing, the situation changes a little bit. Instead of your device input, your things you came up with, you have a random input, you put it into the function and make some assertions. It looks very similar and is supported in Go from like Go 1.18 and you can also run this using the CLI. You see some boilerplate around the test but you know, in the middle you basically have your unit test that you had before. I intentionally left the assertion blank because how the assertion stuff, if you don't know the input, right? If you run the fast test, you'll see that it tries hundreds of thousands of inputs per second in this instance and it runs indefinitely. So you can run it as long as you want. As you've seen, it's easy to create fast tests if you have unit tests in place. So there is no reason not to do it really. One thing that we haven't talked about is that it's not our magic. You still have to kind of instruct the fuzzing engine to be able to come up with inputs that make sense for your test. So you can actually reuse the inputs you use for unit tests and add them to what's called the corpus and that tells the fuzzing engine to come up with something that's similar but quite random as well. Add the inputs from your unit test. That helps a lot. I've talked about those assertions that might be pretty tricky to come up with them if you don't really know what the input is. So what you commonly see in fast tests is that they don't make any assertions. They just, the engine just checks if the function crashed, which is still very efficient because it tells you that there are some out of ground size axes, for instance. But you should and can assert an invariance of things that don't change and in our instance, for instance, there is a property to the ROT13 function that you can actually call it twice and you get the input back. And this holds true for anything that has an inverse symbol. So if you have an inverse function, you can make a simple search like this, which is called ROT13 and ROT13 and then you expect the input back. If it doesn't agree, it's, you know, the test fails. Some examples that are commonly used are encoders, decoders, marshallers and marshallers. You can just call the things, you know, decode the encoded thing and you should get the input back. There's other stuff, like if you do a SHA sum, for instance, you always expect it to return 32 bytes. But there is other technique. And what if you had two implementations of ROT13, right? Something that you wrote and then, you know, something else. And that's called differential fuzzing. So basically, you get a random input, you put it through two implementations and you see if they disagree. So, you know, think about for a moment and, like, where can we get those second implementations from? The first thing is refactoring. Let's say you have your function but, you know, it's unreadable, maybe it's not performance enough, so you're refactoring the code for whatever reason. You can save your old implementation to the site and use it to basically reference it when you refactor the codes. The second example is performance. You might have, you might maintain two implementations in the first place. For instance, you are following a specification closely and, you know, the first implementation is written very closely spec, but it might be inefficient. But the second one is heavily optimized, but it might be not quite readable. You know, you might have some straight buffers or, you know, whatever. The third option, which is really interesting, is that there is a C library that does a similar thing. And you can use C go to college. And that's what we'll explore further. So back in January last year, I saw an interesting bug report and I can go with a newsletter where there was an issue with the HTML tokenizer, basically the piece of the, or part of the experimental library that does HTML tokenization. And the thing was that it was incorrectly interpreting comments and this led to an excess attack. So what does an HTML tokenizer do? It basically takes a HTML input and it gives you the HTML token. So for this example, for instance, you have a paragraph and a text inside and an anchor afterwards. You'll get start attack of P, text, and then the text inside and tag of P and then start attack of A. This is a very well-defined process and there is an HTML specification for it. It's very high in detail. It's easy to follow. And it's a state machine which will become important later. If you look at the go implementation, though, it's not a state machine. And it's not quite easy to follow, at least for me. So I thought, you know, if there wasn't a report for it, there might be other bugs lurking around. So let's, you know, let's use that function a bit and make another one that gives you a list of tokens because the API works in a stringing way. So we'll just call the tokenizer, collect all tokens, and then return the tokens it generates. So you know, when we, let's say, start with the fuzzing, we will supply some HTML input to the corpus and then call the tokenize function without making any assertions. And there are no results. It doesn't crash. Something will be expected from, you know, from some library or from the experimental part of it. So let's try differential fuzzing, right? We'll have the, our tokenizer function that we wrote and some alternative implementation for it. And if they don't agree, we'll fail. And as you can imagine, because the, you know, C ecosystem is very mature, there probably is a library that does the same thing. So in this case, I found Legsport, which is a web browser engine that, you know, is a software library. It has no extra dependencies. It has a Poshy to Prenel license. It sounds about perfect for what we want to achieve. So don't look at this slide really. It's, you know, it's basically implementing the tokenize function that we implemented using the Nets HTML tokenizer, but using the Legsport. It's actually a lot more complicated than that, but we'll be good for our tests. So we call the tokenize and Legsport tokenize and do some equality checks and if they fail, we fail the test. And it found something. So there is some weird looking HTML codes, looks, month forms, and Legsport says that, you know, it's an ATAC, but Nets HTML library is like, oh, there's nothing in there. So let's transform this a bit and let's see what the browser thinks. So we have these agreements. Could this be a security issue? So what if we made trust decisions based on the tokenizer? And so imagine you have like some, you know, user input on your website, you accept the HTML inputs and, you know, you decide whether the staff people input is safe to display or not. And you should, by the way, you really shouldn't do this, but we'll have a S-save function that will return the Boolean, whether it's safe or not, and we'll just look for the tokens we get and only allow strong tags and nothing else, strong attacks and text tokens. So the S-save method thinks that, you know, the thing that we got from the fuzzing is safe because it thinks there's nothing in it. But the browser says otherwise. When you look at the documentation, though, there will be a security consideration section in the HTML tokenizer and it says, you know, care should be taken, especially with regard to unstressed inputs. If your use case requires a well-formed HTML, the parser should be used rather than the tokenizer. So let's implement this using the parser, right? I want to go into detail, but we use the parser here. That's also in the same library. The thing is the parser also thinks this is safe, and the reason is it uses the tokenizer underneath, so it doesn't really, you know, differentiate between the two. So we still get the XSS. So we have two things. You know, the first thing is that the documentation could be improved because it's unclear. It's tier C in the wrong direction, and second, that there is a bug in the tokenizer. So I thought, right, if there was a vulnerability report in the VRP program for the common thing, I'll do the same thing. So I submitted a VRP report. There was some back and forth. They closed my ticket. I told them to reopen it. They reopened it. And the result of that was that there was a documentation update, which is cool. And they say that in security context, if trust decisions are being made, the input must be recerealized, for instance, using render or token string. So what they are saying is that instead of doing, you know, a safe function that returns a boolean, you should actually transform the input and construct it in a way that, you know, basically sanitize this, transform the string. And there are two ways to do this. One is to use the token.stream function, which, you know, when you loop over the tokens, you can reconstruct the input or render when you use the parser. A few months pass, and there is a comment to the library. And they fix the actual bug. So, you know, handle equal signs before attributes, and they quote the spec and fix the debug that was there. So now if you call the is safe function, it returns false. That's pretty cool. But let's run the fuzzer again. I mean, you know, you get something that is very similar, and it acts the same way. So I thought, all right, I have this fuzzer. It's not pretty. You know, it has no way to reach the standard test suite. But we can, you know, learn the code base and iterate over it. So run the, you know, fix the problem, run the fastest again, and then, you know. So I prepared the patch, and you've seen I get it screened today already. It has the code review, but as Jonathan mentioned, you need a lot of patience. It's been stuck in, like, ready to submit for like three months, I think. So it still hasn't reached master, but it's close, I think. But when you run the fastest again, there are no more findings. So the takeaway from this is that fuzzing is very effective, and differential fuzzing helps write correct for code. So let's talk about what are good testing candidates. We've used it on parsers, which are pretty complex codes. You can use them to get the coders and coders, you know, marshallers, and any complex code that, you know, can be unit tested, basically. But running those tests in CI is kind of traumatic, at least in my experience, because it's not really mature enough yet, I think. And when you run the go-pest fuzzing vocation, it can only run a single fuzz test. So people have been doing a lot of hacks, like, grabbing this fuzz code, trying to find those fuzz targets, you know, sleeping, like, some pretty hacky buskers, for instance. There is also a very cool project called Cluster Fuzz Lite. It's actually a subset of OSS Fuzz that you can run in your CI. But we found some problems with it. First, it has problems with extracting and failing inputs. Like, if you have a byte array, for instance, it doesn't really translate one-to-one to what the actual input is, because you have to apply some of your own transformations over it, and it's being convenient to run locally. So we built Go-CIFuzz. And it's kind of a lightweight wrapper around go-test fuzz. And it separates multiple test targets, and it allows you to extract inputs. So if you want to give it a try, there is a link here. And, yeah, good to go. And it's basically plug-and-play, drag-and-drop. You can use it to run fastest as part of your pull request workflow, or you run it on schedule, so, like, you know, during the night, or whatever, whenever you want to run this. All right. So we've placed for it. But, yeah, if you want to, you know, say hello, there is my email address and my handle. And also, I wrote a blog post about this, but it goes more in detail about this actual finding. And there are some references. You have the, if you want to start fuzzing, there is a very excellent introduction to it in the Go documentation. There's also Goode's article on Wikipedia on how it works under the hood. And there's a link to clusterfuzzlight, the Go-CIFuzz, the blog post, and also a pretty interesting entry in this list is the second one. So there was a recent paper from Google where they use AI to actually generate the fastest. So maybe you don't really need to write them, and AI will be able to do it for you. All right. So if there are any questions, happy to answer. All right. Any questions? We still have some time. And the front. That's nice. Okay. How many minutes do you run the fuzzer in the CI because this is important, right? Because it costs money. That's true. Yeah. So, you know, it depends on the workflow. So for instance, when it's a pull request, you really don't want people waiting. We run this for like five to six minutes. It's enough time in our experience to catch like those bugs that are, you know, the edge cases that are quite common. But you can run this indefinitely during the night, and it depends on how much money you want to spend for your CI runs. Yeah. All right. Any other questions? Questions. Can you keep your hands up and I can go to the right row if you could pass us along. Have you tried to fuzz only inserting random strings or like also a combination of valid tokens in different order? Could you please bring? From what I got from the slide, if I'm not wrong, you were like inputting the data. You were like putting random strings, right? Okay. So how it works really is that you provided a starting corpus. So like your, think about your unit test inputs and then the fuzzing engine underneath takes those inputs and puts transformations on them. So every time you'll get a slightly different input. It won't be completely different, but it will be a bit more formed. So like if you saw these, the findings for instance here, right? It outputs all, well it outputs a valid HTML or almost valid HTML. So it kind of reached this conclusion based on some coverage data it found. So like it also looks at test covers. So when it runs the fastest, it kind of captures which branches of code have been covered and tries to reach the other that have been not covered. So it's kind of an interactive process where it applies transformations to the inputs. Right. There's another one. How does the engine know which part of the corpus it might change and which not so it doesn't only input like random strings as I could obtain from the random package? Could you repeat the beginning or the question? Yeah, sure. The fuzzing engine, you give it a set of example strings. How does it know which part of that it may change and so that it doesn't just put in random things? Okay. So I don't know the exact details, but I think it works that it makes a change and it looks at the coverage data. So it looks at the branches, it kind of discovers when it made the change and it will note some interesting inputs and then try those inputs. So like if the coverage increases, it will try to make more transformations similar to the one that it makes. Yeah, one more. What kind of coverage metric is it? The question is what kind of coverage metric it is. I think it's, I'm not so sure, but I think it's branch coverage based. If you run the fastest with some variable flags, you will see that there are coverage bits and I think it tells you how much coverage there is for a particular input. All right. There's one more. One second. I can probably just speak up. So the question is, there is a go cache or when you run fastest, there is a cache folder that will capture the inputs already run and the question is whether the tool will or can support this. And the question is, the answer is it doesn't right now, but it's planned. So for those that are unaware, when you run a run fast test, there is a directory that will capture all the input it has tried or the interesting ones. And when you run this again, it will start from the point, which is really handy because you will not do the same work every time or a similar work. You can start from where you left. Yeah. Thank you. Yeah, there is one more. The question is slightly tangential to this directly, but you said we provide a starting corpus and then there's transformations on that, which is run against whatever we're testing. So is there a way to optimize the starting corpus to increase the kind of test cases that are actually generated by the FuzzError? Is there a way where the starting corpus can be designed to cover as many edge cases as possible? Okay. So there are similar perspectives to this. There are corpus that you can find online in GitHub, for instance, that you can employ in your FuzzTests. Also when there's a finding, for instance, when you run the FuzzTest and you find a string, it will add it to the corpus that you have in your repo. So when you run this, there will be a directory created in your repository that's called test data. And inside that test data folder, this will be captured. And you should actually commit that folder to your repo so that every next time you run the FuzzTest, it will actually check for regressions. So yeah, I hope this answers your question. Any more? Thank you. Are there ways to customize the kind of transformations that are applied by the FuzzError? Not in the Go Native FuzzTests. So there are other tools that have been used before, and Go introduced native fuzzing. There is libfuzzer, for instance, that's very commonly used by the OSS Fuzz. And I believe if you use that, you can customize it. But the way native Go tests work is that they actually use libfuzzer, but it's not very configurable. So it's supposed to be good developer experience-wise and cover most of the needs that you need, but I don't think you can drive the transformations from it. I'm going to end the questions here.
Maintaining Go as a day job - a year later
Next speaker, I'm very happy to have him because he called him only last week and asked, could you speak at the Go Devron please? Because we had a cancelled talk, apparently immigration out to the UK is a problem post Brexit. And he said yes. And why did I call him? Because last year he also had to say no to me. And he promised me to give a talk. And he's going to give the same talk he promised to me last year, which is basically I could develop a page at Google and here's why. A round of applause for Filippo. Thank you. The good news is that this is a much better talk than I would have given a year ago. Because a year ago I had no idea if this was going to work. I have good news. Anyway, first a bit about me. I'm Filippo Valzorda. I've been the maintainer of the Go Cryptography Standard Library since 2018. And that's a job I've shared over the years with Katie Hawkman, Roland Schumacher, Damien Neal, Nicola Moreno and many others, to be clear. I was doing that at Google until 2022. So I was on the Go team. The title I had was the leader of the Go security team. And that's the team that did things, including the fuzzing support that we were talking about earlier. Until 2022. Because in 2022 I left pretty much to prove a point about proving that something was possible. And the something was, actually, let me tell you a story first. Stop me if you heard it. You're maintaining an open source project and that open source project has two kinds of maintainers. Some are volunteers. So you don't really feel like you can ask any hard commitment of them. You can tell them, well, Alice, you really told me that you were going to fix CI. Why isn't it fixed? Well, Alice will respond because you don't pay me. And what do you say to that? Nothing. Yes, exactly. Thank you for your help. And the alternative is you are employed full time at a company. And that company is a reality of capitalist society, which I say with no value judgment there. And that company has an interest in making money. And has an interest in not spending money that will not make money because that's what a company is and does. And that company will maybe start a project because it has a bunch of value for recruitment and for internal infrastructure. And it will fund some maintainers because it goes like, yeah, yeah, this is a good project and we should make it into existence. And that's great. That's the company contributing something to the world and we should appreciate that. Then the project grows. Now does the value of that project to that company grow? Not really. So the number of users double and then double and then double and then double. And they start filling rooms and conferences and that's great. And do you think the number of people working on that project fully paid by a company grow? They do not. And some people get angry at the company and they think that's misplaced because why should a company say, oh yes, this is more useful to the world so we should sink a lot more money into it. I mean, hey, if it was my money, sure. But you're a manager at a company, it's not your money, it's the shareholders' money, it's not really even a thing you can do. So now you have a problem because success of the project puts the project in a difficult position. And I said, stop me if you know this one because this is not around about go. This is not about Google. This is not about any specific case. If I had seen this only once, I wouldn't have said, well, this kind of needs a fix. This is something I've heard over and over and over again, all over the industry, all over the past 10 years, all over different companies. Because this is fundamentally an issue in alignment of incentives. So the problem is that critical open source projects that we don't have a sustainable way to maintain them. And I'm not talking about contributing back, I'm not talking about integrating a new feature or anything like that. I'm talking about the grant work of going through the issue tracker every morning and say no, no, no, no, no. Which I guess is tinged by what I do, which is cryptography, which most of the times I want to tell you know about. And if I said no to anybody in this room on the issue tracker, I'm sorry, you can find me later. But the point is that that work is really hard to fund both in the volunteerism model and in the full-time employment model. So what did I do? Well, I had this hypothesis that there was a way to fund open source maintenance like a profession. And the model goes a bit like this, going to companies and offering them retainers. Making the core focus maintenance. So not going to companies and say you'll get 100 hours of support this year. That doesn't scale. I'm not talking about, hey, do you really want that feature? I'll implement it for XK. That's actually kind of, I'm super uncomfortable with that because that means that to make money you need to add features. If you add features, you increase maintenance burden. If you increase maintenance burden, you need more money. So what do you do? You add more features. And yeah, that's a bad spiral. You don't want to be in that spiral. And there's also applying for grants, which is a totally valid way to go about it, but then you become an expert in applying for grants. And hey, I made a lot of questionable life choices. I just didn't make that one question of life choice. So that is the part where people tell me, yeah, but companies really don't want to pay for this. So really no way to go about it. That was the part that I was convinced was not true. I was convinced that companies can get something out of this because some companies even have an internal go team whose only job is being experts in go. Now that costs so much money. A fully loaded software engineer is like a giant expense. So my theory was that I could go to these companies and say, hey, do you want a resident go expert that you can ask advice to and who can liaison your issues with the team and can help you guide through the contribution process and so on. And to be clear, this is not offering governance. This is not offering support hours. This is just, would you like expertise that you don't currently have access to? And a lot of people told me, no, that's not going to work. I was kind of convinced it would work. So I quit and tried. And here I want to make a small parenthesis. This is an incredibly privileged thing to do. I was able to do this because I had the money from having worked at Google because I live in a country with a sane healthcare system. And yes, and because of my personal position and because I knew that I could actually call up a number of people and say, hey, would you give me a half an hour of your time? I want to pitch you something. And we'll get to that. Anyway, this is almost exactly a year ago, which I did not realize until I made the slides. I'm pretty happy about that. So exactly a year ago, I announced that this is what I was going to do. At this time, I had one, I think one client. So it'll take until you make it. If you want to read more about the model, and this was the 10-minute version, but if you want to know more, that's an article you can find, search engines sort of still work. So I'm sure you can find it. Now a year has passed. So this is an update talk. A lot of people ask me, OK, so how is it going? And to be fair, I haven't published much because there's a lot of work involved in doing this thing. Now, first of the good news. It's working. I have a healthy portfolio of clients. So far, only one client has turned, and not because they were unhappy about the service, but for their side reasons. And I am fairly happy with this. And this is funding me at the same level as Google was funding me, which I'm not saying as a humble brat. I'm saying it because I think it's important that we offer an alternative for maintainers that is competitive with the jobs they could get on the market. Because another thing that doesn't work is saying, well, you're an open source maintainer, so you consume, what, three boxes of ramen a day? I guess you should go to the cinema a couple of times a month. So you're going to be fine with like 10K a year, right? 15 maybe, and you can keep maintaining this? Great, yeah, sounds good. And then something happens in their life, and they go like, well, maybe I would like to make more money than that. And let's see how much money they offer me down the street to a person who can manage a large technical project with a number of stakeholders and who can coordinate contributors and who then qualifies as a senior software engineer. The answer is a lot more money. And so they stop maintaining the open source project. And so again, we don't have a sustainable model. So I think it's important to stress that what I'm going for is something that's competitive with senior software engineer salaries. And that's a part of that. Everybody usually would just kick me out of the room when I would get to because they would say, there's no way to make it a match. So it turns out, yeah. Now, lessons learned. What got me here? So a lot of it is sales. This is the bad news part of the talk. None of us wants to be doing sales. We want to be doing software maintenance. But what I usually tell maintainers on this is, so do you know dentists? Do you have a dentist friend, a dentist relative? Do you think they ever stopped for a moment and went like, you know, I really like teeth, but the whole invoicing people running a clinic, hiring somebody, paying taxes, marketing my services, nah, that's not my core skill. You know what? I'll just do it for free. Know anybody? Okay. And anybody who went like, yeah, so you know, really I'm probably not cut for being a dentist. I guess I'll just change careers. Probably somebody, but most of the dentists we do go to just do the parts of the job that are not core skills. We'll also get to the advantages dentists have over professional maintainers, but we'll get to that in a second. But a core point here is that there's a lot of sales. So it's enterprise sales. There are books about it. We usually like learning stuff, right, when it's technical. First off, this stuff is also extremely learnable. And what worked for me is funding champions inside companies and going to them and saying, hey, so this is the thing I'm doing. And those champions are usually engineering and they'll tell me, oh, that's great. Like I want that to work partially because I could see myself maybe at some point doing that. But to be clear, they don't let me spend money around here. And then you tell me, yeah, I know, but you know, the people who do and you can introduce me. And so if you can do that, I would appreciate it. And then when those people go to you, you go back to engineering like, hey, how's everything? And they ask, oh, so how's the conversation with John going? You know, he seems busy, but if you could just think it in turn, that would be great. And then you do this over and over and over and over and over again. And is it great? No. Does it work? Yeah. Yeah. Like every startup in the world just goes like this. The closings, however, the closings that did work did not take forever, did not take a year. They either closed in a month or two, or I am still trying to close them. Right now. I am thinking about one person in particular who is a friend and who I am still chasing. He got, anyway. The point is, the ones where it will happen soon or otherwise. What I am offering companies has refined a bit. Most companies want the whole thing. I started with tiers and you can get the advisory part and joining, I will join your Slack and nobody bought the lower ones. Which hey, okay, great, more money. At the same time, I feel like I am leaving something on the table by not knowing how to sell the lower tiers, but whatever. Fine. Maybe the answer is that the thing that does sell is that advisor part I was telling you about earlier. When you compare it to, look, you could hire an engineer and have them become an expert in this and have them get involved with the project so that you can go and ask them for support when you need to, or I cost a fraction of that because I have multiple clients. Also, there is not as much risk that I will just move on, you know, quit and you will have to retrain from scratch. I can't pre-train in the box. So that's the part that really sells. And finally, the ongoing, I was a little worried that am I over-comitting, I started slow because I didn't want to sell time I didn't have. Turns out my main problem is that some clients don't use enough of my time. So I'm worried that in a year they will be like, what do you do for us again? Which is not a great conversation to have. So I try to keep them engaged and remind them that hey, you can ask me questions. When I remind them, they come back with questions and then I answer them and they are happy and that's great but sometimes you have to remind them. On that something useful I do is that every time there is a release I will send a PDF being like, hey, this release happened, you should probably patch, here is my opinion on what you should patch. A lot of this is sending PDFs. If you are hoping for the new platform where you register and you get microtransactions directly to your wallet or something, no, I'm here to suggest getting very good at Microsoft Word. It doesn't have to be Microsoft Word. But yeah. And finally it's a multi-stakeholder job. What does that mean? Sometimes your client will be like, oh, it would be very nice if there was support for Foo and Cooks and Blacks and all of that in the standard library and I'm like, no, absolutely not. And that's a little throat because, and for that it's important to spread out enough that a single client is not so critical that your palms are sweating when you tell them no on something. And it's important to communicate with the rest of the team so that they know your concept of interest and I try to recuse myself from all of the proposals, for example, that I help the client's brain. So if a client really wants to propose something, I'll tell them, look, what I can do for you is help you present it in a way that makes your case. I'll help you craft the proposal. I'll tell you what you should fix. And then I'll just step back and let the proposal committee decide. That works. Anyway, a few words on where this is going next because this is what would work so far. But really I'm not in this to fund myself because if I wanted to do that I could have stayed at Google. They have great insurance plans. It's a job. It's fine. I could have stayed. So I left because I wanted to make this into a model that can be replicated. So there's two ways really to grow it. One is vertically. I can hire more people and then get them funded and that's a few more people that do this. And that's kind of unsatisfactory. Unsatisfactory, sure, might make me more money but it can't grow that much as just the Lippos thing. Which, by the way, this is happening. I've hired the maintainer of SFTP Go to maintain XCrypt SSH which didn't have a maintainer for years. And that's only possible because clients pay enough and they're interested in the SSH package and so I can go back to clients and say, hey, look at that package that you had the fork. You don't need a fork anymore. That's because you were paying me money. All right. Don't cancel, please. But the thing I'm really interested in is growing this horizontally. So getting other people to start it. And I'm not talking about starting a platform. No, no, no, I'm talking about getting other people to say, all right, maybe I can do that and try their own version of that. And here's where we get to the disclaimer part. This worked because I have a network. This worked because I had already a public profile. I want to fix that slowly and over time. What I want to do is to speak about this to enough people and to do this with enough companies that companies get a little more comfortable with it so that the sale is not anymore. So here's a whole new thing. Let me convince you it's not silly. And then let me convince you to pay me for it. You can skip the first part and get to a company who's like, oh, yeah, right, we do that with a couple of people. Maybe, yeah, that your project is also useful. Another disclaimer. This is for critical open source projects. I am targeting a very specific kind of open source project. So if this sounds like it wouldn't work for your project, there are two options. It either you might be surprised that it's actually easier than it looks, or you might be correct because it's not the right shape of project. The right shape of project is something that's critical to at least a few companies. And that doesn't mean something gigantic like go. Critical just means that how I pitch it is how long would it take to replace this if you had to? And then you let them think about it for a moment and they will go like, two, three engineer years and that's a lot of money. And then you go like, great, so you would like this to stay maintained. Excellent. Then that's the kind of project that I think you can pitch this model for. And so that's the lowering the bar to access. I like to think that I started because I have more of a public profile and more willingness to wear a button down shirt and talk fast, which is very useful for sales. But I would like this to become something that more and more people can do as it expands. So maybe the next cohort of people who do this already have a bit of network, but will have something to start from, et cetera, et cetera. Then I want to build together the tools to make this easier because I am sort of yoloing stuff like contracts and what works and what doesn't. I'm hoping to build training and contract closers that are easy to pass to Liga and say, yeah, so no, you can't put a complete IP assignment close in that contract because I don't work only for you, so no, I can't give you the rights to everything I write. We will have to have a conversation about that. So stuff like that. And maybe one day making a professional association around it so that we can even tackle things like how do you get healthcare in the U.S., which some freelancer unions give you access to. Anyway, this is how it started, how it's been going for the past year, and where I'm trying to take it from now. The goal is to give the option to maintainers to follow it as a profession, just like dentists starting at junior and ramping their way up using resources for learning and support from technical tools. Because that's the main thing that I think makes it easier for dentists to run clinics than open source maintainers to run businesses. It's that you can buy so much support as a dentist. You can buy books and trainings and CRM software and software to run the clinic that prints the invoice already with the little teeth drawn on them. And all of that is actually extremely useful. We are starting from scratch here because for some beautiful reason, which is why we're all here, really, open source didn't start as a profession. It started as hacking. It started as something we wanted to give to the world. But it's now critical infrastructure and we need some people to make a living and a profession out of it. To be clear, this is not the only way to do things. I'm not here to tell you every single one of you has to get a button down and a 24-hour briefcase and start going out there and closing some deals. No. But I want that to be an option for the projects that need it and for the people that want it. Questions please. Thank you. Hi, Filippo. I know you've been sharing this path for at least half a year or more. So how many people have followed your advice? Do you know about and how many of them you would say succeeded, which share of the followers have succeeded? So I've been actually extremely skittish to ask people to start trying this until very recently, partially because until I knew that it worked for me, I was very uncomfortable with telling anybody go for it because... So really, I've started trying to form a community around this in the last two, three months. And there are maybe half a dozen people who are aiming for this or angling for this. Some of them already were going for something similar. I had at least one conversation where I was like, ah, that, yes, ah-ha. We kept saying things and just matching notes and it was great. I think it's too soon to say how many of those are successful. I'm barely comfortable saying that I'm being successful because it's been a year and after a year, I'm a little more... People are not canceling after the third year. Great. But it's a little too soon to say how many are successful, I think. If anybody wants to try, now is the time to email me because I want to set up a bit of community. I don't know if it's going to be, I don't know, matrix rooms, a Slack, mailing with something to compare notes, share tools. So now's the time. Hi, thanks for the talk. How do you see this interaction? I imagine it being a little bit of a potential problem with open source projects that are grown from or that already have companies crystallized around them. So there's a bit of a tension there with this kind of freelance model that you're pitching. So governance in open source is complicated and it's a people's job and I think every maintainer is actually very expert in negotiation already because some of it is figuring out diverging incentives. I think this would probably be in contrast with open source projects that have a company built around offering support contracts. But that company probably already scales with the success of the project so it's not probably suffering the same way I was describing. If the users double, the potential support contracts double. Support contracts have their own problems which is why I'm not going for that but maybe I can start by cannibalizing a project that has already a scalable funding model. Projects that instead have mostly companies that are footing the bill but are not reselling support hours, they're not built around, we are the open source project X and you can pay us for the open source project X. I don't think this is actually in contrast. Case in point, I don't think anybody at Google resents me for doing this. Actually they are happy that now there's two maintainers on top of the headcount they have maintaining go. So far there's been no issue with that kind of company involvement. Thank you for your inspiring talk. Do you have any suggestions or advice for people who already went to the dark side and actually don't have means to get back to the Raman level of money for a couple of years to just get back to being a critical maintainer to start pitching projects? I am sorry I did not catch that part, there was some noise. Yeah, so if you are already at the dark side and you already cannot get back to the open source Raman level of income to get back into the critical maintainer path because you are not going to like it, your bank is your house and you are not going to like it. So what do we do now? We are already in a company somewhere, we are not doing the maintenance because we had to get out of this pass a long time ago and we would like to get it back now. Do we have any way to pitch it really? I was saying that okay you know what, ten years ago I was maintaining this thing and maybe you could hire me again though ten years I was not doing anything, it doesn't sound like something convincing. Okay so the question was if I understand it correctly how do you get back into the maintenance track if you are doing other ways of doing software development as a professional? Yes? Okay, so I think that gets to the general question of what is the ramp up to what I am describing because I just skipped, right? I went like well great, five thousand people of my newsletter, here is a new thing and great, that's not an on-ramp. So the on-ramps will be part of what we need to figure out as this matures because I think they can look like getting involved with a maintainer who is already doing this. For example if I am already doing this with Go and somebody is skilling Go and something that my clients specifically need I can hire them, they can start doing that through my contracts so they don't need their own cash and then they can use that to maybe spin off into their own thing which is really how firms work, how lawyers get started, how security consultancies got started in the 90s and early olds. If you look at that history you will see that they started as individual realities pretty much like mine based on the popularity of the individual then they grew and then they spun off pieces and then it became an industry and now you can just be a junior security analyst, right? I am hoping one on-ramp will be just starting as a junior maintainer on a team and then spinning off and that I think is one on-ramp we should definitely figure out different ones and ways to pre-invest so that you don't have to jump and say I will not make no money for six months and then hopefully I will start making a little money. We need to figure out ways to get investment up front for some projects and I think the more companies get comfortable with this and with the fact that they want this because they do. The real thing that surprised me is that companies are often like yeah, yeah, okay, yeah, great. We are uncomfortable about our open source supply chain so as companies get there hopefully we will get to a place where you can get a pre-commitment letter which is often how products on-ramp and then once you have three letters you go like okay I am ready to make the jump. If anyone needs a freelancer go maintainer for the deaf rooms I am available call me. Thank you, Roy Lebrouz.
How we almost secured our projects by writing more tests
The careful eye might have noticed something in my schedule. I put a lot of similar subjects together and because Philip was actually replaced by the speaker, this would have been three hours filled with only tests. Glad we saw where I say it from that. But let's continue into this test thingy because tests are important and many people love them and many people hate them. So Alessio has got to take us away with security by testing. All right, applause. Hello, everybody. Welcome to my talk. I give you a little introduction about myself. So who am I? My name is Alessio Griggi. I'm a software engineer at Armo Security, the company behind QtScape. My full-time job actually is to be a cat food opener for my furry friend. But jokes apart, I'm passionate about reading and taking long walks. You can find me on GitHub and Twitter with this account and the following avatar. But let's start the talk. So I will give you some introduction, some easy concept that can help you to understand better the world talk. So first question is what is the code coverage? So code coverage is a metric that we can use. It's a percentage actually, as a metric, that we can use to understand how many of our source code is covered by tests. Really or better, mostly, it is used to write when we write a unit test, but not only for this kind of test. Let's go a bit more in depth. So code coverage related to Golang. So first time it was introduced in Go version 1.2. It was more or less 10 years ago. I guess it was April 2013, if I remember well. With support for the unit test in this specific article. But the story continued after more or less 10 years. So one year ago the community introduced in Go version 1.20 a new kind of support for tests. This time it was support for the integration test. So what happened since last year that we basically sensitively increased the percentage of the coverage in our project. Of course if we were already doing integration tests. And yeah, basically in these 10 years a lot of things changed. They also implemented some nice tool in order to check the coverage rendering the profiles with an HTML page that you can check on your browser. It's really nice also to use, really helpful. But let's see another concept that is important for this talk. What is a second profile? So first of all, second is a kernel feature. And it helps you to block certain syscalls during the execution of certain program. You can define second profile as a kind of rule. So you can list all the syscalls that you want to execute or you want to block during the execution of your program. And what else? It is extensively used in the Kubernetes ecosystem. Also in Docker you can attach this security profile when you run a specific pod or container. And the container will use this second profile in order to check if all the syscalls are enabled to run. And another important thing is that in Kubernetes if you enable the second default profile feature flag you can basically use by default the default profile that is a list of deprecated, really dangerous, let's say, syscalls that you should not use during your execution. So by default you can use this profile and be quite safe more or less. But it may be better if you create your own second profile for the project that you are implementing. So the main idea that I had was to generate a security profile during the test pipeline since it is probably the best environment when we, of course, if we write a lot of tests that can help you to run all the syscalls that are included in your project. So the test environment is probably the best candidate to use in order to extract all the syscalls that are going to be executed in your project. So the idea was to generate the second profile and in case you have your project that is based on Kubernetes, you are developing something related to Kubernetes, the way was to create an init container that can inject the second profile into the node and use the security context with the second profile localhost in order to attach this security profile that you just injected into the node. And that's one example. So you have the init container that's downloaded the second profile. In this case it was just a test but you can think to provide it as an artifact on GitHub or whatever you want. And the application container can use the second profile type localhost by referring to the second profile. Okay. This was the first part of the talk. But now let's see how I try to achieve this goal. I mean, how I try to extract the syscalls from the test. So in this case we are talking about integration test and unit test. In this case you can see a kind of execution path of your project. So if you run the project you are going to have this kind of tree. So with the code coverage you can understand which part of this tree it has been executed. So you can refer as a metric about your second profile in order to understand which part is missing and how much it could be readable since it's a metric that gives you a percentage. So first thing, extracting the syscalls from the integration test. So let's say it was the easiest part. So with the integration test you can build a binary, provide some script that basically checks for expected results. And when you run the binary that you built you can use one of the tracing tool, for example strace or perfer or whatever you prefer, in order to extract the syscalls during the execution of the binary during the test. So this was the first part but let's see the other one about extracting this information from the unit test. So first of all it was a bit more complicated and I'm going to explain why. So the reason is that GoTester actually compile and run the test all at once. So you cannot do strace GoTester because otherwise you are going to catch all the syscalls that are not related to the function that you want to trace because think that we are speaking about unit test. So we are testing only specific unit, only specific functions and you want to extract the syscalls that are executed during this function. So you cannot do strace GoTester first of all and even if we build the binary, the test binary for the test we cannot neither do strace dot slash test binary because the test binary could include some noise that could be related to for example let's suppose that you have some data file that you want to run against your function and you open this file and you take this data and you put this data inside your function. So when you do this open file you are going to catch with strace also this open. So it's not really suitable. So my personal solution, let's see another step, so more or less the solution could be split all the steps. First of all we can compile the binary without running it with the GoTester. So you can do gotest dash c followed by the package that you want to build and consequently you can from this binary extract the function name just by using obj dump followed by dash dash since so you can extract the entire symbol of the function that you want to trace. So at this point let's see my personal solution. I don't know if it's the better one but it's a solution. So this project is called ARPUN. You can find this project on my github and it makes use of an eBPF. I want to clarify that I'm not an eBPF expert but understanding the technology I try to use this technology to solve this issue. So the main idea was to define a trace point with eBPF that started its execution so it's tracing about the function. When a U-probe that was previously attached to the function basically emits an event. So the U-probe informs you that the function started the execution and another probe, the U-ret probe emits another event when the function finished the execution. Another important thing to know is that this project actually is a POC, it's not a production in a great project. It's based on Go.BPF that is a part of the Iovizer BCC project. So that's the main, how does it work actually. So you can put U-probe and the U-ret probe inside your health binary at the point of the function symbol. So in this case we have main dot do something that is our example function. And the U-probe and the U-ret probe will inform you when the function starts the execution and when it finished the execution. So in the meantime the trace point knows when to trace the function. And the trace point is going to trace the function with the C-center event. So it's going to trace all the C-scales that are executed during this time. So that's an example. In the right side there's a function that's some easy things. And in the left side you have the result. So you have the right, the open-et and the other C-scales and in the end you can see also the read. Okay so all these things are really nice. I was really happy to have achieved this result. But at some point I also realized that these things were not really working. I mean not every time. And I discovered after a while why this was not working. But first let's understand how the U-ret probe works. So because we have a problem with the U-ret probe in this case. So a U-ret probe basically overrides the return address of the probed function with an address to a trampoline. The trampoline basically jumps into another kind of function that in this case is our EBPF program. But since the GoStack dynamically changes during the time due to the GetBatch collector, when the trampoline function tries to return on the stack it is not able to do this. At least not all the time. Because the stack changed and the previous address is not more useful. So possible solution, likely for us the U-ret probes can be attached to a specific offset in the health binary. So we can basically simulate a U-ret probe that informs us when the function is finished by adding a list of U-probes on the ret instruction of the function. So if the function returns three times we should place a U-probe on these three ret instructions. So we can basically simulate the U-ret probe instead of using the U-ret probe. So future improvements, so when I realized that this solution could work I tried to check on the IOWI's or Go BPF library but it was impossible to attach the U-probes at certain offset. So it was my fault actually because this library is deprecated. So future improvements are to move to another library before. So we can use for example a BPF from Cilium or the one from Aqua Security and so on. So in this case we will be able to put the U-probes to specific offset and so put them into the ret instruction of the function. So here are some references that I found on internet that helped me to understand better what was the problem, how to solve this issue. Also some special thanks to some people that really helped me during this experiment. So thank you for your attention. Well I have your attention or sleeping depending. I have two announcements. One is read the wide board, not repeating this again, lightning talks, we still have available slots. And the second one is this room is not possible without volunteers. This is a 110% volunteer conference. I get no money, I even have to pay for my own dinner tonight. Oh no that's sponsored now, thank you. But I want to make a special shout out to my dear co-organizer Eva, a proud of her past. Eva is a student in computer science, more specifically in application development. If you have internship positions at your company, you can hire her for free.
Dependency injection: a different way to structure a project
I'm going to talk about using Go. What is important when you use Go is dependency management. You cannot write a program these days without depending on something. Dylan is a co-worker of mine. We work on Cillium together. He's going to talk about anything to do with dependency management. So run of applause. Hey everyone. Thanks for coming. So dependency injection. Before we start, a little introduction. Already got one technically. My name is Dylan Reimering. I work at Isovalent on the foundations and the loader team. So we're responsible for basically doing dependent or a lot of changes that I'm going to talk about within the Cillium project. You can find my get up there. In case you find anything interesting. I don't know. You never see. You never know. So before we dive into the dependency injection, why, how it works, what it is for those who don't know, a little journey about why I'm here, why I'm talking about this and how I got here. So what is Cillium? Cillium is a CNI. So it's long, long talk short. We use EBPF to do networking in EBPF. We secure it and we make sure that you can see what's going on. And that actually involves a lot of components. So this is our nice visual about a lot of the different features and we actually have way more that wouldn't even fit on the slide. You can imagine that with a lot of components that we get quite a large application. I checked and we are currently the third most active project on the CNCF. We have, I think, so again last time I checked this is like a month ago. We have 650,000 lines of code that are not the vendor directory. So we have a big code base, a lot of things that happen, which also means that we have a lot of dependencies. So to illustrate that, I picked one of the features that I personally worked a lot on, which is called the Alto announcer. And it's a little feature in Cillium that basically makes sure that certain IP addresses are reachable in the local network via ARP. So both gratuitous ARP and responding. So we have like the big Alto announcer block there, which contains most of the business logic, but all of the other things are dependencies. So all the way to top, we have, in the white still, are our external dependencies. So we have to create ports. We set up, we get environments, configuration, standard outputs, et cetera. Those are connected to our infrastructure layer. So our infrastructure layer does all of the things that are really common in the application, logging, metrics, configuration, da, da, da, da. And then we get to the orange layer, which is our control plane. And there are abstract business logic happens. So this business logic gets go objects, and it also writes go objects. It's all pure go world, and it doesn't have to care mostly about all the things. And then we go down to our data path, where the translations happens from this perfect abstract world into the real world, which in turn often means, for our case, that we talk to the kernel via net link, ebpf, maps, raw sockets, et cetera. So we have to, but for this one, for my big component to be able to work, I basically need all of this to exist, at least in production. So I went back to 111, which is before we started working on dependency injection in Cilium, and looked at what does initialization look like at that point. So we have our main program. We could call into Cobra. This is common, hopefully. We go into our run function. It starts up three components. It initializes the environment, where we already have 50 components. Then we call something called run daemon, which has 50 components spread both before and after the new daemon. And then in our new daemon constructor, we actually create at least 150 components. I started counting, or stopped counting, sorry. So we have a lot of components, but they all have to somehow wire into each other. And at some point, the development team decided is that we are going for sort of hub-nispoke model because we had so many components. We had this big daemon, which was our hub, and it had pointers to almost all components. And then it's easy. You only have to give the daemon to everything, and then via the daemon, you can find every other component. So it was, but that becomes a real mess because when is this pointer nil, when is it not, et cetera. So I started looking into this new daemon function, like what is this about. And then you'll see a pattern. You don't have to read everything. So we initialize this before we're creating this. We must close this before we open that. This must be done before we start identity allocation. IP cache must be done after the initialization below. This must be read. You said after this happened. So we discussed some for a while. So at some point, so at this slide, I'm at sort of the first snippets that 350 about. And then I basically, I stopped. So I just scrolled down at that point. My point was made. In the last reference I found something like before, do this before, do this after was at 718. But what is perhaps interesting to note is that this top snippet is basically a sort of defer. So it talks about cleanup instead of initialization, which is also a really big thing that we have. So to summarize the problem that we were facing at this point in development. So we have a lot of dependencies, but this is just inherent to the product that we're making. Just nothing to do about that. What we can do something about and what is a lot of the source of the pain are these implicit dependencies. So we have dependencies on global variables, these very big objects or system states, which require us to use comments to tell our other developers which how our dependencies work. So our dependencies are all implicit in this state, which makes things really hard to modify. Like when I started and I created a component, it broke CI, it broke everything. I couldn't figure out why. And it turned out that I had to move it up a few hundred lines in the initialization or down in some cases to make sure that everything that I needed or implicitly dependent on was there. So it's really hard and it really destroys confidence. It's hard to shut this application down at least correctly. You can kill the application, sure, but then open files are not saved. And if you are running end-to-end tests or anything like it, then you need to make sure that all your resources are cleaned up. So the next time you start, you are not blocking other things. So this was really hard and it made it really hard to test because if I wanted to test my L2 announcer, I had to recreate all of this additional infrastructure a lot of the time, even if I had interfaces because some dependencies were still problematic or whatever. So for us, we started looking into solutions and this led us to dependency injection for a few reasons. So before I go deeper, for the ones for people that don't know, dependency injection is basically a way to instead of explicitly initializing your project, so basically having a very big main file, you define your components and you explicitly define what their dependencies are. And then you can have some component, in this case I call it a graph builder but it's basically the name of your framework that you use to actually initialize that and you hand off the job of correctly initializing your application, you hand it off to some piece of software. We know software never has problems or bugs. But in all honesty, so this is actually quite popular pattern in other languages like Java, C sharp, PHP, but we don't see it that often in Go projects. So the only thing that is required for this to work is that you always, or at least work correctly, is that you specify your dependencies explicitly, so as arguments to a constructor function. So what I would like to introduce to you is the Uber FX library, so it's made and maintained by Uber. Originally developed by Glipp, who is now actually a colleague of mine, which is why how we got into this library. It's really well battle tested and I'm going to show you how it works and what this looks like. But what's important to know is that it is an, as is the Penesy injection library. The Penesy injection libraries might not all work for your use case, it didn't for us. So we actually, if you were to look at Cilium today, we actually use our own custom flavored framework, build on dig, which is basically the underlying library under FX. But FX is if you go ahead and first try something, then FX is your starting point. And this actually solved most of, like it was made to solve a lot of the problems we had, not only for this initialization issue, but also because we have a lot of binaries in a big mono repo, so it also allows for really good reuse, which is, as far as I understand it, where Uber first started. So to explain this, I first created a very, very small application. Normally you wouldn't use dependency injection on such a small application. So we just have a simple web server. And this is, I why, for example, might have, might write this without dependency injection. So if a main, we construct everything, link everything together, call server.serve, and we're done. So this is nice and short. So when we do dependency injection, we have to be a bit more formal. So I defined a new listener, a new logger, and a new server. My listener and logger at this moment don't have any dependencies. I could give them configuration or something else, but that wouldn't fit on the slides. And the server takes both of these and constructs itself. So we defined everything, what everything needs, and then on the top left in our main, we say we create a new effects application, and we provide the listener and the logger, and we invoke the server, because if you recall, the server was, the serve function was the thing that we were interested in, that we called. In practice, what this does is the invokes are basically your entry points. So and the library will look for all dependencies of that, of that entry point. So you could, for example, create a very big graph and have multiple entry points or remove entry points depending on, call different entry points depending on, for example, commands in your, in your binary. And then it will only construct and start your dependencies that you need. So it also does a little bit of that code elimination implicitly. And then you call the run, which actually wouldn't do anything in this example. So I'm sorry, because the serve is not called. So this would start and it would construct everything, but nothing extra would actually happen. For that, FX has something called life cycles, which are really useful. So we, the last slide talked about the construct time. So when we construct our graph and then when we run it, the life cycle gets invoked. So what we can do is we give this, we say, okay, the server is now dependent on a life cycle. And within the constructor, we, we tell the life cycle, okay, I can, I have some, something while, while I'm alive, I want to do something. So I have an on stop and an on, on, on start and on stop hook. And when I start, I want to start a go routine and serve out whatever I do. And when I stop, I want to shut down, which is something that my initial program didn't even do, do a proper shutdown of the, of the HTTP server. So when, when it's, it's a little bit hard to like show that in the original example. So I threw together a very small sample that still fit on the slide, which is important here. So I have ABC and they basically all depend on each other. So it's a very deep dependency chain. And then I have this print function, which you can decipher later, but it basically, I call it in every constructor. It's both prints at that time and it prints in the life cycle hooks. So you can see what happens. And when I would were to run this program, the output would be something like this. So it says, A is constructed, B is constructed, C is constructed, because that's the order in which the, so, so we have all the dependencies there when we are, but it's just some construction. Then the start hooks are called in the exact same order as we constructed them. So if you have dependencies, for example, A opened the file and we need that file to be open because B will start calling things in this life cycle. And we know that the, that the start hook of A is always called before any of its dependencies get time to run. And then when we stop the application, we control C or something else happens, we shut down. But the nice thing is, is we automatically shut down in the exact opposite order, just like you would add the first somewhere, but it's at the application level. And this allows you to do proper shutdown, write your files away, do everything else. And you also know that you, because you depend on everything else, that you get the first chance to shut down properly and no one will call into you after that, because, in their shutdown functions, because they don't have references to you. They are not your, you depend on everything else. There's also a nice feature called groups. There are actually quite a bit of features. I couldn't touch on everything because of time constraints, but this one is nice for, for a small section of problems. And it's called a group. What you can do is, so I actually use two features. I use the effects in and effects out feature. And it basically allows you to, to return multiple dependencies from a constructor or take multiple dependencies in a nice way. So I can, for example, have a parameter structure that takes in 20 different dependencies and don't have to spell them all out separately in my arguments. And I can also return multiple things. Crucially, in my case, I can specify group names to basically route outputs from one, from an output, from, from one place to another. And in this case, I created a mox. And this mox collects all of the mox handle, mox handle objects that are there. And I have a foo and a bar and they both admit their own thing. And they are collected by, they are collected by this, by this mox which we could, could give to, to a server. And the cool thing about this is that you, you have this once. And you can then add a lot of additional, you can add a lot of additional parts to the, to your whole application. And it all collects as an array into this group. There's some caveats. I'll come to that in a bit. So under the hood, how this works, very simplified, is we have our definitions. At least effects and dig use reflection to then look at the parameters and then based on the types, it creates a directional acyclic graph. And that graph can then be walked to get the, to get that correct ordering. So there is a small bit of magic there and it's called reflection, but it's not much. Like it's quite understandable if you actually go dive into, into how something like this works. And then again, the constructors to start and stop are called in that, in that determined order by the deck. It also means that you can't have cyclical dependencies. That's, that's a no, no. So it's a good reason to remove those from your code as well. So I would like to share with you in case you want to try this, try dependency injection. Some tips, tricks and lessons we learned because there are, there's a good way to do this and there's definitely also bad ways to do this. So inject, but in moderation. So not everything has to be a component. For example, math libraries are stateless. There's no reason why you would make that a component as like a dependency in this system because you can just, you can just use them and they are pure functions, etc. So my rule of thumb is if it has states, make it, make it a dependency because then you benefit from all of the state specific things. But if you have libraries that don't use state, please don't make it harder than it has to be. And also a note of inject, but in moderation is that we saw that doing dependency injection adds a lot of boilerplate, which is worth it in very, very big applications or even moderate applications, I would say. But it's likely not for your small CLI tool or whatever. So pick, this is really a technique for medium to larger projects. When you do this, pick logical boundaries. So we, for example, we started and then made 20 cells within the same package and then no one outside the package actually ended up using those cells, which is massive amounts of complexity and overhead is just not necessary. In my experience using packages as logical boundaries for these components is the best thing to do because you can also leverage what types I export, which type, because you can export, you can provide something and not export that type, for example, and then only export an interface that matches it or whatever. So that's a really powerful combination. So and the last thing to note is that one of the other features that I wasn't able to show you because of time constraints is FX options. So FX options is really cool because it allows you to basically take multiple of these components and bundle them under a single variable. So while global variables are big no-no's when doing this, you can still use them or you can use a variable, global variable on your package to export these constructors. And the nice thing there is you can make a sort of hierarchy. So if you have a package hierarchy that's three layers deep, you can basically reflect that. So in your main application you don't have to list 200 constructors all separately. So that also really helps with readability, seeing where what is provided and so on. Provide targeted interfaces. So go idioms still apply. The smaller your interface is, the more powerful it is, the better you can swap it out. So when I provide a very small interface or when I depend on the smallest interfaces I can, and it's really easy for me to mock out in my test, create a new FX app, only provide the direct dependencies which are interfaces which I can then mock out and it makes everything really nice. So this is general advice, not for dependency injection but like it goes hand in hand. If you have dependency injection and don't do this then it takes away a lot of the benefits you would otherwise get. So it also makes it easy to rely on external, for external components to not rely on internal implementation. So when I export something or when I provide a component I always try to provide it as an interface as well. And the last thing which is more of a trick is you can actually, if you for example have a struct, that struct can implement multiple, so instead of having one interface that implements three methods I can provide it as three separate interfaces that implement three separate methods. And that way you can, you have both on the receiving and the sending side of your dependency, you have the smallest possible interface again to help with mocking out but also so if you don't use certain methods that you don't have to like write fake methods that panic if you were to call them etc. I mentioned groups and they are really powerful but go easy on them. Groups are really only ever useful if you have multiple parties that are interested in the same list of objects. So for example we have metrics, so we have a Permetheus metrics registry which collects all of the metrics to actually use them. But we also have tooling that automatically generates documentation about these metrics and I can write a very small CLI tool that basically just with one component that depends on all hooks or all metrics that we have defined in our application and I collect all of them automatically and everyone who uses who registers a new metric it automatically appears in this metrics tool. So it's really great and the same goes for our configuration HTTP elements which will also have configuration for or sometimes CLI tools which live want to interact with the same things. The alternative to using groups is to just use like a registry pattern where you say I provide a registry, it just has a registry pattern and everyone else, so if I have 20 other components I can depend on that and I can register myself during construction time. And the upside of doing that is that you can, like if you use FreeScope for example or any decent editor is that you can follow those traces back. So you can always use references to see who actually uses what. With groups it's all magic. Something everything goes into this group and it comes out but it's not clear. You can't trace that back in the code itself without having difficulties. Stay with a static graph when possible. So you can, with this FX application you can in theory like depending on configuration provide or not provide components. We have opted in Cilium to never do this because it makes it completely impossible to verify that you never have missing dependencies or other problems like circular, the references and there are certain combinations. The graphs are verified at runtime so you have to have a good CI to run everything, make sure that it works. What you can do instead is use this life cycle and so you always have the objects but then you can always choose if they do or do not subscribe to the life cycle and that way you can enable or disable certain logic if you don't want to run it at that time but always provide it. And that was it. Thank you very much. Thank you. I have time for one question. I see a hand there. I'll quickly come over and hand to the microphone. If you are exiting already do it quietly please. What can you make choose, dig and FX instead of Google OIR which is more popular for example? So like I mentioned, colleague of mine, Glyb, authored it so it was very, we were very quick to jump on that one he suggested using the library. So it's purely advertisements. Thank you. Any questions?
Putting an end to Makefiles in go projects with GoReleaser
Thank you, everyone. I see the defactmentation rules are now learned and people are following it. If you are not familiar with the defactmentation rules, go as much to decide as possible. That way we can fit more people who are entering through there. We are amazing door people. Our next speaker is Dennis and he is presenting a more controversial title than was on the initial proposal. In this case I would not have accepted the kidding. I heard the second time today that make fires are horrible. Again that is a personal opinion but maybe they are. Because this is the second time somebody mentioned this and this is the first time somebody proposed a solution for it. I have to be fair, I already use GoReleaser. Anyone else? Okay. It is no introduction anymore. Everybody knows you already. They go away. Thank you. Hi, everyone. My name is Denis Germain. I am a French music streaming service. I am also a tech blogger and if I am telling you this, all the slides, links, related stuff can be found on the blog. We are on the blog right now. You can find exactly everything. As you have guessed by now, by my beautiful accent, I am French but hopefully everyone will understand what I am going to say. So yeah, the title is a bit controversial. Okay. Putting an end to make files in Go projects with GoReleaser. Why is this talk? 15 years ago I started my career as a systems engineer and I haven't wrote a lot of code during that time. I wrote a bit of Python script and there are a lot of infrastructure as code but not compiled code. Last year I fell in love with Go. Many of you perhaps as well. But if you know site reliability engineer, you know that we are the laziest persons on earth. Like literally. We don't ever want to do anything boring. We don't want to do anything manually. We want to automate everything to keep drinking coffee instead of working. And I found out that compiled programs are boring because you have to compile them. Which is not a problem I had when I was writing Python. And I also started to contribute and to maintain open source projects, little things here and there. And I found that there were also other tasks really boring that I didn't want to do by hand. Among those tasks is cross compiling. You just don't have to compile your binary into one OS and one architecture. With Go you can do it quite easily for many architectures and many OSes but you have to do it. Building Docker images, obviously you probably want your users to be able to run this in a container. Making a bunch of packages like Debian packages, flat packs and so on. And if you create all those artifacts you will probably also want to check some and sign them so that your user can make sure that the binaries have not been altered during the downloading process. And last but not least you will probably release versions of your software if it's complex enough. And making those releases on your favorite SEM like GitLab or GitHub can be tedious as well. So, first solution. And I have to say that all those solutions are valid and work. When I was in engineering school I learned about make and make file. And it worked really great. I was using it for C programs and when you have to compile something you just create a simple make file and then you run your task and it's awesome. If you work in an industrialized environment you probably heard of Jenkins. And if you love Groovy I'm not judging. But you can automate all those tasks in various Jenkins files. If you run on some let's say modern SEMs like GitHub or GitLab you will probably use the GitLab CI or the GitHub Action to help you with this. And the last one is a bit of a joke and you probably want to understand it if you don't know. But you can do bash scripts. But what I'm going to show you today and I hope I'm going to convince you is that you can do all this and much more with a software that is called Gorillazer. Gorillazer is an open source software. It's an open source open core. Most of the features are open source but if you want more features you have to pay the maintainer for them. Today we are going to use the open source version. And it's an awesome software written in Go for Go. Okay. Now it's time to the demo. Okay. Some of you got that joke. Okay. So for this demo I have written a very super small program in GoLonger. It's a bit of a joke as well. I don't know if that's the case in Belgium or in other countries but hairdressers tend to put funny names or play on words in the name of the shop. So to give you an example that anyone will be able to understand, one may call it shop hair force one but with an H instead of the hair instead of the plane. Okay. And in France nearly all hairdressers do that. And so someone made a database of all those funny names in shops, put it in a JSON database, hosted it on the government data.gov.fr and I used it, created a program that picks randomly an entry in this database and output it. So that's what the program is doing basically. It's really silly and I like silly things. Okay. So no, that's not what I want to show you. Yeah. Is it big enough for those really? No, not big enough. Okay. Let's increase. It's going to be hard to see but let's try it like that. So what we are doing in this make file to build this program to save me some time because my time is precious. I start by doing a prepare which is going to do the go mod tidy and then I'm going to build which is calling the prepare obviously and then calls my command to build my software. I give it some environment variables. So I give it some environment variables to build it for my laptop and some LD flags because I write the version as an art coded variable. Okay. Let's try to run it. 0.0.1. Make, build. Okay. So you can see I'm really lazy because it's not that difficult but bear with me. Okay, diminutive or some. Okay. Great. Let's declare. Okay. Real funny. I'm going to do a little bit of a demo. I'm pretty sure the French speakers are enjoying it's much more than the other ones. Okay. And if I want to build a Docker image for this, for this binary, I have also made a task in my make file. The Docker file looks like this. So I use a multi-stage build. So in the first stage, in the builder stage, I'm going to download all the necessary dependencies to build my software and then I copy it in a new container containing only the binary. So the attack surface is way smaller and it's better for maintenance and size. I'm pretty sure it won't work because I have some networking issue but we'll try. Nevertheless. Okay. I forgot to add the version. I'm pretty sure you have always seen how Docker is built and it's going to fail. So I'm not going to continue here and skip to the rest of the demo. Maybe I have one here. So in there. Okay. So it's going to be a little bit more complicated. So I'm going to do a bit of, it works just like the other one. The version is displayed and then a random entry. Okay. Cool. How can we do better with Gorillazer? Let's say you don't have any idea how Gorillazer works. So there is a command for you which is called Gorillazer init. One which is going to create a default create a default Correlizer file, a default configuration file. I'm going to remove most of the stuff. Okay. Up, and then we will talk about it. Great. Okay, so what does this do? As you can see, there are before hooks that we can call before doing the actual building. So here we do the go-mode ID, just like I was doing before with my prepare in my make. And then the builds come. So I can also add environments, variables. I can tell Correlizer which architectures and OSs I want to build on. So let's say I want to build on ARM64 and AMD64 and on Linux and Mac OS as well. Okay. And I'm going to cheat a little bit to gain some precious seconds here. I'm going to have the famous LD flags that I was hard before to hard code my version variable. And as you can see, I'm not going to pass it as an un-variable as I did before, but I'm going to use an automated variable coming from Correlizer. Because what I haven't told you is that Correlizer, as this name says, is going to help us making releases. And obviously when you release software, you are going to add a new version for it. So we are going to leverage the power of Git tags. We are going to tag our code and the tag is going to give us our version. So, okay. So to make an example, I'm going to add commit. Then I'm going to tag my code with a new version being 0.0.1. I think that's correct. Okay. And then, oh no. Okay. And then I'm going to try to build my software with Correlizer. So Correlizer, there is a sub command called build, which is going only to make the build. We'll show you the rest later. And if I have, okay. So what is Correlizer doing? As you can see, it's really quick because my program is really small. But if you have some more complex program, if you try to cross compile, it's going to take obviously more time. So Correlizer detected our version with the Git tag. It ran the, the gomotid with the before hooks. And then it cross compiled our binary in on four different targets just by adding a few lines in my YAML. Okay. And just to show you that I haven't cheated, we are going to try to run it once again because I never tire of hairdresser jokes. Okay. Imaginaire artistif. Real funny. Okay. Okay. So cool. So now we have something that helps us build software, but can we do more? Obviously, yes. Or else I wouldn't have been selected to talk, to talk to you today. So I'm, I have some time left, but I'm not going to be able to show you step by step. So I'm going to cheat and take the final configuration file with all I want to show you. Just bear with me one second. I'm just going to modify one or two things. And then I'm going to say what I'm doing. Okay. So, so here I'm going to tag and push my code with one little script. Okay. So what did I change? So we were there. So we had my binary and the version. Okay. I added one section, which is called archives. Archives is going to make archives for us. And we are going to add overrides because it's possible to override the most, most functions in Gorillazer for specific arch. Obviously in Windows, you don't usually, don't always have something to open our GZ files. So for Windows, I'm going to override and put a zip instead. If you are working in regulated environments, sorry, you probably will want to have a software bill of material, S-BOM. So Gorillazer is going to be able to create SIFT S-BOMs for you. You are going to want to check some everything. And then you are going to want to be able to create Docker images. Here I use buildx to cross compile my image. So here I'm compiling it for Linux, AMD64. I'm going to push it to GitLab registry. Oh, I forgot to show you here. I'm using another variable, which is a project name, which is all automatically created by Gorillazer and it's going to help us to reuse our Gorillazer files between projects. I can use the version here as well. Okay, and last but not least, I'm going to run an announce on Mastodon to tell the world about my awesome new version of my software. Okay. And I'm not going to do it on my laptop. In fact, it's already run on GitLab because I have a GitLab CI running configured that I haven't showed you. So here the GitLab CI YAML file just runs a Docker image of Gorillazer with the correct environment, variable and secrets. And does the build for me. Okay. Okay, okay, okay. Oh, and I also modified the Docker file because I don't need the builder step, the multistake build because that's Gorillazer that does it for me as well. Okay, so now is the time. I'm really afraid that the demo failed. So I'm going to check if my job launched correctly and if it worked. Oh, it's still running. Okay, let's try to see. Let's try to see if it's running correctly. Okay. It looks promising. So the GitLab CI has detected that I pushed a new tag, the 0.0.2. It launched an image, Gorillazer. It has detected that my tag is 0.0.2 has run the Gomad tidy, has built my software in many OSes and many architectures because I have not explicitly told it which one I wanted. So by default, it does it on like eight targets. We can also see that it has created archives and Sbombs for each. It has created two Docker images. Someone has seen my student post and pushed them to the GitLab registry. And then it has uploaded all the artifacts, sorry, to the GitLab. Okay. And last but not least, we are going to check on Mastodon if something happened. But as I can see, I already have some messages telling me that it's okay. So if I click on the link, I'm redirected to my release page which has been created for me, containing the various assets, the change log with all the commits on this release. And if I try to take a look, I'm not going to open the binaries because obviously it works. If I take a look at the package registry, no, the container registry, sorry. Okay. We can see that we have the Docker images and they have been created two minutes ago so I haven't cheated. Okay, great. So the demo works. Okay. So I was a bit stressed so I spoke really fast. But what I wanted to show you today and I hope you are convinced is that GoReleaser is an awesome tool. You can do cross-compilation, you can do Docker images, you can do signature and checksoming GitLab releases, you can post on social media or other communication channels. There are many more options that I haven't even all explored. Obviously you could have done that with a make file or just in file or so on but you have to create most of the steps by yourself. And I think there's nothing wrong in leveraging the work of someone else to do it for you. That's it for me. You can leave me some comments and tell me if I was good or if you will learn something using the QR code. And thanks everyone..(audience applauding and cheering loudly.) If you have any questions, I'm not sure I will be able to answer all. I'll come hand you the microphone real quickly. No difficult questions please. Thank you. Thank you for the talk and for the tool. My question is, is there a possibility to remove in Go project from the title? I mean, is it possible to make the tool not, so Go oriented let's say? Because I see there is a potential for it to be like a structure for something bigger or generic let's say. Yes, okay. That's the word I want to say. Okay, I get it. But good stuff anyway. So you, it's really Go oriented. You can't cheat because there is an option to change the binary used to compile it but it's really hacking the tool. In fact, I use it as well for, with in conjunction of another tool which is called FINE, FINE which is a tool that helps create UIs with Go, FINE, F, Y, N, E. And FINE has its own cross builder, cross platform builder. And I try to make them work together. And it was a bit like, okay, I add some scripts and put some glue to make it work. So it's going to work much better in a fully Go environment. Any more questions? Right in front. Thanks, I think you forgot to mention the limitations of Go release. As I remember, it doesn't support in open source version that multiple projects built in one repo. So I'm sorry, I didn't hear? So basically Go release doesn't support in open source version, release of multiple binaries from one repo. Okay, I don't know the answer, but okay. Thanks for the... We can't know everything. Is that a wave for... I need a mic. I can ask my question this way. Okay, but I will have to repeat it. Excellent. So, great work on Go release there. My question is, what about using it for things that are Go based, that have a more complex compilation tool in things like C.Ga? I know that Go STL is using Go release now. Did you comment on this? Is this an idea? No, I'm not sure. Can you repeat the question? Okay. I can repeat if you want. Yeah, okay. Can you use Go release for very much complex Go project, Sdl and C.Ga? Well, I have not worked on a very complex Go project yet. But it would... Obviously there are limitations and it's going to sometimes be hard to glue things together. But for small to medium project, I think it's an awesome tool. Thank you. Another round of applause.
REST in Peace: using generics to remove REST boilerplate
Well, this is a depressing title. RIP. Rest in peace, and I hope rest means, like, restful. And not like this is the end of go, because I kind of like go. Anyway, this is going to be a very interesting talk. Rodolfo Plaza, thank you. Oh, actually I don't have my notes, so I'm going to swing it because I had to clone. Anyway, hello everybody, thanks for coming. So I'm going to present a project of mine that I created like maybe years back, a few years back now. It's called Rest in Peace, and it's about to make rest in peace, oh ho. So in 2021, Jan Lans Taylor and Robert Grissmer created a talk about how to use generics once they create implemented, actually implemented, the generics in go. And they were like, basically, I don't know if I have audio. We'll see. No. Anyway, so basically what Jan Lans Taylor is final words were, please use generics wisely. So of course, when a figure of authority asks you to use something wisely, what do you do? The total opposite. And people from CrowdStrike security company decided, I don't know if you can read in the back, but basically they were like, created a channel on the discord. It was a creative usage of generics contest submission, submit your worst implementation of generics in go. Basically, everything that Jan's told us not to do. So my, I did a thing I'm pretty proud, my, let's say to make the world the worst place, was an, I think, await in go because who needs go routines and channels anyway. Some people even did some try catch. If you're missing those good things from other languages. I got a plush because, you know, when you do something, the world like gives you something back called karma. And just for records, I listed everything that was attempted. We had like monads and stuff like that. Anyway, but out of all of this, I created something that thought was actually useful, use of generics, maybe not the one that was like supposed to be because the current implementation is not optimized for this use case. But I thought it was a nice anyway. So about me, I'm Tanguy, I'm from France. I worked 17 years in IT and I'm also CEO of HTMX. Okay, one person knows about HTMX. Anyway, so as you will see from this video, oh, we have sound now. Anyway, I'm ready to make anything for money. So I'm a freelancer. Specialized in go since 2015, worked in a normal consulting before that. I worked mostly on classic restful APIs, backends, and I've done some blockchains. And I stopped freelancing for a year, about a year to work a dagger. You should check them out. I think what they're doing is pretty cool. It's CI, CDS code. And I'm also very interested in pushing goes in more, I didn't say that, areas than just like microservices and web backends, so GUI, game engines and stuff like that. And the next talk will be about GUI. So I'll advise you to check it out. So now, can anyone recognize this? Basically, it's, yeah, thank you. Basically, it's the HTTP handler code that anybody does. You might even have like a validation step if you're fancy. And, but we do all of this code in the end, we decode the JSON and we encode the JSON for the response. But all of this is just for this line right here. For just this line, we do all of these gymnastics. So that's a lot of code. And let's say we had another handler to deal with another type, another type in our API. And now we do basically copy paste all of that previous code and change the code here, here and there. Myzermanos, whatever. Anyway, and so again, I see like a lot and a lot and a lot of duplication. And for me, like duplication, it's just, it's just something we should try to avoid as much as possible. There are some of the few rules about the rules of two or three, which I think is good. But when you create like a big API, you have more than two or three copies of that code. So you can have a solution and abstract the handler and create like a very unsafe type to basically get whatever you want in it. It can make it work, but then you call your back end, but you have to deal a lot in the back end. And then the back end will deal with a lot of type casting everywhere and it can fail many times. So you need to do a lot of hair handling. And here I put like basically what we should do to convert from one type to another and make sure it works. And it's all of this. All of this is, I don't know if I have the thing. Yeah. So basically all of this is for two types, for a structure with two fields, A and B. And for all of that, we have all of this boilerplate that basically take the dynamic code, transform it into a type safe that you can actually use in your back end. That's again a lot of code. And then the real back end can be easy once you have the right types, but we had to do all the way to here to be able to just call a simple back end. And by the way, if your back end is just that and you can make money out of it, go for it. So as I said, a lot of runtime reflect boilerplate to get back to types and potential reuse of the handler. So we had a lot of durable potential. Not so sure about it. So if finally we had a solution, thanks to Go 118, we have the generics. And so that's when this ID popped out. So the pro about generics, we have better type safety. We have better performance than empty interface. And as a wise person said, empty interface means nothing. And for this user, sorry, mis-type, for this use case, for my use case, we don't have better performance. There's an article from Vincent Marty that talk about this in depth. But somebody told me maybe it has been improved since. So maybe it's the procated. I don't know. And in general, it's more readable code for the users. And it allow more like don't repeat yourself all over your code base. So for example, without generics, we have this. So I just want to check the minimum between X and Y. And can anyone tell me what it prints? OK, not very interactive. OK. Well, actually it doesn't print. It doesn't even compile. Because mat.min accepts only float 64. So you just have to do this. I hate you, but I hate it. So, I don't know. I do it disgruntly, whatever. Not my native language, sorry. And with generics, we have this function, which is way better. It doesn't look like it, but it's way better. So the library code is not that great to read. That's for sure. But you can get used to it. And so you compare the previous one and this. Like, yeah, it's not one reads better. But the user code. On the user code, it's really way better. You don't have to cast those everywhere. So it makes for a better code base. So what about rest in peace? OK, I'm checking. OK. So rest in peace. The idea is to basically use generics to avoid all the... this HTTP boilerplate that I presented. For example, here we have some user code. We just wrap strings.toUpper in a function which is like an input, output. I don't even remember the name. But basically it takes a context, it takes a type, returns another type, and an error. And as long as your function respect this interface, you're good. And you can just wrap it, send it to the rip.handle, and you indicate the method, then you indicate you call this function. And then you have like a route options. And then you can just call curl on uppercase, and it will just put your input into uppercase. Magically you don't have to handle any HTTP about that. Library code, less readable, I will admit. So we have the type input, output, funk, which is like the function that needs to be respected to be used, then in the handle, which takes the input, takes an output, and put the method for this route, and then you can just pass it like that. But this was fun to do. It was my first experiment, but I wanted really to go a little bit further, because I do a lot of rest back-ends, so there are a lot of routes to deal with for resources. We need to create, delete, update them, et cetera. So I wanted to automate that as well. So rest in peace. So the key concept of rest services is the notion of resource. It's accessible via an URI, an action on the resource URI via HTTP methods. I mean, this is one implementation about rest. Normally it doesn't have to be HTTP, but anyway. And the current state is sent back through the same system, which in this case would be HTTP. So on the user code, I create like a user provider, like an entity provider. I pass it to here. I decided this is the path I want, and here I just take the default route options. And thanks to that, we'll see what it will give us. But so this user provider needs to implement this interface. Okay? So create, get, update, delete, list all. I will update that because list all is a little bit too much. I need to handle pagination and stuff like that. It's not there yet. But once you implement that on your code, you can just use your code. You don't have to deal with any HTTP whatsoever. And you pass it to this function. Then you have a whole slash user with all the bells and whistles, which gives you all of this. You can create the entity. You can get the entity. You can update the entity, delete, list. And I recently added fields. So now, because you can use patch, basically, to update just part of your entity, but the protocol is not defined. So you have to define your own way of doing patch. Is it like text-tips or something like that? And it's a little bit quirky. So I decided I found somebody talking about a pattern, which I liked, which is basically you take the whole path to the field, and then you can just put and get from it. And so that's how you update part of your resource. And so you have your entity and the entity provider thanks to a type inference that improves. You don't even have to pass that. You can just pass directly. You don't have to put the square brackets and put those types. So you pass the URL, the entity provider, and the route options. And here you go. You're good to go. What you get, creation of CRUD-HDP endpoints, content negotiation for many encodings. Right now we have JSON, XML, Protobuf, MessagePack, HTML, HTML forms. We have automated resource web pages that can edit the resource. Right now it's very nice UI. You see this is not my major. And a harmonious way of handling common scenarios, and because I've worked on many projects, and maybe because duplication, you do it once, and then you forgot you need to update all those boilerplates you've done, so then you forget. And then the behavior between all your handpoints are not really coherent. So that makes a good thing, I think. For example, this is just an implementation of adding a new encoding to this platform. That's the whole code to add the JSON encoding on REST in piece. So you just, I have like a facility like the RAP codec. I use the JSON new encoder from the standard live. JSON new decoder, then I define the MIME types. That's it. You're good to go. Most of the implementation are like that. So RIP is to HTTP. What an ORM is to SQL, me. But I know that how many of you hate ORM just to see if, okay, you might hate me as well. Anyway. But seriously, I hope it will help you create services more easily because I have a pain of repeating all this code all the time. So here's the QR code. Like, subscribe, click the back icon, something like that. Anyway. And here's a demo. Last time I did live code, it was awful, so now I have a video. Amazing. So I just run the server. And so all the logs that will print in yellow are from the servers. There is one that is from the logging handler that is the logging middleware. And the other one is from the backend code that we just logged for ourselves. So here I just get the list of users. There is only one called Jean. So here we see the backend. Whoops. No. No. Sorry. Anyway. We'll check on the next one. So now we're going to create a new one named Cam. Are you stopping, please? Thank you. I'm sorry. All right. Okay. Mm-mm. Did I check this video? Maybe not. Okay. So maybe it will take a little longer. I'm sorry about that. Yeah, that's karma. Are you serious? Okay. Okay. So, oh, yeah, but also there was no output. So we just saved this new user. So here is the log from the backend. And this is the middleware, the logging middleware, which is just a apache log style. Then I go, then I released again to just confirm that we have a new user in our list. Then we have, we just get one user and we decide to display as XML because why not live in the past? Because on each endpoint, you can have multiple encodings based on whatever I do content negotiation. So if the browser or whatever client asked for, and I have it, I will give it to you. That's your problem. If you want to XML anyway. So here we're going to just modify the first users and call him Philip instead of Jean. So here he is. Check is still Philip. Yeah, good. All right. Now I just want the field. So I just want the name of this entity. And so it just returns it as a JSON string in green. Now I'm going to, I don't know what I'm doing. Oh yeah, the email address got thrown in a trash because I did a full put on the entity and didn't specify the email address. So I'm going to just modify this field to modify the address. So now I do a put just on the field, on the email address field. And now I'm going to check again that it's correct. And here now we have a correct email address and correct name. Then I delete and then I will just check that we rightfully so deleted and there is only one user left in the thing. So yeah, so that's what you get. Sorry, it switched this as well. Okay. So that's what you get with just like this one line of handle entities and the whole backend implementation, of course. But yeah, I think it's pretty cool. On route option, it's something that I added since I did this talk in Golab. So now each route can have their own set of handle encodings and middlewares. So that's pretty nice because before it was like a global state. Not good. Okay. And for this to implement that, for example, we need the entity, which is just like the user struct. And we implement those two method, ID string, which return the ID as a string because our ID is an int. And the other way around, we need to convert from an int to a string. So if you have a better design, come talk to me because I'm not very satisfied with this, but it is what it is and it works pretty well. Plimitation is quite simple. And then on the entity provider, so on this example, it's just like a map, memory map, and we just, I just present you the updates. So this is just the backend you have to write. I put blogging because why not? I get my user from the memory, and then I just update it and that's it. So that's the code you have to create. So basically I tried, like with the memory map, it's like 100 lines of code for this, for all those method and implementation. I did in SQL, it was 110, something like that. So then you really reduce to just that. So for the future, oh, I have time, yeah, maybe I will have time for another thing, but I'm going to just finish that. So for the future, I would like to do nested resources, but I've heard like even Django REST API doesn't do nested resources, so maybe not. I want to add pagination, I want to add open API auto generation, so then you could generate whatever client for your system directly. I would love even more 8 OIS, I don't know how to pronounce that, but to have links and stuff, so the API is self-discoverable, I guess, even more than open API. And I would like to overly improve the API. So since last time, I did like the route options, the fields, I added protobuf during my way back from Italy, and I would love to use a log and slog, better handling and customization of HTML templates because you will see. And I would love also to generate GUI apps for that directly so you don't have to also bother that, of course. Simple GUI apps. So let me check if I can do this. Yes, yes, no, yes, yes. Okay, so yeah, this is my beautiful HTML GUI skills. So we have the user Jean. For example, we decide to, I don't know, Jean Marc. All right. And we can add a new person, see like a very well designed from the 90s, as vintage as me. Let's add Marc here. All right. Okay, we get back there. Then we have our full list and we can just delete. All this is thanks to HTMLX, which is you should check it out. It's pretty cool. Anyway, so this is the thing. I wish we could update those through by what you want, actually. And then the last demo I have is I play with my daughters a game called GoCraft, which is a implementation of, simple implementation of Minecraft in Go. And to bother them, I was like, how about I use my thing and see how it's usable and just be able to create blocks in the middle of their construction to annoy them. Or I can just delete and yeah. So for this, I'm just going to show the code at four last minute. I created a block type, which the ID will be the coordinates, XYZ. So my ID is, maybe, yeah, I guess I can still see. Yeah. Okay. So I split by X. So the ID is basically, if I show this, it's like that. I'm sorry. So I just say like the coordinates in XYZ as the ID and then I just have to marshal and then marshal this. And then in the game, I didn't implement get, I just implemented create. I get the coordinates. I create in the right format for the game. And then I update the block, dirty block. So this is really code just about the game. I'm not doing any HTTP in there. And here the delete, the same. It's just like code about this specific go game. That's it. And it works. And yeah. So if you were excited to use it, talk about it or something, go talk to me. I have a bow tie. You should recognize me. And so I would love to talk about it. If you have like a design ideas and stuff, I'm really up to it because I think we could improve it. And discuss about it, contribute anything. All right. And I want to thank the go team for the generics. Without that, it couldn't have been possible. The Ghostrasburg meetup because they had to suffer through my first iterations of crappy slides. The logo from a fellow Strasburg gopher. You for coming here, you online. I don't know where is camera. I guess there anyway. And for them in the go, they have organizer because like you're really, really top and HTML for the mean, of course. One of us.
Low code graphical apps with Go top to bottom!
We're going to continue in more creative uses of Go. Most people use Go, microservices, Kubernetes stuff, servers, whatever, but usually not user interfaces. Every year there are a few crazy people who come to talk about some crazy new front-end thingy built in Go, and I personally always like it. So I also invited Andrew this year to talk about Logoad graphical interfaces in our favorite language Python. Go! And I'm a boss. Thank you very much, Dave Matcha. So yeah, I'm going to talk to you about Logoad graphical applications. So on two levels there's not going to be much code on screen. I think I actually have less code than John's description about how to get involved in contributing earlier. So let me see how I can do it. However, what's a pretty picture, so hopefully I can keep you engaged in that way. So yeah, hi, my name is Andrew. I am a software engineer, working various startups. I've written a couple of books, and occasionally appeared on podcasts and interviews talking about graphical app development with Go. It is exciting to be here on stage at FOSDEM. I've been coming for decades. Having been an open source contributor for years as part of the Enlightenment Project, Maven, all sorts of things that potentially predate and certainly stand outside of the Go ecosystem. More recently I started the Find Project. Perhaps a few of you might have heard about this. It's a way to build native graphical applications using Go that are going to work on any device. If you've never heard of it, I'll do a quick recap. If you have, just hold on a second and I'll move on to some new stuff. I've been a Go developer for about two weeks less than I've been working on the Find Project because we had an ambition of what we were going to do, and then we figured out what language is going to work to deliver on these ambitions. Hopefully everybody agrees, it goes just a fantastic choice. How did all of that come together, and what are we building on top of it? My day job is at Fine Labs, where we're working on products and services that help businesses get more out of the type of technology that I'm presenting today. Like I said, the Find Project started in 2018, and it over that time has aimed to be the simplest way to build native graphical applications. They should look good, they should be easy to build, and they should work absolutely everywhere. Of course, easy to build is relative. We've had great feedback from people who have never coded Go or who have never built an app before, but there's plenty of people out there who feel that that's a little bit overwhelming to learn, they don't want to be a coder, they just want to build stuff. That's why I'm talking about something a little different today, about building with absolutely no code at all. But before I do, here's the recap for anybody who's not familiar with Fine. It's been running now for six years, I can't believe it's been that long, but hey, it's come a long way. We're currently ranked around sixth of all cross-platform graphical user interface toolkits by OSS Insights. That puts us up amongst some other names that you might have heard like Flutter or React Native, and of course I should probably shout out to Wales as well. They're very popular, lots of different ways to build in modern toolkits and actually in Go. So, variety out there. Last week I was really excited to realise that we have become in the top 1000 GitHub repositories of all time, out of, I don't know, 350 million or something, long tail perhaps, but it's a little bit of a milestone, very exciting to be celebrating that. We have about eight core contributors, they come and go. This year has seen a lot of new contributors coming in and as part of the Go community, it feels like a really welcoming inclusive space and we have some channels on the Go for server, we have a Discord for people to chat and there's about 2000 people across the different platforms that we are discussing on. But that is enough about the technical side and the project, which if you're interested to hear more about, there is a talk in the graphics dev room tomorrow afternoon at a similar time. I wanted to talk today about not using code to build applications. So, I'm going to introduce you to a tool called Fission. The spelling is just as peculiar as the fine project, but why not? This basic screenshot is not going to reveal too much, I'm going to step through a little bit of what is capable of it, but more how we pulled it together and how it has really been enabled by the Go built-in functionality and what we have been able to build on top of that. This is the screen that you might be greeted with, if you do load the app for the first time, it is going to help you get started building a project. So what did this set out to achieve? There's so much that we could do and probably I should have thought twice about getting into the space, I think there's 130 to 150 no code, low code platforms out there, but if you've ever tried them, they're mostly building websites or web technologies and if they're apps, they might be bundling them up into native installs, they might be targeting specific platforms or they might be reliant on technologies that I might refer to as legacy or certainly not with the same awesome modern tooling around the Go community does. So we wanted to do something new, something that was truly building native apps, like I said before, fine applications are going to compile and run on any device with a screen, at least that's the ambition, we're about 95% of the way there. So we're wanting to build native apps, but also we want to make this really easy to get started, with easy to build stuff, so as much graphical editing as possible, as you would expect from a low code platform, we started with the UI and the theming capabilities, so although the application has got a long way to go, as you might see, there's something to get started right away, it should always be building on a real code base. So if you don't like the front end or you want to work with a team of developers just loving Go at the low level, you should be able to work with them collaboratively through the Git repository, for example. The applications should compile for all platforms, but also this should run on all platforms. We're making use of our own technology, if you want to build an app for an iPhone, but you want to do it from an Android tablet, that's cool. If you want to use Windows as your development environment, but target only mobile devices, that's just grand as well. Little tweak on the bus, because you know the boss was expecting something before you get in in the morning. Of course being at FOSDAM, everything that I'm showing you today is open source, it's going to remain open at the core, but some days companies want the business add-ons, the plug-ins, so we're going to be using this as open core, but like I said, nothing that I'm showing today is proprietary or held back. The repositories are evolving and some of them are not landed in the right place yet, but I'll point you in the right direction at the end of the talk. Like I showed you at the beginning, we're going to give a UI that allows people to get started with templates to get their application running really quickly, but also you could build an application completely from scratch if you want, with the building blocks that we've provided on top of Git repository for managing the source control. But there's so much to get started with building your first project. I kind of don't want to say that. When I started it was super easy, you opened a text file, wrote a couple of lines in there and then you just ran it. I mean it felt a little bit like a script, but really good solid code. I've opened a few issues upstream with the project team about why has modules made things difficult to get into? Workspaces are amazing, but it's more metadata. We're going to have to manage that for you, but that's exactly what's going to happen. You tell us what your application is going to be called and we'll generate all of the metadata, set up the modules all for you. The metadata about the UI, about the themes, everything you're editing is going to be stored in the source control as well. So if you decide that you want to work, like I said with somebody else who's not on top of this UI, you can pick up absolutely that code and work with it. But also we want people to be able to pick up this project having worked on the code directly for a period of time. So not like a project where you can really quickly pull together a user interface for an application and then export it. It's amazing, you've got a React Native app out the other end. Nobody can read it and if you want to start working graphically on it again, you're possibly going to be starting from scratch. I don't know. Anyway, so everything is synchronized with the source control onto the file system. So we are working on a Go project. I did promise something a little bit graphical. So here you have the first slightly better looking screenshot I think. We're going to be working just now on the theme. We have a pretty crude mock-up of a smartphone device here, a generic one. The cutout is somewhere between a magical island and a place where let's face it, cameras exist and we don't need marketing about it but it's there. The UI is going to allow you to see how these applications work on mobile device, smartphone tablet or a standard window, inset inside the application. It's going to handle the scaling, the alterations that you would expect for these different types of devices. But also we need to present in light and dark mode. So you can see a toggle at the top of the color picker on the right hand side. All of this lovely information is just saved directly to JSON. We've used the standard encoding package that Go has provided to save it to the wonderfully comprehensive file that you see illustrated on the right hand side. That wasn't easy. Go made it super easy, completely built in. But then we needed to load that data into the application that you're building. We didn't want to do any weird generation of things, stuff that could get in the way of working on code like you would in a real code base. So we just store the file there and embed it into the application using Go embed. I haven't realized how easy it was to work with this. I'm going to call it new functionality because I work a few iterations behind the cutting edge because we're trying to support as many devices as possible. To be able to stream this most effectively into your application, a fine app can have its settings set to a certain theme. You just call set theme. But it doesn't really expect a JSON file. It expects some Go code, a struct. We provided this from JSON functionality in the package scene. You can see here illustrated how we can provide both light and dark alternative colors for applications. Less so well illustrated here is that you can work with fonts, icons and the different sizes. Everything that makes an application feel the way that it does. We've got a bit of a look. You can imagine how that file might have your brand identity or something stored in it. You can port that across multiple different applications. Widget editing is the other thing that I feel is actually quite an enabler on a UI like this. If you're thinking about building out your first graphical app and you're looking at fine and you want to use Go but you're not quite sure how to get started, something like this, just this one screen could provide you with the graphical editing that helps you to understand how things are put together. The functionality in the user interface here is mapping to the APIs that are available if you're looking at this as a developer. Actually, let me just go back a little bit. I'll show you a little bit more later. You can see basically here there's a section highlighted on the user interface. We've selected that and down on the right hand side it is giving you the different areas of settings that are available plus the option to insert more things into your user interface. I feel like I've said a little bit too much about JSON already. The fact is it's really super helpful. I don't like to read it. I don't know if it's the win but I'll agree with the folk that perhaps suggested that XML was a little bit cumbersome in comparison. We use it again. Actually, it is great that Go not only supports serialized using something like a map to JSON but we can because we have a stateful widget toolkit, we're able to serialize the entire state of your application the way the widgets are positioned, the containers around them and the metadata for them streamed directly to a JSON file. Again, illustrate it over there. There is also a little blank field on line four for name. A chance to put an identifier on your widget so that you can hook it into code later because this is a low code solution. We know we haven't solved all of the problems and you might want to write a little bit of Go so you can hook into that through the name which is going to be exported as a field on the application which I can show a little bit more in detail. As part of the Find project we've created a library which did start out as a project a little bit like this but now has shifted focus to helping more applications to load and save graphical state. It will also allow you to understand which widgets are available so you can iterate through. You can, at runtime, create new instances of an object based off some textual representation or just ID of the object type that you're looking to work with which, as you can imagine, pretty helpful if you're trying to generate at runtime a user interface that's normally compiled at compile time. One thing that I find really quite surprising, in fact I don't know how many people have realized this but your objects and types in Go in memory can be written out to Go code to reconstruct them as though they were source code. That's pretty cool. It's like stringer but it's go stringer. Has anybody heard of go stringer? I'm really curious about that. Right cool. So hey that's really interesting. Anything that you have in memory pretty much can be serialized as the Go code that generates it. You may need to write a little bit of code to make that fully functional yourself but we built on top of that. That means that every time you save your user interface state it's not just saving JSON but it's spitting out the Go code that will generate the application source code so that you can be working with developers but also so you can actually compile and run it. Which moves on to compiling applications. Now it goes amazing at the cross platform compilation, portability building applications for anything but there are certain requirements when it comes to building native graphical applications. Partly they want metadata around them but partly people who own certain platforms put licensing restrictions in place and require that you run on their hardware or with certain toolkits present so there's a little complexity here. The project that I've presented and will illustrate uses local developer tools so you're never beholden to anything at all. If you've got stuff installed you can build the application that you have coded and have it run on the local system and install it into your applications directory or the start menu whatever the equivalent would be on Windows at the moment. For the local system that's really quite straightforward, the tools are there. For cross compiling we've had some really great contributions to the fine project called Fine Cross from Looker and Jacob and Cedric as well so that you can with that level of simplicity build for any platform. It pulls down images with all the developer tools installed that you would need but even then you still need to have it running on a Mac to do iOS development or on a Windows box to ship off to the store so I'm not going to say this is proprietary but if your business was interested in something that just worked in the cloud there's going to be an option here that, good timing, there's going to be an option that allows you to spin up basically a pipeline in the cloud. It sends the latest version of your code and it comes back to you with the native app bundles for store deployment or ad hoc distribution and also included in that we have support for over the air self-update of application software as well. This little diagram here is something I created a while ago to try and explain to people why platform agnostic development or building with a tool chain that works on any platform makes a really big difference. If you think this would help to convince people to use Go More in your organization there's a couple of postcards over there next to the stickers and on the other side there's a couple of really sweet doodles which just show how coding nirvana can be achieved so hard level tooling, cute fields. Lastly before I actually show it in action there's a project out there called Fishos. Again you might get the theme here with the FY starting of the name which is a Linux desktop operating system that is built from the basic graphical level all the way up entirely with fine and fine applications. We're moving to all of the applications being created or editable with vision not just with source code so that you can be very well running your desktop software and go I actually think that this could be tweaked with something wrong here I can improve on it and you could go and edit the software that you're running, load it in the UI, make some modifications and then install it right back over the top of the software that you're editing before. If that sounds really interesting well we're working in that direction you can head over to fishos.com to see where things are. There is a beta ISO, stick it in a virtual machine nothing more and some of this functionality is not in the version that's there yet but keep an eye out because this is all coming very soon to a platform near you. With that I thought I might just try and show you that this works and bring up the UI editing and application of my system here. I have the bar of icons on my installed system here. I see this calculator app. It's nothing special, it's a calculator, it's going to calculate some things. Clearly there's some things in this that could be improved for some reason I think that's true. Let's actually go ahead and look at how this is. We can edit the calculator application and it's going to load it in the editor that I showed you. I was demoing this for somebody else just immediately before so it's defaulting to smartphone apologies for that. Of course we're really working in desktop software so this is the more familiar button size, text size button size that kind of thing. This C button, it doesn't seem to be quite right, it's very vague, I feel it should be a robust red warning. It is a warning, it should be a danger button. Let's really indicate that there's a problem likely to happen if you press this. You might be one of these people that thinks that clear isn't quite substantial enough so all clear or AC might be more familiar to you. We also could look at the layout of our application expanding down here on the containers. If I tap this, this tap here, this is a two item grid, I think it's this container here. I could do something a little bit bizarre and make those rows wow. That did make sense from a mathematical point because this is a new and evenly spaced grid and I just asked it to do something a little bit daft but actually the columns were just fine so we can go back in there. This application obviously it's just a quick editing in line. I want my app, I want to test it, I want to run this piece of software so I'll test, run, I'll press run. It's going to go and okay I forgot to save the file. Sorry I should have asked if anybody realized what I'd done wrong there and offered a prize but it's happened to me once before and it will happen to me again I'm sure. So there you go it has compiled the application and built it natively onto our system. That is a live native application compiled for the current system but it is just a binary where it's a single binary as any good go application would be running off our hard drive but actually what I wanted to do was commit my really significant improvement to this calculator app to my operating system and so I'm going to use this other developer button called install and it is just going to improve every day of my life now so when I go back to my calculator app over here I now have a new version of this little piece of software and I just feel like this has been a big improvement for me. Hopefully you can imagine a lot more possibilities and you see the next project that you're going to build and I would love to hear about that. Anyway let me just, oh by the way if you really like building applications you like mark down and think it's the future of all good things. This slideshow application is a fine application called Slides with a Y of course and it's just marked down. Anyhow sorry I'm pressing the wrong button aren't I? There we go. If you would like to learn more about this that I have showed you and please do check it out any feedback that you have. It's early days but we're looking for people to get involved. Beta testing.app is the homepage for everything that we're doing. It offers you links to, I feel like some surveys let us know what you think is going to be useful. Sign up to beta testing when it's available and the second link there is actually not connected to that app website. We recently completed some user interviews and got some really great feedback about where the opportunities might exist in this area. If you're at Intrigue we're running a questionnaire based follow up so the second link there would be really interesting to get your feedback. Like I said this is all open core and everything so far is fully open sourced under the BSD license. Actually it's dual licensed with GPL as well for the licensing of business add-ons later but it is all out there with a compatible license. If you would like to see the source code which I didn't tell you about but honestly is there fully available and pretty straightforward you can go to our YouTube channel. There is a video series called Creating an App Builder I think. We used to do them weekly and then moved to monthly. There are 11 videos there that take you through almost all of what you've seen demonstrated earlier and the source code is currently in the tutorials repository because we're just working on neatening up the first iteration of the actual product that I just demonstrated to you there. But the majority of the code as I said been through the videos is available in the tutorials repository. Hopefully that's been really interesting. I'd love to take questions now but also like I said there's these little weird things out there but if you're interested in building the future pick up one of these stickers and slap it onto a laptop and tell the next person how it is that Go is going to be the next future or the best, brightest future for graphical application development. Thank you very much. Did you all just realise we just saw an operating system user interface completely building Go? Yeah. Wow. I'm shocked.
Creating a multiplayer game in Go, from zero
Thank you everyone. If I can have your attention again please. Because now the fun starts. We have games, we have drones, we have music. It's finally time to party and the first party person will talk about gaming in Go. Run for plaza Francesc. Thank you very much for being here. Before we start, if you want to check the game, it's available on MazeWorlds.com. If you decide to check it out while you talk, it's fine. It's still in Earth development so it may have some bugs, it may crash, but don't worry, just play with it if you want to and have fun. That's why I did it. So a bit of myself. I'm a Go developer. I've been working in Go in back end mostly for the past nine years. Before that I was doing JavaScript, Pearl and Ruby. So nothing about gaming. So I have no knowledge when I started doing this game. Why did I do it? Why did I decide to build the game for myself? So basically a few years back I was thinking like, oh, I would like to play a game that I used to play when I was in high school. So when I was in high school, every Friday with my friends we went to a cyber coffee and we used to play games like Warcraft or anything like that. And not only Warcraft but mods about Warcraft, which nowadays some of them have become popular like League of Legends. Maybe you knew about him. So League of Legends, it came from a mod of Warcraft which is called Dota and also Dota 2 is a mod that came from Warcraft for the same one. So one of the mods that I used to play on Warcraft was called Line Toward Wars. And I really loved it, but I've never been able to find anyone who ported it from the actual implementation to something which is not reliant to anything of Warcraft because this mod is available in Warcraft 3 which is from 2007. So remember, it's on Starcraft, it's on Warcraft 3 Reforged which is the new version that they released. But you always have to have Warcraft as a platform to play any of these mods. So I said, okay, how hard can it be to do something like that? Based on that, this mod is not that difficult. So first of all, I started checking out, okay, I want to build a game and I remember hearing a friend of mine saying, okay, I built a game for my kid in Lua. So I said, oh, let's check out Lua and see what it has. And with Lua, I found Lof. This is the name of the library. This library basically is a two-day game engine and I started doing the tutorial. The first tutorial which is building a small game which I'm going to show later on. But when I was building the game, I was thinking, okay, I know what I want to build and I know that we'll need a backend. I know that the backend, I'm going to build it in Go, I'm not going to do it in Lua. And I most likely need to share things between the frontend, the client, and the backend. And I don't fancy rewriting stuff so I said, let's search if there's anything in Go which does the same. And searching in Go, I found Evitan. Or Evitan, I don't know which is the representation. So an Evitan in the description is basically a two-day game engine, the same one as Lof. So I said, oh, that's fine. Let's check the documentation and let's check how it looks like. This is the simplest, simplest Hello World in a game engine. Hello World is printing Hello World on the top right of your screen, basically. And as you can see, basically, it has, in the main, you set the size of the windows that you're going to print out. You set the title that it will have. Hello World in this case. It's just the title of the windows, not what it will be inside of it. And then you use Evitan to run the game and then you pass a stroke. The stroke has to fulfill these four, I guess, three functions. One is for interface. One is update, draw, and layout. Layout is a simple one. It's basically the size of the screen outside of where you are. So you can print a small screen on a bigger screen. Then you have update and draw, which were really interesting for me because were the same names that the library that I did the first tutorial had. So basically, those two name functions did the same thing as the ones on love. So I didn't lose my time doing something in Lua because it was the same. So basically, update where it does is every game tick, which is 60 times per second, it will call this function. And in this function, you're supposed to hear for events from the user, like clicking or with a mouse scrolling with the wheel, clicking, typing anything on the keyboard, you're supposed to hear, to listen for those events there and do something. And then on the draw function, which is called every render time of your screen, you're supposed to draw what it was that the user did if you want to do anything on that. And that's it. That's all the functionality that it gives you. Aside from all the helpers of listening for stuff, that doesn't matter as it gives you. So if you want to do anything, you are basically building on top of that. So my first games, I did at the beginning four games just for me to learn libraries and to be comfortable to then start doing my own game. So the first one was shoot the enemy. This is the one that I copied from the Lua library. And just because the a bit in library didn't have any useful at least at the time. And all the tutorials were examples. So you want to do this? There's an example. You want to do that? There's an example. It has a lot of examples, but not a tutorial that kind of drives through all the stuff that you are able to do. So the first one was this one that I copied from the Lua implementation, which is the only one we're going to play. It's this one. So you are able to move. The snake is the enemy. It has a life on top left, which is the number of life, which is 10. And me and the panda. I move the panda with A and D. And if I play space, I shoot. If it hits, that was lucky. If it hits, it moves faster, so it's harder for me to finish the game. It's not because I can spam it and just finish the game easily, but the idea is it's harder for you to finish the game with that. So that was my first game. This allowed me to know how to print something on the screen, the image, get the events from my keyboard of moving left and right to move the panda. And then let me drink a bit and I'm just going to drink. And then having something that moves automatically, the snake is moving by itself, and then shooting something when they press something, and that also is moving by itself, which is the bullet, and then detecting the collision from the bullet to the actual snake and then producing the life. So with that simple tutorial, you get all that, which is really useful. So then I decided, okay, let's make something else. I was moving out of the tutorials because that's the basics. As you saw, the interface is really simple. You have all the updates or whatever you want to do. So then I decided to make the snake. I'm not going to play this one just because the image you see is one pixel by one pixel and hitting the actual box is really awkward. So I'm just going to make a full of my software in that game. But it works. And you have the score that you have on the top left, so how many of the boxes did you get, and it gets longer, and then you can go to the edge of the screen and it will appear on the other end. So the only thing that kills you is eating yourself. So one of the normal snakes. The next one, the radio, is space and bothers. We all know how it was, so it's basically you can move the spaceship to the left to the right, you click space, and then you kill the invaders. Same thing. So nothing much more difficult than the first one, for example, this one. But then when I finished this, I said, okay, now I know a bit of the basics, but there are a few things I didn't touch yet on regards of what I want to do, which is the game that I will tell you later on what it's about. And the thing that I was missing is basically how units move. So not just moving, but the assets of the unit moving up, moving left, moving right, and so on. So you print those on the screen, so when you actually do the movements, you see something moving smoothly, not just the other movements, which are just moving an image up and down. And not just that, but seeing it move on someone else's screen. And then you have to have server side, which the client is sending information to the server, and the server is replicating this state to the other clients and yourself. So this is a small example of the first implementation that I did. That's an hour on the right side if I move to the left side. So you can see that the two clients see the same thing, and there's a server also. But it's the first implementation in which it's simple to see the things are moving around and doing this stuff. And also the map. This map that you see in the background is something that I built. It's not complex, but it's something that then allows me to build what that's the real game that I built, which I called Maze Wars in this case. So now it looks a bit better. It's a screen show I took maybe a month ago, so now it looks a bit better, but that's the basics. So what is this game about? Basically when you start the game, not when you start the app, you have to give it a name, and then you have to start the game, and then other people join, and it could be up to six people or minimum of two. In this case, it's two people. So when we are two, you are assigned one of these lines, up to six. So for example, in this case, I say I'm the one on the left. The end goal of the game is to steal all the other people's life. Everyone has at least 20 lives. When they start, everyone has 20 lives, and the way I steal those lives is by sending units. Sending units in the bottom right, you see that there's a screen which some faces, those are the units. When you click those, the units spawn on the top, gray area, and they have to pass down until they reach another gray area, which they move to the next line if there's any, or die. And also when they reach the end, they steal a life from the other player and they give it to you. That's as simple as that. How to prevent that? You have to place towers. When you place the towers, the idea is to build the maze for the units to go through the maze while the towers are attacking the units. And that's all. That's where I did it, because it's really simple. If you nail it down and reduce it to the basics, it's something which is really simple. If it were a shooter game or something like that, I would not even start it because that's another stuff. But this is really easy. So I said, okay, let's build it. And also this is basically if you ever played any kind of tower defense, which also came from Warcraft mods, this is basically a tower defense, but that you can build a path instead of having a predefined path that you have to place towers on. So my list of things, at least at that time, that I said, okay, I need this in order to do the game. I needed some assets because I need towers, I need the terrain, I need the units and so on. I need to do the map. So how is the map of the game? That's the thing that you saw before. Also, I need to be able, well, the code architecture, so how I'm going to architecture the code in order to not have a mess later on. So how are things going to be organized and how everything is going to communicate between each other? That's one of the interesting parts for me. Summoning units, so clicking the unit and summoning on the other side and then passing down, placing towers, described this, and gold income and lives, that's also easy, and multiplayer servers and the EXUI. So those are the list and we are going to go through that and potentially the struggles that I had while that. So the assets. For the assets, I was searching for something free at the time that was giving me the minimum kind of interesting assets that I needed. In this case, you need units. You can see that there are here some of the units that I saw before. On the top right, there's a cycle that is called which one I, so it had units, at least 10, it has, so it's fine. The hard part is to find towers because normally on these free assets packs, towers are not something that you were going to find because it's really specific to this game. So what I used is what you can see on the bottom right, which is like a sculpture of a monk. So that's what I use as one of the towers and there's another sculpture as a warrior, which is the other tower that I use. The asset pack is much bigger than this. It's really, really big. It's just an example that they give you as a few things that we have here. I'm going to drink a bit more of water. So yeah, then the map. How did I build the map? The map I use software which is called Tile, which is basically used for that, to build maps and to be something for your game. Basically what you do is you import your assets on your bottom right. You see that those are some of the, some of the map assets of the assets that I showed before and then you are basically using paints. Let's say this way. You click one of the assets and you can place it on one of the grid. You see that that's a grid. You can place it. You can, and basically once you have one line, in my case, I can just copy paste it down because those are all the same. So once I did one, basically it was to build the map for more play-ups, just copying them. But that's the library that I use. Once I found it, I was like, okay, that's quite easy and it works really well. Then for the code architecture, when I was building the game, when I started with Lua, the thing about hearing for actions, sending actions to somewhere and that somewhere is storing those actions and then the draw is drawing what happened before. It was really, really ringing my bells on some things I did before. On the bus, when I was working with JavaScript, I was maybe good at that, but I was in the train type of React when it came out and React and Redux and all that. When that came out, I was doing JavaScript, so I was really on tune with that. React before it was using Redux, and I don't know what it's using now, it used to use Flux, which is the library that Google and more Facebook are implemented. How it works is basically this is an HTML point of view. The view is the normal HTML that you have. When the user clicks the view, an action is triggered. This action is passed to a dispatcher, which is basically serializing all the actions in a specific order in the order they arrive. Then it's dispatching these actions to the stores. The stores are basically data maps which listen for those actions. It's a big switch case for the actions that you want to do actions on with the data that you supposedly have. Once you change your data with the action that happened, then you trigger an even like I changed. Then the view is hearing for those I change. Then if it Redux, for example, I react, you change the specific part of the view that it belongs to a data.change. That's how this works. If you change the view and you add the client that I'm doing, it's the same stuff. The only difference is that the store is not triggering I change, though the library does. It's not triggering I change because the draw is already being called every time that the screen is being refreshed. There's no need for any view to hear from any changes from the store. Basically, I look for the Flux library in Go. There were some implementations not fully implemented or not working as expected. I just ported the full implementation of Flux that is on Google, I don't Facebook for it, and I ported it on Go. If any of you want to do something like this, there's a library now that works for that. What that gave me also is a state update, which means that when something changed, I also need to notify the server that something happened if it's needed for the server to know the data. For example, if I place a tower, then I can send this action to the server. The server has the same architecture, has a dispatcher, has the stores, and then the server can apply this information. In my case, that's why I decide every four times per second, the server is getting these stores, the state that it has, it serializes the state in JSON, and then it sends this state to all the clients. The clients can see that someone plays a tower because the state is changed. How this is sent with an action, I'm using WebSocket. It sends an action via WebSocket to get the action on the client and just creates an action and push it to the dispatcher, and then everything works as expected. Yeah, that's architecture that I did. Also, one thing that is really cool that I'm not doing now, but that could be potentially useful is replayability. If you play a game, some games you can, you have the ability that after you play the game, you can see, okay, what did I do wrong, and you can just replay the game. I don't know how this is done, but the only way that I have in mind of working is something like this, because if I have the actions, I can just replay the actions in the order that happened, and I have replayability, and that's it. So as long as I'm replaying on the same moment that happened, I can replay anything at any point of time. So, yeah, if at some point I want to implement that, I just have it kind of for free. I just have to store the actions on, that's it. Now, the other interesting part, summoning units. So the first implementation was the easiest one, so just Y++, just go down, straight down, ignore everything until you reach the bottom, which is when you die, and then you steal the life, and you give it to the player. That's fine. I did this implementation just to see that everything was working. You click, you summon, and it goes on. Fine. Now, the next implementation was when I was already placing towers. So when I had tower placing, if I summon a unit, it just went straight through the towers, because it didn't know that those should be avoided. So then I implemented Drieska, just because for me, like, I have to find the shortest path between point A and point B. I'm going to do that. Well, I didn't know at the time, but Drieska is not the fastest. If it's a big graph, it takes three seconds to calculate the path of all the graphs using Drieska. So if we go to the beginning, you can see the cursor moving. Now I click, now it appears. So it's really slow. And it's something that basically you cannot play with it. It would be impossible, because also it's blocking the main thread, because it's doing that. So I was chatting with a friend of mine and said, okay, this is happening. Do you know something? And he said, yeah, you can try Aster. I don't know if you test it, but Aster is basically a passing algorithm that is an improvement over Drieska and based for passing for games. Like, well, that looks like something that is for what I want to do. So maybe you can notice the difference, but yeah, it's a bit faster than Drieska. I'm spamming, right? I'm just clicking, clicking, clicking. And you can see it's not going slow. It's not blocking anything. And it's quite good. Ignore the fact that they are going below the towers. Just follow the lines. Just ignore the assets at this time. So yeah, I did Drieska. The only thing that it's noticeable, and I think that now that I changed a few things, it was a mistake that I did. But basically, the ones that go on this side, you see it basically are doing edge shapes, which is not good for this game because the towers have a range for attack. So they're basically going attack me, attack me, attack me, attack me, attack me, and they die. The good thing would be just doing a diagonal. Let me bring my water. Right. So for that, I just tweak a bit A star. Because A star, how it works is basically you have a Drieska, the same one. But the only difference is that you have, how is it called? A priority queue. It's time that you want to calculate the distance between, if you want to, it's time you want to jump to a different node to calculate the graph on that, the path on that node. What you do is you say, okay, from this node to the end, which is the distance. And from that node to the end, which is the distance. And then you got all the distance. You push this element to a queue with the node. And then you get an element from the queue, which is the one that eats more, more closer to the destination. So basically, if it's just a straight line, it will just calculate go straight because it's always the fastest route to go straight. And that's how A star works. What I change about A star, and that's basically my potential, and now it works better in this way. But what I change is that this algorithm to calculate how close it is, I use Pythagoras because it's just to have xy and xy just make Pythagoras and go straight line the difference. But the ideal one that it has to use is Manhattan. Manhattan algorithm is basically how many jumps left and right, left and right, or pixel, pixel, pixel, pixel, pixel, no diagonals is the distance between those. And with that, it works much better. But what I did to change it is basically when I'm pushing elements to this sort of queue, this periodic queue, what I'm changing is that if the node that I'm pushing isn't the same direction that I was, I just increase a little bit. So I just decrease the priority. So it's most unlikely that he will go straightforward or the same direction. And it's more likely that he will switch directions, basically fiber thing to going on diagonals. And just doing that small change, everything when like perfectly on the diagonals or whatever. And yeah, that's what I should be for. Placing towers. Placing towers has not much logic itself, just the fact that when it clicks and how to place it on the line. The only thing that's noticeable is that the tower is not, you cannot place it anywhere on the line. You can't place it anywhere on the line. But it has like, you could see that it's basically jumping around. That is not straight. Why is that? Because when you're building the maze, it's much better to have something straight because for you, it's easier to finish like the build that you're doing than if something is one pixel below and someone is one pixel above, because then most likely you will find that you are blocking the units, which is something that you cannot do. You cannot block the unit. You cannot just build perpendicular lines and say, that's it. I finished. That's not possible. You cannot block the path of the units. So that's another topic. How to detect that the path is blocked. And how to get the path is blocked. Basically what I do is when you try to place a tower and you do the action of placing towers, I just try to create a new unit that will go through as theoretical line with that tower in it. If I don't get any path, then it's wrong. You cannot place a tower. And if I do get a path, then yes, you can place a tower. And that's for placing towers. Then I have gold and income, which is easy. You get gold by killing units. You get income by submitting units. And then every 15 seconds, all the income that you get from submitting units, from all the game, you get it as gold. And the life is basically subtracting life. The multiplayer server is something that I spoke with the architecture. But basically, if you are interested in game servers, this post here is really, really, really interesting. Because it explains you how different games that you may want to build, for example, shooter games or things like that, how do they supposed to work in a small scale? For example, shooter games, the thing that was not useful for me, it's like you are shooting on the past, you are just shooting on the present, because as everything is happening in life, the idea on the servers that you are on the past, for example, the implementation is alternative. Basically, the server is the one ruling everything through the client, sending information. The client has to override with what the server is sending. But it's also the server have, the client have a bit of predictions. The unit movement is not sent with the state. It is, but the one that is doing the movement is the client. So the server is just sending that this was the path and it didn't change and the client is just continuing with the path that it was before. And also using Flux and using WebSockets for all of that. UX UI, first, that was the implementation that I have before, which is really awful. And then I discover a written UI, which is basically bootstrap. I don't know if you heard of bootstrap for HTML. Well, it's that. You can create, you can say this container, which it has rows or it has a grid, and then all the content inside of it will be placed in different things. And you can do input and you can do buttons and so on. So it's really, really simple for these types of things. Then cross compilation, which was something that has been discussed with Go Release before. And this was fun because I never compiled with Cgo enable and this requires Cgo enable. So I managed to cross compile with Cgo enable for Linux, OS X and Windows. And using Go Release, just changing a few things on the actual code base and I have to open up a PR to go release to propose those changes because it's just two lines of changing. And you can cross compile using Xgo. It's not go, it's Xgo, which is a cross compilation tool. And also you can play on the browser with Wasm and at some point I will try to do it on Android, which is also possible. And for the future, the only thing that I, well, from the other things I have to do, the main one is improving the A star because right now it's the bottleneck. It takes 100 milliseconds to create a path. And if George is spamming it, it's awful for everyone to work with because everything goes as low. And then add more types of units, add more types of towers and much, much, much more. That's it. I hope it was interesting for everyone. Thank you. If you have questions, please ask in the hallway because we want to change speakers quickly and not wasting any more time. Thank you again and amazing work.
[Replacement Talk] Having fun with MIDI and Go
What is that time again? The time we switch our compiler from the ghost.NET compiler to the tiny.go compiler. Same language that there are friends. Let's be sure about tiny.go or our friends. Because they are the reason they supply us with this pretty go for hardware and these go for badges. That's all running tiny.go. So when I switch into the tiny.go session of the go dev room, I always forgot what I'm doing here. And I want to very special thanks to our speaker here. Because Danielle is just like, was it yesterday? What? Was it yesterday that I asked you to give this talk? Yeah, probably. It was yesterday that I called him, bad king. I have no talk at this slot. I need to have a talk. And I couldn't move schedules around because the people will be very angry at me for doing that. So what I did is I called Danielle instead. Give me what you got. You said give me a day. And apparently we are finally going to have some live music using my favorite stuff in the world, and I'm going to give you a little bit of a round of applause for Danielle. And I'm Daniel Esteban, also known as Konejo. And I'm here to talk today about having fun with MIDI and Go. So the year is 2014. We have got the room at first then. Since then we have been, we have more than 130 talks, more than 100 speakers, two virtual Kone, thanks to COVID. A dozen of lightning talk, multiple room changes because we grow bigger and bigger. Konejo is an attendee, so happy 10-year anniversary. But also five years ago at Fosden we make the first release of TinyGo. Since then we have 36 releases, more than 150 contributors. We have given multiple talks here at Fosden and another conference. We have two Not-A-Spy weather balloon launches one last year at Fosden also. And this year we have our own subtract of TinyGo because I'm the first, but then you will hear my college a.k.a. and then roam with more TinyGo surprises. So let's celebrate. Let's make a party. Woohoo! I'm going to make music with MIDI. MIDI is done for musical instrument digital interface. And it's basically each interaction with a key, a knob, a button or a slider is converted to a MIDI event. And then you need a MIDI synthesizer to understand the event and make the audio. So today we are going to make part of the MIDI controller, which is kind of the instrument. And then we will send all the MIDI data to the laptop and then just play in a wedge page called Virtual Piano that will play the music for us. So basically MIDI works like you press a button and then you send like, okay, play this note at that volume and that channel and that information. And the synthesizer, which is the Virtual Piano webpage, will process that information and really make this sound. So to make music, first we need a group or a band. I present you these two days. These are the GoFerbaches. I have one on my neck, March have another and you can see them here. They actually run a RP2040. It's this little, little tiny chip here and we are going to put Go code in the chip. It also has RGB LEDs. It has six buttons and LCD. It has a serial feature. And what is TinyGo? Well, TinyGo is not a superset of Go. It's nothing different because it's a different compiler that just puts your regular Go code into the microcontroller. Like RP2040 or Arduino or a 10C or something like that. Something very, very small. And how are we going to play music with just one simple function? It's from the MIDI package. It's MIDI.noteOn. We specify the cable. We specify the MIDI channel. We specify the note. In this case it's the B3. We want to play and the volume. So demo time. This worked like 10 minutes ago. Hope they still work. And let's just try the simple thing is press a button, make a noise. So how will this look like? Just first we configure the button to understand when it's pressed or not. And then if the button is pressed, we play a song, play a note and if we release the button, we stop. We are going to flash the code. I am here just to make the audio hearable from the laptop because we have no fancy AV setup. I am not here to play any actual music instruments behind the scenes. This is all real. We have flashed the code. We can see it's the same here. We go to our MIDI synthesizer. We select our instrument. The channel is one. And if I press the A button, it's a song. Okay. But that would be too easy. So let's make things a little bit more complicated. Let's play air drums. What I have here is an array of five distance sensors connected to another buffer batch. It's almost the same. We just configure here the distance sensor. Wait. And then if the distance is less than 90 millimeters, we play a note. In this case, we are using the channel 10, which is a special channel for MIDI, which is a drum set. So it's not an instrument per se, but different instrument. Hopefully, it's too low. Thank you. And this was only 24 hours notice. Can we make things more complicated again? In this case, the buffer batch has an accelerometer, three axis. We are going to use the data from the accelerometer and also the buttons here in this little tiny guitar. And in this case, when I move the guitar and the batch, I will change the pitch of the sound. What happened? Select channel one and I select this one. So now, I can play the music. It's okay. Okay. And finally, sorry if I'm going too fast. This was supposed to be a lightning talk. I don't know how to program, but I want to have fun too. So what could I do? Well, I introduce you to Blockly. This is a visual programming language. You just drag and drop the block. You don't need to write any code. So there is also a playground. And you have different sections with different blocks that do different things. And then we can translate those blocks to go code. So Blockly is a JavaScript library. It's 100% client-side. It's compatible with all the major browsers. And you can customize and extend it as much as you want. Blockly does not officially support TinyGo. And that's part of my hobby here that tries to make it output go code. So how it looks like, it will look like something like that. It's like the first sample, an infinite loop, like repeat while true. And then you do the button up, for example, play the note D3. So, that's it. Just to save some time, I have the demo here, but you can see, like, you just... You can drag and drop the element. You can modify them if you want to. And finally, here is the go code that should appear. Is this one? Okay. We are going to... We are going to copy. We are going to pass in a U. We are here again. Select... And now if I press the up button, it should... Sound. And we just do that with the... Sorry. We did that with the blocks only. And that's it. Thank you. Sorry. Thank you again. Thank you very much for this very, very last minute, but still awesome talk. I finally have a musical act in my room. What's next to your Taylor Swift? Does anyone know her? They're not going. They're not going. Come on, one last pass. Any questions? Yeah, for your tiny go compiler, which microprocessor popular ones do you currently support? Which one? We do support a lot, actually. I'm not sure the least. I mean, you can go to tinygo.org and you could see... No, it's not driver. I'm not sure right now, but from... Yeah, from NRF, son of the Nordic compiler chips, at Mega also. Sandy 21 and Sandy 51, RP 2040, the one used in the 10C board also, and also like ESP family kind of. Cool. Thank you. You
Smartwatch firmware... in Go? On TinyGo, small displays, and building a delightful developer experience
Okay, the setup is done. We are ready for our next talk. Next talk is by Ikea. Ikea doesn't like things easily. Why Ikea is using a MacBook, ARM-based laptop, running a complete Linux kernel? I'm miracle that this works. If anything goes wrong, we just blame Apple. Remember, if it goes wrong, we just blame Apple. Ikea doesn't like things easily because I like smartwatches. This one runs Linux, this one runs JavaScript, and these ones are proprietary ones, which I won't talk about. But I have non-running Go, and you have one. And that's what we're going to talk about. Go on our smartwatches! Well, hello everyone. So, yeah, I'm Ikea. I'm about six years ago, I was programming using MicroPython. I was programming some LED strips, of course, but I found that they were a little bit too slow, because Python is kind of slow. So, of course, I did sensible thing, and started writing a Go compiler for my controllers. Nowadays, LED strips work really well in TinyGo, because that's a project, of course. So I needed a little bit more advanced project, and that's programming this watch. As you've heard, it's a little bit more advanced, as you've heard, it's... oh, wait, where did my camera go? Well, here you can see it. It's the pine time. It's from pine 64. We also have a stand here and another building. So you can see it live, maybe. No, it can't. Oh, it does have color. Cool. Let's see. Zoom in, maybe. So, yeah, it's a watch. It shows the time. It was in digital. It has very sensors. Step-bunt is wrong, because it doesn't reset at midnight, but whatever. So, yeah, that's what I'm working on. Why this watch? Because there are basically very few other open smartwatches. I know of two or three others, so they're very rare. And this is a relatively cheap one. The hardware is pretty standard, really. A Nordic semiconductor Bluetooth chip with some flash memory in external storage. It has actually a really nice display, but I'll talk about it more in a minute. And it has a hard-wired sensor and a step counter, like normal stuff you'd expect in a smartwatch. So when I started working on it, I decided, oh, wait, you can see it. Let's see. This is supposed to be... There you go. So when I started working on it, I realized I could actually use the smartwatch as a way to improve time ago itself. Because we have a lot of projects that are basically like demos or prototypes, but not a lot of actual professional projects. So my goal was to make this like a actual watch could be like you'd buy, which means I'd have to implement a lot of features you wouldn't implement in just a talk project, such as it has to be fast, it has to be efficient, user-friendly, there have to be translations, but also things like security. So the way I started is by working on a Borg package, which is basically a hardware abstraction layer. It abstracts a lot of different... It works on a lot of different kinds of hardware, like the point time, of course, but also this batch, which is the same hardware, and a few other batches, but might have more in the future. But because it's hardware abstraction layer, it can just as well work on a desktop computer in a simulation. And on all these different kinds of hardware, it exposes the exact same API, which is actually tested in CI because it's easy to introduce differences. This makes it possible to just write for one platform and then have it work more or less for hours without too much effort. As an example, I can actually... Here's the camera again. This is the Adafruit PyPortal. Let's turn this around. It has the same... It can be a little bit tricky to get working, but it runs the same software. So for example, the touchscreen is a little bit finicky. Here you have the same clock running. So the code I wrote for the smartwatch is entirely... Well, it's written for the point time, but it's independent of the point time. There's no point time specific code in the watch code. So I think that's great. So the simulator... Let's move to the next slide somehow. Yeah, this is the simulator. It's running like in any other Go program, so you can just go run dot and you run the program. Let's see what I can quickly get this to work. So for example, here's the same watch. You can debug it here. There's sensors all emulated. There's a step counter here. It disables the screen, of course, but you can change it. So there you go. But there's another feature. Can I show this? I'll try to show this. Help. There you go. Basically, this even emulates the... Well, it uses the Bluetooth package, which works both on like the point time and on the desktop operating system. So what I can do now is I can connect the def kit. Should be connected now. But it's mostly invisible. And if I have... Sorry about this. Where's my cursor? There you go. I can increase the steps. And it should also increase the steps on the simulator. But it's apparently takes a while. Well, this is not going to work right now. But it does work. It's not right now. It most definitely works. So this is a way to be really productive because you can actually test the whole system, including things like Bluetooth, which are normally specific to the hardware. And also the accelerometer, it's not implemented, but you could just easily add settings to change it. And you can debug the same thing using like Delve or something. So that should just work. Here's the Go-Batch in simulation and a game I made. The only thing I really add... Wait, I don't have much time left. The only thing I really add added to each project is like a small configuration part, which isn't actually built on actual hyperpossess in a statement. And just set some parameters. So the simulator knows what screen size to use and even the draw speed. But you could just as well leave it out and then it uses its standard size, which happens to be the same as this batch. Oh, wait, I didn't actually show this. There you go. Sorry about that. Maybe I should have used screen memory. Anyway, so that's the developer side. And I think that's really important because debugging hardware can be quite finicky. And for example, the pintime, either you flash it over the air, or you can't really debug it in any way. Or you have like these small thin wires that are really easy to mess up. So running it on a normal computer is really helpful. So that's the developer side, which I tried to improve. Next up is the battery life. I think I'm going to skip explaining most of these, but basically there are a lot of really small things that need to be implemented to make it really efficient. For example, one is that the chip itself actually runs at a lower voltage. So it has a linear regulator inside to lower the voltage. But that's really inefficient. There's also a DC-DC converter, but it needs some external hardware. But because it needs external hardware, it's disabled by default. But it's like one line of code to enable it. So it's things like this. Most things are more complicated than this. All of these things are implemented in the board package that I mentioned before. So if any of you like to write your own firmware for this watch, you can just import the board package and it'll basically do almost everything you want without. Yeah. So that's battery life. Then the display. This was actually a lot of work. So as I said, the display is like a really nice display, a very high resolution. By default, it uses 16 bits per pixel. The only downside is that the connection to the main market controller is like 8 megabits, which is like 8.5 frames per second. So quite slow. I used a few tracks to speed this up. Yeah, maybe you've seen it before. It was part of the responsive. The first one is that I configured the display to use 12 bits per pixel instead of 16. This low is the image quality, but not really that much, especially for smartwatch, which doesn't need that high resolution. The second one is you can just update a part of the display. You don't need to update the whole display at once, because the display itself has a frame buffer in it. The frame buffer in the display is more memory than in the main market controller. That's how embedded works. Sometimes we get like that. The third one is fast scrolling. Maybe I can show you quickly. So it may not be very visible, but this scrolls with very little delay, except when the touchscreen locks up. And... No? There you go. This is actually a hardware feature. Basically you can say from this line to this line, I'd like to rotate the display by this many pixels. It works, but it's a pain to implement. Oh man, I've had so many headaches. But it worked. Anyway, so that's one of these really specific things for this place. And the fourth one is DMA. DMA, if you don't know, you can think of it as like a co-processor that does one thing, one thing well, which is copying data from one place to one other. And that's basically the only thing it does. So you can basically copy... So what you can do is prepare an area of pixels and then say to the DMA, like, hey, copy this now to the SPI bus, which is how the display is connected. And then it will start copying in the background and you can just continue rendering the next block of pixels. And this can actually really improve performance quite a lot because the CPU is fast for a market controller, but not actually that fast. That's 64 megahertz. I've implemented all four except for the last one because basically... Basically, DMA is a little bit harder to implement in a portable way. That's also Go-style. The hardware works a lot like the async await button in many other languages. You can just say, start this project and later say, just wait until it's finished. But that's not really how things are usually done in Go. So we will need to figure out a nice API for this. But it will get there. Of course, this display is quite complicated and you really don't want to write for this one display. So we have a display API, which is an interface. Many displays in the TinyGo driver's package use an interface like this. Of these, the draw bitmap method is the most important. And as you can see, it has a pixel-built image type, which is actually a generic type. This works a lot like a Go slice, except it's for pixels and in two dimensions. The main reason I did this is because the display is 12 bits per pixel, so they don't fit well on the byte binary. So a Go slice only works for, when each element falls on the byte boundary. So basically, we implemented a slice-like type for pixel buffers. I think this is actually going to be really useful in the future because there are many really weird pixel layouts in some displays. I have seen a display with nine bits per pixel that doesn't fit in the bytes or two bytes. There are all the displays with one bit pixel, like many E-Ink screens or some OLED screens. I'm not exactly sure whether this pixel-built image type is how generic they are supposed to be used, but they are very useful and they are very efficient as well. So that's the main set of methods that most displays implement. But some displays also have special support for the scrolling, which I mentioned before. Actually, the way you use it is by simply doing a type assert on the display drive and seeing whether it supports scrolling or not. That's just a regular old Go type assert. So that's display API, one level higher. Then another level higher is tinyGL, which is something I have been working on lately. It's basically a graphics library for drawing widgets and stuff. There are many existing libraries which do the same thing, like what's happening right now? Lvgl, maybe people have heard of it. It's a C library and it's very often used for embedded devices to draw widgets and stuff. It's also used by Infinitime, which is the default firmware on the pintime. Downsides are that it's written in C, which is difficult to wrap in Go, and it has a lot of macros and compiles to ARM constants, which Go really doesn't like. The only thing we have are build tags, but really you want to avoid them as possible. Another library that's often used, especially in toy projects, are some libraries from Iterfruits. They are kind of slow, though, because they draw each pixel individually, which means you have to send the x-coordinate, i-coordinate, the width and the height of the area in your corner paint, and then one pixel. That's going to be slow. It doesn't support 12 bits per pixel like I'm actually using. It doesn't really work for me either. Then there's TinyDraw and TinyFont, which are libraries used in TinyCo, that are part of TinyCo. But they work sort of similar to the one in Iterfruits. They're great libraries, but they're not super fast, so that's why I wanted to make something different. So that's why I made TinyGL. It works really well together with the board package that I made. They are independent. One of them doesn't import the other, so they're separate, but because they're an interface match, you can just really easily use one with the other. Some of the features include that everything it draws is anti-aliased, including the text circles. Actually, I do have a circle in the middle of the screen. There's my cursor. There's a circle in the middle, which is anti-aliased, and these lines are anti-aliased, which actually took me a week to implement or something. It's kind of difficult. I actually based the algorithm for the anti-aliasing in these lines on an algorithm that was originally used in, I think, the second Star Trek movie, from the 80s, for the Genesis device. It was really nice reading the paper and seeing the mention of Star Trek. But computers from back then are somewhat comparable to my controllers today, so it makes sense. I am going to write a blog post in the future about how I implemented these lines because it's way more complicated than it looks. It basically is polygon rendering because it's like a rectangle instead of just a line. Yeah, I think that's it. Some things I want to work on in the future include, I mentioned before that I think security is important. Well, I haven't actually implemented any security features yet, so that's some work in progress. For example, in Bluetooth you can change the MAC address every 15 minutes, I believe, and you can bind the devices so they can connect security the next time. But that needs some work and probably some flash storage, which is another thing I need to do. And of course translations. I plan on looking what's available for our go and seeing whether any of the existing packages would work for a tiny go. And maybe write my own, but maybe not because I'm not a dialogue experience, but languages. So that's it. Any questions? Thank you. My takeaway is hardware is very hard. Tiny go, lots of effort. Any quick questions because I think Ron needs setup. Hardware is enjoyable, though. Was there a question? Please raise your hand again because I lost you. Sorry. Keep holding your hand up. Thank you. Hello. How do you power your earrings? Sorry. Oh! Sorry, I totally forgot about them. There's a small lidsim on the back side. It's a CR1225. How long does it last? About 30 hours, I think. The most crazy thing is that I actually wrote part of it in go. And assembly. Second question is how can we get a kit, like a development kit for the watch? Sorry? Is there like a development kit that we can buy? I believe so, but you'd have to look for the Pine 64 store. There should be a development kit, so you don't end up breaking the device if you make a mistake. The Pine 64 people have a stand in the stands building. Go ask there and they will point you to the right place. Pine 64. Any other questions? Really quick. I see none. Applaus!
Go Without Wires Strikes Back
Go to an internet search of springy LED and you will only find things that I have posted. Okay Ron, I will quickly introduce you. Well Ron is the introduction, he is the guy that flies drones over my head every single year and every single year I survive. Last year we did Go Without Wires, where we launched illegal weather balloons above China. I mean above Brussels, completely illegal by the way. And this year he strikes back. Ron, why are you striking back? Go ahead. A plus! So before I get started I just want to say a big thank you to Marcia, the organizers, the staff, all of the people who work really hard to create this amazing labor of love called FOSTA. Ron, let's give it up for them. I'm missing one cable. Here we go. I know there's always a cable. Alright. At FOSTA 2021 we learned Go Without Wires and we introduced the Go Bluetooth package. Then at FOSTA 2022 we learned to go further without wires and we showed lots of interesting local area networking with Wi-Fi. Then last year at FOSTA 2023 we went to go even further without wires and in an incredible cliffhanger ending, despite the fact that I accidentally reset my long range router to factory settings right before my talk, somehow despite all of that we managed to release Tiny Global 1 the next day. It was in hack a day and all of us are still walking around so it couldn't have been that illegal, right? So yeah, it was an amazing experience. Now it's FOSTA 2024 and Go Without Wires strikes back. So I am Ron Evans, technologist for hire. I run a boutique consultant called the Hybrid Group where we create the software that makes your hardware work. And we work on some open source projects such as TinyGo. So it turns out our secret plans for Go wireless communication are no longer secret. Something to do with doing a series of talks at an open source conference where it's all streamed on video. People actually watch this stuff. It's amazing. So let's talk about space. These are all the different network types by spatial scale. At the lowest we have nano which would be like body sensors that are like injected into your bloodstream. We have body area networks which would be like your pulse oxidometer or your diabetes, blood sugar monitors. We have personal area networking which would be Bluetooth, local area networking, wide area networking. You get the idea. There's a whole universe of frequencies still to explore. And thanks to the amazing community around TinyGo we are doing exactly that. So let's go back in time and see what's been happening. We'll start with the personal area that we can go Bluetooth. So of course it runs on Linux. Actually there's a new release coming where it will run on many different versions of Linux because we got rid of the cgo dependencies. Mac OS of course. Windows, yes it runs on Windows. There are Windows people out there. And bare metal on Nordic Semiconductor microcontrollers like you saw during ICAS demo. So we now have new bare metal support that we're just introducing where we support the Bluetooth host controller interface. Host controller interface. You're probably wondering some of you, what does that mean then? So a quick overview of a typical wireless embedded device architecture. Hello, there we go. So we have an embedded device and a lot of the more professional, serious devices will have two microcontrollers, two processors, the primary one which is going to communicate with your sensors and displays, and then the secondary which is going to do your wireless communication. Because these microcontrollers generally only have a single core. So how do we do multitasking on single core? It's not easy. So many boards use this exact setup. Boards from Arduino, boards from Adafruit, boards from Seed Studio, many other boards. And most of them use the ExpressF ESP32 which is a very common little chip that supports both Wi-Fi and Bluetooth. So it's so common that there is actually firmware they put on that called NINA FW, NINA firmware created by Arduino and supported by a lot of community members. And the way that we use that is that TinyGo runs on your application microcontroller and then using this HCI protocol over the UART using the serial interface, we can communicate with the wireless microcontroller. So now we have Bluetooth support on all of these boards right out of the box today, right now, well, if you run the DevBatch. So how Bluetooth works? The too long didn't read edition. So we have central. Central's are like your notebook computer, your tablet, your mobile phone. They go scanning looking for stuff out there. Then we have peripherals. Peripherals advertise their services. I'm a heart rate monitor. I'm a printer. I'm a smart light bulb. So central connect to the peripherals and say, so what do you got? What are your services and characteristics that you offer? And the services and characteristics, some of them are well known like, you know, your heart rate monitor, et cetera. That way different companies and different manufacturers and different projects can use these same services and characteristics. So let's take a look at something. Tiny scan, our first demo. So it's a Bluetooth war scanner as a conference badge. So where's my conference badge? So I do not have this running on the Gopher badge because I need some extra pins to actually connect. But it's an A to the fruit pie badge, which is a microchip AT-SAM D51 processor. It only has 256 K of memory. Yes, 256 kilobytes of memory. So you know your typical hello world in regular go is 1.4 megabytes. And so luckily we have time to go. And then we're going to use the A to the fruit air lift Wi-Fi feather wing, which is a little daughter board, sister board, friend board for the badge that contains the ESP32 processor. So let's take a quick look. Hopefully my, I forgot to check my video. Let's see if it works. I know that some people talked about how they don't like make. I really like make. All right. So there's my camera. So let's take a look here real fast. So here's the badge here on the back. You can see the little board with the ESP32 microcontroller. So let's go back over to our code. Let's take a look at the code actually. Yes, I have internet. Amazing. Can you see this? It's kind of small. All right. So it looks just like your regular go program because it in fact is a regular go program. We're even importing like your basic gigantic memory consuming packages like font. Just a show we can. And we're using the tiny go Bluetooth package. And then we're using another package called tiny term, which is part of the tiny go ecosystem. It's just a relatively simple terminal for run on any of these displays, not as cool and anti alias though. So what we're going to do is first we're going to initialize the terminal. We're going to output some information on that. We're doing it. We're going to enable the Bluetooth adapter. Then we're going to start scanning. When we do the scanning, whenever we find a device, we're just going to print out what the device is doing. We're going to print out what the device is addressing. It's relative signal strength indicator and if it has a name and then output that. So that's all this is going to do. So let's see if this is going to work. So let's see make. Tiny scan. Let's plug in the badge. Let's actually plug in the cable because that works a lot better. Who here has forgotten to plug in a cable during? Who here has not forgotten to plug in? Who has not raised your hand or a liar? That's what we had for a while. Touché. Some people got that joke within the joke. All right. So here now we see I flash the board and it's a little blurry. Let's get some focus here. So you can see there's the MAC addresses of some devices. Anybody have a Bluetooth device they want to turn on? Let's see. I'll turn on my phone. What could go wrong with that? All right. So I have some Bluetooth stuff on my phone and I have a little advertiser that I can turn on. And you'll see here in a minute. In a minute. It's been over a minute. So they enable Bluetooth. Yes, I enable Bluetooth. All right. Well, you saw some other devices so hopefully my phone is going to work though because I'm going to need that in a minute. All right. I have to unplug my battery. Let us continue, my friends. Let us continue. We have very little to do and lots of time. Now wait. Strike that. Reverse it. All right. So we saw this very good. We saw the code. We saw a demo. All right. Cube life. So now we saw what a central can do, right? Can scan, look for stuff. So now let's see Conway's Game of Life as an LED cube with Bluetooth control. So now if you don't know about Conway's Game of Life, it's in the category of mathematics known as cellular automata. It's basically a zero player game. It plays against itself. And I'm using a package called Vita from friend and collaborator, Alessandro Cipata. And I'll give you a quick demo. It's a very cool package written in Go. I need to slow it down because it's way too fast. So this is written in TinyGo and compiled to WebAssembly and running in your web browser. But I don't have time to talk about that right now. Just suffice to say we're going to play six parallel games of life. And we're going to use Go channels to do it, to communicate between these six different universes that are all running on this same processor, which is an Adafruit Metro M4 airlift with six Hub 75 displays. So the Adafruit Metro is pretty much the same processor as the badge. It just has slightly less RAM. And it has an onboard ESP32 processor. So then the Hub 75 panel is a typical LED panel that you see on billboards and other installations. And what we're going to do, let's take a quick look at our architecture here. Our primary processor is running the code to play the games of life. Then our UART is going to communicate via Bluetooth using the built-in ESP32. Then with the SPI, SPI interface, we're going to communicate with all of these six different panels so that they can communicate with each other. Let's take a very quick look at the hardware. Wait, where did my camera go? I probably unplugged it. Okay. So this is the LifeCube, CubeLife. And I left the front panel open so that you could see it, because it's very hard to fit everything in there and the battery. But you can see here is the ESP32 chip. This is the Metro board. So let's plug this in without destroying it. This is like where I'm sawing the lady in half. Okay. So far it's plugged in. It didn't do anything. It's because we haven't flashed it yet. All right. So let's take a quick look at some of the code. We don't have time to look at all of it because there's quite a bit. But you can see we're bringing in the game package, which is part of the VDAP package. We've got our multiverse, which is a slice of parallel universes. We've got our colors defined. And then I've got two Go variables, one for the LED color, and one to track the number of frames per second. You'll see why we need those in a minute. So our main program, we initialize Bluetooth again, same as what we did before. We enable the adapter. We create our multiverse, which is out of these six universes. We connect them up. We set the first color for the cube. And then as quickly as we can, we draw the cube. We display it. We execute the universes against each other using a weight group. And then once per second, we display the stats. And then once per minute, we change the color. All right. So let's just take a very quick look at the Bluetooth side of this code, since that's what, in fact, this talk is about. So we can see here, it's basically the same as what we saw before. We have the default adapter, which is the built-in Bluetooth adapter on the board. And we have a couple of services and characteristics, which are custom ones I defined. These are not well known ones, like your heart rate. This is the Conway Game of Life cube service and characteristic. So we'll try and establish it as a standard later. So we initialize our Bluetooth. We set our connection handler. So when we connect, we're going to do something. We enable our stack. Then this is, remember, this is the advertising side. It's saying, what can I do? It's going to say, my name is the tiny go life cube. I'm going to start advertising. And my characteristic for the color here, we can see that it's got a unique identifier to say, yes, I am that color. The value, remember that LED variable? You'll see why we're going to use that in a minute. Our flags, which say we can read, write, and that's about it. And then when we write to it, we're going to change the color to the ones that were sent to it from the other side, the mobile phone in this case, and then reset the cube. And then once per minute, if you recall, we're going to, sorry, once per second, we're going to sand the number of frames per second. That way we can see how well is this thing rendering. All right. Let's flash some code. See what happens. Let's see, make cube life. So now it's flashing it, and I'm connected to the monitor, so it should show me once it reboots that, in fact, it is working on how many frames per second it's able to read. All right, let's turn on the cameras so you can actually see how cool that looks. So this is Conway's Game of Life, and each one of these panels is a separate universe, and they're using these gull routines and weight groups in order to communicate with each other. I have to be very gentle because it's been through hard journeys this cube. Somebody say Borg, go ahead, I dare you. All right. So now I'm going to use my phone, which I might be able to show at the same time. If the gods are kind, which generally they are not, but when they are, when they are, let's see. I forgot to change. Let's see. This is in 2024, and that would be make show phone. I've got a loose cable in there. Well, when in doubt, reboot the cube. All right, so when I say show phone, oh, I forgot to plug in the phone. Let's plug in. Again, you said without wires. Okay, it should be good. I tried to get this to work without wires, but it wouldn't. All right, so there's my phone. And let's see, where's the cube? Cube didn't move. I think I really do have a short in there. I shouldn't have moved it quite so roughly. Let's just flash it again real quick. Just because, you know, that always works. So you'll notice that we do not optimize for compilation time. I have a very loose cable in there. Let's just pull some stuff out and it'll be fine. You saw love coding before. Have you ever seen live connecting of wires? It's even more stressful. Well, let's just change the color anyway, because, you know. Is this a time to remind you that the emergency exits are located in front and in back of the room? Please locate your nearest exit. Well, there are a lot of Bluetooth devices in here. Let's see if we can find this one. Somewhere, somehow. Is there a search function? You know, there probably is. I just don't know how to use it. I don't know how to use a mobile phone. I just saw my watch. There it is, the tiny go life cube. All right, so let's connect to it. We'll see if it'll actually connect. Sorry about that loose cable. It makes it look really neat though. We need to combine this with the MIDI demo. Then we're getting into something. Okay, it's thinking because it has to render all the frames at the same time and it also has to communicate, so that's kind of pushing it. Plus, there's a lot of wires between me and the antenna now. I'm probably just left it out. Oh, there it is. All right, so let's go and we can see the current color is supposed to be red. And let's turn on our notifications and we can see right now it is rendering at 6.4 hex frames per second and it's able to play 0 C hex games per second. And you see those numbers changing, it's pretty real. Let's go and change the color, which let's go and make it like a really psychedelic color. 224488. Yeah, I'm feeling that, you know. All right. Why? Is this actually correct? It's fine. Yes. Really? Yes. Well, then I got cheated in the beginning. I cheated myself. All right, so let's take a quick look here at, I guess we'll make our last demo. So I was going to show you, well, maybe I can probably pull it off. I want to show you flying the drone with my badge. So I've got two different drones. One is a Bluetooth drone, so we're going to talk from the badge to the Bluetooth drone. And let's take a quick look at the cut. You've seen the hardware, it's that same badge. Like it's literally the same badge. So what we're going to do here is we're going to connect to the drone, right? We're scanning for it, we find it, and then we start sending it commands, which will basically go forward and backward based on how I press the game pad keys. All right, so let's go back to the beginning. So if we make mini drone, also let's try, let's not try that. All right, I have the badge. The badge is plugged in. I can flash it. No, I cannot. Oh, I'm sorry. There we go. Yeah, that's it. All right, no, not cube life, sorry. Make mini drone. Go mini drone, go! Now we have to actually turn on the drone. No, I connected a little camera to it, which I'm not sure if we, no, we can actually do it. We can maybe pull it off. So it's a little first person video camera that actually is analog connected to this toy parrot drone, because they won't let me have real drones anymore after that last time. That was not my fault. I mean, I don't think it was. Honestly, I'm not sure. It might have been my fault. Okay, so let's stop this video. All right, and let's plug in. Where's my camera? There's that camera. Let's plug in the drone video just because, I mean, hell, what could go wrong? Drones with cameras. All right, yeah, and that's you. All right, so now let's go and flash, did I flash the, yes, I flashed the badge. It looks like. Let's restart it. I guess we'll find out if we're going to have connectivity. It says flight badge on there in cool ASCII font. Wow, there really are a lot of devices. It's still thinking. That really is a lot of devices. Geez. Hmm. Well, when in doubt reboot because maybe it'll come up earlier in the list. Enabling Bluetooth adapter scanning. Searching for, oh, so many devices. Wow, I really, I mean, there are really a lot of devices. I didn't, I actually did not think about how many devices would be in that scan list. Got that proper analog flicker, though, doesn't it? Hmm. One last try to reboot the badge and then I'll fly the other drone because that's Wi-Fi and that will for sure work better, right? Oh, wait, I think it found it. No, wait, no. Oh my God, it's just like, there must be like 100 Bluetooth devices in here. I guess we found out that it doesn't fail. It just doesn't work. Well, let's try the other drone then real fast because, you know, we don't have a lot of time here. So I'll unplug the adapter. Let's plug in this one. Put the original camera back. That was going to be so fun, too. Well, this is actually going to be pretty fun as well. All right, so this is the Tello drone. Tello is a drone from DJI, which is one of their toy drones. Again, I told you they won't let me have real stuff anymore. I do have a small unmanned aerial systems license from the FAA, which means I'm supposed to know better. That obviously is not true. All right, and so let's see. I'm going to, this is the same code, but instead of using Bluetooth, it's using a Wi-Fi interface using our new net package, which gives us much more compatible networking support in tiny go. So this regular garden variety go code that uses networking will probably work now. Let's see if that's actually true. So we'll flash the Tello. Well, now it's working because it will flash. You get to see my cool ASCII art. Okay, now it's scanning for the drone, which, where do I have it? Oh, right here. Oh, it started. Drone started. Okay, it's ready. All right, so let's see here. What should happen? Oh, yeah, so this drone is a lot more advanced. And yeah, I can fly the round with my badge. Look, Ma. Wait. Oh, man. If I plug in the battery while it's flying, what will happen? The batteries plugged in. Let's unplug the power. Can I still fly it? Yes. No wires. So this is pretty real. Now this drone does some neat stuff. It's got tricks. Like it can do flips. They told me I should never do what I'm about to do ever again. I clearly did not understand the simple instructions. There actually is a dance function. I just didn't get it to work. But you know what? I feel like I've really pushed my limits of what's safe by a pretty... Come on, little drone. It's okay. Thank you, Ron. If it takes off by itself, I swear that's not my code. All right. Wow. Well, I went a little over. All right. Sorry about that. But it was really fun. You know, we did Wi-Fi. We have a major rewrite. I told you all that. More flight testing. We just did that. So there's so much more happening. TinyGo has over 14,000 stars on GitHub now, and we're doing 150 contributors, over 100 different boards supported, over 100 different sensors, displays, and other peripherals supported, Bluetooth, Wi-Fi, LoRa. There is so much happening. More wireless chips, more wireless protocols. So connect all the things using Go without wires. That's all we want. Thank you, Ron. Thank you.
turnip: Update on Open Source Vulkan Driver for Adreno GPUs
Hello everyone, thanks for coming here. I'm Danilo, I've been working on Tornip Driver for three years in Igelya. And I want to give a status update, what we have achieved so far, and what's coming for us. Let's start with the new hardware we support. We now support a lot of hardware. And recently we started support at 700 series, Adreno GPUs. We already merged Adreno 730 and 740, and the merge request for the most recent, Adreno GPU 750 is being on review. There are a lot of changes between Adreno generations with four mostly performance reasons. There are registers changed, and many new performance features out there. We also currently implemented only direct rendering and not tile based rendering. Adreno GPUs are a bit weird because they support two modes, tiling and direct rendering, which is the same that desktop GPUs support. But tile based rendering is still working progress for now. We also support a lot of, almost all, 600 series GPUs, but there are some variants out there we don't support. There are five sub generations of 600 series. We support all of them. So to add a new one, new variant of the GPU, we just need to change some registers there. As for our features and extensions, we now support Falcon 1.3 and a lot of extensions with it. Most interesting one for us was dynamic rendering. It's rather simple for desktop GPUs because they don't care about render passes boundaries, mostly don't care about them. But for tiled rendering for mobile GPUs, it's a big deal. We have to stitch together the render passes, sometimes even at the submission time. It could be really nasty. Like the code is bodily readable for it. And we have all extensions implemented for DxVK, D3D Proton and for Zinc supported. So it's great. While we do not claim Falcon 1.3 conformance, we do regularly test Vulkan CTS. We test a lot of game traces, we test games, but with games it feels like a vacuum all game right now because there are not a lot of real users out there. And we don't have a proper CI with game traces, like Radvid does. Another big changes we've done are in pipelines. Our GPU has some unique way of dealing with pipelines and with all the new pipeline related extensions, we have to rewrite them every time in some way. But thanks to Conor, Conorabot, our pipelines are healthy. We've done a lot of IRC optimizations, which is our backend compiler. They add up a lot with time passing. And we've done a lot of work in debug tooling because we have to reverse engineer GPU. We deal a lot of with unknown registers, unknown instructions, so we have to be able to quickly understand what's going on right there. So I want to spend some time on these debug tools we've implemented so far. I gave a more in-depth talk last XDC. You could find it at this link. So what's our debug tool? We have GPU breadcrumbs like in Google flight, graphics flight recorder. We have ability to reply common streams. We have ability to edit common streams. We can print for GPU memory. We could print from shader assembly in these common streams. And we could debug register reading of undefined state from registers. I'll describe each of these feature a bit more in the following slides. Why we even need our own GPU breadcrumbs? There is already a solution for this at Vulkan API level. It called graphics flight recorder from Google. It already could tell you where Hank occurs at which command, but there are two issues with that. It's two cores because for example, the start of the render pass could translate into like 10s or 20 bleeds at the worst case and each of them may hang. So API level tooling could be like not great at this. And what's really prompted me to create the breadcrumbs to implement breadcrumbs in our driver is debugging of unrecoverable hanks. When your computer or board just completely hangs, you cannot do anything, writes to disk doesn't come through. Like graphics flight recorder doesn't work with it. And to make it work, you need some new Vulkan extension and so on. It was much easier to deal with in the driver itself by doing all the things synchronously. And it worked rather great. But this tool is currently is not used too much due to the tooling I will talk about now. Okay, let's say you cannot even reproduce the bugs. Some bugs are random hanks occurring in different parts of the game and so on. So the easy way to reproduce them is just to record all comments submitted to the GPU and then replace them back. I mean, for most hanks and issues works great for reproducing them. There are a few caveats like it's necessary to record all buffer objects submitted and there could be a lot for some triple A game. So it works mostly for one frame or two frames. And not all issues are reproducible this way. There are some that are too finicky for this. But most of them are reproducible, so it's good enough. But it's not enough to just be able to replay the trace and see a hank in the mask. You have to have a way to narrow it down. So what we implemented is a simple way to edit the common stream. So we could decompile some submit to the GPU into very trivial packets. Like there are packet names only in comments right there besides some of them. It's really easy to do for probably any GPU and even in this form, it's very powerful because you could bisect the trace and find the exact comment which hanks even if you have like the comment. Even if it's impossible to determine from any other way how to deal with it. So you could edit some part of the packet and see if it helps. If it solves the hank, you could like deal with it as with ordinary code. What if the issue is inside the shader itself? We already could compile the shaders from assembly. So with this replay tool, we could add ability to just print some registers from the shader. And the most trivial print is good enough. So our print takes temporary registers for address and so on and registers to print. And print them. Like it increments global counter and tries to global storage and replay tool just reads from it and prints the registers. It's trivial and it was incredibly useful in reverse engineering and hardware. You get the trace from proprietary driver, you decompile it, you edit the shader to print something and you see the values and what's going on. It's incredibly useful. And the last tool in our tooling is the way to debug undefined registers, stale registers. A lot of issues are due to reading of like run value from the registers. Some state is not immediate. Even games have issues of not emitting some state and so on. A simple solution, at least for us, it was writing run values to all the registers and seeing what's breaks. And it mostly works. It's not that trivial because there are at least registers which are written at the start of command buffers and never touched again. And there are registers written in each, like in the render pass, like registers set that are set by pipelines. So we divided the registers into two categories. The ones that are set at the start of command buffer and the ones that should be stomped before each bleed and render pass. Again, there are some other caveats but it helped us quite a lot in debugging various issues when we implement new features. Let's forget about some weird registers. Okay. What are the real users of our driver at the moment? Like where you could see it. At the moment they are emulators on Android. Why? Because proprietary drivers are terrible on Android. Not due to their code but due to update policy of proprietary drivers there. They are not updated at all. So users are stuck with their terrible, many years outdated drivers. And with many issues, these drivers have many issues. They don't have necessary extensions. Like it's bad, it's really bad. And emulators need new features. They need for drivers to work. They push drivers to the limit. So if, so they, like for example, you now is able to load our driver, Chornip, and use it instead of proprietary driver. And it works rather well for them. And I remember some other emulators use the same technique to deal with issues in proprietary driver. Let's see an example. Here is some Zelda game running on Android on Adreno 650 with our driver. It's running rather great, even if it's a previous generation of Adreno. Like FPS is nice, runs correctly, it's great. So proprietary driver is a bit weird to say the least. Like maybe it works with the most recent one but it's hard to tell. Drivers are not updated. It's hard for users to update them and so on. So there are lots of issues and probably they don't test with these games. Okay, fair enough. We also don't really test these games. But the developers of at least Yuzu are willing to implement some debug tooling like recording the games, the game traces for us to easy to debug them. Because it's not that easy to launch a game without having the switch itself. Like it's not legal. Okay, earlier I said that Tornip implements all the features for DXVK and VKD3D Proton. So can we run desktop games? Yes, we can run desktop games. Here you see laptop X, X13S running cyberpunk. It runs via a lot of layers. Like you need FAC simulator to translate X64 assembly into IRM 64 assembly. You need Vine for Windows compatibility. You need VKD3D Proton and so on. There are lots of layers. So we mostly test game traces, not games themselves. We test games, but mostly traces because they are easier to deal with. But we will test games more soon. So what is the future for us? We need to support tile based rendering on 700 series because it would maybe not give a lot of performance boost for desktop games, but it would lower power consumption and help probably on Android for the games. Mark Collins, my teammate is working on it. And I hope we will see it merged soon. It would be great. And then we need to squeeze even more performance. There are lots of performance features we need to implement there. So even if we will not come to proprietary driver performance, we expect to be somewhere near it. At least we hope for this. I hope. And in the distant future, we want to implement ray tracing because at least like, 740 should be able to support Rayquery. And 750 probably could support ray tracing pipelines. I hope we implement this someday. And maybe we would be able to implement my shaders. That would be cool. Okay, another exciting development, not from us. It's not a Galeas project, but an easy way to run desktop games on Android. There is a work in progress project called Kasha. It's worked upon by one of my teammates, again, Mark Collins and some other people out there. It's an amalgamation of Vine, DXVK, VKD3D and FaxCore on Android. And I hope Jornip would have a first party support there. So it would be all bundled together and work together as one. Or you may say that people already are running desktop games on Android. Like here you see some person running Assassin's Creed on their device. Like it runs. Yes, that's true. There is project. There are several projects probably for this. It is done with Thermux. It's, I mean, I'm not sure exactly what it is. But it's even more unholy amalgamation of projects. It runs, it's really cool. But there are some performance issues, some issues with how all these moving scenes are are stuck together. But like people running games, desktop games on Android, that's super cool. Okay, that's all from me. For today, so you have some questions, suggestions. So you said you... Mike, Mike, no, okay. So you said you could use this on Android to replace the proprietary devices. Yes, you could use... So does that, okay, does it meet the root device or custom kernel? There are two cases. If you want to replace proprietary driver for the whole system, you need the root. You cannot change system libraries without root. But if you want to use a tournip for emulator, if emulator supports this, it could just load the shared library, packaged for it. So, and Google Play allows emulators to use custom drivers, they asked for it. And Google Play allowed it for this case. And the loaded driver talks to the proprietary kernel driver. Yeah, there is proprietary kernel driver, KGSL, it's a downstream driver. So we have backends for several kernel interfaces. That's right. Anyone else then? Sorry, will you recall the one with the upstream for doing all the kernel? Could you repeat the question, sorry? How would your implementation interact with the upstream kernel driver for the seven access? Do you go as fast as you can? We develop a Mesa for 700 series on MSM, on upstream. Not exactly on upstream MSM, because we have some custom changes to make it work. Not all of them are upstreamed, at least for 750 GPU. But it will be all upstream, we need it upstreamed. It would be there. But the kernel is not done by us, so we don't have much control. It's other people working on it. Okay, I guess that's all. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you.
Graphics stack updates for Raspberry Pi devices
So, as I said, thank you to attend this talk about the acceptance update of the graphics stack in the Raspberry Pi device. Thanks also to the organizers of the 4 of them and especially to the people organizing this DAPLOO. Which is great. So, let me introduce ourselves. My name is Juan and with me is my colleague, Czema Kacenova. We work in the Alia and the graphics team and we are working on the Raspberry Pi graphics stack. So, what is it to come out? It's basically cover the change that happened in the graphics stack since the release of Raspberry Pi OS Bullseye Edition was in November 2021. Up to the latest version, which is Boogworm. It was released several months ago in October 2023. So, I mean, this is for people that are not used to with, primarily with MISA. We have like five devices, Raspberry Pi devices, well, there are more, but are like variations of those devices. The Raspberry Pi 1, 2, and 3 use the GPU from Brathcom. It's called VideoCore 4. And the name of the MISA driver is called VC4. And then for the Raspberry Pi 4 and 5, they use the VideoCore 6 and 7. And the name of the driver changes like V3D for the OpenGLES. In this case, they are support for the Vulkan driver, which is called V3DV. So, what things happened? Well, probably the most exciting one is the release of the Raspberry Pi 5. This is evolution of the GPU from Raspberry Pi 4. It's an architecture, but with more benefits, how it really means like. It has like a higher clock rate, so it's faster. It supports up to eight render types. It has better support for subgroups operations, which is interesting for Vulkan. And that provides a lot of changes at the institutional level, so it allows to have more compact shaders which run faster. Drawback is that it has a bit of less register, so it suffers a bit of more pressure. And this is the support. It's integrated with V3DV-divis, and it was submitted for review almost the same day the Raspberry Pi 4 was announced. And now it's released in MSI 23.3, and that's in the current 6.8, which is required. As I said, this is more or less the evolution of the GPU, so the GPU front has the Raspberry Pi 4. So nowadays the features are more or less the same in terms of the driver implementation. So it supports the OpenGL ES 3.1 and Vulkan 1.2, and that supports a non-conformant version of OpenGL 3.1. We will see that at this moment. So from the point of view of the drivers in the MSI, the OpenGL driver, well, one of the important things was that we promote from OpenGL 2.1 to OpenGL 3.1 with some caveats, I'll explain later. I think this is quite important because at the end the Raspberry Pi is intended to be used as a desktop PC in most cases. So targeting the OpenGL desktop apps is quite interesting. So there are some applications that require OpenGL 3.0 something, and now they come on the Raspberry Pi. The upgrade from Bullside to Google allow us to expose 35 new extensions from OpenGL, OpenGL and OpenGL ES. And I was saying before, the driver is not fully compliant of 3.1 because there are some missing features in the hardware. For instance, this version requires 8 radio turrets. This is fixed in Raspberry Pi 5 but not in Raspberry Pi 4. It doesn't support 4. And then the hardware itself does seamless QMAP filtering, and the OpenGL spec requires no seamless. And then some other formats that are not supported. But all in all, probably these are not the most easy features. So we support anything else. So from a practical point of view, probably any application that uses OpenGL 3.1 will work in the Raspberry Pi. Then in the Vulkan driver, we move from Vulkan 1.0 to 1.2. So this is Vulkan 1.0, 1.1, and then 1.2, which meant exposing like 80 new extensions. It will compare both versions of the driver from Bullside to Vulkan. So there are a lot of new extensions. Some, I mean, I mentioned like extension dealing with sub-drops, as I said, which is very interesting for Vulkan. Extension dealing with geometry shaders. But I think the probably the most important work done was improving the performance. So when Vulkan 1.0 was released, the target was just having a performance driver. So we didn't spend any time on making it fast. And during this lifetime, we were working a lot on making it more performant, specifically in the shader compiler to reduce the liminal analysis and make strategies to make the shader smaller and faster. The good part is that the shader compiler for the Vulkan driver is actually shared with the OpenGL. So both the OpenGL and Vulkan share the same compiler. So all the improvements in the driver in the shader compiler also affect the OpenGL. So basically the improvements are both for Vulkan and for OpenGL driver. Another thing relevant to mention is that now Think, which is the driver that supports OpenGL over Vulkan, works with the V3D driver. So it means that you can use the Think to open up the applications. And then that's how we, well, we know Roman Stratenko was working on that in support for Android. So now you can run Android in Raspberry Pi 4 with the Vulkan driver. And now my colleague will continue with the workbook in the kernel. Okay, Sam. Well, continuing with our work on Vulkan on the Raspberry Pi, we need to implement several features that were not available in the hardware. We need to create what we call CPU jobs. That is part of the behavior that is not available in the hardware support for the GPU. So we implemented that in the Vulkan driver in the user space. But that implied, that was affecting mainly to some queries about performance counters, time-stun queries, and compute-shaded in this patch. So this caused issues because when we were submitting the different command buffers to the GPU, we need to start the driver of the GPU submissions, do the work in the user space, and then continue after having the result. So one of the improvements that we have just recently landed upstream in the kernel was these kernel CPU jobs. So we moved this operation that are already known to the kernel space. So when we are creating in the Mesa driver the submission, we are currently going to handle that so we don't stop the submission of GPU jobs. That was quite an interesting improvement in terms of performance. Because before this, there were a lot of stalls in the submission. Another feature that was quite interesting for the users at the end was to know if they were just really the GPU when they were running the different applications. It would happen for a lot of developers. I don't know if this is really working with the GPU. So we implemented GPU stats. We suppose these users' stats per process using the standard way of doing it in DRM. And we also suppose global stats. So this way some application just, if you want to know the global status of the GPU, just check that the value of the percentage of usage. Because in other case, you need to go to every process, check each process which amount of GPU has been used, and do the complete sum up. So because of using the standard interfaces, we can run application like GPU top. That is really nice because it works for several drivers. And at the end for the global stats, as we, there is no, no a common defined interface to expose that. We are currently using a CFS. So the hardware lacks some features to provide the stats as other drivers are using, in the case for something in Intel. So we are, we go to a simple approach. It is just, we put in the DRM schedule, when we submit a job to the GPU, we just get the time stop and the job ends, we get the finished time. As we are only processing one job on each queue, we have the information about how much the GPU was used. So we can show here, for example, it's on the top right of the, there is a graph with a widget that the users can check and the GPU users. And in the task manager, we already have the information about the GPU users. For example, in this screen, the main user of the GPU is Chromium, and the second one is the compositor, in our case it's Wayfire, because it's compositing all the different windows and surface that we already have there available. So, well, that's the highlights of the modifications from the kernel. And one of the main important changes that we did from Bullside to Bookrow Raspberry Pi OS was the change from the default desktop. Previously, in Bullside, we were running for the Raspberry Pi 4 devices, matter with Xserver, and it was being OK. And for the previous generation hardware, Raspberry Pi 1, 2 and 3, we were running the previous desktop we had, that it was an open-boss with Xserver. Matter was too heavy for the generation of hardware. And when we have this release of Bookrow, now all the Raspberry Pi's that we started, the public e-mails, they get a Wayland desktop using Wayfire. That was, and for Raspberry Pi 5, it was just a digital, it's the default one. For previous generation, we still maintained the open-box and Xserver, but I want to commend on this, now this is the last part of the talk. So, well, Wayfire is using OpenGL for doing the composition. It's based on WLroute's backend. We use the OpenGL, but it's quite tight for OpenGL. So, all the plugins are implemented there using the OpenGL API. One of the most important things we did in this transition from Bullside to Bookrow was that the user's experience don't change a lot. So, as we can see, Simon Long from the Raspberry Pi has a lot of effort here. So, it's difficult if you don't see the change of the background to figure out what are the differences between the previous version and the new one. That is Bullside and this is Bookrow running. So, all has been rewritten, the panel, the theme, because there are different compositors. And, well, now we go to the desktop on the previous generations of hardware. Well, we are still using the Xserver with Openbox. It's the file, the file we have. This has been the same way since Bullside, we didn't try to see it in two matters. The main cause of still using this is that we need to use sober composition. We use the CPU to render the desktop because the hardware limitations are supposed to have a memory limit that is 256 megabytes by default. The problem is that we don't have control when the GPU memory that is using the CMA, Continuous Memory Allocator, runs out. So, at the moment, we'll answer a new Chromium Brows tab that uses CMA memory. If we run out of the memory, next application that can do the following allocation could be the Xserver and it can crash or the compositor. So, the solution it has been there during all the time is on these devices, all we are using is CPU sober composition. So, GLAMOR has been off all the time and there is no hardware support. You can run a full screen application. All has been, you can enable it, but it's not the default. You can enable GLAMOR and you get hardware acceleration. But you are supposed to crash in your desktop at any moment. And there are a lot of hungry applications like the browser that can kill you if you open six tabs, you are completely frozen the desktop. So, during the previous development cycle on both sides, we wanted to make the possibility of enabling the hardware accelerated applications. So, if you want to launch your GLX-DRS, the GLX-DRS is not using a low beam pipe and it's using the driver for the hardware. So, we managed to do that. We enabled the hardware acceleration on the four applications while we were still doing sober composition for the rest of the desktop. So, in case you run out of memory, what is going to crash is just application. You are not supposed to the Xserver crashing or matter crashing or whatever application because they are not prepared for when you do a memory location, it fails. We assume that all the time it has been working. This was implemented modifying the mod-setting driving index server. We implemented the support for DRI-3 in this case, but without the need of using Glamour. How it is currently written is just Glamour enables the DRI-3. So, on crash-by-devices, we can use DRI-3. Even we don't have open GL during the composition of the general browser. There is a request for the server, but there is now too much interest in integrating that. We understand because Xserver development is stopped at the end. But we have been using that downstream for almost a year. It was a huge improvement for the users. With these changes, we avoid the problem of the GPU memory subsystem. When we were about to release Book World, the idea is that we are transitioning to Wayfire as the stock compositor. What can we do for the older generation devices? We need to rethink again how we solved the previous problems with Xserver, now with Wayfire. We need to do the software rendering composition using the CPU. We would like to allow again hardware-obscleted applications. The problem with using Wayfire to do the software composition is that Wayfire is quite tight to use OpenGL. It is using WR-ROOT's backend. As you have seen in parts of the code, mainly in the plugins, doing the different effects, we are doing calls to OpenGL API. We don't want to do that. The first thing is WR-ROOT already has a Pismand backend that is working. You can just transition and the parts of WR-ROOT that are using Wayfire, just do small changes to use the Pismand backend and it works. The next part is we need to reimplement all the parts that were tied to OpenGL in the different plugins that we are going to use in the distribution. There are some that are quite complex that we didn't need, so we didn't implement the change, to use the Pismand rendering logic. This way we managed to get all the rendering done by Wayfire to be done using CPU rendering. The problem is that if you do that and you start doing blending operations, in architecture they become really slow. Reading from the even buffer when they are doing the blending, assuming that we have synchronous memory, all the changes are flashing at the moment, it is terrible. We experimented with enabling for the buffers, we used for doing the run buffers, to use non-coherent memory. That makes that if you write on the CPU and then you put it on the display, maybe there is not coherency. So you start needing to flash in some places, you need to handle that. Some things happen funny because in 32 bits, IRM is different to 64 bits. Things that work in one place, in 32 bits you can just, I'm going to flash the memory before putting it on the display, and it works. In 62 bits, it doesn't work. At the end the flash is not doing anything in architecture. So we need to handle the synchronization but in the compositor to do that work. The difference of that change is that everything runs fast enough. The problem is that when you enable non-coherent buffers, you only want to do that for the compositor, but not the rest of the application. So that is complex because some applications don't work on the core and buffers and we are dealing with that, maybe enabling it with a parameter, creating a new IOCTL for the getting non-coherent buffers or what. So well, and for the other part, that was quite fast as we already know, the part about getting hardware-assisted applications because we already have the knowledge of doing this on the X-server, so at the end we need to handle in WOL routes to pass in the pieceband backend, to pass modifiers with the memory buffers, and it was already working. So I'm going to show this is the current working progress we have with this work. This is our Raspberry Pi 3 running the desktop. It's using the non-coherent buffers. In other case, you will see how the programs are moving. The performance is quite good. Some of the more complex things that are most expensive is the shadow calculation. You cannot imagine doing this in the CPU. Every time you scale a window, it's complex. We are seeing that these DLX-servers are using the hardware acceleration and it's not the best thing that we can do because there are possibilities of having another different plane to show that there is no display, but we are bleeding this by the compositor. We have enough time, I think. So we are going to see several plugins working. So as a conclusion, we are on the point of maybe thinking about putting this for the users, but it's still not ready. One of the things that Raspberry Pi devices have is trying to maintain all the generation of hardware because you can run the last Raspberry Pi OS with the Raspberry Pi 1 and it will work. We have already tested it with the Raspberry Pi 1 that has slower memory. Juan was doing that this and was good enough. It's comparable with the results we are having with the X-server. We are seeing Chrome running, that is using hardware acceleration. The good thing is that as we are not spending memory of the CPU, we can run more applications. But you can crash Chrome in just open, I think it's eight apps. You will crash Chrome in, but only in some cases, only one window. This is the Zoom working, this is all software composition. And I think that's all from us. We already implemented, this is the switcher that has Wi-Fi by default, we implemented with Pisman and we tried to do a more simple option, but this one was already working fine at the end. So we are maintaining the most complex part is doing the transparency and using the alpha channels in software. So, question, I think we are on time. What features do you need the CPU to actually get into the job? And are they used a lot, the applications need them, will that impact the performance? Well, our colleague, Mayra Canal, has a lot of positive planning in this and they are really out. The question is, which features in particular are needed to be done in the CPU and cannot be done in the GPU? I think I already commented, there are some things related to performance counters, mainly if you want, when you are running the CPU commands, to reset the counters. These need to be done in the GPU. No, you need to write in a resistor in particular. The other one is related to getting the time stamp, because there is no support to get the time stamp from the GPU. And the other one is indirect computer shader dispatch. Is that when you are sending several instances of a computer shader, in this case you need to send in the CPU one by one, because there is no support in the GPU. So you just submit the buffer to the kernel and the kernel is going to handle that. In the other case you were in the user space, you send one, wait, and you are going to send one by one. So, well, time's up. Thank you very much for your attendance. you
Delegated compositing utilizing Wayland protocols for Chromium on ChromeOS
So, hello everyone. My name is Maxim. I'm a browser engineer from EGALIA and today we are going to talk about the delegated compositing utilizing Willem protocols for Chromium on ChromoS. And here's our today's agenda. So first we talk about the goals and the motivations of the project, why we have Willem on ChromoS and why it's in Chromium. Then I will talk a little bit about what lacrosse is. Also I will need to cover a little bit about the Chromium display compositor to give you some of the ideas, why, how it works and why we actually needed the delegated compositing there. Then about the delegated compositing itself, the Willem protocols and a big picture of what we actually have. So Chromium and Willem on ChromoS. So there are quite a few vendors who are shipping ChromoS on their devices and as soon as the devices become, well, they are aging, right? So they are not receiving the updates. That results in having them with the old browser and so on. So in order to improve that and improve the maintainability of the devices, it was decided to split the Chrome browser from the ChromoS system itself because they are tied together. And that would also make it possible to receive them the browser updates. But how is it possible to do that? So the idea was to decouple the browser, as I said, from the operating system itself. That was called the lacrosse project. And the ChromoS itself, it has a system UI and the window manager called Ash. And yeah, Chrome was tied to that operating system. And at this point, there was also a Willem implementation already in ChromoS and it was decided to use Willem. So basically in 2015, if I'm not mistaken, the ChromoS received an own Willem's version of the implementation called Exosphere. It's currently used by ARC to run 100 apps on the ChromoS and also Crestini to run Linux apps. And in about 2016, we started to port Chromium to Willem and on Linux, you can use Chromium with Headless, X11 and Willem. So it was kind of a natural choice to employ that implementation and have a browser running them. And basically Willem is used for graphics and the wind handlings with stable protocols employed and also with some custom extensions employed. And for the high level features like file picking, Cross API is used. Well, it's basically Google's implementation called Moja IPC. This is similar to Win32 and Cocoa. But what is Lacrosse? So Lacrosse is a project to decouple the Chrome browser from the Chrome OS window manager called Ash from the System UI. So on this box, on the green box, you see the Chrome OS operating system. And on the yellow box, you can see the Lacrosse browser, which using Welland backend through the Ozone layer. The Ozone layer is basically an abstraction in the Chromium browser, which allows you to implement on backend. And as a sent on Linux, it's X11, Headless and Welland. And it's switchable in the runtime. For the ChromoS itself, it runs on the DRAM, but you can also use like X11 and run ChromoS emulator on your Linux device. So the Lacrosse is using Welland to communicate with Exo, which is in built in the Chrome OS, which actually forwards the input devices and has some graphics communication there. But there was a problem. So this split resulted in performance and resource cost. But why and how to mitigate that? To understand why it was causing a problem, we need to switch to the Chromium display compositor and understand a little bit how actually Chromium draws frames. So as you may know, Chromium has multi-architecture, multi-process architecture. So we have a GPU process or this service process. And also we have clients, which are the render process, the browser process. There's also this video client, which sends the video frames. So basically, we call them the frame things. And basically, the way how it works is that if we are talking about this GPU acceleration and the GPU rasterization, the way how it works is that, for example, if we take the render process, it prepares paint operations for the compositor frame. Then when we are preparing the final compositor frame, we submit those paint operations to ski on the GPU process. That is called the GPU rasterization. And we prepare textures. And these textures basically represent tiles if we divide the whole window to the tiles. So those represent tiles. And the compositor frames, they have references to the tiles, including some frame data like masks, filters, clipping, and other stuff. And on the right side, you can see the vService process or simply GPU process. It represents clients as surfaces. And each of the surfaces has own compositor frame. So we need to aggregate all the surfaces into a single compositor frame and do the final compositing. So this is a high-level picture, high-level overview of how it was working before the delegated compositing. So Lacrosse was aggregating the quads that would end up creating a final surface. And that final surface was, of course, represented by Zingobuffer. It was sent over Weyland to Exo. Then in the Ashcromb site, Ashcromb you can call HromoS, it was like maybe getting some other frames from other windows, I don't know, some system settings if you open that one. And it was doing the compositing once again in this step. So that resulted in double compositing and bigel resources overhead. But how to fix that? And the solution to that was to use the delegated compositing. So basically, we left the aggregation step, we created our final compositor frame, but the quads that we got, which are basically the textures, all of them must have been sent over Weyland protocol to Ash for the final compositing. And of course, I need to say, basically, this is about serializing the HromoS compositor frame, sending over a couple of IPCs through Weyland to Ash. And basically, it was at this stage, deserializing the data that it received, and it basically created, must have been creating the same kind of browser frame for the final compositing. And in order to achieve that, a couple of, well, at first I was thinking that there's actually more things we implemented, like some custom things. But in the end, it wasn't that much. So Weyland subsurface, that is standard, right? Each quad and, well, let's say we were sending quads as overlays, they were represented by own surface. Of course, Weyland buffers and explicit synchronization protocol, because we want the job to be asynchronous. And the main thing is surface augmenter, right? Because we wanted to have this data to be sent from Hromo, Hromo browser, basically, the compositor frame, with this additional information like rounder corners, clipping, also pixel precision, this is one of the important things. And we needed to make our own protocol extending the Weyland surface. Also we used, in the beginning, we used our own protocol for the single color buffers, but as soon as in the upstream, there is now, right now, a single pixel buffer protocol, we just employ that one, so that we don't need to create a real buffer. At first, when nothing was there, we were just clearing a buffer to a certain color, but that's not really efficient. Yeah, why we also needed to pass this kind of round-end corner and clipping information? And the reason to that one is basically because when Hromo sends, it thrusterizes the quads to the textures, those do not have any masks, right? So when we do the final compositing step, we apply those mask filters and so on and send them to Skiya, which does the final picture for us. And for the pixel precision, the problem is that Hromo basically works with the pixels, and as long as Weyland uses deeps, it resulted in some pixels losses. And when it was compositing the quads together, we could see some of the glitches. For that, to overcome that, we actually added some additional stuff to the surface segmenter and started to pass this information using VLFixed, basically, which allows us to use some floating wallets. It was also required to update the VP port, this destination, and some of the other stuff, like setting trust form, setting trusted damage, because when we, for example, change the order of the Weyland subsurface, this Z-order, at some point, we don't need to recalculate the damage or do we need to recalculate that? So basically, all that is managed with this additional fact. And there can be some other stuff, but I would say that was the most important one. And so this is the big picture how everything is implemented. So we have, like, on the top, Lacros viz process and the Lacros browser. So Lacros viz is basically preparing the frame with the quads and sends over the Weyland to the ashram, which then creates the same compositor frame as Lacros would have if it wasn't delegating but was compositing itself. Prepares the final frame, prepares the overlay and sends it to the DRM and that's it. You have the final frame with the system UI and the browser content as well. That's it. Questions? No, go ahead. Yes? Well, I can just repeat the question. Okay, so the question was whether the GTK and QT can also benefit from that. Do you mean the Chromium browser or you mean itself? No, just regular apps using GTK or QT. Yeah, I think so. Basically, if it's possible to have the double compositing, it is possible. We had to use some additional protocols because as long as Chromo is a really closed environment, we can do whatever is possible, whatever is convenient for us. But I think that is possible for the GTK and get this improvement of the performance as well because if the Weyland compositor can do that, why not? Yeah, basically in similar direction, but the Chromium on a regular Linux Weyland compositor, I mean, that would benefit from such features as well. I mean, there is double compositing again. So, have you looked at getting up to or generic protocols to manage that? So, now you have custom protocols, right? But for it to work on a regular Linux Uniday, yeah, a generic protocol. Do you look at doing that? So the question is basically about if Chromium Linux can benefit from the same implementation and whether we considered creating some generic protocol and upstream that. Well, if we get back to this pixel precision and the rounding corners, for the pixel precision, if the browser doesn't work in some custom scale, it's one, right? So it's fine, we don't need this kind of protocol. But for the rounded corners, well, probably we could do something like do this processing on the Chromium side, but it's not very efficient, right? Well, it should be possible, but creating a protocol and upstreaming that, it will take some quite some time. I personally did not thought about that, but it's an interesting concept for the future, of course. I mean, especially for embedded, it can also help if you, I'm guessing part of the subsurface is offered, for example, the video in the browser. If the compositor on the embedded device can then put that video on the plane, the rest is not branded, then you can benefit from these kind of things much more easily. Yes, of course. Do you delegate all the compositing and the compositor can decide what to put on the plane? Well, at least we can submit this video frame as an overlay. If I'm not mistaken, there was a, from somebody doing this, this forwarding Chromium, if I'm not mistaken, I actually saw this by the patches. I think that landed from the problem later. Yeah, probably, probably, yes. I didn't pay attention to that. I was busy with the Chromium itself. Yeah. Yes. What's the granularity of these subsurances? Like, how many would you expect to have on a regular webpage? Are we talking like almost every screen element or is it the more hard to think? Are you compared to like CIL? Well, if you just take a normal page, right? So, the question is how many subsurances we are going to have, I mean, how the page is itself like divided, whether we are going to have each, for the each element, sub-sub-surface or it's kind of done in other way. Well, basically, if you imagine a page, right, as a simple page, there are no additional textures and so on, we can split the page to the tiles, like there will be, I don't know how many, maybe six tiles, something like that. So, basically, this is how much you are going to send. But if you take like, for example, motion mark, right, there are some tests, like images tests, it can create hundreds of those textures. Then we are starting to send all of them over the pipe. But there is a limit for the IPC. So, we have to limit this, the number of quads that we are able to send. And if I'm not mistaken, it's limited right now to 50. Because after this while you, it just doesn't make sense to do any delegation. It's kind of become too expensive in terms of, I mean, there will be too many subsurances. If we could like squash this together, that would definitely help. Because it seems like it wasn't like a use case that was thought when the wheel and was designed. So, any other questions? Thank you.
Greenfield: Wayland in the browser, an update
Hello everyone, welcome. My name is Eric Dereke. I am the author of Greenfield. Previous talk we saw how Chromium can run on Wayland. Now we're going to see how Wayland can run in Chromium. So this is an update about Greenfield. I gave a previous talk in 2019 about Greenfield. So basically recap what is Greenfield? Greenfield is a Wayland compositor. I'm sure most of you already know what a compositor is. That runs entirely in your browser. You can control remote applications with it from different servers if you want. X applications as well, X Wayland applications. For that, that's also a fun fact. I actually had to port XCB library to JavaScript, to TypeScript. So I have a Python parse X protocol files and have it output JavaScript and then you can run it in your browser. Works perfectly by the way. You can also run Wayland applications directly in your browser. So actual native desktop applications compiled to WebAssembly and have them talk Wayland protocol to Greenfield inside your browser. We're going to see some demos at the end of the talk as well. It doesn't have to be WebAssembly applications. By the way, you can just write JavaScript applications. They just have to talk Wayland to have their pixels displayed on the screen and to handle input. So that's basically Greenfield in a nutshell. Why? For God's sake, why would you write this? Well, because I can, that's basically the crutch of it. No. I had basically an idea. Wouldn't it be cool if we had a desktop that you could run anywhere? So not tied to a single physical piece of hardware you had with a single interface for all your applications. So no longer bound to say, yeah, a smartphone or a desktop or whatever. And yeah, so, and be free basically to run it where you want. So you're not bound to any SaaS provider or anything. Just have your own server or servers or whatever and have the application distributed and basically access them anywhere you want, anytime you want without restrictions purely for you. I thought that would be pretty cool. And it didn't exist. So a set hell, I'll write it myself. Well, how is Wayland written? Well, in the browser, you're limited to the JavaScript and WebAssembly. So in the case of Greenfield, most of it is TypeScript that's then processed to JavaScript. And there's also a bunch of C involved, mostly existing C libraries, things like lipxkb common is compiled to WebAssembly to handle all the keyboard layouts and keyboard encodings. There's also lippixman basically to region handling. I'm not going to reinvent the wheel, I'm not going to write those things in JavaScript. So I took the existing libraries, ported them to WebAssembly and made them accessible inside the browser. The display part of Greenfield is written in WebGL to have any kind of decent performance. It's very similar to how a normal or say a native Wayland compositor would would composite using OpenGL. So in the browser case, it's WebGL. To deal with the remote applications, that was you need basically a proxy. So the native applications have to find a socket to connect to a Wayland socket to connect to and have those messages handled, have the native buffers handled and have the protocol sent to the browser. So I wrote server in TypeScript. I'm sorry guys. I needed something that was I could prototype fast and without dealing with seq faults and everything. So I did it in TypeScript and it grew a bit and now it's a bit bigger. So I guess some point in the future, I promise I will rewrite it and rust somewhere in the future, I guess. Also the performance critical parts of the whole pipeline are written in C using Gstreamer as well. So there is a tight integration there. Gstreamer, really cool project. It does basically everything I needed. Very modular. I could do all kinds of cool stuff with it. I'll show that a bit later. If there are any Gstreamer faults in the room, the only thing I missed a bit was color conversion on the GPU. So a hint that would be cool to have. At some point, I also wrote a Kubernetes implementation basically. So that basically means that you could use an entire Kubernetes cluster as your desktop. So every application would be its own part. So you could run 20 performance intensive applications, have them all run on separate machines, all running smoothly, basically giving you virtually unlimited resources on your desktop. So that's going in the direction I had in mind. Sadly not open source for the moment. Maybe someday in the future, if it's gone out of prototype phase, perhaps I can give a talk about it too. And last but not least, blood, sweat and tears. This is like a side project, a hobby project. So I was not working on it full time. So at some point, I had to sacrifice a bit of sleep to make some progress. I do not recommend, by the way. So here we have an overview, a pipeline overview, how frames are sent from your native site to the browser. So on the top part, we basically have everything that happens on the GPU. On the bottom part, we have everything that happens on the CPU. So we're going to start from left to right. On the left, we have an application that renders either on the GPU or on the CPU using GPU memory buffers or shared memory buffers on the CPU. When an application submits those pixels to the native compositor, so that's our proxy compositor that I talked about before, then those buffers are basically handled and they are color split. So we need to get these buffers to the browser and we need to compress them. Ideally, we do that using video encodings. But to do that, there are a few issues. Most color buffers that we get from the applications, they have an alpha channel. This day, we have cool virtual reality headsets that just launched. We have awesome artificial intelligence. We do not have video codecs that can handle alpha encode. So I had to split the color into an RGB and an alpha and those are actually two separate video streams. So the alpha color conversion is handled as a black and white video stream. This is done on the GPU if possible and there is a fallback to CPU encoding if it's not available. There's also the color conversion that also always happens on the GPU. And then those two frames, those two video frames are then sent using a WebSocket server to the browser. A small remark there, WebSocket server, it's TCP. We're dealing with a real-time application here. Ideally, ideally, people would use something like WebTransport which uses UDP but it's a bit still experimental. Not a lot of implementations there for the moment so we're stuck with WebSocket. The frames arrive on the browser and there we have to decode the video streams, the video frames. This is done for Firefox only, sadly, using WebWorkers where I ported FFMPEG to WebAssembly for H.264 as well. To do that, for the other browsers, I used the WebCodec standard so I can have the browser do all the artwork and have the video frame decoded. And then last but not least, there is a WebGL shader that puts the colors back together. So we have the RGB frame, we have the alpha frame, we stitch them back together and we push them to WebGL texture and paint them nicely on the screen. That was a long trip. This is all how we go from the application to the compositor. So far so good. But Wayland also has a few mechanisms to tell the application, hey, I'm done painting your pixels, you can send me the next batch if you were to please. But there is a bit of an issue here. Wayland is an asynchronous protocol but the dealing with sending a frame and the compositor telling the application, send me the next frame, that's a whole slow synchronous process. So what that means is that if this pipeline were to take no milliseconds, you still have your network latency. Say we have 50 millisecond network latency, we immediately send the frame to the compositor and then the compositor tells the application, okay, I'm done, you can send me the next one. You have about 50 milliseconds between each frame. So that gives you around 20 frames per second. That's not really acceptable. So we're going to have to be a bit smart and we're going to have to tell the application in advance, like hey, you can already send the next frame because I think the compositor is probably already done handling your previous frame. So that is basically the round trip latency problem that we have. And there are generally two mechanisms in the Wayland protocol. The first one is the one I just talked about which is the frame callbacks, which is the compositor telling the application, I'm done painting your pixels. The other one is the sync request. The sync request in the Wayland protocol is a way for an application to know when all previous requests that have been sent are done. So when an application sends a request to a compositor, those requests are queued up. And when an application sends a sync request, then the compositor will simply, once it encounters that sync request, will simply reply with done. And then the application know, oh, I got a done response. So that means all previous requests were handled. So to deal with that, as I just explained, there is the predictive frame callback. This is how we deal with the sync request issue. So this is a complex picture, a bit out of scope to go too much in detail. But what generally happens is that the proxy compositor on the native site analyzes all requests that come in. And as soon as it receives a sync request, it will look at all the previous requests and see, hey, those previous requests that I just saw, none of them is going to send a reply. So you know what, I'm just going to send the done event immediately, and I'm not going to wait for the compositor to send a done event back. And that basically circumvents the whole network around the problem. The only potential issue you can have there is that you're basically got rid of your throttling. But there is some intermediate protocol to deal with that between the compositor and the proxy on the native site. So this is explained in the picture here. On the top, we have the requests that are coming in in the compositor in a classic Willing scenario. On the top right, we see, hey, there is a sync request, and it takes a whole network round trip time before the done event is sent. And the fastest placing, we have the same scenario. We have all the requests coming in. And at the end, we have our fast sync handled by the proxy. He sees, okay, no, events are going to be sent. So I don't need to wait for the compositor to send any other events. I'm going to send the done event immediately. And that makes the whole pipeline asynchronous. It makes the whole pipeline fast enough. If you have fast encoding, fast decoding, fast enough for gaming as well, you can do 60 frames per second or more if your hardware is fast enough and your network can deal with it. So far, the remoting part. Greenfield can do more than just remote applications. As I said before, you can also run applications directly in your browser. To do that, there is a prototype Greenfield SDK. It's based on EM scripting because it's the most complete, somewhat post-excompatible SDK out there that is aimed for the browser. This works fairly well, but it has some disadvantages as well. Wayland applications are Linux applications, well, almost exclusively. And the EM scripting SDK aims to be post-excompliant. And that's not Linux compliant. So things like E-Poll are not implemented in EM scripting. So I had to add those, well, at least E-Poll in the Greenfield SDK because Wayland requires it as well. There's also the core Wayland protocol is implemented in Greenfield together with the XG protocol. That works well for desktop applications, gives you a good standard to work with most office applications. They work out of the box. In the WebAssembly implementation, there is currently only support for shared memory buffers. We can theoretically do WebGL if we were to port Mesa to WebAssembly and utilize a custom WebGL Wayland protocol. The protocol exists. It works. It's simply currently not implemented inside Mesa itself. That's some future work, I guess, to support that. But we can perfectly support WebGL. No issue. The work just here needs to be done. So how would this all look while we have a nice green diagram here to show you how this looks? We have our main page on the left that loads your composite, that loads Greenfields. Next to it, you will have an iFrame. The iFrame loads your WebAssembly application, loads it in, and then basically uses internal iFrame messages to your main thread. Talking 100% pure native Wayland protocol works fine, works great. Next to it, we have transparently also the remote applications that are running. For the Greenfield compositor, both applications are just your ordinary Greenfield applications. He sees no difference. With a small remark that the Way file descriptors are handled on the protocol is a bit different. In case of not remote applications, the file descriptor is basically a URL that's passed around and that's opened and transferred and closed whenever needed. In case of native or rather browser native applications, it's a transferable browser object that is used. Those two file descriptor discrepancies have not yet been bridged, so you cannot do copy-paste operations, for example, between a browser application and a remote application. What you can do, copy-paste between a remote application that works just fine. That's how it currently works. What would the future look like? There are lots of cool stuff that can be done, that still needs to be done. There is the issue of sound. There's currently no sound. I initially left it out because it's a bit out of scope for compositor or Wayland related stuff, but somebody also already did their master thesis about its implemented sound in Greenfield, using pipe wire and G-streamer worked really well. Prototype exists and can be implemented, I guess, pretty easily. There's also the need for a bit more Wayland protocol, so there is just only the very basic currently implemented. There is no unified file system, again a bit out of scope for Greenfield, but it doesn't really exist, so it would be nice if we had it. Imagine you run applications on different servers, they all see their own local file system and you cannot transfer files between them, so that would be nice to have. There is the port of Mesa, the WebGL, using Wayland protocol, that would be nice to have. Then last, the hardest part, also the coolest part, is the whole EM scripting posix issue. It would be nice if you would simply got rid of EM scripting and just could compile applications directly to WebAssembly and actually have a Linux micro kernel running in your browser. Somebody else who is also crazy ported Linux kernel to ASMGS. ASMGS is the predecessor of WebAssembly and that crazy person got the Linux kernel to boot, up until pit one at least. I tried it to do that myself using WebAssembly, turns out it's actually really hard to port the Linux kernel to other architecture, no shit, especially if WebAssembly is not an elf binary. The Linux kernel expects the elf binary to be used, the format to be used when it's compiled in all kinds of different places and that assumption goes through the window when you try to compile to WebAssembly. There's also a bit of documentation importing the Linux kernel to other architectures. There's a ton of documentation about the Linux kernel, but most of it is about developing drivers. But yeah, I'm not a kernel developer as well, so that might also have to do something with it. But I'd say think about possibilities you could compile an application or Linux application to WebAssembly, boot it up in your browser by simply accessing a URL and have it completely sandboxed, super secure, running inside a desktop that's running in your browser. That would be really cool I think. So how would say a Linux port look like in WebAssembly? We have a nice yellow diagram this time. It would simply load access a URL, it would load the WebAssembly application, the WebAssembly application would then link against your kernel, which is also a WebAssembly module, bit out of scope for the graphics room, but there are certain WebAssembly standards that allow you to isolate certain region memories to the application and to the kernel module and have some regions shared with them. And then we could probably leverage the Vue.tio stack and have it interact with basically the browser APIs and have your browser basically be your virtual machine. So that's probably how it would look like. For the file system some attempts have been made and for now Jusifes seems to be the most valid candidate. Interesting note here is that Jusifes uses two kinds of databases, one to store your data itself, one to store the metadata of your file system and the experiment. It was shown that the metadata database needs to be really fast, so we probably need a locally cached metadata database which then uses CRTT basically to synchronize between the instances to make it fast enough. So let's see if we can show some demos. That would be cool I guess. So in here we have a state-of-the-art green field running, super fancy as you can see. And I'm going to try a remote application, I hope the wifi holds. So this one is actually streaming remotely, Doom 3. I noticed it sometimes tends to freeze, I don't know if it's because it's an old application running on the Nvidia Wayland drivers. As you can see there is no pointer locking here, so that's still one of the protocols that needs to be implemented, but we can simply start a new application here. We have a nice 60 frames per second streaming to your browser. All works fast and fine, so I wasn't lying when I said it's fast enough for gaming. You can see it, you can simply walk around here and everything. There's also, here we have a Western demo applications compiled to a web assembly, there we go. So this is, I believe it's written using Cairo, I think my pointer is being captured by the game. So it's using Cairo to draw all everything and G-Lip as well. So this runs inside an iframe, if you were to inspect the source code here, we have the iframe here that runs the web assembly application and yeah, talks Wayland protocol, it runs entirely in your browser, nicely isolated as well. It's all transparently done. Then we have, let's see if we can get this No, it doesn't want to go. So this is your web assembly application and of course we can also run desktop applications, so I have one running locally here. I think this is a cute app, there we go. So this one is running on my desktop locally and see I can open, so this popup chooser is running in my browser, it's all Wayland, see I can't move it and you arrive some packet. See it's the fast system of my my laptop as I said before and you can run it here. So that's that. So far the demo, let's see if I can move it, yes, can go. So that was it, I hope you enjoyed. If you have any questions, I guess I have maybe some answers. I'll go from left to right. How does it work for input events? Yeah, so the question was how does it work for input events? It uses the browser's input framework. In case of pointer events, it uses the raw pointer event API if it's available. It's still quite experimental but it removes some lag. So when it does it captures the browser pointer events and then basically translates this to input events like they were coming say from from lip inputs and it's then sent using Wayland protocol to the application. So it's fast but you still have inevitably the network latency that's unavoidable, sadly. Yes, I have two questions. The first one is does it support KDE's background blur protocol? The question was does it support KDE background blur? No, it does not. Currently the only protocols implemented are the core Wayland protocol and basically the XDG desktop protocol. The second question is vertical synchronization v-sync. Yes, so the vertical v-sync is supported. So it uses the browser request frame callback API which is basically the v-sync callback that the browser offers to draw. Next question. Yeah, so the question was does it use h.265 and is it secure I guess in a legal standpoint? It uses h.264 in this case. The reason being that the web codec API is from the browsers. Most of the browsers do not support h.265, at least not at the point I implemented it. So currently it uses h.264. I currently do not use it for commercial purposes. So yeah, I hope I'm safe. I'm going to have one more question and then I'm going to stop. Yes. So obviously the way the application expects Unix socket at the wet level, do you fix this if they're compiling for WebAssembly? Did you just implement the Wayland? Yeah, so the question was if you're compiling for WebAssembly and the application is expecting a Unix socket, how do you implement it? I extended the EM script in SDK so that it basically has support for Unix sockets on the user level and then also added e-polling to that as well. So for the application it's purely Unix sockets. It's not entirely so there's only client side support for connecting to a Unix socket, not for creating a Unix socket because that is done by the SDK itself. If you have more questions, I'm available. So after the talk, come and find me. Happy to answer them. Thank you.
"Scratching an itch... by patching kmscon" or "How OpenGL can lead to systems-programming"
Okay, thank you for showing up or for staying for my talk. Scratching an itch by patching KMS con or put another words how knowing open GL can actually to systems program, believe it or not. Before I continue, I want to set the stage. I see this talk as a partly being an experience report by somebody who did a drive by contribution to KMS con. Well, yeah, that's me. And also to some degree, call for help with testing and maybe poking upstream to review my pull requests. And on top of all that, the whole thing is running on KMS con, the patch version. So it's now a graphical output and it's basically one big live demo. So fingers crossed nothing is wrong. Here, there's a quick overview of what I'm going to be talking about the next 20, 25 minutes. And so what is KMS con? For those who don't know, it is a systems console replacement meant to completely live in user space. It has a nice plugin architecture. It uses these plug-in architectures for actually implementing all its major features as shared library objects. It is unicode capable, meaning it has seroic support, Korean, Farsi, no emojis. But if you are bored and would like to contribute, I guess there might be five persons on this planet who would really welcome that. Two plugins I want to point out. One is the free type font support, which means you have vector-based fonts, open type, free type. And the other one, the most important one to me, is the rendering back end being based on OpenGLES, which actually allowed the stuff to be done to it that I wanted it to do to scratch my itch. It's kind of old. It was started 2012 by David Herrmann. Wanted to basically replace the in-curnal system console that's hiding behind config on a score YT. And it has a side of project where David extracted the whole terminal handling, sorry, the terminal handling stuff in its separate library. It's basically this Lipt. TSM is a toolkit, a neutral version of the Lipt. VTE that you might be knowing from GDK or GNOME. It was developed by David until about 2014, and then he magically probably vanished somewhere in the depths of Redhead. He didn't update his range anymore. And then from that point on, it kind of organically grew MnT to some other place, which is there at GitHub, AETF. That's the current most active fork of KMS-Con. So who am I? Usually when I do these talks, I just say, hey, I'm Xlo. I love computer graphics. Considering the circumstances, the venue, a little bit more context is in order, I think. I used to be a bit more visible and active in the open source community. Back in the day, I write Kyro Clock, Kyro Dock. I worked at Canonical. I implemented notify OSD. I always mess around with computer graphics. And you can contact me or ask questions if you don't get them in today. Later by email, the master don't thing on Twitter. And when I'm not hacking on computer graphic stuff, I like to race my motorcycles and skateboard. Enough about me. Motivation. Why I'm standing here. Why did I do what I did to KMS-Con? I have a kind of unusual monitor setup for a regular user, maybe not so unusual for a developer. I've got two 24-inch widescreen monitors, turn sideways, next to each other in portrait mode. That's not much an issue if you're running KDE agnome because they can easily rotate the output. When you have these moments where you don't want to be distracted by anything, you switch up every graphical output and just go to the systems console. That with a normal system console doesn't work very well. You see that in the picture in the corner. You sit there like that and after a minute you go like, fuck me. So I thought I can't be the only person on the planet who has this problem or use case. There's got to be something that solves my issue. I looked around. I couldn't find anything. So crap. During my research, I stumbled across KMS-Con. I said, ah, KMS-Con. Sounds interesting, intriguing. It's open source. Oh, that's good, sir. It has a pluggable rendering infrastructure and one of them is OpenGL. I know OpenGL. Rotating the output by 90 degrees with OpenGL. That's a piece of cake. I do that in the evening, maybe in the weekend best. Famous last words. It took me maybe one and a half weeks of some evenings. But that was last year, end of January, start of February. And I got it working. I said, oh, awesome. That's good topic for a lightning talk at FOSTA. Wrote the folks, I mean, two or three days before February. Ah, great idea for a lightning talk. I said, now, sorry, Merkel. Nice idea that everything's taken next year. So here we are. OK, so patch number one, the main edge to scratch. The OpenGL rendering back end that I touched, I didn't touch the rendering back ends like PIXMAN because, yeah, it technically would work, but it's not so nice. It's not so fast. I extended that with two entries in its V-table for all the methods that this plug-in provides. And one of them is the GL-Tex rotate that allows, oh, wait. Oops. So there. That allows to recalculate the aspect ratio, depending on the orientation that you want. And then it also recalculates the amount of columns and rows for the characters to fill the screen. And it does all that dynamically. Dynamically because of the third patch that I'll be talking about, but you'll see. And of course, talk is cheap. Let's do some more demos. I switch to another session. Log in again. T-Max. So I'll create two paints and make something happening in these paints. Oops. Oops. Hmm. Just as a stand-in, so there's stuff happening. And now I have a pre-configured hotkey that is super plus. And I'll rotate the output. It might make sense with this, but imagine your thing is turned around, right? So clockwise, counterclockwise. You can, of course, while you're doing that, resize the characters. Bigger. Keep rotating stuff. And make it smaller again. Oops. Ah, okay, something. Okie dokie. Come on. Hmm. So, so much for the output rotation. That was the main itch to scratch. Then I thought, well, okay, now I spent the time to learn the code base, its architecture, how stuff works. I also really like this GPM tool. If you maybe know GPM is this general pointer mouse deamon thing with the old console, it provides a text cursor, which is attached to your mouse. And then you can just select text and copy and paste stuff when you're really too lazy to type. That doesn't work with KMSCon. I don't know, fuck. I really like that. It cannot be that hard, right? So, okay, read the code again. And then, since KMSCon tries to keep its dependency over external libraries really, really slow, low, it implements its own event loop and timer system. So, you read up that. It's kind of documented in the code, but not like properly. But, yeah. So, I figured it out. I'm using the kernel input system for mice and basically implemented a new additional plug-in. Right now, I still compile it directly into the binary because I'm a bit lazy, but that needs to be extracted at its own shared library. So, it works then with mouse track points, the Bluetooth-based Apple Magic mousepad, maybe also with the first touch point for touch screens, but I don't have that hardware. So if anybody has that hardware, please download it, report to me. I'll try to make it work if it doesn't. Yeah, there's still this busy polling issue. I didn't hook that up to the event loop and the timer, so that's something I need to clean up. And the mouse plug-in is also the one that actually uses the TSM directly because the current paste is basically implemented in lib TSM. So, it allows you to select, oh, this is the start of the selection. There's the end. Copy it to a clipboard buffer, which TSM handles. You don't have to deal with that in your own application code. And then you just paste it wherever the cursor ends up being. And we can demo that too. So, let's cut and paste some file names. Okay, okay, here's the mouse pointer. So, I can just, ah, I think I found a new bug. That is interesting. Usually, that should not happen at that spot, but, yeah, pasting does work. So, I move it up to the fan above. The problem is I'm missing some recalculation when I resize the font. So, usually, if I would scale it down one more, then it should work again. But that works with regular Latin text. But, I mean, that's cheap. Okay, Scon supports Unicode. So, let's look for some, oh, that's, I think that's Greek. Yeah, I think that's Greek. You select the same way, yeah, okay. Imagine that would be correct. Correct. So, I drags there. I have no idea what type of language that is. So, yeah, it's Unicode, but it feels like a regular character. And it just works. That's the magic of K-Maskon. I really love it. I don't know why it's not shipped by default for every distribution. I don't know. It's really awesome. So, that is patch number two. The third patch is to make it a little bit more consistent with the way stuff works in a typical decimal environment, a graphical environment. I know from GNOME, when you have a driver sensor in the system and you turn the laptop, it recognizes that. And then it turns, if you allow it to be turned automatically, it turns your display. And I thought, well, if you have the sensor, well, if the graphical environment, GNOME, KDE, can read that, then, well, the system console should, too, right? I mean, be consistent. So, I sat down and looked around and implemented that. That was a bit more tricky. It was not that much work, but it was a little bit tricky. And I thought to myself, well, since the amount of external dependencies needs to be really low, I think I'm going to stick with the lowest level I can get along with, and that is the low-level D-Bus API, which even the D-Bus maintainers discourage you from using. I used it anyway. It works. I cannot show it on this laptop. I have another laptop with me, which has such a driver sensor. After the talk, either here or in the hallway, if you want to see that live come by, run into me, and I'll demo that. It's really fancy. It's like extra lead points for us when you show that to your geek friends. And then, at the end, there will be links to screencasts, where I show that for anybody who's not here or sees that as a recording in the future. Or just download the code, compile it, and use it yourself. So, the last patch, that's basically just messing around. The last patches kind of make sense. They improve existing functionality or add new functionality. Since I'm a computer graphics nut and I like to mess around with stuff, I thought, now I want to have fun. The best thing you can do with just a terminal-based text interface usually is, well, ASCII-Art. And there's this famous demo, and I think everybody knows that, this ASCII-Art stuff. It's basically a very clever way of doing very, very lower solution, ray tracing on the CPU, and then using the characters as the, let's say, brightness pixels. And then this creates the illusion of a 3D object running around. But with KMSCon, and we have openGL rendering back-end, surely we can do better these days. And so I did. So, I'll give you live GL SL shader backgrounds. This is a little bit my, I'm also a little bit active in the demo scene. I like to mess around with some demo events, and this type of scientist fields rendering is typically the stuff that you do. So you have now live shader backgrounds. And this actually, although taken to the next level, also implements an old initial to-do item from David Herrmann, because he wanted to have background images. Okay, it's animated. It's real time. Good. Okay, now, after all this, to upstream or not to upstream? Of course, to upstream, right? I mean, be a good open-source citizen. Upstream your work. You found something that you like. You improved a little bit to a scratcher edge. You put in the work to make it as clean as possible, fit with the coding style and the system software design and everything. Okay, so initially I brought my patches against the branch from David. Then I was contacted by a Debian maintainer, developer, Victor Vesterhuis, or something, Vicky, and he said, ah, no, no, don't do that. Use this fork here from somebody else. This is much more actively maintained. It has also a Mason build system. It's much nicer. Then the likelihood of your stuff being reviewed and actually moved upstream are much greater. I did that too, and then it sits there since about six months ago. Not anybody looked at it and touched it. I got two requests or questions on GitHub issues. Yeah, it's just sitting there. So this is also my hope to, let's say, mobilize the community. You say, if you really like that, if you use KMSCon, you like these features, please poke them so they review my stuff. I'm willing to put in the work to make it fit the coding style and everything as much as possible. And I've still got future ideas with truly multi-hat capable KMSCon. So it's not just mirroring the stuff. So I'm seeing the same here and here, but having actually per screen connected display one terminal session with KMSCon. And then, of course, you need the information of how other screens arranged. You could gather that information from any settings saved from KDE, Eugnome. To read that with KMSCon, you need the libraries from KDE, Eugnome to link. That's not so nice. Nice would be kind of an end-curses interface where you have then ASCII art graphics to arrange your layout of your screens. But I started actually doing that. It's not finished yet, but it would be really nice. So maybe it would be accepted upstream too. So links to the source code. Some blog posts where I talk about what worked, what didn't work, what was hard, some small milestones. Lots of screencasts where I try to spread the word a little bit, but the thing like and subscribe doesn't work because I'm always watching my stuff. So I put it there anyway. Are there any questions? Go. So just one one, I guess, with the mouse support. Yes. Were you actually like double-click to select? Is everything character is possible? Yes. So the question is to select, double-click. Yes, that does work. I implemented that, but it's a bit iffy. But usually it works. But there's some things I still need to iron out. But yes, yeah, it works. Is my phone your scene handle? Yes, basically that is my nickname forever and ever. It's not a pun on Apple systems. It's just make slow. Yeah, that is my scene. And you're part of a demo group? No, I'm a single person. I know the demo scene from forever, but I only got active about three, four years ago. And I'm not part of a demo group. I ran two times at revision and evoke into the folks from Logicoma. I'm kind of sometimes jumping on the stream from Hoffman. He's basically doing Amiga assembler stuff. And he's an awesome musician, but I'm not part of a demo group. But I love the demo scene. If you like computer graphics and you don't own a demo scene, you're seriously missing out. Can I just have a part of this as we make that? Yeah, I know. There's the gathering assembly and all that. No, wait, assembly is Finland. I still need to get to Scandinavia for sure, for demo scene. Can I ask on now? Yeah, I mean, it would be nice to make a bosomatic variation into KMSCon. Yeah. Sorry, your hand was a little bit fast. Does it support subpixel rendering for rendering the fonts? Does it support subpixel rendering for fonts? Yes. Well, whatever free type does. So it's using free type and through Pango. When you rotate the screen, then it takes into consideration or then it just broke it? It's rendering. No, it should be. It should work. If the app doesn't do the rotation, you do the rotation. I do the rotation. It's broken, maybe. I need, actually, that's good. I need to look, but I think it's not broken. But I need to look into that. That's good. Thanks a lot. Yeah. I mean, you missed the presentation, but is there any minimum version of OpenGL required? Minimum version of OpenGL required, I think. I would guess. Oh, my God. 3.3. I mean, even OpenGL 4.0 something is like 10 or, I don't know, how many years old, but certainly any system should work. But I think 3.3 should be enough. Yeah. How does this compare to using a full-screen real-time compositor, such as page, compared with a lightweight terminal emulator? How does it compare to a lightweight compositor? That's a good question. I have no idea. I think it's more lightweight. It's much more simpler. It's certainly not a desktop environment, so it's nothing like Western. You can, at some point, easily make the jump to that if you keep pushing, but no, no, it's not. It's just one OpenGL context. You run your terminal screen into that, and that's it. There you go. Thank you. So two questions of first-of-end remos. Do you implement all the remos by yourself, or it can be waited until the editors, for example, some of the console editors handle mouse events? No. It's right now in process of KMSCon. Sorry. The question is if the mouse handling could be, for example, handled or actually be read in an editor, like Emacs, I guess, or Vim or something. No. It's not like it's a KMSCon version of GPM. That would be the next step. But that would be a nice patch. I'm sorry. Yeah. Okay. The second question is, so you are handling all the rotation inside your GAO. How do you talk about handling rotation for the KMS properties? Suggestion how to handle the rotation through KMS properties. I haven't considered that. I can look into that and then see if that makes sense, or if I can do it or change my patch. Sure. Yeah. Okay. Anything else? Well, if, yeah. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you.
Flutter in Embedded
Okay, we can start. Hi everyone. Today I'm here to present Flutter in Embedded System, precisely Linux Embedded. Quick presentation of me. I'm André Rikki. I'm from Italy and I work at Amarulla Solution as Embedded Linux Consultant and Developer. My background is mostly C++ on both console and UI application using different frameworks and also recently Flutter. In this talk I'm going to present the Flutter framework from a developer point of view. So we will not go deeper in how the framework is used in the Embedded and Outworks. Outintegrate with the most common build systems such as Yokto and Buildroot. And if there is enough time I'll show a quick video of a commercial product that we developed with one of our customers. So, what is Flutter? Flutter is a UI framework developed by Google, was first released in 2015 for Android, and then was later ported to iOS, Windows, Linux and web application. The idea behind this framework is to have a single framework and codebase to create a good-looking UI application, natively complied and multi-platform. And it uses Dart as programming language and we will talk about that later. So, let's go through the advantages of Flutter. First of all, Flutter is fast because it compiles natively for ARM and Intel machine. You can expect great performance both at startup and at runtime. Also, the idea was to help the developers to achieve 60 frames per second. So, you can expect fluid and responsive UI on any kind of devices. Now, let's talk about Dart. Dart is the programming language used by Flutter. Being modern and designed precisely for UI, it comes with multiple advantages and tools that are really helpful when dealing with UI applications. First of all, is that the language is completely asynchronous, so it's quite impossible to have the application to freeze. By contrast, for example, with C++, I saw in my experience multiple times where, for example, in the UI loop, people were opening files or doing some blocking operations and performance were really bad on that. In this case, it's impossible and the architecture of Dart abstracts the complexity of a multi-threaded typical shell memory multi-threading. So, this is really, really important in my opinion. Another important point is that the language is completely no safe, so it's practically impossible to have a segmentation fold in Dart application. This is another really important point because the UI application is what the final user sees, so it's important to have it always responsive and alive. Finally, all the error management is handled through exception. So, even if there are some errors during the execution of the application, simply error exception are thrown. And even if there are not catch, the application keeps running. And we will see an example in a future slide. And all of that in an easy to learn and familiar syntax language. I came, as I said, from a C++ experience and working with Dart was really, really smooth and really easy to learn. So, Flutter is fast, Dart is great, but that's not all. Flutter and the entire framework is also productive. And with that, I mean that allow development and maintenance of application, which really is. One of the most important key is Outreload and OutreStart. Outreload is the ability to apply changes without recompiling and restart the application. As you can see in the GIF, by simply changing the code from a dark theme to a light theme, without recompiling, the changes are directly applied in the running application. This is really important. We reduce a lot the time of development and maintenance and also the stress on the developer. Also, being modern, it comes with a set of useful tools such as Stotics Analysis, Widget Introspection and also debugging and logging and much, much more. Finally, Flutter Extract is flexible. As I said, it's meant to work on a multi-platform. And you can expect the same look and feel of the application on any screen. This means that from a more commercial point of view, if you need to deploy application on embedded and also on mobile and Windows, you can expect the same look and feel. And this is really important from the final user point of view. And also, the out-of-time compilation and natively compilation of the code allow for really fast and great performance on any platform. All of that in a single programming language, of course. This is my typical setup. I use Visual Studio Code with the official Flutter extension and on the right, it's the running application. As you can see on the top left, there are the typical running tools and debugging such as Start, Stop and Step Into, but also the Autrestart. Autreload is embedded in the Flutter extension. So if I apply any changes in the code and save the file, the changes are automatically reloaded in the page. And in the bottom, there is the Debug console. As you can see, an exception is thrown because the application tries to save a file when starting. But even if there are some errors, the application is still running and works without any problem. In my opinion, this setup is really important. It's really easy to use. It's really productive. And I think that from the developer point of view, allow the maintenance and development to be really, really easy and less stressful. I'm happy to see that Flutter is becoming more and more popular in the embedded system, precisely Linux. Yesterday I saw a different talk regarding this topic. And it's important because the community is really active and huge and it's becoming more popular on Linux. So if you have any difficulties or if you are facing any problems, most of the time you will find a solution online. Then there is a huge list of packages. There is an online repository where free packages are hosted and developed by the community. They come in different types. I use the packages to visualize, for example, a lot of animation or SVG file, but there are also packages that are more into the code such as MQTT communication or file parsing. Flutter is actually developed and updated. In the last year, I had to update the Flutter version both on my laptop and on the target multiple times. So Google and the community keeps updating the framework and improving security and adding new features constantly. Finally, it is used by big tech companies, first of all Google, the creator, but also BMW and Toyota. And those companies keep the project alive contributing because Flutter is completely free and open source. It goes on a BSD3 license so you can use it without any troubles. Now, let's do a quick comparison with the most famous UI framework for embedded Linux. First of all, a VGL. The first point for me is the most important one. C is not dark. C is a really powerful language, but when it comes to UI application, it's not so easy to use, and it's really easy to mess things up. Instead, Dart is designed to work with, for UI application, and we saw all the advantages that the language come with. Also, autoload and autrastart, there is no way to achieve that in LVGL. Of course, you have to rebuild and recompile and redeploy the application every time. Instead, Flutter has this amazing feature. Also, it has more platform supported because LVGL can only run on desktop or embedded, and Flutter can also run on mobile. There are more packages available. Let's call it, also call it their Libs, as we saw in the previous slide. And of course, Flutter is a bigger community behind it. Finally, using Flutter is much, much easier to build the application and publish. We've seen a later slide, how to integrate the Flutter application inside the Yocto, and you don't have to mess with the build argument. It's all handled by the framework and the Yocto project. Instead, with LVGL, if you need to cross-compile, it can be a bit tricky. Then, Flutter versus Qt. C++ and QML versus Dart. C++ is a step up against C, but still can be quite difficult. I saw multiple times Qt application having a really bad performer issue because C++ was not optimized. And QML is designed for UI application like Dart, but first of all, it's an interpreted language. So if you start doing any kind of logic inside QML, the application will be crap. And I think that Dart is still much better for UI development. Here again, autreload and autrestart. It is possible to achieve autrestart and autreload with QML, but in my experience, I was never able to do that. Most of the time, QML is strictly connected to C++ for modeling and such, so you of course need to recompile everything. Third point, Flutter, as I said, is completely free and upersoos. Instead, Qt has commercial licensing. So if you want to use Qt in a commercial product, you probably end up buying a lot of royalty. And finally, Flutter, I think that it's rapidly improving. I mean, Qt is improving too, but the release cycle is much slower compared to Flutter, so I think that this one is also really important. So we saw a lot of advantages and good points, but it's not everything is perfect. So in my experience, one of the, let's say, tricky part of Dart is that when working in embedded Linux, you can expect coming from C or C++ to be able to do anything you want, for example, accessing the hardware directly from the U application. For example, in the product that we showcased later, I added to read proximity sensor input directly from the U application to simply turn on the display. This one was not possible because Dart doesn't allow to read directly the hardware to access directly the hardware. The structure is a bit complex to use, so long story short, I was not able to do that. But the solution that is foreign function interface, also known as language bindings. So what is possible is that we can create a C library with a public interface and then we can call those methods directly from the Dart application. This is really important because we can solve a lot of the issue related to more complex stuff from the language by using a C library. So start up the application, the library is loaded, and then I can call directly the public function. By doing so, I was able to solve the issue and read the proximity sensor input from my Dart application. Now, how to integrate Flutter in your project? Well, build route, there is the Flutter package developed and maintained by my co-worker Adam Daskat. It has done a great job on this package and is currently maintaining it. In my experience, I used Yokto for my project, so I'm a bit more into that. There is the Metaflutter hosted on GitHub that is the, let's say, official Flutter, is maintained by some guys from Toyota and the community. Integrating the Flutter inside your operating system is really, really easy. Just include the layer, add the dependency in your image, and you are pretty done. The Flutter engine and Flutter Embedder are automatically compiled and added in your system. If you need to include, obviously, you will need to include your application. You can use the Flutter Gallery Recypes as reference, pretty straightforward stuff. You just copy the Recypes, adapt the repository, and maybe if you want to adapt some build arguments, and then add as a dependency, and it's done. You don't need to mess up with any related cross-compiling stuff or any of that. On GitHub, on my page, I have a repo manifest that I've done almost a year ago, hopefully still working. For a tinkerboard, simply create the Yocto project and download all the layer needed, and then add the dependency inside the image. When you compile, you can simply put the result on SD card, and you will have the Flutter Gallery on your hardware. Now I'll show a video of the product that we developed with one of our customers. The device is an intercom device. It's running a Rockchip processor, so obviously a bit more powerful than the tinkerboard. This is the intercom device. It's able to connect to another device on the other side and have video and audio stream. Under the hood, there is a lot running, but as you can see, the application is still smooth and responsive, and the performance is really, really good. This is a custom keyboard that we developed to achieve the design that the customer wanted, and those are some lotty animations that are running with the package that I included from the Flutter Dev. The final result, in my opinion, is really good. We as developers are really happy with the result. The customer is happy with the result from the commercial point of view, so that's why I was presenting Flutter today. So, if you have any more questions, thank you everyone. Great. You mentioned this Dart is a custom language, but there's an MQTT library. How does it work? Do you have to, for each protocol, implement it in Dart again? So, Dart is not... The question is if, for example, MQTT is actually... There is a custom stuff running under the hood. Not really, because in reality, Dart can be recompiled as C++. So, simply, Dart, you can use Dart as you... It's like you are using C++ code, and MQTT actually didn't really look into it, because I downloaded the package. It was working flawlessly, so included on running, and I was really, really happy with the result. So, yeah. Thank you for the talk. I'd like to know what is the memory footprint with Flutter Engine, I mean, like Flash and ROM. Okay, the question is the memory footprint of Flutter Engine. So, let's say that one of the disadvantages of Flutter is that it's a bit more resource-less hungry compared to Qt or LVGL. I think that the Flutter Engine was like 14, 15 megabytes on the storage, and on the memory, I didn't run any kind of profiling on the hardware, but I think that it's comparable to Qt application when running. Yeah? Yachto, do you really need a big operating system underneath, or is it capable of running on, say, a pre-artore or something like this, a more lean operating system underneath? Okay, the question is that if Flutter is able to run in less powerful operating systems such as FiatOS. Oh, perhaps Bare Metal. Oh, Bare Metal. I don't think so. I think that at the moment it requires a lean operating system. Okay, thanks. I have one more question. You said about integration to a distant project based on Yachto. So, we just integrate a map of Flutter, and you said we do not have to take care about any cross-compilation. What was the work? Like, my distant project is compiled by GCC, and Flutter, I read that there is a dependency to C1. So, the question is how is the end-all the layer, and how is the cross-compilation managed? So, I think that everything is end-all by Yachto, and the meta-layer, the meta-layer. The meta-flutter is really well done. You can simply download and add the layer dependency in your Yachto project, and if you include the dependency in your image of the Flutter engine, it will automatically be compiled, because Yachto takes manage everything about that, so you don't really need to take care of that. Hi. Hi. What about Flutter, Yachto, and IRM32? Sorry? IRM32, ashtag, too. Have you had any experience? Actually, the question is if Flutter is capable to run a 32-bit platform. IRM32? No. Simply no. Do you have any port, any project on Yachto? No, at the moment, I don't think so. Yeah, the question is if I know any company that is moving from Qt to Flutter. The video before explained one of our customers was mainly using Qt for your application, and now is moving to Flutter. So I think that because of the open-source licensing, this is really tempting from a core-mesh point of view. Yeah? It's a bit of a tilt question, so for the current project, did you do the whole project in Flutter and Dart? Yeah. Okay. In your company, do you also have projects where part of the product is made in C++ or something else? And how do you, would you integrate that with Flutter? Okay, the question is if the whole project was made in Flutter or if there are also other applications running with C++. Well, in this case, the UI is run with Flutter, and there are a set of microservices that are running under hood with running C++. For example, they take care of the video and audio stream and all of that, and the application communicates with those microservices via NQTT, for example. Okay, I think there's no more questions, so thank you very much. Thank you.
Building Cross-platform GUI apps with ease (and Go) - desktop, mobile and beyond!
Thanks very much everybody for coming here to the graphics room to listen to another talk about building native compiled applications that are going to work everywhere. I had the title of the slide up and realised it didn't actually say go on the title slide so I just wanted to get that out there right away. It's very exciting to be here in the graphics dev room and to be presenting in the same place that fantastic people over the last decades have shown great new features in KD and Nome and fantastic discussions around all of that and hopefully I could bring something new and interesting to the room as well. Just out of interest to get this right I recognise some faces from the Go Dev room yesterday so maybe a show of hands if people have programmed Go at all. Wow, okay cool probably unusual for this room and anybody then that is a C developer just in case I need to go back to some common ground. Right, okay cool well thanks very much. So just a little bit about myself. Hi my name is Andrew it's really nice to meet you all. I am a software engineer have been for 20 years I think now, I stopped counting. I work a lot in startups either my own or other people's companies solving interesting technical and personal challenges, building teams all that kind of stuff and I've written some books gone on a couple of podcasts on the topic of building applications like the ones I'm going to show you today. I have a background in open source if you've seen me here before it might have been talking about the enlightenment project where I spent a lot of my time and before then the Maven project as well. I started the Find project which I'm going to present today to build graphical applications built on top of Go in 2018 and I have been a Go developer since two weeks after the project was founded. I'll tell you a little bit more about Find as we get into it but I didn't pick up a language I wanted to learn and decided needed a graphical toolkit. I had an ambition to make getting into graphical application development so much easier I knew what I wanted to achieve and then I hunted for programming language and I don't know if it's a good or a bad thing to say I wanted it to be rust I so wanted it to but I couldn't figure it out and so I picked up Go and I haven't looked back I've never felt more productive. My day job is at Fine Labs which is a company that's set up to help businesses get more out of the types of platform agnostic technology that I'm going to show and so we have products and services that could help companies working in this space. So I don't know whether or not people would think that Go is a strange choice of programming language for building graphical applications. It's certainly what the Go development team have said over the past few years although I think they're coming around now they've seen how easy it is but just to summarize the benefits for anybody that doesn't know much like Dart in the previous presentation it's going to allow you to write applications that compile natively to absolutely any device so they can pretty much run anywhere from desktops through mobiles, wasm on the web browser through into embedded devices as well. It's important to me also that there are no runtime dependencies. These pieces of software should drag and drop or install through a store in the usual manner without any need for additional steps, no runtime setup, no hidden pre-conditions required to get the applications running. We may have to do some as a developer but we take the pain so that our users get the big benefit. We're going to deliver native performance. These applications are compiled down into the same machine code at the TIN level that any piece of software built with C or other built for specific platform technology is going to offer but fundamentally I thought it was important to lower the barrier to writing graphical applications, help people to realize it's not so difficult. It's something that you can see and do and have installed on your device very very quickly indeed and Go provides the ability to do that whatever platform you're on but also the standards and the pros and the tools, the techniques in the language help to make everything easier to understand. It's good documentation, it's standard ways of writing things, unit testing built right in, all of those good things helping to promote good engineering principles. And so for me this is why it made such a good fit and it's why the fine toolkit picked up Go because we're wanting to be the simplest way possible to get people building beautiful and usable native graphical applications and to not have to think about any changes that might be necessary to get them running on any particular device. So the fine project like I said started in 2018 so that is now six years old I think possibly as of this weekend actually, complete coincidence I was not sitting in a Fosden room when I thought of the project which is same, would be a good story. It has become the most popular graphical toolkit for Go which is pretty exciting. Over the years there's been actually quite a few have started and it's nice to have choice. They have started perhaps with different technologies under the hood some are using embedded web browsers for example and others are interested in enabling more control, more power where we're focused on the simplicity and the ease of use I suppose. OSS Insight if you track them on Twitter, X, wherever they are have ranked a sixth out of all cross-platform graphical tokens which is very exciting although for some reason Qt and GTK don't seem to list in the top 10 so how they came up with the numbers I don't know but it puts us up there with others like Flutter, React Native and other names that you would have heard of and just last week I realized that we had got into GitHub's 1000 most popular repositories across the entire entirety of their code base if that's the right word which I think something like 350 million repositories and as part of the Go ecosystem we make use of the really excellent and welcoming community that they have established over there and across Slack, Discord, Matrix and in-person meetings we've got about 2,000 people that like to get together and talk about building applications offering help for people who want to get started. So let's get a couple of pictures on these slides as well. This is the fine demo application so if you're interested in checking it out you can load this right now it's in the standard repository that we have we ship a few demo applications and if you're on the Google Play Store you could download this right now onto your phone and see how it renders on a mobile device. Hint looks exactly the same except it's adjusted for the different screen sizes and of course I mean as a developer at heart Light Mode is no good to me we ship the dark mode by default as well sorry they're both in there by default it will pick the right variant of the theme depending on your user preferences. So let's get started and build an app. I'm not going to overwhelm you with complex code which is perhaps a relief to people who don't know Go or C but I'll step through what we do have. Go is known for being easy to compile across all different platforms from whatever developer device you have which is fantastic it's a good place to start but because we're going to be doing some graphics programming and we want optimized binaries that are going to use your hardware acceleration we've got to get a little bit of C in there under the hood you're never going to see it but we do need a compiler installed as well so you'll need to install Go and GCC or Clang or a compatible compiler. If you're unsure whether you've succeeded setting up a development environment we have a fine setup tool which will verify the runtime yeah the runtime it's linked from the Getting Started pages which I'll add reference later and that's just going to check that the Go compiler, the C compiler, are found and catch typical challenges around just having your path set up properly so that tools are indiscoverable and we have a tool called Fine that's going to be useful for our packaging later. So there are a couple of steps that we need to do to get started with a project nothing like if we were going to be starting with a C code base but nonetheless it's something to be aware of we need to make a directory for our code we need to initialize the Go module this is a step that was introduced recently it adds for much more powerful dependency management on the Go project it used to be that you could just open a file save it run it and you would have an application displayed and I'm trying to coax the Go team to allow that as a default for really early stage because the mantra is start with the smallest thing possible and then add to it over time so apologies we've got a couple of steps there that you need to know we're calling Go get which is going to grab the library and goes looking all of the stuff up on the web through pretty efficient caching mechanism but as you can see that's a URL it's finding the source code and that's going to download it into the module that you've just created actually it's referencing in the module and putting in a common space so you don't need to download it again for another project and then we're going to edit our Go file I'm calling it UI.go because I'm really good at naming this is the code that we're going to put in it not adventurous to live code I'm afraid I'll step you through it we have package main because every application enters through the main package we're importing two packages they're in the same namespace of find.io slash find slash v2 because this is our second major API and it's the app and the widget sub packages that we're going to be using the app package sets up the runtime pools in the appropriate drivers for the device that you're running on and then is going to bootstrap the application and the widget package we're using to add something into our window our main function again probably no prizes here forgetting that's the entry point for a Go app is creating a new find application that's invoking the driver is creating a new window from the application with hello as its title so if your device has title bars hello gets popped in there and then the one line which is basically our entire user interface says set the content of the window to a new widget a new label widget that says hello fine and then we then call show and run on the window which is a little shortcut for show my window run my application if you're not familiar part of the challenge with graphical apps is they do have to run in an event thread the operating system has specific requirements for things we just bundle it up as simply as possible so there we have I think four lines of code and a couple of import statements we can type go run full stop the period there is just say the current directory you could equally have said main.go because we only have one file and it's going to load this picture this window here says hello fine I was running this on a desktop in dark mode at the time which is why it looks that way wow yeah I can see you're really really excited about a hello world application I mean I was the first time it appeared on the screen but that was a few years ago so let's do something a little bit more interesting and show that it's still going to be easy to do something useful going to make a markdown editor for you we have built into the standard widget package and an entry widget that seems to say editor it's an entry widget that's going to take the user input we're going to use a rich text in our application to render the output part of the reason this is going to be really straightforward is that a rich text understands markdown as a source for the information to mark up a text document and a horizontal splits container for laying out our user interface I showed you widgets before but containers are sort of like a type of widget where you have multiple things in it and it has a layout that's going to describe how things should lay out on screen you don't position widgets manually you have an area and a container fills it which means that we can adapt to screen sizes orientations very easily widgets don't have to think too much about how they're placed what type of device they're running on it's actually very powerful when you're not really wanting to think too much about what system you're going to be running your application on and I'm going to hook the two together with an unchanged callback so that when the user edits their text the runtime will change it okay so I described four lines it is a little bit more than four lines of code but we have the same imports with the addition of the container package and we're starting the application and window in the same way although as you can see a very exciting new title is going to appear on our window the editor field the oh sorry the editor field can't seem to point sorry anyway you can read it's not a lot of text is a new multi-line entry which is a standard entry widget but has more than one line in it we don't need to specify how many because it will fill the space available we have a new rich text we're saying load markdown but we're loading nothing as you can imagine if you passed in a string a markdown string there it would actually render that as it was loading for the first time then the hook that I mentioned is again one line of code the unchanged on entry passes a string to whoever is interested in what changed and the parsed markdown function of our preview accepts a string because you would parse your markdown from a string so we're able to set one function to to the other one so when unchanged happens it fires parsed markdown so we can avoid signal slots string based IDs and comparisons to connect multiple widgets together and just use a single line of code instead and then the most complicated piece of code in this entire snippet is the container we're using an adaptive grid which is like a grid but it adapts to is the number of columns slash rows that it should it should have if you had a standard widget a standard grid it would have columns or rows specified and as it reaches the end it flows onto another with an adaptive grid it's going to decide whether it's columns or rows based on the space available so if we were loading this on our phone in portrait mode one will be above the other and if it's in landscape one will be to the left and one will be to the right so sorry about the sneak preview before but this is a markdown editor there we go that's better thank you thank you you're too kind as you can see this is not difficult we have the entry widget on the left we've typed some markdown into it and it has rendered on the right there is a link which you could tap and it has referenced a local image as well that's quite cool but this is cooler it's exactly the same software that has been packaged as an IPA and dropped into my iPhone simulator actually it's a dot app because it's a simulator not a real device but exactly the same so the code could be dropped onto a device as well as you can see it's also running in landscape so the arrangement is the same so there that is the application running across multiple different platforms how did we get it there so compiling for targets that aren't your current machine is a little bit more complicated but let's start with what if you are compiling locally the fine tool the helper that I mentioned before is pretty important and very helpful as many help helpers are you can get it from the URL there and that go get command is going to download it and put it into your path and then you can use it to do helpful things like package the application or install it locally as I'm sure you're aware a binary that you get out of a compiler is fantastic and efficient and you can move it around because it is portable it doesn't look good and you can't put it in your start menu so fine package is going to give you a binary with whatever metadata around it is necessary so it will inject an icon into an XE for Windows or it will put the icon and desktop file into the appropriate places on your Linux system and fine install on that second line there is doing all of that for the current system and installing it into the right place for you so user local probably for most people here or the start menu or your applications folder on Mac line three there is how do we do it for a different platform because we can't just invoke the compiler we also want to package it differently the XE is going to appear instead of whatever our native system is that is going to use local tools and so if you're familiar with cross compiling and see having a tool chain specifying the CC variable is going to be likely needed for some of these cross ports we'll come back to that in a second the fourth one there is to build an Android application on our platform to do that you just need the Android SDK installed essentially quite straightforward and relatively portable the only reason we it's a bit more complex is we need to say what the application ID is because the sandboxing and the operating systems rules say you can't just be an anonymous piece of software so we pass that in there is also metadata file that you can do if you prefer to avoid command line arguments all the time fine app.toml I'm not going to cover it but it's there and can help you save a little bit of pain but what if you don't want to manage multiple developer tool chains installing packages even if it's for your local environment you just might not want to you might not be able to so contributed to the project is fine cross from Luca Corbo and Cedric by who many of you will know and another guy Jacob on our project have pulled together a Docker based build system with a very standard command line front end so you could much like you would say fine package OS windows you could say fine cross windows and it's going to take your application bundle it up inside the Docker container put the binary back into your current directory and and exit the container so helps you to avoid all of the setup if you don't mean running Docker or podman on your local instance that's going to be super super helpful very briefly want to touch on some more interesting parts of the toolkit because it's not all about just showing dialogue sorry showing graphical elements on screen one of the hard things about making applications portable is the file system we take it for granted but we shouldn't it's not always there so we've provided dialogues to open and save files and a package that helps you to manage storage in an abstract way even more abstract actually than the recently added go file system package it doesn't assume file paths it uses URIs to uniquely identify any data source so you could have your your data remotely on a network somebody made an application to bores their steam library they connected it through the storage api and they used the file open dialogue to browse their steam library cool but why it's really cool is the picture on the right here i've asked my application to open a file i've put that onto my iphone simulator yes and it has shown me this file picking dialogue i don't know if people are familiar or not but this is what's going to come up if you have an iphone set up with an iCloud account i can pick data off the cloud or i can back out and i can go to the dropbox picker where i might have something stored so third-party applications can provide data as though they were files because we're not making the assumption that they're files and if you get further into this and you want to separate your ui from the data that you're managing internally separate state from from rendering then we have a binding package so you could pass around a string binding not have to remember that it's going to a label or you could multiplex i have some data it's going to go to two or three widgets and most of the standard widgets will provide a with data constructor so i can pass the data binding in and that's a two-way data binding everything's always going to be kept up to date so two pretty helpful things but i wish i had more time to tell you more obviously there's a full widget library or i wouldn't be shouting about hey everybody you should try this we have a dialogues library and full featured forms as well which surprisingly is one of the things that could be a little tricky to get working on a mobile app menus some more complex containers than i have shown you we have notification integration system tray for desktop and popping in and out wherever it happens to be appropriate for the device you're on and we've provided native access to apis that you might not have in go so if you need to use a library it's not available in go you can call out to that natively through a c api the go team have done a fantastic job with making that integration really easy you just essentially import c and call it with a c namespace and it works pretty much transparently again there's some complications if this is android and you want to access the end the k you need the jvm instance so we've provided some native integration that that give you the context necessary i'm not going to step through that today however there is a little bit more than that it wouldn't be a presentation in the graphics dev room if i wasn't able to say but hold on a second um we built an entire desktop system using this this api stack the presentation that i've just run through is in an app called slides it is a markdown file we support markdown rendering that is pulled together in a fine app the terminal here another fine app in fact the desktop system everything in front of us it's all rendered in fine it is go apis and very very easy to understand so there you go i feel like i've fulfilled what's necessary to to consider ourselves a serious graphical contender if you would like to learn more well i'm here i'll hang around outside if anybody wants to chat there's a lot of documentation online the like i said the project's been going for a while so you can find a lot of what we have at docs dot find i o there's also a pretty good video channel um at fine i o on youtube where you can find tutorials examples and do search for fine tutorials outside of what we have because there's translations in okay i'm not going to list them in case any of them politically insensitive but plenty of different languages for folks who find that they want to try um a platform that's this different to the standard ones available perhaps um there's a book available about fine that i wrote you don't have to buy it but it's out there um if you would like to contribute and we would really love it if some people came along and helped us to improve this project everything is on github including the documentation the websites the examples you can all find it in the organization and the main repository is simply called fine and you can find the source code to everything i've shown you today we're of course looking for sponsors but who isn't if you love it you know help out in whatever way you can appreciate your time thank you so much and i'll take any questions that you have excuse me please yes um we do a lot of complicated stuff to make oh i'm sorry of course so what's the support like for ios is it more complicated than android it is very complicated for us it is trivial for you the developer who's using the toolkit you don't need to think about it you don't need to do anything at all the tools that i've shown you will create any type of application from your code um the one proviso is that if you want to put it into an ios into an onto an ios device apple is going to insist that you own some hardware that they have produced it may or may not be possible to do it in other ways but that's the license um but no fundamentally this is this is platform agnostic the apis are all guaranteed to work absolutely everywhere um sorry the the fuzz done to fit yeah yeah i was just wondering if you could um i if you could provide a sense of are there any certain kinds of application that are both particularly good bits and maybe less good i good good bit the video in this framework yeah so are there applications that are good fit or not a good fit um for using fine i think the easiest answer is going to be that if you have a document rich piece of um content if you're helping people to browse archives of um documents and things like that you'd probably be better with a web framework honestly because i mean that's what it's built for if it's more interactive if it's graphical driven um then that's something that we're going to be much better place to do fundamentally if you want to get this out to many people we're going to um alleviate a lot of the pain of getting it out there quickly and some of the things that other toolkits might offer as built in or community add-ons might take a little bit of time to implement but you've saved a lot of time up front i wouldn't go and implement games because we don't offer the 3d acceleration as part of our api we just use it internally for the speed improvements but it has been used for such a wide variety of things we have remote desktop application screaming uh 60 frames a second um full screen so that that kind of thing that's pretty cool and can we squeeze one more in yeah um just there please uh yes you mentioned about uh using open gel as the back end under ios for instance are you going straight to metal or are you using say angle or some other solution okay so yeah what are the graphical back ends that we're utilizing um it is open gel on the desktop you're you're quite right and we are using gles on the mobile platforms ios and android i'm aware that some of these have been deprecated and they may change over time um on the desktop mac is trying to kill off um open gel they've not really said that they're going to kill gles on mobile but it's inevitable that they will want to we're looking in the future to build more back end sorry more platform specific uh engines because you know performance also it offers slightly better portability ironically if you build for everything separately internally but we've designed the api so we don't need to make those decisions it's really easy to use it's going to work and over time we're going to adapt the back ends to be more efficient or or whatever is needed by the platforms um and so we're offering we do updates every um six months four to six months so that we can keep up with the the specifics of each of those platforms so if we have to look at a different one it'll be there before you have to worry about it thank you so much everybody enjoy the rest of your day
The FIM (Fbi IMproved) Universal Image Viewer, in a Nutshell
Welcome everybody. So this is a talk which introduces the FIM image viewer and it's a pretty classical introduction talk with some new things about new features which I'm working on, which are not so very stable but half of the presentation is about that, so something which will work very soon. Good. So the classical introduction is of those happy times I was using the frame buffer a lot and I was using the FBI image viewer there and I was happy but I needed more and more use of the X-Windows system from time to time and foremost I wanted the VIM keys working in FBI because this was not there and this is very bad. So at some point I started wanting more things from FBI but FBI couldn't give it to me so not in this order but I started introducing things like new graphical front-ends, arbitrary key bindings, custom keywords or commands, shortcuts of course, in a completely different order of course but no, actually much more than this. Handling metadata of image files or searching of those metadata so incrementally adding things which one could do and also interaction with the standard inputs and standard outputs and shell scripts or scripting from the inside, so wherever. It was, you had to be a 360 degree nerdy image viewer for my purposes so it was just driven from my curiosity. So but still I maintained that this is meant to be and still is a unique tool so it's meant to show images so not to show movies, not to edit images so to do that with a possibly minimal user interface, yes this is right, for command line user interface enthusiasts, for people who like to have perhaps a few lines in the configuration file for some custom caption or perhaps some other customizable graphical feature and for perhaps people who want to use it with other utilities, all sorts. So what happened is that some kind of Kimera came out of the original FBI so I thought maybe I can call it FBI improved because of this. So with one executable you get pixel output, ASCII art output with or without colors, the second part of the talk will be also mentioning that a GTK front end, back end or front end sorry, front end is there so yeah. And what's the core of FIM? On what does FIM, on what is the basic FIM? FIM is based on a number of commands of a small language which perhaps I could spare myself if I only had taken some time to learn perhaps Gile or one of those languages so I evaluated I think the ninth rule of the Greenspan, I mean rules of Greenspan so that in every highly configurable program perhaps a LISP interpreter would have been done better so maybe that will come. Anyway, keys are always bound to commands. Commands are customizable, keys have names and you can bind them to commands. Commands live in a space of a simple language which is documented in the man page and there are also shortcuts so once you enter the command line mode with column of course you can have some shortcuts for things which are useful like go to the third picture of scale smaller by 10% or by some other precise amount. You have top auto completion of commands with a deliberate line. You don't need this always but sometimes you need it and it can be useful. I also forget the commands so therefore I have those features. I also like when I show pictures to my friends to have nice custom captions over the images and perhaps having those captions customizable in different parts of the screen and therefore there are four captions which you can customize with expando codes so this can expand to internal variables or to specified values like width, height and so on. You can make it look the way you really prefer. This is your right. I think this is important in an image viewer. I also like to use a bit metadata when I have collections of my travel pictures like to quickly show to somebody a concept from a place where I have been so then there is a simple text file format which can parse with simple commands there and perhaps some other features to make those commands perhaps more flexible. So to categorize groups of pictures or files actually and this allows quite functional forward and back search with the slash and of course the backspace key just like in VIM. What happens if you have a lot of pictures in a list in a collection? Is it also sometimes you just limit what you're showing for a time to some category, some values to some value key combination, key value combination. So for instance here you see that from five files that we have selected that we gave to the FIM we are showing just the four ones which respect this specification. So there are other limiting features like on size, on date and so on. Another recent functionality is that one that perhaps if you want to try out systematically some particular filter command that can be expressed as a UNIX command you can just start FIM that way with this particular command being specified as an argument to FIM and then systematically all of the pictures will be filtered that way before being shown to you. It can be useful if you don't want to convert all files just to having a preview but having the preview on the fly happening on temporary files. I find this also useful to me at least and the syntax is absolutely inspired by FIND the FIND minus exec feature and also just another important thing is that sometimes FIM doesn't really read all the formats you wish although it reads many of them but thanks to the help of external converters it can open many many other sorts of files also with the use of image magic as a last resort. So zip pad files, zip files and ISO files I know many of them and what's new in FIM because all of what I've shown so far it is the classical presentation with all the stuff but there's some nice stuff coming in a bit instable yet. It's about GTKB news because I'm getting old. I don't always remember of the internal commands of FIM. I wish people to find FIM more accessible, more usable perhaps and I never wished actually people to remember the commands. Why the hell? But still I wish to see myself and to allow others to quickly access to the functions which exist in FIM because I also forget them actually but yeah there are nice things. So yeah so this is the boring menu which you see now if you use minus OGTK and it's a normal menu right? So there is nothing really particular or funny here. There are not even like icons but still I cared, I wanted, I want that this menu should be, you should be able to specify this from the textual configuration file with like one line per item which seems sane to me. No XML files, nothing else. Just my custom and dangerous notation for that. And also that you should be able to customize the menus on runtime, why not? I think it is also very important. And this is actually to have like a reminder to what you can do with FIM when you need it. So the menus should be there just waiting for the moment. So back here. So those are the menus which I just had on my instance of my computer today. And let's explore the first one, the file menu. In the file menu you will find the classical open file and quit. And other things. And each one of those is like one line which is half-holded here but you can also have in your FIMRC file. So if tomorrow you want to have some specific variation of next or go to or jump to something, it's just one line in the configuration file so no recompilation, nothing is needed. Since commands are bound to keyboard keys, therefore there will be working automatic here so documentation hints to which key a specific command refers to. I mean if there is exists, for instance here, previous file in list, by the way it's b which is specified in a completely different line but this is kind of let's say detected or crossed. And this previous command actually is an alias which this alias is actually go to minus one and in the moment when the menu is being specified and in the specification of menu, it's only written that this is like slash file slash previous in list, space go to this and a few tabs so it's pretty light as a specification I would say and it should stay so. And only in that moment there is a correlation which looks that by the way, pref is go to minus one and why not to show this in the tooltips. So I find this useful. I find this to ease the path into using FIM. And you also have like perhaps things which might be useful like next in this directory. I didn't put it here but also open recursively directories. I didn't put it here because you don't do it this actually interactively you do this when you start FIM but it's there so you can put here the things you use more often or go to the next file from the last search for instance. Anyway so these are all things which I don't know if many other file image file viewers have but still this just reflect what already exists in FIM since a longer time. There's also like the categorization functionality which I introduced in the last time which I used for my vacation pictures. So once I used this very simple custom notation for categorizing files so listing the name of the file, a comment and a few like variables saying I don't know which artist a picture which artist made a particular picture then kind of also automatically such a menu can be computed during the rebuild of the menus and this can be useful to like select to short list remember the pictures of Bill and Richard before. To shorten the list to only to the files which pertain those particular keywords. Now also the usability of some functionality like interacting with internal variables should be easier perhaps with this GTK menus because yeah I said that you can specify the menus right but you can also specify not only menus which like run an action but also menus like toggle ones which let's say change between two values and this again fits like one line can be something custom for you if you want to have some particular let's say variable on which something else gets triggered because like in VIM you have like auto commands hooks so perhaps something can happen after you change a value yeah so the toggle is on two values so after there is a say some hook detects that the variable changes and the picture is for drone and it's also so that after yeah after such a thing happens yes you have also in case you have another menu which actually refers to the same variables there should be some consistency and to say the same state of the widget should be ensured so to avoid situations like the widget being an inconsistent states in different parts of the menu. Same story yeah now that there are so many variables in VIM that sometimes yeah there is a bit of confusion yeah for instance there is a variable that says yeah flip everything however as it comes and yeah so if you flip let's say the file but also flip all the files what ends up that it flips twice so you have to do special menus one for this image and one for whatever comes so this can be useful to me at least and perhaps to you as well and remember that this is just my choice of what I like so here there could be other defaults which you find better like your own values for the scaling or orientation yeah yeah moment yes so yeah let's go forward there is a window default menu which I specified and again you can specify your own you can take it away if you don't like it where I have put up with other things like since the the menu of VIM is actually it's using the font from from the linux kernel and it's just very ignorant I mean very simple pixels you can magnify the text or change the text font yeah by the menu here so just it refers to what the kernel uses and yeah and if you are to play let's say a bit more with the configuration of the gtk menus you can kind of rebuild the menus perhaps in a in more verbose ways so there are ways defaults at the moment I mean in the unstable version which you should not use yet which will be give more verbosity yes yeah so it's good that you have things which are experimental to you yourself like I don't know taking a selfie with with the camera and reading it in FIM at a given moment you you specified here in a so here we are calling the webcam alias the FIM alias code webcam which is reading from from the output of a command and it will show it here whatever output it is as long as FIM can read it the first few bytes will say which format it is and it will be shown same story as I have shown shown here there is another default here just to say something which means to read loud allowed the comments via speech synthesizer which I find also useful sometimes so you can have somebody not seeing the picture but actually giving you the sorry somebody not able to speak but telling you about the pictures you see so yeah there are also some special menus which are kind of not specified by you but just requested like the all action alias menu or the help menu which you just specify with one line but actually a lot of lines in the menu items in the menu gets in because this is like some kind of automatically generated recapitulation of what commands or aliases or key bindings or variable or variables exist so this is kind of a documentation intended feature so in the in the case of the help menu it's also a handy way to get to the to string telling you which are the options or particular usage modes of particular commands or aliases yeah so over the years I'm very grateful to for the help of the debut entertainers they always ensure a higher level of quality of what goes into code so they they help a finding problems so different maintainers in particular La Basia and so I'm helpful for whatever bug report hug or patch which comes once in a while and yeah that's that's that was this presentation and famous package so many Linux distributions and in case you wish to run it I don't know not only no but somewhere else perhaps I then can help you with it and we can do it I know that most of the people use film for for the Raspberry PI or for such systems I don't play with that it's okay if they do it I like to use it interactively and I invite you to do the same and yeah I welcome your feedback or your ideas of what's not yet there thank you time for questions first second is there just the just been used the upload anything to the gpa no what the question is whether there is a floating functionality of for doing what exactly for instance I don't know I just wanted to use so starting an image view from scratch if there was any no no no no no no if you but motivated now or in a different venue why it's useful important or whatever we can do it but at the moment no it's it's simple it's not a new project sorry is it a new project official from the new no it's not an official new project this is why it's called not new.org the website when it's post it looks very new it's I was confused maybe I maybe knew one day just for yeah perhaps yes I mean it's not going to itself because it uses so far now it's used to do a project and everything actually there are no projects there it is the reality yeah next question all right who used film so far maybe one side yeah one user who thinks we'll use film will you who really thinks we'll never use such a thing oh no yeah the honest good what's what's the typical use use case for using I use it often for showing what's the use typical use case of using film I don't track users I know that many people use it for like frames frame displays which is boring to me if it's useful to them good for them it means that they have like small computers like changing every each second picture super boring to me but they do it they're happy I'm happy but what I'm using for I use it for whatever I mean if I'm I create plots sometimes via pipe so I pipe it to film and film show me the picture I test filters graphical filters so I pipe picture I go show it to film I have my picture collection of vacation pictures I want to show how a good pizza looks like to a friend and I just look for pizza in my in my search in film and show you look this is how you can inspect visually a pizza and I don't have yet the pizza detector the plug-in for film which tells you the pizza is good just by showing it but you can easily integrate it with some OCR program if the OCR program is fine I mean or the pizza detector if you train the pizza detector fine it will show you a some kind of emotions are not yet supported but perhaps some other caption you can program to to be shown like good pizza or bad pizza and these are uses which I really support and invite you to to to have in film or I use it for aliases on on on my shell so sometimes I want to also show some kind of specific concept like somebody's being nasty to me so I start a special alias which shows something a gram picket for instance this may have communicating you're being nasty I don't know what other people do but whatever it's it's about showing a picture not manipulating it really so keeping the file not modifying the file I think belongs to film if you want to go dangerous you can compile film with the system I mean you can interact with the shell and you can you can do dangerous things with him it's many things are possible but I think the best way is to have a powerful possibly minimal because you don't have to implement all of this fun although compile in all of those functions so it should be a flexible image viewer that you can use with one over SSH over screen over T mooks over whatever I said I don't know and with different graphical outputs so I don't know other programs who do the same six is not there I'm really sorry many people use six nowadays I know that for some reason they like to encode to use more bytes than necessary for viewing pixels perhaps one day we'll have six cells six also here yeah thank you you
Productionizing Jupyter Notebooks
We're trying to get some fresh air into the room as well. Have to be creative. Okay, we'll get started with the next speaker now. Anthony is going to talk about productizing Jupiter notebooks. Please welcome the speaker. Hello, nice to meet you everyone. Yes, my name is Anton. Hello, hello, hello. Okay. So, as I was saying, I'm a software engineer programmer and I've been working on data engineering and data, disability data processing projects for the past 10 years in VMware. I'm glad that VMware decided to open source, invest in open sourcing for Stout Data Kids for the past three years. I've been focusing on maintaining and developing open source projects. Today, I'm going to talk about the challenges in productionizing Jupiter notebooks and show some possible solutions to those challenges using Stout Data Kids. So let's get started. You have been using Jupiter notebooks, if you can raise your hand. It's 20%. Hello. Okay. So, if you're a CEO, hello. Okay. A CEO at Jupiter notebooks are really versatile too. It helps a lot in terms of doing experiment, exploratory data analysis, POCs, like, things like, okay. Hello. Is it working now? Okay. So, with Jupiter, you can do a lot of things because it allows you to mix documentation with markdown, with visualizations, and also code in different languages. Still, there's quite a bit of struggle that you don't really deploy your notebooks directly in production. Most likely, you can correct me, but from my experience, you'll be redoing this using some kind of Python scripts or other type of applications of framework and not directly in the production, which is a bit of a double work often because you do experiments in notebooks and then you do actual productionizing separately. The other tool, generally, I'm going to talk about is for Stout Data Kids. So I'm going to quickly introduce it. For Stout Data Kids, it's an ETL framework, or EOT framework, depending on how you want to use it. It provides you with an SDK, which allows you to write basically steps and those steps. You can ingest data or process data. There are abstractions that make this a lot easier. So this is just Python. So you can install it with Python. And separately, there is control plane operations UI, which is the server part optional. That can be installed on top of Kubernetes. So let's now dive into Jupyter and what are the challenges to productionizing. So I've listed here five challenges. Though that's not exhaustive list, there's some of the most prevalent one in productionizing and using notebooks in production. I'm going to go over each one. I'm going to explain what I understand, the challenges, and then I'm going to show what a possible solution to that challenge is. So let's start with the first one. Reproducibility, right? So it's in this example, let's say that you have a notebook where you want to develop in three different cells. The first cell in the notebook, you set some variable to zero, then increment it, and then you print it. What would the result you would expect at the third cell? I imagine one. And that's what you would expect if this notebook is being run in production in an automated self-sufficient way. But that's not necessarily the result you do during development, right? It's quite possible that the user can execute the second cell twice. In this case, the result would be two. It's also possible that you execute the second cell, change the second cell after executing it. In this case, your result, if you deploy this job in automated pain production, wouldn't be one, as it's currently. It's possible to remove the cell. But because the notebooks run in state, again, the variable is one, and every call after it, it assumes it's one, even though if you deploy it in a scheduled automated way, it no longer be so. This means that notebooks are not really reproducible, and that you can start developing in a state that's diverging from one that will be in production. So what is that is clear trouble, because things may seem to work locally, and then when you deploy your job or notebook in production, all of a sudden things break. So what we can do, one thing, for example, we can do is with VDK, what we've thought of is the concept of, let's say that we have this kind of notebook, right? And we want to have a predictable way of which the execution is done. In VDK, we can mark certain cells as production ready and mark them in the order which they will be executed, which is top-down noise. So we have created a visual way that the first cell is executed first, the second, second, the third, third. If our state, which is on the left side, the current execution order is different than the one that's expected. We can issue warnings, and this allows also to smoke testing before deployment, your notebooks. It shows the predict exactly the order that you'd expect the cells to be executed. This is done by, there will be different ways, but for now this is done by setting a tag called VDK for each cell that you want to be running production. The order is always top-down. This solves some of, helps solve some of the problem by providing this kind of determined sequence in executing order of the cells, while it potentially detects divergence, and we'll see later it allows to test easier. The second challenge I want to speak about is code organization. Overall, in the notebooks, you expect to have quite a bit of irrelevant or maybe debugging code. That might be useful during development, but it's not something that you want to run during production in an automated scheduled manner. Like this very simple example, the first two cells import pandas and read CSV, and the third visualizes it. We can say that this is most likely relevant for your workflow, and this is something that you want during development, or if you want to share the notebook with a colleague. This is again, could be helped with VDK tags. As VDK tags, you talk only the cells that are relevant, and you think that they're going to need to be deployed in turn on schedule manner in production system. All the other cells will be completely ignored when the notebook is being executed. Like in this example, the first, second, and the third cells, which are on the right side, are covered in blue, will be expected to be executed in production, and we don't need the debugging code that's simply checking the data frame or visualizing being skipped. The third challenge I'm going to talk about is the execution model. In notebooks, generally, you have much more complex execution model. This is necessary because of the way the notebook needs to keep state and the way you want to be able to use multiple languages, but those kind of execution models are really bad for automation, or using notebooks as part of a workflow. They add a lot of extra work on top of this, and in order to execute your Python code, you need to go through a notebook server to the Ipython kernel, for example, if it's Python, and so on. Usually, the way you want things in production is much simpler. You have a Python script, and it executes on top of Python, or SQL script, and executes on top of some SQL, and that's it. You don't have a lot in the middle. Because with VDK, if you can extract exactly those kind of Python pieces and construct your Python script or SQL pieces, we can do the same thing, which is what VDK does. So when VDK executes a notebook, it basically would extract the Python and the SQL pieces and directly execute them. This would enable us to do things like reuse another notebook as a template, or almost as a function, in a similar way like this one. Let's say that we have a... This is some kind of job, or Python script, and we are going to execute another notebook, process notebook, Jupyter notebook, almost as a function with arguments and so on. You can also execute it within a workflow, and you can run automated tests, which will show how you can do that in a little bit, which is... Which goes to the fourth challenge that I want to talk about, and it is automated testing. You can see a CD. There is no doubt that automated testing overall, we were able to have a CD pipeline, it's cornerstone of having reliable software nowadays. Jupyter notebook do not really easily lend themselves to this kind of traditional testing paradigms, and also really vital if you want to push code to production, and make sure that any changes to be done, don't break out and things work as expected. In Jupyter notebook, there has been some attempts to solve this, and it has been quite a bit of a challenge. With VDK, we will be attempting as well. One of the things that you can do, because the pre-determined order that VDK tagging provides, and the fact that you can mark which cells need to be executed, and the fact that VDK skips a lot of the kernel and extra in the execution model, means that with VDK you can use command, for example, VDK run, which is provided in a Jupyter notebook with the VDK plugin, which will execute the job as it is supposed to be executed in production. One by one, each of the cells in the order that you expect them to be. Another, of course, you can also do this with CommonWide interface using CLI. Beyond that, if you end up using the control plan, which is the part that you deploy the job in production, the integration with the notebook UI would make it sure that if you create new deployment, it will prompt you to run basically end-to-end smoke test, the data job, as it's called, the notebook files, and to make sure that they execute correctly. Finally, because this notebook can be used practically as a function, that means that now you can test them using Python PyTest. You can write PyTest tests, the survivor code, VDK test, that helps with that, in which you specify your dependencies through plugins. For example, you can specify dependencies like SQL databases, it could be HTTP servers and so on, on a way sort of to mock them. For example, using PyTest HTTP server if you want to mock an HTTP API. You can verify the results after the notebook is executed. This is the link over here about different cases that can be used with PyTest and notebooks. This is pretty powerful because it allows you to actually be able to run automated tests of all your code that you want to productionize the notebook. Finally, another potential issue with notebooks is version control challenges. A notebook file tends to be this kind of JSON structure when you put it in version control, which contains all kinds of extra fields, including output and so on, which generally you want to clean. This is something that will style data kit and deployment. When you deploy data job to be in some kind of managed environment, it can strip all the necessary parts. Instead of having this kind of diff, maybe where the only relevant information really is just the source, the last three lines, you can have this diff, which is where you show run job input, right? There are actually no changes in this diff, despite what it appears. Those are the five challenges and potential solutions that I wanted to share with you. There's this kind of self-paced tutorial that's showing some of the things that I've shown, how it can be used and done, so you can actually try out your self-ings. Overall, if you'd like to discuss more about do you think those kind of challenges, reproducibility, co-organization, execution model, automated testing are really relevant for your use for notebooks? Do you think the solutions make sense? Do you see any other challenges that are important? I'll urge you to contact me. I'll be happy to talk about it. You can do this through LinkedIn. If you want, I would appreciate if you can take the survey, which would simply ask you, basically, what I ask you, do you have any comments? What did you like? What are you talking about? Do you have any other issues? And you can leave any contacts if you want to talk about more. Or you can just contact me directly through LinkedIn. And yeah, that's everything I wanted to share. Thank you very much for listening to me today. Thank you. This is going to be interesting. Okay. We have time for a couple of questions. Do we have any questions? If you can, please wait until the Q&A is done. That will be helpful. Yeah, there's a question there. Can you ask her question? Can you repeat the question? Yeah. Okay. Let me run it around. Okay. Thank you. I was wondering if I want to deploy VDK, does that replace my Jupyter Lab, my Jupyter Notebooks, or does VDK replace my existing Jupyter Notebooks server, or is it an add-on that goes with it? Okay. So does VDK replace existing Jupyter server? Is the question? No. Actually, it's Pugin, both Ipite and Pugin, and also Jupyter Pugin that provide this functionality. It basically, you can, you'll be running your Notebooks or the VDK also, they're called jobs, which is the directory of Notebook files or other scripts with VDK using either VDK run, as I showed, or the UI, or you can just run them as a Notebook. It provides on top of it some extra variables, the Pugin, or there's also programmatic way in Python to run them. But it doesn't really replace it more like co-exist on top of it. Thank you very much. It was very interesting. Doc, does VDK, say for example, I want to export my Python? For example, respecting the ordering rules in VDK, does that, if I wanted to export that from Jupyter, will that impact the Python that's produced by doing a pure Python.py export? So, you're saying that you want to export the Python which is marked with the VDK tag, for example, in a script? Yeah, it's possible. And I, assuming that the script runs with VDK run, it's supposed to be running in almost the same way in Python script. You might, VDK provides some kind of extra libraries like job input, which if you're using, you might need to just initialize yourself, but other than that, there's no reason not to work. Hi, thanks for the nice presentation. My question is what was the first requirement for you that you needed to productionize Jupyter notebooks? What was the purpose of first productionizing them instead of using the regular Python scripts? So, what is the purpose of productionizing the regular notebooks instead of using Python scripts? Well, I'm hoping that the idea is to prevent double work so that you can reuse the same basically environment that you are developing and not needing to redo the same things in Python scripts in separate environments. Make it easy also for people without needing to know a lot of pyn of internals and software developing engineering practices to productionize things. So, there's a point I think that this might break if you have some very complex applications. In this case, probably you still want to switch to using IntelliJ and Python and some kind of framework, but until some point it should be much easier for other people, I hope. Thanks for the talk, first of all. Just wanted to ask a couple of things about regarding dependencies. You know how Jupyter notebooks typically don't have any versions of specific dependencies stated at the top. Probably just says import specific packages here and there. Probably doesn't do anything about pip installing those unless you specifically put a show command at the top. So, how is it processing those dependencies? Is it like automatically interpreting it? Or we still need separate requirements or a PyProject or something else that specifies those dependencies? So, how dependencies specified in VLK? Yeah, VLK, basic VLK data chop is a directory which has a couple of special files. You can have Python, SQL files, notebook files like this one, but you also have requirements, the XT file where you specify your dependencies and it will either you install them locally or when deployed, it will automatically install them in the environment. That's how it handle it. Okay, thank you very much. Thank you.
CATS: The Climate Aware Task Scheduler
The Institute which tries to promote best practice in research software across the UK. And earlier this year we worked on this project called Cats, the Climate Aware Task Scheduler that we'd like to talk to you about today. So very simply the idea behind Cats is to try and time shift when we do our compute to the times when the carbon emissions of producing electricity are at their lowest. We are probably aiming this at smaller to mid-size HPC and HTC systems that are not 100% loaded and therefore we have that flexibility to time shift. If you've got a super busy system that's always at 100% then there's not much you can achieve by time shifting your compute. I'm sure that most of you are familiar with the very pressing need as to why this is important but carbon dioxide levels are now getting higher than we can tell they have ever been for the last 800,000 years. Before that our records are not so clear because we don't have things like ice cores readily available but it looks like this is very, very much human caused and because of burning fossil fuels and this is causing a rather dramatic and alarming increase in temperature. Very worryingly in 2023 we saw temperatures going dramatically off the scale and with a far bigger jump than we had seen in any prior year. We get to see if that trend is going to repeat last year but I don't know how many of you have been to Fozden before but for me this is the warmest Fozden I've ever been to and I think this is my fifth one. Most specifically in activation. Have I gone? No, that's going. Is that working? Yes. We don't want to set the world on fire from doing our computational work but that work is important and we would like to do it but having the most minimal impact we can while we do so. So our plan is, as I said, to time shift our compute to when electricity has the lower carbon intensity. We focus this very much on the UK because that's where at least most of us were based and where we met to come up with this idea. The UK has a very, very variable level of carbon intensity in its electricity. Some other countries are not so variable but we have quite a lot of wind power now and quite a bit of solar but it's always windy and it's not always sunny. Carbon intensity can actually vary from in some regions as low as zero grams of carbon dioxide per kilowatt hour to about 400. The average across the whole of the EU in 2022 was about 250. And we have some huge regional variations within the UK. Scotland is normally very green and very low carbon because it's got a lot of wind power, a lot of hydro, not that many people demanding it, not that much industry compared to England. And although it is interconnected into England, those interconnected are of limited capacity so all of that electricity can actually be made to other parts of the country. Conversely, the south of England has a very high population density and is very dependent on gas power still. It does have a bit of solar power because it's the sunniest part of the country and a little bit of wind but not so much as Scotland by any means. As we do have this relations connection, we have a lot of international interconnections now. A new one just came online to Denmark. Another new one came online to Norway a couple of years ago and an additional one to France. We've had another one to France since the 1980s and there's also interconnected Belgium, the Netherlands and Ireland. But again, those represent maybe 30% of typical generation capacity. Unfortunately in the UK we also have this great web API called carbonintensity.org.uk that provides us with regionalised forecasts for the next 48 hours with 30 minute time windows. And it has both JSON and XML APIs that will allow us to interrogate this data and very easily get hold of what our regional forecasts are and it will show us how things are performing both against the forecast and previous measurements. So just to give an example of some of the data we get out of this site, this is an example of a really good day in October last year where the whole country was green, which I think means under 75, no, maybe 100 grams of CO2 per kilowatt hour and about half the country is dark green, which means I think under 35. And we can see some of these regional variations. So just comparing two regions here, these are two regions where my employer has offices. So in the north Wales and Merseyside region, which is this one here, we had only grams of CO2 per kilowatt hour. In the south of England we had 92, which still for the south of England is very low. But if you look at it later, the situation had changed drastically and we now have 289 grams in the south and 235 in north Wales. So if we could have made sure our compute jobs ran on the 6th instead of the 9th, we could have reduced the amount of carbon being put into the atmosphere to run those jobs quite considerably. So just thinking about how much this might actually save us, imagine we have a fictional HPC or part of an HPC, which has 64 core AMD epic CPUs. There's 10 nodes, a 1280 cores in total. And we reckon each one of those will use about 255 watts fully loaded or 37.5 idle. So if we can bring that node down from full to idle, assuming we don't turn it off or suspend it or anything clever like that, then we're looking at around 217 watt per CPU saving. If we can timeshift from when the grid would be at 200 grams of CO2 per kilowatt hour to 50, that's 150. If we had a 12 hour job that used all those 1280 cores, that's around 50 kilowatt hours, which equates to about 7.5 kilograms of carbon dioxide, which is the same as driving a car 50 kilometers. So how many of us might consider not driving to work or not driving a 50 kilometer route because it might reduce the environmental impact? How many of our employers might also have a policy that says you shouldn't drive those kind of distances, you should take public transport? Well, why shouldn't we have the same policies for compute? If we can achieve similar savings, we should be doing that for our compute systems as well as for our travel. I'd just think about this across the wider world. There was a paper published in, I think, 2021, looking at the potential savings of doing time shifting for AI based jobs in cloud providers, just showing us the different levels of savings that could be achieved in different regions. So as you see, these vary quite dramatically from region to region, normally depending on how much renewable energy is available in that region and how bad the alternative is when they are not using renewable energy as well. Now, a lot of people might suggest, well, the grid is going to go net zero soon anyway. Is it really important to do this? And the UK has put out a set of future scenarios about how they might achieve their net zero transition. One of those scenarios, though, is where we don't actually achieve it. And it's not quite clear whether we're actually on that pathway or one of the more optimistic ones at the moment. But even if we do achieve this, and their target is by 2035, we should actually be near net zero, that's still another 11 years of us putting carbon dioxide into the atmosphere when we use electricity, when we could reduce that impact now if we do something with time shifting. So let's do something now instead of waiting 11 years for someone else to solve the problem and possibly not solve the problem in that time. There is also a financial incentive to do this. There are starting to be variable rate electricity tariffs that roughly reflect the carbon intensity because wind and solar power don't have any fuel costs, so they're much cheaper to offer. And if you've got your own electricity production like rooftop solar or your own wind farm, then there's also a saving to use that power, and it's normally better to use it than to export it and be paid normally quite poor rates for that export. So I'm now going to hand over to Appalachic who will talk more specifically about CATS and what it does. Checking, okay, seems to be working. Thanks Colin. So I've been introducing the climate error task schedule, and so CATS figures out the best time to start your HPC job, or really a job that you want to run in a laptop. The users need to submit run times, of course they need to have some idea of how long the job will take, and a postcode. So this graph shows for example that if you schedule a job to the optimal time, you can save your carbon footprint by 70%. So right now it's like a proof of concept, we haven't released version one yet. It was built in one day at these SSIs collaboration workshops. It's kind of a hackathon that we have every year. It won the first prize. So what it is is a Python script, and it targets the at scheduler, so the at scheduler on Linux and BSDs, schedules the starting time of a job. So if you want a thing to start at 6 in the morning, it's at 0600, and it will start the job then. So that's a very easy, simple way to integrate CATS into this scheduling system. So that's what we started with. So the limitation is of course that if your HPC is running at full load all the time, then it doesn't matter what you shift around because it will always be 100%. So it's really meant more for computing systems that are not on all the time, so time shifting will actually make a difference. Of course you'll need to know how long the job will take. Without that, this doesn't work at the moment. If it's an HPC, then other users might be trying to run at the same moment, so how do you do that sort of resuscitation? Currently, it only works in the UK because we're using the Carbon Intensity UK API. Of course, we open to pull requests for other APIs. If other countries have energy data, energy forecasting data, so far we have not found any free to use APIs. So if you know, please let us know. And of course, this is not the only thing you should do. You can use cooling or there's a lot of carbon, embodied carbon in the manufacturing cost of the server that things are running on. So that's probably why you have HPC refreshers because newer servers are much more efficient. So there's a lot of considerations, but in terms of what you as a user can do, I think this is a good place to start. So the way to use CATS is, it's a Python script, so you run it as a module, give it a job duration, give it a postcode. The postcode is a proxy for the location. And using this, it fetches data from the Carbon Intensity API and it calculates the optimal starting time for your job. So we return data in a format that can be passed to AT and it also gives additional data in JSON. So what the other feature of CATS is that it will try to inform the user about the carbon savings that take place when they're offsetting the time. So we do some estimates on how much carbon would be saved by using some year after specify configuration using like what kind of CPU or GPU you have, kind of power use efficiency and thermal design power. And using that, we have a simple formula that will calculate the amount of carbon that you would have saved by running it in this time that CATS wants you to run it rather than will right now. So this is a demo of CATS running. So we specify duration of 60 minutes. And this is RT1, which is redding. And it gives a time on 16th May, 1130. And it also shows you the amount of carbon saved. And that scheduled, so this is a last bit, you can run what you want as long as it takes around 60 minutes. Okay, that's it. So if you have any questions, please email us or put up an issue in our GitHub. And I skip the slide. So what we're doing is we are going to release vision on this month. We're cleaning the common line options. So right now we only support AT, but we also want to support SBAT, which is the Slurm command scheduler similar to AT. And so we'll clean on the command line options and we'll release a vision. But of course, going forward, we want to integrate with Slurm, which is the main scheduler for ACs. And the simplest method, of course, is SBAT, so the start time, the other option is green cues where you have cues that don't run 100% because again, with running 100%, you don't get the benefits of cats. So we have green cues which don't run 100% and also integrating carbon accounting. So Slurm has a plugin that allows you to look at the total power used in your job. So way to integrate carbon accounting into that would be great. That would require rewrites in C because Slurm plugins are in C. And we have got funding from the Software Sustainability Institute for a few months of developer time. So we're looking forward to making great progress and have something more polished soon. And yeah. And now, thank you. Thank you. Thank you, Kavain and Abish. We have time for some questions. Yes. Thank you for your last presentation. But my question is how often clusters are not load on 100%? Yes. I mean, that's a very good question. So I think we need to think about how these things will work moving forward. So we are talking with HBCs and I think this is more proof of concept and prototype to look at how these carbon footprints might be achieved. Obviously, if you do keep things 100% on time, then there is no point. But I think if we do move to a future where we don't move to a net zero grid soon, then we might see funders asking for carbon budgets or government putting carbon budgets just like your financial budget. And then you do want to see this. So it's more like playing the groundwork for what might come next and not for the current generation of HBCs. Any more questions? Sorry. I wasn't looking over there. Thank you very much. Couldn't you use also energy prices like energy stock prices for forecasting? Are they usually at least in Germany or lower when there's more renewable energy in the net? Thanks. Yeah, that's a good point that we'll look into. A good place to start with is to correlate, I guess, energy prices. Yeah, yeah, we'll do the care and look at carbon intensity. Thank you. Hi. I don't want to be Dell specific, but there's a web based tool from Dell called EIPT, Enterprise Infrastructure Planning Tool. You can configure servers with your particular build and you get their CO2 and power usage for various workloads, computational memory intensive and idle. And also Dell have a plugin for their IDRAX called Power Manager. You can take a rack or a group of workstations servers and turn down the energy capping dynamically. And that also works for Fujitsu and HPE servers for other brands. I wonder if you would like to extend your work into that rather than scheduling the jobs to turn down the power cap dynamically. That'd be quite possible. Thanks. Yeah, I was at a word that there was a carbon accounting plugin for Slurm already. So yes, ideally we have something that goes into Slurm and does the carbon accounting rather than having user-shadow jobs. That's definitely not optimal. So yeah, thanks. We'll look into that. Any more questions? Sorry. Are you also planning to integrate this into HD Condor? Sorry, could you repeat? Are you also planning to integrate this into HD Condor? It's another scheduling software? Not at the moment. At the moment we are focusing on Slurm, but once we have got something there we'll focus on other schedulers. Okay, thank you. Hi there. First of all, love it. So thanks very much for the talk. I'm looking forward to version one being released just to throw another thing out there that you could do. It'd be cool to have two locations and you can imagine an office having two offices in the UK and they're kind of deciding where to deploy that task. Maybe that extends even globally and you can choose where to run your task. Yeah, spatial shifting would be a nice thing to have. The problem with that is moving the data with it. You need to have the data in both locations or move the data in advance of the job starting. Anyone else? That's one over there. This is really just a comment. I made a calculation of how much commuting to work costs in terms of energy compared to what the supercomputers in that computing center are. Just to give you a perspective, usually when you go to work and go back home it's like the 100% that you can save here. So just to give in general to whenever you're doing such savings estimates, just be aware that tell your employer that just by working from home you can save that exact amount. Sorry, it's just a statement. Just a quick question on how CATS works. I don't recall if you got into detail about it, but does it use the API data for time series prediction or does it have some kind of internal calculation to do that? It tries to find the lowest carbon window in the forecast period. So if your job is going to be six hours it will try and find the lowest six hours in that period and schedule your job to that time period. Okay, but the predictions are already available. The predictions are not made by us. The predictions are available from the carbon intensity API. Thanks for the talk. I have one question. If you already saw an API which is called Wattime API, I think it does the same, but all over the world. I wasn't aware of that one, but that is very useful. We will look into that one. Hi there. One, two. You can hear me, right? Yeah. Hi, I'm Chris Adams from the Green Software Foundation. I'm curious about whether when you're doing this, if I know what jobs I did last year, is there a way for me to run any of this against, say, last year's worth of compute jobs to see what my savings could have been if I were to find something like this? Because a lot of this is forward-looking, but if I've got some data now, that will make it easier for me to make the case that this could create some meaningful savings inside my team or inside my organization. Yeah. Slurm normally has some accounting built in that logs when jobs ran, so that would give you the data that you'd need to go and do that. I think there's also a simulator available, which is something we want to look into in our next phase as to whether we can make use of that to do that kind of prediction, or not prediction analysis. I saw your talk last year, and that was actually some of the inspiration behind this. I think that's it. One more. Hi. I was wondering, we're also in the purchase of buy-in new cluster, and we have also raised this thing with the users. Turns out they don't really like this. The idea that there would be free resources in the cluster, and their job would not start immediately. Is this a cluster with people actually using the HPC clusters? So far, it's very hard. We're finding that from users, there's sort of a three-way split of some who care enough to definitely go out their way to do things, some who might use it if it was available, but aren't going to go out their way, and some who they want their science now, and they don't care what the carbon emissions are. I guess we can target the other two groups first, and the other third might get dragged out, kicking and screaming one day when carbon accounting comes in for them. Okay, can we thank the speakers again, please?
Overcoming MPI ABI incompatibility
All right. We're ready for the next talk. So Mark is going to explain us how to overcome MPI, ABI compatibility with WM for MPI. Yes, can you read me? Great. So, hi, my name is Mark. I'm working at CEA. And today I'll talk about how to overcome the MPI, ABI incompatibility using WM for MPI. Just a couple of words about CEA. Can you read me? Yes. We are a French organization. We are hosting a computer. We have big supercomputers that are used at the national and European level. And in a couple of years, we'll host one of the two exhaust scale systems in Europe. So I start with a quick introduction on the ABI MPI incompatibility issue. Why we actually care about this problem. I will try to give you an insight of what is the problem actually. And then I will show how we can overcome this issue by dynamically translating between different implementation. And I will show you some use cases before concluding. So first of all, why do we need MPI library portability? There are a few reasons that are motivating for us. The first one is to be able to work around the limitations of an MPI library. So as you may know, MPI is a norm. There are different implementations. And as all software libraries, they may have limitations. They may have bugs. So to be able to switch between libraries can be interesting to actually diagnose the source of a problem. It's interesting at the level of a user. It's also interesting at the level of managing the supercomputers or at the level of the developers, of course. It can help you also to choose the best MPI implementation because as you also may know, between the different implementation, you won't have necessarily the same algorithm to do a specific communication. And on a specific cluster, some MPI implementations may be better optimized than others. And it's interesting to be able to test easily all those things. It's also useful to enable fast and portable containers because what containers offers or claim to offer is the flexibility and portability. And it is almost true, but there is a problem in HPC that is you will have a loss of portability if you need to match the host MPI. And in almost all cases, you will need to match this host MPI because it's optimized, it's vendor optimized, and you have no other choice than doing so. And it's not so probable to have the right underlying libraries. Sometimes they're even closed in your containers. It can help you also to add flexibility to high-level languages. Some high-level languages like Julia or Python at some point depended on a specific MPI library. So if you want to use another one, you're stuck. And also sometimes you will build very complex software stack with those languages and if you want to switch the library and you need to rebuild everything, it's really, really time-consuming. And if you can switch easily, that's also interesting. And the last point is to be able to run a bidding-edge system or early access systems because in many cases on state-of-the-art systems, you will have a single vendor optimized library. So if you have your application already compiled with another implementation, you'll be stuck once again. And that's sometimes something we saw with cloud providers, some cloud providers provide a single vendor optimized MPI library and that's it. So now let's talk about the ABI compatibility in MPI and why is it a problem? So MPI, as I said earlier, is a norm and the norm actually defines a single API, which is great, but it has several ABIs. At the moment, there are at least open MPI, MPH, and all the MPH-based library or MPH-compatible implementations, MPC. And if you go back in time, there are others also that exist. And the problem is there are in general ABI incompatible because even for the simplest element of an MPI library, it won't be implemented in the same way. If you look at just a communicator, which is the very basics for MPI, within MPH it's an integer and within open MPI, it's a struct. So you have no way of going from one to the other. And it means that if you want to switch from between those libraries, you will need to recompile. And sometimes it's possible, sometimes it is not, as I said, because you could have a very complex software stack and it can take literally days to recompile everything. And sometimes you're stuck with proprietary software. Even though we are at first-dem, those software exists and we have to live with them on our clusters. So how to do that? That was the motivation to create Wayfarr MPI, which is a library that allows us to switch between MPI libraries. The idea is to catch the calls from a library and to translate them. So we have the input arguments. We will translate them from the original library to the destination ABI. We will call the function from the destination ABI. And then we will go the other way around. We will translate the output arguments and the return value from the destination ABI to the original ABI. There is just one catch. That is, in some MPI functions, you have actually a call to another MPI function. So you have to ensure that you're not already in an MPI function and to avoid to re-translate functions. Because if you do that, it will crash. So we have an assembly code selector to deal with this issue. And as you may also know, there are literally hundreds of MPI functions. So to deal with those translations, the functions that are doing the translation are actually generated. So we have templates. We have files to define the different functions and the input arguments and so on. And we generate everything to be a bit more robust. So now how to use Y4MPI. One of the great things with Y4MPI is that you have two available modes. The first one is the preload mode. And this one is quite interesting if you already have a software that is compiled. Because you can dynamically, at runtime, translate between MPI implementations. So imagine that you have a software that was compiled with MPH. At runtime, you can use OpenMPI. And we have an interface mode. So now we are using Y4MPI as a stub implementation. So we will compile the code against Y4MPI. And it is at runtime that we will choose which MPI we want to use. So this one, if you know that you will have Y4MPI at end on the cluster you will be using, it will ensure you greater portability. From an installation point of view, it's really simple. It's a basic CMake-based installation. And Y4MPI is also available through the SPAC package manager. So if you're using SPAC, just SPAC install Y4MPI and you're all done. And in practice, how you can use it. So there is at least two ways of using it. The first one is to use it directly as a wrapper. So imagine you're using slurm. So you will call slurm. In general, you will then put the binary you want to launch. And you will just add the wrapper saying that you want to go from one implementation to the other implementation. Another way of doing it is to use it transparently using environment variables. And here the main catch actually is to have the right LD preload because you will inject the Y4MPI library at runtime to catch the MPI calls and to translate them. And then if you do that, you can run directly your app with this run. Don't worry about all those variables. They are, of course, documented. And if the translation work, the only thing that will differ from a normal execution is that you will have a message saying, hey, you're using Y4MPI in the preload mode from one implementation to the other. So for more advanced Qsage, I invite you to read the documentation. We have a bunch of tutorials. So it's available online. At the moment, we have seven main tutorials. One, well, we have few basic tutorials about the installation of Y4MPI, how to translate MPI dynamically using either the preload or the interface mode. And a few examples with Python, with Gromax, which is a molecular dynamic code that is heavily used in the HPC community. And we have examples with containers. The tutorials we have at the moment with containers are using Podman. The only reason why we use Podman is that it's the easiest to install almost anywhere. You don't need to have root privileges to use it. But even though it's specific to Podman, actually, the ID stays the same. So if you have another container runtime that you like that is not Podman, you can apply what we did with Podman on your specific use case. So Y4MPI started in 2016 at CEA, and it's still in Active Development. It's, of course, open source. And it's under your license, CCLB plus BSD3, to be compliant with the French and more international law. All our developments are validated using a CI that is using well-established benchmarks, especially MPI benchmark, like the OSU benchmark, IOR, and more user-oriented benchmark as Gromax. And it's also an ongoing collaboration between CEA and the Laurence Livermore Laboratory in the US that started in 2020. And we have publications together. And last year, we did a tutorial at ISC, and we hope to host another tutorial this year at ISC in May. Regarding the support and the limitations of our library, so we're supporting X86 and ARM architecture, GNU, Linux, and BSD. It was tested recently on 3BSD. We are supporting also, of course, CN4TRAIN and 3.2MPI norm. In terms of the limitation, for it to work, you need to have a dynamic linking. If you compiled your code statically, it won't work. It's better if you can avoid a circumvent air path, because our idea is to inject the library at runtime. We are not supporting the timeout feature on BSD distribution. Because so we have, so we added few features in WeforMPA that are not defined in the MPI norm, and in particular, we have this timeout feature that allows you to set a timeout on a specific MPI code. And that is very interesting for debugging, actually. For the transition of some constants that are defining the max length of some strings, you may have truncation, but with all the tests we did so far, it was not an issue per se. And lastly, the MPIX functions that are the experimental functions implemented in the difference MPI implementation. We are dealing with those functions on a case-by-case basis. So they are all supported, but we started to support few of them. So now to give you a glimpse of what you can do with it. The first examples I wanted to show you is something that happened to us that we had a GROMACS version. So GROMACS once again is a molecular dynamic code used in HPC. This version was compiled against MPI. And on the target cluster we were using, MPI could run only on GPU, and we had an error on CPU. So this is due to the fact that in GROMACS you have a call to the MPIX query code support that is just checking if you have GPUs. But actually the way it's implemented in MPIX is that if you don't have GPUs, it crashes. In other implementation, it's not doing that. It's just telling you, no, you have no CPU, sorry, and then you can do whatever you want. But the way they did it for no in MPIX is that it crashes. So we couldn't run this code on CPU. And so we used Wi-Fi MPI to run the code to go from MPI to open MPI. And here I see the results. So using MPIX, it fails. Using Wi-Fi MPI, we have some performance. And actually we recompile the version of the code using the open MPI. And we have performances that are really similar. And the other use case I wanted to show you is with containers. So the idea is to have an MPIX-based container in which we have an OSU microbenchmark that was compiled. And we did some comparison between two AMD mail-in nodes at TGCC, which is one of our clusters at CE. And in all the things I will show you, I compare an execution using open MPI directly on the cluster with the OSU microbenchmark. And one execution using a container in which we have open MPI available that we are plugging on the open MPI of the cluster. And the last one is a container with MPI. And we are using open MPI using Wi-Fi MPI. So the first graph here is to show the in-it time between those three cases. And you see that it's very comparable. We have the same results. It takes the same time to in-it MPI with Wi-Fi MPI. Here it's bidirectional bandwidth. And it's the same. The results are very comparable. And another example is an old reduce. So here all the cores of the two nodes we used were participating to the communication. And we have very comparable latencies between those three cases. So the good point is that the overhead of Wi-Fi MPI and those tests is really minimum. So now in conclusion, for the future, the good news for HPC and the bad news for us is that there is a standardizing project for the ABI layer that started last year. And it's really great because it will greatly help all the HPC users. So there will be very likely a common CBI defined in one of the next norm. You can refer to the Haman et al paper from 2023. And we actually can reasonably hope for a convergence because nowadays there are actually two ABI that covers more than 90% of the HPC platforms that are MPI and OpenMPI. And the plan is to have a single feature ABI-only release for MPI 4.2. At some point they were talking about MPI 5, but finally in India it should be for 4.2. And they are hoping for a draft for SE 24, so more or less one year for now. There is already a prototype available in MPI. And there is also lots of ideas regarding this common ABI that are implemented in Mukatuva from Jeff Hammond. So I put in the links if you want to have a look at it. And if you want to have more info on that, you can check the MPI ABI working group also on GitHub. And the good thing with this standardizing effort is that the Wi-Fi MPI is actually seated as a reference implementation. And so to conclude, for Wi-Fi MPI in Ananshel, this library helps you to switch between different MPI libraries. So it allows greater portability and a better flexibility for HPC applications, including a containerized app or proprietary software. Its usage is mostly transparent. So in most cases we studied so far there were no significant overhead, which is also a good thing. And the library is still evolving. So in the years to come we hope to have an MPI 4 support. We would like to support the Muk ABI, which is the Jeff Hammond project that should be close to the common ABI defined in the future norm. And of course the project is open, so we are waiting for your contribution. And in conclusion, I want also to thank all the people who contributed to Wi-Fi MPI, and especially at CEA, at the Lawrence Livermore Laboratory, and at EOLN, which is a company with which we are working. Thank you all. Okay, perfect timing. Questions? Yes, thank you for the last presentation, and I have several questions. And trust me, which version do you support? So we are supporting the idea to support the Fortran API of MPI. So we have no limitation in terms of Fortran supported. But it's not the part that is the most tested. So if you have Fortran use cases, you're welcome to try it. And if it works, great. If you have any issues, open issues on GitHub, we try to be very proactive on the GitHub issues. Okay, thank you. The next question is about ILP64 ABI. So because a lot of programs compiled in ILP64 mode, I mean, for example, it's like FDFALT integer 8. So did you support this feature? We're supporting once again, if it's working, the idea is that if it's working with your MPI implementation, if you can actually do it with your MPI implementation, if you have a compatible MPI implementation, it should work. But some cases where you need to have a right, you have to have kind of the same way of compiling your library and the ABI. You probably didn't understand the problem that initially what a lot of implementation has two wrappers already. So you have initially you put in some wrapper which translates from ILP64 to LP64. And then only after this it will do some traditional calls. So do you support this feature? I'm not entirely sure. We should discuss it afterwards. Yeah, and the last question is about runtime because, for example, we have some additional parameters in MPI run and some programs like ORCA or MRCC. Did you just do internal calls of runtime, I mean just exact of MPI run? Did you do something with this feature? I mean, because, for example, Intel MPI supports some additional flags and the OpenMPI also supports some additional flags and there is different between these two implementations. So, same thing I'm not entirely sure because for quite some time we actually didn't support MPI run for this kind of question and also because on our clusters the norm is to use SRAM directly. So there were a few things with this regard that we didn't support it. So sometimes the Google started to support MPI run but I'm not entirely sure of the level of support we have on specific things you could find on one implementation versus another. Thank you. More questions? Hi, cheers for that. I wanted to ask. I want to mention some example programs and benchmarks that have performance portability here. What HPC applications are you targeting with this that require the performance portability? Because if it's a Gromax application that wants to use all of the system, it's typically that code will run for years on the system and be optimized for it anyway. So at which point do you need something that works on multiple systems that's okay and at which point do you need to have something that's very specialized so you wouldn't actually need to have a wrapper in between? I'm not sure to understand your question. So what applications use this in the real world on the clusters that you're working on? So none, example programs, what actual programs require the translation from one MPI to another? The only one in which we really had to use the wrapper and we have no other choice were commercial softwares. And especially when I talked about bleeding age system, we have some systems with BXI interconnect. And BXI interconnect is a evident interconnect. And they were supporting only open MPI. The thing is some commercial vendors distribute their softwares built against a specific MPI implementation. And they have no interest of supporting another one. And so if we wanted to be able to use for real those commercial softwares on our system, we had no other choice than using Wi-Fi MPI. That is the only case in which is really, really mandatory. For other cases, in most cases, it's actually more, it's more comfortable in some sense because it helps you to debug quickly. It helps you to test things more easily. But if you have an infinite amount of time in front of you, you can do anything. You can always recompile everything. But we don't have infinite amount of time in front of us. OK, thank you. Any more questions? I have a quick one. Have you looked at MPI trampoline? That's a very similar project to what? Yes, MPI trampoline appeared a few years ago. And the main difference between MPI trampoline and Wi-Fi MPI is that MPI trampoline allows you only what we call the interface mode in Wi-Fi MPI. So you can compile against MPI trampoline and then use another MPI implementation. But you can't bring your own code and run it directly with MPI trampoline. But yes, in the past few years, there were a few ideas similar to what we did with Wi-Fi MPI. And it's interesting because for quite some time, people didn't really care about this API incompatibility issue. And we see now that there is some interest. And that's also why at some point at the MPI forum, they decided that it was time to have a common API. So that's great to see other projects like that emerging. All right. I think that's it. Thank you very much. And if people like the project, there are some stickers here in front as well.
Extracting Mini-Apps from HPC software for Total Cost of Ownership optimized systems procurement
We want to stick to time. We've been doing very well. So we want to keep that up. The next talk is about, oh sorry, your name is on the slide. That doesn't help me. Tim, yeah, the next talk is about Tim, about extracting mini apps for HPC software. Then, thank you for having me back at Fostum. I am Tim. I'm by trade. I'm a compiler guy and I'm giving a talk to the HPC Dev Room and how that came to be. We'll be part of the talk extracting mini apps from HPC software for total cost of ownership, optimized system procurement. And I want to give a quick background about how this project came to be before I actually start to go into the technical details because this project is part of the NHR Association funded TCO, so total cost of ownership project. NHR stands for Nationales Hochleistungsregion. It's the German term for national high performance computing basically. And the NHR is an alliance of computing centers which all have different specialties, but all have a common admission process and have a so-called harmonized computing environment, which means basically that all the clusters at the different locations have very similar scheduling system, very similar file system. So if you manage to get your app work, your program working on one of the HPC systems, you probably will be able to get it working on the other systems as well. And this little blue dot is the is Darmstadt, which is where I'm from. And all those systems are procured at some point. And whenever you do hardware procurement, there are basically is basically the question, well, what hardware do you get? And the simple answer is you get the best of everything, right? So you get the most cores, the fastest RAM, the most efficient power delivering units, fastest most storage, but this is infeasible for all but the largest of HPC computing centers and even those usually struggle. So usually what you do is you go ahead and say, well, we want to get the best performance per dollar we spent. And for this, you need to figure out what's the performance is you get for the dollars. And so you usually use lin pack and stream, which are benchmarking. One is for basically stressing your floating point unit and turning your rack system into a space heater. And the other is memory. And then because you do not want to run only synthetic benchmarks, you use some of the spec HPC suite benchmark and run this and you figure out how much performance do you get basically. And recently there was this push to get more power efficient when you do HPC community. And so there the target to hit is performance per watt. And there again, you basically use lin pack. If you look at the green 500 list of HPC system, they all publish their lin pack score. But this is not actually representative of what the system will cost during its lifetime. It's usually just what's your one time investment cost in procuring the actual hardware. And what you really want to figure out, especially in the case of this distributed national high performance computing association, is the money we spend actually well invested for the use cases our users have. And this is where this total cost of ownership project came to be. Because you do not want to score your procurement only on performance, but you want to have it be a mix of different factors. Of course, you want to have the initial hard and software investment cost as part of this, but you also want to figure in cooling costs because this is one of the main cost drivers today. Put your power in and how to get the heat away that dissipate the heat that you generate. And usually want to have technical and administrative stuff for your HPC system to actually work properly. And then the last thing which is power consumption. And it's not power consumption of your idling system because that is reasonably low, but it's of the job mix you're running. And this job mix is very essential in this whole thing because the job mix is a very user dependent metric. It is what is the system actually being used for. So for example, and this is again referring back to these distributed and slightly specialized computing centers we have in HR. If you do physics simulation, your application might benefit more from faster CPUs with more core counts than compared to if you're using AI workloads where you probably can't get enough accelerator cards for your workload. And what you do is you monitor how your system is used, which is doable. For example, Slurm can do this. And then you figure out that your users are running lamps and grommets and open form your typical HPC software. And then you cannot really give that one to the vendors, can you? Because if you give a big grommet run to your hardware vendor and say, well, you have 48 hours to run this code through, then your vendor will probably not do it. And in an even more extreme case, you have this weird institute like the scientific computing institute where I'm from, which runs one weird A.out executable. They self-compiled with a custom build script. And you cannot give those to the vendors as well. But the problem is that all those HPC applications are large and complex and have different coding and software patterns. But they are the most representative thing you can get about what is actually running on your clusters. And so the idea is if you have some very big and complex HPC system like the one you see is simulated on the right, which has some kind of entry point and then does matrix multiplication and conditioning and heavy output preparation, the thing that actually spends most of the compute cycles is the one in gray. So the matrix conditioning matrix solving. And if you have a so-called mini-app, which is just the gray part and not all the other things around, you might be able to shrink this application significantly. And this mini-app approach was in one actually pioneered by Jan-Patrick Ler, which was the guy who gave the talk before me. So talk about coincidence. And the basic idea is you shrink the size of your application, but keep the computational characteristics. So the computational kernel where actually most of your compute cycles are spent stays the same. And then you just need to add some wrapper function that sets the kernel up. And then to finish, you just need to find some way to graceful terminator program because you can then have time measurements, power measurements on this little part of the actual big program. Great. And so this is why they needed the compiler guy to do this because they wanted in this total cost of ownership project the idea was to have a fully automatic extraction pipeline. And the basic idea for this pipeline was first you analyze the whole program. For this, we used the MetaCG framework. And those of you who happened to be there last year when I gave a completely different talk, we were using MetaCG as well. So it's a tool that's used at our institute quite heavily and allows you to have a representation of how do functions behave according to each other over the whole program. So you can get a whole program call graph. Once you know how all those functions relate to another, you can figure out what is the actual kernel. So where are my compute cycles spent? For this, the intention was to use Pira. The other idea is you just ask a domain expert what's the slow part of your program and they will probably tell you. So this is much more easy usually. And then the actual extraction of the kernel. And for this, we developed the Apex tool, so the app extraction tool. And it's a clang front and based compiler tool that does source code manipulation. And the basic idea is you query the so-called AST. You do not need to know how you get the AST, what an AST is. The only thing that you know an AST is a very, very condensed and information dense form and representation of a single CPP file. So if you have this CPP file on the right, which only contains the main function, you get the thing on the right. Admittedly, this is very much shortened. Where you can then find your record declaration, you find your structs, and you find your assignments, and you find your function calls. So what you then do is you can query this AST for your information to figure out how these function behave. So if you want to track the kernel, so you already know which part of the program you want to extract, you find all the functions that are used for this kernel. So your kernel might call some subroutines, you want to extract all those subroutines as well. And sadly, the AST is unable to provide us this information because we only can extract when we have the definition, so the body of the function. And as we are only limited to one CPP file with our AST information at a time, we only have to the declaration in this case, which was part of the header. So the print as function in our example is only declared, it's not defined, we have no body there. So if we have the whole body of a function, we get it as a whole text block. If we only have a definition, we remember that we need that one, and we extract it once we actually find the source file that contains the definition for this function. What we can do is we can find and extract all the used globals, because you usually rely on some kind of struct definition, you might even be using global state. And this is where the AST has the information. The whole definition of our struct S was inside the header files, and the header files are included by the preprocessor, though they are part when the AST is built. Great. So we just extract those as a text block, and then we need to find all include statements because include is the last colored thing in our example. And then we run into a little problem. Because remember how I told you that it's great that the preprocessor put the header files into our source code, well, include statements are also handled by the preprocessor. So everything that was specified in this include statement header is put physically in the AST once it's built, and we do not have the information anymore. So what we need to, and this is also true for defines and if and defs and pragmas, all those are resolved by the preprocessor, and we do not have any way to really figure this out once we get to the AST level. So we do not only need to hook into the AST, but we also need to hook into the preprocessor. Those of you who have actually worked with a preprocessor might know that the preprocessor is basically doing copy paste. So it's not context sensitive, it just takes include files, puts it where the include statement was. So we somehow need to map this context insensitive analysis results we get from our preprocessor hooks to figure out how do those relate to the context sensitive information that we get from our AST. And the only thing that those two share in common is source file locations. The preprocessor knows in which source file line it currently is when it does its copy pasting, and the AST can map back to the original source file. So what we do is if we now go to a more realistic example, this is an excerpt from the Lulish code, you do not, it's heavily shortened so there's no way to figure out what it does exactly, but we are mostly interested in the things colored yellow, which means it starts with an open MP, if open MP is available statement and then it includes the actual header and the preprocessor gives us all this information. Whenever it encounters one of these statements, we get a callback that tells us we found an if open MP, it goes from line one to line three, and then we found an include statement in line two, and then it's on us to figure out that line one to three fully encompasses the statement we found in line two, because again we're only doing text block extraction. So this is the conflicts that happen inside the preprocessor, but if we go on the preprocessor also tells us that it found an if open MP statement again in line six to 13, and seven to nine, and 15, but we also have the knowledge that there is a whole function going from line five to line 20, so we need to marry those two informations together as well, and this matching process was one of the challenges that we needed to overcome when we did the kernel extraction. So when we started this whole process, we had a very good idea how to do single translation unit C code, and we expanded on this to allow for multi-translation unit C code and C with C++ components like new and delete and classes, and we're currently working on getting codes that makes heavy use of templates working, because the problem once you come into templates is that if you think back about our analysis step, we only get information about functions and how those functions relate to another, and templates are in a compiler speaking sense not necessarily functions, they are descriptions of how functions will be generated at compile time, so our analysis is currently not offering us the information about the original template, but only about the instantiated templates as generated by the compiler, so we are currently working on getting templates to work, and if you think back about the global usage analysis we are doing, if you have complex class inheritance and polymorphism, we are currently not able to traverse all possible diamond inheritance hierarchical implementations that are possible in C++, and lastly the idea is to also allow for automatic check pointing, so the wrapper calls that need to be generated to set up the environment for the kernel to run, it is theoretically possible to fully automatically generate those wrapper calls, we just haven't looked into it, and lastly the thing we are very skeptical if we are ever able to do it is to just mini extract from every C++ code ever written ever, because there are so many things you can do in C++ that we can try to achieve this, but I am very skeptical if we ever will be able to do this, but I don't want to leave you on this kind of depressing note actually, even in a state like this where the tool cannot fully handle all templates, even in a state where the tool cannot handle most complex inheritance hierarchies, tool assisted mini abstraction can still be useful, for example if you are willing to include the templates manually, because your program won't compile with the templates, you can just copy paste them, then you can get mini abstraction to work right now, and if you are interested in doing pinpoint optimizations on your source code, you can extract only those small snippets of code that you actually intend to work on, optimize those, and then do manual optimization and reintegrate those easily, so there are uses even for a tool that is not able to handle every C++ code ever written, and if you know of any HPC code that you think has a kernel that is identifiable and maybe not using the most and the deepest inheritance hierarchies, let me know, because I am always interested in figuring out how well my tool performs on other codes, so with this, thank you for your attention, I hope it was kind of interesting and I am open for questions. Any questions? I am the author of a large sparse matrix library, do you have something similar already in your catalog or collection? So it would be interesting to apply this tool on a library, but usually when the thing we are doing is we have the whole HPC software and then you call into the library, so of course your library is probably doing the heavy lifting and therefore probably doing the kernel part of the program, but extracting this one, we can look into that, but extracting a call to a library is relatively speaking very easy, so programs whose basic structure is do some setup, call an HPC library, get input back, those are basically mini-apps in the sense that we are talking about because they are not doing most of the heavy computing themselves, but if your library has internal conditioning or matrix solving capabilities that you know of that your library struggles to do, then we are talking again, so just let me know the name of your library and I will try to look into it. Four questions? Alex was first. Hi. I was wondering your mini-apps seem to be focused on compute-intensive parts of the code, do you also construct mini-apps for storage-intensive applications or something else? So the automatic identification wire, the PIRRA tool, tries to figure out what is the compute- intensive part of the program, and this is the only kernel, so to speak, that we are able to automatically identify, but if you as a domain expert know that this is the part of our program where we are IO limited, then this is nothing PIRRA can identify, but if you say I want to extract starting from this function, our tool should in theory be able to extract the IO limited part of your code. So this is the point where you as a domain expert need to specify this is the part I want to extract, because the only thing that is very prominently identifiable is compute-limited parts of the program, but yeah, in theory it should work. And second question, I might allow, and do you have a library of mini-apps that are ready to use for others by third parties? So currently we're doing our benchmark, our benchmarks on already existing mini-apps, so we're doing mini-app extraction from mini-apps because I am profiting from the small size of those mini-apps to validate that my program actually runs, for example the Lulash example I showed is a mini-app in itself. If I remember correctly, it's a shock simulation in fluids, please don't quote me on that, but it's a great code and it's very easy to work with, so I'm using that for my evaluation, but the idea is to get it to work on larger codes, for example we're currently looking to the ISSM ice sheet and system model, which is a well ice melting simulation for large ice sheets, but yeah we're always looking for other codes, and if you have something that is IO bound then of course tell me. Thank you. Plenty of time for more questions, Chris. Sorry to be that guy, but how hard would it be to adapt this approach to Fortran? Fortran tooling in general is something that has been of interest at our institute for a long time, but the problem with Fortran tooling is that most of our knowledge is coming from the Klang front and so the C language front end, and I am not sure if Fortran, the current new Fortran LLVM front end, offers the same analysis and query capabilities as Klang, and the idea to move lower in the hierarchy towards the LLVM IR, which is more target agnostic, or language agnostic more really, is that as soon as you go down to IR it's very hard to go back to figure out what was the source code files that actually made up this IR. So yes we already have in the back of our mind that there are other languages that are used in HPC systems, and usually if I present this approach I'm getting asked, well we have some Fortran codes, we have some Python codes, how does your approach work, but we are sadly limited by the Klang front end's capabilities, so C++, Objective C, which we have never tried, I'm not saying that we are able to do Objective C, but C and C++ we have tried and are currently limited to because of our design choice. There is a Fortran front end for Klang. Yes, but this is a Fortran front end for the LLVM infrastructure. The C-Lang front end is the part of the LLVM project which takes C code and translates to the LLVM IR, so I'm not sure if we're talking about the same thing. There is a Fortran front end for LLVM, yes, but I would be very surprised if there is a way to translate Fortran code into something that Klang can understand, but I might be wrong, there is a myriad of interesting software repositories, but currently consider Miscoptic. Then you can use this small part of your code to stress the floating point unit only of linear hardware. Then this other part is very well vectorized, so we extract this one and suddenly you are able to use AVX 512 instructions. The idea is to extract every code intensive part into its own little package and then benchmark with those separate packages, so at the end you can get an idea about the whole performance of your as your individual kernel regions. This would be the approach I take. Maybe one more? In the back, okay. Great talk, thank you so much. A couple of questions, Chris. I believe there was a paper about a Fortran mini app extractor some time ago. I can dig that up and send it to you. If I remember that, if not, shoot me a message. Then the flank front end in the LLVM project currently is not compatible with Klang, so we do get different ASTs. This approach is actually working at the AST level, so if it were using the LLVM IR level and then somehow like magically map back dwarf, then it would work, but there was a different project that did that, it worked kind of okay. I would have a question about you for the complex inheritance hierarchies. Do you have any idea on how you could tackle that, approach that, represent this thing across the whole program or things so, I mean, did you have spent any time so far looking at that or did you say like okay, that's future me or future someone going to do that? So thank you for the question. So it's a mix of both. I spent some time thinking about it and decided it was for future me because I didn't assume it to be very easy, but you already mentioned the general idea that as soon as you go into more complex inheritance chains, you aren't able to extract everything from one header file per se, so you need to do the same opportunistic extraction idea that we do for functions, but now for classes, structs and all their possible inheritance parents. So this is something that we need to analyze on the whole program scale, so this is something that in the not near future, but in the foreseeable future, I intend to put as an analysis pass into the CG collector, which you probably are familiar with. So the idea is that this tool is then able to annotate all this information as metadata and then once we merge it, we get a very good, hopefully, impression of how those inheritance chains flow through the whole program. So the idea. Thanks. Okay, that's all we have time for. Thanks a lot, Tim.
PyPartMC: engineering Python-to-Fortran bindings in C++, for use in Julia and Matlab
We'll get started. Sylvester will introduce us to Piepart MC. Thank you for coming. I'm Sylvester Arrabas. I work at the AGH University in Kraków in Poland. And this is a project carried out together with a team from the University of Illinois, Urbana-Champaign in US. So Piepart MC is the highlight here. But from the perspective of this conference, probably I should read the subtitle, namely How to engineer a Python to Fortran binding in C++ for use in Julia and MATLAB and why to do it. So the package that this tool is interfacing is called Piepart MC. It's a Monte Carlo simulation package for air resolves that are, for example, floating in the air. It's an open source tool developed for more than 20 years at Urbana-Champaign. And just one line about the physics. So usually it's kind of a box model, so studying just processes without a spatial context. But it also has an option to be coupled with the Worf weather simulation for a cast. So here is the HPC context. And it simulates things like air pollution, evolution due to collisions of particles, condensation, chemical reactions, et cetera. And on the technical side, it's actually an object-oriented code base written in quite classic, using quite classic subset of Fortran, but still in very much object-oriented manner. And despite 20 years of heritage, it has a very comprehensive test suite. And I would say it could be an example of best practices in Fortran. However, its usage poses several challenges, for example, to students who intend to start off using it, for example, from a Jupyter notebook. And these challenges are related with, first of all, multiple dependencies. The need to compile it. Getting updates doesn't have really a workflow ready. The automation of simulations, analysis, et cetera, usually involves Shell. The input output is handled through multiple text files. And to analyze output from these simulations, usually one needs to actually look or use some of the Fortran code the simulation is based on. So the question that was posed when we started was how to bring together these two seemingly separate worlds. So on the right-hand side, this is the simulation package, part MC, with its Fortran code base, a bit of C code base, different dependencies. And then a perspective of a modern student, let's say, who starts with Jupyter and expects basically everything to be importable and interoperable with other libraries, scipy, numpy, et cetera. So the goals would be to lower the entry threshold for installation and usage. To ensure that the same experience is doable on different operating systems. And also to streamline the dissemination of studies based on the simulation tool, for example, for peer review with scientific journals. So the status of the project, as of now, of part MC, this Python bindings, is that we released after two years of development version one, it's on PyPy. And we also published a description of the package in the software X journal. So we are kind of ready for a rollout. And today I will talk more about the internals. And the internals start with PyBind 11. So despite we are talking about Python and Fortran, we actually, we picked PyBind 11, which is a C++ tool for developing Python packages as our backbone. So here's some highlights. The project actually is for those who are new to it, it's quite a remarkable success, I would say, with over 300 contributors on GitHub, 2,000 forks and 14,000 stars. Congratulations to PyBind 11. And it's very useful. So it fits here into the picture. So essentially we developed in C++, in C and in Fortran, so it's a triple language project, something that uses PyBind 11 and a few other components to automate building of this part of C and offering the Python package. So probably what's also worth mentioning is here that most of the work on PyPartnC was around substituting this text file input output with JSON-like Python native, let's say, or Python-like Pythonic input output layer. And as I mentioned, the original project has the object-oriented structure, so we tried to also couple Python's garbage collector with the Fortran functions that are provided for creating and deallocating objects. And there are many, many dependencies that the project has in Fortran, in C, in C++. And here, let me just mention that we picked Git submodules as a tool to pin versions of these dependencies, which is useful because the pip install command is able to grab packages from a Git repository, and this would include all the submodules with their versions. So let me now present a bit of code and how it looks from a user perspective. So this example here, please don't look particularly on the license of code, maybe just on the bulk of code, and the type of code. So here on the left, we have the Fortran Hello World for using the PartMC package, and on the right, three text files that would be the minimum to start a simplest simulation. So now this is the end result that uses the PyPartnC layer, so essentially the same can be obtained with a single file, starting with importing from this PyPartnC wrapper, and then using this kind of JSON-like notation, essentially here, list and dictionaries that are wrapped. So one achievement kind of, and one big advantage of using Python is that actually providing Python wrappers, you are catering also to Julia users, for example, here through the PyCall.jl package, essentially the same code and the same logic can be obtained for Julia users using PyPartnC. And finally, example with using Matlap, which ships with built-in Python bridge, and then which allows also to use PyPartnC to access the Fortran code from Matlap. So these three examples I've shown are actually part of our CI, so we have them in the readme file, and on CI we are executing the Julia, the Python, the Fortran, and the Matlap example, uploading the output as artifacts, and there is an assert stage that checks if the output from all these languages match. By the way, the timings here are essentially compilation and set up, so it's not that Fortran takes much shorter, the execution is always done through the Fortran code base and binary, but clearly compiling just the Fortran code is faster than setting up the Python, Julia, or Matlap environment, and how it works actually in practice when looking at the code. So here, this diagram might be not perfectly visible, but the right column is C++ layer, here is the C layer, here is Fortran layer, and here is the user code either in Julia, Matlap, or Python. And the different color here is to depict the package that we are interfacing with. So if we start with this readme code here, the user's Python code, we have set up the some import and instantiation of a single object of this arrow data class as an example, and what happens if we call it, first it goes through barely visible, I guess. So anyhow, this is the kind of outer layer for the C++ implemented Python package, and now I hope it's more visible. This is how PyBind 11, how one works with PyBind 11. So this is the C++ code where we define a module for Python, creating a Python class from C++ code looks roughly like this, with some templates defining the class that we interface how to handle memory allocation and defining particular methods. Here there is an init method, so a kind of constructor, and this constructor, when called, goes through C++ code, this arrow data class that we wrap, but quickly we need on our way to Fortran to go into what is written here up at the top, C binded signatures for the Fortran function. So they cannot take exceptions, exception handling through, across these languages is essentially undefined behavior, depending on the compiler. This is how it looks from the C++ perspective. So when we look now on the C signatures here at the top, they match to what is later defined in Fortran with the Fortran built in C binding module. So whenever you see this bind C or C underscore types, these ensure within Fortran code that we can access this code from C, and each of these routines is written for our wrapper and essentially calls quickly as a fin wrapper around the original Fortran routines that we wanted to wrap. So for example, the one below spec file read arrow data. So now we go finally to the wrapped code. This is the unmodified code that we access, and it sits in a Git submodule of the Pypartmc project. Now the fun starts when this Fortran code actually calls its input output layer, and there is like, usually a simulation takes something like 20 different text files to be read through, and these text files are nested. So what we've done is we replaced one of the components of the original Fortran package with our implementation that starts in Fortran, then goes through a C layer back to C++, which then uses JSON for Fortran. So this is a C++ library that helps get very readable C++ code for using Fortran, and this was our solution to replacing the multiple text files with what from user perspective are essentially in memory, MATLAB, Julia, or Python objects. We also have online documentation for the project generated from the source code, and as you can see here, for example, the types are hinted correctly. So despite in principle the Fortran parameter ordering is the key, we do inform Python users for the types of the arguments. So to start a summary, what we achieved with the Pypartmc wrapper is that we have a list of different types of the wrapper, and we have a single command pip installation on Windows Linux and OS X, with the exception that from Apple Silicon we are still struggling to get it done and help welcome, if any of you is a Fortran hacker who could help us produce universal binaries. We provide access to unmodified internals of the Pypartmc underlying package from Python, MATLAB, and also C++. So as a side effect by product of this goal of providing Python interface, we got also Julia MATLAB and C++ layer. Probably something that might not be obvious from the original plan, and we ended up actually using extensively is that this provides us with a nice tool for development of other Python packages because we can use part mc in test shoots to verify against the established simulation package. And also probably it's maybe a non-trivial way to use pip, but since C and Fortran are probably not the best, are not the solutions, not the technologies where you see mainstream package managers coming in or being established here, we managed to ship Fortran codes to users of Windows 6 Linux different variants of binary packages through pip. So it's essentially probably one way of thinking of the PyPy.org platform. And from the point of view of what I mentioned earlier, providing students or researchers using this package with tool to disseminate their research workflows, including input data, output data analysis workflow in a single, for example, Jupyter file for a paper peer review. And finally, PyPy.org mc allows to extend the Fortran code with some Python logic. So since we interface with, we expose the internals of the package, we can do in a simulation the time stepping can actually be done from Python. And you can add to, let's say, if you have 10 different steps of the simulation done in Fortran, you can add an 11th one that is in Python, Julia or whatever. And the final point is probably one of the key things here is that having statically linked all the dependencies, we can actually use the package on platforms such as Colab or Jupyter Hubs of various institutions by doing just pip install and importing what otherwise would require getting a lot of dependencies and a lot of compile time stuff available. Take home messages. So I wanted to kind of give you a little bit of a little bit of a little bit of a little bit of a little bit of a little bit of a little bit of a little bit of a little bit kind of underline that PyBind 11, despite being a C++ tool is actually a valuable thing for interfacing Fortran with Python. And this is linked to the fact that PyBind 11 offers CMake integration. So your C++ projects can have build automation in CMake, and CMake handles Fortran well, so this was the key thing here. The glue language role of Python is, I think, nicely exemplified here with Julia and Matlap, including CI. Static linkage of the dependencies was essential for us, for example, due to the fact that there is no standardized ABI for four different versions, even of the same compiler, have different binary incompatibilities, and this was essential to get it working on on platforms such as Colab or other Jupyter Hubs. But this prevented us from from publishing the package on KONDA due to KONDA policy of no static linkage. We've used more than 10 Git submodules for tracking our dependencies from the GitHub repo. As I mentioned, help welcome in getting the universal binaries generated with G4tran. The CI on using MATLAB is possible thanks to the MATLAB actions. So the producer of MATLAB MapWorks offers CI, GitHub actions that actually do not require any MATLAB license. So if one wants to run MATLAB code on GitHub, this is important and just wanted to thank them. And finally, a fun fact or the positive thing that actually when we submitted the paper about the project to the Software X Journal, just reporting that during the peer review, the reviewers indeed tried the code and provided us with feedback that also helped. So this was kind of positive that it did work. Let me acknowledge funding from US National Science Foundation and Polish National Science Center and thank you for your attention. Any questions? Yes, thank you for that presentation. My question was exactly did you keep in Fortran and what did you pass to Python site? So it's a race or some or just single values? So the question is about if I understand correctly what kind of data we are tackling with passing us during the simulation. So it's a the Monte Carlo simulations here are tracking particles in kind of attribute space that tracks their physical and chemical properties. So it's usually 20, 30 dimensional attribute space that is randomly sampled. So we have vectors of these particles in this attribute space. So usually this could be from thousands to hundreds of thousands of particles that each of the particle has like 30 attributes. From Python perspective, usually the user does not really use the roll data of the simulation, the state vector, just some aggregate information which is passed back to Python as enumerables that can be used with NAMPy, but we don't actually assume that it must be NAMPy. So one can use just lists if they are enough. I hope that answers. My question is just because we need some roll data from Fortran site to Python site and then it's just some two dimensional matter. Here we have some problems that we need to know where we keep the data. We are not exposing particle locations in memory. They are always returned as new objects to Python because this is it is never the state vector of the simulation. It's just a some aggregate information that characterizes it in a simpler way. So usually we have just one dimensional enumerable. For you it's much more simple. Thank you. Time for one more question. If there is one. Okay, if not we'll wrap up here because apparently there's a queue outside to get in for the next talks. Thank you. Thank you very much.
Feeding ML models with the data from the databases in real-time
All right, it's two. We'll get started. Hello, my name is Vojta Uranek. I work as a developer at Red Hat and in this short talk, I would like to discuss one approach, how you can ingest your data from into a machine learning pipeline from your databases in real time. First, this is how the machine learning pipeline can look like. And I will be speaking about where the beginning of this pipeline basically here, how to get the data from your databases into a machine learning pipeline. It can be as complex as this, but I will simplify. Just imagine you have your application running, you insert data in one or several databases, and now you want to get advantage of some machine learning or maybe AI and get your data and do some, for example, prediction. And you want to solve a question how to actually feed the data from your databases into a model, and especially when some new events are coming, how to do, for example, online predictions. It might sound simple, you just do some selects from the database, but when you have quite heavy traffic and you will run the selects in a loop, you will quickly find out that you will overload the databases with the selects. And if you add more conditions like you don't want to miss any data, you won't have consistent view of your data, you don't want to do the right and so on, you will shortly find out that it's a pretty hard problem. There are several ways how to tackle this problem, but I would suggest maybe in my opinion the best one is to change data capture. What does it mean? As the name suggests, change data capture is a process of observing some resource for a change, and if there is any change, you will extract this change and propagate it further, typically into some messaging system. What does it mean in terms of databases? It typically means to observe transaction log of the database, because transaction log of the database is typically the source of true for whole database, and all the changes which are happening to any data in the database are recorded in transaction log. So basically you will observe transaction log of your database, and if there is any change with some data you are interested in, it can be one example, whole database, whatever, you will extract this change from transaction log and send it into, for example, some messaging system. It's pretty, well, probably sounds good, but maybe you have to ask yourself if it's easy to implement. Fortunately, you don't have to solve these questions. You can use Dibizium, it's an open source for change data capture. It's pretty major. It has connectors for most of the popular databases, and it currently comes into two flavors. It originated as a Kafka Connect source connector, when you can deploy the Dibizium connectors into Kafka Connect, and they will connect to one or several databases of you are using, extract the changes, and send it to Kafka, where you can use sync connectors to do whatever you want, like, for example, updating the search index, invalidating the cache. You can, for example, use it also for replicating one database to another database, but in our case, you will probably want to push it into some data lake pool warehouse, feature store, or maybe even directly into some machine learning model. If you don't want to use Kafka for whatever reason, you can use Dibizium standalone, which is standalone process, which basically does the same, but allows you to push the events into whatever system you like, like Apache Pools, Google Pops up, and it can be even, for example, HTTP endpoint. And if you are missing any sync of for Dibizium server, it's pretty easy to implement your own. And Dibizium provides some other features, like, for example, it's capable to do snapshot of your database at any time. It can also transform the records before sending out into the messaging system and so on. So back to our problem. Unfortunately, this talk is too short to go through the whole example. On the page of this talk, for them, there is a link to our blog post when we described in details a use case when you store in the database images of handwritten digits. And whenever the image is stored there, you extract the change using Dibizium, send it to Kafka and Kafka, send it to TensorFlow, which will recognize what number is written in this picture. So it's a well-known example. And everything works in real time. As I said, we also have a full example on GitHub. But what I would like to show you here is that it's really simple. It's basically consists of deploying Dibizium and configuring it. And it's really just one page of JSON config. Here is you just provide credentials. And more interesting part is here. It's some transformation. Here is one predefined transformation where I extract only the content of newly inserted image. And because there is some caveat when you use TensorFlow with Kafka because it can interpret correctly the bytes, I will, I'm transforming the image into string, which is later on parsed in TensorFlow. But I would have to do it in TensorFlow anyway. So it's no overhead. But I can define my own transformation here. So, and it's just a couple of lines of code which just converted into string. And on TensorFlow side, it's similar, easy. It's again one page. Here I define the coding function which decodes the string. And I think it retrieves from Kafka. The most of the code is just defining Kafka endpoint. And it's about three lines to push it into the model which will recognize what is the number on the image and produce the result which you can consume further. So, as I said, if you are interested in it, please go to our website, take it up and take a look. And basically that's it. So to sum up, DBZoom is able to do a snapshot of your database and load existing data from your database into messaging system or directly into TensorFlow. And it can retrieve any change. So it can, once anything is stored into the database, it can immediately extract this change and send it further to your pipeline so you can do real-time analysis of your data. And basically it works for, you can deploy many databases and do more things with that. So that's all. If sounds interesting to you, you can try out DBZoom and please share feedback with us on Zulek, or mailing list. We have pretty large Vibran community and we will appreciate your feedback. Thank you so much for your attention. Thank you very much, Vojta. We have time for one question. Is this working? One question. Someone come up with one question. Come on. So if no one, just if you have any questions, switch pop up later on, just hit me on the corridor or elsewhere in the conference. Thanks for the presentation. My question is, is there any database that's already provided what DBZoom does by default? Any change? Could we repeat? I don't hear. Yeah, sure. Is there any data? Can you please stop moving so we can hear the question? Thanks for the presentation. My question is, is there any database that does this, what DBZoom does natively already change tracing? What do you mean natively? Because without any external tool like DBZoom, is there any database that does this already? Well, it again boils up to what means natively because we leverage typically some native features of database. For example, for Postgres, we use replication slot and we just read replication slot from Postgres. So you always need something which will need something which reads from database or from Mongo, which is from change stream. So always the database provides usually this natively, but you need something which will read it and translate it into something usable, which will parse, for example, the data you get from the replication slot from Postgres and so on. So yeah, that was the question actually. Is there any database that does this anyway without using DBZoom? But you said, I think there is no competitor then. Pardon? Is there any competitors? Yes, like is database is doing this natively already what DBZoom does? I'm not aware if there is any database which uses this. I'm aware that some databases provides this, but several of them use DBZoom under the hood as far as I know. Okay, perfect. Thank you very much, Vojta. Thank you.
HPC-oriented Large-scale Code Restructurings with Coccinelle
Hello. This talk is about collaboration with Julia from India. Before this collaboration I was alone and I was left in a room with 100,000 lines of C code which had to remain C and had to be optimized somehow. How many of you here ever had such a situation? Perhaps somebody? One, two, three, perhaps more. Good. And usually, well, you understand that this was not my code, so I had to modify it. It was the code from people, from many people who worked in bars and produced code which had a lot of loops which do number crunching. Those loops have touch numerical quantities and actually had a kind of structure that you see here. You might recognize. So the layout, the data layout here is called array of structures. So who would say, oh, I have an optimization for this case from your mind. Which is this optimization? The most obvious one. So shout. Okay. It's my role to say it. Well, to change the layout into not, sorry, arrays of structures, but structures of arrays. So this is like a transposition but brings like an improvement in performance of, let's say, two, three, four, depending a bit on many conditions. So changing the layout of the code and also the expressions from what you see here to what you see here might have such a positive effect on performance. But how to do this if you have 100,000 lines of C code which has to stay C? Do you cobble together a few scripts? Yeah, why not? But perhaps you then come at some point whining that doesn't work. So you revert to, oh, you revert perhaps to using some high level transpiler. Yes, perhaps, but it's complicated. Well, I went that way. But I chosen another transpiler which is called coxinell. Coxinell has been written with the idea of matching, so finding bugs in the C code of the Linux kernel and erasing those bugs. But I found out that coxinell, I think, was also a good tool for factoring large amounts of C code, especially in manipulating expressions within statements. So expressions, once they are matched in statements, can be also changed. So coxinell has this pattern language that you see here, which is not in terms of patterns of text. Those patterns are in terms of the abstract syntax tree, which gets built on the top of syntactical elements of the C language, which are being, which is being parsed here. And those entities, the syntactical entities of the C language are of, you have control of them. So let's look at this key slide here. So we identify, match identifiers, tree identifiers, and we also give constraints. Like the first identifier Q, which can occur in the code, has to be, has to look like PRSR. The second one has to be P. The third one can be one of those two ones. So it's like you give constraints that has to be respected. And then the red line here, the minus line, which you see here, has to be found in the code. In the arbitrarily complex expressions or statements, such a form must be found. And once such a form is being found, only then such a change occurs. So you see this change? It's also expressed in terms of what has been already found. So you can be very precise if you want. And what you see here on the right panel is an entire program without, with one line changed from what you see here, red to what you see here, blue. So this is, of course, construed example to show you that this teeny rule here can syntactically go over your arbitrary large code and do really a lot of changes here. The complicated part comes if you want to be, if your code doesn't have regularities, so it's messy. But yeah, but if it's tidy enough, you can do a lot of things. And we're working on new HPC oriented features in coxinell in this system. So what does HPC revolves around nowadays? GPUs, modern C++ and modern OpenMP. So this is what we target. And these are three slides now I will be showing you just to give you a taste of some key elements we have been putting in the last times. So if you know the CUDA from NVIDIA, API or language extensions, then you will recognize that here we have some keywords, some the Chevron, kernel, call syntax, which we support. So this is not standard C or C++. We're supporting this. So we're putting this language support into the coxinell tool to allow you perhaps to change your existing code. So to write rules on your huge code, because if it would not be huge, you would be doing this by hand, right? But if it's huge, you want a tool for power, for the factoring. And this is what we want to provide you. To, yeah, to derive perhaps a CUDA kernel code from your CPU code. If you have in mind the regularities and the changes that you need, and you can express them with this language, I like C++ 23 introduces multi-index operators with square brackets, for instance. So this is now also possible if it's of use. It can be of use, for instance, with cocos heavily. So this can help you to transition perhaps to cocos if you really want. And just to, again, I want to, these are expressions and they act on statements. So you have to imagine that this can occur in arbitrarily nested statements. So complicated thing is like the one that you have seen at the beginning of the presentation. And it's up to you to create like chains of rules, which express the logical dependencies between the things which you want to match and the things which you want to introduce in the code. So if the code is messy, it will be more difficult. If your code is tidy enough, you can do extremely powerful things like also experiments with performance. Let's say, oh, what happens if I change in my entire code base from this style of arranging things, and this this style of arranging things. Then such experiments might be enabled here. And perhaps you don't, you're not perhaps obliged into using overly complicated APIs from portable, yeah, from some vendors. Yes. So, oh, sorry. One last example slide. You can even use coxinell to declutter code. So here we are removing hand unrolled code. So such a pattern recognizes, but I mean, I have written on purpose, recognize some hand unrolled code, removes it, and just introduce one pragma from openMP, which says, hey, unroll this. And this is a standardized pragma. So it's not like GCC, unroll, Intel, unroll. No. This is finally, since a couple of years, we have this pragma. So let's say, if you know how to declutter your code, you have rules in your head, you can implement them here, and have some formal, an informal method for restructuring your code. Good. So you know what we are doing lately. So we are developing further material and use cases, language support, some small things and large things. And yeah, that's it. So this work has been supported by the Gauss Supercomputing Center project, by a collaboration between Bavarian Germany and France. And if you are tough, you will go through this tutorial, which I created a few years ago. Yeah, and you can stay tuned with our developments with Coxinell, or you can also come to Würzburg in one month to attend our short introductory tutorial into Coxinell. And if you want to read our six-page use case article, I really recommend it because it's more deductical than this 10-minute lightning talk. That's everything. Thank you very much, Michele. We have time for one question. JP in the back, again. Thank you so much. Hey, cool talk. Great, great thing. I love Coxinell. I have it right on the top of my list that I want to work with. The question is, can I express more constraints than just symbols and identifiers in these rules, like types or like lexical things? So what are the things that I can express in terms of this rule should only apply if? So for sure, context? So whatever you write as context, it gets matched. But if it comes to the rule part of those patches, which we have seen, the top part, you have seen identifier and something. Yeah, you have also type and other words for function parameters, for positions, for symbols. So there are a lot of them. And the important part is that the code is being parsed in the first place. Because once we support your language constructs, we parse your code, then you write a rule which can be, which is really the code which we have to match. So Coxinell is different parsers. It has different parsers. We parse your code first, we parse the root first, and then we look for a match. We are able to parse that. Any construct can be modified. This is the point. Any construct. So even within a template, you want to change something systematically. Yes. Okay, cool. Yes, thank you. Okay, we're doing
HPC Container Conformance
Okay, next lightning talk. So we don't have a lot of time to switch between speakers. Please take a seat. Next lightning talk as Christian, who's notorious for being very good on staying on time. I did once a very great job. I benefit from that still. Yeah. So if you see me, I talk about containers a lot. So this time I would like to give an update to the HPC container conformance project, which we started or I started last year and which got a little addition by being introduced to the OCI working or we created an OCI working group together with some other folks. So what is the problem? I mean, to just maybe call it challenges, everyone knows modules, right? If you're not, if you're new to containers and you use native code, you most likely use modules to figure out what's the best binary for your program on the current system you add. So you do module load grow max and the module system will pick the best binary for the current system you are on. So it's a runtime decision, right? So you have a bunch of software in a software share and it would just pick the best one problem or not problem. I think it's a good thing with containers. You don't want to have a lot of binaries or different variations of binaries within the container. You want one, right? So a single set of libraries and a single binary for a given problem. So what we end up doing was to create multiple containers for different systems. Let's say for the CPU like Graviton, Skylake, Zen 3, or maybe even we use a name to identify a cluster we are running on. That's fun, but problem is how do you pick the correct image? Within the container space, you have something that's called an image index, which is just a matching artifact that says, okay, you are on an arm system, you get this image. You're on an AMD system, you get this in or Intel or x86 system, you get this image. And if you are a wasm guy, then you even get another system. But the thing is that's not, that's not fine enough, right? It's very, it's very gross grained. You cannot just put like your, your, all your x86 code in this. So what we actually want is an image index that is more specific. So they can say, um, this CPU, this accelerator gives me this image. If I have this CPU, I get another one, maybe even configured with MPI in, in, in mind so that you say, like, if I have MPH, this version, I get this image. If I have open MPI, I get a different image. So you get the idea. So have a very, maybe long image index with different variations and then you pick the best image. And another thing that I didn't mention in the first slide is, uh, run times will go through the, uh, image index, the normal image index and we'll just pick the best or the first match that they get. So even if you have an image index with five different x86 images in it, the runtime will just pick the first one. It matches and off you go. And with this, of course, we, we cannot do this. We need to go through all the versions that we have, all the different specific images and then the runtime ideally picks the best image for you. Okay. So I did some hacking back in the days, right? So I changed or used an unused feature in the image index to make some identification. So I saw it. Okay. This is a broad will this and media driver and I hacked, uh, the Docker runtime to also recognize what the best image matching is for this specific platform you're on. So with this ugly heck, you were able to identify, create an image index with a lot of different images for different, um, different systems. And then you configured your runtime to search for a specific tech list, if you will, that was like hacky. And, um, I didn't intend it to be, I created a pull request for Docker, of course, what turned down, but, uh, because it's, it's, it's ugly, right? And what's ugly about it is that you need to implement it in every runtime. You need to implement it in any scheduler to make sure that it works. And this was of course bogus. So what we did, uh, as I said last year or the year before, uh, we created an HBC container conformance project to establish best practices and provide some guidance on how to build images for the HBC and how to implement the use of those images. The first thing, which is very brief, uh, is what we expect an image to behave or how it should behave. So the first one is, uh, we want, there are two types of containers, application containers, I call them and login containers. Uh, application containers is just if you have, for instance, a binary and you put the entry point to be this application, you can create an alias that just runs some program within a container without you knowing about it. So for instance, let's go release a example. You just, instead of running a binary, you point to an alias and then you run, um, this problem here is if you want to debug things, uh, and if you, it's, it's hard because the entry point is always tricky to get rid of, or at least I need to look up the Docker command or every time. The other thing is multi or a lot of HBC applications have multiple binaries you want to run. Maybe you have a pre-processor or the application and a post-processor with this case, you would have like three different images for this because the entry point is different. So that's kind of ugly for HPC is not really usable. What we actually want is a login container. So you start the container and it drops you into a bash. That's that way you can just, um, augment your, your script and just say a Docker run or a singularity run or whatever to, um, execute the GMX command. For instance, you can just run it here, uh, within a container and it just works. Um, another aspect that's very important, but everyone hopefully does it anyways is, um, that the user within the container, if you use a shop, a shared file system needs to be agnostic, right? So you cannot rely on a certain user within the container. So you might, or you should, um, make sure that the, that the container is able to run with nobody because the username will be dropped from external, uh, the user ID and group ID to have access to share file systems so that the process is owned by the user outside of the container and the container itself has no knowledge about the actual user. Um, okay. And then that's how we expect the container to behave. And I think that's common and already understood. I think I talked about, yeah, last time was annotations that was an idea of us HPC guys and girls, a simmering in our own soup and tried to come up with something to put forward. Um, that was kind of a nice exercise, but at the end, um, we jumped on the train of the image compatibility working group at the old OCI initiative. And you might ask, and hopefully a lot of you know it already, but what is the OCI? It's the open container initiative. It maintains the more the, the relevant specifications about containers. So what's the, what's the image like? How do, uh, run times interact with images and distributions and registries and so on. What is the distribution specifications or how do, uh, registry work and security stuff? So it's kind of like a body that maintains the specifications and we formed a working group together with others. Um, that's called image compatibility. So we want, as I discussed in the, in the beginning, we want to extend the manifest list or the image index to not only be able to, um, pick by platform by architecture, but extended so that you can make what I, what I said as a, as a desired state for the image index so that we can pick the right image and an optimized image for, uh, a certain application. And of course we want to express like what the image was built for, what we expect from the host, what runtime we might want to use and so on. All this cool stuff we want to incorporate in this. And why is it a better way? I mean we HPC folks, we like to, to do our own thing, right? And we are kind special and snowflakey, but this is of course a better way because we interact with the OCI community and we put it in front of them so that we can take into account other things like for instance, wasm is a thing, uh, haven't used it, but seems to be a thing and it's a runtime was in, was in the, was in the container ecosystem. And of course we also have different run times, right? We have like singularity, obtainers, saros, what have you. And, um, picking a runtime over the other is something that we are interested in. The wasm folks are interested in. Say you have a Kubernetes cluster and you have an x86 image for an application and a wasm image for an application. Maybe you want to pick one or the other like different, depending on the condition. So they want this, we want this as well. Uh, scheduling a registry, of course HPC is great, uh, but the container tech is much wider than HPC, say the least. And, uh, we want to make sure that we align with Kubernetes. We want to make sure that the registries are aligned with us and the OCI working groups have like, they have an oil machine of sanitization. So that's also very cool to do. Okay. Where are we now? Uh, we discuss around use cases and while discussing the use cases, we already brainstormed some implementation ideas and we came up with a couple of use cases or, uh, stand in, um, stand in stakeholders, let's say, um, for instance, like the first one, of course, and we are all building images. So the first one is image author. If you build a container image, you want to define this compatibility definition that we propose that we want to propose. Uh, easy. I, ideally it's implemented with an easy build of spec or geeks that, um, you don't need to do it yourself. So all the stuff you can put there and, uh, Vanessa already wrote a little tool for that. The other is of course a system admin that wants to make sure that the system that he's maybe pursuing, procuring, uh, is able to run the container. So you just go through all the competibilities and then you, you figure out what's, uh, what, what works and what not work. So that's all, uh, this good stuff. And, um, you also want to make sure that the configuration of your system is actually able to run this image. The end user just wanted to work, right? So we need to make sure that the system admin and the image author and the other stakeholders just hum together and, and conclude on a certain configuration. And that's what it wants to do. There are other use cases. I don't have time to go all of them, but, uh, we have a list of, of, of this use cases we are going through currently. Our meeting is every Monday. Uh, and if you want to join, please do. Um, I have some links. There are resources. The, the slides are available online. If you want to get in touch, we have an HPC container slack. We have an OCI slack channel. There is a HPC social slack channel as well. So if you want to have a more general overview and if you're at ISE, make sure to, uh, join our high performance container workshop. It's a tense edition. So we do it for 10 years now, which pretty cool. And we have a friends of container boat trip. So if you like to, to, um, meet container guys and girls, uh, make sure that you point your mark, your calendar at the 13th of May. Yeah. That's it. Thanks. And now the famous and I think I'm good on time. Awesome. Maybe do, do I get a sticker if I do it three times in a row on time? You, you get a beer. Oh, even better. Right. We have time for one question.
Updates and Innovations with the Apptainer Platform
All right. Well, hello everyone. Hope you all can hear me okay. My name is Forrest Burt. I'm a Solutions Architect at CIQ. I've worked with HPC for a few years now. I was a student sysadmin at a university over in the states for a while before joining CIQ a few years ago. So I've been there for about two and a half years. I've used Singularity since about 20, well, AppTainer nowadays, but I've been familiar with kind of the Singularity container ecosystem since about 2019 or so when I was originally deploying it at Darbuysi State. So I've been a big user of kind of the Singularity AppTainer ecosystem for a long time, and I've got a few updates to share with you guys about what's going on with it. A couple of years ago, let's see here we go, I gave a talk a couple of years ago at Fosdome 2022 that kind of went over what was going on with AppTainer around that time and kind of how things were switching up. The long and the short of that story is in late 2021, the Singularity project, the open source side of it, decided that they had kind of felt they were reaching a mature phase with their technology. They decided that they wanted to maybe kind of move themselves into the Linux Foundation so that they could be managed under that and cross-pollinate easier with some of the projects from over there. The vote to do this was unanimous in favor, and so they moved over into the Linux Foundation, this open source side of Singularity, and in order to change or in order to, the Linux Foundation could get a trademark on the name, they had to change the name from Singularity to AppTainer. So that's how we got to this. Over the past few years, since the rebasing as well on version 1.1, we've done a few different things that have been kind of interesting that I'm here to discuss. First off, we've started leveraging the username space. As of kernel 5.10, we have username spaces, which give us those consumable UID mappings. So we're taking advantage of that to do some of the stuff that we used to need, set UID binaries and stuff like that for. We've got new recommendations that I'm going to go really briefly into about containerized MPI and how we're doing that these days, and then I've got a little bit of information about the OROS protocol and how that is increasing compatibility between AppTainer and OCI registries. So first off, as we know, we have username spaces. These are name spaces that we can create that allow us to do UID mappings. So you can take your 1001, map that to zero, something like that. This is useful for a number of things in AppTainer and allows us to essentially do most of what we needed to do beforehand with these set UID binaries, just now using these username spaces. So this gives us the ability to do all of our file system mounting using Fuse. This gives us the ability to do, for example, like with this, this standard pattern and singularity in AppTainer that we're needing to be able to go into a container. As you can see, in this case, creating a file under my user and then being able to shell into this container using fake root, get immediately mapped to root, see that that file we created is now shown as root, create another file and then be able to come out of that and still have that file we created while mapped in the container as root, still under our user that we were outside the container. This is important so we can do things like DNF and stuff like that during container builds. So, for example, when we want to do a build, in this case, we need to be able to run DNF, stuff like that. Whereas beforehand, this type of thing, if we wanted to do root build, you had to do add the fake root options, stuff like that. Nowadays, that's all just implied and, like I said, all these things that we used to use these suede binaries for are all now being done using username spaces. So that changes up a little bit of how AppTainer works on a security level. One thing to note... One thing to note is that some of the... because I'm sure you guys have seen some of the other security holes that come around with username spaces, but one thing to note is that because users can now do unprivileged installs, which there's a helper script out of the show site to do that, because users can now do unprivileged installs of their container... of their AppTainer software in their own, basically, environments, that renders some of the system-wide controls around this stuff, like the ECL and Valid. So you do have to kind of watch out for that and disable username spaces if you don't want your users to be able to tinker around with installing their own AppTainer and doing that type of thing. New recommendations for MPI jobs. As we know, usually when we want to do MPI, we're doing something like this, MPIexec sending these MPI... or setting up these MPI processes via SSH out on these compute nodes, wiring them up via MPI. When we do this with AppTainer, this doesn't quite work the exact same way, because if we try to do it this way, if we try to run, basically, MPIexec inside one of these containers on our host node, we end up getting out via SSH to these compute nodes and within the containerized environments that we're trying to spawn on these compute nodes, we... essentially, because of how these containers are namespaced, we can no longer figure out where exactly this calling program was out on its host node. So that doesn't quite work. So normally, when we want to use AppTainer via MPI, we have to MPIexec the AppTainer process itself, and then that wires everything up correctly. This can cause problems because that's a little brittle. You then have to match the MPI on the host, the MPI in the container, or you have to bind in the MPI from the host in the container. That creates very brittle containers that are very linked to a version of MPI. So one big thing that we've been looking at kind of in the community and at CIQ over the past little bit, especially my colleague Jonathan Anderson, has been using the PMI libraries to create more compatibility between these. So instead of having to have exactly the same version of MPI between this container, between the host, or even having to have MPI on your host at all, you just compile your slurm or something like that with this PMI2 support, compile your MPI in each one of these containers with PMI2 support, and you can leverage this PMI2 library to wire up all this communication, rather than having to do this more standard model that we're familiar with. There's a lot more to this. There's a lot of information you can go look at. We've done some webinars or some blog posts that kind of have some performance benchmarks, stuff like that. So there's a lot you can go look at here. The last thing that I have to say, well, fabric adapters, you can do this basically same thing with fabric adapters. If you just switch out your user driver or your basically user space drivers that you're using to communicate with those K-mods that are communicating with that fabric adapter, with libfabric, that provides a very similar kind of compatibility layer between a lot of the fabric adapters, like what we see that we've done with PMI2 there. So essentially, we can do this similar, like I said, kind of genericizing this build rather than having to link, in this case, down to a K-mod version, and ultimately ending up originally with a container that is very linked to a specific kernel version. In this case, we can, like I said, just use libfabric and have a lot more compatibility there. One other big thing is O-RAS, OCI Registries as Storage. Docker as a database or the Docker Hub as a database is no longer cool. So we've come up with ways to kind of formalize that concept of storing generic arbitrary artifacts inside of an OCI registry. So what we're doing these days, basically, because of this protocol, a lot of the OCI Registries have gone and implemented this. It allows you to store app tainer, sifts, helm charts, things like that, all within these OCI Registries. And so with that, I'm sure we're all very familiar with app tainer. You can pull images, do verification via PGP to eliminate a whole host of different potential security problems. You can also inspect files to ensure that they have their sign correctly. You can make sure that when you actually go to build a container from somewhere that you're pulling it out from, say, pushed up to even like an OCI registry at this point, you can check these key fingerprints versus what's uploaded out on sites like openpgp.org to ensure that what you're getting from this registry is what's actually showing up on your machine. So that's one of the big advantages of app tainer is that you can do this type of signing. And like I said, this used to be kind of a little bit more limiting because you weren't able to really store these in the same type of registries as everyone else is using. But nowadays, you know, you go to AWS, Azure, GCP, Oracle, most of the major clouds, all of their artifact registries support not only storing Docker containers, but these sifts alongside them. And you can, like I said, pull them down, get the exact same type of just these workflows that you would expect where if the fingerprints don't match, the container won't build that type of thing. One thing to also note, if you're thinking, well, what if you're upstream containers that you're actually building these, you know, your base images from are contaminated? One thing that you can do is also build from your basically the flat or basically the mirrors themselves. Using something like this allows you to pull directly from like, for example, Rocky Linux's mirrors build a container out of that. So you're not relying on anyone else's containers that they've uploaded. So that's all I have. Like I said, we have seen how username spaces have replaced all the setUid type stuff. We've seen how we kind of have new ways that we can utilize these more generic libraries and MPI so we're not creating as brittle containers. And then we've seen how O-ROS allows us to integrate these SIF containers with OCI registries. So I have a lot of links in here. I have more links right here that you can look at if you're interested in getting involved. And thank you all for your attention and for continuing to use Optaner.
Automating Spark (and Pipeline) Upgrades While "Testing" in Production
Okay, that's it. Please take a seat and we'll get started. So Holden is going to talk about automating spark upgrades and also lots of testing in production. That's going to be interesting. Testing in production is the best place to test when the alternative is no tests, which it often is. Okay, cool. So let me know if you can't hear me because I'm very easily distracted and get excited and I might not notice that I'm not talking directly into the microphone, so please grab my attention if I screw up. So yeah, I'm Holden. My pronouns are she or her. It's tattooed on my wrist. Super convenient when I wake up in the morning. I'm on the Spark PMC. You can think of this as like having tenure, except it doesn't guarantee I get paid. It just guarantees that I have work to do, so it's like the shady version of tenure. And I've worked at a whole bunch of different companies, not super relevant, but I've seen a lot of mistakes made in production and I have made a lot of mistakes in production so you can learn from some of my mistakes. My employer who sent me here, Netflix, is hiring and I would be remiss if I did not mention that. They're actually finally hiring remote people after who knows how many years. I'm a co-author of a bunch of books. Some of them are related to HPC-ish stuff. I get the highest royalties on scaling Python with Ray, so I think it's a fantastic book and everyone should buy several copies with your corporate credit card. If you don't have a corporate credit card, the internet will provide. You can follow me on social media and there's lots of pictures of my dog. If you're into that stuff, there's a lot of complaining about American healthcare. If you enjoy Shaddenfreude, highly recommend it. It's great. I also do a lot of open source live streams. If you like seeing people struggle with computers, once again, it's great. You can watch me fail. The code for today's talk and a lot of my other code is on my GitHub. You can check it out. And there will be more pictures of my dog. In addition to who I am professionally, I'm trans, queer, Canadian, in America on a green card, I make great life choices. It was a great time to move to America and part of the broader leather community. I can make that joke now because I have a green card. It's slightly more difficult for them to kick me out. This is not directly related. There is no secret Canadian code modification tools. Everything we use is open source. There's no secret Canadian GitHub alternative. If you go to GitHub.ca, you don't find... Actually, I don't know what you find. Maybe you do find something cool. I'm imagining you don't. But this is something that I like mentioning because I think for us who are building big data products or machine learning things, it's super important that we look around and we see like, hey, who is on my team? And if you realize you're hanging out with only Canadians, that's fantastic. Enjoy the poutine. But maybe it's time to get some other perspectives. And if you don't know what poutine is, you're missing out. You should try it someday. Cheese curds and gravy and French fries. Best thing ever. Okay. So what is our problem? And so why do we care about automating upgrades? So fundamentally, our problem is we have unsupported versions of our big data tools and other data tools running in production. And this is a problem because when things go wrong, I get woken up. I don't like getting woken up to figure out what I did five years ago. And that's just not fun. The other option is sometimes I get woken up when I'm trying to focus. That also, sorry, not woken up, interrupted when I'm trying to focus. And this is important because we are also getting Spark 4 soon. That's super exciting, super lovely. There's going to be all kinds of new breaking API changes. And that's just going to be so much fun, right? Like, yeah. Anyways. And so I don't know about you, but I'm not looking forward to going back and trying to figure out all of the different things that I've built over the years and upgrading them, right? Like, I know I'm going to have to do it, but that is not the thing that excites me in my life, which leads into, like, why do we have these problems? Why do we have old things running in production? We have it because APIs change and code breaks. And then people are just like, you know what? I don't want to upgrade. Just keep running on the old version. It totally worked. It's fine. What could go wrong? The other one is like, this isn't fun, right? I don't know. Does anyone here wake up in the morning excited to upgrade their API usage? Yeah. Okay. So this is zero people, right? And the other possibility is, right, like, we could try and keep this old software alive, but we don't want to. So, how are we going to work around our problem? So we're going to use software, and then we're also going to have to deal a little bit with humans, right? We're going to do automated code updating. It's super fun. So much fun. If you took a compilers class, this is going to look very familiar. If you didn't take a compilers class, this is so cool. AppSection.x3s are really cool. And we're also going to do automated testing and validation and prod. So the social problem is much harder. I am completely unqualified to solve it. I work with other people who are much better at talking to humans. They did a fantastic job. They made newsletters. They tried to make the project exciting. That failed. And then they tried to make the project required. That failed. And then we set deadlines. They slipped. But for sure, totally, we're definitely going to hit our new deadline for real. Okay. And now, let's go and see how else we addressed it. So the other thing that we did is, like, hey, we have this problem that humans don't want to do a thing. What about if we made it so they didn't have to do as much work? And so that's sort of the approach that we took. We can automate a bunch of this. And the other part is, like, so we've got API changes, which we mentioned. And then the other thing that we have is testing code as a nightmare, especially code that you inherited and is called untitled underscore seven dot ipod dot notebook. I don't know what it does, let alone I can't make tests for it. It's terrible. So yeah, we have a problem. We're going to fix it with computers. Google has a lot of really lovely code mod tools that I saw while I was there. Super fantastic. This encouraged some counterproductive behavior. I don't know if any of you have used Google APIs and watched them change underneath you. So this is a double-edged sword, and we should heed the warnings before we go, like, super, super all in on this. So what are we going to do? So how are we going to move on? Basically speaking, we're not going to use regular expressions. For the most part, there's going to be a few times when regular expressions are like the simple hacky way, and we're just going to do it. For Scala, we use ScalaFix. For Python, we use something called PySparkler. For SQL, we use SQL Fluff. And for Java, we looked at it, and we were like, we don't have that many Java pipelines. Get them to update their code by hand. It's fine. We know where they work. Okay. So how do we figure out what rules to make? So we could read the release notes, but they're not very complete. We could look at the MIMA changes, and so Spark has a binary compatibility checker that it uses, but, oh, dear God, there is just so, so many things in there. Or we could do my favorite approach, which is run it in production, see what breaks, and then fix it afterwards. So we went with the YOLO approach, which is just like we're going to try migrating some things as it fails. We'll add the rules that it turns out we needed to add. So what do these rules look like? Today, we're just going to look at Scala and SQL. If you love Python, you can check out the GitHub repo. It's got some samples there. So in ScalaFix, we override this function called fix. We take an implicit semantic document that's really just the syntax tree, so that's the parsed version of the source code. And we specify the things that we're interested in looking in, and then we can write a recursive function which will match on this tree and generate a patch. And so here, we can see like, hey, do we see something that's calling the JSON reader? Because the JSON reader, certainly no one would use that ever, so they cited it was a great idea to change that API because who has JSON data? That was a joke, by the way. Everyone has JSON data. And so it turns out like, yeah, this actually happens a whole bunch. So we should write a rule for this. Do we see someone trying to read JSON data from an RDD? And if so, this is the path we're going to add. Now the really cool thing here is that we're matching on a syntax tree to produce new syntax tree. I can just say, like, swap this part of the syntax tree for this string, and then underneath the hood, Scala fixes very smart, turns it into a syntax tree. Everything's happy. I'm quite happy. I've got a bunch of sketchy hacks, and they're all inside of a function, sorry, a library called utils. So it's great. We hide all of our mistakes inside of utils because only nerds look inside of utils.Scala. Huzzah. And here you see we're recursing on the tree, and we just return nothing if we don't find any matches. SQL very similar, but the AST is a little bit fuzzier because we're using SQL Fluff, and it has to support a whole bunch of different versions of SQL, not just Spark SQL. Things are a little fuzzy. So we go ahead and we look and say, like, hey, do we see someone calling this function that we know has changed? If so, go ahead and extract out the part that we care about. And so we go ahead and we grab the third element because, God, whatever, don't worry about it. Magic number, totally fine, no mistakes. And then we go ahead and we say, like, hey, what is the type of this element? If it's a keyword and it's cast, we know we're good. The types are matching. Everything's fine. Otherwise, if it's not a keyword and the type is cast, we probably need to go ahead and change this. Because the types change. We actually need to add explicit casts into this function. And so we go ahead and we check it, and then we say, like, okay, function name, no, if it's cast, we're fine. If not, we go ahead and we produce these edits. Now unfortunately, SQL Fluff isn't quite as amazing. We can't just give it a string and have everything work. We have to produce, like, the chunk of the syntax tree. But this is still better than writing regular expressions, right? So much better. So this is totally fine. Everything's great. How do we know if it works? So there's a bunch of different things that we could do. We could try and make tests, but realistically, that's not going to happen. What we do is we do side-by-side writes and we use icebergs ability to stage commits. You can do the same thing with Delta Lake or Lake FS. They're all open source. I don't know how to do it with Delta Lake because I haven't used it, but I'm sure that you can do it. You might be saying, like, holding this sounds like you're running all of your data pipelines twice. Isn't that expensive? The answer is yes. Does it catch everything? The answer is no. But it's a hell of a lot better than just, right? We've got hope and a little bit of data, and together, are better than hope alone. So now we're going to come out and it crashed last night, but it's totally probably going to work today. Yeah, thank you. Thank you. We see I made a backup copy just in case it fails. What our demo does is it builds a regular Spark project, and it also makes a copy of it first. This is a Spark 2.4 project. Did I break it? Hello? Oh. Okay. We're back. Yay. Okay, cool. So you see here we've got everyone's favorite big data example, word count. And so, okay, this is going to go ahead and it's going to add the Scalifix plugin to our example. So we're just going to go ahead and say, like, yes, add Scalifix. And now it's going to run Scalifix, and it's going to run Scalifix with our additional rules that we created. So much fun. It's probably going to work. This is where it crashed yesterday. Everyone sent good vibes to my computer. Come on. Come on. How's that? Okay. You can see I subscribed to printlin debugging. Oh, well. And now, so it's run the first set of rules which do automated migrations, and now it's doing a second set of rules, and the second set of rules warns about things that we didn't think were important enough to create rules to automatically migrate, but we wanted developers to be aware of. And one of them is the group by key function change behavior between Spark 2 and Spark 3, because who uses group by key? Turns out everyone, very few people depended on the specific weird behavior, though. And so it's just warning, like, hey, I see you're doing this, and I applied a regular expression and I see some, like, bad words, not bad words in that ones that I use, but bad words in that, like, they're bad. Okay. And we say, like, everything's fine. It says we should review our changes, but we're not going to just, like, real developers. We're just going to hit enter and see if it works. And now it's going to go ahead and replace Spark 2.4.8 with Spark 3.3.1, and it's going to run these two pipelines side by side and compare their output. And so we will see if the demo finishes, ooh, five minutes left. Okay. We'll probably finish inside of five minutes. If it doesn't, we'll give up on the demo. That's okay. That's okay. So here we see it's running these two pipelines side by side. You can tell because Spark loves logging. And it passed. Yay. Okay. And then this, this, okay. Hmm. Okay. Well, this part didn't, and that's how you know it's a real demo, is that it failed at the final end part where it's copying the jar to a new special location, but that's, that's okay. The important part of the demo worked. So we'll call that mostly a win. And if we want, actually, yeah. Okay. I'm going to go. Oh, thank you. My lovely assistant. And so I wanted you to see that like, yes, this actually did update some code. So we go here, SRC main Scala, Spark demo project, word count dot Scala. And then we're going to go ahead and we're going to look at the regular version of this. Oh, God. Emax, come on. Now is not the time. Eight megs and constantly swapping. I can make that joke as an Emax user. Okay. So here we actually do see like it has made some small changes between the two of them. And, oh, sorry. Yeah. So here we see, for example, we have this old pattern of creating the spark context and it's been swapped for the new pattern of creating the spark context. And it's done other similar updates to the code. And the important thing is it now works. And this is fantastic. I think it's really cool. Thank you. Thank you. Hand for my assistant, please. Thank you. So I'm super stoked that the demo did not crash. Unlike last night, I switched it back to I was running a nightly build of the JVM and not surprisingly that didn't go well. Okay. So this is all cool, but like where does this fail? So this kind of fails when it comes to dependencies, right? Like we can only update the code that you've got. We don't rewrite byte code. We just rewrite source code. So if you're depending on something that doesn't support the new version of spark, it's not going to work out. The good news is for us, we got to this so late that all of our dependencies were upgraded. So there's something to be said for waiting right until the software goes end of life. Don't tell the security people. I said that. The other one that doesn't work super well with is programming language changes. In theory, that was actually the original purpose of ScalaFix. In practice, this didn't work so well for Scala 211 specifically because it's just so old. We had a bunch of Scala 211 code. So in conclusion, you should definitely check out the repo. It's here. It's spark-upgrade. It is in my personal GitHub, but a whole bunch of other people have contributed to it. They're awesome. I'm lazy. I wouldn't do all of this work myself. Thanks to my employer again for sending me here. I'm super excited that I get to hang out with a bunch of other nerds. The good news from this talk is that we haven't made a system so powerful that the spark people don't care about making breaking API changes. The bad news is we haven't made a system that's so powerful that we can't just not care about breaking API changes. The excellent news is that my dog is cute as fuck. He's here. I said that at the end of my talk just in case I'm not allowed to swear. He's really cute. His name is Professor Timbit. I miss him so, so much. Y'all are lovely, but I miss my dog. Hopefully there's time for a question, maybe. Yes? We can also do... Thank you. Thank you all. Have a couple of minutes for questions. Thank you very much for the talk. Very interesting. One general question out of curiosity. How long did it take to convert everything? Because you just showed like, I don't know how big the script was, but I can imagine just how big the repositories that you guys have. Totally. So that's a great question. It takes a really, really long time to convert everything. And we actually, internally, we have a whole bunch of different projects. One of them is a project that goes through all of the repositories because we have a whole bunch of different repositories, and it generates PRs to these projects. And that code runs daily. And it doesn't actually catch everything. So what we do is we generate the changes, and then, as I mentioned, we sort of did the YOLO run in production approach to life. So we'll look at these changes, and especially for SQL, it'll be like, hey, we do this shadow run. Does it look like it works? And if not, we actually flag it for review rather than raising the PR so that we can go back and say, hey, do I need to add a new rule, or is this a one-off special case where we'll just have a developer deal with it? So I know that's not exactly an answer, but several hours. Okay. Thanks. Any other questions? Yeah. There's one right there. No. How many rules did you end up coming up with for this migration from two to three? And do you anticipate going from three to four? What? Do you anticipate going from three to four? Oh, yeah. Okay. So two questions. I love them. I don't remember how many rules we came up with. For Scala, it wasn't a huge number, and that's because while there are a lot of breaking API changes in Scala, our usage of the APIs in Scala is more narrow, and so I'm very thankful for that. For SQL, I think we ended up with around 20, maybe between 10 and 20. And for Python, I haven't kept track, mostly because that code has been working really well, and so some of my other teammates have been working more on the Python side, so I don't remember how many rules we made there. But they're all in the GitHub. As for do we anticipate going from Spark three to four? Yes. Probably not like the same month Spark four is released. I love Spark, and we'll make Spark four available internally, but we're not going to go ahead and start pushing users to migrate to it right away. We normally wait a little bit for things to stabilize before we start doing managed migrations just because it's better for our sanity, and there's more fixes to the code base in general. Cool. We got another question. Any more questions? Okay. Cool. Hazar. Actually, hold on. You can keep talking because the next speaker is on the bus. Oh, okay. So with the next speaker is on the bus, I'm super excited, and we can go ahead and we can actually look at more of the changes that it made to the code, which I sort of skimmed over because I didn't want to eat into the next person's time. So it's kind of basic, right? But we can see here, this is the side-by-side for the Scala one, and we can actually go ahead and what we're going to do is we're going to go outside of our end-to-end, and we're going to go ahead and we're going to look at some of the other SQL rules. Oh, fancy. I don't... Okay. Oh, this is so that it's better to read. Okay. Okay. Okay. Cool. Fantastic. And we're going to go ahead. I need my lovely assistant again. Thank you. Thank you so much. Hand for my new lovely assistant. So here we see one of the things that changed between Spark 2 and Spark 3 is that previously you would be able to do just an arbitrary cast to things as integers, and even if they weren't integers, it would do kind of a fuzzy conversion. But in practice, if you wanted to parse a string as an integer rather than casting string to an integer, you should use int at. And so here we see we've got something similar. We use a lot of print debugging. It's not great. But what we do here is we return this lint result, and what it's just doing is it's taking this expression and swapping it to an int when we see a cast with a data type of int. So much fun. There's a lot more rules, but I didn't do a git pull on this because the demo barely worked, and I was just like, let's not tempt fate and do a git pull because I hadn't tested the end-to-end demo. But this is kind of cool. We've got similar updates to our format string. Super fun. Oh, right. And then char versus string types also got updated. Super fun there as well. And where was another one? I want to find it. Sorry. Then we've got, there's a rule down at the bottom. Oh, no. Okay. I guess the rule that I was looking for isn't in this version of the code. Let's go back to ScalaFix. So the other cool thing about this, sorry, doot, doot, doot. So one of the really cool things about ScalaFix, just while we're waiting, is that you can test your rules. And so, for example, like, I wrote these accumulators, and this is the old bad style of writing accumulators, and I was like, okay, let's make sure that it updates to the new good style of accumulators. And this is super convenient because I don't have to manually construct syntax trees. ScalaFix just has built-in functionality for this. And we see here what this rule does is it actually throws out a bunch of situations. And it's actually going to generate a bunch of warning messages. But there's situations where, like, this doesn't directly translate to the new API easily. So we just told users, like, hey, you need to make a change here. But we'll get it to compile, and then it'll pass the test, and it'll yell at you because you're trying to access a null. It's not perfect. Like, this is very much like a, how would I say this? This is a very mediocre rule. But in practice, we didn't find all that many people were creating accumulators with fixed values to start at. But the one that we did see was people creating accumulators that explicitly started at zero long, and so that we just converted to a long accumulator. And then the other one that I saw here was I also added some tests to make sure that, like, I had a rule which was applying itself too eagerly. So I also created a test which was just, like, make sure that this rule doesn't do anything if it's not, like, encountering the thing that I wanted it to do. So we can also make essentially negative tests for AST transformations. That's super convenient. How much time do I need to kill? How much time do I need to kill? Do we know how long the bus is going to be? Okay, cool. Okay. So we see another one, the group by key thing that I told you about. We actually had two different situations. These are ones that we could automatically rewrite, and so that's what we do here. And so here we see, like, the situation where someone was explicitly using the column name in a way which we could detect. But then we also have the situation where, like, we weren't super sure, and so these ones we did with a warning. And so we said, like, hey, this should generate a warning because we don't know for sure what's going on here. So we want to generate the warning, but in the other situations where we could do the full rewrite, we made sure that the full rewrite was able to be applied, which I think is kind of cool from sort of, like, a point of view of you don't have to get everything right, and you can, like, add these warnings in places where, like, it's worth it to let people know their code might not work, but, you know, it's not 100% required. Um... Choo-choo-choo. Cool. Let's see here. Ah... Just a quick interruption. The next speaker is going to be late. He texted us that he's still on the bus, so we're letting Holden entertain you. Oh, I got an idea. I got an idea. Hi. I'm just a speaker. What does that mean? Where am I? Oh. Yeah, I got a... I think I got another minute of something fun that I want to talk about if it's okay. So the other thing that we sort of, like, lost over was the, like, side-by-side comparison in pipeline runs, right? And so that's totally really... I think it's really neat, right? Like, because it's super important because people don't write tests at the end of the day, and that makes me sad. But we've got this pipeline comparison project, and... Oh, God. I'm just remembering how ugly this code is. Please don't judge me. This code was originally written at a conference and then made it into production, as you can tell by the fact that it's called domagic.py. Very sorry. Very sorry. So yeah, so this domagic.py does a bunch of really interesting and terrible things. And I was mentioning how we mostly don't do regular expressions, but we do a little bit. And one of the things is when you've got Spark 2 versus Spark 3 and you've got Scala or Java code, you're going to need different jars. Whereas in Python and SQL, like, we could maybe just be using the same files, or we can use the same files with a little bit of a transformation. But so for the jars, we use a really nasty, really terrible, regular expression to just kind of extract what we think the version 3 version of our jar is going to be. And then this is convenient because we can run it side by side. And then so we've got sort of different options. Here we've got it so that you can specify the input table. But I actually did a hack that I'm super proud of because I'm a bad person. Where we made this plug-in, Iceberg Spark WAP plug-in, where what we do is, oh god, we use the Iceberg listener and we output this string any time something happens to the logs. And so if anyone's touching a table while their job is running, we know what tables it's worth so we can go back and run our comparison on these two tables. We actually have some special code that goes ahead and looks at these tables before doing the comparison and says, if the user updated more than 1,000 partitions worth of data, just don't bother and tell the user they're responsible for validating their data. And if they're touching more than 1,000 tables, sorry, 1,000 partitions in a table, they should really have some reliable tests. For the people who are touching five or 100, like I get it, untitled underscore seven, it's great in production. When you're updating that much data, maybe it's not time to depend on Holden's sketchy do magic dot py. So I think this is really cool. And we're going to go back to our friend Pipeline Compare and down to our friend Table Compare. And so Table Compare is really basic. And there's actually an updated version internally that I need to bring out that does better tolerances. But we just go ahead and we compare these two tables with sort of traditional drawing, which is part of why we had this limit on the number of partitions. Because when we didn't have this limit on the number of partitions and we tried to do these comparisons with some of the pipelines that ran on all of the user data, everyone was very sad. And we took down production. I hope that part. Yeah, anyways, there was an incident and I got woken up when we did not have that. And so, yeah, all kinds of fun. But you see here the thing, the magic here is the snapshot ID, because the other thing that we output in our listener is what snapshot IDs we're writing to. Super convenient. And Iceberg allows us to read from snapshots even if they never got committed. There's a new thing in the new version of Iceberg that allows for branching that would be even better because then we would have named things rather than random git hashes. But we're not running that and it's also not supported in the really old versions of Spark. And because we want to do the migrations from the really old to the really new, I went with sort of the lowest common denominator. And that's kind of how we ended up there. Okay, that's all that I had that I thought was interesting. And I think there was someone else who had something that was interesting. Do you want to come and do your interesting bit? Thanks to Holden for filling in. Does anyone have any questions? Does anyone have any questions? That's that? Yeah, all right. First of all, thank you for the talk. I have a quick question in the summary of your talk. You also mentioned that if time permits, you might have an overview of the changes coming in Spark 4. Do you have this overview? Yeah, so if you're interested in the changes coming in Spark 4, the place to look is the Spark Jira. And there's actually like this meta tracking Jira that's in there. And you can see sort of like the things that we're planning on coming. Historically, I would say without naming names, there's a particular vendor that loves to show up at the last minute with the giant piles of code and just kind of yolo it as a nice surprise for everyone. So this Jira will give you a good idea of what's coming. But my guess is there will be a surprise that we find out about in June, just based on history. I could be wrong. Maybe everything is actually planned this time. That would be a pleasant surprise. But there's a non-zero chance that there will be something new in June too. Cool. Okay. Take it away, my friend. Or no, you don't. Oh, okay. You've got a USB key. I think my employer would be mad if I let you plug the USB key into my work laptop. I enjoy being employed. No, no. I just had more time to kill.
Semantically-driven data management solution for I/O intensive HPC workflows
So people can hurry and sit down for the next speaker please. Okay, thanks for our next talk. We have met in talking about semantically driven data management solution for IO intensive HPC workflows. Thank you. My name is Metin Chakrachalov. I work at the European Center for Medium Range Weather Forecasts Department, Forecasts and Services Department at ECMWF. I will talk about the semantically driven data management solution for IO intensive HPC workflows, which the work was funded by the EUR HPC project called IOC. It is work done by many people. So a little bit background on the ECMWF, European Center for Medium Range Weather Forecasts. It is established in 1975 by 23 member states and 12 cooperating states as an intergovernmental organization. There are three base duty stations with more than 450 people, Redding, Great Britain, in Germany, Bonn and Bologna, Italy. So ECMWF is both a research institution and 24-7 operational services, producing numerical weather predictions and other data to member states. There are two big projects that ECMWF is a key player. One is Copernicus. It is the Earth Observation component of the EU's space program. We provide climate change information, atmospheric composition information and also flooding and fire danger information. The other big initiative, EU initiative, is the Destination Earth project, which is prototyping digital twins of the Earth. So the ECMWF's production workflow looks like this. There are per day 200 million observations, collected acquisitions and fed into the Earth system model. Those observations and the output from the weather predictions are archived. Also, these data are used to generate products, which are 300 terabytes of data per day, which then accounts for 65 terabytes of data per day as products disseminated to around 350 destinations to member states and other customers. So the information system, the data is central. It provides access to data, models and workflows, and the data management is very critical for the operations. We need transformation of data into information, insights and decisions. So, semantically driven data management, we have been doing this for a long time. It means managing data based on its meaningful logical description, rather than just storing data. We also abstract the backend technologies. We also abstract where and how the data is stored from the users. So instead of, we try to avoid nested folder structures or UIDs, such as this home user projects ECMWF and blah, blah, blah, or some cryptic UIDs that doesn't make much sense to the user. Instead, we want to use meaningful, scientifically meaningful metadata to describe the data. For example, in this case, this project is ECMWF experiment number 42. The data is 224 parameter pressure and level. So for that, as part of the IOC project, we developed DAISY, data access and storage interface. So we provide, we index and identify data using its meaningful description. And for that also, that allows us to implement optimized algorithms to retrieve archive and retrieve data. And this is based on the ECMWF, ECMWF object store called FTB, which is also free and open source on GitHub. And this abstracts, we also abstract the storage technologies behind POSIX. We support POSIX, DAOS, Moto, and Ceph. And we provide interfaces and tools as well as CEC and Python APIs. So the schema, the main complexity is the schema which describes the database. And it is a collection of rules and each rule is a tree of attributes. In this example, I have a schema file and inside that I have two rules and each rule is consisting of multiple parameters. For example, here project experiment date parameter level would translate to a key project ECMWF experiment 42 and so on. The other rule is event city year and this could translate into event for stem, city is Brussels and year is 2024. So the rules are, the rules have, the rules are blueprints of the database, how to construct the database. And they have three levels and they, each level can have multiple attributes. And to make a rule, it has to be unique and complete to describe the data so that we can identify data from other data. And we also need to think about the locality where data, where different data is related to it, we would like to store them together. So each level here, we can think of the first level as directory, the second level file level, and the third level as the indexes in the file. So the locality would be increased when we go deeper in the level. The other, we can set daisy, we can set up daisy by the configuration file in YAML configuration file. We can point to the schema file to find the schema file and we can set the backend storage technology by saying file in this case is reference. We can also have different parts to the databases. We can have multiple databases. It's called roots. And we can set multiple behavior to individual roots. So aside from data, we also need key and we also have query. The keys would refer to single objects while queries can be any number of objects. In this case, key defines, identifies, and single objects on the right, I have level as a list of values, 0, 1, and 3. So it means I make a query for three different data where the differences, the levels are 0, 1, and 3. So we provide multiple interfaces, command line tools, C, C++, and Python APIs. But here I present an example for Python API because it's simplest. So I need, for storing a data by key, I need a key and data. So data can be anything in this case. I just put a string here, but it can be PNG file or PDF file or any other type of data. Then I make a key. User is met in project is IOC. Date is 2023 and city is born. And I pass this key and data to Daisy and Daisy would archive it. Then the other main feature is list, searching for data in the database. I need to make a query, in this case, user met in project IOC. And in this case, I just want two data objects for two different dates. And I pass this query to Daisy and it returns me the keys that I need for retrieving. And in the next example, I have the retrieve getting a key, getting data by a key. I make a key, user is met in project IOC and so on. And I pass this key to Daisy and retrieve the data. So it's very simple. So to sum it up, we describe data semantically instead of your IDs and nest directories. And we index and identify data by its meaningful semantic information. And this also allows us fast and efficient retrieve and search and archive algorithms. Also, we abstract where how data is stored from the user. And we make blueprints called rules. And we make keys to attach to the data and pass it to Daisy. And Daisy would store and manage the data using multiple different storage technologies. So more about Daisy, we have, we published, Daisy is free and open source. We published on GitHub. We have example C API and Python API. We also provide binary packages on GitHub for C, C++ as well as Python. We also have Python packages for Linux, RPM and the beam packages are available. We also have documentation on read to docs. And that's all. Thank you for your attention. APPLAUSE Thank you. Do you have any questions for Metin? No? Oh, there is one. Thank you. Hey, next presentation. I was wondering if you can specify the type of the values in your SEMA. You mean integer? Yeah, yeah, like that. To facilitate the queries. Yes, attributes can have types. You can set integer, date, string, they can have multiple types. Okay, thanks. Hi, thank you for your talk. I was interested in indexes because you mentioned that you index and identify data. Did some standard type of indexes or you have some format on your own, optimized for this three type of data? Yes, indexing is based on the rules. So the rules here, we have three structures which has three levels. So it has to have three levels which translates to a directory file and data. And we have in-house mechanism algorithm to that indexes that translates this three into identifying data. So is it something like gene indexes? I couldn't hear. If it is something like gene indexes? I'm not sure if I understand. Gene index? Yeah, I'm not the right person, I think, because we use the FTB which has been developed since long time. And it's a big library. I cannot answer the question because I haven't worked with that level. No problem, thank you. Thank you for the talk. I would like to know where are these keys stored because you need to query this kind of index and where are they stored for the user? Yeah, so the indexes would be stored separately but together with the data. And they would go, for example, in this case, inside the roots. So each root would be different database. And if you would look inside the root one, output one, for example, here, you would have index keys inside as well as the data, so together. Okay, so it's a file system or a database stored inside these two directories? Yeah, for POSIX, it would be directory, but for object storage, it would be not directly contained or something like that. Okay, so the way how the index is stored depends on the type of the storage you describe here. Yes, but we can also have, we have two different abstractions. One is indexing and we call it catalog and we have a bulk data. So we can have indexing catalog inside POSIX directory and bulk data on an object store. Okay, thank you. Any more questions? Is the next speaker in the room? Okay, thank you again, Metin. Thank you.
RCTab Cloud Subscription Management System
Okay, now we have David from the Alan Turing Institute who's going to tell us about RC tab. Thanks very much. Yeah, I'm David Llewellyn-Jones. I'm very happy to be here at the HPC, Big Data and Data Science Dev Room. I'm relatively new to the HPC area, so I'm still getting used to all of the terminology and everything involved. It's a very exciting area, so it's nice to be part of the community. So as it says, I'm from the Alan Turing Institute. That's the UK's National Institute of Data Science and AI, and I'll talk a little bit about what the Institute does and the relationship with RC tab. RC tab is a cloud management system that's been developed at the Institute. I'm going to talk a little bit about that, and then I'll also talk about the process that was gone through to turn it from an internal project into an open source project. I think there's some interesting things to be learned from that. So I'm part of the research computing platform teams at the Institute, and I work with various other people in the team. So Thomas Lasauskas, Instentson, SA Ben and Joe Palmer, alongside myself, are the current team, and I should say that all of the people in that list, apart from myself, are RC tab developers. I'm just an RC tab user. I use it in my day-to-day work. But there are some previous people on the team. Oscar, Giles and Pam Rochner were also developers of RC tab, so they also contributed to the work that I'm talking about here. So first of all, a little bit about the background to RC tab. So I'm part of the research computing team, as I mentioned. The research computing team essentially supports other researchers in the Institute doing research on compute systems. So that could be high-performance computing systems, or it could be cloud systems. The Institute itself doesn't run its own HPC clusters, but it does have relationships with a number of other clusters in the UK. So for example, it has relationships with Baskerville, with J2, with Archer 2 and with Dawn, which is the new Intel-based system, and who are here today, I guess, represented. Yeah, that's nice. So we basically are the interface to those other systems. And as well as those high-performance computing systems, we are also the interface to the Azure cloud that researchers are also keen to use as well. Not everything that our researchers want to do needs high-performance computing. They might want other sorts of systems. For example, they might just want to run a website, for example. And in that case, a high-performance computing system wouldn't be appropriate. Azure is something that would be more appropriate there. So yeah, the Alan Turing Institute itself is not that large, but we do have projects with over 400 researchers. So it is quite a large number of projects that we're managing. And so therefore, managing those cloud projects, the interface with the cloud, can be quite challenging. So why is the cloud useful in research? Well, researchers are very keen to use the flexibility of the cloud. They don't always need to use GPUs, but even when they do want to use GPUs, they might also want to do it on the cloud. A lot of the cloud providers are now moving into the same sort of area that the HPC systems are providing. The cloud provides a lot of flexibility. And in addition to that, it's also quite attractive for users that don't necessarily have a Linux background, that don't have a background of using a Qs, for example, to deploy their jobs. A lot of our users, a lot of our researchers are very keen to use cloud systems. However, there are downsides to that. And in particular, one of the problems that we find is that with cloud systems, the companies, Microsoft or Amazon or whoever is running the cloud service, aren't very keen to put restrictions on the costs that people can accumulate whilst using the cloud. So you can use cloud services and you can have them on your project and you can use them as much as you like. There won't be a cap. If you go over your funding limit that your project has assigned to you, you can just carry on using them. And actually preventing projects from going over their spending limits is a real challenge. So these clouds are very flexible, but they also provide a lot of scope to shoot yourself in the foot as a researcher using more resources than you should be able to use, than your project funding allows, for example. So there are certain ways you can manage that. You could, for example, assign a credit card to each of the projects, have a credit card associated with each of the projects that only has a certain amount of funding on it. And when that gets used up, then Microsoft will cut off your access to the service. But that's not really practical in a research institute that has tens, hundreds of projects. Managing that process is really hard. Azure and AWS will send you alerts when you get certain limits. They won't cut you off, but they will tell you when you get certain alerts. And so you could use that as a mechanism, for example, if you wanted to manage it in a sort of a manual way. But again, that's not really practical in a very large institute. So there really needs to be a way for project managers, people who are working, for example, in the research computing group, to manage those projects in an automated way. And that's essentially what our CTab is for. So I'll talk a little bit now about how Azure structures its resources. I'm sure many of you will be familiar with this, but in case you're not, Azure structures resources underneath subscriptions. And they're a natural way, essentially, to manage project access to Azure resources. So you can fire up virtual machines, you can fire up firewalls, you can fire up instances or other web services, for example. And they will all fall underneath a particular subscription. In addition to that, you can also put access control on those subscriptions. So only certain researchers would have access to those subscriptions, but once they've got access to the subscription, they can then fire up resources under those subscriptions. So they're a very natural way to manage projects, essentially. However, as I said, although you can put funding caps on these subscriptions, they won't get turned off when they reach their limit. So these are essentially the mechanisms that we use with our CTab to manage the subscription to manage the resources is via these subscription entities. So our CTab is a tool that has been developed in-house to essentially perform this process. So it tracks the spending that's happening on Azure for individual subscriptions, which are related to individual projects. It manages the budgets that are associated with them so that people can see how much they're using and they can track that resource usage. It will notify users of various points in the process of using that spending, but then crucially, which is the thing that we really need, it will also deactivate subscriptions when the money has been used up. So it will prevent researchers, it will prevent projects from going over their budget limits for their resources. And it will do this on a very large scale. So I'm going to talk a little bit about how it's structured. So this is the background to it. So it is open source. You can go to the GitHub repositories and get access to the code. You can deploy it yourself. And it is deployed itself on Azure, so it is an Azure service. It's split up into essentially four pieces. So there is the infrastructure repository that handles the deployment of the other services to the cloud, to the Azure instance. There's the CLI, the command line interface. The command line interface allows you to manage the subscriptions. So that is essentially the right access to the backend database that manages these things. Then we have functions, which I'll talk about a little bit more detail later, but the functions essentially are the cron-like jobs that run in the background, which manage access, which monitor resource usage, and also perform the job of turning off the resources. And then there is also an API that can be used to create a web interface for access to those resources as well. And the functions in the API are all deployed as Docker images. So if you wanted to use this yourself in your own organization, you could pull those images directly from Docker Hub and deploy them. And it's quite a straightforward process. And so until I search for it, I did not know what this image at the top meant, but that is a webhook image. All of these things are managed by GitHub webhooks essentially. So when we change the code, these things get deployed automatically to Azure. It's a very smooth process. So I'm going to show you now a little bit about how the interface looks. So this is the web interface that's provided by RCTAB. And you can see that there's essentially a list of subscriptions that you can see on the left-hand side. And each of those subscriptions has data associated with it. So it has the name. The name is pulled from Azure. The budget that's associated with it, which doesn't come from Azure, that is allocated by us using the command line interface. So we can allocate budget. And then the usage is monitored in the background. And when the remaining budget, which you can see on the far right-hand side, reaches zero, then the instance will be shut down. The subscriptions will be shut down. And all of the resources will essentially go into zero cost. And you can see that in this case here, there are a number of subscriptions that have reached zero. And they've been terminated. As you can see, they've got, I think it's a little skull and crossbones on the left-hand side to show that they've been terminated. And you can drill down into this. So you can select one of the subscriptions to find out more information. And then you get a page like this. And here, again, you can see not only the costs associated with it, but you can also associate costs with project funding and with a ticket. So we use a ticketing system so we can manage all of those things really, really cleanly and nicely. And it's not just us and the research compute team that have access to this. It's also the researchers themselves that have access to this interface. And because the access control list can be pulled from Azure into RCTAB, we can ensure that only the researchers that need access, sorry, only the researchers that have access to particular subscriptions can gain access to the information about those subscriptions. And I don't show it here, but you can drill down further. There are graphs that you can see about your usage and what your costs are and what your costs are likely to be as well. So it helps the researchers to manage their budgets that they have on Azure. So that's the web interface. The web interface is entirely read-only. So that's for looking at the information. You don't get the ability to change any of the information through this interface. If you want to change the information, you have to use the command line interface, which looks, some of it looks like this. It has quite a lot of functionality. This is the subscription functionality that I've just put the help up for here. And as you can see, I've also put a command on the bottom which shows how we allocate the additional budget. And this is essentially the tool that I'm using day to day. So I'm basically, when I get a ticket, I see how much funding is needed, is being allocated to a project. I check everything's okay, and then I run the command, and the command then allocates money to the project or removes money to the project. And we can also set end dates for projects as well. So the subscriptions would also get shut off if the end date is reached. So I'll just talk a little bit about how the functionality works in the background. It's a very high level. But you can see that we've got the websites, the CLI, that are both on the right-hand side there, the interfaces to the system. There is also email integration. So we use send grid, which is a third-party add-on to Azure to send out emails to users to tell them that their subscription has been terminated or that they're reaching nearing the end of their subscription, close to the termination period. And there's a database in the middle, which captures all the information, as you might expect. But then there are these three functions. So there's the status function, the usage function, and the controller function, which run their cron-like jobs that run in the background. So they actually run periodically. They run every hour, in fact. And the status function is monitoring the status of a subscription. It will send out emails to users to tell them that their subscription has been either activated or deactivated. The usage function measures the cost that has accumulated on a subscription. So again, it will also send out emails to users, but it then also feeds that information into the database, which is used by the controller function to ultimately shut down the subscription when the remaining usage has been used up. So all of these things are running in the background. And they're essentially providing the services that you might think that would be available on Azure already to do this, but I guess it's not in Microsoft's interest to do so. So we also have this group management structure that is used for this, as I mentioned. All of this is deployed to Azure. So the actual RCTAB infrastructure is deployed to Azure as well. And you have to manage this quite carefully in an institute like the Turing, because we have a lot of projects. Some of them have a lot of private data and those things, because in effect RCTAB has access to subscriptions, we have to make sure that those are not monitored by RCTAB. So we have a managed group that is for special case subscriptions that you can see on the far right hand side for you. And they are not managed by RCTAB. Even within RCTAB, we also have two groups. We have managed and not managed subscriptions, which are both monitored so that users can find out the information. But there are certain subscriptions that we don't want to shut down, even if they go over budget. One example of that is RCTAB itself. It's very important that RCTAB doesn't shut itself down. That would lead to bad consequences. So there are certain things that we don't want to shut down and we keep those separately as well. All right. I'll talk briefly now about the process of open sourcing. So originally RCTAB was an internal project. So we've been using it for, I guess, since 2020. So between 2020 and 2023, it was an internal project. It wasn't open source. It was essentially closed source in some private GitHub repositories. And it was switched to an open repository around April 2023. So it's now been available as open source for a little under a year. And the process of open sourcing was actually quite time consuming. So we had to go through full code audit. We wanted to remove, so again, I'm not a developer, so I was only adjacent to the process. But there was a lot of work that went into it, essentially making the code better, checking that there weren't secrets within the code, for example, which is one of the problems which happens when you've been running an internal project for a long time. And in fact, one of the motivations for keeping it closed source originally was to try and avoid any, reduce the risk essentially of potentially, if you push a secret to a repository, it might become exposed. So there was a lot of work using Gitleaks for checking for these secrets, Grep, also for the same reason, a lot of grepping to check for the stuff, and Vulture for unused code. And ultimately, I think, as it says there, 25% of the code was actually removed, 25% of it was refactored. It was quite a lot of work to do that. And there was a process of moving from one repository to another. And in fact, that meant running two deployments of RCTAB simultaneously for a period of time. The choice was used to use Pulumi as infrastructure as a code for deploying RCTAB. And that was also part of the open sourcing process, because there was a need to make the deployment a lot easier if other people were going to be able to use it. And Pulumi was chosen essentially because it's open source, because it's using Python, in contrast to examples like Terraform, which are closed source and which have their own declarative language, RCTAB itself is written in Python. Pulumi seemed like a natural choice. And it seems to work really well. So if you want to deploy RCTAB now, you literally just have to run Pulumi up, and it will deploy all of these things to Azure in a very sleek and streamlined process. So it actually works really nicely. So the Alan Turing Institute itself is, in theory, committed to open source code. One of the things that we champion that we promote is the idea that you should have reproducible research, and for that you have to have open source code in the main. So you'd reasonably ask the question, why wasn't it open source to begin with? Well, I already mentioned one of the reasons. It was to reduce risks initially. It was also because it was never really perceived as being something that would be needed by other organizations. But over time, it became clear that actually other organizations could also really benefit from that functionality. And finally, there was also the fact that this process of refactoring the code was something that was ongoing anyway. So actually converting it to open source was both a motivation for cleaning up the code, but also motivated by the fact that that was happening anyway. And ultimately, it reached a stage where it was generalizable enough and also potentially beneficial enough to other organizations that it seemed like the natural thing to do. But there wasn't really originally any good commercial reason for keeping it closed source. But in practice, that's probably something that should have been done from the start. All right. So in conclusion, well, managing costs on cloud platforms like Azure is difficult for large organizations that have a lot of projects. RCTAB provides now a very stable and battle-worn solution. We've been using it for three years with many different projects. It seems to work very well, at least for our needs. And we really expected that there would be something else out there that would do this job. It's not really the natural thing for people to want to do. But in practice, we just didn't find anything that fitted the bill. And then finally, this process of open sourcing, it was time-consuming, but ultimately, it felt like it was a very beneficial process to go through and the code has benefited as part of that. All right. Thanks very much for listening. Thanks, David. Do we have any questions? Hi. So when a project runs over the budget, you shut down the instances to reduce the cost to zero. Well, so the subscription is deactivated and the consequence is the same. Okay. And what happens with storage, for example, storage accounts like Azure Archive storage or like Blob storage? Yeah. So the storage is retained because that's generally very low cost. Okay. But they continue producing costs, right, if the storage is retained? So there is a small cost to that, yeah, exactly. Thanks. Any other questions? Hi, David. Hi. Just a question off the cuff. How tightly coupled are you to the Azure API? I'm so sorry. I couldn't hear properly. How tightly coupled is this code to the Azure API? So the question is how tightly coupled is it to the Azure API? Yeah. So that's a really good question. So at the moment... Can people be a bit quiet, please? So like I said, I'm not an RCTab developer, so take this with a level of uncertainty. My understanding is that conceptually, it's not strictly speaking tied, so the concepts will transfer over, but practically speaking, it makes heavy use of the Azure API. Yeah. So at the moment, it would be possible to convert it to something like AWS, say, but practically speaking, that's as I understand it, quite a lot of work at the present stage. Yeah. Are you using something else? Or do you think there's a practical benefit to that? Yeah. Yeah. Okay. Thanks. Thank you. Hey. So first I want to say thank you for the talk. It was kind of a little bit, I don't know, annoying that they were just coming in and just interrupting your talk. I find it was rude and I wanted to say that. And the other thing I want to say or ask is that, so you say it might be not in Microsoft best interest to have like a so delicate model of managing the cost for these kind of projects. I think that might be true. But I was wondering, you have a quite high complicated structure of like managing the budgets for projects. I think that's something, yeah, special I guess. Is it? So your question is, so you're saying that, yeah, my question is like, that it's an unusual situation. Yeah, exactly. Other universities might have the same thing, but not other places. Yeah. I mean, that's a really good, I think that's a good question. So we do have a lot of subscriptions, but I think we're probably not unique. So you mentioned universities. I think other universities will probably be in a similar situation. I think it's, and I guess it is also true that the nature of the projects we do are that they come with defined funding. So they can't, if they go over their budgets, that's a problem. And that may not be the case in a lot of other organizations. So you're right. I think it probably is a special case, but I'm sure there are, I would expect there to be other organizations that fall into this situation as well. Yeah. Okay. Thank you for answering. Yeah. Thanks very much for the question. Very quickly, the questioner said that it might not be in Microsoft's interest to help you limit your budget. That's, if you work for Amazon, you are actively rewarded to save your customers money. And I'm sure that's the same for Azure. They like to hang on to the customers and they can help you save your money. You are rewarded for that company. Okay. And I guess I do want to make clear that, yeah, Microsoft provides us a very good service, I have to say. I don't want to give the impression that I'm putting down Microsoft by saying that. I guess that was a mistake to suggest that. However, it does seem to be the case. I guess I should put it in a slightly different light. The nature of cloud is that people want access to the resources when they need them and they don't expect them to be shut down. That's not the nature of how people perceive the cloud, I guess. So in that sense, it's probably not in Microsoft's interests because that in some sense, to me, that feels like it's not the service that they're providing. They're providing access to resources and they almost make it feel like it's unlimited access to resources, even though obviously it's not in reality. But in practice, that's kind of the service they're offering. Is that, would you disagree? Sorry. Okay. Clear enough. Okay. I think we have to stop now. So let's thank David again.
Vector Search in Modern Databases
Okay, next up we have Peter telling us about vector search in modern databases. Okay. Well, hello everyone. My name is Peter Zaitsev. I was supposed to speak here together with Sergey Nikolayev, who is actually a much better expert in this space, but unfortunately he couldn't get a visa. So, guess what? You are stuck with me. And we are going to talk about the vector search. How many of you are familiar with the vector searches? Oh, well, that's a good number of hands. Some not, so that's a fun audience to have. Let me maybe start with ruining the suspense and kind of showing the highlight what is a state or a vector search in a variety of databases. And I think what is interesting happening here is what we have, A, the number of new databases started in the last few years, which is specifically focused on vector search and related applications. And then also at the same time we can see a lot of mainstream databases have added support for vector search. You can see that starting back in 2019, which is actually relatively recently and relatively quickly. I think that is very interesting because databases are often rather conservative, kind of relatively slow moving product. And what that reminds me is something what we saw with Jason. We saw databases like MongoDB came and really got a lot of developers' hearts and minds. And then later on, Jason support came to pretty much every major database out there and was also added as SQL standard. Now, what is their unfortunate mission here? What you can see is the MySQL, right? Well, you can say, well, MySQL is now owned by Oracle, which is a big, fast, slow moving corporation. Right, so they're not doing that. But there is another problem is what with MySQL, it's actually exist with a heat wave solution, which is cloud-only Oracle's MySQL variant, right? And it's, I think, not very clear to what extent it will come in the MySQL proper. At the same time, MariaDB is working on solution in the MySQL space. The planet scale announced what we are going to implement vector search. So if not in the MySQL community itself, it will come in some variants, right? And obviously, if you look at the PostgreSQL, I think that's always wonderful ecosystem, right? So something has done the process with like a multiple way of doing this stuff, and there have been number of vector search extensions created, but the PG vector seems to be one which is getting the most support and most of attention those days. Now, what is also interesting, what we can see with vector search being so hot, right, with AI, and some databases like Elastic in this case here, they are pretty much focusing, calling themselves the vector search database first, right, and the full-text search database second, right? So I think for me that was very interesting to see that change. Well, anyway, with kind of a big picture out of the way, let's talk a little bit about the vectors and vector search, right? And why suddenly this becomes kind of important those days. Now, if you, Leon, I don't know, let's say, is it like a high school algebra or something else, right? You probably know what the vectors are, right? We can think about 2D, 3D space, that's all very clear. But also if we think about that, we can think about vector as something as represent colors, right? As in a software engineer, they probably know like well, colors, this is red, green, blue, right? We often encode that in one byte each, right? And that actually can be seen as a vector in color space, right? And we also can think about similarity of colors as similarity between the different vectors. Now, if you think about the vectors, there are actually a number of different approaches how you can think about what vectors are being similar and the same, right? Like there is, like, Euclidean distance, but if you think what is the most common this case is a cosine similarity, right? For example, if you go and ask our, you know, the third leaders in the eye space, open the eye, say, hey guys, then I'm using your API, what you suggest as a distance between vectors, they would suggest you to use their cosine similarity. Now, let's talk a little bit about the history, right, and how they have been using the database and specifically in their information retrieval space, right? Vectors are actually not something which is suddenly became new, right, as you may think, a vector search database, right? If you think about the full-text search application specifically, there have been vectors used for a long time, right? For example, if you, one approach would be to look at the sparse vector and say, hey, we have a document, we can actually look at all the words we have in the dictionary and let's say state a frequency of that word in that vector, and then if you want to compute the similarity between two different documents, well, we can essentially look at cosine similarity between those two kind of massive sparse vectors, right? That was something which exists for, again, very, very long time. You can find that in, you know, Lucene, Elastic, Libraries, and so on and so forth. Now, if you think about what we use in the vector search for much more is we are looking at more of a dense vector, which are called vector embeddings, right? Which are different, right? What is interesting in those kind of sparse vectors, like also referred as a bag of words, we can actually think about the every dimension in a vector as a mean something, right? We can say, oh, this dimension means if a word, you know, cat was seen in a document and how many times, right, or a relative frequency, right? Then we compute the dense vector, right, or embeddings, what that means is we take a document, right, we learn that from the model which generates those embeddings, and then we don't really understand very well what those different bits mean, right? What we know, though, is what the similar documents should be close by, right, how this system works. Or it doesn't have to be even a document, right? Like if you think about their, let's say, image recognition process, right? I can train the system, let's say, you know, faces of a bunch of people, though I know it's kind of totally legal in Europe, but let's say imagine we are in China and we are going to do that, right? Then we can look at, you know, vector embeddings, computer of somebody face, right, and look at what is a people, people in the database it looks the most like, right? That should give us, give us the closest much. So here is also something interesting. This is embeddings which are computed for single word documents, right, using one of the open source frameworks by Glove model by the guy called Jay Alamard, right? And what is interesting in this case is what we get, you know, sort of, you know, cardinality, we see, you know, bunch of words we don't really quite know, right, what those different dimensions means, right? And it's kind of like very common in AI, right? We know those things work, but we can't really figure out how exactly this works, right? And then you can think about and rationalize about what those things could be, like for example, all of those things have time to do with humans besides water, right? And then you can see there is those, you know, you see like a straight blue line which is blue for everybody by that. They say, well, maybe that is something which is related to humans, right? Or we can see something else, right, like a king and queen, right? They also have a lot of similarity. They may say, well, you know what, AI, we don't really exactly know how, but it may have something to, you know, think about their royalty. Okay. So if you think about their vector search, right, in a nutshell, we have vectors. That typically will be dense vectors which came from the AI applications. That would be some embeddings which in some database systems, right, they are, they focus on supporting operations with those embedding systems, right? For example, postgres, PgVector, it's just say, hey guys, I don't know how you compute those vectors. That is not our problem, right? We just going to help you to operate FM. Some of the more advanced features they may, they vector databases, they also may support creating embeddings. Maybe even, you know, from the database itself, right, especially in the cloud database, being able to, you know, call open AI's API in the background to compute the, you know, embeddings for you. So anyway, let's talk more about the technology, how does that work, right? So what exactly operations do we typically see in the vector search applications, right? Well, typically we do have a bunch of vector store, we have a vector on input, right, and we are looking to find a vector which is closest to it. Right? Through the distance we want to define vector search is a cosine, cosine distance. Right? Now, if you look at this problem, if you, of course, you can just, you know, as with about everything, right, you can just scan all the data and find the closest vectors, right? And that is, you know, fantastic way you can do it, you can do it exact, but it is also very slow, right? So that means why it's not used in practice instead, using special index structures, right? This HN SW, right, that seems to be the most popular algorithm, right, which industry seems to be coalescing about. And I think what is important to note about that is compared, like, different from many other things in database, this index is not exact, right? So it gives us rather, well, good accuracy, but it does not guarantee what that always will give you the, you know, let's say the closest vector when you ask it, right? And if you are familiar with database, you can say, ooh, that looks strange. But in AI applications, how those vectors are computed, right, they are not really exact to begin with, right? So these are quite usable. Okay. So let's talk about those solutions, right, what we are really using this for, right? As I mentioned, the most common features in this case will be found in your nearest. There are some other more kind of global operations in this case, which can be used in terms of clustering the data or classification, which is supported by advanced features, advanced systems. Okay. Let me show maybe in this case a little bit of example. And here we'll use their, an interesting way, we'll use their multi-core search system, right, that's where one of the search is created for. But we'll connect to that through a MySQL protocol, right, that's what it's for. And we can see what we can go ahead and let's say, create the table, as you can see, right? We've defined in the flow of vector, in the traditional database type, but something what this engine support, right? And you would see typically vectors support some sort of vector store functionality, right? And then we can find their distance between the given vector we have as well as the vectors in the database, right? And it can give us the information, right? So what we actually had in this case is their different images, which would run through creating embedding for them. And then you can see what, what was that? Yeah, I think that was like an image of a bag, right, which was saying, hey, you know what, it's much more similar to a yellow bag compared to their, to a white bag, right? Which you can, that is a kind of pretty common, what we have. Okay. I mentioned what, when you speak about embedded computation, right? There, if you look at especially non-vector databases, not specialized databases like Postgres where we would say, hey guys, you guys can use external encoder, but they don't see that as a database problem, right? You process some information out there, your favorite find, external API, or use some local open source model, that's fine. Though there are some libraries, right, with Microsoft last time, a library, right, they are being added. And I think over time, of course, if our desire to enable developers simplicity, we'll see more of a direct support. Now, something else I think what is interesting about their embedding and the information retrieval tasks specifically. I think what is very interesting, if you look at the search applications, for years, there have been a lot of time spent about sort of like a hard coding, like in the engineering, the structure of the language, defining synonyms, defining antonyms, right, and so on and so forth, right, and so we can run our search quality. The other approach, though, is we can use the AI, right, for this approach, right, so we can look at the matching document through embeddings, right, and that really allows to avoid a lot of that thing which has to be manual processing if a good quality. But what is an interesting there, though, is this new generation AI search may not be best, and also it may not be the most effective, right, because if you think about that, if I have like a lot of a document in my database, right, billions, right, then actually the search can be relatively expensive. So one of the approaches which is used is dual approaches, right, and you can say, well, we may be getting, you know, like a top thousand documents or something through, you know, like a legacy kind of frequency-based search methods, right, and then we can use AI to rank those, right, and you can see what that's sort of like a combination that's like a last-second, like a Vespa hybrid, right, it shows better, right, on many benchmarks. Okay, well, oh, close, thank you, is very usable, both in information retrieval tasks as well as many other applications, and what we see also what the vector support in databases in relatively early stages, right, it just happened implemented in the last few years, right, I would say what the current implementation is related to basics, and we would see a lot more of the features, right, as we kind of figure out, right, what we actually want to empower us to go into the continuing innovation in this data structures to support us in the fast vector applications, right, and as well improve accuracy. Well, that's all I have, right, you had some, yes, the gentleman was about to show me zero minutes, yeah, and if there is any questions, I would be happy to try to answer. Any questions? Yes. Hi. Okay, okay, so given the fact that so you say that databases are going to transition from just supporting externally modelled embeddings to generating internally, but given the fact that we see a lot of many advanced models that generate embeddings, for example, the GPT, the GPT is a machine that basically takes entire concepts, translates very efficiently into embeddings and then outputs also, and is the standard like you provide a single external embedding generator model compared to a traditional BWAC, and then everyone can just benefit from an on par model. Well, yes, yeah, absolutely. Well, what I'm saying in this case, I think it's being supported, right, maybe in different things, right, I'm not saying, oh, you know what, you should expect postgres incorporate inside that over possible models, right, I think I think the same would be say like if, let's say, Python support those things, right, that just means it's easier to do from a Python standpoint, right, so now if you think about this case, if I want to generate embedding, even for data I have in the database, I kind of have to do that, that externally, right, what I'm saying is, well, get some, you know, fun, was us by talking externally, not creative, so that is what I'm speaking about, make sense? Thank you. Any other questions? Thanks. Thanks for the talk. As a database developer, how do you deal with very rapidly moving target of both algorithms and storage formats? In the sense, do you deprecate rapidly? Because if today you have a KNN function that has this, some certain API, and a month later, you know, ML is very rapidly moving target, you have a better KNN function, and same goes with densely, dense format. Well, yeah, I think that is a good point. Now, I think what is interesting what you're saying, right, you know, thing on one extent it's always interesting and then you have a, you know, interesting and then you have this, the industry in like an early stage, because on one extent things are changing quickly, on the other hand, often people implement something, they put it in production, it's good enough, and the fact that that kind of is a better state of art out there, that doesn't mean what they want to change, right, and that means, like at least in the database world, right, you always have to say, well, you know what, there are certain things you would love to kill, but actually some very big corporations already deployed in production and they're not freaking changing that in the next 10 years, right. So, reality what that's going to be, right, I think is that evolution on that side, but then we'll still have to have a competitive I think if you look at, I mentioned like a postgres, like a PG vector extensions, well, you support like a whole bunch of different options in this case, right, so yeah, I think that's what we'll expect. Okay, hi, yeah, can you go back to the embeddings slide where you showed the similarity between Queen and Queen, I think, the similarities between Queen and King, I think, or was like the first, Oh, this one? Yeah, yeah, yeah, yeah, yeah, you mentioned you use Ada, I think. What do you use? Chart GPT embeddings. No, so yeah, this one is actually, you know, a global model, right, that was one of the open source models, right. Ah, okay, okay, yeah. But I mean, in this case, I think that is just like an example, right, I think what you wanted to look in this case to visualize, right, I mean, when you say, hey guys, you generate those kind of embeddings, and they do not particularly mean anything, right, but can you just plot them, you know, as we plot, let's say, you know, DNA of a frog and a fish, right, can we see something maybe in this case, right. So that was the case to do here, right, to visualize how particular embedding generation model. Yeah, no, no, I find it really helpful also for, I find it really helpful also for like my future students, because like it really grasped the accents of embedding and similarities, I think. Yeah, I mean, again, like in this case, that is just to visualize to show people what that things look like, right, and my point I was trying to make is, on one extent, we cannot really state what exactly are those dimensions, what it corresponds to, right, but as a human, we can, you know, try to rationalize over, it seems to be like, you know, something. Yeah, yeah, also. You know, something there, right. Also because other, the embedding of OpenIE has like 100, no, 1000 and a half features. Well, that's right. Yes. Really difficult to like. That's right. That's right. So like in this case, that was specifically kind of cut, right, to have any less features, right. Yes, it's because, you know, 1500 or like, well, like, you know, 3000 tried for large. Yeah, that would be too many to display. Okay. Thank you. Thank you. you
NFD: Simplifying Cluster Administration through Automated Node Labels, Taints, and Annotations
with the next talk, Eduardo is going to talk about streamlining your cluster with stuff. You continue to hate containers, no? So hey, big topic switch, big data just went out the room, now Kubernetes. So this is about, this is the first of three Kubernetes talks. The first time Kubernetes is making a big room in the HPC room, so I'm happy for that. Today I'm going to be talking about something that very much since I knew Kenneth, it's been something I care of and is how to properly create labels and annotations so we can properly set the workloads where they need to go and where they don't need to be. So let's talk about that. And I want to talk with this, everything that I'm going to be talking about, just thinking about automating. We are going to be automating how we use Kubernetes clusters. But then why this conversation? This Kubernetes and containers conversation that started like six, seven years ago in the HPC ecosystem, the promise was that once we implemented containers and Kubernetes, we could kind of like forget about infrastructure and we could just like jump into a new adventure, everything was going to be abstracted from us. It was kind of like billboard going into this adventure, but it took six movies to actually complete the adventure and I feel we are still in the hobbit. We haven't jumped into the Lord of the Rings kind of conversation in this topic. Who am I? I am doing containers for HPC since singularity, so since a long time ago. Currently I am working at NVIDIA trying to make everything Kubernetes and GPUs easier and better for everyone. So basically contributing to Kubernetes and all upstream projects to make it easier for you to run GPU workloads on any cloud-native ecosystem. So this is the agenda for today. I want to be talking about the no-fisher project and everything and all the amazing things that you can do with it. So just a quick, if you scan this, it will take you to the GitHub webpage of the no-fisher discovery. This QR code is going to be along other slides, so don't worry if you miss it once. So what is NFD as I call it or the no-fisher discovery? It's basically a Kubernetes 6 project. Kubernetes 6, so the 6 word there means is a special project that is under the Kubernetes umbrella and is managed by all the CNCF protocols. It's an add-on for detecting hardware and features in your system. That's what I will be talking about. And what I want to be doing this with the Hobbit and a lot of the RIMS conversation is because I really like the topic of Smaug talking with Bilbo and it's like they are always labeling themselves. I am the clue finder, I am the web cutter. They are always setting labels to themselves to make themselves bigger just to identify who is in the room because during that conversation Bilbo is invisible to his eyes. But what is NFD? So NFD basically four components. Is the NFD master worker, the topology of data, that is a new thing, and the garbage collector. So why this split and not just a single component? Four security reasons. So the NFD master is the only container running in your Kubernetes cluster that actually has all the permissions and privileges to label the nodes to create things, annotations, and all the features that I will be explaining today. And all the three other components communicate constantly with the NFD master to tell it what to do. So the NFD worker is the main component of NFD which is basically a demon set. A demon set in Kubernetes is a container that is running everywhere basically. And this container will go to your nodes and discover everything that is available to your container. Right? So I want to be very specific here. So everything that is available to the container. So if you are not by mounting something in the container or if you are not allowing your container to see things in your system, the NFD worker is not going to see that. Everything that is on the eyes of a container and the privileges of the NFD worker container is going to be advertised to NFD master and NFD master is going to advertise that to you. The topology updater is part of an ongoing effort to create topology aware scheduling in Kubernetes. So it's a very HPC related topic. This topic is undergoing development. The topology updater now has like two years and topology aware scheduling in Kubernetes is also now like a year and a half and it's getting very stable. So Kubernetes is getting very, very closely to having more and more HPC, like deep dive HPC workloads. And the garbage collector is for removing those things that maybe were removed. And a very good example for me that I work on NVIDIA is that some people turn on the systems. They start running Kubernetes and by some reason they remove the GPUs like manually. So we had to implement a component that like every minute goes and check back if everything is actually there and like update the labels. This is how the API looks like. So the API for no feature discovery for those that know how to define an API in Kubernetes is this is a custom resource definition. This is how the API looks like. And basically every component in Kubernetes, like even from another nice space, if you give them permissions to a deep to update patch or create new custom resource definitions, node feature, as we see here, you can actually create a sidecar containers for the NFD master of the no feature discovery. As an example, from NVIDIA, we have a sidecar container that we call the GPU feature discovery. And it's basically a container that knows about GPUs, the NVIDIA GPU, right? Like it become very specific. We have examples as Intel has a similar container. AMD is working in a similar container. So basically you have a container that is very specific to whatever you need. And it will communicate back to NFD master using this API. But when I say that the NFD worker is going to see everything you have in your system, I truly mean everything. If you enable everything that NFD can expose to you, it will become at least that you will never read. Basically, you will break it seeing your Kubernetes cluster and you will never get to use it. So let's trim it down. So we have automated node labels. So why automated node labels? It became a point where we were creating so many labels, features, and we were exposing so many information to the user in their Kubernetes clusters. That it was not usable. So we defined it rules. And we defined it rules for this same topic. When a cluster is so big, it has to label itself. So when a smoke goes around saying like I am a smoke that tremendous, I am a smoke the mighty, it's the same analog, right? So we have to tell the cluster like, OK, these are the only labels I want you to know from me. And this is how it works. So it's a node feature rule. So we created an API to communicate with the NFD master to tell the NFD master, OK, for all the information that you are getting from the NFD worker, only advertise the things that reach that match these features. And the match feature API is very rich. So we support rejects. We support match expressions. Just we say like if the operator exists, if the operator is true or false, if the operator is greater than, is smaller than, equals two. So we support rejects and match expressions. And this way we can tell NFD master, OK, for all that information you are getting, just create these three or four labels that I really care about. I don't care about all that other information that you are getting. Automated things. So this is another important topic. Let's say I have a big cluster, but only in this big cluster I have a specific side of the cluster that is like 10 super expensive GPUs and I don't want anyone to go into those GPUs unless I specifically say so. But I need to identify where the GPUs are and then taint the nodes and then create the labels. And if you start scaling that to HPC, right, like you start talking about thousands of nodes, that becomes a very hard manual label. So only the wordy shall pass, right? So in NFD we have a way of creating no official rules that will create taints instead of labels. So we can, using the same no official rule API, we can tell NFD master, OK, for the information that you are getting from the NFD worker, if a NFD worker tells you that a node in the system has this and this feature, please taint that node, right? So for those that use Qnettis, you know that once a node is tainted, the node gets drained and only workloads with the node selector and the tolerance will run into that node, right? So this way we can automatically taint nodes or any infrastructure to be free from users unless we actually define so. I have a node there. OK, yeah. Before enabling taints, we have to say this because it's an ongoing discussion for security reasons. We try to implement a feature for the NFD worker to auto kind of like remove himself from the picture when the node is being drained. A lot of the security people chime in and say that that is very unsecure. So now by default, when you use the taint featuring NFD, the NFD worker also gets removed from the node. So it's kind of like it goes, it creates a label, it creates the taints and Qnettis drain process actually takes NFD out of the node itself. Extended resources. You need to inventory what you have. Like I really like this scene when his entire grocery goes out in a single night. But we have to do the same for our nodes, right? Using the node featured rule again, so this is a very rich API. We can define resources, node labels. So as we can see here down, we can create allocable and capacity. So when you define a node in Qnettis, we can actually expose resources that are consumable. This feature was actually created for something that now became, for the first time became a Debrum that is the confidential computing. So when you are using confidential computing in Qnettis or in any cluster, by that matter, you are consuming some, let's call it tokens that every time you are running a confidential container, you have to consume one of these tokens to run the container, right? And watch, you reach a point where the node is out of these tokens, no more containers can run until one of the containers is actually killed, right? So this feature was created to expose how many tokens the node had left and we did this with NFD. Why this is a very specific use case? Because the tokens that you consume when using confidential computing, they are featured in the kernel. So they are always exposed to the container. So these extended resources doesn't replace what is a device plug-in for Qnettis, right? You cannot tell NFD, like, oh, go and identify all the GPUs and create extended resources for the GPUs because NFD will create extended resources but your containers won't get the buy mounts and won't get all the other things that a device plug-in does in Qnettis, right? So it's a very big caveat here that extended resources may be only used for the things that a regular container can see, right? And with that, I wanted to run short. And that's the whole introduction to NFD. It's been running for three or four years. It's an open source project and everyone is welcome to contribute. Right now, the only ones contributing to it really are hardware vendors. So Intel, AMD, NVIDIA, Melanox, now NVIDIA as well. But yeah, every hardware vendor is contributing to it so you can expose your hardware to Qnettis. And also, not hardware vendors, right? The confidential computing community has been also contributing to NFD to expose all the new things that go into the kernel so you can consume confidential computing. Now you can expose that to Qnettis using NFD. And with that, Kenneth, questions? Thank you. What about mic technology? What about mic technology? If I reconfigure my mic layout of my NVIDIA GPUs, how quickly will it be picked up that has changed? You can actually define that in the configuration of NFD when you're deploying it via Helm. So by default, it's every 60 seconds. But you can go to one second if you want. So that's up to you, right? We have been working a lot in NFD to make it CPU not intensive. So like a full runoff detecting everything in your node takes like 0.3 milliseconds and doesn't consume more than like 10% of a single-threaded CPU. So it's a very lightweight container given everything that it does. So we try to keep it optimized. But if you want, you can take it down to one second, but by default, it's every 60 seconds. I have a feeling the previous question might have answered this, but if I wanted to get GPU operator to say label an A or a V100 and say enable MIG on V100 after, does that mean that it will say go NFD through the NVIDIA stuff, my custom stuff, then back to the NVIDIA stuff? It's not a one-shot thing. Or is there an ordering basically? Yeah, it goes in order. So when you deploy the GPU operator that you mentioned, the GPU operator deploys the GPU Fisher discovery that is a side card to NFD. So what the GPU Fisher discovery does is that it utilizes the API of NFD. If the API doesn't exist, the GPU Fisher discovery doesn't work. So that's why you first need NFD to then deploy the GPU operator. Once you deploy the GPU Fisher discovery, it will call NFD with this API and it will tell like, hey, I realize this node is a V100 or a Grace Hopper. But that is going to be on the hands of the GPU operator, so you don't have to worry about that. That's why the GPU operator deploys a side card container for that. Because it's a container that is specialized on NVIDIA hardware. The same thing will happen if you deploy some specific Intel products. Intel will also deploy a side card container that is super knowledgeable in Intel infrastructure and it will communicate with NFD for Intel Fisher. That's why I'm asking, can you have NFD to discovery from one another and then deploy NFD to the discovery of a different pattern? Oh, yeah. The node Fisher rule is you can create cycles. If you define a rule that will create a label, you can have another rule that says once this label exists, do also this. So the node Fisher rule is you can stack on rules. We actually use that for confidential containers. We first define with a Fisher rule, we create some labels that if your node supports the cryptographic confidential stuff, and once those labels exist, we deploy another Fisher rule that reads those labels and does other stuff. Oh, hi. Being swapped to Kubernetes. Good use case for this. Figurations on nodes. Everything. So NFD will expose every kernel Fisher to the point that it will expose which kernel flags were used when booting the node. So everything is. As I said, unless you configure your cure net is not to expose certain things to a container, NFD is going to read it. But yeah, we try to keep it as big as possible and leave to the cure net is system administrator to restrain what goes and what doesn't go into the container. Oh. Very clearly the 60 seconds sounds very much like an infinity band subnet manager. And in slarm you would schedule jobs that's infinity band switch. Could you think of extending this to have Kubernetes scheduling on nodes which are close to each other than the fabric? Yes. Thank you for the question. Like a two step answer. At NVIDIA we also developed what we call the network operator that is basically an operator that you deployed in a cure net is closer. If you are using infinity band or any melanoc scar, it will configure your cluster for you. And it will deploy NFD to help him like himself do all of that. So the network operator will leverage NFD to set up your infinity band and all your network configuration. Now the topology of the nodes, that's something we have been working on. But yeah, we hope to have it this year in NFD because we first need to run like an MPI. We already have a proof of concept. And the proof of concept is running an MPI job. And this MPI job is going to help us identify the topology of the cluster so we can then create labels for like a wish node disclosure to wish node and you can then use that information. We actually, like if you reach to me after this, we publish a paper with Livermore University like two years ago where we did a proof of concept of this. And it works. But it's still like, we're still discussing how to implement it. But we can define the entire topology of the cluster using NFD. Yeah, thanks for the talk. Do you have like what misuses of NFD haunts you at night? Like did you see something, someone misusing it in a way that you didn't expect? Or do you have misuse cases that you came up by yourself? Maybe? Yeah. So the project was actually start by just Red Hat and Intel. It had grown bigger. We kept getting users from many, many projects using it. So it's not only just for like Intel and NVIDIA. We have banks using it. We have people that use their clusters as for like databases and things like that. And recently on the Slack channel, we got people that are using it for even for testing, right? And they are, you have a big cluster and you want to use taints and annotations to like create mini clusters inside your cluster. You can use NFD for that. So yeah, it's not just for a specialized hardware, but we more and more have seen people using it for even software features. But yeah, it's... Oh, yeah, NFD currently is being deployed in thousands of Kubernetes clusters today. It's a very well used Kubernetes project. And being a maintainer to that is a good responsibility. I would like to ask if the labors are standardized. So I'm wondering if I have this deployment in one cluster, then okay, if I have this deployment in two different clusters, then can I move my application from one to another transparently or do I have to change everything in the deployment side? Since pretty much three years ago, the labels haven't changed. So I think it's very small, but you can read that every label is going to be fissure.node.kernet is that I O slash your fissure, right? Like CPU or system kernel. Yeah, this list is pretty much infinite. The other thing that we are also trying to keep very stable is the sidecar APIs. So when NVIDIA going back to the example of the GPO operator communicates with NFD to create the labels, the label is reading NVIDIA.com slash something. And when Intel creates a label, it's Intel.fissure.io slash. So it's like we try to keep the same label so we can guarantee users that they can move from one cluster to another. Just redeploy NFD and everything will work back again. Okay, thanks. Yeah, we try to not touch the labels. In the last release of NFD, we added a fissure where you can define your predefined prefix, but it's at your own risk, right? Any more questions? No? Okay, sign speaker. All right, time to go. Yeah, I am. I'm just showing up. All time. Taking over the entire, taking over another person's talk. Yeah. Yeah. Yeah. Yeah. Yeah. So I thought I would open this talk by putting two words on the slide that are either going to make you very angry or very anxious. Those words are cloud and HPC. So probably the question on everyone's mind is what does the future look like? I'm going to answer this question by posing a question back to you. Where is the money going? Okay, I think I can still, oh, zoom died. He died there for a second. Yeah. So I. Okay, we'll do what we can. It's some. I'm surprised. Don't have a full room. So that's. Yeah. Yes. Yeah. Yeah. Yeah. Yeah. Hello. Hello. Who's going to get left behind? We can look at a paper from read Dan and Dagar from 2023 that identified some really interesting trends. Okay. Okay. Now we have Kevin who's going to tell us more.
How the Kubernetes Community is Improving Kubernetes for HPC/AI/ML Workloads
about Kubernetes and HBC and AI. Hello everyone. Yeah, so today I'm going to be talking to you about what the Kubernetes community is doing to improve batch workloads in general. So just a brief background about who I am. I work as a senior software engineer at Red Hat. I'm a big upstream developer in Kubernetes and OpenShift. At Red Hat I focus mostly on Cryo and KubeLit now, but I also dabble where I'm also a reviewer in the job area in Kubernetes and a project I'll talk about also called JobSet. I was a maintainer of a batch project called Armada, which was for running batch jobs across multiple Kubernetes clusters. And generally I actually started my Kubernetes experience by trying to run, trying to build a platform that could run jobs on Slurm and Kubernetes. So I kind of liked the Kubernetes aspect a little bit better in some ways, but the Slurm scheduler was a lot more easier to use in Kubernetes. But I think I saw a gap in Kubernetes and I've been kind of helping try to contribute since. So just to give a little outline, I'm going to kind of give a historical perspective about Kubernetes and how it developed and why we're in this area that we are now. I will not really be talking too much about how best to get the most performance out of your cloud vendor or what other things you need to do to get Kubernetes. I'm going to be kind of focusing on the APIs that users could use in Kubernetes. So this is my couple slides of what is Kubernetes. It's pretty complicated. But generally I've noticed that when people start using Kubernetes as a library, I like to kind of think of it as sort of a react, but for distributed systems. So you're kind of using all the Kubernetes client libraries, you're using the APIs, you're composing custom resources on top of objects and exposing them to your customers. That's kind of where I've seen a lot of companies start using Kubernetes, especially when you're trying to build like a quote-unquote Kubernetes native platform. So what does that mean really for most people? Well generally I think the benefit for in this community is you have declarative API for workloads. If you're running on the cloud, failures happen, it sucks, but it does. And a lot of times your users also don't want to be told, oh yeah, you had a network failure so your job failed. Sorry, restart it. And a lot of users are pesky and they ask more and more of you as time goes on. We all know this. So and also for better or for worse, everything starts with YAML. You take it with what you want. But generally what that really means is that we have a big focus in Kubernetes on what is your API, backwards compatibility, most of the time, and also how to make it useful for people. So generally a Kubernetes cluster has not too many components, but I want to try to focus a little bit on a couple of components for this talk. So generally you have the API server which everyone talks to, CLI, whatever. NCD is your database essentially for storing all your objects in Kubernetes. The scheduler is an interesting component because it's, I think, the hardest thing for the HVC community to kind of grasp with the Kubernetes scheduler versus Slurm is Kubernetes is a scheduler focus for the node. You don't get as much fine-grained control in a slur, you get a lot more control in a slurm scheduler than you would in Kubernetes because slurm can actually target like, I don't know, sockets and everything on a node. It's much more fine-grained than Kubernetes. So I like to think of the Kubernetes scheduler as kind of a heat-seeking missile for a node. You give it hints and it just, it targets it and then your pod is on a node. So in the node, what is actually on a node? Well, there's this thing called KubeLit which talks to the container runtime and actually I will talk about that next slide. So the point of KubeLit is to actually start a pod, but I want to walk through what actually happens with a pod. Like this is, you know, step one, a user creates a pod that's a workload and it goes to the API server, the API server stores it in that CD and then the scheduler says, oh, you don't have a node specified on your pod. Okay, let me do a little scheduling loop, finding a node. And then once it's, once your pod is located on a node, KubeLit will pick it up and actually start running it and if you're running a batch job, it will run into completion. If you're running a microservices, it's just there and it keeps running. And KubeLit actually talks to a container runtime and the host. KubeLit also handles a lot of stuff with volumes. It's a pretty, it does a lot. So now you saw the pod lifecycle and I'll be honest, my first time using Kubernetes, I was like, deployment, stateful sets, this is so complicated. I'm just going to use a pod. Unfortunately, I learned pretty early on that you kind of lose a lot of the benefits of Kubernetes if you're using pods directly. Pods are stateless, so if your node goes down, you essentially lose your pod. And a lot of times if your cluster is overworked, you're actually going to lose, you, well, not overworked, but your pods will get deleted after a while. You also don't get self-healing. That is an important part of Kubernetes, even in, I think, the batch community. It just means that when you define an API, things are going to keep running and if you have, like, a job, you are going to keep retrying, is one example. The more pragmatic thing is the pod API fits the need in both microservices and the batch area, and you cannot really change it for one area, not the other. So generally, I don't recommend people using learning stuff that people like. Unicorn is actually, it's more popular in Spark community. It's trying to bring the yarn scheduler to Kubernetes by replacing or by adding a separate scheduler. And then MCAT is a project from IBM around trying to deploy arbitrary objects to multiple Kubernetes clusters and adding its own queuing. So now, what does this mean when you have all these projects? Well, you have chaos. You have Kubeflow, I'll pick on Kubeflow a little bit. I only have two machine learning frameworks, but from the last I checked, there's like six different APIs for representing a machine learning job in Kubeflow. And that means that there is a lot of APIs for running a batch job from Kubeflow. They are trying to consolidate most of them into a single one called a training operator. Still, you have a new API. You have two versions of running MPI jobs on Kubeflow. Now, it isn't as, I actually don't know if that MPI operator fits for all the use cases that people can give with MPI, but it is, as far as I know, the only public open-source way of running MPI on Kubernetes. And you also have things from Armada and Volcano that have their own representation of jobs. Well, this is honestly pretty chaotic. It's not really fun as a developer to be told, like, you know, how many, like if people want to bring a new API, can you support them? And you say no, because we don't really want to install all of Kubeflow just so you could run a PyTorch job or whatever, or install the controller. And it gets kind of complicated. So this group was founded, it's like a working group in the Kubernetes community. Batch workloads run the full gamut on Kubernetes from the scheduling all the way to the node to some representation of the batch APIs. So they actually had to form a working group to kind of coordinate, not really have to, but it's kind of a way to sort of allow you to focus multiple people on a single area and try to improve it. And some of the goals of this group are, let's make the batch API useful again. Let's allow people to actually use these APIs without having to install something like Kubeflow or Volcano to run a batch job. And also, the other one I'll talk about is queuing. Carlos over there could probably talk to you all about DRA, which is another exciting area that's happening, and that's about getting more use out of the GPUs, and that is in scope of this group, but that is actually mostly led by NVIDIA and Intel right now. And I'll be focusing on the two bullet points for the rest of this talk. So what is the job API? Well, this is generally a pretty simple way of representing a batch job, and I think that's one of the downsides of it, is that it was really focused originally on kind of simple use cases. I have an example here of computing Pi, and I'll just walk through the API so you'll see it kind of repeated again and again. So generally, Kubernetes has this concept where you define a template and you define a replica. And the job API that's called parallelism, and that just means how many pods do you want running in parallel? So the first thing that I want to talk about with this group is how many of these do you want to actually are complete before you consider my job successful? Active deadline is just how long the job takes to run, and then back off limit is retry. It's kind of how the job gets some self-healing, if you will, because it just says if the job fails for any reason, I want to retry, in this case, up to the back off limit, or the default is six. And one of the first features that this group added is a pod failure policy. It's essentially a way to kind of short-circuit the retry limit, because let's say your user has a segmentation fault and they're using a GPU. You probably don't want them to be using that resource when other people could be using it, and you probably don't want to keep retrying. And there's no limit on these retries, so someone could say 10,000 retries and kind of be on that node forever or whatever. So generally, that pod failure policy was kind of a way to short-circuit that. Now, how do we actually make the job API useful for workloads that need to talk to one another, which is pretty much most of the most interesting use cases in the HPC world? Well, this is kind of this idea of an index job, is can we provide a static name and environment variable so that the applications can actually refer to a replica of a pod and not have to worry about not being able to communicate to it and not be able to say, you know, my replica zero, that's my index zero, is always going to be this, and so then you can kind of talk to it. So you could think of this as sort of being a common way in like an MPI area where you have maybe like a rank zero pod and you have a series of workers and you probably want to make sure you have a rank zero. And that's kind of the idea of an index job. Now, I should wish I would have shown a slide here, but when you couple an index job and a headless service in Kubernetes speak, you're actually able to get all these pods to talk to one another. So when the last area is if you're trying to build queuing in Kubernetes, you kind of run into this problem where this pod lifecycle, I like to kind of joke, the way I envision this lifecycle is it's kind of like a racehorse. Once you create the pod, it's just, it's running and it's never going to stop. And effectively, the why this can take down a cluster is because if you have a million of these things running, it's just an infinite loop and it's going to kind of drain all your resources of your cluster. But you still need to know kind of how many objects are being created, but you also do not want when you create the object to start this loop. So this was kind of this idea of suspend in the Kubernetes community, adding suspend to our essential queue supports, a wide range of jobs via this use of suspend. So kubere, all the kubeflow operators, a project I'll talk about next called jobset, job, and then another project called a flux, which is, I don't know what I'm going to add, but, and so this is kind of a nice thing that queue provides. So what do you do about representing a more complicated job? Well, generally the job API is only, is, you kind of have to have the same pod definition for all of your workloads. And that may not fit for a lot of use cases. So the job set was kind of created as this way to say, can we create a representation of a single job that could have maybe different pod templates and then also have its own kind of failure and success policies. So when you run these jobs at large scale, you're going to see failures and you may want to restart some jobs in case, or maybe you don't want to restart, or you want to, and I'll talk about one interesting use case of success policies. And one of our goals is Kubernetes is kind of an implementation detail. Most people don't want to know about it if you're a researcher, you just want to know I'm running this. So we kind of want to streamline the creation of stuff like index job and headless services, because we know people want to communicate with their pods. And so at a high level, the API for a job set looks very close to a job to a pod. Instead of replicating pods, we are replicating jobs. And I didn't have it specified here, but there's a replica field under the spec, which says how many replicas of my replicated job I want to create. And then inside of the inside of a replicated job is a job template. And so this job is a PyTorch job. It creates an index job with a headless service, and then it creates a single job that has four pods. And I'll show in a little demo why this is useful. And the other area that we've actually gotten quite a bit, it's one of both Volcano and Kubeflow have implemented this in their projects, is one of the main reasons why they kind of created these projects, is what do you do if you kind of have this leader worker paradigm, where your leader, let's say, is a Redis database and your workers are talking to it, or whatever, you know, a message queue. Well, I want my workers just to finish. Like, I want to say, hey, once my workers are done, my job is successful and I don't really care about the progress of the leader. And so this is kind of one of the use cases we had in mind with this project, or not, there's a lot of them, but this was one, like, can we use something called a success policy to say, I only really care about one set of jobs completion, the rest are fodder, essentially, or not fodder, but they play an important role until the workers are done, and then they're also taken down. So, how am I doing on time? Okay, so I'll walk through the demo a little bit. So generally, with a job set, you have this controller, a job set controller manager. Right now, you can check it's running, great. And I kind of, in this demo, I tried to take the PyTorch job and kind of show the template, and then try to run it as just a normal job and kind of show you what happens. You can't communicate to the service, because if you try to create this job normally, there is no service for the communicate with, and it just automatically fails. So then, what do you do? Well, you can use job set. Woo-hoo. And so, I already created the job set, and you can kind of see that with the kube control logs, I'm actually able to, the job set is running, it's doing training, using PyTorch. And also, I created a headless service called a PyTorch that's there. And so, this allows you to kind of hide all this stuff from the user. And then, I think in the next part of the demo, I'll show the success policy. Come on. Oh, well. So, I guess, I mean, it will go on for a little bit, but does anyone have any questions? Any questions? There's a couple up there. Wait, wait, wait. Who was first? Hi. Yeah, I'm very much from the Slurrem bioinformatics snake make next-flow world. So, and we have an IT department, and they have a Kubernetes cluster, so this is very interesting talk to me. But are you thinking about these kind of workflow managers that typical researchers like that use, because I was just in a high-energy physics session, they also use snake make, and they have schedulers, of course, but somehow that also has to interface. Do you have any comments on that? So, generally, we don't want to get into the... We don't want to add another workflow engine, and there's too many of them, but what I kind of view the job set is like a single node of a DAG, and one of our goals could be this, like either this job or a job set could be added to something like Airflow or Argo workflows. It's another example to kind of be like, this is a single element that you could run, rather than having, like, Argo has their own way of representing, like, what they actually run on Kubernetes, which is, you know, fine for pods, Airflow is also pods. There are a lot of other workflow engines out there. I've actually... We took a lot of inspiration and two jobs ago for me in applying bioinformatics, some of their workflow languages, and trying to get... Trying to standardize a workflow language so we could actually run across different environments. And so I'm familiar with the area, but we're trying not to be a workflow engine for this project. Thank you for the talk. I noticed that a lot of the things you were talking about seemed to play kind of in the same field, sort of where the Slurm plays. So, I don't know, a few years down the road, do you see Slurm kind of giving way to, you know, this Kubernetes-based infrastructure, or do you think they're targeting kind of different tasks, and Slurm will always have its place? That's a really good question. I was not at KubeCon North America this year, but I heard of a company called CoreWeave that was actually collaborating with SchedumD to try and kind of provide Slurm on Kubernetes. From what I understand, kind of using the Slurm scheduler, but also allowing people to run some of the more popular Kubernetes stuff, like have Kubernetes for services or Slurm for batch. Generally, everyone is kind of converging in this area. Our motto is actually taking from inspiration of HT Condor and trying to apply that to Kubernetes. And then I know that the... Sorry, I'm pulling a blank. The University of Wisconsin, who kind of created HT Condor, they're big on trying to actually use Kubernetes for a lot of some of their infrastructure also. But, and also, we do talk pretty closely with the SchedumD folks, at least in my last role, and there is a lot of interest in trying to bring Kubernetes to Slurm. And part of it is Slurm has been around a long time, and so they had to do a lot of work to just even to get in the fact of, I want to containerize Slurm in Kubernetes. Okay, great. Now, do I want to schedule a pod, or do I want to schedule a single container? And that's kind of where I can see... That's also what's kind of challenging, and the other thing is convincing more and more people to use containers, because it's great, but it's also a pain to change everything that you want to go to a container. Okay. Any more questions? So, if I understand it correctly, you're primarily optimizing that I do not schedule 10,000 pods, and then have job sets, right? Because when I think about batch processing, I do think about, let's say, CI, and then we are running like 5,000 jobs per day, and we do this with Jenkins, which actually works super great with Kubernetes plugin, but I'm not seeing enough features on this proposal to get rid of Jenkins or any other components. I'm primarily seeing a way of not overloading the cluster with pending pods. Is that right? No, I would say the main thing is trying, if you want to say run a PyTorch job, the one option is let's create, let's use Kubeflow. Fine, that will work. But what if I don't really want to use Kubeflow? What if I have my own representation? What if I want to add my own...
Kubernetes and HPC: Bare-metal bros
Okay, this is going to be interesting. We are relying on the Wi-Fi bit here as well. So it would actually help if you turn off your Wi-Fi. I know that's a big ask. Consider that for the next half an hour. That would be really helpful. So Vanessa is live here through a video call. Give us away, Vanessa. We can... Well, can you try speaking? What's up, folks? Sorry, I'm not working. Is that working? Try again? Still what's up, son of them? Okay, that's really better. Nice. So we'll start your recording, Vanessa, and then we'll try and do live Q&A at the end. Sounds good. I have some answers for the previous Q&A, too, so we can talk a little bit about that. We can try. We can try. By the way, Vanessa is also the one who designed the HPC social logo. So you should thank her for that and take some stickers when you leave. Thank you. Thank you. All right, here comes the talk. Hi, folks. I'm Vanessa Socket, and today we're going to be talking about Kubernetes and HPC, the bare metal bros. So I thought I would open this talk by putting two words on the slide and then I'll go to the next question. So, what is the question that you guys have been asking or very anxious? Those words are cloud and HPC. So probably the question on everyone's mind is what does the future look like? I'm going to answer this question by posing a question back to you. Where is the money going? So we can look at polls from Gartner and Hyperion Research that suggests that cloud is projected to reach $40 billion by 2026 with a smaller CGR of 6.4%. So very superficially speaking, the money is going to cloud. Now, we can also then follow up on this question like, okay, that's great, but who's going to get left behind? We can look at a paper from Reed Gannon and Degar from 2023 that identified some really interesting trends. For HPC, it suggested that the way that we design our system will not be a problem because we're not going to be able to design our system will not continue to work. We cannot depend on dentered scaling and Moore's law. There's increasing rising costs for improved semiconductors. This is going to make it harder and increasingly more expensive and laborious to deploy new systems. And they define something called NREs or Non-Reoccurring Engineering Costs that we are incurring for every new system. Now, cloud, on the other hand, is leading the space of innovation. As we know, there's this massive expansion of large-scale commercial clouds. They are not depending on software vendors or hardware vendors. They're making their own stuff in-house. And guess what? They're hiring away and attracting the talent pool. And they made a really interesting analogy with temperature. They described HPC at endothermic requiring the absorption of heat for survival. And cloud is exothermic and really giving off of heat. And we know that, folks, we're not talking about heat here. We are talking about money. But to continue the heat analogy, you'll know that if you've ever been out in the snow in a cold environment, you are much more likely going to be wanting to give off heat to survive. So who gets left behind? Well, the person that needs to constantly absorb heat that's probably going to run out is the person that needs to absorb heat. And that's the reason that we're all here. It's because we need to ensure that the needs of our science are represented in this new environment. And guess what? The success of our science, the reason that we're all here, really depends on our ability to be collaborative in this space. And so this is really kind of the manifesto of Converge Computing. So if we bring them together, we get this new technology space where we have the best of both worlds. So where do we start? Well, here is how the talk is going to proceed today. We're going to start with models for convergence, talking about patterns for bringing together traditionally disparate environments. We're then going to move into strategies for convergence. So designs that I've noticed allow for easy movement between the spaces. So let's start with those models for convergence. Now, if you've looked in paper land, you've probably seen many different models. There's many different ways to take HPC and cloud and put them together. I'm going to talk about the high-level patterns and from the perspective of someone that's maybe deploying a system. So let's say that's me, and let's say I want my cloud and HPC, I'm going to take my limited set of resources and I'm going to try to split them into two steps. So I spend a ton of money and I do this, and then, I chose poorly. No one's using half my resources, and oh my god, so four years later I come back and I'm like, all right, I want cloud, X or HPC exclusive or HPC. I understand I can't have my cake and you to choose, so I am just going to choose one. We've used HPC for all these years, red and butter, this is why you've always done things. I choose HPC. Great, six months later, someone comes into my office. Are we dinosaurs? You know, everyone over there is using YAML and automation and we have this old setup and ah, so you go back in your office, you contemplate your life choice and you're like, oh right, no, it's okay, I'm not going to wait another four years. I'm going to sneak it in. So this is where you see all of these ideas, like bursting, multi-cluster, and these are generally referring to this idea of having some home base of resources and reaching out to get more. And the problem with this approach as I see it is that the complexity of these approaches often reflects the complexity of the systems. So they tend to be snowflake, they tend to be complex, and this is why there hasn't been like a single leader that has emerged in the space. So here is a different idea that's less common because it doesn't superficially kind of make sense. I want cloud or HPC, meaning I want to be able to run HPC, or cloud, or at the same time, or something together that's more converged, like what the heck am I talking about, don't I? Am I talking about, don't worry, we'll talk about it. Let's first talk about strategies for convergence. So these strategies I need to point out, these are not just about the technology, they are also about the people which is often harder. The first is common goals. In order to get two different communities working together, they have to care about the same things. You can't get around that. The second is modularity. So the degree to which your application or infrastructure can be modular, is that you can use things interchangeably and swap them, be very creative. The third is integration. This is consumption of an entire thing in another thing by way of different strategies. So let me give you some examples. For goals, the best overlap of goals I've seen is with respect to batch workloads. So a few years ago, the Kubernetes community started the batch working group, and this was because this new need to have AI ML workloads in Kubernetes. Traditionally, Kubernetes is where you run services, you keep something running. And there wasn't this concept of starting something and having it complete, but all of a sudden there was this new need, and guess what? We have been doing that in HPC land for like a couple of decades now. Modularity, a really great example, is actually with Kubernetes and Flux Framework. So you may think of Flux as just like this workload manager, but actually it's called a framework because we assemble many different components together to assemble into the workload manager known as Flux. Kubernetes is the same, different set of components, and there is going to be a creative way that we can kind of use these interchangeably. So the final example of integration, the best technologies I can provide are containers and language bindings. Container technologies are literally this vehicle to let you move between spaces, and language bindings are going to let you take it traditionally like C++ HPC project and extend it into a language that is native to the language and extend it into a language that is native to cloud. So for example, Go. Alrighty, let's get into some examples just like eggs three ways. Here are some projects that we've actually been working on at the lab. The first is Fluids. As I alluded to, this is the Flux scheduler, swapped with Coop scheduler. The next is the Flux operator, the entirety of Flux Framework implemented inside of Kubernetes. And then the namesake of this talk about air battle grows, Flux and Kubernetes working side by side. So let's start with the Flux scheduler within Kubernetes. You may be familiar with Kubernetes when you launch a job. You ask for a certain number of resources that's given to the scheduler. The scheduler says, okay, here are four pods. Have a nice day. So what we're going to do is bring in Fluents. So our C++ package, FluxSched, that is mapped with Go bindings into a custom scheduler plugin. We're going to swap it. And so you're basically going to be asking for the same amount of resources, but the scheduling is going to be done by FluxSched. How does this do? Well, we find that the workflows run three times faster. So what you're seeing here is Coop scheduler on the top, Fluents on the bottom. You see a lot of randomness with respect to how Coop scheduler places jobs. What this leads to is a pathological scheduling pattern. So anywhere you see a red box on there, that is a startup delay. And what that means in practice is though, is that although the workloads themselves run in similar times, we have a lot of outliers. We have a lot of jobs that take a really long time to get started. And so Fluents improves upon us. So Fluents is a really great example of modularity because we're taking an HPC technology and we're literally swapping it. And the modularity of the software allows for that. It's also a great example of integration. Because we have those Go bindings, we can speak the language of the cloud need of communities. Alrighty, next project, the Flex Operator. Super cool. All the gophers in Flexland are pretty cool. Alright, so the Flex Operator is implementing the entirety of Flex framework inside of Kubernetes, your own HPC cluster. This happens by way of a custom resource definition of CRD, where you basically give all the parameters that you want for your cluster, whether that's a single job or whether you want an interactive cluster. This creates what we call the mini cluster, which, you know, Flex Operator is a mini cluster, which, you know, Flux doesn't know the difference that it's running in Kubernetes versus on bare metal. There's a lead broker that's connected to several follower brokers. So here you have one pod for one physical node. The tree based overlay network within each pod or node, you have Flux that's added on the fly to your application. And the Operator is just going to basically reconcile until the state that you need for your cluster matches the actual state of the cluster. How well does it do? We added it to the best in the space last year. The MPI Operator and the Flux Operator consistently outperformed the MPI Operator we believe because of the 0MQ bootstrap. So the Flux Operator is a beautiful example of integration because we're taking the entirety of Flux framework and implementing it inside of Kubernetes. Bro, bro, bro, is it time for the bare metal bro? Yeah! Okay, so, warning. I've been saying bare metal, but nobody's going to give me bare metal. Let's be frank about that. So I was using virtual machine. We're using virtual machine as a proxy for bare metal. So just a warning. So what's different about this picture? The orange is on the outside. So we actually have Flux framework on the outside spinning up a Kubernetes cluster and notice that we actually still have compute running on bare metal alongside Kubernetes. How's that possible? Don't worry, I'll tell you. So why do we need this in the first place? As you know, also, there are increasingly more complex heterogeneous workloads that are coming to HPC. So this means not just, you know, embarrassingly parallel stuff, but also adding in services, databases, task queues. Ah! Okay, so I was... This slide is not wrong. I was going to give you an example of such a workload, and apparently this slide is giving you this warning that I'm a bad scientist and I'm not wrong, but I will point out that my example is actually a very good example that is a prototype for this kind of design. Let's talk about that. So let's say that we're running simulations. We're training examples one through N, whatever, doesn't matter, and we want to send them to a machine learning server, a specific endpoint to do the training. We then want to wait till some metric of goodness or perhaps a number of samples, and then we want to flip it around. We want to run simulations again, but we want to instead give this to our machine learning server without the actual values, then we're going to have a vector of the true values and the predictions, and we're going to see how well we did. Now, very superficially, if we match this to HPC versus Kubernetes, this is how we do it. We would expect that the simulations would run better on bare metal, and the service thing would run better in user netties or Kubernetes. This is way to be... We need to prove to ourselves first. So a lot of you are probably out there like, user net, like, Kubernetes? Like, in user things, are you nuts? I'm not nuts. There's actually something called user netties. It's a Kubernetes enhancement proposal or CUP proposal in 2022 by a very talented developer named Akihiro Sudo. Akihiro must point out won the top maintainer award for KUKON last year. He's an incredibly talented developer. If you've used any of these technologies, he's the one behind it. Hats off to Akihiro. So last year, at the beginning of the year, user netties was really a hodgepodge with kind of bash grips. It was really hard to use. So I engaged with Akihiro and we released Generation 2 of user netties in September. And guess what? It is using containerization, which is really great. It has these components that we'll go into in more detail. So what does it mean in practice? Well, it means when you're building a virtual machine, you need to have C groups version to enable. I recommend LIMA or Linux virtual machines if you're prototyping this for the first time. It also means that you need to enable these kernel modules. So very generally speaking, the RNet filter is going to allow you to apply IP tables, rules, bridge traffic. VXLan is going to allow you to connect VXLan devices on different hosts to a standalone bridge. This is important because we actually have different physical nodes. Now it's going to use RULE stocker. This isn't such a crazy idea anymore. Many clusters have podmin these days. And so what does it mean? Actually, when you bring out these VMs, it means that you're going to run a make up command that has two contexts. So both of them are going to build and start a base image that is using kind, kubernetes, and Docker with CNI plugins. And then the two contexts are the control plane and the worker. The control plane is going to install Flano, run kubernetes, and admit. This makes a joint command which is basically a token that you give to the workers, and then the togers can authenticate and join the cluster. And so that's what they do. They're just like, I'm ready to serve. All right, so we created this garbage cluster small and mighty using Overt and Ansible. It is small and mighty because each has eight cores and 30 MBs RAM and a 10-NVVD iterate. And I want to point out that we have seven nodes here because generally speaking, we're going to have six that we run things with compute on and one's going to be an admin node or control plane. Again, warning, not bare metal, you get the deal. All right, so what's in these VMs when we bring them up? We have a complete system install a flux, singularity on bare metal for reasons I'll tell you a little bit. Lamps installed on bare metal and of course user netties ready to be brought up. So once I shell into these VMs, my flux cluster is ready to go. I can do flux resource list and I can see all my nodes. And user netties, again, that administrative node is also a control plane. So we technically have six nodes to work with. And then we have a user netties. So we technically have six nodes to work with. And we can still see them with coop control get nodes. Here's what we're working with. User netties and flux running side by side the bare metal bros. All right, bro, bro, what experiments do we want to run all of them, bro? All right. So we first need to sanity check that what I said earlier about the bare metal and lamps and the simulations is actually true. We need to look at application performance between flux and user netties. So the way we're going to do that is by running a few things. We're first going to run lamps on bare metal with flux. We're then going to do the same thing but in a singularity container. And I did this just to demonstrate that you don't lose anything by using containers. Here's great. We're then going to run lamps in user netties with the flux operator. And then finally we're going to repeat cases one and two, but with user netties running in the background to look to see if there's any overhead of that. And I need to pause for a second because I know how incredibly cool this third case is. We have flux on the outside. Flux is running user netties. Within that we are launching the flux operator which is bringing up another instance of flux and inside there is where lamps is running. So folks, like I know Thanksgiving is over but this is the ultimate production. And we expect lamps to be slower in user netties because as we know it makes MPI collective calls. User netties are using something called SERP 4.NET NS which requires additional processing of packets with a tap device. I have a great paper I can share if you're interested in learning more about that. So drumroll the results as we expected the well actually maybe we didn't expect but guess what the bare metal case is the singularity container is very comparable to actual bare metal. I was very surprised by this. So user netties does not add a lot of overhead. And this is what we'd expected that guy up there running in user netties is about twice as slow as running on bare metal. So what did we learn? Well, we learned that for a setup like this the network sensitive stuff probably should be run on the HPC. But I'll point out there's opportunity for improving this in user netties. If you have experience with networking I'd like you to go over to the GitHub right now and I'm just going to wait a lot for the talk and engage with that to hear it to work on this problem. Now the next thing we want to look at is distributed machine learning specifically two cases one distributed to across six nodes and then the second on one node so the distributed case network is a variable and for the one node obviously network is not a variable. Drum roll results same thing it's about twice as fast on bare metal or twice as slow I guess on user netties. And interestingly when you look at just a single node these are really comparable so there's no issue with running something on a single node in user netties in and of itself it's really when you bring in the networking that it becomes a variable. So it's a network right well let's sanity check one more thing here's I per thing we did one bit of transfer for each node as a client to each node as a server we see bit rate and give you bits per second is between 10 and 30 for bare metal user netties with like non detectable closest here are really really terrible we can look so we can see the same patterns for transfer gigabits per second and so yes it's the network we're pretty confident for the setup it's the network. All right can we do the fun workflow now we absolutely can so guess what I actually prototyped this kind of workflow because I was really excited about it and so what we're going to do is we're going to be launching a batch job with flux batch this means the flux instance that's only by the running user it's going to scope resources using hw lock in this backshot where we can basically bring up and tear down all of user netties. We're going to take that workflow that I mentioned before we're going to map it into our star track cluster space so we're going to run simulations with lamps randomly selecting the problem sizes predict well time we're then going to bring up a machine learning server a special server I made using river a few years ago and then we're going to basically do the test cases we're going to run lamps again but we're going to leave out the actual well time and we're going to ask our models what it is and we're going to do a thousand training samples and 250 testing samples. How do we do? I put no thought into these particular models but I did three kinds of regression the Bayesian and sampling from a probability distribution didn't do super well but for the first two there's an actual kind of pattern between the predicted and the actual time and so although I put no thought into this I was really pleased with this result to see that the general prototype this idea of having bare middle simulations running alongside a service there is something here we can do science this way with actual real scientific questions and I'll point out that there are real heterogeneous workloads out in the wild and you this capability here's Moomi the massively parallel multi-stale machine learn model infrastructure and this is basically simulating biological systems the interact between proteins and plasma membrane I'll also point out that the Moomins are what it's based on the name the finished book comic book series with really cute hippos with often yellow spiky hair very awesome so this is the perfect example the bare metal rows of coexistence adopting technologies to make it possible to go to coexist and continuing to improve upon them so that for example with networking this environment can get even better so what should you remember from this talk if you take nothing else away the first is looking out for opportunities for collaboration look for that alignment of goals between spaces that's an opportunity the second is providing handles for your components so you don't have the bandwidth to look for opportunities add some go bindings to C++ projects because someone else could find you the third is engagement we need to show up at the table we need to go to working groups, conferences places that you haven't traditionally been to engage in to find these opportunities for collaboration and possibly the most important is this mindset we've had this mindset of cloud versus HPC that one has to win but they're different for so long we need to throw that away and get rid of the adversarial thinking and have a more collaborative mindset this is the vision that we have for the future for converge computing and we hope that you like to join us so thank you that's how to reach me my email and social networks and here's some interesting links for the flux and the various projects I think I will take some questions virtually now okay we can take a couple of questions it seems like the wifi is stable enough to let Vanessa answer them do we have any questions okay so Vanessa we may have to repeat a question for you we'll see how that works hi Vanessa amazing talk congrats so I was wondering if your architecture can support sidecars because one of the nightmares I had when I was trying to do something similar was that in order to get the sidecars running I had to spin up a second network stack and that created a lot of overhead so no no just one is on okay did you get the question Vanessa no I didn't hear the question at all neither did I yeah maybe that's better okay let's do it like this you'll come up front and ask it here yeah that's perfect that'd be great I can hear you great hi there hi so I was wondering if your architecture can support sidecar containers because as I was saying when I was trying to do something similar when I tried to create the sidecars I had to create a second network stack within singularity so the network overhead was amazingly high so absolutely a flux operator actually uses a sidecar container on a net container which is similar in concept to add flux on the fly as a view what's going on in Kubernetes is sort of a different thing than the networking issue so the short answer is yes to kind of add to that though I'm not sure that singularity and Kubernetes singularity as the container runtime for Kubernetes would work I have never tried that but it doesn't sound like it would work yeah it needs to be done yeah exactly hi Vanessa thank you hi it was the most fun presentation on the post then so far thank you so when you were saying that the main difference between performance between EBM and bare metal workloads was related to network was that the case also for distributed training and if that's the case were you using infini band or not so this we did not have infini band and you make a really good point that this kind of setup would need to be tested with an actually great network and that is still a very big challenge even for cloud so for example if you use AWS you can bring the elastic fiber adapter which will give you great networking performance but if you go to other clouds and I don't have to name this specifically you tend to only get really good networks when it comes to using like TPUs or GPUs the exception though is Azure which has a lot of really great HPC stuff kind of built in so absolutely you can get that setup with infini band Hi thank you for your talk I had a smile on my face the whole time thank you for having such high energy at the end of the day what was I going to say oh yeah so probably in my workloads I can reduce the network traffic by a very large margin if I can constrain certain jobs to specific nodes because then large files don't have to be moved for certain jobs to across the network is that something that you could keep in mind so if you remember the very quick machine learning experiment that we showed when we're running something on one node and you're not using the network there's no issue so if you're just running something on one node in user netties you won't have an issue in a degree to which you can reduce anything that uses network so moving data MPI etc etc you will get similar performance at least from this small prototype experiment that we've seen as you would on bare metal I have to do this because it wasn't really bare metal thanks one more question hey Vanessa that's Danny I'm gonna die my hair soon so you won't recognize me again I really liked your framing actually I thought I was going to sort of being adversarial and then I actually realized what you were saying and I really appreciated it however though regarding the adversarial framing I have some experience with for example cloud tools and cloud environments being used as platforms for vendor lock-in I think that you described especially with your converged computing kind of the way that you can push back against scientific labs aren't kind of in-depth to corporations I actually think that you kind of made a really useful example of one way to do that in your talk so again I actually was very very impressed by the way you kind of explained that I would like to know in the more general sense how can labs and potentially RSEs make use of cloud tools without getting locked in or becoming beholden again to a corporate environment and again by the way I think that you effectively did that in this talk so I'm more looking for a general kind of thought about that You're totally correct that vendor lock is an issue and when you tend to see many sort of niche APIs in different clouds and then you built your entire thing around them you do face that as an issue but the great thing about Kubernetes is that it is this open source project that is available across clouds there are subtle differences but if you make a workload that can run on Kubernetes you're going to have an easier time to move it between clouds and that's you know speaking from my lab we work on flux framework and one of our goals with flux is to make things portable not just between clouds but between cloud and HPC that's also something like user netties running actually Kubernetes on bare metal alongside HPC is so important because all of a sudden you have the same workload and it runs in all the places that is sort of like the vision we don't we want to make sure that the scientific workloads that we're running today can run in all places not just to one niche specific cloud not just one niche specific center just convergence TLDR that is very exciting and I really appreciate that response thank you so much okay that's all we have time for this workout great Vanessa I hope you agree yeah it was really fun if anyone has further questions and stuff please reach out to me I love chatting it was a pleasure chatting with you and I hope you have a great rest of your fun then thank you and the best way to reach out to Vanessa is via HPC social so don't forget to grab a sticker and you walk out please consider doing a small donation in the box as well to help cover the costs and if you're leaving please check if you see any trash around please take the trash with you bottles anything anything you clean up we don't have to clean up thanks a lot Vanessa this was great bye
Welcome to the Identity and Access Management devroom!
We are starting the second edition of the Identity and Access Management Dev Room. My name is Alexander Bakavoy, this is Iker Petroza. Formally we are the ones who organize this Dev Room. If you have any questions, anything, please talk to us. A guy in blue t-shirt will be the guy moderating a specific session. I'm not talking about him or Trevino specifically because this is a moving target. We have one t-shirt for people who will be moderating. I wanted to do a bit of history reminder. We had the first edition six years ago. It was, I think, a successful one. We got roughly the same amount of talks we will hear today. They were as much diverse and wide in topics. Also, we had quite a lot of people coming to listen to the point that at some talks we actually caught the FOSDMQ sickness. There were like 50 people in the room, there was a smaller room and hundreds of people waiting to get into the room. So, truly FOSDMQ sickness part that we enjoyed six years ago. I hope we will have enough space because this is twice the room, the first one. For this year, as you all know, this URI probably has a whole schedule. You can get access to all the things that I will just remind to speakers here that please upload slides. Using this new interface to pre-talks, but please upload them roughly like half an hour before your talk so that people who will be watching this live stream and they have some reference point and can see their... You can omit things from the slides that you want to make a surprise during the live presentation so that it's not spoiled. But it's typically good to have it uploaded. Since this is a smaller room and we don't have another mic, please, when you're talking and hearing some questions, please repeat the questions so that it's recorded in the mic. Finally, we have smaller slides, but please leave one or two minutes so that we can change to the next one and don't get the time taken from the next one. And since this is largely done by automation and volunteers, you will get an email from the video team that will give you a link with the details of your presentation recording. And you need to act on that. Preferably if you have time today or at most tomorrow so that they can re-encode the video and publish it. There will be an interface where you get to set where the talk starts and ends and maybe not in our rooms, we have one mic or maybe two. And in some other rooms they also have more mics and you can choose which audio you're taking. And once you set this off and signed off on that video, it gets re-encoded and published automatically on the schedule page. For all the people who missed your talk, they will be able to get the recording and how fast they will get it depends on you as a presenter. So, yeah, and now Iker does an overview. You forgot one thing regarding the video meeting that you need to make sure that you sound correctly. I forgot to say that yes, you need to check that you sound correctly in the video. Last year's talks. This one wasn't very good at the beginning. Yep. So, now it's to you. Okay, so my name is Iker. Well, this is my first post-dem. I hope you are enjoying it as much as I'm doing it. And, well, this Identity and Access Management Dev Room is about Identity and Access Management. We have several talks regarding passwordless. We also have multi-factor authentication, signals unknown, user federation. And, well, just in short, I hope you are enjoying it a lot. Leave space to the next one, do the next speaker to prepare everything. And we are all volunteers. So, if you find that something is not correct or just fix it or tell it to us so that we can, you know, try to fix it and have everything correct. I don't have much else to say, so thank you and have fun.
SpiceDB: mature, open source ReBAC
All right, so this is the talk on SpiceDB. Thanks everyone for showing up. So early in the morning, I'm starting to lose my voice because there's a long day yesterday of talking and meeting awesome people. This is my first FOSDOM. So who am I? My name is Jimmy Zalinski. I'm the co-founder of a company called OthZed, an OthZed-billed SpiceDB. Previously, I worked at Red Hat and CoralS. So I've been around in the container and Kubernetes ecosystem for a pretty long time, basically since the beginning. There, I'm actually a maintainer of OCI, which is the standard specification for Linux containers. And I've also started a bunch of projects in that space, notably Kubernetes operator framework and some others. This talk is entitled SpiceDB. But since FOSDOM is more of a developer community conference, I really wanted to focus less on this talk being a vendor pitch for SpiceDB, but actually more of a level set about the problems in the authorization space and the history and status quo of that. So that everyone understands what might be the best tool to solve their problems. I'm not going to try to sell you SpiceDB for all problems, because the more informed you are, the better you can pick the product that's actually going to complement your software stack and what you need. And that means there's going to be way more qualified people using SpiceDB and way more qualified people using other authorization tooling. Obviously, I'm the most jazzed about SpiceDB because I created it. So why are we all here? We're all here because there is a not-for-profit organization called OWASP, which is the Open Worldwide Application Security Project. And they kind of got started in the early 2000s. And they're famous for having this list called the Top 10. And the Top 10 is basically an enumeration of the highest risk, the highest threats for web security. And as of 2017, broken access control was number five. As of 2021, broken access control is number one. That means this is the biggest threat to the web and to all the applications running internet facing to the web. But really, the question is, how do we actually get to this point? Like, how did this happen? And how did it happen so quickly? I'm not going to point any fingers, but what I'm actually going to do is kind of dive into two different groups of stakeholders in kind of the history of authorization. There is kind of the academia and people publishing papers in this space, kind of defining concepts. And then there's the industry practitioners that are actually building the software and realizing these systems as they're actually connected to the web. I'm going to start with academia first. So on the right-hand side, you're going to see a timeline. And then on the left-hand side, there's going to be some notes. And then not for this slide, but you'll see QR codes in this corner as well. Those QR codes are going to link to the specific novel paper. So if you're interested in any of these particular concepts, you can feel free to scan the QR codes. But our history kind of of authorization is going to actually start in the 80s. And it kind of gets really kicked off with this publication of the Trusted Computer System Evaluation Criteria, which is a security practices book published by the US Department of Defense. And in it, it's outlining a lot of different security practices that are effectively a part of the military, the United States military. And in it, they kind of describe these two different access control systems, discretionary and mandatory. Now, discretionary is effectively just if you created the idea or the information, you can share it. And if you're then given access to that, you can share that. It's at your discretion. I kind of use file systems. And Google Docs is an example here. It's not a perfect one-to-one match. But if someone shares a file with you on a UNIX file system, you can copy that file if you have read access. And then you can change whatever permissions on that and share that, similarly with Google Docs. So it's at your discretion how you're going to share that information once you're given read access. Then there's mandatory access control, which is effectively a long list, an exhaustive list, of all the access for a particular thing. Most notably, people are kind of most familiar with SE Linux as the example of this. If you're unfamiliar with SE Linux, it's a way of locking down the Linux kernel. Honestly, it kind of comes with a negative connotation because mandatory access control is very verbose and very difficult to get right because you have to enumerate absolutely everything. Some people say that the three-letter agency at the US government that created this are the only people who actually know how to configure this correctly. I don't know if that's actually true or how many people use it. I know Red Hat is one of the folks that actually does promote SE Linux. But the one thing about this slide I really wanted to kind of drive home is these ideas, they're as old as the military and war itself. There's nothing novel about the 80s where these ideas got invented. But what actually happened was someone only actually ever thought to write this down in the 80s. So it took that long after using these ideas for many, many, many years. So we jump roughly 10 years, 9 years to 1992, which happens to also be the year I was born. That makes me feel relatively old. But in 1992, we get this paper published on role-based access control. And role-based access control, often called RBAC, is kind of where actually most people believe the state of the art for authorization systems is. The core idea is basically there is a group that is assigned access to a particular thing. And those groups are called roles. And then you map users into these roles. And by means of being in this role, you get access delegated to you. The kind of number one problem with RBAC is that everyone kind of defines it differently. If you build any enterprise software, you're going to talk to clients and they're going to ask you for RBAC. But the difference is if I look at two different enterprise applications, how they implement RBAC entirely differently. The only commonality is this mapping of users into groups that then have access. This is kind of going to be a recurring theme across all of these papers published in academia, anything with Starback, because they're documenting concepts, but not actually specifications that would give you an ultimately cohesive and designed and secure system. So kind of most famously, the biggest issue kind of with RBAC is that there really is no scope. If you say someone is an admin, does that mean they're an admin of the entire web app? Does that mean they're an admin of a particular resource in the app? You just don't know until you actually build it yourself. So there's not really an easy way to reason about these systems until you actually touch them. So we jump actually well into the future now into 2015. And now is when the paper on ABAC, which is attribute based access control, is written. Effectively, the idea behind ABAC is to kind of generalize on RBAC and say, the role that you're assigned, that is just one attribute that your user can have. And other attributes might be that you logged in with this IP address. Many other dynamic attributes can be assigned to you. The kind of really important thing about ABAC is it's providing this real time context. So now you can kind of write rules, like are they connecting from this country, this subnet, this time. You can delegate access at particular windows of time and kind of perform more logic on these attributes that folks have. And now we're going to take a huge digression back to 1965. So if you're unfamiliar, Multix is actually this operating system that was developed between MIT, GE, and Bell Labs. You might not remember it, but it actually inspired an operating system you're probably familiar with. Unix. So Unix is actually an attempt at making Multix concepts ported to less expensive hardware. Multix is often credited as the operating system, like the first operating system that has access control for the file system. I actually don't know if that's true, but it's often credited as that. So in Multix, you have a file system tree, so you get hierarchical structure. And then at every branch, which would be a file or a directory, you can have five different attributes assigned to that file. You get read, write, exact append. These are all kind of file operations that you'd be familiar with. But there's this fifth one that's super interesting called Trap, and that actually gives you the ability to do callbacks and to see functions. And it was initially designed so you could do file walking in user space. But kind of like the whole thing with Multix and the reason why I bring it up is because there was inheritance, there was aback, and there was user-defined functions in an authorization system. In 1965, when in academia, the ideas behind attributes were published in 2015. So there are systems using these concepts, but they maybe haven't been formalized and written down in the concrete form. And this is kind of like a huge issue with the whole space, because people are doing things, but they're not really studying how to make these systems robust with these ideas. They're kind of more just documenting these ideas ad hoc. So getting back to the normal timeline, we hit 2019. It's actually in 2007 that the term is coined relationship-based access control. And the idea behind this is actually that by establishing a chain of relationships, like Jimmy is a speaker at FOSDOM and speakers at FOSDOM have access to the FOSDOM speaker matrix chat. If you can follow these chains of relationships, you can actually go from Jimmy has access to the FOSDOM speaker room. So this term is kind of coined around then, and it's looking forward at what tech in the Web 2.0 era will look like. It's published initially while considering how Facebook, the social graph, works internally. So when you share photos on Facebook, you say, friends of friends can view this. You're literally defining it in terms of relationship to yourself. So we hit 2019, and actually that's when Google publishes a paper called Zanzibar, which is documenting an internal system at Google powered by these concepts. And the difference and the reason why I have 2019 for you back is because Google is documenting a concrete implementation of this. Unlike a lot of these other papers talking about concepts, it's talking about an application of these concepts and really giving you a framework for how to use this effectively and in a correct way across multiple products at Google. So then in 2021, SpiceDB is open source, which is also implementing the similar concepts to Zanzibar. And obviously, I'm going to get into that later. There are other models like Starbucks, but that's kind of like the primary ones that I see mostly in industry. You can dive into Wikipedia if you're interested in other ones. But now you've got kind of the industry side of things. We're leaving academia. And industry has this problem, which they go to build in a web application. And your first job is just build the MVP, the minimum viable product of your web application. So what you're going to do is do what you do with everything in a web application, which is store data in a database, probably the relational database you're using for everything else. And you're going to try to check if a user has particular access based on some data you store in the database. It might maybe going to be a role if you're inspired by RBAC. But maybe it's just a numeration of the list of users that can do a particular thing. So you may have written code that looks like this. But the problem is this falls over at some point in time, whether fundamentally you build a system that actually is just really slow, or you have to build a new system that is way faster than you ever intended it for it to be. Or you basically get users of your software that demand new functionality that is not actually possible for you to implement until you refactor your authorization code. So a great example of that is if they want recursive teams. So if you have groups of users, what if you have groups of groups? Or groups of groups of groups, right? That is something that most people cannot build, or they don't build in their initial MVP. And when you get functionality like that, you're forced to completely rewrite your authorization system. The other thing that could happen to you is your company buys another company, and they're based in a different continent. And that means all the requests for checking permissions now have to travel across an ocean if they want to be correct. That's a huge problem. And making sure that the performance is actually going to be viable, and the answers you're going to get for authorization questions are correct is a difficult problem. So you hit one of these kind of big issues, and then you kind of are forced to enter the cycle that I'm going to get into. But these numbers are kind of fudged. But the whole point is that if you take an engineer, probably with expertise in that web app, has worked on this authorization system, it's going to take them a while to implement this. It's going to be super sensitive because someone else is going to have to review it. That person is going to also have to be deeply embedded in that code base. They're going to be extraordinarily careful because any mistake that happens in this code base is going to be a CVE. It's giving access to people that shouldn't otherwise have access. So that's going to take a long time. Then you're going to do QA. You might actually have to perform a security audit before you can deploy this software because you're deploying to enterprise environments. And then you're also probably going to want to take extra time rolling out these changes into production. You probably don't want to deploy it to everyone all at once. You probably want to deploy to a minor subset just in case you find something wrong with the code. And all of this just takes time. And the problem is it's actually putting security of your software at odds with development velocity. Fundamentally, it's going to take you too long to add this functionality. And you're going to want to take shortcuts. But shortcuts are security flaws in your software. So then as rinse and repeat, you basically don't know how long until the pain is going to build up where you're forced to rewrite these authorization systems. And that is the mystery box entirely. You could finish or not even be finished rewriting your authorization system. And then all of a sudden, a new user sets some requirement for you. And you're doomed. You have to completely rewrite the thing you just thought you re-architected to be future proof. So how do we fix this? There's never ending cycle. And OAS themselves actually have recommendations for this. They say you should no longer adopt RBAC, but take concepts from A-BAC and RE-BAC. Obviously, I'm biased towards RE-BAC because I think it's a more modern approach to this. But the OAS folks also give you some high level benefits to why you would do something, like why you would adopt these new ones over RBAC. I'm going to just take this from the RE-BAC perspective. When you're doing a graph-like thing, a relationship-based system, you're forced to basically talk about individual entities. So this user, Jimmy, has access to this particular document. Because you're doing that, it has this kind of buzzword, fine-grained. You're not resolving Jimmy to a role or a group. You're actually following Jimmy directly to the document. So you're talking about individual entities in the system. So as a result, you get actually more fine-grained access. I'm not trying to generalize about any users or paint over anything. I'm actually talking about the exact objects I care about. And that means you can actually develop systems where you delegate access to a particular row in a database or a cell in a spreadsheet. And all of these systems are designed for speed because they understand they're going to have to store a lot of data to be this fine-grained. And then because your applications are only talking about the direct objects that they care about, any of the relationships in the in-between don't get written in your code. So you just ask the question, can this user perform this action on this thing? How they got access to that? And if you ever refactor or change how they get access to that, that does not live in your code base anymore. That means you can make changes to your permission system and not change a single line of code in any of your web applications. And believe me, when you do that for the first time, it is a magical feeling because you don't have to touch any code. So then there's also multi-tenancy and management ease. And this is just simplicity around modeling. And then with ABAC and REBAC systems, you're paying it forward. So our back might be really easy conceptually for you to implement at the beginning. But these systems, the ABAC and REBAC ones, they're more focused on forward thinking. If you need to make changes, like I just described, you can change REBAC designs without changing code. It may be a little bit more effort for you to get started in building and integrating with one of these systems. But by day two, if you ever need to make a change, it's going to pay dividends. So I wanted to get deeper into this Zanzibar paper I talked about earlier, which kicked off the interest in REBAC that you see today. Basically, Zanzibar is a purpose-built graph database that is very specifically optimized for one thing, which is finding a path in a graph. And by virtue of finding that path, that means that the user has access to that particular thing. It's actually one of the few good things that came out of Google+. So there's only two things that came out of Google+. There is Zanzibar internally at Google and then Google Photos. The novelty of this paper is actually that it is solving an authorization problem with a focus on distributed systems. So if you'll notice, the title of the paper is called Zanzibar Google's Consistent Global Authorization System. So it is fundamentally trying to tackle authorization as a distributed systems problem, which is not really something else has done in the past, because they kind of acknowledge that if they're going to deploy one system at Google, it needs to work across all geos in the world. And it has to be extremely, extremely reliable, and it can never be wrong. These are really difficult requirements. But the anecdote I like to use is when you're on a cloud provider like Amazon and you go to provision something like, say, an S3 bucket, you're always choosing what region. But actually, if you go to set IAM rules in a cloud provider like Amazon, you don't pick the region. That is because these systems fundamentally have to be global. And when you're designing them yourself at a particular scale, you need to think about how you're going to make your system global. And so this paper actually inspired two companies, Carta and Airbnb, to go forward and implement their own internal systems based on the ideas in this paper. None of them are truly 100%, I would say, authentic to the original paper, but rather the paper refused with the requirements of their business at the time. So I think the real superpower to Zanzibar, though, is this, which is if you go to send someone a Google Doc in Gmail and they don't have access, Gmail will pop up a box and tell you, hey, you didn't give access to this person. That fundamentally means that Gmail actually has a way to ask questions and check permissions that are built into Google Drive. So that means you could have one central source of truth for authorization data that your whole application suite can share, microservices can share. And this is incredibly powerful because not only does it allow integrations like this, but it also lets you have that central source of truth where if you need to audit something, you can just ask that one service. It's the only service you have to trust. It's the only service that you have to query if you're trying to really dig into any of this data. So you have a problem like an outage or something, an incident, and you need to understand what the access control looked like. So you might be wondering, how do I Zanzibar? So this is exactly what we set out to do. Basically, the year after the paper was published, my co-founders and I left Red Hat to found and basically build SpiceDB in the open source. There were some folks experimenting with the ideas around Reback at the time. But no one was really moving the needle towards making this a production thing that you could use in a real enterprise environment or at a real tech company. We originally prototyped the thing in Python. It was type annotated, lazily evaluated, functional Python. So it was way faster than you'd ever think Python should be, but it was not fast enough, so we ended up rewriting it and go and open sourcing that. The name is actually inspired by Dune because internally at Google, the project was actually called Project Spice because the ACLs must flow. So the timing for that has actually been really good with all the Dune resurgence in the movies, but internally at OZ, all of our software is named Dune References as homage. But if we fast forward to today, the SpiceDB community has actually gotten contributions from a lot of companies, big names like Netflix, GitHub, Google, Red Hat, and Plaid. And there are production users in small companies, startups, where it's just the co-founders, all the way up to Fortune 50 companies. But I still haven't actually told you what SpiceDB is. So SpiceDB is, as I described with Zanzibar earlier, this extremely parallel graph database. So developers basically apply a schema, just like you would for a relational database. And I've given an example schema here, kind of modeling a Google doc. And then what they do is they store data inside that database and query that data according to that schema. And it's really magic where you can actually make schema changes and not in a forward compatible way that lets you actually modify your permission systems without changing any code. So we don't actually have a SQL API, despite being a database. We give you GRPC and HTTP APIs. And effectively, like the primary interface we recommend as GRPC for latency reasons. Because authorization is in the critical path of everything, your web applications are going to do, and possibly everything at your business, you really have to make sure the stuff is fast. Thus, everything needs to be kept in memory. Everything needs to be returned in single digit milliseconds. So GRPC is actually pretty critical for that. And then in addition to the actual main server, we also expose servers for power and dev tools. So you can get auto-complete and things in your editor. But then also integration testing services. So it's Kubernetes native. Designed from the beginning, our background is all in Kubernetes. So actually, SpiceDB is self-cluster. So if you deploy just SpiceDB directly onto Kubernetes, it will discover other nodes and actually start to divide and shard up the in-memory graph that it's using to actually serve this data across them automatically. We also offer a SpiceDB operator in the open source, which will then do automated updates for SpiceDB. Notoriously, having zero downtime updates for a database is very tricky. So we just took that problem off the table for most people and just implemented it automatic for anyone using Kubernetes. So we remain true to Zanzibar's goals of consistency at scale. So we actually have pluggable data storage systems. And basically, depending on what your requirements are, say you need to deploy everywhere in the globe, you can actually store all of your raw relationship data in something like Spanner or a Cockroach DB. And then you can deploy regional deployments of SpiceDB that will exist as independent caches for those geos. But fundamentally, they're sharing all the same core data and they're consistent across those environments. If that sounds too complicated for you or you don't really need that, you're just single region shop, that's fine. We also have deep integrations of Postgres or MySQL if you just want to use something like Aurora or Amazon or ES. Obviously, then there's also memory for testing. We also have a tool called Zed. Zed is the CranLine tool. It basically manages cluster credentials as backups. It gives you a command for every single SpiceDB API. And I just kind of give an example of running kind of with debug flags permissions check. You can actually see it gives you a whole graph traversal. It shows you a tree of how you actually computed whether or not someone has access with timing data associated with all that. So you can see where things slow down. We have a web IDE. So actually, the two things you just saw, SpiceDB and Zed, we compile to WebAssembly and then run that in the browser. And then we basically build that all on top of Monaco, the engine that powers VS code. And give you a full IDE where you don't have to install any of the software I just showed you. You can just go to play.offz.com and start playing with this stuff. Run Zed against live data. You can load in test data. And what we actually do is we can generate exhaustively all of the paths available in the graph for you. So there's somewhat of a model checking happening here. So you can actually prove exhaustively that all of the ways you can traverse the graph are the ways you think they are. And that basically lets you prove that a system is correct without you deploying it into production or having someone do a extremely long security on it on your process. And then you can check this stuff into CICD. So if you make a change to the schema, you can actually guarantee that certain assertions always pass and that everything is exhaustively checked. So Zanzibar is not a silver bullet. We actually have had to extend Zanzibar in a bunch of different ways. So SpiceDB remains true to all of the core concepts that you'll find in Zanzibar. But not everyone is Google. So effectively, not everyone relies on users being represented the same way. So we are kind of more flexible with how people can model their own users. And then we kind of add on developer experience because at Google they can say, you're forced to use the software. When you're building open source software, you can't force people to use your software. You have to compel them to use your software by having a better experience than what they're currently doing. We've also added kind of contextual relationships with ABAC. So that means relationships can actually exist basically dynamically based on context that you provide at runtime. That was a joint project with Netflix. So if you're wondering how you SpiceDB, you can go to our Discord, discord.gg slash SpiceDB or check out GitHub, basically anywhere on the internet where you expect to find open source projects. SpiceDB is there. So thanks everyone. Thank you. Thank you.
Agama: Low-Code Web Identity Orchestration
you you you you you now it's proper green okay now it's a good green okay so so yeah so so when you look at this distribution oops wrong slide yeah so it's actually it's a number of components and we when you deploy the stack it's for production deployments it's Kubernetes only we do have VM distributions but they're for development only there's no way to upgrade the VM so if you're testing it VMs are fine we're publishing packages that are not available for you. Red Hat Ubuntu and Sousa but when you put it into production this is really a enterprise deployment. Just a couple of notes of like who who wants to use this type of thing this is for large scale deployments companies that have economies of scale to operate. People have custom requirements companies that need to self host who don't want to use a cloud provider or people who want to build a product. So okay well that's enough about about the background what I want to show today is actually a live demo of a gamma and this is sort of brave majority not you know we're already facing technical difficulties we'll see how it goes here. But I want to show you actually want to build an agama project and maybe customize it a little. So let's go here so I'm going to go to our. So this is a gamma lab as I mentioned a gamma lab is the developer site and so a gamma is a. So the idea with a gamma project is that it's everything you need in order to build a project that includes the code any libraries any web assets CSS HTML images so all that stuff together is what we call an agama archive. You take a gamma archive if you're Java people you might know what a war file is a war files are your whole application and you can deploy it on any. Application server that supports war files so a gamma files are sort of like war files for IDPs if you have this any IDP that supports deployment of a gamma should be able to run. The agama catalog is a catalog of projects that you can start with because it we don't want you to have to write everything from scratch the idea is is that we can create sort of a catalog of ready to go projects you can fork these and then modify them a little bit and publish them on your server. The project I want to work with today is called a gamma SMTP so I'm going to actually just go here and fork this project in my personal repo. And so a gamma lab doesn't actually store any code we're just using GitHub for everything here so we're storing you so you fork it and so now it's forked so now I'll go back to a gamma lab. And so I'll change the repository. And I'll say okay where to go. So this is this is the one I just forked so let's work with this one. And it says okay you don't have enough permissions so. We're using what's called a GitHub app I never heard of this thing but basically you can install a GitHub app in your in your project and so this enables us to have very granular permissions. So I'm going to configure and I'm going to basically have to add say okay where's the gamma this this app. Save. Yeah and so now I can go back to a gamma lab. And I should be able to switch to it now. Without that I wouldn't be able to read from the from the from the repository. Okay so now here's my project. Let's look at it in this is what we call the orchestrator it's the place that you actually write the agama code. So this this is what a gamma the the low code looks like. If you actually want to see the agama code itself you can hit generate code here and this will show you the agama source. But we don't actually want to see this. And so the idea here is that let's actually. Let's maybe make a new a new flow file so I'm going to teach you let's just say. Or dot foster dot. So. Say demo. So. Every agama flow has a unique name we call that the qualified name or QN. So when I'm evoking an agama flow using open ID connect we would always use the ACR agama to tell the IDP we're running in a gamma flow. But then we have an extra parameter called a gamma underscore flow and that and that would be the qualified name of the of the server. So here's the QN up here. Now in a gamma these are all the agama commands agamas are very concise language. So let's say I want to the first step of my flow is I want to display a form I would use RRF and here I would give the name of the template. And actually in the in a gamma lab we also have form authoring. So for example here here's the form and this is using Apache free marker. And this is kind of brave but let's try it. What if maybe I can. Drop in a new logo here. So. Let's try. That one looks beautiful way better. Okay so I just customized my form. Okay and I'll save it. So you got the idea. You can customize your forms. Now we also have a lib folder here. So the code folder is where all your agama orchestrator flows go. The lib folder is where all of your code goes so agama is low code not no code. That means that you're still going to call classes Java classes or groovy classes are both acceptable or you can drop in a jar file. And. Yeah and so when what what is a gamma file basically as we take it we make a zip file out of the code the the lib and the web folder and we have a project set that Jason that basically provides the metadata. Let's go back to the orchestrator. I want to keep going with a gamma. So and show you some of the other commands. So in addition to being able to render a form we can assign a variable you're all geeks you know what that means. We can call so in the call box we provide the class name the method name we send in arguments and we assign whatever gets you know comes back to the class name. We can get back from the method into a into the result. So that's a very important box. We also have a trigger so trigger means I want to call another agama flow sometimes you might break your agama flows into different parts. Maybe you have one we have one part that sends the email we have one part that maybe does registration so instead of building gigantic flows we can break our our orchestration into different flows and route between them. We have our FAC this is a very powerful command that allows you to redirect it stands for redirect and fetch it call back. So that means that we can send to an external open ID provider. A lot of things have an open ID interface these days. It's like you know a lot of things have a web interface well in authentication things have an open ID interface many authentication services have an open ID interface. When we call key cloak for sample we actually call it via open ID. So the ability to sort of built in you know call another IDP and get a response is very powerful. Let me actually so let's say we want to test this thing. Okay I've written my code and now I want to test it. So what I'm going to do is I'm going to download a dot gamma file. And now I'm going to ask SFDP that up to my server and deploy it. Dangerous I know but let's do it. So nope that's the wrong tab. This one yeah so let's try it. Okay now it's on my server that's good. And so there's a couple of ways to configure Johnson I'm going to show you the geeky way because you're developers and you can handle it. We have this thing called the two e and the two e basically it's an interactive command line thing it's better than I hate command lines. But you can we have a command line we have the two e in the commercial product is a web interface. But let me say so what I want to do here is upload my dot gamma file. So or rather it's already uploaded but I need to say okay this is the agama file I want to deploy. I need to fill a buster here while it's deploying so I'm just going to show you this thing sends email. So it actually uses the agama or the Johnson server to send email. You know this is one of the areas where actually so if you we want agama to support multiple IDPs we'd really love it key clout to have an agama deployment for example. If you have any IDP specific stuff what we want you to do is write an interface with all the methods. So in our server we said we have three methods that your IDP needs to do send email on board user. And so we have an implementation for Janssen that is how we send an email but if you're writing key cloak you would implement these three methods to in order to do whatever key cloak needs to do to send an email. So we want to sort of separate the IDP specific stuff. Okay I think it should be deployed now. Let's just go back and check. All server, agama, just see. Yeah so it looks good. Okay so let's actually test it out now. And so we have this really cool test RP called TARP. And so TARP it's a browser plugin which is really nice. So you put in the host name of your IDP. I think this is what is my host name actually. Moral boxer. Moral boxer.glue.info. Okay so now we're going to do dynamic client registration. So now I have a client ID. So remember when I send this request I'm going to send ACR values equals agama. And remember I need to send this extra parameter agama underscore flow and then I will go back. So I don't know the QN of the flow I want to invoke so I'm just going to go look at it. I think it's this one. And remember the QN is here so I'm just going to copy this. Copy this, go back to TARP. Paste it in. And let's trigger the flow. Oh no. I think I know what happened. I think probably my bad demo flow didn't like it. But I don't know if I have time to fix that. But basically you got the idea though is that we what we're going to do is just to summarize. I think I only have a couple of minutes left. So you build your flows here. You download them. You test them. And then when they're good we have another feature built in called publish release project. So this actually does a GitHub release. You know how GitHub has the release a section. So we have a way, an automatic way to do a release. And if you really, if you're really excited about your project and you want to share it with the community, you can publish it. That means you're submitting it to the agama explore catalog. We have a review process to make sure that it has the right documentation and license and everything else. But we're we want third parties to submit projects. So I think that's about it. How much time do I have? Okay, let me I'll take some questions. Sure. So you mentioned about identity providers being able to provide something. My company is an identity provider and authorization as a solution. And I'm not sure what you could be needing to add to support this or what our customers would get out of it. Could you explain? Very good question. Okay, so agama like Python, you know, it's it's a it's a. Okay, so the question is, is how can my company who has their own IDP use use use use agama. So a gamma is governed at the Linux foundation. And so if actually my one action item, please go to jans.io and and star this project. We need to get to a thousand stars. But a gamma is published here at the Linux foundation. And that includes the agama interpreter. We have an agama interpreter, which is in Java. So for if you have an IDP and you want to support a gamma, what you would need to do is support a way to enable your customers to upload and deploy a dot gamma archive. And then you would have to be able to interpret that. But that code is actually here for you to use under Apache to license. And also the documentation for the agama language is also here in the in the docs. So the language reference, etc. Is all here. But basically you'd have to build an agama deployment engine and a gamma interpreter sort of into your IDP. What does that get us? That gets you the ability to your customers would be able to use a gamma projects and deploy them. So they could use the developer tool right there. Gamma projects and then deploy them on your IDP. Do you have low code? So maybe nothing. But but if they have a lot of IDPs don't have low code. Also, we'd like to see commonality instead of each IDP having their own low code. We'd like to see why are we going to torture developers and make them learn five different low code ways. Let's get one low code platform for building web flows. And that way we really would really like to see this deployable on the cloud to like Amazon Cognito would be great if you could deploy your gamma project to Cognito. So we'd like to see interoperable web journeys just like war files. Yeah, any other questions? We're probably out of time right. Five minutes. Sure. So do you think the current state of the language to the SLB is finished or are there needs to be extended? It's two years old. Okay. So the question is what's the maturity of a gamma? Is it done? The we've it's about two years old. So it's still early and I wouldn't say it's finished. But we are we are using it and and and so and we have a number of pop up projects published this year. I'm going to plan to do one project per week. I have a if you follow me on LinkedIn, I have a new episodic called a gamma project of the week where I'm going to feature one project. So by the end of the year, we'll have about 50 projects. So it's still new, but it's we think it's usable now, but certainly we're going to keep improving it. So to get started, you know, you can go to I have stickers, but if you go to glue dot org slash a gamma dash lab, or just go to glue dot org, you'll find it. You can sign up for free. A gamma lab is a free developer site. So, yeah. So looking forward to seeing some of gamma projects. Thank you.
Improving Infrastructure Security Through Access Auditing
Today is Scott Bryan. He's going to talk about improving infrastructure security through access. So, you're up. Morning, everyone. So, I recently joined Red Hat and I work full time on the Adoptium Temerin JDK project. So, we use a very traditional build model with a large suite of machines. We support between 12 and 15 different platform and architecture combinations. So, it's very difficult to do just with dock containers, just single machines. So, we have a massive, massive suite of infrastructure. This doesn't work. So, we're currently undertaking a massive piece of work to secure our supply chain. So, we are looking at S-bombs, reproducible builds. But underpinning it all is a good infrastructure security strategy. And we've implemented centralized keys, rootless access, things of that nature. But how do you know all of that stuff is working? Unless you can visually see the results of all your security work. It's very difficult to prove whether it's working. So, I came in. There's no security or no strategy for verifying that any security fixes have worked. So, it's a very cut down presentation from the full length one. So, first things first for us was identifying what we wanted to get out of an auditing system. So, we want to capture, login any access attempts, anything at all where somebody was accessing a system, particularly in the build sphere. If you think about the Sol wins attack, which was a compromised Jenkins server, I believe. If your build system infrastructure is compromised, your builds and source code are potentially compromised. You build something, it's got a vulnerability in, but it checks and everything else looks valid. So, any end user sees that. The other thing we need, we wanted was an automated response and alerting. So, should somebody try to log in as root on a build system, we need to be one. That needs to be stopped straight away. And we need to be alerted that that's a thing that's happened. Come to it, why in a little while, the scale of the problem when you don't know about it is very different to when you do know the numbers involved. So, and then we want some analytics and reporting so we can, again, gauge the program and the success thereof it. Ultimately, for us, our infrastructure is all provided by a dozen different cloud providers. It's all publicly accessible. So, all of our, even our build infrastructure is open to the web. You can request access to it when you join the projects. So, again, the attack vector is significantly large. We don't have a single firewall that we can use, sneak and restrict the IP addresses. It's all publicly available. So, for us, host-based intrusion detection using Wazoo, not a tool we build, but it's open source. It's a very good tool for this use case. I would recommend you do a very similar exercise, analyze your requirements, and then have a look into the tools that are available. There's quite a few of them. Wazoo itself is a fork of OSSEC, which kind of stopped development when it became semi-paidware. Wazoo was an offshoot that is still open source, and they've continued to feature develop it. So, the scale of the problem. So, some numbers, which 24 hours across our infrastructure suite, 202, just slightly over 2 million attacks in 24 hours. It's a bit of an eye-opener. Of those, 12 are deemed by our, and the standard rule set from Wazoo is really excellent, of being serious enough to warrant concern. And you can see in 24 hours, about half a million people, people just brute-forcing the build machines to try and compromise them. I think a demo is slightly impossible without my laptop, but you can drill down into all these. You can see there are all the metrics that are available for the attack vectors, the CVEs, and you see there are also the 79,000 authentication successes. It's here on the right. What's the difference between SSH and brute-forcing? Not all machines are accessed by SSH, so they will be things like Windows, brute-force, password attacks. But Wazoo detects, again, remote services, modify registry attempts, all via RPCs and things like that. So, again, this is the, the first thing it does is give you a nice kind of visual view of how big the scope of problems are. It's why I like this tool quite so much. So, drilling down a little bit just into the authentication failures, you'll notice that Windows, by far, is the key attack platform compared to the Linux service. The numbers are hundreds of thousands times as many. And you'll see there, the top three machines are all build, Azure, Windows. It's a very popular thing for attacking and, again, get a much better kind of breakdown of the attack vectors. People trying to access restricted accounts. People trying to get valid accounts. So, although they're disabled on ours, the standard Windows administrator, the standard Windows guest accounts, although they're all disabled, everybody can guess one of the Windows or can find out one of the Windows standard accounts that, unless you've disabled it, is a very easy attack vector. And then just brute forcing things. And then, looking even deeper into just a single host. You can see down here at the bottom of the screen, you're getting the login failures, unknown user, a bad password. In theory, it's somebody just typing an IP addressing wrong. However, every single one of these attacks has been stopped with an automated response. You can go even further into blocking IP ranges, geographic ranges, so you don't even get the alerts. It's that I like the visibility. I would say only the really high priority stuff. And you'll notice once you drill down, there are actually no serious alerts. That proves it's working. So, again, you can take some knowledge in that your infrastructure is fairly secure. And then another really useful feature, and again, is you can then go into the details of each individual attack. Although you get a geographic region name, IP address, things like the target users, they've tried to brute force on our SSH-based host. There isn't a slide for this. We've extended it because Wazoo is eminently customizable. So, we also capture the SHA-256 checksum of the SSH key being used to try and attack. So, we can then determine if it's one of our valid users, because we have all our keys stored centrally and distributed centrally via Vestillion. If it's not one of our keys, we can then start blocking SSH keys at that level. But, again, we've extended it to capture that information. And Wazoo is basically an Elk stack-based tool system. It uses the logging part of it, the elastic search, and it just captures all the logs from all the systems. Again, you can customize it to capture whatever you like, your Windows system registry, whatever the Mac equivalent is, audit log, syslog, and it just harvests it all into one. Really nice, easy to query, work with. It's got the capability of doing dashboards and searches. We're still fairly new to rolling out and leveraging it for real serious stuff, but I think it's worth sharing even at this stage. And again, more extended audit information. This is from one of our dock hosts. Somebody there has logged in as Root. It's probably me. But again, you can see the kind of information you capture even on successful logins. If you're trying to find out who's doing stuff, they shouldn't. And Wazoo itself goes much further. It's got a file integrity management tool, which again, you can alert on, so you can track all the changes to key system files. It's got a SEA component, so it will check your system against the NIST databases, look for any vulnerabilities, give you the links to the CVEs, and then the potential fixes if that information's in the NIST databases. All of that in one happy place. Worth a look, and if you want some more information about how we use it, feel free to connect with me on the adoptium slack after this meeting. Whatever you need. I think we've got like a minute left, so time for one question, maybe. Say we're already using something like HashiCorp Vault. There's that lagging behind in audit capability. Let's say audit capability is something we want to elaborate on right and get ahead. Does that even give us an advantage? Is it doing everything in Vault or not? What is the wisdom and that? Okay, so that's the question is, compared to HashiCorp Vault, what does Wazoo give you? I can't see any reason why you couldn't use both. You could still use Vault for everything you're using Vault for, but what this would give you is the reporting tool on top. Would that work? Yeah, yeah. How much effort would go into it? I've never used HashiCorp Vault, so I really... But Wazoo, say you could get it to monitor your Vault, as long as Vault's putting some logs out for you to monitor, you could customize Wazoo to look at those logs, as well as your system logs, and still use the same visibility features and log harvesting. I don't see why that wouldn't work, but... So it's string matching based, right, as long as I have log performance? At the base level, yes, it's string matching and regex from log files, but that's just what it ships with by default. You can extend it to do whatever you like pretty much. If you're willing to write it. OK. Right, I think that's it. Thank you very much. APPLAUSE Thank you. Thank you. APPLAUSE Doctim is an Eclipse Foundation project for Temrin JDK. Although I ran out of paying my wages, I worked full-time on the adopting project. So... Wazoo is a third party to look into the Eclipse Foundation. I just think it's... Yep, sorry. Sorry. Well, cheers, George. I'll catch up with you later, mate. Wazoo, just a little bit. I saw it was best for our needs. OK. And all good things about being a little bit independent about working for the foundation.
Role of IGA in Access Management with Multilateral Identities
I have two affiliations, Manis with Evolvium, which is company behind open source IGA system midpoint and also I'm active in academia helping scientists across the world get together and solve their identity problems and in this talk I will kind of combine all of my experience with this rather complicated topic. So let's see with some introduction. If we're talking about multilateral identities we are meaning basically the whole scale of identities that are available to the users because the users today have a lot of identities that they just own and they can use to access systems. So it can be like identities from one's institution but it can also be social identity, identity on for example GitHub or even though states especially here in Europe are pushing some European IDs, digital wallets, some academic identities, banks and so on and so on. There's a little lot of them and all of these identities can be used somehow. Then the next item in the name of the talk was access management. So it's a component basically responsible for really giving access to people and do everything related to access. So one thing you can do is of course just type your username and password in but in principle you can use all these identities as well and that we have IGA which is identity governance administration for those who don't know this term is basically an extension of identity management and its main purpose is get the identity management rather technical stuff for administrator to people who are actually making decisions. So some managers or even support just get them in, let them manage what they supposed to manage, not have everything done by technical people when the others call them. And in this talk the identity governance system will be represented by midpoint and I will try to show you how all these pieces fit together and what can you do with the combinations. So let me introduce midpoint as well. As Zawar said it's identity governance and identity management system and because I'm here of course it's fully open source and usually suppose it's not important to say it but when you are dealing with identity management and access management areas a lot of the products there are claiming to be open source but in reality they are just open core or something even else but with midpoint we are really doing our best to make it fully open source including all the documentation guidelines for the developers whatever is needed everything is open and available to use. And the product itself is maintained by Evolveum and we have few external contributors. We would happy to have more of them but it's kind of hard. The identity management, identity governance is very complex tool contains a lot of code so it's very tough to get contributors. But luckily we have some contribution at least to some integration part something that is easier to get to. So about midpoint it's very feature rich and I would say it's really comparable to any commercial alternative. So I consider it a big success and even we are recognized by some analytics company which is really nice and what we can expect from open open source system it's really customizable is using as much standard as possible if you want to get more there is a link that you can find all the information. So let's get to access management integration because this is quite a common but I think there is a lot of potential if you are integrated IGA system of access management. So from the IGA to access management this is the more common path so the IGA because from the identity management part most less information about users and their accesses naturally IGA can provision all this profile information about users to access management and also provide data for authorization. It might be attributes, might be roles, even some combination. So this is quite natural. The other way around it's something not that heavily used and then I think there is a lot of potential in that because the access management especially when we when using some external identities have a lot of information to pass back to the IGA because and if we are talking about single organization, if you're using a password and you have no new information, if we are using these external identities usually with the identity we are getting some attributes that can be used. So if it's a state identity we at least know this person was verified by state and we have some identifier from the state that can be laid to use. For example if we are dealing with some big security incidents, if we have academic identities we can get information whenever this person is an academic employee or student and again use it for access control later. If we have social identities at least we have some social identifiers of the person that we can use for some integrations for example or we might have also other attributes like names, emails, whatever that can be used such as to make life of the person easier just to request them just use the information that we already have. Second thing that we can get from access management is access timestamp because the access management of course know when the user was accessing the system so we can get these timestamps and work with them later and I will get to this. What are the typical interfaces for the integration? There is no standard unfortunately but we have some standard common option so from the identity management part integrated anything is usually through some kind of connector basically writing custom connector to to get whatever API is that for the access management or there can be some middle layer like some let's say LDAP or Active Directory some standard database that access management can use. And to get some information back I do if there's this direct synchronization of connector identity management can read it back or if you want some like just like runtime integration you can always call some API and do some something that. Let's move to identity governance benefit. Basically if you are familiar with identity governance it probably won't be anything new but I will just repeat it. The very important one is overall visibility. If identity governance deployed because you usually deploy it within a single organization you want to be in control and have some visibility of it's happening in your organization mostly to tight your security be able to go through audits and so on. So the main feature is some kind of reporting or web pages dashboards who has access to what and why so you can visualize for example if you are using role based access control who has which role what the role is entitled to to which applications and why every person has this role. In midpoint we are using something that we call they are calling policy driven RBEC because the RBEC is very good tool it's very easy to visualize it to explain to people but you need something more in order to work with some attributes with some automated rules so we have kind of extension of RBEC and if we are thinking about this talk how we can get this multilateral identities and this data we can get through the access management to AGA I have to use it. So first one is use these attributes for example if I know the person was wedded by state come with state identity I may note this as a level of assurance attribute I have big restaurants in this identity and based on it I can give some access through RBEC classically and then I can visualize it in the standard way using dashboards and really know what the person has access to. Also when I have time stamp when the user is accessing each system for the last time again through the access management I can use it to build some policies either to remove unused accounts and to tighten security or I can naturally work with some kind of expiration renewables of account whatever I need for my particle workflow and of course AGA wants to automate all of this so using airbag automated rules some provisioning through connectors and integration in the system make sure that everything that I just said is completely automated and you don't have to worry about it. If the full automation is not enough you want some human element some kind of interaction you can have some approval processes, expirations, renewables and so on. So let's get to some interesting feature about integration all of this together and I will start with integration of basically access management to given surveys using just in time provisioning now without identity governance. What you can do and this is very nice trick you can basically create accounts on the fly because when we are using these multilateral identities which are coming already with attributes we can just pass this identity to target system and by passing it we basically authorizing the identity to access the system and the system if the system supports it can create the identity and accounts for it on the fly, use the attributes and give proper permissions within the system. What this basically here is how to how to deprovision such accounts because this creating of accounts is ideal it's very simple you can use it really on the fly but you have no way how to disable these accounts the only way how to do it is again by the end system itself to have some kind of expiration because what is important here if the person accessing losing the access they are not going to the system anymore and the system doesn't know never gets the information there's no way how to get the information and also with this we have no central visibility who has account where and why which might be tough for doing some audits or resolving security incidents you have to manually go through all the systems and get through it. So with midpoint if identity governance component in place we can basically extend this using some extra tricks so the basic premise is in midpoint we are managing entitled users so I'm not saying the user should have active account on the target service at the moment we're just saying he or she is entitled to have it and whenever the user decides to access the system again using just in time provisioning we can create the account on the fly using this entitlement at midpoint midpoint managers. Also what is nice about midpoint midpoint supports provisioning and it's really quick it can be done in real time so even though if the target system doesn't support just in time provisioning can create in accounts immediately basically access management can ping midpoint and say now it's time to provision this account midpoint checks if the user is entitled and if so triggers the provisioning so we can have just in time support even for system doesn't support it not that able. Also midpoint and is this provisioning connector can read some data back so regardless if the account was created on the service or through midpoint midpoint will get the information the account is there is active and also we can read any additional information and basically then we have full scale information for the IGA we have who has the account active who is entitled but doesn't activate the account and we can build all the policies on top of that including some expiration renewals work with last access timestamp and combine this all together so for example if the user doesn't use the system for a long time it's kind of a security risk we can deprovision but we still know the user is still entitled so for the next next usage we can we can still work with that. And now gets to this part with multi lateral identities because that's brings just another level of complexity because with multi lateral identities we are expecting that a single user can have multiple identities and we can even combine them because we can say okay one identity for example the state identity brings your account to the higher level of assurance we are know this account was vetted then you can have some social accounts saying okay this is your social ID and we can connect with some with some system with some social systems and we can integration because we know this ID because we know this ID if we have some academic scenario we know the person is a student or employee of given university or even more universities you can combine it all together. Tough part is how to correlate these identities because there is no common identifier nothing that we can automate it. In midpoint we support two way we call it smart correlation and it enables you to configure how the individual accounts could be correlated and you can you can base it on the source so if in the source you know this is the email which was verified and I'm happy to correlate with existing accounts based on this email you can set up this rule you can set up some fuzzy rule like matching on name or even even have fuzzy matching counting with typos and some something like that but this probably you want to fully automate because there is some risk if it's really the real person so you can also define what match should be like processed in an automatic way and just connect these identities together and what rules need some human interactions and there are two ways how to do it. If you want strict control it can be done by some administrator or some other delegated responsible person basically manually whatever process you need you will just see okay this is attributes you have this is the new identity you have to decide here are some potential matches decide if one of the matches is real or if you want to create a new account for this user. Second option might be to again use the access management part because basically the user in principle own all these identities and can use any of them to sign in so let's just the user sign in with first one then the second one in the same session and then we know for sure that the user's own both identities and can be connected together. Also what is nice here you can combine all these external identities like state, social, academic with local one if I have deployed IGN and usually I have within a single institution I have some local accounts managed by HR department so even combination of these local accounts and this like remote identities is possible using the same principle there's nothing really different there and what I can do with that is build some kind of unified profile so take all these attributes that I'm getting from different sources and then usually I want to build a single user profile I don't want to work with users who has six names from six sources and most of them are exactly the same the value is the same maybe sometimes if you have like your name from the social network maybe you have different biospelling because you like it or something like that so we can just gather this data and then put a formula how to build single user profile how to select which name is the one how to select which email is the one that should be used or if we need it we can just build this like one profile which is always handy to have if you don't have any special requirements but then we can also have some extension of this profile for example we can have like official email within the institution and then like a personalized preferred email and then we can decide based on the target application which one of the emails should be used which one should be provisioned and this is basically or possible with midpoint just put all the rules in how it should be processed of course the most difficult part is how to decide it because we want to have it simple so people can understand and also give an option to select for example they prefer the email or preferant address at least for the system when this this is not that important and what we can also do it thinking about these rules and how to combine all this data together how to put some organization policies in because it's really nice to have if you use this freedom to select their mail or they prefer preferred name or preferred email but sometimes we have some systems that we really want to enforce strict rules because this is something that I don't know is sent to send to authorities for for some validation or I don't know it's it's might be tied to to your payroll and you want to have real data there but then you can have like a company social network and let user give them the freedom so when constructing these rules we can combine like organization policy with some user preferences and even based on the target system decide it what results should be used and where sounds complicated it is but it's all about programming and how to put it together for your organization and again with the end goal to have fully automated processing at the end with some middle user inputs user preferences and so on it's not complete there are still some missing pieces we were we were experimenting with running some demos improving midpoint as a product to support this better but for sure it's not fully complex the biggest issue is user experience because yeah a lot of these options especially dealing with external identities when users needs to sign sign in actively work with them it's hard and this will be hard for a while but it's getting better how how how people are getting more and more used to work with work with their identities and use the identity to sign with completely other system now we've pushed for european evils and so on again people will be getting more getting used to this principle and it will get easier also the interface between access management and aga is not well defined now we are just writing custom integrations on both sides dependent on the on the needs for sure it will be better to have some like prepared interface that we can use and we can connect our product midpoint to existing access management systems that would be really really handy also life cycles of the individual identities because we are combining different identities to a single profile and also we should think about life cycle of the individual identities some of them are pretty persistent like the state ones but other like if i if i know that someone is a student probably i should verify this this this statement once in a while and i can put some policies or condition it would be nice if the protocols where we're going to support this so we can for example query each day but with other protocols like samble basically until the user didn't sign in you don't know the current state of the information so have some maybe some explorations some renewables here as well would be really nice also the whole assurance and trust model in this might be very complex again working with different sources of information which are trustworthy which are not how we can process them how we can use it what's our assurance on this information it's difficult to even decide what we want to do and we are and when we are when we are have this decision is an essential thing is how to process it we experiment with some kind of small project which we called mid privacy and it was about putting metadata to each value that we are storing at the metadata source of the identity the assurance level and also for example how we can be used within the gdpr framework so having this all tied up and again automated so we can use it for automated processing and provisioning rules would be really nice we started as an experiment just to get some feeling for it it's fully available to people but as far as i know nobody tried to put it in practice yet which is which is a bit pity and again there is a link if you want to read more about it so just to conclude and hope to leave some time for the questions it is it is really possible to combine these worlds and really tightly connect identity governance system with access management and basically unlocks new potential for new features and nowadays there is a lot of identities that people can use to sign in to our systems state banks and i'm expecting there will be more and more of them and people will get think more and more custom to use them especially with these eids on the european level so this is something that we should be prepared for and i ga even though and i think about i ga is mostly within a single institution to make sure everything is tied everything is well well ordered everything is automated it can be audited it can work very nice in this world of multilateral identities and bring these same conditions and the same the same benefits from the i ga to this world as well but having a full implementation covering order english is complex and it will take probably some time when we all get it there midpoint is kind of halfway through and that doesn't mean halfway exactly we have something now that can be used it can be experimented with but if you want to reach maximum potential it will need to improve the product as well and because everything is open source and available all the contributions are always welcome so thank you for your attention and we have a few minutes to get some questions yes the question was if we have some machine learning on our roadmap we have we are already experimenting with that not for this particular problem we decided to first start with role mining so if you are importing roles usually within the existing role that you already have how to mine some business roles out of it because if you have if you are migrating towards i ga you have a lot of roles manually manage and it's good to build some business roles that can be easily managed and we are using machine learning principles for that it could be good to use it for example for this identity matching but so far we were happy with some customers yes so this identity of management is only one side of the picture because if i have a user he might be a suicide man or whatever he's leaving traces in the application you grant access to okay so if this is now leaving the company the institute the university how do you deal with the traces do you have a mechanism like maybe scramble the username and change the username in the application so if it's reused that yeah re-usage of usernames in the target application might be a big problem yes so so the question was we have all this in place and what we what we can do when user is leaving the organization with his or her data scramble them remove them somehow something with that and this is a tough question because one part is the application itself and when you have this automated identity management identity governance system in place usually you can deprovision the data completely out of the application but then you have this central point of the identity governance and the question here how long you want to keep the data for security incidents for example and that's valid and probably you want to have them unscramble for year two depending on your policy and then you should again automate the process how to either scramble the data or just completely get rid of it i would say accept identifiers because especially if you are talking about usernames you probably don't want to reuse that or at least not within a central period of time so i would recommend to keep that yeah but it's a bit more than that like in the talk before we had the wasuo what we have some web application and some person is creating a dashboard so within the application it belongs to that person and everyone else is using it so i kind of delete it but the the creator is done so it's more complex than this yes yes so the so the comment was if the user is creating something like a dashboard in web application that others are using if the original owner leaves can we delete it or not and if you are within an organization when you have complete control over your users you need some process to pass this work to someone else and i would say you have to i process for levers in the same way you are returning your keys to your office you should also return all your digital systems or transfer this to someone else but what you can at least automate in this case if you have this like a dashboard something have a process that before the deletion will send the notification or let someone approve it and that could help to automate it okay time is up thank you and we can continue this question later
FusionIAM - a full Open Source Identity & Access Management solution
So, we're going to start our next talk. It's going to be so as to explain what we do and we'll be doing this so that I tell you fusion I am a full open source identity access management solution. Bonjour, je suis français but I will speak in English, okay? No problem. Yes. So, some words about me. So Clément Wido, I work in a French company which is called Vortex and I'm doing a lot of stuff about identity management of course because I'm here to talk about it. I'm also doing other things like music. If you want to listen to French music, open source music, it is a creative commons, you can go on my website and I'm also doing a theater and other things. Very quick about Vortex, we are a service company and we provide many solutions like collaborative tools, containers and of course identity access management and I will talk about this thing. And if you want to not play music but work on open source, you can apply on our website. So, for the topic today, I will talk about the fusion I am project, explain why we created this project and which open source component we use to try to build this big solution. So, we decided to create this with Benoit Martier which is the leader of fusion directory. I don't know if you know fusion directory product, who knows it? Okay, but many people. So, it's cool that you come here so you will know about it today and Vortex. So, we are both people working on open source product around directories and identity management. The goal was to offer a complete identity and access management solution because you know that in the propriety solutions, when you buy one, you get all the components of identity and access management. But if you are using open source tools, most of the times you only get one piece of the full picture and you need to install them and connect them. So, our opinion was to say, okay, we know that in open source, each product must do one thing and do it good. But if we want to be able to go to companies and say we are doing identity and access management, we must provide a full solution integrity. So, that's why the reflection. And we are today working on this project in Vortex with David Coutador and myself. So, who knows OW2? Okay. It's normal because it's a French consortium like Eclipse, but you know Eclipse, but you don't know OW2. So, today you know OW2. There are a lot of products inside OW2, Blumain, GLPI, and Lemanelda, et cetera. And so, we are an official project of OW2. So, one solution when you want to do a new open source software is say, okay, all that exists is mess. So, I will write everything, but of course, I have a family, so I don't have the time to write everything from scratch. So, we took all the open source projects that we know and we tried to combine them together. The one you may know is this one, OpenLDAP. Who knows OpenLDAP? Okay, yes, one. Of course, we are not the developers of the OpenLDAP software. It's something that is managed by a Siamese company and OW2, which is the leader of OpenLDAP. But we are very implied in the community and we work a lot with OpenLDAP. So, our choice for the directory server, which is clearly the base of the identity management, is OpenLDAP. And then, we put a lot of products. So, Le Mans-Haldap-NG, who knows? Ah, yes. And we have the founder of Le Mans-Haldap-NG, which is Xavier Guimard here. So, we have some of the community here in Fosdame about Le Mans-Haldap-NG. I will explain all this fusion directory. So, I said, Haldap Toolbox. Okay, LSC. Okay, it's normal because it's the products that I created. So, okay, it's normal. Okay, so, these are all community projects, open source projects. Of course, you know only this one, but you will see how we try to combine them. Our approach was to say, okay, we can be as IBM, HP, et cetera, and we can go in your company and say, we have all the components. So, access management, access manager, the directory server, the directory manager, synchronization, the connectors, and two other components, white pages, and service desk, I will present them. So, that's a typical big proprietary IIM solution, which is, okay. But we put all the open source software behind the same. Okay, so, of course, the directory server is open-end app, but we added some tools in Haldap Toolbox project to better manage open-end app, to do backups, et cetera. The directory manager is a fusion directory, connectors LSC, the access manager is Le Mans-Haldap-NG, and other tools are some part of the Haldap Toolbox project. Of course, I will present them, but I know that you know other software to do that. Typically here, the most open source tool known to be the access manager is Kiklok, who knows Kiklok? Of course, everyone knows Kiklok, but I will explain why we do this one. We have another choice to have a single sign-in product, and this is Le Mans-Haldap. And for this, of course, Evolvm midpoint, that we just saw before, is another possibility here for the directory manager, et cetera, et cetera. So, everyone can choose which technical components it will bring in identity access management. We did this choice because we are clearly developers of a lot of these components, so we can act on the roadmaps of these components, and we know how to make them work together. So, if you choose Fusion IAM, you will take the choice we have done. If you do not agree, you just can fork and replace the components if you don't like them. On a technical point of view, if you already install Kiklok and a directory server, you know that it's quite simple. All components are linked to the directory server because that's where you have your users, passwords, groups, et cetera. And here you have the connectors to be able to synchronize from a database, from an active directory, for example. So, it will go into the Haldap server. These are tools to manage the data. So, to white pages, just to display the photo, et cetera. Here is to be able to reset the passwords. Here is to create icons, et cetera. And the access manager will also be connected to the directory server to do the authentication. All these are Haldap, HaldapS flows. You have just one database used for the access manager to be able to store the configurations on all the sessions, but the other tools did not need any database. All the tools are only using the directory server. And of course, you have here the access manager. So, a user, the end user will only see the access manager part to be able to access all the data here and to access also all the components. So, some explanation on the software. The first one, everyone knows. So, like Tynitero said, it's simply the best. I hope you all have the song now in the head. It's the best Haldap server in terms of performance of standard, because the people coding on Haldap have also written a part of the RFC of the Haldap protocol. So, we're sure that this component is respecting the Haldap standard. And if you manage your Haldap by yourself, you know that you can add a lot of features with overlays, like password policy, which is very important in identity and access management to be able to expire your account, to lock your account, et cetera, et cetera. And we will see that we bring other tools to be able to manage the open Haldap password policy. And in the Haldap Toolbox project, we provide some package to be able to install open Haldap on a different distribution. You may know, are there people from Red Hat here? Okay, it's not a problem. But you may know that Red Hat has chosen to push away open Haldap from the distribution to be able to use the Red Hat directory server as the main directory server. So, if you want to install open Haldap with a package on a center-est, et cetera, you can use the Siamas package or the Haldap Toolbox package. And of course, we also provide package for Debian Ubuntu, et cetera. Okay, so the directory is okay. The directory manager, so we choose a fusion directory. It's a PHP application. It's not like PHP Haldap admin, which is a very technical tool in which you browse the tree, et cetera. Here, you have a functional view of all the objects that are in your Haldap directory. So, of course, users, groups, but you can also modelize the service icons, applications, et cetera, et cetera. So, it's a very functional view of this. And it includes administration delegation. So, you can say a people is connected to this interface, but it can only manage the people in the service, et cetera, et cetera. So, it's like the midpoint or all those software like this. It's just to offer user interface to people to read or edit and illustrate data depending on their why. The connector, no UI. It's just a command line, but it's a very powerful tool written in Java. And it talks with RISTPI. It talks with databases. It talks with Active Directory. So, we are able to easily synchronize Open Haldap and Active Directory with the store. So, very efficient. And Lema Haldap ng. So, the key clock killer. No, I know it's not, but okay. It's like key clock, but we provide an application menu. We manage all the access control. White pages is an easy way to display the data of your directory for end users to search for phone number or email address. So, these are only Haldap data. So, I created an Haldap directory with Star Wars data. And you can display them, search for the umpires, the Jedi, et cetera, et cetera. But there is no database. It's only an Haldap directory. And ServiceDef is a little tool for the support team because you're able to see first one. You can see all the password policy data from Open Haldap. If you work a little with Open Haldap password policy, you know that it's very technical to understand how all the state of the password is managed. So, here you have all the dates, et cetera. Here, and you can test the current password. Of course, you can reset the password. And you can see if the account is locked, you can unlock the account. You can see the password is expired, et cetera. So, it's very easy for a support team to know if an account is expired, it's locked to unlock it, et cetera. So, moving to the cloud, because that's how we need to work now. Why? Because before, and we still do it for customers, we have virtual machines and we deploy all the package and we configure all the package. And we say, okay, Haldap directory is here and you need to connect to this web server, et cetera, et cetera. And when you want to put the logo of the customer, you need to put the logo in every product. So, the customer say, okay, it's integrated. Okay, it's integrated. But, okay, this still works, but it's a lot of work indeed to reproduce this by every customer. You need to write that. And the cloud approach is to say we will move from package to containers, images, and we will try to configure all the images, all the containers through variables. And indeed, we saw that Haldap server is the same for each component we need to connect to the Haldap server. So, I only need one parameter, which is Haldap URL for all components. So, I configure it once and then I can have the full solution. Of course, you do cloud. Okay, it's a mess. Okay. We need to have pods. We need to have volumes, et cetera, et cetera. So, you see that what was a little easy with some bricks and some components is not easy with the cloud because you need to identify which volumes you need to run the containers. And when you split, you usually split the web application between the front end and the PHP, FPM, or the past Haldap server. But it's better because we can run all these images and we can have, so, of course, for Haldap, we have a volume for the data, a volume for the configuration. And also for the certificates, KN certificates. And so, we identified in the FusionIM project, we identified all of this and we created all of these images and volumes. So, you just need to do make, run, and it's running. We have a container registry. So, it's open source. It's available. You can just pick the images and you can run with a Docker podman or a Docker compose. So, it's very easy to test. And you can also download the Git and run the Mac, run all with a Mac file and it works. The only thing you need to do is to initialize the volumes and, of course, put some configuration for your domain, et cetera, et cetera. But you just have to do this and you will have the full stack of identity and accept management running and configured. So, it's very easy. In Vertex, we choose to create a new offer which is called with us, identity as a service. And we put FusionIM inside of our, in our cloud for our customers. So, we can run for each customer. We run one FusionIM project. And so, a customer don't have any directory, don't have anything, but he can connect all this application through SAML or OpenID Connect. He has all the applications to manage the data inside the enterprise directory which is inside the cloud. And we, of course, have a lot of RISC API. So, we have RISC API for provisioning to create accounts, to create groups, et cetera. So, you can do all this with RISC API. And we also have some RISC API here to be able to create a new OpenID client or a new SAML client. So, you can provision the users, the groups, and you can also provision through RISC API the applications. And, okay, I know I have five minutes. Yeah, I can do a demonstration. Okay. Ta-da-da! So, it's not a screenshot, okay? It's a real, it's a real interface. So, it's hosted by Vortex. It's running on OpenShift, which is Kubernetes from Red Hat. And so, this is the login form. And so, you see, it's the access manager component, so, LemoneldapNG. And inside, we plugged all the IIM components. So, this one is for configuring LemoneldapNG. So, just the administrative interface. And, okay, so you get all the parameters. And here, you have the other components. So, why? Of course, it's a demo. Yes. Yes. Yes. Okay. So, this is how we manage the users. So, what you can see is that you can work with departments and branches in the end-up directory. And we can create, so, I create, for example, a codecobain because, okay, it's 30 years ago. It dies this year. So, it's a simple account, okay? But if I want to, so, I'm an administrator, I have this view, okay? But if I want to browse the directory, I can see it. So, I'm an end user, and I want to see the information of codecobain. So, I can browse it through web pages. But this is clearly the same. You see that you can also browse the groups, so, with Brittany. And this tool is wonderful because we can dynamically use the postal address inside the end-up directory to display people on the map, okay? So, it's a nice feature for an intranet, for example, when you are all in a remote location, you just put the postal address of people and you can see them on the map. It's quite nice. And, of course, you can click on and see, okay, this one. And if you are in the support, okay, Brittany Spears has lost his password, okay? So, Brittany. Okay, I can reset the password. I'll say, okay. Baby, one more time, okay? And, okay, the password was changed. We activate the flag, so, she must change the password as the next connection. All this is managed by the open-end-up password policy. But, of course, when she will connect through all the components, the component will respect the password policy and she will be forced to change the password at next level. Okay, that's all for the demonstration. Thank you. Some questions, maybe? Yes? Can you change actually directory passwords from this? It's a feature that we did not implement. Can we use this component? This component is an adaptive box service desk to change the passwords on active directory, not yet. But, all people are saying, okay, this is wonderful, but I don't have open-end-up. I have active directory and I want to use it. So, it's in the roadmap, but it's still not available. But, for information, this one has some hooks, so you can reset the password in open-end-up and also hook it to a change at the same time in active directory. If you have both directories, open-end-up and active directory, you can use the hook to push the password on active directory. But, if you only have active directory, you can, for the moment, not use it. But, it will be the case maybe next year. Maybe next year. Yes? Do you support private ACME servers for the certificates for these web services? Sorry, private, what? ACME. ACME. Let's encrypt. Do we support ACME or let's encrypt? Of course, yes, because you just have to run it in the container, yes? How do you handle applications which cannot use OpenID or Sample? Okay. And, where do you use the host headers and authentication? Yeah, so how we manage applications that are not modern applications which use either Sample or either OpenID Connect to do single sign-in. Lemana Lab is also compatible with the CAS protocol. So, we can also use CAS, but in the cloud, we say that CAS is not secure enough to do the cloud. And, of course, we can have a component in Lemana Lab. We have a component called the Angular, which is an agent that you can install remotely on your infrastructure system and can communicate through REST with the portal in the cloud. So, you can secure it, some local application with an agent on your side and let the agent deals with the session, et cetera, through REST API in Lemana Lab. So, we can do a mixed mode between the cloud and your local applications. It's over? Last question. Last question, a very good question, so. Can we authenticate users using certificates, personal certificates? Yes, you can use. The question is, can we authenticate users with certificates? Yes, Lemana LAPNG can use certificates, Kiberos. We are compatible with second factor authentication, WebOTAN, et cetera. So, we have a lot of methods. It's like K-Club, but it's French. Time's up. Okay, thank you. Thank you.
Add user self-management, brokerage and federation to your infrastructure with Keycloak
Adding a user-sales management brokerage federation to the infrastructure with Keycloak. Keycloak has been mentioned now and then in the previous talk, it was great to hear. I'm Alexander Schwartz, I'm just Alex, I'm working at Red Hat for the Keycloak project full-time and I'm also a maintainer since last year. I've been using Keycloak for several years. When I was a back at IT consultant, we were building applications, we were using it as an identity next-to-management solution and back in the time, a lot of customers did not have Keycloak, so we brought an application in there and the custom-built one, we put Keycloak next to it to do the IAM stuff and over time, then we built our applications for them to customers. They already had Keycloak, so it was great. Two years ago, I joined Red Hat full-time working on Keycloak. What do I do at Keycloak? I'm doing a lot of performance testing, database stuff, also a bit of LDAP. Keycloak has so much to offer and when I was reading the corporate presentations, this was then stating about Federation LDAP and I thought, yeah, I could present you this slide today and this is what I will do, presenting what's already existing in Keycloak and also some of the things that will arrive in the next version of Keycloak, like the current version is 23 and the next version is Keycloak 24 and you can already download the things that are shown today in the 90 build of Keycloak. Right, so yeah, and the agenda that I brought for today is more like a journey that I saw customers going through when they entering the identity and access management space. It's like day one is seeing a sign on a school, right? I need only one password to access all my services, so that's where it all starts. Day two is, yeah, well, I need to get a bit more flexible because I have maybe one directory with users, maybe multiple directories of users that I want to integrate, lots of applications and then day three, yeah, I want to eliminate a daily churn, like reset of passwords, user self-management and that's especially where the things come in that we have in Keycloak 24 around user self-registration and declarative user profiles, what we see there. So why is Singleton on cool? I said, well, users need to remember only one password, that's, yeah, and then they authenticate only one today. In the morning, usually when they get to work and then it's, depending on how you configure it, maybe more valid for 24 hours, for 10 hours, for eight hours, that's the policy of the company and then they can access all these applications over the day with the credentials they entered. And well, usually a password might not be enough, so you have a second factor, you have one-time tokens, you may have maybe a mobile app that generates these small codes, you have file keys, web auth and all that stuff, and maybe some applications need it, other applications don't need it when you access them, maybe you want to re-authenticate during the day when you access a special application, so all those things come with KeyClick. And well, not the last thing, but usually in the middle when you deploy KeyClick to your organization, you want to theme the front-end, right? It should look like at least the colors, maybe the logo of your organization, it's to make your users feel at home. It might seem like a small thing, but it really helps the exceptions of that in an organization. So I say, even if you're deploying a single application and need an identity nexus management for it, it makes sense to deploy KeyClick for that, because you then don't need to reinvent it yourself, right? And doing user management right with all the bells and whistles is not a nice thing. So how does KeyClick work in the end? Like you have a user with maybe a mobile device, maybe with a regular device, and they log in with KeyClick, so KeyClick presents a login screen, does the handling of all the second factors that you come about with, and then the user sends from their browser a token to the services in the cloud, whatever they are, and the application can then either check the token directly by inspecting the token's cryptographic signature and the timestamp, or it will send this token, for example, to KeyClick to figure out who's that user, I want to retrieve some additional information. This is possible. You might also use that token, I don't know, when you're integrating other authorization services that then return like OPA or something like this, where they come up with is this user allowed to access this service or not. So that's the basic setup, and KeyClick, you can deploy it as a single container connected to a bunch of databases that you can choose from, be it Postgres, MySQL, Maria, Oracle, MSSQL server, usually, well, as an admin, you don't, or even as a developer, you don't have a choice, like usually an organization has chosen a database, they know well how to do backups, how to restore, how to operate them, so we give you a choice which database to connect to, and then you have KeyClick either deployed as a single whatever binary container, or you deploy it using an operator with a high availability setup to the Kubernetes of your choice, to the bare metal of your choice, that's what you do and do. And well, this is what users then usually see when you don't customize a login stream, it's a username and password, right? And once I log in, let's see if the demo goes with me, so I'm logging in here, maybe it's expired, oh it hasn't expired yet, so I get an admin screen here, so where I can set up clients, basically clients or applications, and have client scopes, users groups, so all of this and rows somewhere as well, right? I can configure all these in a web UI and it will, in a very basic installation, will just start to be in the database of KeyClick, and it will then take care of all that. So, yeah, that's a simple start, you have your application, it's secured, it's all working well, but then, yeah, you usually don't start in the green field, that's very rare, so you need to become a bit more flexible in what you're doing and to integrate with all the existing stuff that's already in your organization. So for example, there might be one LDAP, there might be many LDAPs in your organization, I think it tends like whenever there's a merger there might be other LDAPs joining, other user factories joining that you want to integrate with and there's Kerberos, so people might be already authenticated on their machines, especially in corporate environments, there might be some service around in your organization or external to it, but it only talks summer and your applications want to talk open or disconnect, so it's great to put KeyClick there in between, there might be also other OpenID Connect things, but then why would you put OpenID with KeyClick in between? Yeah, well KeyClick can train it to summer or KeyClick can also give the right tokens to the right application because maybe your this one application is on a special diet to require that or the other attributes in the right tokens and KeyClick can do that in the way this application is then finally working. You can also create your own extensions to KeyClick, so for that you need to get familiarize yourself with a bit of Java and then you can integrate custom stores, you might have, well it's called legacy usually for a good reason because maybe the old systems, the customers are known to those systems, they make money, you can't shut them off and you want to integrate KeyClick with existing user stores, you can do that, you can then connect it to a database directly, call some rest services, wherever you get these information from and make it work and also we might hear later today about SCIM integration, all that is then possible by adding extensions to KeyClick on this area. So we use everything that is already there and integrate and connect with that, so that's very, I say, essential on your day two things when you say yes KeyClick is cool, single sign on works, but then you need to integrate with a lot of stuff and yeah, KeyClick hopefully makes that a lot simpler for you. All right, so that's, yeah, some diagrams around that, so identity brokering, Kerberos, Samo, OpenID Connect, you can connect to those and yeah, we can show that in the demo shortly, well the good thing about Kerberos is you don't have, your user might not see KeyClick at all, look, well the user tries to actually see the application, the application wants to get an OpenID Connect token or some Samo token, it forwards the browser to KeyClick, KeyClick will negotiate with the browser that the user already logged in using Kerberos and then will not even show the login screen but forward directly to the application back with the right token so the user can continue, so the user will never see the login screen, so there's Kerberos, but on the other hand if on that system the user is currently on, Kerberos is not configured correctly for whatever reason, it will fall back to a login screen and you can use the regular credentials and then what we see in a second maybe use that credentials and verify these credentials against an LDOT, so it's yeah, it's like Kerberos but without the Kerberos it works the same way with the same credentials in the end, we can get all these social logins integrated, so with those then the user usually has login screen where they pick the right social login provider, they want to use to authenticate, it might not be the right thing for corporate environments, but it might be the right thing when you are integrating, well your public facing website with users coming around that they want to integrate, yeah and Federation as I said OpenLDOT is their active directory, custom user stores, you can have none of those when you want to store things only in KeyClick database, you couldn't have one of those but you can actually have multiple of those as well, so I wish or I hope for you that you have a simple environment but on the end, on the other side you can't really choose when you are, I don't know, there's another merger coming around the corner and or yeah then you might have another directory to integrate or maybe a customer has some users they want to bring there and you want to integrate as well, so yeah looking at the demo, so you can identity providers that would be OpenLD Connect, all the social logins that you want to integrate with here, they're here either custom or predefined with some defaults or some sensible defaults, user Federation I already configured LDAP here, so LDAP telling you okay this is, yeah I'm running some patchy directory server here locally on my machine because it was simple to set up, the usual LDAP I'll say, I can choose if it's a read-only writable or synchronized, all these things are here and then yeah not all OpenLDOTs are, or sorry, no not all LDOTs are the same, they need some special configuration seen here, yeah and you can configure it that it matches the organization, there's usually also some methods so there are lots of attributes in LDAP that you want to leverage either to put them into the tokens, that you want to pass on to the applications or that you want to leverage and the user into endpoint where the application can then carry those if you don't want to put them in the token, so all these things can be configured here mapped on a per realm, per LDAP connection in the needed to work, eventually you can also configure it on which application should get what kind of attribute and what kind of token, yeah but then it's the real world catching up on this, the simply can make you set up the better you'll be off but on the other hand you need to make it working with the things you have and I, well we're hoping that we got Keeklog in a way that it's not standing in your way, so let's go on to day three, a limited turn, so all these repetitive tasks that you have to do every day when it comes to users, they're well annoying for admins and also annoying for users, ideally they want to do these things themselves, they don't want to be bound to some opening hours of IT or so some things, I've shown a minute as users required actions to basically you can as an admin choose, well as an admin you might have sent out an email please enable second factor and you sent another email saying please finally enable second factor for login and then you say well now's the time I go through maybe all of my users or some of my users let's, on the next login they need to must enable the second factor no matter what, so you can do that as an admin and you're done because no one will enter your system without a second factor enabled. Also password recovery, you can add a link to the login screen we will do in a second that you can do password recovery that you send out an email, click, the user can click on a link and you will, it works with an external with an external database of key cloak but it will also work when the user's on an LDAP, it would also work when the user's on an active directory, also well this kind of bits work when you're using the password recovery mechanisms of key cloak. Also well in a corporate environment you might not want to self register for people right, so they probably need to sign a paper contract first but then on the internet on the public facing side you want the people to self register, again this is something that comes with key cloak. Also once you're registered you want to maintain the data yourself as a user maybe update your mailing address, your blog, your social handles whatever all these things should be managed by the user themselves and key cloak allows you to do that and this is something that greatly improved over the last releases in key cloak 23, you can enable it as a preview feature and we are pretty sure that we will have that in the final release of key cloak 24 enabled by default so that you can really use that in a very good and configurable way. So yeah and it's great to resolve the need for either phone calls or tickets or chats in nowadays right. So let's go back to these required actions so there are lots of them so let's maybe have a look here. So in authentication for each realm I can really decide what I want people to be required when they log in or to be checked when they log in for example one-time passwords, maybe you want to have them confirm the terms and conditions, I need updating the password, update profile, verify email address that we sent out an email with a link people can click on it. So that's for public facing registration very useful. WebAlhtim is in there, people should be able to choose their locale, we want to verify them the profile and I can enable those and maybe also have maybe some policies when and why and then on the realm settings I can, well this is basically the tab called login which configures the login screen and I say okay from now on user registration should be enabled right. For good password flow yeah I want to have a link there where I want to allow that people can reset their passwords and once I do this I can just when I sign out here now these fields have appeared so for god password link is here and I'm asked for my username and email address and I have a register button where I can register with some fields that are here required and if I then log in again and we go for the user profile, yeah there we are. So this is the configuration where I can say these are the fields that exist that should exist for both the admin to be edited in the admin UI they should these are the fields that should exist on the user self registration form and those are also the fields that are available for user self management. So basically you can think of this as a form configurator and for each of these fields I can go in and say okay this is the name to be there's an attribute name like a technical name I can reference it by later a display name well this is an automatically localized name here but you can put a simple name in here as well I have attribute groups here so we can group things on the form for each field I can decide who can edit this either a user or an admin who can view this either the user or the admin and then can put lots of validations on top of each field about the length for the username it's the minimum length of three the maximum length of 255 characters for username I can there some prohibited characters so you should use regular keyboard characters for that we also don't want to have any homographs in here basically letters looking like real letters from a Latin alphabet for example but they're actually from a very different alphabet so you could have like a user registering with a username that looks like an already existing username it might need to lead to confusion so that's a really sensible good security by default so and I can add more things here I can also add annotation and saying how should this element be formatted should it be an input type should it be a helper below and below the input size the columns I can also reorder those so it's it's basically a form builder and the form builder will be consistent in all three places user self registration user admin form and or admin form for users and user self management right so when I go here for example the block so I can I can change it here with a different display name and once I go to as an let's do that as an admin to manage my own account then I would see here okay now it's renamed and another field here and I can then choose when is this shown is it shown is it mandatory on on first login is it maybe mandatory once a month like it can have maybe a scheduled process that inserts that actions on each login once a month and then I can see here all these things are then how I configure my login flow and have this information populated by my users so yeah we saw that we saw that as well right we have this recovery we have seen the configuration how we can configure those with validations and attributes and all necessary information and again the three areas where you have the admin we on the left in the middle the registration screen and on the right the personal information the users can self manage and all this information is either stored in key cloaks local database or if you then choose to store it in an external service like LDAP it will be stored in the external service of LDAP right so that's basically almost the end so while we saw day one singer sign on is cool and it makes a lot of sense to not reinvent yeah identity and access management even for a single application database you want to get more flexible and integrate with a lot of existing security infrastructure in your organization once you are a happy user of key cloak and then day three it allows you a lot of automation around users when you really want to scale especially when you if you want to scale with lots of users signing up on the internet and if they want to manage their stuff on their own and you don't want to get calls from them or emails and stuff so I brought some links so this is the key cloak homepage please pay the visit we have some docs on there how to install it the key cloak nightly release I linked it directly so and if you go there you can download the zip file if and extract it but there's also a docker registry on kio that you can yeah have a container built ready for that with a nightly release if you're on github please give us a store there's the key cloak book second edition that been published last year so if you've been using key clock maybe two or three years ago you might know that this was based on eap and wi-fi it was now moved to quarkus so some of the things changed so it might be good to look at this second edition something that is my of my very personal goals I want to start a key clock hour of code so to get more people into contributing so I'm planning for maybe once a month maybe every two weeks I think an online session um to get people familiarized with coding how we do how do we code in key cloak how do you maybe contribute documentation how do you um yeah how do things work around key cloak and at some point we also want to bring in community to also review issues helping with triaging those helping if another community member creates a pull request maybe it also the community joins in and helps and helps to get that to a material level that we can merge it in so that would get some some weight from the shoulders of the maintainers that would be great so that's my my thing for this year I want to try out yeah so that's me so I'm around for the rest of the day so meet me here meet me in a hallway I also have some stickers of key cloak some postcards if you want to sell key to your managers or friends or colleagues so send them a foster postcard with key to it thank you very much all right we might have like two questions or something yeah and what is the best way to configure key cloak declaratively so um usually want to use the UI to figure out what's there and how it works and then you can one way is to maybe export the full realm as a JSON and then re-import it so that's like the full export full import there's also the chance that you there's a terraform hopefully open tofu compatible key cloak provision mechanism as well and there's a rest interface so you might use a rest interface to yeah use this API to configure it and there's a command line interface as well but the command line interface is basically a wrapper around the rest interface so that you can yeah configure different settings of a given client or maybe override the client with a new config that kind of way but it then depends on how you want to do things if you have the chance to um I don't know delete it and re-import it might be very helpful for test environments or if you're more bound to um incremental like database scheme immigration style of things that you really want to things like one step at a time and always in that order and maybe open tofu would take some shortcuts that might not work but you want to have maybe some migration so it's depending on what you want to do but it's the good news is it's all automatable. Just one question how key cloak can be beneficial in Linux ecosystem? So how can key cloak be beneficial in a Linux ecosystem so like if you then logging into you say um with a SSH somewhere or I haven't seen it this way but it kind of connects very well if you have like for example Kerberos around so if you have Kerberos I have them on my machine as well when I'm in a corporate environment that key cloak can leverage that okay maybe then okay to repeat it for the video so there was a talk on 20 23 on FOSTEM on password list authentication on Linux here at FOSTEM right okay one note there's a redhead SSO Ansible Collection yeah okay okay yeah so there's also redhead SSO Ansible Collection that allows you to configure key cloak right yeah the old name well key cloak is the upstream project it's a CNCF project there's also redhead SSO like the thing that you get with a subscription from redhead where you find tools that work with that as well and from end of last year there's no longer redhead SSO but redhead build of key cloak so it's going to be easier to find in the future so whenever you need so for something for key cloak it will be both the upstream project and that what of redhead offers for a subscription okay I think this time is up thank you very much
Ipa-tuura: FreeIPA connector for Keycloak
Hi guys. So now we have Francisco Tivino who will speak about the Ipatura project, which is a FIJK connector for people. That is awesome. Yes, thank you Ikea. Thank you. Thank you very much. Yes, my name is Francisco Tivino. I'm a principal software engineer at Red Hat, specializing in identity management systems. And I'm part of the Free API team. And I'm very excited to introduce you to Ipatura, a collaboration between Free API team and SSSD teams. And basically what you are allowed to see is a redesign of the system integration between identity management between key clock and Free API. And then the Alexander talk, the one before this one, it was perfect because he was explaining all the concepts, all the, he was giving a very good overview of key clock, what are all the features that are supported and specifically how to integrate key clock with other identity management systems through the user federation and identity federation and all the brokering and stuff like that. And he was missing one integration, actually. That's nice because he didn't spoil my presentation. He was missing the integration with Free API, all right, through the user federation. So, yeah, this is all about, yeah, well, this is very basic stuff that I would like to scope well the project, right? I just want to spend a few minutes talking about some of the background and some of the key aspects of this project that we have been keeping in mind as we have been undergoing this work. So as far as the background, IAM is an umbrella term. It defines processes to assess the right assets at the right time for the right reasons, right? Well, keeping an author right access and the control. So some of the common products are Microsoft Active Directory. This is for environments where Windows is dominating. So we have also identity management, which is Free API. So if you are familiar with it, it could be, yes, you can understand Free API like the open source version of Active Directory, but for post-example environment. This is Linux, right? And, yeah, because basically it relies on the same building blocks like Microsoft Active Directory, like LDAP, Kerberos, PKI, CA, DNS, right? And, yeah, another one is Key Clock, okay? It doesn't need an introduction. It's a more scope for modern applications and services to the users in general, right? And there are more of the solutions such as Octa and TrID where they are more oriented to cloud-based environments, right? So when comparing these solutions, we soon discovered that one of the main differences is like number of assumptions regarding how users and groups are controlled, okay? So, for instance, Free API is tied to user and groups like this. They are necessary for the applications running in a post-example environment, right? On the other hand, Key Clock offers authentication services to modern applications where these applications are deployed usually in the cloud environment and the identities, they completely differ from the system level ones. Meanwhile, AD, for instance, Active Directory relies on other identifiers like security identifiers or organizational units. I'm not going to talk about post-example environments because the very last talk of today is all about that. So I recommend you to watch that one. And, yeah, the key point is that sometimes you are happy with a standalone solution with all your conflicts in place, or that is not a useful case. I mean, many times you will need to integrate multiple identity management solutions so that the same user can access different operating systems as well as different cloud applications with the same set of credentials, right? So, luckily, IIM, this umbrella term I was talking about in the first slide, defines some processes like single sign-on and also identity and user federation where a user basically will be authenticated directly once, okay? And then the fact of authentication is consumed by other services for a certain amount of time, regardless how and where the application are operating, right? So that's basically it. And then when talking about this, this one, and this one, federation that Alessandro was talking about that, Yaki Clock is very well known for providing these functionalities, okay? So, yeah, the way it works is like when a user logs in, Key Clock will look into its own internal database, right? And if the user is not there, it will fetch or iterate over every user storage provider that is connected until it finds a match, okay? This is basically how it works, right? And guess what? Key Clock already supports the integration with API, yeah? As a backend to look up for authenticated identities and so on, right? This is already supported. You can do that. So, yeah, by default, well, I'm not going to spend a lot of time here because this was explained by Alessandro. So the one on the left, the one on the left, it was the one from the previous presentation. Yeah, includes LDAP and AD provider so you can federate with multiple LDAP servers in one Key Clock. And, yeah, at the same time, and the one on the right is the one that I'm going to focus on, is that Key Clock also includes SSSD plugin, right? And it provides access to multiple identities and authentication providers from free API, right? And also very nice features like failover and offline support as well. So then what is the problem then? If we support everything and we can integrate with anything? All right, let's have a look. So what are the problems that we are trying to solve? So one of the main issues is, and the most important is that we are missing feature parity between those integrations. I mean, they are really different ones. I mean, you can integrate from a Key Clock to the user federation with LDAP, with Rehabilitation Server with AD in a different manner because we support a lot of features there. And at the same time, you can integrate with ADM with free API, but there is a huge limitation there. It's that it's only a read-only interface. So Key Clock, yes, can fetch information from SSSD, that's all. It can write there. So that means that if you do changes in your user database in Key Clock, you will need to drop by the free API and do the changes as well. So this is kind of a very limiting factor, right? So yeah, another one is that if you want to integrate with SSSD, you need to deploy SSSD in the same host or container where you are running Key Clock. This is also a limiting factor, especially when talking about cloud environments and open sieve, where you usually deploy the bots in different hosts and different machines, right? So that's another limiting factor. So then, yeah, we are thinking about already designed then. We need already designed to already design this, okay? And then in this slide I have, yeah, well, this is where the IPaTura service comes into play, okay? And then these are kind of a list of requirements when redesigning something, yeah. We are thinking about to support all these things at the same time. So we need a common API for managing identities. So the requirements are to be able to read and write. This is the most important one also to authenticate users from any integration domain. At the same time, now that we are redesigning everything, we are going to try to simplify the integration. And one idea is to replace all the existing plugins by just one plugin for all of them. So you can easily configure and then you can connect with anything. Another one is the cloud-friendly, maintainable solution. Yeah, we need to get rid of this limitation about deployment. We need to get a key clock and this is in the same container. This is kind of difficult without performance impact. That's always there. And ideally we shouldn't reinvent the wheel. So we need to ideally rely on existing open source projects, okay? So then, now this is a question for you. How many of you know about the scheme? You can raise your hand. That's that kind of the whole part of the room. That's nice. So it stands for system for cross-domain identity management. Okay, this is a protocol. This protocol finds or helps with this chain of user identity data between different identity management systems. It simplifies the provisioning, the updating of attributes, also the provisioning of users, of accounts, and it helps with interoperability. Okay, so it sounds like this is what we need, right? Yeah, so the idea there is to implement a scheme server for free PA as a backend to process all the requests coming from key clock, right? So, yeah, the idea is to don't start something from scratch. So based on this protocol, I think there are kind of 10 or 10 to 15 projects already implementing this protocol. And we were paying attention to one in particular, which is Django scheme 2. And this is why it's written in Python. And the reason is because free PA is also using Python, especially the API. So it's very similar. I mean, the interconnection between this, the scheme server and free PA will be some sort of straightforward. So, okay, let's start building it. Let's start building a new service. Okay, we mentioned that it must be a cloud-finally solution. So we are targeting a container. This is a container. Okay, on the left we have key clock. On the right we have free PA. So the first thing to do to add into the container is the Django framework because there is where we have the implementation of the scheme based on an open source project. Okay, this project in the container is already exposing some sort of endpoints. Okay. What is the next requirement? It must be secure enough, you know, the Django framework implements an HTTP server, but this server is kind of for developers. It's not, I mean, it's not protected at all. So, okay, so we can include Apache. It's a well-known server. And we can enable HTTPS for production-driving environments. So, all right, we add Apache. We connect Apache through the WBSGI connector to Python, and this is from Django. Okay, now we have a secure API. Okay, what is the next thing? Yeah, it must provide a generic API. So, let's rely, yeah, the breach can, at the same time, I mean, the idea is to, this is a breach, right? And then we have connected to Key Clock already, so through the user federation storage and other identity providers, brokers, we can connect to the container. This scheme protocol will help us to translate, and so we can make another call to free API through the API. And this is how, basically, how we connect everything. And it's generic because it's based on the scheme. All right. And then, yeah, we need, I was mentioning that, we need to read and write. Okay, so we implement two interfaces to connect to free API. And then, about the performance, well, deploying a container with a start service, talking to free API, making API calls is kind of expensive. Okay, but no problem because we can rely on SSSD because SSSD is implementing a catch. Okay. And then, okay, let's include SSSD in the container, right? Let's connect through the Django, through the divorce info pipe. And this is how we can access to the user materials, identity materials, right? All right. So, looks like we are almost done, but we mentioned that it must be generic enough. So, these interfaces, the right and the right one, we can easily configure them to talk, not only to free API, but also to any Active Directory, and through LDAP and also any RojHDS Directory server, right? Okay, so this is basically the idea to unify. Okay. And what about key clock? Yes, that is support the scheme calls because, well, we have implemented a scheme server. It's pushing a generic API, but key clock, well, doesn't support the scheme calls as a client, right? So, okay, as Alessandro mentioned, there is another, well, I mentioned that there are two ways to integrate user federation with an identity management system. Elavanidian, the other one is SSD, but there is a third one. The third one is that you can implement your own user data connector key clock. So, then this is what we did, basically. We implemented, and this is another project you have in GitHub, and it's basically a custom user federation that is capable of acting as a scheme client, all right? And this is what we need in key clock to connect with the server. All right, and this is how it looks like. You go to key clock, and then you will see all these options. You will see parameters for connecting to the bridge. It's basically the server URL and the username and password, but, well, we have other authentication mechanisms, probably, but this is basically, you specify the details about the integration domain. You can choose between the type IPA, is free IPA, you can choose AD, but also a lab. So, this is just for summing up, and then if we combine both projects, then we have it. And let's say that this one, which is the server running in a container, is basically supposing a lot of endpoints. For instance, there is one, which is called domains, is kind of an administrative endpoint. It's basically when key clock tries to enroll with a scheme, is sending some details, and then a scheme is implemented on the automation to make an enrollment with any other identity mind-saving system, right? And then once this is done, key clock plug-in can simply make user calls to the user federation storage to fetch for users or write and read whatever. So, yeah, it's important to note that key clock plug-in now, it doesn't communicate directly with the databases or with the other identity management, it's kind of only talking to the scheme, which is a container. So, let's go for a quick demo. Yeah, this is, in the demo, this is what is going to happen. Okay, basically, you will see key clock, you will see free APA, and a container running in another host, and we will make an HTTP post-request to the domain's application in the bridge, and the bridge will be capable of doing all this automation, because those are steps that they were done by the administrator in the past. I mean, if an administrator today wants to enroll user federation to free APA, it's going to do a lot of automation, a lot of steps there, like file a service, the proper role with the proper privileges, and generate key tabs, and blah, blah, blah, right? So, this is fully automated now. And once the other process is done, now key clock every time is looking for a user or something, it will make a generic call to a scheme server. So, what this whole looks like is a post to the scheme server, which is in JSON format through the recipe, and this service will translate that into an APA API call, and it will make it to the domain. Okay, so, yeah, I love the live demos, but I have to admit that today a little bit cold, I have a recording, and yes, I think I want to do the real demo because I have all the infrastructure deployed in Red Hat, and then this is everything recorded, and I don't want to expose DNS names or internal IP addresses, so I have a video anyway. And so, yeah, if this works... So, no. Okay, let me see what is going on here. So, very quickly, I have a quick one, so how many minutes do we have? Okay, then I have a three-minute video. So, yeah, there we go. So, yeah, the three consoles you see on the bottom, the one on the left is key clock, the one on the middle is the bridge, and this one is free API. So, this screen is key clock, we are authenticating there, and the same in free API, all right? Then we go to the user federation so that now we see it was quick, that was super quick. It's quicker than the speed of the light. So, yeah, I wanted to show you that there is a new user federation storage there. Yeah, you see? Wait. Let's see if I can... Yeah. So, these are the ones that Alexander was talking about in the previous presentation, and this is the new one, okay? So, you will see that. All right, let's continue. So, yeah, key clock, free API. So, these are the services, because you know that the machine will configure things here automatically, okay? So, now I'm going to show you where the container is running. This is a different host, by the way. It's not tight to key clock anymore. So, that's the... If you do podman.ps, this is the container, right? For the demo. It does the proper lead network mapping. You see, it's not running HTTP. Apache is not running the host, so if you log into a container, Apache is running inside the container. There we go. Yes, I'm piping the error log just to see that there is movement here, but I'm not cutting the content, because otherwise we will see a lot of IP addresses and internal stuff. So, I'm too lazy. I don't want to go to the key clock and type all the parameters, so that I do a cool call, okay? Or, well, this is with the KAdmin. You can see all the parameters that we are configuring for integrating with free API. And once we execute that command, the container is capable, you see the activity. And now we have the user federation enrolled, okay? And you will see a new service here, which is this one, the bridge, okay? It was done automatically, so you don't need to worry about that anymore. So, now everything is set up, everything is configured, so now you can manipulate users. When we create one user, for instance, I'm trying to file a user in key clock, so right away after we click on create, the user is added to the key clock database, but it's making a call to the scheme service, and the scheme service is redirecting the user to, directing the user in the IP as well. So, it appears here, okay? Yeah, that's the user, and yeah, you know, we can do all the administrative stuff, like changing for instance the email, everything is fully replicated to the free API and the Dimachment system, based on all these cool calls that are happening to the bridge, so the user is there, you can see it from the click command as well. Yeah, the modification was currently provided. So, I guess I'm trying to delete the user now, and yeah, basically, it does a group of operations, I mean create, modify, list, and delete, yeah, the user is not there anymore, and also when you delete the scheme federation, it is unenrolling, okay? So, it goes to free API and then remove the service because it's no longer needed, okay? So, this is also fully automated. Okay, so, what you just saw here in this video, is the user provisioning, okay? We are not done yet. Let me see, because I have a bonus. If I can close the video now. So, the bonus is... Oh, now it's working. Now it's working, okay, when I don't need it. Okay, now, okay, this is a bonus. This is a working progress, all right? This is the other piece, the identity federation. You just saw in the video, the user federation, but this is the identity federation, and it is all about to expose another endpoint in the bridge, so that key clock can also make calls to the bridge, but now, not for user provisioning, or modifications, it's for authenticating, and this one is for Kerberos, okay? So, this is kind of controlling Kerberos, and then the scheme, I mean the bridge, Ipantura, is capable of translating that into an operation that we've modelled, yes, this API, using a proper kit up, and then free API will answer with the session cookie, or it will fetch from the SSD, or well, from the cache, and then it will respond back the session cookie, so that the cloud application that is running here, trying to log in into key clock, is actually authenticating in IPA, nothing key clock, okay? So, yes, and then, yeah, this final slide is about potential usage, so this can be used for synchronization of identities across different providers, as you can see. Also, we can use it to migrate all the users, because the beauty of the scheme is that you can do mapping of the attributes, so you can translate anything you have in any cloud application into something that is more powerful, like free API, and then with UIDs and UIDs that are generated automatically, I mean, it's amazing, and the good point as well about potential usage is that key clock, if we merge this in key clock, there will be a user federation that will be capable of connecting to any scheme server, it doesn't need to be this one. So now key clock can talk to a scheme as a client, right? And this service, as a scheme server that we implemented in IPA as a container, can be also used to connect with other clients, it doesn't need to be necessarily key clock, we can connect to Azure or Azure AD or any other, for instance, I don't know, anyone that is supporting the protocol. So, yeah. So, yeah, that was it. I think we have time for questions, right? More or less? One, two minutes? Okay. Yes, please. You spoke about intervention with AD, I want some idea of the client side, you would have windbind, would you be able to replace SSD with windbind and still use this solution? So the question is if we can replace windbind... Yeah, from SSD with this solution with... So, not yet, the answer is not yet, but yeah, I think we can look into it and potentially, potentially, yeah, it could be done. If we decide to prioritize that use case over the others, why not? Yeah, but not yet. What's the not yet part of it? Say again, please. Will it happen? Will it happen? Will it happen? That's a good question because we haven't done any release. This is an upstream project. We have two upstream projects. So, yeah, our intention is to make this to happen. This will help simplify a lot the key clock user integration with identity management systems and also it's very convenient now to get a deployment independent from the host so that you can get a container, this is kind of on the towards the cloud, cloud-based applications. So, yeah. So, about our plans, our plan, I mean, the key clock plugin is more or less completed. Now we are thinking about sending to key clock so that it will be emerging upstream first and then it will appear scheme client there. And the service, as soon as we finalize the Kerberos authentication redirection, then I guess we will be in good shape to make a first release in upstream, okay, and later on, if once we prioritize a lot of aspects, then, yeah, potentially yes, it will replace or, especially will replace the SSD connector we have in key clock, that for sure, okay. Okay. Thank you.
Passkey authentication - the result
So guys, I will start now the presentation. My name is Sikert Pedrosa, and I'm a senior software engineer at Red Hat. Well, today Erwin was supposed to come here and present something about garage door opening with pass keys, but apparently there's some time of curse because, well, he couldn't come. And I will present a topic that I was supposed to be presenting last year about pass keys also. So I will show you today the final results because last year, Alexander, who is there, kindly volunteered to present my talk, and now I will do a kind of learning talk very fast about the problem and the solution that we gave. So introduction. As you may be all aware, in January 2022, the US government released a memorandum where they constrained their agencies and their, the companies working from them to use telotrast architecture. So if we focus just on the topics about user authentication and authorization, we'll see that the memorandum speaks about centrally managed users, and more specifically about using multifactor authentication and passwords. On top of that, it explains that they should use single sign-on as much as possible, and they mentioned two specific protocols to achieve this. One of them is PIF, or smart cards, and the other one is Fido2. So let's speak about Fido2 a little bit, why users should be aware of this authentication method and why it's important for them. First of all, because it's passwordless, so you don't need to remember lots of passwords. You also don't need to, sorry. So you don't need to be aware when there's some type of leak in one webpage or some service that you are using, because the private key resides in this token that is here, and it never leaves it. So you will not have any problem with data reaches or any other kind of problem. On top of that, it enables a strong authentication by providing multifactor authentication. So the keys that I'm using, they usually ask for the pin, but you also have some others that ask for some fingerprint or some other kind of biometric reading. So the design is quite simple. So we have a user with Fido2 key, it goes to some computer, connects is there, and using SSSD, they will contact the ADM server and authenticate there and get a Kerberos ticket to do the single sign-on. So in this case, we are speaking out IDM server because the best integration is achieved with it because we will get the Kerberos ticket. If you are using some other type of a lab server, you will be able to authenticate, but you won't get the Kerberos ticket. So if you want to know more details, the first link is the talk that I was speaking about before that was given by Alexander here at Fosben last year. Second one happened last year also at Defcon in June, I think, if I remember correctly, it was me giving it and you will have some progress in that area. So now it's time for the demo or the demo Gorgon, who knows, because I never was able to do this demo lively. Yeah, you know, it's like that. So first of all, I'm authenticated in a SSSD client and we also have an APA server. So I will add a new user, which will be called Icar. And here, the important point is that you need to set the authentication type to Paskey. Sorry. So the first part, I guess you are aware of it, if you are IPA users, but the second one is kind of the new thing. So I will create a user like that. Okay, it already exists. So let's try another one. So that's my sister. So this is your trust. So we have created the user and now we need to register the Paskey for this user. Okay, yes, with that, I will present there. Just again. I guess I don't, oh yeah, I forgot the name. Now I need to enter the pin. And now, well, I need to touch the device. The device is already blinking, so it's kind of obvious. And you see there down below the Paskey mapping data. I will show it to you. Well, I will clear the screen and show this user. So we have user I know. And here we have the Paskey mapping data. So now I will change users because you know, if you are root, you can authenticate as any user. And I will try to authenticate as user I know. Okay, I need to set. Okay, first of all, you need to insert the Paskey and this presenter. You are prompted for the pin that you need to input. And finally, you don't see it on the screen, but the LED is blinking here on the Fido 2 device and I need to touch it. Okay, perfect. So we are here. We are using I know and as we are using a free IP server, if I saw it, okay. We have here a server ticket. So that at this point, we would be able to authenticate to any other service or application that is enrolled to this server. One thing to notice here is that, well, the key needs to be physically connected to the device where you are trying to authenticate. Okay, you cannot do it remotely with SSH or something like that. This is important because, well, I heard some people asking me this question and well, currently it's not possible at least to do the remote authentication. Okay, so some conclusions. Availability of this feature. First one is SSSD 2.9.4. You can try with the 9.2.9.0, but it has some bugs, so I would recommend you to go to this one. We also have free IPA for .11.0. And if we are speaking about specific distributions that have this software, you can use Fedora 39 or CentOS Stream9. Some reference links. So we brought three design pages, two for SSSD and one for free IPA. The first one for SSSD is about doing the local authentication and the second one is about doing the Kerbalos integration. And if you would like to test this feature on your own, I brought a Fedora Magazine article that was kindly translated by a Chinese reader. So you have there the demo and how to work with it. If you don't want to mess up with your production environment for some reason, you can use SSSD CI containers. This project is, well, it provides a set of containers that you can use to test SSSD IPA lab and things like that. The only, you will find these instructions in the GitHub page. The only thing that changes is that you need to run, well, you need to connect the Fido2 key first and then you need to run MakeupPasky instead of Makeup so that you can redirect the Fido2 device to the containers. Okay, so that was all. I think we have some time for questions, right, Tvenho? Yes, we have four minutes. Thank you. Thank you. Thank you. So the system didn't ask you to touch your device, is that some limits or was that just an implement? No, it's a feature in reality because you can have the Fido2 device connected and some applications or some malicious actor could try to sneak in your device is already connected, they could use it to perform the authentication. So if you press it, you demonstrate that it's actually you who is trying to authenticate in this device. Thank you. Can you speak louder? You indicated it will not work remote at the moment. However, would this possibly work with USB redirection, for instance, it's a fix. Yeah. So the question is, would this work with USB redirection? The answer is yes. Yeah, we would be able to do that. Question here? If we lose this key, what happens? Okay. So somebody asked a good question here, ask what happens if you lose your key? You are doomed. So my recommendation is to have at least another authentication method, or you could have two keys. That's what I have. So I have one here and I have another one at home. If I lose one, okay, I won't be able to do this demonstration here or to authenticate somewhere, but when I arrive home, I have it there and I can use it to authenticate. We cannot store somewhere the algorithms. No. No, the private key. So this uses public key algorithms, and the private key resides in the key, and as long as I know, you cannot keep it out. Yeah. So. Do you have any plans to support built-in platform authenticators like Windows Hello? So yeah. So the question is, if we have any plans to integrate Windows Solo, you said, right? No. The hardware now has the FIDO key in it. Yes. You don't need a USB to extend the hardware. Every piece of hardware now has FIDO built in. Can we just use the platform authenticator? Not yet. I will answer you. Okay. Yeah. Yeah. Not yet. What this project supports is big FIDO 2. So any effort to extend it to support a platform authenticator should be against big FIDO 2 project. And then we will inherit that one. So the question was whether we are supporting platform authentications, and the answer is no. We don't have those plans yet. So Tveniu, how much time do you have? I think yes. We have room for one more person. Okay. Yeah. So. How do you solve the pin code? Like, which action do you use to install the pin code? And do you have any pin code policy? Okay. So the question is which is the cryptographic algorithm that we use to store the pin. And the answer is I don't know. This is embedded in the FIDO 2 key. So in reality, we ask for the pin, and we realize this information to the FIDO 2 key. And it's them, the FIDO 2 key, who does all the decryption, and who, well, it signs an attestation that it's you who are doing the request and send this to the server. It's PQCSD2. Sorry, can you repeat? It's PQCSD2 normally like a key derivative algorithm which you use for the project. Thank you. All right, all right. Okay.
Post-Quantum Cryptography transition: where we are now
Okay, great. Let me introduce myself. My name is Dmitry Bedovsky. I work in Red Hat for several years and maintain the open-estation, the open-estate. I am also involved in development of open-estate, a member of the Open-estate Technical Committee. And my current work is dedicated to post-quantum transition in Red Hat. So, first, brief reminder. Why do we need post-quantum transition? There is a wide consensus that quantum computers, if they ever happen, nobody knows, nobody knows when, will break the traditional cryptography in sense that digital signature becomes forgible, decay generation becomes reversible, and so on and so forth. So, if a malicious sector records the communication now and they are still secret and confidential to the moment when the quantum computers happen, they can get your secrets. Not sure it will happen soon, but this is considered as a threat and it means that the technical community has to implement quantum resistant algorithms that will be unbreakable even with post-quantum computers. So, some words about challenges we have. First, as quantum computers happens, we can't trust the existing algorithms, as I mentioned before. Second, well, when we implement new algorithms, they are not tested for a long time enough, so we also can't trust them too. For example, in the NIST contest, one of the algorithms that was moved even to the fourth round of the contest was completely broken without any post-quantum computers. It's a pity it was a wonderful algorithm. So, currently, a lot of efforts are related to providing so-called hybrid schemes when we use both classical algorithms and post-quantum algorithms simultaneously and combine them in this or that way. It can be two different signatures. It can be some combination of calculation, but again, if one of algorithms is unbroken, the second still provides some relevant security. The second area where we can expect problems on post-quantum transition is related to key size. Well, let's compare the key size for classical algorithms. RSA, well, practical 3K bits, means 400 bytes of the key and 400 bytes of signature, right? For the deletion, one of the algorithms choose for standardization, and the digits I provide are not for the most strongest version. It's for some intermediate version of it. We will have more than one kilobyte of the key and two and a half kilobytes of the signature. So, as a key and the signature are parts of the certificate, as a certificate doesn't go alone. You have a chain. You can imagine that, well, currently you have, say, four kilobytes of a certificate chain, and switching to deletion, you get, well, 18, 20, something like that. Also, we should expect performance problems because new algorithms will, with high probability, be much slower than existing. We will have compatibility problems because, well, other implementations of algorithms will contain these or that mistakes and probably implement various versions of intermediate standards instead of final, at least at early stages. And sometimes, I am not sure, we will meet problems with middle boxes analyzing traffic, passing through them. Well, is it something known and should go forward or is something bogus and they should be stopped? Well, let me remind that when TLS-103 was in the process of standardization, people have measured and found out that something between five and 10% of TLS-103 traffic don't pass through middle boxes. And the TLS-103 protocol was significantly redesigned to better mimic TLS-102, which was already familiar to middle boxes. And, of course, when we are speaking about network, we also get traditional problems. Well, big keys doesn't fit the TCP or UDP packets. We have to deal something with, for example, DNSSEC, which is currently stateless and expects that the response from server comes in one packet. And, of course, if you send a little request and get a huge response and use UDP protocol, well, all the protocols that rely on post quantum algorithms will be a good chance to implement so-called amplification attack where you send a legitimate request to a server, but spoofing the IP address. And if you use UDP, the response, which is much bigger than the initial request, will go to the victim computer and so distributed denial of service will be implemented. Okay, now when I briefly told about the threes, let's go to something more positive. First, we have several standard bodies that are involved in the process of post quantum standardization. NIST, which organized the post quantum contest, has chosen three, four algorithms for standardization. Here are three of four links to the drafted standards. Kiber is the algorithm for key encapsulation. Deletion is the algorithms for digital signature and Falcon is Sphinx and Falcon are also algorithms for digital signature. The standards, the final version of standards are expected to happen in Q1 of this year, but did not happen yet. Then, when we have algorithms, we should specify the usage in protocol. Okay, sorry, how to switch it off? Yeah, sorry. So, IETF is the standard body which works on protocols. The work happens in, well, in almost any working group that is dedicated to cryptography that is in the so-called security area of IETF. And it was created a dedicated group named PQEAP, which will cover the protocols that currently don't have dedicated working groups, such as SSH, for example. I will briefly speak about it at the end of my presentation. And also for hardware implementation of the keys, for example, tokens and HSMs, the standards are developed by OASIS group. As far as I remember, well, several weeks ago, there were no final version of the standard. There were some drafts, but they're not public. So, despite lack of the final standards, you already are able to use Fedora for experiments. We have chosen LibuQS project. It provides an implementation of a wide set of post-quantum protocols. For Fedora, we build only those which are chosen by NISTO, the standardization, I'm sorry. If you want to play with something else, you will probably have to rebuild it yourself. And, well, LibuQS is a part of OpenQuantumSale project. And they also provide some fork of OpenSSH using post-quantum mechanism for case establishment. And what's also important, OpenSSale provider. Let me briefly remind what is OpenSSale providers. It's basically a plugin style mechanisms that allow you to add or modify functionality of OpenSSale, including providing the new cryptographic algorithms or hardware back implementations. In Fedora 39, that was released at the end of 2023, we have OpenSSale 3.1, we have LibuQS 0.8, we have OpenSSale provider 0.5.1. And we plan to update all these components in Fedora, Rochide, LibuQS, and OpenSSale provider are already updated. And we are currently finalizing the rebasing of OpenSSale to next version. I'm sorry, I am lazy and not brave enough to provide you a live demo. But well, it's quite simple. If you have a Fedora machine, you can do it yourself. You should install, okay, as provider, the first line. You can, then you should generate the K-Pair. I have chosen electric curves, but it's a matter of taste. And then you just run OpenSSale server. But now you should exactly specify what groups for K-Exchange you plan to use. So it can be done with a common line key groups. And here, if you see the group's names are in red, the names consist of two parts. One, X25519 is a classic cryptography algorithms and the second, Kiber is a post quantum stuff. The second group allowed for K-Exchange establishment also have the same structure, but uses a different parameter for classic part. And now when you have run server, yes, it's a demo server, then you can also connect to it. When you run the bad connection, I strongly recommend use the key trace. It provides in more or less human readable form the handshake process. And trust me, you will see that you use the hybrid algorithms for K-establishment. Oh, well, S-Plant and the server is sort of fun, but I don't recommend them for any sort of production use. But you already can also use such a popular web server as engines, but again, for now, I'm speaking about Fedora 39, we will have to load OKS provider in the global open state configuration or in the local copy and provide it explicitly to engines. For demo purpose, I recommend using global. It's just simple, you load the provider, you activate it, it's done by providing the section dedicated for it, and then you configure engines in a regular mode and you add a derivative, say, ACDH curve, which is more or less equal to the groups parameter that I mentioned on the previous slide. So, well, then after restarting Jinx, you have a web server that provides that uses hybrid K-change for groups and you can use URL, which is open state based, at least in Fedora. Again, you should have to specify the curves, but you will get something over quantum protected channel. Of course, it's worth mentioning that big companies, well, also have their post quantum stuff. Google Chrome allows enabling post quantum algorithms, it requires switching on special flags, and you can check that your server setup as it is done on the previous slide, will be able to communicate with a standard Google browser. Also, you can use a CURL to reach, for example, Cloudflare, the demo site, they also use the same algorithms and compatible implementation. Okay, future plans. First, we want to pack all our results to container because do it yourself demo is fine, but for a practical purposes container, it's much more convenient. Then, as I mentioned before, we are going to provide the recent version and it's work in progress in Fedora Rohide. So, you can use the post quantum algorithms also for digital signature. It's currently doesn't work for Fedora, for OpenSS 3.1. And of course, we are involved in upstream work, OpenSS cell, NSS, GNUT LS, we have identified some deficiencies and working on fixing them. And as I promised, there is an opportunity to be involved in that community work because let me speak about SSH. For several years, OpenSSH has implemented post quantum algorithms for K exchange. Unfortunately, that is not any algorithms that is chosen by NIST. There are no standards for it, neither NIST or ITF, there is work in progress on ITF level to write a specification, a formal specification with these algorithms. And there are no specification for no specification in RFC form for using the NIST chosen algorithms in OpenSSH can shape. So, OQS project has the version of OpenSSH which is currently frozen because of lack of contributes. So, if anybody desires to speed up the process of transition of SSH to quantum safe world, I think it's worth organizing some activity both in the development, in cooperation with OQS project and with writing specification of the draft for ITF. Thank you very much. Feel free to ask questions. Sure. Have you analyzed the performance difference between the classic implementation and the one with post quantum? What is the performance impact? I didn't analyze it myself, what I expect performance degradation, just because we are implementing classical algorithms for decades, and first post quantum algorithms will be imperfect by definition. Sorry, the question was about the difference, the performance difference between classical algorithms and post quantum algorithms. Sure? So everybody nowadays is using X509 for services, and you mentioned that it's difficult to trust the new algorithms, and also would be impossible to trust the old algorithms. So did you do any experiments on like dual implementations on X509, and the impact on that, because the certificate will be huge? Yes, certificate will be, so the question is, do I correctly understand? The question is how does the post quantum algorithm affect X509? Yeah, if you use it in dual combination with the old algorithms and the new algorithms. If it's used in dual combinations with old and classical algorithms. So there are several concurrent documents of combination of classic and post quantum algorithms. And yes, the certificate will inevitably be huge, which combination, no matter which combination will be chosen. There are some efforts how the impact can be reduced. For example, let's add intermediate certificates to the trust storage instead of sending them on the wire. But it definitely has its downsides, because well, increasing the size of root storage. But yes, as I mentioned, network protocols will be seriously affected by huge certificates. So just add on to this question, so that means we need more computing power for our applications? No, it means we need to reinvent the CPN DDP. Sure. If we bring these into how we usually provide a very friendly user experience in order to communicate these keys from one device to the other, we sometimes use QR codes, NFC and Bluetooth. Will that be still possible if we go to these size of certificates and keys? Will the user friendly certificates, will the user friendly way of transferring the certificates such as QR codes, Bluetooth and so on and so forth, be still suitable for post quantum keys, right? Okay, yes, for QR code because it's just linked to the URL. Yes, yes, yes, don't know about Bluetooth, sorry. How many time do I have? For minutes, sure, go ahead. Do you have any expectations on when will actually have to deal with post quantum signatures in the wild, like in our products or because of a server we're interacting with or as a client? Well, when do we have an expectation, when will it appear in real world, right? So, I have expectations but don't trust me too much. We have, there is a promise that algorithms are finalized in Q1, right? Presuming this, the ITF process even for near-finalized RFCs is about half a year, right? So, I'd say that first attempts of introducing post quantum certificates into real world will not happen before 2025, especially taking for account that real world CA needs hardware which is capable to keep post quantum keys inside and also it will take time to develop such hardware. You showed the hybrid mode of the hybrid, right? Do I use a hybrid mode of the hybrid and the classical algorithm, right? Yeah, yes. What is its security level, let's say? Is that hybrid mode also quantum safe or is that not fully quantum safe? It's quantum safe. At least the current evaluation of this hybrid mode is that it's quantum safe. As I mentioned, we did not study the post quantum algorithms enough yet. Go ahead. And how do we evaluate the quantum safety in general? Like what is the current evaluation of the current in general? Like what are the approaches to presumed that are going to be quantum safe? Which approaches are presumed to be quantum safe? Sorry, I'm not a mathematician. Yeah, right? I can say some words. I can say some words such as lattice-based cryptography, hash-based cryptography and so on and so forth. But please investigate what this word means yourself, sorry. Okay, the last question. Are the quantum save algorithms, sorry, do I correctly understand the question? Will the quantum algorithms will be resistant to all types of quantum computers? We hope. Thank you very much. Thank you. May I take the question? Yes. Okay, thank you. Thank you very much.
Beyond passwords: secure authentication with passkeys
Alright, so I had a talk yesterday, you missed it. So today I'm going to talk not about Passball, the open source password manager, which we are building with my friend here, Kevin Clayton and Shmouty. I'm going to talk about Passkeys because I have the chance to participate and be a Fido Alliance member and sit and participate to the Plenary Conference and CPSIG. You will see Fido, they love acronyms, so it means Credential Provider Special Interest Group, so because we are a Credential Provider. So what is authentication, just so that everybody knows a little bit what we are going to talk about. Authentication is something you know, or something you have, or something that you are, like biometric, or something you do. You can even have like behavior based authentication. So authentication, these days generally a combination of one or two of these factors. You know about passwords and password based authentication have a lot of issues, so you have issues of the user selecting a weak password, or basically people being able to brute force, or like phishing is a big one, and you have all sorts of other issues. Generally you can implement content measure to make sure that basically your authentication is good enough, but phishing is the one that is really hard to solve because it depends on the user. So you can solve for example password strength selection by introducing a Credential Manager, and you can prevent a little bit of the phishing with a Credential Manager, but you still have some room there. So who has set up Pasky's as a user in the room? Well quite a bit of people. Who has as a developer implemented either Authenticator or like implemented Pasky's authentication on the website? Yeah, three people. So we can see here that it's still a new topic. So what is a Pasky? You will see that Pasky's mean different things for different people, so I'm going to try to give you like a 10,000 feet view of Pasky's, and not like go too deep into the protocols and the options for the protocol or particular implementations, but just give you like a high level view of the landscape, something that I would have like when I started working on this because it's like really tentacular and there's a lot of options and a lot of different views. So Pasky's the official definition is Pasky's a password replacement. There are public key, private key pairs that are used for authentication using cryptographic signatures. So basically you have like a site that gives you something to sign and you sign it and you prove that you're you using an Authenticator. Pasky's are user credentials that are discoverable, so it's possible for the browser to know if you have a Pasky for a given website for example. And because in the browser the JavaScript is served by a website, it means that the website can also discover whether you have credentials. So these Pasky's are stored within application or security keys and they may be synced across devices. So this is the new stuff we've seen in the previous talk that we were talking about device-bound Pasky's, the Pasky's that are sitting on devices, but they are now a new class of Pasky's that can be synced across devices. So you can see this is the lay of the land like depending on like who you ask about Pasky's, they will tell you maybe they are like thinking about device-bound Pasky's, so Pasky's that are on physical devices or if you ask Google and Apple, they will talk about sync Pasky's and sync Pasky's are basically keys that can be synced across multiple devices. So you can for example have them on your laptop and on your phone or you can basically transfer them or like do an attestation using your phone while you're trying to authenticate on your laptop. And these Pasky's are supposed to be exportable and transferable, but in practice they are only transferable within a given ecosystem. So for example Apple will not let you export Pasky's to Windows for example. So it's their advertiser's being like you know interoperable, but because they are not coming from the open source world like we do, interoperable for them means different things. So they are also another class of Pasky's which is called up level Pasky's which generally have lived with the device-bound Pasky's and meaning that they can be used for other things or you have like additional properties added to them so typically you'll see them in banks. So for example like a bank application will use Pasky's to sign transactions or they will use additional signals to unlock a Pasky's something that is not there with a classic authenticator. For example they will check like you know your location or like your working hours like you can do all sorts of different signals. And you can build a custom authenticator so like you can build like the UI that you want. You don't necessarily have to follow like the OS or like the physical device design. So there are a lot of different requirements. We've seen like basically Pasky's means a lot of things. So like on one side you have like people that are working on the enterprise level. So people that want hardware keys that are like very strong and you cannot export them. And on the other side you have Google and Amazon for example they don't really care that it's very, they want the friction less experience. So they are ready to trade a little bit of security for usability. So for example when you're doing a checkout using Amazon they want you to go as fast as possible through that checkout and pay. Even if there is a security issue they will be okay to give you back your money. But if you do that with a bank they are not like having the same mindset. So on one hand you would have like Pasky's that require certifications. The banks will check is it the authenticator that I gave you to authenticate. Is that really your personal device? And on the other side you basically have like a website like Google that just want to show like okay you authenticated with the UB key but they don't really care like which one it is. It's just to present you like okay you have these Pasky's and this is the kind of authenticator that you are using. And on the enterprise side they will issue for example for like super high security setup they will issue you like a security token that is just for you. So you can see that these are some privacy implications. So for example if everyone was using like a device that is unique for each of us and we log in on each website with this device you would be able to do cross domain tracking. So you would be able to see like okay this guy logged there and then he logged there. And this is a privacy issue obviously. So on this side you want privacy. On this side you want basically no privacy because you are in an enterprise setup. So all of these are like very complex requirements and they are all fitting in the same standard. So basically like it's a little bit complicated to know like what's going on. But the common denominator is phishing resistance the fact that the Pasky is domain bound. So like you have one Pasky for Gmail, one Pasky for AWS. But you don't reuse the same Pasky twice. And it's always supporting HTTPS. They made this choice which is very wise. It's like no support for HTTP. So the Fido2 project is a project that works with the Fido Alliance. Fido Alliance contain like Google, Amazon, Visa, but also like TALES, you know people doing like security devices. And on the other hand you have like the WebOtent protocol which is managed by the W3C. So you'll see like basically people working on the Fido Alliance are also part of W3C. It's the same people, you know for example the person at Google is the same on both projects. And together this is called the Fido2 project. So on one side you have W3C that manage the WebOtent protocol. And on the other side you have Fido Alliance which manage the Ctap which is credential. Sorry, client to authenticator protocol. So basically the relying party is the website you're trying to authenticate to. It uses WebOtent over HTTPS. Then you have the client which is basically your browser and the JavaScript application that is running in it. And you have the authenticator. So authenticator can be the OS platform. It can be like a device, can be like a UB key, can be anything that is basically Fido approved. And these days it can even be credential manager. So it can even be like Bitwarden or Dashlane or OnePassword. And the interface for client to authenticator is a bit more messy than WebOtent. Basically you have like it works with Bluetooth, it works with what I call, everybody calls monkey patching. So basically if you want to integrate in the browser in JavaScript and become an authenticator, for example as a password manager, you will just hijack the JavaScript API and replace them with what you want. So that technique is called monkey patching. It's the only way for a browser extension, for example, to act as an authenticator. But you have like also proprietary protocols, for example, like when the Google Chrome browser wants to use the Google authenticator, you know, you don't know what's happening underneath. They are using their own stuff. So I hope that's clear and gives you like a high level view of what we're talking about. So there are two ceremonies. There is the attestation ceremony, which is the registration. And there is the assertion ceremony, which is the login. So there are no other operations. So for example, you cannot list what are the pass keys available for a given relying party. You cannot delete pass keys. These are not part of the protocol and need to be implemented separately. Like it's not normative. So we will see that this goes some issues. So the attestation ceremony, you have the client, which basically post a username. So this part post a WebOtten attestation option is not normative. You can do whatever you like. As long as you send a username, basically the URL doesn't matter. So it's for the relying party to decide what is the language that it wants to use. What is the URL? So recently they introduced WebOtten.wellknown file that you can place on your web server to basically say like, OK, this is my attestation URL. Then the relying party reply with the public key credential creation option, which includes the RP, which basically the ID of the relying party, the challenge that the user need to sign, and some other options. Like for example, we've talked in the past, in the previous talk, people were saying, like, do you check for user presence? Do people have to enter a pin? This is basically the moment where the relying party can say like, OK, I want to use this algorithm and I want to check the user that way. I will require user presence. I will require you to do user verification. And then the client does basically what it wants. So from that, the client called the navigator credential API. There is not a WebOtten API. Basically, we use a JavaScript API, which is the credential API, which can be used for other things, but it's used mostly for the WebOtten protocol these days. And then we basically enter a setup protocol or something else. So either like a proprietary protocol, but here I put setup for, you know, like clarity because it's the one that is the most defined. So same, you will send some data about the RP and the user, and the authenticator will assert the parameters, see if the crypto operation is supported. So it's asking to use a particular type of key. Can I create such keys? Then collect the user gesture. So like either enter or enter the pin. And then generate the credential and generate the signature. It will return the attestation statement and the authenticator data. And the client will send this information over to the relying party. The relying party will assert if the key is valid and will check the signature. Is it valid for that particular key? And it will check if the RP ID is also matching. So that basically you don't have a client that use another request from another website. So we keep the property of having a domain bound process. So the assertion ceremony, I'm not going to detail it again, but it's pretty much the same thing except you're not like giving the new public key. You're just signing with a key that is already there on the authenticator. And then what about account recovery? So obviously if you lose your device or you lose your passkey, then what do you do? So there are two types of account recovery. There is account recovery for the RP. So basically the solution to passkey is more passkeys. So it's good if you have like device bound passkeys because then you need to buy more devices. It makes a lot of sense when you're selling devices. And you can also use passwords. So generally you have a website like Amazon that will let you have a password but they will propose you to have passkeys on top. So basically you will default to passkeys but you can still use your password for account recovery and magic link. So basically passkeys are as good as the account recovery mechanism. So we're kind of back to square one. Unless you get rid of these methods for account recovery, you're not like really changing your security posture in my opinion. And on the authenticator side, it's a little bit more complicated. So I think Apple recommend you to have several Apple devices. Makes sense. And then you can also set recovery contacts. And you have custom procedures. So I give you like for example what happens on iCloud if you lose all of your devices. And it's actually possible to do a recovery for my cloud using all your devices and it's quite smart. I'm going to say the problem with we have in open source world is we don't have such a service that is ubiquitous where people have an account. So for example if we are Ubuntu and Firefox, we don't have such infrastructure to exchange such a scroll mechanism. And that's going to be a challenge I think moving forward. So how does it look like from an authenticator point of view? It's a work in progress and there's a lot of change. So like maybe like by the time I put this slide together, it's already updated. So this is an example on macOS and Chrome. And so you will see on Chrome by default Chrome will prompt you to use, when you click continue it will use the Google Authenticator. So you have the impression that you're using the OS level authenticator but you're actually using Google Authenticator. That is leveraging the API of the OS to provide an experience. But it's not the Google Authenticator, it's not the Apple Authenticator. So you can see already it's already kind of sneaky because like if you're using Chrome, they will prompt you to use the Google Authenticator. But you have some other options. For example you can use the phone or a security key. So basically you see there is more clicks if you want to use something that is not Google. And then you will scan this or press your security key and you will have the same result. So if you can even do like two device ceremony where basically you scan this QR code and then you will unlock your phone and the signature will happen and will be exchanging through Bluetooth BLI. So there is no pairing with the laptop and the phone and you will be authenticated using that mechanism. So it's possible for example to use like an authenticator on Android phone to login on Windows device using this mechanism. So if you use Firefox, you will start directly using the Apple Authenticator. And it's the same if you're using Chrome, you'll see like basically if you use that option that was there on the previous screen, then you switch authenticator. And for me like I don't expect like people that are not like knowledgeable to understand what's going on. So and it's the same like depending on like the options that are provided by DRP you may have like different mileage and different user experience. So it will be quite confusing I think for the average user. So it's the same if you want to manage your pass keys, they are like buried. It's really hard to see like how many pass keys do you have and where they are registered. Same on iOS, if you want to manage your pass keys you need to go to passwords. So you need to click on the password to see that it's a pass key. So pass keys are like okay we've solved the password problem, we can all go home, right? Like that's it, mission accomplished as George would put it. But no we still have a lot of issues to solve. Like we have like the what happened when you lose devices, especially when you don't have like a sync fabric that is common to different authenticator. And there is no real work being done on pass key management and review as we have seen. So for example in the future we've seen with the previous talk with quantum computer coming soon. We will need to roll out new algorithm, maybe we will need to change algorithm faster than we had in the past. So we will need to revoke keys. So in order to revoke keys we will need to design experience where the user understand okay this key is using an old algorithm that is not supported, you need to create a new pass keys. So we will need to have a user experience to manage pass keys that is understandable for the average Joe which is we are very, very far from there. And it's the same like for developers, I think like for developers to understand all the different options and what they mean when you're implementing as a RP. It's quite tentacular and you for example you can't follow the implementation of Google because Google does not care about user enumeration because you can already send an email to a Gmail address and it will bounce. So you already have user enumeration in place so they basically don't implement best practices to prevent user enumeration. But for you, for your use case maybe it's important. So like you can't even follow like what the big players are doing, you will need to do your research and find out. And I think this can lead to some problems down the line and we will need to do a lot more of education on like what are the security problems around pass keys. There are some other issues as you've seen like the user experience is quite fragmented and it will not be the same on different OS and different authenticators. And there is an entry barrier for authenticators so like one of the few open source projects on the Fido Alliance project it costs around 50K a year to be in the room when these things are being normed and so it's basically like a pay to play initiative. And in my opinion for something that is supposed to replace password but so ubiquitous that's an issue. So we even have like I think Firefox have a seat at the Fido Alliance but they didn't have the staff last year to be there. So I think it's an issue. And there's a lot of proprietary protocol and monkey patching happening and like we need to do much more normalization and I invite you to get involved to be interested about this because if you don't act on it basically they will make the decisions for you. That's it. Do you have five minutes? Yes? All right. Yes? The complications that you just mentioned to implement that is also true for somebody let's say a software service that wants to offer pass keys to their users. Yes. Do they also have to deal with all this complexity? Is there maybe a simpler way? Yes. So do the RP have to do their homework to understand the issues around pass keys? Yes. And they are not like super easy to get. You know like for example if you let's say you're building like a service like that authenticate people using pass keys but it's like a globalized service for your enterprise and you're doing just one pass key authentication on that domain and then you're switching to another protocol. Maybe you're using an iFrame and in the case of pass keys you may have issues with UI redressal so you need to basically take care of these issues and these are like you need to read the specs and I think maybe there will be more education and more easy resources and I think we need like all tools like for developers and like not for them to trip their feet on. And same like what kind of algorithm you should support like we've seen there is a lot of legacy website like maybe like they start creating keys with a certain algorithm but two, three, four, five years down the line when quantum computing becomes like something that you have to do because the state is telling you you have to do it. Then you know what happened with these keys you know like. He said earlier that kind of the relying party can say the use of the present for example. So how does the relying party know that the client actually should that. So. How does the client ensure that the authenticator is doing what the RP is requesting. I think it depends on the option but most of the options are basically like for example the client can ignore that the user needs to be verified but with the data that will be sent back as part of the assertion you would have. Okay, what what did the authenticator do? Did he verify the user or not? So it will be the RP to verify. Say to the client I want the user to be verified and when he has the response to say did the user get verified? Yes or no. So you say I want this but you need to check if it actually happened when you get the final assertion. Makes sense? Yeah but you have a password manager so someone says please make sure that the user is present and you sign the challenge and say oh I didn't say the user is present but you actually like because you can do that. Yes. There's no way for the website to know that the password manager is present. Yeah that's where the Fido certification comes into play. But it's like nobody wants to be like you know caught off and like say like hey look there you know so it's also like a gentleman agreement that you're gonna respect and do. But I could I could suspect like that some you know people that are like do it yourself kind of thing they can make their own authenticator that does whatever they like. But for example a bank may refuse to use an authenticator that is not Fido certified. So you know depends also on the RP. Because you assert which service are you using for the station. When the. Well often. So on your response do you know who certified that. Yes so you have information so it's like in the response do you know do you have information about the author of the cater yes so you have information which is I level which is what they call the a a good which is basically a global you you ID. That says like you're using like a UB key five but you also have like older version of this way where you have like actual certificate and you have like a signature that you know like. So this is stored in the MDS of the Fido alliance so if you are like for example a bank and you want to make sure that. You know it's not somebody pretending to be a UB key. You can actually see the certificate the route certificate and check against that assertion. One minute that doesn't make sense so you have you have two level you are one which is like the user agent kind of thing that most are using but you have another one which is like more complex that involves like cryptographic and signatures. We've seen some examples with the ecosystem and what about the system. Is there anything like that piece for private. On the unique system basically like to my knowledge not much is happening I think there is a talk with a GNOME the GNOME team was going to present like what's what's happening on the Linux site but it's way behind in terms of like Microsoft. Hello Apple. Kitchen and Google. I've not seen an open source one like but you have like credentials manager like for example. Bitwarden or dash lane that are basically can bridge the gap in an ecosystem where there's no OS level support. Thank you very much. Thank you.
Making Ansible playbooks to configure Single Sign On for popular open source applications
So, hello. This is where I get an email later from FOSDEM where should we cut the feed and start a new one. So this is where we will cut. I'm going to do a talk about making Ansible playbooks to configure Singlesign on for popular, popular open source applications. Yeah, I got that right. So, who am I? My name is Jerome Betten. Jeroen Barton, if you do it pronomously, correct in Dutch. Let's see what else there is. Oh yeah, if you're English and you want to know how you pronounce that, you just read this out loud and it should work. I do open source stuff. One of my recent projects is Chateau IT, which has a slide on its own. I wrote 12 books and I got a couple more in beta. I've got three girls and my wife had two girls so we married and there you go, five girls. I used to be a volunteer firefighter, did some scouting stuff. I do training and I hack a lot when I can. So what do I do? In this case, I do open source consultancy. One of the things I do is Chateau IT and in Chateau IT, we do training. One thing is the Bernard Prevention training so that's not so much IT, little, no IT actually, but it's for a technical crowd. And it's in Dutch but why the Chateau? Because I'm going to do that in the center of France. Anybody here from France? Okay. I have a little longevity in the center of France. We have a lot of it. Okay. XOIT, I retrained people to IT. I did that a couple of times before. I retrained a sailor into a programmer, C++ and I retrained somebody who had been selling suitcases at a warehouse for training years and he is now very, very, very happy working in the IT, starting in a help desk position. But he's happy because he's working with computers and he used to do that for his hobby and I helped him out on that. And those are, well, training for the really good IT guys and training for bosses management to explain technical stuff if they need that. But no commercial. Okay. Another project I do is LibrePlan which is a web-based project management. It's in Java. It's extremely cool. Have a look at it. And it's, I think I can safely say is about the best project planner that's out there in the world. For one minor detail, I'm about ten years behind in making a package. Full disclosure. But I'm going to do that next month or the month after that. So it will happen this time. Apologies. Is there somebody here from England? UK? Oh, that's good. Oh, there is. Apologies in that case because I speak English words but I don't speak English. Okay. Because I once found this translation guide that says when the British say with all due respect and what they mean is I think you're wrong but what I think, oh, he respects me. Oh, that's good. We've got a nice vibe going on. Because we take in Dutch if it's English and any other language, we take things, everything literally. Which sometimes lead to awkward situations. Okay. The project is short. I was hired to be a general manager at an IT company. I learned two things. One, I shouldn't be a general manager of an IT company. That's somebody else's job. Should be. But it was an IT company without IT landscape. There was nothing there. Well, almost nothing. Which is weird, especially for an IT company. So I started with the Proxmox Virtualization Server. I installed free by LDOP. I installed some applications. You know, XWiki for the wiki. If somebody doesn't have a wiki, look at XWiki because I think it's the best. Zebix Monitoring, Jenkins, SnacksCloud, GitLab, Odo. Odo is ERP from Belgium. If you haven't heard of it, look at it. It's awesome. CMD-Build is a completely customizable configuration management database. So if you haven't heard of it because not a lot of people have, look at that as well. Because you can really tune the data model of your own business into CMD-Build. And then put data into it with rest interfaces and connect it to other systems. Okay. And I got a question. How to upgrade this landscape? So I did this with LDOP authentication and somebody said, yeah, is there any way we can make this better? I said, well, you could switch to single sign-on. That would be an improvement over every single time entering your credentials. Oh, yeah, sure. Go ahead. Make it happen. So I was sponsored heavily by the owner of this company. By the way, it was Onstein in Breda in the Netherlands. And they are an Odo consultancy club. So, showing the Netherlands, thinking about ERP implementation. Have a look at Onstein. And they will gladly help you out. So I promote to make everything single sign-on. And that's where our venture starts. Basic lingo is that you need to know for single sign-on is the application that uses it is SP, the service provider, because it provides a service. And the application that does the identification is the identity provider, IDP. So you have to know SP and you have to know IDP. If you understand that, we're almost there. In our case, I used Keycloak, which is a very popular open source project with very nice features. It's a Red Hat project. And they used to have Red Hat identity server, which was based on Keycloak. And Keycloak became so popular, so now it's Red Hat's packaged version of Keycloak. Because Keycloak is more popular, which makes sense. ACS is Assertion Consumer Service URL. That's the service provider sign-in URL. So any application has an ACS, if you can single sign on to it. And you need to know that URL because you need to configure that at the IDP, the identity provider. Of course, you need Ansible, the. And who doesn't know what Ansible is? Okay, so Ansible, and I'm always getting slapped in the face. Oh, you know. Oh, you know. Okay, okay. Maybe somebody is watching. Maybe somebody who is watching. Okay, so Ansible. And I'm being slapped if I say this, but I keep saying it. It's a sort of a recipe language because I shouldn't use recipes because recipes term is claimed by. Chef. Thank you. But for the layman, and I'm a dumb guy, so I use the word recipes. Because you simply say, oh, now I want this computer to have this package installed. Now I want this configuration file to have that line added or deleted or changed. So you have all those very easy to read steps, and you make a recipe that you can send to a server, and then those steps are being done on that server. And it uses some JSON stuff, and anybody here who's a fanboy of JavaScript? Okay, because I think the JavaScript is a bad language, but JSON is very cool. Oh, that sounds like somebody who differs. Okay, basic single-cell no-process flow. The user clicks login on an application, application X. The browser of the user is then redirected to the identity provider, in our case Keycloak. The user is presented with a login widget by Keycloak. User logs in successfully, hopefully, or gets an error in the night message. If not yet 2FA configured, but set as mandatory, he gets 2FA setup dialog, all done in Keycloak. So it's very easy to enable 2FA for users. So you should, because that's today almost mandatory in a secure environment. Then the browser of the user is redirected back to the service provider with some credentials proving he has successfully logged in, and you get the application and you can do your stuff. Now, if you then, every other application is also redirected through Keycloak, but he says, oh, but you already have logged in, and you're immediately transferred back to your application. So it's single-signal. Any questions so far? Okay, cool. That's what I just said. The setup, user IDs are held in this case in free by LDOP, Keycloak for Web as a zone. Syncs with free EPA, so the LDOP is synchronized with Keycloak. Well, Keycloak gets the request and then sends it to LDOP for verification, gets back the signal, yeah, everything is okay, and then continues with the identity process. And the Keycloak server has a client definition for every application. So you can also read here client configuration, sorry, application configuration. But in Keycloak lingo, it's a client definition. And so it's not every user that logs in is not a client. No, the application is the client. So I started with this. The first application, X-Wiki, configuring it, writing Ansible, 20 minutes done. And I thought, what could possibly go wrong, right? Because, you know, I've only got, what is it, 8, 9, 10 applications to hook up, 20 minutes for the first. Man, this is a breeze in three days, I'm done. So about a year later... It's... Sorry, what? Your certificates expire. Yeah, my certificates, no, no, almost probably, but I'm very bad at certificates. Internally, in a land, the lifetime, you know, I make it, I don't know, sometimes 10 years, I don't care. It's internally, so... Anyway, this was a walk in the park. It was based on good documentation. Do you know that there are developers who don't document their stuff? No. No, yeah, they don't, they don't. Some of them is really awful. Anyway, I added another application, etc., etc., etc. So this is what the client list looks like. And here you have... I made... This was my bad sense of humor. I made the key cloak server internally AD as host name, because that's what everybody from the outside would expect, right? And CMDB is here. Yeah. Okay, let's have a look at the program flow. And with program flow, I mean the Ansible playbook, my recipe. So there are two valuable files. One is global and the other one is encrypted, because, of course, to make a client definition on the key cloak server, you have to know the admin credentials of the key cloak server. So that's, of course, in an encrypted vault. And if you don't, you're a bad IT guy. Playbook works on the application VM. So I have made for every application a single playbook, because that's easy. You only, if you want to change something, you just look at that one playbook, sometimes two, couldn't be helped, because everything there is in a logical order, should be. And it's directed to the application virtual machine. And then the application virtual machine will connect to key cloak, do some magic, and configure its own, and then at some point, hopefully, it's done. So the playbook retrieves key cloak endpoints info via the host. Checks if key cloak client definition exists, and if it exists, it will delete it, because we have a playbook. And I can trial and error the hell out of it, and just delete the former client configuration, create a new one, and then configure that. So, yeah, it just deletes it. It fills a client definition template, that's in JSON, and that's a JSON file, uploads that to key cloak, with some keys that it has created, and some variables, and then, yeah, and then the client definition is created on key cloak. He checks, of course, if the client conversion is created successfully, if there was no error, and if not, okay, it downloads a shared secret, if relevant, if you use OpenIDC, because there are two ways to authenticate single send-on. One is OpenIDC, and the other one is SAML. So I have both in the set of recipes. So it's either OpenIDC or it's SAML. And Ansible leaves you with a configured application, if you're lucky. It even installs the application, and I need to start talking very quickly. Normally, I do this in an hour, now I have 30 minutes, so that's a problem, a slight problem. And at the end of the playbook, it displays remaining tasks, as in, look, you are the admin guy, this is the user credential, this is the password for the application I just installed. Now you need to go to this menu, this sub-menu, that you have to click this button, or you have to enter this string, if I couldn't do that remotely. For instance, if you install Jenkins single sign-on, you can do that using Ansible, but then it's a chicken and egg problem, because vanilla Jenkins doesn't support the REST API. You have to install a plugin, which I don't in my playbook. So, anyway, trust me, it should be easy. This is a very small print, but basically, these are all the steps that you do in the playbook. And since I have five-fifty minutes left, and the slides will be online, and there is a repository, every application also has a wiki page with extra notes, there are comments in the playbook, so, and if it doesn't work out, send me a message. That's how open source works, right? There's some tricks that I had to use, so this is a sort of condensed JSON stuff, so here you have your application-specific ACS, so that's where the key clock has to redirect you to, in this case, Zabix, to get you logged in, and you have several ID fields in JSON, and the ID is a random UUID. So, look up Community General Random String, and it's 20 to UUID, done. So, it's just a random string, and it changes every time you run this. You have to define the algorithm, the certificates which are generated, are placed in this variable, and then send this JSON file to key clock. Some of those applications need protocol mappers. In this case, the Zabix user protocol mapper, because it's SAML and Zabix expects the return thing to say, oh, yeah, Zabix user needs to have this email address, and you have to, that has to match on the email between Zabix user and key clock LDAP. Yeah, oh, yeah, and SAML's multi-valued roles false. It's now about a year ago that I did this talk for the first time, so don't use key clock 18, because you can set it to true or false. It will always be true, which means it doesn't work. That takes you a couple of days to figure out, but it was changed and the bug was smashed, whatever, after 18. Okay, so once you have a working synchthonal setup, use... Yeah, we're doing it again. This is if you start out for the first time with an application, and you don't have my playbooks, and you know a little bit about synchthonal, and you get it to work, then you need to have the client configuration of key clock, and there is a very simple script that will just give you the whole list of client configurations, and you copy-paste that one entry out of it, pipe it through JQ, replace settings with variables. There's even, I put in here, a very small div to compare two JSON strings, because he will order the fields and then do a div, so it's very easy to use. And why do I tell you this? Because I want you to do some work. Anyway, gotcha. Everything was better when you used HTTPS, because synchthonal and HTTP should work, but often doesn't. So just default in your head, make everything's HTTPS. Don't think it will be easier if I don't, because it won't. That's one. Tomcat expects an SSL keystore to have the password, change it. Don't change it, make that the password. It's, and maybe somebody else knows the other way, but that's my experience. Some application developers can't read. If the standard says optional, that is not the same as mandatory. So you look at stuff, you think this should work, it doesn't, why shouldn't it? And you start looking at it and debugging it and trial and error, recipe and delete client, make a new client, and at some points, oh, this is mandatory for them. The standard says optional. This developer, shit for brains, made it mandatory. And some applications are very badly documented, which is probably not a surprise, because that's basically our day job, right? Oh, yeah. And adding the free EPA keycloak user ID sync, midway was not a smart idea. I could just start from the beginning. So if you're thinking about using LDAP and keycloak, start with that and after that, do the recipes and don't do, oh, keycloak, oh, yeah, that works. I probably should sync to LDAP because you can start all over again. And to me, Ansible can solve just about any problem. Well, in IT, right? So not like world hunger or world peace. It's a little ambitious. And do not use keycloak 18 unless you like long searches. And walks in the park. Okay, so your job, if you choose to accept it. Because I have a repository. Add applications and build this further. The most, well, what I think are the most popular, but probably you think what's he talking about. This application is way more popular. I forked this, so I forked this from the one-stein repo that I made for one-stein, of course. But of this, I'm 100% sure I can always accept pull requests. Right? So I don't want to diss them. I'm very grateful for all the time they invested basically in me to make this happen. And yeah, that's simply because it's a very nice open source company. And if you have an application that you want to hook up using SAML, start with the ZappX Playbook, copy it, change it, because you know your application, I hope. Or you have got something running and configured, and you know how it works. Please adjust the Playbook and send a pull request to me for OpenID Connect. Start with the X-Wiki Playbook, because it's OpenID Connect-based. And every application is documented in the wiki on my repository. So it should be relatively easy if you know a little Ansible. And Ansible starts with the Playbook that does a ping and says, Pongbeck, if you have done that, you can do this. Yeah, sort of. Yeah, it's not that complicated. It was complicated for me because I had to do ten applications and some were severely badly documented. So Tui can tune to work in config, send me a pull request, and well, within the time. Are there any questions? Yes. Thank you so much. In the middle of the presentation, I mentioned that you use variables to grab those variables from a vault. I recommend grabbing them from a vault. We did the same. And now we have a couple of some applications consuming. So at the end, they are a volume of some hundreds of variables that you can't solve all over the directory. And then suddenly it became very slow and running the Ansible Playbook. We can really, really slow. And that's a constant and easy solution to that. Right? There are variables, not long-gauge series, but simple variables that increase the speed incredibly. OK, so to repeat the question, you have the experience that you have playbooks with in a vault a lot of variables and it becomes slow. And the suggestion by the kind server over there is to don't use large dictionaries. Or use fact caching. Or use what? Fact caching. Fact caching. Yeah. Or that. So, well, I had a very simple vault simply with the credentials of the keycloak server. That is basically just about it, which is, of course, two variables in a file. I did the password on my command line, of course, not in a repository. The. So my command line was simply look at that file for the password. And this is my vault file and do this keyboard with this host file. Yeah, sure. Welcome. Yes, kind server. You seem to be interfacing with a keycloak in a kind of like the wrong way. Because have you considered using the Antopole Community Keycloak module? No. There are actually very extensive keycloak modules. Yeah. They can configure keycloak. Well, the Ansible, there are some things about Ansible that I really love and some things about Ansible that I don't love. For one, I understand that because of the popularity, you need to start working with collections to split stuff and separate. What's the word for that? Accountability now? Separation of concerns. My goal was to make a set of playbooks that's very easy to read and a single playbook for a single task. So I don't use roles and all the subject directory, mumble jumble that comes with it. Now, a single file with a couple of steps that everybody can read also to understand how this stuff works and make it and configure it once. And yes, there are a lot of collections. There are a lot of excellent Ansible playbooks that are spread around the Internet Universe that gather information. You can even do a Kubernetes install. But that's a repository on its own. And I wanted to keep things simple, the KISS principle. Sure, but in Ansible, one of the objectives would be idempotency. And you start with deleting your client. So basically, you come to the idempotency tool. Will you confirm your client every single time? Yeah, I do. Yeah. You know that? What can I say? It's a design decision. And yeah, that's basically it. Sure, thanks. What about use cases when you have the IDP and the SP in different network segments? Think about your key flow needs to be in LAN because your customers need to be protected because they don't have proper security updates to key flow. Yeah, right. But they want their kitlab public or the other way around or their service. Usually, service is not in the open. Okay, yeah. So the service provider server needs to be able to connect to the key flow server during deployment of this playbook but also in production. And where they are, I don't know. I don't care. I'm just asking from the Ansible side to connect to them. I mean, like for the workflow, how do you deal with customers that have those requirements? What requirements would it be then? Not to run an unpatched key flow in public because your kitlab runs in public. Oh, yeah, but you can. No, I wouldn't do that. So, time's up. Thanks so much for your attention. And have a nice day. Thank you. Thank you.
Fixing a Kerberos vulnerability with the bare necessities
I'm working for the 3APA development team and I'm also the maintainer for Kebras, for Fedora, CentOS and REL. And today I'll be talking about, well, Kebras, 3APA and the Bronzebead vulnerability. So a few general words about Kebras. So it's authentication protocol relying on cryptographic, symmetry cryptography. And it's like 36 years old now. It can be described, I guess, as a very early implementation of a single sign-on principle. If I can run a quick poll, as who has been maintaining a service using Kebras for authentication, here. Okay, so a few words. And have you heard about the Bronzebead exploit for Kebras? No, okay. That's actually normal because it's something from the AD world, actually. So something that is often gets in the way of understanding things about Kebras is the terminology, like this old vocabulary like, so, a ticket, key tab, KDC, authentication, S-Rack, things like that. So I made my best in this presentation to try to use as generic terms as possible so you can really get the main idea. So basically the three terms I'm going to use in this presentation that are specifically to Kebras is things like ticket, key distribution center, and ticket granting ticket. To just give you a gross idea of how Kebras works, you run authentication once against the KDC, this is the Kebras server, basically. It responds with a ticket, especially in this case a ticket granting ticket, so a TGT. And then it uses this TGT to request tickets to other services, basically. So you can authenticate in all the services in the organization. So that's like the usual use case for Kebras, but in this talk I'll be mainly speaking about an extension to this core part of the protocol, which is the MSS4U extension. So it responds to a common use case that you are facing. Let's imagine you have a web application and this application has to access some backend services like an SQL database or a distributed storage system. And very often you have access control happening on this backend service, so you cannot access the user-related data that you cannot access using just the identity of the front-end service. So you need access permission specifically to the user. So what you want to do in practice is impersonate this user to actually access the resources they have access to. So the historical way we are dealing with that is that problem is by doing TGT forwarding, which is basically just you take this TGT, so this initial ticket, you send it to the front-end service and then this front-end service can just reuse it to request other tickets in your name. But the issue here is there's basically no granularity in the permission you are granting actually granting to the front-end service because it can just request ticket to any other service. So it's definitely something that had to be improved. So Microsoft actually designed this extension, so S4U. It adds two new mechanisms to the protocol. The first one is constraint delegation, also called S4U to proxy. What it does basically is so you have the user, this front-end service, I'll be calling it the proxy service and the target service like the backend, SQL database, distributed network volumes, something like that. And so in IPA we provide a tool to configure some delegation rules saying okay this web application has the permission to impersonate user for this backend service. You need specific permission to do that of course. And so I'll be explaining that with the diagram. So here you have three agents. You have the user, the proxy service, so the front-end service, the target service, and the free IPA KDC. So and you have each of them as a key except the KDC that has the key of everyone. So what we'll be doing when we're doing constraint delegation is first the user is going to use the TGT to run a request against the proxy service. Now the proxy service to process this request it has to access the target service as the user so what it will do is run an S4U to proxy request providing both its own TGT and what's called the evidence ticket. So the ticket the user provided to this proxy service. So free IPA will process the request so there is a condition for this request to succeed. You see in a moment why we have this condition is to have one of the flag in the ticket, the forwardable flag set to 1. And if that's the case it's going to accept the request so send back to the proxy service the ticket for the user to the target service and then the proxy service can reuse this ticket to run a request as the user against this service and the request. So it was constraint delegation. Now the other mechanism in this extension is called the proxy transition or S4U to self. It's meant to solve a different problem basically. The way it's described by Microsoft is when you have a case where you have a service which is not using K-Bros as the authentication method in front of a face by the user to authenticate. You can use that to obtain a ticket for this user to the Flotent service. So you can still integrate with some local services that are requiring a K-Bros ticket to work and something like that. And it's also a way to obtain some group membership information that are contained in the ticket. I'll come back to that in a moment. So how it works basically. So it's called proxy service. I wrote that for the courier and 3D of the diagrams but it's not necessarily proxy services can be any service actually. So it runs an S4U to self request for user U using its own TGT. And then so we come back to this forwardable flag thingy. So at this point there is an attribute of the proxy service in the 3API database and according to the value of this attribute the forwardable flag will be set or it won't be. So if it's false you don't get the flag. If it's true you get the flag sector 1 and then you send this ticket back to the service. So that's where the Bronze bit buildability gets in. So something you might anticipate while looking at that is if you have a service that has the permission to request this forwardable evidence ticket and this service also has a delegation rule with another service. It means the service is basically able to request an evidence ticket for any user and use it against the target service. So that's something that is not recommended to apply usually but if you paid attention in my previous, so I'm going to show you that. So we have a compromised proxy service here. It has a delegation rule to the target service. So first it does an S42 self request. The proxy service does not have this rule that allows it to get a forwardable evidence ticket. So that's what you get, a non-forwardable evidence ticket. The issue is here you have this part in the ticket that is encrypted using the key of the proxy service. So actually once it's received the attacker could just decrypt this part, flip the forwardable flag and then re-encrypt it and send it to the KDC again as a S42 proxy with this modify evidence ticket and then the KDC will check the forwardable flag, it sets so then it's allowed and then the attacker gets a ticket to the target service as a user. As you can see the user it doesn't do anything in this whole process so it can do so with just any user including a user that has administration privileges on this target service. Yes? I understand that the reason why the proxy service is able to change the forwardable flag is because it's encrypted using the key of the proxy service. But is that also the same reason why it's able to impersonate any user? It's able to impersonate any user because there's a delegation rule between the proxy service and the target service. But it does not have this specific property that allows it to request a forwardable evidence ticket. So this should not happen normally. If a forwardable flag was properly... The main important thing here is for you to sell this flag. I want the ticket to be held user to myself. It was deemed to be innocent because it's for yourself not for anyone else. And it's not forwardable so you cannot use it as evidence for anyone else. The problem is that not only it can decrease that area and change anything, the area is specifically not checked soon. So you cannot verify that they... Yeah, the whole problem is this area is not protected. It's protected by the key... I mean, it's encrypted using the key of the proxy service but that's all. The proxy service cannot... What I don't understand is how the proxy server in this case is able to change that plan but you also said it's able to impersonate any user. Because the S42 self works for any user. It's not supposed to... You're not supposed to get a forwardable evidence ticket for any user if you don't have this specific permission. So... Time flies. So the issue we had in the APS case, if there was just... There are a lot of reproducers for this issue on AD but there is actually none for IPA for multiple reasons mainly because there are... It was mainly an issue of encryption types. There are multiple encryption types for Keablos. And our default encryption type basically on IPA is... HMACSHA2 encryption types and they are not implemented by AD right now. So that's why it was... We had to implement a reproducer ourselves. Actually, there's a full request here if you want to have a look. So the solution that was designed for that by Microsoft is something called the ticket signature. I'll skip some details. Basically, there are a lot of things in the Keablos tickets and there is also another extension designed by Microsoft. It's called MSPAC. It stands for Privilege Attribute Certificate. So it's another... Basically a piece of metadata in the ticket that contains various kind of information and it's signed by multiple keys as it's a certificate. And what is... For this signature, it's basically... While it protects the encrypted part including the forwardable flag. So that's the response to this attack because here if the attacker tries to flip the forwardable, the flag is still possible because it's only encrypted with proxy service key. So the flag is flipped, re-encrypted, send as an SOU to proxy request. But here before actually checking the delegation rules, it will recompute the signature compared with the one saved in the pack. If it's not matching, it will just reject the request right away. But we had an issue for CentOS and REL8 because there we are using a MATKWAS version 1.18 and this version handles generation of the pack in a completely different way. There was a major rework in MATKWAS 1.20 and because we have some ABA compatibility rules, we cannot just backport this full fix on the CentOS and REL8 version. This is basically all we had in the API for KDB plugins. So there's nothing... We have no access to the encrypted part basically so that's why we could not implement this fix. But the approach we took was different so if we cannot protect this flag, maybe we can implement something to actually detect this attack when it happens. So there are a lot of information in the pack actually including a list of security identifiers. So it's something that is specific to the AD world but we have some support for it in IPA. And there are especially two SIDs that are interesting in this case. This two SID basically the first one indicates a user was pre-authenticated to use the KABLOS term in the normal way or there's another one indicating that the evidence ticket was obtained using SOU2. So that's actually something we can use in this case to... So we start the same way. So the attacker sets the forwardable flag to 1. Then it sends the SOU2 proxy request with the modify evidence ticket. And then what we do is... So we do the same, we check the forwardable flag and we check the SOU2 self rule allowing to set the forwardable flag also. And we check what's the SID indicating the way the evidence ticket was obtained. And if we see that the SID is not matching the value of the forwardable flag then we can just say this is actually a bronze bit attack happening so we can reject the attack the issue is... I said the pack was like a certificate so it's supposed to be protected but it's actually not for K. It was also compromised basically at some point. So there was a way for an attacker to actually modify the content of the pack including the SID. This is this CV, it's basically a pack spoofing attack. It was based on basically how the content was signed because to give some details the initial signature is done by the proxy service key and then this signature is signed by the KDC but the problem is since it's just signing a very small part of the pack you can do some spoofing etc. to eventually modify the actual content of the pack. This was fixed by Microsoft also by adding a new signature in the pack. So this time it's called the extended KDC signature and this time the KDC uses its key to sign the full content of the pack so it removes this vulnerability. So now it's basically the same approach except when we reach that point we check the extended signature before and then we do this check I mentioned earlier. So, conclusion already. So yeah, I thought this is an interesting, like typical example of the kind of tribulation you're facing where you're doing long term support especially for security rated protocols and it's also an occasion to talk about this as for your extension because that's something that it's a good example of a gradual shift. So, the next cabros is currently doing from the authentication only protocol to some protocol also now providing some authorization information. I already described that with this list of seeds but there are actually a lot of stuff and there are plans to start using that in the future in the IPA and open source projects. So yeah, that's all. Okay, so I was quick about that thought. So there are some reference interesting articles, pointers to multiple pull requests, backpots to centers, things like that if you're interested. Now if you have questions. How long does it take? Well, the patch actually it's relatively simple. There has been some hiccups in its deployment but the main thing is that you have all this kind of dependency like we have information we want in the pack but the issue is since this in 2022 the pack did not be considered trustworthy anymore because you can actually spoof its contents. So first you have to backport the support for this extended caddysys in nature to be able to actually trust the content of the pack and then you implement this actual detection method. It took roughly a half a year to actually design the scheme on detecting traces. It's literally detecting some evidence that is left by a hacker which is not supposed to be there but finding what you can trust from evidence took like a half a year in overall drawing board design. So it's not really written in code and then backport in this extended checksum took another problem because now NIA IPA running on Kerperos 120 becomes suddenly incompatible with 118 because of the re-factors that was like checksum missing and then another authorization data is missing and then all the walls and the wall reverberation doesn't understand. So there's a lot of work not technical but more investigative. You can definitely make a movie about this. And some politics against. And also politics of course, yeah, because we have to think also even for all the versions. Okay, so time's up. Thank you.
Passwordless authentication in the GUI
Well, as Alexander was mentioning, this is kind of a continuation of the talk I had before. The previous one was about pass keys. This time it's not only about pass keys but about other passwordless authentication mechanisms. So now I have more time so I will present myself correctly. So my name is Sikar. I'm a software engineer at Red Hat. I work in the identity management world, more specifically for the SSSD, Shadow and Linux PAM projects. I'm the upstream maintainer for Shadow. So if you want to contribute there, we are welcoming new patches and fixes. In the past, I also worked as a software engineer but in the automotive sector first and then in the 3D printing world for HP doing their industrial 3D printers. And something that I didn't mention here is that I like swimming so if you are a swimmer, we are in the same team. So regarding the agenda, I will have a little introduction about passwordless authentication and then I will deep dive into the actual status of the pass key. Smart Car are external entity provider authentication mechanisms in the graphical user interface. I will also provide some, well, something, the vision that we have right now, including some mock-ups and even a demo. And finally, I will give up a conclusion. So let's start with passwordless authentication. So what is passwordless? It's a way to authenticate a user without using a password. Well, that's kind of a common ground. It usually involves multi-factor authentication and also single sign-on. And on top of that, it strengthens the security by, well, you are using a public key instead of a password. You are not reusing it. And you are not vulnerable to a data breach because whatever the attacker takes from the server where you are storing this data, it's just a public key. They have no knowledge of anything else. And on top of that, it improves the user experience because, well, they don't have to remember so many passwords. They just go there and try to authenticate with passwords. So currently, the passwordless authentication mechanisms that we are providing in FreeAPA and SSSD are the PASCII, SMARCAR, and external identity providers. So for the PASCII, we are using FIDO2 in SMARCAR, this is one of our internal Spanish national identity numbers. And then we have OAuth2 for external entity providers. So now let's see what's the current status of PASCII in the graphical user interface. So first of all, you enter your username and you arrive to this screen where we can insert your PASCII, then what? I can't tell you. It's then press Enter. But it's because I know it. Nobody else would know. I mean, you can kind of guess, but you wouldn't then be sure. On top of that, that's a textbook. You are supposed to enter some data in a textbox, but we don't need any data there. So we are just informing you that you need to insert the PASCII. Nothing else. Second, if you press Enter there, you would arrive here. It's asking you for a pin, but it could be that you don't need a pin. You don't have an extra user verification. So why request for a pin when you don't need it? If you don't need it, just press Enter here and you would continue the process. But who knows it? And finally, you are requested to touch the PASCII. The LED is usually blinking, so you just touch it and you are authenticated. So our proposal. Two different things. So if you don't need a PASCII, sorry, a pin, you will be requested to insert the security key. You will insert it, insert it, press Enter, and then you are done. But if you are required to enter a pin, well, you have a textbook where you need to enter a pin. As you can see, it's kind of better understood what is suspected from the user. On top of that, I didn't mention it before. When you are doing PASCII authentication, you can do it either locally or remotely. If you are doing it remotely in the server, you will get a Kerberos ticket. But what happens if you are doing it locally? How do you know that you don't get the Kerberos ticket? Well, you need to inform the user somehow. And we are still trying to figure out how to inform the user that it will not get the Kerberos ticket. It will not be able to do the single sign-on and, well, to inform it. OK, next one, smart cards. So the state here is kind of better. You have the available users. You select a smart card, and then you come here. You are requested for the pin, and that's all. OK, this one was easy, but maybe not so easy. What if you have more than one certificate for the user? So you select the user, you come to the second screen, and we have the same problem as with PASCII. There's lots of tests here, and it doesn't fit in the same box. So maybe you know which one to choose. Maybe you don't. You come to the next screen, choose one, and you are requested for the pin. We also need to improve this. The user needs to know which certificates to select and why it's selecting this one. So currently, I don't have any proposal for this. We are still working on it. I will show you later on where we are proposing the OS mo-cups. OK. Last one, external identity providers. So I will show you the current state in the command line interface. So you try to log in with Sue, and it says authenticate with pin. It provides some pin at a web page, and then it requests you to press enter. So if you go to this web page in a browser and you input the code that is there, you press on continue, then you come to the next screen where you are requested to authorize this request. Because, well, maybe it's somebody else that is trying to be you, and it's not really you. So you authorize it, you press enter, and you are authenticated. Nice. And what about the graphical user interface? It's not possible. You cannot use it. And here comes the application. And here comes the first demo. Yeah. So I will input my username here. OK. You see. You have your username here, and then the login button. If you press it. You are shown a QR code, a URL, and the login code. So this QR code will redirect you to this web page. And on top of that, we haven't finished this yet, but we will provide also our embedded web browser alongside this, where you will be able to provide the login code and authorize the request. Thing is that, well, first of all, you are embedding a web browser inside the login screen, which is maybe not a good idea for security reasons. And then you also need to take into account that it's not easy. So we are still working on it, and it's, you know, a thing to do. So if I were to follow the workflow here, I enter the code. OK. I press and continue. Now I need to authorize the request. I'm requested for the PIN, for the password, sorry. OK. Now you are supposed to press on enter, and it will work. Well, I will tell you we have a bug here. A problem. I haven't been able to solve it before the presentation, so it will fail, and it will show you another screen. But in reality, that would be all. So you press done, and then you are authorized, and you are logged in inside the computer. So what more? OK. So we currently have several authentication mechanisms. But how do you select them in the graph? How do you select them in the graphical user interface? It's not possible. There's no way. So you are prompted for either the pass key or the smart card, and that's all. You cannot select it. No way. But we already have a proposal for that also. So I will come back again here. I will need, OK. You have the web login here, and you have this small key. And this user, apart from the web login, also has the password authentication option. So you can press it, and you are requested to enter the password. You can come back again, say, OK, I don't want the password. I want the web login. And it takes some time, but it comes back. On top of that, you are a user. You authenticate the first time with a given method, let's say pass keys. You log out, and the next time that you come back to the same login screen, it will ask you for the pass key, not some other method. It will remind that you try to authenticate using the pass key and that you succeeded so that the next time the graphical user interface will ask you for the same example. For exactly the same method. Of course, you can change, but so that you don't have to start changing the method every time because there's some priority that always tries to do the same authentication method. So in conclusion, we have here the software design. It's more or less that GDM prompts the user for the login prompt. The user will input the username and GDM will start upon conversation. So when SSSD has already resolved the username and obtained the available authentication methods and all the data related to them, it will generate a JSON message and send it back to the GDM. So GDM already has all the information for all the available methods and the prompts that it needs to provide. The user will provide the information and GDM will come back and will generate another JSON message informing SSSD which method to use and which data the user input. So if you want to know more about this topic, here you have a web page. This is the design that we are currently writing. It's still working progress, but for external IDPs, it's kind of more or less done, so we don't expect to change it much. As a wrap up, so these are the high level requirements. The user should be able to select the authentication mechanism. It also should be able to use the previously mentioned authentication mechanism to authenticate. On top of that, the previous attempt should be remembered so that the next time that the user comes, it's prompt for the same authentication mechanism. And finally, the user interface shouldn't get in the way, so it should be easy and simple. We don't need to start doing strange things. The user needs to feel comfortable and it should follow the same workflows that he's used to follow in other applications or in the web browser. So last slide, reference links. So the first link is the design mock-ups that the NOMI team has prepared. You have there almost everything except for the two or more smart card certificates. This is still working progress. On top of that, the second link is the link to the SSSD EDM interface that I mentioned previously, like two or three slides ago. And finally, you have a copper repository if you'd like to test it. We are building it for federal road height, so if you want to test it, well, I would wait one or two weeks more until we stabilize everything here, especially for external entity providers, but then you should be able to test it. So that's all. Thank you, and do you have any questions? Okay, so you are asking what happens if you are not connected to the Internet and, well, you try to authenticate. Well, if you try to do an external entity provider authentication, you will not be able to do it because you don't have to enter. Is there a way that I can connect to Wi-Fi? You can. Yes? You mean in the login interface? Yes, in the login interface. No, I don't think so. Uh-huh, no, yeah. Good idea, yeah. But maybe you can use another method that doesn't require Internet connection. Yes, but then what's the point of even having the Internet login? Like, if I need to remember my password and I have the ability to use a password which is less secure, then I just make the entire system even less secure by adding another way of logging in. Yeah, but. Well, not actually improving anything. Okay, so you want to know. I'm trying to criticize. Yeah, yeah, yeah, yeah. So I get your point here that, okay, you enabled the Web Login and you don't want to use another authentication method that is less secure just because you don't have Internet access there. Then I would recommend you to use pass keys or smart cards because usually the user is cached there in the. And is there like no network manager access in the login interface? No, I don't think so. Because that would make it like work. We can discuss it. Yeah, it's a good feedback. I mean, there are certain potential issues with network manager like network manager also has access, for example, to VPN configurations. So you would have to create a special interface for network manager that didn't expose secret information. Yeah. So this is actually a common problem that exists for quite some time. And I've been talking about this at FOSDM in 2016 already when the, how they call these in hotels they have portals. Yeah, yeah, before you connect to VPN, but you cannot connect to VPN before you go online and you need to be online to and then solve first this portal challenge. So it's still the same problem. You need to run browser effectively before. So running network manager with the access to you potentially user private information before authenticate and then identify the user is another problem. Yes. We are not looking at that problem specifically within the context of this one, but solving ability to run the browser before the login will help us to solve some problems. And I know that at least three major distributions are working on these set of problems, but it's question of prioritizing. Yeah, I think that it should be solvable to run the browser securely. Yeah, I think so. Yes. I mean, we have stuff to work in, but we'll take a while. Yeah. Right. Yeah, first thanks for working on this. Clearly, there's a lot of work to do. Have you done like accessibility review like on this? I mean for disabled people, especially and actually connected to that is, for example, I noticed that when you shown like the UI. It was a bit far off from what people expect when they go to a website. For example, like, for example, you shown like to select the factor notification factor was a small. And the bottom right with the key icon, right? It's not really what people are used to like when they go to websites. So why is that essentially and indeed like how is that connected to an accessibility. Okay, so the question here is whether we took into account the access to these ability people. I'm more specifically when you are about to select the authentication mechanism that the icon is kind of a small and it's on the right side. So I know there's been some people from the UX team working on this and I'm quite sure that if you go to the first link there and you provide your feedback there, they will take it into account. Yeah, there's like going to no issues in the design team section. There are actual these mocaps. Yeah, so you can add your. Yeah, yeah, and you can follow it and you know provide further feedback if you don't like it. Yeah, we are still working on this. So everything is like working progress as much feedback as we can get the better because we'll provide a better product for our customers to use. I just want to make good that was more than a criticism. No, no, no. Search for rationale. I'm not part of the UX team. So I don't have the exact details. Thank you so much. Okay, sir. First of all, thank you very much for your work as well as the presentation. There was a picture of logging with security that looked like mocaps. Are they already available or no, no, I guess you mean this one? I guess you mean this one, right? Yeah, it's still a mock-up and we'll work on it. So we started with external entity providers and we'll continue with the other two methods. So I don't know when this will be available, but we are working on it. Philip. Thank you. Sorry, go ahead. So of course you're primarily implementing, integrating this first into a GDM course because that's what you're working on. But will you implement it in such a way or when it is finalized and solidified, can people who have different display managers or even implement it all be able to implement that? Because otherwise we'll be forced to either use GDM or maybe SDDM. OK. So the question was whether other login providers, like I guess KDE or some other one. Like GDM maybe. Yeah, if they will be able to provide this authentication mechanism, the answer is yes. They just need to follow this design. So the PAM conversation is part of the LIPPAN, which is kind of a standard nowadays. And the JSON message, it's defined here in this design page. So the graphical interface just needs to follow this to be able to implement it. It's not implemented, but if somebody wants to already start implementing it, we are fine with it. I mean, we don't have any problem. We are providing Nomi because we are using that in Fedora. But anybody else can come here and implement their part. So they're going to have that tight level of integration where you need to have it? No, no, no, no, no, not at all. We just need a PAM conversation there happening and to follow this diagram. That's all. OK. Thank you. Yeah. Go ahead. I wonder if, for example, I have a laptop set up with both password authentication and a second PAMD module for Trezor, which is like this USB thing. And I can click the button and I can log in. And that allows me to choose whether I type in my password or I press the button on the Trezor. And you have it so that you have two different flows for the authentication. Do you think it would be possible to set it up so that you could either type in your password or tap your smart card to have it win a single flow without the user selecting the flow? If you have a smart card reader of an NFC device on your laptop, then when the password prompt is showing, it would potentially be possible for the user to tap their smart card even though the password prompt is showing without the user having to click on it. I'm like, I can type my password or log in to have a think of it. Yeah. It's the same on your phone. You can make your symbol or you can just use your fingerprint scanner. You don't need to choose what you want. This is a bit more complicated than the case of GNOME and GDM because GDM uses different PAMD stack configuration for cases when it detects a smart card. And it uses GDM smart card. Yeah. And that one explicitly includes expectation that smart card is engaged. So it will not use the one that uses just the password in that case. Your device, if it's supported by a separate PAMM module, yes. Then it will work in the stack of the normal password basis authentication. It will work with this one. This one will just be skipped completely because your module will handle it. That's how PAMM works. So the whole concept here, depending on PAMM, basically being used for everything. So time's up. Andreas, do you want to say something like that? Yeah. Do you plan to extend the PAMM conversion of how to talk to PAMM because that's still very limited? And one problem, for example, if you have multiple domains and have one to have the selection, that the user is able to select this domain, that's the problem to present. So you actually would need to extend the PAMM conversation actually to the first. So you can use this to start your life. We can use the same JSON format. It already allows you to do that. Yeah. Can I define a primary plausible which is part of PAMM that's not fully linked? And so they can go to that. So it's not a classic on multiple terms. OK, time's up. I think we can continue this discussion outside. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah.
Automated Integration of FreeIPA with AD and External IdP
So let's start with the next talk. And this welcome, Thomas Werner, talking about Ansible Free APA. Hello. So, let's go in here. So, this talk will be about Ansible Free APA to use for AD integration, so Microsoft Active Directory, and also to set up external identity providers. The plan was to have an online demo here, but yeah, there have been some issues. So, there is slides and there will be online demo later on on the web page. So, for the automated Free APA deployment, I have been using the work from my colleague, Raphael, and we had to change something in the inventory to make it work, especially in my environment. So, if you want to do it on your own, it's used as a base for the whole presentation. These are the steps to do, and one important thing is please fix time and time zone on all the machines, otherwise you will have fun. With covers, tickets, not valid tickets, tickets that are in the future or in the past and so on, so not fun. The first step was get your windows you want to use. There is a nice documentation on this web page from Raphael. The different steps you need to do where you can get the images and what kind of images are working and so on. The first step that we need to do from Windows AD step setup is to change the first script that we have this, Windows AD setup playbook to disable IPv6 because if we do not, we will have lots of fun with DNS later on. So, this was one of the most important things. And we're coming to the setup of the IPA server. For IPA server, as we wanted to have a replica deployment later on, it was needed also to enable DNS and auto reverse. Sadly, there is an issue with auto reverse creation later on, but it is fixed manually. And there is another issue with DNS with Windows, so you need to disable DNS stack validation. In the lab, that's a lab. Yep. You will find out if it's working for you or not. So, then you can simply do the steps that are on the web page. So, the first IPA setup, then there is a nice test to make sure that DNS is really working on both sides. This is the NSLookup test. So, it tries to find the Kerberos TXT records on Windows side and Linux side on both ends. So, it verifies that everything is working in here. And the last one is setting up the trust. I'm not adding the information in here because it's completely unchanged from the script in here. And after we've done that, we can log in with the AD administrator into our Linux server. ServerLinIPa test. You can see I can log in, I have a ticket, I can get my AD, and then I'm trying to do a change in IPA. And it says, hmm, invalid credentials. Okay, so, but we have a solution for that. Also an answer for your IPA that was added lately. So, we can add, we can grant the rights to the AD administrator to act as an IPA administrator. So, the first one is adding an ID overwrite that is needed to be able to use the AD administrator. And the second one is adding this an overwrite for this administrator to the admins group to make sure that this user is able to do something so that it has admin rights on IPA. And after we've done this, we can directly add a user, remove a user. You can do everything. So, host, users, whatever you want be. AD administrator is an IPA administrator. And the next part was, okay, let's try to do client deployment using this AD administrator. So, the inventory file needed to be a little bit changed in here. So, there's a client set up. There is also a setting. I don't know if you know about that. This is configured DNS resolver. This is a feature of the IPA client role to set up the client in a way that is that the DNS server or the DNS server you're configuring here. This is the IP address of the DNS server that was used. So, the first step of IPA client role is to set up network manager or systemdresolved or resolveconf so that you are able to use the DNS server directly. So, it's not needed to do this manually. And also, if you do an unconfigure, it will remove it again if you set the variable. So, it's doing this automatically. And the next two lines are to force it to use the administrator of AD. And there is one important thing here. You need to write it correctly. So, that means a capitalized A in administrator and also the domain capitalized. Otherwise, it will not work. If you log in, it's working without because there is a rule for that. But here, there is nothing. So, you need to write it correctly. This is the first issue. So, why isn't it able to find the AD administrator? But, okay. With this, we are able to deploy the client. And it's working afterwards. So, you see here, the next one is the playbook to deploy the client. So, it's the normal thing you see on Ansible Free APA. There is a playbook for this. So, you can simply consume it. Yeah, so, client was easy. What is next? Yeah, replica. But for replica, we ran into an issue with command line and also Ansible Free APA. So, there is currently an issue in the replica connection check. It tries to use admin. And for sure, the password is not valid. So, it fails. We will find out what exactly the issue is to solve it. It's affecting command line. So, Free APA package itself and also Ansible Free APA. So, it doesn't matter which you're using to deploy. They will both fail. But there is a temporary workaround. So, disable replica contact. But make sure that it's working. So, DNS needs to be working. And also, the reverse lookup needs to be working. Otherwise, it will fail also. And the next step is simply to deploy the replica. And then we are there. And we have a working replica. We can use it also to deploy clients. Also using administrator AD8-PI test and so on. So, we have some issues and we will work on them in the next time. But they are relatively small in my opinion. It could have been worse. And now we are coming to the second part. A colleague of mine wanted to present this here, but he was not able to come. So, the second thing that we added in Ansible Free APA was the possibility to configure and use external IDP. There will be another talk later on about external IDP. It will be true. But we will be a big deal for that. So, any open questions here might be solved by this one. So, Free IPA has the modules for external IDP. There is a new module that was added to Ansible Free APA and also the use of external IDP was added to the user and so on. So, we can configure Free IPA as an OAuth application on GitHub. This is the example that I will show here. So, let's go directly to this one. So, we are creating a GitHub OAuth application in the first step because this is needed to be able to configure external IDP with IPA. So, the steps is simply go to your GitHub, go to developer settings, OAuth apps and register a new application and read the docs. If you do so, it will ask for several things. So, it will ask for application name, the homepage URL which is also the authorization callback URL. So, you should have the same in there. This is the iperserver.com. And please also add a description to be able to find it later on. And enable device flow, this is needed for IPA to be able to handle this at all in the end. So, this is very important to enable. And then click register application. And if you have done so, you will get a client ID and also a client secret. It's very important to keep those secret. But one thing, you need both of them in the next step for the setup of external IDP. So, there is no way to get to the second one again. So, either write it down, make a screenshot, whatever, but in a safe way. And if you have those settings, you can go to Ansible Free APA. And here you see, we are using simply them in text form. But you can also use Ansible Vault for that. So, that you do not have the passwords here, but for simplicity. It's here, the same with IPA admin password. It's here simply to make it simple for us to see what's going on there. Otherwise, it will be a little bit cryptic. And so, this is simply creating, setting up the external provider. And in the next step, we need to retrieve the GitHub user ID. Oh, one thing that we should add here. So, IDP user ID is set to ID here. There is another way, but this is way better because with GitHub, it's possible to reuse names. So, it's really good to use ID here for authentication because then you will not run into this possible name clash later on. And this is common problem for many IDPs, that if you delete the user and after some time, another user registers the same user visible name, that user becomes basically squirting the previous one. And many of those providers, they run like 90 days protection of the accounts. Even if you delete accounts, you can not register one. But eventually, they expire. So, somebody can squirt your account in this way. If you've configured your systems to trust whatever was the user name in the system, good luck. You will be hacked a year later when you start doing it. So, taking these other fields into account is very important. And this is part of administrators to kind of design this thing. Unfortunately, all these fields, they are not visible in the UI. So, normal view can not see this information. So, it's admin that needs to discover. Yeah. So, use it this way. So, we're retrieving the GitHub user ID. It's stored here. And in the next step, the IDP user ID is using this retrieved user ID. The bad thing here is, sadly, IDP user ID here and IDP user ID here is not the same. One is a user ID, so a number, and the other one is really a user name. So, be careful and read correctly. So, the thing is, Ansible Free API is trying to use the names from IP itself. So, you will see the same naming issues in Ansible Free API that you see in free IP itself. And after we've done this, the user is able to authenticate. So, it needs to get the code, and with the code, it's possible to log in. And that's it. Thank you. So, we have like six minutes. Do you have questions? Yes, please. Please scroll to the beginning of your presentation where you describe the service stuff. This one? Yes. Well, I hate the NSSEC myself, but why do you disable the NSSEC presentation? If you do not disable the NSSEC right now, the IPA server gets a reply from the Windows DNS server, it's ignoring it. Maybe you should have said that this is about the lab setup. So, if you have a lab where you don't have a Windows configured DNS SSEC in the Windows DNS server setup, which is not default, they don't set up the DNS SSEC. So, your lab is basically disconnected from the Internet, and it doesn't care about DNS SSEC. So, that's your ADC DNS SSEC in that lab, or you don't. So, Thomas choose the easiest way to configure DNS SSEC in the lab. But, IPA configures DNS SSEC by default if you use the internal DNS server. That's why it's forced into disabled validation because it's enabled on IPA sign. So, either that Windows setup in the lab needs to gain DNS SSEC configuration, or both of the sites needs to drop DNS SSEC validation. Sorry, I don't get, because, well, if you implemented DNS SSEC validation, implying that the DNS records SSEC are required, it's something weird. Otherwise, you just don't get any signatures, and you have nothing to disable. Now, a buy-in server, which IPA uses as an internal server, has DNS SSEC validation enabled by default. You cannot switch it off unless you explicitly say to switch it off. Yes, but if there is no signature, it should not reach anything. It does check, and it does reject the request. It rejects on signed address. But this is just a lab story. I try you do not disable DNS SSEC validation in the wild, unless you know what you are doing and then pay consequences. Okay, thanks. Yeah, I think probably the problem, the reality is that a lot of RokyD setups don't have it enabled because nobody ever talks to them, they're blue. In many cases, people are using cloud-based DNS services, so it's not the DNS server on your infrastructure. It's some of the DNS servers provided by, I don't remember the names of those companies, and those typically do not have DNS SSEC enabled for the whole zone that the company rents out of them. Okay, yes, just a second. Okay, so there's some language. Does this experimentary work with Samba as well? Does the external ID provider work with Samba also? I think I need to answer this. Thank you. One of them is using AD user to manage IPA. This will work with Samba, because this is just a normal job between IPA and activity. The external ID provider is called IPA users, because IPA only authenticates users that are in IPA. AD users authenticated by the active territory domain controllers. Microsoft implementation of active territory does not have perverse pre-authentication method that supports anything like that at all. Completely. Same with Samba AD. Samba AD built with Heimdall has no way to handle this. Samba AD built against MIT perverse has theoretical way to handle this. It's not implemented, and it's on my and Andrea's plans to complete this work on Samba AD site. There will be more about it in the hour talk, that will be the last talk of the day. More questions? I hope it was captured by the mic. I hope. Yes? It's the difference between this kind of integration and the one you spoke about in the morning. Wow, good question. Maybe I can answer this one. This is basically Azure Freebase. It's a new Samba network that will make the point of computer development. Here we have an ID integration. This kind of Azure Freebase is simply created to establish a class with AD. The one for my presentation is basically a container service that is capable of connecting to the LDAP from AD to make a request to LDAP, Python LDAP, something like that. The one that was in the morning, the PAA theorem, in the center, the client enrolled into AD or IPA and provided services to web applications, which is key to all of this case. Any more questions? We have time. Thank you.
Connecting IBM AIX to Red Hat Identity Manager (FreeIPA)
Thank you. So you know if you work a lot of the green screens, so after some time you cannot distinguish between green and yellow. So it's what happens to me and we are talking about IBM RX, and it's usually green screen. So we are here in the deaf room and the first, I'm not a developer. I'm a classical system administrator. So yeah, you can say to me I'm a DevOps engineer because I can quote, I can quote and see and Google Go and Rust in Python and Ruby, so I used everything. And I ported a lot of tools to IBM IX and to Linux on IBM Power, and on OpenPower. I'm not an IBMer. Usually IBM has talks about IBM Power. I'm not. I'm an IBM champion. So if you know what is AWS Hero, Microsoft, most valued professional, so this is the similar program from IBM. I'm not a red-hitter. I don't have a red-hitter. So as you see, I'm a red-hitter certified engineer and I'm a red-hitter instructor. So that's why some of these talk may sound for you like a training material, so it is not. And we're here at FOSDEM, beer, open source hackers. What does it do? What has it to do with IX? IX is a series of proprietary, Unix operating systems called sourceWare. So this is my attitude to it. Real hackers don't need source code. I was born in another time in another country and we didn't have access to source code mostly. And the real hackers, or real hacker, is someone who can understand how the program works without looking into the source code and who can change it without looking into the source code. So, who are some of the men who use it today? Who uses IX? Nobody? You are all wrong. Do you have a banking account? You're a bank user, so do you have insurance? Your insurance user, so how did you came here by car, by train, by flight? Everyone uses IX. So retailers, manufacturers, if you bought something, it was done on IX. So do you have this thing? No, you don't have an onix on it. But in the back end, it is IX database, or database on IX, who processes all your orders and sign-ons and so on. So I added some, yeah, I added some marketing sheet because nobody knows what IBM power is. Sheet with long e, not with short. I just want to say that this is my favorite guy, IBM Power 8 and AC. So it has 1,920 logical CPUs and 64 terabytes of memory. So you can have it everything in one position, so in one virtual machine. So I did once with six, first time I did with 640 CPUs, virtual machines. So I wanted to look what are my CPUs are doing. And you know, even if you have, let's say, 40 lines in your terminal, 640 CPUs, it is how many? 16 pages of CPUs. So it takes some time to get them. But there are some funny facts about power which are not very well known. So fun fact number one, zero successful data breaches. So it is 2022. I don't think it is about that IX is so secure. IX is not so secure. So it is like any other operating systems. And most of the guys, system administrators, cannot use it correctly. So but the other side of the story, nobody knows about it. So it's difficult to breach into it if you don't know how to use it. Fun fact number two, it's 14 years the major reliable server. So somewhere here, the typo. So it's really reliable. To bring it down, it's impossible. So it just works. So it can work year long. You can forget about it, it will work anyway. So and I like this fun fact. So this is about performance. Let a P in power is performance. So you see here IBM E-server P5. So this is the fifth generation of power servers 2005. Did an SIP benchmark, somehow 8000 something steps. So eight years later Fujitsu's park could do almost the same. Last year, the latest and greatest Intel Dell Power H with the latest and greatest Intel CPU just outperformed by 1%. So you know what is the most powerful server right now in this benchmark. So the first three places are there. Before going further, we had to talk a little bit about IX and what we should understand. What makes it so easy and so difficult working with IX. So it's a real unique standard operating system. It is everything is standardized and everything what is implemented in IX is standardized. So implemented according to standards. But you know standards can be a little bit different according to the developer. So it depends on developer who develops or who implements their standard usually. One of the most things is binary compatibility. So if you ask any IX admin, they will say binary compatibility is the most important thing because yes, I can run on my most modern IX server. Binary is a program which were compiled 20 years ago. I did it even more than 20 years ago. Another case of this binary compatibility, so you don't innovate. Because you have it, it works. So why should you innovate? Why should you do newer things? It's not BSD based and not system 5B based. So it's a usual, it's OSF-1 based if someone remembers OSF-1. So it was end of the eighties, beginning of the nineties when IBM, HP and digital united together to make a new standard in Unix world. And they did OSF-1. And of course, because not everything can be standardized, it has some unique features. And let's go to authentication. So PIME, everyone knows. Everyone uses in Linux PIME. IX has support for PIME. Everything is good? No. So it's originated in Solaris, yeah. Can be used in IX. But IX uses old Solaris implementation of PIME from the end of the nineties. And it's really paining in the ass, sorry, to port some PIME module from Linux to IX. I tried to port Azure AD authentication to IX. And I failed and after one week I said no, I will not do it. Because the differences between APIs in IX and old PIME interface and newer interface. But IX has something different called load authentication module. This is original IX idea, how to make almost the same. So it was done even before PIME. So five years I think before PIME, they did LAM. Almost the same, but a little bit different. IX only technology and very popular in IX world. Again, not because it is the best technology. Usually because system administrators don't know anything else. So that's why they use LAM. It is by default there. So and the most big feature of it, there is almost no documentation how to use it and how to develop it. So first time I developed my LAM module, I used Samba source code to understand how it works. Because Samba had LAM module for IX and IBM didn't provide anything. So it's not really versus, it's together. They work together. So we have PIME and LAM. We can have application one using PIME and application two using LAM. It's flexibility. So we can have user one using PIME and user two using LAM on the same system. We can do everything we want. Every user has 50 attributes, different attributes we can configure. So it's not like in Linux where you have home directory password, user ID and so on. In IX you have 50 attributes. You must not use every attribute, but you can use it. And you can configure, you can have different password policy for different user based on different dictionaries and so on. But even in wars you can configure PIME to use LAM for authentication and use LAM to configure LAM to use PIME for authentication. So usually it's good that IX administrators don't know about this feature, but you can get into real, I don't know how to say it. So you will be waiting for authentication 20 years till it completes because LAM will consult PIME and PIME will consult LAM. So now let's go a little bit into details. So the first configuration we need to, we can choose, do we use standard authentication? It is LAM, loadable authentication model, or we use PIME authentication. So we configure it's just normal in config file and this is standard value, standard authentication. And in our case we leave it as standard authentication. So next in ETC security user we can configure different user attributes. And this is one attribute system which is usually tells us which loadable authentication modules we should use to authenticate the user. By default in IX there are two variants, files or compact, they are not very, really different, but you can install additional authentication modules and add much more. So it works with IX only functions authenticate, it is not POSIX, it's not single user specification, it's just IX. And you just get username from user and send it to this function. And functions read ETC security user, it somehow works with system which we configured and says, okay, let's go use this prompt to the user. So in this case it can be that user one uses LDAP and authenticate with free EPA for example. And other user uses Kerber's and authenticate with Microsoft Active Directory, the third user can use multi-factor authentication and authenticate through GitHub. So all on the same system. But it's not all because as I said sometimes documentation lacks some information. There is another function authenticate X in the newer version of IX. Yeah, IBM is like me in this case, so they have very big brain how to name functions. So I also use the variables name, R, Y, J, K, L, M and so on. But the difference is here with this state. So what should be used, I don't know. But one of these two functions is used by login and it then goes to loadable authentication modules. And we have a configuration for loadable authentication module in this file and this is standard way. As you see we have one module for 32 bits programs and another module for 64 bit programs. And this is again the problem for me personally because I like to use modern programming languages like Google Go or Rust. And they are 64 bit only on IX. But in this case you need also to have 32 bit program and it means you can use just C or that's it, nothing else. And this is what comes by default as you see IX delivers carbers and LDAP as default modules. They are just there, you need to configure them. But there are some pieces missing of information here. So if you want to use LDAP you must install IBM directory server LDAP file sets. They are delivered with IX, so but they are not installed by default. Similar if you want to use carbers you must install IBM network authentication service file sets. They are also delivered but not installed by default. In this particular case I use only LDAP but you can use also carbers for authentication and LDAP as directory service to list to users, to get directory for users. So do I need to do something on free EPASite? No, really no, no. You just install and use free PSU. Usually do. So I have such installation at customer site and they just installed, they didn't do anything special for IX on there. I usually do something, so in my test I usually do OK2oFS delegate too. It's not for LDAP indeed, so it's more for carbers single sign on which I don't use here but so it works. And I create separate IX for you because there are some gotchas with IX. So BESH is not on every IX installed by default. So just on newer versions of IX it is by default there. On older IX version we use con shell 93 and on IX standard user group has JID 1 and not 100 SM Linux. So that's why I do this using Vue. So and yes, just small ansible snippet what I do, what I told about and again thank you very much for ansible modules. So magnificent. So IX site. On IX site we just create second app configuration with this command and we specify here our IR server. The name which we use to log into free EPO and password you see I use the long very long password and cryptical so and where we can start searching. So it creates automatically configuration for loadable authentication module. It creates automatically LDAP client configuration and starts LDAP client. And if you see here it finds really finds where it can find users and groups on its own. Nothing, no magic, no rocket science. The only thing you can see here is RFC 2307. As in Linux you use usually RFC 2307 BESH and here there is no BESH. Why? Because as I told IX is a standard operating system. So and RFC 2307 since 20 years is a draft of a standard but not a standard. So sorry guys it will not be implemented. So this is official answer from my BMI gut because it's not a standard. So everything else looks good. So there are some configuration changes I see they must be done. So but in real life they should be done not must. So you want to have home directors for your users to be created when they log in. So I found so okay you say that by default all the users are in LDAP directory. So and I found that on the new US-TIX these two parameters and password policy plays not very nicely together with LDAP. Another feature the main list groups so by default IX use if user to IX comes from LDAP directory. It can have only LDAP groups. It can be in local groups and switching on these the main list groups I say okay the user can be also in a member in a local group. So and one more if you use views in FreeEpa like me so just don't forget to adhere IX views and restart LDAP client that's it. So everything is configured everything is working so it's not so interesting if something goes wrong you can use LDAP to check what FreeEpa brings to you or what IX sees on FreeEpa side. And if something goes completely wrong you can switch on debugging with such magical variables nowhere documented almost nowhere documented. So but be careful with them first of all they produce really a lot of output. Second you can find even your passwords in clear text in this output. So okay mapping sometimes you must change your mapping this is standard mapping what all IX uses but you may find situation I was for one bank another a little bit different mapping because they need some additional fields in their case. So and you can change the mapping here and it's rather easy it's no rocket science so does it work yes but there is always some but some so there are some wishes first of all as I told standards are so how. How developer implements the standard it's depends on developer and I extend that sees a little bit different password class change attribute as in FreeEpa so different formats of dates. So it's on IX side and I development tries to fix the problem since I think one year something like so it's when you walk with closed source operating system so you can't fix it on your own sorry. I don't I didn't find the way how to make HB a quacking. So I have the answer thank you. But I would like to have it working so yes free plays missing my favorite 50 IX use attributes and I would like to have them there it's really one of the things I love in our IX. We have even more IX specific like role based access control and trusted execution so if you switch it on you cannot execute some some binary on IX without it will check the signage of the binary and everything can be stored in held up so that's why it would be nice to have it in free part to. Yeah. And yes I exist missing 23 but I would like to have a native free Epa client which is not there. So and if you can help me feel free to ping and we will talk thank you very much for the time and. And. We have any. One minute. Yeah. Not a question just a side note. About stuff so. Well it's sort of catch 22 until you. Until RFC is implemented it can't become a standard for a proper standard. It is implemented by numerous. Yeah. So probably. Raise the issue to. IETF is not interested in finishing this work. The people who originally started this job not interested because everything is working for everyone. That's that's. Position. Okay.
Empowering FreeIPA: a dive into the modern WebUI
So ready? Okay. So, yeah, I have a huge disclaimer before I start. It's that this talk I'm giving now, it was supposed to be done by Carla, who drives this effort in Rehat. So, yeah, she was explaining me a lot of things, a lot of things that I hope I will remember. So, yeah, and also if you have challenging questions, we will probably need to get redirected to Carla somehow. So this is all about the web UI, the free API web UI. Okay. So, yes, I just would like to go quickly through the agenda. So I'm planning to, on doing a very quick overview of the background and historical context. Then I'll talk about the main motivation for the chain. We will see the technology stack transition from the current one to the new one. And this some stack transition not only applies to the technology or the components, but also has implications on the whole testing framework and documentation. So then if there is no demo effect, we will have a look to the modern UI from a live and public instance that all of you will be able to access as well. And you can play, if you have your laptop on, you will be able to play with it as well as we speak. And finally, yeah, before the Q&A, we will have a look to the future roadmap because this is a working progress. We haven't implemented all the functionality yet. We have big pieces, but it's not ready yet and we are working in upstream, you know. And what is more important, we will also cover how you can contribute to it as well. And then, yes, this is kind of the background. Free API is an identity management solution combining Linux, 3D9DS, MIT, Kervos, NTP, DNR, DocTag, Certificate System. I mean, one of the most important aspects is that it provides a centralized authentication authorization built on top of very well-known open source components, as you know. And it consists of a web interface, the web UI, and a command line administration tools. And, yeah, the very first version was released in August 2007, so it's been a while, the first version. And at that point in time, yeah, in the very beginning, one of the main goals, it was simply to have party TV between the UI and the click commands so that you get the same stuff from the web UI and also from the click commands. So that was one of the main goals at the very beginning, very long time ago. And the first implementation was based on Turbogears, which is, as you know, probably a Python web framework, consisting of Seral, WS-GNA components that this is still active if I'm not mistaken. So there is still activity in GitHub. So this web UI consisted of some sort of a tool wrapper, capable of interacting with the IPA server through the IPA IPI commands. And it was capable of sending IPA server commands, fetch the data, and modify before displaying. So very basic thing. So as you can imagine, during those days, no React, no Angular, so they had yet to emerge. It was a plain JavaScript in the end. And there was a lot of limitations, like it was necessary to optimize the number of HTTP requests and also minimize on how the JavaScript files exist. So in the end, we ended up using a lot of files for the web UI. So the first significant evolution of the web UI was driving by the Dojo library. So with the help of this library, it was possible to transform free API JavaScript packages in AMD modules so that it was possible to build into a single file, and this was helping a lot with performance. We managed to reduce all this inside number of JavaScript files to be processes. And then around 2014, all the styles and guidelines for UI, for UX, for the web UI, you know, were adjusted. This was done to align with the Red Hat, Common User, Environment, RQ, you know, which later became the pattern fly. The pattern fly that you know today comes from the Red Hat Common User and Environment. So basically, pattern fly is a set of best practices to provide users and unify them into the experience when navigating through the UI. So, yeah, and this is how it looks like the current web UI, the one that you can see today. So, yeah, so this is why we summarize somehow why are we changing it and what are the main motivations for this change. First of all, the current web UI is following pattern fly 2 and it's unsupported. So, some of the functionality that we need, this is kind of blocking us somehow to implement new features in the web UI because we are still following the pattern fly 2. So, it makes evolving the UI very difficult and what is more important is to stop using outdated tools and libraries like the Dojo library that I was talking about before that is going to be replaced by React. So, and then how are we doing, yeah, and this slide is somehow trying to summarize how are we going to do this. So, it's basically following some sort of guiding principles. So, the most important thing for us is to minimize the disruption with our assistant users. So, this is like when you go to the supermarket and the location of all the products is changed. Every time you go to the supermarket you are unable to find the things because they change the things so that you spend more time on the supermarket and buy more things. So, we are not going to follow that strategy. So, in the new web UI you will find all the things in the same place. So, we will focus more on improving the components and yeah. So, then now you can see the technical stack transition that we are making. Okay, on the left you can see the current web UI, all the technologies we are using there and this is what we are moving now. Okay, so the first one is pattern fly from the version 2 to the version 5. Okay, these are all the guidelines I was mentioning at the beginning and best practices. Okay, then we are going to move from dojo to react which is more modular content and is based on components and is kind of the modern way. And another one which is a very important piece is how we test the web UI. We are currently using PyTest and we are going to replace that by Cypress and Cucumber. As you know these two technologies follow the behavior driving test. They are more human readable. It is super easy to test with Cucumber and Cypress and yeah, this is an important change as well. So, okay, yeah, let's start with the comparison between the old one and the new one. So, the most significant one I think is the navigation bar. So, instead of having it there and the sub task, now we have it on the left bar and it is very well organized in the sense that all the sections, they depend on each other. We have three levels now and then you can collapse and expand as you go. So, this is kind of a huge improvement. And then, and it is always visible and now you have the ability also to hide it and have more space for the other bar. Another big change in this comparison is that the settings form and now we implemented a very cool thing here is the jumping links that we have here. Okay, so you have them available all the time and then you can click on them and then they will move you to the correct view. So, another one is that all these buttons, they are floating buttons and they will stay there all the time. So, also the tables, they are kind of refactored now and this is now much more clear as you can see, you know. So, another one is the scrolling. This is another important one because in the old one, when you were scrolling through all the list of users, for instance, I mean, you lose the navigation bar. I mean, the whole UI was scrolling. So, now this is not happening anymore. And now you can scroll and then you will have all the time the navigation bar there. Okay. So, this is a very nice improvement as well. Yes. So, let's continue then about modernizing the UI. So, this is kind of the architecture. We're following. Okay. So, on the left part you have the front end. You have the modern web UI is sitting on React. Okay. The React is kind of umbrella for all those libraries. From top to bottom. Then we have butterfly 5 that we already talked about that. The testing libraries, it's capable of connecting with the testing libraries with Cypress and Cucumber. Then we have the React router library that this is providing the multi-paging functionality. And this is because React's only built single-page applications. So, this library provides the feeling of moving through multiple pages. Okay. So, this is why we are using that one. And as you can see, we have the internalization one here, but with dots. With dots because we haven't implemented yet. This is not available. But it is on our plan and we need to investigate and see what will be the design. How can we connect that library also with React so that we can also cover different languages and stuff like that. And then the communication layer is one of the most important libraries that we are using. It's using RTK query library. And we are using this one because the connection with the IPA server is through RPC. This is one of the best libraries to do so. So, basically, from the modern web UI, from the forms, we are kind of collecting the data. And then we send an API call to the IPA server. Then we grab the response back and then we process the data that we show in the web UI. This is the way it works somehow. Yeah, basically, this RTK query is capable of performing group operations in JSON RPC. That's the whole idea. So, yeah, let's continue. Yeah, this is about the testing framework I was mentioning at the beginning. So, we moved from PyTest to Cucumber and Cypress using the behavioral driving development. This method that you know is more human readable and then implemented tests now. Super easy, but you are going to see that with this sample, this side-to-side comparison. This is the old version. This is PyTest. And this is the new one. It's more oriented on the user. So, you just describe what do you want to do in the web UI. Like adding a user. Okay, I open the side menu. I click on this and click on that. And this is the test. All right. And even better, I think I have a video. It was done by Michal Polovka, one of my colleagues. He's working on this. Yeah, it's working. Let's reproduce it. So, this bit is really kind of... Yeah, you are going to see now how one test is executed. This is going very fast because it's supposed to be fully automated, you know. You see? And the good thing about this is that if everything... If something fails at some step, it will freeze there. It will wait until you have a look and you will see exactly the step that is failing with all the locks and stuff like that. So, it's amazing. Yeah. So, with this framework, it's kind of really fun to implement tests now, you know. Yes. All these tests are obviously running in the GitHub project. We have enabled GitHub actions. And yeah, it's really easy to implement this. Okay. So, if we continue... If this is working... So, yeah. This is the same video, right? Yeah. Okay. So, now, yes. Let's have a look. So, before this presentation, I deployed a public instance in EC2 Amazon Web Services. So, I did a trick with the... Well, we have a trick to expose the modern UI, okay? Because we haven't implemented yet the login page for the modern UI. So, we are reusing the old one to login. Once you login, you can access to the other one. So, let me try. But yeah, you can access, try to access with the laptop and you can start playing it and start adding users if you like or whatever. So, on my end, I'm going to try to access it now. So, yeah. So, this is... Modern UI. So, yeah. So, yeah. This is the modern... I mean, the demo instance, but it's loading the old one. Because I need to login here first. So, do you know the credentials? I guess it was in the slide. Okay. Admin and secret 123. So, this is supposed to work. Yeah, this is the old one. Okay. So, now, I'm going to access to the modern UI. And then I just need to change here. Yeah, and this will give access... No, wait. Modern... Yeah, this is the modern one. So, if I open it a little bit. So, one of the cool things that Carla told me to show is that all the... is that this web UI is very dynamic thanks to React and pattern flying. Okay. And for instance, if I go here and inspect, you can see how dynamic is the new one. For instance, we start reducing the size. You see how all the things are getting little there. And it's kind of... For instance, now the panel disappears because of... to have more space. But if you continue doing so, it will... all the entries, it will appear like some sort of cart. Okay. So, yeah. So, this is gradually improving a lot of things here. So, yes. So, what I mentioned in the mock-ups that you have in the navigation, you can collapse, you can hide the one. I mean, the entire navigation if you like. And you have three levels. You can go for more sections. And, yeah. So, yeah. So, this is... there are available. And don't trust too much from this instance because I reset every night with one of these nice automations in the Amazon Web Services that destroyed the instance and then recreated again. But, yeah, you can play with it. So, let's continue with the presentation then. Yes. Yes, I embedded another video where Carlisle is playing way much more things than I. Okay. So, you have available in the slides. This is pointing to a YouTube video. So, where she basically is playing in a lot of content. Okay. That you can also watch if you like. And she's using a model, an instance with a lot of data populated so that you will see more content there. All right. So, yeah. Then, as part of future roadmap, what is next? I mean, the situation of the project. I mean, we implemented two major sections in the new YPI already, but we need to keep implementing new pages and functionality. So, it is on the roadmap to improve the routing and somehow figure out how to add direct links to pages. This was in the old one, but not in the new one. And we need to investigate. Also, we need to explore because the new web UI is capable of sorting the data or listing a lot of entry records. And we have limitations with held up. So, this is something that is still open and we need to test performance with that and see how can we improve the internalization that I was mentioning before is still missing. Also, we would love to implement a new login page, but in this new login page, we will, yeah, we would love to enable new authentication types. Like, for instance, if you have been in the other talks, passwordless external IDP, all this type of stuff, we will implement it in the new login page. Okay. And also, there is another topic, which is for a little bit for the future. Once we have the project more advanced is how are we going to adapt all those external plugins that user or the community, they have kind of plugins that they are able to connect to the web UI. And then we need to provide a solution for them as well. And, yeah, and the last thing is that we are driving all this initiative all this product in GitHub. We are engaged a lot with the community. And so, I can show you in the GitHub project, we have GitHub project, sorry. So that we didn't strive for in the sense that it's super easy to contribute to the project. I mean, if you are, if you like these technologies, and you will love to work with the Red Hat Web UI team. So you will be able to do so. This is the project and then it's super simple. You just follow whatever is in the read me file and then in terms of one minute, you can get all the development environment in your own laptop. And, oh yeah, I see. Wow. I miss the time. Yeah, so everything is well explained. I mean, with background you can bring up and yeah. So, sorry. So I think that was the last slide. Yeah, the next is kind of, yeah, how to contribute, but I already explained that. So there are open discussions as well. And yeah. So, yeah, that was it then. So I guess we don't have time for questions or. Yeah, that's a good question. I mean, so the development is happening in upstream. So it's very easy to follow what is happening there. Right. And let's say that there are three main sections in the Web UI project. We have implemented two of them. I mean, the most difficult one was the first one. Implement the first one because you need to take the decisions or investigate about things. But now almost all the things are sorted out and we are speeding a lot with the project. And we believe that maybe by the end of the year we will see the new one. We will see. But we can speed up if we get more contribution. Thank you.
POSIX identities out of OAuth2 identity providers: how to redesign SSSD and Samba?
Thank you for coming for this final talk of our dev room and let's hope we even maybe get the demo working, whatever that means. So this talk is about how to get POSIX identities out of the OAuth identity providers. Literally not about how to get them out today, but how we want to change the existing operating system layers, which SSSD and Samba both are providing, to make it actually working and maybe gain something interesting in the process of it. So who we are. This talk kind of packages together a work we do at Red Hat on VAP, on Samba and SSSD and it's from me, Alexander Bokavoy, then Andreas Schneider and Sumit Bosa who is one of the core developers of SSSD and actually one of those three S in the name is Sumit. So is it. Yeah. And this talk is about a bunch of different projects and different protocols and we will try to really just reflect on what is there and maybe see some of the future that is there. The thing is that the POSIX identities is not really what people typically think about them and it's something that is driven by the requirements of a kernel or any unique separating system. And typically for the kernel the most important part is that when a process is started it needs to run on the certain identity and when this process needs to access a data stored somewhere that identity needs to be checked against identity associated with this data. And the way how it's done varies widely across multiple operating systems but on the POSIX side, so UNIX-like or POSIX-like operating systems which Linux tries to implement and outperform in this implementation what could be there, the certain things are kind of required. So UID and GID here you know it's user identity and group identity groups have association with when process start there is a primary group of that user that is associated with the process and that's how user get access to files not owned directly by the user but there are more elements in it and basically the work we do in the login process is effectively enable all the knowledge about these identities to the processes that will be started after the login. And we focus on the enterprise environment but sometimes now enterprise is really a small company that outsourced their authentication to a cloud provider and how cloud provider handles their authentication is not their own business that their own business is what they are doing but still the machines are here the workstations and they have this data that needs to be stored somewhere right and associated with different stuff. And if we look at the OAuth IDPS in this particular case I'm looking at the OIDC connect default claims that you request when you use OIDC client then it's literally like two claims that can be reused none of them is really useful for the actual operating system the description in both cases so the Geekos field on the POSIX user side and the description of a group that's literally all you can derive from the standard claims because if you get the name of the user from the OIDC claim that's not literally the name you want to use as a POSIX username or POSIX group name you may but here be dragons it might include the stuff like SQL sequences SQL things like drop tables and so on in those names and you don't want really to get them in so you want to have certain control over it and you don't get home directory you don't get shell that's what the login process needs to know to launch the initial process of the login you don't get UID GIDs and so on and more to that the access is authenticated so you cannot just come to this IDP ask over OIDC endpoint and give me user token or whatever is there you have to prove that you are who you are and you are who I DC client effectively so the fact is on a linux system SSH server has to know that this user exists in the system before it authenticates you and there's no way to get this authentication before the authentication of the user so it's kind of a cycle that we have to break somehow then yeah on the other side you have effectively to enroll this machine this client that will be talking to IDP and that means you have to store credentials somewhere the it is not the user credentials at the sort of machine credentials so we come back to the enrollment process similar to enterprise systems and there are requirements that it should be stored securely so we we got back to what hardware dependencies TPM for measuring that what we store there is there and so on so it's it's a lot of non-trivial stuff you start with simple things and you get to the point that yeah you want to have this all secure right and if we look at the host enrollment okay some IDPs allow you to dynamically create clients and sure you you can do that users who can do that they there is a process that you can follow so properly if you have user credentials then you can enroll create these clients store credentials for the client somewhere and ensure that they secure encrypted and what not available only to the processes that get there so again bind to TPM nice nice stuff that been talked especially at the virtualization framework dev room yesterday about the confidential VMs and so on so there's infrastructure brewing in the cloud not on the machines but luckily all these machines that have certification for windows right they have hardware TPM and all this so theoretical it should be there and actually it is used in a while because Microsoft Azure AD enrollment for the machine actually uses this it requires TPM and in the hardware it requires to during enrollment process to obtain certain tokens and get all of this stuff and store it and exchange cryptographically sign it data with it all the interesting part of it is that we yeah we figured out parts of it and force it Microsoft to document some of it we got help from them not for everything but there is a process and there is even progress a bit in in proof of concept implementation of the enrollment for Azure ID from David Mulder from somebody in and Susie and actually a few other people work together with him but if you're enrolled you really have to protect these credentials you have to do something to get them seriously not leaking anywhere and also you need to have some information on the IDP side to retrieve it and and that's the bigger question there's no information like I said we we need to define so what this work is about really is to define bits and pieces that we want to store for post-ex identities in in the cloud and the most important part like it was already asked in the previous talks it's all online only so Azure AD authorization authentication and authorization is also online only Microsoft doesn't support offline work with that well they do support some local accounts but then you don't get access to the cloud basis replicated data because it's well not authorized then and finally we can do this on Linux of course because we have Pumstack we can have local passwords and so on and this so but really if we don't have this information and for many public IDPs like Google, GitLab, GitHub and so on we wouldn't have this extension or something there we will only have whatever is there we'll have to generate this information so what we came in is the experience that we had already with Samba and SSSD that we generate these IDs based on certain properties it's hashing of certain values and placing the results of executing some transformations into the ranges that we call ID ranges it's an algorithmic approach yeah algorithmic approach and there are plenty of them so Sumit came with a few approaches nice one is for example this fully qualified one where you basically use the fact that you can use name with email like style or Kerberos like style with IDP domain being the suffix and then you split it so the suffix is used to map to the ID range the specific range for that one and then the first part is used to identify user within that range the downside is alias it would not be possible to support with this because algorithmic mapping basically gives you one-to-one mapping otherwise you have other problems and but it gets you stable mapping across multiple machines so then you can log in with the same user on multiple machines and get access to the same files so this is kind of an interesting story we will still continue working on it and if you have actually post-ex information then maybe you just expose it so effectively make an a separate app that you can query and retrieve the ID ranges retrieve the description of the algorithmic mapping and certain things and maybe even attributes for the users like like Andrei was saying for the AX where is where I can get my 50 additional user attributes that AX loves to have here you get them right you can get this kind of extension and the nice part here is that maybe if we make this app and interfaces kind of common enough then it could be reused between first multiple implementations but second between multiple domains and maybe then we can at least start thinking how we can build up trust and filter in and filter out this information somewhere but of course you can store locally but the problem with the local one local information is that you not always also trust it but maybe you trust the federation federated part because if the IDPs trust each other for federation like you have your own key local key clock and then you have registered ODP client against Google or GitHub or GitLab or whatever that is the federation right the GitHub now knows you and allows you to query data about the user from which is registered at GitHub or GitLab and if it provides you post-ex attributes would you trust them is this trust actually going forward that far maybe it is between the organizations that have much more control like research institutions and universities which have their own network of collaborators and they can guarantee certain properties but not in general I guess so we probably should do there and by the way they recognize in the OALF framework well they recognize the same problem so there is a work ongoing on defining a spec for this identity chainings which is essentially akin to Kerberos S4U protocol extensions and ability to track down who did what and against whom so finally all of this brings us certain identity we need to log in in and IDP authentication that Icar was showing before was really or Thomas was showing with Ansible free ABA is really what it is about and but the most important part is okay it's annoying to perform this online authentication every single time right it's annoying when you have to do SU or sudo and go reach for for that GitHub link and perform it every single time it would be good to have it once well to have it once we use Kerberos we cache the fact that we were authenticated initially and that cache and all the metadata about this fact can be in a Kerberos ticket so this is typical Kerberos flow in a normal environment and it assumes that you have this Kerberos KDC the K distribution center somewhere in data center or somewhere and what if we put this locally what I'm just sorry so one of the idea is when you are logged in you want to access certain services they do not have support for an IDP but they have support already for Kerberos so why not use Kerberos and use that already yep and we already do this so we basically have a lot all this in the environments like Samba, AD or free IPA they both provide KDCs and they both allow to get tickets with various properties check them and and so on we have actually authentication against the external IDP the OAuth 2 device authorization flow that IPA supports it's done through the Kerberos it's actually done behind the Kerberos KDC so when you run K-init to obtain the ticket or SSSD authenticates you it runs K-init against the server and KDC spawns a process behind it and runs the authentication against GitHub, GitLab and the others and all your key clock that's already worse but it requires free IPA deployment okay if we run KDC locally we can reuse more much of this code we only need to get this metadata securely but we'll ready store OIDC client credentials right somewhere so maybe it's the question of getting this information to that small part of the thing running on the local and surprisingly if you want to run this it's not that much is needed of course some changes are needed but we can reuse existing code we can do the work that was already heavily lifted and here's a demo a small demo so I have here a machine which is actually let's hope I I can show something this is just a podman ryanon container just fedora 39 custom because I did some customization of the config in it this is a normal system so I can show you that here's ryanon SSSD inside this system and it takes like 40 Mac memory whatever system system did treats it this way this is KDC running on the same system it only takes 1.6 megabyte if you trust what system did says here at least but yeah it's it's not the question of memory use right it's more a question of do we want to expose this for more than needed but maybe we can run this over a local host L.O. interface maybe we can put this into namespace isolate and give only access to certain things there it's still still not possible to do that fully bridge in the local host things with the namespace where there be 5 KDC run so in this experiment it's running on the host unrestricted so it's like listening on all interfaces but what I get is this so I'm logging in specifically as normal user well this is admin user but it's a normal user right we have password and I get the ticket as a part of that login so now I have configured actually this system to I think it's this way to use SSSD with jssapi that means that if there is a Kerberos ticket and you're allowed to use certain palm services with Kerberos ticket you will be allowed to use it we can check that I am able to just reach that sudo without entering password the fact that I need to press enter is a bug in communication between sudo and SSSD in this but specific setup it it doesn't require to do that in IPA setup for example but the fact is I got the ticket and I used this ticket to actually obtain another ticket the ticket to the host service on this machine that's the name of the machine that's that's the name of the machine it's nothing special just the example thing yeah so that is already possible I did small setup with SSSD for this but I can do this with other SSSD just handles Kerberos better than this but what we get as a part of it is that we actually solve in another problem because just when it was in November last year Microsoft announced that they finally found their way to kill NTLM and you know how they did this by introducing local KDC on Windows 11 machines just normal workstations so they are doing exactly the same they're introducing a local active directory stripped down to not be an active directory adjust the KDC effectively doing this stuff on the machine using an extension that was developed and sort of abandoned more than 10 years ago for Kerberos this IO kerb and reviving it we have almost support for in the MIT Kerberos so we can try and use this to to get it all working and then we get to the point how we control all of it through the identity provider and now I am circling back to the first talk of today of SpiceDB and Zanzibar maybe that's one of Answers how we can standardize some sort of like that and import or export access to that through the IDPs and of course this is a lot of changes and yeah I think we have basically no time to get through it but we will be talking about this in detail at Samba XP in April conference which will be online by the way so accessible to everyone so I encourage you to continue and follow up and if you're interested please ask questions on maybe mail in this or just shut the mail in yep thank you very much you
How Tansu, a Reactive Agnostic Library, Simplifies Widget Creation for AgnosUI
Good morning everybody. So my name is Florence Bria. I'm a UI principal engineer in Amadeus. It's a big company doing software for the travel industry. And I'm here to present Tansu, which is a framework library, a reactive library, Agnostik, that we are using in our new widget library. So a bit of context. My team is doing a lot of widget libraries since many years, like more than 15 years. And we are of open source, a lot of them. The last one, which is very popular, is MGBoostrap. We've up to 2 million downloads per month. And so we are kind of developing a new one, Agnosti, which is using Tansu. And I'm going to explain what Tansu is. So first a bit of Tansu, then the features. Then we will see Agnosti in a few slides. And then we will finish by why Tansu is helping us a lot by developing this Agnosti library. So Tansu is a senior library. So it's a library to do a state management in a reactive way. And there's a lot now of frameworks that are implementing such a library. Okay, so we have Soligys, which has also a senior library, Angular. Now since last year, as a senior library, Quick from the start, Svelte Store, is also some kind of solid library. Vue is having a reactive system also. React has a preact signal, which is also from Agnosti, by the way. And we have developed ourselves on open source Tansu. Tansu is basically based almost on the Svelte Store, but is independent from Svelte. So let's dig into Tansu. So Tansu is providing a few state management such as Primitive, so for best store. The most important one is the writeable. So you can create a store by using the writeable and give it an initial value. And then on this store, the idea is that anyone can subscribe on the store to get the actual value of the store. And on the store, there's a set method and an update method to change the value of the store. So the set method, you provide the value directly. The update method, you provide an update function where you get the current value of the store, and then you need to return the new value of the store. So this is the primitive for our library. Then we have a way to extend our base primitive, like the writeable. The writeable is taking any store-like object and is transforming it into a writeable. So it is adding basically a set function if you are not providing it, but you can provide your own custom function like we are doing here for the base store. We are providing a set function that is basically setting the double of the value put in the set. And you provide also, you can increase the store by providing some API, like the increment function that we have here that is incrementing the value of the store. Like any other store, the writeable, you can subscribe to it and you can use all the API, including the set and the increment. Then we have a writeable. A writeable basically is a writeable, but we found a set and update method. It is just for us to provide a way for people to subscribe to the value, but not interact with the value. Same as for writeable, we have as readable utilities to transform a store and add some API to interact with the store and return a readable. So for example, here we have a writeable as a private store and we return a readable store where we have an API to interact with the current store. Then we have some computed store. So the computed store, they are a dependency on other store and they are going to return new value depending on those primitive value. So the derived store is taking, you need to provide when you are creating a derived store, you need to provide a list of dependencies. And then you have a function where you are going to return the actual value of the store. So for example, here we have x dollar and y dollar which are the dependencies. And we have the function that is going to take the value of the store, the two stores, x and y here, and we need to return the new value. So here we are summing the two values. The return of the derived is already readable, so you can subscribe. And as soon as there is a change in a value of one of the dependencies, the derived function is computed again. And it is computed synchronously. So this is important for us because we want to be able to plug in any framework, our library, so the best way is to be synchronized so that it has no side effects when we are plugging into any framework. And it reacts but on the other hand. There is another API to be able to handle a source and conus event with the derived. So we have the way to provide on top of the dependency a set function. And then it is up to us instead of returning the value for the derived, it is up to us to use the set function so when we want, change internally the value of the store. In this case, we can also provide an initial value because we don't know if the set is going to be a code synchronously from the start. Then of course the derived is also readable, anyone can subscribe to get the actual value. And if there is a dependency that is changing a value, the derived is run automatically. Then we have another so computed the store which is computed. This one is automatically computing the list of dependency. So if you are using any other store inside your function that is computing the value of the computed store, in fact, the dependency will be kind of known by the library. And if any dependency is changing its value, the computed is going to be executed. What I didn't tell you also is that the writable or the readable, they are also function that you can call to actually get the value. So here when we are doing like XC$ and parenthesis, so we are calling the basically writable and we get the actual value. So the computed here, if you are using any store inside the computed function, we will know if one of the store is changing its value and we can react to this. Thank you. Thank you. Thank you. In a change on first name and last name, the derived function will be executed. But I don't want to have intermediate change if the data will not be constant. So here if I am changing first name and then last name, if there was no batch, like we would have computed within the derived, shall occupy in our case. And with the batch, we are able to kind of stop the reactivity system. And until the end of the batch, nothing is basically happening, nothing is computed, and shall come, this is only send at the end of the batch. So the data stay consistent. And it's pretty important for our library to handle, because you don't want to send data to the UI that are not consistent. So what are the time-sensitive features? It's from what I've just did, so it's critton and type script. We have no dependency whatsoever. It's synchronous, like I said, all the derived functions, all the computations are called synchronously. So if you subscribe to a store, you are called also synchronously. It's lazy. Lazy means that basically if the store is not subscribed, if there's no subscriber to any store, the computation is not going to happen. So we are not doing any computation that doesn't have any meaning, because there's no subscriber, basically. So we are only computing if there's at least one subscriber on the store. And there's a feature that I didn't talk about, which is the unused. It's pretty interesting for our use case, which is doing widget libraries. The unused is a possibility, so on every basic store, you have the possibility to provide an unused function. This unused function is basically executing the first, when the number of subscribers goes to 0 to 1. So we are able to initialize things only when the store is used. And this unused function is returning a function that is executing when there's no more subscribers, so when the number goes to 1 to 0. So this is pretty interesting to put your application in a state, like put some event listener, only when actually the application and the store is used in your application. It's live in the page. Something that I also didn't talk about is that when you are subscribing to any store, of course at some point you need to clean up and you need to avoid memos. You need to unsubscribe also to the store. So every subscription is returning a function to unsubscribe. So here is the example. So what about Agnes UI and what's the context of where we are using Tansu? Agnes UI is a widget library. Actually, it's more than one widget library. Agnes UI is actually kind of seven libraries. There's one library that is the core that is framework agnostic. That is only written in Tanscript. It's basically a bunch of builders of factory, let's say, component factory and utils factory. So it's function that takes some props, some inputs, and what are returning, they are returning the older state of the widget and all the action possible that the widget can trigger. So if you have a pagination, if you click on the next button for the pagination, it will trigger a change in the internal state management returning by those builders. And basically, a new data will be computed for you so that you can show it on the screen. So Agnes UI does this core that is framework agnostic. And then we have, let's say, put some wrappers, some utils to be able to use it in multiple frameworks. So we have the possibility to use it as headless. So without any CSS, we basically do CSS of your choice, in React, in Svelte, and in Angular. But basically, you can use it in any other framework. It's just that we have only the time for those three and only the user of those three. We also are providing because, well, we are maintaining ngboostrap. And so a lot of our customers, even internally in my company, are using Angular, so ngboostrap. We are providing also the bootstrap wrapper. So this is the look of basically the bootstrap component here. So in some of Agnes UI, it's framework agnostic also. So there's a core framework agnostic. It's reactive because we are using Tonsu, we will see how. There's some wrapper bootstrap. You can have a less component. You can use Telwin to customize it. We are going to show them in the website using Telwin on some kind of CSS of our choice. It's highly customizable, highly configurable. So you can basically, the activity is also put in place for the customization and the configuration. So you have some kind of configuration mechanism that is injected inside our core. And that you can use specifically. And it's API-copied, meaning that all our components in every framework, they have the same API. Okay, same props, same output, same behavior. So for our company, for us, like where we are using multiple framework, and maybe we want to have consistency inside our product, it's quite useful to have the same look and feel for every application that we are developing. So what's the usage of Tonsu and NSU? And how it is helping us developing those libraries? So it's helping us for many things. And we will see that we need, in fact, all the features that we have seen before. So for example, we have a create and accession details. Okay, and the idea is to be able to know, we are giving like a list of elements, and we want to know if those elements are visible inside, are entering inside the screen. Okay, and we are using the intersection observer. So the intersection observer is a synchronous JavaScript utility that you can use that is going to call us every time there's a change inside what we are observing. So here, I didn't show you, but we are then observing the different element. Okay, so every time the intersection observer basically is sending us some new entries. Okay, we are changing this map, and what we are returning basically is this map in the visible element. So we are using the fact that with the derive, we have a set function that can be called as a chronosly by us, to put the new value of the visible element. And with this, we are transforming like this intersection observer in a change of the state and the visible element, visible element, dollar state. Okay, and we are doing this in a framework or a nostic way, so we can plug these utils in any framework or for choice. One other thing that is also very interesting is that we are doing with a library and people, when we are doing widgets, there's like a props or anything. Here we have the case where in a pagination widget, okay, and we are computing the page number. So the page number that we need to display on the screen, and it depends on two things. It depends on the page size, so the number of items you want to display on the page and the complete list of items that you have in your collection. Okay, if with Tamsun, we are able basically to write that the page count is computed out of this expression. Okay, and if any of the dependency is changing, this page count is going to be automatically computed again. Okay, so we don't have to say, okay, if collection size is changing, I'm going to do this computation, this computation, this computation, and it's going to change page count. And if page size is changing, I'm going to do this computation, this computation, and page count is computed again. So you have a way to basically declare your state, your internal state of the widget in a very simple way. So the thing we are going to see about Tamsun and our usage in AgnusRide is the unused. So the unused is a pretty interesting thing that we see on Zvelstore, but not on a lot of other scenarios. I don't think there are other scenarios that we have in this feature. It's the way, like we said, to execute some code only when the store is used in your application. So for example here, we are having the red-head ball of the active element. Basically, on use means if some point in my application I'm using this store, this function is going to be executed, and I'm going to start listening to the focus in and focus event on this document or document element. So I'm going to be able to react to those events only if the store is used in my application. And then if the store is no longer used in my application, I'm going to unsubscribe directly to those events. So basically you can set up things and remove things only if they are actually used in your application. And it's quite difficult to do this if you don't have such utilities. And also it's transforming everything that we are computing here at some point as it is red-head ball in a store so that it can use in a reactive way anywhere in your application. Okay, what's next? Good. So what we want to do with Agnes-Wide, but basically we want to extend the component list, we want to extend the utilities, add more documentation for user, contributor. We hope the people from MNGBusap will migrate and we hope people also will start using it in many other framework community because now we are not only on Agnes-Wide but we are targeting React, ZVED, or any other possible framework. So we are starting to advertise it and we will help people contribute to it. So have a look at the two websites, of course, they are open source, they are mighty lessons, you can use them for your design system or for whatever project on your own. So thank you. If you have questions. We have a bit of time for questions, so if anyone has a question raise your hand. Perfect. I think you have a suggestion, you said you decided to build your own signal library when there are already existing ones. Is there a specific reason? Are there some needs that weren't publicizing the library? So the question is, are there some needs that are not present in other libraries? Yes, indeed. So only pre-act signals is framework diagnostic. On the whole list I showed you. All the other signal libraries are included inside the framework but are linked to the reactivity system of the framework. So we did one which was framework diagnostic to be able to code only the core in a framework diagnostic way and use it in any framework for choice. So we had not a lot of choice. We basically started before pre-act signal I think also. So we have the Tansu library since a bit more than two years. So yes, this is much it. After, why did we don't choose pre-act signal? So there's a lot of features that is not present on pre-act signal. For example, the unused is not there. They have the batch function. I don't think they are lazy. I don't know. Maybe yes they are lazy. Well, it's close to what we have. We thought a few features that were missing. But before we started before them, at least we have open source I think before them. And we need them at the time. So there was no other scenario that fits on it. Yes, so Mobex is quite interesting. So we had a look at Mobex. It's quite verbose for me. It's interesting, but I don't know if they have the core that is basically framework and the stick and they are using the yes. So we had a look at the time, but maybe it was too many years ago. And when we developed Tansu, basically it was we developed not Tansu for NSUI. We developed it before. And the idea was because we were contributing to the Angular framework at the time. Angular was missing some scenarios of library internally. So this was a project to basically implement step management in all our applications, but also try to have a similar library that could fit our Angular need. So we are not really interested in having this library where we need to use it. We use a different framework and the different flavor. I think in Mobex, there's not a core that you can use for not having a stick one. So maybe we'll use this one. But I was not so... We played with it and we were not convinced at the time. Any other question? Just a second. If the next speaker, Diklan, can come and start sitting up. Sorry. So thanks, the talk. And the question is, do you, in any kind of... ...like performance testing compared to other solutions in the market? benchmarking, whatever? Well, performance testing is always quite complex. Nope. In a sense that it's quite difficult because you have always a framework that operates in the other framework. So what should we focus on? Which performances? What we try is to be as lazy as possible. We are lazy basically. So in terms of computation, we should compute only what we need. We have also this on use setup that allows you to only compute things only when there's a subscriber on the store. So we think we are pretty efficient, but we have not done kind of... ...complete performance testing. We are synchronous, so we are pretty fast. Also, in terms of when there's a change on the store, it computes every logic to compute all the derives and it's sending directly the data to the subscribers. It's a pretty small laboratory that insures itself. Yeah, we have time for last question if there is one. Just a reminder, try to always go to the middle so that others can also sit. And we can't be standing on the sides. That's a rule from the firefighters. So we really can't have anyone standing on the sides. There's a lot of last question. Yes, you can ask a question. Yeah, so we are thinking about providing a side effect function that is basically just subscribing to have the side effect run. Yes, indeed. Currently, we didn't have the use in our Orang Nostra laboratory, so we didn't implement it, but we were always talking about this like one month ago. So this is something we will have maybe in the future. Otherwise, yes, you need to subscribe to it to get its side effect. Thank you.
calc with calculang, a language for calculations
Hey guys, thanks for attending this talk about Calculang. Calculang is a language for calculations. So calculations, Calculang. And in Calculang we write pure functions but outside effects. So if you've probably heard that sound bite before, and no side effects means because I try to develop Calculang to be flexible. Now, I use a lot of terminology whenever I describe functions, but pure functions are similar to formulas and spreadsheets. I'm not the first person to say that. So you'll hear me mix up those terms, formulas and functions. And I want to show you the Calculang website, which is Calculang.dev. So here's the logo, and here are the things that are important. And down below we have some examples. So there are some finance examples that are simple computationally. Some simulations that are simple computationally. Some maths art examples. And some other demos of things. And I want to show you some of these. We'll start with the savings calculator, but I want to use a new experimental website. This is this. So here we're looking at savings calculator, a visualization. We're saving 1000 every year for five years. We're getting interest 2%. And we end up at 5,308. And over here, separate to the visualization. This visualization uses numbers from Calculang model, which is separate. And we have the formulas here and the inputs here. So we have formulas, input, output. And I want to show you the formulas. Formulas are the building blocks of calculations here. So look at the balance, the savings balance. This is the most important output here. Apart from some edge cases, the balance, the savings balance is the previous year's savings balance plus deposits, new deposits, plus interest. And we can look at interest. And it's the balance from the previous year, the previous year's balance times interest rate. So now I'll point out some things about this code. First of all, you might notice it looks like JavaScript. It looks like JavaScript because based on JavaScript, it compiles into JavaScript. Secondly, there's only two new primitives to know about Calculang formulas and inputs. Here, balance is a formula and it's blue, although you can't really see it here. And interest rate is an input and it's pink. And although balance is a formula and interest rate is an input, we can change it up here, balance is the formula we just saw. Their formulas and inputs are both implemented as JavaScript functions. So we call them, we call interest rate like it's a function. Fourthly, everything being a function is a nice uniformity, but it means there's a lot of calls and there's a lot of brackets. We've got brackets, brackets, brackets, brackets, brackets. Mostly these brackets are empty and that's because the Calculang compiler will analyze how inputs are used and it will populate brackets for you. Those last two properties are motivated by flexibility, which is something I'll speak more about tomorrow in particular today. I'm going to focus more on interactions. So we can interact with this model also from over here. These formulas. And if we double click a number, we get this overlay of the numbers that we used to calculate the number. So 508 is 5204 plus 104. And if we can navigate here, 104 is 5204 times 2%. So we can see the whole workings of a calculation and dependencies. So this is showing you a visualization of numbers and interacting with the workings and the formulas. But numbers that are on this in this website. So what about meeting numbers where they are? For this next example, I'm going to ask you to use your imagination a little bit. So we can imagine this is a piece of paper somewhere. It's got these numbers on it. 100,000, 4%, 10 years. And it's alone and there's a repayment amount 12,000, 3,000, 2,9 per year. So this 12,000, 3,000, 2,9, it was calculated so it has workings. It was probably calculated on a computer and I know it was because I calculated it with calculating. And with just this, which links the formulas and the inputs, then I can share not just the number but also the workings. So where that brings you is it's 12, 3, 2, 9. What that brings you is here, 12, 3, 2, 9. Repayment amount. And you can check the factors that are used. And I don't want to convince you if this is right or wrong. The answer might be that it depends. But just to show you a way to share workings along with a number. So this interacting with numbers and seeing the workings behind a number and sharing it is something we can debate at being useful for one person or another. But it's something we can debate for all users of numbers. And that brings me to developers. And numbers are a lot of what we do on computers, I think. So this example is a bar chart. You might anticipate what will happen. But we can move around. It's a projection of a reward. And how does this work then? There's a map. 2D map. And a player in it. And there's walls. And there's a field of view. And I'll turn around and look at this direction. And this, the calculations behind this to show you how the bar chart works, it's a we split the field of view. To produce this image, we split this field of view into many different directions. And we cast imaginary rays in every direction until they hit a wall. And then we calculate a distance. And for rays that travel far, that have a long distance, like these ones, then we give them small bars. And for rays which are interrupted by a wall at a short distance, then we give them big bars. And these bars here are very big because they catch this green block right beside us. Now, we can go deeper and we can, with a few clicks, see the workings as we interact with this. And look at the formulas for the distance. So, this distance is, there's two distances here. We can see the steps of the algorithm by highlighting here. This is the force intersection with the horizontal grid line. There's no wall. This is the second. There's no wall. This is the toward. And there's a wall. So we stop. We can calculate the distance. Now, there was a bug here. I don't know if you saw it. If we go back, so we can travel back to our gameplay. Here. There's something wrong with, this should be a smooth edge. At least this bar should not be so small. From the map, it looks like, from the time here, it looks like the ray is going through this corner, which it shouldn't. And so, can we figure out why? If we go through the workings. There was a check here. No wall, no wall, no wall. Wall. But, here, there should be a wall. So this check looks wrong. Let's look at what's going on in here. So X is like two and a little bit, and Y is three. X two and a little bit, three is a blue wall. So, that seems wrong. This formula, there are checks for the boundaries at the edges. So, that's not relevant here. And we have this lookup against the map to find the walls. And there is a subtraction here of a little value, which is a fudge. And it looks like our X is just very close to two. And that fudge throws it down to one, and then it doesn't find a wall. So, we're hit with this bug. Now, I'm going to show you how we can fix that by making the fudge smaller, and I won't remove it because it says it's necessary. So... Okay, thanks. So now we can compile with our change, and we can hot reload. So now we can see a fixed wall. In case we missed it, we can go back and forward. So before, after. We can travel through more gameplay just to make sure there's no regressions with that. And these things are fine, so... So that's our change met, and we're happy with that. So here we saw interacting with a visualization, but a simulation of a world, and some developer things like hot reload, time travel, which I think are well suited to functional programming, which Caglang is close to. And one more example. Whereas this one is a bar chart, this next example is a scatterplot. We've got stars, and we can travel through them. I can speed this up. And we can say that the state of this visualization is calculated by Caglang, and on the left we can see the state and the workings for that state. And up this in real time. And I was watching this from start to finish, and then I noticed that there is some redundant code here. This condition is never false. Now, in these examples... In these examples, I showed some interactions with numbers and their formulas and their workings, and a way to share numbers with the workings. We looked at numbers in the world, imaginary, and we looked at numbers for a simulated world. But the enabler for all of Verben I showed is this. It's the separation of concerns and it's the pure functions, which is the same enabler for the aims of Caglang, which to be shareable and communicable, it helps to not have programming mixed in with your numbers, to be shareable and communicable as far as numbers is concerned. To be transparent, it helps to work with different views to be interoperable, so that tools can give different perspectives. To be verifiable, it helps to be pure, so that the same inputs going into a function, you can expect a predictable output. Flexibility and reusability is this. These two are things I'm going to talk about tomorrow in declarative and minimalistic computing, dev room. That's all for my talk today. Now, if there are any questions, I'll be happy to take some. Thank you.
Your web app is taking up too much RAM. Let's fix it!
Hello everyone. Can you guys hear me properly? Nice, perfect. Yeah, today I wanted to start your presentation with quite a bold claim. I would say that your web app is taking up too much RAM and we could fix it. And this like comes from a thing that I noticed recently and is that if you look at your Chrome browser, you can see that if you hover over the tab, you will see that the Chrome is now starting, at least in a while, to tell users the memory usage of your app. Which like, if you look at most applications such as for example g-tub even while looking at a pretty big diff, the memory usage is not that bad. Yeah, like I mean, 122 megabytes were a lot in the 2000 but like now it's not as much. But if you look at other websites that maybe are a bit more expensive such as Airbnb, you can see that if we load a pretty big page, the memory usage goes way up. Like we're talking about half a gig of RAM being used by the browser. And what I was wondering like is it our fault? Is it the browser? What's in that memory that is being used? And we can find out how much of that is actually used by the JavaScript virtual machine, by our variables, our functions and our code. And the way of doing that, it's by opening the DevTools and there is a special tab called memory and you can see for each JavaScript virtual machine that is running, you can see how much memory is it taking up right now. Which in case of Airbnb, it was like 111 megabytes which is like, it's not much but it starts to be quite a bit especially when GitHub was like 10 megabytes compared to that. And then maybe you look at some more extreme examples such as here I propose fully stress test notion by loading a quite big table and we went into 1.5 gigabytes of RAM used just by JavaScript variables and that was quite wild because if you think about it, that's a lot. That's a lot for a web page. And also there are even worse examples or I would say more difficult examples like the product that I'm currently building and I'm building this web-based tool called the flux which is a tool for designing electronics on your browser and it is quite complicated because electronics is made up by a lot of different parts and it's built using typeskipped React, 3JS, React refiber so we use a bunch of technologies and also a bunch of abstractions to make our life easier and that had an effect on us. In fact because we wanted to be able to render very complicated documents with a lot of different shapes and text and everything has to run at 60 FPS, you can see how holding a big project can take a lot of RAM and that's something that backfired a bit. Why? Well because originally we focused a lot on performance, we wanted to have everything load very quickly, we wanted the scroll to be fast and originally we just optimized for performance, we were like yeah memory is cheap, let's just use whatever all the memory that we have so we just optimized what the profiler said, not what the memory profiler said and actually we did this because there was this article that from a while ago that was talking about yeah if you're building React apps just memorize everything, just cache everything that you can because that is not going to be an issue in most cases. People did we know that we were one of those cases and yeah and in fact you can see how like if you load a pretty big document at least before this talk the app will take too much RAM but they can really hear someone says well okay I have 16, 32 weeks of RAM on my desktop on my computer, why do I care about memory users, we're not in 1999 anymore. Well there are still a couple of reasons why we really care about this now and one of the reasons is out-of-memory crashes. If you're not optimizing memory usage the browser will limit you. In most cases for example in Chrome if you go over four gigabytes you will get this, you will get an old snap error code 5 which is an out-of-memory and there is no way to catch it, no way to solve it, the only thing that you can do is just prevent this from happening in the first place because here you will need to refresh the page to fix it. And on iOS it's even worse because on iOS sometimes the limit goes down and no one is really clear about what the limit is. For example if you are on Safari UIOS sometimes the limit can go as low as 300 megabytes and this is what you get, you get your browser loading up the page, trying to load the page then going out of memory, refreshing the page and going in an infinite refresh loop which you will see your user report and that's when your product manager will come screaming into your office why is the application not loading on my phone and because you're using too much RAM, so yeah clients might have a lot of RAM but your browser doesn't care, it will not let you use it. And another thing is that we also care about the garbage collection performance, if the more that you allocate the more that you will need to deallocate later and that's a thing that you have to care about because in some cases the garbage collection connection times can really hurt your performance. This is a bit of an extreme case that's like one minute of garbage collection but like this is something that is a bit more realistic. We were debugging an event handler that was supposed to run on mouse move so something that was totally off path and the major garbage collector took 0.5 seconds which means that there was a sharp drop in the frame per second just because the garbage collector had to kick in and so that's another thing that you want to care if you care about performance. Also memory is part of your performance optimization strategy and another thing is that as I showed before now Chrome is showing the memory usage of your website to your users so if your users are using like 12 tabs or if you are insane like in my case you have 10 browser's opens with a thousand tabs each, yeah that should start closing them maybe tomorrow. The users will be able to see that it's your website that is taking up their entire RAM and they will not be happy with you so now they will know which one it is and so yeah we're into a situation and for example my situation how do we solve this like how we approach this problem in flux? Well first of all it's important to figure out what is occupying memory and once you do that like there are multiple strategies that you can use to kill it with fire and then we also want to make sure we're not doing the same mistake aka we can set up some checks in CI or we can set up some monitoring even with remote users to check that the memory usage is not that bad right now. Of course in the talk of today I wanted to focus more on the first point because that's already a lot of things to talk about. So before going into the tooling I wanted to introduce some ideas about memory usage so that we know what we're talking about. I like to have those distinctions while talking about memory usage this is something that I made up in my analysis and like I noticed that there is a pattern of having either static or transient memory usage. What are we talking about here? Static memory usage it's when you have variables that are taking up a lot of RAM but they are long lived, they are global variables, they are state that is staying there and it's not really changing throughout the run of your application and that's basically what you would find in a heap snapshot and it is that the easy thing for example the document that loads and it is taking up a lot of RAM but you don't necessarily have a situation like that sometimes you could have a transient peak of memory usage which means that for example the user clicks a button and that button triggers a very quick operation which allocates an array with one million elements you can see it as a peak in the memory usage at that point and that sometimes can be a bit more hard to the bug because you want to find that on a heap snapshot because a heap snapshot is just taking an image of what's in your RAM at that moment and a peak of memory that would be de-allocated immediately won't show up in that so there are different strategies depending if we have the first or the second type of memory problem. Another thing that I like to consider is the count and the size of stuff. Why? Because you might have a very happy situation the kind of analysis wise in which you're locating a 500 megabyte string or a 500 megabytes array that's very different than allocating millions of small objects and if you have the first or the second situation you need to use completely different approach to analyze that because while if you have a giant object it would just show up in the memory profiler immediately as a very big object if you have millions of small elements it would be much harder to analyze them because you will need to check what's inside those those four bytes objects and another thing that I like to bring up is the difference between shallow and retained size and these are things those are two terms that you will see in the memory profiler and the reason for that is because in JavaScript everything is a pointer so if you have an array of strings it's actually an array of pointers to strings so the array itself could be very small like in order of bytes but the stuff that is pointing to could be giant like it could be pointing to a lot of one megabyte strings so when we talk about shallow size we talk about the size of that allocation itself such as the array which is small but that array it's causing other memory to stay allocated because it's referring to those one megabyte string so the retained size is instead the total amount of memory that that object or that array is forcing to stay and that is preventing from being deallocated and there's another last topic that is also quite complicated which are allocation types and which means that in JavaScript there are multiple things that you can allocate you have different other types you have code you have strings just array you have typed arrays you have also closures and each one of those behaves differently in memory and one cool thing that you can get from this is that for example also functions are something that can take up memory if you're not careful enough because functions need to save all the variables that are around them so technically that a function is an object as well in javascript and this means that even for example if you are creating functions in a loop that could become a memory problem because it's the same thing as creating an array of objects so like sometimes you can just look up the v8 in Chrome documentation and find a lot of interesting things about how memory is used internally but that's another topic for another talk I would say I wanted to instead look into tooling like if we are in a situation in which we have a lot of memory usage what are some tools that we can use to try to start to analyze what is going on and how to solve the problem well the most famous one is the common memory compiler which is that memory tab that you probably saw next to performance in the Chrome DevTools and it's quite powerful because it can work in three different modes I think the most interesting ones are the heap snapshot and the allocation sampling which works in very different ways for different purposes but it is that with the heap snapshot you can take a big snapshot of everything that it's in your RAM everything that javascript is working with and like imagine that you created a lot of variables in your code with this you can just save all of them and look at what's inside of them which is really cool because you can even see the values that you have there and for each one of those for each allocation that you have you can also see what is the retainer that means what why is this being a memory who created it and who is holding references to it and that's useful to determine who is the thing like what is the function that caused that thing to stay in memory and the heap snapshot are very useful if you want to check stuff like static memory usage because it takes a snapshot in time if instead you're more interested into transient memory peaks as I said before there is this other tool called the allocation sampling which works by accumulating every allocation that happens this means that everything that is allocated is saved here but you don't get the allocations so which is you can't really measure how much RAM you're using you can just measure who is creating that RAM that's objects we had not too many of them but some of them were taking a lot of RAM like 89 megabytes that's a lot and we had one specific object that was taking a giant amount of memory like 80 megabytes and which is then by looking at the retainers we were able to immediately figure out what was the function that was allocating that stuff that was retaining that stuff and that was one of the very first optimization that we managed to do because this way we went into the code into that function and realized how we were basically creating a bunch of functions this is react code and a bunch of string UIDs and saving all of them in a map and apparently that's incredibly inefficient that's probably not code that you look at the first at the first glance that it seems inefficient but if you call it thousands of times this is apparently sticking up 80 megs of RAM and so how did we solve this we refactored it a bit by using a set instead of a map and so it's very experiment based and with this we were able to have like a 50 percent improvement of memory usage which was huge and this really made the difference between being able to load some project at all or projects that would just crush your browser like documents and so that was one of the first big wins that we had so we were like yeah okay let's continue this eventually we will reach zero megs of memory use right no immediately after that we hit pretty much a brick wall in which we were taking hit snapshots and we were seeing that we had two million objects that were taking a lot of space and it's not that that like we had one big object to optimize each one of them was a couple of bytes and the heat profiler really doesn't help you in those cases and that's interesting because that's pretty much the same situation that you will find if you try to profile that same notion that I've tested it before or even Airbnb as it's actually the same problem and unfortunately the answer is the problem is react kind of and like we are in the same situation also with notion we have is it two geeks of ground no that can't be that is just being occupied by a lot of small objects so yeah we hit a brick wall but what we do now like the heat profiler is very bad and analyzing those kind of stuff thankfully we can export from it we can export a giant five gigabytes json from from chrome and then we look at the json and we see that the json it's in a format that is pretty much unreadable but thankfully there is someone that did work for us the guys at meta and it is beautiful tool called memelab which it's a toolkit for exploring memory usage which is very focused on finding memory leaks it has like an entire automation for that but I think this is even more it's even cooler because it provides you a very powerful API for opening snapshots from chrome and analyzing them what you can do is that basically you can read the objects in memory and perform analytics on them for example we wanted to answer this question which type of objects are taking up the most space out of the two millions that we found in a snapshot well this is a some code that we wrote I think that we don't have time to go too much into it but I can publish it the idea is that we can load the snapshot so load the current state of memory and find all the object types like what are what is the like the type skip type of the object computed total shallow size for each type and then sort and print results and from these the results were very cool because the um which is we had the for each object type even including like the the keys of the object how much memory they were occupying and which is we were able to see that on the top two we have one object that is called fiber node which is from react and another node another object that had base q base state memo state what the next q what is that that is not something that came from our application that's react again that's the data structure that is used internally for keeping tracks of hooks and so like we went into react to me so that there was exactly that other structure which in most websites that are using react heavily nowadays is pretty much the thing that is occupying the most memory with enough so it is we figured out that keeping tracks of hooks is expensive and but are we supposed to just tear down the 400 000 lines of react that we have in our app right now like that's a bit too far into the development so we wanted to know precisely what we need to optimize so we use memlum again this time we uh we see like even more uh by looking at this fiber node data structure that is used by reactor and we need a lot of statistics on it to try to figure out what is the react component that is taking up the most memory so that we can optimize that specific component first and we managed to do this because this way we were able to divide the odd memory uses by react component and see each hook how much memory it was using and with this we were able to find out a specific react component that was using a lot of memory and we cut the memory users down again by 60 percent which was pretty nice so that's like memlum saved us with this because we were able to make our app properly working and it also made us possible to answer other questions like as out of all the strings that we have in our app how many of those are uids should we start optimizing uids and make them numbers well no because we used memlum to find all the uids and we found out it was like two megabytes in total so who cares so that's also nice to know what to not prematurely optimize so just to sum up everything that i said i think that we can all agree that memory analysis is actually difficult especially because it varies so much between application between framework between browsers but it's important even if even in a world like nowadays in which we have a lot of a lot of round because for some apps it really makes a difference it makes the difference between you being able to use the notion on your phone or the app constantly crashing and never loading your data and that thing is that the chrome profiler is cool but sometimes it's not enough but thankfully it can export so that at least you can perform your own analysis externally so thank you for listening to representation thank you are there any questions i see a question here yes you were talking about the shallow size versus retained size yeah when would you ever be interested in looking at the shallow size sounds like the more interesting one yeah he asked about when do we care about shallow size when we also have the retained size well i yeah we care a lot about shallow size in our case it was all about shallow size where to write our own custom plugging for memblab to just analyze shallow size why because if you are analyzing like very big objects there are thousands of lines and in that case you have to use tricks like even virtual scrolling if you know that you can have like instead of allocating all the DOM elements you keep reusing the same ones and you think about that that's like ejecting from react because you are creating something just with javascript and the DOM and then you are creating a reactive wrapper for it so that's another thing that it shows that yeah react is good at orchestrating stuff but when it comes to the performance critical things that you want to have inside your application then you need to start optimizing it differently just a small mark or we continue with the questions so please if there are spaces please try to squeeze and not leave spaces in the middle as you could see we have hundreds of people waiting outside and here as well and we cannot have that many people on the sides so please try to squeeze don't let free seats for your jackets or something put it on your lap thank you and since we're starting to be quite a lot if you're going to go out please try to go out from the right side and avoid going out from the left side so that it's easier for everyone thank you we have a question here first i've got more more as a comment instead of a question so the thing is that with this limitation of four gigabytes for memory this comes from the fact that like chrome like compresses pointers so that small objects take less space basically that's one thing second thing is that is this is like a security mitigation so that when there is some like back in v8 it's harder to exploit it but also i've read on like a chromium box tracker that there is for example 16 gigabytes limit for fixed arrays so there may be different limitations for different things like web assembly also has a different limitation and also supposedly like electron abs doesn't have limits so yeah yeah that that's very cool thank you i think that firefox has pretty much the same limitations oh ask me if we're also trying with other browser yeah i'm mostly working on firefox actually and firefox has very similar limitation and sometimes it's even worse because sometimes we notice that the upper randomly sometimes takes more memory in firefox for some reason or some things are more optimized in firefox other things are more optimized in chrome so that that's very complicated to answer unfortunately because it seems like that the answer is either you look deeply into the source code of the browsers which is i still haven't reached that point unfortunately or you do try an error ah no the tooling um you know firefox also has tooling around it which is actually if i remember correctly more focused around analyzing the memory users of the DOM elements and it also has some facilities for for analyzing ip snapshots but since like memlab users works with chrome ip snapshots we went with that immediately and how do you go about running this in ci? oh um yeah that's a complicated thing because running in ci it's pure pain like you can use memlab and run it in ci because it uses playwright i don't remember if it uses playwright or puppeteer i think puppeteer and with it you can like orchestrate some some tests that open a page it can even like use some machine learning algorithm to find memory leaks the problem with doing that is that it's fine if your app is small if your app starts to become bigger then uh you will need to have a ci machine that is powerful enough to be able to run your app and the profiler on top of it which for us it meant that the the ci time went like in 30 minutes which was unacceptable so eventually we removed it but you can do it are there questions as your question there so from the browser or something like that? yeah that's another complicated thing because if you are using chrome i don't think that firefox allows you that but chrome does you have a specific performance or memory i think variable that you can use and you can check both the maximum heap allowed size and you can also read an estimate of the current family usage in our case once we do that we are constantly like giving data to segment then we analyze in amplitude with which we can like keep track of memory usage and we are also doing that for like the performance timing the problem is that we notice that that data who very quickly becomes bogus because it depends a lot on what the user is doing and when the garbage collector kicks in because the garbage collector sometimes is like it goes up to four gigabytes and then no problem goes down to 500 megabytes so it's extremely difficult to capture memory usage because you don't have a precise memory a precise measure on how much of the total retained memory is active and how much is actually inactive and going to be garbage collected soon so we try to do it and we have some charts showing how much memory is being used but it's very hard to make sense of them unfortunately any other questions you still have around five minutes for questions
Recycle, Reuse, Rebuild: Transformative Tactics for Turning your Brownfields Green
Our next speaker is Adam Uren. He's a developer with over 25 years of experience in IT. So big round of applause for Adam. Hi and welcome. Welcome to Recycle Reused Rebuild. And I'm here with you today to talk about some of the tactics for turning your brownfields green. So quick intro. My name is Adam Juran. I have my own company, Juranosaurus Tex. It's because, like you said, I've been doing this for 25 years. It makes me a bit of a dinosaur. And I functionally am a fractional CTO at this US startup called OK Capsule, which we'll figure later in today's talk. And before we jump in, you all know what brownfields and greenfields mean? No. Great. So a brownfield is like a legacy project. It's maybe brown because you feel like it's a piece of beep. And but that's almost what all of us work on professionally. And it's the greenfield projects, the ones with all the latest libraries and the latest version of Tan Stack Query or React Router or whatever your favorite X state, I don't know, whatever it is, the ones we get really excited about are weekend, our side projects. And this talk is about some of the ways we can help make our brownfields a little greener. So here's a quick overview. We're going to talk about the challenge of legacy systems, how to unlock transformation, navigating the JavaScript or solution evolution, how to empower development and developers. And then we're going to look at a quick case study, which I think highlights a lot of these points. And then a quick Q&A. And I have to resume my timer. There we go. All right. So this is what OK Capsule sells. We sell these dispenser boxes rolled up of pouches of vitamins, of supplements, so that we can reduce plastic. And you don't have to. If you're taking a lot of vitamins, you all look very young, but for anyone who takes supplements on a regular basis, it's kind of a nuisance to open up like three different bottles or four different bottles. And there's a lot of plastic that goes into that as well. What's really interesting about working for OK Capsule, this is the first time I have ever worked with a company that produces a physical product and not just code. Does anyone else work where the end product is anything manufactured? Cool. Cool. It's a US company, and it's supplements, so it falls under the jurisdiction of the Food and Drug Administration. By the way, as a Native American, not a Native American, but an American Native speaker, I sometimes speak really, really fast. If I'm going too fast or I'm not speaking clearly enough, just shout at me. It's a slow, no, no, no. The startup was started about four years ago, and just because that's what the CTO knew, it became Salesforce everything. Now, Salesforce is a great tool for certain things, but as a development platform, it's frustrating. Especially for all the things we look for, for scalability, for efficiency, et cetera. But it had everything, not just the CRM, the client relationship management, which is typically what it was invented for, or ERP. Oh gosh. Enterprise, resource, planning, thank you. Too many acronyms this early in the day, which is really because we have inventory and we have lots, and we need a way to manage that. And Salesforce is a reasonable way to handle that. But the API was written as into JavaScript using Apex, which is Java, cool. But there were other problems in that architecture, which I won't get into now. And various integrations, everything was Salesforce. And they just started to move some things out to AWS, but it was very, very, very early. There's a Shopify app to make it easy for some clients, because this is a white-labeled product. You can have your brand. It can be anything. And there was just a lot of tech debt, even within the Salesforce, MVPs, POCs, Band-Aids on top of that. It was kind of a mind-blowingly broad landscape of bad practices. And yet it worked. That's the thing. There's value in the fact that it was already working. Think about all the banking software in the world. It's all COBOL, in from where you're going to go and blend the reliability of what exists with the agility of maybe some new technology. And I'm getting a stamp. I have to read. Oh, no. Does this work? Yes, this is better. I'm sorry. This is, I have speaker notes. You know how it is. So what we're looking to navigate is what I call the middle path. And it's about finding that right balance, keeping the system stable and reliable while at the same time making them more efficient and ready for the future. So we have to carefully choose which aspects to enhance and how to integrate new technologies. So we're doing more than updating. We're, in a way, future-proofing the work. Because as it is said, what got us here really can't take us to the next level. So there's this amazing temptation when you look at someone else's code and you don't have the context and you're like, oh my god, this developer was an idiot. Like, who would write something like this? And another side of it is someone's looked at our code and thought the exact same thing. So there is this temptation to throw it away and start over. But it seems like this direct roots to efficiency. But hitting the reset button comes with its own costs and challenges. Because not only you're losing what you've already built and the opportunity cost, like the time and the money that was spent developing the thing that mostly works in the first place, but then there's knowledge gaps about why decisions were made. That's keeping track of whoever came before us, why they chose what they did. There must have been some, we hope, that there was some logic in the choices that they made. And almost inevitably, there's these tentacles of tightly coupled systems that reach way deeper than we initially imagined. So we really need to consider a more thoughtful and layered approach for a meaningful transformation. So who here has burned it all down before and be like, the hell with this? I'm starting over. OK. Did it work? Oh, and we lost my slide. I'm unplugged. Please go back to full, please. I'm plugged in, I promise. Yeah. We're going to have to find it again or reset it or something, you know? This way? It should be fine, but it's not there. Yeah, hold on. Window. There we go. And is it full? It's full enough. We're going to leave it like that. I just don't want that feedback. It's been driving me nuts. Let's try putting it down again. I'm going to take it like this. OK. So moving on. So how do we find this strategic middle path? So you have to start with an evaluation of legacy systems. You have to identify, see, stand back. See what is already delivering business value? Because, do you guys hear that? Hmm? It's when you get near to the things. This thing? Yeah. Try to. I'm standing over here in the corner. Anyway, OK, good. It's gone. So we have to look for the places that deliver business value. And I know that's not always our first instinct as developers. We want to do the cool thing and have a good developer experience. But sometimes our managers, they have a point. We can't. The whole point is we're a business. We're offering something of value and people give us money for that. So we have to look for what provides business value. Because sometimes the cost of rewriting or replacing doesn't justify the potential gains. Like, it's not just about us having fun. I mean, it is, but it isn't. So this assessment helps us pinpoint where enhancements can be made without disrupting the underlying value. And then we can bring in new technology to enhance these areas so that we can boost performance without starting over. So we want to mix the old's dependability, because it's already there. It's already working, mostly theoretically, hopefully, with the new's efficiency and do gradual improvements. So we're talking about evolving systems bit by bit, ensuring that they may remain stable as we integrate new features. This is also why testing is really important. Because no one sleeps well at night when you roll something else, roll something new out. And you're just like, well, I hope this works. I don't like that anymore. I did when I was young. Also, 25 years ago, no one was dead. No one was testing. At least I wasn't testing. OK, let's call it on yourself. So by choosing a middle path, we can make sure that we meet today's needs and also plan for the future. All right, so how do we navigate finding the right solutions? So JavaScript, which is the bulk of my new platform, but no to the back end and React on the front end, is a massively evolving landscape. I have trouble keeping track sometimes of all the new frameworks and new libraries or changes. And it's so easy to have FOMO, this fear of missing out on, I'm not using React server components yet. And it feels like everyone is. You have to just be aware of that context. But you do have to look at all the pieces out there. There's a lot. I mean, we think of the JavaScript ecosystem and all the packages and the vast majority of it is open source. And so you get to consider, like, do I want this library? What's the best way to handle this? Do I want state management? Do I want two stunts? Do I want recoil? Recoil is old. Redux is too big for my taste. X-State is my favorite. But there's like a million options. And how do you choose? So but even stepping back, you have to look at the whole architecture of the system and think about how you can go from tightly coupled towards loosely coupled. Legacy systems often, not always, often tends to be tightly coupled. So we want to keep our move towards loosely coupled, keeping the architecture scalable and flexible. And then there's the aspect of repair or rebuild from scratch. And it's not just about dealing with tech debt. It's about, again, weighing the ongoing value of the system as it is. Is there something of value that you can salvage before you decide to rewrite it altogether? And as I mentioned, open source versus custom solutions for OK Capsule, four years ago, Salesforce, the vendor solution seemed like the logical choice. It was known. They didn't have the director of engineering. And they just said, OK, well, let's start with this. This seems like a reasonable choice. And went. And they got themselves in pretty deep before they realized we're going to have to replace this, especially for the order infrastructure, the way that our clients can submit orders. And finding this right balance, really between open source and vendor solutions, can actually drive innovation if you plan it and navigate it well. So another poll for you all. Rebuild, repair. What do you all think? What have you done before? What do you prefer to do? It depends. Yeah, I know. Good answer. Good answer. Have you heard my talk before? So beyond the technical challenge, another aspect of transforming these legacy systems is how do we empower our teams? Because odds are there's already some underlying frustration in development teams working on legacy applications. We feel frustrated by the inability to move quickly very often. But there is sometimes this untapped potential there. And we can find the path for innovation and efficiency. So the activity is to engage with your teams and consider the mud of a brownfield. And decide whether when you're going to sow the seeds and when you're going to let this field life fallow, you're going to leave it alone. And this is the strategic transformation, applying thoughtful targeted strategies to cultivate sustainable tech landscapes. And when we all raise our hands having worked in legacy, we're not alone in this. We're all dealing with this. And there's really a lot of value in the collective wisdom whether you're reaching out through open source communities or you're asking the models of chat GBT. There's info or Stack Overflow too, back in the old days. But there's a lot of information out there. And you don't have to go this alone. So now we're moving on to the case study. And how am I doing on time? 18 minutes? Yes, yes, we're in good shape. All right. So we're going to talk about the problem I faced, my team faced, the opportunity, how we divided up, recycled, rebuilt, reused, what the outcome was, and lessons learned and takeaways. So if you all remember back in the beginning, I showed you that little dispenser box, and we sell these little pouches. What we also package with that is a pamphlet. The pamphlet is a marketing material. It also has descriptions of the supplement. They show up to orders the next day. The pamphlets are already ready. They use a little scan to make sure the order matches the pamphlet. So conga, this is another product within Salesforce, decided that we weren't paying enough. We were one of the bigger clients, and they're like, hey, we're raising your rates. And it amounted to highway robbery. They were blackmail, or extortion, or one of these terrible words. And my CTO is like, we're not doing that. He's wanted to get rid of them for a long time. And so he said, like the day before Christmas, we're going to rebuild, and you have until mid-February. We're going to do that. So yeah, so vendor lock-in. Who here has suffered from vendor lock-in? And the problems? I'm not alone. So this is how we dealt with it. We chose to basically reverse engineer. We had some of the pieces in place. Too close to this box. We applied the principles of a cycle, rebuilt, reused, and we completed the development under five weeks. We're testing right now, and this will be going into production within the next two weeks. We're just basically testing the printing and alignment and making sure, because again, it's a physical product. And just because it comes out in the PDF doesn't mean it comes out of the printer exactly how you want it. And I'm really, really proud that every member of my team played a role in making this. We have a very small, like five-person team, and everyone touched this in some way. So what's really cool is here's a few of the open source libraries. And we already had this node app, which helps us produce the nutrition fact panels, and we extended it to be able to print pamphlets using pug templates, existing visual assets, and an S3 bucket, no problem. We had to manually copy the text and style data from the Microsoft Word templates. We put it into a JSON format, and we put that into a database table. What's really awesome is the new architecture allows us to have now multiple blocks of dynamic code per page, or in multiple pages of these pamphlets. We're not doing that yet, but it's just another row in the database and another cycle in the loop. We added a new endpoint with actual schema validation for Salesforce, because Salesforce has the orders information at the moment, so it's going to call out to our panel app on AWS to generate the PDF and then send it back. You'll see another diagram on the next page. And with this also having this stored this way, it lays a foundation for us to build a section in our portal to allow clients to directly interact with their pamphlet designs and handle themselves, because right now someone on my team has to collect the data and put it into the database. So ultimately, that's an efficiency that as soon as we have time, we build this UI and then just charge the client every time they want to update their pamphlet, and we stay out of the mix. So have you all ever had this innovation under pressure? Kind of a thing before? Few of us. It's it was a fun challenge. It wasn't just too much, but it was still a challenge over the holidays too. So here's the original. Here's the before, if you recall, and this is a little more complicated how this all works. So Salesforce still triggers a cron job, hits its apex class, which calls the panel app. It then takes the image assets from S3, order data from the database, and also the style data. I didn't mention that aggregates it all through a pug template, generates HTML, playwright, handily, just creates a PDF of my HTML template, sends it back, attaches it, boom, done. So this just shows the comparison. And there's a lot more independence available. We've just taken out the blocker. We don't need this anymore. I can go bye-bye. And you can also see tightly coupled where we're like, these are all individual things. So now more loosely coupled. Cool. So lessons learned. What is that? So the journey taught us several lessons. The importance of reducing dependency, Fender Lock-In, power of creative problem solving, the need for strategic flexibility in our planning. And these insights have shaped our new approaches, preparing us not to face future challenges, but to anticipate them and solve them proactively, like the UI. And that's kind of it. So I wanted to just open up a Q&A session. Please share questions, insights, or experiences related to Brownfield transformation. And yeah, questions, please. Questions? Yes, sir. It's very curious to me. I'm in a position right now about the Starr-Grindfield project with a client that it's asking between AWS and just buying everything out of Salesforce. So this is already helpful. Thanks for that. You're welcome. But if you were in this situation, again, where you're starting from scratch, what would be the set of tools you would prefer? So the question is, if I'm starting from scratch, what would be the set of tools? And to quote from someone else in the session, it depends. It really depends. I kind of like, for certain things, I like Redwood JS, because it's kind of a full stack solution out of the box. But it really depends on what you need to start with and where you're going. You really have to, because Redwood's good for a few solutions. We have a micro, our new API, which was built last year, we outsourced it, was, is microservices, about 10 of them. And so it just depends. What are you building? So I'm dealing with people who really want to keep using Excel. So they just want to make sure I don't take the weight on Excel files. So is that the actual Excel or the experience of a spreadsheet? So now they come from manufacturing, but also physical products. Uh-huh. But yeah, they're clueless. On the text, they don't have a CPO. So it's really hand-holding them into the world of choosing the best for their future when they have absolutely no knowledge of tech. Got it. So the question is, he's dealing with a new client, and they're in manufacturing, and they really like Excel spreadsheets. I, my gut would be, find a tool that emulates, like a data table that emulates that and lets them get pretty close to that so that it's not actually Excel. Unless you want to use Excel as like a back-end, and I only say that, and that sounds crazy, but you know this website, levels.fyi, which they had all the aggregated information about who got paid what and all the fang companies? Their back-end for the first three years was Excel. It was Google Sheets. Actually, it was Google Sheets. I'm sorry. So, and that was their back-end.
Unraveling JavaScript's Heart: Mastering the Event Loop for Peak Performance
So, our next speaker is Antoine Perret, which is one of our local superstars. He's the CTO of Rosa, which is a super nice company in the health sector. Maybe he will tell us more about it. And he's going to talk about the heart of JavaScript, which is the JavaScript event loop. Background of applause for Antoine. Alright, so you probably, everyone here probably heard that sentence, right? Do not block the even loop. Okay, and you might have heard or read on the internet that you should prefer asynchronous code over synchronous code. So, question for you is, do you believe that as long as you're using asynchronous code, you're safe and you will never block the even loop? Who believes that async helps you get there? No one. Okay, okay, that is good, that is good. So, today we're going to look at this and we're going to ask is the question, hey, what does that mean not blocking the even loop? What does that mean not using synchronous APIs and whether or not using asynchronous API help us stay safe? So, I'm the co-founder and CTO of Rosa and Rosa is building a patient application. We want to help people live healthier for longer. And when you look at the kind of applications that we build, part of Rosa can be labeled as CPU intensive or there are some parts of our applications where we do some heavy computation. Okay, so we have a calendar application, we have registries that are deployed at hospitals, etc. And when you think about the kind of computation that we do, yes, sometimes you have to compute recurring appointments, so occurrences of recurring appointments. Sometimes we do have to compute hashes to store passwords securely. Sometimes we do have to parse large files such as iCal. And sometimes, of course, we do the diffing because we have the schedule of a health professional. We have a list of appointments and we want to know when that health professional is available. So today we're going to talk about the event loop. We are going to talk about how not to block the event loop and the questions that are that we are going to ask ourselves, does it scale? What does what if traffic does 3x, 10x and is there a possibility to have a denial of service? Because as soon as you block the event loop, you have a possibility to have a denial of service. So why was not created? What is the event loop? And then that is this of how we can hash secrets using B-crypt, thread pools to the rescue and what are the metrics of the event loop? That's on the agenda for today. All right, if you look at one of the first talk of Ryan Dahl, the author of Node.js, he's talking about non-blocking I.O. And he's comparing the situation in which you will query for a database and you will have a blocking I.O. system and then he compares that to the Node.js implementation where you have non-blocking I.O. And the reason why he wants to build that is because most of web applications are I.O. bound. They spend most of their time waiting for an external server to answer. They spend most of their time waiting for database to answer. So when you have a server in which you have blocking I.O., what you do is that you create for each connection, you will create a thread, meaning that you will have a memory overhead because each time you create a thread, you will have memory associated to that. And so you can't scale because each time you need to handle a connection, you need more memory. So the solution to that problem is to create an event loop. But as soon as you start to use an event loop, then you require non-blocking I.O. OK? So Node.js is born because most web apps are I.O. bound and because the CPU and I.O. live in two different scales of time. OK? So the CPU and the Jigga-Earths with the Jigga-Earths frequency means that the cycle of the CPU takes one nanosecond. And you have to compare that to a roundtrip between, say, California and the Netherlands, which will take approximately 150 milliseconds. But that is kind of tough to create a mental model around that. So let's make it easier. OK? So we are developers. We all love to drink coffee. OK? We also love some shows to watch some shows on Netflix. OK? So making yourself a coffee is like an operational CPU such as a new text lock or a lock. It is taking 25 seconds. OK? Making yourself a coffee takes 25 seconds. Looking at a Netflix episode will be something like 40, 50, an hour-ish, will last for an hour approximately. That is the world in which a CPU lives. Now, if you compare that to the world of I.O., it is the equivalent of taking a five-week vacation. Or it is the equivalent of studying for five years at university. So the danger zone is when you take your CPU and you bring it to with you on vacation. So basically, the danger is when you block your main thread because you're performing a CPU-intensive operation while Node.js was designed with the idea that you would drink coffee and not go on vacation. Keep those figures in mind for the rest of the talk. What the heck is the event loop? Who is familiar with this representation? Good. Philip Provertz gave an excellent talk about this 10 years ago. And you can play with that tool on the Internet. It's getting a bit old, but it's still really, really good. OK? So let's have a quick look at what it does. So here what you see is that you have code that you wrote. OK? And we'll see what happens when JavaScript executes that code. It's quite simple. We have a set timeout with a five-second delay. And we have a console.log. So obviously, we all know that welcome for them will be first, printed to the console, and that after close to five seconds, because that is not a guarantee, we'll see powered by BEJS. OK? Let's play that video if it works, and the video can't be loaded. It's not a problem. OK? So basically what the video shows is it takes that block, puts it on the stack. When it's executed, because we have a set timeout, it will put the timer and play the five-second timer in the web API part. After, during that time, console.log for each one after the other. And you can also picture, have a mental model around the fact that in the end, there might be even several loops. OK? Multiple phases, timer, spending callback, ideally prepared, pulling, where we all know JS is going to ask the OS, hey, do you have any connection, network connection for me? Is any file, has any file being read? OK? And when you understand the different phases, you can answer questions such as this one, promise.resolve.then console.log promise versus process.nextStick, console.log nextStick, which one will be executed first? Well, it depends on where it will be picked up from at the level of the even loop, what phase of the even loop is involved. Node.js architecture is inherently multi-threaded. OK? So we all heard that JavaScript is single-threaded. What we mean by that is we have one single thread to execute your JavaScript code. But it doesn't mean that Node.js on its own is single-threaded. OK, how many threads or how many processes do we have in a Node.js application typically? One, two, three, three, anyone? Four, what about five-ish? Five-ish is a good number. OK, five-ish is a good number. Well, you have a thread for the main even loop, for the main stack. You can have threads or processes for the garbage collection. You will have libv, and then libv also takes care of handles a pool of four worker threads. OK, so you have at least five-ish processes or threads when you run your Node.js application. All right, that is a bit too complex for today. OK, and so we're going to simplify, and I've grouped different, I would say, parts of that architecture into blocks or squares of the same color. So we're going to look at the orange square. It's going to be the main stack together with the run on the same thread as the even loop in red. Then we're going to have one single queue. OK, we're not going to distinguish between micro task and task. And then everything else is going to be called Node.js API and will be assembled together. So that is what we are going to work with today. OK, good. So prefer asynchronous code over synchronous code. Let's first look at what it means when we use core modules, and then we're going to look at what it means when we use MPM modules that we can download from the internet. OK, so FS module reading a file. When you use the asynchronous API, read file with a callback, it is non-blocking. When you use the read file sync, it will be blocking. What is the difference? Well, the difference is that in one case, the sync version of it will run on the main thread, while the async version of it will be run on one of the worker thread, of one of the worker of the thread pool. It doesn't mean that at the OS level, reading the file will be non-blocking, but from the perspective of Node.js, it is non-blocking because it doesn't block the even loop of the main or the main thread. So the sentence, prefer a synchronous API over synchronous API, is absolutely relevant and true in the context of a core module, because it will run the code of the async version into a worker thread. But what about a pure JavaScript library? How does that work? To answer this question, we are going to use the example of Bcrypt. Bcrypt is a way to create a hash to securely store its secrets. And it is interesting because that operation can be quite intensive in terms of CPU, depending on the number of cycles that you will perform or how secure you want that secret to be. If you go on NPM, you will see that there are multiple implementations of Bcrypt, and there is one pure JavaScript implementation known as Bcrypt.js, and there is one C++ implementation which is known as Bcrypt. And so here it is interesting because both have sync and async APIs, and then we can compare what happens with the pure JavaScript implementation and what happens with the C++ implementation. So let's look at the pure JavaScript implementation. Hash sync versus hash, it is basically the same syntax as the FS module, right? And what we are going to do is to imagine that we have two servers that receive five requests. And the five requests that they receive is you take your CPU on a five-year study at ULB to obtain a degree, so you perform a super intensive operation, and then four requests that are basically quite fast, you just watch the Netflix episode. Okay, and we are going to compare what happens in both cases. Now, the trick is that Bcrypt is a smart and well implemented library. And the Bcrypt asynchronous API is implemented in a way that when it has to compute a large hash, when it has to compute a long operation, instead of doing all of it at once, it will chunk, it will split it into two smaller chunks. Okay, good. So synchronous on the left is synchronous on the right, and then we look at those five requests, the big one and the four faster one. So at some point, the endpoint with the hash computation is called, and we put on the stack the computation of the hash. Synchronous parts, we have that big square, blue square, that needs to be performed, and you see that as the computation goes, then the green is going to fill in the blue square to show progress. Okay, because on the asynchronous part, we have a chunk, that is a smaller square. At some point, the second request, the first red request, the first episode that you're watching, is reaching your server. And what happens is it has a callback, it has operations to be performed, and it will be queued. And notice that in the case of the asynchronous API, we're quite close to be done with the first chunk. The first chunk is done, and then the script will schedule the second chunk to be run. What it means is the stack will be empty, it will use Node APIs to schedule the next chunk, and then the Node APIs will push another callback to compute the second chunk. So we go on, and now you see that on the synchronous API, we continue to move forward with the computation, and on the asynchronous one, we are executing the callback one. Continue, same here, at some point the stack is empty, because the stack is empty, the event loop is picking the next task, put it on the stack, and we continue to perform the compilation. As each chunk is done, or the callback is executed for one of those red requests, when the stack is empty, we pick up the next task in the queue, and we go on, and we go on, and we go on. In blue, and then only you have callback one, callback two, callback three, callback four. While in the case of the async API, you chunk it, and because of that, you can't do anything. You chunk it, and because you chunk the work, then in between those chunks, your server can handle other requests. Now if you start to draw some lines and analyze the response time, so the point of view of the user, this is what it looks like, and then you have that kind of chart, where you can see the duration of the first blue, the big request, the duration of each of those red requests. In the first part, what you have is that each red request is delayed by the entire long compilation. On the b-crypt async part, on the bottom part, then it is delayed by at most one chunk. That's why you have smaller timing for the red request. What will happen if you do the same exercise with the native C++ implementation? Because it is a native implementation, because you can use the async API, it will behave the same way as the core module of FS, and it will be executed on a worker thread. If it's executed on a worker thread, here's what the timings might look like. You basically have a timing that corresponds to exactly the combination that needs to be done for the red request, and there's no delay at all. There's a small difference between the C++ and the JavaScript implementation. The C++ implementation will be faster, but here what matters is the fact that the code runs either on main thread, or in a worker thread. It's not important to compare the speed of the C++ or the JavaScript implementation in that case. Sometimes you do have to take your CPU on a vacation. Sometimes you do have to do a heavy combination. What if you do not have a native implementation, or you do to have a slow application? Well, if you really have no other choice than to take some vacation, my advice is be sure to have a pool. Be sure to take your swimsuit with you, because it is possible with libraries such as Piscina, Swimming Pool in Italian, to create pools in which you can have threads that can execute JavaScript code. The API is quite straightforward, and in the end what it means is that instead of having one stack to execute your JavaScript and then a set of other threads to execute native code, you can create other tracks, all the threads in which JavaScript code will be executed. For example, say you create two pools, you can create one pool with four threads to compute, for example, B-Crypt hashes, and you can create a second pool to compute recurring events. And in that case, what it means is that when your code will be pushed to the main JavaScript thread, the main thread is going to communicate with the pool and say, hey, execute and do that computation for me. And then the pool will distribute that computation among the different threads that it creates. So here is what it looks like when you use a pool. It's quite efficient, it's quite nice. Is it a silver bullet? Well, no, there are several things you need to take into account. You need to choose the number of threads wisely. You need to determine when you use a pool and make an analysis. You need to be sure that the machine on which you run your application has enough cores, because in some situations it can be counterproductive to create too many threads and have too many processes running. And of course, you will have to monitor and check the memory usage. All right, how do you know when you need to create a thread pool? For that, you need to measure how the event loop is behaving. You need to measure the health of your event loop. And one of those metrics is, for example, the event loop delay. Another one is the max CPU time, and there are tools to help you get there. I strongly recommend Dr.JS with ClinicJS. It will give you such a nice graph and show you when you have a delay in your event loop, when your event loop is blocked. Measuring that yourself is not complex. That is all you need to measure on your server or in your node application. The delay of your event loop. What you basically do is you set an interval with a one second delay. So every second you will execute the callback with a set immediate, and you will compare the time at which you plan it and the time at which it is actually executed. And that, the time difference between those two, the start and the end, will give you the delay in your event loop. Time to wrap up. So do not block the event loop. And what you really mean by that, it is not about async versus sync. It is more about not performing CPU intensive task on the main thread. Okay, so that is what you have to remember. As long as you do not execute CPU intensive task on the main thread, your application will be fast and smooth. So here is a couple of advice. And have some coffee, drink as much coffee as you want. Enjoy the show. Take some time off from time to time. And thank you for them and you will be for having us today. Are there any questions? One question there. You have to speak up. How does PCNAS react compared to Node cluster API? So the question is how does PCNAS differ from the Node cluster API? So my understanding is that the Node cluster API basically means that you are going to have multiple instances of the same application. Okay, why PCNAS, one instance of your application will have multiple threads.
The Biome toolchain
Thank you. Our next speaker is Victorien, who is one of the lead maintainers of the Bayonne open source project. Big round of applause for Victorien. Hi everyone. Is everyone heard me? It's okay. Is everyone heard me? Yeah. I'm Victorien and I'm one of the biome lead maintenance. In this talk, I will present some unique characteristics of biome. We take a high overview of its internal, how it achieves this kind of unique characteristics. What is biome first? Biome is a code linter. Biome analyzes your code statically to find bugs to enforce convention with a team, similar to ES lint. Biome supports natively JavaScript, JavaScript, JS6, TS6. Unlike ES lint, you don't need to install zillion of plugins. You don't need extra dependencies or extra configuration to support TypeScript. Biome also tries to output helpful diagnostics. It brings you some context, explains the issue, and provides a course of action to solve the issue. For example, in the figure, you have a unique biome rule named no accumulating spread, and they warn you against the use of accumulation of the spread operators in reduce and maps. Biome currently provides 200 rules. Some are unique to biome. Overcome from ES lint, ES lint, TypeScript, and over plugins. One of our contributors is currently working on Ter-Wine class sorting. Biome is also a code formator. You can format JavaScript, TypeScript, JS6, TS6, JSON, JSONC, and we are currently working on CSS formatting. In contrast to Prettier, Biome is able to format invalid code, as demonstrated in the animation. Last November, one of the mountaineers of Prettier launched 10,000 Bounty. The goal was to create an alternative to Prettier, a faster alternative, right in the rest. That matching the Prettier output, in particular for the JavaScript formator, over guys come in the play and add monies, reaching $25,000. In one month, we made it. We reached 97% of compatibility with the JavaScript formator, not only for JavaScript, but also for TypeScript, JS6, TS6, and JSON. One question remains, is biome fast? According to our users, it is. Here we have an employee of OpenAI at TestBiome, and using biome to replace the S-Kint and Prettier, they reduce the linting and formatting time from one minute to about two seconds. Biome is also a growing community. We have, we reach 170,000 weekly downloads on NPM, and we have a big player that starts using biome. We have Astro, we have Arespaq, it's a project from Baddance. We have a three-person company, and we have a Gold Sponsor, Shiguredo, a Japanese company. Now you are familiar with biome. I propose you to take a look to its internal. Biome uses a leader follower architecture. We have a main thread, the leader thread. The leader thread spawns a thread for each file to process. This allows to scale with a number of threads on your computer. Each thread parts the file in a structured form, a tree, and under the file, for example, formatting, linting, and so on. Usually, code formators at Cundlinton in the JavaScript world keep every Watt space, every character. And the very tolerant parser is able to represent missing pieces in the tree. For example, here we have a hole because the variable is not initialized. And we have also a bogus node in some other case. And yes, we are able to format and fix the linting issue by replacing the bilete and removing the training space and indentation. If I use less indentation. That's fine. Many rules are like no var or no accumulating spread directly query the concrete syntax tree to output diagnostics. But sometimes you have some rule more complex. For example, no unused variables in the following code. The name variable in the conditional is unused. For this kind of rule, it's very hard to implement the rule just using the concrete syntax tree. In VatsCas, it's good to have extra information. For example, it's under scope, lexical scope. For example, you can see the name parameter is shadowed by the local variable name in the conditional. And you have also to handle a listing issue with function var places. To do so, instead of querying directly the tree, we consume the tree and provide the extra data in a semantic model. The semantic model is able to answer questions like find me all references of such declaration. And is this reference a right reference or read reference? For example, in the previous example, the name is unused because it's only right. It's not read. In September, we have a first version of name resolver. But a very simple name resolver. The goal of a random resolver is to bind the declaration to the references. Basically, this name resolver assigns a unique identifier for each declaration. And they follow an invariant that is a reference refer to a single declaration. You have some example here. It's correctly under scope because the name parameter is linked to the parameter for the same for the return value. And in the local variable case, it's also correctly bind. The issue is this name resolver doesn't play well with TypeScript. In fact, in TypeScript, you can use the same name for tip and variables. For example, here you have an interface and function with the same name. And using this first version of the name resolver, you can see some issues. For example, the return tip person is bound to the function name. And we export only reference the function. Actually, a reference in TypeScript can reference several declaration. Here, the export reference both the interface and the function. And we have also two differences between tips and variables because the return tip references the type and not the function. In fact, yes, maybe. We can just conclude here that a reference can reference a variable, a tip, or both. In fact, it's even worse than that because in TypeScript, a reference can reference an infinite number of declaration. Here, you have an example of the E type, in fact, reference three in fair declaration. You have also over issue, for example, in TypeScript, it's possible to partially reference a tip variable. For example, a case, it's both a tip and a variable. And here, you have the first reference only the variable and the last one, reference only the type. So to support TypeScript, we don't want, in fact, to handle every edge cases because it's very complex to expose this to the developer, to the implementer of the inter-rules. It's similar to you have a reality, TypeScript, and you want to model a simplification that covers most of the cases. What I propose is to keep the invariant that simplifies the API. It's a reference referred to a single declaration. And we handle edge cases with export, in fair, and others with different code. And we keep also important characteristics at tip and variable with the same name are possible and tip and variable duality with cases, for example. I implemented a second version of a name resolver. In fact, the name resolver add a tag to every declaration and references, a tip or variable tag. For example, here, the interface has a tag tip or for tip and the unique identifier we previously talked about. And for variable, for example, function, we have a tag variable and so on. And the reference refer ever a tip or a variable. With this new name resolver, it's possible to detect some issues. For example, here, we have a function, it's unused function, and the tip is used. And one important aspect is you don't expose the tip variable duality in the semantic model. It's only a resolving pace. Once you resolve the names, you just remove the tag and you have previously the same API. You can ask which declines every references of each declaration and so on. In conclusion, Biome is a formator as a linter. It also supports import sorting. Biome is fast because it's white and in a rest. It uses multicore capabilities. Biome is editor-ready because they use an error-tolerant parser and concrete syntax tree. And we support TypeScript, GSX, TSX. It's better included. And we have a tip-aware semantic model that allows to implement some interesting rules as use import tip, use export tip. This year and later, we plan to implement, to support more languages, CSS, HTML, Markdown, and more frameworks, Vue, Angular, Zvelt, Astro. We have already some support for React. We want to improve linter capabilities, multi-file analysis, because for now, we are only able to analyze file by file. We cannot analyze an entire project. And we want also to implement a SAP LIFI Type System to implement some rule of USLint TypeScript. We want also to support plugins. It's a requested feature by many users. And expand the tool chain with other tools. For example, transcut transformation and so on. If you want to help us, you can try BIOM first and report Azure feedbacks. You can also contribute to BIOM. We have a good first Azure on the repository. We have a great resource to create an inter-rule. Emma, the other lead partner, created a video for creating an inter-rule. If you have money, you can sponsor us also. Thank you. And I just put some comments. In fact, you can format your code base, you can link, and you can do both with a single comment, format and link. Thanks. I have some questions. Yes? What was the question you talked about? Sorry, I don't... What vision of ECMAScript do you talk about? Atmosphere. You can maybe... What vision of ECMAScript do you talk about? Advertisers. Okay. Advertisers. ECMAScript direction. Ah, okay. The last one. Are there any other questions? Can you shout? Can we use custom rules? Can we write custom rules already? It's a recurrent question. Currently, no, because we don't support plugins. But we have a core contributor, but currently working on that. We explore two options. First is WSM or JavaScript possibility for writing plugins. And another one is with domain-specific language. Maybe with a grid QL. Yes? For large projects that have large stinker operations, it might be difficult to switch to the bio-jazz stuff. You can have some library to generate bio-contagrations based on some existing configuration. Yes. In the next version, we plan to add some migration tools for pressure. But we're also thinking about doing the same for ESLint. Unfortunately, we have some differences with ESLint. Most of the ESLint rules have an equivalent biome rule. But we have some differences in the option. For example, we support less options because sometimes we think it's not relevant to support some options. And we have also some rules are split in two rules, for example. I can give you an example. It's NoVis alias in ESLint. In fact, ESLint forbid you to write, for example, const self, equal this. In fact, we think it's too restrictive. We decide to split in two rules. First is NoVuselessVis alias. In fact, it checks for use less Vis alias. For example, you use an arrow function here. You don't need to alias this with self. You can just remove the aliasing. And you have a second rule. It's use the arrow function that try to replace function with arrow function. And combining both, you have the same functionality as ESLint, but less opinionated. And we have also overruled more strict, but depend of the case. And this kind of differences make a bit hard to create a migration tool, but we're working on that. Yes? Sorry, the large advantage of criteria compared to formatting tools that came before it was its opinionatedness. What is your stance on the opinionatedness for the biogest? Sorry, can you just repeat the question each time, because for the live streaming? The question is, the pressure is opinionated, is biome the same? Yes, because in fact we decide to match the pressure output because of the challenge. And we implemented every option of pressure, and we decide to match its output. And yes, some users are not so happy because they don't like pressure. But most of the version of ESLint uses pressure. I think it's a good thing to match pressure on that. Yes? With TypeScript ESLint, it has some quite advanced rules, but it depends on the type. So for example, it will warn you if you call an async function without using await, because it knows that the function returns, I promise. Is this the kind of thing you do at the moment, and if not, do you have plans for that kind of semantic analysis? Yes. In fact, currently we implement about all the rules of TypeScript ESLint. It's basically all rules that don't require tip information. And for this kind of rule, no floating promises, I think so. It's this rule, yes. We need tip information. We have several ideas, but personally and the team also want to explore a simplified TAP system. In fact, we want to base our implementation on the new, the upcoming isolated declaration mode of TypeScript. It's for the next version or maybe the next one. And the idea is you require to have a bit more annotation in the code and combine this mode with a simplified tip inference system you can in fact achieve most of the type query in a code base. And I think this mode will gain a lot of support because it's built by Bloomberg and it brings a lot of performances in the TypeScript world. In one or two years, this mode will be the default. I think it's a great bet for us. Thanks. Any other questions? No. Big round of applause for Zipfana.
Staying Ahead of the Game: JavaScript Security
I think we can start. So our next speaker is Dheeraj, who is going to talk about JavaScript security. You should care about that subject, because it's an important one. And Dheeraj, when he's not coding, he's thinking about food or drinking coffee. So I guess, as all of us developers, big round of applause for Dheeraj. Hello everyone. My name is Dheeraj, and I'm very excited to be here. And thank you all the great organizers for having me. This is my first time at Fosdame, so I'm very excited. And today I'll be talking about JavaScript security. Before we get started, just with a show of hands, how many of you are passionate about security, or think of security before building applications? There we go. Yeah. Oh, all right. A lot of fun. That's great. I'm sure this is a very important topic for all of us, and I hope to share some of my learnings. Just a little introduction about me. I work with the amazing front-end team at GitLab. It's been almost more than four years. And I also like playing table tennis, and I can switch my hand in between. Just to have fun and winning the game. And I like to find security bugs with all the applications I use, just to poke around sometime for fun, sometime for swag, for swags. Who doesn't like free swags or goodies? So I've reported some of the companies where I've reported bugs. I've been doing it for a long time. This mere alert is something which I'm very much excited about, and we'll be talking about it in the next slides. To give you an idea today, Agenda is very simple. We're going to be talking about XSX, CSP, TEMO, and some of the initiatives we have been taking at GitLab to improve our security posture, and some of the best practices we can follow while building any application. And as time permits, we will have some questions and answers. First and foremost, security is important. Everyone knows security is like an elephant in the room when everyone agrees to it, but only a few take it very seriously. But like most of the companies do not know that if their app is compromised, they'll lose the trust from the users, and they'll not be able to get too much customers. So there's a famous saying that there are some companies who have been hacked, but some of them do not even know that they have been hacked yet. It's like speeding on a highway where first you get used, drive incredibly fast, and then you get fined, and then you start driving a bit slower. And our primary perspective about security is always about... The security is always at the back end, but in this talk, I'll be giving a lot of insights on the front-end security. There's a little story time. There have been many instances reported here and there about security bugs, where one of the popular ones was where somebody was stealing your McDonald's website passwords, how they were able to... was just using a reflected XSX, using a sandbox bypass for AngularJS, and as soon as a user visits McDonald's sites and they type their password, they were storing the password in the client side in local storage, and when you visit a link sent by an attacker, they would be able to use the passwords being stored on the local storage and send it over to the attacker website. And another instance where popular WordPress has been hacked a number of times, where there have been instances where a plugin has a malicious code which it uses a database, and once it is rendered on the UI with a vulnerable code, and if somebody was able to log in using admin, would be able to send a request and create fake accounts. So it's just like installing a keylogger using just a code. You might be familiar with this particular application, chatGPT. Even in that, a known researcher was able to just pass this code as a markdown link with a JavaScript protocol and it was being rendered and it was quickly fixed. So we have been talking about XSX, so it's called cross-site scripting. It's so complex that I have been knowing somebody who has been doing PSD in this particular topic, and to know more about it, it's very simple when data being supplied to the application becomes part of your code. In this example, like an image tag with a source which is invalid, and an honor attribute will help you to add some JavaScript to it. And the screenshot has an example where I reported about to Uber, where I just add this particular code to my name, and when admin sees that this name is malicious, they try to delete it, and when they delete it, that pop-up will appear, and that was reported using their HackerOne program long back. So it's more than just an alert pop-up. It's just a try, maybe. You can write it through the document. Yeah, exactly, right. So the problem here is with the document to write, it's just taking the URL from the query parameters, and without sanitizing or escaping, it's just putting it in the DOM, and which can actually lead to NexusX. So how do we fix or how do we protect this? So we have to analyze, inspect all the places where DOM elements are created, and there's a famous attribute called innerSTML function, which should be actually avoided at all, and to prevent these bugs, we can also have a linter rule to be placed so that these sort of vulnerable functions are not being used in the code base, and like famously, we should just do input sanitization or output encoding. We'll be talking more about it. Then we have on the client side, we can have multiple flags. We can help you secure the cookies. So there's two flags, which I would like to mention. One is secure, which will help you make the cookies, which are sensitive in nature, are always transmitted over STTPs protocol, and there's something called STTP only. It doesn't even know what its STTP only flag does. It cannot be accessed by JavaScript. That's correct, yeah. So this is something you might see for all the session cookies we have on all the applications. This will make sure that even if an NexusX bug happens, it will not be able to read the session cookies data. Or the cookie which has the STTP only flag data. At this point, I would like to show a demo. Maybe I can just show a demo rather than a screenshot. I thought that would be fun. So this is a vulnerable application, which is a secure site. And if you try to do like a hello world, you can see that it's a hello false name being displayed here at the bottom. And if you try to enter any HTML to it, let's see if it does render or not. And with just an underlying tag, there's underlying being rendered. So that gives a hint that it might be rendering HTML. So how about we just add the code which we have been seeing like as a maze tag with invalid attribute, invalid location, and the honor attribute. So it just adds a prompt or an alert, but this is again not really harmful. So we can try to read something like, like this is the website's name. And we can also try to have local storage data. Oh, here we go. So it's like you can read the local storage data which have been stored on this particular website. So to talk more about the code base, which is making this vulnerable. So in this, this is react application where you can just see that there's a piece of code called dangerously set in an HTML. It's a variant of, it's a variant of inner HTML for react. And it can be harmful. So actually it's not even needed here. So, yeah, please go ahead. Oh, sir. Yeah, sorry about that. So, so that, that was the code which was being vulnerable, which is being used to render the data. We put it in the search bar. But if you want to fix it, we could just, you know, remove this and just render it directly because react view such frameworks does escaping by default. And there is no need to render HTML. And if you just try to save it and refresh it. So it just tried to render exactly what you write in and not render the HTML we have. Right. So going back to the slides. That was backup. Like if the demo doesn't work, I can show some slides. And yeah, so like I mentioned, it's the dangerously set in an HTML should, can be avoided here. Very possible. And if it's not possible to avoid it, what we can do is like add some sanitization to it. So tool of rule of thumb for, for, for preventing such issues like first of all, never trust any user input you have in your application, wherever the user input is coming through it and avoid such inner HTML variants like dangerously set in an HTML, V HTML for Vue.js and always sanitize your input and escape it wherever possible. Though the framework like react view will do it for your automatically. And there's also things like links, links must be secure by default. There's not something browser does it and not even the frameworks does it for you. And then the ESLint or similar linters can be used to prevent that these functions cannot be used in the code base. Then the next we'll be talking about content security policy. So it's like an advance or defense in layer, defense in depth layer mechanism where, which allows you to restrict your website to load certain resources. And you might have seen this error in the console when, whenever a malicious content is being loaded, it will prevent and throw this error. I can, I can just show how content security policy can fix this particular issue. Let's, let's go back to the code base and just, so this is the vulnerable code. And, and if I go to his index dot HTML and try to add this meta tag and just write that we only want to allow certain script from the self self means the same origin. And if we, if you try to refresh it. Maybe I did not save it. Yeah, so, so if we go to the console and I zoom it, and if you can see that the, it was being refused to execute that in line event. And that's because we have added to, to allow only the scripts from this, from the same origin. That was an example to showcase that. So content security policies is really very defensive in depth measure by browser and can be applied for all the web applications. It can help with cross-scripting attack like what we show, what I showed in the example. And also being helpful with secure form submissions that it should not let your form being submitted to any STTP websites or to prevent MITM attacks or STTP as migrations or mitigating vulnerabilities like click jacking. This is how the CSP added can be added like what I shown. This is a particular code for Node.js helmet framework. And what I shown was the meta tag. So it could be done with any of it. Like another example where content security policy where it can be prevented. If there's example that it only allows resources from example.com and if somebody tried to request malicious.com website to load some JS, it will be prevented. And other security headers which can be helpful and can make a note of its X frame options. It is also helpful to prevent click jacking vulnerabilities. And HSTS is strict transport security for making sure your website is always accessible over STTPS. Then we have access, X protection, X content type options as well. These are generally being part of all the frameworks you might be using the modern browser frameworks are always having secure by default having these headers. But if you're not using any framework and going for a vanilla thing, then make sure to have it them. And now the interesting part like I would like to share some insight what we have been doing at GitLab and how we have been improving our security posture. So first and foremost, it's everything we try to do make as much as public. So that was something we decided a couple of few years back that we should work out upon improving our front end security posture and taken note of multiple steps. We can take it forward and firstly and foremost what we did was we have VSTML very similar to dangerously set in an STML in React. We built a safer version of it, which is V safe STML and V safe STML is basically remove the malicious part from the scripts. And how we are able to do is using an open source known sanitizer called Dom purify. We try to apply that sanitization to all the input we get VSTML and then we say FSTML and we just render it. So and also we have added a ESLint rule to our front end code base that nobody would be able to write VSTML unintentionally so that we don't really forget about it. And so far is doing good. And then we also made an effort to make sure that our link component which we use it GL link renders links safe by default so that we don't have to developers don't have to think about writing secured code. The component does it for them. In the GL link, if a malicious link like JavaScript protocol will be just set us about blank or null. And then it was interesting story about iframe sandboxing. This is a mechanism where you can contain a third party module. We have been using a third party library to render charts and it we like it lab can allow you to add charts in the comments notes description everywhere. And and if there is a vulnerability with with the charting library, it would also impact it lab. So what we did was to contain the entire execution of the charting library inside an iframe with this with sandbox and do not allow any cross origin request. And we implemented this and all the issues with the third party module were just fixed with the one simple fix. And applying all these defenses actually helped us reduce front end security policy. This is one of the instance where we were getting a lot of reports on our hacker one program and applying these defenses has helped us to make the security issues come down to. Manageable numbers and they like you can see in this example in the graph that the issue got increased around some of the time in November 21 and then we applied these defenses and the issue dropped very significantly. And I'm sure what we have been doing nowadays also it's keeping the numbers low and we try to keep it to us. We are trying hard to keep it more lower than this. So to get an idea like how how do you improve your security posture is like just try to shift left integrate your. Such thing into HDLC your software development lifecycle adopt secure by design principles so that developers don't have to think about it and just use the make sure they come component you're building or framework you're using has though all those secure by a default design. And you security middle middlewares don't read when the wheel and build your own framework or things like that. So helmet jay's is popular one for node jay's and use standard sanitizers again like Dom purifier is doing pretty good job. And then if you want to audit your vulnerable packages sync is a nice tool and NPM also has a audit functions which can use. And there's some learning resources I found it really helpful which is Stanford CS 253 scores and my favorite one is over at developer guidelines where it keeps updated with the latest attacks and the vulnerability information and Hector one Hector one oh one is also a nice way to learn about security. And anybody plays CTF or nice yeah I love CTF so it's the way to learn about security and and we have also have secure coding guidelines which can be followed to learn about these stuff. Thank you so much for listening me out. If you have any questions I'll be around or if you have time. We have time for questions. Yeah so it's not always. Yeah so I'll just try to repeat some of the. So what are the options when we are not possible to set X frame options as a deny. Sorry. Yeah the question is all about if if your application is not able to use X frame option as a deny. Some of this applications might share that I frame being used into the website. So I think one of the option is making sure them to use content security policy because it can actually allow the vendors who are using the the I frames on their website would only allow that. Content security policy is one way otherwise. I think that's that's way otherwise if you want to. Build a more advanced mechanism there would be like I frame post message communications which can allow you to you know only do secure communication with the I frame. Yeah I would love to know what you think about that. I would love to know more about this so that I can give more better answer to it. Any other questions. I will. Can you. Can you. Ah this question is about have you looked at new firewall mitigation to fix these issues. To be honest no I have not known a good firewall which because that's something. Needs to be trustworthy and stuff like that and all only worked upon like operating environments where they have installed a firewall for all the systems. But this this was more about like how general developer guidelines. Anyone else. I have a question. Big I love you guys. Thank you.
Codebase Conquest: How Nx Turbocharged Our React Workflow
Thank you all for being here and for waiting, sorry for that. So our next speaker is Nicolas, who is a staff engineer with a lot of experience and he is here to talk about Enix and an actual use case that he incurred during his time in Hazura. Thank you Nicolas, for your applause. So, does your build time keep getting longer? Well maybe we can extract some packages into overrack packages. But then the packages are extracted, started to explore the dev time to work and integrate in your app. And then it's hard to keep up with two versions? Yeah, at Hazura it was the same. The build time was like 15 minutes for the frontend. The dev reload time was like 5 minutes, so you make a change, you wait 5 minutes and then it's actually done. And tooling wasn't proper everywhere. So we had to make a change. And this is the story of this change. First who am I? I'm Nicolas, I'm a staff engineer at Pethitch. You can find my Twitter and my blog. This is also in the right hand form in my blog if you want to dig further. So let's get back to the topic. So what was the setup? We had two code bases, the open source one and the enterprise version. What we did was we extracted some of the code from the open source code base into a bundle through extra layers of webpack. And then we installed this into the enterprise application. It seems pretty standard, right? But then tooling wasn't the same everywhere. In one place we had touch scripts, yes, test, storybook, chromatic, Cypress, so very good dev experience, dev installation and everything. In the other side, which let's remember, enterprise clients pay for the other side, we had JavaScript, no touch scripts, yes, test, and that's it. No storybook, no entry and test, nothing else. Because it was so complex to work in this second part of the application, this was the end result setup. But that's not it. Get worse. We had one K-line of custom webpack config just to bundle part of the application into the other one. Log files management was hell when you change one thing in one place. You had to make sure the log file, not the package version, the log file was the same in the other place. Otherwise, things will crash in production and without end to end test, you only know when you're in production or when you test your dev environment. CI was very slow because of this whole system. So we wanted a Mono-repo tool. Let's have everything inside of a single Mono-repo, having them work better in union instead of isolation. We made a wish list for what we wanted to do. What we wanted in the Mono-repo tool was task orchestration, saying build this app before this one. We wanted to have dependency-graph-visualization because right now we have two packages, but in the future we'll have more. We want to see what the hell is going on without having to guess and looking at packages and digging through code. We wanted to have consistent tooling. Let's say we have just and the same config of just and the same version of just everywhere. Because yes, it wasn't the same version of just before. Fun to make with stuff. And we wanted to have contact constraints. And for example, the open source edition couldn't import the pro edition, because you don't want to give away things for free. Like companies get paid for. We wanted to have distributed task execution so that we could scale the CI by adding more runners and to say run those jobs in parallel and deal with how you want to do. And the bonus point is we wanted code generation so that scaffolding was baked into the tool so that in the end everything was done for us. So after this open X we went into the ecosystem, look at every tool that existed. And we checked every one of them. First a small disclaimer. This work happened about a year ago. New tools exist since now. Like Moon repo didn't exist back then. So if you want you can also look into Moon repo. And I also want to shout out every engineers working on those moon repo tools. They are amazing. If you have anything they are always willing to help. So kudos to them. So what did we look into? First one, Bazel. Bazel is made by Google to handle Google monopos. It's huge, complex, you can do a lot of things. But it's also very complex to use. We looked at Gretel because yes, Gretel can do other things than just Java. You can do whatever you want in Gretel. It's tailored to Java but you can do JavaScript, you can do Go, you can do whatever you want in it. We looked at Lerner which is the historical and classical tool to manage a moon repo in the old days of JavaScript. We looked at NX because I've used this in the past in the Angular days when NX was only an Angular plugin. And yes, this is a real monopo tool. We looked at Pence which is mainly used by IBM but also in other places. It turns out it's pretty good if you want to experiment and give it a try. We looked at Java repo because all the hype and trouble was solved and everything. So it was in the list. And so that was like the tool that we looked into. So let's see. We wanted Tasker-based acquisition. Well they could all do it so that's good. We wanted dependency graph visualization and Pence didn't support it. So those two are out. Then we wanted ecosystem tooling. Troubles didn't support it. Lerner neither. So we end up with either Bazel or NX. Project constraints, they both support it. Amazing. We wanted this task execution, they both support it. Cool. And congeneration, well Bazel didn't support it. While we could have added Bazel congeneration utilities and extra code, it was also simpler to set up than NX. Complex to set up than NX was way simpler to do. So Indian NX was the tool that met his needs that we had at Hazara. If you want to learn more about those tools, this is a great resource. It's open source and contributed by many of the maintainers of such monorapos where you have a graph of all the main things that make the monorapo features and each project is listed in here with what it can or cannot do. So we had with the tool NX. But turns out there is two flavors of NX. Integrated or package based. First let's go into package based. Package based is behave like a PNPM, such as NPM workspace. You have many packages, they all link together. It works pretty well. But it doesn't have consistent tooling. You can do whatever you want in your projects. The migration path is way here because you just slap an extra JSON at the root and it's done basically. But there is still a bit of step between the leaves. Let's remember why we are doing that because we want to make sure the build between libraries is way faster so that we don't reinvent the wheel every time. So then what is integrated? Integrated means that every tool in the workspace is unified and considered a monorapo as one unit. Every tool is consistent because every tool has the same version and the same configuration everywhere. You can train it in a specific project but the base is the same. But the migration is more thoughtful because you need to decide how you want to migrate. Do you want to align with NX context or do you want to bend NX to your wheel because you can do both. But thanks to this, we have optional build steps between libraries, which means we could solve all speed issues. But there is one more thing, plug-ins. But what the hell is a plug-in? A plug-in can do three things. It can generate generators that allows you to scaffold the bases. NX new library, done. NX new application, done. NX new storybook, done. It can execute it, which is wrapping the tool to make it simpler for you to consume. And the best part is automatic migration. For example, a new version of desk came up and you need to update your test to have a new configuration for the timer. NX will migrate your code for you automatically and it works 95% of the time. You won't have to do anything. This was really helpful for us because the code base was huge, like a million of code on those lines and it was hard to maintain. So that's all good and all, but we engineers, right? Tread-offs, not everything is green. Yeah, there is two big ones. First one is single version policy. We state that there may only be one version of a dependency and package inside of the monorepo. While it adds extra constraints, it's also recommended within any monorepo. Because if you have a library that is built using React 16 and another one with React 18, you cannot import the 16 into the 18 one. And the way I see single version policy for me is a bit like buy versus loan with interest. When you want to migrate React, if you buy, you just bite the bullet. You spend maybe a bit more time, but you do it everything at once and everything is a daily. Versus if you loan the migration, meaning you have to spend many times doing many packages one by one over time, every time you have to regain context, how do I migrate this again? How do I send this again? And every single time you want to migrate a new system, it takes way longer in the end. But it's a bigger investment up first. You pick. Buy enough tools is another constraint. You have to wait for the tools, meaning that, for example, like this version came up, you have to wait for NX to update in their setup so that it will automatically migrate the tools. In enterprise software, waiting for a day or week is not that big of a deal for a new test version, to be honest. And it's way better now because they work hand in hand with actual engineers working on those tools. And some of them actually work at NX now, so that helps a lot. And if you need it, there is plenty of escape hatches, so you can just do whatever you want in the case you may need. So we know what we want. We want to manoeuble. We want NX. We want integrated. How do we proceed? Because we're not going to say, we're going to freeze production for six months until we might get everything. That's never going to work. So the goal is to migrate incrementally without stopping the digital data work. And we add some requirements for this migration. First of all, we wanted to have no cost freeze during this migration. We had many engineers working on the code base, and we never wanted to say, stop working for half a day every week so that we can migrate stuff. That's not feasible. We wanted to have as little regression as possible. Nobody likes bugs. And neither of those customers. We wanted to adhere to NX. So that automatically migration what was as easy as possible. And which meant less maintenance in the end. And furthermore, if we have standard tools, then reusable skills. You can switch teams and everything is the same. So that's nice. Like companies that do loads of re-ogs, that's a big seller. And nice to keep. We had seven years of Githy story. Githy story is sometimes the only reason sometimes we can debug something because of the JavaScript and such. So we wanted to keep it. So here was the situation. We had our current code base. We then created a new NX workspace, like just create a new workspace. We import the code into the workspace. We build it. Is it working? Yeah, everything is done. Except not. Things broke, obviously, because our code had many issues. And so the next step is to identify a whiteboard and then break the current build. This way we can fix it in the current application. And then we can start over again. The good thing about this migration path is that at every step of the way we provided value to the actual developer working on the old system while preparing the new system. And at one point we identify some of migration we needed to make to NX. So every time we create a new workspace, we added a non-migration beforehand. And we did this cycle many times to make sure every step of the way it worked, we even had a crown to do on a weekly basis to make sure everything was good. And I mentioned we had to make tweak to NX. One thing we had to tweak was the JavaScript path because we had add slash. And in the monomaple, add slash means nothing because there is no root. There is only packages. But we tweaked it so we can make sure the migration was not blocking and require a lot of work on the previous code base. We had to include Node.js fanbots because even though no Node.js code should end up in the browser, we all have Node.js code in the browser, like HTTP and such. We had to make some specific changes to the web-config, like SVG and such. And we had to disable some ASN tools because, well, our code wasn't up to standard, obviously. So that's what we needed to do. What about our code, right? So first of all, we had CSS module without the .manual.tss extension. So there would be a VIN like CSS modules, but we didn't have the extensions. We had to fix it. We used an ability to pass in CSS in tabscript. And it shouldn't have worked, but somehow it did. So thanks, Webpack 3, I guess. But we had to change this so that it worked with Webpack 5. Path imports relied heavily on Webpack config, so we had to change that also. We had to update a test in tabscript to a version that is compliant with NX. We had to update the entry points so that they only export a component and not mount the application. And this was the kicker. Turns out, somehow, the build compiled with a lot of second-dependencies. Like a lot. Like 150 loops of second-dependencies within the codebase. And this was like one of the libraries, not just the bootstrap of it. So we had to dig through and fix our code, basically. And we went down through 95, and now Webpack was able to compile the application, and the browser was able to load it. So that was good. What it looked like in the end. We had our pro application that loads the pro library that imports the OSS library. And the OSS application that loads the OSS library and the end-to-end test that both imports the library and the application. Thanks to this, this was, by the way, generated by the NX graph of the workspace. We don't have to do anything. So all good, right? Everything is nearly ready. We just need to switch. And switching means keeping the Git history. So to keep it, we first made a commit to clean up the old workspace. Then we made a second commit to Git MV to the over place. Then we made an archive for OSS because, given we are open source product, we wanted to make sure a contribution went up broken because of this. Both commits, we applied the known tricks, and then we were in NX land. Thanks to this way, the second commit was able to identify into Git blame to make sure Git blame doesn't pick up this commit. So we still kept our Git history for whatever we wanted. In the end, the total freestime for this migration was three hours. From the beginning to the actual end of the migration, three hours total. It wasn't a fault lasting a few months. And the three hours is because of CI was slow to run on the four commits that I mentioned before. So all good and all right. What about the results? We want numbers for all users and all developers. First all users, zero bargain pollution. That was great. Because of this incremental approach that we took, we were able to see that every step of the way we didn't break something because otherwise we would have identified it in the app. The over surprise was that because everything is unified, the bonus rate decreased quite a lot from 43 megs to 13 megs. And funny thing is when you get a call from a service representative, thank you, Niko. I can finally use the app locally without being too slow to load. Thanks, I guess. It's a bit weird. You wouldn't before, but still. So this helped us at the low time. We have the application loading like five seconds faster thanks to this. Okay, that's good for devs and everything for users. What about devs? Well, 30x faster local devs. Because we didn't have to have build step every step of the way, we went from five minutes to ten seconds. This was life changing. Try to imagine when you debug something, you make a change where five minutes to see that the console you added show ups. Now it's like ten seconds in an instant for what we used to. And the CI was about 60% faster in the worst scenario. In the best case scenario, it's about 80% faster thanks to caching and things like that. All right, good. Is it the end? Are we done? We are now in Enixland. We have the packages. Are we good? It could be. It could be a step that we, you say is good enough. We don't want to go further. But you could. One of this area is architect of the coupling where you say I want to make sure that my open source doesn't import my enterprise code. And you can info that thanks to Linchwool in Enix. You have a Linchwool that's better than Debreche, but it basically says that a pro code can import shared and OSS and pro and that's about it. A shared can import shared. In a visual way, this looks like this. Where you ensure that libraries in the scope can only import within the scope or the scope they allow to go to. This helped us heavily to ensure that open source code stayed open source and the enterprise code stayed enterprise and open source couldn't import through the tooling production like a cloud enterprise code. Then the other thing we went further is to unify our tooling. While in this migration, we just add Enix, generate new entry and test. We add the new entry and test for our provision. And this costs us like 20 minutes to do. We now have a V-test in some of the new projects. And we also made our custom plugin because you could make your own plugin. It's relatively easy. And thanks to the plugin, we can create a new library. I want a library with this scope and this type and put it in the right folder for me. I don't care. Do it for me. And the naming would be automatic. Everything would be automatic. In those cases, you can say generate automatically like the code owner, update the CI if needs so and that. Because in the end, thanks to the plugin, you get the specificity of your tooling, all of the developer and engineers mind and into automation. Because we all know this documentation that is never updated. And a tooling is always updated because we use it regularly. So if we know it's all of it, we can look into it. So in the end, what I wanted to say is coding on a last code base shouldn't feel like this. You are not sure you're going to break something. You are not sure what you change with a fact. You have no idea what is going on. Instead it should feel like this. A happy dance. We just pass the ball around and have things moving in the right direction. Thank you for your attention. Are there any questions? So in this case, we didn't use NPM to share on the outside. However it's supported in NX to be able to release applications. And thanks to the NX plugin, it can understand your workspace and create a package for your library to be exported publicly on NPM. Next week there is a new launch event for NX and they are going to announce something that may be related to your question. Are there any other questions? Yes. Can you hear me well? Yes. My question is what was the main reason for such decrease of the bundle size? Is it because you are using all of these cycles in the code? One of the questions was why we end up with such a lower reduction in the bundle size, because what happened in the beginning of the talk, what happened before is that we had one package that we had bundled into a package. Sorry, there are a lot of slides. Anyhow, I think you remember close enough. So what we did before is we exported a large part of the application into a package and then we imported this package into the proper base. First change now is that Webpack now has a unified view of the whole system and has a way better tree taking. Because in this middle package right here, Webpack didn't understand what was actually imported into the end application and wasn't able to do as powerful tree taking as before. So that was one huge step that helped us on this. The second step was having updated Webpack configuration and tooling, which makes sure that we didn't need to target IE anymore. So that reduced like 5 megabits from the bundle. And so both things combined plus better CSS processing with like a unified view again of the whole system made that we had this decrease in bundle size. Yeah. So today I don't pay for it and I'm doing a similar migration using an X2. There is a new tool that I would investigate, which is called MoonRepo, which is similar in some cases to an X. However, through this day for an enterprise ready product, I will still use an X. Because the one thing they are moving towards to is to also have a way smarter CI. Because if your CI can understand your workspace, it can also understand better what to do and what not to do. And so for this day, an X would be still my choice. In the future, I will still investigate MoonRepo to see if it could make sense. But unless you have a huge scale like 10,000 engineers, Bazel would make sense. Because you could have a team of like 20 engineers working on Bazel. So yeah, that's my answer. Yeah. So just to make sure when you started with an X, you imported package by package. But you threw away the results in the ads. Yeah. And you redid it in two hours. Yeah. So this way, we made sure the old system was being updated to the change we needed to do. So this way, if for whatever reason we had to stop, we still provided value to the existing base. So on the question before, what do you think of TurboRepo? Yeah. So TurboRepo has some features that are integrated into an X in terms of a feature of parity. However, it lacks some of the larger system that is required for an enterprise project. You don't have distributed task execution, for example. You don't have unified tooling. You don't have generators. And this makes that, for me, TurboPo is a middle between learner and an X. It's like a middle ground where you have a bit better because you could have tasks like caching on the cloud, thanks to like Verso. But you don't have the full power of something like an X. So yeah. Yeah. If you compare TurboRepo with the other way of choosing the index, the first one, how would you compare it? So I'm going to have two answers for that. One which is related to next week announcement and one for today. For today, an X requires a bit more conscience and tooling when you set it up. But stay tuned because it will be even easier to adopt an X to an existing workspace because they are trying to change the fact that an X is smart and trying to understand what is your project. And you have less friction to adopt an X. Yeah. Did you have any non-Node.js applications or services that you needed to integrate in this migration or an X is only for Node.js related to nodes? Great question. So by default, an X is agnostic. There is an ecosystem of plugins that exists supported officially by an X that is very fund-electrified and circulated. However you could do whatever you want. There is community plugins for go, for .NET, for Java, inside of an X where for example for the Java project it will understand the POM.xml and try to understand whatever it can automatically. And one great thing about Polyglot repo like this is you can say when your backend change, we render end-to-end tests for the frontend because they are related. Because you can say your frontend like your SDK impulse is related to the backend because it is linked to the Open API spec. Then this, we trigger everything on the frontend. And this is where an X or a manual report shines is that it's one context even if it's for Polyglot. Unfortunately we don't have more time for questions so we'll begin with a close for you guys.
Web Performance: Leveraging Qwik to Meet Google's Core Web Vitals
Okay, so our next speaker is Ayub who is tech lead at Serview, company based in Paris, I think, right? Yes. And he's going to talk about how to leverage quick to improve performance in the web. So big round of applause for Ayub. Hello, so like you see here, my name is Ayub Elhuan. I'm a software engineer working, so I'm a rocking engineer, but I'm working in Paris for two years now. And today it's my first time giving a talk here in Brussels, and I was here for two times to attend to a BJS conf in the react Brussels, but this is my first time giving a talk here. So I'm so happy about that, and I'm so happy that there is a packed room here, and I'm so stressed with your questions. By the way, so let's start. Before starting talking about quick, let's make some context here. So imagine you're in a restaurant waiting for a dish, for example, for Moroccan couscous. Do you know couscous or not? Okay. Good. So you are in a restaurant and you are waiting for your Moroccan couscous, but the waiter said that you should wait 45 minutes. That's a lot, but for Moroccan one, no, you should wait. Okay. But it's a lot of time, so you can't wait for it, and what you will do, you will maybe go for another restaurant, you will write bad reviews, you will talk to your friends and say, maybe they have a good Moroccan couscous, but you should wait 45 minutes. That's a lot of time. I can't do that. Okay. So imagine that you're working on websites and you're having the same issue. You should wait for it to download all these JavaScript and show it to you. So before talking about JavaScript, let's talk about the core web vitals, who are the metrics that Google use to try to measure the performance of your websites. So we'll start, for example, with LCP, largest contentful pens. What is largest contentful pens? It's the time that your website will take to show the most large content to the user. Okay. So if it takes more than 2.5 seconds, you need to improve it. If it's more than four seconds, that's a lot. Okay. We have also the first input delay. It's the time that take, for example. For example, you have a button. Okay. But this button is not working and it's waiting for its JavaScript to work. Okay. The time that will take between showing the button and the button will work. It's the first input delay. If, for example, I show you a button and you should wait more than 300 milliseconds to work with it, that's not a good thing to do in your website. Okay. And Google are calculating that. We have also cumulative layout shift. So what is cumulative layout shift? Imagine, for example, you have your website and you want to click on login. But in this time you click on login, you have an ad bar that's shown in the website and you click on the ad bar. That's a problem. And Google is measuring that for our websites. If you do that, they will lower your ranking in Google. Okay. In Google search. I'm talking about Google search. Now let's talk about performance. Why your website should be performance. Here Google's done some study cases about some websites that try to improve the performance of their websites. For example, we have here MOBIFI. They improve their websites only, only with 100 milliseconds. And they had 1% more conversion. That's a lot. Okay. We have Cook, for example, with improving only 850 milliseconds. It's not a second. It's only 850 milliseconds. They had 7% more conversion. That's why performance matters. Okay. So by the question here that we have, why we have these problems with performance in our websites? Imagine, for example, you want more interactivity in your website. What you should add. You will add more JavaScript. Okay. So that your website will be more interactive. And when you will add more JavaScript, you should download more JavaScript and that will impact your performance. Okay. So here I will try to show a short about how SPA, single page application, works and how we tried in these last years to improve it. So when you have a website that you did with the SPA framework, you will send your HTML to the browser and you will show a blank page. And maybe because I'm experimenting that a lot with the websites to reserve my flight, to register my flight. And I had to, I will not say the name of it. And I should wait for, I think, 3, 4 seconds. And after that, in this 3, 4 seconds, I will get a blank page. And after my 4 or 5 seconds, I will get my page. Why? Because the browser is trying to download the JavaScript, trying to execute this JavaScript. And after that, it will render a page that is working. Okay. That you can work with. So to improve this process, what we tried to do? We tried to have something that's called hydration. What is hydration? We will send more HTML because here we didn't send a lot of HTML. It's only an index.html with nothing. This is why we show blank page. Okay. So now we will try to send a lot of HTML to show something to our user. Okay. But this thing that we will show, it's not working because there is no JavaScript. You should also wait and download the JavaScript, try to execute it, and now, doing the reconciliation process to have a working page. So maybe we will wait more time than before, but we are trying here to trick our user. The website will not work, but we'll show it to him. Okay. That's the thing. And we should wait. He can do nothing. And if you're now on Twitter or on YouTube, you will find a lot of people who are talking about these React server components, partial hydration, partial pre-rendering, streaming HTML, streaming that. Why they are trying to do that? They're trying to do that to improve this process and to try to show to the user something that works. Like for example, streaming SSR, they will try to, this will work and you should wait and we will download the JavaScript, execute this part. They are trying to improve this process. Here what I will talk about today is about Qwik. So Qwik has another approach to improve the problem of hydration. What Qwik will do will not have hydration, but it will have another concept that's called resumability. What is resumability? Here we will send the same HTML that have a lot of HTML. We can show a page and boom, the page is working. It's interactive. The problem that you can say now, does your page will be interactive without JavaScript? That's not the case. But there is a trick. The page is working, but because we sent a little bit of an event listener here, a global event listener who will listen to the interaction of the user and he will go and download the necessary JavaScript that your website needs for this part. Imagine you have a chart, a lot of data tables, a footer. If you are using another framework, you will download all the JavaScript for this page and maybe you will never use these JavaScripts. But you will download it. Here with resumability, you will not do that. You will download on demand. So the question that you can ask here is every time the user will click on a button, we should send a request to the server and we should wait for this request, execute the little JavaScript and it will work. So because Quick has a mechanism, what is this mechanism? Quick is using a service worker to download these chunks of JavaScript. That's the thing here. When you are using SBA, you have one single file of JavaScript that have all the JavaScript needed for our page. It's one thing. If you are doing, for example, lazy loading, it will have more and more chunks, but for other pages that are not shown for the first time. You can't do this for pages that are shown in the first time. Here Quick is using a Quick optimizer who tried to chunk all your websites. All these pieces of interactivity here, you will have for each piece of interactivity GS file. And the framework, what you will do? It will start to download in the background these files and store them in the cache, but it will execute nothing. So the DOM here, the DOM will be free. Will not have a lot of JavaScript. I will show you that in the demo code. So that is the resumability. You will have your page faster and also interactive. And with using this concept of resumability, you will need maybe for your website more JavaScript because you need more interactivity, but you will keep the same performance. Why? Because you will not have more files to download. You can scale as you want and you will download on demand. That's the thing. Okay? So what I will show you here, I will do the demo and I hope that it will work because always live coding, you're in front of people, problems come. Okay, I will try. I'm stressed for that, but maybe it will work. Okay? So you can see my VS Code. I'm a VS Code guy. I don't use Vim. It's too complicated for me. Okay? So here I created a file in our roots for our new page. Okay? I called it FOSDEM and here I will create a quick component. Okay? What you will see here is that the quick components, there is a resumblance with the react. Okay? Why? Because quick is using also GSX. Here what I will do, I will create a simple button. I will name it console and here I will go click and console, log, test login. Okay? So here I have my page. I have my button and if you see in the network, I don't have the VT files because it will not have them in production and I have the GS files and if I click on refresh, I will have nothing. I will not have my GS files that I should download with SPI frameworks. Okay? Not only SPI if you use hydration, also you will need to download the GS files needed for your first page. Okay? But here you will not have GS files. When I will click on console, I will download the GS file needed only for that. Okay? So I will do another thing. I will create second button. Okay? I do the same thing with login 2. Okay? So I will click here. I will have this file for only this interaction. I will click on console, log 2. I will download the second file for this interaction. Okay? That's the thing about Qwikis. They have chance for all your interactions in your website. Okay? So here I will create a account to have some simple example. Keep it simple because I have 20 minutes. And here I will create that. And here I will have a count button. And I will name it count. And here I will display this count. Okay? So here I will click on the count. I will have this JavaScript needed for this count. Okay? It's working. That's good. And here it's only the file for the framework, for Qwikis. What I will do here? I will go and create another button. Okay? We do simple console log. And I will name it here. I will do condition verified. Why? In this button, I will have a condition. I will not show this console log only if the count is more than 14. Okay? Okay? That's good. So it's not count. It's count value. That's great. Okay? So I will click on, I will name it console. Console. I will click on console and I will get the condition verified. Okay? But I will show nothing because I should wait for it to become 15 and now it will work. But the thing here is that we didn't use this condition. First, when we had the count, when we had our count 12 and we clicked on console, we didn't use it but we downloaded this JavaScript. That's a problem. Here with Qwik, we can do lazy execution. So what I will try to do here? I will try to lazy load this line because I don't need it. I need it only if the condition is verified. Okay? So what I will do? I will create a function here and I will wrap it with a dollar sign. Okay? And I will call this function. So I used only a dollar sign to do what? When I will click on console, I will get my file but without the console log and I will get this only with a link to this console log because I don't need it. So what I did here with this dollar sign is lazy executes a line because I need this line only if the condition is verified. Even when I will have the condition verified, I will click on console and I will get this file with this condition verified and Qwik will take this file and it will use it to the console log because now it will use it. Okay? I have really a loud voice. I was working without you. I have really a loud voice. That's too much. Maybe in the recorder recording, we will not get my voice but there is no problem. They can hear me. That's good. That's good. So, yes. Thank you. So here, like you see with me, what I did is only lazy load this line of code. Okay? So imagine this line of code is a lot of things to do and you can't do it in your front end. You can't do it on the client side. Okay? You need to do it in this server. What I didn't tell you is that Qwik has something called Qwik City. It's metaframework. Like for example, you have React. You have with React, Next.js, remix, Gatsby, other things. I don't know. And with Qwik, you have Qwik City. And with using Qwik City, we can do the back end for Qwik. Okay? Here, what I will try to do is this condition verified. There's a lot of things to do. So maybe I should do that in the server side. Okay? I will come here. I will call server function and now I will go to my client side. I will click on console. Notifies it's Qwik City because I'm using the server. I will get nothing in my console log. Okay? It will count more. Now I have 16. So the console log should work. I will click here. I will have nothing. It's only some things here. And I have nothing in the console log. Why? Because the console log will be in the server side. Okay? With only adding server to your line, you will execute all your code in the server side. Imagine you're using Java, for example, for the back end. You have on-giraffe for the front end. You should refactor your verifications of a form and you want to do that in the back end so you should do all of it, refactor all of the code, write it another time in Java and all that. Here with using Qwik, you can, you will also add a server function and it will work. Here for example, I will say you should return me a result. Okay? This console log is good. But after doing your work in the server, you should give me a return. Okay? So here I will give him a return from the server. You will come here, have a const result. Wait for it. We'll have a thank in here. You will cancel log your result. Okay? So like you see with me here, what I have? I have this cancel log in the back end and I should return this thing to the client side. Okay? So here I should have the, I should have the condition verified. Okay? Its count is 16. When I will click on cancel log, I will get my condition verified here and on my console log, I will get my return from the server. And here I'm trying to work with the server, the client in the same file, in the same file, like I said. And that's it. This was my demo. I think I didn't do more than 20 minutes. So happy to be here and thank you so much. I will keep the couscous image. Speaking of couscous, if you want to avoid the 45 minutes, you can try Tunisian couscous. It's way better. He's saying that because he's Tunisian. No, it's the Moroccan one. So questions? Yes. Your question is that you're worried about the size of the framework here. So for example here, I will use only the count. You will get the framework. So I think it's 15 or something like that of kilobyte. I can't see it. I'm stressed. In the button? Yes, 56. That's it. And if you will not need quick CT, it will not download it. It download it only on demand. You can deploy it simply with Versa, for example, or with the, it's like, it's a Node.js version that is included in the quick framework. You will not have the work of going searching for server trying to, no, you take the framework, you add the only simple adapter here. I don't have a good connection. Here in the docs, you will get your references, deployments, and there is, in the documentation, you have files, adapter files for each environment. You can only use it, and it's really, really simple. If you use, for example, Versa, you add only some configs in your files and it works. Okay. Yeah. Since you allow for the code from the front end to the back end, how do you deal with code injection? How do you deal with? Code injection. Code injection. From the front end to the back end, for example, in process of testing. It's like I showed in the demo. It's only, it's only with playing with, with your hooks that you have. There is, if I had time for that, you have, for example, use resource, you have wrote loader, you have wrote action. It's only hooks that you deal with. You take this code that is in the client side, you do it in this hooks or with using server function, and it works. In the same file, that's the thing. You will not need to add another file for the back end, file for the client side. No, you should only use the right hooks that you want, and it works. Is it also possible to unit test your components? I, I, I didn't, I didn't try, I didn't try to unit testing yet because I'm working on, on a project now with, with Quick. I didn't do that yet, but you can use, use, I think, it's the C-Press or the tools that we are using for framework, for front end framework. Yes. Yes. I, I didn't, I didn't do that yet. I didn't do that yet. There are other questions. We still have time for one question. Oh yeah. Is there a way to pre-fetch the JavaScript files? For example, if you have like the button click, and there would be more code to that, and there's the console lock, and you might have a bad network connection if you do that in the background, while the user's not clicking the button, is there a way to instruct the user to download as long as your runtime is here? Yes. That's what it's doing with the, with the, with the service worker. It's trying to download in the background the files. Yes. With Quick, you will not need to, to, to send another request to the server. And the good thing about is that Quick has, has, has now, has now a new tool that works with EI that you, you can use in your production, for example. If you, you deploy your website after a month, you will get some data of, of how it works and how users click on the buttons and which buttons they click first. For example, if you're using e-commerce website, the first button is add, add the article, and the second button is show the, the category or something like that. So Quick after using it, after, after being in the production for, for example, one month, they collect data and, and the second deployment, it will take the, this data and the, the, the download in the background with the files that maybe the user will need. Okay. They will change the way how it download JavaScript and they will start with the files that maybe you will need on, on your, on your website. Thank you. Yeah. We don't have time for more questions, but you can find a, you will be around. So big round of applause for you. Thank you.
Can we simplify charting libraries?
Alexander has been a React developer since 2018 and he likes creating UI that is nice. And he's going to talk about how we can simplify charting libraries. So big round of applause for Alexander. Okay, so thank you very much everyone for joining. To give you a bit of context about what we will talk today, I'm currently working at the MUI, which if you don't know is providing a user interface component. You might know us because of this library. And a very kind of a tradition each year you ask a user what can we do for you, what can we improve. And the community is quite creative, which led to other libraries, for example, a base, which is a headless library. But they are very creative. For example, Toolpad is a no-code application we are trying to build. And then there is the team I'm working in, which is MUIX. And we create the most complex components, for example, a data grid, a date-time picker, which are a bit more complex than a button and select. And a year ago, we decided to start the chart sephor. And this talk is about how we proceed, what we found, explored, and our current conclusion. So from the question we asked to user, what they wanted is a nice documentation. That's the main stuff they complain about a chart library. And having a developer experience that match what we do usually, for example, for the data grid. So we'll see together if this is possible. Okay, so I started with just thinking, having a dream, what would be the perfect user developer experience I would want. So for me, the best one is you have a wrapper. You provide him information about what he needs to know, what is my size. And each time you want to add an element, just add a React element in it. It seems pretty basic. It should be okay. Up to the time you add more data. When you add more data, it overflows and it totally makes sense just because the x-axis will need to communicate with the plotting to say, hey, stop after 10. But if you put larger data, you have another overflow issue. And just because your line plot needs to communicate with a y-axis. So I started my journey with a dream and I ended up with an issue because components need to communicate in all of our direction. And that is just one example, but it's the main issue of charts. Data management is a pain. There is a second one, which is customization. Here you can see for a button, we kind of all agree about what it can be. You can customize a bit the color, does a background have a color or not. The most complex stuff you can do is adding icons. Most of the time it's at the beginning or at the end. But for charts, you have much more elements. And the creativity of designer and mathematician is endless about how you can add annotation to it. So we need much more flexibility. And currently, all our developer experience strategies does not allow to do that. So we have two main issues. It's time to have a look in the past. This library exists since more than 10 years, so they have a lot of experience to share with us. And it's a pleasure to work in open source, so that you can have a look at why they made a decision and how the code is working. So let's start with rich arts. As you can see, it's a composition. We'll just say at the beginning that composition is a pain. So how did they solve this data management issue? Basically, you have a wrapper, so the line chart. And he says he looks at his children. So children is just an array of components. And he says, OK, which one is an axis? And I will extract all the data from its props to know from which point to which one I can display stuff. Does the same with all the elements that are plotting data. So here, line, mark, areas, and stuff like that. And then you do a kind of an aggregation to render the components with the correct properties. The file that makes that is 1000 lines. It's very hard to read. I assume it might be hard to maintain too. And when you want to add your custom components, you don't really know where the information comes from, because there is this black magic aggregation that will provide you some data. And to the bug, it's a bit of a mess. But it allows a lot of flexibility. On the other side, you have a much simpler approach. It's a single component. So for example, you want to line, you do responsive line. And you provide data. You can configure all the axes look like and configure the tooltip, et cetera. Each element has its props and a lot of options. So as I said, it's very straightforward. So one chart is equal to one data. So that will change according to your user, plus a set of options. But you get two main issues. For example, mixing charts does not really make sense, because you have two single components. You cannot overlap them in an easy way. And you cannot modify the features, because it's a single component. You have a finite set of options. And the option is not available. You can go inside the source code to update it. For example, supporting different axes for the left and the right. So having multiple axes for line charts is not supported. And except modifying the source code, you cannot do that in Nivo. So very nice if you want a simple chart. But if you, once you go into a wall, there is no option. So for the charts, it's a pure Ligava script. So as you can see, you select an HTML element, for example, main, and you run the code. Of course, all the complexity is hidden here. And to give you a bit of flavor, they kind of fixed the issue we've seen just before. The series can be multiple types. So you can mix a line chart, a bar chart. You can even put a pie chart in the middle of a line chart. It does not make sense. But for the software, it's okay. And it's an old software, so there is a lot of options. So you can do most of the customization you want. Due to time, I will just skip this. So basically, this is all the pipeline for rendering a chart. And the main issue I see with each chart is this one. The only stuff you have access is still the option object. So basically, you can provide the data. You can customize the option. But as soon as you want to render a custom element, you know if you've tried to render SVG just using strings, it does not make a lot of sense, or you need to have the components. So, now, save time. Nice. Just to resume, so we have these two solutions, basically, single components or composition. And as we've seen, data sharing with composition is a nightmare. And you can work around, but you get into the black magic stuff. And for the developer experience, it's not good. And for adding elements, you need composition, because as soon as you get to view these options, you don't know how to insert something, for example, in a Nivo as an array that allows you to reorder the grid, the axis, the plotting. But you know that when you reach the state, when you need to pass an array to order your elements, you will be quickly limited. So, it's time to go to the proposal. So, basically, we started with a single component. So, it looks a bit like a Nivo. You want a line chart. You say line chart, and you provide data and options. But behind the hood, it's composition. So, you have a like for a rich art. You have a wrapper and all the rendering components. If you look closely, you might see that the way props are passed are not exactly the same as for rich arts, and there is a reason. Basically, all the data that need to be shared and aggregated, so the axis, the series, and so on, are passed to the container. The reason is basically that we want to do this aggregation stuff in a need and way to say, okay, you're using our components, trust us about all the axis and the series need to interact. You don't need to take care about that stuff. We'll do it for you. And then it's passed to providers, so a series provider. But take care about knowing what is a bar, a bar series, what is a line series, what is a pie series. Same for the axis and interaction provider. For example, we'll say to you, the series with this idea is currently highlighted by the most. So, displaying accordingly. So now we can create the rendering part. So, for example, the bar plot, we'll call the series provider and say, okay, give me the data about the series. If there is none in Render.0, if there is some, he asks to the axis provider, okay, I have this bar with a value of 24. Can you say me which have a coordinate I should associate to this value? So, he renders the rectangle, and he will communicate with the interaction provider to know if the bar needs to be fade out, highlighted, or just in a normal state. With the same logic, you can create whatever you want. So, other kind of series, other kind of components. So, for example, we created the axis, legend to tip the basics one. For the little story, the reference line has been created by a user, just using the provider because it was a bar so. And of course, you can create your own ones, and that's the main success of this approach. So, as a conclusion, a single component. For us, it was a need because most of the time, for example, you just want to put a sparkline. You want to put a bar chart in your application very quickly. So, you say bar charts. You get few options, but just what you need to get the correct bar chart, and you don't have to care about all this internal stuff, about all components communicate together. But as soon as you want to do something very custom, and the charts are part of the earth of your business model, you want it to be as the designer implemented it, or display very specific stuff. So, you need composition. The main failure of this experiment was the configuration feeling. I wanted absolutely to avoid this aspect of, I give you a bunch of options, deal with it. It's not possible because there is so much interaction between the axis and the series, but you cannot split them into the options where they are needed. For example, axis in axis and series in the series. You need to get them all together to do the computation. You get this feeling, but okay. And the success is to empower developers to create their own subcomponents. And that is something I've never seen before, except if you go very low level on how to make charts. And to give you a flavor about how easy it is. Okay. So, this is a line chart. And there is a custom component in the middle, this horizontal line, that shows you for your most position what is the value on the left and on the right. So, this component is not very useful, but it demonstrates interaction and axis management. And so, to create it, you need two stuff. First, the bounding box in red, the most position. That's easy stuff. And then you want what we call a scale. If you use D3, it's the same object that allows you to convert the value to a coordinate. And what will interest us is the coordinate to the value. So, let's start coding it. I promise it will be very quick. Use drawing area is calling the provider that retains. Where do you plot the data? So, you get just the bounding box. And you use Y scale, provided the idea of your scale. And it returns you the D3 scale. Very easy. And that's all. That's all you need. After, it's boring stuff. You save a state. And you do your use effect to almost move stuff like that. If you store null, just because you are outside of the SVG of the drawing area, so you're under nothing. Otherwise, you're under a path. So, quickly. You go from the left at the most position. So, single point. And you draw a line of the wave. So, that's come from the drawing area. And then you just have to use the axis scale invert to get the value from the coordinate. You display it. And that's all. So, you've created a component that is completely custom. And that interacts with your chart. And you can reuse it into any other kind of charts you build with us. Thank you very much for your attention. Thank you. Most of the time, people don't know, but there is an option on to force them to send a feedback about talks. If you have some, please don't hesitate. Otherwise, there are my contacts for later. Are there any questions? We have a few minutes for some questions. Yes. You mean rendering a custom? Can we use a render props to render custom sub-elements? But the issue is, for example, with SVG, you end up with the order of your components, impact which one is overflowing which one. And so the question is, where do you render this element? So, for example, this line, you can imagine that you put it on top of a line chart and below the mark plot of the line chart. And so you need to get access to the GSX level. How do you go from simple mode to complex? You can go from one component and if you need more advanced stuff, you can compose. There is a single component for all the basic charts, so line, bar, pie, and the scatter. And if you want, for example, to compose a bar chart with a line chart, you need to recreate it. So we provide all the basic stuff. So basically, if you open in GitHub, the line chart.tsx, you will see a chart container, different plotting, the access, and basically that's also you get five or between five and 10 components to create your own one. How do you use the rest of MI as financing? It's kind of standalone. We reuse the theme mostly to be linked with, for example, the tooltip so that it gets the same color as the background of your application. But otherwise, it's SVG, so there is not that much in common. There is no button, for example. There is no select. We don't really need those user interfaces. It's more of a theming and the way components are styled, for example. You set that it follows the same developer experience or you can override the styling. How do we create a reaction between what you have in the data pool? How is the performance? Have you checked how it behaves? Because how we have to prevent this is with reaction between real reaction and real data and real data. Have you seen how this affects whether we have a lot of points? No, we did not try mostly because we are currently using SVG. And so we know that there is at least a wall that is waiting for us at a certain level just the time to render the SVG. So we did not care that much. It's part of the next year roadmap. Thank you all for close for ads on.
Building your own JavaScript runtime with Rust
So our next speaker is Leo, who is a developer at Dino, and he's going to talk about how to create a JavaScript runtime with Rust. Big round of applause for Leo. Hello, I'm Leo. As I was just introduced, I work at Dino, and I do various Rust. At Dino we do a lot of Rust, and we create a JavaScript runtime, but we want other people to be able to use it as well and make their own stuff with it. So we will explain internals and how you can make a small JavaScript runtime by yourself. But first, what does Dino? Many people still don't know, so better explain. It's a JavaScript runtime similar to Node, maybe similar to Bonn if you've heard of Bonn, it focuses on security, web compatibility, typescript out of the box support, and just a lot of built-in tools like a formator, lint, dock generation. We also have compiling for single executable, and a bunch of other tools. We are also not 100% fully Node compatible, but we're getting closer and closer by the day, and it's getting quite well. And what matters to this presentation is the modular code base. We have a lot of building blocks that can be used individually to build your own JavaScript runtime just with these Rust crates or Rust libraries to make your own one. Without too much effort, actually, we simplified this a lot. Yes, so first off we need to explain the internal structure of Dino, which is everything is built on Dino Core. Dino Core is a layer above V8, which is the JavaScript engine that powers all the Chrome and Vowsers. And Dino Core is just a small wrapper around it that simplifies a lot of the utilities around it and makes it a bit more friendly to use. It's not always easy to directly use V8 by itself. And on top of that, we have various other functionality that's built on top of that. That's extensions. Extensions are individual libraries that can be used by themselves to implement individual APIs and functionality. For example, a specific web API, like let's say fetch or a fetch of variations on individual extension. We have HTTP server, KV, root loads. Basically everything is individual building blocks that can be not copied and pasted, but imported and just used without too much hassle. Like usually to add an extension is like three lines of code and then suddenly you have a massive amount of more APIs that you can just use. Then we have Dino Runtime, which is a library that is built on top of a bunch of extensions that adds a bit more capability to it, including permission system, which relates back to us being a secure runtime. We have various permission-based functionality and flags. And also another additional feature would be the fact that we do some definitions of various global scopes and the Dino namespace itself. And also web workers are only implemented in the Dino Runtime grade because it's just not possible to have it as an extension just because it needs to interrupt with the extensions themselves. And then we have the CLI, which is what we compile and what people use. And that's not great. Yeah. And CLI includes the TypeScript support, a bunch of other, like all the tools of the CLI, like the lint, the formator, et cetera. And also then we have the compile supplement, as I mentioned before, that has a compiler single executable and testing infrastructure, benchmarking infrastructure, and dock generation. We have a fully static HML dock generator that you can just use and will always give a relatively clean output. But what will we build today? We will build a JavaScript runtime that can compile TypeScript, has a functionality to make a HTTP request, a console.log, some files to migrations like read and write, and deleting a file, I think, as well. And it's all in less than 20, 30 lines of Rust and JavaScript. We will connect Rangias just because this will be a relatively technical topic, so there's going to be a lot of code. So you want. First let's explain extensions more in depth. Extensions have various fields and options that can be set. Arps, which I will explain in a moment, basically the Clif Rust functions that can be used in JavaScript, so you can just write a Rust function and that will then be callable out of JavaScript. ESM is the ES module, so you can use ES modules, import static imports, and dynamic imports. Work as well, I believe. Maybe not. JavaScript files are just scripts, so not ESM. To include it, all works differently under the hood, so we have these two separate options. And then depth is declarations of other extensions. This extension depends on. This is not necessarily needed. It's more of a safety harness. Just it makes sure that you actually initiated the extensions in the right order so you don't actually forget to initiate an extension that another extension relies on and then everything floats and then you don't know what's happening. And there's some other relevant, less relevant options like config. JS, as I mentioned, above is ravelly use nowadays. And then lazy loaded ESM state, and I'm not going to go into depth. It's just config lets you configure some options to a specific extension. If you want to have some special state, you can use the state option. lazy loaded ESM lets you lazy load extension code, but that's nothing that we're going to go into depth for you to look at this in this talk. And then ops. So ops are these functions that you can declare in Rust that then are used in JavaScript. You can just call it like a normal function in JavaScript. And it uses this up to macro. I hope that's not too problematic of an explanation that what a macro is. I hope everyone knows here. Not your problem. And then basically you define arguments and return types with these special macro attributes like the string or this string. And basically it infers then the right type to map it from JavaScript to Rust. And vice versa, depending on the attributes. And yeah, you just write a normal Rust function like for example in that we just use tokyo which is tokyo is the async executor that we use in Dino and most of the Rust ecosystem uses it and then we just read the string, we read the content of a file of the path specified and we just return it. And we return everything in ops as a result which is either an error or an acceptable value because you might want to throw an error for example and that just handles it under the hood tool. So you just return an error. There's other various types that can be specified in ops. We have some more ambiguous types like V8 value which is just a generic JavaScript value. You can pass that in, you can manually match and do some more specific handling if you need some weird function that does based on different types, something which usually we try to avoid. Rather have separate functions that do more specific things. But then we also have Boolean, I supported numbers, strings, as there, array buffers are supported as well. And yeah, you can return and accept array buffers and it all handles under hood without issue. It's all been simplified as much as possible to make it as user friendly or developer friendly as possible. So it's really easy to just create your own functionality without too much difficulty. There is also this async is defined up top. It makes sure that the function is actually async and that it does need to use async functionality when you define something as async. And if you don't do anything async, it will usually then error out during compile time. So because async, it's just more complication under the hood that makes it less performant to some degree. Then here comes the code. So for this example, we're gonna have to find a few ops or cross function clarifications that make the code from JavaScript. And we have read file, write file, fetch, set time out, and remove file. So in the read file, as we just saw, we read the file from the path given and return that with write file. We can get, we specify a path and the content both as a string and write that to file to disk. And we return nothing as per the empty type. And then the fetch one, which might be the most interesting out of all of these, is basically uses request, which is a rust grade for doing HTTP requests. That I guess if we wanna compare to something in the JavaScript ecosystem would be similar to Axios, I think. It's very similar, maybe not similar in API, but similar in functionality and simplicity. And yeah, we may just do a fetch request and get the content of the body via the text method and then just return the content. And then we have a set time out, which just puts the current thread to sleep. And for the specified duration, it's passed by the user via this function. And remove file, just remove file, this is given in the path. However, we use a whole system called v8snapshots and it's gonna be part, I apologize because it's a very complex topic. Not many people really know what it is or even how it works. But to very simplify, it's let's take the current state of the JavaScript execution and you can store it in a file and resume it later. That's the simplest way to explain it. It's not exactly like that, but for simplicity's sake, let's stay to that. So we need a build script because we first need to do some setup. So first we initialize our extension. We call it Vangias as we said earlier. And we have this ESM entry point, I did not mention that earlier. But basically, let's you specify the entry point that runtime will use when starting up. And we specify our files. We have this ESM option and we have this JavaScript file, which we'll see just in a second. And we have some path defined. We want to get the path of the current build script location, some more specific Rust shenanigans. But we get this path and we join it with this Vangias snapshot. It could be any path, just we need a common location where this build script outputs something that we can then retrieve during runtime. Now comes the fun part, which is this create snapshot utility function that we have made that does all the snapshotting logic under the hood, tries to simplify it as much as possible. And you have a few options, most of them can be completely ignored. The only two important, three important ones are the manifest there. We cannot infer this automatically. So we have users to always set this value to be this and micro call to the target manifest directory. The snapshot path is the variable we defined earlier for the where the output of the snapshot will be. And then we have extensions, which is the extension we created earlier above. And we just want to initiate the JavaScript code. We don't have just initiate ESM file, it's initiate ops and ESM. Here we have not defined ops because this is just doing the build script. We do not care about ops at this point in time. They will come into play in a moment. First, we also want to support TypeScript. And this is just a small snippet of the code. There's some more boilerplate that is not necessarily interesting. It's just getting the path of the file and the media type of the current file just to be sure that we actually transpile JavaScript and TypeScript to JavaScript and that file types are all correct. But for that we use this AST create, which is basically a wrapper around SWC. SWC is a Rust create a library that basically implements TypeScripting as per TypeScripts wants and needs since there's no real specification because TypeScript. But it takes some options, the specifier, which has created a path or the name of the file that we want to transpile, the source code. So the text info, we just create this structure from the code that we got earlier from this V2 string at the top from the path that was specified by this function. And some boilerplate, this media type that I just talked about. And then we just call Transfile and magically we get the Transfile TypeScript as a JavaScript. And we can just use it. And then we have to code and we just create a structure of module source, which is how it internally is represented and we just return it. And this is all in trade, I guess the best way if you're familiar with TypeScript is like an interface and we implement this trade and it has a few methods but only one method is really necessary and it is a load method, which is what this is it. There's a few more lines both above but again, that's just for media type and some smaller error handling that is not really out of too much interest for this scenario or for the simplicity. Then we have our, this is in the actual main script where we get the snapshot that we created earlier during the build script and include it into the binary itself. And then we have access to this runtime snapshot and we will use it later on. And then we have the extensions, we initiate again, but this time just with the ops that we defined earlier. And this time we don't need the yes modules because we defined them earlier and snapshot attempt so they're part of the runtime snapshot from above. And it seems I forgot a slide. I can quickly hopefully fix it. This is not well prepared and I apologize. I hope this, let's do it the easy way. This is the JavaScript file with the internals defined. And basically we input the inner core as a JavaScript as a JavaScript module that you can import. If you use the inner core that has some utility some functionalities again just like the Rust version. Just this is for interropping between the Rust and the JavaScript. And we have the structure to score into ops. Ops again is, this is an object that can be used to access the functions that we defined earlier as we see over here that I hope it's big enough actually. Can the people in the back read it? Wonderful. Is this big enough? Wait, then let's, okay. Just to quickly reiterate we have this input of the inner core and instruction to this ops object which is just down below here used to call this op read file which is the one we defined in the Rust file earlier. And all under the hood it converts the values to the correct type matching in the Rust. And then whatever it's returned from this op read file which will be a file content we then just return it from this function that we defined in this object constant. Over here above we also have the console definition which is, uses code of print which is a utility defined in the inner core again, a few more helpful tools. And we just get all the arguments and just use this arts to match it but we then define the part which will be just stringifies and joins all the values. We don't need anything too complex for this example and then just prints it to the console. And then we have the same forever which just sets at the end the true value which is for if it's an error or not. So above it's false and then below it's true. Then further down we have the other function definitions which are read file, write file, move file, fetch, maybe just all wrapper functions around these ops. Technically this async was not needed. So one of the side part whatever and then we have the set timeout which calls the set timeout and then calls the callback. So it's relatively identical to the web API that we know. And it's assigned this to global disk which is the global namespace and also we assign to the global disk also the console and we define a runjs function object which is the object we defined above with all these extra small functionalities. To go back to here, we defined this extension again and the runtime snapshot and then we have basically all the building blocks ready now we just need to actually use it. And for that we need the runtime. This is again a bit more complex but basically we define a function that takes a file path. The file path is the JavaScript file we want to execute with the user's code that they pass. We have some utilities in Unicode that resolves the path against the current directory and gives you a model specifier out because that's what internally it's used. The module loader is what is used to resolve a module and any imports in it from this user specified file and we have our TS module loader. This is the TypeScript transpiler that we built earlier that is just the structure that we defined but I did not show that because we've boilerplate. Startup snapshot is the snapshot that we got from earlier from the setup and then the extension we need to initiate the ops that are defined so that the Dino Core and the JavaScript file that we designed can actually access these functions and load them up. And we don't care about any of the other options and then we have the actual usage which is this load main module. The load main module, it loads the main module of the entry point. Let's say if you run Dino run test.ts it will, that would be the main module and then it will work through the entire module graph which is basically all the imports one by one on the recursively. And this is async, a lot of this operation async because ES modules are inherently async and yeah we evaluate the module so we basically run it and get if there was any output and then we want to run the event loop because there's going to be multiple pulls let's say with async functions you've got to do multiple async calls perhaps or just stuff. We have some options that are not of interest. We have Dino Core includes inspector, utilities and pump via message loop which is again not much interest at some point or another. We just await this event loop running and return value of this result that we were calling earlier so we just then get out of this run.js function we get the result which hopefully will be okay and there's not going to be any errors but there might always be some error. A user might have to find incorrect variable names or have invalid syntax or something like that. And then we can do a small demo where we, I hope this is going to be big enough again. That's definitely not. We have this example.js file. Here we just call the set timeout that we defined earlier in the global scope and then just come out and this can then just be, kind of make this bigger. No. So life demos never go perfectly well but hopefully this should be working. So we then just do congo run and we want to specify this input file which we called examples.js. And hopefully this will work. It first needs to compile and yep it prints the weight and then the hello world that we call here. Now this is just a set timeout that's not as interesting as for example fetch. So I mean we could just console log the fetch output so it would be run.js because we defined this global variable earlier as run.js in this run.js down here and then we want to call fetch. I think we could fetch HTTP example.com and since this is async we want to await it and again let's run this and hopefully we'll get an unreadable wall of text of HTML output from example.com. It's usually not that long and yes we did a fetch request to a remote server. And we had the file system operations so I could just call await runjs.readfile and let's read for example this file itself. And then my terminal quickly. And hopefully it should just print the same output because we're reading self file yep and it reads and then the deleting and writing of files will work as well. We're not going to go too depth into that. It's relatively self explanatory and yeah that's pretty much it. I know I went a bit fast. I hope people don't have questions. There's a QR code for the actual repository where we have this so if people are interested to check it out but also we always are trying to improve the ecosystem and common problems of the JavaScript ecosystem and we actually have had problems with the dependency ecosystem of JavaScript and NPM and we decided that someone needs to solve this and as such we also created a new general purpose JavaScript registry that will work in any runtime. This has been announced a few days ago by Ryan Moindepf and you can join the waitlist at the QR code or the URL. That's it. Are there any questions? Time for one or two questions. Yeah. Let's see that I hate this. Inside the Docker container. I have this input queue of jobs where I send the script that I want to run and then just execute it and the output is from it. Is there any downside as long as I am only sending one single script that it needs to execute? No. I don't see any issue with that whatsoever. It should just work. Again, I'm not too familiar with Docker though but that seems like a relatively normal thing to do. Any other questions? What have been your biggest challenges in writing this run project? This project has been going on since it was announced in 2018 and we have written our internals many times. For example, extensions were called other things multiple times in the past. We renamed and restructured not entire structure of the code base but it was just multiple rewrites just to be able to have more capability but also performance wise improvements. Overall, it has been a challenge but it was something we could always figure out. Rust itself has never been an issue. It's always been relatively good to use. It's not perfect. No programming language is perfect but previously Dino was initially started as a Go project but we switched quickly to Rust for performance benefits as well. I hope that answers the question. Anything else? Yes? Yes? Yes? Is it? On this one? Yes. Okay. This could technically have been just accepting U64 directly. This should actually have been U64 directly and just been passed and not casted but that was probably just some oversight while writing this code. We casted it because it was from Melissa, except only U64 but this is just oversight. I have one more question. Yes? How does the performance on the custom run times or extensions compare to the foreign functions in the past? I'm not too familiar with FFI but we have optimized both FFI and these extensions a lot more but extensions inherently are going to be more performance because it's not a foreign function. These ops, I guess if you really look at them foreign functions, since it's calling Rust functions out of JavaScript, there is some plumbing but these have been optimized so much over multiple years that I would say like sync functions are basically maybe not no cost but close to no cost. Sync functions have overhead due to...
MessageFormat: The future of i18n on the web
So our next speaker is a very good friend of mine. I call him the big boss because he's a co-chair of the TC39, which is the committee that manages the JavaScript specs. So he's going to talk about message format, the future of internationalization. I hope I will say it right. And big round of applause for Rojoa. Let's go. Thank you, Aiman, for the gracious introduction. Major thanks to Leo for setting up the stage for me. There's no pressure there at all. Well, I thought a lot about this. It's very anxiety inducing to follow up to a talk like that. But I realized that it's nothing to be anxious about because why not follow up that very intense technical talk with something simpler? So I try to talk to you. It's mostly propaganda. It's a program that I need to induce you to. But yeah, I hope you're ready. Everybody ready? Let's start. Okay, great. So welcome. A little bit about me before we begin because that's obviously the most important part of the presentation. I am Ujwal. You might remember me from my username on the internet. It's not easier, but it's fine. I'm from New Delhi in India. And I live in Akharunya. It's a beautiful little city. I would describe myself as well. Zellet may be a strong word, but trust me, I care a lot about open source software and believe in the web, which is what I'm here to sort of make you fanatisized about as well. But I suppose, given that this is the JavaScript room, you all share a lot of these ideas as well. I love dogs and massage video games that hurt me psychologically. And I work at Agalia. So quick show of hands. How many of you know about Agalia? Wow. Only at this conference. Thank you. At Agalia, well, this is me trying to read a newspaper. Agalia is an open source consultancy. If you don't know about us, we are also a worker-owned cooperative. So that's neat. We've worked across a lot of open source projects and ecosystems, mostly around low-level software like the Linux kernel, different things in the Linux user space, and in the multimedia space and graphics and so on. You might know about some of our contributions in the web platform. A lot of the browser projects have a lot of interesting work that has been put in by my colleagues, which I'm very proud about. And lastly, but not the least, I suppose, we work in the compiler space, which is all about programming language design, right? It includes WebAssembly and JavaScript, but also stuff on LVM. And that's what I need to talk to you about, about JavaScript, about TC39, as he said, which is a very descriptive name for an organization. Do any of you know about TC39? What does it do? Wow. That is surprising to me. But TC39, long story short, designs and develops this programming language called JavaScript that we all love, I hope, to some extent. It's a complicated subject. What would you describe JavaScript? What do you think of JavaScript? Do you have a definition for JavaScript? We know what JavaScript is. I would like to describe it, hopefully, non-controversially, as a general purpose programming language, and now general purpose programming language, that is designed primarily for scripting web interfaces. Are you web developers? Are there any non-web developers that use JavaScript? No. See? Like, while JavaScript is a growing language ecosystem, it is still very heavily influenced by the web ecosystem, and it owes a lot to the web ecosystem for making it what it is, and therefore, as the primary language for making web interfaces, it needs a lot of tools that are required for web interfaces. But what is the web? I couldn't even find a logo or an image to describe the web, and I feel that the web is such a weird concept because we all know what the web is, right? But could any of you define what the web is? So I do feel that the web is hard to define or really put a finger on because it's gone through so much. I always rely on my good friend, Baudrillard, for explaining what the web is. So we have, for example, web as we know it, and then it just keeps what? And now it's everywhere. The web is in your gaming consoles, it's in your cars, in your toasters, probably. Does it need the web? Probably not, but I don't know. I don't trust toasters. But the point is that over the last couple of decades, the web has more or less emerged as the main platform for designing interfaces for people. This is not a controversial statement, I hope. But what is the web platform that I'm talking about? So to define the web platform, I would say that it's an interactive, decentralized communication platform at the scale, but that makes no sense. So the web is a standard platform for making widely accessible. They're deployed at the scale of everyone and rich user interfaces. We are not in the early days of the web. The interfaces that we use on the web have changed substantially. But this ambition of universality, universe, yeah, that's the, okay. Yeah, this ambition is built into the web and it's ethos, right? It is supposed to be for everyone. It is supposed to be accessible by a vast group of people. And any platform with such ambitions is not needed, but I would argue required to be accessible and internationalizable and localizable because how else do you reach the anywhere near like the target audience of the web, right? So the web has a responsibility as this platform. A quick note before we start, I've already started using these terms, but if you're unfamiliar in a nationalization, basically is the process of making an interface in a way that it can be localized into various different languages, cultures to suit your users. And as I mentioned, localization. Localization is the specific act of, you know, modifying your interface to suit a target audience. This is just a primer to, because I'm going to use this a lot. I hope that was obvious. But let's talk about the early web, the early interfaces that started thinking about internationalization. Because as much as I would like to talk about how the web has revolutionized internationalization, it doesn't start there. The story starts in the beginning because user interfaces were everywhere. Like people who were writing text based video games and C were also very keen on things like internationalization. UIs are composed of string content. These content, like these strings are what we are referring to messages. So when I say formatting a message, that is what I mean. Manual localization is a way to do this. I'm sure everybody is familiar with Wikipedia. It's kind of possible to have a website work in that way. But it was quickly proving to be unmanageable. Not only because it's hard to do it manually, it is. But also that it represents a very slim vision of what it means, what true internationalization means. Changing from a language to another doesn't actually mean internationalization. There's so many different axes within any given country, within any given language. There's so many different ways of expressing things that to reduce internationalization to merely catering to languages or locales is, well, locales can be complicated. But merely focused on languages, it's too simplistic. So the actual diversity of locales can never be catered to using this simple approach. Imagine having a different website, having a different version of your website for every currency that you support. And basically to promote a better, cleaner, more modular approach for building interfaces that can be localized, C-Hackers first came up with Get Text. Have any of you ever used Get Text? It's everywhere, right? In all of your operating systems, it's actually part of Lib C, so basically everywhere. But let's talk about Get Text. What was it? And I know that we are in the JavaScript room, but bear with me. It all makes sense, hopefully. Get Text was one of the two main internationalization systems that the early hackers cooked up. The other was CAD-GETs, but the presentation can only be so long. So let's not get into that. But it was the dominant system over time. CAD-GETs, as you might know, now not used anymore. And it was not standardized. We're going to be talking a lot about standards here and how they enable people to build stuff in a way that is reliable across boundaries. But Get Text was never standardized. It was, however, standardized in a de-factor way by programmers using it across their tooling and basically through documentation and education and so on. But its adoption by Sun and then GNU in like G-Lib C basically made it so popular that it was standardized through popularization, for instance. It was not as powerful as the internationalization systems we would use today. It mainly dealt with very static strings, and you could replace them with different strings and different languages. So already it's not perfect, right? But it was good for the time. It was what we had. It was better than nothing. But what it did was it went on to inspire an entire generation of applications, of interfaces that were not only utilizing internationalization but were built with internationalization in mind. And one topic that I'm going to, this is one topic that I'm going to keep returning to throughout the presentation. Giving power to your users is by far the easiest way, in my opinion, to, well, people do wacky things. That's the web for you. But it's the most interesting way, in my opinion, to allow people to innovate and come up with completely new paradigms that were unimaginable. And we'll see how Get Text inspired these things in its own right. So there's Python's Get Text. It's basically Get Text, but in Python, Python likes to keep things simple, I guess, in some way. Java introduced, however, for the first time, the concept of message format. So the context here, like you might also already remember, well, it's in the same slide. And the micro systems already big in this idea of interfaces in internationalization, of basically deploying their products and their users products across the market, so to say, was very keen on internationalization. And therefore, they basically, through acquisition and innovation, created the first sort of beginnings of message format in Java. At the time, to be this like, it's so funny to think about it. Java was, to that generation, what the web is to us. Java was a cross-platform way of building interfaces that can be then deployed to a massive audience. Minecraft, I guess, if it's the bill. ICU then, however, picked up message format. ICU was also formed by these organizations working closely together, but it was a much bigger effort. It was standardized, and it was developed in a way that was general enough for everyone to use and integrate in their apps. So ICU creates message format. It was originally a very close sort of copy, is maybe a bad word, but a very close imitation of the Java message format. But in its own right, it added more features and more power. And we can see how that affects things very soon. But here's a quick and dirty example of how that works. So you have this message, and it has an expression inside of it, right? And this expression is basically selecting for a number. So if there are no files, well, if there are zero files, you can say there are no files. And there's one file, and then anything before above one is, you know, there are X files. But this needs to be, this is just one message in one language. It needs to be translated into various languages. But finally, we had a format to express all of this, right? If you're writing in Arabic, for example, you need a lot more. In Japanese, you need just one. But, yeah, so if you think about this, ICU's message format not only subsumed the original Java message format that it built out of, but it added more, and it experimented in its own sample space. But some important things was that ICU message format, because it was separated from Java, it was able to rethink some of the design details about some of the processes that happened. And it was designed for the first time almost with massive feedback from implementers and translators at the forefront. Translators being a key word here, because I think it's not very controversial when I say, and it's kind of obvious from the documentation even to this day, that tools like GetText were primarily designed keeping programmers in mind. And finally, we have a lot of organizations who are very invested in translating their interfaces and their translators being part of this process of designing this format. So things were really starting to pick up, at least I would feel so. What it ended up with is a much better or, well, a much more powerful syntax, and that, as I mentioned previously, just opens up the space for innovation as it did. So here's a quick example. It looks huge, but please bear with me. But basically, here we have a simple message. Whoops. Hello? Is, okay. But yeah, you're, one of the things, okay, I'm going to go back. But one thing that I didn't focus on much was that ICU message format at the time allowed nesting. So messages could be nested, and this, as you might see here, opens you up to new use cases that you didn't think of before. So here we have like a nested select statement of sorts. And what we're doing is we're setting for the pronouns of the host who's hosting a party and the number of guests that we have to display a simple message. Well, simple at the end of it, but that's the amount of work that goes into making sure that you have a message that works for all the combinations. Right? And this was great. People were now doing more powerful things than they ever did. But what was going on with the web? So talking about the good old days of the web, whatever that means. The early days of the web were very dominated by either absolutely static documents or very static crowd apps, basically. What does that mean? It's basically about some simple operations. If you think about it early internet applications, so to say, even Twitter, for example, were built as very simple crowd apps. There were very few operations that you could do there, right? And if you compare that with anything we ever do now, it is a completely different landscape. I mean, even just Twitter and how it works now. So the early websites, because they didn't have a very powerful medium on the front end, they were like almost entirely on the server run times for their content, for whatever kind of dynamic things that they did. And guess what? For internationalization. So while the early web was not as powerful as it is now, people were still, there was a lot of appetite for internationalization and people did what they could. So they used Java's message format using Java text. Java.tex has a number of other APIs as well. PHP had an Intel object, if you think the JavaScript was the first to do internationalization. That is not the case. And Ruby on Rails also had this IETN. I don't know why I keep mentioning this. It's just one that is closest to my heart, I feel, because that's what introduced me personally to internationalization, how it's important, and how it can really bring life, spell life into your interfaces. And this allowed the early web developers to tinker with things like message formatting and more complicated internationalization use cases. So there was a lot of tinkering happening, but basically with the combination of this message formatting style that we talked about, the popularization of templating languages. You have to remember this is the time when templating languages were really taking off and HTTP content negotiation, which meant that the client would tell basically what languages the client accepts to the server and then the server could localize based on that. We reached not a great point, but a very important point for internationalization on the web because now things are starting to get serious. So we have a platform for building interfaces. It is being used by a vast community of people who are building websites, and they are utilizing internationalization techniques to make their content more accessible to a wide range of users. What about the Mpesky.js developers, though? What are they up to? Up to no good, I hope. But no, jokes apart, JavaScript also had its own parallel development during this time. But basically, for reasons that are not worth getting into in this presentation, JavaScript remained mostly not very popular. I mean, for a long time, people were hesitant to really do a lot in JavaScript, and it was not their fault. JavaScript was not as powerful as it is now, but things, again, on the theme of somebody giving power to a bunch of users and them taking it across the board. jQuery released in 2006. So that's almost two decades now. And how far have we come from there? It just, it sparked a whole sort of space of people experimenting and building more and more interactive web pages, basically. Fast forward to some years, we have React and the age of SPAs is dawning on us. Some of the most annoying websites kind of were created during this time, but all in sort of good time, and it all led up to something important, which is that now we have a very dynamic, a very expressive web that people interact with in very different ways than they did with the early web. Some of the interactions that we have with websites today are just, they probably weren't something the early designers of the web had any sort of idea they were enabling, maybe for good reason. But TC39, which, as we mentioned, is the standards body that is tasked with designing the language recognized at the time that internationalization is a growing concern for us, and we need to enable JavaScript developers to make the best use of this area and the techniques with it. They formalized a task group, task group two in this case, to work on internationalization. So a lot of internationalization features were developed and deployed to the web. As of this day, you may know about the Intel object in JavaScript and the various formatters and other things that it has. But basically modern JavaScript interfaces use these like in line, well, not in line entirely in the context of interfaces, but basically they use these APIs to internationalize, to localize for that context on the client and then sprinkle them across the interface, but it's not perfect, or is it? Well, the state of internationalization on the web is that outside of what was being done in JavaScript, most of the work and effort in terms of internationalization was directed at very specific things, like supporting more writing systems, which is great. We need to support all the writing systems, but it was very limited in its scope and ambition. On the other hand, internationalization grew and have now so many different features just to talk about them formatters. So you can format all these different types of information like numbers, including currencies and different kinds of numbers in different formats. You have dates and times that can be formatted in various different ways in so many things. You have collation, segmentation of text, which is something that is now supported by the interl object, and then there is some selection. So there's not a whole lot of selectors that are available to this date, but we have plural and ordinal selection. So if you have a number, you can basically select for the plural rules of any given language, which was cool, which has a lot of the building blocks that we need, but we're still missing important pieces. So talking about the timeline of where we end up with message format 2, the JavaScript group that was designing the internationalization API essentially realized they needed something of the sort. It was discussed over the course of many years, as you can see. It was iterated upon, but there was a general agreement among this group that message formatting on the way. The web is something that is both needed and necessary step to enable further use cases. Do you remember this slide? Like, this is not a great format, but that's what we had. So message format and the syntax and all the details of it were standardized 20 years ago at this point, and it is not sufficient for the modern dynamic interfaces that we built that JavaScript has enabled now. So why do we need a new thing? Well, first of all, as I just mentioned, we have so many more ways of interacting with our interfaces that just require new tooling, rethinking how the fundamentals work and the sort of outdated, outmoded tools don't exactly fit the bill. They are too imperative. They are too static to actually support our new use cases. There's also a sad lack of modularity and extensibility in the existing message format a bit more on that later. And because it's a standard that is designed for being accessible to basically everyone who uses ICU, which by the way is a great thing, the unfortunate side effect of that is that you can't really deprecate anything. You can't just clean it up and move on without making a breaking change that would annoy every user. So there needed to be kind of like there was Python 3, I don't know, that's controversial, right? But we needed a heartbreak, a message format 2 that would be designed from the ground up to deal with some of these things. And the diversity of locales that we know now makes it basically impossible to map any localization structures one to one. A great example of that is one that we just used in the past, like the number, the plural selection example. Do you remember that? Yeah. English has two plural rules or modes, which is that either things are singular or plural. Zero is plural, for example. And that's simple and that works for us. Well, Welsh, a language that is not so far, you would assume, has five. Arabic has six, I think. I'm sorry if I'm... But you get the point, right? Japanese has one. When it comes to any of these things, you cannot map messages one to one. You need something that is more expressive than that. And basically, like the design constraints of the old API made it very limited for the modern JavaScript ecosystem that we have. So not only do we need to take everything that the original message format did right, do it better, if possible. We need to also accommodate a lot of the innovation that was now happening outside of the standard space, right? In proprietary tooling or elsewhere. So we needed to really get our act together and that was message format two. So I am going to start on a quick and clumsy, as I said, intro of message format two. But are you ready? Okay. Context setting done? Yeah. So for context... Well, okay, more context. But a dynamic message string is the string that is not just a static string, but as we perceive and use strings on the modern web, they can change. They can morph around each other. It's ridiculous, honestly. But the goal of message format two is to enable a lot of these complex use cases that we haven't covered now while keeping the basics simple. Because at the same time, a very important goal is to lower the bar to make it easier for anyone who at the moment, and I'm not saying incorrectly, that they rightly consider message formatting to be a complex thing to integrate. But lowering the bar would allow more people to experiment with it and to basically think of new solutions for their new innovative problems. So, for example, it's text mode first, which means that for very simple messages, it's very simple to get into it. At the same time, it allows a lot of complex messages and expressions, which we'll get to. It also allows you to have declarations and annotations. So basically, we're getting dangerously into programming territory here. You can have variables and functions being called. And it's great because you have this great deal of expressivity that you didn't have before, right? And finally, talking of functions, there is the idea of extensibility and modularity, right? So now we have the concept of a function registry, which you can add to and a bunch of built-in functions that you can easily utilize for common use cases. So yeah, quickly talking about the kind of messages that we have, we have simple messages, which is, well, a message. We can have expressions where you can interpolate variables kind of like you might be familiar with in JavaScript, but yeah, different. Same but different. And then we have complex messages like selectors here where you can actually match for a particular value, kind of like a switch case statement. So this is the various kind of messages we have. Then as I mentioned, you can call functions. So in this case, we're calling a formatter function to format a date, and then it's part of a larger message. There's support for markup elements. So if you have some messages that have markup built into them, simple example would be text sort of base markup elements like bold and italic. More complicated things could be like this. And then there's support for declarations. So you can have variables and play around with them. But there's more. There is an extensible function registry apparently, which is a great thing. You can have your own functions. You can have private use annotations as well. And there's support for popular built-in formatters and selectors, kind of like we did already in the JavaScript API, but specifically built into message format. So this means date and time, possibly durations. It's kind of not settled yet. There's support for formatting numbers and integers and selecting for them and matching them. And plural and ordinal rules are on the table and possibly lists, formatting complex heterogeneous lists of objects. But yeah, as I mentioned, this is still something that is up in the air. This is one of the final pieces of the puzzle that we are yet to completely sort of put in the puzzle, I guess. How do you do that? But the point is that this needs to be settled. This needs feedback. This needs more data than we already have because we know what is generally useful. But what mostly helps is getting actual data from people who use these things or might want to use these things, but things that matter to them. So that was part of the shtick. But basically, do you remember that slide? It's still here. This is how it looks now. So you can have a more complex matcher. And now it's matching for both the guest count and the pronouns of the host. And yeah, you have an expression. It's basically achieving the same thing, but with a cleaner syntax and hopefully something more manageable than what we had before. Why does any of this matter, you may ask? Where did I start with? I kind of lost context. But as we talked in the beginning, UI design has evolved a lot since the beginning of UIs. And the web platform was developed for something but ended up becoming essentially the most reliable standardized way for deploying user interfaces. But a lot has happened since then. For example, the UI space has done a lot of innovation around internationalization about making more localizable UIs in a cleaner way, in a way that helps programmers and translators and everyone else in essentially message formatting. On the other hand, the web platform has evolved substantially with JavaScript. Things are very different from what they were. To bridge this gap, however, to fill the final piece of the puzzle, as I mentioned before, we need Intel message format. Because we have developed a lot in the web and interfaces are more complex and more dynamic than ever. At the same time, we have better tooling in every way when it concerns internationalization. But these two spaces have not yet benefited completely from the innovation in each other. So this is the idea. Not only is message format to built on top of a lot of the innovation that has happened within JavaScript, JavaScript is now sort of importing years of work that has gone into the internationalization space. So what is the message format from internationalization? That's where we were supposedly starting but it got lost somehow. After talking about it in Unicode and sort of coming to this format, we finally brought Intel message format back to the committee and it is now a proper proposal. So it's at stage one and hopefully it would reach stage two soon and could get deployed to various browsing engines and non-browser engines. But it is built on top of the things that we know and that we have discovered around familiar patterns that internationalization built-ins use. For instance, format two parts is this whole thing that we do in internationalization in JavaScript to allow people to essentially have more control over their formatters, which is not really a concern outside, right? But that is a major design point for the proposal, among other things. This is how it looks like. So you have Intel constructor like any other. Instead of what we do with most Intel constructors where we have the message at the beginning, sorry, we have the locale at the beginning. Here we have the message and then the locale and the options follow. And yeah, it works. This is very simple, but you get the point. But I hope that this was convincing enough for you to feel that this is something important happening here. And if it is, then how can you get involved? Well, one thing I mentioned is the actual message format two syntax and data model and everything that is being standardized under Unicode. You can go to this repo and read all the issues, see what's been done and give your feedback there. Help us out in any way you'd like. And then there is the JavaScript proposal. It's early stage. It needs a lot of work. It needs a lot of feedback and it needs people to be motivated about it, to write tests, to help us figure out the spec, to help us figure out the design details. And we'd really appreciate your help. You can find us or me on GitHub and Matrix and start from there. I'd be more than happy to guide you with this. And that's all.
Cryptography against AI: Deepfake resistant WebRTC videocalls
All right. Hi, everyone. Today I wanted to talk about this subject that you can see on the screen. And yes, as you guessed, it's a click-back title. So sorry about that. But let me first talk to you about the origin story. So a few weeks, a few months ago, I had a chat with former coworkers, and we had to use JITC to talk. I don't know if any of you know JITC. Yeah, all right. So it's our WebRTC chat, and there's been an update a few weeks ago. I don't know the exact time frame because I'm not a JITC contributor, but basically you had to log on to create a room. And then I thought, yeah, but how do I know that the people that create the room are actually the people that pretend to be? Since, you know, right now there's a lot of AI going on, and you know, you cannot trust people on the Internet. And as you know, just our neighbor in the next room, so the RTC room are talking about Skynet and AI and all of this shit. So, I mean, we have the right to be afraid, right? So, I thought, okay, how easy would it be to create my own chat platform? So, I would want to have something that has OJO as well as video, and I don't want to have huge server requirements because I don't want to spend money on it. I also don't want to store, let's say, username and password because, I mean, it's 2024. Who uses password? I'm all right. But we still want to have a way to, you know, identify the people. So, what can we do? Just, yeah, just a few, yeah. Just a quick recap. I know the RTC room is next, so if you're not familiar with it, just go make a tour. So, I want to explain the whole principle of real-time communication, but just so you know, I take Dwayne Johnson as a reference. So, this would be our signaling server. So, what he does, he brings people together. So, you have all these people that are joined to Dwayne Johnson. So, you can, well, it's a central server. And what he does is then he stands back, and then the people just are connected together. So, you know, the first one is connected to these three others, the other one to the others, and like so. And you know you can trust Dwayne Johnson. Well, I mean the signal server because he has a TLS certificate. So, it's nice, but can you trust the peers? I mean, you don't know who they are. How does peer work? It's actually a bunch of information about the people. So, actually you would generate an object called an RTC peer connection object, which will contain information about your connection that you will stream to the other peers. So, yeah, can we trust the peers? And I thought, yeah, let's use a technique that is as old as computer science. Let's just use cryptographic keys because that's, I mean, it's easy. And for that, I decided to use the subtle crypto API, which is available both on Node.js and in the web. So, it's really great. And why did I choose keys? Because digital signatures are everywhere and pretty much foolproof. And right now we have a lot of ways to interact with keys. So, I don't know if you guys know about Fido WebOtten. So, basically like you have like those UB keys, which are external devices that you can store keys on. You also have like on your smartphone, you can just choose to use your fingerprint maybe on iPhones you can use your face recognition, I don't know. But yeah, basically you can use different stuff. You have like the cryptocurrency wallets that are basically a set of key. So, that's nice. Or if you don't have an ECDSA key per right now, you can just use a Kudo SSH key gen. I mean, it's... And yeah, for those of you who are unfamiliar with cryptography, where have you been, but let's just say that. You have basically two keys. So, you have your private key, you have your public key, you should never share your private key with anyone, even with your mother on her deathbed. It's for you, it's yours, you don't share it. And there is your public key that you will share with everybody. And so, basically what we're going to do is that we're going to send a message in ClearTech, so the end goal here is not to change the message, but to keep it as clear as possible. And to add what we call a signature. A signature is simply something that your private key will generate, so your private key is private. And it remains private. But you include a signature that is generated from the message and added to the message. And then you add your public key, which would basically will allow anyone to decrypt the signature and to say that indeed you are the one owning the private key and that message is genuine. There's a lot of math stuff behind that. Well, we are not going into the details, but it's basically prime numbers and all this kind of stuff. Okay, so as I told you, we're going to use the Sodel Crypto API, which is available both on the browser and on Node.js. And beyond signing and verifying message, you can even generate and derive new keys. So even if you don't have all the keys we talked about earlier, you can still generate keys in your browser or in Node.js, and that could be useful for things that we'll talk afterwards. And that's what this talk is all about. Okay, so here comes the plan. So basically what we're going to do is that the system we're going to create, we will create a server, so the signal-signal server I told you about, and what it will do, so the first client we connect to the server, we will call him the host, so he hosts the room. He creates a room with a unique ID, and that's done using WebSockets. I'm sure you're all familiar with WebSockets. You can talk afterwards. No problem. It's pretty basic stuff as well. The host can send a special type of message to the room to send the public keys that he wants to accept in the room. So basically I would create a room, say I would create a room for my party, and I will add all the public keys of the people that I know, and then I will send you the link to the room, and if you want to join the room, I will ask you to sign a message to prove your identity. And basically since now you can use your phone to confirm your identity, for instance with your footprint, that's pretty much how I can assume your nut and AI trying to ruin my birthday party or whatever, so it's nice. So we do that for each peer, and it's okay. Just a quick security disclaimer though. When you land on the Sodor Crypto documentation, you get that big warning. So yeah, it's a long text, but basically it can be summed up as this project we're building is for fun. There's a lot of security considerations to take into account, and basically that is how you store key and how you should manage them. And as Adi Sharmeer says, Adi Sharmeer is one of the guys that created the RSA algorithm. Sure, it's really hard to brute force cryptography and to try to guess a ciphertext by just brute forcing, but usually the mistakes come from the system and not from the algorithm. So let's dive into the code. So basically what we're going to do. Here we're going to import a key from our browser. So we're going to use the P key, the word that's written there. To as an input key, we could use, as I said, a crypto wallet, a FIDO device or whatever, but here let's use something that should be generated from the terminal. We're using, so it's an elliptic curve signature, so it's using, once again, it's an algorithm that won't go into the details, but it's just a way to generate signature. And we specify that we want to use this key for signing because you can also encrypt, you can also derive keys that's other use you can have. Once again, we can talk about that after the presentation. There's a lot of stuff to do, but yeah, that's basically the stuff. And I'm also going to give you an overview of the idea in check I was telling you about. So basically when you connect, the server will ask you to sign this stuff, which is a JSON message, so well, this is TypeScript, whatever. So this is the payload we're getting from the server. I'm also adding a property which is called issued at, so that ensures that you have to sign the message within a certain time frame, otherwise you could have what we call replay attacks where people just trying to send a signature until it works. So this allows us to mitigate some effects. And so that is what the server hands you over, and this is what you would send back, so once it's signed, so you send the payload that you just saw alongside signature and your public key to verify the signature. Okay, so this is how we sign the payload. So once again, it's a big bunch of code, but it's actually quite easy to understand. So we have the key. We have the payload. The key is strictly, so we have the type definition for the script interface. So I heard the TypeScript question in the audience, so don't worry, it's fine. All the interfaces are defined. So we have the key and we have the payload. The interface I just showed you, and this function will return signed payload. So basically what we do is that we take the payload, which is a JSON object, you will stringify it, and then convert it to bytes, because everything happens in buffer and bytes. And then we will just sign it, and you know it's a simple function, just crypto.sign. We just choose the algorithm once again, ECDSA, so elliptic curve. We want to produce a 265.56 hash. We specify the key that we want to use, and we specify the buffer we want to sign. So pretty standard stuff. And we return that signed payload. So as you can see, it's really, really, really simple. I was buffered at how easy it was. And on the server side, it's pretty much the same thing. So you want to verify the payload. So pretty much the same thing as I told you earlier. The function looks pretty much the same. So we first need to transform the payload into a JSON object, but then once again we import the key. Once again, we use the JSON Web Key Format. You may be aware of the JSON Web Tokens. It's pretty much the same spec, but there is just a key. So it's a key that's been serialized as JSON. False mean we don't want to export it, and verify means it's a public key, and it's used to verify signature. Once again, we generate a signable variable, and it basically bytes to pass to the verify function. And then once again, the algorithm, the key, the signature. See, it's very, very easy, and it works. So all right, so I'm not going to give you the whole code of the single server because, well, it's not hard, not hard at all, but it's just switch statements. And I don't think it's really interesting to talk about here. Basically what we're going to do is that we have two types of, we have several types of message we send to the socket. I told you about the one that the host sent at the beginning, so just to white list a bunch of public keys. And then here I'm going to show you two. The first one is the one I call request. So when a new guest want to come into my room, that's one of the functions, and the other one is the out, so that's the one we just saw. So it's basically verifying the signature. So yeah, when I receive an off message, I will call the connect peer function. So basically what it does, simple as that, first it looks if the public key is valid for the room ID. So if I indeed white listed that public key. The second one is the issue that timestamp, which looks if that timestamp is valid. So for instance, if my server sent that message two minutes ago, I deemed that it's still an acceptable time to have, to receive the payload back. If it dates back to one hour, I just teach the message because that's probably a curve trying to ruin my birthday party. And the last one is just the signature verification that we just talked about. And if it's all good, and that's how WebRTC works, we will give the peer information to all the other peers so that they can chat together. One thing that I didn't mention, but it's important to know, is once you're using WebRTC, everything is, every information is from peer to peer. So the server gets back and there's no information that are sent to the server, so it's really important to know. And that's actually quite a cool design. Last thing I want to say is that I didn't get into the details and specific of WebRTC, but for the sake of simplicity, I recommend to use the simple peer library. I'm not paid by this guy, but this guy does awesome stuff. So use the library. And I just want to go a bit deeper. So I talked about video and audio chat, but what more can we do? So since we have, I guess this is your call to go further down the peer to peer rabbit hole. So let's use that sort of crypto stuff. You can do a lot of stuff. WebRTC is really great. You can do a lot of really great stuff. And I want to finish by doing a quick shout out section. So all these people already trust the process. So I'm going to give you a quick overview of some project that I worked with far or less far. And yeah. So first of all, I want to shout out about WebTorrent. Anybody know about WebTorrent? Nice. So it's a Torrent client that's built in the browser. It uses WebRTC data channels to stream data. It's really nice. It means that you don't have to depend on a central server to actually stream data. IPFS, anybody knows about IPFS? Yeah. So great guys also. They were there last year at Fosdame. I don't think they're here this year, but it's basically a protocol that it tends to not replace HTTP, but be side by side with HTTP. Basically it works by creating a unique hash for files. So they're fingerprint files with just a shaft fingerprint. And basically if you know the hash of the file, you can just request the file and it will flow a DHT. It's a really complex system. It's really interesting. You can retrieve that file that is hosted somewhere on an IPFS node. And IPFS nodes are connected to each other using also WebSocket and WebRTC. So it's a really, really interesting project. You should have. It's very nice. Some of us are here, but only a few of us. IPFS? Yeah. Oh, nice. You're so great. I love you. And if you want to know more about IPFS, I'm not the one to ask, basically, as those guys. And lastly, that's the project I've also worked on last year. It's basically a project that allows computation on top of IPFS. Also a great project. I think they are trying to hire a front-end developer. So if you like stuff with crypto and just talk to these guys. All right. I don't know how much time we still have left, but I made this great slide for questions. So does anybody have a question? Any questions? A few minutes for questions. All right. Or else I have that slide with my credentials. So this is my GitHub. This is my telegram. She wanted to reach me. And this is my public key if you want to. Thank you.
All Things Astro
Hello everyone, so our next speaker is one of my very good friends, a BGS team member and an Astro core team member as well. His name is Elian and he's going to talk about guests. Astro. Hey, who wasn't that a surprise? Alright, let me check that I'm not in the screen. Okay, hello everyone. Hope you're doing good. I'm doing good, I'm just a little bit tired. I just flew in from Poland over Zurich because I had a conference yesterday as well. So if I sometimes struggle with words, I'm sorry I'm tired. Also I'm in Astro core. Astro is this framework. I'll talk about that in a minute. But also in the React Brussels and the BGS team, I don't know if you've ever been to our conferences. That's here in Belgium. I was actually born here in Brussels and now living in Ghent. But also those guys are the same ones that actually organized this dev room. So maybe let's give them a quick round of applause as well. Yes. And they actually both left so they have no idea. But that's good. I also do my own meetups in Ghent. So if you live in Ghent or in Belgium overall, you're always welcome at our meetups. They're free. If you want to follow me after this or want to ask some question that you didn't get time to, feel free to follow me online. It's at Alien Coats on all platforms. So that should be easy. Okay, let's address the elephants in the room. What is Astro? Who has heard of Astro? Oh, wow. That is a lot. I asked the same question yesterday. We're like three hands. Who has actually used Astro? Okay, that's also a lot. Who is on the latest release of Astro? Okay, still good, still good. And who is using Astro professionally? Nice. Okay. No, that's what I was expecting. That's fine. That's fine. Cool. Okay, so it's a personal experience probably. Okay, that's good. Okay, cool. So we call Astro the framework for content-driven development. There are a couple of reasons that we say that. And I hope that will be clear to you after the talk. See it as being a comparable framework to Next or Next. It's a meta framework as we sometimes call them. There is a lot of discussion over should we call them meta frameworks. But let's call it that for now. We can later discuss on Twitter if it's, well, on X, if it's actually called a meta framework or not. This is what it looks like. This is the Astro syntax. Basically, everything that you want to write in JavaScript or in TypeScript, we support TypeScript, goes in between the front, the top, the dashes. That's always server-side. I'll explain that a little bit later. But it's a very familiar syntax. It's basically JavaScript at the top or TypeScript if you prefer. And below it's just, JSX likes syntax. It's not really JSX. It looks like JSX. But you can use class. So it's an improved JSX. Why is it ideal for content-driven? Well, it is because it's better for SEO and for meta tags and all of that stuff. Because we ship zero kilobytes of JavaScript by default. There is a few catches with that. We were one of the first frameworks to take this approach. But by now, we're surely not the only one. And sometimes a better tool fits a better use case or there is a different tool that's totally fine. If you want to discuss that, we can totally do that after this talk. Think of your traditional framework application approach. You write something in, let's say, next JS or in next. It typically looks like this. It doesn't always, we have now, React Server components and stuff. But I'm not going to account for that. All of these components require JavaScript or TypeScript and compile to JavaScript. And that is actually really weird because there is a couple of stuff here that is completely static and doesn't need JavaScript. For instance, the footer. It's just basic A tags, whatever. The header, maybe it's just an A tag that refers to your home page or an image. Why do I need JavaScript to render an image? That doesn't make sense. So what we do with Astro is basically we compile it all down to static HTML, CSS and JavaScript if you want to. More on that later. Basically, you have to remember HTML first. So what if you need JavaScript? You probably want some interactivity, right? You probably want to add a button, a hamburger button, drop downs, all of that stuff. What if you need interactivity? Well, of course, that is possible. We have a directive for that. It's called client. And that gives you a few options on how to control interactivity and tell the compiler when and how to hydrate components. I listed a couple. There are more. But I'm going to quickly just go over these. Client only is very easy. It just skips our compiler completely and ships JavaScript, as you would in React. And client media will only hydrate a component when a given media query is met. Think of it like mobile only buttons, hamburger buttons, all of those don't require JavaScript to render on our desktop side because you don't even see them. We have client idle, which will only hydrate components when the main thread is idle, when it's doing nothing. So basically, free for your CPU. Client load will just say, hey, I need JavaScript to send it to me. Then we also have a couple others like client visible that will only hydrate when a component is actually in the viewport. That makes sense. So what we actually can do in Astro, think of this as the basic HTML page I was talking about earlier, we can ship JavaScript to just a couple of components. Maybe an image slider, we need some things there. Maybe we need some header links, whatever, that are dynamic. We can do that. Of course, we are an open source thing. So you can build your own stuff. You can put that into Astro. And of course, you all know, as developers, if you let them free, they will come up with weird shit. One of those is Astro client when it's raining in New York. This will basically, like it says in the name, it will hydrate your component, but only when it's raining in New York. Cool stuff. Ben built this. Ben of the Astro core team has an implementation to show off how it works. But it's possible. It's fun. It's cool. There is a lot of creativity to be explored here. We call that concept islands. Islands basically referring to a component that's completely isolated from your other components. But we come with one twist. We have seen the astro syntax. But the components that you want on your client side, you can actually build in other frameworks. You say, add react to my Astro website. And then you can use react components inside of your Astro website. Or you want to use view or you want to use felt or maybe both of them together. That is possible. I won't say that it's a recommended thing to do. My thing disconnected here. Okay. It's not a recommended thing to do, but it is possible. But by default, without the client hydration, if you use a react component on Astro, it will still compile down to static HTML at build time by default. That's basically what makes Astro fast. There is, of course, a lot more. What I show you now is basically only the static generation side of things. That's the default. But we have so much more. And just in 2023, that was a crazy year for us. We did a lot of stuff. We shipped three major versions. And we have reasons for that. And I'll go over them like very quickly. I'll show you what we did and how we improved the life of Astro developers. So in January, I did my first real international Astro talk at JS world. Amy, you were there, right? With Omar? Yes. We just shipped Astro too. Astro looked completely different from the Astro that it is now. What we shipped, we shipped more than just the features that I'm going to share. But basically, these are the important ones. We shipped the new CLI. RCLI? I think it's crazy. It's crazy good. It's super clear. It's really easy. We just asked you a couple of questions. And on those questions, we set up a template for you. A couple of questions are, of course, do you plan to use TypeScript? Yes. What kind of TS config do you want? Do you want strict, strictest, loose default? Whatever you call it. You can do all of that. But also, since we are so open source minded, we also have released that as a client library. Well, not a client library. CLI library on its own. That's called CLAC. That's built by Nate, one of our core members, built in a weekend. And now it's used in different projects and it's actually amazing. Cool to see that there is like a couple of different projects that came from Astro. We shipped content collections. That was actually one of the biggest ones. Content collections give you a type safe way of working with markdown, MDX and all of the other markdown flavors. Even markdoc, for instance. This is probably very familiar to you. This is Zod. And Zod is this client, well, not client. This is library that basically checks your types on Eskimo. That's what you do here. And because that's type safe, we can also error check way better that I'm going to show you in a minute. This is how it looks like. So you get all the intelligence goodies. You get all the auto completion and all of that good stuff. We added hybrid rendering. So I was speaking about, as super clear, you can instantly see what's wrong. In your blog, the astrotutorial.mdx frontmetter does not match collection scheme. You instantly know what's wrong. What file is it? Oh, its title is required in astrotutorial.mdx. You instantly know what's wrong and where it's wrong. You fix it, done. Then we launched astrotutorial.mdx3. I think that was in August, if I remember correctly. We shipped view transitions. View transitions are a super, super cool thing. Who has ever used view transitions? A couple of people, not too much. Who knows about view transitions? Okay, that's a couple more. What's the reason that you didn't use them? Yell something. Time. Okay. Okay, yes, browser support. I was going to expect that one. Yes, it's not supported by all browsers yet. But what we do with astrotutorial is we polyfill a little and then it works. At least the basics work. And what view transitions are for the ones that didn't put their hand up, it actually looks like this. So astrotutorial basically does this SSG MPA page. But they actually, with view transitions, you can make an MPA with basically all static HTML files feel like SPA with client-side navigation, even though you're not shipping that to the browser. The browser will always do this by its own. Basically, really simply explain it takes a screenshot and the screenshot of your next page and transitions in between both of them. But you can do crazy shit with that and about the demo with me. It's not built by me, but I have it with me. Can I do it like that? Okay, give me a second here. You can all see this? Okay, okay, okay. Switch page. Yes. So as I was saying, browser support is a hard thing, but you can do shit like this. So this is multi-page application. Still, when I press North, look what happens. Okay, let me fix that. I wasn't expecting that to happen actually. Will it work? Yes. Okay, now it's there. So if I go back to South page, it's basically South.html. Look what happens. All of that animation is coming from the browser. There's no client-side hydration happening here. This is insane. I don't know if you're excited as I am. Yes, some people. Okay, okay. Not too much. It's fine. It's fine. But still, it works also with like the navigation API. So if at the top, I don't know how well you know ARC, but at the top I have just the basic buttons forward, backwards. That also should work. Yes. That's amazing. Okay, okay. Now let me go back to the presentation if I can get that back here. Okay, okay. There we are. And connecting. Yes. The craziest about all of that is actually from you as an end user. Well, end users are typically the clients that use the website. I mean, as a developer that will use that feature. It's only two lines of code. It's really easy to implement and we make it so easy for you to ensure that you have the best developer experience possible. A couple of other things, of course, if you think statically, you don't have middleware, you don't have all of this edge stuff. We added that as well. And the good thing is you can always create faster responses for your users anywhere in the world, wherever they are. But those are always like the catch words with edge stuff, right? It's also a little bit of a smaller runtime. So it's a little bit more difficult than that. But you get the point. Image optimizing. Images are hard. Can be hard. Can be really hard like in the browser sometimes. What we did is actually we released a virtual module, actually, which is astroassets. And you basically just import your image, just like you would do with a component, then use it as a source and it will automatically output a optimized WebP image. But of course, a lot of people came complaining and were like, where is picture? We need picture. We brought picture. And then actually you can do formats with it. So if you want to use Aviv, because that's even faster and actually not supported in all browsers, but you have a fallback to WebP, which is supported in all browsers, then we'll take care of that for you. So it's really easy for you to define and optimize the small bits of your website that are lacking behind. That's at least how you get. Also, we did a major refactoring of our internals, the JSX internals. And because of that, we also got another 75 performance improvement, which is great. We also brought this. I don't know how many of you are familiar with fast refresh. It's amazing. If you don't see what's happening here, that's good because then you're living a good life. What actually happens is, does anyone ever like build a dialogue, for instance? You click on it, you have the dialogue goes open, then you change some text and suddenly it's gone again and you have to re go through the whole flow again. That's the problem with state. Actually, what fast refresh does for all JSX in our, in our case, it will actually remind the state. So while you're typing, the state will update and you won't like have to go through the flow all over again. So it's basically quality of life upgrade for you as a developer. Page partials. It wasn't intentionally built for it, but of course we have all the HTMLX hype. And actually, this is possible now with Astro because of page partials. You just ship one thing, no HTML tag, no head tag, no body tag, just what you wrote in HTML and that makes using HTMLX in Astro possible. Then we'll have Starlight. Who has heard of Starlight? Less people than Astro. Okay. But there were a lot of people about Astro. What is one thing that you can name about Astro that is good? Documentation. I know you were going to say that. I just said it for you. Starlight is actually a, I want to say theme slash library slash framework. It's basically a great theme for Astro. But one important thing is that it actually ships everything that we have learned from writing docs for Astro and brought that to a framework for other people. And I was actually talking backstage a little bit earlier with Nicholas and he's using Starlight at work a lot and says it's amazing. Like you have all these built-in features that take care for you like the search. You can change that with Page Search or Algolia or anything you want. Really it's very pluggable. It's really good. And of course you have all the Astro goodies. You can use React, you can use felt, you can compile everything to native languages. You can do anything you want. But then we launched Astro 4. Astro 4 is cool. Why? We have a DevTool bar now and DevTool bars are something underrated sometimes. In our case, you can see your islands. You can see where your JSX is located. You click on the file, it will open. You can see that in this case it's not hydrated or it is hydrated. What's the text? How does it work? You can see all of that just in the browser like without leaving the browser. But also we shipped accessibility tools. Accessibility is getting more and more and more important and it is. And that's why we integrated that. So basically you click on the audit tool and it will tell you oh an image alt tag is missing. Oh these are misconfigured area roles. All of that will just show you. Really easy. But also it's super pluggable. So open source first. You can just write your own DevTool bar plugin and build it. For instance we have the Astro Tailwind Config Viewer which is basically you can see your whole Tailwind configuration inside of your Astro website or inside the DevTool bar. So basically if you do this well or there is a lot of more features you can actually just build everything inside the browser and never leave it except for developing code. Then we have built incremental content caching. A question I got yesterday for instance was what if I want to use Astro with thousands of pages? Where are the paints? And there are some of course like if you want to use SSG and you're constantly pushing new files then your build pipeline will just be very slow because it's always building and it's always building all of those pages. Even though sometimes they never change. If you change one file while building all the others basically that's why what incremental content caching does. It sees one file has changed and will only change that file. That makes sense right? But with doing that just for our own documentation it's still experimental but we tested it of course. We had a performance gain just for our documentation which is like 3,000 pages of 80% gain. That's a lot. The improvement is insanely good. And then we also redesigned our documentation to Starlight. Now it looks like this. I don't know if you've ever seen the previous one. It was also good. It was also like kind of built very hacky. We didn't have internationalization support before and such. We all have that now in Starlight and such in Astro docs. It's really great doc footing for both projects. Then we announced the ecosystem fund. It's a really cool thing that I'm very proud of. Actually what we do is we have dedicated the funding that we get as in GitHub sponsors and such things like that. We dedicated a hundred thousand dollars of that to give to other open source projects that are empowering Astro users. For instance one of those that got the grant was LuciaAuth. You know if you've ever done LuciaAuth it's basically an authentication library. That's also framework agnostic. But also they enable a lot of Astro users to build cool websites with authentication. And for that they deserve an award. Well they deserve at least some money to keep working on it. For instance we also gave 10,000 dollars to a team builder. They create themes for Astro and they output like one team per month or something. But that means that a lot of users get drawn to Astro because there are so many themes. So that's really makes it work. Of course that's not all of it. This was just like basically it was a ramble of features and how it works. There is more and there is more to it. And the question I always get is but what is next? What is the next thing that we are going to ship? Well I don't know. We have an open roadmap. So basically you decide. Our users decide. We have an open GitHub repository which is just a roadmap. And you can just make an issue there. We'll comment on it. We'll discuss about it. And then we'll get into an RFC. It's accepted. And then we'll actually build a feature. And if you can help in that, that's awesome. Cool. If you want to stay updated you can go to Astro.Build which is the website. If you want to join our Discord where we are very active both in development but also in support and questions you have if you can't pose them here today. Go there. There is probably someone super eager to help you out there. And Astro.Build says chat. And we also built a newsletter like actually was launched this week or last week. And that's Astro.Build slash newsletter. Cool. Thank you. Questions or is that another thing? If there are none I did a good job. Did you try creating... Is it hydrate only when it's raining in Brussels? Yeah. Because it's always hydrates. That would just be client side. I didn't. But I should. You should. It would be easy. It's just an equals truth. Big round of applause for Elio. Thank you.
Better Bee Be Better: spot more bugs than TS with less than JS
Hello everyone. So this is a really tightly talked with a bit ambitious title about a site project of mine. I will explain a bit what site project is about, two parts of the title and explain what I plan to do with it later. So I made a toy programming language for fun, like yeah it's not really original but anyway. I was thinking of it for more than a year now and I have a lot of ideas about stuff to experiment and I start implementing a lot of parts of it but at the end I had no programming language at all. So I decided somehow to send a first-dem talk and to try to do a programming language usable by people. So here I am and I will, I publish online a version where I have a domain name and so this is something I tried to write a few past months that is mostly usable and I can explain why. So the idea is I made a really small programming language and by small I hope it's easy to learn. That helps you to catch a lot of bugs at build time, what most of programming language try to do and I make it kind of looks like JavaScript and target JavaScript because it's more easy to target JavaScript when you do a programming language that targets something else. So it's one of the motivations of it. And so maybe you know that GS is not the perfect programming language. There's a lot of flow. There is one I really hate which is shallow code by default which is you pass something to some place of your program and it's changed behind your back and you don't realize only really after about what's really happened and it's actually painful. There is really weird behavior with every operator that everyone know and talk about. It's not really big issue but still annoying when you try to use it and discover it at the last moment. And there is most painful things about there is a lot of things that should fail that does not really fail and rather return and define and you will really discover it really later in your program and there is a lot of work on JavaScript to make it a better language but the problem is you will still want JavaScript to work with all JavaScript code so there is a lot of duplication in the language like the same things two times but not exactly the same thing because we try to fix where we ever so this is really confusing I believe. And I think JavaScript is actually a good language to have another language transpiled or compiled because it can work anywhere. It's actually quite fast because there is a lot of engineer that worked into making JavaScript fast and that's true JavaScript is run fast on a lot of platform and still there are alternative run times that you have to make JavaScript easily embeddable in other program or use it for other usage on the web. And there is that thing that everyone knows about maybe which is WebAssembly which is the thing that your language should target if you want to go on the web and don't want to work write JavaScript and it's cool, it's fast, it's also cross-platform, it's even know as the fact that you don't really have to manage your memory by hand in WebAssembly because there are no GCs that let you say okay I create object and let's manage the memory like you would do in JavaScript so it's really nice to have a language that built to WebAssembly but there is still few reasons to prefer to target JavaScript than WebAssembly. The one is WebAssembly doesn't work on older browser, it's really nice to have a language when you can do all the hack of the earth because it's really expressive and you try to do a type project programming language, really nice to have all this expressive power. You can get more easy to work with any existing API which is something which is really painful to do in WebAssembly if you already tried WebAssembly you will be aware that it's really hard to interpret WebAssembly code with any other API because you have to do JavaScript glue code and this glue code is not really safe, it's really hard to maintain and this kind of stuff. Having a target language that is familiar and when you can run it into a step by step debugger if you find bugs in your compiler or I mean mine type project and also you can build a term JavaScript to see with something called Quick.js which is actually really cool and so if you have JavaScript you can have native target for free, native backend for free. Maybe you were aware there is a lot of language that already tried to target JavaScript, there's a few of them named here. The one in bold are actually functional language and the one in italic are the ones that try to kind of look like JavaScript and the only one in the list that's really used by a lot of people is TypeScript because that's the one language that accepts the ideas that most of people want to write JavaScript or keep using working on JavaScript code base so I don't know a lot of people that's doing per script for example that look like Askel or Elms that look like to Askel too and that don't look like at all like JavaScript. So what does that mean to make something smaller than JavaScript? It means that I have for example one kind of function which all functions are lambda and an image function, the arrow function. I have one kind of string in the literal not three and the ones that one support string interpolation like you do in JavaScript but all the strings support it so it's more easy. So all this kind of detail like you just take the minimum amount of feature you can implement and try to make a smaller JavaScript and there is kind of the thing that's normally a rather functional language but I kind of implement which is everything return a value so every expression you can put it into another expression because everything should return a value so for example I can assign the result of an if-else branching because the value of an if-else branching is a branch that was executed so you can just say that which makes sometimes the code more compact and easy and something you can do easily in JavaScript because you have to define variables with a default value, reassign the variable in the branch and scan stuff which is painful and I don't have an undefined and just have an empty object which is empty value like when you do something that does not return a value it returns an empty value which is the empty object or record anyway. I have some integration to make it easy to embed into JavaScript and this thing that is that sorry since every file is kind of an expression and every expression is a value every expression is convert to the equivalent JavaScript value so every expression could be compiled and using JavaScript for example I have a function which is compiled to the equivalent JavaScript function and those equivalent things you would expect the function to do and it's kind of expected and I want to speak a bit about some things that's often when you use a lot of programming language we take we start to feel like a lot of behavior are normal and logical but if we really think about it those behavior are not logical at all and for example there are so many different scope rules in JavaScript and many language that does at some point there's too many scope rules it's really counterintuitive for example this is legal in JavaScript you can use a function which is declared later in your program because function are top-level things that can be used anywhere in your programs or order don't count but if you do a lambda if you do an anonymous function it's blockscop so you can't use something before it's defined logical it's not the same mechanism but still we are two mechanisms it's a bit counterintuitive and if we are in object declaration of course you can't use an item of an object within an object because you have to wait for the object to be created fully and you can't call the object within the object because you have to wait the object to be to be declared you can do that it's obviously is legal it seems logical but if we look at function we can use function inside function of course and we can even use value that are declared after function even if function capture the definition site because we set the semantic of JavaScript because JavaScript without harvesting which is function are actually declared in top of your block but only assigned later so and it's also true for const on let keyword so this code will actually work and display a lot of emoji and at some point do a stack overflow of course but this is really counterintuitive that you can you can access a variable here define here it's working function it looks so a lot of different variable so I come with the idea of doing something really more simple which is let's merge the two concept of object and block and say just one thing which is a block could have side effect a block could have value redefined but a block will return the object of the element within it so we just have one kind of brackets that do the two things in one and are one scope rule which is always the same and you don't have to remind four five scoping different scoping rules there is that rules that works for everything it's more smaller language like you just have one block to have instruction and that is also the way to define objects so that kind of one main idea and so other idea it's be is pure and pure it's the world that is so much used for things so many different things that I will be to explain what I mean by that in this context of this talk I mean that when you define a variable and call something else you expect the variable to not have changed because you expect not to have shared mutability like the you don't expect to functions that implicitly change a value of your program you what you expect is to have kind of a guarantee with a value of what it it in B of having whatever you do with the value the value and change and this kind of invariant is kind of things you you you know you will always have it's really cool because it's a lot of stuff I will show later in the presentation but kind of kind of nice to know that to know to whatever I do if I don't reassign x explicitly it cannot be reassigned by another part of the program I can't do that if I want to reassign x I have to return it as a return value of a function it's the only way to reassign it it's to have it at return value not to pass it as implicitly like that so I have a kind of immutability by default so immutability kind of says that when you define a value keeps that value forever but you can redefine value which work bit like mutation and actually is implemented like mutation can redefine things and if you redefine something in a different scope it don't touch to the thing in the upper scope so if you redefine value in sub scope and later call to the value it's the previous value because you you want some modification to be local to don't propagate and that this idea of scoping which is always the same that's applied everywhere which is maybe a bit weird but actually really handy and so the interesting part is about how having a more constrained language makes allow the ability to spot more bugs and I speak a bit about the fact that tabscript sadly have exactly the same flows as JavaScript because it's a superset so for example charm immutability a 4x2 have right correct inference and correct typing here I have an example of bad inference in typescript which is I it's borrowed from the typescript tutorial I should have typescript understandings that's if I want something which is left of right and assign a variable right I should have the right thing but the thing is typescript decide to infer s as a string and not as a left or right and why typescript does that because here we have a let s equal right and not a cost s equal right if you change the let by the cost typescript will correctly infer but since there's shared immutability everywhere typescript can't really know or assume that the value of s will not be changed so it just say okay it will be a string even if you don't change it if you even if you use it right after say it's a string because it's mutable it can change so you don't know it's don't know so so that's kind of among situation where so typescript inference failed and if you have immutability you don't have this kind of error of course and there is also bad typing this code is a bit nasty but the idea if you expect something which is a string or a number and then you check it's a string and then you kind of calling some things that changing the value behind the back of typescript and typescript can detect that because typescript just choose to don't keep track of that and assume you don't do it even if you can do it so this code was compiled it will check with typescript strict typescript strict is hard to say we say there is no no error everything is fine but if you run it every of course at one time you say that's two uppercase is not a function which is annoying I would expect typescript to catch it but in a way I know typescript can't catch it because yeah it's really hard to keep track of every value and and types then can't do that easily so this question is no how I expect types to work in my minimal programming language I've not finished yet and there is an idea that's come from mostly really mathematical language which is called dependent type language which is the idea that types could be values that could be used like a value you can assign like for example here I just say types kind of a function they just say if something is true or false and if it's true it's type and if it falls it don't type it's really natural to understand but then I could use this in my code in to prefix value and say okay let's check if it's work and everything and and this is a really expressive way to say that something is a type because every function could be a type as soon as the function return true or false a boolean value this since there are functions function can be composed so you can for example do structural typing you can for example do union types the things you have in typescript you can do interchange type it's just way to say is it a thing and a thing and a thing or a thing is a thing is a thing and and just comparing value it's not that hard but the thing is you want to run all that computation at run at build time not at run time so you kind of all you you will you will succeed to do that what typically typescript type script can do and the thing is it's more easy because as we see before the execution flow of the language is more predictable there's no shown metability you don't know if a value can be shown by something else and there is a technique that's come from static analysis which is technique which is mostly used to prove that's a language that don't behave correctly still behave correctly this technique for example used to prove that the C program works as expected and in a C program you can do whatever you want so you kind of need a powerful technique so the idea of this technique is typing a program checking a program the build time check is really like running your program you run your program and you check if it failed or if it worked but running a program does not scale because if your program has an infinite loop your check at build time will be infinite or if you do an infinite recursion so you have to change a bit to be aware of your program you have to change a bit of if else work a function call work and you have to change a bit the standard library to to not do print not write file not read file and the idea if you change a bit seems to be fully determined for example random function rather than returning a random number return the range of possible numbers so you know your number will be between zero and five because it's expected to be between zero and five but you don't know what value you you you change value by kind of the what would be the expected value in this case this is really simple and powerful technique but that does not really scale well because if you have a really big program you kind of running a really big program at build time which is problematic and there's a solution to that later in this presentation and I have a little demo of this working yeah so here I have four file but only two of interest I have something in type script which is I have an array of four element I pop an element and try to access the previous last element and obviously it failed and I have the same thing in these that look kind of similar actually parentheses are optional but I put them because it's more looking like JavaScript yeah so two two can do the same thing so if I do it in in type script yeah take a while it's really small file but still type script yeah cool I have an old computer for type script defense and it compile no issue everything fine and if I run it obviously I got in the fine which is not what I I expect here but if I do it with my super cool side project toy compiler I have a really nice error that tell me a lot about what's it is happening here yeah so tell me what's the static value of the array what's happening where how yeah and I work mostly on that on having kind of user friendly error and tries next to make this thing scale well and have more expressivity to explain type so so that's the idea um yeah I'm I'm I'm near the end of the talk so I implement a lot of things like really funny scoping bugs I use three-seater for my grammar which is what use github for syntax creation and analysis of everything on github I have all the things we have in JavaScript like the spread operator or the structuring I have things that does not exist in JavaScript like implicit function argument, hyper-overloading and I have a small standard library where with everything you can expect to to write a program and and I have also a client as I just demonstrate that work without node in your class you don't even need to have JavaScript installed on your computer to make it work I have build targets so you can customize your JavaScript with different compile time variable and you can even use or define have build cache so when you you do yeah it's not a good example because the compilation fail anyway but if your compilation succeeds if you run it several times you just compile only the files that change not the new one I have syntax highlighting not the best but still and I have a lot of things left to do I don't have yet pattern matching I would like to have it for the presentation but I still don't have it so I try to do it soon because it's really nice to kind of destruct your things in function if parameter or yeah anyway so this case where my language looks really too functional like you don't really can't really do while loops you can have forage but you can't have while and because you need kind of mutability to be able to say okay do that until that thing it's no longer valid and and I'm kind of thinking about having a bit of mutability constraint because it solves this problem and there is subtyping because actually with types I as I show them I cannot say that's a type is of another type and it's not that art you but I didn't did it yet because I have keep track of all the type declare and can explain that later if there is question about it and and yeah and there is a lot of things to do and societal is a part-time project but maybe one day it will no longer be a part-time project it's really nice to have a language server protocol maybe you are aware of that debug adapter protocol too it's kind of the language was designed the first place to not compile to JavaScript but to compile to kind of so I made a read me you can try it if you want and that's basically the end of my talk thank you are there any questions yeah so the question was what's programming language parted part theme I use as inspiration was this just JavaScript or something else for example rest or functional programming and the answer is I work in for the asset for an ascal company so I am really a functional programmer indeed and before I was working with rest and I really love rest but I I kind of feeling that trust is not a language for anyone because rest ergonomic make it sometimes really painful to use for writing like most of the program tasks you you want to write in your daily program or life so my first motivation was to try modestly to do a better rest like a rest but that's with maybe more simple rules and more easy to use but still the same kind of guarantee but it's still the idea a bit behind this project but I I like what I said in the beginning of the talk I try to to I try to make something small and hackable in on which I can really experiment if my semantic work in my syntax is is an ambiguous this kind of stuff and it's more easy to have something small that work on which you can experiment and see if it's usable rather than it's a bigger goal I have I had in the beginning which was more inspired by yeah rest or functional programming overall yeah thanks for the question so uh one of the you
How to Win 1st Place in the Kernel Patch Statistics - Tools and Workflows
First talk is by Uwe. How to win first place in the kernel patch statistics. Good morning. Soundcheck seems still good. I'll talk to you about how to get many patches in the kernel. The starter for the talk is the LWN patch statistics. That is presented after each kernel release where you get statistics. But actually this shouldn't be your motivation to get patches in the kernel. This is just a nice side effect. But it was a good starter for the talk. First about me and my employer. I'm Uwe Kleinekunig. I work at Pangotronics as a kernel engineer since 2008. I have several jobs in the kernel. I'm a PWM maintainer. But I already contributed patches through all the kernel subsystems. You can reach me via IRC and PGP if you have questions after the talk. If you are interested in the tools I present, I didn't create a repository for that. If you have questions or want to use the tools, just contact me. My email address isn't listed here, but you should be able to Google it. Pangotronics is a company that exists a bit longer than I'm with them. We're doing embedded Linux consulting, mostly for German industrial customers. In the kernel, my colleagues and me, we have several, we're listed several times in the maintainer's file. So we're working with our customers also in the mainlining business. We're selling them that mainlining is a good idea. Yeah. So if you have a good idea of what to change in the kernel, this is the process you have to work with. You put your changes in the end in a mail and send them to subsystem specific mailing list. Then ideally you get prompt review by the maintainers who are responsible for the code. Then the patches are picked up and sent to Linux Torward through in the end, creates a release from it. If you have a big series, you have to apply the same things. You have to do for single patches too. This is the usual or a short list of things you have to care for. These are not very hard rules, but this is what I think is the sensible set I use next as a base. Next is the integration tree for the upcoming kernel release. This is a good idea because if you send patches based on what is in Linux Torward's tree, you often get feedback that there is already some development happening and that your patch doesn't apply, so you have to rebase. If you use next, this is minimized. Even if you think you are a good kernel developer and you don't do the beginners mistake, use check patch. This is a small Perl tool that catches the obvious errors you can do with your patches. You forget your sign off or there are spelling mistakes and things. It's much nicer to get these things said to you by check patch then if you send the patches out and people tell it to you. The same applies for build testing. Do build tests ideally on several architectures because even for trivial patches, it's quite easy to break the build. The same reasoning as with check patch. For single patches, it's good to describe the change as well. The idea is that you want the maintainer to understand your motivation and the things you are changing. You want to make it easy for them to apply the patches to understand the benefit. This is still more important if you do massive patch sanding because you are adding much more burden to the maintainers. Also, addressing the right people. You don't want to miss the important people obviously, but you also don't want to annoy the others. I once sent a 600 patch series to the kernel mailing list and several people were annoyed. Don't repeat that. To get a big project, you have to pick something that applies to many drivers. What I did in the past is the remove callback for SPI drivers returned an integer, but that value is ignored by the core, which resulted in many drivers returning an error code in the expectation that there is some error handling in the upper layers, but which is wrong and this resulted in several resource leaks. The same for platform devices. This is my current quest, which is a bit more massive because there are more than 2,000 platform drivers that I have to touch. I am in the middle approximately, so there are still a few more patches to come. I have a few further ideas, but I will come to that when I am gone with this quest because doing more than one such quest at one time is really hard. Usually it is not hard to find something new to patch. If you touched all platform device drivers, you have seen quite some stuff and there is always something you can fix. What is very helpful to generate the patches is the tool Coxinell. It allows to describe a patch in a very high level form where you can... For example, this is a small version of a patch where I first try to identify... ...platform drivers that have a remove function that does not return zero, which is the first step before converting them to return void. Maybe I can... The syntax is just that you say, OK, I have any expression that is not zero and I just want to patch that in all remove functions of a platform driver, changes the return value from that non-zero value to zero. This is just to find the drivers that are affected by the quest. It is very hard to create a Coxinell patch that does the right thing for all drivers already. There is always some handwork that you have to find. For example, for indention, which is usually get wrong by Coxinell. With Coxinell, you don't have a tree where all drivers are adopted in the end. If you have 2,000 affected files, you don't want to commit by hand. You have to apply some shell scripting to make a commit for each file, which I think is the right thing. For some maintainers, they prefer to convert all their drivers in their subsystem in a single patch. But at least for sending it out and for review, it's easier to have one patch per driver. What I then do is I iterate overall changed files and commit it. The challenge here is to pick the right subject prefix. In the first approach, I just put the file name here. Then I go several times over my branch and use Filter Branch to adapt the subject prefix. This is depending on the subsystem, how they want it, if they want a capital or a small letter here, and if the separator is a colon or a hyphen, you have to check all the commits for the subsystem to get this right. I have a script that I keep in a scratch file. You see a short part of it where for some common drivers, I can adapt the subject prefix accordingly. This is much quicker than doing it by hand. Then here comes my usual workflow for formatting the patches in a mail and sending them out. This is the usual format patch call. I always put it in a sub-directory that I always call w. I don't know what it stands for. Then I have a script that I pass all my patches. I'll come to that in a moment. I edit the cover letter, which is a quite important part of patch theories, where you have to describe the overall idea of what you want to do and to show the benefit of the patch theories. This is the, I think, or I hope, the first thing that people will read about my patch theories. This has to be a good description to, again, make it easy for the maintainers to pick it up. Then I edit the list of recipes. I edit the recipes to the individual patches and send it out. Then what is a critical thing for tracking later, if I note for every patch that I send out in the commit, the message ID I used to send the patch out. This is important later. If your patch doesn't get applied, I can quickly find the conversation in my mail client to send a ping or to ask what's up. Then I put the commits I sent out in a dedicated branch near the top to track all the patches I have already sent out. This is the L file. This is generated using the getMaintainer script, which helps you to identify the interested persons for a given patch. This is, well, in the end, it's shell script. Usually you really have to adapt the list of people. For example, if I send out a patch series adapting several SPI drivers, I usually want that the SPI maintainer takes the patch series as a whole. What getMaintainer gives you, however, is that for the, or in this case, for the upmail PWM driver, that the upmail maintainers are listed as contacts. What I do then here, of course, one step back, this address append is another script that takes a list of patches and adds the people listed with minus t to two, to the two header in these patches, and the addresses on persons with a pass to option c to cc. So what I usually do for, well, this is a PWM series now, what I usually do is I replace with editor magic all minus t by minus c to first have them all on the cc list, and then I change individual lines back to minus t to just address the maintainer. And then I have a longer VIM command here to fix the syntax because currently it doesn't work. I start a quote here and I have the descriptions from getMaintainers here in the end, and this command above just throws away the parented expression and adds the closing quote. And then I can execute it and have all the people in the right mails. And what I'm doing here also is for each patch, I add the cover letter here to ensure that each person or each list that gets a patch also receives the cover letter to give the right context. This is also important if you have a patch series where you have dependencies where I introduce a helper in the first patch, and then this is used in the second. It's a good idea to also at least carbon copy the recipe for the second patch with the patch that introduces the helper such that they can easily understand the second patch. Here is a short snippet of my git config which is important for sending out or which I rely on. One is I blind carbon copy me all patches to make sure that I have all patches I send out in my index to be able to reply later to it. This is a good idea if you get sent email. This makes git send email ask before sending out each mail, which is if you have a big folder of patches, you don't want to accidentally send it out. So this gives you a chance to look again over the list of recipes and abort maybe if there is a problem. This setting is important for the nodes I added to the commits I had here. If I rebase them to be included in my tracking branch, the information doesn't get lost, so the nodes are copied on rebase. For sending patch series out and addressing the right people, it's beneficial to add one series per subsystem. That means not less. Don't mix several subsystems in a single series and also don't send several series with a similar or the same topic to the same subsystem. This is maybe a bit subjective. Some people, for example, NetDev is an example, they say don't send big series if you have say 30 patches, better use two or three series. It's a bit of experience to know this, but in general it's a good idea to do one series per subsystem. To save time and communication overhead, it's a good idea to be explicit about the expectations, how your patch series should be merged. For example, you can write into it, I expect this series to be taken by the SPI maintainer as a whole, even if there are maybe one or two patches which doesn't fix this topic. This doesn't have to be fixed and people can disagree, but this is better than if you have, you get no feedback and you don't get your series applied, and then you have to ask who will apply. So state your idea and, yeah, such that people can know what you think should be the best path. Another good idea is a slow start. What I mean by that is if you have a patch quest and you have to address drivers in 50 subsystems, don't send them all out at once. Start with the first one, pick something actively maintained, and then take the feedback to improve what you send to the other subsystems. So first send out one and then you can slightly increase your speed. But the effect is that you get better descriptions, people ask questions what they don't understand, and you can improve on what you write to the next maintainers. Good. I already presented you, I have a branch for all patches in my quest. I base this on the latest RC1 release. This is a bit smoother than basing it on next, where there's much more movement and it's easier to rebase from one RC1 to the next RC1, because this is all linear and you know what patches are really in. Occasionally it happens that you get a patch into next and it is dropped again, and in such cases you would lose patches, because they fall out of your tracking branch as they are included below, and then if you rebase the next time they are just missing. My tracking patch looks as follows, it's somewhere down below, there is the RC1 release, and then I have all the patches I sent out, and the few top commits is a collection of the remaining drivers that I have to adopt. This is one commit for all remaining patches, for all remaining drivers. In this case it's two such commits, because some drivers are a bit more complicated, they are not correctly adapted by Coxinell, so I track them separately to be able to take the necessary care. The top commit is where I rely on all platform drivers being converted, where I change the remove callback to actually return void, which is only possible after all changes are made. So it's the top commit to have the series still bisectable. What is really useful is the cherry command line parameter to lock, which marks all patches with a plus or an equal sign, and the difference is that the patches marked with an equal sign are already included in the left hand side of the expression here. So the mailbox patches were already applied in next, but not in the last RC1 yet, so they are still included in my branch, and the Macintosh patches are not yet included, so they are plus the work in progress patches, obviously I also have a plus bit, but that's less important. There's a similar option cherry pick, which has the effect that it lists only the patches that are marked with a plus in this syntax. This is the one I usually go through if I want to track which patches need some more care, which need a pin to make the maintainer act on them. I have this below each patch, ideally I have this marking I added that I already talked about in an earlier slide, and with not much, which is a full text mail indexer, it's quite easy to open a mailbox that contains the mail with the given message ID, and all the mails in the same thread. So if I open the virtual mailbox, well the thread belongs here actually, this is broken in a strange way, I see the patches I sent out, and in this case I can see, okay there was no reply, maybe it fell through the cracks at the maintainer or I added the wrong person, in this case I see it's also nearly a month old, so maybe it's time to send a ping and ask are there any problems or what's the state of the series. This is very useful to have an easy connection from the git commit to your mail, and not much integration of much really helps here. Occasionally it happens that you get feedback where you have to adapt things where things are not so optimal. In this case B4 is a great tool that I really recommend to use, even if you're not a maintainer, it's quite handy to collect the reviewed by and the act by text. Occasionally it happens that you already did some restructuring of your branch, and then git range div is very useful where you can compare the two different histories, the one that you already adopted, and the one recreated by B4 based on your previous submission, where you can see the difference where are texts, where are no texts, where did you change the code, and this really helps to create a single series that has all the improvements that you created on both sides. This is what I wanted to present to you. If you have questions either here in the forum or later, don't hesitate to ask me, or after FOSDEM contact me by email or ISE, send me your questions. I'm happy to help you for your next quest. Thank you. We have time for questions. I don't know who was first. I was looking at my send emails and have seen a lot of my patches sent to you because of GitMaintenors collecting your address often. Is it a challenge for you now to deal with all these emails because GitMaintenors will do to your many commits everywhere, often collect your email address? This is indeed an effect I wasn't aware of. If you attached all 2,500 platform drivers, you'll get a massive amount of patches in the next few releases. It's not very helpful to... If you submit it, it's not helpful to send patches to the person who just do cleanup on the driver who don't have the real interest in this driver, which also applies to me. I don't have interest in some obscure IDE driver that I just touched because it happens to be a platform driver and I changed the remove callback. On the other hand, it's also really hard to separate the list of people you get from GitMaintenors. Don't hesitate to keep me in the list. I'm very good at ignoring emails. I just archived some and it's quite usual to... If you send patches to, say, 10 people, that you don't get feedback from at least 9 of them. So this is life and I have a very big mailbox, but I usually can handle it. Thank you for your presentation. You have described your send workflow and you're not using B4. Have you talked to Konstantin, who has developed B4, because you have some special needs about CC and tool handling and cover letter and so on? No, I didn't. Mark already knows. I don't use B4 because with B4 you cannot individually change the recipients for the patches in the series. So what I like to do is if I have a series that touches, again, SPI drivers, I don't want to send the patch touching the iMix SPI driver to the Atmel SPI driver maintainer. So the list of persons is really hand-picked, which patch is sent to which parties. And with B4, at least last time I checked, you can only define the recipients globally. So you have to send all patches to the same set of people. No, I didn't talk to Konstantin. I have little motivation to do that because my patchwork works and I think it's a bit special for these big series and I'm not sure that there's a big benefit for extending B4 to that because for most people it's the right thing what B4 does. And the added flexibility for my use case results in a complication for tracking and for usage for all people which is questionable, I think.
Streamlining kernel hacking with mkosi-kernel
I'm very excited about this because this is actually the tool that I'm using to build kernels for a while now and it's made my life a lot easier. So thank you for that. Dan? Thank you. Yeah, so let's talk about kernel hacking. First a little bit about me. I'm Dan. I work at META on the Linux user space team and I'm a system demaintainer and I also maintain the tool that I'll be talking about today which is a METCO sign. So quick motivation for the stock. A little while ago I started looking into running system dejournality which I work on for individual users instead of just on a per system basis. But to make this work I actually needed a BPF feature for Unix tokens that wasn't available yet. So I looked at the kernel source code and I figured this is probably doable to do myself. So I got into kernel hacking. One site figured out the code and were written up my first batch. I of course had to test it but there wasn't really a clear way like this is how you test your Linux kernel batch. So I started looking into what I could do to do this. The first thing of course that I needed to fix is like if you have your batch you can't test your compiled kernel on your host machine of course because if it's broken then you suddenly lose your system. So you need a virtual machine or something to avoid breaking your machine. I also wanted to make sure that this setup is quickly replicable to any different machine because I started on my laptop because that's what I do for system dejournality and it works great. The kernel is quite a bit bigger than system dejournality and it also compiles a lot slower. So I was quickly looking for a bigger machine with a lot more cores so that my kernels could compile quicker. So it would be very nice if that I could replicate the setup very quickly to another machine. And ideally I'm not too reliant on whatever the host distribution of that machine is because well I work at Meta and we can get like very big beefy servers that have a lot of cores that we can work on but they might also be running some old version of CentOS with all the not all the latest tools available. So ideally I still get those but on the big beefy server with the old Linux distribution running. Of course I want it all to be fast so that I have a quick turnaround time so I can fix bugs or notice bugs, fix bugs, recompile everything and boot again without waiting too long. Like everyone knows the XKCD with like compiling and two dudes are fighting. So I wanted to avoid that. And then of course when you hack on the kernel these days it's not just the kernel that you're working on. There's very often some user space projects involved as well. A good example for the file system service XFS tests which is a separate project. I also wanted to be able to compile all those things and get them available in the virtual machine so that I can run them. So of course because I work on system D and we use Mako aside to do all of this for system D because system D suffers from the same problems. You also can't really test system D on your system because if it's broken then you can't use your system anymore. So Mako aside is basically my hammer and like kernel hacking is just another nail that I wanted to slam in. So what is Mako aside specifically? It's a tool that the Linux puttering developed to simplify its hacking on system D. So he had all the same issues so he developed Mako aside to fix it. What Mako aside does is it builds you a Linux image. So it invokes a package manager and then it installs packages. It packages that up as like in one of various formats and then it allows you to boot it either in a virtual machine or in a container and then you can do whatever testing you want and when you're done you just exit the virtual machine and it's like nothing ever happened. So Mako aside supports like it has a general execution flow. So of course like we have CLI options, configuration all that. We install packages for the distribution. So this is invoking DNF apps, what zipper, Pacman for all the distributions that we support. Optionally we set up a boot loader and all that if you're doing it bootable disk image. We run various system D tools that are helpful to configure an image. If needed we build an NNRMFS. This is again when you're doing bootable stuff. We generate what's called a unified kernel image. This is this new system D thing that allows you to combine the kernel command line, kernel image NNRMFS all in a single file and then boot that from that in UEFI. Then we package up the entire thing as a disk image and then optionally of course you can boot it in QM or container with this in the N spawn. So how do you get started with Mako aside? This is not like the kernel hacking specific stuff but this is just like if you want to make a side image you specify which distribution you want, you specify the packages you want in this case is in the NLINX and we're running on ARCH. We have an auto log an option to basically automatically get a root shell in the virtual machine and then you say I want to boot this in QM. That gives you something like this. So we support this for Debian, Santos, OpenSuzi, ARCH, Fedora and Ubuntu. And there's a few other distributions but they're all derivatives of these. So everything can be specified via CLI, the settings as you can see here but of course we also have configuration. So this is like the system, the init, any files and things that we all know and love. So we more or less do the same stuff so you can also specify it all in the configuration file. Using Mako aside for kernel development and development in general. So what I showed previously just in soft packages from the distribution of course that doesn't really help us. We want to build stuff from source either system D or in this case the kernel. So you can specify a build script. The build script is responsible for building your software. Canonically we call this Mako aside.build. So when you define that Mako aside will pick it up and it just contains the instructions to build your software. So either make for the kernel or mess on for system D. You can specify build packages which are just the packages that are needed to run the build script. So compiler, build system and all that. You can specify a build directory so that everything is cached. This is important so that your incremental builds are fast. With the build directory we have to build cached but we don't have the image cached yet so we have the incremental setting for that which will install all the packages once cached result and then reuse that on the next builds so that our image builds are fast as well. And then we have various settings that you can use to configure the image without invalidating the cache. So you can add extra files for testing or to configure your shell in the image or basically anything you might want that configures the environment to your liking. You can do with the extra trees and the post installation script so that the testing environment is the way you want it. Whatever customization you want you can pretty much do it. And then we have the runtime trees to basic which we use Fertile.ufsd then to mount extra directories into the virtual machine so you can make the XFScast source code for example available for running XFS tests or you can make your home directory available in the VM if you want that. Whatever you want with runtime trees. You can modify the kernel command line and whatever way you want. And we want to specify the output format as a directory so that we don't have to build a disk image but we can just boot from the directory itself also using Fertile.ufsd. Why do we want to do that? Because it's faster. It takes time. And we're looking for this quick turnaround time so we try to make everything go as fast as possible. So make OSI kernel is really nothing more than a make OSI configuration in the separate repository that's specific to hacking on the kernel. So we have a build script for the kernel and then we have various other modules that are all just build scripts for user space projects that are related to kernel development. So as of this moment we have of course a module for the kernel and then we have other modules for better FES procs because well I work at Meta and Meta work some better FES. The Linux test project which I added for Christian and then some other testing projects like block tests and BP filter which is a Quentance project for hacking on firewalls. So I added that as well. So you basically specify which modules you want and then all those get included. So getting started with MakeOSI kernel more or less looks like this. You clone the repository. MakeOSI is pretty easy to install. You can also install it from your package manager of course but it's a pretty fast moving project so in this case we install it from source so you just clone the repository. You sim link the script to somewhere in your path and then that's all you need. You can then run it. We download if you want, well by default for the MakeOSI kernel we download all the other tools we need on demand. So the only stuff it needs is Python and bubblewrap and of course the package manager and then that's enough to get started. Then we clone the MakeOSI kernel repository which contains the kernel configuration, a specific configuration and then you can write a local configuration file that basically says what distribution do I want to use to test or to use MakeOSI kernel. So we support Fedora, CentOS and Dabian at this point but it's easy to add more. The only thing that's distribution specific is basically which packages you need to do kernel development. So you just define the list of packages to build a kernel and to boot the system and that's sufficient to add a new distribution. So it would be very easy to add Arch Linux here as well. And then of course finally we specify the modules and we specify where our kernel sources live. So this is what the build source is setting. So your kernel can be checked out anywhere on your system and then you use the build source setting to specify here's the source location and then the target directory where it should be mounted when we run the build script. So this should always be kernel, the target directory. Of course the directory can be anything and it will be mounted in the right place and then we run MakeOSI and it will do its thing. So I hope this works with the internet here but I made a video. This is with everything cached so otherwise it would take a little bit too long for the stock but when we run QM we see the images cached and then we start running make. Kernel build is of course cached as well otherwise it would take forever. So not too much happening but we get a new kernel image packaged up. Then we make OSI does its thing and then we boot and then you're running in a VM that's running the kernel compile from source and you can do whatever testing you want and then we shut down again. So of course to build the kernel we need kernel configuration. We ship a default kernel config in MakeOSI itself. This is just with the minimal amount of stuff enabled to do to test various things and to the necessary drivers to be able to boot in a virtual machine. So we keep the drivers to a minimum and the features to a maximum. Anything like that's related to kernel development can be enabled so that it's available and then you can use it for testing. We also enable a few debugging things so that it's easier to figure out what's going on. For example also the kernel command line we configure it to panic on oops and stuff like that so that when something goes wrong when you're testing you immediately see and you don't have to go to the message to figure out if something went wrong with stuff like that. We also allow configuring to build a self test if you want that and specifically which self test so you can specify targets or you can specify to skip specific targets. For example the BPF self test because those take absolutely forever to build. You can specify your own K config if you want so you don't have to use MakeOSI's default one you can specify your own and the interesting way that we basically use this minimal config file is by using the all death config make command which basically says take the config file that we specify with K config all config use everything from that and set every other option K config option to its default value. So we specify what we want and we give everything else a default value. And then finally while I said that MakeOSI can build an inner ramifest for you building an inner ramifest is again more work which means slower which means slower turnaround time so in this case because we're building our own kernel anyway we simply build the virtual aero fests driver right into the kernel and that removes the entire need for even needing an inner ramifest so we just skip that step completely. As I already mentioned there's a few useful settings to like runtime trees and extra trees to customize the image. Another one that's useful for file system development is the QMU drives in the QMU arcs setting. So to add extra devices block devices to a VM with QMU you need both a drive which is the host facing side of it and then of course a device which is the guest facing side of it so MakeOSI can allocate the drive for you using a file that it creates itself on the file system which then removes when the VM shuts down so that's what you can do with the QMU drives and you can specify the serial or the drive ID you can specify the size and you can specify all the extra QMU options you might want in this case we specify that asynchronous IO should be done using IO U-ring and then of course you need to attach the drive to an actual QMU device so in this case we specify an NVMe device and we give it a better RFS serial and we specify that the drive should be better RFS which is the same as the ID we gave to drive. Like I said we can configure the kernel command line and if you want to do bootloader stuff you might want to hack on the EFI stuff or stuff like that you can also specify that we should boot in UEFI environment so that you can basically hack on the EFI stuff code or anything related to that. Well all this stuff I mentioned works like usually what you do with QMU is you have your dash kernel argument and your dash append and your dash init RD which you use for kernel development but when you start doing UEFI you might not have all of that available anymore. Now what MakeOSI does is it basically sets things up so that even if you're booting in a UEFI environment everything really works the same even though we don't directly might not directly use dash kernel anymore we might be booting from a disk image we can still like append to the kernel command line and all that it's all still all supported. You can get some extra shells in the image as well so of course you get the serial console but you might want extra shells so you can do that with MakeOSI SSH. You have to also enable the SSH option to make sure that the image gets configured for this but we do that by default in MakeOSI kernel. There's a very complicated diagram here that basically shows how we implement this in systemd but the interesting thing about what MakeOSI SSH is that you don't need your VM to have a configured network to be able to do this. So for VMs there's this alternative socket family which is called AFVSoc which allows for inter-VM communication that doesn't rely on the network interface being up and running and configured. So using a bunch of new systemd features what we're able to do is at runtime provision the virtual machine with your SSH public key so we can put it in the authorized keys file for the root user and then if there's a VSOC device attached to the VM in the next systemd release systemd you'll basically be able to automatically detect that a VSOC device is attached and if so it will generate a socket unit that will run SSHD on port 22 of the AFVSoc family and this allows you to connect to the VM over VSOC from the host without needing a network. We can also do install a drop-in file for the SSH on the host configuration which SSH support now as you can do drop-in configuration for SSH and we can use the SSH proxy protocol to basically take possession of the UNIX and the VSOC host name prefixes so that you can use those to connect to VSOC enabled VMs. So with all this setup you can basically do SSH VSOC slash the VSOC connection ID to connect to that specific virtual machine all without going over the network. So we don't use this stuff yet and may go aside we have our own version because the systemd thing is very recent but we'll be moving to this in the future once this is available everywhere. Running tests manually is all good and fine but you want to move from manual testing to automatic testing of course so we also support this when you want to do automatic testing you want to run the test and you want to get an exit status usually this is very simple with a process you just run the process in your shallow whatever and you get the exit status from the kernel when you run the test in a VM this gets a bit harder there's not really an easy way to get the exit status of a process that's run in the VM and transfer it back to the host. If you're running a directory from a directory with FURTAOFS you can just write some files to the directory and retrieve all the information that way if you want but if you're doing testing from a disk image then you have to mount the disk image once the VM shuts down to access the information and of course to mount the disk image on Linux you need root privileges so you have to start entering your password and stuff so it all becomes a bit more complicated so what we added instead is a way again using the VSOC stuff to have the VM when it shuts down and you use these two unit settings success action equals exit and failure action equals exit in the systemd unit when that unit exits the VM will also system they will also shut down so the VM will shut down but it will use the SD notify protocol which is a some systemd thing to send notifications to send the exit status over VSOC from the VM to the host and make a way so I can pick up on this and exit with that exit status so seems pretty trivial to get the exit status but there's a bit of work involved to get it out of the VM and then of course what we also want is the locks so this isn't actually upstream yet but we're looking to have add another forwarding mode to systemd-journally so that again using VSOC it can forward locks over an AFV socket and then we can listen on the host receive those forwarded locks with systemd-journally remote and write all the locks to a local directory and that means we can access the locks on the host without needing root privileges we don't have to mount the image we just have to the locks locally we run internal cuddle on it and we can access the locks see what went wrong with the test and debug further of course I'm not the only project in the space we do have some competition so the latest product in this space is Furtmianji so I thought I'd mention it as well because I don't want to claim everything for myself like there's more tools than just to make a side kernel so definitely take a look at Furtmianji as well Furtmianji is very focused on kernel development so it has a lot more options to for example use the kernel from the host and various other options but it's very specific to kernel development it also has its own in its system that runs in the VM which allows it to boot very fast but you don't get all the you don't get a regular Linux system like you would when you well I mean I don't want to like say that system D is regular but you don't get system D so if you wanted to start doing stuff with devices or something like that you definitely won't be running so it gives you a bit more limited environment so depending on what you're doing one or the other might be more useful yeah that's more or less it on the comparison if you want to know more about this like come talk to me afterwards or something and I can say a bit more about the differences between it too of course I'll end with some reactions from users so of course like Christian already said he was using it so it's very nice as well his reaction to it and then Joseph from Meta the better FS fast system maintainer is also using it so and he's also very happy with it so I hope it can be more useful for more than just them so please give it a try and I'm happy to answer any questions or implement more features if needed thanks for listening hello thanks for the thanks for talk or two quick questions so one what about cross compiling so that works we don't like we don't have like a specific environment rival in the build script yet that allows you to specify cross compile but we can simply add that but I already tried it like just by hacking the build script and saying cross like changing the architecture to compile for arm 64 and that works Christian also or I'm not sure who added it but we also had the support for compiling with LLVM if you want to and the second small question maybe I missed that because I was late for the talk what gets into the unit MFS so what about from the all in the last half so make was I kernel by default doesn't boot with an around fast when we do the third I your FAS stuff if you do a disk image then the inner MFS is built with make or side cell so I actually have another talk about this in the distributions there from but yeah we just install regular RPM packages or whatever into the inner MFS and then by default we just copy all the girl modules and firmware from the host but we have a suite of settings to basically include and exclude whatever you want and we also have like the stuff that in the ramfs generators to to include everything that's loaded on the host if you want that so you can configure a bit which firmware and then drivers you want we'll also make sure that we when if you specify these drivers modules to be included we pick up all the pen the dependencies as well so we make sure that all that is set up correctly and included I'm using the inner drum FAS stuff like I'm building full images and I'm not using the QMU part I'm using a different virtual machine manager for this and it works really nice because that was the biggest the biggest thing for me that it wasn't easy to build an inner drum FS especially if you want to do it destroy independent which was really annoying it's this also useful if you want to run a mainline on a new device where there's only some heavy patched in the window kernel so you want to test if your drivers work and so you need to you want to test it but don't want to touch any non-volatile memory just started somehow without like this fast boot boot or something like this sorry I don't think I completely here to her to question so this was all about mainly so if you want to test if the kernel works on a new device so where there's only vendor kernels are known to boot so you want to you don't want to destroy the user space there but first test it there before you touch your space and you want to boot it only from ram can it also be used in that way so it's very focused on virtual machines at the moment while make or size kernel specifically is but make or size can build your images that you can then deploy on another device so like you can run the stuff that is produced by make or sign on your laptop or you can flash it to your disk and it will it will boot but specifically without destroying the user space we don't have anything specifically to make that work you could take the kernel produced and then keep the user space the same but it's not something I've really I've looked at before so it probably won't work all right I think if there are no more questions and thanks for your talk thanks for the tool you
libvpoll: create synthetic events for poll, select and friends
This is the Kernel Dev Room, just a reminder from time to time. Ready for our next talk, Renzo and Luca are going to talk about LibVPol, create synthetic events for poll select and friends. Hello, I'm Luca Bassi, I'm a student from University of Bologna and today with Professor Renzo Davoli from University of Bologna we will present the LibVPol library. Many programs use the poll and select system calls to wait for events triggered by file descriptor of events. So you can write a library to be like a natural stack or device and you can implement your drop-in replacement function of a system call and use some dynamic library magic to rename the verb t's system call request to their virtual implementation. t seems all good but there is a big caveat that this approach doesn't allow to mix the normal file descriptor using the real file descriptor with the library one to use in the same select and poll and similar system calls. So Professor Renzo Davoli developed the LibVPol library that is a library that permits to define file descriptor whose IO events can be generated at user level. This permits to generate synthetic events for the standard Linux poll select, poll select, etc. system calls and with this library it's possible to mix the real file descriptor with the ones provided by libraries as a parameter of the poll select, etc. system calls. The API of LibVPol is very simple, it's three methods. Vupol create to create Vupol file descriptor, Vupol CTL to change the set of pending events reported by the Vupol file descriptor and Vupol close to close the file descriptor. To work with the LibVPol library needs kernel support for a complete implementation of this feature. There is two possible support to LibVPol, a kernel patch extending the eventfd system call and a kernel model that implement a virtual device, exponent dev slash vupol. The library also provides a fallback implementation but is feature limited and for full support they need a patched kernel or the kernel model implementing the virtual devices. So we had a problem, the problem was to mix a real file descriptor and a file descriptor code file descriptor created by library implementing stacks, implementing virtual devices. We decided to do the nice API for the library but unfortunately it is impossible to have this implementation complete without having a kernel support. The first idea was to extend the feature of eventfd. In fact if you try to code a library like that and you look around the system code and you step into the manual of eventfd the title set create a file descriptor for event notification. That's it. But unfortunately it is not providing ways to create, to set arbitrarily, pull in, pull out, pull out. That's just some way to synchronize processes using counters. So it was chosen because of the finish of the feature. The kernel patch adds to eventfd a third semantics because eventfd has already two semantics, the normal semantics and the semaphore semantics that is able to implement a service like the standard semaphores up, down, and down in the kernel. We had a third way, a third way, a third semantics, a loophole. There should be another way to implement it. So following the standard naming of system codes we have a loophole create, loophole, control and so on. It would be possible to create, to propose the adding of two different system codes, loophole fd create and loophole fd control. But the idea is that we feel that eventfd is the most clear and straightforward way to implement this feature. So out of these extension words we can create the eventfd with the new flag saying we want the semantics of the pole. Then following the semantics of eventfd we have read and write, read, return the current stage of rendering. We use the 64-bit buffer of eventfd as a bitmap of a set. Write is used to change the set of events for the file descriptor. For example you can add the pole in and pole pre using these system codes. So the operation flag add mod del is specified by setting the mod request bits. We had a problem, we had a proposal of solution but we needed to have it working quite fast, quite good. We needed it for our process. So we proposed, we developed another way to have the same service to implement this service as a virtual device. So we have implemented it. If you download, if you install the LibrePole package from Debian Ubuntu you get this module and this module is installed using decay MS. So you can have the service by adding the package after the kernel module. Here the implementation, so we use the LibrePole, the user-level library tries one after the other. If there is eventfd patch it uses eventfd patch. Otherwise if there is the LibrePole device it uses the Ubipole device. At the last resort it uses an emulation. For the device it is possible to open the device and the generation of events is performed by IO control like it's usual for devices. Examples. We work in Internet of Threats. The idea of giving each process the ability to have their own stack inside the process itself so that this process can be a network node by itself. People Peacock's Net is a library which implements a user-level network implementation. Peacock's Net is based off Peacot TCP. There is Peacot TCP plus a glue layer that creates the abstraction of a library-based network stack. We have added LibrePole, we use LibrePole in Peacock's Net so that you can write programs like a terminal emulator which has a main event loop waiting for standard input to send data on the socket and to wait for the socket to send to write on the terminal output. So it wouldn't be possible using other stuff or tricky solution based on unisockets or bytes to synchronize but it would be just on purpose for this. Instead of using LibrePole and extension of the kernel or the device gives a general purpose solution so that we are working to port even LWP as a user-level stack implementation so that we can mix and create applications that can use different stack implementations. Maybe we can create gateways between network streams coming from different network implementations. So to summarize we had a problem to have a file descriptor, real file descriptor and virtual file descriptor working together in PulseSelector, E-Pole and whatever POLU you can invent in the future. Providing POLIN POLAR POPLE and all the POLU events you can create in the future. We have proposed a library giving a nice and easily straightforward interface and we have implemented the proposed implementation and implemented a proposal to be accepted in the kernel. A device driver and an extension to say existing system code. If you want obviously all the source code is publicly available on GitHub. Okay, I think that's all of the questions. I wasn't sure if there were any slides but does this need each individual event loop library to add support for this? Or can you use this with existing event loop libraries like for example SystemD has an event loop library but we need to add explicit support to be able to use V-Pole file descriptors? Or can you plug this in without needing to modify existing event loop libraries to use it? So you create a V-Pole file descriptor and you want to use it with an event loop. Does the event loop need to do anything special? No, absolutely. That's the point. Instead of creating something special to use the POL selector, you use this method. You use the standard, clean syntax of the selector POL or whatever event you're waiting for. System code you won't not use. That's the point. So instead of creating unpublished solutions for the use of that application, this is a general purpose solution for all applications needing to work with select POL. Because you can add extra services to provide the same semantics, the same method to unblock, to signal, and for this system codes. Maybe one more question. Please. Is there any... There's also the new interface in the Linux kernel as IOU ring. Is there anything needed to make this work with IOU ring as well? Or I'm not sure if you've looked at that. I'm not sure. So I'll check and please send the mail or send feedback. I probably missed this, but why didn't you, for example, consider proposing a new E-POL extension? Like a new E-POL system call, for example? It's not an extension for E-POL. I use the standard E-POL like POL or the other. It's a way to signal to POL or E-POL that's never occurred. A common solution for this is to create, for example, a pipe and waiting for a red event on a pipe. And you know that if you receive a red event on a pipe, it means that something has happened. Instead of having this, you can use, you can write a flight description behaving like an event driver, so that you can port your code as is. Because if the driver would have to send POL and VAL, you can synthesize POL and VAL and your E-POL or POL can work in the standard way. With that extra code, it's just to simulate something because it's not possible to do what LibbyPOL does. Another question. As I understand it correctly, it simulates somehow the source of events somehow. But how does it scale? Because you use just one system call in the case of the I.O. control. It's just a system call or using write. It's just another system call somehow. So it doesn't probably not scale, right? So if you want to generate thousands of events, triggering events, it requires a system call each. I say it does scale. It scales as a poll of scales. Because you can use as many fider skaters. For each fider skater, you can have all the events you need. In which means you think there are problems with scaling? Yeah, I mean a poll or an E-POL or whatever scales somehow. But the mechanism on your side to trigger this requires one system call. So here you see another write on the slides. Another one, the other slides. If you saw the write, it's just a system call or using the I.O. control. It's just another system call to trigger this. So you have one event. The solution with the device, you mean using the I.O. control. Yeah, to trigger the event, probably we can do it offline later. Here the last line. No, no, on the previous slide. This one, the write call. Here, right? You see this? This is a system call, the write is a system call. And if you trigger it, it requires a system call. But each aesthetic event requires one system call, right? No. And if I want to simulate millions of... No, no, you don't need the call for each event because it's a bitmap. So you can set the bitmap of an event. You want to be added or set or cleaned. So with one call, you set all the events for one FD. Obviously for different FD, you need different... Right. Next. Thanks. We're out of time. Thank you for your talk. Thank you. And we'll have a five minute break. Thank you.
More flexible user namespaces
Okay, this is the kernel dev room. In case you wondered into the wrong room, the next talk is by Stefan Grabber about isolated user name spaces. All right, hello everyone. So, as Christian mentioned, I'm Stefan Grabber. I work for myself and I've been doing container stuff for the past 15 years or so at this point. I'm also the project leader for Linux containers. And with me, I've got Alex, who is a kernel engineer on the LexD team at Canonical. And basically this topic is something I've been thinking about and talking to a bunch of people, Christian, whoever else was at Plumbers for the past three, four years at this point. So, since about 2018, and I got lucky enough to have Alex at Canonical who's been able to actually get this thing working. So, bit of an intro, what are the user name spaces? To just kind of give an update on the current state of things before we try and replace it all. So, user name spaces were introduced back in this, well, fully introduced back in 312, 313-ish. That's back in 2013-2014 timeline. They allow for the creation of a name space, obviously, in which one can map one or multiple ranges of UIDs and GIDs. So, you effectively decide that, oh, in this name space, my UID 1000 is equal to UID 100,000 on the host or something like that. This allows for much safer containers because you can now map routes in the container to be something other than route on the host. So, even if something bad was to happen and you could escape the container for some reason, then you would end up being a nobody user on the system instead of being real route. That's kind of the idea. It's been used a fair bit in some container managers, especially the ones that we developed. So, Alex, Alex, they incurred all of those user name spaces quite extensively. It is also possible to use in Docker, Podman or other now, but doesn't tend to be the default. So, just to kind of show the current state of things again, so if I turn that one on, there we go. Let's just take a quick look at how that stuff works. So, as a completely normal random user name system, I can just do user name space, stream map, and now I'm route. Except I'm not, but I'm route in that name space anyways. So, when you do that, it just maps the route user to your own user. So, my own user, my laptop is UID 21105, and that's what route is in there. That works, but nothing else is mapped. So, if I try to actually switch to any other user, things are just going to blow up. Actually, this user doesn't even exist, but this will exist, because nobody would. And, yeah, it's just not going to work. So, the plain doesn't exist, which is going to cause a bunch of interesting side effects, like everything belonging to nobody, no group, all that kind of stuff. But, that's for an individual user just creating a community and privilege name space, and that works. Now, if we look at doing the container instead, so if we just create, as I was using in CURSE in this case, just creating an Ubuntu container, there we go. And, if we go in there, then that container is going to have an UID map and a GID map. So, in this case, it means UID 0 in the container is mapped to UID 1 million on the host, and the following 1 billion UIDs and GIDs are mapped the same way. So, UID 1000 would be UID 1 million, 1000 whatever on the host. We can also do, as I said, multiple maps, so you can start doing kind of punching type stuff. So, if you go and modify and you say, hey, actually I would like UID 1000 on the host to be UID 5000 in the container. I think that's in that order. It works. Then, you start the container back, and if you look at the map, did I forget to do something? So, G1. Oh, I looked at G. Oh, that's why. Hey, look at that. It shows you can actually do maps differently for UIDs and GIDs. So, yeah, UID does have that punch in the middle, which effectively causes three maps. That works. That's been around. It's fine. That's the status quo effectively. Now, what's wrong with this stuff? Well, so the current implementation, it relies on still a single global UID and GID space. And you get to then map, like create namespaces and map there UID and GID back to some trunk of that global UID space, a global UID GID space. That works. But you can create overlaps. So, you could have multiple containers that actually map to the same thing. You can have some random processes on the host that actually use the same UIDs and GIDs as you're trying to use in some containers. And that can cause some issues. That can cause the occasional oops, I've got way more privilege than I intended to have. And also, it can cause some issues with like potential denial of service attacks if you're starting to like use user specific limits or that kind of stuff. There is a way to try and avoid that, which is using shadows, sub UID and sub GID files, along with some helpers called New UID Map and New GID Map. And the idea with those tools is you get to actually assign for each user what maps they're allowed to use on the system. And so long as everyone uses those helpers and everyone looks at the file, there shouldn't be any conflicts. The problem is that not everyone uses the helpers and not everyone looks at the file. And even when they do look at the file, the tools that write the file gets really confused sometimes. This is an example of what happens when it gets really confused. I mean, nobody in this room hopefully can figure out what the root user actually gets as far as allocation, because even if you could, it's just broken. There's like, they actually overlap each other in conflict to the point where if you actually try to figure it out, it just ends up with something that's invalid. That's a bit of a problem. And just the general concept hasn't really worked out. So in practice, most tools at this point just ignore those files entirely and just do whatever they want. But that can have security implications, which is not ideal. So what can we do about this issue? Well, what if we had a lot more UIDs than GIDs? I mean, that would fix everyone's problems, right? Like, what about a lot more of them? Like, about, say, 4.2 billion times as many. Well, yeah, that's effectively what we've been doing. So in the Linux kernel, a UID or a GID that's represented as a UN32. We've changed that to a UN64. That obviously comes with some interesting side effects, which we'll get through. But that's the general idea, and that's the kind of concept we came back with at Linux Planner as kernel summit back in 2019. Now, yeah, but that's going to break everything, right? Well, it doesn't have to. So we obviously can't change user space to use 64-bit UIDs and GIDs. We're not going to even try to do that. That would be a very bad idea. And we would actually just move the problem, not fix it. No, what we want to do instead is keep that on the in-cannel and have, of that 64-bit, 32-bit is your normal user-visible UID and GID. 32-bit is effectively an ID, like a namespace ID, if you wish. And the obvious issue with that now is, okay, what happens with persistence, what happens with anything that is not in that namespace looking at those kind of IDs? You're going to have issues there. Well, if it's a process outside of the namespace looking at a process running in such a namespace, we can use the credential attached to the user namespace to figure out who created the user namespace and use that as kind of the proxy ID we show. So that's one thing we can do. On the file system side, we're going to go into more details in a tiny bit, but effectively we absolutely don't want the file systems to be aware of any of that, like file systems, still 32-bit, still normal. That means that out of the box, you won't be able to write or even really read anything. But thankfully, there are mechanisms in the next channel now that makes it possible to actually handle those kind of translations and mappings, which then fixes that. So how does that stuff work? So here you've got a tiny bit of code that actually creates one of those namespaces. So you do unshare user namespace, as usual, and then normally you would go ahead and at that point write your map. So you would normally write your ID map and your GID map. Instead, we just write to that magic file and say, hey, we want an isolated user namespace. And then we switch to the root user, and we're done. At that point, you are running as root inside of that isolated user namespace, and you get to use every single UID and GID that you want. As mentioned, don't try to access the file system. That is not going to work well. But as far as you being able to like spawn process, like actually switching user and messing with those users, that's going to work. And there's quite a few more things you can actually do in that stage, even before you look at any kind of data persistency or any other kind of stuff. So, okay, fine, file system, what happens there? All right, so on the file system front, there are kind of two different things you can do. The first one is, hey, you're in a user namespace, you can unshare also a mountain namespace. Now in that mountain space that you own, anything you can mount is owned by your namespace, and you can write to that. Obviously, we don't allow mounting most stuff inside of a user namespace because that would be a terrible security risk. So, your options are mostly tempFS, you can also mount fuse, and you can mount a few other select file systems in there. And if you do, on those, you can, because they are effectively visual, we're created from within a namespace, you can persist, like whatever UIDs you see inside of that namespace will be written as search on that file system. That works perfectly fine. Now, if you care about normal file systems and persistency and all that kind of boring stuff, well, then that's when you need to actually use a new feature, which was introduced by Trish Novodev, a wallback called VFSIDMAP. And with that, it lets you, so that does need a privileged operation, obviously. But as a privileged user, you can now say this mount on the host needs to be, like on outside of the namespace, needs to be exposed inside the namespace at this path, and this map is applied for the transition between the two, which most often for us means we just want to map one to one from inside of the ASA to the namespace to the host. So, if you write as UID 1000 in that isolated thing, it shows up as 1000 on the file system. So, that's effectively how you handle the persistent thing. You need to pick specifically what files you want. You need to go through a privileged helper type tool to pass those through as VFSIDMAP to mount. And then you get to actually use this thing. So, let's take a look. I'm going to go into, what's the, actually I have that in my notes. Just trying to figure out what the name of the VM is now. User analysis. This guy. Pretty sure. Yeah, there we go. Okay, so this is a virtual machine which is running Alex's patch set on top of the current 6.8 RC1 kernel. And right now I'm, right now I'm rude, so let's not do that. So, I'm going to switch to the open to user, and I'm going to go into a folder where we've got some tooling that we've used to say that stuff up. So, if we look in here, we've got a tool called goisolated.c. If we go look at the code for that, it's pretty much a slightly longer version of what I showed on the screen earlier. So, it does an unshare. It sets the isolated UNS bit, and after that it switches to both UID and GID to root in there, and then it execs whatever command I'm passing it. Okay, so the way we're going to call that is we're going to call it as, well, call that wrapper, and then have it called unshare, and we're asking unshare to also unshare effectively the mount namespace, the pid namespace, and to fork. So, that gives us mostly a mostly functional container. Okay, so we do that, and now we're rude again. But this time, if we go look at the maps, so your ID map is empty, GID map is empty. So, there's no maps there, because we did not ask to actually map anything to the host, we just have that isolated user namespace. Okay, so we've got, fine, a query route, we could do that before, it's okay. We've got another helper here called setUID.C, which is also extremely simple. All it does is it changes your UID, and it execs the command. So, okay, fine. So, let's do setUID 1000 bash. So, that worked. That's something that would not normally work. Normally, a normal unprivileged user cannot create a username space and gets more than just their own ID. So, you can make route work, but you can't make an arbitrary number of users work. So, I can do, you know, just mash the keyboard, whatever number, and that works. So, you get to do that, and now, let's see, what does it want? Okay, fine. I just need to get a second shell, and we're going to go back inside of that VM. So, isolated userNAS bash, default project. Okay. So, now if we look at the tree from outside of that namespace, what we see is, and you see, I can't actually highlight, so you see it kind of towards the middle of the screen, you see switching to Ubuntu user, there's a share, then there's the unshare, then there's another share. And we see that the whole tree looks like it belongs to the Ubuntu user. That's not true. We know it's not true, because that last bash is actually under that whatever user I ended up typing on my keyboard. But because that can't be represented to the host system, as a real UID, it shows, as we said, whoever created the username space, which in this case is the Ubuntu user, so it shows the tree as belonging to the Ubuntu user. If we go back here, and we look at the process tree, we're going to see something quite different. In this case, the host IDs can't be represented, so they all show up as the kernel overflow, so nobody know group. But then our own processes do show correctly, so we see root, root for the first two, and then we see what is it, like 4788, which is what I just matched on my keyboard earlier. So that works. Okay, fine. With that, if I try, like even as that root user, if I try to touch anything, I'm going to have a bad day, as I said. Five systems don't like this, so they just tell you, no. Okay, that's fine. But now if I go ahead and I mount a tempFS, so I'm at the tempFS on slash temp, hey, now, yeah, this works. I can do that. It's fine. If I switch to render my random user, and I go here and foo, okay, that works. And if I look from this level still, oops, we've got foo, did I create foo twice? Yeah, I did. Okay, I should have actually created bar. That would have been nicer. So I can do that. Okay, so now we've got two files, one created by root, one created by 4788. You might notice that my statue ID wrapper thing doesn't bother with groups, so it only changed the user and other group. It's fine. And we see those two files. But now what happens if we try to look at the same two files from, again, from outside of the namespace completely? Stefan? Yep. Do you want to take questions during the talk? Well, I can probably do in like a minute for during the demo, yeah. That might be easier than going back to demo at the end. So just here, if we go look at one of those processes of 556, and we ask for the root file system, we look at demand. Again, I'm outside of the isolated user namespace, so everything shows up as we ever created the user namespace. So we see Ubuntu Ubuntu in this case as the owner for those files. Event are, it's not quite, it's not actually the real thing, but that's what's going to happen in this case. I'm just going to show the last piece, and then we can take a question for the demo, which is the persistence piece. So in this case, again, my user name, I've created a space works. I've got multiple users in there. That's fine. I can write data so long as I write it on a tempfes, but none of that is obviously persisted in a meaningful way, and there's no way for you to do that. You can access some files, as you've seen, like I've been using comments and stuff, that works, but it only works if only random user, as in if the order permission would allow it, because no other permission checks can actually work. Now, if we go back here, we've got another tool. I just need to find the usage for that one again. In my notes, there we go. So, in this, enter, yeah. Okay, so I need the process ID, so it's 556 we're using, and 556. So what that comment does is it enters the map namespace, but not the username space of the target process. It runs a comment from the host, which uses VFSID map, to then map from the host my home directory to a share folder inside of that namespace. So we do that. And now here, if we go look at that share folder, hey, the UIDs and stuff actually do resolve. Okay, that's cool. Well, if we were to go look at the non-map version of it, which is this, the dot. So now, as root, I could go in that share Ubuntu and touch foo, and then as a, so if I do my set UID thing, and I switch to 1000 inside of there, and I touch again in the share, and we'll do var. Oops, sorry, Ubuntu, var. And actually, this one should have been also share Ubuntu foo. Okay, so that worked perfectly fine. If inside of the namespace, we go look at those two files, we see the ownership cannot be having the way we expected. And remember, like when I did that earlier, looking from the outside, it was all belonging to that fake Ubuntu user, because that was the owner of the namespace. But this time, we're doing it through a map. So if I go look at the file system tree here, we see that it just went through the map and actually persisted the data going through that map as expected. There was a question? Yes. Yeah, hey, you showed how you touched a file as a root user, and then you changed to the unprimitive user inside this isolated username space. How is it possible that this unprimitive user is able to touch a file owned by a different user? Is that how it's supposed to work? That's a good point. Yeah, so it was the same group, that's sure. Because my setUID binary is kind of messed up and it only changes the UID and not the GID, combined with the default UMask, it was allowed to do it. If my setUID binary had done the right thing, which is change both the UID and the GID, that would have failed. Good catch. Yeah, so that's effectively how that stuff works. There are a bunch of other places where you will see resources owned by an isolated user namespace from the outside in one way or another. Those are going to be showing in this case as the owner of the namespace. We're doing the owner of the namespace instead of going with the overflow UID, because the overflow UID has been, I think, one of the biggest confusion caused by the user namespace. With the normal user namespace, anything that cannot be represented is going to show up as that overflow UID. The problem is that the overflow UID is, effectively, the nobody-no group under POSIX, which is 65534. The issue with that is that this is also an actual user. You end up with that really weird thing where you can have some stuff that shows up as the overflow UID. I guess one example you could do is, say you've got a process tree, you've got a user namespace in there, you've got some processes running in there. They show up as the overflow UID if you look at those from another user namespace. Then if in another user namespace you actually switch to the nobody user, you will think that what I am, UID 65534, those processes run according to everything, are 65534, I should be able to kill them. No, you can't, because they're not actually running as that UID. That's been causing a ton of confusion. We're trying to do things a bit differently here and hopefully make things slightly less confusing. Remains to be seen, but hopefully. There was something else? The question is, maybe I misunderstand this part of the demo. Is this 4nbar files, they have the owner and the group of user, is it still inside the container or is it visible outside of the container as root? Is it like an escape from this namespace or not? Or is it because it is inside and a center? Which one, the one I'm looking at right now? Yes. This is completely outside. From inside the container we were able to escape as root. That's how. That idea of ID mapped mounts. With an ID mapped mounts, that's effectively what you allow. If you provide an ID mapped mounts with the full map, then yes, the user is totally allowed to create a file that's owned by root and even make it set your ID if they want to. That is possible. That is why you need a privileged helper to be able to set that up. In most cases, what we do is the path you actually share comes from a non-traversable path on the host so that no user on the host can access any security binary you might be able to create in there. That's a specific use case, for example. Or you could use a map and don't map the ID zero. Wait a second, please. When, for example, what other users like Portman and so on do, they use this with OvalAFS or together with OvalAFS and they, for example, ID map the underlying mounts that are used for OvalAFS, but they leave the up amount where the actual writes occur un-mapped, which means that the writes that actually then go to disk still occur as the container user ID and GID. So there are different ways to actually use this. This is just specific to Lexi, for example. Lexi or Incus, but specific. You need to be very careful when you use that because indeed there is that pretty serious risk. If you do allow writing as UID zero on the host, then yes, you can start doing set UID stuff if you're not careful, for sure. And one thing, for example, you can do, even if you do that, you can, like the ID mappings attached to a mount are completely independent of the username space. So, for example, you could say, I want to ID map this mount, but in this mount, you can't write as root. So the UID zero isn't mapped. So you can't even create any files as UID zero or GID zero. So you could, for example, delegate. Yeah, it's kind of similar. Only that in this case you can't write as any user at all. Like it's basically the kernel tells you fuck off. The kernel tells you to go away because you get the overflow. But that's, for example, what I tend to do if I share specific data from the host into a container. And the container, for example, is privileged. I just don't map UID zero in that mount. And then even a privileged container can't write as UID or GID zero on the host. Yeah, so in HPC we have this kind of trick of like using sec-cop to, like, ignore basically set UID and CGID calls so that people can actually install packages as root as themselves. Have you thought about basically like aliasing everything that's within the isolated user namespace to the user that's created it as far as persistence goes? Well, it gets messy. I mean, every time you try to map multiple UIDs to a single UID, obviously the reverse becomes impossible, which causes a lot of problems as it turns out. I mean, in your case, like what you, you know, I mentioned that you can use Fuse and that's potentially how you could do, how you could do kind of whatever you want. Because at that point, any unprovided user on the system can create one of those, one of those isolated user namespaces. Then they can mount a Fuse file system of their choosing in there and then they could pivot root to that and make the Fuse file system their root file system. And then Fuse can do whatever it wants as far as keeping persistence here of UIDs and GIDs. So if you don't care about that crazy performance, that's probably the easiest way to handle that. Because you could use Fuse as a de facto overlay file system that writes that metadata on the side effectively. So, but for things like NFS and things you would have to proxy everything through this Fuse file system. Did you think about using file system extended attributes to store the UIDs and GIDs that can... Funny you mentioned it, because when we were designing the user namespace back in 2014, that was actually one of the ideas initially. It was like, hey, we could do something out of the start and just use extended attributes everywhere to store that stuff. As it turns out, that becomes really, really painful because not all files and implement them correctly. It doesn't scale. Yeah, effectively it doesn't scale. That's kind of the issue. But it could be used like I said, like with the Fuse thing, that could be a way that you store that. You might use a user extended attributes to just store that metadata that way. I think we're just going to keep on with the slides for a tiny bit and we'll do questions again at the end. Otherwise, we might just run out of time. Okay, can I just ask one? How do you do punching through the namespace map? The use cases using architectural distribution emulation containers, I punch my own UID through and my home directory. So I still have it in the distribution I'm emulating. So technically, there's nothing that prevents you from still using UID map and GID map. So you can use the combination of the two. It's more fun to show with none whatsoever because that's more fun to show as a committee and previous user. But there's technically nothing that prevents you from using a mix of the two to actually fully map a single user's shoe if you wanted to. All right. So isn't that going to be a massive change? That's kind of what we thought. Initially, we're like, well, the user namespace patch sets for anyone who looked at it back in 2013, 2014, that was rough. It needed changes to every single file system. It was absolutely massive when Eric was doing that work. But because the Linux calendar is mostly written in macros, it turns out it's not so difficult these days. So to our astonishment, really, we're looking at a very, very small patch set to actually do everything that was in that demo. And a bunch of it is kind of infrastructure type stuff. And then there's the actual type change. It wasn't so bad. We're not fully done. There are a few more issues that still need to be resolved. It possibly will be a bit larger than that, but not by much. And that's definitely something that's quite reviewable and that is hopefully not going to be scaring people all too much. It shouldn't be too difficult because you actually shouldn't have to touch the... Right, exactly. As far as all the VFS stuff, we've not changed anything. We still do 32-bit, but it's all fine. And for the rest, it's actually just some types that had to be changed and the rest works. I mean, the main thing is kind of like add the boundaries just like, oh, we need to go and pull the user credentials out of the user namespace to figure out what to show. But that's the main thing, really. If you do scan the QR code that gets you to the GitHub repo with the tools that I showed, as well as the link to the kernel tree that was used. I don't know if you put the link to the package as well. If not, we can add that afterwards. Because I did build the kernel I'm using. I built it for Debian 11.12 and Ubuntu 20.04, 20.04. So if people want to play with that, you can totally do it. All right, so what's next? Well, we've showed this work at the Linux Plumbers conference and the next kind of summit back in November. And at the time, the demo wasn't working quite as well because we just had a bad build that day, which was unfortunate. This time, the demo, everything works. We've talked to a whole bunch of people as well. I mean, Christian, I will say, we're very close with Christian, so we made sure that all of the VFS stuff and what I think can make sense. The real next step is going to be sending an RFC to the containers kernel mailing list to try and get some feedback there. Hopefully, Eric is around to actually look at it. He's a bit of a hit or miss here as far as answering stuff. But hopefully, we get to have that reviewed by the username space maintainer. Before we do that, there are a few more things. We want to be able to run kind of normal LXA, LXA, and in-cars type containers with this feature. And for that, one issue we've got right now is around CGroupFS and the CGroup namespace. So what we want to be able to do is create entries in the CGroup tree. Normally, you would then turn them to the right thing, which is a bit impossible. And after that, you would do the unshare of the CGroup namespace, and then you can use that from inside the container. So the impossible part is that we see a bit of an issue, so we're looking at how to fix that. I think that we've bounced a bunch of ideas over the past few days with Alex. I mean, one of them is to effectively do a VFS ID map type stuff on top of CGroupFS. We'll see if Christian wants to kill us when we send out. That's one of the ideas. There are a few other tricks that could be done to make CGroupFS be more aware of the 64-bit thing to handle those specific cases to be seen. But that's one of the issues right now that prevents a straight up Alexi container from just working. There's also still some work to be done around SEM creds, like passing around U creds and that kind of stuff to make sure that this also passes the credential of the creator of the username space instead of an overflow UID. It's different, yes. So there are a few more bits here and there as far as those boundaries are just in to look at and make sure that it's kind of consistent as far as what's exposed. Once that's actually been sent, hopefully reviewed, or hopefully merged, there are a few more things that we would need to consider doing on top of that. The biggest one of those being nesting. So being able to create either an isolated username space inside of an isolated username space, because who doesn't like total other way down, but also being able to create a normal username space inside of an isolated username space. Why? Well, it's the usual reason. The usual reason is someone wants to run their old Alexi, Docker, whatever thing inside of a container. That's such a tough thing we shouldn't do. We shouldn't mix isolated and regular. Isolated should only do isolated and not have any ID mappings attached and as regular user as an isolated user and as I think it's just painful. We have to see just how nasty it gets. I mean, I agree with the main case we care about, but she's going to be isolated and in isolated because that's what we mostly want for testing and whatever. Regular in isolated, we'll have to see just how many people we break and how bad it would be to fix. If it's trivial to do, then maybe if it's a massive budget to make it work, then it's not worth the effort. I agree. And that's it. So we can do more questions. I think there were a few more. So you can only really write to a tempFS unless you start doing the UID map maps. Right. TempFS or fuse. And fuse is kind of magic because you can do a lot of stuff with fuse. Yes. But is there any way to... So I want to use this in make or aside too. I used new UID map and new GID map now and Landerhate said because it's such a UID and everything. It just sucks. So the problem is like writing, making these images takes quite a bit of space. So if I have to do it all in the tempFS, the machine is just going to run out of memory. Is there any way to get the tempFS back by, I don't know, a swap file or something so that I can actually somehow get this stuff? Possibly. I mean, I don't know if the tempFS supports that. Maybe some of the other virtual file systems have something similar. Like can we... Is there like a tempFS backed by file type stuff in the kernel? No, I think so, right? I mean, yeah, right now you would do fuse. And fuse, you can make it write to whatever the hell you want. The question is going to be performance. Yes. Fuse got a lot better because of the work on Vata UFS. They've done a lot of optimizations for that. Yeah, that's fine. Any questions anyway? I mean, we're looking at VFS at like an previous support for a bunch of other file systems kind of underlined, but they're mostly networked type file system which will probably not help you a lot. Yeah, hey, so maybe I missed it, but can you actually figure out from the outside that you are using this isolated? I forgot to show that. Yes, you can. Does that actually work? Yeah. Okay. I think personally that like the issue with a lot of interfaces in the kernel is that you cannot figure out that something is actually happening or not. Yeah, so in here, like you see everything belongs to the open-to-user, but if I look at that use UID 556 and I look at the status file and we look at the UID, you see this isolated UIDs and associated GID in there, which shows you the inside UID and GID. So, you can actually figure it out that way and it gets you all of the different ones like the effective and all of those. So maybe not to you, but a general comment. Can we actually have this in adjacent or something in the kernel? Because there are lots of tools that are parsing this and they are doing it incorrectly a lot of times. Yeah. It's possible because this file gets generated as a SQL file in the kernel. So it just gets generated line by line. So JSON is just too hard, too complex for the kernel. So that's why. Yeah. But you know, we can have Tycho add something to that Libre source or whatever we come up with as far as libraries to also pass that file. Because we're looking at parsing all of those stupid files. I think this file is like extremely easy to pass compared to mountain 4 or CPU and 4, a bunch of the others. But yes. There was concern about a security around username, name spaces and I think about APAMO restriction. Yeah, Ubuntu had fun with our one. So is that helping in some ways? No, it's going to make it worse for APAMO. I think that. It's like a bunch of distros initially were of the opinion that username space is the devil and we should just prevent everyone from using it. Unless you're rude, which cannot affect the purpose to an extent. So this they can have the big hammer turn things off. Then Ubuntu did a really weird thing recently and it's just in Ubuntu kernel, which makes it even worse. Which is that you get to only use username spaces on Ubuntu if you're running from a program that has an APAMO profile that allows it. That has caused a lot of people to stop using Ubuntu kernel. Because that's really it's kind of weird and there's no other way to kind of opt out of this particular feature. It is kind of bizarre. So we'll see this. I mean, it's still going to be a username space. So any kind of concerns around those users and being able to like you're still in a user space. You can still create network devices. You can still access more APIs than you normally can as a community and privilege user. So those concerns are still valid and I expect that for any distro that offers knobs to turn things off will also effectively turn that one off at the same time. The question, the thing, Tara, is this is going to make adoption of username space in other applications for other developers significantly easier. Which may put some pressure on distros to not outright block things because they're going to be blocking actual useful workloads. We have one minute. Yeah, one minute. We can probably do like one more question or people can always catch me afterwards. If you do unshare network in a username space, then that's unshare privilege escalation. Yeah. Because of all the UAFs and network. So just with that last question and if anyone has more stuff, I'm happy to take them afterwards. So in the patch that I didn't see like any test, how much effort have you put into like first thing this or like write test for it and check for intended interactions. So testing is interesting. I don't know if we actually have a lot of username space tests in the camera, which is kind of unfortunate. I think that's something that should be improved and we should probably take a look at trying at starting to get that ball rolling with this one. That would be really good to say. Because like the VFS stuff is very well tested. The username space stuff, not so much. That's our plan to write this test. For non-isolated case and for isolated tool. So we want to write tests for both isolated case and for non-isolated tool. Because we have not so much yet. Alright. Thanks everyone.
Linux Kernel TPM security and Trusted Key updates
So welcome to our next speaker, James will talk about TPM. Please. Hi everybody. This is completely different from the TPM talk I gave yesterday. But for those of you who didn't see yesterday, this is all my contact details. So I've got the usual email, a Fediverse link. I have been working on the kernel for a long time. I began in SCSI. I've gone through doing architecture maintenance for PA RITS. I had a long spell doing containers for parallels. And finally, I've become a reluctant TPM coder. And pretty much everybody who codes for the TPM always describes themselves as a reluctant TPM coder. As far as I can tell, there's no one who actually wants to be known as somebody who actually likes to code for the TPM. All my contact details are here. My blog site is about the best source of information for most of the stuff that goes on in the TPM. Unfortunately, I'm very sort of gadfly stream of consciousness. So you'll find a ton of stuff on my blog that you're probably not interested in, like legal stuff, Android phones, whatever else I happened to be working on last week. But I do try and tag it all. So there is a TPM tag if you're looking for TPM stuff you can go through. It's my matrix ID. And if you need my GPG key, I don't need to do a fingerprint anymore because I do overdain, which is a DNS sec extension for securely actually identifying me from the domain address of my email. This is the exact command I was arguing with Linus over about last week when my key had expired and you couldn't find the new one. So if we go to the basics of a TPM, it's effectively a separate shielded memory processing device. This is what one looks like. This here is actually the TPM chip. And this big thing here is the low pin count bus connector. So the chip itself is really tiny. And often it's actually sold it onto the motherboard. And they've been around for quite a long time. TPM 2.0 has agile cryptography. TPM 1.2 is actually obsolete. But a lot of you in this room will still have TPM 1.2 in your laptops. All you have to remember is don't use it. The actual TPM functions are shielded or asymmetric key handling, which is where you have a private key. The private key is only visible to the inside of the TPM and it never comes out, which is why it's shielded. They can do things like measurements. So you've seen things like EIMA on various slides. That's measurement that we do within the kernel. They can do data sealing, which is effectively for symmetric keys. The TPM is over such a slow bus and it's such a slow processing engine that it can do asymmetric primitives itself, but not fast enough for things like disk encryption. So the way we do disk encryption is if you use a key sealed to the TPM, it's in a TPM shielded blob, when the conditions are right, the TPM will actually give the symmetric key to the kernel and the kernel does all the symmetric primitives, meaning we can do it fast enough. And this is called data sealing. And then the final function a TPM has is attestation. If you're using a TPM for measurements, you have to prove to somebody else the measurements for what they thought that's done by an attestation function. The kernel itself really only does measurement and data sealing. And actually I got a question yesterday from Ignat Kochagin, who I think is here, demanding to know why the hell we didn't do shielded key handling in the kernel, because the kernel has a crypto subsystem, which actually now does asymmetric keys. So it would be a candidate for actually using the TPM functions for that. The truth really is that I've put it on my to-do list, and if we actually get lifetime expanding medical care, I will have time to do it, but otherwise it's probably not going to happen unless somebody else does it. Because my to-do list is about a million items long now, but I'm thinking about it. Let's see. Oh yeah, the reason you shouldn't use TPM 1.2 is because it had SHA-1, which is now a fully compromised hash. TPM 2 generates what's called an internal seed instead of a key. So usually with TPM 1, when you turned it on, it generated an RSA, public-private key pair for its root key. TPM 2.0 doesn't do that. It generates what's called an internal seed, which is just a long string of random numbers. And then what it does is it uses a key derivation to go from that long string of random numbers to an actual key pair. And the point about a key derivation function is it takes one number as an input and outputs a key pair. And the point is that as long as you use the same number as the input, it always outputs the same key pair. So for RSA, this means that what you're actually doing is trying to find primes, which actually makes the thing really slow. So this is why we tend to use elliptic curve keys with TPMs rather than RSA keys, because they actually have difficulty finding it. One function actually the kernel uses the TPM for is secure random number generators. All TPMs have to have a cryptographically secure random number generator, because one of their usual functions is actually generating key pairs, generating these seeds. And so we actually use, on boot, the TPM's random number generator to add entropy to the kernel entropy pool. If you re-own the TPM, it actually changes its storage seed. And this is useful because a key is hanging off the root, the storage key sits at the top of the hierarchy. Every other key you use in the TPM, sealed data, hang off this. And by hang off, what I really mean is the encryption wrapping for that key, when it's reduced to, comes outside the TPM, is a function of that storage seed. If that changes, all of the wrappings are unreadable by the TPM, and effectively all of those keys are shredded. So every time you re-own a TPM, you destroy all the keys that were created for that TPM previously. And so, like me, you use your TPM for, you know, not quite hundreds of keys, but definitely tens of keys. And, you know, I'm just giving away my laptop. It's a single operation to shred all of those keys as though they never existed. Keys to be inserted into the TPM are based on this parent thing. Actually, that's pretty much what I just explained. And only the TPM can decrypt them. So this gives, and the encryption algorithm for TPM 2 is AES 256. So it's got 256 bits of security as long as AES isn't compromised, it should be pretty good. State of play is that the TPM itself is really hard to use and really hard to program. We currently have two completely different library implementations. They're actually based on two different TPM standards. Both produced by the TCG, the Trusted Computing Group, but at different levels. So the Intel library conforms to the sort of upper library specification, and the IBM one conforms to the lower TPM library specification. But they're both in user space, so when I did the kernel coding, I couldn't use them anyway. So this is why I'm a really reluctant TPM coder, because effectively I had to rewrite all of these library functions. Now, I don't really want to get into which standard is better, war, but I actually followed the same standard the IBM one used, because for low level functions it's much simpler. And if there's one thing I can't stand, it's being overly complicated, because it just generates scads of code and I really don't have the time for it. But that means I also had to rewrite all the cryptographic primitives in this. And so I have a TPM session handler file that actually contains all of these cryptographic primitives for the TPM to use. One of the useful things to go back to the original question about shielded key handling is that with all of this cryptographic crap in the kernel now, it will actually be much easier to write shielded key handling for the kernel crypto routines, because they can piggyback off all of this, which is going to be quite useful. So the kernel TPM currently is used as well as the random number generator. It's used for I'm a measurement, entropy seeding, and this thing called trusted keys. So trusted keys are effectively TPM sealed data blobs that you can pass into the kernel, and the kernel will give you back as a TPM sealed data blob if you really want. And the good thing is that Lux disk encryption can actually use these sealed data blobs. The reason you'd want to use this is for key protection, because if you keep a TPM shielded key in user space, and at some point you're going to pass it into the kernel along with its authority, the point is that key is only ever unwrapped in the kernel, which is the most trusted entity in an entire Linux system. It means you don't get it unwrapped in user space and then have the risk that if you get a user space compromise, they can run off with your disk encryption key. In order to get at your disk encryption key, they actually have to compromise the kernel. So it improves the security boundaries. And the kernel itself exports a TPM device to user space, and all user space programs use this to send commands over. The original device was Dev TPM 0, but TPM 2.0 requires something called a resource manager. It actually technically virtualizes the keys in the TPM, because a TPM 2.0 has a really, really tiny internal key store. It can actually store three keys only. And so if you have multiple users of the TPM, each trying to insert their own keys for use, you'd rapidly run out of memory. And so we have a resource manager that basically allows one user at a time to use the TPM. And they don't see it, but behind the scenes when we switch between those users, we just take all of the keys out and store them in memory, in the computer's main memory, so that every user effectively gets an empty TPM to insert their keys in, so we don't run out of keys, key slots. And this has been built into the kernel for a long time. It's called Dev TPM Resource Manager, RM0, and pretty much every TPM library uses this to communicate with the kernel, so they don't get out of memory key clashes. And it works just fine, except for the Intel TPM people complain about it doesn't do session de-gapping. It's also on my list. I will get around to it, provided medical science comes up with a stretch for at least two lifetimes. Or somebody from Intel could actually submit the patches and we just do it. The only reason the current... So this session de-gapping means that you can't context save sessions for a long period of time, because if the TPM runs around what's called its sessions clock, it hits a gapping error and nothing works again. The way around it is you actually have to save and restore the session. We just don't do that. We could, but I mean nobody's written the patches. So TPM security. There's lots of sensitive information going on to the TPM. So if you're concerned about cryptographic randomness, the random number we got from the TPM should be a secret. If anybody snoops that, they can figure out what the kernel's entropy pool looks like, and therefore all of the secrets that it was generating itself. If you're doing data sealing, the data will come back to you over the TPM bus in raw format, and anybody snooping the bus will see the key you sealed, which is pretty bad. And the point is that you cannot necessarily be assured of a secure channel to the TPM. Most of them sit on this low-pin count bus, and attacks actually exist that snoop this bus. So a Canadian company came up with a little dongle that you just simply plug into a laptop, and helpfully most laptops have a little LPC slot in them, so you just slot this thing in, and it will snoop the TPM bus. Pretty bad, but it's an evil maid attack in theory, because you actually have to have physical access to the laptop to see it. And so I'd probably notice if someone were actually trying to insert that in my laptop. The problem is the LPC bus contains a lot of weird and wonderful devices, like keyboard, mice, other weird things, that are actually programmable. And one of the fears we have is that this attack has been generated, and you could actually, instead of having to plug a snooping dongle into my LPC bus, just reprogram one of these reprogramable devices to do the snooping for you, and what we think is a local attack suddenly becomes a remote attack. So securing the TPM bus has been priority number one for pretty much everybody doing TPM work for quite a while. The problem is that the way the Linux kernel currently uses the TPM is completely insecure, so we're vulnerable to these snooping attacks. And this is done by using something called TPM sessions. It makes handling the TPM in the kernel much more complicated, but I have written all the patches, I'm just trying to get them upstream at the moment. What it really does is just an ECDH, elliptic curved if you helman key exchange with the TPM, and then pretty much everything goes over an encrypted channel using this. I mean, it's not difficult to describe, it's just difficult to do. And one of the useful things we get once I add sessions to the kernel to do this encryption is we can also use the sessions now to do key policy. Key policy is actually a very interesting thing that I'll come to later, but it does make the whole thing way, way more complicated. And like I said, the kernel is currently insecure when it uses the TPM. And these patches I actually did first write in 2018, so they have technically been around for almost six years. Trying to get them upstream has been a long, long slog. I'm hoping that the end is currently almost in sight, or at least the TPM maintainer has been making positive noises, which he hasn't for the last six years. So, you know, if the wind's in the right direction, we might get this. One of the patches I got up early was standardizing key file format. So the format that the kernel uses for TPM keys is exactly the same as the format that all of the user space tools use. So you can actually use any of the user space key sealing tools to generate a sealed key for the kernel, which is really useful, because now we don't have to have religious wars about what tool is best. Anybody can choose any tool. And the idea behind this is that since we've had all of these schisms over what TSS, what library should you use, as long as the key format is the same, I honestly don't have to care about all of this crap. So, you know, anybody can use any tool, any tool chain. You know, you can be partisan for the Intel one or the IBM one. It doesn't matter. They'll still produce keys in the same format. It will still just work. This is great. And all producers and consumers except System D have agreed to standardize on this. And the key standard is actually sitting there. Sorry. Should have put my phone on to do not disturb. That was bad of me. And this standard is actually patchable. So if you see something that's missing or you want to use, you just send a patch to the... It's actually the OpenSSLTPM Engine list, and I'll just add it if it looks useful. And we're hoping it will eventually become an RFC. Becoming an RFC is very difficult. A guy called David Woodhouse, who's also a kernel programmer, is currently going through the pain of doing it for a different standard. And I'm just waiting to see what happens to him. And if he comes back in less than three pieces, I will try the process as well. So like I said, the kernel trusted keys already use this key format, so we're fully interoperable, which is useful. And one of the things this interoperability gives us is actually the ability to seal kernel keys without having access to the kernel's TPM. This is actually a function of a TPM called import and export of a key. So as long as I know the public storage root key of my TPM, I can actually create a key that's cryptographically sealed and the TPM itself can import it, and I don't even need access to the TPM to create the key. Because usually the format of the TPM key is very specific to the TPM. You can't create it because it's symmetrically encrypted. So these exportable keys actually have to be asymmetrically encrypted with the public-private key pair, making them slightly more difficult to use by the TPM, more complicated, more difficult to unwrap. But it gives you the advantage that if you're, let's say, your use case is in the cloud, and one of the things you're trying to do is release keys to a cloud TPM, you can just in your key release program say, okay, so the storage root key of this TPM is this. I'll wrap the key to that, just hand it off to the TPM, and the kernel will just boot. So, future patches. What are we going to do with this TPM in future? Once I've finally finished getting session encryption upstream, which is sort of the number one priority, because it's really hot right now that the kernel is the only thing which is insecure when using the TPM. And the first one is going to be key policy, which is a natural follow-on, because I already had to put all of the session handling code and the TPM to do encryption. Policy is just a fairly simple add-on to this. So the patches are only, I think, my current cryptographic patch for the kernel expands over a thousand lines. The policy patch on top of this looks like a couple of hundred lines, so it should be pretty easy to do. Once we have policy, the nice thing we can then do is create keys that can never leave the kernel by policy. And the way we do this is by using a feature of a TPM, oh, can do this, so we can use a feature of a TPM called a locality, which means that the TPM actually, TIS TPM has registers mapped at several different locations. When you use a location, each of these locations is called a locality. When you use the mapping of that locality, the TPM is sits of that locality. The way you're supposed to use it is just to block out the mapping so that nobody can use it, and the TPM is sealed of that locality. For the kernel, it's really easy, because all user space uses the kernel device. They can't talk to the TPM directly, so all user space only talks to the TPM at locality zero. The kernel ensures this. And if the kernel talks to the TPM at a different locality, we know whether your kernel or user space by the locality you're talking at, and we can have a policy that says this key can only be unwrapped at kernel locality, which means if I try to unwrap it in user space, the TPM will give me an error. So effectively, it gives you a key that can never actually leave the kernel, which is another useful property for security boundaries. The problem with this locality scheme, we thought it was brilliant and easy, and actually, I'm not the only person who's pissed off about this. The other thing that I think is really garrulous about is Intel locked all of the localities apart from zero unless you do a trusted execution launch, TXT launch, which is really, really annoying, because pretty much no one in the world does this. So we all have these TPMs that have the locality shut off, and nobody's using TXT, and you just look at Intel and go, well, nobody uses this crap feature. Can you just unlock the localities because it serves no purpose? But unfortunately, we can't get to it. Today, if I get... Yeah, I should get around to the demo. I blew through my demo time yesterday. I'm actually... I have a kernel that operates at locality 2, and I will demo it, but pretty much it won't work on any of your laptops, and we have to use alternative means for sealing the keys. So the alternative way that I'm thinking of doing this is to reserve a range of NV indices for kernel access only. Because you can only communicate from user space to the kernel using the TPM device, we can snoop the commands, and if you try and access these NV indices, we can just say, no, you can't do that. So effectively, we're behaving like a locality, but for NV indices. And then what I can do is I can get the kernel to take one of those NV indices and just put a random password in that index that's known only to the kernel, and then seal a TPM key that says, only the person who knows this NV index password can unseal the key, only the kernel knows that password, only the kernel will be able to seal the key. So even if we can't get Intel to cooperate on localities, which would be the cleanest and best way of doing this, we have alternative ways of actually enforcing that only the kernel would be able to unwrap the key. And this hopefully will be in future patches as well. And let's see. Key policy. TPM2.0 really supports a very rich policy language. I'll demonstrate some of it. It can have policies on time of day, reset count, you know, how many times this thing been rebooted, what are the values of certain PCRs, what are the secrets embedded in this object, what is our value of an NV index, and so on. This policy can be both AND based and OR based, so you can have these sort of huge lists of do this and this and this or do this and this and this or etc. Problem with this policy is it's described by a single hash value, which means it's burned into the key at creation time. And if you think about the way you might use policies if locking to PCRs, one of the things you do is I only want a certain set of kernels to boot and then unlock my root key, so I'm trying to lock the policy to the PCRs of the kernels, but I can't predictably know in the future what the hash value of a kernel will be, you know, because it hasn't been created yet, so I need a way of actually updating the policy after the fact, and this single hash burned into the key doesn't quite cut it for that. But fortunately, the TPM actually has... Sorry, this is all about hash construction you don't need. The TPM actually has a thing called policy secrets, which allows you to actually execute a signed policy and add it after the fact to the key, because the burned-in policy says, and any other policy that is signed with this key you shall accept as well, and then you just keep adding these signed policies to the key, and hopefully I should be able to demonstrate that as well. So, let's see, I have 15 minutes left for a demo. Let's see if I can actually do this. So, let's get that out of the way. So, this is... Everybody can read that. I don't think I can blow it up much further. Start the... So, in user space, I'm going to start the... TPM server, because I need a TPM for this, and then I'm going to go into a UFI TPM-based boot. So, this is actually booting a kernel, hopefully. And if I just check, yep, it's actually communicating with the TPM. It's always wise to check these sort of things. Okay, so that's my thing. I can log into it as root. So, there we go. I have a TPM that I'm emulating. This means that I have a non-standard patch here, because the kernel that I'm booting actually has these locality patches in it. So, the TPM I ran is actually now running at two separate localities. The user space of this kernel will be in locality zero, and the kernel is actually in locality two. So, I can demonstrate the keys that can't be unwrapped, except at the kernel in locality two. And let me also get a... So, this is a user login as me to this thing, and this is me... So, this actually just gives me a user space login to the software TPM in my home directory here. And so, what I can try and do is actually, if I get my demo scripts and remind myself what I was supposed to be doing, let's just do a very simple... So, this will just take a piece of data and seal it for the TPM. And, sorry. The data is actually 32 bits long. The problem with the TPM sealed data can be anything between one and 128 bytes long. But the kernel trusted key subsystem is assuming that you're passing in AES keys. So, it expects a key length of between 32 and 64 bytes. So, I just did that sort of randomly. If I go back to the user space login on the root, I should just be able to... Let me... Before I do this... We have to have trusted keys actually working before I can actually do them. And then... Oh, yes. There is another problem with the trusted keys. So, when I created that key, I created it... as effectively a PEM file because it's really used for the cryptography systems for OpenSSL. The kernel key system doesn't read PEM files. So, one of the things you can do is convert this to a dir file. So, if I did... If I looked at the dir file, that's the standard ASN1 format of a TPM key. Very easy. Problem, the kernel doesn't read these either. The way the kernel key system works is it actually wants a hex dump of the dir file. So, I have to do a reverse hex dump into a key file. So, that string of binaries is effectively exactly the same as the dir file. But this is the file that I now actually have to pass into the kernel. And so, inside the virtual machine, I can actually... I've inserted that key into the kernel. And the fact that I got a number back means that the... I can just look at the user key ring and you can see that I've got this key actually inserted into the kernel. So, I wrapped a key outside the kernel and I inserted it into the kernel. The usual way you do this is you actually get the kernel to create its own random key and then pass it back. And the problem with this format is the only way the kernel will pass the key back to you is by this thing called pipe. And for reasons best known to the kernel, it only pipes back hex strings. This is why you have to do the stupid conversion from PAM to dir to hex string, which is sort of one of the most annoying things of this. But let's get on to demoing a key policy. So, I created the key here. But what I'm now going to do is seal it to a PCR. And there's a very useful PCR, which is PCR 16, which is never used by anything because it's... Sorry. We'll use the SHA-256 on 16. It's always...it begins life as zero, nobody ever uses it because it's resettable. And resettable means I can wind it back. PCRs aren't supposed to be able to be wind back. But if I seal the key to this... Sorry. That's a PCR lock. So now I've locked this key to that PCR. I can actually unlink the... That key and I can just demo that it will re... Sorry. Forgot to go through the conversion dance again. Right. So I can demonstrate that I can load this key. But now what I'm going to do is extend that PCR. As you can see, the user space options for these commands are very friendly. And, by the way, this is the same... The same thing is true of the Intel and the IBM TSS. They're both equally unfriendly, just in completely different ways. But anyway, I extended this PCR and the point is, if I got this right, I'm not allowed to load the key anymore. And if I look at what the kernel said, it said trusted key unsealed failed because of a TPM policy failure, because I've extended the PCR. So what I can do now is actually... Let's demonstrate some signed policy. So I'm actually going to create a key that is now sealed to a public policy key. This key would insert. I'm going to lock it to the PCR. Which has a weird value. I'm going to do the conversion dance again. And this key I should also now be able to insert, I hope. Yep, so the key inserted. Now what I'm going to do is unlink it. I'm going to move that PCR on again. So I've now spoiled the value of the PCR, try and load it and it again refuses to load because it has a failure of the policy, because I just moved the PCR on to a different value. But now what I'm going to try and do for this key is I'm going to add another signed policy that locks it to the new value of the PCR. So this is adding the policy after the fact. I burned the policy into the key, but I'm now adding an existing signed policy to that key with the new PCR value and conversion dance. This new key should actually now load. There it goes. So I took a key that I created earlier. I didn't know what the PCR value would be, but thanks to signed policies I was able to add a signed policy to that key that now accepted the new PCR value. This is how you would keep the keys up to date with the state of your laptop. So the final thing I'll demonstrate is a key that can never be unsealed except in the kernel. So let's... So I'm just going to create a new key. I'm not going to bother with any of the policy stuff. I'm going to lock it to... It looks like locality 4, but this is a bitmap. So it's actually locality 2 only. So the way you lock it is just all the complex things. All the complexities of the TPM. But the point about this key is if I try to unseal it in user space, which is where I am now, it gives me a locality error because I'm not in the correct locality to use it. And now I just do the conversion dance. And if I try and insert this... I have to unlink the previous one, don't I? If I try and insert this into the key, into the kernel, it works because the kernel is at the right locality too. So this is finally a demo of using localities to protect keys because the data I seal to that key, I can't get back. And the point is if I got the kernel to generate this key and I'd sealed it to a locality, it can give me the key file back, but nobody can unwrap that key file. That key file can only be unwrapped by the kernel. It adds an additional layer of security to your disk encryption keys or something. So with that, I think we'll probably go back to the presentation demo over. And I'll just come to brief conclusions, which are the kernel TPM subsystem is evolving way more slowly than I'd like, but at least it's evolving. And we hopefully will eventually get at least to encryption in the kernel, which means that we should be using the TPM securely, and hopefully shortly after that, sort of policy and all of the other wonderful bells and whistles. And with that, if you like this presentation, it's all a web page using impres.js, which of course makes me a web developer, which is not something most kernel people admit to. And I'll say thank you and call for questions. APPLAUSE No. What a recording. So, James, I wonder if you can say a few words about the interaction between the TPM and virtualization and containers. So how does it work to keep, like a hypervisor, from seeing the keys of a virtual machine or from two containers, seeing the keys of one another? Well, so the question was, what do you do about TPMs and virtualization systems, both containers and virtual machines? Well, obviously, you saw my demo was using a software TPM. The traditional way of actually running a TPM is that you trust the host of the virtualization system. So the virtual TPM that I was running was running in the host, and then you make contact with the virtual machine. The way I was doing it, I was actually using a patched version of QMU because I don't quite like the Vproxy TPM that we have. But that's just because I don't like running the software TPM because I need to run the TPM reference implementation. The way you're supposed to do it is you have a bank of TPMs running in the host, one software TPM per virtual machine that needs to use it, with the ability to save and restore their states. The state can also follow the virtual machine image, and this means that you have a virtual machine that actually can use a TPM. Now, you alluded to the fact that if they don't trust you, the same thing works for a container as well because containers can also communicate with the TPM through basically by mounting the TPM device, which is quite easy as well. So you can do all of this if you trust the host quite easily. If you don't trust the host, you're in a confidential computing environment, you have a lot more problems. But the way we're actually trying to solve this is to put a virtual TPM implementation inside a layer of a confidential virtual machine that is protected from user space. So currently only AMD SEV does this. It has these virtual machine privilege levels. And when it boots up, it starts an SVSM, something virtual machine service module. I always forget what it is, secure virtual machine service module. And inside that is a TPM. This TPM effectively initializes itself each time it boots. So when it powers on, it has a different seed. But the public keys of that TPM are part of the SVSM attestation report. That's what binds the TPM to the confidential computing attestation report. And then the user space of this virtual machine can use the TPM running securely in the virtual machine as its own TPM for measured boot, key handling, everything else. And because it's running in a protected area of the SVSM, neither the kernel nor the user space can actually get at the contents of that TPM. It's fully protected from them. And it's inside the confidential envelope, so it's completely protected from the host as well. So effectively we can run a software TPM per virtual machine inside the confidential computing envelope. And Intel TDX is also coming up with something like this. So we should have a solution that works regardless of confidential computing technology. Does that answer the question fully? I think I understood it was pretty specific. Okay, sorry. Ignat is asking, can you wrap a key to a TPM endorsement key? Can you wrap a key to a TPM endorsement key? So the question is, can you wrap... This is a very technical thing about hierarchies. The answer is, of course, yes. Every hierarchy can support keys. But the usual... So a TPM actually has a split permission model. So usually TPMs have four hierarchies. They have the endorsement hierarchy, the storage hierarchy or owner hierarchy, the null hierarchy and the platform hierarchy. So the platform hierarchy is always owned by the firmware and you can never use it. The null hierarchy actually changes its seed every time it boots so it's not useful for key sealing. But the other two are the endorsement and the owner hierarchy. The problem is when you take owner of a TPM, you get ownership of the owner hierarchy, the storage hierarchy, not the endorsement hierarchy. And in certain TPMs, the endorsement hierarchy is designed so that it will only work if you know the endorsement password. And for some split TPM implementations, the owner might not know the endorsement password. So what you're asking me is, yes, it's theoretically possible, but there can be ways that the TPM is set up that it won't work because you don't know the password to do the key insertion. Okay, thank you. We're out of time. Okay, thank you very much.
Linux Matchmaking: Helping devices and drivers find each other
So we're ready for our next talk, so we need to be a little bit more quiet. Ahmed is going to be talking about Linux matchmaking, helping devices and drivers find each other. Okay, thank you. Yeah, my name is Ahmed Fatoum, and I will talk to you about how devices and drivers find each other. I am an embedded Linux developer with Pingotronics. We do kernel and bootloader porting, system integration, graphics stuff. And I think we often do is update kernels or bring up kernels on new devices and run into the problems that happen when we do that. Often we have some kernel patches because not everything is upstream yet. Sometimes we have multiple topic branches. We use a tool called umphazet, but in the end we have Git tree that we built. Maybe if we do a kernel update we have an old config, so we learn make old config to get the old config to a new version. Then we type make to build the kernel. Then they deploy the kernel somehow and sometimes it works and sometimes it doesn't and the kernel hangs at boot. And then you need to debug why that happens. So if you are doing a kernel update and you have a known good, don't waste time, just do a git bisect if you have network boot or something quick to test new versions. And when you have the commits that cause the regression, you can reach out to the author or you can discuss it on the mailing list or you can read it and try to understand why this caused a problem for you. Maybe it causes problems for others too or maybe it's just a problem in your configuration. But if you are doing a new kernel bring up or you are moving from a much detached kernel, for example a vendor fork, you often don't have a known good that you could easily bisect between. And yes, that's what my talk is about. How would you debug early driver issues? Here is an example breakage that can happen that a colleague run into. So he updated the kernel. Here and make old dev config or menu config. So make old config will prompt you for all the new symbols or new configuration symbols. Do you need that driver? Do you need that driver? There is a lot of there and if you do old dev config, you are not prompted, it just takes the default. And yeah, after he did that the kernel no longer booted or some device didn't was not functional. And with a git bisect eventually you would have found this commit to be the cause. And what this commit is doing is that's renaming a symbol. So previously we had MFT RK808. This is a driver for the power management IC on that board and that was renamed to RK8XX which makes sense because as you see the driver supports 805, 808, 809 and so on. And so it's a bit confusing to have the driver called 808 but it supports much more than that. But the problem is a Kconfig doesn't track such renames. It's from Kconfig's point of view. Your old config has an RK808 that doesn't exist anymore. So it's deleted. And the Neo kernel tree has an RK8XX that's by default off. So if you just do menu config this driver will get lost. And because that's a power management IC and everything depends on it. So if you need to drive a USB stick and output 5 volt, it has a dependency on that PMP MIC. If you want to use a higher speed mode on your SD card and need to lower the voltage, it has a dependency on that PMIC. And all these drivers that have a dependency on that one driver will fail to probe and you might not even be able to boot your system. And yes, you need to debug that somehow. If you know the culprit commit like here, the solution, the resolution is evident. You just open make menu config, you enable that one config option and then you are on your merry way and you can probe your driver and everything should work as before. But what I want to talk about is what if you don't know what's the problem. You have a system that's stuck on boots or you have reached user space and graphics don't work and you want to know so what happened. The kernel always be, of course, kernel bugs, but a lot of people use a kernel and if you run into an issue and you are not really the first one to try it out, it might be something in your configuration, maybe just a driver that's not enabled or maybe something that's specific to your board. And to be able to debug that, you will need some insight about how Linux does this early driver initialization step. So I will start with that and then talk a bit about the problems that can happen that early. So Linux device driver model is what matches devices with drivers. We have three main abstractions. One is the bus type, which is what sits between your device driver executing on your CPU and the device that you want to talk to. A bus type can be something like PCI or USB or MDIO or something like that. So it's meant to reflect an actual hardware bus and that's the software representation of it. Then you have the struct device driver, which is a data structure that has the entry points into your driver. And then you have the struct device, which is what the driver operates on. So you can have the same driver operating on multiple devices because for example, my laptop here has two USB controllers, one on that side and one on that side and one of them has also internal connections. So these are completely identical, but they are just at different places on the bus. You can see that here on that diagram. You have one PCI bus, there are three devices on it. Then you have a second PCI bus that hangs off the first one through a PCI bus bridge. And each device of these optimally has a driver that will try to bind to it. And this is done by a series of function pointers. There is a lot of entry points for device driver and so on, but the three function pointers that are interesting to us are match for bus type. This takes as an argument the device driver and the device itself and has to determine in a bus specific manner if they are compatible or not. So for PCI, where each device has a vendor ID and a device ID, which are, so if you are a vendor, the PCI sync group will give you a vendor ID if you pay them and then you can assign device IDs and then you can write drivers or people can write drivers via hardware that match on exactly that vendor ID and device ID. And here, once a match, once a bus determines a match, it will return to the driver call some positive value, a one, and then the device driver probe will commence. So the probe will take the device, will try to determine if that device is really what it's after, initialize it, and then you have a remove step to undo that probe. The struct device also has function pointers, but these are not interesting to us that early in the initialization stage. Here is a short quote example how it looks like for PCI. So you have a struct PCI device. This inherits struct device by embedding it and then adds PCI specific stuff. So it has the vendor ID, 16 bits, it has device ID, it has resources, so PCI devices have IOPods or memory mapped regions that are associated with it. That's all struct PCI device. Then you have a struct PCI driver that similarly does extend struct driver. And additionally, it has this ID table, which is a list of all vendor and device pairs that are matched by that driver. You see that structure at the end. And you can add to it also a driver data unsigned long where you can encode stuff so you have a... ... Usually you have the platform bus, which is... There is a catch all in the back of chips bus because it does a lot because the bus itself doesn't do anything for you. The bus itself is usually just memory mapped and it doesn't tell you what device is where. It's not like a PCI bus where you can actually ask the bus to enumerate the devices and report them. You have to have some sort of hardware description that tells you what is where so you can actually use it. How that hardware description looks like is on the right. You will have this compatible, which tells you I have a device that's compatible with that. It has that address and it has these resources, these interrupts, these clocks and so on. And then the platform bus will take this description and try to match it with drivers who also list these compatibles. But that's just one thing the platform bus can do. It can also match on ACPI. It can match on strings. It can match on the driver name itself. It's where you throw everything in that doesn't have a proper bus. That's a platform bus. And yeah, once any bus, platform bus, PCI bus and so on finds a match, it will call the driver probe function. So the name of the function is a bit of a historical artifact. Normally if you have already done the match, the device will probably be what you expect. So you don't really need to probe the device if it's really yours. But in the past, for example, with the Superio chips on x86, they usually had the same IOPOD. And if you write the values appropriate to one of them in tools, in tools, registers of another, you could break it. So they had like schemes where you need to enter a password into a magic register two times. And then you read from another register if it's really the device that you expect. And if it's not, you can return an error code, you know, def or you know, such device or address. And the driver call will try something else, but that's usually nowadays a buck if you return in a def. Usually you don't intend to return in a def, but some other return value. So the return values that are relevant either the driver is happy, it has got a device, it claims its resources, it returns zero. After registering it with some other kernel framework, for example, if it's an Ethernet interface, it will register a net def, call register net def to register a network interface that can be later called to interface with the device. And that's an uninteresting case because everything works here. What's interesting case if it returns an error code. And that's usually what happens when your boot is stuck, sends kernel bugs of course, but if your boot is stuck, it's usually some driver that didn't want to probe. And that's usually because device dependencies because a device probe is just one little part of a function, but each device has, especially on a system and chip, has a lot of dependencies on other devices. And yeah, if you have like eight dependencies, maybe one of these dependencies is missing and that's propagates up and kills the possibility to probe anything that depends on it. This was the case with that PMIC example that I had in an earlier slide because that was not available. Everything that depended on it, for example, USB or SD card controller, didn't manage to probe because the dependency couldn't be satisfied. And there are a whole lot of these dependencies and they are used at very different places. So there are generic dependencies like PIN control. A chip usually has more functions than PINs available, so it needs to mocks the PINs into the correct states. The Generic Driver Core will do that for you. Also the MA configuration. Then the platform devices will also do like initial clock assignment if you need to ramp up the rate of a clock or you need to reparent a clock. If you have a power domain which are like power is inside the chip, it needs to be powered on so you can actually talk to the device. And then there is a whole lot of stuff that's device specific that is inside the probe function of your driver which is requesting clocks, multiple power domains, GPIO, resets, files, or like the supply of the PMICs that we just saw in that problem. And the problem is when the device driver probes, it expects these resources to be available. And if it's not available, it just can't progress. So if it's like a reset line, you really expect that reset line to get the device out of reset. You can't continue usefully often without having that resource. So what the kernel tries to do is to probe the dependencies that you require first before probing something that comes later. So we want the PMIC to probe very early and then later on we want USB to probe, for example. How that used to be done was statically in the build system using init calls and makefile ordering. So I don't know if it's a bit too small. No, it's okay. So yeah, there are init calls which are the different stages that the kernel will run its initialization code from and these are synchronized using these sync stages between them. And if you do something in a subsystem init call, you know when something in a device init call runs, it will be available. And so you can place dependencies in a subsystem init call, for example. But the kernel uses a lot of these init calls for itself. There are not enough to represent all what the kernel needs. So what was instead done was that the order in the makefiles were used. So the kernel will work all directories, collect object files, and the order in which the object files are collected is the order in which they land in the linker list of init calls and that's the order where drivers are registered and if the devices are available, that's the order where the devices are probed. And so you have makefiles that still have some stuff like regulators early since some subsystem might rely on them or DMA. DMA is very important, so do that extra early. But that of course breaks down once you have a dependency that goes into the other way. Like here is a power domain driver that requires power supply, requires a regulator and power domains are added before regulators. So in this case it breaks down. You can't have one order that is okay for everyone. You can do that in a simple microcontroller or on a board that you know beforehand. Yeah, I could do that in that order, but any more complex system you will have maybe cycles even or you will have stuff that you can generically say because you can have plug-and-play and so on. And you can have one fixed order that always works. So you have to have a system that is only avoiding the problem of requiring resource that's not available in the kernel since 2011 or something. Also does detection that this happens and tries to re-probe at a later time. This is done with a special return value. It's called E-probe defer. It has a value of minus 517. It's interesting because all other return values are smaller numbers because they start counting from E-perma at minus one and then go maybe, I don't know, minus 100 something. But minus E-probe defer is like over 500 so you can more easily spot it. And this is never reported to user space. It's only for internal use by the kernel and it's what the kernel driver uses to tell the kernel driver core, please try me at a later time. The kernel driver core will go through the exact same motions. It will clean up resources but instead of marking that the driver probe has failed and there is nothing we can do to fix it, it will instead add it to a list of deferred driver probes that will be retried later. So if another device fails, succeeds to probe, then the kernel will try, can try again to see if now every requirement for that device is there so it can attempt a reprobe. And once it runs through the whole list and no new devices appear or no devices in that list manage to probe, it knows there will nothing happen anymore. And in that case, yeah, you can't boot but in the case that you have drivers that will bind later or you have maybe a cyclic dependency here that will help you because the stuff is being retried. And how that looks in driver code, here's a small example. This is getting an interconnect and USB 1 and then you have, you say get me a regulator or get me an interconnect. And you check if it's an error and then if it's an error, you return it. So if it's an error that you can't recover from, you return it and then the driver calls now, okay, that won't work. I won't try this again. If it's an e-probe defer error, it will just be propagated and the driver call will try at a later time and you are responsible to do cleanup. So you must take or the driver or the author must take care to clean up all resources so the driver probe can be attempted again. What you often see is this check that the error code is not e-probe defer. That's because e-probe defer is an expected result. So if you had a driver that couldn't get it resourced on the first time and it would say it couldn't get interconnect and it would work later on the second try, that would confuse a lot if you had error messages that are not really errors. So often you check for e-probe defer and only write the error message if it's actually an error and not e-probe defer. And you can see all this deferred stuff in the CISFS. There is a file there, CISCURNELDebugDevicesDeferred if you have debugFS support enabled and it will list all devices that have the probe not done yet. And in the case that you don't reach a shell because you have some dependency of your root file system missing, after 10 seconds if you have config modules, the kernel will time out and print all devices that it couldn't satisfy, that it couldn't probe because of missing dependencies. And then here in this case I am missing an interconnect driver and everything that depends on that interconnect driver, the power domains, the USB files, the USB itself will defer the probe. But you don't actually know why it deferred the probe. If you look at the slides before that, we had an error message that says couldn't get interconnect, but we didn't want to print it the first time because we didn't know maybe in the future it will be really probes that we can satisfy. But at the time from e-probe defer, you want to print it only on the last e-probe defer when the kernel gives up. But because we check here for e-probe defer that error message is lost, which is why for a few years now we have dev-er probe as a function in the kernel, it takes advice and error code and an error message. And if the error code is equal to a probe defer, it will just store the message in the device. And if it's unequal to a probe defer, it will just print it directly. And with that you get actually reason why deferred probe has happened. So here you see how it looks like in the debug FS. It will tell you block control is not ready. And then if you look for block control, it will tell you fail to get knock entries. I wouldn't know what to do with that error message, but at least I can search for it. And then I will see in the kernel source, okay, it tries to get an ICC. Yeah, I will search what's an ICC and then I will see it's an interconnect. And then I will look in the device tree, see there is an interconnect. And then I know, oh, maybe I need to enable an interconnect driver. And this would be a lot more cumbersome if I didn't have that information. And 6.8 is the kernel release that's currently being stabilized. And since 6.8 Rc1, these reasons are also printed to the kernel log. So in the case that your system doesn't manage to boot, you get the same error messages. Before that, yeah, you had to like start an init.id or something and mount it there. But you don't have to do that anymore. And yeah, if you have devaprope, that's an easy thing. If you don't have devaprope, you need to trace a bit and try to find out what's the last call that's failing. I pasted here some of the stuff that I add to the kernel command line from the bootloader. I add some kernel options to try to zero in on what's the problem. Earlycon is a useful thing because many drivers have a separate earlycon implementation for outputting a character. And that can be used even before the normal kernel driver is initialized. And you even use that because of a serial driver while it sounds very simple. It has resets as dependencies. It has clocks as dependencies. It might have even a power domain as a dependency. And you don't want to wait for all that to initialize before you can actually see something on the console. And earlycon sidesteps that because with the assumption that the bootloader has set up your serial console, the kernel could just keep using that set up serial port. And yeah, and later on when a real console with the driver model is registered, it can take over. So at earlycon and set standard outpath in your device tree so the kernel knows what to what console to use earlycon on. Then ignore lock level will get a lot of output when you enable debug stuff, but it's better than no output at all. So I just say ignore lock level and then filter it on my side with init call debug. You can print out the debug as they happen. This is useful if the kernel gets stuck at the moment. Maybe then you can see what is the last init call that the kernel called or the last few init calls if it's done in a multi-threaded manner. With dynamic debugging, if your kernel is compiled with config dynamic debug, your kernel will have all the debug strings built in, but it won't output them by default. But with dynamic debug, you can enable that for example for a file or for a line or for module. Here it takes dd.c, that's a device driver model file. That's the main file that does matching the devices to the drivers in the kernel and also will print debug messages like deferring a probe and so on. So you can see how often the probe is deferred if that's useful to you. And with a plus p, that means print. So yeah, this line will cause if you have config dynamic debug enabled to print out debug information from your driver core. And for good measure, I always add clock ignore unused and PD ignore unused when I am debugging because that's when the kernel has finished starting up. It will disable clocks and power domains that it thinks it doesn't need anymore. But if you happen to need them, the system will get will hang itself because there is still devices that require it. So yeah, it's not really related to the other stuff, but because it's debug stuff that I copy and paste, I add that too. Of course, you will want to remove it later for when once you have found where your bug is. And then for that particular problem of understanding why that probe has deferred, you can use ftrace. So the function graph tracer will print assuming of course you have it enabled in the kernel and so on. And you have enabled boot time tracing. It's a separate option. Yeah, if you have that all enabled, you can on the kernel command line say ftrace function graph and that will enable the function graph tracer very early. You can set a ftrace graph filter, which is a function during which the ftrace should run and you can set a maximum depth. And what the kernel will do is that once you enter this probe function, it will print a line to the trace buffer. And if you enter a shite function, it will add some indentation and print the next. And if it returns, it will return and so on. And then you have the flow of how the kernel walks through the probe function before you. I limited here to three, to a depth of three, because usually it's functions that you are interested in that claim resources are at a depth of three or less. You can increase it as you like, of course. And then you can check out what was the last thing that the kernel called that might have failed. So if you see at the very end, okay, let's try to get a GPIO and after that you only see clean up. So yes, probably the GPIO that's missing. I wanted to automate this a bit because one nice thing would be to get the error code out. So if you could just see which ones return an error code that is E-probe defer, you could just look for that and you don't have to look what could plausibly be the cause. This seems to be possible with boot config. Boot config is an ftrace function graph red val tracer that records return values, but you need to use boot config for that. And I haven't had the chance to use boot config so far. I tried yesterday for the first time. And what I initially wanted to do, I wanted to dump the ftrace buffer during boot time. So if the kernel gets stuck, I can't access the tracers to output my trace buffer to find out what was the last function that's called. Yeah, I can help myself by adding an init addy. I have an init addy for arm64 that I can just, that has a small shell that I can use. And then I can on the small shell from the init addy mount tracers and read it out. I could like something just out of the box where the kernel just, if it can continue boot, it would just dump out why it couldn't continue to boot. It currently will dump out this deferred probe information, but if you don't have this devair probe call, you won't know what's the reason why it didn't probe. And so ftrace could help. But I haven't managed to get it working. There is an ongoing discussion about that. And once I get it working, or someone tells me how I can get it working, that might be someone of you too, you can reply to that mailing thread and others will know too. I will try for sure a bit on the way back in the train. Yeah, and, ah, fadflink is something I wanted to talk a little bit about too. fadflink, it's a problem with eProbeDefur, it works, but you retry probes a lot of times when you don't really need to. In the worst case, the next device that you want to probe is the last one in the list. So you walk the whole list, you try all probes, and all will return again eProbeDefur until you reach the last one, then you probe that, and then you start the list again. And so on, you keep walking the whole list because the one you are interested in is at the very end. And you could do better than that if the kernel could take note of the dependencies. And yeah, if the kernel could take note of the dependencies, like read them out of the device tree, it could order probes, and that's what fvdeflink is doing. And yeah, it doesn't replace eProbeDefur completely, but it minimizes it a lot, which should improve your boot up time because you don't need to redo probes so often. And that's it. So if you want to take one thing with you from this talk, if you debug such an issue and you find a place where you could add def-er probe to make the life of other people after you easier, please do so. So the world will be a bit better after that. Thank you for listening. We have time for maybe one or two questions. Hi, Ahmed. Thank you for your talk. I just figured out you can use magic sys request to print the trace buffer. Oh, okay. Yeah, that would be a way. If you add the magic sys request, you can do that over the serial console too, then you could ask for the trace buffer to be dumped. I was thinking about triggering an oops maybe, but that sounds a bit less severe.
The case for a virtual Rust stateless codec driver
We're almost ready for our next talk. So if we could all quiet down a little beforehand, it would be great. Hey. Okay. It's about Rust. Shouldn't you all be excited? Okay. Daniel is going to talk about the case for a virtual Rust stateless codec driver. Daniel? Okay. Can you guys hear me? Is the microphone working? For those of you who don't know me, which I assume is going to be the majority, my name is Daniel Omeda. I worked for Collabra for three years, mainly doing codec stuff, the interface between just the streamer and the kernel. And as I've recently been working on doing decoders and Rust in user space, so now I'm fighting really hard to bring this to the kernel. And I'm halfway being hated by a lot of people just kidding. So yes, I'm here today to talk about the case for a virtual Rust stateless codec driver. I want everybody in the audience to really take a look at the title because we're going to go piece by piece every word that you guys see here. We're going to talk a little bit more about it. So I'm going to start with saying what a codec is and why we need codecs. Then I'm going to talk about hardware acceleration and codecs and why should we need that. After that, since now we have hardware, I'm going to talk about codec drivers because now we have hardware to drive. So how do drivers work? Now that we have a driver, we need a way to talk to the driver. I'm going to be speaking about the two major APIs in Vigil for Linux for you to be speaking to these drivers, the stateless and the stateful API. And after that, I'm going to talk about what is a virtual driver in this context and about Vizal, which is a virtual driver I wrote. And then lastly, I'm going to tie all of that back into Rust and hopefully why we need Rust in this particular context. So without much further ado, let's get started. And the first thing I want to talk, as I said, is about codecs. What are codecs? Basically codecs, they are a way for you to compress video data because without codecs you couldn't have video in the morning age. So if you have a 4K stream, for instance, and you do the math like 3, 840 times 2160 times 3 bytes per pixel times 24 frames per second times two hours for a movie, this is going to be very huge. So in order for this to be even possible nowadays, you have to have a way to compress that. But thankfully, all video signals, they're full of exploitable redundancies that that's how video codecs work. They basically exploit these redundancies to shrink the amount of data that you have, therefore making, you know, storing videos on your computer or streaming videos through the internet and a lot of other user cases, even possible today. Usually this process is lossy, right? So you can have lossless compression, but usually you lose a little bit of data. But the objective is to arrive at a possible approximation such that you don't notice that you've lost a whole lot of detail, hopefully, but you've shrink the size of your image by a large amount. And for a given bit rate and power envelope, meaning for a given size of resulting file, the resulting file size, and for a given amount of heat that you're going to be generating or power that you're going to be consuming from the device. All right, so we want things to be fast and hopefully cool, right? Because nobody likes to be sitting on a laptop, does a scorching hot, or you don't really want to be using a whole lot of power, or, and you want things to be fast. So what's the solution for this? Is to have a hardware accelerator, for instance. So hardware accelerators tend to be more power efficient and faster, and it frees up the main CPU. So that your main CPU can basically hopefully do other stuff. But they're usually less flexible because you usually only get what you synthesize into hardware most of the time, right? And you're not going to be synthesizing all the profiles for all the codecs. You're only going to be synthesizing a small subset, and well, you kind of, this makes it a little bit less flexible than doing CPU encoding or decoding. And there is another key aspect of this. Now that you have hardware, you have hardware to drive. So now you need a driver, and you need an API to communicate with this driver. So you have now yet another piece that you have to have that you didn't need to have with a pure software slash CPU approach. And to understand these drivers, we first have to have a brief look at what is inside a bitstream. Let's say you're watching something on YouTube, or you've downloaded some video file on your PC. What exactly is in there? So basically we have two blocks in there. The most important thing actually is the data to decode, right? That's obviously the most important piece. So the data to decode, the actual compressed data takes the most amount of space, but that's not everything that's inside there. We also have a small block called the metadata block. And what exactly is the metadata block? It's data that's not going to be actually decoded per se, but it's very fundamental because it controls the decoding process. So it's data that the decoder will be consuming in real time in ingesting that metadata to actually, in real time, decide how is the decoding process is going to look like for each and every frame. So as I was saying, this metadata controls the decoding process. It may dictate to the decoder how the decoder is going to decode a particular frame, in which case it will only apply for that particular frame. Or it can apply to multiple frames, depending on the kind of metadata we're speaking about. So in frame-packed decoders you have things like PPS and SPS and VPS, which are metadata that stays for multiple frames. Or you can have metadata that's only for a single slice, which is only going to apply for that particular slice and that particular frame. And you also have the slice and or tile data, which is the data that you're going to be hopefully decompressing. So far so good, I imagine. So how can we talk to these devices? Because if you recall, now we have hardware to drive, so now we need to talk to the device somehow. In Vita-Felonics, we basically have two different types of APIs that you can use in order to talk to the codec. One of them is the Stagefool API. I like to think of the Stagefool API as a black box. So you just send in this data to the device. The device will do its magic, quote-unquote, and it will keep track of the metadata by itself. You don't need to do much. You're just sending the data and bam, hopefully you get decoded data back. Very interesting. If the Stagefool interface is a black box, I like to think of the stateless API as a clean slate, meaning for each and every frame, you in the user space have to extract that data and then send that data together with the compressed data to the driver. And then the driver will use this metadata that you have just parsed and send with the compressed data and it will decompress the data. So if the Stagefool API is a black box, the stateless API is a clean slate. For each and every frame, you have to submit data that will tell the device how to operate and then the device will basically forget about it and then on the next frame, you have to submit the new metadata and the new compressed data for the device to decode and so on and so forth. It doesn't really keep the state of the metadata within the device, hence the name, Stateless. I assume so far so good. So now we're going to be talking about virtual drivers. I'm going to talk about the virtual driver which I have written called Vizel. And Vizel is basically, as I said, a virtual stateless driver that just pretends it's decoding data. And why would a driver that just pretends that it's decoding data be useful, most people may be asking, well, it's useful as a developer aid, right? To help developers who are working on this particular niche to either develop new implementations or use your broken implementation. Let's say you found a bug, you can use Vizel to dump some debug data to help you fix the bug because Vizel is a driver that instead of decoding video, just dumps a bunch of data, useful debug data through debug.fs, through ftrace, and through the Viti for Linux task-patter generator. And what is Vizel good for, as I said, is good to help you to develop a new user land application. Usually you have a working application, then you use that, you trace the working application and use the trace provided by Vizel to develop a new one, that's one use case, helps you fix bugs, helps you test your user space when you don't have hardware available. So if you don't have hardware that can decode a particular codec, you can still use Vizel to test the just trimmer code, to test the FF impact code, to test the Chromium code, so on and so forth. And you can use Vizel for prototyping. So first of all, let's do a quick recap over here. I explained a little bit what codecs are, why we may need hardware acceleration. I said that when you use hardware acceleration, you're now going to have a driver, and to talk to this driver, you have two APIs, and now I said that we have in the media subsystem some virtual drivers, one of which is a virtual stateless codec driver. What is the problem? Well, the problem is this. Can you guys read this? It's readable in the slides, I think. So what is this? This is what you find when you open up a codec specification. I think that this is for AV1, which is the state of the art codec. And this is how you parse this thing. This is how you extract the data from the bitstream to send the data to a stateless codec, because as I said, a stateless codec needs the metadata to decode each and every single frame. So in user space, you have to go through everything in here and start parsing. And right from the outset, I see a few issues over here. So we are reading, on the second column, there's F, and in parentheses, there's a particular number. That's the number of bits that you're going to be reading for that particular syntax. Now, what happens if you have a bug? If you were to read, for instance, there's a type over there that says show existing frame, and you have to read a single bit from the bitstream in order to get the value for this. Let's say that instead of reading, say, one or two, you're at false, because you had a bug, whatever. So now that branch, so show existing frame, is not taken, because you had a bug in your implementation. And if that branch is not taken, now you're out of sync with the entire thing. So now you're going to be reading like, instead of reading frame to show map IDX, which would be the next field, you're now reading frame type. Let's say that you were to read all that stuff and eventually read that return over there, but you didn't, because you missed this branch, because you read that thing over there wrongly for whatever bug. So now, instead of returning, you're reading everything, thereby reading over the memory that you had in the first place. If you had that return over there, you would have only a given amount of memory, but you missed it, because you had a bug. And now you're reading more stuff, thereby reading after the end of the memory. This is a crash at the best of scenarios. At worst, it's, you know, you can corrupt stuff, this can go very badly. So this is very tricky, and this is very indented. You can see that whenever you see a pair of parentheses, this is like yet another syntax element that will have its own way to be parsed, and it can have if statements and for loops and etc. It can have other things with parentheses, which means yet more indentation basically. So this can get very hairy, very tricky, very fast. And you have pages and pages of this. I'm pretty sure they have at least 20 pages of this stuff to parse, just to send that to the driver. So this can get very complicated, and not only can this be very complex, but we're also reading the indexes for arrays, and we're also reading a few loop variables directly from the bitstream. So if we go back a little bit, let me see if I can find it from here. Yeah, so frame to show map IDX, well, you're reading that from the bitstream, and you're using that to index into another B of memory, which is the ref frame type array. And if you read that index wrong, now you're indexing into an array with the wrong index, and as we all know, see here, I assume, we know that this can be very broken, very fast. So yes, this is very hairy. So here is my pitch. Here is my pitch to use Rust. We're handling a whole lot of metadata performance, as I said. A whole bunch of data that we're getting from user space, and although we do do some vetting of that in the kernel, there's some functions in the kernel aimed at potentially detecting invalid input from user space. They're not, you know, that foolproof, and the attack surface here is so huge that no amount of ad-hoc in kernel validation would ever catch all the possible ways in which this can blow in your face. This metadata thing is very structured and very complex. The media fields can change based on the value of other fields. As I said, if you read true, then go here, otherwise take the other branch. This can get very complicated as well. And you also have to maybe juggle between multiple versions of the same metadata. For instance, in HVC, you can have multiple instances of a VPS or a PPS or a SPS, which you have parsed previously, but of which only one is active. So you have to juggle between multiple of these, and only one is active at a time. So there's plenty of pitchfaults that you can sort of shoot yourself in the face when doing this. And the problem is exacerbated in real drivers, because as you will recall, so far we have been talking about vise all about virtual drivers. And if that thing crashes, it's bad, but it's not the end of the world, although it's very bad. But in a real driver, you may use this broken metadata that you're sending to the driver and use this broken metadata to program the device. So now you may be changing the decoding process of the device and who knows which ways. You can hang the device at best or corrupt the state of the system at worst, and you may even have to reboot the system. It has happened to me multiple times that I had some bugs somewhere, and I had to reboot the machine in order to, because the device was stuck, basically. So I assume by now we have, you guys can see that we have value in having rust here, because most of what I have spoken about is check that compile time when you have rust. So when you have rust, you have memory safety. If you basically, all the issues I was talking about, about accessing invalid memory, by default are prevented and rust at compile time. You have just fixed yourself a bunch of different classes of bugs for free, just by switching the language. And I was speaking about virtual drivers all this time and about Vizel in particular, which is a virtual driver for testing codec drivers and codec user space, because I think virtual drivers are the perfect candidates to experiment, and Vizel in particular is a perfect candidate to be rewritten and rust. And we can make that even simpler, because Vizel has a bunch of F trace code and debug Fs code, and we don't need any of that. We can strip away basically all these things for now and just have a virtual driver that boots and that can pretend that it's decoding data without dumping debug data to user space. We don't need that in the first version of a rust driver in Vidafilinix. And I think the most important part of my pitch is if we make a virtual driver in rust and we make it work and we prove to everybody that this thing is working, it's not much more work to get a driver for real hardware, because then only the parts that touch as the real hardware, like getting the MAs and getting the interrupts working and blah, blah, blah, these pieces of the kernel, they're basically being worked by everybody else, because everybody else who's interacting with hardware have the same issues to fix. So they're also working on this. So if we fix the, if we come up with a Vidafilinix to specific bits, maybe in six months or one year, the situation of rust in the kernel will be more advanced. Therefore, we will have more abstractions for more areas of the kernel. Therefore, it will be easier to write a driver for real hardware. We'll have the Vidafilinix bits and we can hopefully profit from the work that other people have done in other areas of the kernel. So I have been trying to do this for six months or one year, I think. I have sent to the main list a simple Vidafilinix to driver. And what we have in this driver so far, we have the abstractions for a few of the Vidafilinix to data types. You have a very thin VB2 abstraction for mutual can spawn a queue to share memory, share buffers between users based on the driver. We have abstractions for some of the Vidafilinix to Ioc tools, not all of them. We have the necessary code to get the driver to probe because believe it or not, it's actually complicated to get a rust driver to probe in the kernel. There's a proc macro going on in the background, blah, blah, blah. So we also have the code to get the driver to probe, which is in and of itself is an achievement, I think. And we have a simple module. And what does the simple module does? It basically boots, I mean, I'm sorry, it probes. And whenever a NIOcto is called by user space, it just brings, hey, this Ioc tool has been called. I am able to process this Ioc tool in Rust. I'm able to translate all the arguments into Rust and basically returns. It just brings something to make sure that this Ioc tool translation layer between C and Rust is working. And from this, we can start to add functionality to the driver so that when it actually processes the Ioc tools, it stores states and actually carries out what the Ioc tool is supposed to do. And what do we need in order to get this thing going? We need support in Rust for referral to control so that we can send the metadata to the driver. We need support for some media controller bits so that we can have referral to requests, which is a way to tie this metadata to a particular frame. So we need Rust support for this in order to get stateless codecs to work in Rust. We need M2M support for device run and friends, which is just a framework who's scheduling the decode jobs in the kernel and deciding when a job should run in the hardware. And we need more Ioc tool support because we only have support for a few Ioc tools and a real driver needs much more. And most importantly, we're still waiting for the green lights from the maintainers. I have been talking to a whole lot of maintainers. I've been to the media summit. Some of the maintainers I think are in this room or not. I don't see them, but anyways, I got some feedback from them. And I think this summarizes it. In visual Linux, nowadays, the state of the subsystem is that the subsystem is a little bit overwhelmed. There's a bunch of patches being submitted. There's not enough people to review. And basically, people are overwhelmed and nobody really has the time to have yet another language in the subsystem, which I understand this is a completely valid thing to say. So there's not enough reviewers, not enough maintainers. Some C frameworks like the media controller have some longstanding problems which nobody has fixed yet. So one of the feedbacks I got was like, hey, maybe these issues should be fixed in C before we add yet another language, which is some valid feedback. I agree. There's a huge fear of breaking C code, which is impossible. Just by how this works, you're never modifying the C code. So this can never broke whatever C code is already in the kernel. I promise. And the other issue is who is going to maintain this layer, because one of the feedbacks I got was like, hey, if I change something in the C layer and just break the Rust code, who's responsible for keeping these two things in sync and fixing the Rust code, which again is a very valid argument to have. But we use Collabra. We want to unblock this effort, which is why we have been investing in the media subsystem. We're investing in CI and we're trying to change a little bit how maintainership in the Vifrel2 community works. We're trying to get it more like DRM where you have multiple commuters and where you can share the workload. So we are doing what we can in order to alleviate some of the issues so that we can proceed with this effort, which is why we're proposing this virtual driver, because we think that this is a good candidate for experimentation. This is not the scheduler or memory management or anything, which if you break that, you've broken the entire kernel, this is just a virtual driver. If it's broken, it's not going to be a huge deal. So summary here, say the drivers, they take a whole bunch of intrested data from user space. This can get hairy. This can let you shoot yourself in the foot very easily. So the attack surface is enormous and Rust is a perfect candidate to fix that at compile time, we think. And the Vizel virtual driver is a brand candidate to experiment and Rust and to be rewritten and Rust in our humble opinion, or in my humble opinion. And this is what I had to say to you guys today and hopefully this was interesting for you. Questions? Yes, sir. Hi, thank you for the talk. So I have very rough understanding of Rust, but I would like to get more idea how Rust would protect us from part of the issues that you described. So I understand how it protects us from accessing past the size of the array, but in case when you have the metadata, it changes because of the previous values that you are passing, right? How Rust would protect us from misinterpreting that new piece of data? Well, if you get metadata that's broken and this broken metadata leads you to access another part of memory which is invalid, that just panics basically. So it's not like, and see you can dereference any value you want, but in Rust you basically can't you panic. And then you compare that with the user space Rust implementation to also make sure that you're getting sanitized data. So basically the only thing that, the only protection that we get is that we will not access array that is out of bounds basically. Because... We can actually handle cleaning these access. Okay, so tell me more. We can have a clean exit to that bad access. Okay, so what's the benefit? I mean you can always define a C function, right? That will check the size of the array, how much data is there left and never access the buffer directly just through that function. That will always check if you are not out of bounds. You can. So then... It's a human factor. Yeah. You're forgetting the human factor. Of course, yes. So you can... The amount of work that is there to add the whole Rust bindings, I'm not saying it's bad, right? I'm just trying to weigh in the, adding a single function that you can use to access the metadata, the buffer that you have made the data in, and porting the whole thing to Rust. I can give you another example which is like error handling in Rust for instance is much better. So while in C if you get something, a wrong metadata, you may fail, and then you may have a bunch of go-tos that you have to go back in the exact reverse order, blah, blah, blah. And Rust, that's free. The compiler just wires up the calls from you in the right order always. So you can never forget to clean up after you're done if you have an error, you know? And you can say, well, you can just define your go-tos in the right way and be careful. Yes, C is just fine, hence why we have drivers in C working. But as humans we can forget, you know? So it's just a way to, you know, kind of close in on the human error factor, I'd say. Any more questions? Questions? Yeah, I was wondering, if you were theoretically to transpile your Rust to C, would that be acceptable then? To transpile to C? Yeah. I'm not aware of any way to transpile Rust to C, but you can have a C API for Rust. Somewhat easily. So you can define a header file and you can ask this, the Rust compiler, to provide you with C linkage, C-A-B-I. So that you can have a C API for your Rust code in the kernel even. The idea was that your code is being rejected because it's Rust and not C. I'm sorry? You said earlier in your slide that the maintainers reject your code because it's written in Rust. Yeah. But if you wrote Rust and transpiled to C and submitted the C result, would that make them happy? Yes, that's not really how it works. You can have a C API to your Rust code where you're not transpiling it and sending C code only. It's just not how the process works, you know? So you're not speaking the same thing. You're speaking about transpiling. But I'm speaking of having native Rust code and offering as well a C API to that Rust code. And I am not aware of any way for you to transpile Rust directly into C. And only send the C code as you're proposing. Okay, I think we're out of time. I think so. Thanks so much for your talk.
From Kernel API to Desktop Integration, how do we integrate battery charge limiting in the desktop
This is the kernel dev room, in case you got lost, which is very unlikely given how hidden this room is. Our next talk is going to be from Jelle about GNOME battery charge limits. Welcome. Thanks for hosting. And yeah, I am... Wait. Oh. Oops. I'm Jelle van der Waal. I work at Red Hat and I started my open source journey by joining the ArtSync team and now a developer and that's how I got into open source and also how I came into FOSTA. On my day job I work for Cockpit and this is a web UI for your server and that's where we use like 100 APIs and it's a bit related to this talk. So this talk will be talking about the kernel CISFS API and since we're in the kernel dev room I'm a bit experienced with kernel development so I wrote three small upstream, three small drivers in the kernel mostly when I was hacking a bit on some all-winner stuff so XR1M, input touch beam driver, some small stuff. So I'll start with like what's the problem? So most of you probably get a laptop for work and I assume you're like maybe you're a developer and you don't have too many meetings, you plug it into your dock and you start working and you basically leave it there for the whole day which means that the battery will get probably doesn't deplete that much and it will constantly get charged and then over time it might discharge like 5% and it charges again and this isn't very great for your battery lifetime because basically the only the last 20-ish percent gets used and the rest of the battery isn't really used. Luckily vendors and a lot of manufacturers have made a solution for this and these are battery charged thresholds. They're implemented in firmware so sometimes you can enable a single BIOS and there's a switch for this and how you interact with this depends a bit so sometimes it's like you talk to the embedded controller or you do some WMI or ACPI but that's just implementation detail. So there are like two things which generally happen so there's like either you set like a charge threshold at 80% so basically the last 80% won't get charged and the other alternative is that you set start and stop charge limits so basically between the 16-80% this is the charge free zone and if your battery is under 60% it will start a charge to 80% and once it's there it will stay basically if you keep it plugged in it will be basically staying at range and won't charge. So that's good if you don't use your battery a lot except maybe once a month or once a week for travel then this basically means you don't use your battery at all. So yeah I'm working on this because I own a ThinkPad and the ThinkPad supports sending charge limits for this there's a CISFS knob and so my solution was my first solution was I make a system deservice, it runs on boot, it sets the charge limits and well I'm done. These limits usually don't survive like a power cycle so you have to apply them every time you do but yeah that's not really a great solution for like your less tech savvy Linux user. So for this there's already some software to set this, there's TLP which you might know and PowerDev1KD can set these limits and other these as far as I know don't really have any support for this and looking at the other side so looking at Windows usually the manufacturer provides some software for your specific laptop and they allow you to set this. So but this talk is about how we intend to integrate it in GNOME so it will be only about this. So for GNOME there has been an issue in the for GNOME settings about this like a lot of people want to have this and an interesting other interesting thing is that some vendors have some certification and that this setting these limits is also part of it but don't pay me on that because I'm not really involved into that. So and for myself I mostly wanted for myself so the question is how would we provide this for the user? We can do like we can make it configurable you just input some numbers we can or we can also ask like how would the vendors want to have this because some vendors like Novo they ship Fedora in pre-installed on laptops maybe they want to set their own custom limits which they tested in their own test lab or do we want or maybe users want like profiles this is what some of the Windows solutions you give you for example you have like a travel profile which is basically no limits at all and then you have something in between which is like 70-80% it highly depends for manufacturer they all have their own numbers and some have like a very conservative like only 50-60% so there's a lot of variants but yeah as you might know whether you like it or not GNOME GNOME intends to be like we would try to give the user like if there's already a good solution for it we try to make it as simple as possible so the user doesn't have to like oh I have this think pad but for think pad I need like 60-80% and for oh but I have an ASUS I need like 50-60% or something so GNOME tries to hide that away from users so what we ended up with and what was already pre-designed and after some discussion we just ended up there should be like a knob which just enables these limits and because in multiple profiles if you would probably confuse users like do you want conservative do you want mobile battery so we try to aim like for the simplest solution so so that's like how the UI would look but then how does GNOME actually get all this information so there's a demon running on GNOME standard if you log in so that's U-Power and basically U-Power is the bridge between the between CISFS over D-Bus to a user space so like well it's running user space but it's like the UIs so like GNOME settings the shell and this shows you normally like your battery indicator if it's charging or not if it's discharging how much power it is so this is the logical way to where we want to integrate this feature it's already writes something to CISFS so it has like the keyboard backlight support is handled by it so this will be the place to do it so yeah the simplest thing will be to export the so how U-Power works is it has objects for your battery or for your display or for charging devices and for your battery we just add like a start and interest hold so we can visualize this to give this to user and as we don't really want to make this in U-Power we will really want to add like configuration parsing so we want to allow people to still like there's still power users who want to configure this these settings so U-Power would have like one default setting but there are of course power users or other kind of users who want something different so there should be a way to overwrite this and an easy way to do this because we get already rely on UDEV is by using by allowing this to be configured for your hardware to be this is so that we completely don't need to any new configuration file you just make a hardware to be file and that allows you to overwrite the settings for like your custom for depending on the my string or your battery name or it's pretty flexible so that's all cool this is so that's and now we look at the kernel site and most of these charge vessels they're implemented in the platform x86 drivers they're set to easily like call your ACPI method or there'll be my method and then to set these to set and read these limits so basically you need to implement these two functions and then basically in the year for your device and then you're basically ready so that's easy and that's also what I thought like okay cool with the start in the end that's what I have on my think that well how hard can it be to do this for all the other laptops and then I started after I was already working on this project for some while for some time I started looking at the other devices and it turns out that actually not every laptop supports the same thresholds or so for system 76 MSI who I and Lenovo they also support the start in the end thresholds so it's like okay that's great that's what I have that's why I intend to support and then I found out that some drivers like the start threshold and I started digging a bit deeper and yeah the love of a driver basically the sub method checks like okay I've we used to try to set 180 that's the value of x-fet and that's it so it's okay that's interesting and then I looked at the time she bought driver and that was I found out more interesting so the setting of the end limit supported like giving you you could give a value from zero to 100 so for the ASUS driver the LG driver if you would give like set like right echo 70 to the end threshold it would give like any valid that's not what the touch bar driver does it just sees like oh say you set 80 then I also had an answer limits to or say is right 80 then I will enable the battery saving feature because the as you can see basically if the value is under 90 we will enable it so if you set like 91 we will disable it so this is fairly confusing and then we look at the function which shows the configured value and this just returns either 80 or 100 so basically you can only set an end limit of 80 so yeah that's not great and the ASUS driver supports 100 to 0 to 100 to be set and it's interesting when I read the code there was like yeah we can't read from or we we don't know how to read the configured states from from firmware so basically whatever you you right there it just cashes in the driver and returns what you what you set so that's interesting and another interesting thing is that according to users that by say you set like an 80% end threshold the driver the firmware internally sets like a start threshold of minus one or minus two so that's probably good that they do that but it's not very obvious when you or like I wasn't able to verify I don't have any I only have like think paths and so I have to trust users that they measure this and this is the case so that's that's tricky and it is logical that is that everything is implemented different of course because it's all the all the laptops are created by different vendors and most of these I think the the no vote drivers was done by the no vote don't pay me on that but the other right like the ASUS one is probably reverse engineered so I cannot blame or it will you cannot expect this to be like perfect because there's no no documentation about this it's just all reverse engineered but this did raise for me the question should the the driver also show which values it accepts because if you're trying to like make an application and it's not that great that if you present like okay we will charge to 80% but when you try to write this it will just return in a valid so the driver the driver upfront needs to the I mean the the application of funds wants to know what is supported so this is something which might be interested to to add to the kernel so yeah but about the actual implementation so I I for now I just keep the interest holds situation as something maybe you can think about the future for now I only support the the start and the end stop just holds because that's predictable for me with the interest hold I'm not sure if there's like a hidden threshold set or what it's like an unknown state for me so for now I'll have kept these out of the kept these are not supported that's a bit sad I would like to at some point do that but I I'm not sure not sure yet what to do about that so yeah I the the merge request are currently under review this is how it looks like and and then because you know I don't have too many side projects I I was thinking about some future work so one of the one of the things is that I don't believe you can change like before sample the toss about driver to return the invalid if you write 10 because that's you can't break user space so maybe there should be a new new attributes but I'm not sure how this will be like there's this other attribute like the charge behavior and as nicely shows you like okay this is currently set to auto it can also be set to these other strings so maybe that can also be I don't think you can change the current like start threshold to also do this because I will break a use space so maybe there should be like a new new file which you just can cut and then it shows you what is supported that will be like fairly easy to implement. Also I would like to see this be supported in more devices so I found out like I own a steam deck because well my friends arch and so constantly I got one and I found out that they also have a chart and charge and threshold but they were using a custom sysvests attribute for them so I asked them about it and they're gonna they're gonna fix this use the common framework and they intend to main in the upstream this at some point they're still figuring out some things. Dell is also interesting so some latitude say exports the the bios settings inside of sysvests is interesting they also contain like they also have like a charge control there and I was wondering how evil it would be or how bad it would be if you just exported this also in as a charge control start and end threshold. It will be a bit funky because but maybe it can be done. And I also own a service and this only has a bias setting. I looked at the the service kit for boss or yeah and there was an issue about this and it seems that this value cannot really be read or set from from the colonel side so it's a bit sad. And then there's the framework they actually have a tool. Which allows you to set an interest hold so I intend to to hack around on this. Once I borrow a framework from colleague and write a driver to support this and. And. Oh yeah and interesting as the frameworks. Firmware is open source so the embedded it seems like after digging the code so they the code which has the threshold also supports the start threshold but if you look at the extra implementation he doesn't do anything with the start threshold so there's a case there to. For somebody who wants to like. Write embedded controller code to see if they start just what can also be supported I'm probably not going to do that that's. For me that's too too deep like Colonel hacking I can do. That it's too much for me but it would be cool if it could be supported. Another thing I would like to do in the future is to calibrate the batteries so basically because you're normally happens that if you fully charge or discharge. There are some internal logic in the batteries firmware and this recalibrates the battery counters like how many. How much is still left so it knows how much initially capacity it has but of course it is great over time. And it measures this when it's close from when it starts charging from zero to one hundred percent it basically resets those counters. And as you're setting chart limits the the counters don't really actively get reset because they. Well you're not really fully charging anymore and think that supports a way to do this and it's basically not very very difficult. It's basically you said you disabled the charge threshold and then you tell. You tell that you can configure the charge behavior and you say do a full chest arch and basically the laptop will run off your battery while plugged into AC power. It will go to like two one percent and then switch over to AC battery to AC power and then fully charge and then this this should be reset. I believe the energy full count. I'm not entirely sure on that but it should really reset the counters and then you have accurate battery information again for when you're traveling. So this will cool. It sadly is only supported for now. I think that I think the framework can do this but I haven't looked too much in the interesting firmware repo story but it will be cool if more. Manufacturers would allow this to be supported and then the thanks to all the people who have helped me like this is not. Something I do a full easy on my own I have asked some people. I had some help from others people from the design on getting into like how G-Dip program works and I like for a new power kernel so thanks to them. Thank you. Hey I don't know how batteries work but would it be better for the battery but frustrating for the user to have like randomized thresholds instead of just like 60 or 80. Like aren't you just pushing it down. You know the aren't you still not using most of the battery when you you know just have a. Yeah like yeah like a starter and threshold. Yeah. I've always. I don't know why the. This room. That's great. This is. Oh all the. Just give him the. It's it's. So I know some users who do this for the for the steam back and they have this. Interesting bash script which basically sets like a threshold to start the 80% and then gently they lowered but yeah so I guess that in the ASUS laptop there seems to be like already a hidden. Start threshold and I'm not sure if that's a thing in the the the touch about or the. The other laptops so. I don't know it's. And this is also why I'm not super confident to support this because I'm not entirely sure what's going on. Questions. What happens if you. What happens if you charge your laptop when it's not actually running. A thing bad more precisely a thing bad. I have seen you have using some surface interface and Colonel site is it some platform specific six eight six or you're using power source framework for this. The notice. I believe. So it is already power supply framework. Okay. Hey I'm the maintainer for the power supply sub system and I would be willing to accept a patch adding like the. Acceptable values for the stop of the charge. As long as it's sensible like exporting a hundred numbers for those accepting any percentage does not really make sense probably in that case it makes sense not to export any fire or something. Yeah thank you. Next is. Okay more questions. So just a quick feedback from a user who created the bash script to do exactly the same. So one thing that I really needed to add to my bash script was to not only save the battery because I normally have the laptop that is always connected to the time the world docking station. And then when I need to travel I need to charge it because it's never full right. So I mean just a single switch to save the battery is not enough I think you should have something like key combination or something that would enable you. Okay so now I want to charge the laptop because I'm going to travel without maybe changing the overall policy. So like an override option. Yeah something like that. Yeah what we interesting. Okay if there are no more questions then thank you for your talk and we'll have a seven minute break.
Converting filesystems to support idmapped mounts
Hello, my name is Alex, I work for Fogunonical. I have a pleasure to work on Lexi project and do a lot of container stuff in the kernel and user space. We have been working on that new stuff about ADMapetMal and support for some file systems together with Stefan and with Christian. So today I'm gonna talk about the problems that we faced when we started to actually look into the network based file systems and how to support ADMapetMal for them because it's kinda hard sometimes. First of all, I'm not sure that everyone knows everything about that stuff, so I want to give some intro about how it works currently. And yeah, if anyone, if anybody there, we were listening to our previous talk about isolated user space stuff, please forget that for the next 30 minutes because that's a new feature. But this stuff is about stable API that we have in the kernel since I guess 5.11 or something. So that's more about supporting more file systems. So we don't do these isolated user space stuff in here. First of all, we need to understand that we have three types of ID mappings in the kernel. First one is the callers ID mapping, which effectively taken from the user namespace and from the current user namespace. User namespace attached, you can get the pointer to user namespace from the struct cred and you can get the pointer to struct cred from the task struct. Right, so if you're calling any kind of syscall in Linux kernel, you get a current task and so you can get a current user namespace. So we have a macro in the kernel to get that. And even if you're not doing any kind of container stuff, even if you're not using user namespaces, you're always invisibly using that because you're using the default mapping, which looks like zero zero in this big number, which is effectively the largest unsigned integer. And what does this means? The first number is the user ID inside user namespace. Second number is the user ID outside of the user namespace and effectively the length of this mapping. So this mapping is the identity mapping, which means that we effectively map zero to zero, one to one and so forth. Next thing that we have when we are working with any kind of EFS stuff is the file systems ID mapping. It's also represented as the user namespace because it's the thing that we are attaching to the super block of the file system. So when you're creating a new mount, let's say for example, for X4 file system, you have a block device, you're creating a new mount and if it is the first mount for this file system, not a bind mount, I mean, then the super block gets allocated and on the super block structure, we have a field called SC user NS and this field gets filled with the current user NS. So when you do a mount, it takes the current user namespace from your current task and puts that into the super block. And that's the file systems ID mapping, which means that if you're, let's say, inside the container with some user namespace and you do a mount, so your super block will get this user namespace effectively from your containers user namespace. And that's a pretty old stuff actually because I believe that it was from the beginning of the when the user namespace is very introduced many years ago. And third thing about we are talking today is the mounts ID mapping. Mounts ID mapping is the concept a little bit more high level because instead of being attached to the super block, we have the mount, we have the ID mappings attached to the mount. So it means that you can, for example, create X for file system on top of some block device, then do a bind mount and you can do this bind mount with some ID mapping attached to it. And once you get any kind of IO through this ID mapit mount, you will get some extra translate, UIDJ ID translation layer inside the VFS, inside the generic VFS code. And then this, all of that goes through the file system in mapping and then all of that gets written to the disk. So that's how it works. So important to mention that all the time when you're interacting with the kernel from the user space and if you use any kind of C-scores like start get UID, get sock opt for instance with the option so peer create which allows you to get the PID and UIDJ ID of the peer socket, you will get these values mapped in accordance with your current user space. So the callers ID mapping always get, always taken into account everywhere in the kernel. And for example, if you, so yeah, that's effectively all the examples and also we have the same in proc PID status file and all that stuff. So let's take a look what happens when you for example, take the get UID C-score which is probably the simplest one. Inside the kernel we have a few helpers to convert between the user space, user ID that we can work with inside the user space and with the internal representation of user ID inside the kernel because inside the kernel we have two types, UID T and K UID T. UID T is the user space one effectively because it's just a 32 bit thing. And K UID T is also 32 bit thing, it's the same in size, usually they contain the same value but K UID T is the value that represents the user ID always in the initial user space. Which means that for example, if you are inside the container with user name space, you have the, let's say user ID inside the container zero and if you have the corresponding user ID on the host, let's say 1000, then K UID will have the value 1000 always. But once you call the get UID C-score from the context of the process of the task that runs inside the container, inside this user name space, this function called from K UID Monct will be called. And the first argument of this function is the current user name space which effectively the time thing that represents the UID mapping. And second argument is the current UID which will be the K UID T value which is equal to 1000. And this function called from K UID Monct will try to effectively remap this host visible value 1000 to the appropriate value inside this specific user name space. It will be zero in our case because as I have explained in this case, let's say we have like mapping of zero inside the container to 1000 on the host. And so you will get the zero finally, yeah? And this function has a pair function called from K UID. And the difference between these two functions is that from K UID is like more like internal one. If we fail to represent the internal K UID in terms of some user name space UID range, the from K UID function returns minus one which means that something terribly wrong. We can't really represent that ID inside this user name space which is possible. For example, if you have the username space with that maps only like 1000 to zero and if you have the user ID let's say 2000 on the host you can't really represent that is any reasonable value inside, right? And if you call the from K UID it will return minus one. But function from K UID month it does the trick. If the from K UID returns minus one, it takes the overflow UID and returns that. That explains why we have these interesting stuff with like if you have the, if you try to access the, for example, the container file system from the host or that has another ID mapping and you will see this strange nobody user. That's because this function is used everywhere because we can't really give the user space with this minus one. We always, the user space always expects us to give the normal user ID, reasonable user ID. And also we have a helper called make K UID which effectively does the opposite thing. It takes the user space UID and creates the internal representation of it for the kernel. The same, we need to give, plug the current user space, current ID mapping to this helper and give the user space value. And that's what happens inside the set UID cisco. If you plug the let's say value zero, let's say one inside the user space to that cisco. Inside the container it will go like make K UID current user space one. It will go to the UID map and it will try to find that okay, this one is for what? And if it fails to do that, then okay, we get gain well. And so the set UID will not allow us to set this UID because it's not mapped. But if you have a mapping like zero, 1002 which means that you have mapped zero and one, then they succeed because the end K UID for that thing will be 1001 on the kernel and everywhere it will be represented like that. But once you, until you do the get UID or something like that. For file systems, what we have for file systems? For file systems we have, it's about super block ID mapping, right? We have two important helpers. One helper effectively takes the I node and tries to get the user space visible UID so the normal UID. This function called I UID read, but in fact it called on the right path. There is no mistake, that's perfectly fine because we are reading the I UID value from the I node. That's why it read because we read this value from I node. But of course it's called on the right path because when the file system driver wants to write the UID on disk or let's say send it over the wire, in that for file systems like this. We need to call this to get properly mapped to remapped user ID that we can then send over the wire, put on the disk and forget. And we have a second helper called the I UID write which does the opposite. It takes the I node, it takes the user space visible, normal classical UID that we supposed to work with and does the same as we have seen in the set UID system. It calls the helper called make I UID, but instead of taking the current username space, it takes the username space from super block. And second argument is the value. So let's say if you create a file on the file system at first from the user ID like one, so you will get that. Like it will take the value one and plug in there and so. This K UID will be written into the I node I UID field. And finally we're getting to the point when we can take a look on the whole picture like how it works together with the amounts ID mapping. Okay, imagine that we have the caller UID 1000. And this caller wants to create the file on the ID mapped mount. And we have these three ID mappings in place. We have the caller's ID mapping which is okay, which is something that we have been discussing right now. We have file system in mapping which is the, in this specific example, which is the identity ID mapping that does the zero maps zero to zero one to one, two to two and so on. And we have a new thing, amounts ID mapping, which maps effectively zero to 10,000 and has the length 10,000. So we have like 10,000 UIDs mapped with this shift. So the second thing is that effectively the shift value. So the zero goes to 10,000, one to go to 10,000, one and so. And what will happen in the kernel in this case once we try to create the file? First of all, we will create the internal representation for the user ID 1000, which will be 11,000, right? Small remark is that effectively in the kernel, to be honest, we all the time work only with this KUID thing. So it means that technically, when you calling the file system, CIS calls like let's say open with OcreateFlock, the first step is not gonna happen because we already have these values on the struct cred, but it's easier to think about it like that just to understand how much different mappings we have in this place, right? And second thing is that we need to, we need to apply this new concept, mount id mapping, right? We need to take the mounted mapping and perform effectively the reverse operation. We call the front KUID, we take the value that we've got from the collars ID mapping, and then we do this mapping in accordance with this this definition that we have. In this case, we are mapping the KUID 11,000, remap it, and what we get, we get 1000, right? Which is obvious. And then once we want to create the file on the disk, we need to get the IUIDT back, right? So we need to go through the file systems ID mapping which is attached to the super block to get the IUID that will be written on the disk. And so in our case, fortunately, we have the identity file system ID mapping which means that okay, we have user ID 1000, it goes to 1000, that's all. But let's think about another example, if we have the, for example, mapping like U0 K1000, in this case, we can remap that value, right? Because if it goes like U0 K1000, we fail because this U1000 is not in the range of this mapping, but for the second one, U1000 K0, we can remap because the corresponding user ID will be zero, but in first place, we can't. And what happens if the VFS generic code realizes that it cannot remap the value? It will give you the E overflow error. So that's the reason why you can get E overflow error when you're working with ID mapping, not only. Even if you're not using ID mapping, if you're using just normal mouse, what you're trying to, for example, to write to this mount from the another user space with another color ID mapping which is incompatible in terms of ranges of user IDs with this mount file systems in mapping, you can get this E overflow error. So that's the really complicated behavior, but that's how it works. We have no alternatives, actually, right? So you can create ID mapping mounts using these effectively two options. We already have the new feature that allows you to use the classical util Linux mount utility to create ID mapping mount, but in most distros, I don't think that it actually works right now because it's too recent, it's like one year or something like that. So I'm always using the Christian utility for to create ID mapping mounts. And internally, it just uses the syscall called mount setutter to set the ID mapping on the mount. And so you can, you always need to specify this attribute with the username space file descriptor. So we're always getting the, at least these days, we're always getting the IUID mappings and GUID mappings from the username space because username space, we have the way to actually set user ID mappings and JD mappings to the user space, from user space using the proc files, right, that's the reason. So currently we have support for all of these file systems, but if you take a look on the list closer, you will notice that most of them are local ones, so it's like the X4, better FS, XFS and so on. And recently we have been working with Christian and Stefan on the CIF support. Christian did the major work a few years ago, created the first implementation of that, but unfortunately it get lost in discussions and it wasn't merged, so I asked the permission to continue work on that because it was kind of important for our containers applications. And I get some rebate stuff and also we decided to use a little bit another approach to make it work. I will explain that a little bit later. So starting from 6.7 you can use the ID map mounts which is CIFFS, and yeah, CIFFS is the only network file system in this list, so. How to port the file system? The very naive way to do that is to just go through the file systems code, find all the places where we have like no M&T map, which means that this file system id mapping is not defined, so there is no id mapping. Replace it with the id map identifier, which is passed almost to all the VFS API functions from the generic VFS code. And then also replace the current FSUID, which gives you the KUID from the current user. And with the mapped FSUID, which does the same, but takes into account the id mapping. And also raise the FSUID map flag on the file systems definition. But no, that's not that simple because you need to be really, really careful with that stuff, otherwise you can really break things and or even open to some vulnerabilities or something like that. So the reason for that is that, okay, I would suggest that if you want to try to try and porting some file system to support id mapping, especially the network one, you need to go through the code of X4 as a really, really good example because X4 file system is like very complex one. It has many features. For example, you can do the overlay FFS on top of, and use the X4 as a one of the layers for overlay FFS. And for example, the rename callback on the X4 supports really interesting rename mode called rename whiteout, which effectively when you rename the file, usually it disappears on the previous place and appears on the new place, right? But in this case, on the old place where a file supposed to disappear, it creates the so-called whiteout thing. So this is effectively the share character device with the major and minor numbers zero. And that mode is enabled only when you call the rename from the overlay FFS. And I guess that only for that reason, this rename callback and VFS takes the id mapping as an argument because in all the other file systems where we have no support for that, we can't really use this id mapping in any case because we don't need one. Yeah, also you need to pay attention in the getutter because getutter what it does, it's effectively what is getting called in the file system driver when you call the statistical because getutter reads the attributes, fills the case, utter structure in the kernel with all the data like size, like user ID, JD stuff and all that. And you will definitely need to take id mapping into account in this place to get proper user IDs and JDs reported to the user space, right? Also there is a permission callback which effectively does all the permission checking unique spike in the kernel. So you need to also properly pass the id mapping in there. If you use, if the file system that you want to convert uses the generic permission helper, then you just need to pass the id mapping, check that everything really works and that's pretty much all. But sometimes it's not the case because some file systems will see that later, use really, really weird machinery to check the permissions. And also get ACL stuff and that's pretty much all for read code pass, but for write pass, the most important pieces is the, obviously the places where are we creating the new inodes, right? So that's the MK node, sim link, MK dear, atomic open and create. So we need to take into ID mapping into account in all of these places because we actually write the UIDs and JIDs. That's it. And set other which is getting code from, for example, challenges call, right? So you need to, as the challenges call takes the user IDs and JIDs from the user space, you need to properly remap them and write to the attributes. So that's, so for local file systems, as I said, you really need to take the X4 or better first or something, just carefully read the code. Be absolutely sure that you understand how it works and then go for the other philosophy that you want to support. Which problems we can have and we really have. First of all, some file systems, especially in Torque Ones, they, obviously in Torque Ones, they do the permission checking on the server side, which is really bad because what we want is to ID map it, map it mounts, is the local feature of Linux kernel. We don't want to tell the file systems remote server to be aware about that we have this crazy, interesting Linux specific stuff because the theoretically user may be from another operating system, right? So if we want to, if the file system does some UID, JID based permission checks on the server side, it means that we need to extend the on wire protocol, pass all of this ID map stuff over the network, write some logic in there, so that's not work usually. Effectively for a few file system, which is not the network one, but it almost the same as network ones, right? Because you have the user space demon, you have the kernel, kernel is effectively the client, and user space demon is effectively the file system, and the client, the kernel just takes the information from Cisco, does something with that information, produces the request, send it over the fuse device, and the user space read that, and so if we want to do all the permission checks on the user space side, and if you want to support ID map it mounts, we need to pass these ID mappings over the, so we need to extend the protocol that we use between the user space and kernel space for fuse, right? Also some file systems, it's also about fuse effectively, some file systems can do, some can allow you to completely disable the standard permission hook permissions, so effectively implemented almost like an empty thing, that just allows everything, and then do all the permission checks on the level of the I know operations, and the problem is that, I can remember that I have seen that in the while I was working on Ceph, is that in Ceph it's possible to set the configuration based on the path to the file, and specify the user IDs and JDs that actually allowed to read the sub directory, it means that you have the combination of permissions checking on the Linux kernel side, then you have some permission checking on the kernel side, on the server side, I'm sorry, which is the remote server with another kernel, which does not know anything about this stuff, right? And they do checks almost everywhere, even for lookup, and why it's bad for lookup? First of all, because lookup, I know the operation does not have ID mapp argument, and it's not obvious why it doesn't have, but the reason for that is that the usually lookup operation is getting called from the slow lookup pass in the kernel, right? If you have the pre-cached to dentaries for some pass, then we won't go to this lookup callback, instead we will just take the dentary, and it means that if you have the permission checks inside the lookup, then everything will depend on that, if you have this dentary already or not. So if you have not, then you go to the lookup, then you do the permission checks. If you have this dentary cached already for some reason, for example, if this dentary was accessed from another mount with another user, then these permission checks won't happen really, that's bad, right? That's why we want to have all the checks in one place, ideally, for this stuff. And of course, some of you can say that, okay, in this case we can do some permission checks and derevalidate helper, which is always getting called, yeah, to derevalidate, but not, because we don't want to do that, I guess. So, yeah, and also, third case that I've almost forgotten about is that some file systems has the local feature, really, really close ideologically that what we have in Linux, that does some UID-JD mappings on the level of the file system itself. And that's also a problem because I personally don't understand how to combine all of that together to make it work properly. Yeah, in third case, what I have found is that we have the combination, effectively, of the classical permission checks and the server side checks. Speaking honestly, we decided to forget about that because we just decided that if someone uses the IDMAPitMounds, we clearly say that, okay, you don't want to use the server side permission checks in this case, just disable that, just trust the kernel, just trust the client because, Ceph really trusts the client. If you have the key to interact with the MDS server, you can do anything. So there is no real reason to do some additional checks because you can, if you have the user ID checks on the server side and if you have a client, this client can give you any UID, right? So it makes no sense to check that because this information is not like, trustworthy so. So in third case, we have this lookup problem which is okay because it's only actual for this case when you have some additional setup, some additional configuration. And the third one is that for some reason, most, I guess historically, is that Ceph uses current FSUID everywhere to get the current user ID. Yeah, thanks. To get the current user ID, but what we want usually, we want usually to take the credential structure from the file because when you open, when you are opening the file descriptor, the credential structure from your current task gets stashed to the struct file structure. And then we expect that if you do, for example, the right syscall or itsyscall on this file descriptor, then everything, all the permission checks will be done in the relevant to this credential structure that we have on the file. And you may ask me why it's so important. It's important if you want to pass the file descriptor over the Unix socket or if you, for example, opening the file descriptor while you are privileged, but then you do some capabilities, drop things, or set your idea or something, and you lose your privilege effectively, privileges effectively, and so that can be a problem. But I was, to be honest, I decided not to send fixes for that because I don't want to break any real user space application. I don't know, maybe someone relies on that. So that's technically not ideally correct, but you will see. So, yeah, I effectively covered that. Yeah, what we decided to do, we just ignored these problems with the server side permission checks because we can't really do anything with that. And we were asked by the CFFS folks, CFFS maintainers, thanks, by the way, thanks to them for help, for reviews, to Viennkischenkartus, Huboli for helping with that because they were reviewing that stuff, especially the user space one, because I was forced to extend the on-wire CFFS protocol and add some extra UID and JID fields for the Inode creation operations. And of course, all of that was done in the backward, forward, anyhow compatible way, not to break anything. Yep, and what we are doing right now, we're currently working on Fuse, I have already sent a series of patches that enables support for Fuse. Unfortunately, only for the mode when we have the default permission set, because as I said, if you have the Fuse mount without this flag called default permissions, then effectively the permission callback is almost empty, it just allows everything. And in this case, Fuse file system expects that the user space will do all the permission checks in the user space, which is a problem because we can't handle that properly. And also, obviously Fuse protocol that between the user space and kernel play was extended to send these UIDs and JIDs over the wire, let's say. Yep, also in addition to this series, I wanted to be absolutely sure that this really works properly, so I have taken the three not random, really not random file systems. Overlay Fuse Fuse just as a good and relatively simple example for this specific case, it's not simple at all. Overlay Fuse Fuse, SEPA Fuse Fuse because I was already familiar with Fuse a little bit while I was working with the, so and GlusterFS, which is the new one. For GlusterFS, it's not an ideal implementation because I found, I unexpectedly found that GlusterFS also likes to do all the permission checks by default in the user space. And so that, a bit painful, but I found some special configuration option that allows to disable that and enable the default permission thing for that file system and it allows us to make it work. So to do, in our plan to go further with the Fuse series to make it fully like tested covered to be absolutely sure that everything is fine, then we want to convert the nine PFS and virtualFS, which can be useful if you do some nesting stuff like virtual machine with some shared director from the host and then the container inside, for example, which is not a rare case. And yeah, that's all. Questions? Thank you. Hello, thank you for your talk. Is there any caveats with ID mappings and interaction with Alasams? So like if you're doing some checks in Alasams, like what kind of UI did we get there? Because I was confused. That's a good question to be honest, because all of these ID mappings works is done by Christian, thanks to him, because he did all of these great API in the kernel, all of these preparation stuff. I mean that our isolated user space work and how we managed to make it work with the file systems is all, it became so small in terms of lines of code that were modified just because Christian did all of these crazy complex hard stuff in the kernel a few years ago, because he effectively provided us with the two functions in the kernel that we can patch easily, relatively easily. And so we get the ID mappings supported for some like new crazy case, right? And to be honest, I don't know much about Alasams, so I guess that it should be integrated. So when I did the original work, I went through all of the Alasams. And so for example, Alasams like SA Linux don't fuck with UIDs and GIDs, don't care about this at all. So most of these Alasam functions don't get past the path or UID and GID value at all. The only hooks are relevant, like security file open and so on. And then it's mostly Tomoyo and possibly some app armor stuff and they are all patched to take the ID mapping into account, although one caveat is I once tried to do some additional fixes inside of Tomoyo itself because it kind of does weird stuff, but the maintainer said, no, we don't care. I mostly care about like BPF Alasam because the hook doesn't get the UID, but like you can extract it from something. Oh yeah, they are aware of that. I talked to them. So yeah, well, for example, if you do a BPF Alasam and in hooks like security file open, you get the relevant ID mapping provided. And in other hooks where you only have the inode, yeah, then you don't have access, but that's also for example, not feasible. Like no, there is no security hook in lookup, but there is certainly locations where we have security hooks where you, for example, in the dentry cache, where you don't have any of that information available and it's impossible to make that work. Like you mentioned the lookup stuff, the lookup stuff itself, like it was two reasons why we didn't do it this way. First of all, because in lookup you initialize an inode and that always needs to be take the global UID and GID into account, the one that you see everywhere. Otherwise you end up with inode aliases in a way because if you can't cache an inode per mount, that's the one thing. And the other thing that lookup is called deep from within the dentry cache, which would have meant then suddenly you would have like, have to pass mount information more or less because it's mount information through the dentry cache. It doesn't make any sense. Also L would have killed me. But I mean, that's another thing why in these locations we don't want to have this. But for example, BPF Alasams, if they need that sort of information in specific hooks and is doable, then we can easily extend the hooks. Like I don't have a problem with this, like sort of more of a LSM question if they're ready to do this. It's, I think for most LASMs hooks, it simply hasn't been done because the LASMs that didn't implement that this specific hook didn't want this information. So it didn't make sense to provide it. If you have an LASM that wants this information, it's easy to extend it. Well, I think the other point is the LASMs should use the code behind the way because it seems this one is not the LASM. I think it's a little faster. And for always tricky when you provide a policy from users based on the current idea, you don't need to translate it to the LASM. Question? Yeah, you mentioned an FS real quick. How does it work with an FS? If I remember correctly, there's an upcall through the Linux Curing, right? So you get the translated.
Linux' receive_fd_replace() semantics confusing
Okay, this is the kernel room. Keep repeating this. And our next talk is by Tykel about Linux receiveFD replace semantics. Oh, cool. Hi, my name's Tykel and I'm going to complain to you about kind of an esoteric corner of the kernel API. We had some patches on the mailing list about six months ago for this. They were done by my colleague Alok, who unfortunately could not be here. So he is, I think, in the chat room. If questions come up, I will try and read them or read his answer. So the actual hard stuff was done by him. I'm just kind of here to complain. So what do we actually want to do? We want to intercept, connect with seccomp, and then do some stuff with the file descriptor. So we want to put a file descriptor into a tasks file descriptor table. And that eventually does cause this function that I'm here to complain about. So if you've not seen the seccomp API before, basically what you can do with this API is capture a system call from a task and then forward it to some other user space statement to do arbitrary things and then return a result. So if you look at a typical application, they want to do some stuff. They want to listen on the network or they want to connect to some networks or something. So they create an ePoll socket. And then they put the socket in the ePoll thing and then they connect on the socket. And then they use ePoll to wait to say, is this socket connected or not? So this is like the JVM does this when you make a network connection or something. This is like an extremely common use case. So let's just look at what actually happens in the kernel when you do this. So when you make an ePoll call to add a particular socket to the ePoll instance, it creates a big table. And the table, I'm sorry, it creates a big tree. And the tree is a tree of tuples. The first element in the tuple is the file descriptor number. And the second element of tuple is the struct file pointer. And I'm using 0x5 here to know it's an arbitrary pointer. But originally it's the file descriptor pointer that maps to file descriptor 5. So that's what my notation here means. So it takes a copy of the file descriptor and the file, puts it in this tree, and then you go on your merry way. And then you add a second one and it does exactly the same thing except for this one's 0x6. And then if you did it again, you would get another element in your tree. And this is some RB tree or something. And so it's sorted in some way. But how it's sorted is not really important for this talk. So that's what ePoll does. When a socket receives data then, this is in user space. You are waiting for stuff. Data is received on file descriptor 5. You read file descriptor 5. It tells you, hey, there was data on 5. You read 5. And then you have your data and you're happy. So this is like the happy path what people expect how applications normally work. But remember, I said at the beginning that we are interested in using setcomp to do, to munch with the file descriptor table. So the reason you might be interested in doing this is for all manner of things. We do it for IPv4 to IPv6 migration. You can do it transparently to the application this way. So the application doesn't need to know. And also, it's not in the data plane. So the socket is connected from the right place. And so there's no IP table stuff that's making decisions based on every packet or whatever. So that's some motivation. There's a lot of other things you can do with this API. But that's what I'm doing with this right now. So we're in this normal world. Everything's happy. Now we want to mix in this setcomp addfd. So specifically, we're going to look at what happens when I do the connect call. So what happens is I connect. This side is going to be the user space side. And the other side is going to be the other user space side. This is the daemon that it's waiting for notify events. So it's waiting to say, hey, have I gotten a connect call yet? So this is how it waits. Then it creates a new socket using some magic, whatever it decides, like however it decides is appropriate. Does an addfd call. That eventually calls this receivefd replace thing. You can see here my new socket is really 0x8. So it's not 0x5. So there's physically different underlying files. But I'm replacing it at fd5. That ultimately does the actual replace. Oops, and then it returns back to the user. And if they did a read on file descriptor 5, now because of the result of this call, they would get the contents of file descriptor 8. But there's some things that don't happen. And maybe it's easier if we skip one ahead here. You remember our tree? In here we had file descriptor 5 paired with 5. What doesn't happen is that 5 does not get replaced. So in particular, this does not happen. So this 5 to 8 change, that does not happen. And so as a result, you're in a very weird state where if you read 5, you read struct 5. But because the ePoll structure was not changed, it will report data on struct of this old file that was the one you replaced. So now you're reporting ePoll events on the wrong file descriptor, like things are all confused. In some cases, you will close the file descriptor from ePoll because it keeps a weak reference, and then you will report no events at all. So that is also bad. So what ideally would happen is exactly this, that we would fix this up so that ePoll works nicely when you use this API. So questions are how to fix this. So we have, and I'm not really going to go through this algorithm, but this is how we did it. I wrote a bunch of C to do it kind of the Cree way where you read and iterate over a bunch of stuff. And it works, and it's what we use today. So it's a good first crack. But you have to parse the FD info for every possible FD. And if you have lots of FDs, then individual system calls take a long time. That makes people sad. So we want to do something better. So one option would be with extra FD info. So the insight here is that the kernel actually knows if this particular file is attached to an ePoll instance. So it could tell you then you wouldn't have to iterate through all the files, but you'd still have to go and manually fix up and do this replace. You have to do all the string slinging and stuff that FD info requires. So this was where the kernel patches came in. So there was a patch series here by a lock. So Christian complained that this is a layer violation because the FD table is touching files inside of it, which is kind of backwards. And setcomp is the only user of this currently in the tree. So I suggest to do something else, which is this. So basically have setcomp specific flags for this fixup behavior. But I'm here to complain to you. This is sort of the end of my talk and maybe the beginning of the arguments. Because I don't think this is really that great because everybody who does receive replaced FD in the future has to then add their own flags. And it's, I think, pretty clear, at least to me, that the semantic that currently exists absolutely nobody wants. It's just we didn't think about it when we wrote all this. So I'm wondering if we add some arguments to receive replaced FD. If that's OK, go ahead. Yeah, I think the only, so the initial implementation was kind of a bit weird. I think it's not a problem if we extend, if we extend E-Poll to do the replace. So if you just have an E-Poll helper and you call it from your FD replace function to update E-Poll, that's sort of fine. I think that's sort of what we, the last version, I looked at it before exactly so that I could say something smart or look smart. What your last version kind of did, I think the only thing is the locking change for E-Poll. Yeah. Quite drastically, which makes it a bit more tricky. But I think other than that, like, it's sort of, I think it's probably doable. It's probably doable. The things that I'm not currently clear about, and I'm not an E-Poll expert, so is E-Poll has pretty advanced and complicated logic to check, for example, whether adding a specific file adds loops. So I think one thing that we need to do is we need to prevent replacing E-Poll FDs themselves, because then you have this sort of looping, basically you can have an E-Poll FD that you add to an E-Poll instance, and then you get notifications from another E-Poll instance, and sort of you need to take care that you don't get cycles and don't get deadlocks. So I think we need to stop that from happening. And there's some weird logic that I haven't yet figured out, where it sort of checks to limit the number of wakeups that you can get, and it sort of does that every time you add a file, and I'm not sure if that's specific to the file type that's added, or if this is just a generic sort of check. If it's just a generic sort of check, it doesn't matter, then updating the file point probably doesn't matter. The third thing is, is this race free in the sense that if you walk through the E-Poll instance and you update all of the file pointers, can we lock out everyone else using that E-Poll instance to not inject behind our back the old file again, and stuff like that? I think it can't happen, but... So I don't even look at the new locking. In the old locking, I think the answer to that was no. But in the new locking, I don't know. I could imagine that the answer might be yes. I just wanted to... I might not understand the full edge cases, because I only roughly played with this API, and I have a hard time understanding why it's a problem in the first place, like specifically for your use case, like APU-4 to APU-6. Why do you want to replace the socket after it was created? Why not sec-comp intercept the socket call when it creates it? Like you immediately duplicate the socket in your supervisor process, and then intercept connect of course, and then just replace the arguments from APU-4 to APU-6, but you already have the proper socket in the program, and probably duplicate in your supervisor, and kind of like that problem just doesn't exist. Yeah, the problem is you can actually change address families on the fly. So when you create a socket, you have to declare an address family, but if you use like send message in their connect API, you can change the address family of the socket on the fly, or you can declare it unspecified initially, and decide later. So when the socket is created, we don't actually know is this a socket we care about. So we can only reason about that when we actually connect. Yeah, but what I mean is like, what we care about is that the file descriptors and files respectively in some E-Poll instance, so what we need in the supervisor process is a duplicate of that file descriptor, and then when we later intercept other system calls, we kind of can match them and replace the arguments. What I mean is like we don't have to go and retroactively modify the E-Poll. I see, I see. We just track the socket through its lifetime. Yeah, yeah. So when a socket is created, you would dupe it, and then just keep a copy of all open sockets. Yeah, yeah, yes. Yeah, you can do that. We're ideally trying to do less like, you know, work, trying to intercept less things, not more. But this is, yeah, this could be another way to fix it. What I mean is like a more complex supervisor user space process sounds more preferable than more complex than inside the kernel. Yeah. I mean, the wider, one of the concerns I think that I uttered was, you do it for E-Poll now. What other parts of the kernel do we need to play the same game than eventually? Yeah, for example, we do not currently deploy this for applications that you select because it has exactly the same problem. So, yeah, we do not currently deploy this for applications that you select and general poll. And I think for selected would be even more difficult to do this, right, I guess. But I mean, you could probably say, fuck it, we don't care about select, block select, only allow E-Poll or whatever. But yeah, we've managed to get pretty far with just the E-Poll fixes. So, you know, at some point applications will start using IOU ring, but I hope at that point they'll also use IPv6 and then, you know, like the set of applications using IOU ring and IPv4. I guess you can have these fixed file descriptors, which is a whole nother can of worms that is kind of interesting. I wonder if this makes it more difficult. And I think IOU ring also has its own polling mechanism and so on. But again, like hopefully, at least for our use case, the set of applications that we want to do this on and that use IOU ring are disjoint. So, other questions? Oh, five minutes left. Why? It was easy. He said, OK, so maybe the talk's over. Cool. I had some other slides, but whatever. Thank you.
What is Linux kernel keystore and why you should use it in your next application
All right, so the next talk is going to be about the Linux kernel key store and why you should be using it in your next application. Thank you. Hello, my name is Ignat. I work for Cloud for and today we're going to talk about Linux key store. By the way, how many people here know that Linux has a key store? Cool, many hands. Because like James earlier showed us that it has a key store but probably not everyone knows that Linux actually has a key store. So, yeah, a little bit about myself. I do Linux at Cloud for. I'm passionate about system security and performance. I'm like Lolo programming, Linux, but loaders, drivers and other stuff written in scary and safe languages. And I'm a hard Linux fan. That's why I'm presenting from a Mac. And probably like most of you here, I'm a fugitive programmer because NSA banned writing C and C++ languages and enterprises. And why is that? And there are many reasons but one of them is regarding application keys and memory. And by the way, here is the brand that NSA recommends that organization use memory safe languages whenever possible. So what is the problem with application key? Regarding keys, we're like talking about cryptographic keys, right? So to dig into that, let's review the Linux address namespace, isolation concept. So yeah, you have these many processes running on your systems because Linux is a multi-threaded, multi-process system. But what these processes have inside, right? So usually it's kind of like your code, like compiled code, your business logic. Some libraries, shared libraries, if your application uses shared libraries, some data, like global data stack. And yeah, I have the stack box separately. So it's like data heap and global variable with mStacks, right? And then you have the kernel, right? Everything runs in the kernel. In the kernel also you have the core code. You have static and dynamic data. You have the drivers which you load modules. And also you have stack or stacks if you have different threads, right? And the idea regarding the address spaces is within the process, each process, and even within the kernel, everything can access everything, right? So it's like one global space, whereas you can't access the memory of another process from one process and you also can't access the memory of the kernel. Like it's separated. This is Linux address space isolation. If we zoom in into the main process, into one of the processes, right? Like let's actually review what can be here and what can be in your data. And it can be like some internal state. So you have global variables, like applications can keep some internal state in the data. Yeah, your process can have user or customer data if it processes some external inputs and does stuff. Right? And the most important thing is cryptographic keys. If your application does some sort of level of encryption, it probably has some keys in the process address space. And what if like suddenly your application becomes compromised, so either through your main application logic or through a library, well, it means because it's all in the same address space, it means all your data section is compromised, right? But not all data is created equal. So, well, yeah. So yeah, well, like if your application internal state is compromised, well, it can be good or bad, right? It depends. Like depends on your logic. Of course, it can be bad if the attacker has control of some kind of data which can, for example, change the control flow of your application. If you're verifying a password, you can flip back like true or false or you can put some authenticated flag on and yeah, this can be bad, but sometimes it's not as bad depends on if your application is simple, but it can lead to further compromise. Well, if your user customer data is compromised, then like it's much, much more now. And yesterday also mentioned Equifox, my favorite company. Yeah, if you're a user customer data leak, it's a big problem because kind of it creates a lot of pressure on the company and you have to pay a lot of fines, but it's very, very bad but still more or less recoverably. Equifox is still in business to this day, unfortunately. But what about cryptographic key compromise? And this is like a total game over, right? So like if your identity key is leaked, that's what anyone can be as you. If you're like the main data encryption key is leaked, everyone knows your data. So it's a data integrity compromise, full security compromise and total identity take over. So what are the, well, 1000 feet view level of methods you can leak your application keys, right? Well, first of all, untrusted inputs and out of bound memory access. So imagine you have stuff in your memory written somewhere, right? And it may be that like near that stuff, you can have like a cryptographic key also in the same memory. And the normal application logic should allow you only to read stuff. But like what happened, for example, in hard bleed, if you can make the application read past the buffer boundary, you can also read the cryptographic key, right? And this is what happened to hard bleed. Everyone remembers hard bleed. Well, if your application have arbitrary remote code execution, like what else to discuss there is game over, right? So like attacker can control the execution of your binary and they can read, and due to say everything being in the same process space, so they can read everything and as to write everything. Not much to discuss there, but in the example was recent one, lock for shell. Everyone remembers lock for shell. Who patched lock for shell? Should have asked yesterday here, Java, right? Well, buffer use can be a sort of problem for leaking a key. So for example, this is a very, of course, this is a simplified program, but specifically tailored to leak the key, but like it illustrates the example. So for example, it has to function and crypt and log, right? And oh no, we forgot to initialize the logging message in the log function. And if you actually execute it, you will see that it kind of actually leaks the cryptographic key. So what happens is you have the process as thread stack, you have your main logic. For example, you call the decrypt or encrypt data function, which will get the key from somewhere and may put it on the stack depending on the implementation. But if you then the function exits, but if it doesn't clean it up the stack with the key, the next function can take it over and actually has an example, sorry, has an access to that cryptographic key, right? This is why all the compliance and security folks will tell you you always need to zero memory after key use. Like you have to clean up. Which is hard to do in many high level programming languages, especially if in garbage collected languages, right? Finally, you have the debugging tools. If you have a logging can accidentally leak your keys like core dumps, like GDB, Ptrace, everything that can access the memory of the application can leak a secret. Yeah, well let's make our applications don't crash and fix all the problems, right? We obviously can't fix all the bugs, so we have to do something about it. And probably we can't do a completely secure application, but what can we do specifically for cryptographic keys? Because they are the highest, most valuable data in our process address space. What some applications do, well, they try to leverage the operating system address space isolation, so they basically create another process, right? It will have a different data section and you can just move the cryptographic keys over to a different process and you write some very basic, very simple, which is unlikely to have bugs, a cryptographic logic to handle these keys on behalf of the main process. And then you create some kind of well-defined, tightened user interface between two processes, right? So we call it the key agent model. So you have two processes, one, the main process and the helper agent. The main process does not have the cryptographic material in the address space and the main communicates with the agent through a well-defined interface to perform cryptographic operation on its behalf. And agent is usually doesn't process untrusted input, like it's not connected to the network and is usually, and more scrutiny goes into that review. And some of the example of these we all use every day. So who here uses SSH? Who here doesn't use SSH agent? You don't? Yeah. Yeah, so SSH agent, GP agent, stuff like that. But there are drawbacks to this approach, right? So we need to develop and maintain two programs. We need to design this well-defined interface. We need to add communication. Like we need to think about how these processes communicate. Should we use Unix, talk, shared memory, something else, HTTP. And probably it's a good to somehow enforce and authenticate the main process from the agent. And not if the agent is kind of like this thing that performs cryptographic operations, we don't want anything in our system talking to it and being able to do signatures with our keys. This is where we go to Linux kernel key store. And the official name is Linux kernel key retention service. I call it the key store. Some people say it's a key ring, but actually, like key store has many key rings. So I think the key store is kind of the most applicable technology. And what it does is basically it takes this agent model and instead of process two, it replaces it with a kernel, right? And the well-defined interface is just system calls. Easy. So in a nutshell, Linux kernel key retention service stores cryptographic keys as kernel object. And this gives us some flexibility. So it was initially actually designed to share keys with the kernel services itself. So like for disk encryption, for example, you pass a key to the kernel and the kernel uses it. But eventually it was extended to user space. And the advantages that keys are now stored outside of the process address space, you have already have a well-defined system call interface to access and use the keys. And keys are becoming kernel objects so you can have associated access control lists, permission checks. Like you have on files or some other kernel objects itself. And the nice thing about it is like the key life cycle can be implicitly bound to the code life cycle. For example, security deleting a key even if the process terminates abruptly. And for a kernel feature, it surprisingly has a quite good documentation. So what does the key store look like? So it's a collection of key rings and keys. So a key ring can have links to other key rings and keys can contain other key rings or contain keys. So you can get this like a tree like structure. So keys are just objects that contain actual cryptographic material or a pointer treat. They can be read and written to and used to perform cryptographic operations. There are several key types which I go on later. You have user, logon, asymmetric encrypted and trusted keys. And they're kind of similar to a file system but unlike the file which can be on the in one directory, like if you don't take into account the weird bind mounts or some kind of hard links, keys can be part of many key rings at once. And key rings, they, it's a collection of links to the keys. And basically they enforce the life cycle of a key. If a particular key is not linked to a key ring, like it gets automatically destructed. And they can be explicitly created key rings or implicit special, a thread process, user and session. And they do enforce the key lifetime and they are kind of similar to a directory in the file system. So let's see an example. And by the way, all the examples I'm showing, I copied it from a real terminal. So it's a demo which doesn't fail. So in this example, here I'm creating a new key ring and linking it to my implicit user key ring. And each key or key ring is designated by a serial number which you can see. So it's kind of a unique number of the object inside the kernel. And once I created the key ring, I can add a key there with some secret contents Hunter 2 to my key ring. Basically I can then show, kind of, KCTL show shows my key ring and key tree. So we have the session ring, the user ring, my ring and my key there. Yeah. And basically you can see that the serial numbers match so what we just created. And also like because I just created the key, I have access to it so I can read the cryptographic material back and get the secret. And I think one of the examples you can use is like secret sharing between two users. So you have Alice and Bob to users on the system and you may notice they don't have anything in common. So they have separate groups, separate IDs, everything is separate. No common groups or permissions. For example, and Alice can create a secret with Hunter 2 and put it in their user key ring. What Bob can do, for example, it can create a new key ring called from others, a recipient key ring. And Bob can actually set permissions on that key ring so it allows everyone to write there. Write means putting links to other keys. So then if Bob communicates the serial number to Alice, Alice can just move that key to the Bob's key ring and then we now see that Alice doesn't have the key anymore in their possession and Bob can actually now read the cryptographic material because Bob now possesses that key. Simple. There are special key ring types. And these special key ring times determine the life cycle of a key ring. So there are session key rings which are available to all the current process and all these children. So for example, if you are system D and you put a key in the session key ring, it will be available to every process on the system which is spawned by system D. The process key ring is private to a particular process. So like every process has their own implicit key ring which they can use to store process specific credentials. And there is also a sweat key ring which is specific to a particular thread. Then let's say you write a web server which serves several websites and each website has a different TLS key. And you can, if you serve a website per thread, for example, so you can kind of securely store a TLS key for that thread, for that website without other threads even having access to that key, which is really cool. There are also user key rings which are bound to the life cycle of a user. So it's a key ring which is shared between all the processes with the same user ID and there is a user session key ring which is similar to user but not important in this context. There is also a type called persistent key rings which the name is a little bit confusing because they are not actually persisting the keys on the desk. It has nothing to do with it. It's just the life cycle of these key rings are different. They're not bound to a process or a user. So it's kind of time bound. So if you basically don't access the key ring for a time out, it gets automatically destroyed. It's useful, for example, in Chrome jobs where you can't really bind, for example, a key ring to a user because that user appears and disappears from the system but you can put a time bound and while your Chrome job is running, your key ring will be available. If for some reason your Chrome job stops running, the key will be eventually destroyed. So let's see a session key ring example. So let me add my favorite Hunter 2 secret to my session key. And basically, I imagine I'm on a SSH session to this particular machine. I can see that my key exists, right, and I can see its ID and it's linked to the session key ring. What I can do now is, for example, in another terminal I can put a BPF probe on a user destroy function which is responsible for securely destroying keys from the kernel key store. And if now I just exit my SSH session, I log out, I can see that the probe works and my key was automatically destroyed because my session ended, so my session key ring got destroyed and all the keys are linked to it got automatically destroyed as well. And if I re-log in back, I can see that technically my session key ring changed. It was destroyed and recreated automatically and I don't have the key anymore. So what it helps is, like, if you select the appropriate key ring type, you can ensure that keys will be securely destroyed when not needed. And you don't have to explicitly clear the memory. It will happen if you're out. For example, if you bound to a process key ring, if the process dies, the key will get destroyed. And regardless how the process dies, if it's successful exit, if it crashed, if it cordoned, whatever, like the keys will be gone. Okay, so now let's consider, like, some different key types. So we check the key ring types, the key types, the simplest one is the user key, which we just saw. So you have the cryptographic material, you put it inside the kernel, and then eventually either this process or the other process, which has relevant permissions, can read that secret back. There is also, like, a special type called logon key, which you can put inside the kernel, but you can never read back. And this is where this type is primarily used to share secrets with the kernel for disk encryption or eCryptFS. So if you're in a relatively recent Linux distribution, if you dump your dmCrypt setup, you will see that some of your keys are actually coming from the kernel key ring instead of, like, you will see the bytes directly. There is also an asymmetric key type, which only supports RSA currently. So you put an RSA key inside the kernel, and technically you don't read it back, but you can perform some operations with this key, like you can instruct the kernel to sign data or decrypt something with the key. So for example, this is a simple example, it was open SSL, so we can generate an RSA private key. Kernel understands only pkcs8 format for unencrypted pkcs8 private keys, so we have to convert it to pkcs8 format, and then we can actually add it to the kernel, and then we can ask the kernel to sign something, and basically we can then verify that the signature is valid with OpenSSL. Which is very useful, so all the things I'm describing today, and more is describing Cloud for a blog post, and there we have an example where we completely replace SSL, it's like a proof-of-concept patch, but we patched OpenSSH and replaced the SSH agent with the kernel key store, so instead of SSH add, you do SSH add our bash script, which puts your private SSH key into the kernel key store, and if you run the patched SSH client, it will actually work the same as it would communicate with an agent, but you don't need any agents running on the assist. Cool, this is all well and good, this is how you can use it, but surprisingly key store can be very useful as a big corporate key management building model, but the question here remains, in all the previous examples you just saw, that we still need to put the keys into the kernel, so we don't want the secrets to be in the application address space, but we still need the application to put it inside the kernel, so even though if the application cleans up after itself, there is a small window of opportunity where application has the plain text secret in its address space, so how can we provision application keys without cryptographic material ever being exposed to the user space at all? So for this we have two other interesting key types, one is called encrypted key, and in this case the process has not the plain text key material, but encrypted key material with some other key, and the kernel has a wrapping key, so when the process inserts that key inside the kernel, the kernel automatically unwraps the key, and if we try to read it back, it gets automatically wrapped by the kernel again. But here we have the chicken and egg problem like how do you then provision the wrap key, right? So, still things, so what James showed earlier today in his demo is you can technically replace this with a TPM, and then you have a thing called a trusted key, so again you have the wrap key, but wrap to a particular TPM, you can insert in the kernel and TPM will automatically unwrap it, and again if you read it back, it gets wrapped. But this schema is not really great because as James mentioned TPMs are slow and there is as much as you can do with these operations, so like if you have thousands of keys you don't want to continuously poke the TPM to unwrap them, so you can do some kind of a combined approach where basically you have some kind of provision, right? So, and you have some kind of HSM in the cloud or on-prem, whatever which does your cryptographic keys, and then you provision a root key first, so you basically wrap the root key to a particular machine to its TPM, and then you insert it and the TPM unwraps it, but all the other thousand keys are encrypted with this root key, so the process received the wrap key and then it puts inside the kernel and then you don't go to TPM, you already have the root key which is a software implementation, can easily unwrap all the other thousand keys. But there are still problems with this approach, even though the application never sees the cryptographic material in this process address phase, but applications are still responsible for receiving this wrapped cryptographical material from this centralized KMS HSM service to wrap their keys, and so basically each application needs, who here uses Vault? Yeah, some people, right? So like it's, you kind of like know what, need to know what your Vault address endpoint is, right? You need to speak the Vault protocol or AWS KMS protocol, you need to basically integrate all this crap in your code, and there is little administrative control if like you're managing fleet of machines of the created kernel key object, so applications when inserting the key can set invalid permissions, so like anyone can, for example, if you set improper permissions on your RSA private key, any application, even malicious on your system, can use it to encrypt or sign data, right? And ideally like you also want authentication here, so KMS or HSM, that remote service, needs to somehow authenticate each requesting application if it can provide the wrapped cryptographic material. So how the kernel tries to solve that problem, it has two set of system calls. So far we've been using the at key system call with a key CTL utility, so it adds the key to the specified, key ring with the specified payload. So basically the application is responsible for the payload itself, so it's either plain text or in case of trusted or encrypted key, the encrypted payload, it gets it from somewhere and it sorts it into the kernel. And the payload is interpreted according to the key type, it's like no interpretation happens for user logon keys, because those are mostly symmetric keys which are random strings, it's a private public key for asymmetric cryptos or wrapped for encrypted and trusted. But there is another interesting API in the kernel called request key, so instead of applications inserting the payload directly what applications can do, they can ask the kernel, just give me my key, give me my key and give it an arbitrary string as an identifier. And it's on the kernel to actually satisfy that request, and obviously the kernel has no idea of everyone set up, like where should it take the key from, so it's one of the examples where the kernel can then make a user space callback and with a special helper program which you can then configure to actually deliver your keys, right? But it's a more centralized and transparent API to the kernel system, so how it works, so you have the process instead of adding key, so the process requests the key from the kernel and provides the identifier, so like give me my cloud app key one, so the kernel creates a placeholder, then it creates a special process, a callout process, helper process in user space called request key, and this one you can configure and you can specify different routes for different key types, for example if I requested the cloud app key one, it will go to the cloud sub-module and you can write these sub-modules in any programming language by the way, it doesn't have to be C, so you can write them in Go, it can be just simple batch scripts as well, which are basically responsible for if the path is cloud, it can contact your cloud HSM, get the wrapped cryptographic material, put it back inside the kernel, the kernel will then instantiate the keys and then the application will get its key back. So with request key advantages, you have a single centralized operating system API to request key from the application, so there are no KMS or HSM connection strings, you arise in your configuration form, just a freeform ID string, and it kind of fully decouples, your application is fully decoupled from key storage backend, so it doesn't care where the keys are stored and how they are distributed, and it's a more secure way to instantiate the keys in the kernel, so this special call-out process which is created by the kernel is very special in the sense that it has a special credential enforced by the kernel, so even if you launch the same helper process yourself as root, it will not be able to instantiate the requested key because it doesn't have a specific token from the kernel to do it. And this also call-out process is very useful, in fact it can be trustworthy, so you can perform additional security checks, you can implement arbitrary policies there, so you can check the requestor, user ID, group ID, executable pass, package name, whatever you suppose, is this application even allowed to request the key in the first place, and you can immediately deny that request. And you can support multiple key storage backends, you have local storage, you have a TPM backend, cloud HSM backend, whatever, and you can even swap these backends transparently, like if you, for example, migrated from on-prem HSM to a cloud HSM, all you have to do is just modify this helper process config file and applications will not notice. And then you have the nice thing that you need to only authenticate this single helper process on your backend. And yeah, as I mentioned, the backend connectors can be written in any language, so very easy to extend. But the nice thing about that with request key, the key management and distribution becomes a core service operating of the operating system itself as it should be, versus like every application has to deal with it on its own. That's basically it for today. Here are some links to some kernel documentation, to some key ring man pages, as well as the last link. Again, everything I told you today and even more is described in the cloud for our blog post, which is linked at the end. Thank you and I'm happy to talk to you. Thank you for the great talk. So I recall there was an API in the producer space to protect memory from kernel space. So the, like a given page was unmapped from the kernel. So if you had an out of bounds in the kernel, you couldn't access the memory, but of course the kernel could remap the page back again. My question is, are the keys protected in such a way in the kernel? And do you think it would make sense to do it? I mean, it would potentially minimize the exposure in theory at least. The default, I don't, I'm not sure about the implement, but I would say no. I think the keys are not like more protected. So the guy who wrote it is right there. And what was the question? If you put a key of the user space process into these areas, they will be more protected than otherwise. It still doesn't guarantee like 100% My point is the kernel could also do it so that it would protect those keys from itself as well. And it would only remap the page back again when it actually, when you do the request key for it. But what's the point then? If kernel needs the keys, it has to have access anyway and remapping and mapping is costly. The other thing is the key store API internally is also extendable. You can write other modules and this is what I asked for. James earlier, that you can technically write an asymmetric key implementation backed by the TPM. So the keys will not be even inside the kernel. It will be in the TPM, but then each operation will have to touch TPM in the first place. Or if you like design some kind of crypto chip or you can like design like an arm like a truss zone back here. So like whatever you want. There was some effort. I don't remember exactly which areas it touched to do this sort of separation between subsystems. But I only learned about it once. I don't know what they say this. No, no. Well in kernel it's still like the old, you mean in the kernel subsystems? I don't like it's still like a flat address space at this point. I don't, unless you're again using like arm trance zone or enclaves or whatever. My question is, so you mentioned that we can do RSA operations. Not everybody is using RSA. Are there any efforts to introduce other kinds of asymmetric keys? In particular, I'd like to see an explicated stuff. So, yes. So the kernel currently also supports ECDSA, but only for signature verification. It was added for kernel modules. I send like patches to actually support signatures through for the Q-Stone API twice. I didn't get any traction on them. I'll send it one more time maybe. Because I also know that the kernel has its own internal crypto API and has support for all of these operations. They're just not exposed through the key store. Well, specifically for RSA, for ECDSA, no. The kernel crypto API doesn't have crypto for ECDSA signatures for generating the signature. So my patch set included both the crypto subsystem and the key store subsystem. The kernel can do ECDSA signatures, but also this code is reachable through the key store API. Okay, thank you. Very interesting talk. Thank you. I have basically the same question. But also, wouldn't there be an urgency to get some PQ crypto in there? Maybe, but we have to fix ECDSA first before we have to learn to walk before the run, right? James, can you pass it to the next? So if we now add the trouson to the picture, does the kernel have any kind of API to interact? I mean, the key store itself, would it interact with the trouson to get the key or we need to still go to the user space to the helper and then the helper will just go through a normal way of communicating with the trouson and secure monitor call and get back the result and then the key back to the kernel. For the trouson, I think there is some code, because I never tested on an ARM system like similar to what we have, the trusted keys for the TPM back trusted keys. There is an implementation for trusted keys for the ARM trouson, the open source one. I saw the code, I never tried it, but it's there. So there is some reference of the application, right? Yes. The GNS and there is internal support for that. Yes. OPTE. Yes, OPTE. Alright, anything else? Oh, yeah. If you shout, I'll just repeat. It's just wondering which version I need to use this. Sorry? Which version is it available from? The kernel key store. I mean, it's quite all, I guess. What we did, I think from 6.1, again, we mentioned the crypto subsystem, the key store subsystem. It was really handy to insert the RSA key into operation with it, but you didn't have any ability to do the same with the symmetric key. So what we extended is like the crypto user space socket API to be initialized from the user or logon key. So now you can do it from 6.1, you can insert a symmetric key and then you can create a crypto socket based on that key to perform like AS encryption with that key without exposing the key to user space. Back to you, does this not want to... So if I recall correctly, you said that the persistent keys can expire after some time of being unused. Does listing the keys also count as using them? That's my first question. My second question is like, what's the time out time for it to expire? I haven't used them like so widely to have those specifics. I think the time out is configurable, definitely, but listing, I don't know if listing the keys actually reset the timer. I just want to answer the question from over here. It looks like the API has been available since 2.6.10, which feels old. Yeah. There is one person over there which... Maybe you shout, I repeat. As a certified micro-configuration enthusiast, is there a reason why this approach is taking rather than planning APIs for the value of duty and so on, that you need the space and have the same benefits? The question was why we didn't do it in user space, but... How do you add extra functionality to the kernel to give you the same benefits? I kind of don't quite understand the question. The whole point is not to expose cryptographic material to user space. You're saying the benefits are, for example, if a process dies, then you can immediately wipe the key from memory and that sort of thing. You could also add functionality to add consistals to the normal database processes that have that sort of benefits. Why didn't you do that rather than sticking extra things into the kernel? Because you can retrace the processing of user space, but you cannot retrace the kernel. Just saying. Anyway, we are out of time. Thank you very much. I'm sure you can get two infocations.
A few limitations in the available fs-related system calls
Okay. Ready for almost our last talk. Again, this is the kernel development room. Nobody got lost, hopefully, here. Our next talk is by Nick about limitations in the available file system calls. Okay. Now it works. Okay. Great. Thank you. Okay. So, hello. Thanks for having me here. Nick, I work at FORF. It's a research center in Greece, and I mainly do prototype bring-up. So, they give me boards and chips, and I make them boot Linux. We mostly work on RISC-5 prototypes these days, and this is one of the projects I'm working on with the teams to bring RISC-5 to the cloud server region. So, I've played with the kernel. I have some driver experience. I've contributed here and there, but I'm totally clueless when it comes to file systems. So, while working on this project, I wanted to take a backup of my home folder, and I wanted to preserve as much as possible. And also, I had another use case when I wanted to take a backup of a really large media library we have in our radio station, our open source radio station in Crete, in the university. So, I tried, I was looking at the available programs out there like copy, r-sync, and stuff, and I sort of wondered what would it take to implement such a thing with the current APIs available in the kernel. And so, let's begin with what it takes, what it means to copy a file. First, you have to preserve the file data. You need to do this in an efficient way, and you do not want the files to be larger when you copy them to the destination, or you want to be also space and time efficient. And you also need to preserve file metadata, and file metadata is, again, the permission bits, like retry, execute for group and users and others, the ownership, like a user and group IDs, the timestamps, and there are some old school attributes which basically 32-bit mask used for tracking some attributes that were used to be file system specific, and then you have the extended attributes, which is a list of key value pairs that you can use. So, let's see what we have to the available API we have for that. For copying data, okay, you can, we have the naive approach, you can just open the source and destination, read from one copy to write to the other, and then close them, but this is not very efficient because you have to copy data to user space from kernel and then copy them back. So, you have send file, which is a more efficient way of doing this. This is a system called that's, so read, write, it's the most portable way, everyone has this. Send file is Linux and free BSD. So, what you do with send file is you have to file descriptors, and then you give them to the kernel, and kernel does the copying for you, so you don't have to read, move stuff to the user space and back. And then we have a new system called that's Linux only, and it's supposed to be the most efficient of all. It's called copy file range, and it's, it, it's the goal of the system called is to also take into account file system optimizations like copy and write, like, refling, like for example, when you want to copy two files on a server, on an NFS server, then the, instead of bringing them to the client, do the copy and send them back, you tell the server to do it for you. So, this new system called is, aims to do all sorts of optimizations for you, and from what I've read, they also want to do optimizations at the block level. For example, NVMe has this operation where you tell the controller to copy blocks for you, instead of doing it from the CPU. So, they aim to take advantage of these things as well, and this aims to be the new API for copying files. It doesn't work across file systems, it works, it has some limitations on how, where you can use them, but it's pretty cool. And it's, when we copy files, we not only want to copy the data, we only also want to preserve the holes, because I mentioned we want to be efficient in space as well. So, what this, this is, is some files that, for example, you have a large set of zeros, instead of storing them in the disk, you just truncate the file, and the file system driver will bring you back zeros, but you won't, you won't waste any, any space on the disk for that. And you have a L6 system call for that, so you seek for the next chunk, and then for the next hole, and you truncate. So, you want this, you want to do this for preserving holes, and not have the file growing when you move it. So, that's for preserving data. How do we preserve metadata? Okay, you have CHMOD for the permission bits, it's a, you have, you can have CHMOD, or the family of system calls, and you have CHMOD at, which takes an opath descriptor, which is very useful, I'm going to come back to that. You can use CHMOD for the ownership, and you have UTIM and S for ATIM and MTIM. So, two of the timestamps we have for the old style 32-bit attributes, there is an IOCTL, which is not portable. This is a set of FS flags, that some of them are file system specific, so it's not that you can preserve them while copying from one file system to the other, it won't make any sense to have the same flags, some, some of these flags do not make any sense for other file systems. There are some flags that are common between all file systems, like the immune, the immutable flag, or the append flag, which tells you that you are only able to append to this file, or that the immutable flag tells you that you are not able to do anything with it, even the root cannot do anything with it. Pretty useful for backups, for example, when you take the backup, you can mark the file as immutable, and then if some, for some reason, someone's doing RF minus RF, it will not delete the file, it will not be able to change it. So, but the IOCTL is not portable, it's a hackish way of doing it, it's a long way of doing it, so now we have the extended attributes, which is a better API for having attributes to the, adding attributes to the files, and this is used for all sorts of things. It's used for storing capabilities, it's used for security modules like SELinux, AppArmors, Mac, you can use it for access control lists, X4 uses this to store data in the iNode, it's crazy, I mean it's a huge list of extended attributes out there, there is even a configuration file where it gives you like a filter on which attributes you can preserve between file system and stuff, and honestly there is not a central place where you can get documentation on this, you have to look, find your way around it, so this is the way of preserving extended attributes, so let's see some problems so far. So, copy file range, as the man page says, may expand whole holes on sparse files, so this system called that's meant to be, you know, the whole source of optimizations when copying, like taking into account various file system and block level, even block level optimizations will expand holes, so you need, you can use this for copying chunks of files because you need to copy a chunk, do truncate and then the next chunk, so you cannot use it like to copy the whole file and forget about it, this also makes it, this both the instant file and copy file range, there are no IO hearing ops for that, so for example, if I want to copy a whole directory and I could do a batch of system calls in IO hearing for every file and then let the kernel copy the directory and come back when it finishes, I cannot do that, we have IO, you have IO hearing ops for read and write and the others, and even we even have IO hearing ops for the extended attribute system calls, but not for sent file or copy range, if copy file range didn't expand the holes and we could have an IO hearing op for copy file range, it could be great because we could just put it there and it will copy, it would be much better for, you know, copying batch of files, we would have to do system calls all the time for seeking and truncating and I mean it's meant to be the optimal solution, the other problem is that the opath descriptors, the op variants of these system calls, they take, instead of having the whole path string, you give a descriptor which is an opath descriptors which describes points to the directory, let's say, to the path, this is very useful because you don't have to, so you don't have to carry the string of the path all the time and it's more safe for multi-threading because for example, if you change directory while you're doing stuff, things will be messed up, so it's better to use opath file descriptors, opath descriptors, you get those from open app, you get from open, open gives you these types of descriptors that are very useful, but we don't have ad system calls for extended attributes, so we do have for everything else, but there are no ad variants for the extended attribute system calls, which it's a bit messy because you can use this for the rest of your process, like for copying files, for changing, for u-times, for changing the timestamps, but not for chmod, for cho, but we don't have it for extended attributes. You mean with opath file descriptors? Yes, I think that doesn't work, but we do have fsetx, adderat, and fgetx, adderat. We don't have it at, no. I mean you can't look up, yes they only operate on a specific fd, but I mean... No, no, no, for example, for chmod and cho, in u-times, then you all have ad variants, you can use opath descriptors. We don't have ad variants, sure, but I should probably... Yeah, you can do some tricks with proc self and get file descriptors from there and resolve the path, but it would make much more sense to have ad variants for those system calls as well. We didn't have chmod ad, but that got fixed on 6.6 with the chmod.2 system call, so now this is covered, and something else that's weird is that the u-times and ad allows an empty path flag, but it's not in the man page, so we need to update the man page for that. So that's one thing, it would be great if we had system call variants for the extended attributes. It would also make sense instead of having to use ioctl for the old school attributes to have a more portable way of playing with them, so maybe if we could use, if we could wrap them around an extended attribute, it would be better, I guess, because we wouldn't have to resort to ioctl, we could have a portable, like have one attribute for immutable, one flag, or one attribute for no dump that map to the old school bitmux, so that we have one API for all attributes, the old ones and the new ones, which is portable and not ioctl, and by the way with ioctl you cannot use opath descriptor, so that would be better. Another thing that I told you about before is that for the extended attributes, I couldn't really find documentation in one place, I had to look over the place and there are some of them that are not documented in the kernel, they're documented elsewhere, so it would be great if we had documentation for all the extended attributes the kernel understands and the permissions required for accessing them, because there are, for example, security and trusted namespaces that you can only access when you have the DAC admin and going, this is the next part, and you have some others that you need for security capabilities, you need setcap, for example, so it would be great if we had documentation for all the extended attributes out there and how you can set them, what do they do, so in terms of capabilities, when you do a backup, you probably need a couple of accurate search so that you can read the source hierarchy without being the owner of the files that you're copying, if you want to make copy special files like devices, you probably need cup and k-node, chown for doing, for changing the ownership after you've copied the file, you probably want, you want for again cup of owner to be able to set the timestamps and most of the extended attributes in order to set the immutable and the append attribute, the old style attribute, you need the Linux immutable capability, so what I want to point out here is that this is a bit inconsistent, for example, we have a capability for one attribute for the append and the immutable, and then we have another capability for the capabilities for an extended attribute, and then for two families of extended attributes, we have Capsis admin, which is an overkill, we don't want to use Capsis admin, so it would be, it would be nice if we discuss this a bit, like how should we arrange capabilities, how should we, how should we, maybe we should introduce more capabilities or review how these are being used, obviously we cannot, I mean break backwards compatibility, but it would be nice if we had that, I mean this goes also into discussion regarding documentation, like we could have a documentation or what you could do with this attribute, which capability you need to play with it, and what it does, so it would be, it would be helpful to have that, and then we could discuss how we could break Cic admin or how other capabilities we could introduce for maybe other special attributes, like for, I don't know, ACLs or something. So another discussion, when do backup file, let's say that we have done already a backup, and now we want to update it, first thing you do is like you check modification times, okay, you can see with them, with them time from one and the other, and make sure that if they're not the same, then the contents of the file have changed. M time only tracks the modification of the contents of the file, if you changed anything related to metadata, if you changed permissions, if you added an extended attribute or replaced an extended attribute, then you, M time will not catch this, C time will catch this, which is the last change attribute, but unfortunately you cannot preserve C time, so when you copy the file for the first time, the creation time, the change time on the backup, it wouldn't be the same as the source, so when the source C time changes, you cannot compare it with a backup. So it would be useful, and so maybe it changed a bit later, maybe, I don't know, you cannot compare them because they are in a different system, different time zones, I don't know, and you have no control over that. So if metadata changes in a file, you have to read them all and compare them by hand, so for example, if I add an extended attribute, I have to read all the extended attributes of the source, all the extended attributes of the target, and then see if something changed, because I cannot rely on C time, because I cannot preserve C time. So some tools, I mean, are synced, checks for M time, and if you want to be more safe, it uses CRC, CRC is unreliable, it's not secure, it says for security, but you can make your way through, you can change the file and preserve the checksum by adding some extra stuff afterwards. Shall be better, both these approaches are not nice because you need to read the whole file both on the source and the target and compare. We could use measurements from IMA, but we cannot expose those, we cannot transfer them through NFS, so if we are copying files from the NFS, NFS server and copying them locally, you cannot get the measurements on the server. There were some discussion around this, there were some patches, they were rejected, because I don't know there were some licensing issues, like IMA's, GPL, and NFS is different, they couldn't standardize this, so they could do an RFC call, whatever. So if we have measurements, we cannot expose them, I mean, even if I, like, so if we had, we could, I could at least compare them and make sure that the file contents were the same or changed, because M time is not reliable, I mean, if someone changes the file, they can reset M time, and I may not take a backup of it and leave any secure copy there. So we have a problem there, and I suggest that one approach to this problem, okay, obviously if we could expose IMA measurements through NFS, that would be the most appropriate way, because we could have, like, secure trusted hashes that we could use, but one way of doing it is if we could preserve C time, if we could preserve C time, when we copied the time, the file for the first time, we could have the same C time on the backup and the source, so if the C time changed in the source, we would, we could check it with the backup, we would know that something changed not only in the contents, like in the case of M time, but also in metadata, so we could use this to skip the file if something didn't change, instead of having to check all the metadata and compare and then skip, and also because when M time changes, C time also changes, if someone tampered with M time, if someone tampered with the data and then reset M time, because C time would change, we could also catch this. There is no API for doing that, and the reason is, there are various reasons, but the most obvious one is, like, there's a chicken and egg issue, like when you set the C time, because the C time tracks the changes of, any change you do to the I, no, it would need to update itself, so it will change the C time again, so if you set it in the past, for example, and when you do the change then it would need to be updated to indicate that it changes itself, so there's, I mean, semantically it's a bit of an issue, it's not another issue that when the C time is in a remote system and you try to change it locally, it wouldn't work, if you give the command, let's say, to change it, it wouldn't work, because on NFS the server tracks C time differently, so it's not that easy, and there is this discussion about C time, okay, C time currently is only maintained by the kernel, there is no API not even for the root, and there is this notion that this could be used for forensics, like for tracking, like malicious modifications to the file, but if someone is privileged then it can do other things as well, it can change C time by modifying the system time, which already, there are already programs out there that do this for you, or you can just open the file system, change the date on the file system and come back, and you can even do this without crashing the whole thing, because now we have FS freeze, and I tried it, it works, so I believe that since this is currently, it's doable, people can't tamper with C time, privileged users, I suggest that we have an API for doing this in the proper way, not in a hackish way, and maybe with a new capability for that, it's the same way that we have a capability for immutable, let's have another capability for C time, and I don't know, change U time, and so my introduction, another system call for that. I was thinking of birth time or creation time, like should I preserve that as well, and I didn't really find a use case for it, there are programs that already create, remove the file and create a new one, like Vi does, so this timestamp doesn't say much, but there were some discussions like this, there is a Windows, like in Samba, there are similar timestamps for another operating system that tell you where the file was created in the first place, and yes, that's why they added this, maybe we could have, instead of trying to preserve this, which is an immutable timestamp, maybe we could have like a common standardized extended attribute that we could carry around, I don't know, we could obviously have timestamp inside the file, but not all file formats support this, and then final, how do we back up encrypted files? So with eCryptFS, you could just copy everything as is, because the metadata and everything needed to decrypt the file is in the file itself, and then you also copy the configuration like the eCryptFS, which contains the wrapped key and stuff, but if you encrypted the files with Fscript, which is the new way of doing things, you cannot back them up, you could use, you could figure out if a file or a directory is encrypted, you cannot figure this out for a sim link, because in order to take this, you need to do an IOCTL, so IOCTLs don't operate on no path descriptors, and currently there is no API to copy, to back up files encrypted with Fscript, the only way of doing is to decrypt them first, and copy them and then re-crypt them, but this beats the purpose of having them in an encrypted form in the first place, if in order to do a back up, I need to re- decrypt them then, there is a window of opportunity there for an attacker. So that's all, that's a summary of what I'd like to discuss or what the issues I found, again, I'm not a file system expert, I don't know what would it take to do these changes in the kernel, I've only seen this from a user perspective, and that's it, thank you. Can you go back to the original slide, so tracking C time and M time changes, this is something that Jeff Layton has worked on in the context of NFS, because they also want to track changes to files, they have a need for this internally to NFS itself, I'm not completely sure how it works, but we do have internally something which is called the iversion, which is the i-note version, but we haven't really exposed this to user space, and the idea is that you get an updated iversion every time, for example a data or metadata change occurs, so that could be thought about exposing this to user space, but we weren't quite sure if we can have a consistent definition across all file systems for this, but anyway you should probably talk to him about this specific stuff, I think someone proposed listset get x-adder at system calls a while ago, and no one was really opposed to adding those, though I kind of questioned the value of them a little bit, my main gripe with them had been that the original proposal made it so that you can use opaf file descriptors to set x-adders, which I think is just opaf file descriptors are limited, and if we add ever more functionality to them, that you for example can set x-adders to an opaf file descriptor, at some point defeats the purpose of having an opaf file descriptor. It's for consistency, I mean you have it for the rest of it, you have it for ch-mode, for ch-own, for u-time and so, it makes sense to also have it for x-hub, right? Yeah sure, as I said I'm not really opposed to it, just a bit weird, because in a way that the thing that I don't quite understand yet is if you have open-at or open-at2, and you have a file descriptor, a starting point file descriptor, and then you want to back up, I don't know, you want to extend the attributes further down the tree, then you could do an open-at, open this file, and then pass that to fget get x-adder. Well if there are workarounds there you can also go from proc-self and get the descriptor and resolve it. Yeah and the thing is, the thing is what I'm saying is like a lot of these, a lot of the newer stuff that we have in open-at2 which have lookup restrictions where you can, for example, don't cross into another file system, known and so on, all of that stuff is not available with, for fset x-adder and get x-adder and so on. But anyway, I'm not opposed to adding this, if it's useful it would probably be easy to do. Yeah, I don't quite remember the IOC get set flags, this is not exposed in static sets, I assume everything that we have in IOC get flags. I don't remember, so the statics will tell you, because you also have an IOCPL for determining if it's encrypted or not, which you cannot use or see or on Syslinks, Asimlinks because you cannot use it. For the immutable, I don't remember, it has the, I remember that the immutable and they're not damp, they are reported, are part of the statics, so you could get them with statics, but how do you set them? You cannot set them in another way other than the IOCPL. So I was, I thought that since we already have the x-adder interface for setting attributes, maybe we could map those old attributes to some new extended attributes that set them through this way, so that we have a common API for all of them instead of resorting to IOCPL and use these flag... We probably need to talk about this a bit more because I don't understand what you mean by mapping those two x-adders, but in general the x-adder interfaces, they're somewhat broken, the complete API is somewhat broken, like if you, for example, if you set 1000 or 20,000 extended attributes on a file, then LIST X-Adder won't even return them to you anymore, because it will tell you, oh, it's limited to, I don't know, 64 kilobytes or less, and if you go over that, you can't list them anymore, like the whole API is fucked, so I mean... Yes, but yeah, it's used for everything, okay? I agree with you. It's even used for inline data, okay? X4 uses extended attributes to store data in the, I know it itself, so yeah, I get it, but I need to preserve them, I need to preserve both of them, so since this is what has come up with, let's at least, I don't know, at least bridge the gap between the old one and the new one, I mean, in the same way I have immutable and append, and I have to do the IOCTL, just give me a standardized extended attribute to do this, so that I only have to use one API, and not, because with the IOCTL, you can, you should also mask the flags that are not file system specific when you do it, and with the extended attributes, you can... You should or you can? You should. When you read the flags, you mask those that are file system specific, so that when you preserve them, you do not try to set a flag to a file system that doesn't make sense of that file system, so yeah, it's a mess, so if we had a extended attribute for wrapping this, we could, for example, get an error in this case, or it should be, we had, we had one API for the whole thing, which to me, it would look much cleaner. I don't know what would it take for copy file range or send, that's, or for send file to become an IO during op. That probably, I just sent a patch to Yen, so I'm pretty sure he's happy to take this. And I don't know what, if we had this flag, because config file range has a flags attribute that's not being used right now, it's just empty, so if we had the flag there for, you know, preserving holes, that would be great, because I wouldn't have to do all this back and forth to determine where the hole is and preserve them. That would be great, I guess. I don't, I guess, I, I, I, I, I, I, I, I, I, I, I, I, I haven't seen the code, I don't know what would it take, that's why I wanted some feedback on this, like, if this would break things, this would be too much to ask, I don't know. So, um, yeah, that's a approach, yeah, that's a good approach. It looks to me that this will be a lot easier if there was some sort of generic, uh, API to dump, uh, tree, our SAP tree in the file system, any file system implements that on its own, and then you get something generic, and then there's something to restore that. So, with Hal, let's say, a system called that would do the whole thing. Yeah, exactly. Because the file system knows better. Yes, but, I mean, they tried, they are already trying to doing that, at least that's what I understood from the documentation, with, um, uh, with the copy file range, at least for the data stuff. Yeah, but just for the data, what I mean is you, it doesn't have the holes, but, so you say we could have, like, something similar for the, for the whole attribute thing. Yeah, you give it a directory, and then it, it passes from that, the entire file system, and gives you something with all the attributes, holes, and whatever the file system supports, and then you can re-import that into another file system, and maybe you add some filter to it so that it doesn't give you something that it doesn't support in the other files that you want, or whatever. Currently, this filter is in a config file on ETC, and user space needs to be... Yeah, but it's user space-based. There is not a registry for all these attributes. That's one of the problems, right? I don't even know how many there are. And the other thing is, perhaps, it will make more sense to, instead of doing APIs like copy file range, to use the splice concept and implement it for pipes and sockets as well. Send file does this internally. So, the previous one, send file, it uses, internally, it uses splice for doing this in Kendo space. But it doesn't support the same kind of, like, going over holes and that sort of thing. Yeah, you still need to come back, but that's why I was saying that, since the new API is meant to be called file range, and it has a flag attribute there that we can use, maybe we could introduce a flag for this. That would make more sense, because send file is already being used, and it would break things we want to be about. Sorry, we're out of time. Okay? Thank you, everyone.
Packet, where are you?: Track in the stack with pwru
So hello everyone, I know this is the end of the day, the end of the first day, so thank you for being so many to attend the talk. I won't be too much into kernel details in that talk, that should be relatively easy to follow. Yes, I'm sure this is a kernel dev room, this is not about Go, so don't be worried about the logo. I also do apologize if some of you attended Jeff's presentation from yesterday, so the same topic, the presentation from today will be pretty similar. But still, so what is Peru? This is, so the name comes from packet where are you, and this is an EBPF-based tool to debug packets going through the Linux networking stack. So we see why we wanted to work on that tool in the first place, how Peru works, and what are some of the features, and how can we actually use it in real life to debug real problems. So the problem is that nowadays we have a lot of things to debug regarding to networking stuff in general, so when you use containers with namespaces, Kubernetes, all these kind of things, you typically have packets arriving on the interface, and then being forwarded to a pod through a pair of these interfaces, and there's that big thing in the middle, that's a penguin, that also stands for the Linux networking stack, and from the point of view of someone trying to understand what's happening, it often looks like a black box that's difficult to analyze and to understand fully. So how do we get some visibility into that? We've got a number of things happening in the Linux networking stack, a few things, that gives an idea. It's very tricky to get to the right place, so where is my packet? So that's the problem we have. So usually when something goes wrong, we use TCP-DOM. Right, TCP-DOM is good. TCP-DOM is a great tool that's very useful. TCP-DOM works well here and there, and sometimes the stuff happens here, and that's great. Sometimes though, it happens in the penguin, and sometimes in the pod as well. So what do I do with it? Can TCP-DOM help in that case? Not really, not so much. There are some other tools to debug things. There is printk, well, comes with a number of drawbacks too. I need to recompile my kernel. It's quite slow to process, to adjust every time. I need to add new printk's. I may possibly have to add a lot of printk's if I have no idea where my packet's going. If I do things wrong, my kernel will panic, that's not great. And how do I filter on specific packets? It's difficult to do. It's far from ideal. We've got some other traces too. Perf is, for example, a good tool to trace kernel functions that's something else, and I can just look into that function and look what's happening in there. But for networking, really, it's hard to do this filtering on the packets that I really want to follow. It's also hard to extract the network-related information out of other things that Perf returns. And in the first place, how do I know what function I'm interested in? Where is the stuff happening? Where is my packet drop? Where is my packet masqueraded? Where the interesting events are occurring. So what if we could have something that gets a bit of all the functions in the kernel that will be processing my packets? And what if I could get callbacks and run programs when these functions are called? And if I could also filter these callbacks to make sure that I only process the packets that I'm interested in. So that's where we introduce Peru, which is based on the BPF. So I assume most people in the room have some familiarity with BPF. So I won't go too much into the details. Just as a few reminders, that's this execution environment inside of the kernel where you can inject programs from user space. They're going through the verification to make sure that everything is safe and won't crash your kernel. You go for the JIT compiler to turn these programs into native instructions and get some good performance too. And then you run your programs on some hooks where you attach your program in the first place with a diagram that looks something like this. So we have a program here that we will hook to a probe on IP local deliver, which is a function that takes SKB as an argument. So that's a socket buffer that represents the packet in the networking stack. And we scan the LVM, or with GCC nowadays, but most of them we scan. We turn that into an L file that contains the BPF program as bytecode. And then we use a loading program that can be in code, that can be in C, that can be in Rust, whatever, to extract the bytecode from that L file and inject it into the kernel through the BPF system code. Once in the kernel we get the BPF, the VFI are in to make sure that the program is safe. We compile the kernel. We don't have to, but most of them we want something fast. So we compile the kernel. We compile the kernel, right? We compile the program into native instructions. And when my packet is coming in, and IP local deliver is called, then it triggers the execution of the program. And I can communicate with my agent in user space through the use of eBPF maps to store data. So, for example, to store metadata about my packet and retrieve them in user space to know what's happening. That's great, but how do we keep track of all those packet processing functions? So I have IP local deliver, I have a lot of other functions that are doing packet processing too. That's where we leverage BPF, which is BPF type format, which is a metadata format with different information. So a bit like dwarf, but producing objects that are much smaller than dwarf and that target BPF specifically for a number of use cases. So we can have BPF information for one BPF program in one object file. We can also have it for the Linux image itself, which is... So this BPF object is usually exposed in the C-SFS file system. It looks a bit like this. We have a very simple program, sorry, a very simple function. It's going to get marked that takes socket buffers and argument. I turn this, I extract the BPF information from that object file that I compile into a BPF program. And this is the BPF information on the right side. So it works like this. It says, I've got a struct SKBuff with the different offsets of the different attributes. I also defined another type, which is a pointer to that type ID too, which is my struct. I also defined the prototype of a function that takes the SKB, so the pointer to the SKB as an argument. And I gave it a name, which is SKBGetMap. And because I have the BPF information about the kernel image, and because this BPF describes all the functions in the kernel, I can process that in user space to extract a list of all the functions that take an SKB as an argument. And that gives me the list of the packet processing functions in the kernel. So now I have a list of all the functions that I want to hook to. So that answers to the three criteria we had. How to get all the functions, where we can with BPF, how to get callbacks, we can with EBPF and K-Probes in the kernel, and how to filter packets. This way using EBPF, and that's it was a packet filtering mechanism in the first place, that's relatively easy to implement. So how does it look like in practice? So I've got two terminals. It's not live demo, sorry. I use a rule, an IP table rule to drop packets, TCP packets, 1111, which is cloudflare DNS for example. And I call Peru, so here I have Peru destination host 1111 and TCP and destination port 80. And after I call Peru, it tells me that it loads all the, it loads my program and attaches all the K-Probes that I'm interested in. So that's 1500 probes in that case. And then in the first terminal, I type a curl 1111. What happens below is that I get a list of all the functions that process my packets. So I see a list on the right, IP local arts, IP local arts, NF hook slow, and so on and so forth. Sorry. Eventually I get K-free SKB mem, which is the function that is called once my SKB is free because it's been dropped by the IP tables rules. The IP tables rules I can also see through the code to NF hook slow. So that gives me information about what's happening in terms of function. It gives me information about the process that's been creating this packet in the first place because on the, on the column in the middle, you can see that it's a curl process. I get also information about the SKB, which is not useful by itself. This is the address of the SKB, but it allows me to be sure that this is one SKB that's being processed in the list. If I have several packets in this output, they will have different addresses. It allows me to filter by SKB when I post process this information. And once I exit from my Peru session, then it detaches all the processes we're loading. Okay. So what fancy features do we have beyond the basic usage? We have quite a number of options for Peru. So this is Peru dash dash help. I won't go through all of them, but through a number of interesting ones. So before we go into the options, you might have noticed that the way I told you to focus on the packet with the 1111 destination was just the same syntax as for TCP. And we do have a support for pickup filters in Peru. And the way this works is, so if I don't pass any filter, things are pretty much straightforward. I'm using my BPF program, uh, compared from Peru loaded, uh, into the kernel. Now if I do have a filter, I turn this into some CBPF bytecode using the leap pickup. CBPF is not exactly the same thing as a BPF. So I cannot use it just like this. So Peru uses another tool underneath, which is CBPFC. Hang on. And it turns, uh, this CBPF bytecode into a BPF bytecode. And then we get this CBPF bytecode and we inject it into the regular program. Okay. We've got everything in place. We load it into the kernel and that's it. That should be, it's easier after that. Okay. Okay. Some other features, uh, we can trace the kernel itself. We can, uh, we can trace kernel modules as well. We've got a few options to trace either a specific kernel module or all modules. Uh, so if you process packets with functional, take SKBs in your module, you can also, uh, follow what's happening in them. We've got a choice of backends for, uh, for Peru. So there are two currently, which is the regular K probes and the multi K probes. So what do the multi K probes do? They allow you to, uh, well, you don't really realize it when using Peru, but they allow Peru to load a bunch of K probes, uh, all at the same time. So instead of loading your probes one after the other, you create an array of probes and you pass this array with the size of the array to the BPF system code and then everything goes nearly at once. So it's faster. How much faster exactly? So if I could Peru on my laptop with, uh, the backend K probes, a legacy one, which is, uh, available, which has been available for a long time, the new one is for five dot 18, uh, plus only. So I get, um, a few seconds to attach other probes. That's seven seconds here, but it takes one minute, 37, uh, seconds to, uh, to attach, to detach other probes. That's not great. Now if I do multi K probes, uh, that's nearly instantaneous for attaching everything. Like there's, there's no difference on that test and once again for the touching everything. So that's quite faster. That's a good improvement. Um, here are a few other interesting functions. Um, they're all in the same box. They are not exactly related to each other. Uh, so we can filter also by a namespace for Peru, like looking for packets in one given namespace and not the others. That's totally possible. That's, I think that's relatively easy to do from the BPF perspective because I believe the, the namespace is directly available from the SKB itself. Uh, we can filter, uh, TC programs themselves, which are not regular canal functions, uh, just like, uh, the one we have in the networking stack. Uh, but because your TC programs can affect the packet processing, that's also interesting to, to follow what's happening on them. And, uh, the way it works is by using some specific BPF, uh, programs, looking on what is what we call the EF and three FX it mechanisms to plug directly onto, uh, those, uh, TC programs. So we're looking at BPF programs with other BPF programs. Yes, it works. Uh, we can also track SKBs that change. So when does it change? So for example, if I, uh, clone my SKB or copy my SKB, so the way we do that is, uh, when the option is enabled, we, uh, hook onto SKB clone, SKB copy at the end of the functions actually. And we, uh, we say, okay, this packet was interesting when I entered the function. And when I exit the function, I mark it as a packet of interest in a BPF map. So in addition to filtering the, uh, the packets that I usually want that I provided the, uh, fitter in the first place for, I also check for each, uh, for each packet if it's present in the map of the packets that I want to additionally follow. So that helps me, uh, following packets that may have changed. We've got some interesting options for, um, changing the display or adding more information on display. So I can add, uh, meter data on the socket buffers. I can add, uh, the full SKB. I can add the call stack. Here's an example. I can add the, uh, the, the, the four to pull for the packets. So in this example, we have, uh, two functions that, uh, process my packets here. And, uh, below each function that is displayed, we have the full, uh, call stack for the functions. So that's quite helpful to understand exactly what's happening in the kernel and how it goes, uh, in terms of processing. So to real life examples that we've had, uh, when working on Cydium trying to debug things on Cydium, which is a, a CNI for communities with, uh, a number of things related to networking and sometimes, uh, complex cases. The first one is, uh, MTU configuration, uh, error, uh, which we had to debug at some point. Uh, so we have a, sorry, we have a very simple setup with the packets arriving on the interface and the MTU on the, uh, on the node interface, not the same as the one on the VETH interface. And, uh, it was, uh, relatively easy to find out in the, uh, the output from Peru that the MTU, uh, is not the same, uh, that, well, is lower than the length of the packets. So the only thing I had to do to get this is to, uh, to, um, add the output to the information to get the, uh, the information about the, the packet that comes in. Another slightly more complex example is, um, so I had, that was in kind, so I had Docker network in the middle. I had, uh, this configuration with a pod trying to curl to the outside and, uh, hitting an IP table rule, uh, leading to masquerading the packets. So my packet gets masquerading with the address of the node interface. That goes to the internet. Okay. That worked fine. So in the second scenario, we checked that the packets were also, uh, currently masqueraded or not masqueraded when going to the node. And we have a second rule, actually, that was not displayed on the first, uh, case, which should, um, prevent packets going to the other node to, uh, from being masquerading. And so the packet should go straight to the other interface, should not change, uh, its IP address, but the packet never arrived. So what happens? So if you write the title, maybe you have an idea already. Uh, we thought that the packet was not being masqueraded as we expected. We thought that, uh, the IP tables rules were not being applied and we could have maybe found the issue, uh, differently, but, uh, Peru helped us to quickly confirm in that case of the masquerading is indeed, uh, occurring. So that's what you can observe on that, um, sample, uh, output. We can see that we're hitting NF hook slow and we can also observe for the same SKB that the, uh, the IP address, uh, is changing. So this is the same SKB. I just trimmed the, uh, the addresses of the SKB cause it was taking too much space, but, um, they're the same. So once we had this information, once we knew that the, uh, the IP tables rules was, uh, not sure taking place that we hit the, the net filter hook, we went back to the rules. So we were supposed to exclude the traffic, the closer nodes for masquerading. Turns out that the IP sets containing the entries, uh, indicating which nodes be, uh, excluded from masquerading were missing the entry of the node on the left on the first diagram. So that get me busy for, for, for some time a few weeks ago, but, uh, we did it. So Peru in brief, it's an BBF base tool to debug what's happening inside of the Linux networking stack. Hooks on kernel functions using, uh, processing SKBs. It's very good to pick up things where it's been a post shot in a way. Uh, you've got more visibility on what's happening directly in the stack and not just at the interfaces. We can use pick up, pick up filter, uh, style syntax to, to filter packets that we want. So we don't get everything. We just focus on the flows that we're interested in. We can try STC programs, can and models functions, uh, modified SKBs. Uh, so that's quite, quite flexible. Uh, we can, um, a number of information, a number of information, including packet level metadata, uh, the call stack, um, and it's proven very useful to solve a number of complex networking issues, uh, that we've encountered so far. So quick note on some other tools that are not exactly the same, uh, but that also uses this principle of, uh, creating a lot of probes to hook into the kernel and look at what's happening. There is sweet snoop, which is, um, really convenient to debug what's happening in the kernel when doing kernel development because it focuses on the written values of the function you're trying to, to, to observe or also the written values of most function in the kernel. If you're trying to just detect what functions are returning errors. IPF-Dress2 is very similar to, uh, to Peru. Uh, there are some features that, uh, are different between the two, but otherwise they are doing the same focusing and tracking the packets. TetraKone is a security events detection, uh, sorry, is a tool focusing on security events detection and, um, it uses, it also supports these, uh, multi-K probes, multi-U probes mechanisms. Uh, it uses EBPF to detect malicious activity on the system and to block it, uh, for, for, for security purpose. So this is the end of the presentation. I'd like to thank Adity and Matt Ness, who did a great presentation a few years ago, uh, at KubeCon on the topic, and, uh, I reused some of the materials, so, uh, I'm very thankful to them. Thank you to the Peru contributors. Thank you to everyone. Of course, thanks for the team, the talk. I hope you enjoyed it. If you have questions and if we have time, uh, I would really, I hope just to be open to questions. Thank you for the talk. Does it work well with, uh, GCO, GRO, like the segmentation of laws when the packets are merged and dissected? Uh, GCO, GRO should see the, should get the SKB as an argument, so they would appear on the list of functions that you, uh, that you get from the output. So yeah. Can you, uh, just print, uh, the SKBs or also trace, inspect those in, inside? Like, for example, I've seen this particular, uh, value inside the SKB, the changes and causes some kind of bug. Can, can I trace it? So you can, you can get the SKB, you can dump the full SKB. I don't think we have a filtering mechanism to, uh, do some additional processing in Peru on the SKB to only raise when you have that value. What you could do is, uh, filter new packet flow, dump the full SKB, and then probably post process to extract the ones that have this, uh, erroneous values, I suppose. But you, you can get the, the full content of the SKB. So from that, maybe that would help. So thank you for the presentation. Um, do you have a, an idea of the performance of your tool? And are you satisfied of that performance? And do you see some opportunities to make it even more efficient to be able to use it in production? No, I don't know. So one clarification is that, uh, it's not my, I've not contributed to it. Well, I picked two typos. Um, so I've not run any benchmarks myself. I know there is some impact due to the use of K-probes because you're loading so many K-probes at the same time. So it does have some impact on performance on the system. Um, I don't think we've tried to use it in environments where, uh, performance was a hard constraint for us so far. Uh, how could we improve that? Um, I'm not really sure. We haven't really given much thought into it at this point. Well, there's obviously the, the issue of loading and detaching the programs that is greatly improved in the multi-K-probes interface, but that's something different at the runtime. Uh, exactly. Yeah. Um, thank you for the talk in the first place. Um, and my question is which behavior can I expect with packet rewrites or encapsulation, uh, network address translation and so on, is the packet evaluated in every probe or can I, can I trace the packet even before the rewrite rule? So for example, I filter to the revitn IP address or I filter to the address before, uh, we explain encapsulation or whatever, or IPsec processing and so on. Um, so the way I see it, if you use the option to track the SKG, the SKG, the SKG, the SKG is, even if the, the metadata changes, you should be able to trace them, uh, uh, even after. So maybe if you set a given destination IP, you wouldn't be able to trace the packet before it gets that IP because that would be like guessing what will happen. But after it changes, yes, if you have, um, you know, tracking of the packets and that you, that Peru, I did it to the map for you, then you would keep following that SKB after that. Yes. Does it also track the revitn packet so I, um, trace the original IP if it gets encapsulated other destination IP? If you, so does it track, even if it gets encapsulated and if the IP changes, well yes, because it's the same SKB, right? So if you're, if you're basing, basing your, your, your tracking on the SKB address, then yes, it doesn't matter if you change the IP. Okay. Thank you. Okay. Thank you.
Welcome to the Legal and Policy Issues Devroom
Hello everyone and welcome to the 2024 legal and policy dev room. It's the 12th annual. 12th annual. We're really glad you're here. We've got a short program but it's packed with exciting content. I'm excited to hear what people say. We're going to probably just do we say our names real quick? Yeah, go ahead. Hello, I'm Tom Marble and thank you so much for coming. Tom is the founder of the dev room. I'm Bradley Kuhn, one of the co-organizers. I'm Matthias Kirchner, also one of the co-organizers. I'm Alexander Sander from the FSD also co-organizer. And I'm Karen Saylor. You're on the panel. And I'm on this panel. Without any further ado, shall we begin?
RHEL and CentOS and the growth of openwashing in FOSS
Hello, hello. You can take that one. Hello everyone and welcome to the 2024 legal and policy dev room. It's the 12th annual. We're really glad you're here. We've got a short program, but it's packed with exciting content. I'm excited to hear what people say. We're going to probably just, do we say our names real quick? Yeah, go ahead. Hello, I'm Tom Marble and thank you so much for coming. Tom is the founder of the dev room. I'm Bradley Kuhn, one of the co-organizers. I'm Matthias Kirchner, also one of the co-organizers. I'm Alexander Sander from the FSFE also co-organizer. And I'm Karen Saylor. And I'm on this panel. Without any further ado, shall we begin? Hello everybody. It's cool that this room is so full again, but it's not a big surprise. I'm Marcus Falner and I'm going to be the host or moderator with this wonderful panel with Liam, Karen and Penny. And I'm pretty sure you already know what we're going to talk about. So here we are. Me, I'm Marcus. I'm an open source journalist. I've been doing writings for Linux and open source for many, many years. And I have disclaimer. I was working also for Susie once. And actually today I was thinking which hoodie to wear because yesterday I got a CentOS hoodie at the CentOS Connect event. And I have a Susie hoodie, but I thought both of them wouldn't be appropriate. So I'm here without any advertisement. And on this panel, we're talking about Red Hat, CentOS and hopefully also the general move that at least I see. And I hope that others also see towards an old demon that is sort of waking up again. And that's the demon, as I call it, of open core. There was an open core summit and there is open core investment funds that invest into open source, but only if you're doing open core. And that all came to my interest as a journalist because I was writing articles about the topic last year. And then I found lots and lots of articles of a guy from England that I knew from the past. And this is already where I hand over sort of to the self introduction of my, our guests here. So I'm not starting with the ladies. I'm very sorry because I already mentioned Liam. And okay, I'm not just briefly, Liam, open source journalist at the register. Everybody knows Karen. She has already introduced her just a minute ago and from the free software conservancy. It's actually software freedom. Software freedom. I'm so bad at that. Sorry. The slide is wrong too. The slide is wrong. Oh, good. Yeah, the slide is wrong. Oh, thanks for giving me an excuse. Yeah. And it's Benny Vasquez from the board of directors and the chair of the board of directors of the Alma Linux OS Foundation. So I hand over to Liam, who actually is a very skilled and long time writer. And yeah, he, I even hired him when I was at Susie. I was team lead documentation. And so Liam, watch what you're saying. I'm not your boss anymore. So we have them. Can we ask, can we ask you a question first? Yes, please. Why did you choose such a controversial title? Why did I choose such a controversial title? Oh, you don't know, you don't know me that much. I'm, yeah, I'm, I have quite a reputation in, in what I write. I often write technical things, but I also like doing some kind of investigative stuff. And so I have also some, some friends out there in the world of free or not so free on the open source software. The FSVE can tell you lots of stories. We've been working together a lot of times in doing research. So last time that was about data port phoenix, German, the story about a public institution being funded with lots and lots of millions for developing open source and never delivering anything in open source. Many years before I did lots of writings about the Luca app in Germany and an app, a tracking app for health services during the Corona epidemic, pandemic and similar things. So I'm kind of, yeah, I like to go where it hurts sometimes in journalism. I guess that's my reputation, isn't it? Yeah. So Liam, you just take. I think we have one. Hello. Okay. As you like. So yes, hello. My name is Liam Proven. These days I write for the register, a UK and US IT news site. I'm the open source guy at the register these days. For the previous roughly eight years, I was working in the area of tech documentation. And before I went to work for the register, I worked for SUSE for about four years where for a while, Marcus here was my boss. And before that, a couple of jobs before that, I worked for a short time for Red Hat. I actually moved to the Czech Republic to work for Red Hat and did not pass my probation period with the company. I moved to a whole new country for this job. And as was recommended in the onboarding pack, I started a flame war on the company wide mailing list. This is a recommended way to get to know people in the company. Mine ran massively out of control and resulted in me being let go after just two months. I am not a big fan of Red Hat. Now, I've been using Linux for about 30 years now. And Red Hat Linux back in the 90s was one of the first distributions I used. I wrote quite a lot about it back in the day. I do not generally use any Red Hat products or services anymore, but bizarrely enough in the Red Hat CentOS source code, I have found myself in a very strange position of defending the company and defending its actions. And I think I might be doing so here again today. Just for context, one of the articles I wrote, which Bradley kindly linked to from an SFC article analyzing this, I had at one point simultaneously a thread attacking me on hacker news for being an obvious Red Hat shill, and they told me that I should quit my job and go and work for Red Hat Public Relations, while simultaneously another guy on Twitter backed by multiple Red Hatters was saying that I was an evil hostile troll who obviously hated Red Hat, Linux, open source and everything it stood for, and I should be ashamed of myself. So when those came to a peak simultaneously, I just told them go and talk to each other and sort it out. But by the way, I'm not going to go and work for Red Hat because those guys fired me, so no, I don't have the option. So yeah, my position is a strange and conflicted one. But let's see. Karen, you never worked for any distributor, did you? Did you work for Red Hat? No, I've never worked for it. Susie? No, no, my path goes straight from engineering school to law school to corporate law to nonprofits. So my perspective is really from the public interest of software. My interest in software freedom originates from the fact that I have a pacemaker defibrillator. I have a heart condition. That's actually quite common. I just, like, we found out as a society recently. It's really exciting, but I've been at a very high risk of suddenly dying, so I need this defibrillator. I can't see the source code in my own body. I was shocked unnecessarily while pregnant. The whole thing is just wild, and it really makes you wake up and say, boy, I really care about what happens with my software. I really care that we have control over it. I really care that we can take collective action. And so I'm at Software Freedom Conservancy as executive director. I am a lawyer, but I'm not your lawyer. I've never been Red Hat's lawyer. And I'll just say that everything from my analysis is just what's going to be good for the state of software. And that's why we also have Benny here on stage as one of the persons representing one of the entities affected by Red Hat's moves in the last years. Yeah, so I'm Benny. I work with the Amalenics OS Foundation. My path to this is through, shall we say, I started as part of the CentOS community. I was a user for many, many years. When I started using CentOS, I didn't know that Red Hat existed. It was just, I walked into a place where we were using CentOS, so that's what I knew. And the last 20-ish years have been a wild ride, shall we say. And I'm excited to be here because the perspective that we have is 100%. As users of CentOS, we needed a solution. So that's where we came from. I also had some points in my career where I got in contact with Red Hat. Both as a journalist and also with many, many people, of course. As I said yesterday, I was on the CentOS Summit and I talked to Red Hat. And I'm very happy that we have a panel here with such exquisite friends, I have to say. Because, believe it or not, the number of people that said, yes, yes, I want to be on that panel that agreed to me when I asked them, exceeds the size of this panel by double, I would say. So we had like, I have like six or seven people that said, yes, I would come and then they said, oh, we are sorry, but corporate and things you know. And so that's, it seems to be a very controversial topic if anybody had ever needed that proof, after all. And controversial, controversial, there was one on this panel that I wanted, I have Liam, the journalist with him. You made yourself kind of an expert in this topic because when I started writing about it, I found lots of your articles on the register and I read them and they sort of, they changed my mind, my attitude to the whole thing because at the beginning for me everything was clear. Well, Reddit is doing something evil here and I need to write about it. And I wanted to find out the true facts and then I read your stuff and there seemed to be over the, since 2021, when Reddit started or did the harsh moves, let's say against Centaurs, there has also been a change in the public opinion about what they're doing and why they are doing. So last sentence, yesterday I talked to people from Centaurs and they said, well, it's actually not against us, it's more against Oracle, this move, and or not against Alma, more against Rocky because those two entities are trying to make money out of Reddit code. So can you give us a short introduction about what happened and why in opinion what... Okay, so, well, okay, context setting first maybe. I have been a journalist mostly freelance since about 1995, but I freelanced for the register for about 15 years before I joined them full time. So I've been there just over two years now. I have learned some very interesting things about people's reading skills. An awful lot of people are completely unable to skim. However, they don't know that they are completely unable to skim. So they will read a headline and have a look at an article and read a paragraph or two and jump around a bit through the later bit. And they think they've got the gist and they move on. Now, gist is an important word here. Amongst other things, I'm an English teacher and I used to be a professional English teacher. Scanning for gist is a key test of skill when you're reading in a foreign language or speaking in a foreign language. And it's something as a teacher, you look for people's ability to get gist. What I've discovered as a writer is an awful lot of people reading technical stuff online cannot extract the gist of a piece of text that they have read. So when I started hearing about Red Hat changing the terms of the license under which... The terms under which the Red Hat Enterprise Linux source code was distributed, a lot of people got very angry and started going, well, this contravenes the GPL. No, it doesn't. What it means is an awful lot of people feel entitled to talk about the GPL and either they've never read it or they've read it and they didn't understand it. Because I get people every day for two and a half years who read stuff I wrote and didn't understand it. This is a really big problem and I'm willing to bet half the people in this room to pick a random number will read stuff and not understand the gist of what they've read. The GPL says if you sell software that is based on open source code, you have to provide the source code to the customers to whom you sold the software. And that's all. It does not say you have to give the software to the world. You have to give it to your customers and nobody else. That's a really big point of the GPL which I think most of the people getting angry about GPL infringement have not grasped. Secondly, not everything in a Linux distribution is GPL. There's GPL 2 and there's GPL 3 and there's MIT and there's X. There's loads of licenses in there and some of them are much more permissive. But Red Hat said the license to the REL source code means customers get the source code and now only customers get the source code. But it's GPL source code. Once you get the source code, you can then do with it what you want. You can then put that source code on GitHub and give it to the world. That is fine. But Red Hat are then perfectly free to say, hey, we sold you the source code. We sold you the product. You got the source code. You have infringed your customer agreement by giving it to a million other people on the Internet. So we're terminating your customer agreement. This is my basic understanding of the situation here. That they provide the source, still, yes, the version that is compiled into REL, but only to customers. It's still open source because customers get the source. However, customers are not free to share the source with anybody else and stay customers. And that's the critical difference. How does that affect CentOS? Can you briefly say what Red Hat did with CentOS? Was upstream downstream from REL before? There was CentOS stream and there was some article saying that CentOS stream, Red Hat kills CentOS stream. No, Red Hat kills CentOS by introducing Red Hat CentOS stream. Do you want a whole potted history of CentOS Linux? Just a sentence. Can we just take a quick seat and respond? I just want to clarify that explanation of the legal situation. There are some bits of it that are a little more nuanced than that. If you have an offer for source, it needs to remain valid. It's not just simply your obligations continue if you've got an offer for source available. I also just want to say that while this kind of issue dances up to the line of GPL compliance, you're looking at the really tight letter of the license and wondering where is that line. Frankly, I would prefer that many of our analyses are not about how do we just follow the very tight license, but instead to say what's in the spirit. It's crummy to ask customers to have to choose between distributing the source to customers who have a right to habit, which is the spirit of the GPLs, and choose between that and support. I otherwise agree with your analysis, but it leaves out that important component. One, since an awful lot of people don't actually understand what the license says about the source code, what the letter of the law, if you will, says about the source code. Trying to analyze the spirit is a whole other and much more difficult problem. Look, if you've chosen to pay for a commercial Linux distribution, then you're not on board with the whole, it's all free for the whole world thing. If you pay for corporate support, you're buying into a corporate subscription. And if you're buying into a corporate subscription, that puts you in a different market segment to all those Debian and Arch and Gento and whatever users who don't have a commercial version and are exclusively relying on the community for support. Suzer and Red Hat are in this almost unique position that they sell support subscriptions for a corporate tool. There's also free products you can use if you so wish, but you don't get the support. You're not buying the software, you're buying a support subscription. This is important, but is it against the spirit? I don't know. Can somebody show me a strict documentation of what the spirit is? Because there's a strict definition of what the GPL is and half the people haven't read that. Let me ask the lawyer about the letters of the law. In our case, it's the license and the spirit. Because I'm pretty convinced that what Red Hat is doing here is both legally and license wise correct. That is an opinion. I consider it legitimate if I see what the target is. I think it's like many others, I think they did bad marketing on it, definitely. But the thing is, I think it's against the spirit of open source if they don't share because they are so sorry. I think it's kind of stupid to have big letters on the Red Hat website what open source is. And that says, developed in the community, shared publicly and everything. So they really on their own website, they have these words and then they make part of their product only available under limitations, which is again something that is nitpicking, I would say. It is limitations. You are free to use it, but you shouldn't share your gun. That is for big customers. That's a threat. If you lose support from a big, that's a threat. So I think it's against the spirit of open source. Karen, can you explain why it's also against the letters? This is so interesting because I'm going to agree with both of you, even though you're disagreeing, which is to say that I agree that here the customers are sophisticated. This is like a different type of... Oh, can you all hear me when I have it over here? Yeah. So the customers are sophisticated. So the result of the violation, if there is a violation, or the result of the policy is different than if you have consumers who are less sophisticated, who are in a negotiating position, who aren't knowingly entering into it. So I agree with that. But then on the other hand, I also agree with Marcus that we are not about necessarily expecting contributors in the free and open source software space to just do a bare minimum all of the time. We are here because we want something better. So I am not going to secure and say that this is a violation. I'm not. But I am going to say it was sure nice when we had sent us and we could see, like, you know, we had like a canary in a coal mine in a way. We knew, you know, we had access. And now we don't have that canary. The canary is gone. And so we may be... Benny wants to say something about this. No, go for it. You guys are doing a good job of fighting. I don't even think we're fighting. I don't think we're fighting. I'm prepared for getting asked. I completely agree with you that everybody should read the GPLs, like, read those licenses. Some of them are easier to read than others, but like, check them out and like, especially read the offer. I'm kind of torn on CentOS because I do not use it personally. One of the things that shocked me when I joined Red Hat is I had not really looked at a Red Hat product since Red Hat Linux 9. And they gave me a laptop and they said, you got a choice. You can run RAL and you get support or you can run Fedora and you're on your own. We don't support staff running Fedora. Obviously, server-side stuff, yeah. But all your colleagues mostly run Fedora. So junior management and so on ran RAL, senior management, have Max, and all the actual techies ran Fedora. So I put the latest Fedora on my staff-issued laptop and it didn't work. The touchpad didn't work right and the graphics didn't work right and it broke if I undocked it. And I asked on the support list inside the company, I went, yeah, it does that. What? I mean, it took me two days to install this wretched thing because your installer is so awful and it doesn't support the hardware it was written on. Yeah, we know. We're going to fix it one of these days maybe when we can if we can get open source drivers and stuff. But for now it doesn't work. And I went, this sucks. The user experience for Fedora 14 was awful. I defected over to Ubuntu in about 2004. Talk to them about... The 90s were hard. But I started a thread and I said, look, I have not used Red Hat in over a decade, but I am a bit of an expert in some of your rival products. I use Ubuntu, I use Debian, I use some related products and the installation experience is considerably worse. The driver support is considerably worse. The stability is not good. Is there a team in the company looking at rival products that wants to help? If so, I'd like to join. And if there isn't, can I start one? Because I think we can do better. And that got me kicked out because I offended the company code. But you know what? About the free distro and do you get the source code and stuff? The question of the killing of CentOS Linux as opposed to CentOS Stream, well, that's a whole other question. But I think... I don't like to say it. I don't buy Linux-based server operating systems. I'm very happy I don't run servers anymore. But if you buy into a commercial Linux, you're buying into a different kind of deal too if you use a free Linux. And if you complain that the paid-for one isn't like the free one, well, why do you pay for it then? Well, in some cases you pay for blameware, so I like outsourcing the blame. But the thing is also, I talk to many people these days and they said, well, if you're CentOS, Rocky, Alma, all of this is just people that want to use Red Hat style software but don't want to pay for it. So, and that is where I'm asking you, Benny, is that what impact did all of that Red Hat movement over the last years have on a project like Alma Linux? I mean, once you get on... Yeah, so as I said before, the move is said to be more targeting Oracle and not actually that much CentOS because Oracle and also, I was told, Rocky, they are doing business with it undercutting Red Hat prices, which Alma is obviously not doing. So you're kind of the good ones in this game and that's also nice to have you on the stage here. But how did it affect Alma Linux over the last years? Yeah, so, well, let's start with the supposition that the people who are using CentOS and Alma Linux and Rocky Linux are the ones that don't want to pay for Red Hat. That is not true. The number of people that have come to me and said, I used CentOS and now I use Alma Linux, not because I don't want to pay Red Hat, but because I was never going to pay Red Hat. I was always going to look for a free Linux. So the idea that we're somehow pulling funds or customers away from Red Hat is incorrect. There are certainly people, certainly companies who use a Red Hat-compatible operating system to decrease the amount of money they have to pay Red Hat, but that is not our primary audience at all. So we start there. But if you just look at the impact of what changed in December 2020, we were all, was it 2020 or 2021, whatever it was, we as users of CentOS were very frustrated, very uncomfortable with the idea that we weren't going to have CentOS 8 as long as we wanted, and we weren't going to have any CentOS 9 that was the Linux version, right? We all needed it for a variety of reasons. Everything from, I got to meet somebody today who uses Alma Linux to serve museums in Sweden. All of the point of sales stuff is done using an Alma Linux server. They were never going to switch to an operating system they had to pay for. They want an operating system that serves their need and is free. That's the kind of people that have come to us. In June, when things shifted again, it completely broke our pipeline, right? We had a built pipeline that was very good and very secure and very stable. When our pipeline was broken by the changes where the licenses or where the code was stored, it was, again, frustrating and uncomfortable. But Open Source did what Open Source always does, which is solve the problem. We will get around it. We'll find a way to do the things that we want to do, whether or not we have access to the source code. In this case, we do. We find it in other places to do the things that we want to do without violating RID hats agreements. It's easy for us to, I mean easy, relatively easy, right? But it's still possible. When I hear that, that kind of fosters my opinion that it's not really in the spirit of Open Source, what's happening here. But there may be different views from different sides on the whole thing. I think Red Hat is doing this also because of business model topics. There is business models like Oracle Linux or Rocky Linux, which Red Hat is not very happy about, and you are kind of like the collateral damage to that. To a certain degree, yeah. To a certain degree, sure. I do, just to be fair to Rocky and the people who work on Rocky, I do want to distinguish slightly between Rocky Linux and CIQ. CIQ and Rocky Linux are obviously very closely tied, but there is difference, right? So yeah, CIQ has a commercial thing. Red Hat has a commercial approach. We don't have customers. We have users. And it is, whether or not it's, like we get into the letter of it, right? It is certainly harder to do the things that we want to do, but I think that it proves the strength of Open Source. Whether or not they are violating the spirit at all. And I would add about the spirit that, you know, the problem with trying to do the bare minimum going for the letter and not thinking about the spirit is that you get so close to that line that it is easy to cross over it in other circumstances, right? So my colleague Bradley Coon who is here actually wrote a blog post that I actually was referring to earlier, that identified a couple of violations that we had had in the past. And that's what happens when you're looking for the bare minimum. And so I think you just have to be cognizant of that. So I guess that's also the point where we want to open this discussion up to you, as seen where many hands raised. Yeah, so many people. Just like 200 questions we will accept, I guess. Correct? We have 20 minutes. Yeah, so my main question, you were so right with that. My question that is on my mind is, do we have to change something in the open source community? Do we need a GPL4 that also covers the issue of not only to customers or that is maybe more precise or more clear on the limitations? Because I think what Red Hat is doing here is posing the limitations on those that get the source code. They are not really, but in a way, that is the effect and it has some side effects. And so maybe, so I start here. Thank you. I mean, I hope it's alright that I don't answer your question, but I want to say something about the spirit again, because I think, I wasn't involved in any drafting of the licenses and stuff like that, so it's just like from what I've read, but my understanding is when you read the writings of stall men, can you think further away? Okay, better like that? Okay, cool. I think it's very clear to me that the intention, one of the intentions of the copy left movement was to prevent software monopolies. And you can comply with a copy left license, and I'm not saying Red Hat is doing that, right? I have no idea, I am no position to judge that, but hypothetically speaking, if a company starts hiring all the developers of a software ecosystem and pay for their livelihood, and then go to the customers and say, of course you can go out on your own and be on your own, and no longer be a customer of us as a company, but good luck finding anybody else and just a technical skill to do anything with that software, because we control like 90% of the ecosystem. I think that is a problematic situation as a community to have that, right? Whether it is legal or within the rights of the license, I think to me, it is very clear that this is a problem. And to bring it back with two sentences to the questions you've asked, I honestly, I have no idea if that is a problem that can be solved by the legal framework, because it is so far beyond just copyright and copyright issues with the licenses that I don't know what the solution can be, I am very sure that it is a problem and a problem that we need to think on, because these are monopolies that are coming up in very new and very different shapes. And that is something that I came across with the research on these articles, so there is an open core conference, an open core summit that happened first time last year, the year before last year, second time this year in December, and it is driven by a company or entity, open source software.oss.capital, and they promise investment capital if you are doing open core. So they are, and Redhead founder, John Ewing, is it John or Jack Ewing, I think? Mark? Redhead founder, you are talking about the one who was at that open course? Yes, exactly. Bob Young. Bob Young gave an opening speech on this event, and so that is where I came across and I thought, okay, there is a lot of money that is looking for a new home, and they see that open source is a threat for investment, they see it as a threat to investments, because if you don't own intellectual property or all the developers on a certain market, then the investment money can vanish, because everybody else can do what you did before. And so that is also a very important matter in this thing, I guess, because what Redhead is doing is trying to avoid others from making money off their development. Which is the opposite of what Bob Young said. He talks about it at the open course, so it is interesting because he, you know, I don't want to be the only one who is here saying, you know, I love them and I hate them, I am here giving a positive view and really I am at heart, I am negative, I did the keynote this morning about outreach and I gushed about... ... ... Thank you. So you said the question is about who is a user and Redhead is talking about customers. Is that correct? So Karen, can you say something about that? I mean, a user of the software is everybody who gets it, right? They are only handing it out to their customers. Is that wording? Is there a difference in any way? Can you say something to that? It's really, you know, the license is written so that the obligations flow, so that the obligations flow downstream. So you are asking for distinctions that are not that relevant in the way you analyze the license and the company's obligations are when they distribute the first code. So now we have got to resolve the technical issues and the new microphone is running around. To answer the next question I can't see. Yeah. Hi up there. Come in, Houston. To everybody at home, this is a room like two kilometers long and one mile up on the left is a very nice friendly guy asking a question. Hi, I'm Jim from Oracle, otherwise known as Big Evil in this crowd. So section six of the GPL provides, and pardon me, in part, that you may not impose any further restrictions on the recipients' exercise of rights granted herein. The Redhead subscription terms provide that you may not redistribute the software, and if you do, it's a material breach. It doesn't just say we can terminate you. That gives them the right to sue you. Beyond that, they could also say just as easily, you don't have the right to modify, you don't have the right to make copies. If they are allowed by contract to restrict the rights granted under the license, how does the end user end up with any of the protections that the license is intended to provide? How is the license meaningful? Can Vizio, by way of example, provide a license with its TV that says, if you ask for the source, it's a material breach and you owe us money? Well, in my readings, I learned about the term in at least US law, that is the caveat-emptor thing, that means actually something like buyer-beware. I'm looking forward a lot to how European law will see this whole situation, but in the end, it's the freedom of a vendor by law to choose his customers. And if I'm doing messy things with the stuff they give to me, because of the GPL, they cannot infringe my law to use it and share it and change it, but they can still say, stay away, we will give you no support. You can call us 20 times a day, we won't do anything for you. Did I get that right again, Karan? The caveat-emptor thing? I mean, I'm not... You're not our lawyer. Let's move on in the conversation. Okay. Here. Yes. Thank you. Hi, I'm Christian. I have to say I'm in the spirit of free software, so what I say now is a bit against my own conviction, but I think we have to put things a bit different. Reddit, Linux is not consistent completely from GPL software. There's different licenses, so we have permissive licenses in there and it doesn't apply. And then the next thing is the question of packaging it is, consider the right work or is it just a way of distribution, just as Storman wrote, it's okay if you put it on the tapes back in the 1980s and if you sell these tapes, you can charge money for this. So this is the question I have to you, Karan Phelps, as a lawyer, and putting that way, if it is derived work, then I think it's in conflict with GPL to not provide the packages as the derived work or not allow the customer to provide the packages on their behalf to the wider community because they have the right to share. On the other hand, for permissive license stuff, that's not true. If it's not derived work, then it would be okay for Reddit to just cite the sources because you still get the software from the Internet, but you have no right to basically get their distribution work. I don't really know, so I'm very interested in your answers. Armand Linux is completely out of the packaging of CentOS, right? There's no packages anymore, anyhow, right? So you're doing packaging yourself? Yeah, so we pulled the source code from a couple of different places. Shameless plug, if you want to know exactly where we get it from, you can come to our talk at the distro's room tomorrow at noon. Thank you. The packages where the code that we pull is from... Where's the other microphone? I don't know. That's not the anthem. The code that we pull is from CentOS Stream, or from other places you can get it where it's free. And pulling that code and repackaging it from everything that we've seen does not violate anything, right? The reality... I think the thing that I would like to talk about at this point is kind of the question that we started with. This is going to continue to be a thing because humans are the way they are. We are going to end up in situations... This isn't like a new cycle to what we're doing, right? But he said openwashing is coming around again, because that's what happens. How do we, as an open source community, react to that? That's the thing that's interesting to me. Like debating whether or not Red Hat is violating GPL is... We've got 230 people in here and 230 opinions. I think how we fix what we're doing going forward is going to be the interesting conversation. Do you have a question? My question was more legal one. For instance, if we look at the GNU or FSF website, we have a GPL FAQ. And that's inform us on how to interpret the license, for instance. And as I understand, this has some legal weight. In the GPL v3, there is also a section about terminating services. Does that give us some information, even if it's not about source code distribution, but installation of software or modification of software? Maybe. I think it's really up for debate, if any commentary outside the license. At least from a US law perspective, it's really... The license says what it says. I mean, for my personal, like, two pence worth, a tough enough worth, I think Red Hat made a huge mistake in, so to speak, acquiring CentOS in the first place. That was a really dumb move, because they legitimized and sanctioned what was effectively a free rival to their commercial product. Now, when I've said this in print, people have said, oh, well, CentOS was dying. They rescued it by adopting it. Well, tomato, tomato, it doesn't really matter. It was a really dumb move to start supporting and offering a free rival to their own product. But that was a long time ago. They decided to stop offering it. Okay, fair enough. They're within their rights to do that. It inconvenienced a bunch of people. Well, they'd mainly inconvenienced rivals. But, yeah, I think it probably is against the spirit. Yes, I don't argue with that, and I do not wish to defend them. But since we can't really codify what the spirit is, somebody's got to pay for all this work. My basic overall position here is, right, I started my career with DOS, Novel Network, and tools like that, which were tiny and maintained by tiny numbers of people, and most of them were not written in C. The industry has gradually shifted to vast and vastly complex stacks of software, the lower levels of which are built in the most unsafe programming language known to humanity. Okay, that was a dumb move. Right? And the result of that is a vast multi-billion, in fact, nowadays, multi-trillion dollar industry, and paying people to fix it and maintain it and document it and try and make it all work together. Rightly or wrongly, that's kind of where we are. There are free products. There are paid for products. You've kind of already taken a seat on the edge of the shark tank if you started using one of the paid for products. I think life would have been simpler if there'd been a much clearer line between real official Red Hat Linux without an enterprise and a totally free one, which is not official. But that isn't where we started from. I think the company was repairing a mistake. So, I would like to add a thought to the question, how could we act as a community on this, and I think this goes right away to the panel later this afternoon. What can we do with additional obligations that someone tries to add onto the GPL? So, I think we could agree that Red Hat tries to add some obligations. I would even go a step further and say they made at least an unclear subscription agreement, and you can read it as you have to pay for every copy which is made, and we have an extra right to terminate your subscription if you should dare to redistribute the code. So, I would say if this is the right reading, you would have a conflict with the GPL, but if I were to have made it unclear by intent that you can always say, no, no, no, we just may consider the fact that you made distributions. So, from a political side, I would say Red Hat already partly won the debate in a sense that we are always saying, well, but Red Hat may most likely not renew your subscriptions, but the subscription terms are more harsh, way more harsh than what you actually are saying. So, I would like to ask to the panel, did you get the gist of the Red Hat subscription agreements? Yeah, personally for the article I even got an explanation which was, as Leon said, very open and clear from Red Hat, how they handle it and they will say they are free to kill the subscription contract with the user at any time, at any given time once they find out he or she or they changed and shared it. So, it's very simple. If they want, I don't think they are that ill, but if they want, they can just from today to tomorrow, they say, oh, we saw you flipped a bite here, no, we don't support you anymore. And I would really recommend I was briefly in Bradley's talk just before and the few minutes that I saw there made me really want to watch the video as soon as possible because you were also talking about companies imposing limits on their employees in terms of open source and transferring intellectual property and things. And first, before I give a word to Bradley, I saw... You could just say whatever. You sure? Karen? Okay. You can respond. Yeah, so... You wanted the mic right after that. Yeah, so I think, so I'm not on the panel because I talk about this issue too much and so I really shouldn't have the microphone, but Don gave it to me. So, I... You're in your hands. That's true. I thought he wasn't going to give it to me. So, when I first was given a copy of the rel, I think at that time it was called the Red Hat Enterprise Linux Agreement, in 2001 and asked, does this comply with the GPL? I never would have imagined this many people would be in a room to actually care about that question because it was kind of a how many angels dance on the head of a pin question at the time and in some sense it still is. It's an incredibly nuanced question. I think everyone wants to give these very pithy answers, Jim Wright from Oracle being glad to give us an incredible pithy answer, that of course it violates the GPL. I think that Red Hat is extremely sophisticated in their drafting and I have read every single version of the rel services agreement that's ever been published and on paper, I don't find a GPL violation. The problems that I have discovered are of the nature that I've often called, if a GPL violation falls in the woods and nobody's there to hear it, did it actually occur? And I say that because there are probably hundreds of Red Hat salespeople who walk in and sell seat licenses and there are hundreds, possibly thousands of Red Hat customers who believe they own seat licenses to rel. I have met many, dozens of people who tell me their company has N seat licenses to rel. That's not what the agreement says but that's what everyone believes is happening. And that I think goes to your earlier point, that there's something happening that is perception and while the letter of the rules don't violate the license, everybody believes it's proprietary software and I don't know how we solve that problem but jumping to a conclusion either direction of it must be a violation or it can't be a violation is not going to get us very far in figuring out the community problem that we face. That goes for me that also goes in the direction of the caveat emptor thing. So you, like, I mean in the US it seems to be very normal to assume that somebody that you're buying something from might be trying to fool you or to get some more extra money out of you. And so that's probably also the biggest failure or the biggest mistake that Red Hat did in all of this. I talked to them and they said, yeah, marketing wasn't very good on that. So obviously, oh no, shit man. But the thing is, is there anything that we as a community can do to be maybe more precise or whatever to counter these things? I don't think so. Actually it's the market, huh? I'll try to keep it short because I have two things now and one of those really got my blood boiling. So I try to start with the more nuanced one. And the nuanced one is we can quote the current GPL to each other all day long, right? But from what I understand from the history, the reason that a GPL exists is because GPL was written at a time where something like the cloud was in nobody's mind. And I think we're at a point now where nobody ever thought that there might be a player in open source who has just war money to buy 80% of the developers of an ecosystem and provide for their jobs and comply with the license. And there is no doubt in my mind, like Bradley, as you said, I'm completely certain that they comply with the license and I believe REL is free software. If they distribute REL to you, you get free software. You're free to exercise your full rights, right? But they have built something that we didn't think of when the license was drafted. So that's the one. And the second one. Let me keep it because Liam wants to directly answer to that. Okay. But the second one with the emotional one, I already wonder if it was something I said or he said because that's usually how the reactions that we get with it. I think it was more Liam. Can you answer? Absolutely. And then keep the thought. We're running out of time, so maybe we should move to the next question. I think you've got this completely backwards in how you should look at this. And I speak as a former member of staff. It is not. Oh my God, one company owns 80% of Linux developers. No. One company, rightly or wrongly, and I am no fan of theirs, came up with a very, very lucrative model for selling Linux distributions. And that has enabled them to pay for something like 80% of the development of Linux and associated systems. I don't like them. I enjoyed working there. Don't get me wrong. There are friends here at FOSTEM this weekend that I have spoken with that I made in my short time at Red Hat. It was a fun job. Well, actually, the docs are very dull, but you know. But it was a fun job. It was a great community. It's a friendly company as long as you don't violate the company religion that there is only one Linux and its name is Red Hat. But that's a whole other question. One company found a way to make Linux very lucrative and they pay for a vast amount of development, and that's a really good thing. I agree. Because Debian can't do it. Ubuntu, Canonical can't do it. Nobody else can pay for all this stuff. It's not they've got a monopoly. They're funding us and they're keeping us in a job. This is a last question here. We have something like two or three minutes left. So maybe a slightly different point of view. So I'm working for a large company and because the company pays my salary, I can do open source work. I'm a maintainer for open source projects and I like to do this. But I'm getting some kind of paid for this. And I guess for all open source that someone develops, all the people need to do a living. I'm getting paid for the work that I do for the company. I'm doing private open source project and I don't want to have money for that. But in general, someone needs to do a living by writing open source software. So this is something that we need to secure. And we've seen in the past that like elastic search or something, if big organizations take open source, provide it with their business model and let's say deny small companies to earn money, then we will lose open source. And I had discussions with Red Hat. I had discussions with lawyers of Red Hat and they made it very clear up to this point, we will talk friendly to you, but the next stage will only be possible if you pay money. You're using our open source software, so you will not get further answers. I don't think that is a good habit, but I would like to get back to what Bani said because it's what do we about this, not necessarily against Red Hat, but how do we as a community, how do we just get around it? So let Red Hat do what they want to do as long as we can have open source software on the right side. That's exactly the last question that I have on my list for the panel actually. That's exactly the question that I'd like to hear from all of those guys, those experts here, telling me what they think we should do or if we should do something about it or if we should just wait. So who wants to start? So what are we going to do? We're going to do what open source always does, which is fix the problem. I said it earlier. The reason open source continues to exist from my perspective is that we all continue to care about it. We continue to put in our time, we continue to put in our effort. We, whether or not somebody is getting paid for it, we show up because we care. There will always be people who commercialize it. There will always be people who feel fundamentally that open source is the way to go. And we're going to end up hopefully continuing to balance each other out. With people like that. I run Debian. It works pretty great. Maybe we're on Debian. I don't think this is a problem to be fixed. I mean, you know what? It might be nice in the spirit of doing it, but you know what? Somebody's got to pay for all this. My last three FOSDEM talks were more or less on the subject of how can we find a way to dump this entire stack from line one of LS and replace it with something a bit better built because I think that's what we should do. And I'd like that to be open source as well. So, are you still unmated? Thanks for joining us. We're out of time. We are done. Thanks for joining us. Thanks for joining us. I'm very sorry that we couldn't answer all the questions. Goodbye. I'm sorry. It's okay. Yeah, I know I'm sorry. What was it that made you emotional? I don't know. I don't know. Thank you.
Figuring out trademark policy on the fly
All right. I would like to welcome our next speaker, Peter Eisenthal. Please go ahead. Better now? Yeah, that sounds good. Okay. This did not come out right. Those were fun flags up there. We'll fix that later. It's open source. So I'm Peter Eisenthal. I'm affiliated with the PostgreSQL project. I've been coming to FOSDEM since, well, 2003. So, wow. I'm in the PostgreSQL project. I'm part of the core team. That's the steering committee, you could say. And I'm also a member of, I think, PGCA, which we'll explain in a minute. I've also been involved in the Devian project. I'm probably fair to say I'm retired now. This does not sound well, but I'll trust your figure. Yeah. Maybe you can mute that one. Okay. All right. I'll just keep going. I'm retired from Devian, but, you know, if you're in the Devian project, you do learn a lot about the open source policy. So that was good training there as well. I'm involved with SPI a little bit, because that's obviously affiliated with Devian. And I've also been involved in, because of these connections, to make PostgreSQL an affiliated project of SPI. And I'm not a lawyer. That's a cliche, but maybe in this room it should be said. I have no legal training, but, you know, if you do this for two decades, you absorb a lot of stuff. And in effect, the point of this talk is to explain the issues we work through. This is not a guide on how to do it, but this is just what we figured out. Maybe it helps. So by Postgres, I know like half the room here are Postgres friends. Hopefully most of everybody else kind of knows about Postgres, but just so we're on the same level as database systems, right? There's a stand across the campus here, and there's a Devroom tomorrow. But I like coming to FOSDEM, because, you know, the Devroom, Postgres, Devroom is nice, but, you know, I know a lot about Postgres, so that's not too interesting, to be honest. But I always go to these other Devrooms, and then sometimes give presentations there, you know, to sort of talk about intersecting and crosscutting issues like this. So that's why I'm here. Postgres has... Postgres came out of the University of California, Berkeley, you know, like BSD and others, and so we always used to say it has a BSD license, and then somebody actually read the license as, this doesn't look like a BSD license, this is actually the... We just talked about this, right? You actually have to read it, right? This looks more like an MIT license, and then we read it again and again, and we realized it's neither, so there's actually... PostgresGal license is actually a separate license registered with the open source initiative. So, but the point here is basically we're not too concerned in the Postgres project about copyright enforcement, right? Because there's no obligations, really, other than attribution, so that's not really a thing we're worried about. And then there's a fine print down here. Postgres, PostgresGal, and this Lonnie logo are trademarks, or registered trademarks of the PostgresGal Community Association of Canada and used by their permission. I didn't actually ask for their permission, maybe that's wrong, but... That's what that is. And then finally, I want to introduce those of you who don't know, to Slonic, this is Slonic, this is the logo, and as you just learned, it's a trademark, or a registered trademark, I don't know. So, the three pillars of intellectual property are copyright patents and trademarks. And the whole reason why we're here every year is because some of those who came before us essentially wanted a different arrangement for copyright around software. So you can share it and do all these things. And this has been very successful, obviously, there's issues that we just talked about, but the reason why we've been doing this for, you know, it fosters them for 20 plus years, and the whole thing, 40 plus years and so on, is because this worked out pretty well. And then some time ago, we had issues with patents, and we've had, you know, in this very event, you know, we just kept getting talked about. I don't really know what happened to that. It sort of fizzled out a little bit, but I don't really want to talk about patents here today, but this was sort of an issue that, you know, patents do exist. The question of whether software patents should apply was sort of the question being debated, and this was partially used to impose additional restrictions on free software use, right? And in some sense, this was sort of hacked around by putting patent clauses into software license, like Apache license is a good example of that, so I guess this kind of quieted that down a little bit. And then there's trademarks, and that's what I want to talk about. So why do trademarks even exist in general? And it's basically to prevent consumer confusion or have a functioning marketplace at all, right? Because otherwise, if everybody could just fill in some brown stuff in a bottle and label it Coca-Cola and sell it, then nobody could ever go shopping again and just rely on that anything is anything, right? So that you have to have some protection of names so there's any sanity in the marketplace at all. So that makes sense. And then the companies who have these built these names, these brands, they want to protect their reputation or their brand, right, or they could will. So that's sort of the general idea, right? So that's existed for a long time. I think that's pretty well understood. Why do we care about trademarks? We specifically in the Postgre's project that I've worked on, but also potentially other, you know, software projects, open-source software projects. So the thing that one reason why we looked into this is because the domain registrations are in some weird way tied to trademarks. So if you have a trademark, you have sort of priority to domain registrations. The details of that are very complicated, but this is kind of how some of this got started because anybody can just register a domain that matches your project's name and now there's so many top-level domains you can't really, like, prevent that. And then they put something up that looks like your website and potentially offers downloads or do whatever and then you have no control over what they were doing. So by registering a trademark, you have some kind of way to prevent that or have a sort of priority for these domains. But also, you know, you want to protect the reputation of your brand, of your project, right? If you have a certain size, then you get imposters and people want to take advantage of that. And then finally, in a weird way, what we learned is you have to protect your trademark just so you have it because unlike copyright, and we'll get to that in a moment, once you lose your trademark, you can't really get it back and so you might want to just have it for later, in a way, in case, you know, something bad happens. And so it's a bit of a self-preserving system in some way, but that's what we figured out. So here's a way to, again, I'm not going to talk about the patent stuff, there could be a third column but that's not what we want to do. This is sort of a chart here, what makes dealing with trademarks arguably much more complicated than with copyrights. Because, you know, I'm sure there's people writing software here in their laptops, like right now in the halls here, right, and they're not thinking about copyright because the copyright happens automatically, more or less in most commercially relevant jurisdictions, let's say. So as soon as you create something that applies to software and anything you create, you're the copyright owner and if you don't want to share it, that's fine. If you want to give it to someone under certain conditions, that's you, or that's up to you. But you don't have to do anything about it, you don't have to register it. There is copyright registration, that's not really relevant. Also the copyright applies internationally more or less automatically. Also in the context of software projects, everybody who contributes is sort of a joint owner in some way, right, this kind of protects, there's copyright assignments and some people do, but in general, like at least in our project, everybody sort of owns it together, that means nobody can just take it away. In trademark, this doesn't work that way because you have to register the trademark, otherwise it doesn't exist. Well, there's unregistered trademarks, but that is very weak, so you do want to register it. And so as paperwork is complicated, potentially you have to get legal help, or at least it's hard, you know, or somebody has to sit down, you have to pay fees, and you have to renew it, and somebody has to be the owner of that, it can't just be, oh, you know, us 250, who half of those we don't even know, really in person, right, you can't do that, you have to have someone who actually has it. And then, well, to share it, you have licenses, so that's sort of similar, but with copyright we have these public licenses that are just sort of there to take in a way, right, maybe that could exist for trademarks, but at the moment it's not really clear how that would work, or at least we haven't figured it out. And that's the final point I've already mentioned, right, you have to enforce trademarks, otherwise you lose them. Copyright, you don't really have to do it, I don't know, but mostly we don't do that, right, I mean, of course people do GPL enforcements, but certainly you don't have to literally enforce every single violation, you can sort of go against the big fish and things like that, right. And so the copyright stuff basically happens by itself, if there's no problems, you don't have to do anything, you put a license on your project, and if nothing weird happens, that's it, that's new, right, but the trademark, you have to be on the ball all the time, right, you have to file paperwork, make sure you do it right, and whoever does it actually has to make sure to keep it updated, you have to license it, and you have to go after everyone who violates it. And that's hard to do for just small-time projects, I guess, right. So what have we done? And we came, the title was sort of, we figured this out on the fly, because we, when the project was founded, we took the, we, I wasn't there, the University of California basically abandoned the software, and somebody said, like, hey, we'll put it on the server, we'll keep working on it, we'll figure it out, right. And this was actually, this was in 1996, and I figured this was actually before the term open source even existed. So I went back to the original email messages, and there was no terminology for that, they said, like, hey, let's do a project like free, as free BSD does. So that was sort of like, there was no term open source, it was just like, let's do it like these guys do, kind of. Well, that individual, you know, happened to live in Canada, and they had the domain registration in their name, because, you know, somebody has to do it, and then later on also registered trademark, it wasn't, we haven't really quite figured out why that was done initially, but they figured they should do it, so it existed. And when that individual then sort of wanted to step back and retire from the project, you know, we wanted to put this under some proper footing, and we were initially thinking just about the domain registration, and, you know, the trademark kind of came along with it, and so we founded this little tiny nonprofit association in Canada just to hold those domains and those trademark registrations, and it wasn't really supposed to do anything else, it was really just like, do a meeting once a year, rubbish them to budgets, and keep these things rolling, and do little else. You know, we're not running any events or anything like that, it was just a really tiny operation. The name of Canada has confused people over time, so we kind of dropped this last year from our title. It kind of worked out because the domain we used for that association, not for the software project, but for that association is Postgres.ca, so the CA can now stand for Community Association, which is sort of was a kind of pun we used. Because people kept coming to us and thought, we're like the Canadian, we, I'm from Germany, right, so it's just where that organization runs, we're like the Canadian user group or something like that, and can we run like a Postgres evening in Toronto, but that's not what we were about. So here's a little bit of a timeline. So in about 2018, we started registering trademarks in various jurisdictions, so we had the Canadian one already, we did the EU one, the funny thing is also once you register trademarks, you start getting spam. People like monitor these registries and then just want to sell you stuff, like hey, domain registration, hey, do you want to, hey, we have, we saw somebody registered something similar to you, do you want to help us figure this out for you, and you know, always kind of weird stuff, so it's kind of fun and annoying. We have the US registration, but currently on the secondary register, which if anybody knows what that means, so it's a little bit tricky to, if the name is already out there, then to register a trademark afterwards, so we're kind of backfilling that a little bit. And so we, and this is maybe some of you who are not close to the Postgres project might have even seen that and like the general hacker media in between about 2020 and 23, we had actually various disputes around the trademarks. In a way, a lucky coincidence perhaps that we just got those trademark registrations done shortly before that, and, but as you, as I just mentioned, it was really supposed to be just a small organization with really no operating budget, and then all of a sudden we had people go register domains that looked similar to the project, actually people going out registering trademarks, like Postgres as a trademark in some jurisdictions, just in their own name without asking anybody, and then putting that on their website and creating sort of a shadow ecosystem, that looks like sort of an official Postgres thing, and then we asked them nicely, you know, can you please stop doing that, and then they didn't want, and it goes on and on, and this is sort of, you know, legal cases were filed, and it went to courts and things like that, so that was extremely expensive, and we had no income stream right, this little organization, so we were figuring this out, but it's, you know, it's, this is a problem, if you do want to get into the trademark game, you have to be prepared for this to happen, and it's become very expensive. How harmful would it be for your trademark, not to pursue all of your legal cases? The question I heard was how harmful would it be if we just didn't pursue those cases? Yes. Well, to some degree we don't know that, because we, you know, there's no, there's no room for trial and error really, because if we say, well, let's not do it, then the next person that we really do want to pursue this dispute with, can then say in court like, well, you didn't pursue those guys, so your trademark is invalid, can get invalidated, and then it's gone, and then this whole thing is over, right? So the legal advice that we have been given, the actual legal advice that we have been given, that we should do it that way. So we have sort of a two-tier, so if you do want to use the trademark, because obviously, you know, people, you go over to the other building, you can get hoodies with Postgres on it, right? So obviously we do want people to talk about Postgres somehow. So we have a two-tier structure in a way that there's a policy on the website that says what you can do with it. This has been reviewed by legal aides, lawyers, right? So what you can do is, what are we already in the law about fair use you can do? So if you just factually talk about Postgres, of course that's allowed. Any kind of what's in the law about fair use you can do, of course. You can write logs and books about Postgres and stuff like that. People do, right? That seems fair. People want to make YouTube videos about Postgres short, that's fine too, right? And then we have, which maybe not obvious, we have a thing also in the Postgres community that we can in a way certify conferences, right? Because we don't have a central association that controls the whole project. And so a lot of independent people around the world run like a Postgres conference, for example. And there is a system that you can self-certify that your Postgres is what we call a community conference. It sort of follows certain guidelines. So it's not just sort of a sham corporate event basically, right? And so if you self-certify your conference as a community conference, then you can use the trademark. If you self-certify your user group evening as following these certain guidelines, then you can also use the trademark. Or if you have a regional NPO, for example, the thing that's happening over there in the other room, the Dev Room, the Postgres stand over there and the Dev Room is organized by an association called Postgres School Europe, which does these things in Europe. And there's others like Postgres School US, for example. So they are recognized NPO's and they can use the trademark. Yes, all the way in the back, yes. You say no company names on the phone? Yeah, so... How many... that actually brings in a number of products that are specifically key to Postgres? Yeah, yeah, we'll get to that, exactly. Yeah, yeah, so... By company names, I should have solved the company once in a focus. Yeah, yeah. So the question there, and we'll get to that for the microphone again, like what about company names and stuff. So what is not allowed by the trademark policy on the website is that you name your company or product as Postgres in the name, or register Postgres.something domain. Now you can try, but then you get into trouble with this. If we can be bothered, as previously explained. But no, if you don't... this is the policy on the website, you can read that, right? So if you fall into any of this, if you want to write a book about Postgres, just go ahead. You don't have to do anything. For some additional uses, you can get a license, right? This is something that's still kind of in progress, right? So this is... we've only been doing this for a couple years. Five minutes. Yeah. So the first thing we did is to respond to the question, we grandfathered, tried to grandfather everything, and that was already out there. So a lot of these, and we tried to track them down. You know, if you have a company that's well recognized in the Postgres ecosystem, and they somehow have a product or the company name itself that has Postgres in the name, then we got in touch with them and tried to work out, like, a license that explicitly allows that, but also contains, like, let's not do that again kind of thing, maybe, right? But to some degree, we also provide additional licenses. For example, and this is kind of the... it leads into some questions I will have at the end. So this is kind of what we're working toward or what's effectively happening, but it's, you know, it would be kind of nice to actually have some guidelines and we're not just making that up. So if you have a Postgres conference that is not self-certified as a community conference, you can still get a license for it. You know, we'll look at what you're doing. If you're not doing a total sham job, then that's probably okay. We'll get a license. Or if you want to use the logo for a product, you can use this Postgres, you can get a license for that logo. But that's all manual work at the moment. So what we generally allow is merchandise sales, except the kind of stuff that's happening there by recognized NPOs, or any new, ideally any new business names. And also weirdly, what we generally don't want to license is anything that's not even related to software that wants to use the logo for some reason. There's people who want to write like novels about Postgres or use the logo in board games and things like that, which is all interesting, but we have so far not really wanted to go into that. So this is kind of the question I'm asking myself, and this is sort of bothering me, right? We've had just these discussions of overlaying additional restrictions on top of software licenses, and this has been done with patents extensively, or tried at least. Just what we talked about a moment ago, what do they call it? Customer agreements that interfere with that. And remember maybe some of you remember the Ice Weasel fiasco, where Debian had to rename Firefox because of trademark issues, uncertainties with Mozilla. And I don't really know. We're not trying to trick anyone, right? This is not what we're trying to do. Others could do it maybe, but we're just trying to do the right thing. But what is the level of things that we shouldn't allow? I don't know. This is really hard to find examples about. We've looked at some examples. Debian, for example, has a trademark policy. Linux has this sort of trademarking system, and KDE has something, but there's really nothing sort of consistent that you just copy as a template. It's really, really difficult. So this kind of gets to my asks here to conclude. It would have been nice to have some any kind of organized information about this. It doesn't seem to exist. Now you can go to these organizations and there's others. Sorry if I offend anyone. I know there's some rivalries between some of these. I don't know. But if you create something, if it's software and say, yeah, I would like to share this somehow, how do I do that? You can go to OSI and learn everything about open source copyright issues. Or if you just create or something else, you want to share it somehow. Creative comments, they have all this information about this is how you can share and this is the kind of public copyright license you can create and so on and so on. So there's this whole thing out there that you can rely on and it works really well. But I've went through all of these things and searched the website. It's like, is there anything here about trademarks? Is there any guidance at all? It's like, no. I don't know. I know that they know that this exists because creative comments itself is a trademark and open source itself is a trademark. So this obviously exists, but there's nothing to find. So in some sense, this is sort of my call here. If anybody has any interest in this at all, it would be nice to share information somehow and build at least some kind of baseline expectations about like how do we create trademark licenses that are compatible with the software freedoms or the open source definitions or anything like that. That's what I would be interested in. And of course, if you're out there as a project struggling to work this out, maybe you want to get in touch with the postcards people. That's also fine. So great to conclude. So dealing with trademarks is much more complicated than copyright. It's not really clear where it's useful. So that was the question. It has in effect been useful for us because we had some bad actors who were trying to do the wrong things. This gave us a tool to prevent that. So it is in effect useful. But the way this interacts with software freedoms is unclear and I would like to learn more about that. I would like to learn more about how others do it or if others want to learn from us, that's also great. That's all for me. We have time for one question if someone has. I'm sorry, we run out of time for questions. No questions. That's also fine, but I'll stick around here and if somebody wants to ask, that's fine. Thank you. Thank you, Peter.
GPL’s Termination under German Law
Let's welcome our next speaker, Sebastian Steck. So hello, also from me a warm welcome to the Legal and Policy Issue Staff Room. Let's start right away. Today I would like to talk about GPL's termination from the point of view of German law. You might first wonder what do I mean by termination? So if someone, the licensee violates the GPL, there are some consequences for the licensee which happens automatically and there are other consequences which need some more manual action by the copyright holder. I will frame this as manual termination. So I would, in this talk, call both mechanisms termination. So even though most strictly you could say only the manual termination is really termination and the automatic part is something else like conditions or scope of copyright permissions or something like that. And in German law, as you might already know, we have a lot of precedents. Thanks for Ralf Welte and Thier Jäger. And in my opinion, the best summary is this relatively recent decision of the upper district court of Ham and they say, according to the well anonymous opinion in case law and the at least predominant opinion in the literature, the distribution of software license under the new general public license in breach of the license conditions constitutes a copyright infringement. So the assances here, if someone violates the GPL, the copyright holder has the remedies for copyright infringement at hand. And the predominant remedy used in copy left enforcement is here, of course, to get an injunction. So the copyright holder could get an injunction that the defendant may not distribute. The copyright holder's work or else if the licensee should dare to do it any way, the licensee will be fined or jailed. So but Harald Welte did not ask for a plain injunction about any distribution. So it was a very clever piece of lawyering in my opinion of Thier Jäger to say, well, we had a phrase unless accompanied with the materials that GPL required. So to be more precise, they had spelled out what the materials that GPL requires are exactly R. So and why did they do that? You could easily imagine to ask the court to get an injunction against distribution in any matter. So and the point here is the main defense of a defendant of a copyright injunction is say, well, but I have a license. You can't get a license. Forbidding me behavior, the license allows me to do. And this leads right to the point, what is this future distributions? The due past violations caused the licensee to lose their right to make future distributions and distributions of course have to be compliant on their own. So in particular, a company with the materials that GPL required. So up to my knowledge, no court had ever to decide whether the second alternative is actually viable and allowed. But there was somehow a backdoor or detour how this question slipped into some of the lawsuits Harald Wertefeld and it was basically the licensee argued, well, this termination clause is so strict that's unfair. It's an unfair term of adhesion and it should thus be regarded as void. So there would be no termination at all. So and the courts rejected the argument, this for example is from the ruling of the very first court ruling on the GPL in Germany. And they said the violator can reacquire the rights by acceptance of and obeying the conditions. So the courts had the opinion that even a past violator could make lawful copies in the future by obeying the GPL in the future. So it would not have been viable to try to get an injunction in the second way. Nonetheless, in my opinion, but as I already said, this is not yet tested in court, it is possible to get an injunction of the second nature if you take the copyright holder takes the right preparatory steps. So but you might first ask is it really reasonable to try to go that route or should we just stick with the first alternative. And the problem or the potential problem here is you may not fit any GPL violation into unless accompanied with the materials that GPL require. So you may imagine that I want to be violator says, well, our schemas, we put all the documents we need together with the binary, the license tax, the warranty disclaimer and a perfect offer for source. But it's only perfect to the letter if someone later asked, they just ignore them. So it may be the case that such conduct cannot be bear it by an injunction and following the first alternative. So I would argue in the following what could a copyright holder do to have the second option available. And then at the end of the day, at least, with the first law students, you would have you would go both routes in the alternative and then see what the courts actually say. So let's look back at section four of the GPL version two. That's the so-called termination clause. Here we see the second sentence, which basically says that any attempt to make use of the licensed work, which is against the license, will automatically terminates the rights of under this license. So the FSF had already come to the conclusion that if you read this word by word in a very harsh way, you could say, well, that means once more mistakes and all your rights are gone. So since they realized that that is impractical, in practice, they made another termination clause in GPL version three, which is basically a manual termination. So to summarize that under GPL version three, the copyright holder has to first send a warning letter and then wait some time and then they can send basically the termination notice, even though the GPL version three does not use these words, but I would frame it using these words. So by the conclusion that such terms would be impractical and even unjust was also made by the German legislator. And for this reason, they introduced this section into the civil code, which basically says if there should be a contract which says that if one party violates the contract, they lose all their rights, such provision should be instead read as a right to rescind in the event that a party should violate the contract. So we have now informed us what does the rescission actually or more precisely mean under German law. And rescission in layman's terms means the rescinding party sends a rescission notice and then the contract has to be unwinding. So more precisely here, the civil code says that the performances received must be returned and the benefits derived must be surrendered in the event of a rescission. And to get really the essence of what does this mean, we will have to look into a specialty of German law and that's first the so-called separation principle. So the separation principle says that every, every day's transaction, like for example, a regular contract of sale splits itself up into multiple contracts. You have first the contract of obligation which solely creates the obligations. For example, in the contract of sale, it only creates the obligation on the seller to deliver the goods and on the buyer to pay the money and take the goods. And then to actually transfer the title when the goods are transferred, you have to enter into a second contract, a so-called disposition contract. And you may guess it, to transfer the money you need a third contract, another disposition contract usually. So and the consequence of the separation principle is the so-called abstraction principle which says if an obligation or the complete obligating contract gets void, this does not automatically void the disposing contracts, meaning that the goods are not transferred automatically back, the money is not transferred automatically back, the rights are not transferred automatically back, the title is not transferred automatically back and so on. So in a contract of sale, this means that they have to take, the buyer and the seller have to take extra steps to actually transfer the rights, the titles back. So translated to the GPL, this would mean if the copyright holder resents the GPL, this would only turn the obligating contract into a new contract requiring those parties to give back what they had previously received. But strictly formally speaking, the rights still stay where they are, they only have to be given back. This may be a technical formality, I would argue it basically is, we will see later how you can handle that. But I think you should be aware that the rights stay first where they are. So let us now look a little bit more into detail what recession means or what do you have to do for a recession. So you may wonder what is the warning I have told right at the beginning and this is the section closest to giving us an answer, do you need to sign the warning letter before we can do a recession notice. And this section is more precisely on statutory or decision rights. So what we have talked before was contractual recession rights, read into the determination clause. So this section does not apply directly but nonetheless in my opinion this emphasised the underlying principle that you should send a warning notice or you must send a warning notice very unreasonable and fair and so on. So I would just recommend send a warning notice and then you are on the safe side. To summarize what we have seen up to now, there is at least another German law, no automatic termination of the GPL. That means current or past violations do not void automatically future distribution rights, of course future distribution which is compliant. But nonetheless a violation is still a copyright infringement because the permissions are only granted under conditions that the GPL is abate. In US terms you would say there is a scope of the license and if the licensee acts outside of the scope you can get an injunction saying no you may not act outside of the scope, you may not violate the conditions. So if you have a case as a copyright holder, if you see a device for example which is violating and you can say that there is a violating from just opening the box and see what is in there, you have a clear violation, I would recommend you strongly to go that route if it is well paved, there is lots of precedent and I would add one remark here, you can get a preliminary injunction in such cases. A preliminary injunctions are very favourable for the copyright holder under German law, for example they do not have to provide a deposit and the only drawback of a preliminary injunction is one that you have to be quick and be sure that it will prevail because else you would owe damages. So if you have a clear violation I would recommend go to the lawyer right away and then the lawyer will send a warning letter giving the violator one or two weeks to solve the problem and if the problem is not solved go to the court and ask for an injunction because the court will reject the preliminary injunction if you have waited too long and that is if you have waited more than roughly a month from knowing the violation. So and I may add here this is what Harold Welter did at least what he later did he also wrote that when he wrote on his own to violators he was just ignored so it was basically it was not only a waste of time and his own resources but it also would have written of getting a preliminary injunction so I was strongly argued there is no point to be that friendly in a clear case to say well we sent a warning without lawyer first so that you don't have to pay the lawyer for writing the warning. So but on the other hand there may cases well the copyright holder has to manually terminate the GPL and it's not so the scholars do not all agree whether this theory what I've presented is actually the right theory some others say well you have to do something which we could translate as cancellation or termination so but in the end you should set a deadline for cure in any way so send a warning in any way so no matter what legal theory actually would prevail you should send a warning and just say here is the warning if you don't fix the problem we will take further measures so and then if the problem is not solved the copyright holder can send in this case a decision notice it may be a cancellation notice or whatever so and here is one more important detail I wrote here partly a decision notice so you may have wondered what happens was the past distribution if the copyright holder resents the license have they be somehow unwinding and even though it's rarely mentioned in the civil code it's clear amongst scholars that you can have a partly decision if that's reasonable for example with the GPL you can say the copyright holder can say I resent for all your future permissions so that you can meaningfully give me these permissions back and I let the contract as is the GPL as is for all past distributions so if there were for example an obligation to provide source code this obligation would still be intact so and if you follow other theories like you need a cancellation you then have to send a cancellation notice or whatever so I think it would be relatively easy to formulate it well enough that you can fit the notice now your rights are terminated into any legal framework which might apply so and if rescission is the right framework you have then demand the permissions back before you can get an instruction not to use this granted back permissions but at the latest when you're at this step you should really get a lawyer so and then the lawyer will just write into to complain a we want to have an injunction on the conditions and if that doesn't work you want to have in the alternative our rights back if they needed to be given back first and then we want also an injunction against any distribution so that's basically what I would say a copyright you're doing such situations so before we come to questions let me make two more remarks first you may have copyrights or money or something else you feel is valuable for copy left enforcement and then your best address you can go to a software freedom conservancy so here are Bradley and Karen and I and I'm sure they are happy to help you with this but you are also able to happy yourself if you want to just assign your copyrights go to as of conservancy or assignment or if you have some money to give visit as of conservancy org slash donate or you can also become a sustainer and give them regularly money so and the second remark is so here is the disclaimer for the legal talk and the disclaimer may be surprisingly I have given no legal advice you may have not noticed it but the trick here is according to the German legal services act legal advice is only advice given in a specific individualized situation so it is allowed to everyone to discuss the general situation to discuss about some hypotheticals so during Q&A I strongly ask you to not bother me with your specific personal problems I'm happy to have general discussions about all this with you and I'm happy to discuss about hypotheticals but that's it so thank you for your talk and I think it's great that you bring up this last slide because now I can sneak in a little remark on the previous panel because I think the solution to all the problems that we have discussed earlier is fairly easy and maybe it's time for us as an open source community to start demanding that the ownership of the source code is with people like SF Conservancy and other non-profit organizations yeah like I must I'm I'm I'm I'm I'm a no I'm a cybersecurity consultant so I do horrible horrible work in my like what I'm paid for I'm a very unethical person in my day job but I don't write code and I especially don't write proprietary code and if I did I would donate it to Conservancy because I think that will actually solve the problem of for-profit companies monopolizing software so sorry I don't have a question on your talk but that was just too good an opportunity I did not ask either of them to promote us and we really appreciate it it's a little it's a little embarrassing but much appreciated I actually have a real question on your talk it's a really excellent analysis I'm curious if you looked at the other side of the question because some of the I guess they're called stat I don't we would call them statutes in the US that you were quoting they seem to also relate to this I'm sure you're familiar with Till Ligger's theory that you can receive a new license once you come into compliance merely by downloading the software again and it's it seemed like some of the things you were quoting there hinted at why he has that analysis because I never fully understood his analysis but some of your your some of your points seem to fit with that and it made a lot more sense to be hearing it from you so I think that the relevant sort you have to have here is what is the disposing contract of the copyright holder to transfer the right through the licensee and depending on how you would frame that you could say they would give it on an ongoing daily basis or you could say they give it up front or they could say they give it up front and additionally an offer to give it to you again if you should violate and depending on how you frame that you would end up with different things you would have to give back so so I was by intent vaguely about that because I know there are different theories what this what is actually transferred and so what have to be actually giving back but but in my point of in my point of view the essence is there is a decision right which would provide the obligation to give back what they still have no matter what it actually is yeah I'm from Germany I know all these things actually I met hard welcome until yeah and I can tell everybody that at least one of the companies he's you'd learned from this but the question is can you also give us insights on the situation in the US because I know there is something called the simple where case maybe Karen could tell us something about this and as far as I remember that in this situation the the copyright holder of the GPL license of the clearly stated on the website that certain companies permanently are banned from using their software I don't know exactly how the situation is I can say enough for sure that illegal systems different different are strongly different in in these questions and these kind of edge cases so I knew that there is theory under us Lord that the rights of her men play avoid I knew on the other hand that there is theory you can get a new license by downloading it but at the end of the day it will have to be tested in court yeah so I got curious about two things you're right in the about German law that you mentioned so as far as I understand one German court said that you can reacquire automatically rights under the GPL because you just go again to the website and download code and I could you maybe elaborate why the court says so because if it worked then you could actually constantly could violate I could elaborate on that and so sorry and just the second question is it under German law that the license has I don't know the German terms for that but I understand that in civil law you have contracts that have only this obligatory effect and the real effect of transfer of let's say property rights so do I understand correctly that licenses in German law have this real effect of transferring actually any rights okay so to your second question first this goes basically back to the separation principle and how you would understand the term license in the in on the background of German law you could say the license is only the disposing contract with the real effect or you could understand the term licensing it also includes the basically more contract obligating contract thing creating the obligations so well I assume even the term wordlet sense is not used strongly for the foreboding of for those things and to to the problem of what happens if a violator violates again and again or can they do that or can they not do that so I think if the violation is unnoticed or at least unactioned by any copyright holder they can violate again and again which would be each violation again copyright infringements but if nobody stops them then nobody stops them of course and on the other hand there is precedent which then lead to amendment of the civil code that no party of a contract can be forced to be trapped into a contract forever if the other party constantly does not apply to the contract so so no matter how which legal theory you actually apply or how you would actually frame it in my opinion you could always go back to this historic ruling and say well but a copyright holder cannot be trapped into TPL needing to tolerate violations forever thanks really interesting talk you already briefly mentioned us or the previous question was about the US am I correct in thinking that this will be not the same but similar in other European countries or am I wrong on that I don't know exactly enough the law of the other European countries to get to be able to give a precise answer but I would guess it's more like the German way for the countries which had to Roman tradition that's basically the non-English speaking countries and the English speaking countries have to have the British tradition basically and these have done other rules okay let's thanks a question
Fireside Chat on Further Restrictions, Imposed Downstream on Copyleft, Wreaking Havoc
Okay, I think we're ready to start our next talk. Let's welcome, Bradley and Khrusov. So, hello. So we wanted this to be a panel, and the problem we discovered was there are basically five people in the entire planet who have the knowledge to be on that panel, and only two of them were going to be at Fosden. So you have the two that are actually at Fosden to talk about this really important issue, and there's been a tremendous amount of things that have happened this year. But to start off, I'm going to hand it to my colleague, Christof, to talk about what this clause in GPLv3 is, and any GPLv3 as well, and what it's there for, why it's there, and what it does. Okay, thanks a lot. Hello. I'm very happy to be here. My name is Christof Chirvich. I recently started as the licensing and compliance manager at DFSF. So this is actually my first public appearance in this position, and I'm very happy to do this appearance. So as an introduction, we are going to talk about something that we at DFSF started to call, for example, confusing licensing or unauthorized alterations to the GPL, which is... Let me start by saying that the FFSF designed the GPL as the license that not just gives users their freedoms, but it also has a mechanism to guarantee this freedom, which is, well, co-pillift and some other obligations. And what we have... So what the FFSF has observed for quite a while is that people are doing different things, not just using GPL verbatim, as it was intended to be applied, but they try to add some additional requirements. And that was addressed in GPLv3. And in section 7 of the GPL, we have actually at least three different kinds of requirements that are differently treated. First of all, you have additional permissions or, let's say, exceptions that you can give, and I don't think we have to talk about them. And then there is a list of clauses that you can add that actually restrict what users can do as compared to the grant of rights that's given in the GPL, let's say the original GPL. So there is a limited list of these requirements, restrictions that you can impose. And if you do more, if you try to restrict further, these are called further restrictions, and we have the quote from the relevant part of section 7. It says that if you received the program with any of such further restrictions, you can remove that term. So basically, long story short, you do not have to abide by those restrictions. I don't want to take too much time, so that's this license mechanism that is intended to protect users in such a situation. The other protection can come from the FSF itself because we, as the authors of the GPL, we have copyright on the license terms, and we have also trademarks that are used in the license. So, after you apply the license, we want that to be done as intended, and we are not authorizing users of the GPL that alter it or twist it in the way that it doesn't give freedoms contrary to what section 7 allows. So we believe that not... So this mechanism can be used by users. There is also another mechanism that we can resort to. We could rely on our copyright or trademarks to protect the community from people imposing some further restrictions. Maybe let me stop here and we can maybe elaborate on that. Well, yeah, I think it's an excellent summary, and I think one of the points I want to add is kind of an historical thing because I worked on this years ago. This clause was invented to handle the problem of... Which was very common in the early 2000s, that someone say, this is under GPL version 2, but for non-commercial use only. Well, we all know, but we had a lot of discussion earlier today, you're allowed to do things under GPL version 2 commercially. Under V2, the V2 group of licenses, this created just pure ambiguity because there was a statement that it was under GPL, but then there was a statement that prohibited commercial use or commercial redistribution. This clause was specifically designed during the GPL V3 process, and you can actually go into the drafting history of GPL V3. I wrote an expert report, which I'm going to talk about in a few minutes, where I did this in a case that's going on in the US, and find that that's exactly what we were trying in the drafting of GPL V3 to allow users to do. There's so much good about the GPL about how it empowers users, and the idea behind this was to empower the users, just throw that clause away. That clause shouldn't be there. You should be able to go back to pure GPL anytime you want, and no one can, anywhere from the supply chain, all the way from the original license or down to each redistributor can add an additional restriction on your rights under GPL. On paper, by design, this looks great. The scary thing that's happening in the US, and I'm going to briefly summarize it, and then we're just going to turn it over for questions, is that many companies, particularly these VC backed open core companies that we talked about in the earlier session, have really, I think unfairly, I'm going to say this because Crystal probably won't want to go this far, but I'll say it because I'm not affiliated with the FSAF, they've unfairly traded on the good name of the GPL, and they've said, oh, we're going to put this, we're going to say it's under GPL, but then we're going to add some weird further restriction. Their users are confused. There was a case with a company called Neo4j, which did this, and a user who was also, by the way, a contributor to Neo4j, an integrator for Neo4j, saw them do this and said, well, I found this clause. I can throw it away. That's exactly it. This fellow named John Mark Suhi did exactly what all of us in the old days intended when this clause got drafted. He removed the clause, and in response, Neo4j sued him in federal court for a variety of acclaims, including a DMCA violation and various different damages. I had the dubious privilege because it was a disturbing thing to watch to attend the trial, and it really is heart-wrenching to realize that while we had this right on paper, hit the ground, and somebody was in trouble, it was very difficult for everybody, and John Mark's lawyers tried very hard to explain to the judge, he did exactly what he was allowed to do. We're kind of in a crisis right now about this clause because while Neo4j is just one incident, the case, which hasn't been fully decided yet, although many of the key parts of it were decided in summary judgment. I can explain more if people have questions what that means. We don't know what the future of this clause is, we don't know how it's going to work out, and we've got somebody who's been really badly treated by Neo4j and frankly by the courts for just doing what all of us in the free software world and the copula world in particular thought he should be allowed to do. That's a U.S.-centric thing, but I think it has worldwide implications because other companies will be watching, and their desire to trade on the good name of GPL and say, this GPL, except we've added this Commons clause thing, except we've added no commercial use allowed or whatever it is. We really as a community need to prevent that kind of behavior, and we're in the thick of trying to make this clause work the way it's supposed to. So maybe just one or two thoughts. I think in the simplest terms what we think about it is that when you use the GPL, it is intended and I think it's usually in this sense to transmit a very strong message that you grant uses all the freedoms that are there. So it's very concerning and confusing when someone uses the GPL, someone says, well, this is under the GPL, but tries to change the result and the message. So what we really are concerned with and we want to react is to stop this confusion. And on the other hand, we are really happy and that the licenses are there for that people are using them. So it's really, and we even allow people to create new licenses using the terms of the GPL. In the FAQ we've written how it's possible, but we then make it very clear how to do it without confusing, without using the name of the GPL because people may create very different licenses when they use the term. So I think that it's a very simple issue that we are concerned with this confusion and it should be stopped if there is someone who wants to write their own license. I mean, well, I don't think it's a good idea. We already have a lot of licenses, but if you really want to please do, just stop, don't confuse users. Yeah, I agree completely. And in that same space, we saw MangaDB do this. They took almost exactly the text of the FAQ or GPL, but again, I can probably say more because I can speculate about the FSF as not being affiliated with the FSF. They didn't, it seems to me, have a trademark problem because they called it something different. They named it the server-side public license. It is a horrible license. There's no question. It cannot be complied with by anybody on the planet. Even if MangaDB themselves tried to comply with the SSPL, they could not, but they own all the copyrights so they don't have to worry about it. But the, so while it's a toxic and terrible thing that the SS public license exists, it's actually better than the Neo4j situation because in the Neo4j situation, they're using the name of a Faro GPL. They're saying the software is under a Faro GPL, but sneakily at the bottom, they've put another restriction and they're captiously going after users and saying, aha, I caught you. And I have some real scary stories to tell of stuff that came out in trial. I'll just tell one and then we'll open it to questions. So during the trial, it turned out that the reason John Mark Suhey, who was a government contractor in the United States, he was working for our Department of Treasury, he had, and this is all public because it was publicly, publicly testified at trial, he had packaged Neo4j with a bunch of other software as a, as basically a system integrator to deliver a solution to the Department of Treasury. And then Neo4j, who he'd signed an agreement with to do marketing, because he was a business person trying to make lots of business relationships, they came to him and said, you have to go and tell the IRS, the Department of Treasury in the U.S. that they're violating the affair of GPL. And he said they aren't. I wouldn't, I wouldn't deploy the solution if I thought it was in violation. And they said, no, but you have to tell them that because we want to sell them a license. But of course, this dissolved quickly the relationship between them. And it was not too long after that then Neo4j added the Commons clause on and said, well, they're definitely violating the Commons clause. And then John Mark came back and said, but I never deployed the Commons clause version at the IRS in the first place. So why is it, you know? And then eventually he just removed the Commons clause. Then I can't remember if he'd actually deployed that version or not. But as soon as he started telling the IRS that they could remove the Commons clause is when Neo4j sued him. So it's just unbelievably nasty behavior. And as I said, trading on a good name of a good license for a pure avarice. So anything else you want to add before we just turn it over to see what people want to ask? No, thanks. Maybe when you have some questions. So confusing users sounds to me like unfair competition. So have you looked into using unfair competition law to stop such confusing behavior? I admit that I had not thought about that angle. There were some arguments made by, because I've read the entire docket of the Neo4j trial. John Mark's lawyers made a few arguments of that nature. The judge was not particularly interested in them, unfortunately. So I think someone would have to bring a separate claim of some sort for it to be possible. I don't know exactly how that claim would be structured, at least in the US, but I'm not a lawyer. I think you want to address that too? So I'm not sure if you're asking about confusing in relation to what, okay. So maybe when I was talking about confusing users, I was talking about the FSF's concern that someone is using the FSF's license, not as they were intended and confusing about the scope of their rights. So as far as I know, in order to invoke unfair competition law, for that we would have to, that would be the case between FSF and that person who is using. So I'm not sure that unfair competition laws would apply here, but I also don't want to bet on that. I would have to check. But maybe another case with unfair competition would be between someone who is using licenses with all these alterations and someone else who was just trying to use the license as, for example, following section seven and relying on the license without restriction. I mean, well, no idea, but that's an interesting direction maybe to check. I have a question. In American law, is it still possible to file a friend of the court brief in the Neo4j case? So normally, my understanding, I'm not a lawyer. This is not legal advice. My understanding is that you can really only do amicus briefs on appeal. And I should mention that John Mark is planning to appeal. He's got most of the negative decisions already because they were done in summary judgment. Really the only thing left for the judge to decide, and the case just closed a couple of weeks ago, is how much John Mark has to pay to Neo4j, because Neo4j has asked for a huge amount of damages. So the judge is going to decide what that number is. John Mark is planning to do an appeal, and at that appeal stage is usually when an amicus brief can come in. Now you may have heard that there was some sort of appeal already in this case. That was really misreported, and frankly, there have been a number of people who have misreported it badly, and it's very unfortunate, because that was only an appeal of a preliminary injunction. So John Mark actually, I mean, just everything went wrong for him in the case, because the judge just really didn't understand the issues that well, in my opinion. And the judge issued a preliminary injunction saying that John Mark wasn't even allowed to say that you could remove the comments clause, like that notwithstanding this text right there in the license. And he appealed that decision, and he couldn't say that it was open source because it was not. He appealed that injunction, basically a gag order. He lost that appeal. But that was only a preliminary injunction that was in place while the trial was ongoing. And so it was really a narrow appeal, and the bigger appeal is what will be coming next. And I would hope that, I mean, my boss is sitting over there. I'm hoping that my organization I work for will follow an amicus brief, and I hope we'll all do a lot more. I think you'll be hearing a lot more about John Mark's struggle after the decision comes out. And so there will be a lot settled in this appeal, and amicus briefs and everything else. I'm just going to make one quick follow up here. For context for everyone, I'm assuming, and you'll clarify this, that in this context, Neo4j has chosen the referral GPL, but it's under what you would call proprietary relicensing. In other words, they require a CLA. They're also selling a commercial license. And was that explained to the court why they did that? There were a number of conversations during the trial about, it's not a conversation at trial. There were a number of questions that flew around of that nature during the trial when I was there. The interesting thing about that, because I also filed an expert report, which the judge basically rejected, sadly, on John Mark's behalf, because you asked about amicus briefs. There's also expert reports. I was the expert witness for John Mark's side and the judge throughout my expert report. So I felt pretty bad about that, but it wasn't surprising with this judge. But to your question about the relicensing and so forth, nowhere did Neo4j ever establish that they are the sole copyright holder. In fact, I went through and looked through the Git history, and they accepted patches from third parties, and they did not have a CLA in place when they accepted those patches. Now it could be the case, we don't know. It could be the case that they carefully went around all those parties privately and collected copyright assignments. But those parties actually, in my mind, if they did not later do copyright assignments or other CLA-like instruments with Neo4j, Neo4j is infringing their copyrights because at the time, the whole thing was AGPL. And then later, Neo4j changed the license to this AGPL. And the answer to the other part of your question is, yes, absolutely. Their primary business model is selling proprietary licenses and using the license they have to scare people into buying such licenses. So I'm not exactly sure why anyone is actually really surprised about this. Because I mean, this whole license flushing has been going on for quite some time right now. I mean, we probably all remember the case of some company that distributed the kernel code and basically told its customers, if you redistribute the source for that, then we are not going to do business with you anymore. So yeah, I mean, this is kind of the logical consequence now. That was more or less my question too. I think that now restricts redistribution of the source code further and that seems to be in direct contradiction of this term. So if their distribution contains source under this license, it looks to me like it's in direct contradiction. What are your thoughts on that? We had a time-backs panel on that issue a little earlier and I don't want to turn this into another discussion of the Red Hat thing. The one thing I'll say that, I'll see if Christophe wants to add anything, but I'm not aware of Red Hat ever suing someone for, suing an individual sole proprietor for damages because of something they did with RHEL. The Neo4j's behavior is beyond the pale. I compared it to MongoDB before. I've never heard of MongoDB suing somebody in this way. So the fact that they went after just this guy who's a government contractor, he started a small business to do government contracting with open source and free software and they're coming after him in this aggressive way to get financial damages from him. It's really beyond the pale and I think there's really no comparison to what's happening in Neo4j to anything I've seen with regard to this whole area of proprietary licensing and so forth. Okay. My question was to say that it's a price in the Red Hat situation, not that they're concerned or would not dream of saving their company. Do you want to? Well, okay. Maybe I did not understand the question as well, but I just wanted to say that either, whatever way you take, either you are the user who wants to rely on Section 7 and just ignore the restrictions or if it's like the case of the FSF that we want to use our copyright and trademarks to start compliance actions with different people. And I'm not saying going to the court in the first place, but trying to talk to people and then make them realize that they're confusing users and they might breach our exclusive rights. Both of these require that actually someone gets involved and initiates this legal action or negotiation. So there are tools available. They have to be used, but that requires effort and I think that was already mentioned in the rail discussion before that requires effort and using the legal tools. And just a point of order. Obviously we could have had an entire like half day just talking about the rail thing. I really want to be respectful of how we time boxed it. We made the decision to only make that an hour because it could just go on forever. So I'd really like to not get too deep into rail while we're talking about this other issue. So zooming out a bit for the moment, if you take a broader look at the AGPL and sort of what it has become in practice, that it's become a sort of like submarine license type, like submarine patent type thing. Bradley, could you give your idea, your way forward, your roadmap, how you see the AGPL going forward because I see it mostly as something that is used to coerce companies into licensing, right? Like a slightly weaker form of the SSPL, but still used to gouge and to extort. So I'm going to put Chris off on the spot here. So we came up with a solution for this for Copy Left Next, which is called the Copy Left Equality Clause. What that does is it disarms Copy Left entirely if you try to do proprietary relicensing. So what the clause says is if a different license is ever offered for these copyrights, then the entire Copy Left evaporates and all of it turns into a super permissive MIT-style license. I would love to see a Copy Left Equality Clause in an AGPL v4. Can you give your thoughts on that? I just wanted to ask for some clarification because I don't understand why you are co-ing AGPL submarine compared to submarine patents. So how? Is it from copyrighted? Wow, okay. It's always used like in a VC context where you have this open core type stuff and then at some point they come like, well, you might not be using... Or are you publishing this part of the code or you have these extra acknowledgments from the user and you can't... For certain types of libraries you can't provide those rights of... Clearly you violate all the licenses. Maybe you should just buy like... Okay, so maybe I will just quickly say that we are looking at different ways people are trying to add something on top of the GPL and we really want to address it one way or another. So yeah. Thank you. So without speculating on the intentions of John Marksouie, do you think it is conceivable for him to initiate legal action against Neo4j for the copyright that he presumably held in earlier versions which they are still reproducing? So as it turned out in my analysis, I didn't see any work that John Mark had actually upstreamed because he was more of a system integrator. So while there's a lot of people who did contribute, John Mark wasn't one of them. So I don't know if he actually has a claim or not. I just don't know. I didn't see his name in the Git log for the period where it would have been possible to do that. But there were a lot of people, I mean like 12 to 15, I guess that's not a lot, but there were 12 to 15 people that had given a patch that was non-trivial to Neo4j during the period. And those people I think would have a claim. It was really a question of resources. I suspect of trying to get in touch with them. I emailed a few of them. They didn't answer. I wouldn't be surprised if Neo4j has already bullied them either by demanding or offering money for a copyright assignment or CLA or simply just made them afraid. Everybody in that space who's done consulting around that code base is terrified because they're watching John Mark get sued and possibly have huge damages. And certainly he's spent a tremendous amount on legal fees already just defending himself. So I think everybody who's in that community is pretty terrified. I'd like to offer a contrasting opinion on the utility of a GPL v3. In particular it's used by Signal for its open source software. And Signal of course is a non-profit. It's very lucky that it has the funding in order to be able to do that. But in particular what that requires is that for example corporations like Facebook who may want to use Signal are actually required to contribute their contributions back even if they end up using it over a server which allows them to circumvent for example the traditional GPL. So I'd like to just make sure that that's kind of in the air as well when we talk about uses of the A GPL. Although of course these sorts of restrictions can be used for positive or negative as well. I wanted to put that I think there was a useful discussion kind of going on here so I don't want to derail that. Sorry I kind of just wanted to add on to you just say that this problem that you're discussing is a very rare problem but it's important. The reason you're giving it publicity is because you want to make sure that it doesn't happen not to prevent it from spreading. It's a very rare problem that we don't want to spread further and I want to think that the same thing with the A GPL is the vast, vast majority of uses people are given their rights and freedoms and it works wonderfully. And so sometimes small problems get expanded because they're newsworthy, they're important, we don't want them to cause more problems. I don't know if you have any response to that, it's more of a comment than a question but would you agree I guess? Well yeah definitely we shouldn't forget about thousands and thousands of projects licensed under playing GPL. I agree with you Ian. If you look at just in terms of lines of code that are facing this issue, you're absolutely correct, there's not many lines of code in the world by percentage that are facing an issue of a further restriction being added that somebody is trying to remove. I think the reason I think it's more than just a publicity issue is, and it was a personal experience sitting in the courtroom and watching this testimony go by and John Mark being who did everything right, he did everything that the FSF in its documents and in the license told him to do, he did everything our community intended when we designed this clause to go and do that and then to sit there and be told like basically you're a thief, you're a DMCA violator, you ripped off Neo4j, all these sorts of things and then be facing in the next couple of weeks possibly huge damages. I'm kind of hoping like the judge will do something silly like be like well we find $100 of damages or something like that, it would be great if that happens, I'm not sure what he's going to do but they've asked for millions and so and John has already spent a tremendous amount of legal fees and he doesn't know how to, I'll be frank with you all, I've talked a lot with him, he doesn't know how to pay for the appeal at this point and so he's very worried about it because he wants to appeal and he really, I spent a lot of time with him over the last six months as I was doing the expert report and then going out for the trial and he's exactly the kind of small business person that is like the ideal in my mind of what free software consultants should be. He took a bunch of code including Neo4j, lots of free software stuff, integrated, made a great solution for the IRS which is actually a system to catch tax cheats as it turns out, it's kind of like this graph database of like all the tax returns in the US and figure out who's cheating so they know who to audit which if you don't like tax authorities maybe that doesn't sound so great but he's doing something that is a useful tool for his client and to have this come back at him. There was another point in the trial where it was so telling where, I just learned this because of the trial, apparently it's illegal in the US to volunteer for, if you're a contractor to the government to volunteer your time to the government, I guess it's considered like a weird form of bribery, like you're like, oh I'll give you services for free and so they got him cornered on the stand where they were like, did you ever receive a call from the IRS for support on Neo4j in the evening and like he's a gas and like did you give him that support? He said, well yeah my client called me and the software wasn't working so I fixed it. He's like, did you build him for that time? And he was like, I think maybe I didn't? And like, were you aware that that's illegal under government contract rules? And he's like, well yeah but he's like, did you, so you're admitting here that you committed this thing you're not supposed to do? And he's like, I guess so, I just a software consultant and when my client calls me I get it done and I figure out the billing later. But it looks so foolish that they were like just grilling him, this defense attorney or prosecuting attorney was grilling him over this question when all he was trying to do was make software work for his client. And that's what free software in my mind is totally about, like having all the source code and being able to do what you need to do to make people's problems be solved. And Neo4j's attitude is he shouldn't have been doing that. What he should have been doing is selling Neo4j proprietary license to the IRS. And he didn't do that and since he didn't do that he's a bad person. And so watching that was a very heart wrenching thing for me and that's why I just personally am so worried about how this is gone. Two quick questions and then we'll wrap up. So did he relied on third party claims? So if there were outside copyright holders he could basically use the same argument as in Versata and Simple Bear if I remind the correct president. So he could basically say I have the third party claim that I have a pure HGPL license because the Copy Left Clause in HGPLs as you have to license is under this term. Yeah, his attorneys did argue that and that was in fact a part of the summary judgment decision was that in fact this clause is not operative, is basically what the ruling holds. So two fairly quick questions. One is probably a bit speculative but do you think this is why we haven't seen any community Neo4j and MongoDB forks? Do you think forks are being bullied by the companies? Because I was wondering that like how is nobody forking that, right? And the second thing is what happened to Copy Left Next? Is that still a thing? Like I'm, is that a touchy issue due to the redhead affiliation as well? So I'm just wondering. Yeah, I probably could have. Yeah, so maybe I could have told more clearly that I don't want to comment on any particular project or company but yeah, so I won't. Yeah. On the Neo4j issue, definitely read, because there was another court case with Neo4j which was Neo4j versus Graff Foundation. And the Graff Foundation does publish a fork of Neo4j, but I urge you to read the stipulated judgment in that case because it is disturbing and heart wrenching. And I think it speaks to your question very closely. As for Copy Left Next, it's a to it problem and I'll get to it. All right, let's thank Bradley.
The new Swiss Open Source Law: "Public Money Public Code" by default
Okay. Let's welcome our next speakers on the news with open source law. Yes, good evening everybody. It's a great honor to be here at this conference at the FOSSTEM. It was many years ago when I was here last time, but now I'm glad to be back and I'm very happy to present together with Rika Koch our new law that we achieved basically getting in Switzerland. It has been a long journey and it's great that we can now present this and we are very interested also for your feedback in the end if something similar is existing in other countries or how we also have to continue in this journey. So briefly just our background. We are academics from Berlin, Switzerland, but from my side I'm also an activist since almost 20 years when I wrote my master thesis about open source community building and I'm very glad that we can now present this to you and Rika will start. Good afternoon from my side as well. My name is Rika Koch. As Matias mentioned, I'm a law professor at the Benefach Hochschule in Bern and I want to speak to you today about the regulation, the legal side of open source software in Switzerland for the public sector. Here it is again. So in the beginning when we talk about regulation of open source in the public sector there was literally nothing. And I wrote here dark past but we're not talking about the so far past. We're talking about 10, 20 or even two years back. Although there was a strategy of the Swiss federal government that said well basically open source software would be nice because it's economically efficient and it produces good quality but there was nothing in the law. We had a strategy. So when there is nothing regulated you don't really know whether you can do it, whether it is allowed for the public sector to develop their software open source, licensing it open source or not, which is called in legal terms there was a lot of legal uncertainty. So developers or the private sector who offered their software to the public sector didn't really know whether it was possible or not. So the crucial question here is can the public sector develop and also distribute open source software or do they have to do it closed source? Well if you ask the IT experts probably all of them tell you yes of course they can but in comes the spoilers, the legal persons, the lawyers and they say wait a second it's not that easy. So the Swiss government was in this situation where the IT experts please do open source software and the lawyers say no no it's complicated please don't and what do they do? They ask they pay a lot of money for of course the legal experts for a legal opinion. Oh sorry I hand over to Matias to explain first why the pressure arose. Thank you so basically from a historical point of view this is interesting because it was being done open source by several state agencies for many years so from the IT side obviously there were lots of open source activities on GitHub from different agencies however as Rika said it was not legally allowed and so when actually we started from our group of parliamentarians it's called parliamentarian group for digital sustainability, Paul Deghi we started basically lobbying for open source release from the Swiss government in 2011 so it was back then when we had our first initiative and asking politicians to actually support the release of open source software by the government and it seemed like a natural thing although the federal government rejected this that there will be no additional support it was basically clear to us that it will take place anyway and even more actually the Swiss federal court is very open source minded for many years they actually are completely based on open source software they host their all entire stack it's they use LibreOffice back then it was open office and all the federal court lawyer federal judges they use LibreOffice for their activities so and they even wanted to open source their back then their court management system called OpenEustizia it still exists OpenEustizia.ch it actually is still online this website but something happened very little Swiss Bernese law company IT law company they actually objected to against this release of open source software by the federal court because their business was jeopardized basically they were afraid of this government competition because they actually had a market of local courts where they sold their proprietary court management systems too and when actually federal court the big court actually releases open source they were obviously afraid that this government would destroy their market and what they did actually they asked another politicians to hand in to hand in actually another question on political side that would ask what is basically the aim of the federal court would they want to destroy the market for IT companies and then basically compete with small companies and this basically actually was the beginning of this legal dispute for last 10 years and now I hand over to the lawyer again and she will explain to you what the lawyers said about this story. Yeah maybe you know the saying we have in German two lawyers three opinions and it was exactly that so we had a first legal opinion issued in 2014 with the crucial question can the government develop and distribute open source software and this legal opinion said basically no why there is in the Swiss constitution and I'm sure in other constitutions as well the principle of competitive neutrality this means that the government should not mess with the private sector to the contrary that the government should make so-called favorable conditions for the private sector and they said that this is not the case when the government publishes open instead of closed source software. Now it gets a bit legal they said that distribution of software by the public sector is a so-called economic acts that would per se distort the free market. This would actually be allowed but only if there is a sound legal basis and if it's proportionate which means also necessary and these two quite old law professors they said no it's not written in the law okay that's that was true at that time but it's not necessary at all because everything you want to do with public software or software by the public sector you can might as well do with closed software it's even more it's even better suited suited to reach the public tasks so it's private basically but luckily some persons thought okay let's just ask another lawyer and paid for another legal opinion three years later and I when I say some persons I look at Matias so this second legal opinion said well we're not too sure whether publishing open instead of closed source software really is an economic and distorted act per se we might call it also an auxiliary services so you can't do the software because it serves the fulfillment of a public task and whether you do it with closed or open source software it does not really change the fact that you have the legitimation to do so so they said it's an auxiliary service does not distort the market and we do not even need a legal basis with this in mind this was government after long consultation they issued a law although you do not really need a legal basis they thought yeah we might as well make legal basis better saved than sorry and they negotiated the so-called federal law on the use of electronics means for the fulfillment of governmental tasks and there they did not only put the possibility to make open source software but a mandatory requirement this is the text we will look at this now and I will speed it up a bit so here it says the public bodies subject to this law shall disclose the source code of software that they develop or have Dilla developed by third parties unless the right of third parties or security related reasons preclude this so we have the first sentence is the principle that opens our software is by law mandatory for software that the public sector develops or for developments that they buy via third parties and the private market so not for software that already exists so if they they can still buy pre-existing software on the free market so and there's the rule and there's the exception the exception is you do not have to make it open for software if third right parties would preclude that that's clear that's usually intellectual property rights closed licenses maybe or for security related reasons and I personally really don't know what this should mean I've been told by people with more IT knowledge than I have that they do not need her so if someone of you knows what this could be please please raise your hand afterwards and give us this input and this was paragraph one the principle of mandatory open source software there are a lot of other principle paragraphs I don't delve deeper into that but just to show you that we've come from open source software is really distortive and against market neutrality of the Swiss government to also this and this is paragraph four and five and they said so the public sector they can develop the open source software and they can also offer supply services other services to other governmental bodies for if they charge for it so they have to take money usually but they can also also make provide other services which is then not deemed market distortive thank you Rika so basically here we have the issue about is the federal court this are the background of this whole clauses is the federal clause a court allowed to actually make community building basically help other courts to use their software and maybe answer some questions and also participate in the community and this is actually now also a legal basis for community building on the federal level now we have the law so what does this mean so a law is not helpful if it's not implemented it's and this is basically the activities of my next 10 years to make this law alive so one aspect and this is where also Rika comes in because she's a specialist in public procurement now we hope actually to find in the next few years more and more public tenders where actually the government procuring IT solutions will have to include different criteria which actually support the open source the release of open source software and the community building so this is actually a real excerpt from one of them public tenders it's actually not with the new law but I think it could actually serve as a good example from my point of view because it says that first one of the issues is that the software which is being built by the company is providing the solution has to be open sourced under an open source license on the GitHub account of the city of burden then actually they have to use not just any license but a copy left license including you PLH EPL so we heard about that thank you Bradley and we also have then the need for companies with experience in open source software development and community building and actually the community management is also one of the services provided by this IT distributor so from my point of view it looks quite nice that it's really something which should be in the future being used as one of them open source role models or good practices at least for for IT procurement nevertheless we have not much not as much activities in Switzerland compared to other countries especially Germany and we have a few activities so one thing was during pandemia that there was the federal IT administration which released the COVID certificate app as open source software which was then actually used by the Austrian government and used for their national COVID certificate and this is I think a very nice example of how governments can also interchange source code another example includes the Swiss map agencies Vistopo they produced open they supported open layers in the past and collaborated with some institutionalized crowdfunding with other agencies the first development of open layers another aspect is from a company initiated in swissland at Finis they actually started an open source project Kaluma and some workflow component framework and they supported now several canton several local departments of swissland to use this company to use this software and they founded this in osco community and this is also another good example which I think shows that even swiss people are able to produce open source software in a good way and the last the last thing you have heard all the railway activities open rail association today this was also partly driven by the swiss federal railway association I think it could also be a good example how swiss government or government companies can collaborate with others so hope there's still hope for switzerland and their open source activities and maybe with and especially if they'll you know there should be more activity soon there's one monitoring thing which we do osasbenchmark.com this is a very hobby pet project of mine which where we collect basically the the open source repositories of organizations from swiss government and companies and look how many repositories are released by which kind of company and here again you can see basically of 150 agencies and institutions how much open source they are already providing on github now what I hope in the future will also help us is the the the the also the high level political environment around digital sovereignty and digital sustainability so a few years ago we created the report on data colonialism where we pointed out this is danger of the big tech companies appropriating and privatization of data and also obviously of software and nowadays I'm working on a new report for the digital sovereignty strategy for the government where they have to actually release some some new recommendation by the end of this year and so we hope that this again also will help open source software development in switzerland now we are very interested in your feedback for the for discussion because first of all we would be interested on do other countries have similar laws in this way where it actually is not just allowed to open source software but it's really actually the default second question what is the potential and the challenges of this new swiss law what do you think could be also something in other countries being implemented and from the operational and the implementational point of view we know of several activities of other countries but so what in your opinion would be the best to do next in switzerland because now we have the law now we do need to do other things is our parliamentary group okay so that's it for the moment and very very interested in your feedback thank you very much I was a little bit irritated that you mentioned within the tender that you're demanding copy left while for some organizations within their own code if they want to contribute it for this to reduce their costs having a copy left may actually be exclusionary because some people using code that under a different license more permissive license may not be willing to put their stuff under a copy left in other words what's wrong with actually saying a copy left or a apache or a bsd or one of these other more permissive licenses so this basically as if I understand you correctly so the question is why is copy left license being recommended which actually excludes a number of organizations who may want to develop software with a much more permissive license than copy left well it doesn't exclude them well it can be integrated into copy left and final product so maybe someone else can add on this but um do you want to are you responding to that Bradley yeah please so so what it sounds like you're saying is that if it if it has a must on copy left then they might just have to upgrade the license that's under you know under mit to be copy left when they put the solution forward right yes so as far as I understand if you have the end product the final product this is include permissive license software and but you still can use the the less permissive the less restrictive license software thank you for the great talk you just mentioned that there was this argument of that providing open source software does market distortion that somebody gave that argument like the right of a private enterprise to make money on licensing fees is above other things shouldn't there be an argument that the government should make best use of taxpayer money rather than blowing taxpayer money on licensing fees and isn't this a good enough argument to reject that yeah absolutely I mean I did not understand myself the argument that why this should I mean just enable some companies to pursue a business model in a certain ways does not mean to be market competitive so I think we to the contrary I mean to enable the government to make a software that has the best return of use money for value value for money sorry for the taxpayers should be the first public interest so and that's how the interpretation of term competitive neutrality changed luckily okay you mentioned that the slow is forced in 2017 am I right sorry the law was the law like six years ago no no the law started on February January 1st 2024 this this year okay because I've got the question that's how often can be used that's security reason because there are you mentioned there are two reasons like it's third party and the second is security reason when there is exception so when it comes for like minister of defense the software could be perhaps proprietary software but if it's like pretty new so so I don't know if it's off or no yeah so this is this is exactly the point so the law is very new so we don't know yet how strongly will be implemented and fulfilled we we know that the government is kind of behind on releasing some guidelines and they're right now actually providing some guidelines but it will still take a few months or maybe years to really get it going on okay so on on the security issue I've so it should be used very sparingly right but there there are arguments that can't be made like Bradley mentioned when you're looking for people who're trying to evade like taxes there you can make an argument that people knowing how the government looks for tax evaders makes it easier to actually like beat these algorithms right so I think it's a good thing that's in the law while at the same time security by obscurity is a very bad thing and it should be like we should use that very sparingly but I think it's good that it's in the law so just from a like cyber security guy point of view and then the actual question do you know what the because we have the same argument in Austria where it says like the government may not publish open source because it it like distorts the market of proprietary companies profiting of proprietary software can you summarize what like the other side of the argument was and what what claims the lawyers made why this is not an issue actually you mean the pro side the like how was that how did they debunk that hopefully like you know ps yeah a good thing that you're from Austria so I can just send it to you the little penny but um to summarize they just said only real economic acts can distort the market so they compared it if it's open source or closed source is like if you write using a pen or your laptop it's just a means to help you doing your work so an auxiliary service and even if it was market distorted then you just have to have the legal basis okay but but then it's still necessary for to have better quality so very quickly I've seen other laws in other countries that were open source by default failing so it's good that you have already clarified that for instance security can be a way to circumvent this I wonder if there are some regional I'd say regional car valves like this is federal can a city or a canton or whatever avoid to apply this law because it doesn't apply to them uh yes so I have to say this is only for the federal government this is binding only for the federal government sub federal governments can do whatever still but what I would would also expect that usually in Switzerland at least the what the federal government is doing the canton's and also the non let's say the non federal players are also looking at so when they see there are benefits and there are obviously benefits otherwise it wouldn't be here so then I hope that people will be more used to actually procure open source software services and build communities okay let's let's thank you Mattias and Rita
Hot Topics: Organizers of the Legal & Policy DevRoom discuss the issues of the day
Okay. I'd like to invite everyone to take a seat for our last session. And this will be the REL panel continued probably. Well, I think that we'll start off with a different title and we'll see how it goes. So the purpose of this session is for us, your Deverm organizers, to summarize a number of the hot issues of the day and maybe go a little bit deeper on them. We've got a little while. I think that what I'd like to do is ask each member of the organizing committee to say a few words about their thoughts or what things are top of mind for them. And then I think we'll open it up for questions. But let's start with you, Karen. With me, I've spoken a lot today, so I'll leave more to other panel members, but we're looking for things from the day or just general hot topics. I mean, I think within the US there's quite a lot of legislative proposals over right to repair, which I think are quite interesting. I know there was a really interesting inter ruling that came out in this case called Vizio in the US. That was really fascinating. So I'll leave with those two. Let's go over to Alexander. I'll come back to me. He wants last. I'm going to take off. Okay, that was the longest way to hand over the microphone. But anyhow, I think for me, particularly the public money, public code card talk was interesting. So I mean, tomorrow we will also, Lina will talk about interoperability. And I think that's pretty interesting what's happening in Europe nowadays. So many countries are deciding to procure more open source software, free software. And it's interesting how they do this and what ways they're looking for. And still most of the approaches come as a lot of loopholes. Also, as we've seen today with the security, a reason maybe. But still, I think we made a huge step, also this public money, public code campaign, but also like, yeah, in the last five to two, three years, I think there was a lot of improvement and it was interesting to see it here in Fosbytsel and I think so. Okay, I am, I am my care. So, so I think the first thing for me is that y'all you're going to miss me to not want to talk about Rell anymore, which was pretty hard to do. And so congratulations on that, because I really don't want to talk about it anymore. And I've been talking about it for like 20 years and never stopped. But you filled it up for me. I'm done. But you're probably not. The big question, ask it. But the thing really stood out to me was that last talk. You know, I don't know if y'all saw my talk on the main track. I had this whole thing about things being politically unviable and certain types of free software, activism is just politically unviable. I was literally writing those slides thinking, well, a law that you would have that would require things to be open source for the government would be one of those politically unviable things. And the Swiss did it. It's, it's, it's very exciting to me and I just, I was really glad to see that talk. So for me, there's one thought that circled a bit around during the whole day. And this is, I mean, there are a lot of actions we see when some companies or organizations think that they could make more money when all the people who use the software would have to give money to them. That's something we also see in the, in the public sector. That this is actually an issue that there are meanwhile actors out there who take free software components and bid on them. And then they, they get money from the public administrations to deploy free software. But everything they do will not get up-screen again. And all the developers that actually do the work for the free software components, for them it's not sustainable anymore. And then what we see in those cases are those re-licensing, yeah, trying to solve this by re-licensing or, yeah, how to in general like what, what they can do about that. And I think, I mean, for, for me, one of the question is also like how do we as a community establish norms that those who actually ensure that we get better tools under free software licenses out there that they can do so and that we enable them to do so. And that in the end, yeah, also then the public administration gets free software that, that is successful for them and not, okay, we offer you this, this software components here, deploy it somehow when there's a problem. Oh, we don't know how to fix it. But we also have some proprietary software in our portfolio which we can sell you and actually we would prefer to sell you this. So last year when I had a chance to speak on this panel, chat GPT had just been released and so I was very curious about the role of AI and free software. And I still am, but I feel like this is a complex issue both from the sort of the intake side, you know, the GitHub co-pilot standpoint of the data that's mined to train the models and then also, you know, what are the downstream rights of AI or even, you know, who is the author of AI, you know, created code. I think these are really important questions for our community but, you know, today I'm struck with maybe more pressing thoughts. I also was very impressed by the talk from the Swiss and it reminds me and I know that many of you are from Germany. Am I not remembering correctly several years ago the city of Munich required open source software? This is a project which over the city of Munich community has a Linux based operating system. So yeah, with Linux that was something which was one of those projects which was very visual out there and always when people talked about free software and public administrations it was about Munich and actually there were then political reasons and other reasons why they switched which had nothing to do with the licensing of the software and the results also they had in several of the studies about this steps was also that this has nothing to do, the failure which was there had nothing to do with the software but with management structures, with a lot of decentralization there which was not coordinated at all and yeah, there are, I think there are some recordings of me talking about Linux from other conferences which you, yeah, can then follow in longer detail about that. May I just add another sentence to Munich because it's super interesting and we are following this a lot and after the Linux case they still continue to do open source and they are one of the administrations still in Germany that are very like progressive in terms of how they procure open source and free software and just lately for example they joined the free software project as a sponsor so they even donate to projects so they found a way to not only procure software but also support free software projects with donations and that's something you should also like not forget about. Well I also think that this clause about not having market disruption by open source software is a completely bogus argument. I think that we can easily defeat that. I think that the European message of public money, public code is great and I think that there's a lot of reasons that we can defeat that argument. So I think that there's a lot of promise for public tenders for open source software. But I want to end my thoughts on how important I think this Neil for Jay decision is. I was tracking it but only superficially and now I'm actually very worried about it. It seems like the reason that I'm worried about it is that well we could say there are all other things that copyright can't reach. Copyright is still a tool that we use a lot and in particular the AGPL I think is a very important tool in the world of cloud based software. So for me the Neil for Jay decision feels like if we were to allow this to go to pass that proprietary relicensing would defeat the open source side of the license and that is really for me a very serious concern and I'm shocked that the court would feel comfortable excluding a part of the license that they didn't like and say that oh you can just ignore that. That seems ridiculous to me. So I'm very concerned about Neil for Jay and I'm hoping that it doesn't set a negative precedent for all open source software. I would say it could be a lot narrower than it seems like that decision. If when you read it it's kind of a mess. So trying to figure out what it stands for we'll have to see what happens on appeal but I think it's premature to feel like the sky is falling but it is it is just stressing that it's in this situation for sure. Any other thoughts before we open up for questions? Yeah let's see questions. Hello. Hi everyone thanks again for the wonderful DeVroom. My thought for the year is that we're seeing very much more than previous a complete differentiation on the way that open sourcing handle whether it's copy left or non copy left and I want to give an example. I was in a presentation a couple of days ago that they were talking about AI and yeah we will develop this wonderful system that will create code for you and we train it with lots of code that we found and we make sure that we remove all the copy left ones. So yeah it's a complete separation of how we treat copy left with non copy left while obligations for open source license are in both. Yeah I mean I think and one of the biggest problems with code generation systems is the lack of attribution and nearly every FOSS license copy left non copy left has attribution and I think you're correct that since there hasn't been any kind of aggressive raising of failure to attribute with regard to non copy left stuff I think companies who are doing this kind of code generation don't think they have to meet those requirements and those requirements are important and should be met. They're there in the license. Just to put in a simple find point that you missed is that using things like a generative model to generate your code your code is not copy rightable. You cannot copyright that code. The IP you still have the IP issues which may be involved which may be litigated but the code itself is not copyrighted. I'm not aware what decision held that and in what countries but I have heard. No it's unsettled. Sorry it's unsettled. No no but you are saying that there's no copyright. Just one big thing with the stream we cannot have a discussion with the panel because we just have two microphones so we need to keep it to a question and then we can try to answer else we need to put that after the session. It will not work with the live stream and the recording. Yeah definitely thank you. We were saying that the folks that are the folks who are responsible for these generative models of course want the output to not be copyrightable. The material they say can be used in whatever manner that the users decide but that doesn't mean that when copyrighted code is output from the generative model that the copyright doesn't apply and there are eight lawsuits in the United States none of which have been settled all on this topic. It will be really interesting. I don't know if everybody wants another legal policy dev room next year but if we have one next year we will probably be discussing the results of some of those cases although they will take a long time and there will be appeals. It's very interesting the copyright office in the United States is very very torn about this issue they just issued a call for comments we had software freedom conservancy gave comments. The FCC reached out to us about this issue we didn't even reach out to that agency they reached out to us and then they submitted their discussion with us and others to the copyright office so we submitted our own comments and then the FCC submitted. So this is all a discussion that's happening right now and you can see from the questions that the copyright office is asking that they are really not sure so it will be interesting to see. This is actually more of it's a meta comment about the organization of the dev room but we were actually a little bit surprised. I'll be frank I was pleasantly surprised that we did not get a lot of submissions this year we expected to have just like a giant stack of AI and this AI and that AI right because it's the fad topic right. We had a few but not as many and many of them were very speculative but if somebody wants to do a very for example if somebody wanted to do a survey of all the cases that are pending about copyright ability and artificial intelligence and FOS that would be a great submission for next year and we would welcome it. Building on that comment about copyright ability of generated code speaking only for the jurisdiction of England and Wales and probably British law in general UK law in general it is certainly the case that computer generated code does attract copyright that's specifically mentioned in the UK Copyright Act. What we don't know is exactly who the copyright holder is going to be it whoever made the arrangements and there is some fairly problematic well there's one very problematic case that suggests that it's the programmer whoever the programmer might be but anyway. Okay so your assignment for next year's dev room is for every country submit the cases that are pending on this question. First of it we could do like a five minute briefing right from each country. Yeah first of all sorry for my horrible English. Of course I don't want to talk about redhead but what is actually the problem. The problem in my perspective is such companies like Oracle for example maybe and what our solutions was one of the questions maybe we someone can explain what is the difference between 501 C3 and 501 C6 and the Linux foundation. Yeah it's not for public good and maybe that's a problem and maybe Karen you can tell us why people like you are not on the board and there is no community seat anymore and maybe something like this is a fucking problem in our community maybe people like Brian Landuk is talking about that for five years and more and nothing is happening and maybe we can talk about that and maybe our community can do something to build up a foundation or something like that maybe or there is one we can join and say yeah you are doing Linux and stuff like that for money gonna go fuck yourself and maybe stuff like that can bring a better and more healthy open source community by the way I'm well socialized with BSD but I love the GPL and GPL maybe is a success model that is getting ripped. And people say that I'm controversial when I make comments about the Linux foundation. So I think it's a good question I don't think it's as bad as you're putting it forward because what I've heard not just in the last year but more over the last two years from lots of companies that are members of trade associations like the Linux foundation which are the 501 C6 that's a US designation. I think 6 is twice as good as 3. I think companies are starting to realize that the C6 model of the open source foundation is really just a coin operated model to do things and if what you want is to throw a bunch of company money together and get a particular project done that benefits just those companies it's a reasonably good model to get that done but Microsoft is an example of a company that would want to get that done and if they want to collaborate with Intel and everybody else they throw their money in and they make a project and I don't really have much complaint about that activity of the Linux foundation. My traditional complaints about the 501 C6 models is when they attempt to say their job is to serve the public good. Their job is not to serve the public good their job is to serve the common business interest of their members which is usually quite different from the public good. In the last two years I think the entire community is getting a much better understanding of these two different very kind of like US tax geek forms of regulation of organization and understanding that a 501 C3 is a charitable organization in the US that has to act to the benefit of the entire public and a C6 is a trade association. We also have my colleagues from EPSF Europe here are just doing amazing work in Europe by using an EV structure and I don't know if you want to comment on how that operates and how you're able to do good things in Europe with that kind of governance structure. You mean with the association? Yeah, with EPSF Europe. Is EPSF Europe an EV? Yeah. I mean in Europe we rather have the problem that it's very difficult to create associations which are then charitable in all the countries so if you want to operate all over Europe that's very tricky. Some things might change there at the time when EPSF was founded in 2001 that option didn't exist so it was founded in Germany as an EV because that's quite easy. You just need a few people and then it's done and then you can become charitable afterwards but as an association you are then struggling with getting you have tax benefits for donors in Germany also in some other countries who didn't accept well if Germany said that you are a charitable organization then we also believe so but for a lot of other countries you don't and then it's less attract or you don't get the benefits from tax deductibility like you get in some other countries but related to that look out at those links over there so if you want to support organizations that are working on the issues mentioned here during the day then those are good starting points and of the advertisement and over to the next question. Well it might not be surprising that it's more comments than questions coming from me again and there's two things I have and one maybe from the first talk of the whole redhead thing people might have gotten the impression that I'm not like anti that I'm anti capitalist or something like I love money and I love making money and I think people in free software should be able to make money right. I think the issue that we see is freedom one or freedom zero says the freedom to use the software for any purpose without restriction and if we build systems where a for profit company has a commercial interest to restrict people to use their software they are fundamentally at odds with that freedom so we need to start building communities and systems where the code itself and the software itself is owned by a non-profit organization and then there can be an ecosystem of companies around that that sell support that sell distributions it's probably not going to be as easy as I'm trying to paint it here right but we need to avoid situations where a company owns large parts of the software and has a commercial interest to stop you from using their software without paying them money right that is an issue. Yeah to add some controversy there so I mean we at the moment we have some discussions about public procurement and how to create successful solutions for free software there and I mean one of the topics there is this issue about there's a procurement and you want to ensure that those who actually have an interest in further developing the solutions that they get the tenders they that they have the knowledge to actually fulfill the tender there and not just the cheapest offer so that is one of the issues and there I would also be very interested to hear from you afterwards what are good criteria you could add in tenders that ensure that it's actually the people who know the code best or who contribute a lot to the programs that you can rather ensure that they are more successful when they have free software experience than those who don't have any free software experience and whose contribution might also not end up benefiting others again. The other question that popped up there is how when you want to break the monopolies in a lot of those areas in public procurement in public sector software how do you have to create what conditions do you have to create that companies there can actually that they would also invest money to further improve the software by yeah by by taking some of the capital they have and investing that already before by anticipating what public administrations might want to have and that is a problem which is still with free software more complicated to accomplish that you make this upfront investments in comparison to proprietary software but I still think I mean there there are models out there where you don't need to have the software with organizations there can be models where software is with individuals with companies with others and I think I mean the great advantage of free software is that you can have this mixture you can have a company developing the software but on the other hand you can have public administration paying some employees to also be developers of that software and you can have other companies also providing services for the same software so I think that's that's one of the big advantages we have there and we don't have to decide like all the software has to be owned by foundations or by nonprofits or but you have this this this mixture and that's the thing that that is important that we can see what models work better yeah I just want to tease out a really interesting point there which is that historically we really did talk about like gaining corporate adoption that was like one of the big and individual hobbyist and individual contributions right like those are what we're focused on and we were a lot less focused on collective action and so we had some examples of nonprofits that took on big big software problems the problem that we have is getting is one having the momentum to get started to solve the really big problems that are hard to solve in a like in a distributed way and second the the problem is that the challenge of good governance for those nonprofits if you if those are the ones centralizing that that initiative is extremely difficult because transparency is essential but not enough you know conflict management is not enough and so I really I truly believe that we're in a situation where we need to try a much bigger nonprofit and governmental lead software solutions for really important huge infrastructure solutions but to do so we have to try radically different governance structures than we've ever done before I want to add on that I think I think the theme we've heard from you've been saying all day and Matias and Karen were both speaking to is that the problem of monoculture and software is obvious in proprietary software but it can also occur in free software and it is extremely difficult to design a licensing structure or a governance model that is guaranteed to prevent a monoculture from developing around the projects and I think it's an open problem in fast policy at this moment so the reason why we were all struggling so much we're like but that's so why are they doing that and they're all in control and they're doing the wrong thing and it's upsetting me because we don't it's not like we have a ready-made solution we say oh if you do it this way if you change your governance model this way we'll be guaranteed something bad like this won't happen we just don't know how to design that yet and something I think we're still learning as a community my question was a bit around that if we look at the Linux kernel like it was done by developers and they retain like even if there is the Linux foundation and so on they still retain some control like the maintainers can refuse non-free user space do we have model where the project is controlled by people like for instance in GNU and company can still participate but the agenda is really set by the maintainer do we have like different models and good models for that yeah so it's interesting to use Linux kernel as an example is that we've we've actually seen the Linux kernel drift into more of a oligarchy from what was a truly when I got when I downloaded my first copy of Linux in 1992 and wrote some patches for it you know it was truly a hobbyist project and was kind of all the best things about free software and the commercial interest in it led to this oligopoly you know there's like seven companies or something that are 90% of the contributions every every year into Linux upstream and holding all those copyrights so so a project totally drift and it's still under the GPL we still have the ability to get our third party beneficiary rights under the GPL and Linux so there's lots of good there but the governance model Linux is it has drifted to a more corporate governance model just basically organically and it's weird to think that a project could organically move from a hobbyist project to a to a more corporate project but it kind of did it yeah so I feel it's somewhat reminded to like the the essay the cassette will enter the bazaar by by this discussion I don't know whether it's general known but it's it's it's like quite some years old and I generally a good read but I wanted to actually ask a different question or like bring up a different point which we talked a lot about legal issues here and we basically did not talk about ethical issues at all no one talked about freedom and this is something that everyone talked about open source software and like this first talk free was used several times but only in regards to price no one ever thought about like the freedom of the users and I make made me a little bit sad because I mean there's like an FSF logo up there and I think yeah we should talk about ethics we should talk about the freedom that the users deserve that the users need that the developers need and we should talk more about freedom and about ethics well that reminds me of yes please applaud flat that that makes me think of one takeaway that I got from Bradley's talk today which in the main room about third-party beneficiaries and I think that I think that many of us you know read the open source licenses with that in mind it's sort of obvious you know that that the user is getting a huge benefit from these licenses and you know one at one you know if you if you take that thought with Linux you think well great as a user Linux is open source software but what's a little bit counterintuitive and this ties into the earlier discussion about trade associations is that the copyright holder for Linux kernel you know the Linux foundation specifically has the interests of the trade association in mind and not the third-party beneficiaries so I just wanted to call that out yeah I just I also want to add that in the first the rail panel you know when we were talking about the spirit of the GPL we were really talking about freedom we didn't put a fine point on it but that was at the heart of that conversation you know one of the things about this dev room is that we wanted to make sure that there is a place to have advanced conversations so not introductory conversations about the important legal topics that are happening now and I think your feedback is really astute because I think that sometimes we're so focused on more advanced issues that we don't take a beat to take a step back and talk about the the impact in a fundamental way so I like that we should do that next year I've you probably noticed that in recent years I've changed my rhetoric to talk more about rights and inalienable rights because I think I think that the the way people interpret the word freedom and free has this ability to polarize people and inalienable rights are a thing that every light the UN has a UN declaration of human rights there's this idea that there are certain rights human beings deserve just by being humans I personally think that the right to copy modify redistribute improve software is one of those rights and I actually think it's better rhetoric I kind of think we should have had a software rights movement from the start not a free software but I think I think it's better better kind of categorize that way myself but but but but terminology matters less than ideas I think and I think the ideas are are aligned in the ways that you're concerned about I think it's one of the exciting missy so you haven't talked yet did you did you want to did you want to answer this no there is another question here but okay then but I think it was a good reminder that we are as we have this deaf room as an advanced topic deaf room we sometimes forget to remind ourselves why we are discussing all those topics and that that's exactly what was mentioned there if we call it software freedom or user rights or freedom I think we should be on the same page there thank you sorry I'm about to change topic but I've been dev room hopping all day so forgive me if this has been it's Boston and but forgive me if this has been covered but I was just really interested with the expertise we've got around the panel here what people's opinions are on the current status of the CRA I was in the presentation earlier on in Janssen and I'm just wondering if I'm the only person who's terrified and if I'm wrong to be terrified yeah I mean we at first time I think we discussed it last year and also this year you mentioned the main hall discussion but there will be also a deaf room tomorrow and we have a longer discussion starting at 9 in the morning maybe basically we'll talk for two hours on the CIA I'd say it's a it's a major improvement from the initial draft what we got from the commission and that was improved a lot by the European Parliament by the Council by the Member States so that's that's definitely a step forward I struggle a bit to be honest with the wording in the product liability directive because the commercial activity is not that precisely defined as in the CIA so I'd say there we made a still an improvement and also compared to the initial Commission proposal but it's not that super awesome good and this is pretty much also true for the which is kind of like combined with the ideas of these kind of exemptions and also in all of these files they will be like covered with a lot of like guidelines and delegated acts and we will just like learn in a few years how it will pretty much look like and this is I think something we need to do for the next years is like pretty much monitor the market monitor what will what will happen there and this is also like pretty much a call so if you if you see companies struggling because of these files then it would be very important to let us know about this nonprofits are out for all of them and individual developers are out for for in all of these files so there shouldn't be issue but there might be an issue if you transform your nonprofit singing into a start up at one point but this is also I mean on the other hand covered also with the like what's what's what's what's defined so what does the what's the sanction and this is and there's also I mean there are different rules for small and microenter prices than for like big tech and that's that's something where we can say in the whole debate we managed it that we that we were able to to make our points to what's decision makers and they understood a lot about free software during the debate in the last year I think and that's something we can build on for future debates and this will help us I think a lot for next files that they might might come up as a better proposal there's a better initial proposal let's say and to quickly add there for organizations like the FSF to follow up on so such issues for decades as well as for other organizations working in free software like as a CEO or FSF who really are working for for user rights it's important that we have the resources to actually do that so have a look at this slide and I was gonna say on a logistical point one of the reasons why you didn't hear a lot of that content here is that we know that there's the room tomorrow but you know it's certainly within the scope of the dev room if there are specific topics that you'd like to see in the future if we put out a call for a dev room and you don't want to present as a speaker but you really wish that a topic would be covered just send us a note through the system and you can say I I really think this is an important topic and sometimes we get like sometimes we get several proposals about a topic but none of the talks are exactly right and that's sometimes how our panels get formed and we could do the same thing if there's a topic that you all want so don't feel like like those of you who come and especially some of you who are here you know a couple years in a row you've got a good idea of what you think should happen here and you can be a part of that organization I just want to add how how appreciative I work of Alex's work I am and FSEF Europe in general on the CRA stuff we we saw in the US all sorts of US companies trying to lobby about the CRA stuff and then of course being in the US organization we get asked like what you know why is this US company like involved in this and I mean I'm just a stupid American what do I know about what's going on in in Europe with with the EU regulation and being able to rely on Alex's expertise and FSEF Europe's was a huge benefit it was it was a really as we've been going through this process it's a it's a great collaboration to be able to work with the FSEF Europe folks on this to have a place to go like we have we can we have no idea but we can point you to the people who know everything about this and are doing the right thing with it. Oh okay I have no idea maybe you do. Yeah I will take advantage of these two last comments because one thing that I expected that will be discussed here and I've not been hearing for it about it all day is the Digital Services Act like we talked for many years with the problem with the platform side stepping all the issues about you know copyright and free software because they just you know they run the free software on servers but what they do is collect the data and now we have this very big European regulation and from what I've seen from the schedule it's a topic that is not being covered. I wanted to to make a proposal for a talk I didn't manage to to get to it but yeah as a suggestion and I don't know I wanted also to hear your opinion am I wrong in thinking that this is relevant for a room like this or not. I've been answering so many things. Yes I thought yes I can with a very like high probability think about that there will be submissions for next year's call for paper which are covering this area. So we take this feedback. Another one. Sorry just a really quick follow up in the in this first talk like literally someone on the panel said that it is OK to take away the rights to distribute the software from the users and as long as someone does this on the panel I think yeah OK we need to have basic discussions. Yeah we need to have a lot of basic discussions because like apparently we can't even agree on like four essential freedoms that the users of software should always have and be at their users of Reddit. And you I'm sure you saw my comment during that panel and the reason I made that comment was because I was like it was Liam who said that it lacked the nuance that needed to be there right and Liam's Leons journalist for the register we remind everybody and therefore the register is known for its controversy provocative headlines and I really I really think it is important on the rel issue and every other issue and the one of the reasons we run this dev room is that there was a tremendous amount of nuance to these issues right and and I think that too often people jump to conclusions about them as you saw Liam do for whatever reason Liam's case I think it's to get people reacting which is fine but but I think it's important we think about the nuance that's there. So I think we are we're almost done with the time what about we just have a short round what each one of us some last words famous last words of today. Okay the first last words yeah thank you for the feedback I'd say this is definitely something we will work on for next edition I'm really looking forward to next year maybe we get the big room again and maybe a bit longer so that we can address all the topics you want to see and it was a great fun again thank you so much. Okay then I just have two two messages one is please take away with you that it's important that we are paying more for people or organizations or companies that are ensuring that the software you get is will give you the four rights to use StudyShare and improve them then you give for those companies that restrict those and the second one is please add in your calendar the 14th of February and say thank you to some other people in our community so that they are motivated and see that their work for software freedom is appreciated. When you go back home open every box of every electronic device you've had and see if there's an offer for source code and test it if they give you anything go to the use the source site on Ask of Concurrency and work with us to test whether it actually complies with the GPL test your offers for source all year please. I love this idea of asking you for your ideas of what topics we should cover in the dev room. If you go in the FOSTA mailing list you'll see our call for participation that has our email address I will just say it to you now which is FOSTA legal dash policy at F-A-I-F dot US which is free as in freedom dot US and encourage you to send us an email with your thoughts about topics that we should cover in the future and thank you so much for coming. Crap that address is running on my mail server I'm a little scared now. And what I was going to say is that if you like this dev room and you wanted to happen next year we did get a very small block of time this year so mention to the FOSTA organizers that you like it that you want to see it again and that way they'll know when they're organizing it but I would say overall it's just been delightful to talk about these important issues you know obviously we would have liked to have talked about even more issues and have different formats for those conversations but it's really been lovely thanks to Alex, Matias, Bradley, Tom and everybody who's here have a great rest of your FOSTA.
Cologne Chip GateMate FPGA -- filling a gap between hardware and software (with a presentation of the GMM-7550 module)
Okay, good morning everyone. I'm glad it's my honor and pleasure to welcome you and to open this development room session. And what I'm going to talk about is a gap between the hardware and software and why FPGAs are capable and should fill this gap in. What I see as a gap and what I see as a problem with it. But before moving to the really interesting stuff, a few words about myself. Just so you know who is in front of you. A picture probably you've already recognized. On the right hand side a few buzzwords I still can remember through my career. Some of them perhaps a bit too ancient, but anyway. And the reason and motivation why I've decided to come over here and talk about this gap between the hardware and software between the FPGA and software development board is that with years I feel more and more frustrated with the pace of observing, mostly observing the pace of the software development, how new technologies appear and disappear, so of course as well, how rapidly it's going, how many young new people are coming to the software development boards who have no clue about the hardware under this software and who do not know and doesn't care how to develop this hardware. And I believe that this difference between the hardware and FPGA development direction and modern software development environment is just widening with the years and that is dangerous for the entire industry. So short outline of the talk, the first part and introduction and we've almost through it already. Then I will talk a bit or mostly ramble a bit the differences between the hardware FPGA and software development as they are seen from the low end, from the hardware and FPGA part, not from the very fancy high end software development board. Then hopefully if it's still running I can show the live demo what is currently changing in the world of the FPGA, what is new there and why I believe those changes can fill in the gap and can brought more software-oriented people into the hardware, into the FPGA and hardware development. Then of course it's just the very first steps I've been able to make or see them, then I will talk about there. Road to head and yeah more about the hardware, some backup sites, contact information and of course what you can do after this talk or if it would be interesting enough for you. Since we are done with the introduction and plans, first of all if software people are not paying too much attention to the FPGA development why should they? They are perfectly fine with the modern hardware, with the modern development and all the technologies around it. Unfortunately for the last 10-15 years it's not exactly the case because the general purpose CPUs are almost reached their capabilities and there is no clear way to progress for them. For multiple reasons technological complexity, fundamental technological and physical limitations and so on. The other reason is that the current software payloads are getting more and more diverse and the very different high variety of different requirements and general purpose CPUs while they are capable to run all of those payloads they are not well suited for any single one of them. For the last 10 years with the rise of artificial intelligence and all those neural networks somewhat unexpectedly GPUs came into the play but this type of neural networks and modern architectures of neural networks are coming and going on the one hand side and on the other hand side GPUs after all have not been designed for those kind of payloads and alternative would be easier to design the new ethics for specific payloads but again there are too many different variants, too many different architectures and requirements are changing every two, three years with all the pace and speed of software development and no sensible hardware company is capable to produce and use silicon every two, three years it will just not pay off for all the investment especially on the modern technological nodes and with all the investments but there are very flexible and nice chips which are manufactured in huge volumes on the most modern technologies and which are suitable with proper programming and with proper support software support and proper tooling support which are capable to efficiently execute any payload any imaginable payload. So there unfortunately here we see another roadblock because all the software development software for the FPGA world is still 40 years old at least if no of course there are new releases every year or two releases a year but all the underlying technology and the design approach is still 40 years old which is ages and of course as I've mentioned it seems to be a very boring subject that not that many people are interested to look at it which I think I might be very wrong looking at the number of people here and still coming so some people are still interested in the FPGAs and development for them that's good and encouraging. I've already mentioned all the things in the mid 80s rise of RTL register transfer layer design methodology was a huge breakthrough in their ASIC and later on FPGA design and the two languages they are still around us and they still are prime languages for such a development VHDL and VRIOHDL were developed I think in 1984-1986 and those two languages are still with us and most are actively used nowadays. How many software development normal programming languages you remember from 80s which are still in active use Fortran and Lisp they are even older than VHDL and VRIOHDL and probably nothing else but definitely CER yes of course PDP 7 PDP 11 but also neither of them are on the top or own even 10 or 20 percent of the development and VRIOHDL shares 46 50 percent with 3 to 4 percent for everything else so it's quite a different picture compared to what it was about and after all those two languages and the RTL design methodology were introduced primarily for the ASIC development and only later on they were by coincidence because they were available new and fancy things at that time they were picked up by then young FPGA market adopted and still there with more or less nothing new even all the cheese sales, color, spindle, HDL, system, c, HLS all those new fancy technologies are going to generate VRILOCH which goes through the custom proprietary VRILOCH synthesis tools which is oh well you can say that LLVM and GCC are still can generate assembler but that's quite different but those things which are pretty common and well recognized by the industry and by many engineers working in the industry are well understood but they are not two main reasons for my disappointment in the current development the problem as I say it is that the only way we are programming FPGAs which are reconfigurable at any time you can change all the configuration in them at runtime anytime any number of times and you can do it really reasonably or relatively fast all the way we are still using and programming FPGAs is this exact way the IBM 402 or even the previous generation of tabulators have been programmed you hand wire a bunch of wires you call it a program then you plug it into the FPGA okay after synthesis and place and road things then you just put those bunch of wires inside your FPGA and using it as a playing hardware not changing a single wire at runtime uh partial reconfiguration is somewhat advertised by both major vendors Xilinx and Altira now Intel and AMD now not Intel again and they are advertised for 10 15 20 years but real support in either Vivado or Quartus is near zero you can do it but in real applications it's very very rarely seen the other thing in this same domain which is advertised for 20 years and still have some limited acceptance is of course HLS high level synthesis but anyway we have their modern chips manufactured on the most advanced technological nodes and we are still using them the way we used tabulators 70 80 years ago just putting wires put them in hardware and using this bunch of wires and the second reason for disappointment which probably just came from my experiences and embedded systems development developer that all the development is cross development strictly cross you cannot 40 years ago you can take an xiloc z80 processor you have a machine built on this processor and you have a c you have fort you have Pascal you have basic and you can program on the same chip you can develop everything on the same chip nowadays we have an FPGAs which are two three orders of magnitude more powerful than that 80 but you still need a huge x86 machine to program it to do anything to even to blink an LED connected to FPGA you still need a somewhat powerful machine and I have to admit that the modern FPGAs even the smallest of them are much more powerful than the machine on the left hand side and this machine will not be even close to run anything of the modern FPGA design tools. Vivado's latest download is something like 11 12 gigabytes, Quartus and Libero are somewhat around 10 gigabytes as well so you need literally 10 gigabytes or for some proprietary closed software to develop any application blinking LED for a one single tiny chip which is more powerful than anything you used 40 years ago so those two things which forced me to look around and to see if something is possible to change in this world and fortunately there is in 2020 so almost it almost exactly four years ago I've seen a new FPGA chip appears on the market then it was not even on the market but only on the show it was an gate made FPGA from Cologne chip and now I will try to make a live demo it's just a screenshot now but it's always better to see something live if it works so what we have here and I will talk much more about the actual hardware we have here in a few minutes but we have a console of the Raspberry Pi which is just a baseboard for the Cologne chip based FPGA model with some LEDs connected to it and what I have as a demo here is that the FPGA is pre-programmed configured with a default image a small software application unfortunately at the moment written on C which is roughly the same age as the rest of the technologies discussed here which exercise just a simple AND gate and then at runtime from inside the chip overload this configuration of the chip exercise it again and I think we will observe some bugs because the demo is relatively young it's something like two days old this running version so we just power it up connect to the console stop out of boot then upload and demo image okay I still remember the name of the image in the morning it's not always the case and trying to run it at that point I would ask you to observe those LEDs running blinking not the blinking one on the below but the LEDs on the top one it will show something like an AND gate behavior so two inputs one output three LEDs blinking and the same will be printed out here the input to the signals and output signal and then we will see if it works or not so okay for AND gate it's visible now it will try for five time repeated for five times to configure and now the blinking pattern is different it's now NOT AND gate and the bug I've announced before it's not intentional for the demo so you see with two zeros on the inputs there is still an LED on so output is high output is one but what is shown here is NOT AND AND so there are no changes here it illustrates the problem with the tool chain and with the changes so what has happened the output which goes to the LED are changed at runtime and it's observable but the outputs which are going to the internal CPU subsystem are not modified they've split it through optimization and place and road somewhere inside the chip and there is a bug so a demo is half running it's running on LEDs but it doesn't run on their console unfortunately and I had no time to fix this part of the demo but the most important part and essential thing about this column chip gate mate fpga is that it allows to change the configuration at runtime but those things are available and have been around for some years from the other fpga vendors that's so-called partial reconfiguration runtime partial reconfiguration but what is important and what I have not ever seen in any other fpga maybe it's available but a well-kept secret from the fpga vendors it's a possibility to change the configuration and run time from inside the chip itself and it works here tool support and how well it's supported with the current software is and the question remains to be answered not so good but everything is well involved and there are much better perspective here for the column chip because the company it's a small german company probably develop in this fpga and this fpga is their first programmable solution product before that they were working in the sphere of communication chips and modems and some asics design yes what was the intention to do this for as far as I know I have no association with this company directly just know some good technical support people from there to have an fpga developed one of the motivations to have an fpga developed outside of the us because two major alterations latest micro chip most of the or all the other fpgs have their ip developed within the states so the this column chip gate mate fpga is developed in cologne and it's manufactured in germany near dresden its global founder is factory so to have a kind of technological independence also to fill in the niche because major fpga players switched or have much more marketing and economical interest in meat and high-range fpgas this fpga is relatively small so it's small physically low power consumption and it fill is fill seen the low end of the fpga market and the other interesting part which as far as I understand was at least partially a motivation for this development is to have a free and open source toolchain officially supported synthesis path is yosis so gate mate fpgas are officially supported by yosis synthesis tools at the moment place and road is a proprietary binary from cologne chip but the configuration bitstream format is not 100 but partially open and free so and as far as I know there is a plan to provide the support for place and road for cologne chip gate mate in next pnr and once it's available maybe well it has been expected in year year half ago maybe it will be available this year they do have some limited resources working on it for multiple reasons as usual everything is much more difficult than it seems to be but once the next pnr support for the place and road for gate mate will be available it will be the only fpga supported by full for which the full design road is supported by the vendor with free and open software I know about the latest I see I know about the reverse engineering of cyclone 4 some parts for spartan 3 4 6 I do not remember I'm not that familiar with silence product line those are available of course but they are not officially supported and provided by the chip vendors themselves so you still once it's a reverse engineering you never know what other hidden secrets are inside so I believe that was the motivation and you also have a question available in non bga packages so I get reasonably sure unfortunately no but there is a solution for that as well just in a few minutes please I guess sure the question was if this package is available in something more suitable for hand soldering not in a bga package unfortunately the answer is no the only available now is bga 324 volts 0.8 so not that good for hand soldering but the rest it used to be a there is a way to try the chip without soldering it now I can say that there are multiple ways to do that relatively approachable accessible just an outline perhaps every single person in this room already know what or can imagine what can be done with those possibilities to change the configuration at runtime from inside the chip itself with free and open tools but as I see it it's possible if you have a soft core CPU at runtime you can change its instruction set architecture you can add your custom instruction to it at runtime nowadays I think most many interpretive languages are running in some sort of jit technology so they are measuring at runtime critical passes in the code and compile them into the assembler or native machine code but now there is an a door is open to make the next step after you compile to the assembly code for the really critical part of your algorithm you can just add unnecessary instructions to accelerate it or you can synthesize at runtime a necessary accelerator load it at runtime and use it after that so it will be a jit not only to the machine level but to the hardware level to the next step the other parts and I think I've already mentioned it several times this morning that it will allow a speed up of the development because what the software people I as I see it are used to nowadays is to have an immediate feedback you type some code interpreter is running it you have a web browser with a page automatically updated and so on with fdga you write some rtl 40 years old code then you synthesize it for some hours then hopefully next morning you see a feedback which slows down the development process tremendously and with this possibility to change everything at runtime from inside the chip itself it's possible to make an interactive interactive iterative and incremental development you can just add part revoke some part change part you don't like so I hope that those technologies will make an FPGA sound much more approachable and more interesting to develop for them please the question was if the changes are if it's possible to update just the part of the design or how well the part changes are insulated from each other on a stanza will be that I do not know exactly yet but as we as I we saw in this demo it was possible to update just a single lookup table so just one gate or just one routing path between the gates generally the configuration format allows to change on the single logic cell unit of FPGA which makes something like an 18 ports two outputs two triggers plus minus some routing logic so I would say that this level will be the minimal one but at the moment in the chip there is no protection and I believe there will never be any kind of such a protection so you are allowed to change everything and of course you can break the system you are running on or break the part of the system you are running it that's a question of tool support on how well the software will support those things of course since you can see the changes at runtime interactively it makes the technology much more open you can more or less see what's going on inside hopefully you can just wrote some signal from inside your design to the pin to the led to the scope logic analyzer so it makes your design more transparent more explorable and I believe it will encourage more people to try it out because it's not some magic box hidden by gigabytes of software and of course since it's possible the dream is now seems to be more approachable a fully native development after all you can instantiate a soft core CPU run your system this soft coins CPU it will take ages but it's doable and then you can take this bit stream and update part of your design so you can develop internally sorry there is another one question yes please would you recommend the set that you are demonstrating for a complete beginner who has never seen any FPGA development before I would be there enough to say yes and this it might be not easy or trivial for the first step but it would be really good if this person even if she or he haven't seen an FPGA development to have some electronic background at least to know or have some ideas about the trigger or logic gate or Boolean algebra is so definitely not to school children's but there were some experiments in this direction as well not on my part but I've read about them maybe not even to the first year graduate students but after two or three semesters of basic discrete algebra and electronics yes why not before the introduction of very low-quake deal and all those nice and useful industrial things of course so first bullets are right here it's possible approachable and it's possible to explore them right now the last one is still somewhat dream ahead and let's take a look what made those innovations possible but before we have another question yes please yes so the comment was that last year someone presented a self-hosted FPGA development platform uh I believe I've read something about it and it was a Yosses running on risk five soft-core CPU slow yes working for sure but here we can see the next step after it after you have synthesized the bit stream on the FPGA how do you put it inside the same FPGA and now it is so yeah that's the next step and I do not try and I am not trying to pretend that it's something very unique I know that there are many people interested in it and working in parallel same directions somewhat different directions and that's just good and should be encouraged and it's nice to share those information and it's one of the reasons why I'm here at Osdame where everything is open free and so on so what hardware made it possible a few words about the FPGA itself it's as I said at the moment the only variant is available with 20 000 logic elements cheap with 40 000 is also available physically in the same package pretty normal mid or low range FPGA but as you see you can run a processor and it takes less than 20 percent the entire system with memory or reconfiguration control support and CPU takes less than 20 percent of the available resources of this the smallest in the chip in the family so there is a path to grow there are some nice programmable PLS 5G service which allows to support USB 3 or PCI express gen 2 which is not so bad for low end FPGA it's even good as I said low power unfortunately the only available variant is in 324 volts bga so not hand solderable but let's wait for a moment for other options we have FPGA itself has a very regular structure very simple one compared to the modern edgy legs or ultra scale it's very simple transparent that makes it easier to support it with the software easy to experiment with it as I said somewhat more transparent and easy to understand and work with it here is their minimal cologne programmable element which is as I believe the smallest unit you can change the configuration for so it has not a conventional lookup table but a tree of two bit lookup tables and trees depth as two or three steps of two bits lookup tables some fast carrying out passes between the neighboring cells and input outputs to the global routing and to the chip block frames which are available on those columns between them so it's very traditional in some parts architecture what makes it different that it's an transparent open and somewhat better documented than the proprietary logic cells of their high end FPGAs and now the question was how to approach this if it's packed in the bga you cannot hand solder for me this question was the first one I've asked it in 2020 in year 2020 right before the corona and all the crises and since I am doing a hardware development for many years the easiest answer was just to develop a model for it I believe that this FPGA is the best thing that happens to the FPGA world since xc6600 which was open and very well accepted by academic community but then suddenly died and was is not longer manufactured back when I started to play or when I wanted to play with this FPGA eval and evolution key from cologne chip was not available yet so I've decided to be in front of the chip vendor themselves something pretty stupid on my part I have to admit of course when I develop a model for myself I have all the freedom to experiment with it model is smaller than the evolution board and you can connect it to something else you can combine several modulus for me the best way to get to know some new chip I am interested in is to design some boards on this chip first one may not work or magic smoke will come out of it but then you know exactly what is going on there and of course it was a fine and nice exercise with key cat because I've just seldomly used it before that decided to go why not to try it it's a good one then key cat switched from version six to version seven roughly at that time yes please make before why I think I'm following everything I found that I sporty I found everything how did you find it and why did you take enough documentation to start doing uh document a uh the question was how do I learn about the existence of this chip because it's relatively under the radars on the FPGA market definitely not as well advertised as uh altera intel altera again xylings imd still imd not sure i see and all other players uh first I saw it's mentioned on some mailing list in earlier 2020 but I do not remember where it was and probably it's impossible to trace down anymore but then I've just stopped by or at their booth at embedded world and it was uh the very first time I've explicitly asked the question if it would be possible to change a configuration from inside and I've got a definite answer yes but uh it's not working yet we do not have a board yet but we have a chip uh and back then it was an mvp so it was not their custom wafers but mvp run the very first one so embedded world uh let's say the first contact with the company and where I learned about them and all the documentation for it available on their site it's free you can download everything and of course you can learn a lot about the internal structure if you go to the usis project and look into the sources for the support of it then you know all the internal details or almost all of them uh what is the current status uh exactly four years from the first contact this company uh three boards were designed instead of single model all three of them are here combined in a system and all three of them are working uh the schematic symbol and pcb footprint for uh gitmate fpga is included in official kikkat kikkat version seven libraries now and I've been able to push it through which not always an easiest straightforward process um some software application as you can as you've seen running on raspberry pi to control model to download configuration into the model to exercise the interfaces of the model uh vhdl examples are running soft core cpu's are running integration with their modern software-oriented fpga design frameworks as light x fuses so few soak some others are walking progress so something is available something is not working yet I think nothing is upstreamed yet completely and uh as I said it was pretty dumb decision on my part because it took five time longer and the first prototype board was running in 2022 so two years from the beginning instead of half year and uh recently in december there is a new model from the bigger companies and myself from trends electronics and there is an another evolution board from olimix so you fasted how to start up development with the um uh gitmate fpga without soldering bga yourself you can use this model you can use evolution kit from cologne chip you can use a model from trends which is immediately available and you can use an evolution board from olimix when it is available as far as I understand they are still in prototype production status if you were late yes hopefully so so uh model and its design is completely free and open source hardware so what you can do you can go to github take all the manufacturing files and manufacture it yourself I've tried but it's not started yet to launch a group gets campaigned so you can order online and if there are sufficient number of orders it will be manufactured and orders fulfilled uh basically this model is universal one what I've tried to do is not to prohibit any use of this fpga so most of the functionality except for single i o bank is available on the external model connector so the other thing you can do would be to design your own base board for it or just an i o extension board which is perfectly hand solderable this memory board extension board is just two layers uh uh soldered and that attached directly to the model so that's all possible uh raspberry pi uh hat adapter is the only board current I currently have to exercise the model so it's uh connects the model to the raspberry pi 40 pin gpo so I can power it on measure current consumption program configure the model and so on uh another small one is just external memory so the soft core cpu inside the fpga is a bit more happy than only with internal memories and it can boot from there so just a simple one and it also has been designed as a part of a mechanical exercise because if you connect all the extension models to this one then the gaps are something like half millimeter and it's a little bit outside of the connector specification but it works so I need some real proof that it works and you can see it here now it's available so if you'd like to do something with gate mate fpgs at the moment once again I believe it's the only fpga which allows to change the configuration from inside at runtime so it has to be a gate mate fpga but it doesn't have to be a risk five which is currently there it can be an open spark it can be meeps it can be your own cpu or no cpu at all all the freedom is yours uh it has not it doesn't have to be a c probably c was the worst choice and my first intention was to use forth so you can interactively just type something change configuration exercise it and I have it semi running but I'm not there enough to show it here at all otherwise it will take another 50 minutes just to see some blink in the led first maybe longer than that and of course uh yeah any language fourth luah micropython scheme leapscala whatever you can fit inside you can run it you can change the configuration from this those software and definitely software people know much more about different environments and how to make it attractive interest in fast and of course it doesn't have to be a this model but I hope that this model some weeks days weeks will be available for group gets for online orders currently you can get go to the project website and there is a link to github with all the designs design examples support software and from there there will be a new update link to the group gets campaign where you can order the modulus if you'd like something immediately available then it will be an get made evaluation kit from colon chip or trends model late on all the mix model so hardware is coming on now but as I said this model is already running for a year those models are just two months as they appear yes please uh light x system so in megan is running but not upstream yet so it's just a set of pages or something but yes that's oh sorry I haven't repeat the the question is if it's going to be supported by somehow high level tools me and megan megan light x fuse soak yes those are plans for that and if anybody is willing to contribute to it please everything is open feel free to do it because I'm doing it but it takes as I said five times longer than plant sometimes 10 times longer um I just repeated that everything is open so you are welcome to create a software running on FPGA create a software for FPGA development to design hardware with FPGAs to design hardware with FPGA modulus because it's much much easier and this one would be approachable even to people with near zero hardware design experience that makes a model much more attractive all the complex problems are already solved you just have to connect the power and some IOS and that's much easier than the rest of the system design and of course if you find some problems there are many of them please report send a feedback and share all your results it's open so not only open source software it's open source hardware as well thank you for your attention and any questions please okay I see two three four questions five but I afraid that I overused all the time please it appears that writing software is easy it appears that logic gates is easy you're like in the riddle where everything is probably very oh I wouldn't the question was that it looks like writing software is easier and connecting gates is easier and that the complexity starts in the beginning when all of them are mixed but I would only partially agree to that uh writing software is extremely difficult writing good software and complex software is amazingly complex problem uh working with the modern hardware connecting gates with modern technologies and everything is also getting a harder day today uh FPGAs are very complex beasts where those complexity are multiplied and that makes this area especially interesting attractive and as with many many other places you have all the innovations and places to grow where some things are intersecting intersecting and FPGAs are exactly the intersection of hardware and software okay thank you very much thank you
An introduction to Formal Verification of Digital Circuits
Thank you everyone. I'm Cesar. Sorry, those that can stay. Sorry. Sorry. And I will talk about the introduction to formal verifications of digital circuits, which I like very much and I hope you too. So why do I, why do you need the formal verification or what, why would you want to use formal verification? So it's like a tool for finding bugs and finding bugs is what you want to do, right? And simulation is another way to do it, to find bugs. And it's complementary, but in truth, I find I'm using more and more formal verification and not so much simulation. And it does help finding corner cases, which normal simulation you need to put many, many test vectors, random test vectors to be able to even try to find these bugs. Like you have a sequence, a bug which is trigger, maybe you forgot to reset a finite state machine, maybe. And if you don't test in that sequence, you will miss the bug. And the formal verification will find the bug. It's really great. So here is a comparison. What does traditional simulation does and what does the formal verification equivalent or near approximation is? So in traditional debugging simulation, you want, maybe you make a simulation a few steps. When to look at the result, the traces to see if it's going all right. The equivalent is the cover. It's not like a software coverage. No, it's something different. You tell what you want to see. So you don't drive the inputs. You let the formal verification engine drive the inputs. And you say, I want to see three transactions. Every time there is a transaction, a write or a read, I count and I want that count to be three. And the engine will go and do everything it can to find these three. An input will give you these three transactions. If it can't, it will complain, not able, not possible. What is the equivalent to unit test? In unit test, you put some test factors. And you actually assert the result. So it's automated. It's not manually. In formal verification, you have bounded model check. You have your model in a bounded, in a limited number of steps. I want to drive this model and simulate it and see if it reaches a bad state, an assertion. But it's not like an assertion. If I put input two, it needs to return four. It's not like this. You have to assert that given the inputs which are driven by the formal engine, what the output should be. It's not a fixed number, not a fixed test. Test fixture is when you don't start from the beginning. Maybe you have several tests which share some common starting point. And you start from there. The closest equivalent will be K induction, where the engine will try from any state to reach the bad state. Or it will prove that from no starting point, it will reach a bad ending point, where the assertions doesn't hold. So equivalent for random test cases, maybe you want a coverage, yes, for your system. You try hand only because generated by hand is not feasible. You try to reach a corner case, but you cannot guarantee. You cannot actually see every random, will not cover every from zero to infinity or two. In the formal case, it does an exhaustive search of the inputs. So it's like random. If a random test will fail, it will find it. Yes. And the other difference is traditional tests can be procedure like you count, you do for loops, you call functions, maybe. In formal, everything has to be synthesizable. Everything needs to be logic. There is no software in formal. It's all hardware. So if you want to verify something, you have to make a test harness to catch it around it. In traditional simulation, you have a test vectors. Maybe a sequence of vectors, which you want to see. And then in formal verification, you assume. So it's not completely random, completely free. The engine has to follow your rules. So like if you have a valid signal, it has to stay stable while there is no ready from the next stage. Well, you have to tell it if it's, if valid is up, then ready, it will stay up while ready isn't false in the previous stage. So you assume inputs and you assert outputs. That's the rule. In traditional simulation, you also do assertions for automated testing, right? So the workflow in formal verification should be like this. In your high level definition, like harder definition language, you put the assertions. Then it can be a very long, a VHDL or any other language like Python-based language, harder languages. Then you make a work plan in the SSBI tool, which drives the process. What processes? First, you'll see, you may be remembered for the free FPGA and ASIC tooling. It will take your HDL to code that you can put into FPGA. Well, you also synthesize instead of an FPGA, it synthesizes to logic functions. Which logic functions? Well, there is a state, which is a state of all your design registers, all your memory. There is an initial predicate which says, this state is initial, like the registers are zero, the system is the reset state. Then is the transition relation, and it takes a state to the next state, like a finite state machine, for instance. And then you have your assertions. All these are Boolean functions, like AND or NOT and NAND. And then, these other two uses as MTBMC proves correctness or outputs a trace. It does exactly a search for an initial path to the bad state, an exhaustive search. If the path is not found, the design is correct. And if found, it outputs an error trace. This is an example of a bad trace. So, you have an initial state where the predicate holds, and there is a transition from S0 to S1, where the conditions hold on S1. All the way to the final, the case state, where there is a failure here in the assertions. So, the engine takes all of this logic here and try all the inputs, so this one is true. And then it tells you the trace, which hitches this. So, the algorithm is, from zero steps, you prove the base case, which is no path from the initial state will reach a bad state in K steps. Like mathematical proof. This is your base case. Then, the induction step. No path ending into a bad state can be reached in K plus 1 states. So, if the base case holds, and the induction steps holds, then you prove it for infinite time. You can hope to simulate for infinite time, but here you can. Yes. Like, this is why we call it unbounded inductive proof. No bounds. With our finite proof, I can prove it for all time for any input. Okay. Well, but maybe, inductive case, I cannot prove it in K. We try K plus 1, K plus 2, etc. Well, not for infinite, it's guaranteed to end, because you have a finite states, a number of states. Your number of registers is finite, is limited, so it can be extremely huge, the state space, but it's finite. It's true to the power of the number of flip-flops you have, right? But it's guaranteed to terminate. And there are very clever algorithms, so you don't have to try every single combination. So, let's try starting simple, right? This is a register with feedback. No inputs, no output, just internal state. What is the state diagram of this? If it's 0, it will be 0, it starts 0 here, and it will keep being 0 for all time. It can never reach the one state if it starts from 0. So, we tell the engine, the formal proof engine, please prove that it will ever be 0. What he will search for paths which end here at 1. But there is no path. I can tell you, try one step, K equals 1. It will work because there is no path in one step which will reach actually in any number of steps, right? So, the base case holds, and the inductive steps also holds. Complicated a bit. Now we have a registered output. So, here my notation is, the first number here is R, which starts 0, and S will be a copy of R after 1 clock. So, it's also 0. If R was 1, then it will move to a copy of it in the next state, okay? And stay there. But you see there is no path to here. And if it starts 0 and 1 here, it will go to 0 and 0. It will copy, lose the 1 in the next step. It will copy the R. So, you can simulate the base case holds here. And the inductive case, well, just one, just K equals 1 is not enough here. Because from here to here, you can reach a bad state if you start from here. Remember, the inductive, the inductive, only the base case will start from the initial state. The inductive state, the inductive algorithm, the inductive step will start from, from, actually will start from the bad states. It will start from here, trying to reach a good state. So, it starts from here, finds it's here and says one step is not enough. So, we try two steps. With two steps, yeah, it cannot be done. We have two good steps, two good states and one bad state. No way. So, we have proven this case with two steps. No. So, well, let's complicate a bit. We have an enable now. The output can be 0, will not copy R, and then I enable it and it can be, it will be copied. So, enable, I put it here. So, it can be here or here. You see, I not label the transitions and the input is actually inside the state. I treat it as a state also. Why? Because the transition relation is between two states and maybe it's from the states and inputs, I get the next state. Well, I put the two together, the input and the state is the state. Okay. Because the transition relation is a relation, it's not a function of the inputs. So, you see from, if enable is 0, it will keep 0, 0, 0. Likewise, if enable is 0, it will keep where it is. If enable goes to 1, then, well, it goes to here. If enable goes to 1, yes, it will copy. The input will be copied on the output. Yes, here. Here you have, for instance, R is 0 and S is 1. And enable now is 1. So, in the next state, that R, the S will copy R. But you see, I'm asserting only that S must be 0. So, R, I don't care. I don't know, actually. It's inside the design. So, only these states are where S is 1 are bad. And all the states where S is 0 are good. Independent of R. So, you see, the base case is good with two steps. It will never reach. The base case will never reach these bad states from here. Never, ever. And it will never reach. And, but now, what about if I have a, I'm in a bad state here. Okay. So, I try one, two steps. Okay, I, let's, I'm not proving induction, I cannot prove induction in only two steps. You see, there is a path here. What is the path? R is 1, R is 0. Not enabled. Then I enable. So, it is copied. Now, R, S is 1. It's a bad state. So, in two states, I can reach a bad state. So, let's try three. What happens in three? Well, I just repeat this. One, two, then three. One, two, then three, and four. Yes. So, you see, this loop here. So, I can try K equals four, K equals five. Induction will never prove it. Induction will never work. Why? Because I'm allowing this loop here. So, in induction, you, you can't have loops, actually. Yes. Some engines will remove the loop for you. Loops actually don't matter because you're only repeating yourself. Right? If there is a bad trace and you put a loop to repeat it, something in the middle, it only makes the trace longer. So, trace loops doesn't matter. No matter. So, two ways to solve this. I break the loop. I say, if enable is zero, then in the next step, enable must be one. I put a logic around this. This is the assumptions. Okay? The other way to solve this is to look inside and say R should be equal always to S. It will not make the proof wrong, wrong, because you are only strengthening the, the, the assertions. You're not allowing anything more that you already have. So, if you say R must be equal to S, you are not doing anything wrong. So, what will happen is these two will gain the X. By the way, the X, I said, are the bad states, right? The X, you, you, you know it. So, now I put X here and X here. So, the induction only has this, this X is up there too. To see about. And then in K steps, it works, because there is no arrow from here, from, from these bad states, you can go to here, but, but, but not from here. So, induction will work if these two are bad states. Okay. So, this is the kind of thing you have to worry about in formal verification. Yes. Well, this is a flip flop with input. So, R will equal D after one and eight, one step. So, it goes from zero. It can start from here to here. And in next step, S will be equal to, to D, R will be equal to that, to D. So, it will be one on one. Or if it's zero, next D is zero, it will stay zero. So, but you actually cannot verify this. You cannot verify, like this, you cannot verify that D equals R. Because D will, R will equal D in the next clock. So, with this, you cannot assert anything. So, what you do, you put a test harness around it. So, you capture D in S. Now, S is extra logic, which you need to prove that the flip flop works. So, I have here D, R, and S. So, R, S is supposed to be equal to R, equal to zero. So, it's supposed to, or if D is one, then eventually both will get one. So, the good states are zero, zero in the end, R, and S, zero, zero, zero, zero, or one, one, one, one. What are the bad states? Zero, one, one, zero. One, zero, zero, one. So, yes, the induction will work here, because this bad state can actually be reached. So, with zero, zero, or one, it will already work. So, this is an instance then where you need extra logic, external logic to prove things. And this is, well, I will not talk about this one, I just put it to scare you. Okay? Too many arrows, too many states, okay? Okay. Okay, so this is what will be, this will look in high level definition, much high language. This is very log. So, we like usual, the simple one which has the feedback. So, R equals R. And then, you have this extra thing here which, if formal is definition, is defined, then you assert that R is false. Okay? Not R, or you could write here R equal equal zero. It's the same thing. Okay? And this one won't be present if you target an FPGA, for instance, only on formal. So, you write the symbiosis write file, you prove, you put into the mode prove, you can do also BMC, only the base case here, or induction, you can choose. Here's the decay that I talked about. It is one or ten, what you need to put here. Then, the nice thing about UOSIS and formal verification, the engine which proves that statement, the long logic statement, this one, the engine is standardized. This output is a standard, so you can put, you can choose the engine you want. You drive it from UOSIS, but what proves is actually EISIS or Z3 or anything else that assets the output logic description that UOSIS outputs. Okay? So, this is instructions from for UOSIS here, and there's one missing here, which it is implicit, which is write SMT, write the logic here. But here you put the number of, name of the VADILOCK file and the top level of your design and the files here. And then you run it and it says induction, trying step one, trying step zero, and trying base case, summary, pass, it pass, so, proof, successfully proven by K induction. And it passes. And you put this into your test suite, GitHub, something, GitHub, or not GitHub, sorry, if you can. Continuous integration test suite, yes, CI. This is an example with a Python-based language, so you put imports here. This is an engine, by the way. I put here two registers instead of one because it will optimize away if I put only one with feedback, so I put two with feedback. Okay? And S will equal run and R2 in the next cycle. If the enable is true. This is one with enable. And then you assert that S is false, always, because in the initial state is false, is zero. So it will all be zero. And that is what happens, forget this about a moment. It will fail induction. You remember, you can stay with enable false for infinity. That was a loop. So it outputs the failing case here. It will start in the case where R is one and S is zero. And I don't only care about S being zero, but S will be one here. How? So enable was false for a long time. Then in the last possible time, it raises enable to one and you are cooked. Yes, you failed. So you go back and do the things I said. You assume that R either enable is true or in the past enable was true. So you don't have a choice if you're not enable now, you'll be enabled next. So this, only this one will solve your problem. And the other one is this. You assert that the R, the internal state is zero, which is true. You are not making any more mistakes. You are only making stronger conditions. Now R must be false. Then you can prove it in five steps. And now it will pass. Like either one of these. This one, that's why I said this one is equivalent to gray box, maybe your design is not a black box. You need to see inside. There is an internal state which is not reached maybe. So this is how one way to put memories in the team. So test memories. So this is a memory. One way to test memories. You can have gigabyte of memory. But what you do is you only test one address, but which address, any address. And then if the address is that what we tested, it will be captured here. And if you read from it, it will show the capture data and the data need to be equal. So that's a way to test a memory. If it works for any one address, it will work for any address at all. So you can have gigabytes of memory or megabytes. That's the same for you. You won't simulate gigabytes of memory. It only simulates these registers and the memory location. One memory location. And this one is if you have a pipeline with streams, you can count the number of transactions coming in, the number of transactions coming out. And then you say the counts need to be compatible. You cannot have more coming in that coming out. That will mean you drop it, packets. Or if you have more out that coming in, then you are duplicating packets. So that's a way to test. And that's one. I'll skip. Okay. So thank you. Thank you. Questions. You have questions. I haven't till when. My talk ends in five minutes. So you can. Yes, please. In the example you've shown, I saw that the depth was manually specified. Wait. If you can wait a bit before leaving, I still have five minutes, please. So in the Python code you showed, I saw that the depth was manually specified. If the proof changes over time or you adjust your circuit, you need to also adjust the depth. Yes. And if the depth is insufficient, will the proof fail? Indeed. Well, the question was if you adjust the circuit, will the, will need more steps in the proof? Yes, definitely. And more importantly, if I change the circuit and I forget to change the depth, will the proof still pass if it doesn't have sufficient depth? If it has insufficient depth, surely we'll have a failure. Yes. Next question, maybe. No questions. Well, if you think about a question, please. I have five, some minutes left. Thank you. Okay. One more. How does this scan to big circuits? Like how much time does it take? Okay. Good question. Good question. So I thought, please come down, people. The question was if, how much time it takes, what the complexity is. So one problem with a final verification, you can never be sure. You can have a simple design which takes tens of minutes. You don't know why. You made a modification. It, one second. You want, because, and the other thing is solvers. Like I said, you can change the solver. Okay. If you change it, one will be fast. The other will be slow. If you change the circuit, vice versa. Okay. So it's non-deterministic, really. Yes. Good. Thank you. Sorry. I, I was a bit nervous earlier. I am actually a dev room manager for this session. So I, I'm partially giving instructions to people here. Okay. Thank you. So the audience thinks I'm, it was good. Perfect. Well, if there is no much. Okay. Excellent. Thank you very much. Okay. I'll finish then. Thank you very much. You can.
Verilog-AMS in Gnucap
Welcome to Paul and Felix. Please start. They will be talking about the value of AMS in new cap. You can start. Thank you. Okay. I'm Al Davis and this is Felix over here. We're doing this presentation on Veralog AMS and GNU cap. So let's see what we have here. Okay. Here's an outline of the talk. I'm going to give about half and Felix will give about half. So we're going to kind of swap places midway through. The outline, what is GNU cap? Some history. All about the architecture of the program and the plug-ins. What is Veralog AMS? Something that some people, a lot of people need to know. Model gen is our model compiler that generates, essentially generates C++ models from Veralog. And we'll talk about some of the features, what we've worked on for the last year in a roadmap. First of all, what is GNU cap? Initially, the project started a whole bunch of years ago when I wanted to do circuit simulation on a Trash 80. You remember those? It's an 8-bit computer that I had a big 4K of memory in the thing and I thought I could do circuit simulation on it and that's what got me started on it. And one characteristic of the Trash 80 is that they're not big enough to run SPICE. And so, I mean, quite far from big enough actually. But anyway, that was the original intent. And ultimately, as the project developed, the goal was to go beyond SPICE, not just to re-implement SPICE, but to go past it. And we have in some ways. So, as it stands now, very much changed from the original Fortran. It has a very nice plug-in-based architecture, written in C++ modular, and it's written in such a way as to encourage public participation in the code because anybody can write, anybody can make a plug-in. And you don't have to necessarily wait like in many projects for the project supervisor to accept it. As soon as you make a plug-in, it's available. History, actually it was about 1980 that I got started. Al Circuit Simulator was a grad school project, and that was the first one that was, that was the version that was, really had some features in it. Released under GPL in 1992. GNU Cap in 2001, a GNU project. They treat us pretty well. And around 2008 to 2010, we re-architected the whole thing to use plug-ins as opposed to everything compiled into a big blob, which has, which makes a real big difference in collaborative development. Beyond Spice, GNU Cap is an early, an early mixed signal simulator. The emphasis is on mixed signal, mixed analog and digital, and even mixed different kinds of analog. And the idea of implicit mixed mode was really the terminology I used at the time for what they later came to call connect modules, which the VARILOG spec came about 10, 15 years later than this. Fast Spice, it's actually, in a way, the original fast spice in the sense that it, the algorithms will do partial solutions. And if you have a circuit that's only half active, it'll kind of push the inactive side out of the way and just simulate the half that's active, and you get some better speed that way than you would with a, let's say, the straight out spice algorithm. And if you don't look too deeply, it actually looks like the spice algorithm. You may not notice the difference. And another area that GNU Cap has done some work with the time step control. In spice, usually, most people will actually specify a time step. You don't have to do that in GNU Cap. It's smart enough to figure all that out. And so the transient analysis time stepping works a bit better. Mixed mode. Oh, yes, digital techniques for analog. What I mean here is that we use a matrix solver that actually only, that solves pieces of the matrix without doing the whole thing. Yet from a distance, it looks like it's solving the whole thing. But in practice, when you're doing circuit simulation, a lot of times, a lot of parts of the circuit are latent in the sense that there's really nothing happening there. And if there's nothing happening, there's no need to spend any CPU time computing it. And so GNU Cap uses cues and a variety of techniques that you normally only find in digital simulation. But we use it on the analog side. And if it fits your circuit, it can actually help a lot in terms of speed. If it doesn't fit your circuit, let me back this up a bit. Since I... Okay, here we go. If it doesn't fit your circuit, then essentially you have spice. But so we have... For the analog, we have an event queue like you have in digital simulation. We have a matrix solver that can solve only little pieces of the matrix when we don't need the whole thing. Low-rank partial matrix solver, that's what I mean. Low-rank means that, hey, I got a matrix of 10,000 nodes, but only 30 of the nodes actually need to be solved. We can do that. Where it actually incrementally updates the matrix, and then so that you make an incremental update to the previous matrix. It looks like you did the whole thing, but you didn't. It only needs to update the little piece. And this gives us better speed with full spice accuracy. The time step control in GNU Cap, transient analysis time step. The simulator actually picks a time step. It's actually more automatic in GNU Cap than in other simulators, in particular because we look at things like cross events. Now, a cross event is where your signal crosses a certain threshold. And spice doesn't get that at all. It doesn't do anything with cross events, but we do. And it actually helps a lot in terms of getting the correct time step control, particularly in control of a scary thing that's known as trapezoidal ringing. It really controls that pretty well. And then also getting started. What I mean by getting started there is getting the simulator started. Algorithms to how do you start this? What time step do you use right at the beginning before you have any information? We do that. The software architecture, it's in C++, a big emphasis on the plugins. The plugins are very much a part of the system that they're not optional. Everything is a plugin in GNU Cap, the simulation algorithms, the commands, the device models. GNU Cap analog simulator has no built-in device models at all. All device models are plugins. And I mean all. Even the resistor is a plugin. Even the, let's call it, submodels, a blob that you might want to use to make a model, those are plugins. So that as it starts out without any plugins, there are no models at all. So basically from the viewpoint of development, you want to do models, you're doing plugins. And I've heard accusations that in a plugin-based system it creates a hierarchy of developers that whether you're working on the plugin or not, no, if you're doing models, you're always working on plugins, even us. And so the program, so GNU Cap actually consists of a main program, a library, and plugins, and some utility programs. Utility programs like, well, one that we're going to talk about a bit a little later is the model compiler. Model compiler takes Veralog code and translates that into C++ code that we use as a plugin. And basically ultimately we're leading towards doing the whole language of Veralog AMS. The library, the GNU Cap library contains a, it's basic stuff that you need. It's not commands or anything like that. The library has things like the matrix solver, a circuit database, IO, has an expression evaluator. And these, this is just a library of functions you can call, databases that you can use to make plugins where the real work is done. And another bit about why on the plugins is that the plugins actually enforce modularity. Modularity is supposed to be one of those features of coding that you're supposed to do to make good code. If you violate modularity rules, the concept of the plugins don't work because they have to be, one model is independent of another. If I have a resistor model, I can't, it can't know anything about any of the other models other than through its, other than through its intended interface. So that you, you can't put in these random go-tos and come-from's or whatever you want to call them to get from spot to spot because it violates the plugin scheme. It just, so the, the coding rules tend to be very strict when you're using plugins because it's necessary for the plugins to work and it actually helps everybody to do that. So the idea of collaboration, one thing about the plugins is that in terms of somebody might make a plugin to do something and submit it and say, hey, I'd like you to, I'm going to do a pull request. I want you to put this into your product. I have this code. And, and, and my response is that we don't have to because you have a plugin, you put it out and it's available. And so you can have, you can have your caducap installed on your, on your Linux box. And let's say you installed it from the Debian package or something like that. And then you, you can essentially make your own custom version by, by, by twiddling with the plugins and, and, and do that in spite of the fact that you have the main, main installation through your library manager or through, through the, let's say you're in a university, the, the, the, the computer staff has installed it there and, and every student can have their own little twisted version to make models or whatever you want. And, and, oh yes, I, on quality, one thing about sometimes when people are making code to do something, the quality isn't all that great. And I look, and I look at that and say, you know, well, gee, I'd like to make this available to everybody, but, but with plugins, it's not a problem. We make it available and it's optional. If you want it, you use it. If you don't want it, you don't use it. So what we have is, it basically opens up who can participate. Okay. How are, what is, what is a plugin? How do they work? Basically the, well, first of all, they're dynamically loaded. They're DL open extensions. They just use the, the normal system called DL open. They're standard shared object models like libraries. And that's really, that's really the, the, the essence of how, of what they are. The, the, the basis is C++ derived classes like, let's say, let's say I have a, a type of plugin which is a device model. There's a base class that defines what a device, device model is, defines the interface. And I derive a class from that and I got my specific model. I got my MOSFET. I got my, my motor or whatever I'm going to make a model of. And, and I, I just loaded and through the C++ derived classes mechanism, they're hooked in. The dispatcher is a way of registering a plugin that, let's say I have a resistor and I, I, I have it, I've written this plugin for, for my own special kind of resistor. We have something called a dispatcher so that when you load the plugin with DL open, it registers itself with the dispatcher. And so now it can be found, it can be used. And, and it gives you a nice seamless interface of how, how you would use a plugin. Just as if they were built in. So in spite of the fact that all of this stuff is really external to the program, it looks like it's built in. Oh yes, wrappers. Well, sometimes we get code models, whatever that are written for something else. And I might have a different interface. Like for instance, we're talking, we're talking about device models here. We have device models of MOSFETs and transistors and stuff like that. Transmission lines. Suppose I have this old model for a, let's say it's a JFET and I look in the old Spice 3F5 code and I say, oh, here's a JFET model. The wrapper, we have a wrapper that wraps the, wraps the Spice model and makes it look like a GNU-CAP model so we can use it as a plugin. And so the result of that is that in, in addition to being able to use models that are compiled for GNU-CAP, we can use models that were designed for Spice 3F5, JSPICE, which is a special version of Spice designed for Josephson junctions and so on. And, and, and we can use models that are designed for NG SPICE. We can use them all as plugins. And through, through this concept of wrappers. Plugins, what? Okay. I, I've been talking mostly about devices as plugins, but also all of the commands are plugins too. The AC analysis is a plugin, the transient analysis. Right down to the source languages. Plugins determine the form, the input and output format, the input format. So that one, one way, one thing that might want in a circuit simulator might want to read Spice files, Spice, Spice input files. So we have a plugin that reads Spice input files. We have another plugin that reads Veralog files. And I probably shouldn't say it, but spec, because Spectre is a proprietary simulator from Cadence, but we can also read Spectre files. And to give you a path out of, out of Spectre, if you happen to have that. Jita is a schematic editor. We can read in the Jita files as an input and simulate from them. Quixator, the Quix project, which has been kind of dormant. We can read some of those files too. And so the idea is to be able to use the plugin mechanism to import from import, export from other code that wasn't necessarily designed for GNU Cap. We can use it. The idea is to facilitate sharing here. Source languages, measurements. There's some measurement plugins too. There's actually about 10 different types of plugins that work with GNU Cap to do various things. Plugins, wrappers. Okay. The wrappers, the idea of the wrappers is that like, like let's say I have this C model from that was written for Spice 3F5, which incidentally is the way Berkeley still distributes their B-SIM 3 models. And B-SIM 3, they're still on C. But anyway, anyway, we can, there's a lot of models that are there that may not be the latest version, but they're still available and still of interest for archival purposes. The idea is that sometimes you're working on something, I'm not going to say, well, I don't want to use the current version of the model. I want to use one that's four years old. And the four-year-old one, that four-year-old model is written in C, and we can read those C models and use them as plugins for GNU Cap. And not only that, but as we're going to be getting to a little later with the model compiler, the way we like to do it today is to write the models in Veralog, and we can do that too. Leading on to the model compiler, the model compiler generates C++ code from a model description. The model description, wow. Okay. The model description is, yes, it's time to turn it over to Felix. The model description is written in Veralog AMS, and so I'm going to turn it over to Felix, and he's going to tell you about the model compiler and the work we've done recently. Can you hear me? Test, whatever. So the model compiler, I just wait for a start sign. Test, test. All right. So, yeah, other than Al giving a basic generic instruction from the top and historical stuff, I'll introduce some of the work we did last year on that NLNet grant, which we have an extension for already. So one of our plans is generate code for other simulators, which happens to be on that slide. And the model gen, the previous one read some format that Al came up with, like, 20, 25, 30 years ago, before Veralog even existed, and that needed a little bump, and we do that now. Veralog AMS, some of you know it, maybe some of you don't. It's a long project, and it is a common denominator for a lot of stuff which started up as digital simulation and verification tool, and it's an industry standard now. And the AMS extension built on top of the previous stuff, adding the conservative flow and conservative and signal flow disciplines that otherwise are only known from SPICE simulation at that time. And former SPICE devs have thought about that in detail, and I think they did a good job with that standard. Yeah, there's no free implementation of that before the one we are working on now, I guess, and if I'm wrong, let me know. The features are a bit more tricky because it adds stuff that was available already in the digital Veralog, like hierarchical modeling. But like computational efficiency in analog mixed signal simulation doesn't make sense if you can't describe these networks, but the language itself has features that make these optimizations that we will need even possible. That way we get true mixed signal, that's something Al has already explained, and we head towards system level analog, but still with analog signals in them. The current implementations are centered around this Veralog A, which started with ADMS, which built a SPICE kind of front and for, or a SPICE targeting model compiler in around 2000, and this is the one we are actually trying to replace. In the meantime, there's OpenVAF, and it has a simplified SPICE interface that builds binary blobs that are now loaded into, I think, NG SPICE, maybe SICE, but it removes stuff from the SPICE interface that we would need rather than add stuff that we already have. So that's not what we want to do, and with that project in 23-24, we kind of took over or take over or overtake these developments in terms of features. Generally, the standard allows analog compact modeling, which is a great feed which Vladek has been putting much time into, and hence we do have Veralog A compact models, which we can now use without that work that wouldn't be possible, and a model compiler without models is a harder start probably. So beyond that Veralog A, that's the MS in Veralog AMS, L has been tinkering with these ideas since the early days. The standard document we currently have is from 2014. I think there will be an update at some point. Maybe we will have to do our own, but it's a pretty stable standard, and it's surprisingly stable, and it makes a lot of sense when you study it, and I had to because I had to implement it. This Veralog AMS Veralog was born in 2023, and we released the first master release last month, and we've got this funding for 2024 to add more stuff, and I will go through the features that we have by now, and that's the overtaking bit, so we do support hierarchy. We have Paramset, which is an essential component of the Veralog AMS standard. We can compile them. We do have Binning, which is available in SPICE as well, but it's much more difficult to use, and it's not standardized. We have compliant sources, like Veralog has different types of sources for voltages and flows and currents, and switching sources and whatnot, and making these compliant with the standard was a bit more work than anticipated. But anyway, we support tolerances. Different types of the system can follow different tolerances. You can have a temperature, and you can have a low voltage and a high voltage, and they all have different tolerances, and the standard and our implementation accounts for that. We add the time control to the model generator side rather than leaving it to the simulator, and the extensibility, I'll get to that. I think I said a lot, and I've got not so much time, so I'll just skip to the examples today, which I want to highlight, because they are important. The Compiled Paramset means, in addition to the module overloading you get with Paramset, you say, I've got a component, and I want to build a new one, and these are my parameter overloads. So you put in that Paramset statement into your netlist or your file, and that generates that new possibly simplified model, and we don't need to deal with model cards or the syntax anymore. For example, here we have ranges, and that gives us a way to bin, and a way to reuse code as well, because code reuse, why is this here? I forgot, but we will get to code reuse anyway. So the second phase of Paramset is the pruning. So you take your Paramset, you take your model, you combine the two before you compile. That way, structures in the model, here this is a simple capacitor model, I'm printing here, it has the describing equation, is this DDT statement stuff here in the else branch, but it has an if branch as well, and under some conditions it works differently. We want to get rid of that, because it interferes with performance, so we put in a Paramset that doesn't set the IC parameter, and so this condition is never satisfied, and it's just pruned, and that is the stuff we send to GCC to compile. Well, in the capacitor example it's very simple, it's one line, but imagine you have a million instances of some device, it could be a transistor model with lots of lines, and it computes the same constant value in each iteration of the simulation again and again, because you didn't prune it, the simulator doesn't know about anything because the compiler has compiled it all in. Well, that's a lot of work, and if you load pruned models, what do you get? It runs faster, surprise, and you get the same result because whether you pre-compute the additions, the exponentiation, the logs, whatever, it doesn't matter, but the model is simpler, and as a corollary, compilation time doesn't matter. You can take as much time to compile stuff as you want and optimize stuff. On hierarchy, yeah, so we have compact models with these relational statements here, and we have contribution statements, well, math statements, which you could replace by just adding sub-device instances. Well, we can do that, sure, but we can also mix them. You can have this and that, and you just put it into one box, and whatever you have, that's the code reuse maybe, you just reuse the resistor you already implemented and add some capacitor to it, and you get your low-pass filter, and you should reuse models because, well, reusing models, you don't need to compile them again, and you have a smaller memory footprint at runtime. You've got one diode for all your 1700,000 transistors in your net list. You need to validate them once. You don't need to validate every single implementation of a diode, and yeah, that's a quote from the LRM, and it is about the extensibility. The LRM explicitly says, we want to enable extensions, and that's how I implemented it. So you go and implement your own functions in the end, and same rules apply then for core team developer, so here's the roadmap, and maybe I better leave it. Shall I continue for two more minutes? Okay. The roadmap is we've got three sub-projects in the project that we have already got funding for, and we need to work on the analog part a bit, but also we add logic modeling, so that would kind of round up the whole package a bit. And on the simulator side, I mean we only have this simulator currently as a main target, and Discipline's Nature's Connect Semantics is something defined in the Verilog AMS standard. It tells about, or it defines how to model interconnects between digital parts and analog parts of the circuit, and the simulator needs to evaluate some rules to place these gadgets that do the transformation between the disciplines, following the rules from the standard. And the third package is also important, because nobody else does it, we need to be able to interoperate across the boundaries of tools. For example, we need to be able to store a net list and send it to somebody else to be able to make use of it. And this is something we need to define this year, and also we need to target other simulators with our model compiler. Currently it writes models that only run with GNUcap, it will run models that also run with NG-Spice when we are lucky. We shall see which one we will pick, and the device wrappers will also be extended, so we will add more of the current work that is happening in the field, like pushing the NG-Spice support to the current version, maybe. We have the plug-in interface, everyone can help, and we also want to see a wish list, really, because if I see something is needed, somebody says, oh, do generate, generate is great. Well, that triggers me to read about generate. The standard is so many hundred pages long, I don't know every single aspect of it, and if something is needed, that makes me look at it, and I know, oh, generate is now at the top of my to-do list, for example, or somebody says, oh, there are these Laplace filters, and they have a nice formulation or a nice way of specifying linear filtering. That made me curious, and I implemented them. They are not ready for release, but they will be out very soon, or somebody says, I don't know, give me an example, and I look at it and implement it, because it makes me curious, that's how it works. Thank you very much. Thank you for your time, and sorry for the additional time I took. Thank you, Mr. O. Thank you very much.
FOSS CAD/EDA tools supporting the open access PDK initiative
Okay, welcome. I'm Wade Krabinski. Wade Krabinski, and you can start. Okay. Good morning. Oh, it's already a good afternoon. I'm here to introduce all the larger selection of FOSS Free Open Source software for IC design. And mainly all we believe can support local European open access PDK in the shot. This is what I'm presenting here. It's a little bit compatible or related to previous talk where Team Nukap did introduce all and then Felix introduced one of the circuit device level simulators. This is a huge collaborative work. If you feel I should include your name here, just let me know. I'm consultant to IHP. This is a research institute in Germany. And they are very first in Europe to join Open Access of Free PDK Initiative. René, Sergei, Christoph, Alexey, they are from IHP and they are personally involved in the Open Access PDK development providing technical engineering support. There are a couple of guys like Mario Pascal Marcos. They are working on Verilog compiler. There are teams working on the models, supporting Quax development. I already mentioned Al and Davies from Nukap. NGiSpy's team, by the way, they will present NGiSpy's tomorrow at Kaikad Devroom. We have independent contributors, high level, well recognized professor, Luca Benini, Boris Murman, also in the US, contributing to Open PDK. Mad Vent is very strong promoter of this initiative and last but not least, Tim Hansen from Google who was very first to push towards Open Access PDK. So I will start with brief motivation. Why Open PDK is so critical, not only here in Europe but at international level. Also in a certain sense as a platform to teach, educate younger generation of designers, both analog, RF, digital. I have referenced not really Open PDK but quite important in Shatif in India, the FOSI. A couple of examples of status of Open PDK around the globe. Then I will introduce all what IHP did to enable, open enable access to IC manufacturing and setting up complete tool chain of software packages to support design at different level. So static layout, verification, validation, because the process is more towards analog RF application. This is the level which we want to focus. A couple of other points, I hope it could be interactive, really understanding where we are and what are next steps, what are important task to undertake to not really consolidate by bring all enthusiasts, volunteers, hobbyists, designers around Open PDK initiative, making a roadmap and then understanding status, think about all the challenges and opportunities which this initiative bring to the design, IC design community. Motivation, I guess you are also following all about European chip-act. This is discussion of over the last two, three years it's coming into power, into the execution. But following even public hearings here in Brussels at European Parliament, such case as an attempt to draw more attention, bring younger generation of designers to our domain, wasn't address. And at one of the public hearings of chip-axe here at EU Parliament, here in Brussels, only Jodebock, the VP of IMAIC address this point, that we can have new fobs, new production line, but we need new designers to bring new chips, new product to these fobs. Fabs are not running by themselves, they need the design. Exactly a year ago, less than a month, a year ago he repeat this questions again pointing to the main topic, how to build talent and skills to support IC design, not only in Europe, IC international conference run in US, how to bring this talent and skills, how to enable access. And we believe that OpenPDK could be also teaching education platform to draw more attention and bring talents to our IC design domain. I'm showing this slide, it's not really OpenPDK, but it's... This is initiative, there was a group of volunteers in Mumbai at Technical University of Mumbai, and they set up really nice teaching educational environment, mainly using NG-SPICE and CHICA to introduce even teenagers to micro electronics, starting with very simple examples like Flip-flop, small generator blinking LEDs, and they were guiding them through this very simple design process with simulations, helping design even PCBs assembled boards. This runs for a couple of years, each year they have more and more contributors, they were talking about something like couple of thousands volunteers supported this initiative. It's not about design itself, but they create huge packages, libraries of tutorials, they call them spoken tutorials, mainly in English as you can guess in India, but they start to translate into other languages, they were talking about French, Spanish, opening this facility for teaching in education also in local native languages. It's teaching education platform, but they recognize importance of free open source tools, and we have these two cases of NG-SPICE and CHICA. OpenPDK was triggered by Tim Ansel in the US, it's running for a couple of years, they partner with Skywater, they open the FAP, they release PDKs as open source, FAPLES was quite instrumental, helping creating all the infrastructure, they built certain tool sets to help teenagers, students, young researchers, I see designers to complete design down to tape out and manufacturing. There is a list of links about available resources. Skywater as I mentioned was very first, Global Foundry joined a little bit later, open two more processes, all a little bit kind of legacy, 180, 30, I think Global Foundry going to 110 if I'm not mistaken. So it's really a unique place where you can design, chip both analog RF, digital, submit, tape out, manufacture, get even test board and complete all the design flow. In Europe, as I mentioned, IHP, it's a research R&D institute in Frankfurt on other, they recognize the importance of open PDK, and they, it's like middle of last year, the organ released, they PDK for 130 bicemos process, and it's high end process, the bipolar part here, BJTs are working at level of half a terahertz on silicon, so 500 gigahertz on silicon, it's unique process, they open infrastructure, they support multi-project quaffers on different level, academic development, early access for commercial partners, start-up, SMC, SMEs, up to the multi-project, complete chip integration. Opening PDK is one task, what is also critical to understand what is really available in open source domain to support design, and positive sense kind of consolidate all in this domain in particle from the developer point of view, bringing all the tools working together, having equivalent to commercial tools like cadence or synopsis. The workshop they organized last June was two days, even which brought tools developer, designers, and all who are supporting this initiative, maybe from the back it's difficult to read the links, my presentation is uploaded, you can get all updates or references on PDF which I uploaded. So this is what I already mentioned, they are very first in Europe that there is no other Wafers Fab R&D institution opening the PDK and helping to access multi-wafers projects to complete design and manufacturing, because it's by CMOS process they are trying to target analog RF applications, and of course opening this environment, a lot of questions how to make all the flow reliable, support. In Germany they are quite lucky, German Ministry of Education recognize importance of this initiative and provide financial support, so this from the basic entry up to multi-project wafers and fund services. Bottom line is always money, so this is initial step, we have ongoing this discussion here in Brazil with European Union to motivate them to support this initiative at European level as well. This is the status, we have Sergei Heinebeck, he is PDK manager supporting this initiative at IHP, all it's online on the IHP GitHub repository with project information, information about technology itself, devices, cells, all the layout information, what is also unique, they are not opening PDK and SPICE library, but you can also access all measurements data for semiconductor devices and passive components. This is also important for others who are working on parameter extraction, model validation, they have physical data to run all the checks. There was a decision to use K-Layout as main tool for GDS generations. We are extending this and the team which Sergei leads works on all the enhancement, so you have to really follow up the GitHub repository to get all updates and new information. There is still open discussion which tools we should put in the complete flow. I mentioned K-Layout for GDS tool, X-Hemp for symbolic entry, schematic entry. As the BISEMOS process targets RF application, modeling capacitive components, transmission lines, parallel inductors, integrated antenna, we can benefit using other open source tools like open EMS for electromagnetic simulation and device validation for RF applications. Digital flow, here it's a little bit abstract view, I'm not a digital designer, but all what is available here is well established for digital design, like complete open line, open road flow where designers they can start with high level abstract definition, RTL, behavioral, VHDL, Veriloc, go through this path, generate layout and submit new chip for production. Later on I have a couple of digital examples which are already manufactured, tested and we have working silicon. Again, all is uploaded so you want to capture slight but everything is online. This is commonly used flow, open line or open road, but in France we have alternative tool chain choreolies maintaining in France at Sorbonne University, Lipstick's department. Unfortunately they were late to provide a slight covering day tools and tool chain they developed for digital design. Again, references are in the slides so you can upload or check the slides and I guess it could be difficult to read from the back. This is what we want to have also for IHP process, open PDK, complete analog IC design flow. This is example from Stanford University where Professor Rihanna she set up all the tool chain for analog design and targeting sky water or global foundry PDK. So they have schematic entry, circuit level simulator, layout tool and all the post processing verification, validation tools, DRC, LVS and again layout editor for final check before submitting tool chip for manufacturing. Is silicon? They were supporting this and they provide standard pad frames where you can put your analog digital design. Caravan in fact is the pad frame for digital circuits. This is caravan for digital design. They have also caravan pad frame for analog design where this part is set of instrumentation like signal generator, scope, to measure your circuit without needs for external hardware to make final test after fabrication. In Europe there is a group at University of Linz, Professor Prettell and his team. They are providing seminal installer mainly as dockers where you can access these tools and set up your analog design flow. Again, Linz at the very bottom. This is a vision of IHP where they want to have complete analog RF open source design flow. This is what Sergei prepared. We are targeting Quax, it's open source simulator where it has reasonable and well established schematic entry editor. This is a new Quax S and S stands for spies. Quax can drive spies compatible tools, NG spies or ZYS. Then, yes, so Quax is the front end, schematic entry in the background. We have NG spies. Then for layout, we are working directly, IHP works directly with K layout developers to make sure that all verification tools and options are available in K layout. Including DRC, LBS, other verification tools. I've mentioned about Open EMS, Electromagnetic Simulator. This is a quite important extension to K layout where if you are RF designer, probably you have some RF components, transmission lines, interconnect, spiral inductors, embedded antenna, and not all models are available in PDK. So EMS helps to simulate these models. As a result, you are getting set of S parameters which you can plug into your simulator and continue RF high-frequency design. Parasitic extraction, it's critical. You have to really understand how your design will operate at not only in gender speaking, how analog design will operate when it's prepared for manufacturing, including all parasitic effects. Of course, it's critical for RF too, but this is still on the discussion. And then you do post-layout simulation, prepare the flies, and submit for tape out using IHP multi-project wafers. There is a lot of discussion if this flow could be part of Euro practice. Euro practice is the shell organization which provides tools and access to different technology, so that the alternative path would be complete open source design tools till GDS tape out, and they would help assembling multi-project wafers submitting to funders. Just to introduce even teenagers or students at bachelor level to IC design layouts, there is a couple of really nice tools. This is Maud Van Silvitz where you can place their predefined layout of simple components starting with resistor or transistor. There is CMOS inverter. You are looking at layout not only to the representation, but this black line guide you for the cross section. So you can see the topology of your circuits. And from the layout, they can also generate very basic SPICE input file. And in this case, we have also the SPICE input for inverter. There is a bug. If you spot the bug, let me know. So this small challenge maybe for students. Okay, we go further. These two tools are part of design flows which are well established in the U.S. I already gave you an example of Stanford using X-HEM for schematic entry, and X-HEM was already set up for SkyWater Global Foundry, and they did some work also for IHP by CMOS PDK as well. For layout and all the tools, verification tools, layout versus schematic, DRC post-layout extraction, magic, it's still a core tool. It's legacy tool, but it's maintained and well accepted by open source design community. In IHP, they decide to go for K-layout as main layout tools. And I mentioned about work in progress, expanding and enhancing this tool to make this fully compatible with the design flow. For schematic entry, we are still working with Quax Team, which has a really nice schematic editor and well established interface for spies or ng-spies, or generic speaking, spies-free simulation level, and this library available, spies library available in IHP PDK. There are some other teams in open source domain like this revolution EDA, and they are also working on schematic editors. They want to also add layout editor and set of the verification validation tools. And again, if you want to learn more about other tools, there are even videos introducing the tools which are available for design. Yes, maybe a few words about EMS. It's 3D electromagnetic simulator, and they start with targeting mainly high-frequency antenna design, and this tool can be also used for modeling simulations of IC components, mainly passive transmission lines of spiral inductors. And this tool, the IHP team who works with K-Lout and Open EMS to integrate and make this smooth flow from the layout to the generation of 3D models for numerical simulation to support RF device modeling. As I mentioned, the eventual output of the models are as parameters, which you can simply fetch and add to your transistor level simulation to validate your analog RF circuits. Yes, this is a snapshot of all different antennas which EMS can model and simulate and generate a resulting as parameters. Unfortunately, I was not able to bring a case of integrated elements from ICs, transmission lines of spiral inductors, integrated antenna, but this work is in progress, and again, the main Open EMS website is referenced below. The talk before was also discussing all important enhancement to standard SPICE-free simulator. GNU-CAP is an interesting alternative, and they are working on their model compiler. As we are targeting NG SPICE as main transistor level simulation tool, we have OpenVaf, which is a new, true, very low-A compiler, not as it was before, where we had other tools which were generating model code, C++ code, which had to be compiled and linked with the model. Very low-A generator, the dynamic library, there is an extension to NG SPICE, which accepts new models as dynamic libraries and allows you to simulate with non-standard models which are not available in standard SPICE-free or NG SPICE. This compiler takes care about all important elements of the model, so it includes the currents, charges, the capacitances, and newest extension, and enhancement is also supporting noise analysis for semiconductor device models. This is fully integrated with the complete PDK flow, which we have... We have references, and there is a pointer to today's presentation by Arlen, Felix, and this eventually could merge with even better solution to bring new SPICE models to GNU-CAP or NG SPICE in the future. This is the case where we are illustrating implementation of new model into Quax. The left-hand side is a snapshot of Quax schematic entry, where you define your small sub-circuit. It's a dummy example, just a single transistor. Above you see all the tiny fields where you reference to PDK libraries, because the model for transistor is not standard. You have to also give a pointer to dynamic library, which refers to MOSFET model. In this case, it's PSP model, advanced MOSFET model for transistor level simulation. With one click, you are getting results. Single transistor, simple output characteristic of MOSFET device, which is defined in the open PDK in the SPICE library of IHP Open PDK. There is a lot of resources, main pointer for Quax as is here at the bottom of the page, as references for all slides I'm presenting. Tomorrow, there is quite interesting dev room. They will discuss KiCut PCB design tools, and part of the KiCut is also nice schematic entry. Behind schematic, there is fully integrated NG SPICE to support circuit verification before completing PCB design. You can find pointer information about KiCut, DeFrom, and Holger presentation, who will talk about NG SPICE and complete integration in the KiCut design flow. Almost immediately after opening IHP PDK, it's drawn a lot of attention, and this is something I would not expect, but Professor Mourmann, he is now at the University of Hawaii, and here at close look at IHP PDK and prepare a series of classes to teach his student, giving a real example. Again, he set up in this case, it's one of the classes, simple case of modeling or simulating MOSFET device, with simple schematic example, and all the settings of control cards to run SPICE simulation. And this is a part of the teaching classes. So this is also what we would like to see, the open PDKs as the teaching education platform. Not only somewhere in the middle of nowhere in Hawaii, but in particular here in Europe. Now, a couple of examples. So all what we are presenting, it's last half a year, but it immediately draws a lot of attention with some pre-announcements. And here it's one of the real silicon which was submitted for tape out. Team at ETH in Zurich, Technical University in Zurich, they design a digital chip. It's the RISC 64-bit RISC processor implementation. And to complete design, they were using OpenRoute. The very first implementation was already presented at Free Silicon Conference last summer. If you would have a chance, I would strongly recommend to also join and eventually contribute to upcoming Free Silicon Conference, which will be organized later this year at Sorbonne University in Paris. We have Thomas, he's very active at Free Silicon. He can give you updates about organization status of the conference. So this, I'm not the digital design, but you can learn a little bit more about status of this tape out on the website. And the silicon should be coming very, very soon. So you can capture and deliver a similar presentation in China. All of Rome are with mobile phones making screenshots. Okay, this is internal IHP design, which was also completed using Open Source PDK and the tools which we are trying to integrate. And it's not only the layout or tape out. There is a small photo of real Sekit, which is working, was measured and qualified. Okay, the links, you probably have to click to get all updates on this project. This is not directly implementation, Sekit implementation of design using IHP. Professor Prettl and his team at the University of Linz in Austria, they make a couple of digital designs. And they start working all these exercises with sky water. But now they are in progress of like transferring the design and preparing new tape out using IHP by CMOS process. And they should be ready soon with a tape out. And all these exercises help us to improve the flow. In particle analog part, we still need a lot of integration between different tools in this flow. There is other paper by Professor Prettl and his team. So it's open access paper. So you can learn more about all the digital flow they are using to create or prepare the chip for tape out. So again, X-hem for schematic, magic for layout. There was some presentation today, earlier today about Yossi system. All what we can have in open line, open route going down to all the files you need to complete layout and prepare tape out. So there is a lot of documentation in a certain sense guiding you for the process, design process and helping you also setting up the tools on your side. So now we are coming to the point that we have really a lot of available. Again, FOSSI, ECME initiative, it's not open PDK, but they created platform to teach and to educate teenagers, beginners, students and maybe young injures to expose them to integrated circuit design. Maybe on PCB level, but this is important point. Other groups and organization like IEEE in Particle Solid State Circuit Society, they financially support small group students enabling access to IEEE. Please, US organization enabling access initially to sky water and global fund that it PDKs. And they are running hackathons and competitions for I think for second class tape out. I was surprised that most of the design were coming from Pakistan. So this draws attention truly internationally. Of course, it's a little bit easier in US because they have huge sponsors as Google. If FABLIS, they help them to manage designs creating all this reference, pad frames or analog design, pad frames or analog and digital circuits, which then manufacture always sky water or global fund. There is initial work done also in Japan. They call this minimal FABL and they are opening quite legacy process, but they are recognizing the critical importance of this again for teaching education. And Europe, we are always behind and we have only one R&D wafer FAB, which opened the process as open PDK for circuit design. Risk five, this group is huge. I think this could be a reference for our activities. They did excellent job bringing CPU to public to open source and there are plenty of tools supporting this digital design. In case of our IHP open PDK, the target is analog RF, but of course we can also have digital chip and there were a couple of examples already. So what we want to have cooperation, I mean this is the key point. We want to help others to access PDK and removing this kind of legal barrier because it's open PDK. You do not need to go through legal procedure, sign NDAs or have quite restrictive in many cases ED licenses. We need more contribution to showcase the advantages of open PDK. Within the last half a year there is a huge response and we have a couple of final complete integration, but it's coming from academia. I hope this is our plan. We want to show that this initiative has a huge commercial impact, mainly targeting smaller team, spin-off, start-up or SMEs. All above, open PDK, NDAs, low level of free EDA tools can help them complete design and bring a piece of silicon to the market. I think I touched the wrong connector. So this is the list of challenges. Yes? The recording is stopped. Ah, I'm run. Okay, okay. So let, I let you read and we create the base. It's initial step and there are a lot of tasks to complete to make this smooth flow from schematic entry, layouts, verification tools up to tape out and final product. We probably should take this offline and continue discussion. If there is any input critical, constructive, I would be more than happy to collect all the information. Closing. Yes, without IHP, all what I'm presenting would be impossible. Seger represent IHP. If you have any technical question, I would recommend to talk to Sergey. This was also recognized in Germany. So there is financial support because the bottom line is money. Software developer designers, they need a couple of euros to continue their activities and we appreciate all the financial support at the moment from the German. They have this research my electronics in Germany and federal means of higher education and research. They put a lot of money. Now it's question will we manage to bring this to a European level and thinking about something European level research project to continue this. Activities and financial support all in open source domain to develop design. Of course, in certain sense, manufacturing. So with this, I will close my presentation. There are a couple of other events where we will be discussing presenting this. You can read and again. We have Thomas. He's coordinating free Silicon Conference. You can talk to Thomas. This will be a central event where we'll discuss open PDK, complete design flow, analog, RF digital, accessing open PDK. Open by IHP. Hope that other European funders will join this and in shot event we will have broader selection. Okay, by this, I will close. I guess I over it. If you have any question, I'm staying. There is Thomas and Sergey. We are ready to open your answer. Any question?
Using the ECP5 for Libre-SOC prototyping
I can try this with high school, because of why? Not to anything? Are you press a five? Second screen? No, it doesn't work on my laptop. I don't know why. This is not on the screen, you say? You are looping it? Yes, so it's better when I use your laptop for presenting the slides. And then switch when I'm at the end. Because I want to start now. So in the last few years I have been working on the Libre Sock FPGA prototype. Using the orange crap and what I did is porting the existing LS2 to the orange crap and began investigating why DRAM doesn't work. Both the ECP5 and the IS-40 FPGAs have Libre toolshelves and I have various FPGA ports including the orange crap which I use for Libre Sock prototyping. And there is also a micro-word which already supports the Libre Sock. So at the end of my presentations I might be able to do a live demo. And like DRAM that uses the original mig but we have switched to N-mig. So the next generation of the Milky Misc generator in that one doesn't fit into LS2. So I was unable to rebuild the original from the micro-word and decided that I would continue working on N-mig based DRAM. And I found some very good bugs here. ECP5 is big enough for prototyping the Libre Sock core and when I started I was already able to run cold boot but there was no DRAM so I began modifying N-mig boards and the clerser pins needed to connect. And the reason why I am doing this is I wanted to design a GPU that is even ready for VR. That is my motivation I have with LS2 and its current GPU needs non-free firmware and on the long term I wanted to avoid this. And I also mentioned that the i40 FPGA is used by Valve and Bitgrace for example in the Valve Index. Now the question is why do we use N-mig in DRAM? What we are using with DRAM itself is an N-mig port of Lambda Sock and N-mig is already used by Libre Sock project and we don't want to have multiple toolkits so everything gets ported to N-mig and we took parts of micro-word and ported them to N-mig and the same goes for most of the other things that we want to port them to N-mig. And the old mig which had some design weaknesses we want to avoid and we also want to avoid LightX which provides a huge collection of libraries and even software. And in N-mig we don't have all those features at the moment. N-mig is much more powerful than Varylock and VHDL. Actually it is a Varylock generator and it is much easier to use for anyone who knows Python so you don't have to learn another VHDL if you want to contribute to Libre Sock. And N-mig of course it works nice with Yosos, Next, PNR and GCC and all those things. And it is also used by DRAM. DRAM is a simplified RAM controller. It currently only supports ECP-5 but in the future we might want to support I-40 and those heavier Xilinx models which provide even more cells so you can design more complex designs. And I wanted to learn how to use a DRAM file that comes with the ECP-5. I found that there was already some software hardware designs, MicroVot that support booting Linux using the MicroVot and I wanted to do the same with the Libre Sock Core but I didn't get there because the DRAM isn't currently working. There are multiple generations of DRAM, for example DDR. As DRAM interfaces DDR4 for the Power 9, DDR5 for the Power 10 and on the OrangeCrab we only have DDR3 and it is small but large enough to boot Linux. And for the OrangeCrab pins we have Migen boards. I made the changes myself. And the controllers that we can use are found in both DRAM and in LiteDRAM so I have compared those controllers. They are very similar but there are changes that might be important. And I also compared software that comes with both. And if it does not work debugging is hard so I am going to look for ways to connect this to a host computer and for example run software from an external emulated memory. And here we have the ECP-5 DRAM controller. It has high speed I.O. interfaces with many built-in blocks and one of them is the data Q-throop. Buffer manager is a very complex module and it handles all the timings that you need for DRAM and those modules are FPGA specific. And we have Python implementations of both files so for DRAM and for DDR and those modules are very similar. Here is a typical module so you have to connect data lines, address lines and few address lines and clocks. And then we have a burst-dead signal that is used for read leveling so our software checks whether this burst-dead signal is asserted. And the data valid signal controls the DFI phases. And on the input size we have a pause signal which we have to assert if we change the delay. And then we use the read clock cell to set up the delay. So only with a certain kind of delay we can get everything working properly and we have to brute force which delay is the best one. Then we have a library called libgiram which we use to initialize this DRAM. And here we have to provide a context and we give it some base addresses and the profile for example includes the delays. And first we set the FITO software control then perform some init sequence. I think that it hasn't changed. Then we load the calibration and then we turn it back again and ideally it should work but I found out that it doesn't work that way. The MAM test doesn't pass or it sometimes only passes for some addresses and for some addresses it doesn't pass. So it seems that the data that I read back is corrupted. And that is a problem related to read leveling only. Read leveling on the ECP5 we only use read leveling and that has to be done for each file module. And there is an inner loop for the bitflip so we do tests for different combinations. We do a test for each read window and then try and use that returns a score and then we try to find the minimum and maximum delays. And then we take the bitflip with the best score and in this example only one of the bitflip values works. And the whole bitflip thing that isn't currently implemented in DRAM only in lite DRAM. And here we can see the read leveling for both files and we see that only the first bitflip works that we have the same settings for each file and that there are three working delays. And then we use the delay that is in the middle and once the delay is set we do a speed test that is currently not implemented but I'm going to implement that zone and then it copies the Linux kernel to the flash and then it puts into Linux. That takes longer time because it has to be decompressed. And then everything Linux kernel in a DRAM FS comes from the flash module that also holds the bit string for the FPGAs. First of all the FPGAs configured. Then I log in as root of course we don't have network that's one of the things I plan to share the BeagleBone Blacks network and then theoretically being able to SCP files to the orange grab. But I think that will be much more work but I have a Beagle wire, a small FPGAs so one of them will be used as a debugging aid and the other one will run Leprechaun. And I think it might be a good idea combining those because I have both boards and if I can use both for my work that seems to be a better option. And of course if I enter arch it shows that it is a power PC architecture and so now.
Welcome to the LLVM dev room
Welcome everyone to the LLVM Dev Room. I hope the microphone is working. This year we have three organizers. We'd just like to very briefly introduce ourselves. My name is Christoph Bales. My name is Peter Smith. And my name is Marius Brila. We thought we'd use the first five minutes to give a little bit of general information. It's an anniversary this year. This is the 10th LLVM Dev Room. The first one got started in 2014. Every year we were here, except for 2021, we couldn't find volunteers to organize. And there's been quite a few different people who helped with the organization over the years. I've put in a few names on the slides. Not going to call them out. And I'm pretty sure I probably forgot someone. My apologies. This year is the first time. There's also a GC Dev Room. And I'm very happy that we're running it back to back. So I'm hoping that enables some cross-pollination of IDs across the two communities. So that is very nice to see. Maybe a few words on if you're interested in participating in the LLVM project that you're not entirely sure exactly where to start or if you're a newcomer. I've put a few links here on the slides. I'll very briefly go over them. Most of the communication in the LLVM project happens on this course, which is a forum or discord. If you want to have the links, go to the FOSDEM schedule page. You can download the slides there and just click on the links there. The LLVM project has office hours and online sync-ups. Office hours, it's where an individual expert on something in the LLVM makes themselves available on a regular schedule. You can dial in and any question goes as long as it's on topic. You can just follow the link as I think about a dozen different experts who volunteer to do that. If you're an expert yourself and you think this is a good idea, please consider volunteering some of your time, too. Online sync-ups are regular sync-ups, simply on a very specific topic. They're also all documented on the website. We have a community calendar. I have a screenshot on the left there. You can't read what's in there, but it just gives an indication of pretty much any day of the week. There's at least something going on where people, sometimes on a specific topic, can come together to have an interactive discussion. All the ways to get started is have a look at getting started issues in the issue tracker. This morning, I were 148 open. We're now three hours later, so I'm not sure if that count is exactly correct still. There's a link getting involved with HTML, which gives you lots of starters on the technical details. LLVM does take part in Google Summer of Code, also in Outreachy. If you would like to work on LLVM and get paid for it, there's always quite a few different companies looking, having job openings to work on LLVM. That's all.
Linker Scripts in LLD and how they compare with GNU ld
This is this talk is about linker scripts and some of the some of the ways they differ between GNU LD, LLD. There are some bits where I've kind of bent that definition a bit and gone through the sort of the differences in the internals, sorry not the internals, as in the internal linker script because at least with some linkers when you when you say you're not using a linker script you are it's just a linker's provided it for you in the background. So first slide is basically just some basics so that you can understand what I'm going to be talking about for the rest of the talk. Apologies if you're already familiar with ELF and linker scripts this will be a bit boring but just very very quickly linker's job is to take input sections that you would have in your sort of ELF file normally your dot text your dot data dot bss which is sort of zero initialized stuff and it will combine them together into one bigger big blob and those are sort of then called output sections so I will use the term input sections for stuff coming from your object file, output sections what linkers combined together and then these will end up in program segments in your ELF file and then basically your operating system or kind of copy will operate on a program segment. Right so linker scripts sort of I guess more formally called linker control scripts it's kind of like a domain specific language that the linker uses I guess most the majority of the commands are to do with image layout you know where you map input sections to output sections there are a few additional commands as well like for example some of the commands are load more files and you might actually be surprised to know at least some systems your libc is actually a linker script but it's a linker script that loads the actual files behind the scenes to make sure you get them in the right order. Yes some details on the command line. GnuLD has a built-in linker script so even though and you can actually dump this with minus minus verbose if you're actually interested in the horror of what the new internal linker script is but LLD and gold and assuming mold as well basically have an internal they don't use an internal textual DSL scripts they kind of mimic it using command line language that type of thing or just basically hard code things. So one interesting thing this is just I guess not related to LLD or GnuLD but if you use dash t or which is the short form for dash dash script the script that you provide will replace the internal linker script but you can actually just put the script on the command line as if it were an object file and that won't replace the linker script it will add to it so you could basically add various different fragments that type of thing. Anyway here's an example of a linker script so you can see what sort of things there this is one from a very very stripped down one from an embedded system. I say I've used embedded systems kind of for the linker scripts because generally if you're say linking on a in-user space and Linux or whatever you really don't need to use a linker script for most of the time and the general advice is if you don't need to touch linker scripts don't touch them. So memory command at the top that's basically laying out where your various memories are on the embedded system you might have different properties like one might be flash one might be RAM that type of thing. You have these things called input sections descriptions which are that star dot text star that's the sort of things that linker's going to filter against so when your input dot section it will match against that dot text there. You have symbol definitions that you can put down that dot in the next of the underscore in the EXE IDX start is called the location counter and that's basically what will mean the linker will fill in with the address that was there at the time so basically at the end of dot text there will be certain amount of addresses space being used so at the end of that output section that value will get put into the symbol there so that your program can basically introspect itself by using these symbols. Have built-in functions for example align and these sort of arrow flash and at flash those are sort of ways of assigning things to memory regions and that can become important for other things we'll do later on. Anyway, GnuLD and LLD linker script handling so yeah as it's been mentioned in the GnuTalk this morning there's no specification for linker scripts the closest we have is the linker script manual in the Gnu documentation some parts are under specified some parts are implementation defined GnuLD and LLD are also moving targets so even if you did decide to basically reverse engineer the source code there would be no guarantee that by the next release it would be the same thing. So yeah so generally LLD will try and keep as close to the specification as possible it has made a design decision to differ in a few cases where there's been some I guess odd behavior accumulated over time I'd say these are not well specified languages that have been sort of gone through a programming committee they are accumulations of you know I wouldn't necessarily say hacks over time but it has been developed over the course of 30 years and it's accumulated a lot of rubbish. Okay so often placement so this is one of the areas where GnuLD and LLD differ slightly but they give you roughly the same results so as we sort of went back to that previous linker script and there were only fragments and it wasn't a complete specification of where all of the sections go so linker scripts do not have to be complete and you could only actually need to give a partial description and the linker basically if any section doesn't match any of those input section descriptions it's called an orphan and then it's basically the link the manual says it is up to the linker to place the orphans so basically it can place them where it tries to do something that's relatively sensible so you can if you're concerned about that I want to know what the link has done there is this this thing called orphan placement and it can tell you where things are that type of thing and then there's also an option called dash dash unique where if you want to get if you don't want the linker to try and mess about with combining your orphans together it will just put them all in their own individual sections that type of thing. Okay so here's an example of how a linker might place orphans what it tends to do is it tries to match the properties of the section so like for example you've got an executable section on the assembly code there you have the AX but for that means SHF ALOC executable A that would be read only AW right that sort of thing prog bits there's something in the file no bits that's runtime initialized zero to initialized data and the linker basically says okay what have I already got in my linker script well I've already got a dot tech section that's good executable so I'll place the one the orphan with a similar name afterwards that type of thing so that's the sort of thing that would do one of the interesting cases that we'll get to is where does it place when there's already symbol assignments so linker's got to be very careful to try not to break someone so carefully place symbol assignment so here's just a very very quick textual detail for some of the things that I've said there and in particular the example I've got at the bottom here you've got this last one dot foo there's a section called bar and then someone's advanced the location counter on a thousand so if the linker insert it says oh dot foo I can place that in the section dot foo but where does it place it does it place it after bar does it place it after the dot and the rule that the linkers take is it always puts it after any of the various expressions that it's done there because in general this is this is where programmers might say I want section start section end and if you insert something in the middle of that then you might have broken someone's program who's been to say try to make their own table of pointers they're iterating through that type of thing okay so here's an example of where GNU LD and LLD differ and this is actually fairly simple one it's actually quite hard to get them to differ in most cases but LLD in its default linker script prefers to place read only sections before executable sections GNU LD has the opposite and will place in read only after executable so if there's no read only in the linker script the link has got no information of which to say ah here's my anchor to place it afterwards so they will make different choices and there was a bug reporter about about this saying of the linker did something different but yeah it's one of those unknown difference yeah another thing that this is more of a curiosity with LLD and it's just something that I see when people port programs from another operating system is quite often someone will forget the A and if you forget the A that's essentially telling the linker that this section is not part of the program it's like a debug section that's metadata so now it does turn out that GNU LD and LD will place the orphan at the same place but LLD unfortunately uses that as an anchor point for all of the other sections so if you put in that particular case bar will get inserted after in bar there but then all of the debug sections will get put after it because that's suddenly the anchor point for all the no-alarm sections which is a bit of a curiosity I think at that point so yeah main thing if you ever pointing a program from GNU and something weird goes on check your assembler and the chances are you forgot to put an A flag on one of your sections okay okay so program header generations this is somewhere where I'm going away from linker script and veering much more back to the user space area so this is basically trying to explain some of the differences between separate code and no separate code behavior on LLD and GNU LD so this is a an elf program header so this is what describes a segment and the most important ones that you need to look at here are p offset which is the offset in the file p viadra which is the basically the virtual address that the thing will be loaded at and the p align which is a very very very strange very strange thing congruent to p offset modulo p a lot a line and this is a I guess you could almost call it it almost seems like a trick and to allow basically the same page in the file to get mapped in two different places in virtual memory and that can save some physical memory yeah okay so in what I call that system five by system five think of that as something like Linux or or BSD that type of thing now this elf file is actually memory mapped using various M map calls into memory this is actually quite different from an embedded system because an embedded system you probably wouldn't even load the elf file anyway you would basically obj copy the load bits out and then you would have some you know bit of initialization code copy from various bits to where they needed to be so they're actually in some ways even though I guess the linker scripts were designed before elf elf is not really well designed for embedded systems because you kind of almost misusing elf to make it work for embedded systems in a lot of cases but anyway I'll go from here okay so the reason I'm mentioning program headers here is that you can be very explicit in your linker script and use the p headers command but most of the time you actually want the linker to generate these things for you because if you get it wrong then the program just won't work so for a typical sort of link the the linker's gonna look for this thing called the VMA to LMA offset which is basically LMA being the load address and this is really only important in embedded systems where you for example want your load address to be in flash but your execution address to be in memory that type of thing so if that offset changes the link will change the program header you typically want all of your non zero initialized thing before the zero initialized thing because that's the only way that elf and elf program header can describe it and of course if you're changing properties from like RO to RW that's that whilst you couldn't in theory merge them you generally don't want executable RW in most systems okay so here's just a graphical example of some of the things that I'm making it's quite complicated diagram but this is where you get this alignment coming in in that if you think of your text segment there so I've deliberately done done that so it is just a bit smaller than the memory page and I'm using a 64k page here so you've got the data segment that is not aligned to a page boundary in the file if it was there'd be a big gap in it filled with zeros so what the what the operating system actually does is it double maps that particular bits you end up with the text segment mapped basically and part of the data segment mapped to the first page read only and then you have the second bit mapped to read write into two separate pages there so we've actually wasted one page of virtual memory but we've saved one physical memory page now the interesting bit for this is that the mapping for the read write is copy on write so you can't write to four one thousand and write into the read only execute bit but what it does permit you to do is to basically read towards the end of the execute thing and you'll actually be reading into the data segment read only now in theory if you've not protected your program and this probably doesn't matter that much but if you have hardened your program against rock and chop attacks there could be potentially gadgets in the read only data that if someone manages to redirect control they can find more gadgets in that same so this is section called Z separate code where it will sort of basically make sure that the read only executable is separated by page by pages so you don't ever get this double mapping from there and as you can see for the GNU LD layout it's got some execute then more read only and actually that can waste you quite a lot of pages in a small system particularly in a 64 or something like that where you've got a 64k basic page so there are control things for that so if you use no Z no separate code then you end up with them tightly packed like I had before so quite often various district just rows will choose different values of Z no separate code but if you do find hey all my binary sizes have suddenly got bigger it might be because of this Z separate code now GNU LD does something slightly different to that in that because it normally preferred read only before read only sorry read only executable before sorry read only no executable before read only executable it didn't quite have that triple or a sandwich of read only executable between the read onlys so by default LLD put would give you three program layout okay I need to speed up a little bit here but anyway that's just one example of differences in memory layout even without link script case okay so program segments and embedded systems so mentioned before you're on you had this arrow to flash and at so this is how you would basically arrange it so that your execution address for your data is actually in RAM but your load address your LMA is in in flash and that and then some program will actually go and copy the stuff from the flash contents into the RAM contents so reason I'm mentioning this is that there are some slight differences between GNU LD and whatever so and there are certainly some problems with LLD that we know at the moment so LLD at the moment will assume that your output section and is virtual memory address is monotonically ascending so you can break this with a linker script like this where you kind of because it's working top down it will just try and assign these sections into the memory region top down and unfortunately that plus 64 that second section really should be after the after the section in that file GNU LD is clever about this and it will actually sort the things to make sure that they are in ascending order but LLD won't and you'll end up with a bit of code that tries to work out what the virtual the load address is from its virtual memory address and it basically wraps around and goes negative so that's one thing but known bug in LLD at the moment that we'll need to fix but the other thing is say if you are writing and a linker script for embedded systems please try not to make LLD difficult for your linker and put things in ascending order is what I would say on there okay so just because I've only got probably one minute left how some other things or gotchas that you probably need to look out for so dot assignment within an output section see that dot equals four there and now you might think that means assign the location counter to four but no it doesn't there's a special case that if you do that within an output section it's supposed to be relative to dot section so that's actually saying dot equals dot section plus four so yes so if you want to do that please well LLD has decided this is silly we're not doing this and it just means dot equals four but it does mean if you have got an old linker script then then you can end up getting caught like this and you have there is a way of doing things in that's compatible with both which is used dot plus equals and it also that looks much nicer and it actually lines up to what you probably intended to do anyway so I think with that I just want to quickly mention some references and then I'll stop and so yeah so if you do want to know about what LLD does and mass-grade the current LLD maintainer has gone through and basically often when he wants to implement something he will put a blog post up up and write lots of interesting things about what he's found out about various things so it's not documentation it's definitely blog post type material sort of what's there at the time but it is quite useful to get into what the internals of these sort of things done and then there's some various sort of bug reports links and things there but I better stop there because I'm probably out of time now okay so we've got two minutes so I might be able to get one maybe two questions or something like that so the question is there an effort to standardize linker scripts I think it's very much down to the community so I think what we've said in LLD is that if anyone wants to add to the wants to change the linker script format then go to binutils mailing list and make sure you get it agreed with the canoe and don't just input we definitely don't want to just pile extensions into LLD but really it's just communication across the yet of the basic the problem with standardizing it is that that means that they can't change it so I think there's probably one they want to keep some of that yes so I know that if you use funky linker script stuff yeah with LDO then LDO like you can do things like we talked about like instructor lists or LDO doesn't seem to be able to take advantage of the link of the layout caused by linker script in order to do things like preload data that it knows should be in certain locations yeah is that some sort of fundamental limitation with the architecture or do you think that could be that so yeah there is the question was about basically LTO and linker scripts probably not interacting very well so yeah there there are some efforts that are going on with it within the LLVM embedded community for this to try and basically say how can we make sure that you can place well basically I'll probably take too long to answer this but things like interprocedural optimization can break certain linker scripts sometimes you want to say here's this region of memory it's and this region of memory and I do not want you to share things between these two bits of memory but LTO basically just assumes it can do all of it so there is some effort ongoing but we need to work out what the actual rules are here but we but yeah there is some effort in the community to go on to try and fix that okay I probably best stop there for
Patch based test coverage for quick test feedback
All right, next up, Shivam Gupta. Okay. Good afternoon, everyone. Today I will be talking about my GSOC project. It was mentored by Henrik Olson, and this GSOC project is about patch-based code coverage testing for LLVM patches. And so in this talk, the agenda is, first we will introduce what is the project is about, and then what is the terminology we use, like how LLVM test cases are written, and then we will see what is LLVM source-based code coverage that is used to get the code coverage of a patch. And then we see how it is implemented. It is basically a Python script, so we will see what are the functions used to implement this tool. And then we will see a demo, like it's already a patch is there in LLVM community. We will see like what is the lines are covered or what lines are not covered with this patch. So we will start by introducing introductions. So LLVM test cases are written in a lit format. Regression tests are written in lit format, and unit test cases are written in Google test format, or Google mock these formats. So the goal of this project is to help developer to create good test coverage for their patches, and it will also help the reviewers to know that the code they are submitting, it has a good test coverage or not. So this is the project, and to accomplish this project we have created a Python tool. It's around 800 lines of Python code, so it will fetch the patch as input, and then it will extract some information like what are the source lines in the patch, what are the test cases in the patch, test case lines in the patch, and then we build a LLVM project with the code coverage enabled. So it will instrument our binary. So whenever we run the test case with this binary, so it will generate a prof data file that will be further converted, further processed, and then it will show the lines which lines are covered or not covered by the source code of the patch. So LLVM test suits basically like they have two kind of test cases written for any patches. One is regression test, and second is unit test. Mainly the regression tests are written for most of the patches. So these regression tests are in .LL format or .C format for different tools. So mostly our focus is on regression tests, and some test cases are written in unit test case. So these test cases are test for libraries, like support libraries or FSEG data types. So these are checking the feature in the system, how it is well indicated in the system. So this is unit test case. Regression test is very small, but you can see at the top there is one run line which will actually run for this test case. Then there is unit test case. This is using Google Gold Test Library. So it has some micros to check. It is not important, but these two kind of test cases are in LLVM for any patches. And then we will see what is the source-based code coverage. So source-based code coverage consists of three steps. The first step is compiling the program with the coverage enabled. We want to instrument any binary. So we will use a fropile generate this flag we will use and this will generate foo binary which is instrumented. So in the next step when we run this binary, it will generate a prof data file. That prof data file contains the data for further creating coverage reports. So next is the tool is LLVM prof data. This tool is used to convert the prof format to prof data format which is further used by LLVM cov to generate or show the report of what lines are covered or not. In the next slide we have a simple test case and I have generated the report. It checks if the number is even or odd. If we pass suppose 5, it will say that the number is odd and this line, this if condition will not run. So it will show like this. This is the report of LLVM cov for any program. Next is implementation. So for implementation I have submitted two patches. For this first one is about the change in LLVM lit. This is the testing tool that is used to run the test case, regression test case in LLVM. So initially whenever we run a test case, it will generate prof data in some random name. So we have modified that and we have given a proper name for every test case. So it will generate a proper name in a specific directory. So this is categorization of prof data. Next we have the main tool that has all the functions that will pass the patch and then build the project, LLVM project and then generate data. Then process the data to show the coverage report to reviewer or a patch author. Next these are the some functions that are implemented in the tool. First two function is just a logging function. And then it is sequentially like we as a name suggest we have first we create the patch from the last commit or from the patch itself. And then we accept the source file and then we have write source file allow list that is used to reduce the coverage data. Because if we generate the coverage data for all the files of LLVM then it will be around 150 MB for each test case. So it will be difficult to process later. So we will use, we have used a flag afro file list. This flag used to generate coverage data for only the files in the patch. So next we accept the modified source line from the patch and then we build the project. We build the project with a flag LLVM build instrumentation. So this flag is passed during the CMECH invocation. So when we pass this the binary that will be old for LLVM project will have instrumentation enabled. And then we run the single test case with coverage and that is helper function next the modified lit test case or unit test case. Whichever they if it is if the patch contains a lit test case then it will run the regression. It will run the that function and if it has unit test case then this function will call and the test case will run. And next we have a process coverage data which will process the data. And next similarly we have a coverage file and it will run. Then we will have a print coverage detail that will actually be printing the coverage detail. We will also have a log file. So print coverage detail have a print a lot of details. So it will print something to log file and then we will print common uncovered line which is so in a patch there is one source file. But there are many test case. If one test case is covering the source file then it is covered. But if all the test case are not covering a source line then it means that this line is uncovered. So it will print the uncovered line this way and then there are some helper functions which is not important. This is the GitHub CI workflow that is actually is a file that is used to compile the project like on GitHub. So it is like it is holding the project and then at the end it is running a Python Git code coverage. This is the file name. So it will run here in the Python code and then it will print the coverage result. I will show this is the format. It will show the common uncovered line for the...
The LLVM Security Group: History, progress, remaining challenges.
Hi, my name is Christoph Bales. One of the things I do in LLVM is I'm a member of the LLVM Security Group. And I realize that I don't think there has been a lot of publication or sharing of exactly what the Security Group does. And so I thought, hey, let's see if there's anything interesting to say about the LLVM Security Group. So what's the purpose of the Security Group? It's there to make sure that when there are issues, security issues that are discovered, there is a way to responsibly disclose them. So that means when there is need to take action before the issue should be fully publicly known, the Security Group enables that. Its purpose is not to try and do everything related to security. So everything that can be done fully public in the open probably should be done elsewhere. A little bit of history. The idea of setting up a Security Group was done in late 2019. It took about a year and a bit. The group was up and running by 2021. And now three years later, we've been operating for, well, three years. Who can be on the group? There's a page on the LLVM Talk website, which I'm sure the URL will show up on some later slides. But it does describe that it can be individual contributors, security researchers, and representatives from vendor contacts. And here vendor means typically people who pick LLVM, build it, and release it, something along those lines. Currently, there's about 20 members, and the far majority are vendor contacts. So how to report a security issue? So far, we've been using the Chromium Bug Tracker, which seems a bit odd, but it was at the time three years ago the easiest way to get the Security Group going because that's a tracker that has features to make sure that you can report issues in confidence. We are planning to move to something different, most likely something that's integrated in GitHub, given that most of the other project infrastructure is now in GitHub. So what kind of reports have we received over the past three years in total? We've received about 47 reports, and in the next couple of slides I'll go into a bit more detail. This is Sunburst Diagram. Hopefully it will be clear what this kind of represents in the next couple of slides. So the stuff that's highlighted here, 27 issues were considered by the Security Group to not be actually security issues, or at least not issues that needed special treatment and could not be made public immediately. Just a list of, this is my personal classification. It's possible I made a mistake here, so four reports were just empty. Three issues report about Chromium because it's a Chromium Bug Tracker, that's a confused people. I don't blame them, like it's five bugs turned out to be just very regular normal bugs that probably should have been published publicly. There were 12 issues where people reported a memory vulnerability, like a buffer overflow or something like that, happening in a tool, not a library. So for example, a buffer overflow in the parser of the compiler, something like that. We explicitly describe that as we don't consider those are security issues, or at least not issues that need to be handled in a coordinated way. Two issues related to undefined behavior in source code, so someone showed a small program and said, hey, the compiler does something like clearly security issue. It is a source program written in C or in C++ and it has undefined behavior, and actually the compiler does what's allowed according to the standard. And there was one discussion on improving supply chain security, which is very useful, but there's no need to do that as a discussion in confidence. So let's jump to the issues that were considered to be, yes, probably these are, well, these are considered security issues according to the security group. So two of them were considered like we don't need coordinated actions here. It's okay to just have these as public issues. One was a particular sanitizer, not reporting an issue in a particular niche use case. Another one was a client warning not being enabled by default, not going to go into the details, but these could be discussed in public. 18 or 38% if I calculated correctly, were deemed security issues, and that required some kind of coordination. Again, very rough classification. One was about the compiler generating incorrect code. In general, not every incorrect code generation is considered a security issue, but in this case, if I remember correctly, it was something about access to the frame pointer that was not correct, and so we could see that this could be exploited. Three issues were memory vulnerabilities in Lipsy++. Seven issues roughly related to supply chain security. I put it in that category, and then seven issues related to gaps in hardening features. I say hardening features. They could also be called security mitigations. So I just show on the slide a few examples of what can be considered these hardening features like stackinaries, branch protection. The number of hardening features implemented in compilers over the past 10 years have really taken off. Let me put it that way. Going back to supply chain security. So what are the kind of issues we saw in supply chain security? One issue was about the Visual Studio Code plugin for playing D. The LLVM was potentially trusting workspaces where maybe the user should first say, yes, this is a fresh workspace. I should trust that. So a kind of supply chain security like in the tool LLVM built itself. There was one issue that said something potentially suspicious happened here. Did someone try to put the backdoor into the LLVM compiler? It turned out extremely unlikely that that actually happened. But I'm quite happy that if people have suspicions, they actually report it and it gets investigated. Two issues were related to, I think one was a person accidentally publishing their GitHub access token publicly. So everyone had access to commit to the repository. And then something about the website that could be improved. And three issues were about out of date dependencies on libraries. So in a few, not in the core of LLVM, but a few of the utilities around, they're implemented in Python. These typically say we depend on exactly that version of a Python library. And those were out of date and there were security related consequence of that. Hardening features again. Seven issues we had. Two categories. One was four issues were related to gaps in existing mitigations. So mitigation has already existed. And three issues were some new vulnerability got detected in another system or maybe in some hardware. And the request was before we make this publicly available, we would like to make sure that there is a new security mitigation feature in LLVM to help people to work around that issue. So I really raised through just showing a few stats on what are the kinds of issues we get. Can we have some takeaways from looking at those? I think yes. So first of all, achievements, success. I think through all of these issues, I think all of these, all the report issues seem to have been processed appropriately. I think that means the security group is working as expected. So that's a success. If you would like to have a closer look yourself, the security group publishes a transparency report every year at that link. One thing I do not know is maybe some people have published an issue that is covered in the public bug tracker. That really is a security issue and should have been reported to the security group first. I'm not entirely sure exactly how to get some stats on that. If you do have any thoughts on that, please do contact me later. Those were all room for a lot of good things. The security group is working as expected. But from these stats, we could also derive some ideas for what could we do? What could be done better? Where do we need some more work? I think there's at least five different categories. First of all, a little bit more clarity on the threat model and what the security issue is. The issues that from my experience at least were by far the most complex are all the ones related to issues related to mitigations in the compiler. Supply chain, there's quite a few issues came in. There's probably also something to do there. I already said we should move away from the Chromium bug tracker. Then something that isn't quite clear from just looking at the issues itself. I think we also need to improve how we communicate what the known security issues are to everyone who has an interest and who should know. I'll go into a little bit more detail there. First of all, what is security issue or a threat model? Actually, it's documented what currently the security group defines as this definition of what is security issue. It was a threat model. A few things highlighted there. A lot of young issues in a very wide variety of use cases. It's very hard for everyone having all different kinds of use cases for all of them to agree on these are security issues. We'll have to come up with a reasonable consensus. Before we can say, hey, this is really the threat model we follow. We need buy-in from most of the people who commit to LLVM because otherwise how can you make sure that patches follow threat models? There's actually a few thousand committers to LLVM every year. That may be a bit of work. Maybe the most important guideline currently is that if you're not quite sure of an issue, is it security issue or not, please err on the safe side and do report security group. You saw a slight majority of issues we received. We decided this is not security issue can be handled as a regular issue. It's better to be safe than sorry there. So by all means, please do report if you think it might be security issue. What are the security sensitive parts of LLVM? Non-currently explicitly written up. I think we need to improve this section. Which parts or which kinds of issues are not security issues? There's actually one definition here. So it's basically if you have a malicious input file to a front-end and like you make clang fall over in a segmentation fall or something like that. That's probably due to a memory vulnerability in clang, but in general we don't consider that to be a security issue. Second category of things that could be improved. Issues related to hardening features. Based on the reports that we received, I think for two or three of them, we said that no, no, the hardening feature works as expected. You as a user thought you would get more protection, but you didn't get it in a specific detailed example. I think the documentation for many of the hardening features can be made quite a bit better to be more explicit and more refined about this is exactly what you get. And are there a few things that we could do to increase the quality of implementation of these hardening features? I have a few IDs there. So one is to build a binary scanner that checks the compiler produced some binary. Does it actually everywhere apply that security hardening feature correctly? I build a prototype based on BALT, but that's a whole other presentation at some point. I already said that probably we should improve documentation. Some of the mitigations, so they're built in LLVM, they are made available for clang the frontend. But then if you go look where are they also applied to other languages and then you see some gaps. So for example, for the rest frontend, maybe some of these mitigations could also be enabled and maybe for other languages too. And maybe it could also help to help more competitive developers know about what are the special things to look out for from a security point of view. And so together with a few other people, I started an open source book to try and collect the information like what should a compiler developer know about security. And so the link to that open source book is at the bottom of the slide. I said something about supply chain, something, something we should could improve there too. There's a few categories for supply chain that we saw. So something's about just the web infrastructure for a project. Something about features and tool chains that help with developing software. Maybe something is needed there also to, to, to from the supply chain angle point of view. Secure against malicious injection of code into LLVM binaries. So yes, making sure that a malicious actor wouldn't be able to like modify the LLVM binaries or the binaries and produce itself. One thing that I think is a really good, very recent change is having some open as an F scorecard for on, on, on the project currently, which is a list of recommendations on best practices for software in general, software projects in general from a supply chain security point of view. And currently LLVM has a score of 5.2, so probably there's room for improvement there too. Moving away from the Chromium tracker, I'm not going to say that I'd spend any more time on that. Finally, some thoughts on better communicating security issues. So most of the security issues that we received, maybe people think of CVE as that's the way to communicate to the whole world. There was a security issue here. This is what you do. What it turns out, most of the issues that we receive on the tool chain side, there are quite a few security issues that actually, most people would say you shouldn't create the CVE for that. For example, gaps in mitigations like that doesn't make one specific program immediately exploitable. And so quite a few people say you don't create CVEs for that. But at the same time, these mitigations are widely used and people rely on it. And so that should be a way to communicate to users of tool chains. We found something probably you want to know and you can do your own decision. Should I update my tool chain or is it okay for me? There's a few ideas on the slides on how to do that. Maybe just have a security label in the issue tracker or something different. But I guess I'm almost running out of time, so let me move on. A little bit on, can you participate as a general LLVM developer or LLVM user? Well, please, if you do think you find a particular issue and you do think there's a particular security angle to it, please do report it appropriately to the security group rather than its public bug reports. When needed, spread the word that LLVM does have a security group and a process to report security issues. We do every month have a public online sync up. So if you would like to talk with a number of people on security group or anything relevant, please have a look. There's also a link to the calendar, what the exact schedule is. And if you're a vendor contact or a researcher or you would like to participate here, please do reach out. We do welcome new additions to the group. So in summary, the LLVM security group has been operating pretty well, I would say over the past three years. If you analyze the issues, there's a number of areas for improvements left. And again, if you do encounter an issue that you think, hey, maybe there's a security issue to try to remember to report it appropriately. And with that, I thank you for listening and hopefully a few more minutes for questions. One minute and 20 seconds. Right. I think you really understand the idea with vulnerability in the tool, not being concerned with the security group. If the tool is part of the LLVM project, why is it not? All right. So the question is, I think, this line roughly. So how come if there is an issue in a tool, how come that's not considered a security issue? I think it boils down to that. In the very first class, in your stats, you said there were two issues that you didn't need to consider. All right. Okay. Let me go back to that. So we had two issues, one in the sanitizer, one in a client warning. So the classification is mine. So I just looked at exactly what happened on those issues. On those issues, the decision the group took was it's not necessary to keep this issue private. This issue can be made fully public. It means that fixes will be implemented more quickly, which is a good thing. On the other hand, do we think anyone would be immediately the target of an attacker because we made this information public? And for these two issues, we thought the chances of that are really, really, really low. And so the trade-off was it's better to make this issue public rather than to keep it private. Hopefully that makes sense. Yeah. Well, maybe we have time for one more question if there's one. I was curious, since you said that you have some vendors in the group, do you think that they will be open to giving bounties for some of these vulnerabilities if they are significant enough? Because I mean, it's a real bug in the code gen. It could be significant, especially if mitigations are on my list. So the question is, would vendors be open for a bug bounty program? Sure. All I can say is that so far I haven't seen anyone even suggesting the ID from the vendor's side, so who knows. Bounty fixing. Yeah. All right.
Building a Linux distro with LLVM
Alright, so welcome to my presentation. I'm Daniel Kulesa. I'm the founder and primary developer of the Kim and Arlenox project, which is a Linux distribution. It's a general purpose Linux distribution with a particular focus on desktop computers as well as different similar cases, I mean client computer kind of cases like single board computers and so on. But since it's a general purpose we are also implementing server purposes and other things basically like just about any Linux distribution. Its focus is to be robust as well as to have stronger security hardening than most standard Linux distributions as well as having good default for things being deterministic so that you can actually install things using a single process and then it will come out pretty much the same every time and it will pretty much work out of the box. It's also supposed to be lightweight and transparent as well as overall pragmatic so not really focused on one particular thing like many niche projects tend to do. It uses LLVM as its system tool chain. It's currently only has LLVM. Well we do have some GCC builds which are bare metal builds for building new boot for different single board computers. That is a general purpose GCC compiler we do not have any right now. I've actually started some work to introduce GCC as an option but so far I've run into this very annoying bootstrap back on PowerPC which prevents me from fully introducing it so far. So it uses tools from 3BSD as its core user-land which are hardened with the rest of the core system and it uses muscle as its ellipse of choice. I'll explain why at a later point. For package manager it uses APK tools which is known from Alpine Linux but it uses the next generation of APK tools which is currently under development and not deployed in any other system. It uses binary packaging obviously but it also has an option to compile things from source if you like to. It has from scratch made fresh build system which creates APK repositories. It's also sort of intention to have things be bootstrapable so you can compile the system from scratch from source and bring it up to the same state as it is in already using some other Linux system as a base. It can cross compile. We do not do cross compiling for any official repositories. All the packages we actually ship, 45 architectures we ship, they are native but there's the option to cross compile in the build system sort of transparently so it does a lot of things for you so you don't have to, generally the packaging template doesn't have to do much to be able to cross compile as long as the build system for the project itself doesn't have some weird cross compiling issues. It will sort of work by default. We do support many architectures including ARCH64, little and big end in 64-bit power PC, RISC-5 also in 64-bit version only right now and obviously also X8664. It includes system-wide link time optimization for pretty much all packages I think. Right now we have about 1500 packaging templates and only about 50 or so have LTO disabled and some of these are basically false positives like the kernel has a different way to enable LTO and standard some are things like scripted things and so on. We also enable subset of UBSAN in production builds set up in trapping mode so that it does not actually include any runtime in the resulting binaries. This is mainly used to mitigate signed integer overflows which should become hard crashes in pretty much all packages unless there are some exceptions where things are currently broken in a way which cannot be immediately fixed. We at least use this to keep track of everything which has these kind of issues and fix them later. We also try to deploy CFI or control flow integrity mitigation from Clank system-wide which is a lot harder task because a huge number of software breaks with that. This is mainly C software where the issue is that many C programmers tend to like having like a function pointer which has some signature and then they take some function and cast this function pointer to the original signature and CFI does not like that. In C this is undefined behavior and people tend to think that it's actually okay because for example your custom function which you are taking a pointer to is taking pointer arguments and the original thing is taking void pointer arguments and they think it's okay because it will get implicitly casted to void but this is like it's still undefined behavior you shouldn't do that. You should declare your function with the original signature the one you want to have and then cast these things within the implementation properly and then you will have no undefined behavior but people tend to avoid this because they are lazy or whatever. In any case to get started how I got started it was in early 2021 and it was I decided to create a build system for Linux distributions called C build which was originally a re-implementation of the XBPSSRC system from Void Linux which I was using at the time. It's written from scratch and it's implemented in Python because I sort of got tired of different distro build systems being written as massive bash scripts which both results in the system being slow and also the system being difficult to debug and track issues in and also it's hard to trust these kind of systems when it comes to actually being integrated so you usually want to delegate things like signing and so on to different places because it's sort of impossible to tell which things can interact with which other things and which things run in trusted context and which don't and so on. So I sort of wanted to avoid all this kind of nonsense and create a system from scratch which does not have these kind of issues. So my initial system was Void Linux as I said at the time on PPC64 early architecture. C build has a sandbox which means the environment in which everything is built is sort of a container implemented with Linux namespaces. We try to harden this container as much as possible to restrict what build systems and so on can do but at the same time allow common software to be built reasonably. So like there's no network access in the container after some point. The root file system of the container is read only and it will only write to directories where it's supposed to write that kind of stuff. So the sandbox also means that it can be run on any distribution even if the distribution is actually completely incompatible with ours so it's isolated. So what does the bootstrap process for the system look like and what it looked like back in 2021? There's four stages basically. Stage zero is to bring up the first version of the bootstrap or the build container which runs using the tool chain from the outside system and you basically use the outside tool chain to compile all the basic things you need to assemble the minimal build environment. This is done with sort of minimal features and also not many compiler flags and so on so things like LTO are disabled at this point. We cannot make too many assumptions about what kind of compiler is used on the outside so we try not to and just assemble sort of an environment which is good enough that it can be used to build more things. For stage one we use these generate packages from stage zero built to actually assemble the container and build the same thing again but using the new packages. We repeat this twice more. For stage two we enable LTO and basically all the flags we need which will match the final environment so the result of that is basically what you want basically equivalence to the final but just for a good measure we use these good final or almost final binaries to rebuild once more and get a clean environment which can be used to build everything afterwards. So what does the environment look like? It's sort of bare minimum packages to build itself so the container is small enough so it can build itself but also you can install more dependencies into it depending on what you are building and so on. Have some build graph and build it over time and so on. So there's a Lipsy obviously and there's the compiler which is in our case Clank and the rest of the LWIM suite. There's the core user and different utilities basically that makes up basically the system because build systems need to run these and so on. And there's the package manager which is used to manipulate packages installed within the build container for different purposes. It's a small chimera Linux system just containerized. The containerization is done using the bubble wrap tool which is also used for example by Flatpak and it provides a minimal interface to the Linux namespace kernel feature which lets us make these small sandbox containers without requiring much other infrastructure. The outside host system needs to provide only Python for running C build itself, bubble wrap and a potentially static build of APK and that's basically all it needs to provide. Anything else is sort of set up by us. It runs completely unprivileged. There's nothing which needs root and you cannot run it as root. So why use LWIM? It's a more modern compiler design in GCC and it has many implications. It has state of the art sanitizers which are in a better shape than in GCC. GCC does have some of these sanitizers but it tends to be more out of date and things like CFI for example not present in GCC at all. It's also much easier to build and bootstrap with it has a relatively standard build system compared to GCC which is sort of completely custom. Half auto tools have some other cursed thing. Also cross building with LWIM is much easier because you only have one compiler and you only need the specific run times for the different C-sroutes for different arcs. It also has FIN LTO which actually enables us to deploy LTO system-wide and not fear it will actually be much of a problem. It uses far less resources this way. It's much faster, it's slower than non-LTO build obviously but the overhead is not so big that we cannot do it. It also has actually better performance these days which didn't always used to be a thing but nowadays it tends to be that client tends to perform about 5% faster on average in resulting binaries than GCC does. And often it tends to be less buggy in my experience. As for why not to use LWIM, very occasionally there's worse compatibility with things. Some things will not compile well with LWIM especially things with very cursed linker scripts and so on sometimes run into trouble with LWIM. Some of the supported architectures are kind of less maintained than in GCC which supports an impressive amount of architectures and overall there's fewer architectures supported. The LWIM suite as a whole also takes much longer to build because it's in C++ and it's sort of idiomatic C++. So on most architectures this is not an issue but for example now RISC-5 Builder which is not real hardware actually. We built in QMU user emulation because it's like seven times faster than the best next real hardware. It's still very slow and it takes like maybe 10 to 15 hours to actually build a new version of the toolchain so it's not great. And very rarely there's worse performance in like runtime performance in some things so like I think in maybe in some open MP things the state of that is still a little bit worse and so on. But it's not a big deal. Now let's take a look at the toolchain structure of a typical Linux distribution. We have CRC library which is usually these days almost always G-Lipsy or Muscle. Other Lipsies do exist but they are very rarely used. Pretty much nobody uses things like UCLipsy and so on these days. You have GNU Benutils which provides things like assembler and delinker and manipulation tools for else files and so on. Then you have GCC itself which is a CNC++ compiler, plus a compiler frontend for many different languages, plus a core runtime or well a portion of core runtime because some of the core runtimes provided by Lipsy, some is provided by GCC. You have LibGCC also and LibGCCS for the unwinder and so on. You have the C++ and the library called FlipSTDC++ and that kind of stuff. You tend to have one build of Benutils and GCC per target. The user-run API tends to look like you have the built-ins for fallbacks which are in LibGCC.a static linked plus the CRT and you have the unwinder plus dynamic built-ins in LibGCCS and the C++ and the library and so on. And with LVM it looks a little bit different. You have the compiler linker, assembler, Benutils and everything all in one, all in one suite. You just compile it all. The only separate component is the Lipsy which tends to be again G-Lipsy or Muscle. For G-Lipsy being problematic still so that's the main reason we went with Muscle. You have one compiler for all your targets and then you only compile the runtimes for the others. The API also looks a little bit different but this is not used in most distros because in most distros LVM acts as a drop-in compiler for GCC. But with native LVM style ABI you have the built-ins which provide all the LibGCC.a plus a portion of LibGCCS which is also static linked in this case and lipanwinder.so.1 only provides the unwinder ABI part of LibGCCS. So you also have the C++ library which is different but it can live in one process with LibSTD C++ because LibC++ uses anonymous namespaces in a clever way and that allows these symbols to mangle differently so it can actually live in one process. As for ABI compatibility, as I said, lipanwinder implements most of LibGCCS. So to make a makeshift LibGCCS you basically have to create a shared library of LibClank built-ins and then link it to lipanwind and this will implicitly pull in all these symbols and at least in a muscle environment which has no version symbols this will work. In a G-Lypsy environment it might not. As I said, LibC++ might be able to live in one process but this actually is only in theory because you might have to make LibSTD C++ use LibC++ ABI as a base. It will still conflict otherwise. G-Lypsy could not build with Clank until recently, now it does but it's still incompatible with native LVM style ABI because it dynamically openslipslipgccs and therefore we cannot use it. Muscle just works and always works so that's okay. There's another very neat thing in LVM as the allocator called SCUDO. It's the default allocator on Android and it's a hardened allocator but still a very high performance allocator. It has a modular design which is very different from most allocators so it makes very few assumptions about the environment you can run it in. We can mix and match these components and configure them differently. Most allocators tend to assume that you have LSTLS available at this point and they can just use third local variables. You cannot do this with Muscle in LibC.so because the dynamic linker doesn't set things up until later and the dynamic linker and the LibC are the same file so you have to be a little bit clever about it. We replaced the standard allocator with SCUDO in our system because it's much faster. For example, LTO linking of Useld takes third of the time now so that's quite a big difference. The main reason for this is because Muscle's stock allocator uses a global lock for consistency but this sort of pegs it to one thread so it's not great. We have no LSTLS so we just put a pointer directly in the P-thread structure and implement a custom thing around it and that works. The main drawback is very high virtual memory with the standard primary allocator is about 8 GB of process which is kind of insane but with the primary 32 allocator we use it's only about 120 GB which is still a lot more than most allocators but I have plans to try to tune it further. Cross compilation, well C-built can cross compile. Cross targets need cross runtimes which we do compile but it has a little bit tricky bootstrap if you need to do it without pre-existing system. The cross compiling environment needs to include compiler RT, Muscle, Lipanwine and Libc++ and its ABI library. This is all installed into one directory which is treated as the system route for the cross target and that's how you use it. It's pretty much standard. To bootstrap this kind of thing you first build compiler RT or well a small part of compiler RT the built-ins plus CRT begin and end files. You do this by telling CMake to force static lips only to get rid of compiler executable checks which will not work at this one because it doesn't have the complete toolchain available and you disable the sanitizers so you can compile them later but to this early point you only have to compile these built-ins and the CRT base. It requires lipcy headers for that still so you just give it some lipcy headers, you just take Muscle and tell it to install lipcy headers in a temporary directory and then give it to that and that works. Then once you have this you can actually build and install a lipcy itself. It needs only the above parts. Once you are done with that you can build and install lipanwines plus lipcy plus plus together which is best to do it together because you can do it together and if you do it together you actually remove yourself some trouble of having these things interact at build time so you just let it to build all three components and you're good to go. Once you still need to explicit no STD lip and CXS flags because you don't have the STD lip available at this point and the build system will otherwise assume it and break. Now once you have this you can actually compile the rest of compiler RT. This is mainly the sanitizers that's what you typically want. Once you have all this in assist route this is the full cross-run time you need and you can happily cross-compile anything for this target. As for practical experiences with LVM as I already mentioned before it makes system wide LTO actually possible. It has far lower resort usage this way compared to GCC LTO. For example at work I'll come currently on a break but I'm coming back to work soon. At work I work on webkit and when I compile webkit with GCC LTO it climbs the memory usage climbs to 80 gigabytes of RAM and it runs out of memory and crashes. So with thin LTO and clang this does not happen. The resort usage stays firmly within some 30% extra overhead compared to standard build so that's very nice. Of course there's the security hardening side which I already mentioned we deploy a subset of UBSAN and CFI is used to where we can use it. Other things we market as to do and maybe fix things later but the entire core user run is harder in this way for example. As for toolchain patching in the distribution this is mostly in line with GCC but still more than I would really like it's about 30 patches we maintain downstream. I would like to upstream some of them but I need to clean them up first. Distributions tend to be geared towards GCC style runtimes so LLVM is often reduced to being sort of a drop in compiler for GCC which is a bit of a shame. I think more people should use the native runtimes and actually test these properly on Linux not just on systems where they are native. And also the build system of LLVM can unfortunately sometimes be a big unpenetrable mass. It's partially due to CMake itself being kind of terrible but it could still be a bit better. Also major releases of LLVM can be kind of daunting to update to because they pretty much universally and always break some compilation of something. This is usually for good reasons. For instance recently LLVM actually switched some invalid function pointer cache to be errors by default as well as disable KNR style function declarations without return type and so on. This would be fine and this is actually a very good thing which should have happened 20 years ago but it didn't. Now when this happened we actually still run into tons of projects which break on this and worse it actually breaks in ways which we do not like. Particularly for example GNU Auto Tools. Lots of projects with GNU Auto Tools tend to be generated with ancient versions of Auto Tools because they ship these pre-generated config files. Once these use KNR style functions for different checks and when this happens the compilation of the const test will fail and it will get treated as not having some feature which will happen implicitly and you will lose that feature and this might sort of you know. So what we did is basically switched to always regenerating Auto Tools files on any project where we can do it. Just to never trust these pre-generated config files because it's really bad to trust them. This kind of stuff happens and as I said it's usually for a good reason and it's a really good thing that LLVM is actually pushing these things which should have been done 20 or 30 years ago but yeah it's still a little bit of a pain. On the other hand the community of LLVM has been very good and helpful in my experience and especially shout outs to Math Gray who has been writing very nice blog posts about all sorts of things and also has been extremely helpful on IRC and in different places with actually figuring out different issues of the tool chain and so on. Well in conclusion it's generally a really nice tool chain and there are some pain but in general it's nice and practical. It can build just about any Linux software which is neat especially given how many extensions and so on GCC has and it should not be reduced to just the drop in thing for GCC use on Linux it should be treated as a standalone thing more I think. Thank you for listening and if you have any questions now is the time to ask. How do you schedule package fields on your builders because if you remember a thing I would tell you it may have been client-based oil and fuel or something like that. Yeah basically what happens is that you push your thingy or change to the C-port GitHub repo and then we have a build bot which will pick up these changes and then schedule them all into the different workers which will build them and then upload them to the final repo server. So one worker is just building one package at a time? Yeah basically yeah it's good enough for the time being it's sort of it may receive a batch and the build system will sort of sort it and then do the thing it needs to do. Yeah? Do we have any idea about the size of the thing? The minimum build how big is the question and what kind of unit are you using? The size of what? Of the system? Yeah. Do you have the minimal system? A minimal container kind of system? Oh yeah sure. So he was asking about the size of the minimal system and it depends on the case really like a very minimal container kind of build is about 7 megabytes while a bootable system is maybe I don't know 50 or 60 if you really make it small but then you pull in Linux firmware and then it grows to 500. So yeah there was one more question in there. If you get significant, do you think the performance drop is enabled like a partial input? The question was if there's any significant performance drop to enable in these sanitizers? No there isn't because most of UBSAN is very cheap and incurs basically practically no runtime overhead in practice. As for CFI it depends on the specific software but in most cases also not. As for other things it really depends on the sanitizers but the stuff we need is pretty much always very lightweight. Yeah? There's a question in the matrix room so coming from somewhere online I'll just read out now. I wonder how many, which packages happened to avail if you have no network access at all in the build container after your preparation step? Well. All build apps from a known mirror and especially before starting any upstream provided scripts? Pretty much anything written in CRC++ tends to be okay and require no network access when you build it. What does require network access is pretty much anything written in Rust Go or like some JavaScript stuff or save some big software like LibreOffice density download things from Internet by default. We do have workarounds for that. For Rust we have a step which is run immediately after installing other dependencies which will pre-download and pre-vendor all these dependencies into the three and then from that point onwards which will disable network access for the rest. Anything else? Oh yeah, here. I have a question about system-wide LPO. Is that also included in providing a bitcode for 10 libraries? So the question was about bitcode for static libraries. Yes, we do ship static libraries in bitcode format. The main issue with that is they tend to be fairly big because you cannot strip the back info from them. But what we do is split static libraries into individual sub-packages so you don't install them by default unless it's needed. But you can still install these static libraries when you want them. Anything else? Yeah? Well, is this version not for a daily driver? Well, I'm running it on this laptop, for example. I'm running it on my workstation and my other workstation which are Arch64 and PP64LE systems. There's other people who run this and have the main issue. It's really like a lack of some software at this point. It's still much bigger than any niche distribution or has been at this point. There's 1500 or 1600 templates which is one template for one software. And we even include some really major big things like LibreOffice or Chromium and Firefox and basically all this kind of stuff. Okay, thank you. Thank you.
Build your first Clang compilation database plugin
Okay. Okay. I'll begin. Now that's some oxygen. Back in the room. Okay. Hi. So, compilation databases. Going to talk a bit about that. Here's the motivation in examples. So, we have some really simple project structure here, your typical project, I don't know. And you have a file that you compiled, file A dot CPP, and here are some flags that you used to compile that file. Really simple, right? Now, the question that the compilation database kind of wants to answer is, which flags would you give your static analysis tool or whatever tool you're using for file A, for example, and also a bit more difficult to answer for file B, right? You really don't know which flags you should use just by looking at that. You as a human might know, but as a two-chain difficult to answer question. And that is where compilation database really comes into play. Mostly, you notice that when something doesn't work. So, you do some static analysis. So, I tend to use CMake, although I heard that there are not many fans around. It works for me. And Ninja, all of these, they can do that. Like I think for me, you need some script. If you don't, if you're built with systems, some AK stuff, you can still use beer or beer. Beer you can also use. Maybe it's better choice with Bear. You can do some LDP load magic, I think, and then intercept the course to your compiler and then write out the compilation database for you, even if your system doesn't support it at all. And also, Clang itself can kind of emit stuff that you can put in a compilation database with the dash mj flag. But on the other hand, if you do that, then you already have the flags, right? So, if you pass that flag to Clang, then you already have the flags. I don't really see a reason why you want to do that, like only for some with script situations. Now, the problem of course is you only have the four compiled files, right? For header files, you typically don't compile them. So, you don't have entries in the compilation database for you, header files or for some other auxiliary files or files that you didn't compile yet, stuff like that. The file that you probably all have seen is the compile commands JSON file, right? That lives somewhere in your build directory. It has entries like these. So, there's the build directory is one of, so there is one file right now in this compilation database that is in the compile commands JSON. The command is there. You have which file was actually compiled here, so the file a.cpp and which is the output file. So, in this case, it's the main file. So, this compile command is just one representation of the compilation database that lives on your hard drive. In Clang or better in Clang Tooling, this is where the compilation database struct lifts or I think it's a class, but it's slightly simplified. It's actually pretty simple. So, there are some methods for getting some compilation database. Some methods for loading them. So, you give it for example some direction and say, hey, give me the compilation database that corresponds to this directory or this file. Or you say, I have this directory. Please figure out which compilation database I should use. This is what Clang also internally uses. These are more interesting for us. There are two things that can give you so-called compile commands. You can either get all of them, all of them that are contained in your compilation database, like the files that you compiled, or you can ask it to give you compile commands for files that you want a compile command for. A compile command is also pretty simple. It's basically what you just saw in the JSON file. It's a directory, the file name. It has the output file name. It has the command line as a vector of string. It parsed already the command line into arguments, which is actually pretty interesting, because this is pretty fast. I started looking into this because ComptiB, which I previously used for doing this stuff, it was really, really slow in doing exactly this step, parsing this and escaping the bash arguments into a string. There's one thing that is called heuristic. Heuristic, that sounds promising. You use some heuristic to figure out some flags that you don't really know yet. What does Clang actually do to give you compile commands? There's this JSON compilation database, which is just a specialization of the normal one that we just saw. It can load the JSON compile commands, JSON file, and then it's a compilation database. It does some steps. First, it will do stuff like this here, expanding response files with just irrelevant here. But what it will also do is it will infer the missing compile commands, and then it will also infer the target and driver mode. What it also does is it loads some plug-ins which is not showing here. So what is actually the heuristic that is used by Clang? Because you may have noticed that you can open with ClangD, use some other toolings on a header file, and will not immediately break. It will try its best to figure out the flags that you're using. What it does is it has this interpolation compilation database. What it does, it finds the closest available file in your database that is already there, and then it awards points for path similarity. So it will use the base name. So for example, in my example here, this file a.cpp and file a.h, they have the same base name, so this is already a point. Then you have also in the local structure of your project, you have some path and it will match on that, and depending on how similar the path is, it will award points. Then if you still have a ties, then it will use the prefix links as a tie break. Then in the end, it will replace the file name, and also remove the output argument to for example, work with header files. So now if you use that, this is the situation where it gets frustrating, you apply that now and you get some weird flags. Some that you didn't expect and didn't work. What happened? We have this other directory down there, still in gray. Some files list in there, they match better than our obvious file a.cpp, bad luck. For me that in other, this is often a copy of LLVM and that matches a lot. For me that happens a lot. Good. So now we know what the problem is and what the solution might be. Let's start building our own CB. But first, I will tell you why you might not be doing this. If you are happy with the default, obviously, you don't need to put any effort in, right? It works for you. Perfect. If your structure is simple of your project, don't put any work into it. Also, if you cannot build your tools, like if you cannot build your Clang Clang D whatever, and you rely on some fixed version that you get from packages or something, also cannot use that because you have to link it. Also, if you're using Clang D, if you use Clang D, then chances might be that you can get around with modifying Clang D's very good configuration file. It is pretty simple, but you can get around in most of these problems. If you can just use some script or some hack, do that. I was telling our working student that I was going to give this talk and he was like, Pascal, what? I just use bash and replace that. Why would you do that? If you can do that, fine. But why would you want to do that? The number one reason I would say is you have way more information about your code base than anyone else. There are tons of discussions online in Clang related like bug triggers and so on where people come up and say, why don't we enhance the interpolation in this and this way? Then some other person comes along and says, yeah, we cannot do that because that wouldn't work for this and this cases. So you have all of the knowledge and you don't have to be very conservative. The people who write Clang D and Clang, they have to be. Also, as I said, it's I think a very nice step into working with tooling. Also, if you're doing something really unconventional, I don't know, some live building or some obscure compiler that you use, also might be worth looking at this. Okay, so let's build our own CDB and something that is a bit more advanced but that we can still now understand, even for the purpose of this demo. That also uses part of Clang D's infrastructure and is also somewhat useful, right? So we want something. What you could do is we could just use an include graph. So we have this like here, say A includes the header file and so on. And this is some useful information. We can generate this information. We also have some additional information, for example, that file A and file B live in the same direction. We did also simple information that we can just use. Best thing now is that Clang already has information about that and tools to give you to help you. Not going to go into all of these, but there is a nice scanning tool that scans for dependencies of file. You just keep it a file and say, hey, what are its dependencies? And it will give you a list. Like this is perfect. This gives you, that works with. So this tool, dependency scan is simple. But this is the part obviously where it could get very complicated and work intensive to find a good thing that does best the candidate. Right? I did this graph thing now, but you know code better. So that's basically it. So how you can use that? In Clang D, it's pretty simple. You can just use a compilation database plug-in, which is a very thin wrapper around this what we just saw. In Clang, you have to manually overwrite it. There are two ways to do that. Either you just change the code, just replace the class name, or you can do some linking, I will show you where. For the other tools, as far as I know, you have to do the same as with Clang D and replace it manually, which is a bit unfortunate. I think one can get it to work with all of the extra tools that are not Clang D. You can also add your own plug-in system. So compilation database plug-in is actually just this, which consists of this one method here that says load from directory. And you take this and we can build our own one. So we have our custom compilation database plug-in now. We have this load from directory. We append this well-known path, easy. And then we just instantiate our class and return it, done. Very simple. And there's one extra step that might be a bit tricky. You need to link that in, and for that you have to have this static variable created. The name is not very important, but you have to do that. And it uses an LLVM registry under the hood, pretty nice. Yeah, with Clang, I'll upload this slide so you can take a look. You have to modify either of these here, which is a bit tricky. So the better way would be to use the plug-in system if you have your own tools. Just directly taken from Clang D, this works for all the tools. And if you just want to set your compilation database, you have the default ones and you can override it with your own plug-in then. Okay, so there was a lot of code. I will upload this, of course. And there's also a demo project online if you want to take a look at that. If you are a beginner and you're just taking to, wanted to look into that, because I look at different LLVM code normally and not at this part of LLVM project. I had some nice experience with, actually with LLVMs. I just asked them, hey, I want to do this kind of stuff. Is there something in Clang? And it was like, yeah, yeah, which is nice. And also, don't reinvent the wheel. If you have Clang already at your hand, there is so much stuff in there that you can just use, right? Don't have to do this. And also, maybe you just don't need this and can get away with the configuration. Okay, so there's a demo project that is online right now. Can take a look at that if you are into Nix. You can just Nix build it. It should just work out of the box. And yeah, to conclude, the compilation database consists of compile commands, simple, that help your tools and your ID. For unknown files, we have to use heuristics. And when you want to customize this behavior of the heuristics, you have to have some sort of this plug-in. The easiest integration that you can find with ClangD, because there you just need to link in your own plug-in, like I just showed you. Okay, and last but not least, there are some resources, very good discussions about why you might do this and why it's not so easy as it seems when you first start. Okay, thank you. I think we have two minutes for question, Adeni. Yeah? So how often does the default heuristic fail? So how often do you need to put it in your own one? I think it's, so the question was, how often does that fail and how often do you need your own one? So for me, it depends highly on the code base. I have my simple hobby projects where it just works 100% of the time. I'm happy. But for example, at work, I have experiences where this is maybe in 10% of the cases where it works. And I create a new file and it immediately breaks. And so something like that would help. How does this interact with C++ modules? There with C++ modules, the question was how to stay in the action with modules. It gets way more involved. So for example, the scanning tool has a different mode for modules. I haven't, to be honest, I haven't looked into that because we're not using modules yet. But yeah, there's much more work to be done, I think, yes. Yeah? Just a little curiosity, what should the answer be if for example, header files included into different files which have incompatible flags? Yeah, the question was what should be the way to do, if some, what should be the way if two header files have incompatible flags? One header file with incompatible flags. I don't know. No one, this is the core question, the core problem here. No one really knows. And maybe you have the information for that, right? You have maybe some clever way of doing that. But generally, the people of ClangD, they cannot answer that for you. We should ask LLMs to tell us for that. Yeah, maybe not ask the LLMs for that. Okay, I think we are running out of time. Okay, thank you very much. Thank you. Thank you so much.
Challenges of supporting multiple versions of LLVM in Intel Graphics Compiler
All right. Thank you. We're going to start with the next presentation. All right. Thank you for coming. Can you hear me back there? That's great. Thank you. So my name is Matawór Belicki and I'm a compiler engineer at Intel. Today, I'm going to talk about a downstream compiler project that we have. It's an Intel compiler but for GPUs. Why do I come here in the first place? So I would like to share this experience which we have of maintaining and doing a downstream compiler. I think this is the case for many LLVM-based projects that they work as a downstream compiler. Maybe they have similar issues to what we had or maybe they are thinking about doing things the same way. So they can see if our results would work for them or not. I think it also might be interesting and valuable to LLVM developers and maintainers to see how their project is used and maybe sometimes even a little bit abused. Yeah. And well, I also hope that I will gather some comments, some feedback, maybe somebody who is also doing something similar has better ideas to what we did. So without further ado, what is IGC and how it relates to LLVM? So IGC, as I started talking before, stands basically for Intel Graphics Compiler. And this is a SPMD compiler that targets GPU, where SPMD means single program multiple data. This is a kind of a paradigm that is very popular among GPU programming models. When you have a singular program that is executed on multiple instances of data at the same time, right? A little bit like Cmd but here the vectorization is completely implicit. You only program from the scalar point of view. You only see a single program but underneath it is working on a multiple amount of our own custom passes that work together. We currently are not using the LLVM Code Gen infrastructure in any capacity. What we do have is a custom pass-based emitter which performs vectorization at the same time, lowering the LLVM IR into another custom IR which is called VISA. And what you see at the bottom of the slide is basically how the target ISA look like, what we are trying to emit here. So this is basically our assembly and I think the most interesting part that you can see at glance is that it's explicitly vector instruction. So we have this impedance and difference between what comes to us in form of scalar program and what we emit in terms of a vector program. So to do what we do we are using the LLVM C++ API. So basically we use all the same functions and classes which you would normally would use to create an LLVM pass if you were writing it as a part of LLVM. We cannot really use the API here although it's stable and that would be great for us but it's really limited in terms of what it allows. So basically this API is mostly designed for front ends to create LLVM IR and then to push it through the rest of the LLVM infrastructure. But in our case we actually want to modify this LLVM infrastructure to serve our needs. So this is one thing and the other thing is our open source model here. So IGC is as I said before part of the graphics driver and for the graphics driver we do both close source and open source releases. And the same goes for IGC. So basically there are two flavors of IGC. There is a close source IGC that contains some proprietary code that is not available in the open source IGC. And there is an open source IGC which is freely available at any time. Here is a GitHub repository if you'd like to see it. So we are also part of some of the GNU or Linux distributions like Archibb onto or Debian. So this is how this works from the distribution point of view. When it comes to development we sadly develop close source first. We do have... So when we commit some changes we basically do this through the close source repository and then we have some infrastructure that takes care of replicating this for open source. So we basically have a system of annotating parts of the code which are sensitive and shouldn't reach public IED. And this script basically cuts them off and then we also have some CI infrastructure that checks in every framework. And this allows us to generally have both close and open source repositories that are pretty much in sync when it comes to latest changes. So this is the picture of how everything looks like around IGC. So why do we need to support multiple LLVM versions in the first place? So as I said before we are striving to be part of open source distributions. And this means that we have to fit into the distro. So basically all the Linux distributions have a set of LLVM versions that they support out of the box that they already have in their package managers. And what the maintainers of the distributions really want to see from the packages that come into the distro is that they use what already is there. So that we use already present LLVM versions. So this creates this pressure for us to keep updating LLVM on our end and always to be able to compile the latest version of LLVM that is supported by the distros.we target. And on the other hand we have this close source windows version of our compiler which is still on LLVM 9. And there we doesn't really see this kind of pressure to update. And because of that we mostly see drawbacks of updating. So each LLVM update brings to us both some performance improvements and some performance regressions. And well we only want the latter, not really the former. So we have to always come and remove the regression and this creates additional effort. At the same time the improvements doesn't really seem to feel worthy of that. So on the windows end we see a lot of less reason to update. That's why the updates there are slower. And this creates this kind of discrepancy when we have a range of supported LLVM by a single code base. So how do we do that? So when you try to support multiple versions of a C++ library similar to LLVM, a number of issues comes into play. So first of all there are changes in optimizations that... So the changes in LLVM functionality and how the things are done in there that create changes in the performance of the code generated by the compiler. Those kind of changes also occasionally cause some bugs on our side. For example when new intrinsic or a new IR instruction is introduced we also have to take care of it and handle it on our end. There is for example this pass that combines many PIP-Hole optimization which is called Instruction Combiner. And this is probably the most, the singular most problematic pass in terms of supporting multiple versions because it brings different results each time. And it does it to the extent that at some point we decided to just fork Instruction Combiner and keep it on a singular version and not really use the upstream one. But really those cases are hard to generalize and we usually have to work on each of them and then create a fix on our site. But what is much more easier to generalize in our example, in our case, is the API changes. And to do that we do have a part of code that we call LLVMRapper. And this is basically a small wrapper like library that provides a single unified API that allows us to call functions from LLVM9 to 15 without a single interface basically. So the biggest benefit of having that in one place is that we don't really have this bunch of if-deaf scattered around our code. And when we come and when we start to abandon the old versions and don't support them anymore, we just can go remove a singular call from the wrapper library and then just change this call in our code and it's very straightforward process. So if you're interested how the wrapper looks like, it's available here. It's a headers-only library. So the conventions that we use when developing this is that we try to mirror the LLVM header structure as much as we can. So it's fairly obvious where to look for the calls when you try to add them or when you try to remove them. And the other important thing is that for each wrapper function that we add, we try to follow the signature of the function as it comes from the latest supported version. So in the future when we can drop the wrapper, we just can replace it with a call to the new version of the library. So how does it look like in practice? So for most cases really, it's very simple and we just see places where we get functions, we get methods that change their name a little bit or maybe change the arguments that they receive. So most of the cases are like this. Sometimes, unfortunately, we have to go on and try to add something which is not available yet. This is one such case when we basically had to pull an implementation into our headers so we can use it temporarily for various versions of LLVM. In the worst cases, it can be whole classes, but I don't think there are many of them. I think this is the only one that we have right now. And when it comes to types, it usually can be handled by a simple type alias, so it's not that much. So how does it look like in terms of numbers? So right now we have 52 headers including 224 wrapper functions that allow us to seamlessly support LLVM from 9 to 15. As you can see from this table, so this table basically follows the header directories that we use. And as you can see, most of the wrapper types or functions are contained in the IR directory. This basically shows that this is the directory that we both use the most and probably the one that is mostly being developed upstream. Hence the DDDD changes. So what about the future plans? So as you may have seen, when I was talking, I was referring to LLVM 9 to 15 and we already have LLVM 17 released and we will have LLVM 18 soon. So when it comes to updating beyond LLVM 16, we have the small issue of the OPAC pointers. So basically the OPAC pointers change, this was a big change that was introduced with LLVM 14, but back then it wasn't mandatory. But with LLVM 16, it became the default mode of operations for LLVM. And the changes work like this, that it basically removes the pointy type information from the pointer passed in the LLVM IR. So we have this pretty simple code that doesn't involve any additional type annotations. So the unfortunate fact is that we heavily rely on the pointer information in our code base. So there are plenty of places where this information can be either, so where the type pointer like dependency can be fully removed because we have the information available elsewhere, or perhaps it's not even needed, but it was used there for some other reason. But for the GPU targets, we have a specific issue that we have some types that were really good to be modeled as the type pointers with the OPACs track type. So for the GPUs, there are things like samplers, images, etc. And these are generally handles that are passed from the runtime through the kernel code and then are given somehow to the hardware. And well, so we really have to know what is coming inside this type of pointer. So we really have to know what it is because we have to emit the right instruction. So depending on sampler, depending on type of image, this will be resolved to a different kind of load, or maybe not even a load, but something more sophisticated. And well, this information is really needed there and we couldn't really get it with the OPAC pointers because we only see then a pointer type that is passed to the kernel. So this is the entry point for us and then is somehow used, but we don't really know what this pointer represents. So this type of change obviously cannot be handled by the mechanism that I showed before. It cannot be handled by a wrapper. So we cannot conjure the pointy type information from the OPAC pointer, at least not always and at least not like for free. We could do some analysis, but this would cost us the compilation time, which we cannot really have. But fortunately, there are already some changes that has been introduced to LVM and I'm talking about it. LVM API, we will be still using at this point the type pointers. From there, we could go to LVM 16, but then we would be doing LVM transition and also starting to go from type pointers to target extension types, which we don't really want to do at the same time. So we decided that we will take this target extension type patch. It's actually a pretty small patch that is quite self-contained, so it's not that easy to port back. So we are going to port it back to LVM 14 and use this version with this patch to slowly move to the target extension types. And from then, when we are happy with our results, we will move on to LVM 16 at the same time, dropping support for everything that we had before. So coming to conclusions, it is actually possible to support multiple versions of LVM. It is a little bit of pain, but on the other hand, if you look back at the numbers that I displayed before, the wrapper isn't really that big. So it's like 200 wrapper functions, but when you think about all the functions that LVM is exposing, there is really plenty of them, and this is a really small percentage of that. And also, you know, it's distributed in time, so it's not like you are writing 200 functions up front. You are just slowly adding them and removing over the time as the window of supported version changes. So this shows that to some extent, a stable IPI perhaps could be created if there was such need in the community. But even if such an API was created, it would be probably pointless, because when you do the engineering the way we did it, you in the end still have the risk that you will come to the big changes like the Opaq pointers, when you will have to basically come and switch to the newer version with dropping everything else. So, well, that's mostly what I had prepared today. Thank you for listening, and I'm open for questions. Yeah, I know, sorry. Please go on. Did you consider using the old VMC API, which is supposedly most stable? Yes, yes, yes. So the question was if we considered using CAPI, which is stable. And yeah, we did, but the CAPI doesn't offer what we need. So, sorry, I'm asked to repeat the question. So the question was if we considered to use the CAPI. Yes, we did. Unfortunately, the CAPI doesn't support what we need. So CAPI allows creating new IR, so basically building IR from the parcel tree state that you would have. In our case, we don't really want to do this. We already have LVM IR provided by a SPIRV translator in our case. What we really want to do is to modify this and create our customer LVM passes. That's why we need the C++ API, unfortunately. Please go on. Can you mix up simple VM persons like say you have this product, this is the LVM9, or do you use those? And there's about there, like how do you do that? Sorry, can you speak up? I cannot hear you from here. Can you mix up simple VM persons like if you have LVM9 on your big old private VM, there's about the LVM9, how do you deal with that? So how do we do that? We fix them for the version, sorry, I was forgetting about repeating the question. So the question was how we deal with the bugs that are in the older versions but are not in the newer versions. So and the answer is that we fix them in the version that we have, in the version in which they appear. And if the fix is applicable to the newer version, obviously it's not a problem. If not, we will have to somehow wrap it up, but such cases are really rare. So usually we don't really see any things that are wrong in the LVM code. The code is usually what is wrong in our code and we can handle that on our side regardless of the LVM version. Maybe I misunderstood before, but were you saying that you were going to phase out the older version? So the question is if we are going to phase out older versions. Yes, we are going to phase out older versions because we are not able to support them and LVM 16 at the same time. So I mean technically it would be, but we decided that the cost of supporting both OPAG pointers and type pointers is not really worth it. But then the distributions that you were supporting previously or the Linux? So the question right now is about distributions and if they will see a dropped support. So first of all they still have the versions that were already released that cover the older LVM versions. And when it comes to the LVM versions, LVM 16 is not really cutting edge right now, right? We will have LVM 18 in a matter of days I believe. And well, so we are a little bit backwards, right? Have you considered assuming your backend for your custom ISA or is it like the closed part? So the question is if we have considered upstreaming our backend. So yes, we did. It's not really in an upstreamable state right now I would say. So basically, so it's all open source, you can go and check how it looks. So we basically have a very thin pass that is emitting the VISA code, the other IR. And then we have a separate library that is handling the VISA for us and is written in a completely different style than LVM etc. It's not really good fit to be managed with LVM. But yeah, we are seriously thinking about upstreaming and upstreaming as much as you can. And basically like this presentation and the fact that we had to put so much effort shows that it's really worth to work in upstream. So going a little bit in the direction of the earlier question is why do you need to support multiple versions of LVM if you basically have to shoot your own anyway because you replace and combine everything? So do you basically have your own LVM for when you complete your 4B and you actually use the system LVM and just replace it? So the question was why we want to support multiple versions and if we have our custom for the VM that we ship or if we use a system one. So I will start from the end. We are using the system provided version of LVM. That's why we want to support a range although a very narrow range in the open source. And going back to LVM 9 comes from the fact that we also support the Windows compiler which is on the older version because of the effort that comes with the performance regression that the upgrade brings. Sorry, I didn't get the question. Yes, yes. How can you do that if you use the system provided one? Okay. So the question was, so I mentioned in the slides that we are going to patch LVM 14 to support TAT and at the same time I said before that we are using the system version of LVM. So sorry, I haven't explained that pretty well before. So basically the patched LVM will be only internally for our development and when we are ready we will just say that we are supporting LVM 16 right now. But you know this patch version is only to ease the development on our side. It's not a release version. One possible way around to bind new things to distribute your modified version of LVM or at least package it all up as one. Yes, you are going to have two copies of LVM libraries but that might get you out of the hole. Yeah, so the question was about having distributing binary more than one copy of LVM. So this is an option and this is probably something that we will have to do occasionally but this is also something that we don't really want to do because we have, believe it or not, a tight budget for size of the graphics driver. So you know, you have this graphics driver. It is a very big blob. It is like 200 megabytes that the user don't know from time to time and really the marketing is pushing us to make it as small as it is possible. So even a megabyte is a big difference in this case. All right, I don't think we have any other questions. Thank you. Thank you.
elfconv: AOT compiler that translates Linux/AArch64 ELF binary to LLVM bitcode targeting WebAssembly
that transits Reax-664L binary to LNBM bit code targeting web assembly. So first, I will explain what is web assembly or wasn't for short and why we use wasn't. And wasn't with a virtual machine instruction set and currently this is used on servers or as well as browsers in production environments. And compared to existing applications, there are mainly two features, portability and security. And portability wasn't enabled us to run applications on both browsers and servers without modification. And of course, wasn't dependent on CPU architecture so that we can run wasn't applications on computers with various CPU architectures without modification. And in security, in the case of outside browsers, wasn't is highly isolated from the host kernel by Washi. And Washi is an API that provides access to similar OSR like features. For example, for systems so gets and so on. Yes, yes, and, yes, and Washi is implemented by Washi at times. For example, wasn't time and wasn't H and so on. And was was was Harvard, Harvard architecture designs and so the memory of the wasn't was wasn't instance is which clearly separated into right now data memory and memory and code of was can access only right now. Data memory and which include increases security. However, there are some limitations in the capability of applications. And first, wasn't can, wasn't can jump to only the course that are determined at compare time and in other words, it is impossible to indirectly jump to the code generated in the data memory. And second, was she implementation doesn't cover all projects API, for example, folk and exec and so on. So when you develop wasn't applications, you should consider the limitations. And now, many programming languages support was, for example, C C plus plus plus and go and so on. And however, it it isn't easy to build was in some cases as follows. And mainly there are three cases for us the programming language that you want to use doesn't completely support wasn't. And, and currently many major languages have begun to support wasn't, but only limited number of languages are available in production environments now. Second, binary is available, but source codes of the binaries are not available. And recently the number of op source, op source programs has increased, but several, several programs are still not published. And third, the case of time consuming to building the environment. And if the dependent libraries of the target program are not maintained, you might be not able to build the libraries. And in such a case, it might take much time to build. So next, I show existing projects that run in X binary on wasn't. And the first, and the first project is tiny mu. And this is the X86 and describe emulator available on the browser. And, and the next kernel can run on the browser. And so, and the second project is counter to wasn't. And this enables us to run the X kernel and counter, run time with emulators compared to wasn't. For example, tiny mu. And, and, and this can, can run, well, counter us without modification, but it can run with the same amount of time. And, and, and this can, can run, well, counter us with modification both on the browsers and wash-around times. And, but, however, these projects, these, these projects use PM on emulator that compiles a, a, relax, describe 32 L binary to a, a, several binary formats. So next, I will show the demo of L, L comp. Can you see? Watch. Okay. Okay. Thank you. So, and, well, well, I have prepared the counter image for the L comp project. And, and, and now in, in this terminal, the container of L comp has already started. And the target sample L binary to be converted is examples, L-sensitive, L-thousand. And, and this program outputs 100, 100 prime numbers in ascending order. And, okay. So we, we try to compare this L binary to was with L comp. And in the directory, there's a one file L comp.sh and L comp.sh is used to try to L comp to compile. So, and, okay. Okay. So, and target this was browser and L comp. And target this was browser and L comp. And target this L-sensitive L-thousand. Now L comp comp. Okay. Great. And, and serial files are generated and we can execute the was application with MS Gryffin. So, run. The browser. Okay. So the cyber was the was application has started now. And, okay. Wait. You can see the, I'll put this correct in the browser. Okay. So, okay. So now let's return to the presentation. So, so, so in compiling L binary to L and B and B code two, two modules are used. Okay. First is L comp lifter. This process L binary and maps every section and operates the next module. And, and is a library for lifting machine code to L and B and B code. And, as this figure shows L comp comp, L binary to L and B, L and B code with these two modules. And next, I will explain how L comp comp, L binary to L and B code and was binary. And, and ramming converter function to one L, B, M, I, R function. For example, as you can see, the, the, the, the, the, the square function one of the machine code is combative to the underscore function one and the square lift function is a L, M, B, M, I, R. Yeah. So, and also one CPU instruction is combative to one L, B, M, I, R block. And as you can see, the machine, the instruction of move X to X zero of the machine code is combative to one underscore move. Yeah. Okay. Okay. So next, I will explain the details of the combative, combative L, B, M, I, R block from CPU instruction. And there are three steps in the combative L, B, M, I, R block. And the first step is the program counter calculation. And this here shows percent 29 is a program counter of this instruction. And the next piece is updated to the next program counter. Yeah. And the second step is open calculation. And in this here, this, this instruction uses X seven and X three, X seven and X three registers. And in the open calculation, the X seven and X three is loaded. Okay. So, and the third step is calling the, calling the function of the instruction-specific operation. For each CPU instruction, RAMU generates a function that performs the instruction-specific operation. And the corresponding function is called at the end of the LM, I, R block. At this end. So next, as I explained in the beginning, the code of the quantum can indirectly jump to only the code that is determinable at compare time. And this figure shows how to deal with indirect jump VR instruction. So in this figure with VR X seven, indirectly jump to the instruction of move X eight and X nine. And in the error, the VR instruction. And the address to jump is 30% IDR and jumps the block of percent error on the square IDR. This is the VR instruction. Okay. So, and after jumping, R on the square IDR, we get the target rubber by calling the getR function and, and, and start to pass the VR. And after that, with the VR instruction, we jump to the target block. And also in that VR instruction, it requires all candidate labels as the argument. And this is, yeah, and this array consists of all labels in this function. And, well, but in the current design, the array of candidate labels includes only the labels within this function. So, and, and Elf Comp doesn't, doesn't support the jump and long jump now. And that is a future task. And next in converting the LNB bit code to a wasn't statically links the LNB bit code and Elf Comp runtime. And Elf Comp runtime includes the mapped, mapped memory of the original error binary. And that is stuck in the heap area of the error binary. So, and also Elf Comp runtime includes the program of the system called the emulation. And existing compiler, for example, M script and Washi CK compiles these two modules to wasn't. Okay. So, and in the React system called emulation, there are two ways of implementing the emulation. And the way of implementing depends on the RibBushy implementation. And in case one, if the RibBushy implements the tag system call, Elf Comp just uses the RibBushy function as shown in this figure of the right system call. Okay. And in case two, if the RibBushy doesn't implements the target, target system call, Elf Comp should implements the system call. And as shown in this figure of the not used PRK function in this code. So, it implements the system call. Just watch out. Okay. So, next, I will show the parameters of the generated binary, parameters evaluation. Okay. So, and the target sample F binary is a simple prime number calculator. And this program computes all prime numbers, lessens the input to integer. So, and one thing to notice here is that in this program, the evaluation are using X H6 under square 64 binary instead of the wasn't binary. Because in the current implementation, the system call emulation for wash-down time is insufficient. So, we use X H6 under square 64 as the output binary for benchmark test. I'm sorry. So, and the comparison method is QM emulation X64 to X H6 and square 64. So, we compare QM emulation with binary LD compilation. So, and I measure the power months in two cases. In the first case, the input integer is 10 million. And the second case is 15 million. So, and the power month evaluation is as follows. So, and as you can see, in both cases, one and the case two, LD compiling by LFCOM is 4,000 QM emulation. And therefore, we can say that LD compiling is 4,000 QM emulation, at least in some cases. So, okay, so last, I will show future works. And first, we will support the output of other binary formats. And currently, LFCOM supports the output binary of only Wazm and LFX H664 binary now. So, we will support the output of other binary as output of mine. So, second, we will never as compiles LF binary of other CPU architectures. Now, LFCOM can compile H664 LF binary. Yeah, so, yes. In the future, we will support other input binary. Okay, so, and third, we will, we will, we will append system calls emulation. And now LFCOM implements a part of system calls and a lot of system calls are not implemented. So, and specifically, when targeting Wazm as the output binary formats, some system calls are difficult to implement. For example, for exec and so on. Yeah, so, so, so, I think that implementing that system calls is very variable. So, and fourth, supporting dynamic linking. And now LFCOM can compile the static, static, link, LFU binary. So, and that's where that dynamic linking is an important function and will support in the future. Yeah. Okay, so, and fifth is the promise analysis of the Wazm targets. Yeah. Now, I measure the promise evaluation under the H664. So, I shoot promise at the binary of the Wazm targets. Okay. Okay, so, and the sixth is making LNB bit calls generated more efficient. Yeah, and, and so, yeah, okay. So, in the current implemented. I translate that to Wazm 32. Sorry, well, I, sorry. I, the 32 bit x86 platform. So, I think that the H664 L binary is mainly used in the world. So, I think the support of the L binary H664 is a big influence. I think that's. Yeah. I take the top. And you consider using it instead of Remila revenge, if you know revenge. I'm a core developer of revenge, disclaimer. So, I'm sorry. Could you, could you, if you have a question, more sorry. Yeah. Remila is a tool to leave from executable code to bit code. Yeah. There's another tool which we developed, which is called revenge that does something similar. Maybe have you considered that? Are you interested in that? I don't know. Sorry, I could use a more, sorry, sorry, sorry. Yeah. Is it an alternative to, revenge is an alternative library to, Remila. Have you, have you heard of that? Well, have you heard of the revenge library? Sorry. It was just saying, if you'd heard of the revenge library, does something similar to Remila. It sounds like you haven't heard of it. That was, that was my interpretation of the question. I think that will fly. I'm sorry. Yeah. When you measured, you did a performance between Kwemu and Elfkong. Yeah. Was that like the, what did you measure there? I didn't understand was it the compilation or running? What did you measure in that performance? Well, comparison performance. Yes. So, and, okay. Yes. Basically the, component performance is of Elfkong. It's very long. But in this project, in this program of the sample, sample F binary, about, about it takes one minute for the compiling. Oh, yes. It's the compilation that is faster. Or is it the running of the thing that is faster? Is it like, I don't, I don't understand. We're measuring the, the running, like the produced results for the compilation. Like which? Like which, I'm sorry. And, I guess that is for like running. That, yes. So, by Kwemu, it runs, it runs like the, it has the, the G that turns into the native code. Like it has the, ahead of time, or you have ahead of time compilation for that, what, what, what's that you run on a browser, right? Later. So, are you looking at the performance running on the browser here and comparing that to Kwemu? Or are we looking at that, like some compilation item, just understand like what are we comparing? Sorry, so, could you ask after the presentation again? Sorry. I'm sorry. Thank you. Yeah. So, you compare the performance with like emulated ARH 64 versus a X86 binary, binary. Have you also tried, like after converting this with, it was Alphcon to convert it back to ARH 64 and benchmark that against the original source? So, like what is the overhead of one, like lifting it and? So, the question is overhead of the lifting, binary lifting. Oh, yeah. So, yeah, so, and in the program of this performance evaluation, the performance overhead of the lifting is very small. And maybe it takes maybe three or four seconds to compare the lift, the binary to L and B bitcalls. Yeah, but what I meant is like, if you combine the big bitcalls back to the original architecture, so how is the performance of that binary compared to the original binary? So, you say that from L binary to the target architecture binary, the performance overhead in the LB bitcalls to target binary. Oh, sorry. I just follow up on that. So, if you just, I will just drop in directly, but from experience.
Map LLVM values to corresponding source-level expressions
Yeah, it's done. Yeah, well, we're about to start, so meet your Euro. Yeah, really. You're already on it? Really? I don't think so. Already on it? Yeah. You think it's on it, right? Yeah. It's on where do you want it? Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Hi, everybody. My name is Shivam and I work for KDAB and I also work this summer, Google Summer of Code with LLVM, working on this project, this mapping LLVM values to the corresponding source level expressions. And, but why? So the challenge of understanding the compiler optimizations. So, so compilers are basically performing different sorts of optimizations and it's not always possible that it's going to be optimized or basically vectorized. So, we basically our motivation was vectorization for the first because we wanted to include those optimization remarks in for the vectorization part. So, our motivation was vectorization. So, it's not always possible. So, your compiler can always vectorize your code. So, there's some sort of data dependencies there where that's why your compilers cannot actually vectorize all the time. So, on that cases you have to emit a good remarks and I'll let you know what currently clang actually generates as a remark. So, understanding why and how these optimization occurs is not always straightforward. Even the authors or vectorizer don't know what's going on if the vectorization didn't happen. So, consider this example. So, you can see there is a data dependencies between A of i and A of i plus 3. So, this loop clang will not be not able to vectorize this code. Okay. So, see this remark that produced with the clang which is that loop not vectorize. You can use pragma loop distribute. So, you can compile the tries to distribute the loop and it might be able to vectorize in some sense. But, just see the remarks. It's not clear that what actually gone wrong here and where's the data dependency. It's not telling you that where's the data dependency actually was and so you can improve the code itself. Right. So, it just that remarks and you can see this not actually clear. Right. So, it's a bit unclear and if you can have such this remarks nothing much just two expressions are the dependent source and the dependence destination for example. So, you know that okay there is a data dependencies between two to these two locations and so if you are aware of the code so you are going to modify your code and you might be able to modify in the way so you know that it will be possible for the compiler to vectorize now. So, you can modify the code by looking at this these expressions for example. So, yeah so it's going to surely enhance the remarks include the exact so if it includes the exact source and destination of the dependencies within the error for example and it will pinpoint the lines of those dependencies and let's look at the impact of these enhanced remarks so it would be clarity for the developers so they can quickly look where the dependencies are actually occurring and so they can improve their code and probably make it vectorizable and efficiency in the terms of that they can save time and improve efficiency by reducing the need for the deep debugging that where was the actual data dependencies so you can just look at the optimization remarks and you get the quite a lot of information that okay there is the data dependencies between two load and stores. So, let's look at the approach that we took for solve this problem or this project so approach was very simple to just utilize the debug information that available into the intermediate representation right so to recreate the variable and the function names lost during the optimization so the optimizations are actually a problem in our case because we currently don't know how to build those for example instructions that's lost into the optimization for example so if you see a multiplier if you see a MUL instruction in the IR so compiler might optimize that into shift left so the MUL was the original information that was actually available in the source code but right now we have shift left so we just lose the context that what was the actual source level information so that's still a problem for us and we have different approach for that but we didn't see to be so we see that it was not much a performance good so it was very bad so we wanted to clone the module so we have so we can look that what's happened after each optimizations so we can have the clone of the original IR and we can see that what's going on after every pass of optimization or every implementation passes so if you look at the different transformation pass and you have to look over that what's the thing that gets changed and anything that you have stored that okay there was the original instruction as MUL but right now it gets changed to shift left so you see that okay the MUL gets changed to shift left so you have to cache the expressions basically to reload the things in a new way so if there is no need for that so we will proceed after it so let's see how to utilize that information that available in the IR so LLVM uses a small set of infernsic functions if you are aware of so those are provided for the debug information so they contains different metadata so they have different arguments so these infernsic functions are named prefixed by this LLVM.debug and these things helps you to track the variables through the optimizations and code generation so if you look in the IR so if you dump your IR with the dash G symbol so you will get to know about the LLVM.debug.value or .declare so those contains everything related to the source level things so they contains the metadata information and the metadata are there for that so they can give you a lot more information about what was actually in the source so for example like variable names so when you trace the metadata so you can get the variable name from the actual source so for us these two infernsic functions were very important the debug.declare and the debug.value and let's try to understand a bit so if you can see the I is allocated and just below it you can see the call to the infernsic function which is .debug.declare and these you can see the three arguments in the infernsic function the first represents the first will always represent the address of the variable the second metadata always pointing to the for example you can see the DI local variable so which contains this is a metadata node and it contains the variable name so what was the actual name so you can see the actual name was I in the actual source expression so you can when you trace back the information so you can retrieve the name so these are these infernsic functions so the second can really help us a lot and the second was actually the source just the source information like name and the third argument is DI expression and generally DI expression is useful as a complex expression so you have expressions like for example int a is equals to b plus c a DI expression can hold that stuffs yeah and yeah so debug declare is used for that and the second is debug.value so .value is very similar to it it's just that when I is gets updated when a value is updated so debug value can up and so everything goes in the debug value for the same so we now have enough information to at least give a try to build the source expressions and only if the code is compiled with the debug info on so it's compiled using the dash g symbol so we use the we are infernsic as a bridge so our focus was on memory access and vectorization as I said so importance of the memory access pattern is so we really want this project for vectorization at first and then we also have a plan to give it a push into the debugger so debuggers can utilize this information to provide better remarks but the main goal initially was the vectorization pass vectorization is a transformation pass so a transformation pass can always query an analysis pass and what our work is our work was an analysis pass so this vectorization pass in LLVM can always query the analysis pass this transformation pass okay so project contribution is actually that we have we have built an analysis pass that can generate these mappings and provides a better remarks for the case of vectorization or any other things that requires it let's look at the implementation detail so for us the point of interest is load and store instruction because of the vectorization because we want to analyze the memory access patterns to actually emit in the vectorization so which is useful for the remark and for example just take a look at this C code and if you compile this with clag or to dash g and if you emit the LLVM just for showing you that what's going on so I think it should be visible so you can see the call to in intrinsic functions debug.7 so we can build these expressions from them so as I said so if you look these were the first is to multiply n1 but and we compile it with the optimization on so the multiply instruction gone away and it just updated by shift lift operator okay so that's why you can see the shift lift operator here and not multiply so that's a problem so that's a problem of accuracy of the expressions because we still not have a good plan of how to accurately generate the expressions because a lot many times these things gone away because of the optimizations on and it's always been a hard problem of how to actually debug in the case of when optimizations are actually gone so it's a classic problem which we still have to look so you can see the we can build these expressions from the instructions so yeah you can see this from example that computing the equivalent source expressions of interest will involve walking through the LLVM IR and using the information that provided in the debug in forensics so even though our current interest is load and store but we still have to build for every instructions because those load and store can contain any sort of an instruction when you trace back to them it's maybe any binary instruction it contains gap instructions so it might be contain any instructions and we have to we still have to build for them too and as I said that optimizations make it impossible to recover the original source expressions so as you see that for example the 2 into N1 is optimized to N1 left shift 1 so which is recovering the original expression may not be possible every time so let's look at how we proceed it's just a basic algorithm that I just want to go through so we started by traversing the IR so we have we started with the traversing the IR we identify the operations that were there for example load and store source or main so current interest is load and store instructions so specifically look for those instructions in the IR and then trace those operands it might be any other instructions it might be inside any metadata so we can retrieve any information like name and utilizing those metadata information building the source expressions and then we reconstruct the expressions with all the information that we get so that's a bit all but just look at the current state so it's still not yet upstream to LLVM the PR is here so what I need from you or anyone you have experience or anyone you have active in that zone of optimizations or for example analysis passes or transformation passes in LLVM so I do like you to have a look at the patch if you have some experience try to review the code as well and give some feedback so we can proceed with much detail because it's still it's still a new analysis and still need a lot of work for struct as well so as I mentioned we need more review on the patch and some active work from me as well and if any of you interested please reach out and as I said the struct pose a unique challenge to when we try to build the expression for struct for example it was not quite easy it was very difficult to build the expression for struct because they pose a unique challenge if you see them in the intermediate representation it's very weird to look at them because I don't know how they actually gets there in the IR how they represent it it's not as simple as the giving expressions for an array so struct is still a problem accurate source level expression in case of optimization it's still a problem and there isn't always one to one mapping between the source code and IR so if you see that we still don't know what to do in these cases if you see there's a pointer and the PTR 0 so IR can generate the same code for these two patterns and we don't know which have to pick so that's still a problem one solution for this was that debug information also contain the information about the file so there is a DI file in the debug info so what was actually we were discussing is so we have still information about the file path so what we can do is we can actually open the file and just go on that line and retrieve the information what was the actual ground truth just look at that second thing was fall back and anything because we don't know what was there so just fall back to any of them so but the DI file was actually quite easy but it's not good performance wise if you see open the file and just retrieving just going on that line and retrieving that so it's not good performance wise so yeah that's it for the talk and thank you for listening and if you are interested in letting knowing more about this project and the algorithm please reach out to me on mail or for example discord so yeah thank you any questions yes why do you need to rebuild the entire sequence of expressions for each of the values why not the specific value of the value it is what the dependencies and the production line from the file just like after a year ago can you just free us the question so you know like when you admit after marks there's a tool called the operator that just put everything in line so between that and what you have here it seems that you know you would get excellent results in terms of debubbability if you just did what the operator does plus you specify which of the values are causing the dependence that what you said and what the reason for the failed optimization okay so the question is basically about using opt viewer right yeah I was just admitting a more limited view as you have here and not trying to reconstruct everything so we still not reconstructing everything for example so we still we still not focusing on creating whole I at or just mapping whole I at the expressions we still focusing on those loads and stores as I said so we still focusing on that right yeah yes yes yes yeah so we we still we still picking up the load and stores so if we see that is there any gap instructions because gap instruction actually contains a chain of instructions so so but we still have to build the loop for loads and stores and opt viewer still not good at emitting those remarks it's still very abstract in that sense if I remember it correctly so I'm not sure how to go with opt viewer but we still making it for load and stores and tracing back the information yeah yep yeah yes nope nope not sure but one thing I can guess that so basically opening a file is not something which is very good performance wise and and just choosing and just going on that line down because you can see that there could be a multiple lines of code in code base so you have to go on that particular line so it would need it would need it would be very bad in performance was I think okay and there is no and there is no theory about if it would be more beneficial to tell the programmer the error that or the sub optimal choice that you made was between line 27 28 compared to generating some arbitrary complex expression that might not be representative of what the program originally wrote I'm not sure okay okay yeah I think it would be fine then if you if you're choosing for emitting a remarks then then you know that this is not good performance right so if you want to look at the performance if you want to look at the actual correct remarks so you have to going deep in the performance thing then yeah then it would be possible and we have also we also talking about preserving the metadata in LVM as they go through but but in LVM metadata is designed in a way so we can drop we can drop at any time so we still cannot preserve the metadata information so it's still a challenge this lot move what going on on this side so yeah yeah yeah okay yeah thank you yeah okay thank you for joining when you leave make sure to take everything yeah yeah yeah yeah
The Matrix State of the Union
Okay, so this is the Matrix Dev Room on FOSDEM 24, I guess, in case you are in the wrong room then take the chance now and leave. And we have an afternoon packed full of information about Matrix. It's only an afternoon, so if you want to look up more information in the air is an internet full of it. But if you are lazy and don't want to collect all the information yourself then there is this wonderful people, they collected the information for you and will give you a presentation now about the state of the union. Matthew and Amadin give them a warm welcome and the stage is yours. Thank you Jan. So we honestly weren't sure what to talk about because if folks came to the main stage talking Jan's in this morning, basically the first 25 minutes was the state of the union of Matrix. So we have a bit of a question mark on the subject here. Also Jan just promised that we will transfer the contents of the internet into your brains which we also hadn't really prepared for. Anyway, basically if you don't know who we are, I am Matthew, I am the technical co-founder side of Matrix, day job CEO at Element. Amadin, the non-technical co-founder side of Matrix and day job COO at Element. But we would like to at least try to tell you something new about what's going on here. And actually realized that we have never done a brief history of Matrix which begins before many of you were born in the year 2000 and free. Now seriously, the actual backstory is a bunch of us were at the University together at Cambridge and we were messing around and instant messaging on a project called Project Foxtrot. Now the idea of Foxtrot is that it was written in Java 1.3, new fresh off the books or something at that point in the late 90s. And what it did was to serialize hunks of Java, put it over TCP sockets except it was end to end encrypted using manually written Diffie Hellman RSA exchanges. So that is where I at least got the bug for Matrix and instant messaging and after we either got kicked out or left or graduated from Cambridge we ended up working at the little company doing APIs for the PSTM. So that's 2003. Fast forward rapidly to 2010. Well, my company were doing mobile app development and Matthew's company both got acquired about a month apart by a big telco vendor. You would find them in the depth of AT&T doing all their billing systems. So small startups having fun getting into a very big company. After a few years of rattling around inside Amdox, I'm not sure why we're not mentioning Amdox by name but it was Amdox, we discovered a new found desire to burn the phone network to the ground and annihilate it and replace it with something that would be open and decentralized and federated that anybody can join rather than the cabal of the phone companies where it's almost impossible to connect into them. And so that was where the idea of Matrix came from. We basically took the combined folks in Hren and London went to Amdox and said, hey, a little bit of a crazy idea but what if we build an entirely new communications protocol and if we pull it off, then you, my friends at Amdox, can go and sell it to AT&T and many other big telcos and you can replace the PSTN. And meanwhile, at the same time, the rest of the world would get a big benefit of the existence of Matrix. And amazingly, they said yes, with no strings attached, they allowed us to go and switch the business unit from selling clones of WhatsApp and Skype to telcos to instead building out Matrix. And that's what we did from starting depressingly in May 2014. So we are a couple of months off having been doing this for 10 years. Not sure whether that's something to celebrate or not in the grand scheme of things. What happened in 2014? So 2014, we all gathered in Hren and sat down and had a big brainstorm on how this thing would look like and ended up with mostly what Matrix is today. Not much has changed in terms of overall idea and architecture and these sort of things. We started in May and the goal was, September 2014, we're going to launch this. Like four months to figure out a high-level working Matrix proof of concept. And we did it. Yeah, it was a disaster really because we rushed incredible speeds. It was like the best gig possible that your day job is suddenly told you and all your mates at work that you can suddenly go wild to create something like Matrix and everybody's sprinted in slightly different directions naming no names. We might have ended up with three different versions of synapse at first. We had the bit that taught the client server API. We had the bit that spoke the server to server API and we had the bit in the middle that sort of meant to funnel stuff around the place. Each one had a different database schema. Each one had a different object model. They were all written in Python which was honestly a win but it's possible we might have sprinted a little too over enthusiastically into this and spent about six years playing off the technical debt that we accumulated in those three months and then ran up to launching synapse. Worth noting the end-to-end encryption wasn't there on day one but we did start in 2015 and we always designed as part of the protocol because if you are going to replicate data equally over many, many home servers, obviously it needs to be end-to-end encrypted such that if one gets owned all the messages don't go out the door. Then in 2016 it says we launched Element. I'm not sure I pulled the slide from but it definitely wasn't called Element. Basically when we launched we were using Matrix console at the beginning and then we said okay we need a very glossy app to actually drive the usage of this. We launched something which became Element at some point but initially was called Vector. What was the second name? Okay let's quiz the audience. What was the second name of Vector before Element? Yay well done and here we go. Element now is the flagship client and still growing there. Eventually in 2017 we set up shop as properly independent both like we started with the commercial company Element and also a bit later like in 2019. Yeah I think technically the foundation I think was incorporated in 2018 but we didn't do anything with it until 2019 to try to make sure that there is a clear split between Governance, the open source project and the protocol versus us practically trying to fund the bloody thing, Element running around, doing commercial stuff but that was the point where things started to split properly into your classic open source foundation versus startup trying to build stuff on top. We eventually turned on and went encryption by default in 2020 alongside Matrix 1.0 which I guess was June 2019 and then fast forward to 2023 when we announced the idea of Matrix 2.0 as showcase development X last foster and here we are today in 2024 the year of mainstream Matrix. Who knows maybe if you saw the DMA bit of the talk earlier it may or may not be getting the but yeah we'll talk a bit about it and Travis afterwards is going to have an amazingly very deeply technical talk all about everything you wanted to know about DBA. DBA? Well you could do one on DBA or you could do one on DMA up to you. I haven't asked permission for any other people in this photo to put this photo up but this is the original Matrix team on our way to REN from the London side playing Magic the Gathering or something. No it was all of us. Yeah that point the front and side is all in REN. Yes because we hadn't got to France yet. We're literally at some crappy travel lodge I think in Luton or Gatwick or somewhere on the way through to REN and so yeah basically that was the vibe at the beginning of Matrix back in May 2014 and more of a vibe is this which was the whiteboard in the Jupiter project room in the offices in REN where we basically drew up the possible architectures that we could use for Matrix. You will notice that there are four if not five architectures here the simplest one is just client to server to client and this was almost just mapping out the various options we had on the table but at that point we hadn't really decided how decentralized it would be. Then we had the one that honestly I came into this with which was assuming that it would be a little bit like SMTP and IMAP that you'd have like or just SMTP your client would talk to a home server that would cash in rooms which would talk to another home server that would be a single point of failure that would talk to a client. I mean it's a bit like a mark in XMPP sounds pretty easy. What I did not expect was for some of the folks on the previous slide to turn out looking really excited saying you know I think we might be able to do it like it we can actually go and replicate this between the home servers which I christened at the time the distributed sink nightmare in terms of active active replicated version of a protocol and then there is another one down here when you've got sort of two inboxes that sort of synchronize ish together but you basically have queues rather than DAGs and you had this one which has got lots of double arrows and I have no idea oh it's a mix net I think is basically all that was talking about you'd have a personal home server you have a bunch of relays which trusted maybe or not trusted I don't know I can't remember it was 10 years ago but either way that was the level of sort of whiteboard diagram that we were playing with at that point. So basically as Matthew said earlier fast forward almost 10 years now 2023 was very much focusing on the basics to work well thanks to the limits of funding which is good sometime like if you have a bit less money then you do focus on the thing more important. So we have posed a lot of things how do you want to do this do you want to go through the list Matthew? We have no you don't. I do it okay I will do it so the focus was very much on 2.0 CNAPs the SDKs Rust and GS SDK the otherwise peer-to-peer matrix is on the side pseudo IDs as well crypto IDs accountability however we still hope like very very soon we'll be able to get back to all of this low bandwidth as well and some of the done right work funded by Elements. The legacy Elements apps and dsdk that are based on are just put on bug fixes only and hopefully we'll be able to switch to everything to Rust soon and LiboM as well now that we have those amounts taking over and PortoDrum is on the side waiting for someone to take it and bring it up to what it all the power it can do. Yeah third room is particularly frustrating we got an email from the W3C after we announced that we had had to lay off the team element who are working on it and that nobody picked it up saying what this is meant to be our promised land of Web SG the Web scene graph API that we created I thought this was how the future of the spatial web as Apple would call it is meant to be and I said well I'm really sorry but we literally could not find anybody to fund it whatsoever even people like Rolls Royce who promised that they really needed this and would fund it then proceeded well first of all to lay off the team that we were talking to on their side and be not funded at all anyway it's been a really fun year so that said I'm gonna disgrace myself as you probably expect by wanting to talk a little bit about the projects which are shelved because it's really frustrating that an awful lot of work went into them last year until around November they got forcibly parked one of them is some pseudo IDs MSC4014 so this is the project to replace MXIDs with arbitrary identifiers per room now the reason for doing this is well first of all GDPR at the moment MXIDs get baked into the conversation history of your room and they are things like at Matthew got on matrix.org whereas if you had a different unique identifier on a per room basis that problem goes away and it's up to me whether I want to publish a mapping of my matrix ID onto the sender key or not the idea of this MSC is that it works out of the box with existing clients no code change is needed because the CS API maps the sender keys back to MXIDs when it hands it to the client however this does not provide account portability it's just replacing the MXIDs and it got implemented in dendrite in June of last year and if you're feeling particularly creative go and turn on the feature flags on dendrite and have a play with it but as I said unfortunately it is currently on ice I'm not going to force Amundoon to do the crypto ID one for the sake of alternating slides so for crypto IDs is an extension of pseudo IDs highly experimental the idea is that your sender keys become your end-to-end encrypted identity so we finally unite together end-to-end encryption in matrix with the idea of your MXIDs so the idea is that when you join a room for the first time you get a crypto ID generated for that room interestingly and perhaps controversially your client then signs everything it does or the events with the crypto ID so that you can basically prove that you own those events and as you move between servers in future you can prove that it came from me as an individual Matthew rather than being signed by your home server which you don't really care about if you're migrating between home servers this has the gbs side effect that we no longer have cryptographic deniability because by definition you would be able to see that a given client owned by a given user has sent a given message so there's going to be an interesting trade-off to be had there right now we do technically have cryptographic deniability but practically speaking it really depends on the trust model I'm not sure just how useful it really is other than on paper whereas this would obviously throw it away again implemented in dendrite and was just being drafted in Rust SDK when it got shelved in November the idea is that if you take pseudo IDs and add crypto IDs and add some magic glue which probably means storing account data in a room so it can replicate between servers then you would have client controlled account portability also a prerequisite for peer-to-peer matrix which is also on hold and again is currently on hold how am I doing on time Jan? More than 15 minutes? No no no no I am don't worry I am can I do a demo then? Okay so whilst we're talking about daily departed projects I know this is probably going to piss off a bunch of people but I really want to very briefly show the final bits that the third room guys did before they got killed so here's our third room using OIDC as oh that's a great start this is what happens if something is busy a bit rotting away let me try to sign into this using OIDC because third room was the first thing that we used to test out native OIDC we might have to wait a little minute for that server to wake up because I haven't logged in very recently so this is definitely a dangerous demo talk amongst yourselves imagine that a server is actually working here which it is right so where we left you last year at Fozden was that this thing had just launched and the next big thing was actually to make the whole thing scriptable and do fun stuff with it and it got to the point here where you could go and enter a world like this and this is just a matrix room stored in gltf with the sorry with the world data stored in gltf itself but what Robert and AJ implemented is if you press the tilt button at any point you go and get an inworld inspector up you can go and select things like buildings and you can do things like move them around and manipulate them in real time I think I showed this last year the next thing though was to make the entire thing scriptable by Wasm so you have a script editor now built in here which gives you a little bit of javascript what you can do is to go in and grab something like the buildings you just drop it straight in there and it right see the javascript to grab the buildings and then for every time the world updates you get a delta timestamp and absolute timestamp and I can go in there and do I do not know what this API is how back in this go let's assume it has a translation button and say that y is going to be what 10 units times or the sign of the current timestamp that will work right and if you head um save as run what it will do is compile the javascript down to Wasm using quickjs written by the amazing Fabrice Miller and reload the world and there you go let's upload in your dance I think this is so cool this is so cool you can see why w3c got in touch afterwards saying hang on this is how the future of the web is meant to be and where are the people and it's like well this is what it is so if you're watching this and you think this deserves to exist well first of all I'm not sure I'm gonna ever persuade the guys to work on it again because they feel pretty pissed off obviously that the project collapsed but the code is all there still and it's so antelizingly close to being absolutely amazing right sorry back on to what we're talking about cryptoidies or yeah what's the next time matrix 2.0 I mean who was in the thing in Nansen this morning do I need to go through this again oh crap yeah only about half of you yeah yeah perhaps we should have done that at the beginning of the talk in 20 minutes through anyway right so matrix 2.0 very quickly first of all this is not a spec release this is a state of mind a bit like web 2.0 it's made up of various MSCs and the status is sliding sink so instant launch and instant logging and instant sink it kicks ass but it's too fiddly we are currently performing slidectomy which is the technical term for removing the sliding bit from sliding sink and there is in fact a PR against the Rust SDK which basically shifts all of the ordering onto the client rather than doing it on the server and this is all my fault being stupid and over enthusiastic going and trying to do this over optimized implementation where the server figures out the best possible ordering and then the client tweaks it at the end and it turns out that having two different things fighting over control of the order of a list doesn't work very well so we've basically said the client gets to order it entirely the server does a very approximate probably based on time stamp thing and the good news is that it is just a subset of the current API so it's not a yes or no rewrite it's just basically simplifying the API so it's easier to implement then you've got end to end encrypted VoIP which again kicks ass we demoed it and Janssen and it worked this morning need to update the MSC because it's on its 6 or 7th iteration now and I think it's stabilized enough that we should actually spec it properly faster joins so synapse rapidly joining rooms and other home servers for that matter if they implemented them incrementally lazy loading the data in it would kick ass if we actually finished it so we got the hard bit done the kind of infrastructure and made rooms non-atomic in synapse and then actually didn't get to the point where we would get it to go faster significantly faster and then IDC which does kick ass but it's going to be a big migration as we need basically everything to support it before we start turning it on our matrix org etc but there is lots of stuff in progress if I have more time and try to show the QR single hop login demo which is super cool then Rust SDK is the brave new world that goes and wraps us all together on the client side and as of Friday as I mentioned in Janssen the JS SDK and therefore Element Web and anything else using JS SDK now uses Rust SDK for crypto so we are finally at the point where the old Le Bon c++ library is in maintenance mode and then some whereas Vadosmats the Rust implementation is our brave new future and I spoiled that Demir has already produced a post quantum PR draft for Vadosmats using the kyber primitives wrapped around I think curve 25519 so a kind of hybrid approach which should be compatible with the signal and PQXDH and key exchange stuff and what else are we doing in Vadosmats it was another big thing but I can't remember what it was another PR that landed basically we fixed all the crypto bugs in one place and a huge huge focus in the coming months is making the crypto finally suck a lot lot less. Should I keep going on MLS you can do the whole end bit of it okay on MLS people might be wondering hey it's not talking about MLS anymore what's that all about first of all we are still doing this you can track the progress on our MLS yet.com MLS is the group encryption that scales much much much better than normal double ratchet and almond Vadosmats we have it largely working on matrix has huge key bundles you have to store the keys and the media repository they're so big at the moment however there's been a lot of discussion on the meme side which we'll talk about briefly and Travis will talk about a lot more in a few minutes in terms of what if you actually used MLS to synchronize everything so rather than having a matrix tag for tracking no synchronizing data between servers what if you just chuck everything into MLS. TBD so there's a little bit of a do you put MLS over matrix or do you put matrix or meme over MLS and debate going on right your slides. Yeah basically as we said at the beginning 2024 could really be the year where our prediction was the convention would come to and the prediction was this that this is a slide taken off investor pitch deck saying in 2019 in five years everyone will communicate over matrix that's why we did this right. In 2019 it said 10 years and now because we're five years later and now it says five years just saying also this is written in R and it's real traffic from 2019 showing the I think the top 100 home servers talking to one another just saying if you're investor decks aren't written in R you're doing it wrong. So basically killing email and the phone network so why the digital market tag you may have heard of it they demand the big communication services called gatekeepers to actually interoperate with the rest of the world. Two of them have been named so far WhatsApp and Facebook Messenger iMessage is pushing back saying no no no we're not a gatekeeper but let's see where it goes. To the business yeah business to a user. So last year it was coming into force in the 7th of March they will have to actually expose these APIs as production ready and anyone in here who actually wants to interoperate with WhatsApp because they don't want to create an account there will be able to come to them and say hello can I please integrate against your APIs to talk to your users. She's a little bit ironic because it starts to look an awful lot like the PSTN in terms of you have great big telecoms providers and you go to someone at AT&T and said hello please can I talk SS7 to you so my little telco can talk to the big telco and they make you sign a massive contract and there's all sorts of back and forth to happen. Obviously we can't say what that will look like with method but there may be a risk oh well there could be entire spectrum between open federation versus closed federation versus everything in between and we just don't know what will happen. Let's see in a month and basically yeah one we may get to a point where matrix becomes the glue and between all the communication system and matrix them together. Yeah I mean I'm not counting on it honestly on this architecture particularly because everybody would need to agree both on the same dialect of double ratchet as well as the content payloads within but you never know if we get critical mass in some places perhaps everybody will follow. So yes we mentioned it this morning already still a lot a lot a lot of things to do especially on the core making sure the core is funded we're trying to put a big call out for fundraising and honestly the goal is really to get the big guys who actually are using it for hundreds of thousands of users millions of users without contributing a sense to the project itself and funding the core trying to raise the alarm. At the same time there is a public policy dev room there where we're trying to figure out how do we get the proven source projects actually funded so I'm going to run there to try to solve that problem very shortly after this. Cool. Thank you guys so this morning a lot of people actually let's go thank you to everyone who is supporting it and everyone who jumped in live this morning during the talk to actually become a member of the foundation and thank you to all the already how do we call them supporters organizational supporters as well in here. Yeah honestly if your organization is just happening to use element and matrix as its common system it really doesn't cost that much to put some money behind the bar to keep it going like we met x wiki on Friday and said oh how's it going and they said stuck notifications are the bane of my life and he said oh well if you actually want us to have more member to go and work on certain notifications and perhaps you can become a silver member of the matrix.org foundation and that is why there is a x wiki logo and a cryptpad logo on the slides there seriously it's meant to be relatively modest but if we get all the organizations doing it as well as the individuals then if nothing else is really a lot easier to go to the really big people like the EU and say look we've got already got 800 people supporting this this is an important thing it matters therefore you should match 20 fold 50 fold 100 fold and as a narrative it may work. So yeah meanwhile we have an awesome community a lot a lot a lot of things are happening around and this is the menu for this afternoon where everyone will be able to tell us a bit more about what they're working on looking forward to it thank you everyone. Any questions? Two I'm allowed two questions but they can't come from me. Kim. Excellent question. So the excellent question which I shall repeat is where the hell is multiple accounts support in element. Now most of the best of clients out there have it already however we've never got round to it in either element or element x there's not a good reason for it other than everything else taking slightly higher priority we did have it in matrix console the very first matrix client that we wrote before producing vector and riot or whatever it is now. Yeah no good answer other than we need to add it and element x would be a great time to do that it's built with it in mind we just haven't put it in the UI yet. Everyone can do it apart from element pretty much. Is there any indication that other assorted governments are looking at following on something So the question is whether other governments are going to take inspiration from the digital market tax there are some movements in the US around it trying to remember the name it's not the interoperability bill but something along these lines so there is definitely it's like GDPR then has been looked at by the US and European Europe is leading on these sort of things and yeah there is movement in that direction. So yeah the question is whether we are lobbying within the EU to make the API's the gatekeeper offer open so the DMA is forcing gatekeepers to open their APIs the big lobbying we've been doing in the last four years has been please please please don't ask them to only open their APIs but try to converge towards an open standard so that as small companies who want to integrate with these people I don't have to build a polyglot messenger which speaks what's up Facebook messenger Google blah blah blah all of them in parallel please please please so so far we don't really know what this is going to be the in the law of the DMA in the text it doesn't say you have to use an open standard but like basically we continue working with everyone the European Commission's and the gatekeepers and all the big corporations trying to convince everyone that that's the best way to go. I think the US equivalent of DMA is American Innovation and Choice Online Act maybe one of Senator Wyden's initiatives perhaps but you don't need to go and look it up and there was something I wanted to add to that but I've gotten what it was just the lobbying try to get everybody on a single it was it would have been amazing if we'd persuaded the Commission to basically put into law you have to speak an open standard the reason that they didn't is first of all it's not really their job as politicians to dictate the actual technical implementation they need to say what the outcome should be like a more competitive environment without massive anti-trust behavior but it's up to us literally up to us a lot in this room to figure out what that should really look like and the other big problem is that there isn't a standard that is suitable like matrix is great it's looked after the matrix org standard foundation but it's not an internationally recognized standards body so if we're gone through ITF already then perhaps that would have worked it's not like they can just tap on the wall and say use matrix even though it has some traction I would say this is an amazing segue to travestalk right now unless there are any other questions which I can cram in but I'm not allowed to no thank you very much
Interoperability & Matrix
So there may or may not be time for questions. There's a lot of detail. This is a 60 minute talk compressed down to hopefully 22ish minutes. So we will see how we go. But yeah, I'm here to talk about the technical details of interoperability. I'm also Travis. If you don't know me, I'm the director for standards development at matrix.org. I'm also on the spec core team. I run T2Bot and I work at element for trust and safety. I have a few jobs. But good news, there's already more that we can talk about. So Matthew had the talk this morning. If you haven't seen that or seen the recap of it about 10 minutes ago, covers DMA and the timelines in a lot more detail. To recap though, the DMA requires gatekeepers or large messaging providers to open up their APIs and their systems for interoperability. Encryption must be maintained between those providers. So you cannot break encryption for the sake of interoperating. You have to maintain it. These messengers have three options. You can become multi-headed similar to Beeper Mini where you have all the networks available in your one client. And you just kind of switch between them. You can create a bridge app where the user downloads a third thing and then you bridge locally on the device. That works. It's not great. Or you can speak a common protocol. We've been doing that for the last year. Probably longer. And oh yeah, they have to do this all by March 7th this year. So with that in mind, there are many projects involved as well. There is the more instant messaging, interoperability working group or MIMI at the IETF. They are trying to specify a standard that does this stuff. We are very frequent there. We are a direct contributor to this. I have written a MIMI protocol document in association with a few other people on the design team to try and simplify a lot of the components, particularly what linearized matrix is. Also, linearized matrix was originally created as this simplified version of matrix because it turns out that you don't necessarily need a ton of the fully compatible DAG stuff or even messaging history for interoperability. A lot of the existing providers just kind of want to throw messages around the place. They don't necessarily want to just kind of keep these things around. Obviously, of course, we have matrix. Hopefully, everybody here is familiar with that. But it is the decentralized and fully featured version of an interoperable protocol. What parts of interoperability do we have to worry about? A few. There is encryption. This kind of fits into a weird L shape. You have content format within that. But the encryption, we have to make sure that all the messages are secure. We have to make sure that everything is the same. Of course, we have to make sure that it is consistent across the providers. The content format, what do messages actually look like? We have to make sure that that is the same because the servers can't help us here. The clients have to agree on this. That is more of a challenge. We also need an authorization policy so people can be banned because they need to. Then we also have messages that people might not be allowed to send in certain rooms. Of course, we also have transport. The transport is just how the servers communicate because we have a room model that looks something like this. The room model is a combination of the encryption authorization policy and transport. We also have a definition of membership or participation, a little bit more on that in a minute. And also how the messages are found out themselves. In the very simplest scenario, we have clients talking to servers, servers talking to each other, and encrypted messages flowing between clients effectively. It gets more complicated when you add a third server, so we will do that later. Some of these problems are easier than others. Namely, transport. Super easy to solve. Pretty much everybody uses some form of HTTPS. Mimi wants to use MTLS. Linearized Matrix uses the same system that Matrix already does where you have a signed or a signing key that kind of gets thrown around a bit. It is unclear what the actual format over HTTP would be. Matrix uses JSON. Mimi wants to use some form of binary. Unclear what that actually is. We are also considering a binary event format specifically for this kind of thing. Protobuf and Seabor are kind of on the top. But to be determined, clients would not be expected to consume that binary format yet. I should probably just add that in. But yeah, we will end up using some sort of binary over HTTPS mechanism authorization exactly to be determined later. The other easy thing is authorization policy. Mimi does not define one. We have been working without one. We have just been assuming that people are able to send messages. Matrix obviously has one. Role-based access control is super popular amongst a lot of these discussions. There is those two MSCs there. 4056 covers the decentralization part of RBAC. Then you also have 2812 where it basically rolls as state events. It is an early form of RBAC. Linearized Matrix uses the existing authorization rules. Matrix authorization rules clearly already work. People have been using them for almost a decade now. They should be fined. We will figure out what Mimi ends up with eventually, hopefully. The harder parts are encryption. Most messaging providers use lib signal or something that is a double ratchet. We also have a double ratchet-like implementation called OM. It was not previously interoperable with lib signal up until about 2 a.m. tonight. We now have inter-OM, which has that X3DH support as well as some of the other delts you need to be able to support that sort of interoperability. Megalom is what we use in group chats to try and alleviate the load. Otherwise, with OM, you have to send a number of events for the number of devices in the room, which obviously causes problems when you have multiple devices per user or multiple users in a room. Matrix HQ would be a nightmare. The double ratchet does rely on existing infrastructure in order to send keys. It has no concept of membership. It does not know who to send the keys to on its own. You have to tell it who to encrypt to and then also send those keys yourself. Some messaging providers, namely Google, have announced that they will be using MLS. We also obviously want to use MLS. REMLST.com is where we're tracking that progress. MLS does have a concept of built-in membership, so it does know who it needs to send messages to. It obviously doesn't send the messages itself, but more on that in a second, namely this slide. RFC 942.0, that is where the IETF has specified this. I have a really awful crash course guide because I am not a cryptographer, but there it is. But yeah, there is a binary tree, so you have a root key and you have multiple nodes underneath that. With that concept, you end up with a concept of membership where only users or members that have certain keys can see other keys. That is how you get to know who to send the keys to, particularly the decryption keys. Mimi has refused to implement any other encryption other than MLS. They are obviously considering it as part of double ratchet because we do need an onramp. But with the IETF, they tend to get a little bit stuck in the RFCs. We are also considering MLS, obviously, and so we want to extend it. Decentralized environments, namely matrix, will have to use DMLS or similar. Membership. As part of the discussions with Mimi, we have been having some arguments, we will say, about what it actually means to define membership. We have decided that users join rooms and clients encrypt messages. Both MLS and double ratchet deal with clients. When a user joins the room, all of their clients join as well. This is hopefully not a novel thing that is here, but it is written in stone now. So we need to synchronize these two concepts. We call users to have a participation state or exist on a participation list. And then clients have membership. So users, participation, clients, membership. We also have to make sure that these are atomic operations because otherwise somebody joins the crypto state, but they are not part of the actual user state. That causes issues. So Mimi has started proposing a bunch of MLS extensions to persist application state within an MLS group. Because MLS has those extensions that you can just store arbitrary things, making the blob even larger so you must store it in the media repo. These are new as of like a week and a half ago, but it is called AppSync. It is a generic mechanism. Conveniently, it would basically be mapped to state events in matrix. So you can just add arbitrary information to the group, namely with a key and some sort of content. And then there are some operations that apply where you can add, remove, update, that sort of stuff. But yeah, it is visible to servers, but servers can't see the actual encrypted messages part of MLS. They can just see that state changes are happening and potentially what's inside those state changes, which is why they would map to state events in matrix. Double ratchet and participation is a bit harder. Because double ratchet, again, doesn't have a concept of membership. It's not terribly difficult to map these. It's a little complicated sometimes. So there's a couple of MSCs there that list this sort of information, namely the crypto IDs Matthew was just talking about. And then yeah, we translate these concepts to Mroom member state events as well as device lists on matrix. But regardless of the protocol, we want to make sure that people currently on double ratchet have a way up to MLS. So it's a natural evolution of the application rather than forcing somebody to effectively fork their own client, which brings us a little bit into content format. So clients need to end up encrypting and decrypting the same thing. Otherwise, there's going to be issues. Because if you send a text message to somebody and they just don't know what to expect, then there's not going to see anything. So we need some form of extensibility because messaging also has a ton of features. And it's constantly evolving. Servers can't help with this because it's already encrypted. And of course, it should be as small as possible. It should require minimal processing power because not every client is a laptop. Or sometimes the laptop is a bit slow. So Mimi has worked on their own TLS encoded multi part MIME format. It looks a lot like multi part email. It's not the greatest, but it is a notional format while we try and work out the exact things. But matrix already has events and you can already define your own custom event types. And you can already add arbitrary content. But what if we made that way more extensible? So we introduce extensible events or MSC 1767. We use content blocks to persist information inside of an event. We specify the course blocks there. And then we also try to make sure that the client can render arbitrary event types that they don't know about. So we lose a little bit of richness in the sense that if a client does encounter an unknown event, that they have to figure out how to render that. And it might not render in the same way for everybody, but at least render the same information for everybody. And that's the critical part. So an extensible event looks a little bit like this. This is just a basic text message saying hello world. So if your client supports HTML, it picks the HTML format. If it doesn't support HTML, supports the basic format. But critically, you have a type of m dot message and you have a content block of m dot text. So if we add a little bit more richness to that and create a fake schema for polls that definitely doesn't exist, please see the MSC for a real schema. You have an unknown event type for some clients, namely org matrix poll start. So you still have that text content block. And then you also have this poll content block, which gives you a little bit more information about how to render these events. So if your client knows what that event type is, I can go into the content, pull out the org matrix poll content block, render that in its UI, and then the client can interact with it normally. Otherwise, you end up with just the text and it is suitably okay. It's not great. But you still have the same information from the poll. And so yeah, currently extensible events are JSON. But again, you could make this a binary format in the future. More events get rendered by more clients, which is great. You can create more custom event types. You can do all sorts of fun stuff to be determined exactly what all of this looks like. We're still in the process of specifying all of the pieces, particularly the core content blocks, and also a registry so you can actually implement a client that understands all of these things. So a little bit on room models. The Mimi room model looks like this. So when you add the third server, there's obviously a little bit more complexity. Mimi primarily uses a hub and spoke fan out. So you have one central server per conversation, not for the entire global network, that is responsible for distributing messages. So server B and C try to avoid talking to each other if they absolutely can. And they talk through server A instead. So server A is responsible for sequencing, which is important for MLS. It has those characteristics in play. And then yeah, the follower servers, as they're called, go through that. And encrypted messages still flow between the clients as normal. The servers can't see those messages. So then we have the question of what does linearized matrix look like? It's exactly the same thing, just different objects, which is particularly interesting when it comes to the fact that it was rejected. Because it uses just regular matrix events. It's the same room state. It's the same matrix event stuff. It's a stripped down version of the server to server API because you don't need all the DAG resolution stuff if you don't have a DAG. Also, your DAG is now a linked list. So you don't have any state resolution to do. You have the same authorization rules. You can use the same extensible algorithms for encryption. You can use MLS, double ratchet, your own thing if you're insane enough to do that. And then you have all of the same capabilities of matrix. And you have the history and all of that. But critically, you can support having a DAG capable server in the room. You don't need to give up your decentralization. You can end up with a hub server that basically acts as that linearization algorithm or does linearization algorithm. And it also still persists the events, still distributes them. So when you get into decentralization, namely how matrix works, you use a DAG. You have full mesh fan out where each server contacts every other server instead of going through a central hub. Conflicts of the DAG are used or done through state resolution. So if two people try to do the same thing, somebody has to win. And the good news is state resolution can also be used to linearize the DAG. So through use of a protocol converter, which may or may not be a dual stack server, you can then bring these centralized systems, even linearized matrix into matrix to just further route them. So protocol conversion, they aren't bridges. Bridges somewhat necessar- they're necessarily break the encryption because when you're converting to signal to matrix prior to our existing or to our new interoperability capabilities, you end up decrypting the network on both sides of the bridge and re-encrypting. So you're only really encrypting to the bridge and not beyond it. So protocol converter doesn't decrypt messages. It just converts the envelope format to another format. So that way you can just keep sending your messages. This may also include translating some of the concepts. For matrix, we have two device events, some other protocols, namely Mimi, just send everything over what they call events. So we would have to translate those concepts into the appropriate matrix APIs. Again, you can make this either with an app service or as a dual stack home server. So instead of having a multi-head messenger, you have a multi-head server. And then, yeah, use msc3983 or 3984 to bridge the particular crypto concepts if your server doesn't necessarily support those key formats. So this is what it looks like. You may have recognized it. I stole it from Matthew's slides. So if you have a gatekeeper on the left there, you can do a protocol conversion. And that might be attached to a single server. It runs through matrix. And then you run another protocol conversion to bring it into linearized matrix or Mimi, where you have that hub and spoke, namely that the bottom two servers there aren't talking to each other directly. So those two nodes might be the same physical server, just running dual stack and not doing protocol conversion. But that's all right. So there are a few missing pieces. We haven't talked about anything to do with identity. How do you convert a phone number or a name or an email address into something routable? Who knows? That needs to be defined. We currently have identity servers in matrix. They're a bit centralized. We're hoping that somebody in Mimi can actually solve this problem for us. We also have an interesting idea around consent. Presumably, you don't want to receive spam. So how do you make sure that the person that is messaging you is allowed to message you? We also have anti-abuse. How do you report these messages over federations or over servers? How do you make sure that the servers can implement their own anti-abuse measures using whatever identifiers they can? Mimi also is not necessarily defined the exact identifiers that they want to use. Matrix already has user IDs, room IDs, aliases, that sort of stuff. But who knows? Maybe something different would work. So room metadata. Again, where does the room name go? Who knows? We'll have to figure that out. Matrix state events would probably be fine. Same thing with ordering. MLS requires ordering. There's a discussion around whether or not the clients also need that ordering. So what's next? We have no idea. As Matthew has mentioned, again, I'm just stealing from his slides. So linearized matrix will probably get updated as an MSC because currently the MSC is one version behind from the IETF draft. And the gatekeepers will have to publish their plans by March 7th. We'll see what happens there. The protocol converter concept will continue to be refined, of course. Mimi will also make some form of progress, hopefully get refined as well. And yeah, funding the foundation is the best way to make this work. So, questions. Yes. What are the stakeholders in the Mimi and why are so different stakeholders, like, not using the matrix approach? And what are the different interests here? Yeah, so the question is what are the different stakeholders and why are we going after certain approaches, I believe. So there are several players in the Mimi space. So we have obviously ourselves. We also have wire. There's Google and I'm forgetting all of the other ones, but there's... Yeah, Cisco, Wicker, Phoenix, and a few others. There's a few hundred people in the Mimi working group. You can see their company association as part of the membership list. I would suggest going there. As for the different approaches, everybody wants everybody to use their thing. We're no exception. We just think that ours is better. But yeah, we've been doing this for a while. Matrix was originally built as an interoperable protocol. And here we are with a legal requirement to have interoperability. So, surely Matrix is designed for that, is kind of our thought. We used to rely heavily on canonical JSON to maintain the technicality of the company. How does that translate to the Mimi particular and get the intracorrel? Yeah, so the question is how... Like we've previously relied on canonical JSON. How does that translate to Mimi and just general approaches with interoperability? So, canonical JSON has all sorts of interesting issues with it. What happens if you have multiple keys? What happens if the keys use a weird former of UTF-8? That sort of stuff. It's a very complicated set of rules that can realistically never be fully defined. So with a binary format, namely, that's what Mimi's interested in, you don't necessarily need a canonicalization, because if you keep the signature for the event next to the event, rather than in the event, like we currently have in Matrix, you are able to just sign the series of bytes. And the bytes can be in whatever order. You can deserialize them, see them more easily, and then check the signature much faster. So that's kind of where the Mimi direction is going, is we want to avoid a canonicalization algorithm, but we do need the more specific standard for what's contained in those bytes. This is something to be supported either throughout the chain, yes, we are going to be pushing more towards keeping the, instead of trying to make everybody use the existing matrix thing, I would suggest that matrix kind of adopt more of that binary event signing instead. Yes. You had a slide with things you didn't talk about? Yes. In many places, primarily in the Mimi working group, that's where a lot of these conversations are happening, as well as on the design team for Mimi. But if you are interested in them, or you have ideas, feel free to pop by the Matrix spec room on Matrix, and we'll be happy to engage. Do I have time for one more question? All right. Yes. All right. So how do we avoid, basically if you have two protocol converters, say they're both talking to the same network, how do you avoid message duplication? Good question. We'll have to experiment with it. We will be trying to figure out exactly what that looks like. We kind of have to wait until March 7th to see what the actual gatekeepers, namely WhatsApp and Facebook Messenger, have to offer for that certain capability. Thank you, Travis. Thank you.
Let's talk Matrix between Governments and Citizens
Hi, thanks for coming. Huberview is working in the government of all the government in this room. I bet it's only a few. Oh, maybe 10% or something. Yeah, quite a lot. Okay, so, hi, my name is Marco. So let's talk about 2AM and why am I here. I am active in the Floss community for about 10 years now of contributions to SignalDino and also in projects in the wireless mesh community tooling. I have a background in IT security and my current project during the last three years was in this building state-of-the-art infrastructure for the public administration in Germany in a German federal IT agency. And yeah, we think a lot about how we can improve our infrastructure, especially in Germany, but we also try to, yeah, think out of the box, out of the border. And this is why, why Metrix is very interesting for us. But let's start with what public administration does in Germany and other countries. So the government provides a lot of services that ranges from healthcare services to social services. There's dog tags registration, for example, there are holding benefits. And in Germany, there are 575 service categories with a total of 13,000 individual services that the government offers on different federal levels. So that's a lot. And also what we need to think about is that the government has a monopoly on these services. Like if you want to have some housing benefits, probably the government is the only institution that will provide you with these services and with this support. So if you want to receive these services, you need to go to your local government. And this is why it's important to have a, like, see how these services are designed, how they work, if they are privacy-friendly, if they're usable, et cetera. So in my opinion, it's very important to have a look at the tech stack behind these services and also the privacy usability, accessibility aspects in there. So how do we apply for these services? First, there's, yeah, the option that you don't have to apply for them. The government starts the process by itself. For example, sending you money for your child benefits. The government usually knows everything about your child when it's born, and then they could theoretically send your child benefits by themselves. That's what we call proactive government. That's usually a thing, a neat thing, but it doesn't really work in all cases. So there are use cases where this doesn't work. For example, for a registration in the kindergarten, it's probably a good idea to ask the people where they want to bring their kids or if they want to bring their kids to this specific kindergarten instead of just, like, distributing the kindergarten places. That wouldn't really be a great usability thing there. The second option, you could always and probably have already done this offline. You can go to your local city hall. That works for many people, but still for many people, that's kind of inconvenient. So the third option that comes naturally, you can apply online for these services. For example, via an app or via a website. And I'd like to look at this third option a bit in detail. Okay, let's start by requesting some government services via a web formula or via a mobile application, for example. That's comparatively easy because, like, the government websites are public. You can just find them online. And the contact details of the government agencies are also public, including their private, sorry, their public keys, hopefully not their private keys, sometimes they are public, but that's not by intention. So you can just encrypt your application form and send it to the right government agency, and you're basically done. So that's comparatively easy. Then usually, hopefully, the government responds. But the person that applied for the government service may have already left the website, for example, and installed the app where they applied for the service. So that's a bit harder because the contact details of these individuals of us are not publicly available, and that's also by design. We don't want that. But also, we don't want the people to force installing some random application and keep it installed for a longer period, or even at all. There should be different ways to access these services. We can't just hope that the app is still installed and we can send a message to the people via this app, for example. So let's have a look on how the industry solved this problem. And here, for example, banks and some insurances put some online mailboxes in place. That's usually very easy because they just stored the plaintext messages on their central service and provide a web interface or an interface via an app to receive them. That might be okay for some banks and insurance companies because they already know everything about BitOS anyway. So that's their service. They're directly communicating with us. Still, it's not really anti-antimcryptus here, but the two ends are the bank insurance agency and the people. That's occasioned some way, but if we built this for all people and for the whole country, we definitely need encryption. So we have local government agencies that want to communicate with the people. And there's a huge, so there's a large amount of information, a large of different services that are being provided. We don't want to have a central server that stores all this information about the applications and responses to that online on the server. So how did government agencies solve this? You have to summarize. Mostly, they did it very badly. We've seen a lot of data leaks in the past years. And I think there must be a way that doesn't include any risk of data leakage. These are just some examples. I found a line there probably a lot of more issues, but this is not a European problem. This is like a global problem. You can find governments on basically every continent that lost personal identification information of all the people in their country. So that must be a better way. So let's have a look how did the German government solve this issue till date. We have a lot of different online mailboxes. There's ELSER for those of you from Germany. We probably know it. It's a big application to pay your taxes. We have so-called the email. That's a German email variant that should be super secure. It's basically some regulation on standard email protocols. We have BundID, which is like a central identification service that also contains a mailbox. Then in the justice context, we have a lot of different mailboxes that are somehow interoperable, but none of these really follows the security by design principles and the Cedar Trust approach. And this poses a huge risk to privacy and security to highly sensible data. So this might explode somewhere, somewhere. In fact, actually, there have been incidents in Germany too, of course, as in other countries. For example, we had the, since 2021, we have so-called digital health apps, and they got analyzed by, it's a forschung, that's a collective of IT security researchers in Germany and the fact that these apps leaked personal data of more than 20,000 people. That's especially problematic because like in the healthcare sector, there are often very sensible information that might get leaked here. We also had a recent leak in the justice domain. That was the case of the Justice Maybox leak last year between October 13th and November 9th, a directory with personal identity data was publicly accessible to you to a config error. So this shouldn't have happened at all. There should have been technical measures in place to, yeah, that, yeah, make sure that this won't happen. That's especially bad in this domain. For example, if stalking victims use this Maybox to contact some, some Kurds, and especially then it's not, not really a great idea of their personal information, including their address is publicly accessible. So let's talk about some solutions. And I brought this vision here. What if, if, if communication between governments and citizens was easy, reliable and encrypted? And since we are in the metric deaf room, yeah, let's take metrics to the rescue. Metrics already provides enter and encrypted messages. It provides multi-device access from apps and web applications. It also provides access via third party apps and services. So for example, corporate IT service or e-governance apps, et cetera. This is all possible using the metrics protocol. So why not build a metrics-based secure communication channel between citizens and governments? And that's exactly what we are planning to do. We wanted to integrate metrics in Germany's national identity system. So first challenge here would be to build a proof of concept this year to demonstrate that this is technically possible. And we have some like technical things we want to discuss here. Also, for example, usability issues would be discussed here. And in general, when we do this, we of course want to have a great user experience. So what do we need for that? We need polls and multiple choice questions. We need push notifications and status updates. We need also machine readable data. For example, these polls would then make it easy for bi-directional interaction with the public administration using machine readable polls. That would be an interesting thing to look into. Also image and document uploads might be a feature. And the neat thing here is that metrics already built comes with these features. So there's not really much to build on top. We can just use this and go from there. Of course, we also need a great developer experience. That's something most government projects don't really think about. But I think especially here it's very important for us to have some SDKs in place for developers that are working on IT systems either inside the government, but also to help building apps and an ecosystem of apps and services for citizens and company-facing apps, for example. That helps us here with development speed for government services. So again, what does Matrix offer us? We have a great usability, especially compared to email-based systems here. We have tried and tested security. This exists already a bit and the protocols are known and we don't reinvent the wheel here security-wise. It's interoperable and it's easy to integrate and it's also ready to use in the real-world layer. Many features are already there in the Matrix specification. Some strategic thoughts on enter and encryptive communication. In my opinion is a key enabler for seamless e-government services. We need this anyway. We will also be able to... This will enable us to really build a privacy-preserving realization of the so-called once-only principle that enables governments to reuse already submitted data and documents. We have all this data in a machine-readable, secure way. It also might support us in some wallet-like use cases, for example, at a station presentation of attributes like driver's licenses. This also needs a secure communication channel before we think about all the additional challenges cryptography-wise here that we need to tackle. We need to start all of these things, need a communication layer as a starting point to interact between peoples and governments. A proud vision, where might this journey go? We will start with a mailbox app and later, if this works out, it might be a good starting point to provide the most common e-government services via this app. We would have an entry-end process to apply for all these services. This will definitely help us with usability and user experience. This might be a neat thing to look into. In Germany, there are very little government services that are already integrated into an easy-to-use app. Most of them are just huge web forms where you have to enter lots of data and then you send a form and hope for the best. If we go further, finally, why not build a framework for any e-government service? Basically, the service that is integrated into the app is basically a config file. This would help us to scale, obviously, and give us an opportunity for modulary specifying different services that we want to provide by just providing a config file that defines how the UI, for example, in the app looks like. Putting it all together, this will provide us a National Privacy First e-government app, which would be a neat thing to have. Maybe it will help us build up speed and get better in this domain. To conclude, let's talk a bit about infrastructure. The status quo is that we have different text tags for requesting services and also for replies. These are completely different infrastructures. For example, we are able to request services via a REST API and then there's a SOAP API to provide messages back. This is completely different. Also, it would be nice to... Currently, we have different text tags between different government agencies. These might be encrypted or not. That's obviously not good. What can we do about it? The obvious solution here would be to take metrics as an interability layer. In my opinion, that would totally make sense to have a basic common ground to communicate with different government agencies. Actually, that's what metrics is designed for. We don't have only the chat application use case, but also the communication layer between different organizations or people. That might be an interesting thing to look into and build some prototypes here. Plus, it would also be very easy to integrate industrial needs here. The industry is also, of course, a large customer, so to say, to the government. They are requesting, for example, building permits for wind parks. It would be nice if they didn't have to do this wire paper, but also wire an easy-to-use API and integrate their own IT services in this ecosystem. Everything becomes easier for the government and for the industry to work together here. Okay. That's all I have. Thanks for listening. I would really like to continue the discussion of course via metrics, if you like. Join the Metrics channel. It's metricsforgov.org. It would be really interesting to discuss with you there. I think we might have time for some questions. You already answered one of the questions online for where is the place to discuss about it. Another question online was, is there any plans to bring it together with the TI messenger communication from the German healthcare sector? Yes, of course. We hadn't any in-depth discussions how to bring this together, but obviously we would then be using the same tech stack. From an architectural point of view, this is what we want to do. We have all these different Maybox infrastructures in Germany right now, and we need an interoperability layer between them to make it easy to use all of them and have one place for people to receive these messages or send messages to the government. This is one of the design goals in the long term, to have all these services using the same communication infrastructure, making it easier for people and governments. So the question was, if the GNUTALA project that also has some origins in Germany might be... So the question was if the Nutala project that also has some origins in Germany might be, so if there might be some lessons we learned there, so I'm personally not that involved in the Nutala project but I'm looking at it with great interest because I think that would be a nice candidate to have privacy preserving payment, a privacy preserving payment system here. That would of course greatly integrate in such an app here, so just yesterday I thought about this aspect of maybe looking a bit deeper into Nutala there. From the perspective of making it more or interoperable in the European domain we are trying to or we are looking into of course other European or we're talking to other European governments if this might be also a great thing for them. We have the Interoperable Europe Act, we have the Single Literary Gateway regulation in the EU so it might be a good thing to maybe harmonise this not only on the national level but also on the European level. I think that's an important aspect when we build infrastructure and I don't know any other standard, the metrics that has the potential to solve this quite nicely. We are talking to them. It's our question. So the question was if authentication will provide the requirements that government services have in terms of authentication would have any impact on what is needed with metrics and I think we're going in the right direction here with OpenID Connect so this is like what government services already use. The thing is here this is not completely zero trust and so we are not there yet with security and privacy by design here because if there would be one central authentication server that would provide identities for all people in Germany using OpenID Connect this will be a huge attack surface of course. So we are also thinking about how to maybe integrate the German EID system that have it in my backpack. So we have some EID cards that can be used to authenticate people and that would also be an interesting thing to look into if we could deploy this privacy preserving authentication system for these kind of services. So that's a huge thing we are thinking about how to reduce risks here security and privacy wise when we build such a massive system that deals with highly critical personal data. Yeah so the question was if we provide any OZG services via this protocol OZG is like the German government service accessibility law that requires governments to provide services online and of course we have this thought if this would be possible at all. Right now we have like different systems other systems in place that are using different text stacks here but in my personal opinion this would be the like natural evolution to if we like communicate with people via such an app or via the metric standards we might also look into using metrics to fulfill these services but I think that's a long journey to go. There are some things you need to consider when we build this infrastructure because not just the communication you also have to deal about which services or who can request these government services we have to deal about authentication routing which government agency is the right agency you want to address. So I think from a technical perspective this would work but I think it will take time to think about and maybe sometime build this but yeah we also don't want to build something separate to the services that are already in place so I think the only natural solution here would be to transform existing services to yeah maybe sometimes using metrics and have a roadmap to or for developers and organizations how to migrate from existing services to metrics otherwise this will probably not work and yeah will create a lot of confusion I think. Yeah so the question was how to deal with with backups and device signing and all that stuff so how to handle private keys basically and yes we are thinking about this and we have some some ideas how this could be could be done of course we don't want people to like manually store some private key file on their laptop and like take the burden to them but this is like definitely a thing we are thinking about so if you have any inputs on this thing I'd be very happy to hear from you in the metrics chat. Thanks. Yeah there is so the question was could we use our German EID cards for this and the German EID cards are so they are able to put some some digital signatures out there the problem is that currently the the signing keys are not deployed on this EID card so you would have to build some infrastructure to deploy the the yeah signing keys for for the people like private keys and as a like certificates for for every people this is like a huge organizational thing but yeah maybe this this might be an option to go for but I expected to be I don't know nothing that happens in the next one or two years production takes a bit of time to build this. Thank you very much. Thanks.
Embracing Matrix for Enhanced Communication: Migrating the WordPress Community from Slack to Matrix
Hi, so this will be about migrating the WordPress community from Slack to Matrix. So first, quick about me, I'm Alex Kirk, I'm from Vienna, Austria. I'm at Automatic for since 2014 now. We run WordPress.com and others. I'm an engineer. I lead teams around localization and matrix. I'm sponsored to contribute to WordPress.org. And I've got some site projects, so if you have a WordPress blog, check out the Friends plugin for making your site your own hub for subscribing to others and enable Master on Apps plugin if you want to use Master on Apps with your site. So quick thing, probably I don't need to tell you, but just to make sure, what is WordPress? It's a popular PHP CMS. But in 2003, today it powers over 43% of the websites on the web. It has a blog editor that allows you to edit posts, but also the whole site. It's well known for its plugin ecosystem, with plugins like Yoast, Runt Custom Fields, WooCommerce and so on. And it's open source under the GPL. And just a step back, just so that you understand what our needs are as a community. So this is how we collaborate. We've got 22 make teams in different areas, so one about accessibility, core, design, polygots, meta, lots of teams, performance, sustainability. And they all work towards separate goals. But each team has a P2, a blog, where they post about new things that are happening, proposals, decisions that are being announced, lots more. This is like the asynchronous part of the communication. And then we've got sometimes weekly, sometimes bi-weekly chat meetings for sharing updates and coordinating. And these are quite important because they give people a definite time when they can reach collaborators on the project. So you don't have to enter a room and hope that the right person is there. But you know the time, at this time, people who work in accessibility, for example, are available. And we've got meetups and work camps. Meetups are local to a city. They're like the smaller ones. Work camps are the next stage where people travel there to meet. And then we've got the flagship work camps. For example, in Asia, coming up in March, EU in Milan, and US in Portland in September. And another aspect is we've got a project, an initiative called Five for the Future. So there we encourage individuals and organizations to contribute 5% of their time or resources towards the workplace project. So this means a 100-person company would have five people dedicated to the project. An individual would have like two hours out of a week. And organizations like that concept because they retain control over the person who can contribute. And thus they're confident of pledging towards that goal. And if you want to hear more about that, there's actually a talk by my colleague Jesus in this room, Shaping the Future, investing wisely in long-term open source development with Five for the Future. And this is how a release of WordPress looks like. These are the companies who contributed to a release. So 640 people from 186 identified companies. This is the make-wordpress.org site. This is where we list the teams. And as you can see at the bottom, we list the next meeting that will happen and not only in Slack but also in Matrix. And this is the meetings during a week. So every day a couple of meetings take place. And because of the distribution around the world, some meetings happen twice in a day so that everybody has a choice to attend them. All right. So our plan to migrate. So it happened or it started in January last year. We announced that we'll create a subproject to evaluate migrating to Matrix. And then we would evaluate and create the environment that we need, migrate history and integrations, and then finally launch, finalize what needs to be finalized, and turn off Slack. All right. So what could happen? What things we anticipated? First people don't like change. We've been on Slack for a while. So we figured we need to prioritize something superior. So where are the strengths of the new system so that people will want to move? There is complexity around decentralized systems. Like everybody knows centralized systems. They need to go to one address and that's the only way to get there. So people might not know what to do. And then we'd had Slack lock-in. We've got lots of migrations created over the years that make Slack nice to use for everybody. And that's why people like it, I suppose. So when you consider Slack and an open source community, there's actually a few things that are a bit tricky. So one thing is that Slack, a sign up is email-based. So when you join the WordPress Slack, you have to follow a guide. And typically we actually do this at WordCamps where we have somebody there who will help somebody to get onto Slack. It's pretty complicated. Then it's a commercial product. The free tier has a message retention limit. The data is siloed behind Slack stores. So you need API keys to access it. But many companies use Slack and it's easy to just add one more workspace to Slack. So for many people, the barrier of entry is quite low in the end. Having matrix to that. Of course, Federation means everybody could join from anywhere, from any home server. But for the WordPress community, we would want to lock them in through an existing authentication system. No retention limits, of course. And our WordPress community has multiple Slack workspaces for different countries. So this would be able to unite them in one place. And of course, an open source project should have an open source chat. All right. So we tried to make it easier to join Matrix. Number one, I already mentioned it. We created a way to use your WordPress.org account to access Matrix. And we created it in a way that anybody could install this plugin on their own server to use it to authenticate a user against, like, to join a Matrix server. And with the upcoming OpenMidiConnect being like full in for Matrix, this is a potential authentication provider. Yeah, so on WordPress.org, we've installed it. People can use the WordPress.org account. They will go through their WordPress login screen and just authorize the WordPress server to submit the information to Matrix. Number two, we created a Matrix client in a WordPress blog. So a WordPress page is made up of blocks. And those blocks, one of them can now be a Matrix chat. So we call it Chattricks. And you can configure each block individually. So one thing that you can do is you can pre-define the home server, which we'll do. But you can also restrict it to a single room. It's based on hydrogen. And we did some upstream contributions. So before we used it, hydrogen had, you could only have one hydrogen open in the whole browser between tabs even. So we contributed something so that you can use it in multiple blocks on the same page. If you have multiple posts, typically they would be like all put on one page and then that wasn't possible before. And we had a couple of bug fixes to use hydrogen with SSO. I'm not sure how many people had used it with SSO before. And this allowed us to create team chat pages. So what does this mean? We can give a contributor a URL, a WordPress.org URL, where they should go for a meeting. They don't need to know this is Matrix. They just see it's a URL on WordPress.org. So for example, for the Make WordPress Core that creates WordPress Core, the address of the Make blog is make-wordpress.org-core. So the chat page is just slash chat. Core has different chats. There's another chat. The design team has a chat and so on. So this is what such a page looks like. On the top, you have a custom, like it's a WordPress post. You can put anything there. We put there when's the next meeting, instructions on how to go there, also instructions on if you want to use your own Matrix client how to get there. And this is the chat rigs block, which shows the room at the time. For FOSDA, my colleague Ashish created a small demo, and it uses the WordPress Playground, which is an interesting concept where you can run WordPress in your browser, and you can test any plugin in a sandbox in your browser. So I've recorded a demo video, to be sure, but it's real time, so as it loads, you can see it's pretty fast. So this now loads WordPress, and we've preconfigured it with a chat rigs block, and here it joined the chat. You can go there and enter a message, and all you have to know is the URL of this page. If you want to add such a block to the page, this is how you do it. You use the Gutenberg block editor. You add the block. You configure it. You set a home server. And if you want to lock it down to a room, you don't have to, but it can be practical to do it. You just enter the room name, and then the block loses the room list and just shows the room that you attached it to. And then you can... It's a block. You can add stuff before, after, as you wish. It's a pretty neat way of giving instructions to people, or putting, I don't know, meetup agendas, whatever. It's like, it's a post. Additionally, we created our own element instance. Just you can preconfigure it with the home server so that you don't have to tell people, you have to enter this home server into the login screen. It's something where people might typically get lost already. And we also created a bridge. So since we control both the bridge and the matrix server, we were able to create all the users on the matrix server and use the Slack bridge with a slightly forked version so that we can use puppeting. So when you post something on Slack, your matrix user already will say the same thing in your name. And there's some upstream fixes, by the way, that could be merged. And... Yeah, so that makes things quite streamlined. And another thing that we wanted to do, we didn't want to lose the history of Slack. But it's been a bit tricky because if you create a bridge, the bridge needs to start at some point and you cannot really backfill messages. So what we did, we figured out this little trick of first creating a room and bridging it, then creating a second room and migrating the history of that room there. Then we would add all users to that new room. We would import the old events in sequence so that we can backdate using an app service. And if the user is no longer in the room, we have to re-invite them and so on. And when we're finished, we can then copy the events from the first room that already started to be bridged and thus close the gap in the history. So there is this period between you importing or getting the data from Slack and bridging that this gives you a way to close that gap. And then we can change the room aliases, reattach the bridge, delete the old room. We've got a room with all history. So now we have a matrix server, community, it uses Synapse with a Slack bridge. Open ID Connect configured and with the app service. And we migrated 3 million messages in 170 rooms, 45k users, 55 gigabytes of database size. And during this process, we updated the community. We had held weekly meetings as it's common in WordPress. We published meeting notes afterwards. And we've got coverage from the tavern. So it's like first in January, like we're starting this in April. This is what we're continuing. We had to figure things out about private and public messages. And then we installed the matrix bridge. So now to the migration. In November, we announced we want to migrate to matrix. And this is how we'll do it. We'll ask people to use matrix instead of Slack. Before the final migration, we'll post a message in every Slack room. Slack will be closed. This is where you need to go for instructions. And then finally, disable postings. It's actually quite interesting that it's pretty hard to just kind of disable a Slack server because in a way you want to be sure that it's still around. The only way to completely shut it down is to delete it, which is a destructive operation. So what remains is that people could still DM. Well, OK. So the feedback that we anticipated from this. People want the default. So we would figure like they would use element. We knew that the notification element is not to everybody's liking. There's no dedicated threads and mentions view as in Slack. Threads is coming. I saw it. There's a couple of things that people are used to from Slack that are a bit different. We anticipated that. So we felt like people could live without that. People in matrix have been living without it for a while. Search is a bit difficult. There's no search langes in element. And while there are many other clients, some of them miss important features like threads or have implementations that are kind of different. I mean, I've tested some of them. Nico, for example, it works, but it's different. And then when you provide a home server to community, it comes with all sorts of troubles. So you cannot limit people creating rooms on the home server. So people will create some spam rooms, whatever. So you need to be aware of that. So we started to collect issues by the community. So they said we are unable to force some things like we can on Slack. So you cannot reduce the time allowed for editing messages. You cannot enforce room membership for federated users. Well, okay. Thread messages in Slack, you can say, I want this message to also be posted to the main room. It doesn't work. Other Slack features, they're considered essential. You cannot ping a group on matrix. And you cannot ping here. As the room mentioned, but there is stuff that when you have one central server that you can enforce that you cannot enforce on a distributed environment. And scheduling of messages, reminders, not there yet. Through a bot maybe, but not in the UI well integrated as in Slack. Then accessibility problems. So there has been an initiative to improve elements accessibility, but there are still gaps like macro navigation. This over wasn't super great. Then we had bridges, glitches like out of other messages, duplicates, double, yeah. All sorts of small issues. Use experience around threads management, obviously. We anticipated that. Then some things that don't work with matrix failing to load time zone positioning, lots of user join events that can make things pretty slow. So what we did to address this. We implemented integrations and many fixes via bots. We used the mobile framework. We could use the RSS bot, GitHub bot. We tried to make it easier to migrate our own Slack integrations. So we have a post to room bot that uses the web hook to post messages to the room. The other direction, how can something on our servers react to something in a room? We implemented group mentions. Not super great. If you post a command to the bot, it will post another message mentioning everybody in the group. Which there are some very large groups. So there could be very long messages. And watchdog so that we can be alerted if some spam rooms are created by some community members. We also, because we had our own element instance, we could ship fixes there while they're waiting to be merged. We provided the channel for the community where they could get help. We created the commutation and guides. But we had to stop the migration. So Matt called off the migration at the state of the word. Then we posted about it. And in turns out the accessibility problems were too big. We weren't able to merge them in time. We submitted the patches upstream. And there was uncertainty around where, like, are UI needs, where are they on elements roadmap? What are the effects of the license changes that were announced with Synapse? And overall, like, do those changes mean that the ethos of the WordPress project are no longer aligned with the element or the matrix project? It's kind of, it creates a bit of uncertainty that was detrimental to the migration. So the current status. The WordPress community remains a Slack community. But now with the matrix bridge and all Slack history, new contributors can no longer need a Slack account to join the conversations. And turning off Slack is currently not planned. And we'll keep observing how the matrix product develops. So in summary, the WordPress community didn't fully merge and migrate in the end. But maybe the things that didn't work for us are not so important for you. We are a huge community. There's many voices. I could see in a smaller community those things not being as important. And I hope that this talk was able to help you identify which is important for us and decide if you are kind of suffering or not suffering under the same issues. Along the way, we did a lot of open source contributions. The WordPress plugins I mentioned. To matrix, open source, all our bots, the migration app service patches upstream. And that is it. Thank you. Yeah, check out the slides. There's lots of links in the slides. Yeah, so you mentioned that migration, you want to fill in your room with Slack and in for the messages there. You actually don't really have to do that because a lot of the bridges actually allow you to, the four messages that come in the history. So the question was whether we were using the functions of a bridge to get back the old messages. So in our experiences, it wasn't possible to backdate the messages. That was the main issue there. I suppose if you, it depends on the implementation on the matrix server. Maybe there's no, okay. Okay, so the question was whether we considered the relationship funnel? So that's a good question. Okay, so the question was whether we considered to use. to send another message. Okay, so the question was whether we considered to use an app service to change the push rules for a user in order to enable group mentions. No, we haven't considered that. Maybe it's a possibility. I don't know if the user would, basically you're saying that you would add a keyword to the user so that they would be mentioned and you would configure it for them. Maybe that's a possibility. Yeah, you were saying there were some accessibility problems which were kind of killed this. Could you give us a little detail of what was actually missing or what problems were that were? Sure, so the most important problems were around macro keyboard navigation, so navigating between the bigger sections of the app. So there's the sidebar of the spaces, there's the room list, there's the search menu and the messages. And for example, you couldn't get up the message list using the keyboard. And if you were able to somehow get into that area, then the voiceover read out wasn't very useful. It repeated for every message, for example, profile picture. It, so stuff like that had been annoying people. Thank you for the POS, but like the accessibility when I was outside, I think it's already reviewed. One of the other plans. Matthew said the accessibility team has reviewed the patches that were submitted. Other questions? Yeah. Did you have to disable some of those integrations or work around that you mentioned? For example, I imagine that here things wouldn't work very well on both sides in a different way. All right, the question was whether we had to disable integrations. So one interesting thing about the bridge is that it works in both ways. So migrating and integration could be done in a way that you first enable, like you create the migration on the matrix side and when it's ready, you turn it off on the slack side and enable it on the matrix side. And still both sides would be able to use the integration. So for the here one, well, it only worked on slack in the end, but I don't know, it depends on the team. Like part of the WordPress project, like there are so many teams that every team has their own ways of doing meetings. And some heavily rely on those group mentions, others don't, some need the here mention, others don't, it's like hard to make everybody happy all the time. So that's probably part of such a big migration that it's really, yeah, you get so many opinions and as with many communities, there are some louder than others. So the question from the internet, where can you find the tools you have used for the migration of the history of the room? Yeah, so the question from the internet was how we, if you can access the tools. So I recommend you to look at the slides, in the slide where I talk about the app service migration, that's where it's linked. Is there any integration with Element Call? The question was whether we did integration with Element Call, no we didn't. There is no culture of using video conferencing in the regular, some teams use it, but they tend to use Zoom at the moment I think. Yeah, it's depending on the team what they use. So for example, Slack Huddles as an alternative on Slack are not being used as far as I know. Is there a possibility for the migration? Or is it more of a licensing issue of accessibility? So the question was whether there is a possibility to complete the migration. I think it's certainly possible at the moment. I think there has been a bit of tension of implementing the migration fast so that people are not left behind. Like if you let the migration linger for a long time, then people will never migrate and then always at the end, like people get panic and then they do the migration so the whole long period is wasted kind of. So that's why the initial plan was to have it rather short. But on the other hand, I think this current hybrid state is not as bad as you would imagine because for new contributors, we've got this easy onboarding and one thing that I liked about this, the way that we implemented is that you can kind of slowly upgrade your experience. You start with the chat message, the chat URL, and then if you use it a lot, you could upgrade to elements, the one that we hosted, and then you could upgrade to another client. So I think that's an interesting way of luring people in. So maybe over time, the number of metrics users will increase so much that it becomes like a request from the community. But as of now, we're kind of waiting what the license changes do. And yeah, this hybrid state is one that I think is acceptable for the moment. Okay, no more questions? One more time. No more time. One last question. Matthew. I just want to know what it is about the license change on SNAPs that is causing it to. I invited to talk to Matt. Yeah, it's basically like WordPress is on the GPL license where you used to be able to modify software on servers and not having to push the changes back and also contributing back codes to the element project and assigning a CLA is something that makes people uncomfortable. All right, thank you. Thank you very much. Thank you. Thank you. Thank you very much. Thank you. Thank you. Thank you.
NeoDateFix - A solution to organising meetings in Matrix
Now, we will have Milton and then Norgin and then Amat and Mikhail. They will tell us about the Neo-Date fix and a good solution to organize meetings. Thank you very much. The stage is yours. Thank you. I'm happy to see everyone of you today here. As Jan presented, I'm Milton and we're going to talk about Neo-Date fix or previously known as Matrix meetings. It was the starting name of the project, but anyway, we'll start by talking a bit about who we are. Amat, yes? Closer talking. So we are four developers from Norddeck. We have been doing software development specifically developing web applications on top of Matrix. We are doing this in the context of the OpenDesk Sovereign Workplace project for the German public sector. We have built this suite of web applications that are embedded within Element. I'll explain a bit about that more later. But yeah, we have Neo-Date fix, which we'll present here today. We have Neo-Board, which is a real-time collaborative whiteboard, which is actually what you're seeing in this presentation. So we built these slides and they're presenting with the Neo-Board. We also have voting polls, which is Neo-Choice application that it's not spec-based, but I won't get into that. And also, if you were at the Fringe event last Friday, we were using the BarCamp application to manage the schedule and the speakers and the whole tracks there. So yeah, what is Neo-Date fix? Neo-Date fix is a web application that allows you to create meetings and video conferencing meetings, especially within a Matrix client using the widget API. So currently, the only client that implements the widget API is Element Web. So that's what we have to work with. It is a good thing. And yeah, what can we do with the application? We can create these meetings as meeting rooms, as I've said. The meeting rooms are created with the default widget layout. So we have the video conference widget expanded and front and center with other widgets that you can choose. Typically, it could be whiteboard or some other widget that you want to previously set up. So you can pre-configure this for usability and quick action when you get into the meeting. We can schedule recurring and non-recurring meetings and see them in this calendar view that we'll show. It also supports creating breakout sessions. So if you have larger meetings and want to create sub-meetings when split people between those meetings, you can do that. We also support users that don't have an account on the home server. So we'll bring them in, creating them as temporary guest users, and they'll join the meeting. And we can also integrate with third-party clients. So specifically in the open desk project, there's open exchanges. There's also a calendaring solution. And when you create a video conference call there, it will create a meeting room in element with everything set up for that call. And finally, all of this is fully accessible and with support from multi-language. Okay. So going to the widget part, if you're familiar with widgets, you sort of have an idea. If not, it's a way to embed web applications inside element. It gives you access to the room events and room state events, and not much more, but that's the gist of it. And the way we have built our applications is we built this common layer, which we call the widget toolkit, that gives you, like these, for example, a React component, which will inject the widget API client into your React, and you can start using it without having to do that integration by yourself. It comes with material UI components. So you can also have this consistent look and feel. You can change the theming. And it also comes with some mocking components for easier testing. And finally, it comes with a base Docker image that you can use to quickly deploy into your infrastructure, your widget, based on that. And finally, it's also not only a widget, but it's also a bot, because the widget API only gives you access to the room data. We need to create the room meetings. We need to set up all of these accessory workflows that we then use the bot to perform these. It's built using nodes in TypeScript and in the bot SDK and in the SGS package for the API that we exposed. And yeah, this is a broader overview. Now Mikhail will talk a bit more about the internals, how we are doing this. Hello, hello. Thanks, Mutham. I'm Mikhail. I would like to continue with a high-level architecture of how it works. So we have this Neo-DateFix widget that is embedded in the Element Web Client. It just uses a widget API with a toolkit to send and observe state events and call some other actions that the API allows. It all goes past through the Element Client to the Home Server. And some of these goes to Neo-DateFix bot. So in Neo-DateFix bot, it looks for some particular message events of some particular types. And when they are received, it applies certain actions to the rooms. So besides having metrics API, Neo-DateFix bot also has HTTP API that is used by, again, Neo-DateFix widget to provide widgets lists and additional configurations that widget may need. And additionally, it provides the HTTP API to manage the meetings by external clients, as I've already said. Addiction to these components, we also developed several Element Modules that simplifies a bit this setup. And also add some additional optional features, like, and this one, like Lifecycle Model, Guest Model, but these are optional. And then, we have a new one. It's a very simple way. So if user wants to start with Neo-DateFix widget, it has to create a room. And then he needs to invite bot to this room, bot will auto-cept the invite. And then user needs to grant moderation rights to this bot. As soon as it's done, bot adds Neo-DateFix widget to the room. So he can see calendar and create first meeting. So he could create single meeting or recurrent meetings. It all end up with one room for one meeting. But the meeting room is a special room. It has type Nordic meetings meeting in M-Rooms create event. It also connects to the parent room, this calendar room, with M-Space child and M-Space parent events. There is one too many relationship with the meeting rooms. The meeting room has widgets and of course it has some other state events that are related to the meeting. Within the meeting room, user could create a breakout session room. It also would be a separate room, but with its own breakout session type. And it also has a connection to the meeting room where it was created. So we use message events and state events obviously from the metrics. And all the message events there are prefix weeks, not Nordic meetings prefix. So they are the events that are sent by the widget. So just to manage the meetings, to create, change permissions, pages, participants, Tomstone the meeting or send some messages. The state events are used to store the state of the meeting. Mostly there are metrics ones, but in addition there is a net Nordic meetings metadata. It contains calendar information. So this is an example of this calendar information from the meeting metadata event. So for the single meeting, it is a list of just one entry that has certain end fields with a date time stamp together with a time zone. And so it's quite simple. And for occurring, in the case like it could be excluded dates and overrides, it has besides frequency rules, it's frequency daily interval one. It has exclude dates, to exclude some particular dates from this recurrence. And as well to, it could have several overrides to change some particular occurrences of this recurrent meeting. Yeah, this is all regarding the slides and I would like to hand over to Nurjin to show some of the features. Thank you. Thank you, Mikhail. Hey, Bill. It's Nurjin. So yeah, hopefully after the all talk you'd be teased enough to see some actions, some demo. So here, me and Ahmed will try to demo quick the basic features. So yeah, first we need to create a new room. We call it a calendar room. And yeah, so we need the bot to be added to the room and give him the correct exact right as a moderator above. So the bot would be able to configure the room for us and add the widget. Yeah, here we can see the bot added the widget into the room, pinned it in the middle. Yeah, so we can schedule meetings while, yeah, here you can see there are the information that you can create for the meetings. You can add participants. You can also allow or not allow messaging in these meeting rooms. And also we have a set of configured widgets by the bot. You can add, remove whatever you want. We will create an example of a single meeting and a recurring meeting. So it's basically the same. We can add the recurring meeting. We say if it's an open-end meeting or if it's end after specific time or yeah, after specific recurrences, for example. Yeah, here we can see we have first the list view, which are the meetings are shown as cards. Each card contains the information with extra buttons for the participants and also the share meeting. We can share it with the meeting with a link or by email or we can download it and as an ICS file and important into other libraries. We are also able to edit, delete the meeting and of course go directly to the meeting room. Yeah, other than the list view, we have also this calendaring view. We have the day view, the work week and week and the monthly view. And of course in each of these views, if we click on the calendar entry, we would be able to edit the meeting. So for example, here we can try to edit the whole series by switching or just edit like the whole series or one recurrence. We can edit here one instance, for example, save. And if we go back to the calendar, so the changes are reflected into the calendar, this deviates from the others regarding timing. So we can also join the meeting. We can see that the bot already configured the room with the widgets that we chose with a specific layout configurations that we set. Here we set that new board and the JITC are now configured. We also see that the bot sends notifications to the room with every change that we make. And also besides those, we have the near-date fix details. It's basically just another detailed information about this meeting room. Also we can do other actions with it, edit it or delete it. We can also go back to the parent room of this meeting room. And as Milton already talked about the breakout sessions here in the meeting rooms, we can create breakout sessions as many as much as we would want or need. Here we can select, they are divided into groups, named defaultly by group one, group two. We can distribute the whole users randomly or we can select them manually, whichever we would like. So, yeah. Yeah, here we can see the breakout sessions are created. They are also as cards and extra we can check that we can send a message to all breakout sessions. As an organizer, you want to notify all breakout sessions, yeah, let's be back to the meeting room or whatever. So we can send it. We can go to other user and here we can see that he got all the invitations. For example, this for the daily, if we view the message maybe, or go to it, yeah. Yeah, we can, so basically the message of this invitation contains also the recurrence rules, information when it occurs and when and who you are added by. And yeah, you can see here, for example, in the breakout sessions where Alice sent hello word or hello, the, yeah, the message was sent to the room and also breakout sessions are configured with Jitsi. Yeah, I guess that's all. Thank you. I will hand over to Milton. Thank you for the demo. Hope you guys liked it. Just to finish our presentation, we have a couple of interesting things that we find that want to share with you. So the first thing is that as you imagine creating lots of meetings and these temporary users creates, I mean, it's not, it's relatively cheap to have these resources, but we want to keep things clean. So we have these additional features where you can clean up the temporary users using a signups module and also have this sort of hackish room reaper that will go through the finished meetings in the past. According to their, there's a field that will tell you tell the bots when they should be deleted and it will clean up after himself, which is a good thing. And also we have, can you move to the next slide? And also we have what we believe is a very good end to end test suite because as you may know, besides units and integration testing, end to end tests allow you to script the full interaction from the browser to the element web to the widget and how it then interacts with the bot. So we have a fully automated way to have the environments being created, tested on, then destroyed. And this is obviously a preconditioned for us to have releases when these tests pass and they covered most of the features. So if you want to see a good example of using playwrights and test containers as a way to do end to end testing, please check out the repo. There are obviously still room for improvement. We are just finishing and should have soon, should be releasing support for encrypted meeting rooms and encrypted control rooms or the calendar room. We've had a slight issue here because we are clients, requires us to deliver to this special IBM Z platform. There weren't any bindings for the crypto rust. That's the case. So we're, I think we have that in order for soon release. Also make the bot clean the rooms instead of that hacky script. Support element call also when it becomes a defect host and they're there. Also have space scoped calendars. So in the demo you saw that there's a single calendar room and it will create the meeting rooms in your top level. So if you could have it create within spaces so the meetings would be within that space and who couldn't maybe manage different teams or different groups with different calendars, that would be a good thing. And yeah, and finally we can get meetings in publishing them out to another in another calendar in client would be a great thing to wrap up. Here are some of the resources, the links to our repos. I mentioned, I think it's an open source Apache to licensed applications. So be sure to check them out. Yeah, I think we're ready for questions. Yes. There was one question on the internet. Do you have support for entry of the rooms and if not is there plans to do so? Yes. So as I said a few minutes ago, we currently don't support it, but we are soon releasing that. So it's a matter of days. And the question was do we support encrypted rooms? Sorry. Yes. Yes, this is the Neil boards. This is maybe can can you show maybe a couple of features. So this is the this is a widget that allows you to have a real time collaborative whiteboard draw. It's an initial feature set, but it's if you went or if you participated in the summer, the matrix community summit in Berlin last September, we did a full presentation. It's online. You can check it out for more details there. So the question is what is how are we implementing the invitation page when we show them to the invited user with the information about the meeting. Do you want to? Yes. So it works like first of all, in invitations, there is a message itself, but it's of course constructed just inside of a member member. There is invite text, but so it's unfortunate we didn't show it, but it's there isn't there. But besides that, there is not this metadata event, for instance, it's we configured it separately in synopsis. Share it in the strip state. So when you go to invitation, it's already shared and you can see it already in calendar. So if you have calendar as a second user, you would see it there already. So we just yeah, we added to the strip state. So it's a bit. Yeah, exactly. Did I understand correctly that it only supports. The question is if we only support jitzy meetings and not in app meetings. What do you mean by in app meetings. I mean, most of the clients have own meeting functionality. And I just wanted to give the chance to use that. So if the answer to that is if there is a widget for it, we can support it. So if there are other alternatives for video conferencing, it's a matter of just developing a widget that supports it and setting up the bot configuration to then include it in the room. It can be implemented as a separate. So if not supported as a widget, so in theory, you can develop the widget is a toolkit and edit as another widget to the meeting. Okay, the question is what is the Docker container part that I mentioned. So in order to deploy widgets, they are a web application. So they need to run on a web server. So we have this Docker file template based on engine X that it's already prepared for you to just auto include in your base Docker image. For your app. So you just instead of including Debian or node based image, you include the widget server image in your Docker file and just copy over the build release distribution assets there. That's the main accelerator for that. So it's ready to use base image for widgets. Any further questions. You can download an ICS file. The question is, can we can we integrate with Google Calendar and other calendar publishing platforms? We only support downloading in the ICS file for recurring or single instance meeting. The inner format that Mikal showed it's restoring an I cal. I cal format. So the storage is using that. But yeah, we don't export any data out currently, but that would be a good thing. The community is open source. You can contribute with support for that. As well. Yeah. If you go to maybe can go to the resources page. You can. Well, not the widget, but the Neo board. Yes, we have a live widget demo for having this. Sure. The question is, how can you can you include and use this right now in your element web client because it's a widget and the bot you have to host it somewhere. So you would need to download and deploy it to some server or VM. It's not included in this time that I don't know. And it's, yeah. Thank you. Thank you very much.
MatrixRTC: The Future of Matrix Calls
Thanks for the amazing introduction. And as Jan said, so we're here to talk about the future of calling and matrix. And we actually bring some pretty cool new things and there's quite a thing going on at the moment. So yeah, we really hope from now on there's finally good calling and matrix, or at least we're doing the first steps and all this will be built upon matrix RTC, which is a underlying protocol basically, which empowers all the calling in the future. And that's what we're going to talk about, how this works, how this is structured, and how calls are built on top of this. So matrix RTC is actually something a couple of you probably already has encountered in the form of element calls. So basically this is a standalone web app where you can just have calls. It's very similar to JITC, but in the background it's actually running matrix RTC. And what hasn't reached yet though is that this is really in the federated system. So what we have here, this single page application is very, very enclosed. So it's running its own home server and it doesn't federate. You can't log in with your actual matrix account. You have to have a custom account for this specific application. And the change we're going to present now is that we actually have the same technology but in our federated matrix system. So before we actually start into the interesting new things, we talk about why we even considered redesigning all of this. Because as probably all of you know, there is calling in matrix since quite some time already. It's in Niko, it's in the legacy matrix apps, element apps, and it's an element web. So why not just work on those? There are issues. For example, if you call each other at the same time, you get issues that the calls sometimes don't figure out that two people want to talk to each other. Sometimes one of your devices never stops ringing. But why not just fixing those? Oh, I see a lot of knots. That's actually super satisfying. It's really good to see that people know what I'm talking about. So why not just focus on fixing those? Why rebuilding something entirely new? And the thing is, there are some pretty fundamental limitations. So it's by design just one-to-one calls. That's just how it's designed. And it never really was in this specification designed for something bigger. It's very call specific. So you can't build any arbitrary real-time application on top of it. It's just for calls. And that's something which we think would be cool if that changes. And the signaling is done over room events. That's not necessarily a mistake, but it makes things a little slower than necessary. And also, it's really hard to get right, as we can see with ringing never stops, or we call each other at the same time, and it doesn't converge to an actual call. So this is basically our vision, what we want to achieve. So we want calls to be a great and central part of matrix via matrix RTC. And those four columns are the core things which we really want to get right. So we don't just want to have calls. We want to think beyond calls and build an expandable system that motivates also other projects. So we already had this, not with the exact stack we have right now, but something very similar. And people like Element build third room, and also Nordic build things like the NeoBoard, which are also kind of built on a similar thing than matrix RTC. And we want to make matrix RTC really a thing where it's super easy to build those kind of things. The other column which is super important is that it's using a pluggable RTC spec end. So currently that's LiveKit, and LiveKit is an amazing open project. So it really fits into matrix from a culture point of view. So it's an open system, and it really solves all very complicated issues if you use WebRTC for calling. It even ships a SFU, and it's just a very, very decent combination, like matrix for the high level signaling and LiveKit for actually doing the WebRTC shenanigans you need to go through. It actually gets quite annoying if you look into the details, and they just do an amazing job to really get this all nailed down. And then it has to support large group calls. Everything which we want to have in the future shouldn't be just for one-on-one calls. I guess that's pretty obvious. And last, we want to make it as simple as possible for other clients to support the whole infrastructure. And we already have two apps from Fremantly, like the Fremantly app itself and FluffyChat which support it. And we have Element apps which support it. And we also, we talk about this in more detail later, want to make it as easy as possible for others to also add calling. There's like a widget path you can take, and also LiveKit helps us here, because they provide pretty decent SDKs. So if we want to build calling on Matrix, we really want to leverage all the good things about Matrix. So here's like a very short, I guess I can really do this quickly because everybody knows probably, what really makes or what Matrix is really good at. So what things we really have to pass over through this real-time infrastructure. And one of this is that it's an open standard. That's like one of the things I really see as a core of Matrix. It's super cool. It's really fast and so it's great out because it's not really that surprising. Then we have Matrix encryption, which is really powerful and it goes further than just encrypting for large rooms. It also has a very good authentication and verification system. And that's a thing which I think is super essential that you not only can connect encrypted to other people, but it's also verified. So you have the guarantee that if everybody is doing this device verification correctly that all the participants are actual participants you trust and where you trust the devices are not malicious. And that is what actually makes security in the end that you don't have any weird third party being in there which shouldn't get the data streams your streaming. Then it's a federated system. So calling definitely has to go this path as well. And what Matrix is also really good is in having persistent storage. So it's not just exchanging data, it's also storing data and replicating the stored data over multiple home servers. But that kind of comes with the cost that it's not real time, real time, what we need for calling. It's more like a, yeah, more in the below second range but not millisecond range. So having those four columns in mind, how can we now use Matrix to really build up something like a system that uses the best parts of Matrix while still succeeding in actual real time? And this is done by like those three core parts. We have the Matrix part, then we have Client apps which then use the LiveKit SDK and we have the RTC infrastructure which is LiveKit in this case. So starting from the top we can see that we have just a Matrix room which can be on a federated system. And the core component in the Matrix system or what the problem Matrix solves here is that it basically stores which user is currently in which session. So if I'm joining a room and I'm reading the room state, I immediately can tell who is in which session and how to connect to those people. So I know if there's a running call and I know how to connect to them. And yeah, of course then Matrix also does a lot more sharing keys and providing the accounts and the verification. Then as the next in the center part we have the Clients themselves. So here we have a couple of Clients which have only the green box in there and then Clients with the green and the blue box and each of those boxes is basically one RTC application. So to make this example more concrete, one could think of as the green box being Element Call or just calling in general and the blue box being some document, shared document real time system or third room or whatever have you. And some of those members are just in one RTC session and some are in two and this is also something that should be possible. And then last at the bottom we have the RTC infrastructure where we primarily want to use LiveKit but it also would be possible that you use FullMesh or we have this empty box at the end. It also should be possible to basically use whatever new technologies emerging. So if web transport at some point in time replaces web RTC then you could implement a new infrastructure which then does the same high level signaling over matrix but it still uses this new technology to have even better or even higher data transmission or whatever the advantage is. So now we look into a little bit more detail for those room events. So before we had them at the top, now they're at the right. So we have room, multiple member events and each member event has an array of memberships. We need this array because as seen before we could have a call and at the same time a real time document. And the top part of the membership JSON object here is the actual core matrix RTC part. This data is just there to know how to connect to this specific peer in this RTC world. So it has this very central field Fokie active where it says the type of Fokie or the type of connection you want to use in this case live kit plus it has all the necessary information to connect to this. And this could also, this is then the part which actually can be replaced with web transport or full mesh or whatever you. And then there's another pretty important field and that's the application. So each membership has a specific application associated with it. In this case it's M call and that basically also gives the typing for all the other fields. For M call we have a call ID as well and a scope. So if it's just a call for the whole room or if it's a breakout call or whatever you want to add to the calling specification. But you can also imagine all kinds of other things. So if we think about third room one possible field could be that you have for example a country or continent in there. So when I look at the room state I can immediately tell who is in which country and based on that I know whom to connect to. So we can do very high level optimizations in this matrix RTC world already before we even connect to an SFU. What time do we have? Like, oh it doesn't say here because. Oh, okay this is fine. And we actually can talk about this as well. So this is kind of an interesting thing and it's one of 20 problems I could have chosen which we encountered which I find really interesting to just really get the mindset what those call member events are and what kind of problems we encounter in such a federated world. So it's about call history. Whenever we have a call it's of course super valuable to then see in the room history that there was a call. How long the call was, how many participants there have been in this call. And one idea, one very trivial approach would be that at the end of a call we just send a summary in the room and the summary contains all the data. How many people there were, the duration and everything. But then we encounter specific issues which are very, very common in a federated world who creates this event. Like there has to be some kind of glaring and maybe nobody feels responsible for it. Maybe the one responsible has a client which crashed at the moment where he needed to send it. Maybe two people think they're responsible because there were some state which hasn't resolved yet. And it also would be redundant data because every state event is of course also part of the DAG so it is in the history of the room. So by having another summary event we of course introduce a possible conflict where if you look through the state history you see that the call was ten minutes long but in the summary it's twelve minutes long because there was a client side back failing to calculate the proper call duration. This slide actually got broken. Either way it's still visible enough so it works. The cool thing is if we look at the call member events which we showed before it's very easy to pass those events as join or leave events. So if we look on the left hand side with the green border we can see in the unsigned field we always have the previous state of that event. So if the previous state was an empty array and the current state is an array with a membership this can be easily passed as a join event while on the right hand side with the black border we have a previous content with a membership so somebody was in some kind of RTC session and now the current content is an empty array which implies that's leave event so it's really easy to tag those events. And looking at the next slide we have a visualization of a basically timeline so the left hand side has to be interpreted as the past and the right hand side as the present and the red boxes are state event changes which we tagged as leave events with the system we used before and the green boxes are state event changes which we tagged as join events. So if we go through a very simple example, member three for example they just had no changes at all so during the whole period which is shown on screen they were no member. If we look at member two in the past they were no membership then they had a join event so from that point on they were in a membership and then a leave event. So if we now run an algorithm locally that we start from the current or the present and we just go back collect all the leave and join events we can basically recreate the call state. So at each point we know who was joined and who wasn't and then we just loop through this algorithm until we find a point where nobody was joined and that then is of course the start so this slide indicated with green border so we have then all the information we need we have the start, we have the end, we even have the number of participants who joined we basically even have a heat map at which time there were how many participants like there's lots of data in there and each client can decide on their own what exactly they want to do with it and how they want to render this in the timeline. So yeah this is your part now. Who on time? Thank you Timo. So now we are going to look at implementing because well client implementers also need help and if you are one of those people whose client already has the WebRTC parts implemented you might be thinking ah shit I need to throw away all of the stuff I've already done. Not really. So Timo showed this already but there's this small RTC infrastructure bit which we are going to look into a bit. This is MSC3401, well kind of MSC3401 the m.call event has already been removed because it caused way too many glares and stuff if you want to know more about that you should watch Timo's matrix community summit talk about why the m.call had no ownership and stuff so it caused way too many glitches. The first half is just the matrix RTC stuff which Timo already mentioned about. The participants send the member events so the room has a history of who joined when and now if you don't have an SFU you could just say the infrastructure or the foci or the back end in the matrix RTC as mesh and then you can potentially just use the P2P MSCs which were already implemented by you or hopefully will be implemented for a mesh call and a mesh call is basically a P2P call between multiple participants. It's just not as scalable as you would think but now you can use your existing MSCs, your existing implementation for mesh calls and you don't even need an SFU or something but if you are rich and if you do want to set up an SFU then it gets much simpler. SFU in our case will be LiveKit but all of the signaling bits are now handled by LiveKit itself over WebSockets. The previous thing was over two device matrix events. The first half is the same but basically all of the signaling part is now handled by LiveKit over WebSockets. More about LiveKit I'm going to keep saying that SFUs are cool but SFUs are also very expensive and if you don't want anyone else to use your SFU you probably want to have some authentication in front of it. So if you are a home server owner admin and you also host SFU then you probably also will be hosting a JWT service which basically gets an open ID token from your Synapse server. You send it to your service. The service then validates if you are the actual one who generated that token and well then it generates a JWT token for you which you can use to authenticate with LiveKit SFU. Right now I believe that the Synapse thingy only checks if you are the actual one who generated the open ID token but I think there's already work going on for checking if you are actually in the room so only people who are in the room and if you want to actually join that room only then you can get access to the SFU. Some fancy stats. The LiveKit docs say that with around a 16 core Google virtual machine you can have calls with around 150 members. This is I believe 720p no simulcast just draw 720p 150 members feeds. From my personal testing I used a Hexner CAX21 well not personal but family gave that to me but it's a four shared VCPUs ARM core thingy and I could get around 70 participants with simulcast and 720p everything optimized I think. Ringing. You might not think ringing is important but ringing is actually very difficult to get. Mainly because native operating systems are not really friendly with you and will try to kill your app every possible second. So they started a GSOC project by the way. GSOC 2022 project at matrix. It's basically a three month window and you have to do this particular task. Well my task was actually implementing the whole WebRTC thing but I implemented the whole WebRTC thing in two weeks and for the next two and a half months I had to fight ringing. You need to focus on three cases. Your app could be in foreground, background and terminated. By ringing I basically mean if your app is in one of these three cases you need to be able to somehow ring the application when you get a call. Pivot it three times and we'll see the three ways. This is a story yes but hopefully client implementers can learn from this. This is the coolest part which I wanted to show at FOSTA. I did not know you could do this. This is Android specific. As far as I know only has one way you can do this. That's using Colkit which is the phone dialer app on the iOS thingy. I think WhatsApp also uses that but turns out Android also has a way to do that. It's called telecom manager or the connection service API and what you're seeing on your screen right now is the Samsung OEM dialer application and what the telecom manager allows you to do is put any wipe call from your application to the dialer so you don't really have to handle all of the OS killing your app and stuff because the dialer already has that. Then you get this fancy UI. You see all of these buttons, the hold call, Bluetooth, even the merge button works and I didn't have to do that. You also don't have to implement a new UI for all of those holding calls or you have another call when you're in another call. This was very cool but why this could not be implemented? For this you need to add your app as a calling account in your dialer app and that is a very hidden setting. I could not find a way to programmatically do it and also in some of the regions it's just blocked. It's apparently a regional thing so this could not get in. Frustrated by that I went to try to. Where? We just hack it. Apparently Android has two very nice thingies. Show on lock screen and the up here on top thingy. What we basically do is apparently we're out of, well running out of time so this is going to be super fast. We just call the up here on top thingy which then brings your app on the top and then you can use the show on lock screen. Even if your app is foreground or background and your screen is locked you could potentially just hack the app to get live and stuff. It does not work on terminated apps and no way my coworkers would have let me merge this thingy. Try three. Fine. We'll do it the right way. By the way, if you are thinking this is an obvious solution this was not obvious for me because family and Fluffy chat are written in Flutter and when I get a notification I would have to start the right Android bits then start the right Flutter bits and then decrypt the event and then show the ringing too much work. But well turns out after two tries I found out that push notifications already do that for you. Well, so we just abuse that now. You use the Firebase push thingy or the unified push thingy. They start a worker for you. They bring up the Flutter engine. A Flutter engine is basically something which is attached to your Android activity. Once the Flutter engine has started you can just hook on to that. You can hook on a VoIP listener to that and then kind of abuse it to see if there's an invite event coming in and then you show your own UI. That works. I hope that's the right way to do it. Please tell me if that is not. By the way, if you, like I said, I use the m.call.invite thingies for the thing now. But that's not a thing with LiveKit because all of the LiveKit stuff happens on WebSockets. So there's a new MSC for that. With this you can basically, this uses intentional mentions so you don't spam your whole room with your notifications. But you can specify which user IDs you want to spam, ring, and which, what your notification type is. It could either be a ring or a notification, yes. S-Frame key sharing. No time, but SFUs need another lock because WebRTC and said the SFU stuff uses S-Frames secure. Trust me, bro. Cascading. Yes. Right now your calls are, well, right now the calls are technically federated. So you could potentially have a call inside a room with SFU one and you could have a call inside another room with SFU two. The only main limitation right now is that all of your participants who want to be on a call need to be connected to the same SFU. With this you can also have like secure deployments where you basically just have the left half and then all of your communication is within your organization just for the local network, etc., etc. But in the ideal future what we want is cross SFU communication where every home server could have its own SFU and their JWT service, then all of the users from that home server connect to their own SFU and then the SFUs cross-federate, everything is federated. Yay. This is already a thing by the way, but it's a proprietary thing in LiveKit. So maybe if someone from LiveKit is watching, please open source it so Matrix can use it. Probably not going to happen. And how you implement this? The easy, there's two ways. You can either implement element call in the widget mode. I believe there's two SDKs right now, the Rust SDK and the React SDK, which already support widgets. So you can just use the iframe in your app, looking at you fractal people, do it already. And if you don't unfortunately support the widget API, well then you have to go the hard way. You need to implement it using the native LiveKit SDKs and, well, LiveKit has a lot of SDKs. The Flutter, Android, Swift, Rust, obviously Rust is there. Yeah, that's it. Thank you. Demos. By the way, if you can join this demo, you, Timo, I think they can use develop.relement.io. Yes. Basically maybe you go ahead and show the... Ah yes. Yeah, but so they can sign in. You can either use... You can either use... I should have written this down. You can either use develop.element.io or td-family.github.io slash fluffychat. I promise you this is not a phishing attempt. I can show you the CI run from what I deployed it. But well, and once you go there, just type in this alias and then you should pop up in a room and you can join a call with us. Could you repeat the URL? This is the URL. It is... Yes. Timo, do you want to start it now? Yeah. So basically, can people hear me if I talk without the microphone or... Okay, then you just have to talk with them. I'm talking. Okay, perfect. Yeah, then I can also talk. So basically what I just did is start a call. And the cool thing now is that we really have inelement web, inelement x, and infamately or fluffychat, we have the full new matrix RTC stack implemented. So all of them are able to... Can you hear me? Yeah. I can hear some weird sounds. So all of them can talk with this new stack. So you have to go to develop.element.io and there is a feature flag there. Oh, to be in the camera, makes sense. But in general, this is like the big new thing now that everybody can without doing something highly crazy, just go to develop, activate the new group call experience and then still and then be able to use the new calls. So basically what I just did is start a call, but I think I did a private call. So that's why I did the ringing as well. So I am joining here. And TD now... Someone's already in the call. Yeah, this was me just joining there. And I think maybe Kim is in there already. Oh, there is multiple people. Interesting. Well, that's element for you. You have been seeing this for months now, but now we go to the fancy thing, fluffy chat. This started a month ago, so this is probably ridiculed with bugs, but well, if it works yes. Kaboom. Nice. So this is really super, super cool that TD managed to get like in record time. Fluffy chat into a state where we have again a federated multi client system with calling with group calls. So yeah, this is one of the first few multi client... I think it's the third time we do it now. Multi client federated matrix RTC call. With screen sharing apparently. Questions? Do you guys want to break it? How many people can still join? Oh, we are doing a test. Might as well. Yes. Yeah, are there any questions? Does LiveKit send any emails back to say who's talking? Oh, yeah, there's actually lots going on on LiveKit. Oh, okay. So the question was if LiveKit is sending any signaling back to let us know who's talking and yeah, probably also who's showing video. And there's lots of things LiveKit does, so it's actually pretty sophisticated in that regard. And even there's things like if I upstream video, but nobody's consuming my video. Like let's say we have a conference of 100 people and everybody has me at the bottom, and LiveKit is communicating to my client that I don't even have to upload video anymore. And that doesn't only work with upload video and don't upload video, that even works with a resolution. So basically if lots of people consume me in just a tiny thumbnail, then my client automatically notices that I only have to stream the thumbnail. So there's like lots of optimization happening that at the end from a receiver point of view you basically just download what you actually see. And from a streamer point of view, you also only upload what people actually need to see. Yes? Who holds this LiveKit? You said that this is fully federated, but maybe I somehow missed the point where we talked about whose LiveKit service is used. Because in the previous iteration with full mesh, I thought the cool thing were that multiple MaTvic servers are involved, also multiple SFUs or whatever are involved. Now it seems like it's maybe the LiveKit server for the first one initiated or something. Yeah, so basically. This is kind of two questions. The first part was who's hosting the LiveKit server, where are they coming from? If it's federated, there should be like, yeah, same similar to MaTvic server, multiple servers, and that's exactly what's happening. So the idea is that in the future it becomes very, very common that next to your MaTvic home server you also host a LiveKit SFU. It's kind of similar to that lots of people also host a turn server right next to their MaTvic server. And then the second part of the question was how do we decide which SFU do we use? And of course, like what TD presented at the end, where you have the option that SFUs talk to each other, there you would just always connect to the SFU of your home server. And if they're federated participants, the SFUs in between each other would figure it out. Now there's actually a system that the first one, exactly how you also initiated it or presented it, the first one who's joining defines in their member event which LiveKit SFU to use, and then everybody's jumping on that SFU. And since that means if the first one is leaving and maybe others are joining, but they have a mistake that they put the wrong or different LiveKit SFU into their member event, we even have real time switching from SFUs. So it's not, I think it's a one second interruption you get, but it still works really well that if the first one is joining with SFUA, the second person has SFUB in there, then the first one is leaving the call, everybody's immediately switching to the SFU from the oldest participants. But I guess it's quite obvious this is mostly a workaround until we get to the point where the SFUs in between each other can exchange the streams directly, that would be, of course, much more elegant than we don't need this anymore. But for now this is exactly how it works, so we can always guarantee, because that's a very simple glaring algorithm, just take the oldest member state event, call member state event, that we can always guarantee everyone is on the same SFU, which is quite important for a call, of course. Does that answer the question? Yes, always. Do you see any technical difficulties with having recording or transcripts? So the question is about recording and transcripts, and if there's technical difficulties around this. So basically since this is matrix, the ideal and easiest to cross approach, or UX, however we want to call it, would be that those kind of things just happen as bots. So, or recording would happen as bots, where you can easily just have a recording bot, they are just another participant, they are part of the room, they get into the key sharing, so it's very transparent for everybody that it's not just those participants, but also the bot receiving the streams, and then this bot would take care of recording. And since it's all based on LiveKit and LiveKit is a very, very good infrastructure already, there are amazing tools for this, so recording should be fairly straightforward. The transcript question, which was also asked, that is basically an implementation discussion. You could also have a bot, and then the bot could stream the data into a data channel, or the bot could stream the data directly into the room, because it's then part of the room, or you could say you don't want any bot to get the data, and you want to run local systems, which do the transcription, and then just, yeah, do it locally, like there are multiple solutions for this. I guess we'll see what the future brings. This is amazing. I think somebody just joined the room with a, oh, but it's just unmuted. I thought that is somebody already having implemented recording and now playing live. That would have been so cool. I just got super excited, but I guess this is just my echo. Any other questions? So basically the current state is that it's just on develop, but it is ready to try out. I think this is actually something, can you show the path to activate the new group call feature in Element? So if you go to develop, there is one feature flex, so for now you will only get the option to do jitzy calls and legacy calls, but if you want to have the new matrix RTC calls or Element calls, you need to go into the settings and then feature flex, and there's a flag called, yeah, new group call experience, and if you turn this on, and on the sending and receiving client, it should all work, and on Element X, like the mobile client, Android and iOS, it also should just work. Like there you don't even have to activate a feature flex. You just go into the room, press join, and it should end up in the same room there as well. Actually, that's a part of the demo we could just do, right? Do you just want to join with that user? I think, yeah, this is actually also a thing we can show. It's basically easily also possible to have multiple devices per user, so that basically implies we have simple continuity, so I was connected with this computer, and now I just connected here. Oh, I need to read this. It's dangerous. So I'm connected here as well, and now you can't see any streams, right? It does show streams on my computer. Maybe they will recover. I mean, yeah, seems to not work, but it works on this computer. I mean, I can turn it around, so at least the first row can be convinced that it's actually showing this stream right here. So if I would hang up here, I basically did a continuity to move the call from here to here. Oh, and this is also pretty interesting. I'm not sure probably no one can see it, because it's just on the screen, but Paul has joined with an older version of Element X, and currently, if you're in an unencrypted room, you will stream unencrypted media. If you're in an encrypted room, you will have per sender encryption, and that's a part where TD kind of rushed over. So basically, if you have an older version of Element X, this isn't considered yet, and since this is an unencrypted room, if you join with an older Element X, you still stream encrypted data, but my client doesn't expect encrypted data, so that's why it's giving me all kinds of noise. So basically, this is proof that it's actually encrypted. So what TD said is... Trust me, bro. It's always super hard to demo an encrypted call, but here we are. We managed to break it, and there you can actually see that it's encrypted, which is... And the only reason this isn't an encrypted demo today is because we have different encrypt implementations of that. I believe Element uses room events, and I decided to use two device events, because why not? But this will get figured out once we start drafting MSCs and stuff. Exactly. Last question? All questions answered. Cool. Thank you so much.
The state of the Matrix Rust SDK in 2023
Hi, everyone. So today I'm going to talk about the state of the rest SDK in 2023. So all the things that we've accomplished as of last year and some of our future plans as well. So first of all, who am I and how did I get into the rest SDK? Well, I'm Benjamin Bouvié. I'm a software engineer in the Rust team at Element. Prior to that, I worked in a game engine, well, a game dev company on a game engine that was written in Rust and WebAssembly. And prior to that, I was a compiler engineer in the SpiderMonkey team, which is the JavaScript engine, powering Firefox, where I did Rust and WebAssembly. So you can sense that there is a common theme here. And back in the days at Mozilla, we were using IRC. And so I wrote a few bots that were just pulling out jokes from the internet and posting them on the channels. And then at some point, we decided to use this new cool thing called Matrix. And so I rewrote my bots so that they could also run on the Matrix using JavaScript at the time, because when you work at Mozilla, you have to bet on JavaScript all the time. And then a few years later, I decided to rewrite it in Rust because I like Rust. And I made this framework system called Trinity that would use Rust for interacting with the Matrix system. And then you can actually write the bot comments themselves using WebAssembly, which is pretty sweet. And I experimented to neutralize it in production. It's mostly a fun project. And that's how I started to use the Rust SDK. So what is the Rust SDK? Very good question. So it's a Rust library implementing the client server API to allow you to implement clients easily if you want to use Rust in your project. So the code is available on GitHub under the FHT2 license. And it does all the things that you would expect from a Matrix client. Logging in, logging out, sending messages, receiving messages. But I guess the most interesting thing is that you get into an encryption for free. And you don't have to worry about the, excuse my French, gory details in the sense that you don't have to learn about Olm, Meg Olm, like sending, uploading your keys, claiming keys, querying keys and all of that stuff, which is very fine. And that we handle for you. Some history for this Rust SDK. So there was in the past one project that was called Ruma for Rust Matrix, which modeled all the events that are happening in a Matrix room timeline. And also all the requests and responses to the endpoints. And the goal at the time, I think, was to try to create a home server in Rust. Eventually that didn't happen for the Ruma project itself. But people realized that it was a good idea to actually model all those events, request and responses, and reuse them across other projects. And there was another Rust home server that started to be written and that is conduit. And like in another timeline in the world, so there was Damir, who is now the team leader at the Rust SDK team at Element. He was doing Rust on his part like free time. And he maintained a small plugin so that you can use WeChat with Matrix. And that was written in Python. And so as he was trying to learn Rust, he decided to rewrite it in Rust. And the thing is, well, he did so. He searched for library written in Rust to do that. And there was none. So he decided to start one. And that's how the Matrix Rust SDK started. And from the outset, it started to use Ruma because it made sense. And that allowed to reuse massive amounts of code, which was very nice. And Damir, being a crypto engineer, he also implemented all the crypto stack, which was very sweet. And then that was first in the Matrix Rust SDK. And all of that code was pulled out and extracted as an independent library called Vodose Mac, which apparently in question means amphibian. And it's like a big pun across languages like Olm, Megolm, and like all of these just refers to amphibians, it seems. And yeah, so that's how it goes. All right. So why Rust, you would ask, well, this is my minute for the Rust evangelizing taskforce. So I mean, you're probably convinced if you're here already, but it's like at the same time high level and super fast. It allows you to write code in a very fast fashion without having to worry about lots of low level details and issues. It is secure and memory safe, which is very nice for a library because you want to have something very robust. It has an amazing tooling and ecosystem, like all the packages, the crates that are published on craze.io give you all the things that you want to have. And like the cargo, the tool that does it all is just wonderful. You can run tests and, you know, the documentation and all of that. You just want to also very important for the rest of this talk. It is compatible with foreign function interfaces. So you can call into other native languages that speak the CABI. So it's quite important to see. And one of the things that is maybe a bit undervalued in the Rust community is that it's actually also in trying to empower you to write a multithreaded code without you having to know too much about it, trying to make it very accessible. And it's a value that was in the community first, and you can find it in all the places. It transpires in translates to all the places in Rust from the error messages that just hold your hands and try to explain you what you did wrong and try to tell you how to fix the problem that you run into, et cetera, et cetera. So it's very sweet to use. And yeah, being a former C++ programmer, so there was this notice in one of the offices where I worked before that read, you must be this tall to write multithreaded code. And it's apparently at three meters high on the wall. So this is something of the past. Like with Rust, you can just be fearless when you're writing multithreaded code because there is this thing called the ownership model. And that makes it really easy to also model concurrent implementation of anything really. So that's really, really nice. So why the Rust SDK? Well, there was this story where we had three apps, basically Android apps, the iOS apps and the web version that is also powering the desktop version. And they all were using a different SDK and a different crypto stack. So that means that if you are serious about your security, and you want to, for instance, audit your cryptography, now you have to do it in three places and make sure that every single implementation actually does what it's supposed to do, which is a bit of a nightmare. And now you have also per platform issues. You can have a bug in one stack, and then you need to check whether the other stacks also have it, et cetera, et cetera. Well, now we are saying, no, we have only a single stack for the element apps, and it's written in Rust. In particular, it's a single crypto stack. You have very high test coverage. As I'm speaking, it's like more than 83% of test coverage in the Rust SDK. The VodoZemac library, the crypto stack is being first as well, which is very important in terms of finding issues, security issues. So that means it's a single place where you can add features, you can code once, where you use everywhere the old Java Dream that everybody knows and loves about. All right, who's using it? So there is Fractal, the GTK-based Matrix client, which is using it. There is IMB, terminal UI client, if you like, Veeam bindings and all of that. There's the new generation of element apps. The element X apps are only using that, which is pretty sweet. And also the crypto stack, as it could be extracted, and it's also like there are specific bindings just for the crypto stack. And so it could be used in the current generation of element apps. And it's another codename element R. And I guess that you can imagine what the R stands for at this point. Rust. All right. So what happened since the last first time? Well, the previous release of the Rust SDK was in October 22. So we made a new release this year. Yay! At the beginning of this month. Thank you. So it's still not 1.0, still quite experimental. We're breaking APIs all the time, but trying to do a better job at writing, changing logs and all of that. And we'll see how it goes. So new features. So you probably heard about sliding sync last year. And this year, the new kind of sync synchronization that makes it so that logging into a new device and retrieving events is always instant, even if you haven't opened the app for months or years. So we entirely support that. There is the basic feature that you can subscribe to specific rooms and list of rooms of which we get a sliding window that is computed by the server. But we're getting rid of that, as Matthew said. And it also implements a modular design in the sense that you have opt-in extensions for read receipts and typing notices and many other things. And all of that is supported in the SDK. As you can see on the right, it's quite verbose because, well, it's a very versatile and general like API to give you the most control so that you can build higher level primitives on top of that. We'll get back to that. And it's vitrugated behind the experimental sliding sync cargo feature. And you can, we basically use it in production in element X. So it's quite stable, actually. There's also support for OIDC, so OpenID Connect. It's a cross stack effort moving from the custom metrics authentication to OpenID Connect. If you have a metrics authentication service running, so it's another service running on your server alongside the Synapse or your own server, it can act as an actual OIDC provider or specialized proxy to an upstream provider. So if you have a GitLab instance, for instance, you can connect it to the metrics authentication system and then have your GitLab users log into matrix for free, like that. And so that's the server side part. It's also written in REST, which is pretty sweet because that means that the request and responses can be actually reused in the client, the REST metrics SDK. And the SDK implements all of that already. And we are also using it in production in element X. So it gives you all the things that you would like to do with OIDC, create, reload, metadata, register on your OIDC client, do the login flow in all the steps and all of that. And it's also behind the cargo feature at this point. Among the big news, we have a new default storage backend. So the storage backend are implemented using traits, which are REST for interfaces. The previous defaults when you wanted to persist things on disk was sled. And now it's been replaced to SQLite because, well, pretty much everybody knows about SQL. And it's also much faster for our use case. We still have an in memory backend if you don't care about losing states and an index DB backend that is used when you're compiling for the web to WebAssembly. Some new cryptography features. So there is this new thing called secret storage. And it's mostly an implementation detail, but it gives you an encrypted key valley store that is backed in the user account data. And where you can put any information that you would like to share across all your devices in a secure way. Like the server doesn't know about this information. It cannot peek into it and know what is in there because it's also encrypted. On top of that, we implemented key backup and restoration. So that means that when you have a new device, well, when you're using elementics, for instance, it will store all the room keys that are used for decrypting room messages in encrypted rooms in the secret storage. And then another device can restore them so that you can actually see the history of events before you joined with that new device. Also, in addition to that, we made it so that the cross-signing automatically happens and you don't have to worry about this at all. That's what's used to verify your own devices and other people's devices. And it's also like some of the private keys are stored in that secret storage as well. And speaking of high-level primitives, so we made a new crate, new package called the Matrix SDK UI. It is highly experimental and also highly opinionated in the sense that we're enabling a few cargo features by default. And we are trying to make it so that we implement the best practices in terms of user experience and performance. And it's also as robust and tested as the rest of the SDK, which is very sweet. And we use sliding sync as a foundation for all these new high-level features. One of these features is the room list service, which, as its name suggests, gives you a list of the rooms. Yes, it does it so in a way that we try to make it to show something to the user as soon as possible. So that's how you feel that the app is kind of instant when you open the app, because it will try to load just one event for all the rooms you were in, or for no, a few rooms you were in. So you have something to display. And then in the background, once that's done, it will try to fetch more events. And also you can configure it to say, this is a set of visible rooms in my apps. So because when you have an app, you cannot show like a thousand rooms, you will only show a subset, right? So you can configure it to say, this is the ones that are actually rendered on the screen. And those are prioritized so that you get more events for these rooms. Another thing we added was the encryption service. So it's basically a sliding thing that is just running encryption on the side, and it gives you access to more concurrency with the other one. So think of it this way, the room list service, the one I just talked about, when you're scrolling on a mobile app, it will change the list of rooms that are shown on the screen, right? So that now means that it's sending new requests to ask for things. And if we did the encryption in the same request, but it's getting a bit technical, but that would mean that we would need to abort those requests and delay encryption. So now we have basically more concurrency and more performance. And we can do the encryption task in the background while you're still scrolling on the room list using this encryption service. And we also have a notification service. So that's very specialized client that just handles push notifications. So if you're given an event and a room identifier, we want to retrieve the event and maybe a bit of context, like what's your name, what's the name of the person who sent the message to you and all of that. It's also using a sliding thing for that. And it makes use of the encryption service because on an encrypted room, of course, you would get a push notification for an encrypted event and the server cannot know if it's a meaningful event, right? Maybe it's just a reaction, putting a thumbs up on one of your messages. So we decrypt the event in the client itself and then we decide whether it's worth sharing as a notification. The one fun thing, if you can call it fun, is on iOS, if you want to modify the notification in case it's encrypted, it's running a separate process. And that makes our life very hard because even if you're just decrypting data, the state of the cryptography keys is mutably changed, right? So now we have multiple states that is global across two processes that are sharing the same database. So we had to be a bit creative to solve that issue and we are basically enabling the writer head log in SQLite and using some data in the database to indicate who's the process that currently tries to read and write to the database. So basically implementing a text like that. All right. And since we added those two services, the encryption service and the room service, we wanted to make it very simple to just fire synchronization and forget about it. So we made some nice high-level service that just wraps the other two and you can just build it and start it and it will do all those things for you and implement all the best practices and you don't have to worry about any of this. And then you can just take listeners on that service and get information that is meaningful to do when we're rendering for a client. Now that we have a list of rooms and decrypted events, what do we do? Well, we want to display them and we have an API for that called the Timeline API. It's basically a room view MVC, so model view components on steroids. The thing is that in the matrix protocol, events are actually like atomic. It's an app and only database. So let's say you have a thumbs up reaction to a message that is a response to something else that would be two events, like the reaction itself and the message itself. So the timeline will aggregate all those different events into a single timeline item that is much more what you want to render as a client on the screen. So it makes it much simpler to render a single timeline like that. And it does a lot of things for you too. It can enter local echoes. So basically when you're sending a message to a room, you want to show it even before the server has returned that it received it. So it will do that and then reconcile the response from the server with the local state and all of that. So it's pretty sweet. And it's all observable, very reactive. So that's nice. You just get, like, as a user of that API, you get notification that one item has been added or removed or updated. And you can just, like, react according to that. So how is this all used in ElementX? We're using a Mozilla project called Unify, Unify FFI. It will automatically create bindings for you for calling interest from other languages. So at this point, we generate bindings for Swift on iOS and Kotlin and Android. It can also generate bindings for other languages. And we use that for Go, for testing purposes, I think. It requires a bit of integration with the foreign languages runtime. And over the years, we've contributed a few PRs to this project. So we made it so that you can just use procedural macros for exporting your types and your input blocks to other languages. And we also added this year's support for async code. So you don't have to block when calling into an async function on the Rust side. It will just look as an async function on the Kotlin or Swift side. And you have actual concurrency and background processing happening, which is pretty sweet for performance. And reactive programming in Rust. How do we do it? Well, the principle of reactive programming is you have some data and you want to make it observable so people can subscribe to it. And then they will get notifications. And I mentioned the Timeline API that will notify you when there is a new Timeline item that has been added, removed, et cetera. So we're using crates that we created ourselves, IBOL. And there's also an extension that is div-based for collections because when you have a vector with a thousand entries in it, you don't want to say, oh, now there's a new thing that has been pushed into the vector. I hear all the 1,001 entries for that vector. No, you just want to hear that there's a new entry and that's its position, right? It also has some extra querying facilities. So you can batch all these updates, div updates. So you don't have to cross the FFI language boundary too often. That has an inherent cost that we want to avoid, some overhead that we want to avoid. And for your batch transaction, well, for your batch to be quite precise, you need to have also transactions to say, this is the beginning of the batch, this is the end of the batch. And also you can do some filtering on these stream of events, limiting, sorting. So it's kind of mapping to things that you would do on SQL in general. It's pretty sweet and that's what we're using, for instance, to filter the rooms in the room list immediately on the client side. All right. So some of the future work that we're going to do, well, I intentionally remain a bit vague here, but we're going to eventually support all the major features a matrix client would expect. We are already working on Scrantum cryptography. And as of today, I think there has been a PR against Voters and Mac to have something that is compatible with Libsignal and with what they do. So that's pretty exciting. And there is a general theme of doing more things client side. When you have end to end encryption, your server kind of becomes dumb sometimes because it cannot peak into the encrypted event. And so you have to resolve a lot of things on the client side. If you get a new event in a room, does that trigger a notification for an encrypted room? Well, you have to push a notification and it's the clients that will decide whether or not it resolves into an actual notification. And even for sorting the room list, you have to do it client side because if there is some room activity, you want to sort by room activity, just show me the room that have some activity. Well, it's the same thing. If the event was encrypted, you don't know if it was just a thumbs up reaction. Maybe that doesn't justify putting the room at the top. If it was something meaningful like an actual message. So that means that this task has to be done on the client now. And yeah, we're also computing the other badges client side in the rest SDK. So we are trying to be very careful to not get into static notifications situations because it's a pain for everyone, us included. And yeah, that's pretty much it. All right. Just a few things. Well, first to all the contributors of the rest SDK special shout out to Kevin Komei from the fractal community. It's done like a bunch of work in the rest SDK, including most of the support for OIDC on the client side, which was MaceFPR. And if you want to be on this slide next year, you can contribute to we have a few issues that are tagged as good first issues or help pointed if you want. And I would like to take this opportunity also to thank elements for donating all of my work to the matrix organization. You can also be a supporter of matrix if you want by following one of these two links. Thank you for listening. And I would be happy to answer any questions if you have any. To the internet asking why have you moved away from sled? Why have we moved away from sled? That's a good question. So I think in terms of performance, so slay this, if I recall correctly, I wasn't there when that happened. So it's kind of hard to answer this precisely, but I think that it's a key value store embedded key value store. And the performance was not great, especially on mobile devices. And we just figured that using SQLite that has been like performance tested and improved and tuned over the years was a right thing to do. And also the way you structure your data using a SQL database is quite different from the way you would structure it with a key value store. So it's just like slightly easier to perform requests when you have a SQL database because you know all of that. Yeah. Any other question? The internet also asks how is your developer experience when using UniFi, UniFFI in general? Are there any hard edges? That's a good question. So there's, oh yes, so when using UniFi across, for calling rest across other languages, have there been hard edges? Yes. It's been a few cases where we have a memory leak that is identified. Well, Kotlin uses the GVM and GVM as a garbage collector. And so we accidentally, and when I say we, I think it's like the UniFi group in general, introduced some leaks by having the equivalent of premises or futures leak sometimes. So that was a problem, but usually it's, I would say it's 90% of the time it's stable. And the 10% of the time where there is an issue, it's high priority for us because obviously it breaks our apps. So we fix it, we try to fix it as quick as possible and we contribute back. But most of the time it works fine for Kotlin and Swift. So the support and stability is also per language, I suppose, since you have to create bindings for each language. So yeah, I cannot speak for the Python or Go generation on the UniFi side. But usually, since Mozilla also use UniFi, they have to provide high stability guarantees as well. So they are pretty reactive and also fixing bugs. So it's working well. Yes. I was wondering about the startup times. Yeah, so the question was, what about startup times for the rest of the time? I think there were two questions. The first one was just starting the SDK itself and then when you're syncing a list of rooms, do you get instant and response and all of that? And well, it's native code, so you don't have to boot up an entire VM for the SDK itself. So it's pretty fast. It will restore the state from the disk. So that can be a slow step. But even like for users who have thousands and thousands of rooms open and I'm looking at Matthew on the side of the room, our general benchmark runner, it's pretty fast. And for receiving a room list, we are also tracking these performance of our time. Pretty much instant. And every time there is an improvement that needs to be done, we'll do it. Yeah. I mean, we are in sync with the synchronization times where about between five to 20 minutes, if you are a very heavy weight user of Matrix, now it's really up to three seconds. So consider that an improvement. Any other questions? Yes? What's the state of supporting extensible events in the rest SDK? So I think that's a question for Ruma. And since we're using Ruma for passing the events, and I'm pretty sure that so the rest type system is quite extensive in the sense that you can have union types. And for each event that can be extended, I suppose that you can have there is a variant in that unit tab that says it's a custom event. If you're referring to a specific MSC, I don't know what it is. And I'm sorry about that. Was that a custom MSC or? No. No. Okay. Just events in general. So yes, you will end up in these case where we have, you will match on this union type or the event and it will say, well, it's something I don't know about. So I'm just ending it over to you and you do something with this. Yes? Can we also provide a question for the rest of the SDK? I'll rephrase this question as are there plans to use the rest SDK for web? Because it's not used. So right now, as we are speaking, as of last week, people have started, well, have enabled by default for new logins on Element Web, I think. So that may be the nightly version using the rest cryptography for Element Web. We have a separate repository for bindings that are for WebAssembly because there's no meaning in using Unify for that. We can directly compile the rest to WebAssembly. So no need to have an intermediary in the middle. And I think the long-term goal is to use the rest SDK everywhere for the Element apps at least. So don't, don't take my word as granted, but I think that this is going to happen. Yeah. Any other question? Yes? That's a very good question. So the question is, is search in scope for the rest SDK and what kind of features would be out of scope for the rest SDK? So to respond for search, that depends if you mean room search or message search, full text search. And well, actually it doesn't depend because the answer for both is yes. We're going to try to take care of that. For full text, we, there was a previous client made by Element called Hydrogen. That was a web client and that could do that and had the fancy system to actually index the messages on your client and then share parts of the index with your other clients devices. So we're probably going to reuse and reimplement some of that in the rest SDK at some point. Yeah. In terms of what features are out of scope for the rest SDK, it's kind of hard to tell, but I think that everything that is like high level UI related like rendering widgets, but not in the sense of widget API, but actually UI widgets and stuff like that is not something that we want to implement or provide. And then I think that, well, the MSCs that have proven to be features that have been proven to be not very useful will probably not be implemented. It's not clear what's not in the roadmap at this point. Sorry, it's not a very satisfying answer, but yes, question here. So the question was using the rest SDK can store a lot of data if you're listening to lots of events and is there any way to limit that amount of data that is stored on this? Well, as I was saying, the storage is implemented as a trait. So one could always implement a different version of the Scallot backend and decide to drop items at some point. One thing that we wanted to make it is the ability to store events locally. And that's connected to the previous question. If you want to be able to do full text search, well, you have no other choice, but decrypting all the events and storing them locally, at least in memory for some time to do the indexing. And then the indexes have to go to the disk. And that means that, yeah, the size of the index can grow a lot. And so we would probably have to implement some kind of garbage collection and say, well, we kind of forget about like old data, older than like a month, year or something like that. And we only care about the most recent data. All right, thank you very much.
A microkernel-based orchestrator for distributed Internet services?
So like the 13 instances of this bedroom and by this I am over to our particular class. Can you hear me? Can you hear me? Okay. Hi everyone and thanks for being here. It's great to be able to speak here. So I'm Alex and this is my presentation on some very high level and pretty speculative ideas I've had for how we could use microcranels in like to do different systems and to host websites. So I'm part of an association which is called Duffler and if it works at Duffler we have some infrastructure which looks like this. So we have some very low powered computers like this one which are hosted at home. So at home of course we have possible issues like power going down or internet being cut. So we have some machines at different locations in Belgium and France. And the idea is okay we have this infrastructure which is like pretty fragile but maybe we can just put all these notes together and build this system. And this is actually what the Duffler infrastructure is doing. We have email, we have websites, we have instant messaging and a few other things running on these very basic machines. So currently our infrastructure looks something like this. So the idea is not to spend too much time to enter into the details of this but basically on the right end here we have the actual applications that we're interested in running. So for instance we have an element for chat, we have Jitsi for video conferencing, Crippad, other things. And to run all these applications we need this whole huge stack currently. So it's based on a Linux OS and MixOS for declarative configuration. And then we have this platform stack here which is based on an orchestrator which is called Nomad which we use. It's a bit like Kubernetes but a bit simpler and I'd say probably easier to use. But still we have all these different components which are basically... ... storage systems. GARZ is one that I'm building myself. And we basically pull these... ... software. And if we look more closely in what's happening on a single node, actually it's kind of a huge mess. So like this is the operating system running on one of these... Here we have all these management tools. Things that are... So yeah, from a conceptual point of view, this really systems like... ... to enter into too much detail. Let's say for instance we have here internet traffic coming to our server to request some information. It's going to traverse a reverse proxy which is going to do TLS encapsulation. Then it's going to go through an HTTP link to the actual backend which is going to talk to with specialized protocols to the storage layer. And basically we can describe all of these things with boxes and arrows connecting these boxes. So the idea is that actually this model of boxes and arrows is the model of micro kernels. Boxes are... ... ... ... memory between different processes sharing the CPU time. And also controlling hardware access. So this is like fundamental thing that only the kernel can do. Separate the resources of the computer at the CPU level between different things that are going on. And then the micro kernel will also provide some IPC mechanisms like message passing or shared memory. So... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... I've made things and connecting things very explicitly only when they. So this diagram is like what's running on one node, but maybe we can include some form of network transparency to make this more into a distributed. You There would be some impact on performance and we also need to be quite careful about that. Okay, so it's still time for some questions, comments, whatever. Okay, I might have one question. So the use case should be always the God, the thing that dictates what how the architecture should really look like. So what do you have in mind in this area, something like safety critical or security critical or really just some average information system? Yeah, we're doing the association with the facts. I mean security is important because we're handling personal data of people, but I wouldn't say it's a security critical infrastructure per se. But of course, like, yeah, one of the advantages of such an architecture is like security is easier to like build in a robust way because we have much more control. Okay, thanks. So so quite natural follow up question. We probably have seen it in this discussion here. So how do you persuade the average guy to buy him? How do you persuade a very guy to stop using their Linux distribution and start using your architecture? I think this is going to be a very long, long work before we can get to that point. But the hope is that this system is both more robust and easier to use because we like we can get rid of some complexity probably. And and we can have some. So if we get to a point where there's good tooling around this, and whereas there's a lot of examples which are already already running and it's easy to get your own started, then I think we can really have something that attracts people. But yeah, of course, it's a long road before we can get to there. Thank you. Any more questions or comments? I don't see anything. So thanks for the talk.
Run Node.js in a unikernel reliably
The stage is yours. Thank you very much, Martin. So as Martin told already to you, I'm Andrea and I'm a software engineer at Genetio. We are trying to build the cloud platform for web applications. So our user should build web and mobile applications and deploy them to us. Just before starting, I want to set some expectations. So first of all, this talk is going to be a talk of the challenges and solutions we've come up with by running real node applications on Unicernals. And secondly, I really hope that this presentation will spark some discussions. I know that the Unicernal community is active and creative and me and my colleagues are going to be around after the presentation for us to speak if you have use cases, questions, challenges that you've tackled and you want to chat with us about them. So let's get down to business. Okay, so before starting to build a platform, we really needed to put down some guidelines and the vision that we want to implement with that platform. So is this sound okay? Because it sounds weird to me. Okay. It sounds good, but that's not the Unicernal. Okay, okay, thanks. So back to vision. So first of all, we really want to optimize resources like power consumption, memory, CPU and costs for our users. Secondly, we don't want to throw away response time for that optimization that we did. So we want to have a snappy response for our clients. So response time is very important to us. And lastly, we want an easy to use, secure and auto scaling platform. That means that we want to take the burden from the developer to where I deploy this application. How should I scale it and stuff like this? This is the job of the cloud platform and we want to provide that to the end user. So to take all the boxes, we decided to try Unicernals in a function as a service environment. Function as a service essentially means that when we have an incoming request, we spawn a VM that will be bounded in time. So it will leave for a few moments, for a few minutes, sorry. It will handle that request and send back the response. And we are doing this with the Unicernals. So essentially we have our orchestrator down here and then for the request, we are spawning the Firecracker with OSV, that is the Unicernal that we are using, with Node.js and at the top level, we have the user application. But still this has some challenges, unique set of challenges than the classic Linux container and classic long-living server. So our challenges are reducing the call starts because we are spawning up many VMs, essentially every few minutes, we are spawning the VMs if we have incoming requests. We have to somehow reduce the boot time for these VMs. So we have to boot it very quickly or give it a try to another mechanism to improve this call start. Secondly, because we are a multi-tenant platform, we want to somehow be able to upgrade the Unicernal or do patch, patches it at the Unicernal level without really iterating through each image for each of our users and redeploy everything. And lastly, of course, we want security and we want process isolation. That means we don't want application one to be able to access resource from application two. So we've tackled first of all the call start, which is a big problem in the function of the service world. So to tackle that, we leverage snapshots, as everybody does probably, and we did snapshots as following. We are booting up the Unicernal. We are starting the Node.js process and immediately after we start the Node.js process, we pause and at that moment we are creating a snapshot and store it for later. When we want to spawn a new VM, what we do essentially is we start the firecracker, which helps us load the snapshot and then we are attaching a new disk with the user code. We are mounting it and we are importing it into the Node.js code that was already started in the snapshot. This will help us with the second challenge that we have, the upgrading the Unicernal and you'll see in a moment how. To optimize even further, what we are doing is that we are not really waiting for requests to come and to then start the firecracker load the snapshot and so on. We have already a pool of warmed VMs that are started but they are not scheduled on the CPU. They are just logged in the memory and then we have an incoming request. We just take such one VM from the pool, we attach the user code, we handle it and then we give back the response to the client. Going back to upgrading the Unicernal image. So as I told you, we are a multi-tenant platform so we expect to have thousands of user applications that are running on our platform. So somehow we don't want to embed the user code into the Unicernal image because if we do that, we will have to rebuild each image of each of our user once we do a Unicernal upgrade or a patch or a bug fix. So what we are doing is that we are creating a single snapshot with just with the Unicernal and with Node runtime and as I already told you, we are attaching the user code later. So essentially when we are doing an upgrade, we just have to upgrade that base image and each VM that is spawned afterwards, it will reference to that base image. So this is why we are mounting user code afterwards and this is how we can enable OSV and Unicernal upgrade without really redeploying everything on our platform. And lastly, we are doing security and isolation using Firecracker Jailer. This helps us running each process sandbox, Firecracker Jailer allows us to have different network namespaces in such a way that we are not using the same network interface with each process. We have different file systems and different process namespaces. And also Jailer allows us to limit resources to make sure that we have some kind of fairness between VMs because we don't really want one single VM to eat up the whole CPU or the whole IO bandwidth. So we can control the IO throughput and CPU time for each VM. Because we are running things with real world stuff, we also find out some bugs, especially in the first one is in the Node.js V8 compiler, it used the pop-up instructions incorrectly and in the privilege mode, which is the single mode in Unicernal, with this it enables interrupts and that was not, we cannot run any Node code essentially, it was all Node code was affected by this bug. And also we fixed some bugs in OSV and we made some contribution upstream. Those are related to using two file systems because we need the first file system for the Unicernal and the second one is the user code that we are attaching later. And also we found out some notepossicks compliant functions in P-treads library. Okay, now let's talk about metrics. So as I told you, the most important thing for us is the request response time, because we want to make sure that the end user of our clients are getting their requests as fast as possible. So they have a snappy feeling when they're using the applications. So we actually benchmark that to see if using the Unicernal versus Linux container is as fast, is there is a difference or if there is an improvement. So first of all, we did a bit of setup. We are using a client, so essentially the browser, let's say, the client is in Asia. And then we are comparing with the standard functions and service solution from AWS. And also we have our servers that are running on the GeneXia infrastructure. So these are the three actors. We are comparing AWS Lambda with OSV, which is a Unicernal solution, and with a classic Linux container to have a full comparison, to have a full picture of the problem. This is the code that we are running. Essentially it's a Hello World code. It's more of a ping. We are just sending a request and we are getting back a Hello World string. And these are the actual numbers. So for a cold call, a cold call means that there is no VM pre-warmed for us. So at the moment when we are handling the request, we also have to wait for the boot time. And in orange we have Lambda that was getting us the request back in 300 milliseconds. And in purple we have the OSV and in blue we have the Linux container that are performing like 60 milliseconds. And then we also have the worm call, which is for the Lambda is around 60 milliseconds. And for OSV and Linux is around 30 milliseconds. So as we can see first of all, is that the OSV and Linux are mostly performing the same. And the first question that comes into mind is why use a Unicernal and why bother with it if the Linux container is just as good. But what we cannot see on this graph is the Linux kernel footprint. So the Linux kernel takes up much more space in storage when we are creating the snapshot, much more space in memory when we're using the pre-warmed pool VMs and so on. So the reason we can use Unicernals is that we are optimizing resources, even memory and storage and so on. Next steps for us, first of all, is to integrate many more canals. So for now we are just using OSV because it was the mature project at the point that we started, but we want to use many more Unicernals that are just developing in the community. And then we also want to add more support for more programming languages. We just went full on for Node.js because it was very popular. But we also want to add more support in a way that every web programmer can deploy its backend code on Unicernals. As a last call, I want to stay in touch with the people that are interested in this kind of project. I think that this community of Unicernals is very active and is flourishing from the contributions that we are all making. That's all. Okay, again, time for questions here. I'll be there. Hello. Thank you for the talk. Very, very interesting. Just a very simple question. Can you give me an eyeball on how big your Unicernal base image for Node.js is? I don't really have the numbers. So how big is the Unicernal image? Yes, so... For Node. So the last one? For Node. For Node. I mean what you used in the demo. So you have some megabytes, the dependencies, or so adding up to the gigabytes. To be honest, I didn't look into megabytes. Under 100. So I just received the answer in the headphones. It's megabytes. Okay, perfect. Thank you. Thank you very much. I may have missed it in the presentation, but in the benchmark that you showed us, obviously Lambda was running on the Amazon hardware, but what was Genesio running on top of? Genesio is running on top of a bare metal in another cloud deployment called Host. So basically we are running, I think, on ARM, on an ARM server that is bare metal. So we are building everything up on the ground. Okay, so there may have been a difference also in the hardware that was provided by Amazon and the hardware. That might be true because the Avisunda is running Graviton and we don't have that kind of hardware, of course. Okay, thanks. Yeah. So, more questions, yes. A bit of access for me. Thank you very much. So thank you for the presentation. I have a kind of a question, like from someone considering to do serverless and knowing that there is AWS Lambda, what is basically your selling point? Why would one use Genesio versus Lambda? Given that AWS is a big company and that Lambda has been running for a long time. Oh, I see, so there are a lot of reasons why you would use Genesio over Lambda. So first of all, we provide a much more easy tooling to use. So it's more targeting to have a very low learning curve. So you can just pick up Genesio in the meters of minutes than AWS Lambda, which is a bit hard to understand for a first time user. And secondly, AWS Lambda is not really interested in resource optimization, at least at what we tested right now. So with Unicorus, we want to provide even more resource optimization and lower costs. Yeah, hi. Hello. Thank you for the presentation. One of the things that's interesting about Lambda's now is they have a snapshot mechanism for Java. Is that something which you're looking to do as well is to have snapshots for different platforms and different kind of run times on top of your framework? Yeah, so we are planning to use the same mechanism that we use for Node.js. So basically every program language that we want to support, it will also have snapshots and it will benefit from the same mechanism that we use. Okay, surprisingly, we have still time for questions. Okay. Bring them on. Yeah, I'm gonna kill it. Very nice optimization. I mean, with the preheating and the pooling and the others, are you targeting or are you planning to target also kernel internals because I mean, I'm getting this experience from Unicraft but also I guess in OSB, you could do a lot of fine tuning there. Are you looking into that or are you linking only on how to optimize it via, let's say, external means because what you showed there is I have this VM, I pre-warm it, is it something in internals? For example, bootloaders, boot times are dreaded, even in your major Unicornals. Firecracker can be also configured to ditch a lot of devices, all those sorts of stuff can be fine tuned and optimized for different use cases. If you're using Node, some items may not be required, so you can do, I'm not sure, some postponing. Are you looking into that as well? I mean, more of kernel internals? Yeah, so we are also looking into that. As I know, we didn't really look into the bootloaders, but for example, we looked on how to improve the network stack because it takes up a lot of time to boot up the network stack. Valdic, is it LWIP? What is the networking stack of OSB? The networking stack of it, is it also LWIP or what is it for OSB? The next stack. I'm not sure. The networking stack, what is it? I don't think it's the best one. Okay, that's a good one. I mean, LWIP is dreadful, but BSD is a better one. Gotcha? Yeah, so I know that we are looking into that, but right now, the things that we already have implemented are just treating OSB as a black box. Okay. Yeah, the next step for us would be to also optimize the Unicom and the bootloader part. We need a kind of site for that. No, no, for sure, for sure, because by many, Unicom also would take doing a different approach, but I think that's a very nice spot and also very challenging to look into. And Unicom will provide this because you are able to actually optimize the application itself, optimize the kernel because of the way it's running. Did you, I'm not sure, did you use SMP support for this? What is it currently running single core? It's running single core. No. No? We got different machine sizes. No, no, no, I mean the deployment of VM. Can a VM run multi-threaded with different trends on different cores? Yes. Okay. So when you deploy something, you can choose the machine size, you can get inside the size, you can get on the graph, that's all. And they have different CPUs now. Okay, for an actual Unicom only instance. Okay, awesome. Okay, let me have a question as well. I was quite taken by surprise that you don't optimize the image, which sort of goes in my opinion against the benefits of Unicom that you really optimize down the image just specifically to the workload and the APIs and whatever that the client is using. So why do you do this trade-off? Do your customers really think that rebuilding the image is so cumbersome? So that means that we have to let our clients tinker with the Unicom image, right? So we are actually targeting users that are not that much into the Unicom stuff, so they are much into writing back and codes and so on, and they have knowledge into that direction. So we'll kind of try to abstract that, so maybe this is why we did choose to create the base image for them without really asking them how can we improve it even more. Okay, so basically you are saying that you are automating it so you don't want to push that burden to the clients, right? Exactly. Okay, that makes sense. So maybe one final question? Okay, nothing? Anyway, thanks for the talk. Thank you.
Using the NOVA Microhypervisor for Trusted Computing at Scale
Next talk is coming up. Pudos Shreinberg does not need a lot of introduction, especially in the micro-curnal circles, but he is the author of the NOVA microhypervisor, and I believe this talk is more a state of update. The stage is yours. Thank you. Can everybody hear me fine? All right. So this talk is going to be about using the NOVA microhypervisor for trusted computing at scale. So we will talk not so much about micro-curnals or micro-hypervisors. We will talk a little bit about scaling NOVA, and we will spend the majority of the talk talking about trusted computing. So the agenda is first I am going to give you a little overview of NOVA. For those of you who have not been in the micro-curnal deaf room before, maybe a quick question. Have you ever heard of NOVA before? Maybe one-third of the people. I will explain a little bit what NOVA is and why it is a microhypervisor and not a micro-curnal. Then we look at what happened in NOVA in the last year, in 2023. The second part of the talk will be about using NOVA for trusted computing for performing what is called a measured launch to actually get some trust in the platform. At the end, hopefully we will have some time for questions. NOVA is used as the bottom piece, the screen box, the micro-curnal that is used in the Bedrock Ultravisor, which is a virtualization layer that sits underneath virtual machines. For those of you who are familiar with micro-curnals, the kernel is very small and most of the operating system functionality is implemented in a multi-server user mode, a deep privileged environment. All of these colorful boxes are actually deep privileged processes that run in user mode and they are isolated from each other and they communicate with IPC. This is what you would expect from a typical micro-curnal. The reason that NOVA is a microhypervisor is because it additionally provides a virtualization interface that allows you to reuse unmodified legacy operating systems in virtual machines. NOVA basically relays all the VM exits to those yellow virtual machine monitors, which then implement the virtualization functionality. The whole stack, all the colorful boxes are in the process of being formally verified and this is going to be important also when we talk about trust. We will not talk so much about all these boxes, we will talk primarily about NOVA, the green kernel at the bottom and a little bit about establishing the trust between NOVA and the master controller, which is sort of the inner process of the user environment. When we talk about scaling NOVA, it originally started about 20 years ago as a research project to address them and since then we have productized it to run on multiple architectures. On the left we have ARC64 which is on V8 and on the right we have X86 architecture, primarily Intel and then we run on all these platforms and more that are listed on the slide. So at the top left corner you can see a variety of arm SOCs and all the ones in yellow are actually not using standard UEFI or ACPI interfaces so they have proprietary builds, you get proprietary or board specific binaries. But for some, like the Raspberry Pi's or even AWS's Graviton Cloud servers, the same NOVA binary works all the way from small embedded devices with just a handful of cores all the way up to big cloud servers with in this case 64 cores. And we have the same on the X86 world, actually the same binary runs on all these platforms whether this atom SOCs at the top right corner or client platforms that you see up there all the way again down to the largest cloud servers with over 100 of threads. So that actually required some infrastructure changes in NOVA but before we get there, in the interest of time I'm not doing any live demos but here you can see or if you can't read it then look at the slides online. The output of NOVA boots on Raspberry Pi 4 or 5. So naturally we had an interest in making NOVA working on the Pi 5 and it just works out of the box if you use UEFI firmware. And the top line which is highlighted shows that it's actually the same build, so the same commit ID and the same build timestamp and you can see the differences in the cores. Raspberry Pi 4 uses a 72 cores and the Pi 5 uses a 76 cores. And as I said the same binary also runs in the cloud. So if you take for example an AWS C7G metal instance you can run that binary and it will enumerate 64 cores actually our Neo-verse cores and it can also drive all of the PCI devices on a platform actually in multiple PCI segment groups. So I don't want to go into the details here. The same thing on x86 where you can see on the right side, so the left side is the beginning of the log and the right side is the end of the log that we can actually run on machines with over 100 cores with hundreds of PCI devices and tons of memory. So what did we have to do to make that work? And I presented a similar thing in my talk last year, what I call an innovation timeline. We put out a new version of NOVA approximately every two months, so six releases per year. And then some releases are more packed than others. So about a year ago we added some local APIC, registered virtualization and support for Atom Sox to NOVA. But then the more interesting work happened over the course of the first two releases or the next two releases at the beginning of the year where I implemented support for Intel TXT which is trusted execution technology in NOVA. And also to make NOVA work with really large core counts, we made the current memory pool extensible. So the bootloader has the choice of giving NOVA little or very large amounts of memory depending on how much a particular platform would want to use. And then in the middle of the year there were some minor adjustments to read, copy, update and capability management that we will not talk about here today. And then at the end of last year for the Christmas release basically, the TXT work was so complete that we could actually extend the trust chain all the way to the master controller, this blue component in New Zealand. And then again for the first release of this year which is going to come out at the end of February, you actually get even more functionality for the TPM and everything that's listed in bold. So you'll be talking about in this presentation. So why do we want to do something in the area of trusted computing? What problem does that solve? And I mentioned this in the introduction that we are formally verifying the entire ultravisor stack. So once that is complete, you know that the source code that you have fulfills its specification and maybe you have a qualified compiler that compiles this verified source code into some binaries. And even if you have that, things can go wrong. The binaries can be tampered with by an attacker either during the installation process, during the boot process, or after installation and you want to know that the binaries that you built are actually the ones that are running or are being launched on a computer. So you want to know that some remote computer is actually running exactly those binaries and not some modified version. Before you give that computer some precious content like your super secret K-I algorithm or some secret data. So in order to understand what trusted computing and what a chain of trust is, we have to look at the concept of what people commonly call secure boot. And secure boot is not a very precise term. The better term is actually verified boot. And verified boot works like this, that you have some immutable root of trust in this slide showing green. And that's the initial stage and it's immutable and it's a root of trust because you cannot reason about it's correct. You have to assume it is correct and it's usually implemented in ROM which doesn't change. And then every stage like, oops, every stage measures the integrity of the next stage and verifies it against some policy. And if the verification succeeds then the next stage gets launched and if the verification fails then you fail the boot. And this is basically establishing a transitive chain of trust and the thing we care about the NOVA hypervisor is at the very end. And this chain of trust only works if everybody before gets everything right. And that's hard because there's millions of lines of code living in all these boxes and some of these boxes are actually very complicated and extensible. So the E in UEFI actually stands for extensible. And the moment you make a change in any of those components, could be you add a new PCI card or you change the order of your boot devices, it changes the measurement. So keeping your databases of permitted integrity measurements or denied measurements up to date is hard. And the industry has learned this recently when UEFI was affected by this logo fail vulnerability which basically forced every vendor to deploy a new version of their UEFI firmware and to blacklist in the DBX database the old version that they had. So it is not very flexible and it is a very brittle thing. And the screen box here in the background shows that all of this stuff actually belongs to your trusted computing base. Because if any of these components actually modifies or trashes the binary, then even though you formally verified your source code, this binary is not going to do what you want it to do. So can we do better? This is an open source conference and we are not so much interested in DRM, we are interested in freedom. So we don't want to enforce boot policies, we want to instead use a concept called measure boot. And it works very similar that a stage measures the integrity of the next stage but then doesn't take an immediate decision on whether the next stage is good or bad. Instead this measurement simply gets extended into a TPM platform configuration register which stores this value for a later attestation request. And then the next stage gets executed. And there is still the problem that certain stages like UEFI and the boot loader are extensible and that they sort of leave a very hard to manage gap in this trust chain. But there is also the problem that typically the whole boot process is not protected against DMA. So these components do not make use of an IOMU or SMMU which means even if the software is correct you could have a USB device or some FireWire device, some DMA capable device that simply DMAs into this memory and trashes the software that way. So again the trusted computing base isn't really getting any smaller. So can we do better than that? Yes, we can and this extends the concept of measured launch with a dynamic root of trust. And the core idea is that you can't really change anything in this boot chain. You still have to execute all the firmware, you still have to load all your drivers, your firmware drivers, you still have to make a boot loader like a boot choice and you still have to initialize your memory controllers. But you can do all of this in a dirty environment. So a dynamic root of trust lets the system boot into initially an untrustworthy state. So we don't really care if anything that happens to this point and only at this point we want to bring the platform into a pristine state. And this is very interesting how this works because what the effect if you do here represented by this green bolt is it is a disruptive event which feels a bit like a platform reset but it doesn't reboot the machine. It just brings the CPU into a well-defined state. It's actually a protective mode with a paging turned off and it holds all the other cores. You can see that in a moment and it forces the execution after this launch event to a code pass which has previously been measured and protected. So we don't care about all this stuff in the red box anymore. That gets eliminated from the TCB which is great because it eliminates millions of instructions and our TCB is now just this DLTM sequence plus NOVA. So what do we need for that? The technology that gives us this on Intel platforms is called Intel TXT. You may also come across the acronym CDNT which just is short for Converged Boot Guard and TXT. So Intel has fused the static root of trust which is boot guard with the dynamic root of trust which is TXT into one technology. And TXT is the one we care about. This gives us the DLTM. You need a CPU that supports this and you need a TXT capable chipset and a TPM preferably TPM 2.0 because TPM 1.2 is really old and it can only do deprecated hash algorithms. And you need an SINNET module which matches your platform. And the purpose of this SINNET module is the module that Intel provides that you can download from their website is to initialize and verify the platform in a way that it is securely configured. And once you do this you can later do a remote attestation by asking the TPM what these measurements in all the platform configuration registers are. And then you can remotely take a trust decision to say if this PCR contains some value do I recognize this value as belonging to NOVA's December release or NOVA's February release. And who knows why there is this sign of like grant here. So Intel develops all its technologies under code names and the code name for Intel TXT many years ago used to be called Lagrange technology and this is named after a city in eastern Oregon. So what happens when you do this disruptive event? How does this reset the platform without rebooting it? And it's very interesting. So first of all we have a number of processors. Here this just shows four so these are four lanes. And we have one processor which we call the initiating logical processor. That's the one which initiates the DM sequence. And we have in this case three responding processors which maybe in some arbitrary state we don't know. They could be sitting in some idle loop. They could be executing malicious code. We simply don't know what they do at this point. But we also don't care. And then some time before the disruptive launch event the code for NOVA which is in this case called MLE the measure launch environment. And this is in it ACM must have been loaded into memory and again they could have been corrupted in memory. It could be the wrong version. We don't take a decision there. And then later there can be an arbitrary amount of time that passes minutes hours. We can do this a week later. It doesn't matter. Some component is executing this DLTM which is a specific processor instruction. And what happens when you execute this processor instruction which is privileged is that everything resets. And the chipset broadcasts an S-enter cycle on the interconnect. And the S-enter cycle basically initiates all the, initializes all the processors into an S-enter sleep state. So we now know that all the other processors, all the pieces are not executing any instruction. They are sleeping. And it transitions control to this S-init ACM and it checks its integrity. So it has a signature and it has a checksum or actually a hash cryptographic hash. So the processor validates that this module is a valid Intel S-init ACM and it launches that. And this module runs entirely inside the cache. It doesn't use any memory because the memory might have been initialized wrong. The memory might have physical memory aliasing where two physical addresses point to the same page. So this operates in a very constrained environment but it is software that can validate that your platform is correct. That the processors are not overclocked. There's no undervolting. That all the chipset registers that need locking are locked and so forth. And the final thing that this thing does when it has convinced itself that the platform is in a good state is that it measures and launches NOVA. And it stores the measurement of NOVA in TPM PCR 17. And then NOVA gets control at its measured entry point. And at some point later after it has initialized enough of itself, it can wrong the rule the other processors into the secure environment. So that by the time we get to the end of this, we have now all four cores or 128 cores in this measured environment. And should anything go wrong during this process like a rogue CPU showing up that nobody knew about or a CPU leaving, surprised leaving this environment, then the platform transitions into a TXC shutdown which effectively resets the platform. So now we talk a little bit about the TPM because what we want to do is we want to measure the next stage into a platform configuration register PCR. And whenever you measure a component, what we really mean is we have a region of that component of that image that we care about that doesn't change. You can call it an immutable region which is typically the code and the read-only data. And you compute a cryptographic hash like a SHA1 or a SHA2 cryptographic hash. And you get in the case of SHA256, you get a value that is 256 bits long or a large number like that. And this measuring entity executes a command to the TPM. And the TPM is a little chip like the one shown up here that sits on your motherboard. And it has, in a typical case of a client platform, it has 24 platform configuration registers. And it invokes an operation on the TPM that's called PCR extent. And the PCR extent operation is interesting in the sense that you can't write to a PCR directly. You can only extend a new value into a PCR and what it does is it takes the existing value, concatenates it with a new value and hashes the concatenated hash. And this forms the new value of the PCR. So the sequence in which you extend values into the TPM and the values themselves are all reflected in the hash. Basically it gets mixed together. And once you look at the PCR so you can read the value, you can no longer recompute the original chain of extent operations that led to this PCR value simply because the hash function is a one-way trap function. So how would a remote verifier who can ask the TPM for a quote so you can go to the TPM and say give me the value of those PCRs that I care about and have the TPM sign that quote report so you know it's authentic. You can send that off to some other computer elsewhere and they can look at all the PCRs and say, okay, if this PCR has the value that I recognize then the platform has launched authentic software. But how do you know if multiple extent operations have happened into a PCR what the individual values are because the individual values represent the individual software components. And for that you need the left side of this picture where in addition to extending a measurement into the TPM it also gets stored in what's called a crypto agile event log. And this effectively is an auditable trace. It's a record of all the extent operations that happened. And in addition to recording which PCR and what the digest so this measurement was that was extended. There's also some event metadata that said the meaning of this extent operation is I hashed the command line or I hashed the RAM disk or whatever it may be. So you have to send both of these things both the TPM quote request and the crypto agile event log to a mode of verifier and it can correlate the two things that can use the crypto event log for a particular PCR to recompute the value in a PCR. And then you can then check if the quote from the TPM actually lists exactly that value of the TPR PCR and if it has been signed with an authentic TPM signature. And then you know what platform and what software is running on the platform. So I said the SNNIT module measures the integrity of NOVA. And NOVA is a kernel that consists it's an F image that consists of code and read only data but also of some mutable data and some heap. And not all of that thing is immutable in the sense that it doesn't change. And while some people may think it's sufficient to just do an integrity measurement at launch time. So when you boot the system that's not the full truth because you can also close the lid on your laptop which will basically shut everything down and only keep the content of memory alive. And when you resume the laptop then all your protections are gone. So on a suspended zoom cycle you actually have to repeat this integrity measurement and then this yellow section has actually changed. So not everything can be measured and how does NOVA tell the SNNIT module which region of it to measure. And there is this MLE header which enumerates the memory pages that NOVA wants to have measured. And NOVA is actually the entity that initiates the launch process. So there's no bootloader that says launch NOVA. No, NOVA gets launched in this dirty environment and then decides itself to launch its second stage and thereby tells SNNIT module what this to be measured region is. But before it actually gets measured the SNNIT module DMA protects this entire region. So the moment it gets protected no attacker can change it anymore not even visit DMA attack. Then it gets measured the measured value gets extended into TPMPCR17 and then NOVA gets launched. And there's also some TXT heap data structures that NOVA's preamble code and NOVA's post launch code used to exchange data with the TXT heap. So one of the things for example that the SNNIT module produces and stores in this TXT heap is some information about how many processes really exist. And it also stores some validated copies of ACPI tables there so that no IOM use get hidden or whatever. When you write software like that that you want to measure you have to carefully think about what should be included in the measurement versus what should be excluded. So if you measure too little then maybe something can be changed in a security relevant manner and it will not be reflected in the hash. And the thing that immediately comes to mind is let's say you have a command line parameter and NOVA has a few among them is one where you can say don't turn on the IOMU. So this is basically a chicken bit for debugging and when you execute NOVA with this command line parameter it's obviously less than fully secure. So you want that change or the configuration change to definitely be reflected in the hash that the NOVA version that runs insecurely can be told apart from the normal version which uses the IOMU with the full potential. So the command line must be included in the hash but if you have some data structures that maybe take time stamps you don't want to take them into the hash because they will change the moment that the time stamp changes. So this needs very careful considerations. And then the next question is if you have a binary like that and you've built it to the compiler, the obviously compile instruction sequences of its choice. How do you know what integrity measurement to expect? So you need some form of reference measurement and when you execute NOVA's built infrastructure and you build a binary at the end of the boot process it will output all the reference integrity measurements. So it will say the sharp one value for this binary is this and the sharp two six values is this and the sharp have 12 values is this and then you know what value to expect when you do a decision. So extending this to user mode what does it require? So it requires NOVA to compute a launch integrity measurement of the root PD which means we have to define what is the region of the master controller of this root PD to measure and we have to actually do the hashing. And for that we can do two things. We can either send the whole data over the LPC or SPI bus to the TPM and let the TPM compute the integrity measurement or we can compute it in NOVA and software using the CPU. And I originally thought that using the TPM it would be a good idea because the TPM automatically does it for all supported hash algorithms. But as you can see on the right side the TPM is really really slow and the bus that connects the TPM to the system is also very slow. So in order to hash a binary of two megabytes of size TPM actually takes almost 14 seconds and NOVA takes 50 milliseconds plus two. So the 15 are for computing the hash and the two are for extending the PCR. And then obviously NOVA needs to drive the TPM itself because it needs to send commands to the TPM and NOVA needs to append the entry to the event log. So all of that infrastructure had to be added and we actually have to measure the root PD before we launch it because we can't have a process launching some let's say malicious instruction. And then saying after I've done some malicious action I'm changing my image to look innocuous and then I'd say measure me. Then it will look correctly even though it has executed something malicious. So before you even execute the first instruction of the next module you have to measure it. So the root PD cannot tell NOVA which part of it to measure so how do we define this. And it's simple you can actually use the L headers, the program headers in the root PD and we defined it to say the first L header that is readable or executable but not writable. That's the one that contains code and read only data. That's the one that we measure. And then NOVA obviously had to learn how to compute char one and char two fifty six and char three eighty four and char five twelve. And basically the entire NISP FIPS one hundred eighty standard. And that looked like very complicated but due to the beauty of C++ templates and function overloading and inheritance the implementation of all these hash functions and NOVA is actually just one hundred thirty lines. And it can do all of these algorithms. So that brings me to almost the end of my talk. The last thing we had to add to NOVA late last year was support for the TPM. And the TPM has two interfaces. The older interface is the FIFO interface and there is a newer command response buffer interface and NOVA had to understand how to drive those that adds another two hundred fifty lines of code. And then you have to send commands across that interface and the TPM library specification is very large. It's thousands of pages but NOVA only implements the subset of TPM commands that it needs for this measured launch which is determining what capabilities it needs. And then you have to understand how many TPM has and how many PCRs and what algorithms and then performing some PCR operations. And that adds about another five hundred lines of code. But for both the old TPM one or two and the newer TPM two dot oh. And then there is a table that lists how the TPM actually gets used by various parts of the platform. The TPM has different localities. And locality four which belongs to the core root of transform measurement actually measures the next stage which is NOVA into PCR 17 and then NOVA which drives the TPM at locality three measures. No, the SNET ACM measures NOVA also into PCR 17 and then NOVA measures the next stage which is the master controller into PCR 19 and then the root PD measures the next component further up stack into PCR 20. So this is the list of all the cool security technologies that we have in NOVA now. Ranging from control flow enforcement, total memory encryption with multiple keys and the latest thing we added which we just discussed is trusted execution technology and adaptation. So with that thank you for listening and I'm happy to take questions. you
Unikernels Are Here: Building, Running and Deploying Application Unikernels With One Command
Okay, if I can ask for your attention, we have another talk coming up. It's by Razvan, who is somebody who also does not need a lot of introduction. He's the unfun terrible of Unicornals. And his talk is about the fact that Unicornals are here. The stage is yours. All righty, thank you so much. Hi everyone, I'm Razvan. I'm from the Unicraft Projects. We were also part of FOSDEM last year. Our Unicornal projects are now part of what we call Unicornal Alliance. So Andrea, Valdez, Jonathan, Martin, and the others are kind of in this umbrella term. So my talk is going to be about Unicornals finally being here. I saw you saw a bit of highlights from Andrea, given her talks. You're going to see more about that here. Let's get started. So I'm talking about application workflows. Typically we think about, let's call them consumers and producers of software, right? The first workflow is for people producing software. The second one is for people using software. If I were to make an analogy to Andrea's presentation, what she was talking about was actually users that, I mean people that produce software. People that produce software don't care about the underlying things. They don't care about operating system specifics, about how they just want it to work. Well, so do we. Users as well, right? There are, imagine that what you do is you're going to build the application, get the package, push it, or then you do something like app install or that pulls it and unpacks it. You have typically a pack, a software package. If you imagine a dev or RPM, that's kind of it. However, this has glitches and anyone who has done some sort of development work or even worse, sorry, Gaby, about that doing DevOps work. Yeah, there are nasty jobs out there, even nastier than current development. You're going to bump into issues. Issues such as that runs there, that doesn't run over here. I'm using that, you're using Fedora. You don't have those packages being compiled. Dependency cell, I can't tell you about that. Anyone with their right mind had to deal with OpenSS cell and OpenSS cell 1.03.03.1. All those versions knows about that. Also when you run a piece of software, that's generally not isolated. If there's a glitch with that, that will affect others, supply, software, chain attacks, all those items. What do we do about that? Well, generally what people have been doing nowadays is using VMs and containers. These are trying to tackle those issues. You want to have containers and VMs because you're going to be able to run them everywhere. If you're using Docker, even if it's behind the scenes using Hyperview or something, you're going to be able to run that particular image on Windows, on Mac, on different Windows, on different Linux distros, sorry for that, all of those items. There are no dependency issues. You have your VM configuration, vagrant file, profile, you name it. You're going to know that's going to work. I'm not going to lie, I need this particular version of Node or out of the 10,000 millions or gazillions libraries that Node has, I need to have the proper version. You just grab that and it works. Also, they provide isolation, which is very important to make sure they are only using what you require that may also impact the memory footprint and they're isolated from each other. Of course, each of these two approaches has their issues. Let's talk about VMs. VMs have very good two items. They're also mentioned, I'm going to measure next then because of the frequent mentioning, about resource control, this is a very important feature of VMs. You are able to allocate CPUs, memory, hard disk, you name it. Also, you have good isolation. For sure, there are attacks out there, hypervisor attacks, VM escapes, but they're not so common as container escapes. However, they have overhead. You need a lot of time to start the VM, boot it up. We talked about cold boots. You have large memory size and it's also quite difficult to create recipes. Particularly when you compare it to container size. We have vagrant files, but even with that, it's not that easy. On the other hand, we have containers. Containers tend to compensate the disadvantage of virtual machines. We have recipes, we have Docker files. Everything is there. You just grab it, it runs everywhere. You have registries of application, Docker Hub, Google Container Registry. You just do a Docker pull, something, it works. You have a run, you pass the name, it works. You have tooling with Docker, with Podman. You have good performance. You don't have that much overhead as you have with VMs. Also, Dandrea showed what comparisons between unicunnels and containers. But there are isolation issues. There are a bunch of container-based attacks out there. You don't have, you are sharing the same kernel. If the kernel gets screwed, then that's over. And you have imperfect disk-on-conference control. We know we have C-groups, but it's not that level you have with virtual machines for sure. So what we want is we want a blend of those. And that blend is getting good isolation, good research control, recipes, registries. We want tooling. We got good performance. That means that in the end, if I'm an application developer, I want all of this. I want you to deploy my application and then users are going to just benefit. They're going to be able to run quick, fast-booting, good, well-deserved, well-deserved, well-deserved, well-isolated, high-performance, pre-built application packages. And in our opinion, that solution is using unicunnels that has this combination of items here. So this is, sorry, this is a VM to continue to unicunnel. So kind of blend of this. My talk is mostly focused on unicraft, as I mentioned, part of unicunnel lives. So basically what we wanted is this. Let's just take a tour of the items I just mentioned here as we want and see how we can achieve it. So because unicunnels are virtual machines, the advantages of virtual machines, meaning good isolation and good research control, are here. Right, so you have this. Also, we want to have recipes. Bobby's mentioned about bunny files. We are using, when you're using unicraft, we are using docker files. We're using something that's called a craft file that has kind of this level of recipe. Three set ingredients and steps to undertake to make this happen. We have a registry of pre-built applications. Just go there, you're going to use craft run, craft pull. It will just work. You're going to see the demo. We have tooling. It, bunny is not yet implemented. However, it craft it. Craft it works. It's there, it's running. You're going to see it for your own. Right, and we have good performance. That comes from the inherent design of unicunnels. You don't have the main separation, but also it's because the way unicraft has been designed. We aim to have extreme performance. That was the kind of the rationale behind my question to Andre and Vali regarding the optimization with unicunnels. Because the way unicraft is being created is highly configurable, highly customizable. We can specialize it for every particular load to get the best performance we can. That means allocator schedulers, you name it. Right, so why am I saying unicunnels are here? If it asked me this one year ago when we had the talk in Fosden, I couldn't have said that. But now I can. We have a catalog of applications ready. We have profkit well prepared. There's also on the commercials at the platform that provides commercial support for this. You can see it happen. What did you have before? And you're going to see the demo. We had to configure the kernel. Go to a lot of different steps. It actually, because I tend to like to do things, but like everyone, I don't get to do them. I spent an inordinate amount of time last night, Martin knows, because I sent his slides quite late at 2 a.m., something like this, which is 3 a.m. in Romania. And I spent, I think, three hours recalling how do I build from scratch and GNEX. I cursed Simone a little because there was a recent change with the way we are mounting file systems. I had to squeeze a lot of items, but I managed to do it and it's now prerecorded. You're going to see it as well. But you have to go to a lot of steps here, configuration, build steps. It's an awful, awful, painstaking time that I, as an app developer, not as a kernel developer, shouldn't look into. I have to have, you're going to see a huge QMU command line. I didn't do a demo because I don't want to scare you and ruin your dreams. But if you go to a firecracker, you have a huge JSON file, you have to write the command line for it. Once again, really nasty if you're just an application developer. The application has to be porting on your craft that takes time. I mean, on any unique kernel, actually, you have to link it properly, it takes time. What do you have now? We have a set of pre-built kernels. You just do craft pooling works. That's it. I have an application. I write a small Docker file. I have my Python application. I use craft running just for you. You're going to see it. There's a single command build. Actually, there is a single one. For running, there are two of them because I'm using a bridge interface. I have to use two commands. I have to use one command to create the bridge interface. Maybe in the future, we're going to do it with one command. But one command is to create the bridge interface. The other command is to run the unique kernel. On top of that, and this is something that Simone, three talks from now, I mean, 6 p.m., is going to show the internals of running native Linux applications. We are basically getting pre-building applications already out there in Docker Hub, and we're running them. So there's no need actually for you to do anything. You just grab them and you run them. It's that easy. That being said, let me show you the demos. That was the kind of the deletree part. Let's look at the demos. For starters, I'm going to show you the demo of the before part. This is the way it happened before, the way we were building and running kernels before. It's kind of a three-minute one, so let's just take a quick look. I'm starting a T-max screen. This is a native one. I'm doing configuration now, and just take a look at what I do. I select the platform. I'm selecting the library. There's a huge palette of them. I'm going to engine X. I'm using a main function. I'm going to VFS score. I'm selecting a pile file system and embed the initial grand disk. You don't know what to do, but that's how complicated it was. Going back, I need to use a new version that Simone introduced. I'm going to dev. I have to mount dev. Save. This is the configuration. I have to know all the steps. I have those steps. What I'm trying to do, I need to create the initial grand disk. There's a command that I need to create. There's a makeMKCpIO command. I'm packing the file system inside the initial grand disk. That's packed. I'm now building it. It's going to take about one minute to build. They're just going to look a bit about that. All of these items, of course, can be automated in some source, but you have to know them. Every time I was testing, I was trying to build something, I get this option. I'm doing a lot of community interaction. It's painful to see people constantly bumping into the same kind of issues. Are you sure you're using staging? Are you sure you're mounting 9PFS? Did you select that particular weird, maybe you don't think that option there? These items were things that people were constantly bumping into. Now, this is now going to get built. It's the final linking step. And in the end, we're going to end up with the final kernel image. I'm now adding an interesting command to create the final link. It's an interesting command to create the bridge interface. And now I have even a larger command. I know, baby, it's amazing. And it's going to run. I'm opening a second console. It's running, so it's okay. But you can imagine how much time and effort you can see that the huge command there for Kimu was to make this happen. So this was the before part. This was the thing that I would say unicunals are not there. However, let's look at something else. So what I'm now doing is I'm using craft. There are two commands. I'm using a bridge interface just to make it more realistic. So I'm using a bridge interface. And now I'm running on that bridge interface. I'm running the same engine X image that's now pre-packaged, pre-built, pre-deployed inside the registry. It started. I'm going to query. It's start behind the scenes. I'm going to query it. That's it. So I can now just say, okay, that's the VM. It started. If I want to not use it anymore, I can simply remove it. And that's it. It's just on screen, right? That's what happened. Let's look at something else. Let's look at what we do with Python. Let me say, I think this could be it. Let me see if this is it. No, this is not it. Sorry. How do I close this? Q. Okay. I think it's this one. Yeah. So Python is also there. Once again, two commands. One command to create the bridge interface. And one command to run it. So I'm using craft run minus, minus net or for the network interface. Yeah, a bit more memory because it's Python. It could be worse. It could be it could be node. Yeah. Oh, they'll go there. Okay. So I'm using Python. I'm using the latest version. It's pulling it from our registry. And it's already there. I'm just curling it. It's port 8080. And it's running. Of course, now the similar using craft PS to make sure I have the virtual machine and I'm now able to remove it. And that's going to be it. Right. Once again, two commands. One for the network. The other for starting this. How can I make this so? Okay. You saw someone starting Python. How can I do it myself? Well, I have this here. I have another server. I have some craft file. If you see here, there's a different Docker. I'm using this command. I'm saying, let's create the network. Right. And I'm going to just copy the new server. So this is the way I'm customizing the build. It's similar to the new file system. This is the new file system. And now I'm going to run this. I need to have that export because we're using build kit from Docker behind the scenes. Right. And now it's going to grab that image. It's going to grab it. That's the kernel. It's going to copy the Docker file. It's now building the Docker file. It's from scratch. It's just copying that information. Similar to what you do in a Linux environment. And I'm running. And that's it. And it works. Right. So it's now saying buy world. And finally, let me show you something a bit more complicated. Let's run flash with Python. Right. So let's do it. I have a simple flash server. I have the Docker file, which says, okay, you need to... This is similar to having a Docker installment. Right. So I have the Docker file from Python, install the requirements. Then you're going to copy the implementation on it together with the libraries. And that's it. So I have the server. I have a Docker file similar to the Docker environment. And I'm saying, hey, grab the Python image, grab the Python kernel, run this from the Docker file, and then copy the... And then execute the... My server here. Right. I'm doing this to craft this build kit export from Docker. And I'm running it. And this will end up pulling the image. So you can check this out. It's going to pull the Python image, pulling. It's going to build now via Docker, via build kit the flash item. And now it's running it. And I'm just going to just curl the virtual machine. And everything is there. Hello from Flask. Right. All of this is now possible. You have the image, node, Python, whatever you want. You have your application. Do a Docker file for it. Use Croft on it just works. So these were all the items that you saw earlier. All of these you can check in our catalog. There is an... We call it the community catalog. There is also on the company side of Croft, there's a kind of a more commercial version of Croft. There are guides that we just published that are available showing you how you can use the catalog or add images to the catalog. And you can simply take care of this. Just run on command or to command and it works. Unicolas are here. All of those items are now in check. And together with optimizations such as those mentioned by Andrea and Vali, we are able to truly make use of cloud-based deployments. There are some other resources, catalog guides, see us on GitHub. If you're wanting to sit on a commercial side of things, as Maloga says, there is a platform using Croft Cloud, you can visit that and find more information. That's it. Thank you so much. Thank you. Thank you. We have time for one or two quick questions. Oh, God. You have to get the mic for the stream. So you mentioned that Docker isolation, sometimes the resource control, sometimes is it enough? Can you name some cases where it's not enough? Not necessarily. I'm not saying it's not enough. It's that the VM is, I would say, a better one. Because when you do a VM, you actually are able to provide different volume. And I think that also, CPU isolation, I'm not sure if, I don't know much about C group. It's kind of, I know VM resource control is better. I'm not saying it's for all these cases. I'm sure there are cases where Docker containers may be enough, but maybe there are cases where you want to have VM-based isolation. And Unicron will provide that. Final question? Yeah. Oh. Maybe pass it on. Maybe be asked. Maybe just ask. Maybe just, I'm going to repeat the question. Yeah. Yeah. So is that like an underlying kind of road time that would manage these VMs and how much your container, do you need a doctor to do this? Would there be a similar kind of Unicron or do you need a problem? Yeah. Yeah. There is actually, I'm not the person to answer that. It's Alex, he's kind of a tooling guy. But there is work being done towards that. There was something that was at some point called RunU for Run Unicron. And I don't know, actually the Nubificus guys have something that's called UK Run. So there's work being done for that. That's planned for Kubernetes integration if it's not already there. So that's already ongoing. If it's not yet public, it's because we have focus on getting items going first. But integration and tooling on all those container-based ecosystem is high priority on the tooling side. Yeah, yeah, for sure. It's on the menu. Thank you, Rosvan. Thank you. Time's up. Thanks for everything. For anything we can talk in the breaks. Thanks. Thanks.
Is Toro unikernel faster for MPI?
Okay, if I may have your attention again, it's time for next talk by Matthias Lorsen, this time about his unicernal and how he can run MPI code faster. Which is yours? So hello everyone, you take me well? Okay, thank you. I'm Matthias Lada, in this presentation I'm going to talk about the player in MPI applications using Total Unicernal. This is an exploratory work, so it's an area I am still investigating, so at the end of the presentation feel free to ask me any questions because I'm still benchmarking things, I'm not pretty sure where I'm going. So first I would like to present myself, I'm fascinated about practice system development and mutualization, and I have been working in these companies, this is my email, I did have a profile if you want to get in touch or you want to see some of my projects. This current project is not related with my current work, so it's something that I'm just doing when I have some free time. I would like to start by what is my intuition about what is an MPI application, I am not an expert on MPI, so it is what I understood since two years I have been working on this. So it is an application that compiles with the implementation of the MPI standard, so there exists several implementations of the MPI standard. The standard defines a set of APIs to synchronize and communicate parallel instances of the MPI application, so for example we have this sort of API, like MPI barrier broadcast and all reuse for example, so to set some of them. My impression is that the only performance matter when we deploy MPI applications, so I have a feeling that the virtualization is not very popular in HPC, at least my impression for the overhead that this adds. So my thought was that maybe MPI applications may benefit from the unicurners because for example Cisco are expensive, so in unicurners we remove that, we have calls, threads are cheaper than process, so you may know that we are not switching the page today, every time we are doing context switching in unicurner, depending on your application you can completely remove the scheduler because you are going to run only one thread per course or something like this. You can rely on communication and share memory for example, in the case of unicurners. And sometimes this is something that I just added, sometimes perform better than a general operating system as I guess and I say this because sometimes you can tweak your operating system to reach good performance, let's say. So yeah, well this is the diagram or the components that they are involved with when you are deploying an MPI application using a general proposal of the operating system. In this case I am thinking that the MPI application is running as a built-in machine but the diagram is more or less the same in case it is bare metal. So what we have is your MPI application, then it compiles with implementation of the MPI standard, for example OpenMPI and the OpenMPI is going to use some Cisco to communicate with the operating system to get some service like scaling file system, networking and so on. So what unicurners propose is well let's take a look at the data. Thank you. you you you you you you you about the scheduler. So the scheduler in tutorial is quite simple and also well here is no scheduler, it is the way that the tutorial creates threads. You have a dedicated API called a begin thread but it is a parameter that has to tell where the instance is going to run so you have to set up where the core is, I mean where you want that function to run. Otherwise it is going to choose all the times the booting core. The scheduler is quite simple, it is a cooperative scheduler, so I mean the thread is going to do something and then call this thread switch which is going to invoke the scheduler and each scheduler, I think I present that in the next one. Yeah, I mean each scheduler is independent one another so there is no communication between the instance so each core is scheduler completely independent one another and the algorithm is quite simple, it is going to choose the next thread that is ready, not more than that. And the idea behind that is that the idea was to have instance of the kernel that don't require any mechanism to synchronize the instance so there is no speed lock or something like this so all the access to the kernel data is lock free. So I just talked about the scheduler, now I am going to talk about the memory, also total memory is dedicated so when the kernel initialize is split the memory in rations and then all the allocations happen from that rations depending the core. So the splitting is quite simple, it is just split by the number of cores so you have two cores, you have two cores, you have three, so for a moment this algorithm is quite simple and maybe it could be improved for sure. So for example we have a memory allocator that since each ration is assigned to different core the way that we implement the allocator doesn't require any synchronization between the core, keep also the same idea that each instance runs independent one another. So for example when a thread has two allocator memories it is always coming from the same ration and also doesn't require any synchronization between the cores and so on. The idea behind this is also to try to leverage from technologies that integrate paths that you can have a known informer memory and then you have faster access to some rations. And in a general way all the kernel data in total is per CPU variables so it means that it doesn't require any synchronization between the core to access kernel data. And also to access faster to these CPU variables using the chs register for example and this is an improvement that I did a couple of months ago and so we have a table and it is faster access through the chs register which is pointing to that table. I don't remember exactly the mechanics but I think I have a blog that I wrote about that and all the access is log free. The only moment that we require synchronization between the cores is when we wanted to for example create a thread from one core to another. We need to synchronize somehow the cores to migrate one thread to another something like this but it's the only moment that we need it. Otherwise this is completely independent all the instance. And to end the principles of total I will be going to talk a bit about the core to core communication. And the idea was to even if you as a user you can implement anything on share memory as you want. I decided to implement the entire over share memory so each core has a set of big use that allows to get data from a remote core and say data to another core. And it was just a bit of I mean to have fun to do it like this. I mean I'm starting to see if I can implement the entire like this. And the idea is that the communication is core to core so we don't have only one queue per core. You have as many birch use as you need to communicate one to one for each core. I don't know how to say exactly but this makes that you don't require any protection to send or keep the exclusive access to this birch use. Because you have only one consumer or one producer and so on. And relying on this mechanism then I could implement the API from MPI like MPI Gutter broadcast and MPI scatter which are functions that require communication between the core. So from the root core to the root core and so on. So I think I will just talk a bit about the benchmark I have been done. I feel free to comment about this because I'm not really sure about the numbers I'm getting. What I did was to choose a set of well known benchmarks called also micro benchmark which is since that it is used for benchmark in different implementations of the MPI standard. And I pick up two of them the also barrier and also already use which what they do is just stress some function. So for example the also barrier stress the MPI barrier function which is something to synchronize the instance of an MPI application. It's just a software barrier let's say. And the other one they already use is stress the MPI already use function. Which is going to send some vector to the root core process something and get back the rest to the other cores or other instance. What I did was comparing with Linux Bermuda and Linux in IBM and I use it. I pick up this machine from the from a clinics which is an AMD epic with 24 cores and 64 sheen rate and the host I use it for the VN is a Wuntu with isolated cores. No sorry yeah with isolated course. And I ran the and I use it. KBM team was hypervisor. I'm sorry now the host was Wuntu and the guess was Fedora 38. What I did is in this particular case what I did was to use a huge VM with 16 cores. Maybe it's not the most common case for MPI people just have several nodes instead of putting everything on the same. In my case I was trying to play with this so I decided to use a huge VM let's say and then compare with total right. So this is how I launch the benchmark. So for example I am using 16 threads for example. I'm not an expert in MPI I'm not really sure if this I mean if the MPI run for example is really using one core per thread it will not be optimal otherwise I think. And I was launching for 1000 interaction so this is the result for the Linux Bermuda. No Linux in IBM sorry. So these are the numbers for the host barrier. Which is this test if I yeah. So you can see that there is quite huge difference between the Linux VM and the Unicolon. But still I have to read redo these numbers I'm not really sure about that I mean because there are one order of magnitude at least. At the beginning I was interested to compare with Linux Bermuda because I think we can achieve something like this in IBM. But then when I started to play with Linux VM I said well there is a huge already difference with the VM. And also I was comparing with the host already used as I said before in particular with that side of the vector. And also it's quite huge difference with the Unicolon. So in the two cases are 16 cores in the VM and the Unicolon too. And I think that's all about the benchmark. To have this number also I figured out that some issues in particular I don't know if you were measuring something in VMs. In particular in carry-in the early TCC register is not emulated so you have to be careful when you use that. For example you have to when you are doing numbers you have to check that the carry-in is still in time. So if you make the difference it's not going to work always so you have to be careful about that. That's all I think. The question is a question. It's a question. It's a pity I'm not doing... The question was why I'm not doing communication between the VMs using this implementation. Basically this implementation can only run on a single node but people are using MPI on classes with tens or hundreds or thousands of nodes. Why? Do you have any plans to extend that? Well I'm thinking about that because it's not the first time that they mention this. Maybe create an interface, I mean use butyonet or butyvisoc to communicate with other instance. You will have multiple VMs running that. But for the moment maybe I will do it soon. I'm not really worried about that. What questions? Which MPI implementation are you implementing? Because there are different kind of versions of MPI or Pitch or so on. Which one are you based on? I'm not really sure because what I'm doing is just trying to read the semantics of MPI. I'm trying to implement it at code. The number of the functions I'm implementing is based on what is the benchmark. That's all. This is why I'm doing it. No more than that. Do you have numbers when you increase the number of nodes? Do you mean if I have numbers when you increment the number of nodes? How that behave? Yeah. I'm still doing that number. The difference is still there between the VM and the Linux implementation. Still a difference in the sense that it's faster, let's say. I'm still doing those numbers too. There is no point in finding the question. I don't know. Do you have a question about the big problems that are happening in the end of the time? The question is if I understand why we have that difference. I don't know. There are a lot of ways to tweak Linux to make it more performance. Maybe I'm lacking that. If you tweak it, you're going to dramatically drop that difference and the configuration. I'm not really sure from where it's coming. But I said before, it's still numbers that I'm working on. Okay, I think we are running out of time. So, Api, thanks again for the talk. Thank you. We have a short break for five minutes. And after that, we will have talk. Thank you.
News from the Hermit Crab — From Soundness Foundations to GPU Virtualization
Go Martin, go! Okay, I guess. So let's get this started. Wow. Okay, thanks everyone for coming. I'm Martin from Avitiha-Aachen University, and I'll talk about the Hermit operating system. I'm here together with my colleague Jonathan, and a few students are also scattered around the room. Yeah, let's get started. These are the things that I'll talk about today. First, a general introduction into Hermit and Juni kernels, although if you've been to this room in the past few hours, you already know some of that. Then I'll cover some arguably interesting internals structurally, and then talk about two applications, namely GPU virtualization using Cricut, and application kernel profiling. Okay, we've been through this a few times now, but let's go through it again. We have compared to a standard VM where we have a hardware and a host operating system, which might also be missing if we have a level one hypervisor, and a hypervisor, we then have this virtual machine. And this virtual machine runs a virtual machine image, which... What's happening? Okay, this virtual machine image is just a full-blown operating system with its own guest kernel, user space, and everything else. Then we've also talked about containers before, which throws away the guest kernel and really tries to minimize the image for the application, and we have unicarnals, which then run in virtual machines again, but inside the unicarnal, everything is packed together as tightly as possible. We have the application, we have some user-provided libraries, and we have the library operating system all statically linked together. What this gives us then is an image that we can really specialize to the use case at hand. So that means for the environment, namely the hypervisor, and for the application itself, and what it should do. This leads to tiny images, only a few megabytes in size for Hello World, for example. And since we only have one process in this whole unicarnal image, we don't need any isolation between this process, other processes, or the kernel. That means we can do this as a single address-based operating system without any costly address-based context switches between. We can run everything at kernel level, have no privileged context switches, and then can just make system calls to function calls. And that's pretty cool. Enter the Hermit operating system, as you can probably guess by the logo. The logo is written in Rust, 100%, well, not 100%, but there's no C in there, at least. There's only Rust and a bit of assembly, of course. We mainly target Rust applications, too. So we have an official tier 3 Rust target for Rust applications that we can use. But we also have a GCC and NewLip fork if you really want to run C applications, though that's not our primary focus. We have multi-core support, we are easily configurable, and we can now also compile on Windows. Yeah, we can also support stable Rust nowadays through our own distribution of the Rust standard library, which you can check out here. Okay, let's talk about the platform support. Okay, once we have this image seen on the left where we have the application, standard library, NewLip, and the kernel, we can then run it on our own hypervisor, for example. U-Hive is a specialized Hermit hypervisor that is specialized to running Hermit unique kind of images, which is the focus of Jonathan. The main target for that is Linux KVM on x86, though there's also some degree of support for Mac OS on both x86 and ARM. And also upcoming, though not yet merged, is Linux KVM support for RISC-5, which is something that Simon worked on. Philip, sorry. We can also target generic VMs through our Hermit loader, which then chain loads the Hermit ELF image. We can support multi-boot on x86, we support firecracker, and there's also UEFI work going on, which will be there soon, hopefully. For ARM and RISC-5, we use the Linux boot protocol to be able to run on things like KAML. Okay, so that's all you need to know if you want to use Hermit. Let's take a look inside. This is the same unique kind of image again, but from a different point of view now. The left stack is the application stack. It is the application. It's some user defined libraries, Rust crates in this case, and the core crates of the Rust 2 chain itself, so standard, Alagon core. On the right side, we have the Hermit kernel, which depends on some crates as well, and Alagon core. These two things are compiled for different targets, though, because we don't want to use any floating point operations in the kernel target, because that's costly to switch between. And the user code is compiled for a special Hermit target, which does have floating point support and also tells the Rust standard library how to communicate with the Hermit kernel. We also provide together with the Hermit kernel, but compiled for the user target some intrinsic such as libm for math functions, or mem intrinsics for things like mem copy, which really benefit from having this floating point support available. One thing that I personally worked on a lot are soundness foundations. You can see unsafe and safe Rust on the right. And we published a paper on that. It's called on the challenge of sound code for operating system, and what this basically aims for is to make the Hermit target sound. That means any safety reasoning must not require context. That's extremely important, and the history behind that is that Hermit was once written in C without much strictness around the locality of this kind of reasoning, and we put a lot of work into going forward and migrating to a more Rust-like approach here. One thing that came out of this is Hermit sync, which is a collection of synchronization primitives used inside the Hermit kernel. Most of these are also independently published as single crates and republished through this image, so you can also pick whatever you like in your own project. Another thing is count unsafe, which you can use to count the amount of unsafe code inside your Rust thing that we use to analyze our progress there. The next thing I want to talk about is our evolving network stack. Originally, it was just a user-side thing, so the Rust applications would compile some network stack with small TCP, a Rust network stack, and C applications would use what's it called LWIP, such as Unicraft does. In 2022, we moved that from user space into kernel space, which is not that meaningful since everything is kernel space, actually, but we moved it to the distribution of the kernel. Then we implemented support for these D-Style sockets because before we had a custom-made API for networking, and now we want to standardize it and adopt these things because that will allow us to throw away all the user space network stack, which can then both C applications and Rust applications use the kernel-provided small TCP network stack. In 2024, we are going for Pulse support for async.io, which would enable us to run a whole bunch of Rust networking applications, which usually run on Tokyo or something like that, and work on this is already well underway. Okay, then let's talk about the two application-focused things. First, GPU virtualization with Cricut. Short introduction to Cricut, which is another project developed at our institute, ACS. It's basically just plugging networking in between some API. So classical GPU CUDA applications work like seen on top, where we have this CUDA app that calls CUDA APIs, a library from NVIDIA, which then performs the actual computations on the GPU. With Cricut, we plug a Cricut client next to the app and a server to the CUDA APIs, and then just tunnel through all requests and answers. That separates these two things, and we can move them wherever we want and control what's happening there. And we found it's not that... Yeah, it's not that high of an overhead. We can then use this for remote execution, scheduling, or monitoring of GPU applications, as seen here. We can have several nodes with virtual GPUs, which then run on another node for computation. We then adapted Cricut for Unicornals, and published a paper on that. And how we did this is Cricut is based on ONCRPCs, which came out of Sun way back when. And the reference implementation is Oden Complex and uses Linux-specific networking features, so it wasn't easy to port to our Rust toolchain, for example. And as you can already guess, we ported it to Rust. Our user code is then run inside the Unicornal and only like the server part serving the GPU is not run inside the Unicornal. We did this for Hermit and Unicraft. For Unicraft we had to develop Rust application support first, but we did that and now it's working fine. The last topic that I want to talk about is application and kernel profiling. It's a project that has been dormant for a while, but we are reawakening it and getting it up to date and getting it working again. It's called RF Trace for Rust Function Tracer. How this works is that essentially we want to find out how much time is spent in which functions when we run software. Instrumentation does this by changing the code that is output by the compiler. We are essentially changing the program that we measure, which kind of falsifies the results a little bit, but for that we get extremely reliable things because we measure each and every time frame inside a function. It works like this. We have our Rust source, which squares some number. That corresponds to this assembly for inter-architectures. If we just append the corresponding flex to the compiler, the compiler nicely inserts this call to a special mCount function. What this mCount function then does is it can inspect the stack to find out which function we are currently in. It can then take some timestamp and it can also insert a return trampoline into this stack so that it also knows when we leave the function again. Together, all of this together, then lets us measure the time of functions, which is cool. In the image it looks like this. Our F trace is just another static library, which is inside the whole image. It works for Rust programs, C programs, and also for images, obviously. It is very encapsulated, so it exposes only a few symbols like mCount and then does everything internally. When we measure such a trace, we can then look at it and have a trace replay and really see which function we go into how and how long it takes inside them. We can also have a look at these graphically, of course. There are tools available for trace visualization. You could also create flame graphs out of this and then optimize the kernel. We are looking forward to using that for further optimizing the network stack, for example. All in all, I think that is all I have to say for today. That is a broad overview of the different topics that we covered last year. You can check us out on GitHub. You can say hi on Zulip. With that, I thank you for your kind attention. Thanks, Martin, for the talk. We have a working mic, so we can have some questions. Five minutes. Hi. My question is how do you instrument the Rust code and how do you actually get the function codes in there? The what? The instrumentation and turn some calls into the Rust code, usually, that you have. My question is how do you get those function codes in there? The question was, you said it to the mic, so it should be. There is a compiler flag for that. For C code, it is much simpler. You would just compile with GCC and then say dash PG, I think. For Rust code, it is more complicated. Well, it is not more complicated. It is just more lengthy. I did not put it on the slide because it was two lines or something. But those are features available to us through LLVM. Rust work is on the way to make this easier because it is not a stable thing exposed by the Rust 2 chain, but through manually enabling the corresponding LLVM passes for the code, this works. Thank you. More questions? I had a similar question. We also have a track on profiling, benchmarking and Unicraft. You are using instrumentation for profiling. Are you also considering sampling profiling? For example, what you are using is Unicraft, we are trying to tie in VMI, virtual machine interface. That will be able to do some sort of snapshotting and the others. Is this enough? Also, Unicraft, you have GCof support now because GCC 13 has embedded GCof support, so that makes things easier. Is this enough for what you have tested so far, the instrumented approach? Because you have to build the application, you then have to run the instrumented one, maybe it is not similar practice, is this enough at this point? We will have to see. In general, we are not that automated yet compared to Unicraft. Our Rust application story is quite seamless, I think, and you just enable profiling through a simple feature flag, and then you run it and it gets dumped on the disk and you can look into it. This is also what Gaby is working on. Did you consider, I am not sure how F-TracingPerox does it, but for example, there is something called K-Probes or K-Raid-Probes or something like that, which is a dynamic way of instrumenting the calls. What that does to you is you don't have to have these items done at build time, so that means when you want to instrument the application, you can tie in some flags and then while you execute it, it replaces some sort of function, pro or web, with some sort of jumps. Interesting. There may be something interesting to look at. We are looking at that on Unicraft's side. Is this like inserting a general hook into every function and then dynamically chain? Gaby knows a bit more about that. It is a bit of a rewrite of the function for organic load. Basically, you have a function that you want to jump in and then you can do the whole function that you want to jump in. Similar to that, just by hand and for some functions only and switchable. Okay, makes sense. Still very cool with the flame graph. I mean, this is the most important item because everyone does profiling, but having some sort of visual way of determining what's actually being spent, that's really useful. Yeah. We have to switch to another talk, so Martin will be around for more questions. Thanks again.
Linux Binary Compatible Unikernels with Unikraft
Perfect. So, welcome everybody. My name is Simon Künzer. I'm the project founder actually of Unicraft and lead maintainer there. I'm also CTO and co-founder of Unicraft TMBH, which is a commercial open source company using the Unicraft project for building a new cloud. You saw some aspects actually from Rasmus talk. This was like three talks earlier than mine. Here I'll go really now into much more technical details, how it looks like on the kernel side. So, this is much more OS class here, what I'm doing. So, first of all, briefly, wave your hands up if you know what a Unicron is. Good. Okay. So, then I do it here really quick. In the end, what we are doing is turning an application into a kernel by using something that we call operating system libraries that are directly baked to an application and we run that as a virtual machine. So, then we're all on the same page. Especially the Unicraft project, what's for us important is the aspect of single purpose. So, we do specialization and saying the OS layers are built just for that particular application that we're interested in and that allows you a lot of optimizations that you can do. For instance, we do single address space because we run one application in one kernel. We have a really small TCP and a small memory footprint. Behind the scenes, we have something that we call the micro library pool. So, you have decomposed OS functionality as library available. So, this is for instance scheduling file systems, then libraries that do Linux or POSIX APIs so that ease programming with that environment. So, it's kind of an SDK. The current project focus is on Linux compatibility because our vision is actually we want seamless application support for existing Linux applications. And since most software is written for Linux, especially for the cloud, we want to remove as many obstacles that we can do for running them on top of Unicraft. And one aspect as Vassim was showing is the tooling side. We have actually two approaches from the kernel side. We can do application compatibility natively, meaning you take the sources of your application and compile it together with the libraries of your kernel. But we can also do binary compatibility and this is what I'm going to speak about here, where we provide the Linux ABI to the application. So, we do system calls, VGSOs, etc., etc., to support a pre-compiled application. And now I'm going over these individual steps. So, first of all, when you build a kernel and want to make that supporting running a Linux application, what you need to understand or the kernel needs to understand is the ELF format. Running that application ELF is actually a pretty straightforward process. So, you need to first be able to pass the ELF, load it in your memory space, then you need to prepare an entrance stack. This is all specifications that you need to look up and fill out the specific vectors and values in there, and then jump to the entrance. And then that application is working. The only interaction that you will then, from that time on, have with the application are actually system calls. There's two interesting things here regarding loading ELF applications. So, you have something called non-PIE applications and something called PIE applications, meaning non-position independent... Sorry, non-position independent, position independent. The non-position dependent, they dictate you the address layout in your unicolonel. And since we want to run everything with a single address space, it means we can support only one non-PIE application. The beauty of PIE is we can relocate the application in the address space. So, we can still work with a single address space, but can run multiple applications if you want. And, sorry, if we're going a bit more further with the original reason why PIE is now pretty common in operating systems, we could even apply with that single address space full stack ASLR, where we mix kernel libraries and application libraries completely spread in the address space. I give you some more readings when you download the slides, but basically in the project we are focusing on PIE applications, since most of them, at least the last five years, is the standard when you get a pre-compiled binary from the distro. System calls. So, we are in a single address space, we have a pre-compiled binary from Linux and we want to run it, so we have to do system calls. This is actually a long pipeline that we need to run. It starts with a Cisco instruction, so note here this is now x86 specific, but that instruction usually takes care of a protection domain switch from ring 3 to ring 0, but we run in a single protection domain, actually we don't need that, so we go from ring 0 to ring 0, but yeah, we need to execute that instruction. Then Linux requires us to have an own kernel side stack, especially language environments like Go. They don't give you a big enough stack that you can just continue executing on the kernel side, so you need to switch. In reality, of course, if you have an application that has a big enough stack, you could also configure Unicraft to get rid of this step. Then at the moment we are using full CPU features also on the Unicraft side, it's mainly coming from for supporting native applications where basically addressing the kernel is just a function call, so why would you restrict your CPU features? If you don't compile it with that, then you can also remove that step, but if you do that in a default option, we need to save and restore that state 2 of your CPU. Then we have a TLS switch, which we really require because we use the TLS for our OSD composites, you know, libraryization, sorry, because we didn't want to have a central big TCP control structure where we need to maintain every particular field. We want to have everything clearly decomposed, so the TLS was the way to go for that. Yeah, and then actually we're able to call the system call handler. That brings you to the question, does it have to be that complicated? We need to do really so many steps, isn't there something we can do a bit better with a single address space? And actually there is, and there is this mechanism called VDSO and also kernel VSS call, this is an old thing, so the VDSO basically for us is just a catalog to look up kernel symbols, so this is a way for us to provide the function addresses to the application that the application can directly call because again, single address space, single protection domain, we don't need to do a switch. The second thing is the kernel VSS call symbol, which was a standardized symbol in the past, mostly for x86 when CPUs were introduced to do sysenter and syscall instead of interrupt driven system calls, that was a way for the S, from the US to tell the application to enter the kernel more efficiently. For us, we can use it as a function call to directly jump in the system call handler, so it's no trap, no interrupt, no privileged domain switch, no need to save and restore of extended registers because even that is covered by the system V calling convention. And the idea here is we just need to patch the libc shared library of that application and then most system calls anyways go through the libc because they provide wrappers there and then we have that covered. So in comparison, no expensive syscall instruction anymore, it's just a function call, the stack maybe we needed depends still, no floating point registers, whatever I mean, you don't need to save anymore and the TNS register we need to switch still. Okay, now we get into this fork dilemma, how I call it, this is like always the bad word for unicolonial people. So this is a bit of OS class here, so you probably remember fork means duplicating the address space of your application so that everything is at the same address. Problem with this is it's a second address space, I don't want a second address space, I want everything flat in a single address space to save time in context switches or TLB flashes, right? So, doesn't work. What we actually would need in a unicolon environment is a fork version that forks the application to a different address. Unfortunately, without compiler support, that's not that easy. You could of course copy the memory region but you need to then start patching your application because you have absolute addresses in there like return addresses on stacks, pointer addresses, etc. So don't work that great but what we can do, sorry, for instance, let's say, what the question is is when you look at the applications, they're compilers position independent. In principle, the result is that I can run multiple applications in a single address space. It's just that this model with fork and exit doesn't fit, you would say, because we have this fork intermediate step. And luckily, there was also this was coming from, let's say, older generations of Linux when they were trying to run Linux on targets without MMU. There's a system called V fork that allows you that. It's a bit of a funky fork call because it doesn't fork actually. It just pauses your parent and allows you to execute within the memory space of the parent, the child, for a short period of time, actually until the child calls either exit or one of the execution functions to load a new application binary. And that is basically our solution for running multiple applications or like a shell or something where the shell forks, you load a position independent binary anyway in the next step, then you have it in a different address area and you execute. What we want to try out here is if that mechanism works great for, you know, if you could just replace the fork system call number internally with a V fork and see for how many applications that works great because they were doing fork exit before and for how many that does work. But obviously, won't work is for applications that just use fork to open up worker processes. But just for this fork exit model that works. Okay, and then the last point is a bit, I really rushed through. I was when I was preparing, I was checking, okay, I can't say this, there's no time, no time. So here I wanted to give you a bit of an overview of what we did the last year when we looked at different applications and we're always in front of that question. We need to have Linux compatibility, but at the same time, we don't want to re-implement Linux. And there's some aspects where it's really tricky. For instance, the first one is a really interesting example. It's about querying from the kernel side network interfaces or setting up routing tables, etc. So get if address and call, which normally needs a complete subsystem of sockets to be implemented. And then on top of a known protocol called netlink, just to do this user space kernel interaction for gathering that information. An alternative here could be, of course, we implemented that, but an alternative here could be also to make use of the VDSO again, patch the libc and do more direct calls for just querying the IP address from the kernel that's currently set. Another interesting point is applications that are so mean that implicitly rely on a specific behavior on Linux. And really, a funny example is primitive scheduling that you come across from time to time. So far we have cooperative scheduling in Unicraft because that's a really efficient for us scheduling way to do things. But then you, I mean, you put something like Frank in PHP or MySQL and they use busy waiting loops to wait for other threads to wake up. With a cooperative scheduler, you can imagine what happens. Basically nothing because you're constantly busy waiting but never give another threat to chance to pop up. Then there is this whole topic about which system calls you need to support, which ones you can stop, which one need to be completely implemented. It's actually true that you can stop a lot, you don't need all of them. I would here refer you to a nice paper from Esplos. Actually last year for STEM, the authors were giving a presentation here too about how to figure out which system calls you need to implement. Of course, there's also sometimes the application dependent, but there is a, you don't need to implement all of the Linux system calls to have Linux compatibility. There are really a lot of system calls that are really specific to some cases or setting up seed groups or whatever, which normal applications don't use. And then of course the whole topic about file system hierarchy standard, where of course application expect you have something under PROC or under ETC or somewhere. So far, we were able to go around that by providing a meaningful filled text file for the application, especially the PROC file system without implementing that yet. And that worked for NGINX, Node.js, Redis, HA Proxy and a lot of other number of applications. Okay, so now we're so fast through this, I'm sorry. So we have time for some questions. I guess there's some stuff for more clarity. We are an open source project. These are the resources where you can find us. And you can also, I mean, see that I put here KraftCloud, that logo. This is what we currently try to build with our company, which is a cloud that uses the beauty of unicronals for really fast bootups, high resource efficiency for serverless architectures, microservices, functional services, etc. Unfortunately, just two minutes for the questions, but still, are there any questions? So when you run everything in a single address base, do you actually need to enable paging at all? So yeah, with the CPU, actually we need to enable paging, but it allows you to build a page table at compile time. And then it's just switching that page table on during boot. What additionally happens with Linux applications that are sometimes doing MAP calls and mapping here, something or file there, of course, if you enable that support, then you need some kind of dynamic page table handling. But if you don't need that, you have the opportunity to have a compile time page table. So we don't have the time to discuss it, but I was wondering if you have paging, wouldn't you be able to use copy and write to the popular one? Maybe something to think about. Of course, we do also copy and write where you need it. The thing is, what we still want is a single address space. So that page table is, basically, there's just one page table, another page table per application. We don't have this page table switches, no TLB flushes. So this is where we gain actually a lot of performance also from. And since we say we are a single application, we run only one thing, why do I need to handle? Everything that runs in the unique kernel is defined to be trusted, and you have then the hard isolation boundaries outside from the hard-provise environment to protect anything that is going bad or an overtaken unique conference. If you write, protect the data pages of a process that does fork, you can actually detect processes that don't use default exec that do actual fork to share memory, and you would be able to detect that. I would just add to that because you would have multiple other spaces just for a short while, so it's not really a performance issue, right? Yeah, it's like two kinds, implementation effort and... But yeah, I see your point. Also for non-position independent applications, I mean, if you have a choice not supporting multiple of them and having multiple address spaces, I mean, why not go for multiple address spaces, it does not invalidate the unique kernel idea. No, no, no, it doesn't. It doesn't. It just comes with some cost, right? Right. Okay, thank you very much. We have to switch to another talk. Thanks again.
Support Dynamically Linked Executables via Linux ld.so and Implement ENA Driver to Expand Application of OSv
Hello, everybody. Can you guys hear me? Hello. Cool. My name is Valde Kozachuk. I'm one of the few OSV committers and I'm here to tell you about the latest enhancements made to OSV since my last presentation at Fosada a year ago. So, first off, I want to apologize for this very long title. Actually, most of my talk is really going to be focused on the first part, but I'll also try to mention a little bit about the other things. So, in today's presentation, I will talk about the enhancements to support statically linked executables and dynamically linked executables launched by a Linux dynamic linker. I will also briefly describe the implementation of the inner driver to support AWS Nitro. In addition, I will preview the new Xconfig-based mechanism to allow further customization of OSV. Finally, I will talk about upcoming one, zero release and beyond. Most applications do not make system calls into Linux currently, as we know. Instead, they do it indirectly by way of calling Lipsy functions that delegate to Cisco or SDC instruction on ARM. On Linux, for example, the dynamically linked executables are launched by Program Interpreter LD, which memory maps the executable else along with other else files. It depends on, like, Lipsy SO, Lipthread SO, and so on. Then, resolves undefined symbols like puts or pthread create and finally involves the main function. On OSV, the built-in to kernel dynamic linker plays the role of the Program Interpreter that performs similar steps as on Linux. But instead of loading the aforementioned libraries, it resolves the undefined symbols by pointing them to OSV implementations of those. The OSV linker supports both shared libraries and dynamically linked executables that are either position-dependent or non-position-dependent. The benefit is that programs interact with the OSV kernel using the fast local function calls without the overhead of Cisco instruction. On the negative side, the Linux compatibility is a moving target because Lipsy keeps adding new functions and on the OSV side, we have to keep implementing them. This slide here illustrates how dynamically linked programs would traditionally interact with OSV kernel. The drawing shows an executable procedure linkage table, PLT, on the left side. The dynamic linker and Lipsy implementation that are part of OSV kernel on the right side. In this example, after the dynamic linker memory maps the program into the memory, actually, more specifically, the self-segment, it then sets up the PLT to later resolve and replace the put function call placeholder with the address of its implementation in OSV kernel, which typically happens upon the very first call. Now, the statically linked executables interact with Linux kernel by directly making system calls and reading from pseudo file systems like ProgFS and SysFS. Initially, OSV implemented a fairly small number of system calls around 70 to support running going programs that were interesting because they would call Lipsy functions to create threads, for example, and execute system calls to do other things like, for example, Socket API. But this was not enough to support statically linked executables. To make this possible, we had to implement some key new system calls like BRK and clone and add substantial number of other ones to bring the total to 137 at this point. However, the most tricky part was adding support for the application fed local storage so-called TLS. The dynamic-linked programs that run on OSV, in a traditional way, would share the thread local storage with kernel and allow OSV to fully control setup of TLS. The statically linked executables, on other hand, want to allocate their own TLS and set the FS register on X64 or TPIDREO0 on ARM and to the thread control address for each thread. On X64, the solution was basically to utilize the GS register to point to the Persepio structure with a copy of that application, TCP, and basically update it on every context switch. On AHR64, we did similar thing. Now, the point of this enhancement is that we basically improved the Linux compatibility because now we don't have to worry about these cases, where, for example, application tries to call functions in Lipsy that OSV doesn't implement. But the drawback, obviously, of the system calls interface is that, obviously, we pay overhead of Cisco instruction every time, which on average I measured this around 110 nanoseconds on X64. This picture actually illustrates what happens behind the scenes. So on the right side, actually, OSV dynamic linker still plays some small role. It still memory maps the segments of the elf. It reads the headers, obviously. But then, really, it just jumps to the start of the elf. And from this point on, the interactions basically between the program and the OSV happens simply through Cisco instruction. The exciting side effect, actually, of enhancing OSV to support Staticly Link executable is basically capability to run dynamically linked executables via Linux dynamic linker instead of basically the OSV built-in one. The Linux dynamic linker, LD, is Staticly Linux, a tightly linked position independent shared object that is loaded and processed by OSV kernel in an exact same way as Static executable is. In Linux, the dynamic linker would be launched implicitly, right? And by simply introspecting the inter-program header. In OSV, we have to launch the LD, the Linux LD executable explicitly and pass its path along with the arguments as you can actually see in RO. And actually, as you can see in this script, runpy example. So we're passing actually the absolute path to the Linux dynamic linker and then we're actually adding all the path of executable and any arguments. So obviously, just like with Staticly Link executables, there is the same benefit that we are now much more compatible with Linux because one can take any application that works on Linux with G-Lipsy and it should work on OSV just because when we build the image, OSV is going to run, it's going to actually load the G-Lipsy, and we can't use it as any other library that given application needs. The drawback is the same because we are again paying 110 nanoseconds for every Cisco instruction. And this slide again tries to illustrate the interactions between the OSV and the application. It's, as you can see on the right, you have the OSV kernel. On the left, the application, the news dynamic linker, that is executed just like with static executables. And then it loads the application LLF into memory by using M-MAP system call. And then also executes the application itself, loads any libraries. And from this point on, all the interactions happen with Cisco instructions. Now to help analyze and troubleshoot static link executables, or dynamic link launch basically in this new way, we have added a new diagnostic tool that called S-Trace, which is obviously similar to what one can do on Linux. In essence, one can specify all interesting trace points using a regular expressions. In this example, to monitor system calls, you just add a Cisco star, and you enable S-Trace system thread that basically would print all the trace point calls to the standard output. And as the application basically gets hit, while program runs. How many minutes do I have left? Seven minutes. So to recap what I have talked about in previous six slides, in the first two I described the traditional way of running dynamic link programs on SV, which benefit from fast local function calls, but may suffer from compatibility issues. In the next two slides, I explained the new enhancements to allow running static link executables. And finally in the last two slides, I covered a new alternative way of running dynamic link programs launched by Linux dynamic linker on SV, which again may suffer from a tiny overhead of handling system calls, but benefit from much better compatibility with Linux. In essence, these new enhancements greatly improve the OSV application and should make possible to run more programs on it. In addition to what I have talked so far, we have also implemented a better version of the AWS elastic network adapter. In essence, we basically took the 3DSD implementation by AWS and made it work on OSV, and we tried to minimize all that. So basically, minimize the changes so that we can backport any possible future, for example, fixes. And disable a lot of stuff that simply does not apply to OSV. The resulting driver costs us around 7,000 lines of, sorry, yeah, 7,000 lines of mostly C code, and 56 kilobytes of larger kernel size. The challenge obviously was testing that because it can only be done on the running Nitro instance in AWS. And so far, the driver seems to be pretty stable. I've tested using, and seems to yield decent performance. I've tested that using IPerf3, NetPerf, and some simple HTTP server app application. As you may have guessed, actually, the ENA driver implementation is enough to run OSV on with RAMFS on Nitro EC2 instance. And so there's actually a script that I wrote to simplify the upload of the OSV image, creating a snapshot and basically creating AMI. And one thing, obviously, to run OSV on a Nitro instance with non-volatile file system like ZFS, or hopefully EXT in the future, we need to have NVME driver implementation, which is actually two pull requests from community at this point, but they haven't been merged yet. They need some love. In my previous presentation at FOSDM, I talked about kernel modularization and driver profiles. This year on it briefly describe a new feature that takes modularization to the next level, and which has been greatly inspired by Unicraft. In essence, the goal is to use the Linux kernel build configuration tool, Xconfig, to let the user select OSV components to be included or excluded, and various parameters to configure it. The make file would then simply act on a generated config file, exclude relevant object files, and pass any configuration parameters to the source files. And this is obviously very much work in progress. And obviously, unlike Unicraft, where all the elements are effectively Lego blocks, with OSV we pretty much have to do the opposite. We have to put sprinkle basically the source code with all these if-deaths. And this is just example of what kind of modules or parameters can be modified. And basically as an example of what can be accomplished with that new feature is that by hiding basically all the symbols, but those used by application, excluding all necessary components, and changing values of various configurable parameters as listed on the slide, one can build a kernel image of 788 kilobytes in size, and running a low-world app using 1.2 megabytes of memory. So it is, when I started optimizing OSV kernel like five years ago, it was like, the kernel itself was like 10 megabytes at least, and it required a minimum of 30 megabytes of memory. So it is almost 10-fold improvement. Well, I'm sure not as close as Unicraft, but we are, maybe we can squeeze to be at half megabyte. So we are, as I am moving toward the end of my presentation, I just wanted to mention that we are also planning to cut a new release of OSV10, which should include all the features that I've talked about. And I hope that we're gonna be able to implement the EXT file system, merge the IPv6 implementation branch, and potentially implement NVMe driver. I'm especially excited about the EXT file system support because I think it will make it easier to build damages on Linux, and then introspect, for example, if something happens afterwards. So beyond the upcoming release, we're planning to revamp Capstan. Capstan is like effectively like a craft kit. It just, but it hasn't been really enhanced in any way, or even to take advantage of any recent features of OSV. So we're planning to basically revamp it, and make it really easy to use, basically to help application developers to use OSV. And then in addition, we're planning to work on some of the security, so like ASLR, and that requires making kernel relocatable, and some optimizations. Eventually, and also finally, we are planning to make OSV to run on AWS Graviton, but that requires UEFI and some other things. And with that, I would like to thank the organizers for inviting me to this conference, and tell you about OSV. I would also like to thank SyllaDB for sponsoring my OSV work, and Dorbola Orr for words of encouragement, and Nadav Haral for being my mentor, and reviewing hundreds of patches, and implementing other enhancements. And finally, I would like to also thank all the community contributors to the project. And this slide, you can find some links about OSV, and thank you for your attention. And I'm not sure if you have any questions. Time for questions. We have time for one burning question, if there is. You wanted? Yeah, go ahead. This is your work on Linux compatibility. How are you handling new APIs, such as the IO U-ring and similar applications? Are you using? Your question was how do you add new applications to? No, no, so with the Linux API, that you are right for, I believe, for, how are you handling IO U-ring and similar APIs? So how am I consuming new APIs, Linux APIs? I don't know how are you handling applications, which do make use of those? So basically, this happens as the way I describe, typically, if the application is launched in the traditional way, OSV simply, resolves all the application symbols, like Lipsy symbols, and simply redirects them to OSV implementation of Lipsy functions. If I have an answer to your question, then we can meet afterwards and I can address better. Thanks again for the talk. Thank you.
Streamlining application development for Genode with Goa
Okay, my name is Johannes, I'm from Genote Labs, and I want to show you a tool for developing applications for Genote. So I think Genote has been present in the micro-conadero room every year, I think, since a couple of years. So I don't think it needs much of an introduction, but I have a short introduction here. And I'm just irritated because my slides aren't complete on my laptop, and it's okay. Okay, on the presenter. Okay, so the Genote OS framework has a component-based operating system, it supports different micro-conaderoes. So I would actually like to switch contacts after a lot of Unicernel talks today. We're talking about micro-conaderoes here, but I'm not going into technical details since it's the last talk, and since we actually want to look at how we can develop applications for Genote. Yeah, it quarterly releases since 2008. And in addition to the framework, we have a showcase that's this GALT OS that's based on Genote that we at Genote Labs actually use as a daily driver. So I'm actually running it on the system right now, so you see a control interface here. I will demo it later on. So back to my slides here. The GALT OS release roughly twice a year. So what was the motivation behind GOA? The tour I'm presenting here. So the Genote build system, it's very useful and efficient and robust for developing on the framework. But it can be a little bit cumbersome for our development, so we have to do a lot of things and there's kind of a learning curve. So if you occasionally get into Genote and want to try some applications, then it can be a little bit off-putting. So GOA, the name is from GOAL, but reached a little bit sooner. It was started by Norman who's sitting in front of me in 2019 as a site project. And last year we moved it under the umbrella of Genote Labs. And since then it has seen quite a few additions and features. And as already mentioned, its intention is to streamline the development of individual applications for Genote. Before I go into that, I have to tell you something about how applications are deployed on Genote. So the package management is made up the way that each user manages its own depot. And depot contains different types of archives. So for instance there are API archives which contain header files to use a library. And then there are source archives which contain the source code, obviously. And then there are the binary archives which contain the compiled binary architecture, specific binary corresponding to a particular source archive. And then we have such things as raw data archives which contain just raw data architecture independent, so no binaries. And then we have package archives which contain the runtime information. So the information you need to actually start up an application scenario. And it also contains a list of the required source, raw, and package archives. And then we also have an index which is a list of users, packages, or package archives that can then be used to, or can be deployed to, in scope you will see later on in the demo how the index is used to actually install applications on the system. And the archives are simply placed in corresponding sub-directives according to the names listed here. And now I want you to keep in mind that a Goal project basically resembles the structure of these people. I'll get back to that in a couple of slides and show you. So what's the actual workflow? So here's the very simplified application workflow with Goal. So first you import, or you may want to import third-party software. First import the source code and maybe also patch it. Then you want to build it using the build system that comes with this third-party software. So just reusing a commodity build system. Then you want to test or run it on your development machine or another machine. And if this works well, then you actually want to export and publish it so that other users can run it. So Goal now is a command line tool which basically has commands for all of these stages shown here. And here I will walk through these wide away. So when it comes to importing third-party source code, in order to make this Goal import run, we require some files. So first of all there's an import file which actually describes what you want to download or what you want to import. And this then actually populates the source or raw sub-directory. Here's a brief example. So this import file is basically a makefile syntax which defines a couple of variables, license and version, and then what you actually want to download. And there can be a list of the syntax. It has a name and a type. Here the name is Calc and the type is an archive. So a table. And there are other types supported. You can download from Git repository, SVN, or a single file. And then you specify the URL for the archive to the char or for Git it would be the revision. And you also define in which directory it goes. So it can be the source directory or the raw directory. And you are also able to patch this here. I haven't shown this, but it's all documented in the intro documentation. So just type Goal help import. And you get a man page style help which should clarify how to use it. Next step after having the import we would build the code. So Goal build just builds the code which is present in the source directory, whether it's imported or whether you wrote it on your own. What you also need is you need two files, a used API file and an artifact file. So the used API just says which APIs or which API archives you depend on and the artifact file names the build artifacts that you want to include in your binary archive. We have a couple of supported build systems here. So just plain can you make or CMake or autoconfcumake. We also have some Rust support with cargo and it's extensible so that you can add your own build system and this list will probably grow in the future. And there's also a couple of other files which you see on the right. So like configure arcs or make arcs by which you can control or add some flags that you need for building. So the next step would be to test run the scenario. For this we have two more files that we need. So there's an archives file and the runtime file. You can actually have in a Goal project a multiple runtime scenario. So there's a sub directory structure that you see on the right. And then archive file lists the archives, the depot archives you depend on. And the runtime I won't go into detail but it specifies how the... It specifies the runtime scenario. What I want to go into more details is supported targets. So by default you can run the software and the host system which leverages the fact that the denote executables are binary compatible with all the supported kernels that we have. And part of this is also the Linux kernel so you actually can run the denote system on top of your host system, your Linux system on which you are developing. And then the component is just pop up as Linux processes. And a very new target is the Sculptarget. So we have implemented a method to synchronize all the required files via HTTP server to running Sculptarget system. And thereby remotely run your application and test run your application on the system. So let's have a look at that. Here's my Linux VM. I think the font is big enough, right? So here I have a project that's a calculator app imported from the Lumre UI toolkit which was formerly known as the Hubo2 Touch UI toolkit. And let's have a look what's in the directory there. First of all there's the import file which I'm interested in. So I don't have any source code here so I need to first import it. But since Goal takes care of doing the import step before the build happens, I can just try Goal build. It downloads everything needed and then builds the application. So what I have then is IFDSource directory which is popped up. So the source code is imported into the source directory. Next I can type Goal. My G is actually a bit... Okay, Goal run. Now I see the window opening and now I have the calculator up here. So as I said before all the components are now running as Linux processes. And as you see it's pretty slow because there's no GPU acceleration in this scenario. So it's barely usable for actual debugging in this case. But what I can do is I can also run this on my scope system that I'm using here to present. So let's switch to that. For this we have a Goal testbed here which we have available in the sculpt images as presets. I cannot do this right now because my VM would then turn off. But I have a launcher here so I just launch this testbed. And then I add the argument sculpts. What I haven't mentioned here, this tool has bash completion. So I can actually spear some typing. Okay, let's switch over and at some point... I'll show you. It still needs to sync the files via the HTTP put to the HTTP server that's running in this Goal testbed subsystem. And now here we have the calculator. And all the log output I still see on my host computer. So it's quite useful for development and testing. Alright, so let's say now we have everything working. Next step would be to publish it. So exporting, I'll leave it... So export just assembles the poor archive. And publishing then means that we make a signed tar archive out of that. What we need for that is for each of those runtime scenarios that we have, or for the runtime scenario that we want to publish, we need to readme. And we also need to license file and diversion file. And yeah, that's basically it. I'll just show you this in the demo. So let's have a look at my version. So I want to publish a new version of this. And what I'm doing is we have a helper command here. I can just bump the version and see it's updated. Oh no. Oh no, it's updated. It's just the same thing. So it was the third of January and now it's the third of February. I can actually do this multiple times. So it just depends on some letter to the end of the version. Okay, then I go into another folder where I'm managing my index. And what I want to do is I want to add this into my index. So the index is just some x and l file here. So I'm adding this here. I'll calculate that. And now I say publish. And since I already have an index file, I have to specify that I want to overwrite that. And wait for a moment. So it says it exported the calculator app with the current version. And now the index, I need to put in my password. Alright, that's it. And now it's published only on, and it's present as an archive on my laptop or NVM. And I still need to sync this with the web server, which I'm doing right now. Which can take a bit depending on the network. Alright, all is done. And now in my control system, or control UI here, I first need to enable my user depot, fetching the current index. And now I'm able to, there it was, there. I install it. Okay, downloadable. Then I have to, when installing the app, I need to root the services. So it needs a GUI service, which I wrote to the window manager. Then I need something for the GPU acceleration, another thing for GPU, system clock, and also something here, and edit. Alright, there it is. Okay, we calculate again. Alright. So the process of all of this, or the development work, is documented on our community blog, which is called genodians. Genodians.org, so there you find user stories about the development and how to use things. Also, as you see in the screenshot, how to use this GUI testbed, I just showed, and what was involved in porting the calculator that I just showed. And yeah, more interesting things. Also the very bare basic tutorials, which go into more details of using GUI. And with this, I want to thank you. And to get started, just clone it and have fun. Thank you. We can allow one single question, I think. So thank you for this token demo. It reminds me of the BSD port system, of course. I have a question. Did you pick the calculator because it doesn't do anything fancy like saving files, and if it had, what kind of patches do you usually need to sort of a real world application to genode? The calculator, I used it because there are other simple scenarios with not many additional libraries needed. And on the other hand, it uses the Ubuntu UI toolkit. So there's all these cute stuff already in use by that. But it's mainly used because it's a GUI application and it has a narrow scope in porting efforts. So I don't have to show too much details. So that's the reason for that. Yeah, we have more complex ports, actually. So we have a LIN phone port quite recently. So yeah, there are more complex ports already available. So as you see here, there are also these projects repositories to check out. It's on the one hand the GUI playground from Norman, and then from me and a few colleagues, the GUI projects repositories on GitHub, where you find ports of wirecrack, doom, chocolate doom, and things like that. So there's a lot of things you can play around with. Thank you, Johannes, for the talk. One more time. What I forgot, we also have some takeaways here. So we have these books, the main genome manual with all the concepts, the genome foundations, and then a documentation of how to call genome to other platforms, including peripherals and all the heart and fan system. Thanks, so don't hesitate to take it. And I would like to thank everybody, the speakers, the participants, the audience, for participating in this year's microkernel and component-based West Dev Room. And hopefully see you next year again. Bye-bye.
Welcome to the Modern Email DevRoom 💌
All right. Good morning everybody to the modern email death room. It's nine o'clock Sunday morning. I welcome you, I'm Hans Jörg. I welcome you on behalf of the organizers of this death room, which is with me Benoit, Damien and Michele. We have the honor to have an email death room, which probably didn't happen for like 10 years or so within FOSTA. So it was a little bit of an experiment because we saw, okay, there's this email thing. Does anybody still care about email? So we said, okay, let's call it modern email maybe. Yeah, so that makes people maybe happy. And yeah, we were totally overwhelmed. So thank you everybody attending here during the days. Everybody submitting some proposals. So we really, really had a hard time choosing and making the agenda. You might have seen also our agenda is quite full because we really wanted to honor many different topics and interesting things that were proposed. So it might get a little bit rushy over the day. So I kindly ask you if you switch rooms maybe to make it quickly and efficiently. We might maybe cut a little bit question time and relay something to the chat, something like that. But yeah, otherwise I think that will hopefully a very nice day. And we also, you've seen there is some metrics channels and so on going on. So there will be also after this day, hopefully some further discussion about, you know, how to make and preserve and keep email relevant and nice technology for everybody. And as for the day, you might have seen we somehow try to cluster a little bit into topics in the sessions. So we start with what we call protocols session now. There will be afterwards some talks about JMAP, which is interesting new technology, you know, to modernize IMAP in a certain way or make it bring it into the future. We will have a service session about email service systems later on and operations session about administrative topics and how to run mail servers. We have clients, email clients. Many of them you will know and probably daily use and there will be certainly some interesting news I think. We have a security session and finally a small session about a new thing called structured email. So hope everybody stays to the end. And I think let's get started. And I think the floor is here for Fabian first. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you.
[Protocols] Security of STARTTLS in the E-Mail Context
Okay, so this is the modern email deaf room and I think I'm talking about something rather old and that is startillist. So this talk is titled, why is it better without startillist, security analysis of startillist in email context. And this is based on research. I've done some years back with Damian, so one of your lovely organizers here, Hanu Berg and Sebastian Schinsel. And my name is Fabian Easing. So since I have the honor of starting off this whole, well, the whole deaf room, I want to give you a quick overlook of at least the client side of the email ecosystem. And this is a graphic I have designed from a dissertation and so bear with me that the legend might not fit the context really. But that's basically, well, it's hard to read in that resolution, but this is basically a lot of the RFCs that go into the email ecosystem. No worries, I'll cut that down a bit. And I've basically identified three categories. One is network protocols. This is where we'll spend most of the time of this talk. So this is SNTP, POP3, IMAP and nowadays JMAP, not included in the graphic yet, sorry for that. Formattings such as internet message format and MIME, so the multipurpose internet mail extensions. And lastly, end-to-end encryption like S-MIME at OpenPGP, which probably no one uses. So in total, that makes up more than 150 RFCs. Yes, some of them are obsolete, some of them are deprecated. But if you really want to get the whole gist of just the client side of the email ecosystem, you will probably have to read a lot of them. And I've done that. And they have done that. So that's only one side of the mail, of course. There's also implementations. So over the course of my research, I've looked at more than 48 clients and more than 23 servers. And this is by far not all of them. So that's just the implementations I could get my hands on and not counting the ones I found on the internet where I really can't tell if that is one of the 23 or isn't. Okay. Since I also have the honor of starting off this protocol session of the email deaf room, I want to give you a quick, well, introduction to how an email gets from our sender Alice to our receiver Bob. So let's assume that both use some kind of mail user agent or what you would normally call a mail client. So for example, Mozilla's on the boat. So both also need a so-called mail service provider. That might be something commercial like mailbox.org or Gmail, you know, and well, that's basically necessary to get one email from the sender to the receiver. So Alice will first have to configure her MSPs, MSA or mail sending agent. So what she has to do in our mail client is basically fill out some kind of dialogue like this, put in a server, put in a port and choose an encryption mechanism. Let's say for the sake of this talk that she chooses stateless or the auto configuration told her to do stateless, which happens a lot of time. So then she will be able to submit that over S&TP or submission if you want to be pedantic to the MSA. And Bob will have to do the same configuration with the mail delivery agent of his MSP to be able to connect via POP3 and IMAP. And then he will receive the mail, obviously. So this is where we will stop for the sake of this talk. So I'll be looking at the first and the last hope only. However, I don't want to leave you hanging. So what's happening in between is mail transfer agent to mail transfer agent communication, which happens over S&TP. And I'll report here be dragons there and whoever has looked into that knows what I mean by that because that's generally from a security standpoint considered to be a pretty hazardous field just to the reasons how DNS interacts with that and how DNS works or doesn't work, they are and so on and so on. So that's not what we will be looking at during this talk. However, I think the first and the last hope are from a user standpoint, the most important hopes simply because your credentials are submitted over that. That means that you really, really want to have those connections encrypted by a TLS if possible. Or not only if possible, you want to do that. Okay, let me give you a quick introduction to IMAP because that's what I'll be basing my examples on and on start TLS. So every IMAP connection begins with a client connecting to the server on a specific port and receiving a greeting back. This greeting consists of this asterisk and then a status code in this case, okay, and usually a list of capabilities. In this case, the server says, I'm able to speak IMAP for revision one and I'm able to do the awesome login mechanism. So far so good. The client will then send tagged commands which consist of attack and a command as you might have guessed and will usually receive some kind of data in response. So that's called an un-tagged response. And in this case, these are again the capabilities we've already seen in the greeting. And finally, to signal that the server is done with processing this command, they will send a tagged response with the same tag. Okay, so far so easy. Now how does that TLS work? This basically is the same as you've seen before. The client connects to some port, usually for IMAP it's 143 via plaintext, via plaintext TCP connection. And now the server sends the start TLS capabilities. So that's the last capability here. And usually removes all the locking capabilities because we don't want to have clients lock in via plaintext. Then they will negotiate start TLS. So the client will say, hey, I really want to speak start TLS. The server says, yes, okay, we can do that. Then they'll do the TLS and shake and everything after that is encrypted. Okay, although attacks happen here. So we assume that the attacker is a meddler in the middle that can listen on the network connections, that can modify packets, insert additional packets and so on. But there will be no meddling with the TLS and shake, no trying to change any cipher text and so on and so on. So everything still happens in plaintext. Okay. At this point, you might be asking, so what is there even for an attacker to do? Well, an attacker could, for example, change this or this or remove this, change this status code, add something here or really do anything in between. So there's actually a lot of room for stuff to do. Of this, accumulated in USNIC security 2021 paper with the same title as this talk. And as I've said, this is a paper from 2021. So most of the bugs I'll be showing you today are hopefully fixed. For most of them, we know that they are. For some of them, we are not 100% sure. But most of them are fixed by now. So the questions we set out to ask were, first, are modern clients opportunistic? So StatiLS has the reputation of being an opportunistic encryption protocol. So basically, only if the client and the server can agree to do encryption, then they will do encryption. Otherwise, well, they will probably talk in plain text. This should never be the case for client-to-server communication. This is where this MTA to MTA communication falls into place. That is usually opportunistic, hopefully not as much anymore as it was in the past. But for clients to servers, you never want to have a plain text connection, actually. So if there is data sent in plain text, so let's assume we are doing TLS, but we are sending some data in plain text before that. And we wanted to know if there's any sensitive data sent in plain text to the server or to the client. The third question was, so if there is some data sent in the plain text, is it retained in the encrypted and authenticated context after the TLS handshake? So does anything slip over from one context into the next one? And finally, what happens in error cases? Specifically, we looked at alert mechanisms like the IMAP's alert code, which you will see in a moment. All right. To answer these questions, what we did was three-fold. So we built the email analysis tool kit, the security tool kit. I always forget what the acronym stands for. I'm not sure. It's East. And it's on GitHub. I'll put the link later. So what we did was build almost full-fetched IMAP, SMTP, and POP3 server to automate tests on virtual machines where we installed mail clients. So basically, you can write a config file, define what do you want the server to answer to specific messages. And if you don't, it does what a normal server would do. And otherwise, it will return whatever you told them to. So the second one was, we wrote some scripts for IPv4 Internet scanning to scan for a specific vulnerability on the Internet. And finally, we built also some local test scripts with which you can test your own implementations or your local installations. All right. So what were the key findings? The first finding was that clients can be tricked into not using StarTales at all. So they will leak credentials or emails. So what you see on the right-hand side is a classic StarTales stripping attack. So basically, you remove the StarTales capability from the greeting. Then the client might still want to do StarTales. So you just tell us, no, you can't do that. All of that happens in plain text. So why should the client not lock in? This thankfully doesn't work in most of the clients nowadays. So all of the StarTales stripping attacks, we found some of them, but usually they are not a problem anymore. However, we rediscovered a problem related to the pre-OS greeting. So the pre-OS greeting is used whenever the server can authenticate the client over some other channel. For example, people apparently use IMAP over SSH. And then you are already authenticated over SSH, so no need to re-authenticate to your IMAP server. So the client would receive this greeting from the server, which tells them, well, you are already authenticated. The client would then do StarTales. Wait a second. The IMAP-RFC states that the StarTales command is only valid in the not authenticated state. So they really can't do that here anymore. And most clients didn't. So of course, they won't block in then, because why would they? They are already authenticated. However, many clients decided to upload sent and drafted emails after that, because they are already authenticated, and their send folder is missing for some reason. So let's just upload that in plain text. So this worked in 15 out of 28 clients we tested. Thankfully only one library we tested, so one client library, was meant to be opportunistic. So opportunistic StarTales is really not the thing in clients anymore. That's at least fortunate. So the second key finding was that many clients process unauthenticated data that appears before the TLS handshake. So for example, we injected this alert message from IMAP, which then looks like this, for example, in Microsoft Outlook. This looks pretty real to me, if I look at it. And it also helpfully highlights links for us. So you can't do HTML in there or something, but you can still highlight links. So I might be tempted to click on that if the link looks somewhat convincing as well. So we found another case of that. You can basically all of these untaxed response I've shown you before, only somewhat bound to a specific command. So you can basically inject them at any point before the TLS handshake. And this results, for example, in this is, I think it's Sonabird, creating an attacker-controlled folder in your mailbox. And this stays even after the hijacked connection. Also not too great. So we were able to do this in at least 11 out of 28 clients. So still a lot. So what concerned us most was the last finding, that several servers were vulnerable to a long-known bug. So this bug was first detected by Vizavine M9's post-tix SMTP implementation in 2011. So more than 10 years ago. Should be fixed by now, right? So this was first described for SMTP, but we have described it also for IMAP and POP3 and it's pretty straightforward that that should be the case. This is also a problem there. And basically what happens is the attacker injects a second command after the start TLS command in the same TCP segment. So basically this BNOUP shouldn't really be there. So and what happens usually for servers is that they read the whole buffer. So they read the whole socket into an application buffer, all data that is available, read both commands, process the start TLS command, send an AOK to that, do the TLS handshake, and that BNOUP remains in the application buffer to be handled after the TLS handshake. And at this point it's interinspirant to the server that this wasn't sent over TLS and this was basically sent by the attacker. So they also answer after the TLS handshake. So we were able to do that in eight out of 23 servers. Remember this is a 10-year-old bug and over the course of the 10 years more than 16 of the 23 servers were vulnerable at one point. So basically this is widespread problem. So we thought about that and thought well we can straightforwardly extend that to clients as well because if we can inject a command why shouldn't we be able to inject a response. So we did that and found 16 out of 28 clients basically vulnerable to the same bug simply because it wasn't described before for clients. So what's the impact of that? We were able to do credential stealing in IMAP and SNTP, stealing and send and drafted emails in both protocols. We were able to tamper with the mailbox in both protocols that have mailboxes, do UIS boofing in all three protocols, and we're even able to do HTTPS hosting under the IMAP certificate of the server, which is pretty bad. So what can we do about that? Our first recommendation for users was to disable StarTLS in their clients and I think that's still sensible. So not disable it totally but switch to implicit TLS. Please don't do plain text. So do implicit TLS instead. But we realize this might not be a workable solution for server developers and MSPs. So if you really must use StarTLS you should isolate the plain text phase completely from the TLS phase, from the encrypted phase, fix the buffering issues in the process hopefully and as a protocol measure we should really streamline the negotiation and lock down what is allowed before the TLS handshake. So only accept the bare minimum of messages before the actual TLS handshake. So as a final note, a quick thank you to all the developers of open source mail clients. The response time to our bug reports was phenomenal, much, much better than the commercial ones. So if you want to test your own clients or your own servers, all our testing scripts are available. We're happy for any push, any pull requests, it's pull requests, not push requests. This is the QR code I'll post it in the matrix chat after. So in conclusion, StarTLS only extends the attack surface. Not only do you want TLS but you don't really want StarTLS. StarTLS issues are widespread among implementations and cross protocol attacks are possible for example against HTTPS. So in conclusion, TLS is better without StarTLS. Thank you. Is there any questions in the room or in the chat which somebody wants to relay? Raise your hand and I can't with the microphone. I see people are impressed. Hi. Alexi, when we were doing, I'm up for revision two. I got some of your comments. Did we fix all of these? Can you reopen the PDF, please? There's a slide on that. I can do that. Are we at the beginning or at the end? So yeah, I'm up for revision two. Fixed at least most of the stuff. So pre-ass, there's a warning for that. Clients must require TLS or clients that require TLS must close the connections. So no more problems with pre-ass, hopefully. Alert responses got a warning. And I think this one is a specific comment on the buffering issues. What's missing from My Point of View is a comment for the clients to do that as well. So this client only mentioned servers, but the response injection is not handled in the IMF RFC currently. That's the only thing that's missing from My Point of View. Okay. Thank you. Dave? Is there a question? This will be a spotty day for me. So can you say something about the general security of email? When I send an email to you, can I reasonably trust that it encrypts all the way? That really depends on all the MTAs in between. So I wouldn't make any general comment on that. Maybe if you are using a somewhat securely configured mail server, and I'm using a somewhat secure configured mail server that should work, but I wouldn't count on it for all the servers on the Internet. So basically from our scans, we know that many servers are not securely configured. All right. Not sure if you have time for one more question if there's one or otherwise we proceed looking at the timekeeping. Is there any final question? So let's switch. Thank you. And thanks, Bobby, and again.
[Protocols] Things we wish we knew before starting an IMAP library
Hi, thank you for being here so early to hear about such an old protocol. So we're going to talk about IMAP. We've both started writing some IMAP libraries and we want to share experience in that. We've hit a few issues along the way, a few surprising things. Hopefully this can help you if you want to deal with IMAP as well. So, I'm Simon. I'm working on the Go libraries and he is Damien. Hi, I'm the main head of IWF-Code. Yeah. So the first thing you might wonder is what is IMAP useful for? So maybe some of you know that IMAP is used to fetch messages from a mail server. So if you have a mail client and a list of messages shows up, this is fetched via the IMAP protocol. IMAP lets you organize messages into mailboxes. So mailboxes is what regular people call folders. So inbox, archive, spam, drafts, all of these are mailboxes for IMAP. The main advantage on the upside of using IMAP compared to older protocols is that it's possible to synchronize from multiple clients and devices. So for instance, if I want to start writing a draft on my laptop and then continue later on my mobile phone and sending my mobile phone, that's possible with IMAP. What's the basic way you interact with IMAP? So it sounds pretty simple at first. You open a TCP connection, ideally with TLS and without start TLS. And then you write a command and then you get back some responses from the server. So it sounds simple. Here's a very simple example. Here's an example of a login command where you specify your username and your password. And then after that you get an OK response from the server if the password is correct and the login is correct. So something interesting to note before going to the next slide is that... I'm sorry. I'm going to do this, no problem. So something interesting to note is that there's a CMD1 right before the login command here. So this is what we call a tag and it's used... It's an arbitrary string, a sendizer client, and it's used to match up the server responses with the client's requests. So it's just a string echoed back by the server. So the client knows that the OK response is for the command, this particular login command it sent before. OK. Here's a more complicated example with a fetch command which is used to fetch messages from the server. So here the client sends a fetch command and asks for the message flags and message envelop. The envelope typically contains a subject and the recipients and stuff like this. And then the server sends back some replies here with responses with the first message as the flag is seen. So it's not unread. It has been marked as important. And then the envelope is very big, so it omitted it here. And the second message has no flags. And when the server is done sending all data, it ends with an OK response. Something worth noting is that here in the middle, you might notice that the command tag is not included here. There's a wild card instead. So this will have consequences later. If you ask for data, it's complicated to know if you get replies for which command it was and if it was for command at all. We'll see you more on this later. In the fetch command here at the start, you might notice the one column wild card. This is the way you specify which messages you want to fetch. And we'll see how we do this in the next slide. So how do we refer to a particular message? There are two ways. Both ways use a 42-bit inside integer. So the first way is with something called UIDs. UIDs are a unique ID which doesn't ever change except when it does. It increases when a new message is added to a mailbox. So if the last message in the inbox has UID 42 and you receive a new one, then it will get UID 43. So the second way is with message sequence numbers. It's an ordinal number. So if you use sequence number one, it means the first message in the mailbox, sequence number two, second email in the mailbox, and so on. And it goes the same way as ICOIDs, like the oldest message added to the mailbox is the first one. So something interesting is that the sequence number, they get reassigned by some operations. For instance, if a message is deleted from a mailbox, then the sequence number shifts a bit. So here's an example of a mailbox with three messages, one with UID 4, one with UID 6, one with UID 12. And if the UID 6 is removed from the mailbox, then the first message stays with UID 4. And the second message is none of the UID 6. It's now UID 12. So the meaning changes depending on the state. Another detail is that message data is immutable. So if you fetch message contents, it will never change. If you want to edit a message, you need to re-upload it and then delete the old one. So this was to refer to a single message and we can also refer to multiple messages with something called SET. The simplest set is just one message. So here's just sequence number one. Here's another example with a column. You can say messages 2 to 4 inclusive. You can specify multiple ranges like this, like 2 to 4 and then 6 to 10. And the last one is 1 to wildcard. It means 1 until the end, until the last message. That's it for the IMAP introduction. Now we can go into the meat of the presentation. Do you want the microphone? Is it on? Okay, so let's go through all these layers. The first layer is types. So what's there to tell about types? A few things. Probably your journey as an IMAP developer will start as either a client or a server developer. So it's kind of tempting to try to implement only half of the standard and to a certain amount. This is possible because as a client developer you can implement command serialization and response parsing only. And as a server developer you can implement command parsing and response serialization only. You can kind of pick only half of the routines that you would need. But the IMAP standard has quite a few of overlap between commands and responses. So there are many types that you need to define and many parsers that you need to define and serialize. So you won't end up anyway with implementing 50% of the standard but more like 70, so to say. So my suggestion would be to structure your code so that you can easily extend it to the other side afterwards. For example using a shared module. And if you are lucky and someone will provide the missing side to you and you have parsing and serialization handy, you can do kind of cool stuff because you can first generate a random message and then ensure that parsing and serialization is inverse to each other by doing randomized tests. So there's a pretty powerful kind of unit test for it. At least for me it helped a lot as you can see at the bottom. Complicated stuff. Complicated bugs. Yeah, perfect. Okay regarding syntax, oh my. I will quote Mark Crispin from the IMAP protocol mailing list because I think it's not that bad but you need to be in a certain state of mind when doing it. Alright, let me think now I'm a bit tired today. But first and foremost the formal syntax should be your holy book. If any part of the syntax distracts you from the formal syntax, you should ignore it in favor of the formal syntax. Your eyes will glaze over and your jaw will drop. You can start saying no, no, no. Just work through that stage. It's a steep hill to climb but once you make it to the top you will see everything with crystal clarity. And remember, no matter what you do, do not try to implement any command or response by looking at the examples. And he's what Mark said, so he's right. I would add that before reading the formal syntax you need to learn ABNF and I mean you need to learn it by heart because there are some subtle things you need to be aware of. And regarding lexas and parsas, I think we agreed when talking about this things. IMAP makes in some places the impression that there are things like tokens saying arguments invalid meaning that there could be some generic argument. I had a very hard time to figure out what should a token be. So there are no words on what constitutes a token and I think Simon in version one tried it and got away from this approach or used a different approach in version two. So I don't know, maybe someone has a better idea but for me you cannot lex the IMAP syntax. And another recommendation, even the syntax has layers. So first of all you have the ABNF corals that are described in the ABNF standard and referred in almost any rule. And then you have these IMAP strings which make everything kind of messy. As an example, you see this is the lock-in command, looks kind of simple. And then you have this innocent looking A string thingy in there which is for example here the username and the password. And an A string is in fact one of three types and one of two protocol flows. So you have A string means either an atom or a string, more or less or some IMAP quirks. And if it is a string it can be a quoted string or a literal. And literals do require special care when implemented. So as a simple example we will start with password. It uses only a very simple character set so you can just write exactly these eight bytes as an atom. If you have a white space in it you need to put quotes around and if you have a quote inside quotes you need to escape the quote. So it is similar to programming, most programming languages. And if you have a literal, obviously if you have a new line in there, this would be the obvious case, you need to use these prefix here in curled braces and then you just send exactly the bytes that made up your string after a new line. With a twist as we will see. What we will glaze over today are ambiguities and defects and I had a few discussions already about this one. So I would very much ask everyone if you find some defect in IMAP please report it to us. We really want to start a collection on all of these things. And one thing I finally wanted to say, I quoted Mark Crispin from this thread, but if you now will go to the internet you won't find it. So the IMAP protocol, at some point it's not available anymore due to reasons. So and for me the only lucky thing that happened was that someone I know, it's the maintainer of the Mealy email client, he had this super cool online interactive WebAssembly demo and he used the dump as test data. So that was the only reason I could read it. I guess the thing I want to say here is let's try to be aware that knowledge is disappearing and maybe try to resurrect the IMAP protocol mailing list because it's awesome, it's like a travel trove of information. Okay, then let's go back to framing. So... Oh, everything tanked up. Yeah, I'm back again. So we're going to continue to talk about some higher level layer. So flow and framing, but by flow and framing we mean how does one split the IMAP stream into separate commands and responses. So this is pretty simple. This seems pretty simple at first. Here's a simple example, similar to what we've seen. Log in command at first and then the server replies okay and then the client sends a select command and then the server replies some data and then replies okay. So one may think, yeah, it's pretty simple. You just need to split a new line and each line is a message basically. And then literals happened. So here's a slightly more complicated example where the client sends a login command, the username, and then the password is passed as a literal. So first there's a number of bytes and then the next line there's contents. So here what's interesting is that these two lines are a single logical message. The second line here sent by the client is still part of the login command. Another interesting thing is that in between here there's a plus sent by the server. This is because the server needs to acknowledge literals. So when the client sends the first line here, it says, hey, I want to send a literal with six bytes and then later the server has to reply, yeah, you can go on with this plus and then option and comment after that. The client needs to wait for the acknowledgement before sending the literal data. Okay, so that's interesting. Let's try to look at only one side of the connection. So here let's try to look at only the client side and see what happens. So we can still make sense of everything here, like login with the literal and the next line and no op. Is this valid by the way? This sounds a bit weird, right? The client sends the username and then announces the literal and then the next line here, it sends a completely different command. It's not the password or anything. Is this valid IMAP even? It turns out that yes, it's completely valid IMAP because if the server replies no to the first line the client sends, then it can send the literal. It says, I don't want your literal. So basically what I'm trying to say here is that it's not possible to pass IMAP just looking at one side because you can't make the difference between this case and this case here, whereas the server rejects the literal. So you need in your IMAP password to have some kind of feedback from the other side of a connection to know what happened. And so one may think that we don't really need to wait for the server to acknowledge the literal. We can just send the command and the literal in one go and forget about it. The server will probably reply, okay, we'll probably acknowledge the literal in any case. So here's an example of what could go wrong if you don't wait for the server acknowledgement. Maybe you have a web form on a page which lets the user save a draft in their mailbox. And maybe the literal contains like, may contain some text like this which are valid IMAP commands. So if the server happens to reject the literal, then these lines are interpreted just as regular IMAP commands by the server. And these lines delete everything from your mailbox. So that's not great. And this can be potentially inserted into HTML email hidden and HTML on a single line. And yeah, if you reply to the email, you just use everything. So yeah, it's pretty scary. So to recap everything, something I haven't mentioned is that literal can appear basically anywhere. We've seen in the login command, but it can happen in the search command. There can be many literals for a single command. It's limited to one. So literals interrupt completely the regular syntax. You have to pause the parser from the server side or the client side if you receive a literal. And then wait for the other side to reply, yeah, go on. And then you have to resume the parser. And the literal can be nested into a list or nested into something else. So it's kind of complicated to do, especially if you're using, for instance, a parser generator or something. So we can pass IMAP just by looking at the single side of the connection as we've seen. And it's important to wait for the server to accept literals before going on or security within. So another aspect of the flows we want to talk about is commands such as authenticate. So authenticate is a command that lets the client use Sassel authentication. Sassel is a binary protocol. And to authenticate in a modular way, you have several mechanisms. So here's an example of a plain mechanism, which is a simple one with username and password, but there are those as well. So basically the idea is that you get a binary message and code it to base64 and then send it over. And the interesting thing here is that, so the client says authenticate command, the server says go on, you can continue the authenticate command. And then the client sends just base64, like the, what? This is not a regular IMAP command. This is just base64. There's no tag. There's no command name. It's just like the base64 data as is. It just interrupts regular IMAP syntax with completely something else. And IDOL does something as well similar to this, where client sends IDOL, server says go on, and then client can just send the ASCII string down like the four bytes down. And it's not an IMAP command or anything. It's just like an ASCII string. Start CLS and compress are kind of similar in the way that when you start these commands, it interrupts a regular IMAP stream and wraps it up with CLS or compression mechanism. So these are fun to implement as well. So in summary, for the flow section, IMAP demands you to conflate your passing with business logic with higher level details. So you cannot have a pure password in its own little module isolated from everything else. You need to wire it up with the rest of the IMAP library. It's kind of special in this regard compared to other processes. Okay, now on to operations and semantics. So let's talk about fetching messages again. There are multiple things you can request from the server to fetch messages. So basic example, the envelope we've already seen. Body structure is if you request the MIME structure of a message with a tree of nested parts. If you have attachments for example. And then to fetch the message body, you can use body square brackets. If you just request body square brackets like this example, you get a full message body. So here's an example, very simple message with two header lines and then a simple text. So yeah, if you fetch the body square bracket, you get everything. If you want to fetch only the header, you can use body square brackets header. And then you get only the first two lines. And you can request only the text of the message. So the howdy part here with the text modifier. But you can do more complicated stuff as well. Oh my. Yeah, maybe I'll go very fast on this one. You can fetch particular header fields. You can fetch sections, bytes, substrings of the results. You can fetch, if you have a multi-part message, we have an example with two parts. So the main part, the first sub part, the second sub part with an attachment. Then you can fetch only the first part here. So the counter disposition in line one. Or you can, here this one is interesting because it returns nothing. A header actually doesn't work in nested parts. You have to use a special keyword called mine for some reason. And then if you have a message attached to a message, then you have a section of the RFC dedicated to this particular use case. Like something everybody does every day, I think. Messages into messages, like Russian dolls. The last thing I want to talk about is unilateral server data. That's another simple example of a fetch command where you want to fetch the body of message one. And then the server replies, yeah, here's the body of message one. So everything's fine. Let's say another client happens to mark the first message as important. So the way this works in IMAAP is that the next time you execute a command, then the server replies here in the middle. Hey, by the way, the flags of the message one have changed. Even if you didn't ask for it, just before completing the command, it sends this data. So what happens if another client changes the flags of message one and you happen to send a fetch command right after this happened? Then you get something like this where the server replies first the body of the first message. Like hello world, like before. And then you get something interesting where you get another fetch item for the same message, but something you didn't ask for at all. So... Yep. So it's not possible to think of IMAAP as you request some data and you get back some data. It doesn't really work like this. You can think of it as you request some data and then the server pushes some data into you whether you want it or not, and you have to deal with it. And as a client, if you ignore all but the last reply from the server for the fetch message you asked for, then you won't get the body here. So it's something to look out for. Okay, last topic, extensions. These are a bit interesting. In GoaMAV1, I tried to implement extensions as a very modular thing, which you can plug. But extensions turn out to be more like amendments. Like fundamentally alters IMAAP syntax, flows, operations, everything we've talked about. Idle and compress are examples that add completely new flows. So Idle switches to a completely different mode than you need to send a downed SQL string to switch back. And compress, yeah, just wraps the connection with something else. And then you have another kind of extension like extended list, which modifies an existing list command and adds some arguments in the middle to add more options for the clients. The search extension for extended search, it changes how the reply looks like. So you send a regular search command and then you get some completely different kind of reply. And then the literal plus extension completely changes how literals work. You get a new syntax that you need to pass. So yeah, this doesn't work at all if you try to implement it as a modular thing. IMAAP is completely mononit, if you want to implement extensions that implement everything in the same repository. It will help a lot. All right, that's about it. Unfortunately, we don't have time to talk about everything we wanted, but it should be a good start, I hope at least. Any questions? Thank you very much first. I see a first arm. Yeah, quite immune. It really helps you at the time. Hello. Thanks for the talk and thanks for the library too. I think we're using it quite a lot. Thanks for the talk and thanks for the library. Oh, okay. Yeah, yeah. My question is like, you said like, sometimes you get responses from the server. You're not even asked for, does the server also send without asking? Does the server also send data without asking? So it kind of. I mean, if you, it will only send data right before it, right after it, sorry, let's go from the start again. It will not send data on its own if you don't send any command. You have to send a command and then you reply to the command and then add its own unilateral responses to it, which can be a bit arbitrary. Like it can be anything, really. It's usually at the end of the, just before the okay response, you get some extra data and you have to somehow maybe distinguish it from the regular data. But yeah, it doesn't really work in practice. I was glad to have you. Yep. Oh, yeah, yeah, yeah. I just added that on a little bit. So the IMAP standard is quite specific regarding and it says you need to be able to receive any response at any time. So it's quite, it has in the standard, but us doing practical things. The thing we learned is that you should not trust anything that's in the standard and to the best of my knowledge, most servers don't. So you have, there are exceptions, for example, by, by respond, by untact, like when the server do a shutdown. Yeah, as answered, maybe if you can explain a bit more, but to the best of our knowledge, most people doesn't do it because at least when we tested some clients, many clients, and I mean, I mean, like the most of the clients, they crashed when we sent this. So I think there's a reason why it's not so common in the real world. Okay. Okay. Just wanted to say that if you consider the client server interaction more like that the client told the view about the server and then the server updates the view whenever you send a command, then it starts to make a bit more sense. Yep. But it can be hard to architecture a client with, yeah, against this IMAP concept. Like sometimes you don't want this kind of thing. But it's good. But yeah, it's a good mindset for sure. Yeah. All right. Any, any. Is the only, having regarding IMAP as a cash fill protocol where the client has a view and the server fills in the client's view is the only way to write an IMAP client that will preserve your sanity over years. If you try to, if you try to act as though this were a web server, you will have and this works over the years. Each new server will surprise you in some way. Painful. Don't ask me. Well, your code is. All right. Thank you very much. And thanks again to the two presenters and we come to the next talk.
[Protocols] Unicode in email: RCPT TO:
So, I'm Arnt, I work at ICANN with email and DNS. I wanted to talk about, I wanted to interrupt about 20 times in the past talk because I have been working with IMAP, email, DNS for many, many years. I wrote about half of the RFCs that Gmail supports as IMAP extensions. Alexei here wrote the other ones. My work at IMAP involves fixing bugs in any open source libraries that deal with email or the DNS and we're hiring someone so please will somebody come talk to me. I also need to talk to people, explain to them how to set up DoveCut for example and explain to people about Unicode email which is the main focus. I am Norwegian, I live in Germany and today is the first time that I've actually been in the city where my office is located. No, yesterday was the first day. Tomorrow I'm going into the office, I'm going to get a badge. I should, well, ICANN rules say I have to explain what ICANN does. It doesn't really matter, the important thing is we do boring things to do with a domain name system. Somebody has to do the backups for the main DNS, we do that. Somebody has to look after uptime. I have colleagues who grow seriously stressed when some domain has been down for two minutes. In the summer the domain administrator in Lebanon died in some accident. Things happened really quickly to get the DNS back up. My baby is email address internationalization, a set of RFCs that were written 15 years ago, 10 years ago in two rounds. We had a testbed in China where we tried one architecture, then we simplified it, made it better and that's what we have now. It's the first email change that simplifies something. We have Unicode everywhere and that's why we have at least some support. This is a valid message. You may see there's no RFC 2047 encoding, there's no quoted printable, there's pretty much nothing. How many of you can see the syntax errors in that? Both? No, it's right. The syntax errors according to current RFCs are that there is no date field and there is no from field. Apart from that, all of them don't need, is optional. Message ID is optional. It is reduced. Technically it's optional. This stuff, that's a real life message with a couple of extra header fields, pretty much like the one I showed. It was actually written by a Ford K9 mail in the Indian company that forks that and sells it to the government. To Gmail, I'm supposed to blank out all the personal information but you can see that this is Devanagari, an Indian writing system that the main works in the sense that it works with Microsoft and Google and the rest of us, well, we don't match so much, right? The changes necessary to make this domain work in SMTP are fairly simple. The server has to say, yes, I support SMTP UTF-8 and then the sender says SMTP UTF-8 at the end there. If you do that with an unexcended server, you will provoke a syntax error and the mail will be returned as an error. This is a feature. It simplifies debugging. We tried it the other way. That was bad. This simplifies debugging and simplifying debugging is great. That domain existed for a while but it was removed. It's a test domain. I think I'm not going to go into the meaning of that. Once you have declared that you want to use SMTP UTF-8, all this is legal, including that domain. If you do not declare that support, then that domain is illegal, will provoke a syntax error. However most servers do declare support for it these days. I'm happy it's much the same. The client says enable UTF-8 equals accept, meaning the client will accept UTF-8. The server says yes, I've enabled that. After that, the client can just use UTF-8 in ordinary quoted strings, which has the side benefit of eliminating most of those literals that so plague people. If you have ever had a male client that couldn't handle a password containing non-ASCII character, that was probably due to lack of literal support for passwords. We eliminate that now. I don't know if it's not the only way that people read mail today. There are five main protocols. Three of them just support all of this. If you use Exchange or Office 365, your app will probably speak a protocol called EWS, Microsoft specific thing, find client library on GitHub, and so on. Everyone just support it in the core. I'm happy sort of the laggard there. Pop has a defined extensions, but I don't know anyone who has written any code for that. Have anyone here used pop in the past five years? Thanks. The nice things about this architecture we have just use Unicode everywhere is that code like this works. People recognize strings by saying, well, there's ASCII 34, then they just go on to find the next ASCII 34, and that works without change. People today use Unicode for all the strings in their program. Okay, here comes another Unicode string. It works. This is why when I patched the Ruby standard library, the change to support enable was as big as the change to support actual Unicode email. Enable needed an actual new command. The Unicode stuff pretty much only needed testing. The biggest program that I have patched was PostFix, which needed well over a thousand lines of code. The smallest one was PropMail with zero lines. Written in 1991 needed no changes to me. This says that it's a good extension. Most of the main, most of the modern languages have support for this in the standard library already. Rust should have, right? Yours should become the standard library. If you want to support it, I have a set of EAI test messages. That's currently seven messages. If you manage to render those seven, then you have support for it. Unfortunately, one of them will not be as simple as you wish, but that's life. If you want to see whether something does support it, I have an auto responder. Auto is the Norwegian word for gray. Also it's pronounced like, or except the other way. It demonstrates a really, very, very nice bug, which I'm not going to explain. It becomes a very rude word if you push it through a certain bug. Most of the service support it, at least six here. Postfix and Exim are the two biggest ones open source now. Something like the next four also support it. Halon and Momentum are the service that send you mail like your package has shipped. Mavis Clownout is pretty much support on the service side. Clownside is not so good. For my work, I run into a lot of bugs, speak to a lot of people who implement this. And this is the common bugs, things people tend to run into. The worst of them is the third Gmail uses all UTF-8. This is nice. It works smoothly with a lot of code. Exchange sends this awful things with Xn that you see there, which I hate. You can see why they do it, much does the same. It's something that needs to be handled. To my mind, it's a bug. There's a lot of code that looks at strings like in the header fields and say, oh, do I need to do RFC 2047 encoding here? That can be modified. You need extra tests to check things. In December, I went to China to do interrupt testing with various software companies providers there. Two of them had bugs where they accidentally encoded the local part, which is not legal. Absolutely not legal, will not work. It's complicated, but the short version is it won't work. It can't be made to work. You're the first audience that hasn't asked about downgrading either by rule or by interrupting. People always ask, what if people with a Chinese address want to send e-mail to someone with an ASCII address? It's a fascinating corner case for people like us. It doesn't really matter to the people who want Chinese e-mail addresses because they only write Chinese. If you only write Chinese, you're not going to want to send e-mail to somebody in India. So in real life, it doesn't really matter, and supporting it, which we did in the testbeds, made debugging really complicated. It was difficult to find out where a bug was. Better to kill the feature, have simplicity, have something that's simple to implement like those 100 lines in Ruby. And these are the RFCs that need implementation. I'll be happy to talk to anyone about either using libraries, whether Java, Mail, Netimap, Netimap. There are several things called that. Talking about this on the wire. I'll also talk about any older RFC. And I'll talk to you about the job that I have. One more person to do my kind of work. Yes, Alex wants a new job. That's the official slide that I have to have at the end. I can policy. Thank you. I've not seen who has been first of you. Very short question about the UTF-8. How is the current proposed way to do comparison in server-side software, for example, usernames and password and addresses? Right. The question was, how do you do comparison of usernames, passwords, addresses? There is a set of RFCs for that called Precise, P-R-E-C-I-S, which I understand is only modestly supported. At present, I think that pausing on bytes unchanged is best if you can do that. If you do that, some Arab passwords won't work, but in most of the world, it works. On a somewhat longer term, we need a shareable open-source implementation of those RFCs. It will be a problem for more people than part of the Arab world. Still, some implementations have done this for years and not suffered. They happen to be outside the Arab world. Next, Eniets. So, if people have addresses with non-ESC local parts, do they also often have in practice a backup address in ESC only so that they can use that if they get the failure to deliver? It depends. In China, people have both. In India, the people who have Indic email addresses have only that. So, that's culturally dependent. I think it has a lot to do with what kind of input methods people use on the keyboard. In front. For IMAP, as UTF-8 equals accept, been completely folded into IMAP for revision 2. Are there some subtle differences? Yes. I think there are subtle differences. There is one subtle difference, and you should do what IMAP for revision 2 does. Alexei got it right. His RFC is right. The other one is bad. So, I should ignore what's said in the old one. Even if in the revision 1, if I implement revision 1 only, I should do the thing that revision 2 says. You should do what revision 2 says. Okay, I have three questions. Maybe take it quick. Okay, quick. Andrei here. I like to use my accent character, which is just Latin extended character. I see that this email address internationalization is mostly for different scripts, but with this extended Latin, I have very often the program that my email address is invalid for many websites because there's this accent character. And it's just the, I'm just the easy part. I'm just doing the IDN part. The local part is almost never working. The IDN part behind the ad sign is working better, but still I have to put it there and code it. So, wouldn't it for maybe extended Latin actually be better for sake of compatibility to have some standard way of let's say, remove all the critics and make it ASCII and so it would be backwards compatible? Quick answer, Art. Suppose you do that using the standard library that exists. What do you think Gros becomes? Okay, it's not in the dictionary. Which, when you change somebody's name to something not in the dictionary, you have a user experience problem. The larger answer is that people have suggested and tried various kinds of downgrading. They are overall, each of them, more work than just adding support for unicodes. Nice intention in practice. Really, when you compare it to writing 100 lines of codes, almost everything is bad. Okay, so I think you're trying to show that the negotiation happens between one point and another point. It's not end to end, right? So if any intermediate doesn't support this extension, how will it affect the actual message being composed? The second question is, does this have any impact on DKIM? Okay, first question. Yes, the entire chain of mail service has to support it, which is the reason that I was happy that Amarvis, PostFix, Xim, blah, blah, supported it so early. The chain is, in practice, not a big problem today. The user agent is a big problem today. And particularly, user agents like contact forms, often things. Second, DKIM could be a problem. I still haven't seen any bug related to that, and believe me, I see all the bugs. Bless the DKIM people, because they must have done something right. All right, thank you very much, Arndt. I think we have all the questions. It says more questions, says to the chat, and I think Arndt is also happy to join. And we switch over to the next session, so we... Thank you.
[JMAP] JMAP: Getting Started
So now we are really getting a little bit after we dove a little bit in the old specs and standards and you know like details of IMAP, we are now going to hear a lot about Jmap, sorry, yeah, sorry. We'll be talking about Jmap which is a new set of standards that has been engineered in the last couple of years by some very engaged people that also in parallel have been contributing a lot to IMAP and still do and we are very happy that we have one of these persons here or representative from Fastmail which has been a company very instrumental and putting a lot of effort in this new set of standards and yeah Rick your stage is yours, we have learned about Jmap. All right, applaud fast because I got a lot of slides and I got a little bit of time. So we are going to talk about Jmap, funny story, I was pitched this talk where I was going to talk about Jmap and I'm like what is it, how does it work, why is it so great, how can you use it, how does Fastmail use it, all this stuff I covered everything, it was a really good talk, it was like an hour long and I looked at my email that I was coming here and it said you get 15 minutes. So and it had to be in PDF so this is just like the absolute minimum dot PDF of my slides and if you want to hear the whole thing and see all the builds and all the animations and everything about IMAP and Jmap that can be arranged very easily, talk to me later. That's me, I work at Fastmail, I'm not going to talk about myself, we don't have a lot of time. Let's talk about IMAP, sorry. Who is here for the IMAP, what I wanted to know before I wrote IMAP library earlier. You, okay well you missed a lot of horror stories but I'm going to give you some now, this is IMAP and I'm going to be real brief about it. What you're seeing here is the server in white, the client in yellow, we log in, we says yeah you logged in and now we select an inbox, okay this is the IMAP protocol, very basic but here's all the like beginning of the parts of the grammar you need to parse and it's a bunch and if you were here earlier you're going to see lots and lots and lots of more stuff, weird literals, like weird ways the interaction with the server changes how you parse the response, synchronizing and non-synchronizing literals, it's a complicated protocol and it's not like other protocols that you're using and there's a really simple reason for that which I'll get to, oh yeah, right this is the protocol to do stuff and then this is the payload of the message is MIME which is like another thing nobody wants to deal with, it works great and it pays my salary but I mean, oh, see what you want about HTTP and JSON but at least it's not this stuff, right? Like you probably all know how to use HTTP even if you don't know how it works under the hood and you probably know how it works because it's really nice and simple. So I had lots and lots and lots of slides talking about how weird IMAP is and I would love to tell you about it but I'm just going to tell you about this one thing and this was touched on earlier, blah blah blah blah, server and client are talking and eventually the client says I want to mark message 12 deleted, store the flag deleted onto that message and the server says great, you have fetched this information and this is where people get really confused and it comes down to something I didn't said earlier, the only way to understand IMAP is that IMAP is a cache invalidation protocol, it's a protocol that tells you what to do with your cache. So you've got a server and you've got a client and the client can send basically the commands you expect like I want to fetch or update or create or delete messages and the server's response is in response to that here is how you should update your cache and if you don't think about IMAP that way you're going to have a bad time. Everything, yeah everything works this way. If the client says I want to work with the inbox it says select inbox and the server says there are 172 emails and these flags exist which is a way of saying here's how to initialize your cache. When you say I want to look at my new mail the client says fetch these things and the server says you fetched these things which means put these in your cache. When you say I want to mark this mail red you say store this flag and the client says put this in your cache. That's how it all works and you have to start by understanding that even to understand IMAP. You want to talk much more about IMAP. Okay there is one more thing though. This is another fairly basic IMAP conversation where we're saying we want to come up to date and come up to date is really important. See at the beginning we say queue resync that means we want to quickly resynchronize our IMAP storage offline. So we say queue resync and that our client state is 123. Just tells what the state was the last time we synced. And we get told great your next sync is going to be 130. Here's all the changes to apply and when you're done you'll be at state 130. Without this IMAP kind of sucks. I mean it's better than pop but one of the great things about it is you can synchronize, go offline, come back later and quickly get up to date no matter what else has been going on. Okay now you understand IMAP. Good job everybody. Yep. Who wants to go implement it? Yeah these four freaks. Okay the good stuff is good but the bad stuff sucks and there's so much bad stuff. So like good stuff. You can resynchronize from previous session. Great. You've got a domain specific model IMAP is built around email. Really nice. How about the bad stuff? Okay the data format sucks, the transport layer sucks. The code that's out there is not great mostly. The key features of IMAP aren't in the core protocol sometimes so you need to make sure you've got the right extensions loaded, the right capabilities available or you're implementing to the worst common denominator and there's way, way too many parentheses. Okay so this is why we built JMAP. JMAP is the JSON Meta Access Protocol. It's just JMAP, it's IMAP plus a thousand right. This is what it looks like. So already I hope people are feeling better, you know what this stuff is right. We're posting a request, we're posting a request to the JMAP endpoint and we say I want to get these emails. Great right so just like everything else it's a restful protocol kind of. Here's what you get back in response. You said you wanted to get emails one two three four, here's one of the ones you might get. You did an email get, you're getting a list of messages, this one has ID one and there's its subject and there's more stuff. But it looks like this, you can parse this. Anybody knows what this means. Here's a bigger context of it so you can see like there's an ID and there's parts of the body and the subject but the thing I want to call your special attention to is it's got one simple date format. Yeah I mean you can stop there and it'd be a pretty good improvement on IMAP and IM. But we're going to keep going. Here's another thing, you can say I want to get, when the server responds to you, it can say yeah you did just get these messages and by the way your email collection is at state 616. It's just like that Q-resync thing. It's going to let you say later I've got mail and it's got it all up to state 616. Hey server tell me what changed since then. And the server replies and what it says is here are the changes. You were at 616, you will be at 717. These two IDs were created and this one has changed in some way. And then you can decide to do what? Update your cache. What do you do? Maybe you refatch those messages. Maybe you just invalidate the local storage but you know how to change your cache. It's just like IMAP. JMAP is a cache management protocol. It's just easier to use. Here's another example. Email query is basically what we call search when you search your email on IMAP. So we're going to search for mail that's been flagged and that's from me. Really simple. And the response to that will look like this. You did an email query. Here are the IDs that result from that. And the reason that it gives back IDs, it's about managing your cache. You should have messages cached. If you don't have these, well now you can fetch them. But if you did have them, why send you the messages back? You should have a cache with these messages. If you didn't, you would go ahead and say great, email get these messages. I didn't have them but I want them so you get them now. And it works great. It makes sense. You can think about this really easily. But we should talk about IMAP again. So in IMAP it works the same way. You say I'm going to search flagged messages and it says here they are and then you say I'm going to fetch those. Right? Makes sense. Same thing. IMAP and JMAP look the same in a lot of ways. This is what you don't always see in these diagrams. Where the round trips come in. Right? First we search, goes to the server. Server computes the answer, sends it back. They only say I need those messages. We say give me those messages. Send it to the server. Server finds the answer. Server sends it back. You're waiting for the speed of light back and forth twice. That's what happens here too. Right? You say I want to do a query. I get the answer. I look for those messages. It goes to the server again and it comes back. So the same waits sit here. But you don't have to let them sit here with JMAP. Because when you write your query you can write this. I want to do a query and a get. And what is the get going to fetch? I don't know the answer yet. That's okay. You tell the server which IDs do you get? They came from another thing I asked you to do. So get the IDs by looking at A. It should be an email query. Get the IDs out of the response that you compute before you send anything back to me and do the method call with those. It's called a back reference. And you can have a whole bunch of method calls that back reference to one another to let the server do all the work and only do a round trip back to you once. So you get one wait state. Really good. Okay. Couple more things. This is a larger section of a JMAP query. I've put in some more things. I've been skipping on these slides. Mostly you've been seeing this stuff, actual method calls. But what up here is good too. This is called the using block. It tells you what capabilities you want to use. This one's really simple. If you squint you can see we're using core which is like, yeah, I'm speaking JMAP. And mail, again, I'm looking at mail. But you didn't have to squint apparently. I had to build. But you can have lots of other capabilities. At fast mail we have contacts and calendars over JMAP. And those are going through the ITF now. They'll be RFCs and we have lots of other stuff too. What that means is if your server supports mail and contacts and calendars and other stuff, when you come back from offline, you can synchronize everything with the same request. Not just the same protocol, but hello, I'm back online. Please get all the changes since my offline state and fetch the updates to me all at once. You can also write your own custom data types for whatever appeals to you, whatever your business needs to use, add it to your implementation. Because even though the data types in JMAP are domain specific, we let you build your own. Anybody can build their own just by describing how those methods will work. I'll talk about it just a little bit. Fast mail uses for mail filters, your preferences, your credentials, your DNS, your files and billing, all kinds of stuff. We just do over JMAP because it's great. Okay. Getting close to the last things. We also give you event source. Event source is a long-running connection. I'm old enough that I still call it combat, right? Like you connect to the web server and you say, tell me when things change and you stay connected. And every once in a while, the server sends you a little blob like this saying, oh, there's an update to your email state. Oh, email and contacts have changed. And when that happens, what does your client do sitting there connected? It invalidates the cache. It can refresh things. It can update the screen immediately. So I'm at Paz this with something called idle, but CalDav doesn't, CardDav doesn't. And when you do this on your mobile phone, idle is not going to help you much because Apple sure as hell is not letting your phone sit there with a connected TCP stream live to your IMAP server all the time. So people build these interstitial servers instead of getting a web push which would just directly send your phone a message. And JMAP supports web push. So you could just get real-time updates from all these protocols. So this is our IMAP. We get rid of all the bad stuff just about and add all this good stuff. JMAP and HTTP, anybody can use. Avoiding around chips by combining these requests. Putting lots of data types in one place and real-time synchronization and the cost is that not everybody's using JMAP yet. It's growing, but it's still pretty early and there's way too many squiggly braces and double quotes. But like that's the price I'll pay. Okay. So what now? You want to know how this works? The first thing you should do is go look at this repository, fastmail slash JMAP samples. It's code that just does some real basic stuff with JMAP and you won't understand it yet, but it's going to give you an idea of what JMAP use looks like. Simplest form. Then it's time to read RFCs. Yes. Don't worry. They're actually pretty good RFCs. You should look at these if you want to play with JMAP. The first one is 8620, which is going to tell you what the basic methods are. And then 8621, which tells you the data types. So 20 is going to tell you things like how do you get, how do you set, how do you do changes, just what those are that work on any data type. 8621 is going to tell you the specific data types that we use like mailbox, thread, email and so on. Everything else, you just learn more data types in more, in calendars and contacts, basically how the protocol works. You learn the data types on top of the core methods. Some highlights from RFCs. Yeah, okay, I got it. A minute and 18 before questions. Email is the most complicated data type in JMAP for obvious reasons. Emails are big and weird and complicated. JMAP does a great job of making them easy to deal with. Here's an email get. When you do a get, you can also say which parts of the thing do I want to get. Don't get every property, just get pieces. So I might say I want the from, to, subject, preview, like a little snippet you see in your mail client, and it's mailbox IDs. So what do you get back? This. You have a build. Great. The to and from come back as structured objects that have parsed the email headers for you. Nice. The subject comes back decoded. That's ASCII, so that was a poor choice of string, right? But it comes back decoded. The preview is decoded and mailbox ID is this weird set thing. Why is it an object instead of just the one mailbox ID? Because it can be in multiple mailboxes ID. And if you hit me up later, I can tell you about labels mode, which is what we use this for. It's really nice. So the headers, you could fetch the subject, but you could also fetch the header called subject. And when that happens, you get back the quoted printable, the literal thing. But if you want instead, you could say, give me the subject, the literal bytes, or give me all the headers, because maybe there's multiple subjects, or all the headers, but decode the text. You can get anything like that. You've got no time left. I'll show you this. When you fetch the body, you can get the blob ID. Don't do that. That's how you have to mine parse it. Instead, you say you want to fetch the text bodies and all their values, and you get something like this. Here's all the bodies you need to display the full text of the message. Like there's no mind parsing, there's no remembering. What do you do with, you have multi-part alternative and multi-part related? How does that, no. Just, just do that. Okay. Yep. Time for Q and A. The first thing I will say is you can ask me for more later, use fast mail, blah, blah, blah. How about questions? All right. Same here. Hi. Thank you very much. So one quick question about adoption. Did you reach out to, because when looking at this protocol, and I've been playing around with it for some time now, it looks fairly similar to whatever Google and Microsoft do. I'm not familiar with those companies. Yeah, yeah. Yeah. So is there any chance that these guys would be interested in adopting this? Yeah. Yes. I mean, I think I can just say that. It's, you can imagine like Microsoft, Apple and Google are all standing around a well in like a spaghetti Western with their guns at each other, like who's going to change first? Right. Apple's client is by far the most popular mail client in use. Google's servers are the most popular servers. If either one breaks, we're in. And I've spoken with people, these companies, and they're interested, but of course, it's a huge amount of work on something that even though it's clearly technically superior and a big win is a gamble. It hasn't won yet. I'm pretty optimistic that we're going to see things happen, but I don't have any secret knowledge. Yeah. Thanks. Hi, thanks for the talk. What's about JMTP? Yes, JMTP. Yeah. So replacing, replacing server to server communication is a much more fraught problem than replacing what your client does. I would. Yeah, yeah. So submission. Are you asking about submission? Okay. So mail, MTA to MTA, right? Full exchange of mail between different, the Fediverse of email, if you will. That's going to be SMTP as far as I know forever. I'd love to see JMTP replace it, whatever the hell that is. But submission where your mail client says I want to give this message to be sent, JMAP supports that and it's really, really good. It has lots of really nice features. It has the ability to tell you, oh, by the way, that mail you sent bounced. It has the ability to tell you how many people has it been sent to. And the way that you create messages as a client author is much, much, much simpler. You don't have to think about constructing mind bodies yourself. You can just say, here's some attachments. Here's the text in the HTML and the server can do everything for you. So it does replace that. It also, because it's one protocol, you're never like, I can fetch mail, but I can't send mail because one server's up and one server's down. It just always works. What do you do about encrypted messages? So you're like open PGP or SMIME sort of things? Yeah. So what do we do about encrypted messages? Punt. Well, so there are some RFCs about SMIME and handling SMIME messages, I think all by Alexi, if not mostly by Alexi, that I would say are optimized for the server having access to your key material, right? Is that a fair way to describe it? Yes. Yeah. And there have been discussions about how we would deal with encrypted messages when the server doesn't have your key material and only the client does. We've talked about it's complicated and I think there's interesting things we can do. But generally, Jmap is built around the idea that whatever the server can see, you can see. And encryption, as usual, makes things less convenient. All right. Thank you again very much, Rick. I think it will be around.
[JMAP] OpenXPort JMAP: a PHP library for Data Portability
All right, we head on with the next talk. All right. Floss is yours, yours. So let's wait for the room to cool down a bit. So I'm one of the lucky ones having only five minutes for my talk. So I'm going to keep it very brief. I hope you can hear me. Good. So I'm Joris. I work at Rodrigo. We do quite a lot of work on data portability. And that's how we came to JMAP. So Ricardo already did a quite good job of, well, presenting what it's all about. For us, the main thing what we wanted for is having a unified API. So I think there was one slide where he said we add files and calendars and contacts and whatnot, own extensions for that. And it's really well for that, actually. Yeah. So I'm just going to skip that slide because I don't have much time. Yes. So in the end, one thing that was not mentioned in the previous talk is that JMAP, the JMAP calendars and JMAP contacts is built upon Cardiff and Kaldiff, which themselves, they build upon iCalendar and vCard. So there is a replacement, a modern replacement for iCalendar and vCard, called JS Calendar and JS Contact, and a modern replacement for Cardiff-Kaldiff, which is called JMAP Contacts and JMAP Calendars. And that's what we are mostly using heavily. And yeah, in addition to a bunch of other data types that we also added. So the work that we did is we, first of all, we have a client, we have a server. We move data from one service to another, data portability. So the client is using, it's a Java client. So we collaborate with Danny Gulch here. We have added a lot of features to the library already. We still need to work out how to combine that well with what is already there, because we also would like to see that the JMAP Java library is the go-to library for JMAP in the Java world. And on the other side, on the server side, we have our own software. It's called JMAP Open Export, which basically makes it very, or is supposed to make it very easy to add a JMAP API to PHP-based systems. We already added support for quite a lot of data types, or verticals, files, calendars, contacts, and so on. So it can also be used to lift files that are on a... It's an ongoing project that you could attach a JMAP API to files that are somewhere on a server, and then you can migrate those away. And obviously, we support JS Contact and JS Calendar. There is an RFC in progress right now to convert... RFC for JS Contact and VCAD is already existing. How to convert between those two, and another one is work in progress to convert between iCalendar and JS Calendar, so it's to make it easy for developers to start with those formats. Yeah, so basically, that's what we extended. Right now, we have a JMAP API for NextCloud, RunCube, and Ancient System Squirrel Mail, and Horder is more or less Ancient System, I would say. Yeah, we already use it in large-scale migration projects with a lot of users. Yeah, so let's finish with the last slide out of time. Yeah, so there's also a JMAP Dart client from LinaGora that we are extending currently, and building a JMAP CLI around that. Yes, so there are also other specifications that you could read upon. Didn't finish quite in time, I'm sorry for that. Oh, fine, thank you. Looking around, here's one. How many lines of code is your Java JMAP client, and what does it require in direct dependencies? So we might even relay it to the next speaker, I think. Yeah, but our client is quite big, actually, but it's a library that we're using, which is quite lean, I would say. Now I don't feel bad at all. Any further questions? Otherwise, I think the next speaker may come, which is actually Daniel Gutsch, the author of the state, the JMAP Java library, and some tools.
[JMAP] Intro to Ltt.rs a JMAP client for Android
It's fine. Anyway, good morning everyone. My name is Daniel. Today I'm going to take a few minutes to tell you a little bit about a JMAP only client for Android that I've been working on for a while. But first a few quick notes about myself. I usually work in instant messaging. I'm an XMPP developer. I am on the Council of the XMPP Standards Foundation. I develop an XMPP client for Android called Conversations. And yeah, JMAP is a long-term side project of mine. I checked yesterday that I registered the LTT.RS domain in 2017. And I think I've been working on that for even longer than that, somewhere on my hard drive. There's an implementation for the pre-RFC JMAP thing. That's at Fastmail Road. And yeah, these days I develop a aforementioned Java library and the Android client for letters. So why JMAP? So as someone who's starting from scratch, I think you already got the sales pitch for JMAP. You have a same set of extensions. You can do sent and receive over the same protocol. JSON parsers are readily available. You don't have to do whatever IMAP is. On top of that, you don't have to do any mind parsing. If you ever wrote a mind parser, you know how much of a relief that is not having to do that. It has built-in push support. So especially if you're targeting web or modern mobile phone operating systems, it's good to have vendor push. And yeah, essentially just see Ricardo's maybe omitted slides on how bad or how weird IMAP is. And yeah, you pretty much know why I way to JMAP. So a little bit of the architecture. The way applications have changed or how Android applications are developed have changed quite a lot. In the last 10 years, Google has released a set of libraries they call Jetpack that make application development a lot easier. And Lettuce tries to use a lot of them. For example, there's Room, which is a database abstraction layer where you basically define how your UI displays the information in the database. And then whenever you write to the database, your UI automatically gets updated. And only those things that have changed. So the way I implemented is that my JMAP library has a generic search backend that's then implemented by Room. We write data to Room. And then magically, our UI gets updated. And we don't have to do anything. And also because my main job, again, is developing conversations, which by now is like 10 years old and quite legacy. Lettuce also has a sort of playground for me to work with new Android APIs, such as Material.U, which is the new design language, or predictive things like that. So you already heard that both IMAP and JMAP are essentially like cache management protocols. And that allows us to have great offline capabilities in Lettuce. So all queries, whether you view a certain mailbox or even if you do a search, those are all cached. So if you retry a search or redo a search, when you're offline, you still see all search results. And then all user actions are handled by another Jetpack library called WorkManager that automatically retries those actions when the user comes back online. Yeah, while the app is in the foreground, we use web sockets and event source to listen for server side changes and refresh the UI. And we also, when the app is in the background, we have a fully open source web push implementation. We don't actually use LIC Play Services library, but we talk directly. We are open source code too. So Firebase, or the Google Play Services to retrieve a web push URL, you can actually trick Firebase into giving you a web push URL instead of doing the application server thing that you might be familiar with from other Android apps. But that requires like Vapet, like a voluntary application server identification, which JMAP currently does not support. And I'm in the process of writing in RFC for that. And yeah, because we have native web push, we can also hook in other push implementations that are not bound by Google. For example, like Unified Push. And the way that works is, for example, that the JMAP server can tell my XPP server to tell conversations to wake up letters. And then Google is not involved at all. And I can like self-force every part of that. We also have native enabled by default order crypt support. No plug-in required. It just works. You see a lock icon on your compose screen. If the other part supports it too. During account setup, we ask for key import. If we previously used setup messages, just refer to the auto crypt spec on how that works. But server devs, please allow us to search for arbitrary email headers, because we need that to discover the setup message. That's it. Thank you for your attention. You will find the code of the JMAP library on Codebook, the Android client. If you want, follow me on message. I'm Daniel at Google.Social. The source code for my slides is also online. Yeah, thank you. Any questions? Thank you. Any conversation about letters or JMAP? Come on. So you said there's no need for a MIME parser. Is there really no, never any reason to have a MIME parser yourself? Yeah, I didn't want to put that on the slides. But as soon as you do like PGP encryption, you do have to do a MIME parser. That's what it was. Oh, damn, now I have to deal with my parsing. But the MIME that is in most PGP messages that I encountered is a lot saner than what you might encounter on wild email servers. So yeah, that's a relief. All right. Any further question? For the boost, I want to know. Do you use Udify? You can receive the notification or all the Zmaps over? Yes. Yeah, so JMAP has built in web push support, which is like an RFC as well. And then you can either speak web push towards Google and let Google relay your messages or use Unified Push. And you best go to UnifiedPush.org if you want to learn more about the self-hosted version of Unified Push, because that's too complicated of a topic to have in a five-minute Q&A session. All right. Any further question? Otherwise, thanks again to Daniel. Thank you.
[Servers] Exchanging Microsoft: Implementing 27 MS Exchange Protocols & APIs in OSS with grommunio
which is, as you introduced, a complete groupware suit. And I will start by pointing out the corner piece of all of this, which is map-ed map, the messaging application programming interface, something that has started around in the early 90s, let's say, by Microsoft. And it can actually refer to a multitude of things, including the data models. You have inboxes, folders, messages, attachments, and a relation between one another. And then you also have the programming interface. So, for example, this is what Outlook uses a C API that is in some DLL. But as we may see in the specifications, map is also used to refer to network protocols, which makes this a bit confusing. Because a program like Outlook could use the programming interface without ever talking map on the network. Or we could have the inverse where there is no map programming interface, but there's map token at the network. And so the leading map-ed product is, well, there's no question about it, is just Microsoft, so exchange and exchange online, which is the, the first is the on-premise server, which you can install in your own hardware. And exchange online is just this magic cloud service that you have no more site into what happens. There's an extra interface provided by the online services, which is Graph API, another rehash on how to access your own mailbox. And maybe that will, in the future, be added to the on-premise version as well. We don't know. And the coverage of the on-premise version is, well, it's not a problem. But it's a problem of exchange is quite broad. There's over 100 documents, over 8,000 pages of documentation that they've written in the past 10 plus years on the matter. And of course, because we have our Unix mailbox formats and internet mail formats that were defined by, well, let's say the internet community, the text-based formats that we all know and love. You also have to support them in such a mappy world, because that's what everybody else talks in the transport. And so coming to Grimunio itself, for the audience at Foster, it's probably easiest to say Grimunio is a product and it's also something of a Linux distribution with various components. We have our core central server, so what we call the information store, which holds the mail and provides a basic API, like give me this message, give me that message. And then we have added components such as the postfix that's well known for just handling most of the SMTP delivery part. For file sharing, for example, we have also reused more software because why invent your own? If you can make something else work rather well, and we can use either own cloud or next cloud, they are pretty much the same still. For chat, for chat we've used Mattermost and all of the integrated with, for example, the user database that we utilize. It may be hard to see on the screen right now, so feel free to click on the PDFs in the schedule. So our information server is called Gromux. It wasn't really meant as an acronym, but if you find something that fits, feel free to just make whatever you want of it. And it implements, so the first slide alluded to 27 protocols, it's a bit more, but it's still not all of them, all of the 100 documents. And we just used and implemented the bits of the specifications that we needed to get email clients running in the expected functionality, which is mail, calendar, contacts, meetings, which is a part of calendaring. And so at the end of the list, you can see, for example, PST and CFP, so we can import from most, so like 80% of the Microsoft's very own formats. I'm still working on something more, but I need to see where that leads me, because that's not documented. So when we do reimplementation of the protocols, what you more or less do is you set up an SSL interceptor on either the local machine or, for example, a Linux machine, use the same SSL certificate or redirect. The tools for this are Fiddler, which is something of a wireshark in its own right, but more web-oriented. And then you just compare to the specs if it matches them or not. And you've write your usual marshalling code, which is serialization and deserialization to network protocols, because the Microsoft ecosystem is very binary format heavy. So, yeah. Using MFCMAP and OutlookP1 can trigger the data and the various actions on the mailbox and then look at what the requests look like. Again, this screenshot doesn't do justice on this laptop, so please excuse that. So you see the individual requests and can analyze them. So people have already written large amounts of dissectors in the past, and one can use that. Furthermore, once you get the decoding of the network bytes right, one can look at the logical structure at which point. It's beneficial to just turn on request logging at whatever level there might be one issue remaining. For example, when you just return the wrong data or unlocks are broken once again. That's then where you can also step in with the buggers, for example. We also implemented EWS. It's the exchange web services. It's more of an XML kind of access protocol once again for the mailbox. Again, we start Outlook or other mail clients. Look what the requests are, trying to make sense of it using specifications. And then possibly replay those requests to an exchange server, because when you start out, you let the Outlook connect to an exchange and see what is it going over the wire, and then it's re-implemented in our own. So that's a bit of, it's a laborious task, but eventually that too was finished. The tool we used here is called Postman. It's something like a graphical lip curl if you want. And so we now have a little bit of a problem. With the help of EWS, connectivity for Mac clients, for example. So this is Outlook for Mac with my sample inbox here, just one attachment. And it also works with Mac mail, it's called, which is Apple's own implementation of a group web client, I guess. And so you have now access to exchange-ish calendars, but using of course our implementation instead of Microsoft's. A very small technical interrupt, sorry for that. I got some feedback from the back of the room that it's quite hard to understand speakers in general. Is that right? Can probably raise your hand if it's a little bit hard to listen. Okay, that's nice. I was with a technical team at the sets that cannot amplify the speakers in that room because the acoustics in these rooms generally is a problem. So what can be done is that we keep the doors closed first of all, that people keep a little bit silent because it will reduce surrounding noise at all and probably also as a speaker, you can try to speak up a little bit because that would work. Bring the mic. If you might. Maybe just use one. Turn it off. Yes. Thank you. So this works. Seems to work better. So in the course of doing that, I have and my team have found a number. Like so. Directional microphones. So in the course of all that work, we have identified multiple problems and specifications. Things being underspecified or omitted outright. And so we just send a bunch of pull requests to Microsoft in that regard and they've been accepted so far. And so Dromot is the information store that we have. It uses SQLite. So it's also quite snappy. I would have a demonstration, but this is a presenter laptop, so I didn't really depend on it. So we can do all these various protocols, including the traditional RPC. So Samba was a great help in that. On top of that is MAPI HTTP and it's EMS MDB. These are all the binary formats that the classic outlook speak that you can run on Windows. And as said, the Mac ecosystem uses EWS and your mobile phone. This is then on the right hand side. It uses actually EAS, the active sync. So yet another protocol for mailboxes. I think I'll be here implementing even more protocols for the next 10 years. As I said, as I alluded to earlier, Graph API is the next hot thing on Microsoft. Yeah, let's see where that goes. Yeah, some components are implemented in PHP 8. If you want, you can also run them on 7, but who would want that? This comes from that there was existing software that could be reused once again. And so that's our main binding if you don't want to interact with C++. What does the feature hold? We would like to work on better utilization of accessing one mailbox, like carrot mailboxes. You've never really seen any until you have a lot of one store where a thousand users at once try to access it. Better improved support for the internet formats would be very well done. We have some old internet mail pauses in there for the RFC 2822 stuff and so on. I've started working on that using more of LibVmime, also already properly realized by some other open source projects of the past. And of course, reporting more errors in the specifications as I move along and find some time to deal with the Microsoft paper. Because not all of the specifications are actually published on GitHub, so all you can really do is make a normal issue here. I'd like to have that in that paragraph edited and see where that goes. Thanks, that's my time. If you have any questions, now's the time. Maybe we'll change again. Thank you. Is there any questions? I see one here. Wait a second. The current Outlook clients. Which one? Yes, the new one in the web one and the one right before that was still the C implementations, the two for Windows. Which protocols do they use? They're unused right now. The Windows Outlook client as of today, so 2019, 2021, offers the classic binary protocols, EMS, MDB and NSP. It can run them over DCR obviously or MAP HTTP depending on which features the server advertises. So it can be proxied these days since 2013. And you can also configure the Windows Outlook to use ActiveSync instead if you so want. But that way you don't really have access to public folders or shared mailboxes right now. A question about these various protocols that Microsoft made, how old are they and do you think some of them are well designed or not? What's your feeling about this? I would like to say at first I said, oh, well, this is horrible, but then there's a bit of a storm, where you get into, oh, so that's what they must have for that time. Certainly, so the DCE stuff comes from OSF as some of you might know. So the people that invented X.400, the very terrible version of LDAP. So that's the worst I guess. Rest is kind of sensible at some point. There's a bit of legacy baggage and times when you think, but why? You got to do what you got to do. It's specified and if you want to make it work with the various clients, you just have to do it and then move on. And do you have client interoperability issue that different clients do it differently or are the clients somewhat consistent? So under Windows at least, everyone uses mappi32.dll, that's the C interface. That's, oh, how long has it been here? So mappi started, I believe, in 92 in its very first incantation and then they, at some point, it was what it was today about 96, I guess, outlook 97, maybe everyone remembers. That's where it, it still looks the same like back then. So that's the approximate age of it. As said, you have the C interface for programming or even visual basic interfaces. And then there's the network parts, how the network protocols are, I don't really know because Microsoft only started specifying them after the EU legislation, antitrust things and so on and so on. So you can see in the documentation, the documentation was an afterthought as always, which isn't all that wrong in practice. I mean, Perl is not really specified by its interpreter. Python is kind of specified by its C Python interpreter until someone said, we need to improve that. So all good. But specification is really like manual pages. You can't read it like a book. But if you try to find something out, you use the index or just search function for a keyword and then go to all those places where the keyword appears and then just read and then maybe magically you get a bright idea what it was that it's supposed to be. It seems to me like in any large business, the people go in and go out of jobs over the years and then, okay, so this paragraph has almost nothing to do with the next one, but it's correct what's in there. All right, we still have time for questions. Is there any? There's one. I see here on the table of the C++ 70 protocols, there's also IMAP and POP3. Are they actually re-implemented on the project or you're using another backend server to actually implement over them, exchange and so on? We have an IMAP POP3 gateway on top of the information stores because I would love to throw the one out that we've inherited over the years and use something like Dov and write a plugin for Dovcott, for example. Unfortunately, Dovcott is not documented at all. So right now I'd prefer, well, we have what we have, so that's that. All right, thank you very much for your talk again. Thanks.
[Servers] Aerogramme, a multi-region IMAP server
Hi everyone. So I will present IROGAM which is a multi-region IMAB server. And so the goal of this talk is to discuss about this multi-region thing. But before starting some context, so my name is Quentin and I have a PhD in distributed systems and this talk will be a lot about distributed systems because that's something I know. And I try to work as much as I can for a collective. It's called Duffler and we try to build a low-tech, etic internet and if you want to know more about the things we are doing, there was a talk yesterday about garage where the infrastructure, the self-hosted, geo-distributed infrastructure we have is presented. And IROGAM is part of the strategy and the project of this collective and also a very nice thing. It is supported by an internet and they are very nice. I have to mention it. So first the problem we want to solve and I like to say that with emails we want to make other people available when it would be as always impossible due to distance. We can achieve this goal only if the underlying system is working. And so this talk will be about distributed systems but also about availability and reliability. And so I have three main ideas that frame the decisions when developing IROGAM. So the first is that we should not trust the cloud and hosting providers not only because they can fail and when they fail your service is not working. And the second aspect is that we think there is some space when it comes to IMAP server designs to study and to try new design, new trade-off. So there is no perfect solution. We don't have a magic solution but we can try new ways and new designs. And the third part I will try to convince you that this new design can work in the real life. So first don't trust your provider. So generally when you have the title of this talk is multi-region. I think the first part is to define what is a region when you talk about a cloud or hosting provider. So it's the Google Cloud Platform Region Paris. So its name is Europe West 9 and it's made of three data centers. And last April the whole region, so the three data centers, was unavailable for three weeks. Not totally but the outage lasted for three weeks in some part and it was due to a fire in one data center. And due to some tight interconnection between the other data centers and many software, the other data center were unable to work not due to hardware failure but due to software problem. So three weeks without emails you can imagine that it could be very hard when you use it for very important stuff like I don't know paying tax and seeking for a new job and so on and so forth. So the idea, it's not new, is that you should move to reliability first design. You should think reliability in your service and not rely only on your server. I'm sorry, the book is named Cloud Native Patterns but we could have named it Distributed Native Patterns and it's the same example with the region in, this time it's Amazon in the US and we see that, so the author of the book, Today's Three Services, Netflix, IMDB and Nest and only Netflix took the effort to deploy in a multi region and there was only one that were still working when the US, this one region was not available. I think it's the secret source of Google when it comes to Gmail or when it comes to Google search. It works despite data center failure, despite multi region failure because they are designing their service as reliability first. So it's easy to say that we should design our services as reliability first but in fact it's hard, like many things and something which makes it hard is that when you are in the same region, latency are very low, like one or two milliseconds but when you consider multi region deployment, I have made a test so between Paris and Warsaw in Poland and we jump to 30 or 40 milliseconds. It's not a lot but when you have distributed protocols, often this latency is amplified and there is such example in yesterday's presentation too. So we know that it's hard but it's even harder in the context of the email systems and the Apache Jambs documentation summarizes it very well. So the hard problem is, yes well done, the hard problem is about the monotony QID generation. If you are at the beginning of the dev room, UID in emails have been explained and so they say you have basically two solutions. Either you choose to doing weak consistency and so you risk data loss or you choose strong consistency and strong consistency is very sensitive to latency so it will be very slow. So currently the answer of the Apache Jambs developer is you should not deploy Apache Jambs or the cat-sandwrap part in a multiple data center setup. You should pay for consulting. Okay. So if we make a wider review of the existing work, maybe I have missed something and let me know, but you have some leader follower design which are for example, Cyrus or Dof Cut and you have some consensus or total order based design like Stalvart, IMAB, Gmail, Apache Jambs, Wilder, and so on. So this consensus or total order is often outsourced to the database. So for example, FoundationDB, Cassandra, lightweight transactions or MongoDB. There was also a research project named Pluto and they tried to design a mailbox server on a serial DT design. So it was very, it works very well in a multi-region setup but they have an incomplete implementation because they do not support the monotonic ID. They only support sequence identifiers. So yes, it's interesting if we don't implement the whole IMAB protocol, we can do multi-region way more easily. So our solution was, we wanted to implement the full IMAB protocol and so it's a trade-off. It's not a magical solution but we decided to live with conflicts. So in fact, in IMAB you can have conflicts as long as you detect it and you change a value that is named the UID validity. So it's not free, it has a don't sign, sorry, it will trigger a full expensive resynchronization for the clients. So for example, we see two processes, so you can imagine that's two IROGAM processes and at the end for UID4, the two processes assign the same UID for different emails and when the other one learns it, there is a conflict. And so in our implementation, assigning a UID is a log and we have an event log that is not totally ordered but only causally ordered and we have a proven algorithm to solve conflict and compute a new UID validity. Also there is a proof in our documentation, if you want to read it or to review it, we are interested and we try to be as clever as possible when we synchronize this event log to reduce the conflict window. And so you might say we are cheating because we are changing the problem. We don't try to have monotonic UID but we try this time to handle correctly conflicts. And yes, it's true but I have two arguments. Often people are tweaking raft and they are doing bad things. And I have two examples when in Kubernetes, an issue that has been opened like six years ago and it's still open because they are violating some invariance due to a caching of raft for performance reasons and another one is the post-mortem of GitHub where they also use raft which is a strong consistent algorithm. And they say that and they show that they have done some optimizations that break some invariance of the protocol. And you can reduce the risk of conflicts as much as you can. So the most important was to have the correct solutions. And so if you want you can put a multiplexer in front of a irogram and redirect all the same user to the same server and so you will reduce even more the risk of having a conflict. So talk is cheap, show me the mail server. So I will be quick on this part but I've tried the deployment in France, in Netherlands, in Poland. And so you have some screen shot and you can check the IP address. There are some IMAG server listening on. And on each region, this is the deployment. This is connected to post-fix through the LMTP protocol. We have implemented in irogram. And irogram is a state-less software and so all the data managed by Garage which is in fact doing the magic behind the scene with is a geo distributed design. Yes. And I have a demo. So I will try to show you. So I'm just using something like NetCAD to connect and show you that there is an irogram server listening behind the domain name. And after that, I have configured this iMAG server on my phone. And you can see that I have a mailbox. And now there is a Gmail. It's the Gmail web UI and I will send an email to this server, to this multi-region server. And so the email is sent and now we will wait until it's received both on the phone and the computer behind. And that's it. So that's the conclusion. So we started with three ideas. And so this is the answer. So irogram is designed from the ground up for reliability. So it was the most important thing to us. And we decided to tolerate UID conflicts instead of trying to be to enforce monotonic UIDs. And so we tried to handle them correctly and minimize them. And finally, we want to prove that irogram already works in real environments. But irogram is still a technological preview. And it's not yet deployed in production. So be very careful when using it. Don't use it for real workloads. No, I think during this year we will deploy it on infrastructure for real users. And that's one of the future work we will do as much user testing as we can because we don't want to lose important information for people. And we also plan to implement KALDAV and CARDAV. And maybe in the end, envision irogram as a group. And something that's so important is performance measurements and improvements. And I can say that many design choices we have done will result in the fact that irogram might use a bit more CPU or memory than your regular email server. And you have to take also into account this fact. So thanks for listening and I cannot start questions if you want. Thank you very much. I see one question over there. Gentlemen in red. So first, thank you very much for this design. I've been working on distributed email for quite a bit. And UID generation is part of the story. What is your approach with keeping the IMAP session synchronized, especially the modification sequence to UID mapping, IMAP ideally, and other things like that with such design? OK. So we are handling the synchronization, the rest of the synchronization in the IMAP protocol. So we have a view and ways that we are maintaining. And so as I've said, we have an event log. And each irogram server sessions are watching the event log that is stored in garage. And so when there is a change, we compute the difference. All right. Further questions? Last call? OK. So. Ah, there's one. Can you say it just again shortly? What is garage exactly in a few words? Can you say a bit about this? So we say that garage is a distributed data store. So there is one API that is S3, which we call often object storage. So it's like a file system, but with way, way, way less features. So it makes possible to have efficient distributed deployments. And garage is inspired by a research paper entitled Dynamo by Amazon. And it's a design of a key value store. And garage has a second API, which is named K2V, which is very similar to also Raya KV. If you know Bashow, it was a company and they don't exist anymore. So garage is really about replicating your data and making them available. And you have this API about object storage, but we have this key value API. And so it's really the foundation of your data layer. And that's a new way, I think. And that's what we wanted to prove with Irogan. But we can design applications a bit different and use not only for a garage for binary blobs, but also for that lightweight database used. So I think I understood from the website that you also encrypt data addressed. But you haven't mentioned that at all. Yes. You're doing it, right? Yes, we are doing it. It's in the code and it's a choice. Maybe we are keeping it for the next year, probably. But sure, yes, in garage, all data is encrypted with a key that is derived from your password. And so when the data that are stored in garage are always encrypted. And the data is in plain text only in the Irogan process memory. But it's not really ready. We have to find as many things, but we have many ideas about that. All right, thank you very much again. And I think we will head over to already mentioned Apache James.
[Servers] Apache James: Modular email server
I am working with Apache James. Basically, first, a few words I'm working at Lina Gora. Our mission is to promote data sovereignty and especially give the tools for organization to communicate together without using big gaps. So we are working on a suit called twig workplace with twig mail for e-mail, twig that is relying on matrix for the chat and also file sharing. So as part of this development effort, we were looking back in the days for an e-mail server that is easy to scale, at the time we were looking for a file sharing. So we are looking for a file sharing. So we are looking for a file sharing. So we are looking for an e-mail server that is easy to scale. At the time we did not hear yet the talk about aerogram. We were looking for a modern e-mail protocol. Hopefully we already heard about Ricardo's stuff called J-Map protocol. And we also needed to be able to do deep integrations inside the mail server. So we started with the protocol. I am sorry, I am a bit frustrated. I did not get to speak about J-Map so we will take one minute to do so. We started implementing J-Map into Apache James back in 2015. Before even the normalization effort started within the IETF. We are big fans of I-Map. We implemented twig mail client in Flutter. So Odriga is using it. The Dart dependency to write J-Map CLI for instance. Basically we are able to take a mobile team that is not an expert at all about e-mail and get them to implement a mail client. The things work fine, works fast. Synchronization is easy. Most of the pains of I-Map are lifted. So twig mail works on multiple platforms, iOS, Android, Web. And it is also used on top of other mail servers like StoreWart Labs. So about the mail server itself, because it is a track about mail servers, Apache James is part of the Apache foundation. So it is a track about mail servers. To my knowledge, it is the only e-mail server that is part of the foundation and has an open governance model. It started back in 2003 from Project Jakarta. So it is kind of a cousin of Tomcat and projects like that. It is surprisingly influential in the Java world. The mail that I will present later is kind of the servlet of mail. So a generic way to write e-mails. Some of the important people within the Apache Software Foundation did actually contribute at some point to Apache James. And Neti Network Library, which is very influential in Java. Norman Mauer is a previous contributor of Apache James. So regarding the overall setup, what I recommend actually to use is the distributed setup for Apache James, where basically we host metadata in Cassandra. Big binaries into S3, distributed search with open search. There was a little licensing problem with Elasticsearch. And last but not least, RabbitMQ for messaging, things like IMAAP, IDOL and stuff like that. Of course, we orchestrate everything and run it on top of Kubernetes and are integrated with metric systems like Grafana. So now let's look inside the code. This is more or less the classical e-mail server architecture. You've got protocols on the left, SMTP, IMAAP, which would call the mailbox where the mails are being stored. And you will submit emails to a mail queue and apply mail processing. So what's important here to notice is that you've got green dots. It's not updated the slides, but now you've got a green dot here that allows you to depend on simple interfaces in Java. Write Java code in a completely separated project, compile it, and embed it into Apache James. Configure it. You have a set of extensions that already exist. You can use James APIs. You can inject your own component. And then basically have your code run inside the mail server without touching the mail server. And then you can run it on the mail server. And then you can run it on the mail server. And then you can run it by switching a single line within that e-mail server. So sorry that might be complicated to see from the back of the room. I did not thought about that when I copy and pasted those rectangles. But basically the mailet container, you take things from within the mail queue. And the overall design is to have mailets, which is an action, applied conditionally by a matcher. So you have two little interfaces that you work with. The matcher represents a condition. And you would organize a pair of mailets, a matcher, inside a processor, which is a stream of execution. You have a specific mailet that allows to switch a processor. And a couple of various basic implementations. All of that is defined in XML and is fully customizable. I will give you a little example. So a hello world mailet that is kind enough to look up for the language and print hello world based on that. So a mailet would get the mail and applies an action to it. You can modify the mail. You can trigger some external APIs and so on and so on. All I need is to depend on the mailets API. From there, I compile my project, I get a job, and I just register it somewhere into my XML configuration, put the job into external jobs and go. So back into there, I can just go back to the mailets. I can just go back to the mailets. So it's actually quite powerful and you can connect the different sets of extensions together. We've been speaking a bit with Daniel about push. We received a contribution lately to have an IMAP extension to for push for iOS application. And basically you are able to plug a mailbox listener that listens to the mailbox events, register an IMAP app, and you can get an extension that creates the registrations and you would be able to get the push working like that. So that's quite powerful, but James is written in Java. You have interfaces for everywhere. Everything has an interface and we rely on inversion of control with a library called JUSON, which means that basically you can assemble your JUSON modules the way you want. And of course you can reuse existing modules, which means that you can make your own tailor-made server with Apache James. As an example, so because we need to follow the Apache way, we need to be in open governance. At Lina Gora, we decided to clearly split the project, which is Apache James. That's where open standards go. That's where the distributed mailbox is. That's where everything related to modularity, extensibility is. And we reuse that as a framework to bundle our own twig mail servers that have a couple more extensions, things like autocomplete for email addresses and stuff like that that are not part of the JMAP standard. So that we reuse to actually build the JMAP standard and build our product. This is a very nice contribution that we did get back in 2020. So this is to give you an idea of how you could use James. The idea is to validate GPG key. So basically, using the Web Key protocol, I would submit my key to that modified Apache James that will send me an email encrypted with the private key that I've just been uploading. I would reply to that email, which would validate the key and serve it there. So it's proof of concept. It had not been merged part of James, but it's to show you that you can really play and do interesting things with deep integrations. Who is doing pop free? It's the guy in the room doing pop free. Pop free is an awesome protocol because you don't have a UID and it's really, really, really simple. So in France, when you go and see a practitioner, you would get a repayment order that would be sent to the National Healthcare Insurance that of course transits by email. And every insurance would get a mailbox receiving millions of dollars. So you would have emails a day. And of course, you need to have a damn thing, geo-replicated on three different locations and so on and so on. So I map the latency, it would go crazy. At least we don't use aerogram. Volumetry is big. And of course, they have a very crappy description of homegrown custom formats that you need to, that slide, don't do justice. It's actually a couple thousands of lines of code to get all of that fitting in Apache James. The point here, when I arrived on the project, they were actually able to write tons of mail-in matchers, listeners and so on and so on themselves and plug it in together. So we were also able to rewrite the storage engine. And we also had a lot of different design to be able to leave some where Cassandra restrictions on dumpstones and listing millions of emails. Another project that we did was actually to also integrate within MSS Santé. So that's the mailing system for French health practitioners. It has some specific security restrictions that are related to it. So we were able to have also some specific integrations for that customer, like upload directly, attachments received into their drive. So basically, we quite a bunch of extensions and modularity going on in there. And surprisingly, even things like banking applications, that's also email. And it's very specific. They have millions of users with very, very, very tiny mailboxes and it needs to be cheap. And they have custom SOAP APIs to access the messages. That's also the kind of other things that you can do with Apache James. So I did not cover much of the technical details. I did do a hands-on session back in the day in 2019 in the Apache conference in Berlin. So if you are interested in getting more information on the code and watching some hopefully live coding that did not go too wrong, you can do it. The talk is online. Thank you very much. Do you have some questions? Thank you very much. Okay. Let's see your first hand. Thank you. So are there any pre-existing modules for spam filtering directly with Apache James? So you need to speak louder because I did not understood the middle of the question. Are there any existing modules for spam filtering so that you can use the same language as you did with Apache James? So basically we are integrated with spam assassin and air spam D, especially with spam assassin because we are an air spam D because we have mailbox listeners. We are able to live train based on the way you move messages, your spam filters. So I think that's a good point. So I think that's a good point. So I think that's a good point. So I think that's a good point. So I think that's a good point. So I think that's a good point. So I think that's a good point. So my answer is yes, there's already some integrations. All right. Further questions? So here's somebody. Yeah, I have a question. You were talking about these examples from the health system and from the banking. And I'm not sure if I understand it correctly. It looked to me like this is very email as sort of an API in a certain way, right? For very specific procedures and processes. And if that's somehow right, you might fix me anyway. Do you also do special processing of these emails? I mean, is there any special mind parsing involved or maybe you can say a few words? So first your understanding is correct. Apache James is very modular and of course it works as a regular email server, but you can use it for all various corner cases that could be hard to handle with other technologies. Regarding mind parsing, I'm also the maintainer of the Apache MIME 4G parsing library that of course you can do some pretty complicated mind parsing within Apache James. Does it play a role in these use cases, in this medical or banking one? Yes. All right, let's see two more hands. Maybe first the other guy and you. Yes, related to the previous question. Are the emails handled by the healthcare encrypted or? So they are encrypted and it is transparent mostly transparent to the work that we are doing with Apache James for them. Okay, so is this like transport encrypted or pay-due encrypted? Depends, but there's a lot of things going on with S-MIME. Oh, okay, thanks. Have you seen any maillets be created in programming languages like Scala, Groovy, Closure, those ones based on Java? So yes, yes, we have a couple of example of Scala mailets. We use Scala at some parts within Apache James. For example, the J-Map stack is completely written in Scala, so yes. All right, we would still have time for a quick question if there is any. One here. Oh, sorry I didn't. Ah, sorry. Yes, okay. Misunderstanding of mine. You mentioned POP3, it's very nice, but I suppose you have IMAP as well. Is it ready for standard IMAP usage or do I have to? Sorry, it's a misunderstanding. POP3 is a horrible protocol, but it's that one given use case of needing highly available protocol that can be multi-data centered. It's so simple that it fits the bills. Okay, and IMAP is a separate? We support IMAP, the big range of IMAP extensions. IMAP is fully supported and we also implement J-Map as a protocol, so very wide range of protocols implemented. Okay, fine. Thank you, and thank you again, also Benoit. I hope I didn't see anything. Thank you. Thank you. Thank you. Thank you. And yeah, we have one more talk into service session, which will be Mikkel about MOX.
[Servers] Mox: a modern full-featured mail server
So good afternoon. My name is Michiel Lekin. I'm a freelance software developer from the Netherlands. Last year here at FOSDEM, I first announced Mox, a modern, secure, all-in-one e-mail server. As you may know, running your own mail server has a bit of a reputation for being hard to do, but what have I told you? Running a modern mail server can be easy. Alright. So, thank you. So the goal of Mox is to make it really easy to run your own mail server so that you actually do it, and then you can stay in control of your data, and you can help keep e-mail decentralized. Now, Mox is a new implementation, entirely new, and written in Go. Now, that's a lot of work, and you might ask, why would you do that? Because we have so many open source components that you can just use, and that's true. And for the past decade, I've used many of those components to good use. But a few years ago, I had to reinstall my machine, so I got a completely new one, and I just felt a bit reluctant to install that same software again that I've been using for the past decade. And for two reasons, at least. One is to see. The language with small mistakes have big consequences. Don't get me wrong, I like C as well, maybe in the past. And the software written in C is very high quality, but I wanted a new machine that lasted for another decade, and I see that C is not really going to be part of that too much at some point. But the bigger problem is basically the complexity. Over time, as e-mail has grown, new protocols added, new extensions, new software components have been added as well. So, to make our fully modern e-mail system, you need many components and make them all work together. I think many self-hostings, at least, they stop halfway, so they have a semi-modern e-mail system set up. You can make it easier to get all this configured and set up with a distribution or a Docker image or something, but you still have all these components working together. There's many integration points, a bit of friction, some data loss. Sometimes there's security issues when, you know, message headers are seen as authoritative but added by some component. So, I think what happened is with all this complexity, some people, you know, just stopped running their own mail servers anymore because it just too much work, and they migrated to the cloud, centralizing e-mail, that's not a great development. So, what we need is an easy to use mail server. So, you need to write a set of features. So, Mox tries to deliver many features. I'm at four for reading your e-mail, S&TP for sending receiving e-mail, SPF, DKIM, DMARC for message authentication, because just S&TP is not enough, but that's also not enough. You need TLS, of course, for encrypting your communications, but S&TP for sending receiving e-mail is unverified TLS. So, you want MTA, STS, and ordain to, you know, check that you're talking to the right person, right machine. So, Mox implements both for incoming and outgoing e-mail. Then there's ECME for, you know, your management of TLS certificates. You want to make it easy, don't do any manual TLS tumbling. Junk filtering is part of Mox. So, based on historic messages and their non-junk classifications, Mox will reject, accept, incoming mail, more about that in a moment. Then internationalization, so you can have Unicode in your e-mail addresses and your headers, both in your domains with IDN and your local parts. Autoconfiguration is in their various flavors, all supported by Mox to make it easy for mail clients to find and write service settings for new accounts. Then we've got a webmail included in Mox. We'll have a quick look at that in a moment as well. An admin web interface, so all configuration is in files. We want the full power. You can use the admin interface to quickly navigate and make some changes like add, remove, an e-mail address, an account, or a domain. A web service included, it may sound a bit crazy over the top, but modern e-mail basically requires HTTP stack with MTA, STS, Autoconfig, Jmap soon. Now it's already part of the deal. What I've noticed is people trying to run Mox and a web server on the same machine. That's really annoying because configuration gets complicated. Instead, I just added some web server to Mox for some static file serving and reverse proxying. That problem is also solved. Permit these metrics, structured logging, operations become a bit easier. Then the Mox Quickstart. That makes all this stuff easy to do. Installing Mox, you take a new machine, you've got a domain, you run the Quickstart and you pass it an e-mail address at your new domain. Mox will generate, so the Quickstart will generate configuration file, decon keys, etc. Create a new account. We'll print all the DNS records that you'll copy, paste into your zone file, or you have to manually edit them in your web interface of the DNS operator. That's not so great. Then Mox also, the Quickstart, also Linux, generates a system-d unit service file, so you just enable that and start it. Then you've got a fully working modern e-mail system. All of this is MIT licensed, so you can do whatever you want basically. Then as developers, a little bit about code. As I said, it's a new codebase. It's a modern codebase coherent. All of this is in the same style. It's very self-contained, so few dependencies. That makes it possible to have it in the same style. It's about 73,000 lines of go and 21,000 lines of tests, mostly unit tests and a bit of integration tests and some fuzzing tests. There's 11,000 lines of type scripts, very strict type scripts for webmail and interfaces. The code is cross-referenced with RFCs to make it very, not easy, but to make it something more maintainable. You can look back and see why you did certain things. Of course, Mox is written in Go, so it brings a whole bunch of advantages like memory safety, standalone binaries, completely subtly linked, also includes a few assets, so it's really just one file that you need. Fast compilation time, great for developers. Dependency management is pretty much solved in Go. You get reproducible builds out of the box, and it also works with cross-compilation, which is trivial to use in Go. Now, there's not much to see about a server, but we have a webmail that I can show you. It's not pretty, but it looks mostly like a standard email client, I think. Mailboxes, message list, message view. Let's open up mailing list. There's some threading in there. You can select multiple. I'm using keyboard shortcuts as well. Mark some unread messages and mark them read. Then there's HTML support with or without external resources and tracking pixels. Then there's a little example of Unicode addresses. The search is easy to use. We've got some quick filters on that side. We could send a message, but I'm sending a message from another mail client that should be arriving. There it is. Select some text to quote as civilized people and send a response. That's a webmail. It's not pretty, but it mostly works for my needs of email, sending and reading. Then I would like to say many things about lots of features, but I just limit to one thing, spam filtering in Mox. Analysis for incoming messages is based on historic messages in an account, based on their junk and non-junk flags. It's always per account. Whatever one account does is not related to what happens for the incoming message for your own account. Of course, this means in order for this to work, you need to have the proper flags on all the messages or as many messages as possible. Email clients don't always help with this, but Mox does help with that because in the default setup, you get an account where incoming messages or messages moved to the junk mailbox gets the message flag. If you move something to a archive mailbox, it automatically gets the non-junk flag. Also, if you move it to the trash mailbox. Also, if you're in the webmail and you have a message open for five seconds, that's long enough for it not to be junk, probably, so it also gets the non-junk flag. That means most of the messages in the store will have these flags set properly. There's a difference in how Mox handles known senders versus first-time senders. For the known senders, they're recognized from sender address or just the domain of the sender address. Maybe it's another person at the same company. Or we look at SPF or Decombe signals in a mail message or we look at the IP address of the remote server or various subnets of the IP address. If there are recent historic messages from that same sender, we look at the junk and non-junk classifications of those messages. If the recent ones were junk, we reject the message and otherwise we accept the message. But if it's a first-time sender, we don't know enough about that sender. We do, of course, something else. So, the BEG analysis is also part of Mox. It's essentially a reputation of words. So, you look at the words in the message, then you look at historic messages and their words and their junk and non-junk classifications. If there are too many spammy words in the message, you reject. If there are enough hammy words in a mail message, you accept. Then you can also configure a DNS blog list in Mox, but it's off by default for a few reasons. One, often these DNS blog lists are centralized services. We don't want to rely so much on them. And you would be sending the remote IPs of those you communicate with to some central party, which is also not great. Then we don't want to break existing email flows. So, this is also one of the reasons why it's only on first-time senders. So, if you've been communicating with someone for a long time and suddenly someone puts their mail server on a blog list, you can keep communicating with them at one break, only if that person really started spamming you all of a sudden, then you mark a few messages as junk and then in the future, the mail filter will just adjust. Now, in Mox, being all in one mail server really helps with this because during the S&TP transaction, all this historic data, these messages and flags and words are available for analysis. Then a special handling for messages from mailing lists and forwards because essentially most of the analysis disabled and DMARC policies are not enforced. Now, what do you do with an incoming junk message once it's classified? Well, one does not simply deliver to the spam mailbox. That's not friendly for users and not for senders or recipients and senders, I think, because the sender thinks that the message has been seen and doesn't get a reply. The recipient may be receiving some message and doesn't get it, so they wait or they constantly check both the inbox and the spam box. I think it erodes trust in email. So, I understand that it's done to not give spammer's feedback about their spam runs, but users should come first. So, instead what Mox does, Mox rejects the message at the S&TP level while it's coming in with a temporary error code and a very generic message. So, the generic message means that the spammer doesn't know for sure why it's being rejected and a temporary error code means or causes the sending server to try again a few times and at some point tell the original sender, you know, this message cannot be delivered and then they know they can find another way to communicate. So, you don't have this problem anymore of lost messages in the spam box. But just like with the spam mailbox, Mox has kind of the same thing but different. It's the rejects mailbox. So, anything that's rejected is still stored in this special mailbox. It's a fixed size mailbox. Old messages automatically removed and so you're waiting for some kind of transactional email. Maybe you did a sign up to a website. It's not coming in. Then you can check the rejects mailbox if the message, because maybe the sending website used the infrastructure with a bad reputation. Then I can just move that message from the rejects mailbox to the inbox, Margaret is non junk and the next time because of the historic based filtering, next time those messages from that sender will be accepted. So, the important point is that you don't have to keep checking the rejects mailbox because the sender knows you didn't get it and that's different from the spam mailbox. So, this seems to be a graph for me but if you have ideas on how to improve on this, let me know. Then a bit about the roadmap. There's still a lot to do in Mox. So, I want to implement a simple HTTP based API for sending some messages and also receiving some feedback. Just so web apps for example can just with a simple call make some sense of emails. If you know of any standardized ways of doing this, let me know. Yeah, okay. But I said simply, really the dumbest thing. But I guess maybe it can be that simple. Then I want to add calendaring. It's not email but users and myself included expect it to come with email. I need some more SNTP and IMAP extensions. A JMAP will be coming at some point. So, so far I focused on IMAP because all my meal clients were using IMAP and I wanted to have a working meal system. But, you know, I, JMAP will be, will be coming. I want to encrypt all data at rest. It's not currently done. I want to be able to have a second Mox as a backup, Mx and a backup instance. In order to do junk filtering on the second instance, I will need all the data as well or the historic messages. So I want to synchronize everything to the other one. And then I, you know, the, the, once all the data is there, you can also use it as a, as a, as a failover machine. So that will, that will be nice. Forwarding to external addresses not yet done because it's a complicated, gets complicated quickly. I think modern email is not really set up for that anymore. So, Mox is a different way of applying rules to incoming messages. Then lots more on the list. Too much for today. So, final slide. So it's been a year since I first put out the Mox code. I've gotten quite a lot of feedback. So thanks everyone who sent in bug reports and made feature requests or sent in patches. Very helpful. Then also thanks to NLMet. They've been funding continued developer the Mox since August last year. So that's been instrumental to keeping working and being able to keep working on this. Also thanks to everyone who wrote all those RFCs about email. They're very, they're excellent and they match practice quite often. So my call to action today. If you're not doing so already start running your own mail server, you know, staying in control of your data and keep email decentralized. And you have many options ready. And now there's just another one called Mox. So give it a try. Send me an email. It's a great way to communicate. Thank you. Thanks. Oh, I saw you first. You only have three minutes. First of all, I think it's a quite incredible project for one person. And I was wondering how many third party libraries do you use and how much the code you write directly to implement all this? Yes. So I think the main external library is called Beebolt, which is a fork of the database layer. So the messages are stored in files. The database layer is a database thing and it's pretty much a key value store. But anyway, that's the main external dependencies. There's something for Prometheus. And then there are a few dependencies that I wrote myself. So those are not really all that external. And otherwise it's mostly the Go standard library and the extended Go standard library. So very few external things. So yeah, it feels a bit like a not invented here syndrome. So I want to rewrite everything. But it has been very instrumental because sometimes I've made sweeping changes and there's no one. I don't have to make pull requests, try to convince people to do something that suits my needs so I can do whatever I want. So it's really sped up development, I think. Fantastic project. I have a quick question regarding database. I don't know if it was answered right away because I heard the database was whether the data is a sort of a database if you like or could be changed or whether and whether whatever it is, could we use Unix normal tools to just go through them? No. No, you cannot use normal Unix tools. What I don't really want is say having a meal day that someone else also makes changes to because I have to do lots of work to make sure that I synchronize or chase as well. So I've chosen a simple approach. So messages are just stored in a file system individually at the moment and there's one database per account that has the index for all the messages in that account and that stores also the message flags, etc. So the database is also essential basically for all the history and all the data. I could talk for a long time about the database library but. Okay, quick one. What is your experience with scaling this up? How many does a MOX thing? I've not tried. It caught me because of the Bayesian per user and we tried that and that didn't work out. So I have no idea where the limitations are. So I would like to try to see where it breaks and I don't know at the moment. So I've only run it small scale and really targeting it small like self hosting a bit and not the tens of thousands of users or something. So I see many hands which is great. We have a little more time since we have the switch of succession. Anyway, when people leave in the meantime maybe be silent so we can use the time for a few more questions. Let's try how many we can get. I didn't see the order so forgive me. Thank you. Do you have any plans for LMTP support? No. But why would you use it? Why would you need it? I'm writing a small... Oh, you need the microphone. You need the microphone. Sorry. I'm writing a small like mandrel clone and now that they shut down and for that I need to be able to put an email message into the server. Yeah, okay. So maybe a better solution would be to put it in the go code and make like a fork or something. From what I've seen LMTP is almost like SMTP but just it has this improvement of getting reply codes per... It's just simpler. It's lightweight. It's just a dumb-down version for mail drops. Did I get that right? You reject mails but still deliver them in the rejects mailbox. Yes. Whoa. Wow. Scary. Yes. About the reject. So I think that's basically like grey filtering if you are... It's basically like grey filtering except that you will continue to reject them or do you do anything special if they come back or if they go... Yeah, so if they come back with the same message, I deduplicate it based on the message ID or the hash of the entire message if there's no message ID. But then we'll still be considered rejected. Yeah, it will still be rejected. Yeah, yeah. And does this interact with the junk no junk flag for Thunderbird and other IMAP clients? Well, so I think there's the flag dollar junk and dollar non junk and as far as I see Thunderbird sets it without the dollar. So it's not useful. But it does also interfere, I guess, because Thunderbird does not a client side and it would work but it's kind of duplicate then. So I disabled it. I disabled the automatic classification on my Thunderbird setup and I just let the server basically do it myself. So I now don't get a lot of junk. The filtering is okay. I still get a few pericas sometimes one a day and I just junk it and then it's okay. Okay. Thank you for your questions. There's still a matrix chat. Mikaela will be around. Thank you Mikaela. Thank you.
[Clients] Introduction to Thunderbird for Android
All right. Welcome everybody for a new round of the modern email death room. And now it gets a little more user friendly user experience because we are in the email client session. And yeah, we start with a very interesting new development. Many of you might know there was K9 email which might turn into, or will turn into Thunderbird for Android. And we are happy to have K9 main developer, I think I can say here, and he will give us some first hand insights. That is yours. Thank you. Yeah. So half of the talk was that. So just kidding. My name is Kati. I will tell you a little bit about Thunderbird for Android. First we'll start with a little bit of history. And it's about K9 email. Like you mentioned, it will eventually be renamed to Thunderbird for Android. Our journey starts in 2008 when the first Android version was released. Jesse Vincent bought an Android device, tried to connect to his self hosted mail server. Back then that was more common than it is now I guess. And it wasn't working. That's because the email app that ship with Android wasn't really great. And he figured it's part of the Android open source project. So he will just fix it and it will work. He did fix it, but then he found out you can't just install an update to a system app. So back to the drawing board, like extracting the code of the email app from the AOSP source tree that contains all the app, the AOS. Build it as a separate app, give it a different name. And then it was working and he figured if he made all this work, other people might like to use that as well. So he uploaded it to the Android market. That was the name of the app store back then. And he released the source on Google code, which was a thing back then. And since it was really early days, most of the Android users were nerds like us, many of them developers. A lot of them realized that the email client that ship with Android was really crap. And so a lot of people ended up finding K9 Mail and hoping the bugs were fixed there. Most of the time, because it was forked from the original email app, the bugs were still there, but at least they could fix it easily. So a lot of us found K9 Mail that way myself as well. I joined in 2010 or late 29, depending on how you count. And because neither K9 Mail nor the email app were working with my providers, I had to fix K9 Mail. Unfortunately, we don't have a lot of time. So I can't talk about all the awesome people that contributed bug fixes and features, even the early days, to make K9 Mail as popular as it is. We'll have to skip forward a little bit to events that are half relevant for Thunderbird for Android. The next one is that Jesse made me the project lead because I kept fixing bugs even once I wasn't affected by it. At the end, I was doing releases and stuff like that. So basically everything maintained as Jesse went off to start a startup. Doing keyboards. Fast forward a couple more years. I was contacted by Ryan Sypes from Thunderbird. And he was like, we have lots of users, but they also use mobile devices and they want clients for that. So Android and iOS. And he was talking to lots of people, trying to find out how they can do that. How can there be a Thunderbird for Android, Thunderbird for iOS? One of their ideas was to use one of those cross-platform frameworks where you can write JavaScript because Thunderbird and desktop also have a bit of JavaScript, so you only have to write the code once. I was like, I have no experience with that, but it sounds like a horrible idea. I offered to ask my friends that I'm mobile development that have used those frameworks before and everyone was like, yeah, that's nice if you have super simple apps that maybe do some rest calls and display some data from the web server. But everything that goes beyond that, you probably don't want to use that, especially if you're trying to write an email client. So I told them that and Ryan went off and talked to other people, tried to find out how to do that. What I took away from the conversation was like Thunderbird was asking for donations and was funding their development that way. I was like, I can probably do that as well, right? So I wrote a blog post. What's up with K9 Mail? At that time, it was a difficult period for K9 Mail. The last stable release was September 2018. So that's one and a half year before the blog post. And the next stable release was maybe one and a half years off because we were doing a big UI rewrite. The Android platform changed underneath that. We had to do a lot of catching up, so we were able to run on more than Android versions. I wrote a blog post outlining all of this and asking people for donations. And that kind of worked, but not really. I mean, maybe you have tried it for your own project. If you just write it in a blog post that nobody reads, you don't get a lot of money. At the end of that year, I ended up with not even 6,000 euros, which is nice for a hobby project. You can probably buy a new laptop, but you can't pay rent with that. Nevertheless, I tried again next year for iLOVE Free Software Day or Valentine's Day for regular people, I guess. Wrote a blog post, K9 Mail is looking for funding. So don't make it about the stuff that we can't do, like do new releases, leave with asking for money, and then basically I still outlined all this stuff. Like we can't do releases because we still have to do a couple of things. That one spread really widely. And in February alone, over 18,000 euros in donations came in, which was very nice. I figured if that continues, I am getting rich. Spoiler alert, it didn't. So donations got down, but at the end of the year, it was still 51,000 euros, which is not quite a salary for a seasoned senior developer, but it's enough to live on. And I was talking to the Thunderbird people on and off during that period. At some point, they were like, maybe we can just fork K9 Mail. I'm like, you could, but I mean, it's not in a great state. If you really want to, I can help you a little bit, but I'm not sure if you really want to do that. And then at the end of that year, I was contacted, oh, damn it. I was contacted by Ryan again, and he was like, we have to do this now. We need a Thunderbird for Android client. How about we just use K9 Mail, rename it to Thunderbird on Android, and be done with it. And I'm like, okay, this asking for donation thing is nice, but really donations went down, and you have to constantly remind users to give you money. And if you're a maintainer for an open source project, there's just one task on top of the huge list of tasks you do anywhere. And I'm like, okay, if someone else could do that, and I could just work on that, that would be nice. So I asked Jesse if he'd be fine, if his old project was becoming Thunderbird for Android, and he's like, yeah, sure, go ahead. And that was basically the start. Still, it took a couple more months until we actually announced that K9 Mail will be joining the Thunderbird family. And I was basically hired with a company that does a pace developer to do Thunderbird development. So I guess first full-time employee working on K9 Mail. And the idea was K9 Mail will be renamed to Thunderbird for Android eventually because we wanted to work on some features that Thunderbird for Android should really have by the time it is released with the Thunderbird stamp of approval. So the next thing we did is we hired a second Android developer because two people get more work done than one. And we started informing the community about our progress. And if you've read those blog posts, you will know that we haven't released Thunderbird for Android yet, even though the plan was to do it by mid-2023. That kind of didn't work out. And then we figured, okay, maybe do it by the end of the year, cut some features, and then we decided, no, we don't really want to cut features, we want those in there. So, well, there's no Thunderbird for Android yet. You will ask yourself, well, when is it going to be released? And if you were hoping, I will say now. I'm sorry to disappoint you. The answer is very open source, so when it's done, like I mentioned, we want to get the features in there. And you will ask yourself, well, what are those features? Roadmap, asterisk, so that's the current plan. I mean, we've changed it before, might change it in the future. That's the plan for now. The new account setup, we've been working on what feels like forever, but it's almost done now. The latest beta will probably be the last one, or penultimate one before a new stable release. If you're on the beta channel on Google Play, you can already use a new account setup. But you can't design three is not something I would have chosen, but our users really like new shiny stuff. So, we put it on the list, improved folder management, conversation view is what modern email clients need nowadays. Thunderbird sync, that's also part of the asterisk thing. It's something we really, really want, but there's a lot of technical problems or open questions on the infrastructure side, also on the client implementation side. But the basic idea is sync settings between instances of Thunderbird, be it the mobile version or the desktop version. Then what I put on there, we have existing functionality that needs a bit of tweaking to make it more user-friendly. And of course, Android keeps changing stuff, so that's also something on the list, which is I guess this year Android 15 will be released. There's been no new APIs announced yet. I think that starts in March, but who knows when we'll actually get to the release. All right, what about that? A lot of users have mentioned that they really like the brand and the icon and stuff like that. And so we decided, well, we'll keep it around. We wanted to change the application ID, the identifier Google Play uses, anyway to something more Thunderbird-like. So Thunderbird will be a separate app, and we will just keep, can I mail it around? Of course, we don't want to maintain two code bases that then diverge, so we will build from one code base, two apps. Hopefully, we haven't started on that work yet. And then the difference really is meant to just be visual stuff, so I can have name and the theming. Yeah, and since we are now in the client section, we can also have screenshots. That's something that several people can't really do, right? So, yeah, it's not too large of a screen. It's also really boring. It looks like every other email client, basically, you have a message list that contains a list of messages. If you tap one, you get a message view, which displays the message contents. What could use a little bit more love in K9 Mail is the compose screen, but still it works for simple messages. Hopefully, in the future, also for more complex stuff. Then in the first screen, if you tap the hamburger icon, not to be confused with a Kibbup menu, you get an account switcher at the top, and then a list of folders. Improving this to make it look nicer is also in the list of folder management stuff. Right, and since we're in open source project, if you really want to, you can contribute. The slides are on the FOSTA website, so you don't have to type down the links. We are hosted on GitHub, so we are not doing the Thunderbird thing, like using McEul, using Bugzilla, and stuff like that. Translations are on WebLate. We have a ton of them, but could use really help for some of the more obscure ones. One of the blog posts goes into details, which one need help. We have a support forum where mostly users help each other, which is really nice, so we don't have to do a lot of stuff. But also, we develop and monitor that to get an idea what users have problems with and fix it. We also have user documentation, which is not very often found these days, I find. It's also very outdated, so it turns out maintaining user documentation is work. If you want to help out the project and you're not good with code, maybe you're good with words and screenshots, so you could help out with documentation. All right, the one-minute sign goes up, and I'm also done. Thunderbird has a stand here in Building K, level one. If you want to talk to some of the desktop people, they are probably also here. Maybe Hans-Up who's working on Thunderbird for desktop. Well, Kai is here. He's on the floor, so you probably don't see him, but there are also some people at the stand. Right, and with that, I'm happy to answer questions if you have any. Thank you very much. Sorry, don't ask for the release date, probably. He won't have any questions. You need to help me, probably. There's so many people. I don't see. Ah! So, how is the funding going, basically? Where is the money coming from? Oh, I see. There was a talk about that. You can probably find the video recording on the FosterM website. Ryan talked about how Thunderbird is making money. The summary is we are asking users for donations, and that worked out really well. Last year we made over 8 million. I don't have the exact number, so a lot of money. By the end of the year, we have all of Thunderbird, so the desktop app, mobile, and some other projects we're working on. The plan is to have 45 people working by the end of the year. Any further questions? While people are thinking, you were mentioning the iOS topic in between, but somehow that got lost in the rest of the presentations. Can you say anything about that? The first idea was maybe if we have an existing client, so the same story with Android, we could do that, but didn't find any. So the ideas will start from scratch, and we're currently looking for someone to start that off. So it will come eventually. Which year? Probably not this one. You know, when it's done. Any further questions? Thank you. I have just a quick question regarding the forum and the issue tracker. Will you keep them around? Thank you. The issue tracker, yes. I mean, people report bugs, yes. And we don't want to switch to bugzilla. No offense to people that have to work with that, or even like it, but no. The support forum maybe depends on, you know, how it works out. If there are a lot more users and not more volunteers helping out like answering questions and moderating this, probably goes away, but for now there are no plans to abolish it. All right. Last chance. I see no more hands. So thanks again for that nice presentation. Thank you.
[Clients] Taking care of Roundcube Webmail - current status and future prospects
Welcome, my name is Anna and I've been with Next Cloud since 2020. Anna Focuss primarily in the backend development and I am responsible for the round-cube maintenance at the moment. I'm also at the security team with Next Cloud, so a bit about that later. So first things first, this is the question we have got most in all of the help forum, blog posts and everywhere. No we won't merge round-cube and Next Cloud Mail. Both products will stay independent as they have been and they will receive independent development and independent loving care. So don't worry about that one. Yeah, let's get into the development aspect of things. So we have hired a specific engineer for round-cube, so that's a person that will be responsible for the maintenance, for issues on GitHub, for contributions on GitHub. The thing is it's not gotten that much love the project itself. There's like 50 open PRs and like 300 issues at the moment which haven't been, I'm not saying not triage, but it's hard for the community contributors to look into everything, of course. I mean it's not their main job and we appreciate what they do, so that's what we want to take care of. What we also want to do is we want to do regular security and bug fix releases. This is really, really the main focus at the moment to get us up to date on security stuff, to get us updates, up to date on bug fix releases. There is one person who has been doing a lot of development for round-cube, that is Alexander. He has been doing most of the feature development for round-cube at the moment, but he is not working for round-cube, he's working for somebody else. So we want to help him get new features development, feature development done and do like feature releases and tandem with him. We really want to make sure that we're not edging out any contributors. We really, really, really appreciate what they're doing for the project. So please don't worry if you're a contributor or somebody who wants to contribute, we really, really would love for you to put more energy into this project. I know a lot of you love round-cube and have been using round-cube, so let us know what you think, let us know what features you want to see on GitHub and we promise we will take care of them and look into them and actually give you a response on GitHub as well and not just leave it there out in the open. Yeah, as I said, community-driven development is always appreciated. As with every open-source project, I'm sure you all have the same kind of thing there. Yeah, more care, more feature, more love for round-cube because it is an amazing product, it is really cool and I mean, it's been around forever, so let's keep it going. Another thing that's changed is how we handle security issues. Since I'm part of the security team with NextCloud, we already have an existing process for this, so we're using Hacker One. We haven't discussed yet if we're going to pay Poundty for this, but it is a possibility that we will actually pay you to pen-test round-cube and the advisories will be published on GitHub in the future because right now there is no established mechanism for this, so you don't really find security issues all in one place with CVs and everything. Yeah, that's pretty much everything from me for round-cube. I still have two minutes to go, that is a very short presentation, so yeah, let me tell you a little bit about how it feels to take over this project. Actually, it's really scary because I know a lot of people love the project and don't want to see it in a draw somewhere, not maintained and everything. As a developer, it's also a challenge to get into a new code base obviously because we have different coding standards at NextCloud than round-cube has. There's different expectations from how the community works with us or we work with the community. Of course, there's implementing the email standard, which is not easy, as everyone knows. There's IMAP, which is an old protocol and it has its challenges, but it also has its cool stuff. It's a challenge, it's an exciting time, it's a scary time and I'm really looking forward to working more with the project. Yeah, let me know your questions. That's basically it. That's me done. I see you have something. I cannot decide who was first, so I start here because you're just closing, so. I noticed as the developer of Snappy Mail that more people are integrating Snappy Mail in NextCloud because of the slowness of the NextCloud Mail app. They also want round-cube. Yes, there is a round-cube app. Will there be better integration with NextCloud? We haven't discussed it yet. I really can't tell you that it is for project management to decide. For me, I personally have worked on the mail app and I am partial to the mail app because it has seen a lot of blood and tears as well from the developers. But yeah, there is a round-cube app for NextCloud and with the code base in round-cube improves, then the probability that the round-cube app for NextCloud is going to be better is very likely. Does that answer your question? Okay. Okay, so this question we solved here. Is there any more questions? Yes? Sorry, I think that. Have you already gathered some experience with Hacker One and how is it like? Yes, I've worked with Hacker One for two years now. We handle all internal or like NextCloud security issues via Hacker One. It has produced some good results. Obviously, it's not always easy because it duplicates and stuff like that. But for how the reports are structured, how you can evaluate a security issue, it is actually pretty decent. Yeah. And it offers an integration with GitHub, so it's not that much work to copy it over to GitHub and then publish it. Yeah. Is there any interest by commercial ISPs in supporting round-cube? To use it as a webmail app for their own purposes? As far as I know, Alexander actually works for an ISP. So I think they might be paying him for that, but you would have to ask him yourself if that is true or not. I know that a lot of ISPs have forked round-cube and have their own kind of version of round-cube that they maintain. There is, Hans, if you mentioned this project from the French government that has their own kind of round-cube implementation. So I'm sure there is interest because it is a powerful tool and it works really well. People like it. It's easy to install. I've tried it myself. It was very nice, very easy to do when Docker wasn't doing its thing. Yeah. Yeah. So I hope there will be interest and I hope there will be interest in the community as well to get the product back and get it a bit more popular again. Yeah. That is my goal for this list. Yeah. Any more questions? As a wise, one thing that came to my mind when thinking about, we have seen 4K9, there is a list of actual features somehow blocking the big renaming. So I wonder, is there anything you discovered on the roadmap or on the bucket list in round-cube, which particularly you would like to address in next time? Not yet, no. We haven't done any sort of project management evaluation yet because we didn't have the developer for it. Now that we have hired a person for this, I'm hoping we can actually get some project management up on GitHub as well. So we're using the boards with NextCloud so that would be easy. We can actually sort issues into swim lanes and then work through them. Since it's only one person, progress will not be as fast, but on the other hand, I mean, it's not like nine people can carry a baby in one month. There's the owner from project management. So things will hopefully be getting done quicker than now. I also really hope to get through the backlog of PRs. We have 49 PRs open. The first one is from 2015. So there's some bug fixes in there, but I've also seen some features. I have seen some left to right and right to left text support so you can change for right to left languages. That would actually be a really nice feature because a lot of people have right to left text. If that could be merged, that would be great, but there is a problem with the CSS classes for different team implementations. So that would need to be thoroughly checked and that is probably something that the developer should and can do. It's all different themes, see how well it works. And yeah, also sync with Alex on this. She's had some input. So yeah, maybe you'll see some right to left text support soon. All right. No more questions. So yeah. Thank you. Thank you.
[Clients] aerc, an email client for the discerning hacker
Hello everyone. So today I'm here to talk about ARC. We'll see that, but more of that in a minute. So about myself, so my name is Robin. I'm an open source and Frank Zappa enthusiast. Maybe these things are related, I'm not sure. Actually in a previous career I was a sound engineer in Brussels and now I went to software. It took some time. And I took over the maintenance of ARC since 2021. Actually this project was created by Drew DeVolves. I'm assuming some of you know who that guy is. And there was a question about how do you pronounce this thing? So who cares? It's not, I mean. So Drew pronounces it with his American accent, orc. In French we say, and I've asked an Hungarian guy, he said, we say, I guess you can choose. As long as it's more or less the same. So what is it? It's actually an email client that's running in your terminal. Like very similar to mutt or pine, alpine. There's even a melly or melly, I don't know how to pronounce that. It covers a large span of protocols. I don't think there's a standard protocol that we don't support yet. One of these features which is kind of unique is that you can have multiple tabs. So when you're using mutt starting to compose, you actually, the editor takes over the terminal and you are stuck. You cannot browse your other emails or watch the original email while replying. You can do that with ARC. Also you can have VIM style commands and key binings. You can compose in VIM and view your emails in less. There's an account configuration wizard. I'm speeding up because I'm seeing the timer. One unique feature is the filters. There's an example just after. You can control it with install scripts. Actually you have a virtual terminal embedded into ARC. So you can start shells, edit or whatever is running in the terminal. You can run it into ARC in a frame or in a new tab. It has been designed with email contribution in mind. So you can actually apply patches from emails directly without using GitHub or anything else. So it's like the Linux kernel. They use that way. It's a short demo. I don't know if anyone in the back can see well. So this is a short presentation where you can, so this is just plain text. You can also have, you can visualize patches and with via filters you can colorize things. So it looks like a patch. You can also view HTML in plain text via filters as well. And you can see also images with filters. Soon we'll actually add also a way to visualize real images using the SIG cells and KT extensions. So if you have a modern terminal you'd be able to really visualize images without the hyper pixelated way. And yeah, anyway, so here I'm just running a terminal and you can just run anything you want. There's no problem. Okay, so what are filters? Filters are just, you define per mime type, so text plane and then you define a command. And you can use it to do a bunch of various things. So you pipe a message by part into a command, then into a pager which gets rendered. And these are small examples. So anyway, you can just, we have a lot of examples in the man pages. And 017 just got released this week. So if you want to check it out, please do. And please give us some feedback on the mailing list. And that's what I have. So thank you. Thank you very much. Short question, which probably language is used? It's written in Golan. Golan. Okay. All right. Any questions? Do you use server-side search in, for example, Jmap or IMAP, or do you rely on your entire mail history being synced to the local disk? No, it's all server-side except when you're using a not-much or mail-deer backend. So server-side. Yeah. Right. Questions? Yes, one, two. Yeah, this inspires my next question. What about any kind of optimizations for the indexes and searches for local files? And so we have, if you're having your mail local, you can use not-much to index your mails and search. And if you're using IMAP or Jmap, usually the servers have the indexes. So you can, the searches are pretty fast most of the time. And we currently, if you're just using plain mail-deer, you don't have any index speedup. Maybe we were thinking about running a not-much, you know, hidden into ERC to speed up the searches, but I don't know. We were still wondering what to do with that. Yeah. Thanks for your talk. So the last time I tried to use ERC, my IMAP and STP, I do SSH forwarding. So they are ports and local hosts. And then the certificate work, and I had to apply patches. I can't use Go. I don't can't rebase Go patches because I don't know anything about Go. And is that, yeah, is that something that you want to support? Or is... Are you using Proton mail or anything? No, it's at my works mail. Okay. And so is it a self-signed certificate or... An internal domain that I can't access from outside. I need to SSH and then it's a certificate, but for local hosts at my side. Okay. And maybe you can install the certificates in your CA certificate store local or something like that. Maybe we could do something in ARC mit. It feels like this is TLS stuff, which is out of our hands. Yeah, I don't think so. Okay. Yeah. If there's a variable you can use a set before you start the application to ignore... Go looks at a different place for your certificates. So if you only talk to this one host... Oh, maybe, yeah. Yeah, we could do that or we could say ignore invalid certificates, but I don't know. I'm more comfortable with installing the certificate in the global bundle so that, yeah, then you don't have to hack into anything and it will work. All right. Have a look at NK3rd, the greatest open source. Yeah. There's a question from the internet from Moritz. Does the account with do autoconfig? All right. Yeah, so there's a very basic autoconfig based on DNS server entries. But not all providers actually do provide server entries and Moritz is currently submitting a patch series to improve that autoconfig stuff. Yeah. Very subtle. Nice. Just one last thing. There was one guy who actually reported an issue during FOSDEM because the FOSDEM Wi-Fi only gives you an IPv6 address and we had a bug in IMAP where we actually resolved only the IPv4. Anyway, it's fixed, pushed, and yeah. FOSDEM works. Yeah. Thank you again. Thank you.
[Security] Modernizing email encryption: the crypto refresh of OpenPGP
Welcome everyone. I hope you can hear me well. I am Daniel Huygens. I am the cryptography team lead at Proton and I will be talking about modernizing email encryption today, primarily about the new version of the OpenBGP standard, but also about the OpenBGP standard more generally and sorry, the OpenBGP ecosystem more generally and how it has evolved in recent times. And I will in fact start with the latter because OpenBGP has a rather bad reputation at least as far as it comes to the user experience. And you might remember having to generate a key manually using KnoopaG in the comment line and that used to be basically the only way to use PGP. But it's no longer the case. So my employer would like to tell you that obviously Proton made everything better and you know it's so much easier now, but it's not only Proton also Thunderbird has OpenBGP support built in now. You still have to click a few buttons, but it's much easier than before. And also other applications like FlowCrypt make it much easier to use encrypted email. And for a technical audience like this one, you might think well it's not a big deal, I can use the comment line or I even prefer it and that's perfectly reasonable and fine. But for a wider audience, this is really important. And also for us it's important because you want to write encrypted emails not only to yourself, but also to your friends. So the more people that use encrypted email, the more reason there is in fact to use it. And so the user experience is also very important and it has improved over the past decade or so. Then regarding the libraries, there used to be basically only one open source, open PGP implementation which is GNUMPG. And of course GNUMPG is still around, but there are many more implementations nowadays. So for example, OpenBGP and OpenBGP.js are two implementations that Proton maintains and uses. And then there's PGP English which is specifically designed to be easy to use and it's a Java implementation, a wrapper on Balty Castle. Then there's PGPI, a Python implementation, RMP, a C++ implementation that's used in Thunderbird. And there is Zekoya which is a new implementation in Rust which Neil presented on in the main track earlier today. If you happen to have come across that, it aims to provide a modern implementation of OpenBGP also for the comment line, basically providing a drop in replacement for GNUMPG. And then key distribution. You might remember that back in the day, you might have to export your key manually and then attach it to your emails or upload it to a key server manually. All of that is also not really necessary anymore today. So there is the key server protocol. Well, it has been around for a while but it's more widely used now and can be used by applications to automatically upload keys and also automatically fetch public keys of your friends. If you want to send an encrypted email to someone, your email application can automatically fetch keys to do so. And WebKey directory can do something similar but instead of fetching from a key server, you can fetch the public key from the domain of the email address that you're sending to. So if your provider supports that or if you're self-hosting your email, it can serve your public keys that way. And then finally, there's AutoCrypt which is a way to send public keys in email headers so that also again, the user doesn't have to worry about it and everything should work automatically hence the name. So slightly diving into HP key server, I won't dive into it very much because the presentation is not primarily about that. But so as you can see, you can simply make an HP request to fetch a public key for a given email address and you get an open HP key back if there is one and then you can use that. And WKD is very similar except instead of making a request to a key server, you make a request to the domain of the email address as I said before. All right, then talking a bit about key verification, you might remember that FOSSTEM used to host very cool key signing parties. And as you can see, these people are having a lot of fun and although the party hats have been photoshopped on, but nevertheless, it can be fun but admittedly, most people don't want to spend their time doing this in 2024. At least the average user probably doesn't. So there is an alternative to that as well, which is called key transparency. And I've presented about it at a previous FOSSTEM but just summarizing briefly, the idea behind key transparency is that we publish all the public keys of our users or of the provider's users or the key servers, public keys or whatever to an append-only transparency log somewhat similar to a certificate transparency. For example, if you heard of it, it's a concept where TLS certificates are published to a central transparency log that's an append-only log and everyone can verify that the TLS certificates that you are getting are in that log and therefore are probably not malicious. And the basic idea here is the same. We publish the openBGP certificates to a transparency log and then in this case, all the owners of the public keys need to check their own keys because they're the ones that then generated them and know whether it's the correct public key or not. So they go and check in the key transparency log whether the public key for their email address is correct. And then everybody else who fetches the public key can be certain that it's the correct public key given that everyone sees the same keys. So it's roughly similar to a blockchain in a way. I know it's a dirty word, but the concept is vaguely similar. Everyone has the same view over the data and if everyone checks the keys, everyone can be sure that all the keys are correct. And we have shipped this at Proton. There is also a working group at the ITF to standardize key transparency, not specifically for OpenBGP, but the general concept. And we would also like to standardize the use of key transparency for OpenBGP so that not only Proton users can benefit from this, but all OpenBGP email can be protected in this way because the advantage of this again is that all of that is fully automatic. So the OpenBGP implementation or the email client, let's say, can verify the keys without the user having to do anything. So it makes the use of end-to-end encrypted email much simpler. All right, so then getting to the actual OpenBGP standard. First, a short history. The current OpenBGP standard stems from 2007, believe it or not, so it's new for an update. Although it's not the case that nothing has happened since then, so there was an RFC for adding the Chamelea Cipher and also for adding Elliptic Curve cryptography using the NIST curves in 2009 and 2012, respectively. But that's the last RFC related to OpenBGP that was published. After that, there have been a number of drafts, one for adding EDDSA that has been widely implemented and also adding Curve25519 has been fairly widely implemented, but nevertheless has never been standardized. And there are a number of other things that were kind of overdue. So the RFC 480 BIS draft was in fact intended to be become an RFC as well, but that never happened. And then the CryptoRefresh is the most recent draft and that will become an RFC very soon. It's currently in the editor's queue, so it should become an RFC in the coming weeks, hopefully. So what is actually in the CryptoRefresh? Well, a lot of things, since it's been so long since we had an update, there were a lot of things that needed to be updated or that we wanted to update. So first of all, it merges the previous RFCs for Camelia and ECC. Then it standardizes finally Curve25519 as well as Curve4Fraight and the Brain Pool Curves, which are commonly used by the German government and they like them. But they are not mandatory to implement. Curve25519 is mandatory to implement Curve4Fraight, recommended to be implemented. Brain Pool is fully optional, so they can use it. Everybody else doesn't have to. Then it also adds modern AAD encryption, authenticated encryption, which was also fairly overdue, so it adds the use of OCB, EX, and GCM. GCM was slightly controversial, so I'll dive slightly into that. So first, OCB is the mandatory to implement algorithm. EX and GCM are fully optional. The reason that they're there is because even though OCB, in theory, should be fastest. In practice, in our testing, particularly GCM is usually fastest because it often has assembly implementations, for example. That's part of the reason why it's there. Another reason is because GCM is standardized by NIST, and so it's FIPS-approved. So for the people that care about that, they can use that. But again, it's fully optional. So for those that want to use the theoretically fastest, or once it becomes actually fastest, everyone can use OCB. Then also a memory hard password hashing function was added, Argon2. The previous password hashing function in OpenBGP was very weak, so this was also very, very necessary. This means that if you, for example, encrypt your keys with a password, they're protected much more strongly. Of course, it's still important to choose a strong password, but it has become much more expensive to do brute force attack on that password. Then it deprecates a number of legacy algorithms, such as RSA with weak keys, DSA, Algomal, and so on. All of those things that we really shouldn't be using in 2024 have been deprecated. Then it prevents a number of key overriding attacks that we discovered in collaboration with ETH Turic a few years ago, that we worked around in our implementations, but the workaround was fairly expensive. So now they've been fixed in the spec, which is a much cheaper way to do it. So for the implementations that worked around those issues, they should provide essentially a free performance improvement. And by the way, the AAD algorithms also essentially improve the performance as well. So it's not just, the main focus was about improving the security, but improving the performance is basically an added benefit. Then finally, it's not quite finally, it also protects against future vulnerabilities in hash algorithms by sorting signatures such that if for example, SHA2 ever becomes broken in a way that SHA1 was, even though we don't expect that today, but it provides some protection against that. And then finally, it adds a padding packet, which means that if you want to hide the size of your files or the size of your message when you're sending an email or of your attachments, you can do that by adding a padding packet to hide the size of the unencrypted data. All right, then for what's next, you might think, well, what about this? So the obvious one is post quantum cryptography, which we have also been working on, but is not yet in the crypto refresh yet. But there is a separate draft for the use of post quantum cryptography in OpenPGP, so that will come relatively soon as well. And then finally, we are, again, quite finally, we also would like to start working on forward secrecy. That's not quite as far along yet, but perfect forward secrecy is obviously something that Signal, for example, has been championing and is a good security property to have in an encryption standard, although it's slightly more difficult to achieve in an email context since you're storing emails basically forever usually, but still there are some improvements that we can and would like to make. And then, as I mentioned before, we would also like to standardize key transparency for OpenPGP. So then as to the implementation status, here you can see a graph. Some of the implementations are already very far along. Some of them not quite yet. Also, notably for its absence is Gnupig. Unfortunately, it seems like Gnupig does not want to implement the crypto refresh and rather would like to stick with the previous drafts, RFC 480B, which they have rebranded Libra PGP, which is a rather shrewd marketing move, I would say, since Libra Office is better than Open Office, so clearly Libra PGP should be better than OpenPGP, right? But actually it's not the case, it's more or less a rebranded version of the old standard. And there is a lot of controversy about it at the moment, so I felt like I couldn't really get around that. So I've here included a short comparison. In the interest of time, I won't go through all of the points, but I would say the technical differences are very minor and in my personal opinion, should not have led to such a big schism in the community. And in fact, I still am somewhat hopeful that we can find some sort of resolution, particularly if you consider that RFC 480B originally was intended to become an RFC. If it had become an RFC and people had implemented that, then the crypto refresh would have been an update to that and we would have basically had to implement both anyway. And my proposed resolution is essentially that, I would argue that everyone should implement both. If you're going to implement the crypto refresh, implementing Libra PGP as well as not that much added effort, the other way around, there is a bit more effort. I haven't heard any objections from any of the other implementations to implementing crypto refresh, so I still hope that GNU-BG eventually will do so as well, although it seems unlikely at the moment. But let's see. So in conclusion, we're trying to drag OpenBGP into the 21st century. Hopefully, we've succeeded. Thank you. And my other point that I would like you to take away is it's becoming more and more possible to build modern email encryption applications using OpenBGP. It doesn't have to have the UX of 10, 20 years ago. And finally, I hope that everyone will implement and use the crypto refresh. Thanks a lot. Thank you very much. I see one straight hand immediately, so this needs to be rewarded instantly. Hi. First question that comes to my mind, especially when you compared GNU-BG to other implementations, is what about hardware support? Because in my mind, and this is why I haven't used either of these implementations, especially those JavaScript based, is that I'd like to keep these keys in my hand on a device. So what about it? Yes. So there is an open pull request for OpenBGP.js to add support for hardware based keys, although in full disclosure, it has been a bit idle in the last while. But I still hope that someone will do the work to support it also in other implementations. I'm not fully up to date on what's the status for the support in all the libraries, but certainly it would be good to add support elsewhere as well. Yes. Another question, actually. I wanted to congratulate you hard to felt for not having the suckful user interfaces that PGP used to have. This sounds hopeful. Thanks. Thank you. I'm very excited about your approach to key transparency, well, or protons, not yours personally. I think it's very good. Do you have any thoughts on the relocation transparency to make that more? Yes. So in our implementation of key transparency, we do include, for example, when the user marks a key as compromised or obsolete, it is included in key transparency. So this means that once that's included, other people shouldn't use the key anymore, right? And I would imagine that in the standardized version, you would similarly include revocations in key transparency such that when you revoke a key, you can be sure that others won't use it anymore. The way I don't, you just get a new record for the mapping, which is not the relocation. Yes. So we always support updates to the key. So the key transparency always provides a snapshot in time. So we repeatedly publish all the keys, conceptually speaking. And also when you generate a new key, the keys are updated, but also when you revoke a key. So essentially it will be the same thing. When you were going through your list of new changes in OpenGBT, you were talking a lot about these optional features. But does it make sense to have optional features when both ends kind of need to implement them in order to be able to communicate with the... Sorry, I didn't fully hear what kind of features were you saying? The optional features like the... Optional features, I see. So there is a lot of new mandatory features as well. Curve 25519 is mandatory to implement. OSEB is mandatory to implement. But to be perfectly honest, a standard doesn't have that much power over implementations just by existing. Every implementation in the end can choose what they implement, even if we write that it's mandatory in the spec. We hope that everyone will implement the mandatory parts and the optional parts as well, usually. But we can't force anyone, right? All right. Thank you again, Daniel, for that interesting talk.
[Security] Thunderbird Email Security, plans and challenges.
So, welcome. My name is Kai Engert. I have been working with Mozilla and contributing to the Mozilla code since 2001, also including email. And I've been a full-time employee of Thunderbird since 2019. And today I want to talk about Thunderbird email security and some of the plans and challenges on that topic. We all know, yes, there are some creatures who could read our email. So, they sit on the service and some robots scanning some mass surveillance monsters and cybercriminals. Okay, we don't like that. So, the problem is that there is no protection while emails are stored on service. We do have some TLS transport security in the infrastructure, but it's not enforced. And it's... So I think we need more than TLS transport security. We heard about that earlier. Of course, we want and need end-to-end security for both encryption and digital signatures. So Thunderbird supports two separate technologies. There's been S-MIME. I've worked on that in 2001 before Thunderbird were born. And, yeah, we have... It's still supported. And we also have OpenPGP, which was previously supported using the Enigmail add-on, which is now fully integrated using an integrated code since 2020. I want to briefly mention some of the things we did in the recent past. We implemented unified status feedback so you get similar UI for both S-MIME and PGP emails when reading an email. When you compose an email, we also have some similar controls to enable or disable encryption. We have made it a bit easier to resolve the problem when you want to send an email, but you're missing the recipient's PGP key. So we have some interactive UI code to help you find the missing keys. We also added some reminders when you try to... When you start composing an email and Thunderbird detects that it can encrypt, then it will remind you if you want to enable it. And just most recently, in the new version from last summer, 115, we have added a long-ass feature which is you can tell Thunderbird optionally to please enable it automatically. If you see, we can encrypt just enable it. Some people have also asked to automatically disable it, but I think it's a necessity to pay attention so we have the option to have some warnings here shown to the user. So other things we did, activists asked for and people who are sharing their computers with others, they have asked that we do support some individual part phrase for the secret keys. We did that. There's some parts missing. We need to make it more convenient by adding a cache. We also implemented the auto-crypt compatible key distribution mechanism which simplifies group conversations by including all the keys of all participants of an email conversation that's called Gossip. We have that recently added. I think we will have it in the stable version soon also. And we added support of publishing keys to keysopentpgp.org. Let's look at the sum. I want to also mention a few general challenges that we've just recently seen. Since some providers now add fmime on the server-side infrastructure, we are now seeing messages coming up which mix two technologies. So people complain. They have a user has composed a pgp message to them and now the whole thing is suddenly wrapped in another fmime layer. And so that's a challenge for the user interface presentation, how you deal with that. What I have one idea is if it's just a signature layer out her most, maybe you just ignore that one, but I'm not sure it's the best solution. So we are still open for discussions if you have better ideas. So there was discussion, what should we do if a message arrives with a digital signature that we cannot completely validate as being good? What should we do? Currently, we do say, well, this has a bad signature, but some people say maybe that's not worth in a plain text email. Maybe we should just stop not showing any bad status at all and just treat it the same as a plain text email. So that's also a pending thing we should do because there was some agreement to do that in a recent pgp community meeting with other developers. And another big unresolved area is if you combine emails with digital signatures and where the content is nice and shiny with HTML and CSS, which many users want to have, the problem is that HTML can be used to manipulate what's shown on screen. So the sender of the email might have seen something different when composing than you as a reader see. So that can lead to confusion. Researchers have shown that. So what should we do about that? I don't have a good solution because nobody agreed to my suggestion to just revert to plain text whenever we have signatures. So but maybe we should show weaker signatures. I'm looking for ideas here. If you have ideas, please, please send them in. So now let's look at some more broader scale. We have the problem that only a small portion of all emails are using S My more pgp at all. They're not used much because there are barriers of entry to use them like Tobias presented. You have to get a third and it's difficult. And then when you have keys, it's complicated to manage. And using email encryption at all can have unexpected consequences. If you just set it up on one device, you have maybe a problem to access your encrypted email from a secondary device. Users can lose their secret keys. They will also lose to the archive of encrypted email. So I think it's still necessary to, we must involve the user. That means user must be willing to accept the consequences. Also user must be willing to take care of the secret key file or lose their archive. So what should we do? How could we get more people, many more people to use email encryption and signature? I think fully automatism is not possible because we have an heterogeneous ecosystem and we need the user to be involved. That means I think that we must better assist users. And that leads me to the question, which technology is easier to use? The past five years, Thunderbird Focus was open PGP in that area because it was necessary because we had to integrate it to ensure it's still usable. But now the question is, is that still a good idea to continue to focus on PGP? As we heard from Daniel, there is currently some, there are currently some disagreements. What's the future of PGP should look like? Daniel has presented a very optimistic outlook for the future. And I agree, many of the things he said would be nice and great to do. But we have the problem that there is a group of PGP in the ecosystem which is difficult to ignore. So I'm, and that's the problem because they are, Daniel suggested maybe everyone should do both. But well, that would also require that client applications support both keys from both specifications. And I see that as a big complication for users to have to manage different keys for different recipients. And I have suggested, I've tried to bring the group together with many discussions and I've suggested even introducing a common key format. But there have been no positive reactions from that. Well, from the IETF side, I usually get lots of good ideas and willingness to discuss. But both sides would have to agree and I don't see the, I don't need much openness to support these ideas from the PGP side right now. So I don't know what will the future bring, of course. No final word has been made. But at this time, to me, I have the worry that the future of PGP is a little uncertain. There are a little, there are conflicting specifications. There might be incompatible implementations. And I don't know how much hope there is for a unified specification. I still hope for it. I think it would be best and we really should see it, but it's not clear whether it will happen or not. And if that's the case, I'm worried that PGP might become less interoperable and more complicated to use in the future. And with that, is PGP the right way to go right now in West Moin PGP when we don't know what the future will bring? My suggestion is maybe we should wait a little and see how the developments in the PGP side go and whether there will be some more agreements in the future. And maybe we should, Sunderberg should wait. I think what we have right now is working. Both specifications have a common base. So PGP is working and you can interoperate right now. It's just that I'm not sure how quickly we should jump on these new ideas and implement them. Maybe it's time to wait. And I suggest Sunderberg should continue to both support both PGP and SMIME. But maybe one idea is I'm presenting that as a suggestion. I'm not saying we will do that and I'm looking for your feedback. Please provide feedback afterwards. And here's a suggestion. We could try to make SMIME easier to use for everyone. We could try to eliminate the barriers of entry that are currently there. And we could say maybe SMIME is an OK technology for users with a limited threat model. And we could say open PGP is more targeted for users with a broad threat model. And as a consequence, they will currently have to accept that there is a slightly higher complexity. Well, why is that? Well, let's look at SMIME. I think it's more widely available in email applications. And if you trust, as a user that certificate authority, do that job right, then SMIME is easier to use than PGP because you don't have to do manual checking of keys. And we don't have the transparency stuff yet that was mentioned. Maybe we can do it in the future. But right now, it's not there yet. And it might be appropriate for people with a limited threat model. It protects against passive reading, SMIME. There is a remaining risk that there are falsely issued certificates. We have seen digital notar in the past. But CAs are regulatory audited. And of course, they don't want to lose their reputation. So I think the risk of falsely issued certificates is not that big. Also, we certify transparency making it even harder. So I think that remaining risk might acceptable for many. So but we would have, in order to follow that idea, we really would have to find a way to get certificates to everyone for free. We would require, like Tobias implemented in his demo, to find a way to automatically obtain and refresh certificates from inside the email client. And then also, we would need something better for looking up the certificates of your correspondence. Maybe we could implement certificate transparency like way for SMIME certificates where we maybe even to protect against the spammers. I'm not fully up to date if the certificate, what's the PPP specifications as, if it's also redacting email addresses already. But yeah, maybe that would be necessary. And if we have some kind of cloaking with a hash, then we could maybe implement a certificate directory that is like a key server and that could consume the information from the transparency logs. And maybe we could use that to make discovery of correspondence certificates easier. So yeah, and PPP could be more dedicated or declared as the preferred technology who don't want to accept that remaining risk of false issues as my certificates. Yeah, and they could still do the manual key verification at the cost of having a little bit more complex technology. So if that idea, if we get a positive reaction to that idea, maybe we could say making PPP easier to use in Thunderbird. Maybe that could become a little less priority and rather focus for PPP on the higher, on the security improvements and interoperability parts of that. And rather focus on making SMIME easier to use and I plan to post some suggestions in the new future towards the Thunderbird email discussion this way, I present this idea in more detail and I will be looking for your feedback. Thank you. Thank you very much. I just want to add one more thing. I somehow expect that there will be many questions. So after this is finished, I will go outside and I'm waiting for your question and to have follow up discussions outside as well. Hi Kai. May I ask why are you still using R&P over the Sequoias version octopus like the crypto library? Well, your question implicates that I should prefer one side or the other. I don't prefer one side of the other. I think I don't want to give any of these conflicting specifications and advantage. In my opinion, Thunderbird should remain neutral. In my opinion, so conflicting parties should get together and find a unified specification and I would like to wait for that. And that if that switching implementation doesn't give me an advantage because I don't know what's the intention of the Sequoia. Will they fully support both specifications? I don't know. There's Nio. Okay. Rush. All right. Are you saying that if we implement V5, then you'll use Sequoia? I'm not making some promises. I'm just saying that additional other alternatives currently are not wearable and if things change, we can re-evaluate our thinking. You mentioned that S-MIME can have a lower barrier of entry than open PGP. To my understanding, the primary problem with encryption is that the user loses the key and he cannot read this email anymore. I don't see how S-MIME has any advantage over PGP in the sense because if I can just as well lose the key, the certificate authority cannot regenerate my key unless you want them to do so much rather than the key. I don't see the advantage. So I think the problem exists with both technologies. That's the same. But yeah, maybe we could introduce a key inscription key. Maybe we could introduce concepts that Thunderbird generates a key encryption key for users. They make it back up with a path race, maybe 20 words writing down which just a randomly generated symmetric key which we back up in paper form with 20, 24 words written down. And then maybe Thunderbird could encrypt all the users' private keys with that single symmetric key. A possible idea that could probably be used for both technologies. So I didn't spend it... Yeah, that's a general idea which we could do which would help both technologies. Alright, any final question in the room? Have you looked at the secure join standard and do you think it might be an option for Thunderbird users to have guaranteed internet encryption with verified fingerprints in a very user- I have not seen the project you mentioned yet so you would have to point me to it and we can have a follow-up discussion. Alright, so thank you again.
[Security] Email Autoconfiguration, and 2FA for email
So, hi. I'm Ben Bocz. I've been working for 25 years on the Thunderbird project, pretty exactly by now. I was a member of the Thunderbird Council, the Leadership Committee, but I'm speaking here for myself and not for the Thunderbird project. I've been for a long time consultant and helping many companies use Mozilla products in code and their products. And most of the point of this talk, I've been involved in four different OAuth implementations, specifically for mail clients. One was for the largest German Internet provider, two of products of my own company, and I was a reviewer of the OAuth 2 implementation in Thunderbird. I've been implementing the IMAP, POP3 and SMTP authentication logic in Thunderbird, meaning server capabilities, authentication methods, password prompt, and so on. And the account creation dialog, including the auto-configuration and the protocol on that. So, about auto-configuration. So, this talk has two parts. First, I'm talking about auto-configuration, and then multi-factor authentication and email. Auto-configuration. Auto-config is a protocol which allows the user, if he sets up an email client, to only enter the email address and the password, and completely automatically set up the email account. He doesn't have to enter anything else, just email address and password. And we find out the configuration completely automatically from that. The user doesn't have to enter any host name, authentication method, cram five, whatever. How do we do that? The email address is ben at example.com. So, example.com is the email provider. So, we just take example.com and get the contact from that. Email is supposed to be federated, so we ask directly the email provider, do you have a configuration for us? So, that's a well-known URL, auto-config.domain.well-known.something. And we ask, is there a config there? And it's simply a static file. So, you can just create the static file once manually. You put it on your web server at the specific location, and all you need is a web server. You don't need any other server-side software for this. But of course, not all ISPs in the world are going to support that. Google, Microsoft, Yahoo, et cetera, they don't have this config. So, as a fallback, we have a database. This database contains pretty much all ISPs in the world. I had to go and search the configuration for every ISP manually in the support documents, put it in machine readable format, when possible, test it, and put it in a database. Other people helped, of course, with that. So, the result is, in this database, you find the configuration for almost all ISPs in the world. If the ISP has more than 0.1% market share, it's most likely in there. But, that still doesn't cover corporate domains, company domains, custom domains. So, my example.com is not covered by this. So, what we do is a MX lookup. So, we make a, what is the MX, for example.com, and we find, oh, it's MX.dreamhouse.com. So, we look up the configuration for DreamHouse, or Office 365, and we got the config this way. This protocol is an internet draft right now. The goal is to make this an internet standard. There is a draft for it. You find this on this URL here. And, if you follow, if you, an e-machline, implements this protocol, and you follow this protocol as it's written down there, you can set up more than 80% of the email addresses fully automatically. So, more than 80% of your users don't have to do anything else than just use an email address and password. How does it look like? This is how it looks like. And, the first reaction I will get is, oh, yeah, except, whoa. Today, I would use JSON, of course, that was 15 years ago. Was a good choice at the time. It still has some advantages. But, this is roughly how it looks like. We're going to look at the details. So, it starts with the email provider. This is the domain. Here, it's on Microsoft.com, but normally, that would be the part after the email address. And, you can have multiple of those, depending on how many domains are hosted by this email provider. So, if I hear who it's all the country, you're who.de, you're who.fr, you're who.ite.com, et cetera. And, the MX servers are also in here. So, if you make a look up, we take the domain part of this, of this, of the MX server, and we put it in here as well. And, this is how we can find the configuration. So, in the database, it's going to map that in the other way around. So, there's going to be an entry for this and this, and we can easily look that up. And, even on the server, this is a static file on the server, so it's really fast. There's also the display name here. There's a long and short version, you don't see that here. It's completely optional, but as a client, you want to use the name of the provider as a count name, for example, you can use that. This is how the config looks like. The incoming server here. We implement, we specify four different types. This IMAP, pop three, JMAB, on special request from Fastmail, yeah, JMAB, and exchange. And, you can have multiple, you can have multiple of these incoming servers. And, they are ordered by preference. So, all other things being equal. The client is supposed to take the first configuration in the file, but the client has the choice to use another one. So, for example, in Thunderbird, if the configuration shows that there's both an IMAP and a pop three configuration in there, Thunderbird gives a choice to the user, what do you want? IMAP, pop three, there's trade-offs, and the user can make the choice, and the client is going to take then, for example, the first pop three configuration listed in this file. Funny fact, half of the Thunderbird users have a pop three account configured. Of course, there's more IMAP accounts, but still half have a pop three account. I was really surprised about that. It's really popular. I thought nobody uses that anymore, but it's popular. There's exchange servers, and of course, you have the SMTP server as outgoing server as well, and the structure looks pretty much the same. You have twice the authentication here. All of them have to work. This is not the same as the IMAP capabilities where the server just lists something and it might not even work. These both work. It's just, does the client support OAuth? If yes, he's supposed to use this. If the client doesn't support OAuth, it can go on to the next one that it does support. And there's the format of the username in there. It could be Ben, it could be ben at example.com. It could be Ben backslash example, like the Windows domains that is in here. Oh, and by the way, in the database, we always prefer as TLS, overstar TLS. And if there's a plain text configuration and an SSL configuration, we don't bother listing the plain text configuration. And there were situations where Yahoo or others said, no, we don't want you to list the SSL only for paid customers. We said, we don't care. If there's an SSL configuration, and it works, we put it in there. And this is the way Thunderbird protected the users years before other clients did, because we knew these configurations and the ISP didn't advertise it. You can also capture address book and calendar sync and file share. Thunderbird doesn't implement that by this possible. So you just enter email address and password, and boom, you have it all set up. Email address, calendar sync, file share, contact sync is all in advance. Ah, so like I said, there's a specification out there. I would appreciate your support. It's like verbally your support, expressing support in the right forums so that this moves forward to stand, actually moves forward to the standardization. And if you're writing an email client, please support the specification. It's really helpful. And of course, if you have an email service, it's always appreciated if you support that as well. All right, second topic. This is a less happy topic, multi-factor authentication and email. We all want that. The ISPs want that. We want that. Unfortunately, not that easy. Right now, the situation is that only if you're Google, Microsoft, or Yahoo, you can do OAuth. The rest pretty much can. There's a few smaller ones, but if you're not part of the select few, which is hard-coded in Thunderbird, or the email client, you can do it because the client doesn't have any way to can discover the OAuth server. So which options do we have? We can use OAuth as it is specified right now, or rather not specified right now. We can, I'm making a proposal for M-Auth, but I'm dubbing M-Auth or more, OAuth for mail. It's OAuth, but we nail it down further. The things that OAuth doesn't specify, we mail them down and specify them for mail so that it works well. Third option is PASCIs. Could you please, thank you. Okay. Thank you. So OAuth was originally designed clearly for websites. Like Zoom wants to authenticate with Google, and they have a relationship, and this is what the spec is written in mind. It clearly shows because if you're trying to implement for mail clients, you run into all sorts of problems. Most of the problems are related to the fact that, oh, OAuth is not really a specification, it's more like a framework. It says if you wanna do that, you would do it this way, but it's up to you. The server decides about everything, it can do something, it might not do it. There's no way to know whether it's going to do it or not. That's for its real problems for the implementation, because as a client, I cannot rely on anything at all. Everything is optional, I don't know what's going on. Even for the same server implementation, it all depends on the configuration, and that specific AP, EISP, whatever they have put in there, this is what works and what doesn't, and I cannot rely on anything in my code. Problem is that users always blame the client, no matter what the reason is. So in my company, we're offering a little email add-on, and in support we get this all the time. The user says, and my email doesn't work anymore, it worked yesterday, it doesn't work today. So there was no software update since yesterday and today. What could it possibly be? Of course, it has to be that the administrator changed the authentication server, the page something, it's something changed at the company, his own company, and there's no way for us to know, there's no way we can fix that, there's nothing for us to do. We cannot even test it because we don't have a login for that company, there's nothing we can do. But we can't, the customer doesn't understand that. He says, hey, but it works in this client, it works this, I want you to work here, you are broken, bye-bye. And I lost this customer, I lost, we lost so many customers because things don't work at the OAuth level, and there's nothing we can do about it. One of the big troubles, one of the big troubles is expiry. None of this expiry stuff, like OAuth is all about expiry and refresh, this is pretty much all that OAuth does, and none of that is specified. There's a lot how it should work, how it almost always works, but it's always optional. So the expiry, I have no idea, is it 12 months, or is it five minutes? I have no idea, and it makes a big difference how I implement my client, but that user has to log in every 12 months or every five minutes. I have to structure my code accordingly, my UI accordingly. But when I write this, I have no idea what's going to happen because it's all configurable. Same with the refresh token. Usually I'm getting a refresh token, but it's optional. So what I'm proposing with M-Auth, that we nail this down. So if you're going to expire that, please tell us when it expires. Again, it's in the spec, so that I have a chance to refresh before it expires. So please send this expiry time. Please send the refresh token. Most servers do that, but it's optional. We would nail this down saying it has to, if you use this for mail, if you want to use this for mail, you have to send a refresh token. You have to actually refresh the refresh token, so if the user continuously checks mail, it is not going to expire. All these little details are not specified in O-Auth, and we would need to nail them down for mail to work properly and reliably. And on the server side, this is just a matter of configuration. Like, we don't need to write any new software, it's just a question, how does the ISP configure that? So all we would have to say, like, if the ISP, if you want to use O-Auth for mail, you have to configure it in this way for it to work. And for us, this configuration is a question of working or not working, because the users are going to complain if they have to log in all the time and not going to use our product. I'm going to skip over error handling. There isn't pretty much no error codes, like all the guys have access denied. I don't know whether the password failed or the user canceled. I don't know how to react to that. Should I show the prompt again or not? We need to have some proper error code. And client registration is my biggest worry. The O-Auth specification requires that the client sends a client ID. And then the specification, it says, the way the client registration works is outside the realm of the spec. It's explicitly not specified. And even worse, this ITFRC specifically states, you may have to sign a contract in order to get a client registration. You may have to sign a contract. What does that mean? Right now, I have to sign a contract with Google and with Microsoft in order for my email client to work with O-Auth. That's the situation right now. That is a problem even right now between the big ISPs because they're not always at peace with each other and they can block each other this way. That's the problem right now. Even worse, for me as a little guy, I have absolutely zero chance standing up to Google with this contract. Like, if I want to offer an email client, I have to sign this. I don't have a choice. So Google can force legal conditions on me, contracts on me, and put in there whatever they want, I will have to sign that. That is a legal nightmare. A huge liability. So if this was an IETF spec for mail, I don't think this is fine between websites. Like if Zoom wants to authenticate to Google, they can make contracts. But for a client's service protocol, I don't think that would ever pass IETF standards because it's pretty much by definition not open. This is actually worse than patents because of the pattern I might ignore. A pattern, it might not apply to me. A pattern might not be valid. But this is a contract between me and the party and it's definitely valid. This is much worse. So there's one proposal in the room to use dynamic client registration. There is a specification which is dynamic client registration. So the ISP can offer to any client, any every instance would just connect, give me a client ID and the ISP would return with a client ID. Apart from the fact that it makes the whole system useless, I don't know of any implementation of that. Like there is a spec for this, but I've never seen any client implementing that. I've never seen any server implementing that. I've never heard of anybody who has ever seen any implementation ever. So there is a spec for that, but we would have to write the server, the software. We need to write a server software. It needs to scale for two big ISPs. We need to write all the client software. And once we've done that, we've added complexity, but the whole thing is pretty pointless because the client registration doesn't actually do anything security-wise. I could just open up outload.exe to find the client ID and secret and theoretically just use that. So security-wise is snake oil and it doesn't serve any purpose. So if I want to know what the client name is for whatever debugging or help purposes, I can just look at the user agent because all this HTTP, I look at the user agent string and if we put a proper value in there, I know what the client is, but I get around all these legal things. So as far as I end up C, the only advantage of the client registration is to force a contract on, which is exactly what we don't want. So there's a simple solution to that. We don't need any new software, very simple, in this MOR thing, we just specify the client ID is going to be mail, M-A-I-L. Problem solved. Hard-coded string. And you don't need any new software for that. The ISP just needs to configure that client ID in their software. If they want my clients to connect, they have to configure this ID, problem solved. That's what I'm proposing for M-Auth as a solution to this. There's another big problem with O-Auth. It inherently depends on a web browser. So I want to implement an email client. I already have to deal with HTML email, but there I don't want JavaScript, I don't want cookies, I don't want video, I don't want any of that when I render HTML emails. When I want to do O-Auth, I have to have all of that. O-Auth won't work if I don't have JavaScript. It won't work if I don't have cookies, and the cookies need to be persistent. It has to be a full web browser. So I'm probably going to use WebView or something, but then it's going to depend on the Android version that the user has, which WebView version he has. So that's going to be a support nightmare. I need a specific API. Because I need to track when this login sequence finished, which Ureli is on. So now he's done now, he's locked in now. I need a specific API for this embedding web browser. That's an extra API, which most embedding APIs don't have. It's an extra complexity. It's already difficult. I don't know how many client, email client implementers are in the room. I don't know how you feel about putting an email client on a web browser, mandatory in your email client. I don't know how you feel about that. But that's the situation right now. There's another option. I can just launch a system browser. So I'll launch a URL, go to the system browser. That's actually what Google wants. Problem there is I interrupted the flow. That's a problem. The user left my email program at this point. He was in the middle of setting up the email address, and now he's in the browser and he finds the news and he starts reading the news and cat pictures, and I lost it. Maybe he's going to come back. I don't know that the flow is interrupted. And in order to know when the user finished, I need to redirect to HTTP local host. That's a web server that I have to implement in my email client. So I have the choice now. I can either implement a web browser or a web server in my email client. I don't have those two choices. I don't like either of those. So that's where we are. You could argue that we have to implement OAuth anyway, because we're dependent on OAuth for Google, for Microsoft, and for Yahoo, which is true. However, the problem right now is still contained. It's really these few three things that are where really needed for all the others that don't need it. My hope is that we can contain it there. If we open up the floodgates and open it up to all the ISPs, we have a legacy that we will never get rid of. You heard the talks about IMAP, which legacy that is. If you implemented email, you probably had your hat scratched for one reason or another. There was a reason 30 years ago why they did it this way. I don't want to be the guy who puts OAuth in email and creates legacy that people have to deal with 20 years from now on. I don't want to be the guy. This is why I don't feel at ease with putting OAuth into email. There's another option. It's called PASCIs. We talked about MAUTH. PASCIs. PASCIs are the new cool thing. Google, Apple, Microsoft are fully behind that. They implemented that in record speed. It's supposed to be super secure. You can bind this to biometrics of your phone. You don't need this code, two-factor thing, code thing. It's still secure. The big ISPs really, really want this. They're really behind this. This is a big advantage because maybe we can contain the OAuth problem and migrate users to PASCIs in this way. We could also allow that for all other ISPs as well. The other advantage is it's a very simple protocol. It's a challenge response protocol, which means the server is sending some kind of information, some blob, some JSON or string, sending it to the eMac line. The eMac line is sending this to an operating system API. The operating system is popping up a dialog. Do you want to log into this and this website or this and this domain? The user can approve or disapprove. He might have to use Face ID or thumbprint, depending on the settings. There we have the two-factor authentication with biometrics. It's secure. Then the operating system generates another information. We send it back to the server and the server says, okay, so it's a simple, we just pass on information back and forth. It's very simple on the protocol level. I don't need a web browser and it doesn't have all of these issues that I just mentioned with OAuth. So I don't know too much about PASCIs, but my gut feeling tells me this is the better way forward and something that is much more easy to support in the future. This is an open question. If you know something of how this would work with PASCIs and have proposals or want to get involved in that, it's an open question right now. Let's discuss this. So, questions? Thank you.
[StructuredEmail] Structured Vacation Notices and Structured Email for Roundcube
All right. So this time I have the pleasure to introduce myself. And somebody else needs to take care. I don't overuse the time. So yeah, my name is Hans Jörg. I'm from Odriga. I have two hats or two histories. So one history is in my main email history is in migration, portability. So you've seen some of our jamming work earlier that day. But actually, I have an earlier history in semantic web technology. So I was a semantic web researcher. I did some stuff on Sematic Media Wiki, if somebody or few is aware of that in the past. And this is a new project, actually, where these things tend to converge. Some people who read their email on the console typically don't like it at all. Recently, what it proposes. But yeah, I like any feedback on it. Some people might even like it because it maybe fixes something what HTML email also broke. And the whole idea is structured email. And I'll present you a reference implementation for RUNCube and a particular application, which is a structured vacation notice, which probably is compiling to email people in particular. So first of all, a claim. So email is sort of your personal API. But you're a little bit of a mechanical Turk in there. So you need to read it. You need to understand it. And you need to act upon what people ask you to do. Other services ask you to do. Second, the email is underappreciated. I think everybody here in the room would probably agree. And so one of the ideas here in order to bring these things together is to make email content, maybe not in general, but for parts of emails or certainly emails, more machine readable so that the tools you develop might help people in certain tasks to make them more efficiently or even to do novel tasks. And the very rough idea is basically like you have a multipart alternative text and text HTML in an email that you also embed structured data in RDF, which is a W3C-specified knowledge representation language according to certain so-called data models. So schema.org is a very popular data model which search engine vendors have set up for, basically, you find it in websites. Like this is movies. This is a song. This is an article. And the very idea is to also allow users or tools to include that in emails so that email clients can make sense out of what is in that email. So yeah, that sounds quite abstract. How could that look like in practice? So actually, this is not something I invented from scratch. So actually, Gmail, Yahoo, some other vendors, WebDE in Germany, are already doing it. So if you fly by Lufthansa or a certain airline and you have a Gmail account and if you opt it in, these airlines might already send that schema inside of that email. And you might notice there is a special display within Gmail that shows you a certain information on the flight, allows you some certain action, might automatically import it to your calendar or at some point also to your Google Assistant and so on. That's nice. The problem here is currently you need, this is only for select senders. You need to register with each vendor, actually, basically, to have this happen. It's only there for very few select use cases, like traveling and maybe ordering in the web. And it's unidirectional. So it's only from a service to you. You cannot by yourself use that. So it's a little bit against the idea of email, right? So I can. I mean, obviously, I would not send a flight probably to somebody, but maybe something else. So schema.org alone does have 800 concepts. And what Gmail supports is like six of them or something like that. But actually, there's already nice, very nice use cases for even this travel information. And there will be a talk just after this by Folk are sitting in the background. So I won't talk too much about that. Second example would be link sharing. So there is share by email, right? And this is how it looks like in K9 email, right? I mean, not blaming K9 for it, but you basically get a URL sent. And this is what you receive. And in this case, basically, you are stuck with Spotify. You click on it. You have said Spotify song. But K9 doesn't know this is a song. And you are with Spotify. And OK, you can listen to the song. But if you're on Apple Music, it's up to you to deal with that. And with structured email, the idea is you could take some metadata, which in the case of Spotify is actually even embedded on the Spotify link already. So nobody needs to do manual annotation. You could put that into the email instead of the link. And so your email client would not just have the link, but the email client would know this is a song, Brussels Jetem, and from 2003 by Al Jaleh. And it could even match, for instance, with your local media player if you have that as an MP3 or something like that. So you could basically give a dereference the kind of content that got shared in that sense. And you have a much better user experience, a little bit like you have an instant messaging when you send a link. Where also, like what's happened so on, they do extract the Twitter cards and kind of stuff. Another use case, maybe even more fancy, is location sharing or even live location sharing. Many instant messaging tools allow you to do so, but it's within their ecosystem. So you're bound to their implementations, their privacy rules. And it only works if you send to another fellow WhatsApp user. So it's also not really open and decentralized. So we built a prototype where you send a location based on the JSON-LD snippet. And we have a prototypical implementation where the client on the mobile can push the updates of the location to a URL with a secret UID, which the user receiving it can actually use to refresh it. So if you're receiving email client support, you could do this user experience. This is an example which we did. And of course, you can also do have some fallback. So you can get an HTML email. Of course, then it's not the live location, but you can do something like a fallback like you have in some newsletters. Click this link. Go to the browser. Even though this is, of course, not the best user experience here. And then another very familiar use case for you, vacation notices out of office messages. So it's typically something you enable for your email account while you are traveling on FOSTA. Maybe it's a weekend, not so many people will write to you. But maybe you arrive back in office on Tuesday. So you say, I'm staying in Brussels till Monday. Please contact my colleague in that meantime or so on. It's still something you need to act upon manually. But it would be interesting if your email client could actually understand this is an out of office message, a resistance at date. And probably this is the person I could redirect the mail to if I wanted to choose. And this is basically what we did. So we did an ITF draft for this to specify it a little bit, the process. And basically, you can even leverage most user interface data you have from the CIF vacation extension. This is how we implemented it in RunCube. We just take the date fields which you anyway fill in there and the reason and put this into the structured field. And if the receiving email client is capable of understanding it, it may store this information for the time which the user is away and it can highlight it. And you can even put it or choose as the user on vacation to include it in emails prior to your vacation. So you could say, even if I go to vacation tomorrow, include that metadata already in just any regular email if you want that. And so recipients can already see, ah, Michel will be in vacation starting tomorrow once he wrote me this mail now. And I might hurry up answering him or something like that. I'm not suggesting this is like it has to be, but it's just illustrating that you can even do additional things which you could not just do with regular out of office right now. And yeah, what is the current state here? So these examples I've shown you, there is currently an ITF working group that's very recently formed. Last November was the first meeting. There is a mailing list here. So even for those of you not familiar with the ITF, please join that list if you're interested in that topic. Any feedback, any questions, everything is very appreciated. There was already quite some good feedback from the community. So like Sunderbad Board Council made a decision like if there would be an RFC, they would be willing probably to implement this or to merge this into their code. First drafts already got adopted in the working group, still sometime under the form of RFC, but things will be going. We are working on a reference implementation where we graciously received money from NLNET and the NGIU program. This is published right now during FOSTA. So you can go to Packages. Not sure if it's already on Packages, latest on Monday. We will provide some guidance so that you can use our round cube implementation as a blueprint for your own vet mail probably, even some reusable code so you don't have to write everything on your own. And there is even first adopters. So for instance, the developer Fairmail, I got in touch with him and he implemented the very first beta of it like within a day, which was quite an awesome experience actually. If you hear that, I really appreciate. And that would be really great thing. So finally, maybe a little bit of an overview of how this currently works. So this is the URLs where you'll find more information. We have one library currently where we do the extraction of the structured data from incoming emails. This could be reused on the server side of your application. We have two libraries which basically are template libraries. It's a little bit user experience-ish. So we are still searching people that really are keen on CSS, HTML design stuff. So if you know somebody or so, please help us because we think it makes sense also to have at least a simple example for how to render these cards for very popular kind of information so that every client doesn't need to decide on its own how to render a Spotify song or something like that, or a music song. Even so, of course, every client could opt to do so. But we want probably to provide some examples here. And we do it both for the actual rendering, but also for this HTML email, which we want to send as a fallback for those that don't have the fancy client yet. And then, yeah, I say there's two RimeCube extension. One is for the structured email as such, where you can do the Spotify thing, for instance, or receive these kind of things. We also have working on the Next Clouds mail thingy, where you can actually interact with the Next Cloud Cookbook app, where you can import recipes that you receive by email. And there is a separate plug-in for the structured vacation notice. That's all, actually, already for the moment. Thanks for listening, and I look forward to feedback and questions. So, yeah, maybe somebody can see. Yeah. All right. Did I say a hand question? Is there concern? I mean, to me, it seems like we've had this discussion that this is just kind of in the background of a mail message. As long as it's not overwhelming data size, it doesn't really matter to people. But the question would be, is this the kind of thing where maybe you have a client that's not displaying really great, where all of a sudden you start having all these random attachments that would confuse a user? Because they can't do anything with this themselves. This is all meant to be machine readable. So there was an interaction between, but I repeat, so I understood correctly. So your question is, is this something that might confuse users if it's somehow mangled inside of you? What are the ideas around trying to prevent confusion of users if a client doesn't know how to handle it? Two things. So first of all, you can see it as a multipot alternative. So just like if the email client doesn't understand it, it just won't get rendered. And also, it's metadata. It will never be shown if the client just doesn't know about it. So you can use it with existing clients already. Actually, you receive those emails probably personally, because Lufthansa might include it already even in the mail sent to OX. You just don't do anything with it. I'll assume you're writing my emails. Sorry? I'm joking. Yeah, yeah, OK. And actually, the interesting thing is even the opposite is interesting. Because we had people coming to us that had exactly the problem where you get a PGP key or an email signature attached to an email, because actually, an email client doesn't even know what that is. And you could actually use this structured data also to provide additional information about what certain email attachments are about. So you could even help email clients to provide a better user experience in that case. What's the incentive for any provider to actually send structured emails? Because it seems that it's an activist opposite. They don't want to like Spotify, or they don't want to be able to send what song it is. They want people to go to Spotify and nowhere else. And same with the Lufthansa thing. I mean, they don't want to send. They want to publicize their brand. They want to upsell services. Yes, good question. Then they don't want to send just a generic message with no possibilities of those. So the incentive is to not use those. OK. So you say there is probably what is the incentive? There is no incentive for both Spotify and Lufthansa to send this. Point one, so for Lufthansa, Lufthansa does it, actually. You can try. I'm not sure about Lufthansa in particular, but airlines do it with Gmail. And the very reason is Gmail gives them a preferred visualization. And actually, it might even strengthen their brand appearance, because they might have a special. There is research being done that the click rate gets even higher when you have the special presentation. So that's at least one theory. I'm not saying I spread the truth here, but just giving you an idea. For Spotify, I was not claiming Spotify itself to send it. Because what I was saying is, you share it. You are in your web browser, for instance, or within Spotify. And you say share with. And you go to the email program. And Spotify does have that data on their website, in the metadata. And the incentive there is for search engine optimization. So they have it because they want to get into the Google ranking very high. And we just piggyback that data by using it in email, in that sense. But you said the share with the feature. The share with feature is controlled by the Spotify client, which is controlled by Spotify. Oh, no, no. It's not, anyway. Because it's obviously URL, because they want to have set for WhatsApp. They won't change that. But with the URL, we can actually pull the metadata from the website. Like the Google crawler does it. So you want to hijack that thing and then put it in? In a way. Which is fair. OK. Thank you.
[StructuredEmail] When is my flight? - Semantic data extraction in KMail and Nextcloud Mail
Okay. Okay. So, yeah, then we'll continue basically right where Hansjord left off. I'm Volker from KDE and I'll talk about how we do the semantic extraction in K-mail and specifically focusing on the travel use case. Many of you probably traveled here, so you might see why this could be useful. So, if you book your flight or your train or your hotel, you get the confirmation as an HTML monstrosity full of advertisements and fine print and somewhere in between is the information that you actually care about. So, you need to find that and transfer it into like your calendar or your travel app and that if you do it manually, right, there's tedious and error prone. So, why can't we have that automatically? And that's basically the point that got me into that topic. I was on the way home from a conference needed to find my departure gate and I was written in like light gray on white in that style. So, I did what you would do in that case like you read the email source code because that's easier to read and I stumbled about a nice compact summary of the trip and that was the schema.org Jason that Hansjord mentioned. So, just showing that in our email client right that should be easy. Six and a half years later, I'm not standing here and still talking about that subject, so as things usually go. So, Hansjord showed us already, right, it's the schema.org Jason that is something that I think Google proposed 10 or 15 years ago for websites and for HTML email. Meanwhile, managed by the W3C, so it's a proper open standard. As an ontology that tries to model the complexities of the real world, right, it has all the fun involved with that. But generally that is sane and something we can work with. Then, however, we got in touch with the harsh realities out there because there's not just that nice Jason format, there is also commonly used a micro data representation that basically embeds that tree of information in the HTML structure of the HTML email. Technically, that's possible and still well defined, but it then basically puts HTML parsing into your problem space with all the fun that that entails. Well, okay, so we implemented that as well. Then we discovered a third variant of encoding that information, basically syntactically invalid Jason. Comalist Jason is particularly popular, so we ended up adding workarounds for the Jason parser to deal with all of that. Then we found the actually much bigger problem and that is semantically incorrect data. I think the most extreme case was Air Berlin. They had the arrival and departure times for flights in the local time zone of the airports, as you would usually do it. But then they added the UTC offset of what is presumably their server location. So if you travel to the US, eight hour difference, you probably noticed that something is wrong. If you travel from here to Finland, a subtle one hour difference, super dangerous, you're under risk of missing your flight. Another common problem, there's an address and there's a geo-coordinate. They mismatch and not just by a few meters. We have to deal with that as well. Then of course the other big problem, this is by far not as widely used as we would wish. You find it with some airlines, some of the hotel and event booking platforms have it. It's super rare for trains. I think in Europe it's only a train line. In general, on a scale from Silicon Valley startup to 100 plus year old European national railway, it's clearly biased towards the former. It seems to be even less common in Asia than in Europe. That isn't really satisfying, but at that point we were hooked and we really wanted those features. We started to look where else we could get them from. There's actually a lot of stuff that we can extract data from in such emails. One particularly useful thing are flight and train ticket barcodes, which then moves PDF parsing and image processing in our problem space. It gets worse. That thing is an entire world on its own. I spoke a bit about that last year in the railways and open transport deff room. I tried to skip that here. Another thing commonly found on booking emails is Apple wallet parsers that zip files containing JSON. Parts of it is machine readable. Parts of it is visual representation, but at least for location and time in the barcode. That's a good starting point. Then of course there is the whole unstructured human readable part. For some of that we were able to build generic extractors. Something like an airline boarding pass. They might look very different from a visual and layout point of view, but they can all be very reliably identified using the barcode. The barcode only contains very basic information, like the day of travel, not the year or the time, and only the airport codes, but not the gate, and so on. All of that information that is really relevant for you is in that human readable text somewhere. It's possible to identify that and match it. For everything else we have provider specific extractor script. That's usually a few lines of JavaScript with regular expressions or X pass queries on the HTML. Not pretty, but it gets the job done. With all of those ways of getting data out, we still have the problem that the data quality isn't really on a level that we can work with. In particular we care about the very exact time, including the time zone. By time zone I really mean IA and A time zone ID, not UTC offset, because if you have a delay over a day-life saving time change, and yes that does happen, then you really need the exact time zone to know when your new departure time is. And the other aspect that is really important is the precise location. So as a geocoordinate, that in turn also helps with determining the time zone, but we want to have features like routing to your departure location or your hotel. And in order to improve on the input data, we use some external data sources like OpenStreetMap or VickyData to resolve airport or train station identifiers and get to the exact location. And we have a few things that apply domain knowledge. For example, if you email, we first to a flight from Brussels to Stuttgart, and mentions a flight time of about an hour. There's two airports with Brussels in the name. They are both close to, or at least both of them are in Belgium, so we know the country and time zone. There's also two airports with the name Stuttgart. One is in southern Germany, the other one is somewhere in the US. But based on the flight time, we know exactly which one of that is possible, right? And I may have uniquely identified the other airport and so on and so on. And then in the end, we have some validation and plausibility checks because they're still either incomplete or nonsense coming through, right? So if you would require time travel to make that trip, then it's likely wrong somehow. And that's then how it looks like in the integration. So we run the current email through the extractor. If it finds something, it shows a summary of that and offers you to add that to your calendar or to your travel app on the phone. This is in KML. Originally, the extractor started as a library for KML, but it's also available as a standalone command line tool by now. That's how we did the integration in NextCloud. Same thing, right? We showed a summary of what we found and you can add that to your calendar. There used to be a Thunderbird plugin, but Thunderbird changed the integration API and since then that stalled a bit. There's a lot of demand for that, so it would be nice to redirect that at some point. And then there's of course the dedicated travel app, it's a memory that we built out of all of this that Hans-Jörg had already mentioned, where you get a timeline of your trip and it then fills the gaps with local public transport and looks for the weather forecast and reminds you to bring a power plug converter if you're traveling to a country where you need that. And I mean, that is exactly the kind of high-level semantic features and workflows that we can build if we actually understand what you're dealing with in your emails or in your documents. So if you produce any kind of transactional email, you most likely have a machine-needable representation of what this is about, so please add that also to the email in some form, ideally in the format Hans-Jörg is working on, but as you have seen, we are not particularly picky in extracting, right? So anything that isn't regular expressions on human readable text would be a big help already. And then finally, I haven't mentioned that yet, all of that of course runs on your device, right? Unlike Google, Apple or TripIt, we don't read your email for this. That on the other hand means we have not as many training samples as they have, so we entirely rely on people donating us travel-related emails in some form, so that is one way to help. Yeah, and that's it. Thank you. Thank you. All right, again, we have our number one question to ask today. Do you have any statistics on signal to noise ratio? Essentially, how many times is the information wrong? Do you kind of any reviews or testing in terms of like, you say that incorrect information is better than no information, but does it ever get confusing to a user, for example? I mean, we try very, very hard to detect stuff that is not plausible or to fit out anything that we at least can detect. How much gets through that is not detectable and then confusing. I don't actually know because the samples we have work or are filtered out, but at least we don't get a lot of bug reports with I missed my flight because it showed something wrong, and usually it is individual providers and they are consistently wrong, so we can add workarounds for that to filter them out and not show anything for them, for example. But there is the risk for providers that we don't know. If they send out something that we can't detect, we might show you a wrong departure time, right? And that is a problem. But you could, I know you could not log, instead of not showing the possibly wrong information, you could not log it somewhere and then to make those statistics. I mean, log in the way that we get the information. Yeah, because it's not a website. That would go against the whole privacy idea that we are very... But if, I don't know, if user agree to send those kind of... We don't have like a data donation feature built into the app right now. That might be an interesting option. But some people send this to us then manually, basically. Yeah. I might, before I give some mic to Arndt, I might just comment on that because we talked already also at Mark to people and there is a lot of the email senders, right? So, and in general, there is some interest by them to support this in a way. So, I have a strong assumption, like, if there is such faulty data, there might be ways to incentivize at least the big senders, the big brands to do it right. So, I'm not so concerned about that. Yeah. Asking people to send bug reports is okay, but if you ever get a mail client to send something to you, to log it, you're going to get information about people's sex life. No matter what you try to get, you're going to get that. It just happens, trust me. And then you have GDPR problems because, well, you thought it was the name of an airline, but it actually was the name of a person. Yeah, I mean, that is, I mean, that's one of the motivation why we are so focused on doing this locally and with keeping control over this. Because, I mean, your personal travel is already quite sensitive. But if you combine that with everybody else, the amount of patterns you see, right, I mean, all of us travel to Brussels in the first weekend in February. If that happens once, right, that could be by chance. But if it happens in the next year as well, and after two or three times, that is not random, right? Then there is some relation between the people involved. And that allows you to do some scary network analysis. If you're looking for the structured data that's already there, it's the open travel alliance. First it was in XML horror. Now it's in JSON. So maybe that will be, can be implemented in the final structure. Open travel alliance. Yeah, I don't know that one yet. No, it's international. Everything is in there, the planes, the trains, boats. Okay. Yeah, we, from the scheme of the world stuff, we support flights, trains, buses, events, restaurant reservations, and ferries and boats. Yeah. But there's certainly more that can be done. One quick final question. I wanted to remark that the anonymization of data fields is possible without being able to trace it back to an individual human being. Because airlines are innumerable. So you can get to the proverbial shouts, whereas user names or people's names are not. And so you could hash everything into the WAHOOZA and still recognize whether or not you should have recognized the field differently than what you've actually rendered in a client in this case. Right. Yeah, but anonymization has turned out to be rather tricky on input data like PDFs, where we also rely on the proper structure. So as soon as you start to modify this, it's not sure that the extractor still detects it in the same way. And we often don't know what kind of sensitive information is even in there or what the fields in the back would mean when we start with a new format. Right. So it's very hard to predict what we need to strike out. Sure, yes. But I thought we were talking about the JSON. Once we have the JSON, sure. But the JSON alone is not really enough to fix the extractor. We need the source document in its original form without modification to see where it goes wrong in the extraction. So if there is proper JSON in the source, then yes, then the JSON is enough. But if our source is a PDF document attached to the email and the barcode in there, then I need the full thing to debug why we failed the extractor. I'm interested, but we'll take this offline, I suppose. Yeah. Yeah. Right. A short technical question is Bogo in the room. Ah, right. There he is. Great. All right. So thank you very much for that lively discussion. Thank you, Falka, for the presentation. Once applause again.
[Ending] It's all about the email. Ugh, what?
What is the future of female? So if you agree, you can shout, like, yeah, or, you know, this is for agree, and if you agree, you can go, boo, and, you know, thumbs down. You can count, oh, all right. Okay, let's get started. Quick introduction. My name is Bogos. People call me Bogos by one, get one. It's easy to remember. Thank you. Yeah, I prefer that like a beer, you know, it will be closer to my mouth. Okay, I used to write PHP, Python, and GS. Now I suck. I work on a call-up project. I don't know who remembers the call-up. Yay, awesome. But I really suck at it. I'm working on a- I'm part of Thunderbird Project, but I'm not representing the projects right now. I was- I'm born in Bulgaria, the Czech Republic. I love heavy metal. So there was a call, yeah. There was a call by me this morning about teaching people how to be nicer with heavy metal, so if you didn't have a chance to watch it, you can watch the recording, which is currently really bad. Okay, started. So as I said, I haven't thought a long time ago. I see some old people here starting at the same time as me. In the past, it was something like, I want to implement something, but the technology or the framework does not allow me. I want to do this, I want to do that, but hey, the PHP library that I have, it's really crappy. I really don't want- I want to do that, but first we need to invest more time, or it's not worth it. Now, looking at the topics over the conversations today, listening at the outer rooms, now everything is possible. Everything. If you want to do something, there is an API or a smart person that will write something. So just keep that in mind in the next few slides. Some of the things might look impossible to be done, but keep an open mind and just remember everything is possible. Okay. So those ideas, I will share that proposed by real customers, people that are using email every day, myself included. Let's get started. So my email is my identity. Yay or nay? Okay. So it's 50-50. Thank you. Thank you, Paulsman. Thank you. So let's say I'm the customer, right? I'll be the main protagonist. So I have my email client and a university, for example, wants to verify that I have a real diploma. It's not like something that I downloaded from the Internet. They send me a verification request through credentials. So they want to see information about my age and whether I'm using the ID credentials in my diploma. So I'll say whether I want to proceed it or not. So they're very different use cases. And again, this is really available to be implemented because the standard and the implementation of lots of languages already exist. Second one is be ready. Build the fans. So I'll talk about that. First of all, we really, we, the customers, we really would like to see all of the messages encrypted and to add to the code. If something is not in code, I want to see it in a very special folder. It's dangerous. So, okay. I didn't get anyone. Oh, all right. One. No, no. Okay, okay. Yeah, yeah, yeah, yeah. We're talking about the perfect world that you guys will build and on the next day of the day of the room, you'll present. So fishing, so, the information, fair, so, manage keys, but it's easy. This is really important. You know, jocosite. It's really important to give the power to underrepresented groups of people that are really fighting for our rights in different parts of the world, journalists, you know, fighters for human rights. They really need an easy encryption, but just a few pro-support. So this is why this is important. Oh, next. I don't have a lot, so don't worry. Okay. So I want my long email to be translated to TLDR. So to extract, of course, something that is useful to me, but then I would like to understand how much time I actually saved by this. Instead of reading the email, the green part here, and at the same time, I would like to understand how much energy the engine doing all the calculations. So if I'm one that is aware or about the end of the question, I don't think that is included and embedded by email. Yay or nay? Or I see the majority down. No, no, no. This is an example. That's maybe not. All right. The local is available. If you don't know this website, you can go and check it out. I think I have one more. Mood detector. All right. So imagine it's Monday morning. Monday morning. And usually the Monday morning is really crowded because of many things. I want to go there, open my email client. So the email client detects that I'm really sad. So it's like, yes. Maybe, maybe, maybe, yeah. And the analysis of the sentiment here. So I want to see my happy emails first. A gift that I buy for Ireland Made in Concord. That's happy. My boss sent me during the weekend. Weekend? Not happy. So post-exists. So if you want to use the second facial, it's open source. Use the camera to detect whether you're happy or sad. And the first one is that you can. Analysis, whether this is happy or not. And of course, you can teach it everything. All right. Quickly, the last one. I would like to have a trust levels. For example, I receive email from my wife, from my kid, from my boss, from everyone. And I would like to see something like whether I, if I trust my wife and my wife trust. Hello. Hello. I would like to see my wife and my wife trust. Her sister, whether she'll trust me. So the trust concept, and it's really nice. You know, some people would like to see that as well. I have one minute. And again, again, your name. All right. Again, everything is possible. Please be open-minded when you create a next generation of email services. And do something out of the box. Creative. I think this is it. Thank you very much. If you want to connect, this is me. If you want to kick my ass, I'm leaving right after this one. So thank you very much. Thank you very much. All right. Thank you. It was very entertaining. I think there are no more questions open. And I think on behalf of the organizers of this deaf room, I think that was a real intense, but also really interesting day. Thanks for all your participation. Yeah, we hope to see you. Yeah, we hope to see you next year probably.
Welcome to the Monitoring & Observability devroom
you you you you
When Prometheus Met OpenTelemetry
So, hello everyone. I'm Pavel. I'm very excited to be here and I will speak about Prometheus and OpenTelemetry and especially how we can use OpenTelemetry project to scrape Prometheus metrics and what are the challenges with this setup. Quickly about myself, I'm Pavel, software engineer at Red Hat. I mainly work in the distributed tracing space. I'm contributor maintainer of the OpenTelemetry operator, Grafana tempo operator and the Yeager project. If you would like to reach out to me, you can do that on the Twitter, on the CNCF Slack. So, today I would like to do some introduction into metrics ecosystem so we better understand what are the projects we can use and then talk about the differences in Prometheus and OpenTelemetry from the data model perspective, how they do things. Then we'll talk about what Prometheus components we can find in OpenTelemetry project, both from the API SDK perspective and in the collector. The second half will be a live demo. We will deploy very simple Golang application instrumented with Prometheus client and we will gather those metrics with OpenTelemetry collector. All right, so why are we actually here? We are here because the ecosystem for collecting metrics is fragmented. There are different projects that provide different capabilities. So, there is a storage, some projects that can store metrics, some projects that can only define protocol for something metric data and some projects that can be used only as an API SDK, something that developers use. Prometheus sits in between, so it provides kind of end-to-end framework for collecting, sending, storing, visualizing and alerting on metrics. Prometheus is very well-adopted, it's very robust and people know how to use it. On the other hand, there is OpenTelemetry project, which is kind of new and for metrics, it provides kind of more limited set of capabilities compared to Prometheus. People still want to use OpenTelemetry for collecting metrics because they can use it as well for collecting other signals like traces, logs and it's better integrates with third-party vendors, your SaaS observability solutions. So the overlap, there is in the API and SDK, Prometheus has clients, OpenTelemetry has an API and SDK and then there is a protocol. Prometheus has its own metrics protocol and OpenTelemetry has OTLP protocol. On top of that, in OpenTelemetry there is collector, which competes with Prometheus agent. Agent doesn't store metrics, it can just scrape them and send them to Prometheus via OTLP, not OTLP, but Prometheus remote write. What I would like to highlight is that OpenTelemetry as well has the auto-instrumentation libraries, which are not present in Prometheus. I think it's a great innovation in open source because those libraries, as we saw in the previous talk, they help you to very quickly instrument your application without any code changes and a recompilation. So I think it lowers the bar of adoption of telemetry in your organization. So that's the ecosystem. Then we should think about how we can use these systems together because we want to combine feature set that they offer to us. So let's take a look before we go into the demo, what are the differences in Prometheus and OpenTelemetry. First of all, the most obvious one is how the protocol works. The Prometheus will pull the metrics from your process and OpenTelemetry, you have to push the metrics into the collector. It's not big of deal. Some protocol might be better for some use cases. So for instance, the push might be better if you have short-lived processes and you need to quickly offload the data before the process shuts down. On the other hand, pool works very well in Kubernetes. I don't think that's kind of a blocker when using these two systems together. However, the second point, the data temporality, I think it's kind of a big deal. The Prometheus uses cumulative temporality, which means that the last observation contains the previous observations. So if you have a counter in Prometheus, it will contain the sum, the aggregation of all the previous values. In OpenTelemetry, we can use as well cumulative temporality, but we can as well use delta temporality, which means that the observations that are sent over the wire will be just deltas. So if people are coming to this room, it will just send one, one, or maybe two if two people entered at the same time. And Prometheus cannot ingest delta temporality metrics as far as I know. So that's a problem. The second difference, or the third difference, is the histograms, or the exponential histograms. As far as I did the research, I think they are almost compatible. However, in the OpenTelemetry, the histogram as well contains min and max values. So in Prometheus, you can potentially lose some precision of what was observed. The next difference is the resource attributes. In OpenTelemetry, when you collect data, there is a resource object that contains information about the process that is sending the data, which is a pot. It contains pot label, deployment label, replic acid label, node label, and all those things. In Prometheus, the concept doesn't exist. All the labels go to the metric usually. There is a workaround to put these labels into the target info metric and then do the joint. However, it kind of complicates the user experience because you need to do additional join when querying the data. The next difference is float versus int. Prometheus uses floats, and OpenTelemetry can use float and int. I don't think it's a blocker because with float you can represent very well all the metrics. And last major difference is the character set that the system supports for metric names and label names. In OpenTelemetry, we can use UTF-8 in Prometheus, only a limited set of characters. So what happens is that when you are sending hotel labels, they will get corrected to the form that Prometheus can ingest. So if there are dots, they will be substituted to underscores, for instance. So as I said, I was working in the distributed tracing space for a long time and I started doing some metrics. And when I did this research, I was even wondering if these systems work, right? Because there is kind of a lot of things that can go wrong. And I think the delta temporality might be the biggest one. So I started looking into how can I solve this problem. And in the OpenTelemetry SDKs, the OTLP exporter that exports OTLP data, it can be configured to translate delta temporality metrics to cumulative with this environment variable that you can see on the slides. And then as well, you can set it to delta if your metric system supports delta or to love memory, which will use even more delta. You may as well ask the question like why we have two temporalities, right? There is a cumulative and delta. And as far as I understand, the delta temporality can be more resource efficient when you are instrumenting your process because the SDK doesn't have to track the summary, right? They will just quickly send the deltas to the collector or process that is collecting the data and doesn't have to do that processing that the cumulative metric store is doing. Okay. And then the temporality, okay, it's a problem. And then in the Prometheus exporter in OpenTelemetry ecosystem, it will do some delta to cumulative temporality translation for you. However, if you are using Prometheus exporter in the hotel SDKs, they will most likely drop delta metrics. So that's something to watch for. Okay. So what are the Prometheus components in hotel ecosystem? In the SDKs, as I mentioned, there is Prometheus exporter. However, if your metrics are delta temporality, they will most likely be dropped. As far as I was going through the code and looking at the exporter implementation, maybe it's not the case in every language, but I was looking, I think, at Golang and Java and that's what I saw. In the OpenTelemetry collector, there are three components. There is Prometheus receiver that we will see in a demo. Then there is Prometheus exporter that will try to handle temporality correctly. And then there is remote write, which will most likely drop your delta temporality metrics. Okay. So let's try what I prepared. It's a very simple hello world style application written in Golang, instrumented with Prometheus client. And then we will have an OpenTelemetry collector with Prometheus receiver scraping those metrics and exposing the same metrics on the collector slash metrics endpoint through Prometheus exporter. So we have receiver and exporter. And addition to that, we will print the metrics into the standard output of the collector. And we will compare if they're correctly propagated. So let me jump back to my console. I guess it's too small. I'm not sure I can change the color. It's better. Okay. So just for reference, this is the app. It's just main class. Using Prometheus client defines a gauge for tracking the version. There is a counter for counting requests and histogram for counting the request duration and some HTTP endpoints. So the app is running. I will just forward the endpoints and refresh the make request. It's a hello world, nothing special. We're going to see the metrics. We get a histogram counter and gauge and not many labels. As a next step, we're going to deploy the collector, which is again a very simple setup. We are deploying a deployment. And then we have a Prometheus receiver with a static configuration. So in a collector config, you can have multiple receivers of the same type. So I have two Prometheus receivers. One is called static, one is SD. We're going to use the static which will scrape the Prometheus example app service. And as you can see, this config is very similar to what you see in Prometheus. So you can potentially copy paste your Prometheus config into the collector config for Prometheus receiver and it should work. And last step, what we're going to do, we're going to enable the receiver in the metrics pipeline to make it active. And now I'm going to deploy it. As you can see, the collector is up and running. And I will pour forward again the metrics end points now of the collector. And we see kind of the same metrics, right? Here's 18, here's 19 because the Prometheus scraped end points with increased the counter. And what has changed are the labels, right? Now I see the instance label, which is the service name and the job which I defined in the collector config called app job. And then, yeah, we see the same metrics, the histogram, the version counter and the direct-quist counter. Okay, as a next step, we're going to make it a bit more automated. We're going to use the Prometheus service discovery in the second receiver. So we need to define the Prometheus as the config. And in this case, we're going to scrape all the pots that have the label that our app is using. Our pot defines this label. So we're going to enable it by just, you know, overriding the name of this receiver. It's the same functionality that Prometheus supports, right? I'm just using it in the open telemetry collector. It should restart. It's up and running. We're going to forward. And now, again, the same metrics. What has changed are the labels. The instance is the pot, right? Which makes more sense if we are configuring the service discovery for pots. The job name changed to Kubernetes. This is what we defined. In addition to that, now we get the target info, which defines the additional labels the receiver discovered. So here I see the namespace, the node name, the pod name. I think it's readable. And so what I can do right now, I can write Prometheus query that will do joint and get all these labels associated to the metric. Or in the collector, I could write a configuration that will put these labels from the target info into the metric labels directly, which will simplify the query. However, it will create more time series in Prometheus. Okay. And as the last step, we're going to use the pod monitor for the pod that we deployed. And we're going to use collector to get this pod monitor, configure the receiver, and scrape the metrics. So the way how it works in OpenTenometry operator, we have additional components called target allocator. And when you enable it, it will watch all the pod and service monitors in your cluster. And it can watch a subset of it. It depends on the label selector. It will get the scrape targets and then distribute those targets across collectors that you deploy. So if you deploy 50 collectors, it will distribute the scrape targets into those 50 collectors so that all the collectors get the same load. How does it work? The operator will deploy the target allocator and collector, will change the Prometheus receiver config with the target allocator service name. And then collector will connect to the target allocator to get its targets. Okay. So we're going to just enable the target allocator. For that, we need to change the deployment node to stateful set. Enable the target allocator. And now we don't have to do any config in the receiver. We can just leave this empty, the scrape config empty as an empty array. However, we need to change the Prometheus to, we need to just define a single Prometheus receiver because the operator will change. There is a convention that operator will find this receiver and change its configuration. Okay. Apply the manifest. And yeah, it's crashing. It's a demo. But it's just waiting for the target allocator to be running and then it will start properly. Sometimes it just takes some time. Okay. It's up and running. Now, if I refresh the same metrics endpoint from the collector, I see labels again they changed because now the instance is again the pod IP. The job name is what the Prometheus receiver uses by default. And then there's labels like namespace, pod directly on the metric. However, the target info should as well contain the metadata from Kubernetes, like what is the pod name, what is the namespace name and so on. Okay. So what we saw is that the Prometheus receiver works pretty well. We can use it to scrape Prometheus metrics. There shouldn't be an issue and it's as well using the Prometheus configuration. So if you are familiar with Prometheus, we can just directly copy paste the config into AutoCollector. However, what we haven't seen is if the process is instrumented with Auto SDK, then the Delta temporality metrics will most likely be dropped if you are using Prometheus receiver. However, if you are using OTLP exporter from the SDK and we set the temporality correctly to cumulative, then those metrics will be correctly propagated to the collector and then to Prometheus. So be careful with the Delta temporality. The Auto SDK should use the cumulative temporality by default. So that shouldn't be an issue. But if you are using something custom, then be careful with those metrics using Delta. So to wrap up, we saw the Prometheus receiver. It essentially contains the Prometheus configuration. However, the dollar signs in the AutoConfig, they are substituted to environment variables. So you need to escape them with two dollar signs. That's one difference. In the open telemetry ecosystem or in open telemetry collector and operator, there is no support for probe and scrape configs. And in the service and pod monitors in the AutoOperator, we don't support TLS. There are limitations. So where do we want to go with Prometheus and open telemetry? The Prometheus is planning 3.0 release. They want to improve the OTLP ingestion endpoint. So you can now ingest OTLP metrics into Prometheus, which is great. However, if you are using Delta temporality, those metrics will be dropped and they want to improve the support for it along other features. So yeah, feel free to help us to build this thing, to be more robust. On the open telemetry ecosystem, there is kind of two projects where you could contribute to improve Prometheus support. In the collector, there is the Prometheus receiver that we saw, Prometheus exporter and remote write. There is a lot of issues on the GitHub where you can help. And on the operator, sorry, we are planning the next CRD version. We want Alpha 2. And we want to create a dedicated target allocator CRD that will expose more Prometheus config. It's as well something that we are working on and we are very happy to accept your contributions. Okay, and this is all that I prepared for today. Thank you. Do we have any questions? No questions? Going longs? Okay. Thank you once again.
Strategic Sampling: Architectural Approaches to Efficient Telemetry
Yeah, so welcome to this presentation about Open Telemetry and Sampling. Today with me is Julius. Hi everybody. Yeah, so we have to pass around the microphone. I'm Julius. I work at Cisco mainly in the telco domain. I'm doing a lot of infrastructure, cloud related stuff. Lately I had the task to add Open Telemetry to our stack and yeah, that's why I'm here talking about Open Telemetry. One second. Okay. Yeah, my name is Pinar. I work at RATED mainly on observability tools and they also have the chance to contribute to Open Telemetry. And what we discussed today is basically a quick recap for those who haven't been in the room earlier about Open Telemetry. We discussed traces, why it's important and basically how we should sample our traces and what is probably not a good idea. And then also some challenges that may occur when you would like to apply some approaches to sample your traces. So what is the Open Telemetry project? The Open Telemetry project is a project which merged between open tracing and open sensors a few years ago. It's a CNCF project which is quite fastly growing and the idea is to provide a vendor neutral data collection so that you don't get vendor locked in by some agents or you need to learn the stack always new. And the project consists out of multiple parts. There is a specification in API on SDK. So in case you would like to instrument your applications, there are auto instrumentation agents that we have seen earlier today as well as a collector which was also shown in the previous talk which is able to have different inputs, process them, send them to the back end of your choice. There are other helpers like Helmcharts and the Kubernetes operator which make our life on Kubernetes then way easier. So in this example we see a web shop which is instrumented using the Open Telemetry SDK and what we get from it, we mainly care here about the traces. For example, a user will interact with our gateway. The gateway will then enrich the context and add the tracing context on top. This one is then propagated to all our different services. Those services would then report the data that is created, for example, the spans with some metadata to a database which helps us afterwards to analyze it and visualize it. So for example, we get then this nice architecture view that we can see on the top as well as the gun chart like view on the bottom which helps us then to analyze basically how the request went. So if we have long delays, if we have some errors in between, we can just find this out and see it. So when we have a normal web shop, for example, the majority of traces is probably quite not really relevant for us. Traces of interest would be probably, let's say, transactions which have special attributes or error traces or high delays. And we should keep an eye on this because if we sample everything, this might turn out to be a bit more expensive. Here we have an example from my colleague Julius from their setup. It produces a decent amount of traces, around one million per minute. And if you would sample all the traces and directly ingest it, for example, into AWS X-ray, you would end up with around $250,000 just for sampling the traces with probably a lot of data that is not really relevant. If you would bring this down to, let's say, 0.1% or even less because you only are interested in this special attribute, some errors or some high delays, this becomes way more reasonable. We go down to $250, and if you have to explain to your manager that you would like to analyze your system and you would like to spend a quarter million, this is probably not the best idea. So the question is now, how can we bring this down to 0.1% or even less? There are some approaches. One would be head-based sampling. Using head-based sampling, the gateway in the beginning would make the decision if a trace gets sampled. This information is then propagated in the trace context, and we can then, for example, define a probability of 10% so that we would like to keep 10% of our traces. And so this is then usually configured in the SDK. So the SDK can be configured using some environment variables, which then leads to the point that we always need to restart our gateway because this is the instrumented part, as so on as the others, but the first one makes the decision. There are options to overcome this. There is the Jega remote sampler. This was originally introduced in the Jega project and the Jega agent and collector, which communicate up with each other. And basically the collector will have a list of configurations for the SDK, and therefore we can configure the SDK on the fly, which is then quite useful. There is another way to reduce the amount of traces, so we can sample always and send the data to a specific collector. And this collector will use then the probabilistic sampling processor to also bring down the amount of traces that we then finally ingest into our database or our service that whatever we take there. Another approach would be tail-based sampling. Tail-based sampling is slightly different. So here we would store the trace and all its spans that are associated to it in the collector, for example, that will receive all this data. And this collector will then make the decision if a trace should be sampled or not. And we will come to this decision later. You can define certain policies which then basically help you to determine if a trace should be stored. The setup is then a bit more complex because you add another component. And also then you will add extra cost because you need an extra collector which does this. In general we can say if we have rare events, so like there are sometimes small errors or sometimes high delays, we can definitely better capture them because we can guarantee that we have this. With head-based sampling you lose this information if we have a sampling rate of one percent and we have an error rate which is way less. This is not really working. So tail-based sampling also introduces some overhead. For this one million traces you need around these resources that are listed there. So basically an X large or two X large instance on AWS which will add another 130 to 260 US dollars on top. But still when you compare the price it's still super reasonable. So the next thing is basically how would this look like in a setup when we introduce the open-talented collector with a tail-based sampler. We would have here, what you can see here is the service one which is requesting service two up to service four. And the trace context is there propagated and the trace one goes through them. In service three an error occurs so the span and the traces everything is reported to the tail-based sampling collector. And the tail-based sampling collector then will make the decision if this should be sampled. As we assume we have a policy which says we would like to store errors. In that case this one would be written to our database and then finally we can observe this trace. This is a configuration how it looks like when you deploy the tail-based sampling collector using the open-talented operator. On the left we see it's the kind open-talented collector. But the more interesting part is more in the config. In the config we define the receiver, processor and exporter. This is basically the input what we want to modify and then what we would like to export and where to send it. In the pipeline section we then glue this together and we can define the OTLP receiving part is what we would like to ingest. The processing part we have them there listed there and they will be processed in this order and then they will be exported in this case to Temple. On the right side we have more details about the tail-based sampling collector. And there we see for example the policies that are defined. There is the error-retained policy and on top without an error so it's an OR. We have another policy that says we would like to keep 10% of our traces. There are more options to configure those policies. I link them here in the slides. When you go to the GitHub page you see there's tons of. One interesting would be probably the OTLP, what is it? Scripting language name. I forgot it. Anyway. You can glue there quite complex policies if you like to. Another important thing is the numbers of traces. So you should tune the numbers of traces always together with the memory limit and the resources. The reason is with the numbers of traces we define how many traces we would like to keep in memory and the trace can have a different amount of data when so multiply spans, more metadata, less metadata. So this is basically a number which isn't useful to define the memory usage. And yeah, so therefore we should tweak the others too. And then we have the decision weight. Since we have tail-based sampling we have no defined end of a trace. This means we never can make sure that this trace ended. So there can always be a span that is reported. We should tweak it in a way that it's reasonable as short as possible but the longest we can expect or the longest trace that we might expect should be the number so that we don't end up with inconsistent traces afterwards. And especially if you have a CACD system that you are tracing, this number is probably way higher and if we had a 10 seconds value every step would be its own trace so we would then end up with inconsistency. The same when we want to replicate our setup here. You can see it's the same set of as before but now we have two collectors. What's happened here, the Kubernetes law balancer will send us to collector 1 and collector 2 independently and now both collectors will do their decision on their own. So collector 1 will go and make the decision after 10 seconds and it will see in service 3 there was an error span so it will sample this and write it to the database. While collector 2 and collector 4 will send the data to collector 2. This one will see the policy is not hit and it will just drop it. So next Julius will show you how to solve this too after we change the microphone. Like this, can you hear me? Right, so maybe it wasn't clear on this slide what you see here is actually a problem because one of our tail samplers decided to sample the trace while the other one decided not to sample the trace because the span with the error was only propagated to one of the samplers. So I will talk about how we can fix this by using smart load balancing in front of it. What we would like to have is a kind of load balancer which distributes all the spans that belong to one trace to a single tail sampler so that the tail sampler which is buffering all the spans can make the right decision in the end. Additionally, we would also like to create multiple instances of our load balancer so that we can scale every component because if you have a single load balancer that would be a single point of failure in our system. Yeah, so how do we do this? Introducing the load balancing exporter. So the load balancing exporter is also part of the open telemetry project. It consists of three main parts. The first one is a resolver. The resolver is responsible for finding all the upstream back ends. So in this case, this would be our tail sampling processors. There are multiple ways to do this. Usually this is done using DNS. This can be done using the Kubernetes API but it can also be done by just hard coding a list of back ends and IP addresses in there. Second, we have to define how to talk to our upstream back ends. You can define the protocol. I think at the moment there's only the OTLP supported but it's very easy to add additional protocols to that and people will do probably. The last thing we need is to define our routing strategy. So we have to define what traces go to or what the routing or the load balancing should be based on like a shouting key. By default, this is the trace ID but you can also specify it to be the Kubernetes service. Right. And the load balancing parts use a method called consistent hashing. I will not go too much into detail but this makes sure that the two load balancers, they don't have a state but they still sample or send traces with the same ID and spent with the same ID always to the identical tail sampling processor. Let's see how we can define this in code. It's pretty much what I just said. You define the routing key. You define the protocol that you would like to speak to the upstream back ends. You can define a sending queue which is acting as a buffer between the load balancer and the tail sampling back ends. You can define the resolver. In this case, we're going to use the DNS resolver. We make use of the headless service which the open telemetry operator is creating for us. It contains all the IP addresses of the parts of our tail sampling processor. And on the side, you can see the different configurations that you could use for static or Kubernetes-based resolvers. Right. So to resolve the problem, we can easily scale. There are some problems and there are some details that you have to care about when deploying the load balancing exporter. Once of all, the load balancer exporter only makes sense if you can export the traces faster than you receive them, of course. Because otherwise, the traces and the spans that will pile up and your sending queue will be at full capacity very quickly. There is a lot of metrics exported also by the load balancer exporter which you can observe, like the queue capacity, the queue size, and also the back end latency which you can use to make sure that the load balancing is happening in an equal fashion across your back ends. If this is not doing well, it looks like this. On the top, I'm not sure if you can see this. Hmm. It's not possible to turn off the lights, unfortunately, and we didn't have time to make the slides, like, inverted. Right. Okay. The problem is the queue capacity is always reached and traces will be dropped in that case. So it's not enough to make a big buffer between your two connections or between the connection. You need to make sure that you're actually exporting or writing traces faster than you're receiving because eventually they will overflow this queue. In our case, the problem was the frottling on the CPU, basically, because the tail sampling processors are very CPU intensive and also memory intensive. The next problem that we have is that we need to load balance the connection to our load balancers. So it sounds a bit weird, right? What you can see on the bottom chart is what happens if you just point your services at two load balancing exporters and let Kubernetes do the load balancing. You can observe that one of the exporters is handling, I think, it's 4,000 while the other one is only doing half of that. People who saw this before know it's probably GRPC connections, which are very long running and Kubernetes doesn't like to load balance that. What can we do instead? We can use the OTRP HTTP protocol instead. So this is using HTTP one. So the load balancing will be working as intended. However, we lose a bit of the efficiency of HTTP two. The headers will be recent all the time. The connection cannot be reused. As an alternative, we could also use some level seven load balancing, NVoy, basically instead of just passing the request to a different backend, we can inspect what's going on, take the individual GRPC packets which are flowing and route them to the different load balancers. NVoy is a beast. You need to understand it. You need people to maintain it. You need to deploy it. And so this is a more complex setup. And the last option we could see here is to deploy open telemetry in a sidecar mode. So what we saw before is basically running in the deployment mode. We have a central collector. Everybody is sending the traces to it. And in the sidecar mode, we deploy a collector alongside our ports. So if you have 10 ports, you would also need like 10 collectors which are running. You see the problem. You need more capacity on your cluster. And there's a bit of overhead there. Right. Last point we're going to talk about is autoscaling which doesn't exist yet in the open telemetry framework. Where I come from, our traffic is very depending on the time of day. So in the night, nobody is sending SMS, nobody is calling, whatever. But during the day, we have high peaks of load. So ideally, we will also like our tail sampling operation to be slower at night while it can scale up during the day so we can save on the resources which can be pretty enormous, as you saw. Right. And also what we observe is that the errors that we would like to catch usually happens during the times where there is the highest load because then things go wrong. Yeah. And here you can see that the amount of SMS send correlates with the amount of received spans. I'm not sure if you can see it. Okay. So what me and Bina did is we set out on a mission to kind of build a thing about load balancing, autoscaling solution that we can tie together. What we came up with is using some sort of a Kafka intermediate stage between the tail samplers and our load balancers. Doing some autoscaling on Kafka is a well-known problem and it's solved. So that was like a good fit for us. We had the idea to make the tail samplers whenever they come up, create a topic on our Kafka cluster representing this tail sampling processor. And the load balancer, on the other hand, will do list topics on that Kafka cluster, let's say every five seconds. So when we create the tail sampler, our load balancer will know about it because it can see the topic. And what the load balancer will then do, it will do the same thing. Instead of routing it to a different HTTP endpoint or different IP address, we will route the traces and the spans to a different topic on our Kafka cluster. So you have multiple tail samplers all listening to a single topic while the load balancer is just rebooting to all those different topics. Right. That's basically the written description of the image that you saw before plus the configuration options which we added in green. So we added to the load balancing exporter the Kafka protocol so now it can speak Kafka and send stuff using Kafka. You can defa... Sorry, I started in the wrong order. The first one is the resolver which does the list topics call. Then we have the protocol and finally on the receiver side, we have a similar topic, a similar setup. The only addition here is that we have the create topic call. So whenever the receiver starts, it will create the topic automatically for us. So we put this to a test, very basic test, Docker compose. I think we spun up like one load balancer and two receiver pods, sent some traces. And what we could see is that the setup is indeed working as intended. So the load balancing works 50-50, the traffic is split perfectly and we can also see that the topics are automatically created for us. If you think this through, you can see that there could be like a lot of problems. What happens if you want to scale down? What happens to the topics? Will there be cleanup? What if I clean up or delete a topic but there's still stuff in it? Do I have to think about, I don't know, that letter queues for the traces and spans that I missed? So definitely if you want to run this or deploy something like this in production, you have to put a lot more thought into this than hacking together, I don't know, 500 lines of code. Right. Yeah, the slide with the GitHub link is gone so you cannot find the code. Right. Which brings us to the conclusion. Quick recap, traces are valuable. If you have a complex system, they get even more valuable. Not all the traces are equally valuable. If you have a bunch of 200 or case, nobody's really interested in them. So you have to focus on the traces which are relevant for you or the ones that you're interested in. You can use head-based sampling or tail-based sampling. Head-based sampling is very easy and cost-effective but you cannot put all the configuration that you can do in tail-sampling. Tail-sampling helps to focus you even better. If you do tail-based sampling, you need to think about load balancing at the same time. Yeah. Right. And the last point with the proof-of-concept that we did was just to show it's very easy to extend the open telemetry framework with minimal code to achieve some custom solutions. So you have Kafka, you can just bring open telemetry or make it match to whatever you're running in production. All right. Thank you everybody for your attention. Thank you. Thanks for the talk. You mentioned that you could switch from GRPC to HTTP so that you would get a new connection every time. Yeah. So you could switch from the GRPC connection from time to time and let it reconnect. The GRPC connection should reconnect? Yeah. Yeah, you can do that every five minutes, for example, just to mean that the client side. Right. So that's something that had to go into the OTLP receiver side, which then, so the exporter side will then after five minutes just terminate the connection. Yeah. I was not aware that you can configure this in open telemetry. But Kubernetes has this tendency to reallocate the game, the connection in the same node anyways. Okay. If that's true then HTTP doesn't help. Good point, yeah. Any other questions? Yeah, thank you for the talk. So on the receiver there is the max connection age. So you can, you know, and I have a question for long running traces. So there is, you mentioned there is a limit like a timer. You can set like 10 seconds and then the sampling decision will be made. Is there a way to do it dynamically? So let's say I see a tag from, let's say Kafka or something that takes longer to process. And those traces would be sampled with a timer that is, I don't know, longer than the default. As far as I know, there's no dynamic timing. It's just a hard coded value. It was the number.
Unifying Observability: The Power of a Common Schema
So, up next, we have Christos and Alex and unifying observability in the power of common schema. Okay, thanks everyone and welcome to our talk. We will in this presentation talk about the conversion story of two schemas of open telemetry in the elastic common schema. But let's first introduce ourselves. My name is Alex. I'm leading the open telemetry initiative at Elastic and I'm a co-maintenor of the open telemetry semantic conventions project. Hi, I'm Christos. I work on elastic as well and I'm software engineer focusing on observability and specifically open telemetry where I am a contributor and a prover on the semantic convention project. Okay, we would like to start with a quite easy and simple question. How many of you do know exactly what open telemetry is? That's great. I can skip some slides later. How many of you do know what semantic conventions is about? That's what I expected. And how many of you do know what elastic common schema is? Okay, thanks everyone. So let's deep dive a bit on the history of open source tools and standards in observability to give us a picture where the standards come from. Let me. Okay. No. Does that work? Okay, around, do you hear me? That works well. Okay. Around or a bit more than 10 years ago when microservice emerged that also changed the observability market and industry. That's when like big tech companies started building their own open source tools for collecting observability data. So tools like Zipkin, Jega for distributor traces emerged, the Elk stack for logging, Prometheus for metrics. We heard a lot about this in previous talks. And based on this defective standard tools, then actual standards emerged like open tracing, open sensors later for distributed tracing, open sensors also covered metrics and the open metrics as a derivative of Prometheus format emerged and Elastic has its own ECS that defines the semantics of structured logging data. Since we will talk a bit more about ECS, a quick introduction what that is. So ECS stands for the Elastic Com Schema and it's basically just a definition of a set of fields that describe the semantics in structured logging data. So for example, if you're collecting a service name with your observability data, the Com Schema tells you that you should put this value into a field that is called service.name, not app.name or application.name. So you have common names that you can later on search for and this also allows you to correlate data across different signals. Now as you can see, we already have at least four standards here that are partially competing, partially complementary. Plus we have all the tools that also create some defective standards for collecting data. So it's ridiculous to have so many standards, right? We need one more that covers all of them. And usually what happens is we have one more that is competing with all the others. And yes, we have one more standard for observability. OpenTelemedia will come back to the comic later again. This is the slide that I can skip based on the Paul. So OpenTelemedia provides not just a standard but a full ecosystem and framework for observability. For collecting data, sending it protocol. One thing that I want to highlight here, there is a specification in OpenTelemedia that defines what data you can collect, like traces, metrics, logs. OpenTelemedia working group is also working on a profiling signal. And what we will talk more about in this presentation is the semantic conventions. Semantic conventions are very similar to what I've shown for ECS. And basically defines, yeah, attribute names and their semantics. Let's have a concrete example of how the data structure in OpenTelemedia looks like here with some logging data. Very simplified view here, it's a bit more complex. But let's say we have a set of log records, right? The OpenTelemedia protocol defines like the core structure of that signal with fields like severity text, which is basically the log level and body, which is basically the log message. In addition, you can collect with your observability data additional context information. This is usually represented in so-called attributes, and that's where semantic conventions come into play. The semantic conventions define which attributes exist, their names, types, and also the semantics behind this. For example, if you're collecting an HTTP access log, right, and you want to capture the HTTP request method, this is the attribute name that you would use for it. Now observability data is usually also captured in a broader context for some resource like a concrete service, a host, or other resources. That's why OTLP wraps the actual observability data into a resource wrapper, and a resource again has a set of attributes, so-called resource attributes, that describe the resource, something like the service name, host name, and so on. So this is the structure in OpenTelemedia for collecting observability data, and semantic conventions is just about the attributes basically in their meaning in this data. Now let's come back to our timeline of standards. There's one important thing I didn't mention before. Actually OpenTelemedia, and we heard this in the previous talk, is the result of a merger between open tracing and open sensors. OpenTelemedia also supports Prometheus metrics and OpenMetrics that we have heard in some of the previous talks, and just last year, Elastic also announced the donation of ECS into OpenTelemedia. So coming back to this, the question is, is it really that we have one more competing standard? I would say actually not. With OTLP we have less competing standards, and OTLP really succeeds in reducing the amount of competing standards and becoming the one and single standard for observability. Now as I said before, Elastic announced the donation of ECS into the OTLP's semantic conventions project. Why? Yeah, because there are great benefits to this. First of all, there are complementary parts and strengths in both schemas that we now merge into one single schema. And second, we grow two different communities by merging them and providing a bigger network effect. So it's a huge win I think for the community, but there are not only benefits, there are also challenges, right? First of all, the overlap between the two schemas is a potential for schema conflicts. And to resolve these conflicts might mean that we need to have either breaking changes in the one schema or in the other. We have seen the structure of observability data in OpenTelemedia, which consists of the protocol with the nested structure plus the semantic conventions. It's quite different to how ECS defines the fields because ECS is just a plain definition of fields without like nested structures or so. So there's some difference resolving that is a bit of a challenge. Another interesting thing that we discovered when we started merging ECS is that in OpenTelemedia before the merger, many times attributes have been defined in a concrete context. For example, we have here an HTTP server span and the attribute HTTP route is basically defined under the semantic conventions for HTTP server spans. The problem is now if I want to use the same attribute in a different context like let's say HTTP access logs, I mean there was always a means just to reference the other attribute, but it feels sort of weird because in the one context is a first class, right, attribute and the other one is just a reference that overrides some semantics. So learning from ECS, what we already achieved with the merger is that now we have in OpenTelemedia a dedicated attributes registry that serves the case of just defining attributes with their types, with their meaning and in the different semantic conventions and their use cases we are just referencing those attributes. So we have clear separation between defining attributes and using them in a concrete context. And finally another challenge is metrics. Metrics formats in OpenTelemedia follow the TSTB model. So we have a concrete metric name like system disk IO in this case with a type, with a unit and we have a set of dimensions modeled as attributes. In this case direction for example for disk IO read or write. In ECS previously the metrics were basically modeled as numerical fields on documents and you can have multiple numerical fields in the documents so you can have multiple metrics. That's the reason why often some of these dimensions that we have in OpenTelemedia are just encoded into the metric name on ECS side. So we have things like disk read bytes or disk write bytes. This is quite a big difference in modeling. This is a case where we are learning basically from OpenTelemedia and adopting this at Elastic now also with Elastic Search supporting TSTB. So we see we are learning from both sides which is a great thing and we are coming to the best solution possible for the community. And Chris will tell you how this actual merger is happening in practice. Thank you. Can you hear me? Okay. So as Alex mentioned there are a lot of things going on so the question is when is time to celebrate the merger that everything has been completed and the truth is that we are not there yet. There are things that needs to be done and actually everyone believed in the beginning that once the merger was announced that that's all. I mean we have not anything to add there but yeah the truth is that the actual work started right after the merger was announced. So yeah let's see some examples of how the merger is happening and how things are moving forward. So I have some real examples here from the upstream repository on GitHub with issues and pull requests. So this one for example is trying to add some new resource attributes for the container images and specifically the digest of the image. So as we can see that PR was filed on the 4th of July I think yes and it took it some time to get seen right. So it took us like many review cycles more than 20 blocker comments actually there so lots of back and forth lots of discussions but that one was actually merged after almost two months. And another example is about a very important attribute the IP of the host hosted IP as we call it and this one was really unique really interesting actually because this PR was filed by a non ECS contributor. So actually that contributor used to work for a company that it's I would say completely unrelated to the ECS project but it was quite nice because in that case the existence of the ECS project was taken into account and there were very interesting conversations and it took us like almost three months to have it in. So yeah it's quite obvious with these examples that the merger was not something trivial not something straightforward that can happen from one day to the other by for example writing a script that will transfer everything from one project to the other or something like that. So we have decided to take an approach to move let's say not so fast and pay attention to the detail and have the proper people work on specific areas so as to leverage their expertise and be sure that what we are merging to the up seem to the final project which is actually the sematic convention of open telemetry will stay there and everyone will be happy with that in the future. So that's more or less the areas of the sematic conventions. We have areas in area about databases cloud containers Kubernetes HTTP system metric system resource attributes and many others. And yeah so we have started focusing on specific areas some examples is the effort that we are doing on the system metrics area we have a working group working there focusing on the stability of the area. We are in a really good position now we are moving towards the ability really soon and the same for the process namespace the process area the process resource attributes and the same for container area we are close to achieving the 100 percent converges there the recent going PR that will add the final attributes final metrics excuse me same for HTTP and network areas we have good coverage HTTP sematic conventions were declared as stable really recently so we are adding on top now which is quite nice and yeah we have work in progress in databases mobile areas cloud Kubernetes so we have working groups getting started and focusing on these areas and yeah over the past months we are focusing on making the project as good as possible it's a community driven way so we as ECS contributes to the contributors donating this project we are not only focusing on the merger itself but we want also to ensure that the sematic conventions project will be there and will can serve us in the future so we are also focusing on other things as well like improving the tooling of the project working on the guidelines this is quite important because there are many times that the guidelines of the one project are in conflict with the guidelines of the other projects so in that case we need to take a step back and reconsider the guidelines and see what we want to have there as a final result and yeah also we work on restructuring the project before it was the sematic conventions within the project were grouped by signal logs metrics traces and so on but now we have a better organized organization there and we group the attributes by topic and yeah as Alex mentioned already we have introduced the global attributes registry it's actually a very big list with all the attributes there and then within the actual specification you can reference the attributes from there so yeah that's quite useful and we're also working on adding a new concept from ECS which actually the attribute nesting or reusing some namespaces that means that if you have a namespace for example always dot whatever you can nest it attach it as it is under the host namespace for example and you don't need to redefine it again so yeah these are some examples from the upstream most of them are closed some of them are really let's say close to be completed but we have some small blockers there but the work is moving forward that's a that's the point and yeah how the community is organized around these so as I mentioned before we want to have proper people working on specific areas leveraging their expertise so we have working groups working on each area and we're trying to first declare their the areas of the semantic attribute the sematic conventions as stable which means that all the semantic conventions that we will have there will be stable and then we can use them in the actual implementations so the next step is to tune the implementations accordingly which means essentially the open telemetry collector and the language SDKs and yeah some examples the system metrics working group the working group around databases we have a security semantic conventions working group which is getting started now we have also approvers areas for the mobile area containers Kubernetes and many others that I don't mention here and the process looks like this first once you want to create a working group or a specific project you propose the working group area and you mentioned there what issues you want to work on and then you will have people expressing their interest to join this effort you will need to find a sponsor from the technical committee and yeah once everything is decided we have a specific project board we have regular meetings we have people getting assigned to the issues there and yeah the work is happening like this and yeah regarding the merger itself in yeah technically it happens like this we follow this process so once we have to either introduce some new fields some new semantic conventions or we want to move something from ECS to the semantic conventions of open telemetry we first check obviously what we have in these two projects and we also check what implementations have so far essentially the open telemetry collector or the SDKs because there are cases that the for example the collector already uses some some let's say metrics there or some semantic events some resource attributes for example but those are not yet part of the semantic conventions of open telemetry so in that case we also check what there is there so we might find something interesting so we can use it and once we have everything considered we have a final proposal we raise an issue or a pull request directly and we start the discussion within the community we yeah particularly focusing on measuring the breaking changes because you can imagine that we want to avoid bringing frustration to our users on both sides so yeah that's really unique really important thing to consider and we go through the review process and then once we have a conclusion we merge and then of course we need to handle the breaking changes because they are there most of the times and yeah the summary for today is that the merger is happening feel free to join us contributors are more than welcome everything happens in the app stream so if you are interested please join and you will see that you will find that you will have real impact from day one there and the goal of everyone is to make the semantic convention of open telemetry the one unique straight one unique and straightforward standard for observability and security that will be there for the future so yeah with that you can find us on csf slack channels or by using our github handles and some project meetings on Mondays we have the semantic events working group meeting same our next day Tuesdays we have the specification sig meeting and on Thursdays we have the system metrics working group 530 30 central time and yeah without any questions I think we're out of time do we have any questions hi thank you for the talk this this was really interesting and clarified some things for me I have one question about what's how what are the benefits of these semantic conventions in terms of like front-end tooling that that we are using because I know that you know there's this idea in open telemetry project that you have semantic conventions and you have common attributes for different signals and then we collect all this data in all these different signals in some observability tools and I imagine in like front-end we could automatically correlate different signals if we have this like common attributes I'm not up to date with the current state of this this area so yeah this is my question what are the main benefits of following this semantic conventions yeah I would say there are two actually one is I mean open telemetry is an open source standard right and there are many vendors adopting this so we need common semantics of what the data represents to build features higher level features on top this is the first thing and the other one is correlation as you already mentioned cross like different signals to also have correlation cross or through the resource attributes for example so you can drill down basically on different signals into the same resource and yeah I would say these two things and also cross signal correlation not only through resources but things like trace ID to have them you know both on locks and traces and later maybe in profiling data this kind of things okay thank you so are you doing something like that in elastic like in front-end at the moment is there any work going on in this area like correlation of different signals yeah of course like I think that's that's the goal for for every observability vendor to bring all all these different signals together yeah okay great thank you very much any other questions going once okay cool then bingo plus okay
Linux load average and other silly metrics
We'll see something very basic, the load average, the thing that you have on top, on top when you look at the performance of your server. Very basic, but with a lot of misunderstanding and the goal is really to understand if it's useful or not and at least how it works. I usually do that as a live demo, but I'm not sure about the Wi-Fi. I think I've lost the connection, but I have some recordings. Basically, what we will do, we will look at what we have in top. So this is not moving because I lost the connection, but we will see later on recordings. You can start to think about it. I have run something that you can see in the processes there. I have two CPUs. I have a load average of 32 for a long time. I don't know if you care, but I have 99% of weight I owe. Basically, my question to you is, do I have a problem or not? I am bound on a resource or not. If I'm bound on a resource, am I bound on CPU or I owe or memory or whatever? This simple question, I see a lot of people who cannot really explain it. The goal of the presentation will be to tell you that you can mostly ignore the numbers that are on top of top because those are about the systems, the processor, what you care about for your application performance is more the tasks that are running and this is probably more useful. Going back to the slides where I have the recordings of all the demos, so we will not try to reconnect to the Wi-Fi. Also, so that screenshot of what we have seen, people using the cloud, cloud providers like to provide nice graphs about performance and usually they put first the load average, the CPU usage. Typically, I have two processors. I have a load average of 30 and my CPU is doing nothing. Memory is 100%. What do they want to tell us with that because most systems will have usage at 100% and that's probably cool. We will look at that in the next 20 minutes. First, this is the recording of what I wanted to show you. That was what was running exactly the same. You see the load average, the number of CPU, the weight, IO, there. What do you think about it? Who thinks I'm bound on CPU? Who thinks I'm bound on IO? Who thinks I'm bound on IO because I have a weight IO? Less people. That's already good. Here, we see a high weight IO, but maybe I can advance on the recording. What I show in this case, when people think that I have a problem with IO, is just to run something else. Let me check where it is in the recording. If I have the wrong recording, I will just explain what I show usually. Sorry, maybe it's in the next recording. What we see is load average, high weight IO, but the most important, what I really care about is this, the state of the tasks. Who thinks I am bound on IO because of the D state? For me, this D state gives me a clue that most of my processors are waiting on IO. Probably. We see that it's not so exact science, but that's something that can give some clues. I'm lost in my slides. This is the next one. I'm running yes. You know the yes command? It displays yes. I'm still running the same IO there, the same throughput. I'm doing exactly the same, and my weight IO has decreased. This is how to solve weight IO, just run something else. I show that to explain that this weight IO is not about what your tasks are doing. It's about the CPUs. When you do IO, you don't need CPU, so you wait. If no one else wants to do something on the CPU, then the CPU state just remembers that, okay, I'm idle because someone is doing some IO. Now I'm running something else that uses this CPU. This CPU is not idle. This weight IO just means idle, and idle because the last one did some IO. The only information I have from weight IO is that the CPU could be used for something more useful than weighting, but doesn't really give me the information that I have a lot of IO because depending on the other workload, I will sit there or not. The state doesn't lie if my processes are all on the D state. At least they are not on the R state, the renewable state, so they are not using CPU. In the next one, what I do to understand better the kind of IO I'm doing, the kind of system call that puts this D state, I just run S trace on my processes, and I just did the S trace dash C to count them, and you see that most of the system calls are P writes. That's actually what I'm running there. I'm doing writes with the P write system call with direct IO. That's basically what I have there. If I want to understand really what is behind a state that is not the R state, the renewable state, I can trace the system calls to know exactly why. I will explain why I'm looking at that because even if D looks like disk, you can do some IOs that are not in D state, and you can have D state that has nothing to do with IO. So it can be misleading. The D state is uninterruptible calls. So your process has something to do that is not in CPU and does it in an uninterruptible state. Depending on the system call, it can do it uninterruptible or not. Often IO like the P write is using this, but there are some other kind of IOs. Any questions so far? Any remarks? Okay. So next one. I will run something else if I remember exactly what I'm doing here. I will run FIO. The difference is that I'm not calling the P write system call. I'm calling the Lib IO, asynchronous IO library. Basically I'm doing the same writing to the disk with direct IO and you can see the throughput is mostly the same. However, I'm not in D state anymore. So there are some IO who put the D state, but there are some IO who just put the sleep state, which is not uninterruptible. So very misleading when you see those things and try to guess what happens. If you are stressed, there is no guess. You know exactly the system call. And I think this is what I do just after. If I stress, I see that most of the IO calls here are IO get events and there is some IO submit. This is our asynchronous IO works. P write just ask the kernel, I want these blocks and wait to get those blocks. With asynchronous IO, it tells the kernel, I will need those blocks. So that's the submit and then can you work on something else and come back and say, oh, do you have my IO? If not, I will wait. The submit goes in this state, but it's very short because it's just a submit. The get events, if it waits, goes in sleep state, the S state, and not the D state. Depending on the kind of IO, you will see it at this state or not. And the wait IO there depends on the state, but more important, I don't know if I can go back. Well, I'm sure I can go back if I replay it. I guess that the load average was lower when I was running that because the D state counts in the load average, the S state doesn't. Means that some IO counts in the load average, some IO doesn't. Means that with load average, you don't really know what happens. Okay. The next one, I'm running something else. So those were direct writes by passing the buffer cache. And here I'm running reads and more I set direct equals zero to FIO. FIO just simulates different kind of IO. Typically I work with databases. I'm a developer advocate for UGA by DB that is a distributed SQL database compatible with Postgres. I've been working also a lot with Oracle. They do those kind of IO, Postgres does not do direct IO. It goes through buffer. Oracle, you have the choice. So really depends. Here, what I would like to show you, I don't see it from here, but I'm probably in the running state. Yeah, it was not sorted. But here, I'm mostly reading from memory, from the cache, from buffers. And this is why you see that much faster. And a difference, I'm using more CPU there. You access memory more than you access the disks. And then this is in CPU usage, the kernel part of the read. I mean, my application is doing the same. Just an IO call. So the user space CPU is still low. But on the system, on the kernel, what Linux does is read from memory. And this is where you have some system CPU there. That counts in the load average also. I just, okay, in the meantime, I did this trace to see the reads there. So I have periods, the same system call. What is different is what is behind. That it reads from buffer. And I don't know if you have seen it. When I was attaching with S trace, the state here was T. That's the state when you attach. And of course, it has a little overhead. You do that to troubleshoot. The important thing is the runable state. I'm saying that either I'm running in CPU or I want to run in CPU. And I don't know which one from those metrics. That's the point. I have only two CPUs. So I know that I cannot have more than two tasks running in CPU. They are running able. They are waiting in the run queue to be able to run on the CPU. Top will not show the figure. Load average will add those rating and those running. If you want to see the difference, you need to look at the statistics from the scheduler in slash proc scheduler statistics or VM stat is showing you the run queue. I'm saying that because I've seen a lot of people comparing the load average with the number of CPU. Like if load average is higher than the number of CPU, I have a problem. Maybe not because if the load average is due to IO, you don't really care about comparing with the CPU. And if the load average is high because you have a lot of processes in the run queue, then probably you have a problem because you have tasks who need to run something on the CPU and just cannot and are waiting in behind. So we have seen different kinds of IOs and they look differently. Many times where I've seen, especially on databases, where I've seen different teams, the Linux team looking at the system and the DBA team looking at the database. And in many companies, they don't really talk together. So one is guessing what the other is doing and a lot of misinterpretation on all that. It's very important if you look at the numbers from the system to understand what the database is doing. And also it's very important for the database administrator to look at the system because many things in the database metric will be different if the system is overloaded. I give a quick example on Oracle, you have wait events where you can know exactly how much time you spend on IO. But it's not exactly how much time you spend on IO. It's how much time between the time stamp it takes before the IO and after the IO. If your process is in the run queue, the database thinks that it is doing IO, but maybe the IO is done and it's just waiting to go back to the CPU just to set the counter on the time stamp. So that's also the message. I say that to database administrator, but applications, if you run on a system that is overloaded in CPU, then probably all of their metrics, because they require CPU cycles to get the number are probably wrong. So why did I call that silly metrics? I didn't came with this. If you want to understand what is what low-dverage measures, Linux is open source, so just look at the source of it. And you can look at the source, but more interesting are the comments which can explain the intention of the function. And so in Linux, the load average is defined as this file, so the source for load average contains the magic bits required to compute the global load average figure. It is a symmetric, but people think it is important. So you see why you see that first in top? It is silly, but some people think it is important, so let's give them something. And we go through the grid pane to make it work on big machine with T-class kernel. So the load average idea comes from Unix systems where it was really measuring the load in CPU and where it was easier to measure it because you just counted the ticks in the scheduler. Linux works differently and means that it is difficult to measure and maybe it makes no big sense. So yeah, good to know why this metric is there just because people coming from Unix were used to have this single graph showing the load and compare that with the application and what is done in the application, but if you don't look at the state of the processes, then it can be misleading. It's easy to understand exactly why we see this state, these IOCOLs in the load average, just the way it is calculated. There are two things that are interested in the way it is calculated. First, it is an average and that's also a problem. If you look at the load average, you will not see a peak of activity of five seconds because it is average. The other thing is that it counts the number of active, so the running state, which is more renewable because if you are in the run queue, you are not really running and it has the uninterruptible calls just because they thought that if we show only the CPU load, is it really the load of the machine? For example, you run a database doing a lot of IOCOL. Then we say that the load is low if everyone is waiting on the disk. Let's add an interoptable because in many cases, we have seen that those IOCOLs are uninterruptible calls, but they are not always, so it can be quite misleading. It doesn't mean that you don't have to look at it, but if you look at it and know what is behind, then it can give you some clues like the clue about IOCOL looking at other things, but more interesting is the process state. A process can have something to run in the CPU and then look at the scheduler statistics knowing if it waits for the CPU or there is CPU available and when it has some calls to do, they can be done in this state or as state and they will be accounted differently by the load average. Any questions so far? Okay, the next one is more about memory just because it's another thing that is misleading in some cases. I think it is quite clear in top that you can look at the available memory, but I see cloud provider showing the use memory or the free memory and here I just want to explain for those who don't know, if you do buffered IO like I did with direct equal zero. Okay, I thought we have five minutes now. Okay, perfect. So I will finish quickly on that. Do not look at the free memory. I'm just showing that if I do some IOs, it will take some free memory, but that is easily freed if it needs look at the available memory. That's the memory that is available to your process, but also think that it is available. You can use it, but if you use it, then another process doing buffered IO may not find its data in the case. So if it is available, doesn't mean that it's free from any impact on the others. Okay, I just put the last one while I'm talking and taking question. The idea there was just to show a really silly program doing V fork that has nothing to do with the data, but just to show that it will go to the state, it will increase the load average and that's the case I've seen in some system where the load average was thousands on a database having its file on NFS and network issues and then those uninterruptible calls increased the load average, but without any consequence because they weren't doing nothing. The only thing is that it's ugly when you look at the load average and the other thing is that they are uninterruptible. You cannot kill them. So you want to restart the system to have nicer numbers, but of course you wait for it. So just be careful, load average accounts some IO and accounts some CPU and you have some IO that you do not see there. Okay, do you have any questions, remarks? Thank you. What about pressure stall information? Very good question. If you have seen at the first screenshot I was running pressure stall information, which in my opinion is a better picture. The pressure stall information is counter telling you during the last 10 seconds, for example, how many, not how many, if there were some processes with pressure on CPU, so to run on CPU to get IO or to get some memory. So it really gives you an idea about the pressure itself. The only thing about pressure stall information I have is that in most of the kernels, the distributions I've seen, it is compiled in the kernel but not enabled by default. And then because it's not enabled by default, I've not seen it a lot. And then I think it's a good idea. Each time I used pressure stall information, it was giving me the right idea, but it's just a subset of the systems I've seen because it's not the default. And then maybe there are some cases that I don't know where it's not perfect, but I try to encourage people to enable pressure stall information where instead of looking at all that, you just see that you have some processes that could be faster if they were not on pressure, on RAM, IO, or CPU. Okay, I think we are just... Another question? If it's okay? So looking at a very generic use case, if you were to redesign the cloud provider's graphs, would you change it? What would you change it to? Could your list maybe the five most important metrics from a generic use case that you would put on a dashboard? On a dashboard, I think pressure stall information can be really nice on a dashboard because you can show that to user. User running on the cloud, for example, they want to know if they are on pressure on CPU or on IO because they pay for that. So those ones I would put that. Load average, maybe with a clear description that it is CPU plus some IO, and memory, available memory, not use memory because a system doing some IO, some buffered IO will always use all the memory in Linux. Maybe we have...
Fast, Cheap, DIY Monitoring with Open Source Analytics and Visualization
Welcome to the next speaker, a big applause for Robert who's going to present on fast cheap DIY monitoring with open source analytics and visualization. Okay, thank you. This is wonderful to see you all here. Big shout out to FOSTA. This is the first time I've ever been at FOSTA. First time I've ever done a talk here. It's totally awesome. So thanks a bunch. All right, we've got 20 minutes to talk about do-it-yourself monitoring with Click House. And just some intros, just a little bit about my qualifications to talk about this. My day job is I run a company called Altenity. We're a server provider for Click House, been around for a number of years. But more particularly, I've been working with databases for a really long time. I've been working with Click House as a user for about five years. Our company also does a fair amount of work in the ecosystem around Click House. Among other things, we are the maintainers of the community provider in Grafana for Click House. That's one of our projects. It's quite popular. It has about 14 million downloads. So let me jump in. I probably don't. Pretty much everybody here knows what, how many of you guys run monitoring systems here? Okay, it's easier to ask who doesn't. How many people use Grafana? Okay, excellent. That'll save some time. Okay, monitoring. You say answer questions. So stuff goes wrong. You want to figure out quickly what is going on. Or maybe things are going fine, but you want to do things like capacity planning. Monitoring helps you answer that question. I'm kind of old school, so I say monitoring. Nowadays, people tend to say observability. But I'll use the word monitoring just because I'm used to it. So back in the days of old, I've been around for a while, so I was on the last talk, talked about inheritances from Unix. When things broke on Unix, you would kind of go in and lay hands upon the machine and run VM stat or IOS stat and try and figure out what was going on. Well, those were the bad days. Nowadays, we tend to do things graphically. And there's a couple reasons for that. One, it's way better because if you're trying to find a problem like, hey, somebody is just like blowing out the CPU, you want to see durations, you want to be able to see a bunch of metrics together, it's much easier to see these things graphically. The other thing is that, as you all know, we're now working with systems, for example, that are based on containers, and you can't actually go in and lay hands on them very easily. Moreover, they can restart and then you lose what's on the local file system. So graphical monitoring, graphical display is critical. So what we're going to do in this talk is I'm just going to give you a few clues about how to build one completely from scratch. And basically, when you're talking about monitoring, you have a number of parts. This picture shows them. You start with metrics, logs, and traces. I probably don't have to explain what those are, but those are standard types of information that come out of systems. You're going to ingest them into some sort of database, which you can then run queries on and it will store them over time. And then you have some system for doing visualization. That's thing number one and alerting. So to tell you when bad stuff is going to happen. We'll mostly be talking about visualization today, but alerting is another important part. So the core of these systems is your store. So something that holds onto these metrics and allows you to do analyses on them of various kinds. And for this talk, we're going to talk about Click House. How many people here have heard of Click House? Excellent. Okay. Great. By the way, we have some of the core committers here for Click House. So if you have any questions, I'm sure you'll get answers. So Click House is kind of like, you could think of it as kind of like, bit like MySQL or Postgres, but it does analytics in the sense that it runs pretty much everywhere. It's open source, but it also has all the architectural features designed to read vast amounts of data very, very quickly. I won't bore you with this specific kind of marketing things here, but instead show you a slide that gives you a little better idea of what it is about Click House that makes it so fast. So our first thing, like most analytical databases, it stores data in columns. So you can think of them as being, every column being a big array that's split up in a bunch of files, but can be read very efficiently. Click House also allows, can make replicas of data very easily. So you can, by having more replicas, you have more things to query. You can handle more, you can handle, you know, more load on the, read load on the system. Another thing that Click House is extremely good at doing is compressing data and then reading it in parallel form. So all of these things taken together allow us to scan data, often, you know, hundreds of millions of rows, very, very efficiently. And I'll show you an example of that later in the talk. So then Grafana and Click House go together hand in glove. I think, I don't have, we don't have actual stats on this because you just never know what people are using. It's all open source. But Grafana is probably the most commonly used visualization tool with Click House. It's been available for years and just about everybody uses it, particularly for operational metrics. And I love Grafana. There are many tools out there, but Grafana is just the level of interactivity allows the fact that you can drill in, sort of look at different time ranges, bounce around between different metrics easily. It's really great to use. All right. So I'm going to build an example here in about five slides. So I'm going to just pick VM stat, which I showed in that previous slide. And what we're going to do is crunch the data, load it into Click House, and then show it in Grafana. So the question is how to do that. Well I'm just going to do it from scratch. So the first thing I'm going to do is collect VM stat data and turn it into a format that I can load into Click House. So this is a little Python program that does it. The important thing to note is not the details because they're probably not that great. I'm not a particularly good Python programmer. But the key thing is there's 14 lines of code here that crunches VM stat. And every five seconds we'll burp out some JSON. And what that JSON looks like is this slide right here. So pretty beautiful, right? This is data. This is data that we can get into Click House really easily. How do we do that? Well, first of all, we're going to make a table inside Click House to hold it. And there's relational databases want things to be in tabular forms. But it turns out Click House has a pretty expansive idea of what constitutes a tabular form. In this particular example, I'm going to do it the simplest way, which is I'm actually going to just create a table with a column for each of my data values. And so what you see here is a table definition that maps pretty much directly to the values that you get out of VM stat. And just a little bit of, you know, if you haven't used analytic databases before or aren't, you know, like a deep database person, we tend to think of these column values as being one of two types. Dimensions, which are characteristics of the thing that we're measuring, or the actual measurements themselves. And that's important because generally speaking, when we're scanning the data, we will group by the dimensions, like collect all the data by host, for example, over a certain period of time. And then the measurements of the things we aggregate. We take averages. We take max. We take min. So on and so forth. So that's just a quick intro, like a 60-second introduction to data modeling inside an analytic database. So the third thing we need to do is we've got the table. We've got the data. We need to make the data go to the table. So ClickHouse is very, very, has a bunch of different ways that you can load data. One of the simplest ways to do it is to, it has an HTTP interface, and you can simply push SQL commands and data to go with them up using Curl, which is a great talk on about two hours ago. So this is an example of the code. You can just say, in fact, that top level, if you're familiar with SQL insert commands, that's an insert command. ClickHouse has this kind of interesting variation on it where they have input formats. Instead of reading a bunch of values, which look like tuples, in this case, I'm actually going to read some data, which will be, for example, if I'm doing a post, it will just be in the payload. And it will actually be a bunch of JSON documents, each of which will turn into a row in the database. And this thing down here just shows you how to execute exactly that command using Curl. So, by the way, one of the things about this talk is everything I've done is there's an example out in GitHub. If you go ahead and Google, like if you're sitting right here and want to do it, you could just Google ClickHouse. Actually, I've got the link at the end. I'll show it to you. It's probably not worth it. So anyway, so we can load that data up. In the examples, you'll see that I actually wrote a little script that I can just run and put in the background. And then it will collect this data on each host that I'm measuring, sticking it into ClickHouse. Then what we do is we build a Grafana dashboard. So everybody, pretty much everybody in the room knows how Grafana works. What you're going to do is if you haven't done it already, install Grafana. You will need to add a plugin to talk to ClickHouse. There are two of them out there. There's the Altinity plugin that we maintain. That's the one, the Community plugin that's been around for years. There's a new one from ClickHouse, Incorporated. Pick a plugin, install it, put your connection parameters in, and then you can write a few queries. And within a few minutes, if you're at all familiar with Grafana, you can create a dashboard that looks just like this. This literally took about 15 minutes to create. And then go crazy. So you've got data loading. So using loading it up using Curl, putting a little script around it maybe so that it can sort of reliably load the data up and dump it. And then you can go look at it. But the cool thing here is that once you're in a database, you have this incredible freedom to use the data any way you want. Because ClickHouse is not a database like Prometheus, which is basically designed to hold metrics. ClickHouse is a general purpose analytic database. You can ask virtually any question of that data that can be expressed in SQL. It may go fast. It may go slow. But this is an example where we're asking a question like how many machines had a certain amount of load, like over 25% load for at least a minute in the last 24 hours. And we can sum the number of minutes. So this is just an arbitrary question, but it's just a few lines of SQL. We can get this question moreover if we have a Grafana attached to this, we can turn that into some sort of display on Grafana that will show it graphically. You then have something that you can really play around. And this is just the effects of running some popular commands like stress is a great command if you want to see something hog memory. What you can see in this graph here, that big blue part was the OS buffer cache, which was actually filled with a bunch of pages from previous processes. You can see that the buffer cache was pretty much blown away when these stress runs began. Up above you see the result of running sysbench CPU command. You can see the effect on the CPU usage. So at this point, this is a very simple example, but you actually have a fair amount of insight into what's going on these machines. So the next thing is to take a drink and to scale this up a little bit. This is a toy that I just showed you. Anybody can do it. It's just a few lines of code. So as you start to think about scaling this system, making it work across a bunch of hosts, a bunch of different types of metrics, maybe adding logs, the first question that probably comes up is, hey, I love open source, but do I have to write everything? And the answer is no, not if you don't want to. So there's a couple of projects that if you're going to go in and do this, you probably should be looking at. One is Fluent Bit. How many people have used Fluent Bit or Fluent D? OK, fair number. We use Fluent Bit quite a bit in our cloud stuff. So what Fluent Bit does is it basically has a bunch of plugins which will sample different kinds of metrics, turn them into a data format, and then put them somewhere else. So for example, you can get the same similar CPU metrics to what I just showed you. There is an input plugin for Fluent Bit which will grab those, and then you can turn around and post them to Clickhouse. And it works as a daemon, so you can just bring it up, let it run. And so as a result, you don't have to figure out how to parse all these different formats. Moreover, you don't have to worry about basic things like posting to HTTP. They take care of it. So that's a really useful project to look at and one that you should consider. Second thing is Open Telemetry, which if you were here two talks ago, is basically trying to create a common data model for all observability data, including metrics, including logs, including traces. And that gives you sort of like a universal broker, if you will, that can handle data coming from all kinds of different sources, maybe Fluent Bit, maybe custom stuff that you build, and then it will push it into databases like Clickhouse. And in fact, there is, OTEL as it's often called, has a provider for Clickhouse, which I believe is in the alpha stage. Some people are using it. It still has some performance issues, but one of the things it takes care of is building the data structures that you use to store your data and doing them in a kind of rational way. So in fact, that's to answer question one fully, if you were going to build this out, you might have a metrics pipeline that looks kind of like the following, where you have Fluent Bits, perhaps feeding this to an OTEL collector, that's this broker that Open Telemetry provides, pushing it to Clickhouse and then reading it to Grafana. I'm not saying this is the only way to do it or even that it's a good thing, that's what you have to decide, but you can do this. The pieces are out there and they're all open source. Second question, this is Fostum. You yearn to use Postgres or MySQL. Why not? Why is it we? Why don't we just use Postgres for this? Well, actually for the example I gave, the little one, you could use Postgres and it would make no difference whatsoever. But as you start to scale, it becomes really important to pay attention to how the data is actually represented and what kind of powers of query you have on top of it. And the key thing to notice here is that Postgres and MySQL are row databases. So they store the data as rows. There is a plug-in for Postgres, I guess, that is beginning to change that. But in general, if you read anything out of a row, if you touch anything in the row, you'll have to read the entire row. Whereas with Clickhouse, it's columnar, so if you read columns, say, out of 109 column table, you only touch the data for those three columns. Moreover, by putting things into columns, effectively a raise of a single data type, taking advantages of things like sorting, say time-based sorting, it tends to compress extremely well. You can literally compress a lot of these metrics by a factor of 100, so they will compress down to 1% of their previous value. Let me just show that graphically. So what they affect is, is that when you run queries with Clickhouse, they can be easily a thousand times faster than Postgres and MySQL because of columnar structure, because of compression, because of parallelization. And this illustrates it. This was a sample, literally reading three columns out of 109, scanning 200 million rows. The amount of data you would read in Postgres or MySQL was 59 gigs, everything. In Postgres, in Clickhouse, first of all, you read three columns, so that was literally 3% of the data. You're already up 33x. Now those columns are compressed, so you're not just up 33x. The amount of data that you're reading is actually, has been reduced by a factor of almost 3,000 at this point. And then you can spread it across, you can spread those reads, say if they're coming off an SSD, you can be doing them across eight threads, so you could argue that that actually is going to reduce the amount of work for each thread by approximately 23,000. So these are, like, not just an order of magnitude, but orders of magnitude less IO. This is the reason why Clickhouse is great for this kind of problem. Third thing, how to handle data from a lot of collections. So for those of you that know Clickhouse, you saw that example and you thought this guy's an idiot. He is adding five rows at a time. And so data warehouses, the flip side of using columns is you touch a lot of files every time you load data. So what you want to do is load data, you want to buffer data into big blocks. So like maybe a million messages at a time. There are many ways to do this. Clickhouse has what's called async IO. But in general for a system like this, if you're starting to receive data from many collectors, you're starting to get into levels of traffic like 100,000 events, 200,000 millions of events per second, what you want to consider is introducing Kafka or Red Panda or a similar event stream. And Clickhouse, so the collectors write to Kafka. This then breaks the connection between your producers, which are getting monitoring data and Clickhouse, which is the consumer. Moreover, Clickhouse has very good integration with Kafka. So there's built in, there's multiple types of integration, but the most commonly used one is Kafka table agent. So that basically wraps a Kafka topic, able to read it so you can do a select off it and it basically reads off the queue. And then there's a trick where you use materialized views to do that read automatically and stick it on a table in the database. So as your architecture gets larger, you definitely want to include this so that you can then read large blocks of data very quickly. Final thing. So I talked about this basic mapping of the data. I modeled it as a table, which is exactly matches the data that's coming out of the JSON, but that's not the only way that Clickhouse can do this. So there's actually a number of options. And one of the most common ones, and one that I was playing around with just as part of this exercise, was to have a column that just contains the JSON as a string. And then what you can do is there are functions in Clickhouse to pull the values out one by one, but they're kind of painful to use. There's extra syntax. So what you can also do is basically as the data loads, you can just pull out particular values and turn them into a regular column. This is very simple to do with materialized views. And so what that means is actually if you're doing a demo for this talk, and it's like 3 a.m. in the morning before the talk, what you can do is basically load the entire JSON document, pull a couple columns out to make your queries work, and leave the rest there. One of the secret powers of Clickhouse is it does schema management or schema evolution very, very efficiently online. So as time goes on and you see more stuff in that JSON string you'd like to read, you can just pull it out into new columns. That's a command that can be, those are commands that can be executed instantly. All right, we're down to zero seconds. We're almost done. So there are other options. So you can, if you have key value pairs, you can do pairs of arrays, like an array of keys, an array of values, and then Clickhouse has very efficient functions to match those up and to process them. You can also use maps, so those are like hash tables. And these are other ways that you can represent JSON. Clickhouse has a JSON data type, but it is experimental and actually do for rewrite. So I just found that there's actually, it's on the schedule, so I just found that out last night. So it's going to be implemented, re-implemented in a way better way. But in the meantime, there are lots of ways to process JSON and Clickhouse has many other features that make JSON very handy to process. Okay, where can you find out more? So lots of sources about this stuff. I'm listing them here. The sample code is up in GitHub that shows everything that I did here. If you just Google GitHub, Clickhouse SQL examples, it will probably be the first thing that pops up. And there's a directory in there called open source monitoring. So you'll see that. And then, yeah, Clickhouse official docs are very, very complete. The Altenony Grafana plugin for Clickhouse, Fluentbit, Open Telemetry. And then we do lots of blogs and YouTube videos about how to use this stuff, including doing monitoring. And that's it. So thank you very much. And if you want to get in touch with me, you can connect me on LinkedIn. I'm on Slack, CNCF, Clickhouse, AltenonyDB, Data on Kubernetes, just, you know, or if you connect, send me email, whatever. So any questions? Do we have any questions? Could you please stay a few minutes for the Q&A so that people understand the questions? Any questions? You showed us data coming from a script. Can we do the same thing with data coming from web applications, for example? Oh, from a web application? Yeah, absolutely. Anything that can generate metrics, you can push them to Clickhouse. And in that case, if you're generating metrics, I would definitely recommend going and looking at Hotel, because it has, for example, it has SDKs, which you can embed in your application, generate metrics that will be translated into a standard form, and then they can get pushed to Clickhouse. So did I understand correctly? There's an SDK for doing this. Yes, it's part of OpenTelemetry. That's one, or you can do it yourself. It's part of Telemetry. There's multiple solutions for this. Hi. So first things for the presentation. My question is, does it make sense to use Clickhouse for tracing, storing traces, and such things? Like traces and logs, other stuff? Yeah, tracing. Absolutely. So, processing and tracing is actually one of the major use cases for Clickhouse. And what's kind of interesting is you can use databases like a time series database or something like Prometheus. The cool thing about putting in Clickhouse is once you get it in, you have this whole, because you have a full SQL implementation in there, you have much more flexibility about what you can do with the data. That's thing number one. We also have people that we work with, some of our users, that just have standardized on Clickhouse as sort of the one size fits all. So that reduces the sort of operational complexity. Third thing is, Clickhouse is just really good at anything that's time ordered, including logs, including just practically any kind of data that's emitted over time. So it's a very powerful engine for this kind of thing. I focused on metrics because it was just one use case, but traces and logs also work very well. Very well. We have time for one. Thank you. I'm looking for a tool to do reprocessing. So once I ingest a lot of data into Clickhouse, I want to do sort of on demand some fancy complex calculation that's a bit too complex for a query and then store the data again in Clickhouse. Is there a good framework or tool for that? Yeah, I don't quite understand the use case. So you're saying you're going to push data in, run a calculation on it, put it somewhere else? Yeah, so I gather 100% of data, but I need only 1%, and on that I need to do some complex business-wide calculation. Right, okay. Yeah, there's a couple options. Within Clickhouse itself, I would recommend you look at materialized views, which are basically like insert triggers that will run a query and then put the result in another table. They can be used, they're normally used to, or the most common use case, is to do pre-aggregation, but actually they are a transform, just a generalized transform mechanism. So anything that comes in as an input block, you can run calculations on it using a query, mix in data from other, you can join on other tables and do anything you want and then dump it to a target table. And that's a very efficient mechanism. It gets invoked every time an insert comes in, the query runs on that block of data. If you have, it sounds like you have something a little bit different. Yeah, I can ask you. Yeah, come by later and we can talk through it. Thank you. Okay, I think that's a wrap up. Yeah, thank you.
Implementing distributed traces with eBPF
Thank you so much. My name is Likol Goczewski and I'm here with my colleague Mario Matias. I think I pronounced your name right. Yeah, you pronounced it very well. We work on open source project at Grafana called Grafana Vela about software engineers. We didn't practice this presentation much because we live on two different continents so you get what you get. It's always not too bad but yeah, we'll give it a shot. Let's go. So we will first do a very quick introduction to what is distributed tracing. I know most of you already know but just to try to get a common mindset even for people that is new to observability or to distributed tracing. Then we will explain a bit how it is implemented and how do we implement it in Grafana Vela using ABPF. So if you want to instrument a server, you might add an instrumentation library like for example the OpenTelemetry SDK and insert some instrumentation points in your server to get on each request a span containing data like the start and the end or some extra information about the request like client ID, the path of an HTTP, the response, etc. Then you can send that to an OpenTelemetry collector and visualize that. If we have a distributed service in which one service calls another, gets responses and so on, you could still do the same instrument each point and then send them to an OpenTelemetry collector for example. But the spans themselves could give information but separately may lack a lot of context. So if you get just a bunch of front end database back in the span separate, it will not be as useful as for example getting for each span which is the request that invoked that other request so you can see everything in context. This is what we say name distributed tracing or context propagation. In OpenTelemetry concretely we use the W3C standard that is using a trace parent header in the request. So you can insert into your request, you can insert headers with the trace ID and the parent span ID and then their services getting these or receiving those invocations can read this trace parent and add it to their own request. So that way you can always track the context. This is not any real SDK, any real language, it's just an example on how could you do it. You have a service and on each request you can read this trace parent, create your span, the part of the trace and when you have to call other services you will add this trace parent in the headers and then in the span. This can be manually done by code, be an SDK or this can be injected by your instrumentation or SDK agent like the OpenTelemetry Java or OpenTelemetry.net agents. Bayla, those are another or follows a similar approach especially for these services that are written in a language that is not so easy to instrument be an external agent. I'm thinking of for example, compiler languages like Go, Rust and C. In that case, Grafana Bayla can be deployed in the host, in the same host as the services you want to instrument and it will use the EVPF technology, we will talk later a bit about it, to hook and inspect the runtimes and libraries of your application or the functions of your application and as well some points of the Linux kernel. Then compose metrics, traces and forward them to your OpenTelemetry collector. What is CVPF? I mentioned it before. It's just in time, virtual machine that is shipped inside the Linux kernel. This allows you to efficiently hook programs to multiple events in the kernel, libraries and the user space programs. For example, Bayla can hook every time an HTTP request is received in the instrumented application. Bayla can execute immediately a piece of code, a probe and then inspect and even modify the memory, the runtime memory of your process or even the kernel. This way is able to know when request, service request starts and ends and even inspect some arguments about them. Bayla has two ways to provide a span information. One is to inspect at the language level. At the language level, we only currently support Go and it hooks user probes into the Go runtime and the Go libraries to inspect them. To support other languages, this is compiler languages but also Python, Ruby or other interpreted languages. It hooks K probes in several kernel functions and libraries to know when connections are started to read the arguments of the requests and the responses and so on. We are able to do that in Go. We are currently inspecting HTTP, HTTPS, GRPC, HTTP2 and soon SQL. At the kernel level, at the moment, we are inspecting HTTP and HTTPS but other protocols will come at some point. We will talk about how to provide the spans but Nicola will talk about how the context is propagated with Bayla. I think you can hear me here. You can hear me, right? Yeah, this is working. We showed a previous example where we had this done by manual introduction in that logic in the program about reading the trace information coming in on a request and then how we send that over which is effectively what most of the open telemetry SDK instrumentations do or the agents in Java or .NET, they do that injection for you automatically but we do it with eBPF so you don't have to have an SDK added to your packages or languages when that doesn't exist or languages where maybe your library dependencies don't quite work with the SDK because of different versions or it's not up to date or whatever the reason. We hook into the program like Mario mentioned in different ways and when a request starts we actually read the memory with eBPF and what is in that trace parent. If there isn't one, we'll generate one according to the W3 stack. Then what we do next is that we notice an outgoing call and then in that outgoing call, if we can find the information about the headers, we will inject the outgoing trace header just like the SDK would do. This is what happens in Go currently with Vela. This is exactly what we do. Now internally, how this all works? Well, to make sure that we can tag an incoming request on a server accepted something like slash ping for example and it did an outgoing request to slash ping me too and in that case we need to track that this incoming request matches this outgoing request by the call maybe async. Maybe somebody wrote a library and said, well, I don't want to wait for this request. I just want to do it async for whatever the reason. I'm using some reactive library. In that case for Go, we track the Go routine creation and termination essentially. Because the Go runtime and the standard libraries are very standardized and everybody uses that, we're able to do this kind of stuff. It doesn't need to be the first argument, needs to be the context. None of that stuff. We just track Go routine creation and we're able to match it later on. That's how we propagate the context. Now for the other languages, we thought, well, how are we going to do that for other languages? People use number of libraries. How do you do this on compile languages? Somebody does just think on time compile language. It's kind of hard. For that we wrote additional support that does something more sneaky or if you will, something more interesting. Land 2 servers or two processes talk to each other over HTTP, for example. They have a unique pair of information and they identify every connection. I have a client, opens a remote connection to a server. It has a source port, which is typically a femoral. I have a destination port, which is a server port. When we see that connection pair, we use it as a unique key and we say, we'll store it in the eBPF map. Then when the server on the other side gets that request, they look up that map and say, well, I have this connection pair. Does that match any client that made this connection? It does require that one single baler monitors both processes. If that is true, then we can actually tag these requests between servers without actually using this transparent propagation. For languages where we haven't written additional support to inject the headers information, we use this as a backup option. This context propagation correlates internally requests through the kernel. Here's an example. We start the client call. It may read the transparent information that was present from a previous call, but if there isn't one, it's just going to generate it right there in eBPF and then store that information. Then later on, when another server request happens based on the client call, we'll read that map, read the transparent information, create the spans, just like if you will, that transparent logic flew through the HTTP headers. More or less the same. There's restrictions, of course. Obviously, for this to actually work, we have to have a single node. Now, these eBPF maps can be shared on a volume and maybe there's a way to use that, but we don't do that and support that right now. This is also not released yet, so we just have it in the main branch. It's one of the newer things we added. But with this, I think I'm more of a person. I'll believe it when I see it. I think we want to try to do a demo to show you everything's running off the laptop that Mario has here. We're not going to connect to any cloud services, but what we want to demonstrate is a few HTTP services here. And GRPC also. They're using GRPC in this case. They're returning Go. We're going to have one Bayline instance. Look at all of them. We're going to use this little tool that actually Fabian made, this little Docker Compose LGTN, which has the full Grafana stack with all our open source products, with the OpenTelemetry collector setup that it can ingest and do traces, metrics, and everything you need. Very convenient for testing. Very convenient for testing or spinning up your own Dockerfana cluster at home. So it's just one Dockerfile with all of it. I also wanted to mention, because we didn't say, it's obvious the presentation is about distributed traces, but Bayline does support metrics too. So HTTP metrics were included from the Star Door product. Traces distributed traces is some of the newer stuff we're working on. Okay, so for this demo, we will show a simple distributed application. It means a synthetic application is just a frontend sending a request to a backend, and the backend asking for distributing some load on the workers and then getting a response. Do you need to hold that? No, it's okay. It's okay. Thank you. Then I have added everything into a Docker compose file just for facilitating the demo in my laptop. So we have this OpenTelemetry collector, which is the hotel LGTM container that Fabian did. And we just dropped Bayline as a container. You can drop Bayline there as a host process, but for convenience also as a container. We need to give access to the pit name space of the host, because it will have to instrument all the processes in that host, and also privilege access because loading EVPF programs requires administrative privileges. Then we set here some the OpenTelemetry endpoint in a standard configuration. Bayline accepts the standard OpenTelemetry configurations for setting up many values. And also we are providing a configuration file. Basically here we say how to group the HTTP routes. For example, there is a route that calculates a factorial, and you will pass in the request, you will set factorial and the number to calculate. We don't get a cardinality explosion because we don't want to create a different route value for every number we calculate. So we say, okay, just group all the URLs matching this pattern, group them in factorial number. And then we tell Bayline how to discover the services to instrument. We have a frontend, a backend, and a worker container, and then we pass that. This accepts any regular expression. So if we say just a dot, it will instrument or try to instrument all the processes in the host. But in that case it will also instrument some parts of the Docker API, the Docker Compose API. So to not generate noise, we are just providing the services we want to instrument. And let me then run this Docker Compose file. Okay, this application is a very simple application. It's a huge factorial calculator application. I will just write a number, and it will calculate the factorial. And if you need more numbers, okay, you calculate. Boom! This is an error introduced as on purpose because I also use this application to track errors from Bayline. But it usually works. Then, doing that, we have, Bayline was already running. We have been generating some traces. So let me go to the local Grafana. Let's see. I go to, for example, explore. Here I selected the tempo, and let me search for all the traces. Okay, beautiful. It's strange because here we can see that Bayline... Oh, yeah? Okay, let me check. No data. Okay, it happens in the best families. No, but we have this... I mean, it is able to... Okay, I don't know what happened. But... For sure, it's a book in Grafana. So I have here many, many requests. Or many traces. Let me just instrument this, submit trace, which is the one that triggers the backend and the workers. If we enter here, you will see the trace information. How the front end invokes the backend. You can track also an internal status of the request, like how much time the queue is in... Or the request is in queue or is being processed. And you can see how, for example, the backend might invoke the worker multiple times. So we got distributed traces automatically. We can even see the node graph of all the requests. How this process invokes or the relation of all the traces as a graph. How the front end as a server, because we instrument either server or client side spans. How the front end invokes the backend, the backend invokes different workers and so on. I just want to add something here. So we're here, if you see, when you look at the Bayla stuff that we produce, we produce these two spans for some of the server requests. We have in queue and processing. And for most people, that's like, what is this two things? Like why are you tracking two times? And if you have a typical application server that saves with an in-go, and you accept the request and as soon as that happens, go or launch a go routine for this. But how long before this go routine gets scheduled on a physical thread, which is M in the world of go, and how long before this physical thread actually gets CPU time? So from a traditional instrumentation, you instrument the handler of the server request. This handler of the server request is the time the handler started running, not the time that the runtime accepted the request coming in from the kernel. Well, at the EVPF, because we're at a low level, we can actually track that time. We can actually see where the request actually came from the kernel, when the go routine was launched, and when you finally got the handler to run. So in a situation where you have a server which is overloaded and it's not able to serve the request, you'll get the actual request time, much closer to what the client sees on the other end. Rather than the fake time, which is what the application server would see normal. Okay, so that was the demo. Let's summarize something that is that, using EVPF, you can capture distributed traces, as we, as Nicole explained it, with some limitations. The advantage is that it requires almost no effort from the developer or operator, in the sense that you don't need to reconfigure your service, you don't need to change the code, you don't need to redeploy, just drop it and get whatever Rela can get. Yeah, and it's, another conclusion is combining this packet tracing with language level support, is what we, we allow Bayla get those distributed traces. So if you like it and want to give a try, Bayla is available to download freely, to test it. You can, you can connect to our GitHub page, or, and then you will see instructions and links to the documentation or the main open source page of Bayla. Yeah, and on the GitHub page is what we start with, we have a link to our community Slack, if you want to chat with us, and we also are soon going to start organizing the community call. So once every month we have a call where you can just join in and chat or yell at us, for whatever reason, but yeah, that's it. Thank you. Thanks a lot. Oh, so many questions. I'm running. You said that when you're tracing in Go, you, you are, you are tracing the coroutines that are, that are handling requests, but in Go you don't have ideas of these coroutines and you don't have the relationship between them. And to, to make it worse, the go around time actually reuses coroutines for something completely different. So how do you, how do you do that without constantly handing pretty much all the coroutines all the time in order to get your trace? Yeah, okay. So like with EVPF you get superpowers. So from a regular goal developer perspective, you never actually have the access to this information. Yeah, for whatever reason, they won't give it to you. But with EVPF, I attached the go runtime. So the address in memory of the go routine is my ID. Now I can tell when the go routine starts and when it gets parked back, when it's reused for something else, it can be reused and that's fine. But at that time I'll clear up all the information because I know the go routine is done. Because like superpowers. Hey, thank you for your talk. I'm one of those guys that manage a lot of infrastructure in code in general. And when you say that, hey, you just have to eat that and just work sort of a box, it's kind of scares me because potentially it can cause problems. And one of the issues that we saw with both kind of solutions usually is if you inject into request a tracing header, potentially the request might be changed. And some protocols do signing and request like AWS signature free, for example. And they don't really like you injecting headers in the middle of request, especially at a lower level. So how do you envision if you have some kind of like agent in the code itself, then you can work on that by disabling the tracing on both specific endpoints. But if you do that at a lower level, then you don't really have a visibility to be able to disable that or recognize that you are creating a request to such a back end. How do you envision like working around those issues in the future? Because this is one example, but this will happen many, many times. Yeah, yeah. So that's true. So if you need sign some IDs and whatever, it's not letting you change the header information, then disable that feature. Don't use what we do right now for propagating using the headers. Use the black box. This is the back boxes are sort of the full back. We've been toying with the idea that maybe in the future we'll let it work with an external storage of some kind that we can actually make past the one node restriction we have with the black box right now. But that's the very reason we're designing for because in so many environments, injecting the header information is just not possible. I'm dealing with interpreter language. No compiled methods, no dice. So I can't do anything with you. Thanks. Good question. Thank you. Thank you.
What’s possible in observability when we have frame pointers
All right, so yeah, what's possible in observability when we have frame pointers is kind of the talk. But let's start out with like a kind of like actual use case of observability, right? So we have these workloads. We can like graph the CPU cores and we can see some things happening and we might be wondering what's actually happening at these spikes, right? And we can use profiling, I guess, to figure out what happens at these individual spikes just to like understand, okay, like in this scenario this was happening in another scenario or like at another time something else happened. We can like get profiles manually and compare them or we do something called continuous profiling where we just like all the time over time, yeah, profile, hopefully a little overhead we can even do it in production or not hopefully, but it's a reality. We can do it in production, right? So we can then store all of these profiles and over time kind of like ask questions when we want to in retrospect and we don't have to worry about missing data points and we have kind of the security or yeah, the ease of use that we can just click on some spike and then get a frame graph or in this case an icicle graph because it's like top down and not the other way around. We call it icicle graphs and you can see all the stack traces and you can like instrument very nicely, introspect what's happening and I don't have a slide for this but we can also kind of like this flame graphs and then we can see in like red where things got worse and in green usually where things gotten better and it's pretty obvious most of the time if you have such a big spike like that's right the point where we need to look in such a flame graph and where we need to like check out what's happening in the code. So yeah, that's kind of a pretty good use case for observability, right? But yeah, what our frame point is but before we come to that quick introduction I'm Matthias Läuwe, I'm a senior software engineer at Polar Signals, I work on Parker which is like the open source project doing a bunch of these things but I also work on Thanos, Prometheus and lots of other open source monitoring projects. Yeah and hey everyone, I'm John Seger, I'm VP of Engineering at Canonical, I have a kind of interesting journey to open source but at the moment I am leading the development of Juju and a whole suite of kind of enterprise apps which we call Charm so if you want to get access to like the best Postgres on your infrastructure or the best MySQL or the Grafana stack or Parker or you want to build an identity stack with ORI and with OpenFGA and products like that, that's kind of the effort that I'm leading. The orchestrator is called Juju, it's been around a really long time, Charm's all written in Python and we're kind of building out a big catalogue of operators that allow you to not just deploy those things but actually compose them all together and integrate them in a really common way irrespective of whether your infrastructure happens to be bare metal or Kubernetes or VMs or on EC2 or on Azure or some combination of the whole lot so that's kind of what I'm up to at the moment. Awesome, yeah and I'm looking forward to hearing more from you but before I do that, let's talk about profiling again or like what profiling data is made up of and you can see these like points in time just T1, T2, T3, at some point in time, we basically want to look at the current stack trace or what the program state looks like and we can see that like at T1 we had ABCD, at T2 we had ABCNE so slightly different and then at T3 we had the same thing again so kind of like just for the sake of the demo or the example, one was like executed twice so maybe it was like executed 20 milliseconds in total and the other one 10 milliseconds so we kind of like count how often we see these stacks and then kind of can make assumption on how much it is running and this is kind of like a sampled profiling profiler, it basically only like every so often looks at these stack traces but over time we can really nicely like see the big picture of things happening. The good thing is because it's only happening so often the overhead is pretty low which again I touched on earlier for our use case figuring out what's going on, it's pretty nice due to being pretty low overhead. So how do we get to these stack traces, how can we see these stacks that we then get all the memory addresses for and then we can like nicely format them using the function names for example in the icicle graphs. So the best case and that's kind of the whole point of the talk right are frame pointers and frame pointers looking at this bit of C code it's hopefully not too daunting in a monitoring observability room but we can see we have the main function at the bottom and that calls a function and so on the functions call each other and then at the very top it just goes into an endless loop. And kind of the important part in all of this is looking at the assembly on the right hand side we can see that okay I omitted like the main function and the a1 but then we can see b1 and we can see that at the very beginning we are pushing and moving some registers around and those are actually the instructions to push the frame pointer onto the stack and then we are calling the next function right and the pushing of the registers so that we know once the next function is done executing we can come back to exactly that previous function and continue executing. The one thing I want to mention here is in the past there were a couple of discussions about the overhead of using frame pointer so we have the push and move instructions and then once the function is done it needs to pop that frame pointer so there were a couple of extra assembly steps involved especially on 30 bit systems it wasn't great performance wise but I think unless you are a really really special case it should be fine for almost all workloads even in production and that's kind of the point of this so basically our binary on the left hand side we can see our set up frame pointer so that's kind of the first instruction that our assembly is executing it is putting the frame pointer onto the stack before then going and doing the actual call to the next function right and before doing that we have to add the return address to our stack so that once the function that we are calling is done we know where to continue in our current function right so we need to know where like this other code we need to execute after calling the function we are calling right now where we need to continue so that's why we have the return address and we then actually do the function preamble and we run that function and eventually we return the function we are at the pointer that then actually tells us where to go back to right so the function that we called eventually returns and we want to go back to the original function however we are then executing after that function call right so previously we were can you see my mouse no we were over here and now we returned like one step and after that right because we don't want to call that function again going into an endless loop we want to continue afterwards however we want to know what called us right so basically what we want to do is whenever we have a stack we want to know which function called us and do that all the way such that we eventually end up in the main function and we know all the functions that we see that we have on the stack up to the point where we are now basically that's kind of like working the stack here and the really really cool thing is we can do this in ebpf so I don't know how many attend the previous talk ebpf kind of a hot topic right now for us it's really really cool because what we can do is write a small program in a C dialect and then get that through the verifier and compile it into ebpf code and then load that into the Linux corner and the way it works is then we actually don't use syscalls like the slide originally says but what we then do is like tell the Linux corner to every so often run this snippet of ebpf code and what we do is do the same things like stack unwinding that you are stack walking that I told you about like two slides ago so essentially what we do is we start or we start in ebpf we get kind of the context we get the current stack pointer and we look at the leaf of the stack so like kind of the very top like the currently executed function and we can then use that to essentially read that instruction pointer and from there get the frame pointer and the special occasion here is the instruction pointer has to be the return address minus one because of the thing I just told you about two slides ago right so basically that's how we can then know where we were called from and we do that all the time up until at the end we do that we get an instruction pointer or that zero so this one then means basically we reach the end of the stack and we know we can terminate or we reach the end of that stack trace. In between for profiling you can see over here we do something with the stack with the frame and what we actually do is we kind of like just get the memory address of that executed function and we basically have an array of all the frames that were executed at the end and have the memory addresses and those memory addresses we can then use to get the function names for that function. So having frame pointers in ebpf makes regular profiling super easy and we can then do profiling super simple we don't have to worry about like special compiler configurations because we can just assume that frame pointers are here for us to then basically use them to figure out the entire stack of the currently executing function. There are ways to do exactly that without frame pointers and shout out I think it was in this very room one year ago there was a talk by Javier and by Charlie who were talking about stack unwinding without frame pointers using Dwarf I highly recommend it it's really really interesting but yeah something for another time and then obviously not only like the profiling use case but if we have frame pointers in the executables in those executing stacks we can also use all the other debugging tools right not only for profiling we can use the bcc tools bpf trace perf etc and they also have the kind of same benefits. So essentially what that means is that the possibilities really become a lot more broad and open or like we can do a lot more things because we only have these like two memory reads and for example in bpf trace we can use the like one liner here to essentially also build a really simple but working profiler that uses the use stack to get the user space stack unwinding and count how often it sees things and that's super cheap then but also like the go execution trace actually traces everything that's happening and because unwinding is so has so little overhead we can also do things like that and once we have profiles continue and kind of like the performance aspect we can do something called profile guided optimizations and just making profiling so cheap that's something where I think a lot of innovation is also going to happen in the future and some outlook like some super new papers the context sensitive sample based profile guided optimization so something we are super excited about because yeah it will allow a lot more things to happen as well but maybe another Boston talk is going to happen about that in a year or two so bringing frame pointer to the masses I'm super excited to have John talk. Hey all right so I'm here to tell you about now we've seen all of the cool stuff you can do when you have frame pointers how we at Canonical are going to make this available to all of you much more easily and so if you didn't see this on an outside our blog a couple of months ago we have decided that from 2404 LTS we're going to enable frame pointers for the entire Ubuntu archive on 64 bit platforms. The caveat on 64 bit is because back in the day 32 bit CPUs obviously had far fewer registers and so sacrificing a register to hold the frame pointer came with a much higher performance overhead in reality these days with 64 bit you're looking at on average kind of less than 1% unless you're in a very specific group so if you're doing like turbo pants on head HPC stuff or high frequency trading or real time things where kind of that like 1% could really really matter perhaps this isn't for you and we can make exceptions in the archive for those packages but in general for 2404 you can expect to see frame pointers for the entire archive through main and universe etc. This is pretty exciting because the LTS I probably need to tell you is going to be installed on many many millions of machines right and then supported for at least 10 years by Canonical so this is going to make a big impact for people who need these things. This stuff is often already enabled by the hyperscalers so people like Amazon, people like Netflix, people like Microsoft are already doing this in production and now you kind of get it for free as well just by using Ubuntu. So I mentioned there will be some you know pretty much negligible barely noticeable for nearly all use case performance impact we're kind of willing to wear that because what it actually enables in the medium term is for us to do a lot of work on our distribution right so we're in the process now of running benchmarks on a kind of pre frame pointer Ubuntu and a post frame pointer Ubuntu ready for the release and that will hopefully help as I identify any outliers so if we hit certain packages where we feel like the performance hit is too much then we will disable it for the first release for 24.04 or we will try and work out what other optimizations we might make to that package to make it work better with the frame pointers enabled. So this will really really help I think downstreams to enable or to gain the benefit of frame pointers and optimize their own workloads. If you are someone who just uses Ubuntu as a platform and you build your own code and let's say you use Python or you use go or use no JS or whatever suddenly those big holes in your frame graph graphs are just going to disappear when you move to 24.04 without you having to do anything. This is really just the start which when I make 24.04 a really focused release on kind of performance engineering and performance itself so what does that actually mean having the frame pointers is one thing but you also need the tooling to actually utilize the frame pointers and kind of inspect the stack and the folks at PoloSignals with Parker are one part of that but we are also looking to include tools like BPF Trace and SysStat and the Perf Tools on Stable by default in Ubuntu. Not in every single image so those of you that are about to screen map me because you use the minimal image or you ship 100,000 container images a month and you don't want to ship BPF Trace and all of them don't panic we are essentially going to enable all of these tools by default anywhere where we ship a kernel. So a Ubuntu server image, a full size server image that doesn't include lexd images, it doesn't include OCI images but if you install Ubuntu on a server or in a VM you will have BPF Trace by default, you will have SysStat by default. Essentially a huge majority of the tools that Brendan Gregg describes as crisis tools will be there by default and the reason that is super important is because if your system is in crisis it doesn't matter whether the tools are in the archive. If your system is right on the edge and then you hit the system with a whole bunch of network IO and disk IO to go and get a package from the archives it is potentially going to put that system over the edge. It may not even work in production, the system may not have access to the package archives and so you just need those tools to be there and we are going to make sure that happens. For places where we don't ship a kernel all of these tools will get wrapped up in a new meta package so if you do want it in your lexd containers, if you do want it in your container images, in your debug images then you will just be able to see it really really easily with a single meta package. We are looking at what other compiler optimizations we can make across the archive as well so this might look like rolling out GCC03 for a huge part of the archive, we are not going to do that in one big bang go because there are some trade offs there and we are also looking at essentially not maintaining a low latency kernel and a generic kernel and just shipping the low latency package by default. None of these are firm, 100% definitely going to happen in 24.04, these are the goals we are working towards before the release in April. Finally, some of you may have seen we have been doing some work on working out how to get Ubuntu and the archive to take advantage of the newer instruction sets, AMD64 v3, AMD64 v4, v5. We actually have a build of the entire archive that uses AMD64 v3, you can get it in a PPA and test it in benchmark, it is faster like TLDR but we need to do a bunch of upstream working apps to work out how we can essentially kind of multiplex that right so that you still just go ubuntu.com slash download, download an AMD64 ISO and it does the right thing without you having a massive long list of different instruction sets to choose from for AMD64 so that work is coming but probably won't land for 24.04. We also continue to introduce new patches into things like GNOME, we are still trying to get the GNOME triple buffering stuff landed ready for 24.04 which gives a much smoother experience on the desktop as well. This runs really from Ubuntu server right up through to Ubuntu desktop and these tools will be available to desktop users too. You as a developer on Ubuntu should have access to the same debugging tools that you find in your production workloads in our opinion. On a side note, we are trying to do this at a really big scale at Canonical, we are hiring practice leads that will sit in a central team to build processes and tools and essentially give advice across our 40 or so products and we are also hiring dedicated performance engineers for every single team whether that team be doing Go, Python, NodeJSC or whatever. If you are interested in that talk to me afterwards, check out Canonical.com slash careers, there is a couple of Canonical folks in here as well who you can talk to. If performance is your thing and you want to come and make use of frame pointers and make Ubuntu blazing fast then that is always an option to you. Finally, from my side, we have done a bit of work with Polar Signals, they have been helping us along this way. We have snap packages and charms available for Parker both for the agent and the server. On any Ubuntu machine you can see this in a cloud in it file with a single line. You can snap and store the Parker agent, give it a single config with a token and start continuous profiling out into Polar Signals cloud or you can host this over infrastructure yourself on machines, on Kubernetes, on containers, whatever it is with Juju. We will continue to make improvements to that over time. It is a super easy way to get hold of this nice continuous profiling hotness in Ubuntu. That is it, get in touch. Thank you very much for that. Looking forward to the Ubuntu release. Are there any questions? Questions anyone? Once, twice, nobody? Okay, then thanks again and next up we have QuickWit I think in 20 minutes. Thank you, bye. Cheers.
Modern application observability with Grafana and Quickwit
Okay, so thanks video team for fixing this. So that was very helpful and now few minutes late, but we just, you know, do the same talk a bit later, same 20 minutes slot plus five minutes Q&A is quick wit François. Welcome. Should I say it is it sufficient? Yes, I should be okay. Hi everyone. I'm very happy to be here. Thanks for having me in this room. We have been working like on observability since three years at quick wit and I would like to present here what we have done during the three years of outcome of it at least. First I will introduce myself a bit. So I'm François. I'm working on the core engine of quick wit, which is a search engine. I also the co-founder of quick wit and I also work a lot. I'm working a lot on the graph and data spring that I will show you in this presentation. So for this, for the agenda, I would start like with taking a step back, a short one and then we'll talk about like the problem of cardinality that we can have with metrics and then I will show you like very briefly the engine of quick wit and how it works. And finally, I will show you a demo of quick wit working in Grafana for application monitoring. So let's start by taking a step back. So I'm showing this graph. It's not mine. This diagram is from Ben Siegelman. This is a Googler who works on the distributed tracing dapper, software dapper when he was at Google and you co-founded also LightStep, which is a company doing observability and monitoring stuff. I kind of like this diagram because it summarizes all the complexities, the intricacies between monitoring and observability and the different signals that we can get from our applications or our servers. So at the bottom, you have the three signals or the three pillars. It depends on how you call them. Traces, metrics and logs. And generally, you store them in different databases, metrics, you store them in time series databases because you want something optimized for it and it can be very optimized for this kind of data. And for trace and logs, it depends. You can store them in a search engine or you can store them in dedicated storage. I'm sure you know tempo and loki, loki for logs and tempo for traces. And on top of it, you try to build your monitoring software with alerts on metrics or could be on logs or even traces if you can. So let me talk a bit about how the problem of cardinality here. At the bottom, you can see that metrics goes always into the TSDB, but you can also put some traces information into your TSDB. But in this case, you need to be very careful about what you did because in traces, you can have a lot of labels and you can be very, very accurate about what's happening. So that's why I just want to stop there a minute. When you want to monitor a distributed system and we all generally have that somewhere in our job, they can fail for various number of reasons, even for a tremendous amount of reasons. And so you may want to label everything like if you have a software that is deployed with different versions, you want to have this version label, same for the host, same for the customer ID, if you are a SaaS, for example, you can have thousands of them. And you want also to monitor your services, your endpoints. In summary, your cardinality will explode and this will be a problem for your time series database. It's a problem, it can be either a performance problem or a money problem because if you look at data dock pricing, for example, you will pay $5 for 100 custom metrics. So if you want custom metrics because that's it, if you want something very specific to your business, in this case, it will cost you a lot. So generally, you don't want to have all those labels on your metrics. You want to control and just keep a low number of them. That's not the same for traces. In general, traces, you want to keep everything so that you can dig into, like, really understand the full trace, each unit of work in your software, you will be able to understand for each customer, for one customer particularly, for one request ID or for one user ID, you will know what happened in your system. So generally, you keep everything in your traces. And that's what I will talk about today and that's what QuickWit is for. So QuickWit is an engine that is storing logs and traces and particular traces. It does not handle metrics. And it is a bit different than other search engines in the sense that we decoupled a compute storage. So we have the same approach as low key and tempo on this, that chip storage is great. If you use an object storage, it's also very reliable. So you don't lose your data and you have all the benefits of decoupling your write pass, your read pass. It's really great when you want to scale to a lot of data and that's the case in observability. And last point is that we worked on the search part a lot so that it can stay sub-second even if all your data is on object storage. And I will explain how it works very briefly. So the engine architecture is quite simple. It's globally the same for this kind of decoupled compute and storage architecture. You will find the same for a tempo, for example. So at the middle, you have your object storage where you store your data. This is the source of truth. On the left side, this is the write pass. And on the right side, this is the read pass. So on the right pass, you have your incoming, gizand documents, could be traces, could be whatever. And you have your indexes that is running. And each 30 seconds typically or each 15 seconds, it will build what we call a split in QuickWit. This is a file where we put all the data structures that are used at search time. So it's several well-optimized data structures to be searchable on object storage. So we create them, the indexer creates them, and then upload it to the object storage. So you will have a bunch of splits that are put on the object storage each 30 seconds, for example. And each time you put the split, you also put one row in a meta store. It could be a PostgreSQL database or it could be just a gizand file stored on the object storage. So we will add just the metadata of the split in it. And once it is inside the meta store, like the metadata of the split that was uploaded on object storage, on the object storage, then the searcher is able to search it. So you have this nice decoupling where then if you want more searchers, you just have to increase your number of searchers. You can even shut all of them down. That's not a problem. So that's for the high-level view. To understand why QuickWit is fat on object storage, I have to show you also how to do how a split is made. This is an interesting part because it shows you also the different data structures that we are using. And it will help understand how we can achieve fast search later on. So you have basically three data structures in the split and one thing that we call a hard cache. The first data structure is the doc store. It's a row-oriented storage. So if you have a document ID, we will give you the whole jizz and document. The second data structure is the inverting index. So in this case, if you are looking for a user ID or a quest ID or a keyword, it's optimized in the sense that if you give a user ID, we will retrieve immediately the list of document ID that contains this user ID. So it is very fast. And then you just have to retrieve a document from the list of whose document ID is. The third data structure is the column now store. So here it's for doing aggregations. If you want to do analytics on your logs or traces, we will use this column now store. You can have a lot of columns. You can have spare columns. That's optimized for that. And the last part is what we call it's a split footer that we keep in general in the memory of a searcher because it's very, very small. I put it 0.07% of the size of a split. So it's very, very small. For that, it's cool because you can always keep it in your cache on your searcher. And in this hot cache, you will find all small pointers to the other data structures so that when you make a search request, you will need only to make one or two requests to your object storage to find the response. So that's why I said when we optimized QuikWit for object storage, it's because of that because we optimized those pointers. We put that in one footer that we can keep in cache. So just the next part now, I will explain you a bit how spans are stored in QuikWit. Okay. I have only eight minutes left, so I need to speed up it. In QuikWit, you can model things as you want. You can put any documents. But for span generally, you want to stick to the open telemetry data model. So what we did for this demo is that we used the data model based on the open telemetry data model. Well, you have a bunch of fields that are always there. And you have also some dynamic fields like resource attributes and span attributes. All those are very dynamic. So I put some random examples here where you can generate random keys and random fields. And here are the nice things that QuikWit is also shameless so that we can store every inverting index and the columnar storage. We can store all those dynamic fields without declaring which fields you have or which you don't have. It will index everything in the inverting index and in the columnar storage. So it's nice when you don't know in advance all your attributes that you have on your spans. So it's time for the demo. So for the demo, I prepared a demo for application monitoring. So my first problem was generating spans, traces that are understandable for this kind of goal. So I discovered recently a tool called, which is an extension of K6, which is like a project from Grafana for testing, for load testing. And there is a nice extension to generate traces. I will show you a bit or it works. And then I deployed a QuikWit cluster on Kubernetes and I did a Grafana instance to show you the results. So a word on the XK6 extensions. It's a nice extension just to, you can declare some spans. Like here I put some services like shop backend, ethical service. And you can declare whatever you want. So it's a template and then you can set the cardinality. You can set if you want some random attributes. So it's pretty nice to stress test and see if your engine can handle high cardinality fields, can handle like many random attributes. So it's pretty cool. So let's do the demo because I prepared it's live. Can you see it? Maybe I can zoom it a bit. So here I'm a heating or Kubernetes cluster and the index in which I'm setting traces. So we have approximately here now 341 million spans. So if I do a refresh, you will see this number moving. So now we have 355 millions of spans. So great. You can see that the uncompressed size of the document that are ingested in QuikWit have around 200 and almost 300 gigabytes of size. And that it's less, we compressed data a lot in QuikWit. So here it's around you can divide by seven on this example, the size of the data ingested in QuikWit. So here the size of Publish Plits is the size that is taken on the JStorage by QuikWit. So what can we do? So we are sending a lot of traces just to confirm that it's live. So here it's just a dashboard based on QuikWit Prometheus Matrix. So it's live. I launched it I think six hours earlier today. So I'm just sending traces at 11 megabytes per second. That's not huge, but it's already pretty decent because it represents around one terabyte per day. And you can see that it is running with only one indexer. And we have one indexer that is not really doing a lot of things. It is using between one CPU and two CPU here. The spikes are due to the compaction that we are doing because QuikWit generates a lot of splits. And so we need to merge them. We need to merge them so that we reduce the number of threads that we will do on the object storage. So great, it's working. It's live. And I can show you that we can search traces with it. So let's have a look. So I'm running a search query on all the documents. So you have a lot of them here. You can see that per minute here you have one million spans. So that's a lot. And we can focus on maybe I want to remove QuikWit traces. It should work. It's service name, I think. Okay, great. So here, as we are sending QuikWit traces in QuikWit so that we can monitor our own cluster with it, so now I'm selecting only the traces that I'm generated for this demo. So for example, we can look for query articles. So here, for example, what I can do is focus on one span ID. So as we have an inverting index, you can look for very accurate attributes. So here I took a span ID, but I guess we can do this on another. Let me check. I need to check article two card. So now I want to focus on this one because I added on this span particularly a high cardinality field for the demo. So you can see here that you have this random attribute with this random value that is here. So I can focus on it. And very fast I will get the results. You can see that this attribute value happens only very rarely. And it's with a search engine, it's even faster when you have this kind of query. So that's nice to search through spans, but generally you want to dig into one trace particularly. So you can use a Yeager plugin that is pointing at QuickWit cluster and it will return all the spans for the given trace. So it's easy to go from one span to the whole traces. But usually you want more from your data. And by that I mean you want to monitor your services. Looking at one particular span is very nice, but when you know what you are looking for, but when you don't know, it's better to build this kind of monitoring dashboard. So here, okay, I will try to go a bit earlier in the past. Yeah, good. So I have set different panels. The first one is like a data histogram and I'm counting HTTP requests. So for calling HTTP requests, I'm using span attributes. I can show you that. So here I'm saying I want all HTTP requests. So I'm using like the open telemetry semantic for that. And I'm saying, okay, I don't want all other spans. I want only those ones who have at least one HTTP target. And here in the second panel, I'm doing like a group by by status code. So I'm using again like a dynamic attribute that is the status code. And I'm doing a group by on it and building this data histogram. This one is a bit the same. So here instead of doing a group by on a status code, I'm doing a group by on target. So if you want to monitor your endpoints, that's useful. This one I can show you a bit more because here it's not the count. We are computing the sum of the duration of each span. So I'm doing group by on each target and I'm summing all the span duration. So you have a good feeling about the time taken by each of your target, your each endpoint. But that's not enough, right? You want something that is more useful. It's nice to see this, but it's not enough to take a decision like is there a problem or not. The last one is the P95 latency panel is more for that. So you can see that you have a nice, smooth latency, a P95 latency on all your services, except for this one. This is normal because I sent some traces with different latencies. So we will dig into that. So we know that maybe there is a problem here. The panel average latency here is just the average. I think you understand that. And this one is also interesting because I'm sending spans from two virtual data centers. So one is CDG and one is a BIHRU. And so you can see that most of the spans are coming from CDG here. So for example, if we want to dig into why there is a problem on the P95 latency, and I will stop after that. I will just show you that you can add your query here and say, okay, I want to look at all the spans that are over one second and one second, 0.1 seconds. And so you can see that only the endpoint article, so card is problematic. You can see that maybe there is a problem in the data center because you have suddenly more requests here. And of course, if you go up to two seconds, then you see that here the color has changed, but it's coming from this data center. And if you want to dig into one particular traces, you can open like and have a look at it. So that's it. I think I will stop there because time is up. If you have questions, come and see me. I will be happy to answer them. And also I have some stickers and hoodies, so don't hesitate. I'm happy to give you some.
What is CI/CD observability, and how to bring observability to CI/CD pipelines?
Okay, next Dimitris and Geo will show us what CI-CD system observability looks like and how observability in CI-CD system pipelines works. Yes, that's right. And hello everybody. Can you hear me? Awesome. So thanks for having us here. It's like actually the second ever talk we're having for them because the first one was three hours ago. So for you who were there, we really saw it's going to be kind of repetitive. But like, yeah, let's get started. So what is CI-CD observability and how to bring it to CI pipeline? So this is our abstract. We're not going to go through that because it's huge. We know so in a nutshell, this talk is about enhancing our pipelines reliability and performance by bringing observability at every stage of software delivery. So we're going to answer two questions like how we can identify flakiness and bottlenecks and try to envision a future of effortless visibility. We're also going to talk about the pivotal role that OpenTelemetry plays in the thing we're trying to solve. And then how we want to shape CI-CD's future and explore the challenges and opportunities ahead. A little bit about us. I am Dimitris, a software engineer at the Platform Productivity Squad at Grafana Labs. And I'm Gio, a software engineer in the Explore Squad. Yes. And Grafana Labs. Yes. And with that, let's go through the agenda. So we're going to start talking about what is CI. And this is with CI-CD systems. And then introduction to OpenTelemetry, some under conventions. Why is it important to own your data? And then where we are now, practical use cases and what's next. So we're going to start with a question. Like what is CI basically for everyone? And we're talking about the definition here. So definition of continuous integration may vary depending on who is the author. So for example, we have a few experts talking about that in two different books. And they are talking about something similar, but using different words. The only thing we can make sure about is that, we may be sure about is that, like, the word continuous is always going to be there. Because we're talking about something, you know, a never ending feedback loop, something which continuously run in order to identify flakiness, may help us like, you know, improve our processes and all that. So all that being said, we go to the next question, which is yes, but like what is CI for real this time? And as we said, CI can mean different things. Like it's a list of things basically. And for every person, for each person means different things. Like it may be a mechanism to reduce repetitive manual processes or like enable better project visibility in some cases or find the resolved flaky tests and builds and, you know, try to prevent paging people at 3 AM in the morning because, you know, human hours are important. So the next slide is how a typical CI CD pipeline looks like in a modern company. So I think this is the same for some of us here. So it's like, we start from testing, building to deploying. And by the way, we know that we haven't talked a lot about CD, but we believe that like CI CD observability should come as a whole. There's no CD observability without CI observability basically and vice versa. So we're talking about deploying to pro like waiting for 3 AM maintenance until we deploy to pro and then get error or downtime start panicking. And then when we resolve all of those things, you know, go and add them as assets in our LinkedIn profiles, like I'm an automation master engineer and all that stuff. So we're going to the next question, which is the most like important one. What is CI? But like for real, real this time. And what we're looking for is just a word. Anyone there to guess? Someone who wasn't in the previous room, by the way. No one. You dare to guess? I would say alerting. That's right. So CI is alerting. So CI and alerting actually serve a common purpose and we should see like CI and alerting as like one component. So when a if alerting is integrated into CI, we can catch issues early in the development process and like effective alerting within CI ensures that threshold bridges and potential problems are identified before deployment. So if we see CI is the left shift of alerting, we can have as I said, early detection in the development or shifting focus to proactive monitoring, preemptive issue resolution and stuff like that. As you can see in the picture as well, it's like they need to hold hands like for as long they are alive, I think. So for the next slide, so we're presenting a few things about what CI does basically and CI, we define CI like is the guard in early stages of the development. So it detects changes, it maintains builds health and constantly monitor system signals. So CI catches issues before they appear in production. So alerting on the other side. So this is not a slide to compare continuous regression alerting. We just want to show how like tightly coupled they are. So alerting is our alerting system like in later stages. So if something slips through CI, alerting should be there to catch that. But the most important thing is that we have to remember CI and alerting are not two things working in parallel. It's different components. CI and alerting is just one component and we should regard these as a whole. So we need like for alerting, we need to make sure to create actionable items. So if something slips through CI, alerting is going to catch that and we should have like enough run books, documentation or stuff like that to make sure that everyone can find the solution to a problem real quick. So where we are now and what is this whole talk about it too? So observability so far, what we have. So from manually trying to search through the logs and traces, trying to find the root codes and going from a GitHub or like GitLab or whatever you're using to your CI vendor which may be something different to Grafana or any other observability tool you may have, like trying to colorate to go from one to another for the same error maybe hard. So we want to make sure that you can, we can like create a centralized way of dealing with these problems. So if observability is so late like at that stage of the run stage of our development and deployment life cycle, we have limited visibility during early stages, difficulty in root cause analysis, increase, meantime to recovery and you know this is something that Gio is going to talk to you more about that and also you can miss optimization opportunities like it's really hard to say what I have to fix if I want to improve my build times for example if you don't have observability in your CI pipelines. So how it looks like now when something catches fire, there's this meme, we all know that meme, we would expect it to like you know someone to go there and like take out all the fires but instead what happens if you have the testing, if you don't have CI CD observability and alerting this happens. So it's too late to go and resolve anything, you may lose money, customers, you know how it goes. So if we want to get data out of CI CD we have to focus our shift a little bit to the left. So we have to focus our shift to build testing and deployment. This way we can like address issues before they escalate in production environment or we can enhance efficiency by catching like problems early in the process or have like enhanced system reliability and robustness and also it is a huge asset when it comes to cost reduction. So what is this amounted to is this slide which is my favorite. So the thing is that instead of having this fire and you know the developer like the engineer sitting in the middle and waiting until someone gets there to like take out all the fires we may be proactive. So we may be proactive and mitigate the fire before it appears and do something about that problem right. So we tried to make it easy for people. So for new joiners, for people who are not familiar with the CI vendor we use. So we tried to make it easy and neutral for everyone to go and use like CI CD observability output. So for that reason we used open telemetry. So we tried to define standards. Geo is going to talk to you more about that. We tried to define standards and certain patterns which are going to be used widely by everyone no matter what they really use. And without handing over to Geo to talk to you more about open telemetry. Thank you. I don't know how this works. I'm going to just... Whatever. Hello everyone. So yeah. Give me a second. There is. So first of all, one question. Here is familiar with open telemetry. Or of course. I want to give a quick definition for where it is not familiar. So open telemetry is a framework designed to create and manage telemetry data such as logs, traces and metrics. Of course the definition is way more comprehensive and accurate. It's on their website and I'm about to take a look. But for the context of this presentation I would like to focus on two bits of the definition which is semantic conventions or we can call them standards and owning the data regenerate. First of all standards. That's pretty much I think in topic. When you talk about standards we always think about the way in which we want to know about our data. Standards I think are a great thing or defining as a semantic convention and a standard. A great way of knowing what data you are storing. What to query for. What to query for the data. And semantic conventions in this context can be divided by, let's say in two categories. And we can divide them by signal type such as logs, metrics and traces or by the area they belong to such as generic semantic conventions, databases, exceptions or whatever. There is one thing that we think is missing here which is continuous integration and continuous delivery. And we think however that this is the perfect fit for this. Why? Let's take a look at three different CICD systems. GitHub Actions, Drone and GitLab CI. They tend to call things differently. I have highlighted here three specific bits which is duration, job names and outcome for three different pipelines. The idea here is that the data behind those three pictures is actually the same. Duration is duration. The job name, I guess here Drone calls it stage or job, I don't know. Like there is a different name. Like every CI tool calls something differently. Same thing for the outcome. Status can be outcome, can be, I don't know, I'm not good at English, but it's the same data. And by defining standard we can define a single common way of referring to this data. So what happened was that at some point I wanted to investigate why a test that we had in one of our pipelines sometimes was taken three minutes and some other times nine. Which didn't make sense. There was no code changes. Like rerunning the same pipeline, the same job would result in different times. So yeah, I started writing some Go code. I'm a front-end engineer by the way, so not. But it was pretty good. No? Okay. It was. It did the job. It did the job. And I was able to get the data we needed from our drone CI system into Lockheed, Ampo and Minere. It worked. We were able to ask the questions we were curious about. Why that was happening? We found something. But before finding these answers, I shared the news with my team. I guess they got a bit excited about it. And what I mean by that was that another colleague was asking me to write the same thing but put the logs into elastic search. Another colleague of mine was trying to want to get the same things for GitHub Batch instead of drone. Yeah, I mean, like, I'm very good at Go. I don't know. I didn't feel like there was a very scalable way of doing that for me. The question here is, how can we prevent ourselves from writing the same code that does the same thing but for different systems and for different databases? And it may sound like a silly question. None of us, I think, uses 16 different databases to put their metrics, logs, or traces in. The point is that if we ever wanted to do that as a community, like not as a Grafana, not as me, we would have to write, I don't know, I didn't count them, 80 different things only for those databases and only for those three CI systems. And yeah. No, there was no way I was going to do that. One of the things I want to talk about is the importance. That is important. There is a bit of the definition that we didn't talk about, which is owning your data. Now when I started thinking about owning my data, what I always thought was having ownership of the hard drive that I was going to be stored in or having ownership of the machine that my database was running on. In reality, I think that this definition was a bit beyond. This definition goes, I think, in deciding what to do with the data you generate, where to store it, where to send it. We can very well be using a cloud provider. The idea is that I decide and I know where my data and what data I'm storing where. Open telemetry helps with that. Open telemetry defines standards, defines a way of doing this. It's a common language, if you want, to send data everywhere and to get data from basically everywhere. Now what we built was, in this case, an open telemetry collector distribution, a very small distribution that was meant to solve our specific use case. And specifically, we wrote a drone receiver, which was listening to a web book that was written by a drone that would receive data about completed pipelines and generate traces, logs, and metrics and then export them to logitm.com and Prometheus. The idea here is that we didn't have to write any of the other components. Those are all open source components that are available. They work with everything. We didn't have to write any of that code. We just had to configure it. There are some other practical examples, which is Jenkins plugin, another GitHub action that generates trace data, and our own one if you want the link. Now in practice, I try to be fast because I think we're running out of time. What does it mean in practice? What we were able to do by implementing CSE observability in our own pipelines and in our own software delivery process and software lifecycle in general was to identify a flake test that we were having in Grafana that was sometimes popping up. We didn't know why and what was happening. By pushing the logs of our CSE system into logitm, we were able to identify where and when that logline, that test output, appeared for the first time. From there, we found out that the build number on-drawn, from which we jumped to the GitHub action that for the first time caused access to fail, which meant that we were able to solve the actual issue. It was some code pushed to someone else, totally different test that apparently was, shouldn't have caused any issue, but it caused this test to be flaky. Another thing we did was we built our own custom experience of the Grafana. First one is just a, let's say, fancier way to look at CI logs. Second one, I think, is a bit more important and that's something that Dimitri was using in his job, is still using, which is monitoring at any given time which one of the build, of the branches, as a failing build. Why that? Because in Grafana, we have to support quite a lot of different release branches, when we have security fixes to make or when we want to release all their versions of Grafana. We need to be sure that our builds are passing because not having them passing means that we have to spend time first fixing the issue and then doing the release. By having an alert on these status, we were able to know beforehand when something was going to fail because the build itself was failing and prepare ourselves to fix it beforehand. So, what else it unlocks? Really, a lot of things. The idea here is that you own your data. Owing your data means that you can ask the question you want, ask the question that you feel are right for your use case, which can span from security, flakiness detection, performance improvements or door metrics. I don't know. Honestly, we don't know. We feel like there are all opportunities here that can be investigated. But we feel like this is adding a lot of value. All in all, we liked having everything within Grafana and production metrics and NCI metrics. Now, what's next? This is, as I said, very early stage. Our proposal got accepted by the Open Telemetry Technical Committee. We formed a working group. We want to define standards to work on and to move forward. So, please join the conversation. We would really like to hear different use cases. So, yeah, reach out on the CNCF Slack channel. And the second link is the Open Telemetry Proposal for CI CDO observability. Yeah. Thank you all. Thanks to you. Thanks, Dimitris. Do we have any questions so far? Can you please wait? We have still five minutes for Q&A. Okay. Are there any plans to release this feature on Grafana for public? How much do we have to wait to use this kind of stuff? Well, I think... So, the thing you can do right now is just use Open Telemetry, maybe take what we did. The thing we did with Collector Create, your own Collector, use our Collector, use the thing that exists already for GitHub or Jenkins, and then make sure that you export all the valuable information for you. And then you can still use Grafana as your visualization tool, of course. But like I said, plug-in, it's not ready yet. I mean, all these things came up like a hackathon project, essentially. So, we are still unsure. But you can do it yourself. That's my... If you set everything up, if you set the Open Telemetry exporter and stuff, it should be piece of cake. Any further questions? Does the collection of Metric itself impact on the CI? I didn't get the question, sorry. Does the Metric's collection itself impact on the time execution of the CI? I mean, if it has a performance impact, oh, okay. Our doesn't, because the way it works, it's a long-running process that waits for a CI to complete and listen to a webbook. So, the whole thing happens after and in... I think it's even a completely different server machine than the CI system. So, it's not really... it doesn't impact anything. It really depends on your CI system and how you collect those Metric's logs and traces. So, in our case, no, but it really depends on how you implement the collector. One more? I don't see your hand. And that's it. Thank you.
Introducing Observability to an airline
Hello everybody. Can you hear me? Is it working? Awesome. This is my first talk, so I'm just going to do this because hello, FOSTA! Yes! Great to hear some energy in here at this time. Right, so my name's James and my major client currently is a, so my major client, my only client currently is a major European airline. Get that right? And I wanted to talk to you today about some of the challenges that we're facing in introducing observability to that client, a framework that I've kind of put together to overcome those challenges and some thoughts that I have overall about observability. This talk should be applicable to any big organization. So there's not really anything that's specific to an airline, but if you think about the scale of not only the size, but the amount of different tasks an airline would be doing and the kind of vintage of most major airlines, you'll kind of get an idea of what we're talking about here. By the way, just as an idea, who here works for like a company that's got more than a thousand people in it? Okay, fair enough. Okay. And how many of those people are actually using observability on any scale? Okay, some of you, awesome. You should be doing this instead. In this talk, I want to walk you through three steps I'm taking to introduce observability. One, I'm calling an observability transformation. We're not going to be talking about anything too technically exciting here, and we're certainly not going to be talking about anything like introducing observability to like the cockpit or anything like that. This talk is about helping you get your company or client or whoever else on board with observability. It's about making that transition successful and it's about making it sustainable. And of course, the associated love and adoration of your peers for making their lives a whole hell of a lot easier. So, first thing I want us to do is align us on what observability is. So, that'll be easy. Does anyone want to, I tell you what, we're running late, so I'll just tell you what observability is. Firstly, I think that what we've got to remember when we're talking about observability is that a lot of people don't really know what to think of, but they're probably thinking of something like this, like a big ten-foot view of everything that's going on. Obviously most people in here won't think that that's observability. Why not? Can anyone say, like, is this observable? Is this an observable system by our definition? This is what I think of when I think of observability. And when I speak to anybody that, you know, may be lay technical or non-technical, this is the kind of thing that I'll introduce to them. I know that I'm putting a definition on something that, you know, and that's a little bit controversial, but this is what I think of. So, this will help you kind of ground in what this talk is about. So, you can imagine, like, as we went through that previous thing, like, there's this cake being made. And so, you know, I can describe quite easily that previously, with a, like, monitoring process, we would monitor, get the metrics and the logs from each individual component of that system. But now what we're going to do is we're going to collect the request for a cake through that system. And this has some clear value if we start talking about this. There's this other way of talking about it that's like, you know, observability is how we understand something by the internal something, I can't say, but it doesn't really kind of get across the value to people that may be a bit skeptical about this. And I think that this kind of does. So, let's just pocket that idea for a second. This idea basically describes observability as recording work done to satisfy a request. So, a request is completely observable when you can see all the work done to that request, and a system is completely observable when you can see all the work done to all requests moving through a system. This to me is much more tangible. It does tie it specifically to requests or events. However, I do note that when we talk about observability and making long running processes observable, most people try and arbitrarily or otherwise find ways of cutting them up into individual traces anyway. So, I think that this is fairly close to, like, how we're doing observability in practice. So, in my view, an observability transformation fits alongside other transformations which, when done right, leads to much more productive organizations. So, with Agile, we move from waterfall to more incremental development. With DevOps, DevSecOps, all of that, we move from silos to more cross-functional teams with cloud, like it or load it. We move from buying things up front and hoping that they were the right things to buying things on tap as and when we needed it. So, with observability, we're really talking about moving everything 90 degrees. So, instead of observing individual systems, we're going to observe requests as they go through them. This should also act as a warning. Just, who has gone through an Agile transformation? Keep your hand up if you think that that went really, really well. Yeah. And I'm using this word very, very specifically because this is another thing I want us to pocket as we're going through this. You do need to think of this as a transformation and you need to think about the kind of pitfalls of other types of transformations and how to overcome them if you want to introduce observability to your company, client, whoever. Okay. So, we're all aligned, please, on what observability is. We know we want it, but we don't get to decide. So, we need to think about who we need to convince. Although you could probably get away, especially in smaller or more agile companies, we're just convincing a couple of people and going ahead, often with this sort of thing, you're going to have to convince a lot of people. And so, this is me capturing three broad groups of stakeholders here that you're going to want to convince if you want to bring people along with this observability transformation. And you want to get everyone on board because if you only get, for example, the C-suite on board, like the higher ups, if you like, on board, then engineers will just make your product fail, make your transformation fail so that they can get back to their work, like with any other thing. And then management will just say, right, I've just lost a load of productivity, we can solve this by getting rid of this observability thing. Similarly, if you get your engineers on board and they keep pushing towards it, you'll land up with them being burnt out because they're not being given the time and the resources that they need to actually make it work. So, it's worth thinking through very quickly here, wary of time. I can spend ages on this slide, by the way, because thinking about stakeholders is really, really interesting, but I'm just going to pick up a few highlights. As an example, anyone here a skeptic would describe themselves as an observability skeptic? I'd imagine, maybe, do you have any reasoning? No, that's fair enough. But it's worth noting that even in here, and I think that there's lots of people outside, the thing I compare it to is kind of transforming towards test-driven development. A lot of places will introduce test-driven development and the way that they'll do it, for example, is their experience will be that some manager somewhere will insist on 100% test coverage. So, they've gone through that, they have to do all these ridiculous things to jump through hoops to get this transformation to be complete, and then they come out at the end of it saying, well, test-driven development's crap, we're not doing this. They managed to get rid of it and they managed to dump it. So, you might think that of these three people, the engineers would be the easiest to convince, but there are lots of people that are out there that have gone through three or four of these now and really need to be sold on whether this is going to help them. So, really, don't think that they're going to be automatically on your side just because you're convinced. Also, I'll note all the disagreement that we have just in this one conference about what the best tooling and the best approaches are anyway. Quickly on things like management, management will want to be convinced that it's not going to break down productivity. One example I'll give when we're looking at, for example, higher ups like the C-suite, they're going to be interested, you're going to be asking them to spend money because you can't just say, oh, we're going to do this, you want to actually resource a team. With my client, what we did was we actually went through the outages over the last 12 months and we did some estimates, we said there are estimates and we caveated like what the caveats are. We went through and we worked out how much time we think would be saved on outages, on each of these outages, if they had good instrumentation of their code and if we could identify the issues more quickly. They could go away and they could calculate that in terms of a cost which they could use to justify it. So, don't forget about your stakeholders. One thing that you didn't hear is in all of that, is what tool to use. That's because, sorry, everyone that makes a tool, it largely doesn't matter at this point. People want traces because they want less downtime, they want more clarity, they want to capture lost revenue or whatever else. But you can do that with pretty much any observability tool right now. So, the one thing you don't want to do as part of convincing people is to try and sell them on a specific tool. That can come later. In my engagement, we're focusing on tempo. And the reason that we're doing that, I'll introduce some of the other reasons in a bit, but the main reason is because we always use Grafana, we already use Prometheus, it slots right in. And we don't really have to discuss it much. There's another thing which is because tempo is open source, we don't have to involve as part of selling this project, a new vendor, and new commercials and stuff like that. So, open source to the rescue with that. But really, you want to get your project approved so you can go and start instrumenting code. Last thing I'll say on this is team topology. This is an example of the sort of team that I'd expect to go and start an observability transformation. You'd want, I prefer smaller, more agile teams. So, you might look at this and think, well, based on my business, I might have two or three of your software engineers, two or three of your operations engineers. That might be an anti-pattern. You can go and look up all the reasons why bigger teams tend to do work more slowly. I'm not going to cover that now. So, I'm looking at a kind of crack team. Software engineers are going to get in and go and instrument the code. We've got an operations engineer that's going to make sure that we clear the pathways to actually get those spans out into tracing databases. And finally, we've got somebody that's kind of in a product owner position that's going to protect that team, make sure that they're not answering inane questions all the time. And that is also going to be working with the business and with the other product delivery teams and the platform team and whoever else to make sure that concerns are raised, that they're heard, that they pivot when they realize that actually they've made a mistake. So, that's an important role as well. But remember, this is a transformation and we're trying to do new things. So, we're changing cultures here. So, you do need to be responsive to feedback and you need to be responsive to feelings. Otherwise, your engineers here are going to make the best system that never gets used, which is another pitfall of transformations. Okay. Those are my thoughts on convincing people to do an observability transformation. Now, let's imagine you've got the thumbs up. Let's move towards implementation. Most important thing is to not get bogged down in the details of the infrastructure. You need to move to instrumentation. But, you are going to have to need some sort of tracing database. You are going to need some sort of tooling. If you have something already, so for example, if you're already using a provider of some sort and they have it, then great. Consider that. However, one of the ways that you can make sure that these things move faster is by moving your tracing database into where the data is that you're collecting. You think about big, old companies, big and or old companies. They get really nervous when you say, right, we're going to collect all this data and we're going to go put in this cloud provider over here. Now, that can take months to agree. And so what you can do is you can short circuit that, start that process, start discussing how you're going to do this. But you can also at the same time move your tracing databases into maybe the accounts or the cloud provider that's actually already been agreed to use this. There is a downside here, which some of you might be thinking is, well, doesn't that mean, James, that you'd have maybe multiple tracing databases, which means that you wouldn't have all your spans in the same place? And that is true, but it means that you can move on to instrumentation. It means that you can move to the point where you have like maybe two traces that somebody has to look to, and then you can get other people in the business to say, hey, wouldn't it be useful if, and then you can start having the discussions. Don't try and boil the ocean on these things. And we're being pragmatic here. So as an example here, this is if your client is in AWS, you can quickly get Tempo. There's a good article on the Grafana website deploying Tempo on Fargate, which means that you can get that up nice and quickly. So again, that's an advantage of using these things. And more importantly, you can deploy it. You can find out it's the wrong thing to do, and you can go do something else. And it's a great thing about using these open source tools is you can really work it out as you're moving. With that in mind, get instrumenting. And know that to start with that team that I put together earlier is going to be doing a lot of the work themselves. Automatic instrumentation is your friend. Get your software engineer to go and find the code bases that are across the system, especially on your hotpaths, and start raising PRs to auto instrument. You know how best to do these in your company. Some companies, they want to start the conversation with a PR. Sometimes they want to start with a meeting or something like this. But getting auto instrumentation in to these code bases will mean that you will start being able to build up the shallow layer of these traces. Then if any teams start becoming interested in this, opportunistically pair your software engineer with those teams. Pairing mobbing is a great way to share knowledge. Remember, a lot of these software engineers will not have done this kind of thing before, and doing it's kind of hard if you don't know how to do it. You don't want them to get frustrated to throw in the towel and say, no, this is dumb. This is hard. This is not the way we used to do things. Whereas if you can put your software engineer in with a pair as a pairing or a mobbing situation, they will have happy times and everything will be lovely. Also, make sure that you point out the value when you see it. It's very easy for us to see these things and to go, oh, it's great. And so obviously it's great. But this is new to people. So point out the 10% of their queries that has like this weird choke point. Point out all these advantages you're getting from this instrumentation and from all these spans as you're collecting them. Find, when there's an issue, when there's downtime, get your team to go and see if they can race the people that are doing incident response to finding where the issue is based on the tracing. Once these teams realize that they can see through walls with this stuff, they'll soon start instrumenting their own code. But you need to get them to look. Another trap is to get bogged down on the problems that are harder to instrument. Airlines and banks and other places have a bad habit and that bad habit is Fortran. Or like Zidark or some mainframe thing or whatever. If anyone here, has anyone here just put your hand up if you do any development on like COBOL, Fortran, anything like that? Awesome, awesome. If you go an instrument something like that, please come and talk about it. That sounds awesome. That sounds like a lot of people are going to talk about this one. I'll be fascinated by it. But if you're doing this kind of project, now is not the time. Something like that is not really going to correct me if I'm wrong, anybody out there. I don't think there's any instrumentation for Fortran code or anything like that. Treat it as a third party system. And also don't try and instrument other people's code. I've seen this happen. People will go, right, okay, we've got this third party and it's this third party code that we deploy. How are we going to instrument that? Do not instrument the stuff that is there and then accept that you're going to get to a point where it's going to roll over to logs and metrics. If the tool that you're using allows you to be able to connect up logs and metrics to your traces, that's really handy because remember in these big organizations, you might never reach the golden sunlit uplands of traces for everything. So you're always going to have to go back. You can think of it sort of like fast travelling through the infrastructure as that you are not going to be able to get to the point where necessarily you're going to be able to get into the point in the Oracle database that you're really trying to kill that has actually had this problem. But you will be able to fast travel to the bit in the code where it makes a query to an Oracle database and then you'll know which logs and traces to look at. So the goal really is for wide coverage, especially of hot paths. And that brings us to another thing which is culture change. So you've been working on this for maybe six months or so. It's fairly short projects. You've gotten traces. You've got end to end on many of the request pass through the systems. People kind of get observability now. So those three people will come out with a few others and build an observability engineering team, right? I would say that for most organizations, that's the wrong way to do things. There are companies for which observability engineering having separate teams stood up for that kind of thing does make sense. But for most places, you're really going to be looking at creating this kind of. This is one of my favorite slides ever, which is weird. I have weird favorite things. But this talks about like a DevOps transformation where what you do is you create a DevOps team and the best DevOps team disappears after like six or twelve months because what it's done is it's created this thing where they come together. And you should, you know, this is a valid way of doing things for observability as well. Ultimately, you may have an enablement team. However, instrumentation should be being done by devs as part of their day to day work The tooling needs to, oh five minutes. Oh, slow down. Enablement should be sharing best practices and doing training and stuff. The tooling really needs to be absorbed into an existing platform team. And this is the really cool part is that now, if you think about it, you've gotten to this point where you've got all this instrumentation into your codes. You can start thinking about what kind of tooling makes sense for your organization. Whereas when you started, it's very hard to do. That wasn't five minutes. Okay, I'll stop. So, yeah, if you've done your job well, hopefully these people won't need you anymore. And you can go and absorb back into teams and you can call that project complete. You might be able to do some kind of enablement team. But as I said, that wasn't meant to have a question mark at the end of it. Go and effect change. I'm going to end it there. There wasn't much time. I've got so much more I want to talk about on this subject. So I might do a follow up thing. If you have any questions, I'd be happy to answer them and you can find where to find me at that website. Thanks, James. We have still five minutes for Q&A. Some questions here. Okay. There is one. I answered almost everything. Hi. How long has it taken for you to convince a big org and an old company to move from no observability to some sort of observability? Completely convinced. I'll let you know. So I maybe joined back in May with this client and was helping them with a previous project that was getting wrapped up. So I'd say it's been eight or nine months working on other things and identifying this as a need where it's been working through. Yeah, it can take time. So because you've got all, as I said, you go back to that stakeholder slide. I could have spent a whole 20 minutes just on that because you've got to kind of get everybody aligned. I've done lots of like meetings. I've shown the people off, shown things off to people. I've shown off all these slides and stuff, gotten everybody on board. And yeah, so I think by the time, you know, I'd say that everyone's actually in lockstep, probably about now actually. I should say though, by the way, is we didn't, you know, just not do anything until that point. So there's been lots of opportunities to like seed things as we've been doing other work as well. So yeah. All right. Some questions? No, then thanks, James. Thank you. Thank you.
Netdata: Open Source, Distributed Observability Pipeline - Journey and Challenges.
So the next presentation is from Costa about net data and open source distributed observability pipeline journey and challenges. A plus. Hello guys. So how many people use the data in this room? Oh, a lot. Okay. So guys, I have to admit something when I when I get pissed off, I write code. I know others go to the gym. But for me, this is how this thing was born, actually. So I was migrating some infrastructure from on prem to the cloud to one provider. Doesn't matter. And we had a lot of problems. It turned to be a cloud provider bags, not our bags. But after spending a couple of million euros in observability, building a team of eight people installing every tool that exists out there. This was back in 2014. I concluded that everything that I had built, I was a level executive back then to a FinTech company. I concluded that everything that I had built is an illusion. So the entire everything every observability tool that I tried is an illusion. Is that make me feel happy? It doesn't actually troubles with anything. So I started thinking, OK, why they did it like that? Why they are not real time? Why cardinality or granularity is such a problem? Why, you know, you have to set up everything. By bit, you need months and months of preparation in actually to build something useful. And I started experimenting. I started writing code to see can it be done differently and what this different would look like. So I ended up, I started the first committee in 2014, it is in 2014, it is in GitHub. After a couple of years of experimentation, I released it and boom, people loved it. So what is the data today? Today, the data is a monitoring tool that you install on a server and it is a monitoring in a box. So it has all the functions of a monitoring pipeline. It discovers metrics, collect metrics, stores metrics, machine learning for metrics, to learn the behavior, alerts check, health checks for metrics. It exports metrics other times on databases. It streams metrics and replicates metrics between the data agents, of course visualizes, query engines, everything. So it's an application. You install on all your servers and each application is a monitoring. So if you install the data today on an empty VM, an empty, just go to AWS, get an empty VM and install the data. What are you going to get? We are going to get this. 200 dashboard charts, 2,000 unique time series. Five unique alerts monitoring, more than 100 components. As a viewer, as a logs explorer for system to journal, a network explorer for all your connections, unsupervised anomaly detection for everything so it will learn the behavior and trigger anomalies, detect outliers. You are going to get about two months of retention using half a gigabyte of disk space and all of these run with one presence CPU of a single core, 120 megabytes RAM and almost no disk IO. Now when you install an agent, you understand that since each one of them is standalone, you have to switch, go one by one to see what's happening. But then we build this. So the same software can become apparent. So when it becomes apparent, it's easy. You go to the others and say, stream there. And you give the IP of the other thing and then up it again. So the moment this happens, the parent now can provide alarms, machine learning, dashboards and everything for all the infrastructure that is behind it. But we didn't stop there. You can have as many layers as you want of this. So you can have data centers that they have apparent cluster and then regions, you know, a bureau for something that has another cluster of parents, et cetera, et cetera. This scales well. So we tested the data apparent with 500 nodes and 40,000 containers. Against Prometheus. We did the same with Prometheus in parallel. The data used one third less CPU, half the memory, 10% less bandwidth, almost no disk IO. And in the same store as footprint, we gave them three terabytes. We managed to have up more than a year of retention. More than a year of retention. Of course, it is in tiers. Now in December, the University of Amsterdam did research about the most energy efficient tool. They tested, of course, Prometheus and the data, et cetera. The data, they said, being the most energy efficient tool. And what happens in CNCF? The data actually leads the observability category in CNCF in terms of user's love. Actually we don't want to incubate in CNCF because of their policies. We believe that the data should be something different. So we never applied. So to build this, we had a number of challenges. The first challenge is that we wanted everyone, everyone using the data, to be from zero to hero to day. So you don't have observability. That's okay. You're installing a data and you have observability today. How we did that? The first is an observation that we all have the same kind of infrastructure. It's a finite set. We use some visual servers or physical servers. We use packets, applications, databases, web servers, message brokers. All these are standard components. We use similar disks, similar interface. Everything is similar. Of course, the Lego we built is different. Each company is unique. But the components are pretty much the same. So what will happen if we start monitoring the components? What will happen if the monitoring solution knows when a postgres is healthy? When a Kafka is healthy? When an NGNX, an HA proxy? What if the monitoring system itself knows if the networking stack of the Linux of this Linux is okay? And how to monitor it and how to present it and how to visualize it? So the observation that we have a lot of common actually allowed us to think a little bit different. So how many work with Prometheus here? Okay. So did you know the Prometheus, the OpenMetrics format? So you have a metric and then you have some labels and the value that changes over time. That's the thing. What we did there is actually to say, wait a moment, not all labels are the same. The labels that depict, for example, that this is reads or writes on a disk I.O. are not labels. They should not be labels. They should be something different. And we came up with this needle framework. This needle framework says that there is context. The context is the metric name in Prometheus, exactly the same, the metric name that has before the labels. We have instances. This does not exist in Prometheus. Instances are all the labels except the ones that get time series values. So if we're discussing about disks, it's the make of the disks, the type of the disks. This is the number of the disks. All these are labels to enrich an instance. The reads and writes of its performance are time series. So what this allows us to do is the following. The first thing is that it allows us to go and configure test boards at the context level. So we don't care about the metrics anymore. In order to provide a meaningful dashboard, we only care about the context. Then it allows us to configure alerts per context which are applied to all instances. So we can say, hey, apply this alert to all NVMe disks. Apply this alert to all Postgres databases. Apply this alert to all Postgres databases before version X or Y. So the first thing is that this thing allows us to have a fully automated visualization. You don't need to configure visualization with that data. It comes up. The same with alerts. They come up. We see 344 of them. So the moment something happens, boom, an alert from the data will come up. I think this is not important. It's how if you see it from the disk perspective or from the container perspective or from the database perspective, how the labels are matched, it's not that important. And this is internal information for us. So this allowed us to have, as I said, fully automated all of this. And the data is the only monitoring tool that you can install mid-crisis. So you have an incident happening now. You don't have observability. You install the data, alarms will fire up. The dashboard will be there for you to investigate. The second challenge was, can we get rid of the query language for slicing and dicing the data? So in order to do this, we provide, you are going to come on a metadata board and suddenly you are going to see 2,000 metrics. And you are not aware of them. You just installed it. So how, what can we do so that you can make sense of this? You can see the dashboard and say, OK, I understand. I understand what I need to do. So this is a data chart. On the charts, we did the following. Of course, there is an info. What is this? But then we added a number of items on the chart to allow you slice and dice and see anomaly rates, you know, make sense of that thing. Now each menu that you see here has an analysis. And this analysis allows you to see, for example, for the nodes, allows you to see all this information that is described below from the same view. This is not another query. The biggest change that we did is that the data calculates this for every query, for every, so it's a cube. Think of your data as a cube. You can see them in different ways. What we do is that we provide a time series for one of the views, one of the sides of the cube, but we calculate some numbers for every possible pivot that you can do. So that by just browsing the menu, you can understand, you can grasp what the data say. The same is with group by. So if you want to slice them, we do exactly the same. We give you all the possible options that you can slice by just dropping a drop down menu, two clicks, boom, different cube, different view on the cube. And the data, and this is how the initial problem, do you remember, that was the migration I had? The migration I had was that the bug at the service provider was that they were stopping the VMs to perform some of their updates, but they were stopping the VMs in little increments. So suddenly, one Friday, all the VMs were going in slow motion because they were stopping starting, stopping, starting, stopping, starting like this. With the data, I was able to find it because every data collection that is missed in the data is a gap. So it is not in the chart. Grafana, for example, told the other tools, smooth it out because there is no bit. In the data, there is a bit. I have to collect this every second. If I miss one, it's a gap. I didn't collect it. So another of the challenges, I am going fast because I have only five minutes, guys. Machine learning. Google in 2019 announced this. So they said, come on. We cannot make machine learning work for observability. So all our ideas are bad. And as they say, Todd says here, we should feel bad about it. So the data does the following. Trains for every metric collected, trains 18 different models. These models cover the last few days of the metric behavior. So for the last few days, the data knows how the metric behaves. We machine learning. Now what we do is that every time that we get a new sample, we check these 18 models. And if all of them agree that this is an anomaly, we say, OK, this is an outlier, guys. This is an anomaly. OK? Five minutes, yeah, I know. OK? So what this allows? I'm going fast because there is a scoring engine inside the data. So scoring engine is an engine. You don't query for time series. You say, hey, I want to find the similarities in the metrics. I want to know the trends in metrics. So can you score them all for me, please? All. It doesn't matter how many I have. So to give you an example, this is a data dashboard. There is a menu where we categorize everything. So it's an infinite scrolling dashboard. One chart below the other. Of course, there are many tabs with different views. And you press the AR button on the top right. And there is an anomaly rate on every section. So you can see now where your anomalies are. And we have a host level anomaly. So look what happens here. When anomalies happen in our systems, they happen in clusters. Look there. It's five nodes. It's 10 nodes. It happened together. This every line is a node. Look there. They happen together. And not only that, this shows the percentage of metrics in a node being anomalous together concurrently. So you understand that NEDA can detect not only ifs a metric is anomalous, but here we can see that it can tell you the spread of the anomaly inside your systems, inside each system, et cetera. And then you highlight the area of interest. NEDA gives you a sorted list of all metrics that are anomalous within this region, this time frame. So the idea here is that you're a ham moment. Or what DNS did that? Storage did that? The ham moment to be on the top 20, 30 items. Another mission accomplished. Two minutes I know. Logs, I will pass through. NEDA is a fund. We base everything on system in the journal D. We have built our own logs management solution like log key and elastic and the likes. We zapped it. We zipped it. Zipped it. How is this story? Don't use it anymore. And what we do is we rely on system D journals. Now system D journal has a lot of unique features guys, especially if you are developers. The ability to index all fields independently of the cardinality. Independently. It doesn't matter. A thousand fields, different values on every log line. Indexed. Okay? This is unique. You can make troubleshooting amazing like this. We have a UI similar to Grafana, similar to Kibana. And actually while all these tools you see on the screen, it's 15 million entries. The title. 15 million entries. Logs. All the others will sample 5,000 at the end. We stop sampling at a million. It's fast. We stop sampling at a million. We submitted patches to system D to make it faster. We managed to bypass all the deficiencies of system D for old systems. You will give me two minutes. Okay. It was 50 seconds. I have another one. So, okay. We did a lot of work. We provided a tool, look to journal that can be used to feed to system D journal. Your HA proxy logs, your engine logs or whatever, log FMP logs. If you have a go application, something look like this. Boom. Push it everything in. And of course there is a weakness. It's a system D journal. And this is mainly because it's reliable. It's fault tolerant. This is what has been designed. It's secure so it seals the logs. They cannot be tampered, etc. The key problem is this storage space that it needs. It needs a lot more compared to Loki or I don't know with elastic, but for Loki that we tested, 10 times more. And you can make it a little bit better, 4 times more if you use a compressed file system. So another mission accomplished. That's the last one. Wait, I didn't see what it was. Yes, and that's another beauty that the data provides. So observability is more guys than metrics, logs and traces. So the moment you run your thumb infrastructure and there is an issue, metrics, logs and traces usually are not enough. You may want to see the slow queries the system has. Database has. You may want to see the socket that an application is currently connected. You may want to see different kinds of data. Most of the observability solutions give up. They say, OK, open the console. What can we do? Go to the console and figure it out. What we did is this. So every data collection, this isn't a data. It has plugins, data collection plugins. Every data collection plugin exposes functions. It says, hey, I am Postgres collector and I can expose, give you slow queries. Hey, I am a network viewer and I can give you the list of circuits. So we built the protocol in such a way so that you are there. You go to the grand, grand parent and you ask for something from this node, this function to be run. It does. So actually the whole slogging system that we have, system-less system in journal, is built like this. There is a system in journal plugin there that exposes logs. And this is a network viewer. What we did here is very simply. You have sockets. As you can see, it's 30,000, 31,000 TCP4 sockets. The key problem was how you make sense. So what we said it's OK. From there is the internet. From here are private IPs. Above are clients, down are servers. The position plays a role. So you can immediately see what connects to the internet. And this is live. So if this was not a screenshot and you SSH the server, you would see the SSHD boom, moving. This is live. OK. Our monetization strategy, the data is open source. This is what we sell just for clarity. So we sell infinite horizontal scalability. We sell role-based access. Access from anywhere because it's on-prem and on-prem solution. Mobile app for notifications and dynamic configuration and persistent customizations that we have a SAS offering that does it. Of course, the same SAS offering is offered on-prem for a price. That's it. Thank you very much. Again. Questions. Thanks, Costa. Do we have any questions here? I think that is fantastic. I've been using it for a while and I really, really enjoyed the experience. A while ago, I was working on a massive experiment. I had 700 machines that had to throw gigabytes of information. So I started to look at NetBe and basically went to the grandfather solution. And then the monitoring was something that wasn't there. I mean, it's a SAS service. It's not open source that part. Do you think that you're going to open source also that part? Wait. The first thing is the open source agent and the SAS offering that we have have the same UI. It's the same thing. It's on the different UI. What the data cloud can do is instead of collecting all the data, if you go to data log, you push all your data to data. We don't do that. So your data are inside your data inside your servers. What the data cloud does is that it can provide you a unified view by querying all the relevant in the data. So it queries them, gets the data, aggregates them, and gives you the same view a parent would give you. Okay. Without having the data. We rely on your agents being alive for this. This is why we have the streaming. So when you have a femoral service, a Kubernetes cluster, the node may vanish. What you do there is you need a node outside this, outside the Kubernetes cluster to have the P2P parent for your data. This is where your data will be eventually. This allows you to have very small retention on every server, even disable a mail and health checks, etc. Disable them on your production systems and have a cluster of parents. Parents can be clusters so you can have two or three or four or how many you want in a chain so that you can loop them, replicate one to another. And they will find out if something is missing, they will find out by themselves. Okay. We have time for one, another question. Thank you. I would like to ask you which issues have you experienced when scaling up on several systems or centralized things? Which issues have you experienced when you scale up to several systems or other centralized things? As with every software, there is a maximum vertical scalability that can sustain. The data as you show is better than Prometheus. Actually, you can be in almost two times in some cases in terms of memory, one third in terms of CPU, etc. But still, there is a gap. The power of the machine, you can give to it. Okay. Now when you need then to scale out, what we did is that we split a query, a number of calculations, has to do a number of stuff in order to give you this little chart there with the drop downs and the likes. What we do is that we have the agents to have the work and then we have a central component that does the rest of the work. So we managed to split the query engine in two parts, one that runs at the edge and one that runs centrally. Still there are limits. If you go and put, I don't know, 10,000 and a data agents in one room and you want one chart with 10,000 queries, you understand it's going to 10,000 queries are going to happen. So it may take some time. But still what we found out is that the combined power of your infrastructure is better in most of the cases than having a very big server. So even if you query 10,000 servers, there is some latency due to the round trip delays, et cetera. But the horsepower that you have, each server will do just a tiny thing. Oh, this is my part taken. This is my part taken. But this means that the heavy lift of the query is spread to 10,000 servers. You get it? So it becomes more scalable, faster at the end of the day. All right. Thank you.
17-year journey of the Mozilla Support platform & its community
Like yesterday. But yeah, today I'm going to walk you through the 17th, probably. I need to add Sweet there. So it's like Sweet 17 years journey of the Mozilla support platform and its community. And today I'm not alone. I will be with Smith from the platform team and also Tasos will also join us to answer some of the questions that you have around the platform. So without further ado, I think I'm going to just start with the agenda that we're going through today. So I will be first, walk you through the inception, like the story of how the Mozilla support came about. And then I'll talk more about the community. And last but not least, but Smith will walk you through the platform and what are the tech stuff that we're using and how you can contribute to the platform, to the code. So with that, I'm going to start with the first. Section over here. I'm pretty nervous because I know that there are, I saw a lot of like older Mozilla folks over here. So I'm afraid that I'm not telling the right story about how it came about. But I need to dig into our archive, like our blog archive to kind of like find out what exactly the start of the Mozilla support platform at that time. But before the Mozilla support, there were already some kind of like platforms that community based, like those are like the community initiated those efforts, like Mozilla Zine, Mozilla Polish and Gecko Zone. I've never heard of Gecko Zone, but it's actually something back then. But yeah, those are like community based and none of them are official. So and it's scattered, right? Like across different sites. So there's no like official effort from the Mozilla site. So just before the launch of Firefox 3, I found actually the first Google conversation in the Google group that I found was from May 8, 2007. And then after that, like five months after that, the first version of Mozilla support in Tikiwiki was launched. Tikiwiki is an old CMS. I was, I've never heard of it before, but it's actually a thing. So after that in 2010, we moved to Kitsune, which is a in-house platform that we built from the start because we realized that Tikiwiki doesn't really meet our criteria or like a requirement. So we moved to Kitsune. And there was a gap in there. But it's like the days where we actually lack resources. So you only have very few developers like maintaining the platform. And because it's an in-house platform, we cannot struggle at that time. But around like 2019 or 2020, there was a moment where we merged our team, like the customer support, sorry, at that time, I think the support team from Mozilla side with the bucket side of things. Pocket is another company that we acquired. And then we merged the customer support team together as a customer experience at that time. And that was probably the start also where we kind of like, you know, like trying to build another muscle of a support team because at that time, we need to support like premium products with the plan of like launching Mozilla VPN and stuff like that. So we're trying to build a support operation team at a time for the premium products. And now fast forward to 2024, we're actually planning another migration, although it's very, I've limited like information to share with you about that. So that we're planning to, you know, like migrate our CMS to a new CMS called Week Hill. But the plan is still going on. So maybe Smith will explain some more about it. So the community, as I mentioned at the beginning of the timeline that you saw just earlier, there was a Google group that I found from the first blog post that I like, you know, like find out. In that Google group, there were lots of people like, you know, like bunching ideas. And I can't even tell who's are like the staff members, who's are the volunteers. So it's just like, embodied the community spirit, like the relationship of Mozilla and contributors. So community definitely part of like Sumo since the beginning, even like, you know, from the first part of like the inception. And it's still true after today because most of our, most of our product, like free products, are supported by community contributors across different parts of the world. And they've been contributing for a very, very long time. The last time we did a contributor survey, we find out that actually 50% of our contributors have been contributing more than five years. And even some of them are contributing since the beginning, since 2007. And they're still contributing until now. So they're very passionate about their work, about their contribution. And they're pretty much self-sustained. I think even if I die today, I think they'll still be able to run the community on their own. And because of their experience with the product and because they have been involved since the beginning, they have very deep understanding of the product. I think even more than like staff members, they're in the company right now. And one of the things that the community provides like huge value is also on the accessibility part where they enable our content to be accessible in, you know, like languages that we otherwise won't be able to support because it's also part of like contribution areas that we offer in the Mozilla support. So yeah. I'm going to show you a few contributors that the picture doesn't show. But there's this contributor. He's Jefferson, his intellectual property attorney by day. But he's been contributing to this small platform since 2007, so since the beginning. And he is mostly contributing to the forum, to our support forum. And he's really knowledgeable in the customization part of it. So like doing CSS configuration to tweak your browser appearance, stuff like that. So he's very knowledgeable in that area. And then there's also Wenxi, who is our contributor based in China. He's contributing to the translation, localization of our support articles into Chinese. He's been translating like more than 50 hundred articles already. So he's really, yeah, he's really big on this effort. He's also part of like the proponent of like open source movement. I think he's involved with the Free Software Foundation at some point. So, so he's definitely a big open source supporter. And the last is Bitya. She's pretty new in the community. She just joined in 2021. But I'm really excited because she just shared this news with me that she got hired as a content writer. And in fact, it's actually one of the things that she contributed in support, in the Mozilla support platform. So I'm super proud and excited for her accomplishment. Okay. So before I walk you through like our struggle as a community that has been run for like 17 years old, I'm going to talk about like some of the things that influence the growth of the community, right? Because there are multiple factors that come in the equation of like how our community can grow. For example, like there are external factors like industry competition, for example. I mean like the domination of Chrome is out of my hands. So, but at some point it's also like constrain our growth as a community. So that also, you know, like that's also something that we shouldn't forget. There's also internal factor like product market share, for example, the leadership support and stuff like that. So that's also another stuff that, you know, like makes in the growth of our community. And there's things that we can control at the right side of it, like the community experience part, the onboarding, the documentation, the governance of our community, and the community platform as well. Those are like the things that we can actually control to improve. Although we may not be able to kind of like improve some of the other things, but those are like, you know, like where we're going to focus more on, because we have more control in that area. So, I'm pretty sure there are a few kind of like community builders in this room right now. So if you're just starting out in like community building or community, you know, industry, you may want to take note on some of the things that we learn as a challenge as a 17-year-old community. I feel like given the age of our community, there are like multiple people who have contributed for a very long time. So the knowledge is kind of like concentrated on some people, and that has been a problem, because we want to make it more accessible for newcomers, right? So how the problem is like, how do we share this knowledge from the, you know, the other contributors to the relatively new folks who are joining our community recently. And then there's this community help part about how do we keep the morale of our contributors high when the circumstances and the dynamic of the organization has changed over time. And last but not least, we also have to think about how do we quantify our value as a community team. How do we show up internally to our leadership? So that's also something that we struggle with right now. But what do we do about it, right? So the first one, split the community. It turns out that when we're too big, we probably need to split the groups into smaller pieces. So the concentration is kind of like divided into different areas. That way we can make newcomers feel like they can have influence over the community, because otherwise if it's, you know, like if it's already established, if the community is already big enough, like it's hard to believe for newcomers to think that they still have a place in that community, because it's already established, right? Everybody have their own role. So the strategy there is to split the community into smaller group focus groups to make room for those newcomers that are joining the community. And then the second one that we're planning to do is the proactive outreach. So things that we're doing right now in FOSTA. We haven't been doing much of this after the pandemic, but you know now that the world is kind of like recovering, there is a chance that we can, we want to be more proactive in meeting like potential contributors. So that's what we're planning to do more. And then last but not least, this is really important is building our data capability internally. You know, like thinking about what value that we want to show to our leadership to convey that how valuable contributors are, because it's easy to take those for granted, because it's already there since the beginning, but you know, like we need to have some kind of like quantify, we need to be able to quantify our value in order to show up to our leadership. So with that, I didn't speak much about how people actually contribute in Mozilla support, but there are actually lots of like ways to contribute. So the first one is the support platform, sorry, the foreign support. So people can answer support questions in our platform. And then the second one is to improve foreign health articles. Is it still working? Right, it's good. The second one is to write and improve our health articles in English. And then the next one is to localize our articles into another locales. Last ones are providing supports to our users in social channels like Twitter, mostly Twitter, and responding to reviews, people reviews on Google Play Store. So we have a tool that we, like the third party tool that we use to accommodate the last two contribution area that we have over there. And if you're interested in contributing, there's a link to learn more about what we're doing with Mozilla support. It's support.mozilla.org. I also have a few flyers that I'm pretty sure I bring here that I can just, you know, like give to you after this. So yeah, we're also at the booth in the same building, in the same floor. So if you're curious to know more about Mozilla support in general, or, you know, like the Mozilla organization, what we're doing recently, you can just come to our booth. I also have lots of like contributors, like community members over here, here and there, you know, like I see Danny over here, team, Paul is there, you know. Somewhere. So yeah, definitely talk to those people, they know a lot. Okay, let me try. Oh, good. Yeah, so definitely come to us and talk more after this. We're going to be at the booth anyway. Also, I'm looking for Spanish-speaking contributors, because we're kind of like straddled there right now. We don't have anyone who's in charge of the local group in Sumo for Spanish. So if you can speak Spanish and if you're interested in helping us, let's talk after this. With that, I think I'm going to give it to Smith to talk more about the platform. Hello. Thank you, Kiki. Sure. So I want to get the robot voice thing going, if we can do that. So I'm going to tell you a little bit about our platform. Our platform itself is the architecture is containerized Python app using the Django web framework. We use Postgres working along with Redis for storage and caching. Our assets are packaged through Webpack. And we're using vanilla.js, jQuery, SAS, Svelte. We use Elastic for search. And we're running in Kubernetes. Our inference is maintained through Terraform. So the kind of support that you just heard that we provide on the site, we have very community focused support that goes through Kitsune. And then we have paid support as well for the paid product support where those tickets turn into tickets for support staff. Our support journey goal, you know, we want to make sure that the site lets people find the help that they need quickly and that it's very usable. And we want to continuously improve using data and information and make the platform better. So, hey, it's just a Django project. You know, why is it complicated? So it's a very rigid monolithic architecture. So it's been around for a while. Changes in our platform can have a larger impact than you might anticipate. So there's a bit of an avalanche of things that maybe you don't seem connected but are very connected. It's a large project. We have custom parsers for our Wiki syntax. In the knowledge base, there's a lot of tight product integration, very custom CMS. And then, of course, technical debt. So, you know, as a product ages and comes in and out of being well supported and low supported and very well supported and medium supported, people touch it many hands. So there's some debt there. So I guess, you know, I can speak to that. I keep saying I'm new, but I guess I've been here a little too long to be new anymore. But when I started, you know, there were some barriers to entry, I think, for me, like learning the language, like, you know, A-A-Q, Kitsune, Sumo, you know, understanding that the apps inside of our larger Django application are, some are more complicated than others. So each piece, you know, we did very large upgrades. It was some of the early work that I was on for Django upgrades, jQuery upgrades, and I got to experience sort of the depth of peeling back the onion of our platform a little bit. And it goes on and on. The good news is we've already changed some things since I started. I'd like to say, like, I personally changed all these things, but that's not true. Ta-Sos is looking at me. We've improved the developer experience. Our documentation is a lot better now. So you're using MK docs. You know, it's a lot easier to navigate. It's correct. So if you wanted to get started contributing to Dev, you could take a look there and you should be able to get going. We've improved our coding standards. We do a lot of in-house quality checks. We do code reviews now. And we're trying to behave better with our repos and organize better, have better ergonomics automation. Excuse me. And so, yeah, we've, like I mentioned earlier, we've already updated many of our, you know, the core bits and pieces of the platform. So Django, jQuery, many other packages were updated. We've altered our data storage, our cloud provider, and we've started supporting additional products. So like Pocket, as new products come on board, we are now able to add them in and support them. We do content and data audits of the site. So we're getting better at keeping the content clean and we're finding new ways to organize it and keep everything meaningful. And so what do we want to do next? You know, we have some plans to merge our different tools for localization into, like, a more singular process. We want to build and automate pipelines to leverage different ways that we can get localization done. You know, some of those things could include machine translation. You know, just getting the need to the right place as quickly as possible so that we have something. We're changing user flows to make things a little bit easier for people to navigate the site. And so that would include, you know, not just people stopping by for help, but also people who are contributing, making it a little easier for everybody to get around, breaking apart sort of the monoliths of our site into the core functionality and being able to address those in a more standalone way. And then dropping our custom wiki syntax or altering how we're handling wiki syntax. As again, you heard earlier, we're trying to integrate Wagtail CMS, which is going to bring some hopefully improved usage in how we handle our text and make things a little bit easier for creating and managing documents. So hopefully we'll get away from the wiki syntax a little bit. And then some work has also been going on to inform our information architecture and taxonomy, so that things are arranged and accessible in the more easy to use, easy to understand model. So probably more to come on that as we get there, but we're getting there. So we have, you know, a lot of helpful links here in the stock. I don't know if this will be shared out. I assume it will be shared out. And then please feel free to come, you know, if you have an interest in contributing as a developer. Great. Please come check out these links. If you have an interest in contributing as anything, that's fantastic. Please come check out the links in the prior slides. And that's it. We're going to have a few minutes of Q&A. If you have questions, I'm going to... We don't have enough microphones to run with the microphone, but please voice out your question. And I'll have the speaker to repeat it for the recording. So yeah. Also, for those of you who just come, there's more space over here. You may want to come to the front to get your seat. So any questions? Any questions? Okay. Thank you. Well, can find them at the booth. Yep. That's side a few meters down. Thank you. There's actually a question. Thank you. As Kiki mentioned, there are a lot of empty seats on this side and everyone enters this side. So either try to come on this side or people please move a bit in the middle. Leave the sides more open as people will come in more now. Thank you. Bonjour. Ready? Let me get that. Okay. Who speaks first? And Mark, I think you're going to have the portable microphone.
Community Driven Data Collection and Consent in AI
So, thank you so much for joining me. My name's Jessica Rose. And I work on the Common Voice team. I am very, very annoying about at least four things. And one of them is I'm really, really excited about human languages and the web. So when I sat down and said, oh, I'd really like to submit to speak, immediately I thought, I'd like to give a 45 minute talk about Common Voice. And then I thought, it's not especially helpful for people. So I thought I'd go ahead and pull that back slightly and talk a little bit more broadly about where our data comes from when we talk about AI models and AI outputs. Given that we're at a Fostum audience, I'm reasonably sure that folks have a general vibe about what's going on with AI, that you have a set of data, models are applied to it, and that some kind of machine generated output comes out of that. The focus of this talk is going to be about where data sets are sourced. We're not going to be talking a lot about models. And I do fear that the older I get, the more my thinking and the more my speaking is about the philosophy that we bring to our work. It is a fantastic opportunity for those of you who are watching remotely, if you're not super excited about consent and data models to escape, for those of you who are here in the room, you can pointedly look at your watch and escape and you're safe. AI is increasingly in the news, and you may have noticed that it's not a bubble this time. It's definitely not hype that every single headline about AI is grounded and focused and deadly serious. But that a lot of the headlines we're seeing are both the people saying, the tool makers behind AI saying, AI is absolutely going to save us, with their other hand saying AI, which I'm making, which I'm selling, which I'd like some funding for, could also kill us all. Assumingly, kidding, Sam Aitman, the head of OpenAI said, AI will most probably lead to the end of the world, but in the meantime, they'll be great companies. Very relaxing. Elon Musk generated a robust amount of AI hype. I liked this quote best because it is equally true of bears. There's some chance above zero that AI will kill us. Very relaxing coming from someone working on AI who presumably likes some more money for AI tooling. But both of the quotes before, both of these gentlemen, are also signatories to the AI safety statement on AI risk saying mitigating the risk of extinction from AI should be a global priority alongside other societal scale risks such as pandemics and nuclear war. How relaxing, how calming. One thing that I genuinely enjoy about being me is I'm not as smart and not as confident as these people who would both like to sell us a future where AI fixes everything and a future where AI kills us all. I'm very, very happy to scale down my concerns about AI. This is very much not the company take. This is my own personal take, which is that both the incredibly positive AI as our saviour and the very scary AI like bears could eliminate us all. This kind of hype distracts from something that's very, very real, which is the things that are wrong with AI today and the things that are likely to become more calcified with time. One of my largest concerns around these personally is that AI from tool makers and AI from individual projects are taking the sort of commons of what people have created and turning them into individual company generators for money that you're living in a future, we're all living in a future. Oh, where sometimes things don't load. That's fine and everything's going to be fine. We're living in a future where these large AI models come from data. If anybody's interested, this was a screenshot about where chat GPT's data comes from. And I'm going to paraphrase this badly, please don't sue me. If I recall correctly, they were very, very proud. A lot of the AI projects are a little hand-wavy. They say, oh, don't worry about it. We find our data. The Facebook AI Lama model said, our pre-training data is collected generally in accordance with open sourcing practices. It's a very long sentence. The chat GPT said, hey, we find this data publicly on the internet. This is all publicly available data. That's fantastic. We all know that if we find an image online, we can reuse this on our websites no matter what. If you accidentally pick the wrong cartoon mouse off the internet and reuse that, it won't ruin your life. The idea that AI models can scrape the internet at large, build data sets of our own work, and give us a future where we've duplicated slides is very, very relaxing. Fantastic. Everything's fine. Open AI's large language models, including chat GPT, cool, cool, cool, are developed using three primary sources. And I love this. Information that is publicly available on the internet. Have any of you put writing or art? Y'all actually, who has not put your stuff on the internet? Amazing. Who is super happy about chat GPT taking your stuff? Those two of you, it was like, yeah, the future. Information that we license from third parties. I'm a lot chiller about this. And I think this one's really exciting. Or stuff that people gave us. How chill. How cool. You can just find stuff online or people. And I like the human traders. This is especially menacing. And the really exciting, the really terrifying part of this, I think, for me is if you write for a living, if you make art for a living, if you perform for a living, we're slowly seeing a future where you're being asked to compete with models trained against your past work. That when we're looking at an environment where scarcity of work comes together, it's including the opportunity to compete against your past selves. The question of whether or not scraping is theft. Is something I will leave to people who are, again, much more opinionated than I am. This picture of someone stealing has nothing to do with anything. But what are our options? If we want to make AI now, what can we do that's not taking publicly available information? What can we do that's not having our human traders give us what they want? This question, non-rhetorical, I appreciate it. I am naturally biased, but I do work with the common voice team. And this is a open source, crowdsourced, multilingual voice speech data set. We've taken, and this is a very, very 2001 way to say it, the most YOLO approach to licensing possible. I'm very old. And what we've done is we've said that there are more than 7,000 languages in the world. Right now, most voice assistants are very, very chill, about 20 languages. If any of you have used common voice assistants, which are going to be some of the more common speech AIs you're likely to encounter, you absolutely know that these work 100% of the time, 100% perfectly, as long as you sound exactly like I do, and as long as you speak English with a very, very standard cadence. For those of you who don't, very chill, good luck with your other 19 languages of the current time. What we've got is we've got, this is a lie, we've got 118 languages right now. And people donate their voices. They read clips online, and we have about 28,000 recorded hours. One of my favorite questions, and you can yell, nobody on the other thing will hear you, can you guess what language we have the most of? Like most clips, most data? People, they all said the correct answer immediately, which is Katalon. They're watching remotely, they get what they get. So we're really, really excited about this. We've asked people, can you donate your voice to us? We'd be so excited, and they do. And we release this under a CC0 license. For those not familiar with the CC0 license, this is the most have a good day license you can do it. You can, we ask that you don't identify individual voices from the data set, because that's incredibly creepy and weird, and y'all wouldn't. But you can do whatever you want with it. It's free, you can build stuff with it now. We've had people do academic research, we've had people make weird art. I love all of it, and the stuff I don't love, I think try not to think too much about. There's a ton and ton of stuff available, but this is, you can tell, I'd go on forever. This is not the only way to do things. So our first model, ask people for their data, and give it out as unlimited as possible. I asked four different people tightly connected to the Wikimedia Foundation for help with this. I'm still going to get this wrong. Y'all can email me when I'm wrong. The Wikimedia Foundation has an incredible data set they've built, which was not necessarily aiming to be used in AI data, but is being picked up by many data sets with being publicly available on the internet. They have users generate original text data, which is being released at a Commons by attribution license, where at the same time they've got Wikimedia Commons data, video, audio, photographs, being ingested under a network of different licenses. They've got tight internal regulations about what kinds of license data they can take in, and the data that they give back continues to exist under this license. It's CC by attribution and GFDL for their contributed, generated text and work for Wikimedia. They do an incredibly complex, one of the experts I talked to said, our licensing system is incredibly simple, and the other three said, no, no. But they're governed internally by a network of internal policy guidance, legal guidance, and volunteer support and community debate that keeps their licensing in check. The two things we've done so far are ask people for their stuff and give it to anybody, or ask people for their stuff under a licensing network and give it back out under that licensing network. Data trusts get complicated quickly. This is saying we don't want to give out our data under set guidelines that are inflexible. We want to have ideally humans, but we want to go ahead and in the current inception take individually contributed data and have a board of directors or have a board of guidance give access to this. Some of this can be under specific open source guidelines. Some of this can be for profit saying, hey, we only give this out under these contexts. And unfortunately, right now, the only data trusts I've seen who are operating in large data sets in meaningful ways tend to be operating in a individual contributor data comes in, and that's handled as a large block. One thing that I see again and again that I'm always excited for, that I'm always rooting for is every three or four years there's a startup promising that they're going to bring individual choice around data sales and data access to individuals through data trusts. There was one two years ago, there was one five years ago, there will be one six months from now. I'm really, really excited to see if this is the future, but haven't seen this launch yet. While I'm very, very excited, both professionally and personally about open source access for data and open source data sets, one option for AI models and AI companies for getting your data is literally just to pay for it. Right now, people's work is being scraped at large from the web, and we live in a reasonably carefully constructed capitalist system. We could, people working on AI, pay people instead of just finding it publicly available on the web. So far, and this is not the exhaustive list of where we could get consensually sourced data, we're going to see a very, very easy, lazy similarity between all of these. We can ask for contributions and pass that data back out freely. We can ask for and offer contributions under a range of existing licenses. We can create data trusts for controlling. We can ask to buy their data, but really, all of these futures involve asking for consent for people's data being ingested. This doesn't really look like the direction we're heading in right now, though. We need a couple different things for this to work. For somebody looking to start a project around open source data or even closed source data that's based around consent, looking at governance models and internal structures that build the external policy, build the consent pipeline for asking for and disseminating data, and policing that to make sure things go okay. There was recently a data trust in the United Kingdom that handles health data where they had promised again and again, this is only going to be for science. This will only ever be for science. Can we have your health data? We will only ever use this for science. Earlier last year, it turned out that science did actually include insurance companies in a very relaxing and trusting way. Looking at how do we set up our internal policies for data projects that increase trust? Creating these external facing structures as well. How are we going to do this? Are we going to ask for data and give it under CC0? Will we have an exhausting and delightful legal framework of licenses? Or are we going to build a board? Regulation and oversight is an incredibly spiky question when we look at where we're going with AI in the futures. A lot of the larger AI tool makers right now have said, you know what? We absolutely welcome regulation, but you can't have a subject to copyright. That would really hold us back. We welcome regulation, but not any of the regulatory pathways that would hurt our businesses. We welcome regulation, but none of that regulation. Unfortunately, if we're hoping for a future where our data is only sourced consensually, we would need some kind of framework where folks asking, folks seeking consent in our data aren't in turn having their stuff scraped by folks not interested in consent. Community enforcement as well. I was so excited to come to speak with Tafastham about this today because I'd really, really love to leave the folks who are going to be working on AI tooling, the folks who are interested in AI tooling, and the folks who are going to be using AI tooling to be asking ourselves as we pick it up, oh, this is so fun. Oh, this does this. Oh, where does the data for this come from so that we can start to make really, really principled choices or, I can't tell you how to guide your ethical futures, well-considered, somewhat principled choices that meet our needs, thinking about what we will and won't work on, what we will and won't use, and what we will and won't pay for is something I'd really like to leave everyone with. It's very, very corny, but if we believe that an AI future where our stuff just gets ingested and we can compete against it our own past selves, if we're lucky, isn't a very hopeful future. If we're going to try to do anything that's not lie down and wait, we need some level of hope. And hope is fantastic. Hope seems so glimmery. Hope seems so soft. I think hope is also part of a righteous determination. You all work incredibly hard on the things you build, on the things you write, and the things you code. I'd love for you to be just as determined as you are in your work with keeping the intellectual rights to your work consensually given. Human-source means, often that we give it to the world, but that gift, that giving has to be optional. Thank you so much. Yes, coming. Can you help me pass the mic please? Thank you for the representation. In the Common Voice Project, you have people submitting samples and people verifying samples. I did some sample verification and I heard a lot of samples pronounced by the same persons. And if I understand correctly, to make a good AI, you need samples from varied sources. Have you ever had to worry about people with good intentions contributing too much, actually? I love this question because we have, especially with some of our less commonly represented languages, we've got a relatively small number of super contributors where we say, oh, wow, this is a little bit less. And all of the voice contributions are tagged by hash ID to identify the same speaker in the training sets. But we absolutely do know that we've got specific language communities where we've got individual speakers overrepresented. And both saying, cool, if you're doing this because you're a language nerd, if you're doing this because you're studying, that's fantastic. If you're doing this because you want to train models better and better, you can probably take the summer off. Our conversations we've absolutely had with people in the past. Yeah. But thank you, both for contributing and for the question. Oh, I should add, and this is not a question but a plug, if anybody speaks a language that's not currently on common voice, please come, I used to be a teacher. I was like, please come and see me after class. I'd love, love, love to see about getting new stuff on. I'll ask you nine excited language questions. Can you, oh yes, on the other side. I'm at 5K steps. I'm looking forward for 10. Don't let me down, 10K. We're doing this. Thank you. Thank you for the presentation. I was wondering if there is a difference when you are trying to, for example, make sure that the speech recognition is accurate for persons with a different accent, let's say, because they come from maybe they're Swedish Iranians and they speak Swedish in a certain way, or if let's say a person has a hearing loss and then they might speak, like people have different types of hearing loss and therefore the accent, I believe, could be more varied and I'm wondering if the approach would be different when you are trying to make it accurate for these different types of groups. Oh I could have just talked about speaker recognition this whole time. I'm so excited. So one thing we do is we allow people to include optional metadata and that includes accent. And we've got a drop down of the most common accents we've seen for each language, but that's a blank text field. There's a researcher named Kathy Reed down at ANU who's doing research about language identities and how people self describe. So we have seen people saying, oh I speak Swedish with an Iranian accent. We've seen people say a huge string of descriptors saying, oh I'm in my 20s and I have a little bit of a Zoomer accent but I used to live in Massachusetts and I said okay. It's very specific. I like that. But we do also have folks who are talking to us about say, oh I'm a bit hard of hearing and this is the accent related to that. So really, really free text accent descriptors has resulted in some incredibly interesting metadata just in how people describe their own accents. I do have a question myself now building on that. Do we use that metadata in a way when I will go and approve or listen to someone to say yes this is correct. Would I get any of the inputs? You know like yes we are targeting this with an accent or we are targeting as clear as possible so I would be. So if you're validating common voice clips, we won't show the accent data associated with it. The general guidance is if you can understand it, if they're saying what it's saying, I'm generally pretty happy to accept it. That's great. If I give consent but want some attribution, is something like that possible? So with common voice, I'm so sorry that it's CC0 so there's no attribution associated with it. One thing that is super exciting is the platform is all open source as well. So we have seen language communities that said hey, we're not really comfortable with the CC0 aspect. We're going to go ahead and fork the platform and create parallel language corpora. Plural corpus, I always flub it. So they're parallel language corpora under different licenses or even we've seen individual users create their own speech corpora based on what they wanted to do with their own voices. So under common voice, no, I'm sorry. Under other collection projects, often yes. It's a hard decision where to go, which side to go. Yeah, yeah, yeah. This is open source. Thank you. So when you talk about the CC0, I also understood I remember that there was some issue of concern because it could have been exploited by some other company. What I wonder is if exist or if I've ever evaluated a license, let's say this is CC0 except from massive machine learning. So on this very, very controversial question at Fostum as well. I've heard other people say them like this. It's not how I would describe them. I've heard these licenses described as Franken licenses sometimes or open source alike licenses. There are a ton of, so I think I want to separate out my personal opinions from the company opinion because I don't know what that is. As an individual speaking for myself, I think these are really exciting and interesting. They're not true open source. But for people to be evolving licenses and to be evolving how we think about permissions as we evolve projects with data, even the ones that don't work are just such exciting experiments. Yeah, they don't always have to be something I use in my projects. They don't always have to be pure open source. But I love mess and they're all very exciting mess. They're often not all. So there's some really interesting ones out there. There's an ethical, I can't recall what it's called. There's a open source alike license based around, you can only use this for ethical uses which immediately splits into 90 different, what is ethics? What's a big company? And it's just a very, very exciting street to go down. Thank you so much. Thank you.
Debugging HTTP/3 upload speed in Firefox
Okay, I think we can roll it. And we are moving now to debugging HTTP 3 upload speed in Firefox and I'm more than happy to welcome Manuel Buschard for it. Hello. I'm Manuel Buschard. I'm Manuel. I'm working at Mozilla in the networking team called Necro and we work on Firefox networking. And in this talk I'm going about our debugging of HTTP 3 and HTTP 2 upload speed. And for this I'm going to give you some background information first. Then I'll cover the HTTP 2 upload speed problem that we investigated last year. And afterwards I'll go over to the HTTP 3 upload speed problem that we investigated afterwards. So yeah, first to the Necro team. We in general focus on security, privacy, but always also on performance. And our protocols that we work on is mostly HTTP but also DNS, web socket, web transport. And we also own the caching and the proxy feature. So this is what we generally work on. And when we think about networking performance, we usually think about it in terms of how long does it take from clicking a link to seeing the result. And for this we usually just need download speed. For other use cases like uploading large files like videos, we generally also need the upload speed. And in this talk I'm going about the HTTP 2 and HTTP 3 upload speed. Those protocols are more in focus. They are relatively newer than the HTTP 1. They got introduced in the past decade. And yeah, so for HTTP 2 upload, first what's the difference in HTTP 2 to HTTP 1. In HTTP 2 we allow to make multiple HTTP requests via one TCP socket. And this TCP socket is handled by the operating system. And real quick, the bug in our HTTP code that caused the slow upload was that we configured the socket to have a fixed size buffer of 110 to 80 kilobytes. And this fixed size buffer became a bottleneck in high bandwidth situations. And yeah, for the fix we just needed to adjust this TCP socket to not set the fixed size buffer and let the operating system handle the buffer size. And this shows that the operating system is responsible for the upload speed or the performance of upload. And this is a stark difference to HTTP 3 upload. And with this fix of just not setting the fixed size buffer, we can take a look at Chrome upload speed, Firefox before the fix, in red and in yellow Firefox after the fix. And we see that in certain configurations like high bandwidth and also from low to higher round trip times we have upload speed improvements up to like four times the speed. So we only have to wait a fourth of the time. And we are on par here with Chrome, which is using all the bandwidth you can use for the upload. So with this fixed last year, we took a more in depth look at upload speed in general. And we also had bug reports about slow HTTP 3 upload and with HTTP 2 seeing very good results, we made it a high priority for us as well and took a look there. So for the fix or seeing how much it changed, we introduced some high level telemetry. And these are person tiles of user reported upload speed. We have different versions, 114 on the left side is around one year ago. And in 115 to 16, we rolled out the HTTP 2 upload speed fix and we can see the improvements in the high level telemetry about upload speed. It's an improvement like in the higher parts, it's roughly doubled and not quite, but it's very visible. So now two difference to HTTP 3 upload. HTTP 3 upload is widely different. HTTP 3 uses a different transport layer. We don't use TCP anymore, but Qwik was standardised alongside with HTTP 3 and just relatively recently. So the standardisation was finalised in 2021, which is two to three years from now, right now. And Firefox also included HTTP 3 upload around the time in 2021. The work started in 2019, which is all relatively recent in comparison to TCP and HTTP 2. HTTP 2 is around a decade now, all. And the problem is different here because the operating system is responsible for the TCP stack. It is responsible for sending all the data performant and in Qwik we have to implement the same congestion control in Firefox and the Firefox application, so it's not the responsibility shifted to Firefox or the application. TCP is already decades old, it was done about 50 years ago and it's operating since 30 years and got a lot of eyes on it. And our Firefox implementation is really new and we were kind of the first ones to look into upload speed performance here, so we had a lot of low hanging fruits here to work on. And I wanted to visualise this a bit, like we have HV 2 and HV 3, which are very similar. In HV 3 we rely on Qwik and Qwik is also implemented by us. In HV 2 we have TCP and TCP is provided by the operating system. So I want to go into a few findings that we had in our presentation, in our IO graphs and other tooling that we took. The most useful tool for us was IO graphs where we just printed within the application, like with logging, when we send packets, when we receive packets, how big our congestion is and everything. So the first problem we have, what does this graph show? So this graph is our congestion window over time. What is the congestion window? So the congestion, well, I would like to go over this. We don't want to overload the network. And overloading the network is called like congestion control, well, not overloading the network. And this is the responsible of the transport layer, which is TCP or Qwik. And all the bugs that we had were in this congestion control, or most of the bugs. And the congestion window that we have is the estimate of how fast we can upload right now. And this changes over time. With every packet that we receive, we think we can upload more. So we have a graph like here where we steadily increase the congestion window over time with all the packets that we receive. And when we detect that packets got lost, we are assuming that the network is overloaded and we reduce the congestion window by half. And this is like one of the early graphs that we had. And orange are like the bytes in flight that we have. They circulate from top to bottom. Increasing again, blue is the congestion window. And what we see at the drop points is that the congestion window doesn't half. We would expect it to half during a congestion event. Instead it drops almost to zero. And this was one of the bugs that we had. We just dropped to zero. Each packet that we detected was lost, half the congestion window. And normally you would only do this once, but we did it multiple times for each packet. So essentially we dropped to almost zero on all congestion events here. This was one of the first fix. Later... Yeah. Later. This is the same graph. With the congestion window problem fixed, we had to investigate further. There were more problems. Here, like all these drops of packets going down, we want to stay with our bytes in flight as high as possible, with our upload speed as high as possible. But we dropped down quite some times. And if we... Yeah. For this problem, we need a bit of background information. And this background information was this slide, which I apparently put a bit later. And I'll go back to the background about Cric first before going over the next problem. So Cric got introduced. Sorry about the mix up here. Cric, the new transport layer protocol. What is Cric? Cric is on the same layer as TCP, but conceptually you can have multiple TCP connections at once over one quick connection. And we have other benefits, like TLS being integrated, so that the connection setup phase takes less time, only one round trip time instead of two round trip times. Yeah. And now we get back to the introduction of the concept of congestion control. Traction control is for us handles like not overloading the network from all participants in the internet. So everyone makes sure that we don't overload the network and keep it usable for everyone. And the congestion window is one of the concepts that we looked at the first graph and also in the second graph. This is our estimation at how much can we upload at a time. What is our upload speed to the destination server? And so our estimation depends on us receiving packets. And we want to increase the congestion window only if we are sure that we are using the congestion window. Like we are sending as much data as we have in the congestion window because otherwise we are not sure if our estimate is correct if we are sending less data than what is that we estimate we could. And this detection on whether a packet was sent during the utilization of the congestion window like sending as much data as we could. This had a bug as well and made us mark packets as not utilizing the congestion window for 50 to 75 percent of the packets which meant that we didn't increase the congestion window as fast as we could. This is another simple incremental fix for our HPE 3 upload speed problem. And after fixing this the graph looks like this. It has a steeper curve, steeper line. Here we also see that the first problem that we had got fixed. We don't drop to zero with the congestion window but have it halfway here. With these steady increases we can also see them in our high level telemetry that we introduced for our HPE 2 upload speed problem. In HPE 3 in the higher network bandwidth we have already an increase of three times. We are three times as fast as before tackling the problem from around 31 megabits per second to 93 megabits per sentence. This is the 95 percentiles. This is a network speed of better than 95 of all clients. Also visible from the high level telemetry. For the current state we are still working on this. We have more bugs that we are aware of and are also in contact with or in collaboration with contributors who can upload or request logs from them to have a look at their network condition. This is the diagram from before but from the contributors log where we can identify which problems are present from our machines in comparison to their network location. With the logging mechanism which we also included in Firefox this became a bit easier about logging. A few of the further works that we are currently still aware of is that the upload has a few CPU bottlenecks. Mostly profiling. The QuickStack made us aware that not the cryptography part of Quick is taking most of the time but some other parts which is unexpected. We have already identified a few code tests that can be improved and are improving these. We will also continue with profiling this. We also have similar to the HDP case we have a fixed size buffer. This will get to be a problem at some point at much higher bandwidth than with HP2 upload speed problem because we have a buffer that is 8 times as high, 1 megabyte instead of 128 kilobyte. We are also aware of the problem that when we are in package reordering networks we detect these package reordering as losses too frequently. There are ways around this in TCP specifications like REC or Forward Egg that we are taking a look at and investigating which one we want to implement and which proves to be the best of the options. We are also setting up CIS to catch regressions in the future and also have a detailed view from different networking conditions, how they look. We have seen where we got improvements in HDP3 already, it is now at a similar level to the HDP2 upload. It is looking already a lot better but we are still on it and we are aware of a few bugs and we will investigate further. We want to make it as good as we can to see all the benefits that HDP3 can provide for us. A lot of this was in cooperation with contributors reporting bugs. One specific bug is the HDP3 upload speed bug. If you want to take a look at our work there you can follow the investigations there. You can reach us at the Metrics channel if you want to get in contact with the NECO team. We have a NECO specific documentation also about creating logs. If you are interested in the NECO team we are making ourselves a bit more transparent by providing our meeting nodes and having a blog. If you need help with fixing bugs or want to get in contact like contributing, we also are going to provide office hours where you can talk to us directly and get in touch. Thanks for listening. Thank you. We might have time for one or two questions. Hi, thanks for talking. I just wanted to ask if there is any chance of Quick being brought into the Linux kernel or Windows kernel or wherever else Firefox runs. The question is whether Quick is going to be implemented in the operating system with the Shokut APIs. I am expecting that it will be implemented at some point. We do have, I have seen some TLS integration. This is one of the stoppers probably that TLS has to be integrated into the kernel as well. Quick is so new that it didn't have time to be integrated into the operating system. I think as soon as operating systems provide APIs we will start using them. They are not here right now but in the future I would assume yes. Two years of being standardized is like nothing. TCP is like for 30 years already. Last question, I see a lot of people coming in and for sure Manu will be available outside, no? Yes. Making promises is your name. My question is just which congestion control did you implement in Firefox? Yes. We are using Qubic by default. We have also implemented new Reno and we are looking also at BBR because this is also exciting for our lack. It's better for lower latencies. We didn't have a plan to implement it right now but it's like in the future we probably will tackle that too. Thank you so much Manu. I didn't count some of them but I took photos and we can count at different rates. I saw the sticker you have on your water bottle. Yalazila. Yalazila. Yalazila. Yalazila. Yalazila. Yalazila. Yalazila. And I think also this is a no? Yes. No. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes, yes. Yes. This is all about the politics. I'd say about the politics. Call me to be disturbed. I'd like to work with you. Yeah, except... I used to... my mother told me how to... You know, like when they had to mess up with this. Yeah, I mean, it's funny when we're not having a discussion. Do you think there's a few... ...sci-fi sites? No. Yeah. I don't know about that. Sorry, it was on. Yeah. Yeah. Yes. You're stuck with me for two minutes or so. Carmen needs another adapter and we are looking over it. It's coming. We're fixing everything. Is anyone that already got the t-shirt from the booth or the collapsible mug? Oh, good. You're saving the world. I'm there for it. I also like it. Someone else is doing steps today. This is how you stay in shape. You moderate the dev room, you run. All good. We are first. Also a big thank you to Konstantina. She organized the booth, by the way. Konstantina and Mozilla, if you want to... Yes. Yes. Do you want to? She brought more stickers here, especially this stock is related to MDN. We have the sticker here, if you want. And we have the cute llama. I heard it. Gold. It's here, waiting. One is mine. And if you want to learn more about llama, the project, not the animal, we have a guide here with all things related to Mozilla and AI. Grab the... And the first one is from the first talk about support. Grab these papers and you can have more information. Are we ready? Let me go then. Without further ado, ladies and gentlemen, Chris will introduce us to the MDN curriculum. Yay! Hi. APPLAUSE So, hello everyone. Nice to see you all. Thanks for coming. My name is Chris Mills and I'm going to take you through a new MDN project soon to be released called the MDN Curriculum. Take you through a little bit about who I am to begin with. I describe myself as a death metal hippie. I love documentation and I love the open web and I love tinkering with open standards. I used to work for Mozilla. For quite a while I was the content lead and team manager for MDN. But I left and did some other stuff and now I've come back as a contractor and this is the current thing that I've been working on with the MDN team. Another thing to add is that I'm a heavy metal drummer so if you want to ask me a question later on please speak slowly. A little bit about this talk. We are going to talk about, first of all, some of the problems that myself and I was perceived with Frontend Development in 2024, particularly in terms of education and the skills that new web developers are bringing to the table when they come and get jobs. I'm going to take you through the thoughts of how a curriculum, a new curriculum could solve some of these problems and some of the research that we did to try and prove out some of our theories about this. I'll then talk to you a little bit about the actual curriculum that we came up with and kind of its structure, its approach, some of its goals and then I'll talk to you about possible next steps, some of the things that we can then go on to do with this curriculum as a basis. Now, first of all, I'm going to talk to us about, talk to you about something that we're very good at in open source communities, problems and complaining. Yay, Mr. Brexits, back in the UK government, I'm so pleased. No, not those kind of problems. Really, we're talking about problems with Frontend Development, kind of what skills are new web developers missing when they come into the industry? What's the effect of web, what's the state of web education, what kind of effects are these problems having on, you know, the web in general and the quality of sites that we build? One thing that I've talked to quite a lot of hiring managers about, and this will also be mentioned in the research that I'll talk about later, is just lack of general core principles of new developers coming into the industry. I mean, a short anecdote that I'll share with you is a couple of years ago, a friend of mine called me into his company, he worked for a large agency at the time and he wanted me to talk to all of his Frontend teams and he wanted me to talk to all of his Frontend teams about accessibility. Really basic accessibility, you know, just kind of use headings and paragraphs and use alt text, that kind of stuff. And I went in there and did a 20 minute talk and I was thinking, do I really need to talk to these folks about this? And it was like a revelation to them. They were all like, whoa, so that's why you have to do this stuff? I was just blown away. I was like, I thought we'd kind of largely won this battle and moved on. It kind of blew my mind about how little they knew about this stuff. And I kind of feel that with a lot of the new developers community industry, you know, they're not really learning core languages and old school standards as much as just kind of, well, I want to get a job so I'm going to learn React and I'm not going to turn this into a massive winch, but you know, that kind of results in not knowing these core principles and best practices quite as well as perhaps they could. The next thing to talk about is lack of core language skills. This is another thing that hiring managers have talked to me about a lot. So people learn React and other frameworks, but they don't maybe take the time to learn the core JavaScript language as much as they could. So, you know, they can build websites that work great and have a good look in UIs, but maybe their problem-solving skills aren't quite as good as they could be when they suddenly need to get brought onto a problem that requires not writing some code inside a framework. And also, we kind of worry that maybe this is not so good for people's long-term employability, because, you know, if they just learn React, what then happens if all of a sudden the company goes, well, now we're going to do all of this stuff in a different framework, or, you know, another framework suddenly becomes really popular and every employer wants to use it on their projects. This is probably the biggest one that I've heard from employers is just general lack of soft skills from UIs. So, and I know, you know, you could make the argument that this kind of stuff comes of experience, but it really would be great to try and promote that learners spend more time thinking about skills such as research and kind of basic critical thinking and problem-solving and also working on having this constant learning mindset that you kind of need to have to succeed in this industry, because things are just always changing all the time. So, who's to blame for any of these problems? Well, not really anybody, I would say. I'm not going to point the finger at anyone in particular, because, you know, you've got all of this ideological thinking that says everything should be accessible all the time, and this should happen, and then this should happen. But actually, people just want to get a job, so it's no wonder that people go, well, all of these job adverts are saying, I need to know this framework, so I'm just going to take the quickest path I can to get employment and be able to pay my mortgage and buy food. Coding boot camps that I've reviewed largely tend to focus on this kind of stuff, you know, and again, I'm not blaming them, I'm not saying it's a terrible thing, but they tend to be, the attitude tends to be, you know, we will take you from nothing to getting your first job in three months or six months or whatever, and that's a perfectly reasonable way to frame what you're offering to people, but there is the problem that maybe the best practices and the background skills aren't maybe being as taught as well as they could be, and of course, courses become out of date very quickly. Particularly, this tends to be a problem with university courses that I've come across. I know a lot of lecturers that really struggle to kind of keep up with all of the stuff that they've got to do, which isn't just learning about technology, they struggle to put the time in to keep their skill set current with all of the stuff that's going on in the industry. So, I think that's a good point. I think that's a good point with all of the stuff that's going on in the industry. And then, I'm also going to just say a few things about interview processes, and again, this definitely isn't the fault of the actual learners trying to come into the industry. But because people don't tend to have a consistent set of skills, a lot of interview processes tend to kind of be, well, we're looking for this kind of unicorn that knows these ten things really well, that are all really complicated. And all of the people that we're talking to have kind of got about four of these things definitely shown up on their CV. So, we've got to do a whole bunch of whiteboard interviews and coding interviews and huge, long, convoluted interview processes to check whether this person can do this job that we're trying to hire for. Another interesting thing to make mention of AI, which has already been talked about today, is it fascinated me that in the last maybe six months to a year or so, I've started to hear multiple hiring managers talk about the fact that oh, we had to put a load of extra processes in and the interviews have become even more complicated now because a lot of our candidates are trying to cheat using AI. I've literally heard about people having chat GPT open in another window whilst they're doing an interview and just typing all of the interview questions into it and then parroting back the answers to the interviewers. And it's like, that's a bit nightmarish and it's difficult to really think about what to do about that. But I kind of think, well, if these people were maybe more confident in their skill sets in the first place, maybe they wouldn't have to think to rely on that quite as much. Another interesting thing is that something that we're sort of looking to do with some kind of curriculum would maybe to have some kind of industry standard benchmark certification, eventually. This is kind of pie in the sky, often the future. But maybe this certification could kind of say, you know, anybody that's got this certification, it's a trusted certification, you know, in the same way that industries such as law or architecture have trusted bodies who have these certifications that everybody gets to prove that they know what they're talking about. But we don't really have that for our industry. And employers don't really trust some random certificate from some, you know, whatever boot camp, you know, I'm not saying those boot camps are bad or not trustworthy, but employers just have a hard time trusting them. And as makes perfect sense, they value demonstrable experience and portfolios a lot more.
The MDN Curriculum: Better web developers for a better web
Are we ready? Oh, let me go then. Without further ado, ladies and gentlemen, Chris will go and introduce us to the MDN curriculum. Hi. So, hello everyone. Nice to see you all. Thanks for coming. My name's Chris Mills, and I'm going to take you through a new MDN project soon to be released called the MDN curriculum. Take you through a little bit about who I am to begin with. I describe myself as a death metal hippie. I love documentation, and I love the open web, and I love tinkering with open standards. I used to work for Mozilla for quite a while. I was the content lead and team manager for MDN, but I left and did some other stuff, and now I've come back as a contractor, and this is the current thing that I've been working on with the MDN team. Another thing to add is that I'm a heavy metal drummer, so if you want to ask me a question later on, please speak slowly. A little bit about this talk. We are going to talk about, first of all, some of the problems that myself and I was perceived with Frontend Development in 2024, particularly in terms of education and the skills that new web developers bring into the table when they come and get jobs. We're also going to take you through the thoughts of how a curriculum, a new curriculum could solve some of these problems, and some of the research that we did to try and prove out some of our theories about this. I'll then talk to you a little bit about the actual curriculum that we came up with and its structure, its approach, some of its goals, and then I'll talk to you about possible next steps, some of the things that we can then go on to do with this curriculum as a basis. Now, first of all, I'm going to talk to us about, talk to you about something that we're very good at in open source communities, problems, and complaining. Yeah, Mr. Brexits, back in the UK government, I'm so pleased. No, not those kind of problems. Really, we're talking about problems with front-end development, kind of what skills are new web developers missing when they come into the industry? What's the state of web education? What kind of effects are these problems having on the web in general and the quality of sites that we build? One thing that I've talked to quite a lot of hiring managers about, and this will also be mentioned in the research that I'll talk about later, is just lack of general core principles of new developers coming into the industry. A short anecdote that I'll share with you is a couple of years ago, a friend of mine called me into his company. He worked for a large agency at the time, and he wanted me to talk to all of his front-end teams about accessibility, really basic accessibility, just kind of use headings and paragraphs and use alt text, that kind of stuff. I went in there and did a 20 minute talk, and I was thinking, do I really need to talk to these folks about this? It was like a revelation to them. They were all like, whoa, so that's why you have to do this stuff? I was just blown away. I was like, I thought we'd kind of largely won this battle and moved on. It kind of blew my mind about how little they knew about this stuff. I kind of feel that with a lot of the new developers' community industry, they're not really learning core languages and old school standards as much as just kind of, well, I want to get a job so I'm going to learn React, and I'm not going to turn this into a massive whinge, but that kind of results in not knowing these core principles and best practices quite as well as perhaps they could. The next thing to talk about is lack of core language skills. This is another thing that hiring managers have talked to me about a lot. People learn React and other frameworks, but they don't maybe take the time to learn the core JavaScript language as much as they could, so they can build websites that work great and have a good look in UIs, but maybe their problem-solving skills aren't quite as good as they could be when they suddenly need to get brought onto a problem that requires not writing some code inside a framework. Also, we kind of worry that maybe this is not so good for people's long-term employability, because if they've just learned React, what then happens if all of a sudden the company goes, well, now we're going to do all of this stuff in a different framework, or another framework suddenly becomes really popular and every employer wants to use it on their projects? This is probably the biggest one that I've heard from employers is just general lack of soft skills from new hires. I know you could make the argument that this kind of stuff comes of experience, but it really would be great to try and promote that learners spend more time thinking about skills such as research and kind of basic critical thinking and problem solving, and also working on having this constant learning mindset that you kind of need to have to succeed in this industry because things are just always changing all the time. So who's to blame for any of these problems? Well, not really anybody, I would say. I'm not going to point the finger at anyone in particular, because you've got all of this ideological thinking that says everything should be accessible all the time and this should happen and then this should happen, but actually people just want to get a job, so it's no wonder that people go, well, all of these job adverts are saying I need to know this framework, so I'm just going to take the quickest path I can to get employment and be able to pay my mortgage and buy food. Coding boot camps that I've reviewed largely tend to focus on this kind of stuff, and again, I'm not blaming them, I'm not saying it's a terrible thing, but they tend to be, the attitude tends to be, you know, we will take you from nothing to getting your first job in three months or six months or whatever, and that's a perfectly reasonable way to frame what you're offering to people, but there is the problem that maybe the best practices and the background skills aren't maybe being as taught as well as they could be, and of course courses become out of date very quickly. Particularly this tends to be a problem with university courses that I've come across. I know a lot of lecturers that really struggle to kind of keep up with all of the stuff that they've got to do, which isn't just learning about technology, they struggle to put the time in to keep their skill set current with all of the stuff that's going on in the industry. And then I'm also going to just say a few things about interview processes, and again, this definitely isn't the fault of the actual learners trying to come into the industry, but because people don't tend to have a consistent set of skills, a lot of interview processes tend to kind of be well, we're looking for this kind of unicorn that knows these ten things really well that are all really complicated, and all of the people that we're talking to have kind of got about four of these things definitely shown up on their CV, so we've got to do a whole bunch of whiteboard interviews and coding interviews and huge long convoluted interview processes to check whether this person can do this job that we're trying to hire for. Another interesting thing to make mention of AI, which has already been talked about today, is it fascinated me that in the last maybe six months to a year or so, I've started to hear multiple hiring managers talk about the fact that oh, we had to put a load of extra processes in and the interviews have become even more complicated now because a lot of our candidates are trying to cheat using AI. I've literally heard about people having chat GPT open in another window whilst they're doing an interview and just typing all of the interview questions into it and then parroting back the answers to the interview, isn't it? It's like, that's a bit nightmarish and it's difficult to really think about what to do about that, but I kind of think well, if these people were maybe more confident in their skill sets in the first place, maybe they wouldn't have to think to rely on that quite as much. Another interesting thing is that something that we're sort of looking to do with some kind of curriculum would maybe to have some kind of industry standard benchmark certification eventually. This is kind of pie in the sky, often the future, but maybe this certification could kind of say, you know, anybody that's got this certification, it's a trusted certification, you know, in the same way that industries such as law or architecture have trusted bodies who have these certifications that everybody gets to prove that they know what they're talking about, but we don't really have that for our industry and employers don't really trust some random certificate from some, you know, whatever boot camp, you know, I'm not saying those boot camps are bad or not trustworthy, but employers just have a hard time trusting them and as makes perfect sense, they value demonstrable experience and portfolios a lot more, so it would be interesting to see how we can match these up in an effective way with some kind of curriculum. So yeah, this is the question we came to, we thought all of these problems, could we try and solve this or at least go some way towards solving this with a new curriculum? And we thought, well, we can't just trust our hunches, let's go out there and do a bit of research. And, well, I went out and I talked to a whole bunch of people from four different groups, trying to get some insights from both kind of people on the new learning end of the table, the new people trying to come into the industry, people that have very recently gone into the industry, new web developers, and then senior web devs slash managers, you know, these are the kind of people that are actually on the hiring panels trying to hire people to come into the industry, and then educators from universities and colleges and boot camps that are trying to impart these skills to allow people to enter the industry. It would take far too long to go through all of that research in excruciating detail, so I'm just going to share a couple of findings with you. But a couple of the key questions that we asked were firstly, in their opinion, what skills most commonly missing, and secondly, how valuable do they see an official curriculum or certification being for employment purposes? And in terms of the missing skills, I was quite glad to be validated in my thoughts that basically a lot of them said, yes, things like core language knowledge is missing, things like soft skills are missing, and things like fundamental best practices are missing, semantic HTML, accessibility, responsive design. There was also quite a lot of mention of tools, obviously not things like frameworks, because a lot of people learn those, but things like version control tools, for example. There was some sort of worry about people missing the idea of how to use those tools and linters and all of the other kind of tool chain stuff that comes along. The value of curriculum slash certification was an interesting one, so the response we got to this question was quite overwhelmingly negative. As I said before, people did come across and say, well, you know, experience is more important than having some kind of certification. People said it even sometimes feels like a bit of a scam if someone's turned up and said, hey, you should trust me because I have this certificate. It sounds like it could possibly be a bit snake oil-like, a bit kind of what, you know, how are they trying to trick me here with saying they've got some sort of certificate? Other people I talked to even said, you know, we don't like the idea that courses and certifications could create some kind of barrier to entry for the wealthy. You know, you've got to pay to have this thing to be able to gain entry to the industry. You know, that's not a good kind of look and not something that we try and promote. But yeah, so it was quite negative. On the positive side of the value of certification, we did have some folks saying, well, actually we could get behind this, but only if we have a reputable provider or industry body to kind of officiate this certification. It would also be useful to have some sort of baseline of skills to say, well, you know, this is the official industry standard of what you should know as a new front-end developer. Because there are lots of people out there teaching things and learning things in quite a lot of different ways. So we took this research forward and we created the MDN curriculum. So to take you through this a little bit, our key aims were to provide a baseline of skills, which basically talks to what I just said. We also wanted to make sure that it provided a balance between your short-term tooling and frameworks and all this kind of stuff. You know, I also tend to call this the kind of short-term employability skills. You know, it's like all the job ads says react, so I'm just going to go and learn react. But to provide a balance between those short-term employability skills and the more long-term core skills like core JavaScript and like accessibility and semantic HTML, etc. So that's kind of a difficult one because it's important to teach both, but you don't want to kind of bore the pants off people with hours and hours of history lessons about web standards philosophy. It's just not going to work for a beginner. The next thing that we wanted to try and put across of this curriculum is, you know, what if we could use Mozilla's brand and reputation to give this project credibility so that whenever, when we eventually start producing things like certifications based on it, because we'd like to do that, it might be seen as trustworthy by the industry. We also want to make sure that we regularly review the curriculum and any kind of courses based on it, maybe any kind of partner material that we recommend that people go to to try and learn the curriculum. We want to review them really regularly to make sure that they stay up to date with industry trends. That's also a big thing. Stuff does become out of date all too often. The number of courses that I've seen that basically still say, well, JavaScript is jQuery and you're like, hmm, a lot of people still use jQuery, but that does seem somewhat dated now. And the last thing that I wanted to talk about is just avoiding this kind of conception that it's a paywall barrier to entry, you know, we want to make sure that the curriculum does provide a completely free educational experience with options to go to paid partner courses if people want, choose to pay for such a thing, but if they don't, they don't need to do that. So the high level structure of the curriculum and actually I think I'm just going to jump into the demo now. We do actually have the curriculum available in on a development server at the moment. We're going to release a pilot program, maybe in the next couple of weeks or so to allow people to actually start looking at this themselves and provide us with some feedback. But at the moment, we've just got it in development. So this is what it looks like. We're trying to present it as a nice friendly experience with a bunch of different modules. So the first thing that you get to is the core modules. This is essentially the baseline skill set that I've talked about. This is what we think everybody should know to be an effective front-end developer. And we did a heck of a lot of research and got a lot of opinions on what should be in here. But we also provide these getting started skills that we, as we're referring to them. So this includes the soft skills that I previously talked about, lots of advice and lots of links to resources to say, hey, you know, you should brush up on your research skills. You should think about what it means to work in a team. You should look into courses that provide skills on how to do well in a job interview, for example. All of these soft skills are very useful to people coming into the industry. And then we've also got a module there on environment setup. And this is kind of, you know, not exactly web-related skills, but it's all of the skills to do with making sure you're familiar with your local environment that you're going to use to actually build websites. So things like command line and code editors and the file system, you know, because it's that kind of stuff that really trips people up. Like I've taught a whole bunch of beginners classes to kids at my daughter's school, like after school clubs. And the kind of the number one thing that I came across that really messed with complete beginners is trying to figure out file paths and creating files on your local system and dealing with all of that weird stuff that Windows does when it hides known file types, which messes you up when you're a beginner and you don't know what's going on. So all of that kind of stuff, because it's really important, even though it's not really web-related. Then the final part of this is we've started to list a bunch of extensions. Now, these are not part of the core, and we're not going to try and insist that any courses that want to conform to our curriculum teach all of the extensions. Because the idea is that the core provides the essential stuff that everybody should know. But of course, as a web developer, once you have all the basics under your belt, you're then going to want to start specializing in things. You know, some people might just want to work mostly on CSS and do layouts and things. Some people will see themselves more as JavaScript developers. Some people are going to be a bit of a hybrid in the mix. So we're providing modules on kind of, you know, more complex, specific CSS subjects and particular types of tooling and things like security and privacy and testing. So that people can go on and specialize in those kind of things after they've got the core under their belt. And this was partly because there were so many opinions flying about what should be inside the core. What's the actual essential stuff that you need to know to get started? That we just thought, well, we need to keep this kind of quite small and focused because, A, as it is, the list is pretty intimidating and large. You know, the list of skills you need to know is quite intimidating and large. And also, it makes it harder and harder to find courses that are actually going to be able to conform to our curriculum. So we just kept it fairly small and focused. Now, if we go actually into one of these modules, just to give you an example, it looks fairly like an MDN page, but with some differences. You'll see here that we explain JavaScript. We've got all of the submodules teaching all of the things that you need to know, like variables and text and arrays, et cetera, et cetera. And then we list resources for people to go to to start learning those different topics. As I say, it's a curriculum in the academic sense of the word. It's a list of topics or criteria you should know. It doesn't provide all of the course material integrated into the curriculum because we thought, well, there's so much high quality material out there already that why should we kind of reinvent the wheel? You know, there's loads of good stuff on MDN already, so really it just needs a bit of organization. It doesn't need kind of repeating and duplication. And then also we can start to recommend partner courses. So this gives you a little idea of what we've got in the curriculum. Just go back to the presentation now. So I had these here just in case, you know, the Wi-Fi obviously wasn't going to work. So next steps, what are we hoping to do with this kind of baseline curriculum now that we've now after we have it published in a couple of weeks or so? As I said, there's a number of follow-on projects that could come out of this curriculum. I'm really hoping that we've managed to do a whole bunch of evangelism around it and start to get it adopted as kind of like somewhat of an industry standard baseline of skills that students can say, hey, I'll go and learn that curriculum or educators can say, well, I want to write a new course for my university. So hey, I'll base it on that curriculum because it's a good solid baseline of skills to know. We're also looking at recommended conforming courses. So we've got a bunch of starter resources, but some people will want to have a very opinionated, complete course just to work through from start to finish. So we're currently looking into a bunch of partners, both free and paid, that we could recommend people go to to learn all of this stuff. And, you know, again, this is a good option than having to create a new, a complete new course yourself at great expense. And plus the MDN team has never really had experience in creating high quality video courses and there's lots of them out there already. So we wouldn't want to compete with that. But it's quite tricky because we're having to review these courses in great depth to make sure that they conform to all of the stuff that the curriculum says you need to know. But also they're teaching the stuff with a kind of a philosophy that aligns with Mozilla's philosophy about this stuff and doesn't teach bad practices. You know, actually teaches all of the accessibility stuff properly and uses semantic elements in the examples and all this kind of stuff. So it's been, you know, we've reviewed a couple already and it's been a very long-winded, challenging process, but, you know, very useful for both sides because it's taught us things. And it's also given us a load of feedback that we can then give to these course providers to help to improve their courses. And then just to mention again about certification, there's nothing on the cards yet, but we really would like in the future to create some sort of certification, you know, create an exam and get a certificate that proves that you know all of the stuff in the core curriculum, you know, and maybe in the future we could use the Mozilla brand to give this credibility. We could also have some sort of system that gives the students like a ready-made portfolio along with their certificates so they can actually have something to show employers to prove that they wrote some code and they know what they're talking about. And we're thinking, you know, maybe in the future this could make all of those really tricky employment tech interviews a bit easier because they could go, ah, well, you've got that trusted certificate already, so maybe we can skip like step one and just go on to the next stuff, just make everyone's life a bit easier. And I think that's about it. Thank you for listening. Let's take some questions now. Anyone, anything? Seriously, either in the other corner, I'm kidding, I'm kidding. Sorry, I haven't missed it. What's the license for the content? So the license is the same as standard Mozilla content. It's CC by SA. Have you considered crowd-sourcing contributions so they evolve? Yes, absolutely. So the question was, have you considered crowd- So the first one was what is the license and it was CC by SA and the second one was have you considered crowd-sourcing contributions? And this is an interesting one because on one hand we want to keep the curriculum kind of very slowly and deliberately evolving. We can't have people just kind of ramming stuff into the curriculum all the time because it's supposed to be a stable curriculum. But on the other hand, it would be really great to get crowd-sourced extensions because there's lots of extensions that we could publish. And they don't even necessarily need to be on the curriculum site, although it would be nice if they were. So yes, yes is the short answer. Hello. I saw that you made a really nice initiative. I think it's really good to have a baseline for people that want to join the industry, especially as front-end. And I have a question about something that's not about what you're showing, but it's related to the topic that you show. It's actually something I'll show very shortly, very briefly. That candidates using AI on interviews. So I have stopped the interviews for a while because it's very stressful. So I didn't have the opportunity or maybe I did and I didn't notice. But do you have any words of wisdom on how to deal with candidates that use AI during interviews? My company used to only have interviews. One of the major interviews was live, like with the person in the office. But nowadays it's mostly online and it will be really nice to have some kind of... So yeah, it's interesting, isn't it? I mean, years and years ago, when I worked for print book publishers, we used to have complex tools that would actually crawl the web and try and find examples of plagiarism. Of course, that's not quite so helpful anymore because the answers are just generated on the fly. One thing I've seen is some of my friends that have been hiring recently and dealing with this problem have literally had chat GPT open themselves and as they've asked the question, they've typed it into chat GPT just to see what answer it produces and gone, hmm, how similar was that thing that that guy just said to that? Oh, okay. And in terms of more sophisticated solutions, I've seen a lot of quite intelligent proctoring stuff recently, so you actually have an AI powered helper that will sit there and it will examine what they're doing and it checks for their eye movements and it checks for them doing things like, you know, being unfocused from the app that they're in for too long and all that kind of stuff. But I mean, I've seen that in a lot of examination platforms, but of course a lot of it gets really complex because it's like some of them are like, well, you've got to use our own modified browser version or app to app, but we can actually keep tabs on you whilst you're doing that. And yeah, so there's lots of solutions coming out, but that's some of the stuff I've seen anyway, but it is, it's a hard problem to solve. But yeah, thank you for the question. Thank you so much, Chris. All right. If you do have any questions for him, he's around the room outside. Yeah, thanks for listening, folks. Thank you. Thank you.
Firefox: Good things come in .deb packages
Hello. Hi, everyone. My name is Gabriel. I'm a senior release engineer in Mozilla, and I work on shipping Firefox on several different platforms. And today I'm here to talk to you guys about the new debt package that we are shipping to our Mozilla app repository. So first I'm going to talk a little bit about the journey from Nile to stable builds, and then I'm going to elaborate on some of the reasons why we thought a native package might be useful for people on debt-based distributions. Okay. So early last year we started talking about setting up an app repository for Mozilla product builds to help us offer better support on Linux and stuff like that. It's really challenging to support distribution builds for us because they're built with different compiler, compiler versions. We can lead to some issues. Yeah, so first around October we started shipping a Nile package. And it was mostly for Nile community. This offered them some benefits like they didn't have to create a desktop file. It also made it easier to update the binary. We have some data that actually suggests that we keep people more up to date on these debt packages. I think probably because people update the whole system components in the app store or... Yeah, I wonder if they did other stuff. Yeah, so I made a blog post about that. We got a lot of feedback from the community about developer addition debt package. So we shipped that. And now as a Firefox 122 we're shipping stable builds to the repository. So we want you to be able to use Firefox how you want. And we know browsers are complex applications that support many different use cases in people's lives. So we wanted to offer a native package in addition to SNAP, some flat packs. So this package, it's built in Mozilla infrastructure from the Firefox source code without any modification. And the builds are supported by Mozilla directly. Another good thing about the package is that we spend a lot of resources in optimizing the builds using PGO and things like that. And we wanted people to be able to get those benefits without having to install our tar balls but rather getting packages from this repository. I like this one. The updates are faster in case of chem spills. So the new app repository is plugged in directly to the Firefox release automation. So when we ship Firefox we upload the build directly to this repository as soon as it's available. Which is nice in case of security patches and things like that. And here's a slide about how to install it. So you can visit that. The website right there and follow instructions, as I said. It's easy. It's just about adding the Mozilla app repository and installing the package. The package is not perfect, surprise. So if you have feedback, if you actually try it out, you can join our matrix channel and let us know if the package is working for you, if you're having issues. And Mozilla will offer support. Thank you. Thank you. Thank you. Thank you. Yes. Can you provide arm 64 builds? Not yet. Can you reply the question? Yeah. So here's if we offer arm 64 builds. Not yet. We've been talking about it. Yeah. Working on it. How are the bindries constructed for these packages? Do you use the native devian tooling to build it? Or do you basically repackage the same binders you would put in a snap file? It's the same binary that we've been shipping as a tarble forever. And yeah, we use the native devian tooling to repackage that into a dev package. And that's what we put in the app repository. I saw a hand over here. I lost the person. Someone had a question. I don't have any questions. Oh, OK. You're tricking me now. Yes, I see. Coming. For those new in the room, this is a challenge for me to have 10K today. So don't hesitate to ask questions. Hello. Do you envision the dev package being like a stepping stone towards the future where flat packs and snaps work well? Or it will be like a permanent offering going forward? I envision it as a permanent offering, just like an additional option for people that like the packages. Yeah, makes sense. Why did you specifically choose dev as a packaging format? I didn't quite understand that. Over like something like flat pack and have a custom repository for that. Yeah, we already have a flat pack. So this is just a different option for people that want to use that packages. There's already a Mozilla flat pack repository? Yep. OK, I didn't know that. Good to know. Yes, go ahead. Sorry. The microphone is coming. So that we mentioned dev and flat pack, so the next question is what about other packaging formats like OS native, for example, RPM or any other, like, I don't know, like something for ARCH or anything else? Yeah. I mean, that would be cool. There's definitely been conversations over RPM. Yeah, we're thinking about it. My question will be if you're supporting dev now and all these packages, is it not a burden for you to continue supporting more? Don't you have a plan to focus on something more straightforward? It is true that we're taking on more work by supporting these packages, but I think we wanted to offer that support to the Linux community. That's, yes. Thank you. Are you going to be working with projects like Debian to promote the Mozilla repositories for their, like, stable user bases? Yeah, we, the Debian package maintainer is a Mozilla employee, he helped us out with this. Yeah, it's just a different package, it's a different set of trade-off, right? There's a lot of different guidelines when you build the package for the distribution and different infrastructure limitations and things like that. So it's more like an alternative package for people that find it useful. Thank you.
Firefox, Android, and Cross-browser WebExtensions in 2024
Please welcome Rob Esimione. Thank you. Simon. Simon. Simon. Okay. Is it working? Yes, I think we're all good to go. Okay. Hello. Welcome. Thank you for coming. My name is Simeon Vincent. I'm a developer relations engineer working on Firefox add-ons. Who are you? I'm Rob Hul. I'm working on the web extensions team at Firefox on extension APIs that can be used to build extensions. Today we're going to be talking a bit about building cross browser extensions in 2024 manifestv3, cross browser web extensions, all that good stuff. In case you're somehow not familiar with extensions, they're personally one of my favorite aspects of the web platform. Web extensions are little programs essentially that allow you to customize how the browser looks and feels. A couple of examples. Tree style tabs will add a tree of pages as you browse the web on the left side or I guess in the sidebar. You block origin probably needs no introduction with this crowd. So it is a content blocker or a content blocker that allows you to block pieces of functionality on web pages. And Dark Reader as I kind of contrast to some of these other things allows you to have your preferences. Like if you prefer using dark mode, Dark Reader will enable that on websites, even if the websites don't support or the website author didn't create a dark theme for your ideal website. Web extensions are cross browser extension model originally developed by Chrome and now all browsers are aligning around this particular approach to designing extensions. Over the years there have been a bunch of different ways that browsers have tried to provide ways of customizing or personalizing the browser experience. And the web extension model is particularly interesting because it is built using web technologies. It's built on top of the web platform itself. So when you customize your browser using a web extension, the author was using HTML, JavaScript and CSS to build that experience. Web extensions as a cross browser thing originally started with Firefox, building a new extension platform based on Chromium. And since then Safari has adopted it and it's before Edge adopted Chromium as its base, they also adopted the web extension model. So this is, while it's not like a standard in the W3C or other standard group, it is a cross browser effort. All of us are working together in the web extension community group in the W3C. And in addition to browsers, the community is also involved providing feedback and helping evolve and enhance the extension platform. So today we're going to take a closer look at how to build some extensions with a mindset towards maximum compatibility with current browsers with new form factors like Firefox for Android. You're now able to have your web extensions available on your mobile phone. And how to kind of maximize compatibility as you build out a new browser extension. Okay, so as the first demonstration, we are going to show how to modify a website. That's a very common task. Most extensions that are just starting out want to enhance a website in some way. So this is a very simple extension. This consists of two files. A manifest or JSON file that declares some metadata such as a name and version of extension. Not very important for the rest of this talk. Something about the manifest version which will be more significant as we see later. And the context scripts entry that declares which file should run and where. In this case, it declares to run on every side. The exact demo doesn't really matter. In this case, we are just going to register a click handler on the document that shows a dialogue when you click it. And now we're going to try a live demo. So first, oh, can you connect your Android phone? Oh, yes. Okay, while he's connecting his Android phone, I'm going to type the comment. So there's a WebEx tool that is a program that allows you to run or build extensions very easily. It takes care of looking up the location of the browser bindings and packaging the extension and launch it without effort. So WebEx run is the common to run extension in the current directory. T is target. In this case, we are starting at Firefox and red. And then let's see. Does it do anything? Okay, so yeah, I'll show that later. Okay, well, I'm at it. I'm just starting to clean copy in background. So this is a Firefox for Android. Currently showing something. And I ran the comment before it showed that some parameters were missing. In this case, we need to select which device is being used. And then it still doesn't work. Those on Firefox for Android on Android device are multiple Firefox for Android applications. The IDs are listed there for convenience. In this case, I'm going to select nightly. So the Phoenix is the code name for nightly and Firefox 10 for the release version. There are more things such as pet and even non Firefox builds. Okay, I launched it. There's a request to install extension. I click somewhere and I get a dialogue as expected. So now I'm going to repeat the same for desktop just to see how easy it is. I could type Firefox desktop. By default, it launches Firefox desktop because initially that is the goal of that X. Later, we expanded it to older browsers. Example.com. Typos. It works again. And for completeness also Chromium. Example.com. And it works. It takes the load the menu at this point. Yes. Okay. I'm going to back to the slide for a bit. Okay. And now to the demo. So as you can see from the extension management page. It's possible to load it manually. Just for the sake of demonstration, I'm going to remove extension that were installed by the tool. Oh, okay. Sorry about that. I'm going to manually load extension. There was the original one we loaded. Manifest. All files are there. And notice that there's some error. So if you increase the size, I'll just pop in. Go ahead. Explain the spectrum. So at the moment, Chrome has formally dropped support for Manifest v2. Or that's not quite, it's deprecated support for Manifest v2. So it's no longer possible to publish a new extension in the Chrome Web Store using Manifest v2. And in fact, it's been about a year since it was supported. In over the course of the remainder of this talk, we're going to be focusing specifically on Manifest v3. Because if you're building a new extension, you have to use it. And as you clearly see this Chrome, it can still load Manifest v2 extensions. And of this year, they will start experimenting to turn off Manifest v2 support. At which point it's not possible to load Manifest v2 extension in Chrome any longer. Firefox still continues to support Manifest v2. It hasn't deprecated Manifest v2 yet. So again, just load extension Firefox just fine. I think it's also worth noting that Firefox does not yet have a deprecation timeline for Manifest v2 either. Whereas Chrome has a kind of end of life timeline formally announced and is coming in the next six months. So the way to fix this would be to switch to Manifest v3. The question is how to do that. So like you see, there's an error here. There was the original code that we started with. Manifest v2 was mentioned there. So we have to do Manifest v3 and then we try again. Just for the sake of, since I've opened anyway, I'm loading Manifest v3 extension. Okay. And this was the Manifest v2 one. I'm removing it since I don't care about it. This Manifest v3. Okay. There was an old example, the new one. Again. If I view this page, there's still some errors here. I'll get to that later. Yeah. The next part of extension model that is very important are host missions. In a previous example, we ran the extension on all websites. We requested to do so on our website and we bumped the extension Manifest v3 and it still worked in Chrome. What it didn't show yet is how to run extension on Firefox. Well, I did show how to, but not that I did so. So let's try again. As Rob does this up, one of the notable differences between in Firefox's support for Manifest v3 is host permissions are limited by default. We want to encourage developers to go in the direction of giving users more control over where and when extensions run. So by default, the set of hosts that the extension declares aren't granted by default. Thanks. So at this time Manifest v3 extensions do not get the permissions unlike Chrome. That's a common source of confusion for extension developers. And as you just saw, I launched Firefox again with a Manifest v3 extension. You click around, nothing happens. What you do see is that there is built-in extension UI in Firefox to control permissions. So if you enable permission from this point and load it again, it will work. And so that is the built-in UI in Firefox. Chrome has similar UI to disable it. So even if you currently think that your extension works in Chrome, it may not if, for example, the user has disabled it or maybe sometime in the far future Chrome has also followed a similar approach. It's also worth noting, even though we only show Firefox and Chrome here in today's presentation, that Safari has also something special for host permissions. In fact, all host permissions are disabled by default. Safari takes the most strictest approach to host permissions for privacy reasons, meaning that you have to explicitly interact with the browser UI to allow the extension to run on the page. In a previous slide, I showed built-in browser UI to do so. Now I'm going to show how to do it in your own extension with the right context to convince the user to give you those permissions. To do so, in the Manifest v3 declared options UI page, in this very simple extract of the options page, I show a label. This is just a very simple explanation. In reality, you may convince the user that they will win something. I mean that they will get some useful functionality. Please do not mislead the user. And for the sake of this example, a checkbox to send the permissions. For the demonstration, I'll be very to the point. Check in the checkbox will give extra to example.com. In your own extension, you may use a more descriptive, non-technical description. And finally, I load the options.js file. In options.js file, I declare the permission that I want to request. The permission follows the format that's shown on the screen. The origins key of the object to declare list of match patterns, which you saw before in the Manifest file. Match patterns are the common way in extensions to define on which host an extension operates. In this case, in this context, with permissions, the path is ignored. Because when you request access to example.com, it doesn't really matter whether it's example.com slash favicon or example.com slash index. It's also worth noting the scheme in this case. The wild cards are a little special in the way that they work. They're not, you know, general glob patterns. Scheme, the asterisk here specifically means HTTP and HTTPS. It doesn't include filed URLs. And the asterisk in the domain or in the host section, it can only be a prefix. It can't pattern match on the top level domain or in the middle of a pattern. So it can only specify subdomains. And in this case, asterisk. includes both subdomains and example.com. Thank you. Okay. So the UI was very simple. We had a checkbox and a label that you can click that triggers the checkbox. To start with, we will first make sure that the checkbox reflects the actual state of the permissions that you have. So you will request, well, check the permissions that you have and update a checkbox state. Very useful. Now you need to make sure that when the checkbox is ticked, that the permissions are actually granted. To do so, we introduce a change listener. A change listener is as fired when the checkbox state changes. When the checkbox itself is checked, so that means that the user checked the checkbox, we want to make sure that permissions are granted if they have not been granted yet. So we use the permissions.request API to do so. And if somehow the user dismisses the permission request, we uncheck the checkbox to make sure that the UI is consistent with what actually happened. To remove the permission, we do something similar. You'll notice that in this case, I'm using try catch because Chrome has something special about it, which we'll explain a few minutes later. And finally, also note that I use permissions.onEdit and permission.onRemoveTheVentListeners here. These are triggered when the permission change, not just by the checkbox here, but also if you use built in browser UI, or simply have opened the tab in a different page. It's a very common mistake of extensive developers to not account for changes in other tabs or other external factors. So make sure that if you have some UI or settings page, synchronize it with your other potential external triggers. I'm going to demonstrate it. Oh, just a bonus tip. So in the demo here, we all, well, clearly you can see that the browser is using dark theme. By default, the web page and Accenture page include that use a white, well, a light theme, which is one simple trick, adding a made and named call scheme with content dark is and light. You can automatically get the dark theme support for your extension page. Even if you don't develop any extensions, take away this tip for your own web applications. With very little effort, you can get dark theme support automatically without any extra extensions. I actually learned about this from Rob earlier today. Right. So if you click on the check box, you'll see that permission request comes up and then you can choose to allow or disallow. Will you continue? Did you also want to show it on Android? Oh, yes. Thank you. So on Android, the flow is also similar. You find the settings page as you can see through these arrows. And then you can also grant the permission to similar UI. I could do a live demo, which is for the sake of time. I'm just going to the screenshots. If people are following this later, the slides are also published later with screenshot included. So you don't have to watch the whole video recording. So this is the Chrome page. Okay, that was the special part with Chrome that I'm going to show now. So I can do this too by recycling real但. So previously I said there was some try catch to handle some error in Chrome. The reason for that is, I just wanted to make it bigger. Okay, yeah, it's bigger. Thank you. If you click here you'll see that nothing happens for some strange reason. The reason for that is that in Firefox you can always, the user can always toggle permission and extension can do the same. In Chrome an error occurs when you try to remove permission that is declared in the counter scripts. Because the extension claims that it's required permission and Chrome refuses to remove it. You can still trigger the same flow by using the side axis control that I mentioned before. So there's the built-in UI from Chrome to control axis. There are several ways including on click. So now I try again and I see the checkbox is unchecked. I can check it, permission request appears, approve it. Okay now I'm locked in the jail. But the important thing there is, again, the permissions can be revoked outside of your extensions flow and you might not otherwise have signals to it. So it is critical to subscribe to those events and make sure that you're reacting to them appropriately. Right, so in this example that we looked at so far, we've statically declared our content scripts and had them automatically run on the page and then respond to the invocation on click. But if we really want to embrace user control, user privacy, it is best to react to a user's invocation of the extension and then perform the work that you need to do rather than always statically injecting and potentially having access to sensitive information that you don't actually want. So both as a user, you're protecting your own data, as a developer, you're respecting the user and making sure you don't have access to content you don't need. So the pattern that we want to encourage is that kind of in-context reaction and we do that through a special permission called ActiveTab. So in this demo, we're going to be using the ActiveTab permission to detect when we're in vote and then get temporary access to the page and then we're going to be using the scripting permission to execute the script in context. And we're going to be doing that using the extensions action icon, the thing that appears in the toolbar that identifies your extension. It's also worth noting that declaring action in this file is necessary. If you don't, you can't actually use that API. So it's kind of an implicit permission. In this demo, we're going to be targeting specifically Firefox. So we're going to start with a scripts array. This is kind of the more traditional approach. It uses a background page, an event page in order to run the script. We are going to be targeting Chrome in just a little bit. So we're going to kind of ahead of time polyfill the browser namespace. The Chrome version of the extension platform uses the global Chrome and Firefox when it implemented the web extension APIs. It used the namespace browser. And the big difference between the two is that browser supports promises and Chrome historically did not. It does in manifestv3. So we're using this quick polyfill. If you are targeting manifestv2, you'll need a larger, more extensive polyfill. So as described, when the user clicks the action, we will execute a script. We will inject on the current tab, and we're just going to alert. We're going to skip over that for time. Skip the demo. Yeah, skip the demo. You get the idea. It works. So when we target Chrome, we're going to actually need to use a different entry point into the background context. We previously used scripts. That's what was used with pages. Chrome no longer supports background pages. It only supports a service worker. So we need to declare a different entry point. Again, you believe me that it works. The important thing now is we have two different entry points. And ideally, as an extension developer, you don't want to deal with having a bunch of different files in special configuration per browser. So in very recent versions of Chrome and Firefox, it's possible to declare both strips and service worker in the same manifest file. But because this is so recent, we don't want users with older versions of Firefox or Chrome to get these updates without, and have the extension fail to load and run. So we're going to add some guards to our manifest here. We have four Firefox. We have the gecko and gecko Android keys and the strict min version. It just coincidentally in both browsers is version 121. There's no association beyond that. And critically in Firefox, the ID field is required when uploading a manifest v3 extension to AMO. And then in Chrome, we target, again, 121. And that all works as expected. Wonderful. Over here, I want to highlight that scripts is an array. So it's possible to declare multiple files. And so you can have libraries or chunk your application up into different pieces in order to better organize your code, whereas service worker is a single entry point. So it's possible to achieve this multi-file loading thing with a slight tweak. In this case, we can use like a service worker JS file as your entry point. And then in that file, we're just going to use import scripts, which is special to worker contexts. It will synchronously import these scripts and execute them. So we can achieve the same result with, again, a single manifest file and a slight tweak to how we load service workers. We are going to skip this part. Basically another way to get the same functionality without the background script, if you care about cross-platform stability, but without dropping support for older versions, you can get rid of the background script and use a pop-up that opens when the button is clicked and close the pop-up. So let's talk about short-lived background context. In manifestv3, one of the major changes to the extension platform is in the past we had persistent background pages, and that's no longer an option for constraints outside of the browser's control. In order to be able to run everywhere and run efficiently, we necessarily limit when an extension, we terminate idle extensions background contexts. That has some important consequences for how you design and build your extension. So we're going to very quickly skip towards the end because of time. We have prepared, like, for over an hour, I guess, a presentation. So if this kind of interests you, you can look up the slides from the FOSTEM link. The speaker notes are also included. We don't see them here, but we make sure that the slides are understandable even without the video. Yes. It's also, I think, worth noting we're planning to expand on this and do, like, a webinar later to share some of this content and explore it with less time constraint. So hopefully we'll see you tuning into one of those sessions and answer some questions from you all about building browser extensions. Thank you. Bye.
Integrating LLMs: Intelligence is tricky
Thanks for coming to this talk. I know it's the last one of the day in this room, so I'm sure we're all tired. I'll get through as quickly as possible. But yeah, essentially my name is Gabriel. I am a tech lead on the innovation studios working on a project called Formuleic, but more on that later within Mozilla. This presentation is all about, yeah, as we just mentioned, why intelligence is tricky and how to integrate LLMs in standard code. So I guess this probably makes sense to start with the definition. So LLM, probably a term you've heard a lot throughout this conference, is essentially a large language model, and that is a program that can naturally understand human text and do something with it that a traditional program cannot. So just to reiterate that, it's essentially a program that doesn't need a specific syntax, it just works with natural language, which is actually pretty interesting when you think about it. Actually, I had a question for the audience, who here has played around with chat GPT or other chat bots or LLMs? Literally everyone has their hand up. I was the opposite, who didn't? Yeah, but who found that actually useful? And just put everyone too, awesome. Okay, just about, not everybody literally. Yeah, so just before we get started too deeply, I just wanted to make mention that there's a lot of terminology in the space. Everything has a name, a definition, and lots of weird words. So apologize in advance if I don't explain every one of those words, just there's so much going on in such a short talk. So yeah, so as we all just learned, everybody thinks it's kind of useful. I also think LLMs actually have utility and are quite useful. These are just two random examples I pulled up, but essentially LLMs can help categorize content, answer questions, provide summaries, help create content, and structure unstructured data, and plenty of other things. They're essentially the proverbial hammer in the toolbox. They can kind of do anything you want them to do. Doesn't mean you should be using them necessarily, but you can do it. So I think it'd be appropriate just to have a quick example of a traditional app that is not intelligent, and then what it looks like when you add intelligence, and just how easy it can be once you get things set up. So in this case, we have a node application, just takes unstructured text from a user, and stores it in a database unless you use a retreat. Nothing special. But then when we add an LLM, for example, we can then take this unstructured information, craft a little prompt for it, a prompt is just an instruction for the LLM, and then get the LLMs to return something quite unique or useful to us. In this case, the LLM output is a category of what the node could actually be called, and that helps with organizational structure things in the database. And this is amazing, but let's be real. There are some issues over this technology. It is not perfect, unfortunately. It's also very young, to be fair. So yeah, so what are some of the issues that you can see when you interact with this technology? It includes hallucinations, a.k.a. it just makes things up. It likes to lie sometimes. It's not great. But also inconsistent format. So when you're in code, you can imagine you want to talk to your services, your APIs, in a structured format, and LLMs like to not necessarily listen to that format, and they'll reply with, instead of JSON, it could be marked down, or broken attributes, or things that just don't make sense. So that takes a lot of hand-holding and validation. There's also the performance and cost aspect of running these services are quite computationally expensive, and we all know GTs are expensive and scarce. Fundamentally, there is also text account limitations that you can only interact with these LLMs with a certain amount of text before sort of forgetting what you actually asked it to do, which kind of sucks. There's also a education and documentation kind of lack thereof, especially for open models. So that's something that takes a lot of learning and trial and error to do. Lastly, there's friction points that include bugs and security issues. So this is, again, a new technology. Of course, there's going to be bugs, and of course, there's going to be security implications that need to be thought of. And what's particularly crazy is that there is actually over 50,000 text models on Hugging Face at the moment. And there's so much choice in so many models out there. It's actually quite hard to understand how to gauge which ones are good. And on top of that, there is also a ton of licensing across these models, and that also complicates how you can select and choose which models to actually use. But there's actually more models, of course. There are the proprietary and closed models that are ever so popular as these little diagrams pictures show. These models are popular for a reason, though, because it is exceptionally straightforward and easy once you add your credit card to get these systems working without having to think too hard about these models. And that has some consequences. The main consequence, though, is it kind of creates a technical vendor lock-in. These models all interact with these prompts, as we just saw. But those prompts essentially have to be curated to the model to get the actual value. So you can imagine you write a bunch of prompts for one proprietary model, and now you expect to run those exact same prompts in an open one, and you don't get the same results. So this is like a key friction point for open models, because there's not so many examples. There's not so much documentation around these what prompts work and why. And then when you do run a prompt that you already had, it doesn't work. And then you just stop using that model, for example. So this is just a quick little demo. Two of these models, one of them is open, the other one not so much. They replied with a relatively good answer. Whereas the other model in the middle just decided it didn't want to do it. So the reality is, if you actually tweak the prompts and you add a little clarity and you write the prompt to the model, regardless of the model, you can see that the responses were actually really consistent and really good. So it's quite interesting to see how, with a little bit of effort, a little bit of elbow grease, if you will, you can get something that's considered, maybe not the prime model, to still output something that's still useful. And that's where my team and project come in. So Formuleic, today's our public announcement, if you will, so it's kind of exciting, is going to help or try to help anyway. Create a platform for open prompt scripts that anyone can interact with. They're open by default. And we'll help enable the creation, sharing, and testing of these different prompts against different models. And of course, we're still in super active development, and we would love to get your opinion as we're building out these repositories here. So please don't hesitate after this talk to say hello. Yeah, so that's my super quick talk. Please find out more about us online. And yeah, that's it. I still have like four key steps to do. So where am I headed to with the questions? Don't hesitate. Yes. Do you have already some plans on how to integrate LLMs and tides? Yeah, I think we have a lot of questions. I think we have a lot of questions. Do you have already some plans on how to integrate LLMs and tides, Mozilla tools like Firefox? So it's a question like, are we thinking about adding LLMs to Firefox? Honestly, I'm not too sure what the long term plans are. I know there are people obviously playing around with the technology, but I don't think there's anything officially on the books. It doesn't mean you can't add it to your own version of Firefox. So I'm just saying. Anyone else? Don't be shy. We are good then. You, we are closing.
Broom not included: curling the modern way
So, yes, it is everywhere. And what is sometimes I think people tend to forget or don't realize is that we keep releasing curly releases like crazy. So these are single blobs for every release we have done since 1996 until now. That's 254 releases. We did one this Wednesday. So anyway, that's just, you know, yeah, we keep releasing stuff and we keep adding source code and changing source code. This is the, so the blue line, the top line is all code in curl and live curl taken together. The red one is just the command line tool. But since the command line tool uses the library, so it's, there's a lot of development, a lot of changes, and we're adding a lot of command line options. That, I'm not sure that's so good, but we do that anyway. And so, yeah, back in, when I renamed the project curl in 1998, we had 24 options, I believe, and now we have 258. So that's roughly 10 a year on average. So yeah, and chances are we're going to have even more soon, right? Well, and even if we just look at the last few years, you can see that all of these graphs show a significant growth there as well, meaning that we're still, we keep doing things and we keep changing things and we actually make curly better product every year. So I just wanted to mention that it's a good idea sometimes to upgrade this, you know. I don't know exactly how it is in all projects, but in curl we have this, I don't know, there's a trend, people keep using very, very old curved versions. Sometimes it's a little bit boring, you know, and we actually fix stuff. Yeah, so curl is this Swiss Army knife of internet transfers and one of the cooler things we did within the last few years is that we had a parallel transfer site. I hope you all know that. So nowadays then, traditionally, of course, if you just write curl in a lot of URLs, it will get them seriously one by one by one by one, but now you can use this option and it'll get them all in parallel. Well, actually it will get up to 50 by default in parallel, you can change that. I guess there's some, usually by default there's a 10, 24 open file descriptors, right? So there's some kind of maximum that will cause you problems, but otherwise it's awesome for when you want to get a lot of files down faster, possibly from just lingering on your disk like curl has always done, or it does by default, right? So you can ask curl to remove it, remove. For example, you have a timeout, you really want to get this download done in two seconds, and if it's not done in two seconds, it's going to fail, right? But what happens with that leftover file? If you have this flag, it won't be there. If you don't use the flag, you will have a partial one, possibly, right? So, and we also have added a lot of fun other ways to control multiple transfers, really, or transfers in general. We've had this option for a long time, which is, this has, this is a sort of a reverse speed limit. If it's slower than this, kill it. If it's slower than this many bytes per second for this amount of time, we don't want it anymore, basically, detecting catching, stall transfers, too slow for me, stop. But we also have this other one. So sure, only, only do this transfer at the maximum speed of 100K, that's bytes per second, or you can also do, if you want to do those, you know, download a million files, maybe you don't want, you don't want those million files to happen immediately, maybe you want to slow down the rate a little bit, and then you can do it like this. I just want to get those files at maximum two per second, or that means, that means started requests, or maybe three per hour, or 14 per month, or week, and blah, blah, blah. Basically, a way to, if you really want to get a lot of files, maybe from the same server, and you, maybe it's your server, you don't want to overload it immediately, or maybe you know that the files are just updated every once in a while, anyway, so why not slow it down. And of course, you can also ask, so ignore the file if it's too big, and this will also nowadays actually stop the transfer if it, if it downloads that much, which in some cases you don't know how big the file is, right, and you might not want to fill up your disk space when the server surprises you Monday morning. And of course, one of the bigger things I've done recently is, yeah, KERL has had this concept that I call config file, it's really not a config file, it's passing the user names or whatever in those config files, and you couldn't do that, but now you can, because we have introduced a way, so we have introduced this new option to KERL called variable, so it's basically a way to set the variables in the command line and in these kind of config files for KERL, and why do you want that? I'll show you just some examples, so basically you can set, it's just a name, it has some content and you can get that content from the command line like that, or read it from, you can also, you can set it in the config file with different syntax, but that's the same as the old, all right. So you can basically set it as pure content like this, or you can read it from the file, you can get it from the environment variable, which means that you can do stuff like this, or you can set it from the environment variable like this, you import it like with a percent sign. I'll show you some examples and why it's cool. Blah, blah, blah, blah. So basically, so you can set, why would you, you set a command line and you can then expand them of course, so if you want to for example set a command, a name, and you want to use it in an option, you can ask it explicitly to expand it, and it sounds a little bit weird, but it works like this basically, so you can expand, if you want to, the data option as someone might know, it's short version is dash D is for sending a post for example, so in this case you can send the content, the variable content in here would be sent as a post, and the content of the variables host user here would be in the URL that it uses to the command line. Basically just gives you more freedom to create weird config files and use that to do more curl, and you can also, what's even more fun than this, you can then for example read those contents of files for example, so if you wanted to post from a file or you wanted to read credentials from a file, you can do that as well, and then if you want to make it even more cool, you can apply different functions when you expand these contents, so you can JSON encode it, you can URL encode it, and you can be 64 encoded if you want to in the config file, so really to help you do those weird things that people want to do. For example, you get the, and are complicated more so than we want to sometimes. So basically with this tool, it's a really simple one, it basically just gets sets parts of a URL for you, like this, if you want to, you can give it a URL and you set a host name and it'll just replace the host name in the URL and output that, or you can just ask for the host name part from a URL, or you can redirect a URL, so if that is the first URL and you want to redirect another, how would the end result become, right, typically then of course a relative redirect, dot dot slash dot dot slash blah blah blah, what would happen, or change parts of it, like update the port number or you could append query strings, or you can do more complicated things, you can output everything as JSON, so we would split it up in different components and output everything as JSON and you could j que it or whatever you want to do, and you can also do things like extracting parts of the query string, so maybe I want to just a component from that weird URL, a little thing just to help you work with URLs better, we also worked quite a lot on adding JSON stuff in curl recently, so for example you can, we have this option called write out, you should know about it, it's called, it's actually dash w in the short form, it helps you just output stuff from the previous transfer, it has a lot of variables, you can download speed or headers or things, and we also, it works like that, we recently added support for getting all that data output as JSON, so now you can do like this dash w and the variable called JSON, and that variable then will spew out the huge JSON object with all of those variables that I call them as JSON object, pretty much it's a fancy way, if you want to get everything in JSON, you work with JSON, you want to do j que it, curl helps you go there, and in the, oh right, it also, we have this other variable called header JSON, it's called header JSON, right, it then outputs all the response headers as a JSON object, also to help you work with the headers maybe scripted, j que it, whatever you want to do, so basically you do more JSON, so you can, and of course we ship hb3 enabled by default, if you happen to be, have that config enabled, no, this too has that, but you know, in theory you could, no, but it's a tls setup problem for all of us, but we enable hb3 by default, and if it's possible and it's a really cool way, and now you can do it with curl, and you can just, you know, ask for curl to use hb3, it's called dash dash hb3, and hb3 as you know uses quick, quick is a different transport protocol, it's done over udp, so it's not the same connection, right, so we can't upgrade from a hb1 or hb2 to an hb3 connection, so they actually have to be different connections, and since hb3 is still being blocked, problematic in expersentable attempts, so when you ask curl to do hb3, it will actually try hb3 and hb1 and 2 at the same time, and go with the one that works best, that works, it's actually, it's raising them together, so it starts with hb3, a little bit before it tries the other ones, it works really well, so it's in the happy eyeballs style, so we do happy eyeballs on pv4, pv6, and then happy eyeballs between hb3 and hb2, and hb2, one will of course downgrade to hb1 if it can't speak hb2, kind of fun, you should try it out, and of course if you're using then the parallel option thing with uppercase z, it will do them multiplex if possible if it's on the same host, really fun, and of course I have written about most of this in everything curl, you should read it, it's a lot of pages, thank you. Thank you. Questions? There's a question, I have stickers. Yeah. Thanks for the presentation, are those all nice features accessible also for library users in C or in Python bindings, or is it available only from command line? Most of this is powered by the library, so almost everything that is that I mentioned here, that is network related, is available to the library, really problem, and that's also why I wanted it to be used the same parser, so true rel and curl uses the exact same parser, so they will work exactly the same way on those URLs. One more question? Nothing more, thank you. Thank you. Thank you.
Improving IPv6-only experience on Linux
They immediately think of dual stack and it's sort of like people put equation between IPv6 and dual stack and think that dual stack is the best. It sort of is except for one tiny detail and that is the one in the bottom that basically by using dual stack you are not addressing the problem by IPv6 was deployed in the first place and this addressing of scarcity of IPv4. Therefore there are some other transition mechanisms I think there are some other transition mechanisms that can help you with this thing and one that stands out is called NAT64 and yeah we are at this event where the network uses NAT64 for the default network since 2015 if I remember correctly. How does it work? Well for IPv6 it's easy the path is clear for IPv4 the path is goes via translation box that is called NAT64 box into IPv4 internet. This works very well you can see it that yeah you can use it basically mostly for most networks but sometimes you can hit some very corner cases for which for some people could be very noticeable but for some people can be like critical usually it starts giving you hard time when you start playing with virtual machine or containers on your computer then usually those kinds of software are usually relying on IPv4 connectivity that native IPv4 connectivity that is not available. So there is another standard called 464XLAT which tries to closing the gap by editing a piece of software on the computer the piece of software is called CLAT and in that sort of like reverse translation to what NAT64 does which is called PLAT in this case and and then so by this trick of double translation first IPv4 into IPv6 here and then transferring via IPv6 to another translator translates IPv6 back to IPv4 we can have applications super happy because they see sort of like dual stack they see they see both IPv4 and IPv6 and everything works well. This is a technology that is very well used in many devices but not in Linux so I put there a slide with some like typical cases of what you plug into networks and how do they work on V6 only network like the one the main network here the one that is called FOSDEM. If you connect your Android phone your iOS device or your Mac OS computer it should work fully these three categories have CLAT there and also there are mobile networks in the world who are running V6 only for their mobile phones like for millions I said billions on the slide maybe it's just millions but it doesn't really matter lots of devices. These devices into one network and yet not waste IPv4 address for every single device that is in the left most category which objectively doesn't need it. That's the question here and the answer for this is quite recent RFC which introduced a new feature to DHCP the old DHCP protocol is used for IPv4 and this this option is called IPv6 only preferred. And it works like this if you have a standard every time a device attached to a network it starts this DHCP handshake it has these four messages discover offer request and aqua knowledge and the device if the device is capable of running IPv6 only it will just signal this in the request parameter list it's the option 108 and if the DHCP server is ignoring it and does not send the option 108 back in the offer that means the network is not capable of running IPv6 only therefore it will use normal DHCP and it will use normal IPv4 this is the case of the most networks but what if the server actually offers something in the DHCP offer in that case the device says oh that means I'm in an I'm in a network which does not need IPv4 for operation and therefore I will just stop the handshake here because the handshake is stopped midway it means the address is not committed the address is will be released after some time out and not used and therefore the address will be saved in the in the in the in the DHCP pool that's exactly how you define network that has a name the most stable name is IPv6 mostly IPv6 mostly doesn't mean that it's most it means that the network is designed to run IPv6 only but still have some IPv4 for legacy devices like windows or Linux or smart home such network has to provide perfectly working IPv6 and the IPv6 have to have not six for support ideal there should be signal 96 for support with this prep six for option which is an option in router advertisement so when you attach to network you receive route to advertisement you immediately know that there is not six for and how you configure it and also must provide not native IPv4 for the legacy devices so everything works yeah this was like sorry I'm a little rushing into it but yeah I want to focus on the main point of this talk which is Linux on IPv6 only basically how to move Linux from the second category works mostly into the first category works perfectly where are what are the gaps that needs to be closed to to have Linux not depend on native IPv4 there are basically two options how to tackle this problem and we somehow has to be and it turns out system yet for the has sort of like unsensible default that it's actually on by default so system yet for the in the latest version is refusing before even though it doesn't set up sealant so it makes your computer sort of broken but it has been reported and hopefully it will be fixed one way or another now how you learn sealant on your computer bad news is there is no native way of translating v4 into v6 inside Linux kernel so unlike BSD for instance it's not possible the good thing is there are at least four software solutions that are like third party but can be run easily first two of them are user space they use the tune tab device and they are working user space the next two net four six and jewel are actually kernel space so you can just compile the kernel module install it and it works there's also Perl script cause sealant D which takes care of setting up sealant it works sort of okay but it has some limitations that are making it not universally accessible and not something that will be in every single every single distribution the biggest issue here is that it works like this that the sealant is sort of like on a stick it's a separate network interface that means that to make it work you sort of have to set up router in your computer and also set up firewall to allow traffic between two interfaces on your computer and not only this you have to also undo it once you want to turn off sealant and just imagine the varieted the diversity of illinux distributions this is virtually impossible to do correct way so I don't think this this is going to fly what is supposed to fly ideal sealant should look like this it should support multiple instances so you can have more interfaces more interfaces can be connected to different IPv6 only networks so maybe you can have more instances of sealant it should set up automatically as soon as you detect net64 and it should that is key point also react dynamically to change IPv6 has changing protocol it supports dynamic renumbering of networks and all those things that can happen dynamically so it's not like you run it once and then it's then it stays it has to react on all the dynamic changes on the network and it shouldn't need to touch firewall or forwarding there's actually draft in ITF that is being worked on right now setting up the requirement for sealant so I was discussing with many people how such a sealant could work and I found out this solution that I'm quite happy with it and that's why I'm presenting in here it basically use IPv1 and network namespaces so the thing is you run the translator inside the network namespace to you and I hope some of you can maybe pick up on it is that I'm missing the piece of software that will orchestrate it that will listen to netlink events see that there is a new router it has net64 set up an IPv6 address for it finds a free address for the translation space from IPv4 we have another IPv4 scale set it are only eight address assigned for sealants so you can realistically run about four sealants if you are very creative you can maybe run a little bit more but to be honest four instances of sealant should be okay but the problem is you have to always dynamically figure out which addressing the sealant should use and react to subsequent configurations and of course my goal is to have this in common Linux distribution which means hopefully network manager will pick this up hopefully system dn4d and this will this will make the life of on IPv6 own network on linux much nicer that's everything for me if there's any question hello excellent presentation and work one question have you figured out how macOS x does it yes macOS does it by the sealant is actually integrated in the network interface so from from application point of view you see network card and has this v address 192002 just have a look if you have mac we will see it on your on this network and if you use vileshark and look what is getting out of the network card it's just ipv6 so basically so basically the translation happens between what application sends to it and what gets out of the on the wire so is it the kernel implementation or is it it's in the kernel and it's and it's and it's you know because it's part of the network interface it can be even duplicated if you have multiple interfaces active both each one has its own instance of sealant hi actually two questions first do you know how android is doing it because it's a bit closer to common links the solutions and also is there an argument against doing it in the kernel or is it just a case of nobody having done it yet sorry i missed the first first question how how is android doing it and is there an argument against doing it in the kernel or is it just that nobody did it oh yeah how is how is the android doing it android is doing it um similar way uh they have several generations and the last uh the last uh part of the android code is heavily using these three letter aquarium of network code that i don't remember uh e b thank you
Load balancing using XDP
It's possible with XVP to do bare metal package processing at the lowest point in the software stack before the kernel, kernel network stack and this is make it ideal for speed. At the end of execution of the XDP program it should return a code. There is three possible choices. XDP pass, let's say the packet pass to the normal flow in the network kernel stack or XDP drop or abort to drop all the packet and so this is invisible to the normal stack or XDP TX or redirect to send the packet away to another destination. Let's see the most basic possible program XDP program that is this one that all it does is just return XDP pass so the normal flow of the packet is preserved. To compile XDP program we use a clang with the target BPF option so this is for example to compile the previous program and after we can load the program to a network interface using the XDP loader that is a command provided by XDP tools. If we replace on the previous program that one the XDP pass with XDP drops all the coming packages will be dropped. As I said this all will happen before the normal kernel network stack so this program this packet will completely invisible to the normal stack for example from TCP dump. Fortunately for debugging it's we can use XDP dump that permits you view what the XDP programs are doing with the packet and check what is the protocol if it is IPv4 than 8 we check if it is ECMP line 12 and after we access the previous map and add one to the value. The next step obviously is reading this data from the user space so we can access the map with a syscall and read all the map. Here we need a loop because for performance each core on the HCPU on the system have a different map, a different copy of the map so we need to loop and some of them but it's quite easy to access and communicate from user space to a BPF kernel space program. Obviously we need to redirect packet if we want to do some nonbalancing so to redirect packet we use XDP ticks or XDP redirect return code. Before redirect we need to change value of the packet for example in TCP samples is very easy we just swap the source MAC address and the destination source address so when we return the XDP code the packets will be redirected to the other machine we selected. Another important optimization to do on the load balancing is using direct server return so all the servers send their response directly to the user without going another time to the load balancer. To do this load balancer and server must share the same from the group of the canned server if the server was the first one the new direct the new request go to the second one and so in this case unfortunately we break connection for example if a server goes online but if we remove on the case that the server wasn't on the first choice all the packets continue to go to the same score so if we remove not the first one but the other ones there is no problem for render washing instead in many cases the ashing completely changes if we change the set of server and all the connection blocks and only one or all of the servers. This is all interesting but it will be really nice if we can load balancing do load balancing without load balancer. For doing this we can exploit we can leverage the ACMP routing feature that the routers have. Equal cost multi-path routing is designed for split traffic designated as single EP across multiple links of equal cost but we can use it to split the traffic between server if all the server announce the same EP and in this case the router blissfully redirects the packet to all the server without knowing that is not one but end server that receive packages. So we can go from a normal this normal architecture with a load balancer between a server and router to a complete distributed load balancer where the package are distributed using ACMP. So the ACMP is really useful for high performance load balancer and using optimization like direct server return we can also increase the throughput. To redirect all the package to the same backend server we need to use a consistent ashing algorithm and it's possible to leverage the ACMP routing available in routers to distribute packets between server and deploy the load balancer directly to the backend server without the dedicated machine. All the code that I developed is available open to the public city and there is also a link to my thesis all in Italian fortunately where I examine all the available load balancers in the market and all I explain all with more or less briefly let's say like 70 pages. Thank you for your attention. Thanks for the talk. Thanks for the talk and I have one question. Is it possible in XDP to inspect the source the data in the packets so you can just. Yeah you can impact the packet quite really easy. Here we have only the protocol and the protocol and which type of packet it is but it's possible to go more inside the packet to examine all the packets. Okay and the second part of this question is it wise to do it like from the. Good but yeah. Sorry guys if you leave the room please leave from the door over there. Wait until he's over. Thank you. Thank you.
Nephio: A New Approach for Automating Telco Workloads
So, I'm going to talk about the architecture. All right. Are we ready for the next session? So my name is Wim Hendricks. I work in Nokia. I'm heading the technology and architecture. In this talk, I'm going to talk about nephew. Nephils are about thousands of sites that potentially have to be working together to actually make this service happen to all of you who are using smartphones or tablets and what have you. So that's kind of the problem space that we are trying to work upon in nephew. And so there is basically a set of issues. One is the scale. The second is the heterogeneous environments of all these network connections and making them work seamlessly together. And then, of course, we are working into an environment where there is just not one single person in an organization involved. There is multiple roles within an organization who are actually involved. So that is what we call infra people. And that is the application side of people. There will be security people and so on and so forth. So we have to deal with all the roles and responsibilities in an organization who actually take care of certain aspects of that deployment that has to happen to make this service work. All right. Moreover, what you see is that if you look to, let's say, a mobile network, right, so we are talking sometimes about terabits of capacity. So that means that we are also, me being part of a vendor, right? So we used to basically own and control every piece of that stack, right? But because we are moving into this cloud-native environment, we have now desegregated stuff, right? The problem that we also face is that we are trying to have very tight control over the infra and we want to basically work in that cloud-native space, right? So the question is, how do you do that? Because you have to basically give away some of that control to other people, right? So if you can see, we are looking at this is a quite challenging space, right? And what has happened so far in the past is that every vendor, including ourselves, we basically said, okay, we control our own pace and then we have a bit of different vendors involved. And so what you will see is that you will end up being, connecting multiple components and components and components and it's actually quite challenging to do that in a cloud-native way. So we basically said, okay, can we do better, right? And this is where Nefio was born, in a sense, because we said, okay, all of these workloads, they are moving inside of a cloud-native space, meaning they are moving into a Kubernetes environment, right? And it's nice to basically do this aggregation. We can do microservices. We can basically put all these components onto a Kubernetes environment. Why don't we leverage that same framework to basically automate and orchestrate the configuration and the setup of that whole stack, right? And that's kind of, right? Now in Nefio, we basically do two things. We basically look at, on one hand, the application itself, right, which is 5G. And then we also look at a set of primitives that are not yet available to us that we would like to have to solve this problem, right? So we basically do two things. One is we are basically defining the use case that we are using to actually figure out what are the primitives that we are missing. And then we are basically adding those missing components inside of a Kubernetes framework to be able to address those problem spaces. And as such, I mean, we leverage KRM. So I haven't asked, so Kubernetes is probably a lot of people familiar with that. If you want to show hands, Kubernetes, I think we should be fairly familiar. Okay, pretty good. So we leverage, so you are familiar with the term of KRM, right? So KRM is the Kubernetes resource model, and we leverage that all the way, right? So that means we have a clearly defined API. We have the set of metadata. We leverage the desired versus observed state to basically figure out a declarative base of operation. We leverage the event-driven ecosystem and stuff like that. So we leverage that to the full extent. Now what we have seen in order to solve that problem at scale, we were missing a few primitives, right? And one of those primitives that we have been added to the component is what we call configuration as data, right? Because if you look to how Kubernetes works as its basis, you actually have a CRD or a Kubernetes resource that is basically triggering something, right? Now because we are dealing with this massively complex environment, right, we said, okay, a single unit is probably not sufficient for us. So we defined the concept of a package, right? So rather than having a single CRD, we actually built a package, and a package is a collection of KRM resources that you are going to use as a kind of what we call a blueprint or a service catalog, right? So it's basically think of it as a collection of KRM resources that do things together, right? The second thing that we did is today in order to use Kubernetes, you typically have to build a controller in code and stuff like that, right? So we added the capability better. You basically say I want to create a deployment and then there will be a replica set controller which basically says, okay, I'm going to select this node and I'm going to scale this out and I'm going to deploy a set of POTS over a number of resources, right? That's what typically happens today inside of Kubernetes and you have different methods to do so. You have deployment, you have replica sets and so on and so forth, or stateful sets, and you pick and choose the one that's familiar with you. Even that we work above a cluster level, what we do in nephew, we call it a term, what we call a package variant or a package variant set which basically says I want to have my package which I was talking about and I want to deploy that on these sites, right? And each of these sites will then be what we call specialized within their own context where the relative parameters, for example, this site needs this VLAN, this site needs this IP address, this size needs these PLM and ID's and stuff like that. So they will be specialized based on their specific context on where they get deployed and as a result, that's then being deployed on that particular cluster, right? So you see that if you look to the analogy of Kubernetes, we are working at the level, at the cluster level versus where as Kubernetes works at the node level scheduling type of level. So we work a level above but you see a lot of concepts that were born or that were basically derived from Kubernetes. We are leveraging within the framework that we are deploying within nephew in order to stay as close as possible to it so that we can leverage the benefits of that whole ecosystem. Now to put that into perspective, I try to explain that a little bit. So we have a concept of management clusters. So this is a regular Kubernetes cluster but we use it typically for our control engines, right? This is where our, what we call the configuration as data server which is ported in our implementation is used upon and then we schedule work, network functions onto those specific workloads, right? And how we do that is we basically have this concept of a package on the right hand side which is our blueprint. So think about someone as a vendor or as an operator, someone basically put that together, right? So we have the KRM resources that are needed for that particular environment and then we say, okay, I want to deploy that on 10,000 sites, right? When we do this, when we try to make variations of that package for a particular context, we also said, okay, let's divide and conquer, right? So because you could basically build a pipeline that is very narrow and very strict, right? But typically what we see is that you need flexibility, right? So you want to have a flexible system. And so we developed a concept, what we call the conditional dance of the choreography which is a set of primitives, I think of functions here that each do a specific thing. So for example, IP address, VLANs and so on and so forth. And each of them basically make sure that if those things are needed, they are basically being called upon and specialized those packages with a specific context for that type of environment. And that's how we can make this work in a very scalable and flexible and pluggable play and it's easily to extend that within a specific environment. So what did we do so far within Nefuse? We are a very early pro yet. So and it would be good if people would like to join us. So we have done, so we are just about to release release two, right? So what we have shown so far is basically the concept that I showed in the beginning. We have basically proven that with Free5GC. So Free5GC is an open source project that basically deploys the core side of a 5G type of network. So we have basically proven that we can use the machinery to actually deploy and show what I'm presenting here in the slides is actually working up to setting up a call, right? So we are using the whole network setup that actually connects all of these functions together. So all of that using the primitives that I was showing. And in release two we added OII which is another open source project mainly focused on the radio but also as the core so that we prove that we can do this not across one vendor but across multiple vendors, right? And so we are extending this whole framework with more primitives as we go along. As I said, we are a very young project, right? And so we seek for a lot of help. So if you are interested to join us, we would welcome. So please contact me or please look at one of those resources that are available here because this is all the information which you can have a look at to the operation. And what you see is that you can actually move rather more quickly than in an on-up type of approach but we are actually trying to solve a different level. So what we call the domain level whereas on-up is working at the orchestration level. Okay. And so is the same if we say the advantage of NetEar over VMware, orchestrator or other vendors? Yeah. So I think, see, okay, I'm of course trying to advocate it but personally I see the following. Kubernetes has been the orchestration for containers, right? That was where it is born. If you ask me, it has the right primitives to be an automation and orchestration platform for anything. So I see Kubernetes as an operating system to actually do any automation, right? That doesn't mean. And what is the advantage of that in my view is that, so when you have all of these different components, first of all, you have a huge ecosystem in open source that is developing and extending Kubernetes for lots of use cases, right? So you can deploy AWS resources, Google resources, you can deploy clusters, you can set up servers. So you first leverage a huge ecosystem that is being developed in open source which we should all love in this room, right? Secondly, that doesn't mean that you can as a vendor not benefit or do things specifically but the big advantage for you as a consumer when we do that is today when you do VMware orchestration, you build your own VMware orchestration server with your own database with your own and then you see it's not only VMware orchestration, you need a bit of this component and a bit of that component and a bit of this component and all of a sudden you have more servers to serve the network than actually the network. I'm exaggerating, right? But the advantage what I see personally is that you look at automation from these, the use case still will be specific to you and there will be a VMware specific controller, right? But if you build it on the same platform, you as a consumer, we benefit from not having to deploy another platform but leverage what we already have if that suits you. That doesn't mean you cannot deploy another Kubernetes instance for that specific environment, right? But at least the integration.
Network Topology Discovery: how it really works
you Patrick will start you away. at that time. And I co-founded a company that is offering support and custom development for NetXMS in 2009. And since 2017, it's my full-time job to lead the development of NetXMS and also participate in the management of state company. So, why do we need network discovery in the first place? So, we usually want to create nice maps of our networks, preferably automatically, so we don't need to draw it, and we want them up to date. We want to correlate events in our network based on topology. So, a very simple example, when we have an upstream router down, you don't want to get a lot of alerts for all devices behind the router. We want to correlate them into single alert. The main source of information is IP neighbors. And it will be ARP caches on devices for IP version 4 and neighbor discovery protocol caches for IP version 6. And basically, we just read all the IP addresses mentioned in the ARP table and took it as a potential target for monitoring. The routing tables are another useful source of information. We can take host routes, we can take gateway addresses from routing tables, and this is really helpful dealing with point-to-point links or any other links when we don't have ARP protocol. And this actually allows our passive discovery to help through the civil links, for example, in the network. In NetXMS, we also use other sources of information about IP addresses. So, any syslog message, for some reason, comes to the management server. Any SNMP trap received, we can take the source address and use it as a potential device to monitor. And we also can support proprietary methods, proprietary MIPS, like an example of Micropeak, Neighbor Tables, so anything that is specific to certain equipment. In NetXMS, we have a special layer, we call it network device drivers. Those are pluggable modules that hide vendor specifics and provide information in a unified form to the upper layer. Active scan, that's really simple, we just send out packets and network devices like switches and routers, they should be accessible by SNMP from network management system. The monitoring system will take some preparation steps. So, for each device that is added to the monitoring, we will read full interface information from that device, name of the interface, description, MAC address, IP addresses, because we will need this information to match topology data. If device supports BridgeMap, which is normal for all switches, then we will read Bridgeport Mapping as well from BridgeMap. And if device is LDP capable, then we will read LDP local port information as well. Then we have multiple sources of topology information, and the first and most important and most reliable usually is LDP. It's an industry standard now, and a brief summary how LDP operates. So, each LDP capable device will send a fixed intervals, information frames that can be received by other LDP capable device. And if we talk about switches, the switch will work in a bit unusual way, it will not forward the frame as it will do with any other frame, it will receive it and process it, and on other ports it will send its own LDP information frames. So, each frame contains a sequence of TLV structures, and there are a few mandatory TLVs, which principles as LDP, so basically we will do the same stuff, but we will use different MIPs. Another interesting source is switch forwarding database, or sometimes it's called MAC address table. It's not a protocol as such, it's just a table within the switch that determines to which port the switch should forward frames designed for specific MAC address. And if we read MAC address table from the switch, we can identify ports that only have one MAC address known on them. And we can assume that this is only one device connected to that switch port, so we can add a topology link connecting that device, which may not be manageable at all, so it may be just, we can just ping it and know its MAC address, and we know that it's connected on that switch port. And this is a MAC address table example, and we see that the first three ports have only one MAC address, so we can assume that the end nodes connected, it's not some link between switches. Another bit unusual source of topology information is spending free. This is a spending free protocol that's never intended to be a protocol for topology discovery, but still we can get some information about connectivity between switches from spending free, and we can use it as last resort if we don't have LLDP or CDP between switches. So how spending free operates? When the switches are connected in the spending free domain, they elect a root bridge, and then every switch will have a root port. This is the port that points to the shortest path to the root, and it will also elect designated ports. In theory, the problem is there's a lot of vendors, especially maybe not from the top, they implement LLDP in the wrong way, so they may send wrong information LLDP packets itself. They may send everything correctly on LLDP level, but they may report absolute garbage through the SNMP when you read this LLDP myth. So we actually have a lot of code inside the monitoring to deal with inconsistencies and incorrect data in devices from different vendors. Those are some comments from our code related to specific devices. So that's it, a quick overview of how we do network discovery. If you have any questions, you can ask us later, visit our website. Thank you. We have one question. Hello, sorry, can I ask one question? Question. Sorry. Yeah. I have one question. Real quick, I was wondering if you've looked into using Open Config at all to collect some of this data. No. Have you considered or looked at Open Config at all to collect some of this data about neighbors? I don't. Open Config. Open Config now, we don't use it. But we always open to new ways.
ipt_geofence: Protecting Networks using GeoFencing, Blocklists and Service Analysis
Today I'm going to talk about an open source project designed for protecting the network. In particular using some concepts such as geofencing, block list and analysis of service networks. I will tell you a little bit the idea behind this project. So in a sense this work started mainly when the war in Ukraine started and we have seen a really spiking in attacks towards servers. And so we wanted to do something that was easier to maintain than what we had in the past. So we said let's do something. And this is a work that we have started to develop with some students at the university. And we said to ourselves we need to handle a little bit IP firewalling a little bit better because you know a firewall is great if it is static. So in the sense that you put some rules inside the firewall and they stay in the firewall for a while. It's not continuous, add or remove, add or remove. This is not a nice thing. And a typical example is geofencing. Geofencing is the ability to block traffic from specific countries. We know that it is not a solution. Definitely it is not a solution. But for specific protocols it makes sense. Let's make an example. Suppose that you have a VPN server that it is used by your company to simply to connect you to your corporate site. So what do you want to leave it open for everybody? You can claim that today you are in Belgium maybe tomorrow or after tomorrow you will be back to your home country. So probably it is not a good idea. But in general some sort of geofencing is necessary. If you manage a service on the internet you will see that most of the logs inside your Pashi.log or Postgres or whatever are generated by people that are not really interested in what you do but are interested in breaking you or trying to find a way to break your system to go somewhere else. So it makes sense for us at least to circumvent. When I say geofencing it doesn't mean a specific country. It doesn't mean Belgium for instance. We must say Europe for sure, not South America. And another problem is encryption. Encryptions is a very nice thing to have. We are not going to remove it. It should be present everywhere. But the problem is that if you look at this problem from the network and we are in the network in that room you would see that encryption creates a problem. Not just the analysis of the traffic per se. So when you say the initial handshake, the way people are contacting you, if the fingerprint of this guy is a good fingerprint or is not a fingerprint used by a previous attacker. These things are very nice to have but they are not enough because you don't know after the handshake what is this guy asking you to do. So this is another problem. So this has been the motivation. So the idea was to create a single open source tool because you can use many tools to do that. Okay, ready. Very simple. I will show that in this presentation. But the problem is that you have one. So to have something simplified in one place. And also to do this for the next thing that we show in a second. So geofencing. So this is a typical example of how you can block a country on Linux. So this is a simple script. You have a country. You have a country where you want to ban. So you download from a certain place. This is a typical example. The list of networks that belong to such countries. You have to make sure that this is updated because you have to do it from time to time. Because countries in particular, sometimes there is a network that is sold or acquired by the companies belonging to a country or another. And you have in essence to download that didn't occur, but I hope that it is not upset with me. download the file and then put it into the file. And this is the way it works. And every day, you have to do it every week. General from time to time, you have to do that. And if you have several countries, it's a country, one, BCN, whatever, so you lose here track of what this is in this country. So you have to remove everything and start over. So this was how I understand that configuring this is complicated. Even though solutions exist. Again, geofencing is not the way to solve cybersecurity problems, but it's a nice way to mitigate them, for sure, at least for specific protocol. So this was one of the motivations. The second thing I told you before is so-called block list. Block list means I want to block specific people based on something, okay? It's usually based on black list. So a set of IP addresses that, as I said, have been pre-labeled, or if you want to use the machine learning word that today is very popular, or if you want to say, there are some people that for some reasons have put in some list, some security guys, that have done something nasty to other people in a few days before or the previous week. Usually these are created by putting honeypots on the internet. And when you see the violations that people are trying to break to those honeypots, you will see that these guys are labeled as bad guys. The nice thing is that there are several black lists available on the internet, but this is not good news all the time. Unfortunately, because some of those black lists are run by volunteers, okay? So some of them are good, some of them are not maintained. So in particular, if you imagine the free VPS services, so those that you can buy for five euro a month, they are constantly moving. So today this IP is bad, tomorrow it's not bad anymore, and the thing happens to the reverse. So the reputation is something that is dynamic. So you need to find black lists that make sense. So for instance, these days we are using the, one of the black lists that we are using, the Stratosphere IPS black list. That is a very good black list. But unfortunately, since some days, they are black listing for instance, Google, or they are black listing with GitHub, or they are black listing 888.0999, so they are the public DNS servers. But if you don't have some regional knowledge, so something on the place where you live, that doesn't mean the town, but it means the network operators, or your neighbors. So let's put it another term. If you take a black list that is coming from the US, it will be 70% effective to you, but it will not protect you for the rest of the problem. So we need something that are created here. Also some of the black lists, please read the paper, use very large CIDR. And I don't believe that everybody is bad. Maybe in this room there is somebody that is bad, but I don't think that 99% of the people are bad. So it put a slash, whatever, slash two or three, this room and everybody's bad. So this to say that they're good, if you use high quality black list. But if you want to take something on the internet and put it, the longer is the better, so this is the wrong way of doing it. Also we have the problem of attacking on services. The problem of the service is that most of the time, it is encrypted. If it is not encrypted, very often it is becoming encrypted. And this again, it's good news. So the black list is a way to prevent nasty people from contacting you, but then the rest of the group, that should be the majority, can still create problems to you. This is what I wanted to say. So the black list are not the solution. They are nice to have. They are nice to have a solution to put in place, but they are not enough. So if you look at log servers, this is an example from the log server I did when I did this live two days ago, it's one of them, it's full of them. Most of the logs are like this. So you would see good people contacting you, but most of the logs are attempts like this. So authentication, authentication, authentication error, too many attempts. This is the email and this is the web. So if you look at the WordPress, it will be even worse. I tried to break, you see, they tried to put a single config file for everything. Okay, you put everything in one place. And because it's designed for security, we have also put some security features, okay? That spans from the network side, but also to the service side. So we have something we call watcher. So tools that are watching log files and searching for anomalies. It can block and block people, very good. And they call so refresh a blacklist automatically. So you don't have to create complicated scripts that sometimes they break and put several countries into it. And the result can be shared first of all through Telegram. So we receive messages from wherever there is something wrong. You can execute actions, okay? Or you can send them through Xerion Q and we are adding additional brokers, so for example QTT to distribute them. So to have them into a single location. And the config file, it is very simple. So you specify first of all the market. I think links mean drop or pass, okay? You can hear them, but it can also mean slow down. So you can mitigate certain traffic based on that. You can specify the policy. For the policy, you can specify what you want to do. So in this case, if the policies drop, allow these countries, okay? Or this continent, North America, okay? You can say everybody from this place can connect. If you, for instance, pass, it's the other way around. So block, allow everybody except those countries, okay? And you can specify what part you want to monitor, so what part you want to enforce your rules, and what part you don't want to look at all because they have to be open. So something to ignore. So for instance, this is 10TP. In addition to that, you can put some honeypot ports. So it means that if you want to say, okay, these are the list of my services, good. But if somebody connects to another port, that is not one of the ports, so why is it doing that? Is it a mistake, so it does it once? Or it is an attempt, it's a scanning, so we have seen before, so network discovery. So once we made the decision, we marked the traffic as good or bad, and the Linux does the rest. No more packets are sent to user space. So very little, you know, CPU usage, and everything is happening inside the, inside the kernel, in VST, it's a little bit different. And watch it suspend by the tool, so when you do this control, start, start and stop, everything is done by that, you do reload, that's the automatic. And automatically it refreshes the blacklist or blocklist every night for you, so you don't have to do anything. And this is a typical example. So look at the time, again, this is when I did my slides a couple of days ago. If you look at the time, you'll see, and these are just two servers. It's always like that. Most of the problems are created by some people that are spending their time to do things that they should not do. So our service, for instance, is in the Netherlands. So we see some time, very simple patterns that they are moving from one server to another to do that. So this is all, but what's next? So what is the idea? As I said, one of the motivation was not just to create a simple administrative tool so that in one place, everything happens. But we want to create something that it is used to secure services for everyone with one single config file, very simple, in JSON. But we would like to do something next. So what we are creating, we are creating a sort of cloud. So in essence, we would like those tools that are deployed to be put in a way so that they can speak each other. The cloud doesn't mean that you send everything to Amazon. Send your data to the party people that you don't know what type of use they will do, what that can pollute your data with probably some IPs that are wrong, maybe simply because they want to block other people. But the idea is that you can run your own server, especially if you are a service provider, so you have many services, or you can put this service also on a laptop, okay, on a Linux laptop that you can bring home. If you see something bad happen, you can report this to the central server. And this is what we want to do.
Network Function Abstraction: A delicate question of (CPU) affinity ?
To Gaffani the renter of the So the thing to watch out is you may have heard with the other speaker. If when you move, you should be okay. I think I got it. He sure was rubbing against it. That's why it was making so much noise. Thanks. Right. So hello everyone. Today I'm going to talk to you about the challenges of deploying your network workloads at different levels of extraction, and particularly from the standpoint of CPU affinity. So first thing first, I'm Hadi, and I'm currently part of the vector packet processing team at Cisco alongside Nathan and Hedy who contributed to this presentation. So we're all contributors to the FDIO VPP project, and Nathan and Hedy are active contributors of the Calico VPP data plane. So first, I'd like to introduce the concept of network function. So it's basically any application or network service which processes packets, and you're probably familiar with physical network functions, which are appliances like routers or switches. And in these systems, there is a challenge to L4, and also it can be used in many network functions as a very performant data plane component. So rather than using a scalar packet processing approach, it goes for a vector packet processing approach using an optimized graph of packet. So also, I'd like to talk about the Calico VPP data plane. So Calico in itself is a CNI, a container network interface for Kubernetes. It allows to deploy Kubernetes cluster with additional network and security features, and it allows for seamless communication between Kubernetes workloads, maybe VM, containers, or legacy workloads. So what's nice about Calico is that it enables the use of other data planes, in this case VPP. So Calico VPP allows for the introduction of IPsec tunnels and also wire guard traffic. And it's also been in GA since December 2023, if you want to check it out. So first, I want to give an overview of the CPA pinning problematic. So currently, we have CPU pinning, which is by definition binding a process or thread to a particular CPU or at least a set of CPUs. And within the scope of network workloads, this allows us to ensure stable and optimal performance for network workloads. And you may have some workloads which might be with a single thread or they may be multi-threaded, so you want to avoid contention. And also workloads need to be quite performance. At some of them, my process hundreds of millions of packets per second and require the most out of your CPUs. So the question is, how do you select CPUs for pinning? And also, why is CPU pinning important in the first place? So of course, there are some concerns concerning the memory architecture of systems. If, for example, we have one of your network workloads which is pinned on any new node, we'll try to access the memory in another new node and you'll have additional latency. So the best practice to have is to try to pin your network workloads on CPUs in the same new node. And of course, distance also for any network interface card that you might have as if it's present on a new node and you're trying to run a network workload, which will use that device will have additional latency. So in short, try to be numer-aware. There are tools if you'd like to get your numer architecture or visualize it directly from the terminal or getting picture outputs. So some short recommendation would be to avoid pinning on Core Zero as some processes run by default. And if you can change kernel boot parameter and you want to set up system for maximum performance on benchmarking, you can try to modify some settings such as ISO CPUs to isolate Core. You can change also the affinity of some of the IRQs or in reduced kernel noise with no HZ in the score for to remove system cloutics. So what we attempt to do is see the impact of CPU pinning on one of our example workloads. So we wanted to test connection between two VPP instances that establish an IP-6 tunnel. So we run a workload locally using only virtual interfaces and which is each VPP instance using one core only. And the results were as expected, we managed with proper pinning to get the best performance around 10 gigabits per second with IP-6 tunnels and the more we went into scenarios like cross-numa or using only two core to pin for process, the two VPP and the two IPERF, this is where we observed the most performance loss. So moving into abstraction challenges. So first of all, the virtual machine. Virtual machine are popular to employ network workloads in. They allow to abstract or hardware and have multiple isolated system where we can run our network workloads. One of the main network namespaces and separate file system, but they also have Linux kernel construct that allow us to manage the resources of our containers, which is C-groups. So we're directly also able to pin the threads of our network workloads running within container using task command previous command we've seen. So if you look in particular at one controller of the C-groups, the CPU set controller, it allows us to limit the set of CPUs on the host machine which are available for a container. And when we do that, we can move towards dedicating some cores to a specific container or a specific workloads. And of course, it needs to go and turn them with isolation on the host machine of the CPUs that we're going to use. And you should watch out for difference between C-groups V1 and C-groups V2. C-groups V2 has been around a long time since 2016, but there are many systems, especially against the system, which still run with C-groups V1. And this is going to cost a mission on where you can fetch the current CPUs which are available. So I like to talk about one of the challenges we have with VPP, a concerning CPU set. So right now if you run VPP and bare metal, what we can do is rather than using task set, we can provide VPP with a list of core we'd like it to pin itself on, to spawn threads and pin itself on. And bare metal, it works pretty well. So we're going to ask it to pin itself from 0 to 3 and it's going to pin itself properly and run without any problem. But here's the challenge. We're going to introduce an abstraction which is containers. And container in this case will have a limited CPU set. We're going to ask it to only use CPUs 4 to 7 on the host machine using C-groups CPU set. And what's going to happen? We're not using task set. VPP is trying to pin itself on 0 to 3 and this is going to fail. This is going to fail the P-throw. VPP takes that into consideration and it's mapping. So again, what we learned from this challenge with the CPU set is that we should be aware of the cores that are exposed by environment. Our environment is going to expose currently available resources and at the same time we need to be aware of how application that's running within the container instance in this case fetches available cores. So similarly to our previous use case, we tried to launch an IP-seq tunnel between two VPP instances. But the twist was to introduce an additional abstraction layer which is container. In this case we used Calico VPP to route traffic between the two VPP instances. And we wanted to see if we could expect on the performance side similar performance result with abstraction. So by adding an additional hop we obtained with our IP-seq tunnels a traffic of around 9 Gbps. So compared to the 10 Gbps obtained with Burr metal, this is pretty close. We lost 1 Gbps but we gained the additional feature introduced by Calico VPP which is additional security and isolation. So to close it up, if you're thinking of switching to performant virtual network function, think about VPP. If you're thinking of adding a performant data plan to your network function and think Calico VPP if you're using Kubernetes workloads. And of course be aware of your architecture when configuring especially the layer of abstraction you're currently running your workload on. And stay tuned if you're interested in seeing more VPP. Pim will have a talk this afternoon. But we'll talk about how he managed to get hundreds of millions of packets per second with MPLS on an XPC with community hardware. So this is the test spec we used for our machines. And thank you for your attention. So right now we're assuming that a set of course is already been made available to a container. And this is going to be the role of the administrator deploying those containers. It's going to deploy them on specific human nodes. This is an assumption we need to take. If for example we have a container with a set of core that are present on a human node and set of core that are present on another human node, there's not much that can be done with this system. So we're going to take a set of core that are present on a human node and set of core that are present on another human node. There's not much that can be done with this system. So there's a need for awareness at different types. Yes, I was wondering related to kind of big little design CPUs. Is this a problem as well for pinning or did you look into this? Sorry. Like big little CPUs? No. Like the ones with performance cores and efficiency cores? You mean with the new Intel architecture with the yes, with P cores and E cores. So no, we are not considering this. We're trying to be as agnostic as possible. Thank you. This is going to be the last question. Thank you. There's the elephant in the room about hyper threading and the fact that the repeat does not perform well if it's scheduled on through on two different hyper thread cores that actually sits on the same physical core. How do you schedule workload on one core but not on the twin core? Let's say. So one of the issues that might arise if you have two cores that are on the same. If you have two threads are in the same core using hyper threading is that for example, if they're going to deal with the same packets, they might deal with the same packets of cash information and this might create some contention. Yeah, this will definitely create some contention. So it's a specific use case. This is fine if you're only going to read information, same cash line. But if there are rights, there's definitely going to be some contention and lock and slow down. Thank you very much. Thank you again. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. I just changed it. It's a slide also here.
Testing iptables firewall rules with scapy
Yeah, then hello everyone, so we would like to start. Okay, yeah, so today's topic is testing IP tables firewall rules with Skeppy, comply with the cybersecurity requirements from the UNECR 155. Okay, turning the on button helps. So yeah, first of all our agenda, so first of all we want to introduce ourselves and also our employee shortly. Then we want to ask the question, why test your firewall rules? Afterwards we want to talk a bit about the basics of the net filter subsystem and IP tables. Then we will show you why we choose Skeppy as a tool to test the firewall rules and we'll also show shortly the tool landscape that is out there. After that we will now for over 10 years. ElectroBit offers the EB Corpus Linux distribution based on the Yachto project, but nowadays we have now also version based on Ubuntu and we are cybersecurity management system compliant. So now coming to the question, why should we test our firewall rules? Well, the answer is basically in our case, well, we had cybersecurity requirements for our embedded Linux distributions and for the automotive industry. This means complying to an Air 155 which is demanding that you basically take care about the cybersecurity for the software that goes into your car and starting in just some few months all new vehicle registrations, so vehicles that are new in the market will need to comply with that and as we are working for building distributions for such cars, well, we basically also needed to certify and test our firewall rules to sum it up. So what is the overall situation? Okay, we have a packet filter and this inspects our traffic in the networking stack. Of course we have different use cases for using this packet filter like the firewall, traffic statistics are locking and in our case we have a net filter in the corner space and for user space because the project was already going on for several years. We actually have IP tables still and not NF tables, so that's our overall setup. And just that much about that. Okay, I think it's forward. Output and post routing and then the egress and yeah, that's just to give you an overview how the net filter looks in general. So and IP6 tables, then a user space program, I think you all know it but just to repeat it and so that we are on the same page and to interact with the net filter. It is organized in tables like the filter nut and mangle table you see here. A table then consists of a change which basically again consisting of a list of rules which can match a set of packets and a rule that specifies okay, what to do with a packet that matches. You can say for example, I'll drop it, I return to somewhere else, I accept the packet or some user defined action and if the packet doesn't match then the next rule in the chain will get executed. So as I've already said, that is going to anywhere that is not sent itself is then sent to the Docker isolation stage two and otherwise we are returning and in the Docker isolation stage two we are dropping every traffic that is going to the Docker zero interface and also returning and therefore we are back at the original state there and yeah. Yeah, so now I want to give a short overview over the tool landscape but I will only highlight why we choose GAPI because we only got 25 minutes. So yeah, I think here you can see the most common tools that are able to create custom network packets so CUT, Nemesis, NetCut and GAPI and I think there are many more out there. They all have their pros and cons but I want to go now a bit into detail why we choose GAPI. Yeah, so why GAPI? So it's a Python based interactive packet manipulating library so Python is very common, I think everyone knows this and it has a very low barrier to get into. Yeah, with GAPI you can define send, receive, complete, custom packets and you can manipulate across different layers very easily so on the slide you can see link, network and transport layer so it's very easy to create their custom packets because the barrier is very low and you have a very easy entry to create your first custom packets with it and send them and receive them and have some highlights. Yeah, and what was also a reason that we choose it, we have already in the integration department running Intest framework that is completely Python based. So we needed to look that we choose some Python based solution. So how do we test the system? So we have the ingressing path and send then a packet to the application layer as you can see there and then we have also the egressing path where we send packets from the application layer then to the egressing path so in our demo version we will provide a show later. We use KEMU and send packets from the host to KEMU and from KEMU back and one additional thing, so GAPI has its own network stack so that runs beside for example the net filter so you have to also keep in mind. So we have here some basic examples to show you how easy it is to craft your first custom network packet. You see here the first example is just a TCP packet with destination port 80 and the flag as is set and as you can see this one liner, when you copy this into your GAPI console is your first packet and then you just have to send it to a specific interface or address whatever. So this is as easy as it is. So the second example is a bit more advanced. There you can see that you can also use random IPs with various other options here for the UDP protocol and it's very easy to understand what is happening I think or from my point of view so it's good. The other example we have is an ICMP package. So we just have here on destination IP that also works and then the ICMP protocol with type 3 and code 0. SCAPI also provides ready solutions so that for example for the NAVA discovery with IPVs. Rule here TCP, okay that we got covered then we have a certain port that we can also easily specify and then we have time to live of 8 here though that's also just setting the parameter here for the IP packet and finally we also are interested in certain flags. So here we are using for example SYNFLAG here once and then we have already our fitting packet and we can just send it. Here we are sending it just easily out with SCAPI. We can then on the other side SNF for it for example we take a look okay what is arriving at PC at p port 1 2 3 4 and then we can just take okay well those packet and the flags did match so it should be accepted by the firewall and otherwise okay well the flags didn't match and so it should be rejected. Or to have a different example here we see a firewall rule for the source that should match in the input chain on the source IP so okay again just put it into the packet then we have a certain destination port again with TCP so we are just crafting our packet again in SCAPI as planned and then we are just again sending it we can again just SNF on the other side with SCAPI and check okay whatever is arriving on ETH0 now we can also have a filter in our SNF function specified so we check okay is it TCP is the source as expected and then we are waiting okay we are waiting for one packet here and if this packet is there we are executing the packet to check function and okay we are checking here for the port if the port is correct and already SNF on the correct interface and check the IP address and if we are on if we are received a TCP packet on the filter already so it would match then the firewall rule and we can say okay it should be accepted or if the port is not matching I don't know it would be 23 then it should be rejected and here one last example it's basically again the same situation we are crafting a ficking packet with TCP and port 100 here and then we are again considering the interface here and also of course our IP address I think you can see it very well from the arrows and then you can just send it out here and again the other sub basically looks as known we are sniffing on the end in the TAP0 interface this is well motivated by our QIMO setup then checking the filter again some sanities executing a function again still checking the port and if this matches then already it should be all accepted by the firewall and otherwise it should be rechecked and I think we now have time for a demo yes so please yeah so as you can see now on the screen hopefully it's visible for everyone good I don't know yeah yeah so yeah you can see on the on the right window you can see the QIMO that is running we already loaded the firewall rules we want to test and now I would start the sniffing so that we can do the ingrass thing test and yeah in the in the IP table fields are have gone up so everything worked fine on the on the ingrass path so now we can also show the other way around yeah yeah I think so we can stop the demo I have I think I have some some mistake in one of the rules so now I'm not allowed to send a packet this is the output packet we showed before that gets redirected yeah sorry for that but if you go to the slide deck you can see you have can see here the demo so it's a link to the GitHub repo where we have also stored everything with a read me then you can see can test it on your own and if you want to extend something or yeah so yeah then we are already at the end of our talk so as a summary so you now know why you should test your firewall rules so it's yeah not only regarding to the UN our UN EC are 155 yeah you get some overview over the net filter IP tables we showed you bit the lens very short overview of the landscape and why our solution was skeppy yeah you hopefully get some basic insights how you can use skeppy for your test cases hopefully and yeah also you saw that you can test IP tables with firewall rules if the if everything is set up right so yeah with that we want to thank you for your attention and if you have any further questions then please feel free to ask now or you can also see our contact informations on the slide yeah thanks also from my side sorry about the screwed up demo I promise on give that it's fixed if not we are fixing it today still so questions hi great presentation thank you so I searched occasionally over the last several years for something that can simulate IP tables rules without needing a full network stack since for example the IP tables save utility can dump entire IP tables configurations and the syntax seems simple enough is there something non-obvious that prevents us from simulating rules entirely in memory without the need for a full network stack sockets ton tap devices or virtualization I think I cannot fully answer your questions so I would say from the perspective we have why we test the firewall rules so we our test case what was that we wanted to test our the firewall rules we have in our system so the showcase we had here in the in our company we are running the in tests directly on our target on the hardware because we want to know how the behavior is directly on the system with the complete firewall config loaded so that we can ensure everything we have in our case we have requirements and we need to test if the requirements are fulfilled in the in the firewall we have in the system and so yeah that's true and yes kept the supports fast testing and we are aware of it and yeah so the basic answer is yes and how you of course say okay I want to have a focus on passing there or there I mean it's not to be generalized because it of course depends on the concrete firewall the concrete user cases you want to cover but yes kept the supports it so that's also a great part of it I would say and I was that I think we can thank you and we are now directly on time so sorry but please have a nice day have a great post
iputils project introduction
Whether you have used pink or trace road or trace path, some of those implementations, I just wonder, does anybody use arping? Okay, you are network administrators, I guess. And clock diff, has anyone used recently clock diff? No, that's a nice question. Thank you. IPv2 is a very old project. It was started by Alex Seikyuznetsov in 1999. He was a Linux-Cannell network upstream developer. He was also IPv2 upstream developer at the time. He ported some BSD sources from Linux and he wrote some other tools for IPv2. And he maintained the project till 2002. He also used net-death-linux-cannell mailing list. Hideaki Yoshifuji was the next maintainer. He was also Linux-Cannell network developer. He was doing IPv6 at the time. Hideaki improved the project a lot. He started to use Git, so we have some history now. He moved the project to source for Chnet, which was popular at the time. And he still continued to use net-death-mailing list. He introduced use-illipsee support, so it was not just for G-Lipsee. Although he made his last release in 2015, the last widely-adopted release was probably the previous one from 2012. Because IPv2's development slowed down, David Heidelberg forked IPv2 and moved development on GitHub in 2014. The initial goal was to upstream various patches from Linux distributions. Still at that time, I did also muscle-lipsee support and other things, because the tools were very old. License cleanup was done, which people from Linux distributions approved or were happy about that. There were other people at the time, for example, Janssen Aček and Pavel Šimetdá. They were both from Redezhet. Pavel improved a lot, modernized the code. He started to use the new C-functions, get other info instead of the old ones, which were for IPv4 or for IPv6. And there were other improvements. Semi-Carola was the next maintainer, starting in 2017. He modernized the code a lot. And he also introduced Messonbelt system. There were other people at the time, Noach, Myron Hans and Yuri Hornovian, who still maintains localization. There could be another question, who needs localization for tools like Pink? Really? I guess not really many people, but I got approached that people really like localization. So I kept that. I came in 2017 and actually there were obviously many people in Git history. There are nearly 140 contributors. And there was history before. So current tools. IPv2 tools have currently Pink, Arping, Tracepath and Clogdiff. Pink sends ICMP a correct Vest to network host. It's very old-called from 1983. I think it's the most important IPv2 tool. And it supports both Sockets, raw socket and ICMP datagram socket, which is more secure. Unfortunately, not all distros use that. I see some of the people from the Bien. So I would recommend to stop using raw socket. But the reason why it's used is system D, which is not used on all systems. You know, the Bien supports other, other in its system. So that is reason why Pink wouldn't work by default. Yeah. Below we have example, pingingsusa.com. That's very basic example. I'm sorry. Pink supports obviously a lot of functionality. So there are loads of switches. So just a simple example. Arping. It sends ARP requests to network host. It was written by Alexei Kuznetsov. And it supports IPv4 only because the protocol itself is for IPv4 only. So, again, basic example. Trace path. It traces path to network host discovering MTU. Again, it was written by Alexei Kuznetsov. There's a small example. Tracing path to suce.com. And clock diff. That's again very old quote from 1985 from unknown author supports IPv4 only. We removed some obsolete tools in 2021. Those tools were using some experimental protocol which were not relevant later. Or there were much better implementation of other tools. So there was no point to maintain something which is not really well used or it's kind of buggy. Because those tools we have in IPv2 are basic network tools. You know, written long time ago. There are obviously other projects which are implementing similar tools. So just to highlight some of them. F-Ping is very enhanced ping. It's written in modern C. It allows to ping any number of targets. Its output is designed to be parsed. So it's good for using in scripts. Also it doesn't perform reverse DNS lookup by default. Which is in some cases faster. MTR, my trace road. It's a tool which combines trace road and ping. It uses QE and N-Curses. And it's also for free BSD. Very nice tool. Those two projects are collection of tools. So busybox is for low power embedded devices. It has many tools and among them are ping, ping and trace road. It's somehow compatible with tools from IPv2 but it implements just part of the functionality. Inetutils is old GNU project which also has RHS and stuff like that. So very old project. Not that active nowadays and it has also ping and trace road. So future, IPv2's future, what we should do. We should rewrite the code to the modern C. We concentrate mainly on ping so other tools are neglected. I wonder whether we should keep clock diff. Also trace path, it's questionable because my trace road is much better. There is trace road, the original project which is also better than trace path. So it's a question whether to keep this. Project would need reviewers and real network developers. We should write tests because we have CI tests but we don't have functional tests. So sometimes regression slips in. Tools could have JSON output and color output. So that's it. Do you have any question? Sorry, I didn't quite understand how system D or lack of it can force to use row circuits. There is a sys control tool which handles kernel parameters for networking. ICMP data gram socket is by default allowed just for root. So if you want to have ping just for normal users and you want to use safer ICMP data gram socket you need to set something. And this row says that with ETC, CCTL config or somehow is called that file. And this works differently for system D and for other tools. So if you want to use busy boxes in its system then you would lose this configuration. I would say mainly there should be a solution just not to block this and there is the band bug report. But no one works about that. Any other questions? Hello. So I have one question. What is the future of the IP tools? So what's the next feature or roadmap that you are actually getting on? What's the future? Or like five years or ten years? So those tools are very old. So one would say the work has been done. But the problem is there are bugs, there are improvements which can you know, broad regression. My motivation to join the development was to keep ping working because I need that for Kana network testing. So I would say there is no future otherwise someone finds interesting to rewrite the tool as an exercise to rewrite them into more than C because the code is terrible. It's 40 years old or something. So no real future but I think JSON output would be a good feature and color output would be also good. So some of those. But mainly maintenance mode.
Orchestrating Change: Automating GÉANT Network Migration
Thank you. I just used an upper around Ansible runner, which is part of Ansible core. So it doesn't expose us too much to any kind of red dot thing. But as you can see here, at the end, you really call Ansible like you would do normally. So in this case, we are populating some extra variables. These WF4 trunk JSON is our subscription with all the attributes that we need for that specific subscription. And then a bunch of extra variables. And you call this playbook iptrunks.yaml. So yeah, this is also available for the masses. Something very important, and I'm going back to this idea of describing your business logic, is the modeling behind it. So we first started to look at the node configuration as a composition of something we call base config, which is specific for the node, or it's the same of every node. And let's say plus a bunch of services and prerequisites that gets added to actually realize the final goal. And so we can map services with products, and then we decompose again a service in product blocks. So for example, we have a product block that represents your physical interface. It could be a link aggregation or a single interface. We have a product block that describes the VLAN with the IP address on top of this interface. And then we have multiple product blocks that describes layer-to-circuits, BGPP years, etc., etc. And so this is an example for something very, very simple, but that if you think about it, it's not really native Ansible, not because you want to describe a coordinate, so a pipe that goes from A to B. So normally in Ansible, you would split it in two, and you would say the A side goes as host-vars on the A node, and the B side goes as host-vars on the B node. But actually that's not true, because you want to kind of define attributes for the pipe itself. And here you can see how the models kind of allow you to separate these things. So what is there, it regards the target. So it regards it's about your router, the site, and whatever kind of attribute comes up. And up there there are the attributes for your core link. So for example, your IGP metric and stuff like that. Whew, I'm late. And so just to wrap it up, this is an example of workflow. The operator fills in a form, then the workflow ending starts, goes to IPAM, gets IP address, gets DNS configured, calls Ansible, configures the router, calls Netbox, puts the router in Netbox with all the interfaces, etc. Yeah. So in case you want to do network automation, don't forget that this is a network problem, so you need network engineers, but it's a lot of software, so you need developers. And we found this super complicated. We found our self-lost in translation many times, but it's doable. Yeah, I don't think that telling you how hard it is to do DTAP. Yeah, you know. Yeah, this is the main takeaway. And if you want to know more, please feel free to approach me. Now I'm around. I have a bunch of links that you should find in the FOSTEM site. So if you're interested, all the code is there. Everything is for you guys. Happy FOSTEM. Thank you. Thank you. Any questions? Thanks. Yeah, yeah. We use NetConf to push configuration. If you say something like this to a network engineer, he will start crying. And it's also not true, no? Because this is done for standard provisioning and the provisioning of service. If stuff explodes, of course you go CLI. Have you had a look at Nordnir and Naipalm in terms of... Can you hear me? Okay. Did you have a look at Nordnir and or Naipalm to replace Ansible? Like to execute configuration on routers? Yes, but no. Yes, but no. But that means... How to say... We didn't... No solution for you. Sorry? No solution for you. No, we have the solution. So the question was if we were looking at Nordnir or other automation tools, what I'm trying to give you is the concept that automation is kind of simple. Okay? And especially in our network, we have 40 nodes. I don't need anything super performant in terms of scalability or things. If you compile a GINJA template and you are able to talk with my device the way I want, so for example, Comfy Private, plus commit command and these kind of stuff, I'm happy with it. The complexity, the real complexity, is taking the process, the entire process made of emails and Word documents and calls and show it in the orchestrator. So maybe in the future we will look at Nordnir, maybe in the future we will do it, but for now, no need. I hope. Thank you. Thank you, guys. APPLAUSE
Flying higher: hardware offloading with BIRD
Hi, I'm here to talk about what we have done to, you talk about what we are doing at my work to, does this work? I'll just change slides here. Good. We are, we are an ISP in Copenhagen and we do fiber to the building type networks where we do the high concentration of customers, we can give some quite good rates. Then we currently have a little bit more than a hundred gig going through our network and if we look a bit at how a network can be simplified, then we have some external connections, PNIS, ISPs, transit connections and they are all going in to a router that is, DLC that means default free zone, so it doesn't have a default route, it only knows all the million different routes to all the networks connected to the internet and then from there traffic going into the network goes on to a free switch where we have OSPF cloud which knows all of the internal routing and then we have end users connected to that and we have some internal streaming servers so that we don't have to get any unicast stream from Amsterdam or elsewhere. We used to do this with an old Cisco big chassis router back when we had 10 gig, it was a very simple time and then we upgraded to something like this when we changed it to 40 gig and these both operators router and stick but I'll come back to that later and then in 2020 we changed, we just run it on a 2U server with 4, 10 gig ports changed together so still 40 gig and then in 21 we upgraded to 100 gig with some of these cuts and like the router and stick I talked about before is basically when you have a single port on your router and then you just use reland tags to differentiate between different external connections and you start doing that because router ports are expensive so it's a lot cheaper just to aggregate it with relands on a switch and it's also an easy way to scale without using too much money. So now we get into what TC flow is, it's an API extension to the kernel originally I think developed by Melanux but it's mainly marketed as just for making open v-switch more efficient and bypassing the host system. However, apart from just being used in this sense, it can also be used to do forwarding or other shenanigans. Here in the bottom you can see that on the network card there is an embedded switch chip so that it can have virtual machines represented as a port in the NIC itself and that way the offload rules can forward traffic directly into a virtual machine. Still not used for that however but this is what you can find marketing wise from all the vendors, they generically just call it something with OBS offload. A TC flow is part of TC which is Linux traffic control. It operates with chains and priorities and then you can have some different things you can do through packets, you can change them, you can drop them, you can redirect them or go to another chain and continue there or you can trap them whereby they go to the CPU. In the hardware offload part of TC there's a few other modules other than flow that can also do hardware offload but generically you have the rules of skip software or skip hardware so it's kind of the other way around where if you want something only installed in hardware you say skip software and wise versa. It's kind of wind diagnostic but you really have to read deep down the drivers for the different devices to see what do they support or just have the hardware tested. A lot of the hardware can do many of the operations that we need for this project but we have only been able to test it with Kandidex cards at this point. So these chains and priorities and rules in it is basically that the packet goes into chain zero and then it takes the first matching rule with the lowest priority and then it does whatever is there, it could be the go to and go chain one and then continue and it could do this number sequence wherever you can recognize it but this limit to 256 go to set max. But then they can drop the packet, they can trap it or redirect it and redirect can go out any other port in the better switch chip we saw before. I'm going a bit quickly because it's a short time flood. If we know what does it take to actually forward a packet in hardware, we need the wheeler to be different when it goes back out of the port and we need the magnetizers to have changed in order to prevent routing loops. We need to determine the detail or hub limits and then before there's also checks on but we don't actually need that rule in hardware, the hardware does it automatically. If we didn't suffer we would need that rule and then push it back out of the port that came in. Even on dual port nicks we cannot push it out of the other port. They are at least on the Kandidex cards, they are just two separate nicks on the same PCB. Not that they cannot talk to each other. So if we do this then it would look like this. If you just use a TC command. So what does here is that it says that it's an Ingress rule and it has a chain and preference and then it says it's a VLAN packet and it's a skip software and it's IPv4 this one and then modifies the VLAN as an action and goes on to modify the magnetizers and then decommends the detail and updates the checksum and then it pushes it back out again with the redirect. If we then show this rule with the show command afterwards then we get output like this. This is a bit edited to fit on the slide. So the action order here is actually a little bit longer. We can see that it sets these values with these masks when it changes the magnetizers and then the decommends detail is just a masked overflow. If you then use the dash S option then you get statistics. We can see that a few packets and bytes got pushed in hardware and never saw this off a path. So we used this for a few years. We just did it statically for all inbound traffic because for inbound traffic as shown in the diagram earlier it always just goes to the layer 3 internal network so that part is static if that site has power that will always work otherwise it wouldn't even advertise our address space. That will always work so therefore we could do that statically for some time but now we also wanted it to work for some high traffic outbound prefixes but for that we needed to work with BDP but yeah more than a bit later. So in the static case we basically have this chain and priority design where we have two rules and change the rule where it just spits it up in IPv4, IPv6 and then in chain one or two we first say all packets with an expiring detail needs to go visit the CPU so that the CPU can send back an IP packet saying that this packet exploded and then it matches some link net so that our BDP sessions don't die because we still need BDP to be able to so the packet for internal traffic link nets in our own address space we need to have rules specifically for that so that that actually gets to the CPU and then we just match the inbound destinations and go to the chain 4 and 6 where we have the big rule changing the backdressers and VLAN tag and so on so we ran this for a few years until we needed something more dynamic and this is basically how it looks when you have amount of where you place the rules so I wasn't able to generate enough packets in a small test setup that I could actually get the nick to hit the limit in the beginning but from 15 packets and 15 rules and more you can see that it decreases with the more rules I add to the hardware that needs to check the slower it gets at the moment we have around four million packets going through each of these routers so we had like this is the worst case where every single packet needs to go through this many rules so the numbers is a lot better if it's a very traffic pattern this is the worst case scenario then in order to make this dynamically and get all this these rules put in dynamically based on bdp changes we have made some software that just has an event loop with some netlink sockets and talks with the network stack in the kernel gets all the routes and links and neighbors and and then generates the tc rule set based on that and dynamically updates and then we have bird feed in all the rules into a separate routing table and and then it automatically gets notified there because we have a monitoring session in that link and then updates the rule set dynamically so that we can use the cpu on the long tail of all the non-offloaded prefixes and in the bird side we just have a kernel protocol to have an extra kernel table where we then have some type protocols copying from the full routing table select prefixes like all our own address space but also some select cdn's and other stuff that we then offload and in the future we need to make the configuration a bit more flexible so we can also handle directly connected paths so if someone wants to use this for their home connection if someone has 5.7 or something like that where they get 25 gig connections then you can we want to use this in our hacker space as well where we have 10 gig connection so like expand it a bit make it more flexible so it's not only our use case and then we need some kernel support on mtu and ecmp support but it's easier to do that after we've presented this and then we need to test if we can also do this with binding the port together so that we can use both ports in a dual port nick and just install the rules in both of them we've also tested how much power do we need to do a small embedded setup so with a dual port connected f5 we can run that on around 15 watt so that's a quite efficient solution but that's more for the hacker space scenario than for real world isp stuff yeah that was the end of it i think that was also the time yeah we are on time yeah yeah any question hi there yeah hi there so you've shown us performance graph of the worst case scenario yeah is it better addressed in software than in hardware in that case or it's uh that is also right now some performance issues in in the kernel side but that'll probably get fixed soon but the graph was the worst case in the hardware only because it had an amount of non matching rules followed by the one matching rule for the test traffic that i generated and if you would let's say execute the same scenario in software only would that lead to better performance or no it had like at the hundred rules it matches the software performance so equal performance right yeah after a hundred rules okay thanks so it's way better performance than software until you have a hundred rules in hardware thanks well actually the kind of got answered by your last question i was just wondering and do you do like net flow analysis or something to choose the rules that the choose the routes that you're going to use those 100 yes we we have some flow analysis to do that and select the prefixes and what is what's your normal level of how many rules you do populate at the moment we we only have around 30 or something like that but we need to do it a bit more dynamically and do it a little bit more clever so that so that we put related routes that are overlapping out in a separate chain so that it's it can check on the on the largest prefix and then only if it matches largest prefix go into the subchain so we need a little bit more of those kind of optimizations but currently all the ground functionality is there for the basic use case and all of the reference counting internally and chaining the primitives from the kernel together to have only have the same destination mapped in as one target so that mobile routes can go to the same next hub by having the same chain that they all jump to since you found some traffic to the cpu how do you protect your network from pathological traffic creating a denial of service we don't have enough of it to really do a lot right now but it does happen occasionally but we have so much capacity that it's rare that it's anything significant enough to want a lot of extra time spent on it we have first initial customers where it's mostly meta gaming and stuff like that that leads to details we don't have hosting and stuff like that where web shops are more targeted yeah excuse me i did not get how many rules you can offload to the hardware you can offload many rules to it but the chain of rules you can check right now drops in performance when you are above 15 rules and then it gets worse from there but that was in the worst case scenario where all the packets will go through that many rules where half of our traffic is inbound traffic and therefore all the inbound traffic to our own address space we can check that in the first few rules it looks a very low number of rules yes it would be nice if numbers were higher on the amount of rules but it just means you need to be more clever in designing the how the chains are constructed okay and were you able to experiment with other hardware from those we would like to do that as well we would like to test with other hardware as well but at the moment we we have some connex 4 as well and connex 4 might actually be but i only discovered that after i left for here but connex 4 is mentioned in some documents that it might support it but it also depends on the firmware versions because a lot of the supports for instance for TTL document and stuff like that depends on the new firmware release okay but they have some documents saying it's connex 5 onward it's a feature called asap2 that is this feature yeah so if you look in data sheets for asap2 from nvidia and melanox then it's it has this feature but they have some of the initial documents on it say that connex 4 should work as well yeah sure thank you okay thank you you
ZeekJS: JavaScript support in Zeek
Hello. If you can hear me well. Thanks. My name is Arne. I work for a company called Corelight. I work on the seek project. Quick information, who of you is using seek? Anyone? Three, maybe. I want to talk about JavaScript support in seek. But first, if you, well, there are not many people that have maybe heard of seek. It's a passive network security monitor. It's existence, well, a long time, 95, was development started. It's open source and BST. It was called Bro until 2018. Bro isn't really a name that you should use for a project anymore. So it was changed. And if you look at it from a high level, you sort of feed it packets at the bottom, either from a live network traffic, like live interface or from a PCAP file. And what you get out at the top is sort of a set of logs that describes what kind of activity is in your network traffic. If you look under the hood, there are a few more details. So it's an event-driven system. It has a custom scripting language. We have some, we call it broker. It's a messaging interface to talk between separate processes. Yeah. To give you a flavor of the logs that were at the top, sort of, those are single entries for single connections. So on the right-hand side, there's the con log, which is the most central log. And, well, there's the identifying five-tuple. We also support IPv6, but that's an IPv4 example. The service field indicates what kind of protocols Seek was able to discover within that connection. And then the bottom is sort of statistical information, like packet counts and duration. On the left-hand side, you see more protocol-specific log, in this case the quick log, which has been recently added. And for example, there's the, so you can see the server name in the client protocol. And if Seek is able to decrypt the initial packet of a quick connection, it forwards the crypto payload to the TLS analyzer, which can then extract that kind of information, and we put it in a log field as you see. That is sort of the data that you would push to elastic search or Splunk and then do your analysis there. That's sort of not Seek jobs, we just produce logs. Okay. It's a fairly old system. It has a custom scripting language, and it looks sort of, that's just a sketch. It's not actually going to work like this, but it sketches how the quick log entries created. So there are two event handlers, one for initial packets, so whenever there's an initial packet, that event is raised, and we create a info record, which represents the quick log entry in the end. And then there's another event that is the SSL extension server name that is raised whenever there's an SNI extension in the client Hello. And you can handle it and basically enrich that quick log entry with the server name or with the first name. That's just a heuristic here. The bottom is a sort of log write call where we actually then produce that JSON entry. So yeah, but it might look a bit unusual in the beginning. It's a fairly powerful language that has some network domain specific features that also allow you to write detections with Seek and sort of build advanced analysis also within that scripting language. What's not so great is sort of interaction with the outside world that log write, for example, is the thin layer above the whole C++ logging framework. So that is not implemented in Seek script, but then you have to do that in C++. And usually any extension that you want to do, you have to resort to writing a plugin in C++. Yeah, we do have so if you don't go to C++ route, we do have support for asynchronous HTTP requests. And if you look a bit under the hood, then the thing is spawning is read and it's launching for writing stuff into temp directory and into a file still and then it reads them and gives them back to the script. So it's a really scary implementation of an HTTP request. So the idea was to, well, why don't we use a language that maybe does provide all that stuff and sort of has a rich ecosystem and has is well known as well. And particularly the Node.js, because of the libraries and the NPM system, so that there was sort of the idea. And as a twist, we are doing this as a plugin and not by patching Seek source code base. We just want to build something external to add support to Seek to also use JavaScript. So quickly about plugins. They're basically shared libraries that seek loads at startup and within that plugin you can access Seek's C++ API or also hook into certain execution paths. For example, whenever a new connection is, so new connection state is created, you can implement the hook set up analyzer tree and attach something to that connection usually analyzers, a protocol analyzer we would say. They also really made components where basically implemented against an interface. There's no component for a whole scripting language, so we sort of resort to the first two to implement the JavaScript support. Okay. So that top hopefully doesn't look too unfamiliar if you have some JavaScript. There's an event object on the left that is called Seek, sort of a global object. There's a well known on function where you register that an additional function for a certain event in M. So that that looks more usual problem in the Seek script example. And as an addition, there's the there's the HDT module from our HDPS module from Node and there's also an example how you could put how you could post the connection you had the end those SSR server names mentioned before to an HDP endpoint just from within Seek script. So we want to get there. And the first step is to, as you prevent Seek from interpreting .js files as Seek script, which it would do with default. And you can implement hook load file and basically check if the if the file name that Seek is attempting to load is ending with .js and return one basically says well don't bother about it I'm taking over and we are stashing away those JavaScript files. And that works for files in the command line or also those with directives loaded. So the add load directive. Step two is sort of to initialize the whole JavaScript engine, sort of the V8 engine and the Node.js environment. There's documentation about that. There's a link here. This is sort of a sketch. It's a bit complicated but I have good documentation about it. What is happening at that point is also that we are loading the JavaScript files and so the top level Seek on calls are actually executed. So we need to provide this Seek on call already. So I'll say this is just step three. I need to slow down a bit. Just for myself. So step three is the call to Seek on is basically getting an event handler name and listener function. And with that event handler name we can use C++ APIs to look at the event handler object which is a Seek specific object representing that, well, belonging to that event name. From that we can get a script function which usually has a list of bodies and each of the bodies contains a statement list and then there are further statements. So usually the script execution is interpreted. So it just runs down all those statements and executes them. What the plugin can do is add another body into that list of bodies and provide the custom statement subclass which when executed really just calls into JavaScript and executes a V8 function. So when this first happened it was really exciting. You see a hello printer from Seek and a hello printer from console. It was nice to get done. What was not so nice is that you need to map types between those two languages. So there's different types on the Seek side and JavaScript has other types. For example the address or subnet type on the Seek side we currently just mapped to strings in readable form. It's not the most performant but it was nice to have Jason stringify and have IP addresses like that. I'm not going to talk much more about this. The last step was to integrate both of the IO loops. Seek has its own IO loop that is KQ based and Node.js has also an IO loop which is libUV based. Usually the Seek IO loop is blocking on an event call waiting for a packet to be served or a block of packets or a timer has expired or something else happening and an act on it. What the plugin can do is register something called an IO source and in the case of libUV the plugin takes the backend file descriptor of the libUV IO loop and installs it into the Seek IO loop which means that whenever something has to be done on the Node.js side like a client is connecting on a listening socket then the backend file descriptor of the libUV loop becomes ready and the Seek IO loop is waking up. Recognizing this is Node.js file descriptor that became ready. I need to transfer control over to that loop and the plugin runs the Node.js loop non-blocking until there's nothing left to be done and control is then transferred over back to Seek. Yeah, that was the most tricky part of the whole plugin. I didn't talk much about the picture before, the architecture, but where I would position that is sort of, it's not completely technical to correct, but if we have extended the event engine a bit with Node.js event engine down there and then also the Seek script language, so we have extended everything with being able to also use JavaScript instead of the Seek script language. As a summary, I find it really impressive that we could do that without actually patching Seek. Everything was in place to pull this off which is testament to how Seek was built over the years really. We're not going to replace the Seek scripts that are existing with JavaScript, that is not sort of the plan. The integrations you wanted to build or maybe just wanted to have proof of concepts of things that you previously needed to quickly use C++ and find some C++ library to do whatever. You can now tap into NPR ecosystem or JavaScript and try it with that. That plugin is sort of coming with Seek 6.0 by default, so if you have LIT node installed and you compile Seek it will just be supported really. And our container images also have it built in by default as well. Any questions about that? Any questions? Hi, Armin. Have you evaluated the performance of this? Does it impact performance a lot? I would say it runs slower than just Seek and interpreted scripting, mostly because we need to translate between those two types. I would also currently position it to not necessarily run JavaScript in the packet path unless you are really adventurous. We have also Seek processes like the proxy and the manager that don't do packet processing. They have a lot more cycles there. If you run JavaScript there and do sort of pulling in IOC information, that's one use case, that you can do in a node that is not in the packet path. We would be interested in performance numbers. Thanks. Have you explored other languages as well, apart from JavaScript? Not explored, I sort of have in my mind as a proof of concept Python, but JavaScript was sort of asynchronous, it's non-blocking. That's a paradigm there and that's what we needed as a replacement for Seek script. Thanks. Any more questions? Thank you very much.
Bringing routes to Kubernetes nodes via BGP: introducing frr-k8s
Can you hear me? Okay. So, today I'm going to talk about a project that we started more or less this summer, which is FRR Kubernetes. Some quick words about me. I'm Federico. I work for this networking team at Red Hat, in charge of making the open-source platform suitable for telco workloads. And because of that, I managed to touch many network-related projects, the SRE operator, some C&I plugins, our primary C&I and lately MetaLEDB, which I'm currently maintaining. My handles are for the power for Mastodon, Twitter and Gmail. If you need to reach me out, ask questions, I will try to answer. So, it's funny because this talk has something to do with the talk that I gave last year here for them, where I presented how we in MetaLEDB replaced the native Go BGP implementation with FRR. And first of all, what is FRR? FRR is a Linux Internet routing protocol suite. It implements a lot of protocols. It's very stable and super-supported and supports BGP and BFD, which were a couple of protocols that we were interested into. What is MetaLEDB? Is anyone using MetaLEDB? Nice. So, I'm the one to blame if something is not working. MetaLEDB is the load balancer implementation for Kubernetes cluster using standard routing protocols, including BGP. BGP allows us to announce our load balancer services across our network. If you are using Kubernetes on bare metal and you need to expose your application, there is a good chance that you need something like MetaLEDB. It's not only alternative, but it's the one that I maintain. This is more or less the architecture. We have the Kubernetes API on one side expressed in terms of services and MetaLEDB configuration. We have some code that takes all these resources, munch them together, and produces an FRR configuration that an FRR Sidecar container processes and then handles the BGP implementation. Last year, in this very conference, I got this question. Can I run MetaLEDB together my FRR instance on the cluster nodes? This is something that I keep hearing a lot. Not only that. What I keep hearing is, hey, but now inside MetaLEDB you have FRR, so you can also do this and this and this and this. No, because MetaLEDB is about announcing services, not for example about receiving routes and injecting them to the node, which is a common request. Why is that? On the cloud, everything is easy. You have one single interface to the node, one default gateway. You get the client who wants to hit your load balancer service, get to the node, enters the CNI, goes to the pod, the pod replies, and then the reply goes to the node and then exits to the default gateway and reaches the client. All is good. But on bare metal, we have users that want to have different networks for different class of traffic, for example, and you have the clients that are not on the same subject. So what has happened in this scenario is that your client reaches your secondary network and guess what? The traffic will try to exit via the default gateway and will not reach your client, or even worse, you will be beaten by RPF and you'll have a bad time trying to debug it. I've been there a couple of times. So this was more or less the request. How can I have something that is able to configure FRR running on my node together with MetaLEDB? There are a few alternatives. The easiest one, at least the easier for me, was run to FRR instances on the node. So I don't have to do anything on MetaLEDB. The user can configure its own FRR instance, but that comes with a few issues. You have duplicate sessions, you have duplicate resources consumed on the node. You have to use custom parts to let the router know how to connect to MetaLEDB and how to connect to those custom parts, to the other FRR. The other option is using two FRR instances in Cascade. This might work, but FRR wasn't able to peer with localhost until recently. It limits the flexibility of MetaLEDB because MetaLEDB has a lot of per node configuration knob. You can say I want to peer to this BGP peer only from this subset of nodes. In this case, these will affect only this session, which is useless. And also, what about the BFD implementation in MetaLEDB? It will establish BFD only through this path. So the next one, which is the one that I'm going to talk about today, is to have a shared instance between MetaLEDB and the rest of the world. So the extra configuration can scale. We can have something that is distributed across all the nodes. We don't waste resources because across the same BGP session, towards the router, we can do what MetaLEDB needs to do, but also other stuff. The cons were this was a lot of work. And getting the right API was tricky. It wasn't clear how to handle conflicts, how to merge all this stuff together. But eventually, this became a design proposal in MetaLEDB, and it converged, and we started working on it. And this is how this new project was born. It's Kubernetes-based, the one set that exposes a very limited subset of the FRR API in a Kubernetes compliant model. I wrote this description, so it's nice. This is the new architecture of the new thing. It's basically we stole what we had in MetaLEDB. We have this new FRR configuration resource, and it does basically what I already described about MetaLEDB before. But now we have a different API and a different way to configure this thing. How to deploy it? It can be deployed as a standalone thing, and this is something that I want to stress. We can use it together with MetaLEDB, and we just released a MetaLEDB version that uses this one as a backend, but you can also deploy it as a standalone component. So you can use it for your own purposes, regardless of the fact that you are using MetaLEDB or not. Now I want to talk a bit about the API. There was a good amount of discussion on this, like we were not sure whether we should have exposed the raw FRR configuration to the external world or having something that was higher level. Because there were some issues in this. How do we merge configurations? How do we allow two configurations, produced by two different actors, to become the same FRR configuration? How to intercept the configuration conflicts? If it was the raw configuration, that would be our Royal mess. And also the way MetaLEDB configures FRR is very opinionated. It gives some certain names to route maps, it gives some certain names to prefix lists, and if we wanted to extend that with a raw configuration, that would have become part of the API, and it would have been something that we couldn't change. Eventually we ended up with something high level in terms of CRD, which is FRR configuration. And this is how a configuration looks like. It has a BGP section in the spec, because we are anticipating that we might need other protocols. We have our outer section, we support multiple routers, but they need to live in different Linux VRFs. We can configure the neighbors, and we can say what prefixes we want to advertise or to receive from those neighbors. And this is how advertising looks like. We can say, I want to advertise all the prefixes that I configured in my router, or only a subset of them. And the same is more or less for the receiving part. We can say, from this peer, I want to receive only the prefixes that are matching this selector. Or we can say, I want to receive all of them. And we have an old selector, so you can say this specific configuration applies only to a subset of the nodes, which is always useful. And of course, because we know that there will be a lot of configuration that we don't cover, we also allow for experimenting or for covering special needs, our configuration, and there is a priority field where basically this gets appended to what is rendered inside the configuration from the API. And of course, we have BFD, communities, local preferences, and all the stuff that Metal.ed is currently exposing. It's covered in this API. And now I'm going to talk a bit about how multiple configurations are merged, because this was a pain point. You have multiple actors throwing configurations at the cluster, and those needs to be merged together in order to produce one single FRR configuration. And there were some guiding principles into this. We wanted a given configuration to be self-contained, meaning that you can have prefixes on one side and saying that you want to advertise those prefixes on another resource. A configuration can only add to an existing one, meaning that you can add neighbors, but you can't say, I want to remove this neighbor, applied by this other configuration, because that would steal the work to other actors. And a more permissive configuration can override a less permissive one, meaning that if you have received none, you can have received some, or a receivable will override the received some. And this is how we can merge to different configurations. We have one neighbor on one side, we have two neighbors on the other. These two configurations are compatible, and then on one side we advertise only a set of prefixes, and on the other side we advertise all of them. And these are two compatible configurations that can be merged together. Another thing is you apply all the configuration, and nothing is working. It happens a lot. We have validation webbooks, but given that the configuration is composed by multiple configurations, we know how Kubernetes work, and some things might slip. So we are exposing the status. We have three fields. One is the last conversion result, which means that if you have multiple incompatible configurations that makes to the controller and the conversion will fail, this is where you will see the error. This is the status of FRR loading the generated configuration, and this is the configuration running inside FRR. So it's something that can be used to inspect the status of the thing. With Metal LB, again, now with this new implementation, we have the same Kubernetes API on one side. Metal LB will generate an FRR configuration. It's going to be read by this new demo, which will talk to the router. And this is how a Metal LB configuration gets translated into this new one. So we have the routers, we have the neighbors, and we have a selector. So each speaker will generate a configuration for the node where it's running on. Yeah, this is what I just said. And this is when we add the service. So we will start advertising those prefixes related to the load balancer service, and things eventually will work. And of course, this is something that can be expanded, providing your own FRR configuration that gets merged to the one generated by Metal LB. I have a very quick demo. It's my first time on a live demo, so fingers crossed. Very quickly, the demo environment is a kind cluster. We have the demo running on each node. We have an external FRR container that represents more or less the external router. And now I'm going to... Okay, so here I have the kind cluster and a bunch of configuration. We have here the external container. It's paired, or it will want to pair with each of the cluster nodes. And also, it will try to advertise a couple of prefixes. And I can go on the configuration side and look at this. This is what I just stated. We want to advertise only one prefix. I'm going to apply it. And hope. Okay, so the session is up with the whole three nodes. And we have the single prefix advertised by the three nodes. And now I can look at this other one, which says advertise all. And I can apply it directly. And it's going to be merged, hopefully, to the other one. And then now we have two prefixes advertised. So it's working. We have CI, so it shouldn't be a surprise. Now I can do something on the receiving side. Here we want only one service out of the two that the external container is publishing. And this is a session inside the node. And eventually, yeah, here we have the last one is the route that is published by the external container. Yeah, what else can I show? I have five minutes. Oh. I can do this. So this is a pod running into the node. And if I try to ping it from outside, it's not going to work. For example, what I can do is try to put that prefix. No pressure. And it pings. So again, another nice example. Thank you. Okay. So I also have other examples, but I think I stressed my luck already enough. And we have still five minutes. Okay. So what's next? There are these, I don't know what's next. FRR provides a lot of opportunities. This is more or less a subset of what Metalaby offers plus something that was asked by a lot of Metalaby users. But of course, you can come and provide feedback, suggest the new features, open issues, or even contribute to the project, hopefully. The good thing is we have a framework that we can expand and grow on implementing new FRR features. A few resources. We try to keep the documentation aligned. So we have an upstream Redmi. We have the Metalaby documentation. There is the Metalaby channel on the Kubernetes Slack, which is where I live daily. And of course, the FRR community is super vibrant, super helpful, and always open to provide feedback and give help to us. And with that, if you have any questions, I'll be happy to answer. Thank you. Why did you keep using the FRR configuration files, which are quite painful to merge, as you said, instead of using the North One APIs? Can you raise your voice a bit? Why? Is it better? Yeah. Why do you use FRR configuration files, which are, as you said, quite easy to merge, instead of using the North One APIs, which have a NetCon thing? Because at the time, that was declared as experimental. I don't know if things changed in the meantime, but like, okay. So then we can... You should. Okay. Yeah. But like, we had all this mess already in place, so it was easy at the time to recycle it. But yeah, if there is a proper API, I'd be happy to start moving to that. Thank you. Any other questions? Okay. Thank you.
Multi-network in Kubernetes: No batteries included
Check one two, check one two, all right. Thank you everybody for coming to our talk about multi-network in Kubernetes and how there are no batteries included. My name is Doug Smith. I'm a maintainer of Multis CNI, which is a multi-networking plugin and also part of a working group related to it and I'm joined by Miguel. I'm Miguel Duarte. I'm a software engineer working for Red Hat, particularly in the OpenShift Virtualization Networking Team. I'm also a member of the Network Blumbers workgroup and yeah, sometimes work with Doug on this kind of stuff. Awesome. So, we've got to rip through this pretty rapidly and it's a pretty complex problem space, but we're going to run you through it as quick as we can. So we're going to look at what exactly multi-networking is in Kubernetes and kind of show you what the problem is that we're looking at. There's also kind of like current set of solutions and then also future solutions that we're looking at as well. And even if you're not necessarily interested in the multi-networking problems in Kubernetes, we kind of hope that you're going to be interested in sort of the problems that we've identified that we think are really common to a lot of engineering problems in general and especially for open source communities. We also have a demo for you to watch at home because we have some short time. So the first question we should be asking is exactly what is multi-networking in Kubernetes. So the thing is it kind of isn't because it's not something that Kubernetes actually is interested in solving. What do I mean by this? So the Kubernetes networking model pretty much says that any pod on the cluster can reach any other pod in the system. Cool. How does it do it? Like one interface on each pod connected to the same network. One interface. If you need more, well, it's outside of Kubernetes. The community pitched in together and implemented that out of three. But well, first, why do you want multiple networks in Kubernetes? For instance, like network isolation, let's say you want to meet the compliance requirements of like you need to separate traffic not only in software but only physically in the network. This kind of thing happens every day. And for that you need multiple interfaces. Or for instance, you want to implement like a firewall or something. Well, you'll need at least two interfaces. So this is a reality. There's a need for it and Kubernetes does not do it on its own. So the problem, you don't have batteries for this. You can do it. The community has provided ways for this to happen but it's out of three. And you need to deploy a bunch of stuff for this. So you need to deploy like controllers on the nodes. You need to add more and more pieces. So it's solvable but it's not entry. It's not native to Kubernetes. Furthermore, while it works, its user experience is challenging to say the least. Like it's cumbersome to use. It feels clumsy. There are a lot of ways for you to get it wrong. Like if you just put an attribute that does not exist or put a typo on it. Well, it depends on the implementation what will happen. And at the end of the day, if you have something that is error prone, a lot of people are going to make errors in it. In one word, this is pretty much like Arcane Knowledge that needs to be used in it. So current solution for it is multi-CNI. So multi-CNI is a CNI multiplexer. So CNI is your container network interface. It's an API that allows you to specify how you are going to run plugins that talk to this API in order to plumb your networks, how you're going to connect your network interfaces into your pod to the rest of your network in Kubernetes. What MULTIS is designed to do is to multiplex CNI plugins. So you use custom resources, which are extensions of the Kubernetes API. They're not natively part of the API. They're a way that you extend it. And they give you a way to quote unquote kind of trick the platform. So you add MULTIS CNI into your network. You populate these custom resources with CNI configurations, but CNI configurations are JSON and Kubernetes resources are YAML from a user perspective. So you kind of mix both of those, and I'll give an example of that in a moment. But we also have an effort that's ongoing for a Kubernetes native multi-networking. So what this would do is take kind of this concept that we have out of tree and get these pieces natively into the Kubernetes API. So we would actually extend that. And probably as a building block, we may actually implement them as custom resources, which is a detail here. The one thing to keep in mind, though, is that this will be an extension of the API without an implementation for you natively. So it will actually still require an add on itself, which is also a bit of a challenge. But we really like the idea of extending the API. If you take a look here, you're going to see this is a Kubernetes PodSpec. You've probably seen them before, but we use an annotation. And those are freebies. Anyone can add an annotation. We have a specification for how it should look, and we see that specification. But it's got JSON in there. So if you're walking through this object in the API, you hit this. What do you have to do? You have to parse it, which is no fun, no fun in the least. Now with the Kubernetes native, you are going to just have it all YAML. So if you're writing a Kubernetes controller and you're using client go, you're just going to walk through this easy as pie. So it should be a lot easier. But we have to ask ourselves, what does the future look like? It's kind of a complicated scenario. Number one, we'll probably still have multi-CNI. It's out there. People use it. They're going to continue to use it. And then we might also have Kubernetes native multi-networking. So we've got these two things. But there's a bunch of other projects in the space that may be up and coming as well. So CNI 1.0 has been around for quite a while. And CNI plugins run on disk on the host. They're not actually containerized. CNI 2.0 would make a step forward to being able to containerize those and also would give us an opportunity to kind of update this API. We also have in the works the Kubernetes network interface, which is a proposal which would bring CNI and Kubernetes potentially closer. CNI itself is container orchestration agnostic. It doesn't actually relate specifically to Kubernetes. It was actually kind of invented in a parallel time to Kubernetes. So before Kubernetes won the container orchestration engine war, there were a bunch of different container orchestration engines. So it tried to fit the needs of all of them. But Kubernetes is the winner, and we kind of need a way to get a little bit closer. So let's figure out the lessons that we have learned. The first one is sometimes you have political problems. We want to extend the pod network and the pod object to get these items here natively in there. And maybe it's not so much a political problem as it is like a people and intentions kind of problem, like what are we trying to solve here? And not everyone sees this exactly the way. And this is a very core part of Kubernetes. If you've ever used Kubernetes, you've definitely spun up a pod, you've definitely touched a pod object before, or an abstraction of it like a deployment or whatnot. So to extend this, to add this network is hotly contested. And there's more than that. Like first, as Doug said before, things is that APIs are forever. Not in a sense that you have to maintain stuff backwards compatible, but pretty much that Maltes exists, solves the problem. And if you want to have multiple interfaces in a pod, Maltes is already doing that. Are you actually going to update all the manifests of your deployments to comply with this new API to stuff that is running in production? Well, maybe not. So there's this to kind of have in mind. Next scope creep. Like everybody wants to solve a different problem. And it's very hard to get quite focused on the, let's say, the least common denominator of the problem space. I mean, just doing that has been extremely challenging in these last six, seven, eight months, year. I don't know, like I lost track of it. And finally, handling a technological problem is a lot simpler than dealing with people and opinions. It's very, very, very easy to clash on those. Like it's hard to get to choose like a restaurant between your four friends to go out tonight. So it's even harder to get you to like agree on what the API would look like on something so critical and so, let's say, central and paramount as the pod spec, for instance. And here's like the, we would really like you to take a look at this demo. But again, like, yeah, better do that at home. Like just hit or scan this or hit the link there. And it's short, couple of minutes. And you'll see how the native current effort for native multi-networking looks like from the user's perspective. And yeah, that's pretty much it. So any questions you have? Well, fire away. Any questions? Can these additional interfaces be used to connect devices that are outside of the data center via VPN, for example, which is a problem I've been trying to deal with and couldn't find manageable solutions? Yeah, thank you. Okay. If these interfaces can be used to connect the pod, for example, to via VPN to external networks. Oh, yeah, absolutely. So this, something that you can do is to use these to connect to existing resources that are already in your network. So I guess a VPN absolutely could be an example. And so oftentimes the reason that people use these additional networks is they have existing infrastructure and they go to deploy Kubernetes to become more of a like a cloud native approach. But they have legacy systems that they need to integrate with. So if you've got existing networks, this would be a reason to do that kind of as a sidecar that you could go out to a network like that. Great question. Can you talk a bit more about KNI and how it relates to the multi-network problem? So that is an excellent question. So I would say that one thing that we're trying to solve is that the way that, for example, multis CNI works is somewhat inefficient. You have this flow to create a pod and in that flow is a call to CNI, which then is called today you would usually use multis to do that. When multis is called, it stops that creation of the pod and goes and queries the API, the Kubernetes API itself. And KNI may be one opportunity to make this flow linear in order to pass information directly to some type of multi-network solution that already has the information from the API, for example, instead of having to interrupt that flow to call the API. So that's one possibility. Another possibility is that, at least from my perspective, as Miguel said, APIs are forever. So multis, I'm a maintainer of it. Certainly it'll be around for a while, but as we may get the Kubernetes native multi-networking, it may also be a layer to do compatibility as well between kind of the new way of thinking and the old way of thinking. Is that helpful? Nice to see you, Gargay. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you.
Declarative Networking in Declarative World
So welcome to the next one in this track. My name is Mateusz. I will be talking about declarative networking now. Yes, that's very good. Yeah. Yes, we spent quite some time talking already about Kubernetes, how networking is done. I'm very glad that people from MULTUS took the hard part of explaining, you know, multi-networking at the level of containers. I'm also glad they didn't say anything about host networking because this is what they don't do, this is what I do. So we are going like very smoothly lowering the stack. So I work at Red Hat like they did. I'm best in Switzerland when I'm not touching computers. I do farming. I actually like it much more, but it doesn't pay my bills so well. Here we are. I don't do AI. That's something, you know. Everyone does it, but no. So I will skip why multi-networking because Federico was talking about this and, you know, if there are reasons for you to do multi-networking, you know that you need to do it. And if you don't, then you don't. It all started because, you know, clouds never care about multi-networking. You go to AWS, GCP, FB, ICIA, you pick your three letters. You get a VM. It has a single network interface and that's it. But at some point you realize you need more network, bandwidth and all this kind of stuff and you're just going to start doing bare metal. It won't fly anywhere else. And once you start doing bare metal and network configuration, probably more than once you've seen, you know, the very ugly network manager config file. It's just a static file and, you know, the syntax is somehow opinionated. It's okay once you learn it, but it's still a static file and it flies if you have one server. It flies if you have three servers, but does it still fly if you have 300 servers? I'm not sure. And one problem is that, you know, those are all files and they don't apply changes automatically. So you modify your file and until you restart the interface or the machine, you may not even notice that you've made a mistake. So you may have some configuration that flies for, you know, last five years, but in reality it shouldn't and the reason is just because you've never rebooted. So, yeah. There was another talk about this before, yeah, you shouldn't have your servers run for two years at a time, but, you know, that's another story. So what is done to somehow change this? So you don't need to modify this file manually. Network Manager gives you command, which is NMCLI and you can modify those configurations using somehow nicer syntax. You can say, you know, modify connection, IP address, yada, yada and it has slightly better error handling because you can see in this example, I never distinguish slash and backslash. Sometimes I will just write, you know, I will write slash 24, but it's not really slash, it's backslash and I will see an error, you know, invalid IP address. That's super easy. But then I fix that, well, I think I fix, but I'm putting IP gateway, which is not in the subnet of my IP address. It cannot fly, like this configuration is utterly wrong, but syntax wise it's perfectly fine and system will allow me to do this. So, you know, is it really the state that we want to have? Well, we could discuss. So we have some basic protection about some basic bugs, but yeah, we could do better. So we got this tool, which is NM State CTL, so we still live in the realm of, you know, Network Manager, but we want to try to be a bit more declarative now. We want to change this syntax so that, you know, at the end we do this for Kubernetes and Kubernetes got this very nice notion of APIs, everything is well defined, everything is declarative. So let's try making cost networking declarative also. So how about we create some API which would look almost like a Kubernetes CR and allow changing this. So let's fix a YAM in which you define your state and I think this is the biggest improvement over the previous file, that you define how you want your network to be configured and afterwards let's event a tool which will be taking care that this configuration really works. So I don't want to dig into details of, you know, this example here because it shows some basic configuration. IP address, IP, IP routes, DNS server, so in general something that you always need, it does that I claim that this syntax here is much nicer than the syntax of this file. We can argue afterwards that I still claim it's nicer and, you know, at this moment there are no containers in the game. We are talking about vanilla Linux, you can do it and you may not know about containers. But now how about we wrap it around API and kind and let's take it to Kubernetes. So let's make CRD out of this and use everything that we built in the previous three minutes to have something that is declarative and something that Kubernetes will be reconciling. So in this scenario and I think that's pretty descriptive use case, you know, you have multiple network interfaces, you want to bond two of them and doing this using all the static network manager yada yada, it's ugly. So how about you just write that you want to create a bond and let something else make it happen and let something else make sure that this bond is all the time there, that no matter what you do, you start SSHing to your Kubernetes nodes and all this kind of yada yada, let this be the safeguard that once you define this configuration is there. When you define a pod, you go, you may delete the pod, but if you have, you know, deployment, demon set, all this kind of stuff, something is there to recreate this pod. Why cannot we have something similar about the networking? Well, we can, so let me do the very short demo on that. So what I have now, I created and we will go through the examples in a moment. So first of all, this is something I didn't mention, but you know, Kubernetes CRs and Kubernetes API in general tries to protect you from doing stuff that won't fly. And you know, you have very extensive validations at the level of Kubernetes API and it's super amazing. I also would like to have something like this here. For example, I will try to configure on one of my worker of the Kubernetes cluster some stupid DNS server that simply doesn't exist. For people not familiar with IPv6, I defined here link local IPv6 address. So there is a very, very low probability that something like this actually exists in your network. And on the other hand, I have this worker and let's just look at the DNS configuration. So I'm running on my default setup, it's there, and I will try to now apply this configuration, which I told you is wrong and you should trust me that there is no DNS server behind this address. Okay, so we created this CR and okay, now it's not what I said because we see in the ETC results, which we watch all the time, that this address appeared here. But this is only because we are doing some kind of reconciliation of that and I have a time of 20 seconds on this controller now. So you can see that this CR is in the state progressing, yeah, 20 seconds already passed. So failed to configure and my configuration is reverted. I won't go into logs of what happened, but you need to trust me this server doesn't exist so it makes no sense for my real server in the cluster to get this configuration. So that's it. So we think you revert that and you get the feedback that, sorry, we cannot apply this configuration because it's nonsense. Apart from this, what I can also do is I have another file in which I will simply take some additional network interface and I will add IP address. Very simple, we do this very often when we are provisioning servers, but you know, maybe you just got some additional network interfaces installed or whatever. Doesn't really matter. At some point you want to configure the next network interface. So this server, we don't need it anymore. And so the output is big, but you want to look into this part. 3S0, we don't have IPv4 address, we only have the IPv6 because you always get this one. So I'm going to apply this configuration now. Now that should not be a magic. This address appeared, but that's boring. You apply some configuration and it's there. But what I will do right now is I will manually on this server delete this IP address and I will make my controller, which is behind every CRD in Kubernetes, to reapply this so that this IP address is back there because if I define something via CRD, I don't want some admin going around my servers and change this configuration. If we agree that we are configuring cost networking via Kubernetes API, let it be like this. So I'm deleting this. We don't have that. So we have the previous state. Now I will do some small hack for the purpose of this demo because I realize that the timeout is set to five minutes. So we'll need now to sit five minutes and let the controller to realize that something changed. I will just kick the controller and... So we were in the worker two, which is this one, so I will just kill it. But not the only thing I did, I deleted this pod. So it's not like I somehow magically apply this configuration again. And we see that the IP address is back. So again, if we just sit here and wait for five minutes drinking or whatever, this would be the same. So that's it. And also for the sake of completeness, I have a setup with the proper DNS server. So, well, I already applied this one, so there is no point in doing this. But you've seen the wrong one, so you have to trust me that the good one would be also configured there. And the slide back is here. So that concludes the demo. So some final remarks because, yeah, that was really lightning talk. So all this stuff that I showed you, the backend, which is nm state, this is written in Rust because why not because we can. It uses network manager as a backend, well, pretty obvious, but we could discuss, and this is something that should come afterwards. That today I have this, it works using network manager because this is what I do and this is what customers that I have behind want. But if there is someone without network manager with a strong reason not to do network manager, but would like to have this, we can discuss and I would be very happy to discuss. Of course, there is a Kubernetes operator behind this because this is what I just demoed. And you are not bound to using this as a CLI tool and this kind of stuff. There is usable API, so you can get Rust create for that. You can get Golang library, you can get Python, probably something else, but those are the most popular and this probably should, I assume, those three make everyone on this audience happy. And yeah, we have a moment for questions. If you want to talk more about this, you can find me on the Kubernetes Slack and yeah, that's it. My personal wish would be that, you know, Kubernetes, and we know it from previous talk, previous two talks, never really cared about managing cost networking. No one really wanted to take this into the realm of Kubernetes. Well, it's not like I wish that we got this API now into Kubernetes upstream, but I wish. So yeah. Maybe we have time for just one question. With networking, you can do the worst thing and pull up the network up to what you want. So what if, for example, you misconfigure the IP address of a node and the node is unreachable from the controller? All of it can be fixed. Yeah, so this is exactly what I showed with the example of DNS. I could have shown it with the example of IP address, but if you wanted to create a CR so that you configure, for example, IP address and the gateway and applying this configuration would make your node unreachable, then we will revert this configuration exactly like reverted DNS because that's the purpose of the operator, that it has a safety checks so that it applies configuration. It checks if all the connectivity works as before. In this case, we had DNS, so it applied new DNS. It was checking in the background. Can I still resolve names? After 20 seconds, it realized, no, I cannot. I'm reverting this change and the CR is marked as degraded. So exactly the same would be happened if you have IP address and you don't get connectivity there. All right, thank you. Great, thanks.
I want my own cellular network! Having fun with LTE networks and Open5Gs.
Okay, so before beginning just a question, how many persons are from network enterprise service or something like this or work with core networks? Okay, so I want my own cell phone network. So that's why you're here and first of all, I'm Alessandro Cieri, I work for one of Italian service provider and I work for exactly cellular network engineering and IP network engineering and I use open source software and Linux from many years now. Then just a few, just a little index because the first part is just a theoretical part because we need to understand some concept in order to make a network, a cellular network and then we have a demo so okay, we will play with hardware. I'll try to be as clear as possible about this. No, okay, so what are we talking about? We are talking about an APC network, so 5G core, sorry, 4G core and it's the core that's actually serving your cellular at the moment. If a smartphone you are connected to the blue blocks inside the slide, so central one and this is also the high level so it's the panorama network, it's post. The UE is your smartphone, so the normal terminal, then you have the run or the radio part of the network, the towers and the ref equipment, then you have the core network which is the brain of the world problem and then you have multiple interactions with other services because the package core is just for packets, it's in those packets of the network, not the world services, they are connected on other parts of the network and are integrated by specific interface, so that's the big picture. But even the core network is created by simple functions that are responsible for just one task and it's composed by functions and so these are just logical, not real nodes in the strict term, maybe in commercial open source implementation they are combined in one node, they are split but these are the roles, let me say, of the nodes and for every node we have a specific function, for example, the first node we are going to talk about is the MME which is the brain of the network because it's responsible for dealing with the signaling or in other terms the interaction between all the elements in this kind of network, so it's the brain, it's the element that's making the decisions and it coordinates also the other nodes. After that we have the HSS which is the node responsible for authentication, it's the user database talking about a bigger picture and the HSS is responsible for authentication because every user is authenticated, that's his own keys, otherwise I will be able to make calls from any of every number and it also handles the authorization, which service I am able to get with my SIM. Then we have the DNS which is not strictly a 3GPP name but it's real DNS because in 4G networks the discovery and the topology of the nodes is described by DNS records and the MME is responsible to get its kind of configuration or IPN resolution by making queries to DNS into specific records and making decisions based on the answer it gets. For example, if I have a TSC tracking area which is a common name for the area that is served by my radio, I have to get other information from other nodes that have to handle the traffic and this information are gotten by DNS. The first information I have to get for the example of people is the S gateway which is just a decoupling element because the S gateway is the anchor point between the network itself that is the centralized network and the radio part which is completely into the wild, it's on radio sites so it's distributed on the territory. In order to get this you have this element that decouples the plane between the internal part of the network, the core interface let me say and the edge or the access path through the network that is the run and the B. Then we have the real router because it's the router more or less of the network that is the packet gateway and it's only just to handle the traffic of the user. If you have familiarity with the fixed network access it's the BNG or the broadband access gateway of a mobile network and there it's also the element that gives you IP, network configuration and for example the detection of what traffic are you doing and how it has to be charged. The last part of the network is the PCRF that is a policy enforcer. If you are using the metaphor usable from the fixed network it's kind of the radius of the mobile infrastructure because it's handling your QS. How much data you can make? What kind of QS you have? You are a prioritized user, a normal user, so it gets you the data about this. In the beginning we talked about packet core network. Now I was just to focus you on the packet term because it's the key point of an LTE network. If you are familiar with other kind of network, the 2G1 or 3G1, there were networks that were made for making calls. So the problem was I have my user on the territory, I want them to make calls together, just for service. Then the data arrived as a don. So the data network was emulated as a call, like the ALAP times in the 90s. The new network, instead the LTE network, it's a completely different paradigm. My focus is when the user is connected it's able to make data. The real problem is making just data. Phone service, value of the service are just supplementary service. The real problem is making data, open NPN. From this little difference, all the other parts of the network are just some elaboration of it because if I'm making data I use an IP network, IP transport network, no, and all the SSN network or circuit driving network. That's my problem. So I use new protocols that are carried by IP packets. So my data IP packets, my transport protocol, it's IP, so I make IP network. And then if I use IP network I can also use some code software because I don't need specific or let me say custom because in the old days network equipment they're very, very, very custom equipment. But I can use really standard and also commercial equipment because the big part of the problem is the software, not the hardware. In the old days the problem was the hardware because I had to implement my protocols just in hardware. And for example many commercial implementation of connectors from many vendors are on normal interplatform or normal, let me say servers, not like in the old days there were very specialized hardware, very, how to say very specialized hardware, ASICs and just circuit boards that were making the computation in just one matter. So no upgradeable, no software reconfiguration, we had the board and the board just know how to do something. And the last point is that in the old days the network was, let me say very trusted. So it's a problem for the user to authenticate to the network but from the point of view of the user the network is always a safe zone. So if I'm able to connect to the network it doesn't mean that network is good, I can connect to it. So we have many news about spoofed networks or other, let me say an essay like mechanism that are used to fake a network and to make users to connect to them. In 4G and in 5G it's also stricter. This policy we have a sort of mutual authentication. So the user needs to authenticate to the network to get the service but also the network is some way authenticated by the user and I know when I connect to the network that I'm connected to my network not to someone that is impersonating it. And the most part of this picture we see before is about dealing with signaling. Signaling is the real nervous system of the network because it's basically commands between network elements that make the network acts like expected. And for a basic procedure like attach that is the first thing your phone do when it's powered up, we have really a lot of signaling because just the first part it's on the radio. That's part of it's just for the UI which is the signal to the radio. Then we start dealing with the core network and signaling because the terminal asks for the network for, I want to attach what I have to do. And then the network reacts on this. It's the MME that is doing the work coordination part because it's asking for example for keys because we're talking about network authentication. The authentication is done by keys. I have one key into the HSS, I have one key on the SIM card and I have to check if these keys are the same. If they are the same I can attach to the network otherwise I'm not the user. I'm another one that is trying to coordinate me. So the MME is asking for keys, the HSS is answering back with keys. Then the procedure is propagated to the user. There is a challenge response mechanism so I calculate an hash, I send you the challenge with this hash, I expect it from you, I correct answer. If it's not good you are not where you are trying to be. If the authentication is good I start with the security procedures. It's an agreement on the keys and protocols that I'm using on the radio interface. After that I'm just authenticated to the network, I'm still not connected. And now if I'm authenticated I start with the network. Okay, now you know I'm the user, I know you are the network. I have to understand where I can put my data. So you start with some location procedure. We just have two major scopes. The first one for the request is related to giving the network the place where the user is. It's called update location request. So the user is in that location and from this procedure I get a response that is update location answer. In the answer I have my profile as user. For example, what resources are you able to open? This APN, this other APN, you are able to make this kind of traffic, this kind of pipe. Otherwise you have these resources on the radio, bigger one, less one. So at this point we have the profile, we have just the profile because now the MME knows what the user can or cannot do. And at this point the MME starts another procedure that is related to, okay, now I know the user, I know what can do, where this user can attach, where he can put his data. Who is the node that is responsible for his service? And the decision about this is done by querying the DNS, like we said before, because it's a dynamic system. So I put my talk and my APN, I get back the two information I need, which is my S gateway for the run handling and which is my P gateway responsible for my internet traffic, let me select this. So query a response and then I can finally start with making the tunnel because the traffic is tunneled to the core network in GTP town. And basically I start a procedure for create the session, so create session request, it's propagated, then I hit the pcrf here, which says, okay, which kind of traffic are you applying, which is your QoS, pcrf, and then I can create finally my session. So at this point I can finally make traffic. All this procedure, it's completed, every time you power up your phone or every time you move from a cell to another, in real time. You have lots of this procedure. And the core network is engineered in order to make this procedure really, really fast because you have millions of it in your commercial network. Last concept we have to talk about, last theoretical argument, core network is all on the signaling part in contra position with the user plane part. This paradigm can be really pushed to the bleeding edge because I can, for example, in the old implementation I had control plane and user plane on the same node. So I had to run specialized hardware with specialized signaling in order to get the service. And with the 4G networks I have the QoS paradigm, which means basically control user separation. So this is my control plane part, which is basically the intelligent part of the problem, which is making decisions and communicating with other elements. And then I have my traffic, my user traffic just here. This is the user part, which is also called UPF, and UPF is just network handling. I get my rules from the control part and I process the traffic using these rules. In other words, this is something I can easily replace, reconstruct, change implementation, make it containerized because it's just computational power. And this is network handling. So it's payload intensive, but it's standard. I have just two or three commands and I communicate with these commands. Okay. That was the theoretical part of the problem, but now we want our network because the tool was, I want my own network. And the problem is that when one thinks about telco network, it seems like black magic, no one knows how it works. It's not something you, if you're uninvolved in them, it's not something you study more or less sometimes. But there are a lot of open source implementation. For example, open 5GS that is one I use, next APC, SLS run, and also Osmo come for all their networks. So there are a lot of implementation. And they are all working. And for example, open 5GS is also a release 17 compliant, which is one of the last standard that is published. So it's something that is driving my community, but it's also usable because it's not like mock up. It's real implementation. So it's really implementing a lot of service. Your telco provider may use everything of these features. And this is just the whole ecosystem that is implemented by open 5GS, which is a combo 4G or 5G network. And I recommend both standards. Okay, too much talking, demo time, because the idea is that you can literally power up your telco provider with just a laptop and some hardware. Let's say which hardware. We need compute power, container AZ, VM, container. You can choose whenever you want because it's open source. So you can recompile and do whenever you want. I personally use containers because it's for me. You need to know the interaction of the components of the component itself. So the theory we had, we, we had it before. So the interactions and some of the standards. And then we start with the hardware. You need an SDR because really not be are really, really expensive, but with the new radio and some other software you can have your SDR acting as you not be. And then you have to also have some, let me say, a ref hardware, especially something to shield because in most countries, 5G, 4G and 5G or also all telephony frequency are strictly volamentated. So you cannot go into the air. Otherwise, they will, you will stop the service from others, form a real or a legacy service provider. So for example, here my setup is just with a ref cable. So no ref, no ref emissions. Then another tool that make your life easy and that they can say it loud. It's an RF spectrum analyzer because it's the only part you can, you cannot see. So it's one of the most important instruments you need to, to, to own in order to deal with their problems. And then last part, you need SIM cards because as we said before, all networks doesn't usually identify new networks, does not mutilate indication. So you need to have a real C. R. R. So I have SIM card to write out with the kids and a lot of time. So. Time was most over so I will quickly that's the content. Yeah, that is the implementation. It's always on this laptop, which is offline. So it's always there and it's made by containers. So the blue one are from open, open fact. Yes. The red one is from SRS, SRS, SRS run, which is a project from SRS LT that make a speech tomorrow. So come and see it. Then we are our RF hardware, which is user and SDR and the just a Mongo database in order to put the user data. And then I try to make you just the quick demo. Okay. Okay. No time. Okay. Okay. Okay. I'll fire up my SDR that quickly easy. I just power up the corner, which is really, really, really fast. Let's start and then I just connect my UI. I forgot to power up the not be so it's not working right now. Okay. And now we are lucky we should see the connection itself. So it sees our in of the. This is connected. This is actually modulating and then from Unix network manager. I should be able to make the connection. And connected that mod indicated and. And I'm able to reach. So this pink is going from the laptop that's this user equipment coming to the modern from the modern meets send to the RF equipment and then it's handled by the big gateway. So the nine nine nine it's not connect on the laptop, but it's on the big gateway container. And basically. This is my traffic. This is my traffic. Okay. Sorry. Okay. So. Yeah. Yeah. Yeah. Yeah. I'll try you know. Yeah. Yeah. Yeah. Yeah. Yeah. No. Yeah. Yeah. Yeah. Okay. Sorry. Yeah.
Navigating the Networking Maze of Kubernetes: A Journey of Discovery, Confusion, and (Hopefully) Enlightenment
My name is Antonio, I'm a Kubernetes maintainer and working in SIG Network, STL and SIG testing. And my interest here is to show a bit more about the problems that Kubernetes has internally and why something like multi-network are not easy to implement. So Kubernetes started as a container orchestrator platform, right? Excuse me. Hey. Okay, you shut up the light on the top. What is the light? On that one. I think it's there. This one? This one? Okay, I must say to break it down. It's like the meme with the two buttons, right? Which one? And sweaty now. The tableau? Tableau? Let's go with tableau. Okay, okay, okay, good. The tableau. Well, this is the thing. In Kubernetes started as an orchestrator and now it's a big thing and everybody wants to do Kubernetes. And this comic is the best example, right? People just want to throw everything in Kubernetes and magically start to happen. But that doesn't work this way, right? In order to absorb all these new problems to me, right? Kubernetes, what it does is it implemented a plugable architecture. Instead of being everything hard called, we try to define interfaces, APIs. You can see it's here, right? It's common these days to build your custom resource or you can have the CSI to implement your driver and your storage. But you see that something is missing here, right? You don't see CNI. Since Docker seems to be in 1.22, Kubernetes doesn't have any networking capabilities. So all the network is behind the CRI API. So it's left to the containers. And if you went to the DAG multi-network meeting like one hour ago, you can see that the people that implement network plugins has to do a lot of diagnostics with annotations and JSON embedding and all these other things. This is something that we want to solve. But before we go there, is how Kubernetes work is just you have a control plane and then you have all these worker nodes that have resources and you run POS, right? Everything is controlled through the API. Kubernetes at the end is you have a REST API, you define the API and you define the behaviors. And these behaviors are asserted by the E2E test. So when people implement something, they can run the E2E test and validate that they are API. They are implementation match the APIs. With this, you achieve a consistency across the ecosystem, right? A lot of people think, oh, Kubernetes use a lot of IP tables. And that's a big lie because IP tables and Q-proxies and implementation detail, right? The API services. But you can implement the same without IP tables. You can use eBPF, user space, XDP, sockets, whatever. The important thing is you implement the API, you run our E2E test that defines how this should behave, and the user will have the same behavior, independently of what it has behind the scenes. So when we move to the network, you need to think that Kubernetes was created as an application orchestration, right? So it was not created as an infrastructure as a service. And then these consequences, you have three APIs for the network, three primitives that we can say, right? The nodes that are the VMs that run in an infrastructure that can be assumed as a node network. The node network here is not an IP network, right? It's an abstraction, say, it's everything that is connected there. So all the VMs are in a node network. All the pods are in a pod network, right? The pods has this requirement, that's the Kubernetes requirement. Every pod has to be able to talk with another pod without an app. And this is important because this allows all the applications to talk with each other. And pods need to communicate. Pod needs to discover, right? And then we want to hard call all the APIs everywhere. We need to create a set of pods and then expose these pods. And for that, you have services, right? So simplifying, obviously simplifying everything, you can find these three networks in a Kubernetes cluster. If we go one by one, the first network is the node network, right? This is you have a bare metal cluster or a cloud provider, you run your machine, and you have a virtual machine. So this is a virtual machine or a server, right? It has resources, it has interfaces, it has whatever. So the way that this work is for the network side is you have two fields, right? One of the fields is the status. The status is the one that holds the addresses. Something discovers when you start the node, the cube let, that's the component of Kubernetes that runs in the node. It starts to discover all this VM, all this machine, and start to populate these fields, right? All these fields, right? So you have all the addresses. The first thing, okay, let me start back. The first thing that the cube let us is to register the object. So this creates the API object. As we say before, Kubernetes is about API semantics. So once you've registered the object, things start to move. And one of the things is it starts to check for conditions. How much memory, what IP addresses, because this information is going to be needed by other APIs or by other controllers. In addition to this, there are the bootstrapping of the VM is complicated in the sense that it can be controlled by the cube let, or it can be controlled by an external cloud provider. Then the addresses and some of the information that you can be in the API can be populated by this other controller. In addition to this, you also have a field that's a reminiscence from the past that is the POSSIDER. Contrary to what most people think, the POSSIDER is an informative field. So the CNI network plug-in or whatever can use it or may not use it. So in theory, a lot of plug-ins use it, but it doesn't mean that because you have POSSIDER in the spec, you are going to have this SIDER assigned to your POSSIDER. This is one of the problems when you develop APIs and commit these mistakes. This leads to the users and generate confusion. We talked about the node unit initialization first. When the node starts, and we talked about the CNI not being part of Kubernetes. When the node starts, the first thing that you need to do is to check if the network is ready. Because you cannot create POSSIDER, the POSSIDER cannot get IP addresses. So for that, there are two container runtime calls. There is one container runtime call that is network ready. This goes through the CNI API to the container runtime. And the container runtime right now, cryo and container D, the only thing that it does is to check if there is a CNI config file. Just that. You can fake the file and it's going to say that the network is ready. Moving from the nodes to the POSSIDER. This is one of the most tricky parts. The POSSIDER is the minimal unit of work in Kubernetes. So the POSSIDER is created and lives in a network, is able to reach other POSSIDER networks, but this presents a security problem for people. So what we created is the network policy API. That means that you are able to define relations between the POSSIDER. This POSSIDER can talk with each other, this POSSIDER can talk with each other. This is a high level. So what happens when you create a POSSIDER is that a user creates a POSSIDER or a deployment. A deployment is a composition of the POSSIDER API. So this is going to create an object called POD in the API server. The controller, the scheduler is going to see, okay, there is a POD created, but it doesn't have a node assigned. So I'm going to assign this POD to this node because it has resources, whatever constraints that you put there. Then it assigns a node to the POD spec. And the QLED that is watching this POD object and got this POD assigned gets the POD and starts working on it. It starts to create the POD. That's the so-called declarative thing. And how the QLED starts to create the POD is via the CRI API. So it starts to build the CRI API as a JRPC service that is used to communicate between the QLED and the container runtime. So the first thing that it does is to call this RampotSambos call. And in this RampotSambos call, you have networking parameters like DNS, config, PORMAP, hostname. When this goes to the container runtime, as I said before, CRI and containerD, UC and I, but this is not mandatory. You need to create the NegoName space and to create the Nego, right? And once you create that, it goes back and the QLED keeps working. After that, you get the POD IPs from the status response, right? So that's the only thing that the QLED and Kubernetes know about the network. Create a POD and receive the IPs. And this is the big problem that we have right now to start to model multi-network and this other complex network of policies, right? So we cover the nodes, we cover the PODs, and now we need to cover the discovery, right? We have everything running, everything is fine, but we need to implement applications. For that, we created the service API. For the service API, what you do is you are able to define to expose a set of PODs by selecting. To say, okay, I want this POD that implement a web server to be exposed through DNS or through a load balancer, right? So for services, there are like five types of services. One is the cluster IP service. The cluster IP service is kind of a load balancer. You define the set of PODs that you want to expose and you expose via cluster IP, virtual IP and virtual POD. It's just a full forwarding. You have a whole set of options to modify the behavior, but basically that's only that. For node port, this is a way, so right now you have the applications and you need to expose them externally. And for that, you have the node port. The node port is the typical form mapping that we do in our routers to expose something in one internal server, right? You get the IP of the node, a port in the range that doesn't collision with other things in the host, and you expose internally. And then you have the load balancer services. This is different. What it does is it creates the service and waits for external controller to populate a load balancer that is able to send traffic to the PODs inside the cluster. You also have ways to expose the services with DNS. That's the headless service. So basically what it does is it creates a new record, a record that as Bakken has the PODs, right? As the A answers. The way that the Kubernetes service work is when you have a service object with a selector, right? And there is a controller that is watching the PODs. When the QLED creates the PODs, the QLED updates the POD status and the IPs. And so when the POD is ready, this controller is to see, I have a POD that matches this selector of the service, so I need to start to create an important slide for this service. And then when you have the service and the important slides, the proxy implementation, like you proceed, is able to define in the data plane the for load balancer, right? So you have this cluster IP, this virtual IP, this virtual POD, and you have these Bakken. It's just installing rules. One of the most tricky things with services is to do termination, graceful termination, because this, you saw the service type load balancer, the problem is that you need to coordinate synchronally three parts of infrastructure. So the POD has a state. The POD can run, start being creating, start running, and start deleting. But there's some time that you want to do rolling updates. And for that, what you want is, I want to run my new version, but people is still hitting my endpoint. I don't want to lose any package. So for that, you establish a grace period. And this grace period is related to the endpoint slides. And this is used by load balancers to implement serial downtime. So in a common state, you start updating the POD. The POD is still able, you see the run line, is still able to receive traffic at that point because it's terminating. So it's going to be able to answer. But the health check starts failing. So the moment that they start failing, the load balancers start to move this endpoint out of rotation. So the new traffic is only going to the new PODs. Once the new POD is removed and replaced by the new one, the health check starts succeeding again. And it can start to receive new traffic on the new application. As I said before, Kubernetes defines some APIs and others are defined by composition. So you can have a load for load balancer on top of service. There is this ingress API that allows you to use the service of a game and define another layer of abstraction with the seven primitives as HTTP. So you could say this path this way and this path this other way. These APIs were created in the region for some things that are no longer or we carry over some of the problems. Okay, thanks. So as evolution, we have Gateway API that's the new API that has standardized. And the other problem is that Kubernetes itself is portable and all the things that you can see there is the network. So the same thing that you run a POD, you may be running a webhook that interferes when you create the POD with the API call and all this stuff. So this is one of the other problematic. We need to be backwards compatible and support everything. If you want to be more interested in see what is going on, we have a public dashboard so you can read the caption what is going next. Okay, sorry. Sorry, we are behind schedule so no questions. Thank you.
Units of composition: recipes, overlays, and packages
No, just rock and roll. All right, test, test. All right. Welcome to FOSDEM 2024, Nick's Dev Room. So I might have been a little bit early. Everyone's still kind of trickling in from the real hard parting last night, so that's fine. All right, I was just going to get started and kind of take it away. So my name's Tom, Tom Burrick. I work at a place called Flock's. I work on a bunch of things called Knicks. And I want to talk today about the different kind of units of composition we have in the Knicks ecosystem. Some ideas about what it is that we could maybe do to improve some of these things and just some issues. And also just a little bit of explanation of some of these units of composition that are maybe less or known or hard to understand. And maybe just start some conversations about it. All right, so I'll go over kind of a few ideas, trying to present a concept, figure out like, yeah, what is this thing? Why we care about it? What are some of the problems we have? I have some proposals, maybe some examples. And hopefully along the way we learn something and come up with some ideas. As a disclaimer, a little bit of Knicks understanding is expected here. But if you got here at 9 a.m., I assume that's not a horrible thought. All right, so basically here's where the experience might come in. Hey, like what do we call this thing? Feel free to kind of shout out. Like what is this thing we're looking at right here? A package. A package, a function. What else? A derivation. What else? Sorry? An expression. Okay, so that's a problem. This is used everywhere, and I just got how many different names for what this thing is. Right, this is something we don't, I think this is an issue. So all these names we used are not quite right, we don't know what it is. So the main idea is like, one I'm going to explain how this thing works, why it works, why it is the way it is, and then be like, hey, let's actually name this thing. Let's make this a first class concept. Let's somehow make this thing a bit more core to the system, rather than getting seven different answers for what this thing is. All right, so we have some whole bunch of issues, tried to reference some of these things, but like, hey, we tried to define what this is. But we just used package, I don't know, four or five times in the definition of what a package is. I don't think this is going to work too well. Right, that might be a problem. We have another example of Robert Hensing having an idea of let's formally define what a package is and make kind of NICS understand this, because it's funny, we don't really have a formal definition of what a package is, yet we often call NICS a package manager. Well, is it a package? It's a little bit confusing. So he's got some ideas on that, which I really like the progress on that. There's another issue that came from John Erickson. I also really like this one, like, what are these functions? We call package on them, but saying this is a call packageable function just doesn't roll off the tongue very well. And also, like, this also has issues with when you're trying to do things like cross-compilation or if you're having like your look at your memory usage, having everything be operated after the fixed point operation NICS packages doesn't quite work. So there's memory implications or performance implications. I want to address all this, or at least kind of come up with some ideas. All right. Why do names matter? Well, let us communicate. I could tell a beginner, I could teach someone, hey, here's a thing, here's its name, here's its properties, here's the form of what it looks like, here's why it's the way it is, and having a name helps. If every single time you have to kind of subtly give them a not quite right word for the thing, they're going to get confused. Because you just said this was a derivation, but then when you call it, then it becomes a derivation, but it's not quite, it's really, again, extremely confusing, especially if you're a beginner. Right? Like, we might kind of understand and look at it and recognize something like this, but look at it from the lens of someone who's new. They're not going to. So let's define this a little bit more precisely. We use this thing everywhere, and you encounter it really, really in the process. As soon as you want something like, oh, I got my own thing I want to put together, or I want to compose an environment, or compose anything, we almost always use this abstraction. So let's use it. We want this to be understandable. We also put lots of abstractions on top of this notion. Things like overlays are built on top of it, things like package set. NICS packages is built on top of this notion, but we don't even have a concrete understanding what that notion is under the hood. Alright, here's stuff you hear all the time from people, like bring you, you know, it's their first week, second week of understanding what NICS is, and you ask questions, like, I created this thing, like, now how do I build it? How do I add it to NICS packages? Oh, I built, I put this thing, but where do I put it? How do I now bring it into NICS packages? That's non-tribal, you have to kind of understand how to do this. What am I, I'm trying to add a package, but I'm trying to do this in a NICS OS system. And it starts showing up in packages.whatever, why? Or it can't find it, again. It's because, you know, they don't know how to compose these things, and it's hard, it's complicated, and we don't teach them well. What's an overlay? Oh, no, if they ask that question on their second day, you're done. It's never going to, you're never going to be able to get past them. Like, what's a fixed point? Oh, here, it's difficult. What's called package do? It calls the package. So is package the function, or what, even the naming of this is really odd. Oh, I want to do flakes. Okay, cool. How do I add my package? Oh, well, we've got to figure out what it is you're trying to do, how to compose it again, all sorts of issues. Alright, so I'm going to go quickly through a few of these things as to, like, what they are, try to explain them, and there's not going to be a full rendition of all these concepts, but just enough to get people to kind of, somewhat of a common understanding, and maybe serve as a starting point for other research. So call package, what is this thing? It's a function itself. It's going to call your definition with all the correct arguments from a scope. What's a scope? I'll go into that in a little more detail, but just consider this is the scope of things that it's available to it. And it gives you some extra usability benefits. We use this thing to kind of make things a little easier, to kind of organize mixed packages as a common way of doing it. You don't have to use this approach, but this is what mixed packages uses. This is kind of the idiom that we have. There are some people experimenting with a few other idioms, specifically with, like, the module system, but this is currently, like, the one that's in use. A pretty good explanation of this is located at this URL. All right. This is kind of a roughly a way this works. So we define a few functions to kind of, a few helper functions that's like how it works. We start defining this call package for our package set, and then we go use this thing and make things. I'll kind of break this down a little bit. So we have a helper. This helper function takes three arguments. We give it a scope, some f, some functions. This is the thing we were trying to name earlier. I'm kind of leaving it a little bit unnamed for now. And some extra stuff. There's a little bit of NYX magic here to say, hey, go grab the correct arguments from your scope, okay, and then just go call the thing. And then override it with some extras that are helpful. It just calls a function with arguments from your scope. Okay. So now that we have that defined, now we give it a scope. Here's a pretty simple scope, just, you know, a few values, not even packages. Just some values. Or this could be something massive, like all of NYX packages. The entire scope might be available to you. All right, so now we have a call package that's hopefully usable. Except, I forgot one little piece. We also extend this definition scope with something we haven't even defined yet. Something that is the value of all the packages you're going to end up with once you're done. This part's mind-boggling. It has to do with the fixed point, lazy language. It's not a very complicated thing, but this makes it complicated. This is probably what makes it hard to understand. But that's the idea is that it has access to everything once you are done. So it captures some closure, namely the scope, and then it extends it with more things that you want to define. All right. So now you've got these two helpers, and now you say, hey, I want to extend it with these extra stuff. Right, and now we have this reference packages, the packages, that whole lazy fixed point is done, and we have extra things that are now brought in. Cool, so now I should have a, b, c, and d. Well, this notion of I want to add a few packages to the scope, you're going to do that a lot. You're going to say, hey, I want to add these few things, and so we give this notion of adding a few packages or modifying them or kind of having access to them. We should call this thing an overlay. And for various reasons, we start implementing now overlays. Except this becomes almost impossible to understand, especially through a beginner. Right, you get these arguments, final prev. What are these things? I have to use the final call package, if I use the prev, I get confused or mixed up, or some things won't be available, or some things will be, but they'll be a previous version. Again, we throw this diagram in front of people, and then they go, great. This was, now I understand. Right, no. Again, that doesn't quite work, I find, in practice. So overlays are powerful, though. They are useful. They solve a really good problem if you need to do something kind of deep in the dependency chain and manage these sorts of things. And when you know what you're doing, it is powerful. But with power comes responsibility, and you can mess yourself up really easily. You know, you've got nested sets, you've got infinite recursions, like you can get confused, and very often this is what happens. But most users don't need this, especially in the beginning. Most users just say, I want to add just one more package, like the one I just started with, my toy example from the tutorial, or the project I'm working on right now, and I don't care about all this complexity, I don't need all that power. So there's a few other issues. It's hard to kind of, from an overlay, get that original function, that original thing. You can't really get it, because the way we normally define this, right, is we call it. So this thing is kind of hidden away, it's tucked away. I mean, you can probably figure out a way to extract it out, but it's not trivial, it's not easy. And we've muddled some of these concepts. All right. But overlays are essentially the correct way. Like this is the way you're supposed to use the composition with mixed packages for everything to work out right. A lot of the machinery kind of expects you to kind of go down this route. You don't have to, but again, a lot of the instructions, a lot of the idioms, a lot of the tooling will expect it. But it's hard to use. All right. There's another concept that we put in there. We have this concept of a package set. I don't even know how well this is even documented, but it's this idea of having a set of packages, a set of things, and we extend it with a few other attributes. And those attributes are the things that allow you to then further extend upon it. That's useful, except everywhere where this is used, it's implemented slightly differently. So that's another issue. So we use this all over the place, right? All the X packages, whatever packages stuff, is all the example of one of these package sets. And it's useful, again, super handy, but it also can cause confusion, because we don't explain this to people very well. Try to override something in Python packages. If someone could do that without reference or copy and pasting, I'll get you a sticker. So one proposal is, hey, how about we actually look at all the different package sets that we have in X packages and just make sure they're standardized. Right now they're slightly different. Some things are named slightly different things, or they're implemented in a kind of subtly different way. Let's perhaps try standardize this. And that way, if it's standardized, we can document it, we can talk about it. Another notion that if you dig all the way into the internals is there, but I don't think, I haven't really seen this written up. Not really anywhere. This is notion of a scope. What is a scope, right? I actually referred to this earlier on. So a scope is basically saying, hey, take your package set. I want to be able to extend upon it, and then later extract the portion that I extended it with. I mean, this is really helpful when you do something along the lines of, I want to add 10 packages to NICs packages and then pull out the 10 I just defined. That's basically the trick that this thing allows you to do. I'm not going to go into a lot of detail here, but it's a really nice trick, it's a really nice concept because otherwise it's very manual. Otherwise you have to always remember, oh, I then have to go extract back out of something. I have to go inherit or grab those attributes. Whereas with a scope, you could just say, make a scope, make a bunch of changes, and then go essentially grab the def. Go grab all the changes I just made and expose them now, and only those things, not all 80,000 remaining packages in your scope. So kind of it's a little bit like a right barrier that you could use. But probably have time to go more into detail on that one, but it's a useful notion. I don't think we talked about it enough. It's yet another thing, another abstraction we've built on top of this original thing. So let's talk about some ideas, some proposals of what we can do. And some of these stuff we could actually do today, it's just a matter of either making a decision or like thinking about it, or maybe some discussions. So first off, let's give this a name. I don't actually care all that much what name we give it. I do have some thoughts on that, but I just want to make sure we have a name so we can communicate this. So what name? Package was mentioned, not quite correct. It's related, but the by name construction that Selman has implemented for Nix packages uses package function, package fun as the notion. It's correct, but it's a little bit awkward. Not super thrilled about that one. Derivation, again, that's not quite what it is. It's the thing that will produce one. So because of that, I kind of like the, it's a proto derivation. It shall become one, but this is technical and people already run away screaming once we say the word derivation. This will make people run away screaming even harder. So that's not quite, that's for academic papers. Blueprint, kind of nice, it kind of makes sense, right? But it's pretty sterile, it's very rigid, it's not very fun, we're not very human. I kind of like recipe, right? Because it's this notion, you know, it's like you're cooking, it's human, it's food, everyone likes food, I like food. You have all sorts of other fun little concepts that could come into play, like hey, you substitute one kind of flour for another, or you substitute one ingredient for another, you could tweak your ingredients when you're cooking a lot, you know, you ran out of eggs, or I don't know, you don't like that much pepper. So that kind of gives us the notions that we like, but again, if someone comes up with a good name, I will go with it. I'm not strongly attached to anything. Any name is better than no name. So, yeah, recipe is kind of a fun one. It gives us cookbooks, variations, it's human, it's friendly, it's colorful, blueprints are usually very static and ugly. And here's some other things we could do. So once we have this thing as a concept and it has a name, well, let's say we're using flakes, not everyone is going to, but if you want to, hey, why don't we just expose these things just directly? Like, hey, here's the functions that I'm going to be using, and then from this, you can, from pretty mechanically, pretty generically, just create your overlays and create your actual hard packages, actual packages that are for a particular system or for a particular environment. But let's expose this as a top-level thing. And this has some benefits. So, hey, there's no system here. You didn't have to pick a random system just so that you could expose it so that other people can override the system you gave. Like, that whole problem is now gone. You could do cross-compiling because I could just pass whatever inputs here I want, and these are self-contained. We're not taking anything else from scope, hopefully. There's an obvious translation. So let's use that translation, implement it, I'm sure you can come up with many different ways to do so that have different usability trade-offs. And let's try to make this like a top-level output. That's just something that we use. NICS packages should expose them. Right now, one of the things people often probably do is they use the file system layout of NICS packages as like an unofficial API to grab these things. Well, now that we're migrating all those, those are all going to break. Those are all going to like the by-name construction, do we want to really rely on where it is in the file system? Like, no, let's just put them and expose them as the functions that they are. Right? We're all about functional language. Let's celebrate functions. The next thing about this is now if you start using this in the Flake ecosystem, there's no lock files needed. This doesn't refer to anything else. There's no system, there's no base NICS packages. You take NICS packages and add this, but I can grab recipes without even evaluating really much. I just have to like parse the thing and grab the function. I don't have to evaluate all of NICS packages just to end up overriding it with another NICS package. Like, that's where you get lots of memory and lock file bloat. So, I don't even need the lock file. I don't have to read the lock file. I don't have to interpret it. Right? And that prevents now a bunch of lock file bloat and the consumers of this. I could consume this thing without ever caring about that. That's a benefit. So what do I want? Hey, all the various frameworks that we have out there, like, just expose this as a top-level output. Like, nothing stops. Technically, like, I do this today, nothing stops us from doing it. Now, there's no tooling that makes it easy, you know, NICS Flake Show where other things don't present it or we can have documentation for this thing that tells you that this is a standard. But we can start doing this and there's no, nothing stops us. And we can start making it nicer and nicer to use over time. So that's kind of an idea. Yeah, I'm just kind of, I guess, reiterating that, hey, these are just pure functions. And that way, again, there's no lock file issue. So some additional thoughts. I guess they got cut off. So you show this thing to people, show it to a beginner, you go, okay, here, here's how you define your own thing, here's what you do, and you start describing it to them. And you almost inevitably, they think about this top thing as if it was like an import statement. And they always ask, what can I put here? Like, what is this thing? And you go, well, if you can grab stuff from NICS packages, okay, cool. But they also, you can grab additional stuff, stuff that you just kind of composed in with your overlays. And so they, I don't know, is Lib in there, or My Package is in there, or they only, the upstream, like, what's in there? And we don't really expose it. Like, this is the scope that's available to you at the time. Why don't we actually expose it to people? I want to be able to search within the context of what is possible for this, what's visible to call package, basically. Let's let the user search it. Why not? It's valuable information. They want to double check that what they just defined got in there. It's good for exploration. It's good for analysis. You can now analyze what's available in your scope, even, to make decisions. So, that's another kind of thing, a little bit beyond the first thought. So what next? We don't need technical changes. Again, you could do this today. You could do this by using a kind of flake schema. You could just do it yourself and just kind of rely upon it or check for it. But I want to, all the time, just kind of get this convention started, and then at some point start to get some feedback. Like, is this good? Part of me talking here today is getting an idea from everyone. Is this on the right track? Is this helpful? Do we need an RFC for this? Then we start adding support into the various frameworks and libraries. We can start, actually, at some point adding some support for this or some of the utilities we have to make it easier. And then, you know, we make that developer experience even better. Like, how do we move these around? How do we update them? How do we borrow them from another cookbook? How do we inherit cookbooks? That sort of thing. How do we mix this or combine this with all the teaching materials that we have? So there's, you know, it's not super easy, but we have things to do. Another notion that I ran across that actually seemed pretty useful in practice was if I have defined all my recipes, right, you can kind of say, hey, well, using some base, add this stuff to it. And once you can define this and use this thing, it kind of looks like the with keyword. It has a lot of other, like, benefits. You can start to define how it interprets what you gave it. So here you can start to kind of say, oh, I don't have to import the thing. I can just refer to it by path. Or other niceties. But it's another kind of useful abstraction that I'll just say has worked out well in my opinion. All right. I think that's a blank slide. I don't have time for a demo, but, like, this kind of works. I use it today. It's useful. It's friendly. It gives us a lot of benefits that I like. I'd like to expand upon it and make it kind of less of, like, a unique, strange thing that I do and something more that might be a bit more common. I've got some references in here. And I am willing to talk to people about this sort of idea, this sort of thing. So come talk to me. And I guess we're open for thoughts, questions, comments, tomatoes. Thank you very much, Tom. Is there any questions? Thanks for your presentation. If you want to expose the recipes inside of Flake, how would NixBuild work? So NixBuild, right, doesn't take a recipe. NixBuild needs a package. This is where the distinction matters. So you would still have to say, hey, my packages are, with this base Nix packages, bringing all my recipes. Right? But that's a simple kind of way to say, it's kind of like the for all systems. It's kind of a very simple helper that can do this thing for you. And that could just be kind of injected into the templates. It could be used and tell people, hey, here's how you do it. Here's the easy answer unless you need more power to, like, only, you know, but the easy answer will be, hey, convert all these recipes into packages. Boom. Now NixBuild has it available to you. So you would need a lock file and Nix packages input inside of Flake? Yes. So if you want to build a package, yes, obviously then you need a lock file, other things, but I can still grab those recipes directly without reading your input, without reading your lock file, because it's independent of them. But yes, if you want to actually build a concrete thing, well, you have to have a concrete base. Any other questions? Back. Yeah, this is kind of just a curious one, a technical one. In your proto derivations example, you have a function which takes nothing. And it, well, sorry, it takes a set, but the set has nothing in it. And it would be unexpected argument if you passed it. Yeah, my data at the bottom, what is the purpose of my data by comparison to everything else? Why put the set as an argument there? Well, it just doesn't need arguments. Great. In that case, call package goes into no arguments. Let me call it with the empty function, and the result in this case isn't even actually a package. It's just some data. All right, so it's a way of interfacing call package because it expects the function as its signature. Yeah, you could process these in such a way. For example, one trick I've been playing with is to process just a simple string as if it was the arguments to run command. It's kind of nice. You could just throw a script in there without having to define a lot of things. This is kind of serves a marker. Again, you could interpret this set in different ways. I don't know exactly what the best way is, but you can kind of figure out different types are handled differently. So you could detect, hey, oh, this thing is a function. Great. In that case, defer that evaluation, or it's a piece of data converted to a run command, or it's a path. It's a path, okay, imported. My call package does this for you right now. If you give call package a path, it goes, oh, it's a path. Let me import it, and I'm expecting inside of that path there to be a function. Or you can pass it directly a function. So this is just another little trick. Again, whether we use it or not, I don't really. Something fun. Thanks, Tom. Let's give it up for Tom again.
Fortifying the Foundations: Elevating Security in Nix and NixOS
Okay, good morning everyone. I'm Dominic Mills-Holl and today I'll be presenting on fortifying the foundation's elevating security in Nixson and XOS. To give, well, before I get into the talk, I'll give a brief introduction to myself. I'm a software engineer that's broadly interested in application development, build system compilers, and algorithms. And I've served in varying capacities in different open source projects such as being a mentor under the Palasadus Foundation and Haskell.org under the Google Summer of Code program for three years. In addition to that, I've also been a participant of Google Summer of Code-like programs such as Summer of Haskell and Summer of Nix, which I both participated in last year. And now to get to the meat of the matter. Today I'll be presenting on this talk, which is about the various features that were implemented during the sovereign tech fund's Contribute Back Challenge, which occurred in the fourth quarter of last year. And the sovereign tech fund is essentially a fund mandated by the German government that seeks to support the development, improvement, and maintenance of open digital infrastructure. So Nix was one of the nine selected projects in 2023, and the focus was on three aspects. But this talk will be mainly focused on the first aspect, which is a proper boot security chain for Nix OS. This was chosen because it's easily the most expansive and arguably the more interesting of the three. So let's get into it. So I assume many of you use Linux, because why wouldn't you? And I know that when you first installed, like whenever that was, when you first installed Linux, you had to disable secure boot in order to proceed with installation. And it's a very common place for us to just disable secure boot and just completely relinquish that thought of ever having a dual booted Windows and Linux machine, because why would you? And essentially what we're interested in is implementing this boot security feature in Nix OS. And that consists of a number of different facets, but historically, there were two earlier implementations, Lands Boot and another one developed by the Terminet Systems, Boot Spec, Boot Spec Secure Boot. Yeah, sorry, it's literally on the screen. But there were a bit unsatisfactory in the sense that they required one to first install Nix OS in an insecure manner and then modify the configuration file in order to apply, well, Lands Boot in particular, in order to apply Lands Boot, then Nix switch rebuild and then you'd have secure boot after you've made the necessary configurations for your machine. And for you as an end user, that maybe is fine, but say you're using Nix OS on some kind of network or some kind of cloud server or you're trying to find a way to kind of determine if you could build some kind of server form. It's not really satisfactory because if you're providing some kind of service like that, then you pretty much leave yourself susceptible to things such as boot kits that can basically take control of your entire system from the boot process onwards. And essentially you kind of want to adapt yourself to a secure boot story in Nix OS in the sense that you want secure boot to be the default option. Once you get the Nix OS installed and put it in your device and you boot it up on a new machine, you don't really want to have to turn off secure boot. You want to essentially have an option where you can have secure boot by default on Nix OS the moment you've begun to install it. And this consists of a number of different steps. I mean, the secure boot is essentially just a chain of trust such that various components from the inception of the booting process just kind of sign one another. And the methodology that we kind of took in this approach was basically to use a unified kernel image because it would keep everything in one particular place. You won't really have to worry about booting any RD or MS-D on the kernel and whatnot. You basically have everything in one place and you just kind of have to sign that and then you go forward from there. And how that works is that essentially secure boots is a very Microsoft-centric thing. So you kind of have to adapt the Nix image that you're building to accommodate for that. And what that means is that you translate the language into the language of portable executables. So what that means is that you kind of have to create new structures in order to manage these portable executables in such a way that you can have Microsoft keys on your machine and secure boot will work. You don't necessarily have to have Microsoft keys per se. You can install your own personal keys. But the problem with that is that the chances of you breaking something are astronomically high. So it's probably best to have something that has Microsoft keys. And loosely speaking, what I mean by has Microsoft keys means that somewhere on your laptop there's like a sticker that says Microsoft somewhere. And that usually means the machine. The UFE has unified extensible firmware interface, has Microsoft keys in its database. And how this was done implementation-wise was that we essentially used something called a meta-writer structure because we couldn't think of a better name. And what this does is not necessarily modified portable executable in and of itself, but more or less kind of begins to recreate it from scratch, kind of looking at every single aspect of it and then modifying that accordingly to accommodate for what changes we want. So the reason why we had to do this is because portable executables are very, for lack of a better word, say volatile. It's not really meant to be cross-platform or anything like that. It's more or less meant to be made for just Microsoft-type products and Microsoft-type products only. And basically you kind of have to look at a lot of the headers and see how they're related to one another. If you mess up, if you write the wrong data to the wrong header or if you have a data section somewhere and it doesn't link to somewhere else, you've ripped your system again so you don't want to have this type of issue anymore. You basically want something that is already developed, quote-unquote, in-house and just works from the get-go. And essentially this was accomplished. If you actually look on the NixOS issue, you'll see that Linus Heckelmann, whose GitHub name is Linux Ackerman, and I often forget what his real name is, but actually did manage to construct a NixOS shim with an embedded self-signed certificate. So that part is done. But there are actually two aspects of the story. The second aspect is to send the shim to the shim review committee, which is basically a bunch of independent, well, they're affiliated with Microsoft in somewhere or another. And this allows one to kind of, by the way, for those, for the uninitiated, a shim is just a first-stage bootloader. So just a bootloader that can load an operating system or another bootloader. And what this essentially means is that once this gets reviewed by the shim review committee, it means that they say, yeah, okay, we can give you some Microsoft keys and you can sign the ISO and everything should work. So we're currently in that stage where we're able to construct a NixOS shim that can embed a self-signed certificate. But we haven't yet got the green light from the shim review committee, but there's nothing to indicate that we shouldn't. And more or less, this is so far the secure boot story in the broadest possible way. But more specifically, there are other aspects such as the boot spec specification. So this is boot spec version one. And a part of the work done during this project was creating a boot spec version two. So if you're probably wondering, why do we need a version two? Version one looks good already. It's not, there's nothing dynamic about it. It's just a JSON document. What could be wrong? Well, NixOS only has one, it can only take one argument, a string. When in actuality, that doesn't fill the entire gamut of the entire spectrum of, you could have multiple in-entities, essentially. And that's usually the case if you have, say, some global user settings, some specific user settings, some environment settings that you've changed. And usually these end up becoming some CPIO archive files somewhere. And so you essentially want it to be a list. And also you can also dynamically generate initRD from the EFS. So yet you could just end up with a lot of initRD files that just spring out of nowhere. So that one isn't very satisfactory because it doesn't consider all the use cases. And you also see initRD secrets. But here's the funny thing. There's actually nothing secret about initRD secrets. The secret in initRD secrets is a plain text file. So it doesn't actually provide any form of security. And ultimately this is what boot spec version two looks like. So initRDs now has, it's a list. You can add things to your list to eventually put them together and then that satisfies the most general case possible. FDT third, that's not really going to be discussed here, but essentially it's just to consider the case of using that's you boot. And device trees, we didn't have support for device trees previously. But we do now with boot spec version two. So what we've done with initRD secrets is that we've basically used a hashing method. It's a very boring hashing method. So it's now more secret. It's actually a secret. You don't have a plain text file anymore. And this is the difference between boot spec version one and boot spec version two. And this is somewhat of a segue from security. It's tangentially related, but it's still a very important feature nonetheless. So we define AB schema to refer to a type of primary and secondary boot partition, wherein if the primary partition fails, you'll switch to secondary. And then this is generalized into something that's called automatic boot assessment in NixOS. And what this means is essentially if I want to load a NixOS generation and it fails for some reason, and the most practical way this would fail, to give a concrete example, is if you went into Nix packages unstable and got that package AMD GPU and you realized you can't boot into your generation and notice from experience. So this is an example of where this would be useful. And essentially what it does is that it generalizes AB schema to NixOS generation. So in that case, you had a very simple case before, and it's designated as an indeterminate with a specific number of boot attempts. So you have some predeterminate number of boot attempts that you want to have. And then eventually, once you've gone past that, then it's no longer deemed as a good boot. You can't decryment it every single time, and then it's designated bad. And this is useful in the case that, say, you have some kind of procedure where you have unattended boot. So maybe you have some kind of build farm, maybe you have some kind of service where you have to boot into some specific NixOS generation. Maybe you're using NixOS as a replacement for Ansible. And you add things to NixOS, you add new services, maybe some of the services are from Nix packages unstable, and you just, it's more or less everything is automated. This is a use case where that would be useful. If somehow you magically find a way to upgrade things without you having to touch the Nix configuration file, then automatic boot assessment will take care of the fact that your generation may be corrupts and you may have to switch back to an earlier generation after certain more boot attempts. And this is an example for an authoritative DNS. The parameter boot.com specifies what you call a synchronization point. That means that it eventually just cycles back. So we haven't specified the number of boot attempts here. I don't believe it's a default. This work was actually driven by Julian Malka, who's giving a talk on this. I really hope it's not the same example he uses, because I didn't ask him before. But more or less, this is an example of where you would have a type of service that has a specific number of boot attempts that needs to be done. And you can use this automatic boot assessment to just count the number of times that it would be completed, and if not, your failure action is to reboot. You could also specify the number of times it could be rebooted, but I don't believe that's done here. And another feature is integrity checks for the next store. So when implemented secure boot and transitioning to stage two, limitations do arise. You can mess with the file system, essentially. If essentially you don't really have any place to prevent that from happening. And a number of different times things were tried. So for example, the invariant which works on the block device layer was used, but it was unsatisfactory in the sense that it basically creates an entire copy of your NixOS generation, which takes up a lot of space and doesn't really provide any means of flexibility to work around that. FS very to works on the file system level, but it doesn't... The invariant is also read only. That's the reason why it takes the entire NixOS generation. FS very to works on the file level, so that's a big improvement over the invariant, but it doesn't prevent you from kind of just switching files at that point either. So you could easily switch between bash and Perl and whatnot. IMA and EVM, IMA is intermediate measured assessment and EVM is extended verification measurement for system integrity. They more or less have the same problems as FS verity and they're really only good for auditing purposes. Like they don't really help us that much in terms of our intended goal of looking at the file system and seeing if it gets past stage two. So what was essentially chosen was the simplest method, which is just to use NixStoreVerify. And the problem with this is that there's a penalty for it. If you have a low-end device such as a Raspberry Pi, you probably have to wait two minutes extra for it to boot. And if you have a high-end device more or less like a desktop machine, there's still a five-second penalty. And there are, we're currently considering this such as Apple's signed file system, which looks promising but has to be combined with something like BCatchFS or BCatchFN or something to kind of really make it a viable option. And this isn't, this is just speculation. This work is actually driven by Will Fanature, whose GitHub name is Elvish Jericho. So he's done the research for this, but he hasn't, and this is the end result of it. But in terms of future work, he's looking at considering whether to use that mechanism that I just mentioned a while ago, which is to use Apple's signed signature volumes to verify the NixStore in stage two. So enabling integrity checks looks as such. You have to first create a public file key, and then in the NixOS modules, you essentially import that into the trusted public file keys, trusted public keys, using boot when during the, in the boot, in it verify a configuration. And it's really just as simple as that. The only issue is that you may have to find a way to kind of hide your file key somewhere if you're in a situation where you have the boot unattended. And lastly, interpreters in Nix. So Nix essentially has a lot of bash and pearl and Python scripts all about the place, and this leaves a lot of room for vulnerabilities. And in phase one, essentially all the pearl scripts were removed. So set up, etc.pl, which sets up users and whatnot. That's replaced by overlay in it already. Update users, groups, that's replaced by system, the system, the users, this user's functionality. And the broad replacement of activation scripts, that actually isn't necessary for this, but it does work in terms of performance and in terms of gaining performance and maintainability benefits. So this is, I failed to mention that this was just like phase one of this challenge. We're still waiting on the status of phase two. But phase two would also involve removal of bash scripts. I'm not entirely sure how Python scripts would be removed because they're more tightly integrated. For example, if you tried to use a Nix OS test, you'd end up having to use Py. You'd have to write it in Python at some point. So it's not clear to me exactly how that path forward there. But pearl has been removed in this phase of the sovereign tech funds country-by-back challenge and in the future, eventually bash should be removed. So yeah, that's my talk. Any questions? Okay. Well, feel free to reach out and push. What is store-fairification? How about a little huge cloud system wrapper that's a verified store, perhaps on first access? Well how would you verify, like what would we use to verify them? I mean the Nix data base hashes there. I mean, we have to, obviously, have that hashed somewhere. Yeah, yeah. Yeah, so Elko is asking how would we verify, can we use just the hashes from the Nix store to verify the files themselves? And I'm not sure where the hashes are stored to be honest, but I mean. Nix for Nix dv.sqlite. Okay. Yeah, but you need the signature somewhere of the hashes. Okay. I'm not sure. I'd have to circle back and get back to you on that. If there's no other questions, let's give it up again for Don. Feel free to reach out. Thank you.
Packaging Bazel and Bazel-based packages
All right. So we now have on stage Guillaume who is going to talk about, as he mentioned, packaging, Basil packages with Basil Software with Nex. Thank you. So welcome everyone. I spent some months working on and off on trying to package Basil 7. And I figured there are plenty of fun things that happened and nice things to learn from that. So I wanted to share with you today. I'm Guillaume Maudoux, a computer scientist. I'm like a founder of BIL systems. All of them, well, the one I prefer is the one we don't want to speak about it today. I'm working, like for my day job as a consultant in doing Basil and Basil work at Twig. You have all my contact info. And yeah, this work was funded by Intuitive Surgical and Twig together. So getting Basil 7. Basil 7 was released last year, end of last year, and it took some time to get this into Nex packages, even though there is already like a Basil 5, a Basil 6, a Basil 4 in Nex packages, and each time you make a copy, reuse, and try to improve, it still takes a lot of time. I think there is like a fundamental difficulty in trying to get Basil to work inside Nex because they just don't like each other. To be honest, I think Basil tries to do everything and doesn't want to share with anyone else. So of course, there is friction when you want to encapsulate it into something else. But initially, it looks like they could be quite happy together. Basil is like file-based, like make, like CMake, like all of these nice build systems we use for all the packages inside Nex. So that should work, and Nex is package-based, so the kind of one works, manages the package, the other one, just a project, should be easy. But then it gets a bit worse because of course you cannot get the benefits of Basil. Basil wants to do remote caching, remote execution, and all of that is prevented by the Nex sandbox. So just like with other build systems, in this case, it's not different, but you see how these things do not work as well as you may expect from Basil. Then we have some issues, like a bit more annoying. Basil assumes that there is like the file system hierarchy stand out everywhere, and like Nex does not provide that anywhere, so I kind of stuck there. Basil loves to include pre-compiled dependencies, and it will happily, and this feature is intended to help all the users that are not using Nex, but it's really annoying for us, it downloads dependencies from the Internet, and that's kind of built into Basil. So with the sandbox, we break Basil completely and we need to work around that. It's really hard to package, so I would like, like these are like four snippets of things that we won't discuss today, and that we have to do to make Basil work. We have to patch all of the user bin paths. We have to remove some chunks of code that try to access the system state, and that's obviously not possible inside a sandbox, like preventing the system from going to sleep. Apparently, Basil has an empty arg when it calls GCC and it crashes. I mean, we don't even know if that code is still needed anymore, but like, not so much to do, we didn't try to remove it. At some point, at least, that was needed. And, yeah, it comes with nested archive, so if you unpack some things so that the patch works and then you have to repack them in the exact same way otherwise everything breaks. Lots of fun, but we won't discuss that today. I've picked like five issues that are more meaningful, I think. There is the Java toolchain. That's something that we need to set up properly to be able to build Basil itself. And then setting the right path, because if we want to package Basil, we want Basil that works as closely as possible to the vanilla Basil. Then we move slowly to building package that uses Basil as a build system, even if it's also the case for Basil, because Basil uses Basil as a build system to build Basil. You're still following? Good. So we enter the realm of like, fetching dependencies of the build, which is quite tricky. And then we will also discuss about like, picking the right Basil version using dot Basil version file. And again, the Java toolchain. Why not? We will see it's so complex that it deserves two points. So we start with the Java toolchain. Basil hardcodes everything everywhere. So here, Basil tries to do something nice for us. It will download the single jar binary that's already pre-compiled. It's a cc++ application, by the way. So of course it doesn't just work like that with Nix. And for the Linux and Windows, it will pick the right one for you. But there is no way to change that, except like patching the source code. We cannot like, from the outside configure the build so that it picks something else. In this case, we probably want this one, the one that's built from the cc++ sources with the c compiler, because we can do that on Nix and we get the proper binary, but none of these pre-built binaries work. So we have to create, and you see this is a patch, so there is no way to do that without like modifying the source code again. We create toolchains for Basil that do not contain these like pre-built dependencies. To do that, we use a non-pre-built toolchain configuration. So that defines somewhere, even though it's not that much tested in the codebase, so we can reuse it. And we use the localJRK, meaning we want to use Java from the system and not some pre-built Java, because that one also doesn't work. Okay. Yeah, okay, that's all right. So this is like showing the non-pre-built configuration that we use, and this thing that we would get otherwise by default, like the remote SDK, it's something you download from a remote cache. And the single Java one we've seen on the previous slide. But that custom toolchain is kind of fragile, because nobody uses it. So when I updated, suddenly it was missing one of these entries, and we got like a proper binary downloaded from the Internet, and that didn't work. So we have to make upstream patches to get it to work. Thankfully, that one landed fast enough, so it was like not broken too much. It was broken between 7 and 7.1, so no one should ever see that on a proper release. Another funny issue we have is setting the right path. Initially, it seemed pretty simple, like we have to patch the paths, so that's onNixBuildBasal finds all of the common bash scripts that everyone expects to have. So we do that, and we do that in a lot of places until it works. And in the end, we have something that doesn't really work, like Basal should. We've discovered that... So this is the behavior of like Vanilla Basal, based on like several options that you can change. And if you take Basal with some patch that we have, there's something very strange there, like it doesn't set an extra branch, so you can end up with the path that's configured differently than what you would expect. And you have the same with all the patch, so everything has branches everywhere, and it's like, yes, it is not consistent with what you would expect from upstream Basal. That's really annoying, because as we discover later, users rely on that fake behavior, and then when you try to fix it, they say, hey, you've broken the build. Yeah, I think this is the correct way to do it with respect to Basal, but I know that we've been using that NixBuild Basal for like three years, and you're used to it. So now it changed your build. It's really annoying that we are maintaining a fork of Basal to some extent. So the first issue we had, we end up with a path that is like literally no such path. That thing can only come from Nix. We are the one that are coded. If Basch from Nix starts without the proper path, it will default to no such path. It's kind of similar, like normal Basch will default to user bin, user local bin. That makes no sense either, so in both cases, it's a kind of a nice default, but doesn't really work with Basal, because Basal expects to have some default Basch tooling there. We have a script for that, like if path is like no such path, then we export the path to the default shell utilities. It should work, except we have this runtime inputs. I had no idea what it did. It seemed like good. We want all of these dependencies, except this modifies the path to that value. And it does it so that it happens that value to the path. So we have a no such path, colon, something else. It kind of works, but it's also kind of ugly, like why is no such path in there? It shouldn't be. That's one of these red dots. Okay, we move that line. Easy enough. And then we still have some very strange things. Like the path is composed of two parts, the path that Basal sees from the outside, and the outcoded string path. So some concatenation happens somewhere. This is like, this does not happen in the default Basal behavior, right? It's only the path there, only the path, only the path. So when we should have Basal's path, we see an extra outcoded string. Where does it come from? Of course, we wrap Basal with that path. So this is technically the path that Basal sees, because it has been wrapped. But it's not what the user expects, because the user expects to have the same path as the one that's ambient when it calls Basal. Okay, so we need to remove more and more wrappers. These were all useful at some point. They fixed some issues. Now we have better fixes implemented, but these ones are broken. And in the end, we get this very nice graph that does not depend on which Basal binary we are using. I mean, it took me some time to read there, so I'm really happy when I see this graph. This explains a lot about how Basal decides what is the environment for the actions that you run, and we can discuss it later at some point. At least, all our Basal binaries are consistent. Modular something in the hard-coded version from like Vanilla Basal. It's been user-bin-user-local-bin, and we set it to some proper path that contains default bash things. So, if streaming Nick support in Basal, that's like kind of... Ideally, we would not have to do all of that work. It should be way easier to build Basal in Nick's. Maybe some of the common stuff we can accept, like patching Shebangs, we know that, but having to redefine everything because Nick's is not aware, because Basal is not aware that Nick's exists and only thinks about, you know, if it's Linux, then probably that binary works for you. Give me a way to change that. Give me a way to configure that, and Basal does not do it. Technically, it's feasible, but it's not really used, at least in the Basal code base. There is no way to configure it. It's just like Basal knows better than you what you should use to build it. And in this case, it's obviously wrong. The biggest chunk here is like prefetching build dependencies, because that's where, of course, we have to fight with Basal. Inside the sandbox, Basal is not allowed to download anything. So, we have to download it before Basal kicks in. And Basal has to find it. So, what we used to do before is like look at the workspace file, which defines all of the external dependencies. This is kind of looks like Python, right? Yeah, it's Python enough that we can execute in an environment that defines HTTP archive. And we collect all of the files that we depend on. It's not perfect, but it's funny because it's a big hack, and sadly, it just doesn't work anymore, because they have changed the format. So, now the workspace is empty. We have no Python to execute. Now, we need to parse some JSON, the log file, which is something that really nice that comes to Basal. Now we have a proper log file, so we know what we can do with log files in X. We can parse it, we can retrieve the information, and then make proper NICs derivations of that. So, we can parse the log file with JQ. It's a bit obscure, but it works. Then I wrote it in Python because it seems like it's obvious. And then I wrote it in NICs because why not? We can parse JSON in NICs and extract everything. So, I have this nice script that takes the log file, a JSON file, and generates a repository cache, and these are like formats that Basal uses to fetch to store things that were downloaded from the Internet. And so, we first go there to see if it's already downloaded before trying to do the actual download. If you did it well, it will not try to download. If you do it wrong, of course, it still tries to download and crash. It's obviously not that easy because there are only three versions of that log file. It's been released like a year ago, but they are making fast iterations. Version 4 is there and it's probably going to change everything. So, I don't know if we will support all of these versions or maybe stick with the latest. It's really unclear right now. And the format is kind of well-defined, but it's like each rule can have its own internal format. And sadly, the URLs and the ashes are hidden in that internal format. So, we kind of need to support everything that's possible, which is obviously not going to work, but it seems that all in all, most projects is the same set of commands and we are able to do something with it. Of course, any failure makes it that we cannot build Bazel or that we can't build a package that uses Bazel because we are missing dependencies and we are back to the nightmare of just trying to download everything locally and then make a big blob that we hash and say, hey, take that. But it's totally not reproducible. Yeah. Speaking of dependencies, there was like a nasty issue that came from like the Bazel versions. If you look at Nix packages, I mean, we are nice guys. We provide Bazel 4, Bazel 5, Bazel 6, it's Bazel 6, but it's also the default Bazel, and Bazel 7, right? All of them are provided, but no. When you build with Bazel, you want 7.0.2 or whatever. But Bazel needs exactly that version. Why? That's how it works. But the thing is, it also works really well if you delete that file before building. Most of the time, but no. Now again, it started failing and it fails for a funny issue. Some of your dependencies, they are dependencies of your builds. There are things that you need. But some of your dependencies, you don't know about them, but there are things that Bazel needs because it has built-in rules, and these built-in rules have like built-in dependencies. And these are also in the log file. But if you start building with a different Bazel than the one that was specified and that was used to generate the log file, then these built-in dependencies are not the ones that Bazel wants. And Bazel tries to download it from the Internet and then it crashes. So we need to do some magic. We need to take some files that are downloaded for the project and some files that are downloaded based on the Bazel version. We need to merge them together. And that creates a folder that Bazel can use to build correctly with a different version. I'm really fond of one thing here. The fact that you can merge this thing by using Simlink Join, it means that the format is not that bad. Well, when you reach there, you're like, okay, this is pretty neat. I think Nix could learn something from that format too, because what they do is something really smart. They store like binaries from the Internet under hash, just like Nix does. Technically, it's like fixed output derivation. We download something and we put it in our known hash. But with this problem, like where we change the name or change something from the input and forget to update the hash, then it happily reuses the same hash forever. Bazel fixes that by having like an extra file that tells where it was downloaded from. And so you can pass extra information that will invalidate the cache and force a re-download. Probably just to check that it's still the same file, but if it's not, then at least you'll check. I'm not sure how to implement that. It may cost a lot if you make like a few changes in Nix, but there is some inspiration to take from there, I think. Anyway, we are back to the Java toolchain. So we have a Java toolchain that can build Bazel, but now we want users of Bazel to be able to build Java projects. It's kind of the same thing as the null-a-dull, because when we build Bazel, we can hard-code everything we want. We can even change the source code to make it work. When users are using Bazel, we cannot like change. We don't want to patch Bazel too much for it. We want to make change that work by default, but we also want this change to be like revertable, so people that don't need it or it doesn't work for them for some reason, which happens sometimes, then they need to be able to remove it. And to do that, we use another hack, a very nice one, I think. Bazel by default, access slash etc slash bazel dot bazel RC. There is no redundancy there. And we add an indirection, so we patch Bazel so that it access first a Bazel RC that is like hard-coded by Nix. And that one contains a few settings that Bazel will always rate. The thing is, using that technique, we can, I mean, we set up some default values, so we force Bazel to use our local, non-pribute Java toolchain, the one that we painfully set up. That's it, yeah, we force the version of Java to run Bazel, and then we try and import the real file. This is very transparent. There is a small risk that users don't know that this exists and don't understand where this option comes from. I hope they will know too deep about Bazel. But otherwise, each of these options can be overridden because these being options, the last one wins. Okay, so if you add a new one, like you change Java runtime version to remote gdk, your last flag will win. Except for one, because why not? This one, the first one wins. So nobody can override it. When it's there, it's like stuck. After some discussion, they realized that it was a bit silly, and so we agreed on removing the order for these flags too. That was also a funny, funny, funny thing. So regarding all the things that we do to make people build packages, we do it with our Nix build Bazel. I don't think there is as much to upstream as the things I presented before, because these hacks are a bit annoying, but they are kind of expected too. These are things we do for other build systems. Like we build packages, we download the dependencies, and then we provide them, and we then have to add some wrappers. In Python, we have even more wrappers. So I think this is kind of expected, even if it's a lot of work, like parse JSON in Nix, maybe that wasn't twice, either. But I think we've reached a place where it's already kind of easy to use, even if not perfect. We cannot compile Bazel from real sources. We use the generated sources, which is not exactly the same thing and differs in a lot of ways. It's not fully bootstrapped either, so we cannot really yet build Bazel with Bazel, meaning that building things with Bazel is still difficult. We build Bazel using the bootstrap process, which is, again, different, simplified, and more guided by the Bazel team. So it's not yet perfectly easy to use, especially on Nix OS. But then at the same time, I realize that upstream has some interest in Nix, they know we exist, they are willing to add patches. If you have a proper reason why we need it and we can argument, then they will happily merge it. I was also surprised. I know I work on very small projects all over Nix packages, but on that one, there are reviewers, people making comments, people testing my temporary thing, reporting issues. It was a really nice community feeling for one of those projects. So I think it's really well supported. And of course, being Nix, every time we update it, we improve it and we keep the receipt. And so we can make improvements, we can build upon the shoulders of Bazel 4 or 5 things that came before. Well, as you've seen, that's all I did, and I'm the kind of person that really loves challenges. So if you are willing to join, you are welcome. And if you have other challenges to share, I'm really keen to hear that with you. Thank you. Do we have any questions? Thank you for the talk. You mentioned generated sources versus real sources. What are the generated sources of Bazel? They do pre-generate some like, tarbles, things that are not compiled, but provided in the Bazel binary. Like all the built-ins, Starlark rules will be there. That's like the zip file that I have to unpack and repack. That zip file does not exist in the normal sources. You don't commit a zip file to... And a lot of other things, like they try to help people, so they provide binaries. So that's with that generated tarble, which is huge, like it's like 250 megabytes. It contains all of the binary dependencies. And a lot of those we don't need, but also it makes it really easier to have all of the needed dependencies. Because of the acts we do with the Java tool chain, we need one more file that's not there. That's pretty easy to compile. So we don't have to think too much about dependencies when we build Bazel with that tarble. Okay. If someone who enjoys Bazel discovers your work and understands it, do you think they will be convinced to stop using Bazel and use Nix instead? No, I mean, that's not at all the point of this talk, at least. I can't say what they will feel. Maybe they will discover that Bazel is complex. But the reason Bazel is taking off is it has all of these optimizations that you want. So as someone that deploys Bazel a lot, and a lot of companies, they still want it, right? You don't want Nix to come and compile for two hours because you made one change to one file. That's like... For me, these are complementary build systems. But ultimately, Bazel was not trying to be a package manager, too, because that's the part that's really conflicts between Nix and Bazel. May I have another one? So from the practical point of view, how soon can we have something like use Bazel mod lock hook instead of fetch utters with nested if else loops for a platform and queue the support? And how far are we from being able to say Nix and it tends to flow? I can answer the second question easily. Like really far. But for the first question, it's definitely something we could do. It's just like I'm amazed by the amount of work I had to do, like recompiling Bazel because it's not an incremental process right now. It takes like 20 minutes, 30 minutes. So each iteration takes that time. So we have to focus on making it work, not like trying to make a nice set of hook. Ideally, this would work, of course. But there is a slight problem also. It's like you need these things before evaluation. So it's not just a hook, right? Because the hook cannot download the files. To do something like that, then you would need recursive Nix or some way of Nix inside Nix to be able to do that. So right now I have implemented it with import from derivation for like making it simple to use. And in some cases I just like download the log file, add it to the sources and use that directly. So I don't have to have an import from derivation at all. Can we just commit the log file of the hook on graph? Yeah, we can do that. It's just huge. So what I do is I take JQ, remove a lot of the things that are not needed, and it's still a log file. All right, let's have a round of applause for Guillain. Thank you.
Remediating thousands of untracked security vulnerabilities in nixpkgs
Okay, up next we have Del Roth. He's going to be talking about improving security in Nix packages. Thank you. It's a microphone actually working. Yes, good. So I'm Del Roth. I've been working on Nix packages for a few years now. I've been involved in some security mediation events and recently working on Nix source infrastructure. So it all comes from a story. So let's start with a story. Sometime last year, this vulnerability dropped kind of silently. Chrome released an update saying, hey, we released an update saying, hey, we released an update. You should really update today because people have actually been exporting to the world and it's giving code execution. So the interesting thing is we actually patched that in Chrome really quickly, but also what Nix packages, Nix OS and some other distros started realizing that that's not actually a Chrome vulnerability. It was reported everywhere as a Chrome vulnerability at the time, but it was actually vulnerability like a dependency of Chrome, which is WebP, which is an image parsing library. So we patched WebP and not just Chrome and everything is solved. Everyone depends on WebP, so whenever the version is updating Nix packages, they'll pick up the update. It gets rebuilt. Everything is magical and we don't have to do anything else. So then it's viewed about a month of work to actually make this happen. So this is the tracking bugs that I actually linked at the bottom for trying to actually fix this vulnerability in Nix packages, not just in Chrome, not just in WebP itself, but in actually everything else that's in Nix packages. I've highlighted part of it here, which is that some applications bundle their own version of WebP. Each of these need to be updated separately by Nix packages' maintainers. And that's not just Chrome, that's including some other web browsers, for example, that was not including Firefox, but for example including Thunderbird, because the packaging was slightly different, etc. See Bigo for a guess of all the non-applications that need an update in their status. So this is Bigo. And yeah, so as you can see, this was about a month of work. I'm not going to go through the whole list, but there's a lot of stuff. This list is probably not complete, because as I'm going to get to, we lack tooling on this, we lack statistics, we lack data. And so this talk is trying to, given overview of this problem, try to bring awareness to it and try to bring up a few solutions and how we could actually do things better. So why is this happening? Why did we have to fix so many things? So this is a phenomenon known as vendering. Vendering is when a piece of software decides that instead of depending on the library that they get through, like, you know, package config or through the general build environment providing to it, the software is just going to copy the source code of that library, put it somewhere in their own source directory, and then, you know, it's easier to build because people don't have to install dependencies. The problem with that is, since they've pinned the version of the dependency, whenever there's an update that needs to happen to it, then the update doesn't just need to happen to the dependency, it needs to happen to everyone who has pinned the version of the dependency by copying it into their source repository. To some extent, vendering also happens with rock files. Rock files aren't exactly the same thing because technically you're not copying the source code, you're just enforcing that your software can only build with one specific version of a library, and sometimes even if you're providing the hash of the source code or the binaries that you need to use to build the software, which means that in practice, you're not copying the source code, you're just copying the hash of the source code, you're still basically pinning it and making it impossible to do any kind of update. So, for this specifically web people ability, so we spent about a month on this, we did not, but it should fix everything. So that's a sad part of this, is that tens of people, I don't actually have a full list here, but tens of people spent probably hundreds of hours combined on this, trying to fix software that we have index packages but maintainers weren't super active on or just chased on upstream, because if upstream has copied the version of a package of a webp into the source code, then you need them to actually go and fix the problem, or you need to apply patches, but patches are fragile, they'll just break on the next update, etc. And yeah, even though we spent hundreds of hours on this, we did not actually fix everything. We fixed, I think about like 50% by count of a new number of packages, but what we did actually is we spent some time categorizing and saying, okay, these are the actual high risk stuff that's likely to get exploited, that's connected to the internet, pulsing and trusted input, and then there is the rest, which is like, you know, maybe we can get away with not updating it now and sometime in the future upstream will realize that they actually are vulnerable to something and we'll maybe fix it. Even if you look at only the high risk stuff that we categorize as high risk, we had some packages in there that we did not actually get fixed. We had to mark them as insecure index packages because even though they are like internet facing software that passes and trusted image files, for example, like email clients, they did not get an update within a month for critical vulnerability that Chrome people were saying was exploited in the world. So, let's play a little game. Let's have some maybe, well, I don't know if we'll do audience participation. Give WebP copies index packages. So I've counted, I've actually been building some tooling as part of the remediation for this Give WebP vulnerability and we have a better idea now of how many packages are copying libraries that we have in Ex packages. So for Give WebP, we had about 116 different packages and by packages here, you know, we had a whole talk about what the package is. I'm talking about something that is built by Hydra. So, next package is stuff that is not non free, that is not insecure and not marked as insecure. And I'm not counting like, I'm counting only one architecture, I'm like grouping by package name. So we have about 116 Give WebP copies, but I think WebP is actually fairly recent. WebP is a modern image format. What about the PNG, which is significantly older? 2037 WebP JPEG, which is maybe even more common than with PNG, 2053. ZGib. So ZGib is a really small C library that people have been using for maybe 30 years to decompress gz files and zip files and stuff like this. We have about 761 of those in Expack, copied throughout multiple Ex packages things. So let's say that for example, there would be like a vulnerability in the PNG. How do we go and fix it? Well, given that we took about a month for 116 packages in WebP and we got 250% of them, I guess the PNG would take about like two months maybe and we get also 250% of them. It's not really a great outcome. So is this actually a problem? How often do these libraries actually have vulnerabilities? Also, okay, we have copies of them, but maybe they're actually being kept up to date and it's not actually that bad. So here is actually for gpng, this is a grouping by version. It's another that we actually have enough information that we can figure out what version of the PNG is being embedded within all the packages in the next packages. And you'll see that the top of the distribution is very much like recent versions 1637, 1639, 1640. That's actually pretty good. Next packages unsurprisingly is at 1640 right now. We have the gate. Well, I don't know if it's a gate version because you want 170 in there and I don't actually know it's definitely not as used as the others. What I've also looked at is the biggest date for some of these versions and these are also versions really small than 10 years ago. So we actually have two packages in next packages right now using a gpng version from 2004, which is kind of impressive. It's like, you know, was x6664 even a thing at the time? I don't know. But somehow it works. I don't know actually I've not tested it, maybe it doesn't work, but it's in there. You can next build it and you'll get a binary that has a gpng from 2004. Does it have vulnerabilities? Yes. There's like about like 12 different critical CVs that give code execution, buffer overflows. Some of this might be might be mitigated these days because we have vulnerability mitigation that's part of operating system. That's part of compilers and stuff like that. So it's not exactly clear how many vulnerabilities apply to these old versions. Another thing is that there isn't really anyone right now who like finds a new vulnerability in gpngs and goes and say, oh, I'm going to test it on this version from 2004. Just to see if it actually applies. So a lot of the vulnerability databases out of date and doesn't really contain the right information to even check against that. I've mentioned block files. So block files are kind of a new problem. It turns out if you go back to 10 years ago, we didn't have software in Rust and go in JavaScript. At least we didn't have as much as we do now. Java kind of did block files a bit with Maven even at the time. But it's mostly a new phenomenon. And the good thing with block files is it's actually really easy to get a full transitive list of all the dependencies because they're in the rock file. That doesn't mean that people are any better at actually managing their dependencies unfortunately, even though there is good tool game to do so. So for example, for Rust, there is this tool called Calgo Audit. And Calgo Audit is a tool that takes a Calgo.rock file and tells you all of the vulnerabilities that apply in this Calgo.rock file. So I used some tools that I wrote as well to go through every single Rust package, current index packages. And that's kind of like looking at every single derivation that has a Calgo.debs in it, correcting the rock file from that. And what we find by doing this is that there is 62% of all Rust packages in the next packages right now that have at least one vulnerable dependency locked in a rock file. This is, I mean, I'm describing as an X packages problem. The problem is it's not entirely an X packages problem. We're just fetching it from upstream. It's just people are locking dependencies. We don't really have control over that. We did for Python, C, C++ dependencies. And upstream is just not doing as good a job as distributions were doing. I mentioned a thousandth and a hundredth vulnerable dependencies. About 750 of these are actually higher critical severity based on CVSS score, which is a third metric, but it's about as good as we have. So yeah, if you get a Rust packages in X packages at random, you have a 40% chance that one of these dependencies has at least one known critical, like higher critical vulnerability. Doesn't mean it's exploitable, but let's say that even like one percent of this is exploitable, that's seven packages in X packages with higher critical vulnerabilities exploitable. That's still not good. And one percent is just a random number I picked. Yeah, so as I mentioned, it's an ecosystem. It's a general open source ecosystem problem. I don't know that this specific log files thing is something that we can fix at the next packages site. Next packages has some fault. We have some rust software that's just out of date, for example. And so when we do that, you know, well, the log file is also out of date. But from the ones that I've manually inspected, this is not the majority of these cases. In the majority of cases, this is that next package is packaging the latest version from upstream and is just containing insecure dependencies. What is causing vendor in X packages? A few things. We don't actually try to prevent it. I've checked. I was really surprised next packages does not have any documentation, any policy against vendor. There's nothing that says, you know, if a software has an option to use system with PNG itself using its own empty version, there's actually no documentation that says that we should prefer using that option. A lot of people do it because it's good practice. Not everyone does. We don't really have, yeah, like we don't really have a way to prevent it for newer language ecosystems like for Goura, JavaScript, as I mentioned. You don't have a choice. You just have to vendor stuff because we don't have a way to, we don't have Rust libraries in X packages. We just have the GIF software. Same for Go, same for JavaScript. Well, now same for JavaScript. We used to have Node 2.x for a while, which kind of added NICS derivations for libraries, but it was automatically generated anyway. So it's not like we could really do much about it. And finally, like until recently, we didn't actually have any tooling to try and detect and measure this problem. So it was just like hidden, big of the water. And we couldn't really go and say, oh, hey, there's a new derivation that's being proposed by someone. Like a new package has been added into NICS packages. Is it actually rendering anything? People would have to go and manually check. And nobody was doing this when reviewing packages because it's just a lot of effort. This potentially stuff we could do now automatically with some newer tooling that I've been writing. As I mentioned, we don't have policies against rendering, but it's even wasn't that in X packages. We don't have really policies against even building from source. It's preferred, but there's actually no preference expressed anywhere. I've checked again today and I could not find it. So people just go and fetch things from app images, for example. Upstream this with an app image. It's too complicated to build. I'm just going to fetch the app image, run patch health to fix a path to dependencies. And then the problem is, well, you don't really know what libraries, what dependencies upstream has been using to build these app images. It's usually not great. So this is something we had with WebP, for example, where I think Anki, for example, like the flashcard software is using. We just using the app image for it. And it was vulnerable because it was using some build environment from 2018 or something. That was, of course, not receiving any security updates. We fetched dev files. We fetched... People are very creative about how to get binaries. We fetched targz. We fetch static go binaries. We... Let's not even talk about JavaScript because you can just fetch a targz and unpack it somewhere. And it's fine because when would JavaScript software ever have vulnerabilities? And, yeah. So some of the distros have famously strong preferences for building from source. Debian has really good policies regarding rendering, which have always been kind of the gold standard, I think, in the distribution world. We should probably do some of that. How do we address Rust, GoNPM, et cetera? I don't think we can. I think it's an upstream problem. But what we could probably do is make it clearer to users that they're actually using intellectual software. It's not really a problem that any other big distros have been... Has been hitting much because just... NICS packages are much bigger in terms of scope. We put everything in NICS packages. We don't have a UR. I mean, we have any UR, like some people use, but like the bar for what goes into NICS packages is very low, right? We don't actually have many policies against like, you know, this... Let's just keep this out of NICS packages because it's not well maintained by upstream. We don't really do this much. So, you know, by being a huge package set, we have the problem that we have pretty bad software that's not really being kept up to date. We have stuff that's just not maintained anymore by upstream. And I feel like this is some... Like the way we should fix this in NICS packages is like, lockfile insecurity problem is just by making sure that if upstream is actually maintaining the lockfiles, we should probably inform the users and make them actually aware of the risks. We currently have this non-rune... Like non-runeabilities bit that we can put in the package. The problem is that it's extremely cold and it's extremely annoying to work around. Like, people have to manually... It stops evaluation. It's not a warning. It's like a critical error. You're using like an insecure package. And so what people do is they just allow every insecure package because the easiest way to work around the error. So, yeah. Tool Gang, as I mentioned, until recently, we didn't really have any way to detect this. I have actually written a bunch of things to try and detect rendering. So, one tool which is called GrepNICSOS Cache, and what it does is get rid of GrepNICSOS Cache. It goes and it takes a list of store paths which we get from Hydra. And it runs... It will just go and fetch every single store path that Hydra has built, which is like a few, usually 100,000, and it just runs some signatures on it. It looks for some strings that you find as part of the implementation of some libraries. And if the library has been vended or statically ganked, you will find a string in there. And sometimes you can even get version numbers and stuff like this. And some other projects I've been working on is... I've kind of called it NICS packages Vendor Drone Scan because I wanted to also make it include the above, like, new rendering thing. But this one is currently specifically doing log file analysis. So, finding all of the log files for Rust, JavaScript, and doing automatic vulnerability detection based on the log files. Conclusion. We have new tool gang. We have a better idea of how rendering goes in NICS packages, and it's not great. And it's a problem because we cannot actually fix security vulnerabilities in base libraries right now. We tell ourselves that we did by, you know, fixing the library itself, and we say, I have 100 different instances of the library being unpatched. How to fix it? Well, awareness. Now all of you probably know about this, and when you review new packages, and maybe look at whether this is happening, more discipline. I think we should have policies about this. I have not thought about the exact policy we should have yet, but we should probably have one. And better reporting for cases that we cannot fix ourselves, which is the insecure log files, most of the time. Yeah. Here we go. If you have any thoughts or comments, please ask any questions now. Otherwise, this is a contact info and some links to tooling. Thank you. Thank you for that. That was terrifying as someone who's come from the Debian world. Exciting as well. It's a really simple social approach that we could take to this to add another tick box to the default pull request template to say, have you checked that there's no vended crap that you could be using a system library for? I think it would help. At some point, if we just continue adding check boxes, people are just going, people are already ignoring a lot of them. They say, has anyone actually checked the sandbox thing in the pull request template anytime soon? It's like, you know, I see two people check, raise their hands. The rest of us have never touched a check box. Yeah, I kind of, I'd prefer if it was automated through tooling, if we could detect some of it automatically. And I'd prefer if at least we fix the policy first and then, you know, maybe figure out the actual edge cases before we start taking people to look for stuff without actually being accurate about what to look for. But yes, we probably should be doing some valiant of this. One of my favorite things to do is package and archive and preserve old software. And some of that work has been done as a pull request in next packages. Sometimes it doesn't get merged because, you know, it's got an old dependency on like Qt4 or something. So they say, oh, no, we can't merge this. And it does prevent some software from getting into next packages, but there is still a lot of software still in there that managed to sneak in. And since we don't have a policy, it's kind of like ad hoc, like some things get in, some things don't. Some people launch Crusades against like old Python, and they don't want that there. And yeah, it's kind of, yeah, messy. So what do you think about it? Because I think there's like a real value proposition for archiving old software, because Table is not nix-source.org, will archive the source code. It will be around forever. You'll be able to reproduce it in 20 years. Yeah, so what do you think about like, you know, banning old stuff and only striving for perfection versus keeping everything in next packages and just accepting everything? Yeah, I don't think we should be striving necessarily for perfection. Like, I don't think we can anyway, but the problem right now is, so for the case of old software, for example, usually one of the things that's blocking old software is when they use library versions that, you know, when they have so many dependencies in next packages that having this old software will induce costs on other maintainers, because they have to care about this old software that will never actually be updated to, like, you use a new API or something like that in the library. And so I think that we should figure out a way to include this old software or it can use this, like, you know, less-maintained software in a way that doesn't use a bandwidth of all the maintainers. Right now, we don't have a way to distinguish this software from, you know, stuff that more people care about, more people use, which means that whenever there is security radiation that needs to happen, the people doing the security radiation don't have a way to distinguish these things. And we use our bandwidth on stuff that maybe is, like, your old software and stuff like this. That's why I think that we should have better query categorization, better ways to inform the users about which category a given software falls into. We don't really have any of this right now in next packages. And I think it's, I don't know how we've managed to survive that long without having such a system. I think we just burn a bunch of maintainer time on stuff that, really, we shouldn't. And we should just accept to be broken. Hi. As you went around interacting with upstream to have these sorts of issues fixed, I'm pretty sure some of these things were also things that other distros were also dealing with and working with. As you encountered them, what sorts of things did you see in terms of those interactions with dealing with upstream, where you had maybe requests coming from other people, kind of a similar position as you, but just from other distros? Yeah. So a lot of the cases where I've actually had to contact upstream myself were stuff that was not actually packaged in other distros, just because, you know, Debian doesn't actually package .NET software, for example, or doesn't package much Go software, surprisingly. Like, if you want to get Grafana from Debian, I think they still won't have that in every repository. I mean, it's not free anymore, so they have a good reason now, but, you know, they never did, right? Do they even package Prometheus? Right? Like some pretty base software that people would expect to be able to apt-get. You have to use some external repositories because they don't have the right tools to package Go codes. So I think because the next package is much broader in scope, we have a lot more things to care about, and we've had to do a lot more of the talking to upstream. There is stuff that has been useful to all these distros and, like, you know, we, like, other distros have contacted upstream before us in some cases, and usually when we do contact upstreams, they are receptive to this. The problem is when they just don't reply. So, for example, for WebP, we had the issue that the main libraries that people are using to use WebP in Go was just unmaintained, and we failed the bug, and the maintainer has just still not replied to it to this day. So you have 500 users of this library that indirectly has a vanduid, vulnerable WebP version. What do we do? So we had to go and manually contact some other users of this and say, like, hey, you use a vulnerability that's not actually maintained anymore. You should fix that. And suddenly it's like, you know, like, the tree of things you need to contact grows and grows. It's a complicated problem. But does that value? It does. It does add value. Yes. I mean, it's general, like, you value to, like, the whole software ecosystem. It's just not the next packages thing, but it's tiring, right? You know, it's not feasible that we would be the only people caring about this for every single vulnerability. Great. Let's have another round of applause for Delo. Thank you. Thank you.
Nix for genetics : powering a bioinformatics pipeline
We have now up next Alexi talking about NICS for bioinformatics pipelines. So thank you everyone for coming. For five minutes I will try to make a kind of different presentation and try to say how NICS can help safe patients. It's not a clickbait title I promise. So I have a doctor in training but I also have a background in computer science so it's a kind of a mixed presentation and I'm working in France in Besanson Hospital. So when we are dealing with patients we want basically three things. First we want to give accurate results because for these patients diagnosis can be life changing. Second we need to be reproducible because all the doctors trust us with giving accurate results every time. Finally we want to be as fast as possible because there is a high demand for results. I'm working in a rare disease setup where obviously things are rare so it's hard to find and how do we do it? Well it's a mix of computer science. And expertise and state of the art technology. So here is a very worth scheme of how everything works. We start from a blood sample of a patient and we try to extract the DNA and sequence it on this machine thing. Unfortunately the machine doesn't do everything and we need some bioinformatics in there. And also the bioinformatics doesn't do everything either. We need a human at the end of a pipeline which is why there is a CSV file that a human has to read. And basically what the bioinformatics setup does is that it figures out a list of candidates for diagnosis and try to filter down the results. For example it can go from one million candidates to a thousand. If it filters too much we can miss the diagnosis. If it doesn't filter enough, well the human will have a really hard time trying to pass the CSV. When you say pipeline it's a really fancy word for just a set of common line utility tools but we also have databases in there that are in our setup just text files compressed. And when I say pipeline we just feed data from one CLE tool to another. And now how does Nix can help it with this? Well as a medical lab we have to be reproducible. It's like in the law. So Nix is a perfect fit because we can fix the software dependency and the dependency either like byte by byte dependency. So that's done. So it would be great if you could run on the high performance computing cluster. And in our region the folks in our cluster agreed to install Nix. And now we can run our current production with Nix there. Two things we didn't do with Nix was to manage the whole workflow. There is actually a tool for that Nix but it's more like a niche thing so we prefer to use a more common tool. And the final things what we could do in Nix but we didn't is to manage this large database because in our setup it's a different folder for Nix so we cannot install it. But it's there in Nix. Last last thing. I really enjoyed the community. It was a really nice interaction. I'm sure everyone knows. But it's also kind of a slow process because I tried to package something myself which is not easy at the beginning. And as you know there is like 5,000 pull requests on GitHub so feedback can be sometimes a bit slow and also I'm working on my spate arm either so it can also take some time sometimes. But for example the support for large databases has been added after a few conversations on Matrix. It was really fast. I hope you take some key points there but if you want to know more you can send me an email and I'll be glad to answer. Thank you. Thank you. Thank you. Thank you. Thank you.
Automatic boot assessment with boot counting
Hi, can you hear me? Up next we have Julian with automatic boot assessments. Okay, hello everyone. So today I'm going to, my name is Julian Malca and I'm a PhD student at Telecom Paris. And today I'm going to talk about automatic boot counting, automatic boot assessment with boot counting, sorry. And so what I will talk about is why we need automatic boot assessments. What is automatic boot assessment and like one of an implementation that is a system-deboot counting, I'll show a demo. So why do we need automatic boot assessments? Because we are using NixOS, we have like something I call the NixOS benediction. It's very difficult to break your system. You really have to want it to break your system. And even if you mess up your NixOS configuration, you can just roll back to a past generation and just be solved by the NixOS magic. But sometimes this benediction has limits. And let's say if you are administrator for a remote server, you perform some kind of server update, let's say kernel update. And you mess up. You choose a kernel that cannot boot your root partition. And at next boot what is going to happen is that it's going to fail to boot. And if you don't have any physical like BMC, then you will need physical intervention to revive the server. So this is this kind of problem we solve with automatic boot assessment. So boot assessment is any kind of technology really that can, and I can only assess if a boot entry is bootable or not. And we have one example that is system-deboot boot counting. So boot counting is a feature of, as I said, system-deboot. And the idea is the following. Each boot entry has a counter when created. Each time system-deboot tries an entry, the counter for this entry is decreased by one. If the entry is booted successfully, and I will define what booted successfully means, then the counters are removed permanently. But if the counters for an entry have ever reached zero, then the entry is marked as bad, and it's sorted in the boot menu at the end. Just if I get just a little bit more in depth of how this works. The counters are embedded in the entries' name, file names. So you have the file name, then you have the plus separator, then the number of remaining trials, then the number of failed trials. So this is generation nine. It has four remaining trials and one failed, and it adds, at the beginning, five trials set. Counters are decreased by system-deboot when it booted the entries by simply renaming the file. And you have to define some definition of a successful boot by scheduling whatever you want. You need to be started successfully before the boot-complete target. So when the boot-complete target is reached, then the entry is renamed by the system-deboot, the bless entry unit, that is going to remove the counter. And we are done with this entry. We consider it good forever. Okay, let me show you a demo. Right, so here I am in a VM, I am booted, and I'll show you that in the configuration.nix, I have enabled the feature and set the number of trials to be two for any entry. The VM is booted successfully, but I will do a massive mistake. I am emulating a mistake. I have a BKHFS file system and I will rename it as X4. So it means that now this partition will definitely not get mounted, and when I will rebuild, it will even change the kernel to a kernel not supporting BKHFS. So now it's rebuilding my configuration. You see when I am done rebuilding, I get no error, no nothing. I think everything is good. I show you the boot entries. The boot entries, you have five boot entries, and the last one as a counter, you see the dash five plus two, two trials for this entry. And now I will reboot this VM. So what happens when I reboot? At the beginning, everything is fine. My generation five is sorted first. It will try to boot it. Kernel crashes. It reboots. Now it's still sorted first because we have two trials for this entry. Again, kernel crashes, reboots. And now you will see it's now sorted last and we are going to boot generation number four. And of course we are going to boot it successfully and that's it, that's the feature. It's available currently as a PR. It will be merged very soon and be available in the next lesson table. Thank you.
Typhon: Nix-based continuous integration
Hi everyone. So today we're going to speak about Typhon, our software for Knicks-based continuous integration. Let's say, for the sake of the argument, watch your enthusiast, and you're asked to set up CI at work. So what do you do? You convince your boss to use Knicks, because that's great. And you install a header. It's the de facto software for CI with Knicks. So your job is fantastic using Knicks, but soon you realize that not everything is perfect, because first you need to install the thing. And it's not easy. It's a full. And so you want get upstate choices. Then you need to configure the plugins, and each time you change the configuration, you need to redeploy the thing. Also, it's hard, because when you want to change a plugin, you actually need to write a poll for scripts, and you need to redeploy it again. Last thing, when you want to do deployment, all you get is this rank command thing, which is a bit hard to use, and a super staple, which you don't really like. So you start to dream about something much more simple, something declarative maybe. Maybe you want your plugins to be defined, user defined basically, with Knicks maybe, and you would like some better deployment, more in line with Knicks philosophy, with declarativity and the productivity. Okay, so in this dream, how does it look like to configure CI for a project? Well, at first it looks a lot like it does in Hydra. You set up an attribute set of derivations which are going to constitute your jobs. But then you write a Knicks expression for your project that looks a lot like this one. So here the makeGitUpProject function takes all the information that needed for a GitHub workflow, with the repository, of course, some arbitrary deployment rules. And of course you're going to need secrets like GitHub tokens and SSH keys to set GitHub statuses and do remote deployment. This expression is fed to Typhoon through the Flake URL. And once Typhoon spawned your jobs, it's going to use the project expression to build actions. So actions are scripts, which are user defined and Knicks built. They are run in a sandbox and triggered by Typhoon on various occasions to provide features that will be provided by Hydra's plugins. For instance, the most important hooks triggered by Typhoon are before and after every job to set statuses, of course, or do any kind of deployment. In a little bit more detail, an action is sandboxed with only access to the store and to Internet. It does not have access to the local machine. So for instance, it does not have access to secrets for other projects. It takes JSON as input containing the decrypted secrets and of course contextual information about your job. And it outputs JSON to communicate with Typhoon. Thanks to actions, Typhoon is completely for diagnostic. Actually all the communication between Typhoon and the forge is done through actions, meaning Typhoon can fit a lot of different workflows. But how do you write actions? Well, of course, you use the Typhoon's Knicks library that lives in Typhoon's flake. It would be quite frugal at the beginning, but soon it would go to fit a lot of different forges and various kind of deployments. And the goal would be, of course, to have an ecosystem of actions like we do for GitHub actions, but much better and using Knicks instead of YAML. A few words about how you would code something like this. Of course, you would use Rust to get some like technologies like Actix and Dissol for the back end and a nice web app using Leptos. And so you would start coding and soon you would have a prototype. Soon the prototype would run CI for itself. So it would be time to present the project to the Knicks community at FOSDEM and tell people to try it. You would still want them though. It's still a prototype. Everything you talked about today is maybe not yet fully implemented, but still it's ready for beta and you're waiting for feedback for issues, a lot of issues, maybe a contribution even to the actions library. And all that would be left for you to do is to thank everyone for listening to you.
rix: an R package for reproducible dev environments with Nix
Alright, hello everyone. My name is Bruno. I'm a statistician and data scientist, data janitor, whatever you call it, in Luxembourg. Are there some people that use the R programming language here? Statistics? I will see some of you. Okay, cool. Maybe this will interest you then. So what is R very quickly? So R is this programming language that's been around for 30 years. It's like a floss implementation of S and it's mainly used and mostly used for statistics, machine learning, data science and all that kind of thing. And it comes with all these built-in objects that we like very much when we work with these things, which is data frames, matrices, formulas, models, etc. So that's all built into the language. There is like a little hello world. You can, with the base language, do linear regressions so you can load data frames or CSV files very easily. You have formulas that define like your model very easily and you can do that with the base language. But you can also extend the language with packages and these are really called packages. So you have deeplier, you have tidier, these are very popular packages for data manipulation but there's many others. And this here is like a typical data manipulation pipeline in R. So you start with your data frame and you keep passing functions to that with arguments and you do your aggregations, you do whatever you want. And so we have, as of writing, around 23,000 packages that are available through the two biggest main package sets if you want, CRAN and Bioconductor. I wrote that all are available through NICS packages. I don't think that's fairly accurate. I think not all packages are available but most of them are available. Personally, I've never found a package that wasn't available through NICS packages. So what this means then is that we could use NICS to set up an environment with R, with our packages that we need, etc. and use that to work. But that's not really a thing in the R ecosystem like this per project environment. If you use Python like for data science, very typically you will see people start with a virtual environment with a specific version of Python, specific versions of packages. That's not really a thing. At most what people do or our users do, they do like per project libraries of packages, right? That's a thing. And if you need more, people would typically use Docker and there's been the Rocket project for that that really popularized the use of Docker in the R ecosystem. That being said, I wrote with a colleague called Philipp Bauman. We wrote the Rix package. So Rix is itself an R package which provides this really familiar interface to R users. It's a standard function. You can specify the R version that you want. You can specify the packages that you want. These packages can come from CRAN, can come from Bioconductor. They can come from GitHub as well. If they are hosted on GitHub only, you can set up tech packages without typically a thing that R programs want as well. And system packages, we called it like this. Maybe it's not the best thing, but this would be kind of other tools if you need Git, if you need whatever, you can add it there as well. And you can specify IDEs because for R Studio, which is a popular ID for R programming, there's like a wrapper that needs to be installed as well. So this would take care of that. And it generates that expression that I'm not going to show you, but it's like a Nix expression that will install all of these things. It will look automatically for the right revision. And if you put in Git packages as well, it will also generate for you the hash because there's like a little server that we set up that downloads the package there, computes the hash and then sends it back to the user. You can also use this with Nix function within R. So you could execute any function or any R script inside like a sub shell with a specific version of R. And you could then within your interactive session that you are currently running, you can then get that result back and continue working with it. So this is useful if you are like doing a reproducibility study and you just want to execute one particular function from a paper, for example, and you just want to get that result. So you can do that as well quite transparently. If you're interested, there's this website that you can check out. It's still not released on CRAN or on CRAN, but we are aiming at doing that in a couple of weeks. Thank you for your attention. Thank you.
Preparing a 30 year-long project with Nix and NixOS
Hello everyone, my name is Rémi Nicole, I'm this dude on the internet and I work for the CEA which is the commissariat for atomic energy and alternative energy. But the CEA is quite big so I should say technically that I'm CEA, DRF, IRFU, DISC and all the way at the bottom. What do we do? Well, we do control systems for big physics experiments like particle accelerators. And so what is a particle accelerator? Basically, it's a bunch of hardware. There is a plasma chamber that just exposes protons and then you need to give protons some energy, you need to control them and you need to do some diagnostics. For example, if you want to make the protons turn, you need to have an industrial power supply and an electromagnet. And so you need to control the power supply to control the strength of the magnet. And so we use a framework which is called EPIX, DayLake acronym in this space. So it means experimental physics and industrial control system. And it's quite old software. I'm showing you the old logo because it quite explains well what it does. It's a single protocol which is represented by the line and some clients and servers. So we have, for example, the input-output controller which does the control of the power supply, for example. And we also have some graphical clients, alarm system and some archiver. And so what do we do when you're an EXPAN? Well, you package it with NICS. And so you can see the logo of NICS kind of eating the EPIX logo. And I'm not going to talk too much about that because chances are you don't have a particular accelerator at home, so you won't really need this project. To be fair, someone did use EPIX to control a beer brewing system. Yeah, beer people are weird. And so what in terms of network? So you need a network as isolated as possible so you don't exactly need to do this much update. And usually you don't want to update something. If something works, you don't want to touch it because it takes a lot of money to restart the accelerator. And so what you need usually is a good resilience of the system. You have a lot of assumptions to rethink. And we could ask you, we could be asked to modify some software 10 years after it was in production. And so what I'm going to present is how we use NICS and EXPAN for this kind of resilience. So first things is we use Flake for pinning projects, which is good because anyone can just pick up the project back up and it should compile and work. There are some exceptions that you have when you have such a large time scale. For example, some software might not be available in 10 years. Maybe GitHub went down because Microsoft or something. And what we have as a solution is to do a lot of CI and use our own cache server extensively. And by caching, I mean caching really everything. So usually what you want to cache is the runtime dependencies, but what we want here is we also want to cache every build time dependencies. And so what we should have as a system is that even 10 years after it was deployed, we could modify anything down the stack and we could pick any project back up. We also need to cache fake inputs, which is a bit weird to do. We also need to cache NICS itself because maybe in the future NICS won't evaluate, we will have some deprecation and won't evaluate the old NICS code. And so the system that we have, thank you Maurice for working on this, is that we have a CI server, in our case it's a GitLab CI, which will build our derivation and we also build a build time derivation, which depends on all the build dependencies of the software. And then the CI we call a webhook in the cache server and the cache server will actually pull all of those dependencies. And why do we have a separate cache server is that with this system we can use profiles because over time the cache server will fill up and so we need to figure out what old version of the software we need to clean. Yeah, I have a hopes that NICS can be used for building resilient systems. And yeah, if you're curious, here's some links. And if you want to build time derivation, there's an example code here. Thank you.
Running NLnet on NixOS
Alright, thank you everyone who moved and gets some space. We can now start the next talk. Josse is going to talk about using NixOS at NLNet. Hello everyone. So, yes, my name is Josse van den Noeven. I'm an employee at NLNet. NLNet is a Dutch foundation. And who here has heard of NLNet, by the way? Are there any hands down? Wow, this is amazing. That's very cool. Yeah, so it's an honor to work at NLNet. And this talk is going to be about how we use NixOS there. There were so many hands. NLNet is the organization which here at FOSAT might be known for, you know, spamming stickers everywhere. We have the stand in the K building with so many stickers. And each of these stickers here is a project we have supported. But not all of the projects that we have supported have a sticker because, you know, command line tools might not have a logo all the time. As you can see, NixOS is up there as well, as well as many other projects. I'm wondering who here has ever had funding for NLNet? See, that's less hands. But we have funding for open source projects. So if you have good ideas, if you're part of a community that has these tenacious bugs that nobody is coming around to help fund to fix, or if you have a protocol which has not been implemented in your particular library, or whatever good idea you have, just look on our website what other projects we've been funding and, you know, write your own proposal. Proposals for writing to NLNet are not difficult to do. It's one form. You say who you are, you say what your plan is, what the outcome is, what it's going to cost, or what you think it's going to cost, and then you press send. And every two months, there's a new call. So this is the tagline that we use since this week, actually. We have a PR person now, and she, you know, she says the message should be simple, it should be clear, and it should be to the point, and so we try to, she tried to fit it into one line. We fund those who contribute to the open internet, you know, because that's what it's all about, why are you here at FOSDEM? And, yeah, we're just very happy that we can help there. So what do we mean when we talk about the open internet? Well, we should be able to communicate directly, right? Get rid of big tech, which is in between our communications. No dependencies, no lock-in, just get the source, compile it yourself, and that way we can have a good democracy, we can be independent and not have to be living in fear that some service is going to be taken away from us, because, you know, we can run it ourselves. So self-hosting is a thing that we very much promote, yeah, free software, free society, and this logo here, Next Generation Internet, is, that's the thing that makes me standing here, because that's the fund by the European Union that is providing over 90% of the funding that Anonite is able to give out. We have been given out money for decades now, but we were always very minor operation until the EC decided that, you know, there's so much software in this world, we're running on it, we're depending on it, we should, you know, also be the owner of it and invest in it. So that's what the EC is doing now, and we are one of the facilitators that, you know, seek out the right projects that are to be supported. So we fund open software, hardware, standards, documentation. When you submit a proposal to us, it has to be something that you can deliver, you can, you know, get pushed or you can publish somewhere, not, for example, server maintenance for that or having a meeting for that you have to go elsewhere. We like to, you know, we like to check what the money is being spent on, and that's also what we have to report to the people that give us the money, which we will mostly be doing, so we try to keep the bureaucracy very low. Yeah, self-hosting. So self-hosting, of course, means system administration. Who here likes system administration? 50-50. Yeah. So, yeah, it doesn't always go well with system administration. You're in the basement in some organizations. In the Netherlands, we're only small, so I get to sit with the other people. It's not all that bad. Once a year, you know, you have system administration appreciation day, which is awesome, right, if people remember it, and, you know, if they're not on holiday, because it happens to be in the middle of summer. So, yeah. Not everything's perfect. Okay. How do you use Nixle's in a small organization? That's what this talks about. And in the Netherlands, we're currently tendon police when I started, we were four, so we're growing. Also, when we started, we were running a bunch of different systems with backups sometimes, no commits of the configuration, so no history of what was running. It was a mail, for example, was running in a BSD system with set-afers, so it had snapshots, so that was pretty good. And our requirements are really not that crazy. We need mail, website, telephone, you would think. But then if you drill down, there's quite a lot of stuff that you need to keep running, actually. So here's what we have that is open source and which is not free and open source software. So a website, obviously, it's run by EngineX. Our email server is self-hosted, mailing lists. We have our own code forge. Well, what makes us tick? Our grant management system. That is running using open source components and chat, video, micro-blogging since a short while. We are also hosting it ourselves. But not everything. For example, our router, which we could do, of course. We haven't gotten around to that. Printer, open hardware for printers. That's not worth it right now. We have some people using Apple devices, so it's not completely open there either. Biases and chips. I mean, we support people designing chips. We're not yet at the stage that we can also dog food those. But we have quite some components that we do ourselves. So when we choose a system to get rid of the whole collection that we had before, what options are there? Well, there's Nixos, there's Geeks. We could go to a closed cloud, but obviously that would be very bad for our image. Or we could go to an open cloud hoster, which there are more and more of those now. But we said, well, we are funding projects. Projects are sending us that code. It would be great if we could also try to keep our knowledge about all these systems up. So let's try to do it all ourselves. And then Nixos has quite a lot of advantages, also some disadvantages. But the declarative part is, yeah, it takes some getting used to. But it's really useful, right? It's just nice static files. It's mostly reproducible, and mostly it's mean 99.99% for the stuff that we use at least. Extremely many packages, as you've seen in the talks just before now. You can mix versions of stuff. I'll show you a bit later how we actually need to do that. The Nix language, well, there's a lot of discussion always about it. But personally, I really like it. So you have to get it. But then it's great. So it's familiar to us because before we decided to switch all the systems to it, we were already using it on our laptops. So that's a bias over there. The Flake lock is very important to us because we can lock down the dependencies and we can be sure that whenever we update, it's a conscious choice to do so. Propriety packages are packaged, but they're disabled by default. So we don't have to worry that by accident we are starting to depend on closed software. Yeah, there are some downsides as well from our perspective. So the community is organized on a proprietary system. A lot of open source projects these days are. And we really promote self-hosting. So if a project is self-hosting, that's a plus in our book. Another thing, not everything is as polished as it could be. I'll show you that we are using an officially unstable feature. So yeah, and there's no storage handling. And what that means, I'll get back to that as well. So there are a lot of green flags there. Full disclosure, Nixos is a partner with us. When people get funded at an Lnet, they also get services. So they get free packaging and Nixos is providing that. So we are a bit prejudiced when choosing Nixos. For me, Nixos, I've been using it a long time, but I always find it very difficult to write the packages until one day I had to explain to a colleague of mine how these files work. And I was sitting there and suddenly it clicked that yeah, everything is a function. I mean, it's called a purely functional package, but still somehow it didn't click. But then I had to explain it to him what are these brackets at the top with the columns. And yeah, that's the arguments to the function. And the rest of the file is what comes out of the function. There are many Nixos developers who are thinking, wow, this is a newbie here. But I feel a bit embarrassed to say it, but once that clicks, it's really a very nice system because it's like JSONnet or Haskell, other functional languages. It's very predictable in what it does once you get it to do what you want. So is it just Nix? Is that enough? How do you deploy it on many systems? So there was a talk by Sir Leanne Rappen a few years ago on all the possible options that there are to deploy Nixos to a number of systems. So it's a whole list here, and in her talk she explained what the pros and cons of each of these systems were, and that was very helpful to us. So that's why I wanted to highlight it here. That was really amazing work that she did. And in the end, what we chose is to keep it simple and do everything with Nixos rebuild. And that's the basic command that everybody's using when you're using Nixos. And it turns out you can just manage your service with that. So all of our systems are defined in one Git repository. They're all defined in one nix.flake.nix file. Each machine has a configuration nix, hard configuration.nix, but there are a lot of placeholders there for stuff that we import from another directory where most of the services are configured. And we try to keep the simple readable for everyone. We use a JSON file that has sort of the structure of our setup in there, and then that's imported and readable as variables further on in the system. So if you do a nixflake show and the flakes are the not yet completely stable part of Nix that we are using, then you will see Nix configurations has five servers in our case. And what we do to deploy that is we type nixos rebuild, switch, and then we say here's the flake for the server, and it should go to that server. So that's how our deployment system works. It's just built into nix. And this is our machine's JSON, so it tells us what should be the IP number for the different machines, what name servers should it talk to, where are the secrets. Secrets management is really done by rzink by us, so we just, when the machine reboots, we don't store the secrets in the nix store, we just copy them into the slash run directory with rzink. And yeah, here's the flake. So we are mixing an old version of nix packages because we haven't completely switched that, I'll explain later why, with nixos, the current nixos, I mean you can just do that, you can put it together, and so these are the inputs, and then here is the function that defines the outputs where these things are coming in. And this is then a very simplified version of how we define each of our machines. So we have a function called makeSystem which takes the hostname and the definition, and we define our systems by looping that function over all the machine definitions. It's a bit more complicated because it has to know which inputs to use on which machines, but this is sort of the magic that makes us able to just use nixos rebuild. Now when you're setting up your system, this is the thing I think is most important, is the alerts. The computer has to do stuff automatically for you, and you would like to make sure that it continues to do so even while you're sleeping or while you're giving a talk at FOSDAN. So I'm very happy that this box in my mail folder has not had any unread messages for a very long time now, so that's great. Our alert board is green very much of the time. We have a very particular alert here which is called the nixos flake committed. So if somebody doesn't deploy without committing first, it gets read because then it's undocumented what our system is doing. This was a zoom in, but I think it was good enough to read. Yeah, so backups, that's the second most important thing for your system. We use Borg for backups and Butterbackup, to do snapshots every hour, and here's a small point of critique for nixos or actually a feature which is not really there at the moment. When you do anything with software and it also needs data, you have to say where the data is and everything is declared in nix, except the folders have to be written by hand or they're set by defaults in the services. Doing backups, there's no enforcement that there is a backup or an easy way to do the backup. In the setup of your backup system, you have to repeat all the directories again or you define them at the top level and then you use the variables for those directories at the top level. This is a thing that might be a bit more polished, it's an opportunity for a new system, a new extension. So mail, who here is hosting their own mail? Wow, that's not enough. We need more people hosting their own mail. It's so important, it's still the backbone of all your communication, it's email. We really want to self-host, we were self-hosting, so when we're setting up a new system, it would feel like a defeat to stop doing that, so we continue doing that. And nixos has a project which is called Simple Nixos Mail Server, which ties together Dovcott Postfix, LDAP and Rspundi together, it didn't use to tie up LDAP, but we needed that, so we paid a contractor to add this support and upstream it. So that's what we're using right now. However, we're announced, so we're funding a lot of projects. We're also funding Stahl words, that's simpler, all included Rust implementation of a mail server, and we're also supporting Mox, that's a go implementation of a mail server. And we're soon going to try out Stahl words on a less important mail domain of ours. Yeah, and then you get these wonderful 100% scores. If you fill around long enough, well, actually we didn't have to fill that long because the Simple Nixos Mail Server really configures your mail properly, and this wonderful website, internet.nl, is what you can use to check if actually your mail server is configured correctly. One highlight of Nixos that we really value is the testing. So to test two computers working together is made very easy in Nixos because there are Python scripts that you can call and you set up both computers and you tell them how to talk to each other and what the expected outcome is, and all of these scripts, many of these scripts are just part of Nix packages. So you can read how these systems, how this testing is done and for your own setup you can also write those scripts, and that's great. And we run that in CI via Flake checks. Well, sometimes something can go wrong. You don't have to be a genius to see what's going wrong here. We are sending the configuration of server one to server two, and this is where the system that we saw earlier comes in handy, how to fix your booting because this really killed deployment one time. So when I say we keep our system simple and we try not to build on top of stuff, here we decided that it would be a good idea to make a small alias script that only takes one argument so you don't confuse the two servers with each other anymore. We recovered from this in five minutes so it wasn't that bad, but I did get a big fright. How do we do updates? I'm just putting this command here. It's not that interesting, but I just want to have it documented somewhere because it's a bit long, but we have a number of inputs to our Flakes, and if you want to update just one input, which is often something that we need to do, for example, when one of our software packages updates that we write ourselves, then you can do only that Flake input with this command. So conclusions. We like to keep it simple. We just use the basic tools of NixOS, and we put most of the configuration, or we try to move as much as we can into JSON files so that it's easier to read. So technically, NixOS is really great for an alat. However, for the average office, it's probably quite complicated to do this. So I think there's an opportunity here for open cloud providers to use a system like this and make it more user-friendly. And in fact, there is a project currently called NGI Fidiversity where the EU is funding us to help create a new hosting stack that will be using Nix. And that has just started. We have the planning phase for this. So if you're interested, look it up. There will be this, or talk to that guy over there. That's a raving, but this will be probably a talk next year for them. And with that, I'm done, and I'm open for questions or tips because there are many people who are more expert than I am. Thank you. Do we have any questions? Hello. Thank you very much. I'm just wondering, you said that you are thinking your secrets to run directory, like why you are not using like, Agenix or SOPS for that, which will do it for you and you don't need to do it manually. So the reason we're not doing that is there were so many options which made me confused. And then also some of them were putting the keys encrypted, but nevertheless in the Nix store. And I just felt more comfortable doing it with Arsene. That's the whole explanation for it. Hi. You said something about Nix not being aware of storage locations. I didn't really understand that. Could you kind of explain that a bit more of what's... Yes, so Nix defines where all of your software is coming from and how to compile it, and it puts it all in the Nix store. But of course the software is interacting with data. And there's no sort of, you know, a type or a class which defines where the storage is. So you could say, I'm doing backup now. Just backup all of my systems. Or if you pass a directory into a service, that directory is an object which has been defined elsewhere as it needs to be in the file system. It needs to be... the file system needs to have, you know, a type of file system. It needs to be mounted. All of those things are something that you have to take care of. And because Nix is declarative, you know, once you hammer it down, it's fine. But it would be great if at compile time you get an error for that. Any other question? There's a question in the back. I just wanted to react to the storage location thing because it's interesting. So in NixOS there is a problem which is you want to declare things, you want to be declarative. But when you deploy a software, software often comes with automatic migrations. So they proceed to the operation on your state, on your files, at every new deployment. And this breaks the rollback system. Because if you rollback to previous version, you don't rollback the data. You just rollback the configuration. And what could be done here is that the NixOS modules themselves could learn about where the state is. What does it mean to back up an application? What is dependency in the PostgreSQL database or whatever. And it would start to provide a solution for the problem you're mentioning. Yes, exactly. Databases are a whole extra level of complexity and possible, you know, data corruption. I think we can take one last question. Hi, thanks for the talk. I might have missed it because I joined a little bit later. But in the configuration, do you have a way that you are happy to pass secrets? Yes, so the way we pass secrets is we have a top level JSON file. And there we declare all our secrets. So for root, it needs these secrets. So wire guard needs a private key, the mail needs a password. So these are files that have to be under slash run slash root. And when the Nix evaluates it, they are stored. Where do they get? So Nix doesn't do anything with that. It just writes in the configuration where that file is supposed to be. And then when the machine starts, some services will say, hey, I'm missing my password. So I copy in the password in there and then I restart that service. But that doesn't happen very often. We are fairly small office. So we don't have 100 machines. So automating it more seemed like complexity and overkill for our situation. Okay. Thanks. Thank you very much.
Clevis/Tang: unattended boot of an encrypted NixOS system
All right. We are now going to get ready for the last talk of the NixOS Dev Room this year. We have on stage Julien and Kami, who are going to talk about an attended encrypted file system boots on NixOS, which everyone who encrypts their hard drives knows about that program of not being able to reboot remotely. So give them a round of applause. Thank you very much. So I'm Julien Malka and this is Kami Mondeau and we are both old students of Ecole Normale Superior and we did this during our studies a few months ago. So we are going to talk about Clevis and Tong on NixOS and I have to say and to thank NLNet for funding this new feature in NixOS. So the plan is we are going to motivate why it's interesting to have full disk encryption and remote servers. We are going to present Clevis, which is the automated encryption framework, given EAPDES view of the Tong protocol and then talk about our implementation and show you a demo. So full disk encryption of remote servers, of course, it looks like a good idea, like critical data or even non-critical data should be protected by full disk encryption nowadays. Problem is often more difficult to do it on servers because when you have full disk encryption you need to basically input your passphrase at boot and this requires physical intervention so it's a bit painful to reboot servers. One solution that is often used is spawning a SH server in Itardi. This way you can instead of having physical intervention you can just SH to your server and input your key and unlock your root partition and continue booting. But this still requires synchrony new intervention so if your server reboots for some reason that it was not planned and you are not awake or doing anything else your server actually doesn't reboot. So that's why I think this kind of solution is interesting. Okay, so before Julian dives in the Macalum-Rallye protocol we are going to present the Clevis automated encryption framework. So it's a project developed by the team Latchset which is basically just a big set of batch scripts wrapping around Jose which is the core encryption system and it's a pluggable framework for automated decryption. So there's the notion of PINs which is central is a plugin that implement automated decryption and to encrypt some data with Clevis, well you just have to do Clevis encrypt and then you write the PIN you want to use, the config, your plain text message and then you get a ciphered text in the format GWE. So the first kind of PIN you can use is TPM 2.0 PIN. So I don't know exactly how works the encryption using TPM but basically the, well Clevis first Jose first generates a key and then this key is used to encrypt the message and then this key is itself encrypted and using the key generated by the TPM and it will be decrypted the same way when Clevis needs to decrypt the message. So that's very useful if you require TPM 2.0 and not 1.2 for that to work and then the most useful kind of PIN is the TANG PIN which was designed by the same team of developers and so it's a server implementation providing services without the need for an X-Crow. So basically you do Clevis encrypt TANG and you precise the URL of your TANG server and this will encrypt your message using the protocol that Julien will describe to you later. And then there's Shamir secret sharing which is a combination of the, it's a way to break down a secret into multiple pieces and so there's the notion of threshold here it's T and its value is 2 and so you basically make a combination in your configuration of multiple PINs, the previous examples being TPM and TANG and here you see there's three PINs, one TPM and two TANG servers and the threshold is 2 so you have to have two working PINs in order to decrypt your secrets. Obviously at the encryption time you need a whole three PINs to be up. So Julien was explaining. Thank you Kami. I'm going to try to explain a bit how the TANG protocol works which might be, I'll try to explain it as simple as possible but this is extra extra for you. So basically the TANG protocol looks like the Diffielman key exchange. The Diffielman key exchange is something used to get a shared symmetric key between two actors on an unsecured channel and the idea is not that complicated. The idea is that you need to imagine that we have a mathematical operation star that is really, that is easy to compute so if I get two numbers A and B I can compute A star B but on the other hand super difficult to reverse so if I get C which is A star B it's very difficult to get A and B. And the idea is that we have also J which is a public parameter and the idea is that we have a server and the client that each generates a secret SNC and then they send their secret star J and then they both on their sides multiply by whatever they receive from the other side. So at the end on both sides they get C, JS and SJC which is the same thing so they get a shared key that they can use to encrypt their messages and talk with each other but somebody listening here cannot do the separation of these operations so it doesn't get any information. If the math is a little bit too abstract you can imagine this with paint. We have J which is the common paint then you have secret paints on each side they do the mixing of paints and they get each a color and from this color it's very difficult to find which paints were used to do the mixing then they send both their paints and add their secret to whatever they receive and they get a shared color which is like an analogy for the shared key. This is the if the Diffie-Eleman protocol used on the internet to devise a shared symmetric key to discuss with somebody and the Tongue protocol is derived from this so there is two sides of this. The first side is like the provisioning so it looks like Diffie-Eleman, they both generate the secret, the server sends J times its secret and the client sends nothing. The client computes what should be the shared key but doesn't send its secret to the server and so he divides the key, encodes whatever message he wants to encode with this key and then throws the key away. So at this point the client has an encrypted message, ciphertext but doesn't know how to decrypt it because it throws the key away. When he wants to decrypt it will do this. So it will generate a new secret E and send to the server J E plus J C. The server will take this and multiply by its secret and send it back to the client. At this point the client does this mathematical operation so whatever received minus J S E and this gives it another key that it used to decode the message. And here you have the math of why this works. So if we go Y minus J S E is X S minus J S E. X was J E plus J C so J E plus J C S and if you compute this it gives you J E S plus J C S minus J S E which gives J C S which is K. So with this manipulation it can find the original key again and decrypt the message. So what does it give us? So let's say an attacker gain access to the client. As I said at this point the client discarded the key only as a ciphertext cannot decrypt so you gain access to the client cannot do anything. Take access to the tongue server. Tongue server doesn't receive any information at no point in this protocol so you have access of the tongue server you cannot get the secret. And if you intercept all messages here and here because of what I said like the Diffiel Man assumption it is very hard to reverse these operations you also get no information. So basically the only way for an attacker to have the secret would be to get access of the client and the server at the same time or to be on the local network where you have your tongue server. That's the principle of how it works. Let's talk about the Nixxas implementation now. Okay thank you Julien for the theory. Now we're going to dive in the Nixxas implementation. You'll see that it's really simple. So if you have your Nixxas laptop with you you can do it live. If you don't well maybe install Nixxas. And so first you have to deploy your tongue module. So it was added by a Jeff Proch. Who is here by the way? And you can enable it with simply with services.tongue that's enabled. Then there's only one parameter really important is the IP address is allow which is critical because it defines the subnet that you trust because every machine on this subnet will have access to your tongue server. And if someone gets your secret.jw using only tongue well it can decipher it. So you put a network that you trust. Then you don't forget to open the firewall as I did a lot of time. So you can change the port. This is the default port in the Nixxas module but you can put it on port 80 if you want. So now you have your tongue server. Then the Clevis module it was packaged inside the init.rd already existed as a Clevis as an Insignix packages. So basically it works in init.rd both system dstage1 and scripted init.rd. So what it does is before decrypting the root partition well it tries to do Clevis decrypt with the secret.jwe which was put inside your init.rd secret. And if it succeeds then it pipes the value to decrypt the root partition and if it fails it falls back and interactive unlocking. So how now how to use it? First you have to generate the secret. So it's really easy you just type in your secret and you pipe it. Be careful because if you have a you will most certainly not have a return carriage inside your secret so you have to add a dash n when you pipe to Clevis. And then you write your Clevis configuration as we showed earlier. Then when you have your secret you can put it in your etc. in xy folder or in your flake and then you write this line which is booted in the Clevis devices. Then you put here the name of the device inside your file system dot slash root partition configuration. So it has to be the same name. And then well this works only if your root partition is encrypted using one of these three methods. The client has also a tpm 2.0 device and what we require. So first you type in your secret in a secret way so it goes to the secret environment variable. Then we state the config that we wrote previously in JSON format for Clevis. So you can see a threshold of 2 for a quick but just. Then you can pipe your secret to Clevis and creep the secret sharing and with the configuration that you use. Snow in post. So this means that the tongue servers have been reached so that's why you actually have to choose the sign in keys. Both tongue servers. So now you have your secret. You can try and decrypt it. Your secret works. And now you have to modify your configuration. So first thing because we're using the tpm you have to add the canal module. So I added both but I think you can probably just add one if you know which one. Then you add the boot in the configuration. So it has to map. So it had to match the file system configuration. Then you can rebuild. And while we're rebuild we can turn off one of the tongue sockets. So the tongue one server. And yes because we defined in the threshold that just one of two is required. Then we reboot and we see that it's not a reachable error communicating but it's still boots very quickly. Then it's locked out partition right. So it works but then you turn off the second tongue server just to check that. And then you can see that there are two errors for both tongue servers. And so there's an error reading the passphrase. And now you have to type in your password so if you didn't set up an open SSH server while you're basically you have to reboot your machine from a. Okay. So what's left to do? Well you can add more pins to match your needs. So one useful one would be Yubiki. And then if you have like an exotic encryption solution maybe you can try and use them with Clevis for instance Veracript or I don't know whatever. So there's many other solutions to use. So feel free to contribute and thank you very much for your attention. We don't have a mic so shout the question and repeat it if you have a question. Question was in case where the tongue servers are down can you still SSH to input your password or are you just out of luck? And the answer is I expected to work correctly with the existing SSH server and it's already featured so you have to configure it to do that but it's possible. There is none apart from the fact that Clevis is just fancy wrappers around like the same calls that the system the crypton world features are doing to the TPM. I don't know which one was first. So yes the question was can you put the maybe the configuration. The question was about the configuration and so the threshold was set to two but we asked for the TPM to be live to be available and only one of the two tongue servers so it was very quick but there's actually a combination of pins so there's first the TPM then inside the SSS configuration there's another SSS pin with the threshold of one so this threshold applied to the TPM and the SSS with two tongues and the threshold of one so I guess it's okay. Can you repeat it? Yes how does encryption key management work with the FS as it supports only one passphrase. The answer to this it's the same passphrase that you encode in a Clevis secret so what it does it will first try to decrypt the Clevis secret and if the decryption succeeds then it will use this key as the encryption for the FS otherwise it will ask you to input it like interactively. So I guess the answer to this question is currently there is only one key phrase that is either in your brain or encoded in a Clevis secret. Yes. Do you know of any possibility to have an encrypted kernel and init.org so like a bootloader we use the TPM to decrypt the kernel and init.org? I do not know about this no I don't I mean as you described it it maybe it's possible but I do not know of any implementation of this kind of things. I guess you need something to decrypt your kernel anyway if you're going to have encrypted kernel what is going to decrypt it? Yes okay then systemdboot becomes the kernel I don't know. I think that's pushing it a little bit why would you think you want your kernel to be secret it's like if you're thinking about secrets you want to put in init.org that's something different and we have so currently in XS we have an encrypted init.org secrets but there is also a new features coming into systemdboot and then learn something. So we have an encrypted system init.org secrets encrypted via the TPM and systemdcrypt what's the name? Credential systemd credential sorry that's maybe what's more what you want. Yes. Yeah I agree so the remark was it's sometimes good not to have like a kernel unencrypted because then an attacker reading your boot partition would know exactly which version of the kernel you're using and maybe target some specific vulnerabilities. Thank you for the remark. Okay so the remark was you can use a Kexec to load basically any other kernel that you might have decrypted from the first kernel so that you can still have some kind of encrypted kernel. Thank you. Yes. Should I use the same time server for multiple hosts or set up a different host or set up a different time server for other because any host could decrypt I guess. Well as long as your tank server is in a secure network that you control the access and you have exclusive access or that you have also set up a TPM on your servers but as long as your network is secure I'd say that the protocol implies that Tanks servers have not access to anything except their own signing key so you can use as many that's what's cool about the Macalum real real protocol is that you can use as many clients as you want the message doesn't even leave the clients when it's encrypted and the key using for encryption also doesn't leave the client so well they have the key. Okay so the question was if malicious host reaches your private subnet and gets access to what the client or the. Well if they get access to the GW secret yes well then they can if they have access to the tank server and the GW then they can decrypt so that's why you have to either well put your tank server on something that you control very much or well encrypt your GW token when you put it on your flake or whatever it should not be clear. Okay so the question was what happens if your tank server is down is down during the reboot of the server. Well as you saw well depending on your configuration if you have a lower threshold well it would print an error communicating with the tank server first but then it could depending on your configuration still boot or not if you only have set up one tank server it won't boot and then fall back to interactive unlocking. And if I may add something you may like configure a system so that it after like it times out trying to unlock your your repartition after some while and reboots and then start again and if it was a transient failure of tank server then you might be saved by that. We can do one short question. So there we have some documentation on the next menu. This is sorry this is a merge and available in the next and stable right now it's not in 23.11 and you have you have some documentation in the unstable manual. Thank you. So. Thank you.
Welcome to the Open Hardware devroom
Okay, so it is 9 o'clock, which means that it is time for us to start the open hardware in CAD-CAM Dev Room. First of all, I want to say welcome to the Dev Room. We've had a long run at Fosdame where we had about eight years in a row and then we missed one last year. But with a little bit of luck and a little bit of help from our friends, we're back again this year and hopefully we'll continue on. So we've had two years without updates on what we are doing in the hardware community, what we're doing in the open hardware community, and what we're doing in the hardware development community. And today we're going to have a series of talks that really delve into some of the most interesting aspects of creating hardware and building things for your use to make things easier, to make things better for us and better for the world. So I'm excited to listen to what each of our speakers here have to say. Just a couple of administrative things. As you come into the room, we're pretty light here at the beginning because no one wants to listen to the intro talk. But as you come into the room, do try to move to the center. That will give people who come in late the ability to have a seat without standing in front of you. And other than that, we are scheduled on a back to back basis, which means that our speakers will be taking questions while the next speaker sets up their laptop. And you saw that I was just kind of fiddling with my laptop here to get the resolution right. We'll have five minutes of change over time, so you get five minutes to kind of set up your laptop. Taking that five minutes while you're setting up your laptop for the next talk, we're going to have questions for the previous speaker. We have a microphone that we're going to pass around for questions. Do try to either wait for the microphone or if you are the speaker and someone asks a question without the microphone, repeat the question back. Repeat the question back. You'll record it on the video stream. That will allow our audience to get into the... Yes, that will allow our audience online to be able to see the questions. So we have our helpful placards here at the front. They will give you 15 minutes, 10 minutes, 5 minutes, and then times up. So when times up, that means that you are in the five minute change over period. So the next person is going to come up and begin setting up, and you should ideally already be taking questions. Now the other placards that you have, you'll either say, speak closer to the microphone. As you can see, this microphone is picking up things pretty well and pretty far away from the microphone. But if you whisper and you're a very quiet speaker, you might need to talk closer to the microphone. There's a sign for that. Other than that, there is also a reminder. Can you please repeat the question? So if you see the sign, please repeat the question. Just stop. Repeat the question. Double check that that's what you're answering. Get it recorded and then continue on. So with that in mind, I would like to welcome up our first speaker for the day. We are going to have a quick introduction, not so quick maybe, to building information management from Tomas Kirchen. No, not.
Multi-disciplinary geometry (libraries) in BIM and the IfcOpenShell software library
All right. Thanks so much for being here and thanks so much for having me. So this is not really a hardware topic per se yet, unless buildings qualify as hardware too. But let's talk a bit about them. So we have exchanged buildings for a long time. So these are just like sets of line drawings without really computer readable semantics associated to them. I don't know, maybe there's a parallel to how the PCB community exchange models. But I don't know, I would say for the last 10 years or so, we're more and more exchanging buildings as rich, semantic data models, where the models we exchange have also a meaning that computers can relate to. And so in 2011 I started IFC OpenShell, which is a software library for dealing with these kind of models. IFC is the open standard. It's called the industry foundation classes, which is kind of a meaningless name. But that's where the name comes from. IFC OpenShell is one of the geometrical forms you can use to exchange representations of your building elements. IFC is also very much inspired by STEP, which is probably a standard familiar to most of you. So if you're familiar with STEP, you know, OpenShell. And in IFC we just prefixed everything with IFC. And then right there is some point, you see that there is, well, I wouldn't call it a spike. It's really like, I don't know, a mountain ridge of new contributors that came on board and that's with the Blender Boom add-on. So in my work I'm mostly focused on analysis of these kind of models. And then, yeah, Dion Mold came on board and he started actually writing an authoring tool on top of IFC OpenShell that allowed people to really graphically create models. And, well, you can see the effect of that in terms of contributors. So we have quite a bit of modules, but in essence we have a parsing library, geometry, geometry interpretation, so that we use predominantly open cascade, hopefully familiar to some of the people in the room, to read these kind of geometry definitions and translate them to BRAPs and then interact with them in a bunch of ways, convert them to desolated formats or, yeah, some other things. So I started my module mostly in C++. The Python has been hovering around zero for quite some years. And it basically had to do with the ecosystem, so open cascade is C++. Because, yeah, it doesn't leave you much of a choice. But quite soon on we realized that if you really want a wider movement, a richer ecosystem connecting different modules, but also the BIM world is rather academic, so we have quite a bit of students, or software developers doing rapid prototyping. So it really made a lot of sense to have an interpreted language with a more accessible syntax. Yeah, so also quite early on we started having Python bindings. And then, yeah, you see the same spike here basically in terms of contributors. That Blender BIM add-on is built on top of Blender, you might have guessed. And Blender is also obviously a wonderful piece of software, very extensible. And also, yeah, the client-side code is mostly in Python. But it's not only that, it's also really a higher-level API that was being built on top of a low-level, let's say instance manipulation we had. So, yeah, I see it's really an extensive schema. It's roughly a thousand to a two thousand classes or data types. And if you really want to interact with that in a meaningful way, you need to operate on a bit of a higher level. And that's also where the steep increase in Python code is coming from. But what my topic for today is really the geometrical challenges that I encounter. Because our industry is really multidisciplinary. So, for example, what we exchange as building models are really detailed decompositions of the building into very specific elements. So, one wall, one ceiling plate, and all the kind of supporting structure. But if you look at the building code people, this is from the city of New York, for example, they want to ask questions on much higher abstract levels such as the facade. So, this is, well, it's New York. So, they envision there is something like a base on top of which a tower is built. And there are requirements about the proportion regarding base and tower. But none of these things exist in bin models. We don't even have a facade, we just have this bag of elements. But we have other, let's say, different perspectives on geometry. If you want to do thermal simulation. If we are here in this room generating a lot of heat and that heat dissipates to neighboring spaces. And then you're also not really interested in all the detailed elements that make up your building. You just want to have basically a graph of spaces and thermal interfaces between them. And there are ways to exchange this information as part of IFC building models. But the generation of them is still rather buggy. And that's also the challenge, of course. Does every authoring tool need to implement this kind of generation of data? Or are we going to opt for a more collaborative ecosystem of tools? Where you just generate a building model and there is other tools to enrich those models. And that is what I'm really hoping for, but it's not what we're currently seeing in the industry yet. Geospatial people, they really want to focus on the things you can see. They come from GIS, so they can only observe what is actually visible to them. So they don't really want to deal with this kind of invisible surfaces. They just want to have manifold representations of interior or exterior shells. But so this is what we have as a building model. And as a summary, what we want to generate, for example, is this representation of just a facade that joins across these walls. And it could be thousands of wall elements. The further you are in your development process of the building, the more you're going to decompose these things into the actual physical things that are going to exist in reality. But yeah, the data comes from heterogeneous sources. I really want to advocate for an ecosystem because we already have an open standard. Yeah, let's have a more collaborative ecosystem where we can augment this data. Here, the interior, where you really want to know, for example, will I bump my head here? But in normal models, these kind of representations don't exist. You don't have this higher level representation of the interior. You might have a description of the space, but it wouldn't be enriched with all the geometries that further eat out of that volume. And of course, none of this is ultimately very precise. So there is gaps either accumulated due to floating point rounding errors or manual sloppiness or also on purpose. If you're building a building, you have to accommodate for the fact that the walls, especially this metal, they are expanding and contracting depending on the temperature. So there are also actual gaps between these elements in reality in more detailed models. So maybe that representation that I just showed of the facade doesn't in reality really exist as a volume. And then, so how to solve that? Naively, you would maybe think, yeah, let's just Boolean union these things together and call it a day. But yeah, that's quite a challenge in terms of performance. But you also have to make a choice there. Are you going to rely on these kind of fuzzy Boolean operations that allow for a certain imprecision or still join these disjoint volumes, even if there is a nanometer or millimeter gap between them? Or that's the open cascade paradigm, for example. Or are you going to rely on these kind of exact Boolean operations that, for example, CGAL offers? CGAL has a very interesting number type, I think, where a number is just not just a number. It's basically a binary tree of all the operations that were used to construct a number. And as such, it is really arbitrarily precise. There is never any rounding occurring. But it obviously has a monumental performance impact. And it's also not maybe necessarily what you want, because you want to join across these kind of imprecision issues. So earlier attempts, yeah, they made the performance problem even more extreme by using a Minkowski sum. So you have kind of a padding volume that you apply to every element and enlarge it slightly, and then union them together and then shrink. But this is not really feasible on detailed models. So what I try to do for this particular problem, and I hope this is going to end up in... Yeah, I'm showing mostly experiments, but I hope that this will really at some point also be part of, let's say, the core parts of my software library. But what I'm doing here, for example, is decomposing these solid volumes into trees of half-spaces. And then the neighboring half-spaces average these out. So the two faces here, here and here, of these two disjoint walls, yeah, they are merged or aligned. And the same here and the same here. So this really allows for some sort of local adjustment, so that you're really sure that things snap into place. And also, it really causes very neat models because there is almost no intermediate vertices because all the nearly coplanar surfaces were exactly aligned. But this is still a challenge to make this work on the really detailed models. Another example where I used half-spaces is if, yeah, for example, the facility management people, they also operate on a much higher level. They don't care about every rentable unit. They want to have aggregates of those. How many square meters of rentable space do I have across these models? And these interior partitions, you don't really care about them because tenants can remove them anyway. So you include them in your square meter counts, which means that you basically need to take this volume and extend it to this volume for them to touch and then union these together. So for that, I used Sparkle, RDF Lib in Python. I built, this is what you see here, is a graph of spaces, the half-spaces bounding the spaces. These are touching the faces of the wall. And in the wall, you have an opposite. And so we really form a graph. And then I query that graph based also on sabantics. And of course, only the non-load-bearing walls that I can aggregate over. And, yeah, as such, this is this kind of patchwork blanket, is all the individual spaces that are in a model like this. And then what the facility management people want to know is this. So this is all inhabitable spaces, I think. So without utilities are these kind of things. But, yeah, there are large performance problems still with these kind of approaches. These arbitrarily precise operations in Seagull, it's really immensely robust. The first I really come from, let's say, I've 10 years of working experience with Obakas Cade. So you come to lower your expectations a bit in terms of what works, how many crashes did you encounter when you load complex models. And then in Seagull, everything just works. It's not always what you want, but that is then your own fault, typically. But, yeah, still really computationally intensive. So as a side project, I've also written my own foxalization library. Because, yeah, especially these kind of challenges, superimposing a lot of elements into the same domain, closing certain minimal gaps, yeah, that's really what foxalization is perfect for. And, yeah, this, so maybe if you would want to union all these kind of building elements in, yeah, in Obakas Cade, it just wouldn't work. It's not robust enough in Seagull. It will take a considerable amount of time. And in foxalization, it's really just, yeah, I wouldn't call it instant. You have to deal with a different set of challenges. Suddenly, the complexity is no longer based on, let's say, vertex and face counts, but it's really based on, yeah, actual, the physical dimensions of your building. So if you're building a larger building, your computation takes longer. But it's still, yeah, it does perform better than Seagull, I would say. Yeah, and then you can do topological queries on those again. So, yeah, this is a very famous testing model that we use in our industry, but I expect it's a little bit cryptic for you to read. But let's say this is exterior space, and here is a door, here is a door, here is a door, here is a door. Here you see a little bit of a stairway. So it's really three-dimensional, but I've kind of folded every 3D volume. I flatten it over the Z-axis to kind of get a 3D volumetric, a foxal grid of just, yeah, let's say, the mast that we can walk on. Yeah, and then you can just do topological queries again on those to see how long is the evacuation distance from a particular point in time. And I'm not saying it's not possible on, let's say, regular polyhedra or B-Rab's, people have been doing that. Yeah, but I came to really appreciate how trivial those kind of operations are on foxal grids. Same for the headroom, basically. It's the same kind of idea where I start from the 3D volumetric space interior that we can breathe in. I flatten that over the Z-axis to just the little surface where we can stand on. But I remember how many foxals I flattened downwards, and based on some sort of color coding or threshold, that's either sufficient or not. So, yeah, you can see here under the stair that there is obviously a little bit less space to stand. So for this, these kind of, yeah, when we started this project, we envisioned that end users would be writing their own kind of analysis scripts. So here you see visually all the operations that were involved in one of these, in one of these, I don't remember which one, but one of these computation graphs to union those foxals, subtract a bunch of things through these traversals, to really figure out the space where we can stand. We also do some sort of padding so that we don't start walking in like two centimeter areas. We kind of assume that we have a little bit of a body, so all the obstacles are dilated a bit. Some of these things are also element-specific, so we do specific things with the railings. I don't remember the details, but we really created our own little scripting library for these kind of things. I don't remember why we just didn't create Python bindings. That seems like it's easier than really creating your own language with your own interpreter, but at that time that's what I did for some reason. Obviously nobody has ever tried to create their own little analysis script to do these kind of things, but as you can imagine it also requires some documentation and stuff that we also didn't provide, but it was really fun to work on. What I see as the advantages of this kind of foxalization, you can also associate numbers to yourselves. What you see in the headroom analysis where you can stand and the evacuation analysis, you can really associate a number with every cell. It's also in a uniform way across every dimension. That's also a bit harder on poly-heat writing. Bullying operations are really just that, bullying operations. If you superimpose two cells, a one and a one becomes a one, a one and a zero becomes a one. That was really, if I can implement bullying operations, then it's trivial. It's efficient to calculate distances also, and we close these gaps. I think for our built environment sector it's really quite a good match for some analysis. I've seen OpenShell going back to where it all started for me. It's quite an extensive software library. It has all the different revisions of the IFC schema. It has these geometry mapping functions. There's about, let's say, 200 classes in IFC that somehow affect geometry or representation. They have an implication on our B-Wrap conversion to OpenFascade. They have these conversion functions. Then we rely on OpenFascade. It's gotten quite a large code base. In the new version, I have been playing, well, if it's been happening for various years now, so I shouldn't call it playing anymore, but let's say working on the idea to support multiple geometry libraries. The robustness of OpenFascade has really improved dramatically over the past 12 years. It's really a usable, very powerful software library now. But there are still cases, if there are issues being reported on my get up, where I have to say, yeah, this takes me two months of investigation. I'm sorry, I just cannot help you with this. In that sense, I have some hope. If I have a secondary geometry implementation in Seagal, that I can really provide the best of both worlds, also to people that want to do these kind of analysis that I just showed earlier, like aligning these half spaces. They have a better starting point with Seagal. But for that, I created my own taxonomy in the middle of geometrical concepts, so that this kind of implementation here is a little bit easier. Yeah, so what I've learned in all of this, so Seagal is predominantly only polyhedra. They have some sort of curved things in some packages hidden somewhere, mostly polyhedra. The exact rational number type I've talked about, they have an interval that wraps them for performance. They have good documentation, but I find their set of packages somewhat incoherent and chaotic. You don't always know where you need to look, and it's not always easy to go from one package to the other. And I think the focus is rather academic. A lot of the CAD concepts that maybe we rely on, they don't necessarily offer out of the box. And maybe you can read for yourself what I think about OpenCascade. And that's it for me today. Thank you so much. Thank you. Okay, do we have any questions? Yes. Do you see any use for the Seagal library to help with the OpenCascade geometry? Yeah, great question. So the question is, to what extent can maybe these libraries also help each other, enrich each other, and not only exist as two choices at runtime, but also how do we make sure that the library is not only available and not only exists as two choices at runtime. I haven't really explored that yet. So far I'm only at the point of, yeah, you can try one, if it fails, try the other. I think what I would at least want to do is kind of automate that process. So that the software tries the first one, and if they're precious or produce bad results, automatically try the second one. As a next step, I would envision that I'm able to rewrite the results from one library to the other. And yeah, by that time, you would maybe use, indeed, a more intelligent combination of the two. Like this Minkowski sum that I showed make an element a little bit larger and also relies on convex decomposition that you decompose your element into convex parts. Yeah, that is really quite powerful and something that I think only Seagal offers. If you can bring that maybe to OpenCascade for some reason, that's a very good use case in mind. I think that would be quite powerful. But so far no plans. It's quite slow process. Thanks for the question. Yeah, please. I had a question regarding the open spaces that you have between walls for dilutation. Exists there a class for defining the dilutation? Because then the next thing would be that you interpret it in a different way. Does it all put a flexible material in it for strength calculations? Interesting question. So the question is, I mentioned earlier that walls are not always connected to account for, yeah, dilutation that elements can grow and shrink a bit. And is there a specific class to encode that information in the models so that that can also be handled? I would say yes and no. If you would want to, I would say the standard is flexible enough to encode that information. The standard is also very asparic. There's a geometry description. There is a taxonomy of types of elements. And you can also refer to other classifications. So if you want, you can encode that information. And the problem with that is also that I've seen is used to exchange information from one person to the other. And there's a lot of inherent knowledge that we all rely on. So, yeah, if two walls don't exactly match, there is probably some textual description there that still says, well, this needs to be sealed with that and that. And that's obviously also much more efficient. It's difficult to interpret for a computer. Yeah, but people in our industry, I have to admit, don't really care about that. I mean, it is a machine interpretable standard, but people still use it predominantly only for communication and coordination. So you can, let's say, superimpose the geometries. If there is a beam going through a wall where there is no hole, yeah, you have to call each other and solve that. But actual computational analysis on these models is rather rare. Thank you. Thanks. Yeah? Two. There is a microphone on this way. Hello. When we say about IFC, we should remember about the versions of IFC. What is about the versioning with your libraries and all this stuff and how you are related to all these things? I mean, the versions of the standard because it's also going on. Yeah, and it's actually also one of the reasons why I had to create this mapping layer, because I didn't discuss it, but we indeed, for example, here, IFC 2, X3, IFC 4, we have more or less the same concepts. There are minor differences. But I haven't really found a way, for example, to account for the fact that these two classes, the polyline and 2, X3 and the polyline and IFC 4, that they are, let's say, 99% compatible. So it also really dramatically increases the compiled size of my platform, because everything is compiled multiple times with slight variations to account for the same, for the different schema versions. And yeah, there are other software libraries that maybe more start from above, have a unified schema and then allow to serialize that as version 2, X3 or version 4. But I have the more side-by-side as really computer-generated code from the schema. So yeah, we have migration scripts to go from the one to the other, but they are written by hand and probably not complete. Okay, one last question and then we're going to change over. Let's change over. Yeah, look at the render on what. So your voxelization code, how did you set the voxel size? Because I assume if you set them too small, you don't find the holes in the wall, and if you set them too big, you close up corridors. So do you have to do an optimization for the size of the voxel? And what sort of size are we talking about for a building? Yeah, great question. And the main consideration was actually performance and also memory usage, because yeah, it's really a cubic relationship. If you have the size of your voxels, your memory usage increases eightfold. So, but the good thing about the construction sector, as opposed to let's say general purpose CAD, is well, it all relates to us human bodies. So we're pretty sure about, let's say, the size of the geometries we're going to expect. And yeah, for that kind of reason, I also didn't create an optimization step to find the optimal voxel size, mostly just five centimeters. Okay, so thank you very much. We're going and that was excellent. Thank you.
Dune 3D - the making of a maker's tool
Hi, I'm Lucas. So when I am not writing the CAD software of any kind, I usually do hardware projects, some of which I've shown here. As you may see, they're pretty much all the same thing. They are a circuit board in a 3D printed case. So for designing them, one needs basically two softwares. There's CAD software for the printed circuit board and software for the 3D printed case. What both of these things have in common, that CAD is pretty important there, since what you're drawing CAD is what you're going to get. So when you're doing, for example, woodwork or metalworking, if you need an extra hole, well, you just drill it. But that obviously doesn't work for PCBs or 3D printing. So yeah, it's pretty important to have a proper CAD software there for the first thing, for PCBs. I solved that issue for myself a couple of years ago by writing Horizon EDA, but that's not what I'm going to talk today. But for the 3D stuff, I found myself oscillating between FreeCAD and Solvespace, since both some things great, but neither of them covered everything I needed. So let me elaborate on that. So FreeCAD itself, pretty much all the features I needed, some of which are step important export and support for chanples and fillets to make the things look more prettier with a little effort, but it falls short by not, by the peculiarities of referencing stuff, the sketch of being modal and not being able to easily make constraints in 3D. For Solvespace, it's pretty much the other way around. It has significantly less features, but these features work really well and I found it really pleasant to use. So at first, it's dismissed it, and since it doesn't do step import and export, but everything else works really well. So is there anything that does all of that? Unfortunately, I didn't find anything, so I thought, well, it's not the first time I've written a CAD software, so maybe let's try writing a 3D CAD software. So after all, so what do we need to make a 3D CAD? So first of all, we need to show something to the user. For that, we need a 3D viewport with all of the usual stuff like shading, navigation, and selection, but fortunately, I already did more or less that for Horizon EDA, since Horizon EDA has a 3D preview, and it's basically all of the OpenGL boilerplate already done. So we have that. Next up, we need a geometry kernel that takes care of all of the Boolean operations and extrusion and stuff like that, and for that, there are some of you might know it, OpenCascade, also from the talk before. It has some words, but I had some experience with it from Horizon EDA, and it works okay. And it's also pretty much the only game in town if you want to have jump-fast forwards and proper interaction with stuff. So that one's there as well. And next up, we need a solver that takes care of solving all of the constraints and entities and stuff. And for that, there's also something that we can use, in particular, the solver from false space. The solver from false space is available as a library, but that's with a small asterisk. The library itself is a C wrapper around the C++ internal from false space, but the wrapper itself is pretty limited, so I ended up not using the wrapper and ended up using the internals from false space myself, and they are pretty easy to use, actually. So we have that one as well. And last but not least, we need a user interface of some sort with all of the boring stuff, like preferences dialog, way to select tools, the general tool handling, and all of those little, lot important stuff, such as the access lollipop that shows which access goes in which direction. But fortunately, I already had all of that in some way or another from Horizon EDA. It's a 2D CAD, but well, undo, redo, and stuff like that pretty much doesn't care if it's a 2D or 3D CAD. So yeah, then I realized, well, I had all of the building blocks to make a 3D CAD, so I started with it, and that was back in August last year, and now I'm here to talk to you about 3D, a parametric 3D CAD. So I said it took about six months to get from basically a blank window in GTK to where we are right now. As probably expected, it's written in C++20, and it's about 33,000 lines of code, and it uses the, you use GTKMM4 as a GUI toolkit. Using GTKMM3 would have probably been a slight bit faster, since I've already used that for Horizon EDA, so I would have been able to directly copy-paste code. But yeah, GTKMM4 was the last version at the time I started it, so I went with that, but that's probably a topic that I should write a book about, since there are quite a things that were a bit annoying about GTKMM4. And same as Horizon EDA, it uses UUIDs for everything, and uses JSON as a data storage format. So yeah, I pretty much reused all of the concepts that worked well in Horizon EDA for GTKMM4. And yeah, just a couple of days ago, I released version 1.0, and yeah, so it's already packaged for, in Fatpack, for the Windows folks, there's an MSI installer, and the good thing was, well, it wasn't the first time that I had to take care of all of the packaging, so the packaging stuff was pretty much just copy-paste from Horizon EDA again. So what does it do? It has a parametric to the sketcher, that has all of the usual stuff like lines, arcs, circles, and constraints to draw these lines and arcs. There's a convenient all-in-one tool that handles lines and arcs in one tool, so one can draw arbitrary outlines in one tool, and there are also some convenience tools for drawing an axis-aligned rectangle or regular polygons, as they're needed, for example, for hex nut inserts also. To make things 3D, there's extrude and lathe, so lathe is basically a 360-degree revolution, revolutions that are not 360 degrees aren't supported yet. And to repeat things, there's linear and polar array, and to combine multiple solids, there's the usual operations from open cascade, so union, difference, and intersection. So for that, I just basically had to expose to user what open cascade offers for to make an N in 3D. There are also constraints such as distance, angle, point-to-plane, or point-to-plane distance, or that's useful for example when you want to make a hole that stops at 3mm from the last edge, you can just use a point-to-plane distance of 3mm, and that's it. For the step import, I basically copy-pasted the code from Horizon EDA that turns the step model into a set of triangles, and I also reused the code for extracting the reference points since the idea is that you want to import your circuit board, and you want to add some reference points, and then you can reference these points in the geometry, for example if you want to fit your case around the circuit board or make cutouts for connectors. The last important point are fillets and chamfers. These are basically just calling the open cascade functions to add a chamfer or a filler to edge, but unfortunately the way it's implemented right now is subject to a topological naming problem since all of these edges are just referenced by index, so if one changes the geometry in a way that adds extra edges or so, it breaks, but well. I was used to that from FreeCAD, so it was okay. So how does it all fit together? So in the middle there's the thing for the document that consists of all of the introduced specific data structures like groups, entities, and constraints. These are then presented to the user with the renderer and canvas, where they are turned into primitives that I can render with OpenGL, and then the user uses the tools, same as in Horizon EDA, to interact with the document and to take care of the solid model. All of the entities get transformed into something that OpenCascade can understand, and then that's again triangulated and rendered. And to take care of solving the things, there's the interface to the solver in the space, solver in solve space, and app-rolly as to be expected. The hardest part of implementing all of this are these interfaces between OpenCascade and the solve space part, since that's where the impedance mismatches are, since I had my data model and the data model from OpenCascade or solve space, and it somehow had to fit together. So what's next? So there are some, of course I have got some plans, mostly some basic things like measurements, revolutions that are not 360 degrees, or stuff like copy-paste. But the big distinction between, from the project point of view, between doing 3D and Horizon EDA is that with Horizon EDA at least have the aspiration that one might eventually be able to do really big and complex parts, but I want UnityD to stay and to be and stay small and little easy to use CAD software that doesn't have the focus to cover everything and all. It should just be a tool to make simple 3D printed laser cut or CNC machined cases for PCBs, probably something else, but it already does pretty much everything I need for my use case, so it'll probably, mostly stay as is, with of course some bug fix and UI enhancements, but yeah, don't expect anything big to happen there in the future. And I think then we're already over with the presentation, and now for questions I think. So, questions? Thanks for the talk. Very impressive for this time scale. You were talking about having 3D constraints, and then you just showed an extrusion size, but well that's something you can also do in FreeCAD, right? Do you have any other possibilities to do more complex constraints in 3D space? Yeah, sure. Okay, so the question was if there are any more complex constraints in 3D space, there are some, such as angled or point-to-point distances, or what one can also do since 2D and 3D can work together with a means of work planes. One could, for example, construct at a work plane in the same group as the extrusion that's perpendicular to the extrusion, then do whatever one needs there, and then constrain the extrusion to that. Or one can also constrain the extrusion to another sketch, so one can put the extrusion in a work plane and then do stuff there, so there really isn't, and then it's all protected into the work plane itself, so there really is no limit of what can do, but yeah, that's all the way that wall space works. Thank you for your talk and impressive effort. Do you think that CAD programs, CAD suites with this level of complexity could be a good stepping stone for beginners and maybe even children from very simple drag and drop programs like TinkerCAD towards something more parametric that they can manage to use when they start to grasp the basics of these kind of suites? So I think, yeah, it definitely has a learning curve since one needs to grasp the concept of constraints, degrees of freedom and such, but I think that's pretty much the same thing in every parametric CAD. So yeah, there are some idosyncrasies in terms of the user, the interface, and it's driven by a global menu that unfortunately has some discoverability issues, but yeah, I think it's something that one can also try with children, but yeah, I've never, I don't have any experience in that education space. Yeah. Great work indeed, especially for the time you spent on it. So in the beginning you showed these tables with check marks, but you didn't explicitly conclude that you had all the check marks for your software. Yeah, so let's go over it. So a step import, step import and export is pretty much done by OpenCascade. Since OpenCascade does the import, i.e. the triangulation and extracting reference points and export is just calling a couple of methods to take the topo ds shapes and write them to a step file, gemfas and fillets are just methods to call from OpenCascade, and all of the three bottom things are basically the same thing as in SolveSpace, since overall June 3D is pretty similar to SolveSpace in terms of overall operation. There are groups, constraints, entities, and if one knows and likes SolveSpace, they'll probably also like June 3D. Right, and another question, thanks. If you would have spent the same time in either SolveSpace or FreeCAD, could you have improved them to your needs? Yeah, but I was pretty sure that question will come up. So, let's go over it. I think FreeCAD, I've looked at the code sometimes and I've also find that there are really a lot of code, and I think especially changes like having a non-modal schedule would have probably been way more work, and with SolveSpace, they have their own geometry kernel for probably good reasons, and from a project conceptual point of view, I think OpenCascade and SolveSpace are pretty much diametrical. SolveSpace has really this nice self-contained thing without that big OpenCascade dependency hanging off the side. So, yeah, that's why I conclude, well, it's probably easier to write my own, and I also noticed that I really like writing CAD software. Okay, we have time for one more question. I use CAD software to create 3D models for on PCBs, to render on PCBs, and I felt that problems like SolveSpace are missing color support for faces. Does your Dune3D support this? Right now, it doesn't support colored faces. Yeah, I have to look into how to accomplish that with OpenCascade. These are always the topics that are a bit tedious, and yeah, well, it's OpenCascade, and as mentioned in the talk before, it has a rather cryptic API, but the good thing is there's FreeCAD, so FreeCAD is pretty much the best OpenCascade documentation there is. Okay, thank you. Okay, thank you very much, Lucas.
Comprehensible Open Hardware: Building the Open Book
Good morning everyone. As said, my name is Joey Castillo and I'm here to give a brief talk on, I guess, comprehensible open hardware, which for the record, I'm not a maker of open hardware tools. I'm just a humble user of them. But yeah, this talk comes from the perspective of someone using the tools made by the folks in this room to learn from open hardware designs and make some of my own. So one of the first things that I wanted to build when I got into open hardware was called the open book. This is an open hardware e-reader, more or less. And I wanted to make this for a long time, way back in like 2018, 2019, when I really wanted to make something like this. I didn't actually have the skills to make something like this. So to get there, I went online to steal as much as I could from folks like Adafruit, who make open source hardware. In opening up their designs for things like this e-paper driver board, they let me copy a lot of what they did for their gadget into my gadget. But getting ahead of myself here. The open book is the thing that I wanted to make, and I had some goals for the device. Those goals were pretty simple, or simply stated. I wanted to use it to read books. As I pitch it to new acquaintances, it's like a kindle that you build from scratch. I wanted it to support reading text in all the languages of the world. And I also wanted it to be affordable, accessible, and for lack of a better term kind of DIY-able. So just to give you an idea of what the device is and what it does, here is a short video of it in use. Here's a listing of books and short stories on the device. And I can launch this short story by Leo Tolstoy, which of course renders in Russian. The center button goes back home, where we can select a different work, like here the Tao Te Ching, rendered in Chinese. So I think it's pretty fun as projects go, but the fact is that's only half of it. The other half is I wanted the open book to be comprehensible to the person who builds and uses it. Like, it's through open hardware that I learned to build open hardware, and I really want to pay that forward to the people who have their own open book. To explain some of how I tried to do that, I have to flip it over to the back side. So there are a lot of issues with this first revision of the open book, but I'm showing it first because of this sort of dense silkscreen text that kind of became my trademark. Back when people were on Twitter, multiple folks at various times called me the Dr. Bronner's of PCB design. For this habit of filling every millimeter of my board with text. Up here, I'm narrating the entire soap opera of an ideal diode circuit giving five volts to a regulator, which is interesting, I guess. But why? Why should I pack my board silkscreen full of this kind of stuff? To answer that, I need to briefly do my ideology of why I got into open hardware. The problem, as I see it, is closed tech, especially as shipped by these big tech companies, fails to serve users of the technology. I tend to look at technology through the lens of power. Like, take this Kindle, for example. Who is this technology designed to empower? And while, yes, it does allow you to read books, I'd argue it is designed to empower Amazon. It's designed to push you into dark patterns that make you spend more money with Amazon. It's designed to surveil your taps and profile your habits for Amazon. It's designed to steal your attention and monetize it by selling ads for makeup or toasters. Meanwhile, the end user just wants to use the device on their terms without ads for toasters and is prevented from doing so by the platform owner. The big question for me, why does the tech get away with this? And the answer that keeps coming back to me is the technology is fundamentally incomprehensible to the end user. A device like this arrives fully formed as a slab of glass and plastic and it's meant to be used in the ways the platform owner sets forth. It's not meant to be understood or hacked or made to better serve the user. So what can we do about it? Well, I don't have all the answers, but in my practice at least, my goal is to make tech that folks can understand. My theory is if we can design well-documented open hardware that people can build on their own and understand, at least in the broad strokes, we can teach them that they don't have to accept technology that wasn't made with their best interests at heart. There is this fantastic quote from Bunny Wong in a blog post about hacking his Garmin smartwatch. He writes, the point of open hardware is not the ritual of compiling our stuff from source. It's the awareness that technology is not magic, that there's a trail of breadcrumbs that any of us could follow to liberate our digital lives. So with that in mind, what are the breadcrumbs? What trail am I laying down for users of my objects to follow? Over the course of a few years, I've had the opportunity to design several different versions of the open book and I think I've found three different sets of breadcrumbs for three different contexts. The first one has to do with helping the user understand how the gadget works, the second helping them understand how to build the gadget themselves, and third, explaining how to make use of the gadget. Let's take the first one first. This is one of the earliest open book prototypes and my vision back then was to use the silkscreen to narrate what each component on the board does. This has some benefits, I think. Like on the plus side, this could demystify the tech for someone who sat down and actually read the silkscreen. On the downside, I have to say, space is limited and I honestly left wondering if this is the most useful information to give the user. Like, I want to demystify the tech, I want them to feel like this is something they can understand, but is understanding how a MOSFET works the best way to do that? The best answer I can come up with is maybe. Still, there were a couple of bigger issues with this version of the open book. The parts are kind of small. These are 805 passives, which are pretty small for the average folk. They're fine-pitched parts. There's also parts like this microSD slot, which has its solderable connections hidden underneath a shield. I borrowed that footprint from Adafruit, which is open hardware and great, but they design for manufacturing, not hand-building. They're also honestly just way too many parts on this board. It's trying to do too many things and it's overwhelming to someone trying to build and understand it. So, yeah, this realization led to a new design that I called the abridged edition. This version cut the part count down considerably and tried to make it as simple as possible for people to build themselves. I used bigger passive components, 1206, and I picked parts with pins that are easily accessible like this new microSD slot. Instead of making folks solder down a fine-pitched microcontroller, I used the Raspberry Pi Pico module, which has a super-friendly 0.1-inch pitch. Yeah, some parts like this flex connector I could not buy on a module, but then I realized I can make my own module and have it preassembled. This little cast-related module, the green part, includes the e-paper connector as well as the whole supporting boost circuit. I ordered dozens of these for a few bucks a piece and I offered them alongside my main PCB. This meant that DIY makers only had to plop down one module to get the display working instead of a dozen densely packed fine-pitched parts. I also decided rather than using the silk screen to explain things, I could use it to explain how to build the thing. Adding step-by-step instructions alongside each of the parts on the board is a different trail of dreadcrumbs, but the upshot was you could follow the instructions, literally counter-clockwise around the board, and if you followed them all correctly, you would end up with a working device. Okay, so things I like about this set of dreadcrumbs, well, it is super effective. Since releasing this design, dozens of people around the world have assembled their own open book boards without any of my involvement. Like these photos are community builds that I never touched. I didn't even send them a board. These are people going in on group buys and part swaps and having enough success with the build that they've moved on to hacking on the firmware, which is exactly what I wanted to see. I'll also say we did a workshop at Hackaday Supercon in 2022, very ad hoc, not a formal workshop. We just sat on the floor of Supply Frame HQ and I guided a dozen people through building their own open books, and every single one walked away with a working device. Like, the plan worked. Still, after the abridged edition and doing these workshops kind of hands on, I realized to make the project scale to more people, I couldn't rely on everyone soldering it together themselves. I would have to have most of the thing done for them. This means I'm no longer using the silk screen to tell folks how to solder the thing together. Still, I did want to use it to do something useful. I still wanted to encourage that comprehensibility that we were talking about earlier. In this case, I kept something from the original open book, arranging the components in functional blocks, even if I can't fit room for narrative text to describe what they do. These blocks match what's in my schematic and how the components are grouped over there. This still gives people an overview of how the device works. You can see this is not a pile of parts arranged haphazardly. These parts work together in ways the user can understand. This is the battery charger. This is the power supply. Still, there is the question of what to do with the rest of the board space, and I can't leave it blank, so. The trail I'm finding most useful these days is the trail that leads to making use of the device. For this latest version of the open book, I'm including pin assignments as well as notes on how to develop firmware for the device right there on the circuit board itself. So, I'm going to be honest with you. I use this a ton when I'm writing my own firmware. Like, I am lazy, really. Sometimes I don't want to search my own documentation. Sometimes I didn't even write the documentation. If I don't want to open my schematic to try to decode what I was thinking when I designed this thing six months ago, what are the odds a user is going to go to all that effort themselves? Having the docs right there on the board is an affordance for people making use of my device, and as I found out, I am one of them. This also works on boards of many shapes and sizes. This is the circuit board for SensorWatch, the Casio wristwatch mod that I'm wearing on my wrist. Also, a shout out to Lucas, who is just up here. I learned everything about making a Casio board swap from your open hardware Pluto watch, so thank you for that. SensorWatch owes its existence to your project. Anyway, you can see here we're on the backside. This board is less than one inch in diameter, but we're still able to include notes about which pins are which, the capabilities, and even which on-chip peripherals I expect you to call on to make use of those pins. Self-documenting circuit boards like these attach relevant information to the hardware you already have physically in hand. This board doesn't just have pin labels. It has a narrative of how you wire it up. It doesn't just have component designators. It tells you what they mean for the device configuration. Oops. It creates a self-contained artifact. This is a prototype of a new version of SensorWatch. I'm still working on the pin assignments, and they may change before it's final, but even if I put this down and pick it up in six months, I don't have to cross-check a revision number with a schematic and a datasheet to get hacking. All the relevant information is literally in hand. Moreover, that information becomes available to the end user as well. Unlike closed-source objects, you have to painstakingly reverse engineer. Putting this information on the board itself makes the object hackable by default. We're throwing the doors open to the end user without forcing them to do so much as a web search, much less a deep dive into my repo. Also, just as a side note, this technique pairs very nicely with code that makes use of the same names. If your silkscreen says you named a pin button alarm, and the headers for your board support package also name that pin button alarm, you've made everyone's life easier, including, actually, maybe even especially your own. Once again, I am not someone who invents open hardware tools. I am just a humble user of them. And I don't have all the answers for when it comes to making or helping folks to rock the stuff that we make. Still, these are some of the ways that I have tried to make some of my stuff more transparent. And I just want to close with some questions that I can ask myself and we can ask ourselves as we finalize our designs and send them out into the world. Questions like, how would I imagine someone using this device? Am I offering affordances that make it likely they'll achieve what I hope that they'll achieve? What kind of information would I want to give a user of the device, both at a basic level and at a more advanced level? And also, I didn't put it on the slide, but what would I want to know if I'm picking this up after six months and I've forgotten most of my design choices? Most of all, can I tell the story that I'm trying to tell, the story of the device in a way that makes sense? Because if I can figure that out and print it right there on the board, both the artifact and its back story will live together forever. Anyway, that is what I wanted to share today. So I'm going to put up my info and I would love to take questions if we had any. Thank you. So first of all, I love the product and I love your philosophy in open source. So those open books support EPUB files right now? It does not. So the open book uses a ESP32 microcontroller. It's a very kind of resource constrained platform. At this time, I'm supporting plain UTF-8 text. That is my file format of choice. That might also be a bit of an ideological choice. I like the idea that a plain text file can represent a literary work. Plain text feels powerful. I think if space aliens come and see the ruins of our civilization in a millennium, they'd probably be able to figure out the UTF-8. I'm not sure if they'd be able to figure out the plethora of things that go into... EPUB is just a zip. Yeah, having said that, folks ask this question a lot and now that people are hacking on the firmware, I think it's entirely possible that I think the ESP32 is a capable microcontroller and I'd be curious to see what folks come up with in this space. So while it's not something I'm working on myself, this is the ethos of open source, right? Throwing the doors open to folks. Awesome, thank you. Thank you. Yes, but for me the problem is microvision in the sense, so because there is a lot of SMD, surface-mountain device, and for me it's not practical. You need an installation of a big house to have a fire and this precision soldering and so it's not for everybody, this kind of thing. So I know hardware, but hardware is now in the past, it was easy. Now it's very difficult. So if you can do a book format, it's little also. I prefer a large format and to chance to make a chance to come annotation and so on. So what do you say about that? I think you're absolutely right and I think this is the reason that I'm starting to move toward getting it PCBA assembled and maybe people just... maybe the experience of building your own book is like taking a circuit board that's assembled and putting it into a case of your choosing or 3D printing a case and maybe that's the larger things that you're putting together, but I totally understand not everyone is going to be able to solder these fine-pitched parts and yeah, I think maybe my appetite for DIY got ahead of my understanding of everyone's capability or desire to DIY. So I think you're totally right and yeah, I'm probably moving toward more PCBA in the future. Can you pass the microphone back to Andy? Hi, thanks. It's really great. A couple of questions. So the sales screen can mess up your board now if you make a mistake in the documentation. That is correct. I think it makes me very diligent about triple checking things before I send it off, but no, I will not lie, that has happened to me before. No way of automating that? I'm very curious and maybe some of the folks in this room have ideas, but I do like the idea of if I know I want to annotate, for example, a line on my microcontroller symbol, I want that to be on my sales screen. That would be very interesting to see if there are ways to link those things together. I haven't yet run across ways to do that, but if anyone in this room knows tools that can help me with that, I would love to do more of that. Can you do field substitution in KeyCAD in the sales screen? Okay, cool. I've been told I can do field substitution in KeyCAD in sales screen, so that is awesome. As a user of KeyCAD, I will check that out. Any plans for having a camera on the book that you can scan the board and show the schematics and the documentation on the ebook itself? Interestingly, not only in the ebook itself, but I have a colleague who's working on using kind of QR style codes to get a better sense of the assembly of various devices. I think there's a lot of possibilities there. I also had a slide about the idea of, I like the idea of putting things like QR codes that contain text, not URLs, but if I could put a basic readme in a QR code and you scan it and you get the full text of a pinout or a description, that would be very interesting. But yeah, possibilities. I see one more in the back. There's a question on the line on the ETA. So, yeah, the question is, am I planning to offer the open book online or is there an ETA? And I hope to do a crowd supply campaign at some point this year. I'm just, yeah, it's hard to find the time to do all the things I want to do, but hopefully by the end of the year, hopefully in the next few months I'll have a pre-launch page up and we'll be able to, yeah, put something out there. Okay, so thank you, Joey. Thank you. Thank you all.
FreeCAD - state of the union
So, we have York and Aikson are going to talk to us about the state of FreeCAD. So please give a warm welcome to York and Aikson. Hi. Hi. Is the microphone, yeah. Hi. Hello everybody. So this is what we're going to talk today. I will make a short update of what's going on, what has been done recently with FreeCAD. And Aikson will talk more about the assembly side of things. What do I go to the next one? Page, huh? The next page, yeah. Oh my. Oh, it's hard to use other people's software. My goodness. We'll get there. Wow, it's good in the browser. We'll get there. Basically, I will already talk while we are there. Yes. Okay. So the main thing everybody is always asking us is when is version one point there ready? And I have good news for you. It's almost ready. I know we've been saying that for like five years, but I swear it's almost ready. We basically, we have two more things that we think we cannot call it one point zero without that. It's the top naming problem and assembly. That's basically what I'm going to talk about briefly. And I will talk about these other things too. That's basically all that we are doing now and what you could already, you can already see if you use a development version. Can you go one further? I hope so. Something, something. So, um, Oh, proprietary platforms, you know what that is. It's not responding. I go back. Okay, let's just do it on the browser. Yeah. So, um, that's a teacher. A lot of things. Woo. Thank God. I don't know what to say. So top naming basically is, it's the main course of, of free cad is a problem we have with free cad since a long time. Basically when you transform geometry, any geometry in free cad has named components with this is edge one, this is edge two, that is face one, face two. And when you change the geometry, the open cascade kernel of free cad reconstruct the things, the thing, and especially with, with sketches where there is a computation needed to know which edge depends on, on which one the order is shuffled. And this prevents or this hinders all the naming. When you reference one edge by name, for example, if you say, I'm building upon one edge, edge one, and then your edge one is somewhere else, your model breaks. Everybody who has worked with, worked with free cad knows that problem. That's our main course. And it's basically about to be solved. Hopefully. Crossing fingers that we'll see when it's there if it holds a promise, but it's basically a fork of free cad, which is the, the link three branch. It's working there already. So we have an example of it's working and, and it's working really reliably. So basically there is a new engine that keeps tracks of the name and renames things as it needed. And that is the last piece of this is being merged currently. So in the next few months, who knows this is done. So this is basically everything I just explained. Assembly. Dr. Co will talk about it in a moment. We have lots of new stuff in the sketch. That's basically where most of the new stuff is being done now. We have auto constraining, which is much better. We have lots of people are working on really interface things and having the workflow in the sketch or more streamlined, removing some unnecessary operations. Now you can build these things and you erect a grid to try and other shapes and you already have the right conference put for you. You just need to edit. If you want to change anything, we have on screen inputs. So you make a rectangle, you click two points and you can enter the dimensions there without having to build. We don't have to go click anywhere else. So those are like small UI improvements. But we are beginning to get into UI and UX and try to address all those little things and make see what we can do to make things flow better. Lots of teaming and UI work is being done too. There are new teams coming, trying to solve, make all these little widgets a little bit better, a little bit less different one from each other. There are light teams for who likes light teams, dark teams for who likes dark teams. Everybody is having fun with this. It's surprising how everybody use them to say, oh, wow, this is so much better and it's just teaming. But you can do so much with that in terms of having your own application like behave more the same across, across where benches etc. Before I let Aksion talk just a last word about what's happening around Fricade. We are in a kind of exponential growth now. Things are like pouring everywhere and new developers coming from all kinds. We have a non-profit since it's our second year now. The FPA is the Fricade Project Association. And we begin to finally, we begin to learn how to spend money. Like it would be hard to earn money, but it's actually much harder to spend it in an open source project. How do you do that responsibly, transparently? How you make the best of the money? It's really something we had to learn and it took more than a year to begin to learn about it. Some Fricade veterans have also created a commercial company around Fricade, which is Onsell, which is its aim is basically to try to sell commercial solutions to people who need commercial stuff. Companies who are not comfortable with taking an open source software. And so the idea of Onsell is to bring them a commercial package. So with all the stuff that companies need around an open source software to feel more comfortable. And as always, Blender is our main inspiration into all this. Like how they managed to get to higher commercial levels and still, it's still a community project. And that's basically what we're trying to do here. Like maintain Fricade as a community project and try to wet our feet with that commercial world. We want Fricade, but without getting lost in the way. So I'll let action talk about the assembly system now. Testing, testing, can you hear me? Alright, thanks. Okay, I'm going to talk about what is called the Onsell assembly solver. I'm part of Onsell and assembly will be LGPL. And the work starts from my work. Oh my. There are your slides now. I know, but... How do I get out? Get out? Escape? There's an escape? Oh my. Can you go just page... Oh yeah. If you go that way, you have yours. Alright. So basically, my work starts here. Goodness, what is so tough suddenly? I have a very little introduction while you're looking at it. There was at the time a second Fricade project. When Fricade started, there were two projects, two open source projects named Fricade. And one of them was ours that we know. And the other one was something that I called launched. And now after like 15 or something years, Dr. Ko is like putting his Fricade inside our Fricade. And it's like a cosmic combination. Both Fricade are joining into one effort. Right. I started out doing this in earnest in 1991. And this project here, I started out as Fricade and I launched it in the year 2000. So predating Fricade.org. Anyhow... So the important thing here is this software here is a multi-body dynamic software which is equivalent to say, Adams for those who are into multi-body dynamics, which is probably the premier multi-body dynamic software used in industry by Boeing, Ford, Caterpillar and so on. So I was in the theory of multi-body dynamics in the 1980s. So I really got absorbed into it. But the license for Adams was, of course, ridiculous for a professor. And I just said, okay, I know how to do it, so I decided to make it. So the next question was, you know, do I program in FORTRAN? Do I program in C? And I said, if I did that, I'd be 20 years behind. And I was just saying, you know, let me look, see what is more productive. As it turns out, in the 90s, small talk was the big rage. Small talk had just invented the GUI. Small talk had just invented the mouse, integrated IDEs and so on. So I used small talk for this program. And within a year, I was able to get a simulation going. And then I spent about, you know, three to five years, depending on how you count it, to get what you see on the left side. Speak closer. All right. On the left side. And I tried to commercialize that. Indeed, we did form a company. And we got bought up by Adams. I decided to leave to start my own. And I put this, as you can see, the graphics here is, of course, pretty bad. Just extrusion. 2D is extruded. That's it. So I made add-ins for space claim. And the motion simulation is mine, still running in small talk. But the simulation now works with space. And the system is quite capable. It can do systems like this. And hopefully, FreeCAD will be doing this in the future. Certainly in kinematics and assembly, that will be all LGPL. We, the multi-body dynamic side, the dynamic side, we undecided at this point. Okay. So hopefully soon FreeCAD will be able to simulate systems like this. Okay. All right. Back to my slide. 15 minutes left. Okay. Good. So the theory, if you're interested in the theory, the theory is right there. Okay. It's in open office format. And this is just a summary page in PDF. And for those who, you know, want to get into the details, it's all there. Okay. And hopefully it's reproducible too. All right. So this is the open source version, which is in C++. All right. In order to make it work in FreeCAD, I had to translate it from small talk to C++. And that was an exercise that really taught me something interesting too. So translating from C++ would be similar to translating from Python to C++. So, all right. A little bit more on the theory. I'm supposed to make it technical. I guess they want me to impress you guys with the technicality. So. It's basically you have the world frame or the inertial frame. And then from there you go to the assembly frame for, and then from the assembly, you go to the part frame. All right. And then from the part frame, you have markers frame. All right. And then from the marker frame, you go to the end frame. And the end frame can be on a point or it could be on a curve or it could be on a surface. All right. And the point itself could be moving relative to the marker and the curve itself could be changing shape relative to the marker. And why would I want to do that? For example, if I have an actuator moving a piston or hydraulic piston out as a function of time, I want to be able to describe the movement of the E marker relative to the M marker. And that would be the purpose of having things move. So once you have the kinematics where the position of things are, the next thing is to use constraints to make sure that they connect in an interesting way to make your mechanism. All right. And the constraints are basically absolute constraint, Euler parameter constraint at point. That means they are at the same point in the plane. The lines are perpendicular. They are at a certain distance, constant velocity in couplers and so on. Extruster. So the equations and we solve this in a right way, mathematical, exact way. You should get kinematics and dynamics quite nicely. So right now we have a lot of assemblies in free cat, but I don't think any of them solve the full kinematic equations completely in 3D. So with that, you can create joints like this, rigid, prismatic, revolute, parallel, cylinder, cylindrical, spherical, and so on. And hopefully almost all of modern day mechanisms could be solved using this combination of joints. So let me share something that I think is most interesting that I discovered in this practice. So I started on a small talk, which is very Python like. So I think that could be shared with you guys. Going from small talk to Python or Python to C++. I was a bit worried, you know, how do you do it? And, you know, C++ is of course a terrifying language to me. But I realized that actually I don't need to use all the bells and whistles of C++. Let me just get into C++ so that you can get into free cat.org. So I made C++ in a way that is small talk like. And once I was able to do that, miraculously, the small talk and the C++ could stay on side by side and look very similar. And as a result, I was able to do a translation in about 6 months when I was worried, wow, it may take quite a long time. So I want to pass this on to you because Python people may want to do that too. You want to, you have developed something in Python, it's nice, but it's slow. You want now to put it into C++ to make it fast. And you have never wanted to do that because it's just terrified of C++. But now I think there's an opportunity. It's not that difficult. Okay, so you will make Python and C++ look alike. So what's the secret? I don't worry about protect and private. Everything is public. All variables are public. Functions are public and virtual. Okay, and the secret is to use smart pointers, shared, underscore, PTR. If you do that, then the C++ objects behave very much like the Python object. Okay, and in the past, you are afraid to use pointers because you worry about memory leaks. Okay, you worry about when to use new and delete. And if there's a match, you are guaranteed to have memory leaks. And then the other thing is in C++, you have copy by value or copy by reference. But if you use shared pointer, those things, you don't have to worry. The only slide additional is the worry of circularity. You have one shared pointer, A to B, and if B points back to A, that would be a circular reference. All you have to do is you use a shared pointer from A to B, and then from B to A, you just use a plain raw pointer. And that circularity for shared pointer will not be a problem. As you go around in a bigger circle, you just have to break one of those with a raw pointer, and you should be in good shape. So that's my, I guess, extra bonus message to you guys about Python to C++. That's it. Just one last thing to add is that there is already a version of this working in FreeCAD that you can get in the on-cell has a special build of FreeCAD that they issued which already contains part of this. So that's testable in their build and soon in main FreeCAD. Okay. Are there any questions? Hello, thank you very much for the very nice talk. I'm very curious with all this new powerful functionality, what your thoughts are about possible changes to the user interface. If it will be easy to integrate, if there need to be new concepts in the user interface, maybe if there would even be visual programming, like how do we envision that users can use all the functionality that you introduce? Assembly has been around in a long time, so I don't think there's anything unusual. So we are developing something on on-cell, but I believe that any of the assemblies can use my solver to get good assembly solutions. Hello, can I ask if there's any roadmap for the path work bench and in particular about using the fourth access, the available ability, if that is on the roadmap at all? Yes, what do you know about that? Sorry, could you repeat? Brad Collette is the, I guess, originator of path. From talking to him, he wants to make it as professional as possible, or as solid as possible. So I'm sure it's in our roadmap, but I don't know exactly what that roadmap is. I mean, it's certainly a high priority. Let's put it that way. So my question is, is there going to be a default assembly workbench and a DAST case with one one? Good question. You want to? I think we have Pierre Boyer creating one in on-cell, and we call it an integrated assembly. But like I said, for everyone to work together to create one nice one for freeCAD, it's definitely our goal. So we'll put our integrated assembly out, and as usual, freeCAD is open source, and people can give a lot of input, hopefully constructive input, and then we can move on. Yeah, just to add something, we hope that this will be the one that unifies everybody. But it would still have all the others, and there are many paradigms, and other people who want another sort of assembly in that will stay. We just hope there will be like a good default that most people will want to use. There's a question online. Once toponaming is solved, would you allow adding custom names to elements, faces, edges, for example, instead of only having generic names like face1, edge2, etc., a user could attach custom strings? I'm not sure I understood exactly the question, but the thing is, yes, the translation engine that we're... There is a kind of mechanism that maps the older edge1 to the new edge1 and keeps tracks that's always the same edge that gets named edge1. That engine is also able... You can use custom names there. So instead of edge1, you could change the name to left edge. And so, yeah, I think that's what the person is asking, if you can begin to customize the things and give labels and names to things. And yes, that's already in the engine. Thank you very much. Super news for the topological naming problem, but I think the one most interesting to me is the change in the UX. I think that's one major, major wall for beginning users. Another question I have, I was recently listening to a podcast from Opulo, a machine making pick and place for electronics, open source but commercial. And what I found very interesting is they said, we want to build with FreeCAD just so that our machine can be the best. How much emphasis do you put on building features in FreeCAD to support this community open hardware approach of just contribution, like building... So for code, it's very easy because you just add line of code, and you have a Git diff. What about Git diff in FreeCAD somehow? Yeah, we're thinking about that all the time. Of course, it's very hard to obtain. It depends a lot on what's your use case, what you want to see in diffs. But if you look at the FreeCAD forum, you have several threads about that, people looking for possible solutions. I would say none of the other software...
KiCad Status Update
There's quite a few seats down here in the front row. If you've got, if anybody's looking for seats, right here is four, three. Okay. So, let's please give a warm welcome to Wayne Stambaugh. Thank you. Thank you. Thanks. I'd like to start out by saying thank you for everybody for attending. It's great to be back at a live Fosdom again. This is the first time I've been back since 2020 pre-COVID. So it's great to get out in front of the key CAD user base and get to talk to people. If you didn't show up at the booth yesterday, it was a lot of fun. We sold our swag out much faster than we thought we would and all the stickers were going before lunchtime. So hopefully next year now we know we'll bring a little bit more with us. So there's a lot to talk about. I know some of this is going to be kind of fast. I'm going to get through it quick, but because there's been quite so many changes that are going to happen in the upcoming version eight, I'm just going to blow through them. The talk slides are available online. There's some animations in them that I'm not going to have time to let play all the way through. So if you are curious about how some of the new features work and how you access them, you can just download my talk and then play through it on your own. So let's talk about what's going on. I'm only going to talk about what's happened in the last year because it would be too much to go all the way back to 2020. Unfortunately, that first line should have said key CAD date was already released. Well, we ran into a few issues. We're going to have to spend an RC3 here probably in the next day or so. I expect eight to be released with some time in the middle of February at the latest. Fingers crossed, but I think we're pretty close. Last year when we ran the version eight end of year campaign, we raised over $200,000 in donations and donation matches from other companies. So that was really, really a successful donation campaign, and that's allowed us to pay developers to continue to contribute to key CAD. So all these new features that are in key CAD and V8 and then moving forward is largely in part due to those having those funds available to help pay our team to continue to contribute. That's been really beneficial. We had our first conference since the original one in 2019 in Chicago. There was a key con Europe this year in Ocarina, Spain. For those of you who didn't get to attend, it was a much smaller, more subdued event, but it was really well done and I think everybody had a good time. Interestingly enough, spun out of that, there's a company called Watch You Next PCB who is one of our platinum sponsors. They also decided to throw a kind of impromptu key con Asia. And so Seth and I were in Shenzhen in November for the first ever key con Asia. So what else is going on in the team? So in the last year, we've added three new lead developers. Watch You Next PCB who sponsored the key con Asia event. They actually have hired people to work full time on key CAD. There's one full time individual now. Once we release eight, he'll probably be the next member of the lead development team. They also hired a second person that he's bringing online and getting him used to the key CAD code. So now we're going to have some additional resources that help the project. The biggest improvement though has been in the library team. So the library team has grown tremendously. For a long time, we had this huge backlog of symbol footprint 3D model libraries that kind of were getting stale because the people who were running the library way back either changed jobs or they had to go do something else. Life got in the way. But in the last year, we've added six new members to the library team and there's been eight now. Is it eight? Sorry, I probably didn't update it. But it's been a huge amount of backlog. I'll go through the statistics at the end here just to show you how much that's improved. We actually have a technical writer now. We have one individual who spends a lot of time just all he does. He's like our, they always say there's no I in team for Graham. He is truly a one man team. So our library docs don't lag or our key CAD documentation does not lag as much as it used to. It used to be the documentation always lags quite a bit. For version eight, it's going to be relatively up to date. There'll be a few things that are missing. One of the other things that was interesting is worth electronic out of Germany had contacted me about providing their footprint and symbol and 3D model libraries to key CAD. And so they've been slowly starting to online and contribute stuff. And their goal was to get their entire product line in the stock key CAD libraries. So at some point it's quite a, that's a big company and it's quite a few parts. So thank you to them for stepping up and documenting their basically providing their own symbols and footprint and 3D model libraries for key CAD. So one of the things that's interesting, we'll talk about, I'll also talk about this in the statistics part is in US and in Europe we have quite a bit of market share. We actually have a very large presence. But in Asia we've kind of lagged behind, but in the last year we're looking at the download numbers, Asia's really starting to ramp up now. And you know, that's a really big market. So it's neat to see key CAD getting in, in making penetration in that market. And I think one of the main reasons, especially in China, is we now actually have quite a few people who are full time either translating the application, translating the documentation. So they have really good native language support for key CAD. So that's really, I think that's what's helped. We now have five platinum sponsors and if you're not aware platinum in key CAD is 15,000 a year or more. So we now have five of those. And I know it's a little limited right now, but the key CAD stores open. So if you want to get some key CAD swag, there's not a lot of items there yet, but as time goes on we will add more and more items to the key CAD store. So head to store.keycad.org and check out, get your latest key CAD swag. And of course we always like to give a little love to our platinum sponsors. I see Felix is around here somewhere. Eisler, the newest sponsor is DAI and they're a consulting firm. They are our latest platinum sponsor. DigiKey and Watchu, Next PCB. So Watchu is the parent company of Next PCB. You are familiar with them. They are a PCB and PCBA manufacturer. And of course key CAD services corporation whose goal is to continue to support the key CAD project from the commercial side and everything flows down from there into key CAD proper, which so we all get to benefit from. So what would we add in version 8? So there's a lot of things that happen in version 8 that are not in version 7. So we made a bunch of SVG exporter improvements. Some of the primitives used to export as line segments and now it exports as its own primitive. There's now a startup splash screen, but that's been disabled some more about that later. But it's there for like rebranding. If somebody wanted to make their own variant of key CAD and wanted to put their own splash screen up there, they can do that. There's now, oh it's slow. Oh it went too far. Page up. Come on laptop. There we go. So there's now all the hotkeys. So here you can see the animation plane. You can assign multiple hotkeys now to a single action. We have ARM64 builds. So for those of you who are running Windows on ARM64s, we now have data binaries for you guys. One of the big contributions this year was there's now an easy EDA project importer. So your easy EDA and easy EDA pro projects will import directly into key CAD. The whole project schematic, everything. So that was a nice contribution. We introduced the command line interface in version 7, but there were bits and pieces of it we're missing. So now you can run DRC and ERC from the command line. So if you want to do like a CI tool, so every time somebody commits a change, you run the ERC or DRC, make sure it's clean. And if it's not, you can automatically ping somebody, hey you broke this, you broke that. So that's going to be in, that'll be available in version 8. Yeah, we all like those CI tools. Keep people on their toes. So here's something that's interesting to happen this year. So one of our, this is one of the things that key CAD services did was there was a customer who needed this for their, this was a request, they paid us directly to integrate this into key CAD. So everybody gets get support key CAD now. It's not everything is supported in get, but most of the basic things that you would need to keep track of version control of your designs is now built into key CAD. So the property panels on seven, it was only available in the board editor. It's now available in all editors. So the schematic and the symbol editor and the footprint editor now have the little panel. You select the object, you get the, you have the panel up. You can modify your object properties without having to open and close the dialog. Here's another, oh come on, you can do it. There you go. Here's another one. So we have customers that do really complex designs and they requested this. So they have, when you highlight a net, sometimes it's so complicated in order for you to find where on the design where everything goes. If you have a deeply nested hierarchy, it's cumbersome because you got to walk down, walk up and down the hierarchy stack. There's now a navigator. When you highlight the net, there's a navigator allowed. You see all the elements that are connected to that net in the bar on the left. Click on it, opens the sheet, takes you to that element directly. So if you do really complex designs, it's a time saver. That was also paid for by somebody else. This wasn't something that was even on our radar. It was just something that a paying customer requested. They paid for it, goes into key CAD. We all win. So there's now search panels on all the editors. So there's a global search panel which allows you to search for all kinds of different objects. You click on it, takes you to that object in the view. Instead of the old find dialog, this is a lot more convenient. It's a little bit more useful. You can see what's available. So there's now an internal bomb tool. In version 8, you no longer have to generate your bombs using a script. In the past, we always scripted it out because everybody would argue about what a good bomb is. We provided a tool. It's obviously not as flexible as the scripting is, but if you just want to export a simple bomb, there's now a tool for that, built into key CAD. So we also have contextual object grid alignment. So you know how sometimes you want the pins on 50 mil grids, you want your text to be on 25 mil grids. You can set it up contextually and when you're using that object, it will automatically pick that grid spacing. So instead of you having to constantly change grids back and forth when you're connecting pins versus like moving your text around to get them all nice and lined up pretty, there's a tool that handles that automatically for you. You don't have to do it. There's now nested symbol inheritance. So you can now instead of in version 7, the derivation level was 1. Now the derivation level is infinite. So you can create subgroups of subgroups of subgroups and you can stack. So you don't have to keep redefining the same fields over and over again. You can define a set of fields and then build something on top of that, the symbol on top of that, symbol on top of that. So that's available in v8. Oh, God. Yeah, this is a pretty big, there's now a tool to check for diffs so against the library. So you're working in your schematic. You don't know if there's a different, you know, it allows you to see a diff. So you say you run an ERC and you say there's a diff, you get a diff error or a diff warning that your symbol doesn't match what was in the library. There's a tool now that will show you what the difference between the two objects were. So you can decide whether you want to pull in the change from the library or just ignore it. Okay, so there's, we now, you can now directly import CAD, Altium, CAD Star and Eagle Symbol libraries directly into KeyCAD like you could do some, you know, some of the other ones instead of having to like convert them and then, and then bring them in as KeyCAD libraries. You now can just bring them in directly. There's a preview. Is it going to show up? Yeah. So there's now a previewer for the library so as you hover over, mouse hover over them, instead of having to click it and see it in the editor, you can now just hover and say, oh, yep, that's the one I want. Just some handy convenience features. There's now a library file. I don't recommend anybody doing this. This is just for demonstration purposes. Here on my hand editing, a Symbol library file, and if you watch in the, it's updating, we now watch all the files in both the Symbol and Footprint libraries. And if you, if they change, it will tell you so you don't save over top of existing changes like say somebody, say you got two people working on the same library at the same time. Some of you is going to get a, hey, you're going to overwrite somebody else's changes. So we implemented that. We now have a simple single button now. Let's say you have a bunch of libraries that you've imported from Altium or Eagle or wherever with one button. Now you can just save those as KeyCAD libraries because one thing we can't do is we don't write anybody's proprietary formats. So we just, it's a read only library. But if you want to edit them in KeyCAD, you just save them as KeyCAD libraries and go on about your work. We now have differential cursors in the simulation, in the simulator. There was quite a few changes in the simulators that much improved. So you, you know, there's a lot of LT spice like features that people are used to having in a spice simulator. Oh, come on. We can now directly import LT spice schematics. There's a caveat with that. You have to have LT spice installed because it goes, because it needs to go back into LT spice installation and get all the LT spice symbols. So if you want to do this in order for it to work correctly, you do have to have LT spice installed on your, you just can't take your LT spice circuit and then import it without being able to import the rest of LT spice because it references its own internal stuff. So we have to do this ourselves in LT spice and we have to extract all that out to make a simulation, but it works very well. I mentioned earlier we got a bunch of spice simulator improvements. We have FFTs. This is a really bad oscillator here. So it's fun to make. You can see, you should only see that one spike there at the beginning and none of the other ones and a perfect oscillator. But I did this one just for fun because I know how to make bad ones. I've experienced that. And S parameters and Fourier. So like most of the features that have been available in NG Spice that we just haven't extracted out, that information that we haven't extracted out in the simulations is now available. One of the more requested features, come on, you can do it, there you go, is editable power symbols. So before in KeyCat if you wanted to create a custom voltage, you had to go into the symbol editor, copy it and then change it. You can now do it on the fly right from the schematic editor without having to create a new symbol. We vastly improved the importers for SVG and DXF, or I'm sorry, we now actually have SVG and DXF importing into this schematic editor. It was only in the board editor and the footprint editor I think before, but now it's now in the symbol and schematic editors. We can export to Cadence Allegro PCB designer, so for those of you who don't want to use our board layout package, you can now export the net list to Cadence. And we've switched over, so a long time ago we switched the board editor printing in Cairo, which can do things like alpha blending, which is kind of more important for the board editor, but we've also switched over just recently in the schematic editor. So like if you have bitmaps with alpha instead of being no alpha, they'll be printed with alpha blending. Okay, that's it for the schematic editor. So it got the bulk of the love for version 8, but there were still a lot of changes in the board editor. So we have the same tool that we have in the schematic editor, check footprints against the board in the library. You can get a visual diff, you get a visual diff just to see what the difference is before you accept any changes. You can now import, come on, you can now directly import Altium Footpoint Libraries. Before we didn't, you had to save the Altium Board Importer would import them and automatically convert them to key cad formats. Now you can just import the library as it is. SOLIDWORKS PCB files, we now do, that wasn't fairly easy because they're basically Altium PCB files, but we can now import SOLIDWORKS PCB files. There's a do not position flag, DNP for the board editor now so that when you export your position files, it won't export those if there are things that you don't want your pick and place machine to populate. We now allow connectivity on any random shape. So you can draw a shape on copper, any shape you want, sign a net to it, it becomes a trace or another connect or zone, but you can basically draw any arbitrary shape and give it a net name. We've added some major improvements to the interactive meander tuning, here's the new properties dialog that allows you to set the parameters when you're doing your meander tuning. There are a bunch of step export improvements, including if you really want to have ridiculously large step files, you can actually export the pads and the traces and the vias, and your step files will be gigantic, but you can do it now. That was a feature that people requested, but be prepared, you're going to have some big step files. So property panels again is now in the footprint editor, you can see down there I click on an object, I get the properties, I can just edit them in the properties editor. We also have the preview, the hover preview, flyover preview in the footprint editor as well. There's a recent, we now export to IPC 2581. So I know this isn't supported by a lot of manufacturers yet, but we're now in a position where when it becomes more widely supported, we'll already have it in KeyCAD. And there was a whole host of 3D viewer improvements, things like visibility panel, so you can turn layers on and off, and there's a bunch of other stuff in the 3D viewer that were massively improved. So I'd like to thank Roberto because I shamelessly stole this from his presentation at KeyCon. This was a matrix of the changes, and so everything in blue was the importer. These were the importers for third party tools. Everything in blue was in seven, the orange is now included in eight. We have a few gaps, we got Gita, import, and PICAD, but we still need to do the project support for Altium. So right now you have to import the schematic, the board, sync them up together so that KeyCAD is happy, but I think in eight and nine we're going to have project support for Altium. So here's some fun statistics. So the source repo, so between now and version seven, the source repo, and it's actually more than that now, had 4,500 commits by 15 different authors. Actually KeyCAD sits at 1.63 million lines of code without translations and another 176K lines of comments. So we're rapidly approaching 2 million lines of code. The library team has just been busy, and in the last year we've added 1,207 new symbols for a total of 20,000 2023. I'm sure it's more since I've made this. He's clean, that's not here shaking his head. Footprint library, we added 713 for a total of 13,454. And just to give you an idea how significant that number is, I was informed this morning that one large, well-known component distributor doesn't have that many footprints. We actually supply more footprints than some of the distributors do. So I don't think I have the permission to say who it is, so I'm just going to say just throw that out there, but it just shows you how massively improved the KeyCAD libraries are. We added 238 3D models for a total of 6,700. We did slip a little bit in our language translation, so for v7 we had 17 languages that were 99% translated. We only have nine for v8 release, but I'm hoping that situation will improve as v8 gets out there. We saw, I don't know how many people saw this, but Felix and Eisler posted KeyCAD usage, and this actually shows the growth from 2020 to now. In 2020, I think we're roughly in the mid-20% of their orders for KeyCAD, and we've continually grown to now, we're 42, but I've heard recently somebody say something like 50% in the last month. So you see all the other EDA tools going down, KeyCAD's going up. That's a nice trend. I like that trend. And it's an Oshpark also demonstrates similar kind of trends. They're seeing KeyCAD usage go up from their customers. Now, that's not universally true. There's other board vendors, I'm sure, that have different statistics, but most of the board vendors that we interface with directly, they're seeing those kind of numbers. So that's really good. So I'm going to blow through this quickly. I apologize for not having a lot more time. But here's what's coming in V9. So we're going to have IPC support for internal procedure calls. So one of the things that KeyCAD has always had an issue with is our Python scripting. People call it an API. Technically, it's not an API. It's a wrapper around the internal KeyCAD API. So anytime there's something internal changes in KeyCAD, we build the Python scripts, we break stuff. We are in the process now of working on an IPC interface that will act as a go-between between any high-level language, including Python, and a running KeyCAD instance. So one of the things you know is you can actually bring KeyCAD down with a rogue Python script. So we're going to try to fix that in NINE as best we can. And at some point in the future, we'll deprecate the scripting stuff and make everything built on top of the API just to eliminate those kind of issues. And so you'll have a stable interface. So when you write a Python script using the API, it's not going to break the next time you compile and rebuild KeyCAD. We want to do, one of the things that's requested is customizable interface, including toolbar layouts. We're going to try to get that done in NINE. There's some talk about doing a visual diff merge tool for Git. So you look at the difference between, so you'll see the visual difference between the schematic or the board before you change or merge. So you have a merge conflict with somebody else. You can look at the diff and say, oh, yeah, I want mine or I want theirs. We've been getting requests for embedding licenses and project and library files. So we'll implement that. Support for barcodes, multi-user editing is something we've been discussing, whether or not that happens or not. That's a big one, but it's something that we're looking at. And of course, I talked about the pads and the guita importers. We've also been taught, one of the things that people have asked us about is being able to save the old file formats. Historically, KeyCAD has not allowed that, but we're actually in the process of thinking about that. And also, forward file compatibility, we were just open a file, an older file or newer file, and KeyCAD just says, well, I don't know what to do with that. So it just doesn't do anything with it. But if you save it, you're going to overwrite and lose your anything that the old version of KeyCAD, you'll still be able to open it. It's just some of the things won't work. A lot of other applications do similar things. For the schematic editor, there's actually a merge request now for a tool that synchronizes sheet pins and hierarchical labels in the schematic that it references. So you can do this bi-directional updating in both directions. We're going to replace this. So right now, we allow sharing of schematics between projects. We're going to stop that and go with a reusable design block because that particular thing causes us so much grief that we decided it wasn't smart to continue to support it. That will give us the design blocks. We're going to do variants for schematics in nine. I don't know if we'll get the board variants, but at least we'll get the schematic variants. Bezier curve editing tool. People who know there's a Bezier curve support in KeyCAD, but there's no tool to edit them in KeyCAD. So we're going to do that. The board editor, there's also now a zone manager. That's also a merge request that's ready to go as soon as we release eight. It allows you to edit all your zones in a single interface. Instead of having to open each zone individually in one dialogue at a time, now there's a zone manager for all of them. Multi-channel designs, that's in progress. Pad stacks. We're really hoping that one gets in because that's one of the feature parity issues that we have. Like so when we try to import from other tools that support pad stacks, we can't import them. We have to make an assumption based on a best guess and you get the pad stack that KeyCAD could support. So we're going to do pad stacks. Guard rings, that's a feature for those of you who do high-end piece stuff and you want to guard your high-end piece circuits so you don't have to leak each current. Those are useful. Right now our router doesn't really make it easy for you to design a guard ring. And also we're going to do the Bezier curve editing tool in the board editors. Somebody's working on a table tool right now. That'll be in the schematic and maybe the board editor. I don't know whether that's actually going to happen but hopefully it does. So it'll just be a table like any table in your favorite document editor. They'll just be native table support. We want to embed 3D models into the footprint so you don't have to have your 3D models. 3D models are external to the board. They'll just be embedded in the board and you take the whole board with you. You've got all your 3D models. ODB++ export, this one's already also in progress. Our friends at WatchUnextPCB are working on that because their infrastructure uses ODB++ since they, you know, that's what they use when you order boards. They prefer it over Gerber so they're going to provide ODB++ support so if your favorite board manufacturer is an ODB++ only shop, you'll be able to export that. Okay, that's it. Just a quick wrap up here. I can't, I mean I get to stand out in front of the team as the project lead but it's an incredible amount of time and I always want to say thanks to all our developers who contribute to KeyCat. It's really gotten impressive the last, you know, the amount of contributions just keep going up. I'm really, really encouraging and it's fun for me as a project leader to see that happen. Thanks to all our sponsors and donors if you've contributed to the KeyCat donations. Thank you very much. That continues the sustained growth of KeyCat. Thank you for your continued support of the KeyCat project so everybody who uses KeyCat thanks. We really like the fact that you use KeyCat and we hope we can continue to support your needs as a project. So anybody who's ever organized a dev room or anything like this knows it's not a nontrivial amount of work. Thanks to Seth for organizing this. It doesn't happen by itself. And hope I get to see everybody here next year and I hope I get to see everybody at least at one of the KeyCons this year. So keep an eye out. The one in Europe this year is going to be in Germany. We don't have a date yet or a venue. We have people on the ground who are working on it. And as soon as we have that information we'll put it up on the KeyCat website and on the forum and you'll be able to keep your eyes open and then hopefully you guys, hopefully we can see as many of you there as possible. Early September in Bohem. Is that when it is? Early September? Okay. And I'm not 100% sure we're going to have a Shenzhen and Asia one this year but I suspect we will. Has Hubert committed to that? We are going to have KeyCon Asia. It is just waiting for the time we coordinate with Maker Faire Shenzhen. So we're going to be on the same weekend as Maker Faire Shenzhen. So yeah, if you want to go, that's a great dual hit because if you've never been to Shenzhen Maker Faire it's really impressive. You should go if you get a chance. So okay, I'm open for questions. If anybody has any questions, no. Thank you. I had a question about the libraries. Are there plans to move the libraries in a... In a... This deep... Po. Are there plans to... So that a project could import the libraries directly from... Yes. Not right now? So the question is, are there plans to allow importing library objects directly from a Git repo? Because all our stuff is basically saved in a Git repo because we design it. You can import the project libraries, not the globals. Yes, not the globals. So yeah, because the Git support, obviously, the libraries that are already in your project will be part of your Git project. But externally, no, we don't have anything at the moment for that. But I mean, if somebody wanted to spin up a Git plug-in, that wouldn't be the... No, but I wouldn't turn you down because I think other people would probably like similar things. Do you have any plans to integrate some sort of mixed signal, real-time interactive simulation into... Kind of like Muldi's in, basically Muldi's in. Well, okay, there's... So on the simulator front, we've had a lot of fits and starts. I wish I had a better, a more rosy outlook to give you on that. We had some people working on EM simulations, so we were going to take the board, break it down into its 3D representation, and then do like EM and maybe a power solver. But the people... There's several things on that front that make it difficult. The most difficult thing is finding the manpower to do that because that's pretty specific kind of... You have to have a pretty good knowledge of how to do that. The other problem that's kind of been problematic is a lot of the libraries that do that in the open-source world, because we are an open-source project, obviously, we're not going to use like MathLab. They don't necessarily build well, they don't play well on all platforms. And so KeyCat, one of the... If you're not familiar with it, one of the things that KeyCat doesn't do is make second-class citizens. All the major platforms are considered equal. So if I can't provide a feature on Linux or on Mac OS, I'm not going to do it on Windows. It's got to work on all three. So that's kind of been a little bit of a... We've had a little resistance there. I don't think that problem is over... Oh, not solvable, I think it is. So the person who's implementing that's not only got to do the end part, like the solvers, they also got to get all the libraries to build all the dependencies that they need to integrate into KeyCat to build on all three platforms. And that's a bit of a load. So I do think it's going to happen at some time. Obviously, it's never going to be as fast as I want it to, but we do have in our big wish list of things we want to do, it is there. It's just whether we get the manpower to do it. Any other questions? Am I done? One more. Yes. Go ahead. Okay. Congratulations for all the amazing work. Thank you. Contributors and maintainers. And I want to ask about one of the planned features for the next release you talked about, GiveDiffMerch tool. I think it would be amazing if the command line tool could export a GIF animation. So you could... Export? You mean what the command line told you to export the Diff? Yeah. So, like, I don't know, GIF animation, something like that. So when somebody comes with a pull request, you could see, okay, what's changing without needing to download or open? I mean, just an idea. That's not a bad idea. I mean, what, like a ping? We'll put an issue for it and we'll see about that. Yeah. So any more questions, please follow. Wayne will be out in the hallway to answer any questions. So thank you once again, Wayne. Thank you. Thank you.
LibrePCB Status Update
Hello everyone, my name is Wurl van Bruin, I'm the founder and main developer of LibrePCB and today I will give you a short update about the LibrePCB project. So for those who do not know LibrePCB yet, it's an open source EDI software to draw schematics and design PCB and the main goal is the same as KitKat but there are some differences and of course it is a cross-platform, it runs on almost every computer, Windows, Linux, MacOS and more and its main goal is actually to make creating hardware easy, efficient and more or less foolproof with an intuitive user interface and powerful architectural concepts. So while the intuitive UI is especially helpful for beginners to get started easily with PCB design, it's also intended for professional users, for example who care about things like a same file format or a command line interface to automate some tasks. So let's take a look at what happened in the past one or two years because there are some great news. Especially at the end of 2022, it was an exciting moment because I started to work full-time on LibrePCB, now doing it for a bit more than a year and of course this leads to a lot more progress than in the many, many years before. In addition, the LibrePCB project has been approved by the NLNet Foundation to receive funding through the Next Generation Internet Program, which helps a lot to keep the full-time development ongoing. Then our fabrication service got PCBWay as a new manufacturing partner, so if you order PCB through LibrePCB Fab, you can now choose between ASLER and PCBWay. Also, I'm very proud to have several new sponsors on board from last year, Bitelé Electronics, NextPCB PartStack, PCBGogo and WinZorz. Last but not least, there are many individuals supporting the LibrePCB project with donations or other kinds of contributions, for example translations or creating libraries and so on. So with these sponsorships and the donations, the LibrePCB project raised around $8,000 in 2023. In my opinion, that's already quite amazing for this relatively still early state of the project. So at this point, I want to thank all the supporters and contributors and for your trust in the LibrePCB project. So this really makes me happy and thank you very much for this support. I take this as a sign that the LibrePCB is on the right way, so I hope it's okay to continue this way. Nevertheless, it's still a very long way to go until we have a stable funding for the full-time development, so I hope this support continues many more years. Other things which happen beside the application development are a completely new website with much more content, a new documentation system with more documentation and also for a few months now, we also have official video tutorials on YouTube. Not complete yet, but at least a few ones now. But now let's take a look at the application. In September last year, version 1.0 was released, which was a very exciting moment. And beside many new capabilities in the board editor like thermal relief pads and so on, this release also added a 3D board viewer with step model import and export, which is not only fancy, but also a great way to review the design before ordering the PCBs. But actually, I mean, this is known from the 3D viewer, it's known from many other EDA tools. Probably every EDA tool is able to show your preview. I'm actually especially proud of two features which make generating production data really a pleasure. First of all, we have introduced comprehensive support for assembly variants and manufacturer part number management. So, MPNs can now be stored in libraries, so you don't need to add them to every new schematic you need them. In the schematic editor, you can even assign multiple MPNs to one component to export them as second source parts to the BOM. I mean, who didn't experience any supply chain issues in the last few years? So, it's nice to actually specify second source parts. And you can even specify different parts for different assembly variants. For example, assembly a 10K resistor in one assembly variant and zero-ohm resistor in another assembly variant. And to actually make generating these BOMs and any other output data, a matter of seconds, we introduced output jobs as a new unified way to export any data. So, these output jobs can be configured very flexible and stored within the project. So, the exactly same output files can be reproduced on a different computer. You don't need to configure anything again. So, yeah. And if they provide a command line interface, it's also very easy to fully automate the production data generation. For example, if you like to use a continuous integration system. So, now, a short demo is worth more than a thousand words. So, I would like to quickly show you a few of the features. I hope this works. Okay. On my screen, it looks completely different. But, okay. I think it's, you understand what should be there, I think. Right? So, the first is the 3D viewer. Let's see if it actually, yeah. More or less. Okay. Strange. So, I just want to show you actually that the 3D feature is very, very easy. Actually, you don't need to care about it. You just add a resistor or whatever to the schematic. And our libraries have the 3D models built in. So, you don't need to care about them. The part to the board editor, let's say a THD variant. And it immediately appears in the 3D view with a 3D model. And, yeah, it's even possible to switch between different footprints, for example. Different pitch. And the 3D model is automatically updated to the new footprint. Or, for example, vertical mounting variant. So, you actually cannot even do anything which isn't compatible. It's always assigned to the footprint you choose. So, yeah. Now, let's take a quick look at MPN management. I mean, the most simple use cases, you just want to add some component. And you have now the option to actually choose a concrete MPN, because they are now sorted in the library. So, if you add a component by its MPN, and let's quickly also add it to the board to actually make it appearing in the BOM. And when you export the BOM then, I think it was LED3, it immediately appears with the MPN, you're just assigned. So, it's very easy to generate high-quality production data. And another use case, for example, I mentioned before, if you want to add a second source part, you can just choose a different part, let's say, from a different manufacturer. Add it to the same component. It is listed as an alternative part now. And if you export the BOM now, you have a new column with the second source MPN and you can generate the BOM. So, there is no need anymore to manually adjust the BOM after generating it before sending it to the assembly house. You can generate it completely finished. No manual rework needed anymore. So, then, to actually generate the BOM, you can use the output jobs feature I just mentioned before. So, you can also generate these jobs. Every job means one or more files which are generated. For example, the Garber files, there is one job to generate Garber files. And if you, for example, like to send Garber files in a zip file to the manufacturer, you can just add a zip output job, choose you want to have Garber files in the zip file, maybe also the assembly PDF within the zip file. And now the output jobs are stored in the project, so you have to do it only once. And now you can generate production data, for example, for single jobs. Just double-click the job, the files generated and opened. Or you can generate all data at once. And you get, for example, the zip file you just configured, containing the Garber files and the assembly PDF, just like you want to have it to send it to the manufacturer. Also no manual file editing or archiving needed anymore. So, if you make any change to the project, one click and you have all files updated. But of course, not everyone likes to manually generate output files, even if it's that easy, because there is even a more easy option available. If you don't like to care about all these things, just start ordering your PCB right within the application. It's uploaded to our fabrication service website. You even get ERC warnings if you didn't resolve them in your project yet. You can choose your manufacturer just forwarded to the manufacturer you like. And without handling any files manually, you have your project... Okay, I was too fast. You have your project ready to be ordered. Just enter your shipping address, payment information and so on. That's it. So, let's switch to the slides. Okay, so now what's the overall state of the project? Generally, liver PCBs are fully functional and can be used productively for projects which are not too complex. Not too complex because hierarchical scheming is a very important factor. So, what's the overall state of the project? Generally, liver PCBs are fully functional and can be used productively for projects which are not too complex. Because hierarchical schematics and bosses are not supported yet. And also the trace routing tool and actually the board editor in general is still rather rudimentary. So, from time to time it might be a little bit inefficient. So, yeah. And of course, part livers is always a problem. It's not very comprehensive yet, but at least with liver PCBs, it's very, very easy. And to create the missing parts by yourself. So, a quick outlook now. The upcoming release will contain an EGLE project importer. So, it can import complete EGLE projects. And there's also some work ongoing currently to integrate live part information into the application. When you add a component to the schematic, you immediately should see then the part lifecycle status, stock availability and the price. So, this will be very useful. So, I hope we can make it happen. And yeah, it's clear from time to time. Some technology updates are needed. For example, switching to Qt 6. And yeah, for long term, as I mentioned, the trace routing tool needs some improvements and also hierarchical schematics and bosses. I think these are a must have. Yeah. So, if you like to support my effort on creating an easy and powerful EDA software for everyone, I would be very, very thankful about the donation. And to keep the full time development ongoing as long as possible. So, yeah, and there are also many other ways to contribute. Just check out the link here. And if there is any Wikipedia article right here, please let me know. We are looking for some help to publish a Wikipedia article. And please let us know your feedback on the feedback survey. So, yeah. The slides are online. Here are some links to get easily started with Libre PCB. That's it. Thank you very much. Thank you. Thank you for the presentation. I'm using Altium Designer and Qiget, and I work at a shop where Mentor was used. How is the state of the import of Altium Qiget and Mentor? It doesn't exist yet. Do you have plans to implement either of those? Any plans to implement these imports? I think Qiget import would be quite obvious. The other ones, I don't know yet how much effort is needed, how these file formats are known or not known, how to read them. So, yeah, yeah. I think at some day we will look at the imports, but it's of course not a high priority. So, did you encounter any problems with patents or something during your development? Because I'm developing a clone of a commercial software where I'm dealing a little bit with some patents that I might violate during that. Sorry, I didn't understand. Patents. Did you have any issues with those, like registered patents of companies of, I don't know? So far, I didn't have any problems with patents, but yeah, I'm not an expert in this area. So, I just tried to take care of licenses of things I use to hopefully not doing anything against the license terms. Any other questions? Okay, thank you, Urban.
ngspice circuit simulator - stand-alone and embedded into KiCad
I translate. No, no, I directly plugged into the laptop. Okay, so we are going to continue on. The stream going out that is being recorded looks nice, so the rest of us, we're going to suck it up and just listen to what we are here to learn from Holger Voight. So please give a round of welcome to Holger. Yeah, okay, so many thanks. Angie Spice, Circulator and Pink. Talk about Stand-Alone and Embedded in Turkey Cat. Well, I give a short introduction to Circulator. Then talk about what's new in Angie Spice. Talk about the Kikat-Angie-Spice interface and give some simulation examples. Well, conclude with what is next. Yeah, why circuit simulation? You emulate electronic circuits per software. It should be cost efficient at time saving. That's it. Some details, of course, you can check functionality without making hardware. It's very important if you do IC design because fabricating an IC with a defect circuit, this is very expensive. You can check for parasitic elements. You can make variance very easily. You can change some device parameters and see what is happening. You can evaluate new concepts with not too large an effort. You can cross-check against automatic circuit generation as a final simulation test. You can anticipate reliability, make degradation simulations. And it's a good learning experience because you can look into a circuit without using hardware to do so. You can see voltage and currents in different branches in this. Very interesting. Yeah, Angie Spice, what is it? It's a circuit simulator that numerically solves equations. Describing electronic circuits, it can also be other types of circuits. For example, thermal could also be mechanical. And you are interested mostly in time-varying signals in electronics, its current and voltages. It's the open-source successor of the vulnerable SPICE-3 from Berkeley. Okay, we have a circuit. This is a very simple circuit, an inverter with two transistors. And this is the entry to Angie Spice. So Angie Spice is a command line input tool. Many people said, ooh, command line. But I've just learned command line is very nice. KeyCut has got a command line and other software also. So we are not too bad with that. Okay, you have the net list, the SPICE net list, which contains circuit description, power supplies, transistors, some simulation commands to run the thing and some model data to run this thing. The output is graphical indeed. It's a time axis and the voltage axis. The ideal green input. Yeah, it's green still. And the simulated output, you see the inverted signal. Yeah. This is the Angie Spice user interface. Yeah, on the input side, you put in the circuit net list, the circuit description. You put in models or model parameters for the devices you're using in your circuit. And you put in simulation commands. And the output could be data tables or tables to file, of course, could be graphical plots. We use the venerable X11 interface or native Windows plotting capability. Or you can plot to PostScript SVG or use NuPlot or other tools for outputting. Yeah, what's new in Angie Spice? The current release is Angie Spice 42, released in December 27th last year. We have an additional matrix over. I will talk about these things a little bit more in detail in the following. We have a new matrix over in addition to the venerable SPACE 1.3. We support VariLock A-coded compact device models. We allow co-simulation for mixed signal simulation with VariLock digital circuit blocks and mixed signal digital analog parts within Angie Spice. We allow core simulation. Again, mixed signal with C-coded digital. So there is a way to translate C-code into Angie Spice readable shared libraries. And we have a, and I'm benefiting from the vastly improved graphical user interface, key cut, especially the upcoming key cut A is offering for using Angie Spice. Well, the matrix over. What is the circuit simulator doing? The circuit simulator, if you look inside, Angie Spice gets the circuit, makes a setup, parsing the netlist, reading the model files. And then it is, if you do a transient simulation, simulation versus time, then you have the ever circle here between model equation evaluation and these data go into the matrix and the matrix is solved. And then you go for the next time step and you repeat this until the time is over and you look at the output. The model evaluation is already running in parallel in Angie Spice. We use open MPs, so if you have a multi-core processor you typically have today, you use, benefit from that. The matrix evaluation is not paralyzed. These sparse matrix solvers are difficult to paralyze. So we have been looking for a long time for an additional matrix solver. We used the sparse 1.3, developed in 1986. And now we use an additional optional selectable KLU matrix solver, which is ongoing development by T.A. Davis and his co-workers. And with KLU you get a speed-up of simulation by factor 1.5 to 3 if you have large circuits and especially if you do circuits for IC simulation. And this is, of course, an advancement. We allow Verlach A compact device models in Angie Spice. Compact device models, these are the model equations describing modern transistors, for example. These complex, tiny things like FinFETs also would have 500 parameters and lots of differential equations to describe and people do the development in Verlach A. And so we had a real need to have an interface to this Verlach A because this provides access to the modern devices like BISIM bulk, which is for ultra-short channels, or BISIM CMG, which is for FinFET, or for gallium nitride devices, power devices, high-speed bipolar transistors, and so on and so on. Yeah, and we got this set up in cooperation with the company SemiMod, who did this open-source development. We have the Verlach A model description. We compiled this model with an open-source compiler OpenWath, compiled directly into a shared library, and this shared library can be read by Angie Spice, which has got the OSDI interface. So we are reading directly the Verlach A compiled model from a shared library or DLL. Yeah, we make use of this. For example, I've been mentioned maybe already, open-source PDK is for IC design, are upcoming, and one of these is the IHP open-source PDK, and this is a 130-nanometer CMOS process with integrated ultra-fast bipolar transistors. Ultra-fast means 500 GHz or so. The model used for the bipolar is the so-called phobic model, which is integrated into Angie Spice for some years now, and the MOS model is a PSP model developed by, currently I think developed by Lehti in France, and this is Verlach A, and we translate this, put this into Angie Spice, and so we can support this open-source PDK with simulation. This is just a simple example, 19-stage NAND gate ring oscillator, so we have 19 NAND gates in series, feed them back, and it starts to oscillate, and we have your frequency, this is an FFT of this signal, a frequency of 600 MHz, and you divide it by 19 by 2, and then you get an inverter delay of 280 picoseconds. Okay, yeah, we allow digital Verlach circuit blocks into Angie Spice. Looks a little bit more complex, but isn't that much complex? We have a Verlach digital circuit block. We compile this with an open-source compiler very later into some intermediate C code, and then we compile this intermediate C code with some C templates in addition, which are constant all the same. We compile it with GCC or MSVC into a C-CodedShare library. And this C-CodedShare library is read by Angie Spice. Angie Spice has a so-called code model interface, and we have written a code model in decosim, which directly interfaces this shared library. So we can now run simulation with this standard Angie Spice netlist, which may contain lots of analog, plus digital blocks. This is an example, it's just a demo, it's not a productive simulation. This is a successive approximation register analog to digital converter. Six-bit. And this uses the digital SARAR block written in Verlach with the analog part, which is a capacitor array with some switches. Okay, and even if things look complex, using this is not very complex. You need two commands. You have this command, Angie Spice, and Angie Spice calls a script written in Angie Spice control language, and you enter the ADC Verlach description. It compiles the Verlach thing, it compiles the GCC thing, and then you call the Spice netlist with the standard command, Angie Spice ADC dot sir, which contains the analog part and contains the simulation control, then you get this kind of thing. Okay, I just enlarged a little bit. You see that it's a successive approximation. This is the ramped-in voltage, and this is the x-axis is time. And this is a new start. We try to get the value of this point here. It starts with the starting value, and then successively approximates the input. Well, with a certain delay, 8.5 microseconds here, which is the time you need for the conversion, then you are here in the stable phase, and this is the red line just shifted by 8.5 microseconds, is, well, the output signal. Yeah, so digital plus analog. Okay, you can also do this with C-coded digital type of models. You have C-coded independent processes. You compile them with GCC, for example, or with any C compiler. And these communicate with NG-spice via another code model. The digital interface is now called deep-process. Well, this has been developed by Eurospalatis from Isotel some time ago, but we have now, for the recent version, have adapted it a little bit, modified it so it will also run under MS-Windows. And now we can, yeah, simulate some circuit, which has some circuit blocks from C-code. This is just, again, an example, a simple example. The C-code you see here, this is, yeah, a gray code generator. This gray code generator is compiled and loaded into NG-spice, and this is the output. The plotting here is by GTKWave, because this is a nice digital plot. Yeah, and you can use these kinds of blocks. So you define these compute functions with data out and data in and some other, and the time or the clock circuit, clock going in, you can run C-code digital circuit. Okay, so I want to talk about schematic entry for NG-spice, because this is under continuous development, and, yeah, it's a nice usable thing. Why do we want to have such a graphical user interface? Well, NetList as input quickly becomes confusing. You need schematic entry. You need to see circuits, circuit schematics, and then have an interface to the simulator. You can get better documentation, of course, if you group inputs and outputs. This is not an NG-spice development. So we develop this, for NG-spice, don't develop these graphical user interfaces. We make use of existing ones or support the development. And, of course, you need one, because all other simulators are mostly, most of all other simulators have one, so you have to offer one. There are three of these interfaces currently under development, we cooperate. This is a thing called X-Schem, whose main focus is on IC design. There is another one, QXS. This is a very universal interface, which specializes a little bit in RF simulation. And then, okay, we have the key-cat. So I wouldn't say that key-cat is developed because it's a graphical user interface of NG-spice. No, the other way around, yeah? You have heard about this PCB design and layout tool, and it offers a simulation, and the simulation engine is NG-spice to support the circuit designer. So, of course, I can then make use of this beautiful interface. Okay, just show these interfaces in strange colors. Yeah, I won't talk about these. I want to talk about this one, again, in strange colors. Okay, but you could imagine that it could look nice. This is the ischema window with some circuit, simple circuit, a simple phase shift oscillator with a 4.2 kHz frequency oscillating. And down here, you see the FFT. Of course, you see it's not a super clean sinusoidal signal, but, okay, this is the 4 kHz thing here. Yeah, so what is the interface looking like? Ischema does this schematic entry. Ischema generates the SPICE netlist, and Ischema also does a graphical presentation of the results. So, it sends the circuit netlist to NG-spice, it sets model parameters to NG-spice, so the simulation commands, and it gets back simulation results. NG-spice is here used as a shared library to this key-cut process. Yeah, I would like to make a live demo. I don't like these colors, but let's see if we can survive somehow. Okay, this is my starting template. I do not start from the zero because it takes too much time. So, this should become an operational amplifier. Simple thing, amplifier by a factor of 10. Okay, what is missing is the operational amplifier. I try to grab it, grab it from the library. So, we just load the library, it takes a little bit of time, but only once the first time, then it gets faster. I know that it is in the library simulation-spice, and here is the op-amp. I grab it, and I move it, and hopefully it fits because, yeah, it did last time. Yeah, it does. Okay, so this is how you place additional elements. Very simple. But now, let's stop. We don't need any more, I hope. Yeah, and now we do simulation. This is a real-time simulation. I look inspect for simulator, and I get this simulator interface. Well, black is green, and pink is white. Okay, I'm sorry for that. What do we want to do? We want to do the transient simulation. Transient simulation is output versus input versus time. Okay, and so we have, yeah, what is our input? Let's go back and have a look. The input is a sinusoidal signal with an amplitude of 0.1 volt and a frequency of 1 kilohertz. Okay, back to the simulator window, and I just click on to start simulation, and here is our simulation. The input is the small one, and the output is the red, who stays red. That's great. The input is the red signal. Okay, so this is transient simulation versus time. We could have another simulation. To be honest, I have prepared this. Four, this is so-called AC simulation, small signal simulation versus frequency. So you see the frequency behavior of this kind of circuit. Yeah, we again run the analysis, and you see that the amplification is 20 dB, so it's 10, is constant, but the operational amplifier has one single internal pole, and so it goes down. Okay, so this is very quickly, you just see what's going on. I think I have time to make some additional change. I put an additional capacitor in here. I collect my capacitor, I transform it because I have to rotate it. I put it just in here. Let's do it in here. And I have to give it a value. I guess I take one mic one. Yeah, and then we go back, and do the AC simulation again. Oops, there's something changed. We have this, this is sort of low pass behavior. It stayed, and now we have some high pass behavior for the low frequencies due to this input capacitor. Yeah, so very quickly you do a small change, and with a simple click, we are there. Okay, so this is what I wanted to show live. Let's go back to the slides, and I give some more examples. Yeah, the first example, this is, again, why do you want to simulate? This is a 2.5 kilowatt class D audio amplifier. And you would say, this is strange. No, you go to some Amazon and click in looking for these kind of amplifiers and 300 bucks. You can get a kilowatt amplifier today, because it's a digital amplifier. And, okay, so what did I do to get this simulation? Okay, I made a symbol myself of this audio driver circuit, just drawing the symbol. And this audio driver circuit is also something I created myself, because it has the analog input. It has a path width modulator. This is a translation from the analog signal to a pulse width digital signal. It needs something more. It needs a complementary pull output, because we have two transistors here. And it has a dead time generator to avoid shoot through, because what will happen? You have minus 100 volts here, plus 100 volts here. And if you manage to open both of these transistors at the same time, you will see the result in form of smoke. And so you have to avoid this. And, okay, and some simulation commands in here. The input is 2 volts, again, 1 kilohertz. You see the power supply. The output load is a 2 ohm resistor. Well, and this is the output. This is the input signal, and this one is the output signal. Okay, and with the double frequency, you have the power signal, the blue one here. And if you do an RMS over this output power signal, you see here it's kilowatt up to 4.3, for example, you will get an output power of 2.6 kilowatt. The simulation has a great advantage. Nothing explodes. You can just do it, and if you do, you can investigate the output filters and can check loudspeaker models and everything just by simulation. Of course, you can also do real-time real amplifiers. This is Tiberio Vecol has made this Q17 amplifier derived from the famous quad 405 audio field amplifier. You see lots of transistors in this thing. The output stage, the input is an operational amplifier. This is the modern contribution of the whole thing and some voltage generators here. Well, yeah, and you can, of course, simulate this and look similar to our 2.6 kilowatt, it's 100 watt, and what you see here is just at 300 milliseconds, we switch the output load from 8 to 7 ohms automatically to check what the output load would mean, and you see a little bit increase in output power. So you can model all these things and model the influences and so on and so on. OK, NG-SPICE allows to do mixed signal simulation. Mixed signal simulation means you have analog and digital circuits in the same simulator, and you could also simulate the digital part like the analog part, but this takes a lot of time, and if you have more than a few gates, it would be much too slow. So NG-SPICE includes an event-based simulation, which is very fast, and this is a mixture. Well, this is the veneral 7400 series of devices. You have flip-flops here, you have some output decoders and some NAND gates, and you have some XOR, or NAND, this is NOR gates. Yeah, and you can simulate this whole thing together, and you see that this is mixed signal means because we're using the digital output here for a delay line. So we have an RC delay and another RC delay, and we have the original signal, and so this gives an output pulse of a specific width. This is the clock signal generated in this circuit, and this circuit here, which is shown, is a rotary encoder, so encoder which does give optical signals when it's turned around one or the other way, and this is the digital output, again plotted with GTK wave, and you see this here, the Q1 signal is coming before Q2 signal, and because in the rotary decoder these two decoders are shifted a little bit, so you know that this is turning left, for example, and here the turning is changed to the other direction, and you see the Q1 is coming later than Q2, and this is detected by this circuit. You have here the pulses, let's say for turning left, and then left turning, switched right turning, and you see the output pulses here for the turning right. So mixed signal simulation is, and this is effective, because the whole simulation thing is 25 milliseconds, so it's ultra-fast, it's click, and it's there. You can even run this on this computer here, which is not the fastest machine. And we can have pure digital. I made a symbol for this up and down counter. You have the input clock, you have the input up and down signal, and here it's a 3-bit, 8-state counter, and inside of this is a state machine, and it's a very, very simple state machine. You have here the states from 0 to 7, so the 8 states. Here are the signals you see from 0, 0, 0 up to 1, 1, 1, and here is what the states are switching. The input is at 0 state, and the input is 0, input means backward counting, then the next state is this one here. Or if the input is 1 and we are at state 0, then we go to state 1. If we are at state 1 and we count down, we go back to state 0. If we are at state 1 and we count forward, we go to state 2. So you can do very simple programming inside one of these code models used by the digital event simulator of NG-SPICE. Well, and here's just the signal, the clock signal. This is the up and down, the up and down signal, and we count up and count up, and then we switch to down and then we count down and we switch up again. So very simple simulation, and the simulation time of this whole thing is mere 37 milliseconds, so it's very fast. Okay, so much about the examples we have. What's next in NG-SPICE? Here are listed some ideas, some more or less fixed plans, and some actual activities. We will do more tests with the open source PDKs, supporting the sky-water PDK, and especially the upcoming IHP PDK to support analog mix signal and RF simulation to support these kind of designs. We will improve the RF capability by adding harmonic balance with a special effective method, for example, to simulate intermodulation of signals and so on. We will support reliability and degradation simulation. Well, nothing lasts forever, chips don't last forever, and people sometimes want to know how long they will live, and so you can try to model that, and this will be done here. And hopefully with a funded project, this is very interesting. There has been the request for transient noise simulation. This is a difficult task, because we don't want to rewite the complete simulator, we have to figure out ways, and again here it would be very difficult to do that. If somebody is interested in integrating this into NG Spice, please let me know. We will improve the usability of key-cut NG Spice graphics interface. Continuously, people are requesting things, and we are detecting things, and we can try to simplify things, we can try to support more of what NG Spice is offering internally right now. For example, the digital simulation is, should be supported by having digital basic blocks as input, and digital plotting, for example, as output. And we have to enhance compatibility, because the world is, somehow we are competing against commercial simulators like LT Spice or Q Spice, or P Spice, or H Spice, and what other... We cannot do this in full, but the basic things should be compatible. But all these four I have mentioned have different, slightly different input languages, slightly different models, and so you have to take care of this somehow. Yeah, that's it. What I wanted to provide you with information, here is some support, websites, if you need more details, here they are. Thank you. APPLAUSE So, while we are taking questions, the video team is going to try to repair the video locally, so your questions will not be able to refer to the slide. Hi Holger, you said something about the creation of semiconductor devices. Would it be possible to simulate the creation based on radioactivity? Yes, this is included in this development plans. Thank you for the presentation. A quick question is how do we input the state machine in the component? Is there a special window where we come and we type it, or the state machine must be written in a dot c or dot something and we give it to the component? Yeah, the simple state machine, the question is how can we code the state machine into ng-spice? The simple state machine I have shown is just a text file. This text file is loaded, you put into your spice netlist a single line with a specific model and this model loads the state machine. That's it for the simple things. The complex, you could of course write state machines and c-codes if you want to. Then you have to do this translation. My question is maybe a bit naive, but would it be at some point feasible to include the tracks or geometry inputs from KCAD in order to mimic the links that you place between your spice components? Please, it's a little bit... Track width and we also have the PCB stack up. Would it be somewhat feasible to from this geometry inputs associate a kind of approximation of the S-parameters of each lines between the components? Yes, there is some work ongoing. It's not that intensive to use an EM-sover, it's called Sparse Lizards, to extract these data from your lines in KCAD. I think it's a lack of manpower to make this a real tool. KCAD has added IBIS simulation, so you have IC output and IC input, only the output and input signals and many semiconductor vendors offer these models. Then you could basically have a transmission line or an RC line in between to simulate the signal integrity. The problem is, as you said, to get these data from your PCB. Slowly, slowly moving on. Basically, yes, but this is a key-cut or ischima, it's a key-cut work, it's not the NG-spice. The NG-spice takes the transmission line parameters or takes the parasitic capacitance resistances and then does the simulation. So the EM would have to be data from the key-cut? Yes, exactly. The EM has to come from the key-cut. I wanted to ask if anybody has used the C interface to, for example, make simulations of existing microcontrollers or things like that that you could have in your design. There has been some activity on this, very scarce. I think it's two. It's yours, Platysy, from Iso-Tel. Just look up his website, Iso-Tel, and you could find some information on that. There has been another guy, I think he has used Arduino interfacing to NG-spice, but I don't know much about this work. Are there any dynamic languages that are possible to be used as a model, or is it just compiled languages that have to be loaded? If you don't care about simulation time, for example, would it be possible to use any scripting language to... Yes, there are various kinds of making models. You have the very old A-road, but this is compiled and static. And it's compiled, it's there. You can do models with NG-spice internal nonlinear voltage sources, for example. And these are very dynamic. And many power semiconductor device makers, they make so-called sub-circuit models, which are comprised of spice commands. These can be very complex, difficult to debug, but then you can do whatever you could imagine. Is it possible to perform simulations over PVT, so over process variants and voltage variants? Yes. And would it be possible to do this without changing any of the models itself? Yes, this is the typical content of the model. Content of modern semiconductor PDKs when you think about IC simulation. The worst case simulation or corner simulation is typically integrated. It's different model parameters. The model stays the same, but certain parameters are changed. So we have a question from online. Just heads up, we're still working on the video, so lucky for us. Holger is able to continue answering questions for the foreseeable future. Online they are asking, is there any post-processing of waveforms such as THD, FFT, etc. possible? FFT is standard. FFT is standard in NG-spice and is standard in the Kikat-NG-spice interface right now. It's more or less two clicks and then you have it. You can set up, NG-spice has a very powerful scripting language, well another language. It's not Python, it's another language which originated in 1990. So we keep it up and have more than 100 commands available. And you can do a lot of data processing with this scripting. So for example, classification into bins, or if you do Monte Carlo simulation, you can run Monte Carlo simulation. You can classify these data into bins. You can do a lot of post-processing internally in NG-spice. Well, of course, if this is not enough or you want to use standard interfaces, there are Python-NG-spice interfaces available. So you can use all these Python libraries which are there for data processing. So it's a lot of action, but the action has to be done by you. Okay, we have time for one more question. You do not actually work time. Okay, so let's give Holger a round of applause. Thank you very much. Okay, so we're going to check.
Modos: Building an Ecosystem of Open-Hardware E Ink Devices
Hi folks, thank you. So my name is Alexander Soto. I go by Alex, the founder of Modos. And yeah, so we're building an ecosystem of open hardware e-inck devices. There's a box of a PCB, red one, green one, and also our prototype or our paper monitor. Please don't plug it in. If you're interested, I can show a little demo in the back. And let's get it all back. I love to get my PCBs back at the monitor and such. But yeah, trust you all. But yeah, please check it out and I'll do a little bit of a live demo a little bit later. We'll do it here, but think it'll be a little bit complex. But yeah, please all pass it around and check it out. So a little bit of backstory. In 2021, sort of the height of the pandemic during lockdown, my bedroom kind of transformed into a workspace. And just from the morning to night, I was just constantly being distracted, having to refocus and then be distracted again. And yeah, being in front of a computer for 13 hours, I got to the point where the device that I'm using to be able to do my work is also the same thing that I use for leisure, is the same thing that I use for entertainment. So can we have technology that is different, that is more calm, that is more humane, that's more aligned with what I got while being. So our focus is reimagining personal computing with a focus on creating calm, inclusive and humane technology. We'll upload the slides, but here we have two links to some earlier videos where we show off our electrical ferretic display controller. It runs at about 60 hertz, open hardware, uses an FPGA, get a bit more into the details of the hardware specs in a little bit. And this is the team that sort of turned that vision into reality. So Wenting has been the lead designer of creating the Electro-Ferretic Display Controller. He's been working there for quite some time and going to wait a little bit so we get the whole image back on the screen. Oh no. Hey! Thank you, awesome. His presentation is read by design. You didn't believe me. Yeah, so recap, Wenting has led the design of our Electro-Ferretic Display Controller. Brody has worked in particularly on the CAD manufacturing chassis. Michael had many conversations thinking about what would it look like to create a software architecture or software stack that's tailored for E-Ink has like a medium. And I'm kind of the guy that does everything in between and kind of supports everyone, thinks about this nonstop and tries to make things happen. Alright, so lastly I also want to say thank you to our community and also the NoNet Foundation. We've had about 300 plus people who want to be able to be in our private program, about 5,000 people in our mailing list and also about 3,000 testimonies and in those testimonies as well a lot of feedback, learned a lot and there's also, sorry, and also I'm going to say thank you to our NoNet. NoNet really, we're a NoNet-sponsored project. It looks up on our caster and yeah, thank you for your support and thank you for helping us really getting to finish our prototype. Okay, so from the community survey findings, we did a community survey. We're asking folks, you know, what are the particular use cases that they use their computer and or also what they would like to use for the E-Ink device and the overlapping categories were reading, writing, coding and focused sort of focused task and those are the majority of categories that people had mentioned. But that I had a general idea that most people would be interested in but I think where I learned a lot with the takeaway were the sort of same problems that I was experiencing myself with just being distracted, being stuck on the rabbit holes or having to use a computer for an extended period of time kind of got that a lot from different people in the community who also kind of expressed similar concerns and then here as well, folks discussing about problems related to like eye strain, people who have tried to use like other solutions and still, you know, filtering glasses and such and but still being a problem. So overall, there's these sort of general categories of one people who are looking for more of a balanced digital life, sort of reducing screen time on social media, entertainment, unplugging, seeing the sun, being outdoors, seeing away from a screen. But also people who are looking for like less visually stimulating digital environments and trying to reduce a digital clutter. So that was like one group or one demographic. The other one being folks who experience some form of visual impairment or maybe some form of like light sensitivity. For example, there's, I always mentioned this, I need to look up the specific person who did who filled this out but there was an engineer who was writing on behalf of his wife who was experiencing epilepsy and she tried all different types of solutions and just is trying to find something so she can be able to interact with her digital devices and that comment has stayed engraved in my mind and a big motivator for myself. But there are other health issues that people reported, things related to like myopia, epilepsy, light sensitivity, headaches, migraines, traumatic brain injury and post-concussion syndrome were quite frequent as well, which to me was also very much a surprise. So here's my pitch, I guess you want to look at it. I think there's a need, right? I think there's a need for being able to create technology that like satisfies our essential needs but also protects our well-being. I think we can maybe redefine the role of our devices to foster more healthier balanced life and hopefully with the start of this control is with the display controllers to create a new class of devices that are kind of like built from scratch to sort of embody these principles of human technology both through hardware and software design. So, Alan Kay, people who are really serious about software should make their own hardware. So, I think hopefully the monitor I have no idea where the monitor is but hopefully it's being passed around. People check and take a look at it, great. So, that's our monitor, that's our newer revision that we have. We built that using our key cat and also our free cat. We have a bit of a block diagram here for folks. Also, folks want to know a little bit more in details. We recently updated our repository which has a much more documentation into specifics of how it all works. So you can feel free to take a look at that. I have some excerpts from there as well. And yeah, so we're using a Spartan 6 FPGA Type C for the display port. We also have a HDMI or DVI video input. Using a Raspberry Pi 2040 for USB communication and upgrading anything related to the firmware and waveforms. And then this is the caster block diagram for our FPGA. Good. Take a look at that. Again, I would redirect folks to take a look at the documentation that we have on our GitHub that goes a bit more in detail than I could possibly do in this one presentation. But some of the features of the display controller is that it works with screens from 6 inches to about 13 inches. Works with the sort of black and white electronic operating displays from E-Ink. Also works with color displays and also DES. Really extremely low processing in the video. Finish show it, but also do a live demo. Very low processing delay. Yeah, we got four level grade scale, 16 as well working. Let's see. Yeah, optimized for the four level grade level scaling. If you ever use a commercial E-Ink monitor, they have these buttons that are in the front that switches these particular modes. We also have that. I don't know who has the monitor right now, but yeah, in the back, yeah, in the back there's like a little button. It's a little blue button. It doesn't work right now, but no, no, it needs to be connected to like a laptop and such. But I'm just saying in that button you can cycle through and it switches between different modes and the different modes are for the particular use cases. So if you want to be able to focus on typing or reduce input delay, there's a mode for that that you press it or if you wanted to use it for looking at black and white image and looking at grade scale it switches and that's all happening locally through the host software that's on the hardware itself. And a little bit more about how it's driven. I'm not going to go read through all of that. Let's see, I want to do more slides. So yeah, so pixels, they're ranging at 2D array. Refresh rate to between 50 to 120 hertz. It's a bi-stable display. So they maintain their stay after the electrical field is removed. And the sort of frame buffer driving mechanism, it uses two frame buffers to determine the picks of colors and then the picks of color changes between the particular value 0, 1. And there's also sort of a global counter of how it's used for the track, for the frame duration. So this is just a little bit of the basics when it comes to electrophoretic display controller and sort of e-extrins in general. Going to create a better version of this. It's a bit more simplified, so I apologize to just sort of make it, yeah. So let's see, give us grade scale. So when it comes to grade scale you're often switching between black and white in order to be able to get that. So you're constantly switching between zeros and one and kind of switching between these particular different modes. And then lastly, so one of the optimization that we've done is that instead of having one global counter, we allow for individual updating for the region. So we're updating each pixel independently. And we also have this sort of like early cancellation method. Could talk about it more, but just kind of want to leave it there. And I think just the last thing for next steps. So, how much time is that? Okay. So for next steps, we have a, so we've been working on this for about two years now and I think we finally have the prototype more or less finished. So we want to be able to do a crowdfunding campaign, most likely on CrowdSupply this year. So there's a link here where it, if you want to be notified when the crowdfunding campaign happens, we can send you email to be notified. We're also a relatively small team of about three or four people. There's also a separate link that as for folks who want to contribute with just various different skills if you want to be able to support in documentation, in CAD or getting more involved with the display controller. We have a link there for more information. Don't think, yeah, pretty much wraps things up. I think it could talk a little bit more, but I'd rather leave room for questions. And that's it. We have some, okay, let's see. Questions are always right in the middle. Thanks for the talk. How do you deal with the waveforms being proprietary? We've generated our own waveforms. Do you put any work into updating those, improving them? Yes. Sorry, the question was related to waveforms and how we generate them. I'll say that we generate the waveforms ourselves and there's certain similarities or patterns across different displays, regardless of different sizes. So we are maintaining and updating the waveforms that we have right now for the 13-inch panel as well and for the 6-inch panel. So you focus mainly on the hardware, but especially doing focused tasks, for example, requires quite some specialized way of displaying things. Are you also providing some kind of solution for that in software? Yes. So the question was related towards hardware that we've been focusing on that and what would things look like on the software side? Yeah, we've spent quite a bit of time looking into what that would look like. The one approach that we've looked at is to use in Wayland protocols in order to use things related to damage tracking in order to be able to do partial refreshes. For example, if you have two overlapping windows and you drag in one window over, it would recognize that this is the area that has changed damage tracking and would only update that particular region rather than doing a whole full refresh. So I think that there are things, for example, that we can use with Wayland and Wayland protocols at the higher level of the software stack that in one way you can look at it is that it abstracts away the idea of waveforms and you let the higher level software stack take care of that with the display manager. So one of the hopes and dreams is that we can work with SORSA with Drew DeVau and a few other folks raising up funding for that to be able to work with them to have that be part of something we can do, which would allow us to one, create applications that are native for E-Ink and also backwards compatibility. Yeah, just one ring. I have no... Yes, oh, hi. I am so sorry. Yes, hi. Yes, why did you show us the Spark on 6 as a platform since it's quite old by now? It's what we were most familiar with and what we had access to and experience. Sorry, the question was why did we choose Spark on 6? It's what we were most familiar with and had experience using. There is another gentleman, another person from NowNet, Victor Suarez, who is also interested in porting our work into other FPGAs. So I think it's not tied to Spark on 6, it's just where we started. Yeah, so I think it's regarding the FPGA only. So you have plans to upgrade to let's say FPGAs that are faster or this was just the original decision, but just because you were more hands on that particular family of FPGA. Yeah, so the question is related to the FPGAs if we plan to use more modern ones. So yeah, it goes back. That was the one we're most familiar with and that's the one that we're going to keep using maybe in some future with Switch, none of the one. Spark on 6 FPGA is doing the job right now. People are more welcome to contribute and port it to other FPGAs. That's one way, so welcome that as well.
The Basic Economics behind Open Source Funding in 2024
Okay, our next speaker is Amanda Cassini. We are very fortunate to be able to get a talk in the open hardware dev room on one of the topics that in the open source community is much more easily addressed in software, but is harder in hardware as some things are. So please give a warm welcome to Amanda Cassini. Thank you so much for inviting me here today and for inviting me to speak. I'm very excited to talk to all of you about economics and economic modeling, which to be quite frank is actually not my background, but is something that is very important for the work I do now. And so who am I in case you're wondering? So I'm Amanda Cassari. I'm a pale white woman with light hair and eyes. I usually wear glasses, which I should put on now if I want to keep reading things. I'm a researcher and engineer at Google and I'm currently leading a team focused on research and education in our open source programs office. I'm also a co-lead for Project Ocean and Across, which are an external faculty to the Vermont Complex Systems Center and a co-founder of Open Source Stories with Julia Farrioli. Who's here today? Thanks for being here, Julia. I sit on the board of directors for something called the Computational Democracy Project and I once wrote a book with Alice Zhang on feature engineering. I'm also queer, a very proud mom of two smaller humans, a US Navy veteran and lucky enough to live in the indomitable state of Vermont in the United States. It's just south of Montreal, which is easier for some people to know. And if it helps you better to follow along or understand what we're working on right now or what these slides are, please, there's a bit.ly that you can see all of these slides and my speaker notes, which will amount to a transcript. And that's bit.ly backslash e-c-o-n-o-m-i-c-s dash o-f dash o-s-s dash dash f-o-s-d-e-m two zero two four. And so I also just want to put in the caveat. I'm very sorry, but something about listening to things from 1.2 to 1.5 speed means I usually talk at that speed now as well. So hopefully if you miss something, you can go back and look at the transcript and speakers later. And I'll just make sure I'm being at a good accessible rate. So like I said, when I gave my introduction, you may have noticed I am not an economist. However, I am a complexity scientist and I try to see the world through a mix of applied mathematical models and abstractions. And it's this background that I'll be leaning into as we explore economics of open source and open hardware that exists right now. I guess to a little bit more just background, why do I care about this? So I've been working on this. In 2019, I was originally working in Google Cloud as a DevRel manager and an amazing team, but I also had this really nagging feeling that there was something kind of understanding about the way that technology and open source worked that we were fundamentally missing out on and we were trying to estimate and understand our work. And that's how I met actually through Julia, was the half bakery of ideas and discussing something. I was like, I have a niggling feeling of something we should be working on. She agreed and together we were able to pitch an idea for what is called Project Ocean. And that's a pilot. It was a pilot, it's continued on now and that's looking at collaboration between academia, industry and communities to understand open source at a global scale. We did launch this in early 2020, rough time to start a new project, but we were able to achieve many of the goals that we had then and that work does continue in partnerships and collaborations now. One of the most audacious goals that we had was mapping out the entire open source ecosystem and then sharing that with everybody. And it sounded honestly at the time like it was not that audacious because it just sounded like it was a way of looking at information and then sharing that in a way that everybody could see, which was something that we all deal normal basis of just making information transparent. But one of the barriers that we ran into then and we continue to run into now is just even defining the problem space of open source with stakeholders and communities as a shared understanding. If you've ever done any kind of research, that's always the place you have to start as being able to actually define the boundaries and the constraints of the problems that you're working within. And the reason that this was a challenge wasn't because there wasn't precedence in explanations of how you could go about that. It's that we were finding that a lot of the mental models and analogies that were used were not actually universal or global. Those models were actually preventing us from having a deeper understanding of the complexity of what was happening in open source because they were too simplistic or honestly in some caries is just the analogy that no longer existed today and was not the reality of the world that we were living in. So that was really hard for us because all of these models also made an assumption about the kind of baseline that you're working off of. So it gave you an understanding of like, well, this is what we should all assume in terms of the number of people involved, the kind of work that is critical, the amount of money that is existing and transacting to be able to keep this ecosystem thriving. And without all of that, this is where problems might be existing. So if we were looking at this concept of risk and resiliency, you need that baseline to understand where risk and resiliency either has interruptions or needs additional resourcing. So we were struggling with this fact mostly because organizationally, as a group and within this larger company, it was challenging for us to understand how we could move forward not just effectively but in a way that moved with purpose. And we can't do that until we start breaking some of those popular over-reductionist models of how digital structure looks today. This is a very popular comic. You have not seen this yet. This is used a lot to show demonstration of gaps, not only exist in understanding but also in a collective failure to respond to the challenge that exists for this large global ecosystem. So just to describe it briefly, this is the XKCD comic. There's a bunch of blocks. There's this idea that this represents all of digital infrastructure. And at the very bottom in a critical space is a person from Nebraska who has been thanklessly maintaining something since 2003. And more often than not, I mean this comic gets thrown around quite a bit, but I've actually seen this in position papers as a reason for why someone should be getting billions of dollars to do research or initiatives. This is their example they're giving. It's supposed to be demonstrative of the fact that the open source ecosystem is brittle and that large-scale investment is critical. But for that very specific purpose that that group wants to get billions of dollars for. And this is why this is my counter to this comic, is that the reality is that we're already spending billions of dollars on sustaining open source. I dislike the term sustaining because you can sustain broken systems. It doesn't mean it's working or it's working well. It just means it can keep going. And so because this has been so effective in centralizing buckets of money and attention, there are walls that are being built around organizational spaces for collaboration that we have to break through in order to keep moving forward. It's not that we're not investing, it's not that investments are happening, but it is that increasingly we are localizing decisions for investments and support in a way that is not creating resilient systems. Which is a problem when the system you're dealing with is not localized, does not have local optimans. It's in fact global, decentralized, and it's an equitably resourced. So to move forward we need new approaches for both challenging the assumptions as well as providing well at paths as understanding how we can approach this differently. So this brings us to the impetus behind mapping out these open source ecosystems. So not just creating this natural understanding, but really what this whole idea was about is that we were assuming and we still assume that if we have better information and better ways to understand the world we can make better decisions. And this is important for my work especially and what I get paid to do because that's understanding where resources that we have available should go. And they always have to be in the context of the business. That's what working for a for-profit company constrains me to. But we have to know where we are so we can work collectively to understand where we need to go next. And one way of knowing where we are is to always look at where we come from. And I don't want to discourage the idea that history cannot teach us what we should be working on and where to go next. However, some of the work we've defined as black swans, these are events that have fun and minimally changed and impacted how we approach open source today. They also sometimes get used as a scare tactic. And again to centralize over centralized resources in an already constrained environment and sometimes by the very organizations that we are trusting to work in our larger best interest. It may be true that there is an intolerable amount of risk that exists not only because of incidents like these but that we haven't even seen yet. And it might be that investing in those in a way now will prevent something further in the future. It might also be that the next problem that happens like this will also only be visible through hindsight and no amount of centralization right now will prevent that. So we can be informed by the past but we shouldn't be scared by it. So I think that my argument here is that we should be able to move forward with a critical eye again with like looking at better information, better understanding of what currently exists now. I've previously talked about tools and frameworks that organizations like open source programs offices can use to identify stakeholders and especially in companies as well as how to define and design metrics for regular reporting. These are honestly frameworks that are useful for any type of work not just open source but they do help in moving forward with something that's very large and messy where you don't have an ability to look at customers coming in and things going out. And those are building blocks of information for teams who work in an open source to frame the value of our work for businesses, partners, communities. This work was not created either randomly or on a whim or in a vacuum. So our team together had been working towards this very specifically because we were asked at the time to develop an ROI model of open source that we could use across the board, across the business. And this was for all of Google, for all the work that we did, for all the investments we put in. How can we describe that using an ROI model? My problem at the time which still exists and exists now is that I still maintain that ROI is almost always the wrong economic model to use for open source at scale. The reason for that is ROI is a very specific economic model that assumes first order inputs and outputs. And so those first order inputs and outputs mean you always know what's going in. You can see those direct effects and it's always an output. If you are a network scientist, this means you are always looking at a first order system. Open source is not a first order system. And so if we try to boil it down and make it a first order system, we are actually either abstracting away or making an overreductionist model that does not actually serve us. One example of this is when projects are tried to be measured simply by lines of code. Lines of code has not actually been shown over time to be something that is an indication of either productivity or even efficiency. When it is used to judge productivity and creativity, then we run into large problems with teams being devalued. We can't simplify that away. So what are the abstracts we should be using in taxonomies? I realize I'm going to go through this fairly quickly. So all of this is grounded in the idea, by the way, that the current structures and systems we're working within are the ones we have to work within. There's more research that goes through these problems and instructions using critical theory. I'll be taking a pragmatist approach today and just be talking about it in terms of the world we're currently living in, trying to describe those a little bit better. So resources. Resources in this capacity, I'm talking about an abstraction of not just money. And the reason for that, again, is that we boil things down to money quite a bit when we want to understand what kind of resources we're looking for and what resources are important to us. So there's just a few listed here. And again, these are somewhat abstract. This does not even account for things. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. For just playing the call. This was actually, I think this was an interesting achievement though. It also means that if you were considering these kinds of fundings you should understand You You You That's called op-ex operational expenses and that changes year-over-year for companies and organizations Academic institutions always have to take into account some kind of general institutional fee So if you are working as part of an academic university or an academic project group and they get a grant You're not going to get all of that money So there has to be costs associated with it You need to take that into account again as part of your budgeting and as that being your fiscal host Another part is there's a lot of government funds and NGO funds that are starting up right now Those may be restricted only to citizens of that government and then we run into the problem of how do you talk about? origination and citizens for a global project All monetary funding requires fiscal hosts we went over that LLC a nonprofit an organization with a platform and the reason that exists It may feel like that's blocking you the reason that actually exists is mostly related to anti bribery laws Organizations and companies can't just hand people money and then have no record of it have no tax record of it These all basically funded the same kind of guidance as being able to identify and talk about things Okay, so quickly outputs versus outcomes. So again before talking about these larger modeling structures I feel like it's important to differentiate these two Outputs of your project are the very specific units of work which you've identified in your roadmap or plan This is the how of what you want to achieve Outcomes are slightly different. They address the so what of your roadmap and your proposals So this is how you identify the purpose of what you're doing It allows you to frame the importance of your work in a way that may not be readily available to a general community or the group of Funders that you're trying to target So when you were tracking your reporting what comes from your work, you want to be able to address both of these things outcomes and outputs So very specific for open hardware. I'm so sorry. It took me so long to get here The original creator of the bomb Right so part of bills of materials and challenges that comes with funding on bills of materials is this question of centralization and Decentralization of hardware component availability So as you're trying to look and plan out and explain your project things along the way There's a big difference between when you first write your design proposal and then down the line Maybe when you get funding then when you can act on that and when you have to deliver on outcomes and out Outcomes and outputs. What is the component availability that you have and has that changed you to other market forces? high maintainer costs again inequitable Inequitable availability of parts It's difficult always to find people who may have the specific skills that you need and when you're operating on these tight timelines If you do not have a robust group that is working together and you need to do some kinds of swaps Or you need to be able to have someone who can maintain a changing bomb over time Then you're gonna have to make sure that you are Working on those in a way where you're not necessarily going to be able to depend on depend a bot You may not be aware that something was either obsolete or that was impacted on a timeline that now moves the entire project weeks at a time So get lead and delivery times also the challenge of one versus ten versus ten thousand when you're trying to order parts And when you're trying to get things from manufacturers, they have to be able to create and work on those in batches Also, your household may not be have the size to be able to have 30,000 of one tiny piece in a box that you only need 10 of So just in terms of funding asking for funding when you're thinking about how to ask for different parts That's also something you have to consider based on the kinds of project you're building or what other people are building is Going to be as well that consideration of how do you share resources that are people or time? Okay, there's no option up not every single kind of development has an open option for all kinds of hardware So especially working with different kinds of chips different kind of microprocessors sometimes the only Option that you may have is proprietary software to be able to work with your hardware You have to be able to budget for that as well And that's difficult for folks to understand who are used to funding and all of the tool chain They're working with is an entirely open tool chain that doesn't have a cost Associated with it licensing fees are absolutely something that you should be including as part of your budget You have the option to build it yourself, but you may not always have the option to build it yourself So working with proprietary tools puts an additional cost on your budgets Working on funding lack of funding. There's a lot of conversations around software funding There's not as many around hardware funding at least not for specific purposes So I think in general this is also a challenge when just when working between different kinds of funders and funding Where is it that you fit that you feel comfortable with? There's a higher cost of getting it right the first time when you have to order something Then there's a long lead time and then it shows up and you realize that you've crossed a few of your paths And now things blow up But software you can just go through and re-change it and then redeploy it It doesn't exist the same when you're trying to connect components for hardware So okay very quickly. I'm so sorry ROI ROI actually can be used sometimes So I don't agree with it as a large-scale Model to work with however if you are working on something like a contract or you're working on a very specific outcome of work This is a good time to use ROI in your proposals So being able again those identify those inputs the kinds of work that you're doing that out comes in the outputs This is something that you can boil down to something like money and time As long as you're always taking into account things that do have monetary equivalents and direct effects There's also something called XYZ model This is basically assuming a black box in the middle So the red is kind of inputs the blue is outputs you have a feedback loop What happens in between is a little bit fuzzy, but because you're able to describe it and again from that like kind of outcome perspective Then it's easier for people to understand what you're doing and why Another kind this actually recently came and it's a good example of this from Hoffman at all paper the value of open source they take a supply and demand model They worked with some very specific people to understand what kind of things they're using within the corporate industry And that allows them to look at things more from a supply and demand perspective So there's even more things especially that apply to software and hardware There's many other microeconomic models that I think that can be applied and have be tried in this case Again, I think the main takeaway hopefully for this and the challenges that continue to exist is that there's not one that fits all There's not one that fits every project But a lot of funding is designed to examine things through one very specific project being clear through as you walk through and Identifying your needs as well as identifying what resources you have and what the outcomes will be will depend on your project Don't try to squeeze it into the model They're asking for if you feel that that's not going to accurately represent the needs that you have all it will do will cause What's exactly happening to somebody I know right now which is they have a week left and a bent frame with a four week time frame Or a four week lead time So they tried to save some money Didn't work out materially and now they're going to be spending six times as much to get to that outcome Part of that was through inability to look at resource planning and explaining things in a way that actually helps there The person who they're working on identify it as something that they should be spending that money up front on So collectively understand where we're not on the same boat. We're in the same storm We're not on the same boat, but hopefully our flotilla together Is a place where we can continue working to support each other and where we want to be Thank you You
QUBIK a 1p PocketQube satellite platform
Okay. Our next talk will be from Ilyas and this is the first of a dual of talks that we'll have on a similar topic, so you are in for a treat on the future of space flight in open source. Thank you. Thank you Ilyas. So my name is Ilyas. I am a core contributor at the Libre Space Foundation and my job title is doing space stuff. So before we move on, let's get the audience up to speed. In the title, I think there are like two words that may be unfamiliar. So does anyone know what a pocket cube is? Okay. A pocket cube is a really small satellite. It's like five centimeters, a cube of five centimeters size. So it's this. It's really small. And the one P, the one P is the one unit for pocket cube. So this is one P. If you go to two P, you extend one time the size three P, you extend two times the size four P, it makes no sense. You just go to a bigger satellite. Do you know what a deployer is? Most of you. Okay. So rocket people do not like to put your satellite with duct tape on the rocket. So they want a fancy box to put everything in. That's a deployer. And the last term, do you know what Leo is? Leo is low earth orbit. Start from around 200, 300 kilometers and go up to like 1200. So this is the terminology. Let's start our story. So this is the story of cubic. So it was, I think, three years ago, summer time. And we got a phone call with, it was actually an email. And they told us, you know, we have a spot on the rocket in a deployer for one P satellite. And it's free. And this matters because the cost for launching one of these could be like 15,000 years, maybe more. So that's a great opportunity. And we said yes. And they said, okay, you have a few months to get it ready to go. Okay. So let's see what we have. We looked around on the space staff we have. And we have a comms board. That's important. Because like a basic satellite, you need something to communicate with. And you need someone, some source of power. And that's what we had. So it was quite capable. It can do all this modulation and stuff. I won't bore you with it. But it was a starting point. It needed a bit of tweaking. And that's it. So where are we going? They told us you're going to an orbit of 300 kilometers. That's nice. For the weight of these satellites, you expect to be up there for one to two weeks. Good enough. And it was the first flight of the rocket. Not good enough. So we came up with the plan. There was lots of hardware that had to be done. So we had to test the comms board we had. And then we needed a way to add power to it. And even it would be even nicer to be able to not just have a battery that will die and have some solar panels to get extra power. And then all these had to be somehow bind together to form a satellite. And then put all this on a plate that can go into the deployer because there are some kind of specifications that require you to have like a specific shape that goes in there. And then there were some testings where you put this in an oven and some materials that may evaporate. They do that in your oven and not on the lens of the satellite next to you that messes everything up. And then there is some campaigns where you torture your satellite and shake it and hit it. So you make sure that nothing will come loose during the launch. And then you just send it to the rocket people and you have pizza. On the software side there was some software that could do communications but it needed to be tested and evolve a bit. And then once we were happy with what we have we flashed the final firmware in the satellite and then we have pizza. And on the bureaucracy department we had to coordinate frequencies. So what's that? Let's say you have a radio and you decide that you duck deep push the button and then you put a tape recorder next to it and say I am Joe. I am Joe every second. And then you just pick a frequency. Let's say I just put my birthday in. And you just start transmitting. And then you get some kind of military band and local people around the country don't like it. Imagine that but the whole planet will not like it. So you need to get a frequency allocated to you. And then there are some managerial staff for exporting this and what is this and what is that. And customs people are not happy and then you have to make them happy with lots of paper work. They love paper work. And then have pizza. So all this is good. We can do it or we think we can do it. But why? So what should this satellite do? And we realize that there is some kind of a problem that we could try to address. So imagine this. When you, your satellite and many other satellites and for this size it could be like 50 or 100 go out in space. It's the equivalent of going on the top of a hill, have like a jar of marbles, just throw it down hills, go to the next city and try to pick one which is yours. So there are raiders for this thing and there are like military services that do track all this stuff. But during the first weeks it's like a blob of stuff. So in order to be able to communicate you need to think. You need to know which one is yours and you need to know where is it. So the first part is quite easy. Because you just wait for the lands, satellites go out and then you wait, you look up and at some point you say, you pick. Hey, it's me. Nice. Can you hear me? Hey, it's me. Okay, where are you? Hey, it's me. Can you tell me your position? Hey, it's me. And then no contact. Luckily the play me part, it's identifiable. Because you would have some kind of ID beacon, whatever that you design. So you know that's, that's yours. The other part not so easy. But we thought that if we exploit the Doppler effect, because these things go really fast and really fast is like seven to eight kilometers per second and to give you like a sense of how fast this is, I will go to the center of the city and come back. So I'm going and back that fast. So the Doppler shift that happens to the transmission from the satellite can help you start identifying the orbit. There are some, we call them orbital elements which define the orbit. I won't bore you with that. But there is a way that you can do that. So that's the experiment. So on the hardware, we started designing a power system. And so we got really popular solar harvesting chip that doesn't PPT that's the SPV 1040 with some solar panels. We also added the battery management chip. So it actually it's a battery goes so knows how many. Can't go in goes out. And you can get a good sense of what's going on your power system. Some minor modifications to the comms, the communication board. And then we we designed all the structure that will keep everything together. So again, the power budget. We wrote a tool to do that because commercial ones are not open and they're very expensive. So this is what this tool produces. You just enter how much solar panels you have. And then you get a really nice plot, which is for each side of the satellite, how much power you're getting. And this is awesome. But the thing is when your satellite is kicked out of the box, because it is kind of mounted on the bottom, just flips around. So the reality is something like that. So the structure, that's a structure. We designed this in the concept of having the PCBs that mount the solar panels and everything to be structural element. And this is an exploded view. And this is how the systems fit in there. There was a lot of room because we can have like four PCBs in there based based on the standard we tried to follow. But we only had like a power board and the comms board. So we had two batteries just to be on the safe side. And then just to add weight. So the more heavy you are, the longer you stay up there. And the more you pay to go over there. So there's a ballast board, which has some weight to like reach the top weight you can have on these things. The antenna on the bottom, it's a measuring tape. It's really good on unfolding itself back to shape, whatever you do to it. And there are actual numbers on that, as you can see. Actually, let me pass this around. So the antenna, we did some simulation for the antenna and the radiation pattern. On the right hand side, you can see how this thing is tied down to the satellite in the stored configuration. And then we get another goal. And they say, you know what? We have an extra slot for you and say, okay, instead of building one, we just build two. That's great. And then there's the thing with a deployer where you were going in. So can you build it for us? So we said yes for some reason and we came up with the revised plan. So the revised plan also includes the deployer. I won't go into that because that's the next talk. But this was the birth of PicoBus, the deployer we built. So you have to wait 15 minutes to hear about this. Moving on. We had all the circuits and the PCBs ready. You need to add some kind of conformal coating to protect the stuff. So you spray this and then you inspect it with UV light and it glows and where it doesn't glow, you have to apply more. Quite simply. This is an almost finished structure. It's a photo from the, during the assembly. And these are the two assembled satellites ready to go. So in the ballast board where it's nothing, we thought we could go like, let's put some ideas in there instead. So this is a small board with like the four principles that we believe in regarding space and the openness in space. We moved on to the bakeout procedure. So that's like upside down jar that goes on the vacuum machine thingy and it goes to really, really big, low vacuum. And then there's like a flat light on the right that does infrared heating of everything there. And what you do is you measure the mass before you bake it. You measure the mass after and you have to be within some specs or something really important evaporated in there. So it's an ago. And the final step was to do the proto flat campaign. This is where you put everything in the deployer and then you just torture it. And it hope, you hope that when you open the deploy again, you don't get sand coming out. On the software side, we built some drivers for the hardware as a standalone project so it can be used for other things. We built the elementary and telecommand. There was a finite state machine for orchestrating of what the satellite should do. And then because there was a delay on the launch, the software people, we had like more time in our hands. So we decided, okay, let's add another project. So there was a new project called open space data link protocol. And it's a way to structure your data and telemetry that goes up and out to the satellite. And we also built some kind of ground station telecommand software to operate the whole thing. Another interesting aspect of the development was that while the hardware was ready and the software was written, we were using the actual ground station software that was going to be used during the mission, which was part of satnav. So during development, we had like a satnav dashboard that would give us the state of the satellite. So everything is good and nice because it's next to us. But the good thing is that this is ready. So when satellite goes up, you just reopen this and you have your data. So these are like the final steps, final firmware is going in. Everything goes into the deploy for the last time. And then you wait for the launch. The launch provider was firefly aerospace. Like the dream payload was like the mission we were invited to join. And a few months has passed and there's the moment that you actually wait to happen. And it's a very stressful moment because sometimes from firefly things can go to fireball. And that's what happened. As you can imagine, there were feelings. And the only thing you can do since you know that since the thing blew up, they will probably going to build another. So we build more cubics. And a few months later or almost a year later, here we go again, biting nails, hadging teddy bears. And then you get this picture. So that's our deployer in the other satellites. Firefly did a good job to interrupt the stream just before we were deployed and then do a playback but not say that it is a playback. So they said all payloads are deployed and you get a video with this thing closed. And they said, okay, what happened? But there's a switch there that says, you know, the door has opened. So hopefully things go good and they did. So we started receiving telemetry from the satellites. It's cubic three and four because one and two kind of disintegrated or something. There was a minor issue with the A2C bus on one of the satellites, but the reset solved this. It's not a really nice way to do that. It's kind of windows you like, but at least it worked. We attempted some telecommand and control. The command was received. We wasn't able to get the reply. And the telemetry was received by the satellite network. So these are like real space data here. The power system kind of overworked. So the battery was like continuously full, which is good. So what do we get out of this? The platform, the cubic platform was a success. It is considered now Terral 9. Terral 9 is like the view of the space and it works. So it's like the top level. There is no pen there. So that's good. Firefly did not reach the target orbit. So this affected the mission life because it was reduced to three to four days. We managed to do orbit determination. But since the orbit was decaying really fast, it was unusable. The information was there, but you could not verify it because the next orbit was totally messed up. And this gave life to four more projects. PicoBus. See, look, you should listen about that in 1630, which is exactly what the experiment is, but in a more commercial way. And we have the simulator and the space data protocol. So the platform is one P-Pocket Cube bus, the one that's going around. If you use one battery and you do not use a ballast, you have room for one or two payloads to put in there. You get 350 to 500 milliwatts of generation for your staff. There's battery monitor and management. We have a really good documentation. It's currently coming together from the various wiki pages and issues, but it's really detailed. And it's a cost-effective solution for research, education, radio amateur or whatever you can imagine. So what's the future? We need to move comps to like version one and call it that. Create a better power board, finalize a standard and then create a document about it. And go bigger and fly more cubics. This is all the software that we used. It's all open software. And these are the people that helped this thing to become reality. Thank you very much. So if there is time for questions. Yeah, yeah, I have minutes for questions. Five minutes for questions. How many ground stations and passages do you need to like estimate the orbital elements of the satellite? So how many ground stations and passages you need to determine an orbit? Okay. Obviously the more the better, but with two or three ground stations and a single passage, you get a really good estimate. And then with the following ones, you just kind of nail it actually. So the other results, actually you can get more info on the SID look, talk about it, which is quite extensive. Yeah. Since you're using CCSDS, you also know the pain of CCSDS, I hope. Have you thought about looking at other like open source space protocols that are already out there instead of writing a new? Actually it is based. So the question is whether we thought for existing space protocols instead of implementing a new one. The CCSDS provides some compatibility with commercial equipment. So we thought it was a good idea to go that way. So there is an open solution that exists that is compatible with this kind of protocol. Okay. Thank you very much for an amazing talk. Tiedelini. Oh, you've got it. Nice. If I can ask one last question, or at least one more in any case. Did you have any considerations for radiation effects in your design of the hardware and any mitigations in that sense? Okay. So the question is whether we had any consideration for the radiation effects on the hardware. When you fly what we call commercial of the self components, you know that they're not space hardened. So there is always a chance for something going wrong. But you can go around by designing in a like more clever way. For example, MOSFETs may latch themselves. But if you have a monitoring circuit that detects that and your power cycle and I think most of the times you get you can get out of this situation. For the software, which is also quite important. There was like a triple storage technique implemented for all the variables and information in the satellite. So also there was an ECC RAM used from the microcontroller. And there was a polling. So you read three values and you choose like the two that match. And this is it. And there is also scrubbing frequently scrubbing where you read everything and you rewrite it. So even there is an error, you have corrected it and then put the correct information in all places. Okay. So illness will be available for questions outside. Of course. We have three more minutes. Great. Excellent. Then maybe maybe maybe two more questions we had. We had one right before. It's all the way. Yeah, but the ground stations. Did you build your own ground stations or there's like a network that you can use or how this works because it looks like the most expensive part of the project is like you have to grab, you know, whole planet. Okay. So the question is, did we build our own ground station or did we use the session network? The simple answer is sat mugs. The more detailed answer is that we actually our first really first project was a ground station network. It now has around the five, five hundred stations around the world. It's operated by contributors. So we have like a really good coverage in the UJF band and also on upper bands too. Hello. Thank you for the talk. Two brief questions. Did you have an attitude control system for the cubic or it was not needed and same thing for thermal management was important or in the design. Okay. So the question is, did we have an attitude control system and was there any thermal management? So the first question is no, we just toss it out there and let it spin. It was not important for the experiment. So and also there was no time as you might have realized for the thermal management. There were some tests in vacuum chambers and we saw that it would survive without needing like an extra provision for that. Of course, the PCB design is built in a way where you dissipate all the heat from the components to the PCB. But that's kind of best practice anyway. And the temperatures we got from the telemetry were actually well within the automotive limit. So we were good. Okay. Thank you. Any additional questions we can go on the hallway. Thank you.
A satellite's final safehouse: The deployer
We need to orbit the satellite back. So our next talk is a continuation on the theme of Libre Space. Now we are going to talk about the the the deployer. So please welcome Thanos. Hello everyone. So I'm Thanos. I'm also a core contributor at Libre Space. I'm a mechanical engineer too. And I will continue the talk of Vilias basically and that's some stuff about the satellite's final safe house, the deployer and specifically our own deployer, the pick-up bus. So some words about the deployer. What's a deployer? So now that our satellites are like the cubic small, you need to have a way to attach it to a rocket. Okay, so basically as mentioned before, we build a box, place all the satellites inside, mount the deployer on top of a rocket. Here you can see multiple deployers actually. And when the time comes, the rocket gives a signal to the deployer and says, okay, open your doors and deploy the satellite. That's what the deployer does. So how do you start designing a deployer actually? So that's a really tricky question. For us, we knew that we wanted to house 8P units. So 8 of the cubics, for example, that you can see. So I will do a quick walkthrough of the internals of the pick-up bus just to have a bit of context and then we will dive deeper inside the pick-up bus itself. So you start with a rail. You place all the satellites on the rail. Some way we will see how. Then you need to push the satellites outside with some kind of spring most of the times. And then you need all these things to be mounted on the rocket. So you put the flange there to be mounted on the rocket. After that, it's a best practice to close everything because space and also add some kind of door to keep the satellites inside. So now you have a box with satellites inside that can be pushed outside. But you also need a locking mechanism and deployment mechanism actually. So when the rocket gives the signal, the satellites do not stay inside. They go in orbit. So that's the final version of the pick-up bus, V1 actually. And these are the main components of it. Let's dive deeper now. So again, we start with a rail. That's one of the most basic parts, but one of the parts that actually restrict you because of the satellites. The interface of the satellites and the deployer too. So we have the pocket cube standard. This is the pocket cube standard actually that gives you the dimensions of a pocket cube and it's available everywhere around the pocket cube too. So satellite manufacturers can put side panels there, sometimes even deployable things, which is cool, which is the hot stereo you can see around. So for the pocket cube, there's a base plate down where the deployer actually interfaces with this base plate. So we manufactured the rail actually. So the rail, that's the top view of the rail. So you see here it has like a notch where the satellites slide inside and are held from the base plate. So for our specific rail, it was machined out of space grade aluminum, 7075 T6 alloy. It also was hard to analyze to give the surface as much strength as possible because satellites were sliding inside this two millimeter slot. So now, yeah, sorry. Yeah of course. So now we're moving to the PUSHER sub-assembly, which is what pushes the satellites outside. So this is a really early version actually. We actually try to follow the rapid prototyping procedure as much as possible. So we constantly 3D print parts, break parts, then redesign parts, and then 3D print again parts and then break them. So after much discussion, we opted to use constant force springs. This is really good. It's a good practice because you cannot just take a proper compression spring and just compress it all the way. So it gives a really big range actually. So with a really quick paper towel calculations, we got a rough estimate of the spring strength and also the satellite exit velocity, which is a really important number when you're building a deployer. So when we finished with the paper towel calculations, we actually machined by hand a dummy rail, as you can see there. We 3D printed PUSHERS and barrels and attached the spring to do some testing. So let's see the PUSHERS of assembly in action. That's our first prototype. And it seems to be working. Yeah, and they did the drop of the table. That was a really good one. So it worked. So we moved forward with this design actually. So again, you can see here the pick up us. You can see one side of the rail. So we have the same assembly mirrored. So we can house 4P on the one side and 4P on the other side. You can see the PUSHERS of assembly here. And you can actually see the machined part in the middle, the final rail with the PUSHERS that's now machined from PTFE material. It's not 3D printed. So the PUSHERS was made out of a single billet block of Teflon. That Teflon is a really great material because it's space great, which means it doesn't out gas. It can operate really good in volume. It also has a really, really small friction coefficient. So it slides amazingly with a hard-denodized aluminum. The second part was a barrel. It can be seen in the right side of the photo, which the course and force spring was wrapped around the barrel. And the spring was attached on the top of the rail. So in the right picture, you can see the PUSHERS that are in the top. That's the deployed state of the PECO bus deployer, actually. So moving on, we now go to the doors of assembly and the thermal knives mechanisms. So, okay, now we have the rail. We have the PUSHERS of assembly. We need to design the door, too. So you need two things. You need the mechanism to hold the door close. And you need the mechanism to open the door. That's the mechanical thing. You also need, from the electronic side, some way to attach the door and then cut it with a signal from the deployer. So we used this mechanism. These are early prototypes. So we used the pin puller mechanism with a compression string. And in order to hold the pin in place and actually secure the door closed, we used the enema string. And in order to cut the enema string so the door would open and the satellite would be deployed, we used thermal knives. Basically, a nigh-chrom wire which gets heated, cuts the enema string. The spring is decompressed. The pin is pulled. The door opens. So we did more prototyping. And we knew that we had to build electronics from scratch, of course. It was mandatory. So the PCB that we did would handle the communications with the rocket in order to receive the signal. We would also have the thermal knives attached to it and also the deployment switch. Of course, you need a way to see if the door actually opened. So yeah, we used two thermal knives because space and redundancy. And we also used two enema strings. So only one of the thermal knives needed to work for the deployment to happen. So you have two. It's a more redundant system. So that's the final door subassembly with the machine parts and everything. So you can see here the enema string is wrapped around like that. And the two thermal knives are here, one here and one here. So the two strings hold the pin inside. They get cut. The doors get released. So the door subassembly was complete, finally. It was ready to be integrated to the rest of the deployer. So behold, that's the deployer, the final deployer, the final pick-up bus. You can see here the wire harness. This is getting connected to the rocket. So when the signal is received, the door will open. So now let's see pick-up bus in action, actually. So this is the deployment test that we did. And you will see the thermal knives starting to glow. There they are. The pin will be pulled back. And then the door will open. And satellites will fly outside. In this specific case, these are dummy-mass satellites. They're not actual satellites. They're blocks of aluminum. But yeah, that's how it works, actually. And you can see the push-up assembly that now it reached the top there. So continuing now, I want to have some slides about the testing that we do and how do you space grade an assembly, actually. So one of the testing things that we do is called protoflight testing. Protoflight comes from two words, prototype and flight hardware. So when building pick-up bus, we had a really short period of time to do everything and build the two satellites that would be inside. So you had six months. So protoflight helps you with the time that you have developed really easily. So in this case, the qualification model, so the model that is tested and goes through the vibration and stuff, is the same as the flight hardware, actually. And the satellites were integrated inside pick-up bus when this protoflight campaign happened. So basically, you want to see if the DQT, the device under testing, will sustain the launch. Actually, launch is a really, really bad time for the deployer and the satellites. Huge vibration, huge accelerations, everyone. You really need to see and be sure that balls will not start flying around, pretty much. So the steps are the following. So step one, you do a resonance survey. What that means is you identify the eigenmodes of the system. Actually, you need to have, usually, the first resonance should be pretty high, at about 100 to 150 hertz. But that depends on the launcher, actually. So you get these tables. You place the whole deployer into a machine, a vibration table. You vibrate it on all axes, and you get the resonance frequencies for each axis. Then you do a signed vibration profile. So this is where the bad things start to happen for the deployer. So you basically start punching the deployer with vibrations and the satellites inside and hope it survives and doesn't break. So you pass from five hertz to 125 hertz with a signed wave profile. That's really painful to watch. But what's even more painful is the random vibration, which is step three. That's the real heating. And you pass at the same time, the same machine, from 20 hertz up to 2 kilohertz. So that's the profile, and the deployer must sustain this. When random vibration finish and you are a bit relieved, OK, things seem to go OK, you took what's static testing. We simulate the static loads exerted on the deployed unique launch. Which again is painful to watch. But when everything is finished, you do again another resonance survey. So you add another table, basically, to there. So you have the post-serve resonance. So you need these values, the resonance before the testing and the resonance after, to be the same. If they're not the same, something happened. Something flew through, something broke. In our case, they were the same, actually. Some words now about our design simulation tools. So everything we do is open source. The tools we used to do. Our things are open source, too. So we use free cut to do all the modeling, everything. We also use free cut to do all the simulations. Before starting to hit your deployer in the vibration table, you need to do simulations. So we do a lot of model simulations to actually try and predict the eigen modes before sending the pick up bus to be hit in the vibration table. We also do static simulations because the deployer is bolted from the flange, and you have a lot of stress, stress is going around. We use calculus as a solver to do the vibration in free cut, and we also for the electronics, we use key cut to do everything. So as I mentioned before, the pick up bus was developed in a really short time. So now, eventually, we had a bit of time to develop the version two of the pick up bus. So by using the tools that we mentioned, we actually were able to do a lot of improvements that were obvious with the pick up bus we won. So the most, the improvement that mattered the most was the mass improvement that we did. So because it was designed in such a short notice, we had a lot of big safety factors, and we tried to do it really quickly. So now, we had more time to do simulations and more time to do design stuff. So by iterating the plates, we actually managed to cut the mass in half of the deployer. So right now, again, the capacity is the same. It can house up to eight satellites inside. It's almost half the mass of the pick up bus we won. It has a larger satellite envelope, which means you can fit more stuff around the satellite, but it has a smaller deployer envelope overall. It's a bit smaller deployer. And again, it has updated electronics because we found some minor issues, so we fixed them. So we did a really cool thing, again, with the isogrid patterns that you can see there. It's really space rated. So we had an aluminum frame this time, not just a plate. And we closed it with a huge PCB, an FR4 PCB, and polyamide tape, couple of tape, pretty much. So we closed it, and it was secured. We will see how that goes. We will start manufacturing the next months, and the expected loss of this is in the next few years. So before we leave, if you were watching about the pick up bus story, so pick up bus one was integrated, as Ilya mentioned in the previous talk, and it was launched, actually, but the launch, again, didn't go as planned. Yeah, it was a really sad thing to watch, but space, these things happened, explosion happened. So, but there was a plot twist. We received a phone call from the guys at Firefly, and they said, sometimes, we think we found your payload. We were like, what do you mean, it just blew up at 15 kilometers. You cannot do this, it's waste. What do you mean? No, no, no, we think we found your payload. We were like, yes, but so they found our payload. So that's the pick up bus that survived through a rocket explosion at 15 kilometers. And there's more. It was okay. We opened it, removed the satellites, they worked. It was, everything was so okay that we thought they didn't launch it. The only thing broken was this electronics cap and this poor capacitor. It was amazing. It was like, it didn't even work. Like it was great. And they actually sent us photos of the pick up bus mounted on a carbon fiber part from the payload bay, and we think that this huge carbon fiber part reduced the drag, and they found it in a beach. Mear van der Beek's Spaceport Base, when it was launched, it was like in the middle of the sand, two meters in front was sea, two meters from the back it was rocks. We have photos, but we cannot share them unfortunately, because they don't let us, but it looked pretty much like this. So yeah, pick up bus survived because space is hard. Pick up bus seemed to be harder though. I will close with this really cool video, which is the orbital pick up bus. This again is provided by Firefly. They had the camera on board. So let's just enjoy it for some seconds. Open source space, everyone. So unfortunately, because of the launcher they didn't have, as Ilias mentioned, they didn't have the doubt, they didn't download the open pick up bus. So that's what we get. Which by Firefly is great, like pick up bus. With the air. So yeah, that's it for me. Yeah, that was another deployer actually. So real funny, really quickly, when we were watching the stream, and there was a cut in the stream, and every day after that at the time said that they had confirmation that all three payloads were deployed successfully, and the stream came back and we saw this, and the door was closed. And we were like, no, come on. But yeah, after that, the Cubics worked and beaconed. So yeah, successful deployment, everyone. So I will keep that there. And I'm open to questions if you want. Thank you for the talk. Why is so much solid structure on the outside even required? So the question was why so much solid structure is required? So it's not really. In the pick up bus V2, you can see it's a more lightweight, but the pocket cubes are light. They're 250 grams, but when you have eight of them, it's a bunch of kilos. So when you place it in the rocket, acceleration, it's a really hard phenomenon. So you need actual structures to hold the rail in place so it doesn't move in place. It doesn't move everywhere. So the plates on the sides of the deployer actually held the rail in place. So you also need to sustain the vibrations too. So you need to have a bit of a mass to absorb these vibrations too. In the pick up bus V2 actually, instead of having a solid plane, we have two ribs. I can show you in the presentation. You can see it actually. We have the design is two ribs that secure the rail pretty much. So here is attached to the rail. So when you have the whole rail protruding out, you need to support it like that. So we did a lot of way saving by doing this. And you also need to have it separated from outside from everything else. So you need to have the deployer closed. So in case a satellite has a malfunction or it breaks or whatever, the deployer contains the satellite malfunction too. So you need to have it closed. Did I cover your question? Yeah, thank you. Thank you very much and congratulations. Oh, welcome. My question was in a eventually version three or version four, which you consider also having a communication bus to communicate with the satellite swatter inside the deployer for system checkouts, for example. So we don't know yet, to be honest, it would be a case. I don't know, we can ask the electronics guy. Have you ever thought about this? So yeah, maybe what it would be beneficial for the other version was to provide the chance of charging the satellites. So for the people that put the satellites inside, whether it's us or other people to have like a way to charge them after the integration, that would be an extra step too. About communicating with the satellites, I don't think so because I don't know, but I don't think so because when they are inside, there are key switches that are pressed. So the satellites do not start becoming or deploying or whatever. In the cubic, you can see the two switches on the bottom plate. So when they're inside, they completely shut down. Yeah. Yeah, here. Yeah. What kind of simulations did you perform? Raise your hand, I cannot see you, sorry. Yes, here. What kind of simulations did you perform and did those simulations inform further iterations of the design? Yeah, all they did. So much. Like we do static simulations to simulate the static loads and see if they can withstand these loads. And we also do the model simulations that I mentioned before because in order to do the vibration testing, we have to fly from Greece to Spain. So we need to be sure that, sure. We need to have a measure to see the eigenmos and stuff. So we do vibration testing and static simulations. So here it's a very good example because we did the through simulation, we reduced the mass actually of the part. Hi. In your test deployment video, when the door opened, there was a certain amount of bounce. So did you have to take precautions to try and make sure that the satellites didn't hit the door on the way out? Yeah, so the springs, in the door there is a hinge mechanism, so it has two torsional springs and are really, really strong. So the back glass is minimal. Now in the V2, we are thinking about adding a locking mechanism to it so the door doesn't bounce back. But it's not such a huge problem, to be honest, because even in the open position, the pick-up bus door has a bit of preload because of the torsional springs themselves, their legs are a bit angled. So even in the open position, there are like half a Newton meter of torque. So it stays there. Welcome. Yeah, there was a front question some here. But yeah, yeah, okay. Hi, thanks for the talk. Did you know why the door closed after the deployment? Oh, it didn't. No, it didn't. It was, it stayed open. You mean when it's torbid, correct? No, it stayed open. It didn't close. So why did it look like it was closed? Yeah, because the Firefly Stream was cut and we didn't have a footage of the door being opened. We only have footage of the door being closed before the satellites came outside the deployer. No, no, it stayed open, but we just don't have footage because the stream was cut from the rocket, the stream. They didn't download this specific part. Yeah. Do you, do you dump the vibrations in any way from the rocket through the pick-up bus into the satellites? As far as I know, no, we certainly don't. I don't know if there is any such mechanism in the payload bay of the rocket itself. But as far as I know, no, we certainly do not. We have time for one more question. Hi, thank you for the talk. I just wanted to ask what science did you do? Was this just proof of concept to show it would work? You mean for the deployer, I assume not the satellite itself. So, oh, sorry. What science? The deployer is pretty much a box. So it has one work to do, deploy the satellites outside. So there is the aspect of the proof of concept and the mechanical functionality of this thing. We managed to build it and make it the RL9, so it worked in space. So that was the goal, to have this kind of box work in space and deploy the satellites. But the science behind it, it doesn't have the experiment, maybe. You can say that the experiment was to deploy the satellites, actually, if you want. Yeah. Okay, thank you for very interesting talk. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you.
Electronic boards production automation with KiCAD scripts
So, next up we have a very important talk from Svetan, who runs a company you may have heard of. They build the awesome little single board computers and they have been automating much of their production line that uses KICAD. This will give an introduction to how to do this on your end as well. So please give a warm welcome to Svetan. Thank you. Okay, today I will share some scripts which we use to make our production more efficient and which you can also use. I am Svetan, the owner of OleMX. OleMX is a company in Bulgaria which is dealing with electronic design, production of electronic products and we have about 1,000 different originally designed by us products and which we do this for more than 30 years and we produce everything in both in Bulgaria. Most of our products are open source hardware. So the design process is always fun. You create something, you challenge your brain, it's very satisfactory when you solve something which you think it's impossible to solve and this gives you a great satisfaction. At least with me it's always this. It's always fun, it's always challenging and give you a great satisfaction of the creativity but later is the production process which is boring stuff. You do repetitive boring stuff which doesn't bring you any challenge. But if you don't do it correctly there is a disaster and financial penalty for every mistake you do in the production. And why we do water than to manufacture? I don't want you to just design the electronics because this is the only way we can pay our bills. When you have a final product actually you monetize your designs and what you are doing. So here I will just mention what can go wrong when you produce electronics. First stage is components apply problems and this is something which doesn't depend on you. Actually you cannot do much about it but there might be differences between the components from the different lots. And for instance even one single capacitor sometimes you can receive it a little bit lighter, a little bit darker, a little bit yellowish, a little bit brownish, a little bit reddish. And I will say what big deal this is just color change but all the inspection equipment take pictures of the boards and compare pictures and when the component has different color it recognizes like different components. So the inspection starts to get the false alarms and you have to update all your libraries every time you get the component with different color. Even one and the same components if it comes from different vendors they can put it with a different orientation inside the tape. So if you just realize that everything is the same like the previous batch you are in trouble it will be not assembled correctly on the PCB. Different component marks. Recently we got LED. If you ever assemble LED you will know that on the back of the LED there is a green dot which shows the cutout or anode. This is you can the pure decision of the vendor. Some put it on the cutout, some put it on the... So when you assemble your LED's vice versa with 180 degree they simply don't work like expected. And of course there is a lot of fake components and the major problem with the fake components is that they never come 100% fake. The dealers of the fake components they are very innovative. They mix for instance 5% fake components with 95% original components so when you assemble the boards you just find that 5% doesn't work like expected. And you start to think that maybe you did something wrong but basically it's because of the components. So all these problems have a solution. You can make a procedure when components come to your storage you never put them in the storage without first testing them. Check the component orientation, check the component color, check the size variation and this is predictable and this is something you can handle. The second problem is with the operators. We are humans we do mistakes and we are not machines. We don't work the same way in the morning and later afternoon because you get tired or you have some problems in the house or you play tennis the previous day and your leg is hurting and this is just distracting you from the work you do with the machine and if you do mistake it multiplies immediately. Every hour with hundreds of scrap boards manufactured. So to be effective in the production there is major two problems to solve. First one is how fast you can move from the CAD program to the machine program. And the second is once this is set up how quickly you can change between the different PCBs which are already stored in the machine memories. This is one typical production flow for electronic production. You have to have the components then they go through the solder press printer then they are inspected then you go through the SMP placer again inspection, oven inspection through whole component inspection and test. And the challenge to generate these programs for these machines is that they are usually five or six different machines which are made by different vendors and they have different ideas about the software, about the libraries, about the component names and when you add this to the stocking and to the CAD libraries you actually have to find the quick way to align and to alias that this component which is in the key CAD is that component is your stock and it has three or four different libraries which has to be associated with this component on every machine. This is very time consuming and usually one operator is getting the files from the CAD and start manually to associate these components with the different machines. This is a problem because these machines have to assemble they don't have to stay and be in programming because machine time is money. If you keep your machines without producing you are not effective. The second challenge is the changeover. As I said people always do mistakes it's not avoidable. So we just have to find a way how to minimize these mistakes. For the last 30 years we are trying to improve our process and this doesn't mean we don't do mistakes. We do a lot of mistakes but we try every time we do something wrong to analyze what is the cause for this problem and to take some corrective actions so we don't do the same mistake again. And the only possible solution for this is to have a computer assistance to all steps where humans are involved. If the computer help the people they do again mistakes but at least they know that they do mistakes and correct them. Kick-out is proven to be a nice tool to do this job. Why? Because it's totally open and always flexibility and we can extract from kick-out all the PCB parameters all the component properties. Every information we need to do the proper programming on our assembly machines. But of course for this purpose we had to reverse engineer the file formats of all machines we have on the assembly lines. This is still work on progress because we have machines from Samsung, from Sony, from Omron, NPM and as I said every machine has different file formats. You have to experiment and to see what exactly is this file format and what you have to put inside it. And this is how our script is looking. We have an Olimax plugin. When we start the plugin we have this screen and we can select the PCB orientation. Why? We need the PCB orientation because some machines work from left to right. Some works from right to left. Some can be operated from the front. Some can be said to be operated from the back. And every time you change this you have to change your origin of the PCB. Then we can pick if we want to have the components which has property to be excluded from the bone, to be excluded from the position file or to detect not assembly which feature I understand is available in the new version of Kikat. And the variant properties because one PCB can be used in many products. For instance you have one PCB with different variation, different memory footprints, industrial version components, commercial components. And this generates the files for all the variants of the board we can manufacture with one PCB. And of course this is what to export. When we run the scripts it creates this JSON file which creates all the necessary info inside and then we import it in our JSON ERP. This is something we created for our internal use. This is general information about the board. Kikat is generating preview for the top and bottom of the board. And we have automatic match for the components from Kikat to our stocking. If I was in the marketing I would say we use artificial intelligence but I'm just an engineer so I'm just saying this is just fact. Here we have preview and we can go through the bone and see visually every component where it goes on the PCB. And then we can select from the drop down menu for which machine we want to generate the programming and we get a native program for the machine which you can load the machine and it starts working without wasting time to associate components to teach libraries and so on. We did the same for the changeovers. Here you can select from the bone, for instance this board has two machines in line and we can select and copy which component on which machine we want to be assembled. So the recap is that since we start using this scripting we multiply times increased or productivity with the production because otherwise it takes days to build all the machine programs manually. Now you see you just click on the drop down menu I want to have file for the printer or file for the placer or file for the optical inspection and it single click it generates the files and you load the machine and it starts assembling. With the changeovers there is no significant speed up but it helps to do less errors. That's it pretty much. Before you made all these custom tools how did the vendors of these machines expect you to use them? Like are they expecting you to do everything manually or do they have their own set of tools? You can do all this association in the machines but you are losing production time because when you program these machines they don't work and they have some tools which are for importing for panelization and etc. but it's done in different way for every machine. So it's very troublesome for the operators to remember how this machine was programming what I have to select this or that and now with single click they get the text file, load to the machine and it works. So there are a number of other companies that do similar things and they all have different machines so the question is what kind of data format do you have internally that other people can then write converters from for their own machines? We start from the kick-up. From the kick-up we can extract every component property, rotation, library name, etc. And from there we see with the reverse engineer for what this machine is expecting to receive like a file and like an info and we just make the bond between the two. So it's, we didn't intend to make it universally for any machines just for the machines which we have in use. So why do the companies selling these machines just don't care? So why is there no kind of import? Because there is no coordination between the different components. For instance Samsung has one concept how to look the menus, how to look the user interface and this is decided by their engineers. They have no clue what Sony is doing or what Yamaha is doing or every engineer in every company has different concepts, how to name the components, how to place the libraries and it's unavailable that they don't do joint development to match their. And basically when you buy the machine you start to learn how to program it, how to do that. There is a huge books where everything is described but it is totally different on every machine. Were you contacted by any manufacturer that says well, we're, no, have you been contacted by any manufacturer of machines that say oh we just have done that already. Here is our importer from Kaikets. No. Is there any manufacturer you know of that does anything like that? I don't think they use Kikat or they know of it. This is just solution for our case. We use Kikat, we use also Eagle for legacy boards which are produced before we start using but we found that in Eagle we don't have so much access to all the resources on the design like in Kikat. So this is why Kikat is so flexible and so easy to fit in our goals. So my question is, is your plug-in open source as well because I think it's kind of interesting the whole reversing theory of some... If this plug-in is later embedded in Kikat, we don't have a problem with this. I think perhaps separate out the kind of reversing during part of it to have people support more machines could be super interesting. Yeah. Hi, you mentioned that you have optical inspection and changing with colors of capacitors. So when you have these scripts that generate the files for all of your machines, do they have sample pictures for all of your components? Which additional data do you have along besides the Kikat libraries? This is unavoidable because they just, for the optical inspection, for the moment we just prepare the file and prepare the libraries and then the optical inspection can take different pictures for one and same components and you say this is the variance of these components so you just add them to the libraries and next time they can scan between the different variants. But this is not done in the script. The script just makes the positions, the few usuals and other info for the optical inspection. Thank you very much. Thank you. So I think there were some additional questions. So Svetan, if you're available in the hallway the additional questions can be face to face.
Jumpstarter: Open Hardware In The Loop for everybody
Next, we have the Jump Starter project with Miguel and Ricardo. This is going to be a rather interesting bit of open harbor for those of you who don't know. Miguel has a long time contributed to the KECAD project, so an alumnus if you will. So welcome. Thank you. Hello. Hello. Yep. Thank you very much for attending this session. My name is Ricardo Nariega. I work in the office of the CTO at Red Hat. And I'm Miguel. Yeah, I worked with him since one year and a half ago. And we are going to talk about the Jump Starter project. We will go through these slides and hopefully a live demo as well. So let me introduce you to PNAT. This is PNAT developer. PNAT works developing applications for embedded systems and edge computing use cases. And he uses all the more tools of development that we know. He develops locally in his laptop, pushing code to a give repository to have version control. He uses IDs, testing frameworks, virtualization containers. So basically all the other tools that we use for developing services in the cloud. The problem is that PNAT, after some hours of coding, he really needs to test a release candidate or some code that he thinks is ready to test in the real target platform. Let's say he uses an NVIDIA Jetson device, for example. The problem with this is that he needs to take the power adapter connected to the Jetson device to the plug, then an Ethernet cable maybe to get some connection, then an HDMI cable or a serial cable. Then at the end he goes, takes a USB stick, puts some operating system image in the USB, install it in the device, and at the end he needs to take the application to the device somehow via SSH or whatever. So by the end of the day, the poor PNAT is completely exhausted. And this is when we start to think, okay, we need to tackle this. We cannot afford every day doing the same thing. And this is where Jumpstart came to life. We see that developing applications for embedded devices comes with unique challenges. There's a huge lack of standardization. Every device is different. We see in big companies that enrolling these devices into CI systems is rare or sometimes is very expensive. We want to keep high quality in our code, in our applications. And testing, especially automated testing, is a key aspect of it. So we thought, okay, what are our testing goals? We would like to test our application in those target platforms at every pull request that we push into the repository or for every merge request. We would like to test a release candidate in all the models of the platform that we are going to run in production. Let's say a point of sales, I have five or ten different models, so I want to test my application in all of them. So we need some kind of automated testing and if possible, something that is hands-free, so no manual intervention. And this is why we created Jumpstart. It's not a device management system, it's basically a testing tool. So what is it? I know this is the open hardware room, but Jumpstart is basically a software project. It's written in Golan and it has the concept of devices, which are the devices under test, embedded systems that we want to test our software on, and the concept of driver. Driver basically exposes the capabilities of a hardware connector. We will explain more later, but a hardware connector is a piece of hardware that allows you to enroll these embedded systems into CI platforms. We have built a script language based in Yaml and allows you to automate some of the onboarding process. And Jumpstart allows you to remotely control these systems and it has the following functionality like power management, control signal management, storage, and console management. It works with the major CI platforms like GitLab runners, GitHub Actions, Jenkins, TecTon pipelines. We are developing as well a Kubernetes device plugin to be able to schedule these TecTon pipelines in the Kubernetes nodes. And at the bottom of the slide, you can see when you use the Jumpstart CLI how you can list the devices that are connected. So as an example of GitHub Actions, if you want to enroll your embedded devices into GitHub, you just need to run a self-hosted runner service per available device, per the device that you want to run. Then you can add a tag like Jumpstart.vrasperepy for. And whenever you want to run a job, you just select which tag or which platform you want to run it on and it should work. We have created a reference design for a driver, for this hardware connector that I mentioned before. We call it Datlink. And Miguel will explain more later. But if you see Datlink, you just need to connect it via USB to the GitHub runner and then create your workflow, your GitHub Actions workflow, and run it. This is an example of the GitHub Action workflow. For example, list a device, download an operating system image, prepare the image, mount it, change some configuration, inject some application and ready to use. And then we can use the scripting system language that we have created to automate the onboarding of the device. Just a disclaimer, if you use, for example, GitHub Actions, it's better to change the default settings because for first, Jumpstart requires full root access to the runner. So if someone has privileges to run, to push a PR, it can compromise the system. So this is how the script language looks like. You can put a name to the script, a selector for the target platform, and then a set of steps that would automate the onboarding. Power off on, write image to disk, and then we can control also the console. We'll see that in action later. So as I said, we have designed Jumpstart with modularity in mind, with a driver-based model. The Datlink is our reference design, but we have also developed other kind of hardware connectors just to show you how easy it could be. And if there are other hardware that you can use to enable this, please write the driver and you can leverage all the benefits from Jumpstart. So we have the Datlink driver. Driver B could be done the same with an SD card multiplexer, plus a smart plug, plus a serial cable. So, yes, as Ricardo explained, when we started the project, we didn't find a proper test harness. Along the way, we found some others, and we will be adding drivers for those. And if you have something and you want to add the drivers, I'm super happy to help. This is what, it's very obvious, at least now it's not pink, like in this morning we had an issue. So what our test harness is doing is switching a storage device between the testing host and the device. So you can access the storage device, the iOSV, from the testing host and write your image very, very, very fast. And then you can connect it back to your device and power it on, and then you can talk to the device via the console and we have some control pins. So far it's very basic control. There is no analog interfaces. But we have the next revisions. We have taken a lot of feedback to add extensibility to the platform. So yeah, this is how the version 1.1 looks. We did it in a mini ITX form factor, so you could put it in racks in a data center or in boxes, like in this case, the one we brought. You can control power via a barrel connector. So in the back plane, you have the inputs for power, and here down you have the outputs. You can put your storage device in here and you can mount your device under test on top if it fits. So you can control up to five amps, and you can provide the power via USB PD. Yeah, so we have, yeah, as I said, the USB storage multiplexing, and this is how it looks if you mount something on top. This is star 5 vision 2, and yeah, we are running some tests with that. And then, one of the best features of this is the speed. So you can get five gigabits per second, so you can go to a little bit of speed. And it makes it very interactive. When you are working, you get feedback very quickly on if things are working or not. So that's really nice. About the hardware, so the design is made with kick-up. You have the repository in here, and here the pollution of our prototypes. So we made 100 before, I mean, last summer, and yeah, it was around $80 per device, and we just made five, I think. Then we made 1.1, and we added some additional EMC filtering to the power. We moved the storage device inside because initially it was outside, and it was okay, yeah. And we added some connectors for the expansibility, so we have an SPI and H2C connector, so you need like a doter board to talk something specific to your device. You can do that. We have version 2.00, which we could not produce yet because of company policies. If I want to make an order to make the prototypes, it's going to be beyond the maximum without the purchase order, so I need to register the vendor and so on, and it's complicated, so eventually. But that one, instead of requiring two USB connections, which this one needs one for control and one for the storage, we'll only have one. I think I have a picture here, yes. So in this one, so the connection from the testing host comes here. It comes to this USB 3.1.5, and you can connect additional devices. Maybe you need to put a camera, or maybe you need to put a logic analyzer, or a Canvas adapter, so you can do it via USB, and with the software we could detect via the USB topology where those devices belong. So the idea is that you could have a testing host, but you can have 10 jump starters. That links, so we changed the name at some point to make it clear what the software and the hardware was. And yeah, we also added a connector for APX. Yeah, I will run a little bit more so I can make the demo. That link board has a controller chip, and the firmware is written in Rust. It has a nice console that you can talk to if you want to do it manually, but that's handled by the driver in Jump Starter. For people who make hardware with USB, this project is super interesting. You have it in almost every Linux distribution, and it allows you to update your firmware on the field. So you can publish your firmware to firmware update, and then you can, I mean, you create the descriptors, and so on. Fingert update in the Linux systems will realize that you have a device that is on the database of firmware, and you can get updates through the network. This is how it looks. Suddenly, I couldn't take one of Jump Starter, but this is how it looks, for example, if you're running on desktop, or if you do it on the console, you see something like this. So, yeah, we're releasing every version in GitHub with all the production files, so you can just download the production files and take them somewhere, and hopefully, I mean, probably you will need to adapt to the vendor, but you can get that. And, yeah, Cid asked me if I could talk about them. We are talking with them to see if they can pre-make an amount of devices and make them available in their co-create program. Normally that is meant for, I mean, if you are a creator and you want to make money on your device, there are other programs like this. They will handle the production, and you just take care about your design. But what we did in the meanwhile, I don't know if that will work or not, we just provided the links to the, how do they call it, the Fusion Gallery. So, when they made the prototypes, they give you a link that you can share and they will repeat the prototypes for others. And now, hopefully, small demo time. So, okay. So, this is, we prepare this demo repository that is actually connected to this. This is registered as a runner on that GitHub. I don't know, hopefully, it's connected, we'll see. And this is the device under test that we have. This is Raspberry Pi 4. And we are building an image and testing it on the device with two different distributions. So, yeah, this is some of the previous runs that all passed. We can look at them and we can see, for example, how do I see that? Checks. We can see that they were tested with Raspberry Unlight and Fedora Rock Height. So, in the process, it will download the latest version of the image, prepare the image and test it in the hardware. So, we go to one of those. You can see the steps of what happened. Okay, this one was a simpler test. Yeah, we can see previous runs. And we can see an example here of, okay, what happens if I break the construction of my image? In this case, we are testing a DPM module that is connected to the Raspberry Pi. So, if I remove the DTV overlay in the config for the Raspberry Pi, it should not work. So, when we go to the checks, we see, okay, Raspberry Unlight stopped working and we can see that it failed at the DPM interactions. We can see, yeah, when the image was being flashed, it was not working. Into the device. And I could show you if this is all working. Maybe I need to make a bigger font size. In this case, this is what the runner is calling. I can list the devices or I can run stuff on them. I can do things manually. For example, I can power on the device. Hopefully, I need to tell which device I want to power on. So you can, and if it's working, yeah, power on. You can power it off. You can request a console or you can run the scripts that we run in CI. For example, if I go with this other device, which is an SD wire, I don't know if any of you are familiar with this. It has an SD card and then a connector that looks like an SD card so you can plug it into a device that boots up the SD card. So if you connect these to the jump starter to the software, you can see that it's working. And I provide, it also needs like a serial console to talk to the device, otherwise it's going to complain. So if I list the devices, I should see, okay, I have the SD wire with this serial number. In this case, I cannot make all the associations with the tags and so on, which are stored in the hardware. So I have a config file that matches the serial number and then at this point I can just flash one or the other and it's set disk image. For example, if I set the Raspberry Pi 4. So this is the same process that you can do in the scripts, but you can also do it manually. For this, I need privileges. We want to split this part of the executable in a separate one only for that purpose with lots of filters to make sure that it will not break anybody's server. Yeah, this is the nicest part. How quickly those. Yeah, we need to be a little bit cautious with the data. Linux, because even if you request the system to eject the device, sometimes it tells you, okay, everything is all right, but the cache is still finishing in some part of the subsystem. And yeah, I think we're, okay. Thank you. Thank you. Thank you. Thank you. From the software side of things, have you looked into LabGrid? No, I have not. But, sorry. So yeah, the question is if I have looked at LabGrid and not, but I will. Yeah, because it does something also similar. Is it open source and open? Maybe we should work with them. Hi, thanks for the talk. I would have asked for LabGrid too, but maybe one other thing. Do you know about the automated testing conference call, the monthly one, that's coming out of the Elinux project? There are already record people there talking about the Coliseum stuff. Maybe that's interesting to share there too. Yeah, so what's the name of the? It's the automated testing conference call around the Elinux project. Yeah, I can come to the front end. Okay, yeah, thank you. That's great. Yeah, that happens sometimes. Big companies, you have people working on things from different places. Thanks for the talk. I wanted to ask how do you actually specify tests? How does the test work with this? Do you have the device? Yeah, so. Is that the YAML syntax thing? Yeah, exactly. That is the YAML syntax. So far, it's rather simple. So for example, if I go, and this is available on the repository, if I go here to the demo, and I go to the Raspberry Unlight, for example, you can see this test, TPM, or latest raw image. So we assume that the image is already built, and we just tested on the image. And it's just a series of steps so far, steps, interactions with the CDL console. So we expect something like that for control. And yeah, we want to add also integration maybe for other types of devices that are not Linux based, maybe within where you want to flash them. Something that I did not explain is that we also do power metering. So one of the things that we want to do is provide that report, maybe from this point to this point. Okay, how many millibars, what's our I consume? So you can check if your software is consuming more or less in your hardware. So really cool project. I'm really excited. So you said there's an external USB for connecting modules. So is that for so I can test something externally like, let's say a very simple system where I've got, I'm turning on a light switch, I can plug in something with a Luxe beta and check that the light actually came on rather than the board went, yeah, that came on and I have no idea. Yeah, yeah, that is the idea of the version to that. Yeah, you could connect anything and have a way to associate the device under test with those devices because you know where they are in the bus. My question is a similar question because there's, you know, literally thousands of pieces of various test equipment out there. Most of them these days probably run on either a network interface or a USB interface and they're like, yeah, they're classic. They've been translated from the old HP IB GP IB, skipping commands. But you know, things that are like really powerful test tools is, is there any plan to integrate be able to integrate stuff like that because maybe I have an analog board that's really highly precision. And I need like a really highly precise meter, you know, digital multimeter to measure that I'm not going to measure that with a simple weight bit a to D. Is there any any plans on being able to integrate stuff like that so you can. We want, I mean, we want to be able to enable that but every tool is different. So we want to make it as modular as possible. We haven't still thought exactly how to how to do that. But the first thing that I guess we need to figure out which which USB devices are related to that or maybe other config file in the system. So we have a lot of consistency saying, OK, this serial number has these two and these two and these two associated. And then when you call the software that talks to that tool at some point in your script, you can you can talk talk to that. There is even more interesting stuff like sometimes you need to test different parts of your system in parallel. Maybe it's not one hardware piece. You have them and they need to talk together. So at some point we want to be able to to run in parallel multiple devices and the rest. But yeah, let's see how far can can we get. OK, and thank you so much for your presentation. My question was around the jammer files and the specification about although it is not available now the canvas communications and the flex ray and other protocols. If do you plan or you have a roadmap in order to put it into your specification and the way of sending standard ones in order that OK, I want to get profit on my job and of the opens of the tool. But I want to create a new hardware, but I want to use the same protocols as yours in order to change for your second version that you are developing into one year. But I need to I don't know if you have planned into your role in order to something. It's not on the roadmap. But yeah, one of the things that is doing and probably one of the reasons why it's getting into the embedded space is automakers. And yeah, they I mean they use can buses or automotive other med. Yeah, so at some point, yeah, we will need to figure out how to do that is we haven't thought about it, but yeah. OK, thank you very much for the presentation. Thank you for having us. You
From hackathon idea to hackaday prize - How we make a Braille embosser.
Okay, next up we have Stéphane who will be talking to us about his project to build an open source Braille printer. So please give a warm welcome to Stéphane. Hello everybody. Thank you for attending this session. We are here to talk about a project, an open source open hardware project which is called Braille Wrap. And we will explain the story of the project, basically the technical subtlety of the machine. And talking about some questions the machine is asking to us. First we will talk a little about Braille. Braille has been imagined in 1829 by Louis Bray, he is a French guy. We build a system with six dots grid that you can read with your finger. It's a tactile system for excited people with a well defined dimension that you must respect if you want to call it Braille. Basically each combination of the six dots correspond to a letter in the alphabet. Very simple. But with six dots you have a very limited combination for a style. So every country in the world have adopted a different system. The grid of six dots remain but the signification is slightly different. Even from French or from English, for a basic letter you can have the same. A is just one dot in French, just one dot in English. If you have a capital A this is slightly different. If you have a number it's different in French and in English. About that, making Braille documents is a subject, a topic in the maker's movement a long time ago. You have many projects in open source movement to make Braille documents or Braille labels. I think the first one was Braille Go. It was a teenager who built Braille and Bosse with Lego bricks. You have also open Braille from Carlos Campos. This was a project with a recycled printer. Many of these projects were successful because you can use it to produce Braille documents. But they are proof of concept because they are very hard to reproduce. If you want to build a Braille Go you have one or two pages of documents of how to build it. If you want to build a system from Carlos Campos, open Braille, you need to find the good old printer and deal with his document to produce it. In 2016, my name is Kit. It's a French nonprofit organization. Basically it's a fab lab oriented to technical solutions for disables. They start to test making Braille with different solutions. In Hackathon, they customized a 3D printer just putting a Braille needle at the end of the 3D printer. They write a piece of software to translate text into Braille and the Braille into G code. This was functioning about an hour. You need to customize your printer and you always need to do things for the device to continue functioning. It's based on 3D printer so it's not very fast. Just to make a dot, you have a stepper motor so it takes some time to push the paper. If you need to make 600 Braille dots on a piece of paper, it takes about an hour. In 2018, we take all this project, think it's proof of concept, and start a project. Open source. Actually Braille is less than 1.2. Easy to build in a fab lab. The idea was to have a device that everyone can build, not to have the better device of the world. With widely available parts, we wanted to have something that not very specific parts that will be hard to approximate. The first point of the project was how to make Braille dots. All the projects we have seen are using a Braille needle, pushing the paper on some kind of soft material. Sometimes a mouse pad, sometimes a piece of plastic. They always use stepper motors or something like that. We think that using a solid needle would be a better choice because they are strong, fast, cheap, and you just have to make an on-off. It's really easy to operate. We put a Braille needle at the end of the solenoid, and it's working incredibly well if you put an anvil on the other side of the paper. You have the solenoid on the bottom with the Braille needle at the top. We have the sheet of paper on the other side. You have just an anvil just to control the shape of the dots you are making. Once you have that, you have a tool to make Braille dots. You just need to move the tool around the paper sheet. You just need to have an X and Y system. This is a very standard design. Every numerical machine, 3D printers, CNC are doing that. They are moving a tool in X and Y space. Actually, this is what is a Braille wrap. This is a numerical machine, but something like a wrap wrap. If you remind of the first 3D printer, it was a 3D printer that you can build yourself. It's a wrap wrap with an embossing tool, and not with a hot hand with plastic provisioning. The frame is built with laser cut. You can do it in many materials. The most popular one is plywood or PMMA acrylic sheet. You can do it with any single one. All mechanical parts are 3D printed parts. All the bearing, linear rails, the stappers, and also all the electronics. We chose to use a 3D printer controller with customized firmware just to have some specialized functions to handle the paper. Finding the edge of the paper is not really the same as doing a reference course in a numerical machine. It's very little. Because making a machine that is working well, easy to build, easy to build for everyone, we make workshops with the machine. You can see that the first version, the image on the left, is not the same as the last version, the image on the right. The message starts small, but starts. The main problem probably was handling the paper sheets, because paper is very strange. When you handle it in your hand, you imagine a very soft material, but if you lock it in the good position, you can jam a printer, you can see it in the 3L1, and you can jam a sheet of paper in a bright wrap. What I wrote is you can lock a stapper with just a piece of paper if you put it in the good situation. This time, we learned that you must not force a piece of paper to go where you want it to go. You just have to tell him gently where to go and everything will go fine. The next aspect is the software. Once you have a good device, you can print, make dots on paper, but you still have the problem of the braille. At the start of the project, we made a piece of software in a hackathon, so in a few days, you know, with a start of braille translation, only in French. And the software that generates G-code files that you need, another software to send to the printer. It works, but it was not very user-friendly, you know. A few months later, we used NadBri. NadBri is an open source braille translation software. It's available on the Internet. It's a French project. So the braille translation is still in French, but we were able to modify the software to put a driver and just to have a print button in the software. So you write your text, you translate it into braille, you put a print and you have the result and the embosser. So for the user, it's a big improvement, but we had some problems due to the evolution of Java and the lack of maintenance on the NadBri project. So we start our own software. Looking in the Internet how to translate braille, we find a wonderful project that's called Libre. Libre starts many years ago and is a fork of a project that is called Braility Onlinex. This project started in 1995, you know. So it is widely available, free, open source. Many developers contribute to have a good translation in many languages. So you have 200 braille standard translations available in Libre. You have everything you want, French, Italian, Spanish, Chinese, Swiley, Arabic, many, many. So we use this to make the translation and also we made an accessible software. Software Access Braille Wrap has been tested with what we call a screen reader. This is software that the Insighted use to use a computer. You know, this is basically something that read the screen to tell them what is written. When you want to write accessible software, it's not just reading the screen. You must imagine the scenario and putting just the information you want to be told by the computer for the Insighted people to use it easier. We have some projects for the future. USB Braille, this is basically a Wi-Fi extension to use the embosser just with a smartphone. You know, you connect it to the Wi-Fi. No software to install on the phone. You just have a web application. You write a piece of text, you put print and you have a label, a piece of braille on the embosser. This project is funded by another organization, Moolove Open Source. This is a CCLab. We have another project that is called Desktop Braille Wrap. The idea is to mix vector graphics to make tactile schematics and Braille. The idea behind this is to build a metro plan, a building plan, something like that. We made some tests and it's working very well. Since 2088, as we put the project on the internet, we have people on the other side of the world sending mail, okay, we have build one. It's working well. We made some operations with the machine, Braille Wrap Cameroon, where we built six Braille Wrap in Cameroon in four workshops in four cities. One guy built one in Fab83 in Bhutan with his team and earned the public price of Fab83. It's interesting. The best reason for this project to have been reproduced is that we are working hard on the documentation and the assembly guide. Of course, it's a never-ending story. We need to work on it on every evolution of the hardware. Nowadays, it's more than 100 pages of documentation, step-by-step assembly guide. What we have learned with that is that sometimes using 3D rendering in the documentation is better than using photo, you know, because you don't want to take photo like you build the prototype and once you have build one, you don't want to build another machine just to take the photo. The other reason is that when you're making 3D rendering, you choose what you want to show. So you can hide some parts just to focus on some parts. I'm at the end. Bright availability is still an issue even in our country. So the question asked in this machine is how an open source solution can spread in the population where they need it. Can FabLab contributed? And more widely, what can we do with open source for promoting a more inclusive world? And as the time is over, there is some details on the project issue. Download the presentation that I have uploaded on the software for the first time. Thank you. So thank you. We'll have questions out in the hallway. If anyone has any questions, you can meet out in the hallway. We're going to set up for our next presentation, which will be starting momentarily. As you come into the dev room, please do try to move into the center of the rows and leave the outside seats available for other people to follow. Is Pieter here? Oh, yes. Thank you.
Automated Documentation for Open Source Hardware
Okay, our next talk will be from Pieter Himmer. This will be on automated documentation. Please give a warm welcome to Pieter. Yeah. Thank you for being here and thanks for the great programming because, yeah, my talk really follows well, follows up well on the last talk. So my goal for this presentation is to tell you about how to automatically generate IKEA-style assembly instructions for open source hardware. So before I do that, let me just explain a little bit about myself. I've currently just started as a self-employed software developer or slash researcher and I'm, my target is to make a living in open source software development. So I'm currently, I've just had my first contract with Ansel that has been talked about before and this allows me to work on a really nice project in Freca that's close to my heart, is managing data for models, so improving parametric design. I'm also a co-founder of the Open Toolchain Foundation, which is a foundation that aims to support a whole tool chain for going from a design to a physical product and we've seen great examples for that. I come from academia, I've been a researcher in the area of high performance computing, but a couple of years ago I got a chance to work on actually my passion, open source software and hardware and I was in a project in Hamburg with which, in which my colleagues from InMachines in Gratia were creating the OpenLabs starter kit. So the idea was in a span of one year and a half create eight machines for FabLabs, 3D printers, laser cutters, CNC machines and three versions each, so rapid prototyping so to populate basically FabLabs and all open source hardware. But of course there is one problem that a documentation we heard about in the last presentation, so documentation is really crucial for open source hardware, without documentation no open source hardware. And it's for replication but also for collaboration informing people how to improve things. But as we probably all know here documentation is very labor intensive and it's always out of date essentially. So related work, there are two basically two approaches document after the fact you design your machine or whatever you design and you start documenting with the potential problem that you miss important things to, for example to tell collaborators the design decisions that you made. Another basic approach is document while doing but this has the problem that you may document much more than you actually need. So in the current state of the art I think in documentation for open source hardware is git building and a difference with our approach and git building is, git building is I would say text first, images second and for our approach is images first and we try to minimize the amount of text. And that helps us with something that we find very important is a semantic relation between the source of the hardware, the CAD files and the documentation. This is difficult to do in, this is difficult to do in if you have text describing a machine for example. So our goal for this research was to integrate a design and a documentation process, generate assembly instructions automatically and support design evolution. If the design evolves then we hope that we can just push a button and regenerate a manual. So my colleagues of Ingressia in machines they created the Fabulaiser Mini and they spent many months with three persons, a graphic designer, a CAD expert and a machine designer to create a very high quality IKEA style assembly manual, assembly instructions and this was really nice but also a huge effort and I believe that when the instructions were done it was already out of date. So this was our starting point in trying to automate this process. So an overview of our approach is that we have a CAD file that we annotate in a CAD-like manner with, for example, with layers and something that we call layer states and we have a textual specification and the textual specification relates to the CAD source. I will show you that later on and we combine that information and generate a PDF assembly instructions. So we created a dedicated workbench in FreeCAD to help us annotate the CAD file. So typically for us the input is a step file and for example we have a button that allows you to select some screws and then you press the button and then they will automatically explode and show this red line that you can see on the screen. Another thing to highlight here is that what I circled here is one of the step layers, step one detail and that is something that we can refer to from the textual specification. I will show you later. And another thing is that the window that you see here with the CAD model, we created essentially a button that allows you to take an SVG screenshot, high quality image of exactly what you see here and we can remember the camera position. So you can essentially move and rotate your model and think, okay, this is how I want to show this in the manual. You save the camera position and you can generate an SVG image out of it. So let's go into the textual specification. At the left we have a domain specific language that helps us describe the manual and at the right we see the output and we can specify the title and we have a command for a bill of materials and this is the only thing that we need to specify and then we get the visual bill of materials that you can see at the left on the assembly instructions. We can specify which main image we want and what kind of highlight we want and this is what I ask you to remember step one detail. So the image that you see here comes directly from that layer state that we defined in the CAD program. This is where you can see the semantic relation between the information in the textual specification and the annotations, the CAD like annotations in the CAD file and everything that is underlined rats are basically these references into the CAD file and that allows us to create this page that you see. So and we also have some annotations, remarks and how to that references a dedicated how to page that tells you how to do things for example, combine to profiles for example and we have commands to add tools to the page. So what was the result? So hopefully you are of the same opinion that the original and the generated are very similar. There are some details that are that we don't have the same flexibility as a graphic designer for example in the text on the top you see there is a red dot in the text. Well so we cannot create so or we cannot do the same thing because we generate the page and we just don't have the flexibility. This is another page and again here is a problem with the flexibility that a graphic designer has for example in the bottom you can see there are two options for the same for a tool and that is difficult to do for us when we generate these pages. So because the original manual was developed over a course of months with three persons without any time tracking it was really difficult for us to measure scientifically what the cost benefits in terms of time are but we tried to give at least an indication and we didn't have the resources to create this whole manual so but on a small model of about six steps a small vice and a vertical lathe to really create this manual was about 25 minutes so but it was us who wrote the software so let's make it a factor two I think it's still pretty good for creating these kinds of manuals. So in terms of design evolution going from version N to N plus one we heard about this in the previous talk so minor changes in the model let's say replacing small screws with smaller ones do not require any action at all you just push the button and the new manual has been generated. If the changes are larger then we can show that because of the abstractions that we use the changes that you have to make are all limited in scope and for example if you make a larger version then you probably have to zoom out on one of and store that camera position but after you've done that you can just push the button and it generates the manual again for you. So and the biggest change that yeah the biggest change that you can encounter is if you split up steps assembly steps or you merge them then you basically have to go all over all these abstractions. So before I conclude this talk I would like to acknowledge the people that I've worked with so the co-author of our paper is JC Mariscal Melgar at the Helmut Schmidt University in Hamburg the team that created this example manual that was a huge inspiration Daniel Ingressia, Markola and Liana Sayuri Honda of Inmachines Ingressia and so this project has been funded in the context of the Interfacer project funded by the EU and to make the professional to make the software a bit more less research software I received some funding from the NGI-Dubb C OpenCall as part of the Open Know-How project. So my conclusion is that well our research proposes a novel solution to the documentation update problem in the collaborative environment of open source hardware so I'm happy to take any questions. Hi, thank you for the presentation. I wish this existed two years ago when I was writing the documentation. The question is is there the capability to also add a photo or like an external picture to this tool? Yeah so we don't prefer photographs because they tend to get out of date so we prefer to have 3D renders but yeah I think it would be possible to add that to a layer yes it needs to be customized a bit. Okay because I do not want to model glue in Frickad when I need to. Right thank you. Is there a reason why the steps are explicitly numbered rather than automatically enumerated because if you add a step two between step one and current step two you need to rename all the things. Yes no there is no reason you can name it whatever you want. We just did this to make clear that these things represent steps but no you're right. Hi Peter thank you for the great talk. I can use this not only with Frickad designs but I can use any step file that I have some step assembly and take this apart or is it limited to Frickad? No actually it works better with step files than with for example assemblies in Frickad because for us it was difficult to choose which assembly because there are so many. But and the hardware designers in our case used Fusion and the CAD expert took the step file from Fusion, drew in many more things and used Rhino so for us the input was step. I have one more question. I saw the point from the camera was in your code so I saw the point from where you're looking was in the code. How do you select which part would be taken out like the screws or the hinge? Is this done by clicking and selecting or how do we do that? Very good question. What you do is you have something so the idea is to create layers in Frickad and we have a layer for parts for each step. You have your model and you can select things and then these parts will automatically go into the layer that you just selected and will disappear so this makes it very easy for you to basically go down your model and select everything in the right step and that's the basis for the bill of materials. Then we have layer states that define which layers are on or off and so you can basically go from one layer state to another layer state and this positioning for example, the positioning of the screws is also stored in a layer and if you turn that layer on or off it's straight switches between the actual view, the assembled view or the exploded view so you can switch very fast between all those views. The parts are like manually. The screws are taken out from the transport menu or are they taken out on the transport menu? The question is whether the screws are taken out manually or not. No, it's automatic. You just select the screw, you hit a button and we can. So the technique is we take the center of the bounding box, the center of mass for a screw that is a bit more, well if the screw is here, the hat and there it's a little bit to the right so we know which direction to take it out so it's automatic. Okay, thank you very much for the very interesting talk. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you.
Sharing parametric models as web apps with replicad
Hi, I'm Steve. I'm a Swiss software engineer. I like to thinker. I like to make things. I like to share the thing that I like and the thing that I make. This talk is a lot about these kind of things. And the story starts, as many stories started today I think, in 2020, 21. And for some reason lots of people started to pick up new hobbies and I was no exception. So I started to do 3D printing. And it was a lot of fun. I bought a cheap Chinese printer. I think you're quite a bit with it. And I must admit the hardware part was not really my thing. I was more into the modeling part. But yeah, lots of fun. The thing is it was not as easy to share with friends. A lawyer friend, they're not going to thinker as much as I do. So it's more difficult to share. The thing is the machines are getting better nowadays. They're getting closer to being appliances. And so I can share them with them. And I can try to share the hobby, generally speaking. And the good thing is the modeling part, people are not going to model. I assume that these people that are interested potentially in 3D printing, that are going to interested. They're going to go on one of these websites, if you don't know, they're a repository of 3D models. And they're going to apply a very simple workflow. And they're going to download the model. They're going to slice it. So use the software that you tell the software, what is the model, what is the filament you use, what is your printer, and it magically spouts out a file. That is a print. That is just it. It's very simple. And if you're not technical, that's perfect. And I've been using this workflow for different things. If you look on the left, you have this thing that you might be using. And it's a way to make snowballs that are perfectly shaped. And if you remember, I'm a Swiss engineer, so I like my snowballs to be perfectly shaped. The other ones are not beer crates, because I don't have a printer that big. Therefore, batteries. Anyway, they're great models. They're simple models. They're fun to print. They're fun to share with people and to give them and all things like that. And this is not what I'm going to talk about. You know, these are very well modeled and shared as just a single file. What I want you to talk about are things that are more like that. So this is a very good project that you can find on printables called the Honicom Storage Wall. So it's a way to do pegboards for the printing. You have this base plate, which is a Honicom that you put on your wall and then the community has gone well and done a bunch of different attachments. And you can attach anything. So here it's probably in someone's office, but I've seen people using it in their bathroom, in their kitchen. You have attachments, people model attachments for everything. And it's just great. There's a lot of big community around it and all these kind of things. So I'm not going to talk about the attachments and modeling the attachments here. I'm going to talk about the plates. Because what happens is, you know, these things are made for 3D printing and 3D printers that tend to, you know, have different sizes and different beds. And this is what you can see in this file. You have different sizes of base plates that correspond to popular printers. They're not going to cover all of them, but you know, you can get quite far. You then have people who want to have, you know, nice borders because, you know, perhaps it's in for their kitchens and they want to have the kitchens being looked really well. And so people, you know, the community has provided. But then you get into these explosions of combinations and you don't have it covered necessarily by the community. And I can see people in the back just screaming parametric models. And yeah, yeah, I know. And this is what we usually think about it. And I mean, small parametric models, software, and anyway, I don't think these are the best answers. This is the best work, one of the best ways that we have now. But I think we can approve on it. And I'm going to show the limits first. And the people making the honeycomb storage wall projects are really good. And they have shared their the files that they've used to build. So they build the thing with Fusion 360. They also have some people in the community have re-implemented the model in OpenSCAD. And I'm just going to walk the, you know, the simple workflow I was at the beginning, you know, download slice print, what it looks like if, you know, you were new to 3D printing and you want to, you know, change the size of your build plate. So you download the model, this part of the same, then you have to find the hobby version of Fusion 360, or I don't know what it's called now. And if you try to do that, you know what I mean, it's not easy to kind of hide it. And then, you know, they try to get some money. If you figure that out, then you download it, pick sometimes a big file. Then you just, you know, you have a professional tool in front of you. And presently, I, you know, I'm comfortable with Fusion 360. This is what I use to actually learn CAD. So it's quite nice. But, you know, you just have a huge program in front of you and you don't know what to use. Perhaps you, like me, and say, oh, it's a challenge, I'm going to be interested and watch a lot of video and learn how to CAD. But perhaps then when you're done, you just don't know what you were trying to do. Or, you know, what is more likely to happen is you're going to abandon. You know, you're not going to customize it, or perhaps you're going to ask your friend who's more technical to do it for you. With OpenSCAD, you have something that's similar. I'm going to go faster. So download the model, then you download OpenSCAD. And the thing is, perhaps you're, you know, it's an open source thing. You don't know exactly what it is. It doesn't look as professional as the other tools. And, you know, the computer telling you, oh, it's not only software, you're sure, perhaps you're just going to abandon there because you don't trust it. Then, you know, it's code CAD. I love code CAD. But if you're a lawyer, code CAD is not your thing. And perhaps you're going to try anyway, but you're going to change the wrong line and add the wrong type of code and all these kind of things because, you know, not everyone is a programmer. And so, you know, you're going to feel that you're going to abandon and you're not going to have the thing that you want. So what do we want? We want to lower the bar for the end user, make it, you know, make these parametric models accessible to everyone. And the solution that, you know, proposes to have something that works very fairly similar to what we have before. You just add the configure model step and then, you know, download, slice and print. And how can you do that? You have these web generators and configurators. I don't know if you know them, but typically, if I want to create a QR code, I just, you know, Google QR code generator, I skip the first five because there are probably just lots of ads in there and I know the good one. I don't remember right now, but, you know, you have these kind of things for these two. It's single serving for a particular purpose. And it's just great. And there is no reason, or there might be a little bit, why we shouldn't share our parametric model that way. You know, it's just software to just do one thing, you know, the UNIX philosophy. And so what I have is a QR code for you. You don't need to go there, but it's because you can see it here. So it's a configurator that, you know, creates the Honeycomb wall storage thing. In the middle, you have the model. If you are on the top, you have something to configure it, you know, the number of rows, the number of columns. And here you can just download. You can see what goes, you know, configure, download. I don't have the slicing and the printing because it's another tool. I don't want to implement a slicer and a printer in my software. It's just something very simple. And you know, you have a couple of things where you can edit, you can see the code and all these kind of things because, you know, we share stuff. The code is open. Just go. What you're thinking is, I don't want to build my own configurator because it's, you know, you have them to maintain a server. You have to pay for it. You know, you have many reasons that you might have. Oh, I just went a bit too far. But, yeah. Yeah, you don't want to pay for it. Many reasons that you might not want to do it, right? Because perhaps you're more back-end person and you don't really care about building the UI. Perhaps you're more, you know, you don't want to touch C++ or you don't want, you know, you don't want to install, to compile some stuff on servers. You know, many reasons you don't want to do that. And so the thing that you've already seen because there was a bit of spoiler, we want to lower the bar for the maker as well. And the way that I've been trying to do that is with this project that I've been building. So it's not the first purpose. The first purpose is going to come later. It's just a bit of suspense. But with Replicad, you can, you know, as someone who is interested in code, make a web configurator very easily. So what is Replicad in that context? First is the online workbench for CodeCAD. So if you want to do CodeCAD, so it's drawing with code, you can just go to the workbench and you code on the left and you have your model on the right. It's something that, you know, was probably originally done by OpenSCAD. You have many, many different examples now. You have cat query, you have something similar. And in the terms of something purely online, you have something called Cascade Studio that exists. So it's nothing completely new. But you have it there, it's something that exists. You can do your model. Then it's something that is a bit different. It's a dot in JavaScript TypeScript. And perhaps some people are just, why would you choose JavaScript? Many reasons. It's a great language now. You should try it again. And the second one is, if you are actually new to CodeCAD, you might want to learn to code. And there are lots of resources for JavaScript online. It's, you know, it's a bit everywhere. And so there's a lot, there are lots of resources to learn it. Also, NPM exists. If you want to do some Voronoi stuff, there's a library for that. So it's also quite nice to just use a language and not have a specific language for what you do. It uses the OpenCascade kernel. So it means that you can make fillets as well as you can make them in FreeCAD, which, you know, means what it means. It, I mean, it means it's a powerful kernel. It's not perfect, but it's very powerful. So you can do lots of things with it. And the last thing, which is coming back to why I was introducing Replicant in the first place, we have a built-in web configurator. So, you know, you draw your thing, you can download the model, or you can click on something to share it, and you generate the link. And the thing that if you, perhaps, didn't let you the time to open the configurator that had before, is what you have directly. Just, you know, a bunch of parameters that you expose, a way to download it. It's very simple. But it's not everything. The first thing, you know, as a software developer, we build things for ourselves. And so, you know, I'm saying that I'm rolling the bar for the end user and, no, no, I'm doing it for myself. But it also means that, you know, perhaps you're a bit like me, and you're a web dev, and it means that the bar is also lower for you to build things with this. So perhaps in the list of things that I had before, some of them say, well, it's not that bad to build your own UI paths. I want to. And actually, replicates this library, and so it means that you can. I can just import it. You build your own front-end project and use it to the library. And as an example, I'm going back to my conference-driven development project that I had before. And here, you can look in the corner, and it's just parameters, right? It's not that great, because what I want is distances. I don't want rows and columns. And so, oh, what you'd have to do is actually do a bunch of math before actually, which is not very good either. So I did another project that you can look at, which is what you expect. It's an online configurator that just generates the same model than the other one did, but with my own UI. So it doesn't, you know, it shares some resemblance with the other model, because, you know, I'd meet both, and so I have a bit of style or try to. And the thing is, now you have distances. You don't need to have rows and columns and do the math, etc. Yeah. You have an undo button, because it was already there in the thing that I copied, but you might want to do undo in your particular project. But it's something that I built just for that. And there is a viewer, and actually it's more responsive than the other, because I spent five minutes to do the responsiveness and all these kind of things. So this is what you can do with this kind of thing. And my aim was to use CAD as a web API. Nowadays, the browser is just an amazing piece of software. You can do audio in it. You can do, you know, 3D rendering, which I use actually. But drawing stuff with CAD is not something that is there by default, which is probably a good thing. But you might want to do CAD in the browser anyway, and perhaps use replica for that. That's kind of my point. Actually, you know, going back to why I did it, this is something that I did. Another thing that I'm into is making boxes for board games. Don't ask, you know, people have some more niche hobbies. And so I made a specific UI for making boxes for board games. So, and it generates the box. But, you know, someone might want to generate documentation or to make the first step and not have to install a free CAD for that. So it might be a tool for it. Or, you know, if you have some specific needs, perhaps just for hobbies, but for work, it's something that you could use. It's kind of my point. And so if, you know, we get in towards the end of the day, and we're going to think a bit about what have we learned. The first thing that I didn't really mention, but I want to stress, because it's quite important, you know, I've said, oh, perhaps not great to share things, you know, parametric models. Actually, there's no wrong way to share stuff, right? It's just, let's try to be better about it. The thing that you can do is we can lower the bar, make it easier to share parametric models and, you know, as configurator, especially. And then, if you were a bit code-cautious and want to play with these kind of things, perhaps have a look at the workbench. You know, there's a bit of a community on the GitHub repos. You can ask questions, discussions, people are interested. So have a look. And if you are a web dev and you want to do a bit something more involved, you can think of replicat as a library to work. And so this is all I had for you. I hope you had fun. Do you have any questions? You make this sound so easy, which is wonderful. But where were the dragons? I mean, part of the thing is, was learning, I mean, I rely a lot on a project called OpenCascade.js. And to me, the dragons are here. Copiling C, C++ is not something that I want to do. And so I could avoid them. Then, it's lots of fun of, you know, building the thing, trying to figure out the different technologies and things like that. There was no, I mean, and then it was about learning OpenCascade. And, you know, one of the things that a replica does is it tries to handle memory as well as possible, because OpenCascade is C++. And, you know, you have to manage some times the memory. And there are definitely memory leaks in my project, but, you know, you're welcome to find them and fix them. But yeah, it was trying to find ways to do that. At some point, I designed the API to handle it. And then I found a way that was better to just have it magically disappear. So when will you buy a laser cutter? So you also make laser cutter boxes? Actually, it won't be, I don't have a laser cutter and I've been, you know, resisting buying one for a while. Before getting into 3D printer, I bought a silhouette machine. So, you know, it's to cut paper. And if you go to the website, the deck in the box website to make boxes, before making the 3D printed boxes, I mean, boxes in paper. And so you have the die lines and it generates, you know, you have the same interface to generate die lines to cut things for paper. But yeah, so, you know, I'm resisting as much as I can. And perhaps there is also one thing that is a bit of a dragon, that is perhaps a rabbit hole that I partially fell into, is the 2D part, because the OpenCascade is not that great for 2D Boolean operations. And so I started to implement them by myself and I'm, you know, starting to build a 2D CAD kernel and it's getting a bit out of hand. Great talk. Thank you. Have you ever thought about free CAD import or kind of a connection to free CAD as modeling free CAD might be easier than coding it? I'm not sure exactly how that would, I mean, you can import step files, but then you have, you know, the whole model and the way it is, I'm not sure that it would be easy to do. And part of the thing with code CAD that, you know, makes it easier, typically, you know, you have the type of naming problems that they're, they're solving currently, you don't really have it, because instead of selecting by, you know, the number, you know, which one you have in the array, because you clicked on it, you can say, oh, I want to take the edge that is at this distance from that, because you know this when you, you know, you model and batch, yeah, you have to do a little bit of mass to figure out what is the distance and things like that, but normally it's basic geometry and you're going to have it wrong and do it again, but it's going to be okay. And so, no, the answer, short answer, sorry. Okay, thank you, Steve. Thank you.
Testing in a Box: Streamlining Embedded Systems Testing
I know everyone's excited for Tessa. We keep the good talks for the end to make sure you all stay the entire day. All right. So we have Tessa in the box. So please give a warm welcome. Thank you very much. Because the end is always a bit chaotic, I was just going to say, could we have a round of applause for all our volunteers here? So without further ado, this talk is testing in a box. It's going to be given by my colleague. Hi, everyone. My name is Munti Chirma. I'm a software engineer at Coltsing. I've been with them for almost a year and a half, and I specialize in embedded hardware and software. And myself, William Salmon, I've been at Coltsing for about six and a half years, and I've specialized in embedded Linux and Linux integration and testing of those things. So I'm going to give us a little bit of context as to how this project was motivated and a bit of a very high level view. And if you guys are wanting to do similar things, maybe you can take inspiration from this for persuading your colleagues of the value. And then that's going to give us some very top level requirements, and then it's going to talk through how those top level requirements become hardware requirements, and then our implementation. So Code Think has been offering software services for over 15 years. We work with open source software a great amount. All of our developers have Linux laptops, and all of our infrastructure runs on open source services. So we really believe in open source software, and this is kind of our journey into open source hardware as well, and we're trying to take all of the good things that we've learned from open source software to open source hardware, but we're a little bit earlier in that journey at the moment. So we work with a lot of embedded projects. We have an awful lot of automotive clients, and we also have some in the financial industries, and what they all share is that they want to bring lots of different bits of software together, so they always work in exactly the way that they want, and they want to make sure that what goes into their things is exactly what they want, but also that it behaves exactly the way that they want it, so how do they verify that it behaves how they want testing. So one thing that we found across most of our clients is that they all share some kind of testing transition, some kind of testing progression, so on your left hand side you've got your final production, and the reality is that some bugs only ever turn up all the way out there, but they're very difficult to deal with, and we really want to avoid that, and then we've got things on the right hand side of this that are very convenient for us to run our tests with, but maybe don't provide as much value. So no developer wants to come and contribute bad code. We have these really creative developers in our industry, many of you have sat here among us, and we want to make their lives as easy as possible, so we want to provide as much value on this side of this transition, but also make it as easy as possible to get information back from the later steps. So what sorts of things are we battling? So some things that we find quite problematic, so slow automated tests are problematic because that's taking a long time to get information back to our developers. Manual tests are never going to be particularly reproducible, so we really want to squash those. There are some tests that are just expensive, maybe they involve some kind of consumable that's expensive, maybe they involve some huge AWS server that we just want to touch as little as possible, and sometimes maybe there is only one lab car because it's a brand new car, and so we are really just resource constrained. It doesn't matter how much money you spend, you've only got one. So kind of our drivers are that we want to make the early steps as accurate as possible, so we want to use as representative hardware as possible, we want to have as representative peripherals as possible. A lot of cars these days have surprisingly complicated networks, so there's a lot of different bits of hardware to go into making it really accurate, and so as our code moves along this transition it gets more and more expensive, more and more difficult to deal with, so another thing, and we get longer and longer feedback times for our developers, so a key thing is to make the left hand side of this graph as easy as possible. For our developers to interact with, so making it easy for our developers to trigger things for later steps in the transition, but also to get information back. So testing on hardware is what we're all about in this talk, and some things that, so I talked about this is kind of a requirement for the next step, for Muddit, and so some of the things that we definitely want to be able to do with this project is completely flash a rig, because if you're not completely flashing a rig, you're going to end up with the last MR affecting the next MR, whatever it is, we need to make sure we've got known starting points, so we need to be able to do that. OTA, over the air updates are a thing that goes wrong all the time, and have massive consequences, so we definitely need to test them. As we went through our testing transition, we tend to find the more and more complicated bugs. One of the reasons for that is that the actual tests that people do kind of get more and more complicated and more and more integrated, so you need to be able to have tests that in a single test can interact with the UI, and the various buses involved, so in automotive often can, and also get down to some peripherals. So one test that we were working with quite recently involved changing the set point of a car's climate control, making sure that from the UI, not just some API call, but actually have the UI set the point, then have that command go down through a bus into a real or a mocked peripheral, and then have data come back up through the bus of the value slowly changing and make sure that it all behaves together. By doing that, automatically we can also improve the reproducibility of that test. Some very top level requirements are we need CI rigs, so we need some rigs that are only for CI. We mustn't be merging without our automatic tests. We need some kind of coordinator so we can make the most of those assets that we have. If we have five rigs and we've got 50 developers, we want those rigs to be absolutely hammered so that those developers are waiting as short as time as possible. We need to be able to interact with the UI, and we need to be able to interact with the hardware through various buses through IO, and we need to be able to control those peripherals. So now that I have gone through all of those, we have a pretty good set of requirements for our testing infrastructure. Thanks, Bill. Now that we've talked about different types of tests and requirements for them, let's go through the basic requirements for testing infrastructure. Let's see if it's a slide. All right. I don't know. Anyway, I'll go about it anyways. So, I mean, talking about testing infrastructure, we need a computer of sorts to talk to your device on the test, and you need hardware to simulate the actual production application. So that would be, I mean, talking about car rigs, we're talking about canvases, we're talking about serial in most of the test images. These ECUs will have serial enabled, so you should be able to probe the ECUs and get out of it. So you'll have multiple serial dongles. In some cases, the testing application or the testing requirement is going to be so niche that you'll have to make custom hardware, and sometimes making custom hardware is not feasible, so the first thing you'll do is have your own reason of the shelf hardware, and you'll get a working example from there, but following which, you'll have to make your own piece of hardware. We've done that before. We've had clients who wanted to test device discovery for Android automotive and CarPlay, and their ways of testing were quite manual, so they had engineers with spreadsheets and iPhones, so the engineers would go to a rig, plug in a device, check the output, mark the representation in a spreadsheet, and based on that, they would make a decision that if it's behaving as it should be or not. So based on the description, we can make out that that's a solution to a problem that will not and cannot scale well. So we had to automate that problem for them, and we made a custom hardware for that and called it the USB switch. So the USB switch is a bi-directional USB-C switch, which can help you programmatically control one host between two peripherals and vice versa. It's a completely open source hardware made with QICAD and the respective firmware and cases are also available with the project. It supports USB super speed. That's something we've tested internally, not certified, but that's something we've tested, and recently we've gone through the effort of EMC getting the hardware EMC tested. It has cleared the EMC tests, and we are in progress of a process of getting the documentation for it in the open as well. The QR code is going to take you to the project, so if you want to learn more, come. So now that you've gathered all the pieces of hardware for your basic testing infrastructure, you put it all together, and the resulting setup looks something like that. So one good quality of this setup is it's functional, it works, but it will confuse you when you're setting it up for the first time and setting it up multiple times and tens of hundreds of rigs maybe sometimes is going to be quite a challenge. You have to maintain a bit of material, you have to maintain the documentation for it, and sometimes you'll buy hundreds of field long-books and you can't buy them anymore, so you'll have variability in your piece of hardware. Again, a trouble. So that gives us a question, what are the key requirements for a robust testing infrastructure? So when we talk about long-term support of software, we cannot discount the maintainability of the supporting testing infrastructure. That means we want to be able to maintain the inventory of spares. So when you put in the effort to set up a rig, you buy a certain piece of hardware that comes with its own tooling, and then you put in the effort to set up that tooling with your testing rig, and then you can't buy it in the next two years, that's added effort, and that added effort means time, and that time means money. So you want to be able to buy and manage that piece of hardware that you're putting on your rigs. And you want to be able to have the testing setup as easy as possible so that the developers can actually get to use those tests, get value out of those tests rather than spend time on setting up those tests. And we want to be able to have the setup as consistent as possible because consistency removes variability from your setup, and that means that you'll be pulling out less hair when you're debugging problems. So keeping all these requirements in mind, we kind of made our own solution, and quite descriptively called it testing in a box. So testing in a box can be seen as a multiple for testing. It's supposed to be a kind of I.O. for testing and facilitating remote access to your rigs, and it's designed with modularity in mind. When I talk about modularity, I think this image depicts the modularity a bit better. So each section in that image is something that we call as a slice. So at the base of the slice, we have a SBC, an unbased SBC. On top of that, we have the USB switch slice. On top of which we have a slice which can hold can dongles. And on top of that, we have our in-house built I.O. board. I'll be talking about that in the upcoming slides. And the USB switch slices and the can device slices are stackable, so you can add more USB switches if you want or not have them if you don't need them. And same goes for the can devices. So I have the hardware with me here. So in its most basic setup, this small contraption gives you the ability to have a GitLab runner. It gives you the ability to do UI validation tests with OpenQA. You can do device discovery tests by programmatically plugging in and out USB devices. And you can check states with GPIO. And you can also mock or monitor can devices. So essentially checking out all the boxes that we put on our list collectively in the beginning of the talk. So now to the fun part. That's the custom built I.O. board that we have for testing the box. Starting from the left hand side, we have three FT232H, which gives us serial SPI square C and JTAG. Then we have an R24T for HID emulation. And we have three USB hub, USB two hub inputs. And all of that then gets connected to your host, which can be an SVC or a laptop with a single USB C cable. We also have an ESP32C3, which sends serial to the R24T for HID emulation. And a new pixel, because everybody likes LEDs. And yep, if you want to electronically isolate your GPIO, you have an optocouples on top. So this is the version one, region B of the board. And these are the changes we got to make to version two of the testing in the box I.O. board as a work in progress project. So the idea behind the version two is reducing the cost, getting the bill of material smaller and with the space that we're going to make with that, we want to add more functionality to the board. So we're going to be getting rid of the three FTDI, FT32H and replace it with a single FT4232H, which will give us 4X serial and 2X MPSSE, which is multi-protocol synchronous serial engine. So basically that will give you your SPI, I2C and JTAG. And we'll be getting rid of the ESP32C3, which because we are not using it that much, the version one on the board was more of an experiment. So we saw after you refused that we were not using it that much, so we're getting rid of that and moving the functionality to the IP2040. And because we'll be getting rid of so many USB devices, we can actually get rid of one of the USB hubs on the board, which was re-chained initially, so that again gets the bomb cost lower. And with the space that we're going to make, we are looking at putting in USB PD on the board. So what essentially you'll be able to do is that you can put in a battle jack on this board and have USB PD out, so you can basically power other SBCs from the board itself. I've said a lot of words in the previous slide, so here's a quick overview of what we're going to be changing and the benefits. So here is an example of the use case that we have internally for testing the box. So in this example, we have a laptop, we have an X86 requirement, so we have a laptop connecting to testing the box, which shows modularity because currently in this case we're only using the IO board, which is connected to a Jetson Orin, and the output of a Jetson Orin is being monitored with the capture card. Now I won't be going into much detail about the actual example, but I'll be talking about the developer workflow. So every time a developer has to make a change, they can test those changes on their own machines, which runs on QMU, but when they're very confident that they can push these changes to the main, so they raise a merge request, and that merge request will trigger a set of pre-much tests. Those pre-much tests will flush the Orin, it will run a set of tests and check that it meets the set of requirements, and based on that, once those pipelines are green, the developer can be sure that, well, that the code hasn't broken the system in a way that was not intended, and the reviewers also have a metric to basically check the code against. So any piece of hardware is only as good as the software that comes with it, and we package together with testing in a box various tools. So we have Ansible scripts that let you easily set up good cloud runners, easily set up open QA workers, CI templates, we also have CI templates that you can use, and we also have a self-servicing script that you can use to set up the UDL rules for the IO board. We also provide an example, which, in a sense, for that, you need a Raspberry Pi with the AGL, and all you do is put the QAD binary on it, I'll be talking about that in a bit, and that essentially gives you a fully functional test in which you are doing UI validation, you are able to run your GitLab CI pipelines, and test and monitor can signals passing through the entire network. So I've got a couple of things to talk about. I've mentioned open QA, QAD, and Canvas before, so open QA is a tool that we use in-house to do UI validation tests, we have found it quite useful. So we have also contributed to it to make it better, and QAD is a tool that we've written to do user input simulation in CI, and how you make that work is, it's a small binary which you put in the right place in the root of S, and that helps your GitLab runner to send it HTTP requests, and mock the touch inputs that a user would actually, like the user would do. And Canvas is a tool that we worked on as well, it's something we've made in-house, which helps you build those tests and make it easier, essentially. So what's next? As I've already mentioned, we're already working on the version two of the board, it's still a work in progress, so if you guys want to join the project or have any suggestions, we're willing to, and we're very happy for some much requests or suggestions. And we're also looking forward to actually making a can of the expansion board, given that we set up these boxes on various automotive rigs, and we do use various canned dongles, it would be nice to have a single one that helps us basically transfer the knowledge between different projects, so we'll be looking at making one of that as well. And there's many more to come, so stay tuned, and that will be it for the day. Thank you very much for listening. So this is a great initiative. So what I'm wondering, like the single board computer or some other component, can it act as a USB device using the USB gadget subsystem, and can I, with the existing setup that you have, act both like switching between acting as a USB device and acting as a USB host, depending on the use case, using the same USB connection to the device of the test? I mean, I don't see why it's not possible. Like it depends on the single board computer that you're using, it needs to have a USB, and what single board computer are you using? That could also be a question. Yeah, I mean, again, for our cases, we got the Rock5Bs, for instance, in this board, which actually does support USB gadget. But the actual setup of testing the box is not limited to the Rock5B. You can put any system as you want, as I say, it's modular, so you can actually change it for Raspberry Pi, or a much more beefier compute if you want. Okay, so it's okay to ask, like how does it connect to the board? How do they connect with the interface? Just a USB cable. Okay, perfect. Single USB cable. So, really cool project. It's really interesting thinking about testing hardware, so I have no idea, so this is probably a question from ignorance, but when it comes to sort of what you call a test rig, how close do you get to, like, some mechanical hardware where something's moving, and at which point do you have to worry about state control? You say you have to flash the whole thing, but as soon as you've got hardware in the loop, you've got a state of not just the electronics, but you've got a state of the hardware, and that's where we've thought about this and decided to let somebody else think about it, and I hope that's you. It depends. I've got a mic, so you're off. Sure, so essentially in this case, like, we try to get as close to the production hardware as possible. Of course, it is hard to kind of test a moving car, per se, you know, so the idea behind the Cinebox is, as we'll explain in the testing progression, that we want to get as close as possible to a testing environment, so when you actually go to test and deploy to the production environment, you find as less bugs as you can, but yes, certain bugs will slip through, but the more closer you can get in your testing rig, the better it will be when you actually deploy to the end car. Okay, yeah, very cool. Do you support or have plans to support acting as a SD card for system booting from SD card? So that is, again, implemented in detail of your tests. Testing in a box is a multi-tool, so if your actual device supports booting from that, that shouldn't be a problem. Again, that's implementation in detail of how you want to do your tests. Booting from SD card. When you say, booting from SD card, do you mean, does any of our hardware boot from an SD card, or? The device and the test moves from the SD card that image has. Yeah, I mean, again, as I say, that's an implementation thing, so. The question was, can the device and the test boot from an SD card? Yeah, I think my answer is the same. It's an information in detail to the actual thing. Yeah, so there are various bits of hardware out there for mocking an SD card. We don't have any examples with that, but if you had something like that, you could potentially use the USB hub to facilitate something like that. All of the devices that we have allow you to flash the root file system through some kind of GPIO manipulation to put it into a reset mode, and then you can flash it like that. So the more modern Raspberry Pi's allow you to flash them by putting them into a reset mode, so you don't have to mock the SD card and the more sophisticated rigs we have also support that. However, if you did really need to do that, then there is hardware out there that we don't currently use in this example. But if you were doing that, then please send us some patches and we will add documentation for that for the system. So thank you. So I noticed that this is extremely neat and modular and self-contained and it's very easy to see what it's doing. Why is something like this not standard in history? Because in consumer electronics and in medical, big mess of wires is the standard testing rig and apparently automotive as well. So why is everyone tolerating this? Do you want to? That's a good question. I think some of it comes down to the fact that everyone thinks this is secret source and they spend huge amounts of money on things that are very niche and custom. So the way that we're developing this is that we think these are really important things to have and so we're working on them. And like I said, these are tools, so we think that tools should be in the open, so we're doing it in the open and that means we can collaborate and we can all work together on them. But when things are in little silos, then often the tools are the things that get hurt the most. So I would say that I think there's a lot of secret source or perception, someone else was talking about secret source before and it was a perception of secret source being really valuable but maybe it's not. So yeah, I think it's a really good question and maybe it would be better pointed at the OEMs to get themselves together. Did you want to say something? I was just saying lines. Okay, so it's a cool project but if I want to participate in it, could I just buy ready made boards or do I have to make them myself? So we publish all of the files that you need to go and ask a standard FAB to make them. So you can take the files from our project, go to a fab of your choice dot com, upload the files and in a few weeks you'll have them. We are in the process of hopefully CE marking this, so that gives us potential for selling them directly to people. But in order to make that economically successful we need to move a lot of the items and so I'm not sure if that will happen but we want to make it as easy as possible for everyone to collaborate. Yeah, if you want to buy them then not from us but we make it very easy for you to do that. Well, making a small batch production is expensive. So I was thinking that maybe you make thousands of them just sell to... In theory but some of the cheap hobby fabs, so this is all chi-cad, if you've got a little chi-cad project and you want it made, you send it off to China or somewhere like that and you get it back not actually that expensive. I think those companies that are set up for those small batches, if you wanted... how do I word this? If you wanted a dongle for this that was custom to automotive you might have to pay a few hundred pounds for that custom thing whereas you can get five of these for like a hundred quid or something like that. So yeah, it's not nothing but actually if you're interested in this niche stuff I hope it's not too high a barrier to entry. And if you're really interested in helping us out, I don't know. Some of these have made their way out of CoatThink into the wild so I don't know. Okay, thank you very much for a very interesting talk. So that concludes the open hardware dev room for 2024. As you make your way out, if you want mine looking down at the floor and if there's any paper or anything that moves, if you want mine picking it up, bringing it with you to one of the garbage cans, we also have a bag up here. So anything that you find on the way out the door, I also wanted to very definitely thank the other folks who were critical in making this dev room happen. Ian and Clement and John and Wayne. This is not an individual effort so thank you very, very much for all your help.
WebAssembly, WebComponents and media filters all at once: a proposal to open the Web to variety of formats
I'm going to talk about WebAssembly, WebComponents, and MediaFilters all at once. I propose to open the Web to a variety of formats. That's it. So thank you very much. I'm really excited to present this project today, since it's the first time I'm presenting the open source aspect of the Bevar project. So I'm Jerome Gorin. I'm a lecturer and researcher at an engineering school, which is called Gunilla Salamiya. I've been quite active in many open source projects at the Kniele and in Telecom on a JPEG. But the work I'm presenting today is the fruits of the research I have conducted many years ago, ten years ago. At this time, the proposal of my PhD was more theoretical and not practical. But since then, there is many technologies which have been integrated inside the browser, which let us deliver all types of contents on the web. So with my associate, Maya Bistrom, we created a company which is called Bebara. That means in Swedish Preserve. And with this company, we want to promote this technology and we want to speed up the adoption. So the talk I'm going to present today is to show you the open source aspect of this project and show you how you can contribute to it. But to dive into this project, let's start by thinking back a decade ago. So at this time, when browser didn't include the ability to embed multimedia playback. So at this time, you have to use plugins, web extension like Flash from Adobe and later Civilite from Microsoft to add this ability to play contents. But this kind of extension are now used. I mean, they face many issues like post-sale, HTML and CSS integration, security issues, accessibility issues. So to fix this, HTML integrate new tags, which are the video tags, the audio tags to allow the browser to natively support multimedia contents. So these new tags allow a wide variety of usages like rich internet application, like social media, video sharing. But they also restrict the media format and content to an handful of codecs. So there is no guarantee that a format will be supported across all the browsers. So the formats which are supported, you can count it on the end. It's like mp4, mp3, Flac and so on. So on the next slide, what I show is for instance, org and Fulera are not supported across all the browsers. And what is also concerning is that the JPEG Excel has recently been dropped by Chrome. So what this realistically means is that a lot of people put a lot of efforts to develop useful formats, useful codecs, but they are restricted from a wide adoption due to this gate keeping placed by the browser. So what we are proposing is kind of the construction of this by-bill tower of an entire container and format in order to let people freely use and develop their own formats. So we are only using W3C standards and open source technologies to fix some of the problems of this gate keeping and to make everyone, it make it easier to integrate like new formats to innovate, to deploy stuff and also give the ability to support legacy formats like AC3, DNG, MKV, EPS and so on. So let me now turn into details of our solution. So there are two parts, there are three parts, but there are mainly two parts which are the web component and the web assembly and we also use media filters. So for the first part, for the web component part, we instead of creating like new scripts, creating new tag, we just extend the usual tag, the audio tag, the video tag and so on. We are using the attribute IS which is something standardized by the web component and we add internal logic, we add logic to this video tag. So this is where web components are used. Then for the web assembly part, we are creating a new attribute which is the using attribute on which you point to an external library which is compiled with web assembly. So for instance, on this one, on this example, we will point to the library, I mean the libraries from the XIV foundation which include OGV, FLAC and VORBIS to decode the input source which is in OGV and maybe in AFERA. So if we think about it, about this solution, it can be like a bit overwhelming because I mean we know the source, the format of the source and we include quite a lot of code on it. So what we added as syntax is the web attribute. So we developed what we call a solver which is based on the open source project JPEG which has been presented many times at FOSDEM. And this solver will be able to create a media pipeline to adapt from an input source to an expected output source. So for instance, in this example, we have as a source an MPEG1 program stream that embed MPEG1 videos which is mostly not supported by browser and we will transcode it to an H264 format file which is supported by all the browser. So what the solver will do is by itself the solver is able to check amongst all the libraries which has been provided with the web attributes to do the transcoding, so to adapt the video to the user browser. So in this example, we take a portion of code of LIM and MPEG1 and LIM X264 to do the transcoding dynamically. So with this same principle, we are supporting audio and we are supporting images. So now let's do a demo. So I have delivered on a website so you can test it with your desktop or with your mobile phone if you just flash the code. I have created two web pages, one with raw contents delivered on the website, on the left part. So you will have a raw G2K image, a JPEG-G-cell image, a Dolby Digital AC-free sound and an MPEG1 video. And on the left, we will not use the universal extension that I have presented and on the left, which is the main page of BVRA, we will show the results with this universal extension. So I will do the demo in live so I hope that it will work. So if we go on Safari on the left, so this is the page without the universal extension, what we can notice is that Safari itself is quite supporting a lot of format and container. So it supports G2K images, it supports GXL images, it supports MPEG1 video, but it doesn't support AC-free. So now let's see the situation with Firefox. So in Firefox, you can see that G2K is not supported, GXL is not supported, AC-free is not supported, and MPEG1 is not supported. And with Chrome, that's the same solution. So now let's see the real main page of BVRA.com that includes this universal extension. And so G2K is supported and if you just dive into the source code, this is the semantics I have shown to you. So you have the ease to say that we are using the universal, we add the internal logic, we have the solver and we are using the open source library which is OpenJPEG. Same for GXL and on AC-free and MPEG1. Now if I switch to Firefox, you see that you will have exactly the same result. And on Safari, everything is as it is, but you have now the AC-free audio which is supported. So on all the browsers, you have the same results. So we also release an SDK in open source. We release an IDE based on Visual Studio Code that you can use. And the goal of this IDE is to help web developers to find the right combination of filters for a given content. So I'm going to show you the IDE. So to install it, it's quite simple. You go in the store of Visual Studio Code, you type BVRA. So on my computer, it's already installed. And then you can open any content in the file explorer. So for instance, let's use the GXL file. So you have first the preview. So it means that a library is existing to decode this content. On the graph part, you will see the mega filter pipeline that has been used to decode this content. You can also view the source of each filter. So on this one, the filter is based on libgxl. So I think the connection is quite slow. And on this part, on the Accessor tab, then you have the script to be integrated inside your HTML. To support this given format. So now, I mean the connection is quite slow today. This is the source that has been used for the graph. So this is this one. So I use libgxl. And this is the code from one filter. So as you can see, it's like using GXL. And then you have a semantics just to adapt the open source libraries with the input format and the output format to help the server finding the right combination. Yeah. So we're to find us. So this is like the end of my presentation. At the moment, we are starting. We are like only supported and full of media credits and container. However, we are like adapting new libraries. We are working on support of new formats, new type of documents, 360 videos, 3D object. I mean, everything which is multimedia can be like constructed with the server, you know. Everything is open source. I mean, what I've shown to you is open source. So you can see check the code from the editor on GitHub. Also test interface and the SDK is on better slash filters. Everything is in LGPL. So you can contribute to it. You can take this code. You can, yeah, I mean, I've kept this decision short to have the time to let the time to you to ask a question or you can also find me later on the audience. So that's how for me. Any question? Yeah. I mean, it depends on which. Sorry. Okay. So our expensive it is to transcode on the browser. So it will really depend on the type of filter you will use. I mean, we have some filters which are called WebCodec. So it will use the acceleration of the browser to do the transcoating. So by using the WebCodec, if it's supported by the browser, then it's very fast. The other thing is that you don't always need to transcode if you are using a canvas, then you will just decode and display it on the canvas. So it means that there is like no delay when you open a file, when you are using the canvas. So it really depends on the complexity of the encoder. It depends on the technology which you are using and depends on how you want to integrate your video. But we are using WebAssembly, so there is no overhead which is imposed by WebAssembly itself. You have quite a native performance, you know, on your browser. Yeah. These are static files, why do you want to transcode ahead of time? Because you can, if it's like delivered, why don't you use static files? That's it. So that's a good question. I mean, why? Because first you can adapt depending on your browser, because for instance some browser does support the native file like Jigsail, you know. And the other thing is there is a lot of files which has like functionality in it, functionality that are embedded in the container. Lex takes for instance DNG files like raw files that are used by your camera, by professional photographer. Then if you use the DNG itself, you will be able to like view the raw format. You can view the preview and you can play with, you know, everything that you can do for instance with Photoshop, you know, like having this high HDR range of color and so on. And if you're on 360 videos, then you will have for instance to use the interaction. If you're on document, you have to add this interaction. So by playing with native file, then you will not lose any functionalities that the container has initially, you know. Yeah. Do you want to make a browser plugin so you can view the websites that don't have the text? Yeah, there is also a browser. Yes. Yes. Do you want to do a browser plugin for that? There is a browser plugin already that is able to detect the, if the format is supported or not for a given content. For, I think that the best functionality yet is to trust the web developer itself because he knows like he wants to use a specific file. He wants to use a native content so he will integrate, I mean, the functionality that he requires for his website. If the web developer asks the user to install something on his browser, then I think that he will lose a big part of his audience. It's better to prepare everything for the end user than ask him to install something. You know, I don't want to come back to the situation where we were with Flash and Civil Light and have all this kind of issues. So, I guess the web pages that still have JPEG Excel will make sense if you want to do that. Yeah. Can you repeat again? Yeah, yeah, yeah, yeah, of course. But this extension is existing, I mean, but it's like less useful than this first presentation that I'm doing, you know. Yeah. Is the MKB file format not supported? It's working on it, actually. Ah, yes. Sorry. Is MKB supported? I'm currently working on it, you know, because what I'm one of the break we can have is that a lot of format has patent on it. So what we can distribute freely is license free patents. So MKB is one of the license free patent is you have the open source on it, the open source code to be used. So the application is quite easy and I think that that will be my next work, you know. Something that I forgot to present, which is quite really important, is that the plugin extension already have a store on it. So if you want to try a combination, then you just have to add a library on it. So for the moment we have PNG, JPEG, JXL, OpenJPEG. We have the full FFFPEG, like a Kika.t.pmp, and then we extract some part of the FFFPEG just to reduce the size of the big project of this project. But MKB is really like the MKB inside the XIV decoder, so I will extract it and I will work on it. And then if you like click on add, for instance, let's say that on this one it was JXL, then I'm adding OpenJPEG on this one. So you will see that it will be a candidate. And on the preview it will check amongst the JXL and JPEG and see that JPEG itself is not just for me, you know. So it's like unused.
GStreamer: State of the Union 2024
I'll try my best. So I'm Nicolas Dufresne. I work for Collabora. I've been hacking on G-Streamer for over 10 years now. I also hack a bit on Lib Camera. I've earned a T-shirt this week. I'm going to give you an update of what is happening in G-Streamer since last Fosdame, basically. So in numbers, at last Fosdame, we were releasing 1.22.0 of G-Streamer framework. And since then, we did nine bugfix releases. So the Rust component, oh, that's off. So the Rust component, I know why. Let me fix that quickly. Sorry about that. That's using time. Sorry about that. There we go. And just a different version of the tool. So as we mentioned last year, the Rust bindings and Rust plugins actually lives in separate repository with their own release cycle. So there was 13 releases of the Rust bindings, 18 releases of the Rust plugins. There has been 13 security fixes that are basically security issue reported by researchers all in C code and over 600 backports on the C side and 600 backports on the Rust side. As for the development, we haven't released 1.24 yet, even though Tim actually mentioned that it might have been released. It should have been released. We'll work on it. I think we're super close. But understand that we had 1,400 merge requests on the C side, about 5,000 commits, plus 750 commits. That has been quite a lot of work. And it also introduced a bit of instability in the development tree that we need to take care of. On the community side, one of the big changes that we're slowly, so we didn't kill it, we're slowly moving from IRC to Matrix. So you can join us to the Matrix community, which brings basically a different topic challenge. Different topic channels, so you can actually have a channel dedicated to a topic which makes the channels a bit less noisy. We also introduced this year for support, a discourse, which is a forum. It's something that has been requested for a long time. People always wanted to use our issue tracker as a forum, and we didn't want to. Now we give the users a forum to do that. And it actually killed the mailing list, literally. I don't really get anything there anymore. Now the rest is going to be quite fast. Might skip some of your contribution if you're a G-shrimmer contributor because there's too much. But I'm going to give you another view of what change this is, things that didn't fit. So at the very core in the GST meta, we've introduced the ability to serialize and deserialize the meta for IPC purposes. We introduced the GST analytics relationship relation method, which is an attempt to standardize the analytics data exchanges between different ML systems, AI systems, or standard open CV kind of analysis. Small but there, and sort bin is now in your registry. So it's an element that you can discover and that you can start learning about four years later. ONNX have gained inference elements, zero copies have been refactored. This is the most active machine learning set of element in G-shrimmer. Orc has gained AVX to support, which increases a lot the software processing on recent laptop. You don't need to muck within code bin, which makes encode bin a lot more useful in a lot more situation. WebArt with WP source has been updated to the latest API. And finally, I think this feature is kind of cool and it didn't fit anywhere. QML6 has a mixer. So it's a mixer to display, which is a lot more efficient than doing tons of render paths to get a single buffer to then render it. On the codec and parser side, we've did quite an important change to the H264 parser to make our frame splitting actually spec compliant. It's not yet fully stable. We still have some regression to look around. But at least it means that you can do fun stuff with your bitstream now. And G-shrimmer won't think it's multiple frames. Small addition codec to JSON is just a set of plugins that takes your bitstreams, dumps the headers, stream headers, and slice headers, and frame headers, and makes a JSON out of it so you can actually read it. It's very useful if you develop an encoder. There was no GPEG parser in G-shrimmer officially. Now there is one. You no longer have to implement a parser in every single GPEG decoder. MPEG123 became our primary MP3 decoder, replacing MAD. Vulkan video, actually, we were early adopter of the Vulkan video standard. And we now have an H264 decoder that has been merged. And there's more work coming. There's a. Then we added a couple of codecs, LC3, audio codecs from Google's, PHQ, which I'm less familiar with. We've enabled support for the SVT AV1 software encoder. And on NVIDIA codec side, HDR, stream sharing, on AMD property codec side, we got HDR, HEVC, and AV1. Streaming side. So I put a little ROS logo next to what is actually being written in ROS. And I'll explain a bit further why later. So you can see that the WebRTC sync now can input pre-encoded stream support. It has the 3D11 and QSV encoder support. There's a base, I think it's a base class, a base class for the WebRTC syncs. And out of that, there's Genus VR WebRTC sync, VR for video room not for virtual reality. And AWS sync also, that makes your life easier. It basically handles the signaling for you. And it also handles the encoding for you. It handles the bitrate adaptation for you. So it's much easier than going straight to WebRTC bin. We also have some enhancement on the WebRTC source, including turn support. We now have a web server source to go with the web client source. We have an HTML API. I must admit, I didn't go through deep on what that thing do. But it's on top of it, and it offers an HTML. It's a kind of separate project inside the project. And another mentioned, there's basically a new feature you can do, synchronized playback in JitterBuffer using your system clock if your system clock is synchronized with your media clock. That's more of an embedded use case. But yeah. More about streaming. NDI sources gain zero copy. There's a new element called AWS put object, which not very good for throughput, but it's a much lower latency for a small chunk of data that you want to upload to AWS Cloud. HLS CMaf sync finally, so you can serve CMaf, which is the common fragmented storage for all the fragmented implementation, whether it's dash or HLS. You can now serve HLS using CMaf in Gstreamer. The fragmented Moxer has gained VP8 and MV1 support. And another one, which is a bit niche, we now expose a W3C media source extension API that mimics the web API that you can bind into whatever you want. The intention with that is basically to be able to share your JavaScript between something that runs in your browser and something that runs in kind of standalone player. Specifically on the binding itself, I probably missed a couple of things there, but there's GES, so Gstreamer editing services bindings. There's bindings for the VBI parser and encoder. The handling, I mean, there's accessors for the pad probe, which makes using pad probe a lot less verbose. And there's bindings for some metadata, like the video SE user data on register, the RTP source meta, and yeah, a lot more. This binding is extremely active and maintained. An interesting one, it's a bit uncommon to have such an amount of activity on video formats, especially pixel formats. And yeah, we've seen 20 new pixel format on the software transformation side, 27 new pixel format in GL25 and D3D11. And I'm actually skipping D3D12 and CUDA here, which probably have the same numbers. We introduce quite something. It's basically the ability to pass through Linux DRM pixel format and their modifier that enables GPU compression on Linux mostly. And that should make your video playback on your laptops a lot faster. It comes with helpers to do CAPS negotiation and all the design for the negotiation also is published on the website. It's implemented in VA, MSDK, Wayland, and there's more support coming. Small, but because it came from a new contributor, someone added 10-bit support to the WebMalpha support. We also introduced 10, 12, 14, 16-bit software debiring, which I mean, I think nobody had touched the debiure for 10 years. So it was a bit of a surprise, but then we're back to the modern days. I think it's mostly because of lib camera. And yeah, but a lot of activity this year has been on the Windows side, mostly done by a single person. So let's get the Windows updates. So on D3D11 support, there's an IPC sort of sync that has been created. So you can share basically D3D textures around across processes. You now have another lay element, a bit like the Cairo element. You get a draw callback, and you get to use the system context to draw. That's about it. That was a question from last year Fosdame. We now have D3D11 support for QT6. Lots of improvement and pattern and stuff in the D3D11 test. You can now pre-compile your HLSL shaders and cache it on disk. There's basically the NVIDIA decoder can now output to D3D textures. And you have the right support to optimize the rendering of overlay, which is basically a companion to D3D11. On D3D12, that's a much newer subsystem, but it's equivalent, and it's getting almost all the same features now as the D3D11. So added MPEG-2, H264, HEVC, VP9 decoder. I'm actually surprised not to have AV1 there. It's probably there, or probably just missed it. And H264 and HEVC stateless encoders, so they're a much harder encoder to implement because you have to do more work. And compositor overlay, screen capture, color space converter, same pre-compilation support and shader support. It didn't do zero copy until this year, so zero copy, threaded decoders to improve that throughput, and it's interoperable with the D3D11 framework. So that's most of the Windows update. On open GL side, well, most of the work has been to be able to negotiate the DRM modifier so we can get compressed video frame into GL, zero copy into GL and render it. Just recently, we also added the ability to pass through the DMO formats so that we can upload to GTK4, the rendering to your compositor. So that's coming. You'll be able to basically skip your GPU completely so it can shut down while you watch your video. We added surface display support and GTK4 paintable sync in Rust now has GL window support. More specifically to UNIX-like systems, VA, Wailin, and MSDK, I mentioned that already, have the DRM modifier support. We have a VA and V1 encoder now implementation for the newest brand of Intel laptop mostly. The Android media codec now has been ported from being fully C calling into Java to be implemented using the NDK, so it's implemented in C, except for I think one callback, it's reduced a lot the overhead for codecs on Android and we added AV1 support. There's a new set of elements for UNIX, which is the UNIX FD source and FD sync. It kind of replaces SHIM source and SHIM sync. The main difference is that what goes, it actually passes through the caps and it serialized the G streamer method to the other side and it also passes MMFD or DM above or any file descriptor that you would be streaming. So zero copy on like the SHIM source and SHIM sync element. OSS that's opens on a system from that is still used on a BSD these days. Now you can animate your audio device, which was quite a missing feature. The Wailin sync have gain support for DRM DOM allocator on Linux. It's not exactly automatic yet, that's coming, but what it does is that when you do software decoding, you can decode into a GPU importable buffer and that means that your Wailin compositor is gonna be able to use a hardware layer, which remove some copies and you're already software decoding, so whatever gain you get, that's a gain and that makes software decoding better. Now embedded Linux, we added V4L2, everyone's stateless decoder. For me, it's two years of work because we also did the kernel side of it. People from Pengatronics added the UVC sync element, which is a companion to the Linux UVC gadget framework which basically lets you use Gstreamer to create a webcam out of your little chip that supports the appropriate USB protocols. In V4L2, stateless codecs are now CI tested using QMU and a virtual driver called Vizzle, so it does not really decode, but it produce image that is modified based on the parameter, so it's basically good for catching changes, which are usually just a regression and it actually works, so we're pretty happy, it run in 10 seconds. Our fastest decoder out there. V4L2 source now cares about frame rates, so when you use V4L2 source, you may get something better than two to five frames per second by default, it's pretty nice. The GC allocator gain at DME buff, DRM DOM allocator, hopefully we'll gonna get a GBM also allocators in the future and we also have buyer support integrated into V4L2 source and yeah, we've improved the, oh yeah, we've improved a bit the V4L2 state for decoders in order to try and get the Raspberry Pi 3 and 4 experience a bit better, so and that included the new dynamic resolution change method in the specification and HDR10. There was quite a lot of work on close captioning, I kept just few line, but basically there's a boxer for C8608, the CC combiner has been greatly improved, there's a C8608 renderer overlay and it looks like the development of that is moving toward Rust now and there's a converter now for the older close captions to the newer close caption standard in the Rust plugin, Revisdery. We've retired some stuff actually this year, so the GSTOMX is now gone, it was completely unused by Raspberry Pi, we never implemented any Android support because it was not exactly a real OMX and basically there's been no contribution for years, so we decided to remove it. Kate, which is based on LibTiger, there's basically, it's becoming difficult to get LibTiger, so we were like, yeah, let's stop supporting that and finally we removed the use of G-Slice all over the code base, simply because G-Slice has been outperformed by the system heaps now, so we'll use the fastest one, which is the simplest one also, so Malachy is faster than G-Slice. We have future plans, I have no intention to actually spoil anything, but I'd like to underline that lately there's an effort that I've started to actually rewrite the RTP stack of G-Strema, mostly for security reason, because it deals with a lot of network input, but it's an opportunity also to fix some of the intrinsic performance issue of the current implementation, so we really hope that that's gonna be better. It's definitely a multi-year work ahead of us before that replaces the existing stack. I also personally strongly consider contributing a replacement for our parsers. Our bitstream parsers has been responsible for 50% of the security issue this year, so that's definitely just the AV1 parser, I think it's the fourth in a row, so we have four bug fix releases that are made quickly because of the AV1 parser, and the first one actually came with a nice code execution example from the researcher, so it's very concerning. If you want to learn more about these things, I've been really on the surface, but all our G-Strema conference across all the years are recorded by UBTV, UBcast, and you can go watch them and you get more detail. We had one in October, which all fresh and covers a lot of this, and please if you want to get more in touch, there's gonna be a G-Strema conference against this year in October, and it's most likely going to be in Montreal, actually, first time out of Europe. Thank you, and we have three minutes for questions, I think. Five, oh, okay. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Sure. Can you say a bit more about the W3C media source extension, why you chose that and if you're seeing any active media extension? So the question is, can I give more context around the W3C media extension? So our use case was, basically my team did that, so our use case was that we had some people that wanted to use Shaka Player, which is a JavaScript player, but without the browser because their device is not strong enough to host a browser. So basically there's not enough RAM, there's not enough CPU, but they really wanted to share the same player, and the idea is that the first step was to make, basically an API that is like MSC inside of it. We didn't create it, we ported it from WebKit, and that is really easy with the object to expose then a JavaScript API that you can use a node or something like that. Yes. How hard is it for someone who knows the C API quite well to get started with writing Rust plugins? Yeah, so the question is, how hard is it for someone that knows well the GStreamer C API to write it in Rust? Do you have a better answer than me on that? I don't know. I've personally been for last year in kernel development, so I haven't really been into the Rust thing, but from my experience, my tutorial trials, if you really know GStreamer, because we have such great bindings, and you have so many examples now, it's not that hard. The hard part is basically the discussion that you have with the compiler that is telling you that you don't really understand what you're doing initially, that you don't really understand what's ownership, and it's part of Rust. You really get intimate with the compiler. But could you say if you want to develop a new plugin and you should go to Rust, or should you still seek with C for time being? I think it's the best recommendation to give. So if you were to develop a new plugin, should you go to Rust? I think my answer is yes. Yes, it's great to have a new plugin for Rust. Are there any plans for the new and what kind of work in writing Rust? You'll have to repeat the question for myself. Is there any plans for writing some of the core GStreamer Rust? So the question is if there's any plans to rewrite some of the core in Rust. For now, no. I mean, you don't consider the RTP stack as a core because it's actually a plugin. So yeah, there's no current, nobody actually shout out that they were doing that. There's few things that could actually be in Rust. Like someone, Sebastian actually made the PTP helpers in Rust that was kind of autonomous and independent that fits, but yeah, in the short term, no. I think we're good. Well, thank you very much. Thank you. Thank you.
Streaming live using RIST On Demand to thousands, how you can have your cake and eat it too
All right, good morning. Again, I'm Sergio Amarata. I'm a board member on the RISC forum, and an active member of the RISC committee that writes specs for RISC. And I'm also the maintainer of the Librisque Open Source project. So today, we'll be talking about how RISC can support end to end live streaming with packet recovery. But in particular, I will explain how we can support this in a broadcast scenario, meaning streaming to as many users as your bandwidth can support. So we'll cover the topic in two sections. First, we'll provide a roadmap or an update on the RISC specification and the Librisque project itself. And then we'll go to a practical application and show you how you can do live streaming in a large scale with the open source tools provided. So on to part one, the development roadmap. The last time I gave an update at FOSDEM regarding RISC was February 2020, a few days before the pandemic shutdown. Now, four years later, we will explore what happens instead. I guess if I have waited one more year, I could have blamed the Thanos snap for the delay. So let's do a brief recap of the beginning of the protocol. In 2017, the VSS forum created the RISC activity group for the purpose of creating a unified interoperable protocol for transmission of IP data over loss in networks. The requirements were that it needed to be based on the UDP protocol, and it needed to include negative acknowledgment retransmission requests. So one year later, after a successful multi-vendor interop event, the simple profile specification was published. You can see that in the bottom. The RISC activity group then proceeded to add multiplexing and encryption capabilities and publish the main profile specification in early 2020. It was at that time that the Libris Library Open Source Project first was published. And you can refer to my talk I gave back then, where I go into detail explanation of what the simple profile does and the main profiles and the differences, et cetera. So as you can see on this slide, the RISC activity group has been quite busy adding features to the protocol to accommodate all possible use cases during the last four years. What started with the simple profile, the first release, as the desire to add packet recovery to an RTP stream with an MPEG-TS payload, has now turned into a reach protocol that will work with any payload and which includes multiplexing, encryption, and authentication. So Libris, the open source project, currently supports a simple profile and main profile. And we're working on adding support for the advanced profile. So in addition to the core specifications for the protocol, the RISC activity group has also published a series of recommendation or best practice documents. These are documents that extend the protocol into specific applications, into specific niches, and you need to actually consider that in the library that is compatible with this specification. So the library Libris, when applicable, has been made compatible with these recommendations, the code synchronization, the relay, et cetera, et cetera. So enough history about the protocol and the specification documents, those are all publicly available. They're not behind any payload. The VSF documents can be downloaded, the PDFs, and you can look at the specs and all the recommendations. Let's talk about the Libris open source project itself, right? In case you are not familiar with what RISC is, we can define it with just one simple sentence, like you see up there. It's a new protocol for transmission of IP data across lossing networks using UDP with NAT-based retransmissions. Before getting into anything else, I'd like to clarify the three most common misconceptions people have about the RISC protocol. And this has come about in talks and conversations. People tell me, oh, well, RISC is only for MPEC-TS, false. Advanced profile includes support for any payload with clearly identified payload types in the header now. There's even an registration with support binary payloads, et cetera. And misconception number two, you need a large buffer and therefore latency is large, a second or more, right? In order for you to use RISC, false. You really need two to six times the round trip time, the RTT, between the two endpoints you're trying to send the data through. So the shorter the RTT, the total buffer required will also be shorter and you can talk about 10 milliseconds, 20 millisecond total latency. It's just depending on what network you're deploying it in. In addition, and this is a major misconception on that second point, RISC also supports real-time data channels with no added latency, with lossy channels that you can have APIs going back and forth in real-time and send data that cannot wait for these buffers. Misconception number three, you can only use RISC for transmitting in one direction, right? You send data over there, this is packet recovery, you're done. That's false. The protocol allows for bidirectional transmission both with and without packet recovery on both directions. The limitations are usually introduced by the implementer of the protocol. The specifications are broad enough so that each implementer has the freedom to add or remove features at will. So let's talk about the Libris Development Roadmap. How do we determine where to go next, right? So we divide it into three categories. The first one is we want to improve the reach of the library. And by improving the reach, we mean improving the adoption of the library by client applications so that users can go ahead and have it available on every device. Libris, of course, adds support for all the different specs like I showed before and all the recommendations so that all these reach features that make the protocol have more use cases are available immediately under the Libris library. The second is distribution, right? We make sure that our library compiles on every platform so that it can easily be adopted by anybody and that it makes it when possible into open source applications like FFMPEG, Libris, OBS, etc. As a matter of fact, when running it within the video LAN servers, it compiles in all 21 different architectures that are predefined in their CI. So we're pretty confident that if somebody wants to use it, they can. In the distribution aspect, we also have it on the major distros now available in Debian, OpenBSD, etc. And the third aspect of how to determine what the roadmap is is we do timely enhancements and timely bug fixes very quickly when they come about. So on the feature set, I think that the most important addition recently that allows the application to be used into this broadcast market like one too many, the media service scenario is the EAPSRP6A authentication protocol. It was introduced in 2022 and what it allows you to do is instead of the normal model where you have a pre-shared key that you have to share among two endpoints which makes it very insecure because if that communication of that key for the encryption gets compromised, your entire network is compromised now. This protocol allows you to do a username and password, a unique username and password for each of the connected clients and part of the protocol, doing that username and password exchange which is different for them, includes the negotiation and the exchange of the pre-shared key. So there's no risk anymore of that pre-shared key to ever be compromised. Other features that allow the broader adoption is that we're working on a one-way satellite application, we're working on multicast discovery and a few other things. So third aspect, the distribution. Many FOSS projects already have Libreps compiled by default or have it as an option. If you know of additional projects or if you know, please drop me a line, I'd like to keep a database of which projects are actually already included in it, if possible. Ritz is also a part of my own day-to-day operations which gives us the advantage of finding the bugs before they are found in the wild and we fix them very quickly. OK. So performance enhancements over the last few years, we now have the ability to automatically configure based on the network conditions. The Libreps library does an RTT with a new packet that was released, the Echo packet, 10 times per second. What that does is it lets us measure, with a UDP, you know, ping, not a regular ping, the network conditions between the two endpoints. We know the inter-packet spacing, the variance, mid and max, we know the latency, we know all these things and with those values, with those parameters, and if you look at the default configuration, the library will auto-adjust its buffer to the perfect buffer for that link without you having to guess or know anything about the network. It will also adjust the initial buffer, the reordering buffer based on your jitter on the network. Your inter-packet spacing jitter, gaps in maximum jitter, and make sure your reorder buffer is at least that much. We've added, you know, because we've done these very large improvements, we've realized we need better metrics, so we've added support for Prometheus and other things straight out of the library, so that you can actually grab that and, you know, plug it into third-party tools and immediately create your dashboard that gives you the proper visibility in the connections. And, you know, last release was just a couple of months ago. So the top priorities for 2024 for the development roadmap is we want to add support for DTLS encryption and authentication. We want to fully add support for the new advanced profile that adds, you know, the new header ID with the special payloads. And we want to try to see if we can get back support of the library into VLC 3.0. So the goal of the original project, like we mentioned before, was an interoperable standard for this type of transmission. There were, you know, half a dozen or a dozen different methods or there still are of doing UDP with packet recovery, each vendor specific, et cetera, et cetera. Our goal was to create an interoperable standard with multiple implementations, and I think we've achieved that at this point, at least at the higher broadcast level and tier one, tier two companies and a lot of the open source projects that support REST. They all talk to each other, even if it's not the same implementation. So now to part two, right? Let's look at REST as a live streaming platform, right? And particularly we want to look at a model that does N2S. How do you use REST and Libris in particular to do an N2N streaming chain, like the one we're doing here, for example, or, you know, any one-to-many scenario, right? Lots of viewers. So let's diagram, you know, a simple scenario here. We have three components, sources, the sender, which is a REST device, and many receivers on the bottom, and the box here on the bottom, you know, symbolizes a single one of those receivers. So we see the logos up there for FFMPEG, BLC, and Open Broadcast Studio. That could be also G-streamer, any source, any encoder, it doesn't matter. Somebody that has the ability to generate compressed or uncompressed video stream, right? Well, we need a binary stream of some kind pushed to the library. Libris in particular doesn't care about what the payload is. You can push anything in the payload, we'll deliver that to the other side, even though the spec for simple profile and main profile say that you're transmitting MPEG-TS, the library doesn't look at the payload or restrict it in any way. Okay. So the source is sending a UDP, or RTP media stream into the input. We buffer it so that we have it available for retransmission, and the minute the buffer is full, we start listening on, we put the sender in what we call listening mode. It opens a UDP port and start listening for receivers that want that stream, right? So the minute our receiver wants to connect to us, then the handshake happens. I'm obviously oversimplifying the process of the handshake that all happens. The SRP68 protocol is quite complex. It would take a talk just to go through the details of that handshake and everything that happens. So this is only symbolic. The handshake happens. The username is sent to us, and we check for that username within our database of username and passwords. It's not really a data-major password, but a password hashes to keep everything safe. If the authentication succeeds, then we send as part of the SRP68 protocol the pre-shared key so that the receiver can decrypt the data now. Once the data is decrypted, that's it. We have an end-to-end transmission from source to hundreds of destinations with just the risk protocol in between. So with proper planning and setting everything up correctly, you can have a 300 millisecond glass-to-glass, one to hundreds of listeners. You need a good network. Like I said, the latency is more dependent upon the RTT between the endpoints than anything else. I mentioned 300 milliseconds because in our large-scale deployments, we've done this anywhere within the U.S. with 300 milliseconds glass-to-glass. When you have to expand it and have users that are across the ocean or with crappy networks or Wi-Fi, the latency will auto-adjust. The protocol will auto-adjust. For those players, suddenly they get 500 milliseconds. We notice as a rule of thumb that somebody in Wi-Fi gets a penalty of another 200 milliseconds automatically. So how do you do this from a practical point of view? The LibreSploracle includes some command-line utilities that allows you to send, receive, and relay. The RISC2RISC is the one... If you want to do a relay application one-to-many, this is the ideal scenario. You can also do it with a RISC sender, to be honest, but the RISC2RISC is effective because it acts as a relay, doesn't encrypt or decrypt, doesn't do anything, but receives data and sends data both in the RISC format. You can put this in a CDN, your data sender anywhere, and you configure in the RISC2RISC a listener with authentication, and then you put your stream from anywhere, your source, like from here, to that endpoint. Then you configure the other end, the one that's going to send to the older viewers with a database of user-oriented passwords, and now you have the full authentication. It adds no additional latency in that process. It's only the latency that you decide to put as far as buffering. As far as quality and quantity, the sweet spot seems to be between 3 and 5 megabits per second, resolution 720p or 1080p, whatever code you're using gives you better or less quality, and that seems to traverse all the different VPNs, corporate networks, et cetera, without any issues. Quantity, the RISC2RISC can handle 100 simultaneous connections, and the number seems low, but because of the threading model and the fact that it has to do retransmissions, after that the retransmissions get compromised. The way you scale is that you can instantiate multiple instances of the same RISC2RISC application within the same machine, and in our case, we have 1500 simultaneous viewers going off of this type of transmissions 24-7. So the RISC password utility is also a command line utility available on the project that allows you to create the username and password combination hashes, just like the HD password file in Apache has a similar format, that's why we created it this way. You run the utility, put a username and password and that outputs this username with a hash, and then you append that to a file, and then the sender can grab that file and use it as an authentication database. In the case you want to scale that to a much higher level, you integrate directly with the library and you use the library callbacks to do the authentication yourself against your own databases, and you can scale that to thousands of users. The command line sender is a typical scenario of what I put in the diagram, what I was using in the diagram, you put the input any type of UDP stream, output you encrypt it, and then the output URL, if you look at RISC, is in your column, column, you add 127, you add the add, just like you do typically for FFMpeg or VLC or that type of stuff, when you want to listen instead of send, and it creates a listening on that port, and that's all you would need to do to create a sender and use the sender as a really, as well, just for one stream. On the receiver side, you want a player, for example, that you can put the username and password, right? You put the RISC in FFMpeg as the input, RISC, column, forward, slash, forward, slash, et cetera, or VLC, or any one of your choice. In our case, we did a custom VLC application inside of Raspberry Pi where we were doing this 1500 at the same time. There were Raspberry Pi's running VLC 3.0 inside with a lib-RISC implementation inside. The transmission of the secret in this case, which is a password for the username and password, should be handled in the same way you share passwords now for any account outside the scope of the protocol, and that's it. Then it becomes very simple to create a large-scale network with this. So the summary is the key feature for this is this new type of authentication that makes the secure implementation on a large scale, and it gives you better latency, lower latency, then the equivalent HLS or dash, with a security model that's built into the protocol. It's no longer the browser or the DRM inside the browser, everything. It's the protocol handles the entire DRM. So we have a really solid roadmap for the future. We were looking for additional contributors and people that want to help adding the next set of features. We're looking for open-source projects that want to implement the library. We'll help you put it in. And that's it. Thank you very much. Thank you. Okay, the question is, what if you're pushing your stream to Africa with a really bad connection? What is the acceptable packet loss? I'm not sure what you mean by acceptable packet loss. To me, zero is an acceptable packet loss, and the protocol is capable of achieving zero if you give it enough buffer. You give it a second buffer, and the round trip is 200 milliseconds, and you will get zero packet loss. We've done tests and we've done transmissions between Australia. I was just two weeks ago doing a demo, a transmission from Australia to Madrid. 16 cameras at 10 megabits per second each were being transmitted in real-time using RISC, and they were being used in Madrid for a production of the event. And the transmission didn't have a single packet loss, and it was all done across open internet. We used one second buffer there because the connections were relatively good, but if you go and, you know, if your transmission is really bad, just increase your latency, and the protocol will recover. We have part of our CI integration process tests that add 50% and even 75% packet loss. And you see spikes in bandwidth, but we recover every single packet if you give it enough buffer. Does it support simultaneous build rates? Does it support simultaneous build rates? Yes, we support multiplexing. In all this example, I've done just one UDP input. You can configure the library and the command lines to ingest multiple UDP inputs, give it a different ID, and then on the other side, you can demultiplex them. I assume that's what you mean by maybe having different build rates within the same stream. The camera, like, sends it on the fly according to network to the combination? Correct, yes. And one of the specifications that you saw on the recommendations was called source adaptation. It was written precisely to accommodate that scenario. What is the best case, best use, or the best practice recommendation on how to do source adaptation? Reduce the build rate, adjust the build rate based on network conditions. It's all documented in a part of a spec as well. So for non-MPEC-PS payloads, as you mentioned, is there already a mechanism like a composite trail to basically define the mapping of different payloads? Absolutely. For advanced profile, there's a GitHub repository that has the mappings already. We have a dozen or two dozen of them. I'm one of the administrators of the repository. All you need to do is go in and, you know, put an MR for whatever binary payload you want to define. All right, thank you. I have another question here. Is it also possible to multiplex and demultiplex subtitles? Is it also possible to multiplex and demultiplex subtitles? Yes. The protocol itself doesn't care what you put in. We consider each of them as a binary payload of some sort. You're the one that determines what the format of that payload is. And you have this pipe. You put multiple UDP streams. One of them is going to be your VTT payload or closed caption or whatever you want to put in with whatever format you want. We don't define or control the format of what you put in. We do to decide on multiplex and mulling. We give you the capability to give them IDs so that in the other side you can map those IDs to different outputs when it comes out. Thank you. But it means that you don't do any timing, right? In between the different streams. That's all user-side. Well, no. When you give us... The question is, that means that you don't do any timing or synchronization. On the contrary, because we are taking care of the multiplexing, when we ingest all the different UDP streams, the timing is guaranteed. The minute we receive that UDP stream, we actually, in the library, the implementation that we did, we grab the timestamp at the network card. This stream came in at this time, and then we reproduce that exact timing on the other end. We reproduce the spacing, the pacing, and the latency. We make it fixed, so that is not variable. That means that when you multiplex many things in the same tunnel, you're guaranteed they're in sync on the other side, or at least as they were when they came in. We're starting the use cases of the protocols to the more for... kind of the current adoption on endpoint devices, mobile devices, browsers. Okay, the question is, the use cases of the protocol, what is it towards more, point-to-point devices, point-to-multipoint, browsers, etc. This is the last question of our time. The original idea was to just do point-to-point transmissions. That was the original scope when we created the first version of the spec. That has changed. We achieved that, and now we went beyond that. Now we want to tackle the distribution. We want to tackle the one-to-many, the media servers. We have actually a project going on with Miss Server to add a lot of this functionality and the scalability as part of the project itself, so that we have at least one media server that already supports that in a very scalable way, where it becomes very simple for an application like VLC, or VFF Play, or Gstreamer to hook up to this media server and start the playback immediately using the Pshuoroko. Thank you very much.
The state of video offloading on the Linux Desktop
Yeah, hi, I'm Robert. I work for Collabora on rail and stuff. And yeah, I'm here to talk about video offloading on Linux. And I'm a bit nervous because I haven't got this many talks so far. But yeah, thanks for joining me here. I will give you a short introduction about what I'm even talking about. Then mainly talk about the current status and what happened in the recent time. I'll do one demo and show some benchmarks and then add some more notes and then hopefully we'll have time for questions. So for everyone using a Linux, most of you folks are probably used to or know what hardware decoding is and why it's good for your video decoding performance, like why you like the API and so on. Many of you probably don't know that there's a second step involved after decoding the video, like getting it into the right format and scaling to the right size, which we usually do using GL or Vulkan these days. But actually, most hardware does have hardware fast path to make this much faster and more efficient. And this is usually in the display controller. They often can also rotate and so on. And yeah, you would like to use them to get the maximum performance actually. And as you already already maybe hear from the name, display controllers, this one sits at the very end of the rendering pipeline. It's like after your app, after the compositor, at the very end before the buffer goes to the screen. And yeah, this is normally not quite much used on the Linux desktop. So what do we do in the embedded world? Over there, we have like lots of elements and lots of software which can already use the display controllers directly by not using a windowing system. Like you have like, yeah, things like the G-smart KMS sync. All kinds of apps have custom backends to use the KMS slash DIM slash GBM APIs of the kernel directly. And on X11, there was an extension to actually make that usable on the desktop. It's called Xvideo. It never really took off and I could jump over that. Now we have Wayland. And this is a picture from 2014, so mid 2014, which I think was the first presentation of showing a Wayland desktop with the windowing system where video was actually offloaded. This is Daniel Stone from Colever. Yeah, 10 years ago, we started using things, we started using Wayland and all this video offloading in the embedded world. There is G-Stymar elements and MPV back end and so on and so on. But practically this is still mostly limited to the embedded space. So what happened in the last 10 years is that as you probably know Wayland made a lot on the desktop. Now most people, I hope, are using it now. Apps actually started using hardware acceleration like GL and Vulkan. This only just happened in the last couple of years in many cases. And we got a lot of better kernel APIs. We got things like DMABuff modifiers. Lots of things happened over the years. So now what happened in the recent time? And especially in 2023 and also in the last couple of months. And I'll start with one of the big news which is like this year, Mada and Gnomescher finally got support to have short YUV buffers so apps can actually pass over video buffers to the compositor which is needed to then have the offloading step. And then GTK4, the GTK4 folks jumped in and made actually use of it and introduced a new widget which allows you to offload content like video to Wayland compositors and that is landed and it will ship in the upcoming 4.14 release. And we hope to have at least one actual video player out in the wild on Flatpak using that and also hope that the GTK4 Paintable sync which is used in a lot of GTK apps will more or less support that out of the box. Fingers crossed because it's all getting close now. And this also depended a lot of the G-Stu-MRO work that you saw in the previous presentation from Nicolas, the GL work there. So just to give you a short impression how this looks like you will have a video player and it can put the video into a so-called Wayland subsurface this is the case where the big bag bunny or the video is in front of the rest of the content but you can also have the other case where there's actually content on top of the video especially subtitles, controls, whatever. And yeah GTK is now able to do that via Wayland and the Wayland compositor is then possibly, I can then in many cases offload that directly to the hardware and I will show you that a little bit later in the demo. Further on, another big development happening in the Wayland world that many are not probably not so much aware of is that like Chrome OS is switching to Wayland, they are moving their main browser to use Wayland and also to offer support for Wayland apps and that means they ported their Wayland back end to support this kind of video offloading which they previously supported by having Chrome OS using their own private APIs and so on. They are now supporting the stuff using Wayland's APIs and this is mainly tailored for Chrome OS of course but it also works on other, it can be made to work on all other Wayland compositors and we have experimental patches for that and we hope that these will find their way upstream in the very, yeah in the near future. Yeah point is we have GTK 4 toolkit, we have Chrome, a browser which support this already shortly about compositors. By now like Western as a more embedded focus compositor has been supporting most of this for a long time, quite well but now we have Mada supporting video offloading in fullscreen cases already, Kwin just landed support for at least one such video format, NV12 which is the most common one, just landed a couple of weeks ago and will apparently ship in Plasma 6, I haven't tested that yet, it will probably also be limited to fullscreen only but it's there now so both big desktops are covered and on the WL routes front, Sway is very actively working on getting better hardware plane support so you can actually have the stacking of overlay over video however yeah just landed support for the most important APIs, that doesn't mean that all WL routes based compositors will support it automatically, like especially those with their custom 3D engines like Wayfire, Hyperland and so on will either have to adopt these APIs or will have to do their own stuff so yeah in the WL routes world it's a bit mixed between very well supported and not at all but yeah the big desktops environments are there yeah the biggest thing missing on the video offload side in the Wayland world is we don't yet have a proper protocol for the color representation which is needed like for example for HDR 10 or for 10 bits video content you often use a color space I hope I use the right word now which is called BT 2020 and we don't have a way yet to from clients to tear compositors now that you want this but it's in development for quite a while and will hopefully land in the not so distant future but this is one of the big missing part and it's kind of related to the color management protocol which is generally the big Wayland protocol needed for HDR stuff in general which has a lot more things there but all these stuff are falling coming into place maybe at least from protocol side we will have have everything this year so short conclusion on this part most desktops have at least basic support for graphics offloading now and yeah there's lots of developments in many places involved not everything is already supported but it's a good moment like to make your app or a framework toolkit whatever maybe look into this and adopt it because yeah then we can make things happen and get faster video playback so just because I'm so confident that everything works I even show you a live demo now or even though it's very small and I have no idea how much you will see here this is a Pinebook Pro with a very low power GPU but it has very good video decoder and a good display engine and what you see is the GTK4 demo player and we have overlays from the compositor enabled so opaque regions of the compo of the app are in green and transparent parts are overlaid in purple and what I quickly wanted to show you is like in this case here the video is now behind the actual GTK window and the GTK window has just has has some punched some holes so you can see through it and the cool thing is if I move it around I know now it's offloaded because it works and now if I go here we have rounded corners GTK4 detects that it can't offload things to Wayland it transparently changes to rendering itself so you can easily implement this or this things work picks up and every frame is perfect and yeah this is running matter and I make it quick now if I start the demo and as soon as the video just disappear you now see the overlay disappear the video plays this is now on a hardware plane with hardware scaling on matter highly efficiently it works in short okay I'm a quick now so I wanted to show you some nice quick new benchmarks from from intersystem Intel GPU top and when I tried this this happened on multiple devices and I wanted to include it here it means that the offloading works but it there's some bug somewhere in the stack outside and the reason why I want to show it to you okay is the the the important point I want to make here is that graphics folks are people working with graphics usually know that it's like graphics is really hard to to test and to prevent from regressing from time to time like this is on an Intel like this should work but somehow it regressed recently in some kernel and here I would like to make the point that getting this stuff to work on the desktop for if you work for some vendor or somebody who makes drivers is an awesome way to get lots of testing from lots of users who actually give like right issues and give you good yeah go ahead so if you actually want to sell things for embedded but want to have good driver quality just ensure you have a linux community which uses on a desktop and prevent cases like that um blah blah blah I'm skipping over that yeah yeah yeah better less watts and so on and so on and we are as good as the mpv native Wayland back end on gkk4 now um I skipped that um yeah it has video uploading has significant advantages for for battery life for resource consumption on the desktop as well as on that unembedded and you can implement this we have all the technology now to implement this in proper toolkits or complex apps like Chromium and it's worth it let's do it um I am one note I don't want to skip here which is a bit controversial like it is very controversial but I just wanted to know for everyone for transparency reasons talking about DRM like that DRM as you know or if you just so what we are doing here is pass through from the decoder to the display engine that actually means it becomes technically trivial to support hard DRM um I'm not saying we should do that I'm just saying that the technical limitation to do that is no more or less done and discussions should be heard about if you want it um but just like yeah this makes it very easy to do all this um yeah on the more positive side we have experimental patches for various things to to make it work also with real v4.2 decoders no I skip that ah yeah NVIDIA just added in the news beta driver a couple of days goes added support for something in that direction I'm not sure if it's GL if it's Vulkan if it's both but and even the proprietary NVIDIA driver seems to be on board ah nice it's both somebody just said yeah questions hi thanks a lot for the talk Chris one thing that kept me confusing wayland is the missing color management so you could use color profiles this is going to work similar later uh yeah the question was um or the the statement was that one thing preventing people from using wayland is missing color management and um yeah there's a lot of active work and a bunch of companies involved um who are very very actively working on that we had even had a hackfest last last year where lots of companies came together and will have another one in a couple of months I think and there's hopes that yeah things will come together right now in the x you can apply one profile to one display and another to the second display you just couldn't do that yeah but what what that color color color color big discussions can't handle that in here and a bit of too much off topic any more questions also yeah uh firefox is looking into there are using that they want to use it especially for 10 bit video oh the question is what is the status for firefox and I've talked to some firefox folks who said they would yeah they are looking into it for 10 bit videos to get that working and yeah let's see I haven't seen any patches yet so but having it's working in GTK will probably be quite a good um yeah but make hopefully convince them to to put some more work into that yeah so you've explained the various strategies to be able to like the underlay strategy and overlay strategy where the video is under the uh compose and another where the video is over the controls so do you always pick the underlay strategy in GTK or do you because some hardware only supports underlays and some of our platforms only support overlays so it's how do you choose in GTK do you always do underlay for now or um so do you switch to underlay and do you overlay or something um so the question is very specific about um strategies how to layer things um and how this is done in um GTK for now uh in short um chromium has options for that um GTK for prefers overlay for the moment and um actually I personally would like to always use underlay when possible in the future so you don't have to switch the surface around but that would require a new protocol which I made um which we could discuss um yeah okay I think I'm out of time yeah you can find me later for more questions thank you lot
Livepeer Catalyst and The Conspiracy to Solve Video for Everybody Forever
Hello everybody. My name is Eli Malin and I'm here to talk about the conspiracy to solve video for everybody forever. It's top secret. Don't tell anybody. Definitely don't live stream this talk or anything like that. This is a conspiracy. We don't want to let anything out. My name is Eli Malin. Like I said, I've got socials there. I'm a director of engineering at Live Peer, actually very close to my five year anniversary at Live Peer, which is longer than I've done anything. Today I want to talk about what motivates me and what I think we can do to make video better for everybody in the world. I'm also going to talk about Live Peer Catalyst, which is software we put out, a media server that I think is going to help us toward this end. I've been working in decentralized video since 2016. I quit my job to try and go start my own company and from there all the experience I gained in that led me to Live Peer. This is the best life advice I've ever gotten from Dominic Tarr who founded Secure Scuttlebutt. If anybody is familiar with that project, I asked him how I could contribute to Scuttlebutt and he said, figure out what you are uniquely suited to do and then do it. I think since this tweet that's been approximately what I'm doing is I happen to know a lot about video tech. I thought I could think of ways to improve the lives of people using it. That's what motivates me and that's what causes the mission to try and solve video for everybody forever. We've had a lot of nice state of the union talks in this room about how things are going. This will be the state of the union for our conspiracy to solve video for everybody forever. There have been some setbacks. Video is not yet solved for everybody forever. I'm going to talk about a few of them here. The first I want to talk about is corporate centralization. This is an analysis of game stream platforms, actually like all live streaming platforms of a certain category here. You can see we sort of live in a world where it would seem to send out video to people. You need to be a mega corporation. We've got YouTube owned by Google and Twitch owned by Amazon. Everybody else add up every other competitor and they don't come close to those two. While these platforms have enabled a lot of people to get started with video, there's also been some problems. In order to make these platforms work at these scale and make it work with all the different content producers, these platforms tend to be extremely strict about things like having copyrighted music. Even when we're talking about a clip that you put in years ago, if that gets reported, you can get your Twitch account or YouTube account permanently banned. This has happened to a lot of people. Basically, it wrecks their entire careers. They build entire careers on top of Twitch, on top of YouTube. Basically, these backroom deals that Twitch and Amazon, that Amazon and Google rather, have had to sign with different folks, causes them to, in order to make the copyright holders happy, in order to these deals they sign with the big music labels and that sort of thing, they have to permanently ban these streamers. I want to point out this isn't a law or anything like that. This isn't the rules about how you're supposed to use copyright. You're not supposed to use copyrighted content, but in terms of permanently deleting things, that's something that they just negotiated amongst themselves. But it basically, given this chart here, it basically has the force of a law. That's people's entire livelihoods going away. This is one group of people for whom we have not solved video for everybody forever. Another one would be, of course, Twitch streamers in South Korea, where Twitch has said, nah, it's too expensive to operate networks there, some sort of contract dispute between local Korean ISPs and Amazon. You just can't stream on Twitch in South Korea anymore, or you've got to go around in VPNs and that sort of thing, which is probably fine for the people in this room, but not fine for most people that are wanting to actually go about having a live streaming platform. YouTube, same deal. If you get three copyright strikes in 90 days per the contracts they've signed with the big record labels, yeah, just gone, gone forever. Here's another one. This came up in my research for this just recently. I used to do a little bit of Twitter live streaming, because it was a fun way to show, especially like the coding I was doing. I've got a local archive of all my stuff, because I'm a video engineer, and that's how I operate. Most people wouldn't, and so this is Twitter saving a little bit of money, post deal on acquisition, and then, of course, Tumblr's ban, which was on adult content, but there's sort of algorithms, just sort of ended up nuking a lot of people's old Tumblr libraries, things they had spent years cultivating and building, and that's just all gone now. I would contend that video has not been solved for everybody forever for these people. Oh, yeah, here's a fun one. Yeah, so a lot of people, if you build up a PlayStation and you buy an episode of Mythbusters, because you like Mythbusters, then you have Mythbusters and you want to watch that. Yeah, it's just gone now. You don't get to watch it anymore. Why? Some agreement between Sony and Discovery that happened in some boardroom somewhere that will never be privy to, and that's just gone. The concept of buying a piece of digital content apparently is just not something that really exists. Here's one in the news recently. Deepfakes, Taylor Swift was briefly blocked from being searched for on Twitter X, which is kind of funny because there is a ton of explicit Deepfakes being made of her and they didn't have any way to get that under control other than to block her from search entirely. This is an emergent problem. This has actually gotten worse. Some of this other stuff is long-standing problems, but this, technology continues to develop. This is an emerging problem that's making video less solved for people, so that's not very encouraging. This next one is just a little pet peeve of mine, but have you ever had a video on your phone and you want to put it up on a TV? Yeah, that's like impossible. You can get a Chromecast. The best ways to do that are Chromecast, which is of course, you can have this proprietary Google framework built into every app on iOS and Android in order to make that happen, which is not super impressive. Some very heroic efforts to reverse engineer that, but I don't think it's quite to the point where it would work in all cases. Or if you have an iPhone, you can pay Apple 130 bucks for an Apple TV. This is to get, think of the cumulative video knowledge in this room. That TV knows how to play an H.264 video, I guarantee it. It shouldn't be a problem, but to this very day it is. Curious if anybody, this is maybe my personal pet peeve, but I've worked in video adjacent startups for a little bit. I'm going to take it out on B2B SAS a little bit. This is the fate of every video startup if you don't actively combat against it. You might start with high ideals and all the stuff you want to open source and make a really accessible product to everybody, but then the big client comes along and they've got specific requirements, specific demands, and your company basically just turns into doing whatever they want. I've been around in the video industry for a little bit. This is very common pattern that's happened with friends of mine, and I think it's sunk a lot of other really promising products. More on that at live here in a little bit. That's the depressing part of the talk. Let's talk about some successes. There have also been some things that have gone really, really well. One thing that really took off during the pandemic is like masses and Plex servers. I don't know if anybody has got like a Plex login or a Synology login for a friend's mass or that sort of thing. My group of friends got really into this in the pandemic where anybody could access this. Shout out to my mom who I think has ripped more DVDs than anybody in America. I bought her a NAS for Christmas a few years back and she's, they just talked her and my stepdad through doing a raid on the NAS in order to get more storage on there. That was maybe the best Christmas present I've ever given. Yeah, this has been, these don't go away. Like Mythbusters on Discovery does when you buy it on PlayStation. Take a DVD, you put it on this and this keeps working. Little self-explanatory. This works. This has worked since it came out. BitTorrent is arguably the most successful decentralized project of all time. It works consistently. You can get everything on it and your content doesn't get yanked when somebody else, when there's some weird contract negotiation that happens. This is a popular one, maybe an unpopular one, but I did some work with video NFTs at Live Pier and these still work. On OpenSea, the video is all backed up on Filecoin there. You can play it on this and there's actually entire like crypto platforms that this, your creator library, you can upload this and it's an NFT collection associated with your crypto address. They could decide not to mirror that content for you, but in terms of like having your own content library compared with the YouTube and Twitch and Tumblr cases where they're like the company could just yank it from you. Yeah, no, this, you have to have some faith in the blockchain that you're running on of course, but other than that your library is your own. So those are some good ones. There's also some emergent projects I want to talk about that I think are going to help us sell video for everybody forever. The first is that decentralized social is going to become a thing. I put some of my favorite ones up here. I am a big fan of Blue Sky and the At Protocol. I worked with Scuttle, actually got my sort of my start in decentralized tech pre-crypto working with Scuttlebutt from Dominic Tar, who you saw earlier there. Some stuff on the farcastered lens in the Ethereum ecosystem. Activity pub and mastodon. I sort of give half credit. I have a lot of love for people in those ecosystems. I think there's a lot of incredible work happening. I think they, a lot of them, would acknowledge that the weakness there is the sort of the lack of a portable user that could move between servers like a concept of like a user with a key that can sign data and own that with what they're working for, which allows them to own their own media library and not be nuked when a corporation signs some deal or whatever. It's consistent with what I call the fundamental principle of decentralization, which is that user actions are sovereign. If I upload a video, I should have access to that no matter what else happens. They could decide to take my videos down from their servers, but in terms of like the definition of my content library, what videos are associated with me. That shouldn't be something that anybody can take away from me. Yeah, some of you may be familiar with this. There's a group called the Coalition for Content Providence and Authenticity. It's a bunch of companies now. It's started by Adobe and the BBC. This does a couple things. This was a little bit, these companies getting ready for a world full of deep fakes where to combat in a world where these models proliferate and anybody could generate any video that says, whatever, how do you ever trust the video you're looking at? And the answer here is actually pretty similar to the blockchain answer, which is you do so with signing chains. So you've got a C2P enabled camera, which has a little encoding chip on there. It's got a private key associated with that. That signs the video as soon as it's created. That goes to some editing software that makes some transformations to that and signs that it did so. And then by the time you get to a user on the web, they could theoretically click in the corner and say, okay, here, this video is created at this time and edited by this person. And it's not perfect, but especially with the provenance chain back to the original camera, you can have more confidence that you're looking at the right thing. This isn't in wide adoption yet, but I'm very excited about it. And I think it has an answer to some of the things I'm talking about here. So yeah, the goal in my mind for that to leverage this in the social case would be like, I could livestream on Blue Sky, transcode that video on the LivePeer Network, more on that in a second, and then somebody else could post a clip to their mastodon. And you could look at that clip and be like, oh, I'm looking at a clip from Eli's livestream, right, through the magic of signing and deterministic transformations and that sort of thing. Yeah, this is a really crude example that we came up with for how that could look, like a little button you can press in the corner of the video to get information on that. Yeah, so let's talk about, so that's me talking about some problems and solutions in video as a whole. Let's talk about what LivePeer has contributed to this. I'm going to give a really fast overview of what LivePeer is and the LivePeer Network. LivePeer's mission is to build the worlds open video infrastructure. We started out doing that with the decentralized network of video transcoders. Most people in this room has probably run into the concept that video transcoating is very expensive. Doing this kind of processing can cost as much as like a dollar per hour video on different cloud services. The LivePeer Network instead has this group of decentralized orchestrators, basically people running in video cards in lots of sort of unconventional spaces, not in data centers as people might be accustomed to. Because of that, we can offer sort of this radical low-cost transcoating solution that's much cheaper than existing approaches to this. But we got caught in some of our own traps here. We had the decentralized network. It worked really well, but as we stepped into the video industry, people didn't want to hear about loading up an Ethereum wallet with funds and transcoating on the network and that sort of thing. They're like, hey, can I log in with an email and password? We're like, so just sort of because it's what you do, we ended up building a SaaS product, which I'm really proud of. We did a lot of really good work. Shout out to the MIS server team who MIS server makes sort of the core streaming engine for the live streaming parts of this. But it got to the point where engineers at LivePeer couldn't even boot this up on their own laptops. This was just so chaotic, so many different microservices. It's a globally distributed team, so it tended to be like one person had their own service that only they knew how to operate and that kind of thing. My work for the last six months in particular started out as something called LivePeer in a Box, where we take all of this and we put it all in a single Docker image with a single point of configuration management and that product became what we now call LivePeer Catalyst. There's the LivePeer Studio hosted product. We've been using the phrase get lab model a lot. There's LivePeer Studio hosted product and there's LivePeer Catalyst, which is the self-hostable version that anybody can run on a laptop. I'm just going to give a really quick demo of this. Let's see if it'll cooperate. There we are. There's instructions for this on doc.livepeer.org. I intended to have this running when I started, but ended up having to restart this laptop here. Good, good. We should get a ridiculous amount of spam here. Shout out to MistController booting everything up here. Okay, okay. As I mentioned, lots of different services that are all crash looping until everything is set up and running here. We'll go over to... This is local host. I just have it as my URL here for the purposes of putting a TLS cert in there. This is a full self-hostable version of LivePeer Studio. I can step into a stream and hopefully this will work. I can go live right from here. Nice. We support WebRTC, ingest, RTMP, SRT, and then RISC as soon as the Mist server project that was mentioned ships. We'll have that. I'll stop the broadcast here. It's got scalable live streaming that can be fan out to as many nodes as you want. We've scaled up to... We tested most recently with 200,000 concurrent viewers all over the world. We've got assets. There's one I recorded when I was testing this. This is my live stream when I was sitting there finishing my slides. This is an asset associated with that. We've got lots of different features here, multi-streaming, so you can push out to Twitch and YouTube, stuff like web hooks and signing keys. Going for this freely distributable, full-featured server here. I want to leave some time for questions, so I'm going to jump back over here. Concluding thought, what do you take when you put all these things together? We've got the rise of decentralized social. We've got stuff like the C2PA, which is going to provide content provenance about video, so we know it came from. We've got this freely distributable MIT licensed server with all these different capabilities. What do you get when you put all those things together? The answer is I don't know yet. This is all emerging very quickly. I can see the future a little bit. I can see a world where some of these decentralized social projects start to take off, and of course any social thing is going to have live streaming eventually. In order to make all of that work, it cannot just be assertions by a particular server. We want to have signing mechanisms and that sort of thing. I'm looking forward to that world. I'm looking forward to building it. If any of you are interested, you can join the conspiracy here. That's the Live Peer Catalyst community page on the QR code on the right there, if anybody wants to. We're hosting a party this evening at Market Bar at 7 p.m. Feel free to stop by that, and we'll talk about solving video for everybody forever. I've got a couple minutes for questions, if anybody. Other than that, I can just give you some time to scan some QR codes. How can people contribute? How can people contribute was the question. There's a landing page over there talking about how to get started. Actually, if you go to docs.livepeer.org here, there's a whole catalyst section now, talking about both how to boot up your own catalyst node locally and how to develop on it. If you want to make changes to things internally, if you want to change some of the internal microservices, that sort of thing. Yeah, we also have, I didn't do it yesterday because I was here, but every other week we have the Catalyst Hackers Group on the Live Peer Discord. So, discord.gg.invite.livepeer, or you can just google it and figure out how to get there. That's all the people that are helping us build this sort of thing, coordinate there. Cool. Thanks, everybody.
Multithreading and other developments in the ffmpeg transcoder
Okay, so I'm Anton Hrnoff. I have been working on FFMPEG and LeBeuville for like 15 years. These days I work with FF Labs, which is a company that does FFMPEG-related consulting. I will talk about my recent work on the FFMPEG transcoder, the CLI2. So first of all, I will explain what this is, because this is a very frequent point of confusion for people who are not part of a community. So there is a project called FFMPEG, an open source project, and the main product of it are these libraries, LeBeuville something. LeBeuville codec is the main one. It's a suite of decoders, mainly encoders, encoder wrappers, and so on. It is used basically everywhere that decodes multimedia or encodes or does whatever, video players, web browsers, anything. Other libraries, AV format for muxing, demuxing, IO, AV filter for filtering. We have some other libraries that are less important, but AV codec is extremely widely used. Besides the libraries, we also have a set of tools, and the main tool is confusingly also called FFMPEG, and that is the reason this slide exists. So the tool is not the project, the tool is a subset of the project. We also have some other tools, which are less often used, but FFMPEG, the transcoder, is the thing I'm going to be talking about today, not the libraries. I also do work on the libraries sometimes, but that's not the concept of this talk. So I hope that clarifies things. So now onto the tool. So the CLI is, I think, one of, you can say, it's the most popular transcoder on the planet, or two planets until recently. It is based on the libraries from our project, obviously. We try quite hard to put all format-specific logic in the libraries, so the tool is agnostic. We don't succeed entirely, but mostly we try. It tries to expose the entire power of the libraries to the bits of it that apply to transcoding. So usually when a feature is added into the libraries, the first user is this transcoder. So if you want all the features as soon as you get them, this is the tool you want to use. And this is the reason, or one of the reasons, why you might think this is just a thin wrapper of the library, because everything is, everything, like all the heavy lifting is in the library, so the tool is just a very simple wrapper. This is not true. It is a very complex tool. And the reason for it is that multimedia is really, really complicated, and handling all of it, all the weird corner cases, is very hard and requires a lot of code. And it really covers an absurd number of use cases. Individual users use it to convert your video, your personal files. Giant corporations use it to run transcoding farms. And anything in between, there's an uncountable number of websites which are just upload your video and run it through FFMPEG, and so on. So it is used at all scales. It has a ridiculous number of options. I think it's roughly 200, and nobody can remember them all. So the tool is really quite a complex one. I will go through its history a little bit for practical reason. So the FFMPEG project dates to the year 2000, and in the first commit that we have back from the CVS days, there is already FFMPEG.C tool, which had about 700 lines of code. But it was quite different from the one we have now. It could only do raw input. It could redraw YUV or PCM, and or PCM. It could also grab from V4L or Dev DSP. It could encode them, and if you had both, you could choose just one, but if you had both, it could max. And the intent, as far as I gather, was to use it as a companion tool with another tool which was called FFServer, and you could then use it to build a kind of a streaming solution, which was a big thing in those days. But later on, FFServer had issues and was very sick, and we had to put it down. But FFMPEG, the transcoder, survived and thrived. So this was what we started with. As time went on, we got to this. It's interesting that we got to this in only a year, and the size got to about three times as big. And now we have decoding, we have demaxing. You see we can have multiple inputs, and every input can have multiple streets. Streams and a string can be either decoded or string copied, which means you just copy it without transcoding. The things that are decoded are then sent to an encoder, and then to a maxer for maxing, you can have multiple maxers and a single, I don't know how, I do, a single string can be sent to multiple destinations. So in theory, you could build these kinds of complicated processing graphs. In practice, the user interface was essentially unusable. It was impossible to understand without reading the code, and nobody could actually do it. But in principle, this was possible. As time went on, we got more features. We got subtitles in 2005. After some time, we got filtering. AV Filter was a GSOC project, which had a very painful development process. It was out of mainstream for a very long. And eventually, it got merged. And then the first user, one of the first users of AV Filter was the FFM-Background Scoder, of course. So we got that in 2010. Then later, we got what is called complex filter graphs, which are best explained in contrast to simple filter graphs. So a simple filter graph is something you could just insert somewhere here. It's just a black box, and which would not change the meaning of the arrow. So it's a black box that has exactly one input, exactly one output, and they are both of the same type. And a complex filter graph is anything that is not that. So we can have multiple inputs, zero inputs potentially. It can have multiple outputs. It cannot have zero outputs, because that's not useful. It can have different types between inputs and outputs. We do have some filters, for example, that take audio and turn it into a picture. And anything that is of that kind is a complex filter graph, and we got support for that. A few years after simple filters. Then we got basic hardware acceleration. Back then, it was more of a playback feature. People didn't really use it for transcoding or for any kind of advanced processing, and as we heard today, only now we are getting some things fixed in full hardware pipelines. So back then, we got decoding, it was mostly a toy, because many chips also could not decode faster than real time. So it was a very limited usefulness. A few years later, we got full hardware pipelines, which means that a decoder gives you a frame, which is a hardware frame on the GPU, so some opaque pointer or handle, and then you could pass it to filters, which would process it still on the GPU, and then you could give it to a hardware encoder and encode it, and the entire process would go on without copying the frame into main memory, and so losing performance. By then, so by 2022, which was when I started this project, the tool got to 11,000 lines of code. So non-trivial. We got dynamic parameter changes, we got an absurd number of options, like seriously, and the options interact with each other in highly non-trivial ways, and it is sometimes some massive pain to, even for me, like I'm the main tenant of the tool, and keep it in mind how all of these options interact is impossible. So our poor users. But they all, they want it. People need all this stuff because of all the use cases that it covers. So the general transcoding pipeline right now looks roughly like this. The change from the previous one is that we have filter graphs here, and as you can see, this is a complex filter graph with no inputs. It could, for example, generate some sound effect from, like, a synthetic one. And the top one has, you know, the middle one has two inputs and two outputs. So those are complex filter graphs. Besides that, it looks kind of like the previous one, but the code around it is a lot more complex. And the problem is that the way we got here looks roughly like this. Somebody needs a feature, and they add the feature, and they take the shortest possible path to that feature. And this is, in most cases, done without much regard for how much harder will this feature, which is bolted on top of what was there, how much harder will it make future development? So sadly, almost nobody ever considered this much. And then every such step adds a multiplicative factor to program complexity. So when you add the feature and another and another and ten such features, you have to multiply the complexity from each such step, and at the end, when you want to add another feature, every one of these before getting away, which means that it grows exponentially. And if you know anything about exponential growth, it means your program has a hard bound on how big it can get after that no human can understand it. And this is essentially where we got the fundamental changes to the transcoder became essentially impossible. So at this point, I would like to mention the same by Dijkstra, which I really like, and which I don't think enough people believe. People pay lip service to it, but if they believed it, they would not write programs the way they do. Basically, elegance and simplicity are not an optional luxury. They are essential. If we don't have them, we cannot maintain our programs. Just cannot. Nobody can. So this is the motivation for which I started this project two years ago, which was, I call it multithreading, which is true in a way, but really that's marketing. The main thing is bring code architecture, the way the code is actually written in alignment with this, because this is the way the program works. This was not the way it looked like. If you looked at the code. So the project was make the actual structure of the code match the data flow. And the way I did this was mainly actual object-oriented design. So make things into objects. The objects have their responsibilities. They have their private states, which other objects cannot touch. And the data flows downstream through this pipeline that you saw here. So ideally, the way it should work, you would think, is that, well, some data originates here and just flows downstream from each of these. This was not the way it worked. We would get teleportation. We would get even worse backwards teleportation sometimes. And this is just impossible to reason about. So that is, that needed to be solved. And you can see that, yeah, multi-threading is somewhere in there. Every component, every note on that picture you saw now runs in a separate thread. And you might think typically when you hear threads, you hear performance, right? You want more speed. But this is kind of almost a side effect you get for free by picking the architecture correctly. So it is important. But we get it for free. It's almost a side effect. Anything else I wanted to say? Yes. So and with the right kind of architecture, you can add major new features. You can do development and add actually new things. So the project was started late 21 and was merged quite recently about two months ago. It was massive. It was 700 commits in total. The way I did it was small patch sets and typically a single patch set would move things around, add objects, move stuff to them, make things private, clean up some old things which didn't work and nobody could understand. I often encounter an attitude that moving code around is just cosmetics. It's just clean up. It's not real programming. And I strongly disagree with that because the way I see it, you move things around enough and suddenly things which were impossible before, they suddenly become possible and sometimes they become easy. So it's really important to appreciate that just moving things around can really, really help you a lot. Along the way we got some extras. We got three filters for demuxing for people who know what that is. That is sometimes useful. If you don't know what that is, you don't care. We got latency probes. I think that's quite a cute feature. The transcoder was not really designed for low latency use cases but people tried to use it that way anyway. We are trying to add more real support for it. This is one of the steps towards it. Now FFMPEG CLI, if you pass it the right flags, it will tell you how much latency is added by each step in the graph, which I think is nice. This is enabled by this feature which is also interesting to library users because it became possible in the libraries and then the tool started using it, which is opaque pass-through, which basically means that you get a packet from the demuxer, you attach some user data to it and it propagates all the way to the filter graph and through the processing graph and then you can extract it here and you can add more stuff to it along the way. This is the way these latency probes work. It was kind of possible before but you had to basically do all the work yourself. Now the library does a lot of it for you, which is nice I think. We got timestamps improvements. We had some really bad breakage in timestamps handling for years and maybe decades. Some of that was fixed as a part of this cleanup. We have a nice really cool thing that is called the sync use, which almost nobody cares about but they make output predictable in some cases where it wasn't before. That's the status. Future work is, what we got now is everything is multi-threaded but that's not the end. Another thing, other things that I want to have are, so you see in this picture that we have demuxers and we have decoders and also we have encoders and we have muxers. The status right now is that a decoder is a part of a demuxer. They always go together and similarly an encoder is embedded in the muxer and this is limiting for a bunch of reasons because for example sometimes you might want to instantiate a decoder as a standalone thing without a demuxer. For example, you might want to pipe encoded output back to a decoder and sort of feed it back into a filter. There are use cases that need that and this is not possible with current design. So what I've been working on after that is splitting the decoders into their standalone objects so they can be instantiated on their own. That is work in progress and eventually I want to do the same for encoders because you might want to send for example the output of one encoder not to just one muxer but to multiple muxers without encoding it twice. This is also useful for some cases and this is not possible currently but in the future hopefully. Dynamic pipelines that is more speculative in the future are adding nodes basically at runtime and for that we would need some kind of scripting maybe Lua that's at this point this is just vague hand waving for the future maybe. There have been some mentions of an event loop based architecture maybe again this is something we are just thinking about there are no actual steps towards it. It might have some advantages to have like a single thread which dispatches work to a pool of worker threads. It might be more efficient in some cases. We'll see. So that is the current status. Thank you. So many of you. Am I supposed to choose? Okay so you're the first. Several months ago I noticed I was trying to package fmpeg and I noticed that you know we're on the help site at the point to any of the documentation for the library since it's in the header files of course. I noticed this in public in the fmpeg twitter accounts and stick the other accounts it began insulting me calling me terrible terrible names and as a result I don't plan to be working on fmpeg in the future. However I wanted to know is this something that you've personally seen or is anyone else because personally I think this stuff is very interesting but that unfortunately is an entirely separate issue that makes it very difficult for me to contribute. So I wanted to kind of leave it there and I understand that's not a question people have answers to but I really really wanted to say it's very important to me and I would like other people to ask that question. It is an issue a problem we are working on it but yeah it's not really related. Summarizing this case the question. Our community has issues. Sadly. Sorry. We know you know sorry we are working on it. Yep. This is really impressive. Actually that you put this all. So I was wondering how did you do it. How did you start with it and how did you map it all out and then bring it back together all while. The rich and architecture is still under development. Well I think the way I described it is. Okay okay so the question is how would I plan such a work like in advance right. How would I schedule the work. Yes well I think this is the way I described it as moving things around is really the way to do it. So these are kind of small changes that really you just take a small piece of functionality and you move it somewhere else. Sometimes you decide well this thing should not be visible to outside of its owner because it doesn't need to be and then gradually after 700 commits as you see. The picture becomes much cleaner because when I started with was any component can see and sometimes access and touch and sometimes even change some other component which is distance and related. And. Yeah you you identify a list of these of these instances and you clean up every single one of them. It takes a lot of time it took me two years but it can be done. It can be done. And I don't believe that it could be done any other way like some kind of an initiative. Fix everything at once I think that would crash and burn because because we have so much functionality that like. 20% of it would break and and users would riot and. Yeah that wouldn't work. Yep. Clean and well designed. Well if the code has a maintainer. Yes so how do how can I encourage the maintain the submitters to submit clean patches. So I think if the code has a maintainer who cares about the cleanliness of the code then I can tell somebody how this is this is garbage. Clean it up but most of the code unfortunately doesn't have a person who just sits there and reads our mailing list which is just a giant volume of patches. And if nobody rejects it then it often happens that code just goes in which is suboptimal so yeah we need maintainers basically who care about their their. Code or if we don't have maintenance we need to have people who care about. The project as a whole being maintainable and again that it's not so many people who are willing to really read the patches because reading the patches is. It's not fun sadly. So you would say that the strongest leverage you have is in case of the project. Well I can reject patches so yeah or I can tell people to to clean it up. Yeah. Of the future work you see going into. Release none of it which which work. What. And to expect the seven point out to be an LPS. So which which of this of this future work is going to seven point zero and the answer is. The answer is probably none of it because seven point zero is basically. Around the corner so I will not be able to finish any of this for seven point zero but seven point zero will be a massive massive release we have this we have VVC we have I am. We have a Vulcan AV one one we have so much stuff. Yep. Sadistic. So the question is. In whether the migration should be started as soon as possible, or should you wait until 7.0. 7.0 will break APIs, but the breakage is not big. So in general, I would recommend to do it as soon as possible. So there isn't as much work you have to do. Okay. So we are done. So thank you.
StreamCrafter - In browser broadcasting
Hi everyone, my name is Marco. I'm one of the maintainers of MISSERVER and today I want to talk just a little bit about the streamcrafter which is a broadcasting studio which runs from any modern web browser. Any questions about that? Well, it looks like there's still a few minutes left so I'll go ahead and give a bit more context about the streamcrafter. First of all, this is developed by the MISSERVER team so I want to talk just a little bit about what MISSERVER is. Then I'll move on to the streamcrafter itself. I'll do a quick demo and talk about what will be next on the real map. First of all, what is MISSERVER? Well, it's a media server. It's completely open source and public domain right now and it has a very broad support in terms of which ingest protocols, delivery protocols, codecs, containers, remixes on the fly and it's very efficient in memory and CPU usage. We think it's a fairly cutting edge media server so to say. Hopefully we'll get some more contributors in the long term to work on the streamcrafter with us because we're all backend engineers and especially making the interface nice and snappy is not our strongest part but we're working on it. So what's the streamcrafter? Well, let's say you are developing a social media platform and I have to deal with a whole bunch of users who do not know anything about codecs or configuring OBS to get a lower latency or a higher quality. Basically, they just want to drag and drop their inputs and have a go live button and that should be it for them. So what we want to create here is something which is very intuitive to use and the system integrator can then set up which kind of delivery protocol they want to use and just drop it into their platform. It's a drop-in react component and it's also a composter of course so the user can add cameras or screen shares and in the future we also want to add the ability to pull in streams from this server or pull in the video and audio feeds from a web-articity conference call so that the streamer can just composite all of these video feeds and use the audio mixer to their liking and just broadcast that. Cool. So this is a slightly outdated overview of how the streamcrafter started. As you can see, it's not that complex. You have a way to add inputs and then you have a way to mix all these inputs together and process them at an overlay or a sound effect or whatever and then you need ways to broadcast this. Now one thing which you don't see on this image is how do you get this media data from the input to the compositor. So right now the default way this works is it all happens in the main thread which is not ideal because if you have a whole bunch of inputs it kind of slows down a bit. So we're moving to a web worker mode which is already implemented but web APIs aren't really there just yet to make it work really well. So at the moment what it does is the web worker will ask the main thread for new frames because during the broadcast and then the main thread will send back individual frames to the compositor and it works. It's not ideal but in the future we hope to make use of the media stream track processor API which isn't available in modern web browsers yet but it would allow you to just transfer the entire video buffer into the compositor and then it can do all its work in a separate thread and then broadcast that directly. So let's move on to the most exciting part which is going to be the demo. Let's move this a bit to the side. So as you can see the interface is not our strongest part but it's usable so we're going to just add a scene and add a couple of inputs. It's a bit difficult to use from... oh, that's after a good start when... oh shit. Looks like my mouse isn't working on the big screen. Cool, so it looks like we won't have a demo of this but feel free to visit the website video.strong.rocks and play around with the broadcaster. But basically you can just drop sources into the canvas and it will also have a way larger canvas screen on your own monitor than this little screen over here and it should stream in low latency and you can just share that link to a viewer anywhere else in the world and they will be able to view it. There it is. Yeah. Here's the inspiration here. Oh, yeah. Thank you. Cool. Can you add a second window and we can also put the player side by side. I'll let you. That's green. So second screen. Well, let's just start streaming first then. That's fine. Yeah, so we've added a scene. Now let's add... just add a tab screen share. And then you just drag that on top of there. I don't know. The scaling is a bit off but it should be better on your own monitor. You can crop layers if you want. Like if you only want to share this part of the screen and it will automatically fit the layer inside the... Automatically crop the input to fit inside the layer so that people don't have to worry about stretching the input and it looks off. Yeah, and then you just click start, share the link with your viewers and then they can view that instantly. Cool. Cool. Well, it looks like it's not responding anymore. So what's next? Well, as you can see the UI needs a bit of an oval. It needs to scale better for mobile devices and for low resolution screens. Secondly, a code refactor because currently it's a bit of a prototype grade code. We want to make it extensible and easily maintainable. We want to have a plugin feature so people can add their own processing to the video layers, for example. And lastly, integration. Currently you can broadcast in web RTC which is fine for low latency workflows. But maybe you want something with a bit higher quality at the cost of a bit of latency. And so we're thinking of a tight integration with Miss Server so you can stream media data. For example, in Matroska format using HTTP readable streams, streaming directly to Miss Server. And that way you will get a bit higher quality without the low latency of web RTC, of course. But it should look a bit better. Cool. Are there any questions? What format is the video from the RZTAP to Miss Server in this case? If you're using web RTC it will be, sorry. So what format is being streamed from the stream crafter to the media server? Well, if you're using web RTC it will be Opus Audio. So you might have to do a bit of audio transcoding to get to AAC. If you're streaming with the other options which we'll be adding in the later dates, you will be able to transmit in any other audio format which is supported by the browser. It will be VP9 if you... What's the video codec being transmitted? It will be VP9. But this is also something which you can modify if you're using a different caption. Regarding the video and audio, what are the limits of the codec? What you can do with it? Is it machine specific, browser specific? Where do the limits come from for the codecs? So what limitations are there on the video and audio codecs? Yeah, I think you're doing it inside a browser. What are the limitations of that program? Well, I think the biggest limitation is... It depends on which browser they're using, if they're using a modern browser or an old browser, for example, modern browsers have very wide support for basically anything you want. Of course, anything that's happening on the main thread can get a bit slower over time. That's why you want to move all the compositing to a separate web worker thread to keep the main thread nice and snappy. Because if you add lots and lots of inputs, you can notice the UI starts to slow down a little bit. That's what we're trying to prevent there. So in a new browser, if it has AV1 supporting the future, you'll be able to just... Yeah. ...not AV1? AV1 wouldn't be supported right now. I don't think it's supported at the moment, but maybe in the long term. You're saying you will roll back-end to engineer, but this is all done in the browser, right? This is a direct, yeah. Well, I consider it a bit of back-end engineering because you have to transfer media data from to the web worker. You have to overlay them a little bit of math there. It is a little bit of back-end work, but you know, of course, the UI presenting it is all front-end work. Is there anything you should be running on a server? So this is all running inside the browser. It will be cool, of course, to offload it to, for example, a mis-server and do auto-compositing in the background because then you can maybe do a bit more fun stuff with it. But the idea is that you can just drop this into your existing platform and your users can start going live without any other setup required. Is the rendering hardware accelerated or does it need to be? This is currently not hardware accelerated now. What framework are you using for the UI? So what framework am I using? Currently, it's all written in React. We do want to put some of the processing into native JavaScript, maybe, but currently it's all hooks and components. I'm assuming it's open-source, but I'm just taking it out. Can I find it? Yes, sorry about that. So the question is, is it open-source? Yes, but we're still working out which exact lines we want to put out. We do have a repository. It's not public yet. I was supposed to do that before the talk. So, yeah, probably later today we'll have the GitHub repo up with the full roadmap and demo link. How many people are working on this? So how many people are working on this? Currently, it's just mostly me. It's products by the mis-server team. And hopefully, of course, in the long term, we'd like to have more contributors because it is an open-source project. It would be cool if we can have other people work on some nice plugins for more video processing or audio processing. But this was all written by me at the moment. Sorry, what would be the next steps to this project? We'll be the next steps. So we do have a full roadmap of other features we want to add. So, of course, the UI is the most important thing to fix because that's what the end user will be interacting with the most. We also want to have more publishing options because WebRTC does degrade the quality a little bit. We want that option for the integrator to say, you want to have the full quality and maybe at a higher latency. And also, more input options like currently, it only has screen shares and adding video and audio devices accessible by the browser. But it would be fun if you can add way more inputs than that. And we're not really focused on having advanced editor controls because that would be maybe too daunting for a normal person to use. It's more about having the flexibility for the system integrator to choose these kind of inputs, these kind of outputs. So the question is, how is this being uploaded? Is it being recorded or not? This is a very broad question. It's very broad because I do this solo, those jam sessions. And if I can just use my phone to do it and people can watch it, that's fine, but also I want to get the video afterwards. Yeah. So at the moment, it does not do recording. Of course, if you send it to a media server because it does support any web-based media server as well as the server-special signalling protocol. But if you send it to a media server, that can do the recording on the server side, of course. But adding recording from the browser directly, that wouldn't be too much of a feature to add, I think. It would be nice once you add to the web map. So thank you. I have a couple of questions. I don't know, should we have to... Yeah, no, that's fine. I think we have some time. One more question. What is the commercial app that is similar to this? Sorry. What is the reason that it is similar to this? Sorry, can you repeat the question? The commercial app is very commercial. But it is similar to this? Well, of course, so is there any commercial application currently already integrated or in the... Is it similar to this presentation in the content? You're talking about existing applications, which... Yeah, so, Restream, of course, they have web-based broadcasting suite. But they don't have the composing in the browser. The user can choose a layout and add a few, like the cameras and stuff. But it does look really nice. We want to get some ideas there from the user interface. I don't think there are many competitors in terms of what we do exactly. Of course, you have OBS, that's a client application which users to install and configure themselves. Sorry. Restream it. I have to check it out, but... Cool. Thank you.
PipeWire State of the Union
Alright, okay. My name is Swim Taimans, I work for Red Hat and I started writing pipewires some seven years ago, I don't know anymore, way too long. I gave a talk about pipewires last year, so basically it's a follow-up on that, a little bit of things that happened in the last year. For those who don't know what pipewires is, it's basically a multimedia sharing and processing engine. So pipewire was originally built to send video frames from Wayland to applications because screen sharing in Wayland was completely unimplemented in anything, so there needed to be some way of funneling those frames around. It went to a whole bunch of iterations to make that happen. It started with G-streamers, some custom implementation, and then version 0.2, which is something that sort of worked, and then it sort of devolved into an audio framework because people think pipewire is for audio, but it's actually more for video. So it devolved into an audio framework and here we are now. So basically the core of pipewire is to link applications and hardware into a graph. It's very similar to what G-streamers does, you make a graph of processing elements. In pipewire's case, this is distributed, so it's an IPC mechanism to funnel multimedia around between apps, devices, and so on. So there's a whole bunch of multimedia that you can funnel around, cameras, screen sharing, but also audio. So pipewire tries to implement all of the APIs to make that possible. So there is support for Video for Linux, there is support for Bluetooth, there is a compatibility server for Pulse Audio apps, for audio, a compatibility library for Jack applications. So you get all of these things here, all sides covered, and you can also run Jack next to it, but in essence, it funnels data around. It's built in the same principle as G-streamers, so it doesn't exactly know what data is, it just funnels it around. And it does so very efficiently or try to. So that's basically where it is now. We managed to build a whole bunch of stuff on it and replace Pulse Audio and the Jack Demon in most test-tops now with pipewire. So 1.0 was released last year, so that was a major milestone. Very happy about that. So for that to happen, I wanted to have at least as good latency as Jack server so that we could actually replace pro-audio use cases with pipewire without having to sacrifice latency or performance. So that took a while, but it eventually worked and now we are on par with Jack regarding latency, and it's using quite a bit less CPU for large buffers, and it's getting almost a little bit better than Jack for very small buffers. So that's pretty good. One of the reasons for that is Jack is more efficient even at lower buffer sizes, but pipewire is more optimized in its conversion and funneling samples around. So that's the compromise, I guess. So compared to last year as well, we have now support for NETCHAC with Opus. I think it was a question last year, why don't you have that? Well, now it is there. So you can actually NETCHAC between Jack and pipewire. They're compatible. One thing that doesn't exactly work very well is firewire devices. The problem is that I don't have a firewire device. You can't really buy them anymore, so somebody needs to send me something. They are like €1,000, you can maybe buy, I don't know. It's also professional audio, so you need cables to connect. It's just not a plug and so on. So that's a little bit of a back home yet. What else are we working on right now for AES? It's basically RTP. It's used for various hardware, professional hardware, that does audio over TCP and IP. So you can interface with Dolmete devices and so on. It requires like a shared clock with PTP and all of that. So we have worked that in pipewire. You can run the graphs with PTP clocks, it syncs and all of that. So people are testing that. Very specialized hardware, I don't really have any of these things. On the other end, we are now past the audio stage and now we are going back to the video. Because last year, some things fell into place to make that possible. For example, video modifier support was added. It requires a multi-step negotiation. I have these modifiers, do you support that? No, I do. I do all but then what video formats and what resolutions. And you need to go back and forth to arrive at the video format that the compositor in this case and Gstreamer, for example, or any other application like OBS, to get the most efficient video frames negotiated. We also added support for compressed audio formats. So for Bluetooth, we are still tracking, it's a draft, low energy audio. There is development in Blues, which is the D-Bus service that runs and there are all the connections with devices. And it exposes a D-Bus API that an audio server such as PipeWire could use to talk to the Bluetooth devices. So there is development there and we are trying to track that and match it to make that work. Some small things that were added that we don't actually know what to use it for. Interesting things that are happening is the video support. So I hope this year this will continue going forward. So we added video support in Firefox, so that means that instead of Firefox going directly to the video for Linux device with IOCTLs, which is not so nice in sandboxes, but which also doesn't work with newer cameras, because newer cameras, they need much more setup. They need to control setup media controls and all of that. So there is a new library called Lib Camera that also handles these new kind of cameras that you are supposed to use. So instead of porting Firefox over to Lib Camera, it's better to port it over to PipeWire, because then you get all these cameras that are new, but you can also do some other things like send video frames between applications into Firefox. I was going to try to demonstrate that, but the video support in OBS is still a pending patch to make that happen, maybe next year. So there is also camera support there. So in OBS, it's an application for making screencasts and YouTube videos and stuff like that. So you can compose some things and try to demonstrate that. There is also a thing called virtual camera. So OBS can export its scene, and it looks like a camera that PipeWire makes, and then you can actually consume that feed into Firefox, and you can start chaining just like you would chain audio processing elements, but then with video. So that's hopefully something that we will try to make work this year. There's some more work needed to get that going. So we are bug fixing small improvements, because there is nothing really to be done on audio that we know it should work. And all the remaining problems are, in my opinion, I don't know yet, driver issues, timers that don't work so well, unpredictable delays in drivers. So I think the work needs to be done somewhere else. No immediate plans to fix there. So all the work goes into the video side of things. So video routing. So we're working on video converters so that we can convert between formats. Like if you want to implement certain shaders that work on one format and not on others, this should be made possible. Also processing filters with Vulkan shaders or processors. So here, so now that Firefox and OBS use pipe wire for the cameras, we need to start thinking, okay, this is now going to work in flat packs without having to open the whole socket. But then we can also start adding security, like the pop-ups, do you want to allow this camera, yes or no, or take away access to the camera if you don't want it anymore. So there's some talks about that to make that better. This is in-planet currently with the portal. But there are other use cases, like for example, we don't have any access control for audio in browsers at all. But that is something that we'll hopefully flesh out this year. Another thing, explicit sync support. Again, if you do the video processing, it's better to delay or like to queue up as much work in the GPU as you can and then have GPU itself synchronize all the buffers waiting for rendering and stuff like that. So explicit sync would transfer buffers and also file descriptor with it that you can use to wait for completion of the buffer data actually. So that's also something that we want to try to do. And then tooling and docs, the things we continue doing. So I was going to show you a little bit what it looks like the video. Everybody knows the audio. Also a little bit of tools here, I don't know if you know any of these things. So there's like a top thing. This is interesting. It doesn't do anything because there's nothing going on. But you can also get these things like a draw view. I'm showing how now because then you can see the cameras as well as a device. So if you, I don't know, let's see if this is going to do anything. Probably does, but there's no, it's going to the HDMI. Anyway, you can see, maybe it comes on the feet. I don't know. So you can have like a little look what's going on here. In this is a tree view of the graph basically. So you have like the audio driver is iterating and pulling in samples from another tool, PA Play. You can also see this as a graph view. And all of these things, you can link them together to other things. So right, so each of these devices and nodes are in a graph. You can visualize the graph. You can change the links between these things and do all these things. So for example, for OBS, this kind of what it is. Well, I made a very stupid scene, but you can make some interesting things. I don't know. You can put some backgrounds there and place yourself there. So this is using a screen sharing from one of my windows. I think the terminal, but it could be anything using pipe wire and also the capture, which is a new thing using pipe wire. And also these things here, the microphones, they are still a bit pulsed audio. You can look in the graph here. That's becoming a bit more complicated. But you can see these green, these yellow boxes here light up. So you'll see that hopefully a bit more. So you know Shell, that's the screencast stream that sends video to this one. That's the camera from OBS. Yeah. So I was going to show some Firefox things, but there is no export button here. So normally in OBS, you can now do so start streaming and send all of that to, I don't know, one of the hundreds destinations that are supported. But you can also start a new camera, a virtual camera, and then you could consume that camera or this composition in other pipe wire apps. So if we enable all these pipe wire apps and we make them as efficient as possible with all of the video modifiers and all of the tools that we get from Vulkan. Yeah. We should be a step closer to the ultimate goal. Yep. Some other thing that's interesting. I haven't shown that yet, which is basically called filter chain. So you can do this. You can make a small little file. Wait, let me see where I put that again. Yeah, this one. Yeah, it's a conflict file. It's not very easily, but I can imagine GUIs that generate these things, but nobody has written any of them. But you can basically make a little graph of lots of plugins and LV2 plugins. And you can link them together and then you can tell pipe wire to make a new sync of that. That's the input for applications to use. And then that is the output of this filter. So this is something that does again. And you can just then I'll use some debug here. Okay. You can run this graph. And if all goes well, you should also see new sync here. So this is this new thing that appears. So you can just stop this program again and take away the sync. So this is interesting. And did I quit? I can do it again. And so here is this new volume sync. So you can just on the fly create and remove devices if you want. It's a bit like pulse audio with loading modules. But in pipe wires case, you don't actually all need to load them into one demon. You can have separate programs starting and stopping them as they go. So for this filter chain, for example, that's used on for like implementing like sound correction for speakers and all of that. We haven't done any of these things on desktop yet. Also, maybe something we can do. Like for example, on Apple, you get the sound of of apples are so great because they do a lot of filtering to make the frequency match speakers and all of that. So if you don't have that, it sounds very thin and a lot of laptops, they need some extra processing to make them sound great. Sometimes why they sound a lot better on Windows. We don't do any of these things yet. So that's also something that we can do with these filters. All right. Something else that that I don't mention here because it's actually another project, which is the session manager. I've shown this. One big component in all of this is a session manager. We use wire plumber normally. So that one is kind of orchestrating all of the things that happen in the graph, the devices that appear. If a player comes where it's going to be linked, how it's going to be linked, is it going to do a mixing down mixing or is it going to need some filters before it does that. So all of these rules are external to pipe wire in in a session manager. So a lot of work is also happening there. It's a separate project. But yeah, there's, for example, a version five coming out where all of the conflict files are rewritten in a different way. So that's also a change or interesting things that are going to happen. For the pipe wire demon itself, I think it's kind of that's what it is. No new plans. Okay. Yep. So the usual. Yeah, we worked a lot on our documentation too. There's a lot more stuff there. Also the weekend as a whole lot of stuff. It's a bit difficult to organize all of these things, of course, and it's why am I. This is weird. I didn't start the browser. Well, I could do that. I guess. We got tons of information on the week. All of the stuff should normally be documented somewhere or another. So a few. Problem is that it's so, so much configuration and so much options that people get lost. I tried to do some simple guides. How do I enable multiple sample rates and you literally have make this file put that in it. That's it. So. All right. And get up. That's where we are. So yeah. Questions. Yes. Speaking of docs, I was looking at them just the other week. I assume you have the ability to use your own event loop manager rather than the basic tutorial, which says create this one of pipe wires. Yeah. So the question is, can you use your own event manager or do you have to use the pipe wire one? You can use your own one. The pipe wire one, you can make it and then you can get the file descriptor from it and add that to your own, to your own loop. So for example, no shell does that. It uses the G main loop. KD as well. Is there something or do you know some project which hooks in speech recognition into the audio part and creates subtitles on fly? What they're in the stream. The question is, is there an application that hooks in the audio stream and generates subtitles on the fly? No, but it's a great project. I think. Yeah. There's also the case, for example, of keywords. So listening for keywords like, hey, Google, okay, Google or something like that, or I don't know. Hello, Gnom. Yes. When you talked about consuming the virtual camera, would you be able to send those sources to multiple destinations? Yes, so the question is, does a camera can be sent to multiple destinations? Yes. So there can be multiple consumers from one camera in pipe wire. I can actually show that. Just to show what's going on. How am I going to do that? I can, for example, start OBS. So that's one using the camera. And there's also, let's say, I think there's an example here. It's in build. Build. Examples. I think it's called video play. Other way around. No. The thing is, of course, the second one has to have the same resolution of the first one. There's no conversion going on immediately. There's a way to reorganize the negotiation and all of that, but that is a GAM policy for wire plumber, I think. So that's, I think, it's not immediately implemented. Yeah. I was curious about the, the work capacity. I know that there is an AES-627 plan. And I also was wondering if it was the same thing for video, maybe a SIMPTP or an NDI or things like that. So the question is, RTP or network support for video? Completely unimplemented. At all. So only done for audio. Yep. No. The current stage, like, is it just, do you have, you've been involved with people who are using the AES-727 communication? Yeah, I know people are testing it. There's an issues page about the state of it. So I'll have to look it up what it is exactly. But you can find it if you look for AES in the issues. You find it on all of the hardware that people test with things they have, the tweets they have to do, and then we try to all, so that's ongoing. I have to go over here. Yeah. Yeah. Thank you for making it because I switched to hardware like two years ago and it was just a very pleasant experience because it just worked. Yeah. And I've also been using it in music related stuff and all the places Jack for me as well. Yeah, it's great. Cool. That was the plan. It wasn't the, if I have to repeat the question, it wasn't the question, it was just praise. Yeah, we have more questions. I have two questions. The first one is for the wire plumber. Does it have a ground using the place or it's just a direct command line? Command line. So the question is wire plumber. Does it have a GUI? No. No GUI. So you can, for example, have several applications and all the sources you can.
Generating music with Open tools, APIs, and NO AI!
Okay. Thank you everybody. As this part of the slide says, my name is Steve, and as this part of the slide says, I'm a little bit of a geek. Now, if you've all turned up to the correct room, this is a talk about generative music. Over the course of the next 20 minutes or so, I'm going to be talking about some of the processes that I use to generate music with algorithms. I'll be looking at some APIs, not all of them. I'll be looking at some tools, not all of them. Doing some of the live coding environments that you may see around and talk about some other ideas that you might like to apply. Also, what's not in the talk is AI. There's a room downstairs for that. Everyone has been doing so much AI for the last year. I just said, I'll do something completely different. It's not that I hate AI. Last year, I wrote an album, five songs, four were co-written with AI, one wasn't. And I said, can you tell the difference? The answer was, no one so far has told the difference. Therefore, AI is not stealing our jobs. So, first off, who am I? What have I done to deserve a place on this stage? Or is this slide should be called the ego slide? This is where the speaker brags about themselves for 10 minutes. Everyone looks for the live streams of somewhere else in the building. I'm a computer geek. I'm a developer. Essentially, I've never done marketing, never done sales. I build stuff. I like building stuff. I build stuff in cloud. I build stuff on game consoles. I wrote a book about old retro computers last year. It was reasonably well accepted. I compose stuff. I've spoken at this conference a few times. And all that's really nice and fun and interesting. But what's more interesting is what's not on that slide. I'm not a professional composer. I make this stuff because I like it. I'm not a professional algorithm person. I do it because it's fun. That's a nice long-winded way of saying, if I can do this, anyone can do this. So, what is this? What are we going to do? So, the first thing up, I've got some audio on this, so I will have to put my mic closer to the laptop. So, the first thing is, let's look at simulating tape loops. Back in the 60s, there's a guy called Steve Reich, and he had this idea of having a piece of tape that just went around in a complete loop and then again on another machine running at a very slightly different speed. This calls sometimes the music would cry, sometimes it would separate apart. And this was really quite interesting if you want to spend 18 minutes listening to a New York preacher saying, it's going to rain, it's going to rain. So, it needs to say, no one has actually listened to this the whole way through. But today's your lucky day. It's okay, maybe not, we don't have time. So, this is a version of how to simulate a tape loop. And I'm going to do this with HTML5 JavaScript because JavaScript is the best language. Excuse me, because JavaScript is the best language. Correct answer. And it's all very simple. You create an audio context that just says, I would like to do some audio, please. You then say, I need to load a sample. I use the fetch library because everyone does. You bring it in, you do some munging of that data because you have to. And then you just say, I want to play the sample. And that's it, the job's done. Now, you get some additional things you can do once you've got that sample. You can say, right, well, I want to loop it between the one second mark and the two second mark. Just a parameter. You say, oh, I want to play it at the normal speed. Or if you change 1.0 to 1.1, you're playing it 10% faster. You change that to 0.9, you change it 10% slower. If you want to use semi-tones, then math power 2 to the power of 1 over 12 is the obvious mathematics you need to use. You can change the panning to move between the left and the right speakers. And you're doing this connect thing. All of the system is just basically, you connect this node to that node. You connect the sample node to your panning node. You connect the panning node to your audio context on the output. And that's it. It's the sound of the thing, any kind of pipelining system, you're doing it in the browser. So the first attempted audio, let's see how this goes. And you can see they're slightly in time. Completely out of sync. And then after a while, they're back in time again. And if anyone wants, that's an audio sample from Nine Inch Nails, which we'll come back to later on. If anyone was interested about the math power 2 function, if you're looking at the mathematics of music, every note and the note, an octave above it, is double the frequency. So you've got to do those little pieces of mathematics if you want semi-tones, but that's just math. That's easy enough, right? So let's create a remix. We know how to play sounds. We know how to load the sound in. We know how to play it. We know how to change its pitch. So what we'll do is find some source material. Nine Inch Nails, I think, for the seventh album called the slip, they made the album, I say they trend made the album, and then released under Creative Commons license all of the individual tracks from the album. So you could take any individual part, you could take the drums, you could take the bass, you could take the synths and guitars, and you could then do whatever you want with them under the Creative Commons license. So I did. I found some sounds that I liked. I chopped them up. I decided two parameters. How long into the song before I start playing that sample? And then how often after that do I play that sample? And that encode looks very, very simple. You load them in, you have two numbers. I have used prime numbers here. All of these are prime. There's a reason for that. I like prime numbers. I also know that because they're prime, they are going to clash less frequently. If I had the numbers two and four, those sounds would always come together. But with the primes, there's a longer time before they clash again. So in a lot of my music, you'll spot that if you are so inclined. We then write a little loop. We play samples, we do intervals and all the normal JavaScript loveliness, and we get an industrial remix that sounds like this. Now, you may or may not consider that music, but I do. And I'm the one with the microphone. So now we're playing samples, we're looping samples, we're doing all sorts of clever stuff. Now, it's your turn to go and do it. I did that because I found some sounds that I like to use. How would I describe to a room full of people how to build your own symphony? Well, first off, pick six notes that work together musically. Doesn't matter if you're not musical. Go to the library. Find some sheet music from some old white dude who's been dead 200 years, all out of copyright. Turn to the very last page, because the last bit of the symphony is when all the instruments come together, go da-da! So you know all of those notes are going to sound good together, whatever six you pick. Pick any six of those notes. Go to Google and map the notes on that page to what you actually have to play. Then go and find six sounds. They can be short sounds, they can be loop sounds, they can be long sounds, it doesn't matter. Pick six sounds that you like. Then attach one of those notes to one of those sounds. Pick a start point, pick a loop time, and congratulations! You've just written the Brian Eno album. That's all there is to it. And let's face it, if just, if 10% of you go away and try that now, next year we're going to have a full schedule of that music. And I can sit back there drinking my beer in peace. So what if we don't want just sounds? What if we want to create actual notes? And we want to decide what note goes where. So at that point we're moving to MIDI. MIDI is a specification about notes, not sounds. MIDI says play the middle C, but it doesn't say how the middle C should sound, because middle C on a piano sounds different to middle C on a violin. I use these libraries because I wrote them, not because they're the best, it's that because I wrote them I know how they work. That's the only reason. There are better libraries out there, go use them, but just find some library that lets you generate a MIDI file. That's what you need to do. Then you create an algorithm. Have an idea, doesn't matter what it is, have an idea, generate a series of MIDI notes, let the sequencer play them, assign a different sound to each of the notes and see what you come up with. And this is where I started. Back in 1996, I was reading a book on modern music, and by modern we're talking 1950s, and there was a piece by Giorgio Sighetti who poems symphonique. In this he said, this piece of music has 100 metronomes all ticking slightly out of time with one another. And I lived in a small seaside town, in the words of the song they should have closed down. They call it holiday resort. I live there, I call it a last resort. Actually, I really hope we're not recording this because this might go back to someone I used to go to school with. We had no record stores of any worth. I like the idea of poems on the field. I was like, what on earth does 100 metronomes ticking out of time actually sound like? I had no idea. And this is before Tintorwebs, Amazon, there were no record shops selling this stuff. So I just wrote a MIDI library and I simulated it myself. But I thought, well, instead of doing a metronome, because metronome being mechanical, will do exactly what you say. I said, well, I've got a computer. Back in the 50s, they didn't have a computer. So what I'm going to do is I'm going to sign a different note. I'm going to have one note that plays once a bar, which is that one at the top, then I'll have a note that plays twice a bar, then a note that plays, you know where I'm going with this, right? A note that plays four times and five times. And because I had a computer that could play 32 notes at once, I decided I would build up to that. And the code is very, very simple. You pick a series of notes that work together. You have a massive loop that just says play this note, wait an amount of time, then play the note again, because MIDI is a serial protocol. It is all about time. You play a note, time happens, you wait, play another note. That's what I did. And I created a piece of music with a bunch of channels that sounds something like this. One note, two. And it gets quite chaotic to the point of you can't hear any particular notes or beats anymore. And at the end, you notice I've also changed the instrument because the important part here is the refining bit. I was going through, I say this sounds very archaic. So what I did is I introduced every note individually. So you could hear them coming in, hear them dropping out. I morphed the sound of the piano into the sound of a harp. And I slowed the whole thing down. So as you get to the end, it just feels like very gentle and plucking. Completely different to where you start the piece. And that's what I wanted to portray with the idea. So if you've got an idea, you just generate music. You could use the digits of Pi. You could use gray codes. So gray codes is basically a binary system where only one binary digit changes from one step to the next. So I've done this. Every melody is different. Every single one of those is different to the one before. But it sounds identical, but it's not quite. Which means it doesn't fall foul of stupid laws that say you are not allowed to play repetitive music. Because not one bar is repeated for eight minutes. Everyone is different. Which you can only sense of being do if you're coding this stuff up. You can use prime numbers. I've used prime numbers a lot. You've seen. I wrote a piece of music for an audio book. The audio book talks about, it's a story about going to Mars and bringing a martian back to Earth. And I thought, what's the only common language between Earth and Mars? I figured mathematics and prime numbers. So all the sounds on here are generated by primes. Audio book bit. All generated by whatever the number happens to do. So I don't have control. I set this up and go, off you go. And sometimes it doesn't work. Sometimes it's a load of junk. But that's where the human comes in. I edit this stuff. I look at it and go, well, this did a bad job. That did an okay job. I keep going on. I've done something with all of these. We'll come to these later. And the online encyclopedia of integer sequences, brilliant. You'll never get bored. But the Fibonacci one on there is an interesting case in point. Fibonacci numbers generally shouldn't work for a music composition. Fibonacci numbers, you have two numbers, you add them up, that's your next number. You add the last two numbers, you add them up, that's your next number. And you go on like that. But very quickly, you run out of numbers. Or more precisely, you run out of keys on your piano. Because after about two bars, you've just run out of notes. So I said, well, obviously Fibonacci is useless. We cannot use this for music. Then I actually realized, well, what if you go backwards? What happens to Fibonacci numbers if you go the other way? And this is what happens. They alternate between positive and negative. So okay, there's something that means they can't go out of a range that quickly. So I said, okay, there's an idea here. But what is it? Do these notes represent semitones, tones? Are they going to be part of a key? Are they not part of a key? What are they? I didn't know. So what do I do? I wrote an algorithm that processes all of them. It just generated two hours of music essentially, using every combination of everything I could think of, until it produced this piano piece. You can hear it's going up bits and down bits and up bits and down bits following the patterns. And this is just me picking up the best bits of what it did. So naturally, it's not an AI here. It's a more artificial stupidity on my part. But it sounds okay. It's not awful. I've heard worse. I've written worse, to be fair. And if you're wondering why I haven't gone into the whole Web MIDI thing, I gave a talk on this a number of years ago. I refer you to the link if you're interested in the Web MIDI components. So we've looked at taking samples, looping them, and doing funny things with samples. What if, and looking at the MIDI, how do we want to generate actual notes? But what if we want to generate the sounds? Where there are a whole load of ways of generating sounds via little algorithms that you code up. So, Mozzie, for example, if you like arduinos, you've got this nice little soft synth thing inside an arduino. This was me building craft work out of four arduinos and another arduino to synchronize them all together. The very raw 8-bit sounds. So if you like that crunchiness in that old techno, they've got a drum coming in a minute. Very raw, very rough, but if that's the type of sound you're going for, all you need is an arduino. Synchronize them with I2C. Jobs are good. If you've got a Raspberry Pi, you can get an entire DX7 synthesizer into a Raspberry Pi that boots up from the flashcard in probably much zero seconds. In fact, if you've got a decent Raspberry Pi, you can get eight of them multiplexed on the same Raspberry Pi, which means you can build yourself a portable DX7 synthesizer. It's fun, it's great fun. Sonic Pi is a lot of people use this for the live coding thing, for its music. Fluid synths, if you're old school and you like the font synths, there's a synth font thing where instead of fonts being used for typography, they had fonts for sounds. And there's a whole fluid synth thing that lets you create, you know, use your own fonts. Super Collider, you'll probably recognize this as something that looks like code. That's something like a drum. That's what it's doing. C-sound is a much more low level approach to the same idea. Here you can see I'm programming in various frequencies, inputs and outputs, and at the end I just say make some tones. Or if you want them together. It's craft work in the box. This is a new one on me, so I don't know if it's pronounced glikol or glycol. Again, it's another programming thing. You say, oh, I have a bass drum running at speed four, I'm sequencing this 60 and this is how it works. And then you piece them all together at the bottom saying this is my output, and it sounds like this. Mercury is another one that I haven't heard of. It's the same approach again. You say these are the type of sounds I want and this is how I want it working, but the timbre is so different. Possibly a bit early in the day for this one. But it's all generated from those six lines of code. So you can imagine generating that and then playing around with it in real time. If you're someone that likes to use desktop, there's a whole load of other stuff. I did a thing inspired by a FOSDEM trip back in, I think, 2020. I gave a talk where I was talking about Web MIDI and someone said, you know what, wouldn't it be great if you could change the thing that you just did? Because I'd written a piece of music that was a fractal. You have a melody line and if you play every other note of the melody line, you have exactly the same melody line again, just half as long. And if you take every other note of that melody line, you've got the same melody line again, but just half as long and so on, all the way down to a single note. And someone said, that's a really good idea. Could you do a whole bigger version of it? So I ended up doing that. And I called it Symphony 1.1. I generated an interface which was using a graphics library. I generate the score using ABC, so you can actually see it, and then it plays in the web browser. Which seems crazy. You're generating an entire symphony in a web browser. So that's obviously not going to work as a live demo, is it? Yeah, okay, let's give it a go. Now, is that actually... No, that's not going to be on the screen, so it's not... So here we are. What we're saying is these are the notes where you're going to play every other. Pick whatever you like. Generate the score. There we go. Programmed and actually generated the score, rendering, and then we just play. And you can export that out as MIDI, play it with real humans or whatever you like. So with that in mind, I'm going to put this as some background music. While I switch back to some actual slides, this is also generated algorithmically. I'll say, thank you for your attention. This is me. I'm going to update my scorecard there. Okay, 23 postems. That's now 24. 24 talks. That's me done. Thank you all. Now, is it time up or is there time up and time for questions? Okay, yeah. So now I've told you how I do it. You probably want to know why. And I don't have an answer for that, so find another question. There's a question at the back. Do we have a microphone or just shout? I'll repeat. I'm wondering about what to do with harmony. Because the power is all the melodies, do you usually state a single key or do you have similar algorithms for harmonic modulation or stuff like that? So the question is, what do I do about harmony? And my knowledge of harmony is very limited. So I am generally sticking to a key. I'm generally sticking to the basic major and minor chords in that key. And then I will use a process that say, well, what note is in this key that fits the chord patterns that I'm going to use? And if I'm doing a basic CFG chord pattern, I'll go, right, well, I can use one of these three notes and I'll do that. In the Fibonacci example, which I played earlier, I used a process called tintin' boole, where you look at the notes you're playing and you just say, okay, well, I'll play the next note down that is either a B and F or an F sharp. And it just picks one. So I don't get to choose what the note is, I just get to choose the algorithm. And this is something that a lot of composers started doing in the late 1940s, 50s, because they just come through a world war. They weren't happy with it. They didn't like the fact that the people of the time were saying, you cannot like music this way. You have to do this with music. So they were saying, okay, well, how do I know what I'm doing? It might be subconscious. So they came up with a series of rules and they apply the rules and that's what I'm doing, just picking the rule and let's running with it. So good question to end on.
Open Source Community Updates
Good morning everyone. I'm going to speak about a few stuff about different open source multimedia communities and I'm going to speak about what happened last year because we had the beginning of the year which is why we are talking about 2023. For those who don't know me, I'm the president of the VLAN nonprofit. I'm an active member on a few open source multimedia community. I'm also doing other things outside of the open source multimedia community but I won't say that because that shouldn't happen. I have also a few companies who are doing open source multimedia consulting. So I was here last year, I guess, quite a few things happened, mostly about releases on FFMPEG, on David, on VLC and others. The good thing is like last year I came and gave some promises and we actually did the promises. Well, people did. I did nothing except making some slides. So FFMPEG 6.0 was stuck just after 4th them last year. It was quite a large release because there was a lot of discussion about what it was because we are trying to move FFMPEG to a one-year release schedule. So what is a large release and what is not a large release? So I started doing stats because for some reason no one did stats in the past and seeing that one release gets 200 people around is quite large. Not people sending patches but actually merging them there because the FFMPEG community still has issues merging all the patches that we receive. Major changes. The beginning of work on multimedia spreading mostly on the Merx website at the beginning, risks 5 optimization, hardware, AV1 decoding and some work on FFTs, new APIs. Well, see my talk from last year because that's mostly it. A lot of new codecs and filters. Well, I did the presentation a few days before release. The release happened. It was a big success. I hope. So then the next major release was 6-off. It was a bit difficult to get. We were quite late compared to our initial schedule. Our initial schedule would have been summer. It was more autumn, like October, November. And this one was supposed to be a small release. It's not a major release. And you still see that it's 150 contributors and the number of line changes is insane. Of course, the larger contributor is Anton. You should see his talk from this afternoon. But a ton of work on multimedia spreading of the FFMPEG CLI. Of course, it's not completely activated in 6.1, right? But still all the commits went through because Anton knows how to do small commits instead of big, major patch sets, which is easy to review. A lot of things happened on the Vulcan decode. Acceleration, hardware acceleration, mostly the work from Lin. It's maybe one API to all the new hardware API in the future, right? Yeah, OK, no. It's another one. At least this one is supposed to work cross-platformally. Well, then like a VR API. I did push a lot on FLV plus and RTMP plus, which is basically extending RTMP and FLV for new collects. So if you're not happy, blame me. I was deeply unhappy about all the new stuff which were supposed to replace RTMP and like whether it's called RIS, the sortie, or Rush, or stuff like that, right? Oh, it's going to be great. Yes. But RTMP is here. RTMP is everywhere. RTMP is on devices. And like, oh, yeah, we're going to do a new stand-down in 10 years was not really what I liked. Also because never happens, right? See the XKCD about that. So we're extending the RTMP to support multi-track audio, multi-channel audio, new collects, especially so now you can give AV1 and also HDR over RTMP. Is that a good solution? No. Is it a pragmatic solution? Yes. New decoders like RivaTuner, but also VMIX, which I quite care about, and quite a few passes for decoders that are coming afterwards, but they got in C.1. And the beginning of RIS 5 optimization. And for those who care on Linux, there is now an AV1 via API encoder. 7.0. 7.0 is out soon, TM. It's a very large release, probably one of the largest. EVC, for those who don't know, it's Samsung Kodak that was underlined by ISO, which is supposedly with less patents than VVC. I say supposedly because of course that's not true, right? Because probably Cisvell will do another patent put around that. The major part is VVC, right? It's mostly done by a few people, some in China, some around here. And that's probably one of the largest decoder that we've seen in FFMPEG in the last few years. Because as you know, everyone decoder was done in David, mostly for licensing reasons. We wanted to have AV1, MIT and BSD license decoder, see the essay of Stolman about why that's okay. And now we have a VVC decoder, right? So it's probably the largest work that we've seen on FFMPEG since HVC decoder, and it went a lot better. It's still going to be marked as experimental because it's not first enough, so we don't know exactly the security of that. But what's interesting is that's around 18,000 lines of code. It doesn't support the whole full VVC almost. There are a few features missing, so I'm not sure how many will be in 70. And also it's reusing some of the assembly from HVC, but also some assembly from David, right? Which is something I did not expect. But we'll talk about that for the next year, I guess, because we'll have a lot of VVC assembly this year into FFMPEG. QoA, more RTMP, more AV1 work, and lately AVIF support is coming. I hope real video will come because there was a batch on the mailing list. I think it was forgotten. No one cares, but just like it. For old guys like me, like having real video six was cool, and I hope Lynn can finish HVAC. Else it will go to the next release. Stats. So I did two types of stats compared to 6.0 and compared to 6.1, right? So if you look at for major release, it's 180 contributors, 2,500 file touch, and more than 350,000 lines of code in one year. That is huge compared to what we've done for FFMPEG 6, right? So most like, well, a good 50% degrees. Of course, half of it is from Anton. No, okay, no. Maybe not, but if you've not seen the talks from Anton this afternoon or the one from Anton at VDD, you have to, right? Because, but basically it's much better for everyone. And mostly people who are using directly FFMPEG CLI. And if you want to have any ABR letter, multiple encodes, multiple protocols, and so on, you don't need to use directly the, the, do a new tool based on the APIs. Of course, a lot of cleanups and API changes because it's a major release. So of course, a lot of threat safety because else the military spreading work will not work. A lot of things on arm assembly, mostly for HVAC, but also for a few others. So good, better speeds. And on the API changes, there is lots of new codecs and profile because of the one that we added. Quite a few things about HGR metadata, IMF headers and the related channel mapping changes. There is a new thing called Stream Group, which we're going to use for IMF, maybe for enhancement layers like LCEVC or other things like that, some Dolby profile. Seven, eight, I don't remember, right? But some of those. Lots of discussion about side data, including the new packet side data from stuff on Direct3D12. So we can have a Direct3D12 acceleration. And of course, because it's a new major bump, a lot of deprecations, including the final YUVJ deprecation. Yay, we've only talked about that since 2013. Yeah, so military spreading, see the talk, right? You have to. So that's mostly it about for FFMPEG. I'm now going to speak a bit about David. So a lot of things happen on David. Yeah, sorry. A lot of things happened on David in the last year, right? We had like quite a few releases. They look small. They're not. There are a ton of work in February, in May, in September. But what's interesting is that we did basically all the optimization for everything that you care about today, right? So all the neon is done, all the SSC3 is done, 32 bits, 64 bits. AVX2 is done. We finished by all the intratools, Z1, Z2, Z3, like really the stuff that, except when you care about image, are very small in terms of runtime. But all of that is done, right? So for normal people, the work on David is done, right? Well, I'm not sure we are normal people. So now there are things happening on AVX512, mostly by Enric, right? So, and the good thing is, a contrario from what people have been saying for a long time, which, oh, no, you cannot use AVX512 to be faster than AVX2 because of the issue with TDP and the clock changes. It's actually faster, right? And in many cases. And also because now we have other chips, which are not done by Intel and are quite competent, you can have AVX512 without going, slowing down the whole CPU, right? So I think it's mostly done for AVX512, because on AVX512, we will not do all the coverage of AVX2 because in some places it's not worth it. But this is some of the things that are happening on the next release, which is happening next week. Martin, maybe. There is RISC5 that was done by Nathan. So we start the RISC5 port. Mostly the inverse transform were done. Hopefully more people will help this year. And from nowhere, from China, some people arrive with a long arch support and they did a ton of things, right? Like a lot of the inverse transform, some loop filter, some loop restoration and MSAC, right? So that's quite useful. But still, it's like a bit more niche than the usual, the normal mainline users. Interesting things were done on reducing memory usage, because some people, I think, meta complain about that. And it was just like, oh, okay, yeah. One of the problems with memory on David is the way we're doing the frame threading, which is why David is so fast. But like one of the problems is that it can use a bit more memory. So we looked at that and we did some fixes for that. The next release, I don't know when exactly it's out, because there is some security issues that are integral overflows and I think are exploitable. So I need to discuss with Chromium people to be sure that they know before I do the release. But that's mostly about it on David. We are looking at David hybrid decoding on GPU, but so far I don't have much to say about that. A bit about VLC. We did quite a number of minor releases of VLC this year, mostly about 3.0, lots of security issues. The last release was 3.0.20, three months ago. And we've had a large number of downloads. We've seen 150 million downloads in three months, which is around 50 million per month. So that's very steady. And you know that I care about one release number or download on one release, because you can know, it can help to have the install base of VLC. So the good thing is that we are soon going to beat Firefox in terms of users, not because we're getting bigger, but because... But yes, that's interesting is that the number of downloads in VLC is actually increasing in the first month. Like in five months, we always get around 220 million. Now we're seeing that it's getting a bit more. Usually in three months, we have 120 million. This is what we had two years ago. So we're at least getting bigger. A lot of those users are of course on Windows and Mac OS. What we're seeing is that the number of users on Mac OS is increasing, which is worrying for me. But VLC4, a lot of work happening on the clock. We have lots of difficulties stabilizing the new clock, which is one of the large work on VLC4. And the cool stuff that we've been doing that are finally out, which is the VLC on Unity and VLC on Unreal so that you can use open source tools directly to output video and real time video inside 3D engines. And of course we did some stuff on VLC in the web browser because it's actually working now. But most of the things that happened this year on VLC are related to the Android and iOS versions. I don't talk often about those, but for them, because usually I don't have time, so this time I have. At the same time, we improved quite a bit the Android Auto, which is different from Android Automotive. Well, it's Google, right? They can find a great naming, then they fuck it up. So Android Auto is like your normal car and you can basically play with something that is on the phone. So we have had major stuff on Android Auto, right? So the app is actually usable. It's not done by a few nerds. And at the same time, we had Apple CarPlay, which is not like Apple Car because that's for 2028. But yeah, so actually now people are using it because it's usable. Most of the people use that for music, of course, and not really for video because you shouldn't watch video while you drive. Some people are laughing, but you know that now bigger cars have actually screens at the back for the kids to watch directly. Anyway, Android 3.5 and 3.6 of VLC got big jump on foldable, right? So because like we're back in the 90s, now you can have flip phones that you can open, right? Quite popular in the US, weirdly, when we see the stats. No one else cares. Support for Android 12, 13, 14 because, well, they need to justify new things. And of course, it's breaking the UI and breaking the permissions for absolutely no gain for their users. But mostly, we back ported or forward ported, I don't know how you call that, the web server features that we have on iOS, which is extremely popular to Android. So you can basically upload files directly through a web browser because MTP and USB is now completely broken on modern Android versions because they decided that, yeah, you can't use that anymore. On iOS, a lot of things that were already on the Android version came, right? The other way around, we tried to match them, play back history, but so everything like network library features, right? So you can use your Plex or DLNA server, SMB server and still have like continuity history and so on, right? External subs, for some reason, Felix, I did CDG support. Where is Felix? Why? People asked you to add karaoke of CDG? Who are they? Why? Okay, I'm sure, MKA. But the last thing interesting is we have now support for VisionOS. So if you have 200, then 300,000 euros, you can buy one, right? And so it seems that Apple has no idea why would we have support for VisionOS, right? The SDK is completely broken, nothing works, but you can run VLC on it and watch your favorite movies directly on VisionOS. Yeah, now I'm just going to speak a bit more about the community. We did great VDD in Dublin thanks to Anil and Vibhuti. We did, as usual, crazy stuff at night in Dublin. People thought we were crazy. We are. But it was like quite a good VDD. A lot of VLC and FFMPEG falls were here, so that was pretty cool. But it's important because like our communities are sometimes a bit difficult. And so on video land, we organized VDD 2023. It was important because like the last things were that we didn't have, we didn't do enough VDDs because of COVID. And so we did some elections. We have a lot of things that we're going to change in the nonprofit, mostly on the infrastructure work. We need to buy new servers and we do most of our infrastructure because our new servers are now 10 years old. We did a NAB boost, which was completely insane with our big Julien. It was quite fun. On the other side, on FFMPEG organization, there were lots of discussion about the community management. One of the reasons is that when we decided the General Assembly elections, and it was not precise enough the way we would update the list of members, so there was a lot of discussion. But like the problem is how do you reboot strap based on something that is not correctly well done? We should have used Lydia earlier to have like good organization. But anyway, this is fixed as part of this year. So we now have a good General Assembly and so we managed to be like T-Series, which is technical committee and community committee, so that we will be able to fix our discussions, at least decide on them. And also like we've been doing FFMPEG technical meeting, the last one was at VDD. We did offer one in June in Barcelona. Was it the year before I don't remember? And one at Fosdam, right? So trying to do what we've been doing on VLC, which helped a lot, also on the FFMPEG part. Last part is just like, and that's for a lot of people watching, not too many people in the room, but like the FFMPEG community needs more support, more corporate support, more money, right? It's now like a core infrastructure project, and it's one of the only ones that is not supported by Linux Foundation and CNCF and all those people who have a lot of money. So the only two companies were actually giving like really supporting YouTube and Meta. But the other one is very difficult to get any cent, because like some of the big GPU chip providers are very poor when I ask them. I would suggest that if you have time, look at the talks from Kiran at Demux where I explained all those issues, but like we really need help on all those things. And I think that's it. Thank you everyone. And because for once I'm not rushing, I even have time for questions. No questions? Yeah? What about the LTS? So in theory? Okay, so the question was, and someone asked the question to Anton before, and I think Anton skipped the answer. Yeah, I didn't answer the LTS part because I forgot. Yeah, yeah, yeah, yeah. You forgot or you did not like the question? If we follow the plans, 7.1, which will be at the end of this year, will be LTS. That's at least supposedly the plan. Are we going to match our plan? Yeah, I think so. The plan is to have 7.5, 7.9 as LTS, and so 7.1 as LTS. Yeah, there we go. No, 2027, that's a target. Other questions? Yes? Yeah, so Unity is a piece of shit company. They are using open source tooling and they're based at the beginning, the web completely based out of the work on C-Sharp on Linux, right? So it's basically a C-Sharp store. They're using a ton of open source libraries, but including LGPL and GPL libraries in their tooling and so on. But if you do extensions on the store, right, what like we're doing, they now refuse open source and not just GPL or GPLv3, right, like Apple or Microsoft, like just LGPL completely off, right? You
Innovations in H.264/AVC software decoding (Architecture and optimization of a block-based video decoder to reach 10% faster speed and 3x code reduction over the state-of-the-art)
Okay, I guess we can start. So, hello everybody. I'm Thibault Rafaillac. I'm a postdoc in Montpellier. And it's really a pleasure and an honor to be here. This is my first time as a speaker in FOSDEM. So it's pretty cool. So this talk will be about H.264, is a library that I've been developing for about ten years. And that was experimental initially. It was like a tool project, like a toy project to try different programming techniques, like unusual stuff. And then I've been working on towards a stable release since 2020. At the moment, it supports only Intel architectures from SSE3 onwards. And at the moment, it supports a progressive high. And the MVC profiles, which is Blu-ray 3D. Yay! So first a few benchmarks. At the time there was a measure last year, in November someone measured the performance. It's actually, it's currently faster than all of the state of the art. It's about 10% faster. And the most important thing is that it's three times lighter in both code size and binary size. On average, it's 10% faster. It's actually faster for smaller resolutions like 480p. And the speed being very unpredictable, I usually focus on code simplicity and number of instructions, as Anton would say, and I would very agree on that, on elegance of the code. It helps speed as a side effect, not the core effect. But the biggest advantage has been when adding new features like MVC. Because when the core is simple, basically adding new stuff is really like a breeze. That's less code to patch, less potential side effects, and easier remembering of the big picture when you come back to the project. Because I'm not working full time on this project, it's just a side project. First of all, just a little question. Who has ever developed an encoder or decoder video or audio? Just to see. Okay, quite many. Good. This talk will be quite technical. It's for you guys, or folks. Now it's time just to open the box and see which of the techniques I think have been the most impactful in simplifying the code and the architecture. First a bit of context. A.26 AVC segments and images in macro blocks, called macro blocks of 16 by 16 pixels, which you get from the byte stream in raster scan. So that's on the top left. Then each macro block is further segmented into 4, 4, or 8, 8 blocks in zigzag order. And when you have your code base, the code base basically is in like five or so parts. You first parse symbols from the byte stream, and then would compute inter, so for each macro block, compute a prediction based on either neighboring pixels or pixels from previous frames. As intra and inter prediction. And then you would add the residuals, like the rest, the difference to make the full picture with IDCT. And then going to the blocking, which is just a post filter that blurs the image along the edges of blocks. Okay, now for the meat. First technique. The first technique is very simple and almost of a troll thing. It's just put all of your headers into a single header file. Personally, I've always been very annoyed at tiny header files. When trying to understand a project structure, you have to open a lot of them just to get the big picture to know what is calling what and so on and so on. And out of this anger, I just put everything into a single file, which is about 6k, not that much. That contains all of the structure type desks, the inline functions, that are defined in each C file, and the same D type desks and functions, which I will discuss later. This has a good impact actually on code size, and it helps diving back into the project after a long time, because I'm telling you, I still remind that I'm not a pro side project. So far so good. The second technique is about the architecture of the codec itself. The overall architecture is designed like a hardware decoder, in that this is a graph of code blocks that are activated one after another. And then after expressing this graph, I express it in C, not the other way round. So I'm not thinking in functions in C, I'm thinking in code blocks, make a graph out of it, and then express it in C. So in C, it makes the nodes, the code blocks are functions, obviously, but then passing execution between code blocks becomes tail calls, and everything is done so that the tail calls are converted into jump instructions. Inlining is disabled so that each block is present only once in the binary, which helps reduce the binary size. And also, thinking in code blocks, instead of input output functions, makes less use of parameters, because you're not thinking into what my functions is going to take as parameters in return, it's just you're thinking, I'm going to pass execution to that function. But that function takes its input from the context structure. So that improves the readability overall. Next one, tree branching. That's one's the technical. In AVC, the intraframe prediction propagates neighboring pixels in a given direction. So in dotted, we have the neighboring pixels, and in full line, we have the block that we are trying to predict. For each direction, the code will follow the same pattern. First, we load the pixels into registers, CPU registers. Then, we would probably fix the colors that we fetched, particularly if the pixels belong to a block that was unavailable, that is in a different slice, or that is out of the bounds of the picture. If we do that, then we basically just propagate the pixels from the left to the pixels that are unavailable. Then we would compute the actual colors, so doing the math in the CPU. And finally, store the values to memory. That is the typical process that is executed for each direction in the intra mode. So AVC has nine directional modes. So basically, you get your neighboring pixels, and you have nine possible directions from which you have to propagate the colors. In the decoder, usually, it looks like this. It looks like a branching on the top, and each of the directional modes is one function. So technically, that's how it's implemented. It's usually nine functions, and branching towards these functions through an array of function pointers. But there are two things we can improve here. The first one is the fixed tests. The fixed tests are present inside some of the functions, not all of them. But we can see that they operate on the same conditions, most of them. They do the same tests, so we can merge them upstream. And the second thing is the storage code is basically the same. Not all the time, but most of the time is the same, so we can merge it. So when we do that, it looks like this. It looks like a tree. The storage is on the bottom, and then you have the compute and the loading, which has a fixed donor with it. And you have only one branching. So it looks like a tree, and in practice, what you do is you branch once, one conditional branch to the leaves, then go down with unconditional branches down to the storage operation. I told you this is very technical. In C, the branching is done with a switch. The branching inside the tree is done with go-to's, and branching out to the trunk is done with break, breaking out of the switch. I tried implementing that with functions, so doing all of the compute in different functions, but compilers get crazy at that. It's messy, so it's really the simplest is just to use switch, go-to, and break. The practical impact. In ABC, you have three intra-modes, intra-44, intra-88, and intra-16-16. Intra-44 has 14 leaves out of nine directional modes, so the impact is pretty okay, but intra-88 has 32 leaves out of nine directional modes, so that makes a good impact. Actually, in the decoder, the intra-88 benefits the most from this. Still, this technique is very general, and maybe applied to you in your code. If you manage to represent things as a tree, as a downward tree. Okay, fourth technique. In H.264, all of the context data resides in a single structure that is passed to every function. That is a classical technique, I would say, in many decoders. That you have one structure, that is the mother of all structures that stores everything. It's the context structure. In H.264, this structure, the pointer to this structure, is stored in a register. It's just mapped into a register with GCC that allows it. That's very dumb. The code actually looks like this. If we have GCC, we assign one register, we reserve a register to that pointer, and we patch all of the function calls so that GCC doesn't pass this pointer to functions, and Clang, or other compilers, will pass this pointer to functions. Easy, right? In practice, the binary size is reduced by 5% with GCC, and there is a minor speed-up, and it actually helps on my builds, GCC be faster than Clang. Yay. GCC9. After that, I think, for some weird reason, after GCC9, the performance actually drops to a greater version of GCC. So far, so good. Fifth technique. So I'll try perhaps going slightly fast on this because they are very hard to understand if they're not into A, B, C, but still. In A, B, C, every block has neighbors, and when you predict the color of the values of a block, you basically look for the values of the neighbors, and what the spec contains a lot is conditions. Basically, when I ask the value of a block of a neighbor, I'm asking first, is my neighbor available? Does it belong out of the picture? Is it out of the picture, or is it in a different slice, or does it exist? So if it's available, then I can fetch the value, but if it's not available, I fetch a default value. And the second test is the neighbor coded as the same mode as me, enter or intra. If not, default value. If so, good value. So one technique is just to allocate fake blocks in memory that will contain the default data. So basically, your picture will be surrounded by unavailable blocks which contain default values. And the second technique is that all of the blocks that you decode will contain the default values for the other modes that will be decoded. So for example, if I have a block, a macro block, that is coded as intra, it will still have motion vectors that will be decoded for inter. So it will behave like an intra and an inter macro block. As you would guess, this makes the code consume more memory a bit, but it makes the code a lot simpler. And I mean it really a lot simpler. This is actually a very important technique, but it's hard to achieve because you have to look at the spec and spot opportunities where the spec allows you to do that. So it consumes about 25% more memory than FFMPEG in practice. And that will be possible in the future to reduce it to 5% more in FFMPEG only. Sixth technique. Here is what it looks in memory. So we have the picture in a strong line. And in yellow, you have the blocks that are actually, the macro blocks that are actually stored in memory with the neighboring blocks stored on the top left. And as you can see on the right, you have no blocks because the memory circles around. So basically when you look on the top right, you're actually looking in memory to the top left. To the left. Yeah, that's a joke. This is still about accessing neighboring values. So in H264, we have a problem when accessing neighbors of subblocks. So in a macro block, blocks are stored in arrays. And when you want to access the neighbor of any block, any subblock, its neighbors may be in the same macro block or in a different macro block. That's what we see with B. So B is stored in the same macro block, fine. But A is stored in a different macro block. So in a codec development, typically what you would do is copy all of this in a buffer. Copy the values from the neighboring macro blocks in the same buffer so that you have everything packed in the same thing. What I do is different. I use pre-computed memory offset. So that's a nasty technique. Basically, when you have your value in memory, you know where your neighbor is going to be in memory. You know it's there. It's in a macro block. The offset is basically constant in memory. I know it. But the problem is that it belongs to a different structure. So what I do is just to compute the offset with offset off in C. And I look to the memory at that position. It's non-standard C. So you didn't hear anything. So the first two are the sevens technique. I have nine. These are the last three. I promise there are eight and nine are pretty better. So this one is about inter-prediction. In AVC, for interframe prediction, each macro block will be first segmented in two rectangles, sub-rectangles, that each will receive a different motion vector. And the problem is that there are many possible shapes. So there are many possible ways to just cut a macro block. So there is not a single number of motion vectors that you will fetch. Traditionally, this incurs many tests in the decoder. For example, if I have a 2x2 block, then I will fetch one motion vector. If I have two rectangles that I will fetch, two motion vectors, and so on and so on. That's a lot of tests. That's many tests. The thing is we want to merge all of these tests together into one. That's possible. To do so, we assign a bit mask to that shape. So we convert the shape to a bit mask, where every rectangle will contain one set bit and the rest as zeros. Then we convert that into an integer. And note that on the left, we have nine rectangles, or squares plus rectangles. And on the right, we have nine set bits. So we know that we are going to have to fetch nine motion vectors out of the byte stream. To do so, we loop on set bits. That's a classical technique. We do count leading zeros on the bit mask to get the first set bit. And we clear it with the last operation. That clears the last set bit. This has a good impact on code size, because it merges all of the tests into one loop. And it has a minor impact on speed by reducing the number of conditional branches. So less pressure on the branch predictor and so on and so on. The general pattern, if you want to sit in and reproduce that, is to convert your macro block type into a bit mask, then iterating on the bit mask. And in the decoder, it's used to parse the reference indices, the motion vectors, and the residual coefficients. Okay. Using vector extensions. So in H264, SIMD are used everywhere. And by everywhere, I mean everywhere, in the whole of the code base. It's not separated into decoding and parsing. The SIMDs are used even in the parsing. And this is thanks to the use of GCC vector extensions. So I use GCC vector extensions. And to be clear, I use no hand-coded assembly in the decoder. It's all C. It helps actually reduce the register pressure in scalar register, because every time you have some scalar code, if you have an opportunity to vectorize it, you just reduce the pressure in the scalar register. So anything that you can do with this SIMD, it's just a win. Just a big win. So to do so in code, first you define your vector types. So this is standard GCC extensions. Then to define vector-capable arrays in your context structure, I use union. It went through a lot of trial and error. And this one is the only one which doesn't generate random stack spills by the compilers, both GCC and Klang. So this is to me the most stable one to use. And the last thing is that I shorten all of the Intel Intrinsics so that two reasons, to make the code more readable, because the Intel Intrinsics are amazing to read, and to help the future ports to other architectures, ARM and RISC-V, whatever. Four things to care about when using vector extensions. First is don't use built-ins. They are a bit unreliable, and they may generate big code, although you don't always know. So to be sure, if you want to use a specific instruction, don't rely on built-ins to generate that. Just use that specific instruction. Then don't use vector sizes that are not supported by the native host. Compilers get crazy at that. Don't index a vector by a variable, obviously, because then the compilers will just push down on stack and then index the stack. And don't use automatic vectorization, as this is very theoretical too, but it's not reliable enough in practice. Okay, still got some juice. Last one. So this is a very important, a fun technique, actually. I will take here the example of the de-blocking filter in ABC. So as a reminder, in ABC, the de-blocking filter is a post-filter that blurs all of the edges in that order. One, two, three, four, then five, six, seven, eight. It's conditional blurring, but for the sake of this presentation, let's assume we de-block all of the edges. In yellow, these are the reads. We read pixels, and in orange, we read and we write the pixels. The problem we have is that to filter all of the edges in one continuous operation, we would need to store all of the pixels, so all of the values. As a reminder, one pixel is one byte. It's a luma value. We would need to store all of the macro block values into SIMD registers. In yellow, that needs about 22 hardware registers, although we have only 16. This is physically impossible. So what we traditionally do is that we work one edge at a time, loading the matrix from memory, transposing it, filtering it, then storing it back. And we proceed edge after edge until we're done. If we count the reads and the writes, each vertical edge is 16 reads, 16 writes, and 50 shuffles for the transposed operations, back and forth. Each horizontal edge is much easier. We do six reads, four writes. And in the total, it makes 88 reads, 80 writes, and 200 shuffles. Can we do better, of course. Now, if you imagine that you have an infinite number of registers, as in C, but you get a penalty, the more registers you use. What we would do is we would load the left matrix, filter edges one and two, and then write the left pixels that we won't need anymore. Then we would load the right half of the matrix, filter edges three and four, and at this point, we have the entire macro block values in registers or in live variables. And some might be spilled on stack, but that's not my business. That's a compiler. Then we would transpose the whole thing. So the whole matrix would transpose it. We load the top missing parts. We filter edge five and proceed down with edges, storing the values as we complete. So ideally, if we count all the reads and all the writes, it makes 35 reads, 34 writes, and 163 shuffles. So that's a net win. But, of course, there is a but. Ideally, if I would code this by hand, it makes 22 spills, spills on stack. And actually, with Klang, it makes 40 spills. And 40 spills means about 40 writes and about 50 to 60 writes. Reads, sorry, reads. And it's less on later architectures where you have 32 hardware registers. So that's somehow a win, but I'm not quite sure. It's like, meh. The good thing is that this register-saturating technique is easy to design and implement in C, especially with regards to reordering instructions, because spilling is very sensitive to that. It has a good impact on code and binary size, and a minor impact on speed. It's useful for de-blocking and inter-prediction for the six-tap filters, which I'm really happy with. And last but not least, in general, using C instead of assembly for the SIMD code allowed me to improve the filtering code, the core code, by about 20%, 20% shorter, especially thanks to eliminating redundant operations used in macros in assembly, and reasoning with more compact code. So it helps you get the bigger picture and really be more ambitious in how you design code. With this in mind, I would encourage you to consider switching to assembly... No, sorry, C. For SIMD code. In my own experience, I've always been able to find important improvement over assembly code at the expense of register pressure, of course. In literally every part of the decoder, be it inter, intra, de-blocking, and not IDCT, because IDCT is just perfect, but how the mark transforms. In these cases, the use of C has made the code less overwhelming and allows you to go the extra mile and optimize further. With this, I would like to thank you for your attention. You can see the code there, and I will take any question. APPLAUSE And last note, Kudos and last round of applause, please, for Olivier, who is here, and who agreed to open-source this work at the very beginning, so open-sourcing for free. APPLAUSE Are there any questions? Yes. Just out of interest, the speed, basically, benchmarks, whether single-threaded or multi-threaded? Single-threaded, all of them. Sorry. So the question was the benchmarks where they're single-threaded or multi-threaded, and so the benchmark conditions were all single-threaded for all of the decoders. Thumbs up means last question? Oh, I think you're OK. Thank you for your attention. APPLAUSE
FFmpeg VVC Decoder
Good afternoon all. I'm here to talk about the VVC decoder in FFMPEG. I'm going to introduce VVC. I should imagine if you're in this room you're already somewhat familiar or at least interested but I'll refresh some of the coding tools and some of the objectives that it has. Talk about where FFVVC, the FFMPEG VVC decoder fits into that. Again, what new coding tools VVC introduces. Talk a bit about the threading model which is one of the most more interesting things for those of you who already have some experience with FFMPEG. Then go over performance, how that compares to previous codecs and the other VVC decoders out there. Conclude the talk, talking a little bit about the Google Summer of Code program this summer and the next steps for FFMPEG. First of all, a disclaimer. I did not write very much of this code at all. The credit should go to Noemi in China who unfortunately couldn't be here today. Who am I? I am Frank Palmer. You can find me at frankclammer.com. There's various other contact details on there. I was one of the Google Summer of Code students this summer working on this project. As you saw in the agenda, we'll talk a little bit more about what that involved later. Going into the introduction then. VVC or not H265, H266, that should read, is a new standard from the Java. It's succeeding H264 and HEVC, so quite big boots to follow. It's got two main objectives. It aims to have 50% lower bit rates than HEVC for the same quality of video. As the name suggests, versatility is the other main objective. That involves a lot of new coding tools for things like screen content coding, adaptive resolution change for things like video teleconferencing, independent sub-pictures. Versatile applications underlie a lot of the decisions made in the design of VVC. The open source landscape of VVC. For encoders, you have VTM, which is the reference software. You're not really going to want to use that for practical encoding. You have ENC, VVNC, which is developed by the Fraunhofer Institute. That is a practical decoder, encoder very fast. Finally, you have UVG266, which is an open source project developed by the community. Then on the decoder side, you again have VTM. You have the dual of VVNC, you have VVDEC, which I believe there's a lightning talk on that in a little while, which is very fast, very good decoder. You have also developed by the Fraunhofer Institute. You have OpenVVC, which is a community project VVC decoder, which is relatively performant for a single core. Unfortunately, that has now been abandoned. I don't think there's been a commit in about two years. Finally, we have what this talk is introducing, FFVVC. The state of FFVVC, the C code was merged at the start of the year. I believe it was a month ago exactly now. As John Baptiste talked about in his talk a little while ago, we believe it will be in FFMPEG 7.0, but possibly under some sort of experimental flag. The Inter-Prediction Assembly has just been merged about a week ago. We have some other assembly that has been written and is in the review process. It's important to note though that FFVVC is not yet maintain complete. There are some coding tools that are missing. The big one that we've heard from the community is intra-block copy support is not yet implemented. There is a patch set for that that's in the works. I'd be doubtful it will be in the 7.0 release of FFVVC though. Most of the other features that are missing are things that are a bit more exotic than intra-block copy. Features such as wrap around for 360 degree videos not yet implemented, independent sub-pictures, reference picture resizing, some of the more exotic stuff, but that will all come in time. This shows the assembly status, what has been written so far, what we're prioritizing, and what we've been able to reuse from HEVC. Things that we've prioritized so far are largely low hanging fruits. The inter-prediction we were able to reuse quite a lot of that from HEVC for good gain. SAO is entirely identical between HEVC and VVC so we've been able to rip that directly. Inter-prediction and ALF are both big contributors to the decode time in C only, their high priority. One of the GSOC projects last year was working on the ALF stuff so we'll talk about that a bit more so that's on its way. Inter we've managed to get some bits out of David for the more generic stuff just like averaging functions. That's been effective in getting a quick speed up there but we need your help with this. There's not many of us working on this at the moment and there's a lot of assembly to write. That's going to be key to performance as we'll see in the performance later on. Decoder size. I believe the biggest decoder now in FFMPEG in terms of lines of C. I'm not sure how it compares to David but even being the biggest decoder in FFMPEG it's still much smaller than open VVC and VVDC as you can see here. How do we manage to achieve that? By being in FFMPEG basically we're able to reuse parts from previous codecs. We're able to use the CBS Quebec reader you can see there and reuse like whole swathes of code also parts of the binary so it's kind of hard to measure that but you get a more bang for your buck in terms of the size of a compiled delivery codec. In the future I believe we may be able to also use some aspects of hardware decoder APIs to do the DPB reference management. We managed to be much much smaller and that's one of the main reasons really motivating putting this inside of FFMPEG. The other one being FFMPEG's vibrant community we can say which hopefully will help maintain this into the future. Moving on to what's new in VVC so there's a lot of new coding tools like a dizzying amount. You can see here you could talk for an hour and many people have about even a subset of these. As you say we haven't implemented them all yet but there's loads to play with which yeah feedback to them the ability to make much smaller bit streams and also to make more versatile video content. What FFVVC introduces that's new for FFMPEG is this stage-based thread model so lots of previous codecs have the frame and slice thread models which do well for sort of low number of cores but have some sort of here ceiling at certain point and so FFVVC uses a much more fine-grained thread model which is able to allocate threads based on the stage of decoding individual CTUs and yeah as that says it means we're able to much better utilize higher core counts and so our C code with no assembly we're able to decode 4k over 30 fps on you know relatively high-end desktop processor but I think that's really impressive. This thread model is possible to implement in HEVC. FFVVC does not use it I think it's also possible to do stage-based decoding in AV1 but it wasn't a factor in the design of AVC. The way that it works is you divide each CTU into several stages of decoding they're all listed there and the key thing is that each stage depends only on the current or previous stage of the neighboring CTUs and so you can start doing the D block of one stage before you've done the pass even in the like top left corner very far away sorry before you've done the intro I think you have to do the pass for all first and the effect you get from this is this sort of wave front that progresses across the image of each of the different stages and yeah it allows you to use much more cores. To allocate those cores we've had to introduce this new AV executor utility which has been made available in LibAVUtil so you can use this for other projects inside FFMpeg. It's a really simple algorithm at the moment but centralizing the control of allocation of threads you know not repeating yourself means we have now one location where we can make improvements here. It's a really simple algorithm it's based on I think some of the earlier implementations inside Python and Java's executor structures or whatever they call them but yeah having that one thing in one location that can be used throughout FFMpeg to improve multi-threading. Yeah so onto the performance section so at the moment it's pretty slow compared to previous codecs I mean this is to be expected by to a certain extent VVC is just a more complex codec than previous generation stuff it has to be in order to achieve high rates compression. This SIMD here false and true for FFVVC so this is with stuff that's not yet in FFMpeg master this is with the current state on the development staging repo. You can see we are getting about over 200 over a doubling of speed increase for FFVVC already but there's a long way to go as you can see from David's really impressive assembly speed up they have there but our multi-threading picture is quite different so that shows you the effect of doing that stage based multi-threading we're just much more easily able to use higher numbers cause yeah note here that this is using hyperthreading which is why you've got quite the knee there at six threads and but below six threads it's really not far off from that ideal you get a core you get the same multiplicative increase in the speed up comparing it to VVDC then. VVDC uses the same stage based threading model so you're getting a very similar performance between FFVVC and VVDC. Open VVC uses the conventional frame and tile based multi-threading techniques so that's quite useful on the left hand side there that figure to compare what is the effect of this new threading model but you can see and then on the right hand side the single threaded performance C only between FFVVC and VVDC is pretty much on par. VVDC behaves has quite significantly different performance on different operating systems but the average between the two is pretty much the same and on 4k it's a similar picture but everything just gets slightly more pronounced. Open VVC is slower that the speed up that we're getting from using more threads matters even more for larger videos so you can see that effect here but we're still lacking on the assembly front so VVDC has a lot of assembly already for quite a few different architectures and you can see that they're really pulling ahead once you enable the assembly there. The theoretically FFNPEG VVDC decoder should have somewhat of a higher ceiling due to the fact that FFVVC's assembly will be handwritten whereas VVDC's is using intrinsics and on some architectures using SIMD anywhere as like a portable SIMD library which introduces them overhead so with enough time hopefully FFVVC can be even faster but we've got a long way to go to catch up to them at the moment. So just sort of wrapping up to the last couple of things here so talking about the Google Summer of Code program in 2023 so there was two Google Summer of Code students contributing to the VVDC decoder this summer. Myself and Sean Liu so I worked on a lot of the stuff that was added in version two of VVDC so that includes the support for 12 and 14 bit which needs the range extension which changes various things to the entropy encoder when you get to higher bit depths and I've also been working on AVX2 optimizations for the inverse transforms they all had to be written from scratch in the end there's not very much that you can share between HEVC and VVC due to the way that the HEVC transforms are written in FFNPEG and Sean Liu is working on also on assembly transforms for the filters which some of them are in the process of being upstreamed at the moment I believe. So yeah next steps as I'm sure this performance and what we've been working on has sort of shown we've got a very solid baseline with the C performance and the multi-threading but we need lots more assembly in there to be able to compete with existing decoders so upstreaming and what we've already got implementing more functions with assembly also more architectures so ARM is going to be a Google Summer of Code project for this summer potentially also risk five there's a lot of work on doing risk five assembly for FFNPEG at the moment so we'll need that in time polishing off the maintain conformance so implementing those features that I mentioned for missing earlier particularly intra block copy is a high priority the thread optimization 32 plus cores so we may be able to improve the AVX2 utility for higher core counts if there's sufficient demand for that and the GPU based decoder so a lot of the stuff in VBC is really well designed particularly to do with the separation of stages that we saw earlier means that it's really well suited to decoding on the GPU so that's something on the far horizon. Concluding so FFNPEG now has a VBC decoder I've introduced that new threading model showing some of the benefits of that talks about the C in multi-threading performance and how that compares with VVDC and given an update on the status including the optimized assembly we're currently working on we'd help with this like especially with the assembly there's just very few of us who only work in our free time so progress on that front has been relatively slow so yeah patches welcome alright yeah thank you very much for listening. If anyone's got any questions I'll be happy to try and answer them as best I can as I said in that just like disclaimer I did not write very much of this code I just did you know the bits I've talked about and then I've worked on doing bug fixes especially since we've one thing I forgot to mention part of why we're going to have to be experimental is OSS fuzz we've only recently started being fuzzed since we went into FFNPEG master so we're getting a lot of reports for that at the moment that we're trying to work through before we go into like a normal release but I'll try and answer any questions as best I can yes. So the question was have we considered trying to use C in forensics? Yeah as a step between having fully C code and having handwritten assembly for everything it's not the FFNPEG way FFNPEG everything is handwritten assembly I think there's a little bit in like lib SW scale I believe but that's when the FFNPEG is in the process of removing that tiny bit of C in forensics that we still have so yeah I mean we're probably not going to do that just out of you can go faster with handwritten assembly so if we're trying to get that same performance and even be VVDC I think it's the only way to go really. Okay there's no more questions yeah thank you very much.
Edit video/audio with or without Vim
you you you Hi, for those of you who are here at the last talk, you will not believe it, but exactly the same thing happened to this talk. My friend in China, he couldn't come here, so he asked me and also Dr. Zhao to co-present his interesting software. So let's start. So it is a video editing software and you know traditionally video editing is quite time consuming with all the drag and dropping going on. And my friend thought it would actually be more efficient if we have something that's kind of text based in the terminal. So let me kind of just do some quick demo. It's going to be very preliminary. This solution has a lot of features, but I'm just going to walk through the most basic ones. So let's say we have some videos, ignore all the temporary roughcast something files. So let's say we have take this one, open Wi-Fi 2022, which is actually Dr. Zhao's talk the year before, first time. And so let's assume that you have obtained some subtitles in a SRT file. Maybe you did this yourself. Maybe you use some like AI generators like those of those these days. And what you do is the first step would be to convert that to a TSV file, TSV being just tap separated values like CSV, but just the tabs. Although it's not like any TSV, it's a kind of a specific format. So for instance, and now I've skipped all the installation steps. But if you have my friend counts Vim plugin installed and you open this with Vim, you'll be greeted with this. And it looks like a normal editor, but there are actually like a lot of key bindings that you can do. So for instance, it's integrated with the MPV, which is just a media player that has like CLI options. And if you press like tap, it will just play this line. There's no like hopefully my laptop is loud enough. Yeah, the subtitles are not very precise. Okay, so that's kind of just that, but that's not really editing video yet. But let's do some very simple editing. Let's say we want to just cut these two lines out from this video. You just select them and just press space. Good morning, good afternoon, and good evening. And you have a video that has these two lines. And you can do more fancy stuff with it. Of course, for instance, let's say you want to just, let's say we want to pick part of this line. Let's say we just want, we don't want good afternoon. We don't care about people who, for whom the time current time is afternoon. So just if you just press pipe, it splits this line into two and we can skip this line. If you just press backspace, it skips this line. You will notice that what it did is actually change the first column. And then if you explore this again. Good morning, good evening. Right, we skipped out the part of good afternoon. We don't care about afternoon people anymore. Okay, so that's the basic stuff and I only have two minutes left. One thing, one final thing I will show is just, so we saw all those MPV windows. You can see this, the popping up and popping out, but maybe if you want to just have it open all the time, you press backspace, backslash twice, it pops up and you can do things like seek to this place, seek to that place or like you can seek an MPV and bring the time sign from MPV to the TSV. We can do fancy stuff like that. And also, if in any case this is not professional enough for you, which will be very surprising to my friend, you can explore this to Final Cut. And if you just press X instead of space, it will export to FCP XML file, which you can then open in Final Cut Pro. I would make my friend really sad if you have to do that. Okay, so that's like a very quick tour of what you can do with this wonderful VIM plugin. But one thing that's what came in mind, like this talk is called with or without VIM plugin, is really just like a way of using the text or the text TSV files to video editing. So my friend uses VIM, so he developed a VIM plugin, but it's very easy. If you don't want to use VIM, you can just use CLI yourself. It's built using a bunch of CLI utilities. We can develop a plugin for a different editor, so that will make my friend count really happy. And there are more fancy features that I don't quite understand. So I would just play a quick demo video that he prepared. Hello, I'm Grover. It's called Elvish. Okay, that's all for today. I'm joking. Let's get started. Go manage it up. Okay, there's more information. It's all linked from the FOSTA schedule page. And do we have time for Q&A? Two minutes. Okay. I'm going to take questions. Yeah. Yes? Is it only Final Cut XML or can we just export to a normal XML that we can also import from your profile example? I've heard that the FCP XML file is supported not just in Final Cut Pro, but also in the other... Oh, the question was, does the exported XML file only support Final Cut Pro? My understanding is that there are multiple video editing software that supports the same format. It's called FCP XML, but it's actually supported by multiple programs. Sorry. You're supposed to take the question. My answer is confirmed today. My answer is confirmed to be correct. Oh, Ash. Oh, bad.
S2S: PeerTube instance dedicated to Sign Language
Hello, welcome in Brussels. We are glad to be at FOSDEM with you and we wanted to talk you about our project which is called S2S and the VAMO is a project to tell everyone who has sign-in videos that we can put them in the same place. Many videos are very good on the internet, they are signed correctly and we would like to get a copy of all the videos in the same place. We use peer-tube to do so and if you go at our website you will only have videos that are available in sign language or with a specific subtitle for deaf people. You know the subtitles, the colors for example, and where it is written things that happen in the movie but that no one speaks for example is a door closes and makes some noise outside as a camera then it can be written so we like this video too. The better video we can have is with sign language and subtitles because therefore everyone can understand everything. This project has no need of funding. It is a very cheap project because it's just hosting a website and we just use the peer-tube software to host videos so it's about 100 euros per year to run this project so we really don't need money. We need people that do the same thing in other countries. We will be glad if someone does the same thing in Germany because we could federate our peer-tube instance and have much more content available and we also need many people who know about this project to post more videos so we can grab many many more videos. Do you have any questions about our project? Show us the next slide. Show us because I feel exactly on the next slide. You can show it. You can go to the slides. There's one right there. Chrome is already open and probably being faster. Sorry. Okay. Thank you very much. Thank you very much. You
5G-MAG Reference Tools: Bringing 5G Media to Life
All right, let's go. So who of you know anything about Fiji Mac? Can you raise your hand? Fiji Mac? OK, that's why we are here, right? So let's start by who we are. So Fiji Mac is an international non-for-profit cross-industry association. And what we do is basically to apply global internet Fiji-based access, global APIs, to media, to multimedia applications in the domain of, for instance, production, contribution, news gathering, et cetera, but also on streaming, also on broadcast, and also new media, like XR. Applying all these technologies means not only taking care of media, but also taking care of network capabilities, network features, transport protocols suitable for doing and then architectures for streaming, for CDNs, for broadcast, MantiCast, network assistance, satellites, like non-terrestrial networks, and so on. We decided to launch a development program not just to talk about technology, but to actually build stuff, right? And we have established a community of developers that is sponsored by Fiji Mac, but is open to anybody. And what we do together with, let's say, these companies and our Nolfe-G Mac members is to develop reference implementations of all these technologies we've been explaining for validating standards, for building demos, for testing and experimenting. And that's, obviously, for everybody, network operators, service providers, broadcasters, and so on. What we do in the reference tools development program is actually to build all these series of technologies. You can go to the website and you will have more information there. Particular, for instance, we have our own set of CDN notes. We can get metrics in terms of quality of experience, in terms of consumption, reporting, so who is consuming my player, my OTT player. For instance, that's information useful for service providers, broadcasters, and so on. We have also developed our own end-to-end system for something called Fiji Broadcast, which allows you to broadcast radio or TV to your OTT app on your phone, not based on the internet, but based on broadcasting. And we are recently onboarding a new project for AR, MR, let's say XR applications, and we have started developing all these series of tools where you can actually do this, right? So you can create content for XR devices, and you can put your, for instance, TV channel for users on this, on this display. This technology works, actually, we have been at IBC, for instance, showcasing Fiji Media Streaming, what I explained, and also Fiji Broadcast. If you wish to participate, this is fully open to everybody with an interest on all these technologies for production, contribution, and distribution. We have a GitHub repository where you can find all the projects, there's documentation, at the moment there are plus 30 repos with different technologies. We accept code under the license terms that contributors feel comfortable with, that means we have a repos on OSI, licenses, but also on other kind of licenses, please check if you would like to contribute. And you can also participate, anybody is welcome to participate, there's a Slack channel, we have public calls every Friday for developers, for academia, for the industry in general, and we have also a Google group with information on announcements, releases, release candidates, testing periods, and so on. You can find all the information at FijiMak.com slash community. So if you have an interest on Fiji Media Production, an app link video, on streaming, on Fiji Broadcast, multicast, beyond 2D volumetric video, et cetera, et cetera, next time, please have a look at FijiMak, thank you. Thank you. Thank you. Thank you. Thank you. Any questions?
dublang, a multi-language live coding system
Hello, thank you first the organization for accepting my talk and thank you for coming. And in this talk I will present software that I am developing since I think two years ago named the Dublang. It's a multi-language live coding system. And then I will do a very short presentation here with a small video demonstration. And first a bit about my profile. I work as a research software engineer inside the project named Cortex Platform in France inside the University Gustave Eiffel. Also I am a collaborator of the software heritage project as an ambassador and a debunk contributor and also as a hobby. I am a live coder and visual artist. I am very interested in live coding to produce sounds and to produce video as well. And that's why I created this tool to support my interest in this subject. First the name of the project. I think it's important to mention from where comes the inspiration. The Dublang name is inspired by the musical style Dub. And Dub consists of remixes of existing music and Dublang consists of remixes of existing software. Then one of the goals of the Dublang tool is to have a single interface text to our interface live coding interface to manipulate and to use multiple different tools in the same source code in the same session. Then how it is designed. The Dublang system is designed in a client server architecture. And for that I am using in the client side I am using Nail Vim text editor because I found it very easy to implement not also because it's easy to implement using the Lua language that it's a really nice script language that fits very well in the purpose of this tool as the purpose is to mix different tools in the same environment than a script language like Lua language works very well. And then in the other side I have the servers that is being managed by system D service. Then here is one example of how it looks like Dublang source code. Then here is an example where I have two different languages and the region with the hashtag and exclamation defines a region for a specific language. And then I can have in the same source code different regions with different program language. And then for each language I have to implement inside the Dublang system extension through a plugin. Then the Dublang system is built on top of the architecture is pluggable. And I can create plugins, new plugins to integrate new languages or new tools. Let's see if I can play this video here as example. I hope the sound works. It doesn't work the sound? No, it doesn't. But I don't have sound yet in this moment. Oh, that's still something. You can try. You should be here. Plugging this into the audio object. If you're looking, you might get sound. What I do with this? Which put a click left or right? Sorry. Oops. What happened? I don't know what happened here. Oh, man, what's happening? Okay, there we are. I clicked it in the wrong button. Sorry. It won't feel screen apparently, but there you are. Let me go back one or two seconds. Then here, I have more or less the same code I showed in the previous presentation. Where is the sound? I lost the sound. Ah, look here. Oh. Yes. Then here, when I evaluate this, it's being executed by the super collide server. And then in the same source code, I'm going to add some... I think I finished my time. Just to finish this is now them. Then this is Bambam, it's being executed by the Tidal Cycles language. Then two different servers and the client is sending it to the proper server. Sorry for extending my time. Thank you for your attention. No time for questions, I suppose. Thank you.
VVdeC<>Arm: Optimizing an open source VVC decoder for Arm architectures
Okay, so yeah, I'm the filthy, uh, from over guy that asked the intrinsic question. Um, and my name is flow and nice to meet you. Um, last year, my colleague Adam, I had a talk here about, um, the open source VBC decoder, um, VV deck and the encoder VVank here at foster and I, I optimized the decoder for arm architectures in my master thesis, which I will talk about now. So basically that's this on the right. They can see the, um, Zindi optimization of VV deck. So VV deck is optimized for SSE 4.1 and AVX 2 4.2 was needed. So 4.1 was enough. And to be also able to run VV deck on arm architectures, um, the, the open source project Zindi everywhere is used, which basically like in this case, parts the SSE implementation to arm architectures in the justice by using, um, either built in functions, um, or neon intrinsics in this case because it's arm and, um, but it can also only use, um, scarline implementations and tells the compiler to like vectorize it automatically. So yeah, a combination of these. So my goal was to, um, yeah, make it faster for arm. Um, for that, uh, the first thing I did was identify the hotspots. Um, I was profiling VV deck using instruments since I was using this M1 PC here. And yeah, um, I divided the profiling into three steps. First of all, I identified the most time consuming functions. Um, with these I checked like the performance on arm versus the performance on X 86. And the third part was since VV deck is implementing, uh, every Zindi function as a non vectorized version as well. Um, I compared by wanted to know like how much, uh, speed up does the Zindi implementation generates. And with all this information, I chose for the foremost promising function, which basically means I wanted to get the biggest bang for the buck. And yeah, I chose to optimize these four functions. Um, on the left you can see these, the names of these four functions. Don't mind the names. Um, the only thing that is interesting is like the speed up. And, um, yeah, this graphic shows the manual optimization optimization. So the optimization I did versus the automated, um, the automated optimization from Zindi everywhere. And I visualized this for one of the JVET video sequences for a quantization parameter of 43. And, uh, yeah, you can definitely see that like two functions have a really nice acceleration so compared to the Zindi everywhere implementation. So in this case, the apply load Zindi function and the X get SAD function, but generally speaking, uh, you can definitely notice that Zindi everywhere does a decent job and to, yeah, in comparison to like just optimizing with C and forensics. And yeah, after having a look at the single function accelerations, I also wanted to know like how much is the impact of the optimization of these four functions on the general, um, on the total acceleration of VV deck. So I measured 11 JVET video sequences two times, obviously, since I need to compare them and average that for every or for, um, common quantization parameters. And yeah, the range is between 3% and 9%. What is not definitely noticeable is that like with, um, decreasing quantization parameter, the speed up gets, um, lower. And this is because the bit rate is higher with lower quantization parameters. This may, uh, this is because, um, not this is because, but, um, and because of that, like the decoding of the entropy decoding is getting more complex and yeah, it gets a bigger piece of the cake. So yeah, that was basically my master thesis in a nutshell. And after that, um, I also integrated like Zimdi everywhere to, um, to, to port the AVX to implementation, to arm, which also led to a contribution, which was pretty nice. It led to a conclusion to Zimdi everywhere since there were some, some errors in the portation. And right now, since there's also an encoder, I'm repeating the optimization for VVNG. And in the future, we might also optimize for the scalable vector extension like directly, or the scalable matrix extension. So yeah, thank you for joining for us. If you have any questions for free to ask, you can also ask me at the, I don't know, post foster and drink up. I don't know. Yeah. I have one. So what is translating across all the speed presets when you do the encoding, the decoding improvements? Uh, what the presets? Sorry. So when you do the encoding, you have different presets, right? So I didn't know that's you are asking about the encoding. So after the encoding, when you decode, right, does it translate across all the presets for recording? Cause every preset may not have all the tools. Uh, yeah, that's true. So the question was like, um, they are different, um, like there are different presets in the encoding, which affect the functions called in the, um, decoding. This is, um, true. I mean, I did like, uh, I tried to get a general overview, which functions were used by like profiling several, um, yeah, several, um, settings and tried, uh, yeah, and tried to figure out which functions were used most and average that basically. So yeah, there's a like a bigger story behind that behind the profiling, obviously, since this was only a five minute talk. And yeah. Does this mean that I can use a Raspberry Pi now to decode it? Have you tried to use the ARM devices to see? Okay. So the question is, um, can I use a Raspberry Pi to, uh, to decode it? And I mean the Raspberry Pi is based on an ARM, right? And I would say, yeah, obviously you can write because, um, I mean, you could do it before as well because Zimdia everywhere was included and Zimdia everywhere ports, uh, the SSE implementation to ARM, which, um, I mean it doesn't do it like perfectly, obviously, but we actually, um, submitted a paper or some colleagues of me are submit to the paper at a mile high conference. Um, yeah, I mean, I can, I mean, I can even probably put it up on, for, on the foster side, maybe. If you want to see that where they, um, like measure the performance of Zimdia everywhere on ARM and, so much of examples of what platforms, I mean the platforms, uh, which are supported are also like visible on the GitHub repository. So, um, yeah, this is also, um, on foster website. Um, like when you go to my talk, like there's the VVDec repository linked to it. And there you can see it. Um, yeah. That's tight. There's another question. Why don't you probably, you're simply by hand instead of using the quality performance in June? I mean, obviously we are still the best, right? So we are still the best when it comes to decoding and encode. But, um, like VV, like VVDec and VVN is performing pretty well in comparison to other VVC, um, coders, I would say, right? Um, yeah, that's true. That's obviously true. I mean, of course we have a head start, but, yeah, but I mean, let's see, right? I think nothing better than a healthy competition. Yeah.
Bridging Research and Open Source: the genesis of Gephi Lite
All right. Well, thank you everyone for having us. So I am Paul. I am from Westware, a company based in North in France, together with Alexi and Benoit. We have the full team here in front of you. And we are going to talk about software, JFE Lite, web applications to analyze network online. But we will be more talking about the genesis of this open source project, which is very linked to academia as you will see. Okay, let's get started. So we are going to present the history of how we ended up creating this software. The first point started in 2007, where in a university, in University of Technologie de Compiência in France, a professor called Frank Gitala, who is actually a linguist, started to do research on mapping the web. Looking at the web as a space of documents, and I see how we can actually study that. Together with Mathieu Jacomi, with there, they created a session called Web Atlas to start creating tools and do research on mapping the web. Mathieu created the first prototype, which is called Graphilte, to depict the network as you can see in the screenshot. One year after that, still based in academia, a research project called TIC Migrations, led by Don Adiminescu, a sociologist. They wanted to study how my immigrants keep a link to their home country using the web. To be able to do this study, they created the first version of Gefi, Gefi being a desktop application, coding Java to actually create maps of nodes connected by edges, a network. One year later, in 2009, they actually published a paper in the ICWSM, which is a big conference of computational social media. This paper, actually, it was a poster, as you can see in the picture. This is Mathieu Bastion, one of the lead devs of Gefi software. This paper actually won the Test of Time Award, which is an award for the most cited paper. It's not a paper, it's a poster, actually, but still. Ten years later, this paper has been very cited a lot. Sorry. Gefi, at that time, was really used a lot to create maps that you could print, or even step on, as you can see in those pictures, because it's a huge map of words, and you can actually explore it with your feet. Thank you. So, at this point, we're in 2010, and we start to see infographics on the Guardian's website, on the New York Times website, and it's kind of the beginning of the data-vis online era. And researchers and people who use Gefi want to actually share the networks online. At this point, I'm a student at the University of Technology at Compiègne, and one of the best solutions to actually do things interactively online is Flash. So, I start developing solutions for Gefi users to actually share JXF. JXF is a graph exchange format, actually. That's how Gefi exports networks. And there's various experiments on this, and they're up on Sigma, which means simple graph mapper, actually, though the weird case thing is. And in the same time, people from Gefi actually find a pretty good solution to share networks online. The idea is to generate a very large, very big picture, and to use C-Dragon, which is a technology at this point that has been replaced later by Open C-Dragon. Basically, it works like in a cartography software. You can zoom in and see better resolutions images while you zoom, etc. And this works well, but you have no interactive feature, but at least you can share things online. Also, another researcher, Raphael Welt, who is working at Erie in Paris, develops Geixf.js, which is the first full-featured application to actually zoom in a network online, have the mini-map, a search field, information about the nodes you select, etc. And it still works today, which is kind of amazing. And that's the first step into real, enhanced navigation in graphs online. Also, multiple people work on JavaScript graph rendering engines. So there's Protovis, which has been replaced by D3.js, which has become a very successful library to render all kinds of graphics online. So I started working on the first version of Sigma in JavaScript. And at this point, some people from the Oxford Internet Institute bind all of this together by developing this Sigma.js exporter, which is still kind of used like two or three years ago. There were still publications based on that. And that's it. People can successfully share the networks online after Jaffy, basically. So let's move on. So the next step is start in 2013. And it's about exploring those networks online. This only starts with another research lab, which is the Sciences Po Media Lab, in which I was part of this team, and Alexi too, actually, for many years. And starting in 2013, Alexi joined the team. And Sciences Po Media Lab was actually, the mission was actually to help social scientists to actually use digital methods, either data, but also tools. And so one of the research we've been conducting in this lab was to try to see how we can use a network math and analysis by studying pictures. As you can see on this picture, we call that the visual network analysis, where we really think that, like Tufta introduced in the statistics, how we can actually use the picture as a way to analyze your corpus. And to do that, we needed to actually go even further in developing ways to produce those pictures and to interact with it online. So we worked on SigmaJS to publish a V1, which was much more a performance. And right after, we did integrate this library into another research tool called HIFE, which is a web caller dedicated for social sciences, kind of. And as you can see, this tool was really key. SigmaJS was really key to actually embed the social network analysis right in the tool that we were producing in the lab. We also created other lots of different tools, like many lines, which was more rooted into pedagogy issues. The idea was you could create a slideshow directly with a network on it, and you could actually, each slide was defined by the way you can filter, zoom in in your network. And that was a proof of concept of how you can actually guide your audience through analyzing a network. This work has been presented for them, actually. So at this point, we have a lot of tooling to rendering graphs online, and the next step is kind of to work more on manipulating graphs online. I mean, one of the things that Jeffy does well is computing things, like page rank, centrality, clustering algorithms, and this kind of things. So the first big step in this direction is from Biomplik in the room as well, who also started working on Sigma in 2013. And he develops this JavaScript library that allows handling networks as a data structure. And within the library, there's lots of standard algorithms. So it's easy in graphology to compute a page rank, to compute various metrics, but also to render some layout algorithms, so not rendering, sorry, to compute layout algorithms, like the position of the nodes in a graph. All the things that Jeffy does well, graphology starts to mimic it, basically, in computing side. And we start thinking of Sigma as just a rendering engine that would be based on graphology, so that we don't have to take care of computing things in a rendering library. Mathieu develops graph recipes, which is yet another tool to actually generate renderings of graphs online, but this one is a bit different instead of having those applications that look like Jeffy. Here you actually script, graphology, and get an image output. And this allows having kind of crazy things like heat maps or... ...area diagrams, I don't know how to call this, or kind of circles around areas, etc. And this is really a new approach on all those tools. Also, there's Minivan, which is yet another application online to actually start exploring networks that uses graphology and Sigma, etc. And yeah, at this point... I forgot this one. And Guillaume also develops Ipy Sigma, which is a binding to display Sigma networks in Jupyter notebooks. And this has been quite popular as well. So graphology is more and more used, and yeah, at this point, really, we can do lots of things in well applications on the graph computing and rendering. And so our last step in our history starts in 2019. So as Alex just said, at that point we had a lot of different tools to do lots of different network analyses in lots of different contexts. But from there, we still don't have a generic graph network editing tool online. But from 2019 started some convergences to actually go into the directions. So actually Alex and Mathieu and Eduardo with another dev of Gefi did a presentation here in POSM trying to explain how contexts of doing network utilization from a Java desktop application or from a JavaScript web browser based one are really two different contexts where you can't hope to achieve as much precise or scaling memory management in the web browser for many different reasons. Yet right after some guys from MediaLabsence were still, so Guillaume and also Robert, did this prototype, a hidden one actually, that you probably never heard of, Nancy, which was actually mimicking Gefi editing tools. So in that tool, you start having like you load a file, you can render the layout, you can change the color, you can actually edit your map clearly. So we start this paved way to having a really editing network experience on the web browser. But at the same time, we three of us actually ended up creating Westwares, this company we are in for many different reasons. I have no time to stay here, but still we ended up creating this place where one of our main skills of this team is managing networks in web application. So we were lucky to have clients paying us to do that, to integrate all those technologies we talked about into custom web application for them. While doing so, we just made all those tools much better and better. And so for instance, still thanks to our friends from Sianxpom Media Lab, we published a new version of Sigma.js, for instance, the V2. And also sometimes to that, another researcher, Thomas Venturini, sociologist, paid us to actually create yet another tool to share with Network Online. And we call that one Retina. It's really like we try to do that with all the features that we wanted to be there to make it easy to search, to filter, to manipulate a network as a document. You don't edit the network here, you share it. But right after in the same year, we were to this event here, the GIFI Week 2022, which was actually organized by Mathieu. So it's a social event to gather the community of developers around the GIFI community, and so we joined this week and we proposed the community to bootstrap this project of mimicking GIFI on the web. So what is GIFI Lite? It's kind of GIFI on the web, but lighter. So we just like, we put ourselves together and we said, what makes GIFI GIFI? And we picked the main features like we need to be able to actually customize the appearance of my nodes and edges because it's a mapping application basically. And we need to be able to actually modify the position of my nodes because this is a mapping application again. So we picked all of those features and we kind of decided to design simpler versions of them and versions of them that we could implement in the web applications basically. And the good thing with the web application part is that it's just a full client web application. It's just some JavaScript and HTML file. It's very portable. Once your browser loads it, it just runs inside your browser. Your browser kind of becomes GIFI Lite. If you need to have custom features for yourself and fork it, you can just deploy it on your domain. And people actually already did this in DMI at Amsterdam and Cortex in Paris, both research programs kind of. And yeah, let's dive just straight into it. So this is what GIFI Lite looks like. So for people familiar with GIFI, it's kind of the same thing. I mean, the center of it is a map of my network. And I will load a network. This network is actually the network of the people, the tools, the conferences and the institutions that we discussed in this presentation. So as a GIFI user, I will start by computing things like, yeah, let's see how the nodes do group with each other. So this is a community detection algorithm. I have a new modularity class attribute in my data. So I can render it. I can go to appearance and say, okay, I want to see the colors of my nodes by modularity class. Okay, but it's still a bit messy. So I'm going to change the position of my nodes. I can start by putting them on a circle, for instance. I don't know. And then maybe run some physics algorithm that has been fully designed inside GIFI. It exists in graphology. So we put it back in GIFI Lite. Yeah, okay. First, I can just stop it. It starts to look like a map. That's nice. Also, here, so I see my communities. There's this purple community, which is about kind of us, actually, westward and Sigma. There's this media lab community. There's this original GIFI community. So, okay, that's nice, but I want to see the type of my nodes, actually. So what if I put the type as the color of my nodes? Did that work? Okay, yeah. I have all these purple that are softwares, purple nodes. Blue nodes are papers. Okay. I see the enterprises, the events. That's nice. I can also maybe set the color of my nodes. Sorry? Size of my nodes? Oh, that's big. So let's put lower things. I don't know. Okay, that works well. And at the end, since it's GIFI and it aims to actually generate a picture, I'm just going to export the file as a PNG file. Okay, I won't click on it because it's going to lose the context of the browser. I prefer to stay in it. Also, I can filter. That's interesting. I can filter on nodes. The easiest thing would be to actually display the start range and to redo the whole presentation, but here. So it all starts in 2007 with Web Atlas, Mathieu Jacomi and Fran Guittara. And yeah, one year later, they built this thing called GIFI. And one year later, I start working on weird flash applications. And then at some point, those two clusters are joined by SigmaJS and the GIFI SigmaExporter and so on and so on. Also, one interesting thing in this network is if I come here and I search for FOSDEM, I can actually see that if I select all nodes. There are 10 confrontations in the 12 or 13 past years that have been done in FOSDEM to present all those tools and all this context. So yeah, it's been a long run. Yeah, so as a wrap up, so at the end, what's all of us creating this GIFI Lite software? So the first thing is like, we were about to do that. Thanks to the very vibrant and brilliant community around the GIFI software. In this community, you can find academic people, you can find developers, some designers too, and also small experiments and prototypes that have been produced along the way helped us actually to make also the ecosystem of libraries and the thing better till we could actually at the end create this tool. Also, well, this is what I just said. Yeah, thanks to Internet Archive, actually, which helped us a lot finding traces of old forgotten prototypes. And the web is an amazing platform. It was easy, really easy for us to actually create this tool to put it online with all the web capabilities of the features we have now. And of course, as I said, many of those tools exist. Actually, all of them mainly exist because they are rooted into academic needs, also design and support. A lot of money and time come from academic projects in this story. We are a company. And so we have some customers that are from academic but also some which are not from and also customers also hire us to make also ecosystem of tools better and better. So we also benefited from having our customers paying us to make that happen. And the last tribute goes to FOSDEM as Alex said before, because this place is very important place actually to share all those ideas. And new prototypes and just emulate all those energy to all going into developing open research tools. Because when you are really looking at how we do research, you, well, I've been trained with Bonilla tool, for instance, to understand how crucial are those tools that manipulate knowledge to create or to do research. And those tools needs to be open to really assure that all the transformation of fact till knowledge is reversible and questionable. So we are very happy to provide the community with the new tools, the Giffy Light, and we hope that it will meet also needs you might have. Thank you. So we have time for questions. Do you have any questions? I actually added graphs, meaning, you know, connecting information, looking together, like even joint collaborative editing. Do you have any easier tools to allow people to collaborate online to, you know, do journalistic research or do you have any new medities or things like that? Yeah, so the question was, we talk about editing networks, but we only talked, we only demode editing the visual aspects of the visualization. Actually, this network we showed you has been created inside Giffy Light. So we can, you can create nodes. You have a graph on the here. You can create edges, of course. So you need to select a source by searching into your networks and targets. And so we actually, so we have to say that we, we did a new iteration of work on Giffy this week. And we added, we're making those addition, that addition features much better for this. Do you actually want to add a photo of the egg nodes visually and drag edges? Yeah, you can drag, you can also drag nodes. I don't know, you can't create an edge on the stage. That could be done. But for the second part of the question was, can you also do that collaboratively? So this is going to be more difficult because since it's all based in the broader, making collaborative features, we'll ask you to have a server and to have WebSocket and everything, which is a brand new thing. But you could use the interface and fork and see. We actually worked on one of the biggest projects we've worked on those last years is actually what you're describing. Like this, but with a backend and collaborative features. And it's named GraphCommons, but it's not open source. But a good example for that. Yeah, yeah, no, this collateral, yeah, Giffy Lite is just in the browser right now. The one, another question, yeah. So I really enjoyed the community mapping part of this. And I'm wondering if you could speak to metrics and measurement. How have you seen this used to analyze communities and actually pushed community development strategy using Giffy Lite? So the question was how I tried to rephrase it. In which extent can you work on community detection and with community statistics inside Giffy Lite? So the answer is for now, well, here you will find all the algorithm statistics that are available in the Graphology library, which is JavaScript based library to work on statistics on algorithm on networks. So far about community detection, we have Louvain only. But I mean, as long as we develop new algorithm in Graphology, it will be easy to port it inside this tool. Yeah, it's kind of a modular way to take it. Yeah. I was wondering what has been your publication strategy to support this work? Particularly, do you target graph conferences? Is it more successful to target graph conferences or social sciences conferences? That's a good question. So the question was what was the strategy for communicating around the tool? Actually, we don't have a strategy. So basically, you see that actually the Giffy desktop software was presented at ICWSM, which is definitely a social network analysis discipline, kind of. We are not from academia and for them is one of our place of choice to communicate about what we do. But yeah, so I'm a bit short, actually, to answer you. Last question? Have you looked at new renderers for Graphology? And also, in this slide, the addition to all that things like how and the numbers of such, do you consider including those in the range of them? You can send the pull request. There are other developers who have developed other rendering algorithms around Graphology, outside of Sigma to render more complex things such as hit maps or halos around the nodes, et cetera. Benjamin was asking if we are planning to actually integrate those features inside Jaffy Lite. One answer is that contrary to Jaffy Lite, which shows that what you see here is actually the preview of the export, yet you will do it as an image. So right now, no. It has to be done in Sigma to be rendered in Jaffy Lite. Thank you very much.
Cosma, a visualization tool for network synthesis
Okay, good morning everyone. My name is Arthur. I'm an assistant professor in Lyon, France. I'm here today to acquaint you with a little program called COSMA. I'm going to present the design choices behind it, touching on mostly two points. It's going to be a short presentation. The architecture of the program, which may interest you if you're working on interactive publications. And the features, which may interest you if you're a scientist or working with scientific data, and you have information management needs. So that should be every scientist. I'm presenting on behalf of the team, and first and foremost, my developer colleague, Guillaume, who is not here because he's on hiatus for a very happy family reason. And also my senior researcher colleagues who have a lot of knowledge and research colleagues who have advised us on the design of the application since the beginning. Okay, COSMA came as part of a research program on Paul Hôtelé. I'm very happy to be mentioning Hôtelé here because he was born and died in Brussels. He was a famous Belgian figure. He was a pioneer of knowledge organization. He's recognized today as a precursor to information science. He was a pacifist, an internationalist, a feminist. He had also some flaws. He was a utopian. He had some sometimes a bit dated views on topics, but he's a very interesting figure. He's the one who popularized the word documentation, so that's that. His main idea, Hôtelé, was to go beyond the book. What he wanted to do was extract all facts from publications and sort of organize them into universal encyclopedia. The idea was that universal access to knowledge would bring peace. That was the utopia. He worked all his life on tools to achieve this, including bibliography, classification schemes, index cards, and so on. There's a museum dedicated to him in Belgium, in Mons, so if you're in Belgium for a few days, I encourage you to go and visit it. So in 2018-2019, we worked on a map of Hôtelé's professional network. It was our take on an idea that had been done before, which was to combine a graph view and also a card view, so like a little index card with metadata, but the note that you're currently selecting in the graph. And one day I asked Leone, can you make that for my research notes? Because at the time I was accumulating files that looked a bit like this, a bunch of plain text files with notes on specific things. These aren't actually my notes. I just borrowed Andy Matousiak's note for this presentation. Andy Matousiak is a researcher who's working on tools for thought, non-linear writing, etc. The idea is that you have files which reference each other with links, internal links, just like in a wiki, double brackets around the title or an identifier, if you prefer using an identifier. And so what Guillaume made is that he designed a prototype which became eventually COSMA, which renders these files into an HTML file. So yet another graph application, after all the graph applications that we've seen in the previous presentation about Giffy Lite. So this is an HTML file which contains a graph view. The rendering of each file in HTML and also a few navigational tools, an index, a small search box, etc. This could be anything. It could be any kind of knowledge base. It could be a glossary of terms. It could be a network of people, of concepts, of events. Really it doesn't matter. It's like a commonplace book or wiki or a zettelkasten, if you're familiar with that word. Even a mind map to some extent. Conceptually it's a bit like that. What distinguishes COSMA is that we have, well, the architecture and the fact that we designed it around scientific writing needs. So I'm going to describe briefly the architecture point and then I'll describe the features a little bit more. So it's purely a visualization program. You cannot edit data with it. It just reads plain text files. And most of the features are actually located in the exports. So this is actually COSMA. It's a command line application. And you use it to generate these HTML files. If you're familiar with Tiddly wiki, it's a bit like that. So it's a single HTML file which contains everything except Tiddly wiki. You can edit the data. This is read only. So it's less like a web application and more like a sort of augmented document. You can share this file, obviously. It's just an HTML file and people can open it in their browser. And the idea was that I was familiar with software like Jaffee and I always wanted to be able to share graph visualizations with colleagues or students, but not as static images, but as interactive things. And there are lots more options now that exist to do this. We just did this for little markdown files. So that's the brief point about architecture. The features, as I mentioned, they're related to information management needs. Everything is designed to encourage knowledge organizations. So categorizing things, classifying, indexing, tagging, relating things to one another. It's basically a memory aid, actually. It's not for graph analysis. It's more for network synthesis, so to assemble document graphs about things. And the way it encourages knowledge organization is to provide a few features that reward this knowledge work. So, for example, if you assign types to your notes, colors will appear and you will have filters to modify the display. So you can toggle, for example, one type. Here I've toggled the inside type, which was in orange. And it also and mostly encourages link-based knowledge organizations. So that's using links in the way you're describing the relations with things. And the way it rewards that is to provide contextualized backlinks. So that's the thing that's at the bottom right here. These are the incoming links. So you see here where this note has been cited, and most importantly, how it's been cited because you have the context, the surrounding paragraph that's here. So that's a contextualized backlinks. Not an idea that we invented, we just borrowed it from actually web pioneers. It's been going around for a long time. And in recent years, there's been a wave of tools for thought text editors in which you can create little notes, link them together, organize them, and they pretty much all have this feature. We just wanted a way to have it for scientific writing and also to be able to share it easily. Now the big thing that we did is we have the same feature, but for citations. So if you're working with bibliographic data, so you have maybe a raw JSON file, maybe more likely you're working with a reference manager like Zotero, N-node, Mandalay, etc. And if you're in your notes, your citing works. So for example, here on the right, I'm quoting the two references that you can see are stored in the file on the left. Well, Cosma will generate a bibliographic note, that's the dark gray one here. I haven't created a text file for this note, it's been generated automatically. And most importantly, it will show me the backlinks as well for the citations. So I can see where I've been citing which work and how in which context. I want to close on the idea of network synthesis very quickly in my dissertation. What I argue is that linking the simple act of relating two things to one another in hypertext, it's a knowledge organization process. So that expression is actually a thing in knowledge organization literature. It's classifying, indexing, tagging, basically any process that you do that organizes knowledge. And linking is a way to do a lot of that. Linking could be a way to index, to classify, to tag, to assign things to others. And most importantly to compose with links you can express new ideas, just like Lego. If you have a note on a concept and a note on another concept, and you just bring the two together in a sentence, this relates to that because this, this becomes a new idea, you express it in a new note, and that's ideation, the basic process of research. I'm going to skip very quickly all these examples that I had added because it'd be fun if there were some time for questions. And just to say that this process of synthesizing knowledge, this is why I titled the presentation Tool for Network Synthesis. Obviously in the process of research the first step is analysis. You start with an object, a phenomenon, and you start, and you try to decompose it to see the fundamental building blocks. But the goal is to take those fundamental units and sort of mash them together again to produce new things. And this tool is just that, it's just a tool to help with this process of knowledge synthesis, which is to assemble and expand over time these little document graphs. I'm saying document graphs because there's the expression knowledge graph. Knowledge graph is usually a set of descriptions in a database, and these are just little documents, so hence the word document graph. Right, I'm going to hand here, and if you have any questions. Thank you. Do we have any questions in the room? We have four minutes. A question about using graph-based and markdown-based, I don't think it improves the blocks or the accident. So the question was can we use this application to visualize nodes that would have been created with applications such as Obsidian or Luxik. A colleague actually wrote a little Obsidian to Cosma Converter because we have a data format which is close but not quite the same as Obsidian. Obviously you have to have a YAML header, the links have to be a certain way, etc. So there is a converter for, if you have nodes written with Obsidian, there's a converter out there to transform them into the format. I don't know that there's such a thing for Luxik. It's possible because it's just plain text, markdown, YAML, it's very easy to write, I think, a custom parser and convert it. Do you have time for one more question? Thanks for an interesting presentation. At all I'd really like to use in combination with Obsidian. I was wondering about the format of the nodes. You mentioned Zetl-Caston, which has a specific format and way of linking. There's permanent nodes, there's every node. Could you elaborate a bit on that, on what type of nodes would work well in this, not a synthesis, a way that you would use? Yeah, a repeating question. What type of format would be ideal to work with Cosma since there are many formats out there at Zetl-Caston? The type of nodes. Oh, the type of nodes. Atomic nodes. I've shown Andy Matushek's notes, he writes a lot about evergreen nodes and the principles behind evergreen nodes, things should be atomic, densely linked, and the titles of the nodes should describe one thing and maybe work almost like APIs. It could be a sentence that describes the idea. So that's the best sort of mental model. It's less suited for a daily log, for instance, than for a sort of conceptual knowledge base, again, where you try to relate events, concepts, people, etc. I hope I was clear. Thank you so much.
From the lab to Jupyter : a brief history of computational notebooks from a STS perspective
Hi guys. So no demo for me. I'm just here for some food for thoughts. And I will talk as a social scientist about a specific case. What I want to do, I have very little time, so I will move very fast, is to make two things. First, a very, very short history of Jupyter's notebooks. And then, sort of a plea for better knowledge of the way scientific software are made and their history. Because I think it takes a lot in our area. The question is, and my starting point is, where are our stories of scientific software right now? I mean outside the specific events and globally in the main scientific area. Because software won't say a lot about these everywhere. And they ran from bespoke and code to international stars. So we are software's every-round research, but very little stories of how they have been made and how they evolved. And social sciences rarely looked at those software. And when they look at them, they show there are very specific dynamics going on. Research of software are open indeed. They are looking for uncertain ends. Researchers are usually known as specific developers, and there are very specific funding constraints on how software are developed. And these are specific consequences of the way those specific kinds of software evolved. The code can have some brittlessness. There is a lot of intertwinement with scientific activity. And it led some researchers to become specialized in software engineering and developing software. And it led to a lot of specific journalist of friend. We have seen one with J.F.E. Light just at the beginning of this day of the room. So I want to take a step back. And because there is a lot of open question about that. First, how can we tell the stories of our scientific software and how social sciences can tell stories of scientific software? Because there are different journeys, especially in open source. And there are different steps in the history of each scientific software. Sometimes it stops, sometimes it continues for years and years. And on a broader level, there is much intertwinement between open source and academia. And especially, what are the links between open source and science? And how the connection is made between academics and software engineers? And just to quote Christopher Calti in two bits about UNIX. In fact, the UNIX spread first to university, computer science departments, and not to business government or not government organizations. And then that it also became part of the co-pedetrical practices of generation of programmers, computer scientists. So there is something connected between open source and open science. And I want in my very little time, but I had to work with a specific case, which is the case of Jupiter's notebooks. And to say it in one sentence, innovation, it is an innovation going from research to become a worldwide infrastructure of data science. It was released, notebooks were released in 2011, 2012, and spread everywhere. And they won the ACM award in 2017. And it is the perfect viewpoint to see how a scientist of course emerged, how he progressively get more and more abstract from this starting point in the laboratory, and diffuse within and outside academia. If you want a long version French history, there is a paper in Al, but I will keep it very short. I'm not here to advocate about Jupiter notebooks. I use them, I love them, but I won't try to convince you. And I'm quite sure there is a lot of people against them around here. And if you are not against them, but you want to see why people are against them, just have a look to the dry-gross talk. But I'm making some sure that you know approximately what Jupiter's notebooks are, because I have no time to discuss about them now. What I just want to say is a very quick story. It is first a PhD student, then a specific script, which is a Python, then notebooks appeared, and finally we got Jupiter, as we know currently, which is basically an infrastructure for interactive data science with different kinds of languages. And you can see this evolution with the Python Dev mainly released, with the progressive emergence of notebooks around 2010, and the appearance of Jupiter. I just go back on those different steps. So let's dive in this history. The important part is to have the context of the early 20, or the term of the millennium. And we are at a moment where we had a lot of achievement with the free software movements and open source development. And there is around the laboratories, paradigm of literal programming, from the next move. And for people coming from computational science or mathematics, there are a lot of proprietary open software specialized for interactivity with programming like MAPL, Mathematica or MATLAB. And at this moment, there are also the beginning of the scientific Python community, which just is starting to develop with the first SciPy workshop organized in 2002, in 2002 in Austin, Texas, especially in the south. And in this context, Fernando Perez was at the beginnings of Python and then Jupiter, was a PhD student in his fourth year, tried to finish dissertation and wanted to move from proprietary software to open source and Python and need something more interactive to do his work. And the script which will become Python was a simple personal fix for the problem of his own workflow and was really grounded in his common sense as a researcher in physics and computational science. So he wanted something to make sense, programming with interactivity. And this was the idea, the value inside this moment that will unfold in a job. In this basic case, the SciPy community, so the scientific Python community was quite an amplifier and there was a very quick reception, and to the secret reception by this community and the company which backed SciPy and thought posted IPytranslations on their web page. And they get a lot of support from this community, think back and contributors, and quickly after this start, other contributors joined the projects, especially Brian Kanger, who jumped in 24. And they managed to secure financial possibility to continue and it was attained with post-doctoral grants that fellow Peerers get at Colorado Bolger and then thanks to the support of a team in Berkeley which joined in 2008. So the fact is, IPytern is something really well grounded in academia and SciPy community. If you look to the main contributor of IPytern, almost everyone was a PhD, some of them are in a position even later after the emergence of the software. And notebooks in this context were just a feature which appeared later of IPytern. And because 2004 and 2011, the project developed, a lot of support was given by the Python community and there is a lot of features and tried multiple times to add a notebook feature because it was something already here in other software. There are five missed attempts before they were able to make a first fabled version of notebooks because some technology, especially for browser, was not available. So in 2011, 2012, a new release of IPytern included IPytern notebooks. It was the beginning of the history of Jupyter and it works pretty well because it was really quickly adopted by the SciPy community while outside the first specialty frontiers of the developers of IPytern. And in 2021, Nature can say that IPytern notebooks are one of the ten codes that are making science, sort of a huge thing inside the SciPy community. But progressively, the notebooks became something more important and they led to abstraction of what a notebook is and the way researchers are using programming in their work. And there are two dynamics. The first, it was a movement of abstraction out of the Python community and on the other one, it was strengthening of the practices in the project of software engineering. And this allowed the project to make a split and to move from a very specific IPytern tool to something more general, more abstract, which became the Jupyter project and was backed with six million dollar grants of foundations that support open science. So it was a huge move because it led to refactoring the code, change the philosophy, reconstruct your latest with the whole project and there was a lot of money involved because it needed a lot of, you know, hiring of software engineering to do so. So at this point, Jupyter became something which escaped the academic world and had a worldwide option. Notebooks became standard of data science and they were integrated a lot of services like, you know, Google collab or use in third party, you know, tools already existing like the regular studio code. So it was, you know, a turning point in the way this initially scientific project became something way bigger than scientific community. And somehow I would stop here because it opened a lot of questions. Of course, for the research community, the question are what the current users of scientific, of competition on the books, what kind of work are they doing? How does it make the way we are programming change? But at this point, the question I want to carry here is does Jupyter project or software are still scientific software? And so how does something which was created inside within the scientific community is starting to get another dimension and to be something bigger or no more, you know, a research tool. So just to rub up because I am going to the end of this presentation, I want to stand for more historical documentation, not only documentation of code, but historical documentation of how those specific software genres are associated with scientific specialties, institutional background, funding possibility. And we need to take this specific dynamic seriously because of course, for competition on the books as we are trying to do with other colleagues in different projects, and there is a GitHub repo if you want to add some archive in the story, but also for all the other tools that are inside our laboratories, inside our daily routine of scientists, because they are a huge part of the way we are crafting knowledge and they don't have the same history than other more material, you know, artifacts and scientific instruments as the discops or particle accelerators. So it's my point, I finish here, thank you. Sorry for the speed. How can we define scientific software? Very neat question. Can I and how can we define what is scientific software? I think the only way I can answer that is that software crafted within the context of scientific research at some point and that builds not for making, you know, a complete tool but for answering specific research question at some point in the advancement of knowledge. And usually there is a national literature about the way that scientific software are really different like that don't take really seriously into account at least at the beginning, versioning, test units, they are quite squirming the good practices of software engineering. At least at the beginning and then if the software is still around a few years after and gain more users, it started to integrate those good practices. So somehow there are two universe but more organizational and social universe different and I would say scientific software defined by the
Prompt Compass: A Methodological Approach to Evaluating the Use of Large Language Models in SSH research
Good morning everybody. I am Eric Borra. I am an assistant professor in journalism and new media at the University of Amsterdam. I have a background in artificial intelligence and I have been a tool maker with the Digital Medicine Initiative for about two decades now. And one of the reasons why I make tools is to understand new technologies. And today I will talk about large language models and particularly their use in social science and humanities research. So who of you has used chatGPT? I think most of the Western world actually has used chatGPT in the last year. ChatGPT is based on a large language model where you ask a question and get an answer. But social science and humanities researchers have also found that you can just send it instructions. And you can instruct it to do all kinds of things such as sentiment analysis. So you prompt. It's the way of interacting with large language models. You specify a prompt and you say classify this tweet. You input a tweet and it will give you a nice classification of that tweet. You can also extract entities and actors. You can include topics and teams. So you have the prompt I just showed. You enter a New York Times article and it will extract all these things for you. It will extract country names, organization names, people names, specific teams and topics. And it's actually pretty good at this. At least chatGPT is. And I'll discuss other models in a second. And it's not just named entity extraction or simple classification. It also, like, somehow extracts teams and it can abstract from the text. And researchers have been using this, for example, to extract narratives from posts. So here's a very complex prompt, which you can see in the slides after the talk. But there they use this prompt to go through many, many Reddit posts, hundreds of thousands, to find out whether there were any conspiracy theories in there. And they devised a prompt to draw out these narratives. And they actually found that LLMs worked really well in extracting conspiracy theories as well. So researchers have been using this and they have been looking at all the tasks that are typically done in social science and humanities and are starting to test whether LLMs can help in doing these tasks. And there has been a lot of research in the last year, especially 2023. This is just a really small snippet of this research. But this research also comes with problems, which I'll touch upon in a bit. But their use is also understandable, the use of large language models in social science and humanities research, because they seem to ease and speed up previously difficult and laborious tasks, such as classification, extraction summarization, and so forth. And they're actually employed as junior research assistants. Now, while this may seem useful somehow, a lot of people seem to be using chat GPT. Actually, all the papers I've just shown, they're based on chat GPT. And chat GPT comes with problems, because it's a platform service. And platform services, as I guess all of you know, are volatile black boxes. You don't know what's going on in the back end when they update their model, when they align something differently or sensor or whatever, whether it's getting dumber or not. You basically don't know. Chat GPT is also very expensive. If you're using the API, you pay for a request. This is one research project by Miguel Escobar of Arela, who calculated that to process the one and a half million news items he had in his corpus. He'd need $150,000. It's just too expensive. There are, of course, also privacy concerns with chat GPT and other platform models, whereby with chat GPT we know that whatever you input into chat GPT is also used as training data for the LLM. Users have also found personal and private information resurfacing from other users, etc. So if you think in terms of open science, replicable research and ethical research with privacy concerns, you basically cannot use these models, even though you can go to privacy.openai.com and state that you don't want your inputs to be used as training data. Well, so how to deal with this? Can we use LLMs in social science and humanities research? Fortunately, the answer is yes. Chat GPT is not the only model available. You probably heard of Google Bart or Gemini. You may have heard of Clodes, which are other platform models, but there are also a lot of publicly available models. All the yellow ones highlighted here, and this is only, I think, until the second quarter of 2023. Since then, a lot of new models have appeared, most notably Mistral, for example, the French model, or Mistral, the 8x7B model of Mistral, which are really good and are catching up on the performance of chat GPT. Publicly available, however, doesn't mean that it's open, that it's open source, that it's free, because there is this whole infrastructure and apparatus to train models, to fine tune models, to use models in your own work. And you see all the orange and red here. Most of these models aren't open, or have different licenses, etc. So it's not, yeah, you can't just use another one. You need to think about these things. Two other considerations before I go to the actual tool. If you use the same prompt in different models, you'll get different results. And this is actually the same prompt in a series of image models, but you can visually see how results may differ. This is something to take into account. And last but not least, there are a few technical parameters in using LLMs. And one way to control differences or variability in the output of LLMs is through the so-called temperature parameter. If you set the temperature to zero, you'll always get the same result. The most probable or most likely outcome was if you increase the temperature, there is a chance of less likely outputs to be included in the results. But again, all these papers, none of them mention temperature, whilst it's a very important parameter. Last but not least, this is work I've been doing with Maichieu, small syntactic differences in a query that's semantically the same may lead to different results as well. So you need to test your prompt for robustness or consistency. So summarizing, open AI or platforms like chat GPT are volatile black boxes that cost a lot of money. There are issues of privacy and security. There are different models which have different licenses, which have different results. LLMs are not deterministic and small changes in prompts may lead to different outputs. So we need research interfaces where we can control for such things. We want to be able to do open science with LLMs. So how do we take into account the volatility of platforms, the robustness of research and its replicability and explainability? And this is where I started tinkering with a tool I called prompt compass, or actually I had chat GPT call it prompt compass. And it's a research interface. It's not a chat interface. It allows you to take into account all these considerations that I've put up. So you can choose various local models. It has default parameters for replicability. It contains a library of research prompts, allows for batch processing user input, and allows you to evaluate prompt model combinations. Do I still have some time to demo this? Cool. Let's do that. So prompt compass is available on GitHub. We also run it at one of our servers. The design doesn't really shine on this beamer, but anyway. Here you can select various models which are loaded from hugging face. You can easily add a new model and select one of these. You can find out more by clicking on the model card and then see what the model was made for, how it was trained and so forth. All these models are loaded from hugging face, which is like the GitHub for language models, but we can also choose GPT for where you and then or any other model of open AI or platforms and then enter your API key and go over that. You can go into the settings which are default sets to replicability. There's a little explanation of it. There are a lot of prompts extracted from the literature and from actual research. And you can input your own prompts like this. You can or you can adapt existing prompts. And then you can provide user input either line by line or upload a CSV. And then if you click submit, the selected model will be loaded. And each of the lines will be run through the model with the indicated prompt. So in this time we chose sentiment analysis, which says you're in advanced classifying AI, you're tasked with classifying the sentiment of a text, which can be either positive, negative or neutral. And this is where we'll input or loop over our inputs. So in this time the user is happy, it's classified as positive. When user is just a user, it's classified as neutral and the other user is a liar, it's classified as negative. And this tool is not the end all go all tool for working with LLMs, but it is a way to test models, to test parameters, to test prompts, to test the robustness of prompts and to get all this into easily digestible outputs CSV. So far for the demo. The technology is used, it's really simple. I'm not like a hardcore coder, but I'm like more of a tie some stuff together coder. Streamlit is a Python interface for making easily making web applications of machine learning tasks. Lang chain is a very bloated way to easily connect LLMs and to work with LLMs and prompts and Huckingface is the place where all these LLMs are stored. We run this on a 24 gigabyte GPU, which is a bit expensive, but it's not very expensive. Like each research group should be able to get one. And I mean, yeah, so to get back to my rent against platforms, making LLMs locally accessible makes them stable and replicable. But we cannot run the biggest models unless we have access to bigger infrastructure, which we sometimes have. But this is really meant for researchers that want easy access to local models. I made a video tutorial, tutorial which you may want to watch. And maybe there's still room for questions. Just three minutes. Atlas TI is a rather big and well-known software package for qualitative coding, right? I'm not sure why they chose to only use Chatchi PT. But yeah, I mean, we've had experience with local LLMs that you can actually also do similar things with extraction and coding. So I would definitely be in favor of actually using local LLMs. On the other hand, if you have proper validation procedures such as intercoder reliability and F1 scores, etc., you can get a long way with Chatchi PT because human coders are also fallible and may also be different today than they were a few weeks ago. So it's not that it's not possible or not usable at all, but you should be prudent, I think. You said it's mostly for testing the models, but how big of an input file do you think? We've run this on more than 100,000 lines of CSV, I think even more. So in the digital methods winter school and past summer school, we actually run a lot of prompts through it. And it seemed to work. Sorry. It was asked how big of CSV files it could handle, and I answered more than 100,000. So it's actually also used in production for relatively small-scale qualitative research, but it's not limited to things you could do manually anyway. So let's switch.
UB App: Using Design Justice to involve marginalised communities and urban planners in co-designing a new photovoice tool for citizen engagement
Thank you. Yes, sir. As already mentioned, my name is Sophie and I work at the Techno Anthropology Lab in Copenhagen. I'm here to share with you an app that me and colleagues and local urban planners and local marginalized communities in Copenhagen have developed. It's called the Urban Belonging app and of course it's open source. But the reason why I'm really here to talk about it is because I think the process of how we made the app is really interesting. So, okay. First of all, PhotoVoice is good to know what that even is and this is kind of a participatory action research methodology that has been developed especially to find methods and ways of evolving illiterate communities and has been developed through kind of 60s and 70s and in the 90s it was really kind of stabilized as a method with a lot of kind of frameworks and publications coming out. That frames this as a method to involve local communities in kind of contributing their perspectives and their experiences on particular issues. And it often also involves not just kind of capturing photos and giving cameras to local residents and communities but also involving those communities in kind of selecting and highlighting and doing storytelling with those photos. So kind of a longer participatory process. And to situate kind of where we developed this app, this was a development that came out of a project called the Urban Belonging project, hence the name of the app. And this was really a project that set out to kind of map marginalized perspectives on the city. So it was rooted in kind of urban planning context and asked, you know, how can we involve different marginalized communities to shed light on different perspectives on the city? And in the process we need new citizen engagement tools that kind of re-tool citizen science and maybe challenge some of the problems that we've seen in the field. So the project involves a range of different research partners and urban planning partners both in Copenhagen and in Amsterdam. But most importantly for what I'll say today is that it involved these different community partners that represented the communities we wanted to in the end use the app. So we had LGBT plus people, people with physical disabilities, homeless folks, ethnic minorities, people with mental vulnerabilities, deaf people and also international expats. And so what we wanted to do in the project was to involve these groups in mapping their experiences of belonging in the city. And we found out that it was really important to kind of involve the groups from the beginning and even framing and designing the project. And we used, as I'll get back to, their inputs to frame kind of the specs and the design of the photo voice app that we then created. And then the communities went out and used that to kind of collect different data points about the city, which was then kind of interpreted in these community led workshops and it resulted all in kind of different exhibitions and co-creation of new frameworks of what does it mean to design cities for belonging. So that's kind of the overall project. But if we hone in on the development of the app, which is really what I'm here to talk about, as I already kind of signified in the title, the project really takes inspiration in two things. So on one hand data feminism and on the other hand design justice. And so design justice, it kind of tells us that even if we're really well-meaning as designers, you know, we kind of, we often end up having these abstract ideas or ways of representing community needs. And most importantly, most of the kind of strategies involved creating these abstractions where the communities are not at the table of the design process. So we think that we're being kind of inclusive and so on. But at the end of the day, if the communities that we assume will use a technology are not at the table themselves, they're not fully represented. And so Costantel Schaak tells us that, you know, involving communities and community members is kind of the most important thing to do, not just because of justice and kind of fairness, but also because community members will essentially have knowledge and experiences and perspectives that we cannot imagine as a kind of homogeneous group of designers. So this is kind of design justice. And then on the other hand, we have, or supplementary way, we have data feminism, which tells us that data infrastructures often are, you know, kind of reproducing or enlargening the biases and the hierarchies that already exist in society, especially when it's a homogeneous group of privileged people that sit and do the technology design and design of data infrastructures. So data can be used and often is used to kind of oppress and exclude and extract. But of course, we can also think of those same infrastructures and practices as opportunities of kind of liberation and co-creation. So with these kind of jumping off points, we went into a design process for our app that really involved a lot of different stakeholders. And as you can kind of see in this visualization, that becomes very messy. So it's not so linear and easy and predictable as we maybe would often like, but also it yields really interesting results. So as signified in the top corner here, we had a research team in Copenhagen, STS, digital methods researchers, we had visual methodologies researchers from Amsterdam. We also had kind of a starting point which was that we actually built on an open source app that even existed to begin with, which of course then had built in epistemologies and so on that we needed to kind of work with. But then we involved kind of planning practitioners, we involved citizen engagement professionals, so the people in Copenhagen that work with citizen engagement every day. And then importantly also the community organizations themselves, which are kind of the red dots. And I think even mapping and being transparent about when do different kind of actors have an input in such a process can be an important step of kind of situating our tool and being transparent about how it's made. But to kind of give you a few deep dives into then well what does the tool do and how was it shaped in its kind of features by these different actors. So first of all kind of the research team of course started with talking to urban planners and citizen engagement professionals about why are we not using photo voice more often when we want to involve a local perspective in urban planning. And certain kind of limitations arise both from these practitioners but also from the photo voice literature. And one is that you know photo voice often collects a really small number of photos and have a pretty limited also sample of participants. And that's because the way it's done right now is really kind of hands on that a researcher will send emails out to different participants and ask them to go and use the camera in their smartphone to take pictures and then send them via WhatsApp or email or something to the researcher who then has to sit there and kind of keep track on the different participants and the different images that are coming in and they come in in very unstructured formats as well. So that leads to the second limitation which is that photo voice data tends to be really unstructured. You can imagine that if you just ask people to send you photos in a text or on email it will be really different what different participants do and how much kind of captions and annotations of those images they send. So there's kind of an issue there that makes it very difficult to actually kind of scale and to use as a stable method. Then data is also not so safely stored when we ask people to just kind of send these images in you know text messages or whatever. Then because of this kind of handheld approach it also means that most projects they tend to focus on one marginalized group because it's really difficult to manage a lot of different users within different kind of communities and so on. And that is of course a huge limitation if we want to kind of bring different groups together around discussing issues at an kind of urban scale. And then finally because it's again pretty handheld the way it has been done so far it also means that researchers tend to tell the participants where they should go to capture photos because you need to kind of know geographically where the data is coming from and since you're not really tracking that at a systematic scale in terms of geocoding or geotacking the result is often that you're kind of already framing where it matters to go and look at a certain phenomena. That's of course kind of something we could tweak or rethink. So the app that we've developed has these key features so it enables kind of a structured collection of photos where users can be invited into different groups and tasks and are given a certain prompt that of course can be customized and then that means that we know in response to what is it that we're actually getting photos collected. Then the app also geotracks where the photos are taken and the walks that people are kind of going on while capturing these photos. And then most interestingly I think for kind of quality quantitative research purposes is that we can ask participants to annotate their images when they collect them on the spot. So we have no issue with kind of recall problems or memory problems and we get this structured kind of metadata on our images. And then to add an even other layer we also can ask participants to react to each other's photos. And so that means that we can start to get this really kind of rich data where we know how the person who's an author of a photo feels about it but also how other people that look at that same photo feel about it. That of course invites as you can imagine network analysis and all these different kind of analytical opportunities open up. So just to show you kind of an example of what that looks like this is the very simple interface where you can either go for a walk or you can just take a single picture then you start a task and you kind of are prompted with whatever is the prompt defined for the particular group. And here I'm on university Wi-Fi so the little devil wheel is spinning out of control but then when you take the photo what happens is that you're moved into this annotation space. So you're asked to also say something about the photo and this can be customized of course project by project what are the interesting kind of metadata points to collect. You can have kind of scale questions different categories you can have open answer other categories and so on. And then you can keep going and you can do this either as a single task you can go on these kind of walks where you you maybe go for half an hour in an area and you collect multiple photos and then we both track the individual photos but also the way you move through the city or the space to train. Okay so those are kind of the core features but then as I have already mentioned the important thing here was that the communities that we wanted to use the app were involved in kind of designing the specs for it. So some of the things that we learned in that process I'll just give you a few examples and I mean the list is really exhaustive but just kind of to give you some pointers of what this can offer. We had a representative from the homeless group say that you know while homeless people are used to using smartphones and taking pictures they're really not kind of trained in the Instagram like aesthetics and visual hierarchies of how to take certain pictures that look in certain kind of way and so they were really worried that if we invited multiple groups to deliver photos to this project these kind of visual hierarchies might mean that certain perspectives are foregrounded more than others if you don't have kind of visual aesthetic skills in framing and taking photos in a certain way. And a similar thing was highlighted by the Danish Deaf Association representative who said that you know the deaf community is a highly visual culture but they were worried that if you want people to kind of document the challenges and the negative things in the city it's really important that people don't feel a pressure to take photos in a certain kind of way. And so these kind of inputs informed decisions to not have any editing or filtering options for instance in the app so when you take photos on the spot and you cannot edit or change them so the light is the light and the way it looks is the way it looks and we also had decisions to kind of standardize the photos to a square format so that there's not like differences between what kinds of phones people have and what are the ratios of the photos and so on and how good are people at framing. We also kind of left out the option to upload images from the library so this was a kind of purposeful choice to make it to make sure that people had to be kind of in situ in the city capture things as they are when they see them and avoid these visual hierarchies that some people can take a million photos and go home and edit it and then kind of upload the good ones. This has since changed because we've developed the app of course to be able to also do other things so now you can indeed upload from a library but another kind of thing that came up from all these different communities was a course a lot of issues with trust so they have been involved many times by the city this is not only the case in Copenhagen but everywhere in the world these types of communities are so tired of being involved in kind of false processes that don't give them any control over the data that they contribute and where they are not really sure kind of what the data is being used for and if they wanted to go and kind of delete the data or pull it out of a data set that's really kind of difficult for them. So we also designed the app to have kind of different features that would support trust with the community so one is that participants are anonymous in the app so you cannot kind of know who has taken which kinds of photos and that's of course really important as it means that it is in workshops or other kind of formats that participants can choose to say that they have taken certain photos and unfold them and they can also choose to not do that. We have an option to invite people with email to the app but we also have an option to just pre-generate kind of anonymous user IDs and invite people via that. We give participants control over when and how the tracking the geotracking in the app starts and ends and we give participants options to always see the data that they've collected and contributed and to always delete it directly within the app. So these are just kind of some of the features that came from this involvement but I think they're really important in in securing kind of trust with the communities. So I'll just show you a few kind of examples of what are the kind of analytical affordances then that this type of data collection opens for. So using the example of the urban belonging project we had more than a hundred walks by 33 different participants and we had 1400 and some photos collected and of course as I've said you can geolocate them and see where they are in the city and how people move. You have the images themselves which can be analyzed in a number of ways. You can use AI but of course you can also use the annotations that people have created themselves. We even have the reactions that I mentioned so this map for instance shows distributions of photos with the sentiment in the middle of the dot coming from the author themselves but then also other people's sentiments or how they feel when they saw the photo and reacted to it within the app. This is kind of shown as a circle around and this of course is really interesting in workshops as you can bring up kind of more contested photos and say well why why are we disagreeing or agreeing on something. We can open kind of geospatial types of analysis and see well where are different people taking different kinds of photos and we can even zoom in on kind of the very local level and look at this kind of route data and see well how are different people moving in and out of a space. So yeah we're seeing this being used for really interesting kind of analytical cases. These annotations that people are giving to the photos also helps to zoom in on certain issues. For instance we could see in our project that photos of urban nature which is what people have said their photos are about predominantly were associated with a positive sentiment and so that can kind of inform you know zooming in and filtering your data and looking specifically at photos of nature if that is what is interesting in your project. So we've done kind of this development of the app for this project but then since then we have kind of open sourced it and kept developing a whole toolkit around it to make it as accessible for as many people as possible. So we now have kind of a web-based interface where you can go and manage groups and set up the tasks and do different translations of the tasks again to make it accessible. There's an easy overview and export of the data for people that are not programmers and this is of course really important to make sure that local NGOs, architects, designers, all kinds of groups without kind of data science expertise can also use it. And then we have this kind of interactive data dashboard that is just easy to generate and so this is an example of what that looks like and every kind of project that uses the app can just kind of easily generate this kind of interactive way of publishing your results or your data set. So with all of this said we've been really kind of happy to see that the tool which is hosted at University of Olbo but it's open source it has been kind of used in different ways so we have a lot of people who are using kind of our version at the university but we also have different companies who have grabbed the code and set up their own version of this app which is of course what we wanted and so we've seen really kind of interesting use cases. Gale Architects who were also a partner in this project they've set up their own version of the app and are running all kinds of different projects with that. We have had an early use case of the DVSA which is an NGO in Seattle use it to map environmental justice issues, artworkers with disabilities is another kind of NGO with different researchers using it so cultural heritage project in Italy and currently the University of Copenhagen is using it to map perceptions of nature and we also have kind of municipalities and policymakers use it in relation to urban development. So this is really positive but I'm also faced with a few questions that I can share with you as I round off. So one is well what kind of community-based financial model can we set up to secure the kind of continuous maintenance and development especially as we want to host a version of this app at the university that can be used by NGOs and so on that don't have resources to set up their own app. How do we kind of continue to involve marginalized communities and perspectives in how the app is tweaked and developed from here and then finally kind of how do we offer up this tool while warning against techno solutionism so because the tool is developed as an inclusive tool we've also seen some bad use cases where it has just kind of been used without much thought about methodology or anything else and it actually ends up really disappointing the communities that it's supposed to serve. So those are some of my kind of questions that I'm struggling with right now and that we are working to figure out. Yes. Great we have a lot of time for questions around maybe seven minutes. Yeah hi thank you. What came to mind was this whole kind of gift exchange for. Sorry what? Gift exchange. Oh yeah. Is it then data for use of the system or to go back to answer your questions can it be that we'll receive vouchers or some kind of I don't know events or food source or something something else that they the participants get in return for. For participating. Yes exactly. Yes like a larger framework that could be incorporated into the system. Yeah. So that was my question like where does the data go. Repeat the question. Well I hope that I understand it correctly but I think you're asking how can you design kind of measures into the to such a project that actually also gives something back to the communities that are donating their time and their data. Is that correct. Yes. Yeah I think that's really important and it's very different what you can do. I mean there's a lot of rules for instance on the university that we can't just like pay people but I was participating in this project in Seattle where they could do that. They could actually pay participants an hourly wage for that time and that's of course like the most useful if you can do that but if you can't then you can find other things to do. So in our project for instance we did professional headshot photos of all the participants both to kind of use if they gave permission for that in our report on the project but also for them to use professionally and that was actually pretty popular and so we tried to find we gave movie tickets and all these other things right. And I think depending on where you are in the world food kind of vouchers and stuff like that can be really helpful. Yeah. Kind of a not fun question but one thing that was lacking in the presentation which was amazing great is what about the content moderation on the photos? Content moderation? Yeah well that is the sign into the kind of this kind of this kind of interface that we have online. So you each participant can only see their own photos so I mean that's pretty protected and then there's the the moment where you if you want to as a project manager to open the photo the collective collected kind of photo library up to people so they can see and react to each other before that happens there's a moderation feature so you can in this interface you can go as a project manager and say you know are there some of these photos that shouldn't be circulated out to everyone and of course as a project manager you can also choose to ask participants to go and review their own photos and delete the ones that they wouldn't have others look at. So it's giving kind of the tools to the project manager and the participants to do this. Yeah. I think this is really important kind of point from data feminism is this idea of paradox of exposure so this kind of sensitivity and awareness that creating visibility is not just always positive but can also lead to harm and to kind of negative outcomes and so we've really thought about that a lot and I think it's really important to design for that both in the tool but also in the process of how you actually use it. Yeah. Oh there was one there. Positive neutral or negative. Did you give any thought about like the difference between a photo being truly neutral versus there's good and there's bad and I see both and so I would classify as neutral. Is there any way to like get this information out of a picture or is there something that someone would need to go and like examine after we have something? Yeah. Well that's a complicated thing to answer in a very short way but the question is how can you distinguish between positive negative and kind of neutral versus well ambiguous that there's both positive and negative in the same photo and in our project the urban long project we had the middle category be kind of ambiguous so that was how we defined it and other people maybe choose to say that the middle category is neutral or whatever I think that really depends on context but I also think that what we did here was create an app that can that can generate this really enriched data set but our entire process was set up around having participatory workshops so we brought this data to people we brought the photos we brought the maps and so on and then asked them to tell us and so really the answer is that we cannot answer the question but that we should bring this data back to people and use the visual nature of it to engage dialogue and so it might be both ways really but the participants can say so. Yeah. Sorry what? The policy makers did they show any interest? Yes. Yes the cities and city kind of municipalities and city councils and so on they know that they have a problem like they are lacking tools to engage diverse perspectives and as we did here multiple groups and not just kind of tokenizing and involving one minority group so they need this sort of tools and they needed to be structured because that is what gives validity to a political decision right and that was what was missing before with this kind of handheld approach to photovoice. Thank you so much.
Beyond Ratings: Empowering Communities through Wikirate for Transparent Corporate Impact Research and Analysis.
Thanks. Hello. So my name is Vasily Gikazyaki and I'm a data engineer with Wikirate International. And I'm going to talk about how Wikirate empowers communities for transparent corporate impact research and analysis. But before we get into details about what Wikirate is, I would like to talk a little bit about what is the problem with environmental, social and governance data of companies. So usually when it comes to ESG data, we can say that they are expensive, exclusive and inconsistent. There are a lot of data sets hidden behind paywalls. So individuals need to pay thousands of euros per year to get access. Additionally, there are a lot of ratings. There are a lot of organizations actually producing ratings about companies. But the problem is they don't provide access to the low level data sets. So it's difficult really to understand what they are rating. And also, yeah, they don't make the methodologies transparent or the sources transparent. Yeah, finally, the last few years companies started reporting more ESG data in a text format, in sustainability reports. But the problem is that a lot of company reporting is not standardized and that hinders large scale analysis and comparisons between companies. So what makes open research so important in the context of corporate accountability? It actually fosters transparency in corporate practices and empowers different stakeholders, especially people that don't traditionally have access to those data. That they don't, let's say, investors and they don't have the money to pay to get access of this ESG data. It encourages collaborations in global scale and promotes data driven decision and policy maker making and drives positive changes. So Wikiread is an open source, open data platform that brings corporate ESG data together in one place, making it accessible, comparable and free for all. It's a wiki that means that anyone who has a passion for sustainability and ESG data can come to the platform, contribute to the research, contribute to the available data and organize their research as well. So our community is mainly comprised of civil society organizations, academics, university students, data and sustainability enthusiasts. And we strongly believe that in research and in research in companies everything starts with a question and ends with an answer. So I would like to give you a sort of overview of the structure of the data on Wikiread. So these research questions we call them metrics and we can have for its metrics several answers. And its answer is linked to a specific company, a year of reference and a specific source. So here we have an example about, did Airbnb UK Limited produce a statement in relation to any modern slavery legislation or act in 2022? And the answer in this case was yes. It produced a modern slavery statement under the UK Modern Slavery Act and there is a source and citation actually linked to this answer that leads to the actual modern slavery statement of a specific company. So in Wikiread in addition to research metrics we provide also calculated metrics as a tools for calculations and for analysis. We can say that the research metrics are actually the building blocks for analysis and the calculated metrics actually are built on top of research metrics and allow users to run calculations. So we do have namely Wikiread in score metrics, formula metrics and in more strictly formula metrics allow users to run their own calculations in coffee script so they can be quite complex or not complex. It depends on what the users want to do with the data. So and this actually, this calculated metrics helps to bring transparency into ratings. Here we have an example with the fashion transparency index is actually a rating that scores companies, fashion companies based on how transparent they are on different sustainability topics. We are a partner, we have a partnership with Fashion Revolution which is also an NGO and they're building, they do this analysis in research and we're helping them to actually make the research, the data, the ratings, the analysis, everything transparent available to the public. So one source of data on Wikiread is of course data that are coming from the ground, from civil society organizations but also there are a lot of data in the public domain. So it's easier to bring structured data and semi-structured data and building data integration pipelines to Wikiread but of course we do have the challenge of unstructured data and how we are going to bring those kind of answers into the platform. So for those reasons we are performing research projects and we have called for volunteers to come and research those reports and finding answers to questions on specific topics like modern slavery, greenhouse gas emissions, etc. So how is the data used? One use case of Wikiread data is building data dashboards that are actually used for advocating for change. One example is Fashion Checker which was developed in partnership with Klinglo's campaign and actually advocates for worker rights, especially worker rights on the supply chains of fast fashion companies. We have the Beyond Compliance dashboard that was also a partnership with Work Free Foundation and is a living data dashboard that assesses modern slavery reporting and tries to highlight gaps on modern slavery reporting and puts for new legislations and for new policies. Also the data is used for writing news articles and producing help CSO, CISO-Cyber-Suside Organization, produce reports and making research findings and analysis transparent. So it's also used for writing research papers, research some of those and Wikiread data are free, are under Creative Commons license so it's welcome to anyone to use the data, explore the data. They can do it through the API and through the user interface. We have an available RESTful API and several wrappers that will allow users to pull data from the platform, also contribute data to the platform if they want to. We also have a GraphQL endpoint that allows users to form more dynamic queries based on their needs. So where to start with Wikiread? If you're interested in contributing data, you can always say start with the guides please. Read the guides as most of the questions are answered there. But of course if you have any more questions you can directly contact us. And yeah we have several projects that are in need of contributors. You can help us improve the data. We have verification tasks, we have verification tasks so we ask the help of the community to help us with this process. And yeah if you are interested in volunteering these links are available to the slides. And yeah you can contact us if you want to share ideas with us, form partnerships or get support. Yeah as I said in the beginning Wikiread is an open source project written in Ruby. You can check out our github repository and if you want to get started with Wikiread and DECO you can do it. And you can also create if you want your own data dashboards if you're interested in ESG data. So yeah I think that's all from my side, maybe it was too fast. But yeah thank you. Thank you. We have maybe four minutes of questions. Hello. Hi. I have a question about if AI has helped you in a way in any of these processes for example while manipulating or getting data from the public domain or something like that. Yeah so yeah we are considering, sorry the question was about if AI helped us in any way obtaining data from the public domain. So the answer is that we are considering using now AI and LLMs for extracting actually more structured answers from text reports. But still we are in the test process so yeah hope that I answered your question. Yeah. How much of companies are covered by the data set? Is there specific industries you are targeting? Yeah so yeah we have about, I'm sorry yeah. The question was how many companies we cover at the moment on the platform. So we cover about 1400,000 companies. The biggest focus on research or the biggest companies. So more data can be found at a very popular let's say companies and because we did have a lot of projects on the fashion industry we do have a lot of data about fashion companies. And in total at the moment we have five million answers. So yeah. Any other questions? Yeah sorry. So it's open to contribution as far as I recall and you mentioned some verifiers. I want to ask how do you make sure the data is consistent and how do you go through the checks and see if the data is reliable? Yeah so yeah the question is how we check that the data that they are coming into the platform is reliable. And of course it's a question also because we are doing crowd research sometimes people do not have the expertise on ESD topics. And what we do is we have different verification levels. So we consider an answer verified when more than two people have come to the same conclusion. And you can see on the platform that we have Stuart verify and community verified. Stuart's usually are members of the community that they have more expertise on the specific topics that the research is about. I'm always the person that's like let's squeeze as many questions as possible so we'll do really rapid questions right now. Very quickly. I was just wondering if this could be expanded to cover other types of data rather than ESD. The question is if this could be expanded to cover other types of data. Yes it can. It's again about environmental data but one use case that it comes on the top of my mind now we are having companies you can have something similar for countries. So you can highlight for instance the electricity or water usage or CO2 emissions per country and not focusing specific on companies. Thank you so much.
From Grassroots to Standard Practice: how an Open Science society shaped university initiatives
I'm going to go to the next slide. Let's see. The U.S. is now going to present from grassroots to standard practices how an open society shaped university initiatives. Here you go. Hello, everyone. I'm coming here to talk about my experience at the Surrey University in the UK where I did my PhD in the last four years. I just finished my PhD. Thank you. Thank you. Basically, is this working? Next. This society was founded in 2019, which is when I started my PhD, but I wasn't part of it at the beginning. Sometimes it's difficult to find out what's going on at the university. It was founded by Marta Topor, who was a student at the time. Emily Farron, who is staff, she's a senior lecturer, I think, at the university. They wanted to create this society to tackle this kind of reproducibility crisis in the field of psychiatry, but we'd aim to expand to all the other areas. It was a society run by young researchers and postgraduate students. It was open to any students, but undergrad students are usually not interested in this kind of society. At some point, we have over 100 members, which was very successful for this kind of society. We were inspired, but different from the other students. We were inspired by the different students and we were inspired by different grassroots initiatives. We have the UKRN, the reproducibility network in the UK. This is a national peer-led community of researchers that want to tackle this reproducibility, trusting science, and improving methodology of scientific methods. We also have other initiatives, like reproducibility. This is a journal club where people get together to read papers about methodologies and how to improve methodologies and analysis. The Riot reproducible, interpretable, open, and transparent is also a science club. I think it was started by King College London. We have the UKRN, and the Riot reproducible is also a club in King College London. Those were our aims. Integrate, open, transparent, and reproducible methods in science. Help with this rigor and quality and this trust in science crisis. As I was saying, we have the UKRN, the Riot reproducible, and the UNR, and the UNR, and the UNR, and the UNR, and the UNR, and the UNR, and the UNR, and the UNR, and the UNR, and the UNR, and the UNR. We have a lot of meetings and discussions and workshops, and we also have conferences in the university. We managed to create... It run for two years maybe? Two or three years. Two or three years, yeah. Yeah, and that was very interesting because it brought like all the university together so we could discuss with different people, so not only students from different faculties, but also staff and researchers. And from these different events that we were creating, we started in January 2020 the monthly mini hacks by Daniel Curtin here, she's here. She will be giving a talk later. But yeah, this was when we started putting a greater focus on the computational methods of these social sciences, so not only that reproducibility, but also we wanted for people to have the skills, coding skills that maybe are not so easily accessible when you are in social sciences or in non-computational fields, but that are very, very important for those fields. So it was this cross-disciplinary collaboration and skill sharing and kind of like hands-on coding so someone would come give like a maybe 20-minute talk or 15-20 minutes about a topic and then there would be like a collective programming time. And then it's always better to kind of learn these things in a group, right? So you can ask questions instead of just watching a YouTube video or reading something online and then no one is there to guide you. And yeah, it's also useful to encourage and promote these best coding practices. So sometimes I'm sure many of you have had this problem when you go to like a research, academy, software project and you have no idea what's going on and all this spaghetti code so to try to make people be better at coding. And also as a way of training and improving your research and your employability and just your own skills that may be helpful for you in the future outside of the university. And I guess, yes, the isolation and the learning curve that is worse for some people in different fields or just depending on what background, what opportunities you've had before, right? So with the pandemic, there was a lot of things going on, right? Some of them very bad and in a way we had new opportunities because we all went online so there was more like a global opportunity of sharing from outside the university. I guess we always had this option but we weren't really used to do things online. So we created the mini-hack consortium which is kind of like taking the mini-hacks initiative outside of the reproducibility society which they were still interlinked but it became its own thing and we started creating a lot of online courses every month bringing different people from different universities. We had people from universities in Spain, in Germany, but also in Latin America. We had a collaboration in Colombia. We did a two-day hackathon for a neural language processing. So bringing people who were experts in different things to talk about their experiences and things that were... maybe they weren't experts in but they've learned about, I don't know, if you've done a PhD in some science, you know latex but when you're starting in research you have no idea how to write latex and it all looks very strange and difficult to understand. So it went very well at the beginning. We had a lot of people interested. We have more people joining the workshops so when they were in person maybe it was more difficult to advertise it across the university but when we went online we could reach a larger audience so that was very successful at first. These are some of the tools that we were using both at the reproducibility society and the mini-hack. So yes, we have all our files and data at the Open Science Framework repository so if we had any presentations we would record them, we would upload the video so you can go there later and all the slides and anything that was required at these workshops. Then we publish all the slides and any outputs of these workshops in F1000 Research which is an Open Science scientific publisher so then you could also have your own thing in your CV and record, right? So it would help your... in the future and then obviously we publish all the code in repositories open for everyone to use and for this advertisement of the workshops we would use Eventbrite which is not open but it was very useful to reach a larger audience so it's not only the people you know and that already know you but it also promotes your events to people who have related interests so there were a lot of people who had no idea who we were and they were just joining because they saw the topic was related to their field or something. And this brings me to what happened at the University of Surrey so we created all these events and workshops and things and then through Emily Farron who was part of the staff she was pushing this to become more policy at the University of Surrey and more people started joining from the staff, from the researchers areas and we got... I think they adapted this open batch the open batch, yeah, to know that your... I don't know what I was going to say that your project has all these data standards and all these open standards of research and they created a working group that was leading this change within the university and they created a community with forums and chats where they could share all these things across universities or not just in psychology but across all the other faculties they created a research handbook, they actually created an actual module that people can take about open science learning about open data and they created this open research annual lecture and this is basically the continuation of the conference we were doing as students So to finalise this is what happened to us We were very successful at the beginning, we had a lot of people, more than 100 people as members of the society but with the pandemic apart from going global we also had other problems People were meeting less online, less in person People weren't going to maybe the faculties and the office to work so we started like putting it apart and because students come and go people were leaving the university and there wasn't any new people to join and at half the time that this type of initiatives required which is a long time So I was left alone with the mini hacks and it just completely faded in my last year of PhD when I couldn't get more people to give more talks or workshops, I just couldn't find anyone else But at least the university policy now, maybe more students will come in the future and we'll want to take this on and we learned a lot so we already have that going on for us and I don't know if we have any time for questions Sorry We do have to be so let's take some questions, thank you Thank you Thank you So it's a very nice story, it's a pretty good way but is this journey like what you just presented Yes, yes, so the question is if all of this work that we've done is anywhere and people can take on from that Yes, so everything we've done, not only the workshops and the events that we did but also like our policies and our documents of how to organize and how to do everything is in our open science framework repository so the first point here and people can go back there and see like how did we do the main list, what are the resources that we use what was the kind of like flow of organizing a society which is helpful even if you want to create your own society that is not continuing this That was the second question, second and third Yes Thank you very much for the presentation, what's your point of view on the initiative you have in science club with respect to this The European Open Science Club, is this similar to some of them? I don't know, I'm not familiar with this So would it be like similar to this but European level Okay, so I am not aware of the European Open Science Club but I think it would be very interesting to connect with all the different initiatives that kind of have the same aim in place and that would probably be useful also to get people to collaborate with this, thank you
Bridging contributor's knowledge and the technology of The Turing Way, an open guide for data science
Okay, great. Yeah, after the last speaker, I think my sticker game isn't quite up to standard. But I'm Jim, I'm a research software engineer at the Alling-Turian Institute and a volunteer core member of the Turingway team. The title I submitted was Bridging Contributors' Knowledge and the Technology of the Turingway and Open Guide for Data Science. And I thought that was a bit long and vague and maybe I'd have another go and I came to a personal perspective on the interface of infrastructure and people which is at least shorter, but I'm not sure if it's much better. But I'm going to be talking about getting the people who contribute to your project with the infrastructure, which does mean the technology, but also means the processes and people who control what gets into the project and how decisions are made. So the way I like to pitch it is we all contribute to projects for a reason. We're making something for a purpose and there's different ways we can measure that and you might think about a number of contributors or stars or downloads or engagement or something. But the common thing between all of these is we need the contributions. So that's how we make progress. And so maybe what I'm going to suggest is maybe the important thing, the most important thing that you should be thinking about and measuring yourself against is how well do you facilitate people who are contributing to your project. So a bit of scene setting for the Turingway. The Turingway is a handbook for reproducible, ethical and collaborative data science. It's developed completely in the open on GitHub and is open to the licensed creative commons. And there's quite a large number of contributors. The last time I looked it was a bit over 470 contributors in total. When I work on the book, that's what my screen looks like. My background, I have a chemistry degree in PhD and during that I was exposed to Linux and open source software through computational chemistry and became really passionate about that. And now I work as a research software engineer. So I'm not really a chemist, but I'm not quite a computer scientist and I'm somewhere floating in the middle. And the Turingway is a big project where I work. And I started off just making a few tiny contributions, fixing typos and links and things. And late last year I became one of the co-leads of the new infrastructure working group and we think a lot about the CI and automation and how we help people get stuff into the book. So what's the actual problem I'm talking about and trying to think about? So if you're a maintainer of a project, one of your key tasks is maintaining quality and to a certain extent that means putting up a bit of a barrier to contribution. You need to have some standard, you need to keep standard, you don't want to break things. And that can be a bit tricky because that involves a bit of pushing back against people, maybe giving critical feedback or maybe not accepting certain changes and you need to kind of strike a balance there of encouraging people and then getting stuff in particularly when your contributors... I always say non-technical, which really means not software engineers, I suppose. Your contributors may come from many different backgrounds and have different amounts of experience. So some might be more or less into tech and that can also create problems because you can't assume that every contributor is going to be the same and they might need different levels of assistance and that might make sense to one, might not make sense to others. In the Turingway in particular, the community is incredibly diverse in terms of their educational and professional background, the language they speak, time zones where they work, where they live, their lived experience. And most of the contributions, most of the data that's actually in the project is pros and not software. They're contributing their ideas and their knowledge in the form of text rather than working on the code. And so also the people are generally not software engineers, which means there's quite a lot of additional support in terms of, you know, YAML format, why does Markdown work and why doesn't CI pass. But I think the important thing there is that all of the people that contributes make valuable and important contributions to the book and their sort of technical ability, so how well they understand the build process and things, isn't really a good measure at all of the value of their contributions. So we focus quite a lot on how to enable people to contribute to the book. So here's our approach at the moment, things we do. Probably not surprising, but everything is version-controlled in Git and the projects on GitHub. But I think there is an interesting question of why do we do that and why don't we just have a wiki if it's mostly text. I think the simple answer is the advantages of version control are just too strong. You can go back in time. Handling multiple contributors is really easy even when they're working asynchronously on different branches and you have to fix conflicts and things. And because it's a guide about open and reproducible data science, there's elements of do what we say, we've got to demonstrate the sort of culture we're trying to create. And so that means doing everything in the open, doing everything as reproducibly as possible. There is a community handbook, and I always love how sort of meta this is. It's like a book within a book, and it's a book which tells you how to contribute to the showing ways. There's a contribution guide, there's the code of conduct, style rules and things like that. So yes, I love that the book tells you how to write the book and contribute to the book. And because it's part of the book, it's completely open, and if you think those rules should be changed or adjusted or can be clarified, you can also contribute to that. Recognising contributors is really important, and we try to recognise all types of contributions, so not just text and not just code. And one of the ways to do that is to use tools from the All Contributors project, where you can tag people for the types of contributions they've made, and you get this nice table of people and their contributions, and that is also displayed in the book. And on a Git repository, people are encouraged that if they feel they've done something, they've put in some effort, they should suggest that they be added for a certain contribution type on this. More recently, we started using a GitHub action which ties into the crowd-in API. So crowd is the platform that's used for making translations to help to better recognise the translation efforts that go on in the during way. There's a lot of support. As I said, a lot of people are not super technical, and so we might need to work quite closely with them. We like to think of pull requests as a chance to work with people and collaborate with people and make connections and not just a barrier to stop things you don't want to be merged in. And that support goes even further. There are different types of events and co-working sessions to help people get contributions in. So there's regular co-working. There are sprint-style events called book dashes. And so there's a lot of stuff which adds a bit of social element, but it's also about helping people work together collaboratively to get things into the book. And we lean on, see on automation a fair amount, and the focus there is remove burden from the users so we don't expect people to build the book or run tests themselves. Everything is done in CI for you, so you don't really need to know about how it gets built. You can just focus on writing the muck down. So here are my sort of unoriginal lessons for how to search contributions. Building a community takes effort. If you write some code, you put it on GitHub, it doesn't necessarily mean people will engage with you. You need to be quite proactive in reaching out to your community and assisting them. And that means you need to know who your contributors are so if you can identify that and figure out what would help them, I think the thing to think about is what thing can you do which would most enable them to contribute. Leaning on CI is great. You can sort of say what CI says goes. It's a fair way to sort of compare people's work, all the tests are done in one consistent place. Everyone's being marked to the same standard and avoid sort of arguments about, well, it works on my machine. I think this sort of goes without saying version control, not optional. It's brilliant, do everything in version control. And if that means you need to do some support to help onboard people and how to use that, which we definitely do, it's definitely worth the effort, it's worth the pain of doing that. You should be flexible. So I think something to keep in mind is it's better to bend the rules a little bit to better contribution in and you shouldn't let the perfect be the enemy of the good. However, you do need to know when to be strict and here's some suggestions of what maybe your red line should be, things that sort of aren't acceptable to merge in. However, even when you're doing that, you've just got to keep in mind be kind and respectful and actually problems are an opportunity to get to know someone and help someone and teach someone. So thanks very much for listening. I'd just like to thank a few people. I'd like to thank all the infrastructure working group on the sharing way. That's Brigitta, Danny, me and Sarah. I'd like to thank Ale and Anne who's here with the project and community managers and provide a huge amount of support in getting the working group started. Skrobiria who worked to make all the brilliant illustrations that you've seen. So without them, you would have been looking at a lot of bullet points which wouldn't have been as fun and absolutely everyone that's contributed to the sharing way. If this has sounded interesting to you or the book, A Guide to Open Refugees, Refugeeful Data Science, you can read the book. Here's some ways to join the community, the sort of social aspects. And we've got more definitive, clear ways to get involved. So you can read the community handbook. We've got good first issues and there are events you can join to start making contributions. Thanks very much. So we have one minute so you can take one question as you please. Why we welcome the next speakers if they are here. Yes. So you mentioned Ale have seen how you're recognizing contribution and you can have people nominate. I'm curious if you have sort of a list of what those categories of contribution are because I assume that many products struggle to recognize non-striped contribution. I'd like to see that you have a record but I was curious how do you, I don't know what those times are, do you have a list of how you can recognize these other titles of contribution? Yeah, I'm going to read the chef. Yes, so the question is we say we want to recognize all types of contributions and earlier in the slides you could see a little emoji to say what those are and so is there a specification for what those are? So I can learn more about that. So the All Contributors Project has a list of contribution types and the emoji and so they've, I would say it's sort of loose suggestions that some of those are infrastructure, content thinking, event planning, things like that. So we roughly follow what that says. We're not sort of super strict on saying this emoji relates very specifically to this kind of work and actually the approach I think it's probably written somewhere in the community handbook is more like if you feel like what you've done, for example is event planning, you should find the closest one to that and add yourself. So we use it quite loose and take the attitude that even a contribution of any size, if it's a measurable contribution it's worth adding an emoji. But yes, the All Contributors Project has their table of what those mean and some sort of suggestions. But I think it's quite nice. I think another project you can sort of maybe use them a bit as you want, maybe some of them are more or less relevant depending on what your project does. Thanks.
The CarpentriesOffline: Teaching Foundational Data Science and Coding Skills with Little or no Internet Access
Hi everyone. So I'm Abhishek Dasgupta. I'm a senior research software engineer in the University of Oxford and I'm presenting carpenters offline today with my colleagues Janata and Colin. And so carpenters offline is about teaching data science and coding skills in low resource settings in places where you don't have internet. It's a way to do courses without access to the internet. So who are we? So Janata is at Newcastle. Colin is at the National Oceanography Center. I'll let them introduce themselves later on. And there are other people as well. We have collaborators from the University of Florida and also Stella and Barsha and Durham. What is the carpentry? So for those of you who do not know, the carpentry is a non-profit organization which was built to teach foundational coding and data science skills to researchers. And their vision is to be inclusive and to make the software teaching skills as accessible to as many people as possible. And they have various kinds of carpentries. So they share some courses like software carpentry, data carpentry which is focused on data science and also library carpentry for library and information sciences. And we have various roles like we have a carpentry instructor so anyone can go through instructor training and be a certified carpentry instructor. There are carpentries workshops which goes through the approved carpentries curriculum which includes things like introduction to Git, introduction to Shell and introductory courses in R and Python. And we use technologies like WebBest. Our course notes are all open source and online. And we use Etherpad and Google Docs for shared notes and everything is on GitHub. So how did carpentries offline start? So again it was at the Software System in the Institute Collaborations Workshop, HACTA. So we started upon this idea that what if you do not have internet and of course a lot of the instructions in the current carpentries curriculum require you to access to internet. So they will say download this from PyPI or download Python or RStudio and install it. What happens when you do not have access to internet or maybe internet is very expensive or maybe you want to work on the EuroStar which we found out that internet does not really work and the Wi-Fi is there but it is a sham. So we came up with the idea that why not use Raspberry Pi's because they are cheap and available. I think this was before they suddenly became unavailable but basically using some sort of low cost single board computer to host the carpentries infrastructure and allow people to access it offline. So it won the HACTA and it also, you know, had the SSI Fellowship on that in 2022. So the first part of it is actually getting the data from the internet onto the Raspberry Pi. So that is a package developed by a team at the University of Florida in collaboration with us and that is called offline data site. So offline data site is a component of carpentries offline but you can use it outside. So if you for some reason want to get R and Python and like the common packages, you know, pandas and ampi all that, on cash onto your computer so they can work offline, you can do that. By default it will install a certain set of packages which are customized for carpentries but you can add your own packages to that and it's available on PiPI so you can install it. What we mirror is the latest installers in Python and R and also we use partial mirrors of PiPI and CRAN. You can customize the packages so you can specify your own packages to download and once you download it you can set your local PiP or CRAN to get data from that and then we also mirror carpentries online material of course and installers like RStudio. So I'll hand over to Colin. Thank you. There's been three threads to our project in terms of building hardware that people can actually go and use. The first is to put the carpentries offline onto a Raspberry Pi that can be booted up and run at a workshop. The other is to actually build a bootable flash drive that can be used with an old laptop perhaps and then the third and latest one is to actually build a miniature HPC high performance computer that can be used for teaching HPC lessons. So for the first option we have the Raspberry Pi. I have here a Raspberry Pi Zero, one of the cheaper ones. I think these when they're in stock cost about $10 or $15 or euros and they can run as a Wi-Fi access point mirroring everything. Some of our lessons require us to have Git Hub so instead of Git Hub we have a program called Git T which kind of has all the functionality of Git Hub but is self-hosted on the Raspberry Pi. We can also run an EtherPAD server, mirrors of all the lessons and CRAN and PiPi mirrors. So what it looks like we can use this, the Pi Zero or we can use the bigger Pi 1, 2, 3, 4, 5 now. And when this boots up and I will try and do a little demo. So when it boots up you should see an access point called Carpentries Offline that will be accessible. If you were to then join that access point and go to your web browser and type in Carpentries Offline.org or 192.1681.1 you will see the web page that is listed there and that will then enable you to get onto the Carpentries Offline. Do we dare to do this? Where's the Wi-Fi chooser? It should be at the bottom there. Oh, it's on there. Good. Is that joining? Connecting. You can see this on the up screen. It says we're established so let's see if the demo gods are in our favour. And there we go. The web page being served from this little Raspberry Pi Zero. If I was to click, for instance, on the data Carpentry and go to the R Ecology lesson, there is a complete mirror of the R Ecology lesson that we can then teach from. And we also have Git T so you can log into Git T and have a very similar experience to GitHub or if you want to download some software to say you need to install R there is an R Windows and an R Mac package available for you to download. We also haven't got it listed on this page but you can then point your CRAN or your PIPI installation at this server and install things from those mirrors instead of having to get them off the internet. And I will put this back on. Edgy Rome to keep going with the slides. The slides are idiotic. Oh, okay. So I don't need the internet. It's beautiful. It's not working. I'm just going to keep going from there. So one of the problems we're trying to build the image for the Raspberry Pi is we initially started with a set of instructions of boot up your Raspberry Pi with an image you've just downloaded, type this and type this and type this and type this and eventually you'll have a Carpentries image. We then moved to having a shell script that could run all of that automatically so that it was a bit easier to reproduce. Then at a hackathon we went to, someone suggested what seemed like brilliant idea of we could run this in GitHub actions and do it all in the cloud and have that spit out a Raspberry Pi image for us. Many hard months of work later realized this wasn't quite so easy because the emulator for the Raspberry Pi is really slow and it turns out that GitHub actions actually has a six hour time limit which wasn't enough to do all of our installation. We had a few hacks to speed things up so one of the things we found is that not just is the computation slow but the network access out of the emulator is really really slow. So downloading anything inside the emulator was much slower than downloading outside so we actually download all of the offline data size stuff outside of the emulator, mount it in a virtual drive and then copy it into the emulated image and build the emulated image and now we've got it down to about two hours and it pretty much works. The one snag I'm currently having is that GitHub is not allowing us to upload the final image and I think we're hitting the maximum file size limit from GitHub and need to find somewhere else to host our images but there's a link later on to our GitHub if you do want to go and download the last image we managed to get on there. As a kind of side effect of doing that I started testing builds in the cloud just natively on a AMD64 system and realized that we could also then build a Docker container containing all of this and that was actually kind of useful during workshops because sometimes some of the carpentries infrastructure would go down midway through a workshop and the last thing you want halfway through a workshop is telling everyone to go and download something and finding that the website they want has gone down so we found that we could also replicate all the infrastructure in the cloud or on our own server if we have a server and it meant we had a backup version of everything needed to run a workshop which has saved me on multiple occasions now from workshops where we lost access to something and I've now got a almost one press solution with a Docker container that I can deploy out to a cloud which also works very nicely for testing stuff. It's sort of the tail end of the COVID pandemic. Raspberry Pi's got really, really hard to get hold of and really expensive. It was always the joke that Janetta had all of them in her house but I don't think she was the sole cause of that but the chip shortage certainly didn't help us and so we started looking for alternatives to the Pi. Do you want to take over at this point? Do I want to do this bit? I don't want to. Where are we now? On to option two. Oh, okay. So this option two is also because some people say but I already have a laptop so why do I want to go and buy a Pi? Especially if we look at countries where you don't have access, that's already a big problem apart from when there are two scares to find. So we've come up with the idea of doing exactly what we're doing with the SD card for the Pi and creating an image for that but creating an image on a flash drive which can also be downloaded, written to a flash drive and it has exactly basically the same software on as the SD card and you can just boot any laptop that you can boot from a flash drive. You can boot from the flash drive and it turns it into the same server as we have here. Oh, okay, so now let's put this. So more or less last year, April, I was running a workshop, an intro to HPC and everything that could possibly go wrong did. So I was starting to think, well, I also had a shed load of my fellowship money left and I had to spend that on something and I thought, but hang on, if we extend this project to cover the intro to HPC, that would be a really cool thing to do and there were a few things that we could do with the money HPC. The hardware is more visible so people don't know really what an HPC is, this massive thing in the cloud somewhere hidden and they don't really ever get to see the HPC that they work on it and it's also quite easy to mimic the original limitations which you're not going to do during a lesson on a big HPC but on the many HPC you could actually do that, you could hit limitations and see what will happen and teach people how to cope with that. You also, and this was the big thing, people get this email, say register for your account on the HPC, they don't do that, they show up on the day and you've got to jump through all those hoop of keeping people registered which is not a quick thing to do because sometimes there are loads of things to be done before you get your actual HPC account. So also the nice thing is it doesn't interfere with the real HPC so users can get quite afraid, they're going to break something on the real machine so in this case you can assure them it's here, we're not going to break anything of national importance or international security or something and of course if you don't have access to a real HPC then that's also covered. Also the day when we ran the workshop one of the things was somebody was doing something on the node, they were running on the login node and you know what's that, you can't even log in, let alone run your scripts so you won't have that and you won't also have problems with network access because it's all local. So like Colin said the reason there was a scarcity of Raspberry Pi's was because I probably had them all in my house. So although I got the money for building a many HPC I decided to go for these rock pies which are sold in the UK by a company called OkDoo and RS Electronic and it was an absolute disaster trying to get these things ordered and the time I had to work on this thing was passing because these guys couldn't get my order, it's all too dark. So I collected all the Raspberry Pi's in my house and built this one that you see there in the picture and that consisted of three Raspberry Pi 4.4 nodes and then one for the login node, in the meantime I've added I think two more nodes or something and it's running the Pi OS Lite, the 64 bit one and then the head node also acts as a Wi-Fi access point so everybody can just log into that. So in the meantime I did manage to buy the rock pies but we are still in the process of setting that up because the idea in the end is especially the Raspberry Pi's would probably make more sense but the rock pies were slightly more at a higher spec but the idea is in the end to produce two images that you can again just download from the website and we want to be able to build up this operating system with scripts so we still need to do the scripting, at the moment I'm still doing everything manually but so we want to work to do the scripting and then download an SD card image which you can just write to it so that the people who will end up with this mini-HPC don't need the knowledge to set all of this up and this is basically the software that we want installed on it because this is what the lesson covers and what most people will be using and what we need to actually do what we want to do for the networking, for the scheduling, etc. And then I've got to get to the point where I credit everybody so all our gets here, if you want to know where I got the STL's for the printing so a lot of people I do have to give them credit because everything is their work. I forgot to put this picture on last night so this is pseudo for executing bash commands so I forgot that one to add it then, it's the important one. And then there's the small one that was called RM. And so here's some more links and credits. The Raspberry Pi image that we have at the moment that Colin's been talking about can be downloaded from that link there. You can find us at CarpentriesOrFlying.org and what's not on there would hopefully soon be on there because it's a work in progress all the time. We also have a Slack channel on Carpentries workspace and so we can be found there so if you have any questions or anything you can get us there and that QR code will lead you to CarpentriesOrFlying.org and I think that's the end. Thank you. We have time for questions. Can you share with more how people are using CarpentriesOrFlying outside the team? Okay so, I forget to repeat the question. Oh, how can we use CarpentriesOrFlying being used outside the team? So one of the things so far was because we're still working on this image is that we have not, we've got a lot of people that keep saying they're interested but nobody has really taken it on yet. So especially with the Mini HPC also I'm still working on it but I hope to now soon be able to run my first workshop on it and then hopefully it'll take off from there because then I can say to people okay we've done it. Also I've kind of used it, I went to UNO's first year of Strathclyte to give a, to run a workshop and the power went down and the internet went down and I was able to use it there just from my laptop actually that day. No, I didn't use the Pi, I didn't use the Pi too because I had it all there but I've known we've not been able to get people because I feel it's not in a state where I can actually set somebody off with no experience because we also need to develop an off-boarding lesson. So we hope to have more hackathons and more work sessions where we can work on these things and get it ready for other people to adopt. So if anybody here wants to go off and adopt it please let me know and you know where to find me. Actually there was another one that I ran in South Africa where we wanted to test this and we ran into a limit of eight people connecting at that point but I think Colin has sorted the problem out. He sorted the newer Raspberry Pi's but there's zero, I think so there's a limit and some of the older ones do but there was a firmware fix for that. A typical Carpiti's workshop would probably have between 10 and 30 people so we'd be aiming for that sort of size again. A typical Carpiti's workshop would be between 10 and 30 people but we had limits with the older hardware but we think the newer hardware doesn't have that problem and we should be able to get 30 people on it. Another question. Yeah I spoke faster than I rattled over because I didn't know how much time it was left. And this hat makes me warm. At least we didn't get the hot hats. Five minutes. If there are questions please ask away. If not, it's cool. You all have been informed that we have a social event tonight. We are going as an organizer team of this room to the tavernier bar so you can check the queue on the left. That's the way to the tavernier. We are also organizing an event next week that's going to take place online with some other talks not only that couldn't make it in the tight schedule today but people who couldn't travel to Brussels. We wanted to be inclusive of those people too so that's the QR code in the middle and the website is the last QR code that's on the back of the room and the right there. A very important question. Where can I get a hat? I've got some more at home. Next time I go to touring I'll ring one. These tools are all representative of the carpentries. This is the Alipay software. This is the library carpentries. We also have a protractor but we are figuring out how to get that on the bed. Thank you very much. Thank you.
The French Open Science Monitor: steering the science based on open bibliographic databases
Hello, everybody. I'm Anne and I'm a, yes, data engineer, software craft woman. And currently I work for the French Ministry of the Higher Education and Research in France. So it will be important for the project, for the presentation, because I'm working on the French Open Science Monitor. So before, before all, I have to say that in France we have no CRIS system, the system in the university that will reference all the publications and all the works in the, in each university or in the, in the national level. So, or at the national level. So, so, yes. And back in 2018, the, the, the policymakers decided to, to, to create the national plan for open science. So to, to promote the, the open science in France. And the, the first question was, so how, how open is the French science? So according to, so to, to answer to this question, we developed a tool to, to be able to, let's say a sovereign tool to be able to, to measure open science. And the, and the, later the requirement would be, would be not to use the proprietary database. So everything had to be open and transparent. So to first begin with the publications, which, which was the first part of the project. Yes, we, we started from there. The, the, let's say the, the, the graph on the right, proprietary bibliographic database. They're kind of complete. They have lots of metadata, but they are missing some part of the publications because they are the one that decided what publication will be in their databases. And on the other hand, the, the open bibliographic databases, they have, they are more complete, more extensive, but they are missing some metadata information or sometimes the quality of the metadata was not really perfect. So we decided to create our own tool based on this. So first we collected multiple metadata available on the, on the web, just like with Crossref specifically. And we tried to aggregate some other sources like OpenAPC, PubMed, and, and, and web crawling to, to complete the information, the, the metadata. So, and based on this, we developed a tool to be able to automatically detect the country of the affiliation in this metadata. So we, we this little hoist, it's the tool that we made. And the, the previous one, PubMed, Crossref, and HAL, we just collect and extract metadata from there. But based on this metadata, we had to build the country detection. So we, our affiliation match, which is able to detect the country of the affiliation. And we, we, we, we, we, we, we, we, we, we, we, we, we, we, we, we, we we, we, we, we, we were able to communicate. So we wanted to create, to create, to create our affiliation match, which is able to detect the country of the publication. Just to, to draw the, the perimeter of the, of the French publications. So, yeah, that's it. That's just What I said. So by, by example. So one of the examples here. So the first compromise would be to, to, to say, okay, if I detect France in the affiliation, it will be France. the affiliation, it will be French. It will be French, sorry. But then we faced this problem with Hotel de France, which is a hospital in Beirut. So, OK, first, wrong result. And then we improve our affiliation measure based on the public institutions database, just like raw. So yeah, that was our first challenge. And then we still, we keep consolidating the metadata by adding the open status based on NPWOL. So for each publication that we collected, we added, thanks to the DOI, we added the open status on it. And after that, we redo the classification to say if it's open or not, but if it's green or not, if it's bronze or not, hybrid or not, there are some different levels of, let's say, openness, according to the journal and according to the APC. So we developed a new classification open access type. And we also added, let's say, categories there, so a tool to be able to add some classification for the publication, to be able to know in which, let's say, discipline or let's say, which scientific field this publication is. So yes. And with all these metadata, we were able to build some indicators. So our first problem would be to measure the open science in France. And we know how the results, according to this methodology, I might show you something else, just like. The website with the results. So according to all the computer metadata, we know that the French publications are open in 67%. So that's the point. And thanks to the, let's say, categorization by discipline, we have some other graphics. Some other graphs, sorry. Just like being able to know the openness by discipline. So here it's about the, let's say, mathematics is the most open scientific field in France. So yeah, you can go on the website where there are more results. But no, I would like to. So yes, that's what I said. Yes, and with all these results, we asked the universities in France, if they were interested, to have their own results. So the only condition would be we asked them to send us their own perimeter. So their own list of publications that they have in their university. And then we were able to send them exactly the same graph, but adapted for the smaller perimeters. And we now have more or less 200, let's say, local variations of this monitor. So there are some universities perimeter, but some labs too. So yeah, and I just show you that. And after that, we did a second round to try to detect and to, let's say, measure the openness of data and code on the French publications too. So yes, let's try to invent another methodology. So first, we needed to collect the data. So that's just what we said before. So we have the whole publications in France. In France, sorry. We tried to download all the PDFs of the publications freely available on the internet. So in this case, there was almost no problem as long as the PDFs were still available. But for the closed one, we need to find some agreement with our partners in the project to give them tokens to be able to download it. And after that, once again, we needed to consolidate the metadata. So we first choose Grobit that you might, may, probably know, which is a tool that takes a PDF in entry and give you a standard XMLTI as an output. And all your data is structured in it, just like giving you the author, paragraph, keywords, affiliations, stuff like that. And after that, we used, so we developed two tools called Datastate and Softsite. And those tools tried to detect the mention of data or code in the PDF. So first, the mention. And after that, in the second layer, trying to categorize or qualify the mention if it's a mention of usage, if it's a mention of production, and if it's a mention of sharing. So those two tools have been developed by Patrice Lopez from ScienceMiner, the guy who is behind Grobit 2. And once again, we were able to do some indicators to have some ideas about the part of the, let's say, the part of the use of data in the publications in France. So among the world publications, so the way to read this graphic is among the world publication in France. So in the 2021, 60% of the publication were mentioning the use of data. So it's a little bit just like we need all the words to make sense of the graphics. And this one is the which part of the publications saying they have produced data if they mention the use of data. So there is just like a funeral approach. Just like first, they take the one that shows data and among them, the one that produced the data and then among them, the share of data. So yes. So once again, to read this graphic, we can say that in France. So 22% of the publication that mentioned the sharing of their own data. So that's it. And as a project about open science, we had to be fully open. So all the code is open. So there is a link here on our GitHub. And the whole data set is open. And then we published our methodology somewhere. And even the talk here is open. And so this project started in 2018. But from there, the open world really moved. So Open Alex released the data about the world research in CC0. So if you don't know it, you should have a look there. And the whole set of publications across the world and the institutions and the funders and the authors. And everything is in CC0. It's quite early. And everything is not perfect. But the good point is that exists now. And it's improving from day to day. Thanks to that, people from Coqui in Australia that developed the Open Access dashboard. So it's a website where there is an open access rate for each country in the world. So it's not perfect for France because it's always difficult to detect the affiliation. So it's a good approximation. And finally, last year, the CWTS released the Leiden ranking but in the open edition. So everything is open there, the data that they used. Because they are based on Open Alex and their own methodology. So yes, that's it. Thanks for listening. Thank you. Questions? Yes. I do have a question for you. Can you make up something? Yep. Yes, I have a call. Thank you very much for this. I'm sorry, but I use it a lot to monitor open-size outputs for Wikimedia purpose this month. I have two questions that maybe will answer some of them. The first is about the timing. So how many years after the publication, year, do you think the data is reliable? Sometimes it takes years for a both ways to deposit. The second question was about Open Alex, but we already answered it. The third question was how do you share what's the best way for people to build on this data? For example, do you send new notices to HAL? And the final question was about the topic. Do you contribute upstream? So yes, thank you. I'll try to remember all your questions. So, okay, Grubit, we don't contribute to Grubit directly, but the two other tools that have been developed from, not from scratch, but from other experience from the same guy, so Patrice Lopez. We paid him for the two data sets and soft sites that have been adapted and improved for this project. But everything is open source, so if you need it, you can use it. Second was about, I don't get it, the one with HAL's notice. Like if a publication that you detect is missing from HAL, it's not HAL. No, what we do, so if a publication is missing from HAL, it's not the perimeter here of the project. But as long as, let's say, a university or laboratory asks us for a monitor about their own science, they send us the list of the publication. And in return, because we have the data, we send them the full list of the publication with the metadata that we calculated. And in this metadata, there is the HAL ID if we match it. So they might, if needed, they might have a way to find the part of their publication that are in HAL and the part that is not. But it might be interesting indicators to add it. You're right. And I forget probably one of your questions. The delay, how many years ago? Yeah, you're right. So the delay, so yeah, the delay in publication, it's always a question. So there is a delay because we grab data from multiple services. There is some delay to propagate the truth between those platforms. Plus, there is the delay when we decide to collect this data because the whole process is quite long. So us, we decide to collect the data four times a year. But we publish the new version of the data only once a year. So and it will happen the end of February this year. So we mentioned here that the data are from, so the data displayed here are from 2022, collected from 2022. So that's the reason why we only display public publication until 2021. So we decided, so in 2023, we published the monitor based on data on 2022. And this data and only the publication from 2021 and are taking into consideration. So it's two years. So yes, it's two years. And that, guys. Sorry, I think too long. Yeah, we've run out of time. Yeah, thank you.
Infra Finder: Increasing visibility of open technologies for open science
Thank you so much. Hi everyone. My name is Emmie Tang. I'm the Engagement Lead at Invest in Open Infrastructure. And my name is Chris Wu. I am Product Lead at Invest in Open Infrastructure. And today we're here to introduce InfraFinder. But before that, if I can just take a minute to introduce what Invest in Open Infrastructure is. We're a non-profit initiative. We work to increase the investment in the adoption of open infrastructure to further the equitable access and participation in research. We do this in a couple of ways. We build tools and recommendations based on our research and evidence to help decision makers and funders make informed decisions about their investment into open infrastructure. We also catalyze investment into the space by running funding pilots. Well, this looks like in practice for us. Our work currently is centered around three core programs that you can see on the screen here. We are catalyzing investment into the space. As I said, running a couple of funding pilots at the moment to look at how we can diversify the funding mechanisms and players into getting them to invest more into open infrastructure and to build towards a healthy, resilient and sustainable future for research and scholarship. We also partner directly with open infrastructure service providers and funders and adopters to provide tailored recommendations and engagement based on our research. And last but not least, we have something called the Data Room, which is all about tools, actionable tools and recommendations that, again, institutions and funders can use to help guide their investment into open infrastructure. An Infra Finder that we're about to share with you is part of the Data Room. So, Infra Finder, why are we building it? What is the problem that we see? So, we know that open infrastructure can really advance open values because it's customizable and it can be adapted as the community's needs grow. And we also know that it is recognized to be important by international policies and recommendations like the UNESCO Open Science recommendations and also in the US, for example, the Nelson Memo on open access publishing and infrastructure for that as well. So, the question is then for us, how can we increase effectively the adoption of open infrastructure? To start thinking about this problem and this question, we decided to first look at who we want to solve this problem for. And so, we chose the target audience of institutional decision makers. So, these could be libraries who are choosing to adopt a certain repository service. It could be library publishing teams who are choosing a publishing software. We choose to work with this audience because we want to really make an impact at scale. So, institutional level adoption of infrastructure means that all the researchers within that institution is then using that open infrastructure. And we also have had long-standing relationships with institutional decision makers. They are some of the strongest supporters of our work and we've had years of engagement with them so we do understand them a little bit better. So, that's where we're starting at with Infra Finder. To start understanding how our audience see the problem, we took months to do user research and focus groups and interviews to try and understand what is going on in their heads, what is going on around them as they make decisions around adopting infrastructure like repositories and publishing software. And we find that actually finding the right infrastructure for them is a really, really hard and slow process. They need a lot of information, information related to costs, for example, how much money will they take them, how much staff time will they take their team to be able to maintain the infrastructure, to upgrade the infrastructure in the short, medium and long term. They are also trying to risk the purchase so that they don't want to understand if the infrastructure service is going to be maintained and continue to be in existence for a foreseeable future because you probably don't want to buy infrastructure that will stop existing tomorrow. They also want to see if the infrastructure will fit into all the other things that they've already got in place within their own institution and libraries. And last but not least, there are some institutions who are thinking about how can we use this infrastructure decision to advance their institutional vision for openness. So what I'm trying to say is that there are a lot of considerations and every single institution we spoke with had a different priority in mind. They all rank those differently. The information, but for everyone, the information is scattered on the Internet and it's very hard to piece together the information that they need to compare different surfaces. And also it's just really, really time consuming. It's easy, someone's months of work to be able to pull this information together to put it into a comparable form to show it to their management and to convince them that they are making the right decision. So this is the problem that we've defined for this first release of Infrafinder and the one that we're trying to address right now. And I hand over to Chris to elaborate what we've built. Thank you, Amy. Okay. Let's see if this works. All right, so thank you for your attention so far. It's my pleasure to introduce what we're calling Infrafinder to you. This tool is meant to help folks navigate the complex landscape of infrastructure services and standards that enable open research and scholarship. So what we wanted to present to you folks in this first iteration was up-to-date verified information about various infrastructures. We also wanted to make sure that the key information that they had told us that they were really looking for, that pragmatic information we'd shown you on an earlier slide, was in one place. And then also we wanted to provide that easy to use comparison view. So I'm going to show you designs and the prototype. Right now we're still in progress, so this isn't quite live yet, and it's a little bit buggy. Subject change, but it is still my pleasure to show you this preview. So this is an example of a page that we're calling the Individual Provider Service Page. So an individual provider is invited to give us information that we are presenting in this way. You can see it allows the provider to show us mission, their key achievements, technical attributes, and as well as information about community engagement, which is so important to the open vision. They can also tell us about policies that they have, as well as additional information about the organization itself, and at the very end they can also describe their own funding needs. So this is the Individual Provider Page. We have worked, we invited 87 partners to work with us, and we received responses from 57, 56, yeah, 56 or 57 of them. So we've already collected this data that is shown, this will be presented when people first come to InfraViner to be able to see what's here. We also have developed a nurturing process that will allow this listing to grow over time and also will be doing our verification process as well, checking against publicly available records where they are available to ensure that the information is both up to date as well as verified. And in the end, users, when they go to the site, will be able to pick services. We're choosing to give people the option to choose three at a time, because three is a good number to compare against. So this is the comparison page, and it kind of boils down all of this information into a way that people can see this high level info side by side so that, for example, they can see, is there an open product roadmap of the services that they've chosen, which groups have them, and which groups do not. So again, high level overview, but I hope that this gives you a taste of what it is that we have in mind. We are aiming to, so most of you who are working here on what we would think of as the provider side. So what's in it for you? If you are interested, being part of InfraViner allows you to showcase your key achievements and highlight your funding needs. In addition, you'll be able to reach more potential adopters as well as funders. And finally, you can learn from and connect with other infrastructure services. In fact, when we sent out our initial invitation of those providers who chose to respond to us, they actually were telling us as well that in seeing and hearing about what kinds of information users were looking for, it kind of sparked this idea that maybe they themselves would want to start to work on some of the things that they had put off, knowing that potential users were in fact looking for that info. So you can, in fact, complete our expression of interest form if you meet our criteria. You can add your service to InfraViner by going to this URL or the QR code. I'll give you a minute for those of you who've got your phones out. Okay. And yeah, in terms of next steps, this is a beginning. When we release this, it's not the final product, it's not the end of the conversation. We are aiming to release in a few months' time, and we'd love to start and continue the conversation from here. So one of the things that we've been wrestling with with regard to this is how to ask about interoperability. So as we mentioned earlier, one of the things that we discovered users wanted to know is, like, does your thing work with our situation? Most people would call this interoperability, but not everyone understands interoperability in the same way. So I'd love, Emmy and I would love to talk with you about, like, how we can ask the question about interoperability in a way where we can collect data that kind of essentially, like, helps us understand, you know, understand how you think about it in the context of how users think about it and collect that data in a way that we can then standardize. So yeah, so users want to know if your technology will work with the other things that they have running and the answer isn't always, we'll just use our API is what we've discovered. So please come and talk with us afterward. We're going to be here for the rest of the day and as well as tomorrow. So anyway, I'd like to thank you again for your time and you can sign up for our newsletter for updates about all of the work that Iowa is doing as well as the launch notice for Infra Finder. And should you want to email us, you can reach us at infra-finder at investingopen.org. Thank you.
Best practices for research in open source ecosystems
Okay, we'll get started. Thank you everybody for being here today. We're so pleased and honored to join you and talk about a project that our research team that we both work together on has been thinking about for a long time, which includes questions like what information is missing when you move beyond a repository looking at open source. I will do my best. Something happened last night. I can't explain it. Yes. So in this talk we are going to explore some stories about what can go wrong and how we as researchers and practitioners in the community can work to find best practices for open source ecosystems research. As it helps you, our slides and speaker notes can be found at and I will spell it out, BIT.ly backslash B-E-Y-O-N-D dash T-H-E dash R-E-P-O dash F-O-S-D-E-M-2-4. Hi, so I'm Julia Farioli. I work at AWS as their open source AI ML strategist, but that's not what I'm here for today. I have had one foot in research and one foot in practice for basically my entire career and I'm especially interested in what motivates people within open source as well as the potential for the understanding of motivations to increase resiliency of our digital infrastructure. And hello, I'm Amanda Kaseri. Please keep reminding me to speak up if I forget. I work in Google's open source programs office as an engineer. I also continue to work around data and AI and I'm also a researcher and I have an emphasis of working at looking at open source through an intersectional and a feminist complexity lens. And our co-author who could not be here today is Dr. Juniper Lovato. She is a multidisciplinary complex system scientist who is a research assistant professor of computer science at the University of Vermont. And her work explores the ethical and governance issues related to socio-technical systems. So through our work, we were struck repeatedly by how much of existing research around open source and digital infrastructure lacked context from those working within the ecosystems. We also saw how these research findings were making their way into open source ecosystems and infrastructure, even if it wasn't necessarily applicable to them. So these observations led us to establish at least a start of some best practices for research in open source. Combined with others that we as a community collectively develop together can help us better study open source ecosystems as the socio-technical systems that they are. So, I mean, just basic fundamental assumption here is that open source is and has always been much more than a repository. It's a complex multi-level ecosystem of human contributors who collaborate and cooperate to achieve shared creative endeavors. And we, collectively are also part of open source communities in our work and our passions. We're collectively a dynamic socio-technical system that is always in production, both people and technology, and evolving towards a distributed goal. We're also, unfortunately, and I say unfortunately just because of some problems that come from it, a very attractive research space for scientists, especially scientists of technical systems and science, because open source ecosystems are so data rich, have such a long history, and have many exciting applications for understanding society as it is, whether it's governance, cybersecurity, team dynamics. However, we frequently see science that does focus only on repository data, and that gives a limited snapshot of the wider ecosystem and explores and ignores many of the explanatory variables of social systems. And that's just one missing data point that we're concerned about and have done some work on. When we are talking here about open source ecosystems, the reason we stress on the ecosystems piece is because we're referring to the collection of repository technology, infrastructure, communities, interactions, incentives, behavioral norms, culture, and studying these as a whole requires community cooperation and participation to understand all the interacting and the interdependent parts of the system. So at the heart of all of our ecosystems are humans, it's us. And our collaborations and outputs reflect social, emotional, and technical labor of the group of individuals moving towards, again, the shared and distributed goal that we all have. One still unsolved problem, and I want to stress both of these, is when both industry researchers and academic researchers, two separate groups, different outcomes and incentives, overlook data from open source ecosystems as part of a review process. And this is because data from open source, if you're not familiar with institutional research boards or review boards, data is usually obtained through scraping or APIs, and it's considered this category of what's called secondary data. And secondary data by some rubrics is not centered on humans, it doesn't require consent from research subjects. But when you are moving data about people, that is inherently data about someone and transforms them into a research subject. But when this happens, we as a community, we as individuals, are unaware that we're being studied. And then a paper comes out talking about the open source project that you worked on, and you're like, well, I didn't, first of all, I didn't realize any of this was going on, but also do I agree with this? And you never have the opportunity to give consent as a research subject in that case. Juniper, so Juniper actually just recently finished her PhD, we're super excited and happy for her. And she has been working in the field again of like data ethics and looking at this, this intersection. And she shared with us the advice from her PhD advisor, which is just because something is permissible, does not mean it is ethical. And so for example, just because open source repositories are public, sometimes they're permissible to scrape, but it doesn't always mean they're ethically fair game for any use without the community's consent. But I would like to give a positive example of this, by the way. So there was a group of researchers in 2022 who did work with the community to learn more about what was possible with repository data. So their research questions did center around the repose themselves. This is in 2022, it's Coutilla et al. And they published an excellent paper that was looking at open source and maintainer well being. They did a mixed methods approach that they looked at quantitative signals, they did a diary study, they did interviews. And they actually determined that it was not possible to determine maintainer well being from any of the signals that they studied purely off of the quantitative work. Now this usually doesn't get published. So right normally there's a hypothesis and they find that it didn't work out and then it goes into the clock, like it goes like somewhere else. But this time they actually published that like no, these things are not correlated, you cannot find these. There's too many individual confoundings for us to be able to say these will effectively work with the community. And that's the kind of research, honestly I would love to see continue to involve and encourage and for us to participate in, because it's breaking down the mental models, not only that exists of this rich community, but also that might hinder us or be used when out of context in a way that does not serve us. And that allows us to build trust with folks who are trying to understand and for ourselves, understand ourselves better in a way, again, that is not just being observed, but that is participatory. And that's because open source data is not just code. It's the collaborative labor of a group of people. And for all of this, we just want to make sure that we are working as researchers as that best practice piece, is not to throw away the socio element, don't ignore the fact that there are humans as a part of these problems. And that also helps remember that we should be treating each other with care and respect, because when we become part of that, we are ultimately part of the same system. And now as researchers and new practitioners, we're at this really critical moment. I mean, and I'm sure everybody's been talking about critical moments all day, open sources and critical moments. Research as well is in some very interesting critical moments, especially questions around open data, how data should be open, should it be open, what are its use cases. And community members are themselves subject matter experts. We should hold a wealth of knowledge and lived experience. We know the system's best. As collaborators, we actually lend much more experience to the projects as opposed to being silently studied or alone. So involving communities through participatory methods will help researchers better understand the systems they're studying, what's missing and what is truly available for purposes of research. And so another point we'd like to give around is this concept of, I talked a little bit about context. So Dr. Helen Niesbaum puts forth the concept of contextual integrity. And that is the idea that the protection for privacy is tied to the norms of a specific context from which the information is gathered. We have a lot of, you know, there's memes around about like overheard out of context, things out of context. But it applies differently here when we're talking about research, which also now fundamentally impact things like funding, people's well-being, your jobs, your ability to advance in your career, being recognized even for having done that work, as opposed to being invisible in history. And so we want to just emphasize here Dr. Niesbaum's idea that a central tenet of contextual integrity is that there are no arenas of life that are not governed by norms of information flow. There are no information of spheres or spheres of life which anything truly goes. And so we see this breach of contextual integrity in cases where data is taken outside of its intended environment used for another purpose. The phrase that sometimes gets thrown around is, but this data is already public. It's already out there. Can I use it for anything that I want to? I mean that leaves just enough room for circumvading pretty much every ethical issue related to data that is found online. In 2016, a specific example, the open data project known as GHTorrent, which was one of few community projects which hosted a structured history of GitHub's activity information, had a lengthy discussion and an issue about sharing aggregated data using GitHub's email, user email addresses. So they collected everything together and they shared it all out. And then as part of those commit messages, as part of that metadata, was people's just regular email addresses, not hashed, not changed. And that aggregation and sharing was being used without consent from those individuals because they were finding themselves the targets of mailing lists. So people would get together the list, they'd scrape all of the emails off of it, and they might do it for something like surveys. Like even researchers asking people like, hey, we'd like you to be a part of this study. But they didn't go to GitHub, put their email out there because they wanted to be contacted for a study. So this is a very long discussion. This point here actually will take you to, thank you so much for Julia for finding this, it will take you to in the Internet Archive link because it's no longer part of that initial Git repository that issue's been taken down. And I would like to just emphasize, so the GHTorrent actually is no longer being maintained. The previous website's no longer applicable. You can still see the repo as it exists on GitHub Archive. So I'm grateful to the Internet Archive for saving that information so we can understand that more. But I just want to add the caveat because we're using this. There is a hyperlink to ghtorrent.org. Do not click on that, it will take you somewhere you do not want to go. You're all about transparency. But that issue around email addresses on a platform, mailing data lists is an excellent example of why this may be controversial to say here. Openness in and of itself is not necessarily always a good. The release of raw data is of course good in the strict sense of reproducibility and of transparency and of working collectively in a community. In other contexts, openness may harm people and public trust that people have between researchers and the community themselves. So we just always need to strive for that balance and ask questions around openness, ethics, and privacy, especially in consultation with the people who exist there. Thank you, Amanda. And I think that's a great segue into the idea that researching open source software is ultimately research about the people behind it. And yet the data about the software are far more readily available than the data about the people. And sometimes that's for very good reasons. But when doing research in and around open source ecosystems, we need to make sure that we're not exacerbating or reinforcing inequalities in the existing system by failing to question what is absent from the data. So plenty of research has already been done into who contributes to open source and why and what benefits they see from it. And these benefits are not necessarily that insignificant. A fair number of people have jobs because of their work in open source. A fair number of folks get sponsorship for their work in open source. And a lot of the ways that people decide who to sponsor or who to recruit for jobs is through what is visible in the data. So if you do work that doesn't get captured in the data, then you don't receive those same benefits. One of the things that I think a lot about is how if you are not getting paid for your work in open source, you are basically paying to do work in open source. And that isn't because you're like literally handing out money, it's because you're spending your free time. And free time is the currency. In 2017, Lawson wrote a fantastic post about time as currency saying that I've already told my partner that if and when we decide to start having kids, it will probably quit open source for good. I can't see how I'll be able to make the time for both. And in 2019, this was reflected in a paper by Miller et al. And it has a delightful title, which is why do people give up flossing? Which I just I love from a pun perspective. I think it's great. But they found that for all contributors, occupational reasons such as major life changes were the most cited for leaving open source significantly more than lacking peer support or losing interest that are more commonly discussed in the literature. When we are looking at who is present in the data and who isn't, we need to understand and make sure that we're keeping at the forefront of our mind that the economic incentives and the availability, which is also a little bit of an economic incentive, of the people who keep the lights on are not evenly distributed. One of the papers that I want to see, free dissertation idea, is why do people never start flossing? We don't have research on that. So if we think about who leaves open source and why, what are the barriers for people who never come in? I think we're at an open source conference, right? Okay, so open sources everywhere. It powers mission critical systems. We know this. It's everywhere from space exploration to social networks to insulin pumps, which still terrifies me. But every person on the planet is affected by open source software, whether they know it or not. And that's critical to keep in mind when we're thinking about research and ecosystem integrity. This example may have crossed your radar. So in 2021, there was a retracted paper where researchers submitted known flaws to the Linux kernel. They had absolutely no intention of allowing these flaws to be merged upstream. But there was a lot of pushback. And in 2021, Greg Scott was quoted as saying, these researchers crossed a line that they shouldn't have crossed. Nobody asked them to do this. A whole lot of people wasted a whole lot of time evaluating their patches. And I think this is a really interesting example, because from the perspective of the researchers, they did not see an issue with their approach because they weren't going to let anything affect the technical system. Nothing was going to be merged. No flaws were going to be incorporated into the Linux kernel. They designed it to be an encapsulated experiment. But they failed to realize and take into account that there are people in that system. And they focused on the integrity of the system without considering the people or processes involved. And so we don't know how this would have worked if they had figured out a way to get consent for this experiment. But we do know that the way that they went about it made a whole lot of people really angry. And it did get them banned. The entire university was banned from contributing to the Linux kernel. Which is, I mean, actually I would put that on my resume. That's pretty impressive. Hopefully that ban will be lifted maybe in four or five years time. But these ecosystems are always in production. It is impossible to know where your software is being used. Because as open source folks, we tend to hate telemetry. And so we just have to rely on the systems that we have established and make sure that we are treating them with respect. So running behavioral experiments, which that wound up being even if it didn't intend to be, but behavioral or technical experiments in open source ecosystems may impact the world's infrastructure in unknown and immeasurable ways. It's difficult to know the scope of your research in an open source ecosystem. Small changes to one part may be what breaks something extremely important that you just had no idea. So we do need to treat open source ecosystems as systems that are perpetually in production. So what do we do? This is like four best practices. What do we actually take away from this? Well, as researchers and practitioners, we need to work together to provide practical context for research approaches, results, and recommendations. We need to consider the ramifications of research upon the ecosystems being studied, as well as the culture and individuals. And finally, look beyond the repository for factors that may influence methodologies and findings. And we acknowledge that again, like wearing these many hats, sitting in these many seats, this is a learning process. So science is all about understanding and finding out new knowledge. But that's also science as an ecosystem that is also always in production, learning because those ramifications of what you learn can have impacts down the road. We're all trying to figure out how do we use data that is online? How do we put our own data online in a way that we opt in an opt-out of? How do we get control? And how do we be responsible about it in terms of others that we're working with, especially within a community? In these transition periods, where things seem confusing, just again, we want to encourage that that's a point of communication should increase and not decrease. The worst thing that we can do is to start to shut things down where we start to silo ourselves, as opposed to come together and working closer with each other to strengthen that humanity as a part of our shared experience. We have an amazing gap to bridge the, amazing opportunity to bridge the gap between people who want to understand what is happening in software and science and technology. And by being those ones who can participate as a part of that, and that's really cool. It can open up opportunities to welcome more people into the open source ecosystems because those are more scientists who are then contributing their own code, contributing their own data, contributing their knowledge as a part to lift us all up. And so thank you so much for having us here today. We wanted to make sure as the last slide, you get all of the references that we have talked about today. And then of course, these slides as well, in case you want to do hyperlinks clicking in there. I do want to just point out that second bullet point also is a link to the full paper that this presentation is based on. So you are more than welcome to read the additional best practices in slightly more formal language than we've used today. Thank you. Thank you. We do have questions. I've got a minute. If you want. There are a lot of them, so please, Judy, one. Judy, one. That's an excellent presentation. The data is actually talking about mixed data. Is there also an element of jurisdiction about you? Because I'm going to sort out why researchers, but exactly described as precisely one where we told them you can't do that because of wireless GDDR. So I'm wondering if there's a jurisdiction versus ethics, and was the case with the Great Souls at the US of Minnesota? Yes, that was. So the question was that are there jurisdictional implications for web scraping and the data that you obtain through web scraping? And I think this is where we both say we are not lawyers. No, no. And so this is a place where, yes, there are most likely jurisdictional considerations. What they are is where we would go talk to our lawyers. Yeah, and I also think you brought bring up an interesting point because we did try to differentiate and we acknowledge that. So there's industry research. There is academic research with institutions. There's government research that are funded by government agencies. Like research becomes a throw all. But each group does have its own ethics boards, legal processes, requirements for sharing and working with information. So yeah, that absolutely does exist. There's not one universal standard. But it's also why we're trying to talk about this in terms of practices as opposed to here is our like resolute suggestion on the commandment. Yeah, here's the rules that you should be following. Thank you for bringing that up. That's absolutely a great point. Yeah, thank you very much for the presentation. If you want to make a point and then you can comment on that, which is super important that we have all the research into open source. But speaking as a researcher, it was literally a gold rush, get up to mine, and we've got lots of silly papers. And it would be better if researchers focused on what's really important. That's not necessarily always open source. So anyone who can provide access to open source data, for example, by being a contributor who works for a company could also positively provide access to the company. We need as much research on how the company is in front of you, even if it's for prior check of work. That's my opinion equally that we need it. And because researchers for the low main groups, you don't have any of them, most of them have a developer. So I would not encourage anyone if you're willing to work with researchers to find ways to maybe also get a company interpreter that will improve the world for us. So I will briefly try to summarize the comment there, which was that research into open sources is good. There was a bit of a gold rush when we first had access to GitHub data. But we also need to do research into the proprietary side of software and software development, because that's important as well. And I would agree with that. It's speaking as somebody who has worked in big tech for quite some time now. There are some significant challenges that I personally am not sure how to overcome to doing that. But it's on the wish list, I think. Yeah, that's an excellent point around like when you need to do studies and you only have what is either conveniently available or ever available, that sometimes you're limited into what you can find. I also think it does limit the questions you can ask. So I think where I would agree that it will be interesting, hopefully, more of those collaborations between researchers and all kinds of developers, whether you're within a corporation or without. My concern becomes when, and I was a reviewer as a part of the mining software reviewer, mining software repositories conference this year. And the challenge that I saw with more papers than I'm used to was just a complete mismatch of understanding data and research questions and the hypothesis of whether or not that was something you could even ask using that data. And so that's a challenge as well, is just having people who understand what is possible of asking of a certain question. There's a connection to open source, which is really nice in company and trial development or, which is called inner source. No. Sure, yeah. I do believe we're out of time. Out of time. Yeah. But we will be outside for additional questions if you want.
PHAIDRA - A Repository Where Research Data Goes to Live (Not to Die).
Thank you. Thank you. Outside for additional questions if you want. Thank you. Thank you. Thank you for a great talk. Thank you. Thank you for a lot of that. Yeah, this is an honor. Yeah. Welcome. Do you have an adapter? I've lost one. Okay, I don't know if we have a USB-C here. Please do not take it away from your needs. What? Here you go. I have an adapter if you want. There was one on the table. Let me check if it works. Really? Yeah. Does it look like it's working? It takes a bit of time. Try it. So again, please, let's try to move all the way to your... to this side of the room. Everyone look to your left. If there's an empty seat, sit in it. Is it pronounced... Mancanguli? Real Mancanguli, yes. I'll introduce you very briefly and let you... Okay, yes. Yes. Okay, so we have now Ramón de Gullir, who is going to present... without further introduction. Yes, thank you. Can you hear me? Is the mic okay? Okay. So thank you for being here. I'd like to present also a repository, but it's not a software repository. It's a data repository, which is a kind of dual what we are now using very often in research data management at universities. And who's talking to you? My name is Ramón de Gullir. I'm working at the University of Vienna in Austria. And I'm heading their department, which is called IT Support for Research. So I'm based at the Computer Center and focusing and helping the researchers to preserve the data long-term. So I'm not a researcher, but I'm in the support shoes. And during my work, we are also responsible for a repository for keeping the data safe and preserve them long-term. And for this, we have developed our own system, and I'd like to present to you today about the system, what's about, and why we did this. And didn't go to something else, and we are developing still by everything by our own. First of all, the origin of our system, FIDRA, we started quite early. So the project of FIDRA started 2007. During this time, there haven't been a lot of data repositories out, so we could use. We looked and researched about what kind of systems they are available. And we found a solution that we can build on, but it wasn't directly fit what we wanted to do. This was FIDRA. And FIDRA is still the base of our repository as well. And the main idea and the goal was to find a home for all the data which are around the university. So for this project, there was a combined project from the library. So the main idea came from the library, but from the beginning on, we have the collaboration with the computer center. And also, and that was the interesting part from the beginning, there have been a service unit which is called Center for Teaching and Learning, also involved. They dropped out after a few years, but now in the last three to four years, they came back to the repository issues as well. So we are launched very fast, our first repository. It was 2008, so it was really bringing the people together, looking what kind of systems we want to use and then build something new and then we go to this repository. So and then because it was a very rushed start, the system wasn't that good developed. And the problem we have seen this as new partners came in and these partners now wanted also to use the same system what we have and then the problems arise. So we rebuilt everything from 2011 and these were also the very early start at thinking we should do everything as open source and open source project. And then national project came in where we also developed based on the demands that came from the project and we go further and further. And the thing was that we are our own biggest customer, so we built our system, but we also developed this for some other fader partners as well. So what is now the idea how we are doing this and how we are doing the development. So as I told you, I'm also research support center, so I help also researcher. I'm connected with their needs and this is also key for such a system is identifying the needs and managing the different needs. So we have really different needs. The main idea is preserving the data, but the point of view from a library is different than the point of view from the researchers as well as the point of view from a funding agency, which are also a big stakeholder in the whole ecosystem of research data management. And also the IT support, the IT center as well plays a big role because they have to maintain and operate all the system, which the other people want to know. And the thing is also they want to know which kind of data are in their infrastructure because we are having a lot of spaces on our shared data where people drop out. And then nobody knows which kind of data there are. And if we're using a kind of data repository, we are knowing about the data. So that's mainly the idea. Over the whole years, what we have learned and what we are trying to improve is the needs are key. So we are started from the perspective of the library and the needs of a library to describe and to also implement the systems in a way that barbarians can work easily with data and preserve them to do their work. And then suddenly we are presented this what we have to the researcher and they had completely different needs. So balancing these two things is not an easy thing. So our approach and our idea is that the data are the key factor of these and we have to manage the data. But the viewpoint of the data can be very different. So also in this way, then we have thought about how we can make different viewpoints. And this we are going to enable an easy to use or an also an easy to self develop a new interface, a new user interface to the data that you see all the data. So we have different approaches, different kind of data, but we have also in the repository and from these different interfaces as well. And these are also the thing what I'm why I'm here is what we have learned about open source is just putting an open source license to a piece of software doesn't mean that's an open source project. So open source project is something else open source project is to engaging with the community and to bring all the things alive. And this is what we are now heavily working on it and bringing also to our software and new community or not the new community also the community who are wants to bring the software more to life and to develop this further. So now we are that that's the situation what we are now in when we have some ideas what we want to to to go to the future. So working with the community of data management is is really the key and we will keep to this. So we are part of the data management community because we are the user of the repository as well. We are engaging with the researcher. We are engaging with the people who brings us the data and we know that we have just covered a little bit what what's necessary for the future because we can handle. Yes we can handle files which are household files that we have on a laptop print into the repository describe them. But there are a lot of challenges out there how do you go with big data how do you go with databases how to go with applications. So there are a lot of open question and these we want to work with the community together to find solutions. And from the digitalization point of view is that keeping the data secure secure and keep and preserving the data long term is very essential because the viewpoint is now going to the data. And we need we have a lot of knowledge into the data and we need also to preserve the data so and just to have the data and not the knowledge around this is also not what we want. We need the whole growing of of of the data management and this is just one answer to the big challenge we are all facing when it came to data management. So. What are now what what are so to to to wrap up. These is it's just the beginning. So I know it's not the perfect solution but because that's I'm here we need people to look at these what we are doing and then say OK in which direction it's necessary to go to improve. And there are a lot of challenges we need to master and it's the best way to go in the community to challenge all these masters all these things. If you want to get in touch there are some links. So there's the link from the repository. There is a community site. Then we have the GitHub repository as well. We have a podcast. This is how we engage also with the community. We want to get known more from the community and their challenges for these. We have a podcast. We are we can you can see what we are doing at our university. You have also the links to the other universities as well. We are our partner partners. We have a conference. This is a yearly conference. This is always in November. And if you want to get in touch with me. So it's also my email address on the slide. Thank you very much. We may have time for one short question. Thank you for the presentation. How do you see the connection with the European Open Science Cloud that you listed in the data history? Yes. So as a point, the question was how we are connected with this, what we are developing with the European Open Science Cloud. So we are connected with national projects, but also not national projects. This is also international projects because all which funded by the ministry goes to this national European approach. This is all European Open Science Cloud. So we apply to fair data. We are going to have these these standard the standard APIs which we then can link to also to other repositories or to the European Open Science Cloud. We are also part of the European Open Science Cloud. So the University of Vienna is a member of the European Open Science Cloud. So we are highly connected to all these things what's happening there. Thanks again for the moment. Thank you. Just a moment. Thank you. Thank you. If you do the microphone, I do your computer. Okay, thanks.
Updating open data standards
Okay, so thanks for bearing with me. I'm Sara Petti and I'm going to talk to you today about updating open data standards based on the journey that brought us from the Frictionless Specification version 1 to the version 2, which is currently ongoing. So briefly about myself, I'm the International Network Lead at Open Knowledge Foundation and also the Frictionless Data Community Manager. I love the digital commons and I'm based in Bologna, Italy. I left here some ways you can contact me via email, ex-former in Onos Twitter and GitHub as well. So before we start, I just wanted to give you a quick introduction for those of you who might not know the Frictionless Data Project and the Frictionless Standards about the Frictionless Data Package, which is the core Frictionless specification. The Frictionless Data Package is basically a standard to package your data. It's very simple and very easy. Basically, you package your data together with a descriptor, so containing your metadata and a scheme about your data. There I put a link if ever you want to explore all the specification of Frictionless Data, you can just go on that website. But so the Frictionless Data Package was released in 2016 in its version 1. So some years have passed meanwhile, and it has actually gained a lot of tractions in research communities, academia, but also it has often been mentioned in the Open Data Guidelines of governments and public administrations, and it's often used by data wranglers as well. So we started to think about what actually was the success of Frictionless, why was this standard so successful, and these are some of the things that we came up with. So the first thing is, so the Frictionless specifications were not developed alone in a room, but they were really the outcome of modern 10 years of iteration with community of practices, stakeholders, and also a full engagement on issues around interoperability, data analysis, and data publications. As you've seen from my slide before, the specifications are very simple. I think that that's also one of the key of the success of it. Basically, because they are very simple, they disrupt as little as possible, whatever existing infrastructure is already there. When thinking about actually the Frictionless specifications, we always had in mind as an example the CSP, which is a standard for tabular data, and we think that the key element of CSP and why it is so adopted, it's because it is so simple that everybody can use it. It's not maybe the most adaptable to specific use cases, but it's still adaptable by almost all, and so it's one of the most used actually tabular standard right now. So the simplicity of the Frictionless standard also mean that they are extensible and customizable by design. So they are designed for tabular data, but we have a lot of people in the community that use it for other data as well. We have metadata standards that are much unusable, because of course we want to have to bear in mind that data must be fair, but we also keep in mind that there are human that might want to manipulate that data, and so the metadata standards are also human editable. Another thing also very important for us was not to reinvent the wheel, so try to reuse as much as possible existing standards and existing formats for data. And then last but not least, we tried to build as much as possible something that was language, technology and infrastructure agnostic. Once that was done, we started thinking about the options of the standards, and one thing that became clear quite quickly was that a standard alone was sometimes not enough, and that you need also a technical implementation of those standards. And it's funny because I was talking yesterday with someone from the Frictionless community who was telling me exactly this, that it's so great that we have basically built libraries on top that you can use to perform a number of things on your data, for example validate your data or extract your data, and those are present in a number of programming languages. So I work at the Open Knowledge Foundation where the core Frictionless team sits, and we developed for example a Python framework, which is the first link that you see there, but then the community that uses Frictionless also developed other libraries in other programming languages that perform some of the same functions as well, so we have Frictionless R for example, Frictionless JavaScript, and those all form what we call the Frictionless universe, and here's a website that I'll definitely encourage you to go and have a look if you're interested. So okay, it's all very nice, everybody adopted the standard and it gained traction, why did you need to update then? Well, of course since 2016 issues started to accumulate in the GitHub repository, so basically last year with the core team at Frictionless, we started having conversations with the community, and we started to go through all these issues, try to triage them and see those that were more requested, those where there was more conversation ongoing, those that made more sense because of the internet requirements that came up during the years, and so we decided to start a draft roadmap for version two, and then the second part was okay, now that we decided to update those standards, how do we coordinate this update, and that was probably the part that took most part in as a community manager, and here I tried to resume the key elements of this update and the things that it was important to take into consideration for us for this coordination, for the coordinating this update. So the first thing is of course don't do it alone, right from the beginning it was very clear to us that we had to take into account and bring in people from as many backgrounds as possible, because as I said before, the Frictionless data standards are very simple and they are adaptable to many different use cases, but if you want to build something so simple you need to also hear a lot of people, a lot of, have in mind a lot of use cases, because they can actually help you to build a common data model that will fit then the needs of everyone, or at least it will help you find some minimal common ground. And so when we started our Frictionless data specification working group, we brought in people from research institutes and universities from different academic fields, but also libraries, open data cooperatives for example, and engineers as well. The other thing is be clear, so the first thing that basically the working group asked us was, okay, very nice, you want to do this, but please let's define the overarching goals of this project, let's have a roadmap of this project and let's have it somewhere that it's easy to find. So for us it was quite easy, we have a project website which is frictionlessdata.io, so there we published a website announcing the specs update, detailing the goals, the deliverables, and from there we also linked to the roadmap, which is actually on GitHub, because that's where also the technical discussion with the community is happening in all the issues that you see there. The third thing that was in the beginning a bit taken for granted, but it actually needed some thinking as well, was to decide how to decide, because okay, we sat down with the working group and everybody was like, yeah, okay, we'll do this with consensus. But then we clearly realized that it needed some definition as well, because not everyone was understanding what consensus really meant, does everyone need to participate every time to the discussion, even if maybe it's some part of the specs that's not really important to them. And so we basically decided that PR can be merged in the specs only if two-thirds of the working group has participated in the discussion and has a favorable opinion about it, and that consensus, we understand as consensus when we reached a kind of like solution that everybody can live with. And the blog, it's in the announcement blog if you want to go and have a look. So that's it. So just to give you a view of where we are now, at the moment, basically we had 36 issues that were part of our first roadmap. So 10 out of 36 are now closed already. Of the remaining 26 open issues, 11 already have a first PR proposal, and then 23 over those 26 have already actually an ongoing working group discussion. What we decided to add also as a kind of like information for the community, for the working group as well, but also for the broader community, is to have a public live track also on GitHub live as an issue, and basically you can go there and we update it on a weekly basis to basically have a place where people can monitor the progress. And our aim by June 2024 is of course the release of the friction specs version 2, but we would also like to release a small Python metadata mapper and also some integrations in external systems like SICAN and Zenodo. To conclude, I just wanted to mention that this update is made possible thanks to the generous support of the NL Net Fund, NGI Zero and Trust. There's a lot of fantastic opportunities out there that maybe could be useful for you as well. They found a lot of open source projects, so I encourage you to go and have a look. And then I wanted to thank you for listening to me today. I left there a bunch of links that might be useful. So the first one, the frictionestlitter.io that I mentioned a couple of times, is the project website where everything is linked from. So if you want to find, for example, all the GitHub repositories, you will find it there, but also the different project pages. We have a community chat on Slack, but if you prefer to use an open protocol, you can access it by a matrix as well. I left the website of the Open Knowledge Foundation and also our Twitter handles of the Frictionless Data Project and Open Knowledge Foundation. Thanks. APPLAUSE We have time for one question. Yes, thank you. A very short question. We said, we said, agree on a commentator model. Someone who has repeatedly failed at that. I'll be right back. Thank you. So the question was how to agree on a data model, because that's very, very difficult to agree upon. I think for us, the key, it's of course something very difficult to do, but it's basically to take away all the layers of complications and all the sometimes specifics of some type of data. Instead, what we did was basically collect all the kind of data that we wanted to support and try to understand what the common things were and basically start from there. Of course, again, it is very simple, adaptable, but it is also something that is focused on tabular data. It is extensible to other kind of data as well, but that's the kind of course you have to have a data type in mind. I don't know if that answers your question. Thanks again to Sada. APPLAUSE APPLAUSE
Wikimedia projects and OpenStreetMap as an Open Research Infrastructure
The aim of this presentation is to show how Wikipedia, Wikidata, Wikimedia Commons, the Wikimedia project and also OpenStreetMap and other resources can be used as open infrastructure for research. We're talking about websites that are based on an open infrastructure, so they're based on open and free software and of course they have all content that is available openly. What is also interesting about this ecosystem is that it's incredibly multilingual, so you have a really wide community of contributors in over 300 languages. And even more, it's one of the biggest existing online communities and this is obviously a feature if you want to collaborate with citizens, which is one of the aims of OpenScience, so working and collaborating among people and institutions. Also something that is also valuable is the fact that we're talking about resources that can host different kinds of content. So it can be data, but it can be also images, audio, documents with also a community that can contribute in different ways in improving this content. It can be restoration or improvement of images. It can be also adding captions. It can be transcribing documents. There are many advantages in those projects. Some of those are very well known. The visibility is probably one of the biggest. We're talking about for Wikipedia, 28 billion views per month. We're talking about the visibility that Wikipedia and the Wikimedia Commons have provided to collection like the Met collection, the Metropolitan Museum in New York. It moved from a collection that was viewed two million times to 10 million times. So the visibility of those projects is very impressive. But also we're talking about an international community, a community that has also chapters around the world and the desire of enlarging the community with policies and with funding that have been created. Also we're talking about reusable resources, so a resource that really provides content, information data that are really available also to people that don't have particularly technical skills. And also there are other features like the fair data principles that are applied also on all those resources, but also an attention to new ethical principles, the care principles, or the synergy with open government and with GLAMS, so with cultural institutions. So those resources are already used in research. Wikidata is probably one of the major examples. And the beautiful project, SCOLIA, is one of the examples that you might access that provides information about researchers and topics. But there's been a lot of work related to how to use those resources as a research infrastructure. And I'm just quoting some of the papers related to this and focusing on Wikidata. Daniel Michen has done an incredible job in this. He was also a Wikipedia in residence for the Open Knowledge Foundation. We just heard a presentation from them. He was the first Wikipedian in residence and he worked extensively on open access, improving content also on Wikipedia related to it, improving also the communication of the project among the open science system. And in 2015, there was this project in working on Wikidata as a virtual research environment, which is very promising. It was not financed, but it kind of gives an idea of how the infrastructure can be used and is already used in this direction. Furthermore, there are studies that are highlighting how, sorry, I need to breathe. This is something that I, you all noticed that it's something that I sometimes forget. So going back to, in 2019, there came up this study about Wikidata. So it shows how Wikidata is already extensively used, but he also talked about how art and humanities and social sciences are not very present in the field. And a research about art and humanities used and it's used in Wikidata show how there are projects that use the data, but there are a few projects that collaborate using the data. So create a community that actually upload the data from research and use the data. So I'm just going to present to you three positive elements and three challenges that I encounter in my work related to arts, humanities and social sciences that I think might be interesting to highlight. So for the advantages, the large use of all those resources combined together, so not only using Wikidata, but really take advantage of the different format that those resources allow to upload. A second element is the broad interest for a heritage and museum, so the existing and real attention that are on those projects. And the last issue is the possibility of visualizing and monitoring content. I breathe a moment and then get back to you. So for the challenges, a major one is a copyright and the restriction to public domain, the difficulties of course of collaborating with a community and also the challenge of scaling up and working with the different skills. So the first element, the possibility of using all the infrastructure is particularly interesting for humanities, arts and social sciences because it allows to really take research, resource and data. And in humanities and social science, you have a lot of also qualitative data. So you have interviews, photos, you have site exploration, you have artworks, you have content that comes from archives for example. And you can find on those resources the possibility to upload it. Also working with OpenStreetMap allows for example to enter data that Wikidata would not allow. So the combination of those two really allows you a broader work on those infrastructure. This is an example that comes from the upload of data from the Ticino region, the region in Switzerland. And the upload was done on Wikidata but also OpenStreetMap and with the upload of images on Wikimedia Commons and the creation of articles also on Wikipedia. The second issue is related to heritage. At the moment the 97 nations have participated in this contest called Wikilov's Monuments and they have uploaded an incredible number of data but also they have worked in creating one of the most incredible database of heritage sites around the world. And this content enriches the existing resources but can also be used to evaluate the existence of images and also the presence of heritage in different countries. This is a visualization we've been working on that allowed also to create a research based on the analysis of those data. Another project, another focus of the community is working on content coming from GLAMS. So GLAMS stands for Galleries, Libraries, Archives and Museum. So we're talking about the broad network of cultural institutions and institutions. Consider also that universities have libraries, have collections, have archives. So it's very strange how sometimes the research institutions perceive separately the GLAMS. And there is sometimes a great difficulty in Brigida too. Also a lot of research for humanities and social sciences comes from those sources. So you work on documents, you work on images, you work on collections and this is really a center of interest for researchers in those fields. And the Wikimedia project, particularly after 2006-2008, have really invested a lot of energy in encouraging institutions to become open access and to upload content on Wikimedia commons and also with synergies with Wikidata. And in Italy we did a project in which we contacted all museums. We created the best existing database of museums existing in Italy. It was done in collaboration with ICOM Italy. We uploaded a national statistic about museums. So on Wikidata you can really access all the available data about museums and museums in Italy are quite numerous, as you can imagine. And also they started also collaborating and opening up their content to make sure that also museums were engaged in checking their data and contributing with authorization. This is a topic that I will shortly touch. We created a forum that allowed the museum to upload authorization for the content. In Italy there are restrictions also on public domain. And this forum was also developed with Daniela Scasciatratte, who might be here, so one of the developers, and so to facilitate also the institutional contribution to the project, which is one of the problems with Wikimedia commons. You need to be an individual to contribute to Wikimedia project. So you need an external interface or a system that allows to associate to a user an authorization that gives the authority to that user to upload content for an institution. So it is a step that is still missing. Those data allow to produce research. So you can monitor museum in a country. You can view if they have a person in charge of communication, how is their collection, if it's digitalized or not. And we enter in this a third positive aspect of the Wikimedia project and OpenStreetMap is that you can really visualize content in amazing ways. And visualizing content doesn't simply mean I have a statistic, I see what is there. It means also to visualize knowledge because what is on Wikipedia and what is in Wikimedia commons very often provide you information of what is available as knowledge. For example, images of heritage. In Italy we had a discussion with the Ministry of Culture because they miss images of certain areas of Calabria. So the community actually negotiated with the government, the data, and then they produced content that is now accessible also for the government. So monitoring what is available there somehow provides an image of what is actually available on the internet and to anyone. So monitoring knowledge is also interesting if you are contributing to it and you're contributing to modify it. So if a museum or a researcher is improving content related to the architect Paraviccini you can really see how you made an impact on that knowledge. And it's quite incredible to visualize this impact because normally impact on research is made with completely different criteria. So this criteria is actually something that is the mission of a museum of course to improve knowledge otherwise if their mission was to have a lot of visitors they would offer beer that would make it a bit easier. But also it's really a way of changing the perspective on how you create a researcher that is really available and visible. Now I brief a couple of moments and then I move to the challenges. Okay the first one is copyright. This is an issue that is present on all humanities, social science research. This is obviously a very well-known challenge. So you would expect that for example you take a photo of a monument and you upload it. Actually things are a little bit more complicated than this in particular. The fact that you need to identify what is heritage and what is not. You need the rights obviously of the photographer but there are other issues related to property and also to the rights of the author of that building. If you live in a country that has freedom of panorama you can take photos of everything that you can see outside so you're fine. But many many countries do not have freedom of panorama and it's a right that was not unfortunately made accessible to anyone with European copyright law. So in those countries you need to ask the authorization of the architect that has not been dead for more than 70 years or the artist that produced an artwork. This is a layer of complexity. Furthermore you have layers of complexity that are added to public domain content. This is tricky maybe it will change because in theory with the European copyright law maybe we are moving in the right direction. But in Italy you need to pay a fee for every commercial use of content in public domain. And this is obviously a very complex block. So those restrictions create layers of complexity and makes it more complicated of course to upload content on the Wikimedia project. In particular because those projects really want content that is clearly open and accessible. I still have a lot of time so I should relax. So I want to make sure that I tell you everything that I might know. So we did a project to explore the impact of culture on safety in Africa. We did it in three countries and in particular in Cameroon we worked a lot with authorization. So we uploaded Cameroon, Duane Cameroon has a great production of public art since 1991. So there are artworks that are disseminated in 13 neighbourhoods and it's quite an incredible project because they've been commissioning artwork to international artists, local artists. You can see the transformation of the city through those artworks. So what we did, we uploaded images on Wikimedia Commons. We created data on Wikidata connected of course to the categories. We created a list of artworks on Wikipedia in English and in French. And we uploaded text because all the production of the research project was on CC by SA and with the authorization. So for every single institution and every single author we created a permission that was then sent to the system of recording permission of Wikimedia Commons. And this allowed to create really the possibility of uploading content. Since it was done in Africa it was a bit more complicated so what we did was we had printed a form that the artist would sign, we scanned them and send them to the permission system that recorded them and registered a ticket. So of course I took an example that is particularly complicated because public art in Africa with a living artist with no freedom of panorama is probably the worst you can get. But it's feasible so it's complicated but it's something that is possible to do. Of course it requires a lot of changes also in procedure and also the need of creating processes that allows the upload of those authorization and facilitate also this connection between institution and the rights management. The second challenge is related to collaborating. And now I don't know how many of you contributed to Wikipedia. How many had their content deleted on Wikipedia? This is something that is an experience that I think everyone... So contributing to Wikimedia Project is not easy. It's a little easier sometimes on Wikisource and Wikiquote I would recommend as a first if you want to go on holiday on those projects is quite fun. Also no Viki Voyage can be challenging too. So those projects have a lot of rules, policies and also collaboration is never easy. So everybody that collaborates know that they are challenging in involving other people and also creating processes that are transparent. There are some specific rules of the project that researchers need to take particular attention to those in particularly of course the no original research which is also an advantage for a researcher because you quote the work of everyone and you source everything. Also conflict of interest, the fact of declaring why, how we are contributing, neutral point of view for the encyclopedia and of course Wikipedia is an encyclopedia. So it doesn't provide space for everything. But it's true that for museums, cultural institutions, for heritage, also for improving articles related to territories that are very connected to topics related to architecture and art, this Wikipedia is perfectly suitable. Also you need to consider that research in humanities and dissemination. Sometimes the two boundaries are storing information and disseminating information. Sometimes also scholars would like that the way they store information is beautiful, is accessible because it's something that might interest a broader public. So it's sometimes not sufficient to store a folder on Zenodo. You would like to have an interactive map that allows you to see the building, having access to all the documents on. And Wikimedia project can somehow provide this infrastructure. The last issue is how to make this scalable. So of course working on licensing, working on CC0 for data is an issue. But the upload of content on the Wikimedia project requires a certain expertise. And what I saw in the past is that very often projects worked when there was the community involved. So people that were experts already of those projects. So this joint work and also maybe the model of Wikipedia in residence could be an approach that can be interesting also on the Wikimedia project and OpenStreetMap. Finally I wanted to just mention that I'm working on a landscape analysis of research infrastructure for social sciences and humanities. So I started on Metawiki, that is where we start, we always say, oh you find the don't make it Metawiki. So here you find a list of research infrastructure to make sure that Wikidata has those resources. But the truth is that at the moment there are two problems. The first one is that all local infrastructure or collection databases are not connected and they're not perceived as research infrastructure because they are too small and they don't have national relevance. So having the possibility of bridging those resources and maybe Wikidata can really provide a landscape analysis on this would be very valuable. Also making sure that we know about those is very useful because those are resources that can also nourish those websites. And finally there is the problem that investment by government on research infrastructure normally they focus on implementing the infrastructure and maintaining the infrastructure while also populating the infrastructure is another issue. And there's going to be also a presentation about OpenRefine that is very important and relevant for this because obviously you need tools that allows really to nourish and to connect those infrastructure. I'm done. Thank you. So now you land there and take some questions. I told everybody that I know to not ask questions. Yes. One question they asked the question was if we have an idea of how much data from research feeds Wikidata and is accessible on Wikidata. I think this research of 2019 might give some insight of it. I don't think it was more focused the study on models rather than actual data. It is sourced. So I presume it's something that is possible to is an information that is possible to view on Wikidata. So that would be feasible. It's true that sometimes the taxonomy of also property so the possibility of actually getting a full access to the information is not obvious. Also for research infrastructure one of the challenges that one thing is called the virtual library. The other one is a digital library. The other one is a repository. So combining all those also broader terms makes it a bit complicated to get a full idea of it. Thank you. Another question. Thank you. I enjoyed the talk. One thing I was wondering is there is a link up talk in the chat here because I enjoyed any and Chris's talk talking about opening infrastructure finder and now you're talking about opening infrastructure finder. I wonder whether there is dialogue or anything at all. Yes. So the question was how to connect maybe the if I understood correctly connect the possibility of finding open repositories and how we can connect with this one. It is important to notice that a lot of libraries and repositing existing repositories are already collaborating with Wikidata. So there is a desire. Europe Anna which is one of the biggest repository has a very strong collaboration for example just to mention one of the most well known. This is a repository of Glam's for open research. There are lots of connection associated rather to repository that provide information about researchers or papers. This is something that is implemented on Wikidata quite nicely. But it's true that also in general the investment are not on something like Wikidata. So investment are either repository by topic and at a national level I never saw an investment that is on Wikidata. It is rather in maybe creating some interconnection. So this is something that my but of course I'm also here to actually stress this. I think we should collaborate more with Wikidata that would be valuable, useful and efficient. So that's all. Thanks. If you have any more questions we can welcome them. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you.
Detecting Propaganda on Facebook and Instagram Ads using Meta API
Thank you. So well, this is an open source conference and I'm going to talk to you about ads on Facebook ads of all things and then the Meta API. So please let me explain. Well, so I'm Jean-Glennard, I'm working at Viginum and Viginum is a French operational department, technical department aiming at analyzing foreign digital interference. So we have three big missions. The first one is to detect and analyze this foreign information manipulation and interferences and the social media and online. Then we also lead and coordinate the French protection mechanism against these attacks. And finally, we also contribute to international works and in particular European works to study and understand these foreign digital interferences. So that's why I'm here today. And why do we care about ads? Well, of course, it's hard to talk about ads on Facebook without talking about Cambridge Analytica. So it was a micro targeting scandal, very well known because of the involvement in the Trump campaign in 2016. And then there was a high suspicion of involvement in Brexit and possibly in other campaigns, other elections. And there might be other companies like Cambridge Analytica. So historically, this is really a big thing in this information world. But obviously, after these events, Facebook made some steps to provide some transparency in that advertisement empire. And they did open a web portal to at least let people know what were the ads that could target people. And then we had other operations. And this is a long running operation, the Doppelganger. This operation was already fully reported in September 2022 by EU in the fallout. And it was linked by Meta at the time to Russia. So this is actually a screenshot at the left of the web portal where you can see the ads. And those are free fake pages sending the same ad with the aim of polarizing opinion. And basically trying to manipulate the opinion. And then the campaign re-suffaced with another shape. And we made a report on this in June 2023. And this time, it was like a big network of typosquad websites and media, like big media in Europe, also governmental sites. There were like thousands and thousands of Twitter bots and still Facebook ads. And you see here a screenshot at the top. And then there was this report in October 2023 by Resets focusing really on the Facebook ads part of things. And they found what they estimate to be hundreds of thousands of sleeper pages, pages that could be mobilized to send fake advertisements to people. That was in like a few months ago. So this thing is really going on and on and on. And this is actually a screenshot from last week when I was making this presentation. So you see it's always the same thing, disseminating advertisements with the aim to polarize and provide fake information to people. So that's why we care about ads. All right. So the aim of the talk is to see how open are Facebook ads. And the first half is about how to, a practical guide, how we can actually access this data. And then I'm going to show you some analysis of what we can infer from this data. So there are three official ways to access the data. The first one is the daily report. So basically you don't load a CSV file from a website, like there's a screenshot at the top right here. And it's very limited because it's only ads that are labeled as political. So it's really a tiny fraction of the ads. Then you have the web portal. On the web portal you can see, you can target like specific country, a specific ad category, enter some filters, filter by date, and then you see the ads. And you have to scroll through all the ads. So it's nice, but it's hard to do data science with this because it's slow and it's finicky. It's not usable like for real analysis. You also have the API. So I'm going to talk a lot about that, where you can actually plug into the data and access it. But then it requires registration and coding skills. And finally, there is this open source project. And I really wanted to talk about it because it's great. It's actually an ongoing mirror of the Facebook ad library. And you will see the code online. And you can kind of understand how you can access the API just by looking at the code. And then there are data dumps hosted on Kaggle and other platforms. So it's really a great project. It's not official, but it's a great source of information. All right. So the how to path. To access the data with the API, you have to verify as a meta developer. And for that, you need to provide your cell phone or your debit or credit card information. And then you have to go through another step. And you have to verify your government ID. In my case, I had to send both my ID card on my passport. Because the first one was rejected. And it's kind of ironic because if you want to just send an ad, they just ask you for the debit card. They don't ask you for anything else. All right. So now you have access to the API. What can you do? Well, there is pretty good documentation, but it doesn't document the most important thing, which is the wild card. It's not just wild card thing. It's double quote, wild card, double quote. And with this, you don't have to specify a topic of interest. You can actually get all the ads. And then you can work on it. So it's quite interesting. And then there are two things that I found a bit counterintuitive. You have somewhat unspecified quota when you're associated with each token that you have. And it's like a daily refresh of the quota of ads that you can get. But then you don't have a limit on the number of tokens that you can create. So you can just create as many tokens as you want and then get all the ads that you want. That's the first thing. And the second thing is that when you ask, like in your request for the ads, you have to specify how many ads you want back. So there is a pagination system. And by default, it's something like 20 or 50. So it's kind of very low. But if you ask for a pagination that is too high, then you're just going to have an error. And the error depends on the size of the text of the ads. So then you have to try again with a different limit. So actually, it requires coding to really get the data. So that's two things I wanted to mention. All right. So go through the verification process. You get your code working. And this is what you see when you look at the ads in French language and targeting friends. So kind of a limit is called. One third of the ads are sent on Facebook, one third on Instagram. And then the remaining ads are split evenly between Messenger and Audient's network. Audient's network is kind of, they provide some kind of software thingy so that people creating mobile apps can send ads and build upon meta like ad system. So yeah, when you open your mobile app and there is some ad on it, it might use this Audient's network thing. So most of it is Facebook and Instagram. And then when you look at the volume of ads through time, well, you see there's almost nothing up to mid-2023. And then you have data. When people see this thing first, they think, oh, Facebook, they removed data. And actually it's not that simple. Actually it's the opposite. They made a change in the API sometimes in July of 2023 and they really started to make the data available. Because prior to that point, only the social issues, elections or politics ads were available. And then starting in July of 2023, you have housing and employment, credit and most importantly, the bulk of the ads, like more than 95% of the ads is non-labelled. So it's actually a good thing that happened a bit more than six months ago because now we have the data to analyze. And well, so we've been talking a bit more, we have been talking a bit about the social issues, elections or politics. I'm going to give you a bit more information on that. So it's completely self-declared category. So this is actually a screenshot. I went through the process to publish an ad and stopped at the last moment. And you can check the box if you want. But then there is absolutely no enhancement mechanism if you don't want to check that box. So it's purely self-declared category. Of course, if you check the box, then you're going to probably have your ID, maybe there will be more scrutiny of your ad, maybe more moderation. So obviously, people that want to do this information campaign are not going to check the ad. But can we make a model that would tell if an ad should actually have checked the box, should have declared itself as political? That model could be used to identify the doppelganger ads I've been showing you. And it could also be used to estimate the volume of ads that should be labeled as social issues, elections or politics. So actually, we can. And I'm going to show you how I built one. But just at this point, how many French ads we should expect to be social issues, elections or politics? We do have some ideas from past work. So in Brazil, this work, academic work, found 2.2% mislabeling. So non-labelled ads that should be social issues, elections or politics. And that other academic work found 2% to 4% mislabeling in the US. So the same thing. Non-labelled ads that should have been labelled. Okay, so to build the model, what we need is some data and then some technical way to build a classifier. So the data I used is from this academic paper. And it's great because it's in French language, but then you have other data sets in English and in other languages. So that's my data sets containing real political ads. And then we also need a classifier. So I'm using a large language model because everyone is doing that these days. And I took specifically the Mistro AI 7B model and I optimized it with the Unclosed Library because it's very fast and it's open source. And just to give you another magnitude, it takes one hour on a free to use Google collab instance. So it's actually quite quick. Anybody can do it. Like resources and GPU or whatnot. All right. To dig a bit more in the process. So to be really clear about what was done. So we have this data set of 4,000 something political ads. And it was collected in 2022 during the French presidential campaign. And it actually categorized the data into nine topics. Could be energy, social issues, these kind of categories. And then we need some control. We need to separate political from non-political. And for the control, I just collected random data in the French scope in late 2023 and targeting the non-labeled ads. So I expect some contamination because some of these non-labeled ads should be political. But I expect it to be kind of marginal, not so important in my data set and it's not going to kind of prevent the model from learning. But anyway, I will have a little bit of performance hits. Then I split into train and test and I finetune my Mistro AI model. And to finetune the Mistro AI model, I wanted to predict between one of these nine categories or non-political. And I do that to get a bit better results than if I just wanted to instruct the model to predict between political and non-political. Because then the model has to kind of think a bit harder about why it's political. So it's kind of to improve the overall accuracy. And actually, the results were good because I got more than 90% precision. It was a very good recall as well. So I was very happy at this point. And then how can we actually use it? Well, the first test is on the doppelganger ads. So that's another example here on the top of the ads, top right. And I got actually perfect classification on this because the text is so obviously political that there is no doubt about it. So all these ads, as I said, are non-labelled by Facebook. And then the model says that they should all be political with focus on international affairs. So yeah, it's working well. And in case you're interested, that's the prompt that I use. So prompt in English asking to categorize into one of the nine categories or other. So non-political. That's the advertisement text in French. And you can see that some words actually, they have spaces inside. So here, chef, it's leader, UE, European Union, there is an underscore. And negotiate here, there is an underscore. So it's as if the text is aiming to kind of bypass some keyword detection. And those ads were not labelled as political. So maybe this is the kind of moderation, of automatic moderation that Facebook is doing. Well, I'm happy because my model is able to identify all these ads. But what if I want to really estimate how many ads beyond those ones should be labelled as social issues, elections, or politics? Well, for this, I took a random sample of like general non-labelled ads. And I ran them through my model. And then I wanted to really double check that the outcome of the model could be trusted. So I asked two human annotators trained on MetaGuard lines to really make sure that these ads should be, according to MetaGuard lines, political, labelled as political. And I find that at least 1.9% of the non-labelled ads should be labelled as political. So it's conservative estimate because some political ads might not be flagged by my model. So it's really at least. And it's aligned with previous research showing 2 to 4%. So I'm not so surprised. It might not seem much, but then if you take 2% of more than 95%, then you see that most of the political ads are actually non-labelled, right? Because the labelled political ads is 0.3%, roughly, in the French scope. So this means that about 15% only of the political ads are labelled as such. Well, that's about it. So what we saw is that Facebook gave access to actually data and more and more data. It can be used. It's a bit tricky, but at least the data is there. And here I talked about foreign interference, but this is relevant much more broadly. Because you see many things in these ads that are not linked to foreign interference at all. And I think the civil society should kind of look into that. You have six months of fully open data, and you have this open source mirror that makes it really easy to at least make sure that you don't miss any ad. And my belief is that this self-labelling mechanism of social issues, electional politics, is problematic in itself. I think for this to work, it needs to be really coupled with a very strong moderation. And as of now, I don't see this very strong moderation. So maybe there are other ways that we can really make sure that political ads are not somehow they don't go unnoticed. I don't know. That's about it. I'm super happy to answer any questions. Thanks a lot. Now we can take some questions. Paul has a question. The first question is like, you have any idea why this change happened in obvious class degrees in the NTI? And the second question is with your model, what do you do with that at the end for the visionary and the third developer? When do you use this? Okay. So the first question was why didn't Meta kind of gave access to the NTI? So the first question was why didn't Meta kind of gave access to more data in mid-2023? And the second part of the question is how do we use it in practice? Right. So the first part, I don't know. It might be linked to the DSA, the Digital Service Act, that is going to come into action kind of soon. It might not be linked at all. I have no insight about that. And the second part, of course, we try to automatize and to be much more reactive to really understand more how Facebook ads can be used and easier for these information purposes. Any other questions? Yeah. Just on the paper you cited there, I can look into it and they use like a more traditional mass language model. Yeah. So why did you choose to use this rather than the probably lighter model? I don't remember this one. It was a nice buy. Sorry. Yeah. So the question was, so this paper used a somewhat traditional model, language model, and why did I use a large language model? Why go through all those troubles? If I remember correctly, this one used naive bias, something, or maybe another one. Well, so for me, large language models are perfect for this task because they embed some knowledge of the world. So it's not like it's not thinking only about, oh, in my training dataset, it's all about Ukraine and Ukraine and Ukraine. So I'm going to say everything linked to Ukraine is actually political. It's more broad and it can predict things like this is kind of political and it's about gender issues. And then this will be generally valid beyond this or that specific issue that is in the training set. So I think the generalization abilities are much better. I didn't yet. I tried to do another approach which had other problems. So I tried an urban approach and with KNN and the data that was also at all. So I don't have a proper benchmark. But the results are good. So why not? Another question. After you find a mislabeled X, do you do something about it? Is there a way to report it? So the question is if someone finds a mislabeled add, what can they do about it? Well, it's a bit disappointing. There is a system and you can click and say this is wrong. I want to report this add. But then you have to indicate precisely the law that has been infringed to say why it's illegal. So you can easily report something about guns or something outright illegal. These texts are just information manipulation campaigns. So I think the reporting mechanism is lacking as of now. Another question. If I understood correctly, you are fine training misdraft on text only. Would it be profitable to use the images because on all your examples there are images. Can you process that in the future maybe to gain more precise? Yeah. So the question is can we also train on the image to have a much better classifier? Obviously, yes. Obviously, it's much harder because then you have different other magnitudes of data. So yeah, we don't do it yet. I thought it was very interesting. You mentioned that Facebook needs more moderation. Where do you foresee that? What do you see the future of that if it was done well? Do you see that as something humans do? Do you see that as something like the models that you've just used or some combination? All right. So the question is how could Facebook enhance the moderation? More hands or more models? Both. Both? Well, both are possible if the hunch is right and if people can really escape moderation just by putting spaces, then there is obvious improvement in having better models. Yeah. It might not be enough. Can you talk about the content on the propaganda you see? Can you teach some group that is trying to do some manipulation on people? So the question is can I go into more details about what was the content of this information? So I could, but it would take a long time. I advise you to read all these reports and there is actually a using for lab page where there are at least 10 or so reports and you will see it's extensive. It goes in many directions and the information is perfectly available. I have a question in the back. From your experience of using the APIs in general, you know that there is a way to actually associate the ads with the contents that were displayed when they were displayed. Okay. So the question is the ad are linked? I'm not sure I got the question. Do you have a way with the APIs to actually link the ads that you were studying with the contents that were displayed around the ads? Oh, okay. No, no, no. So the question was can we infer the context? Why an ad was shown to a user? Was it because there was calling on some, for example, Ukrainian topic? Was it something that they did that was not in relationship to Ukraine or whatnot? No, we don't have this data available at all. And that would be very precious. Maybe one last question. In your experience, from what you've seen, the posts that are displayed for ads are self-contained or they contain links to other platforms. Yeah, so this report. So the question was, was it self-contained? Like was this information only in the ads or was it just a way to have people click and go somewhere else? So this report in June 2023 really goes in details in what happened. So actually there was a system of several redirections through different websites. And then it was JavaScript redirections and then typosquad redirections and then like several layers of redirections to, at the end, bring people to malicious websites where there was more disinformation. That's the idea. And technically it was quite interesting to investigate. So again, if you have time, these reports are fun to read. Okay, let's thank again Jan. Thank you.
Unlocking Research Data Management with InvenioRDM
Thank you. Hello, everyone. My name is Karolina, and together with Havi today, we will tell you about how Invinio-RDM is unlocking research data management. But before we start, I would like to ask you if you see any connection between those three images, and is anyone able to answer that quiz? And Luisa says that you have three seconds to do so. Those are cuts. Those are cuts. No, no, sorry. So what about now? Sorry? Yes, you're close. So the common, the connection between the images is that is CERN, actually, where the World Wide Web was invented. It's located in Switzerland, so fondue and chocolate. And that's because you can see the funny pictures of the Internet of the Cuts. Thanks to World Wide Web Invention. But it's not the only thing that we do at CERN. So we are housing the biggest machine in the world, the Large Hadron Collider, and many more machines which experiments are using. And also we are sharing our knowledge and welcoming visitors. So wherever you are in Geneva, Switzerland, please pay us a visit. We do much more than only physics at CERN. So we do also open source projects, like, for example, the World Wide Web, that it was given back to the public. But that's not the only one. And this is what we are talking about today. So it's Zenodo, which I have been told that some of you know already, but you probably don't know what's in VINIA-RDM. But I will start with Zenodo. So Zenodo is an old-purpose research repository that any researcher around the world can just go and store their research results for free. And it is hosted there at CERN as long as CERN exists. So the question is, why do we need such a place? And this is the answer. So very important, up to plus, the crucial scientific data, many years of research work inside. Well, we don't want to allow this to happen ever again. So we provide a safe space for storing data for the researchers. But not only researchers. We have also integration with GitHub. So you can cite your software, stored in GitHub. And what's the advantage of storing it also in Zenodo is that GitHub allows you to delete your software, but it will be preserved in Zenodo. And we have received many questions about the platform, if it's possible, to take it and install it as it is in another institution. So up to a point, it was not possible. But we have received so many questions that in the end, we have developed another platform, which is InvinioRDM, now the engine of Zenodo. Now it is possible to easily upgrade the software, install a new version, and we are basically supporting the underlying engine. So, Havi, if you were to characterize InvinioRDM with one word, what would you say? That's a good question. So you have to use one word, I would say, that InvinioRDM is fair. And when we talk about the concept of fairness, I'd like to quote our former director general, San Roth-Litter, who once said, why do I like Zenodo? Because Zenodo is fair, fair in the sense of lower case and fair in the sense of upper case. The most conventional use of fairness, which was already covered by this first part of the presentation, is like equitable or just. Now let's see how InvinioRDM embraces and promotes the fair principles that are, that is an acronym that stands for Findability, Accessibility, Interoperability, and Redusability. So starting with the first one about Findability, when we upload our research, one of the key things is that we want to have a link that we will make sure that it will resolve over time, that it's not going to be broken. And for that purpose, we have DOIs, which is a digital object identifier, which is a globally unique and persistent identifier. We encourage people to use their own DOIs if they have one, otherwise there will be one automatically generated and registered using, registered in data site. It's as important to have a nice metadata. That's why we adopted the data site metadata schema, which is simple, yet a powerful format to describe nearly any research output, data sets, software, as she mentioned, journal, papers, anything you can think of. And of course, to find out all this data, we need a good search engine with capabilities, such as filtering options or search variations or powerful query syntax that will allow you to find the data even without the identifier. So these are key aspects, not only for humans to find data, but also machines. So if we continue about accessibility, a very common use case is that we have our data and we want to keep it restricted, but we want still people to find the data. For that purpose, you will make your metadata public, and if people want to access the data, they will have to request access via a simple form, and then you can choose if you want to grant access or not. In the same way, you can also share different links with different permissions levels that will allow people to view the record and the non-pullished versions or even edit to make collaboration easier. Now if we talk about interoperability, one key thing is to follow standards. That's why we follow the one I mentioned, the data site metadata schema, which includes things like common vocabularies, which will allow us to have the same concepts to describe data as other people do and other machines do, so that we make sure that everyone will understand it in the same way. Another important thing is that when we upload our work, we have to link it properly with other data that is also uploaded, and you can do it very easily as well. And if we talk about how machines exchange data, we also provide a strong REST API that allows you to build your own data, build your own integrations of top of the Miner-DM, and we also have an integrated YPMH server, which is a standard in how systems exchange data. If we talk about the reusability, I think one of the key aspects is that when people use our work, we want them to cite it correctly. So here you have different styles of citation that will always include a DUI. The DUI is very important also to track the impact of your work. If you remember, she talked about software citation. We know that 85% of all software citation is on Senado. And of course, having also a clear licensing information, it's also important so that people know how they can use your data under what conditions. And I want again to stress a little bit on the metadata, so having a rich and comprehensive metadata is very important not only for people to reuse their data, but maybe for people to also reproduce it in the future. And since we are talking about the reusability, do you think there is something else that we can reuse? Yes, we can reuse the whole software entirely. So these are examples of how InvinioRDM was reused with other institutions, with our other partners, and as you can see, it's very customizable. Those interfaces are very different from each other, so it's quite flexible if you would like to join this sizable community, that it's still growing. We have many partners around the world. And if you would like to install an institutional repository, also in your institution, you can get to know more about InvinioRDM under this QR code on the right side. Also, you could pass by our booth, it's in the building K, floor 2, floor 2nd, and if you are a developer who would like to contribute to Open Source Project, you can check out our community on Discord as well. We answer questions, and you can see also a growing community there. So thank you very much. So are there any questions? Thank you very much for the talk. I already know it, and I like it, by the way. But I have only one specific question for the... You also said, you have plans to support the process with mixed licenses. Software is usually not just one license, there's a lot of SPDX expressions or something like that. Okay, I will just repeat the question for the stream. So we also were told that it's like our repository, I think that's worth to mention. So the question was, if we plan to provide more licenses, so I think we were very fast here on the slide. There are already many standard licenses that you can find, and they're available, but also it can be customized. So whatever license you need, you can add to the software. If there are more licenses, you have a data file under CC4, CC by 4, and draft code under MIT, so then you cannot simply say from the outside this is only MIT or CC by, so then you need a list or CC by and MIT or something. Okay, you mean if there are multiple values for the licenses attached to one record, do I understand correctly? Okay, if I remember correctly, it was no? You can have multiple licenses. Yes, so you can have multiple licenses, but you cannot map it one to one. You cannot say for the file is this license and for the metadata is this license, so it's not there. Okay, thank you. I think the question is, if I archive the software in Zenodo, how long does the software is preserved in Zenodo? So the question was for how long is the software preserved in Zenodo? So the answer is as long as we have data center at CERN, as long as CERN exists. Okay, but what is the commitment of CERN in order to organize in terms of how long it will last? Well, in terms of contract, can I? Well, for now we say forever, but let's see what future holds. We'll see if the sun goes out. Sorry? We'll see if the sun goes out. Yes. Hello. I am, sorry, out of those compared to other data in photos at CERN, is it more specified for scientists' researches? Or... So what I think the question is, is that if it's targeted on one area, that's this what you meant, of research whatsoever? Yes. So it is not targeted, because it's very, like we said, it's reusable. So we have, for example, universities also installing the software and keeping it as institutional repositories, but these universities might differ in the domain. So it might be, for example, Northwestern University, but then they host many domains. They do a lot of research. We have also installations at CERN, like one is an ODA and one an internal one, an institutional repository, which we are in the process of migration right now to upgrade the version of the software. But I can also, like, come back to hear there are much more many usages. So it's not targeted to simple domain. Okay. Next time again, there's another theme. Thank you.
Making OpenRefine more reproducible
Okay. So we welcome Antoine Delpache, if I'm correct. And yeah, Florey Searst. Thank you. So I'm Antoine Delpache. I'm a developer on the Open Refine project. And I'm very happy to be back in this bedroom to tell you, give you a few news about Open Refine. And in particular, I'm going to be focusing on what I'm working on right now, together with Zoe Cooper, who's a designer on the project, to make Open Refine more reproducible. So I will first explain to you what Open Refine is, because I'm not assuming everyone was here four years ago. And if you were, don't worry, there are some differences that you might be able to spot. I'm very keen to know if those differences look good to you. And also, what do I mean with reproducible in this context? So what is Open Refine? It's a data cleaning tool. So you can import tabular data, mostly, in it. And then it lets you do all sorts of cleaning operations on it. Guess what? So let me give you an example. So this is a database of filming locations in Paris. So every time you film something in Paris, you need to register it with the city, and then they make this data set. And one thing I can do here is to say, let's match all of those films with an external database. And we call that reconciliation. So in this example, I'm going to reconcile it with WIC data that we've already heard about earlier today. And because reconciliation is a bit of a tricky process, we have various options to let you configure how we're going to match your data to WIC data so that we just don't only rely on the names, but also on other attributes that we have in this data set. And we then have various tools to help you make that a little bit efficient and let you review the results of the reconciliation manually. So for instance, here I can hover this and get a link to the WIC data item that it could link to. So that's a sample of one type of operation that people do a lot with OpenRefine. You can then manually match things if you want to go through the entire data set yourself. Let me show you something else. Well, first, once you've done this reconciliation, you can pull some data from the target database. In this example, I could, for instance, do something quite simple. Sorry. Let's just add a new column with the URLs of those entities in the database. So that's something that I can do quite quickly. And you get your new column. You could also pull more information from WIC data, identifiers in other databases, things like that. Let me show you another sort of operation you can do in OpenRefine. This is the column with the directors of those films. And I can try to cluster them. So what does it mean? Well, we are going to basically look through all sorts of values in this data set and try to detect whether they might refer to the same entity. And when that's the case, then you often want to normalize those to one consistent spelling. That's very useful, typically, as a first step for reconciliation. So those are samples of the canonical values you could use. So let's say I want to use all of those suggestions and accept them as valid clusters. OK. So those are the sorts of things you can do in OpenRefine. Now, what do I mean by making this tool more reproducible? So imagine you're a researcher working on some data that you've collected. You're cleaning it with OpenRefine as part of your research process. And at the end, you want to publish a paper about what you did and you want to make your research process transparent. So you want your fellow researchers to be able to inspect what you've done in OpenRefine and ideally even reproduce it on a similar version of the data set. So what can we do for now? The best thing we have for this so far is our undo-redu tab. And as you can imagine, it's primarily designed for undoing things that you've done, but it also happens to list all of the operations you've done so far with OpenRefine. So you could try and copy and paste this in your research article as a way of saying, this is what I did. Now, this is not exactly ideal. So we are working on improving basically this part of the tool. And before we get into reproducibility per se, there's already a lot of usability issues with this interface. And that's where it's been very interesting to work with a designer on this project who was also not familiar with the tool before she came on board. And so she was really able to come with a fresh eye and identify things that I really couldn't see anymore because I've been looking at this for so many years already. So for instance, here, it might not be clear to everyone that you can actually click on those previous steps to go back to them. We don't have any undo button in OpenRefine. We only have this weird undo, redo tab where you can't really click on the undo or the redo, like things like this. And so it's been really eye-opening. What else can you not do? Well, say I realized that this match here was wrong and I want to undo just this operation, but I want to keep all of the following ones. There's no good workflow to do that, but it's very often requested. So let me now show you what we can do with those extract and apply buttons here. So I'm going to roll back here. And if I click extract, I get this interface where I can select some operations I'm interested in and then I get some code for them. And this big blob of JSON is something I can copy and share as the representation of those operations. And I can also reapply them later on on this project or another one. Now, the problem with this is that it's very hard to work with this representation. It's very unreadable. And it's also very brittle. So for instance, if the column names of your new data set do not exactly match the column in the original data set, you will have horrible errors and it will be very hard to do anything with those operations. So that's the core of what we're trying to solve, providing a better representation for those operations so that you can understand what they are and also reapply them reliably. So as a summary of the main goals of this project, make the basic undo-redu functionality just more usable. Then make this reproducibility also easier and effective because we want those representations of operations to be reliably applicable. And also adding this advanced undo functionality of undoing not just the latest operation, or maybe just modifying the parameters of an earlier operation. So that's the main goals. And what do we have so far? Well, you might have already noticed some differences in this prototype, but let me show you another one. So far I've been working on making open-refined operations aware of which parts of the data set they modify. Because the problem is, if you want to let people undo a deep operation, then you need to be able to detect which following ones can be kept or not, or if they need to be recomputed because the data they were working on has been touched. So now that we have this capability of scoping operations a little bit better, you can, for instance, run reconciliation on multiple columns and that will run concurrently, which is something that wasn't possible before. So you see the reconciliation I started earlier, it's only 7% complete. It's a very slow operation because which data reconciliation is particularly slow. And now I can already start reconcealing the other column. And if you see, we already get some results, although the first one hasn't completed yet. So that's already won win. It's not directly about reproducibility, but I hope this will be work on by users because it should save people a lot of time. And on top of that, we've done some research about how other tools represent pipelines or their undo-redu functionality. So this is a screenshot from Talent, another data cleaning tool that we've been looking at. And in those sorts of data cleaning tools, you design your pipeline explicitly on a canvas. So it's a very different sort of user experience. But we've also been looking at Excel, how they let you track changes, or basic undo-redu functionality in Google Sheets, things like that. So that's been also very interesting in trying to get some sort of user experience that our users are already familiar with. So as you can see, this is all work in progress. This is what I have just here, a prototype. We don't have full answers to all of those questions yet. But we're working on this, and we are very keen to hear from you. So if you're interested in those topics and would be happy to test out some ideas with us, we're running some user testing sessions. So you're very welcome to sign up for those. And that's basically the state of the project. And I also have some open refined stickers if you happen to organize some training events in various places. So do also get back to me if you want some. Thank you. Thank you. We can maybe take one question. Thank you for the presentation. So it's an interesting piece of software. But what exactly is the target audience? Because I mean at some point, if you have the data rendering script, it makes the job. I mean, to not get me wrong, it's interesting. But just to know who exactly you are targeting. So the question is, what is the target audience of open refined? So it's a broad range of communities. I would say it's generally suited for tasks where you can't really just write a script upfront, which will do your keening. And it's not really about whether you like programming or not. It's just some tasks where you need to be looking at the data while you're doing the cleaning. As you saw reconciliation, it's a messy thing. You can't really just come up with the parameters and make the matching. You need to be looking at the data. Same for clustering. So it's a mixture of interactive data cleaning and a little bit more automation that you would have in Excel. So basically here the point is the point is the point click aspect for the operations. So the real point is the point click aspect for the user. Let's thank Antonina again.
Node Reduction through Artificial Intelligence Inferences using Graphology and SigmaJS: A Case Study on Hypertextual Conversations in Freight Train Graffiti in the North American Region.
I had my notes here, so I'm supposed to read them because my English is really not that well. So in some parts I'm going to try to just read the slides and in some others I'm going to try to improve. So first of all I want to really thank you guys being here, the Graphology and Sigma developers because thank you, thank you for your work and thank you for being here. I really appreciate that. I'm using that library. I'm also in one moment. So in one moment I also used Force Atlas II that Matthew Jacome is also here. So it's kind of really, really amazing for me being in this chance to at least talk to somebody that has the same interest and that's why in my own country maybe it's difficult to do this kind of presentation because I have to make a long introduction. As I'm going to do right now, but about the social phenomena I'm studying, it's freight graffiti. So this is one of the visualizations we can achieve and this is, as you can see, it's a really hot mess happening there. It's a lot of stuff and we're trying to get this. It's kind of a synthesis or maybe an abstraction too. So in this visualization at the end we'll be able to link the users to the symbolic forms, the symbolic or the meaning they are using to make a community. So I just invite you to have that in mind. We're going from this to this, but before we actually have to get all that information, gather all that information and make it happen in this visualization, in this computational visualization. Something is happening in these train yards. So we have two different stuff. We have to break that, this really long title. I know maybe social scientists were always in verbose mode and we were just talking right and talking right and I see how you guys are really into synthesis and really straightforward. So what I'm going to do here is talking about two different stuff. One, a computational process that is a really fancy name, no production through artificial intelligence, inferences, but it's just a filter. We're just cleaning all the mess. And the other one is about a social phenomenon. This is happening in physical world, real world if you want to say so, but we're trying to take that data also and make it happen in our own framework. So these guys tend to write their names on the freight trains and these trains will always travel from all the North America region and then some other persons will track them, will take some pictures and will post this on Instagram. So this has been happening before social media. So this is a community. It's a practice community. If somebody here know Jenkins, maybe you will know what I'm talking about, participatory culture. So that's kind of the same. There's two, two, that's the same phenomenon happening in two different places. In the physical world, in the digital world, and that's what we called an on-life phenomenon. So the case of study for this presentation, as I told you, is the graffiti, the freight train graffiti in the North America region. So an hypertextual conversations, I don't know if it makes sense for somebody here. One guy in the morning make a presentation about his presentation was Cosma. That's the software he was presenting and he talked about that guy that invent this kind of linking document idea and that's hypertext. What about hashtags? Hashtags may be this kind of hypertext too because they will function like a gathering point. People will join in those places through their own publishing, their own post after they tag them with any hashtag. This network will be talking about users, Instagram users, that post and mark this post with any hashtag. So this can make clusters or clicks and that's what we are trying to look. These small groups that share something in commune, that share meaning, all these posts are meaningful for themselves. So that's I think something happening here too. This is like this big group of persons that gets together and fuzz them and then we have these little clicks happening in each room and then anybody will move from place to place and make these kind of networks if we try to see it that way. So the other part I would like to talk to you is about this filter. This filter happens in two different levels. One with a Python using some other libraries too and the second one through Graphology and Sigma, I think misspelled, using JavaScript. So this was introduction guys. The point here is I'm going to share to you how each step uses different open source libraries or software and that's one way to acknowledge to all the developers here that all that effort you are doing is making people like me that is not really a developer. Trying to make a dialogue, talk between social science, computational science with the tools that I can try to use. So there's a word there that is really important. It will go all the way from the whole slide show, it's data. We have been listening to that concept a lot and I really feel kind of sad because the effort that I can see in those talks before was about standard data. Big platforms make tools to make standard data and this example is the whole other thing. It's really different because it's a really custom data set. It's a really custom, it's a really niche social phenomena so there's no tools to study this study object. So we have to make them with anything we can. So data is the key and it's the link between execution devices, between disciplines, between programming languages, theoretical frameworks, development libraries and social phenomena. That will help us to make interoperability between all of these different dimensions. And I think, and I hope you do too, this will be only possible through open source and data. Data is the key here. So the journey starts. I'm going to try to be really fast so I can have some of your comments. I will tell you this is a master's degree thesis so each step it was way long. If you think this is verbose, that's some other stuff. So I'm using the first link between, I want to show you is between conceptual frameworks, theories. So we have Thompson, a guy from England that is trying to find these kind of categories to detect meaning, to detect symbolic stuff. And we also have the graffiti de firma from Figueroa, that's a Spaniard, another Spaniard guy that retomb these I exist, I am the SCART, I don't know in French maybe, the SCART, I don't know the right pronunciation, the SCART, the SCART is to the cart. To see how graffiti writers broadcast themselves to the world. So we're making this link, right? Because data will be the key here. To make this link between some theoretical point view perspective, to a way we can manage to just back up at least, we need to make this, look for these terms, look for these stuff and make it some sort of way to, well to data. So we at least have these three categories, those things we are looking for. We are looking for geographies, so we are looking for cities, so we are making a dictionary, a city dictionary. We are looking for communities, that's symbolic, shared terms, so this dictionary is about the words that graffiti writers use to tag their own posts and the freight workers use too, so we can mix them, merge them and make this freight train dictionary. And last but not least, we have entities, so we are looking for graffiti writers names. We are going to scrap, we're going to mine these hashtags, these hashtags, conversations, these hypertextual conversations and the network, we have that simple structure. Users post some publication and add some tags. But we are not only using one user's post, we are using a lot of them. So we have this seed node, the seed node is the first hashtag scrapped and this Instagram data mining boat, really original name, will download an infinite number of posts and then add new hashtags that are found on these publications. That will give us this primitive kind of network, this is a small one, the seed node was graffiti bombing, we used a mining depth of only zero, so it will only mine that in this case 30 posts that are using this specific hashtag, but as you will see, graffiti bombing has 30 posts, but this other post is also using different hashtags. So that is how this network is built. For making this mining, I'm using Instagram app, it's an unofficial Instagram app for Python. I don't know if it's a privacy stuff and I know it's tricky, so I won't talk about it. But I use Docker, so I can make a Raspberry, we try to mimic human behaviors, so this mining will last for maybe one week for each conversation and if these conversations are really large, it will last longer. So that's why we are using this low consumption computer and then after we scrap this, we will put it on the SQL database. So we are going from the publications to the SQL database. So this will be a really fast way to put it. We came from reality to Instagram and from Instagram to our own dataset. But we are now looking for these terms on the dictionary we already made before. So in this case, this is a writer from my city, Afex, he wrote that train in Mexico and now somebody else sent him the photo in Utah. So he will put his name, the place it was found, some other stuff and some slang for the same community and also his crew, his group. So if we try to put this text in spicy, we'll give us only one token and that won't help. So we have to split the hashtag in small words. So the answer was really cool. It was already on Stack Overflow. So thank you to that guy. I put it in the paperwork. It's there because we have to acknowledge some others work. So I have to build this really big dictionary of all the words we know in Spanish and English to split the hashtag. And after that, we'll look in these dictionaries. If some word is inside any dictionary, it will be marked as. If it's not, we will think about it as a writer name or as a crew name too. But we're making sure this is real and we will look for graffiti in any part of the caption to make us sure that that strange string is actually a graffiti writer's name. So this is simple, but it works. We have those words, how those hashtags were marked by this software. So we have throw up calls, bubble style. And this is interesting, but it will be more interesting when we try to put everything together. We do it with a spicy docker on Sorrasberry. I'm going to be really fast now. These are two image detection models, ones for Google, ones for IMAID. And that's cool because this is the same technical process, the same image, but it's seeing two different stuff, right? Because the models will see what we want them to see in the training. So that's really straightforward, but it's really important because the Google model, it won't be useful for me at all. So this is the result of using this model. Also it will be interesting when we put everything together. We are using Jupyter with Google collab because it's free, so we can make an SQL query and then it will download the images and apply the model. So the beauty of relational databases is you can access to these different content from different sizes. You already know this. But the point here, we are going from the database to JSON network and we will get something like this, right? We have in the middle, in yellow, the inference notes and purple, the images were detected. And the point here is to see how users gather using the same symbolic stuff. That can be the same symbolic stuff, some kind of graffiti, some specific slang word, some kind of city. We can see in this point how somebody in Tijuana maybe will use the same... Some group in Tijuana will use maybe the same style. I think that's important. But this is a hot mess again. It's thousands of notes of different types, so we have to clean this. Looking for significance, or meaning or symbolic forms. We know a man is an animal suspend in webs of significance. He himself has fun, but we can clean that. We're trying to clean that. So the note reduction will be the shortest path to the meaning. We're going from that to the really clean network. At least I think so. So the algorithm, the thing that's happening here is that for each user node, we're making an array. Then I have another array of the whole network. And if they have a shortest path that is like an algorithm using graphology, and another place is two, that if this match some... If it match three steps, it means that the user has some symbolic node detected and then it will link them. If it's not like that, it will delay the node. So it will change the network structure to this now. It will be... It's really different from that one from before. So I think... I don't know if we have some time. If you want to see how this works, and if you have also some comments too, because I would really love to see if you guys have anything that I can change, I can add. I think it's really... I don't feel like there's some questions you can do. I think it would be better if you just told me what to think about it. So in this case, the network starts with 4,000, almost 5,000 nodes. And at the end, it's really small. Let's see. I don't even remember which one is the biggest. Sorry. We went from 5,000 to only 500. So I put this example because... There's a principle phenomenon called divergence that when we mine the whole hypertextual conversation, it will go for a lot of places that we are not interested. Like if somebody used red as a hashtag, if somebody used love, if somebody else used no, it will just move the conversation that we are trying to mine to different places that we are not interested. So in this case, macro, it's a really known writer. Let's see if we can find it. Well, this one is also a really known writer. So in this specific example, we can see how the object detection model tagged this photo as a wild style, as a throw up, and the Google model will tag them like a wheel and a train. But also, we can see how this hashtag was tagged as a graffiti writer. So that gives us an idea that... Well, not an idea, an evidence that this guy is a graffiti writer name and we can see his intervention. We can see who is the user that makes this post. When we can access to the user, the user neighbor network, we can also be sure that he's using these specific graffiti styles. Okay. So I'm going to finish with this. In this case, when we apply this filter, it's way better, it's way cleaner, and we can start to see... I think I'm just talking nonsense now. Do you want to add something? That's my answer question. Thank you. How did you get inspired to run this as a master piece and continue doing research? I mean, checking the throw ups and the graffiti on trains. What was the practical aspect that motivated you to be fine? Okay. So the question was, what was the practical aspect of attracting the graffiti and what motivates you personally to do it? I had a pre-grad also graffiti as a central topic and I thought it would be easier to do something that will continue this personal initiative. But at last, I've been late for two years now. I shouldn't deliver this last year because... But it was really interesting to how learn this new stuff and put it together to make some scholar work. What made the difference? What do you see? Right train, for instance. How do you know that right train is the background and not the name of the artist? Okay. That's a really... That's a good question. I... You can repeat the question. Okay. How we managed to difference the freight as a graffiti... Is not a graffiti writer's name. So there's a big dictionary. It's built with all the known words. So it will distinguish between known words and words that are out of that vocabulary. Yes, Alison? I have one question then if no one else has one. How do you... Have you had any insights in your graph that really excited you? Yes. So the question is if the insight of the network really excited me. Yeah, I think it does because it was like a kind of a serendipity, you know? When I start to see like this small notes linked together from the terms, it will pop like how some terms are linked for some graffiti styles too. So everything's connected and I think the way to get to this is tailoring data to our own needs. Right. Yeah. Okay, folks. Can we have a big round of applause? Thank you. Thank you. Thank you. Thank you.
Qadence - A library for Digital Analog Quantum Computing
All right, folks, we're going to start. David, it's you. Hello. Hi. I'm David, or Jorick. I work at Pascal. I'm going to tell you a few more words about that in a minute. And I am here to tell you about an ongoing work at Pascal called Cadence. And as you can guess from the name and possibly from the logo, it's related to quantum computing. So before I proceed, I would like to stress out one thing. None of the things I'm going to tell are my work. I'm just, for one thing, I joined Pascal recently with a baggage in programming language theory, compilers and things like this. And this project has not reached the stage where we can use programming language theory or compilers just yet, but maybe someday. So a few words about Pascal. What do we do? We build qubits. More generally, we build quantum computers. We build quantum algorithms. We build quantum tools. We build quantum teaching materials. I forgot to mention we are a private company, but we are a spin-off from several laboratories. Sorry, there is a strong research background at Pascal. And importantly for today, we build open source tools related to quantum computing. And if you're interested in knowing what the inside of a quantum computer looks like, well, that's part of the inside of one of ours. I think this one is called Fresnel, but I'm not sure. You can see lots of lenses which suggest that lots of lasers are involved. Yes, lots of lasers are involved. We're not generally allowed in this room because of the class 4 lasers. Way too dangerous. Still, cool to have. So if you're like me, you might have a question. What the heck is quantum computing? I mean, we all hear about it. A little bit. Well, I hear about it every day, but I pay for that. But we hear about it in mass media and everywhere on LinkedIn, etc. It's still not clear. At least it wasn't clear to me. It might still not be entirely clear yet. What quantum computing is all about? So the first thing is quantum computing is about computing with qubits, not with bits. An important part of it is quantum computing is very much research. You may have seen many announcements, each of them informing us nicely that the last few problems in quantum computing have been solved. I'm sure that we are going to see these announcements for the next 5 to 10 years. Quantum computing is currently a very active research domain, but it's a research domain. And while there are companies that are actually building quantum hardware, we are not there yet. It's not something you can buy at the local shop or even if you go further down the road. And it's probably going to be a few years before we can do anything really useful except in a few domains. I'm going to mention that a bit later with quantum computers. Still, it's extremely exciting. And when I say it's open research, it's open research for the hardware, it's open research for algorithms. And these algorithms most of the time are designed based on mathematical models of quantum computing. There are a few algorithms, but not many algorithms that actually run on quantum hardware. And there is lots of research on compilers and tools, but again based on mathematical models usually and simulators. Lots of hype too on quantum computing. So on the upside, it means that lots of credits for quantum computing, lots of funding, which is why companies such as Pascal and a few others can do their work. It's also thanks to this that a number of academic laboratories can do their work. And it's a good time to be working on quantum in general and quantum computing in particular. It makes things a bit complicated when you have to read a press release and it's a bit hard to understand whether the new problem that has been solved on a mathematical model has been reproduced in labs or is actually ready to come out in production. Why do we care about quantum computing? Well, we do care about quantum physics anyway because in computing, I mean, because CPUs need to deal with quantum phenomena on a daily basis. One of the reasons why we cannot make CPUs that are much faster anymore is that we have hit some physical limits. I'm not exactly sure which ones, I'm not a field physician, but they exist. So we want to go for the next generations of hardware and at some point you can either continue fighting quantum physics or try to embrace it. So that's one of the reasons. Another reason is that there are hopes that quantum computing will be faster. I mentioned hopes because despite some papers including a famous paper by Google two years ago each, we don't know yet. There are good reasons to be hopeful that for some classes of algorithms we will have something very fast, but we're not sure yet. Similarly, we hope that we can be energy efficient. I'm going to show you some algorithms later during this presentation. And there are good reasons to be hopeful that we could possibly someday replace entire data centers working on very specific algorithms with something much smaller. Again, this needs to be validated in labs and on industrial hardware. We're not quite there yet. And also simply because we don't know how to build new hardware at the moment. If you look at what's needed to train chat GPT or at least an old version of chat GPT, I assume it's worse now. If I recall correctly, they were using 10,000 boards, each of them carrying, I don't know how many GPUs each of them carrying, I don't know how many cores for the training part. And I don't know how long training lasts. So how we do it at the moment is we expand as many resources as we can, which is not something that can last forever. Again. So I mentioned bits, 0, 1, easy. Cubits, three dimensional, more complicated. Plus you have the question of whether the qubits are 0, 1, which is a complicated phenomenon, its measurement, and I'm starting to have a few intuitions about it, which probably means that I'm wrong. So there are two flavors of quantum computing. The first favor is digital quantum computing. This is a program in digital quantum computing. If you look at it, you'll see something that looks very much like a circuit. Well, that's why it's called a digital circuit. You have quantum data coming from the left conveniently. All these rx, ry, rz are gates, which operate on the quantum, on the qubit, sorry, in these, all the ones prefixed with r, r rotations on the sphere. These x, z, and I could have had, y's also are symmetries on the spheres. There are all other gates, but these are the ones that I had an example to use with. And at the end, you might be able to do some measurement, and in practice, you'll have to run your experiment many times because what you end up with is probabilities. So you need to measure probabilities by taking pictures, essentially, which means you have to take many pictures. So as I mentioned, a program is a circuit. And there are programming languages for almost 10 years, I think, there have been programming languages designed to create those circuits, or at least to give a syntax to the circuits and possibly to do modeling and simulation on those circuits. But the big snag is the hardware isn't there yet. One of the big difficulties that digital has is noise. I know it's not the only difficulty, but that's the one I remember, which is already good for me. Again, I'm coming from a different field, adapting is complicated. On the other side, you have analog programs. This is an analog program. This is actually part, I believe, of the test suite of one of our computers. So the test here is, hey, can we make a program that looks like our logo? Needless to say, it's probably not a very useful program. But we need to manipulate things at a very fine level. So in practice, when you're dealing with analog, a program is not a circuit, but it's also called a circuit and some parts of it will model as a circuit. But in practice, it's geometry and pulses. It might be different for other kinds of hardware support, but I think the ideas are generally the same. When I say pulses, I mean laser pulses, so you have to set up a frequency, a shape, and things like that, which is a bit complicated. I'm not going to claim that I have any understanding of how it works. And this, why do we care? Well, there are two reasons. One of them is this actually takes advantage of the hardware. It maps extremely naturally to hardware constraints and to some classes of problems. So from the top of my head, there are a number of graph algorithms that map very naturally to this. I showed you a two-dimensional representation, but it could also be three-dimensional. And so graph algorithms, a number of optimization algorithms. I'm going to show you a little bit of an example later. And if we have a problem that maps naturally to an analog circuit, the big advantage is that this is something that you can mostly run today on some machines. Not everything can be run, but we're much closer to this than in a digital world. And one thing I should mention, if you are familiar with the history of computing, well, every computer nowadays is digital, but before World War II, there were already many computers and they were pretty much all analog. So if you look at the battleships of the UK, US, French, German, Navy, they all had onboard computers that were electromechanical and that were used for aiming precisely. So they were computing ballistic trajectories. It worked before we knew how to do digital, and it worked because this specific problem that they wanted to solve had a very nice physical, electromechanical representation. In the end, they disappeared. It took a few decades for them to disappear replaced by digital, because digital was so much more generic, but it took lots of time for digital to catch up with analog. So these justifies war were interested not just in the digital, which is going to be much easier to program once it works, but also in the analog, which might give much better results in some specific cases and which is much closer to being actually something that we can use. Of course, the problem is how do you program that? I mean, that logo was not very intuitive. Well, it's easy. Well, no, really. And I apparently accidentally removed one of my slides, which was a big differential equation, which showed on one side the interactions between atoms and the other side the interactions with the laser itself, which I have no idea how someone can go from this differential equation to actually writing an algorithm, but some people succeed and they have my complete admiration. Anyway, that's why we, and when I say we again, I mean they have devised cadence. Cadence is a toolkit. It's designed for experimenting. You can experiment both with digital circuits, with analog circuits. You can mix them. Once you have written your circuit, you can simulate or execute it. When I say simulate, the world is a bit overloaded, but simulate. I mean, an emulator running on your CPU or GPU that's going to pretend that it's doing quantum physics usually at a fairly deep level. You can pick a level or execute. Well, if you end up in the subset that actually runs on the machine, that you need big glasses and be very careful to look at, that we have in the basement, we have a few of them. They're not really in the basement, but we do have them. So if you end up with this, you can compile your program to essentially a sequence of laser pulses and then send laser pulse to the computer for execution. We do that because there are many experiments that still remain to be done. We're not quite there yet. One of the reasons, I'm putting it first because that's the one I'm most interested in, but it's not necessarily the main reason, is this is the kind of thing that can help us find out how to design a programming language that is both usable, ideally by human beings, and also executable on the hardware, which is something that doesn't really exist at the moment. Another thing is, even without that, just having some abstractions on top of laser pulses, for instance, we have libraries of geometries, well, that makes life easier when you don't have to actually solve that differential equation all the time. An interesting aspect of simulating and executing circuits is that we can run optimizations for at least two different meanings of optimizations, one of them being how we deal with noise. Noise is a big problem with quantum computing if you put your substrate, you should put your atoms too close to each other, they're going to interact, if you put them too far away from each other, they're not going to interact, how do you send exactly the data you want and not the data you don't want from one to the other. So that's the kind of thing we can simulate using CADNs or lower level tools, or possibly other tools, but anyway. And the other thing is something I'm going to show you very soon, again, still might work. So at some point, I assume that some people will ask questions, don't be surprised if my answer is, I have no clue. Okay, so let's look at a few demos. So this is an example of a graph. Let's re- yeah, okay, this is a random graph. We want to solve the MaxCAD problem. It's a well-known problem in graph theory. The detail is not extremely important. We want to find the best places to cut the graph according to some criteria. So this can be reformulated as maximizing this value. And someone, I was sure I had written my sources somewhere. Okay, so someone has devised an algorithm to do that. Sorry, I didn't sort my sources. So this starts by waiting, yes, after the wait. So we derive a circuit from the graph. So there are as many nodes as edges, if I recall correctly. And we do a number of operations whose objective is to eventually make some configurations more likely than others. So I couldn't tell you exactly how it works. Many operations, many, many operations. Yeah, and in the end, we can measure stuff. So once we have this, we can represent the quantity we want to maximize as an optimization problem for one of the many different, what? Okay. Demo effect. Hop. And so this code is basically PyTorch for people who are familiar with PyTorch. And then we can run what we call training in that case. So we can run the optimization problem. So what we're going to do is iterate. So there is a theorem in the paper which I forgot to cite that shows that this computation is eventually going to converge. There's no guarantee that it's about after 100 iterations. But in practice for a demo, it seems to work. And if we pick the configuration that was most likely, again, there is this problem with the cat which might or might not get out of the box. If we pick the configuration that is most likely, it happens to map to the solution that we're looking for. And here, so we need to cut in such a way that something, something. I don't remember exactly how to read this, but I'm going to read it. I don't remember exactly how to read this result. But the interesting part is, hey, quantum algorithm, give me the grants. So that was a digital algorithm. I'm going to show you something that has a very similar outline. We want to fit a curve. So this, we're just going to take the curve x maps to x2 and see if we can teach a quantum circuit to basically represent this curve. For this, we're going to use the quantum, the ansatz quantum learning algorithm, which exists. And basically, we're going to try and optimize a number of parameters, a number of angles here, and see what we can do. So again, let's finish our circuit. What is going on? It was working this morning. Yes. Yes, no more error messages. Okay. Okay, so this is with the initial state of our quantum circuit. The dots are the approximation, the, are samples that we want to approximate. And the curve is the initial result. As you can see, it's not exactly a perfect match just yet. So we're going to run a few steps of learning algorithm. So this one is just pure by torch, just regular optimization. And usually it works. Normally it works. I'm going to pretend that it has worked and I'm going to pre, to start. Yep. What the? Yeah. All right. So after a few steps of learning, this is what we get. We have an orange curve that why not absolutely perfect actually matches fairly well the blue dots. So okay, it's not, not time to call the noble committee for that. But this has applications. Of course, this is a very simple example for a very simple curve that we want to fit. But if you look at it with a little tolerance for approximations, this is kind of the things that neural networks are doing. That the learning phase is something kind of like this. In fact, there is an entire subdomain of quantum computing. That's quantum machine learning. And this is, I believe, one of the simplest algorithms of quantum machine learning. If you look at the API documentation of cadence, you will actually see a QNN module. So quantum neural networks. And this is a very, well, a very active subfield of an already very active field. Because if the hypothesis we have on, if the models we have of energy use and computational power are correct, this means that hopefully we could replace these tens of thousands of cores used by a chat GPT or whatever its competitors are named and replace them by something that consumes way less energy and hopefully runs at least as fast. So time to reach conclusions. What do we have? We have a toolkit designed for exploring the design of quantum circuits, both on hardware that already exists, on hardware that we believe is possible and might come out of, into labs or out of labs within the next five years, and on purely hypothetical hardware because why not? Experiments are interesting. We have this mechanism circuit optimization, which I've showed you. I showed you how it could be used to solve problems or to approximate curves. It has also other applications such as the problem of noise. I mentioned noise between atoms, for instance. Sometimes you want to optimize based on noise models and make your things work because you know that your model isn't perfect or at least your high level model isn't perfect and you want to go to a lower level model. And again, it's not a programming language, but I hope that maybe someday it could serve as the beginning of one. Ongoing work about enriching everything, writing libraries for domain-specific problems, for known algorithms, for geometries, etc. There are many questions. There is ongoing work on compilation, on the subset that we already know how to compile and more larger subsets. And of course, I'm trying to make this easier to program. And when I say we, of course, I mean them. There was a paper recently accepted and presented at Planck. If you were interested, it's on the last line here. And all the documentation and the source code are on GitHub. So thank you for listening. APPLAUSE We have like four minutes for questions, my friends. I'm sorry, did you catch it? Was there any attempt to implement the circuits that we mentioned as an actual problem? I can see the question for the mic. Yes, the question is whether these particular circuits have been implemented on hardware. The answer is I have no idea, I'm sorry. LAUGHTER I believe... No, sorry. I'm not going to say random crap. I don't know. Right now, the main use case is experimenting with this. But again, for the second algorithm, for instance, if we can manage to make its scale to very large... to a large number of curves and more complicated curves, there is a potential application to basically machine learning in general, not just artificial intelligence, but... And the former one, I can't think of any specific example for the former one, but I know that graph algorithms are very interesting for many things, because, well, for one thing, there are good reasons to believe that they can be executed on existing or almost existing hardware. And there are many important problems that can be modeled as graph algorithms. For instance, we are in an energy crisis at the moment, and all the energy distribution problems, for instance, are graph algorithms. I've heard of people who want to work on it. I have no idea whether they actually work on it. Also, for car... for modeling the circulation in cities, things like that. I couldn't tell you about more than that. Okay, I think we should also thank you very much. Thank you very much.
Science without secrets – how Galaxy democratizes data analysis
Okay, good afternoon everyone. We are happy to introduce our project here. We are part of Galaxy Europe project and who we are. I am Polina and I am PhD student with Galaxy and my colleague Mira is US Galaxy EU administrator. And we will split our talk into parts. I will talk more focused on research and how to use Galaxy for research. And Mira then will talk more about technical details. What is Galaxy? Galaxy is a platform for data analysis. Basically it's free and open source and there are almost 10,000 tools in Galaxy toolshed for data analysis. It is cited in more than 11,900 papers and there are extensive tutorials for data analysis that you can do with Galaxy as a researcher. And this is cross-domain platform. It is usually used in bioinformatics but it can be also used and it is used already in chemistry, ecology, climate science and astrophysics and more. There are multiple interfaces that Galaxy can offer. There is intuitive graphical user interface for researchers and there is also a unified open API for more advanced users. And what is Galaxy for? So if you are a researcher then one main advantage of Galaxy is the graphical user interface. So here is the main window of Galaxy and on the left side you see the list of tools. I repeat that there are almost 10,000 tools implemented into Galaxy that you can use with this graphical user interface. And for researchers and data scientists it is always the question how to organize your data. And Galaxy provides this concept of history. History is the directory that contains all your inputs, outputs, all your tools that you use for your data analysis and its parameters and more. And you can also do as many researchers in parallel as you want because Galaxy also provides this feature of multiple histories that you can use in parallel and switch between them easily. And yeah, that's cool, that runs. There is a short presentation of how it works for bioinformatics tools. So here we make a short analysis of biological data with very basic bioinformatics tools such as FASTQC, which is actually the tool for quality control of FASTQC data. And so we can see now we use another tool for bioinformatics to adapt and you can see in the main window that we can advantage from this graphical interface. So usually these tools are command line tools and Galaxy provides you graphical interface that you can easily use if you are not familiar with command line. And yeah, finally you have your outputs on the right panel in your history. And as I already said that usually all these tools that Galaxy can provide user interface for are command line tools. And if you want to be more focused on research, you don't want to learn this command line and then Galaxy provides this. So you can, all the parameters that you need to set in command line you can set in more visual way. Another question that is important for researchers is reproduce your research. And Galaxy can also provide this. This is like the small demo of how to reproduce your analysis with clicking one button. Or you can also analyze your data in one go. So for this Galaxy provides you workflow feature that you can create your workflow which looks like the sequence of tools in a visual graphic interface. And then you can use this workflow. And here is a short video of how we use this workflow editor in Galaxy to create our own workflow. So here is on the left panel we choose the input. Then we set the name of this input. Then we choose any tools that we want to use for our analysis. And we set them in the order and we can use the output of one tool as input of another tool. And yeah, in this way we create a workflow. It can be more complicated of course. And here is another video of how exactly we can use this workflow that we just created. So first of all we need to upload data into our history. For this Galaxy have different options. So we can upload them from local computer, from cloud, from external resources or other ways. And yeah, here we found the workflow from the list of our workflows. And we ran it by clicking one button. And then in our history we have our output. And Galaxy also provides to researchers different ways to visualize your data. And also this is also very important to share in research. So Galaxy can provide you different options to share your histories, your workflows, your visualizations, your data. And you can choose to share it with specific users, with your community or make it published and available for everybody. And yeah, this is not only what I showed you. Galaxy can do, Galaxy can do many things that are supposed to be in a data lifecycle. And now I will give a word to my colleague. Hey, so you are now a bit overwhelmed. I think that's quite normal. But I have good news for you. Galaxy has its own training network. And it's made for self-study. It has step-by-step learning paths, videos, interactive tutorials, and covers topics for everyone. For students, for developers, and even for administrators. And we host worldwide training events where hundreds of instructors in all time zones are there to answer your questions. Now I will go more into technical detail, but first I show you which Galaxy servers are available. There's three large instances that provide a lot of compute power to the public. One in Australia, one in Europe, which I administrate and one in the US. But there are also many more domain-specific and national Galaxy servers, as well as hundreds of small-scale deployments. And with a few clicks, you can easily host your own Galaxy server. Now if you decided to host your own Galaxy server, where do you get the tools from? Is there kind of an app store? Yes, there's an app store, but we call it toolshed because you cannot buy anything and it's open sourced. But you can share all the tools if you like. And the toolshed currently has more than 9,700 tools of various scientific domains, mainly bioinformatics, but also other communities. And there's a curated selection and it covers also interactive research environments, such as your Jupiter notebooks. And if you wonder how these tools are shown in Galaxy, because they are usually all command-line tools, as was mentioned, there's no magic behind it. Every tool has a corresponding XML file that describes its parameters, its inputs and outputs. And there's even a tool for rapid development, which is called Planimo, and that makes it quite easy to contribute. But if you do research, you want to have every experiment reproducible, and this is why each version of a tool should have a fixed set of packages it depends on. And you already saw that there was a requirements version in the wrapper, and Galaxy can use dependence resolvers like Konda, but also can run tools in so-called multi-containers that are described by a unique hash for a set of fixed dependencies. And this way, every tool is reproducible in all Galaxy servers around the world. Okay, let's assume you found a server and you found your favorite tool, and you want to start a job. How does it actually work? So your web client is communicating to a whiskey-compatible Python web server that responds to your request and creates a job in the database, but then there are Galaxy job handlers that pick up your job and create kind of a script, like a bash script, and submit that script to a cluster and then monitor its status. And once the state changes, it will be mirrored in the database and then also sent back to your client. And which compute environments can we actually use? So we can use large HPC clusters, for example, running Kubernetes, HG Condor, SLAM, or many others, but we can also use remote compute resources with POSA, which is a small Python server application that is developed in the Galaxy project. And if you don't have such a HPC cluster, you can also run it on your laptop or even on a Raspberry Pi. But if you want to scale it to 10,000 of users, then you, of course, can have a more complex setup. This is the setup I administrated at you, and yeah, maybe you have a look, but I will not explain everything in detail. We have an OpenStack Cloud with about 8,000 cores and 42 terabyte of memory. But how can you contribute to Galaxy now? So you saw the tool wrappers. If you have a tool you want to be implemented in Galaxy, you can just write your wrapper and open a pull request. And also you can contribute to the source code, which is on GitHub. You can contribute training material, your scientific workflows, or other resources. For example, if you have an HPC cluster, you can spin up a POSA endpoint and contribute compute resources. Yeah, I want to thank you for your attention. And if you want to keep in touch, we are in Matrix, Master Done, and GitHub, and we will stay here a few more minutes to answer your questions. One question, Tassel. Do you have a success story for people using your platform? Do you have a success story? Yeah, you did your masters, right? Yeah, I did my master. Just take the question. Okay, the question is a success story of how people are using. So many researchers use Galaxy for their research and many biologists. And so I also used it for my master project. And yeah, I wrote master thesis and all made with Galaxy. So this is my own success story. Miss Bernice. All right. Sorry, folks. We do have to wrap up and move to the next speaker. Thank you, everyone. Thank you.
Workflow managers in high-energy physics: enhancing analyses with Snakemake
Okay, so good afternoon, all both evening, I suppose. I'm Jamie, so I'm a doctoral student at TU Dartmouth University in Germany and a member of the LHCB experiment at the Large Hadron Collider. More importantly for this talk, I'm a user of SnakeMaker's three years and will come on to what SnakeMaker is and how it's used within our experiment. To jump straight into the title, so really the first two words of workflow managers, they do what they say on the turn. It's tools to manage workflows which is unsurprising from the name. It's not uncommon in many fields, but particularly in high energy physics, that we have data of some variety. We want to apply process, some form of workflow to get some results and ideally a nice paper out of things. From this, we can kind of structure our work with a workflow manager. This involves defining our workflow, organizing or reorganizing the rules in our workflow, so this kind of flow chart we have running a workflow or miraculously rerunning when there's a change to the code base or if there needs to be a reproduction of a result, and also to document a workflow. So we've had a bit throughout the day about reproducibility, so it also plays into that. There are a lot of tools on the market. Here is just a variety for a few different places, some more leaning towards the toolside of things, some more leaning towards the framework side. Things like common workflow language. The focus here today is really going to be around snake make. So snake make evolved really from the GNU make paradigm, so the workflow is defined from a set of rules. These can be related to one another and a directed acyclic graph is generated, which can just be really thought of as a flow chart linking all of the rules together to get a set of required target product or results. We can use wildcards within there that really allows us to create dynamic workflows, thoroughly encourage having a play around with it. It's a really fun tool to use, really good tool to use. It's based on Python, so it's Python compatible, you can in fact use Python functions within snake make. This does provide a very shallow learning curve for those already experienced with Python. It's been in development for a while, it's recently had almost an overhaul of sorts with version 8 which released at the end of last year and restructured a lot of the functionality to look more towards the future. It originally was a large tool in bioinformatics, but within the last five years it's been picked up in HEP. Now I've said a lot about HEP and high energy physics, it's probably worth just briefly touching on what that means, so this really is the physics of the early universe, this is what we talk about when we talk about particle physics. For example at the Large Hadron Collider, where the experiment that I work on is based, we accelerate particles to large fractions of the speed of light and then collide them to see in these very high temperature, high density environments what the old conditions of the early universe would have looked like. There's four main experiments around this ring, which is 27 kilometres of tunnel under the Swiss French border at Geneva. By experiment LHCB just in the foreground, other background just next to the airport, and it's just as easy as looking at the differences between matter and antimatter. This is our little experiment here, 100 metres under Geneva. Now analyses in high energy physics really aim to measure some things, so this is a project all the way through aiming to get a final measurement that might be a massive particle, the lifetime of a particle or how it decays for example. We want to contradict ideally the standard model, which is our best understanding of particle physics. Any contradictions here would be a sign that some things arise, that there is new physics out there. We start off with experimental data, an extractive measurement from that by fitting, selecting and so on. There's a number of processes we can apply and these all take place in dedicated scripts, usually written by analysts. And since analyses are collaborative projects, these can range from a few authors up to dozens really in some of the larger analyses, this means that every analysis has a large kind of shared and dynamic code base. This may sound familiar, so a software project in many ways has parallels that can be drawn there. In terms of what we require from a workflow manager in high energy physics, we can break it down also like this. So starting off we need results that need to be reproducible, we can't really afford to build a second large Hadron Collider, so the best we can do is ensure that our results from the raw data make sense if we rerun the analysis. Similarly, if there's a new theoretical input or if we have a change to the scripts, we may need to rerun this analysis. Workflow managers really make this very easy. Excellent, thank you. The data that we have is often stored remotely and this is because we have absolutely tons of it. The scales of data we have are several terabytes per analysis, there can be hundreds of analysis per experiment ongoing at once, are enormous and this will only get larger. There's a plot I'll show near the end that shows how much of a daunting case this is, but we'll come onto that. So these really can be large scales of data. The scripts, as I've said, in an analysis can change very frequently and also can be of very different types. So these could be in Python, C++, Fortran, it really can range, so you want a flexible platform where all you need to do is ensure that it runs in the shell and you can then deploy it within a workflow. And finally, we need to be very scalable and deployable. This is both with regards to the amount of data, the amount of authors, and also ensuring that everyone collaborating on an analysis can contribute to it. Snakemake actually meets all of these needs, it meets them very well and it's seen a lot of uptake particularly in my experiment at HCP, to the point where we have a really strong user base and we've started now to have our internal expertise leading to internal training. I've linked here a very recent, this took place yesterday, training from a colleague place. There's also within our starter kit, which is given to all of our new members, there's a Snakemake course within there. So new analysts are trained in this tool so that they can use this right from the off and since most of them are familiar with Python, that really is getting started from day one, getting straight into physics. The features and functionality, suit analysis is really, really well. So there's interface for HPC resources, the amount of data we have means we can't process all of it kind of locally, so we do want to make use of the resources we have around the world and also local clusters. And also because all of the data is remote, the integration for remote access protocols within Snakemake enables that in a much more user friendly manner. So if we start with kind of scalable deployable workflows, so there's a few different ways that Snakemake approaches this, we can break down our larger workflow into smaller files, either as wrappers of common snippets or parts of code, or into individual files. This also helps with maintainability and ensures that one small change in the workflow can't destroy everything. Additionally, check pointing within workflows allows for much more flexible definitions, so if a script is going to produce an unknown number of files, it's not known or it's not deterministic how the workflow will look from the start, so we can re-evaluate that flow chart further down the line when that's being produced, so when we know how many files we have there. We can also batch the jobs that we have, so if we have a rule that happens many times, say over many files, we can batch that so we only need to consider as many at time and that also reduces the overhead locally when running it. Lastly, just in terms of deployability, so there's integration for specifying package requirements within Condor so that when you are then using another machine or for another user that it will run as it does or should run as it does locally. In terms of distributed computing, so as a large data scale requires large computing scales, it's not uncommon within analyses even without using workflow managers to use. Clusters and HPC resources for processing, fitting, it's becoming more and more common with tools like our data frame which is able to run through Spark. Snakemake supports common interfaces, so some of the ones just listed on the right hand side and actually how this is implemented is very straightforward, so a workflow can be defined exactly as it's done locally and the only additional part is specifying a profile, so this is typically a job script and a submission script and then in the command line just an additional flag specifying the profile is given, everything else runs as it would locally, it's just that the rules then are submitted to the, or the jobs are submitted to the cluster. Resource limits can be set, so if you have a job that you're worried is going to need quite a lot of memory or you're sharing a small number of resources but want to use central resources, that can help there in a lot of ways. And finally, if you have a job that maybe you're writing a very small amount of text or file which you would prefer to run locally where the overhead isn't justified to submit it to central resources, that can also be specified as a local rule so that regardless of whether you're running locally or through a cluster, that's, that always runs locally. And finally in terms of the functionality, so most of our data is stored either on EOS or this is AtCern or on the worldwide LHC computing grid which is around the, around the world as in they. The remote implementation within Snakemake allows for different providers, so we mostly use X root D, this is the protocol used by EOS and WLCG, but also S3 is becoming a bit more common in places and that is also supported. So in both of those cases there's a common, a common implementation where simply add the provider.remote, excellent thank you. And this can be wrapped around your path so you just define the rule as you would normally and add the additional parts on there. And there's some functionality within there, glob wildcards for if you want to correct a glob function on these resources rather than having to download all of them in advance and keep local to avoid repeatedly downloading or streaming files. This list isn't exhaustive though, I think about eight or nine, possibly more now. One of the big changes with V8 was to increase the flexibility of the interface there. Finally just really what do analysts need going forward, so we have this tool that works very well that we do want to kind of deploy more widely. There's lots of discussions within the community as to how we do that. Going forwards, scalability is really probably our biggest issue. So this is from the Atlas experiment, we are where the cursor is. In five years time we have five, six times more data than we already have as well as the data we already do have. This will keep on going, so we'll have the high luminosity error which is what this large increase from about 2028 will be where we take enormous amounts of data compared to what we currently have. So we need to be able to handle this and we need the tools to handle that. The experiments are already, LHCB used to be the smallest of the experiments, it's technically I think still the smallest and sits around a thousand authors. This grows by about 100 every year, so we need to be able to deploy for that many authors across kind of analyses. In terms of usability, analysts typically aren't software devs by trade, so having a tool that is very user friendly is super helpful in having these implementations that work very well out of the box adds to that. So going forward they're kind of continuing the good work there. And finally in terms of functionality, really what would help is further collaboration and there's already been quite a bit already between devs and HEP users. So for example the pull request on GANGA which is a service that we use in LHCB for cluster interface to implement that as an executor in the same format that was shown on a couple of slides before. So to draw some conclusions, workflow managers really are very useful to our research, they save us a lot of time in making sure that everything's run in the right order, that nothing is left behind and that we can look back and see what's been run. The tools do meet our needs, the functionality is there and in fact the use of these tools will become unavoidable. Within the next few years the scales of data we have will mean we have to use tools like Snakemake. And finally we have a very strong user base, it's a bit spread across the experiments, not necessarily all in Snakemake, so CMS use Luigi which was on the much earlier slide and bringing that together to collaborate both across the HEP community and also with devs really is also crucial in moving that forward. So I've left a few links on there, the training from earlier and I would otherwise welcome any questions. One question I have for you, so a happy user of Snakemake, one thing is sometimes challenging is the visualization of the data and the dedication flow, how do you handle this especially when you have thousands of nodes in the data? So to repeat the question, so within Snakemake there are ways to visualize it, this can be very difficult with large scales of workflow, so the question is how to approach that. Often this can be done with grouping, that helps in a lot of cases for very large rules that then when visualizing the DAG this narrows everything down. I believe that's both the DAG and pipeline modes, one expands to every job and one just goes to individual rules, so one is naturally a lot narrower. Really I would say between batching group, those that have been the ways I've personally dealt with it, I'm not sure from the outside or elsewhere in the user base, but that would be how I've gone about it. Sorry, I think we have to move on to the next talk, so thank you very much for joining me.
Open Neuroscience: practical suggestions for conducting open neuroscience research
Cool. Take it away. Cool. Howdy, y'all. I'm Danielle. It's nice to see you. Nice to meet you. And I'm super excited to talk to you about open neuroscience. And I'm really glad there's a lot of people here. I did not expect everybody to come because I know this is a developers conference. So I'm going to tell you first why you should care about open neuroscience. First point is that we all have brains, right? Any time I mention neuroscience, everybody gets really excited because we do have brains and that is essentially what we are, the integration between our brain and our body. But the issue is oftentimes our brains don't work the way we'd like them to, right? Raise your hand if you or somebody you know has ever struggled with mental health, neurological or psychiatric conditions. Right. Everybody's crazy and sad. And so, and this is not unique, right? Neurological and neuropsychiatric conditions are one of the greatest contributions to the global disease burdens, right? If we think about the fact that 28% of the global disease burden is neurological and neuropsychiatric, that's including communicable diseases which spread like wildfire. We all had COVID. And so we understand the importance of the world. And so we understand the importance of better neuroscience means better health. But more than that, why am I presenting this to you? You may get that implicitly, but I also think there is a long and storied history of neuroscience and computing. And so we have been learning from each other and influencing one another since the inception of our fields. And to exemplify this, I have two people up there. Everybody probably knows who John Vaughan Newman is, hooray. But Rafael Laurent de Neu was a very close collaborator with John Vaughan Newman, and he was a colleague of Ramoni Kahal. Out of curiosity, how many of you here do any form of neuro stuff? More than I thought. And so you will know Ramoni Kahal is the one who is responsible for any time you've seen a picture of a neuron, and it's in black and white, probably him. And so not only were they delving into the structure of neurons to understand how they functioned, this was seminal to the work of computing for how do we communicate information. And so you have interactions between Rafael Laurent de Neu, who is understanding how electricity communicates information through the structure of neurons with John Vaughan Newman coming together at these conferences on complex systems and biosciences. This continuing trend of neuroscience and computers have continued to influence them, right? Fast forward a little bit, you get Frank Rosenblatt and the perceptron, right? Based on neural networks, literally how neurons communicate with one another. And so here we are today. But not only have they continued to learn from each other since the inception of our fields, now in our kind of open neuroscience focus, we continue to learn from computing science how we can make neuroscience more applicable. And so we've all talked about the reproducibility crisis. I'm not going to spend too much time on it, but we can say that having open source technology, sharing your code, sharing data, and other techniques that I'll talk to you about today, are ways that we can improve the quality of neuroscience and therefore hopefully improve the quality of health for all of us. But another reason, some of you in this room may be academics. I hope some of you are academics. And if there is anything that funders are starting to like, it is the fact that we're going to be able to do this. It is the fact that you may have open source technology, or that your work may be reproducible and shareable, and not just because you have pretty figures, but the fact that funders themselves are realizing that open neuroscience is more robust, which means it's more often replicated, which means it's often more generalizable and holds true in ecologically valid and clinically translational applications. This means they get what they ask for with their investment back to them. And so I'm glad personally that funders care about open neuroscience because it then provides the means for researchers to do the extra time and effort that open practices often take, and we acknowledge that. I also think that open neuroscience helps facilitate a synergy between industry and academia. Oftentimes, industry has a ton of money, but they may be focused on research questions that are of interest to their stakeholders or profit shareholders, and maybe not so much time or interest for R&D. Academia is R&D. So if there is a little bit more open neuroscience from industry and from academia, perhaps we can help each other with this. So when I talk about open research, I'm not just talking about academia, I'm talking to industry people as well. And finally, the principles I'll discuss in open neuroscience, a lot of them are transdisciplinary. There will be some fund neuro-specific examples, which is why I hope you're here, but there are also many transdisciplinary principles. So I'm going to move through the different stages of a typical research experiment, data, preprocessing, analysis, dissemination, and for each one, I'll just talk about one or more aspects of open neuroscience, starting with the data we have. Specifically the fact that I will first talk about the type of data that I mean, and I mean neuroimaging data. So we have a beautiful drawing of a brain here. That was published in 1643, Andreas Vesalius and Dake, what is it? Dake, Humani Corp. porous day fabrica. I'm sorry if you're Italian, I just butchered to that. But he was a visionary in terms of how to visualize the brain. He was the one who came up with taking this 3D structure that we have and slicing it, thin sheet by thin sheet and drawing each sheet, and in doing so, sharing the visualization of how structures connected from top down, starting to give people a 3D visualization of brain function. Now that's really cool, but the issue is only the people who could do dissections, which were not very common, actually had a chance to interact with these structures themselves and see how these structures connected in 3D, and as we know in biology, structure often implicates function. But worse off is that the structures were dead. Yes, they can tell you structure, but brains are so complex and filled with electricity and neurotransmitters and fluids. A dead brain is a very poor representation of the complex emergent phenomena of this dynamical pink walnut that we all share. So this is where neuroimaging comes in, and this is the kind of shit that I do. This is the kind of stuff that I do. And so, it's late. And so these are the ways that I prefer to visualize the brain. I take the brain, I break it up into a bunch of regions, like in the top left-hand side, and then the gif just to the right of that, all those regions, you see how they connect over time. The brighter the color means two brain regions, I should have labeled them, but zero to, in this case, it was zero to 110 brain regions. I would use a different atlas now, we all improve. And the lighter the color between the two means the increased mutual information between them. And every frame in that is every couple of seconds. And so this is a person playing a cognitive task. It's a bit of a memory game. And we see how the brain evolves its connectivity over time. That's really cool. We also have images of how the brain lights up, if you will. We have better ways of visualizing brain structure. So this is the neuroimaging data that I'm talking about. So there's tons of types. We have EEG, MRI, anachronome, alphabet soup. The point being is each neuroimaging modality has its own challenges. In terms of its spatiotemporal resolution, let's take EEG, right? EEG caps, and I'm going to, this is going to come up later, so. EEG, raise your hand if you know what EEG is. Oh, yes! All right, I'm not going to bore you with a ton of background. But basically, for those of you who don't, it is a cap pictured here. It doesn't have to be a cap, but often a cap placed over your head computes the sum of the cortical electrical activity that we have. Problem, our brains are not just the outside of our brains, right? We have tons of nuclei and complex structures deeper within. That we don't really get with EEG, but EEG is a lot cheaper to use than my personal favorite, MRI, the superconducting magnetic donut that we have pictured to the right of the EEG. And that is a fantastic spatial resolution. And at the moment, we can get submillimeter resolution. That's pretty good. There's still thousands of neurons per voxel or 3D pixel, but the problem is temporal resolution, right? I speak quick, the brain moves even quicker. It's at the speed. But the issue is with fMRI, we're not measuring electrical activity. We're measuring the response of blood flow to increase brain activity. So right now, my language centers are firing, firing, firing, and it's like using a muscle, right? The blood flow will come to those brain regions, and that takes time. Time, that is essentially just a slow down representation of brain function. Thank goodness the brain is a scale free system, so we can at least learn some principles from slower temporal resolutions with EEG's faster temporal rosin and bad spatial resolution with fMRI's. All this is to say, there are challenges with each. But if you are going to collect data, we're going to talk about how you can do it in an open and reproducible manner, and step one is to think about it from the get-go. When you are creating your ethics protocol, when you are designing an experiment, and you think, okay, I'm just going to have my standard consent form that my lab or the university typically has, make sure there's a section about data sharing in your consent forms and be specific about what types of data you can and will ask participants to share. So if somebody takes an MRI scan, right, typically it's very easy to share a group level statistics. If I take everybody's brain and I do a bunch of processing, and then at that point it's probably in a standard Atlas space and you can't backtrack to get to any personal identifiable information. Problem? That's really boring, because all of the beauties and the complexity and the individual differences within each of us. So we have to balance this trade-off of being as open as possible while being as closed as necessary to protect patient and participant information. And so I encourage you to think about that from the outset and put it in your consent form so that you're not stymied by having all this data, realizing you can share some cool stuff without harming patient or participant privacy and realizing you didn't ask them if you could share it. And nobody likes to get an email or a call six months after you were paid 20 bucks to participate in a research experiment being like, by the way. So think about it from the outset. Another important thing when collecting data is to consider addressing gender bias. And so what I mean by this is in many different studies, in most studies, there's a typical bias to have male participants, and that is a problem because half the population has female brains. And so we could think about this with the success, or should I say failure, of a lot of drugs that go to market, right? This is not a neuroscience example, but typical medical trials often have male-dominated participant groups and participant samples. Then a drug gets approved, ha-zah, we think it's good. It goes to market, immediately pulled back, because the other half of the population that now takes that drug has adverse reactions that we're not accounted for during the clinical trials. Same stuff goes for doing experiments, get diverse brains. Which brings me to point three, breaks a weird cycle. Who knows what weird means as an acronym? Acronym. Nice, for those of you, good woke points. Weird stands for white, educated, industrialized, rich, and democratic. Most of the time, neuroimaging research is done in western countries, Australia, the United States, and leaves out a lot of the global south, often because neuroimaging experiments are expensive and use a lot of resources, that the global south typically does not have the R&D budget for. There's some really cool research out there on this. But point being is the state of neuroscience is not representative of the world's population. So we need to get better at diverse recruitment. So for example, I live in London now, and I looked at the most recent census data to see what's the kind of demography of London. And I found out only 46% of London is white, everything else breaks up into different categories. So I make sure that my study populations are representative of this, but it's not only cultural or racial ethnic, it's also socio-demographic. It's very easy for me to go to my university canteen and goes, who wants 20 pounds to participate in a cool neuroimaging project that you get a 3D printed brain at the end? Everybody loves that. Problem is that everybody there in Imperial College London tends to probably be very educated and potentially wealthy and therefore is not subject to the health impacts of the rest, that literally the rest of the population faces. So I make a conscious effort to do recruiting in local faith centers, and I'm talking go to your local Islamic Center, go to your local synagogue, go to your local Afro-Caribbean church, get people interested, and I'm not just saying like ask people to do research with you, show them why it's interesting. So most of the people I work with have heroin use disorder, I do addiction neuroscience and psychedelic neuroscience, and so I go to service users and I say, what would be helpful to you? What do you want to understand better about yourself? What would help you? Maybe not use crack this time. And so these are the sorts of dynamic conversations you can have with communities and make sure you have community investment in your research because the community will give you back that your time and your effort you put in. And again, better research samples means better generalizability, more robust data. Huzzah. Finally, we're going to break various participation. We talked about EEG earlier for a reason. If I'm running a neuroimaging experiment, I want to put an EEG cap on somebody's head. If you have an Afro or just thick and kinky hair, or you've got box sprays or dreads or banchanauts, whatever, it means the cap is not going to fit close to your skull, and I can't get your brain data. So there's two options. Participants either say, okay, thank you. Okay, sorry. Or they're really patient and kind people who let me move their hair and put electrodes one by one by one by one in their, on their scalp so that I can get that data. The problem is then that we all, I have to make sure I communicate to the participant, yes, the electrode conducting pace is water soluble, but maybe ask them, do you want to do this on a hair care day? And if you do, I want to make sure that I give extra remuneration to participants who have extra hair care considerations because if I do a study, my friend washes my hair in the sink. That's fine. That works for me. If somebody else who has to go home, spend a lot more time working on their hair, they should be reimbursed for that time. Right? So think about diversity from the outset of funding your experiments. I did, that was a long harangue, I'm sorry. But pros and cons of collecting data is that the pros are that it's customizable to your research paradigm. So I'm really interested in the relationship between reward processing, decision making, and attention, and how it goes wrong in addiction. Oftentimes it's really hard to get addicts in the scanner, right? They have a lot of well-founded trust in large structures like universities, healthcare systems, what have you. So I often need to go out and collect data, but when I'm developing my computational skill set and I can just use healthy brains, I'm going to avoid the time and resource expense of collecting data and utilize what is already out there. But if you are going to collect data, Danielle's pro tip is to test your analysis pipeline with a small pilot sample first. It's not only good for you for your downstream analysis, but it's great for registered reports, which have we heard of this earlier today at all? Or in general? Any nods? Somewhat, okay, not a lot of nods, so we're going to talk about it. So registered reports are when you say to a journal, I would like to do this study. Here's my introduction, here's my methods. Do you, and it goes through peer review, it's like half the paper goes through peer review. If they accept it, that means whether or not you have no results, that paper is published. That's pretty cool. And that actually gets over the trend where large journals have this, I don't know if you guys have seen the correlation between the higher impact the journal, the smaller the p-value a study typically has, and therefore the burying of no results. So it's really nice that some journals are happy to do registered reports, which it is more time and effort on early career researchers often or whoever's doing it. This is why it's nice to say to funders, I'm going to budget extra time to collect this data in a reproducible and ecologically valid manner. I'm going to develop my analysis pipeline with some pilot data and test it, and go to the time and effort of doing a registered report, then finish recruitment and study sampling. Open Science takes more time, but for neuroscience example, if you are going to collect data, data anonymization is important, because we want to share our data, but if somebody comes into an MRI scan, and I take an image of your brain, it's not just capturing your brain, it's capturing your beautiful faces as well. And so we need to think about how and why we're going to scramble that, and how we're going to take care of that before we upload this data to whatever open neuroscience data sharing repository. And I'll cover some of those in a bit. So I have some examples here of how some people just completely remove the skull, the other sorts of tissues and what have you, and just go straight up to uploading brains. You can do defacing algorithms, you can do blurring, masking, whatever. And so there's pros and cons to each of these, but I just wanted to give this as a highlight of how we need to think about participant safety before uploading data. But you could also just use previously available data, and we love this. My whole PhD was basically the Human Connectome Project. Pros, it's plug and play, it's well validated, it's easily citable, but the limitations are what I mentioned before, and it's the pros of collecting your own data. There's limited study populations, maybe you have the study population you want, but not the imaging condition you need, etc. But one thing I really like about open data sets is that even if I collect data, I can often try to reproduce even some of what I do in my collected data with an open data set, make sure that my results are replicable with these larger sample sizes than my lowly grants and funding can afford. And if you are just going to use open data from the outset, I encourage you to look from the outset for compatible databases that you can do test-re-test reliability. So I'm a big fan of using the Human Connectome Project, which is an American project, in conjunction with the Chinese Human Connectome Project. So lots of similarities in how the data was acquired, but you have a really different population from which that data was collected. Brilliant. Pre-processing and analysis. This is what I spend most of my time doing, and this is what I'm going to spend the least amount of time talking about, because this gets real technical real quick, and I don't know how much you know or care about this, but if you are interested, come find me. But the thing that I will say is that a lot of pre-processing and analysis uses a plethora of open-source technology, and choosing a toolbox can feel like an analysis multiverse. So when swimming through the sea of analysis techniques that you can use, one, I'll just leave you with a few tips, one of which is to use bids, or brain imaging data structure. It's a way of organizing your data, where a lot of open-source technology can take your data and shove it through their pipelines, give you shiny results, but better yet, if your data is in bid structure, and when you have properly anonymized it and gotten consent for it, uploaded it to a data sharing repository, everybody else can plug and play. It increases the ability for other researchers to use your data. We love it. Now, let's see if this demo will work. Who's heard of Neurosense? A few people? Hazard, but I'm glad most of you haven't, because that means this is fun and cool. Actually, I have it pre-loaded. Neurosense is one of these examples where a ton of data is uploaded, and we get to see what things look like. So Neurosense.org, so let's see. So all these studies that have done language, they just have language as a part of it, right? It could be listening, it could be talking, we're keeping it vague, but you can see how many studies did this say it was? A thousand one hundred? That's a lot of studies, and yes, there's some light, and you can just click through. You can download these maps. I won't get too much into why this is really useful to prevent a circularity of analysis, but it is. Somebody give me a term. Dementia. Dementia, thank you. Oh, you picked one that you knew was going to be up here, didn't you? I'm grateful. Only a hundred and forty-two. Well, ah, so maybe there's a hundred and forty studies of frontotemporal lovar dimension? Who's to say? Either way, the cool thing about this is not only can you see the areas of the brain, you can then see all of the studies that have contributed to those fMRI-based magnitude of the bold signal changes. You can also do a bunch of FAQs, fun neuroscience and analysis tool. That'd share it. Right. Dissemination. How are we going to share our stuff? Who has heard of paywall, the business of scholarship? If you haven't, you should watch it. It is a documentary on, and it's free and open source. If you go to that website, you get to watch the documentary. There is a terrible, just very boring joke about two minutes in. If you get past that, the rest is great. I always say, but the point of this documentary is that it talks about the history of the for-profit publishing industry and why it's just not great today. It's a necessary evil, but if you want to learn more about it, check out the documentary. However, if you're also a part of the academic sphere, where we need to just fill our CVs with DOI so funders are really happy, then there are other ways for you to do this other than traditional research outputs, many of which include things like preprints. Great, but also protocol papers. Protocol papers are peer reviewed, and they contribute to this kind of open methodology and gives you a space to really get into the guts of what you've spent, probably at least a year, really sweating over. All of the trials, all of the errors, all of the bugs, you can put that in a protocol paper. Otherwise, you just start publishing those. Registered reports we've talked about. There's a whole spectrum of open access. I'm not going to make you sit through it. These slides will be online, so that's there. Who's heard of Psyhub? Good, everybody, moving on. There are other means of dissemination. By the way, in the paywall of the documentary, Anna Kay, I forget her last name, the creator and coder for Psyhub, she features in it, which is very fun, so she gives cool interviews. Anyways, how am I doing on time? You've got like six minutes including questions. Cool, I'm almost done. Other means of dissemination. Y'all have heard of GitHub, but there's a few you haven't, and if you're a neuro nerd, Nitric is a good one. There are others. Open data, I told you to come back. These are all of the different data sharing repositories for mostly fMRI data, but there's also EEG data, MEG data. Have fun. These are more examples of why we should care about open data, funding agencies, researchers, the public. I kind of got over it. Parting words are going to be short and sweet, because I actually want to hear your words and questions. And it's just the simple fact that open neuroscience benefits everyone. Thank you all for your time and attention. Thank you. Thanks. Thanks. Questions, my friend? You in the yellow hoodie. Thanks a lot for the presentation. In terms of explainability of these analysis and thanks, I've read a lot of research papers, but a lot of them don't really include the code. And lately maybe it started from 2021. They had something, so the European Union started publishing some stuff, but they did not keep up. And it's not keeping up anymore. I think there's just only maybe some journals which are accepting and are enforcing that. So I believe that's going on. Poorly. I'm also frustrated. I read a paper. I see the coolest analysis. I'm like, how can I do this? And I control F code, control, yeah? Ah, I told you I was going to do that anyways. I stink. Okay. So what do I think about the state of open code sharing? There was some, you know, progress being made in Europe about it, but now it's kind of died back. I'm based in London and I don't see too much of this done in the American sphere as well, where I'm from. I say, y'all, you can tell. So I will say that the, to answer the question of the state of data sharing and neuroscientific papers to increase the interpretability of the analysis and also the ability for you to just play with it on your own, it's not great. I wish more journals enforced that. It's interesting to me that you have funders who ask about open source stuff, but then the other kind of top gatekeeper of academia publishing journals don't enforce it. And I think this is where grassroots efforts do so much, but if you make it policy and regulatory policy as well, then you'll have a lot more uptake. And then we can all do it a lot more. This is probably stuff that you think about yourself, and I'm sorry I don't have better answers for it. I also wish it to. Anyone else? Fun questions? None? Maybe you're shy. Talk to me after. Any more questions, my friends? We do have a minute. We have time for one. We have time for another as well. Yes. You've talked a little bit about scales. You've talked about the gene in the R.I. And I have one question. Immediately pop up when I see such a theory is, how does all of this relate to the multi-scale problem in the population of neuroscience? And... Yeah. It's a great question. It's something that I know the field is trying to... Oh, thank you for repeating the question. Yes. When I talked about scale-free systems, how does this relate to the aspect of multi-scale integration of neural data in understanding the brain? Is that a good summary of your question? Thank you. Okay. Then, for me, it's trying to harmonize multimodal imaging. I've not quite bridged the EEG MRI divide, but I'm starting a bit with PET MRI. So, in this sense, I am trying to develop a better understanding of how processes at the neurotransmitter level, positron emission tomography, to those of you who don't know, is a way of injecting somebody with a radioactive ligand. If it's glucose, your brain is an energetically greedy organ. It takes 20% of the body's cardiovascular output. If you're using more parts of your brain, you get more blood glucose. But you can also do it for mitochondrial activity, for different pathological proteins, regardless. You can see where cellular-level phenomena are moving throughout the brain, and then, because you can do PET and fMRI at the same time, start to see how those cellular-level processes relate to these macro-scale processes that we capture with fMRI. Since those are captured at roughly the same time scale, we can start to develop an understanding there. In terms of integrating EEG and fMRI, there's some really cool work I can maybe share with you a little later if you're interested, but I have not been able to do it personally yet. Thank you. I think it's time to wrap up. Cool. Thanks. Thank you.
MiniMill: a miniature Field Mill Electrometer for airborne platforms
Thank you so much and thank you for staying. I know that I'm keeping you up from your beer, which will be quite lovely and with a gnarly piece of software and hardware, let's say. So I'm Lily and this is a part of my PhD work that I completed a couple of months earlier. I'm a researcher in astrophysics and atmospheric physics and what we did with a couple of folks back in where I'm coming from, Greece, we made a miniaturized sensor that's used to measure the electric field strength in the atmosphere and I'll show you in a bit. So a brief agenda of the presentation, so I'm going to do a bit of an intro and discuss a bit the scientific question that was posed in our minds and that attracted all the funding for this research to be conducted and then why we chose open source, which is self-explanatory, let's say for that case, the sensor assembly and a few testing bits from the sensor, some observations from scientific campaigns that we did throughout all this period and then our data sharing platforms and what's next and what we hope from the sensor. So a lot going on here but what I want you to keep in mind is that we're coming from a field of, let's say again, atmospheric physics and what we were seeing throughout all the years of science is that our experimental data were showing that particles and particularly ones coming all the way from the Sahara dust particles were transported all the way towards central Europe most of the time and we are all the way from the Atlantic so we couldn't figure out why that was happening and we couldn't because our models and our forecasting models couldn't depict this change. So we were hypothesizing that these particles are moving within a fairly dynamic system which is our atmosphere and that we have this vertical electric field that's within the entire spectrum of the atmosphere and as they were moving within there was some electric force most probably that would act on the particles and make them to, let's say, float within the atmosphere or negate the gravitational force that was acted on the particles and then another, let's say, after effect of these electrical forces could be that it would orient the particles on the vertical direction and that would give them even fairly more time within the atmosphere which was another part of my research back in the PhD days. So in order to have and collect all the data that we could possibly have for our hypothesis we needed to do some measurements from ground-based sensors but we also needed and found that the best practice to do was to launch balloons over in the atmosphere, lower atmosphere with atmospheric electricity sensors and acquire the data that we needed within the layers of the dust particles. So could we verify our hypothesis through the observations and then we needed new developments in order to do so? So the only way was that to do all these launches that I talked about before. So let's say why we went for the open source in terms of, you know, all the traditional academic structures that do not allow the scientists to form collaborations and they did not foster, many times they did not foster collaborations between open source practices and then there's this concern in terms of the researcher personal status and recognition and that all the open source work that's been done doesn't get recognized as it should be and again then there are these biased reward systems that fog, let's say, the perception of the researchers and what we wanted to do is exactly that challenge and create transparency through our work and what we found up till then concerning the specific sensors was unfortunately fairly closed so we had a couple of projects which were depicted from publications, you know, the traditional system of the academia to present their results and these were for the left-hand side they were not tested on these balloon-borne platforms that we needed them from and they were most of the times sketchy and there were nonexistent schematics at all in terms of the sensors which was pretty bad and then there were some other stuff that were mostly homebrewed projects which weren't, let's say, tethered to our applications for the measurements or that they weren't sensitive to the electric fields that we wanted to measure so these two on the right-hand side are both very cool projects and we used a lot of their, let's say, bases for our projects so again cool ones but not the exact one that we wanted for the data in the research field so a few words about the sensor we are calling it a mini-meal which is a miniaturized version of a commercial field-meal electrometer that's what they are called and what it actually does in the principle of operation is that you have a fan with two vanes that they're periodically screening a sensing electrode on the bottom side of the fan and this what does is that it induces charges on the effective area of the electrode and this time-resolved charge creates and induces an un-stern induced current and that's what we measure with the sensor and the internal processing and the data for the sensor are amplified and then we get the measured voltage which is twice the amplified output voltage from each of the coupled electrodes for the latest version of the sensor we were using DC-brustless motor which was used and the speed controlled respectively in order to minimize the electromagnetic interference in our system and the circuitry and then we also integrated gyro three-axis gyroscope for the rotational position of the sensor so we were trying to use as low RPMs as possible for the optimal response of the of the electrometer on the right hand side that's the exploded view of the sensor it was used in we were using layers so we have like the cover plate which unfortunately you can see here because it's shielded by the electrode view for the PDF and then we have the motor and the veins that I told you about and then we have the motor mount and the back plate which was all cut in aluminum for sturdy purposes and then we have the intermediate PCB with the electronics of the sensor so it was ADC it was controlled by an analog to digital microcontroller and we used a serial transmission for the data so it was all assembled in the institute with the two guys that I showed you before and that's the final result for the sensor it's quite a small and robust sensor like 8 centimeters at each side and we have used a thermal shielding because you know as you go up within the atmosphere the temperature drastically falls and we needed to shield our electronics and the battery supply and what it does is that I said measures the electric field strength with the altitude so the pros for the sensor was that it was easy to reproduce it's a fairly low cost sensor with all its the hardware and the electronic components components it's lightweight and it's disposable which is it's a bit of a problem because you have you know all these metallic parts and all these circuitry that are going into the ocean most of the times and they are not retrievable because you know the cost of getting it it's quite larger than the sensor itself so we will jump to another version which is going to be not biodegradable but definitely with less metallic structures and hopefully it will be sturdy enough for the measurements that we would like to do and same same validation methods so on the cons it was had these bulky electronics because it was an implementation on a PhD level and we had the minimal amount of knowledge for this for this project but hopefully it's going to get better and better with SMD's and stuff so what it does in scientific terms it slightly overestimates the electric field because as the sensor you know it's actually hanging from this balloon and you have all these wings and shears that are hitting it as it goes up so we have parasitic electric fields that can be detected from the motion of the sensor but we are using a rotation the rotation of the sensor to minimize this effect and it has unfortunately a limited operation temperature at minus 50 degrees it can mostly operate at this specific version on dry conditions so if you have like heavy rain conditions it's going to be heavily biased in terms of the output but that's that can be fixed easily and it has a maximum altitude of about 16 kilometers which was in the atmosphere again which was fairly biased by the battery level freeze due to the decreasing temperature so it's a sensor you have to do all sorts of calibrations to be ready and robust for measurements that you're taking so we had a hard vibration test we were actually you know moving the sensor around and getting the data we had the temperature resilience test and we tested its response with a commercial which was well validated instrument and they had similar responses pretty good ones and again we put it on different different conditions and the most you know famous let's say set up for calibration of such sensors is that you're putting them like into parallel plates and you're using a fixed voltage between the plates and you are securing the sensor in order to measure this fixed voltage that you have and to control the output so that was the standard calibration setup and it's also since it's pretty small and it doesn't have you know these complex aerodynamics can be easily tethered to UAVs we haven't completely tested that through modeling but we know that it's viable and it can send send its data so the telemetry now that was a bit complex because we couldn't use we didn't have the knowledge actually to make a telemetry on our own and just pop the data down but we also needed meteorological data from collocated with our electric field data because that was just a single measurement and we need other parameters also to explain you know this this ecosystem of models that we needed to use for the forecasting so we created another second sensor which was also measuring its own stuff and we desechained our electrometer to the second sensor and then to a commercial radius on if you are not familiar with with these these are like small instrumentation that are flying almost every day from around Europe and the meteorological services are releasing those those balloons and they are giving us all the data that we need within the atmosphere for meteorological parameters so a cool stuff that was done here the the radius hunt had its own protocol data protocol the X data one and we needed to do a bit of reverse engineering in order to connect our sensor to the the X data protocol and then trick the the ground station that was receiving the data that our sensor was actually one of the commercial ones which was yeah which was dumb because you you just was putting we're putting with our hands an ID and a fixed ID and the the ground station knew that that was completely different sensor I said you know okay just passing through the data that's that's fine and then we created the decoder for the for the data and then we have the raw output and also the sensor was able to to measure as a standalone but on the ground not flying up in the atmosphere with the serial that was giving out a text simple text output and then again we said you know okay we have one we've made all this effort like for a couple of months to create one let's create another 30 which was quite tedious but yeah but we needed that because you know when you when you research all these dynamic systems as the atmosphere you need to have as temporarily dense data so we needed to launch almost every day or maybe twice a day if we if we had the correct conditions so of that we are going to a very very nice place the Cape Verde islands in South Sahara and South Africa yeah and we've launched there through an initiative and a campaign that we did with the ESA the European Space Agency it was called ASCOS it was a pretty big campaign and it had all sorts of atmospheric stuff going on over there it was not only that part but we were tasked to measure the electrical conditions in the atmosphere and some you know very fancy results from the first experiments that we were doing on the plot whatever you see these colorful dots these are different days that we were launching our sensors and on the x-axis you have the electric field strength which is measured in volts per meter and on the altitude the x the y-axis you have the altitude so within these layers that you can see but visualize here of dense dust particles we have this increase of the electric field that we are seeing in different days of thank you of launching our sensors and that was a pretty good win because you know from from this addition of the sensor we were getting data which were quite realistic and we were getting some hints that the dust particles were charged indeed as we were expecting again we did a correlation between the wind speed and the electric field with a wavelet transform and we saw a small anti-correlation but that gets a bit too researchy and technical so I'm going forward and that was a very very nice day for us because we are going all the way again to Cape Verde and that's a day that we have a pretty intense layer dust layer what you see with this colorful green thing over here is dust within the atmosphere it's measured from a very specific laser which is called lighter and we also launched the sensors that day in order to see if we have this electrification of the dust particles and it was a pretty successful one we have many many and lots of data from the days that we have on the campaign and and the data are also available so the sensors for the day let's say were operational up to 40 kilometers due to battery decay so we are keeping the data up to that point so now what do we do in terms of sharing all this work so primarily our repository is in github and we have also collaborated with the evaluation and validation center for the data of ESA and that's where the the raw data set from the sensors and the entire campaign can be found we also have right over here the escos calendar which shows a step by step process of what were the instruments used the the daily plots from the sensors and then all the other instruments that we were having and all the work including all the publications done for the work the the data sets and all the presentations can be found in in zanodo so please do contribute if you if you find it interesting and yeah so some long bullets but i would like to to mention that so what were the challenges that we faced as we were conducting all these open source practices and all these research is that we had some you can have fairly large standardization issues so if you have variations within the sensors and all the the bulk of the sensors there it can affect the right reliability and the consistency of your collected data so another step was to have a pretty good documentation quality so that we can ensure the the project success and especially for those which that were not quite familiar with the hardware we do not have that many contributions to know if it is a good documentation but you know please do check the the project so also we needed to have a technical expertise minimum technical expertise for all those that would like to assemble the the sensor or that we employ the people that we employ to assemble the sensor so also we have data calibration and validation that needs to be compliant and with the regulations unfortunately of large research organizations when you're talking when you're talking about research that can affect many fields and being interdisciplinary so also we had to battle a bit with the funding constraints which is present and limited in terms of the open source and open research projects and we had to do a lot of testing and and also the quality control of the data themselves can be quite challenging so also we have the the ending that's the long-term support when addressing open source hardware issues through releasing updates of the of the hardware and if that's going to be a compatible technology for the second version of the sensor so the key takeaway is that do we answer our initial scientific question and fairly yes which was quite good and the minimal produced nice results and we were pretty happy about that and it's definitely not something that can be used only in research you can you can build it on your own and just use it for fun if you'd like on your backyard so in retrospect in terms of the in terms of the result that I had in my personal research is that the electrical properties do not play a significant role on the long-range transport of the particles which was a boom okay because we we didn't know all that work but again we had a milestone and a physical mechanism that was not accounted before in all these research so what's next the next step is that we want to migrate in complete open hardware technologies like a free cut and key card and version the version 2 is being baked at the moment and we would like to do it on our local hacker space with a bunch of people that were interested and yeah that's more or less it from me thank you so much all and that's my email if you want to address some questions apart from the ones that you're going to do now if you have to and my geek handle back there yeah hi so if it doesn't turn out that electrical properties are not responsible for the transportation of desert dust do you have any ideas or suggestions what might be yeah I do thank you for the question that was that was one that I was expecting on my phd defense and never came so that's good yeah um um a factor that I didn't you know quite oh do I need to you yes please repeat the question so the question was that if there are the the electrical properties and not a factor that will be very crucial for the particle dynamics if there's something else that plays a role and if we have an idea about that and yeah yeah we do um it's actually the orientation of the particles which was the the second part of my phd that I was uh researching with an instrument and we've seen that particles are indeed getting oriented in the atmosphere and from you know complex modeling of the particle dynamics we know that they're going to float a bit more from that movement alone so but that movement won't be efficient only by the electrical forces you have winds like strong updrafts that um are gonna you know orient the particles or maybe do a sling sort of effect and make them flow within the atmosphere so there are pretty complex mechanisms but that was a bit a small milestone to that maybe one more yeah yeah really one of the open source hardware uh so in your documentation do you also essentially for example duck on all the steps of the hardware along with for example calibration and maybe maintenance and repair details yes you said you're working on a second version are there any efforts for example to say institute replication efforts around the if there is a story and a replication to essentially as a open source hardware application yeah yeah yeah it's going to be so we have recently uh well oh yeah they repeat the question yeah I have to remember all that you know and my my brain is jammed so um if the the the question was if the current documentation uh has you know all the the tedious processes that we've used from the hardware uh documentation to the um steps step by step process of calibrating the sensor and stuff so yeah that's that's correct we have it and uh the the concept is to have to be able to you know with with some small steps to be able to recreate the sensor yourself um it needs a bit of you know cnc machining and cating it doesn't have 3d printing at the moment but maybe uh the best strategy would be to wait for version two which is going to be a fairly usable and much more usable and then you can reproduce the results yourself so the the basic calibration process was the one that I showed you we did it like in our in our house and it's yeah you you need the voltage supply and the the plates and that's more or less it but yeah for for the purposes of the research you need to do all the other stuff that I numbered in terms of calibration thank you yeah uh is this kind of measurement also done in space yeah that's a nice question um the specific yeah is it nice you know you you want to go with the flow and just answer quickly but that's yeah so are that no no I didn't say it did I say sorry um so does do such kind of measurements get repeated uh in space uh well not with the specific sensor because it's um pretty um pretty bad in in low temperatures pretty pretty bad in low temperatures and it doesn't have space hardened components so the the basic strategy to go to there are like rods that are collecting particles and these are translated to electric field measurements when we talk about let's say rovers for example when you're going to mars which is actually the image that I showed here so that's the planet marsh and it has a fairly large global dust storm um which is pretty pretty nasty and lasted for at least a couple of years so when when you're speaking for that kind of technology that's the go-to sensors but these were sparse I think there was actually one mission that was successful and that landed on on mars with such an instrument that was using electric field measurements and then apart from that nothing yeah around the earth around the earth yeah um not not that I'm aware of I know that there are some uh processes of magnetic field measurements that are done with small uh pocket tube satellites but I'm not aware of electric field measurements yeah because you know if you make the satellite two centimeters bigger you have the nano satellite format and you can it's quite easy to get in yeah yeah it's quite easy yeah and I hope that you stay there all the time yeah all the time I know um we're actually uh I'm pretty close with guys that are uh with us uh in in the room they're called uh Libre Space Foundation and they are using all sorts of open hardware and open software stuff to make small satellites so it would be a nice collaboration let's say where yeah he's waving yeah he's doing all sorts of mechanical engineering stuff yeah so it needs the the objective it needs the scientific objective yeah that's it okay thank you so much thank you
Welcome to the EU Policy Workshop Devroom
So good morning everybody, welcome to the open source in the European Legislative Landscape Devereux. I have a confession to make which is that we applied for this devereux two days before the end of the closing deadline and we have made it up as we went along after unexpectedly being awarded a devereux. So the whole day is very organic but it has a very important purpose. We've discovered that the European Union has noticed that devices contain software and that the software needs regulating. And they have started doing an amazingly effective job at writing software into regulation. So one of the people we have with us today, Benjamin Bergel, wherever Benjamin is, he's presumably, I know he's here but he's hiding. He was involved in writing the NIST 2 directive and then he went on to write the CRA and he is surprisingly expert if you have a low opinion of EU policy officers or unsurprisingly expert if you know that they're all generally brilliant people. However we discovered that the EU's model of what open source is is that it is low quality components full of defects that are created by hobbyists in their basements. And the regulations rather reflected that. And so we found over the last year it was very valuable to engage with the regulators. Today what we want to do is not talk about the technical details of any regulations but rather gather the feedback of the open source community so that we can document the reflections and outlooks of the community for the benefit of the commission as they go forward in regulating within their digital agenda. So we've arranged for there to be four workshops today. The first workshop that is starting in six minutes is a workshop on the consequences of the Cyber Resilience Act and the Product Liability Directive. Then the second workshop which starts at 11.15 is going to look at how we engage with policy makers as a FOS community. The third workshop which is at 1.20 is going to look at how we can assist in getting more free and open source software in use by public administrations. And the fourth workshop is going to look at how the free and open source community can come alongside the task force that is implementing the DMA and the DSA and promote interoperability given that the best path to interoperability is not standards but rather the implementation of standards in shared open source packages. So that's our agenda for today. We have some ground rules that you'll see again during the day. First of all we encourage you if you are like me and you talk a lot to maybe talk less and to encourage and leave space for other people to express their opinions. We encourage you to always be holding the microphone when you speak in a session where notes are being taken and that is all of them because today we have four rapporteurs free to the workshops. The rapporteurs will be listening to what's said, noting down the substance and writing a written report for us to send to the commission after the workshop. When you do start speaking please make sure every time you start speaking you indicate who you are and if you have an affiliation what your affiliation is. Please note that this is a very complex topic and we know that it's a very complex topic so please be open to new ideas. When we run into an intractable problem let's note it and move on to something we can fix rather than obsess about the obstacle. And finally there's two ways of looking at this. Please observe the FOSDEM code of conduct or if you prefer let's have fun and make new friends.
CRA & PLD: [begin workshop] How will the open-source community adapt to the new EU Cyber Resilience Act and Product Liability Directive
I'd like to hand over to the chair of the first panel, which is Martin Erz and Martin Erz from MLNet Labs, and is going to lead what we do next, Martin. Thank you, Simon. So welcome to the first block of the day, which is about CRA and PLD. I just heard from Simon how the structure generally works. I will say a couple of works about how this block will work right now. So an important person during this session will be our rapporteur, who will be writing down all the things that the speakers say, but also perhaps the things that you will bring in, because the idea of these sessions is to actually have some interaction. So for this session, that will be Merco. Merco will be our rapporteur, and at the end of the block, he will summarize what he learned today. So for the agenda of this particular block, we will have two lightning talks. We will have a panel, a workshop bit where you can actually do something yourself, if you haven't already, by asking questions. We will have a third lightning talk, and then we will close with a rapporteur's summary. So that's our agenda until about 11.15.
CRA: 40 new ways the CRA can accidentally harm open source
Hi, so my name is Toby Langelle. I run a small consulting firm based in Geneva, Switzerland. And I have kind of straddled throughout my career open source and standards. So people thought it was a good idea to bring me in to talk about this. So this lightning talk is called 40 New Ways to CRI Can Accidentally Harm Open Source. And that of course references the 40 plus standards that are, the harmonized standards that are going to be written in the next couple of years to essentially make it possible to implement the CRI. So the first thing I want to say is the CRI has landed. It could have been really, really bad. A lot of us were really, really concerned. And it turns out that it isn't. Firstly, first thing is the open source community rose to the occasion. And I think that's really amazing and it was beautiful to see. And like a lot of people put a lot of work in. And I think we should all be very thankful for the work they have put into helping us. And then also policy makers actually paid attention, listened, and considered the input from the community. And also for this, I think we ought to be really thankful. So thank you for both sides for making this happen. In the process, we avoided harming open source pretty seriously. And we also avoided harming that EU's ability to leverage open source, which was another one of the potential risks of the original versions of the CRI. So we do now have a lot more clarity. There's an asterisk there because lots of people still have lots of questions, myself included. My key takeaways from the last version of the CRI is that the responsibility falls in the right place. IE was the people monetizing open source. The company is monetizing open source. So for me, this is really important and it's great that this has spelt out really clearly in the last version. And then the other thing that I thought was really interesting is the open source stewards, this new notion of open source stewards, which really institutionalizes the foundations that have been playing an important role in our space. And it's also, I believe, a really smart instrument for the EU's ambition around Southern Tech. That said, it's going to have industry and ecosystem-wide impact. I think companies will be a lot more cautious. I will certainly advise my clients to be more cautious. And a lot of projects will move to foundations and I think they will do so earlier. And then the conformance requirements, they're going to climb up the dependency tree. And so essentially, I'm suspecting pretty quickly most of the ecosystem will actually be subject to some parts of the CRA, probably the lighter version that is for open source stewards. And I do have a question that was this. This is going to create a lot of financial and work overhead. And I'm still kind of wondering who's going to be paying for this. So I think this is a question that will need to be dug into a little more in the future. So to meet the CRA, there are essentially going to be two options. Either you demonstrate conformity by yourself, so the burden of proof is on you, or you will essentially follow a set of standards, the harmonized standards, and this is going to provide presumption of conformity. So the fact, though, the standards are going to be how the CRA impacts open source because that's what everyone's going to do. Essentially follow the standards so that they can be presumed to be conformant. And so 40 plus standards, that's 40 plus way things can go wrong. If you believe that the standardization process is less opaque, easier, more open source community friendly than policymaking, I have bad news for you. And so essentially the same kind of misunderstanding, the same kind of risk that CASC carry through the CRA is probably going to carry through 40 different standards. Actually sitting in 40 different rooms to make sure that 40 different standards don't harm open source in a weird and unexpected way is a lot of work. So I mentioned the opaque standardization processes. Also open source has special requirements. Things have to happen in the open. They cannot be patterns around the standards. And not every organization functions in an open source friendly way to put it mightily when it comes to how they deliver the standards and how non-uncumbered by patterns these standards are. So that's also something that will be incredibly important to make sure that the open source community can actually have access to those standards and be able to implement them. The two last points is there's a huge diversity of open source stakeholders, a lot of which were very poorly represented in the CRA even though the open source community was there. So there were these two words obviously and they were very much involved. Hobbyists, it's very hard to actually represent hobbyists, right? Small commercial open source startups that are going to be incredibly impacted, including in the EU because they will be considered manufacturers, rightfully so. Probably don't have the resources or the know-how to be involved in the process. And the last point is interop was other jurisdictions. One of the huge strengths of open source is the fact that licensing is essentially standardized worldwide and like the MIT means the same thing here and there roughly sufficiently that it's like okay. And if we start having security standards that are different across different jurisdictions it's going to be a huge burden on open source maintainers and open source developers and we want to make sure that if you comply to whatever the EU comes up with in terms of standards it's fairly similar to what NIST is coming up in the US, etc. etc. And that's it. Thank you very much.
PLD: When software causes harm – who pays and why?
Okay, so I'll advance the next lightning talk, which is about the second big legislative effort that went on during the past, well, couple of years, really, smiling at one of the people that worked on it in the European Commission, which is the Product Liability Directive. And with us today is Rob Carolina, who is the General Counsel for ISC, Makers of Bind, who's going to give you an introduction into product liability in five minutes, which is... So, take it away for Rob. Martin's original idea was do the product liability thing in three minutes, and then you can do some other stuff for two. So what I'm doing here is giving you a reading test, and I'm trying to condense down to two and a half minutes a topic that we spend about 40 to 60 hours on in law school. So the reason that I'm giving you this reading test is because I want you to be familiar with this fact pattern. I'm going to tell the story in reverse from how I usually do it. This is a story about an automated car that hits a pedestrian in Ireland, Pat Victim. That car has on board a piece of software called Bravo Drive, which has included within it a piece of software called Open Sesame. The car was imported by Exotic Imports. The car was manufactured by Einstein Motors in California. Einstein Motors got Bravo Drive software from Bravo Bits BV in the Netherlands, and Bravo Bits VB got Open Source, Open Sesame from Firefly APS in Denmark. Terry Dastardly hacked into the automobile because of a weakness in the authentication package, provided a few inputs. And the next thing you have is a car that runs over Pat Victim in Ireland. Don't worry about Terry Dastardly. He dies or she dies in a horrible paragliding accident or without money or is run over by a bus. Just take them out of the equation. The question that product liability seeks to answer is, in a situation like this when we have an injured victim like Pat Victim, who pays for their injuries. Two slides that look like this. This slide is designed to teach you the difference between two different legal theories on how you sue people who manufacture things. The left-hand side is the law of negligence, at least as it's practiced in common law countries. I would not come to a civil law country and teach people about the Napoleonic Code. However, I will talk to you a little bit about common law and suggest that the two are not worlds apart. As you can see from the chart, when our victim tries to sue all these various peoples, Johnson, exotic imports, Einstein, Bravabits or whatever, Victim is in a little bit of difficulty because the people who manufactured and imported the car did everything reasonably. They selected good components. They selected trustworthy producers of things. They did not act rashly. Whereas the error in the situation came from a software vendor called Firefly and maybe, just maybe we could establish that they owed what's called a duty of care to the victim. If someone like Pat Victim was a foreseeable victim when someone wrote this authentication package in Denmark, but as you can see, it's going to be difficult to establish that. Now, in the reading test that I gave you one slide ago, I did put in there that the folks at Firefly, they had a bad week. The problem with their package was because someone made a coding error and the QA people were kind of asleep that week because we're going to get that in a forensics report from an expert who's going to come to trial. The right hand side of this slide is designed to teach you a different area of law that was adopted in the U.S. in the 1960s and in Europe in 1985, which says what do we do in situations like this where everybody acts reasonably but Pat Victim still has injuries? And the answer is we don't look for people who did things unreasonably. We don't care how careful they were, how cautious they were. We look for people who manufactured and put into circulation a dangerous product. We tried really hard to make it safe. It doesn't matter. If it's dangerous, it's called no fault liability for this reason. And as you can see, because the automobile manufacturer and the importer, and this is the law as it exists today in Europe under the 1985 directive, because they were dealing with a product that is dangerous, they will be strictly liable, but the software vendors will not because software has not been deemed to be a product. Enter the PLD, which changes things on the right hand side of this chart. And as you can see, what happens here, one of the design characteristics of the PLD, and the origin of these slides, by the way, was I did a talk at Etsy five years ago, which said this is coming. So I keep using the same slides for five years, and they're still accurate, is that we recharacterize software as a product, and now we can attribute liability to Firefly because they distributed a dangerous product, a piece of authentication software that had been, that didn't work properly. We'll just leave it at that for right now. And since we're running a few minutes ahead, I have one last slide that I'll show you, and I'm just going to hold on this for 60 seconds while you read it. If you're looking for a copy of this, I just posted it half an hour ago on X and on LinkedIn. So whatever the answer is, depends on what questions we're asking. I know a question I'm asking, I'm the guy on the left. It appears the questions on the right were the questions asked by the European Commission. And that's how we have the answers that we're talking about today. Thank you. Thank you, Rob.
CRA & PLD: panel
Okay, so welcome back to this session on the CRA and PLD block. We are having a panel with some of the people that directly wrote the pieces of legislation we're discussing in this block. To my left we have Benjamin Bergle, who is working for the European Commission as Head of Sector for Standardization and Product Security. I almost did it right. Next to him is Chuck Dinghou, who is Director for the Python Software Foundation and we really wanted to get a community perspective on this panel, which is what Chuck will provide and also what Chuck will challenge you to help us provide, because that's kind of what we are trying to do here. And finally we have Omar Enaji, who is Policy Officer for DGGROW, who has worked on the Product Liability Directive for multiple years now. My name is Martin and I will try to ask some questions. You will be asking the really clever ones. I will be asking the other ones. So let's get started. I would like to ask our panelists to do a real quick introduction specifically to answer the question, what does implementation of these laws mean to you? Because we've been over the proposals, we've had the negotiations, they're about to be confirmed by Parliament. So what this panel is about really is about looking forward. We're not doing the negotiations over. We're now looking at when will these actually hit Europe and what's needed to get there. Thanks a lot, Martin. So for the Cyber Resilience Act, I mean the text isn't final yet, right? So we don't know exactly when it will enter into force. As I said yesterday, sometime around the middle of 2024, maybe a little bit later. And then we have a three years transition period. So manufacturers, hardware and software manufacturers, they will have to start applying the rules roughly around June 2027. So that gives us three years during which we can prepare for the implementation. We just had this fascinating presentation on the 40 standards, right? So that's going to be a huge part of our work, helping the European standardization organizations with the standards. We will also have to produce guidance, of course. And thank you actually very much for inviting us here because I think these are the venues where you get all the tricky questions that need to be answered in the guidance, right? Because of course the CRA is a high level piece of legislation. It will not provide an immediate answer to every edge case that you may have. So I think this is where the guidance really comes in. And we want to be inclusive in this process. We want the community there, open source, single vendors, everyone. And we're really looking forward actually to this process. Thank you Martin as well. So for the PLD, a bit different from the CRA because it's a directive and not a regulation. So it requires transposition at a new level in each member states. Actually the law will be applicable in each member state. So this year will be 2026 in theory around June, July. It will depend exactly when the parliament will give the vote. And by then the liability rules will be kicking in. So yeah, that's roughly it. Would you mind spending a couple of cents more on the difference between a regulation and a directive because we appreciate that a lot of you may know a lot about software and we also think some of you may not know a lot about you lawmaking. So can you? Yeah, so I mean just a quick legalistic view says at you level you have three types of acts. You have three types of acts. A regulation, a directive and a decision. A regulation and a decision basically only requires to be directly applicable at national level. But the law remains the same. For a directive it requires transposition and the transposition is basically incorporation into the national law. You will have 27 different laws that would say the same basically. But because of the particularity of the directive it would also require changes in some other parts of the national legislation. A directive also requires implementation along with the incorporation. The regulation only requires the implementation of it. And it's directly applicable as a regulation into the national laws while the directive needs to be incorporated to be applicable. So you will have the central piece of legislation but for the rest you will have national laws that will tell you or give you the answer. And the role of the commission during the transposition, the two years transposition, that's why there is a deadline for that. It's to check each legislation to ensure that there is no mistakes that doesn't go against what the main legislation says. So that's the big picture. So my next question is about your personal experience trying to express false into law or maybe to interact in the EU policy space whereas you may have previously focused on the developer space. So a different question for each of you. So for you, what was it like to work on a policy topic as someone who is very knowledgeable about software development? For each of you, what was it like to work on a topic with the nuances that open source has in your policy? So first of all, again, my background is very similar to a lot of developers. I'm closer to a software developer than a policymaker. So for us, I think we have a lot of concern about whether I will be reliable. I mean, maybe I've created some fun stuff. I publish it as open source because I want to share it but then you have no control of who is taking it and doing what about it. For example, the car example maybe at the beginning when I created this project, I'm not expecting someone to use it in a car and then the car will hit someone. So that is something that I think a lot of developers have that in their mind. There's a bit worry that now if this happens, will we be not publishing anything anymore so that will affect the open source ecosystem a little bit more? And also, for example, if you're working for companies or maybe then your company would tell you to not do it because the company don't want to get involved in your hobby project that may get into trouble. So there is a lot of concern, I think, as someone who, you know, and also, because software is very different from hardware, right? You can't make something at your backyard and then come and in fact you can take it in production. But software, you know, the power of software is like, you know, some individual developers, they can still develop a piece of software that is, you know, very applicable in a lot of application but is maintained with very limited resources. I think that that make hardware and software a huge difference in terms of scales. You know, you don't have enough, you know, resources, you can't massively produce something in hardware but if you have limited resources, you can still massively produce some things in software that like a lot of people use, right? So that's the concern from a developer perspective. Yeah. So. Thanks. Yeah, I mean, for us, I think it was a huge challenge to adopt the existing European framework for product legislation, the new legislative framework, as you call it, or the CE marking that you're familiar with, to software and to cybersecurity, right? Because, I mean, software is not a tangible good. It's different and cybersecurity is also very special. It's not the same as safety. Usually we've always regulated safety. Now, for the first time, we are regulating security and I found that to be a huge challenge. I think we managed to get it right but it was a challenge. What I really liked about engaging with the open source community is that you meet a lot of passionate people who really care, right? So when we regulate other areas, you get to meet lobbyists who are simply paid to defend interests. Of course, you're also defending your interests. But on top of that, I mean, you meet people that actually really care about the things that they work on and you see it's more than just a job, it's a mission for them. And I really appreciate that. Well, I mean, for me, it's a bit different because let's say that the product I really like to direct is about any type of product. So what I had, it's basic. I think it's the CO2. Do you see maybe it's a defective product. But the idea is basically how to deal with perfume industry, with car industry, with tables, with chairs, with vaccines, with pacemakers, with hats, with whatever you want, all of those industries. With the PRD, we didn't have a specific sector. We had all of them at the same time. And what we actually needed is basically to have people that could represent each of those sector to hear the concerns and what could work and what could not work. And I have to say that with the open source software community, it was maybe a bit harder to achieve that because of the fact that you are all individuals, there is not really someone that represents you. You need to speak a little bit harder because this is not a mic. It's only for the recording. So what was really complicated for the open source software community for me is basically I could not have a single voice that could tell me what were the full concerns, had different voices. But to be totally honest, the one that we're more talking about, your issues, let's say the bigger one, which I pretty sure do not represent you. And so that was the main difficulties for us from the PRD perspective, was to get what are the real concerns and how do we reply to them. But at the same time, we also had to be totally honest. The PRD is a piece of legislation made for victims, which is basically all of you, all of us. So we needed to find the right balance, not to put too much pressure on the one that creates the product, but also not too much pressure on the person that actually suffered the damage. And that was what we needed to achieve. And where we need to find the good balances when we have your inputs, this is where we can actually find the perfect balance in a way. So I will be asking, giving the crowd a opportunity to ask questions. So if you have one, raise it before I'll get to you. I'll ask two questions to Benjamin so I can have a look around. So my first question, Benjamin, and it's about stewards, is how can a steward know they're a steward? And my second question is, suppose they find out they're a steward, but they're not in the EU, who is the supervisory authority they are supposed to be talking to? Okay, so I mean, you find out if you're a steward by looking into the law, the law defines the concept of steward, right? It says you have your, if you're someone that's, I mean, if you're a legal person that supports a project on a sustained basis, and this project is ultimately intended for commercial purposes, you are a steward. The regulation also gives a few examples, such as foundations, I mean, not every foundation will be a steward, but if it meets those criteria, it's a steward. And so you can look it up in the law. As I said before, there will be cases where it's maybe not as clear cut, right? We hope that with the guidance, we can also address those cases. So I'm quite confident that the end of the implementation process, people will usually know if they're stewards or not. Now, if you're outside the EU, so the CRA is indeed a regulation, yeah? It means it applies across the entire single market in a uniform manner, and all the market surveillance authorities are responsible for you, essentially, yeah? If your product is, or if your software is published and accessible across the entire internal market, then all the market surveillance authorities will also be responsible for supervising you. So I will be walking into the crowd to get a question. I will be off camera, which is fine. So please state your name and affiliation and a question if you have it. I'll hold the mic. Okay. My question is about a Debian, which there is a Debian foundation in France, and there is software in the public interest, but these foundations only handle financial issues. They have nothing to do with code in any way or form. Are they going to be considered stewards? Yeah. So unfortunately, I cannot give legal advice on individual projects, right? Because if I get it wrong now, then it's a huge problem. So you will have to check for yourselves. I mean, what I can tell you is, I mean, we put some indications into the law when you could be considered a steward. So for instance, when you are hosting the collaboration platform, if you are to some extent governing the project, if you take decisions on the project, or if you do steer the development process, then you would be considered a steward. Taking another audience question. So please state your name and affiliation and then the question. I'll hold the mic. Thierry Carreze, Open Infra Foundation and the open source initiative. You mentioned the chilling effects on development and engagement from the open source community. And I think it's the main fear we have is that whatever legislation is created, it would prevent or discourage people from participating in the open source commons. And I think it's linked to any uncertainty will be interpreted in a worse way. So how are we going to, with 40 standards on the CRS side and transposition in every country, 27 countries on the PLD side, how are we going to have enough certainty for those people to, for them not to have this chilling effect on their participation? Thank you. I'm going to Omar first. Well, I think you can send an email to one of us. That's basically the first. I mean, we are open to have any discussion with anyone that has an issue on the ground because we are not on the ground. So this is how it works for every unit in the commission is basically everyone has legislation or has a policy and we receive feedbacks from people. Someone, for example, for transposition would say, well, I'm in Spain and this is how the law applies in Spain. And I'm pretty sure that that was not the main idea because when I looked into the main piece of legislation, it says something opposite. Well then it's the work of the commission to realize, well, that something goes wrong there and then we enter into contact with the national authorities. That's for the transposition part. But if there is like, there are issues during the years of applications of the directive, then we have what we call a review clause in each piece of legislation. Every three years or five years, you will have someone from the commission, usually one of us that will do the review with the study, having interviews, taking all the evidences and proof and you will collect all of them and then realize, OK, there is an issue that was not foreseen at the beginning. How do we solve it? There was a gap. How do we, how do we feel it? That was actually the main, the same thing that happened with the PRD, the PRD that dates back from 85. It took 40 years to review it. Before that, we started the collection of the reviews of the proof and we collected all the opinions and this is where I say that maybe your community was the one that was not really involved into that because of how the process is, but everyone has a voice in the seat there. Sorry, I want to ask a follow-up question. So I know that like sometimes the White House will have some open call for like suggestions and comments. Will you plan to do something like that? Well, first of all, we need to apply it, but that is for sure that for the next review, which will happen. So it's two years, four years, it will be in six years. In six years, we will take us, do a state of play of how it's applied and then obviously we'll have to collect a bit of information and we will have to check with people of the industry, the communities to see what is their experience and if there are things that work or don't work. So that's how we will have to do it. But I cannot tell you right now, but it will be one, but I'm sure that it will be one because it's how it works for these kind of things. Yes, so I would like to fork Omar's answer. I would like to add that, I mean, I don't think there will be a chilling effect on open source coming from the CRA, to be honest. I mean, let's be frank, open source is essentially outside the scope. I mean, of course, there will be cases where manufacturers will, of course, try to place requirements upstream, right, and talk to upstream developers, but I mean, you are for the most part not covered by the Sub-Brazilian Act. If you want to make sure that the transition goes smoothly indeed, I mean, please do reach out to us. I think we've proven over the last year that we are a very approachable bunch here. We are taking your concern seriously. We are going to do our utmost to find solutions, but we are even legally obliged in the CRA. There is a specific provision that requires the commission to consult the community. I mean, we would do so anyway, but you even have that reassurance that we have to do it. And yes, I mean, just please do reach out. Just one thing for the chilling effect, because for the PRD we have experience with that. Forty years ago, I could show you the newspapers that was going all around Europe from manufacturers saying that if this piece of legislation would enter, there would be no products anymore in Europe. I'm pretty sure that this is not the case anymore. What the PRD did is basically give trust into people. When they buy something, they know that if something goes wrong, at least they will have the back cover. That is the idea of the entire piece of legislation. One practical comment I would like to make with respect to the question that was just asked is that after the panel, we'll have a workshop. And one of the mechanics is that we want to ask you about your fears, but also hopes and perhaps your solutions. So if you're listening to this and think, hey, but I have these corner cases that I'm really worried about, make sure to remember for like 20 minutes more and then put them to paper because we're actually trying to collect these. I saw multiple hands. I'm first going to ask a question myself and then I'll return to the audience questions. So it's related to the PLD. In December, a political deal was reached on the PLD and one of the things that was publicized by the MEPs looking out for open source specifically was that open source would not be in scope if it was not a commercial activity. And it was a delegation to the technical level to implement this idea along the principles of the CRA. Now when the text of the PLD became public in end of January, what we saw was that there was a single or maybe one and a half PLD recital and the CRA has seven, eight maybe. So I'm asking is the PLD team that much better at writing recitals? Can we somehow use the nuance that was expressed in CRA in the PLD or are you going to offer guidance? Because I was a bit surprised. I was expecting more nuance but maybe I'm wrong and you're the expert. So maybe a bit of a tough question for you Omar and I'd like to hear from you. So I mean I will ask you a very short question. How many products do exist in the world? Because what you as a community got is basically one full recital over 47 while the PLD applied to millions of products. So I think in proportions you got quite a lot actually. The difficulties for the PLD is basically to say that yes there is a CRA that gives an explanation about open source software but you will also have the AI act for that. You will also have all the type of legislation that will get the open source point and we have to cover all of them at the same time. We cannot copy paste from one single legislation because we apply to all of them at the same time. So the difficulties was really to find the right wording. I think as you said the MEP that quoted said that the main idea is that is the commercial activity. While this is applicable for any product, any product that's actually been developed or supplied outside, mostly supplied outside of a commercial activity, it's out of the liability regime. And that's what we written in the recital. It's basically restating the fact that if it's outside of a commercial activity then you're out. But if you're in, that's where the PLD applies. We cannot create a specific regime of open source in the PLD itself also because of the nature of the legislation that has to be neutral and you cannot have just very specific provisions about one single product because each provision has to apply in the same way to any other type of product. That's a bit of the, and the CRI would apply for cyber vulnerabilities but then you will have the AI act that would apply also for open source. And for us we need to cover all of them so that's why it's in this way. So I have a question that relates to work that will be a little bit out of your hands. So for you Omar, it's about the 27 member states that somehow need to get the work you did and then make their own and somehow understand the nuance of what open source is about. For you it will be about Toby Stark on 40 standards. What will you be doing for Chuck and me and all the other people writing software to apply what you learned in the past 12 months or maybe you already were to help the people doing that work not make, like understand the nuance of what is essentially a niche of a niche but also rules the world of products with digital elements. So what will the commission do to help the community in these stages of the process? Okay. Yeah, I mean, so the commission is not writing the standards, right? I mean, that is how it works. I think you also would not want us to write the standards. So that's probably a good thing that we are not writing them. It's the European standardization organizations. They are made up of national delegations from the national standards bodies. These standards bodies, they often send representatives from manufacturers and from others. We have like, the commission has basically three ways of being involved in that process. First, we are the ones drafting the standardization request, which is the basis on which the ESOs, the European standardization organizations are going to work on those standards. So in the standardization request, we can already express our expectations, what the standards should look like. Then, although while we are not going to be writing the standards, we are going to be there all the way, right? So we will be in all the meetings. We will listen to the conversations. We will give our views. We will answer questions on how things are to be interpreted in the CRA and so forth. And in the end of the day, we also have to rubber stand the standards. They have to be cited in the official journal of the European Union that gives them this power to give presumption of conformity. So what I can reassure you is that we are going to be there all the way. We are going to look at the process very closely. We are also more than happy to engage with those parts of the open source community that do have expertise in standards, right? To find solutions to the issues that you may have. So again, I mean, I already said it a couple of times, please do reach out and let's discuss that in more detail. Thank you. Well, I mean, my work is not done yet. As I said, the transposition will kick in as soon as the co-legislator have officially voted, which should happen either in June, July or September in any event. And then after that, we launch the transposition period, which means basically that we will be receiving the 27 piece of legislation piece by piece, or sometimes just the entirety of it. And we will have to work closely with each single member state to ensure that the legislation reflects exactly the directive. What we have as a tool in the commission is what we call the infringement procedure. So when the commission realizes that a member state does not conform itself with a new legislation, we can bring the case to the court to ensure that the member state applies it or does it properly. I'm not, I mean, as a small background, the first PRD took for some member state more than 20 years to properly transpose the directive. So I hope we're not going to be there, but this is how it works from our side. And then once it's transposed in any event, we will have to check constantly if there is a good application, because one of the things is it's not only the transposition by the member state, but also how the jurisdictions will be applying the law. A national court is also a representation of the member state at your level. So if there is a misapplication at that level, we would also have to intervene to ensure that it is done in conformity. Thank you very much. I will take two audience questions, one here and one there. And then we will continue with the panel if there's time. I will be holding the mic. Please state your name and affiliation. Alistair Woodman representing 2501C3's outdoing open source projects. As far as the PLD is concerned, do you anticipate that the market will support insurance policies for this to deal with the sort of quenching thing? Or is it a non-goal or a goal to encourage insurance in this particular regard for non-mulse-feasant behavior? I think that was for you Omar. So the PLD does not have any requirement about insurances. So everyone is free to do whatever they want. Basically, it's just you need to calculate your own risk. And once you know your risk exposure, you will know whether you need one or not. But it's not from outside that we do it. And to be also totally frank with you, as I said also yesterday, most of you here will never have a claim on PLD. I mean, this does not happen every single day for each type of product. We have a few cases that can happen. You can have access to all of them. It's true that for software it's a bit more rare that this happens because you have something that the traditional products don't. You can correct the piece of software before something wrong would happen. You know that there is a vulnerability. You know that there is maybe something defective inside. And then you will correct it with an update and then you avoid having any issue. That's a bit more of a facility for you. And we will not impose from our end-in-end insurance for that. That's a bit of the approach. Audience question. Please state your name and affiliation. Hi. Olli Johansson. I'm an open source developer also active in OpenSSF and OVASP. The problem with those 40 organizations that create standards for us open source developers is that ECMA, Sennilec, all of them require quite huge fees. Who will pay them so we can take part in the standardization effort? I think this will be for Benjamin. Yes, I don't think I have necessarily a satisfactory answer for you, right? Yeah. So I will take note of your financial needs. But indeed, I mean the CRA is just one of many pieces of legislation. So we do not shape the standardization policy. We just use the standardization process for the CRA. But indeed, I mean this is an important question and we are more than happy to look into that. Thank you. So we're slowly nearing the end of this panel. I'm going to ask a number of questions in succession and then we'll see if there's more time for audience questions. So to Omar, I'll ask, do you know about any other related legislation that is coming for this community that we should be waking up about? So take a moment to think about it. I'll get to you for the answer. So for Benjamin, I would like to talk to you about the guidelines. Can you be very specific about how people can contribute to the process of writing them? And there is this delegated act possibility about voluntary security at the station programs. Can you talk about what your intentions are, maybe how people can help? So my goal is with these three questions to the two of you, is to give people in the room a clear view of what they can do. Should they have the time, the money, etc. to be involved in EU policymaking? So now I'll hand over to Omar. Well, I have no idea. I have to be totally honest with you. We are many, many directorates. But it's as simple and it doesn't really require any money. It's just two times time to check what the commission is doing. The various directorate channels, I mean, mostly did connect, but it can also be did you grow? Could be just wherever the directorate is. And then if you have a question, you are wondering something, you are... I mean, don't say that to my other colleagues, but you can send an email to the units. And this doesn't cost any money. They will happily reply to you and give you any answer that you're seeking. There are stuff that you don't understand from legislation. There are information that you would want to bring to the attention to the commission itself. Our emails are open for that. And this is also our role to have a look into what happens on the ground. I mean, as I already said, we are legally obliged to consult. But we would do it anyway, of course. As the commission, I mean, we are very likely going to organize conferences where people can attend and bring their ideas to the table. We are likely going to have some form of expert group or a similar body where people that want to, like, be more involved than just ad hoc, but in a more structured manner where they can engage with us. And, of course, you will be seeing us at conferences. You can invite us to your events. We're happy to attend, maybe not always physically, but online. So there will be plenty of opportunities to engage. As regards the voluntary security attestation programs, so, yeah, I mean, the idea is basically to give those projects that are not directly in the scope of the CRA a chance to provide some form of assurance that the projects are secure, right? We know that many of these projects, they don't have financial resources. So the provision is quite open in that regard. It does not require the ones that develop a project to also pay for that program, but other people can step in. So, for instance, integrators that take an active interest in a certain component because they need it for their own due diligence, they need the assurance that it's secure. They could also team up and pay for that assurance. Now, these attestation programs, there is only a so-called empowerment in the law. That means that the commission is empowered. We are allowed to flesh them out. So they are not there yet. We don't have these assurance programs at this point in time. But the commission will be able to work on this. And for this, we will also need your input so that we can shape these programs in a way that they are useful for the integrators or users to have the assurance that they need. But they also take into account all these specificities of open source projects because they are often so different, right? The way they are structured and the promises or commitments that they can make compared to more traditional, manufacturer-based projects. So I think we'll take one more audience question and then I'll ask Chuck if she wants to do any reflection on what this means for Python maybe. Let's see. I think there was a hand pretty early on. So name and affiliation, please. Hi, Vittorio Bertola from Open Exchange, which is a 300-people German open source software maker. So the question is, well, first, this is creating cost, of course, not for security because we have a flow of security record. We already spend all the money that's necessary on security. But for the bureaucracy that now you are introducing for compliance. So this is making us less competitive and all our competitors are from outside Europe, including Google. So how is this going to be compensated? And maybe ours, we are a pretty big company. We can cope for it. But the French Foundation for Debian that has to hire a lawyer, there are going to be costs. Are you going to put some money onto this, maybe to fund developers to cope with security issues or to fund the bureaucracy? And also, how are you going to avoid the international players from gaming the system? I mean, it's way too easy. I see this happen for like the Googles and Apple. They create some initiative, which is a non-profit. They put the code into that. It gets outside of the CRA scope or maybe gets the like system, whatever. And then they don't have to support the cost of compliance. Well, we still compete with the same piece of software and we have to pay the full cost of compliance. So do you have any thoughts on this? Do you want to check to go first or answer the question first? Yeah. Yeah, I mean, it's true, of course, that there will be some bureaucracy. I mean, no law has ever been created that doesn't create some bureaucracy. Okay, maybe the PLD doesn't because it only hits you once something happens and not before. But usually, of course, there is a certain compliance cost that's quite unavoidable. I think the competition concerns there may be a bit overstated because the CRA does not only apply to European companies or manufacturers or open source projects, but it applies to anyone who is bringing, publishing or putting on the market those products in Europe, right? And we all know that Europe is a big continent. It's quite relevant. There are probably very few manufacturers in the world that do not place products on the European market. So they will all be subject to the rules. We do have some field facilitations actually for small manufacturers when we talk about actual manufacturers. So there is a provision that again, it's an empowerment for the commission. It allows us to create a form for a simplified technical documentation for small companies. So that means that small companies, they will only have to fill out one form essentially and the length of the form is somehow going to inform the expectations towards how much information you're going to provide. So I think that can help a lot like one single form makes your life much easier. And then we also have some funding calls. Actually there are funding calls ongoing right now until the end of March that also aim at helping small companies deal with the implementation of the CRA. Thank you Benjamin. So I think we are at time. I would like to thank our panelists for the courage to come here to talk to us, to have this conversation. They're not leaving yet, but I will ask you for a round of applause before we continue.
CRA & PLD: workshop
So a quick apologies to the people on the live stream. What we just did was workshopping. It was, in my idea, real fun and also real messy. You couldn't see it on camera, but now the program will resume on camera. And I'll hand back over to Toby, who will take you through it. Wonderful. Thank you. Well, I thought that was fun. I hope you enjoyed it too. So now we're going to, each station, each group will have about five minutes to talk on some of the key findings that they found and share them also so that the live audience actually knows what people are concerned about, what their hopes are, and what kind of solutions we want to. So I'm going to invite, why don't we start with the, well, the PLD, is the PLD fine? Yeah, I think that's fine. Good luck. Thank you. Do you want me to go through quickly? Yeah. Okay, I think the good word is trust. That's the main one. The idea is really to bring the trust in new type of products, at least. The alignment before the piece of legislation is a good wording. I mean, as I explained from the beginning, the PLD applies to any type of product, so it needs to be aligned in that sense. And I think some people got the idea with the DMA because of the interoperability that the DMA is introduced now. And it's a bit of the idea also of the PLD that when you have hardware and you can install different type of software, you also need to pinpoint the reliability of the person. That is a bit of the opening as well to these new possibilities. I think, so for the fears, we have three types of, across at least two main ones, the scope, I guess it's the, still the clarity of the scope for the commercial activity, and the cheating effects that it could have on the community. I mean, I can talk for the cheating effects, but that, again, we will have to see exactly how it will work and how it will affect directly the community. I cannot tell you in advance, I know that there are fears about that, but that's also the reality. And as I said, what we only did is to clarify the situation. We did not change drastically the situation. I know this is not what you will hear all the time, but the scope was like this already before, but now has been clarified to his previous way. So we will have to check for that. For the scope itself and the commercial activity, again, I cannot tell you on each case how the commercial activity applies. What I can tell you is that the commercial activity is not always the same, or at least the scope is not the same of the piece of legislation. The PRD will not be applicable to certain products that the CRA does not apply to, to make it another wording. It's not because you're not covered by the CRA that you're outside of the scope, that you will be outside of the scope of the PRD. You might actually be. So just to make it as clear as possible. There was a good question on the open source silicon chips. If I just remember which one put that question, and I will then get back to it just to understand a bit more. But I would just go to the solutions. There is a good call for guidance. There is something that we call the blue guide that applies for every product in the union. The blue guide, it's just a guidance, not a piece of legislation that can help market surveillance authorities to apply a legislation, definition, etc. Could be a good way also to update it for software and specifically to open source software to make it more flexible and more adapted to that. I think that's a very good point. I would just use this time to give a short on the communication because there were some points about contractual liability or limiting the liability. That's not possible. There is no way that you can escape the product liability directive even by putting any type of clothes. I'm sorry for that. The piece of legislation is made for protecting the most vulnerable person. So you cannot sell a product or you cannot provide a product by limiting it. But what you can do is if you are a micro enterprise and someone wants to take your open source software, you can decide with him that you will not take over the liability in case something wrong happens. And so basically if a victim goes to court against you, you will pay. When I say you, let's not imagine yourself, but you as I say, you can just basically compensate and then you have your own agreement where you get back the compensation from the integrator. And in the other situation around, if the victim goes against the integrator, the integrator then will not be able to go against you to claim part of the compensation that he gave. So that's something that exists in possibility in contract. Thanks. I think we're good. Yeah. Wonderful. Thank you. Should we go next? Do I talk or do you talk? Do we do both? Half-half? That was a bad answer. Hi everyone. So on the workshop on the CRA standards, we have, I think, a lot of participation, a lot of thinking heads and a lot of fears in the beginning for sure. Some hopes, which was nice to see. And there was also nice to see the connection between the hopes and the solutions. So I think with time we got to some solutions and I'll let the other moderator explain the solutions. But I think in terms of how we organize the hopes and fears, there's things that the open source community should probably do and or that they should address. There's things that the standardization development organizations need to address. And there's things that the regulator needs to address. And I think trying to figure out how we can collaborate and cooperate between this triangulation is going to be key moving forward. So thanks very much for being here and I think we hope we can continue the conversation. Thank you so much. And sorry to put you on the spot. Don't go away right now because you have not been introduced. And so Felipe Jones-Mauw. He is the standards person for the CRA at the comm, the EU commission. And so you will, this is the person that you will be able to bug about these issues the most, right? So thank you very much for being here and thank you for joining me and organizing this session. This was great. So in terms of solutions, we grouped them in a few sort of like topical clusters. I think the ones that really stood out are open standards. Like everyone's really concerned about like the process but also access to the actual output of those standards. So I think this is going to be really critical. Community organization is one that comes up fairly regularly too. So how do we do this? How do we structure ourselves as all of these different stakeholders in the open source community to participate in this is something that we need to work on. Then there are requests about good EU guidelines. I think like this is great. So also requests for EU funding to help was, well probably organization and sending people to actually participate in these efforts would be for example great. If I can shamelessly plug into that. There's actually a call recently being accepted and a project starting called Cyberstand. And their aim is to support the participation of experts in standardization efforts. Also some other sort of auxiliary tasks around the standardization, the development of the standards for the CRA. So I really encourage you to take a look at that and if you're interested in participating then that's probably an avenue for you to do that with some support. Thanks. So where can we actually reach out to for this? The name of the project is Cyberstand. So from there I hope you can. So the answer is Google for the Cyberstand. Search for Cyberstand sorry. Or get in touch. And lastly, yeah, so one of the things, well, this is great. One of the points that came up is better access to policy makers while we're doing this right now. And you know, it's a very accessible person. So thank you very much. And then some, well, EU-US mutual recognition of standard compliance. So I care about this. I think that's great. And then a focus that we're more on actually doing the work of actually making the security better of software. And also being able to do self attestation and have the integrators or the manufacturers fund some of the compliance and effort, which of course I believe that open source sustainability is very important. So I'm all for this. Wonderful. Thank you very much. Last stand. Do you all want to come here? Can we do this? Oh, beautiful. So in those two corners, it was more close to the text and close to close to specific aspects of the legislation. Over in this corner, we were going from coming from the other end of the chain. We're talking to people about where they are at the moment and their perception of what is happening, what could be better. And so we looked at all the hopes and fears and hopes and fears are kind of the same thing. You can hope that something will go wrong or you can fear that it will go wrong. And so we bunched them together and we made a few different categories. In general, there was a lot of discussion of funding, but we found this is a complex topic. And so in funding, we have to think of specific suggestions rather than just increasing funding, make funding available to more smaller entities, remove the costs for certain things. For example, having to buy a copy of standards. And then we noted that participation in projects is also as good as funding projects. So there are multiple ways to think of funding. A second thing is procurement and procurement by the EU institutions can be good for supporting projects. It can also be good for increasing the awareness between the EU institutions and our ecosystem. The third thing is that there is a lot of funding that is currently available and there is little awareness of it. There was the open science cloud, which has one billion of EU funding and only one person in our group was aware of it. So there is also an information gap there that we as an ecosystem and the EU institutions can work on. There is the issue of being first mover. So in some of these pieces of legislation, the EU is the first to regulate in a big way the way software is distributed. And so we have to keep in mind that when we do this, it may be copied by other parts of the world. And so on one hand, it's useful to do it well for our own people. It's useful to provide a good example. And we should probably also try and work with other parts of the world to ensure that other parts of the world don't bring in similar topics into legislation and completely different requirements. And then we have a different set of requirements in every different region of the world and developers will have a big headache. FOSS awareness within the EU institutions is a big topic. Basically the more people are aware of what FOSS is and how it is funded and how it's developed, the better we would be treated as FOSS. So we need to increase awareness of who develops FOSS and increase involvement. And the last thing is international coordination, working more with the UN, WTO and ITU. I've got two minutes left and I said to Benjamin that I'd give him the last two minutes to give his impressions of what he heard in general. So if you would like to. Thanks a lot. Yeah, so first impression, everyone wants money. I'm not sure we will have all the money that you want. We'll make an effort to support the community through calls, of course. Yes, I think I've noticed that a lot of the hopes that you have are actually the same hopes that we have, right? So you are hoping that the manufacturers will contribute more back. That's also our hope. I think the CRA is doing a lot in that direction and I think it can be a game changer. And yeah, I mean, you are also hoping that the community will step up and also provide solutions such as templates and so forth. And I mean, that's of course also our hope. I mean, the CRA cannot succeed without you. I mean, it cannot be just the commission doing all the work. We also need you. So we are really hoping that especially the more those players that are a bit better equipped in the community that they will help draft. I don't know, cybersecurity policies that then other smaller stewards can maybe copy paste and take over or that you will also maybe start projects for tooling to help companies or open source developers test their products, ensure that they are secure. So we also really count on you. That's also one of my messages for today. Thank you. And thank you for all the participation. I will hand back now to Martin. Thank you for everyone who wrote down their hope, fear or solution. Thank you to the people from the commission that were willing to put up with our chaos, trying to run a workshop in a venue intended for call. And thank you in particular to Toby who made all of this work.
CRA & PLD: CRA conformance for Open Source Projects
Our next speaker for today is Marta Rybczynska and I probably didn't manage to pronounce her last name so I will be asking her to do that again and show how I was almost right but not quite. Marta will be talking about CRA conformance and the thinking that Eclipse has been doing around this and she has a background for a number of years developing different solutions but I think she also closely followed the CRA. You may have seen her article on the Linux Weekly News months back which was a very good summary of where things were at at that time. So without further ado this is Marta and enjoy her talk. Thank you Marta and you pronounced my name quite correctly in fact. My name is Marta Rybczynska and I'd like to do a test implementation of a Siri in five minutes today. So let's go. The example open source ecosystem that is quite standard one with a physical product to make things easier. Starting from the end we have the final product that is sold to customers and we have the device manufacturer of that product and that device manufacturer is assembling multiple open source and preparatory elements adding their own software to the whole thing to build their product. This device manufacturer can of course have multiple product and they are not integrating one open source project they are integrating upstream project A and of course a hundred other open source projects that upstream project A develops a project under an open source license and they have dependencies. They have a dependency B that is another open source project working in a similar way. So okay here enter open source towards you have already probably seen the definition I highlighted the important parts for me. Legal person that has a purpose of objective to provide support of open source. Okay so what comes out of it was towards pop in in the whole thing. They pop up for the dependency B. They pop for the upstream project A. That's that's pretty expected and then a few remarks in there. Very likely stewards will be foundations especially if they have trademark to the project name. That is quite quite obvious situation but we also have situations that are little less obvious. When we can think between stewards or manufacturers or none of those for example if there are four profits that are supporting projects that are not critical to their income like open sourcing CI scripts, open sourcing, programming, tooling for their board. Things that are absolutely not critical that they are absolutely not monetizing. And we also have consulting companies not giving names. They are many consulting companies that are contributing to open source projects in a sustainable way for years. So how do they qualify? And when we add this can we have multiple stewards for a single project? If we just take that definition of a steward why not? There may be a foundation and there may be a company that actually donated the code to the foundation that's still contributing. If it's if they are not monetizing why not? And then interesting case stewards. There's a definition by stewards also have some obligations and what happens if the stewards cannot force the project or they want to force the project but the main developer say I'm not going to implement that. Pay someone to do that work. What do you do? Question mark. Okay and then we finish adding the CRA elements to our scenario. We add due diligence or that the device manufacturer should do about the open source projects they are implementing. We have the conformity assessment that they should do while releasing their product and we have the final user documentation that they are expected to release. And well mostly for the con we have some challenges for the conformity assessment. Changes and opportunities for the open source world. A final product includes dozens of hundreds of open source projects usually. So manufacturers quite often use the same project in many different places and many manufacturers use the same open source project in different places. So what makes sense and what is logical to do the conformance work, to do the paperwork all together in an open source way and release it open source license? Oh there's an alternative. The big ones will be able to pay the whole work on their own. The small ones I'm not absolutely sure if they include a hundred projects. So that will be it for me. Thank you.
CRA & PLD: rapporteur playback
So I'm trying to summarize what we have learned in this session. We had a great opening from Toby who kind of explained to us the importance of these standards that will be written to accompany the legislation. We had a lot of discussion about the 200 pages of text that is the law. And we will then have fun reading the 4,000 pages of text that are the implementation standards and a lot of the details will be in those standards and are still to be developed. He also pointed out that the CAA has landed now and is not catastrophic, which I think is an important point to take. It is also an opportunity for the FOS ecosystem to step up here and to play a leading role as stewards. We have a lot more clarity with the separation of roles. We also have the first time that a major law in a major economic block talks about free and open source software and describes a specific role for open source software stewards. So I think that's a win. Next we had Rob walking us through the wonders of the hardware and software supply chain and how liability, especially strict liability, can work out in that and highlighting that the approach in the UAPLD is one of strict liability. We had a panel where besides the really interesting questions which I will go to in a second, we also had a very symbolic picture here in the front. We had the Python Software Foundation sitting in the middle squeezed between the Cyber-Syber-Zerians Act and the Product Liability Directive. And I really thought when I saw this, this is kind of a, the picture of what we're seeing because we have a group of people that is making free software available to the world trying to do the best and essentially really, yeah, working in the public interest here and trying to see how to, how can we make this work in the environment of our own regulatory frameworks. Also are the highlights here. I found it indicative in the question to Ben who is going to be a steward and what happens if the steward is outside of the EU that, first he said, looking through the law, which is the right recommendation, but he also said this is to be more clarified further down the road. Yes it is, but that is just indicative of this uncertainty that we are currently in. And so it's the right answer, but it means that we need to stay on this topic and we need to get answers to these questions. I also thought that the question from the audience about what if we have a very decentralized open source community, maybe with a legal entity in France, but that's not really controlling anything that the developers do, it is just coordinating the work. It was very pertinent because this is exactly the gray zone between being an individual developer, being a loosely organized community and then being a well organized centralized community, a more centralized community that clearly qualifies as a steward. And I think there, it's not just on the lawmakers to make this clear, this is an impact where the free and open source software community will have to sharpen our own governance norms to make it more clear which are we in this situation. So there will be implications on how the communities operate that were clear in, this was one of my takeaways that was clear in this discussion. We focused a lot on the Cyber Zillions Act because it was such a pertinent topic recently, so I was really glad to see the Part Liability Directive here. Key takeaways that I took from the discussions are you cannot escape the Part Liability Directive, I hope I'm quoting that right, and I think Oma also was able to say why. If a law protects the most vulnerable person in the chain, then it can easily exclude others. Who is the most vulnerable person will probably have to always be accessed in a concrete case. He also pointed out that an important aspect of EU law making is that all these laws have review cycles and they're not one-off written and then collect dust and we will have to live with the consequences. They will be reviewed and he encouraged us to engage in the review process and to provide our feedback. This is probably one of the most important takeaways today for the people in the room is let's stay engaged here basically. How am I doing on time? Two minutes, okay. We talked a little bit at the end of the discussion period on how will the European Commission engage with the standards. Oma pointed out that the European Commission does not write the standards and what the process will be and that your source community will be encouraged or is encouraged to actively participate in this. That's a big takeaway that we need to take. There was always the question of who pays for the additional bureaucracy, who pays for the fees to participate in some development organizations. I think we didn't really get good answers here today but it was made clear that this is an open issue because our organizations are almost exclusively non-profit organizations and that's additional cost. Additional cost requires more fundraising. I think the question of who pays for this, who pays for this and who pays for this got repeated a couple of times including in the workshops. Let's go to, if we have a little time, I summarized a couple of highlights here. One is a big shout out to our own lawmakers here. The EU is approachable. It was here in this room. It was in multiple panels. It was willing to answer questions from angry developers. I really congratulate them to this attitude so thank you for being here. Another takeaway is why is this room so important because I think it was Toby who pointed out that we have a very diverse set of stakeholders and normally in the processes the stakeholders that are well heard are the ones that have the means to do so, the big foundations, the larger projects and we need to have the hobbyist community, just small like one person enterprises and all those also like, yeah, involved and that's I think more in this room than in the discussions until now. I want to point out one thing and this is in response to Omar who said software is only one thing out of a million products. That's true. Software is also an every product and every product consists of 80% for you know, so software. That means we're not one in a million. We're 40% of the overall market if you divide it 50, 50 between hardware and software. So I think we're totally worth it to be especially considered in the law and I think we do have the impact that justifies that. Regarding engagement in standards, last statement here, the EC made a very direct offer to engage. I think we should take that especially the foundation I speak for the Linux Foundation here. We will engage. There's one really positive signal here. We've recently been appointed to the multi stakeholder platform for ICT standardization which is kind of the consulting group to the commission here and we will use our influence there to bring more free and open software players into standards development. Regarding that as well, keep in mind that standards development is also a national activity in the member states. There will be representatives of your member states appointed into those standards bodies and it's a great way to engage through where you live and get somebody into that. With that I would like to close, a big thank you to the panelists, to the moderators and to all the participants. Thank you very much.
FOSS policy engagement: [begin workshop] OSS Exchange with policy makers and policy support of OSS developers two faces of the same coin?
Okay, everyone. So we are ready for our second workshop in our death room open source in the European legislative landscape. We have some general rules. I think Simon want to talk about them. No, this is okay. So we have some general rules for the whole death room. That means for the next two hours, this is true, but also for the rest of the day. Please, we have some issues here with like the sound quality in the room. So please, remain as quiet as possible. And if possible, please only change rooms in between the sessions. So we are having like quick changes in between. So there should be enough time to like leave and enter the room again. And also we have this microphone. And there's only this one microphone. It looks a bit strange to talk into it, but it's important for us that people in the livestream can follow us. So this means if you want to address a question, please wait until we are around to see microphones so that people in the livestream can also follow us. We in this workshop want to build on what we just like discussed in the workshop before. And I think Mirko put it very correctly to say we should stay in the dialogue. And the question is how do we stay in this dialogue? And how, yeah, shall we do this? What have we learned? For example, during CIA, but also other debates. And this is basically what we are going to discuss in the next two hours. So we will talk with Martin and Enzo about their CIA experience. Then we will talk to Jean-Luc and Clementine. They will show us about how important open source projects to the European Commission and how they are funded around NGIO. And then we will talk with Cyril about future EU free software regulation. And then we want to turn this workshop here into a panel which then turns into a fishbowl conversation. And I will also then later explain you how this fishbowl conversation works out as soon as we go to the panel. But now I would be happy to see Enzo and Martin. Would you be so happy to join here and then we go to our very first talk of Martin and Enzo. Who wants to start? You.
FOSS policy engagement: a CRA retrospective.
So Martin and I, I think we met seven months ago, six months ago, eight months ago, something like that. When you started to get interested more and more into policy, and I was in Eclipse trying to make something out of the CRA so that we can solve all the issues that we've been discussing during the first session. And we quickly realized that we had backgrounds that are very complementary. In a sense that he has obviously this all open source background, and I have this policy background into advocacy and how to advocate in terms of policy. So out of our four matters combined, we sort of decided to combine our efforts. So here today I discuss what it's like to do advocacy, and then Martin will explain what it's like from an open source perspective to be new in Brussels and try to do something about it and how we try to organize the whole thing. So I think what we need to do here is share the information about how policy making is done in Brussels. In Brussels you have several institutions gathering together in order to create those policies. The main three ones not entering into details of all the policy making procedures are the commission, the one that is drafting the proposal, for example the one that drafted the PLD, the one that drafted the CRA, the two policy officers that were here this morning are the one who did it together with their teams. Then you have what we call the co-legislators. One is the parliament, the one that we directly elect, and the other one is the council, the council of the EU which is representative of all the government of the EU. So we're talking France, the Netherlands, whatever country you're from in the EU, that's your government that is there. Then the question is how do you actually influence those policies? There are I'd say several things to keep in mind as a community to do that. If you want to influence that policy, first is to get interested so that you gather the knowledge on the policy, to gather the reason why this policy is happening, to gather the actual details of the text or the issue that the policy makers are trying to address. The second one is get organized, trying to actually have an impact so that you have credibility and the policy makers don't see just one citizen coming to them but a group of citizens that is organized enough to represent a part of the society. The other one is just write down stuff so that you have clarity. Then you have to identify the different elements within the policy making process that can allow you to get involved. Here I'm talking about contacting policy makers, the REC-1, being able to identify them properly. I'm also talking about getting support in your network, in the companies that you know because the open source community in the case of the CRA for instance, we won part of the overall challenge for policy makers but they also have to discuss with industry, car companies, large tech companies that are closed source as well and all of this needs to be addressed. I'd go back to Marty now giving details and then we'd be exchanging throughout the presentation as well to see what didn't work, what worked, what could have worked if we would have acted differently and then if you have questions we try to do that as well. I'll start with a quick promise. I was here last block and this will be the last time you see me here and then I'll just sit down and other people can talk because we take this rule seriously. What I will be sharing with you is my personal story of how I got here because that was not really my plan. So my role is to work between policy and technology from an Elnet Labs which is a small R&D organization in doing DNS enrouting from Amsterdam. In September and because this is not about me but about the lessons I learned, I'll give you the lessons first then you can plug your ears for the remainder of my talk and then that works for the deal. My lessons first, I think we were too late and if you're too late, if the commission makes a proposal then you're chasing the train, right? So we did a lot of train chasing and that means you need a lot more effort than you would have needed if we were in front of the train at the station discussing where the train would be headed. I also think and I learned that we cannot expect FOSS to organize as an industry organizers because if it turns out that it is nicely organized like a trade organization and you're probably not talking to the whole community but just to the industry part of it. And I think because of these facts that the digital dossiers in the European Union needs to change a little or we need to figure out the mechanisms to do advocacy from the community because for all the digital dossiers software is relevant and for software open source is relevant. And I'll try to illustrate why I think these are the lessons because I think I and we all got quite lucky on this one. So two last lessons, I think we should be talking more to parliament as a community. They should be the people that are most accessible to us. I think I personally didn't talk enough to parliament and it turns out that even if you don't have any EU policy experience you can figure this out if you have enough time and if you're lucky then you can make this work. So now to the story and you can plug your ears if you want to because you just had the lessons. So in September 2020 I found out that there was this thing called the CRA and I read it and I thought hey this is weird, open source is mentioned which is great but it's also clustered under the whole idea of non-commerciality and we all know that there's also a lot of open source and commercial products right. So I sent some emails and I got lucky real quick because I reached out to the Dutch digital civil rights organization Bits of Freedom, they connected me to a law professor and they connected me to a wonderful recent graduate who had a lot of context on this law because she had recently interned with the team that actually wrote it. And I want to thank Francine because she delivered the mental model to me and to a lot of people I work with later about how this actually fits in with the NLF and that's thanks to her. So in October I contributed to the first blog by ISOC, they were the first I think to mention hey there's this thing coming and I wrote a little newsletter saying oh I'm spending some of this time reading this stuff and I send a little tweet and then I got lucky again because the tweet was noticed by the commission, by Benjamin specifically and he said hey maybe you should come and talk. So that was kind of a surprise, I mean you write. So I was planning a visit to Brussels, I bought a t-shirt, super relevant detail, yeah. So I was there because my organization works in DNS right so I was attending a DNS conference in Brussels and I decided to drop by the commission because I was invited. So I learned a couple of things, they have really nice rooms with really nice flags and you get, they receive you friendly people and then you get kicked out because that room was for the boss. And what also really stuck with me and Benjamin repeated this morning, one of his colleagues said oh it's so refreshing not to talk to a lobbyist for once and I was like yeah I've no idea what I'm doing here but so but it was really constructive and I'm not saying this to please people, I'm saying this, this was November 2022, we actually had conversations about whether compliance work would increase security of open source and I was arguing it probably wouldn't and we got the question like is there anything we can do that will increase security and so we got into the conversation about if a vendor is obliged to report a vulnerability maybe they should also be obliged to send the patch if they have one and that was just a conversation and I don't know how it got into the law, maybe Benjamin will tell us someday but I think at that point already they were thinking oh maybe you can tweak this a little which I think was great. I also learned and this is about talking to parliament, Benjamin and his colleagues were very insistent that I should talk to the co-legislators and I was like what I'm here now so co-legislators others also other and it turned out they were actually done already because they made the proposal they could explain the proposal but they were not the party making changes at that point in time. So in December I visited Brussels again and I came with a list of examples like hey this is the ways that people write software and maybe interpret it as a commercial activity and people told me yeah it's great but you need to talk to the co-legislators. I was like oh great maybe I'm doing this wrong but I'm not talking to you so maybe you should come to Fosdam. So last year we had a Janssen, the session with Omar with Benjamin and I think that was the first time we did EU policy at Fosdam and some questions were raised because Benjamin told us, it's on camera, you should be talking to the co-legislators and Alex who is chairing this room today was in the audience and asked so what is your plan because you're talking to the wrong people and he was right right but I didn't know so we're just trying to get. So what it did get us though was it started building alliances right because a lot of people started interacting with me with others, Open Forum Europe became very active and I think that was what Fosdam did for us. Now remember that blog it did three things so I got in touch with people who did know how Brussels worked and that was useful because I didn't. I got an email from an aide to the Dutch senate and they were interested to understand what was it what this about so they started writing questions to the commission, they started writing questions to the Dutch government and these questions are obligatory to answer right so it created some pressure and we got a visit at an outlet lab from the Dutch delegation writing the CRA from the national like from the council perspective and it turned out that the Dutch were both very pro CRA but they also grasped FOS so that I think was a win because at that point in time we kind of or at least I could start to talk to the right people instead of the friendly people because that was kind of how this works right. So the questions from the senate helped for the civil servants to actually show up because they need to focus their time on where the relevant problems are right and I got some help from people that are more experienced working the Dutch government so I want to thank Bert. Then came some silence because FOS was over and I talked to the Dutch and then everything moved back into its own room right or just silence and I got a rejection from parliament because I applied to go to the hearing and they're saying yeah we don't know you we were mostly talking I mean there's limited seating and we're talking to the people we've heard from before so that was a bit of disappointed I mean it was a very kind rejection letter but it didn't help me as like a random Joe trying to but also open forum Europe built up steam so in and so open forum Europe was the place where a lot of us met on a weekly basis Kiran made an agenda every week I was now in the corner and he basically made sure that people were actually discussing the same topics no matter whether and we or maybe I got lucky again because about every couple of weeks I got from random people leaks from the council discussion from what the parliament was discussing and I learned how to analyze these and just share like oh this is what is being discussed and I also had to I also learned how not to write a position paper because we wrote some they were very lengthy and I think they were completely ignored so that was a good thing to learn and we yeah and we learned that the policymaker perspective perception was that we were just trying to get out so I and I think a number of people shifted focus to challenge some specific assumptions instead of arguing about the scope so over the summer I started emailing some me piece the commission the council there was a lot of silence and a bit of despair but then in the October November time frame suddenly communication started flowing we got emails not just me anyone asking for reviews we got asked questions for input there was a proposal floated about an open source steward and no one saw coming and I think at that point we started to be in the position where we should have been before the train left there were people engaged on a topic talking to the right people about the policies they were making and I think the policy makers delivered in parliament in the commission and in council by actually having conversation about these topics with people working on the specific aspect so this is how I got to these lessons learned we were too late we cannot expect the community to organize like industry and I think digital dossiers need a way to get us involved at a stage where where in a way that actually works for the community oh yeah and you should talk to the co-legislators that's it for me I'm having over the end so because I let it did a lot of things weirdly and I'm happy to hear about it so I think it raises the questions then from the very beginning I sort of said this is how it normally works you get organized you get interested and then you start writing stuff and then you speak to the co-legislators but the question is and Martin raised it very well he got lucky I wish I could be that lucky in my life but how are we gonna get organized who is gonna be the person that is enough interested in the open source community so that we get the information at the right time who is gonna write the stuff how are we gonna agree on the stuff that we want the open source community to say all those questions we need to figure them out figure them out on our side because Martin just said the co-legislators and basically all the institutions need to figure out a better way to interact with us but how do we also get better at interacting with them and that's the question that we're probably going to have to discuss today during the workshop the and then the fishbowl and that's I think the question of today in this specific session are the two what do we want the co-legislators to do so that they can come to us more often and in a better manner and how can we step towards them as well so that we also get better at interacting with co-legislators and the commission thank you
FOSS policy engagement: The impact of the NGI Open Source projects on EU policy and values
I trust them. However, now we want to hear about, is this the first slide? Oh, sorry. Like this. OK. Jean-Luc, and Kemen, who starts? Good morning, everyone. I'm Jean-Luc Doréle. I work for the next generation unit, an generation internet unit. It's not a new unit. So together with colleagues, and I have one of them here, we are supporting the so-called thanks generation internet initiative, NGI. So I don't know how familiar you are with NGI, but I will introduce that in a few words. And then Clementine will give you some insights on what we are currently doing in analyzing the impact of this initiative. So we have heard about the legislation this morning, the CRA, PLD. We'll probably hear about the DMA a bit later, the DMA-DSA. So the European Commission is the executive body of the European Union, and we have the right of initiative, and we give initiative and suggestion to the co-legislators. And we have two main, to simplify, two main stream of action. One is legislation, one is money, funding. So I'm representing the funding part, and the colleagues who are representing the legislation part. And in the funding part, you will have many, many things like agriculture, policy, regional development, infrastructure, and you have something called research, and more precisely, the Horizon Europe program. So this is where the money for research is channel, the main instrument. By the way, it's a law, so it's also a legislation, so co-legislator matters, even in this part. And as part of Horizon Europe, there is a lot of actions. You can imagine, it's the research spectrum. It's a little bit less than 100 billions euro over seven years, that's the budget. And if you scroll down, at one point, you will find something called next generation internet, that's us. It's been operational for five years now. We are completing the fifth year. We mobilized between 20, 25 million euros per year. That's the envelope. And what we do is we fund open source. So I don't know how familiar you are, but there are a lot of people funded by us in this FOSDEM event, these two days. There was a session yesterday, there was a stand from NLNet. And the way we work as commission is that we don't give the money directly to the funded party, but we give money to intermediaries. Why? Because in the context of open source, we believe that we don't have the instrument to fund open source communities, because at the commission, we have instruments for big consortium, 2, 3, 4, 5 million, and a lot of participants, et cetera. So what we do, we give money to intermediaries, and we have projects, and some of them are here. And they, in turn, take this money and do calls, open calls. You can find them in ngi.eu, and there is somewhere a section on those calls that are open more or less continuously. We just awarded a 27 million euro funding for the next three years, so it's a quite significant project called NGI0 Commence Fund. And this notion of commence is that we are funding community-based and community-governed development of open source. So there are some nuances in open source that will not dig into that details, but check the eligibility if you're interested, and you will find that it's relatively, it's very open. Well, five years means that we need to step back a little bit, understand what we have been doing. It's 1,000 funded projects. Again, you can find the catalog in ngi.eu, 1,000 projects, and we need to understand what's the dynamic of this portfolio, because comes the next steps of funding, and it is a core legislation, it is a policy, so we have to have the impact understanding so that we can come to the decision maker and tell them that's the five years we have done this and this, and this impact this and this. So we have contracted the Gartner and the Clementine that was involved also in the early discussion more than five years ago that settled the rules and the principle for this initiative. And I'll leave you the floor for detailing precisely the benchmarking of these five years of NGI and its impact. Thank you, Jean-Luc. Good morning, everyone. I am Clementine Vellier. I used to be a researcher here at ULB. I studied a master in IT at ULB, and I have been coming to FOSDEM for the past 20 years. Five years ago when the European Commission launched the bid for creating the next generation internet initiative and shaping it, at Gartner we are humbly in love of technology, and so of course our understanding was that it had to be digital commons as well, the next generation internet, and that we would have to reach out to the existing community. So what did I do? I called an LNET, I called Mikhail, and that was my lucky moment. We won the bid. And so that year at FOSDEM when we finished the study, we had one of the keynotes of FOSDEM presenting the NGI study with the policy officers at the time. So that's also for me, I think, the first where the European Commission came to FOSDEM, and I'm glad that they're here every year since. The reason why we thought it was also interesting to present the NGI in this track, as Jolnik was explaining, is that the NGI initiative with its thousand projects has an impact also on EU policy, because it actually helps implement them. So we've been trying to collect data points fast to present them at FOSDEM today, and these data points come out from the initial analysis of a survey of these NGI thousand open source projects. So we looked at how they implement EU law, how do they carry and foster European digital rights, how do they provide alternatives, how do they link to standardization, what standards do they implement, and the topic of sustainability is also analyzed through other different dimensions. So basically, there are six European digital rights and principles. We have data points here that basically show how many projects implement how many digital rights, and so basically we have over 20, 25 percent of the projects in NGI that implement up to three of these digital rights, and we can cite this number, which I think is really interesting. You have three quarts of them that actually state that they increase safety, security, and empowerment of individuals. These EU digital rights are really important. I encourage you to follow the link and keep them at heart. I think they stick close to the open source community as well. So this was one of the initial findings. The next one is about how do they fuel the uptake of EU policy. So we have cited many different EU policies. Those that were actually mentioned by Jean-Luc, so I won't go too much through them, but I think what's interesting to see is that let's say 90 percent of the projects see themselves as implementing EU policy. So to be honest, I think the open source community understands somehow EU policy because they have similar values and they understand what they're thriving for. We're getting there. Another data point that's really interesting, of course, is the fact that more than half of the projects state that they're implementing an alternative to large monopolistic solutions that exist. And so as these alternatives are open source and available, they will, of course, be a good source for driving digital sovereignty in Europe. The standards. The standards are also very strong on the European Commission's radar for many reasons. For one, they fuel innovation, but they also help create, let's say, scale and breadth in the market and interoperability aspects as well. So obviously the NGI projects are rooted in the standards community, mostly ITFW3CIOIEE, but we've heard also other standardization organizations linked to geospatial aspects that were cited. And now, of course, this is key as well because the NGI is going to, of course, be part of the Web 4.0 that will take place worldwide. And so geolocalization will be a key part of that. What about sustainability? I mean, we've seen in these last data points that there's a strong footprint in those EU digital rights, values, policies, standardization, but what about the outcome? How do these projects follow through? Well, actually, three quarters of them follow through. It's a significant data point for many reasons. For one, it shows that there isn't a concrete outcome. So either they follow through because they are part of a larger effort, an open source community or a business, and so those contributions funded by public money have been integrated back into the community solutions. The second point that's interesting is that 8% of them have, as a follow-up, created a company or foundation. And to build on that point, so we can extrapolate the data and then estimate that NGI so far in those five years has generated probably a data point around 80 companies or organizations or foundations that have been created. What about those companies? So we asked what are those business models? And interestingly and probably not unsurprisingly, the main business model is about SaaS hosting hosted solutions for almost 50% of them. So let's say this is also an interesting data point because we understand that this releases or at least takes away the struggles for implementing other people's open source tools and solutions and the technical hurdles if you're not a techie. So the reuse is also important. How are these tools reused and how are they made available to end users as a wider part of consumer base? Another very interesting data point talks about the developers. So what about the developers? What about the open source community? Well, the NGI definitely attracted the community and the talent. So in that sense, we were very happy to see some of these data points because it was the bet to actually reach into the existing community and inject the funding into the one that's actually making the internet run and trying to make it, let's say, pursue further its maintenance, its update and its innovation. And so in terms of community, we have 40% of projects that have a small community. We have 15% of projects that have a large community. We call that over 50 people. And then we also have six of these projects that are running in a community larger than a thousand people. So if we extrapolate that data, we can estimate that 80,000 people are actually contributing to the NGI through code testing or bug reporting. This is a very important data point for us. But it's also interesting to see how the NGI solutions are channeled through the open source community through the mainstream distros. This means that the user base is also expanding. We're looking on how to make sure that these solutions are reused. And last but not least, I think one of the main lessons also is the notion of agility of funding. The mechanism that Jandik explained at the beginning of this session is about funding innovation. And that was also one of the recommendations of our study was how to do this. We have these notions of course of fail fast, of incremental funding when you fund innovation. And the fact that three quarters of projects pulled through means that a quarter of them are recognized as failing fast and that's fine when you fund innovation. I think from my take today, from the discussion of this morning that were really interesting, as much as we managed here to do an agile funding scheme for the open source community, as much the European Commission and NL net have done with the CRA discussions, almost an agile policy making approach. And I think there's lessons to be learned from that. I think there is value in iterating, in iterations. And I think the open source community has taught a lot in an unorganized organization that you can't really tap into formally, but somehow always finds a way to innovate and always finds a way to get their messages through. So for me, this was a really insightful morning and I'm really happy to have presented these data points today with you. Thank you. Thank you. Yeah, let's hope we can continue with the interesting and conversations. With this, I'm going to hand over to Kiran, who is there.
FOSS policy engagement: EU FOSS Act
Thank you very much. So we spent the last 15 months thinking about all the problems that could be solved, caused by a piece of e-legislation. At the same time we were trying to think in a positive sense, what could we do to get something positive? How could we have something useful? And so one idea that I started thinking about was an EU FAS Act. However, after discussing this with a few people I agreed this is not the time for a EU FAS Act. It's not certain that it would get done correctly. So I just added a little bit of punctuation there to get myself out of that one. So the idea is when we saw the CRA and we saw that it was a very long document with a single paragraph about free and open source software, we started thinking about is this the right way to do legislation. Free and open source software has been getting bigger and bigger over the past few years. And so I looked for a few studies and one study found that 76% of code bases are free software. And this could be biased in favour of free software because how do they get access to these code bases? But the same report also said 96% of the code bases contain free software. So you could imply from this that also some of the code bases that were not available would also contain free software. So another way of counting it was by NISA which is partly funded by the European Commission so there's some legitimacy there. And the average IT application is 70% open source components and this number it may be higher today because that was a 2019 number and it also concluded that this was already a doubling since 2014. So the actual number for 2023, 2024 could already be a lot higher than that. So what we're trying to think about is in future if legislation was to be written is there a way we can ensure that the drafting matches the reality which today is that free and open source software should be the assumption. We should have texts that regulate software with us in mind and there should be a little paragraph at the top if you happen to decide that you absolutely must keep your source code secret then you have separate obligations but the default should be free and open source software. So the CRA was a lot of work. We heard from the Commission on Thursday that they put out a call for input and they got over 100 responses and only two of them mentioned free and open source software. So Openform Europe was one of the people who responded. I scanned the response, I couldn't find the other one if anyone knows who it was interested. But the thing is the response to the input we gave was on a fairly generic proposal, you know, we're going to do something in this domain and we gave a generic response of well, be careful. And so this is all we can do if we don't have the specifics of what is actually going to be regulated. Now, possibly due to the input we gave or due to other people's input, there was this one paragraph in the CRA and it was very useful. It gave us a starting point to have conversations. This one paragraph later got turned into pages and pages of text to improve the situation. But it was interesting that the CRA got written and then after the drafting of the CRA during 15 months of negotiations and meetings, the actual provisions that will affect free software got written. So what we were thinking of is what would be the principles which we would like to see in the way legislation is written in the European Union. And so one is that we would like the topic to be discussed at an earlier stage in detail and so preferably with all the people in this room, preferably in the most public way possible. But there are also internal bodies in the European Commission. The European Commission has an open source program office in ASPO and maybe, I don't want to hand them too much work, but maybe they could be consulted either to comment on the content or to be asked who should we be talking to. They currently have two staff, maybe that could be more. The question then is also raised of when we want to consult, when we want to talk about how will this impact us, who do we phone? And it's true, there is not a single industry association and it's not even an industry. So it is a difficult thing. However, we also have to try. And so first off, European Commission coming to FASTEM last year this year, it's a fantastic step and it's a fantastic way for them to get a feeling for who is actually behind all this software. But we are made up of small entities and they need funding if they're going to participate in any of all these meetings and all these procedures. And this already complicates the way we work together. At the same time, there are instruments in the European institutions that could be used. So there are high level expert groups for certain ICT and standardization topics and there are multi-stakeholder platforms and maybe there could be one for free and open source software. The second thing is that things should be clarified earlier. And so at the moment, the text of the CRA is complete. However, we're now going into two and a half, maybe three and a half years of negotiating 44 different standards with standard organizations where we're not well represented. It could be useful to have certain assurances that we will be taken into account and they're taking into account this category of software, which is the majority, should be a basic principle. And so this could go into the general text of legislation and then specifically on standards rather than a definition of open standards in the EU legal system, which says that if the standard is developed in an open process, it's an open standard, which leads to standards which we may not have access to. We don't know if we will have access to these standards that we can't implement in free and open source software. And so this idea of open standards freely implementable standards needs to be thought of in terms of patents, in terms of technologies, in terms of compatibility with our development and our business procedures. So with these, we hope that we would at the least, well, we hope that we will get access to all these standards and we hope that we will be able to use them. These things can be clarified in regulation 125 is up for review and maybe this is somewhere we can put these suggestions in. So in summary, what I wanted to say is that what we're hoping will happen is that the European institutions will discuss early and clarify early and that would help us out a lot. But from our side, we also have to figure out what exactly we're asking them to do. And when we're talking to specific people, we have to make sure that we're asking those people to do something that is within their power. And so from our side, there's also if we want to get to the stage where we could trust the EU institutions to write an EU FAS Act, we also have to improve the way we interact and make sure that we're asking for the right things and giving them the right input. With that, when I'm chairing meetings, I'm quite strict with timekeeping and I like to eat my own dog food, so I'm going to cut myself off and thank you all very much.
FOSS policy engagement: panel
Okay. So then we can go over to the panel in a second. Next to the people you just already met, we have Jules from the European Parliament here and Walter who is from Freischrift and also connected to Etri. And we want to now basically discuss what we've learned during the day with CIA, but also like in these quick lightning talks. And there was an interesting question from the Matrix Channel, and I want to use this to kick off the panel and ask you, open source is a worldwide and thus also European thing, one of the few citizens movements that is somewhat European. And the European Commission is most mostly unaware. Question mark. Do you agree on this or do you agree that like the open source community is not around in Europe at all and that the European Commission doesn't care at all or how do you see the situation? Maybe also in the Commission, but also in the Parliament and Walter, you want to jump in immediately. Then Walter, you have the floor. I would agree with the idea that it's a global thing and also that's a very much a European thing. I'm not really aware of any equivalence of FOSDEM elsewhere on the planet. FOSDEM is in that sense a very unique community event. So in that sense, it is a European open source community. Moving on to the Commission, one of the things that is worth to keep in mind when you try to engage with policymakers is that governments are not monolithic entities. And I think the CRA itself has proved that half of the Commission was not knowing what the other half was doing. Also be aware that in large bureaucracies people come and go. So there may have been wonderful people four years ago in the Commission. They have moved on to different positions. So if you're talking to a bureaucrat or a bureaucrat in your own country, be aware that they have certain powers and influence, but it may not extend to the things you want to have changed. So that was more the point I was going to make and I'd like to hand over the microphone to someone else now. Who wants to jump in then? I can try. So one question is the global nature. If you look at the statistics that we got from 1,000 projects, and it's not insignificant statistically, you will find that the people that are working in our projects are very well aligned with European policies. It can be this European declaration for digital rights and principles that was mentioned, but it's also surprisingly a very good alignment with the policies that like CRA or DMA, DSA, there is an alignment and the community, open source community we are dealing with, they are very much impacting the implementation of this policy. Now for the Commission, yes, it's a large bureaucracy. There are a lot of departments there. I'm representing the one that is funding, so assumingly a bit more friendly with you guys. For the legislation part, I think the CRA is an example of ultimately a good relationship. The question is how to represent the communities. There are a lot of organizations at job and level, and I think they did a great job in this context. Now what is true is that the developers themselves, they are working on what they like to do, and they are not necessarily very interested in this discussion and conversation at policy making level. So in that context, there is in my view a lack of representation of developers themselves. It's also true at the national level. It was mentioned the legislation at European level is a co-legislation, so member states and the parliament are key players, and we don't see, well, it's maybe it's a biased view, but at the national level there is also a need for representation of the communities so that they can influence a national but also European level development of policy. There are a lot more to say, but I think we can give the floor to the others. Thank you. I work for the European Parliament delegation in the European Parliament. That's why I'm here today, and I won't talk specifically about the commission, but more about the parliament, which I know better. And in that case, for PLD and CRF, for instance, we've seen that there were slight inconsistencies in the outcome of the negotiations between the two legislations, which shows that there is also lack of awareness in general, and we've seen that especially at the beginning of the term where there was a great lack of awareness on what open source is, what open source community is, and especially how diverse the community is, because you have foundations, business models, but also hobbyists, which are different persons, different interests. And that was reflected, I think, in the PLD and CRF, that at the beginning there was this commercial activity that was very much defined, and anyway, on the general level, I would say that along the term, the awareness raising has worked, and that, for instance, in PLD, the rapporteur were very much open to include interesting elements and important elements for open source community and to better protect open source community, so I would say it's getting better. Yes, sure. I mean, I think everyone has a role to play, and we see how people organically step up and how things play for them when they do it. I think it's a very democratic approach as well. I think the past year has proven that it's possible to influence EU policy, even if it's not organized, and of course, all the lessons are being made today. But again, I want just to remind that the European Commission was here at the keynote of FOSDAME five years ago to mention the fact that there was funding available and other communication passed really well, and the thousand projects exist today, so I guess a similar approach has to be found for voicing EU policy concerns to a certain extent, and of course, if some key stakeholders in the community, be it the foundations or those in the room today, can help in creating some forms of guidelines and templates just to make sure that we have a good understanding of the future. Thank you. Thank you. It also costs much more of institutionalization at the Brussels level. Get subscribed to those newsletters, but there's bad news inside of the good news again because the digital rights groups have more and more become professionalized, and you end up hiring people with background in law international relations because you're engaging with legislative processes and you want people to understand law. But then you end up with similar knowledge gaps inside civil society, and there's also a need for people with background in development and a more technical background to keep engaging their nearest digital rights NGOs, et cetera, and they're quite most welcome to do so because one of the things I find regrettable is that civil society did not mobilize quick enough around the CRA. It's more a case of good luck or a happy coincidence that we got where we are, which I still think is not good enough, but not as bad as it could have been. It also shows that there's work to be done on the civil society side of these things. Yeah, absolutely. And I also do believe that with this we can open up the discussion a bit so that we turn this panel into a fishbowl. So that's basically the idea and I want to do this together with Karen. Maybe you first introduce yourself and then we explain a bit what we are trying to achieve. Hi, I'm Karen, and I have been sort of coming to not fuss them, but digital events, I was saying to Walter, I could have worn a t-shirt like his, but a different model, different size from the CCC. I got engaged in sort of trying to figure out how tech influences society, and then I got into politics, and now I'm in the European Parliament for the Liberal Group and was working together with Jules on ensuring that open source was included into the PLD in some of the last sort of shadow meetings and trilogues by taking text from the CRA. And it's quite apt that I'm in a fishbowl because politicians have so much going on that they're a little bit like goldfish, so you need to remind them and keep telling them what they need to know. Yeah, and with this, so a fishbowl discussion means that normally we have these panel discussions and then you are happy to ask questions and then panelists address them somehow and basically we talk about ourselves, but there's this empty seat here and yeah, and we feel to use it and to take part in the panel discussion with the panelists and so put your points on the table, but also discuss with the other panelists, so feel free to join. First of all, but we also have to say that as this is foster, we would not take the first ones and dropping by, but we will also look a bit for diversity and so we highly encourage you if you are not quite male men to also take part in this discussion here and I think it's in particular important for Fossum to do so, so please take this empty chair and take part in the discussion. Yeah, so, okay, anyone else? Then feel free. Yeah, sure. So the only most important thing is, yeah, respect yourself and please with the microphone. Oh, sure, sure, sure. Yeah, so let's see how it works exactly. Yeah, yeah. Yeah, so and I said it's important to have the microphone for or like. I understand that, yes, and speak loudly regardless. Good. And don't worry, I'm not going to stay here for very long. I'm going to step out as soon as I can. Anyways, I'm Jonas and I work for Nokia and responsible for open source things there and the kind of like I'm been really excited about the topics of kind of like what has been going over the what is now called the policy week, starting with the commission thing on Thursday, then the open forum Europe thing and now this the discussions here. One thing that I noticed that in the in these discussions is that and some of these discussions I've heard before so I have a little bit of a kind of like an dark past so I am and recovering standards person and I have a history and internet governance things and they there's a lot of parallels into the kind of like where we are now as an open source community where we were in a kind of like Internet community community when it started in like early 2000s the discussions. So but the good thing sadly that's kind of like that comes with good and bad things and the good things is that people now finally understand that open source is actually something very important for the society. Bad things is that people get very interested in those things and then things like political interests come in involved as well and all of a sudden we need to get organized differently. So but I think that the good thing is that we've done this in the Internet community already before and I think we can look at some of the processes and some of the kind of like things what we did at that time. And similarly like an open source community the Internet community was very diverse community. So we had big companies like me for working for Nokia we had small people that were advocates and then they kind of like we had the civil society and then governments and so on. And the thing is also that we need to organize ourselves in a little bit of similar fashion. So and I think that there is a little bit grass roots already happening that that because for instance for us in there were a lot of people said that the open source community is not as organized as the industry or something like that. To be honest in this topic the industry wasn't very well organized. I started to read about the C.R.A. from actually the NL Labs blog post which was very good that woke up a little bit. Hey what is this and what should we look at that. And and then the kind of like the overall discussion started to go a little bit and I agree that even for the kind of like industry we had a lot of luck because we had a lot of people driving towards the same direction from the open source community. As what we need also as part of the open source community. The other thing is that what we want to start to work work together is kind of like looking at that we are all part of the community. We have different views. We have different approaches and stuff like that and we have to focus on those things that what we are in common and then look at it using the different strengths that are and the facts that different people are parts of the community. Even though that not always sharing the same ideas are also present in the different rooms that might be even difficult to get into by other parts of the community. And I think that through this we most probably can get into very fruitful conversations and very fruitful cooperation among the overall open source community. I hope this helps. Okay yeah anyone who wants to react on this. I think one of the key point when you reach out to institutions is get organized. I'm sorry. Get organized. Get organized first. Get on time especially because then negotiations can go fast. So you have to be there on time so that we can take on board your your inputs on time because then otherwise is very difficult. And the last thing is especially if you get organized that we can have one single contact point for instance which for us is very useful. And also for your inputs on the institution side again because you also have the campaigning side but for us what is useful is your expertise. So we need your expertise. And in that case we need your inputs on very specific points sometimes on the low side sometimes on the technical side. And that is why you need to get organized so have diverse teams so that they can provide inputs on low and technical side which is very important for us. Well I think as Kieran was saying earlier it's really important to get engaged early because that's when you'll have the most influence. And then also I think what Walt was saying earlier was that I mean it's kind of been forgotten what happened in the early 2000s. You had the software patents you also had actor and you've had the copyright directive. And I mean there is forces that can be mobilized that can actually reach the MEPs and now is actually the time to influence people and get it because they're running to be elected. So now is when they'll be listening. And the problem is since 2005 is that the big companies and society has gotten more engaged in regulation of technology but not always in the good way. It's sort of like let's use technology as a magical source to fix our problems. You have initiatives in the U.S. You have initiatives in the EU that are not very positive for regulating technology. You have safer kids online in the U.S. in the Congress. You have chat control in the EU which is probably going to go to the next mandate. And be aware that people that you think would be obvious to be supporting inclusion and adapting to open source might not always be so. We had a lot of pushback from the social democratic group in the parliament against including provisions for open source in the product liability directive because they were worried about protecting consumers. So you need to like really be able to explain the digital side to politicians that don't always get it and want to sort of regulate tech in a bad way. It's hard to be more. We don't really need to step into the middle like this is already a circle. Even better in terms of forum. We don't all look to the same people. That's pretty true. Yeah. For the next one. Okay. So everybody says get engaged. Everybody says get engaged early. And there were really good processes for that within the EU with Green Book, Blue Book, the whole shebang. And here we skipped a lot of the processes it seemed. Why did we do that? Is it not working? Yeah. Okay. Hi, I'm Jordan. I'm Karen's assistant. And I think you raised perhaps what is the single most important point. But I think the reason that we skipped a lot of that is because a lot of people aren't aware that you have this possibility like and it's. Yeah. But it's it's it's it's it's also it's also exhausting to have to keep an eye on all of the legislative proposals that are coming out of the commission. You know, I mean, I don't know how many they come, but it's a good like 30 or 40 proposals every every year. And I mean, yeah, we should like it's important that we keep an eye on this. Yeah, exactly. Exactly. No, no, no. I 100% agree. 100% agree. I'm super sorry. The thing is we need to keep the microphone for the live stream running. So that's so please even repeat when you do this strategy. Or do you want to? I'll take some of the tone out. What I said was if in response to the notion that there's like it's busy if you're in parliament to keep track of everything, we're actually as a community are being told we have to be proactive and track everything and it's we're just not set up to do it and it's Speak closer to the microphone. Interesting. Okay, so what I was trying to say is if it's exhausting for the people who actually work in parliament to keep track of everything, imagine how exhausting it is for the people that are just in the open source community dodging bullets as they come down from the sky. Very briefly, but also very importantly, parliament does not have to be a public parliament. Very briefly, but also very importantly, parliament does not have to write the inshaed legislation. You're barking up the wrong tree. It's the commission that has been overwhelming everyone else. They have a massive tsunami of really crap proposals in the past mandate. So was that the reason that the normal process was great to skip? Because if we have to be engaged early for us, this process is ideal because you then have one of these books, you can send it to people like look what's there and then respond. And because we skipped that formal process. How can we engage early if we don't see what's happening? Well, there is an obligation of transparency. There are processes like impact assessment. But I know it's something completely foreign from a lot of people. And here I see a gap personally. Again, I'm not on the legislation part. I'm not on the funding part. But the gap is that we heard that open source is very important. It's 80% of modern software. It's a large part of the IT business and IT stack. And a lot of players are benefit from it. The question is, so it's important, but there is a gap of representation. It's not in the mind of a policymaker. So what you describe is a situation where open source comes after. But if it would come first, it would mean that decision makers have this understanding that open source is important, especially in Europe. So this gap, in my view, is the main reason why we are late in the process. And to address this gap, it's a question of making decision makers that you guys are important. So it's a question of having people from the community representing, okay, we have European organizations that are doing great job, but we also need the grassroot people and the process to democratically represent them at the level of decision makers. That's my impression. Okay, I got it. So I was, thank you for all the answers to this. And I was listening a little bit to what you were saying, but also to the point that we're made throughout the week since we had this event from the commission, the OFE event as well yesterday, the summit, what they call. And it reminds me of something I used to work on before. That's called SME policies and this principle that is called the Think Small First Principle. It was developed by an expert group on SMEs. Hopefully one day the commission has an expert group on open source so that we can actually reflect on those principles. And I was reading through the paper that was written in 2009. And just if you look at the table of contents, it starts with the importance of SMEs in the EU economy. That's exactly what we've proven throughout the week, the importance of open source for the economy. The impact on regulations have on SMEs. We're in the same situation, impact that regulations have on open source. And then they explain what is a Think Small First Principle. First, listening to SMEs, listening to open source community. The second is what they call the SME test. Why don't we just go for an open source test every time we enter into a digital legislation? We think, can we apply this to open source? Does it work with open source or not? And then after that, it's a bunch of stuff that are way too bureaucratic for me to read them through today. But maybe if we have an expert group one day, we can actually build us the bureaucratic stuff as well specifically for open source. So I'm thinking, you know, we've been discussing the processes, but maybe also the policies could be adapted. So I have just a practical proposal. There used to be an application that was called Follow the Law. And then you could just put in like, I'm interested in digital legislation. And then you would be notified at the beginning of the process that a process was happening. And you could follow all the way through at what point it was, who you would need to address if you wanted to be part of it. So maybe this can be remade again. Unfortunately, it is no longer. Someone from the panel wants to say on some of these remarks here. Yes, exactly. I was hoping someone would step up from the community and said we have a tool to help for this. So, kudos to you. Another point we've mentioned that many times in this conference and the previous conferences, software is pervasive everywhere. There will be software in everything as the world goes digital, and therefore there will be more regulation. That's unavoidable because software probably will be a commodity, just like we're regulating our air and our water and our electricity. So probably my question also to the open source community today is what legislation would you like to have to help open source? I think that would be a valid question as well. So I would like to respond to the gentleman in the red shirt because he said it would be a good idea to check policy against as a me interest of the EU. And especially open standards can really help bring forward the position of open source within the EU. And it would help European SMEs a lot if open standards are at the forefront of EU policy because that helps our sovereignty. It helps European companies exist in a landscape with big tech and really help small companies to compete. And if local governments are keeping to open standards, that helps a lot for SMEs. So maybe take it from the standard side and that will help engage open source as well. I'm Jason Peccio, I'm a biomedical engineer and I work in Brussels. Very active in everything regarding fab labs as well, the digitization of the hardware world. I'm more of a free software guide and the open source because I think this question of public European stuff is really about making the citizen more free. And what I saw in the digital rights stuff is exactly about that. And so if we talk about open source in the context of political framework, I think it's important to put the freedom as more core to it. What I made me take is about the data of the project that NGI supported. You said a lot of them follow European politics. I'm a bit surprised because for me politics follow the will of the people and so follow actually the community. And so this should actually, should. So my question is aren't those policies following the will of the community and therefore shouldn't the funding go towards funding the policies? Like we are supposed to give to take all free time to pay for people, like to pay and to use all free time to defend our opinions. Why should we be paid to tell the commission what they have to do since, well, that's what governments are for somehow. It's like democracy. European institutions. Get into it. So I will start with a question to the audience. Oh, you know what? I will ask a question to the audience to start. Oh, yeah. So Thomas de Pierre, what I do does not relate to what I do for open source. My job is not related to that. You may, some of you may know me because of my blog post a year and a half ago called I am not a supplier. I have a question for you and for the audience in general. Can you raise your hand if you get paid for something linked to your presence to this room today? Right. So what you are all telling me is that the people that do the work cannot represent it in this room. And then what I'm saying is that what we saw is that most of the people doing working on open source and we have ample data on that are not paid for it. They do it on the free time. That's what we call our best. The other people that cannot do the work you're asking them to do. My total time for open source and I maintain multiple packages that are niche but fully in need a lot of context and knowledge is this niche is two hour per month. Just reading the law is more than that. I cannot and people that do this cannot participate in the grass with people you're asking to participate in this cannot do that to come here. This is my free hobby time for the next three months or four months I have to take. So I understand all the discussions that are happening here around the fact that we need engagement from these people and we need this engagement to happen. How do you want us to do it? This is a part we need to decide as a community and this is a real thing we should probably talk about more than things like digital right and so forth which I already have a quite a lot of organization working on this. Thank you. Yeah I think that decision makers answer to these policy remarks. Well I mean the difficulty in Brussels is actually figuring out what's going on because once it's public the bulk of the influence has already been made because once the commission has their proposal out they've already made up their mind on what they want to put in there. So that's why I'm saying you need to influence the politicians that are trying to get elected before they get into the parliament and you need to get to know their offices. You need to get to know the policy advisors in the political group delegations and also like the policy advisors in the groups. And a lot of companies are trying to influence both the commission and the politicians and we are the and they have the time to do that. Not a lot are thinking about that the source code that is also at the basis of commercial code proprietary code is actually open source and that is maintained by developers doing it in their free time. And that is actually a message that was really difficult when we were sitting in the negotiation room to get across. And that's why we had to push really hard to actually get that part into the text. So being here taking your free time to talk with us and put these things into the room is really important. I think Edry has been doing a great job on digital rights. I mean the commission is now talking about digital rights but I don't know if you remember Maritjha Shaka that was in the parliament for about 10 years until the election in 19. I mean she read out a speech about digital rights and then told the audience afterwards this is actually from the communist party congress. So you can have a lot of fine words that sound good but it's how you apply them and implement them and put them into legislation that's really important. And I mean one of the things that we like was putting our head up against in the negotiation was that the commission has this blue book of definitions and stuff like that that nobody amongst the co-legislators actually had heard about. And we're like come on this is your own internal standards. This is not supposed to override legislation. That's supposed to like follow and be adapted according to the legislation that we're putting in. So I mean it's difficult for all of us and that's why we need to work together on this. Sorry for being long again. Could you maybe hand over to Edry as the representative? Well, no I'm not the Edry representative. I would say there's no separation between digital rights and open source. The digital rights movement was started by the open source people. We have become somewhat more from our roots in that respect but please do re-engage if you feel it's not sufficiently represented. You're deeply disagreeing with that notion I see from Elaine. I would also respond to the notion and I'm one of the people that's not being paid for this by the way I should mention. This is not a matter of being paid for this. You as a citizen if you want to be represented it's too much work for any citizen to get represented properly. And there are more channels than just your political party if you're a member and please be a member of any political party. Well maybe not a Nazi party but there are a democratic society that does not function solely through elections. There has to be permanent engagement through other channels than just the election. And that's what civil society is for. That of course it should be. And lastly democracy is not about the quality of decision making. It's about the non-violence of the decision making and the non-violence of the transfer of power. So do not be overly disappointed when the outcome is poor. At least we're not shooting each others in the trenches of Verdun again. And that's a win. If I may one comment it's on the question of the representation and the fact that we are professional of the profession. And we are not representing grassroots to some extent. I think the question you ask is the question of representation of democracy. It is not. Okay well I thought it was. But the question of money you ask. No no time. Ah fine okay. Time. Time. Yeah well time is money you know. People have used to pay the time without money. This is not about money. It's about time. Right. You need to keep maintaining the stuff you all used to live. Exactly. If I take away from that time to explain the things. I cannot keep the thing that you need to live and to keep alive. Working. Exactly. So well the argument is time is money. So you need a way to have the right people representing you and that will have the time and will have the money. Because they will have to be paid for that. And so they have to represent you and you have to find a way that you find the people that. So you have a European organization. You have OFE. I will not mention all of them because there are a lot of them. Well they are not representing. They are coordinating and helping whatever. But what you describe I think is a democratic representation of the grass root people. And you have to find solutions. There are examples. Look at IETF. IETF is grass root. IETF is people contributing with their own time. Wait, wait, wait, wait, wait. Not really. They are big tech. Okay. But it's one person one vote. If you like. It's not one. You want to talk about maybe ITU? No. Okay. You can think about organization that organized this representation and maybe take example. Maybe it's the wrong example. But the argument is you have to find ways to get a representation. That's it. It's not just open source that has to make time because the people in open source they contribute with what they can do best and it's mostly writing code. And there's a thing that people in policy roles have this attitude towards technology quite often that they say we don't have to understand it. And there's an educational gap there. And it's not that the people who write codes have to make time to explain all that. The people who make policy have to understand technology because it's at the core of our democracy. So there are people willing to explain it but don't take them away from writing the code because that's really important for society. So it's also an effort from the people in roles that are important for policy making to learn about technology and try to understand how important it is. So there you go. Thank you. Thank you. Good morning everyone. My name is Monique and I'm involved in the NGI community since a few years. I think this debate is amazing. I think we cannot pretend that we all speak and understand the same language because we have different skills and we do different works. This is a multi-stakeholder community and we have to respect each other and we have to learn to work with each other. I think the work that we have been trying to do in the NGI community has been really to talk different languages to different people and it's very difficult. At the end of the day my job is to run the NGI outreach office and I can tell you it has been very difficult to reach out to some of the open source communities but it has also been very hard to speak to politicians, policymakers. So the fact is that we speak different languages and we work according to different rules and to different, we have a different pace. But I think also that the progress and I've been here the whole week attending all the events, workshops at the commission, open source forum today, yesterday, here at FOSTA. I think it's amazing what's happening. We have open source, very high up in the discussions that a few years ago we could have dreamed of being even mentioned. So I think there is a cultural change and we all have a different role to play. So let's not, so I think to go back to the guy from Nokia, I think we really have to work on the processes and on the ways we can work better together. I think this should be the core topic of this discussion. Hello, I am Karin. I am not paid here to be here. I have my own company in, I have a background in public administration. I'm a contributor to RiotOS, my microcontroller operating system. So thanks, happy to help out on this. What I was wondering, I heard policy makers speak with a lot of people and groups before they make the first draft. Who are those groups? Who are those, what kind of institutes are that? Because that's a procedure that's already there with whom they talk. And maybe from that side out, can we then look on what as an open source community can we make or go into to speak with them? Okay, I'm already getting feedback from the repertoire of the session and so who will also have the, the, the, the, the I'm going to summarize what you are saying in the last minute. And he told me we have a lot of questions but no answers at the moment. And maybe we should like focus for the next minutes on this and to kick this off I would first of all, Simon. Thank you. So I'm Simon Phipsen just because I'm organizing the room. It doesn't mean I don't have opinions. So I work for the open source initiative. The answer to your question is in the impact assessment of the CRA. The impact assessment of the CRA explains that the European Commission went out of its way to approach SMEs for input on the CRA, but did not make any attempt at all to consult with the open source community or anybody who is a genuine community member. The reason for that is because that their societal model is a societal model that has manufacturers or producers, it has labor force and it has consumers and it has no recognition that for the last 30 years there has been a, there has been commons based peer production and it has no recognition of the fact that there are people engaged in that. And so the number one thing that needs doing is actually a very difficult thing that needs serious political engagement, which is the Commission as a whole needs to recognize that society is no longer the society of the 1950s and that people, that the people are connected to one another and that as a result of being connected to one another, they are actually doing things that are being harmed by their regulation that believes that unions and MEPs represent the people. Unions and MEPs do not represent, I believe, anybody in this room. And as a consequence the Commission completely failed. Maybe there is one Karen, I don't know. But as a result the Commission has discovered to its surprise that coming to Fozdem, Benjamin last year was quite surprised to discover he had a thousand new friends who wanted to share their opinions about his legislation in the corridor. This year he was a lot more positive about it, although he did ask me if he was going to be egged by the audience. So the most positive thing that I've got to contribute is a very difficult thing, which is the Commission as a whole has got to recognize that society is no longer the industrial society of the 1950s. It is now the digital connected society of the 21st century. There we are. So we are still looking for answers, right? Well, for the Parliament side I can say that the question was to who did we reach out to for the PLD, for instance we mostly talk to free software foundation open for Europe and Debian for us, and Marcel Scolias team. So I would think that in our team, for instance, if you send an email to us, we are very sensitive in the parts about the open source community, obviously. So if you send an email to us, obviously we will answer. But again, that's also the main purpose of the discussion today, you have to be targeted. If you're not working at all on the topic, we might not answer because it's not our topic, we're not working on the file on this specific file. So I think that's the main thing. But on our side, definitely we will answer to open source community almost every time. Thank you very much. I am Pablo Correa from PostmarketOS, a small but very fast growing community to put real Linux into phones. And there was one thing that caught my attention from your NGI presentation. You said that out of the millions and millions of euros you gave out, you managed to get 80 people to organize a community or a foundation to establish a legal entity. And that seems like that's a huge success, but that's also way too small for the amount of money. Or like my feeling, like I am part of this overworked people working their free time trying to grow a community to create some sort of entities so that we can at some point make this a viable product because we see people, we see consumers wanting this. We want this, people want this. And we are lacking the tools to actually grow into something where we can organize ourselves. We are lacking, we are all developers. And we are lacking people with the background in business. We are lacking people with the background in law. And the funding does not support us on that. The funding, like we don't get or we just say, like, can you please get involved? And it's like, well, maybe I can get involved in, like you said, in my 25th hour of the day. But if we are all working on something else because we can't finance ourselves, if you can help us like pay advice on free software, there's people from grassroots organizations happy to do that. We need help and we need funding not only for development, also for development, but we need funding for more things. And if you can provide that, that might actually help us. Okay. A very quick one. And these are also then the last words for the panel because then we have to transform for technical reasons with the camera. Okay, no, just very shortly on NGI. So we have a new project, NGI Commons. And there is a special effort to precisely provide business support. So in terms of what business model could be on top of an open source, you guys, you are developers who you don't necessarily have the right, yes, knowledge in terms of what sort of business models needs to be done. So this is included in a new project called NGI Commons Fund. So that's good that we ended the specifics. Yeah, so with this, I'm going to end the panel and the fishbowl discussion. Thanks for the insight. Thanks for all of you taking part in this. But now the funny part comes and Enzo will sum it up.
FOSS policy engagement: Fishbowl conversation - share your experience
Yeah, absolutely. And I also do believe that with this we can open up the discussion a bit so that we turn this panel into a fishbowl. So that's basically the idea and I want to do this together with Karen. Maybe you first introduce yourself and then we explain a bit what we are trying to achieve. Hi, I'm Karen and I have been sort of coming to not fuss them, but digital events. I was saying to Walter I could have worn a t-shirt like his, but a different model, different size from the CCC. I got engaged in sort of trying to figure out how tech influences society and then I got into politics. And now I'm in the European Parliament for the Liberal Group and was working together with Jules on ensuring that open source was included into the PLD in some of the last sort of shadow meetings and trilogues by taking text from the CRA. And it's quite apt that I'm in a fishbowl because politicians have so much going on that they're a little bit like goldfish, so you need to remind them and keep telling them what they need to know. And with this, so a fishbowl discussion means that normally we have these panel discussions and then you are happy to ask questions and then panelists address them somehow and basically we talk about ourselves. But there's this empty seat here and yeah, and we feel to use it and to take part in the panel discussion with the panelists and so put your points on the table, but also discuss with the other panelists, so feel free to join. First of all, but we also have to say that as this is foster, we would not take the first ones dropping by, but we will also look a bit for diversity. And so we highly encourage you if you are not quite male men to also take part in this discussion here and I think it's in particular important for FOSM to do so. So please take this empty chair and take part in the discussion. Yeah, so, okay, anyone else? Then feel free. Yeah, sure. So the only most important thing is, yeah, respect yourself and please with the microphone. Oh, sure, sure, sure. It's, yeah, so let's see how it works exactly. Yeah, yeah. Can you say something? Yeah, so and I said it's important to have the microphone for or like. I understand that. Yes, and speak loudly regardless. Good. And don't worry. I'm not going to stay here for very long. I'm going to step out as soon as I can. Anyways, I'm Jonas and I work for Nokia and responsible for open source things there. And the kind of like I'm been really excited about the topics of kind of like what has been going over the what is now called the policy week, starting with the commission thing on Thursday, then the open forum Europe thing. And now this the discussions here. One thing that I noticed that in the in these discussions is that and some of these discussions I've heard before. So I have a little bit of a kind of like an dark past. So I am a recovering standards person and I have a history and internet governance things and they there's a lot of parallels into the kind of like where we are now as an open source community where we were in a kind of like Internet community community when it started in like early 2000s discussions. So but the good thing sadly that's kind of like that comes with good and bad things and then the good things is that people now finally understand that open source is actually something very important for the society. Bad things is that people get very interested in those things and then good things like political interests come in involved as well and all of a sudden we need to get organized differently. So but I think that the good thing is that we've done this in the Internet community already before and I think we can look at some of the processes and some of the kind of like things what we did at that time. And similarly like an open source community, the Internet community was very diverse community. So we had big companies like me for working for Nokia. We had small people that were advocates and then the kind of like we had the civil society and then governments and so on. And the thing is also that we need to organize ourselves in a little bit of similar fashion. So and I think that there is a little bit grass roots already happening that that because for instance for us in there were a lot of people said that the open source community is not as organized as the industry or something like that. To be honest in this topic the industry wasn't very well organized. I started to read about the CRA from actually the NL Labs blog post which was very good that woke up a little bit, hey what is this and what should we look at that. And then the kind of like the overall discussion started to go a little bit. And I agree that even for the kind of like industry we had a lot of luck because we had a lot of people driving towards the same direction from the open source community as what we need also as part of the open source community. The other thing is that what we want to start to work together is kind of like looking at that we are all part of the community. We have different views, we have different approaches and stuff like that and we have to focus on those things that what we are in common and then look at it using the different strengths and the facts that different people are parts of the community even though they are not always sharing the same ideas are also present in the different rooms that might be even difficult to get into by other parts of the community. And I think that through this we most probably can get into very fruitful conversations and very fruitful cooperation among the overall open source community. I hope this helps. Anyone who wants to react on this? I think one of the key points when you reach out to institutions is get organized first, get on time especially because then negotiations can go fast so you have to be there on time so that we can take on board your inputs on time because then otherwise it is very difficult. And the last thing is especially if you get organized that we can have one single contact point for instance which for us is very useful. And also for your inputs on the institution side again because you also have the campaigning side but for us what is useful is your expertise. So we need your expertise and in that case we need your inputs on very specific points sometimes on the low side, sometimes on the technical side and that is why you need to get organized so you have diverse teams so that they can provide inputs on low and technical side which is very important for us. Well I think as Kieran was saying earlier it is really important to get engaged early because that is when you will have the most influence. And then also I think what Walt was saying earlier was that I mean it has kind of been forgotten what happened in the early 2000s. You had the software patents, you also had ACTA and you have had the copyright directive and I mean there is forces that can be mobilized that can actually reach the MEPs and now is actually the time to influence people and get it because they are running to be elected so now is when they will be listening. The problem is since 2005 is that the big companies and society has gotten more engaged in regulation of technology but not always in the good way. It is sort of like let's use technology as a magical source to fix our problems. You have initiatives in the US, you have initiatives in the EU that are not very positive for regulating technology. You have safer kids online in the US, in the Congress, you have chat control in the EU which is probably going to go to the next mandate. And be aware that people that you think would be obvious to be supporting inclusion and adapting to open source might not always be so. We had a lot of pushback from the social democratic group in the parliament against including provisions for open source in the product liability directive because they were worried about protecting consumers. So you need to really be able to explain the digital side to politicians that don't always get it and want to sort of regulate tech in a bad way. We don't really need to step into the middle like this is already a circle. Even better in terms of forum, we don't all look to the same people. That's pretty true. Yeah, we'll talk about the next one. So everybody says get engaged early and there were really good processes for that within the EU with Green Book, Blue Book, the whole shebang. And here we skipped a lot of the processes it seemed. Why did we do that? Is it not working? Okay, hi, I'm Jordan, I'm Karen's assistant and I think you've raised perhaps what is the single most important point. But I think the reason that we skipped a lot of that is because a lot of people aren't aware that you have this possibility. It's also exhausting to have to keep an eye on all of the legislative proposals that are coming out of the commission. I don't know how many they come, but it's a good 30 or 40 proposals every year. We should, it's important that we keep an eye on this. It's exhausting for you, what do you think it's like? Yeah, exactly, exactly. No, no, no, I 100% agree. I 100% agree. I'm super sorry. The thing is we need to keep the microphone for the live stream running. So please, even repeat when you do these settings or do you want to? Just quickly. I'll take some of the tone out. What I said was in response to the notion that it's busy if you're in parliament to keep track of everything, we're actually as a community are being told we have to be proactive and track everything and we're just not set up to do it. And it's... Speak closer to the microphone. Interesting. Okay, so what I was trying to say is if it's exhausting for the people who actually work in parliament to keep track of everything, imagine how exhausting it is for the people that are just in the open source community dodging bullets as they come down from the sky. Very briefly, but also very importantly, parliament does not have to write the insinuated legislation. You're barking up the wrong tree. It's the commission that has been overwhelming everyone else. They have a massive tsunami of really crap proposals in the past mandate. So was that the reason that the normal process was skipped? Because if we have to be engaged early, for us this process is ideal. Because you then have one of these books, you can send it to people like, look what's there and then respond. And because we skipped that formal process, how can we engage early if we don't see what's happening? That's probably the problem. Well, there is an obligation of transparency. There are processes like impact assessment. But I know it's something completely foreign from a lot of people. And here I see a gap personally. Again, I'm not on the legislation part. I'm not on the funding part. But the gap is that we heard that open source is very important. It's 80% of modern software. It's a large part of the IT business, an IT stack. And a lot of players are benefitted from it. The question is, so it's important, but there is a gap of representation. It's not in the mind of a policymaker. So what you describe is a situation where open source comes after. But if it would come first, it would mean that decision makers have this understanding that open source is important, and especially in Europe. So this gap, in my view, is the main reason why we are late in the process. And to address this gap, it's a question of making decision makers that you guys are important. So it's a question of having people from the community representing, okay, we have European organizations that are doing great job, but we also need the grassroot people and the process to democratically represent them at the level of decision makers. That's my impression. Okay, I got it. So I was, thank you for all the answers to this. And I was listening a little bit to what you were saying, but also to the point that we're made throughout the week since we had this event from the Commission, the OFE event as well yesterday, the summit, what they call. And it reminds me of something I used to work on before. That's called SME policies, and this principle that is called the Think Small First Principle. It was developed by an expert group on SMEs. Hopefully one day the Commission has an expert group on open source so that we can actually reflect on those principles. And I was reading through the paper that was written in 2009. And just if you look at the table of contents, it starts with the importance of SMEs in the EU economy. That's exactly what we've proven throughout the week, the importance of open source for the economy. The impact on regulations have on SMEs. We're in the same situation, impact that regulations have on open source. And then they explain what is the Think Small First Principle. First, listening to SMEs, listening to open source community. The second is what they call the SME test. Why don't we just go for an open source test? Every time we enter into a digital legislation, we think, can we apply this to open source? Does it work with open source or not? And then after that, it's a bunch of stuff that are way too bureaucratic for me to read them through today. But maybe if we have an expert group one day, we can actually build us the bureaucratic stuff as well, specifically for open source. So I'm thinking, we've been discussing the processes, but maybe also the policies could be adapted. Could you go ahead with the microphone? We have a great pleasure. So I have just a practical proposal. There used to be an application that was called Follow the Law. And then you could just put in, like, I'm interested in digital legislation, and then you would be notified at the beginning of the process that a process was happening, and you could follow all the way through at what point it was, who you would need to address if you wanted to be part of it. So maybe this can be remade again. Unfortunately, it is no longer. Someone from the panel wants to say on some of these remarks here. Yes, exactly. I was hoping someone would step up from the community and said we have a tool to help for this. So, kudos to you. Another point we've mentioned that many times in this conference, in the previous conferences, software is pervasive everywhere. There will be software in everything as the world goes digital, and therefore there will be more regulation. That's unavoidable because software probably will be a commodity, just like we're regulating our air and our water and our electricity. So probably my question also to the open source community today is, what legislation would you like to have to help open source? I think that would be a valid question as well. So I would like to respond to the gentleman in the red shirt, because he said it would be a good idea to check policy against as a me interest of the EU, and especially open standards can really help bring forward the position of open source within the EU, and it would help European SMEs a lot if open standards are at the forefront of EU policy, because that helps our sovereignty. It helps European companies exist in a landscape with big tech and really help small companies to compete. And if local governments are keeping to open standards, that helps a lot for SMEs. So maybe take it from the standard side, and that will help engage open source as well. It's a double question. Yes, I'm Jason Petjo. I'm a biomedical engineer and I work in Brussels. Very active in everything regarding fab labs as well, the digitization of the hardware world. I'm more of a free software guide on the open source, because I think this question of public European stuff is really about making the citizen more free, and what I saw in the digital rights stuff is exactly about that. And so if we talk about open source in the context of political framework, I think it's important to put the freedom as more core to it. And what I made me take is about the data of the project that NGI supported. You said a lot of them follow European politics. I'm a bit surprised because for me politics follow the will of the people, and so follow actually the community. And so this, I mean, should actually, should. So my question is, aren't those policies following the will of the community and therefore shouldn't the funding go towards funding the policies? Like, we are supposed to give, to take all free time to pay for people, like to pay and to use all free time to defend our opinions. Why should we be paid to tell the commission what they have to do and what the government's, well, that's what governments are for somehow. It's like democracy when you're one. The European institution is a global source. It gets paid to the legislature. So I will start with a question to the audience. Oh, you know what? I will ask a question to the audience to start. Oh yeah, so Thomas Dapierre, what I do does not relate to what I do for open source. My job is not related to that. You may, some of you may know me because of my blog post a year and a half ago called I am not a supplier. I have a question for you and for the audience in general. Can you raise your hand if you get paid for something linked to your presence to this room today? Right. So what you are all telling me is that the people that do the work are not represented in this room. And what I'm saying is that what we saw is that most of the people doing working on open source and we have ample data on that are not paid for it. They do it on their free time. They are what we call our best. The other people that cannot do the work you're asking them to do. My total time for open source and I maintain multiple packages that are niche but fully, indeed a lot of context on knowledge is this niche, is two hour per month. Just reading the law is more than that. I cannot and people that do this cannot participate and the grass with people you're asking to participate in this cannot do that. To come here, this is my free hobby time for the next three months or four months I have to take. So I understand all the discussion that happened here around the fact that we need engagement from these people and we need this engagement to happen. How do you want us to do it? This is a part we need to decide as a community and this is a real thing we should probably talk about more than things like digital right and source which I already have quite a lot of organization working on this. Thank you. Yeah, I think that decision makers answer to these policy remarks. Well, I mean the difficulty in Brussels is actually figuring out what's going on because once it's public the bulk of the influence has already been made because once the commission has their proposal out they've already made up their mind on what they want to put in there. So that's why I'm saying you need to influence the politicians that are trying to get elected before they get into the parliament and you need to get to know their offices, you need to get to know the policy advisors in the political group delegations and also like the policy advisors in the groups. And a lot of companies are trying to influence both the commission and the politicians and they have the time to do that. So I think a lot are thinking about that the source code that is also at the basis of commercial code, proprietary code is actually open source and that is maintained by developers doing it in their free time and that is actually a message that was really difficult when we were sitting in the negotiation room to get across and that's why we had to push really hard to actually get that part into the text. So being here, taking your free time to talk with us and put these things into the room is really important. I think Edry has been doing a great job on digital rights. I mean the commission is now talking about digital rights but I don't know if you remember Marie Tishaka that was in the parliament for about 10 years until the election in 19. I mean she read out a speech about digital rights and then told the audience afterwards this is actually from the Communist Party Congress. So you can have a lot of fine words that sound good but it's how you apply them and implement them and put them into legislation that's really important. I mean one of the things that we like was putting our head up against in the negotiation was that the commission has this blue book of definitions and stuff like that that nobody amongst the co-legislators actually had heard about and we're like come on this is your own internal standards. This is not supposed to override legislation. That's supposed to like follow and be adapted according to the legislation that we're putting in. So I mean it's difficult for all of us and that's why we need to work together on this. Sorry for being long again. Could you maybe hand over to Iqbe as the representative? Well no I'm not the other representative. I would say there's no separation between digital rights and open source. The digital rights movement was started by the open source people. We have become somewhat unmoored from our roots in that respect but please do re-engage if you feel is not sufficiently represented. You're deeply disagreeing with that notion I see from Elaine. I would also respond to the notion and I'm one of the people that's not being paid for this by the way I should mention. This is not a matter of being paid for this. You as a citizen if you want to be represented it's too much work for any citizen to get represented properly. And there are more channels than just your political party if you're a member and please be a member of any political party. Well maybe not a Nazi party but there are a democratic society that does not function solely through elections. There has to be permanent engagement through other channels than just the election. And that's what civil society is for. Of course it should be. And lastly democracy is not about the quality of decision making. It's about the non-violence of the decision making and the non-violence of the transfer of power. So do not be overly disappointed when the outcome is poor. At least we're not shooting each others in the trenches of Ferdinand again. And that's a win. If I may one comment it's on the question of the representation and the fact that we are professional of the profession. And we are not representing grassroots to some extent. I think the question you ask is the question of representation of democracy. It is not. Okay well I thought it was. But the question of money you ask. Fine. Time. Time. Time. Yeah well time is money. Yeah because people who have units choose to pay the time without money. This is not about money. It's about time. What? I need to keep maintaining the stuff you all used to live. Exactly. If I take away from that time to explain to the audience. I cannot keep the thing that you need to live and to keep alive. Working. Exactly. So well the argument is time is money. So you need a way to have the right people representing you and that will have the time and will have the money because they will have to be paid for that. And so they have to represent you and you have to find a way that you find the people that. So you have European organization. You have OEF. I will not mention all of them because there are a lot of them. Well they are not representing. They are coordinating and helping whatever. But what you describe I think is a democratic representation of the grass root people. And you have to find solutions. There are examples. Look at IETF. IETF is grass root. IETF is people contributing with their own time. Wait, wait, wait, wait, wait, wait. Not really. They are big tech. Okay. But it's wait, wait. It's one person one vote if you like. It's not one. Well you want to talk about maybe ITU? No. Okay. You can think about organization that organized this representation and maybe take example. Maybe it's the wrong example. But my only argument is you have to find ways to get a representation. That's it. It's not just open source that has to make time because the people in open source they contribute with what they can do best and it's mostly writing code. And there's a thing that people in policy roles have this attitude towards technology quite often that they say we don't have to understand it. And there's an educational gap there. And it's not that the people who write codes have to make time to explain all that. The people who make policy have to understand technology because it's at the core of our democracy. So there are people willing to explain it but don't take them away from writing the code because that's really important for society. So it's also an effort from the people in roles that are important for policy making to learn about technology and try to understand how important it is. Thank you. Good morning everyone. My name is Monique and I'm involved in the NGI community since a few years. I think this debate is amazing. I think we cannot pretend that we all speak and understand the same language because we have different skills and we do different works. This is a multi-stakeholder community and we have to respect each other and we have to learn to work with each other. I think the work that we have been trying to do in the NGI community has been really to talk different languages to different people and it's very difficult. At the end of the day my job is to run the NGI outreach office and I can tell you it has been very difficult to reach out to some of the open source communities but it has also been very hard to speak to politicians, policymakers. So the fact is that we speak different languages and we work according to different rules and to different, we have a different pace. But I think also that the progress and I've been here the whole week attending all the events, you know, workshops at the commission, open source forum today, yesterday, here at FOSTEM. I think it's amazing what's happening. We have open source, very high up in the discussions that a few years ago we could have dreamed of being even mentioned. So I think there is a cultural change and we all have a different role to play. So let's not, so I think to go back to the guy from Nokia, I think we really have to work on the processes and on the ways we can work better together. I think this should be the core topic of this discussion. Hello, I am Karim. I am not paid here to be here. I have my own company in, I have a background in public administration. I'm a contributor to RiotOS, my microcontroller operating system. So thanks, happy to help out on this. What I was wondering, I heard policy makers speak with a lot of people and groups before they make the first draft. Who are those groups, who are those, what kind of institutes are that because that's a procedure that's already there with whom they talk. And maybe from that side out, can we then look on what as an open source community can we make or go into to speak with them. Okay, I'm already getting feedback from the repertoire of this session and so who will also have the the honor to summarize what you are saying in the last minute. And he told me we have a lot of questions but no answers at the moment. And maybe we should like focus for the next minutes on this and to kick this off I would first of the Simon. Thank you. I'm Simon Phipsen just because I'm organizing the room. It doesn't mean I don't have opinions. So I work for the open source initiative. The answer to your question is in the impact assessment of the CRA. The impact assessment of the CRA explains that the European Commission went out of its way to approach SMEs for input on the CRA but did not make any attempt at all to consult with the open source community or anybody who is a genuine community member. And the reason for that is because their societal model is a societal model that has manufacturers or producers, it has labor force and it has consumers and it has no recognition that for the last 30 years there has been a there has been commons based peer production and it has no recognition of the fact that there are people engaged in that. And so the number one thing that needs doing is actually a very difficult thing that needs serious political engagement which is the Commission as a whole needs to recognize that society is no longer the society of the 1950s and that people are connected to one another and that as a result of being connected to one another they are actually doing things that are being harmed by their regulation that believes that unions and MEPs represent the people. Unions and MEPs do not represent I believe anybody in this room and as a consequence the Commission completely failed. Maybe there is one Karen, I don't know. But as a result the Commission has discovered to its surprise that coming to Fozdem, Benjamin last year was quite surprised to discover he had a thousand new friends who wanted to share their opinions about his legislation in the corridor. This year he was a lot more positive about it although he did ask me if he was going to be egged by the audience. So the most positive thing that I've got to contribute is a very difficult thing which is the Commission as a whole has got to recognize that society is no longer the industrial society of the 1950s. It is now the digital connected society of the 21st century. There we are. So we are still looking for answers, right? Well for the Parliament side I can say that the question was to who did we reach out to for the PLD for instance we mostly talked to Free Software Foundation, Open Forum Europe and Debian for us and Marcel Scolias team. So I would think that in our team for instance if you send an email to us we are very sensitive in the Pirates about the Epoenso's community obviously so if you send an email to us obviously we will answer. But again that's also the main purpose of the discussion today, you have to be targeted. If you're not working at all on the topic we might not answer because it's not our topic, we're not working on the file on this specific file. So I think that's the main thing but on our side definitely we will answer to Open Source community almost every time. Thank you very much. I am Pablo Correa from PostmarketOS, a small but very fast growing community to put real Linux into phones. And there was one thing that caught my attention from your NGI presentation. You said that out of the millions and millions of euros you gave out you managed to get 80 people to organize a community or a foundation to establish a legal entity. And that seems like that's a huge success but that's also way too small for the amount of money. I am like my feeling. I am part of this overworked people working their free time trying to grow a community to create some sort of entities so that we can at some point make this a viable product because we see people, we see consumers wanting this. We want this, people want this and we are lacking the tools to actually grow into something where we can organize ourselves. We are lacking, we are all developers and we are lacking people with the background in business. We are lacking people with the background in law and the funding does not support us on that. The funding, we don't get or we just say can you please get involved. It's like well maybe I can get involved in my 25th hour of the day but if we are all working on something else because we can't finance ourselves. If you can help us like pay advice on free software, there are people from grassroots organizations happy to do that. We need help and we need funding not only for development, also for development but we need funding for more things. And if you can provide that, that might actually help us. Okay, a very quick one and these are also the last words for the panel because then we have to transform for technical reasons with the camera. Okay, no just very shortly on NGI so we have a new project, NGI Commons and there is a special effort to precisely provide business support so in terms of what business model could be on top of an open source because you guys are developers so you don't necessarily have the right knowledge in terms of what sort of business model needs to be done. So this is included in the new project called NGI Commons Fund. So that's good that we ended the specifics. Yeah, so with this I'm going to end the panel and the fishbowl discussion. Thanks for the insight, thanks for all of you taking part in this.
FOSS policy engagement: rapporteur playback
So I was asked to try to summarize this and I was asked to do that before I knew it was a fishbowl and how difficult this is going to be. So bear with me and please don't insult me if I forgot something that you've said. I just tried to summarize. We're going to reuse the recording for everything that I've missed so that we can try to summarize all of this and all the points into a small report that we can share with policymakers and hopefully try to create that process where our voices are heard, which is the first point I'm going to try to start with. I think something that was interesting in the room is that there's sort of a divide between us about whether or not we believe that it's part of our role as citizens to get more and more involved within policy-making, more and more interested. And those that say, well, no, what I want to be interested in is writing code. And so this divide is interesting in a sense that one said, yeah, okay, but you're a citizen. You can just get involved the same way that, you know, if you're unhappy because there's not enough parking spot in your street, then you go to your town hall and you complain and you do it. Then the following point after that is that there is a huge, huge, huge knowledge gap and something that was missing from the conversation are ideas about how to solve that knowledge gap. But something that was quite interesting in that conversation as well, that is to say, well, there is only one organization that can really help you into doing that knowledge gap and as the European Commission. And that's where the sort of conversation ended on the knowledge gap. We know that the community has to do more. That's what people have said a lot and get more involved. And that's what people have said repeatedly. But we also have said that there is a work that needs to be done from the institution so that they share more information. And the only answer was the commission is the right place to do it. Again, don't insult me if I forgot something you said because I see people doing this. So sorry. Then someone... Something that was mentioned after and I'm not entering into the details of the NGI projects because I had trouble understanding them to be fair. I guess that's the knowledge gap. Is the lack of processes as well. So we've said there is a lack of knowledge. There is a lack of knowledge in our community. There is a lack of listening and from communicating from the institution. But something that was said repeatedly and repeatedly and repeatedly is that there is no processes for us to be able to express ourselves. And one was saying, yeah, we don't have processes within the community to be able to express ourselves entirely. Someone mentioned OFE, someone mentioned OFE as a coordinator, some organization that don't remember the acronyms because I was taking notes hardly. And someone else said, yeah, but there is also no processes within the institution for us to be there. Emails were mentioned by the parliament many times. Send us an email. We'd be able to share the information. But they also say the limits of their own resources. We're going to try to answer the emails, but if we're not really responsible for the files, we can't really do anything about it. And I guess there is something I can't really express more than what the community has done, which is we don't have a solution yet. So maybe the best way to conclude this summary of the FISHBALL is we should think about it more, discuss it more so that we figured out the way to solve the problem. And when I say we, I don't only mean the open source community, but also mean the policymakers because I think that's the call out that our community did to the policymakers today. Speak with us, but also speak with ourselves, with each other so that you figure out a solution so that we can be more involved. Again, sorry if I didn't summarize the best way possible. I tried to take better notes next time. Thank you.
Public services interoperability: [begin workshop] Free/open source and Interoperable European public services
Okay, good afternoon everybody. We're going to start our next session. Everybody has found a seat so we know how many people can come in. That's up to the people at the door. My name is Gai Selenius. I work with the European Commission Open Source Program Office. This is for the recording. Okay, I will amplify myself. I will quickly run through the ground rules which you have seen already. We would like you to make sure that everybody gets a chance to speak. Me and Lina will be handing around the microphone. This is for the people online. We want to focus on finding solutions for the problems that are coming to you soon. This is the third workshop. The title is Public Services Interoperability. It is made out of two parts. We will first discuss the Interoperability Act and we will then have a presentation on the Commission's Open Source Strategy. And the reporter is Axel who I think is still outside so somebody should get him in. And with that, we are almost ready for our first session. Welcome. Okay, so hello everyone. Just a very quick introduction. My name is Lina Cevajos. I work with policy at the Free Software Foundation Europe. I just want to make a little bit of introduction about this session. If you have been here before, you have seen that we are trying to try different formats doing workshops, fish bowls. And this format, we kind of imagine it more as a discussion, kind of like what we have had so you don't need to move, you just need to raise your hand. I will bring the microphone to you. And I also wanted this to be not like a technically like a Q&A but more like let's chat and let's try to find common ground.
Public services interoperability: The Interoperable Europe Act; the challenges and opportunities for the free and open source communities.
Welcome. Okay, so hello everyone. Just a very quick introduction. My name is Lina Cevajos. I work with Policy at the Free Software Foundation Europe. And yeah, I just want to make a little bit introduction about this session. If you have been here before, you have seen that we're trying to try different formats, doing workshops, fish bowls. And this format, we kind of like imagine it more as a discussion, kind of like what we have had. So you don't need to move, you just need to raise your hand. I will bring the microphone to you. And I also wanted this to be, you know, not like a technically like a Q&A, but more like less chat and let's try to find common ground. Yeah, so about the topic. So this is a, the session is on digital public services and interoperability. There will be a session more focused on, like that is not the digital public services later on. And because we're talking about this, we're going to have two parts. One is a focus on the Interoperable Europe Act to serve as an example of what's happening at this moment when it comes to digital public services. And this is an act that, you know, happened last year. And I personally have to say that I had the feeling that it was a little bit under looped. So not so many people pay attention to this. And I think it is a very crucial piece of regulation because we're talking about interoperability in digital public services. So to start the discussion or to start this, let's talk about what is interoperability. So like the definition, you know, is like the ability of information systems to speak with one another. And now we want to make use of this feature to deliver public services or to make public administrations to deliver public services. And this can have so many examples in practice. One that I can always like to give is that imagine you're going on a road trip from here to Paris and then you go with your car and then you want to park on the street and then you get to this machine and then you want to put the car plate off your car and then it turns out it doesn't recognize your plate because it's from Belgium and then you just end up having a park in another country that you cannot park. So is this, you know, like we're talking about things that affect us all in the EU. I mean, it has to do with freedom of movement, but also with education, with health. There are so many aspects that are so important when we're talking about interoperability and digital public services. And of course, for these to work, we're talking about critical infrastructure. And this is where free software plays a huge role. And that is the reason why we were trying to be active last year and still trying to be active to make decision makers understand the role that these, the free software and open standards come or have in this regulation. I also have to say that, I mean, the commission proposal already acknowledged this. The commission proposal already came with some of these wording. And there were different things that we were trying to get active. So I guess we're going to learn more from this, from Issa, from Calvin, from the commission. But there is like a very interesting inclusion of a governance structure. So we will learn more about it and we were trying to push to be there because that's where, you know, the decisions are going to be made. And so on. And of course, I don't even know what I had this slide, but anyway. And then the second part, then we're going to focus on the European Commission efforts that are happening when it comes to the direction of free software. And for this, then we're going to have a decision with Hays and with his colleague also from the OSPO, from the European Commission OSPO. We're going to discuss all these efforts because there have been some of this coming from like years back, but just bring an example. The open source strategy after that decision that came with it, the code repository that another European Commission is having. So we're going to learn what has been done. But what I want us to keep in mind, and I think along the day I've been seeing and I think it has been a very fruitful discussion on how we can get engaged. And how we can interact with decision makers. But it is also important that we keep an eye on implementation because of course it is important to advocate for free software when it is possible with decision makers. But once the text is done, such as the interoperable Europe Act, we had the opportunity, we did what we could. We would discuss this during the session. Now we have this piece of regulation that we need to make sure that it is implemented. And let's make use of these words that we struggle and we fight so much to put in there. And let's help them as well to figure out how to implement this. So let's look at the future. Let's look at examples. Let's see what has worked well, what hasn't worked well. And let's try to find ways where we can all collaborate and we can make use of all these efforts that are happening at the moment. So, yeah, and I also want, yeah, again, like the structure. So we're going to have this presentation and of course we're going to open the floor for questions, comments and so on. Remember that you have to wait for the microphone for the live stream. And it would be also nice for you if you could introduce yourself, your affiliation and so on. So decision makers also know who you're talking to. Who they're talking to. Yeah, and I think that's all from my side. I really hope we can learn a lot on this conversation. And again, let's try to keep an eye on the future and how we can monitor implementation of all these amazing regulations happening. And yeah, without any further ado, then I'll hand over the floor to Issa from... I'll bring this. Oh, cool. You got it. Okay. I'll do it. No worries. Yeah, from Edgy. Yeah, thank you very much, Lina. Okay, here. And thank you very much. Thanks for the slides. Issa is with us on behalf of... She's one of my colleagues in the Commission, Digit B2, our recent new unit. And Issa is also very much informed on open source. I'm happy that she's here. Thanks. And me too. I'm very happy and thankful to be here because I think it's a great opportunity. I'm also very thankful for Lina introducing the act as something that is important and should get more attention. Because I'm completely unbiased after working four years on this. But of course, I also believe that this is an important piece. But for now, it's a piece of paper. And it will only get to live when... And it will also be on you to make this... fill this with life. And I think there are some opportunities in there. And this is what I'm going to focus on today. And then I'm looking forward to the discussion and to your ideas. And yeah. So let's get started. What's this act about? And I mean, the first very maybe disappointing news for you. It's not an open source law. And it's not nothing that... now we can say, yeah, it's clear that it's not an open source law. But it's something that has actually been discussed. And has actually been discussed from the very early start when we did the impact assessment. Should we have a law that is on GovTech, open GovTech. But then we saw that there was no majority for this path. So now the interoperability policy evolved into the Interoperability Europe Act. And maybe I just quickly why no majority. This actually came from both sides. So from those who are very much fan of open source but said, but if we want to do an open source law, it should be made in a different way and shouldn't build now on this interoperability. Then it's too linked. It's too much in one direction. We should do it properly. Let's not make it from the back door in open source law. That was the fans of open source who were against making this an open source law. And then there were the ones that said, yeah, but we are not there in the public sector. This might hinder the digitalization of the public sector. So this is what we found. We put something in the middle. That's the compromise that we found. This is what I'm going to present now. And what was the main objective of the act? It was to help EU and member states administration to deliver connected digital services to citizens and businesses across Europe. And I think I don't need to tell you that open source can contribute to this. I just strongly believe it. And I also don't need to talk to you about how interoperability is linked with this objective and with open source. What is maybe interesting is that the European interoperability policy has always focused not only on technical interoperability, but has these four dimensions of also legal, semantic and organizational. And that was very much in the negotiations that we've talked a lot about. How can we actually make stronger mechanisms to help also legal and organizational interoperability? And this you might also find back when I now go into the components of the act. Because we say the act is structured around these four components. So a structured and co-owned EU cooperation governance, mandatory interoperability assessments, recognized in your reusable interoperability solutions, and strengthened interoperability support. And now I'm going to tell you more about what this keywords, what we mean with this. So I start with the solutions. One of the mechanisms that will be established is that the EU public administrations in Europe, and they are the ones that are actually the act is addressed to EU administrations from EU level until a member states local level. And all the binding requirements that are in the act go towards these public administrations. They don't go towards private parties. And those ones will decide together on certain solutions, and solutions are in a very broad term. So solutions can even be a framework, can be an architecture, can be guidelines, can also be a solution, can of course be software. Like the definition that is in the act is quite broad, but the public sector will agree in this interoperability governance that there are interoperability solutions that will become interoperable Europe solutions, so that everybody will not be, they don't have to reuse it, but they will at least need to look into the reuse of such solutions. And this I think is an opportunity for the open source community that wants to know where is the EU public sector moving, looking into the interoperability Europe solutions, might help them to see, okay, if this is here in the catalogue, then it's not a software that fits with those interoperability Europe solutions, it might be easier for public administrations to reuse it. And then another thing around the solutions that what we've managed to get into the law is that sharing is a default. And for now, between public administration, sharing is not a default, and proper territory solutions are a default. And if you want to share among the solutions, then the IT people go to the lawyers of this world, and I'm going to talk about one of them so I can talk better about them, and they will ask those people, can we share, there's somebody else who wants to reuse it, and then the lawyers come and say, yeah, but this is problematic, and there might be this and this problem. So because, and we hope that by just putting in a law that normally if somebody asks you, you have to share, then the lawyers already need to argue why can't I share. So we hope that this changing the default might already be a push in the right direction to make public administrations that are, and they also have to be risk, very risk adverse, to be more friendly towards taking this risk and being brave. And what come in the negotiations, also, and that's what Lena also talked about, is that there is also a small thing about priority to open source solutions when you are doing interoperability solutions, and this is something that is now very new in the text, and that public administration will need guidance on how to implement this, and will need help on, and this also I think where the community has an opportunity to help them, guide them on how can we really make this work and put this into practice. As a second component, there is this governance, and there is, the governance is composed, as Alina already said, around the board and the community, and in the community is, we were thinking about you also as part of the community, of course also the community of local public administrations in the board, actually it's only the member states represented, but IT is very, sometimes very federal and some scattered around all public administrations levels, so the community is really about putting many different actors in the field together, and then giving structured input, channeling this input also maybe with digital platforms, digital tools towards the board, to take the decision to have always a sounding mechanism towards the board. Another thing that I haven't seen when we were drafting the law, but when I talked with open source people who said that was interesting for them, was the clear points of contact that with this law, we are having now responsibles on member states level and in EU institutions who will need to implement the law, so when you have anything that is linkable to the law, you know who to reach out to, and that this is something that is actually important for the community. The third component, the mandatory interoperability assessments, I think this links back to the discussion I heard here before on the fishbowl where you said, but it needs the right processes and how do we get into the policy making, and it's very hard to bring the two worlds together actually, the interoperability assessments will be something that when public, in the public sector, is setting any binding requirements for their digital public services in the future, before taking a decision on these binding requirements, they would need to do and publish a mandatory interoperability assessment, and I think this interoperability assessment report will for the community create a lot of transparency and help them actually then also engage in policy discussions afterwards, because the policy text is already translated into concrete requirements, maybe it might not always be technical requirements, because if you look at the law, it might be requirements on business level, and now we are looking into how to really make this work in practice, but I think it's a very interesting tool that can actually help bring, and that's why we put it in the law, because it's about illegal interoperability bringing IT and policy together and helping to have this conversation early in the process, and not when the law is already written and then you sit there with the law with requirements that are actually not implemented in a user-friendly, citizen-friendly, engaging way. So that's definitely something we need to talk about in the community and to develop in the coming months. The fourth component is what is also in the law, it's around GovTech cooperation, there is innovation projects, and it's the first time that there's a definition around GovTech also in the law, so it's just that the recognition that this is important, even if there is no strong legal requirements around it, I think this is an important step because it helps to argue that this is important, if you say, but because it's an allowance, it's important. Then we have a spot on regulatory sandboxes, and here also for me, they are an enabler for legal interoperability, and they have to involve also the GovTech, but create a way when you do innovative stuff to have the York legal questions channeled towards the EU level and have the EU regulators already pre-discuss on how can we tackle this, how can we have this innovation also rolled out in a legally safe environment. Last point is upskilling of the public sector, and this might also sound a bit vague and lame that is in the act, because it's nothing, it's not something you go and the police can come and say, you have to upskill, but I think if there is, it's mentioned in the act that the public administrations have to skill their stuff on interoperability, and I think you can't skill them on interoperability without talking about openness, without talking about open code. So I think a lot of these trainings will be created in the next coming years, so together then look into what can we push, what messages, and then you can talk in the public sector to people who know about open source and are not afraid of open source. So this is an opportunity for me. And one opportunity that I don't have on the slide is that this is going to be evaluated, so everything that is not in today, now if we talk, get the right questions in this evaluation, then it might be that in the next version this can also get into the text, so this is the entry model and we will see where this takes us, but this also needs early involvement, this needs, actually, that the community reaches out as soon as possible and say, this is our ideas for where it should go, and this is the question we should ask. So I'm very, as I say, now maybe you understand why I think it's very relevant that we are here and that we start this conversation and looking forward to the discussion. Thank you. .
Public services interoperability: workshop Interoperable Europe Act
Yeah, so about the topic. So this is a, the session is on digital public services and interoperability. There will be a session more focused on, like that is not the digital public services later on. And because we're talking about this, we're going to have two parts. One is a focus on the Interoperable Europe Act to serve as an example of what's happening at this moment when it comes to digital public services. And this is an act that, you know, happened last year. And I personally have to say that I had the feeling that it was a little bit underlooked. So not so many people pay attention to this. And I think it is a very crucial piece of regulation because we're talking about interoperability in digital public services. So to start the discussion, or to start this, let's talk about what is interoperability. So like the definition, you know, is like the ability of information systems to speak with one another. And now we want to make use of this feature to deliver public services or to make public administrations to deliver public services. And this can have so many examples in practice. One that I can always like to give is that imagine you're going on a road trip from here to Paris and then you go with your car and then you want to park on the street and then you get to this machine and then you want to put the car plate off your car and then it turns out it doesn't recognize your plate because it's from Belgium and then you just end up having a park in another country that you can't park. So is this, you know, like we're talking about things that affect us all in the EU. I mean, it has to do with freedom of movement, but also with education, with health. There are so many aspects that are so important when we're talking about interoperability and digital public services. And of course, for these to work, we're talking about critical infrastructure. And this is where FreeSoft plays a huge role. And that is the reason why we were trying to be active last year and still trying to be active to make decision makers understand the role that these, that the FreeSoft and OpenSunders come or have in this regulation. I also have to say that, I mean, the commission proposal already acknowledged this. The commission proposal already came with some of this wording. And there were different things that we were trying to get active. So I guess we're going to learn more from this, from Issa, from Calvin, from the commission. But there is like a very interesting inclusion of a governance structure. So we will learn more about it. And we were trying to push to be there because that's where, you know, the decisions are going to be made and so on. And of course, I don't even know what I had this slide, but anyway. And then the second part, then we're going to focus on the European Commission efforts that are happening when it comes to the direction of FreeSoft. And for this, then we're going to have a discussion with Heis and with his colleague also from the OSPO, from the European Commission, OSPO. We're going to discuss all these efforts because there have been some of this coming from like years back, but just bring an example, the open source strategy. After that, the decision that came with it, the code repository that another European Commission is having. So we're going to learn what has been done. But what I want us to keep in mind, and I think along the day I've been seeing, and I think it has been a very fruitful discussion on how we can get engaged and how we can interact with decision makers. But it is also important that we keep an eye on implementation because of course it is important to advocate for FreeSoft for when it is possible with decision makers. But once the text is done, such as the Interoperable Europe Act, we had the opportunity, we did what we could, we would discuss this during the session. Now we have this piece of regulation that we need to make sure that it is implemented. And let's make use of these words that we struggle and we fight so much to put in there. And let's help them as well to figure out how to implement this. So let's look at the future. Let's look at examples. Let's see what has worked well, what hasn't worked well. And let's try to find ways where we can all collaborate and we can make use of all these efforts that are happening at the moment. So yeah, and I also want, yeah, again, like the structure. So we're going to have this presentation. And of course we're going to open the floor for questions, comments and so on. Remember that you have to wait for the microphone for the live stream. And it would be also nice for you if you could introduce yourself, your affiliation and so on. So decision makers also know who you're talking to, who they're talking to. And I think that's all from my side. I really hope we can learn a lot on this conversation. And again, let's try to keep an eye on the future and how we can monitor implementation of all these amazing revelations happening. And yeah, without any further ado, then I'll hand over the floor to Issa from... I'll bring this... Oh, cool. You got it? Okay. Yeah, from the... Yeah, from the... Yeah, thank you very much, Lina. Okay, here. And thank you very much. Thanks for the slides. I'm very happy that Issa is with us on behalf of... She's one of my colleagues in the Commission, Digit B2, our recent new unit. And Issa is also very much informed on open source. I'm happy that she's here. Thanks. And me too. I'm very happy and thankful to be here because I think it's a great opportunity. Also very thankful for Lina introducing the act as something that is important and should get more attention because I'm completely unbiased after working four years on this. But of course, I also believe that this is an important piece and that... But for now, it's a piece of paper and it will only get to live when... And this... It will also be on you to make this... Fill this with life. And I think there are some opportunities in there and this is what I'm going to focus on today. And then I'm looking forward to the discussion and to your ideas. And yeah. So let's get started. What's this act about? And I mean, the first very maybe disappointing news for you, it's not an open source law. And it's not... Nothing that... Now we can say, yeah, haha, it's clear that it's not an open source law, but it's something that has actually been discussed and has actually been discussed from often very early start when we did the impact assessment, should we have a law that is on GovTech, the open GovTech. But then we saw that there was no majority for this path. So now the Interoperability Policy evolved into the Interoperability Europe Act. And maybe I just quickly why no majority. This actually came from both sides. So from those who are very much fan of open source but said, but if we want to do an open source law, it should be made in a different way and shouldn't build now on this interoperability. Then it's too linked. It's too much in one direction. We should do it properly. So let's not make it... From the back door in open source law, that was the fans of open source who were against making this an open source law. And then there were the ones that said, yeah, but we are not there in the public sector. We might... This might hinder than digitalization of the public sector. So a bit... This is what we found. We put something in the middle. That's the compromise that we found. This is what I'm going to present now. And what was the main objective of the act? It was to help EU and member states administration to deliver connected digital services to citizens and businesses across Europe. And I think I don't need to tell you that open source can contribute to this. I just strongly believe it. And I also don't need to talk to you about how interoperability is linked with this objective and with open source. What is maybe interesting is that the European interoperability policy has always focused not only on technical interoperability, but has this four dimensions of also legal, semantic and organizational. And that was very much in the negotiations that we've talked a lot about. How can we actually make stronger mechanisms to help also legal and organizational interoperability? And this you might also find back when I now go into the components of the act. Because we say the act is structured around these four components. So a structured and co-owned EU cooperation governance, mandatory interoperability assessments, recognize your reusable interoperability solutions and strengthen interoperability support. And now I'm going to tell you more about what this keywords, what we mean with this. So I start with the solutions. One of the mechanisms that will be established is that the EU public administrations in Europe, and they are the ones that are actually the act is addressed to EU administrations from EU level until a member states local level. And it's all the binding requirements that are in the act go towards these public administrations. They don't go towards private parties. And those ones will decide together on certain solutions. And solutions is in a very broad term. So solutions can even be a framework, can be an architecture, can be guidelines, can also be a solution, can of course be software. Like the definition that is in the act is quite broad. But the public sector will agree in this interoperability governance on some solutions that will become interoperable Europe solutions so that everybody will not be, they don't have to reuse it, but they will at least need to look into the reuse of such solutions. And this I think is an opportunity for the open source community that wants to know where is the EU public sector moving. Looking into the interoperability Europe solutions might help them to see, okay, if this is here in the catalogue, this might be something when I built my software that it fits with those interoperability Europe solutions, that might be easier for public administrations to reuse it. And then another thing around the solutions that what we've managed to get into the law is that sharing is a default. And for now, between public administration, sharing is not a default and proper territory solutions are a default. And if you want to share among the solutions, then the IT people go to the lawyers of this world and I'm one of them so I can talk better about them. And they will ask those people, yeah, can we please, can we share? There's somebody else wants to reuse it. And then the lawyers come and say, yeah, but this is problematic and there might be this and this problem. So because, and we hope that by just putting in a law that normally if somebody asks you, you have to share, then the lawyers already need to argue why can't I share. So we hope that this changing the default might already be a push in the right direction to make public administrations that are, and they also have to be risk, very risk adverse to be more friendly towards taking this risk and being brave. And what come in the negotiations, also, and that's what Lena also talked about, is that there is also a small thing about priority to open source solutions when you are doing interoperability solutions. And this is something that is now very new in the text and that public administration will need guidance on how to implement this and will need help on. And this also, I think, where the community has an opportunity to help them, guide them on how can we really make this work and put this into practice. As a second component, there is this governance and there is, the governance is composed, as Alina already said, around the board and the community. And in the community is, we were thinking about you, also as part of the community, of course, also the community of local public administration in the board. Actually, it's only the member states represented, but IT is very, sometimes very federal and some scattered around all public administration levels. So the community is really about putting many different actors in the field together and then giving structured input, and channeling this input also maybe with digital platforms, digital tools towards the board to take the decision to have always a sounding mechanism towards the board. And another thing that I haven't seen when we were drafting the law, but when I talked with open source people who said that was interesting for them, was the clear points of contact that with this law, we are having now responsibles on member states level and in EU institutions who will need to implement the law. So when you have anything that is linkable to the law, you know who to reach out to and that this is something that is actually important for the community. The third component, the mandatory interoperability assessments. I think this links back to the discussion I heard here before on the fishbowl where you said, but it needs the right processes and how do we get into the policy making. And it's very hard to bring the two worlds together actually. The interoperability assessments will be something that when public, when the public sector is setting any binding requirements for their digital public services in the future, before taking a decision on these binding requirements, they would need to do and publish a mandatory interoperability assessment. And I think this interoperability assessment report will for the community create a lot of transparency and help them actually then also engage in policy discussions afterwards because the policy text is already translated into concrete requirements. Maybe it might not always be technical requirements because if you look at the law, it might be requirements on business level. And now we are looking into how to really make this work and practice. I think it's a very interesting tool that can actually help bring and that's why we put it in the law because it's about illegal interoperability, bringing IT and policy together and helping to have this conversation early in the process. And not when the law is already written and then you sit there with the law with requirements that are actually not implemented in a user-friendly, citizen-friendly, engaging way. So that's definitely something we need to talk about in the community and to develop in the coming months. The fourth component is what is also in the law. It's around GovTech cooperation. There is innovation projects and it's the first time that there's a definition around GovTech also in the law. So just that the recognition that this is important, even if there is no strong legal requirements around it, I think this is an important step because it helps to argue that this is important if you say it, because it's an law. Then we have a spot on regulatory sandboxes and here also for me that they are an enabler for legal interoperability and they have to involve also the GovTech and have, but create a way when you do innovative stuff to have the York legal questions channeled towards the EU level and have the EU regulators already predescuss on how can we tackle this, how can we have this innovation also rolled out in a legally safe way. And last point is upscaling of the public sector and this might also sound a bit vague and lame that is in the act because it's nothing, it's not something you go and the police can come and say you have to upskill. But I think if there is, it's mentioned in the act that there is the public administration have to skill their stuff on interoperability and I think you can't skill them on interoperability without talking about openness, without talking about open code. So I think that a lot of these trainings will be created in the next coming years so together then look into what can we push, what messages and then you can talk in the public sector to people who know about open source and are not afraid of open source. So this is an opportunity for me and one opportunity that I don't have on the slide is that this is going to be evaluated. So everything that is not in today, now if we talk, get the right questions in this evaluation then it might be that in the next version this can also get into the text. So this is the entry model and we will see where this takes us but this also needs early involvement, this needs actually, yeah, that the community reaches out as soon as possible and say this is our ideas for where it should go and this is the question we should ask. So I'm very, as I say, now maybe you understand why I think it's very relevant that we are here and that we start this conversation and looking forward to the discussion. Thank you. Thank you.
Public services interoperability: Open Source efforts in and around the European Commission; and how about a next EC open source strategy
Let's start our second session on the, let's now focus even more on the European Commission and its efforts on the direction of open source. So we have two experts from the OSPO here to tell us a little bit what has been done. And again, let's try to remember to look up the future. Let's see how everything has been done, what we can learn from it. And let's also try to bring up new ideas on how we can make this work. So, all yours. Thank you very much. We also have stickers. I think that's important. Please pick them as you leave the room when you walk away thinking, oh, this is not going where we wanted to go. There's a flyer. And outside you already saw we have a roll-up to make a bit of advertisement for the OSPO. So Gai Selenius, I'm Dutch. Sir Anjith Arora, I'm confused. Indian originally, but have been in the UK for 40 years and Belgium with the last six. And the two of us will try to do a stand-up comedy show that will tell you a little bit and bring you up to speed on what happens in open sourcing and around the commission. We've been here already for quite a few hours and you've heard a lot about the policy developments. The OSPO was started three years ago to remove legal and organizational barriers so that the commission could become faster at doing and sharing open source. And what this first slide, our staircase diagram, should show you is that we've been in open source forever. He's right behind the camera now. He's one of those who worked at the commission and kick-started what now is the Apache project. That's before this timeline started. And the commission has always used open source in the infrastructure right from the start. And I spoke to the people who installed the first lamp stack saying, OK, so that's where we kept notes on which server was doing what and what the passwords were and who was doing things and so we could transfer it when we rotate it around the organization. And pretty soon other DGs, because Srinjeet and I are talking from the Digit, the Directorate General for Informatics that's doing like an internal service provider, pretty soon other services, the ones who are doing policy or putting things in place, were calling Digit saying, I need a server for, I have a pesticide database system, it ran on a lamp stack. And so the second layer, the use thing, that really exploded quite fast. And so in our data centers, I think the numbers are 60, 70 percent these days are Linux machines, replacing other Unix types by the way. And then around 2003, 2006, in our Directorate General people started realizing we need to work at the member state because this stuff is reusable, it's going to save us pots and pots of money. And the interoperability program was started that led to the interoperable act. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you very much. No matter how much I say this. Your time is above. Thank you. Thank you very much. Thank you. Thank you very much. Thank you. Thank you. Thank you. Anna. I've been trying to observe this change in the Commission ever since 2007 when I started writing for the European Commission on the open source observatory. And so I've seen it going on this path. And I helped the Commission ask if I could help revamp the open source strategy. In the year 2000 and 20, we announced an open source strategy that came with 10 action plans. They are in the strategy, the strategy is open, it's out there. Except that we never gave the details that was far too much. The shopping list was like this. And first being recursive, start yourself, set up an open source program office. That was really quite easy. But the two big barriers we had was that the Commission had a big problem in making its software available as open source. Because we had a red tape process that would take six months of just a project person chasing signatures to get permission to make the software available as open source. In the room we have one of the IP specialists who helped us a great deal with removing this barrier. We now have, the default is if you want to share your software, it is open source. If you want to share this proprietary, you go into this paperwork process that will take you six months, which means no project will do this. Because we don't have time. No development project has a project developer that can spend 10 minutes every Friday afternoon chasing signatures. Now when the project team says this is a useful software solution, it shares its open source. That came with code.jropa.tu because we needed a place to share our software repository, share our software. And the other big thing is that we moved our internal teams to sharing software amongst themselves. And finally when you created a project on your internal repository, it was closed to you and your team. Which means that I as an external, not in that team, could not see what the code was. I didn't have access, only saw the name and you could find with a bit of luck the project owner and you could ask for permission to see the code. We've changed this default and we already see the rewards because teams are now starting to reuse each other's work. Not so much literally the code, but they're starting to build on each other's preparations. Open sourcing, I managed it. I talked about it already. We have 400 projects, 2000 users. It's growing. You're all welcome at code.jropa.tu. It's GitLab based. You need to log in, but that's just my small step. The legal barrier was removed. I talked about that. We have labs, an important way to onboard new open source tools to our users. So people from other services can come to us and say, can you, you know, we would like to experiment with HedgeDoc for example. Well, now we can do that. Here are the ones we're currently running. Ythsi, OpenTalk, HedgeDoc, CripPad, Discourse. And we're always looking for others. Outreach is what we're doing here and in other places. We successfully integrated open source in our IT governance projects, which means that when a new project is proposed, we are asking questions about open source. So we are forcing projects to from the start compare and look on the market to see if there's an open source alternative for their proprietary product. And if they are willing to share whatever they're doing as code. And I will hand over to Srinjeet for the important pilot and preparatory action. Thank you. So I joined, as I said, six years ago on this project, EU FOSSA 2. So we have this cycle. We may start some initiative. The European Parliament gives us some money and we start what's called a pilot project. If that's successful, we move on to what's called a preparatory action project. So hence the number two after the initiative. And in the EU FOSSA 2 initiative, we actually created an inventory of our open source. We didn't have one. Not ashamed of that because I don't think many organizations have one even today. And it's a fast moving, changing scenario. And then from that software inventory, we were able to identify which was our most important software. And then we decided to run back bounties and hackathons to improve the security. So a lot of things were done under EU FOSSA, many for the first time like back bounties and hackathons. And then we thought, why are we not doing this on a European scale? Really, we need to cooperate on open source and bridge these islands that exist. And actually it's the right time because open source has matured to a very high degree in many member states. So it makes sense to cooperate and on a number of areas. So basically in terms of specific initiatives, so FOSSA is about FOSSA for European Public Services. And it's about cooperation. Now breaking that down into specific work packages, one of them is to create a catalog of solutions. So the French government, the Italian government, all member states have built wonderful open source systems, solutions. Why are we not reusing them? Often we are not reusing them in our own countries, let alone across Europe. So giving visibility to solutions built by and for public administrations is a wonderful idea. And we've had great successes in national catalogs and regional catalogs, saving lots of money, time and increasing interoperability. So FOSSA is going to, we already have an MVP and we're going to expand it to many more member states. And hopefully we'll have a rich catalog which will save time and money. Can I steal back? Yeah, please. Yeah, because we're being flagged also for going over time in our own session. I just wanted to round off with one thing. Is that we are trying to prove that we can reuse software. So the catalog system is actually thanks to the Italian government team digitalicus. We're starting to reuse their tool and the standard public auto jamal is also developed by them and others. And this is also something we're implementing. And there's the author. There we go. So thanks for that. And I think you should deserve the credit for this. And we are now in the process of launching our internal efforts to revamp the strategy once more. Because the staircase diagram, the first one, those were the internal strategies. The last one, the most recent one was an external communicated internal strategy. And we're going into this process again. And so the next round here is where we would like to discuss with you, whether we would like to have ideas from you, we would like to know what else can we do as an Ospo. And I think the whole day so far has given us a lot of stuff to work with. But that's what we'll do in the next session.
Public services interoperability: workshop Open Source strategy at the European Commission
Okay, yeah, perfect. Thank you very much for the time and for the insights here. So yeah, I mean, I guess some of you were already familiar with the strategy and so on. But now that we know what it has been done, then I guess you guys have some questions, comments, and just to keep it like a custom practice now, I would like to kick it off with a question, actually. And I mean, since I read the strategy, I was kind of like struggling with this whole thing of inner-source. So I really wonder if you, now that the strategy is almost over and then you're kind of like renew it. If you think this is the direction to go, or like, you know, first of all, what do you guys mean with inner-source to put everybody on the same page? And also if this is actually the direction that the commission should be going or if you're thinking about something else. Sorry, forget the very important microphone. Inner-source is really maybe not the best label for this thing. But the goal was to make our software development, the commission has like 6,000 software developers working for it at any given time, doing all kinds of projects. And some of them were in the room. And what we wanted to achieve is that on the road to making the whole thing as open-source as possible, we needed the first step. We needed to have the existing projects of which there are many realized that they needed to get ready for reuse. And you can't go to a project that's been running for 15 years or for five years and say, oh yeah, by the way, we're going to make your stuff available next year. Because these guys are like, whoa, but it's full of passwords. It's full of internal references to machines and stuff. So the code has to be matured. And so we've gone through bug bounties and hackathons internally to make sure that projects were ready to move from being, you know, in the basement with just these five people that have been working on this for 10 years to being shared with the other colleagues, with the other professionals whom we can rely on not to abuse the information that is often in these systems. And the other thing that we're doing is that we're only, we're making it easy for projects to go open source as easy as we can. But we're not forcing them. If there are reasons within the team, within the DG, within the service to say, this is so niche, why should it be open source? Or this is so secretive, why should it go open source? We're not going to force them. But we already see a lot of projects going open source because most of the developers, this is where they want them to go. And they'll talk to their project managers to say, we should go from inside, we should go to Code.org Europa. So the numbers that were on the slides are very good and give us another decade. Maybe that's too slow. It might go faster. All right. Yeah. No. Hi there. My name is Romer Adams. I'm working at Red Hat. Building actually a solution for the public sector. Two quick comments. First of all, great that you discovered that there's our responsibilities with CRD, SCRA and PLD from a code perspective. So it's good. The second part is that in a source for taking this into consideration is one of the best thing you can do when you are in a very disconnected environment and you cannot share anything with outside from a security perspective because you can apply all the construct of open source to build software. And at some point when you clean everything up, you can share that with the other member states. The last part is more a question there. When we're thinking about the open source catalog that you have built, I guess we're speaking also about blueprints. Will there be any incentive for member states to get to those blueprints as a first class citizen in tenders versus trying to build something from scratch? Hot potato. Thank you. Good question. So I think the idea of reuse is warmly welcomed by almost everyone. Why would you build something twice if it's already working in municipality A? Unless it's not good enough. You could amend it, you could improve it and that's the idea of an open source solution. Alluding to your question about making it mandatory or incentivizing, it's not for the commission to do that. I think open source is by its very nature a very cooperative, open culture. So we are hoping that procurement will come on board and say, why do you want to build this? Show me that this doesn't already exist in the European catalog. So I think it'll happen eventually because it just makes financial sense. So I think that's where the solutions will become blueprints. Hi, my name is Dali Boran. I work for the commission services. So I'm a user, but I didn't work there for all my life. So I also worked as a developer in the industry in the education and coming back after several years, I saw a great progress in open source software on my computer. So thank you for that. On the other hand, as a practical user, I have problems with it because I've seen things moving in another direction. We are forced now to use the Office 365, as you know, then TMS teams for also internal and external communication, not to speak about the applications for some jobs which you can fill in the form only using Microsoft Windows or some testing for in Epsoc, which can also be done on Microsoft Windows computers. So I cannot do anything about it. I mean, I've been trying using the available open source software as much as possible, a LibreOffice, and even some other solutions. But I think you might. So I urge you, at least if not to look for some long term or radical solutions, which I would, at least not to allow this regression of the current situation, which I see coming in six months to one year. Well, depends. I mean, internally, what do you use for your team communication? Please answer that question. You need the microphone as well. We use everything, including the tools that come with the machine and when we switch it on. But we also use all the alternatives. And if you were a software developer for the commission, well, if you were, then the good news is you can have a Linux laptop at your disposal and running Ubuntu mostly. And this, yeah, we're, you know, we see that pain point, but I can already tell you that the OSPO is not there to make that switch. What we can do is translate the demand from our users, which includes you to say, look, there is a need for an alternative. And there's many teams that interact with the outside world that cannot use MS teams. There are many teams and projects of people working on legislation, in fact, they're asking us, do you have anything because I cannot use Office 3, whatever? It won't work. And so with them, we're trying to figure out what's the next thing we put in our lab and how can we make it accessible to legislators in the member states so that they can work on the next generation of laws. So we're trying to find ways to do this. I wish we could show a lot of progress. I think it'll take a bit of time because, yeah, it's in place. Yeah. So I think the first thing to say is that we absolutely acknowledge what you're saying. That's the truth. Right? Now, the question is how do we solve it? It's a collective we as crisis outlined as the Ospo we are doing things like using some of the labs and some of the other mechanisms, right? We're pushing open source. So there's another movement, right? Open source is expanding throughout the commission and its awareness is benefits. And yet there is an entrenched Microsoft culture. So I think we it'll just it's a question of time and it's a question of continuing to pass on the messages that you're mentioning. And we do that as the Ospo. So I think that's the answer. Sorry. It has to change. I agree. The best moment to start that change is now and we're trying to bring those experiments. Exactly. We're doing something what we can. I'm sorry. I would say let's go for a question and then maybe you can. So well, thank you. But unfortunately, regression is going the other direction. And unfortunately, it consumes more work time. It consumes more money, bandwidth, energy, so it's not compatible with Green Deal. And also, please think about this. If we would have, you know, an open source working environment for all the administration in Europe and for the politicians, maybe they would think differently and then support the open source more. Okay. I agree. And then I take command from that again. No, no. I always say it's a question here. Okay. Yeah. My name's Ingen and I'm just a citizen. I was trying. I'm trying to put this question somewhere. I'm not sure whether it fits. But why we were busy with CRA, another regulation sailed by which has implemented, which implicates maybe problem for open source, which is the European ID. So A does. And how can we assure that this European ID will work on open source systems of all kinds? So on free smart, free androids on Sailfish, on Linux smartphones, on Linux desktops and everywhere. And how can we ensure that the ID application itself is open source, which I think is mandatory for a fundamental thing like this? That's not about. You want to leave the room now? No. I can get part of the answer. And I'm going to look in a few other things, if you don't mind. Because there's so much going on. And as I was trying to say, there are good developments happening in Europe. And I want to first acknowledge there's a circle on this diagram, and I put it back online. These are the OSPOs in the member states. And four of them are in the room. Five of them a few minutes ago. And we're working with these peers to figure out, A, the citizen interaction. I think that's a, it's clear that we need to, this is a goal for us, a task for us that we didn't identify three, five years ago. And about the commission is doing a lot of open source reference implementations. So when we, when there is a big regulation like this, like A does, for example, there is an open source reference implementation. I do not know if that one would immediately run on your mobile phone, but it will run on a Linux machine. That's what it was made on. It was made for it. And this is also how we help the member states test their implementations because they can either use the reference implementation and install that, or they can test theirs against our reference implementation. And we do this with many other tools. So you'd be surprised to find commission, commissioned, sorry for the strange formulation there, but software built for the commission, either by software developers working directly for the commission or through companies that are in industry everywhere. If you do an import, export decoration these days, you're touching software that is made available as open source and built through the European commission. If you're in the steel industry, you're most likely doing things that is done with open source commissioned by the commission. If you're doing e signatures, it's most likely using open source libraries that were developed as open source through the commission. So this is one of the good things that is happening. And then I would just like to point out that the OSPOS are doing good stuff in the member states too, and we're trying to build a network that reinforces itself. So let's take two more questions because we're running out of time. Can we get the OSPOS to stand up just so we know who they are? Thank you for the great talk. And we spoke a lot about infrastructure and tools, which are very important. But is there a European wide plan to somehow replace Mastercard and Visa? These are essentially payment clearing systems, and these are leaking hundreds of billions of euros worth of economic value out of the thing. If there could be an open source trustworthy solution built into it, which every bank issues to retail people, that would be great. Wonderful. I think that's the answer. We're not the right people. And these are very large initiatives. There are many, many large things that could be done. We are working at the grassroots level in terms of open source, propagation of open source internally and connecting with OSPOS externally, and doing projects like FOSCEPS. And check out the NGI funding framework because there are super interesting projects on this topic too. Okay, so one last question. Thank you. Hi, Paolo Vecchi, board member of the Document Foundation. Well, I would like to say thank you guys, or the OSPOS team, or the European Commission, which is now 150 people. I think most of it is in this room. And, well, a handful of it, because I actually seen the progression of what has been going on in, well, lately, and you did a very good job. In a way, you said that, I think you commented that probably is not your job yet, maybe, also to promote some open source platform within, maybe the European Commission, or just part of the way the users will see mostly. So we have an example of LibreOffice, or maybe also Linus Texop, or something like that. I suppose that that probably is going to become one of your tasks later on, maybe when you're going to be more structured. But in a way, I hope that we're going to get to a point where the European Commission actually is going to be the example for the rest of Europe, where an organization that manages enough people to fill a small town will show how to do things, how to implement open source, so that for a small town to nations, actually going to be able to do it quite easily. And another thing regarding, well, something that I'm very biased about, LibreOffice, there's been a bit of an effort, a lot of effort in trying to get LibreOffice updated in the Catalog application, the European Commission. It is there, at least, so in theory, it should be able to install that quite easily. And it would be nice to have more feedback and see how many other open source applications you would like to see on your desktop, so that the rest of the community can say, there guys at the European Commission, can you please add this one, because it actually makes sense and we help other people switch into open source. That is a collective effort. Thanks. Okay, perfect. So any comments on there? Should we wrap up? Yeah, we can wrap up. Okay. I'd just like to tell you that LibreOffice is on the Commission's laptop, so... Yeah, I think the general answer is the small army that can fill a town is already using open source, right? That's why open source is on our list of software to use, officially. So it's a lot more usage than we... that might be visible, right? But sometimes the external tools like teams and things and word... We have to connect with everybody in the Commission, right? So it's increasing. Thank you. Okay, perfect. I'm sorry, we need the microphone. Okay, so now we're going to have a wrap up from the repertoire on the inside's inputs that we got. Thank you very much, everybody. But let's take around five minutes. Only five minutes and then you can go.
Public services interoperability: rapporteur playback
Thank you. Okay, perfect. I'm sorry we need the microphone. Okay, so now we're going to have a wrap up from the repertoire on the inside inputs that we got. Thank you very much, everybody. But let's take around five minutes. Only five minutes, and then you can go. That's all right. Okay. Oh, yeah. I'm sorry. I'm sorry. I'm sorry. Okay. All yours. I'll do a gesture thingy. Hello. Okay. So for the few of you that might still have forgotten about what we heard about in the last two hours, I'll just do a quick wrap up. But first of all, thank you all for putting forward so many questions. We're very happy to see that many people be interested by public sector and public sector. And having so much interest in what our speakers presented, I think maybe it's all right. I think maybe some of the most interesting comments were first on the presentation of Issa before that related to clearly the question of the implementation of this act. And then we'll have to go back to the first question. Clearly the question of the implementation of this act and how to be part of this question of the interprobability assessment. Now it was also made quite clear that this legislation is not an OS law and that there were many pushbacks again having this made as an OS law instead of having it just as an interprobability Europe act. A lot of you also raised question of the open documents, of the open format and how this will be integrated into this regulation. I think that's what's very nice to hear. And on the question of standardization there as well, it's been quite clear that the board would be one of the main actors in the implementation of those. As for the second presentation and the question of the OS, the question of the OS strategy, you raised question about the inner source, how to move away from inner source. I think it was also quite clear that it was the first step and that it was really helpful for you. And on the question of incentivizing or having mandatory use of open source and what was the strategy used which was clearly trying to understand how open source works, why people adopted and trying to avoid having counter effect by making it too mandatory or strong. Yeah, and maybe just to finish, I think one last one. Oh, yeah, thank you. One last one was really interesting also on the question of the EIDES. Thank you for bringing that up and other regulation like this. So we talked about very specific policy papers or regulations today, but there's a lot of regulation at the U-Level that are constant by open source. And I think it was really interesting to learn about how actually it can be shaped up by this reference implementation and so on. So, yeah, thank you all. I will finish to write a better report about that later on. And good luck to those who stay. We will start, I think, in a few minutes. So, yeah, thank you. Thank you.
Digital Services Interoperability: Intertwining EU telecom law, the DMA, internet devices and Free Software
Thank you. Thank you. My name is Nico Riecke and I'll give it to Lucas to start it off. Hello everybody and welcome. I'm very glad to be here. Thank you for the organizers to invite us to this talk. I'm very happy to see that there are a lot of interest to DMA because we have been working already on this for some time and I would like to contestalize all the problems that we already started hearing here about interoperability, about security, about having access to infrastructure from the telecom perspective because together with Nico we would like to prompt you with our experience on advocacy on routers. And I think that contestalizing this example on routers and router freedom in Europe can help us to understand a little bit better what expects us when we start dealing with smartphones from the DMA perspective. So let's talk about end user control of devices, this contestalization about the DMA and telecom and then I will give the word to Nico so he can tell us our experience with router freedom. So the first question I would like to ask us, do we have control of our devices? Devices are becoming ubiquitous, we are using that for everything in life but I have the feeling that we are losing control over our devices. We cannot change the battery, we cannot uninstall the programs, we cannot even install programs. Today people call it side loading but in laptop we cannot, we don't call that side loading because we are just downloading and installing. But well big tech now says to us that if we wanted to install something outside Dev Store we needed to side load. This is not good for software freedom but let's talk about that. So I think that we are losing a little bit control of our devices and here are some key aspects of gatekeeper control of our devices. They are imposing online accounts us so if we wanted to use our devices they say first you need to create an account with me. And then I think when I bought my Android phone I was prompt, the first thing that I showed in the screen was you need to create an online account. Then when we use our smartphones we are already trapped into vendor lock-in because we have no access to third parties repositories and app stores. And this is really key because on these repositories is where we can find apps and we can exercise our software freedom in order to populate our devices with our software. And last but not least we are not free even to uninstall software that comes to our devices. And we check that sometimes on drive devices or iOS devices there are a list of apps there that is draining our battery and they are doing stuff because it's proprietary but we don't have access to the source code so we have absolutely no clue what is happening there. So based on these facts I think that we are losing control of our devices. And therefore some questions that I would like to prompt the audience and perhaps in the coming moments of the workshop we need to answer is that how we can re-empower users to have control of our devices. So we already heard that DMA is a very important piece of that and I believe so I think DMA is crucial but we need to go further. First, we need to recognize that devices or ecosystems are mostly proprietary. So the largest two operating systems smartphones they are proprietary, Android and iOS they are proprietary. And therefore being that and since they are so large we are calling the gatekeepers due to the monopolistic power over termination bottlenecks. So basically everything that we needed to do with this device we need to go through this company. That's why we are calling a monopoly over the term features of these devices such as one example operating systems, browsers and app stores. And of course we hear them since they have this power of devices they can hinder interoperability and they exercise type controls of APIs, apply proprietary standards that we heard today, hampering functionalism, block access to drives and hardware. So in the FSFE, the Free Software Foundation Europe we have been working on a concept that we call the device neutrality. And this we want to re-empower users and to give control over devices back to them by software freedom, eliminating vendor lock-in and giving end users control over data. And last but not least I would just quickly because I'm a lawyer and I would like to point to what is happening nowadays in the EU, right? So for 10 years we have been working on the open internet regulation also called the net neutrality regulation. And this regulation had very clear rules over internet access devices. So it applies to routers, modens and other internet access devices. Then in 2018 there was a reform of the telecom law in Europe called the European Electronic Communications Code. And it then implemented some rules over iOS, over operating systems and apps and also to network operators. But now comes the DMA with rules over devices and operating systems and apps. And in order to contest to allies all these challenges I would like then to give the word now to Nico, I'm sorry, so we can learn a little bit how DMA can connect to that from the perspective of routers. Okay, as a case study, yeah, yeah. Routers of freedom, well, my wife and I were really excited. We got our first house, we were moving in together so we had to prepare for the move and one of the things we had to do was get an internet contract. And we didn't think much of it, we were doing some comparisons and said okay we got this contract. It was some all in one provider and you could recognize it from the box, it was an all in one box, it was a TV, telephony and internet. But also outside of being a box that did everything, it was a modem, a router and it did so badly. It filled quite a bit of times at a certain point, it filled entirely and we had to wait three days without any of the services to get a replacement. But after another failure getting dropped out of an internet call I said okay this is it, I'm going to get a router myself so I know I can trust it and it's reliable. But I found out that this internet provider didn't really support that. It was all to me because previously it was on a telephone network connection and they were even advertising that you could use your own router and modem and some of my friends were doing so. Also if you would call them for support they say are you using your own modem? They would just assume there was something you could do but not with this provider. So that was all then I learned about router freedom and of course there's a lot of benefits here on the slide. Some personal but also some in the grander scheme of things like competition and creating a healthy ecosystem of devices. Now internet providers put up quite some barriers to prevent you from exercising router freedom, the ability to choose your own modem and router. Some of them actually have some technical merit. For example the telephone network is laid out differently than a coaxial network. If your modems are doing bad things on a television network, the coax network, as the lines are shared you might interfere with the devices of your neighbors. But that of course is why we have standards. If your device is compliant and the standard you can get one from the store, plug it in and it will work and there's no reason to deny you those freedoms. Actually one of my FSV friends said oh it works just fine. I have these and these devices running at some friends. I could really recommend to use your own router and get the freedom you want. So it wasn't really a technical barrier. Now at the FSV we've been at this for 10 years as Lukas said and we keep an interactive map keeping all the states and working with regulators to ensure that router freedom is actually achieved. One thing in 2015 is that we had the EU net neutrality act and it was like yeah, Inno says users should be able to replace or change their device. So you think okay router freedom everything is good but not unless the regulators regulate. That's the main thing and that's why we have this map to keep up with the regulator and the state. Now about two years ago in Belgium there was a consultation about modemvrijheid, a freie modemköser, basically the ability to choose your own modem. There was a consultation and I saw this like okay we have to get in on this even though I'm from the Netherlands, I care about this, I can do in Dutch. We got a different volunteer, some from Belgium, and together with Lukas we responded. We were quite alone initially not having other parties that had the technical knowledge to go through this legislation and have the community behind them. But eventually through a survey we were able to actually engage the crowd and come over the argument and we achieved router freedom. So shortly here are some examples of people in our community using their routers at home to also establish the practice of router freedom. There's also benefit of free software on routers but that's something else. And myself I now happily using fiber with my own router and Lukas if you want to wrap it up. Yeah, so yes, so it's a big win when we have router freedom and we fought against the operators that and they always came to say that interoperability is a problem, security is a problem. But with router freedom we could prove that this cannot be a problem and I hope that with our discussion with DMA we can bring our experience and say that we can also overcome this problem. Thank you very much.
Digital Services Interoperability: Panel Discussion - The technical challenges of interoperability requirements in EU law
All right. So we've heard from the community, we've heard from policy makers and I think it's time for you to be able to ask questions because that must have been a bit frustrating to not hear specifically when I hear your laughter about the specific Apple stuff that I'm not going to comment specifically. So it's time for questions. I'm going to ask a few of them, the dumb one, and I'll let you ask the smart one. My first question is actually to you, Alex. Here we have a lot of developers that are developing the app, open source app, they build their businesses, but sometimes they're hobbyists as well. They just like to develop apps because they're training themselves because they're curious because they like it. So when I hear your presentation about interoperability of messaging services, what I'm dreaming of is a kid one day or a hobbyist that is just able to build an app that is interoperable with every single messaging service, it's just for himself. He's not selling it to anyone, it's not a business, it's just him or her or them and wants to be able to just interoperate and create something to exchange with his friends. Do you think that's something that the DMA will be able to create and that's what we're talking about or it's just impossible? No, I mean, overall the DMAs of the DMAs to create opportunities. So if you're a software developer and you see, well, I need, for example, access to these APIs and all this, then you can go to the obligations of the DMA, you can say, how can the DMA can help? And so the idea is to create some room for developers to develop their software. Now when it comes to the specific question of messaging services, Article 7, if you want to go and take a look, which is the one that deals with this, is quite specific in the sense that, for example, it says that the level of security and end-to-end encryption provided by the gatekeeper has to be preserved. So I can imagine that if you are a software developer, then you have to, and you're starting from scratch, somehow you have to work towards that. So it's not that I do something and that's it and that would work, but you have to make sure that that level of security and encryption is going to be preserved. And that I suppose that requires some manpower. But the idea, anyways, is that as far as you are providing a messaging service that you intend to provide in the European Union or are already providing in the European Union, then Article 7 is for you as well. Does anyone have a question to this or to something else? Okay, many. Let's start from you. So actually I have a question from one of our livestream viewers for the Matrix Channel for Interoperability. The Data Act mentions harmonized European standards. Is there some work underway in that respect now? Thank you. I guess that is for you, Lede. Here you are. Yeah. So the work is actually starting now. So it has not yet started. We are now starting with the inventory of standards that facilitate interoperability. And the creation of this, the creation of an inventory with all the different standards and protocols that would help facilitate this interoperability among the different cloud services. So it's not there yet, but it will be hopefully soon. And if needed, we might need to define our own standards. Thank you very much. I have a follow-up question for this. So here again, I'm going to ask you pretty much the same question I asked, Alex. What do you think the open source community can do if they leverage that specification in order to create new solutions that could be open source or other type of solutions? Do you think that's something that they'd be able to leverage? Yes, of course. Also because we want to have open specification, open APIs so that these people can change, you know, you're not logged in into one service provider, you can change from one service to another. And this will open up also new business opportunities, new application developments, etc., etc., new cloud services. Who knows? So it's also another opportunity for innovation. Okay, I've seen hands raising. So I'll let you go for it because you're closing on anyone. So you've got that for me. Thank you. I'm an independent lawyer and political scientist. Okay, okay, I'm an independent lawyer and political scientist. So I have three questions here. First about the DMA with question with Apple how they will integrate or not. It seems they don't want to. And the thing is if we look at the Microsoft case that has been opened in 2006 from Commission against Microsoft, one of the problems was interoperability from the Samba project. So afterward, as the court ruled that Microsoft should provide for free the documentation for interoperability, Samba get access. But there was some problems still about some clause that should still be secret. So that can end there and maybe Apple would like to do something like that. Secondly, about the messaging things, about the interoperability messaging. One of the problems about that is what about the quality of the end users? What I mean is, for example, if there is a WhatsApp user, one of my friends WhatsApp users, and for example, I am a signal user, I cannot interpret with them. But on the question of security and privacy, one of the reasons why I use signal is because of the privacy and the security concern. And the problem is if there is interoperability between WhatsApp and signal, does it mean that my message will go to a lower level of security and privacy because my friend use WhatsApp? Do you see the problem? It's a problem of quality. We will go down on the quality of privacy and security or we will up it. It's an open question. And the third one I forgot, I think it was, it's much more larger question. It's about the question of anti-competition law. So how do you see DMA will articulate with anti-competition law that is very useful for open source community as we've seen in my case and various cases? Thank you. Right. I'd be lying if I'd say that you got the easy questions. So good luck. Yeah, now these are very interesting questions. And I come actually from the field of competition law and I'm glad you're asking this. It's just my hobby. So now about the, I'm actually happy that you mentioned the Microsoft cases because those are quite important in history of antitrust when it comes to applying competition law to digital cases, right? And as you say, it's very important for the open source community. Now one of the, and I think this speaks about the importance of the DMA and how powerful the DMA is because when it comes to these cases, first of all, they run for a long period of time because they take time to, you know, first of all, identify what's the issue, what's the anti-competitive behavior of that company, identify that the company is dominant in the market and then implement remedies say, okay, that was illegal. Then you have probably an appeal and then that appeal will go forever. And then finally you have your solution. And by that time, then you just gave up. Now with the DMA, actually enforcement is, it's actually really quick because as I was saying in the presentation, once you've been designated as a key keeper, all the obligations that we have in the map, these ones, they apply in six months. So you have six months to adapt your business and comply with all of them. Also another difference with competition law cases like these Microsoft cases is that there is no, there's no like exception. So for example, when it comes to competition law, you can say, well, I'm doing this because this is efficient for the market. And then you have a whole argumentation about how that was good for the market in the end. And that takes some time also to figure out, right? In the DMA, there are no exceptions. So what you have is like in some of these obligations, you have very minor targeted kind of safeguards, so to speak, where you can say, well, you know, I'm going, I have to, I'm the key keeper, I have to comply with this obligation. And what I can do is I can take, I can implement some measures to preserve integrity and security of for example, the device. But that doesn't mean that you don't have to comply with the obligation. That means that you still have to comply and you're allowed to implement just some proportionate justified measures to ensure that the device will keep that integrity. But that will be it. So with you coming back to your example about saying, well, what if the key keeper says, I'm not going to open that API because of, well, they need to really justify that on that, within that particular exception that we have there. But it's not really an exception because they still have to comply with the obligation. So there you can see, you know, a big difference with these anti-competition law cases that take ages. And you know, sometimes compliance is not as effective as we would like to. Now moving quickly then about, for example, your last question about competition law and the DMA, we have actually at the beginning of the DMA, we have some legal provisions saying how actually both interact. So for example, when it comes to the DMA, and this is an important aspect, the DMA is not like something that is just really rigid. We have some mechanisms to adapt the DMA, have more obligations or amend the obligations that we have is if the practice, for example, in the software development community changes and we identify more issues, we can include those. And one way of including those is, well, if we have been running an anti-trust case, a competition law case, and we have discovered that there are some things that would need to be in the DMA so that we can actually act quickly, then that's a way of the DMA informing from the competition law. So there's a lot of synergies between the two RAs. And then finally, very quickly, on the quality of use and messaging services, you're right. So one of the concerns would be, well, you know, what service is provided, is a service provided by Meta, you know, what about my Meta data going to the service of Meta, right? One important safeguard that Article 7 has is that users have an opt-in option. So if you, for example, are using, say, Signal, and Signal says suddenly, oh, we want to interweb with WhatsApp, then you're still, as I use it, you can say, well, but I don't. So then you can just opt out from that option and you can deactivate the interoperability option from there. And that's a way of calibrating that concern. In any case, if you think about the way that Article 7 is constructed and interoperability for messaging services is constructed, it's a request from the alternative messaging services. So in this case, it would be Signal that has to evaluate, okay, this is the privacy that I value for my users. And you know, it's a business decision that I go to WhatsApp or to Meta and request interoperability. Now if we are a very privacy-conscious service, maybe it's not in your interest to do that. So in any case, privacy has to be preserved. That's also in the law. And still users, ultimately, you can choose and say, well, I'm still in Signal, for example, and I don't want to have anything to do with WhatsApp. Thank you very much. Questions? Yeah. I remember you raising your hand from the very beginning. I'll come to you too, don't worry. Here you are. I had a question regarding the long fight for getting refunded from Microsoft Windows. So when you buy a laptop at the store, you can only find laptops with Microsoft Windows. And when you refuse a license, you basically cannot get refunded. That's a very long demand from the open source community that for software that you don't use, you should be refunded. So there is the monopoly of Microsoft's still Windows on laptops. And I don't see how the side loading of apps is going to help to tackle that problem. I think there was a missed opportunity in the DMA to say instead of allowing the user to have this right to sideload apps, we should also have the freedom to load anything we want, including alternative Linux distributions. At the moment, I see that this sideloading apps is only tackling the problem of the user space, the applications, but not the operating system. So it will help, but I don't think it's enough because you're also strengthening, but by doing apps for those platforms, you're also strengthening those platforms. You're strengthening this oligopoly by doing that. And also the sideloading apps, I was expecting that the OS manufacturers would put a lot of restrictions using any kind of excuses to not have full access to the hardware, try to obfuscate certain APIs, don't provide documentation on how the hardware works and so on. And it's the first step, but I don't think it's enough. I don't know if you want to answer to this because it was as much as a comment, as a question. Yes, several questions? Yeah, all right, let's get several questions then. No? You were asking for the whiteboard. I think two, so one, two, and then a third one. Craig Russell, and I'm a software developer. It seems that if you're going to allow developers to access these services, then instead of making them petition the gatekeeper and go through a list of, I don't know, how many thousand applications they're going to process and then say, well, it'll be sometime in 2036 before we get to the ten thousandth application. Why shouldn't you require the interoperability APIs to be open source and put it in the public domain, then there's no argument about who can use it because everybody can use it, whether you're open source or commercial. Thank you very much. That's my favorite question so far. Thanks very much. I've got a question actually back to the application specific. Some of us will have seen what Apple have published and what their proposal is, and it's very obvious from their proposal that essentially every time there's a safeguard, every time there's anything that could lower the amount of things they have to do, anything that could help them restrict the amount of freedom users have, they're exploiting it and they're going to keep exploiting it. My question really is how is the commission going to manage to make sure that basically Apple dragging their heels isn't going to stop this from happening because right now what it seems like is they're going to charge developers in a way even if they're not using Apple services and so on and so forth. How are the commission going to stop Apple from dragging their heels because clearly they're very much willing to take this to the courts. All right so I'll leave you answering the questions. If anyone has a question for Lere, if not I will ask the next one. Here you are. Yeah, no actually I can tackle the, well in a way the way I see them, the three questions I cannot relate it. I see the relationship in this way. So for example the first two questions, two minutes, okay. Now the first two questions, I think so for example about refunding licenses and then also why is Apple doing this about having the request and not really just giving all the APIs open. Well this is actually the type of feedback that we are looking for. This is the type of feedback that we need from stakeholders because in a way one of the things of the DMA is that this is also an important aspect is that gatekeepers once they've been designated they have to comply with this within six months and it's for them, the burden is on them so they have to come to us and put on the table solutions to comply with this and they have to explain well we're going to comply with the vertical interpretable ligation in this way and then it's for the commission and this I can go to the third question, for the commission to evaluate if whatever they are doing is actually within the law. So for example when it comes to giving the open, the API, sorry, giving the APIs directly open instead of having like a process request that can look more than some then it would be asked to evaluate if that compliance solution that they are putting on the table is actually compliant with the law and then we will see with all the procedural tools that we have and we can actually have a case in this case against Apple and then this leads me to the third question which is what can we do? Well the DMA gives the commission a lot of powers to investigate and enforce all the obligations not only vertical interpretable but all of them so for example one option would be to, the commission can have specification decisions, we can tell Apple and say well you have been proposing this, this doesn't work, you have to comply with this specific solution that we are putting on the table now and that's something that in competition law for example cases that doesn't really exist, competition law cases you have a remedy in the case and that would be it but as here we can go to Apple and we can say in this case, we can say well this is our specification decision, this is how you should comply with this article and then we can also open infringement procedures that are actually pretty quick and in those non infringement decisions we can say well you have not complied with this obligation and here's your fine, this is how you should comply with that and then if they still don't comply with that we can impose daily fines on a daily basis and then if they get three strikes like three non-compliance decisions that would be systematic non-compliance and there the commission has a lot of powers even to impose structural measures. Thank you very much, then I will get two more questions, there is one from each one from each and then after that anyway we have the workshop session as well where everybody can ask questions and make comments directly on the tables but I'll let Amandine explain this. The question I have for Leire in the meantime is that open specification we just mentioned and you mentioned open specifications as well and it seems that you are trying to achieve interoperability with your open specifications as well so I'm wondering is there some sort of bridge between the two regulations in sense that the commission could be leveraging an open specification developed in another through another piece of legislation in order to enforce interoperability requirements to gatekeepers for instance one day and something I wonder whether or not you see also something similar but I shut up and I'll let Haas give the question. Are you a mind reader? So this was a question from online, so I'm just a messenger for interoperability the data act mentions harmonized European standards is there anything in preparation? Thank you, thank you indeed I am a mind reader and the question was that way. Thank you, my question is about the cloud interoperability is it only about the data or it's also about the workloads and if yes on what level is it on like yes. All right so I give you the mic later and Alex if you want to add something because I think it's connected to what you do as well and then after that we'll have to wrap up. Okay so the first one with respect to the open specifications and the relationship with the data markets act I think that somehow they are related but for the time being as far as I know in the data markets act there are no gate keep there are no cloud service providers named as gatekeepers. So okay okay sorry sorry so that currently in the data markets act there is no gatekeepers there's no cloud service providers named as gatekeepers but eventually the open specifications for interoperability should also apply in that respect. With respect to the question online I think everybody answered it so this now a process in which we're actually analyzing the existing standards and if there aren't any on interoperability we will define them there's going to be a study allowance on that and what was the last one? The last question? Oh yeah the data the data and the workloads so in principle in principle the data act is for data at infrastructure level and also at platform and software as a service level but not so much for workloads so but that's something to think about thank you. Okay so I think we're gonna wrap up the panel but then we're gonna open the workshop when you're gonna be able to be more interactive and then I will finish your question to the audience as well. Alex mentioned the fact that now specification has to be open maybe they will have to require the user specification and you ask a question very clearly why not open sources so I wonder if any of you see anything that could stop a regulator to say no no you have to use that specification and you have to open source those specific APIs and open standards I just wonder because that's one of the question that is on the table of Amandine but I will let Amandine and give you a round of applause for the panelists. And I will let Amandine give you the details about how the workshop will be going where you can actually give your inputs on how to enforce the two different pieces of legislation.
EU Policy Devroom Wrap-Up
So we're going to close the session for today and for the whole day of session. So I'll ask the DEVRA managers to help us to organize if they could come as well with us and then we'll let Simon close everything. So thank you everyone for staying right to the bitter end here. So the DEVRA managers that have been organizing the day for you today are Enzo over there. Enzo was allowed to do this by clips. There's Deb here. Deb broadly did this because he wanted to. There's Heath here from the European Union. He's had to take it took him about a week to get permission from his boss to do this. So I think I thank him very much for doing that. Yeah, yeah. So let me see. We've got Alex here who is from FSFE. We've not in the room at the moment. We have Axel from Open Forum Europe who is out wandering the estate. And Martin without whom we probably wouldn't have done it at all right from the beginning. So thank you very much, Martin. So what are we going to do with all this? Well, the reason we've had a rapporteur in each of the four workshops is because we were told that if we want to get any traction at the European Commission, we need to give them a report. And so we've taken notes of all four of the workshops and we're going to construct a report that gives the essential feedback from each of the elements. We're going to make it look nice. And we're going to work out how to subdivide it so that it can be used in each of the directorates where it will be a useful tool. And hopefully that will be a way that there will be lasting change and not just a great weekend at FOSDEM. I am also very grateful to Alistair Kergan from the FOSDEM organizing team who has been our guardian angel for making this happen and making the keynote session that we had yesterday happen. Without him, we probably wouldn't have got it. And also last year. And so we're very grateful to him as well. And I'm very grateful to so many people, the people who are here now and the people who have been here all day, who have been so positive and encouraging and engaged so well with the European Commission staff. And I want to especially thank the European Commission staff who have given up a weekend day. In some cases, in the case of Omar and Benjamin, two weekend days to come and meet 8,000 friends who they didn't realize they had before. I'd like to encourage those of you in the commission to treat us as your friends. We're not lobbyists. We're subject experts. So please refer to us whenever you're preparing legislation in the way that you would refer to a subject expert. Many of us are freely available to you whenever you write. We're on signal. So. And on matrix. Maybe parliament. So. So behind the scenes, we've also had some support. You haven't seen anyone from council here today. We did try to reach out. We didn't actually find anyone who was free this weekend. And we're grateful to the people from the parliament who supported us. And it's really very good to have had all three parts of the Trident present here. I think it amazes me. I've been coming to FOSDEM since 2006. And it amazes me that it's taken until 2023 last year for this to show up at FOSDEM. But we are going to try and make sure that it remains an important instrument in creating end user agency and software freedom for people throughout Europe going forward and in perpetuity. And with that, thank you very much. And there is a closing session in Janssen.
Thunderbird: Why Visual Change Is Good
Nameless. You cannot blame me for anything. Hello everybody. Thank you so much for joining. This is a much packed room than I was expecting, but it's fine. My name is Alessandro. I have a slide that I present myself. Of course, I need to fill up the time. Why visual change is good? We're going to talk about the dreaded. We're going to change things up and your user will absolutely love it. Nobody will be upset about it. And we are going to look at a little bit of history about Thunderbird, what we did and the success that came from all the changes that we did. So first of all, who the hell am I? I'm the director of product engineering and you might ask yourself, what is a director of engineering doing here talking about design? I've been a designer and a developer for almost 20 years. Yuck. Started as a front-end developer and UX designer at Thunderbird. First I was already there. Build many absolutely terrible GTK applications. I'm a really strong advocate of open your process. Open, share everything from the ideation, prototype, mock-ups, anything that you have. Just share them with your community. It's free user feedback. You don't have to pay for user research. We already have people to do that for you for free. So who doesn't know what Thunderbird is? Okay, good. That would be very awkward for you to be here. This is Thunderbird. This was Thunderbird. This was Thunderbird version one. We officially initially released it in December 2004. So Thunderbird is actually 20 years old. It looks pretty nice. I guess pretty standard email application. It has a folder pane, message list, a message pane, a toolbar. Nothing crazy, nothing new. This is Thunderbird 102. We released this a little bit more than two years ago. Kind of looks worse, right? This is colorful. You can identify. It's easy to read. You can identify the major interaction points. There's a lot of nice white space. It's breathable. Yes, it's dated with the little icon from Windows XP era. All this type of thing. But this is flat, cramped, 25 toolbars, all flat icons. What the hell happened? So all these wonderful things. And the reasons because this happened is if you don't know, Thunderbird is actually built on top of Firefox. Thunderbird is a bunch of layers of C++, CSS, HTML, and JavaScript on top of the GECO render engine. And at that time, when Thunderbird was initially created, Firefox and Thunderbird were developed at the same time. Anything that was happening in Firefox was trickling down into Thunderbird without much control. All the user interface changes or the tool kit changes, we were just inheriting those things. So when Firefox adopted the Aero Windows 7, we got that by default. All those rounded tabs. Even having tabs was weird for Thunderbird, but that worked. When Firefox was completely compliant with GTK 3 and accent colors, we got that by default. Windows 10, their new flat accent color toolbar. So there wasn't really a time where designers were on top or in charge of Thunderbird. It was all just at some interface things from Firefox. But the problem is that Firefox UI is technically just a toolbar and a setting page. Yes, there's much more, but the majority of the things that you use your browser for is just browsing the web. You don't use your browser to look at the UI. Hopefully the UI gets out of the way so you can enjoy the web much more. Thunderbird has a lot of UI, a million dialogues, a lot of message panes, a lot of options. It's just all the tool kit and UI and design system and ideas that come from Firefox, they don't translate and adapt seamlessly on Thunderbird. And then these happen. We broke up. So these are all excerpts taken from Wikipedia. The Thunderbird development was just a burden. They decided to give it to the community, which was fantastic because community members stepped up and kept the project alive. The problem is that community members were, the majority of them were engineers. And when you leave the user interface choice to an engineer, everything goes great, right? Like they, oh, we have so much space on this toolbar. Let's add another button. Where's the option? Let's put in a sub-menu, a sub-menu, a sub-menu. It's fine. Users will find it. So it turned into the most powerful email client that no one can fully use. Still off today. As off today, we get email supports and requests like, I wish I could use Thunderbird, but it doesn't do this or it doesn't do that. And our answer is mostly, well, it can, but you need to enable it from the preferences but in the sub-menu and also change this dialogue a little bit. And now you can use it the way you want. So this is an example of the steps that you need to do if you want to move an email to a folder without using a mouse. This is the only access point to keep it fantastic, right? Like very easy if you do a wrong step, everything closes all of a sudden, fucking great. And then we have other examples that are like the pinnacle of user experience. We have our filters dialogue. You can create custom filters. It's extremely powerful. It's an amazing piece of engineering. But oh my God, what is that? A lot of users, they say, I prefer to go into Gmail and set the filters there because it's easier. We have these situations when you need to change something, you need to open a dialogue and then open another dialogue from a dialogue and then another dialogue and then you can do your things. Or this, which is for a lot of our users, this is the most optimal interface ever created. I can see everything all at once and again, I have stacks of toolbars. What can I ask more for that? So obviously this translated in slowly losing users because what was working 20 years ago and what was working 10 years ago in terms of what was the standard of user interface and user experience, it's completely different. The level of interest in an approach to accessibility and assistive technologies or just the heuristic researches and discoveries, they are completely different from what they were even five years ago. And that translates to user retention. Users will see Gmail is easier to use. Yes, there are a lot of other value propositions from like a webmail, but even other applications are more enticing. They look a little bit more beautiful and things like that. So we had, yes, our absolutely fantastic community kept us alive and they continue. We kind of have a steady stagnant user base, but slowly, slowly declining. These tip down are December because people don't use email in December. Thank God. But yeah, we were like, I don't know, 20 years, maybe a thunderbolt would die, like no more users. So whoever got this type of feedback from any of their users, this is kind of a silly but honest and knee-jerk reaction to any change because users, they love their muscle memory. It's the most important thing, which is true. When you get comfortable with an interface and when you start doing things without actively thinking where to look for, it just becomes natural. And when you see someone like a designer stepping into a project, hey, I'm going to change everything because this is bad. The first reaction is like, fuck you. And this is a little formula and hopefully if there's any developer in here, you would appreciate this strict operator. This is a little formula that reminds myself sometimes. When you don't have any, any like zero user interface and user experience updates, you basically have a stagnant application, which doesn't mean that it's familiar. Familiarity doesn't mean that you cannot change anything. You cannot bring any improvements, any updates. You can maintain familiarity but without creating a stagnant environment. That's what we did. We created a new effort called Supernova because we literally wanted to blow things up and create new elements from these galaxy explosions. And this is a little screenshot. This was one or two. This is 115. It's not finished, but it looks like an email application from 2005 now rather than like 1992 or it was before. And you can go to Thunderbird.net, click on what's new. There's a little funny little slider. You can see the differences. The thing is we really focus on familiarity and muscle memory and retain the current user. We didn't want to alienate our audience that for 20 years they kept using Thunderbird, even if it didn't have any substantial update. So rather than just saying we need to make this prettier, we need to make this look modern, which modern is a very relative concept. What is modern today is not modern tomorrow. Don't follow the trend. So we approached this from what are the improvements that we can do? And we started focusing on the first problem. We have a million and one features that users cannot discover, cannot find. The density control and font size control that we have here are very important because a lot of our users, they have multiple monitors with different DPI. Sometimes they plug and plug out inside like different monitors. They're going on laptop. And it depends on the operating system, especially in Linux, how the operating system talks with the application, how our gecko engine reacts to it. Sometimes the density doesn't change. Sometimes the font is a little bit too small, a little bit too big. These solve 50 bucks per month constantly. My font doesn't work. My density doesn't work. We just exposed these on the primary menu and we made them, not in an option, sub-option, with like multiple text fields. Just a little thing and this is absolutely lovely. We focus primarily on accessibility. Who cares about colors? Who cares about, it's pretty, no. It is accessible. Users with mouse, keyboard assistive technologies, color blind users, can they actually use this without tabbing with the tab like 35 times and losing their minds? Then we focus on consistent paradigms. Every tab in Thunderbird, if you open the address book, it looks from another application, the calendar from a third application and all these things. We start implementing some consistent paradigms. Let's create an accent primary color that we're going to use for primary buttons. They use the same color to indicate user call to action or indicator. Let's create a color palette that is consistent across the space, consistent icons. Also the icons were coming from Firefox. We didn't design our own icons. All the icons were coming from what the designer of Firefox did. Incredible design that was working well for Firefox, not for us. So we decided to take in our hands and then fewer dialogues because before the address book, you want to edit a contact in the address book. You needed to open three dialogues. Now we have this crazy application called different panels and models and all like very, very intuitive and makes things much more easier and compatible with assistive technologies. Obviously, our community loved it. It was like, thank you for doing this. We were waiting for it because the main complaint that we got was this is not familiar for me. Like you broke my muscle memory. I don't want to reuse Thunderbird. I don't want to relearn how to use Thunderbird. And the majority of our audience, unfortunately, are like old engineers, like people in their 16, their 70s, asking your mom or your grandma, please relearn everything from scratch is absolutely not going to fly. But something happened. The last little thing in at the end of August, 2023, we officially opened the faucet and upgraded all our users to Thunderbird 115 before we were trickling in and fixing bugs and regressions and things of that. And now look at that. For the first time in six years, we're getting more users. It's absolutely beautiful. And the feedback that we're getting are incredible. We have two major feedback. One is I used to use Thunderbird in 2017 at a stop because it didn't have this or it was looking outdated. I was looking for something different. Now I'm back and it's perfect. I love it. Or the other is I never used Thunderbird, but I tried and downloaded a couple of years ago. I hated it. So I just left it there and now they're trying it again and they're using it and we're getting new user, which is fantastic. So at the end of the day, visual change is good only, only exclusively if you do it tastefully and you do it with a controlling intuitive upgrade path. So thank you. You've actually got like 10 minutes left. Holy shit. Let's do this again. Okay. We've got a lot of time for questions. Yes, please tell me. Tell me everything. So remember repeat the question before you answer. Do we have questions from the room? I see one there and one there. So we'll go Mike. Great. Great talk. I'd be curious in terms of, I'm not sure what sort of like format you're getting at, but you've seen, I assume overall, user growth. Yeah. Did you see a lot of your legacy users drop off? We had a little bit of a drop off. We didn't, sorry. Hey, I have 10 minutes. I can do whatever I want. The lovely person asked if we have data to confirm if we saw any drop off in existing user. Yes, we had a little bit of a drop off, but we didn't pull any actual numbers so I cannot speak like specifically, but in general, we have an average of 15 million active users per month. And the total complaints that I got of people very upset were around 200 messages. So it looks like as a first impact is like, oh my God, our community is absolutely upset, but the rest of the 15 million didn't really care. So always take the grain of salt off, like take them with a grain of salt. When you get user complaints, is that really the voice of the whole community or is just the knee jerk reaction of you're changing things I hated. So fuck you. Thank you. You mentioned focus on accessibility. Yeah. I have some tests that interact with under birth using accessibility API. Yeah. And I noticed from version to version it's working worse like less descriptions and left on the notes, etc. Is it going to be improved with this? 100%. So that's our question. The user is asking this lovely person is asking throughout the years, which version do you remember? Whatever throughout the years, the accessibility got worse than worse. And if we're going to address this and we're improving it. Yes, 100%. The major problem is that what we were using before all those message list and folder pain were and this is probably you're going to collapse some of you. The user interface was generating C++. That's a Zool 3. I don't know Zool XUL. Yeah. Very easy to work with absolutely fantastic. It was easy to inspect everything in like, yeah. We had like almost borderline zero control on that. So if things broke, it was very, very hard to figure out what broke. Now our new folder pain message list is a virtual HTML list box. It's just an order list with list items. We need to rebuild all the accessibility that we broke in the past 10 years. It's going to take us a while, but we collaborated with visually impaired users from our community. We do a lot of beta testing. Version 115 is later a lot better. There's also a little thing which is funny. NVDA and JAWS, which are the most famous assistive technology softwares out there, because Thunderbird was so broken, they created layers to create compatibility with Thunderbird. And now those layers are broken because we removed the whole thing. So great. Okay. You go. One, two, and someone else. Yeah. Okay. So I initially was one of the absolute users and I might say it could have been improved a lot by communication. So as you see, even in the pictures, the default on the old version was much denser in the number of emails per screen. So the trick to be happy with the new version is to set your density much denser than you had previously, but this was not announced. So I had to learn it through Hacker News. Okay. So the user, that's not a question, but it's fine. The user is saying that the default density is much more spread out. There's much more white space compared to the old one. So in the previous version, the button did not really do anything and it was well hidden, so everyone forgot about it. Yes. The density button was not this horrible. Yes, absolutely. We're going to talk later. I want to ask you why you need that density. So, you receive fewer emails. Is that it? You receive fewer emails. No. Well, technically, a lot of designers might know this, but your focal point is actually like this big. So you don't need all the density. You have the perception of, I'm seeing everything. It's great. I was like, no, you're not that efficient if you have everything cramped as well. You need space. Faster to scroll your eyes than to scroll your mouth. Yes. There was another question. Yeah, there was another. There was, I think I saw her first. Yeah, one, two. Yeah, good. Dan, sorry. How do you market to users you may have lost to say, hey, try us again. We've changed. Okay. So you're asking how do we market and target users that we lost and change? We don't really, right? They're still part of our community. They're still shouted at us. They tell us we're doing a terrible job. And then we show, hey, we made $9 million in donation this year. So I guess we're doing something right. And we're gaining 30% of users. So we don't, there's a certain point where you need to pick your battle. Like if this person is very upset, like what do I gain from servicing them specifically and their niche needs rather than the whole community that we were completely like not servicing. So, thank you. So I actually used it for like 15 or 20 years maybe. So thank you. Thank you for using Thunderbird. I was actually mostly not contributing. My point is like so I'm very happy to see more users and probably more money coming. And so the software is going to improve in 10 years. But do you actually really attribute these two weeks changes or is it because honestly like for the past five years or like, okay, seven years ago, let's say, Cernibur was dead for a long time. Like, Mozilla was not taking care of it anymore. Other people were thinking should it move to like, like we talked about many different people starting to take care of Cernibur. So I was wondering if the peak now that we are seeing is just something like, okay, the project is back on track. There are actually people taking care of it again. So if I ask for something, someone is actually going to listen to this. Like also a lot of good work has been done on the booth like the performance. For me, the increase can be like something bigger than new weeks improvement. Yeah, so yeah. So the question is can we make sure how are we sure that the user increases attributed to exclusively UX improvement and not all the other things that happen throughout the years. Like Thunderbird has more development and there's more trust in the community, all these things. So Thunderbird was announced that it was part of MZLA, had a new for profit, and we started releasing monthly beta releases and ESR. We had all the resources in, I was hired in 2019. Our marketing and outreach for the user continue constantly throughout these four years. Users, the clients still going down. We kept having the same feedback. Oh, Thunderbird is not that fantastic. I'm going to try it. Oh, it looks like from 1999. That's, yes, is not 100% UX exclusive success. It's a lot of other things, but your user, when they use an application, the first reaction should be, oh, wow, this is nice. And if you don't have that, it doesn't matter if it's a well supported or they hold the features. There's the whole thing, but the entry barrier is absolutely terrible. Your general population, not the open source enthusiasts and the privacy focused person that really knows that one's Thunderbird, the general population will just like don't care about any of that. So yes, it's a combination of a lot of things, but if you don't, if the first impression is yuck, nobody's going to stay around. You can take one more question. One more question. I saw him, sorry. You showed operating system themes, not messing with the UI. Is that still a problem? Where are the canonical people here? The person is asking if the operating system changes are still the compatibility with the operating system still messing things up. Yes, because we release on Windows, Mac OS and Linux. Windows and Mac OS is pretty stable. They have the human interface guidelines. We kind of stick with them with those pretty quite frankly. We still have bug reports of my custom arch with KDE on Dracula theme doesn't work with your theme. Can you fix it? So yes, it's still a problem. Unfortunately, we rely on the gecko render engine. So if Firefox fixes it, we get it. But it's very, very tricky and we have limited amount of resources. Our front-end team and designer, we're only six people in total and we're services 20 million active users. So it's a pill bottle. Can we show some love to... Thank you.
Reimagining Personal Computing with E ink: Community Insights and Design Challenges
Hi everyone. Thank you for having me. Thank you for the organizers and volunteers that have been doing this event and all the other ones as well. I appreciate it. So yeah, so my name is Alexander Soto or please just call me Alex. I'm the founder of Modus. So just a bit of a backstory. In 2021 at the height of the pandemic, my bedroom transformed into a workspace. So I was spending most of my time from the morning to the night just in front of a computer, right? So using the computer and sort of battling this distraction, refocusing and then trying to get back into my work. And you know, the same tool that I have to use for work is also the same tool that I'm using for like leisure, for entertainment, and sort of this is kind of what brought about the idea of being able to or trying to re-imagine what would it look like to have computing that is calm, inclusive and humane. So at Modus, we're kind of reimagining personal computing with focus on creating calm and inclusive and humane computing. And we're doing that by creating a high refresh rate electrophoretic display controller. In the slide that's uploaded, we have videos where you can take a look at that. I also have our prototype here. So if you see me or if you want to try it out, I'd be happy to demo it so people can check it out. And this wouldn't be possible without the team. So this is really the team that's turning this vision into a reality. Wenting or Zephyre has led the development of our electrophoretic display controller. Brody, who's done amazing work sort of like CAD and manufacturing for our paper monitor chassis. And also Michael, who have had many conversations of what the possible software architecture and sort of stuff would look like. And last but not least, I want to thank the community really. We did a community survey. We had about 3,000 people fill out the community survey. We had about 300 different contributors and people who are interested in joining our pilot program, over 5,000 people in our mailing list, and a special thanks to Nelnet. We were working on a prototype. We're getting close. But we needed some support and Nelnet really, we're a Nelnet sponsored project. So really extended thank you for your support. If you haven't, please check them out. They're amazing. And a little bit about that survey. Oh, a little jump. Sorry. A little bit about that survey. So from the survey, there were multiple responses that were allowed. We had about 3000 people who filled that out. And some of the most popular use cases of why people want to use E-ink were reading, writing, coding, and just in general, be able to do it for a focused work. Not too surprising, but from the survey, what I learned a lot were from the feedback. So I think for me, I was also, what I experienced, just being focused, being distracted, refocusing. There were other people as well that shared similar concerns just from this quote, you know, I lose hours and days and weeks of my life, getting sucked down rabbit holes. I'm not so sure who hasn't experienced that before. In this room at least. And, you know, just with entertainment and content, missing deadlines, and then other people expressed concerns about our eye strain and just accessibility. Other folks who have been using commuter, you know, since they were a young age and have tried to look at other solutions between, what are the examples that they gave? Well, using different solutions that are available and still not having really any success. So I've read each one of the comments of all the feedback that I got. And there's some overarching like patterns and themes. So there's this one, there's a desire for living a more balanced digital life. So reducing your screen time on social media, entertainment, be able to kind of like unplug and be able to be outdoors. Oh, and I connected online. Be able to connect, be outdoors and also being a more less visually stimulating environment and kind of reducing digital clutter. So that's like one particular group. The other group where I learned more about is just people experience eye fatigue and strain, but also very specific health issues, people experiencing myopia, epilepsy, some level of light sensitivity, headaches, migraine, traumatic brain injury. And I think one feedback or comment that's been engraved in my mind was an engineer who was writing or completing the survey on behalf of his wife because she had epilepsy or has epilepsy. And it's tried all the existing solutions and they weren't working. So that's like one specific comment or feedback that's engraved in my mind. So Cat Holmes is the author of a mismatch how inclusion shapes design. And this particular quote is one that kind of speaks to me. It's, you know, all of us are temporarily able bodied and at some point in our lives we'll face new kinds of exclusions. As we age, when we design for inclusion, we're designing for our future selves. So I'm getting older, I'm more divergent, I have a host of other health issues that I'm aware of and unaware of. And this quote and then kind of working with Models and the work that we've been doing, kind of spoke to that to me at least. And overall I think that there's a need for creating technology that satisfies our essential needs and while protecting our world being. How can we like redefine the role of our digital devices to foster more healthier and balanced life? And at least a vision is, you know, can we create a new class of devices that are built from scratch that sort of embody these principles of humane technology both in hardware and software design? So what about software? You know, how can we be productive in a calm world? Can the work that we do be synced on the cloud, fall into our focus? Can we collaborate without notifications? Can we scale a minimalist UI to more than just reading and writing? So if we start with the basics, the common use cases were reading and writing. We have kind of like a mockup of an example of a writing application, simple text editor that will allow you to do prose and code and also a simple reference for we can like browse the web using Gemini. You know, this is how you would sort of explore and use information. So, you know, how can we possibly scale something like this? And modellix. So modellix is a core framework that's in development that's designed to tailor applications and documents across various devices and end users' needs. So the way that I like to think about it is imagine responsive design but taking much further, not just related towards particular screen types, but a longer range of different types of devices they have and really cater to the needs of individuals. So here we have the source of the particular application, a minimalist reader, and then let's say you have a reader who has low vision or perhaps is using maybe some forfeiting device, the application itself will sort of adapt to that. So yeah, so modellix adapts to the interface to users' preferences, making adjustments depending on their need. There would be a semantic model that guides the adaptions across different modalities, screen types, operating system devices, and enabling developers and users to extend and enhance the application for each of the modalities. Some of the approaches of this is to be able to restructure the complexity of the interface and moving complexities associated with the particular modalities. I have a little bit of a mock-up here of kind of explaining how the modellix and modalities would work. We would have a modellix-aware application and then would present to the user a visual and or possibly like an audio interface. And we're also looking into, well, missed a slide there, but the complexity, some of the challenges are the complexity in the representation of user interface and also how can we be backwards compatible with existing applications. So here we're also looking into saying, also looking into possibly using like large language models for being able to support applications that are not made with modellix from the beginning. Need to expand on this a bit more, but we'll do that later. So next steps. So we've been working at this for about two years now with our display controller and our prototype. We are pretty much done and one of the things we want to do is be able to do a crowdfunding campaign later this year to be able to make the devices or the boards available to people. We're also a small team. We're about three or four people who've been working on this. So if anyone wants to get involved, anywhere between documentation, between design, we'd love to have you all. So we also have a link here if there's different ways to be involved. And I think that's mostly it that wraps up. I think, I don't know if I sped that through too fast. I have 15 minutes. Well, I have 15 minutes and have a demo and I'm happy to answer any questions. So that's it for me. Let's take some questions first and then we'll see if we've. Yeah, I could do. Yeah, I would have to disconnect it from the display, but I can't get here, but yeah. I would have to disconnect the from the here and there, but we can do it afterwards. Not a problem. Any questions? Okay, I see one down here. Please repeat the question. Yes. So the. Yeah, so I didn't. Sorry, the question was am I working with multiple different devices or displays? Is that correct? So initially, the motivation for the project was to make a laptop. We initially had an investor was interested in supporting it. Sadly, they backed off a little bit. So we continue just working being complete bootstrapped and working on the display controller. So the first device that we want to make available to people as well, one, the board itself or hardware engineers, hackers, people want to work on it, give it accessible. The second one being the monitor itself. It'll be a nicer case. So that'll be the second device and we're working right now actively and identifying like what would be the third possible device where they're being between doing a reader or maybe a dedicated like typing device for like authors and writers. This talk today was more focused on like software and like what we envisioned. What would it look like to rethink to be an inclusive technology? I have another talk tomorrow is talking more about the hardware side of things, but the long term vision is to be able to create a whole new class of devices that use our controller hardware and software stack that is optimized. So right now I'm on the next year. Hopefully we'll have maybe a reader here and Fosom 25. But yeah, so that's kind of what it is. And I think part of it is that everything's open hardware. It's on our repositories. There's a read me. So if you're a hardware hacker engineer, please feel free to get started. Hi. Have you been in contact with one of the larger districts of environments like the integrating is one of the challenges you had was not being able to easily easily tell what part of the interface needs to go where it needs to be changed in a certain way. Yeah, so the question is, have I spoken to larger projects, for example, like no more KDE and the second part of it? Just want to make sure. In regards to the changes, I haven't spoken to any of the larger projects. If anyone is in KD or I'm happy to talk. I did speak with Drew DeVault from Swarstut and kind of thinking about what would it look like to create a dedicated software stack at the higher levels. And we see promise in using Wayland particles, specifically things related to damage tracking that would enable applications to create this idea of an ink native application, for example. So if everyone's ever used like a remarkable or any other dedicated device, be able to create ink applications that are natively built and with the support of Wayland protocols. So that's the extent that I've gone, but I haven't spoken specifically to no more KDE projects, but happy to. This is what I've been thinking and working on for two, three years nonstop. So I have many ideas. Yes. Yeah, we're going to hit a 60 Hertz refresh rate. Sorry, what is the refresh rate that we are aiming or targeting with the ink displays? We're able to hit a 60 Hertz refresh rate right now with our controller. It's has good, has anything that's in the market right now. And we strongly believe we can improve upon that both in the optimizations at the hardware level. And also with the dedicated software stack and Implement and Wayland protocols, we could create a more native, create a more native like experience. So yeah, open up the hardware stack, dedicated tailored software stack. I think we can see a lot of optimizations. Yeah, back. So the question was related to the 60 Hertz. It is excellent. There is typically problems related to like ghosting or the problems and have we found a way around it to answer your question? Sorry, to answer your question, yes, we found a way around it. So there's ways where you can modify at the hardware level in order to be able to do that. But still coming back to the earlier question about the Wayland protocols and the larger projects, I think this is where we would need support to be able to create software that's tailored to that medium. But I have a general belief that the display controllers that are available in the market that a lot of existing companies and products use are not optimized so much. I think it's, if we look at E-EngleVraw, they primarily have focused on e-Veeters and digital signage, but not so much in other form factors or devices, let alone having software that's tailored to it. So I think there is a unmet need in the market, so to speak, where these controllers could be pushed further, but they haven't. So I think with the right combination of optimizing at the hardware level, the software level, probably having specific guidelines of this is how an E-Engle native application would render, this is thus the homo-dialics piece, right, that I sort of mentioned before, the tie-in, I think we could hit 60 hertz and also reduce go-sing and things related to that. Any other questions? I'm going to, do I have two more? I'm going to go with the person that will, yes. Yeah, the question, yeah, the question is what would it take to make a modalex of wear application, modalex of wear application, UI, library, or whole different design system? We're trying to figure that out, so it's in development, so I think part of it is a combination of everything you just stated, part of it would be philosophy, part of it would be also design, another part of it is figuring out the right software stack for it too. So I think we've gone under a lot of our rabbit holes trying to figure out what would be an appropriate stack, and I think the way that I look at it is we need to get it to the hands of people, we need to be able to get the board and the monitoring to the hands of people. We have some ideas and some levels of direction, but tapping into the much wider community, I think is where that lies. I'm going to, there was a question over here, yes. You're covered. Okay, awesome, thank you. There was a, think of a question, yes. I'm trying to change our user experience, our UI and so on, so what do you want to talk about? I remember hearing for more, call for the, for last 15 years, starting with one of a lot of her child and other lots of things, so on the one hand, now you might have a bit easier because there is more hardware, like the brand laptop, like the brand, so the brand is anything framework, like M and P, yes, or the other can be, there is much more fragmentation, so that's one thing you need to fight and then you need to convince people to start using this UI in your application and we just had talked from standard birth. Yes, so the question was what is our focus and then mentioning projects like the OOPC, mentioning the fact that it's been around for about 15 years and just what is our focusing hardware and software. I think overall, I see Modos and I think where we're at right now, I see myself as more of like an enabler, an enabler to be able to enable other folks to be able to take E-ing devices in whatever direction they would want. We as like Modos have this particular vision of creating this idea of a humane technology and create a hardware and software that's tailored to it. We might just disappear overnight, right, but the fact that it's open hardware, there's a repository there, you can go ahead and use it. I did track, I don't know if everyone's done track, but I think, there's like a totem pole, I think this is as far as we have a foundation where we can start if someone like Pine 64 or another one like OOPC would want to use our display controller in order to take it in a particularly different direction, it's more than welcomed. I think given our capacity where we are right now, we just want to be able to first focus on the display controller, be able to build the community and then see where the community really takes it. So, prior to my work, I worked a lot with universities and other educational institutions. One thing that's really been happening recently is especially with students with extra needs, universities have been really pushing for assistive technology a lot more than we do with this thing. I was wondering about an area that you consider looking into working with organizations like DSA in the UK or ensuring that the US and the UK are looking into other organizations. So, finding tools for students with extra needs and almost building like a foundation there for other developers to then jump into that market. It's a brand that we are looking at. Yeah, so the question was related to have it looked into like assistive technology in particular the needs of students in the education space. Does that summarize the question well? Yes, I've been looking into assistive technology. I don't know enough. I need to do more research on it. I've looked into it, at least back in Boston. There are these SBIR and SCTR grants that are available, but they're complex in the sense of being able to apply to the grant. The path to enter into a positive grant is a substantial lift. So, I have looked into assistive technology. I think that there are, it's a different market. It requires different needs. I don't know enough about it, but I've looked into it as possibly being a base to start from. I'd love to talk to you further more about it and learn this is why I'm here. We're going to take one more question and it was just back there. Sorry. Also happy to talk afterwards. Thank you. I have a question about the brief that you mentioned in the beginning of like the E being a source of less, but the I think you can help in particular high conditions. So, how do you make sure that your design process matches that needs? The question was related towards E-ink addressing issues to like eye strain and fatigue and how we can make sure that our design process is aligned with that. Is that somewhere as a question? Well, figuring it out, still working on it. The challenge when it comes to E-ink and other displays is that there's also a lot of mixed studies there. I can't conclusively say that problem is related to blue light, for example. But also our eyes and pupils are changing throughout our lifetime and everyone here has different levels. So, I don't have necessarily the funds to prove this is what it is, but I have heard a overwhelming amount of feedback of people with light sensitivity and issues. So, to your question, yeah, I think that's something that we need to take into consideration in our design and would want to get. There have been some folks who have some level of sensitivity or some problem who've expressed interest and I'd want to involve them from the very beginning as part of that design process. I think now that we have our prototype done, that will enable us to answer more questions like this and investigate further. Okay, thank you so much. Thank you. Really cool. Awesome. Thank you.
Liquid Prompt: yes, we can drastically rethink the design of a shell prompt
Okay, thank you everyone for coming. I'm going to read because it's easier. So this is Nojan. Yes. And his talk is liquid prompt. Yes, we can drastically rethink the design of the shell prompt because command lines are interfaces too. Isn't it? Yeah. People forget that. So take it away. Thank you. Thank you. So this is challenging of course to come after this very interesting presentation by interesting people, interesting and new interfaces while I'm here talking about this old piece of software that's kind of specific if not the niche, the prompt. Right. So who here is using the terminal or know what is a prompt? Okay. Okay. Okay. Okay. So I was seeing myself working on a niche problem but maybe it's less the case. So of course I did introduction about what was shell prompt because I thought that it was interesting but then maybe it would go fast on that since most of you should know what it is. What's interesting is that the default prompt looks like that. That's a bash prompt in many distribution. And its purpose of these default systems are to indicate where, when there are problems and follow the state of the work. Right. So that's what we would want a prompt to do. I started working on that because I have many students every year with whom I work and they would show me that when I said, okay, let's work. In my line of work, we are using a lot of command lines and we're working on HPC cluster so that SSH is the only entry point on that in many case and that's what my students are showing me. And honestly, the trick is do you know where the prompt is, for instance, in such a case. That's the default these SSH prompt on my system. I just don't know where to look at. So of course, people had the same problem for years now and there's a couple of existing prompts that you can install on your system and just start using right away. These are the seven most known. And I happen to be the author of this one, Liquid Prompt, which is historically the first one actually, but not the one that got the most successful at the end. But I did a study recently because with the guy who was working on Liquid Prompt, we were wondering should we continue working on that since there is this very well-known software. So we did an extensive study on the feature set and the design of all those prompt systems. And I wrote an article that I will not explain in detail here because it's talking a lot of features and going into much details. I will just show you those two tables. The first one is about the feature sets of those systems. And what I want to advertise, of course, is that Liquid Prompt has, if not the best, the largest feature set, at least the one that are the most interesting, which I call the essentials, the one that are tied up to the shell. Because many prompt systems are actually interested in listing as much environment as possible, which is like the versions of tools, basically, like all the compilers. They would show you the version when they're... But Liquid Prompt shows much more information tied up to how you use the shell, actually. So if you want to know more, you can go on the article. As today, I want to talk about design. In the same article, I'm also talking about design. So there's a couple of ideas we wanted to follow while designing Liquid Prompt. And we paid a lot of attention on that. And they are summarized with those words. A good prompt should be what we call focus. I tried to use single words, so of course they are not completely the right choice, but meaning that the prompt should target states that are actually useful to the user during a work session. And it's not actually useful in most of, at least, my work session to know all the versions of my tools, for instance. Not always. It should be seamless, which in a prompt basically means it should be fast and not showing too much information. It should target the states of your system that actually change. You maybe are not interested in a state that you know from the beginning of your session to the end of the session that's the same. It should embrace the fact that some states change less often than others. So there's states that you are willing to know more often, to look at more often. And of course, those important information, they should be really visible. And I'm not defining configuration because you know what that means. So we did this extensive study again on the design across all those other prompt systems. And here again, of course, I'm advertising my work. But really, I want to emphasize that that's not a post hoc justification, right? That we really wondered whether we should continue working on a liquid prompt. And that's what came out at the end. Okay, but you don't have to believe me, but you can read still the article to get more information. Okay, so those other prompt systems that I'm making fun of with these tables, they look like that. So this is a very classical. Those are screenshots of the Omaiposh system, which is another prompt system, which is quite good. Technically, it's a good piece of software. And each line, basically, or couple of lines is a different theme. This one is the prompt with the more teams. So basically, the approach to design of all those other prompt systems are like a sequence of colored segments, which I call that segment rainbow. And of course, for a prompt, there can be a lot of information to display. If you have the feature set of liquid prompt, for instance, there's a hell of a lot of stuff you can show on the screen. I did make this theme for liquid prompt not for everyday use, but of course, if you fancy it, you can use it. That shows all the information we can actually show. So of course, there's no question. You should not display that at all time. I will not go into detail about everything. So of course, we can do better than those rainbow of segments that just change every time you do something, and it's difficult to spot where the information is. The first idea is that we can show the important information first. We can help visual passing. We can avoid this list of segments, of course. We can be colorblind friendly, of course, having a rainbow or stuff that's difficult. And the guy with whom I'm working is actually colorblind friendly. Is he friendly? He's colorblind. Of course, he's friendly to other colorblind people. I'm sorry. And also, we want to avoid text overload, which happened very often also with those systems. And another thing that I will be talking today is we can have logical sequences of information, things that are linked together while you're reading them. And I will also talk at the end on semantic threshold. But it's difficult to introduce. I will just show you. So to do that, LiquidPront comes with a default prompt, which is kind of the well-designed prompt you would think, you know, text mode with some colors, but close to the classical approach. But we did the dot matrix 10 theme, which is a theme for LiquidPront, that completely changes that. It doesn't look at all like a classical prompt. It looks like that, basically. I'm going to show you a little bit of features and explain that. The first thing is that it takes three lines, which may seem a lot, but we thought that nowadays we are using these terminals in high resolution screen anyway. So we have some room. So the first thing is that we prioritize the information based on the location on the screen. And we try to make it stable. It doesn't change of places. If you have this sequence of segments, every time an information or a state changes, everything is moving around. Here it's not the case. It's more stable, a lot more stable. So we have just to say a few, the type of connection, who is connected, the name of the machine, where you are at. Here you have the right align section that shows some sensors, the temperature, the CPU load, whatever. We have this line that separates sections. I'm going to show why we have that and shows everything that's stable. But you would want to display, like I don't know if you're working in containers or you want to know these versions of this and that. You can show them here. And here we have the big version system repository management or display with the left side for the remote server, the remote repository and the right side for the local repository. I'm also going to explain that. And here, very near to where you type the important stuff like errors. Here you have an error code that's displayed. So while this, for instance, these section lines, it's because if I display the same screen that I show you at the very beginning of the presentation, but with dot matrix, you would see that. It's a lot easier to parse, of course. And for instance, in my line of work, we're working with a big build step that takes a lot of screens on the terminal. And I'm very often scrolling up and down to find back where does that start, where does that end. So of course, having these sections easily easy to spot, it's a good help. As for the colors, you've noticed surely right now that we decided to go for these black and white segments, which of course is easy or it's color blank friendly, right, for most of the case. And we use only two colors, one for what we call notes, the blue one in the default system and yellow for the warnings or the errors. And with those two colors, it's kind of enough to display what we want as an information. And of course, you can switch depending on your color blindness, you can switch the pair of colors that you would need. We're still working on the 256 color space, so not RGB, for compatibility reasons, because there's not that much terminal emulator that supports the RGB color space, surprisingly. But then we stick with that, which in our case, since we are only using four colors, black, white and two warning colors, is more than enough. We did develop a tool that helps you select couples of colors based on the contrast. So you would just, that's kind of a side note, but you can say, I want to have a good contrast with this color, and it will find in the NC, there is very niche color space, what other color you can use in combination. Okay, so now I will highlight a couple of design features of dot matrix. I will not go over all the details, but just to name a few. The first one that I liked is that to avoid having too much text and too much icons popping up, we used negative space, which is actually actual space in the prompt that appears when there is some state that changed. So here, for instance, on the first one, there's an SSH connection, and you can see that the user is actually disconnected from the left-most side. Right, so you're not connected on your laptop, there is a space. The same goes for the root user here, which also appears in the warning color. Here, you have the user which is disconnected from the path, meaning that you don't have the right, right, right, I'm sorry, my accent is different. You cannot write in this directory. Another interesting idea, in my humble opinion, is the use of these logical sequences that you can see on the VCS section, so the last line. It reads from right to left, so you are at the right, that's where you type your comments, right, you are doing git comments. And what you can see, how do you interpret the first line, for instance, you have a commit, you have changes ongoing, that's the blue section with the numbers. That's the number of lines that have been changed against the head state. And there's an arrow on the left, meaning you can surely do a commit with that on the master branch. So that's what the second here is doing. Here, I picked up a part of the diff and I did a commit and it appears here. What this section means is that the remote repository has seven commits on its own. So you're going for a conflict here, which is you should have pulled before, you should have, you know you should pull before starting a new branch, but if you forgot, then LiquidPump is warning you. There's probably a problem going ahead, but you can still push another commit and so on and so forth. Yeah, okay. I cut the slide where I solved the commit issue because it was quite long, but you get the idea, for instance, that you have these commit paintings and if you add another commit, it would be there. And then when you push, it would just disappear. You have nothing left to do, so there's no more warning to show you. Here, what I'm showing is that we use these two colors, these pairs of colors, to give you a semantic hint on the urgency of what you're doing. So since you have something going on, some work, some diff in your repository, if it becomes too long and you can, of course, tune the threshold as you want, it becomes yellow in the warning color. And the same goes for the commit. Here, I've put the threshold at five. I know that if I have five commits, probably I want to push or pull or do something like that. Surely your commit limit is higher than mine. And that's it. So I did not talk about everything. That's the summary of all the features that we can display with dot matrix. So there's many, many, many things. I have shown about the space system around the kind of connection that we have. You have a space for a team of screen connection, for instance, the read only. We saw that and so on and so forth. And there's the diverse warning. For instance, if you're connected with telnet, there's this big yellow arrow that shows. Anyway, the same goes for shrewd and so on and so forth. You can also summarize if you don't want to see the numbers of lines or commits, you can just get one icon and be done with it. If you want to, again, in this article, we detailed a couple of thoughts on the design that you can go and read. And thank you. 10 minutes for questions. Perfect. 10 minutes for questions. So, if you make sure you repeat the question. Yeah, I noticed that. Thank you. Very nice to hear someone in line. I was wondering if you ever did, like with your last example, for when you're telling someone it has to do something that might be expressed. So do you do any user testing or at least student testing with your phone? Yeah, so that's a good question. That's actually part of the reason why I decided to do, oh, yes, sorry. Every time I saw the, why do they do it? It's obvious. So the question is, did we do user testing to test the prompt and in particular the level of stress while we are telling them to do something? So the answer is no. Because I'm not a designer. I'm an obvious designer. I just happened to discover that quite recently. And I liked it. So I read a lot and all, but I'm not a designer and I decided to come and force them in part because I thought, oh, maybe in this room there would be some designer that would have some idea and that would be able to answer my question. I have a couple of questions. What would be best? Where to display the error stuff like that? So yeah, no, no, but that would be, also because our user base is nothing compared to the previous talker, right? We are not talking millions or we are talking a couple of thousand at most and most of them don't answer, communicate or... So yeah, I have three users with whom I'm discussing basically. Myself comprised. Yeah, so to continue on the design, it's really elegant with the negative space and everything, but how do you, if there are at least some good documentation to say what each icon or each sign represents and how do you convey because sometimes I might forget what's the difference between one big arrow and two small arrows? Yeah, yeah, yeah. Well again, the answer is no. There's no good... Yeah, sorry, sorry. So the question was, is there a good documentation to remind the users what the features are and most precisely what does mean the negative space which is not completely obvious in itself? So yeah, no, there's this kind of summary and there's a readme that lists all the features and all. I would not count that as a very good documentation. Yeah. So again, that's a tricky question. I did not manage to solve that myself except by playing with less. We did a single line version of the prompt though, or either two lines, this one, sorry, two lines, in single line that's really not possible, which is a little bit more dense, but yeah, no, I did not find a good way to solve that either.
Bad UX is Bad Security: Adventures in Qubes OS UX Design
Thank you. All right. So I would like to start with this very controversial sometimes notion, which is I want to convince you all a bit that that sentence that is up there that bad UX is bad security is actually true because I get often people who tell me that complete bollocks I will later talk a bit also about cubes but I don't want to start with this. I would like to start with the general principles. So why UX matters for security? The thing is very often when I talk with hackers about security people come to me like but we don't actually need usability. People can figure it out if you care about security you will figure it out and that's not a good approach. One thing is of course it's not like security and privacy are things that you should be you should have to deserve to work for them not everybody deserves only the smart people but the other thing is it doesn't matter if it's the fault of the user if it's the fault of the software if we get compromised if we get harmed the harm is done and I would personally like there to be less harm less damage to the users and that's that means making things more usable for people taking into account how humans work how human brains work this is of course sometimes a controversial concept but we are all human here and we make mistakes like people like humans who make mistakes user errors are a real vector of attack and a very important vector of attack when we read about compromises and for example big corporations very often the initial vector of attack was oh somebody clicked on the link or somebody answered the phone somebody talked to somebody and said what they shouldn't somebody made a stupid password so we cannot just say well I did the tech side all the problems there are user errors not my department this is not a good attitude it's like if the UX for the door or for the door control process is terrible and you end up doing oh nobody can remember the code just put the sticker next to the door then the person who designed the security system failed yes people shouldn't put sticker with the password next to the door but also the person who designed the process did a bad thing this is not good and also we are not mothers and fathers of our users we should not be like oh you have to deserve this you have to work harder why are you not paying attention chair a bad user we need to treat our users seriously like adults who also sometimes have different priorities than our programs not like children because the thing is humans make mistakes this is a thing this is a this is a truth universally acknowledged we all do we will make mistakes we may have other priorities than making using the software perfectly like very few people just want to use the software as good as possible they want to use the software to do something and also the problem is our brains were not exactly optimized for using computers also controversial our brains brains have a lot of heretics a lot of shortcuts that they take all the fascinating optical illusions just tell us this our brain is not perfect at perceiving the universe and reacting to what's happening we have a lot of iffy things in our brains and this is something that we as people who make software need to take into account people also do shortcuts they do it like they want to do things fast and if you keep noticing that your users keep doing a shortcut that is for example a secure less reasonable terrible there is a need to do we cannot be just like well stop doing that this is bad bad user no no if for example people people keep people keep walking on the grass then they probably get need to get somewhere and maybe that's not how this square should have looked like you have to take this into account that people will want to get close to their goal not necessarily in the way that we would like them to do it and again even the smartest person in the room can be in a hurry you can have a bunch of brilliant engineers brilliant physicists and they may make stupid decisions and they may sit and be like yeah and it can't be that bad right this one time what could possibly go wrong it's not that terrible or something exploded oops we have to take this into account we cannot be just making the software that we make with the assumption that people want to make mistakes you want to get perfect users this is just impossible that's that's not how humans work one of the big things that I find very important for designing things security related security related processes is in attention that is we generally just notice the thing we care about we don't notice everything that happens in the background this is not a bad thing this is very useful for our brains that's called cocktail party phenomenon when a human being can actually for most humans understand conversation in a very busy room and a cocktail party because our brain is very good at being like this thing I care about all the rest not important not my thing but this is very annoying when you are trying to design a good process for security because this means that a small red blinking light may be ignored the error message may not be read because the person just cares about one thing and I really like to refer to you to a psychological experiment that demonstrates this is how humans work it's called the invisible gorilla and the experiment was people were asked to watch a short film where a bunch of people was playing ball passing ball and told count how many times the balls is passed at the end of the short film people asked okay how many times the ball was passed cool did you notice the man in gorilla suit walking around and 50% of the participants did not notice the man in the gorilla suit because they didn't care about it they were told to count the passing of the ball so gorilla was there a gorilla and that's how humans work we cannot design our secure processes thinking yes people will pay perfect attention to everything all of the time that's just not how our brains work and I like to show it on the example of the error message this is liberal office error message this is what a designer program a sees there's an error message as an explanation what happens all very useful things and this is what a lot of users see because what they want is to get to the file and there are some words and they're annoying because they are stopping them from getting to the file please give me my file so it's just a bunch of annoying red stuff and a big button that says oh go do my thing and then the person opens the file and be like I cannot edit it something's wrong what happened is there an error message and this is I know this is annoying when we are designing things and making things is just like just read the error message why are you not reading the error message people want we have to think about communicating things not just in the error messages because a lot of people would ignore them because they don't care about them in the moment the error appears okay so this is my introduction this is my introduction on human brains complicated what is the thing I'm working on this is keeps us a reasonably secure operating system we don't we don't say it's perfectly secure because nothing is perfectly secure don't use computers if you want perfect security and cubes is a fairly complicated thing it's sort of a meta operating system which means that it has a bunch of virtual machines talking to each other everything's isolated this is my virtual machine that has my devices this is my virtual machine with my work everything is compartmentalized and the thing is we are trying to make it actually usable for people because you could have done the thing of partitioning things into virtual machines manually but it would be such a pain to actually make it work cubes provides the layer that allows you to actually use it to get all the security of really strongly isolating the things you're doing but also being able to use it without writing pages and pages and pages and pages of shell scripts this is a slightly cut but mostly visible diagram of how cubes works so you can see a lot of different virtual machines called cubes because we are funny like that and there is for January for the user there is a bunch of system stuff that does all the important system things and there is a bunch of user things like this is my cube for work stuff I have my browser my liberal office whatever I have my social media cube and those two cubes those two virtual machines don't know about each other they can talk they can share things if I click on a stupid link on my Facebook account it won't compromise my work which would be very nice so that whole idea of providing this separation is very very nice but it leads to a very complex usability situation because you don't have just one operating system you have a bunch of them smushed together that's not easy that's why we are providing a lot of interesting tools to make the process of using those things together a bit easier but also to still maintain some security and I want to discuss two things that we are doing that I think show interestingly how this can be done how you can make things usable but also think about security and the thing is the first thing is copying and pasting so in the normal system Linux Windows whatever you select text you press control C or select copy whatever text goes to clipboard control file V then the text goes to a new place this is of course terrible from the security standpoint mostly there is a bunch of attacks that are your clibbert that steal things from your clibbert put some things in your clibbert that should really not be there cubes makes it a bit more complicated sorry for the slight cutoff this is some technical problem first you copy text but this lands in the clipboard of the virtual machine you copied it from and all the virtual machines don't know about it to actually move it to another virtual machine because for example on your private Facebook you found this fascinating link that you have to share with your co-worker you have to press control shift plus C copy to the global cubes clipboard and then control shift V to copy it to another VM this is a bit more complex and yes we theoretically could have done this is more easily right we could be just like us always copy everything but that what goes all the security problems that would cause all the issues where one thing could steal the clipboard from other thing that's not what we want does the introduction of this separate step it also means that when people are trying to copy and paste things in cubes between different virtual machines they have to stop for a moment and think do I need to do that this is what I want why am I doing this this is something that forces you to stop and to pay attention for a second to this process and that leads to slightly better decisions with relationship to security of course it's not perfect some people get very much used to it they get it becomes also like automatic for them yeah this is yet another step just press the keys very quickly but and that means that of course further security is still needed that means we have to provide more layers of configuration of information of what's going on we do have a whole complex policy that allows the user to configure it and the thing is there's a lot of text here and a lot of you will be like nobody reads that yes that's why we put it in the settings so only the people want to customize what's going on actually go and read it the other people probably want because they don't care but only if you actually care enough to want to learn a bit about what's going on then you go to the settings and read it and then you can specify for example what can copy to where and how to control it so it is making the process so we are making the process of copying and pasting adding this additional step to make things a bit more secure by leveraging those two mechanisms technical one but also making people think for a second about what's going on the other thing that we are doing that I think is very interesting this is current work I could say devices things you connect to your computer they are evil like a lot of them can be very malicious you never know what actually happens within the thing that you are connecting to your computer maybe it is actually a USB stick or maybe not maybe it's some more malicious device that's just masquerading as a USB stick you know it's very complicated with them and even those devices that are not evil they very often can do far too much for example microphone camera they are very powerful things they can record a lot of things that we really would not like them to record and of course our browsers our programs are swearing to us that nothing malicious is ever happening but some people don't think this is sufficient level of security and for many people well attacks can happen and we would like to be protected about it that's because that's why QPSOS isolates all the devices in their own cube and the user can decide okay my camera I want to connect it to this cube this virtual machine from which I'm making calls but not to one from work because I want my boss to have absolutely no chance to see that I'm working in my pajamas or my microphone can only be connected to this cube not to the other and the problem with devices is that the initial user interface for handling them was made by engineers and it's not very friendly there are small things there is a list of stuff a lot of complicated technical details of what's coming from where one for example one USB stick can appear multiple times for very good and sensible technical reasons but it's very annoying when you have to figure out okay which one of them is the thing I actually want to use you have a list of cubes you want to connect to which is also very small and I ended up with this and I decided to ask my users okay does this work for you is it good and a lot of people said no this is terrible because I keep making mistakes because I I want to connect for example my USB stick to my development cube but I keep connecting it to my work cube because those things are very small and it's very easy to click on the wrong one and the thing is yes it's a user error it's not the fault of the system that the user clicks on the on the wrong thing but we would like the errors to be less common I know it's a user error but I still think we could make it easier for users to make less errors good decisions and that's why we're working on redesigning it and I think this is a decent example this is not yet working in cubes this is incoming will happen very soon once I finish working on it so extremely soon we we are changing things to one provide more information which is another thing that a lot of users told me when I started talking to them actually doing user interviews like yeah I know I should know that but I have no idea which of the devices I see listed is my camera because they all have like names that consists of numbers and letters randomly maybe we should we can actually show people which one of the things is the camera which one of the things is the microphone that's why icons actually show what's happening that's why there is much more space between different options and that's why the options are actually described not guess what it's going to happen no now I'm using actual full sentences to describe what the thing is doing and yes this is basically a visual update right this is not a technical change this is not deep delve into the back end of the USB of how cubes handles USB stack but this is a change that a lot of people when they saw it all of you said oh wow now I think I will make less mistakes now this will fix a bit of me problem but at the end we will have a more secure system everything will be better for me as a user even though it's just a visual change of course some people are like and this is terrible too big why it takes up so much space but unfortunately you can never have everybody be very happy this is basically the same okay so as a final word on these two examples and generally I would like to say a bit about how to design with security on your mind if you're a designer or if you're a programmer making a secure things that want to be secure designed for human error designed for mistakes not just for success take into account that people will do things badly people will be in the hurry if you if you ever want to design a process for a thing that's supposed to be used by human imagine that your user is currently having their six month old baby yelling and they're cut puking at the one at the other side and you want to design a thing that will not completely compromise them even if that unpleasant situation does happen the things that are secure should be easy making things in security should be harder making a shortcut the shortcut the easy way should be the secure way going around because people will sometimes go around also we are open source people we like to go around sometimes so the going around the insecure way should be harder design for actual human beings don't don't think that if it's a user error then it's not our fault because unfortunately user error is also our fault not just the user thank you five minutes for questions please yes isn't it creating more friction in the process and rather than focusing on adding more layers to like force people to go to read the like all the security issues rather than that why is there not a focus on the display of the error messages that makes them read it makes the user read them more properly okay so the question is why why are the more friction instead of just making better user errors so two reasons one reason is that sometimes it's difficult to tell a part user error for what user wanted if I copied what I wanted into a wrong cube this is a user error but it's not an obvious error that can be detected by the system and the other thing is friction is not always bad we like to think that friction in design is always a bad thing but friction also forces people to stop to think for a moment and at some sometimes when we design the system so that people have to make certain choices we give them a large variety of choices but there are some choices we have to give them a chance to actually make those choices and friction allows us for the stop to make a choice I don't want to add friction to every copy and paste within a single VM there is no friction it's just when you're going outside and the friction is by design to also show that this should not be a common operation to decrease on the making shortcuts thing yes Do you have some methods to encourage users for secure behavior for example let's say what prevents me to log to social media on my work cube so the question is how to prevent users from making bad decisions security wise for example logging into social media and work so in short we don't have a technical solution for it we have just a solution of describing like tutorials how can you use it sharing the setups of the developers of the core users so like educating people encouraging people to use different colors for different environments you also if you want to do it like you can limit yourself by limiting network access of different cubes so be like okay this is firewall and cannot access Facebook or whatever we don't have a good solution like system-wide this is still a decision that the user can make it has to make also because the user needs to divide their work into those virtual machines themselves this is something that the user generally has to do No so the question is do I have any favorite examples of UX oh this is a very difficult question security yeah oh I don't know I'd say that I really like how those usb tokens pass for u2f authentication work so I really like this process which adds just the perfect amount of friction with the need to press a button so I think this is my favorite example we have to finish you
Web-accessibility for open-source privacy & security tools
Hi everyone, can you hear me? Yes. Hi, thank you so much for joining in. So my name is Rashi, I'm from the Accessibility Lab. And given the time we have, I'm going to do a little overview of accessibility. I'm not sure if you'll have time for case studies, but try to look at a few aspects of what can strengthen accessibility and of course, I'm going to ask you, and you can also ask me some questions. So we look at promoting the digital rights of accessibility and people who have different disabilities, which we think should be vital to anybody. We look at accessibility as an ability to be able to function, someone who can operate the different devices we have, have access to the different versions of browsers, and basically looking at the least part of resistance, eliminating different obstacles when transmitting, receiving, or understanding information. We also get, one of the things we also see in the open source conversation and around is a lot of people like to interchange and look at accessibility and usability as one and the same thing. But they are very different. The way we operate is very different. Although usability, we have a lot of allies in the usability space, but accessibility in hand with usability does strengthen and provide an extra level of security or privacy or any of those factors that people look at. We also look at offering users and customers with a good user experience because accessibility for people who have different abilities fail to consider people with different disabilities, whether you have a motor disability, you can have a visual disability, you can have a physical disability, and you can also have a disability a lot later on. I see my parents aging and a lot of them also have disabilities with a lot of aspects in the visual world. We also know someone in our open source space who was visually challenged for a very long time. This is a slide we usually use based on who we are talking to. So we talk to private sector, we talk about how it has competitive advantage and how it can help with SEOs and positioning and compatibility. Also depending on which region you are from, complying with legislation because some of the legislations are pretty strong if implemented regionally. Our accessibility, we look at increasing the number of users in keeping the ones that we have, whether it's providing accessible online services, providing a sense of autonomy. This is a project that I worked with, carries around. We are just going to show you some aspects of how people with different visions see it. This is normal vision, this is someone with humanopoeia, macular degeneration, diabetic retinopathy, glaucoma, tritinopia, protonopia, acronotoposia. This is looking at an overview of the different lawsuits that we have seen. This is data from 2022 on the lawsuits that have been sued, plus action lawsuits, so it can be from e-commerce websites where companies and services have failed to provide basic accessibility or adhere to basic accessibility guidelines. You can also see food security, you can also see education somewhere but education should also be somewhere here. Basically people who are challenged, most of them like to enjoy a personal social life to be able to access things they want given that they have so many challenges for mobility. Websites have been the primary target. I know that we live in an app culture but a lot of people with disabilities like using websites. We also look at understanding different personas. We work a lot on providing audits for websites or it can be applications. One of the reasons we were not typically in the open source realm but the reason why we got into the open source realm because we realized that a lot of open source teams, although they care about a lot of the social conscious aspects of accessibility but they might not have the resources. For example, we have a long standing partnership with Open Technology Fund where we provide pro bono services for applications that work on countering surveillance or enhancing the privacy and security of these tools that look at censorship and look at countering it in an effective way. This is a simulator. How many fuel have you heard of? Do any of you all use... Is there anyone who has an experience with accessibility here or is there... Maybe I'd love to kind of maybe get an on-site view of... Some of the apps in Linux are really inconsistent. NVDA for example and they use different... I don't know, it's... No, no, it's absolutely fine. Go ahead. Is there anyone who uses the screen reader or analyze? Go ahead. Yes. Is there a day only in a terminal or just, you know, a situation to server? And I know there are some current stuff work going on to fix some... I know like an office doing... Working on their... It's not... And such to make things... To set standards and make that patient call like a... Right, right, right. But it's like still not... I think a while. Are you someone who also works with these tools regularly? Is it something as a part of your employment? Are you looking to get into it? Well, I don't use the tools currently because they're not accessible. But I definitely don't... Okay, thanks so much. Is there anyone else? Just one, we have one last one. Anyone who would like to chime in or I can just move on. Okay, so these are assistive technologies. We have Braille displays and switches. I don't have any, but we also have a few keyboards that are Braille friendly. And of course they're also catering to folks who... People use different switches. We might have a disability with the hand. Just showing you a few visual examples. There'll be eye tracking devices. Can I show you a few more? Yeah, screen readers. Google Talkback and VDR Jaws. There's anyone who's had experience with that. And I'll also move to legislation. I know I'm in a tech conference, but I think legislation is also an important aspect when it comes to pushing ways in which... And also improving everyone's information about what's out there and what's really happened. Yes, we did have the convention, the UN convention on rights for persons with disabilities that acknowledged it in 2005, 2006. But again, it doesn't necessarily mention anything about web accessibility. It talks about how people should uphold the rights of people with disabilities by providing them with dignity, respect and livelihood. But nothing specifically on the web. But it was still an important aspect which was signed together by different UN countries at the New York headquarters in 2007. Yes, the US is... I feel like it's an example where a lot of the aspects have happened in a very reactive measure. Because there have been reactions of lawsuits and people just suing different people which is what has led to the ADA Act being so strong when it comes to finding organizations. And I'll perhaps go a bit... I'm not going to go into the details of how it came about and what happened, but I can go ahead and talk a little bit about lawsuits. And there have been hundreds of them. Target has been one which had a class action lawsuit damage of 6 million. You can also see Bed Bath and Beyond, Home Depot, Domino's Pizza. Surprisingly, a lot of educational institutions as well. Harvard, Massachusetts and Netflix. And for something very basic, for not providing close captioning. Even when I fly now, I would still see a lot of the entertainment content. Only 5 to 10% of them have close captioning or any sort of captioning that makes it disability-friendly. What I do will know that I do have a local airline that I enjoy flying a lot which finally had sign language interpretation and translation. But yeah, there really is a long way to go there. We've also had... Oh, okay. So I have 5 more minutes. So yeah, in 2014 FedEx was also sued by a US Equal Employment Opportunity Commission. And in 2005, you also had the National Association of Deaf that stepped in and supported it. And FedEx actually came back and said that we're so sorry. We don't actually have the money to be able to comply with these aspects. But then anyways, they had to pay damages in perhaps a few millions. But I mean the idea, what I'm also trying to get at is that a lot of these companies have the opportunity to be able to make amends but they don't. And yeah, so a lot of... And perhaps a lot of the culture also... You have a lot of other governments that are following suit. You have Canada. You can see as an example of the Accessibility Act which they passed. But it only complies with government or the public sector that has to comply with the basic level of accessibility. You can also see that in the EU where you have a lot of the accessibility requirements for ICT products and services. They specifically... I do believe they have a provision where they mention certain aspects of accessibility. I'm just reading the clause. They do have access of accessibility where they talk about everyone adhering to the Web Content Accessibility Guidelines 2.1 level A for clarification. They also talk about... But again, it only applies to public sector bodies. So anyone in the non-profit space, civil society or private sector gets by. But then what are these content accessibility guidelines? So they basically... Again, when we talk to you, we're not doing anything new. We're not reinventing the wheel. We're just adhering to things that have always been there. The World Wide Web Consortium has a Web Accessibility Initiative which anyone can join. Anyone can sign up. They create content. They also create... And they for a very long time... The Web Content Accessibility Guidelines 2.2 was passed in October. So they've been around a long time from the 1980s. They build curriculum and they build directives and they will guidelines on how things across the world, Web interfaces can be more accessible, understandable. So they have... Currently the guidelines has four principles, 13 guidelines and three levels of accessibility. And I can talk about... We won't have time to talk going about into the principles and guidelines, but then we can move a little bit into examples of what is A-level accessibility, AA accessibility, AAA accessibility. Most of the public sector government websites across the world, whether it's ADA or whether it's the Canadian legislation, or even within the EU, they have to adhere to AA guidelines. And in the US, I think with the ADA and with the implementation of the ADA, the legislation is such that you will get fined until you're able to achieve that specific criteria. So one of the... I know this is... I know it's a cheesy example, but this is one of my favorite examples. This is the accessibility statement from the White House, which talks about their ongoing accessibility efforts and the criteria that they follow. And how people... And they also mention the different disability aspects, whether it's sensory, cognitive, vision. I think it's a great example. So yeah, I'm going to talk a little bit about... I'm sorry for rushing, but then I just want to ensure that I can also show you some examples. But yeah, we talk about four tenants under the content accessibility guidelines, a perceivable where we want our users to be able to perceive and are able to look... Are able to sense, use one or more of their senses to be able to see the content, whether it's... And then you talk about operable where users should be able to also control the UI elements, whether it's buttons or using a keyboard or a mouse. And robust is the content must be well adopted to the current standards that we have, whether it's functioning across different browsers and applications. And understandable is, of course, the content must be comprehensible to its users. So some examples. This is alternative text. It's a level A example where it's a textual substitute for a non-text in web pages. It focuses on images, but it also applies to multimedia and other content. So it also serves many functions. It can also be seen with screen readers that announce alternative text in place of images, helping users with visual cognitive disabilities. And also in cases of places where you don't have a high quality of service or low bandwidth with internet. If an image fails to load, then the browser will represent the alternative text visually in search of it. Okay, I'm going to end. So this is an example of alternative text, closed captioning examples. I'm going to end with this. Can I play a YouTube video? I wanted to show them an example. You cannot play audio through the system. You can try and turn up my laptop audio as high as it will go. Okay, can you mind playing this? Because this is the example of audio descriptions. Audio description is an example of... Audio description, this is again level A. Maybe you can hear. There's no volume though. Oh, give me a tissue. It's got my Bluetooth headphones. Oh, no. It's fine. I think it's because it's going through the HDMI. I don't think you're going to get audio. No, maybe through this one. Yeah, you could maybe play and turn the volume all the way up. Yeah, one second. Because it's, you know, in the... How do I... Oh, no, I get it. I get it. Yeah, yeah, yeah, right? Okay, enough. Wait, and this is... Do I just play this and they can see this? I can now, they can hear them. I will... Okay, ready? Go. Okay, go. Can you hear? Hello! He takes a deep sniff. The snowman smiles and moves to Walter. The hatchet is running on the spot. The wind up full-fledged his chin. The snowman leaves his arms a clutch. The reindeer puddles his front legs. Head over heels, the snowman pulls the... The reindeer does the breast drop. The snowman rolls his body with flips onto his back. The reindeer's tongue is just to the ice. The snowman holds his head, twig on, the reindeer nips, target the carrot. The carrot flies off and lands on soft snow. The reindeer goes after it with the snowman at his body parts hanging on his tail. The snowman puts himself back together again and the only point of it is his nose, this state. The reindeer jams the carrot back in place at hands like a crowd of happy people. The snowman plans if he sticks the other end then goes to sneeze. The carrot is nose with both hands. His head shrinks off first. That's my turn. Okay, and with that, thank you so much. I'm just gonna... Yes, thank you so much for listening. I'm sorry for being a bit rushed, but if anyone has any questions... APPLAUSE Also, if you are an open source tool, we'd be happy to do an assessment and audit. We even do trainings where we go a little more deep dive into success and fail criteria of what goes on. Please do reach out to us. You have my QR code here. Hopefully it's easier for you. But thank you so much for listening and joining. And if there's anyone who has any questions, maybe do we have three minutes? Anyone at all? Yes? I just tried. There is no access to the detract. For the first time, maybe you or someone in this room may want to organize one next year because as someone said earlier, I think we're presented. We are all... As we get older, we will all get more and more disabilities. And actually, like most of them have already had one. Which is pretty obvious. I couldn't drive the night because of my glasses. So I wonder if next year we could have an accessibility trial. That would be really nice. Yeah, if you have that. Thank you for that. Thank you for the recording. Just what the question-asker said. No, I said we'll be happy to look into it. Can you hear that? I said we'll be happy to look into that. Thank you so much for the suggestion. Our idea is that we want to build a culture of accessibility throughout, especially in the open-source tool space because it's free, it's easily available, and also abiding by the principles of making it more accessible for people with different disabilities. Yes, please. Well, this is not true just to your presentation. Well, in general, since you mentioned that, I want to make the open-source community more accessible. One thing I noticed is that every single talk that I've been to today is that I'm not sure, like, I can hear what's going on, but then the slides and the presentations are not, like, loaded ahead of time, so then there's no way I can follow along. I've just played, someone just be like, oh, yeah, you can look at this and that, and I'm like, what is this? One big issue there. And I know that they saw the news out there after, but I really don't go back to it to look at this one again. We'll be less sloppy next time. Thank you.
Penpot 2.0 is here!
Change the slides. Magic. So we have Pablo from Tenpot, talking about Tenpot 2. It's here. It's here. So excited, so excited to be the last talk of the day. Also, there's some nice free chocolate. Not called to action yet. Just free chocolate there so that on your way we exit. But we will have a bit of a birthday party here with all of you. Because we're turning four today. So yeah, we have some waffle here from Brussels instead of a paella from Spain probably, that way we're using. But basically it's very important, very exciting. Every time we come to Fosn, I think my first Fosn was 2005, 2006. But it was only four years ago, 2020, that we announced this was going to happen. So every year we come here and say this is something new, and then we have Alfa and then Beta and then 1.0 and then 2.0. So very exciting. So I'm going to take a bit more water because of the excitement. So we're going to discuss PEMPO 2.0 and then it's time to meet Hanson Demo. We'll see how it goes, the staging server, and the Wi-Fi. So for those of you who might not know about this PEMPO, PEMPO is this open source design platform for design and collaboration, code collaboration. And we like to discuss, and this is very, very relevant for the open source design track, design and code collaboration. Perfect talk by Ariel, perfect takeaway also for PEMPO 2.0. So we believe we bring design freedom for our product teams, and we do so in various ways. The fact that PEMPO is open source, definitely a key ingredient. It gives you privacy, security and customization. You can hack it, you can do it whatever you want. You can use cloud, you can self-host it. We are pro open standards, so that means everything is SVG and CSS native. We make sure that we're not creating yet another proprietary format. We want to have this sustainable design and sustainable collaboration with code. And we do believe that it's important that whatever tool we build has to bring something that was not present in existing design and prototyping tools in the past, which was this collaboration between designers and developers. It was felt similarly as some good tools or code tools that are not welcoming to designers. Similarly, design tools, design prototyping tools were not welcoming developers. So what if we fix that? That was the whole idea behind Gradient PEMPO. The next generation of design tools should be about collaborating design code. So this is like the basic intro on PEMPO. But we are here to discuss PEMPO 2.0. This is a major release. You could call it PEMPO 2.0 or PEMPO 10.0, because it's just a massive change in just one year. So we're going to cover UI redesign. We're very proud of that. The new component system, wonderful new inheritance and overriding and all that stuff. CSS read layout and some of the stuff. So let's see how PEMPO 2.0 UI redesign looks like. Like this. No, that was 0.2. But that was only four years, five years ago. It is elegant. It is simple. Why frame me? Of course, the reference. Anyone gets the reference of the picture? No? Willow? Willow fans? No? Ah, I said, ah, Willow fans. Yeah, that's my age. And so, no, this is PEMPO 2.0. Look at this. It's very fancy, right? I would like to have the light steam, you know, but perhaps at this time that would not be smart for me to ask. This is just wonderful. I mean, this is just a design that is being created with a beautiful interface. Because open source and, you know, beautiful go along. What was behind this, the whole UI redesign? Well, this is a design tool, design prototyping tool. It has to be interactive, it has to be real time collaboration, it has to be multiplayer, and it's a productivity tool after all. So we needed to reduce the cognitive load. So, it's so tempting to make many things be achieved differently. So, in terms of real estate and how you would achieve things, goals, we reduce the cognitive load through heretics and through research, and just intuition sometimes. By the way, the picture you are looking at is a portion of our design system, which is completely available. I will show it now in a minute. We also improve accessibility. We, strong believers in accessibility should be a de facto standard for everything we do. It is absolutely challenging to include all accessibility in a design and prototyping tool, since it is very visual and it's very a complex tool and has a lot of micro interactions, and we already discussed cognitive load. But we try our best for the size of the team that we are, you know, just 15 people in the broad team. And still, we do want to pursue that. So, major work here was color, of course, and typography and size and relative shapes and that. So, pretty basic, but still, I think, worthwhile. We will continue to do that. Of course, you should be able to use Pemport to design accessible UIs. But here we are discussing Pemport itself as an accessible tool. And I think it is beautiful. I really honestly think it is beautiful. Probably one of the best, most beautiful open source tools, but also one of the most beautiful tools. Okay. I think it's also, sometimes it's just about pride. And why not, right? So, here we are showing just like a crop. We are going to see just the theming, dark theme, light theme. As in case you are, you know, you are fans of one or the other. It's not important what we are showing, it's just so that you can see how different Pemport looks now that we have support for both dark and light themes. And of course, you could create your own theme, whether it's corporate theme or just some other theme, because now we have the possibility of having n themes. We just created the two most common, okay? Before I go into that, why did, what? Okay, that's probably, okay. So, is it, yeah. So, here, that's for later. So, you can actually enjoy our design system as a library, if you want. I mean, this was meant to develop our own UI, but if you like it so much that you would like it to inspire your UI, why not? So, we have many libraries and templates available, thanks to our great community that, you know, continues to provide amazing stuff for everyone to reuse. This also will be available, and I think it's pretty cool. So, it follows the typical design system pattern, so, and all that. So, we use that, okay? Yeah, okay. New component system. This was, a ton of requests, basically, had the underlying theme of new component system. For those of you who are not familiar, basically, it is now a thing in design, I mean, not now, but like now in the past few years, to make everything highly reusable. Similarly, as we developers have thought about how to code. And so, we actually, part of this design work has borrowed terminology and abstraction. Abstractions from the code and engineering world into design, because it works. Design is also a science, and so, it is easy to borrow those concepts. The, what we wanted was to make it easier for everyone to build, like, the main components, like the original elements that are like the ideas, like the ideas of the components, and then very easily track the copies of those ideas. Pembot 1.x did not have this metaphor. It was much more abstract, and you had some trouble finding where the ideal component, or the master component, the parent component was. Now, it has a kind of physical representation, sorry, not physical, but, you know, what I mean. And it is easier to track those components, the main version, and then follow their copies. And that comes with all sort of very cool ideas about inheritance, overriding, overloading, and also using a copy to reset the ideal. If you are so happy with a copy that you think every other copy should now follow this copy. The way you do that is that you basically reset the main component through that copy. So, by the way, Riz of Hans, who here is a developer? Okay. And who here is a designer, or does the design? Okay. Both. Both, both. Yeah, yeah, yeah. The question is, not exclusively. So, then I have a call for action for you developers in the room. The proprietary design tools are coming for you. They're coming for you because you represent ten times, well, here, much more, but you represent ten times the market size of the designer world. So, it's now obvious for the proprietary design tools, design proprietary tools that you are the next in line for milking, being milked. I hope you have strong opinions about that. Also, the updating workflow is now much obvious what's going on. The synchronization, and we'll show, I hope we'll be able to show that during the demo, is obvious. I mean, when you are synchronizing things that are there, right there, simultaneously, that is very obvious, but also, and I don't think I'll have the time to show this during the demo, when things are synchronized behind the scenes, okay? You get notifications, updates, you can decide to dismiss some synchronization, and perhaps later on, now is the good time. So, that has been improved. And then we also have new capabilities, very obvious ones, very tangible ones. Annotation, which is, okay, I'm going to document this component, whether it's the main component or the copy of that component, but also the swapping, the quick stopping. Because when you have everything as a component, you sometimes want to swap that component for another one that is also capable of taking the role of that component within that context, okay? So, here's a very simple example where you have the main component, that's a very simple landing page. The main component is the one top left, right? You know that because it has a specific legend on top, so it's very easy to spot. And then the rest are copies, and the synchronization is instantaneous. This is really like capturing what someone is doing on Pembroke campus. This is a very simple example, of course, but it's good for animator GIF on a presentation here, okay? So far so good, right? Yeah, yeah. This is a component swap, and I was discussing a minute ago. So, here we go from image gallery to image gallery, but with title and description. So, basically, someone decided to have different components that could fit into, in this case, this is an app. Looks like an app. But what if I try this, or perhaps in a different context, I want to show different stuff. For whatever reasons, you should be able to have your components easily be swapped. And, of course, this is easy to navigate. Here you can see, you pay close attention. You see there's the content. The content is basically an arbitrary categorization that the designer or the user used. But you can then go back one level and find everything in your component library. This was just to show just a small list, okay? Very good. And, wow, we have CSS Grid Layout, or CSS Grid CSS Layout. Not Grid Layout, because Grid Layout, of course, we had that from probably 0.2. Grid Layout is the print media standard of columns and rows. This is the CSS. Why is this so important? Because this delivers on a promise. We said if we really want to unite designers and developers around one language, what if we're able to bring the code language, the expressiveness of the cloud-tip programming, that is, in CSS, natively into building a design, without using the code, just using a user interface. Okay? See some people saying, aha. So this is a complex theme. Probably it would deserve its own, I wouldn't say, Fawcett and Track, but perhaps a talk, which is the cloud-tip programming, the cloud-tip design. So this is about if you want to read more on this, just stick with the cloud-tip design. And the cloud-tip versus imperative is about expressing rules to get to a point, but not exactly how I get to the point. And CSS is perfect about that. Because the browser understands the rules, tries to get to a goal their way. So when you're designing for the real world, one could argue that imperative design is problematic. It's not fluid. It's not reactive. It is limited. But the cloud-tip design is able to be okay with a fluid world uncertainty. And CSS embodies that very keenly, finally, after the specs of CSS Grids or Grids CSS layout came in 2019. So it's very recent. And for an open standard, that's very recent. So we started with FlexLayout, which is about alignment that was present earlier in Pempot. So Flex is about one-dimensional alignment. But Grids is about bi-dimensional. So with both, and you can combine them the way you want, you have almost total freedom. You can do all sorts of compositions. You know, Flex, Grids, and you can nest them the way you want. And Pempot was able to build that natively. For the first time ever for any design tool, we decided to trust the code standard instead of creating our own interpretation of how design should be created with new vocabulary and terminology. So this is very opinionated software building here. So here you are seeing edits in a grid, again, cropped and very simplistic for the sake of, and you can see where, if you're familiar with CSS, you're basically seeing CSS visually. And you can see how the code next to the UI is automatically being updated because it's synchronized by design in a way. We actually started with the code, created the user interface, and it is trivial for us to impact the code. This code is part of the Pempot's user interface. You can go to inspect code and you can see that. I just pasted there, you know, synchronously. So it gives Pempot users the possibility of the CloudTip design, which is amazing. And all those YouTube tutorials about you designers you know about CSS, this is the code tutorial you, it's easy for you. You just follow this code. No need for that anymore because you can just use your visual language, knowing that it is expressed as code instantly. So I would rather, I mean, I would like to ask for a round of applause for the team to get us to this point. Demo time. So this is, I hope you can see it, yes. So this is a very simple design. I'm going to just select this and make it like this. So this is a, I mean, don't pay, this is a bento design, it's trendy, that's not important. So this is a grid. I'm going to actually edit it. And I'm going to just go and add a column to the right. Okay. So notice that we are using FR units by default. You could use whatever you can do. Go out, auto, pixel, doesn't matter. It's fine if it's like this. And by the way, I forgot to duplicate this file, so I'm messing with someone else as by now. We have a limited undo. So, yeah. And then what I want is to pick this one, this element, and I'll just put it here. It automatically understands this lot. This one, I'm going to do something different because I'm going to create a component out of this, so Ctrl K. And I'm going to duplicate this component, so Ctrl D. Now I have a copy and I'm in Components. You can see that, sorry if you cannot see it very precisely, but there are different legends there on my console. So what I'm going to do is I'm going to just move it here. And you notice it doesn't react really to the fact that there is more space available. And this is a reactive design, so I want to do that. So that's easy. I just select it. This is a copy. And I can go here and just... No, no, no. This is not going to happen to this demo. Okay. Just one, just a mouse. Just a mouse. And I'm using a trackball. It should be easy. It's great that I feel the... Everybody stop breathing. Okay. This is a certain level, you know, whatever, precision. So here I will go for just use the space you have. Okay. Totally... But notice, notice, there's more, there's more. Notice that the main component did not react because I overrided, I overwrote this attribute, which is fine. But if I go to the main component and change, perhaps let's go for something silly as the fill here. Okay. And I changed that, then the main copy does react. This is the synchronization that I was talking about. So I'm going to use something like this, I don't know. That's it. There's more. Because... And this is something that happens a lot. I go to the copy here. You can, of course, navigate all this, but if I go and select button and I change the fill, yeah, like this, let's say, something like that. And now we all praise to the demo gods, okay? I can decide, okay, I like it so much that I'm going to update the main component. Okay. And I update, and that happened. And now the main component, if had all the copies across not only this file, but elsewhere, if I use this as a library, would have the notification that that main component has changed. Do you want to apply those changes to your copies? This is very nice, right? And so to finish, because I know I'm out of time, one last thing. I have here a code pen. I always like to end with something like this outside pen pod. So I can take this, I can go to inspect, I can go to code. All this is there for you to enjoy, to fund, everything. So I'm going to copy the CSS and just copy this, okay? It's going to take a while because there's a lot of images that depend on the Wi-Fi. We now have HTML on top of SVG, so you pick what you want. We don't care, you know. It's everything is, as long as it opens up, that's fine. We copy that. So what this is doing is, since we are, if we are telling the truth, you should be seeing, the moment it downloads all those base 64 encoded pictures, the design, let's see. Yeah, that's what I'm trying. I have Wi-Fi, it works. Yeah, this is, well, I'll send you a link. But basically, this is basically what you need. You need the HTML, the CSS, and it's built exactly to your, to the perfect standards. Because nothing I did was not possible using the CSS expressiveness. So there's no way you're going to mess it. It is one-to-one perfect match. So that lost in translation, that back and forth issues that typically designers and developers, you know, express that they're having, very frustrating, doesn't happen with Pembot. And of course, this is real tank collaboration. I'm just single player mode here. But so, so quickly to finish, we saw the UI design, new components, this is right now. We have some other cool stuff going on. And the question is, when Pablo, when do I get Pembot to buy it all? It's coming, it's coming. Wouldn't it be nice if we had it today? We have a staging server. If anyone interested, come to any of the Pembot team members. We can give you the security URL, which is basically quite simple. And you can try it out. But it's in the next few weeks, basically. So we're aiming for February. So it's still forced in month. And so very, very soon. So thank you a lot, the team, the community, and everyone. You can find more stuff there. Thank you everyone for staying up to now. And I hope you enjoy all the work that came from Pembot to Bueno. And now, before we leave, or where all the track ends, anyone has a lighter? I do not have a lighter. So. May I steal the light? Yes. You didn't sing happy birthday? Yeah. Okay. Hello. How's it going? Oh, yeah. All right. So it's so exciting. This is our event. It's basically how we were born. So it's very exciting to do this. So I wish everyone wishes something nice for their open source project for Pembot, for Foslan, for the community. So it is like this. Yeah! Take it. It is chocolate. It is chocolate from Pembot. Thank you very much. Thank you.
Open Source Firmware, BMC and Bootloader devroom - intro
Okay, I think that we can start. It is my pleasure to welcome everybody here in behalf of our team who is organizing open source firmware BMC and bootloader dev room. Today we have fully packed schedule with very interesting presentation about bootloaders including grab, U-boot, system D also about security in this area and also we will be talking about secure boot, DRTM and other stuff. Unfortunately we are not able to accommodate all presentations. We got much more interesting presentations unfortunately due to lack of time in our room. We are not able to accommodate all of them. Some organizational stuff as I said the schedule is fully packed so we have to be careful with all presentations and all presentations will be stopped shortly just before the end. There will be some time for questions around five minutes. Also we have after the FOSDEM and meeting which is organized in the funky monkey pub I think. We have some places reserved there so we are invited to meet with us there and to discuss all topics which we are talking about during this presentation. I think that's it from our side. Peter do you want to add something? Yes I forgot about that. Thank you. Any questions before we start? Yes, Mike is only for recording. Yes this is confusing. I was confused yesterday during one of the presentation but this year this works in that way that the Mike is used only for recording. Anything else? Okay so let's start.
Open Source Firmware status on AMD platforms 2024 - 5th edition
Welcome to my presentation about open source firmware status on AMD platforms on Fosden 2024. It is a fifth edition already of this presentation. For those who don't know me, I'm a firmware engineer at Friendep. We are based in Poland. We do open source firmware stuff. I'm mainly interested in core boot, advanced hardware features, security, stuff like that. I'm a maintainer of a few platforms in core boot. Sorry, I'm full. You have to stop. We don't have the devices. Two of them. So we have full. Two of them. Yeah. So yeah, please have a seat and we will continue quickly. But a few people I think. Yep. There's one more here. One more here. Yep. Okay, so yes unfortunately. There are like two people in the corner. Yeah. That's it. Sorry, excuse. Okay, come on, come on. So for those who don't know Friendep yet, we are doing various core boot stuff. UFI, FWPD, Yocto. So you may always also find various contributions in those projects as well from us. And the platforms I will be like mentioning throughout this presentation are mainly on this slide. This is like kind of glossary. For the terms I will be using throughout this presentation. And those processors or micro architectures are either currently now supported in core boot or where supported in core boot up to some point of time. So if you need to please go back to the slide. I have uploaded them already onto the system. So you can always check on that. And let's start. So a little recap from the last year. In the January 2023, we had another release of core boot, which happened to deprecate a few more platforms based on AMD silicon. We were not fulfilling certain requirements about the code quality and drivers and interfaces that were being also deprecated by this release. So like we lost a couple of platforms like PCN, GCP-1, Lenovo AMD laptop, the G505S and others. However, since then there were no more removals of any AMD boards yet. So that's kind of promising because all that is left right now is like quite modern hardware. So I don't think it will be like dropped very soon. Okay, to also recap the recent status about the mobile processors of AMD in core boot. So last year I have talked about the patches that were sent to the review by Starlabs. Apparently they had their own design on AMD processors for laptops. However, since then there were unfortunately no updates and I haven't received any information from Starlabs about any plans or status of it, unfortunately. For those who track also the other developments like AMD Chromebooks, the AMD, Mendocino and Phoenix are still in development, but the FSP binary, which is responsible for the whole slickensization for Mendocino has been published, but not yet for the Phoenix. And of course the publication intervals indicate that it may happen quite soon because we had the interval between season and Mendocino about five months. So right now it's like passed about... So it should quite soon happen, but I'm not sure about the release dates. But yeah, the difference is that the Mendocino is like Zen 2 architecture while Cezanne was Zen 3, so it's also quite... not so straightforward about the release dates because Zen 3 is like a newer architecture, but then there seems to be some kind of update to an older architecture. So let's continue with the Corbuth status on a little bit older and newer platforms. So we also had the initiative to bring back the ASUS KGP D16 platform. We have been trying to upstream that code that we have rebased to a newer Corbuth revision. However, we have received a response that it will be like too much work to get it back and probably there is like no manpower to actually review the whole of the code. So we decided to try to redirect the funds we have received from Immune 5 for the KGP revival to offer some additional features based on the Corbuth Dachar that we released for the KGP. However, there was no response from them, so this project is kind of stalled. But yeah, let's leave the bad news behind us and let's move maybe some forward with some more positive news. There is also an initiative by an individual with a nickname Hanetser. His name is Marty Plummer and he decided to port MD-FSP to a desktop board. He is doing it in time as a hobby. He has had some successes with CISAN-FSP. However, he has success with Picasso-FSP, so the older micro-architecture than CISAN. He could sort of boot the platform, but of course there are still some problems to solve. MD-FSP for example can initialize only soldered down memory, so if you have a platform with typical memory modules, it is kind of problematic. When he tried to use some kind of newer processor with CISAN-FSP, he faced problems where FSP had CPU IDs hard-coded in there, so he possibly couldn't get past the insertization for processors that were not intended for use with this FSP. Also, there is also a problem with the PSP binaries that are actually published for the Chromebooks, so the mobile processors. These PSP binaries are specially crafted for Chromebooks and the verified boot, and they might not work well also with something that is not Chromebook. And the hardware that is not Chromebook, because apparently there are some configuration fuses that are distinguishing like a Chromebook device with a non-Chromebook device. But you have also some new initiatives which are much more promising than hacking with FSP or old platforms, and what I mean to say is servers. Something that was my many probably considered almost not possible not quite long ago to have open source firmware on servers. There were moves from Intel to make it happen, and we probably saw some FSPs being released on EDK2, Tynocore. We have had efforts from 9 elements that were porting some servers on Safari Rapids, for example, but that still uses the old, good-known Intel FSP. What AMD thought about is like entirely new approach. Because the model of FSP is like very, very costly, because they have to port the UFI reference code for their silicon into an Intel FSP format, just like constant work of rewriting, adapting the code, testing, and to be honest, it is not like maintainable and scalable approach. So what they come up with is the open-sill, which stands for open silicon-icization library, which is fully open-source silicon-icization code for AMD servers. This project has been announced on OCP Summit in Prague last year, and the initial plan was to show a proof-of-concept on a general platform. So it is current generation AMD Epic Server processors, and we also have a working Corbuth proof-of-concept, as well as EDK2 reference code as well. If you want to know more about open-sill, I also encourage you to watch the presentations from the OCP Summit or from the OSFC conference, so they are covering in more detail what open-sill is. So let's try to summarize how the current state of the open-sill Corbuth looks like. So I did a quick round and tried to build the general reference board Corbuth binary with these few little simple steps. And just to show you a few statistics, there are still of course some blobs that are needed, like the PSP, there is no way around that, and they are still quite heavy, like it's like four megabytes, as you can see, the APU, AMD APU firmware. But comparing to Intel, where, let's say, current generation desktop has one megabyte blob of microcode, four megabyte blob of management engine, and another one and a half megabytes of FSP, that's already like much better situation. But at the end of the build, we are informed about some missing blobs, which is the APCB. So the Agiza configuration block, if I recall correctly, it is the input information for PSP, how to train the memory, what is the topology, how to find the training parameters, and stuff like that. I have later checked that these blobs are somewhere present in OpenSyndery repository, I think. So I don't know why Corbuth hasn't integrated them, maybe they already did, because this presentation is like two weeks old, so things could happen in the meantime. So I think it is doable to get those APCB blobs from OpenSyndery for sure. So we have like a revolutionary approach for AMD for OpenSource firmware on their silicon, but what can we expect in the near future? According to official AMD information, the OpenSyndery is going into production mode around 2026 with the server processors that will be released that year. Currently now it is only a proof of concept code, so it is just for evaluation only, and you can use it for personal purposes. But what is more important, that AMD plans to expand all market segments of their silicon to use OpenSyndery. So in the coming years we will see all possible platforms that could run OpenSyndery. So basically we returned to the golden era from like I would say 2000 something, where AMD were releasing also full installation called 4-Dale platforms, where everybody could actually make a fully OpenSource bios firmware for AMD platforms. So that is very reassuring and exciting news. I have also got information from AMD employee about a new library, something like that, which is called the AGSA compatibility layer reduced, which is a wrapper on the original AMD UFI reference code that can be integrated with Tyanochor EDK2 to boot a Ginoa server platform using UFI firmware. So it is very Ginoa specific, so building it might be a little bit tricky, even for experienced developers. It is quite fresh, so don't expect a rocket high quality of it. It is just an initiative done in probably some free time by one of people who sits together with us today. Feel free to try it. I haven't yet had time to look at it, but it is there. It is also public repository on GitHub. Okay, I will speed up a little bit because I am running out of time due to those disruptions in the beginning. PC engines, probably most of us know what these platforms are, what this company were doing. They were supporting Corboot for many, many years. And we see more interest in this platform. We are going to launch a product based on Corboot with the Shara, where we will offer the standard features that we offered with the standard PC engines firmware, but we will try to use the UFI, so we will provide more security. We will have secure boot, we will have setup password, TPM 2.0 support, measured boot, verified boot, stuff like that. And it also will be available as a subscription, so anybody can donate us to support us and make the development happen for this platform. Because for the past few months it was quite neglected because PC engines ended the official open source firmware support. There are also efforts by Felix Held from AMD, which also did some work recently in upstream for this board in free time. So this platform will still be alive for those who are fans of PC engines. Change boot. We also have dedicated talk for that, so I will just only briefly mention it. So we are expanding the possibilities of launching the operating systems with dynamic root of transfer measurement for AMD platforms. We will cover the UFI boot mode and booting Zen with QPSOS using Anti-AvenMate. So I encourage you to come to Mach-A's talk, which will be at 20 past 4 p.m. in this room, so you will get more details about this initiative. And I mentioned Densharo, but possibly not all of you know what this is. So this is like our downstream Corbo distribution, which aims to make the firmware more approachable for end users and regular people. So we aim to provide a validated pre-builds of the firmware that are known to work and which are to help spread the open source firmware usage that way. We also offered the subscription model, which allows the subscribers to integrate directly, for example, with developers, request the features, have the most recent updates earlier than regular public builds. Also, they are given a special treatment in terms of newsletters and stuff like that. So a little bonus at the end, which is might also be interesting. AMD also published the AMD secure encrypted virtualization firmware to GitHub. This is the firmware that is running on the PSP, so it is very revolutionary publication, I would say. Because till now nobody released any parts of proprietary co-processors to public. I haven't heard about Intel releasing any parts of Intel ME code. So again, AMD number one. And I would like to give special thanks to Paul Grimes, which is with us here from AMD and Fakesheld for the insights, review and suggestions to the presentation. And yeah, I'm open for questions if we have time for that. Thank you. Do we have time for some questions? Yeah, three minutes, I guess. Please. Okay, so the question is whether the approach taken by Oxide to bring up a modern platform just using the PSP Blobs could be adapted to Corboud. Technically, yes. But what Oxide did was still like some bare incision of the IO buses. They needed to program certain registers to get the PCI Express and stuff like that. So it's not entirely, let's say, independent from the code that runs on the host. So they still needed to put some kind of small portion of incision and extract the secret bits needed for the incision. So what we are right about is that the PSP actually, yeah, it can bring the RAM app and boot the platform. So you have at least half the responsibility of the BIOS less to do. That's true. But adapting that thing is, I would say, not so scalable. And it would result in a very feature-limited result, right? And other questions? Please. Do you have a plan to support Corboud with open-sealed on framework laptops? Our framework laptops? Yes, there's a plan to support it. It's going to be a proof of concept. It will not be a full replacement. For example, it won't support any power management features. So if somebody wants to build it and put it on their framework device, they can. The framework is not going to be supporting it. It's going to be independent. But yes, we will have it for both the 13 and the 16 inch. Did you ask about AMD? I've been talking about AMD. Yes, AMD because open-sealed. Also, I forgot to mention, I'm sorry, Felix. This is also like last-minute information. Felix also pushed some patches for AMD Phoenix to be integrated with open-sealed. But right now, as open-sealed for the mobile platform, it's not available. The open-sealed is stopped. But the infrastructure is being prepared, I would say. So we might probably expect something in not so distant future. We are out of time. Yes, I can also answer questions like outside or later. Thank you.
immune Guard: Streamlining Boot and Kernel Security in the Cloud
I'm here today to represent a completely different talk normally from Firmware. I mean, I'm probably some people know me from the Firmware world, so I did a lot of open source Firmware stuff. But like there was another topic on the CFP was TPM in attestation and I thought like, okay, we had to start up for three years, it failed, but we built an attestation product. So I thought like, okay, let's make it open source. So my co-founder Kai is here as well. And so we thought like we just make it public and probably also like integrated and some other stuff. So I'm giving today a presentation about it. I first thought I give you a lot of insights and architectural information, but we worked with three people on that for three years. Yes, plenty of code. So I decided not to use that. So I just give you today a high level overview. But if you're interested, you can just go to GitHub and look at the project. So yeah, streamlining boot and kernel security in the cloud is basically the topic I hope. Oh, sorry, I hope it works. Sorry. Ah, no, it works. So who I am. Yeah, I'm now a consultant. I'm doing a lot of firmware and like firmware security and trusted computing stuff. I also did a lot of open source firmware. Can see me with some Facebook servers there. I'm holding. Yeah, and I did a three year startup journey at Immune that was unfortunately directly in COVID. And so that wasn't that great, but I can share some information about the journey in terms of attestation and TPMs. And we also like have a booth at open source, you know, foundation booth at K2. So if you want to come over to talk a bit more, I would just leave after this to just go to the booth to talk to the people. So feel free. I worked a lot of open source projects. Yeah. So the story at all in total is that we had the startup. We built a SAS product like like all these startups do, right? And so the idea was basically you do endpoint security with this product. And what we wanted to achieve is basically one secure endpoint systems like Linux, Windows systems securing them with an additional layer of security for specific type of attacks. So that was our goal. And it should like be integrated with the standard endpoint protection mechanism of nowadays like most of the Windows guys probably use endpoint protection like antivirus, what Windows defender, whatever is built in. But the Linux guys for sure have also other stuff like integrity measurement architecture, it's more open things. And so that's great. So we built this type of SAS, which gives you observability. And yeah, what does it do? Yeah, basically it protects everything from the power on over the security configuration over the firmware itself. So there we come back to the firmware topic. We also protect the boot loader, the operating system and the drivers. And we also protect the endpoint protection or EDR. That's really business words, but let's say you can have that on Linux with open source stuff. So that's what is possible. So if you ask Chattivity nowadays, what does the malware first do if you have a system and it invades the system through an exploit or whatever, it will basically try to hide itself if possible from the antivirus or whatever kind of detection software protection software is there, or it will try to disable the antivirus. So that is the most common thing. So nowadays they really kill off all this snake oil stuff. And so what we try to build is something much better than that. So for that we can look at the problem at the moment. So there were a lot of attacks in the 2000, 2016, which basically targeted apps and a little bit of hypervisor. But nowadays the attacks going to more into the firmware direction. And the reason for that is basically that there is a lot of growth inside the firmware ecosystem. So for example, we had in 2000, I think 32 kilobytes or 64 kilobytes of firmware, probably a little bit more on some systems. But nowadays we have 64 megabytes on a laptop. This laptop probably comes with still 32, but the newer one and there's other Thunderbolt firmware and other of other things. There's tons of attack surface on the firmware level now. And so that means companies like Dell start to make cloud services for their BIOS. So they attach the BIOS to the cloud. And that was quite prominent. It was, I think, in 2021 summer. And it was super fun because what basically happened here is that they had a TLS connection into the FI itself. And the certificate, the PKI was broken for whatever that was like a bug. So you could basically hijack 30 million devices over the internet. That's kind of funny. And the funny part is if you do that over the firmware, you have full access on everything. No hypervisor, no antivirus, nothing will protect you from that. So that's really crazy. And I mean, this you have also on Linux systems. It's not like it's disabled. It's there. You just don't know. And so we thought we need to close somehow the security gap from hardware firmware to hypervisor operating system core. Because apps protected nowadays with tons of security mechanism in Linux as well to do that even if you use proprietary software. You can just do open source things to do that. You can just go to any security conference for Linux and check that out. But that's easy. It's already solved. It's starting to get solved. This is like, that's easy. So we try to do the hard thing. So how do we do that securely? How do we really make sure that the attacker is not disabling us, right? Because it didn't disable any other type of antivirus. So what we try to do then is we want to use the security chip. So there are a lot of security chip nowadays out there, especially in the server ecosystem, but also on laptops like these ones. Either they are in firmware or they are in real dedicated hardware. Or they come as part of the system on ship. So the processor itself. And the most prominent one is the TPM2, which is now everywhere in every laptop. So probably every one of you has at least the TPM2. Or if you have a MacBook, you have a security chip, it's not the TPM, but there's a chip security chip. So everything is there. And so we have that TPM stuff quite long. It's everywhere. You can also have that in data center devices. And so what we did now is basically we built a product where you have like as us, you have the security operation center and they do insight and remediation. And then you have the employee laptops. I mean, in the open source world, you probably have like your own fleet you want to manage, and then you have like some kind of monitoring tool, a monitoring system which tells you what's going on and then you can act if something happens on your system. That's what you want to do because you never know what happens. Because there can be vulnerabilities you don't know. They're all talking about risk mitigation, but the reality is even with all the known risk, there are tons of unknown risk. So this is difficult. So how do we do that in detail? I try to not to explain it too deep completely. So what we use is the TPM chip. It has a secure credential storage. It's basically on the most of the devices and we do remote attestation. Remote attestation is basically the idea of asking the TPM chip, give me the state of this platform. And that's what we're doing. So we ask the TPM chip, give me the state of the platform and thanks to the current situation in the firmware world, it gives you the state of the platform in a specific type of let's say artificial binary blob which is signed, which is publicly parsable. You can write a parser for that. There are tons of parsers out there already. And it's signed. It has the timestamp and the nonce. It has all the hashes of the executed code down to the, at the moment for Linux, it's down to the Linux kernel. And from there, if you start the integrity measurement architecture, they are like, it's enabled in all distros. You can just enable that and it hashes everything which has been executed. You can also write a policy for that technology stack. It's more like an endpoint protection replacement for Linux. So it's kind of super interesting. Most people don't know. There you Ubuntu, Fedora, whatever is able to do that. You just need to place a configuration thanks to system D and Linux. For doing that, so super easy. Anyway, and so what we do then is we gaze all this information with the open source tool. This open source tool is basically gazing information, putting that into the attestation, signing everything, making everything super secure. You can even bind the TLS keys for the encryption and everything to the TPM. So it basically is secured also over the TPM. So everything is clear. It's also clear. The platform is identifiable. And then we send everything back to the cloud and do our crazy cloud analysis with our check engine. That's what we basically do. Quite similar to a normal endpoint protection, except that we do tofu. We don't do like what they do is like they try to find specific patterns for more. What we do is we check if this is a legumintate action by having smart decision tree to say, okay, this change in array where we know this can be secure. And thanks to this architecture is quite fine granular. That means we can do it quite deep and in a good way. So how does it look like? So if you run our software, you can find now on GitHub, you basically have a dashboard. It looks like that with all the risk and incidents. This is all like more business like, but we sort like customer wants to buy it, right? It needs to look like shiny, not like there's another project called KeyLime. It's also nice. They are more focused on container for attestation, but they do it with less good UI stuff. So we try to have a really big focus on UI and that was kind of super important for us. And so what we built in as key features is basically the secure device identification. We can identify the device securely. We know the integrity of all the software components. We execute until the endpoint protection or the kernel takes off with that drivers. And we have also the secure monitoring which runs all the time, right, remotely. And you can just like build that into your data center operation or whatever. You can even do it on endpoint operation. And so additionally to that, we gained storing credentials in HATRA and the trust income. You can additionally, we didn't do that, but you can also like push credentials to the platform for VPN and other stuff. That would be the next step in the future. And also access revocation by incident. So then you can, okay, 10 minutes left, but I'm already quite far. So yeah, the incidents, if you run the system because it's super complex, that's why I didn't want to explain it in detail because it's like a lot of code and everything. It's basically giving a list of incidents on multiple affected devices. And there you can see specific type of incidents. It has documentation for all this type of incidents. You can check it out. You can extend it. You can improve it, whatever. If you want to work with a project, it's super. We are quite open for contributions. So and it has also an extended feature list, which is quite, yeah, let's say it was quite experimenterial because we are start-ups. So we tried out a lot. So we added entity to support. This is for, so in order to make really that the firmware is coming from the vendor, you can have some service from Intel to verify that. And we added the service that's there. And then there's, finally, it's there famous for firmware vulnerabilities. We did the scan over the firmware as well. So we extracted the firmware from the system, put that in the data we sent to the, to the analyst system. And then we could also find their vulnerabilities and everything. We added LVFS integration, which was kind of nice because then you can check also for firmware which is there in LVFS and check if hashes matches up. And if it's the same, because the biggest problem of all this attestation story is basically the firmware. But still it's got much, much better than before years ago. I did it like 15 years ago. It was horrible because the firmware implementation were all buggy. And now we have UFI. I don't like it, but it's much better than what BIOS provides in terms of security. And so the hashes are somehow can be pre-calculated. You can evaluate them and say, okay, this change or that's changed, right? So that's the only important thing. You want to know what changed basically by seeing the hash and having additional information. We have built an engine for doing all this checks. This is like trust on first use. So the first time you install it, it's basically then trusted. And when something then changes, it's tried to automatically evaluate. This is like the PCS is a hashes basically platform configuration. We just check it up and looks and sees, okay, this change, for example, the bootloader changed and specifically the hash of the certificate changed. And we know, okay, this certificate changed. And this is normally a sign, for example, for an attack because if you have a bootloader and it's updated, it's super rare that the 10 years valid certificate changes, right? So you can add additional checks to the engine. You have also a wireless engine. So if something you won't really allow, you can just accept and then it's like, it's a trust-based then or the trusted computing-based. And we also have risk detection protected by the attestation. That's also part of it. And there's a nice view for incident-risk and device views. It looks like this for a single laptop. So you can see it, for example, for my Lenovo SyncPad X1 Carbon. And you can basically see what kind of device integrals stepped past and where it stopped. So that's what we did. So we could see from the supply chain to the device configuration overhost firmware bootloader operating system and to the endpoint protection. So you could directly see what's going on. And if you like, here you can see this in all the picture, but you basically could like open it up and then you see a lot of detail what happens there. So you can see, okay, this hash of this file changed with this certificate and whatever. And so you can investigate that. Additionally for security stuff, if someone just wants to use the code or get some insight, we opened up CME into CME measurements, which wasn't public yet. There's RIM certificate support. This is like for the TPM RIM certificate, for the supply chain security stuff. Current drivers for firmware extraction for Linux and Windows. Then we added IMA support similar to KeyLine and AZ support, as it's an antivirus. And for Windows, generic endpoint protection support. So you can also use it with any generic endpoint protection. Mostly we tested it on Windows Defender. We added Intel ME API support. It's like the management engine of Intel. We can talk with it and find a lot of information out there. It's super helpful. So if you need code there, just grab it. And we can also verify the TPM itself that is coming from the vendor and not like someone exchange the TPM or whatever and try to take the system. So you could download that for Ubuntu Fedora and generic Linux and Windows. And then there's an OS open source agent. It's more like a tool you can just install with the system descript. Execute that and this one gives us the data. It has really low memory footprint, so it just runs every one minute you can say and delivers the information. Push only to the system, which is kind of good because we don't have a back channel executing stuff on the system. There's another security hole. So it's super easy to use. Yeah, and that's also available. How long? Still five minutes? Okay. Thank you. And yeah, this is a repository. You can find it on GitHub. We probably move it maybe under another org or we maybe keep it. I'm not sure, but you can find it now there and feel free to check it out. This is BSC licensed. So everyone can use it and build also products from it. What I want to do the next few months is that I want to do more code cleanup. I've already thrown out a lot of stuff, but we still have some leftovers, especially for business features and so on for accounting and all this nonsense and no one needs. So I'm just ripping that out and adding CI support, pre-built releases for Windows and Linux so you can just install it directly on the system and ease up the deployment because currently it's there's an instruction to do so, but it's like super complicated stuff is Kubernetes and all those things. Maybe we can make it a little bit easier and there might be other features where we can integrate it directly into system D and other stuff. So yeah, that's basically the idea. And now I have a demo. I hope it works. Please work. So I can probably, can I? Yeah, we have. Can we make it with sound? No, I don't need. I explain. So this is Windows 11 with all security features on all this super crazy Windows 11 stuff you can know from Microsoft and their marketing. And then we are in our tool here, right? It's just like, it's a little bit like probably that could be also done better instead of tokens. But anyway, we just like bind this system to our system and it installs it and takes a while when it's done. Come on. Okay. So and then on some points the device shows up. It takes a while. But of course the engine needs to like analyze it and then it's trust on first use. Basically it's protected. And this demo is based on Black Lotus malware. So I got Black Lotus malware. So we won and tried it out. And the funny part is, so what you can see in a few minutes is basically that you install Black Lotus and Microsoft doesn't. So they know the signature because I didn't repackage it. So that's what you normally do with malware. So they can check the signature anymore. And so, but you can see that I execute the malware. Then the anti-rials saying, yeah, I know the signature because it's known, but this is quite easy to fix. But I didn't do that for the demo anyway, because I don't want to to mess around with packaging P32 binaries. Anyway, and then I just executed on the system. And what happens is that it over overwrites a bootloader which is before Windows and injects code into the kernel and turns off all security. But everything in terms of endpoint protection after the reboot tells you everything is fine. And if you looked in the detailed information, see all the security what they built is completely disabled. So that means attacks going into the deeper level are not covered by Windows. They cannot be protected with all their security stack. And so our system basically detected the malware, as you can see with the red information, the bootloader manipulated. And then it takes a while until Microsoft gets, I don't know what it does there. Like I'm a Linux user, but this thing shows in few seconds that everything in terms like secure. Oh yeah, now it shows nothing, right? And then you go to details like advanced protection or whatever. And then yeah, everything's turned off. So yeah, that's how it goes. That was my presentation. Thanks for listening. Oh yeah, say it's all I have to return to. But anyway, that's it. So any questions? I hope I have some time. Yeah. My question is you are mostly focused on malware type of attack. But could this software somehow prevent an attacker with a physical access? Like if someone manipulates my laptop, for example, if you could temporarily disconnect TPM in boot and you could install this malicious bootloader. So you disconnect the TPM, you boot the malicious code and then you submit all the right hashes. And you pretend TPM seems that it's a valid right. We cannot protect against bus attacks on the TPM. I mean, this is like hardware attacks. You cannot do anything against hardware attacks. There's no real security against hardware attacks. Even in a confidential computing environment, side channel attacks and what power glitching attacks and all this stuff is always possible. So we don't say we can protect against hardware attacks. We can probably detect like if you plug in in USB stick, right? If you try to execute something else or if you try to modify things on the system, that's what we can detect. But we are not like claiming that we can really do it low level on the bus system. I think you can provide that by there's like a way to authenticate the TPM. So you can have an authenticated session which is encrypted and then you cannot do this type of attack anymore. But that's what is currently missing. I think in the goal in TPM library. But yeah. And then Oscar. Yeah. Please repeat the questions because you know, oh yeah, sorry. I'm sorry. Yeah. Okay. So from your previous question, I think we can also protect from supply channel attacks. Right? We can. So what, so if you can, can we protect against supply channel attacks? We cannot completely protect against it, but we can probably detect modification on the firmware side. It's because there's also signature verification and this is all measured. So you can do that. What you can also do is to detect if the TPM is the same TPM and if it's really a TPM from the vendor. This is also possible. But yeah, sure, there are limitations in the threat model. But I think that's a good trade off. I mean, for me, it doesn't have to be perfectly secure. But yeah, probably we could extend it, make it better, improve it. But yeah, that's the current state. Yeah, some of these questions are like, security is a binary state. Security is not a binary state. It's a spectrum which we kind of travel and we can be either, and it depends how much we have, we can lose, how much we can pay for that, then that much we can do. Yeah. That's what I think also security is not a binary state. Not zero and one. It's somewhere in between, unfortunately. Yeah. I have a question. You mentioned your presentation here. Do you plan to cooperate with the TPM support? So if you can at the TPM support depends or if we cooperate with the TPM project depends on like, we are open source project, right? If we can like support that, why not? If someone makes a pull request, feel free to do so. We will just integrate it, right? So there's no reason why we shouldn't integrate that. I think I just want to have the three years of work we made, not completely a loss. So sure, probably it's not taking off as open source project, but that's always how it is. But we thought like it may help some people and give them some more insight what we did. And yeah, maybe they can reuse that for business or for personal reasons. So malicious agent attacks, this is a good question. So the problem is with the TPM, you cannot forge the stuff from the TPM itself. And since the quoted data basically has a nonce and timestamp and everything, you can forge it. That means the agent can definitely try to connect to our system, but if we don't get the blob, we don't talk with the agent anymore. So this is like, sure, you can say we don't talk back to the system. That's one option you just completely go silent, but we have for like a counter and there if it's 50-minute over whatever or like specific amount of time, it doesn't call back to us. Then we say the system is unknown state. So I mean, that's a trade off attestation, but I think it's fine. Yeah? If anyone is interested in TPM hardware attacks, stack smashing just post that thing doing that in like a minute, one of the actual carbon. So just put a snap for hardware on top of the TPM stuff and there you go. But yeah, part of the protect against it. Yeah. So we did it for more of their protection basically. That's software-based attacks mostly. But if you can detect it, at some point, and somebody is sniffing something, and you can somehow detect that, you can do something like wipe remotely or whatever, right? So all data doesn't bring up too much. Yeah. That's true. Especially with the carbon that automatically unbox. So that just happens because some other passions match, and then later you will be aware that we give you a scenario. Yeah. Any other question or anything we have done? I don't want to play more time here. Thank you very much.
Standardizing the generation and signing of boot images
you . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . trusted firmware and then that in hand starts up your opti and finally it goes to the main new boot image which finally loads your Linux kernel. This is a very typical boot flow. This is what you see commonly but it's talking about one core right. When you think about it you're talking about one core but that's not the case with chips today. So what if you have multi core systems. So an example would be the Texas Instruments K3 architecture of devices and here you we have two cores running. So you have your 32 bit R5 core and your 64 bit A72 core. Now in this case we need two sets of SPLs right to get it going. So your as you can see I've inserted two SPLs in the boot flow and your R5 SPL will run first. Do the initial stuff and then it'll jump to ATF again to opti and then your Uboot SPL for your A72 core comes up and then finally you boot to Linux. So this is how it looks like. Now in this presentation in the interest of time I'll be talking about the A72 bootloader mainly not the R5. So yeah that's what I'll be talking about. So what do we need for the A72 bootloader? You'll need your ATF, you'll need your opti binary and you'll need DM firmware. DM is device management firmware. It's kind of like a TI version of ARM-SEP and then you'll have your SPL binary and then you'll find your device tree blob right. So let's say I want to make this position agnostic my final bootloader. So I can just append or prepend the fit header at the top so that the entire image is basically position agnostic. So that's what I've done over here and what about security? You obviously need security. So you have your X509 certificate appended to the top of each of your binary blobs. So basically it's signed all of the blobs inside are signed and this is what our final A72 bootloader binary looks like. So as you can see it's not very simple. I mean you can run it through a simple shell script and get your final output but it wouldn't be, it's not really the standard right. This is how we used to go about generating our final binary. So you can see I have U-boot. You basically give all the inputs. You give your opti or device manager firmware, your ATF and U-boot has custom scripts, a bunch of long shell scripts that used to you know tie up everything, sign everything, stitch it up into a final image. So this is what we used to go through. But in cases of higher security devices you'll need to have a core Secdev K3 which is an external TI maintained repository. And this is how we used to sign the images a few months back. Obviously there's a lot of issues with doing this. One is maintaining and scaling. It's a non-standard flow. It's for example let's say we already have more than four bootflows at present and extending it to all the bootflows you know where the binary will have to change. It gets difficult. Packaging gets difficult. And this is not the standard. It's not distro friendly. And there's no unit level testing. They're just shell scripts. You don't really, if it works it works. There's no test coverage there right. So these are the issues with the custom scripts. And this is a small snippet of the shell script that we used to use. So you can see unless I showed you the final image what it used to look like. I don't think you can gather much from this script. And you can see the highlighted ones are pointing to external scripts. So you have scripts within scripts and it's just a mess to get to know what your final image looks like. So yeah. Thank you. So just a little talk about Binman why I started writing this tool a while ago now. Packaging is actually much harder than we think. And you can see an example of that there. Some other things that go on. There are SOC specific tools that need to run. And as mentioned before there's different phases of the boot. And the image needs to contain code for all of those. It's also nice to be able to see what's actually in the image. And so this this bin man tool lets you look at an image and list it out. So the image is described as data. So rather than shell scripts or code or whatever you describe the structure of the image in a simple data format. This image has U boot and it has SPL. It has a size of one megabyte and it has some pad bytes. So that's basically how you start. Binman normally runs as part of the U boot build. So it's the final step after all the inputs have been created. You run it runs build man sorry bin man and produces a final image. But the nice thing is you can then use all those inputs and run it again separately maybe in a signing server or in some other step in production. So bin man also deals with missing blobs. It deals with tools that need to run and so on. And it can produce an image even you know even telling you that this image won't work but at least you're able to validate that you could get that far. Bin man consists of it works with a list of entries. Entries have a different type that you've seen the U boot ones and the SPL ones but there's loads of other ones as well. And they just packed one after the other. They normally can't overlap but it is possible in extreme cases if you want to do that. Bin man is written in Python. There's an entry based class. You then have an entry blob subclass of that if you like. You can see that in the middle of the screen and you can sort of extend it from there. So a blob is just basically an entry that has a blob of data in it. But you can make arbitrarily complex things that involve you know producing signatures and that sort of thing and it's fairly easy to do that. To add an entry type you basically put a new Python file in the right directory and give it a give it a class name and off you go. You can I mentioned you can run command line tools. It's actually possible to list out the tools that are available. If you don't have one you can do bin man tool minus F to fetch it and it will go and build it from source or find it in the binary or whatever it has to do to get the tools. So you don't have to go and hunt around for three days trying to find vendor tools. The code has a lot of comments. It has 100 test coverage. So it's very very strongly designed to to be reliable. That's it from me. Yeah so now you've seen how what bin man is and the rest of the presentation is going to show you how we switched rather migrated from what it used to be with shell scripts to using bin man. So this is what the final flow looked like. There's no external repository. There's no custom scripts. It's just a bin man device tree file that we've plumbed in along with the other inputs. So this is what it finally looked like. So as you can see just like the image on the right which is our target image. So you have a fit node within which each of the individual binary blobs that have to go in. So your a t f your opti or dm all of them are kind of packed in nicely. And you can see that ti secure is an entry type that we've created to mimic not mimic actually generate the x509 certificate that has to go on top. So it's being passed the contents. So the binary that it has to sign which is in this case the a t f binary and the key with which it has to sign the binary. So it's all nicely packed in and now you have a somewhat visual representation of what's going on and you can manipulate it easier easily. And yeah since I didn't get space. So this is the remaining of the two blobs that have to go in. So you can see there's a few things that you can notice from here. So one is yeah there's a custom entry type that we've defined and along with which there are standard entry types that we've used. For example opti and a t f their arm standards. So you can have the standard entry types defined as well. And it's already there in the bin man folder. And at the same time let's say you want to reuse the same device tree for building many different boards. And let's say each of the board is using a different address to load your a t f. And that's also easy to plumb into the bin man flow because it evaluates config options. So according to your build your config will change and yeah. So to kind of finish off we'll just quickly go through what the python class looks like for TI secure for the x509 certificate. So you have a special method of python in the beginning. So that's just there. And this would do the reading of the node. So you can use your FBT tools to go and grab the properties that you've mentioned in your device tree. And you can even add your own properties. That's also possible. So for example sha is the property that we are going and grabbing by default it's 512 if you haven't mentioned it. If you want to change the sha value you can give that property in the bin man node itself. And then this would be the method that you know is kind of important which is actually setting the contents of your entry. So in my case I've defined a get certificate function that actually goes and runs open SSL on the binary that you fetched and put that in the entry. So obtain contents is what is doing that. And in the case of for example you have a u boot SPL which contains the symbol to your u boot image that you want to jump to. So there's cases where you'll be writing symbols so your final image would change. So process contents runs after at the end of your build so that it'll go and update your binary essentially in your final image. And here's another last method that was used which open SSL is already a bin man tool. So it's already present there and like Simon mentioned all tools can be all CLI tools you can easily port to be run within bin man itself. So here you're just adding the open SSL tool since we'll be using it. Now we're kind of towards the end of the talk. Some of the developments that are ongoing is for example the bin man dt node is not part of the device tree specification as of now but Simon has been working on it and that's an ongoing work. Then the ability to pass custom firmware via the CLI argument. For example let's say I want to pass the dm firmware argument as an actual CLI argument instead of hard coding it in the bin man dtsi that's not supported as of now without making changes to the original u boot make file. So yeah that's also something that that is in the works and finally the x509 template that that is used to generate the final certificate that's in some ways kind of hard coded right now even though it's a very standard tool that should be generated on the fly. So that that is also something that's ongoing. Now there's a bunch of u boot boards that still use custom scripts today and they can all be ported towards using bin man which is the final aim of this presentation to get everyone to finally port and use bin man as the standard. So some of the references I've used mainly the u boot documentation Simon's talk at OSFC and my colleague Brian's bootloader presentation as well and you can also see the patch series that was used to port the entire k3 devices to using bin man. Yeah lastly I would like to credit the FOSTA organizers and Texas instruments and the u boot community that has actively been working on bin man. So yeah now we are open for questions. That's not really. So the question is how does bin man relate to make image the make image tool? The bin man calls the make image tool. Bin man can produce fit images as you probably saw you simply just write fit in there and you get one. So it's a lot more convenient. Make image has the SOC specific stuff. There's no plan to you know rip all that c code out and write in python and bin man. It's simply make images is sort of one of the tools if you like that bin man uses. It's actually already part of the r5 bootloader which is a little bit more complicated which is why we didn't cover that. The question was if you can kind of recursively sign the images so a signed image within a signed image and you can do that which is a part of the r5 bootloader. We'll be sharing the slides so you can have the r5 view as well so that it'll cover that. The images are actually hierarchical so if you want something you put it here if you want the data that comes that goes into that you put inside it and you can just keep going right. So that's one of the nice things I think it's the I can't remember what it's called Mesa or something that uses sign within signed and it's simply a case of putting it in the description. So I think you're talking about changing the key once the images. Yeah that is also possible. You want to take that? I don't know much more about that but yes you can. So the public key has to go in a prior stage but because you're producing a cohesive firmware image right where all the phases are essentially have to be there then yeah bin man can can stuff the key from one into you know that's used in the next phase into the prior stage firmware and that's obviously necessary. Yeah yeah. Yeah so if you look at the DM firmware right now that's an external blob. Oh the question is huh? So you have to mention the blobs that you so the question is whether we can include external binary blobs into the final image and whether we can use scripts to generate an image and then you know port that into the final bin man made image. So the first one is yeah you can reference external blobs like I've done here. So DM is a blob that's not in the flow so it'll go and pick that up as an external binary and in terms of scripts so UBOO does the first build and then basically you can mention the binaries that are dependent that that that have to be created before bin man can be run. So you can mention that so you can maybe run your script before that and get your binary ready and then bin man will just do the packing. So it'll only run once the input binaries are ready to go. So is this already upstream? Yeah this is already upstream. Yeah. Any other questions? Thank you very much. Thank you. you you you you you
systemd-boot, systemd-stub, UKIs
I had my second talk of the day. The first talk was very somewhat topic, but it focused more on the distribution side of things, how to build all this stuff. I welcome you to look at the video if you have some time later, because what I'll not be able to answer in these 20 minutes, hopefully the other talk might. So I will talk about this in boots, this in BSTAP UKIs and what those are, and why you should all switch to that, of course. So let's jump right in. So system reboot, what is it? We usually call it bootloader, but it actually isn't. It's a stupid boot menu. Like if a bootloader, at least in my view, is something that actually is capable of loading sectors onto disk, parthing them, and then eventually setting up the boot params and jumping into this, we do nothing of that in system reboot. All we do is we give you a menu and you pick something, and then we chain load some other UFI binary. So yeah, it's a fancy boot menu, nothing else. Makes it on one hand dumb, but also nice and robust. It's built around this model that you have drop-in files inside of a directory, which I guess is very different from Grub where you have these boot scripts and things like this. Our way to configure things is supposed to be as simple as possible and modeled after how we started doing things in package management and classic Linux distributions like that, but it's not this pattern where it has this directory, like a configuration directory or a directory where you put desktop files on the desktop into and things like that, where every package can put more stuff into it, and then the combination of all of them is what makes the system work. And we just said, okay, let's do it the same way. Have one directory in the ESP, and people that want to populate the boot menu just put one file in there, and that's what populates the boot menu, and that's already it. So, yeah, it just takes basically this Linux pattern around package management and just takes it to the boot loader stuff. So, this boot is UFI only, right? Like, this makes things nice because it basically means we don't have to actually do any boot loading. It implements something we call the boot loader spec, which is spec we wrote ourselves. Actually, it just basically tries to define in abstract terms how, like, where to place kernels, where to place descriptions of what to boot. It supports two kinds of menu entries. Type one and type two, we call them, I think the focus should always be nowadays on type two because they have much nicer properties regarding measurement, cryptography, and things like that. But type one still exists, and people will continue to use it because it's more flexible in regards, it allows you to configure the individual items manually. Type one is a configuration file, basically, which just says use that kernel, use that in already, use that stuff and things like that. Type two is something where the boot menu items are just binaries, UKIs, as we call them, UKIs. We'll talk about this later in more detail, but the very short version of that is it's a kernel glued together with its init.rd and a couple of other things and then turned into one UFI binary. So, basically takes much of the early state of the OS and makes one thing out of it that can be updated as one, signed as one, measured as one, loaded as one, which makes it robust and secure and very nice. Since Friday or something, System Boot is also like eligible for signing, like Suze actually did this ahead of time, but now it's officially okay, so you can get it assigned by Shim with the same infrastructure and things like exactly like you can get grub signed. System Boot is supposed to be fully automatic, no configuration, right? Like there's no boot scripts, no nothing. I mean, there are some configuration options, but the design is to just work and not require configuration, right? It should be just one binary you drop in and then you have this other directory where you drop in the menu entries and that's supposed to be it. Of course, there is something like you can configure something in EFI variables and there is also a configuration file, but that's just for the nerds and it's not supposed to be the default. It also has a nice functionality that besides looking at these directories for boot menu items actually capable of finding Windows installations automatically and Mac OS, which is kind of nice because you don't have to configure that either, right? Like from the West you don't need to do anything. SD boot when it boots up, it just looks, oh, is there also Windows installation that adds the one awake? It's really nice because it's robust and it has also benefited that if you add Windows after you install Linux it will just show up. It also has APIs to user space, which I think is very important, right? Like for us, the bootloader world and the user space world are not distinct, right? Like they are closely intertwined for various reasons, like for example, because user space adds and manages the boot menu entries because from user space you generally might want to be able to select what's going to be booted next because there are things like automatic boot assessment, right? That you figure out did this boot actually work? If it worked, booted forever in the next time, if it didn't work and you've tried a couple of times and give up and revert to the previous thing. So this always requires communication between the bootloader and the operating system. So we defined, that's actually another spec where we defined this generically with CFI variables and things like that. We said this is how bootloader and user space can communicate and can send each other commands basically. It also does early boot random seed stuff. This is because traditionally in particularly in VM environments there was no RDRan, no Virtio RNG things and then Linux really didn't like it. You didn't have to any entropy in your VM and then certain things just hung and that's super annoying. So we took a bit of inspiration with something that FreeBSD did which is an early boot random seed. So basically you have a random seed that is stored in the ESP. You can update it from user space and it is updated from user space. After we did this, Jason Donfeld who's also the maintainer of the Linux kernel RNG, we wrote a couple of things that we kind of confident nowadays that it's really good actually and the good thing is it works everywhere, at least everywhere where you have EFI and make sure that from earliest moment on you have really good entropy in addition to whatever the hardware might support you. It has automatic enrollment of Secure Boot keys which I think is actually kind of nice. It implements like this tofu concept for Secure Boot enrollment. So if you want to change your certificates, which I think people should do and particularly in virtualized environments, then you can just add the keys to the ESP and then on first boot up when we are in setup mode we'll just enroll the whole thing and then we'll be locked down. So you have the trust on first use. Like on first time you boot up, nothing is enrolled, nothing is trusted. That's the moment where everything is trusted. Then you add the keys and from that point on this is how it's locked down. It also has this thing where again with the drop in deer you can load additional drivers mostly exist so that people who really want can make the ESP or something like that, one of the weird file systems. Yeah, already mentioned briefly automatic boot assessment exists which is like the infrastructure that we count before. We boot something, how often we have booted it and then from user space can report back if that actually worked and then I get this kind of robustness thing going. So much about system reboot. Boot control is one part of the user space part side of things. Boot control is like a command line tool for installing system reboot. That's kind of its primary job but it can do a couple of other things as well. It's a use space side. You can tell it to boot on the next boot up to specific menu entries. You can list the menu entries. You can update the random seed in a couple of other things. We hope that it actually runs automatically on boot for example to update the boot loader. It always will do this. To make sure that the copy of the boot loader that is in slash user is instantly copied also like if it's for some reason the package manager who ever updated system forgot this it's always kept up to date. So the focus is really that the boot loader is always up to date. It also resets the random seed by the way like from the Linux pool so that there's a good chance that the random seed is as good as it could possibly be. So much about boot control. Next thing system destub. System destub is also UFI binary. System destub is basically a little UFI binary that you glue in front of a Linux kernel and an inodori that runs in UFI mode. It does a couple of things before it transitions into the actual kernel. Why do we have this? It does a couple of things like for example it measures the payload of what it's going to start. So now you might wonder if it's a UFI binary that a second would sign and things like that. Why does it need to measure because firmware already measures all second boot binaries. Very good question if you ask that. The reason we do this is because these measurements that the firmware does are PCR9 I think and there's a lot of stuff in there and that basically means it's hard to predict because there's stuff that is controlled by the firmware and there's stuff that comes from the West and you cannot bind security to a PCR that has sources that you cannot really control. At least you cannot do this in a predictable way from the West point of view like figure it already out. Like you cannot predict it on basically the Fedora bridge systems if you build Fedora. But if we do the measurement separately of the payload of the UKI we can do that in a separate PCR and then we can predict it because in that PCR there's only going to be the stuff that the West vendor controls and not also the firmware stuff and then cover the firmware stuff with something else. UKI is what this becomes when you use system-based stuff right like the combination of system-based stuff plus a kernel plus an inodori plus kernel command line plus all these other kind of things that's what we call a UKI unified kernel image. Yeah it is system-based stuff supports a couple of sidecars right like this UKI model that we try to want to push distributions towards where you unify everything into one image that you can sign measure as one update as one and things like that that comes as with problems like inherent problems like for example the inodori that you built into this you like we expect that vendors will build those on their build system so they're always going to be the exact same ones on every installation which is great for many reasons but also horrible for others because depending on the machines you will need large drivers large firmwares like Nvidia driver for example comes with multiple hundred megabytes of firmware if you would always build that into all the UKIs that you as a generic distro vendor ship to your people then yeah this will be really really large second-build binary as it turns out because of all the measurements booting really large second-build binaries works but it's kind of slow so I also had this inherent problem that yeah in this model where UKIs are built on OS vendor build systems the question is open how do you parameterize that right like because on a simple laptop you do not need to parameterization you can figure out everything on its own but a learner is supposed to be generic right like you have these installations that want additional parameters like they want to configure I don't know additional ways like route passwords so that it can log into the inodori or a boot ice-gazzy device that you actually want to boot to so these this I mean there's a reason why the kernel command line exists people want to want to be able to do this in certain setups not a laptop is mentioned that should not necessarily be necessary but the more you go to the server side they all want to do this so we came up with a couple of ways how you can have sidecars so that even though while we push everything to the UKI model where you have a single thing that is self-contained and that has everything you can put next to it the sidecars that configure individual things like there's one concept we call system creds I went into this in more detail in my earlier talk but let's just summarize this at system decreds like the asterisk cred stuff that is that is basically short little bits of information like like keys like cryptographic keys and passwords and things like that that you need to operate but they're individual bits and they they are encrypted and locked to TPM so that you can actually put them in an untrusted environment like for example the ESP where there's no implicit trust and you have to authenticate it before you use it there's another concept we called add-on if I add-ons which are basically the same idea as UKI's right you take you make a PE binary that you can sign can you can measure as one however you leave out the Linux kernel the inner D in all these kind of stuff and just insert the kernel command line that you would also add to UKI so you basically have a binary that looks like a binary but doesn't contain any code however you can authenticate it via the usual Secureboot usual shim APIs like it was a binary because the UFI just cares that it's a PE thing so these add-ons as we call them are our way how we can allow people to extend the kernel command line because when a UKI is booted and system D stop takes over it will look for the side card files find this add that to the kernel command line and boot on and it's in a fully trusted way because these things need to be authenticated the same way as everything else is authenticated via the the shim Secureboot stuff I already mentioned this the system you stop also does measurements right like of the content so that we get this isolated out so that we have yeah one PCR that only contains the US stuff separate from the stuff where the firmware is this means duplicate measurement but that's fine at least I think it's fine something that also does it can read additional kernel command line options from SM bias type 11 I'm not sure SM bias type 11 well I'm in the boot loader room so I hope you know what that is like SM bias you you you probably all know is like this descriptive thing that the firmware passes to the West and there's one object type you can add that's type 11 and it's wonderful because it's just called vendor strings and you can put we can put anything in there that you want and various virtualizers like QEMO for example allow you to directly set that from the kernel command line from the QEMO command line and yeah we use that also to extend the kernel command line right like so you can just on the QEMO command line sets a string that is just implicitly added to appended to the kernel command line that is eventually booted we kind of want to push people actually to the model where they use this more often it's actually an awesome thing and I'm kind of pushing like trying to push all the cloud renters to adopt this as a generic way to provision data into VMs but anyway other topic another component is UQify it's a basically it's a Python script that helps you gluing together UKI so it will take system to stop kernel and I read sign it as one it will also do the TPM predictions of what will the PCR will look like when it booted the signs all that with second boot it gives you one EFI binary that you can just drop in the in the SP and boots up and everything's secure and wonderful then one other tool system be measured like much of this like all of the what I'm talking about here is actually part of system me because I'm system the guy system the measure is a tool you probably don't have to interface with it anymore because UK UK fight does that behind the back for you it's the actual engine that predicts the PCRs that the UKI will will result in if booted yeah I just want to mention that exists and yeah there's another tool called common stall as for the traditional distributions so that they can ship inside of a devian package or RPM a kernel and that this tool it's like plugin based and things that will copy the kernel into the SP and potentially it built the UKI at that moment right like because we want to cover a couple of different models like one model where the UKI is built on build servers of the OS vendor and another model more for the let's say democratic devian style distributions where they can do this locally so that they can use their own keys so yeah the kernel and styles are infrastructure to make this happen it's really nice this full UKI support for example for this like so if you want to do your sign your own stuff you can trivially do this because you can just use that and drop in your keys and Etsy and then it happens magically there's something I don't have that much time anymore should we switch to questions okay this is one of my last slides anyway assistant you part PCR log is one of the most recent things it's a more complete prediction engine like I already mentioned the system we measure tool which is able to predict the PCR measurements that a specific UKI will result in system dPCR log is supposed to cover all the other PCRs that they are that are firmware stuff and things like that system dPCR log deals with like all the other operating systems generally have this like Windows Chrome OS Android they nowadays have all these predictions and well depending if they actually care about TPMs or have some other second-hand clif thing doesn't really yeah it's all a little bit different rare but the ones that care about TPMs generally have this prediction engine where they just look at all the different things that happened during boot analyze the UEFI event log and try to calculate a TPM policy to lock disk secrets to our version of the tool is called system dPCR log it's supposed to be modular so you have again drop-in directories where different components of the west that will show up in the boot pass like the UKI boot loader shim and things like that plus components that are not even necessarily under the OS control but our firmware stuff can be described with little jason fragments to just say the measurements that I expected for each of these components there is a concept of alternatives because usually you don't want to lock your secrets to exactly one kernel or one boot loader version because you want to update them and then if that update fails you want to be able to go back and things like that so usually for each component you want alternatives also very well supported and system pcr log takes all this information explodes what all the the the pcr values could be in the end and then generates a TPM policy out of this that it stores in a TPM and v index and that then our disk encryption stuff can reference as an access policy long story short this covers like this basically locks down the OS against the firmware versions with all the measurements that firmware does that are not necessarily predictable for the OS because yeah the firmware people suck there's also like support of course that if we cannot predict firmware measurements we have some some logic there to deal with that so if you do all the combination of this then you get a super secure system and everything's great my recommendation is do this but these components are relatively independent of each other and as the things happen and different distributions started adopting different parts of this earlier like for example Susan nowadays adopted system to boot already but rel for example for the confidential computing stuff already adopted system to stop and they all pick different parts of this okay my time is over so this is my my summary here if you use all in combination everything works great but if you pick what you want you don't have to pick anything at all if you don't want but if you use in combination you get this full boot chain stuff everything's secure and relatively robust because all the update cycles are around individual files you have ways how to parameterize it and extend it and yeah there are a couple of more slides we don't have to cover them but let's move to questions we have five minutes for questions so yeah so the question was regarding whether the system D stop stuff works outside of UFI environment the answer is no like it uses UFI APIs and it's just UFI all of the what I was talking about here is more or less model after UFI system to boot system you stop absolutely only UFI but the further you go with the with the like carline store for example that has nothing to do with UFI like unless you actually use the parts with well I mean I know it isn't there I think there are sorry so my my suggestion would that it would be a well just just adopt UFI and avoid all this mess I don't know I think everything has problems UFI have some my general like I mean I get this all the time like this thing like oh we have to stick to grub because it supports all the non-UFI world and I say sure my recommendation would always be if you look at this stuff like there are certain like philosophical ideas built into this right like you have a drop-in directory you put on type one type this kind of stuff is entirely generic and there's like type two is not generic but type one is totally generic right like so my recommendation for it for that by the way is just use UK eyes as they are right like they they are a PE wrapper like it's a really simple format actually PE right like and it's just an envelope that carries sections for you if I think grub now this can parse that too cannot see so if grub can parse it you stupid bootloader should have no problem at all parsing it and then you suddenly have a universal format and you boot windows style PE binaries even though it's not Windows but it's I think it's it's the way to go like model it after UFI UFI has its words everything has its words but I think it's way better than than the stuff that came before for it and just yeah so my recommendation would always be if you can't do this stuff at least consider the ideas behind her a bit behind that and like drop-in directories and sing like single file updates and these kind of things and then try to model it afterwards and the more you can take over the easier will be your life because this probably will end up in all the distributions and the less differences there are the easier it is right so I think even grub supports type one at least on type two or something like this so so we're on purpose we wrote the spec as a generic thing both of like of all like there's a spec about yuki yuki i there's a spec about the bootloader spec there's a spec about the bootloader interface like where how how we do because we will always clear to us that not everybody's gonna do you if I so uh yeah we did that as a nice service to the community but the other people have to figure out if they actually want to adopt this it took them long enough to not adopt it so far but now things are changing if you want to not do you if I my recommendation would be let's yeah look at the specs and I'm sure like adds to shit to it like it's a spec senders like there's a getup issue thing and if you need something else then send an issue and like if it makes any sense at all we have no problem with adding that to specs very short yeah very short so all these projects are system well it depends like the specific age like so the question was if all these projects are under the system the umbrella so that depends right like so we we created like a group which you call the uapi group where we try to standardize these things right this is to a large degree admittedly system the adjacent people right like who like adopted a way of thinking like we do it but there's nothing specific in it and the specifications are on purpose written independently of like the word system B doesn't show up in this but like it might show up in the things but it's that's not the point of it right like so the code right like that's a different story right like we this is in the system tree right like this is developed like unix was developed I guess you have like this get repository and you have all these components on this the fact that it's in there doesn't mean you have to use them you can mix and match them like is mentioned like different distribution pick different things up like opens us it at sd boot first and not sd stop and then sd the right hand was confidential computing take sd stop but it is not interested in sd boot because they didn't want a boot load at all so yeah this is how it should be right yeah we'll give you the buffet pick what you want don't pick what you don't want I don't care but yeah this is ultimately it's very Linux focused various ufi focused very systemally focused but I think look at the specs maybe you can reuse something but yeah yeah I think even the uk i stuff like because we don't like you know how the firmware jumps into your uk i that is not defined by us that is simply an artifact of the fact that's a pe file right like so you can jump like any find any other way to jump into it you can even look for the Linux kernel in it right like that's what grub does right so it looks for the linux pe section in it ignores all the other stuff and if it wants to and then does the classic boot protocol that does not do anything like this okay but anyway yeah thank you you
GRUB - Project Status Update
Thank you guys. My name is Daniel Kippur. I work for RACL. I'm software developer and grabs upstream my tenure since 2005. I think it doesn't work. My only issue is only for recording. I have to speak louder. Please remind me about that. Thank you. I want to present what is happening in the project right now, where we are and other things. The plan for this presentation, I want to discuss what happened during the last two years. I want to dive a bit to the cover it to work, which he did during the last three years. Also, I will be discussing what is happening in the project right now. And also, I will show you some statistics for federal grabs upstream patches. So, let's start from the beginning, let's say. I suppose that it is not needed to remind what is grab is. In general, it is a bootloader, but we are not focusing on UFI income horizon to system D. We support all architectures and platforms, which are most of our architectures and platforms. I suppose there are some platforms which don't support architectures. But there are tons of architectures and platforms, and we try to give a user a lot of flexibility, what can be done, etc. So, the grapple is happened in the last December, and we provided support for new compilers, new bidinutils tools, and other stuff. We also finally did a unification on the UFI Linux kernel loader. Before that, every architectures had own implementation of EFI kernel loader. And additionally, to make the problem much more complicated, we had a completely different X86 architecture loader. This was terrible, and finally, thanks to many people, we were able to merge all these EFI bootloaders into one thing, which is common across all architectures. There are some minor differences which we had to leave, but unfortunately, they are minor, and I suppose that at some point we will remove them completely. We also, as I thought earlier during the Linux presentation, added initial support for bootloader interface, and I suppose that we will be implementing more specifications which were developed by SystemD guys. We also, during last development efforts, made some important changes to the Grapp runtime memory management. Previous implementation was very rigid, and it allocated memory from the firmware just at the initialization of the Grapp. So, it didn't allow us, for example, to run Ergon 2 KDF. So, we had to change this memory allocator implementation, and finally, Grapp is able to just allocate the memory from the UFI firmware just on the runtime. So, it is a very nice thing. We also added support for PCI and MMI or UARTs, which is, I think, very important if you are some problems, because you are able to print all debug messages directly to these UARTs. We added Sdl support, LongArts support, TPM driver fixes. In previous versions, we had problems with this driver, because when UFI failed, then TPM driver failed, then boot stopped. So, this was a problem. Currently, this is fixed. We also added debugging support improvements, improved test cases, improved documentation, and tons of other fixes. As I said at the beginning, I would like to dive into the coverity work, which we did. Finally, we were able to achieve our goal. Currently, we have zero outstanding defects in the coverity, which is, I think, nice. And this work was initiated by Andrei Bozhankov in 2013. Later, Vladimir Serbinian, who is also a graph-matterner joint, and together they continued this work until 2017. We started this work in 2020, when boot-hole security popped up, and it took us years to do analysis of all coverity issues, not solved earlier. So, finally, we fixed more than 600 issues reported by the coverity. Sorry, we did analyze more than 600 issues, and fixed more than 500 of them. About 100 of issues were false positives or something like that. After that, we ran a new coverity tool, which revealed additional four issues. Fortunately, they were not grave, and we fixed them quite quickly. So, all of these issues currently are fixed. At this point, in general, we do not accept any patches, which introduce coverity issues. So, this is not accepted at this point. We also employing fuzzing and other tools to find bugs in the graph code. And I would like to thank you all these people who are working on the coverity stuff. At the beginning, it was quite easy, because the reports which we were looking at were quite simple. But if we were diving deeper and deeper into other issues, it came out that the coverity reports are very confusing and difficult to fix, or even understand. So, this took a lot of time to do analysis, especially the final few issues. One issue which we analyzed with Vadimir was in the Grapp-Moller allocator. And even if we thought that we fixed the issue, the coverity was thinking that it is not fixed. We don't understand why, but we finally silenced forcibly coverity, because we were not able, even we contacted with the coverity company to get information why it works in that way. And we were not able to understand how this works. But as I said, in general, currently we have coverity issues reduced to zero, and we will not accept any patches which will introduce any coverity issues. So, what is happening right now in the Grapp project? As you may know, currently Microsoft requires that PE binaries has to have special properties. So, first of all, it is required that PE files must be aligned at the page size. And the more important thing is that you cannot combine a right attribute with executor output for the data section. So, it means we have a special code which properly aligns of this section of the shim, of the Grapp-Loder, of the kernel, and it requires a lot of work. Shim currently has support for an X. Currently we are working on support for an X. It was initially done by Red Hat. Currently work is taken over by Oracle and we are currently looking at this code. And I hope that shortly we will be able to post updated code for the upstream version on the Grapp mailing list. And I hope that it will be matched in the upcoming release. We are also working quite closely with the Tenchbot project. I do not want to dive in this topic because we have presentation at the end of this dev room. It will be made by Jack and Maci from 3MDep. So, they will provide more information about the Tenchbot project. Unfortunately, we have to update all embedded Lips. The most difficult problem at this point is the Jypri-Library which provides encryption primitives for the Grapp. Currently we are discussing various options. Also, we are considering moving from the Jypri-Lib-Jypri which is a huge thing. Thank you. And maybe not completely aligned with current Grapp needs. So, one guy with whom I am discussing the Grapp dev app, proposed that maybe we will move to something which is a simple library which is just to file a forget its name. It is on mailing list if you are interested in, which contains just two files. One C file which implements the cryptoprimitives and one header file. So, it will much, much more simplify the abandon this into the Grapp. And this way we will get Argon to KDF which allows us another thing in the Grapp which is automatic disk unlock with TPM2. The another issue which we have is that due to Horiex historic reasons we have distributions carrying tone of downstream patches. And this make Grapp upstream code more difficult to use in these distributions. I will discuss this later and I will show you some numbers to give you a clue what the size of problem is. We are also working on fix ups and clean ups for documentation. We are also looking for setting up CI infrastructure at this point. In general, I do all tests manually so this is not very convenient and I think that it shouldn't work in that way for projects like Grapp. We expect that the next code free will be around October 2024 and next Grapp will be close to November 2242. So, as I said, let's have a look at Fedora downstream Grapp patches. This is Rafa estimates. These statistics were prepared by Alec Brown, my colleague from Oracle. So, you have to take this number with grain of salt because we were just looking for at the names of the patches. So, if the names of the patches in the upstream were a bit different, for example, by dot or something like that, then we were not able to identify them. But this shows you that we have huge problems with the downstream patches and this is due to historic reasons. In general, at this point, we want to focus also on merging most of these downstream patches. I will show you more numbers which are more relevant, I think. Sorry, not this one. As you can see, Fedora 40 currently carries around more than 350 patches. Of course, there are some of them which were back ported to this release, but still we have more than 200 downstream patches, which, let's say, makes Grapp downstream maintenance very difficult and also provides some problems for Grapp upstream development. For example, some people come to us and ask about things which are specific to a given distro, but do not apply to Grapp upstream. So, this is a real problem. In general, we try to focus on this work in the following months. Alec Brown would be helping with this, but we are looking for more people who want to review these patches, who want to work on these patches. If you know a given area, you have experienced a given area and you are happy to help with that, that will be perfect. So, I think that's it from my side. Do you have any questions? Yes. Thank you. Go ahead. Thank you for the talk. Are there any plans to modernize the Multi-Boot 2 specification? Right now, most of the definitions there are only x86 and MIPS and are constrained to 32 bits, so you can't have modules above 4 gigabytes, you can't have entry points above 4 gigabytes. Which fields are there? How to date it? A bit. What is the process for getting the fixed? I'm not sure what is the idea behind using this protocol, but I'm not against using this protocol, but I would want to understand use cases, etc. If you want to improve this spec, just take the latest spec from the grab source code and work on it. I'm happy to review it, to discuss other things and solutions. I think that, especially on EFI world, currently we are deprecated using this kind of protocols. For example, we deprecated usage in the kernel, deprecated usage of Handover protocol, which was also artificially designed for EFI environment many, many, many years ago. I think that similar case is with Maltiboot 2, Maltiboot in general. And I think that on EFI platforms we should use EFI interfaces, as for example I suggested, but on non-EFI platform I think that it makes sense to consider something like Maltiboot 2. Of course, if it is properly improved, I think its design is quite generic and can allow us to use it on different architectures and platforms. But I think that we should not target this protocol to EFI environments, I think. This is my opinion. Does it reply your question? I can just send you some pictures for example for ARC. Oh sure, I'm happy to review them and discuss on my weakness. Go ahead. I'm a gen2 developer here and one of the biggest pain points historically was that your release case, you mentioned that the next release was Manfrouk Tova, which I hope that will actually ship them because historically some of these released it, it had to wait 2-3 years. Same problems with the work here, all these patches, and they give you simple patches. I think for the community and for downstreams, just getting more updates on how frequently you feel like is one of the biggest. So, the suggestion from the audience is that we should show them the cadence of the releases and speed up the review process. I completely agree. That is, as you saw, we are aiming at making a release at the end of this year, but my goal is that and from the talks with other folks, my goal is to make a release cadence around twice per year. This is my goal. But due to lack of resources, at this point it is quite difficult. So, we are looking for more help from the community in reviewing patches. Especially, I think that at this point one of the most important projects is forward-porting patches from the downstream as much as possible to reduce this backlog. And I think that it will help to review currently posted patches on the mailing list, etc. I hope that this surprised your questions. Thank you. Any questions? Go ahead. Yeah, yeah, I'm looking for you. From the stats that you showed with all the patches and all the releases, do you have any estimate on how much are bug fixes and how much are the file systems and how much is integration? The question is how the audience is referring to the list of the patches which are in downstream. And the question is how many of these patches are bug fixes for the sheen, for the EFI protocol and other stuff. As I said, this is only a rough estimate. We just did this work a few days ago, so we don't have these numbers. Alex started looking. After some discussion, we decided that we will be looking mostly at this point, mostly at the UFI code to identify which EFI code is. It doesn't need it, which code should be fixed and matched, etc. So if you need more information, if I have one, I can share with you. Thank you. Go ahead. A question about your statistic in the slide before that. Stefan? How does that add up to 222, 323? What's the total? What's missing? Could you repeat once again? If you take any line and write up the numbers, it doesn't come up to 460. 22 patches? What's the total? It's the total patches were put on the given upstream version. So for example, let's go back to this one. The first number shows you how many matches were put on the graph 2.0 upstream. 87 patches were backported. It means that if some patches were put into the master, into the graph 3, then Fedora started getting some patches from the upstream. So 87 means that these patches come from master graph. 64 new patches. It means that there are new patches which didn't come into upstream and just live only in downstream Fedora. And as you can see, the number of patches is increasing. So this is a bit confusing. As I said, we were working during the last few days to just show you numbers. So there are no precise, we have to take them with grain or salt. Just wanted to show you the size of problems which we have at this point. One of these patches is bootloader spec support. Is there someone working on getting the upstream? Not yet, but if anybody wants to work... Oh, okay. The question is if somebody is looking at boot a spec support which is currently downstream and wants to upstream this code. Not yet. I haven't heard about this work, but if somebody wants to pick up this work, I will be happy to review this on my main list. So there are all things which are defined. Various boot protocols I think that can be implemented in the graph. And we are happy to review those. Does it apply because? Okay. Go ahead. So for the patches, a lot of Fedora patches are actually new features. Yeah, yeah, yeah. My question is about multi-boot support. Okay. So if multi-boot support is getting deprecated, what's the recommended way for booting Zen? Yes. The question is about multi-boot protocol and if the plans are to deprecate this protocol, how we boot the Zen? Which relies on this boot protocol? Yes. We should move to, in general, EFI protocols which are currently used and situation gets much worse there because there are some discussions about dropping support for Shimlog protocol because it makes some problems in some situations. In general, we will be moving from various kinds of strange of bootloaders on EFI platform to just start load image and start image protocol functions and they should be used. I communicated this to Zen community. They are aware of that. They were not happy with that but they told me that it is better that I told them about this problem now than later because they were considering working more on this multi-boot protocol to support Secuboot. Does it apply your question? Kind of. Okay. We can discuss this later as you wish. Okay. Next question. Is there a way that we can avoid using grub scripts and have more of the information that Hunter and Debian don't need to create these scripts? Could you speak louder because I have difficulties with hearing you. Is there a way for distributions to avoid needing to write grub scripts? Well, I think that we should repeat the question. So the question is it is possible to avoid the grub scripts in the distributions to make grub work. As I understand the question, I think that we should migrate to something different than grub scripts like a mechanism which are used in system D or something like that. But load aspect, exactly. So this is a potential way to use these scripts in the grub right now. So we are considering this. You heard the discussion on how that somebody will pick up this work and we will be able to match this support shortly in the grub. Okay. I think that was the last question. Thank you very much. Thank you.
Kernel command line to configure userspace considered harmful
Okay, thank you. Good morning, good afternoon. Thank you for having me. My name is Luca. By day I work as a software engineer in the Linux systems group in Microsoft in the Azure organization where I manage the Azure Linux OS that runs on the infrastructure there. By night I'm involved in various open source projects. I'm an assistant demantainer, a debian developer, a DPDK LTS maintainer, a bunch of other stuff that I consistently forget about. Now, if you read the title of this talk, you might think, hang on, was that really intended to be that provocative? And the answer is yes, yes it was. This is my yearly talk to make new friends. But of course I tried to mean it in a positive way. So I want to provoke some thoughts and discussions, see what we can do about this that I consider a problem and that I think we are in a good place to start fixing. But first, even though everybody lives and breathes secure boot, some background, if you're not aware, if you work on bootloader, so boot system, you already know all of this, but just one slide. So in the beginning we had BIOS, everything was great. The security model was low as if. In the 2000s we got Ufi. So Intel, Microsoft, a bunch of other people got together and created this new protocol for firmers. And it actually has a security model, which is nice. Now, it gets a lot of mud thrown at it. Every time there's a bug in the news like the logo face stuff, people are going, oh, why do we need secure boot? It's always broken anyway. Well, having a security model doesn't mean that everything is perfect or never breaks. It's software. It runs on computers. Of course it breaks. The point is that we have a process to deal with it and we have a natural security model to follow. So the way it works is that there's a chain of trust that starts in hardware. So for example, Intel boot guard. So the hardware verifies the trust in the firmware. The firmware verifies the bootloader, the bootloader verifies the kernel. The set of keys and certificates used are stored in the firmware. I won't go into details in that because it's not too important for this. This is generally called secure boot in a nutshell. Now, this in the 2010s, thanks to the work of a lot of people, like the government and many others, Linux joined the party finally. We were shut out of that ecosystem for a while and by default, distribution couldn't boot on new hardware. You had to go and fiddle with the buyers on this secure boot. This changed in the 2010s. So we have Shim, Grab 2 and the kernel lockdown stack and distribution can work again by default. They are signed with the U-featured party CA. You get your Shim signed by Microsoft and then you sign your second stage bootloader like Grab or System Eboot with your distribution key. And then we have this patch set in the kernel that was called secure level in the beginning when it was out of three. Then it was merged as the lockdown LSM later that basically tries to protect the kernel and the firmware and the hardware before the Xeboot service is called. Xeboot services is an API call in the U-fe interface and when that happens, a bunch of things get locked down. You cannot change secure viable anymore. The firmware goes away and a bunch of other things. It's very important to protect the system before that. So this is what this ecosystem tries to protect. Secure level also tries or lockdown tries to separate UAD0 from ring zero. So the theory is if you are root, you shouldn't be able to change the hardware or the kernel memory outside of what should be allowed. So this is not perfect. It went a very long way and it fixes a lot of problems. It's not perfect. Of course, it's softer. It's never perfect. But yes, the idea is we have this boundary between UAD0 and ring zero. And this has been working for 10 years or so. It's great. We moved, we have no trust whatsoever to having trust until the point we start the user space. And that's great. But other operating systems are way far ahead. My course is way far ahead. Android is way far ahead. Windows is way far ahead. We do nothing for user space so far. But in the past couple of years, we've been talking a lot about how to fix this and things are starting to happen. So this is the next level. We'll have a unified kernel image. And by the way, Renard had a talk this morning about UKIs and I think he might have mentioned before as well in the previous talk. I could not get him because the room was full. But we've been talking about this stuff for a while and there were at least three or four talks talking about these things. So you might have already heard about these concepts and we'll repeat them again in a different context here. So what we are trying to do is extend that level of trust and security and authentication to user space. For example, the E30 right now on any generous distribution just sits on the boot partition on the ESP and anybody can offline or even online that has write access that can inject anything. And they're just built locally. They're unverified. You add a backdoor to the prompt that asks for your locks encryption password and you wouldn't have any idea because it's completely unchecked and untrusted and verified. Unified kernel images try to fix this. The unit RD is part of the PE binary that gets signed so that the shim or the firmware verifies it before loading so that we can extend the chain of trust a little bit further into user space. At least the first part of user space, the unit RD. But that's not enough. We want to go further because once you go to your root file system for the unit RD, well, that also is unverified. Now, Wacuzov is working on the IP, Intervity Policy Enforcement or Execution or something like that. It's a new LSM that basically allows you to write a policy that says any binary that runs on my system in user space must come from a DM Verity volume that is signed by a key trusted by the kernel. DM Verity is a mercury system to do online verification of block devices as they are loaded and opened. It's a very, very nice interface that has been available in the kernel for 10 years or so. And with IP, we can use this to move the chain of trust into the full user space. So now all the code that runs is fully verified with a chain of trust that goes back to the hardware. With discover of these images, we can also protect further payloads, so containers, end-spon containers, portable services and other things that are attached to the OS. If you're running a read-only system, you need some way to attach new applications there, of course. And with DDIs, you can further extend the chain of trust again in the same way for those payloads as well. So we put all of this together when we have the shim and lockdown stuff for the boot process and then UKIs for the InterD, IP and DDIs for user space. We have a very nice system that chains back to the hardware and implements a full root of trust. And that's very nice, except for the kernel command line. This is just stored as a plain text file in system debut type 1 BLS. So it's a type of boot images supported by system debut or in grab as well. It's just plain text file. If you have root access, you can write whatever you want there. If you don't, it checks, that just gets out there and run. It also can be edited on the fly if you have access to the keyboard, which probably is fine on a laptop, because if you have access to the laptop, you're probably the owner. But if you're on a server, owner of VM, or a confidential VM, that's kind of bad, especially for the Taster computing case, because the serial console is just a device owned by the supervisor, which is outside your PCB. So why is this a problem? Because it has become kind of a kitchen sink. Just for the kernel alone, there's that document there, which is very nice and documents a lot of options available. It's 7,000 lines long, and itself says this is not a complete list. So we don't even have one list that says, okay, this is everything you can do with this untrusted, unverified interface for your machine, which is not ideal. Also, I checked, I'm not a kernel developer, but I checked as far as I can see, the very first parsing of the kernel command line happens in the kernel's EFI stub before Exiboot services. Remember I said before Exiboot services is a very important point in the boot process that before that you want to protect your system. You'll be really careful about what happens and what is allowed to run and execute and change the flow of execution. Now, you can use the kernel command line to configure the kernel to do things like disable a Cerenox, disable IP. I talked about a moment before about IP. You can disable all these security components using the kernel command line. And it's not just the kernel that you configure. It's called the kernel command line, but it's just a command line. You configure everything and anything in user space. Everybody sees it by default. It's approximately aligned. It's there. Everybody has their own parsers. The custom written to parse it and read it, and it's used for absolutely everything. And again, this is bad for confidential computing. The Cereconsole is outside of the TCP. So this is a difficult problem. Now, of course, there are historical reasons for this. It's super convenient. It's amazing. You have a problem. You just press E to edit a debug, and then you can get some debug logs if your machine doesn't boot. That's amazing. That is super useful. But I think we're getting close to the point where we need to make decisions and whether we want to allow this always or in some cases or disable it completely in some other cases, because it is the last bit missing, as far as I can tell, in the security story of the boot process on Linux. So for system reboot, we have decided to stop allowing editing the command line and supplying untrusted sources of input to it when you boot UKIs. You cannot do that. And we made a lot of friends with that decision, I can tell you. So the problem is, of course, the flexibility is gone. Can we get it back? What are the use cases? So one of the main ones, the root FS auto discovery. In this case, you would do root equal, devs, da1, or whatever. Now you probably may be using a UGID to identify the disk so it doesn't switch partition. You don't lose booting. We have something called the discover partition specification system that is supported by our tools. So basically, the very quick resystem reboot tells the intrad where your root is. It automatically finds it based on UID set on the partition. So this use case is very well covered now. I already mentioned UKI, that we mentioned very frequently as FOS, so I go very quickly through this. You can add a command line to the UKI when you build it. It's very easy with UQFI, our tool to build UKIs. But of course, it's one entry. It's a fixed entry. The UKI is meant to be shipped by the OS vendor, and that is very not flexible, of course, because the OS vendor doesn't know what you need to have on your OS to make it work. Now we have a future plan. We'll get to this this year, but you'll be able to actually specify multiple options. For example, your OS vendor will be able to say, I have my kernel command line, which is the default one, and then one that has debug, and another one that has factory reset, so that you have multiple options. And in your boot menu from system reboot, you can select a non-default one if so you wish. This is very hard to list, and I get Leonard to implement that very soon, but he hasn't done that yet. The other thing we have, so system is tab is the small UFI stub that is embedded in the UKI, the first bit that gets loaded. So we added this thing this year called addons. Again, they can be built by UKFI, and what they are is they're just PE binaries, so they are signed, so the firmware verifies them using the secure boot certificates before loading them, but they don't have code. There's just a kernel command line configuration. So you can use this, and system is tab, we automatically load them if it finds them. Again, through the firmware, so verified and signed and trusted, and then you can use that for extending the kernel command line that was in the UKI, and that is fixed. This is really meant for platform owners. So for example, if you want to set crash kernel equals some amount of memory, that's probably the same across your whole fleet, at least for the same devices, so you can use the same addon everywhere to set these cases. Again, we want to add selection to, right now every addon will be used. We want to add a menu to let you select which one you want a boot in case that is needed, but we don't have that yet. It's again on the to-do list. Next thing we have extended extension images. This can be used to extend the IninterD. So you can drop them in the ESP, they are the invertees, so again they get verified by the kernel, and given the IninterD 2 is fixed, we can use this to extend the IninterD with additional code or configuration. It can be used for both configuration, overlay none ETC or code overlay none USR. Again, we don't have any way to select which one you want. We just pick every one, every extension image that we find in the ESP. And also you can embed them in the IninterD if you want and extend the router fast with this, or download them at running time to extend the router fast when it's read only. Finally, this is my favorite one, and I think this should give us enough flexibility that we can start to talk about actually disabling this stuff by default. So credentials are a very simple concept that we added to system this some years ago. They are just key value pairs. The key difference is that they are scoped by default. They are only visible by user space and only by services they opt into them by key. So in your service you say load credential key. If you have a credential for you name it there and it will be loaded. Everybody else will not see the content because they can be encrypted. And we are, I think we have that already, if not it will be ready very soon. You can encrypt them ahead of time if you know the TPM's public certificate for the SRK. You can encrypt them ahead of time for any machine. And they are decrypted only when the service starts and reads them. And only in the namespace view of the service. So they are fully isolated. Nothing else outside of it can see the credential. And you can drop them in the SP again in a per image or global directories. Again we don't have selection. Everything that is found in this location is picked up. And we are starting to add support to every system component and outside of it to use credentials to configure things that used to be configured with the Cora command line. So your networking can be configured credential. Your users, your autologin, your root password and a bunch of other things can all be, which you need to start. Like literally hundreds of things can be configured using credentials. I have a pull request open. Hopefully, as soon as I figure out the TPM measurement story, but we will also allow you to create new credentials from the boot menu. Like when you have a system boot with type one or grab two, you can edit the Cora command line. You'll be able to on the boot menu to just type credential and then a name and then a value. It will be passed to by system D and added to the interface so that it can be used. So I think this is very powerful. It's something that should give us all the flexibility, most of the flexibility that we need. Maybe. Is that enough? We have GPTO to discover for your root for a system. UKIs, add-ons, extension images, credential. Is this enough to cover all the use cases that we need in the 90% or maybe 95%? Of course, there will also be a case where you have to go put your hands on a machine that is completely broken. What you do in the case is you disable security. You break glass. You take your node offline. If it's a server, you take production workers away, you debug it. You disable security and then you can do whatever you want. Let's say 1% of the cases. Are we there yet for the 95%? That is an open question. I hope we can have some discussions about this. We also have a Secure Boot 2.0 comment. We are starting to think about it. Should this say something about what is allowed or not allowed to do to the kernel before execute services? It's kind of an important topic. Should it be a specification or should it not? Now, as reviewers and maintainers of your space components, I think it is time that we start to say, hey, if you're adding a new option via the kernel command line for my user space program, please at least also add a credential to it so that we can use both at the very least. And is this a full term? Most likely, but we still try because we are trying to push the envelope a bit forward every time and hopefully we'll get somewhere. So I think that's it and we have three minutes for questions. Oh, thank you. I was fast. Questions? Please. So you said about selecting crude partition, there was this discoverable partition specification. How you handle multiple installations of the same distribution like free fedora installations on the same disk? So the way this works is that system disk... Sorry, repeat the question for the microphone. If you have multiple partitions, multiple distributions, and start at the same time, how do you find out automatically what their disk is? The basic stuff is that the way system deboot system disk actually finds it is that it tells the initardivia EFI runtime variable which disk the ESP that was used to find the system deboot or the UKI was on. So you get the root of the disk. And then the auto generator there takes the disk and only look at that. So if you are installing with multiple disks, then we select the right disk like that. If you have multiple root partitions on the same disk as well, then I have no idea how we do that. I think we recommend to use different volumes for that. I think there's some way to do it. I don't remember to be honest. I need to look at the generator. There you go. So use a different UKI for the different root of FES basically. Yeah. Or again, with the credentials, we support credentials then for the auto generator yet? So we should add that. So that's probably a good way we could configure with the credential. But this is made so that by default you find the right thing for the simplest case. And then of course you need configuration for the complex one. If you use the same disk for multiple root file systems, well then you need to tell which one to pick. And that's one way. And I think we it is configurable and we should have credentials support for that. So you drop in a credential and then you decide which one to use. Yes. But it's a good question. That's a BTFS question. How to deal with starting from difference of volume on butterfests? I have no clue. I don't use butterfests, but the meta people here do. Do you have any other? I don't remember. Right. So yeah, it's not supported right now in the specification, but there was a buzzer, I think. Patches welcome. As usual. Yes. Yes. Anything else? Please. So the question was, can we use the auto generator when we create the UKI? The answer is yes, because then you would use the kind of command line. But if you are generating UKIs locally, our idea is that the UKI is generated on a server somewhere by your vendor. So I wouldn't work in that case. You could create a credential when you install it, for example, to tell it actually going and figure it out from that idea. But yeah, if we do support building UKIs locally, we have kernel installed plugins for that. It does work. And yes, you could do it that way. Yes. Yes. That could work if you're building locally. Yes. Sorry, that was according to the UUID in the UKI itself. And again, yes, it can work if you're building it locally. Yes. Anything else? Yes. And no, it is not a workaround for broken EFI variables. We added this so that we could configure autologene in VMs. I think that was the first time we added this back then. But this is a way to, again, the main use case was to be able to have secrets that are encrypted against the TPM and are not visible by default. So the services don't have to implement encrypted and decrypting all the stuff because that is hairy, especially against the TPM. That was the main use case, to have this as sealed stuff that is only visible by the service in its main space only when it runs. That was the main use case because normally a lot of times you configure secrets by environment variables and things like that. And of course, that is bad because environment variables are inherited now on the process tree. And you don't want your secrets to leak down to all your child processes. So this is one of the reasons we added the credential concept. Yes. So another question for credentials. What is the scope of credentials you load from this ESP? Is it the whole EUNITRD you can see it or some? The worry, so again, they are obtained. So sorry, yes. What is the scope of credentials loaded from the ESP? Does the whole EUNITRD see them? Yes, if they obtain. So your EUNITRD is trusted and verified and signed. So you build a configuration, you say service, full and bar can load this credential. Service, ABC, don't. So only full and bar that have opted in will see that and it will be decrypted for them. But yes, anything in the EUNITRD can obtain and I think we transition them across as well to the full OS. So they will be available also for services running from the after EUNITRD to full OS transition. Yes, credentials are awesome. You should check them out. They all, by the way, the slides are online and all these things are links to the actual documentation. Anything else? I think we have two minutes. I have a pretty dumb question, but let's say I want to put an RO on the kernel and online. I do that with the prediction. No, because that is for the, depends. Is that for the kernel or, oh, sorry, sorry. If you want, if you want to put RO on the kernel command line, would you do that on using a credential? Depends who's reading that. Is that for the kernel or is that for a user space in EUNITRD setting up your root file system? It depends on the case. If it is for your kernel, well, probably you want that in the UKI itself, because it's something you want your image to run in that configuration state, so you put that in the UKI itself. If you want that to only for certain cases, then maybe you can use addons and only deploy that on the machines that use the same image, but with a different configuration. So the answer, yes, it depends. There's many ways to do that and depends who's reading it and what the use case is and if you want that to be the default or the non-default on whatever else. I think we have, okay. Thank you.
TrenchBoot - project status update
So let's start. So thank you for that. So you already know a bit about the TrendWolf project. So now we will take a look about another practical application of TrendWolf project as an anti-eveal solution in QPSOS. My name is Maciej. I am currently over seven years in FremDep, currently at the position of engineering manager. I open source enthusiast interested in bedlet systems contributing to various open source projects. As FremDep we are involved in various open source communities, mainly Corbuth, Yocto, but also others such as FWPD. We are also pop-and-power foundation members. We did some work on Raptor Talos 2 Power 9 platforms. So what we will talk about today, we had already some introduction about the TrendWolf in general. Now we will have some introduction about QPSOS anti-eveal made solution based on TrendWolf. What is its current state and what are the further plans. So pretty simple. Let's see. So we will not cover the whole progress. We have regularly updated on various conferences. We've given some status updates on last two QPSOS summits. So we will cover the progress since the last one, which was October last year. You have links if you want to watch the previous ones. Feel free to. So just a few words about what eveal made attack is and why we want to prevent that with TrendWolf. So eveal made attack is a one when you leave your device unattended. Someone has physical access to that and can tamper with that. And you, for instance, come back to your hotel room and you don't know whether your device was tampered with or not. Anti-eveal made solution aims to provide you with some tools to maybe not prevent because that would be quite difficult to prevent such attacks, but at least give you some information whether your device was tampered with by some external person or not. In terms of QPSOS, we have QPSOS anti-eveal made solution which was already there. It's a set of scripts. Right now we are improving extending attack based on TrendWolf to support broader variety of hardware, for instance, TPM 2.0. And it requires TPM, so it requires some piece of hardware and also it requires DRTM, which was mentioned before, so some silicon vendor feature either from Intel by AMD currently. So what's the current state? In short, we have released two milestones in the last month and we are just starting another one. If you want more details, there are links to GitHub milestones. These three were funded by the NLNet Foundation. So links are there and we will now briefly cover each of them. So the MySum number two, it was about QPSOS anti-eveal made solution on Intel boards with TPM 2.0 because previously we had only some support for 1.2. As you saw before, there are many parts in TrendWolf project. In the previous presentation we saw a bit maybe different use case because we saw the one with Linux kernel, with QPSOS, we used Grub and then Zen hypervisor, and not Linux directly. So for each MySum, for each phase we have a set of GitHub actions releasing the packages so we can download them and install in the QPSOS installation. We also have blog post which describes how given MySum was verified so we can also reproduce that if you are brave enough. So as the project is ongoing progress, it takes already a few years in total, we can say. There are some changes in the upstream trend boot protocol, so the phase three was aiming to align with that with the TrendWolf anti-eveal made solution. So again we have a set of packages you may use and install, and we also have a nice blog post you can read to have some more insights on that one. So we are just starting phase four as said before. That one is about AMD platforms. We want to cover both TPM 1.2 and 2.0 solutions. We are still selecting hardware. Right now we think about ASUS KGP-D16, it's quite old already, but it allows us to verify both TPMs. It can also have open source firmware based on core boot. So that is one we might use here. Another one is some super micro board. There are some cubes for point two installation issues, we have some discussions open, so we are still considering that one. At least with the latest version there were some problems. What are the further plans? So after we finalized, or maybe not after, we are already scoping the phase five. Its goal is to bring UFI support as well, because so far all of the previous one were focused on the legacy boot only. For the phase five we want to support both Intel and AMD platforms with both TPM 1.2 and 2.0. So we are finalizing the scope to gather what needs to be done basically. We plan to publish that as another Github milestone, so if you are interested you can also follow that topic. Definitely the UFI one is interesting, because nowadays it's the most common one for many boards. So it opens up a wider, higher variety in terms of what we can actually test, on which boards we can use that solution. Another thing about our roadmap is further improvements of testing and documentation. As you saw the project consists of multiple moving parts. We have Zen, we have Grab, we have anti-vmage scripts. Installation, even though we have some CI, we have some packages, CubeStyle packages being built, it's still a tedious task to do it manually to test recent changes. So we are aiming to automate that as much as possible. In the last status we shown some progress in that area, but it was only on QMU to have some automated installation of the anti-vmage solution. Right now we are pursuing to move that forward, so we can also use that on real hardware units, not only on emulation targets. And also it would be nice to automatically pick up packages from GitHub actions with rightest changes to see what's the result is. Another one we are interested in is upstreaming AMD Linux patches. Maybe it's not directly related to the CubeStyle's anti-vmage solution, but to trend-root it is. So Jack said about the Intel upstreaming effort, which is a very demanding task. As we see it's version number 7 and counting. We've secured an internet grant for AMD equivalent of that port, so we could be able to help here. So we need to sync up on the latest changes with Oracle. Ideally the Intel patches are merged so we can have easier time posting another set of patches for trend-root to the Linux kernel mailing list. Another area we are interested in is as we are developing the Shara open source firmware for some of the platforms. So we have control over the firmware part and we can make sure that all of the properties required to actually use the trend-root can be properly configured. And if there are bugs we can at least try to fix them. So that would be very nice to have trend-root running with Cubes and perhaps also without Cubes on different hardware targets supported by the Shara. We already do use some of that. For instance on the Dell OptiPlex one of the other series we use that's one for testing for the current phases. So we use Corbett firmware with CBIOS as a reg AC solution to test what we currently have. But we aim to also once the Phase 5 is completed with the UFI support, support perhaps the Navacast and Laptos which are already certified in Cubes OS as well. Another idea since that the testing is mostly done by us at that point. We could use maybe some help if there are some folks interested in that. We know actually we know there are because we have already some reports but it's not that easy to jump into that. So we're considering also to make it a bit simpler. Maybe set up currently you can already download packages from GitHub Actions from GitHub releases but maybe we can set up a package repository with trend-root packages with trend-root patches. So you can more easily install them and try them out if you are interested. If that sounds interesting you can feel free to join Matrix Channel let us know so we can try. We can gather some attention and see what can be done here. So that was it. Any questions? Sure. I have a question. As I saw you are strongly insisting on still implementing support for TPM 1.2. I wanted to understand why do you want to support such hardware and do you have any hardware which supports DRTM and TPM 1.2? Very insisting. So the question is why do we still support TPM 1.2 and do we have hardware supporting that? Yes, we started with TPM 1.2. So for instance the stations showed here they have TPM 1.2. TPM 1.2 these are like... Yes. But they can run coreboot, they are really open source because they can have even more initialization natively in coreboot without FSB, correct? So they are quite similar for TPM PADS EX230. So TPM 1.2. When they were produced? Quite a while ago but this is the last one that you can have fully open. Ah, okay, right coreboot and FSB. Yes, you can have a mini open source and ME discarded as much as possible. So they are popular among those people which are here on the first day. So some old plant-based signal support, they fire some of the back-up to the mic. Of course. I understand. Yes, okay. How's the U? We were anyway. I have to test the U5 1.4 TPM 1.2. Yeah, I know. And... Two questions. One, you mentioned you fetch in your CI packages from GitHub actions. How you do that? Because GitHub does a lot of fetching artifacts with authentication. Yes, not... I mean we don't use that currently. Oh sorry, the question was how do we fetch packages from GitHub actions? We do not yet, so... Yeah, I'm aware of that problem. You can fetch from releases but not from the jobs directly. Yes, that's a problem with GitHub actions. We like GitHub CI for this reason also. And the other question for UFI support, what to do about runtime services? Oh, sorry. Sorry. I guess the question was about what to do about runtime services and I guess it's still a little bit discussed. Or is that? Yeah, there are some ideas to virtualize that. Can you do that? Yeah, is there a proof of concept from non-enables? Yeah, another hypervisor do the same thing actually where they do runtime services in virtual environments. So it's a common concept. Yeah, yeah. So you essentially isolate that through... Yeah, that would be nice. If that works? As far as all is started, when the fumes start and then the hypervisor... Something like that. There is write up. Okay. Can you show me? Sure. Anything else? This is the most recent publication on a non-enable blog. Oh, so a colleague wrote a KVM hypervisor where he puts like EDK2 in that environment. Yes, yes, yes. What was that? So he puts in the go KVM runs the EDK2 UFI in this environment. What was the blog? But this is about... 9 elements, 9 ESAC blog and the last post I guess is about that. This sounds like running virtual firmware and not calling... But that's part of this concept because this concept is like very long time, I guess like this, like we're going in that direction. I mean, just the single things. Yeah, I know. Setting UFI variables. You have to convince someone to contribute that. There is another thing, capsules. Thank you. And you mentioned capsules. Yes, capsules. What are your calorie goals for that? That's okay. Always use some vulnerable pieces. Yeah, for the better ways. You know, I can set boot next to something and then reboot to capsule. That's okay. You still need to reboot to... Yeah, like you may have it. So one reboot more. So you have at least two because if you want to reboot to the UFI, this is... No, that's broken. On most implementations. Yeah, yeah. Did you try it on the shadow? I think yes. And there is no bug about that. I don't know. Are you sure? It's a bug about that. Two minutes. If this is so new that I don't know, I can... I have last status... Last status was Friday. So you have access to the variables. I need to start, I think. It is not because you have to set variables to boot next to something. Yeah, that's okay. Potentially you can do a capsule. And you can reboot via ACPI or something. Yeah, exactly. That works. I think Windows do reboot via... You have Wi-Fi? Exactly. ACPI. I'm not sure. It's a sort of graphic level bug. It's my sort of... My experience is that there are several firmwares that have broken the UFI reboot. So we default to ACPI. Yeah, okay. Another question? We have time for one more. Okay, thank you. APPLAUSE Oh, yeah.
Open Source Firmware, BMC and Bootloader devroom - outro
So we're doing gathering in Funky Monkey under this address. So if you know 3MDep and what kind of part is we doing for 12 hours on V-PAPS, then you know what we can do. And if you are willing to bear with us talking about the firmware, talking about trustworthiness, talking about TPM, secure boot, whatever was part of this dev room, you are invited, feel free to join. And I guess that's it from me. And thank you very much for surviving to the end. Thank you.
Beyond Joins and Indexes
Good morning everyone. Thank you for coming to the Postgres Dev Room. This is our first opener. We're good? Yeah, we're good. The microphones working, yeah. My name is Bruce Momjin. I am one of the Postgres core team members and it's a pleasure to be here. I was told this is the death slot for a speaker but hey this is looking really good. So, thank you for coming. I promise you an interesting 50 minutes. I hope not to disappoint because I'm going to talk about some pretty complicated things and I hope they will be very interesting. They're certainly very interesting to me and hopefully they'll be interesting to you as well. As you know we have a whole span of Postgres talks today. I was looking through the list of talks and they look really interesting so I know you won't be disappointed. This talk is actually a follow up to another presentation that I've already done and I'm going to go over that in a minute. Probably the most interesting point here is this right here and this QR code which is a link to 62 Postgres presentations, 2700 slides, 121 videos of me speaking about Postgres. So, if you are curious about Postgres and you'd like to know more about this presentation or others please feel free to go to that URL and hopefully that will help you. So, as I said before this is a follow on to a talk that I did, originally wrote in 2011. So, by the way these slides are online right now so if you want the slides and you want to look at them closer to your laptop for example just go to that URL you'll find those presentations right there. So, this is a follow on to a presentation I did in 2011 about the optimizer. I'm going to ask for a show of hands, how many people have either seen the slides, a video or me present that talk. Okay, not a whole lot. Alright, so that's good to know. That talk is basically giving you an introduction to the optimizer. As you may know, optimizer is a critical part of a database. It allows you the system to choose when to use indexes and which type of join methods to use, how the importance of statistics and things like limit clauses and so forth. So, if you're curious about looking at the precursor of this presentation, again this URL here at the bottom will work and if you download the slides you can just click on that little URL down there at the bottom and that will take you to that presentation. But that is not about what this talk is about. That talk is about the basics of the optimizer and this talk is about everything else which is why we call it beyond joins and indexes because it's beyond the concept of joins and indexes is what we talked about in the previous talk. We are going to talk about 43 other things that Postgres does beyond again using indexes and join types. There's a lot of them that are actually really, really interesting. I learned a lot in preparing this talk and I hope you'll learn a lot as I prepare it. I color coded some of all the sections although I kind of ran out of colors as you can see but you can see they're kind of grouped together. For example, the ones over here on the right, the mustard color, yellow I guess. The green ones are related to comment table expressions. I have a talk on my website about comment table expressions. These other ones are about parallelism. The red ones, the pink ones are related to aggregates and so forth. So again hopefully this is helpful to you. Another aspect of Postgres is the ability to control the optimizer. I will not specifically talk about all of the configuration parameters but this is a list of pretty much all the config parameters that Postgres allows you to use to control the optimizer. Again we have two URLs here that I think are very helpful for you to study that. The ones right up here are the ones that I covered in my previous talk so I'm not going to be discussing the ones from the previous talk here but I will cover all of these right here related to things like gather merge, parallel, hash ag, memo wise which is kind of a funny term, incremental sort and so forth. I will not be covering these although I do cover these in another talk about partitioning which again is on my website so if you're curious about partitioning that is where you would go for that. Now I would love to say that I have a grand story about all of the join types that weaves into a very poetic narrative but unfortunately I can't do that. As you can imagine the join types are kind of distinct. There is not a real great way of presenting them in a sort of a way that connects them together. So we're basically going to spend the next 45 minutes basically going through the individual types and explaining why they're used and why they're important. And again we're going to start with some really silly ones that are really kind of not very useful but as we get forward we'll start to see some really interesting ones and of course at the end we have some really bizarre ones in some ways. The first one we're going to talk about is called a result node. If any of you have ever round explained before and you see these node types in the explain plan that's what we're going to be talking about. So you probably see the things like index scan, sequential scan, merge join, hash join, nested loop. You see those node types before. Those node types are talking about my previous talk. What I'm going to talk about now are the node types that I did not cover in my previous talk which are actually really interesting. Another thing that you should be aware of is that this presentation was originally written as SQL. So I basically created an SQL script that had a whole bunch of queries with explain running and then I ran it and then I captured it and I put it up into the slide deck and colorized it and labeled it and so forth. So if you want to run this presentation download this SQL file right here at that URL and just run it through PSQL and it'll just like fly off your screen. The only problem is you don't get the colors. It's just all one color. But you can test it. You can see and reproduce what you're seeing basically by running that SQL. Probably no questions about the result type. The result type is basically result is just a constant. Whether it's a string or whatever it's just a constant. There's nothing fancy going on here. You're basically just saying select one. Another thing is I'm using colon explain and you're going to see that over and over again in the presentation. Colon explain basically just turns off the costs. It's just so it makes it simpler for you to see. You don't see numbers in here that really aren't adding anything to the presentation. That's a PSQL feature right there, the backslash set and the ability to run explain without costs. That will reproduce the presentation on your screen. This one you might not have seen before and you might be a little surprised. This is not part of the SQL that I used back in the 90s. I guess SQL 89 didn't have this. I don't, Vic isn't here. He would know when we added this. This part of the SQL standard, it's basically the values clause is basically like a select with a bunch of values. Except that it's kind of like select with a union and select with a union. Instead of doing that, you can just type values and it makes a row of one and the second row has the number two. It's basically a very kind of throw off the cuff kind of a clause. It has a special scan node which is called values and that's exactly what it looks like. Another thing you're going to see over and over again in this talk is things that are in blue are causes of things that are in red. If we look at this slide, for example, the cause is values in blue and the result is the value scan. That's the output. If you're ever looking at a slide, you say blue is the cause, red is the output that caused the result of the blue. You'll see that over and over again. Any questions so far? Great. Generate series. This is just an example. There are many other cases where functions generate multiple rows. But any function that generates multiple rows, it's going to create a node type called a function scan. Normally functions return one value. That's kind of the mathematical definition of a function. But of course in SQL, we've gone beyond that. We have the ability for functions to return multiple rows. Not only multiple values in a row, which you would use in an out clause, but actually multiple rows. That would be something like a function scan. This is our first legitimate output. This is a case where we're doing something called an incremental sort. I had trouble understanding what that actually was, but I think this should illustrate it to you. How many have seen incremental sort before in their plans? Anybody? A couple? Incremental sort is a case where you're sorting by multiple keys and the earlier part of the key is already sorted, but the latter part of the key is not sorted. It kind of makes sense. You're incrementally sorting. You've got the front part, the early fields are sorted, and the later parts are not. So here we have, I've created a table with a million rows, and I've created an index on the first column, x. I've analyzed it, so I've got statistics on it, and I add a column y on the end of it, and then I select from it, and I do it by x, y. What happens is you can see the system is smart enough to say, well, I can get part of it sorted by pulling off of the index I've already created, but I can't really do the y, so I'm going to do an index scan on the x table, and then I'm going to do an incremental sort on top of that, and I already have x sorted, I'm just going to add the y part. So if you didn't have this, effectively you couldn't use the index, and you'd have to basically resort the whole result set, obviously it would be much slower, that's why we got incremental sort. And what you're also going to see in this presentation is a lot of diagrams, because I love diagrams, they help me to see what's going on, visually. What you can see here is you can see that the table, originally all of the 3's are together, all of the 4's are together, but you can see all of the y fields, the second column, are all in random order, and effectively what incremental sort does, it knows that the first blue section is in order, doesn't need to touch that, and it merely sorts the second column. So we're going to see this kind of pattern over and over, I'll show you the SQL, I'll show you a diagram that kind of explains what it does. Any questions? Okay, great. Unique, you've probably seen this before, this is not necessarily the unique clause when you do DDL, you can actually create a column as unique, that is actually not what we're doing here, it's basically, typically would use if you're using a distinct clause on top of some kind of result. So here I'm generating numbers from 1 to 10, I'm ordering them and I'm saying make sure they are distinct, so what we're doing is we're basically doing a function scan, remember function scan we just saw that earlier, right, and then we're doing a sword on top of that, so all the values are sword together and I'm running unique on it, basically another way of doing this, this is the way of distinct, another way of needing unique is a union, I'm not sure how many of you remember, but union always does distinct removal, you know, a duplicate removal, right, unless you use the all clause, unique is always going to remove duplicate, so even though I'm just saying union, I'm not saying union distinct or anything, it automatically does it that way. So therefore, when I do 1, 2, I basically am going to take my new result sets, remember, we did remember, that was the first node type we learned was result, remember, way back, hey, four minutes ago, then we sort them so all the results are next, the duplicates are next to each other, and then all we have to do is get rid of the duplicates as we go forward, and again, similar case here, we have a bunch of random numbers, we sort them so all of the duplicates are now next to each other, you can see the sixes and the threes are next to each other, and then we run the unique on it, and all it does is compare and removes any duplicate next to each other entries, and we get our unique output, okay, great. Okay, append, this one exactly what I talked about before, remember I said that union will remove duplicates by default, that's true, but if you use the union all clause, right, it doesn't remove the duplicates, so we have a special node type just for that, it's called append, so when you say select one, union all, select two, we have our two result nodes, and we just append the values right on the end of each other. It's exactly what it looks like, this is my first result set, this is my second result set for the union, and I'm just sticking them, I'm just appending them next to each other. Very, very, very basic, very basic case. Okay, merge append, this one's good, okay. This is kind of weird because it combines two terms that we think we know, right, we just talked about append, we know what that does, okay, but then we have the merge, which sounds like a merge joined in me, right, it's kind of, you're going to see this pattern where we've got a node type we know, another node type we know, and if we put the two names together and it does something different, it's not ideal, it's trying to kind of match it, but this is, what will we have, okay, so what I'm doing here is I'm taking a values clause, which we talked about before, remember values clause, I'm taking another values clause, I'm union alling them, so I'm appending them together, right, okay, now I'm appending them together, but each of the unions is already ordered, this is the key aspect here, okay, remember we had append, append just sticks one on the end of another, I will tell you that putting this presentation together is like a jigsaw puzzle, because you've got all these node types and you can only talk about the first node, the second node type, if you talked about the first one, and getting it all to kind of fit in your brain is quite a challenge, I hope I've succeeded, but effectively what we have here is two values clauses, but these values clauses are automatically ordered, and therefore when we do a union all and we want the result to be ordered, okay, the stupid way to do it would be to just take the results of a pandemic and then sort the whole result, right, that would be the silly way to do it, because we already have our results ordered in two pieces, so what merge a pen does is it takes two result sets that are already ordered and maintains the ordering as it merges them together and it repends them together, okay, so here you can actually see that right here, we've got our sort, for the first one we've got our sort for the second one and now we do our merge a pen, and I apologize for the diagram, but this is the best I can do, what we have here on the left is the first result set, on the bottom we have the second result set, as you can see from the query we sorted the first result set here, we sorted the second result set here, and as we append them together we want to maintain the ordering that those result sets already had, and to do that we're going to take the lowest value from each result set and just repeat it, so the lowest value between these two is two, the lowest value between these two is three, the lowest value between these two is three, the lowest value between this and this is four, five, six, eight, eleven and twelve, okay, so by using merge a pen we've avoided having to resort the results, we basically kind of merge them together, and if you're familiar with the way a merge join works, that's kind of how it works, right, it takes two results and kind of compares them and kind of walks down, finding the minimum matching values as it merges them together, it's the same concept, this is what I'm kind of getting at, that the terms that we use here are not random, like the fact we call this a merge a pen actually has some logic to it, because we're taking what effectively is a merge join and we're sort of repurposing that concept to do a pen and retain the sorting, any questions? Yes sir? Do you know the same thing by using merge a pen instead of just, I don't think the answer. So the question is do I know how much time we're gaining by doing merge a pen versus just sorting, so we have, we're a cost based optimizer, so we know the cost of how, what it would take to sort the whole thing and what it would take to do merge a pen, so we are always reevaluating that, all we know as back end developers is we're going to run the cost of both of them and we're going to figure out which one is faster, I don't know how much benefit it is, of course it depends on the size of your result set, but we're only going to do merge a pen if it's a win, if it would be cheaper to do it the other way we do it the other way, right? Other questions? Yes sir? Does the query guarantee the order of the two set periods to leave out the order by? So the question is if I leave out these order bys, No, the other, the lower order by. I'm sorry? Is that the query? Yeah, this one here? No. This one here. This one? Yeah. Well, if I don't have this here, I'm not going to do a merge a pen, because I don't need to, I don't just, I'll just append them together, I don't need to merge them and maintain, the only reason we're doing that, you see how order by one is in blue, that has to be there, if that order by isn't there we aren't going to use merge a pen, because we don't need to preserve it, right? Yeah? In that case would the other two order bys also be removed? So the question is in that other case where the two other order bys also be removed, the answer is no, because the user would still get the order by of the first result, and then the second result would be right underneath it. So they've specified in the query that they want the order by, we're going to maintain that. But because they've added an order by after it, we're kind of overriding it and kind of using their order by. Now I'll admit this is a contrived example, we could have done an index scan to get this order by. So this is the fact that I've got two order bys up there, you see they're not in blue, they have to be there, but it could be some other query, we could be doing an index scan and pull the orders that way, and that way we don't have to do sorting again, could be anything, right? Yes. I told you earlier that we do this order by inside, you kind of like hit the optimizer, which I'm not supposed to, but I know that you need to restore them. So it will do this, right? But I mean in more complex cases it's not necessarily that you had to restore the other two. So the question is do we need to order by there? The fact is if there's no ordering of the two results, we aren't going to do order by, we're just going to do a big one huge sort and just run with it, right? The only reason we're doing that is that. Okay, so eight and nine, two new options here. One is called subquery scan and one is called hash set up. I know I'm not super proud of hash set up, it sounds like a, I don't know, some kind of science fiction thing, or I don't know what, but let's just look at this. So this is a query where we've got a thousand rows and we're saying select from the small table and then remove or subtract or whatever, how do you explain it? These other rows. Now we know by looking at this there are no rows. Okay, so just go work with me here, all right? The system doesn't know that, that I've actually removed the rows from the same table twice. We don't have an optimization for that. So what we're going to do here is we're going to run something called a subquery scan and we're going to run it twice because we've got two queries here and then, I'm sorry, subquery scan and then we're going to do hash set ups. And again, crazy, crazy diagram, I'm going to walk you through this. What we basically have, this is the outer part, the first part of the query and this is the except part of the query. The query, we're removing all the matches, okay? And what we're going to do, and this is kind of weird, is we're going to create, we're going to kind of append the two together and we're going to put one, a label one for the first query and a label two for the second query. Okay, so here's the first query, all with ones in the second column. Here's, same thing with all, two is in the second column. And then what we're going to do is we're going to hash them and we're going to hash them basically in a random order because again, hashing doesn't have any ordering to it. And we're going to look for the ones that basically all of the ones that don't have a two. So for example, the seven does not have a two match for the hash and therefore it's part of the output, the three and the six have a one but without a two and those aren't going to go out. A 12 is going to come out and the five and the eight and the eleven have a two without a one. Okay, so anything basically that has a one without a two, that's what we're going to output and that's how we're going to implement this except right here. Again, we have some, if you want to read this at some point, this is related to how we do intersect and accept and so forth. It's kind of interesting if you're curious for later, you really want to study the slides, feel free to read that. Setup is what we would use for intersect. Intersect again, another opportunity here. So we want to find the ones that are in both of them. So we have large intersect from large, again same issue. We do a subquery scan, we append them together, we sort them and we do set up. Again, similar diagram, here's the first part, here's the second part. We label with one, we label with two, we create a joined result, we create the hash but in this case, we're now looking for cases that have a one and a two. Remember before it was cases that have a one without a two? Now we're looking for cases with a one and a two. And you can imagine we're kind of using the same code, right? It's sort of the same idea, it's just the filter you put on at the end. Because remember this one was all the ones without twos. This is cases where there's a one and a two together. Three has a one and two together, five does not, six has a one and two together, seven, eight, eleven and twelve do not. So that's intersect. Any questions? Materialize, this was an interesting one. I had trouble kind of understanding what this was, because materialize to me, there's like a materialize command, SQL command, like for materialize views. That's what I thought, is it that? Again, we're reusing terms quite a bit here. So what we have is a query that's selection small, and it also selects from a copy of itself, but again, the optimizer doesn't know this. And we're doing a very weird comparison here, we're doing a not equals. As you imagine, equals is really easy to do, not equals is kind of awkward. So what we end up doing, and I know this is kind of weird, we basically take the inner side, and we actually create a memory copy of it. So we load the matching rows, remember this is a small table, in fact it says literally small, we know it's a small table. And it just loads that into memory, so we can do the not equal comparison much quicker, than it could if it had to read them out of shared buffers. That's all it's really doing. It knows because we're going to be hitting this thing over and over and over again for not equals, we don't want to keep hitting the shared buffers, so we just bring it in, bring a copy in, and we effectively just do a bazillion comparisons on our local copy of this very small table. Memo-wise is a weird one, that was at I believe in 14, I think, somebody? Yes sir. Sorry, what was the phone number? Yeah. The local memory, is it a working memory set? Yeah, this would be your working workman. Could be, yeah. So if I cranked up my workman, would a... Would you work... So the question is if you cranked up workman, would you be more likely to do materialize? Maybe, yeah, I could maybe. I think so. Give it a try, yeah. Okay, Memo-wise was introduced... Memo-wise is a weird term to me, like it's a memo, it's like a letter, like what is it, right? It turns out that Memo-wise is the sort of academic term for this thing, and I'll explain what this thing is. But that's how we got the word Memo-wise. We had a long discussion about what to call this, and somebody said, oh, that's Memo-wise. And we're like, what do you mean that's Memo-wise? And they sent us some academic paper, and they're like, oh, okay, that's what it is. All right, so let's take a look at what Memo-wise is. So it's kind of hard to set up, I need to create a table with duplicates that also is too small to make sense for a hash joint. I know that's like a big word, a lot of words. But effectively, what I have here is I'm going to do a join, and I have a small table, but it's not big enough to hash it, because hashing is expensive. And I need something that's too big for a hash joint on the other side. So it sounds like the requirements for this thing almost never happened, but it turns out that Memo-wise has a lot of things. But Memo-wise happens all the time. I don't know why, but when I read the description and when it's important, I was like, pfft, nobody's ever going to use this thing. But it turns out that it actually gets used quite a bit in real-world applications. But again, it's a case where we have a lot of duplicates, something's really big, something's really small, so you can memo-wise it, and something's really big, meaning you're going to do a lot of comparisons. So we have an index on the Memo-wise field. So here's the query we select from small with dupes, and we join it to a medium table. And here you see the Memo-wise clause right here. If you're curious, this blog post right here does a great job of explaining Memo-wise, and you can see right here, Postgres 14 is the release that that was added in, because that says Postgres 14 right there. 14 later, right there. All right, so what does Memo-wise do? It basically creates a local memory cache of the table you're joining to. So basically it's a case where I know I have a lot of duplicates here. So here's, like, this is duplicate of that, this is duplicate of that, so forth. So I know I have a lot of duplicates, so I know I'm going to be hitting the cache over and over again. Right? So I'm going to be hitting the cache over and over again. Right? So instead of doing what potentially could be an index lookup over and over again into the index, I create a cache. And I basically say, okay, is this a match? If it is, then I can say that's a join. If it isn't, then I've got to go over here and check and refresh and make sure it's okay. It also has a negative cache. I'm not showing that, but there's a cache of stuff that isn't there as well, which I'm not going to show you. So the point of Memo-wise here, it's right here, inner-side lookups that return no-os are also recorded in the cache. So my point is that when you're going to do an index lookup over and over again, because you have a lot of duplicates, and you're going to be checking it over and over again, why don't we create a cache so we can remember the index lookups, and we don't have to keep doing them. But again, only in limited cases, table has to be small, has to be duplicates, has to be the inner-side, the Memo-wise side has to have an index, so we can refresh the cache when we need to. That makes sense, it sounds like crazy. I thought it sounded crazy, but it actually is really useful, and it's kind of cool. So if you see Memo-wise in the future, you'll be like, oh, that's kind of neat. Okay, any questions? Okay, let's launch into more of a section. Okay, I know we've kind of hit a bunch of sort of discrete topics. I'm going to move into an area where we have some coherence. We kind of move through. We're going to talk about grouping and aggregates now. So here's a query where we do a join, and we're saying x is less than 0, group by x. So I didn't know that you can do a group by i when there's no aggregates in the query. I learned that in doing this presentation. I thought a group by always had to have some aggregates out here, but turns out it doesn't. Basically removing, wearing, adding, what if I does the same thing? So here's a group clause. It's going to give me everything x less than 0. And all it does is it basically just removes the duplicates. That's all the group does. It says okay, 1, 1. Okay, that comes across 1, 2. I got two of those. I'm only going to get one of those. For these I would go across. I have three of these. I get one of those. So again, group by with aggregates is similar to distinct, except duplicate detection can consider more columns than those selected in the output. Again, I give you, that's an option for studying later, exactly what that means. You can try it out and see how it works. You can do a group of a single column, and that is actually a use for group alone. You notice I'm getting, notice I have, these are not unique. Like this and this and this, these are different, but they all generate one output. So it kind of trims off the one column. I know it sounds really silly, but there are actual use cases to this. So all the ones get output, choose in the first column, get output, all the freeze get output. Aggregate, everyone's familiar with this, the count command, we have a node type for that, just called aggregate, very easy to predict. Here's a group aggregate, which would be a group by with account on top of it. So this is an aggregate, again, makes sense, right? We learned aggregate, we learned group. What do we call the node type when we have aggregate and group together? Group aggregate, right, makes a lot of sense, so that's what we call it. And group aggregate effectively outputs the non-aggregate column once, and just like the group by, which we talked about, and then instead, for the second column, it runs an aggregate across that second column, right, which is what we're all familiar with. Oh, why do I find networks are available? Isn't that exciting? Okay, hash aggregate. So this is a case where it's not actually an aggregate, we're basically doing a distinct using a hash. There's no mention of aggregate here at all, right? But what effectively we do is we take all of our values, and we put them in a hash, and we merely have one value for each hash. It's very similar to group, the group clause. Remember how the group clause got rid of duplicates? This is a way of doing it, except instead of doing it by group, we're doing it by hash. Okay, instead of sorting, we can basically just create a hash and remove the duplicates that way, and that's what the distinct is. So normally I wouldn't think of distinct as related to group, but in fact, I can see now, kind of, okay. And I have lost my mic, so I'm sorry about that. I will fix that. There we go. Great, okay. Mixed aggregate, I'm not sure how many of you are familiar with rollup. I do have a Windows function talk on my website that explains what rollup does, okay. And effectively it does, the rollup is basically taking, again, the unique values and then rolling them up into an aggregate. Okay, and it also sorts it, which is different than the other one, because you notice that it's all sorted, okay. Window functions, again, I have a window function talk on my website, but again, these are all kind of grouped together. So this is a sum over the entire result set. It generates something called a window ag, and a window ag effectively just takes each individual roll, but it manages to output an aggregate across all the rows within the group. If that makes no sense to you, I recommend you take a look at my window talk. It is kind of unusual how this works, but effectively all we're doing here is it allows us to maintain the distinctness of the rows. Window functions allow aggregates across rows while the individual rows remain distinct. And that's exactly what's happening with the window ag. Okay, moving on to parallelism, we do have a nice reference here to the Postgres stocks about parallelism. I'm going to go over a bunch of parallelism nodes that are quite interesting. So here is parallel sequential scan, partial aggregate, gather and finalize aggregate, okay. So here we're doing a sum on the large table. So we have a big table, we're doing a sum, and we generate a whole bunch of parallelism here. Parallel sequential scan, a partial aggregate, something called a gather, and then a finalizer aggregate. So kind of like prepare for the diagram of craziness here. What we basically have, again, going from left to right, we have the first part of the sequential scan. Remember, we're only scanning one table, but we've broken it up into two parts, because we want to scan them in parallel, right. So here we're scanning, we're using one background worker to scan the first part of the table in parallel. This is called a parallel sequential scan. We're taking the second part of the table. We're also doing a parallel sequential scan on the second part of the table. We're also going to generate what's called a partial aggregate. That partial aggregate is going to be the aggregate result across all of the rows that our parallel sequential scan has processed. And now we have a partial sum right here. The same thing down here, this is a partial sum here. We then send both results to the parent, which generates something called a gather node. That kind of makes sense. Now the gather node is gathering results from parallel workers. And of course, because we're generating a sum, all we need to do is add together the two rows that we've gathered, 27, 33, and we issue something called a finalized aggregate, and that generates my 60. Okay. Now again, this is just a two, but we could use a hundred ten. However many parallelism you decide to use. And again, it's scanning different parts of the table in parallel. Yes, sir. Why there is partial aggregate, finalized aggregate, and aggregate nodes, because they are just the same, but they are different parts. And using parallel here, you just aggregate on a smaller size table. But I don't know why you decided to call three different nodes, but basically they're the same. Okay, so the question is why are we doing, why do we have different, why is this not the same as that basically? And the reason is that for some, for the sum command, they're the same. But if I'm doing something like a max or standard deviation or something, we're going to have different operations to join these together. So sum is the simplest one, that's the one I use, but for other aggregates, these would be more complex. And we may do different things at different stages. But I see what you're saying, it's sort of, the point is that, it's just the way it's processed. They probably, some cases could be the same, other cases they can't, so we just call them different things. Okay. Gather, now we saw merge append. Now we have gather merge, which sounds kind of like, well, what happened? Okay. And what gather merge does is it effectively is going to take parallel workers and then just merge them together. Again, I have the same parallel scan here, I have a parallel scan here, I'm going to do a sort. So again, I'm not using aggregate here, I'm doing a sort. Okay. And now I've scanned part of it. You know, I keep doing that, that's not good. The reason is because of the way the clip, the clip doesn't go into my shirt properly, so I keep having to shove it in there. Alright. So basically we've sorted, within the background worker, our results, we sorted and now we're going to gather, merge, remember merge append or merge joint merge, merge append. We're going to take the lowest of this, the lowest of this, and then we're just going to keep doing it and then take those and merge the two ordered results together. Okay. Makes a lot of sense. Parallel append, all we're going to do here is we're going to append stuff together. This is one of the craziest diagrams I have, I think. So here we're doing, we're doing our background worker parallel scan and we're going to take the, we're going to append the two of the workers together because this is a join again and then we're going to take the other part, we're going to join that and then we're going to sort those and then we're going to merge them. So I know it sounds like kind of crazy, but what we're doing is we're doing four sorts and we're appending them in stages and then we're sorting those in batches. And then, so it's a combination of basically a parallel scan with a sort involved, which also happens in background workers. So again, it's just, it's just, this is the craziest diagram I think we have. Parallel hash, parallel hash join. Here we're doing a join, a join in parallel. Again, crazy diagram. Here's our parallel sequential scan. We're going to hash those together in a shared memory hash, which is kind of like mind blowing, but effectively we have dynamic shared memory and we're going to create for the background workers a shared hash table and they're going to join, push those into the parallel hash table and then once we get this shared hash, which has been built by multiple background workers, we're going to take our outer side and we're going to join against these, that shared hash into potential background workers. And then we're going to gather them together and get the result. So not only are we doing the sorting in parallel, we're actually creating the hash in parallel and we're doing the hash join in parallel and then we're returning the result. Okay, so I realize it's a lot, but that's exactly what it's doing. Okay, let's move on. Comment table expressions, again, have a nice talk about that on my website. Honestly, I don't get any money for advertising my talks, but you'd think so from this talk. So if we do a comment table expression with a materialized node, we just do something called a CT scan and effectively all we're doing we're scanning across the comment table expression we created. Okay, I got it. Work table scan, this is also with a recursive comment table expression. We would do that here. We're going to loop around through this and again we create something called a work table and a recursive union. This is a diagram from my other presentation. It talks about how comment table expressions work. And again, this is a diagram. It's basically looping in and creating this comment table expression. And then as you loop through the results, you're continuing to append to what we call a CT source, which would be used later in a query. I know if you're not familiar with comment table expressions, it's not going to make any sense. I apologize for that. I apologize for my microphone. Project set, this is a case where we have a function returning multiple rows in the target list. Not the from clause in the target list. Very interesting. Lock rows, if you do for update, we generate a lock rows node. If you do a table sample, we generate a sample scan. Not surprising. If you're using XML table, we actually have a table function scan. I think that is the only function call that uses that node type. Just a very special, very obscure case there. Foreign tables, if you're familiar with those, we have special foreign scans for those. If you've ever used CTids, we have a special Tids scan. CTids are the physical location of the values. We're basically using a Tids scan for that. This is what a Tids scan would do, effectively open a certain page and return a certain value in the page. Insert generates an insert node. Update generates an update node. Delete generates a delete node. Truncate does not, by the way. Truncate is different. Merge, the merge command generates a merge node. Exists generates something called a semi-join. A semi-join is very similar to a normal join, except it stops after the first intermatch. So it's similar to any other join, but it stops after the first. It doesn't keep going to find out how many matches there are. As soon as it finds one, it can stop. The in clause will use also a semi-join, and again, some details on how null handling works for in exists for those people who are curious. Not exists uses something called an anti-join. Not surprising. Anti-join for not exists. And not in is kind of weird. So technically exists and in are almost the same for nulls, but not exists and not in is actually different. And again, we kind of explain it in the query here. We also have something called an outer... We have a feature that I realized during writing this talk called an outer join removal. Notice I'm doing a left join on something where it actually removes the join itself, because it has a unique index and it knows there's only one possible match, so it actually got rid of the join, which I felt was like crazy. That optimized pretty smart. And finally, two things I didn't cover, tuple scan and custom scan. There are... There's documentation and postgres about it, but you don't see this very often. So that does complete what I wanted to do. I believe the time is exactly 9.50. Thank you.
Isolation Levels and MVCC in SQL Databases: A Technical Comparative Study
So, yeah, the idea is to talk about resolution in postgres, compare it with other databases. We will see the concept and we will see also the practical usage from developer point of view. I'm Frank Bacheau. I've always been working with databases. Now I'm developer advocate at Sugabite that is distributed postgres. We use the postgres code for the SQL layer but the storage is completely different. So, let's start if my clicker works. I have a question for you. So this is a picture of a car, obviously. Do you think the car is moving forward or backward? So, think about it. Actually it's not moving because it's a picture, not a movie. But the snapshot, the picture was taken when the car was moving. But I cannot tell you if it was moving forward or backward. There are two snapshots there because it was taken with a long exposure during the night. So this is why you see this movement of the light and also with a short flash taken either at the beginning or at the end. And this is why with the same movement you see different kind of anomaly there with lights in front or the other. So I'm showing that to compare with databases. Databases are always moving because people are changing data. And when we want to read it, we want to see a consistent view of it of something that is moving. But the same. There are some anomalies that can happen because when you read things change. So how can you solve this anomaly? One solution is to stop the car. You stop the car, you take the picture and then you have a nice picture of it. But then the car cannot move. Another one is to take a movie of the car and then when you want to see the full picture of the car, you can go maybe back in the movie and pause it and then you can see it. And actually databases do the same. They can stop the database for modifications with locks or they can take a movie of the changes to be able to go back to the past. So two kind of implementations and you find them in all databases mostly. You look, you can look what you read and then it stops. It cannot be changed until the end of your transaction or you can read from a preview snapshot. So you lag, you don't see the latest changes but you have a consistent point of view. If you have read books, learned about isolation levels, it was probably not introduced with the pictures I've seen. It was probably introduced with that. Don't try to read it. I will not even talk about that. This is what you find in most courses. This is what you find in the SQL standard but actually no database implements that. That was defined before any implementation and the modern databases do not work like that and do not provide those isolation levels even if they put the name to match what is in the standard. They do not really do that. So let's forget about it. If you want to read about it, there are plenty of sources. I would like more to explain how the real databases work and basically the isolation levels that you find in modern databases, you find so really simple that is very similar to what is defined in the SQL standard but the others are based on snapshot isolation like the picture we have taken. Okay. So I'm really saying that there is the SQL standard on one side and there are the databases that you use on the other side and there is a big difference. For the SQL standard, when SQL standard was defined, the idea was that users do not have to care about the others in the database. You just work, define your transactions, code your access to the database as if you are alone in the database. You do not have to look explicitly anything. The database will do it for you. So for the SQL standard, looks is something that can happen in the database but not visible to the user in the SQL standard. There is no lock command for example. But in real life, in the real databases that exist today, this is not how it works. Most of databases are able to read from a past snapshot. They record the changes and they are able to read from the past so that they don't have to look too much and the developers can also choose to look themselves some information that they don't want to change, to be changed by other transactions during their transaction. That's typically the select for updates. You will not find that in the SQL standard. You will find the syntax for select for updates but not to lock it. So yeah, big mismatch between the theory and how databases work. Before talking more about MVCC and defining it, I will introduce a few concepts. My idea is that there are many databases and they have many versions. You cannot learn, remember how everything works but if you have the basic concept then you can understand better how all those things are working. So I will explain a bit more what is special about SQL transactions especially if you come from no SQL databases. I will explain the reads and writes time in a transaction because transactions can be long and we will see optimistic and payment-seemistic locking and maybe name them differently. This explicit locking in SQL that is not in the standard but that you use in SQL applications. Any questions so far? Okay, first, SQL transactions are complex. So if you work with SQL databases you probably know that if you work with no SQL databases and if you see the no SQL database vendors telling you that they have transaction, they have AC properties, they may be, they are not talking about the same thing exactly. SQL transactions are complex because they do many reads and writes during the transaction and the transaction can take a few milliseconds or seconds or even minutes. Many reads and writes and the complexity is that in a transaction usually the writes depend on what you have read. Usually if you take a simple example you check in for a flight so you read to see the seats available and then you will pick one. But your choice to pick one that is a write will depend on what you have read. You don't want to pick a seat that is not free. And then the problem is that another transaction may have read the same map of the three seats and may pick the same. This is where we will have conflicts and we will see how it is solved. So SQL transactions are complex especially when you have a decision on what you have read that will determine what is done later. And in SQL you do not declare everything that you will do. I can say begin transaction, read something, have a coffee, come back, write something because during the coffee I was thinking about what I will write. That's very different from, for example, no SQL databases where each call you exactly tell what you do, what the database has to do. Here you have a transaction with user, maybe user interaction between them. And even if you do a single row insert that looks like a simple write, so not this complex problem of reading, writing, actually on your table you may have foreign keys, you may have secondary indexes to update, you have a primary key. So even this simple statement has to read before write. You read, does the key already exist? If it exists already you have an exception. If it doesn't exist then you can write. And when you write you write to secondary index, you check the foreign keys. So even something that is very simple from the developer point of view can be a complex transaction where write depends on reads. And as I mentioned, no SQL vendors now start to have some transactional behavior. But when they call that ACID like the SQL properties, it's not about those complex transactions. For example, in no SQL databases you just have single calls you put or you get something. Many no SQL databases do not try to be consistent. They have individual consistency when updating the secondary index. They don't have unique keys, etc. So that was what is more complex with SQL and why it's not so easy to implement a database. And then during this transaction where you have reads and writes they cannot all happen at the same time because it's different place on disk, it's different CPU instructions. So the problem is that the reads and writes happen at different time during your transaction. Cannot be atomic. But for the application it has to look like it is at atomic all happening at commit time. And so it happens at different time then on different states of the database and then the consistency is difficult. So how databases can try to reduce the number of different states where it reads and writes. You can read from the past if you recorded all changes like in a movie. You can read from the past but you cannot read from the future. Obviously you don't know what the others will do. Except if there is no modification for example if you look everything that you read you can read from the future because you know that the future state is exactly the same as the current one. For writes it's the opposite. You cannot write to the past when you see movies where people go to the past and change something it's really consistent at the end. But again you can be sure that there are no modifications. You look everything that you write and then where you write is the same state as at the end. So if I have a timeline between my begin transaction and commit I read and write at multiple times I can read and see as of the time of the start of my transaction for example. And I can read and write and at commit time come back to set the right time stamp as the commit time because this is where everything is visible. So we'll see that in NVCC databases you read from a state at the beginning of your transaction and you write as of the commit time. That's the magic of databases to be able to do something that looks like happening at the same time even if it's not possible physically. So to do that in NVCC databases the writes take an exclusive look. In NVCC is about reading from the past the write you cannot do anything else than looking what you write to be sure that it doesn't change until the end of the transaction. To read you can read from the past and then instead of multiple different states of the database you have only two states the read state and the write state. But then you need some additional logic in the database that compares those states check if they are compatible. If something has changed in between it may have to raise an error to tell you okay you have read from this state you have written on this state but they are not compatible because someone changed something in the state in between. Another concept optimistic and pessimistic looking and so those are general names that are used at many levels. At low level in the implementation but also from the application you can the idea is that optimistic looking you think you will not have any conflicts or you do not do additional things for that you will just raise an error and retry if it happens. Pessimistic looking you expect that you will be in conflict and then you wait on it. But those terms are a bit misleading. Those are terms for example in new Gabbai when we did the documentation we used other terms like wait on conflict. So this is really the behavior. A conflict is detected for example you want to take an exclusive lock someone else has locked the row. You just wait that they are finished so that you can continue so there is an thank you. So you wait for you detect the conflict on the row for example but you wait for transaction. Another one is fail on conflict just you detect a conflict you don't wait you just raise an error to the application and then the application can retry later with the hope that the conflict is gone. But there is another one and that's also why optimistic and pessimistic looking are not really useful. There is another case where you may want to skip on conflict. For example you want to read and lock a row that is locked by another transaction. You don't want to wait for it. You don't want to raise an error you will just ignore this row. This again of course locks are not in SQL standard but this cannot be in the SQL standard because it's not very deterministic but this is used a lot for queues for example if you want to dequeue events that are in a table you want to lock them when you process them but if someone else is processing a row you just want to process the next one. Typically we will see that it's the case of select for update skip locked. And last concept before going into the different implementation. Explicit locking so by the application again this is not in the SQL standard but this is our most application under the anomalies that could happen in the different isolation levels that you will see. And typically when you read a select for update is a read but it's a bit more than a read you read and you tell the database about your intention to update something which means that it's a read that behaves like a write like an update lock with a lock on it. And then you have those choices. Wait on conflict. Don't wait. Raise an error or skip lock. Those three possibilities where the developer tells the database I want to read that I don't want it to change or lock it and if you cannot lock it then give me an error or wait until you can lock it or just go to the next row I will come back to this one later. So you can just ignore the isolation levels and just manage that from the application. If you know what you are doing for example if I know that I'm reading the status of a hotel room that I want to book I can select for update what I read and then I know that nobody will change its status until I commit my transaction. So you don't really need to care about isolation levels if you are okay about thinking about the conflict yourself. Thinking is in the code and of course in tests the problem with concurrency is that it's not always easy to test the different combinations. But it's totally possible. And those select for update we will see also for share those concern rows but sometimes to avoid some anomalies you may want to lock more than a row maybe a row that doesn't exist yet. For example you don't want anyone to insert a new row that's where you realize that this row does not exist then you can also lock a table. Again you don't find that in the SQL standard but all databases give you the possibility to lock. Okay do you have any question about that? So the main message there is that it's totally okay you can ignore the isolation levels do your own looking it's totally okay but only if you understand them. So most people use the default isolation levels read committed in Postgres and that's fine. You don't need other isolation level if you don't want to but you need to understand how it works to be sure that you handle correctly the cases where you have to block something yourself. So basically that's the goal of the presentation talking about how it works so that you understand it maybe you will not change your isolation level after understanding all that but at least you will understand how it works and if you have some flaws in the logic with the select for update you use for example. So MVCC stands for multiversion concurrency control I don't really like this name others may call it multiversion read consistency at least it makes it clear that it is only about reads. The writes you cannot do anything else than looking what you write this is only about reads isolation levels are only about reads it was also called multi-generational architecture in the first databases that implement it but basically it's versioning when you change a row the database keeps the old version of the row so that you can read from a previous snapshot. Twenty years ago IBM in a paper that compared with Oracle was saying MVCC is implemented in only one database Oracle no other database did it because basically it's not good and today I think that only DB2 is not using MVCC release or the history has proved that they were actually wrong now all databases use MVCC because you don't have to look too much when you read. Another thing that they were saying was that the model where you look what you read the one that is defined in the SQL standard is better because the developer or the user doesn't have to think about the code around the select for updates but finally most application developers really prefer to put the select for updates in the transactions where they are needed rather than going to isolation levels that may have other problems the only thing where they were right in this paper was saying that you need to understand it in your database because all implementations are different. So the non-NVCC ones like the DB2 they cannot read from the past because they do not record all the changes they have to record it for recovery reason but they don't record it for this reason. So they cannot go to the past the only way is to look when you read you look in share mode so many people can read at the same time but nobody can update something that you have read which cause problem because then you may have deadlocks then you may have you look too much basically for no LTP application and MVCC databases they look only for the rights they read from the past and they have some conflict detection to see if those two states conflict with your transaction consistency or not. This is also for repeatable read only because if you do read commit even within transaction you will read the current... The remark is that when I say I read from the beginning of the transaction that's the general case in read committed we will see the isolation level at the end the read time may be reset for each statement and we will see why. But the idea is that you have only two states when you are reading your statement. And yeah this is about read. I will go first I have a reference on the first mention of this architecture but basically the changes are versioned we will see that there are different implementation Postgres versions the rows in the table and query can read as of a specific time so we will see as of the beginning of the transaction or as of the start of a statement and then it can do those optimistic looking without locks that's the main value of it the readers do not block the writers. Typically in non-MVCC databases if you are the DBA and just want to count the tables the rows in the table and you do a select count star which you do easily in Postgres if you do that in DB2 you will lock all your applications because you will lock everything so yeah quite nice to be able to read without blocking the others. So basically MVCC became popular because it allowed mixed workloads even if you have a no LTP application there is some reporting on it there are some analytic queries on it and you don't want them to block the others. So then the implementation all databases most databases today implement MVCC but all in a different way and then the bravio may be different and that's why it's interesting to understand it. First when you version you have a version number some databases use a timestamp some databases use a number that is always increasing the log sequence number for example in Postgres and what do you version? Many databases version the rows in Postgres each row if you update a row you will have a new version of the row. Then the question is I say what about the index entries? There are some databases that also version the index entries Postgres doesn't do versioning for the index so the index has the two entries and then in the row you have this version number. There are some databases actually I think there is only one database that doesn't version at all level but at block level so a low level the storage the page or vehicle does that at block level so very different implementation and where do you store the past versions? That's also a big problem you can store it keep it where it is for example in Postgres you have a row you update it the new version will be written at the end of the heap table and the old versions just stays there in other databases this old version will be moved to an undue log a rollback segment can have a different name and both are pros and cons. If you move it then you have the overhead to move it but you don't put overhead on the other that we have to read and the size of the table so many different implementations. Also you need to change the versions if an index go to a row and then you see that this version is too early when compared to your read time you want to go to a past version or it can be the opposite you always go to the past version and then you see if there are new changes. So in Postgres when a row is updated a new version is written at the end of the heap and the old version has a pointer to it so from old to the old. Question? I didn't get the end of the question. We have a microphone it will be easier. I'm just asking that the version number whatever it is which shows that a change has been made so will it also point to the row or rows which have been updated or deleted or changed like in one version or it just shows that this table has been changed in this version? For example in Postgres it is per row so typically if I read from an index the index entry the value that I'm looking for I have one or multiple index entries for that and then it will go to the table and then if it's not the right version it will go to the right version so it follows it. Okay so I think it's kind of an idea which is attached to each row and not the whole table right? Yeah the version is per row for each row. I think only Oracle is very different where the version is per page so you read a page you see which transactions are there and then there is the information to undo the page to a previous version but most databases do it per row. Of course it has an overhead in the row because you need enough information to follow that. And also the big question you keep all versions but how long? Because your database will grow if you just keep all versions. So you need to keep all versions to cover the longest query. If I have a long report that takes one hour and if it's read time is one hour ago I need one hour of versions but after some time you need to clean that and it Postgres is the vacuum that does that. Kind of garbage collection again it has different name in all databases but basically you need something that cleans that. If the old version was moved in a different place then the cleanup may be easy. You have the overhead when you move it but then the cleanup may be easy. If it's done in a lazy way where the old version stays in place then you need something that cleans up. So all implementations are different and then the most difficult is the indexes. The secondary indexes in Postgres all indexes are secondary indexes because the row is stored in the heap tables. They have all the entries and then when you clean up the old version you need also to clean up the index. So it's really something that is very complex to implement MVCC in databases. It has the big value of readers not blocking writers but it's not easy and all implementations are different. I will not go into the detail in a few databases that I compared. So I already mentioned the level of versioning table rows for Postgres. Oracle does it per block. For example in Ugabite we are per row but also per index. So the index has also the versioning. The interface is the same as it's compatible with Postgres but the storage is completely different. So I already mentioned how it works with Postgres, happened to the heap table. Then you have to think about where are those versions. If it is happened then the different versions are scattered. There are some other implementations where all versions of a key, the primary key are together. For example SQL Server implemented MVCC very late compared to the other databases. They move the old version to another database where initially the time DB now they have a specific storage for that. But it's all per key. MySQL is also very special. MySQL doesn't have heap tables. It's only in the index organized table. So the rows stored in the primary key. And they have two logs. They have an insert log which you don't really care when you read from the past. You just ignore the new insert but they have also a log for the updates or deletes. If you delete it, it's just marked like an update saying it is the end of lifecycle of the whole. So they have another log for that. Yeah, question? It has only one log, not two logs. It has different types. In MySQL? In Unodebs. Yeah, okay. Yeah, thanks. Only one log. Those are very different. I know quite well Postgres, the Wacol and of course, you Gabbite. I know a bit less those. The documentation is also not very... The SQL Server implementation, there are not a lot of documentation about it. MySQL, yeah. But basically, they all have also garbage collection and the main difference, the choice of the databases. I said that the retention of version must cover the longest query. Some databases, for example, Wacol, doesn't care after a while not to have too many old versions. So it will remove it. You set the retention that you do. But that means that a long query may encounter this snapshot to old error saying, oh, I cannot rebuild the read time that you ask for. Other databases will just don't do garbage collection, render longer running queries. For example, in Postgres, if you have a long running query, vacuum will not be able to clean up everything that it has to clean up. And that can be problematic when you read from a standby. And that can be different. You can have the same equivalent of error. I'm not going to go into all detail. The most important is that you understand the different concepts. Those are different. Those are pros and cons, the overhead on different operations. One that is very nice with Postgres because you have all the old version in place. Rollback is very fast. Usually don't really care about how fast is a rollback in normal operation. But when you have a crash recovery, you need to rollback the transaction that were not committed at the time of the crash. So this is really fast in Postgres. But in Postgres, keeping the old version in place, you have all these problems with bloat and vacuum that you have to manage. In other databases, the problem of stopping garbage collection or not when you have long running query, do you want to give priority to those queries that they can finish? Or do you want to give priority to the garbage collection that maintains the performance? So those are different pros and cons that we have seen. Important to know what is the default isolation level. Important to know because most people do not change it. They use the default. So at least you should know what you are using. In Postgres, it is the read committed isolation level. The same for many databases. And mostly for performance reasons. If serializable was easy and fast, probably everybody would use it, but not the case. And even if they all use read committed, they have a different behavior. I said that at some point it has to detect conflicts between reads and writes. And when it detects a conflict, the big advantage of read committed, and I will explain that later, the big advantage is that when there is a conflict, in some cases the database can restart the statement to a newer read time. So from the application point of view, you just wait a bit more and it is managed by the database. Postgres does not restart the statement, but it can reread a row, which may show some inconsistency because it can read a row from a different read time. In my SQL, the default is repeatable read. We have also some strange things that may happen and look like inconsistency. Some operations are not really isolated. The serializable. Very few databases use serializable by default. A few distributed SQL databases use it, but for example, co-crochet is implementing read committed. Mostly to be compatible with existing applications and because serializable has to look too many things. So I mentioned the read restart. When, so in read committed, what is different with read committed is that the contract with the developer is that the read time can change during a transaction for each statement. If I am in isolation levels above, the read time is also the time where the transaction starts, the begin transaction. With read committed, each statement has its own read time, which may bring more anomalies if you rely on what has been read before. But the big advantage of it, because the read time can change during a transaction, if a conflict is detected within a statement, the database may be able to restart it as of a newer read time. You cannot do that for the whole transaction because you don't know what the application did before. Maybe the transaction did something not transactional. Some databases restart the read in read committed when they encounter a problem. Postgres doesn't. Postgres just rereads the new whole. I think, I'm not sure, I think the main reason is that to restart a statement, you have to take a snapshot just before and to roll back to this safe point. I said stop shots, I must say a safe point. To roll back to this safe point to redo it. And in postgres, current version taking safe points for each statement is probably too much overhead. Actually, this read restart is also possible in other isolation levels if it is the first statement of the transaction. The thing is not really related to the isolation level. It's more when the database knows that the application did nothing before, then it can move the read point. So basically, I'm going fast on that. Basically, bad things can happen with all databases at all isolation level and it's good to know it. So postgres can read rows at different point in time. Oracle can set a reusable that is not reusable. It's kind of a lie in interpreting the isolation levels. SQL server has MVCC, but when it needs to lock, there is an eye overhead, so it's not always recommended. MySQL in repeatable read can see commits by others. So basically, bad things can happen at any level, not at real serializable, but this has a cost in terms of performance. But it's not a big problem if you understand it and you know that you can do your own concurrency control with select for update, for example. So the goal is to know which problem can happen with your transaction. It depends on the isolation level, the database you use, but also what you are doing in your transaction, and then you can use mostly select for update to manage that. So with implicit locking, I already said that it was not in the SQL standard. SQL standard was made for a lot of user interactions, long transactions, where a user really starts a transaction, can go back later at the time where users are at only one screen, so you cannot switch to another application and leave a transaction open. Today, you just don't want to do that. Today, applications probably know all the transaction intent from the beginning, which means that the application can say, okay, I'm reading, but with the goal, the intent of writing, then I do a select for update. And in short, if you use read committed, big advantage, there are no locks on what you read, and the read time is the start of the statement, which is really cool. And if you want, but may have some anomalies, if you don't want those anomalies, just use select for share, select for update to lock what you want. Instead of having the database locking all reads, you just lock the few ones that you need to lock. In repeatable read, the difference is that the read time is the start of the transaction, not the start of the statement. The problem is that there, a conflict may raise an error, and then in the application, you need to catch this error, a serializable error, to retry later. But you can also lock with select for update, or lock at a higher level, lock a table, but then, be careful, you lock a table, nobody can update it. In serializable level, everything works well. You don't have to care, you don't have to lock anything yourself. There is a performance penalty, but also, the developer doesn't have to care about locking, but has to care about the retry logic. And that's not so easy, it's not just retry until it works. You probably went like an exponential back off, you retry 10 milliseconds later, if it still blocked, you retry 100 milliseconds later. Okay. Going fast on that, there is more than select for updates, you can select for share, it's just a read lock, many sessions can select for share, but be careful, if in a transaction, you select for share, and then you update, you may have some cases where you will have a conflict, and then you prefer to reserve with an exclusive lock before. You can lock tables, but be careful, and for all of them, you have the choice of wait on conflict, or raise and error, or skip it, if it's a queue. Okay, we are just on time. I have a series, or I don't know, maybe 10 block post about it. So I try to summarize that in this presentation. This is a topic that is very interesting, because different in all databases, and many developers, there are a lot of developers that do not develop only on Postgres, they have other databases, and many developers think that they can write database agnostic applications by using a framework that can generate the syntax for many databases, that works for the syntax, but as we have seen, the preview is different on all databases, so there is nothing like a database agnostic application. If you have questions, do not hesitate to contact me. I will be there all day, but I have a session in one hour, the other side on something different, Linux, load average, if you want to look at it.
Reducing Costs and Improving Performance With Data Modeling in Postgres
Who's using Postgres in here? Yay! Thanks for being here on Sunday and our next speaker, Charlie, is going to talk about reducing the costs and reducing the costs. Those things out there. And other things as well. Good luck. Thank you. Thank you. Good evening. So yes, welcome. Today we're going to talk about how to reduce costs and improve performance. Doing really easy stuff, really easy things. Well my name is Charlie Batista. The presentation is not about me and this was made by ChatGPT so we can get these lights later. Sorry. Okay. Okay. Good to go. Nice. So what we're going to talk today, so we're going to have a bit of a review on what's about this talk. So we're going to review some concepts on how the hardware works, right? So try to understand a little bit what is cash, how Postgres stores data, and the summary. And thank you guys. You are good to go. So I think it's gone. See you next year. I'll be a bit fast because I have a lot of slides and not much off time so I will try to get. If you have questions at any time just raise your hand. That's fine. So just interrupt me. Or if you would like to, you can also wait for the end. We try to answer as many questions as you can. So what is this talk about? So this talk is not about go there and modeling a business. It's about how to model database, right? We try to understand a little bit how the underneath hardware works and how we can play nice with this and how Postgres can play nice with this, right? So we see some concepts about how computer stores data, how Postgres stores data. And it may be get a little low level but I'll try to keep it as more higher level as possible so we can all follow together. And we hope that the end of this talk will be able to understand a little bit and save some money, especially for those running on cloud, right? You know that space and these and those things cost a lot of money. And well, that said, let's start. We're going to do a quick review on the hardware. So I suppose most of you guys have seen this picture before. So this is the memory architecture. So how the memory is divided on the hardware. If you see down here, we have those secondary storage. This is your hard drive, SSD, HD type if anybody still use that thing. So this is the thing. If you see it's quite large but it's slow and usually inefficient. The latency is high. As we go up to the top, we go to the CPU registers, they get really, really fast. That's where the magic happens but they also get really, really expensive, right? We want to do our best to always use them in a very efficient way. Things to understand about memory. Memory is either volatile or non-volatile. That down here is non-volatile. That's where you save your data and you should save your data there if you want them the next day. Because if something happens and power loss or whatever, so everything that is up memory on those non-volatile, they're going to be lost, right? So also, as I said, the down is cheaper, the upper is higher price. Memory can also be basically accessed in three different ways. We have random access, direct access and sequential access. We'll see that most of the times we're doing random access, especially in RAM, RAM is basically random by nature. We also always try to do, when you go to the hard driver, to do sequential access both write and read and you try to understand why. So if you see here, see I have four CPU cores and I have the IO controller. One thing that you need to realize is the CPU is not connected to your disk. There is no physical cable or path away that the CPU talks to your hard drive. It doesn't matter if it's SSD, ADDD, tape or whatever. It needs to go to the memory controller. The memory has physical direct access to the hard drive. So every time that you need something, your CPU asks the memory and the memory catches that thing inside of the hard drive. And then it moves up all the way to the CPU. Also a very interesting thing that most of the developers do not realize is we don't write bytes on hard drive. When you're a software, open a text file and you save your name, my name is Charlie, five characters, five bytes, no, I still save one block in which most of the systems are four kilobytes. For that very simple operation of five characters, I'm dealing with four kilobytes block on the operational system. And that is for everything. The database also do the same. So if we can start doing more work with those four kilobytes block, things can go faster. That's one of the main ideas. Another thing that you see here, ADDD, they're really slow but still have a lot of companies and people using them because they're quite inexpensive nowadays. But random IEO is terrible, is low, is horrible. Every time that you need to do a random IEO there is horrible. And the problem is it's a mechanical device. So most of the people believe that the performance problem is on the plate, that the spinning plate. Actually that one is fast. The problem is on top of the spinning plate you have a literal ah that needs to move back and forward. And this movement is really, really slow. Really, really slow. So if you do a random IEO and the arm keeps, needs to keep moving back and forward, that's going to be horrible or whatever application. And especially for database. That's going to be really, really bad. So on SSDs, it's not that bad but it's still, the performance or random IEO is still not as the same as the sequential IEO. Both writes and reads. So writes are closer but they're still not as the same. So this is a little bit about what I mean about sequential axis. Sequential axis is when you write one block after another. Sequential and random is when you have that mess. So most of college people there when you go to their bedroom they had random axis there. So you can think about that thing. So but yeah. What is this cache? We're talking about improving performance and talking about those things but what does it has to do with cache and performance and database? So cache on its very simplistic definition, I got it from the Wikipedia, it is a hardware or software component that stores data so you can access them in the future in a faster way. We have many, many different levels of cache. So we have natural cache, we have hard drives cache, we have application cache, most of the database they do have the application cache and we also have the CPU cache. This is the one that we really interest for us today. And we have some definitions here. So what is a cache heat? Anybody? Come on guys. Exactly. Let's say for example I want to do a select and I want to get the Charlie, the first time that I do select and get Charlie's information and goes for the disk up to memory CPU. So but it stays there. The database doesn't throw it back because next time that I do a select for a Charlie's information again the memory is not there. The information is on the memory. If luckily on the CPU cache, remember that high top we also have cache there close to the CPU which is really, really fast. This is a cache heat and a cache miss is the opposite. So if I do a select and that row has never been selected before it needs to go to the hard driver so that's a miss. It needs to go and then all the way so that's going to be really, really slow very inefficient. So if you have a heat ratio higher so more cache heat than cache misses the better and the faster is our application. So we always try to improve that metric there. It's a very important metric. We also have some writing policies. So we have the write through and write back policies. So the write through policy is when you send information to the cache especially when we save in data to the cache so the database is saving data it can keep the information in the cache now and save later or can immediately save that information both in the cache and in the hard driver. So a write through is immediately save that information. So it stays in the cache but then immediately saves information. So what's the problem with that? The problem is latency. So we increase latency when you use that policy. Oh is it all bad? Some applications are fine with that. So some applications that need higher reliability will implement that policy. And so we have tradeoffs here. The other one, the write back. So the information stays on the cache and eventually that's up to the CPU if it's a CPU cache or up to the aggregate of the application to eventually save that data back to the hard driver. Remember, everything that's up there if we have power loss we might have a problem. So in this case we might improve a lot of performance but we may lose in reliability. So we need to have that tradeoff. And then we have the prefetch. Different CPUs, they're very smart. At least they tend to be very smart. So those theory that when you go to cache, the information that you get now you probably will get more information around that data. So we call it cache locality. So based on locality if I have Maria, Charlie and John, if I select Charlie, the probability that they'll need data from Maria and John is higher than that I get down the line. So it's playing about probability. We also have the time locality. So the information that they just accessed now has higher chance that they're going to access in the future, like in five minutes and a few minutes. So what this prefetch does, when we ask for one block for the CPU, the CPU, oh, this guy's accessing this block so he's probably going to need the next block. So the CPU prefetch loads in advance that block and puts in cache. So if I need that block, it's already in cache. The CPU doesn't need to go back to fetch that information again. So that's improved performance. That's awesome, right? Now comes the problem. Cache are expensive. So when it gets close to the CPU, they get really, really expensive. If you're going to buy a laptop or device or whatever and you choose the CPU i9, i20, you're going to see they have those L1 cache, L2 cache, and they're usually in gigabytes, right? Thatabytes. They're usually in kilobytes. You're going to get kilobytes of cache. What can you do with kilobytes nowadays? Almost nothing, right? Another problem is what we call cache line. The cache line is literally the line that the cache is divided. So the cache is divided in many lines and each line has a specific size and usually depends on the CPU word size. What's that CPU word size? So if I have a 64-bit CPU, the CPU word size is going to be 64-bits and the cache line likely would be 64-bits. So why likely? Because when we go to the CPU, we start not counting the time in times anymore, in seconds or nanoseconds. Now we're counting the time in clock cycles. So inside of the CPU, it has those things that they call registered that are really, really fast. It takes only one clock cycle for the CPU to get information there. So it's really fast. When we go to the cache, the L1 cache, it usually takes between three to seven clock cycles. It depends on the CPU and depends on the cache. So things start to slow down. People always tell you memory is really fast. Memory takes 215 clock cycles. Can be 100 times slower than the CPU cache. So memory for the CPU point of view is really, really, really, really slow. So we don't want to go there. Can you imagine your hard driver? They didn't even put here because I don't know the number of years of data to fit there. That's insanely slow. So that's why we always try to put information there. Remember that thing that I said about the line size? So we might have a problem. We want to fit everything inside that size. We don't want to waste that thing. So can you guys spot a problem in this here? I put a really simple algorithm here. So a tip is not the code or anything. It compiles. So that's not a problem. Anybody? On the line. On the line. That's exactly on the line. So most of the developers, at least a lot of them, will believe that information will fit in memory just like this. So we have the int, the boolean, the int, the boolean again. The problem is it won't. Why? Let me see if this thing works. Can you see? No, that pointer doesn't work. So you see from 0 to 7, it has 8 blocks, little blocks. So it is 64 bits there. 8 bytes or 64 bits. This is in this example the size of my cache line. So because the CPU can only fetch one word size at a time, it can only fetch that information. So the boolean that is only one bit here will have to go to the next page. And see all those white stuff? We call it padding. It's waste. Waste of money and waste of time. So in this example here, if we go back, we have one cube, big blocks, right? We could fit them in two big blocks if we aligned them properly. So we will have two CPU blocks to fetch that information. But actually, we're going to have four, double the time. And we only have four variables here. So we mess it up things really, really quickly. Keep that in mind because that's going to be really important for our discussion here moving on. So yeah, how POSC is organized the data? So POSC and most of the database has its own file organization. Those are the most common one. We have the B3s. I just put here in ascending order. So no special things. It's not that one uses more or less on B3 or one is better on others. It's not the point. So but POSC uses what we call hip files. Hip files are really interesting things because they are very simple. If not the simplest one, it's one of the most simple. How a hip file does work? So it's basically one spaghetti of information. You put one block after another. That's it. That's a hip file. There's no order, no guarantee of order. There's nothing special. You just have one block after another. This is a hip file. So it's very simplistic implementation, right? But it can also be very efficient because remember the locality thing. So if you just have one block and if you prefetch all that information, that's going to be sequential. A problem for indexes, sometimes people do not understand why POSC please do not pick that amazing nice index that I just created for my application. So it was there. I have an index there. Sometimes, you know, but the database doesn't pick it up. And a lot of people, they want to facilitate the database to pick things up because they think there's nothing in the database. Well, sometimes they also do and then they realize they're not. But it happens. So one of the problems is the index is a B3, most of them. So B3 by nature is random. The access is random because you have one block here, another here, another here, another there. So you are changing the access pattern from sequential to random. Remember how random lies expensive, especially on the ADD thing that they spin in this? Yeah, that would be horrible, even for a really efficient index. So that's why the database say, nah, not today, maybe tomorrow, you know, today I'm fine. I just do a full table scan, let's just get everything. And more often than we sometimes think it's faster. So but if files in postgres, they have some very interesting properties that go back here. If I put somewhere, no, here, no, here. Well, they are eight kilobytes. So the block on the hip file, I think, yeah, eight kilobytes size. And each file has the limit of one gigabyte. So it means that we can only have one gigabyte table in postgres, right? No, limit one gigabyte, just create another one, and another one, another one. I have one terabyte information that I'm going to have 1025 files. So be mindful of that, because depending on the file system that you use, some file systems, they do not play well with many, many files on the same folder. Right? Be mindful of that as well. That might be a problem. So it's of eight kilobytes size. Keep that information in mind as well. That's also going to be quite important when we move on. So as I said, one row is just appended after another. So not in fancy. There's no fancy organization or whatever. And if you go for postgres, insert all your data in a nice way, index it there, okay, yeah. And if you do an update on that data, and then you search again, if you do not put another thereby, that's going to mess up your order. So be also mindful about that. So the hit files has no guarantee of order at all. Right? So this is basically how it's organized. And every single block, it has its own organization. So it has a header. The header has a lot of data. I put the information here, as I said, I'm not going to go through all of them, because that's not the point. But you can go and search on the documentation. They're really nice. Now what matters for our discussion here is how these things are organized. Right? So this is inside of the block. This is a picture of the block inside of the database, inside of postgres. So we have the header, and then we have the data. Nice thing is data is started and stored by the end, not by the beginning. We have an index, sorry, a pointer that points to the data. So the data goes towards the center from the end of the block, and the point also goes toward to the center. So when they get close together, that means I don't have space anymore on that block. Right? The block is full. The database needs to create another block to put more information. Also, postgres doesn't split the data between blocks. If your block doesn't fit inside, if your data doesn't fit inside one block that is 8 kilobytes, actually we'll see that if it doesn't fit inside half of the block, it's going to do something else. Otherwise, you're not going to lose your data. But yeah, the database will handle that. So also, those rows here that we see here, those tuples, tuples and rows, we call them interchangeably the same thing. So the rows here, they also have their own organization. I put here all those things. And I will go for a couple of things that are important for the discussion, not all of them. They are fixed size. Well, they have a header as well, that the header is fixed size. A question, does anybody know how postgres stores new? If I have a table and I have a lot of columns that can be new, anybody? The null bit map. Sorry? The null bit map. The null bit map. Does anybody know what is a bit map? Okay. Yeah, a bunch of bits. So you have what we call map, actually, you have the sequence of bits. You have 11111010100001. That's the map of bits. So what postgres does, it has this sequence of things. And the position of the one of zero on that thing is the position of the column inside of your row. So if it's zero, it means there's no null. If it's one, that means you have a null. So it's highly efficient and compact the way that stores null inside of postgres. Yeah. This is how it works. And also what is very important here, this is, again, another photo of the row, is this padding. Remember the CPU padding? Well, the database also tries to keep things aligned with the CPU. So if things don't align with the CPU, the database will add padding. And we'll see how nice that can play. Right? Remember that I said if the data doesn't fit there, the postgres will save somewhere else. This is what we call toast. This is the oversized attribute storage technique. And usually goes really well with coffee. So postgres uses another file, that's not your data file, to put the information that doesn't fit inside of the block. And then it puts a reference inside of the block, a pointer, to the beginning of the data on the other file. So the database does it automatically, you don't have to do anything about that thing. So every time that you have text, VACA, 3000 and something, that, you're going to have a toast because that won't fit in 8 kilobytes page. So there are a lot of improvements since being created with compression and those things. So it plays quite well. And when we understand how those things work, we can also change our data organization. But that by itself would be another talk. But I can give a just a small example here. So let's say we have the table user, where the user does the authentication, right? So for the authentication to work, we only need to compare the username and password. But then when the user has been authenticated, sometimes you want to show the pictures, the bio and a lot of information that's there in the toast. So if we create a table with only that information for the authentication, the authentication process that's a complicated process can go really, really fast because you can fit a lot of things inside one block space. And the database doesn't need to go to the toast. We have the toast in the information and the modeling. So and then when we need to show the information from the user, we can use, still use that same primary key that we have a reference for for an key and get that data on that very specific time. And for most of the applications, the authentication is a hot path. Everybody knows goes to the authentication. But really few people go to the bio page. So we can improve the performance of our application just doing nothing. Really simple changes. And we get a boost on the performance. So we already got the things, the performance improvement. How those works. So as I said, when it's filled a percentage of the block, it will be saved elsewhere and has a pointer to those things. And now we come to what really matters. So if the data is too large, they go into toast. But what if I have too many small problems? Well, if you have too many small post-process has a limit of how many columns you can fit inside of your table. Because in that case, it won't fit on the block, right? So the question is, if we have too many columns, how does it do if it doesn't split? So you just have the maximum of columns. Depending on the data type we're using, for example, if you're using big inch, that is eight bytes, so you have that number. If you're using smaller inch, you can fit more. So but you have a hard limit because of the block size. That is what it does. If it doesn't fit, you need to split the table in more than two, three, four, whatever many times you need. But that would be insane. Like have hundred or six, 64 columns inside of the table. Can you repeat the question because you're not talking about it? So the question is, the post puts the information outside because it's a larger than the block. But if we have too many rows, too many columns inside of the row, how post-process handles that? So that was the question. And again, post-process just has a lot of limits on the number that you can put there. More than that, you need to create more than one table. That would be the solution. Now we come to the data alignment and padding. So remember that I said the database also does padding and post-process also does. So the natural alignment on post-process padding is of eight bytes or 64 bits. That's the natural alignment. So what does that mean? It means that every time that you put one integer, integer is four bytes, you need to put another integer together to get a perfect alignment. If you put an integer of four bytes and a big integer just after that integer, so that big integer is going to go to the next one because it doesn't fit the natural alignment. And you can ask me, well, what's the problem? Just go to the next one. Not a big deal. So every type has its own alignment here, as you can see. Shaw has basically no alignment needed because it's variable. It doesn't mean that's good to use car. Actually it's good to put at the end, not at the beginning. And we'll see why. So we see that Shaw has alignment of two bytes. It means, yep. So does the order of the fields or can optimize the order of the fields? So when I decide my table, do I need to think this? I got it. So the question is, does the database automatically reorganize internally to put it in the best way? Or does the DBA need himself to change the order of the fields, the columns, to put in the best way? The database does not. This is work of the DBA. The DBA has to do this work, and it's a really good question. And we'll see sooner or later why it happens. As we see here, every type has its own alignment. This is on post-presokumentation. It's just a copy and paste. So the two bytes on most machine, that means that one will occupy two bytes there. So for example, we are aligned with eight bytes. So I can fit four of those data types one after another that does eight bytes. However, if I put one of two bytes in a big integer that's eight bytes by itself, we're going to have six bytes, pageant, waste. Remember, pageant is waste. Most of the time, waste of money, especially if we're running on cloud. All right. And it's really possible to optimize those things to make them to work in a better way. And this is an example. So in this example here, what I did, yes, let's say I create that table. See, we have a really few columns there, not many. I just put in a random order, like I put an integer, and then I put a bar card, and then a timestamp. So that was a very small table. I only inserted one million rows. So on that one million rows, I got a certain size, and then I organized it just to align better, to remove the pattern. The alignment saved me 25% of space. How much would be 25% less in your build on AWS of storage? We make and buy a burger, right? We never know. So we'll see a couple of our examples. Wait for people to take photos. All right. Yeah, question? Jason B, it has its own, if I test the Jason B data type, that's the question, right? So Jason B has its own specificities, so, and most of it doesn't go inside of the block, because most of the time it doesn't fit inside of the block, right? The problem about Jason B, it has its own algorithm of optimizations. So, but I haven't answered your question, I did not, but that would be really interesting one to do, to see how that works, how that plays together. Especially because Jason B has a binary format, so the binary itself, algorithm should be able to do a lot of optimizations on the way. But yeah, I'll take notes of that one, because now I'm curious. Another question? So the question is, if there is any analyzer tool that we can use to see how much space we're wasting, right? Not that I know. That might have there, but not that I know. That also would be a really nice open source tool to develop, you know? People looking for ideas, that's a nice one. Back to Jason B, the data base, the table. Like, there are a few things about the Jason B, is that like 60 bytes, 32 bytes, 64 bytes, how do you organize the database? Okay, the question is back to Jason B, how does it organize it? It's organized in 64, 32 bytes. I don't have an answer for that question, because I really have no idea. I haven't played with Jason B much. Most of the work I'm doing with those things, performance, and how I would say, netrodata types, right? So, but yeah, I don't have an answer for that either. How long do you take to review everything and optimize it? And are there any tools to make the process much quicker, like any automated scripts? The question is how long does it take to do the full reveal and the process, and if there is two? Well, I don't know of two. And the time, it highly depends how complex your database. If you have a small database, like the one that they use for the TPCC, it took me like a few minutes, right? But to have really complex database, it can take you some time. But it's time that works it. The question is if the alignment only works on Postgres or if the other database is as well. If I'm not mistaken, Oracle also does, but its implementation is specific. It's from database to database. And I have not played with those, especially because this type of documentation is not highly available. And if you do not have documentation, you really need to go to the binary and it's really time consuming. And I haven't been working with other databases for like 15 years. So yeah, but they probably do. The question is if we have a way to ask Postgres how the alignment is, right? If we have tooling, yes, we do have some extensions that we can go deep and see how the organization is. We even have some extension that we can check the memory for from time to time. Okay, moving on because I only have 15 minutes. Thanks, sir. Probably now it's less because he's been showing me that for five minutes. So, okay, what are the implications about those things, right? I showed you one example. Now I'm going to show you another one that they use a tool named sysbench that they did a TPCC-like experimentation to see. So this is just an example of one of the tables. This is how sysbench created the table. This is normal stuff. And you see it's really small table. There's not much. And most of the columns are integers. So nothing going on there, right? And this is what I did. Besides the flashing thing, did you guys not see anything? I changed some orders of the column. It's a shame that this point is not working. But yeah, I changed some orders. See, I put all the integers on the top and then I put the small integers on the way that they are in pairs to improve the alignment. And then I put the other columns. See that I put the timestamp. A timestamp, if I'm not mistaken, is also eight or four bytes. And after that one, I put the another small inch because they can still use the same space on the alignment. So I have four and two, still have six. So the only padding that I'm going to have is after that one. So I tried to minimize the padding as small as I could. Back to the other one. See, it's just not that bad. We had an integer as small as small integer and then an integer. So not so bad. So it shouldn't be of huge difference. So this is what happened. See the new on the left, the schema name. The schema new are where I created the new ones. The schema public are the old things. If you see the total size of the orderline one is 3,000 megabytes or three gigabytes, 3.8, to be more fair. The old one was 4.1. It's about like 15% increase, right? And also interesting, look at the index size. We also have improvement on the index size and they have exactly the same indexes. Let's go back. See the index, the primary key here, we have the column one, two, three, four, and then another two columns for the other index. And the same. Exactly the same indexes. Don't change the name. Exactly the same color, exactly the same order of the columns. I didn't even play on the index for the optimization. I just left them there. I just did for the table, right? And this is what happens. And I'm highlighting here for one of them how much space is saved. One table. And a very small table, right? So, but how does it play with performance? Because, okay, one thing is I can save any space on this and performance. So obviously it does a TPCC type test, right? Try it to use as well. The answer is I got an average 8.4 performance improvement. About around, on average, for this example, this load, 19% disk space production. I'm cutting down at 19% of my disk space views on cloud providers. I think that's why they never approve my talks on their conferences. Now thinking about that, I think it makes sense, right? The latency, I reduced it about 15% latency improvement. Just shuffling the columns around. The application does not even need to do. And I'll tell you, when I created those tables, because the SIS bench has its fixed structure on those, the insert and updates, I had to trick the SIS bench. I had to create views and then rows inside of Postgres to make those views insertable, updateable and deleteable. So on top of this, I still have latency on the database I had to put because of the tooling. And even though I got almost 9% improvement. Can I just clarify, I'm confused, your average write and read around 8.2, 8.5. So I thought the latency would be around between those two numbers. But obviously you mean latency in a different sense, or you're measuring? Latency on the application side, because latency is not about how fast or as low you get the data, but also at the end how fast or as low you process the data. So you have more data in smaller blocks. So on the application side, latency is going to be a lot better. So I'm also improving the application for free. You have some kind of improvement, but what are the things you need to do to reorganize the columns? So your question is what happens if I need to go to the table and reorganize the columns, right? So, yeah, if you already have an application, you may need to do the trick that I did here. Right, first create the new table, so you're going to double that for a certain moment type. You create the new tables, and then you create views and you create rules inside of the database. So where your application will insert, update, and delete, obviously select as well on those views until you change. Or if you use a tool like, if I'm not mistaken, PG... I forgot the name of the tool, you know. Actually, the guy had a really nice talk on Friday on PG Day that you can do online data change for your thing. So your application does not even need to know that you're changing the database. So you have a few different options. What's the name of the tool? PG Online Scammer Change? That's from ISQL. Oh, PG, okay, yeah. PG Online Scammer Change, that's true. Thank you. Why was there is a reason for the project not doing the organization for you? The question is, is there is a reason that the PostgreSQL does not do the organization? Yes, because even though we can have really smart things on there, the database doesn't know the full story. So it may try to reorganize the data and mess up the things on the way. So it's always safer for the DBA to do those things. And at the end of the day, the database should keep the data and retrieve the data for you. So you need to know how the data plays inside as well. Yeah, I wonder if it's important which columns can be known and how often these rows actually contain those columns? So the question is, is it important for the column to be known or not and how often it plays? So it is important, not for the column itself, but the following ones. Remember, PostgreSQL doesn't store the know, right? It stores on the bitmap. So then will be a pointer there and nothing on that place, right? So yeah, it does play a role. I haven't tested that to see how much impact that would be, especially because I just used the tooling, right? So but yeah, it definitely should have some impact, for sure. Would it be fair to say that this benchmark measures bit tree more than anything? And let's say that we're not prioritizing latency, but rather we're doing like the same sort of dense joints and we want more bandwidth. Is it true that if there's something like grid, right, we're going to be able to get as much bigger improvement in bandwidth points? Can you rephrase that? I don't think I understood the question. So for this benchmark, it seems like it measures bitmap scan. So this is like a bit tree benchmark. And let's say that we're not after latency reduction, but rather we want to get more bandwidth. Let's say in a case where we would have distinct or time-based joints. Could we use something like grid to effectively get a much bigger set in bandwidth if it's going to be dense enough? If I understood it correctly, the question is it's fair to say that the benchmark mostly tested the bit tree performance, right? So not really the density, if more density of data that would improve. Well, actually, as I explained, post-crisis, especially for insertion and things, we're going to have on the heap file. So we often do not do, at least in this type of benchmark, not do with the bit trees and performance would be really marginal, especially if you saw we don't have many indexes. It's like most of the tables here are the primary key, and I feel them, they're only one index. And the only bit tree structure on post-crisis for this example are the indexes, because the table itself, they are not. They're just heap files, right? So in that sense, I would not fully agree, but definitely density would play, like if you could have more dense tables. And the example that I gave for the authentication, so what do you do when you change? You just put a few columns there for the authentication. You are increasing the density of the information you have on that table. So on the same block instead of having three or three, that will have a thousand, right? It's a lot more dense, so that per se makes it a lot faster, especially if you think about that, time is also wasted on the network, as you mentioned. And network is not only the bandwidth of the network inside of the computer itself. Well, we have five minutes. Let's rush to the end. So this is a summary. So post-crisis stories. And actually, if you, now wait for the photos, almost, okay? So if you want to take something, this is the summary. Every data type has its own alignment on post-crisis and can cause padding. And it's really, really dangerous. So we can get really, really mass with all data, especially because most of the tables in our application has like 20, 20 something columns. And we are not careful on how we put them. Yeah, we can mess up. Questions? I think we have two minutes. Possibly, yeah. Yeah, possibly. When you have fields with varying size, like the text field, what are the problems? The text, the VARCA, right? So they do not play well with padding because as the variable, so the database doesn't know how to optimize them on that way. So it highly depends on how your information is put. And that's usually best to let them to the end. Sorry, can you say that again? Okay. Uh-huh. Yeah, the question is, if I test it on larger database because smaller database might just fit fully in memory. So yes, I did. And what you need to realize is all the data, all the padding that we have here goes to memory and all the pad that goes to memory goes to network. So in all the padding that goes to network goes to your application. So you're wasting space in your hard drive, in your memory, in your CPU cache, in your network and your application. What are the numbers? Well, it highly depends on application to application. So I would say that's empirical. There's no science on that one that you can get up to 30% performance improvements. That's all. If you have more questions, thank you guys. Here is my link again. Thank you.
Clustering in PostgreSQL: Because one database server is never enough (and neither is two)
Hello, everyone. Welcome or welcome back to the Postgres Day Room. We have a great speaker again. Umar is going to talk about clustering in Postgres. Thank you. Hello and good afternoon, everyone. Thank you for being here and not going out for lunch instead. So we're going to be talking about clustering in Postgres, and we're going to be walking through an abstracted level concept as to why clustering is required, the various different architectures that are typically used in production to make your database reliable. And the challenges that are associated with the concept. A little bit of an introduction about myself. The name is Umar Shahid. I came all the way from Islamabad, Pakistan to talk to you about clustering. I've been working in the Postgres space for more than 20 years. And I am currently running a company, the name of Stormatics, which was founded about a year ago. So working on a startup focused on professional services. And yeah, my past has been associated with various other Postgres organizations, including EDB, Second Quadrant, OpenSTG and Percona. Two of these companies do not exist anymore. OpenSTG was acquired by Amazon and Second Quadrant was acquired by EDB. So Stormatics is focused on providing professional services for Postgres. We don't want to talk about that a whole lot. So, on to the topic. Now, in order to understand why clustering is required, it's important to understand what high availability is and why you need a highly available database. Now, you want your database to remain operational, even in the face of failure of hardware or network or your application. You want to minimize the downtime so that the users of your application do not experience any interruption in their experience. And it's absolutely essential for mission critical applications that need to be running 24 by 7. Now, if you go back about maybe 10 odd years, it was okay to, let's say, have your credit card declined or just not working on a machine because the network was not available or somehow the connection was broken or there was some problem with the communication. In this day and age, what you expect to be able to do is just to tap your phone and instantaneously get the transaction through. And you never expect it to be dropped or an error to pop up unless, of course, you run out of credit. That's a different story. But it just works. Everything just works. And the only reason everything just works is because the entire infrastructure is highly available. It's always available. And that availability is measured in 9s. And I'm going to explain in just a second what those 9s are. Now, we're going to start with a very basic and that's 90% availability. When I say 90%, it sounds like a lot, but that's one line of availability. And if you span that over the course of an entire year, that means that the potential of the database to be down while maintaining 90% of availability is 36.5 days. So in a given year, if your database is 90% available, it's going to be down for 36.5 days. Anybody find that acceptable? I really hope not. Now, you go to two 9s of availability. That's 99% and that's a better number. That's equal in order of magnitude higher in terms of availability. And the downtime goes down from 36 days to 3.6 days. Now, again, in a given year, if your database is 99% available, it is going to be down for almost 3.5 days. And again, for most mission critical applications, that is not acceptable. 99.9% is 3.9s of availability with an allowance of 8.7 hours of downtime. You go off to 4.9s, 99.99, that's 52.6 minutes per year. And 5.9s of availability translate into 5.26 minutes per year. So just to make sure that we understand what availability is and how it is calculated. Now, the database runs on the cloud, so you don't care, right? How many people agree with this? Oh, I'm so glad that nobody agrees with that. We've got a bunch of experts over here. Yeah, so just because your database is on the cloud does not mean that it is always available. And here's, I just copied and pasted the service level agreement that Amazon has up on their website for highly available RDF clutches that are in multi-availability zone configuration. So you have the highest form of availability that you can get with RDS, right? They talk about MySQL, MariaDB, and Oracle, and Postgres, and they specifically say that it's only, this SLA, is only for multi-availability zone configurations, and they're going to make commercially reasonable efforts to make this available, right? So if they're losing money, they're not going to do it. And what they promise is three and a half nines. Not four, not five, three and a half nines, 99.95%. And if they're unable to give you those three and a half nines, what they say is that affected customers will be eligible to receive a service credit, i.e. the service that had gone down on you, you get more of it. So three and a half nines translate into 4.38 hours of downtime per year. So if you're running an RDF cluster that is spread over multiple availability zones, you can expect your database to be down for almost four and a half hours every year. Now, what do you do if you want better availability? That's one of the reasons why you have clustering. And I'm going to run through now a few basic architectures of how clustering works with Postgres. This is probably true generally for databases as well, but we are in the Postgres Dev Room, so we are going to be talking about Postgres. Now, in this very simple basic cluster, you've got one primary database and two standby nodes. The way this cluster is structured is that the internals of the architecture are invisible to the application. The application simply talks to the cluster, which of the nodes it's talking to doesn't care. It reads and writes to the cluster, and the two standbys are essentially read replicas, so there's redundancy in data in each of these replicas. And in case the primary overhead goes down for whatever reason, there's a hardware error, there's a software error, there's a network communication error, whatever it is, one of the standbys can take over and the whole cluster will just continue working. And diagrammatically, just to explain how that works, just so that we're able to visualize the whole sequence, that's the sequence of events. If the primary goes down, step one, the standby one, or well, it could be the standby two as well, but just for illustration purposes, the standby one takes over as the primary. The previous primary is retired from the cluster. A new replication path is set up from standby one to standby two. The standby one is labeled as a primary. Standby two becomes the prime replica of this new primary, and a new node is spun up, or the old primary is recovered and included in part of the cluster. So this is, you know, at a very abstracted and high level, diagrammatically, how an auto-failover procedure works. Now, there are other forms or other variations of clusters that you can set up with Postgres, with various different intents. This illustration talks about a cluster that has load balancing, and what load balancing does is that it focuses the right processes to one node, which is the primary. Those rights, that data, is replicated over to the two standbys, and the application can actually read from one or both of the standbys. The idea over here is not to allow the primary node to get hogged by read operations. It can focus on the rights, and the read can be served from the two standbys. This is load balancing. And again, you know, the auto-failover, etc., goes on as previously discussed. Same cluster, now with a backup process in place for disaster recovery purposes. Now, notice that the backup is taken outside of the cluster. It's an off-site backup, and for good purpose, if the entire cluster goes down for some reason, if, you know, there's an earthquake, there's a fire, a whole, you know, data center goes down, you don't want your backups to also go down with it. So the backup is taken at a different location. These backups are taken with requirements in mind that are mostly fall under the two concepts of RTO and RPO. Both of these stand for recovery time objective and recovery point objective. Anybody over here who is hearing these two terms for the very first time? RTO and RPO? Okay, okay, so I'll, you know, just take a moment to explain this a little bit. So recovery time objective means that in case your cluster crashes or goes down for whatever reason, how much time is acceptable for you to be able to recover the entire database, right? Depending on the criticality of your application and the criticality of your cluster, that time could be very, very small, or you could allow, you know, a few hours, a few minutes, maybe a couple of days for the recovery time. Recovery point objective is how much data can you afford to lose in case the cluster crashes, right? From what point is it acceptable to be able to recover the cluster? Now, again, for critical clusters, the RPO might be very, very close to zero, i.e. you don't want to lose any data, but of course there are implications to it. There are efficiency in space and financial implications to trying to achieve both RPO and RPO that are close to zero. So you keep that in mind as you design your architecture and your disaster recovery strategy. Point in time recovery is something that is kind of aligned with that, you know, it's about what point you can recover your database from. You want to go back in time and recover your database to that point in time, that's the PITR concept. Also, it's a footnote over here, but just, you know, a piece of advice, it is extremely important to make sure that you're periodically testing your backups, because if the restore does not work, the backup is absolutely useless. And you will only discover that in the case of a disaster, and then, well, that's a double disaster. Another form of clusters that you can have is a multi-node cluster with an active-active configuration. In the previous configuration, we had a single active node with two stand-byes. In this configuration, you've got multiple actives where your application can both read and write on all of the nodes in the cluster. Now, this is a little tricky, and the topic also tends to be a little thorny when you're discussing this with enthusiasts. And the key point over here is that you have to have your conflict resolution at the application level. The database, at least Postgres, the way the open source Postgres works, does not have the capability to resolve the conflicts for you. So in case the application writes on active one and does an update of the same data on active two, there's a conflict as active one and active two try to replicate data to each other, that conflict where the database will not be able to resolve that conflict. It's the application that needs to be active-active aware. This is asynchronous replication between nodes, and this architecture is shared everything, which means that all of the data is written to all of the nodes. The data is replicated to all of the nodes. And then, you've got another cluster, which is multi-node with data sharding and horizontal scaling. This architecture is shared nothing, which means that no data is shared between the nodes across which the data is being sharded. So data is distributed and you can scale this cluster out horizontally, as much as you, well theoretically at least as much as you want to. There is a requirement of having a coordinator node up there, which decides which node to route the query to, which node to route the data to. And you could set up automatic sharding. You could also have the read and write operation automatically directed to the relevant nodes. And then, last of the architectures that I'm going to discuss in this conversation is globally distributed clusters. Now, theoretically speaking, the last two clusters that I described with active-active configurations and the sharding, you could have them globally distributed as well. But I have a separate slide for this, primarily because of one reason, and that is the specific requirement that different regulations can have about geofencing of your data. So many different jurisdictions of the world are increasingly enforcing that their resident's data does not get outside of the country that they reside in. And you want to make sure that you've got local data being stored locally and read locally. And with geographically distributed clusters and with the right configurations in place, you can implement that geofencing. That, of course, also has a side impact of better performance because you're reading and writing locally instead of somewhere that's 10,000 miles away. Now, talking about replication, primarily dividing it into two technologies, synchronous and asynchronous replication. I was just trying to explain over here a little bit about the differences between the two. Anyone over here who has not come across the concepts of synchronous and asynchronous replication have no idea what these two terms mean. Everybody already knows. I could have just skipped this slide. That's fine. So very quickly, walking through some of these points, in synchronous replication, data is transferred immediately. It is not committed till all nodes in the cluster acknowledge that they have the data and they can commit it. In case of asynchronous, the primary does not wait for that acknowledgement, that handshake. It will just commit the data locally and will assume that the replicas will commit that data in due time. What it does is with synchronous replication, there is a performance hit that you get because you need to wait for all of the nodes to agree that the data has been committed. And in asynchronous, you achieve much better efficiency. But also, there is that chance of inconsistency of data if you have an asynchronous replication set up in your cluster. It's faster, it's more scalable, but there is that little bit of data inconsistency problem. So in case it is absolutely critical for your application to have all of the data, all this consistent in all nodes of the cluster, synchronous replication is the way to go and you will need to take that performance hit. Any questions so far before we move on? Yes? In asynchronous replication, you are saying data may be inconsistent. It doesn't mean that some data may be lost in one of the replicas, so if you have to recover your data, some fun may be missing. And then how do you find that in the case of this? Okay, so the question is that in case of asynchronous replication, if I say over here that the data may be inconsistent, does that mean the data gets lost and if it does get lost, how do we recover it? That's the question, right? Okay, thank you, very good. So the idea over here is that as the data is being shipped from the primary to the replica, there is a certain time lag. It could be in microseconds, but there is a certain time lag where the data exists in the primary and does not exist in the replica. And in that fraction of a second, if there is a query that runs across both of those nodes, it will return different datasets. That is a risk that you take. Now, in case during that lag, during that time, the primary goes down, there are chances that the replica will never get that data and hence that data can be considered lost. Now, there are different ways to protect yourself against that kind of an eventuality. That includes being able to replay the right head logs. That includes just making sure that any data that is written is actually sent across. And so even if the primary node goes down or it crashes, the data is still in transit and the standby is going to eventually commit it. But yes, there is a slight risk there of data loss. In case if you don't have this kind of disaster happening in the meantime, is there still a possibility that like, because like you're saying, the commits are not waiting. So is there a possibility that there could be an incident that will fail or something goes wrong and like one commit just is missing? If nothing goes wrong, is there a guarantee that there is? So at the database level, because Postgres is compliant with Acid, it is going to be consistent. Within the cluster, however, there is a lag. We're going to discuss the application lag in just a little bit. It's one of the challenges in setting up clusters like this. But you're right. When we talked about the load balanced cluster in just a couple of slides ago, one of the things to keep in mind when you have a load balanced cluster and you're reading from the replica instead of the primary node is the fact that there is a lag between the primary and the replica. And when you are reading data from the replica, there's a possibility that some of the data has not yet been written. Does that help? Yes. What's the maximum network latency? I'm sorry, I can barely hear you. What's the maximum network latency to build up the cluster for synchronous and asynchronous? So the question is what's the maximum network latency that you can use to build up the replica? That's the question. I think that's a fairly open-ended question and I'm afraid I may not be able to give you a very precise answer. There's a lot of variables involved in designing that kind of an architecture. Network latency, and again, this is something that we're going to be discussing in a moment, depends on a lot of factors including, well, actually, it depends on a lot of factors and not all of those factors are directly related to your database. So it is related to your hardware, it is related to the network connectivity, it is related to the medium of connections that you've established between the two nodes, how far the two nodes are, you know, spread. So there's a lot of variables involved. And as you design the cluster, you need to be, you need to recognize those variables and you need to design the cluster based on what you have and allow for, and you should have allowances for some of those, some of that lag and some of those nuances of the network that you have. All right. Okay, let's move forward. Actually, this has absolutely nothing to do with my presentation. I just put it up there because, well, I don't want it to be too dry, right? Okay, so, yeah, now we come to the part of the challenges that you face in clustering, as you set up clusters of Postgres, and there are four, and this is in no way a comprehensive list of challenges. And as we go into each of these challenges, I will also not be able to cover all aspects of these four points, but this is just to give you an overview of the kind of variables and the kinds of, you know, points that a DBA would typically need to keep in mind as they go about designing a cluster and make sure that they are highly available. And the first point that we're going to discuss is split brain. Now, anybody over here who, again, and I'm going to ask that question in a different way, has never heard of split brain, does not know what that is. Okay, a few hands went up. Good. So the next few slides are not wasted. Okay, so what is a split brain? It's a situation where two or more nodes in a cluster start to think that they are the primary. For whatever reason, there are different reasons for that. There could be different reasons for that, but for whatever reason, two or more nodes, if they start to think that they are the primary, they will lead to a situation that can cause data inconsistency, inconsistency that can cause data loss. And the scenario that is called split brain. And it could be caused by connectivity problems. It could be caused by latency. It could be caused by a server locking up because of, I don't know, a long running query. There are many different things that could cause it, but whatever causes it, it's a difficult situation to be in and it's a difficult situation to resolve. Now, a few ways to prevent a split brain scenario. So the first one is to use a reliable cluster manager. Doing it manually, writing scripts, et cetera, you know, it will still leave a few holes that are, that can cause the problem to recur. There are cluster managers, there are tools out there that can help you. We'll talk about them a little later in this presentation as well. And what they do is that they implement algorithms and heart rate mechanisms to monitor and automate the whole process of cluster management. And because these tools are designed to make the decisions for auto failover, they will help you prevent a split brain situation. Another thing to keep in mind is do what's called quorum based decision making. And essentially what that means is that a majority of the nodes need to agree which node is the primary. This also means that there's a requirement that an odd number of nodes, a cluster should be made off an odd number of nodes instead of even. Because if you want to do, if you want to rely on some voting on some quorum based process, you need to have an odd number of nodes that could vote in. Let's say in a particular case, you've got a primary that is operating as per what it thinks is normal. And one of the stand-byes loses contact with that primary and begins to think that it needs to take over as the primary. Now you've got the original primary and one node both acting as primary. You need to have a tiebreaker in place that will say that, hey, stand by one, you're wrong. The primary is still working. You just lost connection with it. So you need to, you know, stand down. So that's what the, what quorum based decision making is. Now in case there's some, and this is something that we, you know, sometimes work with our customers at times that are requirements from the customer that says that, well, you know, we can only have two nodes. We cannot have more than that. Or we can only have an even number of nodes, not an odd number of nodes for whatever reason. In that case, we implement a witness node, which does not hold data, but can be a voter in the quorum process in order to act as a tiebreaker. So that's what the witness server does. And, you know, you want to make sure in order to prevent a split print scenario, you want to make sure that your network is reliable. And it's, you have redundancy in the network. So if one path goes down for whatever reason, you, the traffic can take a different path. And you want to minimize the risk of partitions in the network. And you want to make sure that you've got reliable connectivity between data centers if your nodes happen to be split across data centers. And then there are a few miscellaneous housekeeping items that make sure that you've got a good monitoring and alerting mechanism in place. So in case, you know, your cluster is approaching a situation where the resources are running out or the network is getting congested or the CPU is being maxed out or whatever. You know, you get alerted in time so that you can get, you can act and take preventive measures. Regularly test your cluster. You can simulate situations where, you know, connectivity is lost to test how your cluster behaves in case of that. And you need to have very precise and clear documentation because if, let's say, I'm the one who's implementing this cluster and I take a few decisions as to what thresholds to set and what configurations to, to, to program into my cluster. A person coming in, let's say two years later or three years later may not know what the decision making was and why it was done a certain way. We want to make sure that you have very clear and precise documentation that is coupled with training with new resources that are coming on and are helping maintain and manage your cluster. Now, in case a split frame does occur, what are, you know, the recommended best practices to recover from it? So you get into a situation where now two nodes are thinking that they are the primary and they're ready to take, you know, data in and they want to establish, establish themselves as the publishers of the data and expect standby nodes to become the subscribers. What do you do? So the first thing, of course, is to actually identify that that has, that has happened. You won't be able to do anything if you don't know that the split frame has occurred. So in order to identify that kind of a situation, again, monitoring and alerting are crucial elements to it. You need to have a good monitoring plan in place. Stop all traffic from your application and stop all replication between the nodes. You know, this, this, this will mean that your application goes down, but your application stopping is a lot better than your application feeding in or reading the wrong data. So just stop the application. Now, this is all manual. I am not aware of a tool that will do this in an entirely automated fashion, but this is something that a DBA and an expert will need to do. So determine which node is the most up to date. Two nodes are competing to be the primary. It's now you who decide which one is the actual primary. Or maybe, you know, you're unable to decide that because there are some transactions that got committed on one primary and some transactions got committed to the second primary. What do you do now? You want to make sure that you, that you replay the transactions that are missing and make one primary the de facto leader of the, of the cluster. You want to make sure that the nodes are isolated from each other till the, till you've rectified the situation. And then you reapply either through backups or through the right-ahead logs and, you know, just, just reapply the transactions that are missing on the, on the primary that you've decided and then reconfigure configuration. So let's say, you know, you might decide that the standby who decided to take over actually has more transactions. So you make it the primary, make it the new primary. And now you need to reconfigure applications such that the other nodes are actually taking data or replicating data from this new primary. You had a question? Yes. You did mention twice already, right-ahead log. I think it would be helpful if you could also decipher why is it called right-ahead log, what it is. Okay. Thank you for asking that question. I will run under the assumption that, you know, it's something that everybody would know. So thank you. So the question is, I refer to right-ahead logs and what are they? So the way Postgres works is that every transaction that is written to the database goes into what's called wall buffers, wall, WAL wall that stands for right-ahead logs. It goes into buffers and then those buffers write to the logs on disk. And, you know, it's those logs that are getting committed to the database and the incremental transactions as they come in, the right-ahead logs keep track of those incremental transactions. And it's those logs that are used for replication, those logs are actually transferred to the replica, to the standby, and they are replayed on the replica in order to get the replica into the same state as the primary. So these are files that are on disk that contain all of the transactional data that the database is handling. Does that help? Yeah. Thank you for pointing it out. Now, once you confirm the integrity of your cluster is that, you know, is when you can start re-enabling the traffic coming into the cluster. But before you allow traffic coming in, you know, it might be a good idea to just run that cluster in read-only mode for a bit so that you can cross-check and double-check and re-verify that everything is working. And then you're working to your expectation before you allow write operations. And then, you know, make sure that you run a retrospective because a split-brain scenario is scary. It's difficult to recover from. You don't want it happening every other day. Right? Yes. You do not just have the fancy mechanism that fills up a secondary, a second primary, and then have it failover. I'm not sure I understand what... So, you're referring to shoot the node in the head, right? I think... I'm not sure if I can shoot the node in the head. No, I think it's... Oh, offending node in the head. Yes, that's what it is. So, yeah, there is a mechanism. I haven't talked about it in these slides, but in case there's an offending node that, well, you can't really rectify you shoot it in the head. Right? You just kill it and then you rebuild a new standby. So, yeah, that's what you're referring to, right? Or, what is something else? Why would you need this complicated rectification if you could just immediately stop the brain and then pay for it? So, because before you do this, you don't know which of the primaries is actually farther along in the right-to-head logs, or if there are transactions that are in one and not in the other. Right? So, you want to establish that fact first and then, you know, recover from there. So, this is in order to just make sure that you don't lose transactions. Right? Okay. So, yeah, running a retrospective, extremely important. Make sure that it doesn't happen again. We're going to go through some of the other challenges. I think split-brain is the most important one, but the other ones, you know, they're kind of like a variation that can cause split-brain, but we're going to go through these. Network latency is one of the things that we, you know, a question that was asked a little while earlier. So, what network latency means is that it's the time delay between when data starts off from one location and reaches the destination. So, any delay that it encounters going from one place to the other is called latency. And the challenge it causes is that delayed replication could possibly cause data losses because, you know, as we discussed, in case of disaster, the primary is going to shut down and there's possible data loss in there. And also, more lag or more latency can lead one of the stand-byes to believe that the primary has gone down. Right? And, you know, they can try, that can trigger a false failover. Causes of latency. The network could be getting choked. Low-quality network hardware. The hardware, it's easy to get wrong, especially when it's costly hardware that we're dealing with. The distance between the two nodes, at best, data travels at the speed of light and it takes a finite amount of time to go from one place to the other. And the longer the distance is between the two nodes, the longer it takes for data to replicate from one to the other. If you have a virtualization setup, it can cause overheads. There can be bandwidth limitations. And security policies can force inspection of all of the data packets causing further delay. And transmission medium will also cause some latency. For example, fiber optics are going to be much faster than something that's based on copper. Right? That's plain physics. And there are ways that you can prevent false positive resulting from latency. You want to make sure that all of your monitoring and alerting and mechanism that you set up during the design of your cluster are fine-tuned such that you adjust the heartbeat, you adjust the time-out settings, and you make sure that your cluster does not read latency as a trigger for failover. Some of the best practices include making sure that you're testing your cluster periodically. There are different workloads that you would want to run on your cluster to simulate different environments. So you want to know what kind of time pressures your cluster is going to encounter with different kinds of workloads applied to it and want to configure and tune your time-out and heartbeat accordingly. And of course, documentation and training are ever important. The third challenge is about false alarms. So we talked about network latency as one of the causes of causing a false alarm. And a false alarm essentially means that an issue is reported when an issue does not actually exist. And again, when an issue is reported, it can trigger a failover when a failover is not really needed. And a failover is an expensive operation. You don't want to do it needlessly. It impacts performance. And false alarms, of course, network issues are there. The configuration and the way your cluster has been set up could cause false alarms if your thresholds are too low. You might want your failover to happen instantaneously the moment the cluster detects that the primary has gone down. But the primary might not have gone down. It might have been just running a long, running query and is unresponsive. So you want to make sure that your configurations are correct. Resource constraints, if the load is too high, the network traffic is too high, the CPU is maxed out. Somebody had planned a schedule maintenance and not told you. Something as simple as that could cause a false alarm where you think, well, okay, the network has gone down. We need to do something about it. You don't want to do that. And some of the long running queries can create exclusive logs from the database which can make the database appear to be nonresponsive. And the automated systems will not double and triple check going into the logs and going into the stack tables to figure out which of the queries are running and whether the database is locked or it's just simply unresponsive. And they can cause a false alarm. And prevention techniques include making sure that your thresholds are optimized, testing, and making sure that you run simulations is the way to go in order to optimize those thresholds. You also want to make sure that your software and all components that are part of the cluster are up to date. You want the latest versions of your software. You want them to be bug free. And yeah, monitoring and alerting, comprehensive strategies, best practices, documenting, training your stuff. The last of the challenges to be discussed is data inconsistency. And what this means is that you call it data inconsistency. It doesn't happen within the database because as we discussed that Postgres is asset compliant. So the database will not be inconsistent, but within a cluster there is a chance of inconsistency if the nodes are not in sync with each other. And the challenge is, well, if you run the same query across different nodes of the cluster, there's a possibility that you get different results. You don't want that. The causes, one of them is replication lag. We've been talking about this over and over. In case data is written into the primary and is yet to be written to the replica and is being delayed for whatever reason, you will get inconsistent data between the two nodes. Network latency and high workloads could be a cause. And this can cause loss of data in case during that time a failover is triggered. That's one of the risks with this. Split brain can cause the data inconsistency as well because, well, if two nodes think they are the primary, they are going to try and take writing of the data or they are going to establish themselves as the publishers of the data and they are going to have different pieces of data, you don't want that to happen either. And any configuration that is not optimized for the functioning of your cluster, incorrect configuration, can cause inconsistency of data. How do you prevent it? You manage your asynchronous replication very closely. And now notice that I did not say synchronous replication over here. I said that you just use synchronous replication primarily because it has a huge impact on performance. And to do the extent possible, our advice typically is to avoid synchronous replication. And not only does it have an impact on performance, one of the downsides is that in case the primary is working and the replica goes down for whatever reason, the primary is going to continue waiting for an eclotage from the replica and the replica has essentially taken the entire cluster down with it. So there are very few challenges involved with synchronous replication. Regularly check transaction IDs across the cluster, monitor replication conflicts, there are statistics and tables that are and views that are available within Postgres to allow you to monitor this replication. You can monitor them and then detect those conflicts and resolve them promptly. And make sure that you have regular maintenance done on your database. Vacuum, we had a talk just a little while back that talked about why table is bloated, why dead tuples are there and why vacuum is needed in order to remove those dead tuples. And we want to also make sure that analyzes run frequently on your tables so that it can optimize query planning and you want to prevent a transaction ID wraparound which is probably something that is a whole talk in itself. We won't go into that during this conversation. And yes, this all sounds really, really hard. It is next to impossible for a single human being to be able to think about all of these variables and actually correctly configure clusters and be mindful of everything involved over here, which is why we've got tooling around it that does not automate the entire thing, but it takes care of the critical aspects of your cluster. I mentioned three tools over here. There are other tools available as well. All three are open source with reasonable license for usage. Repmanager at the top is licensed as GPL. It provides automatic failover and it can manage and monitor the application for you. PG pool has a license that's very similar to BST and MIT, which means it's a very liberal license. And it acts as a middleware between Postgres and client applications and it provides functionality much beyond simply clustering, so it will give you connection pooling and load balancing and caching as well, along with automatic failover. Petroni is a name that just keeps coming up. It's wildly popular to set up clusters with Postgres. The license is MIT and it provides a template for highly available Postgres clusters with the smallest cluster being three node. And it can help you with cluster management, auto failover and configuration management. And that brings us to the end of our presentation. Two minutes to go. That's the QR code for my LinkedIn. Thank you. Thank you. We actually have a question. The gentleman earlier alluded to network fence and Kubernetes. You'll have to be louder. The gentleman earlier referred to network fence and should denote, which is only possible because of PVCs, right? Like persistent volumes, they're saying Kubernetes, right? But the kicker is most often than not, the volumes themselves, the PVCs, are the cause of those transient issues. What if we don't want to use persistent volumes? What if we want to use ephemeral NVMe? Is it currently possible with Postgres to manage a cluster without using persistent storage and defaulting to shoot a node? So the thing is that when you're working with... They might have gone off, but let me try and answer you loudly over here. So the thing is that when you're working with databases, you want persistent storage, right? A Kubernetes kind of cluster is designed for stateless applications, at least on the ground up, but for databases, you want persistent storage, right? In case that you're working with a scenario that is just completely... That does not use persistent storage, those are cases where I don't have expertise in. So I won't be able to definitively tell you how to go about handling it. So Matix, those are like EC2, please, I imagine.
For Your Eyes Only: Roles, Privileges, and Security in PostgreSQL
Hello, everyone. Thanks for joining. We have Ryan talking about roles, privileges, and security in post-dress. Over to you. Thank you. I also get to quit this at the end of my deck, but it's fairly easy. I'm Ryan Booz everywhere. I try to be my blogger software in Booz, which you'll see some stuff time to time. And I will try and get this updated version in my repo hopefully before the end of the day, and I will do my best to link it to the website. They have a place for us to do that. So quick agenda. Roles and privileges. It's something we have to deal with everywhere, whether it's in post-dress, whether it's on your machines, whether it's in your applications. And it's something that, as I've done more and more in post-dress, helped users in post-dress, helped people transitioning into using post-dress, understanding roles, and how the privileges that we can inherit and use really interact with things. So that's what we're going to try and go top down. It's, every time I run through this, I'm actually writing a chapter of a book on this, and it's like trying to figure out the exact order because of all the pieces that have to come together. So we'll start with the building blocks, get through roles, talk a little bit about inheritance, which is really important. And for me, it's about getting down to object ownership because I think that's where most people get confused and have a difficult time using post-dress at scale with a lot of people. So quick disclaimers. We won't cover everything. There's just too much, right? But everything I cover in here should be applicable. It is applicable to anything that's currently in support version, which is 12 plus. Honestly, it should work with anything from nine, six forward, aside from one or two things that have been added over the last few releases. So let's go ahead and dive in. So first, the building blocks. So there's four pieces here. One is just, I want to talk about the building blocks. We're going to talk about roles, security, and ownership, just to get you through as we go. So if you have been using post-dress for a while, you may or may not understand this, but it's really critical to understanding how privileges and roles work. And so if you are in a hosted environment, this may not matter to you. But again, it's really important to understand. So as you create roles, objects, and the ownership and the privileges, this is how this works. So we have, if this is a bare bone server, we have a cluster. And we have a host, whether it's a VM, whether that's bare metal. And on that host, you can have as many running post-dress clusters as you want. So we just had to talk about clustering, but the actual process running on the host is called a cluster. If you go to the documentation, that's what you'll see. So we can have multiple clusters. They just have to run on different ports. And then once you have a cluster, now this should make sense. Really, there's a lot that goes on in there, but as far as this talk goes, the two pieces that are really critical and really symbiotic to one another is roles and the databases, the objects that are contained therein. So these exist at the cluster level. So roles are created at the cluster, databases are created at the cluster, and again, remember, for this talk, cluster is one instance of post-dress, not many, many instances. And then, like I said, the interplay between these is actually just a little bit more nuanced than I used to spend a lot of my time in SQL Server a number of years ago. And it's similar, right? We're talking about roles or users and privileges, but the way that they rely on one another in post-dress, again, can be a nuance that not everyone picks up initially. So we'll try and talk that through. Essentially, just to show that everything in the database, everything we care about from an object perspective has to be owned by a role. And so it cannot exist until a role exists, and then it includes a database. And so there's just this back and forth that we're trying to understand as we go. When you're in the cluster, every cluster has what's called a PG-HBA file, that is, host-base authentication. And it's the first layer of authentication. So again, if you're running your own server, this exists, you have to do something with it. If you're in a hosted environment, most of this is taken care of for you if you're in a cloud vendor. And I like to think of it almost like a firewall rule, right? So it's a file that literally shows which hosts and roles can connect to what database is using what authentication method. And it's a matter of reading top down as a connection tries to happen into Postgres. It matches each of those properties. What host is this connection coming from? What is the role that's trying to connect? And what is the method they're trying to use? And the first one it finds, that is the role, that is the HBA rule that it lists. These things can be very, very long, right? But it's just left to right on each line. What type is it local? Is it host-based? A bunch of others. Which database? All databases, the users, addresses, and then methods. So when it comes to methods, you've probably heard this if you've been around long enough, but their trust still exists. And so just avoid using trust. Really at all costs. What that means is on that machine, if the host and the user matches, you're in. You just trust it and move forward. It's not terribly secure, right? And so in most environments, if it's not some kind of central authentication, like Kerberos or whatever that might be, most places do give you Scram 256 now. So Scram was developed, I forget now, five or six years ago, really kind of took over, replaced MD5 and some other things. So Scram 256 is what we recommend if you're using password-based authentication. And just need to make your, it's probably what you're using, but if you don't know, go ahead and look. All right, so that's just the building blocks. We have a host. We have a cluster running on it. We know the inside of the cluster. We have roles, databases, and some objects. So let's go ahead and talk about that first part. Once we've at least gotten through that host-based authentication, we have a host, a user, and a type. We've matched the rules, and now we're allowed to try and connect. Who are we connecting as? So roles, obviously, own the database schema's objects, things like tables, functions, views, things you would expect within a database. And roles own the database itself. There's a role that owns the database that's created. Roles have cluster-level privileges. It's this nuance of thing we'll call attributes, and I'll show you in just a minute. Those are separate from the privileges that you get within a database. But they're kind of, again, it's like the host-level, what can you as a user do as a role do in this data in this cluster? And then you might be able to be granted privileges to a data... They can be granted privileges to database schemas, objects, and so forth. And then possibly, as we'll see in one second, some roles have the ability to grant their privileges and their privileges to other roles in the database. And we'll see why that's really important. So just to talk, I've been trying consistently to say roles over and over and over rather than users and groups. So in the SQL standard, role is there, and so is user and group. User... I might have the backwards, but user and group is also in the standard. But starting with, I think, was 8.2. We moved to just roles. So there's no real semantic difference between roles and groups. It doesn't do something magical. What we tend to say, what the convention is, when you create a role and it's allowed to actually log in to the cluster, we kind of consider that role a user. And if it's not allowed to log in, we consider that a group. Everything else about the roles can be consistent. They can all have privileges. They can all do a lot of things. They can own this. Even a role that can't log in can own something. And you'll see why we do that in just a little bit when it comes to inheritance. So you can do this, create user and create group does exist. They are simply aliases to create role. And so if you say create user, it, behind the scenes, does create role, whatever attributes you pass in, it will apply. And then by default, it will apply the login automatically. And group will apply no login so you can't get in. So there's really no reason. It depends on your environment and how you work. But there's really no reason you can't do a create role consistently across the board. Any of these will work. They'll get the exact same thing done. Just recognize the first two are not running create role under the covers. And so I keep talking about these attributes. So now we understand a little bit what roles are. You can apply attributes to the roles. So they are predefined settings that, again, are at the cluster level. There's nothing to do with the databases yet. And they map to this catalog table called PG roles. So these are the attributes. I say Postgres 15. I think these have been the same attributes since Postgres 9.6. One might have changed. I don't remember to be honest with you. The ones that we, I'm going to talk about just briefly through the rest of this. What most of you are probably concerned with as you are administering databases are the ones that are underlined. Can you login or not? Is this role a super user or not? Talk about that in a minute. Can they create other roles? Can they create databases in this cluster? Is it password based authentication? And then can they inherit privileges from other roles? The other three that are listed there, again, a little bit complex in connection limit, if you really want to set it, you can. Just recognize if you don't set those other couple strings, the connection or the attributes, the strings or inherit. By default, roles will be able to inherit from other roles. We're going to talk about that a little bit. They have unlimited connections. If there are available connections, I can connect many times as that user from that method and so forth. Any questions on roles? One thing that I often forget to talk about, there is a way, again, depending on what you're doing and how you are administering Postgres, you can actually, for a role, set many of the settings that you could do within Postgres. If you go into running Postgres instance and you can do something like set search path, set jit, you can actually alter a user and set that property so that every time they connect and has to be connection, that property will get set for that session. That can be really helpful. Sometimes you get lost in documentation and it might be useful for what you do. I just chose jit. Here's an example. Jit can be really good and it can be really troublesome when you have complex queries that are lots of data. Maybe the jit actually is not as helpful. Maybe you have a report user in your database that's often running really complex reports and you just don't realize that maybe jit's one of the reasons that it's not being as efficient as it could be. Maybe with that user, you would turn off jit. You don't have to think about it anymore. Every time they connect, jit would just be turned off for that session. How many of you have heard of the SuperUser? Most people have. If you've worked with Postgres, you've been warned about this thing called the SuperUser. Most people, if you're learning, have logged in with the user Postgres and they can do whatever they want and they never think about why they can do whatever they want and we move forward with life and we forget. You would think that someone who has access with SuperUser would kind of be like the superhero, the neighborhood friendly Superman. It's always looking out for the benefit of good of everybody but the reality is SuperUser is a lot more like this. You can do anything, anywhere, destroy whatever you want and no one can stop you. It means we have to be really careful with SuperUser. Again, as Postgres has become more and more popular, the usage has increased, depending what SuperUser is needed for, which in many ways is often very little. Compared to quite honestly some of the trouble you can get into with it, it's really valuable to know what you can do and ways to get around it. You get one SuperUser created when the cluster is initiated. When you say initDB, you get a user. That user has to be a SuperUser because things have to be done. Roles are going to have to be created. The process is running as that user. But it doesn't mean that you actually have to use that user moving forward. There's a lot of recommendations where you actually can change that user Postgres to no login. You can't log in now. You can log in as a role that could set log in if you really need to for some reason. There's a lot of ways that it's necessary for some actions that we're going to do, but it's just really powerful. Typically named Postgres. Is that because when we run initDB, the user, the process that is running Postgres to init that DB will be the name of the SuperUser that's created. In most systems, when you install from an RPM package or something like that, it will be the, the rim keeps all of our RPM packages up. Give me a hand. It's Postgres user. In Linux, it's created. Therefore, the SuperUser is called Postgres. You can actually tell initDB to use a different role if you want. Create a different role and use it. But generally it's Postgres unless you have a different environment. And it bypasses every security check everywhere in Postgres except for login. So as long as that host is allowed to log in and you're a SuperUser from that point forward, you can do whatever you want. So it's kind of like root on Linux. So most cloud providers do not provide this to you. Now there are some, if you are in your sandbox environment, like a private VM or something like that, you may get direct access and you may get SuperUser. If you use AWS, Microsoft, Google, whatever your hosting provider might be, you do not get SuperUser. They give you something that is like SuperUser. We all, we trust him. So it's just enough power, but not so much that you can destroy the world. And so the recommendations, you'll find this in docs. I actually forgot what page and I tried to find it quickly, but there is this recommendation in docs and then you'll see this elsewhere. If you are going to manage, and you're a DBA of a Postgres cluster, it's usually best practice to create, just as you would in Linux, create a user that can do what you need us to do, but is not root. And so in this case, we say something SuperUser like, at a bare minimum, they probably should be able to create other roles that will allow them to create roles, alter roles and so forth. And they probably need to be able to create databases. But if they're not SuperUser, they can't just go to any database, delete, remove, modify anything they want. And that's what you're trying to prevent. So it allows user management, but a little bit safer. Now there are still some things that you may not be able to do if you are not a SuperUser. There are some extensions that require being a SuperUser to install. Now the team consistently has worked, we'll talk about the very end, about providing new roles that can allow us to do some of these things that used to require SuperUser. So I know that that's one that's been talked about, for instance. It used to be checkpoint. You could only run a checkpoint if you had the privilege or you had SuperUser. And so now there's a privilege with 16 that allows you to run a checkpoint, even if you're not a SuperUser. So we have roles, both regular roles, super-duper roles, and the kind of roles we want for managing our database. And then for those roles, we need to apply privileges. And at the heart of it, we've just, by creating roles, all we've done is been able to log in. And so if we want to actually do something in the database, we have to understand privileges in Postgres. So obviously there are a set of access rights, to database schemas, objects. Now when I say objects, I generally mean things like tables, views, functions, store procedures, things that have ownership of some sort. Not every single thing in a Postgres database is actually owned by an owner, a role. Most things are. They can be granted, privileges can be granted or revoked. You've probably been used to this either in Postgres or elsewhere. And then the one thing I, it's, as we get to one or two things at the end, it's easy to forget that any time you run a script, and it says, grant something to somebody, it only impacts things that exist right then. So a lot of people I've seen will start a database up, they'll do something like grant all, select all, to all tables on public to whatever. And they think, great, I've solved my problem for the rest of time. And then they create a new table, and no one can read from it. When you explicitly run a grant or revoke statement, it only impacts the things that exist right then. So just keep that in mind. So here are the privileges, 15 and 16. I actually thought I went through and changed all of those to 16, so I must have missed that. These are all the things that we can set. Now the ones I have underlined, starting with Postgres 15, are the ones that are essentially provided by default unless you modify anything. So every single user, again, super user side, and unless you've modified something, every role will get these four privileges on any database on the public schema. And the reason is there's this role called public. It's basically hard coded in a Postgres. You can't remove it. You can't get rid of it. And every role gets is granted membership into public. And again, most roles inherit. And so when you have that kind of role, you automatically get the connect privilege, right? So I've passed HBA, I provided an actual password that works, but if I don't have the connect privilege, I can't connect. So I can turn off connect to a database. I might have multiple databases, but this user does not get to connect to that one. That's really the usage. Again, I can connect, but if I can't use it, I can't do anything. You can actually connect to a Postgres database, get a valid connection, you're connected, and then you want to do anything, select whatever it might be, and you're just denied. And that's where usage comes in. Temporary tables and then executing things like store procedures and so forth and functions. Now, if you're using Postgres 14 or below on the public schema, you also have the create privilege through public on the public schema. And so that we realized gets to be some of a security hole. All right. And the reason is, in this case, I don't want to get into too much, but if you create something on any schema, so on the public role, that's where most people were creating things. A lot of us still don't use schemas in our applications. We just create tables by default. They go into the public schema. And so if somebody created a store procedure and they weren't super user, there are ways, actually not that difficult, if you know what you're doing, to create a function, somehow get someone with elevated privileges to run it and you can get super user. Another talk that I like to do. So we realized that. So basically starting the 15 and above create is not provided through public to the public schema. So you have to be explicitly granted. Every role has to be explicitly granted create. And then when you create your own schemas, you have to grant create to other roles if you want them to be able to do it. So recognize that change. Now, the one caveat here is, if you've been upgrading 12, 13, 14, 15, when you upgrade to 15, it doesn't take away the privilege from roles that already existed. Again, all of this is point in time, right? I applied the role at some of the privilege at some point in the past. I have to explicitly do something to modify that. And so security best practice. I've been talking about public a little bit. And again, this is more what has come around. It's got a lot of attention over the last few years, which is there's just this potential for bad things to happen on the public schema. And so most folks, most advice you'll get is to revoke all privileges from the public schema from the public role. Again, you can't get rid of the role. So you want to remove all privileges from public. And then per database, you probably want to remove privileges specific, you know, to the database itself. And what that would mean is, again, that comes to the connect, right? So you have to be able to connect to a database. If I don't revoke all privileges, any user, they're part of public, public has connect, then they can connect to that database. And so that allows you, this just means that then you have to be more explicit with every database, every schema and so forth. All right, you'll find this a number of blog posts, people talking about security, and especially two years ago with Postgres 15, there was a lot of news around this. Now granting privileges is hopefully pretty straightforward. The docs pages on grant and revoke are really good. They go into a lot of detail on all the privileges, what it means when I say I grant someone select. What does that mean? When I grant someone delete, what does that mean? What is it just delete rows or does it allow me to do something else? And so there's a lot of good documentation, but you grant something to an object, to a role, and then you can name a, you shouldn't, you name a schema, whether it's public, whether it's all, could be all schemas or specific schema. So in this case, we're simply granting create. So now this admin can create something. They can use and create in the schema that I've created, but then we're going to create a junior role, and the junior dev role, and we're granting a select and certain update, but they can't delete, they can't create in the database. In theory, they've been given usage on the database. I missed that out here. I should have had that in that line. Now there are other ways to do this. So again, just remember, explicit grants only affect current database objects. So I'm going to do a quick demo at the end of this to show you all of this very quickly and hopefully, you know, tie all the pieces together. Again, these pages are really good. And so it just answers all the questions, every privilege. And if you don't know, if you go to the Postgres documentation, there's a search box up top, and it works pretty well. And so you can just simply say grant, grant, privilege, grant, revoke, and it will come right up. All right. So we have the cluster. We understand we have roles and objects and databases on the cluster. There are some attributes and privileges given to a role at the cluster level. Then we get to the databases themselves. Now we have privileges, which we can grant two roles for all the various types of things within the database. But if you notice on this slide, if I had to do this for every user, this gets really frustrating and complex. Now quite honestly, this is probably why a lot of people, myself included, is just easier to use SuperUser. Just log in with that one user, do everything you need to do, because I trust myself. I'm not going to do anything bad. But the better way forward is to deal with inheritance. So you may have noticed, you may have not, that earlier on, this is one of those attributes, one of the privileges, I'm sorry, the attributes to a user. Do you inherit privileges or not? Now it doesn't matter if you aren't granted membership into some other role that would apply privileges. You could receive privileges from. So roles can be granted membership into other roles. That's why there really is no group and user difference here. It's just whether they, you know, again, we say whether they can log in or not log in. But if you create the roles that cannot log in, treat them as groups, you can apply all of your privileges to those groups in ways that make sense and then grant ownership into those roles from other, for other roles. So this is really the preferred method for managing it. What you would expect in a, you know, whether it's Linux, Windows, whatever it is, you have groups, you have users, users can be part of multiple groups. It's exactly what we're talking about here. But you have to go to some effort. So again, just a really quick example we hate, we're creating a senior dev user role, they can inherit, report user, they can inherit. And then we create two groups because they can't log in. All right. And so we just, we explicitly say no inherit. Now, you don't have to do that for groups, but it can get a little bit messy trying to figure out exactly where everything's coming from. So a lot of wisdom is your groups, let them be separate and apply the groups you need to other, to your actual user roles. And then I've said, hey, grant, insert, update, delete on all tables to the read only privilege. Now it's, I should have to the admin privilege. I was like, wait, that's not read only to the admin privilege, right? So right here. And then we have grant select, all tables to this read only privilege. Right. But those are my two groups. They can't log in. So how's this going to help me? Well, you can then grant membership into those other groups. So I say grant admin and read only priv to the senior dev and then only grant read only priv to the junior dev. And so essentially what that looks like is this. Those two roles both have read only privilege. But the senior dev has now also has other privilege. So you kind of keep building on top of that. So it's a great way to be able to apply the kinds of privileges you need across many roles. And then if you need to update something, you update one object, the group role, and it will be applied to all of the users that are inheriting from that user. Any questions on inherit? What that looks like? Yeah. Okay, the question is, if the super user creates roles, do those roles get the same privileges as super user? No. Okay, misunderstood. If you create a role and you grant it super user, they're super user, just like what's exactly same as what you would expect Postgres to do. Yep. It's a flag in the database. And if it doesn't matter what the name of the role is, if it is a super user, you're a super user. Have fun. Don't destroy. Okay. But maybe that is the fun, right? No, I'm kidding. Test your backups. Okay. So great. We have, you know, just trying to build down through this, we have our cluster, we've created roles, we understand what those privileges look like. We understand that there is, you know, this, this level of the roles and the privileges they get, but then we get the object ownership. And honestly, this is when I decided I started to need to dig into roles in Postgres. Because I was using super user for everything I didn't care, right? And then I actually started to manage an application with multiple users, a lot of devs in a, you know, one environment, a couple different users for various applications that were connecting from another environment. And all of a sudden I was like, what is going on? Because this is not what I thought was going to happen. And that's when I had to really start to dig in. So that's why all of the other stuff leading up to this is important for me as an application developer or running or helping to teams of application developers effectively use Postgres. So object ownership, whoever creates the object, whatever role you are currently logged in as, or that session is currently acting as, when that object gets created, they are the user. Table, function, view, you know, on and on. Even a database. When I create a database, if I had privilege to create the database, that database is now owned by me, not by Postgres, not by some other user. So it's really, that's just the first thing you got to understand. Now, the owner of the object is essentially like a super user of the object. Right? They're not a super user, but I own the object. I'm the only one that can actually do a lot of things on that object, unless I've granted other privilege. And there are some things only I can do. Or a super user. So I like to think of this as principle of least privilege. When I create something, the way that Postgres works, it says, we don't want anyone to do anything. You have to tell me Postgres, the cluster, what everyone else should be able to do to this object. I don't care if they're part of some group that has access to this thing, and you're both part of the same group. I don't care. They have to be given explicit privilege in some way. So that's kind of the first place that you start to get confused. If you happen to have multiple devs, and you're on a test database, and you're all part of the same group, and all of a sudden dev one creates something, and dev two says, oh, let me go just see what you did, like access denied, like what? What are the test server? What are you talking about? And this is what it gets down to is object ownership and understanding of that. Now, again, roles, there are some roles that can actually, you know, grant, yeah, default, sorry, default privileges. So we're going to talk about default privileges in just a minute. And that's where kind of the power for managing application and creation of objects and management of objects can be really helpful. So this is what I showed earlier. And hopefully you can see, and I actually forgot to make this point early on, and I apologize for that. The one unique thing for Postgres with me coming from a different database is that although the roles are created at the cluster level, I cannot connect to the Postgres cluster, unless I can connect to a database, every connection is to a database. And so I might have the right password might have the right host might have the right role. But if I don't have access, I don't have literal connect privilege to any database, I can't get in. So there's this thing that like I almost said, symbiotic earlier, like roles and objects are separate. But what's a little bit unique about Postgres, again, for me is you they need to exist together. That's why when you initiate a Postgres cluster, you get one database and one super user because that super user can now connect to the database that's named after itself, blah, blah, blah. So there's this new one. Now the problem, though, is if all of my users are creating all these different objects down here, right, they're all owned by different people. And as I said earlier, the owner of the object is is like the super user of that object. And so then you start to get into conflict of who can use what and what can you do in that object. So what I've learned over the years, now I work for a company called redgate, you may have heard of the Flyway application, it's migration schema based migration, redgate owns that product and manages the open source portion of that. And we see lots of folks that are moving from other databases to Postgres. Yay, we're super excited about that. But again, understanding this ownership principle is so important. So they will, you know, go and create, they don't even realize what owner they're connecting and running these migration scripts as. And if the migration scripts don't explicitly modify ownership, all of a sudden they have objects in the database that are run by multiple people because different people were running these migration scripts. And then you get into a big issue because now someone wants to modify this table, we've turned off login for super user. And only that user can modify the table. And you just get into this like roundabout, right? So what we tend, what I tend to like to tell people is particularly as you get up to your production database. Now with Flyway, what we would say when we help folks do this, we go through, you know, dev, we have a staging server. And often what we'll say, and I'm going to show you default privileges in a minute, is create run all of your scripts. Now again, if you do a dump, you'll see that after every object in a dump file, if you do the script, it postgres explicitly changes the owner. Now that's also where you get those error messages, if you don't have that owner on your machine. But the object was created, whoever creates the object, it doesn't matter if it's for a backup script. If you ran a backup script from your server, and those objects were owned by Joe, and you go run it on the other server connected as Mary, all of those objects would be connected, created as Mary, if you didn't explicitly change the ownership now. So that's what it's like this nuance, right? So we tend to recommend when you are actually going to production or even your staging server, you run those scripts as one group role, and you make sure that group, you know, doesn't have things like select and delete, whatever, they are just allowed to create the objects. But you have other roles that are granting permission into those objects then, in a way that is accessible. And the beauty of doing that is you can still switch to that role, you'll see that in the demo I'll do in just a minute. So if you needed to modify something about that, you can still set into that role, and then you know exactly which role you need to get to do the modification. So this is a nuance here, and the value to this comes with default privileges. So as you'll see in the demo, again, I create an object like table. Only I can modify that table. I don't, unless you're a super user, I don't care if we both are part of the same group roles, only I can modify that table unless we set it to a role that both of us are a part of, and then both of us can be, can switch into. So this is just a really simple example. And I'll show you another one in the demo. Default privileges are way to say when I, as this user, so I'm connected, you guys are, is everything okay? Okay, they're staring at me like I'm doing something wrong. The, so I create default privileges, I'm altering them, and I'm saying grant select on all tables to the role public. Now it could be any role, right, but I'm saying the public role. Now, anytime I create a table, if I had gone ahead of time and removed all the privileges, whatever, anytime I, as whoever, whatever role I'm running that command as, every time I create a table from this point forward, everyone will be able to select because everyone's a member of public. Right, does that make sense? If I didn't do this, every time I create an object, I then have to explicitly grant the roles. That gets really tiresome. Now the only nuance here, and I have been dealing with this, so again, Redgate has been doing a lot with Flyway and Postgres, and I've been trying to help them understand that only exists for, again, when I create the object. If I later go and modify this default privilege, nothing changes about the objects I created earlier. You still have to go back and grant whatever you just modified to all of those other objects. Right, but it's super helpful. So from a migration perspective to just ease the management, what we tend to do is say, hey, make a group role that, you know, certain people are part of us, they can set to that role and modify the objects if they need to. But then you know the owner of all the objects, and it's not necessarily the Postgres user. That's what most people end up doing on the cloud host environment or something like that. Any questions, yeah? Just about syntax, so first we have actually, who has privileges from equal sign and what privileges, and who has given these privileges, correct? Exactly. So this says that the user Postgres, the owner of this specific default privilege, anytime the Postgres creates a table where it just says equal, that's public, that's all. And so they have read access. The question was, I apologize, you know, basically what's being shown here. So when I create the default privilege, you know, the equal with nothing in front of it just means public. And then you can name multiple roles. In this case it's just the owner obviously has all the privilege, and they always have all the privilege, right? Yeah. So do you think it's possible to have wild cards after like, you know, like, any database is structured if the user will have access to these types of privilege? That's a great, so the question is, is it possible to have wild cards? And I think you're saying, like, if I create a default privilege, and I don't know if I said this earlier, and I apologize, this is per database, right? So if I create the default in the database, sorry, I don't think there is. I mean, again, you can create things like, in this case I said on tables, you can do things like on tables, on views, so you get a lot of the objects. But I don't believe there's a way to say like a wild card across multiple things. So a great question, the question is, could you do this in the template? Yeah, you could. You could create your roles in the template database. You could, for the roles that you want to use, set your default privileges, and if that all works out and you have all the roles and owners, every database you create is going to get that stuff. All right, I just really want to quickly run this demo, it's about five minutes, and so just so you can see it, because sometimes for me at least, that's just helped me see what's happening, right? It's one thing to see slides, but just really quickly, so providing object access, because this is, again, this is where I see so many of the actual problems happen. When you don't give someone a super user, all of a sudden things just go haywire. And so you can either explicitly grant access every time to every kind of object and go for it. A lot of work, do what you want to do. You can alter default privileges, and now any time something is created in that database by that role, including something like migration scripts, they will inherit these privileges for whatever roles you assigned. You can then set role in the app, I'm going to show you that, so in Postgres you can say set role, so I could connect to the database, I could set, change my role for that session to the owner of the table so I can do something with it and modify the privileges and so forth. And then in Postgres 14 and above, we're starting to get some of these other attributes to do more. I talked earlier about this, this is the object ownership thing in security. Again, there's a number of talks on this, I think I have an old one maybe linked on my blog somewhere. So let's go ahead and quickly do the demo. So I have an empty database, this is going to be really quick. So I'm using dbeaver, I just like it because of the color coding stuff, just a little bit easier for you to see and show. So the current role I'm currently connected as Postgres, so this session I initiated as a Postgres super user. And I'm going to create a new schema, and I still have to do all the things I want to do, so I'm going to create a developer role, now it says no login, so what kind of role is this? We consider this a group role. And so the set role, if you say none, that will change the ownership of the current session back to whoever initiated that session. So as Postgres, I just had this here because I think earlier I had said to something else. And so for that role, we're going to do this, we're going to grant select, insert, update, delete on all tables in the demo app schema to this group role called developer. Now it can't login, so it can't really do anything, right? And then we're going to say grant create and usage to this role. And then we're going to create our developer users, it doesn't really matter, you know, anything here doesn't matter. Oops, I am not hitting, am I hitting the right keys? Oh, my apologies. So dbeaver, I can just say control, enter, and it will run the commands. So I've created two users, and now the magic. I can grant those users access privilege into that role, that group. Now at this moment, now that they've been given granted access, what does that mean about their privilege? I have not granted any privileges to those users yet. But what do they now have? Select, insert, update, delete on all tables, and they have create and usage, right? So now without doing anything else, they can use that schema. And we can see if I, so now I can set role. So this, I could have multiple tabs, I could have connected as dev one on one tab and dev two on another tab. In Postgres, when you say set role to a role, it's basically like switching user, there's one or two things that don't happen at that moment. One is, remember earlier I said you alter, you can alter some settings, those things don't get run when you do set role. But otherwise, if I'm allowed, I'm running a super user so I can do this, if I have membership in that role, I can set to it and act as that role for a little bit. And then I can go back. So I'm going to set to dev one, so this is as if I had connected as dev one to the database now, and I'm going to create a table in that new schema, because I can. Again, we haven't granted anything to that user explicitly except membership in this group. And now we'll see, oh, I need to create the table, don't I? What's that? I didn't, what? So here, I'll just drop this because that's going to miss everything. Yes. Ah. Alright. Come on. Yes. Does not exist. Okay, maybe I, well, if this doesn't go, then we'll just move on and I'll show you what I can. Alright, there we go. So now I can see that, oh my goodness, my hands are not hitting the right keys here. So I have this table and the owner is Postgres. So now I set my role to dev two and I try and alter that table. Of course I can't. I don't have permission to do that because Postgres created it. They didn't give me permission. So I'm going to go ahead and drop that table. I could also just alter the, I could just alter the owner. What I'm going to do this time is I'm going to set the role to developer. This is the, no, again, it can't log in, but I can set the role to, I have access to developer or I'm super user. So I'm now, now I'm developer. I create that same table and now we can see that it's owned by developer. Okay, what does that really do for us? Well, now I can go back to dev two and I can try and alter that table. And of course this doesn't work. Maybe I didn't. What's that? Oh, I didn't. Okay. Let me just talk you through this rather than, man, I literally ran through this five times today. My apologies. The big point here is as we go down through, as long as the user is a member of that group and that group created the objects, I can do the privileges I'm allowed to do on that object then. All right, so it's a way to let me do some stuff. Now, some things I may not be able to do, I might have to switch into that role to do some alter things like that, right? If I want to alter the object itself. And so, yeah, I see, man, that's really crazy. Anyway, the main, hate when a demo fails, right? The main point is there's like two recordings of this. You can see this run through if you want. It's just to say again, you have to grant specific privileges. I was going to come down here to the default privileges and show again that once you set something like the default privilege, as long as you create those objects with that role, they will get whatever privileges you said to the roles that you provided. And so, it's just a way. So in this case, it was just a read on the report user. I want them to be able to read from every table. If I'm not using Postgres 14 and above, I would have to make sure that they have select on all the tables. Setting a default privilege is one of the easiest ways to do that. All right. So last thing, go back. Demo fail. We'll have to get that end of the time. Just to really quickly bring up predefined roles. So predefined roles have existed for a while and Postgres 14 and above. There's a lot that's been done to try and do things, provide roles that for management purposes. So you don't have to be a super user. I gave the example earlier of checkpoint, right? So now you can give someone this checkpoint. We call them predefined roles. You can grant them membership into that role and then that user could run a checkpoint. Things like read all data. This has been a problem for a long time. So starting with Postgres 14, I think it was, we had the read all tables and the write all tables. So if you just wanted someone to be able to read all tables, in this case, in all databases, because it's a role there, you could now create this, you know, grant them access into this. Here are the current predefined roles. This is updated to 16. I believe the one that's different here, I knew earlier and right now I can't find it. But this is where you can do things like read servers. So a lot of monitoring programs now require you to be able to read the log or to read files from disk. Well, if you don't want super user to connect, you could grant your monitoring role something like read server files so that they can still read the files without being a super user. All right? That's the end of it. I really apologize for the demo. I love giving that demo and I don't know what I did. But anyway, if you have questions, I'll take one and then we're going to have to be done. Yeah. Thank you. Great. Yeah. Great question. For those who are still here, the question is there like a log cap of transitions of, I guess the mic is off. There's, is there a log cap when you grant things off and on, right? I had this default privilege and then I modified it. There isn't. You would have to do that in some way. Maybe through scripture if you do.
Your Virtual DBA (PostgreSQL on Kubernetes using an Operator)
So, time to start. Please welcome to the stage, speaking about running Postgres on Kubernetes, Karen Jax. Hi, thanks, Jimmy. Yep, so I'm Karen Jax. I'm a senior solutions architect at Crunchy Data and I'm going to talk to you about running databases on Kubernetes or how to create a virtual DBA. I've always worked with databases, so I've just included a little picture of my career to date just to prove that I'm vaguely qualified to talk to you about DBA type stuff. This is the first ever job title I've had that doesn't have the word database in it, but I still only work with databases. Okay, so in the abstract I said a lot of people who have looking after databases as part of their job responsibilities aren't actually these days database administrators. I see sysadmins, I see infrastructure teams, I see application developers, I see DevOps teams, all sorts of different people who don't necessarily have a training or experience in database administration, database administration who are expected to look after their organization's databases. And I see that in particular in organizations where everything's running on Kubernetes and the databases are just seen as another part of that landscape. So if you're in that situation what do you do? Do you go out and quickly learn to be a database administrator? Do you phone a friend? Do you panic? I mean a better option would probably be to go out and think about using one of the Kubernetes operators that's been created by database experts. So we'll have a look at what a Kubernetes operator does and how it can help you to create a virtual DBA and what you want to look for when you're choosing an operator. So we'll quickly have a look back at how database architecture has evolved over time. Make sure everyone's kind of on the same page so that you know what kind of things Kubernetes does and what it's useful for. Look at some of the special features that makes Kubernetes suitable for running a database environment. Try and figure out what a DBA actually does because that's going to then give us an idea of what we need the operator to do. Understand what an operator for Kubernetes is. Look at the features that you might expect from it and then finally have a little look at how you might go about implementing an operator and trying that out for yourself. So first of all the history, I promise it will be extremely brief. Once upon a time databases were deployed on physical servers or bare metal. You could run multiple databases on a single server if you wanted to but you had to accept that those databases are sharing the resources of that physical server and competing for them. If you wanted isolation you had to deploy a single database instance per physical server and that's going to bring with it very high overheads in terms of maintenance, operating costs, hardware costs etc. But you do get that isolation, you can manage them independently. Then we got virtualization in the form of VMs. So now you can carve up a single physical server into multiple VMs and you can deploy a database instance to each of those VMs. So now you've got isolation, you can manage those independently. You've still got fairly high overheads. So you've got the, as well as your underlying operating system, you've got the hypervisor and then your guest OS. But you have got that isolation. And fast forward to 2024, many databases are now running in containers. So just to recap, a container is a lightweight self-contained software package that you can deploy pretty much anywhere. Containers use features of the underlying OS. So it's using C groups and namespaces. So they're sharing that underlying operating system. But they remain isolated from each other. So now if we deploy things like this, those database instances can be managed completely independently. They're not competing for resources. But because those containers are sharing the underlying OS, they're much more lightweight. You're looking at typically maybe tens of megabytes versus gigabytes for a VM. So not so many years ago, most people thought the idea of running databases in containers was a completely crazy idea. This year, all of my customers are running some or all of their production databases in a containerized environment. Some of them are running multi-terabyte mission critical databases. And some of them are running hundreds or even thousands of databases. So some things obviously changed. There's been a shift to make people see this now as a viable architecture for databases. So let's talk about some of those features that have made people move to a containerized environment for their databases. So as I mentioned briefly, containers are isolated, they're lightweight, and they're portable. You can create them and destroy them quickly and easily, which means that a containerized environment can be extremely flexible. It's very easy to scale. But containers are also stateless and ephemeral. A container's data and its state only last as long as that container exists. As soon as your container's destroyed, you lose that. Which as you can imagine strikes fear into the heart of your average database administrator. You need to take special care obviously when you're using a containerized environment for a database or at least ones where you have any kind of attach any kind of importance to your data. So putting aside the stateless and ephemeral issues just for now, your organization probably isn't managing just a couple of databases. Excuse me whilst I get my display back the way it's supposed to be. A lot of organizations are running hundreds or thousands of databases. And once you get to that stage, it's probably going to feel a lot like herding cats. You don't want to be doing all of those maintenance tasks associated with those containers and the databases in them. You don't want to be doing that manually. You need some kind of tooling to do that for you. Which is where container orchestration comes in. So container orchestration platform such as Kubernetes will let you manage many containers. It will automate the entire life cycle of those containers and it will integrate also with DevOps tools. So it allows you to do things in a flexible, automated, repeatable way. So a container orchestration tool will take care of a long list of tasks, things like provisioning, deployment, configuring your containers, scheduling, scaling up and down, repairing things, replacing containers that have failed services that have failed, creating services, allocating storage, different resources, load balancing, network and security. Kubernetes is an open source container orchestration tool and it's the industry standard for container orchestration. Just to reassure you that it's not a newfangled thing. It's actually been around a reasonable amount of time now and it's been a graduated CNCF, Cloud Native Computing Foundation projects, 2018. Kubernetes can be run pretty much anywhere. You can use a managed cloud platform or you can run it yourself either on premises or in the cloud. You can either run vanilla Kubernetes or there are a whole host of different flavors of Kubernetes. So you might hear people talk about OpenShift or Rancher or Tanzu or EKS, AKS, GKE. There are all sorts of different versions of Kubernetes that you can use. So to the why would you want to run Postgres on Kubernetes? It's no longer considered a leading edge technology. It's very much mainstream now and it's trusted in production by many, many users for database workloads. One of my favorite quotes is actually from Joe Conway's blog post where he says resistance to containers is futile and he points out that actually on modern Linux systems, because everything's running using C groups and namespaces, you're effectively already running your database in a container. So the customers I work with have many, many different reasons, use cases for running databases and Postgres on Kubernetes. Automating the deployment and administration of their databases is obviously a huge one. That's one of the main reasons that people cite for wanting to be able to do things. The features of a container orchestration platform that we saw on the previous slide are already, they go a long way towards doing the things that you would need automated to look after your database environment. There are other features as well that help with that and we'll look at those in a few slides. But otherwise, we see customers that want to be able to deploy and manage their database environments at scale. As I mentioned before, maybe hundreds or thousands of databases. They want to run multi-tenant environments. They want their database environment to complement an existing microservices environment. A lot of the time there's already Kubernetes in use in the organization. The applications might already be running in Kubernetes and they want to bring the databases into that environment. A lot of them do it because they want to be able to create a database as a service type offering, whether that's for internal or external customers. We'll have a quick look now at some of the other Kubernetes features that can help to build our virtual database administrator. First of all, a little bit of terminology. Even though Kubernetes is a container orchestration tool, you don't deploy an individual container in Kubernetes. You deploy a pod. In its simplest form, a pod you can think of as just a wrapper around your container, but it can contain multiple containers. Then we have a deployment. A deployment consists of one or more copies or replicas of a pod. The pods within a deployment are ephemeral and interchangeable. If one of those pods is destroyed for any reason, Kubernetes will just stand us up a new identical pod. We talked about the benefits of containers, the features of Kubernetes, but also the fact that a container's data only lasts as long as that container exists, which obviously would be a bit of a problem for a container that holds your database. You probably don't want your database to disappear if you lose a container, so you need some kind of persistent storage. Kubernetes provides that in the form of persistent volumes or PVs. By creating a persistent volume claim, a PVC, you can attach permanent storage to your container. What about standby databases? We've talked about pods in a deployment being interchangeable. If you lose a container, Kubernetes will just say, okay, that's fine, I'll just create you a new one. If that's your primary database container, you can't do that. A primary and a standby database aren't the same. They're not interchangeable. You can't just replace one with the other. You need something in there to tell Kubernetes that there is a difference between these. It's very rare that you'll be running just a standalone database. You will almost definitely want high availability, but also you might want replica databases for read scalability. Scalability is one of the big use cases for Kubernetes. We need Kubernetes to know that our primary and our standby database aren't interchangeable, that you can't just replace one with the other. We also need it to know that they can't just be started up and shut down in a random order. It needs to be carefully considered. For that kind of situation, we've got stateful sets. A stateful set is similar to a deployment, but each of our pods will have a persistent identifier, so it keeps that through any rescheduling. If pod one gets destroyed, it will be replaced by another pod one, and it will still be attached to that same PVC one. It will still be attached to that same storage, so it can keep that state. The Kubernetes documentation says that stateful sets are useful for applications that need stable persistent storage, ordered graceful deployment and scaling, and ordered automated rolling updates, which sounds very much like what you would want from a high availability database environment. Another useful feature is sidecars. We saw that a pod can contain one or more containers, so a sidecar is a kind of helper container, so it's tightly coupled with the main pod in your container. You might have, for example, alongside your database container, you might have one that exports metrics, one that exports statistics from your database, you might have one that performs your backup and recovery. We've seen what kind of things Kubernetes can do. What does a DBA actually do? This is a slide from the DBA evolution talk that I gave here last year, and for that I looked at various definitions of a DBA to try and find out what the general consensus is for the DBA roles. It turns out that apparently DBA is responsible for managing and securing computer systems that store data using specialist software, which tells us absolutely nothing about what a DBA does day to day. I compared that at the list of responsibilities that went with those definitions, and I looked at a whole load of different job adverts for DBAs to try and get some kind of consolidated list. Of the things that DBAs are actually expected to do, and it's a pretty long list. The general consensus is that a DBA will do some or all of, ensuring the availability of the database, usually involving putting in place some kind of high availability infrastructure. Design, implement, and maintain the necessary backup and recovery procedures. Design, implement, enforce, potentially various different security security requirements, create database users, manage database access, ensure data protection. Implement monitoring processes, perform ongoing monitoring of the databases, looking at things like performance, the security space, etc. Database design and development, including data modeling, for example. Support and troubleshooting, including 24-7 support, uncle support often. And it goes on. Installing and upgrading database software, providing database expertise to other teams, to other people, so for example to the business, to other technical staff. Performance tuning, capacity planning, putting in place the necessary procedures for creating databases and maintaining databases. Of course, there are different types of DBA. Some organizations will split the roles out differently. Some DBAs will be expected to do different things, but all of these things will need to be done by somebody. Okay. So we know that Kubernetes provides a lot of the features that you need to manage a database, but how are you going to go about setting up a containerized Postgres environment? Kubernetes doesn't natively speak Postgres. So you need to put in place some kind of mechanism that's going to tell Kubernetes how to manage your database cluster. You need it to know about replication, about backup and recovery, about monitoring, about upgrades, and all sorts of other things. To do that, you need expert knowledge in two domains. You need expert knowledge of Kubernetes and you need expert knowledge of Postgres. Most organizations find it difficult enough to find somebody that's got expert knowledge in one of these domains, let alone both of them. Fortunately, Kubernetes has another secret weapon, the operator. So this lets you extend Kubernetes functionality using custom resources, and we'll look at a custom resource later, and something called the control loop, where it keeps checking the current state of your cluster to see if it fits with what you've defined, and if not, it will make necessary changes to keep it in that required state. Even more fortunately, there are various Postgres operators that have been created by Postgres experts. I can speak in detail about the Crunchy Data Postgres operator, Pego, because that's the one I use day to day, but there are others out there. Each of them works in a slightly different way and might use different tools, but each of them combines that detailed Postgres and Kubernetes knowledge, so it extends the functionality of Kubernetes and lets it speak Postgres. It allows you to define in a manifest what your cluster should look like, and then work to deploy your cluster and keep it in that state. So what do you want from your Postgres operator for Kubernetes? The idea of Kubernetes operator is that it will perform all of the tasks that a human operator would otherwise do. So what we want it to do is automate as many as possible of those responsibilities, those tasks that we saw on the previous slides. For example, database availability. Most production environments, as we've said, need some kind of high availability. You'll probably be using Postgres' streaming replication so that you've got a primary database and one or more replica or standby databases. You'll then have some tool, a framework such as Petroni and XED. There are other frameworks available. This is one that we choose to use, and it's well respected and it has a rich set of features, so it's used by a lot of people. So you'll put that framework in place to manage your cluster. You might add in a tool such as HA proxy to maintain a virtual IP address so that you've always got your application connections pointing to your current primary database. There are quite a few moving parts here. There are various different tools to install and configure, and it can be quite fiddly to get that set up in the way you want. So you definitely want your operator to be doing that for you. If something goes wrong with your primary database, you want to be sure that you're going to get an automatic failover, that it's going to promote one of those replica databases to be your new primary, that it's then going to reconfigure any existing replicas to stream from that new primary, and that it's going to move your application connections to point to your new primary. You don't want to be doing any of that manually. You want that to happen automatically for you. And then for a combination of the self-healing magic of Kubernetes, Petroni, and your operator, you want to make sure that you have a new replica created to replace that primary database that you lost. You definitely want as much as possible of your backup and recovery to be automated. You want your operator to install your backup tool and configure it, so for example, PGBackrest. You want it to let you define one or more backup repositories that could be a local repository, that could be a cloud or network-based repository using S3, for example. You want it to take care of your wall archiving. You want it to take care of taking backups for you. You want to be able to schedule those backups. You want it to take care of removing obsolete backups once you no longer need them. You want it to retry backups if they fail. And then to minimize stress, data loss, and downtime, you definitely want as much of your recovery to be automated as possible. You'll still want a human operator in a lot of cases to say, yes or no, we are going to restore. Can we accept this data loss? Can we accept this downtime? There will be decisions like that to make by a human operator, but once those decisions are made, you want that process to be just a click of a button. In addition to your primary database cluster, you might want to be able to define a disaster recovery cluster or a standby cluster. A lot of people have a separate Kubernetes cluster in a different data center, in a different region, for example. And you want your operator to make sure that's kept up to date, either via wall streaming from a cloud backup repository that it sent the wall files to, or via streaming replication, or belt embraces. You might want it to do both. You might want to use a similar setup as this to create a clone of your database for test or development purposes. And you want your operator to allow you to do that very, very simply. In terms of security and data protection, there's obviously going to be manual effort here. You want to be in charge of defining your security policies. But the operator should provide you with the means to implement those. So you want it to do things like managing database access, so creating database users, making sure they've got the right permissions as defined by you. Maintaining pghba.conf entries, encrypting passwords and storing them in secrets, managing SSL or TLS, generating and managing the certificates for you. Monitoring is a hugely important part of database administration. You really need to know what's going on in your database. You want to be aware of potential issues before they come emergencies. Rather than reinvent the wheel and create your own monitoring system, trying to figure out the queries that you need, the scripts that you might want to run to keep track of what's going on in your database and then maybe setting up your own dashboards, you can let the operator configure monitoring for you. So the pigo monitoring architecture, for example, looks a bit like this. You want the operator to configure the logging parameters for you. You want to make sure that you're actually storing all of the information that you want in your PostQuest logs. You want it to export metrics from your database. So we have a sidecar there for metrics from your database. You then want it to either integrate with your existing monitoring stack or you want it to stand up a monitoring stack for you. So Prometheus with pre-configured metrics, alert manager with some pre-configured alerts, Grafana with dashboards that are already set up for you. You'll probably be pleased to know that it's not going to take over your database design and data modeling because you obviously want to keep some of the fun bits of database administration. And although the operator isn't going to completely relieve you of support duties, it should mean that you're called on less frequently in an emergency in the middle of the night, for example, because you've got that high availability already put in place and automated. You've got the self-healing capabilities of Kubernetes. You've got the monitoring in place so that you've already been keeping an eye on things and trying to react before things become a problem. You've got alerting in place, so hopefully when thresholds are exceeded, you already know about those things and you can fix them before they become emergencies. So hopefully you're only going to get involved if there's something particularly complicated going on that needs detailed analysis. What about database software install and upgrade? Well, the install bit's easy. You don't actually need to do any installing of Postgres or of those associated tools such as PG Backgres, the Prometheus Grafana, your Petrona. You don't need to install any of those because they come pre-installed in the container images that are available with your operator. As for upgrades, a few slides back we talked about stateful sets being useful for applications that need ordered automated rolling updates. The operator can use exactly that technique for performing a Postgres minor version upgrade. Next week, when you want to upgrade either from 15.5 to 15.6 or 16.1 to 16.2, you can simply change the version in the manifest, so in the definition of your cluster. Reapply it and then you can watch as the replicas are upgraded. One of the replicas will be promoted to be the new primary. And then finally, the original primary is updated. Major version upgrades obviously require a lot more planning and testing. So the operator isn't going to take away all of those tasks for you. It's not going to take care of reading all of the release notes. It's not going to take care of testing your application with the new version. It's not going to take care of checking your application code to make sure that you're not using any deprecated features, for example. But you do want it to perform automated upgrades from one major version to another. So in the case of Pego, that uses PG upgrade. Other operators might either use PG upgrade or logical replication or PG dump and PG restore. Does the operator mean then that we don't need any database expertise? Well, as we saw, there is a lot of database expertise that's built into the operator. But it's not going to do everything. We still need a human export for things, experts for things like strategic considerations, looking at the need, the actual needs of the database application, considering business requirements, for example. Okay, performance tuning. Again, it's not going to do everything for you, but it can do certain things. You'll still need to do the initial setup, making sure that you've got your application configured the way you wanted, et cetera. But you do expect the operator to do some of it for you. So it could set initial parameters to a sensible value. It could make sure that you've got connection pooling available, make sure that you've got the PG stat statements, extension available and enabled, make sure that slow queries are being logged, for example. And as we saw before, make sure that you've got monitoring and alerting in place. Capacity planning. So the monitoring and alerting that you've put in place should mean that you can see what's going on in your database. You can see the resources it's using. You can see how much space it's using. You should be able to know approximately what kind of trends you're seeing. In your definition, in your manifest, your definition of your cluster, you'll have said how much storage you want. If you're using a storage class that supports dynamic resizing, you can just change that in your manifest, reapply it, and your volume will be resized. If not, you can create a new instance with a bigger volume and use the same technique that we saw for the Postgres minor version upgrade to do a rolling increase of your volume. If you're using that rolling technique, you can also use that if you want to reduce your volume in size. Other resources such as CPU and memory, for example, can also easily be scaled. And you can use things like the request and the limits to make sure that you allow it to claim more resources up to a certain threshold. Database creation and database maintenance. So users and databases, I don't know the details of how this works in other operators, but in Pego, for example, you can state a number of users that you want to have created automatically in your database and the databases that they should be able to access. If those databases don't already exist, it will create them for you. Database maintenance is a really wide ranging and very unspecific task. So this is a list of some of the things that might fall into that category of database maintenance. And we've already looked at a lot of them. So we know that we can expect our operator to help us with a lot of those. And other maintenance tasks such as index rebuilds, for example, gathering statistics, that kind of thing could be scheduled via the operator. You can define everything in the same place so that you don't have to then manually change things and implement things later. Okay, so you're now obviously really excited to give this a try and see all this magic for yourself. How can you do that? I'll show you how to get started with Pego, but as I've said, other operators are available. First of all, beg, borrow or build yourself Kubernetes cluster. As I've said, that can be either one that you build yourself, that can be in the cloud, that can be managed for you, or it can be vanilla Kubernetes. It can be one of the many different things. It could be OpenShift, Tanzu, Rancher, all sorts of different Kubernetes platforms available. Next, fork the Postgres operator example's repository, which gives you a sample manifest. It will give you Helm charts, customized manifests that help you install and configure and deploy your first Postgres cluster using the operator. Okay, so I'm just going to go through this step by step. So clone the repo and navigate into it. Create a Postgres operator namespace, and if you're lazy like me and don't like to keep typing minus n and the name of your namespace, set it as your default namespace. Install the operator using the customized file that you'll find in the install default folder. Then you'll see that it will create a load of resources for you that are needed for managing and managing that database cluster. So the one that we're most interested in is this Postgres cluster custom resource definition. That's what's going to let us define our cluster. Now, to define our cluster, we're going to just use the example Postgres.yaml that's provided for us, why reinvent the wheel. So I've created a copy of that in a Fostum folder, and then I can make whatever edits I want to my Postgres.yaml. So the first couple of lines here is just saying that I'm creating a Postgres cluster resource, that I'm going to give it a name Fostum just so I know which cluster it is, that I want to use Postgres version 16, that I want three replicas. So replicas here is in the Kubernetes sense of the word replica. So that means three database pods. So I'll have a primary database pod and two standby or replica database pods. And then I'm just using the default storage class, leaving all of the defaults there. So I'm just going to have a local volume here, but you can specify whichever storage works in your environment. You might want cloud storage, network storage, local, you know, whatever you're using. And I've just said that I want to have a one gig volume. That might not be hugely visible right down at the bottom there. Okay, last few lines of the manifest. So the last few lines set up the backup and recovery. So at the moment, we've got backups, pgbackrest, it is, it's just pgbackrest. I'm just going to configure a single repository called it repo one. And again, I'm just choosing all of the default parameters. So I've just got a local backup repository. You probably don't want to do that in production. You will probably want some kind of sensible place to store your backups, but this is just my little test cluster. So a local volume is absolutely fine. You can specify multiple repositories if you want to. So you can have a local repository and a cloud repository or a Google cloud repository and AWS one or whatever combination of repositories you want. Okay, so once I've created my manifest, that's my definition of my cluster. I apply that and the operator will set me up a three node high availability post-press cluster. So it's now got the Petroni managing that high availability. I've got a service that points me to my primary database. I've got all the things that we talked about before. So if we have a look at the pods that that's created for us, we can see that was my operator itself from when I did the operator install. These are my three, oh, sorry, no, those are my three post-press instances. I can use a different command if I want to see which is primary and which is stand by. It's created my repository and it's taken an initial backup for me. I've also, I've not talked about that, but there's also a PG admin pod there as well. So you can use PG admin to log in and look at your database and run queries, et cetera. So that was, I think it was a 26 line manifest. That's, that was enough to get you up and running with high availability, backup and recovery. You can then make all sorts of changes. If you tweak that manifest, you can set up backup schedules. You can create that standby cluster that we talked about. You can install the monitoring stack. You can implement connection pooling with PG bouncer. You can set your different post-press parameters, your patroni parameters. You can tell it to run certain SQL queries when it initializes your database, et cetera. I forgot other things. You can tell it where to schedule your pods if you want to. I've just left everything at the default and let it schedule them wherever they want. I've got a three node Kubernetes cluster and I'm just leaving it to do its thing. So that was just a really quick kind of, how can I get started? But I really do, even if you're not planning on using it in production, it's really good fun. So give it a try, kill your pods, delete services and watch it kind of repair itself. It's fun. So conclusions. So a post-press operator for Kubernetes really does act like a virtual database administrator. We've seen that it knows how to do most database administration tasks. It can automate everything from deployment of a high availability cluster to backup and recovery, monitoring, upgrades, et cetera. It lets you implement a, I think this is from my marketing team slides, it lets you implement a robust, secure, scalable architecture. It combines the strength of post-press and Kubernetes so that it keeps your database cluster running smoothly. And more importantly to me is it leaves you free to do the strategic, interesting and fun bits of database administration. So that is all that I've got to say on the topic of post-press on Kubernetes. And before I move to my thank you slide, I just want to do a plug in case today hasn't been enough post-press for you. The next community post-press conference in Europe is PG Day Paris on the 14th of March. And we obviously really hope that as many of you as possible can join us. And just for Fostum, we have created a 10% discount code with limited availability. So I think that's available just until tomorrow. So very much hope to see you there. And that's me. I've put a link to the slides there in case anybody wants to see those. And I think, do we have time for questions? We do. So thank you. That was a very comprehensive talk with a lot of useful insights. Anyone who, I see a hand there. If you can make sure the next question is right at the bottom so that Jimmy has to run back and forth, that'd be great. And can ask you a favor. Can you repeat the question please so that it makes it into the video as well? Say you want to install an extension that's not by default in post-press, like post-GIS. How would that be handled by the post-press operator? Will it be detected when upgrading in such? So this operator does have that, sorry, the question was, if you want to install an extension that's not by default, something that's not by default in post-press, how would you handle that? So for this particular operator, post-GIS is one of the extensions that's available in the images. For others, I don't know, but I suspect that that would be available because it's an extremely popular extension. So we tried to include the most popular extensions. Otherwise, you can create a layer on top of the container images that are provided for you. You can install the extra extensions into that. Some of them will let you create your own custom sidecars. So we saw the extra helper pods, so you might be able to install certain things into a sidecar as well. You said that if a primary instance goes down, then the job of the operator is to assign, for example, replica one is now the new primary. So why to rephrase it, why is it, we don't want the operator and the primary instance to run on the same worker in the Kubernetes because if that worker is shut down of the electricity, there isn't anyone to assign a new primary. Okay, so the question is to do with the operator assigning a new primary database and saying that we don't want our two database pods to be on the same worker node, is that correct? So actually embedded in the operator code in this case are some anti-affinity rules. You've spoken a lot about the advantages. Do you also know some downsides, like for example the lower performance on the same hardware or something like that? So the question, I've obviously spoken a lot about the advantages but are the disadvantages, for example, performance for the same hardware. I haven't done extensive, well I say extensive, I haven't done benchmarking, but just anecdotally from what our customers see, they're not reporting any significant performance degradation. That's not to say that there isn't any, I haven't like I say, I haven't done those tests, but we certainly haven't seen customers saying we moved to Kubernetes and it's running more slowly. So you said that progress instances, progress pods are being managed as a stateful state, but what about pullers? So how many pullers do we need? For example, if I want to expose read writes and read-only service to my applications. So do you use a single puller for those read writes and read-only requests or you use a separate set of pullers? So the question if I've understood correctly is how do we use a single puller or multiple? You can configure, it's up to you depending on your actual use case, depending on where your connections are coming from, how many connections you've got, how they're being used, etc. You can define how many you want. There was a very, on Friday for the extra PG day, there was a very interesting presentation at by Joe about a problem with G-Lib C and correlations and one of the workarounds was you created your own binary. That's going to be a lot more complicated in Kubernetes or is that something which your operator supports? I'm just curious how to manage that sort of rare but important edge case? I guess that's the kind of situation where, oh sorry, repeat the question. So there was a talk on Friday by Joe Conway where he talked about an interesting edge case where there was an issue with G-Lib C and the workaround was to recompile the binaries. So is that more complicated with the operator? I mean it's for the average user that's going to be complicated whether you're running in Kubernetes or not potentially. That's the kind of thing where we would probably recreate a container image with that workaround and make that available. So certainly if it was for a paying customer, I imagine that's the kind of thing that would be done with the images available to the community. I guess at some point that would be made available or as I said before you can create your own images so you can base your own images on the ones that we provide so you could potentially do it in there. So potentially a bit more complicated but it's still the same process. And we have time for last question over here. Sorry. I was wondering how backups can be restored after a cluster-wide issue for instance. So the question is how can a backup be restored after a cluster-wide issue? So in the manifest there's a section where you can say what the source of your cluster should be so you can say that it should come from a backup and you can obviously put in your point in time recovery requirements etc. in there. Thank you very much.
Postgres vs. Linux filesystems
Hello, thanks for the introduction. So my name is Tomas Wunderer, I work for ABB, I'm a Postgres contributor, cometer, developer, and so on. And I'm here to talk about Postgres versus file systems, right? If you want, you already can find the slides on my personal website. There's nothing much else, just talks. I gave at different conferences, this talk is already there. So if you want to look at the slides in more detail, you have the chance already. So a very short overview of what I'm planning to talk about during this talk. And by the way, if you have any questions during the talk, please shout. I prefer to answer the questions as we go, because we will go through different stuff. And it's easier to answer the question about the topic that I'm currently talking about, right? So in the first section, I will briefly explain how Postgres uses file systems and how. And I will a little bit talk about the overall design and maybe some advantages and disadvantages of that, try to explain why it works the way it works. And then I will get to the main goal of this talk and this like give an overview of how Postgres performs on different file systems. I'm going to restrict myself to file systems that are on Linux, like the usual file systems that you probably can use for production workloads or might consider. I'm not going to talk about like experimental file systems or file systems that are not used regularly. I'm not saying those file systems are not interesting, but I need to restrict myself to something that is actually benchmarkable and so on. And I'm also going to talk about file systems on storage that is attached directly to the machine, right? Because once you introduce like network attached storage, which usually introduces like latency and so on, that changes the behavior significantly, right? So I'm going to talk about if you are concerned about performance, you probably use directly attached storage anyway. And I'm also not going to talk about like managed instances because if someone chooses the file systems for you, then this is kind of like useless, right? So let's assume that you do have access to storage that is attached directly to the machine. And that you have a choice to like which file systems to use. And I'm doing these benchmarks and these talks because I wanted to learn myself something. I'm not here like, I'm not an expert on file systems, right? I'm a database developer, database engineer, and I wanted to know like how does it work now? Because like we do have benchmarks from like 20 years ago, but the hardware changes over time evolved and like was the current situation, right? And in the end, maybe we will talk a little bit about future file systems of like storage in Postgres, but there is a really nice talk by Andres Freund who talks about like how we might evolve one of the things about storage in Postgres, which is direct IO, asynchronous IO and so on. And there are actually developers that might give better opinions on this, right? So there is like a talk from the PGConFU, which is like two months ago. It's on YouTube available. You can find it. So I restricted myself to measuring data on like the usual Linux file systems, which is XT4, XFS, those are the traditional ones, then the new which is BTRFS and ZFS. ZFS is like not like native Linux file systems, of course, but it's commonly used. Then I've been like thinking usually don't have a single device, right? You have like multiple devices, so should you use like LVM to build like a volume on those devices? Or should you use something that has multi-device support built in, right? Which is like BTRFS and ZFS, they don't require you to build like a rate, a rate and that. Then there's like the question of snapshots, right? If you need just the bare file system and POSGRACE is fine with that, then the XT4, XFS are perfectly viable solutions. But maybe you want something smarter, right? Maybe you want to be able to do backups using snapshots, or maybe you want to use the, you know, the sent and receive which is built into ZFS to replicate data or stuff like that, right? So the question is like what happens when you actually want snapshots up more? I did some experiments with stuff like compression and so on in the file systems, but in my benchmarks, in the stuff that I benchmark, it didn't make much difference. I'm not saying that that's like universal truth. But I'm not going to show any results with and without compression because simply there was no difference for the OLTP workloads that I tested. So a very brief executive summary, just like to explain what I found or what I think is my conclusion is that in general, you should prefer a major supportive file system, right? You do run databases because you want to maintain the data, right? I mean like if you have a file system which is like super fast, experimental, and you know once in a while loses your data, it's like okay, maybe you don't really need like F-Sync at all, right? So my recommendation is in general to use supportive file system which is supported by whoever supports your operating system or like the environment. And that usually means like one of those four fast systems that I mentioned. The other thing is that you should use sufficiently recent kernel. And there are two main reasons for that. First is well, we do improve the database, but the kernel improves other parts too, right? So if you are using like old kernel and that might mean a couple years ago, a couple years old, you are losing a lot of optimizations and improvements that are typically focused on like a new hardware. So if you are using new hardware, you are usually losing a lot of performance. The other important reason of course is the bugs, right? And I'm not just talking about the regular security issues and so on. I'm talking about like data corruption issues that are in the kernel. I think I do have a slide where I mentioned the F-Sync gate, which I think I spoke at FOSDEM about in 2019. But the other part of the executive summary is that EXT4 and XFS are roughly the same performance. I mean like I don't think you need to talk very much about like should I use EXT4 or XFS? Like will it be faster or slower? Through this, I mean like in my experience, the differences are fairly small. And by fairly small, I mean like 10% difference in throughput for example. It's something that I believe I could probably tune the file system, actually eliminate by tuning the file system or maybe buying a slightly faster disks or like something like that. Is it a throughput? I also put 10% for the overall communication piece. Yeah, so the question is like how do I measure, you know, the performance? What do I mean by throughput? I mean OLTP performance in the database, which means like small, random IO, random reads, random writes and so on. So that's what I mean by difference in performance. If the database does like 100,000 transactions per second, and the other on a different file system, it does like 110,000 transactions per second. That's the throughput that I care about. Does it answer the question? Yes. Yeah, cool. Obviously throughput is not like, I'm going to talk about like other ways, other things that need to be considered when comparing file systems. But this is like the gist, right? Like if I had to choose from the EXT4 and XFS, I would probably pick what's default into distribution that I'm using because that's simply easier. And then of course, if you need something like more advanced, if you need snapshots, for example, and if you use them heavily, then I would definitely go with either ZFS or BTRFS. And I'm probably way more in the ZFS camp, like because of the reasons that I will talk about later, about the results. Obviously, if you only need snapshots like once in a while, you could use LVM and snapshots at that level that works. But the native snapshots in copy on write file systems are usually much faster. They have like much slower impact on throughput of the database or on performance. Right. So the first thing I'm going to talk about is why Postgres actually relies on operating system so much because there are databases that just kind of like ignore the file system or and either like implement like completely custom file systems on raw devices or do something else, right? Like do much more like direct IO and so on. And the answer is like, I do recognize the complexity of the file systems, right? Database engineers sometimes have the tendency to say like, oh, it's file system. It's simple, right? It's like you overvalue the complexity of the layer you are working on, on the database and kind of like diminish the say like, oh, no, all the layers are simple. The stuff that I'm doing is like very, very complex, right? And I want to say that I don't think that at all. I do recognize that all the layers, both below the database and above the database, have significant complexity, right? And I'm not here to talk shit about file systems. I'm here to learn something essentially, right? So Postgres is a database, right? We are storing and accessing data and that's the whole point why we actually do what we do. But we do leave the low level stuff to the operating system and the operating system implements the on-disk format, it implements the caching to kernel, it implements the device drivers that communicate with the hardware and so on. And we just use the Postgres interface on top of the operating system, on top of the kernel. And all the low level stuff is responsibility of the operating system. That might change a little bit with the patches that improve or start using the asynchronous IO and direct IO, but so far that wasn't the case. The question is like, is it even a good idea? I mean, shouldn't the database just do everything directly and just ignore the operating system? Well, sure. If you have the developer capacity to do that, if you have an infinite amount of money to actually spend on development, then sure, you can do everything, right? But the project doesn't have this advantage, right? We do have a limited amount of time and so on. So we decided or like, I haven't been contributing to Postgres back then, but the choice was to just leave as much as possible to the operating system and it worked quite well so far, right? And I'm not sure it would even be possible to do the custom stuff because there was so much, for example, Postgres supports many platforms and the support for direct IO and so on varies a lot between the different Unix systems, right? Even though they all support, you know, implement Postgres, right? So there's like a lot of difference, a lot of nuance and we would need to deal with all of that. So that would be terrible, I think, and it would not allow us to actually improve the database as much as we did. And of course, by relying on the operating system, we automatically get all the benefits, all the improvements those guys do, right? So if they improve the file systems, we do get the benefits, which is great. So how Postgres actually works in general, like a very simplified idea is that we have like the Postgres as an application, essentially running on top of the kernel, which has some, you know, shared buffers, which is memory managed by the database. And then we have some processes which are either doing some, you know, maintenance operations or whatever, or as the backend processes are handling user connections, right? So you connect to the database, it will fork a process and the process will access the shared buffers, which is where the data is going to be for the backend, right? And when the data is not actually in memory yet, the Postgres will read the data through Pagecase, which is managed by the kernel, and the kernel will do some magic and will, you know, read the data from the disk to the hardware interfaces file systems and will include some IOH scheduler to govern all the process, right? So that's roughly how it works, how Postgres is designed. With the Direct IOH, we will kind of like ignore the Pagecase and we will, you know, do still talk through to the operating system facilities, but without the Pagecase, right, of course. And in that case, the shared buffers will be much larger, right, or it should be much larger, of course. Right, so this is like the Direct IOH kind of. Anyway, we are still essentially in this, and this whole talk is about this architecture. So I spoke about a couple of reasons why you should use like new kernels and what are the problems with like relying on all the kernels, and that's like, well, there's a lot of things that can go wrong, and there's error handling, but what happened in like 2018, we discovered that we are actually not either receiving the errors at all from the kernel, because for example, you open a file to one file descriptor, you do some writes on the file, and then you close the file, or the file gets closed for some reason, and no one actually knows about the error at all. Even though you might have like another file descriptor for the same file, you will not read, learn about it. And there's like different ways to lose information about errors during fSync, for example, which is pretty fatal for Postgres, because we do rely on the fSync, for example, during checkpoints, right? So luckily, it's not a very common issue. I mean, I don't remember when I actually got like the last fSync error in, you know, when working with the database, but when it happens, it can, it should definitely not be like a silent corruption, right? So this was fixed, I believe, but again, it's something that needs to be, that is fixed only in sufficiently recent kernels, right? So you need to run a recent kernel to be immune to this. The other problem, of course, is that, and that's not like a bug, that's a problem with the design in general, and it is because the most of the IO activity is actually managed by the kernel, by the operating system, but it does not actually have any insight into what the database needs, right? It has no concept of, well, this write is more important than this write, right? Because this write affects user activity, and this is some sort of like a background task which could wait, right? Like the operating system has no idea how to, has no way to actually, you know, differentiate between the writes, so that's one reason. The other reason is, for example, prefection, right? The current storage systems rely heavily on actually having full queues. If you only request from SSD one block at a time in a synchronous way, it's going to be really slow. If you submit like many, many IO requests ahead, then you are able to saturate actually the storage device, like the, the tooth, right? You get like much better performance in general. And again, that's something that the database needs to do explicitly. It's not something the operating system can, can do on its own, and we do actually rely on the operating system to do prefetching for sequential scans, for example, but the, we need to do explicit prefetching for other types of scans. So for example, during index scans or bitmap heap scans, we need to do explicit prefetching. So it's like, this is a design problem. So rule number one, use recent kernel. All the kernels have all kinds of issues. Okay. It's not always like perfect. There are regressions in kernels too. Once in a while, you can get like a dropping performance because something went wrong. But overall, I think it's like something you should do. Right. So this was like a very basic explanation of, of, of why POSGRESS IO, why POSGRESS uses file systems the way it does. And now I'm going to talk about some benchmarks and stress test because this is like very, like high level. Right. I like to, to do some measurements and look at the numbers and say, like, okay, so this performs well, this sucks. Right. So what I did, I did a lot of stress tests, which essentially means running PG bench, which is OLTP database benchmark tool. Simply, it does a lot of like random IO to POSGRESS. And I measured like the truth. The first important thing here is that this only really matters if you are IO bound. Right. If you are hitting the, the storage, then that's the only way, that's the only case where the difference in file system performance can actually affect the throughput. Right. If you are CPU bound, for example, because you are working with very small amounts of memory and it's all in cache, then the file system doesn't really matter. The other reason, of course, is that typical production systems are not using IO for 100% time. Like once you hit, for example, saturation of like 75%, right, you are already like being affected by, by latency increase system so on. At that point, you probably already are thinking about upgrading the storage system or like migrating to do something else. So that's one reason. So keep this in mind when interpreting the results that I'm going to show you. It's probably like the worst case scenario. The other thing, of course, is that I only have some particular hardware available. And some of the file systems, especially like ZFS and so on, they do support a lot of different features. So you can like move the intent look or stuff like that to a different device. I didn't do anything like that. Right. What I recommend you do, if you are actually evaluating like different file systems for your use case, to actually try that with, with the hardware that you are considering. Right. To actually do your own measurements. Right. I would love to have like a perfect benchmarks for all possible hardware configurations, but it's not possible. Right. So I'm going to show you a bunch of results, a bunch of charts. And I'm going to, I think what is more important is like not the exact numbers, but it's more about the visual, like, like understanding like what's happening. Right. So for example, this is from two machines. This is like a smaller, older Intel machine. This is larger Xeon. And this is the time that it takes to do a bulk load into the database. Right. Of scale 2000 means like, I don't know, 30 gigabytes of data. And this loads the data, builds indexes and so on. And the first bunch of results here, which is in seconds. So the shorter, the better. These are just like regular file systems on LVM without any snapshots, just like right. And then there are a couple that are two that are actually multi-device without the LVM using the BTRFS or ZFS file systems, like multi-device support. Right. And you can see that it's like almost the same, except for ZFS for some reason, it's like much slower. Right. But that might be a hardware issue or like specific to this hardware configuration, because on a different machine, which only has a single device though, it's like NVMe, the difference is like much smaller. Right. And there is no LVM because there are no multiple devices. Right. So that's one thing. That's what I mean, what I said that the difference between EXT4 and XFS is like usually very small. And then we have a couple results for snapshots when you start creating actually snapshots on LVM. And you can see that it like improves, oh sorry, improves like, degrades like significantly. Right. So it suddenly takes like twice as much time in some cases, except for the native file systems, like the copy and write file system, BTRFS and ZFS that didn't actually like got much worse. Right. So this is similar thing you can see here for the other machine. So what I conclude from this is that if you actually do need the snapshots, use the ZFS or BTRFS. Yes. So for BTRFS, I just did like a regular, I didn't want the specific like explicitly else. Right. So I just created the BTRFS like. Because the easy optimization would be to turn on copy and write for the files affected by the database. And then when you do the snapshot, it still does copy and write only and in those points. Right. So I consider like disabling copy and write because there's like an option for, I'm not sure if it's a mount option, like no data copy and write and so on. The problem with that is, as far as I remember, is that it actually disables like checksums or affects like these capabilities. Right. Which, and that's what I don't want. Right. I do want the checksumming and so on for these file systems. Right. So that's it. Right. Well, these are the results with the LVM snapshots and these are the built-in snapshots. Right. So my conclusion, if you want snapshots, if you need snapshots for like, because it makes, for example, backup simpler for you, use these file systems, then I do have some results for OLVP-PG bench, which is like a read-only mode. It simply means like select by primary key. Right. It does a lot of random IO. This is like the large, large scales, which means that it actually is hitting the disks a lot. It's not like in memory. And you can also see that on the smaller machine, which is like just four cores, the differences are fairly small. The ZFS is a bit slower. I assume that it's because it's not using the page cache. It's using the ARC cache and there's like, like different size. It's like smaller than the page cache in this configuration. So that's fine. On the larger machine, which you can see that this is like five times or four times higher throughput because it's using NVME, then the beta RFS is getting slower. ZFS is slightly slower also. Right. Which again, in absolute numbers, this is not great. If ZFS is giving you or like beta RFS is giving you some additional features, I think this is perfectly fine. For the read write, and I'm actually showing for the read write like different scales. Scale, this is like a small scale, which means everything fits into shed buffers. So we are actually doing a lot, very few random writes. A thousand here means it fits into RAM, but not into shed buffers. And this is like much larger than memory in general. And you can see that, again, the EXT4 XFS kind of like perform the best. And unfortunately, the copyright systems, once you exceed the available RAM, get much slower. The OLTP PG bench is not exactly, it's very uniform access. Right. So, yes. Do you use large blocks on ZFS? So, for ZFS, I use the 8 kilobyte blocks. Right. So I reduce the size of the block to match the postgres data block. What I was going to say, well, I wanted to say that PG bench may not be a perfect thing to model your database, your application, because it randomly and uniformly accesses all the different parts of the database. But usually what you have is you have a very active subset of the database, which probably fits into memory. And then you have the rest of the database, which is like historical data or users that are not very active or something. So, this, which means that you probably are not very affected by this. This is like the worst case possible. Right. And you are probably somewhere in this region. Right. In which case, the ZFS is slower, but not by much. So that's one thing you need to consider when interpreting the benchmark results and applying them to your application. Right. But one thing I'd like to mention is that throughput is not the whole story. Right. I mean, if you only get information about like how many transactions you can get per second, that doesn't actually, you know, fully explain or fully describe the database or like the performance of any system. The other thing that you need to look at is latency. Right. Because if you get like very different latencies, like one, one request gets handled in one millisecond, the other request gets handled in five, five minutes. It's like in, on average, it's probably not very good performance. Right. So what I did is I actually show behavior over time, not just for the whole two hour run, but I show how actually the performance changes over time. And this is like the throughput. And you can see that EXT4, one thing I want to say, don't look at the numbers. Right. The numbers, you may not even be able to read them from the back. That doesn't matter. You can look at the slide later. What matters is that you can compare the charts visually. Yeah. You can, you can look at the first row and that's the small data set, which is the data set that fits into shared buffers. The other row is the medium, which is like fits into memory, but doesn't fit into shared buffers. This doesn't, the third one, large one, doesn't fit into memory at all. But that's the read write. And this is read only. Right. All these, this is small read write, medium read write, large read write, large read only. Sorry, there is a mistake here. And this shows like how that actually behaves over two hours. And you can visually compare each row. Right. So you can, for example, see here that EXT4, XFS are really, really stable. Right. You get really, very similar throughput over time. BTRFS is a bit slower. ZFS also very stable. And then once you get larger and larger data sets, the behavior changes. Not for EXT4, XFS, of course, you get like slower, lower performance. But for example, for BTRFS, you get like much more, much more jitter in, in the throughput for per second. Right. So, so that's not great. Also, you get like progressively slower throughput. She's not great for ZFS. It's similar. Right. I mean like, you get like more variants in, in the throughput. And ultimately, even for ZFS, you get like much lower throughput for read only. But I started talking about latency. This shows me still just throughput over time. It shows me like how, how it changes over like two hour period. It doesn't show me latency. Right. So, this is the result of percentiles of, of the same test. And ideally, you would see something like this. Right. I mean, this is, I think, 25%. 50%. 75, 95, 99. And ideally, you would see like perfectly straight lines, which gives you very consistent performance over time. Right. So, this is really, really nice. I mean, like the throughput was fairly low. But this is really nice because it's very predictable for operation. Similar thing here. Right. You get some blips here, some, you know, spice latency and so on. But it's very short, very predictable, really nice. And you probably will not even see this in like a monitoring. For ZFS, it's not that great. It simply needs to do, I don't know, compression, whatever, do copy on write of the data. For BTRFS, it's unfortunately much worse. Right. This means that the, the latency spikes are pretty significant. I mean, if you look at the throughput, you can see that there are like a lot of fluctuations here. So, that's not great. I would definitely, as a DBA, I would like to see something like this. Right. Because it gives me nice smooth behavior. This is okay. This is not great. Okay. For the smaller machine, it's like a very similar, similar story, except that the differences are not as pronounced because simply the storage is not as powerful. Right. I mean, like you get similar performance for the smaller dataset, then as we are increasing the amount of random writes in random IO, it gets worse. And of course, similar, similar outcome for, for the latencies. Right. So, I use this as a visual way to compare the results. Not the exact numbers, but like how the chart looks like. Right. And I think I do have to say like a super large machine, which is, I don't know, 100 cores, AMV epic with four NVMEs. And you can again see very similar pattern with like EXE4 XFS. There are some fluctuations here. I'm not sure what exactly that is. I need to look into that. But the, and I would say the ZFS behaves like better here. It's like nicer. You can see those are most likely checkpoints, these spikes. So there's probably a way to improve this. Similar for latency, right. Like these are really nice. Well, you can always improve that, but this looks really nice. ZFS is slower or worse. BTRFS has some latency spikes that would cause a lot of trouble in production. Right. So, there was just like looking at the file systems and with some basic tuning at the file system level. But there are also things that you could think about at the Postgres level. And the first level is, well, you need to be careful about, about filling the page cache. Right. Because what can happen in Linux and with the default configuration can happen quite easily is that you accumulate a lot of dirty data in the page cache because Postgres will just write stuff into the operating system and then eventually call fsync. Right. And if you accumulate like 10% of the RAM in the, in the page cache and then say, okay, write all these five gigabytes of data to disk at once that will inevitably affect the, the user activity. Right. So, you need to be careful about, for example, decreasing the background bytes. And I think I do have this here. This is EXT4 with the default, default here, which I think is one gigabyte for, for this machine. And this is the throughput for, if I decrease the, the dirty background bytes for 32 bytes, 32 megabytes. And you can see that it's much, much more consistent. Right. Because here the, the gray, gray chart is essentially like per second throughput. And the red one is like average over 15 seconds. Right. So it's like a smoothed out. And you can see that it's like almost the same throughput, but this is like much more variable. And for the latencies with 32 megabytes, sorry, 32 megabytes, one gigabyte, it's the same, same thing. The decreasing the, decreasing the dirty background bytes makes it much more consistent. Obviously, if it had just like benefits that would be the default. Right. Unfortunately, if you decrease this, you kind of like reduce the throughput of the machine, of, of, of the system. Right. How much, I don't know, you need to test it. Right. Or I do plan to do the test. I don't have the numbers yet. But in this case, obviously, the, the impact is like minimal. So that was one thing I want to talk about. Yeah. The other thing I wanted to talk about is full page rights, which unfortunately something Postgres has to do. It means that after each checkpoint, the first change to the page will write the whole eight kilobyte into, into the transaction lock. The problem with that is that it inflates the amount of, you know, data we write into transaction lock. And it can easily happen that you, you just, by doing the full page rights, you hit the next checkpoint. Right. Because you write so much wall that you are required to do the next checkpoint. And it's like infinite loop. Right. So you will do like a lot of full page rights. I do believe that ZFS actually allows you to disable this. Right. So in ZFS, you can actually optimize the Postgres to benefit from the feature of ZFS, which can be very beneficial. The problem with ZFS that I run into is that it's really difficult to configure prefetch like for, for sequential scans, for example. I mean like PGDOM, for example, if you do that on, on the database for me, it took like twice as long as on the other file systems. Right. I'm, if there is a good way to enable prefetch on, on ZFS, I'd like to know about that. But I found like, you know, 10 different options at different places in ZFS that should be configured. That's like very difficult for me. Right. So what about snapshots? I mentioned that with snapshots, you would probably expect lower performance. Right. Because the, the file system needs to do something else. Right. With ZFS and BTRFS, that's not really the case, because they do copy and write by default. So that's okay. But what is the impact of doing a snapshots on the EXT4 XFS in case you are using LVM? Well, these are, these are the results for EXT4 LVM snapshots, BTRFS with LVM, BTRFS when you do that natively in BTRFS, and ZFS with native snapshots. Right. And you can immediately see that if you are doing snapshots, the, the ZFS and BTRFS can easily compete with the EXT4, which can only do that through LVM. So that's like, if you need snapshots, if you want to benefit from snapshots, if you are willing to pay for snapshots, then ZTRFS or BTRFS can actually do a pretty good job. Like at least as good as the traditional file systems. Of course, there's still the problem with latency. In this case, once you start doing snapshots, snapshots on EXT4 and LVM, the latency gets much worse. And I would even say that the latency of ZFS is better. It's more predictable. BTRFS is still a bit slower. Or like, obviously the latency is much worse. In all those charts, the scales are always the same for all, you know, charts in the same row. So it's like easy to compare this. So you can see that the 95 percentile, which is the, you know, the violet here, is much higher than here. So this is from a different machine from the large AMD. And you can see that, of course, with, when you have like EXT4 with no snapshots, it's, it's really fast. Once you start doing snapshots on LVM, and by doing snapshots, I mean like having three snapshots at the same time. Right? So during the benchmarks, I just created like a snapshot every five minutes, and then deleted the snapshot after 15 minutes. Right? So there are always like three snapshots at the same time. You can see that this is like a massive impact on EXT4. And I'm not sure if you are willing to pay for that. And then, of course, like, BTRFS is better. ZFS, sorry, this is BTRFS with no snapshots with snapshots. And there is like no difference here, right, between those charts. So which is great. That's exactly what we expect from, from those file systems. And just to compare BTRFS and ZFS, again, ZFS no snapshots, ZFS snapshots. You can see there's like almost no difference when you enable and start doing snapshots, which is great. Exactly what we expect from copy and write file system. But the comparison between BTRFS and ZFS is pretty clear, especially for this scale, for example. So this is one of the reasons why I'm more like a favor, a fan of ZFS. So that's all I wanted to say today. If you want, you can find all the results, all the charts on GitHub. If you want the source data, or if you want the, the scripts that I used, I am very open to just providing them. I have no problem with that. It's multiple gigabytes of data. So that's why I didn't put it on, on GitHub. But I'm still going to do more benchmarks. I will publish it there. If you want to look at a very interesting paper, which I think explains a lot about like the challenges, how actually we need to saturate NVME storage. There is a very nice paper from VLDB. I highly recommend it. And yeah, I think that's all.
Build Your Own PostgreSQL DBA Out Of Available MySQL DBAs
Before we get started, these folks in the blue vest, the ones that just left the ones here, give them a big round of applause. They have a lot of time and effort. They've been doing something like napping or eating. They're doing something more fun than this. They've got to do this for the lovely youth in the community. So today we're talking about building your own Postgres DBA out of available MySQL DBA. And Dave Stokes, I'm a technology evangelist for Percona. Two years ago I joined Percona. Before that I was on the community team for MySQL. Stars MySQL AD, you don't know through sun and the more flax acquisition. Origin story. 2007 I hired to join MySQL AB, which was then a cute little startup company. And became the, well, what's it mean a PHP programmer to be running the certification group? If you want to be certified to be a ZBA or developer, I was the guy who signed off on your application. The big trouble I had was that hiring managers would call me and they'd say, it's hard to find a MySQL DBA. And I said, well, you have a list of folks who recently certified on the website. And they're like, yeah, by the way, it's impossible to find Postgres DBAs. And over the past couple of years, or several years, I've noticed it's still that problem with getting DBAs. So an economics comes down to a make versus buy decision. If you go out in the free market trying to buy a DBA, it can be expensive. You're also not quite sure what you're going to get. But if you have folks who run MySQL databases and they're fairly tolerant at it, you might have a good chance of converting them over to running Postgres. This is the definition of make versus buy for those who are interested in that. So why MySQL DBAs? There's a lot of them out there. They have basic knowledge of what a database does, what you do to it, what you don't do to it. They know the care in feeding, watering, and taking care of the basics. They're usually Postgres curious. For years they've heard, yeah, MySQL is okay, doesn't do this, doesn't do this right. And Postgres is so much better. And they can go, yeah, well sure, eventually the curiosity gets to them. Also there's a lot of similarities between MySQL and Postgres. Both were started by guys named Michael who tend to piss people off by the things they say. Also, when you show them some of the goodies that Postgres has, it becomes very, very attractive to them. Also in the past couple of years we've had a lot of folks who weren't DBAs for businesses who have seen their data taken from underneath them and pushed up in the cloud. What they're doing now is either boring or redundant for now if they want to do. And they're looking for another opportunity to be database administrators. So how do you recognize MySQL DBAs? Now I recognize some folks in this room. It's a couple of them are kind of scooting down in their chairs. First sense are t-shirts and coffee cups. You'll see a whole bunch of wide variety of them. And some of them are kind of cute, some of them are clever, others are kind of scary. But if you know that they're not a Postgres DBA but you know they're the admin of database, you might see some of these signs around. So Postgres versus MySQL differences. Both are relational database management systems. Both open source. Both very popular and both are technically old enough to drink. And since that's turned... Uh-oh, did I not turn it on? Yeah. Oh, there we go. So Postgres. Postgres has better support of the SQL standards. It's governed by a mailing list, which I always thought was kind of crazy until I've seen in actions the past over years. It actually works fairly well. Active community as you've probably seen today in this room alone. MySQL has seen this easier. It's governed by Oracle. If you're one of the folks who spent Thursday and Friday in the MySQL Belgium days, you've seen a lot of the stuff that is coming out from Oracle that may not impress you. But they also have an active community. So someone once said the devil is in the details. Okay, so you found your candidate to make your brand new Postgres DBA. What do you do with them? Well, you're going to mention to them that they're going to, at the end of the process, have better skills, be cross-trained, have better job opportunities, and now complain about knowing two databases. How many folks here run more than one database? Okay. And by the way, what we start is you're going to tell them you have different approaches to the same problem. Different isn't better, isn't worse, it's just different. They're going to learn a whole bunch of new tools. The basics are still the basics. You still have to do backups. You still have to do restores. Account administration is very similar. Tuning is wildly different, but not in theory. Query tuning is a big difference in Postgres. It's a little more complicated. Then there's the really neat stuff. Postgres has two JSON data types versus just one. The merge operator, more indexes as you can shake a stick at. But unfortunately there's some stuff that I call the, oh my god, why do we still have this stuff in 2023? It should be 2024 that you need to warn them about. Let me get to that in a bit. First steps. Give them an environment that's similar to what they're used to working with. The way to do this is go down and download the video rental database. Now for you folks without gray hair, 20 years ago if you wanted to watch a movie in your house, you had to leave the house, go to a store where they had these things made out of plastic, either VHS or Sony beta format that had the movie, and you take it home and you put it in a special player. You actually had to go out and get that. And they may not have your movie in stock. Now 15 years ago, rather than DVDs, rather than having video rentals, they actually had DVDs, which is a little bit easier because that was a standardized format. MySQL used the Secura database, which is also a DVD or video rental database in their documentation, training, and blogs for years and years and years. So you're going to give them a similar environment to what they're used to. And you'll talk them through how to do a simple create database, show them how to use PG restore to load in the data. Real simple. By the way, if you've never seen one of these, this is what a video store used to look like. Okay, now that they have that DVD and rental up there, have them log in. Well, they're still logged in to use your Postgres and create a user. Now, I advise you to do it as a super user. You might have heard Mr. Booze earlier having great ideas of why you don't want to do that. MySQL DVAs are used to having God privileges, and if they screw it up, you go back and you show them how to reinstall the database. As I mentioned, this is dangerous. You bypass a lot of security stuff, but you want them screwing up at this level, so you learn not to do it at a higher level, which is much more expensive for you. Okay, back in their user account, they type in PSQL minus D DVD rental. That tells them that they're talking to the database of DVD, DVD, rental, and they get that lovely little prompt. The prompt with the equal sign OctoThorpe is warning them that they have super user privilege. You'll have to point that out to them. So, at this point, we have a Sekela-like database, something they're used to, and you can have them do assignments, play around, do similar stuff that they're used to. And the great thing about this is it's a familiar to them, lots of stuff to join. It's easy to do, and up to now, it's been dirt cheap for you. Now, the first thing they're going to do is they're going to type in the command show tables. No, show tables. Oh, my God, this thing's broken. What's going on? This is where you tell them different, equal, isn't worse, or better, it's just different. So, you'll have to tell them how to use the slash D commands. Turn them out a cheat sheet, walk them through them, show them tables, show them indexes, show them sequences, and you'll have to explain what a sequence is. But it's just different, and it's going to take them a while to get used to this. Have them use cheat sheets, there's nothing wrong with that. So, if there's no show or create table, you show them how to do a slash D actor, and they get the information they're used to looking. Now, the format, I see one person who's a MySQL DBA that I know, is noticing that the format is a lot different than what they're used to, but it's the same information, column name, type, whether it's nullable or not. This is going to blow them away, but you're going to explain them what a sequence is. It tells them where the indexes are. The information's there. It's just in a different format. Once again, different isn't better or worse, it's just different. Now, have them do a simple query. That works like what they're used to seeing. Hooray. About this point, they're going, you know, this isn't so bad, I can get used to this. And at this point, you have them hooked. Have them do a simple backup. PG underscore dump, very much like MySQL dump. I know you probably use PG restore or something else, but this is a very simple, fairly generic, fairly common tool they can find just about anywhere. And show them that you're piping it to a file. And explain to them, you know, what's going on. By the way, if you're going to be at Confu in Montreal in two weeks, I'm giving the other side of this talk where it's the MySQL DBAs learning this information. This talk is mainly for folks who run Postgres, who want to see how they steer the boat of learning around the shoals and reefs and rocks that are out there for these folks who want to learn. Simple restore. They're used to doing this in MySQL. Now you've shown them how to get around not having show tables, how to do a backup and restore. For a lot of companies, that's 50% of what a DBA does when they get started. Whoops. Once again, print out this cheat sheet. The slides, I'll show you how to get that later. It's, once again, different, and they'll get used to it. The other thing is, when they start looking at data types, they're going to notice that auto increment has disappeared and it's been replaced by serial. There's a serial data type in MySQL. It's not heavily used. There's some issues with using it. But it's the same idea, two bytes, four bytes, eight bytes. That's great. Just call something different. Now we get to sneak in sequences. Now MariaDB has sequences MySQL doesn't. And to get around this, MySQL has something called auto increment. Every time you add a record, and don't specify that column name or specify null as the value for that column name, it auto-mentally increments. Now, you show them this and they're going to be happy. They're going to... We create a simple table and we tell them, okay, this is going to use a function called nextVal. It's going to go out to the sequence and pull the next one off in the stack. It's real simple. And they'll catch that. I give them a quick demo, point out that we're not giving any values for column X into table X, but the system automatically generates them. As you see over here, they're going to be happy with that. They're used to seeing that sort of that work. And you can tell them that if they go out to a slash D, they're going to see the entry for the table and the sequence that supports it. And you're telling them that this X here is the X same X here, this is the column name, this is to say that it's a sequence. So about this time, they realize that things work the same. They just...at the end, but the intermediate function,ality is a lot different. Something else you can do for them that will amaze them is create a table and have it populate itself with a generateSeries function. Can't do this in mySQL. I don't know how many hours I've wasted in my life generating test data in tables, and to have this suddenly unleashed was a big revelation. Wrapping sequences. This is the part where you're going to have to toss them very slowly because it takes a while to catch this. We create a sequence. Minimum value is one, maximum value is two, we tell it to cycle. As we do the next value function, it goes between one and two, one and two. Now there are some edge cases where this is extremely handy. I used to work in a place where we had product that came in a part one and a part two, and there was always a pain in the butt to generate the data for that. Show them how to check the details on sequences. Here's the one we just created, minimum one, maximum two. It increments and it cycles. Sticking points. This is where you're going to have to be patient. In some cases it's like talking to a five-year-old, other times it's like talking to a 15-year-old, and other times it's like talking to a 35-year-old. Explain is a lot different. The MySQL addition that they're used to shows them column, table, partitions, key length, references, but possible keys, gives them the query plan down here. Now, using the explain that's found in Postgres, it's the same rough material. It's just a different format. You'll have to explain to them why you want to put Analyze in here, or Buffers, or whatever else you want in there, and tell them that it's the same rough information. It gives us the query plan that doesn't actually show the SQL. The one thing that they're going to freak out at is they're going to see cost, the startup cost, and the overall cost. They're not used to seeing that information. Then you can explain to them that when you have things like indexes, there's some set up time before you use the index, and that's where this cost will pop up. Once again, different, not better, not worse, just different, they'll get used to the format soon enough. Also, my SQL is not used to seeing YAML or XML output from this. Now, talk them through how to read this, how the nodes are selected and mentioned here. It's a different format than what they're used to, and they will learn how to pick out the various information from the various tables. In this case, I put Film in Magenta, Film ID in Blue, and Actor in Red. Show them that the various types of joins that are available to them, and it's just in a different format. I don't want to calmly discuss. This is the hard part on you because you're used to this, they're not. I mentioned sequences already, they'll pick that up. Materialized views. There is a way to get materialized views in my SQL, but it's a third-party software. It doesn't always work the way you want. That is something that's going to pique their interest because they might have worked in a company where they have something like delayed stock quotes, where a materialized view would have been handy for them if they've had it. Explain, as I showed you, it's different for them. It's just formatted in a way they're not used to. They're used to connecting to the server and getting a thread. They're not used to the overhead of actually getting a Linux process. That's something that's a little different. They also need to be taught that everyone's using some sort of pooler. Not a big problem, but it's there. Now, vacuum. Be very careful when you mention vacuum. They're not used to calling rows tuples, and the idea of the heap will really throw them. That's the hardest part when I start talking to folks about this one-on-one. It's the equivalent of a teenager throwing all his laundry in the middle of the bedroom, and he knows exactly where it is, and he can pull out whatever he wants, but no one else can. Then tell him, yeah, when the stuff gets dirty, you actually run a vacuum over it and cleans it up or cleans up the tables. That will freak them out. There's lots of good documentation out there on the web about that. The next thing to do is teach them automatically about auto-vacuum. That will save a lot of problems. Toast. MySQL has something similar to toast. If you have things that don't neatly fit into a block, that's how it extends it calmly, and we'll actually tell them about that. Then buy them an adult drink and talk about wrap-around XIDs. For the MySQL folks who are in here, every transaction has a unique ID, and it's a 32-bit number. Unfortunately, in lots of possible... It's possible to wrap those around, and once you wrap those around, the older numbers, the data that's tied to them, you can't get to without doing a whole bunch of really nasty mechanical work to get that, and then you lost the new stuff. There's ways to modify... Well, ways to monitor how that goes, but the first couple of times you run into that, or here I mentioned of it, it's frightening. The other thing that you're going to show them that's going to really pique their interest is tricks like this. MySQL, you can't do a filter like this. This isn't a thing where we're going through movies, and we're trying to get the summary of the links for the R-rated and the PG movies. To do this in MySQL, you're going to have to write some CTEs and do some other nasty stuff or some winnowing functions. Here it's just a simple query. Now there's some reading out there for them in the watch. I produced and I'm about to reproduce a series of stuff on MySQL, teaching MySQL DBAs postgres. Halfway through my production of that, the company changed their logo, so I need to go back, update some of the material, change the logo. Checking for bloat and tables and how to scale postgres. The other thing I recommend is this book. You'll hear that it's postgres 14. We're on 16, 17. We'll come up with this later. This gentleman has done a marvelous job on documenting everything. It's been available free as a PDF for a long time. You can actually get print copies, recommend print copies. They make a nice impressive thump when you throw it when you get frustrated, but the material is there. I think it costs 20, 25 bucks to get the printed copy delivered to your house. Check out scaling postgres. That's an interesting weekly website. It gives you a lot of great information. These are the videos that I mentioned earlier. I'm redoing those. Follow my ex information and you'll see that. Now, if you want a longer version of this presentation, I have it there. I gave an early version of this at Percona Live last year. That was mainly from the MySQL folks learning postgres. I'm here right now to beg you for your help. Percona, the company that employs me, is working on transparent data encryption for postgres. Other databases have that natively. Postgres doesn't. We have it in alpha, about to go beta. Unfortunately, the guy who was here, who was the expert on it, is unfortunately recovering from his bad dinner last night. I was going to have him answer questions. Download the code. You can run it off Docker. I run it off Ubuntu's version of postgres. It encrypts your data. Someone just casually going through your data can't read what you have in there. It's very handy and we need people to try it. With that, I want to answer your questions. For folks who have suggestions on how to teach postgres, I'd love to hear that too. Thank you. Any hints? Any questions, please? Thank you. Any questions? Yes? How would you introduce my SQL DBA as a sort of postgres replication cluster, etc.? How would I introduce my SQL DBAs into the clustering models found in the postgres world? Replication. If they have a Galera background, it's a little easier. If they're used to an inner-DV cluster, it's a little more complex. You're going to have to speak to the options between running FCD to control everything. That's similar to kind of like my SQL router. Something like Petroni that does the coordination. There are some articles out there that people have them, but there's really no easy correlation. I think five years from now, postgres will kind of be on par with inner-DV cluster, and we won't have those problems. Right now, it's kind of a mess because you have third-party software all over the place. By the way, is booking.com seriously thinking of switching over? Yes, it's a postgres. For here. First of all, thanks a lot for the talk. I'm actually quite young, a SQL DBA. It's just two years since I've worked on that. It's actually pretty interesting to see the differences that you are showing up. One of the things that you mentioned, like it felt like having emphasis on is the explain, how could it be quite different? Actually, the latest version is not so different. There is this theme of the execution plan that is slightly different. But I guess that one of the key points for me, at least, while checking postgres recently, is about the amount of operations and operators that it actually has. That is like more than 50% of my time spent with engineers, helping them to fix their queries and to find creative ways to actually get around to the limitations. Postgres has an embarrassing amount of great features out there. There's like eight different types of indexes out there. In MySQL, you used to hashes and... I guess the index is the only one that works. So there's a lot of really neat stuff out there for MySQL DBA to discover. Getting it mastered and getting it running right, it's going to take a little bit of time learning in elbow grease. But it's a wonderful opportunity if you're in MySQL and you don't like the heat wave pressure that MySQL is pushing on everything, and you want to try something different. Thank you. Thank you. Thank you. Any other questions, please? Thank you. I was wondering if you could... I mean, I've worked a lot on Over the Years on Oracle and Postgres, but not so much on MySQL. And the thing which always intrigues me, and I was working with MySQL DBA, is the indexing. Now, you have... originally you had MyISAM, then you had INODB, but these are not traditional sort of heap and B-tree on top style indexes. It works differently, but I don't know. Can you just... is that a conversation? You did allude to it in your talk. Yeah. I mean, different isn't better or worse. MySQL, especially with the INODB storage engine, would love to have every table have a primary key on it, and they store everything in the primary key. When you update a row, race that row, put in the new stuff, write the old stuff off in the undo log, and away you go. You're not having an old version of that sitting out there in the heap. That's kind of the 10-cent version of how that works. Thank you. I'm very curious. I'm going to look into this more. It sounds expensive. Well, that's the great thing about open-source software. You can get, like the demonstration there, you can take an old laptop, put on a copy of Ubuntu for free, put down your favorite version of Postgres for free, have them download the DVD rental tar for free. Opportunity cost is doing things out the door. If you had to teach someone to be a DB2 or Oracle DBA and had to give them their own environment to get started up, you've got licensing costs, you've got a whole bunch of other stuff that's out the door. So... Thank you. Thank you. So thanks for coming.
Introduction to the Public Code and Digital Public Goods devroom
So, hello. Welcome to the Public Code and Digital Public Goods Dev Room. My name is Elena Finlay-Diracht. I'm here at the Foundation for Public Code. This is my colleague. Hello, everyone. Nice to meet you. I'm Amreen Taneja, the Standards Manager at Digital Public Goods Alliance. And there I manage, lead and promote the Digital Public Goods Standard. So, very excited for this Dev Room today. And I'm Jan Einley. I'm also at the Foundation for Public Code and I'll talk later. Cool. And I'm going to... So, in case there's any confusion about what we're doing here and who we are, this is a Dev Room dedicated to everyone developing public code. That is open source code that implements public policy used for the public good and by public organizations like governments, public administrations and state corporations. Digital Public Goods, DPGs, are open source software, open standards, open data, open AI systems and open content collections that help meet the sustainable development goals. So, we have a couple of housekeeping notes. Most importantly, the FOSM Code of Conduct applies here. So, please be respectful in the space. Oh, sorry, this way. On this side of the... Okay. Secondly, even more. Okay. All right. We have a window open for ventilation. That's to make the space a bit more comfortable. If people would like more than one window open, I'm happy to hop on that. We're going to leave the window open all day in any case. And that moves us to the third housekeeping point, which is that if you have any questions, if anything comes up today, talk to Jan, Amrine or me. And that's it. So, on to Amrine. Thank you so much. I'll just take a moment and get this up. Okay. So, well, I think moving on. So, I've already introduced myself. So, first of all, I'd like to warmly welcome you all to this dev room today. First things first, I'd like to share with you a bit about the Digital Public Goods Alliance for those of you who are new to this organization and concept. So, we are a multi-stakeholder initiative which was launched in 2019. And our mission is to accelerate the attainment of sustainable development goals by facilitating the discovery, the development and the use of digital public goods, which are essentially all open source solutions. So, I'll share about this as we move forward, but I'd like to kick off this conversation by introducing you to the Digital Public Goods Standard. Right. So, just to give you a little bit of context of where this concept and this definition came from. So, the DPG definition was actually laid out by the UN Secretary General. And there are five kinds of open source digital solutions that are recognized or can be certified as DPGs. So, these are open source software, open data, open content, open standards and open AI models. So, we have a set of nine indicators, right, that make up the standard. And I'll share a bit about each of them with you today. So, the first one is SDG relevance, right. So, this is a very broad topic. So, essentially any application that wants to do good for the society in some form or the other will come under one or the other SDG, right. So, what we expect from you here is, first of all, to establish a clear contribution to either one or more SDG and also explain how your application will be seeking to do that and achieve that. And then also we have an SDG tracker tool, which I'll be sharing in the presentation as we move forward. The second indicator is open licensing, right. So, the DPG standard has a set of specific licenses that we accept. And all licenses, supposing that are, you know, they're approved by OSI are there for software. We have Creative Commons licenses for open content. And then we have various other licenses for AI systems as well as data. So, because there's positive time, I'll not get into too much detail right now, but I'd love for you to have this conversation with me later on. I'll move on to the third indicator for now, that is clear ownership. So, that essentially what we mean by this is, the DPG status needs to be renewed every year, right. So, you have to send out an application everywhere every year and, you know, your application needs to be up to date with the standard that we have created. So, we need to know who the owner of this application is and it could be either a person or an organization. Both are acceptable. And what you have to provide to us is a proof of ownership, which is anyway a legal requirement for the application. Now, fourth indicator talks about platform independence, right. So, this is a tricky one, right. And the goal here is for vendor lock-in to be avoided. And we prefer for everything to be open source, but let's say you have a proprietary component within your application. So, when you apply for a DPG, what you have to do is provide an alternative open source component to this and explain how it should be implemented and the condition being that it should be relatively easily implementable, right. That solution should be easily implementable for anybody who has enough technical knowledge about this. So, we in fact have external, you know, facilitators and experts for this particular indicator. We have them with us today as well. So, Ivan, that's for you. So, if you have any questions around this indicator, please feel free to contact him. Now, coming to indicator number five, that is documentation. So, this is fairly straightforward, right. So, it basically means that you need to have all your documentation in place. So, this can be in the form of a repository or, you know, on your website or in the form of some good book. And it should essentially have enough detail, you know, that someone with enough technical knowledge should be able to deploy the solution by themselves. That is the requirement that we have. Now, moving on to indicator number six. So, that basically talks about mechanism for extracting data, right. So, if your project collects any sort of non-PII data, then it should be possible to access it through non-properity formats. That is the condition that we have. And now, coming to indicator number seven. So, this is adherence to privacy and applicable laws. In fact, I have some news around this indicator which I'll be sharing with you later on. So, essentially what this means is that your application, it should be compliant with, you know, any of the privacy laws that are there in the jurisdiction where the application has been created or where you intend to operate. So, if it's Euro, it'll be GDPR or anywhere else, you have to provide proof of compliance, of course. And that can be through, you know, providing us a terms of use or privacy policy. And of course, these things are held on a case-to-case basis. So, you know, you'll be speaking to our reviewers around this. And once you satisfy the conditions, then we move forward. Now, coming to indicator number eight. So, this is adherence to standards and best practices. So, essentially, any standards and best practices that apply to the industry where your solution belongs, you have to adhere to them and you have to provide some proof of adherence as well to us. And lastly, coming to indicator number nine. So, this is do-no-harm by design. So, do-no-harm by design essentially means that we, you know, we say design because we don't look at the implications that will be there, you know, down the line somewhere which are completely out of your control, right? So, we look at how the digital solution is being used or rather how it's being built and not how it's being used. So, that is what we kind of focus on. Now, moving on to the next slide. So, this is about how do you become a DPG. So, this is a three-step procedure, right? So, first stage is nomination. So, nomination essentially means that you can either nominate yourself or a third party can nominate you. And the second stage after this is technical review. Of course, this is a very, very rigorous process. We have level one and level two reviewers who go through, you know, your application. And if your application satisfies all the conditions, then, you know, your application is essentially certified as a digital public good and it is recognized on the registry. So, like I mentioned, step one. So, we have a five-minute eligibility test that anybody can take and you can figure out whether your solution is at the outset capable of becoming a digital public good or not. Step two is the nomination. So, this is what the application form looks like and this needs to be filled up as per the criteria that we just spoke about. And this is step three. So, success. So, if, you know, your application is selected, it is added to the DPG registry and this is the SDG tool tracker that I was talking about. So, this is where we have 150 of the DPGs categorized and arranged as per the various SDGs that they are striving to contribute towards. So, now coming to call for experts. So, I was mentioning about something about indicator seven. So, this is where, you know, the standard is entering phase two of operations. So, what this means is that we are going to be fine-tuning critical indicators of the standard through two expert groups that we are launching, one on privacy and one on AI. You'll see this poster across the dev room and outside as well. So, if you're interested, please feel free to scan the QR code and apply. And these are the requirements. So, if you're a subject matter expert in either privacy or AI with a technical background, legal background, academia or, you know, any other background which you think would be a good fit, please do apply. And it's not much of a time contribution. It's about three to four hours for this knowledge partnership. And of course, if there is previous experience in standard making, then that is also highly encouraged. And with that, it comes to an end from my side. I would like to introduce Yan now. So, who is our DPGA member as well as the co-host here for this dev room. Thank you so much. Thank you, Amreen. And I'll come from the Foundation for Public Code. We're a non-profit. We're based in Amsterdam, but we aim to work globally. Just last year, we started chapter in North America. And we exist to help the public organizations who already decided that they want to work with open source, develop open source, to help them do that in a collaborative way. So, ensuring that also that anyone can reuse what they have been doing. And to do that, we have the standard for public code. Here are some old versions. We have some new paper versions here, if you'd like. Just last month, we released 08.0. And it has a number of different criteria in it, certification criteria. I'm not going to go in as deep as Amreen did. But this is what we use to sort of like certify that a code base is able or easy to collaborate on. And our philosophy is that it shouldn't contain any surprises. It should be more or less the best practices in the open source business. So, you're probably already doing most of it already. And then there's probably also a lot of shortcuts that you have made to save some time that you're not doing, but that you wish you had the time to do. We have collected them all here, because that varies over. And if you comply with the standard, our thesis is it will be very easy for someone to come up and collaborate with you. It's of course an open standard itself. It's cc0. You can start using it immediately. You don't need our permission to do anything. And you don't need us to come talk to you. Reuse it, adapt it to your needs. If you find that something is, oh, this is shaping with me, please contribute back to us so we can continue to improve it with your feedback. And these are sort of like the type of requirements that we have. And just as Amrin showed with the DPG standard, we also have a self-assessment test that you can do. There's just 10 yes or no questions to give you an idea how close you are to dig into it completely, because in the entire standard it's like 116 requirements or something like that. And there's a review template, of course, and a checklist to easily check what you're doing. And we list everyone who is compliant on this website. Today it's a list of zero, but it is a list still. But we also include right now everyone who has said, oh, we are aiming for this goal. So everyone who has the ambition gets listed there. And then just a tiny little thing. We also have a number of governance game decks. It's a little sort of a game you can play with your community to figure out how do we want to do with our governance. And we're giving them out from the small fee of signing up to our newsletter. And with that, I want to introduce our first speaker from the day.
Sustainable Open Source Development
Hey everyone. Everyone's ready for some, learning about some sustainable open source development. What? What? Okay, cool. Hey, so I'm going to be talking today about human essentials, and that's how you can find it on GitHub. A little bit about some of the best practices we've done, hopefully have a little bit of a discussion at the end. I tried to save a little time in the talk. So hi, I'm Sean Marcia. I'm a software engineer on GitHub's social impact team. I'm also part of Ruby for Good. I love baking. I'm a Zimergist. I'll let you Google that. And I'm from the Washington DC area, so if any of you are planning on coming to Washington DC, or you're from there, reach out. I'd love to take you out for a coffee or tea, or whatever your beverage of choice is. I guess everyone found out I was speaking, because they're all rushing in. And I'm going to be talking about the Human Essentials app. It's a couple of fun things. We are a digital public good. Got our certification last year. We won the 2022 Pizzagatti prize from N10. And so audience participation. Who's familiar with the concept of an Essentials bank? Okay, there's one person. A co-worker of mine. So Essentials banks, well, that's actually not why she's familiar, but Essentials banks are things like diaper banks, peer supply banks, adult and continence banks. They operate on the same concept of like a food bank, where they don't give food directly to the public. They give it to organizations. They give it to the public. So like a diaper bank collects diapers and things like that, and gives it to like homeless shelters, women's shelters, like high school programs, things like that, that are distributing directly to the public. And these are the organizations that Human Essentials software serves. And just a couple of quick facts about diaper banks in the U.S. is just last year, 40% of the families in the United States relied on or had diaper need. And like the last time they did the survey was in 2010, and it was 33% of the families. So things are maybe getting a little bit worse. And of those families suffering from diaper need, 25% of them had to miss work because of that. And the reason why they miss work is you can't put your kid into daycare unless you have diapers for the kid. And if you don't have diapers, you can't put them in daycare. Well, then you can't go to work, and then you don't have money, and you have less money, and it's this vicious, you know, negative feedback loop. And the same thing too, like 28% of these families often had to choose between buying food for their kids or diapers. And just, it's not a good situation for families in this situation. And so like our software, some facts about it. Now we have over 240 banks across the United States, and some in Canada now registered using it. And like I said, they're on the partner system, and so we have about 5,000 community partners. Our project started in 2015, and it's now helping over 3 million kids a year and over 500,000 period supply recipients. We've had over 300 contributors on GitHub, which is pretty cool. We're endorsed by the National Diper Bank Network and the Alliance for Period Supplies. These are the two big national networks in the United States. We're digital public good, like I said, which we're super stoked about. And we're 100% volunteer driven. Like we've never had a paid person, and we don't charge anything to all the diaper banks and peer supply banks using the software. So yeah, like are we a unicorn? Like how did we do this? Like how did we get here? And so kind of first of like background before we kind of get into that, some background information, basic kind of things, is like we have like a team etiquette, which really is three things. It's to be patient with people, be helpful, and be kind. Like we really believe that it's much better to be kind than to be right. And so, you know, like kindness really matters, and like that is our ethos with our teams and our contributors, which I think, you know, is why we have so many people. And also like just kind of like some basic like stats to like, you know, we have a readme and GitHub, we, you know, we have a contributing guy, like all the little things to make it easy for someone to come to the project and just get started. And importantly, we have a code of conduct. Just like, you know, they said there's a FOSDEM code of conduct. A lot of people don't realize that if a project doesn't have a code of conduct, some companies are actually prohibit their employees from contributing to those projects. So if you have one of these projects, make sure you have a code of conduct. And it's the right thing to do. And, you know, I think we highlight this in our contributing guide. Like, you know, we start off like, hey, code of conduct is important, but then like you see this like, like, hey, if you're unsure of anything, just submit a pull request or an issue and just ask. No one's going to yell at you. We're not one of those evil open source projects. Because, you know, open source can be intimidating for people for the first time. And so, you know, we just say, hey, give it your best effort, and we're going to make this a welcoming place for you. And GitHub also makes this really easy for projects. Like, if you've ever been on the Insights tab and the community standards, like it lets you know if you have all these things. And if you've never been in the Insights tab, like I'd say, go check it out today, especially if you're data driven, because there's a lot of really good information just about like the cadence of your project and just what's happening. And it'll really, yeah, give you some insights that you may not be aware of. And like another kind of background thing is, it's important to know who is contributing to your project and why they're contributing. And, you know, like we've had over 300 contributors to our project, and like, we've tried to talk to them and understand why they come and contribute. And we've found that it's really for four reasons. Like, the first reason is, is like they want real world experience working in software. You know, like they can build to do apps or they can build things, but like there's no replacement for, you know, like your resume or if you want to show, like highlight work, then a real project. You know, because maybe they're coming from like C++ or Rust or some other language and they want real world, you know, Ruby on Rails experience. And so like this gives it to them. Some people are here because, you know, just like everyone at Fosdm, they believe in open source software and they want to contribute to open source software. Some people are here because they believe in the mission and like, like perhaps they, they benefited from a diaper bank at some point or they have a family that has or a friend and they want to contribute for that reason. And the final reason is people just want to be part of a community. Like they want a group of friends because, you know, we are all friends on the kind of the maintainer team. We get together on regular cadence, we chat and zoom. And so, you know, they want to be friends with like-minded people. And I think these are the best people around. And on the other side of that is like, what's important for a maintainer of a project like this? And like for us, a maintainer is someone who doesn't write code. A maintainer is someone who makes it easy for other people to contribute and to write code and to, you know, write assets and to manage a project. Like our general belief is a maintainer should be writing code 10 to 20% of the time and the other 80 to 90% of the time they should be facilitating everyone else being successful. Because, you know, rather than just being one person, they can be a force multiplier for a lot of people. And then the last kind of bit of background information is, you know, it's important to understand the type of project it is too. Like this is a, like human essentials is a SAS. It's not a library. Like when you're maintaining a library, there's a lot of things to think about like backward compatibility and like does it run in all the different versions of the language, all the different frameworks. But we're a SAS. Like the number one thing we have to be concerned about is user data. Because, you know, we're dealing with such like vulnerable populations. So we have to keep that data safe. And so that guides all of our decisions. Okay. Now that we've got the background of the way, like what works? And again, this is what works for us. And so your mileage may vary. So the biggest thing that I think that works for us is the human impact. Like there are a lot of amazing open source projects out there, like talking to a bunch of them just in the hall before I came in here earlier. Like there's a lot. But these type of projects, like, you know, these digital public goods, like we have something that all these other open source projects don't have. Is we have like the human angle, the human impact that no one else can, like those other projects can't in my opinion, compete with. So we take advantage of that. Like when we're writing our issues. You know, we could write an issue that says, hey, when we click this button, it sends an email. Which is, you know, is accurate. But, you know, like we'll always tie our issues. Or we try to always tie our issues back to the human impact. So contributors know who they're helping and how they're helping. So, you know, when we click this button, it sends an email to remind a family to come, you know, pick up their diaper supplies. You know, so, you know, so they're able to get to work that week or be able to work that week. Maybe that's a little contrived. But we're always trying to write, like highlight the human benefit. And the other side of that is, you know, we try and facilitate as much stakeholder interaction as we can. So we have regular meetings with different diaper banks, period supply banks. So contributors, you know, can meet, can talk to, and really understand, like, this code they're writing and contributing to, who it's helping and how it's helping them. Because, like I said, like there's no substitute for actually hearing from the people you're helping, how you're making their lives better. Consistency, like, again, like being consistent has really helped us maintain our contributors. Like we have, you know, we have a public calendar. We list all of our meetings on there. We have regular check-ins with our stakeholders. We have regular office hours. We do deploy same time every week. Like, even if we don't have a deploy, we send emails out, like letting people know why. Maybe it was Christmas or maybe we're in the middle of something large. But, like, always regular communication, both the stakeholders with contributors. Another thing we benefit from is being part of Ruby for Good. And Ruby for Good is a nonprofit that builds software for other nonprofits. Like, maybe Code for America or Code for France, or like these organizations that run several software projects. And they have a couple big events each year where they get a lot of, think of them as code retreats for nerds, where they get a bunch of nerds together at a site. It's kind of like all-inclusive. So, like, they do coding during the day, and then at the night they do a lot of community building, like playing board games, seeing karaoke, sitting around a campfire, making s'mores. But so, there's a lot of community building around the teams, as well as doing good. And it's also a great time for the teams to do a lot of, like, in-person, like, road map, and just, like, the deep work needed for these projects. And as well as, like, the Ruby for Good kind of events or conferences, we submit the Human Essentials Project to a lot of Ruby and Rails conferences, because, like, it's a Ruby and Rails project. So, like, RailsConf, RubyConf, RailsWorld, we'll submit it as a workshop event where we will create 20 or 30, like, really small issues for people to contribute to, and run the workshop for people to contribute, make their first open-source contribution. And so, they will come out, we will facilitate them, like, writing code, making an open-source contribution to Human Essentials. And generally, we start these with also with bringing, like, someone from a diaper bank out or peer supply bank, and giving them, like, a five or 10-minute, like, just a little talk on Human Essentials, on, like, diaper banks and what they do. And again, like, to tie these people, like, to the work they're doing, and, again, putting a human face, like, hey, like, I'm going to write this code today, and it's going to help this person, which, again, like, I think is a really special thing about, um, what we're doing. Uh, Slack, we're heavy users of Slack. Like, we, uh, we're part of the Ruby for Good Slack, which is really nice, because there's all these projects in there. And so, if we're ever stuck on something, we can put out, like, a call for help, hey, we need, we need advice on this, and people from other teams will come and, you know, help us. Um, but we have a public channel, we have, like, a bot channel that talks, the bot channel is really spammy, but it's talking about the pull requests and issues and everything coming up. We have a lead channel. Uh, the other thing we did is we set up a Slack, a separate Slack instance for the banks and their partners, which, which actually turned out to be a really good idea, because, like, now that there's so many of them, uh, like, initially it was just kind of a, like, in their minds a place to come and get tech support, but, but now it's kind of turned into, like, a community for them, and it's actually, we're, we're now, like, kind of flies on the wall in there, like, listening to them, talk to each other, and, and more interestingly, like, we hear how they're using the software, because it's not always how we built it, or we intended it to be used, and so, but, like, this lets us help them, uh, you know, in a way we weren't intending, because, like, oh, this is how they're actually using it, so, you know, so we can make, you know, this, this, and this better for them. Uh, you know, we're big fans of continuous integration, uh, every, every bit of code that gets submitted gets run through, like, a battery of tests, and linting, and, you know, um, breakman, like, security vulnerability checks, all, all that kind of thing, because, again, like, we don't want to protect the data, we want to make sure nothing bad is coming in. Uh, in, in fact, we use GitHub pretty much extensively for all of the, uh, parts of the project, uh, like, dependabot for keeping all our dependencies up to date, and security, uh, continuous integration, like I said, uh, project management, like all the, like, we're not using Jira or anything like that, uh, workflows, talk about workflows in a second, love workflows, uh, you know, Wiki for information, pull request templates, issue templates, like, all the things in one place, it's just been really nice for us, and workflows, like, workflows have turned out to be my, my favorite part of, like, the GitHub experience, because, like, for us, it really allows us to, uh, like, offload a lot of the emotional labor, uh, of this, and, like, there's a lot of, I see nodding heads, like, there's a lot of emotional labor to, to, uh, maintaining, like, a, a project, like, you know, someone claims an issue, and then are they still working on it, and then you have to, like, pester them, hey, are you still doing this, and, and, that, that's hard, and, and, like, it's hard to, like, you know, to, you know, bug people or pester people, but, like, GitHub allows us to, like, automate that, like, we have a, like, a bot that, after 30 days, if, if an issue is, uh, stale, like, someone's claimed it, and they're not doing anything, it'll just say, like, hey, this is, nothing's happening, we're gonna, we're gonna take you off this in seven days, then seven days later, they get unassigned, it gets remarked, help wanted, and then somebody else can work on it, and then, like, all that, that pain of having to, like, you know, try and, like, pester this person is gone, and also, like, the good emotion of labor, too, that maybe sometimes we forget to do, we, um, you know, like, like, someone, here's a poll request that happened, and, you know, K-man had got merged, and then, uh, when the, when the deploy went out, we automatically, the bot automatically notifies this, this, this gentleman that, uh, hey, your, uh, your code was included in the deploy, and now it's out there, it's out there helping these, uh, these Depper Banks, which is awesome, um, and one just tiny little example, uh, like, our workflows, our workflows are also, you know, if something gets merged in, okay, changes on the, uh, one of the project boards, uh, and this, this is probably gonna be controversial for some, some of the people here, but we really feel that branches are better than forks, uh, and, like, the, uh, and the reason for that is, uh, like, oftentimes, like, a poll request will come in, and it's, like, 99% of the way there, like, it's just missing, like, some tag or some, uh, like, it's just missing something minor, and we could go back and forth with, with the contributor, but, like, we found it's just easier for a maintainer to, whatever, add, you know, add in that tag or add in that one little missing thing, and quickly merge it in, because then you get, like, a faster, like, like, feedback loop and, like, faster results for the contributors, and, like, they're happy that this, this thing got merged really quick, and then, you know, then they're gonna pick something else up and, and, and, uh, and get it going. But again, uh, I know that this is very controversial, some people probably, uh, think that's terrible. Um, uh, we're also very opinionated about the code we let in, like, all, all code gets linted that comes in, we require tests, uh, we're, we're very also opinionated on what libraries come in, we want all the libraries coming in to be boring, like, we don't, like, if there's a JavaScript framework that's four days old, and someone wants to add it to the project, we're probably not gonna let them, because, you know, because then it's up to the maintainers for the next however many years to maintain this thing, and it's also harder for contributors, like, so if we are using, like, standard libraries, standard packages, uh, like, very, like, standard conventions, it makes it easy, you know, it makes it more welcoming, more people can, can contribute, can contribute, like, regardless, uh, you know, of their level, they don't have to know all these, uh, specialized, uh, uh, libraries. Uh, and then the other thing is, like, we use realistic seed data in the, uh, uh, like, when you spin the app up locally, and so when you're, um, oh, everyone liked the joke. Uh, uh, uh, yeah, so our seed data is realistic, so then if you are running the app locally, like, you can, you get the same, uh, look and feel and experience as, uh, as a bank, uh, who uses it. Uh, we're also very intentional about how we build, uh, our teams, like, our maintainer teams, like, uh, in the early years, it was just software engineers on it and developers, uh, but we, we quickly realized that that's a really terrible idea, uh, like, because one, because engineers don't always speak, uh, non-profit, uh, and so, but so, you know, we've, we've made it, made it a point to add in, you know, product, product managers, designers, and really, like, anyone, it doesn't even matter if you're technical or not, like, if you want to be a, be a part of the team and you just want to do good, like, we, we will find a way to bring you under the team and help, and help, because, you know, uh, good people are good people. Yeah. Uh, but again, it's, it's not all, like, uh, uh, sunshine and roses, like, there are, there are challenges to, to, uh, you know, uh, uh, maintaining a project like this, and, um, uh, like, the big one is kind of like institutional knowledge or, like, project memory, like, like I said, our project has been going since 2015, which is nine, nine years now, wow, uh, and so, like, it's hard to remember, like, after nine years, like, hey, why was this decision made, and why, uh, why did you do that instead of this, and, um, uh, and, you know, like, we, we have Wiki, we have Slack history, you know, we can go look at old Git commits and, and our GitHub issues and, and, and things like that, but, like, a lot of times there's conversation in the room, and we just don't know why things have happened the way they did, and so if anyone has a solution to this or anyone, and I know, like, there's architectural decision records and things like that, but, uh, if anyone has solved this problem, find me later, because, you know, I would love to know. Another challenge we have is, is tests and testing, like, um, like, you know, like, unit tests, integration tests, system tests, like, um, you know, like, obviously, we'd love to have all system tests, you know, because they really approximate a user using the application, but then that slows your test suite down, and it slows down your feedback loops, and, um, but again, like, where, where are those borders, and, like, we're always struggling with, you know, the testing, um, and, like, and contributor feedback, like I said, like, we've had over 300 contributors now, like, we've talked to a bunch of them, but sometimes people come, and, like, they're really gung-ho to contribute, and, you know, they submit this amazing pull request, and, you know, we merge it in, and then they disappear, and I don't know why, it bothers me, like, did we do something, because, you know, we want people to come and be welcome, because, you know, like, um, happy supported contributors, you know, these become our long-term contributors, um, and the other, the other, obviously, thing, uh, challenging thing was, like, you know, we're struggling with a project like this is cost, like, uh, again, like, we don't charge anything, we don't pay anyone anything, but like, there still are, kind of, like, fixed costs with a project, and, so, like, how do we do it? Like, uh, like, luckily we're, we're really fortunate that, like, Microsoft gives a free Azure credits to nonprofits, so if you don't know about that, definitely take advantage of it. advantage of it. I think it's like 3,000-ish a year, which is more than enough because again, we're not operating on GitHub or Google scale. It's got our production server, our staging servers, and that's pretty much it. So it's great for us. We get free email via SendGrid for nonprofits, free air purger via bug snag. And so I think the only thing we really pay for is our domain name, which is awesome. But I know that's a challenge, because again too, we're also a little worried every year, like, well, what if the Microsoft for nonprofits goes away or any of these things? How do we pay for it? So now, if you aren't aware, I'd like to introduce you all to you for good first issue. It's a new site we've launched on GitHub, and it's defined projects like human essentials or like most of the digital public goods are listed on there. If you contribute to a project or you want to contribute to a project like this, you should definitely come and check it out. And there's a little recommended project there where you can submit your project to be added or like I said, just find projects and it will by default list any issue in your project that you've tagged help wanted or good first issue. So this open terms archive here will like open up and list a bunch of the help wanted and good first issue tags or issues. And I saved five minutes for questions. Are there any questions? Thank you. Not a question, just a comment. So in regards to the idea around contributor feedback where you're like, well, we have people, they do all this work, they submit their first poll request and then they disappear into the ether. Something I've seen work in the Kubernetes project and the other project is having really clearly defined roles that you effectively have a job description for and you can say like, we're looking for somebody to fill this out. You know, like we're looking for somebody to help us do the issue triage where every time somebody has a new issue, you go in, you ask the person, you give me more details, because they very much say, people say, something's broken and they walk away, that's their issue. And that's like not useful as a maintainer, right? So you say like, we need to do this once a week, we need to do triage new issues. These are all the things involved with triaging new issues. And like very often that's a way of getting, it's a little bit secure that a good first issue, like that was the problem that we found in most of the C&C projects actually, was like, you can get people to do that good first issue, but like giving them to do a good second issue, much, much harder. But having roles helps and like really defining the boundaries of how, of a very specific path, how somebody can deal with you, like typically the commits follow after that. Like if somebody is engaged in say, triaging issues once a week, it's very likely that in the course of that triage, they say like, oh, I can do that in five minutes, I'll just do that. And then like, in the back of those, they can take on something harder, we're going, oh, that takes a little more context, we can't try to go as a good person, and you're like, but I can take care of it. And so that's like a way of kind of getting people to stick, and it's also that way of like mentoring your leadership in the project. And it's also a good way for them to communicate with work that they're doing the project externally, which has effects on like how their, the work that they do for your open source project can help their careers, which is realistically a thing that people want. Awesome. Yeah, no, thank you. Also, just some comments on that. I think that it works very well with Kubernetes where it's like a multi-faceted infrastructure with different companies at the end, so having a very clear defined role, often both, because people do want to get involved in that, maybe because a lot of interest from their work and work, but a lot of the time in very open communities, it may not work. One thing that we tried at the RR project a few years ago is, in the way you get in the project, you do something, and you find people, you interact with them in one of the community forums, and you make friends in the new state, most of the projects we participate in because people there, there's a real life-minded people. So we try to reverse that angle with having a funnel of getting into the project, don't talk to people in this channel, make friends, and then we find, after talking, then we find something that I'll be able to use. I'm going to, because I don't think the people on the thing can hear the conversation going on. Oh, sorry, okay. So I'll open up for more questions, and I'll try to repeat the questions too. Yeah. My question is, the platform that you built, is it primarily targeted for these audiences, or can it also be used for other countries? Oh, no. So the question was, can the platform be used for other countries? So it is, obviously it can specialize, like, and the language inside it is specialized for diaper banks and peer supply banks, but if you go to the repo, like, we have actually an entire file in there on, like, if you were running, I don't know, like, a tool bank, or like, any kind of like donation bank platform, you could use, like, we talk about how, like, hey, you could take this and use it right for a tool bank, or a medical equipment bank, or, like, all these different banks that are, are exist. When quick follow-up question, is it only in English? Yes, the project is only in English. Yes. But if you want to translate, we'd, yeah. You mentioned it's all 100% volunteer driven, and I was wondering if you could talk about that. Why did you make that decision? Have you considered paying contributors in the nonprofit, I mean? Yeah. Well, so the, so the nonprofit is also, like, fully volunteer driven, like, it's not a, like, so, like, there is no, so I guess we have to, like, yes, we thought about, like, the, so the question is, like, have we thought about paying people, and, like, ideally that would be great, and also to, like, kind of just giving the software to, like, if it crashed, and something really bad happened, like, that's something that goes through my mind, because, again, like, these are all volunteers, and if things go down, like, how, you know, we can't tell, we can't tell someone to call in sick from work to, you know, get this fixed. I think a lot of the people in our team would, because, you know, they really care about it, but, but yes, but if you have a way for us to get a lot of money to do this and pay for this. It's just a funding issue, yeah. Yeah, like, we've just, like, right, we've just kind of followed this path, and, like, ideally I think there'd be people that would work on this full time if, you know, there was funding, but, yeah. Right. Are we done? It's time. It's cool. Thank you, Sean. Thank you. Thank you.
Some updates on Public Code in Germany
Okay. Hi everyone. My name is Marco and I'm an active member of the Floss community for about 10 years now of contributions to SignalDino also in the wireless community tooling, wireless mesh community tooling area. Currently I'm working for the German government for a German government agency that builds IT infrastructure for Germany, mainly backend infrastructure. And we like we are in the middle between the 16 federal states of Germany and the federal so we have a lot of stakeholders in place and to contribute to. Also during this job I get a lot of feedback and see a lot of things that are happening in Germany. So first maybe let's talk about a little motivation about this talk. So in Germany I have the feeling that the term open source is very omnipresent in the public administration also in politics. No one actually speaks about free software. So the open source term is the leading term here. Also there's very little information about how Floss is used in public administration and also there's little knowledge in public administration about how to handle Floss software appropriately and there's hardly any contact with the Floss community. There are exceptions of course but like generally speaking there are ways to improve that. There are also hardly any statistics on the use of free and open source software in the German government and so my impression after three years in this domain is now that everyone is talking about at least open source software. Maybe they also mean free software. Maybe they don't do the decision between the two terms. It's also okay but in practice hardly anyone is really doing or following these software development practices. Yeah right now there is a lot of happening in Germany and I thought it might be a good chance to give an update what happened in the last year or so and what's happening right now to give you a better feeling how these things that happen in Germany might also be like relevant for other countries or if you are from Germany I hope it might also be interesting for you. Yeah so the first question are we Floss yet in Germany especially and I wanted to start with the state of Floss laws and regulations there. So in June 2020 yeah two and a half years ago there was a principle defined in the service standard and this service standard like these designs or gives design principles for government digital web services like interactions between people and the government and this service standard is also mandatory for the largest digitalization program over the last five years. Those of you who are from Germany may know it as online two and a half years or short OZG. It's a law in Germany that mandates the government agencies to provide their services online and in this principle it says that a source code from the realization of digital services must be made available as open source so that's very progressive. We think that's a nice thing but the problem with it it's not mandatory. There made a survey I think in 2012 no what was it it's written down here 2022. They made a survey and out of 15 from 221 people that have been been asked to give it a high priority in their in their own projects but that's only a very few people from and in practice I also see many people don't don't know it actually so it's not not very broadly adopted. Then in 2021 there was another approach there was an obligation from the economic stimulus package also intended to improve government digital services and there it says the source code will be made available as open source whenever possible. Nobody really knows what this whenever possible means and unfortunately the federal ministry of the interior didn't really keep track of which projects actually released software as as open source under any open license. I think actually so I personally know only one. There were a lot of projects in there that got funding so this really didn't have had much impact. Then in November 2021 there we had a new parliament in Germany we had elections and the coalition that formed after these elections then decided or wrote down in their coalition agreement that they wanted to or yeah that the development contracts of public agencies should be generally commissioned as open source and the corresponding software that is being developed should always be made public. So this is like the same intention again and yeah there's a but because like after this agreement the German government spent 4.8 million a billion euros investment in proprietary cloud infrastructure in addition to 1.3 billion dollars to Microsoft licenses of course you can't just throw Microsoft software away it doesn't work. So this is like more a long time terms change but these 4.8 million cloud infrastructure that have been like this was a new contract that didn't exist before in this form you could have like invested in open source software here. Also in general less than 1% investment during like for the second by the current government investments from the current government from the current legislation went into the open source software ecosystem and also the plan financing for so-called senders that's the the German Ospo has been cut by nearly half due to resource I don't know they didn't find the money that was needed so they had only 24 million euros that's still a lot of money for an Ospo that's great but yeah compared to the initial plan it's less than we expected and hoped. And also there is still no floss procurement regulations that are badly needed to give government agencies a tool to really require these procurements to be based on open source licenses. But we have some policies in the German federal states we have 16 federal states in Germany and two of them Turing and Schleswig-Holstein they defined a priority for free software in their federal laws. The first one like it's mainly the same text in those regulations and the first aspect is that a priority for free software should be applied if technically possible and economical this is again we don't really know what this means it's like hard to define when is it economic to use open source software compared to proprietary software often this like also comes with long-term impacts so this is this is a really hard question and it's easy to like find some arguments why it's not even cheaper. Also for in-house developments the rule is that an open source license yeah has to be applied and the software needs to be published as long as it is not used for security relevant tasks and this is still again I don't know what your security relevant task is and even like for maybe people thought about like police software etc but still I think we in this room are know that even like especially in these domains it's super relevant to have open source software to have the possibility to see inside the code and see what they're doing there. First to improve the security and second like to improve control of what agencies do in their day-to-day business. Okay but still like these two federal states have these like thought about these questions, applied some regulations that's great I really like the effort there and in practice we see there are still also there are some very motivated people in the governments there and they're doing everything they can to improve this even further so I think that's a very good first step here that's nice. So let's have a short look at the European perspective here. I just created a graphic based on the information from the join-up platform also from a questionnaire to the German Bundestag like our federal parliament and we see that currently right now there are some countries in Germany actually in Europe actually I would say it's a huge amount of the or like relevant parts of Europe in terms of like their power also in the European parliament have some some regulations in place concerning open source software. The Swiss parliament just passed a law this year or last year sorry in March 23 to publish all government software and an open license there will be another talk in the legal and policy issue deaf room at 226th Oc so yeah head over to this talk to get more insights about Switzerland. Okay so but let's have a look at floss in practice and in generally we must summarize that these political objectives that have been defined are mainly ignored to be honest in public administration so there is no like the step from legislation to like the execution of this laws is hard and it's yeah it's not done not not not done yet. As in the industry we also have this phenomenon of open washing like presenting some kind of software as being open when in fact it's actually not. A small example for this is like the government site builder that's used to build the websites of all the German ministries and on their website they say it's based on open source we dig a bit deeper we can see that the technology the technological basis is 100% open source that sounds great so I wanted to dig a bit deeper and I tried to find some download link I have found some and unfortunately I didn't come to any git repo or something instead I was greeted with an HTTP basic auth because like the software is based on open source software it's correct but it's not released as open source software. So why is that that the public administration doesn't really respect these political intentions that have been formed on or formulated on like every federal level in Germany like from the from the top federal level in the Bundestag to the federal states and as far as I see it the public administration has like no no too little experience either with public procurement of free software it's hard they don't know how to like buy free software and buy support for free software and also they have no experience with releasing software or releasing their own code as free software there's little incentive to invest in existing free software coming from some laws and regulations and also there's little incentive to release own code and to collaborate with others to improve their own code because there's so little knowledge about the like benefits of all of that. Yeah, like in summary I think the application of these floss software development models still heavily depends on individuals we have individual cities and we will see there in we'll later see an example of this where it works really well but yeah my feeling is that it's still dependent on individual persons that yeah mandate for this and do the the heavy work in practice it's not really widespread and spread adopted in all government agencies. Yeah we will later see how to fix that but first maybe let's talk about some wins there are also great things that are happening in Germany. Germany just built an open source collaboration platform called OpenCode it consists of like GitLab instance there's a discourse forum there's a wiki.js wiki and it also is also based on the public code YAML standard that is used to annotate the purpose of public software and this encourages the public agencies to make things open today the administrations do not really dare to do this though like with this platform they can see that other government agencies also release their code and if others do it it might be okay and I might also be able to to release my code as free and open source software and I think that's a great thing. Also it's somehow a safe haven for for public administrations to get some first experience where they don't have to go to all the real free software or external free software repositories like like gitlab.com or or like even github where they have no experience with and like this is inside the government even if it's public it's something government owned and this might help to to convince some people to release their software there. I think it works okay so there's already more public organizations on there than on github at least for for the german organizations to be fair there are very little organizations on on github in Germany from the public administration yeah but still only a few real projects exist on this platform to be honest many of them are stubs many of them are just code dumps or other kinds of documentation consultation processes etc so it I think it's a good start there needs to be more more code there integrate all these products we know like for example next cloud colabora also Univention Corporate Server, Op-Mix Change and all this this kind of software that exists but doesn't really integrate very well and the idea of Open Desk is to pay the software vendors to to build integrations between these solutions. Also there is an interesting project from Germany that's called Kullibri it's completely public funded by an IT service provider of the federal government and it's basically a component library that uses web components and is meant to be what are meant to have a strong focus on accessibility and they also do a real open source model they also accept accepting contributions in my opinion they have interesting great tech but it doesn't really feel like public administration project it's the normal open source project and that's great great as a huge recommendation if you if you're looking into a component library maybe this this might be an interesting thing for you there's also the current design system that's meant to be used for all government services to build a unique or re-identifiable design system it's not an actual software project like the design system which that defines which design elements are used on the websites but still they have the philosophy and also the community building parts baked in in their DNA and they're trying to to get involvement and they're trying to to build a community that's also a great thing that that that happens right now already mentioned there are some cities they're doing great progress the city of Munich built its own open source transparency website and this is really interesting because they document which floss software they use they also document which software they contribute to both in terms of code and also in terms of funding and they also document which software they write and publish so they really understood the benefits of free and open source software and build a website to make it transparent i think that's a good example maybe for other cities too and we have the national documentation portal that's like read the docs like project where documentation for developers on the core government infrastructure can be found it's by itself a license under the european union public license and it's also contributing accepting contributions so let me close this talk with the question what does it take for free software to be to become the the default in public authorities and i've brought three challenges here the first one we need to release custom build software of course under free software licenses i think the regulations here are very very important so there needs to be some regulation in place that enforces governments to do this because otherwise they have no knowledge in this there's no so like there's little to no motivation why to do this in the first place so regulation helps very much here to to release all all the code yeah and of course also knowledge and skills in this area need to be built up in the administration maybe our ospo made contribute a lot to this in the next years but that's a major challenge in all government agencies second challenge uh for software procurement is of course a real yeah a thing here we need you you you you very important because we need to measure our progress does it really work does it do we make progress in in this in this area right now there are hardly any statistics so i think it's it might be a good idea to have the mandatory use of a researchable software catalog before buying any software like the italian government does this already there's the italian free software catalog and all italian government agencies need to have a look at this catalog it doesn't really say anything about what they do about this like results they just have to document that they have searched here for for the software or for the kind of software that they want to buy and if there's something in there they that's a good opportunity to look into it and see like for example is yeah is this software useful for us before they're buying any yeah non-free software here yeah if you want to learn more we collected some infos on best practices about free software in the german government also some examples this kind of follows the idea of of like the awesome list but like just find some information about what is what has worked in the government to improve free open source software maybe this might also be something for for other countries too for your communities too i really encourage you to build some knowledge about what already exists and communicate about the efforts that have been taken already okay thanks for listening and if you have any questions you can contact me here or yeah maybe later outside if we have time maybe one or two questions we don't have time okay
GNU Health. Incorporating Digital Public Goods in the European healthcare system
All right. So first of all thanks to the organizers for having us here. And I got to say I'm not Louis Fai-Khann but I'm spontaneously replacing him today. So nevertheless I will introduce both him and myself. Louis is both a computer scientist and a physician and he founded New Health a bit more than 15 years ago. And he's specialized in genomics and medical genetics. And apart from being active in social medicine he's also involved in animal rights. Then shortly about me I studied computer science in Hanover and there I'm employed since a bit more than two years. And mainly I'm working on an Ansible deployment of New Health to ease and improve the installation process but I'm also reporting and fixing bugs or rewriting the documentation. And last year we also hosted the annual conference of New Health in Hanover. And it was also together with the Orson conference. Sebastian will do the following talk about Orson. And the institute I'm working at is called Computational Health Informatics and even though we are only working inside computer science it's always related to medicine. So behind New Health there's a non-profit, non-governmental organization called Ngu Solitario which is working globally and it's focused on social medicine and New Health. But there's also the Global Exposome Project that aims to investigate how the environment has an impact on our health and how social problems like pollution of water or factory farming or wars also impact this environment and consequently our health. And then again there are also projects about animal rights where it is involved. Ngu Solitario is spread quite around the globe but when it comes to productive use in hospitals then we hear the most of projects in Latin America or Africa for example in Argentina or Cameroon. And then there are many research institutions, hospitals and so on for example in the top in the middle there's a university in Argentina that is cooperating quite much with New Health. Okay, so what is New Health actually? In general it is a hospital information system but the core is a hospital management information system that is often called HMIS node. And there you have one client-server architecture and it takes the quite realistic approach compared to other ways of organizing the infrastructure of hospitals. And it is first of all based on Frighton which is an enterprise resource planning tool so you can overtake the user management and inventory stock and finances functionality from this. But then we are adding modules for hospital functionality and putting this on top. And like Frighton it is written in Python and using the PostgreSQL database back end. Even though Frighton could theoretically use others we are always taking this to first have a uniform way and then also because there are many good functionalities for productive use. And then for example you have really many modules that are part of New Health for example about surgery or the laboratory or genetics and bioinformatics and as it's used in many precarious reasons, New Health is embedded as also one subproject which basically means that there are for example images for respiratory pys because sometimes yeah it's really a matter of resources what to use. And as the name says, New Health is a GNU package. So the HMIS component as I said is a client server architecture and on the upper left you can see a screenshot of the client and with this you can generate graphs, you can display images, there's a calendar you can use yeah and also the electronic health record is part of this. Then there's a reporting engine coming with Frighton and so all the information you feel in the database fields can be exported as an ODT. So there's a LibreOffice in the background and you can yeah generate this and print it or start outside the program. Yeah. Besides there's an integration with Orsan which is a DICOM server to support medical imaging and actually there's no DICOM viewer integrated in New Health and as usually there is the DICOM format used. It was chosen not to reimplement any DICOM viewer or do all the work Orsan has already done but to integrate Orsan and so to synchronize patients and studies between the two of them and to just use Orsan's DICOM viewers that are integrated there already. Apart from this there are also other components of the New Health ecosystem for example the Federation and my new health. So my new health is an app that is that can be used to enter vital data and in the end also to share that vital data. And last year at the 40th birthday of GNU the second version was released where all the dependencies outside Python were eliminated because many people don't have Linux on their phones and we had requirements before they were now eliminated and it was migrated to Kivi so now the idea is to have something cross-platform. And then the GNU Health Federation aims to connect multiple of those HMIS nodes and ideally also make the people, give the people the opportunity to share the vital data they recorded with the hospitals. And so to give one example the colleagues in Argentina also used this already in the beginning of the COVID pandemic to trace how much, yeah, just to trace the situation of COVID. And now, yeah, to come to the topic of the room also, GNU Health was declared a digital public good which is in the context of the sustainable development goals of the UN where many goals should be achieved until 2030 and one of them is healthcare and so, yeah, GNU Health is part of this and also just advertised at the European Commission join up where, yeah, free software or open source software is, yeah, advertised inside the European Union and then compared to other software projects, of course there are always bureaucratic barriers and also certification processes but there are many steps to check if your project is a medical device software but actually at least the hospital information system itself and the electronic medical records are not a software or a medical device. Of course then there's the other stuff for example in Germany would for sure need to have an interface with the insurances and most of the productive use is somewhere else. Then, yeah, from our point of view, proprietary software and public healthcare is a contradiction, yeah, and we think that there should be, yeah, a move to free software and there's really many barriers and a lack of funding especially for free software projects and, yeah, there could be really many benefits of putting more resources in communities like this so that everybody can profit from what people are working on. This is why we also signed the campaign public money public code. I already saw it in the slides of the talk before. I guess the most people know it but basically the name already says if there's public money spent for a project then the code should also be available to the public. Said quite easy but also not the reality. Yeah, I'm finishing with a side of that Luis often says which is who has this a social project with a bit of technology behind, yeah, to highlight that it's not only about the software but also about the philosophy behind. Yeah, that's it. Thanks for your attention.
The Orthanc ecosystem for medical imaging
Okay, so let's go. Okay, so thank you very much. So I'm very happy to be here with you today and I will give an overview about the Alton project. So as far as I'm concerned, I am a professor at the University Catholic of Rueval and Heuve, not very far away from here. And I've been working on this Alton project for a long time. So before diving into the project, I will talk a bit about how medical imaging works in hospitals. So everything is built around what is called the DICON standard. So whenever you are going to do PET scan, CT scan, MRI scan or whatever, all of your images are captured by the modality and sent to a large database in the hospital that is called the PACS, the picture archiving and communication system. Then the radiologist comes into play and they write a report about the case they see and then the images are sent back to the patients, to the general practitioners through web portals typically. And also there can be specialized devices of our applications that connect to the PACS in order to apply for instance artificial intelligence, algorithms and so on. So what is very nice in medical imaging is that everything is driven according to one open standard that is called the DICON standard. So DICON rules everything in medical imaging, the imaging devices and the viewers, but also the PACS server. And you must also know that DICON is not only inside the big imaging but it is also part of the small imaging for instance with mammographies, with ecographies, with CT scans, with radiographs and so on. And it is also used in veterinary applications. And DICON is really two things. It's first, the file formats. So you can think of it as a kind of a GPEG image with metadata that describes the acquisition of the patients, the physicians and so on. Plus a network protocol. It may look a bit outdated as a network protocol because it was designed almost 40 years ago, but still it is commonly used nowadays and it is used to send images between the different modalities, to search the contents of remote modalities, to retrieve data and so on. So everything is built around this DICON standard. And now what is the Orton project? Actually the Orton project has its roots in my post doc at the University Hospital of Liège where we wanted to get back in control with our medical imaging workflows. Because at that time we wanted to treat machine learning applications and we came to the conclusion that the different modalities that we had inside the hospital were not very easy to use and so we needed to create one new platform that was called Ortonc. So the first release of Ortonc was in 2012 and it was released as the first free and open source DICON server that just starts out of the box and that has a REST API. Now there are other DICON servers in the free software community that also provides such features but ArtOrtonc was one of the first. So the basic idea behind Ortonc is that we wanted to create a micro service that is available as free software in order to foster and share the knowledge about DICON. And nowadays the main use cases of Ortonc are threefold. First, Ortonc is a DICON server that can be used as a PAX. It is commonly used to route images between hospitals and between sites for instance for doing research, for doing the applications of artificial intelligence algorithms and so on. And we have also tele-radiology portals. So here you see the main user interface of Ortonc. It is translated in different languages, not only in English, French, I think there is also a Spanish translation and a Russian one. And so the main features of Ortonc are the fact that it is very lightweight. You can start very easily an instance of Ortonc on your laptop. And it is an industrial grade, large-scale project with over 400,000 lines of codes. We do have tele-radiology solutions that are built inside Ortonc. So you can, when you install Ortonc, have many different viewers like those ones. 2D viewers but also 3D volumetrics rendering and also stuff that are very specialized and that are focused on nuclear medicine and radiotherapy. Ortonc is really deployed in actual hospitals. Here for instance you have screenshots of an hospital in Malaysia that runs Ortonc for all of its medical imaging workflows. Installing Ortonc is very easy. You just, if you are using a proprietary ecosystem, you just have to download the installer and it will work. But you can also use Docker, Kubernetes and so on. Now I will just show you some advanced features because I don't have much time but if you want you can get in touch with me if you want some demonstration. We have support for digital pathologies, images which are very large images and we want to publish them in a web browser. We can automate workflows notably using Python. So we have the possibility of writing Python scripts like this one that allow to route images between the different modalities inside the hospital. I would also like to mention the fact that physicians are nowadays very excited by artificial intelligence, obviously. And for them it is something that is very magic and it is very good but from my perspective as a computer scientist I just see a black box. So I'm not very happy with this because now if you ever look at the number of artificial intelligence algorithms for radiology, there are hundreds of them. But there is no knowledge sharing about those ones. So people don't know how to use those artificial intelligence as end users. We do have a lot of open source libraries, open source models, open access models and so on but the general audience cannot deploy and run those algorithms easily on their own computers, on their own infrastructure. So we are working on this and for instance here you have a research project that was published one year ago where we used WebAssembly in order to run deep learning models directly inside the web browser. So this means that it is a fully open ecosystem that is secure by design because everything is run by the laptop of the physician and you don't have to install anything. Everything comes with Ortonc. It is a C++ software so you don't have any dependency and it can run even on Microsoft Windows 7. So we are very proud of it. Here you have an instance of Azor, a research project we have with Musea. So we treat 3D images of bones, of human bones and it is used for cultural heritage. So it is one of our research projects nowadays. We use, as I mentioned earlier, the different viewers in order to do nuclear medicine and radiotherapy. So just now to the conclusion, thanks you for the reminding. We have here a recognized project with a worldwide community. As Gérald mentioned, we are very friends with the global community. We are a digital public good so with the 3 and 9 goals. We have been recognized by the Free Software Foundation and so our mission statement is to freely share the knowledge about medical imaging including about artificial intelligence because I think it is really something that is very important nowadays. So thanks for the attention. We share the Q&A session I think. We have time for a few questions here for these two speakers because they got together and merged their two things. We have a quick question. So we were talking about human health and that you would like to have a European or questioning how to get a human. What do you think is the limiting factor? Is it big hospitals being afraid of not having a commercial entity delivering it during the support of what is limiting? Why don't we have this? I would say just three words, third, fear, uncertainty and doubt. People are not comfortable with open source and with Free Software simply because they are not used to use them. I am pretty sure that every hospital in all their infrastructure use Free Software, Postgres, MySQL and so on. Every hospital in the world already uses Free Software but they are not aware about the equivalent software for the medical uses. Yes, there was a question. Can you show me the side view so that you can actually pan and zoom in to the map? Yes, yes, of course. If you want, I can show you. So the files, you don't have to import large files. You can just leave the files where they are and just browse through them, right? Yes, the question was, is there a need to bring all the entire volume which is quite a heavy weight volume of data inside your own computer in order to see them? So the answer is no. Actually, not for this viewer. It is a tele-radiology viewer which means that we only download one to this slide at a time. We don't know the full volume. But here, in this case, we do have to download the full volume. Yes. Yes. Yes. Do you have examples? I'm sorry, I have problems with my ears. So you're getting started with Docker or some of the windows. Yes. Do you also have example images that we can use out of the box to play with it? Oh, yes. We do have a web server, for instance. Can you repeat the question? Sorry, the question was, is there sample images for DICOM? If you go to the demo website, if I open the web interface, then I have the possibility here to download a zip file. And so you can just download the image. But it is open access images, so you could find them elsewhere. Yes. I'm just going to call that the software thing used in the UDT. Oh, sorry. When your software is being used for diagnostic purposes, does that mean that you need to have effectively FDA certification or a class 2 medical? Is that anything that you're thinking of? So, at once, so, S, Gerard, sorry. Good. So the question is, do we need an FDA or CE approval in order to use this inside the hospital? So you already mentioned this. The answer is you don't necessarily have to have this regulation. People, physicians can decide to take the risk of using a free software. They can always decide this. It is up to them to decide. If they don't want to take the risk, then they want to have someone in the commercial entity to take the risk at their place. And that's what, that's the meaning of the regulation, actually. It is to give the risk to someone else. Actually, that's the essence of the certification. So, if I can jump onto that. Yes. Even the big commercial players do not want to be seen as a medical device because that entails that all your source code needs to be audited. So it's not only, it's not that it's a problem for open source. Even the commercial parties, they do not want to be seen as a medical device and they're actively fighting with them. And it is also important that Gnuels and Ortonc are not medical devices. We are not, we are at the class zero, just like Microsoft Windows is a zero class medical software. The, oh, I'm sorry. We have to repeat what he said. We can. Oh. So we, so the commercial, please do. So the commercial players also don't have the interest to get classified as medical device. That was what he said. Yes. And I was telling that you, the, the DICOM viewer, I've, I've shown you this kind of stuff can be considered as medical device at the moment that the physician uses the, uses the software to do some diagnosis or some treatment on the patients. If it is just to inform a people, there is no regulation needed. For instance, a web publication portal doesn't need any kind of certification. But if the web, the tele-radiology portal is used to make a diagnosis, then you come into the class two medical software. So, sorry. One last question. Yes. Yes, yes. We are actually, I have one of my researchers. So the question is, can we extend this kind of models in order to do explainable AI inside? So the answer is yes. And actually I have, for instance, one of my researchers who works on mammography in order to create explainable models that can be run in any computer. So that's, that's one of our goals in the next few months. Thanks. Thank you both, Sebastian and Eira. We have a session that up until 10 minutes ago wasn't in the schedule.
OSPO4Good UN Event report & 2024 Call for Participation
Hello everyone. How was everyone doing today? Good. Alright. Little energy is always helpful. This is an interactive talk. Please feel free to stop me at any time and ask questions. There's no dumb questions and I'm happy to answer any questions at all. With that said, I'm Jacob Green. I'm from Ospo Plus Plus. Does everyone know what an Ospo is? Okay. Nope. Just for those that don't, it's an open source program office or an open source program organization. It ends up being the epicenter of open source for an organization. Whether that organization is a company, an industry, a government from a city to a nation state government, as we've heard earlier. Germany now has an Ospo. Yay. We have university Ospos now and hopefully we're going to get some NGO Ospos coming up soon. I'm here to ask you to come to a party. Kind of like FOSTA, but we're throwing a party at the United Nations. On June 9th and 10th, we have a huge room available for 800 people and we want to bring together the entirety of the open source community, including governments, universities, NGOs, and industry. But not just from Europe, not just from North America, we want to bring people together from all over the globe. We held this event last year in June, June 20th, I think. We had about 80 people. This time we're going for 800 people. This is the report that we came out of that event. Ospos for good, building cooperative digital infrastructure. We're going to be hoping for this report to come out next year, but that all depends on your participation. A little bit of history about how this came to be. We held this event, I'm from Baltimore, the city of Baltimore. You have seen the wire? We have some challenges. This is why we're in the public code dev room here. We want to cooperate with you as a city of Baltimore, as a representative, not of the official city of Baltimore, but just as a resident looking to cooperate with you. We need infrastructure to do it. We want to run your code and I want anything that's developed in Baltimore to help in your cities. The swap going back and forth between you all. We've been working with the United Nations, the European Commission. There's a European Commission Ospo Network now of governments. We have about 13 of them. We have about 12 university Ospos now in the US. We're starting to branch out these Ospos everywhere. Now we're going to need to start cooperating. On November 21st of this year, at the European Commission Open Source Awards, the UN Secretary General's envoy on technology, Mr. Amadeep Gill, said some remarkable words. He equated the Ospo as soft infrastructure, just as important as hard infrastructure, our roads, our bridges, etc. The Ospo itself is an organizational construct, like the Office of the CTO. He equated it to helping us to solve challenges, those challenges being helping to localize and advance the SDGs, helping to get cross-border solutions in Europe and interoperability, which is very important for digital sovereignty, to helping achieve AI governance and working with AI. You in this room could think of a bunch of other policy issues that might be used by an organizational construct that's about cooperation. He took it a step further and he called for the cooperation between these Ospos. He said, and I'm going to hopefully screw this up. We need to start connecting Baltimore, Buenos Aires, Bangalore to Barcelona. Four B cities, so we're looking for more B cities. We have Berlin coming up hopefully next. Boston hopefully is coming up soon after that. If you're in Berlin, happened to be on March 5th for FOS backstage. We have a Berlin Breakfast Club that's been going on. The Berlin City Ospo will be there talking and answering your questions. The event we're going to do now was announced yesterday at the Open Forum Europe Policy Summit. Both the UN Secretary of General Technology Envoy, again called and opened up, made the announcement for this event. The UN OICT, which is like the office of the CIO, also enforced and asked people to come. I would like your help in making this possible. What does that mean by help? Well, we would love your help in making sure that it's a global event. So one, I'm looking for attendance. We would love for everyone here to come, whether you're a developer, a community manager, you're a policy wonk, you're in government, or you're looking to use open source to achieve policy missions. Various NGOs that you would think about as not having to do with open source are actually heavily involved in using open source to achieve their policy outcomes. We're looking for attendees. Second, we're looking for those attendees to come as groups, as a delegation, you might say, because one of the things we were to happen after you leave the event is for you to work on building cooperative infrastructure in your own city, your own government, your own NGO, your own university. Third, we're looking for, as I said, the attendees to be coming from all over the world. So that means industry from all over the world, universities from all over the world, cities from all over the world, world governments from all over the world, and NGOs from all over the world. I can't pull that off myself. My organizing team can't pull that off themselves. We need your help in that. So attend, help drive her participation, and then come have a good time. The United Nations is an amazing place if you have not visited. It was awe-inspiring last year when we went, and I would like for the open-source community for us to come together there if possible. We're making this a yearly event. I'm opening the floor now to questions. Questions? Okay. So, also, how do we get our governments involved? Because we need to have governments engaged with us, the members of the United Nations. Any thoughts on how beyond the city of Baltimore you're going to, like any hope of giving the White House to show up? Yes. We're going to show up? I don't know yet. Can I repeat the question? How do we get our, how the question was, how do we get our governments to show up? One, we have examples now. When I first started evangelizing hospitals around, we didn't have examples. Now we have 12 European governments formed in a network together. That network was just approved in European budget legislation called FOSEPs 2. So now you have a point of contact to point your government to, to say, you're not alone. Second, we can connect them to the United Nations itself who is looking to bring this together. The ITU, the International Telecommunications Group, just announced a program for governmental OSPOs that they're, they're running. So it's no longer a theoretical thing. We now have programs starting up for OSPOs for good and just OSPOs and government in general. So please reach out to myself, jakeupatospoplusplus.org, Yan Oralena. Thank you.
From disconnected elements to a harmonious ecosystem : The Epiverse-TRACE project
First up, we're going to hear from Hugo Brisson from Disconnected Elements to a Harmonyne Ecosystem, the APRiverse Trace Project. Hi, my name is Hugo. I'm the lead software architect at data.org and today I would like to talk to you about the work that we are doing to build a harmonious ecosystem for epidemiology as part of the APRstrace project. So today's scientific research relies more and more on data science and computational tools and this is true across fields such as epidemiology, climate science or econometrics. But the pipelines that are used by these data scientists are getting also increasingly complicated to maintain and to update. And to change just a single step in this pipeline just to use a different piece of software, you may have to spend hours of data wrangling just to get the right format for the inputs and the outputs. And the problem is that this maintenance that is really complicated is something that we cannot afford when we are in the middle of a crisis. The price is just too high to pay. When the next pandemic hits and we want to get really fast results to understand what's happening, it's not the time to do basic boring data wrangling, we want to do actual science instead. And so set differently, we have some good isolated free software tool but we don't need just good isolated pieces of software, we need a robust ecosystem as a whole. And this is precisely what the APRstrace project is about. It's an international multistakeholder project to harmonize the ecosystem of epidemiology tooling in R. And we do this by making the existing pieces interoperable, by supporting existing tools to adopt global standards such as the ones that are defined by the Digital Public Good Alliance or organizations like R OpenSci and by developing a community, a sustainable community around these ideals. I can also define our goals by what we don't want to achieve. We don't want to erase the existing established communities. We recognize that diversity of solutions is good, it's nice to have a rich ecosystem but we need interoperability in this ecosystem. And so the way that we do this is by involving the community. We work with existing established communities and by this I mean both established communities of users such as public health institutes or NGOs but also existing communities of developers. And in the end what we want is to come up with a solution that increases usability, sustainability and maintainability for everyone involved. We've had already quite a lot of success with this approach. We've managed to package and release a lot of un-maintained non-portable code bases and including many more tools than the ones that are presented here but just for the sake of this session I should mention that two of them are already registered DPGs and one is in the process of being submitted. Having a sustainable network of collaborators is something that is really exciting and really ambitious but as you can guess it comes also with challenges and in particular research and academia are really competitive spaces which makes it difficult to build some collaboration between some communities. Additionally, because we have a multi-stakeholder community, communication is really difficult in a network that has so many collaborators and so many nodes which creates delays and miscommunication and the question is how to build something that is sustainable even though funding isn't probably in this space. To conclude I hope that I managed to convince you that responding to this crisis be it the climate crisis or the next pandemic will require interoperable tools and that this can only be done for collaboration and multi-stakeholder project. But even though it's necessary to have this kind of complex community it also brings a lot of extra challenges especially around communication, collaboration and sustainability and in the end what may appear initially as a technical challenge is even more of a communication and social challenge. With this I will finish just with a picture of the entire core team of the project and invite you to come to talk to me if you're interested about any of this. Thank you. Thank you. Thanks. And thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you.
Legislation Editing Open Software (LEOS) - an innovative open-source solution for drafting legislation​
Yes, we just go right on. Okay. Thank you. Good afternoon everyone. I'm Fernando Nubla. I'm a project officer from the European Commission and I'm going to tell you a story, the story of Leos. So, upon a time we were in LaGislan and you can imagine LaGislan is not that funny. It's talking about legislation, you know how complex legislation is. And in this case we had law legislation that was complicating the life of everyone living in this kingdom. Legislative laws was enforcing very complex rules to everyone that wanted to create a new piece of legislation. Rules about the structure of the documents, formatting, etc. He was not taking care of the versions. We had versions everywhere, in local computers, in surf folders, in everywhere I can imagine, even in a paper. So it was getting very complicated for the people working with the legislation. And there were a lot of people and no one was collaborating because law legislate was not helping them on that. So we went to the round table, we tried to find a solution. We needed to help these people. So we started by creating a work plan, an idea of what we need to help them. We were defining more or less the solution that we wanted. But there was something very important. It was the financing. We couldn't do this without budget, without any financing. So we used two programs of the European Union, the ISA program when we started in 2012. And then, right now, the ITA Euro program since 2020. With the budget, the financing, and with the work plan, and with the idea of the project, then we created Mr. Leos that you have here, and you have also in our search. Mr. Leos, we wanted him to be an open tool that is a web application. We wanted to be able to draft tests. We wanted to have a rich editor where you can put images, formulas, track changes. We wanted to have collaborative tools to create comments, suggestions, to work with other people, everything centralized. We take care of all the versions. So now they are not spread everywhere. They are all in a central place, and you can go and check them. And something very important, we wanted to use open standards. We didn't want to keep drafting legislation in an unestructured standard that you can use further. We wanted to create something that is structured. And we are using the commandos for you standard that is open for everyone, for any administration or government that wants to use it. And the last aspect is that we wanted to do it open source. That was something new, the European Commission doing open source projects. We did it. And we bring with us the community. We are not just alone. We wanted to do this with member states, with other countries, with academia, et cetera. Whoever wanted to help us is welcome. And then I know you were waiting for this. There was a battle, of course, between the Leos project and the legislation. But no one was harmed. So our idea was to convince them that there was a better way to do the things. So finally we end up with a legislative in our team. We are running around Europe. We are helping other member states, other institutions to use this tool that is open source, is available for everyone. And together we are going towards the future. And the future is leading us to artificial intelligence, machine learning, imagine drafting legislation just with one click and getting the proper test. And this is the tool. So everything that I said is true. You can check it out. We had all the features that I was saying before. We had the structure. We had the versions everywhere. We had the research with the touch changes. You have on the right the collaboration to create comments, suggestions, the use of tests. We are using open projects like the CK Editor, Hypothesis, EUI. So we are creating our software open and using open source projects. You can check us in this QR code. You can scan also our t-shirts. We are in code.tropa.eu Leos. All the software is available there. You can check it out. You can contribute while waiting for you here. Thank you. And this is the end.
From Excel to Grist: the example of a massive transition towards open-source software and contribution by a French government agency
Hi everybody, I'm here to tell you a story about how French public agency managed to migrate their users from Excel to Greece. By the way, my name is Fréran, I work as a developer for the French public agency ANCT in the Doné Territoire team, Doné Territoire meaning Data and Territory. But I'm not here to talk about myself, I'm here to talk to you about Anne, who is a French public agent working for the ANCT as she is in charge of subsidy programs and she has to have the feedback about how it goes. So she sends Excel spreadsheets to Priscilla, Paul and Patrick who are regional workers because she hasn't had the data herself. And then they don't have the data themselves neither, they send to regional local agents and well, it didn't work, went as expected. Paul didn't understand correctly what Anne asked for, so he sends her new data. Then Patrick is in CECLIVE, so it didn't have the time to send the data to the local users in time. And Priscilla is a bit worried about the data, she received back from the local agents and she asks for clarifications, sends partial data to Anne, but still she has a bit delayed. So finally what happened? Anne Gathers and the data, she realizes that the data are incomplete and outdated and well, she has colleagues who adjust her to send the data and well, she went back to the regional and local agents for clarifications and they're just fed up with these spreadsheets. So here comes GRIST. GRIST is a collaborative spreadsheet, but not only it's also a local platform, if you know it's kind of an alternative and open source. It's made by GRIST Labs with a small US-based company and yeah, we deployed an instance that we self-host ourselves for data sovereignty. Why we chose GRIST? Because we didn't develop it ourselves at first. First because it's hacker-friendly. You have formulas in Python, you have many possibilities with that, just with that. Second you have access control list, which means that when a local agent fears the data that is for his local administration, he doesn't see the data of the other local agents. Obviously, it's free-level open source. That's why we're here, but it was a criterion we had at first. And it's easy to consume data, so we can transform and publish part of it as an open data. Here is how GRIST looks like. You have a spreadsheet and well, I don't have much time to make your demo, so let's get in touch. And just to finish, we don't fork this project. We contribute with development and many stuff like that. Thank you. Thank you. Thank you. Thank you.
Gno.land: Improve Your Understanding of Our World
So, hello everyone, I'm Alfred Thorn, an engineer and someone with the Earth for open source and decentralization. And today I will speak about Noland which is a mix of open source and decentralization to try to find the truth collectively and to try to basically create new democracies online. So, one of the main reasons we want to create Noland is because we consider that right now we have too many control from governments, from the media, from censorship, we have AI making a real line between real and fake, between humans and bots. And what we lack right now is not a way to send and to transfer resources and data but basically to coordinate on social concerns. So one of the initial starting point is basically we use Web3. Web3 today is not helping that much for social coordination but it makes something great. It allows exchanging data, resources, financial resources, messages between peer to peer people without intermediaries. And it brings new level of transparency especially in governments. So we don't need to trust a source of parties, can be companies, governments, internet providers or whatever, we just need to trust code. So that's what brings Web3. But Web3 has so many problems especially in terms of UI, UX, scalability, simplicity and that's exactly where we try to improve the problem by working on UX, on scalability and on basically moving from the financial world to social coordination world. So the main vision of GNOS, so it's our third project, we started with real blockchains. The first vision was to create the equivalent of microservices for blockchains because we believe we won't have a single big one but so many specialized small ones. So we created Tenamint and Cosmos and with an internet of blockchains you can have specialized blockchains or geographical blockchains but that can work together. So you have the mix of independence and collaboration. And what we want to do now is to say okay, how can we not compete on important bigger applications so basically not Excel style stuff because blockchains today are mostly about managing balances, integers, I can transfer a balance from address to another in practice. When you look at social networks, at democracy, at source, at Wikipedia, it's definitely more complex. So we created a new language which is interpreted meaning that open source is not an option, open source is unforced. There is no way to run a binary, to run something obfuscated. You need to upload your source code. When you upload your source code, all the good things from open source like reading but also reusing are unforced. It's not possible to make some things that won't be reusable. You come from probably the classical world where you publish a binary on an API that your website. In practice, all the data is on your server. Even if your code is on GitHub, the data is on your server. What we want to create is this concept where open source is not just about code but about the state, about verifiability and also about the data. So we have an interpreted data mystic simple language so everyone can code and not only the most expert, full of data ability. The last point is about the consensus mechanism. In blockchains, you are aware of GPU based that cost money and that are batch-oriented on environment. You are probably no proof of stake where the goal is to, if you are rich, you have power and we are creating proof of contribution. Proof of contribution is a way to basically gain a score based on your contribution, expertise, alignment. Everything makes this way more human-centric. It's not more about trying to be rich but about human coordinating. So basically our goal is to take all the good things from open source but try to bring new things. This is the example of GitHub. GitHub is definitely not the open source itself but before GitHub you had open source but it made it better and what we want to do is to make the same with another level where you mix data where you mix transparency by default. I need to stop but if you want to discuss, it will be a pleasure.
TruBudget - a DPG to support the project workflow in international multi-stakeholder environments
So, hey everyone, my name is Zuri, I work for KFW in a field of development cooperation and in this case development cooperation is not about GitHub but working together with third world countries and donor organizations like us to make the world a better place. So together since a couple of years we developed the digital public goods so we also registered that and that's what I would like to show you and also to invite you for collaboration. There are three slides, I would like to explain the problem to you, our proposition and where we are at the moment. So that's one example, if you look at development cooperation and how it works today, so you see one country here, maybe someone recognizes it, it's Ethiopia and we started to count how many government organizations and NGOs are actually working in this country and supporting it. And at some point we stopped counting so we didn't have any time to put more logos on it. The problem here is that we don't trust the data exchange and information exchange for the partner countries, they don't trust basically us. So in the end we end up, many NGOs and we as well, to do the project ourselves instead of just giving the money to the country so that the ministry there can actually do, build schools, hospitals and so on. And that's the real problem and already in the Paris Declaration 2005 we decided that we should do all this stuff on the systems of our partner countries that never happened because of lack of trust. And now what's our proposition to solve this? We called it true budget. So we figured out that the solution is not to install some kind of SharePoint or Google workspace or something, so another intermediary because whoever owns the data is actually more powerful than the rest, the data is the new oil basically. So we don't want to own the data and also the partners, if they own the data we actually don't trust them potentially. So the idea was to build this decentralized solution, which is truly decentralized and managed data there. What I mean with data is actually to manage the workflows, who is doing what, how are the projects implemented? So for example we build schools, how are the tenders done, how was the money dispersed, who did what basically. So what is actually stored as decentralized data is the status of all the different workflows, of all the different participants in this network of people and organizations. So technically what it is, it's actually a front end based on material UI React, so the JavaScript stuff, also the API side, it's a Node.js server and on the data side it's a very small blockchain solution, it feels like a key value data store basically, but it has a very nice property that actually synchronizes across different nodes. So there's a kind of a conscious mechanism that the data is then synchronized across the different parties. And that is very important also from a political view that basically you process this data on i-level. There's not one party that owns the data and the other one doesn't, so if any one of these participants here, that's only a very simplified view, would get out of the picture, it would still work out, right? So where we are at the moment with this, yeah, we did this since a couple of years, as I said, we are registered as a digital public good. We have a couple of pilots, for example with a Brazil Amazon fund, it's a very important one where sort of Germany paid money if less Amazon forest was destroyed, so it is very good. We had the vaccine alliance also very important, Burkina Faso is I think one of the oldest projects we did or one of the first we started with, we have the Ministry of Water, we try to manage this data and get to a situation where we can actually give them the money and they actually use the money to do good stuff instead of us developing the projects which we believe is not the most sustainable approach. Yeah, of course, as I said, you are invited to contribute to this project, it's running since a couple of years already, we have a couple of contributors, it's also the first open source project we did as KFW, it's a state owned, German state owned bank, so remember the talk we had before that state owned organizations, it's tricky to do open source, we achieved it here and I'm quite happy to be part of this project and would also be happy to if you join us. Thanks a lot. Thank you. Excellent. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you.
Intro to Janssen: Managing inter-domain trust and security
Hi everyone, my name is Mike Schwartz. I'm the founder and CEO of Glue and I also lead the Janssen Project at the Linux Foundation. And how do I do next? Next? Well, what happened now? There we go. Okay, so in 2009 I was working in digital identity for enterprise customers and I was sort of tired of recommending commercial solutions. So we wrote the first Glue server to provide an enterprise identity access management system, authorization, authentication. We built this product comprised of open source components and we sold it to many large enterprise customers over the years. In 2020 we decided to contribute our IP to a community-governed project at the Linux Foundation. And we couldn't contribute Glue, that's our trademark, so we created a new brand called Janssen Project. And Janssen, so the trademark is free for companies to use. In 2023 we launched a new developer website called Agama Lab, which helps developers write in the Agama language, which is a programming language we authored at the Janssen Project to help developers use low code to write web identity journeys. So Janssen is composed of multiple components. Think about it like Linux. It's a number of open source components deployed and working together. You can, it's built on a mountain of open source like many other open source projects. We are in production only supporting containerized deployments. So each service runs in its own container. From a functional standpoint, digital identity is really existentially important for governments. We tend to think of governments as having one identity system, but actually there are many ministries and agencies and departments, and each of those departments have their own identity infrastructure with their own usability, accessibility, and security requirements. So we're in a multi-IDP world. Each government needs to manage trust, and that's just hard to do. And Janssen helps governments build this identity infrastructure so they can service their ministries and departments. This time is wrong. Actually, so I have a deep dive on Agama. If you're an IAM geek, please join us. It's tomorrow at 9.45, not five. So, but I'm going to do a deep dive on Agama if you want to learn how we can use low code to build these authentication flows. So why would you use this project? It's not for every organization. We really are looking for organizations that have what I would say described as economies of scale in IT. That means that they have, if you're a small company, you should probably use a SaaS, but if you're a government and then you have, then you need to really manage this yourself. Janssen is good at handling customer requirements. We forget that a lot of these organizations, they already have infrastructure that they need to support, and they have special requirements. So hosting yourself enables you to meet all those special requirements for your identity. And data sovereignty, a lot of the customers that we serve, banks, telcos, and other banks, telcos, healthcare, education, government, they need to host this data themselves. They need to persist the data in their region. And then finally, you might need this infrastructure if you're building a product. So many products actually include a digital identity component. For example, NEC is using our software to build a biometric authentication platform for governments. So if you need to build an identity stack, you can sort of build this into your product. And you can know that it is safe to use. I should mention that actually Janssen Project is registered as a DPG. And that means that it's safe to use. And it's important to remember that, yeah, we don't want to get hashy-corped. So with that, I'll end. And thank you very much.
Moodle: Empowering educators to improve our world
Hello everyone, so my name is Noel and I'm here to talk about Moodle. Moodle is a learning management system that you can use for your online learning and teaching and our mission is to empower educators to improve the world. We want to do this in an accessible way that can be used for everyone and that can be customized for every use case. We do this through open source and actually Moodle started more than 20 years ago and the first commit was actually the same year as the first edition of Forstdom and preparing this talk I have been looking at the archives and this is actually the first talk about Moodle. It had been mentioned here and there but this is the the first talk specific about Moodle so it may be the first time some of you okay it may be the first time some of you hear about it so I hope you find it useful we are a certified B corporation and Moodle is a registered digital public good and in case you don't know who is using Moodle at the moment more than 400 million users translated to more than 160 languages mostly contributed by the community the translations and you can find these stats in stats.moodle.org but I have to mention that since Moodle is open source and can be self-hosted all this information is only obtained so in reality there's probably more people using Moodle than this. In this slide maybe there are some logos you recognize we know that Moodle is used by more than 60% of higher education institutions it's also used by many education ministries and it's also used by many governments and NGOs so Moodle is used all around and who is making Moodle. Well there are some important part of the contribution from the open source community and other companies but mostly it is done by Moodle HQ which is the company I work for. We are currently more than 200 team members distributed in more than 20 countries and we speak more than 26 languages and I didn't want to leave them without mentioning the text tag so Moodle is made with Vanilla, PHP and Vanilla JavaScript with an SQL database and the mobile application which is the team I actually work for is made with Ionic using Angular so in case you want to learn more you can look in the ad repositories to look at the code. Also I mentioned that it's very customizable to different use cases so you can build plugins for Moodle and if there is something that is not doing already there is likely a plugin already working for that and if there isn't you can make a plugin yourself. Here you can read the developer documentation to see how to build it both for the LMS and for the Moodle app and finally even though the Moodle LMS is at the core of everything that we do there is also many other things. For example I already mentioned the Moodle app which is interesting for low resource environments because you can use it to use offline so you can download the contents and fill the exercises and everything and it synchronizes when you go back online. We also have Moodle Cloud so you can self-host Moodle but if you want to get started we have a software as a service solution which is Moodle Cloud. We also have MoodleNet and Moodle Academy to share and find learning resources and if you want to integrate Moodle with your organization we have Moodle Workplace and Moodle Certified Partners and service providers so there is a lot more that you can dig in if you want to learn more. So that's it you can learn more at Moodle.com and if you need to contact me my mail is noel at Moodle.com and that's it. Thank you.
Fiscal sponsorship for / and FOSS projects
Good morning, good afternoon, good evening. I guess the morning is over here. Good afternoon. Hi everyone. How are you all doing? Any energy in here? The last session of the day? Yeah, everyone seems pretty tired. Okay, I'm Angela. Great to meet you. This is Madeline and we're from Code for Science and Society. How many of you build products? How many of you are working on products? I assume almost every hand in this room is probably going to go up. Keep it up, keep it up, keep it up. All right, how many of you love operations of those people with their hands up? How many of you love running payroll every month? How many of you love when something happens in your bank doesn't actually pay the people? Everyone's calling you? No hands up. All right, well, so we have the magic solution for every single one of you that put your hands down. How many of you have heard of fiscal sponsorship? Oh, all right, well, you know the end of the talk, but Madeline, what is fiscal sponsorship? This magical thing that might help everyone in this room. Fiscal sponsorship is indeed a magical thing and it leverages economies of scale so that we build shared operational and administrative infrastructure so that projects can thrive. What that means is that you get to work under the umbrella of an organization who will handle things like your payroll, like your taxes, like your insurance, things that take up so much time and energy and take away from the ability that you have to invest in the things that you care about most. And it's not just that when you participate in an organization, you get extended some of the benefits of being in part of a nonprofit. In the US, for example, that's a 501c3, where you'll yield certain tax benefits. But what else do you get out of a fiscal sponsorship, Angela? Is it just operational and administrative infrastructure? Boring, shmoring, taxes, blah, payroll. The most interesting thing to me, I'm a people person, can you tell? I like people. Yes, I like people. Shocking. I think it's the people, right? Like, as we build all of these open source tech products, like everyone's head down, oh my god, I gotta get this code out. It's gonna break. What if you could actually come together with other people as we're doing here, but do it on a regular basis? You talk about the challenges you're facing. You share best practice, lessons learned. You talk about which funders might give you some money. The ability to actually come together and build camaraderie and build products that have shared ethos and mission alignment is, in my opinion, one of the best parts of being under a mission aligned fiscal sponsor. So we are very excited to answer any questions if you have them. We represent a fiscal sponsor. We are based in the US, but there are European fiscal sponsors as well. So for those who might not be familiar with them, they're actually some regional. Regionally, I think there's one in Berlin, if I'm not mistaken. And so it is an option even if you are a Europe based. So happy answering questions. Good luck with all your product building. Good luck finding someone to help you run your payroll. And look forward. She's very, very adept. Thank you so much. I see no questions, but come find us. We have time for questions. Yes, yes. Questions. Interactive. Someone said they wanted interaction. Yeah, please. Sounds. Maybe for our friends in online worlds. Non-profit. We have now founded an US entity and we are about to apply for fiscal sponsorship. This card that you take, so that's like a 15% cut. Do you consider this expensive or cheap? The question was, do you consider a 15% overhead charge or fee expensive or cheap? That's a good question, right? I think that every fiscal host does have some sort of percentage. It can range from, I think the lowest I've seen is 8%. This is my colleague, Rhea. Hi, Rhea. Emily Aucou, hey. We all are from CSNS. I think the lowest I've seen is 8%. And maybe on the higher side is 15. Usually this covers fees, right? So even just as those who run your own payrolls know there are fees with so many various platform services. And so all of those cost the labor of the operations and finance team. So I think it's a good question to ask. And as you're shopping around for your fiscal host, ask what they do and what they don't do. So some of them are very low touch. Others are very involved. And so depending on those aspects, the fee will change. Definitely. Anyone want to add anything? Angela, repeat the name of your org again. Our org is called Code for Science and Society, CSNS. We are based and registered in the U.S. as a 501C3. And we take open tech and open science projects and are willing to share our menu of services with you all if you're interested. So please come find us. Thank you. Thanks so much.
What can digital open source projects do to reduce our environmental footprint
So how many people here are worried about, say, climate change? And yeah, a little bit of a concern. There's definitely issues, crazy weather, blah, blah, blah, blah, blah. What does that have to do with open source and my talk? Well, let me tell you. We live in a finite world. And as much as we want to believe that the cloud is green, it's not. Everything that's digital is tied to an atom. So as I say here, if you don't measure it, it still matters. It still matters because we're looking at the environmental footprint of our digital lives, and it's significant. It's about the same size as the airline industry, and it's growing exponentially. So when we're thinking about our digital infrastructure, every piece, every bit is tied to an atom. So whether that's electricity and whether it's hardware, it all comes from somewhere. It all has an impact. Think about the lifecycle of our products. There's not just the process of these devices that we have around. It's also the question of the creation and disposal of them as well. There's a lot of systems that are interrelated, and they have a huge impact. So when you're developing code, or if you're doing open hardware, think about what is the ecosystem you're working in. It's so interrelated. Whether you're a JavaScript person who's working on the web, you're a PHP person who's building content management systems, whether you're a Python person that's involved in creating other data processing tools, all of these things that have communities rely on networks of other pieces of code and a lot of people to maintain and organize them. So think about that ecosystem of people and code that our projects work on. So so much of this is thinking about sustainability like a measure of quality. So how do we make sure that good code is both accessible, because I have to say that I'm an accessibility person, but also is sustainable, that we're trying to minimize the impact that we have on the planet. And that that is baked into the definition of quality. We're thinking about it early in the process. We're not waiting until the very end to evaluate it. We're trying to build it into our CI CD pipeline so that we're catching errors and we're looking to minimize our website or we're minimizing the impact of our code on a spring by spring basis. And you're having a livable planet is not a feature request. We need to have that. This is a bare minimum that we need for our society. We need to start working together around that. So there's so much to learn and so much that's happening in this space right now. 20 years ago, this was not something that was generally thought of. People were like, well, just don't print out the web and don't print out your web pages and your emails and you'll be fine. It's like, no, that's not good enough. The information is changing very quickly. There's a lot to learn in the space. And I think it's really important to try and think about sort of learning that, but also finding ways to contribute. So where can you give back? Where are the experiences that you've had? How can you get involved in measuring your project's impact and moving ahead on that? How do we look up at leveling up our expectations, encouraging more people to discuss and to learn about this? So it's really important to have these talks. There's a whole section of talks here at FOSTAEM on energy as there was last year and that's wonderful. If you're going to the State of Open Conference, last year they had a whole sustainability track as well. In making sure that there's some conversation about sustainability as part of your project is really important. Getting people engaged and doing something about sustainability is a good way to go off and keep optimistic about it and keep our attitudes that we can make a difference, we can make a change. This is something that is doable. So getting people involved. And this is a huge problem. But everything that we do in the end is going to be insignificant as an individual contribution. But as Gandhi said, it is important that we do it. We need to go off and find ways to contribute and to play our small part to move things ahead. There's lots of best practices and best standards out there. There's one from the Web Sustainability Guidelines that was just launched as a draft in September. There's so evaluation and development on that. That's the Sustee Web Group or the Web Sustainability Guidelines. There's also the Green Software Foundation that's done some really good work. I'm building infrastructure around that. The Green Web Foundation is another one that has infrastructure and information about that. Also the IETF and the IEEE I think also have sustainability projects as well. So there's lots of different ways no matter how you're involved in the tech world to look at best practices and sustainability that you can work with and extend. And that's all I have. Yeah, any thoughts? Okay, any questions? Does anyone here have a sustainability? Go ahead. Any kind of practical tips that make matters good? Look at how much processing there is. Sorry, any practical steps. One practical step is to sort of look for where there is time and transfer. So how do you try and minimize the efforts and try and make sure that you're counting the milliseconds used to process it? What are the process heavy things that your code is using? Yes?
Open Terms Archive
So I'm here to talk to you about the digital common that I'm quite passionate about, Open Terms Archive, which is a digital public good incubated within the French Ministry for Foreign Affairs. It has been existing for three years. The code base is under an European Union public license. And its goal is to enable the democratic control of digital platform. Make sure to shift the power balance from the large digital services to the end users and regulators. How do we do this by measuring the evolution of the rules, the terms of services, which are actually the way in which the services decide that you can do things or not. And oftentimes it takes over some regulations. And we provide tools to archive and enable analyzing and influencing those rules. We are basically creating technical tools that will enable us to unite different actors who are powerful enough to influence the rules of these large actors to be more loyal, more respectful of end users. Right now this is a bit blurry, but it's going to get better now. So who do we try to unite? We just looked at who can impact, who can really change the behavior of those large actors who are not respecting your privacy, who are not respecting your rights online. Regulators, if I'm a privacy regulation authority and I can have a million euro fine on some company, it's going to change its behavior way more than if I'm just yelling at it that it's not respecting me. Legislation, it can take a long time to be enacted. I think about 10 years for GDPR, but once it's in place it does actually change things. Why? Because someone in the big companies actually fears to go to jail. And that's quite impactful as well. Takes a long time, but useful. Press and media. There are tons of articles that say that some companies are bad and evil, but sometimes there's enough coverage that it will actually threaten their business model or their user base. It happens sometimes. Think, for example, on the WhatsApp case where they had to back down because so many people were afraid of it, didn't like the change. And finally, consumer protection associations, such as some of you might have heard of the Shrems, for example, cases in the EU that prevented the bulk sending of transfer of data of EU citizens or any consumer protection association that can go on a lawsuit against the companies. So we have demonstrated this impact model by providing tools that have been taken over by the European Commission for ensuring that some regulations were actually impactful on the behavior of companies. We have been in contact with Congress people in the US and so on and so forth. So we have these examples, and I have to be fast now. This is a DPGA public good session, so I want to highlight that we've been selected for the Nobel Prize Summit thanks to the DPGA. So we're very thankful to the Digital Public Goods Alliance. And if some of you are considering applying to be a DPG, I can talk more about it, but yeah, it's cool. Okay, that was not geeky enough for FOSDEM. I know. So now how does this work and how can you participate? First of all, we just need to define which documents you want to track. That's quite simple. You just use a URL and then a set of selectors like CSS stuff, usually just CSS selectors, a bit of JS if it's necessary, but it's like 5% of the cases. So here you define your target documents that we are going to track. Then we have made the tracking part quite simple. You could think that you could put together a script in half a day to just download stuff. That's true, but there's a difference between this. Time's up. I say it's four minutes. Sorry. Okay. So, yeah, oh, good. So yeah, we do this with software. We store every version and then we make it readable. So basically we will extract just the legal text instead of having all the menu navigation stuff and all that. And we will provide a diff for all these people, regulators, legislatures, all these people who don't really get all this stuff, but now they have a way to be notified when there's a change. And then we can have humans who will write down a summary of what has changed and circulate it around. And then instead of having one person there, one press person here, one regulator yelling, we have all of them together and thus we believe we can actually influence. We also provide data sets and this is a decentralized system. We would love to have more instances if you want to create a new one in your country jurisdiction. Please do. It's going to be amazing. Please visit us at contact at opentermsarchive.org.
OpenFisca
Perfect. Hi, I'm Matisse Schneider. I'm here to talk about a digital comment that I really care about. It's called OpenFisca. And as it says here, it's the most widely adopted free and open source engine. It's under AGPL 3 to write Rosas code. What is Rosas code? It's the idea that if every piece of regulation, legislation had an implementation in code, it would be kind of easier to figure out what this is all about. And this is now 13 years old. We've been, we're used in many, many different countries. Here are some of the countries that we are used in. There's an engine, and then there are models of the legislation. I'm going to show you what's the sort of product that you can build on top of it. Here is a selection. We don't have everyone here because it's on a voluntary basis to register, but we do have quite a few. One of the most common use cases is for NGOs or governments to build benefits assessment tools. Tools that will enable people to figure out what the hell they are entitled to in terms of social benefits. Usually you have to figure out which agencies distribute which benefits. Here, thanks to Rosas code, you can just talk about your situation. You describe your personal case. You fill in your information. And then we just send this to the Rosas code engine, which basically acts as like a big calculator, but instead of calculating arithmetics, it calculates law. And then it's going to tell you the sort of stuff that you're entitled to based on the current status of the legislation. So that's one most common use case. But once you have a model of the legislation that you can interact with through APIs that you can compute, you can do other stuff. Like for example, here, this super scary user interface is actually used by members of parliaments in France to simulate the impact of potential reforms. So here is the payroll of a person. And here to the left, so I'm going to do a really fast because that's a lightning talk. So in the lightning, I'm going to now create a change in the legislation where I increased the income revenue tax by 5%. And you can see immediately the impact on all sorts of potential families. That's pretty cool already. You can design your own reform. But if you're a member of parliament, you can log in and get this calculated on 60 million on the whole population of France on real data in about one minute, thanks to the fact that Open Fiscal does vectoral computation. But that's not limited to the governmentary users. You could do your own thing as well with it. Here is an example of the user interface to provide some readability exploration on legal models. That's the demonstration one. So it's a very simple model. We have the same one for France. It has over 6,000 entries and so on. So for each country, it's going to be different. But you can see different things. And for example, you can also see the different parameters in legislation, such as the value of an entry in the legislation that changes. Here, for example, is in the French model, the list of countries that are member states in the European Union. So if you ever want to assess if one person is a national of an EU member state, you can either code this yourself or you can just go to an authority that is going to tell you what it is and also what it was historically. Because Open Fiscal stores all the history. So I can tell you, for example, here, that's the initial creation of the European Union. And so you can write formulas that will take as input that am I an EU national and depending on when you run the simulation, it's going to give you something different. Here is, for example, the hourly rate for minimum wage and do I have to stop now? Is it 30 seconds? No. Okay. All right. You can calculate stuff also. It's cool.
Developing in Public : Open Source Tech Education
Hi everyone, my name is Hermann, Benji, I come from Venezuela and I'm the founder of Code Your Future. Code Your Future is a non-profit organization that trains refugees, asylum seekers, forced migrants and people of low income backgrounds in programming skills and help them start a career in tech. Today I'm going to tell you a little bit our story and what we're doing with open source and creating one of the most inclusive platforms to contribute to curriculum development. But before that I want to show you a video, but you know we're small groups so we're going to try to do this session interactive. So we're going to talk. Are you ready for that? Yeah? So I'm going to start with a video that we normally share with our COFers when they join to start the training program. It's a video that we created with the community and the voice is from our students. We don't know if the sound's going to work, but let's give it a go. Code Your Future, in this video we will learn how Code Your Future works, how it is different and how you might want to change your expectations. Code Your Future is a majority minority community of adults and together we are powerful. Everyone has some kind of struggle in our community and everyone also has help they can get to others. Everyone is wanted and needed. Together we can solve many problems. We are not a school of students and teachers. We are professionals collaborating to achieve realistic practical goals. CYF is not an accredited institution. We can't use our certificate to get a job without actually learning coding. All we can do is learn to build great software and we can do this to employ our client by actually building that great software. At CYF we don't tell you what you can do. You're not here to judge your potential. We only look at what you have done. We make decisions based on evidence. You have to rebuild things and you have to show us. Nobody will allow you to do your coursework or tell you off for not coming to class. If you don't do your work your reward will be your own failure. So you're free to make bad decisions. You are an adult in charge of it all. And so is everyone else. Everyone at CYF is a volunteer and we are all choosing to work together. Everything has been created by a volunteer. We sort of problem and try to solve it. There's nobody coming to save us. We are here and we are saving ourselves. CYF grads come from many different businesses and go to many different jobs. We are here to support you to get a job you want and a life you want. We know that with hard work, honesty, kindness and challenge you can get there. Don't sit around waiting to be taught things or complaining that you haven't been taught enough. Build your own projects. Find ideas that motivate you. Fall in groups and work together. Don't wait around. Try things. At Culture Feature the only failure is the failure to try it. You have this year to change your life with a huge hope for two forwards. Seize the opportunity. It's your life and you can change it at Culture Feature. Okay, that's it. Thank you very much. This is an intro to our organization. I get a little bit emotional when I see it because it really puts together a lot of who we are, what we do and how we do it, what we have been doing over the last seven years. I'm going to start by telling you a little bit of my background. If you're interested in knowing, is this sounds interesting like you want to keep hearing about it? Thank you. It's good to know. No, no. Okay, so as I said, I'm from Venezuela. The back story, sometimes when people hear about Culture Feature and they come to me, you know, you found the Culture Feature, tell me what is the story, how did it come, how did you do it, what was your inspiration, what is the inspiration? I think I summarized everything in these two areas, migration and on fairness. I spent pretty much more than half of my life abroad traveling in different places. And I started in the tech industry, had a great, you know, a good job, and had a wonderful experience learning and growing, but I was always a little bit touched by the on fairness, the inequality, the inequity in this world, how, why some people have access to so much opportunities and others, not just because they happen to be born in a different place and in the wrong country, by the wrong autocrats, and that was something that just kept nagging me for a very long time. And at some point it was just unbearable to kind of just keep doing the same thing. And above all is that when I was a child, you know, my country was a pretty stable, relatively safe place. This is my home country, it's in the mountains in Latin America, in the Andes, and it wasn't the richest place, but it was okay. I had a great childhood, I was playing in the streets with children, and I never thought that the country was going to change so dramatically. No one ever, when I was a child, imagined that the country was going to become this. Mass migration in Venezuela is one, had one of the biggest immigrants from any country. It's like millions of people have left the country over the last 10, 15 years. It's completely different to my childhood. These are people that a few years ago literally started walking, leaving the country, because that was the only chance. So this is the concept of a forced migrant. You don't want to leave, but you feel you have to. If you want to look for safety, if you look for other opportunities, because it's just so bad going on there. And then later on over the years, I started realizing that this story is actually overlapping with many other stories of what was happening. I'm sure all of you have heard about the war in Syria from a few years ago, and the migratory crisis that emerged from there. Well, people in Syria, they will tell you that when they were children, they also never imagined that the country was going to end up being like this. That they were going to start migrating and walking across, but they actually also hosting refugees from other places. It was a completely different story. And it's this change of reality that made me think a lot about the world. And when I decided to quit my safe job in tech and start looking for funding something else, well, at the beginning it was living on an overlap, but I entered this other world. The world of refugees and asylum seekers, the world of long-term unemployment, the world of unsafety, where you don't know where people have homes, or if you're going to keep the same place for a while. It felt that it was literally a parallel universe, that we live in certain reality and then the others are living in another one, and we rarely overlap. We rarely know what's the reality in that other place. We hear about it, we see it in the news, we see some statistics here and there, but do we really know that many people that live in those circumstances? And all of that is in contrast with this wonderful world that we live in tech, with so many advances, so many developments, so sophisticated technologies like this one. Can anyone recognize where this comes from? What device took this image? Sorry? Yes. The web space telescope, you know, a state of the art. This is capturing a constellation from one of the oldest pieces of light that exist in the universe. It's more than 13 billion years old. This is a time machine. It's looking back to the origins of the universe. It's only between 200 or 400 million years after the Big Bang. And we have this technology, and we can enjoy this beauty, this wonderful images that tell you this immensity that is out there. And we have these beautiful things happening, and I want us to just like see it for a little bit. Isn't that beautiful? They call this the nursery of stars, because it's a constellation of the formation of the stars over hundreds of millions of years. But all this beauty is contrasted with another reality. And I want someone that is brave enough or two people to start explaining what it is. What are these numbers? Anyone guesses? You can try this. No, it's okay. We can just try. A person out of 78 has to migrate from his own country. That's correct. One in 78 people in the world is a forced migrant. And that number keeps, the ratio keeps getting worse and worse. One in 78, forced migrant. 10% of the worldwide wealth is detained by, no, sorry, detains 7600 of the wealth. I'm not sure if I was right. Yeah, that's correct. 10% of the population in the world owns 76% of the wealth. And I checked this million times, and I just couldn't believe it. What else? Any other one? Sorry? Is it too low? Do you want it to be? Well, it's bad enough. What else? Yeah, the 10% produced nearly half of all the CO2 emissions. And then on the flip side, How are 50% keeping 2% of the income? Thank you. And are responsible for 12% of the CO2 consumption. Yeah, exactly. So these to me was the best representation of inequality in the world. When we're talking about wealth and pollution, these are the two extremes you have also, like, you know, very huge numbers when we're talking just about 1%. It's a big, big challenge. Do you think you can change something with your project? I mean, a bit of a tree bud in my opinion. Okay, tell me about it. A lot of companies already talk about inclusion, and they have like websites, and like, there's a lot of discrimination. But I'm not sure you're going to change that. How can you have impact on that? Thank you very much for the prompt. That's very good. We're going to talk about this. Yeah, we're going to talk exactly about that. So I want to know from the audience if you would be interested in helping changing people's lives. Raise your hand if you are. Great, thank you. That's great. And you don't have to say, I don't want to. The vast majority of people, they're like, you know, it's okay. My life is complicated enough, and that's okay. Come in, welcome. No, no, don't worry. Come, welcome, welcome, welcome. Welcome. The question is how, right? You were saying, is it realistic? Is it possible, right? Come in, come in. How? Come in, please. Do you want to sit? Don't worry, don't worry about it. How? So let's explore that idea, that concern a little bit. How? How could that be? What could we do? Any ideas? We want to change a little bit. We want to do something, small, tiny contribution. What can we do? Someone that hasn't said anything? No? Small steps, just teach one and one teach another, and another teach another. Like small things. Training. Yeah. Thank you. One person training another one. Information. If the media can talk about all of these problems, it would just be a huge step. Okay. Awareness. We want to keep talking about these things. Uh-huh. Maybe as a response to your comment, I don't know if it is, well, probably, well, at least if I'm considering how the world in, from my point of view has developed, I do know not believe that there will be one change, one technology, what I don't know, fusion, for instance, which solves all energy crisis problems. And this will also serve to, um, to ensure that we, that there won't be poor people anymore because we are living in a highly inequity, uh, in equal world. But, um, maybe it helps at least to, uh, like you said, with small steps in your own environment where you can make impact. So what you're saying, if I can summarize this, there's no one solution that feeds all. There's no one magic wand that is going to solve all of this. Absolutely, right? And you know, you're, you're good to say, hey, is this, is this a thing? Is this possible? Now I would like us to spend one or two more minutes doing something that is extremely unusual in these conferences, which is let's talk to each other. So let's look at your neighbors and then just talk for one to a minute. What other ideas would you have to actually bring some tangible? Are you, do you, do you feel comfortable? Is that okay? Let's explore that. Let's, let's talk, you know, we're a small audience. Let's talk and they say, how else can we do these things? What can we do? Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Do, do you guys want to, do you want to share? Well, just ideas, just to spark ideas. No? Just think. We're like brainstorming. OK. OK. Brilliant. Thank you very much. Thank you very much. Thank you so much for being so kind and speaking to each other and sharing. We have 10 minutes left, so let's carry on. So as you were saying, there is like, it's a complex topic, but we've been involved for seven years working in this area. So we have maybe a little contribution to make here. One thing that I just want to highlight, hell is paved with good intentions. This is something that I learned, and I saw. There's a lot of people that we want to do good, but it's really hard to actually do it, not only just to action it, but actually to have the impact that we're trying to do. A lot of the times, the good intentions can lead to really bad consequences. And that's something that is very important to analyze in every single aspect. In a decent case scenario, you actually don't manage to help. But in a really bad scenario, you actually cost more harm. One of the areas that for me has been a really good framework to understand is the mesoherarchy of needs, to understand, OK, when we're trying to make impact in a person's life, we kind of look at what it is that we're trying to do. And in the first level, we have the physiological needs. When people need food, air, water, for example, when there's like big emergencies like earthquakes, and there's this horrible words that are happening, like we know that, OK, we need to help people in these areas. It's very clear. It's very, you know, people really understand that intuitively. But then when we try to do something else, it becomes a little bit more complex, yeah, more difficult. It's like, OK, people need more than that. It's not enough to just like have the food and the air and the water, we need more. And one of those areas is we need to feel secure. We need to feel, you know, that, for example, when we have a job and we have employment, we feel like a little bit safer. We feel like, you know, and also when we start connecting with others, when we have a sense of belonging, when we also believe in ourselves and we have self-esteem, and finally, when we actually keep learning and keep growing, we actualize ourselves. All these are important, and all of these are interventions that we can do. When we think on a project, we think we're more or less in which area we're trying to make a difference. And what we decided in Code Your Future was that our goal was to launch careers, was to going to help people start a career in tech. We're not going to help anyone. We're going to help those that need it the most, those that you saw in those statistics, people that were forced migrants, that suddenly arrived in a new country and had no connections and no jobs and no way to know how it was working. And then we saw that tech was a great place to be because it's pretty democratizing. If you know the skills, if you have the skills, then you can actually get a job. And that's what we started working on. But then later we realized, well, it's actually not just about this. Because actually, yes, you can, you know, you say, I'm going to try someone get a job. But actually, in order to get a job, people have to be skilled. But in order to be skilled, people need to also feel safe and a sense of connection. So our community of professionals, the old volunteers, and they're all coming to share, give their skills, the old people like yourselves, normal people, they come and they created that sense, that community and that growth. And then it goes back to the question, does this work? Does it actually make a difference? And we did measure it because we said we're going to do a training, let's organize a training that's all very nice, very exciting. We started like doing web development, the full stack JavaScript. But then that's not enough. Like that is something interesting, but there's much more than this. The real part that mattered was getting people to get jobs. And we were lucky to see from the very first class, people getting jobs, like Centaur, refugee from Ethiopia, was unemployed long term for many, many years, went through the program, and now he's working at the Financial Times. Or Ahmed, a refugee from Syria, his dream was to go to university, but the war interrupted all of that. Come, please, come in, come in. The war interrupted it, and he was completely lost. He never went to university, but he always wanted to be a developer. He joined Code of Future, and a year later, he was having his first job as a developer. And finally, Ansi, who is from India, had given birth just four months before the program started, had a tiny, tiny baby with her. She had not done any employment for five years, was completely alone, feel disconnected, very shy, very low self-esteem, and went through the program. Please join us. And then, and she's a developer five years later. Now, is this going to change the huge numbers that we saw there? No, of course not. But it's just basically one person at the time, because every single person matters. Every single person makes a difference, and is like connecting and knowing that you see that journey of people getting into employment. It's a dream. These are some of the numbers that we have. Mass majority of people we work with are living below poverty line, mass majority are ethnic minorities. We have a huge range of ages. We have a great gender split, and above all, people are getting jobs. This is our ultimate measure of success. And over years, we have been diversifying. We have an open source curriculum development that Daniel that is here has helped organize for many years. We've been diversifying, getting into new areas of development. And we started like here, but then we just been growing. One of the last ones was an SRE program that we created together with Slack to help people pay them all the way from like no programming all the way to getting a system engineering job at Slack. We are doing this. And because we wanted to talk a little bit about our syllabus development, at the very beginning of the early years, we were very excited. We started developing our own curriculum and creating content. And we had this really long, long list of content. It was very good. We're very excited. But over the years, it became really hard to change, really hard to adapt. It was a little bit like at the beginning was, you know, we had these little blocks and they started growing. So at the beginning was like, okay, well, we have a block and we put another block. And then if we wanted to do a change, it was easier because there wasn't much content. We had to change one and then we put another one. But over time, we had like bigger, higher towers of content and information. It was really, really hard to change. And then at some point it was like this. We had this big thing. I was like, oh, Daniel would say, I think we need to change this part because it's not working. It's like, oh, well, but if we have to change this, we need to change so, so many things. So it become really, really difficult. And then we had to design a new curriculum paradigm and we just went in the open source mindset. How can we make as inclusive as possible? How can we allow as many people as possible to come and contribute? And we basically completely changed the way that we had the curriculum and we decided we're not going to have content. What we're going to do is basically point out to whatever content is out there. And then we have basically bits of information that can't live anywhere. And the content basically is just those pointers that say where those bits of information are. So if we want to change something, we just rearrange it. And if we want to move from one to the other, what we have to do is just point it out the direction. So our curriculum, instead of these pages and pages, longs of text that becomes really updated, and it looks like this, this is our curriculum. So if you're interested in knowing more, join our open source curriculum development. Thank you very much for your time. It's been a pleasure talking to you. This is Code Your Future. And we're here in the sun. Thank you. Thank you so much.
Open Source Railway Designer (OSRD): why SNCF Réseau start an open source project ?
Yes, let's start. Thank you for being here late. It's difficult to go after my colleague because my subject is more lighter than his. Nevertheless, we'll try to answer to the question why OSRD, the French infrastructure manager, has started an open source project. So first of all, this is the S&CF Resort. This is a railway infrastructure manager in France. That means that this company is responsible for the operation and maintenance of 36,000 kilometers of tracks on which 15 trains are running per day. That means 2 billion passengers a year, almost 100 million tons of freight per year. This company has to face several challenges. First of course, we want to be on time for the decarbonization of transport. We want to increase the traffic and as we are a public company, we want to optimize our infrastructure investment because we use public money. And I think that's why we are in this room today. And of course, we want to be competitive to plane and road at European scale. For example, in May I made a trip from Paris to Berlin. It spent nine hours by train. It's not possible now that only for a thousand kilometers in Europe, you spend nine hours in a train. We do the same in France and it's spent only three hours. So we have to do something and to show that at the European scale that trains and railway can be competitive compared to plane and road, of course. Or not. Sorry. So we need tools to show this problem and to solve our issues. But S&CF Resort is not software vendors. And in the meantime, we can't find any software in the market that can answer to our needs and to provide a solution to this specific challenge. My team knows well this figure. What S&CF Resort here? We start with mobility needs. We design the timetable and we sell space time. This is a slot where the train can run. And we produce in real time by opening the tracks and switch the red light to green and the train can run. This is what we are selling here. We earn money. And in the bottom here, there is a network where the train can run. Then we have to build the network and we have to design this network. And it's because we need capacity to make space time. So it's quite easy to understand. I think this representation of our activity and where we need some help with the software is on the long-term planning by designing the timetable here and manage several infrastructures, several models of infrastructure. And on the real time or quite real time, we need to add some more trains. Even if we have planned all the trains in the years ago, because here it started 10 years ago, even if we have planned some trains, we want to be able to add more trains in the D-Day, for example, to be competitive to road, especially for freight trains. Because when a company wants to send goods to a customer, it has a choice of a truck on the road and a train. And if we say to the company, you have to ask a train 10 years before you want to sell the goods, it won't work a lot. So we have to be able to answer to this need in a very short time. So we developed our software, this is the topic of the talk, and how it looks. So it looks like we have several infrastructures. This is the French infrastructure here. It's coming from internal data. And we have a lot of data putting on the map. And we have all the tracks and all the platform. And we are able to choose a starting point in Paris, the station. And we want to go to Berlin, which is another infrastructure, but we don't have access to this data. So we took the open street map data. There is a lot of data in Germany on open street map. And with our tool, which is called OSRD, we are able to find a path in the infrastructure and calculate the running time here. And with this calculation, we show at the European scale that a train is able to take only four hours to go from Paris to Berlin. This kind of tool doesn't exist yet, except in open source now. The worst slide I ever made, it seems like a postcard from the 80s, but I just wanted to show the issue with an example. And can you find the difference between the upper part of the slide and the lower part of the slide? What is the difference? Not the obvious difference. It's orange, it's white, okay. I know that. But what? No, no, not quite the same speed. Yes, you're right, but it's not, it's obvious, too obvious, sorry. Another one? I have more trains power. The power, no. I have throughput. I have throughput. It's more trains power. No, there is the difference here from here to here. It's 800 millions euro. You can't find it in the picture, but if you want to transform your high speed line from Paris to Lyon with 13 trains per direction to the same track, or yes, of course, with a new rolling stock, but with 16 trains per direction, it will cost 800 million euros. I have to think because it's a huge figure. So, and for doing that, we need public money because we don't have money at the Sensei Frézot. So, we need public money. And we, in the business plan, the three trains are part of the cost of the, can find the cost of the 800 million euros. So, if we missed our studies, and if we, at the end, there is only 15 trains or 14 trains, we put 800 million euros on the trash. So, we have to be careful by doing that. We have to be careful with our studies and we have to be sure of what kind of tool we use and we want to master it. So, as I said just before, we add software from vendors. They are acting like back box. And we, to explain to the state that will give you, give us hundreds of million euros, we have to be transparent and say that if we need money, it's for that. And it will cost money because of that. So, we need to be transparent to explain everything. So, we have to, for now, we share only maybe data, but now we need to share algorithm because the running time calculation between for a train, which help us to know how long does it spend to go from Paris to Lyon. You have several algorithms. Each software has an algorithm. You put the same data. You don't have, never, you never have the same result at the end. And can you cut the mic? We have 14 running time calculation algorithm at the S&C Friseaux. So, it's a big mess already internally. So, we have to explain and to share the algorithm and to be able to explain what happens when you put data in and what happened at the end. Of course, by the past, we make some worst study and we spend money to, and we expect result and we never had this result, but we spend money. So, it's not a good thing for you. So, we need to restore confidence. Reliability as a software, we need to be, to track all the bugs. And if we share the code to the community, everybody is able to see the bug. And I hope, correct the bug. And of course, we prevent security flows by opening the code. By using, by opening the code, we expect that the IT integration will be more easy. And because every companies, every railway companies, every country that want to integrate the tool, OSRD, they can, they will be able to do it because the code is open, the internal data model is open and you can, you take the tool and you connect with your own database and it's quite easy to do. And of course, this is internally in each company and between company, as the code is open, the data models are open, we want to facilitate interoperability between the railway company in Europe and then we'll be able to do a train from Paris to Berlin. I don't want to insist about that, but nine hours is too much. And of course, if we do open source and the railway company invest, I don't want to people working for free, of course, but if you are in another railway company, we can share the cost of the maintenance and developing new features. And at the end, this saving will benefit to all the rest of the community and of course, for the railway. And of course, if it's open, it allows and it makes possible collaboration between companies, between a large community of developers and we expect that some feature will come without any plan because people will see something and they give ideas to developers and our companies and then there is an issue which is open and maybe a proposal to develop a new feature and something unexpected can happen here and always to the benefit of the railway system. And by opening again the code, we expect and we already working with a community of possible contributors, other infrastructure managers in Europe. We are starting a common project with the Swiss company. They have a long-term time table tool. We try to plug these two tools and to have the possibility to make the timetable for the train from 10 days before operation until the D-day. And we expect also contribution for researchers. Researchers have a lot of ideas but they are working on their side. They don't have access to data, they don't have access to algorithm but they develop very, very good algorithm, optimization algorithm that could benefit to the railway system to reduce the impact of the railway on the energy consumption, for example, or to reduce the cost of the operation of the railway, etc. There is a lot of possibility to optimize the railway system. So it seems to be nice but we have opportunities and threats. So again opportunity, OSRD is very innovative in the railway sector and I think with the possibility to be shared by two other railway companies, it will give some opportunities and we choose to create an open source foundation around the railway software and it's called Open Rail, sorry, and it starts last Monday, the operation of Open Rail Association, it's the name, starts last Monday and we hope that it will benefit to the railway sector by giving a neutral area and give the confidence to other railway companies to develop together. And of course as we said, it will help to the European integration by offering European railway infrastructure and to show that the railway could be competitive compared to plain and we have talked about standardization and the main goal is to be competitive to other transport means. But we have a lot of threats or a lot of reservation because people internally think that the running time calculation is an industrial secret and running time calculation is Newton's second law and numerical integration. So it's secret since 300 years, it's not secret since 300 years so I think we can share it. But people still believe that it's secret. Sorry, we have to face a question about the market rules which is very tough in the railway industry because we are not integrated in France, we have the rail, the infrastructure manager and we have several railway companies, railway undertakers that operate the train on the network and there is an issue if a railway company win tender, I didn't remember the name, win a tender by the use of open source software, what is it possible or not, it's a legal consideration that I don't manage of course. And there was a question about sovereignty but as you see I made a line on it and it's over now because OSRD is the first project hosted by the OpenRaid Foundation since yesterday. So it's good news for you, for us and for you also because it's not an SNCF project now, it's a neutral project, it's open to the community and yeah you can work on it, you can use it, you can contribute and we are waiting for you. Thank you, that's it for me. And I think I have time for questions. We have about 10 minutes before they start off the video so there's plenty of time for questions. What kind of technology we are using? The older stack, the two guys that are, the back end is in Java and Kotlin and Rust, I put Rust in the middle where but Rust in the back middle and the front end is React and we use also some map lib for the map and what else? What did I forget? Something? Oh yes, some Python also, it's all technology yes and I didn't say it but it's a web app so we put that in a browser. The license of OSRD is LGPLv3 which allows us to open the code of course and allows us also to keep part of some modules, keep it closed and private if we want to keep it and it allows to the vendors because we don't want to kill the vendors, it allows the vendors to keep, to take the core of the OSRD and make business with their own module. That was our choice, yes. Sorry. When was the project created? It's very difficult to talk. This version of the project was created three years ago. We were five, some more five and we are 50 now. We are founded by European Union and French state and also SNCF Rezo. So that's why we can't be so, such a big team, yes, we are 50 and we are in a beta test currently and we hope that we had a V1 at the end of 2024. Why is it hosted? It's hosted on GitHub and for now it's an OSRD project but maybe next week it will be on Open Rail Association OSRD. We just have to make the move but it's official. We can say we are an Open Rail Association project. So if you use the software to kind of double check or maybe even challenge the calculations of other competitors? Yes, the question is did we change, did we check or challenge the running time calculation for example? Yes, we check with at least two big software when it's German, the other one is Japanese and the difference between the running time calculation was very low, 0.4%, 0.4%, something like that. And does the data usually, I mean, does the data of the infrastructure actually allow modeling the real world that well? So the question about the data, the data that are expected can model yes, the physics of the train and make the calculation possible. Actually the quality of the data is not so good so we have to make magic about it or be resilient about the quality of data but yes, almost all the data are available. It's not very difficult to, there is 13 parameters for the running time calculation which are very easy to find. Yeah, I don't know. What is my activity between Open Rail Association and SNSF Result? That was the question. I'm fully employed by SNSF Result and I work at 100% for SNSF Result and the rest is for Open Rail Foundation. Yes, I think now they're supported yes because we have this on Monday we have the first board of directors and they are high level people they are making nice meetings so it's okay. Now it's okay. I don't know. The reaction of software company, we have no reaction for now. I know that some are interested to use the core and to make collaboration but that's it for now. I have no direct interaction with the software company. The previous algorithm you were using were provided by proprietary software from external companies or were they already developed internally? The previous algorithm in SNSF Result was a homemade in the 80s and it died three or four years ago and it has been replaced by a vendor software and they try to respect the homemade algorithm and we now compare to these two algorithms and yes and now it's a black box we don't know what happened. Okay thank you very much for being late.
How to Use Private Data in Generative AI: End-to-End Solution for Retrieval Augmented Generation with CrateDB and LangChain
in the morning on Sunday. It's nice to see all here, looking very bright and early. So we shall get straight into it. Let me welcome the first presenter of the day, Maria and Christian from CrateDB, who are going to be talking about privacy and generative AI. Thank you. Good morning from our side. Pleasure to open the Dev Room today and thanks for being here that early on a Sunday morning. We're going to talk about a very interesting topic, generative AI, how to use your own data and how we can build such applications also based on open source software. I think everyone is used to open AI and chat GPT, but you never know what happens with your data in these cases. So very, very brief overview. This is gen AI. I think everyone in the room played around with it already. Just a very quick summary of the basics here. You have your source data of any kind of sort of data. It can be text, it can be code, images, audio, videos. Everything is transformed. We are encoders, but billions of parameters that we use, a lot of text, a lot of input to train the so-called foundational models. We as users formulate some prompts against it. We ask the models some questions. It does its job and it generates the output and a language model does nothing else than predicting the most likely next token that it should generate. That's all the magic behind. We see a very, very big potential. When I first tried chat GPT more than a year ago, it was amazing. It started to write code for me. It starts to generate articles. I even went to some tools out there, took 30 seconds of my video and all of a sudden I can be a virtual speaker. Very, very impressive, super fast, but there's also a bot assigned to it. Obviously, some quality issues. All of you heard of hallucinations. Last week we had the example of what color is the water. Is it blue or is it really transparent? Depending on your training data, if you use children's books, the water is obviously blue. If you use the real-world training data, water should be transparent. Same as snowflakes or not white. They are transparent technically. Also, a lot of ethical questions, a lot of governance questions. Official government people talking to deep fakes, not realizing it. Also, a big threat that we have in the future. We have to be aware of also some environmental impact. The key thing we want to talk about today is quality and reliability with the importance of current, of accurate, and also of private data that is not available publicly. Because all of these foundation and models have been trained on public data. What's in Github, what's on the internet, what is in the documentation. Yesterday I watched a presentation with a clear message to everyone writing docs. We are responsible for what these models tell us. If you write bad documentation, we get bad results from GEPT or other models. It has been trained on not so good training data. Here, for example, Maria figured out promo code, open AIS web. If you register there and put the code 20% off. But unfortunately it was not working. So asking GEPT, hey, how can I apply the promo code? I'm sorry I know about this promotion. That's something you don't want to happen if it's a company chat bot. You want to avoid this. So perfect example why we need this current and accurate data up to the minute, maybe even up to the second. We need this current data. And obviously non-public data, private data, it's internal documents, it's confidential documents, documentation that is not public. And also if you are working with, they use legal documents, they use the technical documentations, vectorize it, put it to a language model and then for the maintenance workers, they have an application ready. But this is information that also must not leak. And this brings us also into a little bit of a dilemma because there are multiple options to bring this private data into the foundation and models or to enhance this foundation and models. First thing, again, I think everyone in the room heard about it, is fine-tuning. Where you give some input data, you really change the parameters, the weights in the foundation and model so that the knowledge gets incorporated into your fine-tuned LLM. Very good. You put the domain knowledge in there, but there are also challenges, right? You don't solve the frequency issue of the data. It's still some static knowledge. So there's research out there that one single wrong training data record can kill the overall performance. One guy says the water is blue and all of a sudden the response of the chat, but it's all water is light blue or something like this. And it doesn't solve the problem of hallucinations. You might still get a lot of hallucinations and not talking about the resources that we need. So second option, retrieval augmented generation, which is kind of developed into kind of a standard when you want to work with your own data. So first step is you really need to make the existing data, whether it's videos, it's data from internal database documents available to create the embeddings, to calculate the vectors, how this knowledge is internally represented. And then as soon as your user asks a question in the knowledge assistant or the chatbot, there's a called retriever is then asked, hey, please give me the relevant context. And this can now be a similarity search in the vector database, or it can be a combination of various searches, a full text search, to your spatial search, a regular SQL query to get information out of your databases. This context is returned back to the retriever. It is put into a special prompt, as context, as additional information to the prompt, and together with the question, and this additional context, not a large language model can generate your answer. And you can put into the prompt, as we will see in the demo also, please use only this contextual data. If you don't know the answer, please say you don't know. Limits the hallucinations a lot, doesn't prevent them 100%. Good. I think I talked about disadvantages and challenges already. And one advantage I forgot to mention is access control. Now that you really get this context from either vector store a different database, maybe create, you can put fine-grained privileges there. The example application that I mentioned before, some of the maintenance workers are not allowed to use legal documents, for example. So they don't use the index, use the embeddings of the legal documents, but they are obviously allowed to use the technical documentation. And someone from the legal department, oh, what is the support contract with XYZ? Are we now in liability? Et cetera. Obviously, they need then different indexes, different search indexes. How to do this? How semantics represented? Key is the vectors. So, or embeddings. And the vector is nothing else than a series of decimal values or an array of decimal values with a lot of different embedding models out there already. And every model has its strengths and weaknesses. Some are more optimal if you use, for example, German text, if you use Chinese text or Indian text, right? A very different way how to come up with the semantics and to analyze how the attention mechanisms internally work, right? Because the sentences are built in a very, very different way. So you see different performance there or highly specialized models. You do an image recognition. Oh, it's a sleeping cat. And this can then be vectorized as well. And you can search for this context in your vector store. And now, if we think this one step further, how could an architecture look like for such a knowledge assistant or a chatbot? Prototype is always easy to build, but you need to think about a lot of a lot of additional topics. First of all, it starts with the data, right? The data that you want to train, that you want to vectorize, that you want to make available for your search. So we've shown here a landing zone from different sources, can be the original sources. You might copy it, depends on the architecture you want to build. And the important thing is the processing layer. How do you chunk your data? How do you create the vectors? And obviously, you need to store these chunks of information together with the vectors and provide proper data access control. Second part here, the LLM part, talked about it now multiple times. You need access to the embeddings, you need access to the large language models, and then also needs to be some logging. What do do you use a query? How much cost does it incur? Is the performance okay? A lot of logging that also occurs here. And intentionally, an LLM gateway put in front of it because it needs to be changeable. Chatbots with a lot of functionalities don't want to go into all the details, obviously monitoring and reporting. And the beauty of it, you can build all of that with open source tools nowadays. And also the embeddings and language models can be open sourced, a lot of alternatives out there. Now, why create a long chain? You need a robust data management. As we have seen, there's a lot of different data sources involved here, data stores, whether it's logging, whether it's semantics, your agents communicate in JSON. So you need to store all of this information, ideally in one store, not five, six different databases here that you need to operate, you need to learn the language, et cetera. And also long chain, other opportunities are also out there. Think of Haystack and others that you could use. But all of these frameworks give you a very good set of building blocks. You can just use them. It's available in Python, JavaScript, there are also Java ports out there, ports to other languages are now available. Everything you need is already in these libraries to come up with your overall architecture. And that's now the point to hand it over to Maria. She will guide you through a demo where we want to use it, try to simulate how you can use support tickets, internal data. Here we took some Twitter posts from Microsoft. We will vectorize them and we'll show how a support or a customer can then interact with this chat bot, ask certain questions. It won't demonstrate it's not such a big effort. You can get started right away. And all the demo, we put the link here on the slide. You find also the link to the demo in the app or on the website for the talk. Thank you. Do you hear me? Okay. Awesome. Thank you. So you have heard a lot of theoretical aspects of the drug and how it works. I have a little bit more than 10 minutes to show you a practical example. But believe me, we can have hours long workshop on this topic. So essentially, the idea today is to show you how to augment some of the existing LLAMs with the private data and how to use it for the context of some specific questions that this LLAM has not seen so far. So we actually use data that capture customer interactions on Twitter and these customer interactions involve different questions from the users about Microsoft, Amazon, all these different products today and how actually the support from these big companies actually answer to these user questions. So this is not something that you usually see on the Internet very easily. So if you have maybe some problem with some Microsoft product, yeah, very often you can actually find the solution out there. But some very specific questions that are asked directly to customer support is probably a very good reason why it sells to the customer support. So you didn't find the answer to this out of the box. And we will use CradyBee as a vector store to support this example. So I think Kristina already gave you a good overview of what the CradyBee is. What is the long chain? Long chain is an open source Python project that actually is used to facilitate the development of LLAM applications. It's a pretty cool project that integrates a lot of large language models, a lot of models for calculating embeddings and actually something that helps you integrate some data source with some language model without thinking out of the box how the full engineering pipeline should look like. Actually you can just do this in a couple of lines of code. May I add one point here that I forgot to mention. Although you use long chains, very good starting point. What we have also seen for very advanced purposes, you want to directly interact with your data, with your source data, with your vector store and all of that is available in standard SQL, no matter which data model you're using. And CradyBee is an open source store, one of the easiest ways to run CradyBee is actually to use a Docker image. So a vector support in CradyBee has been available since 5.5 version, but if you actually always pull the latest image, you should not actually think about this. So once you run this Docker run command, we actually run the instance of CradyBee cluster and then we can access the admin UI in the local host. So currently I think because of the resolution of this screen, yes, not everything is available, but actually in this admin UI you have a couple of tabs that you can use actually to monitor your cluster to run some query in the console and also to have overview of the tables and the views that are available in your database. So let's go back to the example because the time is flying very fast. So what we need is the first step, we need a couple of import statements to make sure that the long chain and all libraries that we use in this example are available. What is also important is that you import CradyBee vector search interface that is available in one of the long chain versions, let's say, which is used to interact with the CradyBee. And as a next step, because we need to interact with the CradyBee instance, we need to specify how we interact. So this is done by specifying connection string. We are using open source version running on local host, but you also have option, for example, if you want to deploy CradyBee cloud cluster and at this point we also give option for all users to deploy one cluster that is free forever so you can just run it and use it for testing purposes. Finally, we need to specify the collection name that we are going to use in this notebook session. So if we run this piece of code, the connection string is now available and then we can start interacting with the CradyBee. So for purpose of this notebook, I rely on open AI models. Of course, there are long chain supports, so many different models, you can actually integrate many of them, but if you choose to use open AI, make sure that you have open AI key as a part of your environment variable. So now let's take a look at how the dataset looks like. This dataset is also available on our CradyBee dataset repository, which is also open source and it contains the customer interaction about Microsoft products. So essentially we would like to now kind of narrow the scope of this notebook for the for the illustration reasons and time reasons. So essentially this dataset has some information like who is the author of this question, whether it's unbound, outbound question, when it was created, what was the context of the question or the answer and actually whether this text is response to something or is it response tweet or is it created in response to something else. So essentially all this information and now the idea is to feed them to the large language model and to ask questions that could be for example seen in this dataset. So as a first step, if you remember this big rug image is to create embeddings. Embeddings is actually the representation of your data that is suitable for machine learning and AI purposes. So first as a first step we need to load the data from this dataset and for this we use CSV loader interface that is available in Longchain and now in this like few pieces of code we already we already creating embeddings for all the data set for all the entries in our dataset. So if I go back to the admin UI I can see two tables. So in the first table actually gives me a collection of entries. So as we as we define the the first collection we created is called customer data but essentially what is interesting now is to see like embeddings created for all the entries in this in this collection. So for example this is the instance of the of the document that we are actually using for the training purpose or for the context purposes and you can actually see how the embeddings look like. So if you use open AI embeddings usually the length of your of your vector is going to be 1040 something yes it would be size of 1040 something but you can also for example choose some other embedding algorithm for example hugging face as you can see suggested here which is which is open source and it can easily be used out of the box in just two lines of code. Now once we have these embeddings let's define our question and our question today is like okay I have some I have some order on my Microsoft Store but I want to update the shipping address and how I do this. I also here put alternative questions so like when you play with this notebook you can also put your own questions and see actually whether this data set has enough information to answer this question. So once the question is answered what we want to do is actually we want to find the context that is relevant to this question and this context is done by doing similarity search of the vector representation of our question compared to the vectors actually that we stored in the creative instance and this is actually done in just one line of code. As Christian suggested vector search is one way to find the relevant context of course kdb supports other types of searches like full text search or geospatial search or just key search keyword search so like you can use different type of searches combined together to find what is what is the relevant context for your question. Once we do this we are now ready to actually ask our LLM to answer our question and how we do this. First we need to create a prompt that explains LLM what his purpose is. So his purpose is today to be expert about Microsoft products and services and should use the context that you are going to actually give to the LLM to answer relevant questions but if the answer is not fine in the context it should reply with I don't know and this is very simple way to create a prompt that actually gives instructions to LLM how it should answer specific questions and finally we just need to create small chatbot by using some of the available models that are integrated with the long chain and also passing this context together with the user question. Once this is completed we can access the answer and in this case it says to update the shipping address you will need to cancel your current order and place a new one. Maybe that's something that is still up to date that is relevant maybe it's not relevant anymore but it's actually something we learned only from the dataset we provided so this is a way how to actually how you actually use your private data to teach LLM actually what should what should be the context for any incoming questions. So I hope you like this demo you can play with this notebook it's on our creative B examples repository and you also can see there are other similar notebooks for different different different types of examples for different prompt engineering examples or like how to create another another form of chatbots how to use another embedding algorithms so please let us know what you think give us a feedback open a new issue on this repository and we are looking forward actually to work with you on these topics. So I think that is all from us thank you for being part of this session maybe we have time for one question okay awesome do we have questions anyone thank you for the talk I have a question about the embeddings model because if you encode prompt with language model and use external embeddings model they cannot be in different spaces and if you do similarity search have you tested it and do you see the effect of different embeddings I mean it's a very important question now if you the way you create these embeddings is super important and you're usually limited to one embedding algorithm because you need to they need to have the same length and obviously they need to capture the same semantics simplifying a bit and this is also what I meant with the customers that we work with they were able to create different indexes right and then the retriever gets more and more complex as you've seen on this architecture slide this is a simplified example you maybe you need to query different different indexes created by different embedding algorithms you know so that you can search your images you can search your textual data right obviously you might use different things there and then re-rank the results come up with the really relevant context maybe from different indexes and maybe you also want to combine it with a full text search or limit it to customer support tickets from Europe trying to come up with a good example there are or to customers support tickets from the US with some geospatial inhibition but this is then the re-ranking of the results that really identifies the particular context that is really relevant for the question okay thanks a lot any more questions no so thank you very much for the very nice talk thank you you
A murder party with Lea
Okay, so now we can start. Thank you very much for coming to the Python Dev Room and getting up early on Sunday morning with this cool weather outside. So now we are going to have a very, very nice talk by Pierre Denis, who is a long-time Python user. He's also the creator of Liya, and he's going to talk about Liya in this talk. Liya is a Python module for helping to calculate probabilities on situations presenting uncertainties. And what that means, I hope he's going to explain to us now. Thank you. So welcome everybody. So we are here about something serious, a sad story. I'm not a good storyteller. I'm afraid, but okay, Dr. Black has been killed last night. Maybe you have heard about that. And okay, we have three suspects that have been identified with given probability to be the killer. And it seems that colonel Mustard is the most likely, is most likely the killer with 40% to be the killer. Then we have Mrs. Peacock, 25%. Mrs. White, 10%, and Professor Plum, 25%. Okay, then these are prior probabilities, but we have the help of a profiler, a segment, the profiler. This guy is very smart. And he can tell, for example, that if Mrs. White is the killer, she'll be absent for the investigation with 95%. Otherwise, if she's innocent, she'll be absent with only a probability of 20%. And the profiler tells you several statements like this with probabilities. So when you see this kind of situation, you see, okay, it's quite complex. How can I use this information? Because nothing is certain. Okay. So the investigator is Leah. Here, Leah is not a person, as you have understood. It's a module dedicated to probabilities. So okay, I have several statements here. In other presentation, I elaborate on this, but this time I prefer to show you Leah in action so you can better understand what it is about. My claim is that Leah is something quite easy to use, quite intuitive. You know probably that there are several packages dedicated to probability or statistics. The core feature of Leah is to be easy to understand and probably well suited for education. Okay. Let's start. First, I import Leah, which is here in version 401B. Anyway, so first of all, I want to define a fair coin with head and tail. I do that. So Leah can work with any Python object here. I define probabilities on strings, but you can define probabilities on numbers on any Python object. Okay. Here for education, I prefer to switch to fractions. You know that Python has fraction included. So I've switched the display to have fractions. So if I want to create a biased coin, I can define several values and here it means that tails will be three times likely to go that then head. So I have a new probability distribution. So what I'm doing here is a crash course on Leah because we want to be acquainted to it before doing the investigation. I can also use a probability mass function to define probability in a fraction. So Matplutlib is integrated so you can display your histogram about any probability distribution. Okay. So I want to make 100 throws. So I use my B coin variable, my probability distribution to calculate to make 100 random coins, throws. So you see in this random throws that there is more tail than head. But okay, how can I be sure that it follows the probability that I given? Simply you can use Leah, the same function as before, Leah, VALS. You provide the values and this time it will use the random sample as it will be a frequency counter and you see that here more or less it conforms to the probability distribution that I provided for the biased coin. So what is interesting on this kind of object is that you can use many of what you do usually on your Python object. For example, you can index. If I ask for zero, it will take the first letter, head or tail, H or T. I can chain with the Python lower method and I have lower case H or T. I can map Python function here. This means that it count the number of characters which is four, head and tail, four characters each. So we have a certain four. And as you could expect also, all the operators are overloaded. So if I concatenate my distribution B coin with fixed string, I have a new distribution that follows what has been defined. Okay and here it's something a bit funny. What happens if you multiply a dice with a coin? You get that. Okay. Let's now throw two coins. So the new method allows you to define new event with the same probability. Here I define two coins which are biased together. If I add them together, I have all the combination possible with associated probabilities. So we will see that this is very important. We are able to calculate conditional probability with the given method. So here I try to see, okay, assuming that I know that the first coin is tail, what is the combination of the two coins? So here we see that the previous result has been filtered out to get just the two remaining possibilities. So it's a common feature of LIA is that when you define variable, there is a kind of lazy evaluation. They remain linked together in a network that define the relationship, the dependencies between the random variables. Okay. And you can also define Boolean events like, okay, what is the probability to be? I define it at 140 seconds. And then I can use operator like to be or not to be. And the result is it's true, it's certain true. Because okay, to be it's either true or false and not to be it's the contrary. So together it's certainly true. Okay. And there is also a dedicated function in LIA which is P. So you can extract the probability of true. So you get really a real probability like this. Okay. Let's go on. So here it's an excerpt of a book that it's three centuries old from Abraham de Moivre. It's probably one of the first problem solved by de Moivre here. Okay. Let's ask to find the probability of throwing a nace in three throws given a fair dive. This is how to calculate in LIA. So here I define a dive. I create three instances which are independent, which are assigned to the 1, 2, 3. And then I ask for the probability of any of one of these dives is a nace. The result is 91, 216th as calculated three centuries ago by de Moivre. So far so good. Okay. No, I don't know if you like playing a role-playing game. So there's a small example that where you can use LIA. So imagine that you have here this dwarf which fights against a troll. Okay. I first define a new kind of display with percentage because it's more convenient here. I define two different kind of dice. Okay. Imagine that your attack hole is d25 plus 4. Okay. What is the probability to ever hit? You see, okay, it's easy to calculate with inequality. So you have to be greater or equal that the troll armor class. You get this probability. So the damage of the magic axe is to d6 plus 5. Here is the result. But this damage is only applied if the dwarf can hit the troll. So for that we have a special construction, LIA if underscore underscore to avoid collision with the Python if. And okay, this means if there is a hit, then I apply the magic axe. Otherwise, the damage is zero. And here is the new histogram. So this is the probability, the actual damage that is done to the troll. And then from this data you can answer the, okay, assuming that the troll has 20 health points remaining was the probability to kill him in four rounds or less. You see it's deadly simple with this formula to calculate. We find it's 40%, something like that. Okay. Okay. You follow? So I will, I have many, many, many examples. But by lack of time I will drop maybe some of these examples. Boys or girls paradox, something very funny also that you can find on Wikipedia. So the chance to be a boy or girl are even. So okay, boy, one half, girl, one half. Mr Smith asked two children, at least one of them is a boy. What is the probability that both children are boys? Many people and including myself, the first time I heard this I think, okay, the information give me no clues. It's one half. But if you calculate like this with Leah, so you define children as two, a joint of two children and that you count the number of boys, calculate the conditional probability, the answer is one third actually. And what is interesting with Leah, you can understand why this is the answer by asking Leah to show you all the combinations. So here I show you the gender of all the children, the number of boys and given that the number of boys is greater or equal to one. And we see here the answer is here and we understand better why it is one third. Okay. It's a bit fast but you can do it at your own pace later. Okay. What happens if you have more elaborate problem? Like here we have several children. The eldest is a boy and he's got three brothers at least. What is the probability that all children are boys? Okay. You can model this like this. Here I create seven children. And I put, so you see when you read this expression, it's quite close to the initial problem. Of course you have to understand the elements of Leah to do that but after that it's quite easy to model. The answer is one forty second. Again it's possible to ask why it is so and here by joining you see that's okay, seven children is this part and the other are this. So you can better understand why it is so. Okay. I will drop this Monte Hall problem which is well known. You can read that after the session offline. Okay. Let's go back to the initial problem. So okay. First I change the display options. So the, first we define the pure probabilities like that. So here I ask Leah to display the probability in one line because it's more convenient in this case and as a percentage. Okay. So we have like this and we see, okay, colon and mustard. Our priority is the killer, the most likely the killer. Okay. Let's now try to write down the different information we have. So if Mrs. White is the killer she'll be absent with probability ninety percent. So I define here a variable. Mrs. White is absent using the if as we've seen before. I put the condition if the killer is Mrs. White then she'll be absent with ninety five percent else twenty five, twenty percent. Sorry. Okay. This is the percentage that Mrs. White is absent. But it's not very interesting because we, we, we are more interested about who is the killer but we will see what will happen later. And then we can continue and define other rules like this. If Mrs. Peacock is innocent she knows who's the killer with probability seventy five percent. So you see here there is a missing information which is the else part but we assume that Mrs. Peacock is not insane and if she's the killer then she knows who's the killer hopefully. So I put here the else part as one hundred percent. Okay. And then we can elaborate on more complex information like this one. Okay. I will not detail but you see again it's quite, when you see the statement, the tradition in LIA it's quite straightforward. And the last one is here. So what have we done here is to define what we call the Bayesian network which put the relation between different random variables. What is interesting with this kind of network is that if you get evidence about something you can go backwards and refine the probability to be the killer. So for that, okay, I define a list of evidence here. So first of all it's empty and the conditional probability is the same as before because I have no new evidence. So imagine now that Mrs. White is absent. I can add it to the evidence and define a new conditional probability. So you see it change a bit. Evidence two added to the previous one. Mrs. Peacock is drunk. Okay. I add this information and I get new probability and so on. Professor Plum accuses Colin and Mustard. And finally we know that the killer is a woman. So for that I use here the Python start with Mrs. because it's a handy way to say given the suspects that the killer is a woman, I add it to the evidence like that and you see, okay, there is a new probability. So there are just two suspects remaining, two women and Mrs. White is likely the killer. Okay. Yeah. Maybe you can consider this as a game but sometimes probability can play a very important role in some trials. So long time ago there was the Dreyfus Affair. There was a big flow of a so-called expert that makes a mistake in this affair. And also more recently, Selik Larch case where also there is a bad reasoning about probability. Okay. So I want to mention also that Leah is able to do symbolic calculation. So by using the SIMPY module that maybe you know, so it's very easy. It's the same interface. So instead of number you put variable name between quotes like this and you have probability defined with formula. So you can redo all the same exercise and you will get formulas to be the killer, etc. So a small example here. Okay. I don't detail here. It's a binomial function here with P and here I calculate a conditional probability and it displays me a nice formula. So you can check offline if you want that it is correct. Okay. I want just to finish about my bullshit generator which was made 15 years ago. So here the goal is to produce sentences at random based on a list of words and a list of grammar rules like this. Then you see that I put here a probability on each grammar rule so that the most simple rule are used preferably to avoid to get to long sentences. So yeah. Okay. I get... So it has produced... I don't know why it's... Okay. So maybe I don't know what happens but... Okay. I restart my kernel. Normally it's supposed to speak and to write down sentences but... Okay. Anyway. You can play that also. The Python code is really small so you can try it at yourself. Oh yeah. Okay. Of course I didn't import LIA. Okay. That's it. But anyway, sorry for the small interruptions but I think we don't have time for questions or... Maybe one question. Maybe one question. Okay. Thank you for the presentation. I have indeed one question which is about performance. So do you have information about performance, your libraries compared to other libraries or yeah, what are your insights on that? Yeah. It's a good question. So okay. It's not really the concern. So here as you have seen the results are exact. So okay. As you have seen also it's quite fast. So there are several optimizations. I have no figures but okay. As you expect there are many problems which are very complex and for that LIA provides Monte Carlo several Monte Carlo algorithm that gives approximate results in a fair time. Yeah. But I have no figures. Yeah. Okay. Thank you. Thank you very much. Thank you very much.
`New` Workflow Orchestrator in town: "Apache Airflow 2.x"
Okay, it's 10 a.m. So we can start with the next talk. Next talk is going to be about Apache Airflow. We have Jarek Potjok tell us about the new features in Apache Airflow 2.0. Jarek is a PMC member of the Apache Foundation. He's working on this project and he's going to tell us all the details. It's probably going to be very interesting talk. Thank you. Thank you for the interaction. Hello everyone. So I'm going to talk about workflows. Who here know about Airflow? Quite a lot of people. Who uses Airflow on a daily basis? A lot of people know. Good. So my talk will be mostly to present you the Airflow, what it does and how the new Airflow, like the new workflow orchestrator, which is the Airflow 2.8 right now, is providing you as users and someone who wants to write workflows in Python. And that's why we are on a Python track. It provides you with a very modern way of interacting with modern data stack, processing your data for all the different kinds of users, including the new users like LLM, all the models, all the artificial intelligence. But first, few words about Airflow. I always refer to my past. I was a choir singer. I know a lot about music and orchestras and choirs. So Airflow is, if you imagine what Airflow does, because lots of people ask what Airflow does. Airflow doesn't do much because Airflow is mostly a conductor and orchestrator, someone who tells others what to do. And as you know, in the modern data processing workflows, mostly you're using a lot of different data processing engines, so to speak. You just a few of them and you need someone who actually or something which actually tells others what to do when and how to pass data between those different processors of different kinds. And Airflow does exactly this. On its own, Airflow doesn't do much. Just tells, OK, you do this, you do that, you do this and send that data here and there. That's basically what Airflow does. And the main thing that Airflow does, and this is why we are here on the Python track, Airflow allows you to define DAG, which is directed acyclic graph of tasks executing the data and dependencies between them. It allows you to define it in Python code. This is very different from many different orchestrators where you usually define them in YAML or in some declarative way of declaring the dependencies in tasks. In Airflow, everything is Python. From the beginning to an end, you do everything in Python, including extending Airflow itself, which I'm going to talk about a bit later. So, like, just a few examples of how it works. You can define DAG with a decorator, a nice Python way. You can also use it in a classic way, which I will show in a moment. Then you can define a task. And then finally, you can just use the task that you define and you can, yeah, if it works. And then you can link the task between themselves and make the dependencies between them. And once you define them, Airflow does everything to schedule the task. As you can see at the top, there is a schedule that the DAG of the task dependencies are executed. So, once you define it in Python, Airflow does everything for you and executes the tasks that you defined nicely in Python in a way. There are two ways of defining DAGs. There is a classic way where you define the operators and you use them. And we have about three or four thousand different kinds of operators to talk to all the different kinds of external data processing engines, which can be either databases or cloud services or pretty much everything that you can imagine. I will show the list later. It's pretty impressive. Or you can also define the task in a more Pythonic way where you just decorate a Python function as a task and this task gets executed. That doesn't seem much and probably you thought like myself, I was developing different kinds of workflows over all my career, many years, and tools to run the workflows. And you would say, okay, that's pretty much all the work or a lot of work that you do is define these kind of workflows and make them easy to run. And Airflow is one of the most popular orchestrators out there because it does it. I think it hit the spot, the right sweet spot, how this should be defined in the Python, but also nicely executed. I will show some executed and managed so you can be a Python developer, develop your tasks, and on the other hand, you give immediately someone who operates Airflow a very nice UI and the way how to manage all those workflows you define. So I will show a few things that many people who already use Airflow and some people here know about Airflow and use it and they might have even not know that things like that are possible. So those are things which appeared in the few latest versions of Airflow. So like for example, you have a task group. The task group allows you to group the task together and execute them together and even like rerun them together or make dependency to the whole group until the whole group finishes. So that's a nice new feature or relatively new features that we have and that we are using in Airflow. Airflow also has this very nice feature of being able to dynamically create many instances of the same task so you can get like kind of map reduce kind of workflow where you expand the task that to execute like you can execute hundreds or thousands of them and then get the result of that. So the typical map reduce case and it is very useful in number of cases but as of recently you can also expand the task groups very similar way so you can run groups of tasks in parallel, multiple of them. Very nice because you can very easily parallelize complex Airflow because by the way, Airflow is not only like you define the task in one place but you can distribute the whole execution of those tasks over a fleet of computers. Like with Kubernetes, with salary, with different mechanisms. We will not talk about that. There is no time to go into details but this allows you to parallelize your task workflow pretty massively and very complex ones. Then you have like dynamic task mapping. So that's how it looks like so you get the task, you expand an array of values and then you have multiple tasks running. We have a very nice UI that you can like browse through the list of those tasks. I just wanted to show it for a reference. Very recent task, very recent feature we have set up in teardown. So when you have a task that requires complex infrastructure to be set up and turned down at the end, we have the whole mechanism to manage those edge cases that comes there when the task executes when and it fails and all the, there are multiple tasks in between set up and teardown. So this is also nicely handled right now in Airflow. Very recent addition in the last version of Airflow we have integration with FS spec. I don't know who uses FS, who knows FS spec, who uses FS spec. There are a few people but you should if you haven't used it because it allows you to access the object storage from different storage providers. It allows you to access it in the same way as you would access a local file system using path leap, using slash, using path open. But also it allows you to integrate with multiple different already existing tools which are using FS spec like pandas, Polar, Spark, DagDB, Iceberg, PyRO, those are just a few that already support FS spec and this means that you can very easily plug in the file that you define from the object storage. Put it, put like here data frame from pandas to read that and it will read it and it very nicely integrates with the Airflow way how to manage credentials to access that data. So this is really recent features that we have added. Now this is also connected with data hour scheduling that we have which means that like Airflow by default in the original version was only task based but right now it can also define data sets which are linking the tasks. So one task can produce data set and another can consume the data set. You can define it and you can use it and those dependencies are automatically taken into account and whenever first task produces the data set the second one is run and we have a lot more features to be added to support that including the FS spec integration and a few others. A little shout out to Tatiana who will be speaking next from Astronomer. One of the things that we recently added as our LLM operators donated by Astronomer who is one of the stakeholders in Airflow. So we have the whole set of LLM based operators that you can use to build your LLM workflows because like just training is one thing but you have to prepare the data. You have to process it and then you have to take the results and maybe do some inference and all the other stuff that is not necessarily just teaching and learning. And this has been implemented. It's not like a theoretical set of operators. It's been implemented in a real implementation of a bot that Astronomer developed. A few words. So that was about the DAG outtouring. So like this is how you prepare the task and this is the most important for people like you who are like Python developers and want to integrate and implement those tasks in Airflow. But one important thing is that Airflow provides you out of the box with the modern UI. Like there is a nice graph UI showing you all the dependencies, status of the tasks, how they are working. You can rerun and clear the task from there. You can see the status of that. This is like great way how you can do your like have your operations people, those who look at those workflows being executed. What's going on with your DAGs? And this is the nice like this is a nice gun chart showing how the DAG is progressing over time. And you can see the grid view on the left side where you see the history of the task being executed. So the same task would like how it progressed over days or hours or whatever your frequency is. A very nice integration with logs so you can see what happens when the DAGs fail, when there is a problem. You can diagnose them. A gun view, a little bit more detail showing you the like how the tasks are progressing while the DAG is being executed. Very nice integrated and very recent addition. You can see the whole overview of the whole cluster of your, as I mentioned, Airflow has the capability of like massive distribution of the workload among multiple nodes. So you can see that from single place in Airflow as well. But I think the most powerful part of Airflow is that it's not only like you, not only can define the task in Python and execute them. That's rather simple thing. But Airflow is a platform. Airflow is a platform that has this capability of, and this is the way how we think about platform in the future when we will be developing the next versions of Airflow, that rather than implementing something on our own, we extend the capabilities of Airflow but allowing others to, other dedicated tools and other solutions to use whatever Airflow produces. An example of that is OpenLinage. OpenLinage is a standard how you can track the provenance of your data across all your flows. So you can know that, for example, this part of the data, this column was a private column that was obfuscated or aggregated so that initially it was related to the privacy. Then it was not and then it was joined with another data and you can track all this information and OpenLinage allows that and Airflow produces the lineage data and it's fully integrated. Another thing that we have integrated is OpenTelemetry, which means that Airflow produces the telemetry data that you can use to monitor whatever happens with Airflow, whatever happens with execution of those tasks, using the favorite tools of yours like Datadog or Google Cloud monitoring systems. All these tools are supporting OpenTelemetry and Airflow produces the data in OpenTelemetry compatible way. Still early day but we already have it. We have integration like, for example, integration with DBT, which Tatiana is going to talk about at the next stop. So I will not stop here but DBT is one of the most used ways of how Airflow is processing the data or what tasks are executed and we have this nice integration where DBT models can be mapped into the Airflow models and then this comes externally. This is not something in Airflow but Airflow has the extensibility that allows you to do that and others like Astronomer, they did it. We have a fully-flashdressed API that allows you to build extensions and those extensions can be of different kinds because you can access all the data inside about Airflow. Those extensions can be like UIs. We had a discussion yesterday at the dinner. You can build your own UI using those APIs if the UI of Airflow is not enough because we cannot satisfy everyone. You can do that all with the fully-flashdressed API we have. A few words also why we are at FOSDEME, like Airflow is fully open source. Airflow has a 10-year history. This year we have 10th anniversary. Airflow Summit 2024 which is planned in September in Bay Area. The plan is 4,000 out in this. It is very popular. Airflow is pretty popular. The last Airflow Summit we had was 500 people in Toronto. You can see that it has been steadily developed over years. It initially donated from Airbnb. It became an ASF, Apache Software Foundation project incubation in 2016. Then in 2019 it became a top-level project. We released Airflow 2.0 which is this new version of Airflow. We released it in 2020. Now steadily you can see that we are releasing new versions with new features that are steadily coming. What is important part of Airflow is open source, permissive license. The Apache Software Foundation is behind it. Strong governance. We have really strong stakeholders like Astronomer, Google, Amazon, Microsoft also. All of them provide Airflow as a service. You can use it in their clouds. We have a very well-defined security process, release process and maintenance certainty. You can be quite sure that Airflow is going to be there with the same license. The license is not going to change as we have heard. It is happening with a number of open source projects. It is under the Apache Software Foundation umbrella. It is going to be maintained and you can rely on it being released in the future. You can rely on it pretty much. There are some community numbers. It is not a vanity metric. We have the biggest number of contributors in Apache Software Foundation more than 2700. 61 committers, 32 PMC members. We have 10 years of history. Those are the different kinds of tools that you can get when you integrate with Airflow. That should be a little bit before. You can see this is the community and integration. All the things that I mentioned before about the extensibility of Airflow allow you to build all the different kinds of extensions. This is like our community page, tools integration, integrating with Airflow. The big number of those are there. A lot of people are developing for Airflow, extending Airflow, adding new capabilities to it. One thing that I wanted to mention, because this is mostly what I am working on, we have very solid foundations for Airflow right now. We defined this public API, public interface that you can rely on when you are working with Airflow. We can very easily rely on the rest APIs and the number of things that Airflow exposes. This is the kind of most impressive, one of the most impressive things in Airflow. I mentioned that before, that we have this many, many different integrations built in. These are all the integrations that we have. I am not sure if you can see the names, some of them probably. There are a lot of different, more than 90 different providers that we have that allows you to immediately connect to those external services and run Airflow with them. My big focus is on the security and this is something that comes in the next two years for every software package that is nearby, near you. Because we have this CRA Act in Europe and others. If you haven't seen the kind of public policy and compliance talks here, you should realize that it's coming. In two, three years we will all have to follow the security very rigidly and we have very good functional security team that is handling the security issues according to the Apache Software Foundation. We are part of the bounty. If you find a problem in Apache Airflow, just report it and you can get money for that. I highly recommend that because we are fixing them fairly quickly. We have some features like S-BOM and reproducible builds built in Airflow. As of recently we were working on that for quite some time to make sure that whatever we deliver is not only useful and nice for the developers, but also it's secure to deploy and used in your production workflows, which I think is very important because it means that you can rely on the software. So summary, I just wanted you to remember from that talk that Airflow is a modern and solid data orchestrator with really strong foundations. Something that is going to be developed for next like 10 or 20 or 30 years, you don't know. The future is pretty bright. We have new slick ways of interacting with modern data stack. Even if we were created 10 years ago, right now we have all these new modern ways of how you can write your workflows and interact with all the external systems. That's pretty cool. It's true open source. And we have huge and supportive community of both the people who develop Airflow and those who integrate stuff with Airflow. And that's a sign that this is a really great project to work on. So I would say if you want to contribute or use, both options are very good for Airflow. And we continuously evolve. That would be very short overview. I didn't have time to go into many details. So just wanted to touch base on what Airflow does. If there are any questions, I'm happy to take them. Very nice talk. Thank you. My question is, in my group, we haven't been able to agree on whether the Python-based DAGs are declarative or not. What do you think? And do you think distinction matters at all or not? So the question was about the declarative or not. I personally think declarative way of defining DAGs or defining the workflows is very much impaired by the fact that whenever you want to do any kind of complex workflow, those declarative ways become unmanageable. In the past, I've developed a number of those workflows, as I mentioned. Airflow is just last one of them. Usually what happens when you have these declarative workflows, when they become complex enough, you start to write Python code to generate these declarative workflows. Because they are too complex. And then the benefit of having those declarative workflows is gone. I would say it's probably better if you start the other way. You use Python as a way how to define your workflows. And counter-intuitively, you take your declaration and generate the Python code from that. This is very easy, actually. And our customers are doing that. Our users are doing that, actually. A lot of users have their own version of declarative way of defining the workflows. And from that, taking that the airflow is so flexible, and you can do anything with the Python code, and you can define the workflows arbitrarily complex, just mapping it to a Python code that does it, is usually much simpler than trying to get these declarative workflows to do something complex. So declarative workflows are great for start, but when they become complex, the Python way is much better, in my opinion. Thank you. We have time for one more question. Is there any interest in airflow to support longer running workflows with events and things like that? With events. Blocking operations. You wait for an event, you wait for a time. Yes, absolutely. This is one of the things that I removed for clarity from the presentation, because previous version of that presentation didn't have that. So the airflow has so-called deferrable operators, and this means that when you will have a long running workflows and waiting for something, then you can defer execution of that operator to something which we call trigger, which is Async IEL, Python-based event loop, basically. And this event will wait, or this task will wait on this Async IEL without taking almost any resources. So if you have any workflow that, you know, with Async IEL, you can check and trigger when it finishes, this is the best way, and like, it's absolutely supported by airflow to have like hours or days running workflows for this kind of waiting for this without taking too much resources. That's a relatively new feature, like maybe two years or so. Thank you. Okay, thank you very much, Eric. That was a very interesting talk, and it's certainly a very, very good project to look into. We'll have another five-minute break now, and then continue with the next talk. Thank you.
Data workflows: translating dbt to Apache Airflow
Okay, it's 10.30, time for the next talk. So we're going to have another talk about Apache Airflow, but by Tatiana Alschwe. She works for Astronomer and she's going to tell us about DBT, which is a tool that basically you can write SQL and then have that executed in a templated way. How to integrate that into Apache Airflow. Thank you. Thank you. Hi, good morning everyone. So I'm really glad Yarek spoke before, so I don't need to get into the details of what Airflow is. Many of you know what DBT is. Could you raise your hands? Amazing. So initially I didn't have any clue of what DBT was and I was working for the BBC where we had very good software engineering principles and lots of things. And one day I went to support a SQL analyst who was analyzing the results of the A-B test and there was a bug. So the results of the A-B test were not very consistent. We were trying to figure out which machine learning models were performing better than the others. And then I said, okay, let's just see how you're doing things. And then I checked in his laptop and he had a Word document with a bunch of SQL statements. And then the procedure he was using was he would use this Word document to write the SQL statements, copy and paste that, like saying, Snowflake and other data warehouses, export to spreadsheets, then try to join some information. So the process altogether looked extremely error-prone. So the principles we had like testing, versioning, some repo, anything, nothing was in place. We eventually figured out what was the issue, but I really thought, well, we should be able to apply software engineering tools to any process. And the tools should be as easy as any person with any skill. And sometime after I came across Djibiti. So the idea of Djibiti is really to empower people who may not have experienced software development to use good software development practices while they write transformations, SQL transformations. So there is Djibiti Core, which is a quite stable project. It has around 250 contributors, 6000 commits. It's quite popular on GitHub, over 6000 stars. And the focus is on transformations. So you can define your SQL files into text files. It encourages users to deploy those to Git so you can have versions. It allows you to define tests. So let's say you would like to check some columns so you don't have no values. So it really makes easy all this practice. And it's an amazing tool which has really helped improving and avoiding the process I've seen in the past. And then many people may ask, okay, but what is the relation of Djibiti and Airflow? Aren't analysts happy enough running those scripts locally? And why using both of them? Yarek Raja explains what Airflow is, but it's a very mature orchestrator tool, which can allow you to run things in parallel. And it has lots of flexibility where you actually run things. So it does have a few things. In one hand, we have Airflow where you can write pipelines in Python. It is flexible. As Yarek shown, we have hundreds of operators from multiple providers to integrate with uncountable data warehouses, tools, LLM tools and so on. But it's quite complex. So the interface can be a bit overwhelming with lots of colors, lots of boxes. It can be hard troubleshooting and getting to understand. If you want to run Airflow locally, there is Airflow standalone, but you will have to be running a web server, a scheduler, and eventually a worker node. So there is complexity to it. On the other hand, Djibiti, people can write workflows just using SQL, transformation workflows. It is quite specialized SQL data warehouses, but it has a very simple UI. So Airflow is quite good. Djibiti is quite good in specifying tests in a simple way. It is very dependent on management. You can use Jinja templates to reference other tables created with other modules in Djibiti via SQL files. It is quite easy to define schemas. But it only focuses on transformations on the T side of ETL. Airflow, on the other hand, gives you flexibility to do anything you can do with Python, which is a lot. But it is quite complex to run. You could achieve anything you do with Djibiti with Airflow, but you may have to write more code. So I don't think we need to compare. I think what many companies decide to do is let's just try to use both tools. What this presentation is about is how we can use these tools together in an efficient way. So this is how a Djibiti pipeline looks like. So on one hand, you have several files. Inside the models, each of those files represents a table, and you have a transformation. And then Djibiti allows you to render the pipeline. So let's say you have some raw data and customers, orders, payments. You then transform those, and then you aggregate those to be able to generate reports and send it to Tableau or something. This is what a Djibiti project looks like. It's similar to what Airflow does since Djibiti interacts with databases. It has a way of defining how those connections look like, which secrets, credentials you have to use. Those are done via Djibiti profiles, via YAML files. And then the question is, okay, so there is similarity. Djibiti and Airflow can generate DAGs, allow you to create workflows, but how to connect them. So we thought, okay, what are the options if we had a translation tool to convert from one to the other? If you check into Djibiti DAGs, what they say is, if you're using Djibiti Cloud, which is a managed version of Djibiti, you can just deploy things, and there is an official Airflow provider to deploy there. But currently, they changed the pricing model, and if you have a team, you'll probably be spending proportional to the amount of developers you have in your team. So it can get quite expensive. Another strategy you can use that Djibiti suggests is you can just use the Airflow Bash operator, which allows you to run any Bash commands. So the same way you run Djibiti commands, on the command line, you could just trigger those to be run. Or you could also use Kubernetes operator, delegating from the Airflow worker into Kubernetes pod to run those commands. So this is an example of how the DAG would look like if you were using Djibiti Cloud. So in this case, you have just some operator to declare the beginning and the end. This is an old pattern many people use. You can check the code in this link below. But with the recent setup and cheer down features of Airflow, you don't really need to have those dummy operators. But anyway, in this case, it just starts the job and triggers the job to run in Djibiti Cloud. The challenge with this is, can someone tell by looking at this pipeline, what are the actual models and transformations going on? Let's say you had a project with a thousand models. How could you spot where the problem was? Worse than that, let's say you have a pipeline that takes 12 hours to run. And one of those tasks that takes five hours past, but a few others didn't. How would you just re-trigger the jobs after the failed job? So this approach is quite limited and doesn't give much visibility from Airflow on what is going on in Djibiti. Another approach people use is just to trigger Djibiti Commons via batch operators. So in this case, you again define, let's say, Djibiti seeds where you load from CSV to the database. Then you can trigger transformations and then you can run Djibiti tests. In this case, again, you are grouping the Djibiti nodes by type and running one Airflow operator, one Airflow task for each of those steps. You don't have much visibility and control of what is actually going on, but it can do the job. Then another approach many people have done in the industry was I will write my own parser of Djibiti projects and render that somehow into Airflow. In a way that we can parse what are the nodes of the original graph and then render them in Airflow having some granularity on the process. So there are codes for all these examples I'm sharing. Each of these approaches has their own pros and cons. From one perspective, the two first approaches are quite trivial to parse the DAG. So they are cheap to do the processing every time Airflow tries to parse DAGs and trigger tasks. On the other hand, it can be harder to troubleshoot, to retry and just to do the work. In the last case, you can trigger independent tasks, which is quite powerful. So we've seen many people in the community implementing their own solutions to do this conversion of expanding the Djibiti DAG into an Airflow DAG. During Airflow Summit last year, Djibiti was one of the most discussed topics and there were several approaches. Some people use Djibiti manifest, which is an artifact that represents the topology of the Djibiti DAG. Some people would use dynamic tasks, which Iaric also showed in his presentation, where you can paralyze a sort of map-produce approach with the Airflow. Some people just generate a static DAG and then they don't use actually Djibiti to run the transformations. They convert into native Airflow operators, which can be asynchronous operations, so you wouldn't need to have the Airflow worker node blocked while executing the transformations. And then many people decide to delegate the jobs, so the Airflow worker node isn't necessarily doing the Djibiti commons or the transformations and just delegate to Kubernetes. So those are some approaches. And then at Astronomers, since we have several Airflow users and customers trying to do this integration, during a hackathon, some team members developed Astronomers Cosmos, which was a tool to try to help in this conversion. So the idea was really to have a sort of Rosetta Stone, which could help and simplify everyone's lives. It is under Apache license, so it's open source. And this is how Cosmos translates the DAG. So you can see the original Djibiti project, and that's how it looks in Airflow. So you have really a mapping one-to-one of how the DAG looks like, and the names are quite close to the original names as well. This is how the DAG is, so that's all someone with a Djibiti project would have to write to be able to have their project fully translated into Airflow. So it's quite simple. We have some configuration. These first lines relate to Profile Config. They are optional. It's a feature of Cosmos which allows you to only define your credentials to access the database using Airflow connections, and then we convert those into Djibiti Profile. So you don't have to define things in both places. But if you prefer, you can always give a Profile.channel, and that's it. So the code of your DAG would be pretty much this hlivesofcode below. It uses task groups which I also spoke about. So we allow several ways of running tests. One of them is to actually group with the model. So usually you would define test-based models, and you can have task groups where you would have both the execution of the task and then the test. Because we assume, depending on your configuration, if the tasks for a model don't pass, then it wouldn't make sense for you to continue processing the next transformations. Then there is a demo. Let's see. Okay, we're okay in time. So this is the Airflow UI. I'm running this Airflow instance locally. We have a few DAGs. Some of them passed, some failed. And then this is an example of how Djibiti Converted DAG using Cosmos looks like. With the task groups, the tasks, you can see the code for this here. It's super simple. And then, like I say, we would like to trigger this workflow so we could run all these things. We can click here on the trigger button. You could be using the API as well. You could be using the comment line. And then you can see the tasks being scheduled. Since it's in my local computer with a single worker, it's much particularly quickly. But there you are. You can see the tasks running, executing, and then we can see the ones that are green here. See? The ones that are gray. They are waiting to be scheduled, and then the green ones succeeded. This is queued. These are running. And then there you are. And then one of the things Cosmos allows is for you to actually check, okay, for this task, what was the actual SQL statement as a queued? So you can see the rendered template, and you can understand, oh, this was the transformation. And it can help troubleshooting. So that's the demo. Some of the key features of Cosmos. So it easily allows you to bring GB2 projects into Airflow. You can assume people who are used to write the workflows in SQL can remain writing the workflows in SQL. You can either render as a task group or as a DAG, depending on your granularity. It gives you flexibility if you want to run tasks in Kubernetes or Docker or in the Airflow Docker worker node. And recently there is a PR even to delegate to some, I think, Azure container service. So you could write your own way of executing the GB2 task with Cosmos. You can override how we do the conversion, depending on the GB2 node types into Airflow tasks. So there are a few different ways we support doing this parsing of GB2 into Airflow as well. You can use data sets. So each, I had a convention to that Airflow.to introduce data set to our scheduling. So in the past, you could only schedule pipelines in Airflow using Chrome job expressions or same daily or so on. And the cool thing with data-aware scheduling is let's say you have a pipeline. Let's say you have a machine learning pipeline where you're processing, let's say, some video metadata in one part, user activity in the other. And then you would like to aggregate those two pipelines to be able to train your model and fine tune and do whatever. So now, since Airflow introduced data-aware scheduling, you can define outputs of your DAGs that those data sets are ready. Let's say the program metadata was processed. And then you can trigger another DAG when that data becomes available. And with Cosmos, you also have this. So you can do conversions of GB2 DAGs and make sure that after transformations is executed, other pipelines that depend on that transformation output will be triggered without having to depend on a time schedule. And then since Airflow and GB2 core are open source, you can run this by as many developers as you want without having to pay proportionally to a developer, the amount of developers. And we have a growing active open source community. So here are just some more details on how you could configure the types of operators we're using within Cosmos. So Python operator, V2 end operator, Docker, Kubernetes operator. Initially, we were thinking about since both those are written in Python, our first strategy was why don't we use GB2 code to do this parsing and conversion when we get to GB2 project. But then we realized there are many conflicts between versions of Airflow and GB2. So we ended up not really using the GB2 as a library, but we used other approaches to do this parsing. So one of the approaches is to use the manifest file, which is one of the outputs from GB2. Assuming you have a CICG, you could just output that and you have the DAG topology in a JSON file. However, this method doesn't give all the filtering and excluding flexibility to GB2 offers when you want to just select a subset of nodes to be run. Then we implemented GB2.ls, which would be a way of displaying selected nodes using GB2 itself. But the performance isn't particularly good for this. So we implemented a version with caching. We have our own custom parser and we usually by default try to depend on the user configuration automatically define a way of parsing the GB2 project. So we support select and exclude. There are several different approaches rendering test nodes as well. We allow users to do this conversion of connection from Airflow into GB2 profiles or to give their own profiles the ammo. We try to give as much information for users to troubleshoot GB2 pipelines within Airflow itself. This is the adoption. So in the last month we had around 242,000 downloads and growing number of stars in GitHub. Some of the next steps are exposing GB2 documentation from the project within Airflow as well, improve the performance, and then improve the open lineage integration and a few other things. We also noticed many people have GB2 projects in different repos than the Airflow one. So we're looking to ways of optimizing the synchronization of the things. And there's the one user asked for GB2 Cloud integration, so that's something that may come. This is a PR from, results of a PR from the community where we actually just render GB2 documentation within Airflow using Cosmos. We don't have as many commuters or contributions as Airflow, but I think we're quite in a good spot. So as an example in November we had 20 authors contributing and only three of those were within Astronomer. We have a growing number of contributors and we're promoting community members into commuters. We know there's a lot of work to be done and we really appreciate the community support. The discussions, we currently use Airflow GB2 Slack channel in their flow Slack workspace and we have lots of daily interactions. And the community is each time developing and supports each other, which is super exciting to see. There are a few references there. The slides are on the POSDEN website, so you can just click on those if you would like to see more information, more detailed material, examples of how to run things, and that's it. Thank you very much and I think we have four minutes for questions. Thank you very much. That was a very, very interesting talk and it looks like a very, very good project. Any questions? No questions. I have one question. Are you only doing the integration from DBT into Airflow or also from Airflow into DBT back again? At the moment we're doing from DBT into Airflow, but would you see, because the tricky thing is the features DBT offers are a subset of what Airflow offers. So the conversion in the other way may not be feasible at all, depending on which operators and tasks you would be defining in Airflow. And also, Astronomer has a managed Airflow, right? So our interest is bringing people into Airflow and not necessarily sending them away over Airflow. And maybe a follow-up question to that. Is it possible to continuously migrate from DBT to Airflow so that people can continue working in DBT and then you automatically get the data into Airflow? Yes. So with Cosmos, the current version of Cosmos, it expects the DBT files to be available to Airflow somehow, but this can be done in multiple ways. So if you're deploying Airflow using a Docker container, you can make sure as part of your RCI CD, you would be fetching that during the image build. We also saw, I think, British Airways that are using this, they were actually thinking, let's say from GCS, they uploaded the DBT project into GCS, and then the first step of their DAG was actually to download those files and then use that. So some people may want to have some NFS or volume mounted with the DBT project and make sure that synchronized with the latest. So there are several ways, and depending on the parse emojis with Cosmos, those will be updated in real time. Okay, excellent. More questions? We have time for maybe one or two more questions? No? Well then, thank you very much again. We have a short break now.
A slow migration from Django templates to Vue+GraphQL
Okay, so now we have both students. So now we have both speakers here. We can start the talk, the next talk. The talk is going to be about a slow migration from Django templates to Vue and GraphQL. So Jonathan, Jonathan, Jonathan and Dominic Georg, both Germans, they're going to talk about a system, Alexis, which is a school information system, which was apparently written in using Python Django templates and they now ported it to Vue and GraphQL. So give them a warm welcome and thank you very much. Thank you. Can we get the microphone for the other speakers? Thank you very much. These were speakers, the last one is the prize speaker. Hello, first them and Python Deafroom. We are the Alexis project. That's the all Libra school information system and we want to tell you how we transitioned from a Django app, with a templated web front end to an interactive web front end as the needs arose for one of those in our project and how we did it incrementally. I'm Michael Bauer and I'm a developer at Alexis and I work mostly on the new frontend and the new features we are enabled with that. So with that let's introduce the rest of the team. More of the team. Yeah, my name is Nick. I'm more or less one of the founders of the project. I started tinkering on the school management system. When I was still at school, I don't think I can remember when that was. Today, yeah. I don't know what my role is on the project right now, but someone might know. So. I have a microphone on my own, so I don't need to microphone. That's decent. So I'm Jonathan. I'm the lead developer with the Alexis project and I'm coordinating the dev process and everything connected to this. Okay, so let's get started with the talk. What's about Alexis? What is Alexis? This is a free and open source school information system and it has a free software license, a European public license. So it's thought of as an alternative for schools that they have a free option to manage themselves and organize themselves. It's a modular system, so any school can just take what they need and don't have to use the whole system. And it's also done in such a way that it complements existing solutions. So we're only focused on the parts that aren't there yet in a free software way. It's developed by software developers, but also students and teachers. So we're working together with pilot schools and already have it in use there. The main Alexis features, of course they're divided into these components, but this sort of the main components is the base data management. It's the basis of the schools, like we have classes and pupils and teachers and so on. Then we have a timetable system. It's like a calendar system just for schools. So you can create timetables and you can serve them to the students. So each student has its own personalized timetable. Also the teachers have them. And there's a digital class register to take all the notes and information for classes seating plans. So you can design and show seating plans for the classrooms. And it also integrates with other services. We have a matrix integration, a O of integration, LDAP and CSV, V import, export. And also we just have a calendar system inside Alexis that's producing standard Eichel calendar feed. So there's lots of choice in which ant devices are used to hook up to Alexis. And it's a quite universal system. There's also provisions for student ID cards and inventory management in schools. With that I would like to give over to Nick. He is presenting you the telecom technology stack. Yeah, thank you. Okay, yeah. So thanks for making this nice graphic to help me know how this works. Jonathan, yeah, well, our legacy code base was a traditional Django project and with all the modules as Django applications. When we started basically everyone was doing server side rendering with all the nice templating features of the Django framework. To introduce you to the rest of the tech stack on top of Django, we use PostgreSQL quite heavily. There's a salary task broker and Redis for caching and for synchronizing several nodes when running Alexis in a multi-node setup. Yeah, and for the front end parts, we, as I already said, we used the Django templating engine and some not very well integrated front end utilities like the materialized CSS framework which at the time somewhat allowed for making yeah, modern interfaces following the material design standards, but it started to bit rot quite quickly here and Jonathan will give you some idea about that later. Okay, so that was the legacy tech stack and where is my name somewhere else here? Do I have to say anything more? Yeah, you can see a page in the legacy tech stack. So you have to. Yes, yes, nice. Little overview of how it looked in the past and yeah, I have to say then the problems started. We occurred some very ugly bugs like I think users described to us there if there was like a select menu, depressed an item in the select menu and but actually was selected the item above or below this item. So that was not so good because many users were using iPads. And in addition to strange bugs like this, there was also a problem with maintenance, with materialized as you can see by this issues here. So yeah, there was a big discussion whether materialized will be developed any further. And in addition to these problems, there was also a request for new features. As we spoke about time table planning or sitting plans, we needed some way to do this highly dynamic features in a better way because the control of time table planning is a very complicated thing. Also these customizable calendar views and auto saving views where you don't need to press the save button. It all wasn't possible anymore with our old front end. So we had an idea Nick will present to you. Okay, so yeah, probably many of you know that it's now the new thing to separate front end and back end entirely and make a nice shiny mobile app or whatever. And yeah, Jonathan more seriously already gave a few hints about why we would want to do that. I think there's one other challenge that we faced. Did you mention offline capabilities and caching? No, because you know, Alexis is used in schools and things might be different in other parts of the world but in Germany only two things are certain in the school system. Namely that your mobile network will not work at school and that's the wifi won't work at school. These two things are certain and therefore teachers always complain that they could not use the server side rendered views when they had no connection to the server. So I think this was more or less one of the biggest challenges we tried to solve. So separating the front end actually makes sense here. Okay, so what we wanted to do, we wanted to replace materialize because materialize was stuck somewhere in 2015 and wasn't really developed, it was abandoned. We had a few patches on top of that, I think somewhere even upstream but it didn't get better and it lacked the dynamics that we needed for a really new shiny intuitive interface. Yeah, so what are reactive? All right, yeah. Yeah, reactive front end libraries and yeah, to make the interface, yeah, to not have it reload on every single interaction and yeah, and also a very important idea. Alexis provides a very good foundation for handling organizational data at schools but yeah, we want to tailor to the needs of different schools, of different types of schools. The ideas they have, one of our most important claims that we share with schools when we expand the benefits of free software is that we can make the software work like the school works and we can transform the software instead of transforming the school. So on top of the foundations for organizational data management, the idea was now that if we could replace the front end for some parts, like make a different class register for an elementary school because they have very different needs, we do not have to replace the data structures, the models and the APIs but we can make a front end that is more tailored to the needs. Yeah. Okay. Yeah. This is not my part anymore. No. So we then decided on how we want to do our new technonesis tech so we, as we said, we just took the backend set, okay, that's our backend and then we decided we want to do an interactive front end with UJS and the front end library beautify and some other UJS libraries and we want those both parts at communicate and yeah, we are a graph API. And so this was our plan and there were some challenges with this plan. Oh, just a graph API. So, yeah, let's see again. Thanks for helping me keep up with my tradition. I always give one very good talk before BiaNite and one very bad talk after BiaNite. Okay. So, yeah. So as we already said, the platform is supposed to be very modular. It consists of, don't know, do we have some figure how many jungle apps we had at that point when we started the migration? Like around 15 I think. Like 50 apps that could be loaded dynamically into the jungle project. We actually had quite a bit of magic in there to discover the modules of the jungle apps dynamically. So the administrators who deploy servers for schools could simply install the Python packages needed for the system they want to put together and then everything falls into place in kind of some black magic way. And now this did not turn out so well for separating the front end because normally when we separate the front end, we want to have one JavaScript or whatever application that is delivered to the clients, nicely bundled with whatever JavaScript bundler is the current type. And then it is one JavaScript application. We could not do this because we do not know which parts of the system are used and in which versions. This might be very flexible for every school. So we need to bundle the JavaScript front end application on the machine where Lexus is deployed. Yeah? 10 minutes left. Oh, yeah, thank you. Okay, and you need these 10 minutes? Probably. Probably, okay. Yeah. So the right way would be you have one front end application, one backend application. They are more or less separated in development. They could be developed independently, but we cannot do this because, yeah, what's? Okay. I have to switch the display so you can see this. Okay, this is where we actually generate parts of the bundling configuration for VIT because when we build the bundle, we know which applications are there. We have the JavaScript front end code bundled with the Python packages in the same repository and at deployment time, we need to extract the JavaScript front end code and let it all fall in place like we did with the Python applications, which was sort of a major challenge. So, yeah. Yeah, the microphone is developing, that's good. And then we faced another challenge. We said, okay, we weren't able to migrate all these apps at once. So we had to find a way to integrate the old front end with the new front end. And what you can see here on the Bima is like how the new front end does look like. So there is no real optical difference with the old front end, but it's the new front end and we have had to find a way to put those old pages somewhere in this new front end. And if I just say the word iframe, I probably get some scary faces here. So, yeah, we made it and just put an iframe somewhere in there and then we built some glue, which takes the URL, which is actually called and then called the different URL with a prefix where the old site lives and integrates with in the front end. And that looks like this. So what you see within this container is an old page. And what you see around this container is the new front end. So if you can see which URL is dated here, it has the prefix Django. So it's within the iframe and if I click the button, the iframe will navigate to this Django URL. I will do this and you can see that magically the actual URL from our new front end is also updated. So it's a kind of, yeah, ugly magic. And this also goes one way further. So this is an old view within the new front end and now I click one of these links and it's navigating to a new view in our new front end. So this needed a large bunch of glue to put all this together, but now it's working with some exceptions. Nick will come too. Some exceptions, yeah. So like this iframe with a server-side URL page in the new view.js front end, they are always communicating using some sort of JavaScript message passing. I did not yet understand. Okay, so what are we doing here? This is the dynamically generated bundler config or something. Yes, it is. I don't think we have the time to go into detail about this. And oh, whoa, there's a video. Michael is fine. Yeah, I had to Michael. Yeah. Here you can see the new front end in action and why we did this transition because we wanted to have more interactivity and here you see how you can design a timetable now with the new view front end. Someone's inserting lessons into the timetable and it's highly dynamic and all just works. So we just want to tell you about new problems and I think this last part will also be done by Nick. So, oh, yes, this problem. Okay, we already talked about iframes and how they communicate or sometimes we all know communication fails and that you have Alexis and Alexis and Alexis. And I think this visualizes quite well what sort of trouble this slow migration caused for us but we did not see this too often in the recent time, right? Mm, not too often. I don't think so. Prove me wrong, okay, thank you. We called it mini Alexis. Now we call it Alexis Matroszka situation. If you know what this means. So, yeah, it did just a good. It pops up every month again. Every other month here. All right, so now we have ugly front end bugs for the integration and all of this will be sorted out once we get all applications and all views migrated to the new front end. The JavaScript ecosystem shares some of the same problems we had with materialized situation because you know there's beautify three and it's pretty neat. We needed to migrate to view three. View two has been deprecated for two years or something. Pardon? This year, this year, this is not too far in the past. Okay, but it's deprecated. And beautify three is cool and we would want to migrate to it, but it's still missing the calendar component, the calendar date picker component, right? And we are basically the only thing Alexis ever does is handle dates. So this is somewhat of short, some sort of showstopper here. We hope that this will be sorted out. I think the release date for the date picker is moved every quarter or year to the next quarter of the year of some, but we will see how this works out. Yeah, of course there's an easy solution to the problem and an obvious solution here because we could just do this, right? No tomatoes for me? To get some new problems. And so we are always shifting from one set of problems to the next set of problems. Okay, thanks for bearing with us. I think I'm slowly getting awake. You can find us in the hallway track if you want to get more information and less chaos maybe. All right, do you have any last words, Jonathan? I think we have like three minutes for questions if I'm right. So maybe if someone wants to ask a question, otherwise we also will be available via email. So yeah. Any question? Thank you. I have a question. Why did you think about GraphQL instead of something like Django's framework and exposing APIs and using that instead of adding a new layer in between the front end and the end? Yeah, well, I think we chose GraphQL. Because I think the obvious alternative would be US or something like this. So, but we chose GraphQL because we were able to select what we deliver to the front end. We have like very complex models. And we say that, okay, we just take this set of information for this page. And from the other page, we need a much larger set. But of course, this GraphQL integration is causing us problems with an un-maintained or slightly maintained Django library and things like that. So as we said, another set of problems. Yeah. I think for the presentation, it's not right. Help. Yeah, back to you. I can just be loud. Yeah, just be loud. Okay, I'll just be loud. So thanks for the presentation. I know your pain. I've had to do that job a lot. So my question is, why didn't you, what I've been having success now is the back end for the front end, right? Because all these fancy new reactive libraries now have these meta frameworks, which is an awful word. But they kind of work. And so like, have you considered doing that? So the way I like to do it is you have the new back end for front end. And when they don't know what to do, okay, PHP help, then they just get the page back. So why did, I don't know if you looked at it like, why did you try to keep a single page application? You want to answer this? I can transfer from there. Yes, I would, what was the question about this? Have you taken a look at these back end for front end? Do you like them, do you not like them? Is that? What exactly do you mean? So like next JS, for example, that's the reactive. Yeah, okay. It has one like that. Yes, it's a kind of, we never have been using this. So it's like two years after this migration started, we just thought, oh, we could also use, have used this. So, but now the work is done. We have to go on with this. Our developer capacities are very limited. So yes, it's a kind of knowledge we didn't have. Okay, so thank you very much for the very nice talk. Interesting system. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you.
Django migrations, friend or foe? Optimize them for testing
Hi everyone. How many Django users in here? Raise your hands. Keep your hands up if you are dealing with Django projects with a lot of migrations, with time and continuous integration minutes. Okay, let's talk it for you. Perfect. You are in the right room. Now, I am Denny. I am on your right side of the photo. I work again in JavaScript, Python, Vue.js, Django, everything. It's pain-me-stuff. So let's start with Django migrations. Our way to propagate changes from your models to a database schema and keeping track of them. Let's quickly recap migration commands. So you can use make migrations, migrate, show migration, and SQL migrate. The first one, make migrations, create new migrations based on your model chains. You can use different parameters in there. For example, an empty migration you can customize. You can give a migration a specific name, and you can restrict the creation of a migration to a specific application. The model, for example, if you want to recreate Twitter, we know the reason for that, is this one. You can create a class for a model, and then creating the migration with the command will create a new file in your project in the migration folder with this content. So initial equal true if it's the first migration in your project. A list of dependencies if you are using something like, for example, authentication, or if you are on the second migration in the project, the first dependency is the first migration, and a list of operations performed during the migration. Then you can migrate your migration, of course, using this command specifying an application or not, or a migration name. So if you want to move to a specific point in the history of your migration, you can specify this. So as a new project, you can migrate everything using managepi migrates, and everything is at the last version of your database schema. Then if you want to roll back every migration in a project, you can migrate to the zero migration, and everything is rolled back. You can move to the second migration in your project with this, and without specifying a migration number, you can migrate everything to the latest version. Now how this works under the hood, you have in your database a Django migration table with a content like this, so the application name, the name of the migration, and the date time when the migration has been applied to your database, so everything is on your database. There is a better way to show this, so using show migration, you can have a view of your list of migration in your database, in your schema, with a tick if the migration has been already applied in your database. And then with SQL Migrate, you can print your SQL statement for a specific migration. So with our example, we can display the SQL code for this. So let's take a look at this. A transaction will be opened, every command will be applied on your database, and then the transaction will be committed if no errors. Now if you need to make further changes in your model, you can apply those changes and then create another migration. The migration will depend on the first one, and then the code will be another transaction, the SQL command, and commit. And again, and again, you can apply migration on your database in production using this. What if you need to do further changes, then for example, an every tweet likes and a lot of other stuff, then you can make change in your models, create a single migration, because of course I like to be well organized and structured, so every single change for me means a single migration. Then you end up having a lot of migration like this one. But even worse, if you need to create, for example, a shop app for a customer, then you need to create a model, and then during the lifetime of your application, you need to do a lot of changes to your model structures. Okay, we won't list this, but we had to do a lot of changes, for example, adding tables, switching data from a table to another, to a main table to a detailed table, and a lot of other stuff, changing data during your workflow. So changes can be a lot of pain, a lot of stuff, and when migrations become a lot, then your performance during tests could decrease a lot, because during the deploy is perfect, you can move forward and backward with simplicity, but in tests it's not that simple, because you need to wait for every migration to apply before running tests. And if you are paying for your testing time on GitHub workflows or other platforms, then that could be painful. As a disclaimer, the timing for this talk may change from laptop to laptop, so keep this in mind, but on my old laptop, this is brand new, so it's faster, hopefully, on my old laptop, it was the timing. So running tests on 20 apps like Shop, I just copy pasted them 20 times in the example repository. Test took just a single second, less than a second to run, and that was perfect, so there's no need to do this talk. Well, not exactly, because creating the test database took 20 seconds. So one second of tests for this project, and 20 seconds for database creation. And that was not optimal, because we were on the verge between the team license and the enterprise license for the timing of workflow runs, so between 3,000 minutes monthly, and that wasn't optimal, we wanted to remain in the team license, because it was cheap, and then we wanted to optimize that time. The first possible workaround is to use KIPDB, running tests, and this parameter preserve the test database between runs, and that's perfect, because the first run applies the migrations, and then the database will be kept on your cache somewhere, on your Oculus, for example. If the database, of course, does not exist, it will be first created and migrated, and during other changes in other prequests, for example, migration will also be applied, so everything is okay, hopefully. So this approach saves 20 seconds for us after the first test run. The problem was configuring your CI CD, because a solution could be using cache or artifacts in GitHub workflow, but this takes time to create and store artifacts in GitHub, or, for example, using an external test database from inside the GitHub workflow, but that wasn't optimal, and a friend of mine, or mistaken, suggested me this package, Django migration CI, that allows you to simply configure an external test database, so you can consider this and save 20 seconds if you have an external database. Another possible workaround, one line workaround, is to use in your settings migrate equal false, so if you are using this, migration won't run during the test, and this is similar to set none as a value in migration modules, but for every apps in your project, so it's better this way, so single line change, and this has a lot of pros and cons, pros, of course, single line change, and it doesn't run migration during tests. The problem is it's like make migrations plus migrate before every test run, so this will add in our example repository five seconds of time, so that was the opposite of what I wanted to obtain. So diving into the Django documentation, I discovered this great, great comment, squash migration, and this squash an existing set of migrations into a single one, you can specify your migration name, and optionally start migration name, it will squash every migration into a single one. This was pretty good, I tried this one on the shop application, and I decided to squash every migration into a single one. It was good, not perfect for us, but it was good. The problem is that we needed to add manual porting, because for example we used a lot of functions, manual function, from a migration to another, from a version to another, and that weren't migrated or automatically squashed, so we had to copy paste the function code into the squash migration and make some adjustments. And if we inspect the squash migration file, we can see there is on the top of the class definition a list of things, a list of tuples in the replaces variable. So the first item is shop, the application name, and the second one is the migration name, for every one of the 26 migration. And the recommended process is first squash, keep the old files, commit and release to production, to staging the demo until production, then wait until all systems are upgraded with the new release, then you can remove the old migration files, commit and do a second release. Then last but not least, you need to transition your squash migration to a normal migration, delete all migration files, all old migration files that has been replaced, and update all migration that depends on the deleted ones with the new squash migration, and after everything you can remove the replaces attribute in the squash migration, and everything is fine. Then if you want to clean up your database, you can prove references, so in your database there won't be references to old migrations. Let's test performances after squashing, after spending a week on my work project doing that, and oh no changes, so I lost a week doing that without results, and don't tell my chief. So what's the point? Well the point of squash migration is to move back from having several hundred migrations, five to just a few, for example if you create a branch, a separate branch where you are working you alone, you can squash migrations and propose just a single migration file in your request. I know, I know you wanted to speed up tests, so let's do it. Are you ready? It's not that easy, but first you need to recreate migrations. So let's annotate migrations for a single specific application with show migrations, and then copy paste all the names of your migration files, and then you need to manually create a replaces, you remember this one from a moment ago, you need to recreate the replaces list with application name and migration file name, and store it somewhere in your computer, then move your migrations in a temporary directory, so out of the way, and make sure that show migrations doesn't show stuff. Now it's time to recreate migrations using your application name and a name, a specific name, for example init squash, so you remember that this is the squash migration, and that will create the first migration at your last model version. Then open your migration file, copy paste the replaces array list, you created a moment ago inside your class, then you can restore your old migration files in the original directories, make sure for missing or overwritten files, and then remove the temporary directory. Now with show migrations you need to check that everything is there, so all in this case 26 migrations are there, and the first one, the squash migration is there but has not been applied, then apply your squash migrations and check again with squash show migrations that everything has been squashed and you have just a single migration, and then you can go back to your post squash task, so commit and release to production, upgrade those systems, of course staging demo production, everything else, update on migrations that depends on the deleted migrations, remove the replace attribute, and if you want to bring references to the little migration, and everything is perfect, right? Well, not exactly, if you have migrations providing initial data, you need to create a new migration for that, because recreating migration from scratch, it doesn't create that insertion in your modules, or even better, you can use fixtures, and in the doc you can see how to use fixtures in both database migration and also in testing, and that's perfect, and then you need to be aware of circular dependencies, because if your project is big and grows during the time, you could have circular dependencies from a project to another and backward, and this problem requires you to remove all foreign key, causing the circular dependency, create the first migration, restore the foreign keys, and create a second migration, and this way you will hopefully solve this. Now, let's try to test performances after all of this, after another week spent on the project trying to tell your chief that, oh, I'm working on something useful, I promise you, and yeah, of course, after recreating everything from scratch, our database creation task took five seconds instead of 20, that was perfect. Yeah, it was perfect, but does this apply to everyone? It depends, because if you have really big, big projects and you are paying by the minute your CI CD workflows, and you are on the verge of having to pay $3-4 per user per month, to 20 something dollars per user per month, then maybe you want to stay on the little cheaper branch of this, so that could be a solution, but if you just want to make order in your migration file, then just use squash migration without everything else, or if you want to speed up tests on your localhost, you just need to use KipDB, and everything is fine, without having to spend, in my case, two weeks working on this, just to save maybe a couple of seconds on your project, so it depends on your use case, and we are done, so if you want to see the example repository, it's there with three different branches, if you want to compare them on your local machine, and I uploaded the slides on the FOSDEN website, so they are there if you want to take a look at them. Thank you very much. Okay, we have time for quite a few questions, I see one up there. Given your salary, and these two weeks of work you've done, how many years of enterprise lessons did you avoid? That's a nice question, hopefully my chief didn't ask me that, but I think we could have paid maybe a year, I don't know, one year of this, but yeah, it was fun to play with this, and for me at least spending two weeks trying new stuff, or trying to discover hidden stuff from the jungle. More questions? Good question. Yeah, thanks for the great talk, I was wondering if you looked into using like seed data betas for CI, so that... Sorry. Yeah, you don't hear it? No, I didn't hear you, sorry. If you looked into seed data betas for CI, so that you run your migrations locally, and then dump the database, and then use that database during CI to start off with a pre-migrated database. No, I didn't think about that, it's a good idea, so you just upload your database dump, and then on your... Yeah, so you just set up your CI script to use that database when it initializes. That could be a good idea, I need to try that, thank you. So you restore the database and just applies your last migrations without having to apply everything. Yeah, exactly. Yeah, that's a good idea, thank you. Thank you. I was also wondering if you're using Postgres for example, you can disable fsync that will just keep database in memory, so that probably be a solution for big time. So locally we kept the database in memory, the problem was on our CI CD, so we created a service in the workflow files, and that was creating a database from scratch. So it was just a configuration you can just add on your Postgres site on the CI... We had to consider the time for storing and restoring the database on that configuration from the cache. So it was a little bit of time for that, but yeah, that was an option I tried to... More questions? So very cool talk. I like your method. I basically came up with the same method about five years ago for this approach. Do you think there's an opportunity to create a tool to automate some of this process? Well, that's a good question. Maybe implementing that in the squash migration in some way, I don't know. We could, we can try to do it just to save other two weeks of salary from other people. Okay, I think we're done with questions, so we're going to have another five minute break and then continue with the next talk. Thank you.
Powerflexible cryptography with Python and Flightbox!
All right, so now we have the next talk in the Python Dev Room. We're going to welcome Pascal Chambon, who's going to talk about PyFlexible PyGraph 3 with Python and FlightBox. And it's going to be interesting to see what FlightBox actually is. It says here that it's about encryption-based access control. So very warm welcome to Pascal. And this is going to be a longer talk, so he's going to have some things to show there, so it's going to be nice. Thank you. Thank you. Hello, everyone. I'm happy. I'm excited to be here to talk with you about these little things. So we are going to discuss powerful and flexible cryptography with Python, of course, on something we call FlightBox. You probably don't know because it's all you and not famous yet. So I'm Pascal Chambon. I'm from France. And I am here not as just a freelancer, but as part as a dev of the Witness Angel project. I will talk a bit later about this project, which has led me to cryptography. Whereas I was just another web developer before, and it was all new for me, and I hope you will like it too. So what's the battle we are dealing with here is that some data on many kinds of data need to be strongly protected, really strongly. The problem is that control, access control to this data can have lots of holes in it, lots and lots. So of course, there is a zero-day exploit. You're unlucky. They found a bug in the Linux kernel. You can't do anything. But lots of time is just that you forget to update your packages or there's a failure in the update system. Only in a few months, your nice server is full of little holes, vulnerabilities, even in the big and the major packages like Django and stuff like that. There are also, of course, your bugs in your code, a problem with a previous escalation and protected endpoints because there were deadlines and so something went wrong. But it's not all. You can, if you do backups of your data, because you do backups, of course, of your data, these backups can have a big source of trouble if they fall in the wrong hands. So they must also be protected. These are also pirates, your credentials to learn because they are on a post-it on the screen. It happens in lots of societies. And of course, human malevolence and human errors everywhere, phishing, stuff like that. So there are lots of attack vectors to your precious medical, personal data. So we need something better than that. We don't want each update of our server to be a cause of potential drama. For the anecdote, I have a cheap web host for my own little blobs. Like two or three times in my life, I arrived to my website and the root of the web server was exposed, the whole file system, so my data, and the data of thousands of other Django users because of a little mistake in an NGNX configuration. Now we have containers, so it's a bit better, but still, at the moment, I was a bit upset even though I have nothing important on my personal blobs. So that's why I spoil the result. We need cryptography. So let's go a little tour of the basics of cryptography. Encryption, of course. So we use what we call a cipher, cryptographic algorithm, to transform a plain text content. It's not always text. It's often not text. It can be video, image, documents, whatever you want. Encryption turns this into what we call a cipher text for the vocabulary, and the cipher text is not text at all. Don't try to read it. If it's readable, there's a problem somewhere. It's just zero and one in random order. Not random, but incompressible. Then there's hashing. Of course, most of you know, it's just taking a fingerprint, a little representation of a potentially very big content. And then there's signing, which is a way to authenticate to check the integrity of a content on most of the time to timestamp it to know where, when it was created or updated. So that's the basics of the vocabulary. Now let's talk about the first cipher type. It's the symmetric cipher. So what is it? It's a box, but a box with a keyhole. So you have a big key to go with it. So you use the key to close it. You use the same key to open it. That's why we call it symmetric cipher. This key is supposed to be 100% random in symmetric cryptography. And the problem you have, if I give you this box, you can't open it. I have to give you the key. So you have this awkward exchange of passwords over SMS or email. You know, here is the password of the zip I gave you. That's because of symmetric cryptography. Most of the time we use two channels. We should, that's the minimum we can do, use two different channels because if I send you the chest and the key, that's very interesting. And we have another metaphor, which is a cryptex, a little chest with the password written on it with little rings. And when you put the right rings, you can open it and close it. So that's the same thing. It's symmetric too, but it's not a key. It's a password, so cryptex is closer to digital world. No time for details. The slides are online. You just have to remember these two things. The winner currently this year is AES. It is everywhere. So if you need symmetric cryptography and you don't have time to think a lot on no precise needs, you take AES. Why? Because it's very famous, very tested, very secure, and its performance is especially great. Symmetric ciphers are very good at performance, but they are hardware accelerated when you, when they, you can. And AES has extensions to be accelerated on desktop platforms, servers, and also little chips like ESP32, stuff like that. They have extensions to do encryption very quickly. So no problem to encrypt your whole hardware with AES or another symmetric cipher. It just works. And it's quick and it's very hard to see the difference with not encrypting, at least in my tests, or little chips. Then there's the other one. If you have symmetric, of course, you have asymmetric. What is asymmetric cipher? It is another test so far. Same thing, but this marvelous invention that I'm showing you, showing you the most expert-noddy things, it's a padlock. It's marvelous. Why? Because I can give this open padlock to anybody. And they will take this, I can distribute thousands of them. They can put something, I don't know, here is some data, for example. Most people will never, will not use any more what it is in a few years. But we put it in the box, we close the box, we lock with the padlock, and I have this key, I keep it, it is a private key. It's not a secret key. Secret key is secret, but it's a secret you can share. Private key is private, it's my privacy, it's intimate. I must not give it to anybody. So I keep it. And we have asymmetric encryption because anybody can lock, anybody can trap data in a chest. Only I, the rightful owner of this key, can open the chest. So the use case is a bit different and very interesting, of course, in lots of cases, because you don't have to transmit your private key. You just transmit what we call public keys in the digital world that are actually, we should have named them padlocks. But it's a bit too late. We can do a petition, I don't know. No time for details. Just remember when we talk about asymmetric cipher, we talk about RSA most of the time, even though it's not the best from a purely mathematical point of view, it is here to stay. And the problem of asymmetric ciphers is that they exploit mathematical operations that are heavy, so the performance is bad. Do not encrypt your hardware with RSA. Or unless you have very much time in your hands, it's not made for that. So far, so good? OK. Then we have a little talk about digital signature. So it's the good old stamp. I hope you all have a stamp because it's so classy. And you have a fingerprint because it's a heavy operation, like asymmetric, it's kind of like asymmetric ciphering. So you apply this fingerprint on a content, and you have a magnificent seal, and you have a verifier, which is your eye in reality, or a public key in the digital world. And once you have stamped something, you have a proof of integrity, of authenticity, and also of anteriority if you put a stamp. But it's very hard to have a posteriority proof, almost impossible actually. That's bad because we would have loved to have that for the Whitney Central project. It's just impossible. But anteriority is already very good. You can show that this document existed before this day. So an example are PIS, DSS, they are not very well known. Most of the time, we don't care so much. We use a trusted signer to do that. So that's signature. And here's my preferred primitive, what we call a primitive, a little operation. It's a shared secret of Shamir. So I don't know if Shamir was the only one involved, but he managed to stick his name on this achievement because I love this algorithm, lots of people love it. It's something we don't do much in real life. We have some data once more, and we want to share it into what we call shards. But we distribute these shards to N people, and only N of them, a smaller number, are enough to reconstitute the secret. So it's not like when you cut a cake, it's more than that. So here is an example of shared secret for a barrier. I don't know for leading cause to pastures and stuff like that. Each people has one key. Each co-owner of the barrier has one key. And each time you open a padlock, you can remove a little part of the lock. And when enough of them have opened the lock, you can pull the bar and you can open the barrier. So I think most of the time here M equals 1. So one person is enough to open the barrier, but you can do fancy stuff like at least three people must come until we can open, unless we can't open the barrier. That's the secret of Shamir, and here is another example, and I love it. You see you have some weird lock on the door barrier once more. Any person that has one of these locks can open his or her lock and then slide the bars in all the senses, and it opens the whole lock. So it's unusual, but it exists, and it avoids the problem we all have in our buildings. You know everybody has the same key for the common parts like the trash room and stuff like that. And when someone does dirty stuff or gives a key to somebody else, we have trouble, we are forced to all change our key. So the secret of Shamir helps with that in the digital world. You can give parts of your secret to some people. If some of them disappear, it's not a problem. You get back your data. And what happens in real life? We use hybrid encryption scheme, as we call it. So we use the performance of symmetric cipher. It gives you a big key, a random big key. We put it in the little box, and then the person who has access to this data keeps the private key. And then we stamp whatever we want, the data, the cipher data, cipher text, plain text. There are lots of different cases. Okay, so that's already much for primitives, but we can already do much with these four primitives, these four concepts. What did we learn when we studied cryptography? Once more, we came from a web developer background where cryptography was not our problem. It was dealt for us by frameworks, by web servers and stuff. The first thing is that cryptography is dangerous. It can be harmful. Main lessons. So that's maybe the most important part. If you want to do cryptography, do not try to implement them yourself. You will get hurt. Other people too. Just trust the big experts, the big libraries, pycrypto dom, the main 10 wide, of course. Libsodium, OpenSSL and stuff. They are not perfect, but they do much better than you will do yourself. Second point to know, the order of operations is very important. So even when the primitives are strong, if you mix them up, you can have useless results or bad results. In this case, all I found was reading and reading and reading blog posts on the article on Stack Overflow posts to understand if I had to sign before or after I encrypted. Very important. Do not use the same key for different purposes. For example, my RSA key, if I use it to encrypt and to sign, I have just given my key to my enemies. So that's a very bad situation. Same thing for what we call initialization vector, non-seize, that are values that are supposed to be used once and only once. Most of the time, even if it's not mandatory, do it. It's so cheap that why not? Why not do it? Of course, when we talk about randomness in cryptography, it must be really random. So don't use a random source, if I can see that. Use a cryptographically proven source. On our desktop, it's not a problem. Really, our operating systems do a good job finding randomness with the hard drive, on the CPU lag, on the audio, the microphone, there are randomness everywhere, what we call entropy everywhere. But if you're on a little chip, on an embedded device, it becomes a real pain. And there are devices for that, devices that just create entropy, randomness. And sometimes that's the only thing you can do, because an embedded world is hard. And also, sometimes you have to let it go, like for electric curve cryptography, I was searching for the best curve, the word curve, to use for my cryptography, and there were endless debates. So in the end, just pick one which is about good, and don't try to be perfect, because there's always someone who will discuss and say, what was provided by the NSA, so we don't know what they did with it, things like that. OK. Now, some good news, because we don't want to get hurt, we did something for a reason. Cryptography is strong, and when I mean strong, it's really strong. It's not our usual strong. For example, if you want to break that chest, symmetric cipher, there's not enough energy in this room to break it. OK. Not enough energy on Earth, not enough energy in the solar system, so far so good. But there's not enough energy in the galaxy nor the entire known universe. So, good for seeing this chest, even if you are put very, very interesting things in it, like cookies, it's not worth it. We don't have enough universes to break most of the chest. So that's what we call strong in a cryptographic sense. And that's good for us, because we have data which is sensitive. If we want to break the little padlock, it's easier in some way. You just have to be a mathematical genius. You have to break discrete logarithm or factorization of integers or elliptic curve cycles, something like that. If you manage that, not only do you break the entire payment system on the Internet, so you will have trouble, but you also get a field of the middle, like the Nobel Prize for Mathematicians. So try it, but we are kind of safe, and we will know it. When someone will have broken this padlock, everybody will soon know it, and we will have to change everything very quickly. So it's not easy, still not easy. OK, so far so good? No question? It's time to innovate, because everything I've just said is common knowledge in the cryptographic world. It was only new for me and my fellows. So what is the witness-essential project that I've been dealing with? It's a black box for humans. When a plane crashes, we want to know what happens. When a human gets robbed, raped or worse, we want to know too what happens to put the right people in the jail or not the wrong people in the jail, things like that. So we want to get rid of judicial errors, we want to get rid of all these cases where people cannot get justice because they have just no proof, because in most of the time when people aggress you, they try to not do that in a public place like today, like here. But if we put spy cams everywhere, we have another problem, we have no more privacy. Our Libertbertese can very quickly go downhill. So our secondary goal, which is as important as the first one in Witness Angel, is that we want to preserve privacy and get proof. So it's a tough challenge. We need innovation. So we invented Witness Angel recorders. What's the concept is that you record stuff, okay, but nobody can read what has been recorded and encrypted. So it's a write only device. You know the read only logs, read only files on your system. That's the contrary. Anybody can write to this file, but nobody can read. Almost, of course, nobody. So it's a bit like, you know, the good old magician bags. That's something, I love it. You have some data, you put in it, okay, so far so good. But you cannot have it. That's the fun part. So my data, what do I do? So we have both security and privacy, okay? We have what we call a revelation. So I will have some of my key guardians, my trusted third parties, my mentors, here to grant me authorization because I have been robbed, by example, on the way home. So I need proof of who to bring to the judge that someone robbed me. So I, of course, must grant authorization because I am the victim and I am the owner of the data in here. The key guardians, at least three of them, among five must grant authorization. We need redundancy because some of them are in vacations. Some of them have lost their key, their password. It occurred a lot of times to me and my key guardians. And we have a special case. If I get murdered, I hope not, but we never know. We have a special assembly of wise people and six of them, among the six of them, four of them must say, okay, he has really been murdered. So we have this special right to access his private data. So when we do that, we recover the data. Only in this case, only that data, not the other CDs that are here, only the data of one hour before I was murdered, something like that. And here I present to you, Meet Shidi, the little mascot, little symbol of our project. It's so cute because it's harmless. It can help you. It cannot betray you. It cannot leak your new files, things like that. So it's a big technical quest, of course, because we want a high-security level, lots of different ciphers, because if someone breaks RSA, we want to be safe anyway. We want multiple key guardians, what we call. So some of them must be mandatory. They must give their authorization like myself. And some of them are optional. A certain number of them must say okay, but not all of them. So we are thinking about creating eBAC, which is encryption-based access control. We know role-based access control or relationship-based access control. There are lots of conferences on that. But here we want permission to be based on something very strong. So it's not very flexible in a sense. In a sense that once it's in place, it's hard to change it. But at least your data is very secure. So we have begun the reflection. Let's begin by chaining symmetric on the symmetric cipher. So what we call chaining is just I have a chest with a cd-knit or USB key. I put it in another chest. And I close. And I lock with my big key or my padlock. And it's a logical hand, because you need both keys to access your data. And it's like the Matryoshka doll, the nested dolls. You can put as many chests as you want in the computer world at least. And it's an easy way to chain our key guardians. So that was the easy step. And we were a big stuck. How to do a logical or? I have already spoiled the result. We go back to that dear Shamir. And we use the shared secret algorithm. We showed a threshold, like m, and n, a count n of key guardians. And we split our key into as many parts, and we give them a round. And then we have different cases. If m equals 1, if any key guardian is enough to open the chest, then we have the logical or we wanted. If m equals n, it means that all key guardians must give their shard so that we can open the chest. We have another kind of or. So we don't even need the nested chest. Shamir can do that too. And if m is strictly between 1 and n, it's the shared secret. So something unusual, but very, very important for us, because it's securing the data and still having a workaround in case of disappearance of one of the key guardians. Let's wrap it up. Let's put it all together. So our data, where is it? It's here. It will go through several symmetric ciphers, because data can be very big, can be petabytes of data. We can put it in here. Then we take the key on the part of key, the shard. On each of them, we will recursively make them go through symmetric ciphers, asymmetric ciphers, shared secrets, etc., etc. And at every step, when we want, we can timestamp stuff, authenticate stuff. That's a separate operation. That's what it gives. It can be a little scary at first. We have a cipher tree. So our key text here goes through AES on chat.20, which are two symmetric algorithms, and becomes a cipher text. That's the easy part of the pipeline. It can be very quickly done with little chips. They can pass the data, unencrypted, and put it on SD card on disk. And here is the funny part of the algorithm. It is, for example, the first AES key that was created. I have protected it myself, me, with my RSA identity. I didn't trust myself enough, so I asked for mom to, for her identity to protect. Also, we have two nested chests on my mom to protect the first symmetric key. So that's done. We have one layer of security, which is already quite strong. But that's not enough. Chat.20, which is a byte stream and crypto, has given another key. And this time, we want to rely on three-street third parties, key guardian. So we encrypt that key through AES in mode EAX. And it gives a shared, we split the resulting blue key between John and Jane. Then we use AES in mode CBC, and it gives a key that we encrypt with Jess. And then the initial key, we make it protected by Jill. So it's a bit complicated, but in the end, we have six key guardians, me, my mom, and the four other people, that we are all protecting in different ways the same secret. Now, there's a little thing a bit weird. Why? Me and my mom, we just tacked our chest, one in the other, one in the other. Why didn't we just do that for John, Jane, Jess, and Jill? Why did we complicate ourselves with other symmetric ciphers? Here and here. That's a bit weird. There's a reason for that. It's a bit just a trick, it's just for ergonomics when we decrypt. Because me and my mom, we have a problem. We have nested our chest. So when we want to decrypt, my mom, if she's in vacation, I must wait for her to come back and open her chest, and then I can access my chest and open my little padlock, which is it. And so it creates a dependency link. That's why we use symmetric ciphers here and there to create new random keys, symmetric keys, the blue one, the red one, and the initial green one. And that way, each key guardian can have his or her shard to protect. It's just a way to remove dependencies between them. They are all leaves in that big tree. They are all on the external path, not stacked, and so it makes an encryption easier. But on a purely security point of view, it changes nothing because E-A-E-S, A-E-S, sorry, is very strong anyway, doesn't change much. You can add it, you can remove it. OK. All that travel, all that road, let us do something that we wrap in a little package and we call Fightbox, and we even have a nice little logo. That's about it. Fightbox is how to protect one piece of data with multiple key guardians with different access rights. So, of course, we need something more concrete than just these wild ideas, so we have a little workflow. A key guardian is someone who will generate a key pair, actually a set of key pairs, a digital identity, and like usual, like in PCP, we'll publish the public part, the padlock to a registry, and keep the private key to himself or herself. Then we have a crypto conf, crypto configuration. We have chosen the extended JSON format because it's so funny. It's a way to store what anything you want about in a JSON. It's used by PyMongo. And this will describe, this tree will be described in JSON format. Then we have recorder devices, which will encrypt the content using all these identities that are available on USB keys or Web registries or stuff like that. That's the first part of the workflow, and it gives us what we call with much originality, cryptainers. There are containers that contain both the encrypted data, the ciphertext, and the metadata. So it's self-describing. The tree of key guardians is stored in there with the random UIDs, with signatures, integrity tags, and stuff like that. This cryptainer is like a little box, very, very safe. You can store it anywhere you want. You can store it by mail. It's very secure. You can even over encrypt it, put it in another, of course, cryptainer if some day the security ciphers have evolved, which will be, of course, the case. And when you want to decrypt, then the key guardians, they each receive their little box with their padlock on it. They will open it if they want, of course, and return the content via a secure channel, something a bit like PGP, but we're using, of course, our already existing system. This way we can retrieve all the little bits of keys and get access to the data. So far so good? Okay. Now let's go down to real business, because so far it can be just vapaware and things like that, but we can already use this in practice. So how do we do that? We have a reference implementation in Python, WA CryptoLib, which we have worked on it for years now, maybe three years. We have changed our workflow. We discovered the Shamir after one year, so we had changes to do. And it has been recently audited, so I'm very, very happy, because some young crypto analyst, Raccoon, has looked at the code and said, okay, you are not screwing it up. It works. You could do better with this and this number of iterations of PKBF. I don't know what. So he gave us a list of little preconizations, but we have not broken ourselves like you can easily do when you don't, you're not sure of what you're doing. And we were a bit scared at the beginning, but it told us it's okay. It works. The CryptoLib has grown up to be quite big in some aspects. So we have utilities. It's over by Cryptodome, the very well-known cryptography package in Python. You can, of course, generate unsalized keys in the PM format that you already use on Apache and Genix and stuff like that. We have some utilities to manage identities of key guardians, to group them, to query them. We have the support for USB devices because it can still be considered a safe way to transfer stuff when you really don't want anything to be online. We have a little GZNRPC client with a custom exception translator that I'm very proud of. It allows us to communicate with our server and see the errors very, the errors very easily. And we have lots of sensor tooling. So it's, it's more, it's more complicated than I thought. When you have to push or pull data from camera, make microphone, GPS, it's always different. No, no, no sensor works the same. So we had to make a pipeline to extract data aggregated for the Polytar format and then push it to the encryption pipeline. And, of course, we have the cryptanal encryption, which is the root of the system, and ways to store and decrypt these cryptanals. So here is the hard way using the Python API of WA CryptoLib. What do we do? We have some very sensitive data here, ABCD. So first we are going to load our CryptoConf, which is in that file, on this. And we have utilities that deal with encoding stuff like that. Then we create a cryptanal storage. It's a high-level structure to, to put lots of cryptanal together in a directory. And we give it what we call a crypt keystore pool, which is a set of key guardians, actually. And then we tell the system, go encrypt my data into a file called rake with this configuration. And if it succeeds, and it should succeed unless some key guardians are missing, then you have your cryptanal. So, of course, you must import your key guardians first. We have lots of other levels of difficulty to manage more precisely what you encrypt and maybe not store it on disk. Maybe you want to stream it directly to a server. On encryption, basically it was one shot. So you had your mp4 file, your video file on your encrypted it. But when we did some tests, of course, on Raspberry Pi, it was not happy with gigabytes of file for some reason. So to preserve the memory of it, we have been first to introduce the proper way of doing it, which is streamed encryption. So you have packets of data which arrive. You encrypt them. You dump them onto the disk. And like that, the memory usage stays very slow. But API is only for pitoniers and pitoniers that want to dig this deep. So we have a command line interface since last year. And with a click, for those who know, a marvelous little library. So here is how to do it with a command line interface, which you can download for different, you have binaries for different operating systems. First, you begin by importing your foreign key stores that are the digital identities of your key guardians. Here, I just showed, imagine, that I have plugged a USB key with the identity on it. And then we list the foreign key stores on UC. We have two different ones. We have AAA, which is, of course, a test key, and John Doe, which could be a real key guardian. They have seven on three public keys respectively. And we have no private keys, and it's good. Unless we are testing stuff, we don't want any private keys in our local repositories, because in this case, they are not private anymore. Now that we have key guardians imported, we generate a crypto conf. So a crypto conf, you can, of course, write JSON by hand. It works very well. But if you have a simple use case, you can just use our generate simple command. You add key guardians one by one, you specify share secret, and it makes a simple tree. And here, we summarize this tree. What did we do? We added one EAS layer of symmetric encryption, one chat at 20 layer. And inside the first layer, only the local device is holding the key, so it's very insecure. It means the keys are directly on the PC. On the second layer, we have used two what we call automticators, which are remote identities, really secure remote identities. So this layer is really the one protecting our data, because as long as you use local keys, you are not protected at all. Yes, a little bit, but not much. And then, it's time to encrypt. So I encrypted my readme because why not? With this crypto conf, I had just created. And then when I list my cryptator, ta-da, readme.rst.crypt. Maybe we will find a better extension later. It has a size a bit bigger, of course, than just the plain text. It is offloaded. What does it mean offloaded? Offloaded just means that the metadata and the ciphertext are split into files. And it's a follow-readme, we don't care, really. But when it's a gigabyte of video, you're happy to not put it into a json. Because when you open a one gigabyte json, even your PC will not be happy. That's why we offload most of the time. The cryptator is created very quickly, and then the data is streamed directly to disk by the encryption pipeline. And then we have lots of other commands, of course. We can purge cryptators, for example, in France. We cannot keep more than one month the video protection data from the mailbox hole, for example. So we purge them. We can validate cryptators. We can decrypt data. And here, will it work if I call decrypt? If all my keys are local, just for testing, it will work. But in real life, no, I have no permission. I don't have the private keys, so it will put some log and explain to me, a long log to explain to me, I tried to decrypt with disk gigarian, he was not there, this one is there. I need two, I have zero. Your data is unavailable. And that's what we want. We're happy when we don't manage to decrypt. On the third kind of use of Flightbox for now, it is a standalone program. So we have it, we have binaries for desktop environment to play with it. So it's with Kivi, a nice little framework, very cute and cross-platform. So with this tool, you can record and you can also put it on Raspberry Pi. It's compatible to have a little, for example, a camera station dash cam, things like that. And we have little interfaces to import gigarian from the web or from USB to manage the cryptators, to launch the recording on specify which IP camera to use, and also a little workflow to aggregate permissions until you can decrypt. So that you can play with it in the recorder software. That's key guardians. We have not talked a lot about key guardians. There can be anybody in my building, we have five key guardians, Mr. and Mrs. anybody. So that's why we have thought about them and we created mobile applications with Kivi again. Well, I tell you it's great. So this is Python working on Android and it also works on iOS. So it's unusual. We had a very hard time doing it, but it was fun and it works for now. So no policy change occurs on Android or iOS. So you have your little program which allows you to create your digital identity, publish it, check it, check that your password, your password phrase, as we say, is still working. And most importantly, you can manage the permission, the request that you get and say, I authorize you or not. So I have been installing parcels in my mailbox and two times the neighbors have given me permission to see that it was a deliverer that stole my precious parcels and so the filters for Roomba, you know, they stole, but at least I knew what happened. Thanks to this authenticator. So here is the dash cam, for example, Raspberry Pi Zero with a little screen. It works. It's just not very simple. It's a bit clunky. And for example, I can film anything I want with this one because I lost the key years ago. So I can dance naked in front of it. I have no risk at all. Nobody will ever be able to decrypt that unless some aliens invade Earth. But that's another trouble. So we have some prototypes to put in a handbag, some prototypes to put in, to replace dash cams of cars, and also to replace video surveillance, video surveillance in, yes, everywhere actually. All the cameras we have in the streets that are deeply insecure to me can be replaced by witness angels on this way when there is someone burning your car. You ask the neighbors, but they're key guardians, can I see who burned my car? And they say yes or no. If they say no, maybe you have a problem with your neighbors. But you see the principle, privacy first. And you can't do it secretly. You have to ask publicly, hey, girls, I have a problem. So we have a flight box in Python, and we could make it work in micro-Python, but micro-Python is still huge for tiny chips. So we are currently re-implementing the best part of it, the encryptor in C. That way we can put it about everywhere actually. And we have a huge challenge because we want to do a portable shivy, a little familiar, a little pet. You have in your pocket, you have on your J-wall, especially if you're a girl, you go anywhere with it, and if you have some trouble, because you can have some trouble even in France, then you can go to the judge and show your J-wall, and there will be an angel in it telling the judge what happened with proof, with lots of proof, timestamp and stuff like that. It's a big challenge, so it's a little moment where I ask some of you if you have some contacts or some free time in hardware development in open source, cheap creation, we are starting this project and we know it won't be easy, but it's necessary. We have to last one day on a tiny battery, some Chinese product do it, or they pretend they do it. We know it's possible, but it's a huge challenge because we have to optimize everything from the sensor to the SD card. So we use Flightbox for justice, but there are lots of other cases. You can use it for protecting your credentials. That's a hashicorp, does it with the vault. They have a vault when you have to output several passwords from several persons of the enterprise to access the server credentials. Same thing for your couple or family photos. You can decide that me and my wife, we both have to input our passwords from very sensitive documents that I could give if I had a gun on the head, but she would not be there. She would not have the gun on the head too, possibly. You see, double security. And also we can replace file transfer systems with Flightbox, a little Flightbox, no need for thousands of Kigadian. And there are lots of businesses probably to do about it. We just don't know them. So if you have ID, go take it, it's open source, and make money with it. That's the part. Thank you for listening to me. Thank you. Thank you. Thank you. Thank you. Thank you very much for the really interesting talk. So now we all know what Flightbox is. Are there any questions? Do you have any hardness proofs? Currently, you said you had a recent audit, but proper proof of your library. You mean certification? Not exactly, but cryptographic hardness proof, like some publication that people can analyze. Yes, the code is all online. But we have not had a big name certificate, just a hacker from internet. But yes, we try to certify as soon as we can. But no, we don't have a big enterprise saying, okay, it's safe. Okay. Yet. There was one more question. I'm interested if this is used only for files that are already saved, backups, or you can use it in the, for example, a relationship database for the data inside. While it's created and read it. So the question is, can it be used in a database that's it? Yes, but in real, in live database. The Flightbox, the core concept is that that data can be written very easily, but it's hard to read. It's even very long. So in most cases, it will not work for a live database that you use. But I am currently brain storming myself because I would love to have a special database system, like SQLite, that you encrypt with Flightbox. And when you have enough authorization, it goes live. It comes in memory, like in memory SQLite. And you can access this until it shut down. But it means you will be granted access temporarily to a live instance of your database when your neighbors, your boss, have said, okay, you have one hour to do stuff with the database and then it's gone. But it's a very specific use case. Yes, we cannot use it on your live HDD or live database. Thank you for your presentation. It was awesome. Nice examples from real life like logs, chess, pads. It was very awesome. My question is half psychological, half technical. Because we live in a very dynamic life, I mean the relationships. And I would like to ask how it looks like the process of management of the accesses. How to add new angel, how to remove angel. So yes, about the management of these permissions, actually. That's the hardest part. That's the hard part with flight box because it is meant to be once that is encrypted. My neighbor was unhappy because he wanted to reset his passphrase. I had to explain to him that there's no reset because nobody can recover the data when he has forgotten the password. So we can easily add over encrypt and add key guardians. If we want to remove or switch, we have to decrypt everything or at least this layer, you know, up to the layer on re-encrypt. And there are homomorphic encryptions. There are things that look very promising to modify, but I really don't know them well. I know there are some researches and things like that. But for now, maybe for a long time, the only option will be to decrypt and re-encrypt. More questions? Hello, thank you for your talk. Looking at your hardware devices like safety camera and protected with angel, my question is if you do something to protect the channel from the camera before it gets encrypted because a nasty actor that have physical access to the camera can niche on the bus. Yes, so there are lots of aspects we are thinking about concerning the hacking of the device. So there are lots of ways to hack into the device, but it's much easier to buy a spy cam on the internet and wear it. So the spy cams are already there. You can already use them on an event you could fake that it's a witness angel. That's a problem we are thinking about. That's why we want full integration. That's why we want a witness angel chip that contains the camera sensor. And even that signs with a trusted platform model that signs the frames as they go by. Because now that we have the era of fake news, anybody can indeed create a video of you saying or doing whatever he wants. And we would like witness angel to be a countermeasure to that by having devices that have their own signatures and that show that they are well taken by a witness angel. Of course, there are always a simple solution to film a screen showing a video of fake data. So it's a problem that for now nobody has resolved, but we can go as far as we can. And the farthest we will go is having our own device with all integrated. And if you try to open it, it explodes. That will be the goal.
How can we trust 3rd party code? Using Python to understand the trust relationships within the python ecosystem
Hi everyone, good afternoon. You're hearing the Python Dev Room. So we're going to have now Nigel Brown speaking about how we can trust third party tools when we have Python. Many times we don't realize all the dependencies that happen when we install a package from PyP and Nigel will be talking about how we can avoid getting some dodgy packages and things. Thank you very much Nigel. Thank you. Thank you. Right. Okay. My name is Nigel Brown. I've been programming since 1981 as a kid. I got a job about 12 years later. I've done mobile devices, security, data, lots of different languages. I currently work at a company called Stack Lock where I'm doing some data science and some engineering. If you're interested in the supply chain and frankly who isn't these days, you'll love Stack Lock. You should check them out. This talk covers some of the ideas that we've been grappling with there for the last nine months or so. Okay. Here are some supply chains attacks and recent examples. I don't know much about these attacks. I'm not a security researcher. Every time I read about one, I feel vaguely uncomfortable. These are things that could apply to me on the whole. And this is why we're looking at these things and the flames prove that they're scaring things. Okay. So, recent lawmaking legislation. We've got the executive order 1428 in the States and cybersecurity resilience proposals. The EO pushes S-bombs. What's an S-bomb? The software bill of materials. It's probably a bit too much detail to go into right now. Let's look it up. There's tracks over in the building about this. The S-bombs are more of a first step than a solution. They're a step in the right direction. Creating them sounds simple. But the practicalities get in the way actually and doing something with them is more of an art than a science still. They are progressing. The key point is the responsibility for the security of your code is shifting towards spenders. That means it's shifting towards you on the whole. There's some more scary flames there because that's quite scary. Okay. Supply chain attacks aren't you. It all boils down to who and what you trust. The key point really is that security, insecurity most often comes from behavior rather than the technology. Why are supply chain attacks becoming more fashionable? Maybe it's because they're easier than they used to be. Perhaps maybe everything else got harder. Okay. Perhaps they were always there and we just didn't notice. I don't know the answer there, but there is a lot more focus on them these days. So a word on trust. Basically, we want to trust some third party code. That circle represents us. We're victims, skateboards, developers. The supply chain is how this code actually gets to us. We generally get code delivered as some form of package. And that package, the source and the package have to live somewhere. Sometimes they live in the same place as in go, which is a very good example. Sometimes they're different places. Some other package repository. These can be private, but we're talking mostly about open source. Important point, we have to download it. These are all potential failures for the software supply chain. Of course, we have multiple versions. They're changing all the time. They're moving target. And there's normally tags in a source repository that point to the different versions. And these are delivered as a bag of files to us on our laptops or our servers. At this point, we can scan them. We can do vulnerability scanning and we can do static code analysis. We should do that. Definitely should do that. And the code has owners. And the point here is that you can't really trust code. It just is what it is. It's the owners you're trusting. And the question we're faced with a lot of the time is do we trust the right people? And it's not just the code owners. There are multiple other people, contributors. And we trust those people because of their reputation. Reputation comes from several sources. It comes from various media. Personal knowledge, you might know some of the developers. They're all quite often we trust in a community of one sort or another. And companies have reputation too. Sometimes good, sometimes bad. How do you trust a company? If you've got close source, that's the only trust you've got actually. The web of trust here is building up. Now, turtle is all the way down. It's an expression of infinite regress. I heard it once. I thought it would be a good metaphor for this stuff. It turns out, I'm looking for an image, turns out Cole Kennedy thought the same thing. So I nicked his image because it displays this quite well. The average middle size, medium size project has about 1500 transitive dependencies. So you depend on something and it depends on other things. And you can investigate one package at a time. You can look at its origins. You can look at the people. You can perhaps do a code audit. But doing thousands of them is hard work. It will just take too long. Now, we probably want automation to help with this. And that's one of the things that we're working on. Trying to give this thing some oil to keep it going. So this web of trust, the supply chain can be attacked at any point here. And it can break at any point. It doesn't have to be attacked necessarily. And also the main point, there's thousands of ways you can draw this diagram. It doesn't have to be like this. But there is complexity there. And it's messy. So what do we do about this mess? Okay. So what we currently do, we really like to see these. And that's because we can count them. And we can fix them. We can show improvements. They've been guilty of a little bit of a misdirection actually. In reality, only about 2% of these are exploitable. So if you're not careful, you end up doing a lot of work that you don't actually have to. This comes from Red Hat Report. I've seen other estimates of this 2% value. And they are similar sizes. Okay. Another thing you can do, static code analysis. Currently, it's mostly signature based. Finds things that we've found before. We think there may be more legs in perhaps grabbing the source, grabbing features from the source code and running it through a neural net. And this may or may not be more effective. There's still, there's lots of research out there. But there's still lots to do. We think we're going to be doing some of this work ourselves at Stack Lock. So, but that's more for the future. I mean, but the criticisms aside, we should definitely do CVE monitoring and static code analysis. Right. So don't take anything I say here. This is an excuse for not doing these things. Okay. So another idea is to look at metadata rather than the code itself. Okay. So that would be descriptions of the package links to the source repositories, activity around it, et cetera, et cetera. This is like a bit like a classic security traffic analysis or perhaps in bank fraud detection. We're looking for behavior around the package rather than the actual code itself. Okay. So this is a graph. A graph. We got basically malicious packages look different from non-malicious packages on the whole. The ones on the left, these little blue dots, malicious packages. The ones on the right are non-malicious packages. They're surrounded by nice bunch of purple users and orange source repositories. You can't see that probably in any detail from where you're sitting. The point is they look different most of the time. Sometimes you get good packages over here that are sort of isolated and you get malicious packages over there that are well connected. So I don't know. It's a malice parent there. It is some of the malicious packages look fine. Most malicious packages don't make any effort to hide the fact that they're malicious. If you look at their metadata, it's quite obviously something. There's no description. There's no effort put in at all. Unfortunately, a lot of legitimate packages look like that as well. It makes it a little bit harder. We started off. We put a neural net on this and we tried to put a classifier and we classified into malicious and non-malicious packages. It worked beautifully. It's like so what? You don't really need a neural net to tell the difference between those two things. You just need to look, you know, has it got any data associated with it? So not necessarily very fruitful. So we don't need a neural net. Instead, we did a simple score. We did a simple score. It looks at some malicious packages. It's mostly Python. We just start with some Rust and NPM as well. We looked at the activity and the provenance. I'll come on to that a bit later. We normalize it with a whole set of packages that we ingested. You can see here that most of the malicious packages, these are just malicious packages, they scored really low. So hey, it looks like we can spot malicious files using the metadata. Not so fast. Unfortunately, the base rate let us down, as I mentioned, we do get low scores for malicious packages, but we've got 10 times as many, at least 10 times as many good packages at school zero as well, which isn't great. So if we get a low score, it means we've got a one in 10 chance maybe of finding a malicious package. We don't know for sure one way or the other. So you've got to go on to your code analysis then. And also, I should point out, this isn't a representative sample. We don't have a labeled data set of all the malicious packages in the PIKI repost, because we haven't found them all yet. So we've got samples, we sample as best we can, but we don't know. So does that handicap matter? Probably not, because most of the actual packages we want to spot, most of the packages we want to use are probably on the further side of the scale. They do have good description, they do have good information, and they are linked up. There are some exceptions. All right. Okay. We act like this currently. Vulnerabilities are all that there is, and they're all deadly. Okay. This creates a lot of work for everyone, as I mentioned earlier. We're only really worried about things that can hurt us. Right? And the real, the reality is that it's more like this. Most vulnerabilities come and don't hurt us. We should use things like Open Vex, right, to describe the vulnerabilities that are actually exploitable in place, and that's a sort of an emerging standard. And then we only have to deal with the shaded bit between the two circles there. Obviously, you want to fix all vulnerabilities, but, you know, there's a prioritization system that we can employ here. Another thing to note is that malicious code doesn't always use CVEs, and there are other things that can hurt us that aren't CVEs. So, we've got malicious code leverages bad habits, like leaking keys and manual processes. We got abandoned code, and it gets taken over, and it's not updated. But bugs and bad habits and abandon where can also hurt us accidentally without being malicious, right? Malice isn't everything. So, we want to avoid all these bad things. Most of the things we actually want to know about are actually hidden from us, right? Okay. So, the malicious code is hidden by stealth. Buggy code is hidden by incompetence or apathy. And since we started patching CVEs, bad actors have moved increasingly to zero-day exploits. All right. And let's remember, most code isn't malicious. When we look at the metadata, buggy, poorly maintained, abandoned, malicious code all look similar. And you have to ask yourself the question, well, we can't tell them apart. You have to ask yourself the question, do you really want to use any of them? So, given that this is a hard problem, why not do something simpler, right? Which is to invert the question. Look for the good, not the bad, right? Looking after your health. It's like looking after your health instead of focusing on disease. So, the good bits are everything outside the circle, right? We want all the rest of the code. And for the Rust developers who insist that code could be finished, it's this bit as well, the abandoned bit. Right. So, what does this look like? We want things that probably don't hurt us. So, this is the inverse of what we just had. So, it's good coding and hygiene habits, active development, regular releases, developers we trust. CBA and stuff like that. Coding now is clear. And the key point is looking for good things is easier because it isn't hidden. Okay. Right. So, I mentioned provenance. Right. The first challenge is provenance. Okay. If you're going to do anything with any of this code, if you're going to scan it, do whatever you like, we need there. Provenance means origin, right? We need to find out where the code came from. Starjacking is when a code lies about its origin. And it tends to be a better package than it is. So, you'll find that lots of different packages share the same source repository in the package systems. It's very common. How do we find provenance? Okay. So, remember the executive order earlier mentioned S bombs. Okay. S bombs are basically a shopping list of all your of your piece of code, whatever it is, operating system, game, package. It's like a shopping list. It's a document of provenance is what it is. What you put in an S bomb isn't quite standard yet, but it's becoming more standard. There's lots of work going on with standardization. Open SSF, there's a track over in the other building that covers this. It's where we probably want to go we want to be able to record these things strongly. Now, if you've got an S bomb, you want to put it somewhere, you want to put it somewhere safe. You don't want people tampering with your S bombs. So, a thing that's becoming more common is Sigstore. And now this is artifacts signing. It's storing artifacts in a transparency log. It's a distributed ledger. It gives us cryptographically strong provenance. It circumvents most of the problems with delivery that we've got. And there's a sort of convergence on this. It's being used more and more in the community. I think it's where we're going to end up and it does solve a lot of problems. But the fact is at the moment, most code isn't signed for now. And I think it would be a few years before it is. And nothing. Historical provenance. That's a stack lock thing. Okay, so we basically, we take a bunch of tags from the source repo and we take versions and we see if we can match the dates. And if the dates match up, then we say it's got some provenance. It's a statistical process. Quite hard to fake. There's a whole video on that on the, on our website if you're interested. Videos and blogs and things like that. I won't go into that any further here. Right, so just because you've got some code, you've got rock solid provenance for it, you know where it came from. There's no, actually, it's no really shortcut way of saying if it's any good. The old fashioned ways are the only ways you test it. You measure it, SCA, again, code review. It requires the provenance of course because you don't want to be reviewing some other bit of code that doesn't apply to your package. And you become intimate with it. And with all those turtles and packages, like, intimacy takes a lot of work. Right, we've got a community of people, right. So to make this viable at any scale, you want to share the work with the community. Okay, and also we want to automate this, right, because this is, you don't want to have to be on the email talking to people all the time. All right. Okay, I mentioned reputation a couple of times. So the reputation of the people and the, the companies that we're talking about, what do we know about someone? We know, perhaps we know them, we know a company that big, we know the size of a company mostly, we don't know much about them internally. We guess and we hope and, you know, do we even care? And the executive order says that we do, apparently. So that's where our reputation currently comes from, I think. Where should it come from? It should come from prior art, participation, recommendations. Generally, we want some proof. And generally, we want to automate this. Okay, so, the key points. Once again, look for good things. They're easier to spot. You don't trust code, you trust people. Trust is complex. It can break in many places. Reputation is important. Communities can share work. And automation makes this possible at scale. Shameless plug. That's the kind of stuff we're working at at Stack Lock. We're open to ideas. Try our tools. They're on the website. Joining a conversation with Discord. The source, if you like, if it's open, if it's not yet. And that's the end of the presentation. So, any questions, please? Page presentation, Nigel. Thank you very much. We have time for one question now. There. I'm coming to you. One second. Thank you for the talk. Maybe it was some humility. But what does your product Stack Lock do exactly to apply all you said? So, is it where you attend your packages and say where are the vulnerabilities or what does it work in practice? To this URL, www.cracksypackage.dev, you'll get to a web portal and you can type in the name of a package and it'll give you a score. What we're doing is we're increasing the number of facets the score is based off of. We've got provenance measures in there and we're going to be doing a reputation engine for it as well. So, there's a website and you can go straight there. To bring this to the developer, there's a VS Code plugin and you type along, you import a file and it'll give a squiggly line underneath it and it'll say, yeah, this has got a low score. Obviously, some of the low scores are absolutely fine but it just gives you an indication that you've got to do some more investigation. There's ways around most of this stuff but it's kind of like it just gives you flags. But yeah, go to the website. It's fairly intuitive. You don't need instructions for it. Cool. Thanks for the question. Thank you very much, Nigel. Please feel free to reach out to our speaker after. Thank you.
Making Python safer than ever
Thank you. Yeah, so I'm glad that actually like so many people nowadays like really care about security. I think it wasn't the case when I first started doing Python. I was one of those people who just put things together with duct tapes. So that's good. And I'm going to dive deeper into you know what the PSF has been doing and also what you should do and also maybe some information that like you may find interesting to know like what we are doing. So this is the most important slide of the whole slide deck because it has a link to the whole slide deck. So with this you don't have to take pictures after. So maybe you want to keep this you know the picture of this slide just in case. And also you can also you know tag me, message me, I love that. So I'm Czech. I love open source. People ask me oh how can you get involved in like Python. Well I just step by step like most of you I was like you know it was the data scientists are doing Python and then you know go to meet up and then people ask do you want to organize meet up. Yes I organize meet up and then do you want to come to the conference. Yes I come to the conference and then you want to organize the conference. Yes I won't organize the conference. So after saying yes multiple times and then I'm here. So now I am a Python software foundation but members are one of the directors and fellow. So that's my volunteer work so how do I do like what I do in my day job. I am a community manager at OpenSSF. So a lot of things I talk about here actually like related to these two hats that I'm wearing from time to time which is very special. So yeah you know OpenSSF we are part of Linux foundation. We have a stand at the building K level 2. So come to us we have stickers there as well. So that's OpenSSF stickers so you can come and get some stickers for me. I would tell you more about that there. I won't talk about it here now. So let me ask you a question. So when you let let's imagine you have to move tomorrow. You have to choose a place to live. So what are the criteria right? What are what are you looking for? Like if you go to like do a viewing what questions you would ask that stay agent or people show you around. So for me one of the things I'll ask like is this neighborhood safe right? Safety is very important no matter who you are you know you don't want to live in somewhere that maybe you every time you go back home in the dark you feel scared or maybe you worry that someone may break into your house and stole things you know. So safety is very important also same as software right? You don't want someone to break into your house but you also don't want someone to hack into your computer or your system or things like that. So that's very important and I'm glad that a lot of us care about that these days. Security in Python is even more important. A lot of times I make the analogy because like I love Python and the community so much it feels like my family they're like I don't want these people to be vulnerable and got attacked so I'm protecting you know Python and the community is very important. So why Python is you know why do people care about security in Python? I think one of the things special about Python is that we have a lot of different people using Python including people who may not be traditionally like trained as a you know a software engineer like for example some researcher we have a huge scientific computing community that they maybe a researcher themselves they use Python to help them with their research. Data scientists I was a data scientist you know I was the one who doesn't care about security so I know. Bank you know this is you know again like you know I have worked in a bank before at some point so I know that bank use Python and you know this is a very important you know organization and you know Python is there. Government as well government also use Python I have like friends who have worked in the government before there are lots of governments actually using Python. Teachers teaching young people teaching the next generation of software developers they also use Python maybe this this is the first language a lot of young people these days like the first language is instead of like us you know who have a little bit like gray hair is like oh we just see when we're learning but now it's like maybe they're learning Python. And anybody you and me like we're here in the Python death room so we use Python. Yeah so Python is vulnerable because again like these people may not be trained as a you know software developer so they just kind of go to code in the first day you know and like you know then the profile is very different we can't just like for example we can't just focus on securing the Python in bank that's not good enough where how about the government how about the people who are writing software as a hobby and like you know all this we have to try to cover everybody and it's very hard when you have a diverse user profile and again a lot of people nowadays especially young people this is the first programming language and you know they may learn Python before they know what is cyber security right so that's very important because they may not know it so yeah how can we help these people to be more secure when they're coding so also is like to make things more complex now you know in the last talk you have heard about all these so and also enforce them you know we have a death room we have a talk yesterday about cra you know or this policy actually are coming to affect us it's not that like oh we are just a bunch of people who love coding and have fun no like we also care about especially when you're making a living with coding policies were affectual because it's kind of policies trying to protect the customers if you are creating a product that you are distributing them so that would actually you have to have a look at least as what's going on so you know in us we have the you know the espon bill or you know that that's the real name but we just call it espon bill a lot of times and cra you know we have a lot of section about cra is coming you know because it's going to be effective in a few years so a i ag i hope you have heard about those terms a i ag prd so these things are happening now so that's why we care so we have to protect the python users not just those people that i talk about that are using python but also you and me who may be developing in python we don't want to get caught in those policies that you know we don't want to do anything that actually we have to be liable so we have to protect everybody so before i jump into what we are doing so i'm going to talk a little bit about what are the most common open source issues so hopefully you you are now you are aware you're trying to like avoid those issues but also you can it's kind of like a homework that when you you know maybe after the talk you can kind of link back of what we are doing is related to which one which problem that we are solving so this ten risk actually is uh you know i i didn't develop them myself i copied from somewhere so the reference there you can have a look at the book post so top ten risk so now you can evaluate right have you made those mistakes hopefully you or you know how you can avoid those mistakes so first of all known uh vulnerability so uh you know have you been using a software that actually got a security issues evolved but you haven't upgraded have you you know sometimes sometimes there's like compromise of legitimate packages so let's say if you are a maintainer of a very important package that everybody used if your account got hacked and someone puts some malicious code in your code base then everybody's using that package so um yeah you you feel very bad if you are doing that so that's one of the risks you want to avoid name confusion attacks how many times like you forgot to type the s at the end of pandas so you install panda in stuff pandas then you know you can be vulnerable to that attack if you're typing the name wrong um so uh unmaintained software you know i've done that before i you know i was a data scientist trying a lot of things so i found this library that this model is quite cool install it and try but it doesn't work because it hasn't been updated for four years so um yeah unmaintained software so so risk outdated software is similar you know outdated we should always keep your software up to date because there may be some security patches um and tracking dependence this is again like i know that a lot of us are you know using a very good package manager but for people who are maybe new they don't know you know maybe or they are again like learning python that's the first program language they don't know that like managing the dependencies is very important especially when you're writing some code that need to be in production so a license risk this is not actually like cyber security risk but again like how many times before you pip install something you have checked what license they have because some license they may not be copy you know they they you may have to be open source if you are using them even though they're open source so license could be again like is another topic that i can give another like talking like you know about but you know check the license all the time immature software sometimes you know again if you are just trying different things someone may write a soft you know a package that you could pip install but maybe it's immature because it's not production ready it's just someone's like experiment a prototype so uh and unapproved changes uh so have have you like give everybody push right to the main branch so avoid those because you know maybe you give a junior death like push the main branch and they may erase everything so yeah that's not good um and uh you know oversized dependency we always have this problem you know python we have so many package available and um so yeah that that may be a problem that your dependency is too big so um now so we have so many problems that we want to avoid we have to get extra power to solve this problem so uh i know that like a lot of times we are just relying on volunteers to help out and really appreciate all the volunteers who help out that that's why we can survive for so long but uh security is so important so hiring someone to full time and take care of it is actually a very good thing that we could do so that that's what we could do right now so i would like to introduce you to these two amazing gentlemen that is helping like full time now working in python software foundation to um to do work on security for python so uh seph seph is a security developer in restaurant so this row is actually funded by alpha omega project so thank you for that you can see his beautiful face there yes and the next gentleman i want to introduce you is mike so mike is uh you know papi safety and security engineer so this row is funded by aws so again like he's working probably working a lot with uh you know together with packages so maybe they are trying to you know maybe you are using the packages that like mike has put some effort in so thank you very much um so this is what uh what has been done like well it's to be honest this is a few months ago i i put this together a few months ago so maybe not most updated one but like you can look at seph's block for the most updated work anyway so um so now like you know python releases so when you use python you have to actually get your c python right so c python usually just go to you know the official python website and download c python but how do you know this version of the c python is actually the real c python instead of maybe someone hack our website and put a malicious c python there uh the best way to ensure that it's legitimate is to sign it right so sixth door again like someone mentioned it in the last talk already uh is a new mechanism that you can you know use a certificate to sign the release so it's kind of very easy to verify to sign and uh it's keeler so it's more secure and um so everything's locked so you can check so there's a transparency law so you can check like who signed it and when is this go sign and you fully trust this c python is the right c python that you're supposed to be using oops i skipped some size okay so uh so all these like new uh you know from from certain time that is like you know c python release has been signed so um it's not just like the newest release of course they are signed but also like all these versions upward they are all signed so you can always verify them um sorry it's a bit hard to see but uh you know all these you can see who has signed it you who are the release manager and you know you have you can check all those logs and you know check the the you know the e-short and like check check off stuff so okay so yeah you can actually write some this this is you know i hope you're familiar with that it's kind of like a yaml file that you could probably use your you know c i c d to check it to check whether your python that you're using in your c i c d pipeline is the right python so um also what if i found an issue right what if i have discovered a security problem actually in python we have a response team that's really great um so it's not just uh staff or mic themselves but also like it's it's a team help out so uh so if you found something about c python hopefully it's not happened that often uh you know it applies to both supporter or the n of live version or pib then you can actually file those reports so uh the team will work with you to handle it so how how do they work so first of all um they would work with you if you are the reporting so you are the reporter so they will work with you privately so privately is the key because you don't want to shout out to the whole world oh we have an issue with python and it's not secure no you don't want this so so we'll work with your privately right so after that like when a you know after that of course i the developers you know call developers work and then we'll try to find a patch and then okay so now it's patch now there's new release right so okay we have a solution now has been solved now it will be publicly announced now we can tell the world say like okay we had a problem but now has been solved so everybody please use the newer version of python which this issue has been resolved right so this is usually how it works so first step you know instead of posting it on you know your social media you report it to the response team don't post it on social media write a blog post report it and then after when you know it can be go public then you can write blog posts saying how awesome you found a bug or something like that right so um this is a really good news so python software foundation has also become a cna so what is cna so cna is actually a organization that can give out the cve numbers so we what are cve numbers so cv numbers actually are a unique identified identify each security issue that has been reported so for example you know cve that that example there you can see that like first of all there's four digits of the year and then there's like five digits of unique identifiers so you would see that um when people you know talking about oh is this new patch solving this issue so instead of this issue which we don't know what is that issue and naming thing is hard we just use a number saying like this cve 2022 4h 5 64 issue has been resolved right so we can clearly say that this you know this new version is not affected by that so that's a good thing um so yeah this help to discuss and you know communicate and make everybody easy to identify where the problem is and where has been resolved so um by becoming a cna which means that we the finance software foundation is taking is taking this security issues very seriously now we can assign cv id's to any issues that you have reported about c python and pip so if we see that that is actually a issues that you found will give it a cve so we would actually be very quick to respond and make sure that everything got resolved um so that's the that's the c is the you know cna is hard we are so glad that now we are thought a 4t that we can give our cve's um the next thing is that actually knowing the cve's we how can we keep a lock keep a database of what cve has been discovered and what has been been affected and that kind of thing so uh of course we will have a database we'll store all those data um you know we have a pi ta advisory database um so it will store what it will store right it will store the c python from the cve's so the so all these security issues affecting c python will be stored there um so you so there are also some packages there sometimes they have problems but we won't give cve's for the packages they kind of work by themselves so um but people that you can kind of check it check those packages on pi bi so if it's known then it will be there um also all these uh the the pi pa advisory database this information is now also synced with uh osv from our bloody database so osv from our bloody database is not just limited to python it's also including other vulnerabilities in other software so if your application is not just python it's also have other components you can actually use osv to check everything so um yeah so that's good you can use both um so it's there's more visibility so people can easily find the issues in python ecosystem in the osv and also vice versa we can also see other issues as well so it's good so um but that's not it so that's the thing that we have been doing at the organizational level so that's something that maybe as an individual developer you don't have to put too much work in it but um let's let's go to this section which i have listed what you can do as a developer as a user or as a company so uh first of all we want you to help us to secure our community um if you are maintainers of python project first rule of thumb enable two f a everywhere you know this is always like very important you know you don't want someone to hack into any of your you know github pi bi or email account to you know gain access of these things that you're publishing and someone may be using right um so you should learn about how to develop a better like saver software you know open ssf we have some free guys that you can follow talk to me afterwards at the booth or something um use tools like you know scorecard and sex store um you know the these are the tools that again open ssf has uh you know associated with and then we have actually these are you know i won't talk in detail talk to me about how to use those tools after also subscribe to the pi bi blog for security features those are the official announcements but saft also got a very very good blog i highly recommend you follow him he's lovely you have a lot of you know very useful information that's given out on his blog um but as a user maybe you're using python project you can be a maintainer and use at the same time right so uh as a user you should keep your dependency locked and up today you know don't be someone like me you know five seven years ago that just just duct tape everything no keep your dependency locked and make sure they use the good um package manager um subscribe for advisories uh you see the email coming from that email after you have subscribed to that mailing list you would be first notified if there's any information that you should know um so pay up that it may be a good tool to check your dependencies but again like if you're your application is not just python so obviously would have everything that you need to know so if you are working well so if well i assume that most of us are working in the company um so if you're using python in the company or any open source project i think uh there's more you can do you can convince your company to become a member of open ssf so contribute to all this open source security work or support python software foundation we want to hire more people to help so uh you know the most easy way is like if your company has a big budget then you can maybe support us to hire another person that would be great um so also educate the employees so if you for example if you're like kind of a lead or a manager maybe you can encourage your teammate to study to learn more about security so you know on a limit foundation we have a lot of free courses it's a very good you know resource that you could maybe use also follow best practice like just make sure your team don't duct tape things together make sure that the product is production ready follow good practice make sure that it's safe and then it's it's better in the long run so maybe you'll be easier for you to get ready to be complianced to when cra coming to effect or something like that right so uh last thank you so much for alpha omega and aws to support python software foundation so we can hire two amazing gentlemen that we have today um and it would be great if we can have a third person so please if you if your company have like want to support please point them our way um so yeah thank you so much this is the end and this is the link to the slide and i'm happy to talk to you afterwards i have coffee with me or anything or message me um i'll be here um today so
Match all things Python: Parsing structured content with Python's new match statement
Good afternoon. We have now Mach-Andre Leimburg. He's the CEO and founder of E-Gennex. He's not only that, but he's a Python C-Python core developer. He's also one of the organizers of the EuroPython. He's a EuroPython Society Fellow, and he's been making many contributions to Python. So, yes, we have this pop star here. Now, he's going to talk about Match of Things Python, parsing structured content on Python's new match statement. Thank you very much, Mar. Thank you. And thank you all for coming. The reason why I'm doing a talk about the match statement is that I'm getting a feeling that it doesn't receive enough traction. So, I wanted to know from you how many of you know the match statement? How many of you have actually used the match statement? A lot less. Yeah, that's what I thought. So, maybe a short introduction. Tatiana already mentioned a couple of things. I did a lot of stuff in Python. I've been working with Python since 1994, so a very long time. I did lots of things in the core development, Unicode, Db, API, the platform module. I'm based in Germany. If you have a need for, I don't know, a senior software architect, then please contact me. But that's not the point of this talk. The point of this talk is to show you this. So, this is the match statement that you have in Python. And it's actually a very, very useful thing, especially if you want to parse structured data. Now, the match statement itself is actually quite complex if you look at all the details. And I'm going through all the details in this talk. There are so many details that I have to rush a bit, unfortunately. And I'm not going to be able to show you live demos or anything because I simply don't have the time for that. So, let's just head right in. So, what's the motivation behind the match statement? People wanted to have something like a switch statement, as you probably know from maybe C, your other languages, for a very, very long time. I just, I wrote a pep a very long time ago, which basically suggested adding something like that to Python. It was rejected at the time, so it took another 20-something years to actually something like this to make it into Python. What we now have with the match statement is a lot more powerful than the switch statement. So, you can do not only matching on literals, for example, but you can also do matching on types. You can do matching all kinds of things, including conditions that you apply to these things. You can combine all of these things. You can also do parsing and matching at the same time, which is quite useful, so you don't have to have two passes. First, to figure out whether something is actually valid, and then in the second pass to then figure out how to actually use the data that you have there. It all started in Python 3.10. That's more than two years ago. But, like I said, it hasn't received that much traction yet. So, what you see here, or maybe you cannot see it, it's a graph from py-code.org, which is a very nice site. If you don't know that one, you should go there and have a look. It basically scans all the PyPI code and then does analysis on that. The maintainer did an analysis in July last year and looked at various features of the language, whether they were being used in the packages on PyPI or not. As you can see, in July, there were only 2,600-something packages on PyPI using the match statement. That's two years after the release, and it's only 0.55% of all the packages, so it's next to nothing. So, I guess one of the reasons for that is that the documentation for this match statement is not all that great. I'm talking about the official Python documentation. There are many blog posts about it, and many other resources that you can tap into and overviews. But the Python documentation for the match statement is not ideal. What you have is these three PEPs, and this is basically the best that you have in the official documentation for Python. If you want to get into these things, then I would suggest to go with the PEP 636, which is a very nice introduction, a tutorial kind of introduction to the match statement, and then you can go to the other PEPs to have more detail. So, how does it all work? We're going to have a look at this example, and I'm going to go through the various different parts of it. So, the first part is the match object itself. This is what you want to match, this is what you want to analyze. The next thing is what you have behind the case statements in there. Those are called match patterns, and there are quite a few of those. I'm going to go through a list of many other patterns that exist. Then, of course, you have the match code. This gets executed in case, one of those case statements, the case patterns actually do match. And then you have something called capturing variables. I'm not going to explain what that is now, because I have a few slides on those. This is a way basically to store the data that's being matched in a variable. Plus, you have something that's a bit strange, which is just the underscore. These are non-capturing wild cards. So, it's basically like an ELTS in an if-else statement. So, if the matching goes down, and you have a, as the last case, you have one of these wild card things, then this will always match. So, this is a way to do the ELTS in the match statement. Matching itself is always tried from top to bottom, and the first match wins. So, the order in which you list these match statements, the case statements, is actually very important. There's no fall through, like in C. How many of you know C? Well, quite a lot. That's good. So, you don't have that, because in C you can often make a mistake. If you forget a break, for example, in one of these, the code that comes behind the case, then it just falls through, and then you execute code that you probably don't want to execute it. So, let's have a look at these pattern types that we have. Like I said, there are quite a few. I'm going to go through them rather quickly. So, the first one is the literal. So, you can just write a little bit of string, a little bit of number, an integer, a float. It can also handle a couple of special singletons, like true, false, or none. Not many more. If you have something else that you want to actually match, and you don't want to write this down as a literal, you can use a variable kind of notation for that. So, if you have some other value, you put that into a variable that's accessible to the match statement. And what's very important is that you have a dot in that reference. The reason for that is a bit strange, because the match statement also works on types. And in order to differentiate between type names and variable names, the match statement and the parser, they need to have some kind of hint for this, so that they know what they're dealing with. And the dot is that hint. Now, the next two types are sequences and mappings. They look very natural to a Python programmer for sequences. You just use like the square brackets or the round brackets, and then you match a sequence. What's not necessarily intuitive about this is that this actually matches sequences, not just lists or tuples. So, if you write something like, for example, in the tuple notation, and then you pass in a list as an object that gets matched, the tuple case will still match in your match statement. So, that's a bit like a gutcher. You have to watch out for that. And it's similar for mappings. For mappings, you write them like the, like a dict kind of notation. It actually matches all kinds of mappings, not just dictionaries. There are ways to, you know, just match dictionaries. I'm going to show them. You can also match, like I said, different types. The very, you know, very simple ones are like all the built-in types that you have there. You can have support for user-defined classes. You have to pay some attention in user-defined classes about the order of the arguments that you have in there. I'm going to talk about that in a bit. What's very important are these parentheses. If you don't have parentheses behind this, then the match statement is going to basically treat this, the name that you have there as a variable, and very often as a capturing variable. So that's going to, that's another gutcher you need to be careful with. Of course, you can nest all these things. You can combine all these things that I just mentioned in various ways. There's an OR combination with a pipe character. And to make things even more complex, you can add guards to these match patterns that you have. So you can say, OK, for example, down here, if you can see that it's a sequence AB, and then this should only match if the value A in that sequence is above 10. So you can write very complex things in those match statements. And then finally, you have these white-card patterns. I mentioned those already. There are two types of these white-card patterns. One is the anonymous one, a non-binding one, which is the underscore. And the second one is one where you basically put something at the bottom of your match statement, and you just assign a variable to that. I often use unknown for this because it just makes sense. If you read that, it's very, you can easily comprehend that. If you read the code, you can easily understand that this is actually something that matches anything a bit unlike the underscore. I'm not too much of a fan of this underscore thing. Right, so now let's have a look at the capturing variables. Like I mentioned in the beginning, the nice thing about the match statement is that you can actually combine the matching and the parsing. So whenever something matches, Python will put the matched value into a variable that you define, which is very much like, for example, the ass notation that you have with context managers. There are two forms for this. One is an explicit form. So I put an example here. So what happens is it matches a list. And then if the list type matches, it will put the value into the variable sublist. And then you can use that variable in your other matching code that you have or in the actual code that you want executed for that particular case. Very easy to understand. It's a bit more verbose, but it always works, which is nice. And then there's an implicit form. This can cause some problems because it introduces some of these gotchas. The way that this works is that instead of putting literals in these, for example, sequence notations or mapping notations, you put variables in there. And what happens there is that implicitly, for example, in the first example up there, the first entry in that sequence will go into A, and the second entry will go into B. And then you can immediately use A and B, for example, in guards that you have on the code that comes afterwards. And these things are actually bound variables in your code. This works very well if you have well-defined variable names. If you don't, you can get into lots of trouble. So using short names is probably not a good idea. They should be very explicit. This does also work with some of the built-in types, not all of them. So there is a, I think this is actually a full list of all of the ones that support this. It does work with classes that you define, but you need to have a look at this pep for the details. There are some special attributes that you have to define in order for the parser to know in which kind of order these variables should be assigned. Unfortunately, it doesn't work with ABCs, but there are workarounds for that. So if you work with ABCs, for example, if you want to test whether something is a float or an int, and you want to put that kind of logic into an ABC, then there are ways to still make that happen. There are some things that don't work with the match statement. Some are a bit unfortunate, because, for example, if you use a scripting shell language, like bash, for example, a very, very common use case for matching is regular expressions. So basically, you have a case, and then you put a regular expression there to match a particular regular expression, kind of like how the string should look like. This is not supported directly. There are ways to work around this. I'm going to show you a reference later on, where you can basically find how to do this. Something else that doesn't work well is a set member matching. There are ways, again, to work around this. You can use a guard to kind of do this set matching. So the guard works by having the wild card, so it always matches. And then it uses the guard to do the actual check whether something is in a value set, or you can use the OR pattern. But the OR pattern is sequential, so it's not really efficient. Optimizations haven't been done yet, which is a very common theme that you always have in Python. First, something gets implemented to have something to work with. And then, in the next couple of releases, people then worry about performance and add better performance. So that has happened a lot in Python in the history. It's probably going to happen for this as well. So I talked a bit about the guard trust. I just want to reiterate some of them. This I already mentioned. If you use the tuple notation or the list notation, and you think that, OK, this is just going to match a tuple or just a list, you can easily get this wrong. So if you want to do this explicitly, then you actually have to use the type notation for this. So you have to write list or tuple, and then the sequence that you want to match. The same issue you have with the mapping types. So you have to pay attention to that as well. Another gotcha is the wildcard pattern. So you can only use the wildcard pattern at the very end of the list if you put something up at the top of the list. For example, if you start with case and then wrong values, because wrong values is a capturing variable, it's regarded as a wildcard case. And so it will match anything. And the parser will actually complain about this. So this is not valid Python. However, if you put a guard with it, then you can use it. Which is probably in order to make certain workarounds possible. I don't really know what the reason is why this works. It's a bit strange. And then the parentheses. If you look at this code, if I wouldn't have put an error there, you probably wouldn't have seen this. What I did there is I put a dict there, meaning that I want properties to have a dict, like a dictionary value. And they want to match that. But I forgot the parentheses. So what's going to happen is the parser is going to regard this as a binding, sorry, capturing variable. So it's going to put the value into a dict. And then it's not only going to not parse correctly, because it will just put any kind of value that you have there into this dict capturing variable. But it will also bind dict to this value that you have in there, possibly breaking code that comes afterwards, because you can no longer access the built-in dict. So this is something to watch out for. And finally, this is the talk that I wanted to mention. Raymond Hettinger. Who knows, Raymond Hettinger? Not that many people. That's strange. You should definitely look him up. I mean, he has done so many good talks. It's just incredible. I mean, if you want to learn something deep about how Python works, he has all the talks in his stack. So definitely have a look at that. He did a great talk at PyCon Italia 2022, also on the pattern matching. And he shows a lot of tricks on how to work some of the deficiencies that you currently have in the match statement. So I was actually faster than I thought. So I'm done. So yeah, this is always my last slide. Never stop to learn. Always learn new things. Never always try out new stuff that comes out in Python. And I hope this talk will kind of make you have a look at the match statement and maybe use it more, because it's actually quite useful. Thank you. Thank you, Mark. Thank you, Mark. So now it's time for questions. So I can say a few people with the hands raised. I will start here, and we will go up. So we have four people, at least. One of your first examples, you first had to check whether this is a list, like with the list in the parentheses. And then two cases later, you are trying to catch against the sequence. That means that this will only match if it's a sequence, but it's not a list, I guess. Like on your first slide, literally. The first one, like this one? Yes, this one. So on the third case, it will match if the thing is a sequence with three elements, but that sequence is not a list, because otherwise it would have gotten into the first case. Is that correct? Given this one, yeah? Yes. Since you have a case list, oh, yeah. Yeah, so you're right. What happens here is that this will always match for lists. So if you put in a real, like a true Python list, then you will always go in here. If you have defined your own kind of sequence, that's not a Python list. Only then it will get in the top. Then it will drop down here, and we'll parse here. And as Heckelman and Laska mentioned for me, what happens if you put a generator in there? Can you match against generators? Because then you will kind of mutate the element while casing the case. Would that work? This is a good question. I think if you put a generator in there, it will actually match the generator type and nothing much else. It won't actually call the generator to give back any values. But it's a good question. I'm not really sure. It probably works like that. Hi. Thanks for the great talk. I had a question regarding the caveat you gave at the end regarding the dict. Is there a proper way to do it, like putting parenthesis, or is it not possible to match a type inside of a hash map like that? Let me just find the slide. This one, right? Yeah, that one. So what was the question? So here you put the dict, and you said that, of course, if it will overwrite, let's say, the Python dict, would it be possible in that case to put parenthesis to match the type here? Yes, of course. And that was the code is actually written in a way that this would have been intended, right? So the intention was that properties, well, it's matching a mapping, right? So if you put in a mapping that has, as one of the keys has properties, and as a value has a dictionary, then this will match, right? Without the parenthesis, it won't match any mapping that has a key that has properties, but not actually look at the value, and simply just put the literal value into the variable dict. That's what happens. OK, I think I see you up there, right? Yes, hello. I was wondering, with this capturing variable, it can sometimes lead into ambiguity. So I was working how well this would work with the existing typing system, where you would, for example, have an object that, like, dict that represents the type. So that is something that I did not really cover in here, but perhaps you noticed the syntax that's being used here is actually somewhat different from the type annotations that you have in Python, right? So those are two distinct kind of, basically, systems working here. These types that you have here are actual Python type objects that you work with, whereas the type annotations are being used by, for example, MyPy or other tools, other static code analysis tools to figure out whether something is correct or not. So this actually happens at runtime. I don't know if that answers your question, so. Well, sort of, I guess. So you can't really put the typing types in here, let's say, because there is generics in there, of course, that would be highly convenient for matching. Right, right. I think that, I mean, in typing, you do have some actual Python type objects. Those you can use in here, right? But you cannot use the type annotation kind of syntax, for example, for matching an integer or something, yeah? No, it doesn't make sense, of course. That doesn't work. Thank you. Do we have any more questions? We have time for one last one. Yes, we do. Oh, my God, we have two. I'm going to the right side, because we haven't had many questions from there. I'm coming. Let's go. Thank you. So, yeah, maybe this is wishful thinking, but how difficult would be to implement or to provide, like, a match that will match not in order, but it will give me the best match? Would be that possible? Because, for example, I'm working in code generators for wrapping CAP from wrapping C into Python, and sometimes you can't do that. And from C++ goes over, function overload. So I can think, OK, I can have function overload to Python and translate that to a single function with match for different signatures. However, I will have to, I don't know, I need to know which is the best match for each case in order to order the match statement. Will it be possible to have that kind of logic embedded in Python, or that's too wishful thinking? You can try to do this by ordering the cases from, you know, the longest match to the shortest match. But apart from that, I think it's, this is actually a hard problem that you're describing there. Because if you want to, if you want to figure out what's the best match that you have, then you actually have to go through all the different cases that you have in here, and that's going to have different semantics than what you have now in the match statement. Usually the problem is like to know which is the most concrete type. Usually the problem that I have the most is like to know which is the most concrete type to the base type, so to that it matches the most concrete one instead of the base one, because it's like it can match us both. But in C or C++, it will always match the most concrete one. And if, and it's not there, it will get to the base. So, and for example, for now, it's like right now in Python I have no idea how I will solve that when I'm wrapping APIs. You can do that by ordering, like I said, you can order the case statement that you have here from the most, let's say, abstract one to the most concrete one, and sorry, the other way around, from the most concrete one to the most abstract one, and then like in the example I just gave where you have a list, yeah, when, if you pass in the Python list object, then it will match the first one. If you pass in, in this other example that I had here, if you pass in, let's say, a user defined sequence, then it will drop down and then match that one. So that's more abstract, right? Thank you very much, Mark. Another round of applause, Mark. Thank you. Thank you.
Annotated, a type hint you can use at runtime
We're here with Dennis Luxald to present annotated type hinge you can use at runtime. Dennis is a Python developer working for the label on post-gres infrastructure and automation. And he's been a long time free software hacker. Thank you very much for being here with us, Dennis. Thank you. Thank you. So annotated type hinge you can use at runtime. First of you want to learn me, I work at Delibos with a small French company doing post-gres service. And they are do database infrastructure automation and also trying to contribute to process ecosystem. Most notably recently the psychophagy database adapter in Python. And last but not least, Python project at SciPy or Macro. So why talking about annotated? Perhaps we have seen this kind of code in the wild recently or less recently. It's taken from Python documentation. Python is a well-known and famous nowadays data modeling and validation library in Python. And you can see from their documentation, especially recently since the version of library, that's this annotated pattern kind of spread everywhere in the library. Another example is here a code sample from Fast API, which is another famous library useful for doing web API in Python. And there is also this annotated pattern here. When I stumbled upon this last year when doing a migration to Python in V2, I was kind of disappointed because this syntax looks well in Python. It's verbose, it's not really usual in Python. So I wanted to talk about it and why. First, let's see how it works because it's not very intuitive to me. Then I wondered how can I use and define my annotation in order to use the annotated syntax using my annotation to when I would define not just the one that would be provided by the library and for which use cases. So the line of the talk is threefold. First I'll introduce typing.annotated, which is defined in a PEP, 593. And then we'll walk through a few of these cases involving data-centric models and doing validation, serialization, and user interface, all using annotated. And first and third I'll discuss what the adoption in the community and ecosystem of this annotated construct. So let's start with PEP 593. So when it is defined in the typing module, it's in the standard library, but in my opinion it's not really a typing similar to others. It's more like an annotation. Maybe more in the spirit of the initial idea of function annotation defined in a pre-old PEP now in which you can attach an annotation to an identifier using the colon symbol. And then it was used for typing, but really here we have the ability to annotate identifiers. Identifiers are for example class attributes, function parameters, or anything in the name space like a module or something. It's here since Python 3.9 or in typing instruction if you have a Python installation. And the PEP is named flexible function and variable annotations. What does it tell? It tells that you can annotate a variable named veer with the type int annotated, which takes at least two arguments. The first one is a valid type, could be a built-in type or custom type you define. And then it takes at least one metadata or annotation. I would use the metadata or annotation in an interchangeable manner in the following. And you need to pass only one annotation, but you can pass many of them. The key idea of this is that this metadata can be used for static analysis and also at runtime, which is pretty new in the typing ecosystem. And it opens for pretty interesting use case in my opinion, although syntax is a bit well. I think it's also interesting because it's designed to be composable, meaning I'm quoting the PEP here, but basically it means that when a tool analyses the code base, the data static analysis or at runtime, and it encounters an annotation it does not know, it should ignore it. And if it encounters an annotation it owns, typically because it defines it, it should handle it. So it means that if you use annotated to combine many annotations from different source, you can expect them to play well together. How to consume annotations? Because once you have defined annotated values you will need to consume them. In the typing module, still in the standard library, you have a couple of utility functions I will introduce here. The first one is getTypins, which can be used for any kind of object like a class, even an instance of a class or functions, any three terms, a mapping from attribute names to their typings. So if we have this point at a class with two attributes, the second one being annotated, we can see that the int value, which is dictionary, returns x and y, and x is simply the integer class, and y is the annotated construct. You can also use annotations and gender attributes, but getTypins is more powerful in general. From there, you need to inspect individual annotations, and you have two functions, you have getOrigin and getAx, still from the typing module. If you use getOrigin on the typings, it allows to discriminate between all the typing constructs that are in the typing module, and especially in the defying in your case, which one are annotated types. So you can compare these results with the ease operator, for instance. Then you can extract the arguments of the annotated value, which means here in our example, in our y example, getting the type and annotations as a list here. So that's how you consume annotation. And in general, thanks to the composability principle I exposed before, once you have extracted all the annotations, you would ignore the ones that you are not interested in. Here we only endow the label annotation which we own, and we ignore the others. So we typically check this with an instant check. To wrap this up, I'll introduce this simple helper. It's not full proof, but it works for our examples. Basically, it uses getTypings, getOrigin, and getArgs in order to work through the annotation of an object and find the one matching the specified type here. So we to illustrate it, if we use getAnnotation on the point class introduced before and trying to match the label annotation, we get these results. So the y attribute, the y name, the annotation, and the type bound to the attribute. So I'll reuse this function here later on. So that's all for the presentation of annotated from the PEP and how to consume them. I will now introduce some use cases in order to illustrate why you would want to use annotated or why not, maybe. I'll use this simple model which is a calendar event model. It uses pyidentic as a base model. So again, pyidentic is a famous library for doing data modeling and validation or stylization. It's similar in scope to data class. And here we have an event model with a few fields, a summary description, and two dates defining the duration of the event. And the following will do three things. The first one will do validation of the daytime fields. We'll do this using pyidentic built-ins annotation. So we'll see how to use third-party annotations. Then we'll do high-calendar stylization. High-calendar is a simple format for XMG calendar data, text format. But we'll see how to implement our own annotation in order to perform stylization. And the third one will be console rendering. We'll build some kind of user interface in the console in order to print and display calendar events. So again, this will illustrate how to define our own annotation. So starting with daytime validation. We define, we use here the built-in annotation from pyidentic. And namely, it's the aftervalidator, which is some kind of annotation factory, because it takes a function here tzware, which as its name suggests, would validate that the daytime value is not naive. Meaning it has a time zone defined. So here, you can simply define tzware type by combining the daytime field with the pyidentic shipped annotation construct and your own validation logic. Here, I'm defining an event with a naive time zone. And we can see that pyidentic produced a validation error, which under the hood is triggered by our value error with the expecting a tzware daytime message. So it works. As a side step, I would like to mention that before using annotated in pyidentic, in all the versions of pyidentic, the pattern for doing validation was through using class methods and decorators, namely the field validator decorator, in which you have to define custom method in your model classes, and you have to bind the attributes you want to validate to the method using the field validator decorator. So that's another way to define validation in pyidentic, not using the annotation. And why, in my opinion, the annotated pattern was adopted in pyidentic is for the following reason, I would say. Because the validation method, the class method is loosely bound to the attributes. You can see that there is no direct relationship between the start art and end art field definition and the validation, whereas in the previous example, with the annotation, you have inline combination of type and annotations. Then if you have different classes using the same validation because validating a non-native daytime is quite usual, you would have to repeat the method in all your model classes. And similar for all use cases, less like serialization. So that's why I think the annotated patterns has taken adoption in this kind of library. Here we simply introduce an alias, tzdatetime, which is the annotated daytime with or validator in order to make or code less variables. So that's one idea of annotated, despite being verbose, you can use aliases. And you can define your base model class just using the aliases. Next, next you see is ICANN under serialization. Here, obviously, I'm using, I'm adding another annotation. These annotations are defined in the ICANN module, which I will introduce later. And there are serializer instances. An ICANN under serialization just takes a label name, which would be used to serialize the data. So here you can see that we have combined different annotations in the same annotated construct, because here our tzdatetime is already in an annotated value and it's again wrapped in another annotated construct with another annotation. All this is flattened at runtime. In the ICANN module I mentioned earlier, we have this serializer class. Here it's a data class. We have the label field. And we have a simple serialization method, which does some transformation between the value, which would be the field value, and specify it. So if it's a date, we need to remove the time zone and change to ETC and use this kind of format. Then we use this function, which takes the object events, walks through the annotation using the get annotation function introduced later and it calls the serial's value. And we join the result using this. So an example here, we still define our event model. We get the date and we can print the serialization in the 9-calendar format. I'm going to work up to define and consume an annotation. The first thing you do is to define annotation, typically as classes, the data class is quite handy for this. You can define options and you would typically implement an underline method in order to process the value. Then you add and take your data type using annotated, obviously. And then you consume your object using the get annotation patterns I would annotate a bit earlier. The third use case, here we are stacking another layer of annotations from the UI module, the UI module I will introduce later. We are adding UI annotation in order to define all the fields of our event model will be serialized in the console. We have a text annotation, a margithon field for the description and we have data serializer for description. Here we will be using the rich library, which is a pretty nice library for doing console rendering and building terminal user interfaces. The widgets, which are the annotations we have used in the previous example are defined here using classes, following the pattern I introduced before. And they basically delegate to rich to do the rendering of a field and then do the running on the type of the field. Here it's another way to process the annotation instead of introducing a custom function for processing the object. We introduce a mixing class, which follows the rich protocol. It's defined in documentation. You need to define a Dunder rich console method. And there, again, we used our get annotation function looking for the widget class, widget annotations. And if we found some, we call the render method of our widgets, get the value and we need the text value. We use this as a mixing, so we extend our class events. If we take another example, we added a description with some margithon formats. The dates field here are the same. Simply, they will be colored depending on whether the events are started or not. So if you reach, print, or events, we get this. We can see that we have margithon interpreted things in the description and the start date is green. So that's all for use cases. I hope I've demonstrated why you would want to use annotated or why not, maybe. Then I will discuss about the adoption of this pattern in the ecosystem and community. First, we have adopters, like I mentioned before, by the antique, fast API, or typo. So you have this kind of code in the wild. And you might want to get used to this because I think it's here to stay in this library. I hope I've demystified the pattern so that you can understand what it does it mean to have this kind of code in the documentation and if you copy-paste it, for example. There's an interesting project in annotated type, which provides reusable constraint type to be used with annotated so that you don't have to define your quite classic annotation. It's also adopted in a SQL alchemy, although it's a bit more involved because you have to use the mapped type annotated file. And obviously, in a project with less coupling with the typing system, there is less enthusiasm. It's obvious. This brings me to some skepticism I've seen in the community. First of all, I've used in annotated is quite verbose. The symbol is already in uncultured and if you stack different kind of annotation in the same type, it's getting verbose so you have to take care about this. It's us, readability. Then it's kind of awkward because annotations are not necessarily typing, although most consumers do use typing information. But if you don't want to use typing, you cannot really use the annotation as provided by annotated at the moment. Also, consuming annotation is a bit tedious as we have seen. You have to write some bullet-plate code. And there is more coming. Here, there is a pep I would like to mention, 724.27, which would introduce a doc construct in the typing module, which would be used through document fields. So again, it's not typed, but it's in the typing module. So this brings us to the typing topic, which as you may know, in the Python community is quite divisive because there are some fans of the typing and some are not. And this, I think Python is growing with its features because they bring user value. The example I've shown in Fast API are a lot more expressive, in my opinion, than the one that you would typically use in other metaprogramming patterns, like decorators and so on. And if you want to deep dive more in this topic, I encourage you to read this LWNN article, which was published recently, and it provides quite a nice overview of the typing issue or topic in the ecosystem and community. Thank you for your attention. I'm done. And if you have some questions or some thoughts to share, but annotated, I'm happy to discuss. Thank you very much, Tennis. Do we have any questions? Let's see. Now is your chance to raise your hand. I can see one question there. Don't be shy. Raise high, please. Hello. Thanks for your talk. It was very cool. Could you go into a little bit more detail of how I could use annotated on my own class? Because like I saw it in Pydantic and I was like, oh, cool. But I didn't quite get how I could use that in my own... Sorry, can you repeat? No, okay. Could you tell me how I could use that annotated trick on my own class instead of it being like a Pydantic base model thing? Yeah, that's what I illustrated. So here you have this cellulizer, which is a class I just defined for the example. And then you annotate your attribute with your cellulizer instances. Here it takes an option, which is the label of the cellulization value. And then you need to write this kind of function in order to consume the annotation of the instance of your event class. Yeah, but the event was inheriting from base model, which was a Pydantic thing. So if I'm in my own project... You don't need to do this Pydantic for this. You can take a built-in class or a data class, or... It's not related to Pydantic. The Pydantic thing was just for the first example, validating the time. Apart from that, you don't need Pydantic at all. Okay, thank you very much. Thank you. Do we have any more questions? One more. One more. The session is still being recorded. So please be silent there. How can we reach you? I'm not sure if it's okay to ask that. Is it possible to use that in Django, REST, and WorkSphere Realizer? Have you tried or have you seen anyone try that using this one? Is it Django, REST, and WorkSphere Realizer? In Django? Yes, Django, REST, and WorkSphere Realizer. I don't know. I don't know Django, REST, and WorkSphere Realizer. In fact, you can use it as long as you define your own way of consuming the annotation. So if you want this kind of helper function, you can use it, definitely. Okay. It's not bound to a particular framework, in fact. It's in the standard pattern. Thank you very much. Another round of applause for Dennis.
Profiling Python with eBPF: A New Frontier in Performance Analysis
Well, everyone, we're about to start the next session. So we have here Kemal Okoyom, who's going to be speaking about profiling Python with EBF, profiling certainly one of the most challenging parts I had with Python in the last 20 years when I've been coaching with the programming language. So I really can't wait to hear what you have to share with us. Thank you very much and welcome. Hello everyone. So let's start with some questions. Who does anyone here like knows anything about EBF? Okay, quite a lot of people. Have you ever used profiling? Wow, nice. Do you know anything about Python? Great, everyone. That's nice. Okay, first, who I am. I'm not Prometheus, but I'm a Prometheus maintainer. Do you know or use Prometheus? Okay, that's also great. So I'm a maintainer of Prometheus. I'm also a maintainer of Thanos and recently I'm a maintainer of Parkout Project. These are all open source projects and they are all focused on observability. And I think I know something about observability, so today I will tell you about that. So let's start always with why. Why do we need profiling? It's either for some performance optimization. This graph can be anything generic. This could be CPU. This could be memory. When you see some spikes and you try to understand what's going on actually on those spikes. Or this could be something about an incident. This is graph is specifically for a umkil that your process and you don't know what that happened at that certain point and you would like to know about like who, what function or which component of your process actually allocates the memory in that particular moment. So some of you already know that there exists some profiling solution in Python. This is not an exhaustive list, but most of the libraries or projects that you see in here you actually need to instrument your code. So either you need to import a library or you need to specifically write some code and then start profiling your application, Python application. This is not always the ideal case because you would like sometimes you don't have access to code itself and sometimes you would like to do this from outside. So how do you do that? This is where eBPF actually comes into place and helps us. So eBPF is a, it's originally for networking applications. It's called Berkeley package filtering, but now eBPF totally something else. It's basically an event-based system where you can hook into some events that Linux kernel issues and then you can just run code as a reaction to those events. There is a runtime, there's a virtual machine inside and there's a verifier before you load your program that verifier needs to check your program that it doesn't do anything harmful like infinite loops whatnot in the kernel space. Then compiler kicks in and compile your provided code and you actually run those code against any events that you issue. That's the one fancy part of the solution. So then it comes another subsystem of Linux which is super cool is the perf event. You can have perf subsystem. From the perf subsystem you can have, you can hook into various parts of your stack and you can run code against these events. In this particle talk we are going to talk about the CPU events, but you don't need to use only the CPU events necessarily. You can do this for the IO events, you can do this for the memory allocation. Practically anything that you see in here, you can run, hook into that event and run a code piece to that. So what makes perf events special? It's actually performance monitoring units. These are very efficiently implemented units in the Linux kernel so that they keep track of the cycles and then like you can actually take measurements and you can react to those measurements. That's why EVPF plus perf events is actually faster than the other solutions that we have in basically ecosystem. Because with the Linux you can already have some syscalls, you can just hook into that introps and do all these things that I'm going to tell you in a minute in the user space. But the most of the things we do is using the PMUs and because the PMUs are efficient and EVPF codes also efficient and run in the kernel space, this gives us a bit of a headroom for the performance. So we are not the only ones that actually implemented that. This is quite a journey. I don't remember when is the first time that it actually introduced the PyPerf code inside the Linux kernel. You can check out, these are actual links and I'm going to share the slides and you can see basically the git comments against that. There's a set of tools called BCC tools in the EVPF space and there's also another implementation of PyPerf in there. But what is the downside of all these tooling? These are first, they are dated. They don't cover all the recent changes in the Python runtime itself and they are just one of tools. So I'm going to show you some cool things that you can actually do profiling in the production itself in a continuous manner, which we call continuous profiling. We're going to come to that in a minute. So to make the tools work, you need to just wrap your Python interpreter around with these tools and then collect your profiles. So that gets us to the Parker project. It's an open source project. It's a continuous profiling project using EVPF and perf events. We can run your profiling workloads directly in production and there is no runtime, nearly none of the runtime overhead in this approach. There's a tiny bit, but it's really negligible. So how does the Parker EVPF agent actually work? Other things that I mentioned previously, the hook into perf events, we have some unwinded programs that actually unwinds the stack, which I will tell you about. And we then keep track of what happens in the CPU for that stack. And then we aggregate those information and put in an EVPF map, which are the special data structures that you talk between Kernel and the user space. And then we read that data, we convert that some open profiling formats and push that in a server site where we can just aggregate and visualize that and let the users actually make sense of their programs. So this is how that whole thing actually works. There are a lot of details, but this talk is not actually about the internals of Parker, but we are doing a lot of cool stuff to make the stack collection and symbolization very efficient. And then the end result is a UI like this. In a continuous timeline, you can see that what's going on on your CPU for each process. Then we collect a lot of metadata and enrich those information for you so that you can query, compare, and see how you can improve your program. And the agent is kind of super cool because you can just install any host machine and any process that you have on that machine, we can just collect data and send to the server and you can see that in the UI. This doesn't necessarily, it's not scoped to the Python itself, but it does a lot of cool stuff with the Python as well. So there's not a Python stack, but we will see some examples, but this is some, I think this is a Go one, but you can see that the stacks are easily getting really deep. So what is the stack unwinding? This is the next critical thing that we need to talk about because the whole, like the what makes profiling challenging, especially from the Python side, is actually to be able to unwind the stack. So when a program gets executed, probably you all heard these in your start of your education. There are specific structures when the process actually allocates in the memory, which is one is stack and one is heap and the stack actually tracks the execution of the program and whenever you call a function, you open a frame and you change the states of your registers and then you keep adding everything to the stack and when one of the functions that returns from the leaf, you just go back and return the data to your user. So I might be oversimplified that, but it's a diagram just to show you how it looked like, but the end result when you unwind the stack and aggregate all these function addresses, you get something like that. It's just the machine addresses and now you need to find a way to translate those machine addresses to the human readable format. So all these parts for the native code, so anything that actually runs on your CPU. So that brings us actually the next step. So this is a state where we didn't implement the Python unwind for parka and you can actually, this is an interactive one. So you can see anything that gets from like your kernel, there's a start thread, this is coming from libc and all these green things that you see, these are coming from the Python interpreter itself because Python interpreter is written in C and it's compiled and then directly gets executed on the CPU, but probably this is not useful for you, right? You are Python developers. You actually want to see what's happening in the Python process itself, not the underlying infrastructure. That being said, we also know that most of the Python applications also rely on the C bits and the native code bits and it could be calling some C function here and there. Then when that happens, these are actually, gets like super important, right? For example, PyTorch, it's very popular nowadays in the machine learning workload, but it's actually funneling everything into a native code and when you want to see what's going on the native code, parka actually can do that as well. And we do that in a very efficient way. You don't need to have, there's a whole concept of frame pointers and that actually helps us to unwind the stack. We just gave another talk in the observability room like why frame pointers are cool, but you don't have to have a need to have the frame pointers itself because there's also another facility with the devolve debug information, you can unwind the stack. So parka actually utilizes that. This is important because most of the packages that you can find on any of the Linux distribution, you wouldn't find frame pointers. But with the devolve information, you can actually unwind the stack and you can see all these goals. So, but we want more, right? We want to see the Python code, so how we do actually do that. So this is where it comes to, where we unwind the stack virtually, with virtual stack, we mean that anything that gets executed in the Python interpreter, we need to find those stacks and put those things in our flame graphs so we can see that like where is the problematic areas in our Python code itself. So everything starts with opening the Python runtime and reading the code. This is the huge structure, like if you know the Python internals, like it's long. There are a lot of comments, but it's not the easiest code to read and it's not the easiest code to reason about. Let's focus on like what is the important bits, right? We care about the interpreter state and from that bit, we want to capture what's going on in each thread, right? It comes from the interpreter state and then we try to find the PyState itself. The PyThread structure, it's like a link list, so whenever you have multiple threads running in an interpreter, you need to traverse the whole link list and for each thread, you actually need to do that. But also you need to find out which thread actually captures the guild and globally interpreter luck so that that's the one actually executing the code. So from finding all those information from thread state, you check the thread state. Oh, yeah, it's another like pages long C code that we need to reason about. It's not the easiest thing, but this is how reverse engineering kind of works. And again, I extracted the important bits. So from that thread state, we need to find what is the current frame is actually executing so that we can online from there, right? That's actually the same thing that we are doing with the native stack, but rather than checking some registries and reading roll memory addresses, we are actually checking the Python in turn itself. So from the interpreter frame, it's actually easy, like whatever we need, it's here. So all the information we need is here. We have the pointers to the previous stack and we can actually do the same thing. So I'm going to speed things up. Yeah, we have the map. We know the source code, but where do we actually start? When we have an object file from the Python interpreter, we first need to find the where does all these tracks are actually live. So we check the entry point of a Python interpreter. We see that it's linked against a live Python. We go and check the names for one of these tracks symbols and we actually see that there are some offsets that are located there. But this is just from the binary. We don't know what these addresses mean when a process started. So this happens because this is just one of the reasons, but when you get a binary and run the process out of the binary, there is some address randomization and all those addresses need to be translated to that. So how do we do that? We just run a Python interpreter and check what's going on in the process. This is basically memory mappings and it shows you where actually Linux maps the certain objects and we check out the live Python. Grab the base address and from the addresses that we find from the symbol or draft, the morph information, we actually find where the structs are actually located in the memory. We are looking for those. And from that, now we need to read that data. So here comes the GDB. GDB is like an amazing debugging tool and we jump into the process and start to poke around, we define a macro and to calculate the offsets of a struct which reads from the devolve information, you say that, okay, give me this struct and this field and it gives you the offset of that. Since we already have the start address of the memory, we just calculate the next address and read the data from that. But as you can see, this is very manual labor. We cannot do this for each and every Python version or implementation out there. So we do this with another project ahead of time. We use Rust-Pyne-Gen for that which was super convenient because PySpy was using that. We just grabbed some offsets and generated all these things for a couple of versions. But we are also working on a devolve-based reader which is more scalable. You just grab any binary, read the debug information and calculate all the offsets. From those offsets, we generate this struct which we think over the kernel space. It's like a map, where to find the fields and everything. And the nice part that the whole things that we are working on, it's going to be deprecated soon because Python... This is life of a software engineer when you do reverse engineering. So something super cool happened in the Python main branch. Now they have this debug offset data structure value. They generate all those sets and put that in just the beginning of the py runtime. We just can grab the address and just read the first chunk of the thing. How we got those that we don't need to do this ahead of time things right now. So this is already a merge and it's going to be released with the Python 3.0 in 13.0. It's also huge, lots of stuff that you need to find out. Okay, actual unwinding stack. So these are where the EVPS comes from. We did all the magic, we got the offsets, we put that thing into interpreter info structure and put that in an EVPF map so that the EVPF program can read in the kernel. And this C code is in the kernel itself. You check something and get the interpreter info. This is the user space code where we actually calculate all the addresses and send and put to the EVPF map itself. And from that we also grab this offset data that we calculated. We just do check some versions and find the runtime version of that particular Python interpreter, read the offsets and from that offsets we calculate that. But again, like 3.13, this is, this will be futile. So then we read, try to read the, oops, rate data from the thread state, find those structures and read the pointers and try to find out where to go from there. Five minutes left, I need to be super quick. Okay, so we find the, we try to find the initial pointer for the virtual frames. This is how we do it. And then from that we start walking the stack. The key points here is just from the previous code that you can see, we actually put just line 13, put something into a state frame pointer and we then read that frame pointer from another EVPF program. And from that pointer we basically find the offset of the frame object where the code points to, read that row address with some VPF helper code. And from that we read the symbol because like the addresses that we saw that they don't mean anything to us. We are humans and we need some human readable data. And from that to be efficient because we keep seeing the same stack traces, we just like hash it and put that symbol somewhere so that if we see that we don't need to send that same symbol to the user space because it's costly. And then we also encode the line number because like symbols just represents a class, then a function and then there's a line number in that function that is different. This API also recently stabilizing the Python. So for the old Python versions this line number could be wrong but the read after 3.10 it's actually, it should be accurate. And then we encode that as well and send it to the user space. This is the reading the symbol parts. Like the code is like super complicated. I just highlighted the GDP outputs because it's like easier to read. So you read this nested structures and find the actual type name, then the file name, then the name of the object code and the first line in that function and put an encode that is a symbol so that it means something for the humans. Voila, now we finally have Python unwinded stack. But as you can see there are lots of things going, happening and most of the things are interpreters and it doesn't make any sense. But we have this cool UI, you can actually, these are like color coded, you can, from the color code you can actually highlight like what's going on in the interpreter, what happens to the libc, libpython, you can see everything. Again, we want to focus on the Python bits, right? You can just filter out the Python code and see that it's recursively calling and calculating some Fibonacci numbers. Apparently, yeah, it's inefficient so you need to optimize this. Yeah, you can just tell by like that list of the stack, okay, yeah. You don't need to know the details of how to read the flame graph. But good thing for you, we also have a blog post for that. You can check it out. So I guess we are nearly out of time. So we support a couple of interpreters like 2.7, we still support that. So if you happen to have that, we support everything until 3.11. We are working on the 3.12 because 3.12 changes where the actual trade state also stored, which is the trade local storage. We are working on the facilities to read the state from the trade local storage. It shouldn't take more than a couple of weeks to be that support actually landed. And 3.13 will be there so we don't need to do this again, basically, for the next version of the Python. So everything you can see, please try, install and give us a feedback. There's this QR code. It's a GitHub discussion. You can just like engage with us and report bugs. And we can try to help you to profile and optimize your Python workloads. All the things that you see here, there's also a blog post if you want to catch up. You can check the company's blog post. And also we find the DevoreFind on winding bits like super cool because it's a niche thing that we do. And if you especially like the, if you have an application that follows everything to the native code, this will be super useful. Thank you for listening. Thank you very much, Camille. I'm afraid we're quite tight on the schedule, but please feel free to reach out to him with any questions. And thank you very much. Thank you.
Python 3.12's new monitoring and debugging API
It's time. So thank you very much to Johannes Bergberg. He will be speaking about Python 3.12 new monitoring and debugging API. For those who were in the previous talk, there was a brief about the profiling features. Johannes is a JVM developer working on profilers at SAP. He also writes blogs about profiling and debugging topics. Thank you very much. Thank you for introducing me. Before we start, I want to introduce you to the concept of debugging, because I'm sure nobody of you have ever debunked. So the first bug that's what's like found was in the 50s when they found a moth that was in between their relays and it makes zip and the whole system like crash because like in the olden days it relays. At SCADAISCAR once wrote, if debugging is the process of removing Zafferbacks and pro pro-peering must be the process of putting them in. As I'm sure all of you are doing lots of programming, I'm sure you're also doing lots of debugging. So that's what we're here. So consider this example program. It's called a counterprong. It just counts the lines in this example in itself and it returns zero. And we're like, why? So that's the problem because the file actually has 26 lines. So let's look at the code. I'm going to see clans of the code. Make this shortly so you don't see what it's about. But the idea is we use the debugger through this because the debugger is a great to understand our system. And the cool thing is now with the new APIs that we get in Python 3.12, writing the debugger is far easier and far faster as I show you in the following. I'm Johannes Pechberger as you already heard. I work at SAP machine, which is the third biggest contributor to the OpenJK, which is like the major Java runtime. And I started talking to people about Python because I also like Python. So it's a bit easier to VM than JVM. The question is now, why do we need a monitoring and debugging API? Because when I'm from the Java world and in Java, we haven't built in debugging API. So we have the ability to set breakpoints, to ask for values, to walk the stack and have everything. But in Python, does the Python interpreter know about the concept of breakpoints? So because I'm here, not only, but with a few here, who of you thinks that the Python interpreter knows about the concept of breakpoints? Please raise your hand. And two of your things. It doesn't know about the concept of breakpoints. Okay. It's a trick question, of course, no, because otherwise I wouldn't be asking this question. So it doesn't know anything about breakpoints, which is not a bad thing. So any ideas how we could implement it? So the first idea that came to my mind was like, we have this code. This is actually the code that was part of the code. So the idea was we just place in front of every line a debug statement, like a debug method that we define somewhere. The idea is simply put in the debug method, we check are we currently at a breakpoint in this file online, and if yes, we open some kind of debank shell. If you've ever used PDB before, that's essentially what, so it's the PDB shoulder, we could be opening. But the question is how did we get this file online? And the easy answer is we have this get frame method. It has an underscore, and the important thing is it has an underscore because it's kind of in C-Python implementation detail, which is great. Because it's pretty slow in PyPy. But we have to live with it because that's the only way we can walk the stack. We've seen before that we can do some EBS stuff, which is nice, but usually most profiles, not most debuggers, don't do it. So the idea is here, we have O stack, and the bottom is like the main, and then the count count line is code line, and then our debug method, and essentially what we do, we can ask get frame, the zero frame, that's the top frame, because we currently in the debugging method, and so we ask it for the frame one. And also get it from the other frames, and essentially what we can get is get information on like which are the local variables, which is the file name, which is the line number, and such, and that's quite nice because this allows us to easily implement the debugging shell because we can just open the shell that contains these locals. And so that's how we implement our first debugging method. And it's nice, and it works, and we can even write some basic debuggers with it. The problem is we want to automate this because we don't want to put this DBG statement in front of everything. So how do we do this? And first I'm going to tell you about the pre-3.12 way so you know the pain points of debugger developers here. The pre-3.12 way was the way of this dot set trace, which was an arcane way to do it, but the idea is essentially we pass it a handler and this handler is called multiple times. Essentially this handler gets parsed the frame type, the frame, and an event type, which could be like call line return or exceptional opcode, and it would be called regular time. So when we then register it, we register a specific handler and this handler then is called at every call. So every time the method called call lines is called, it's called, and every time then also a method is called line, this use here is called. And that's nice, but we want no more. We want to know also we want to get a handler on every line. So what we do, we can return from this handler and inner handler that's called for every line and this has the same signature. So the idea is we essentially implement our debugger here by not using like our writing manual here with the DBG but just setting an inner handler. This is called at every line. And that's quite nice because it works, but you might expect to show later it's quite slow. But it's okay. We can even go to the opcode level, to the bytecode level in here. But the problem is, and the question here is do we need a line event for every function? Because we know when we set a breakpoint somewhere, we only need to like, set a breakpoint, set, we only need to like check the lines there. But for example, consider that we have here this con-con-lines and our user decides that he adds a breakpoint when we're in isCodeLine. So it's a breakpoint also and isCodeLine and in isCodeLine decides, hey, I want to add a breakpoint into code con-lines. The problem is when we haven't passed like inner handler to the method here for con-con-lines, we can't enable line tracing for con-con-lines. So we have to enable it for every, every line of our whole application, which is kind of a mess. So this is slow. So there are multiple ideas how to improve this. And one of the best ideas, and if you're a Python developer, you should do this. If you're a Python core developer, is to add a new API. And this API is called Python, and this API is defined in the PEP669 and it's really, really cool. And the cool thing is also this PEP is written in a style that you can easily digest. I come from the Java world. This is not always the case with the Java PEP. So I'm with the JEP, so I'm quite happy that Python does things a little bit better. So the JEP is called Low Impact Monitoring for C Python and hopefully other runtimes will support it in the future. And it's here since October. So the idea here is that we have more fine-train support, that we learn from the lessons that like having to enable the line handlers for every line is probably not the best thing. So typically when we develop, when we use this PEP, we define some shortcuts here in the top. So we don't run to write, is this not monitoring all the time because that's where the monitoring functions live. So we call it mon and events, it's also bit long, so we call it mon not events. Then we have to assign it the tool ideas, tool ideas. So the idea is that you can have multiple tools that are registered here. And for each tool we register some callbacks. So what we do here in our example, we register a callback for our tool. Our tool is a debugger, they're like six possible tool ideas, a tool idea is one of them is a debugger, another one is a profile. And we register here callback for the line because we want to still have like line callbacks sometimes. And we also register a callback for the start function. So for the start event, when a method is called, then we have these handlers and the start handler is just passed like the code object. That's what you get when you form a method for function called f underscore code. And the offset word is located in a byte code and for line handler, we get the line word in. And the cool thing is here we can return from this, as you see in the bottom, the line handler also can return either disable or any. And the cool thing is when we disable, it's disabled all the time for this specific line and it's called for coverage. So we can also make coverage testing easier. So yes, we enable them the start events and that's fine. And then we run our program, we get the start event for every function that we call and then at every time we ask, hey, do we have a break point as an function? If yes, we enable the line here. But only specifically for this function. And then for every line we check it. The cool thing is that these line events are emitted per fret. So the ideas that were set, sister, etc., it was emitted like in the main fret per interpreter. But here it's emitted for every line in the fret that the function is currently executing. And this is really cool. And Lukas Lange wrote in a PR discussion, the biggest opportunity of PEP 6.9 isn't even the speed. It's the fact that the debugger built on top of it will automatically support all threats and supports threats properly. And with the incoming changes with PEP 703, with making the global interpreter lock option in Cpython, it will get far more important. Because with then we will probably see multi-threading Python applications and then the old approach is just not usable anymore. So the idea is that we can enable events globally and locally. And the sum of both is like they're enabled events per function. And the cool thing here is the power is in the fine-train configuration. So you can set events in a function f for the function f while this function is running. So consider this example here. We have a simple line handler here and you register a callback for each line. And then in f you decide at some point, hey, I want to set the local events. I want enable line events. And later you disable them. And it works. So here we emit hello and then we emit like, hey, we're at line 18. We're just like the line that prints n, then we print n. And that's really cool. That wasn't possible before. That's really great because this enables us to only enable line events for the functions that we will need them. And so the question is, of course, what's far as there are several methods in this PEP and this API. And what's really fast is to register callbacks. We can easily switch out the callbacks and get the tools so we can say, oh, please get the tool ID. What's kind of fast is that setting local events because where it does it modifies the bytecode to the VMS executing. And what's pretty slow is to use the tool ID to start the debugger here and to set the global events because this potentially recompars or modifies all bytecode of every function. So do it all the way. So then it's fine. So back to the debugger. We had here our start handler and our line handler. And they look essentially the same as before. The only difference is that we enable the line breakpoints when we're needed. So they're different than events kinds because we've seen that line events are pretty powerful for implementing basic debuggers. One of them we've seen already the start events. There's the regime return, yield events for everything that you do in your Python application. And there are also then in-signary events. These are events that aren't like an A that you can't enable or disable because they come from, they are controlled by other events. So for example, you have to call event that is triggered whenever you call a function and then we have C relighted, a C raise whenever exception is flown in C code or whenever fact a C function returns. And there are of course then also other events that are not enabled locally but only globally. Essentially the idea is we cannot locate them properly. And the cool thing here is maybe you've seen that we have a new event called stop iteration because we think it was in this Python version we're not using in iterators. When we were returning from iterator previously we wrote an exception but that's pretty slow. So we don't throw an exception anymore but to debug this to still notice it we have a new stop iteration event. Of course what you'll be waiting for is performance because the performance is like the thing that besides the threading support is pretty neat. So what I did I looked around and I found some people also doing some performances but they were using Fibonacci functions. Now I'm like that's a bit small, that's not representative. So I started looking into Python benchmarking suites and there's this pi performance benchmarking suite and I just hacked it. I just wrote random code in it because you can do some kind of monkey patching in Python and it's great. In Java we have like private functions and everything in Python you don't have to care and that's why you like using Python you can do things that you're not supposed to do to get some change results. So if you want I'm using Python all the time when fixing bugs in the OpenJK to write test suite because it's faster in Python than to do in Java so you get the OpenJK, some bugs were fixed because I wrote some weird Python script here. But essentially what I was then going to test is I wanted to check the minimal implementation of a debugger. So minimal implementation with such trace, so debugger that doesn't have any breakpoints here is just call handler and then an inner handler calls it every line. Then the minimal implementation for monitoring API wouldn't enable any line events because when we don't set a breakpoint we don't need to set any line events. So that's how we implement this but I thought like it's a bit sneaky because we're comparing something that triggers an event that is relying with something that only triggers an event every function calls are also made a third comparison with like setting all the line events and it turns out that's still faster which is quite nice. So I use this Py for matchpoint suite that's quite representable and what I found is that it's really, really fast. So the CessetRace when you run it, when you run all the five benchmarks with CessetRace on you have a 3.5 times larger runtime. That's pretty slow. When you're using monitoring you only have a runtime increase of like a factor of 1.2 which is like 20% slower which is pretty, pretty awesome because this means you can debug all your tight loops, you can debug your whole applications without worrying about like debating slowing down and when you enable all line events it's still 30% faster than with CessetRace and people here probably like RAS. That's essentially all the benchmarks that are in Py performance and what you see here are the orange bars. These are the bars for the monitoring solutions and here it's just at one so if the bar is not visible it means you don't have any overhead in this benchmark but essentially you see that the blue bars for CessetRace can get high, can get up to like 10, 12 so it's really good. At least in my opinion and then when we switch over and use CessetRace monitoring with all line events enabled it gets worse but still we see that it's still significantly faster. Another question is of course is this whole thing used because it's implemented and I'm working in OmTek so I noted when you implement a cool feature the chances are that nobody will use it for like a year but here in OmTek people started, but here in CPython people started using it especially the vendors like PyCharm. So it's not yet used in PVP and they're showing the second pie but IDEs like PyCharm with their version 2093.3 use it and they've seen significant performance improvements. And there's currently a pending pull request on GitHub so if you want to help PVP implement it go to this pull request and try another discussion here so I would really recommend it's an open source project to CPython and you can make PVP better so what not to like is simple. Here's a quote from the pull request from Chen Gao who wrote this pull request and he wrote after this change we will have the chance to build a much faster debugger for break points we don't need to trigger trace functions all the time and change for a line number so I'll show it to you. The bad news is it's almost impossible to do a completely backward compatible transition because the mechanism is quite different. So there's an ongoing discussion how to do this. You could take part there so scan this QR code, be part of the community, give something back and not just use CPython. So because I have like tiny tiny town left I want to just show you shortly how single stepping works because single stepping is just break points because essentially the idea is here when we have always take here with the Scona and step out of this for example we just check for the next line where only the frame before changed like the current code lines changed. Stepping is also pretty simple we just check that only like the line number changed it's also nice and then check in to is we check the next time where we just put the frame on top. So that's all from me I'm part of Northern Twitter you can find my team at sub machine A O so if you want to use a JVM use the sub machine it's the best JVM. I'm contractually obliged to say this. We work at SAP we're one of the many cool open source projects at SAP you can follow me on my blog where I write on DBegging, EBPS stuff and everything else. So thank you for being here. Thank you very much Johans we have time for probably two three questions maybe. Does anyone we have one there? Thank you very much for this talk and for this pep because it actually solves a lot of problems I had when I started back in the days developing a tool for performance analysis for Python. However at some point choose to use the C interface of set rays and profiling and whatever do your does your proposal as I said is implemented already also support the C interface? So I have to correct I'm not I have nothing to do with the nice people who implemented this sorry but so please ask them they're probably in some discord somewhere I'm just telling you about the good news because programmers usually don't want to go to conferences and speak in front of people so that's why I'll be giving talks on this. So sadly I don't know. Thank you do we have any more questions to Johans? Can raise your hand. No questions apparently. I just want to choose this opportunity to thank Mark Andre and David also known as Flypeg for organizing this dev room. You guys did an amazing work. Thank you very very much. And thanks Johans again.
Deploy Your Next Python App with WebAssembly (Wasm): Smaller, Safer, Faster
Okay, it's 4 p.m. now, so we can start with the next talk. So Dan Phillips is going to talk about deploying Python on Wasm. And what's interesting about this is there's been a lot of talking about running Python in the browser, and I think Dan is going to talk about running Wasm on a server, which is something completely new, at least to me. Thank you very much. Give him a big welcome. Thanks. Can everyone can hear me, yeah? Perfect. Yeah, so thanks very much for the intro. Feedback, no? Yeah, a little bit. Okay. Yeah, thanks again. My name is Dan Phillips, and today I'm going to be talking about deploying Python on Web Assembly, and the sort of tagline here is smaller, safer, faster, and universal. The one of these that should have an asterisk is faster, and I'll get into the details on that shortly. Briefly about me, I am an engineer at a company called Loophole Labs. I'm here with a few of my colleagues down front. We do very specific cloud primitives where we work in specific areas, so I mostly focus on Web Assembly stuff. We have a project called the Scale Function Runtime, the Scale Plugin Runtime, which is a Web Assembly-powered function runtime. You can check it out at scale.sh. I'll talk briefly about it at some other points. On the internet, I'm mostly D-Unscore-Filla. Of course, on GitHub, since they don't allow underscores, just D-Filla without an underscore or a space. And I'm from Chicago, so in Chicago, I run the Web Assembly Chicago group. If you're ever in town, we'd love to have you stop by, or if you're interested, we also do all of our meetings virtually as well. Okay, so we'll just jump right into it then. What is Web Assembly? We're going to sort of start at the more abstract level and then get down into the weeds a bit, not too far into the weeds, but into them enough so that we can get a good grasp on what the constraints are and what the benefits are. So firstly, this is directly from the spec itself. Web Assembly-abbreviated WASM is a safe, portable, low-level code format, which is designed for efficient execution and compact representation. There's a debate about the pronunciation of WASM, WASM, doesn't really matter. Technically, it's probably WASM because it's from the precursor, which was ASM.js, if anyone has used that previously, which was a set of JavaScript primitives that allowed you to run more performant JavaScript code, C and C++ code in the browser. The WASM project sort of came from that. It's a safe sandbox execution environment, a deny-by-default security model, and it makes no assumption about languages or host. The best analogy is that it's a virtualized CPU, right? So we can think of it as just another compilation target. To continue, just like when we have architectures like x86, etc., we compile things to those architectures and we also compile things to WASM. It's a virtualized ISA, right? So when we think about ISAs, we have the things I just mentioned, but this is a virtualized instruction set. And it's virtualized because it needs a runtime, right? And it uses a stack machine model for the execution. So we're not going to go too much into the specifics, but this is kind of the high-level overview. So what does that really mean, right? In the broad sense, it's really just another architecture. The key differences are it's virtualized. It needs a runtime to translate to machine code, right? So in browsers, every major browser has a WASM runtime in it. The four major ones do, at least, and they're all relatively up-to-date with the spec. There are also server-side runtimes. There are runtimes that you can run in kernel-free environments, etc. We'll kind of talk about a few of those. And it's universal, by which I mean it's universal in as much as you have a runtime on the machine that you are going to run the WASM on, right? So people have played with WebAssembly here. We just mentioned that there's been some really interesting work with PyScript, PyDyde to run Python in the browser. Initially it started as a client-side tech, right? We go back to the spec here. This is right from the spec that says a safe sandbox execution environment, which makes no assumption about languages or host. And if we dig into that a bit more, we can understand some other benefits here. It was designed to be extremely compact and to start up extremely fast and shut down quickly, right? So cold start times can be in the nano-to-microsecond range. And again, a universal compilation target. This is a joke in the WASM community that WASM is neither Web nor is it assembly, necessarily, right? In the spec, there's no mention of the Web itself. There's no mention of assembly. It's just, I think it's just kind of a nice combination of words. So server-side WebAssembly is an interesting point that's come up recently. And one of the things that I like to think of it as is, it was sort of cloud infrastructure as penicillin moment, right? Where there was this technology that was clearly designed to be extremely performance-safe and be able to be overnight shipped to billions of different machines. And that's because this is what the browser commands, right? So this was something that if we took this same idea and we kind of squinted, it kind of starts to resemble other technologies that we see in other areas. Some people have made this argument that we went from things like bare metal to VMs, to containers, and then possibly next, WebAssembly, right? So we're going to call our safer faster and much more universal. The faster asterisk, asterisk here I did remember. And when we talk about faster, again, it needs a runtime. So you might think that it's not as fast as native code, although sometimes it is. And if it's not, it's pretty close. The faster argument here really comes from the fact that you can start out very, very quickly. This is a somewhat interesting tweet from Solomon Hikes. Oh, no. Of course, I'm not connected to the internet. I'm sorry. It was difficult for me to find this room. It was very, very quick. If this works. Yeah. That's my plan. Okay. That's fine. Well, at least the slides work for now. Okay. Well, basically, the founder of Docker. Oh, no. Okay. No. There we go. There we go. Open source technology comes through. There we are. Excellent. Okay. So this guy, Solomon Hikes, he's a great developer. He's a great developer. He's a great developer. He's a great developer. There we are. Excellent. Okay. So this guy, Solomon Hikes said this a few years ago, which probably got a lot of VCs very excited in 2019 that said if Wazemann Wazey had existed in 2008, they would never need to create Docker. That's how important it is. Right? And this has to do with sort of the fundamentals of the technology, which allow the things that Docker aims to do and aim to do. Initially, it allows you to do those same things at a smaller, safer, faster level. We'll kind of talk about that as we get into this. All right. Okay. Maybe. Good. Blah, blah, blah. Good. Wazey. Who's heard of Wazey? Anyone here? Great. Oh, a lot of people. Good. So the Wazem system interface, this was initially started as a POSIX interface in 2019. It's gone through a couple iterations. The big thing with Wazey is that it's capability-based security. They borrowed some things from Plan 9, the operating system that came after Unix, trying to fix some of the mistakes with Unix, to sort of think about resources as things that are granted permissions to use and act upon. It's an evolving standard. We've had Preview 1, Preview 2 actually just got released last week, and Preview 3 will be in the future. Preview 2 brings some big things like networking support, which is obviously a big thing if you'd like to have real applications that run. Preview 3, which should come out next year, is going to have ASync support, which is also a very, very big one. And pertinent to this talk specifically about Python. And then we will get to 1.0, probably in the next year or two. Is it required to run WebAssembly on the server? Do you need Wazey? You do not. You just need a WebAssembly-compliant runtime that can at least be compliant with 1.0. So we've done some things, not always using Wazey itself, just using a standard WebAssembly runtime on the server. Okay, so briefly, this is a project that we work on at my company. This is the scale framework. I'm just going to bring this up to sort of show what a good fit WebAssembly can be in very specific areas before I jump into the Python stuff. So scale is a plug-in framework. You can also think of it as a serverless function runtime. Serverless functions, serverless itself has kind of become popular in recent years. A lot of serverless architectures use containers. The problem with containers, especially with things like Python and maybe like Ruby and things that have a slightly larger runtime is that they can be a little bit slow to start up. So there's a lot of trickery that has to go into bringing them up to speed, keeping them hot, moving them around. With WebAssembly, you can do all this very, very quickly at orders of magnitude of speed faster and size smaller. It also allows you to do some very interesting things, which means you can do polyglot programming, different languages, in the same runtime environment. So with scale, you can do things like run, rust, type script, go, all in the same WebAssembly environments. And we've sort of figured out what we think is a good UX for doing that without having to do a lot of low-level programming to deal with types and passing them across different environments. Yeah, so if you'd like to check that out, feel free. Basically, it's very simple. Scale.sh, you do scale new, scale build, scale run, and you can do something as simple as this. I know this is go, but it's relatively simple, pretty straightforward. Basically, it gives you a function. You do stuff in that function. You can pass it to a function written in a different language, and you can get information back. You can put this in front or behind HTTP requests. You can use middleware from other languages in other languages. So we did some interesting stuff with go, where we took go's regex library and swapped it with rusts. And using this plug-in framework, we found that the regex speed improved four times faster than go's native library using rust this way. And you didn't even have to think about the rust that was happening. So that was a pretty fun example. So to continue, blah, blah, blah. Good. So building Python. A lot of people here probably build Python from scratch on native platforms, right? So interesting thing about Python is you need Python to build Python, which is kind of fun. Some assumptions that Python has is that it's going to be on a Unix or Unix-like operating system. There's going to be a file system. This is very important to Python in particular. Dynamic linking. This is something that certain builds think of. WebAssembly has no concept of this. We'll talk about that. Also, that there are going to be syscalls and a libc and or. So some of the pain points with building WebAssembly, building a Python distribution with Wasm is that there's a limited number of supported syscalls, right? In Wasi specifically, there's no pthreads, right? So green threads, that's kind of out the window. That's a tough one. There's no socket APIs. This is also a very big one, right? This makes it very, very difficult if you've ever used things like PyDyn and PyScript. They can do some interesting things where they sort of overlay on WebSockets to sort of emulate behavior. This is done using a tool called MScripten, which is one of the earliest WebAssembly compilation tools, which allows you to take C and C++ libraries, compile them to Wasm, and then have bindings in JavaScript to sort of mimic some of the system behavior. There's non-comprehensive signal support. This is also a very, very big one. So if you want more details on this, this could be a whole talk by itself, and indeed it was. You should check out Christian Heinz's talk from Wasm Day in 2022. He is one of the four maintainers of the Wasm Python project, and he, I believe, also works on PyDyn. It goes into all these pain points in excruciating detail. It's an excellent talk. So, but this talk is about actually deploying something, doing this, right? Right? So if you've tried to use WebAssembly, you might have realized it's a little bit hard to use because there's certain things that are very low level that you have to do, a.k.a. like, you know, doing things with data, getting data in and out of a running Wasm module, communicating between the guest and the host, depending on what your implementation is. So there's a project that I put together. It's called Boxer, and I decided to try to take something that is well known to most people in this world, along with some other ones, which is a container declaration, a Docker file, and plug it into this tool and spit out a Wasm binary, right? Plus the runtime. You can check it out, boxer.dev. It's experimental right now, but I'm going to be demoing it here in a second. So what is in one of these? Originally it was called Wasm Boxer, but now I'm just calling it Boxer. The, what is inside a box? You have a base layer similar to a container image if anyone here is dug into how containers work. The base layer sort of sets up the imports and exports for the Wasm module. This mimics the sort of interfaces that you might find with syscalls and libc in the traditional operating system. Then you also have a virtualized file system, virtualized syscode subs. So you have actual POSIX based file system calls that actually work in a virtualized environment, and then you also have things that aren't supported, stubbed out. You have a compiled runtime, which in this case is Python. And then you have the user source code, which also gets passed in. Okay. And this is very important. WebAssembly modules only understand about the outside world, imports and exports. That's it. You kind of think of it like a very, very simplified inter-process communication from the Unix world. Okay. So, I kind of went through this before. I used a tool called Wiser. Wiser is a really, really great WebAssembly tool that allows you to combine WebAssembly modules, link them together, do snapshotting, things like that. But basically, this taking an example like this, the big caveat here is that the A. This is from a C binary. That binary must be a WebAssembly binary. That's the key difference. Right? Cool. So, for the sake of time, I'm going to kind of move a little quicker. Python really, really needs a file system, right, when you want to build Python. Okay. So, one of the very important things here is that we need a POSIX-based FS. And you could do this with Wazzy, where you can go down into the underlying host and use the host's file system. But those could obviously be different across different distributions. So, one solution to this was building a virtualized file system, which we did here. This is a small project of mine, also very experimental. This is the Wazzyn VFS. And what this gives you basically is very familiar sys call slash live-c calls that you might see across different distros. And the point here is that you could do this all in a virtualized space. It could be in the same Wazzyn module, or it could be in a different one. Right? So, yeah. Cool. The demo. Yes, let's do it. So, Wazzyn Boxer. So, here we have like one of the simplest, this is right from the container registry, right? From the official Python registry. So, the very, very simplest thing we could do is from Python 3, set up the working directory, copy from the host's OS to the guest, which in this case is the WebAssembly build, and then use the command directive to run the actual script. So, what does this look like? Basically, looks like this. This is a very familiar command. Box build, passing the Dockerfile. Ignore my Rust warnings, which I haven't fixed yet. Cool. So, yeah. You'll see here, we have the build started. It found the base image, which in this case is just the sort of interfaces. It's building and bundling the runtime, the standard library, the source code, and the file system. This can take a while. It was cached, so that didn't take as long. And it bundled it, and it's complete. Cool. So, we have that, which means we have our box set up and ready to go. It's just like the sort of purpose of this, that's just like building a container image, you have it ready to execute. Then we just do a very simple box run, and that's Python code in WebAssembly. And what that Python code is, is just a very simple square root operation, right, and printing it out. So, that's that. So, from these perspectives, you may not even really notice a difference, but we'll kind of talk about what the specific differences are in the few minutes I have left here. So, there are caveats, and we're going to talk through why these caveats are going to sort of improve, and why things are a little difficult right now, but there are some ways around these, but they are difficult. Threads, like I said, threads out of the box, they don't work. There are things you can do, and there are some things that some people have experiments with. Wasm is single threaded right now by default, and it won't not be for at least a couple years. But there are ways that you can do some really interesting things with stack switching. You can do pause and resume on the stack, and you can make it almost as if it is in an async sort of programming environment. Networking, like I said, there are ways to do networking. You just have to do it all yourself. So, you have to bring it all yourself, right? So, you don't have an operating system, but people have done things like taking the kernel networking stack, bringing parts of it in, allowing those to be exports as host functions, and then calling out to the underlying OS that way. Native dependencies, this is a big problem. This is a big problem that they faced with PyScript. People who have done scientific computing in this room would probably understand that there are some really interesting languages being used under the hood in Python libraries, things like Fortran, I believe, right? And so, there is no official Fortran to WebAssembly compiler yet. So, what people are doing is, but the thing is, is that if there could be, you could then, if you think about native dependencies on different platforms, how much of a pain that is, there is a possible way forward in the future, which could be a different talk maybe, that you have one dependency and it's in WebAssembly, and that's the only one you ever need because you can have a WebAssembly runtime on every machine, right? That's kind of an interesting thing that some people are exploring. So, this is a cool thing I had chatGPT make for me, but it's a nice little illustration. You got a container, that's an app, and then you have a box, that's an app. That kind of highlights the differences in metrics, and what are those metrics? Well, containers, so a Python distribution, a Python container might be anywhere from 800 to 900 megabytes or more. The start-up speed of this container could be 800 milliseconds to two seconds. A security model with container run times are shared kernels, and with boxes or with a WebAssembly distribution of this kind, you can get closer to something like 16 megabytes for the size, right? Even less if you don't use the entire standard library, right? I've seen people do it as low as five, I think, when they cut certain things out. Start-up speeds could be 100 microseconds to about one millisecond for this exact build. The security model is a virtualized sandboxed machine code execution, right? So, you're at a different level of abstraction when you're talking about virtualization. In fact, Docker really isn't a virtualization technology as much as it is sort of a sandboxing framework, in certain ways, right? Again, that's probably a different talk, but also kind of interesting. So, the future, what does it hold? Full support for libc, Cisco interfaces. There are some really interesting work going on with this, where people have taken parts of the Linux kernel and made them available in Wasm, doing things like emulating signals, doing things like emulating threads with, like I said, different techniques. Then what that gives you is kind of a pluggable networking stack that you can use in different environments, and maybe even a slight paradigm shift where you might not even need a kernel in some cases, if it's all in WebAssembly and the WebAssembly runtime acts as the kernel in certain ways. So, this is an interesting technique that people in the embedded space have kind of done some cool work with. This is something where if you can compile the, if you can have a compiled Wasm runtime on bare metal, everything you need is in Wasm. This is great for people in the embedded space because they don't need to reflash their devices every time they update their code, which is a very difficult thing to do. You can just swap out a tiny WebAssembly module. Wally is also a very interesting one. This is the WebAssembly Linux interface. This is a project that came out of Carnegie Mellon just last year, and they're doing a lot of interesting work on sort of the emulation of Linux syscalls that make them available so that you can run things sort of out of the box. And like I said, with the embedded stuff, bare metal runtimes with a tiny little unicernel has allowed people to run Wasm in really, really small spaces, doing some really, really interesting things too. And what this does give you, then if we think all the way back to the beginning with browsers, browsers have a Wasm runtime, servers have runtimes, your phone has the runtime, you could have the runtime in an embedded device, in a controller, you could have it in all these different places. So you kind of have this sort of new true isomorphism where you can actually run the same code everywhere, and then what that does, it kind of pushes the problem to a new interesting space in distributed computing where it's how do we orchestrate these things, and how do we move them around, and how do we make them available. Cool, thanks very much. I don't know if I have any time for questions, but... Yeah, we have time for... Just a second. We have time for a few more questions. I see some over here. Thank you for the presentation. Not from a web application perspective, but from more like low level and server side side. Some parts of the presentation reminded me of this recent thing called unicernels. Could you comment on the similarities and conceptual inspirations between two things? Yeah, certainly. I'm not really a unicernel expert, but unicernels kind of saw the similar thing of basically taking the space that you have and only running what you need. So from a theoretical perspective, it's kind of like WebAssembly can take the same approach, and in fact they're being used together in certain cases where a unicernel only is provided for what the WASM runtime needs. So yeah, it's cool. We can talk more about it later if you'd like. Anybody else? I know you gotta go. More questions? Feel free to talk to me later if you'd like. Yeah, this side. Thanks again. Thank you. Thank you.
How I've Built a Web Frontend for a Federated Communication Tool with Brython
Okay, so let's start. Thank you very much for staying for the last session of the day in the Python Dev Room. So we are now going to have Jérôme Poisson, who is a free software developer and he's working on a web front end for an XMPP client using Brithon. So give him a warm welcome. Hello everybody, my name is Jérôme Poisson, I'm also known as GoFee on Internet, so I'm the lead developer of the Liberia project and in this talk I will explain how I'm using Brithon to do the web interface and why I'm doing that. So if you were on Liberia, in Liberia it's universal communication in the ecosystem. It's goal is to be all in one place, to do everything like chat, blog, audio video calls, etc. There's Gateway to other protocol, I'm working on an XMPP to activity of Gateway, I made a talk about that yesterday. It supports N2 encryption but not in the web front end at the moment, that's one of the reasons when I want to use Python in the web. And so it's multi front end, so you have desktop, command line interface, text interface and even an experimental Android interface made with Kivi. So why using Python in the browser? First, there is no context switching. In my case, several front ends are made in Python, the backend is made in Python and it's enjoyable not to have to switch to another language, another way of thinking, another way to look for documentation, etc. When I'm working on the web development, I'm always staying in Python and feeling at home. Python is famous for being highly readable language, so it's easier to maintain, so the goal is to make something quick to do a new feature and easy to maintain. There is a code reusability. Thanks to Britten, I can use code from other front end or for backend, actually. The thing why I want to use code for backend is for N2 encryption, I will explain that later. And also it's a stable ecosystem. JavaScript is infamous for I think every X month, a new shiny framework that everybody want to move to and you have to learn everything from scratch again. So yeah, JavaScript stack is quite complicated due to that. I think it starts to be a bit better for a few years with React and VGS, but still, yeah, a bit of JavaScript. So just to give you some ideas, I see the screenshot of the chat feature of the, of the library of the front end. So there are a lot of dynamic stuff here which are managed by Britten, so there is reaction when you can add a smiley, you can like, when you have a new message, of course, it will appear on the below. If you put a message, the input will grow and you can send files, you can drag and drop files and everything and all of that is done with Britten. A few words of why Britten has been chosen and not alternatives. So a few years ago, I was using Pijama, which was a part of Google Web 2 kit to pick to Python. So it was doing combination out of time, so you had JavaScript, which was kind of easy. It was, and we were doing the developments in similar way as desktop development. That was reducing at the time because I was doing a lot of desktop development, but at the end it's proved to be a bad idea because we were far away from the HTML, HTML stack, so to HTML to CSS and it was becoming complicated to maintain and adapt things. And anyway, it was supporting only Python 2 and there was no real plan to move to Python 3 and the project is dead due to a sad argument inside the community. Transcript quickly is another transpiler from Python to JavaScript. It's lightweight, it's working, it seems that it's working well, it's performance, but by design it's made to use JavaScript module and not Python, which is a showstopper in our case. Pyo did is part of C Python to WebAssembly with M script 10. It's fully compatible because it's a real C Python which has been part, but the thing is it's quite kind of heavy, but it supports numerous packages. Also it's not really well adapted to work with the DOM and WebStack, but it's really good project if you want to work with a scientific stack on the web. Pyscript is New Killing Town. The decision to use Britain has been made before Pyscript was even start, five and none at least. It's a kind of framework around thing like Pyo did to make it more easy to use and more integrated with WebStack. You can use other Pyo did, but you will have full Python compatibility, but it will be heavy, or you can use micro Python, which is lightweight, but it's Python, but not fully compatible with Python, which is also a showstopper in our case. But anyway, it's a really interesting project and it's nice to keep an eye on this one. Other project was not supporting Python tree, so it was dead and PyPy.js is not maintained anymore anyway, so it was discarded. So here's come Britain. Britain is Python implementation in JavaScript. It transpiles Python to JavaScript, but in the browser. That means that you can do Python in the browser. You can have a Python console in the browser, but the transpiration at some time, but it's cache is in cache for the first time, it's going in the browser WebStorage. That means that the next time you use it, it will be quick. There is a compatibility layer. You are using JavaScript object, but with a compatibility layer, so the object acts the same way as the Python object works. It's aim is to be real Python, and it's really good compatibility. If something is not working, it happens to make a problem. We can just open a ticket. That's another point about Britain, the community, and the lead developer is really nice, welcoming and reactive. So each time I had a problem, it was fixed quickly and in a nice way. Most of the standard is available either by direct transpiration of the CPy version or by re-implementation. Under the comment, you can see what is available, what is not. And it's really well integrated with JavaScript. You can use JavaScript module as well as Python module. And from JavaScript, you can call a code from Britain. So, yeah, it was really a good fit in our case. And the proof that it's real Python, you can check on Britain.info, there is a console, you can do anti-gravity and then you can fly. So, all the front end is working. So basically, blue on the top, you have a backend which is doing all the work, all the X&PP stuff, the profile management, the caching, et cetera. And then you have the various front ends. And the web front end is a bit special because it's split in two. There is an HTTP server which is done with twist ends. And you have the browser part which is done with Britain. So there is a static part and dynamic part which is done with Britain. And it's communicating by Web sockets between them and then HTTP server communicates with the backend with IPC which is usually DBS. And one of the points Britain must be using in the web front end is because anti-gravity is done fully in back end. So for this front end which works on the same device, it's fine. But because on the web front end, it's usually distant user, you need to do the work for anti-gravity again in the browser. So my long goal is to use the code from the backend to be able to do the encryption and decryption directly in the browser and to make real and to an encryption with this front end. So the goal of the web front end is to have progressive announcements by default if you don't have JavaScript to have a static patch for most of features, not every blog. For highly dynamic features like the chat, of course you need to be fully dynamic. So in this case you need JavaScript enabled. It's made to be easy to develop and to maintain and to reuse the code as much as possible from other front end. An important part of this front end is the templating system. I wanted to have templates which work at the same time in the backend, at the same time in the browser dynamically. So I've chosen to use a Jinja 2 which is probably the most famous template engine in Python. And on JavaScript side I'm using Ninjux which is a kind of port of the Jinja 2 to JavaScript which is made by Mozilla. And it's mostly compatible but some filters and directly everything. So I do the implementation of the one I was needing in Britain. And it's working. And there is easy tuning and the fact that I use the same template in the backend as a side effect that I can use also the template in the Cli. So if for instance you want to generate a static website from the same template and the interface you can do it easily. So each feature is organized in what we call page. A leader page is basically a directory which corresponds to a new URL in the web front end. So there is no router like you have in Flask. This might be easy. So one directory, one feature in URL. And in directory you can have a page underscore method.py file which includes a metadata such as the name of the page, is the page public or private, which template to use and you can also use some Python code if you want to get data from the URL or stuff like that or under the post request. And you can have underscore browser directory with Britain part which is what we see now. And then when you send the website, the hierarchy automatically generated so Britain knows where to access the files. So this is a minimal example of the page.page metadata.py file. So the name is used to be able to access the page from somewhere else. So in this case this access profile means that you need to be logged to access this page and the template to use. And here comes the Britain code. So that's what is running in the browser. So you see it's real Python. On the top I'm using the standard JSON to do the parsing of JSON stuff. In the past, the JSON from standard was a bit slow so I was using directly the JavaScript version. So you can do it but no, it's been fixed and it's fully performed so I'm using directly the standard version. The bridge is just a way to communicate with the backend. The browser is an important module. Browser is what is used to manage the DOM. So document if you want to access an element. So Iopart is a Britain version of AsyncIO. AsyncIO is not exactly the same as in Python for reasons I explained in the doc. And Iorg is just a module to show the dialogs. So you see I'm binding the DOM event click to a method. Normally you expect a blocking method but I want to use AsyncIO so I use Iopart. And then you have AsyncIO method. So you see you have the event exactly like in JavaScript. You can call the method exactly like in JavaScript. Then you have the code and I'm parsing the data of the item. And I'm showing a dialog. The feature actually is you have a list for the list feature and if you want to delete a list, you show a marker around the list and you confirm if you want to delete or not. And so you see with Await it's really readable and easy to do it so I can check after if it's confirmed or not and if it's confirmed I send the deletion request or otherwise I just remove the stuff. So this code is doing this. So we just click on the delete showing the dialog and here I'm canceling and it's removing the flag. So yeah, it's working like it will do in JavaScript. About debugging. So when something is going wrong, what is really nice with Britain is you have real Python, Transback, with the line of the source code and everything so it's really easy to debug. Sometimes unfortunately you have JavaScript exception. Usually in this case it's better to report to Breton because it means it's a bug but it's happening less and less often. You can use breakpoint and PDB so your code will be blocking the browser and you will have another box where you can use PDB instruction and there is also an inspector module which in this case your code is running normally but you have a Python console and you can inspect the local variant that you have where you run the inspector. Regarding performances, according to the doc I have a benchmark myself but it's comparable to CPyton because JavaScript is probably one of the most optimized engines in the world in Chrome and Firefox. So it's comparable, sometimes slower, sometimes quicker. Of course it's slower than pure JavaScript because there is a compatibility time and the compatibility layer but the compatibility is done only once after it's cached and for compatibility layer it's working okay. By default you have a world standard lib which is kind of heavy between 4 and 5 megabytes. Normally you don't want to use that. You can either not use at all or you have a small tool which will check your source code and which module you are actually using and make something smaller and totally usable. So the loading time anyway is cached, it's in cached so after it has been used at least once it's quick to learn. So from my experience and I have not yet worked on optimization it's working absolutely fine at least from my use case. So now the roadmap. In the future I want to integrate more Britons, notably in the loading parts because I want to do something more like modern social network experience with reaction and everything. So use kind of the backend, as I say it's really important for end to end encryption. I want to be able to get the stuff which are done the way we format the XMPP standards etc. and do it in the browser so we have real end to end encryption. And I would like to experiment what we can do with the Python in the browser. I'm thinking there are a lot of fields where we can do that. In education we can imagine a chat where we could have Python console, it could be used for instance to learn Python itself or to do mathematics in a school or this kind of things. For science of course it will be super useful. Maybe we can try to use the work done by the PIOD to also try to run some scientific stack. And automation and visualize that it could be possible to do filter in Python when you have a chat message or block message and to see if you want to do something on it or not. So, for my experience, is a robust solution for integrating Python into the web development. The community is nice which is really important part and it's allowed to use the same code in the backend and in the front end which is really time saving and avoid to have people specialize in one language or one other language. So, that's it for this talk. You can check Brithu on this website. You can check my project on Libervia. I have a blog accessible by XMPP, activity pub or Atom. You have Atom field of course. Where I'm talking about project money and stuff I'm doing with Brithu and etc. And you can look for help in the Libervia chat room on XMPP or PINGMI. This is my other activity pub and I hope I give you the test to try Brithu and see what you can do with it. You have a console in the Brithu official website so you can use it right now and just try to play with it and see how it works and what you can do with it. So, thank you very much. If you have any questions. Thank you. Any questions? Yes. Yes. How do you set up the web server? You must serve it somehow, the Brithun code. Do you have to do some kind of special thing to make the Brithu bridge work? Web soket is support directly. I'm using JavaScript code directly and I'm sending, I'm using JSON to work with the bridge also in the back end. So, it's native in JavaScript and with Brithun it's easy to access. And Web soket is straightforward. No problem with it. So, I send the Web soket to the HTTP server which in turn it's also doing the security stuff because of course from the browser you can trust it. So, I check if you have a right to use this method, etc. And then sending from the IPC. IPC can change but usually it's the bus that we use between front end and back end. Thank you. And great that you support the XMPP world. Thank you. Any questions? No one? Okay, then I guess it's a wrap. So, thanks again for the talk.
Opening Railways and Open Transport devroom
So, hello everyone. Hello everyone, thank you for being here and for making this room so full even in the early morning. Don't be confused, I'm speaking in this microphone just for the stream so we have to talk a little bit louder here so that people online can hear us. I hope that's all right for you also to the speakers. Okay, so we are very thrilled from the organizers that you are all here and that we collected so many interesting talks. We will shortly give an introduction into what we thought about this schedule, why the talks are there and were selected, but in general we had so many good contributions and submissions to this deaf room so it was really hard and we hope that you will enjoy the program. We, that is people from different railway companies I would say in Europe, like in the last year when we first presented this room or organized this room, yeah, we learned a lot and we also learned each other and we are actually, we can say we just founded the Open Rail Association just a few days ago officially so this is one of the forms that we have where we bring together people from the community, from the free and open software community but also from railway operators, from the transportation community to work together and yeah, this is one of these great opportunities so it's great that we can do this for the second time in a row. And yeah, thank you. We, that is here, Louis Hamelon from SNCF, we will also give a short introduction now and then it's me, Max Mehl from Deutsche Bahn, we have Comedius Schumacher from Deutsche Bahn, Peter Keller from SBB, Mahalia Steffan from SBB as well and Simon Clavier from SNCF so you see we are quite international here and we are the organizers. So Louis, do you want to give us a short intro into the day, what can we expect today? Yes, thank you Max. So we try to tell a story this year and not to put some talks, the one after the other and we start with, first with something about the data and about the traffic forecast and to modelize the data of the demand and then what happened with the data, we try to simulate the passenger behavior with a mat sim tool which is a quite fancy tool and then once we have this transport modeling passenger model, we can use it for making simulation and building timetables so we have these three talks about a fancy tool which is called OSRD. Yes, it's a and this tool, so we will have three talks about this tool, one about the map, one about the running time algorithm and the other one about the signaling system and then how it is used and nice and what is, what, how the community works with all the, the, the, for the railway and for the transportation in general. Cool, Tristan. That's it for me, so maybe we can start now.
Open standards, open data, open-source tools: their governance and future
So, hello everyone and thank you for your patience. So I am Tuto, I'm a project manager and expert contributor with an IT for PT. I'm based in Paris and why I'm here today is because I do believe that the tools, the data that we generate should be used to build communities and inclusion. You have below a couple of languages that you can ask me questions into try not Japanese because then no one will understand in the room. Thank you. So, IT for PT who we are, we are a non-profit association and what we do originally we actually come from onboard units where we created a standard for open architecture and data accessibility and interoperability which means that basically the bus, the trains, the trams, the rolling stock would all be standardized talking together from the actual wiring of the vehicle. So making sure everyone for example use the same internet connection into tapping into the same feedback loop to the back office. So we are a membership based association. We have over 160 members in 28 countries that will have railway operators, public transport agencies and other associations. So as I said what we do really is to build this architecture for interoperability first and foremost. We also gather together a community of open source developers, aficionados and passionate people so that's why we're here and we also have at the end a label for compliance making sure that when people use standards they're not alone and they can actually check that all their different units are compliant with the standard so then from the buyer perspective you know that it fits the norms that exist and I'm happy to have Breder with me. Yeah officially I'm a public product owner for a small team in Norway with 10 people representing and tour company owned by the Ministry of Transport. We are a non-profit. We work building open source tools. We are using public funded money in our development and we want to give as much as possible to back to the society both with open data and open sources collaborating with stakeholders in Europe, Norway and internationally. What we say we do is building an open infrastructure platform. The road authority builds roads, someone builds the harbors, airports, electricity, water supplies. We build an infrastructure platform for mobility data. Open source all the way for my part of the tour and advocate for that for also the rest. This is. So on this one we wanted to show you a little bit what exists today when we talk about data that relates to transport and public transport and also railway knowing that there are different types of standards and specifications that exist. So if you take the European context you will have this gigantic European norm that is called Transmodal that is really to be viewed as a data dictionary and a grammar. So it's not a standard. It's really a reference for you to cross check concepts, how they integrate one with another, how they are articulated, how actually they are defined and because it's a European norm it is also translated in most European languages so it also makes it easier to implement. Obviously a data dictionary, a data model is nothing if a data exchange format is not created. So there were two and way more open standards that were created based on Transmodal. One is netx that is timetable information that is everything that is known in advance to describe transport network, schedule, the fares and so on. You have Siri for real-time information or anything that is not known in advance that you will have real-time updates, vehicle monitoring, situation exchange for example if you need to close railway or public transport services and one that is upcoming that we will start working on defining very soon is OPRA which is more to run statistics, performance, so public transport agencies and authorities can compare one operator from the other. You also have in the screen shown GTFS schedule and GTFS real-time which are probably the most used today across the world to describe real actually timetables and real-time information based on the on these schedules so if you use any trip planning app a good chance is that it's actually based on GTFS data because we're here in the room I would also like to thank a couple of colleagues including Stefan because right now what we're doing is actually bridging what was first created for urban public transport with the rail domain through European project. So as I said I wanted to place a little bit everything that is open standard and open source within railway and open transport and basically is just to show you that everything is linked so as a customer you usually just see the trip planning part which is on the top right where you actually want to go from A to B so you'll get your train schedule your timetable and so on and also real-time information the train is cancelled services disrupted one is late and so on and so on or just the tram is arriving at the station all of that is thanks to the data that is issued from the back office that a lot of times especially for real-time data is actually based on vehicle data where the IT4PT specification exists so as you see we really try to map out all the different standards and specifications that exist to build all of that but that's really the backbone it was more for me to give you an indication that all the data you work on has been standardized and standards are open for you to participate in. Let's go back I will focus in my next minutes talk about the upper right corner IT4PT standards from the vehicle to the back office system that produce real-time data in Norway is more and more based on the IT4PT standards but I will not cover that in my presentation because we are in the upright corner from the inside and we will have demand for finalized data that we can use so the IT4PT stuff is done by the operators. This is a overview of our all components that my team is responsible for you can split it into two which is the input side and output side all of this is open source open at github someone in the room Hannes is part of the team so we are more here to ask other technical questions it's possible and one French company has downloaded this on github and one atender and serves a region in Paris. We seek collaboration with everyone don't reinvent the wheel again in this audience how many produce data or want to produce data for the mobility sector 10? Stevan, raise your hand you want the right way to produce data and make it open yeah and the rest of you you want to use data yeah yesterday in the middle of this week I was in another meeting where the EU country talked and showed what they have done with open data regarding netx and Siri the last year nine country showed up and all of them has a lot of work to do so let's go just briefly into what we have done in Attenture we have focused on high quality data we need that to produce good information to the travelers so the operator and authorities in Norway are responsible for three things we have a national stop-lapse registry so they manually update that they produce planned time travel data in Norway we say note the conversional data so they produce native netx from the start in this context netx is similar to gtfs but have a lot of more data support more into operational data can be transferred to the data and we use Siri protocols for real-time updates that is very similar to gtfs real-time we develop OpenTrip planner open source component we do that in collaboration with that project it's a successful collaboration supports both netx and gtfs supports high quality data in Attenture we reached in January one billion requests in a month in a country with five and a half million people so you want to travel a lot the API from OpenTrip planner is openly available you don't need to register but if you do you can have more access to it and we support mostly the main travel apps in Norway to use that API so we are getting the same information everywhere the API should be relevant for different users so it's flexible so the client can decide what is important for my users to show so the Ministry of Transport wants an tour to be a neutral one the national the biggest railway operator they want to show their offering first so they show their offerings and not the competing ones in the same way and other other regions based apps show only their local area and not enough and all of them are using the same API getting the same correct information everywhere we have one place to correct if it's wrong on the left side at the source we also share data in national access point which is a requirement in Europe there we share the API I talked about we share net text and stereo raw data files and we share gtfs and gtfs as well all of this is open available and what we say to the Norwegian data producers you have three responsibilities stop places the netx data and the return data and we from an tour can take care of the data is correct in all the apps and all the international apps also so we say to them deliver data to us and we can secure that it is correct as Google which is important for them we also see that the data producers want to use the data they have delivered to us that we have merged together with other data and we have quality validation tools so they want to use it themselves to do that they want they need to have more data than we need for public information they need operational data so we have added that into our data pipeline that is supported in netx and it's not supported in gtfs so that's the benefit of using netx in regards to gtfs and they starting now to use the data opens up giving them the possibility to get out of lock-in situations which is something that is usual in the public transport sector they have a big important software provider had it for many years and it's hard to shift and get out of it by going into the netx and doing that correct it's possible to break that circle to handle the extra data we do that with one validation in the previous slide but in open data we remove all the data that is for sensitive for operational part and we give them access to that in a different data set and it's this is worked works pretty nice today open source tools open tri- I can take open tri-planer I'm a man leading that open tri-planer is an open source tool started back in the US 13 14 years ago thanks Hemsner that's it was a successful tri-planning from the beginning increased usage worldwide added a lot of functionality and after 10 years when we started to use it we saw that it was built for big cities when we entered a graph with all the data from Norway the latency wasn't that usable from Oslo to Bergen 10 seconds we shut the stop the search give one answer we decided to get away the community to build a new version so and to took that lead on that development the two first years we did the development alone we had meetings with the collaboration so we secured that we was on the same path today we are around 10 to 15 companies actively develop on it we do that together in the same master branch we have regular product owner meetings to discuss the direction on it and we share resources the open tri-planer is a multimodal tri-planer supports all kinds of modes we are still not finalized with it but it works and we can collaborate even more and it supports the standards we talked about today netx and syria and gtfs and gtfs and it's supported in the same instance so you can use those standards together and then we're getting to the part that might interest you the most is we wanted to present to you today all the open source tools that exist and others that need to be built and hopefully some of you will raise their hand and help us build them in the sense that it is a good good for the ecosystem so what exists is mostly thanks to the amazing work done at N2 because everything is open source so they have as red said an open registry for a stop place registry so all the stop points in europe pushed by the european delegation we have a national access point so it's kind of like open data platform for every single of the 27 plus three european countries where you can find a lot of data sets and not only just public transport some of them for example in france you can have the registry of all the places where you have cal pooling you can also have description of bicycle lane and so on if some of them need for you to create a login a user and a password it's mostly to try and keep up in the kpi of how many people actually use the data and you have a lot of other open source data libraries you used to have transit feeds now it's called a mobility database you have the mobility hub for gbfs and so on in n-tool you also have a data creation tool that is called n plan so to really create your schedule and your data in netx and we have four netx two validation tools that are fully open source and two developed by n-tool and green light developed through the european project data for pt which is basically to check if your netx feed is correct against the xsd scheme and then you have a lot of other smaller open source tools such as the one created by dg transit alkadis abi and so on but what we wanted to show is those are the tools that people created a lot within their companies within european projects within their initiatives because it replied to specific needs however now that we get more and more data that is open we need to create more tools so ideas we had discussing with a lot of people but happy to hear your thoughts are graphical representations of netx and serif feeds for example conversion tools from netx to merits that is more on the railway or bridging the different open source validation tools that exist or analytical tools so that's it for our presentation and most of it we want to hear from you if you have questions on either some tools we could develop or how to actually extend gtfs or gbfs or netx or seri how do we work with the railway industry and osdm which is also one that we did not have time to present today so the floor is yours actually i know the question about netx because it was kind of small on the slide that it's the Nordic profile that's the interviewers what are the difference between profiles or what does that mean compatibility with other countries so your question is on the compatibilities of the different profiles i was asked to repeat the question for the live stream so perfect no netx is a huge standard made by more or less theoretical parts so almost every use case you can think of in public transport is supported in the standard it's allowed in within the standard to model a specific use case in different ways and it's still valid so to make data interoperable and usable for serpartis we need to have profiles the regulation came at the same time that netx was uh valid yes in the european this is a valid standard and the regulation came but it wasn't the profiles existing so the uk started with a small profile first then french built a bigger one and then in 2015 and tour came with our aims that was extremely high we wanted to support what i showed in in the slides both the information part but also support the operational part for the rail operators which is complex and also for bus operators and and more so the french profile was based on the uk we based our profile on the french added stuff and a couple of years ago uh EU profile came which is almost part of the french small differences yes first supports all of the most important parts of the european profile the norleague profile but not the operational part is not supported and not needed in the EU profile but there are this will speak it's just a small the the differences between the norleague profiles and the EU profile uh and what we see now in europe is that too many countries build their own profiles so we started also this one with a norwegian profile but together with a collaboration with the norweg countries sweden danmark and finland we asked our neighboring countries is our profile supporting your use cases if yes thumbs up we collaborate if no come to us ask us and we see what's the difference and they come back they had some uh small things that wasn't supported so we added that to the profile and they had some ideas this is a better way of solving this use case and we changed it and that we was in live production development with different stakeholders producing the data and we changed the profile and it went back to my left side on the previous presentation you have to change your export oh you're gonna have to give them money no you're not getting any money you have to change it we stopped the validation you are not allowed to produce data into our production if you don't change this and then they changed it so the difference there is uh today we see that it's very hard to use net text data from a different country into the same system you don't do something so that's what i spent the last week to highlight yeah we need to solve that as well yeah was that answering your question yes you
Rust-transit: libraries to manage transit data in rust
So, we can start with the next presentation. Right pronunciation. That's right. We will talk about the rough libraries, public transport, so the stage is yours. Yeah, hello. Thank you. And I want to thank you, the people before, because Transmodel helped a lot to make like a nice model of how transit should be called, what's the stop point, stop area, a trip. So, if you ever work with public transit, read the Transmodel model. It helps a lot just to make things clear. And also, OpenTrip planner works quite well. We've been experimenting recently using it in whole France, so it's a bit bigger than Norway, and it seems to be working, so they do some nice things when we have to talk afterwards. So, I'm talking about the very other extreme end of public transit data. So, some very small tools and libraries to manipulate this data. And it's in Rust, because, well, if you're handling a few gigabytes of data and want real-time data, Rust might be an option, and that's what we've done. And, yeah. So, we are a very open and informal organization. So, in GitHub, it's Rust Transit, and we want to make some lot of small breaks just to get started using public transit data, and then you can do whatever you want to do. So, it's not very formal. We have no status, nothing, and it's just focused on implementing things. So, gradually adding more implementation and getting more things working. And this presentation is kind of a call saying, we're looking for some projects to add to it and maintain us, and maybe some people will say, okay, I have this very specific need, come and see us and let's talk. Right now, it's mostly, but that only maintained by voluntary in a cooperative where I work, which is called Coup de l'Hon liberté, and there's a colleague over there who can also answer your questions. So, the first one is GTFS structures. So, the most important part, the biggest one, so I will be spending a lot of time on it, and the other breaks are a bit smaller. So, GTFS, I was told before, is the fact of the standard used to publish static transit data. So, what time will the bus run next week? And at what stops it will stop and so on. So, we started initially for our own project as defining the types in Rust, so the structs and so on. And for those who do work with Rust, it's basically just baking third serialization and serialization annotations. And with the time going on, we added some sugar, like reading directly from an URL, which was a common need because, let's say, no way publish it on this website, and we just downloaded and have the data immediately. We started making some integrity checks because it's just plain CSV files, so identifiers might not reference data that doesn't exist, so we added those checks. And we tried to make it easier to access the data from one object to another. I want to mention one alternative, which is Transit Model, which is made by a French company who is now called HOD, which was used to be called Kizio Digital and which was used to be called CanalTP, depending how old you are in the transit world. It's an AGPL, so it's a library, as AGPL, so it might be a problem. It does much more things like file conversion, is able to convert GTFS to netx and other way around. It has very nice query functions, like tell me all the lines that go through this point. But it's a bit more complex to use. It's mostly done for the own tools, so it's not always very documented and you have to read to know how it works to get it working. And in a perfect world, it would be based on GTF structures. So we start discussing with them, but it break too many little things on their end and they didn't want to bother, just say, it's work for us, don't bother with it. So some user examples are TrontoValidator, which is done for the French National Access Point, what they were talking before. So the transport.data.gov.fr has a validator that checks that every GTFS file is valid. It doesn't have buses that go over the speed of light, sorry, and things like that. So it's based on GTFS structures. It has also some own tool with the GTFS to geogesan. Some people just didn't care about the timetable, so they take a file of timetables and just extract the topology of the network. And another project is Ketrin Maps, which is kind of a big student project in American California, I think, from a university, which are trying to make a whole system and they're contributing a lot. So it's a very small vanity metric. We have like 15 contributors, which is a lot and small for a project that's not really publicized. And we have regular people just happen to use it and make some contribution. So it's living on a small pace, but it works. So we had, what was also to say is quite performant. We tried to find the biggest GTFS that is there on the wild and apparently is the German one. So there are 600,000 stop points in Germany, at least in the GTFS files. There are one million trips. A trip is a bus doing his route, and if the bus runs 10 times a day, it's 10 trips. 32,000 stop points. So it's quite a big file and just to get everything read from the GTFS file to everything in memory, it takes about 16 seconds on this laptop, whatever it means. It takes about 5 gigabit of RAM. It means you can handle the whole data of Germany on your laptop or on a reasonable, affordable server. It's also quite robust. As I said, every file in transport data GovFR, which is a national access point, is passed using GTFS structures. It has data that comes from a lot of different editors and vendors, which are present all around the world. So we kind of work through all the quirks of all the weird things that people do. Like in GTFS, you're allowed not to put the trading commas. So if you have 10 columns in the CSV file and have data just in the first two columns, you can just put one comma and leave everything empty at the end. It's all the kind of things you went through. And I'm using this as a side note, as I have an audience that might be interested. Oh, sorry. The GTFS file was created as a dump of a database. Just dump all the tables, put the CSV file, bundle them in the zip file, which is a horrible thing to do. It's nice to exchange with your colleague as a one-off time, but not to make a standard. So in the future, if you ever work with this kind of things, don't make zips of CSV files, please. Let's use, for example, a SQLite database. You can have a schema, so you'll be sure that the data will be respect, and into there will be an integer, a Boolean will be a Boolean, and so on. You have foreign keys, so you won't have wrong references. You have typed columns. You have indexes, so you open the file and you can have fast queries immediately for free. You have a crying languages integrated. You just download, open it. You can already make some statistics. And you have kind of fun things. You can put everything on an S3 server and make HTTP range requests. You don't even have to download the whole gigabyte of file, just 10 megs, and you have all the data you need. So, yeah, this was a thing. Think about how you materialize your data at the end, because people will use it, and it will bring a lot of pain if you don't think about the serialization of your data. Okay, that's done about GTS structure, which was clearly the biggest part, and now some smaller projects. One crate, so crate is a package in the Rust world, is Serialite. So, as you told before, Serialite is a standard, it's a norm in Europe for real-time data, and Serialite is kind of a more simpler version to use. I think it's open to heated debates if it should exist or not. So, Serialite was mostly used as a SOAP interface, so over XML, and Serialite is the same data, but serialized as JSON and served over REST. And we used it initially to convert from GTS-RT, so the real-time GTS data to CRI, for the French National Access Point, to be able to expose the data in the European standard. And I also worked this small toy project where I read all the French, sorry, not French Parisian, big Paris area, Ilde-France data, to have some dashboards about real-time of this top or this metro line on real-time. So, it works also quite well with some big data. I mean, Ilde-France is twice as big as Norway, so it works. I mean, the standards are well done, and we get things working. Another one, which started really as a toy project, is Wesson for Routing. When people see OpenStreetMaps, they say, oh, nice, a road network. Let's implement some Dijkstra algorithm on it, because I want to play around with it. And if you go into the OpenStreetMap format, you see that it was meant for mapping and not for routing. And the most simple example is a way, it can be a road that goes around 100 kilometers, and it doesn't stop at every intersection. So, if you want to do some routing with that, that's a very bad graph. So, the idea of this small tool is to cut the graph into a topology, as we learned as a student, to make some routing algorithm. So, initially, I made it just for this toy project, was all the roads from, it's like a tree spanning here from Tokyo to every corner of Japan, and making this kind of tree-like structure, so it was nothing useful. So, it's meant to toy around, like the project I told you. If you want to just try some algorithm because you're a student and you want to implement on real-world data, it's very nice. It used also for WessonD. I think there will be some presentation afterwards, which is the open-source railway designer. Sorry. I'm not worried just I'm new. Yeah, I'm bad with acronyms. And we want to do the same with railway, and with railway, there was no tool to do it. But be aware, don't use it if you want real-life routing on roads for pedestrian for cars. There are much more better tools and there are tons of constraints that's not able to handle, like a turn left, a slow as a turn right, and things like that. So, use OSM for routing for turning around for very specific needs or for learning, but don't use it for real-life routing algorithm, and use those very nice open-source tools that exist. And that's pretty much. So, thank you. If you want to say that we're working, everybody wants to work with Rust and transit data. So, we're quite open. I hope we're friendly. So, don't hesitate to contact us and let's slowly grow the space of Toolbox for your needs. I saw a Chinese screenshot of the departure port. Are you planning on also integrated outside of Europe? Well, this was just a creative comment from Wikimedia picture. In series, we are not specific to any region. I mean, GTFS and Siri might be used anywhere. While GTFS is quite used all over the world, and they take more European, I think Siri has gained some traction around the world because it's more usable than the GTFS RT part, which is very focused, very big infrastructures and not always used. So, maybe it's possible. More like an information, you might be happy that Siri Lite has been fully approved. So, it's not kind of like a French version of Siri on the side, because Interforce Mobility asked for it. Okay. Thank you. That's a nice approach about all this transport. It's like making words vocabulary and being agreed on what this word means. It makes it easier to cross boundaries or formats and things like that. But it's always a bit tricky, but nice to hear. How far away is Rust in the transit industry? Like, there's another project, the transit model. Is that also Rust-based? Yeah, Transponder is Rust-based, yes. And it's used for... Huff makes a routing algorithm, which is called Devisia, which is heavily used in France. So, which is written mostly in Rust nowadays. It started as C++, well, it started as Delphi, but that's a long time ago. So, it's actually used, yes. Okay, thank you very much. Thank you very much. Thank you.
Counting on openness: Privacy-safe passenger counting
How many of you are still awake? Let's have a show of hands. Okay, 90%. Very good. Very good. How many of you work in mobility in your day job? All right, that's about 25%. How many of you would like to work in mobility as your day job? These are the superheroes of the next generation. How many of you have worked with passenger counting before? All right, five. Excellent. All right. So this story, this is like a few of the things that I want to highlight in the development of the Finnish national automatic passenger counting system. And there are like bits for everyone, and I'm going to be fairly speedy with these things. And if you have questions, please ask them at the end. And let's get started. So I've been working in public transit for a bit over 10 years now, and software development for a bit over 15. And I started my own consulting company five years ago when I wanted to help more organizations as well. But I just wanted to give you a bit of background that I come from the public transport side and not so much from the railway side. So just the basics. What is automatic passenger counting? It just answers the simple question, how many people are there in the vehicle? There are two different kinds of messages that these vendors send. For example, they send how many people went in or how many people went out of this particular door. And then there's the option of telling how many people there are in the vehicle right now. And some vendors send both of these data, and then you have to decide, speak louder? Yeah. All right, thanks. So some vendors send both of these data, and then you have to decide which one you trust, the DIF or the total. So why do we collect this? For the passenger, the benefit is quite obvious. You want to travel with the less crowded vehicles, mostly. You get this information from the passenger information systems such as signage. But also, more and more, I expect that there will be automatic decisions made for you without you knowing about it. So trip planners already will suggest trips that are less crowded. And in the future, I think general technology should already be there, but that general technology hasn't reached public transit yet. So we can't yet recognize prams and bikes and such when they come in. But when we can, then we can tell you if your pram fits in the bus that you're aiming for. Now for the authorities, the public transport planners, for example, one of the most important things is to be able to understand where the masses are moving. So you want to allocate the vehicle capacity where it's needed. So for example, if there's a route and the last three stops of one direction is often very empty, then it might make sense to cut the route short and just increase frequency. Also, some of the trip planners have these status information on how many grams of CO2 have you released when you're traveling. So that depends on how many other passengers there are in the bus. Also, pandemic precautions are important driver for the funding of these projects to finally get these things funded and running. And also when the passengers choose to even out the load, when they choose to go to vehicles that are less crowded, it means that transit becomes smoother because there's less congestion in particular spots. Now the situation in Finland before COVID and before mobile tickets were very popular. Aha, I'm hearing myself. The situation was such that... Somehow caught me off. Right. So there was not much incentive for developing these systems because we got most of the information from the ticketing systems. At least we got the information on when people got on. But in 2020, six municipalities and the government put money together and pulled it into Valti, Valti's service development company for public transit purposes owned by the major municipalities of Finland. And they pulled money together and in 2021, they chose Futurize as the contractor. Futurize is an excellent service company in Finland and I was privileged to take part in that team as a technical architect and a lead developer. And in the next two years, more companies and organizations joined in this project and many phases. I'm just giving you a bit of the background on it. I think in hindsight, our main task was to reduce vendor lock-in and to reduce the costs of APC because currently the high quality sensors take cost... Typical stereo camera costs 1,000 euros per door and 203 doors per bus means that it's quite expensive. And also, I mean, sensibly, many of the vendors want to offer end-to-end service of providing data and analysis. But then it's hard to maybe get rid of that vendor if you want to move on to the next system. So we interviewed stakeholders, held workshops, sketched out some architecture ideas and we came up with like a three-prong approach. And the first one is that we create an API spec between the onboard accounting devices and the backend. And we try to make it easy to understand for companies that don't work with public transport in general. And as a starting point, we took from HSL to Helsinki Regional Transport Authority, the capital area, PTAs, Public Transport Authorities often have the most resources, so they were a bit ahead. They had this data format that was modified from an earlier data format that they have. And we wanted to be compatible with HSL, so we don't fragment the finished market. But also, it has a lot of craft for our needs. So I've split the chasing message into two columns here. The first one is more about the APC ideas and the right one is about the general public transport method data, such as routes and operating dates and directions and such. And all of the data on the right side is available in the backend anyway from some other source. So by reducing some of the fields and also trying to get rid of some ambiguities, we just added schema version and accounting system ID to do matching in the backend and message ID to make a message checking unique when there are duplicates. And then we dropped everything that wasn't about the APC. So these chasing messages are sent over MQTT, which is very commonly used in public transport, both on board and between backend and vehicles. And I think this format allows for any company that understands how to count people or objects to participate in this market. So it lowers the barrier to entry. And we're hoping that there will be more companies that offer accounting devices. Okay, the second approach, second attack was to prototype new counting technologies. And we asked two companies to develop new things and one company to provide a reference device of something that already exists in the market. And DILCOMP created object detection from security cameras and AmpliCa used a millimeter wavelength radar for the presumably object detection. And there are a couple of pictures on the upper right, maybe a bit small. This is a picture of the prototype millimeter wavelength device 3D printed stuff hidden behind the ceiling panel. Now, unfortunately, unfortunately we learned that 20,000 euros per new technology was not enough. For us to create breakthrough technology, we managed to create the right format of data, but values were not yet usable. But maybe some of you can figure this out. I hope you can. Okay, the third approach that we created an open source backend for this whole system. So there's also no great vendor lock into our team either. And here's like the simplified architecture of it. And let's forget the left side for now, but in the middle, data comes from the on-port counting systems. They go to the MQTT broker. And then we push it into Apache Pulsar. Apache Pulsar is a distributed append-only log system competitor of Apache Kafka and has been in use at HSL for six years now. And we also wanted to have synergy there so that Valti and HSL would have similar technology backends. The messages from the MQTT broker are deduplicated and they're brought into the journey matchup. Journey matchup also takes its input from GTFS real-time, the vehicle positions which tell where the vehicles go and when they leave the stops. So it's a very simple logic in principle in journey matchup that you just accumulate values in and out until you leave the stop and then you trigger APC message with all of the public transport method data that you need in the analytics. So routes and stops and directions and so on. So journey matchup pushes it through MQTT back into the provider of the GTFS real-time API. So that services the authorities. That's the raw data or raw, but I mean the accurate data as such. But it doesn't serve the public because this is private data. This is mobility data of people moving about. Now you might think, okay, how many people moved in the door doesn't really match with any individual. But that is not so. And on the left side we describe how we need information from the vehicle registry as well to pick up data about the vehicle models that we have, the seat configuration and standing places and how we create an organization profile out of it. But for this part we need help from experts. So we asked university researchers from Finnish Center for Artificial Intelligence and Helsinki University. There's a professor on the home class group that is focusing on differential privacy. And especially Jonas Jalko and also Raja, sorry. And oh dear. Okay, I'll get back to that. Worked on this. Jonas was especially working hard on this with me and they created a method for the anonymization. Now the reasoning why we need this is that if you consider someone who lives maybe not in the city center a bit further away and they travel in a bit peculiar manner. Let's say that they have shift work, they travel at noon. The stop that they use, no one else ever gets on that particular route in that particular direction at that particular time except them. No one else gets off of the bus at that time. So if you learn that pattern, if that accurate information was public you could stalk them and figure out, okay, now their house is empty and so on. So to combat that, often I've understood that how people approach it is just been the values. So for example, if it's five to 20 people in the bus then it's many seats available. In GTFS real time, standard occupancy status, the occupancy status field has these ordered categories from empty to full. And the thing is that that's not really anonymization because when you switch from one category to the other you're still leaking information. So the method that they created based on differential privacy, we believe it is the first differential privacy method for automatic passenger counting in public transport. And I'm really glad that these researchers made this effort for all of us and it's all open source. I think it deserves a round of applause. It's also very simple. So the above case would be the one where you have no anonymization except the pinning. So once you switch from four people to five people you go from empty to many seats available. Now how their method works is that they take that vehicle model, the seats in the standing places, and they turn them into these, they take as input this upper CSV file, CSV, I'll just give you a light. And they fast these boundaries so that they match the differential privacy condition. So I'm not an expert on differential privacy. I'll hedge anyway how it works roughly so that we're actually using epsilon delta differential privacy. But in the epsilon differential privacy you have the small value epsilon that you can decide and that affects how private versus how usable and accurate your output data is. And the epsilon affects what's the probability that you can figure out an individual from that data set or whether that result was formed by a data set with a particular individual in it or not. That probability difference is very small and affected by epsilon and the delta parameter relaxes that condition a bit. So the black areas here have probability of zero, exactly zero. So that's the delta in action. Otherwise you would have these violet purple bars quite far along. So for example, how you interpret this is also CSV files is visualized. For example, if you have seven people, you have a small chance of publishing empty and a large chance of publishing many seats available and no chance of publishing any of the other categories. So we want to have such a system that if the accurate value would be many seats available, we don't accidentally publish full. The counting of these profiles is quite intensive, takes many hours. There may be various optimization possibilities in the algorithm, but it needs to be only done once per vehicle model. And then you have this small CSV file, a table of probabilities that you just sample from every time you need to publish the result at a stop. So it's very, very fast in use. And you can just plug it in if you already have another system like the one above. All right, so this has been a trip of these highlights. Check out our API spec, especially if you're interested in creating these kind of counting devices. Please try your hand in it. They are, the bosses are dirty and they are dusty and they're shaky, but otherwise you can use whatever methods you have available. Also, if you haven't yet got your own APC system, check out our backend code or maybe our architecture and this idea of not, of having only minimal data from these APC vendors. It's attractive for you. And if you already have an APC system, please do use the anonymization method created by the researchers. If you have further questions after this, you can contact me by that email. There are a few of the links. They're also in the talk page. And I'm not yet sure what else will be behind transit privacy.org, but right now it's just a link to the tool that the researchers created. That's enough of the monologue. Let's start the dialogue. So for public transport, I guess it's very important to easily detect when a road is being inefficient by maybe just moving air. For example, if in some end of the road, the bus or a tram or whatever is mostly empty, how does this anonymization algorithm make it harder to detect when some public transportation is being under you? So the question was whether or how this anonymization will affect the public transport planning use case of figuring out whether reallocation is to be done with the vehicle capacity. In our architecture, the public transport planners get the accurate data into their analytics. So the anonymization happens after and it's only for the open data part. Can you speak about your experience with the microwave based sensors? Oh, the millimeter wavelength radar. I have no clue. So we gave these companies a lot of leeway and they produced their pilots and we don't have insight into how exactly that works. Any insights about the results of their pilot or not? The insights were that thus far the results were not good enough to be shown. You haven't actually mentioned the great deal about how counts are actually achieved. Sorry, I can't hear you. You haven't explained much about how counts are achieved. The time has been involved enormously over decades. In the past you simply used to weigh the carriage. Now obviously it could be distorted by adults, children, people with a lucky age, Americans, whatever. Then you don't even know where they are on the train to the carriage. So you don't know that they're individual though. Then they looked at things like as you enter an exit using light sensors. Then they looked at things like are you connected to the internet and counting the number of people who did that. I suggest that they actually work on facial recognition, not against a database but simply against the number of unique faces on a carriage at a time. Then you can start as they move around and work out what the behavior is. What's your thought on that? Alright, thanks. There was a brief history of the different kinds of technologies used for detecting passengers and objects. Then a question of whether facial recognition would work. I'm not sure if it sounds like it would effectively work. It could. It's very tricky how to communicate it to the public in a way that is understood correctly. Like for example that we don't send anything else that plus one and minus one from these vehicles onwards. Yes, it's back. About that, facial recognition, you don't have to go really directly to that. There is so many more stuff that you can do and with your counting mechanisms, even all the source models can really do much better without having to get any facial information. If possible, just actually close that out. You don't really have to get through that. The layers, like I'm working on model share counting and we're doing that for cycling and we're doing that also for passenger counting and stuff. You don't need to actually get to mine all these information characteristics of the people who do the tracking algorithm yourself. It's not really necessary to get there. That will also reduce the communication. Thank you. A comment was about how the object detection and object tracking algorithms on open source are already quite fine without facial recognition. Yeah. It's like calculation of CO2 in the carriage. Sorry, I can't hear you. Calculation of CO2 in the carriage because when people are breathing, they can reflect air transfer and calculate how people are in this situation. Other studies have been done for COVID, for example, and you can reuse this COVID study. My next question is regarding the users of security cameras for counting people. Do you have any experiences in terms of producers of the systems? In the moment, you use the camera for a different purpose, the warranty is gone because we have this problem. This kind of thing is like, let me say, okay, we will never use it for other purposes than just checking security. This is like a big ego constraint that you have when you're procurement. Yeah. A good comment on security camera warranties. I don't, I remember hearing discussions about that, but I don't have any proper answers about what do the security service providers think about using their camera fees for something else. Okay. Thank you. Thank you.
MATSim at SBB: Using and contributing to the open-source transport simulation for advanced passenger demand modeling.
Thank you, Peter. Yeah. So today I want to talk a little bit about Matzim. Matzim is a transport simulation software that is being used at SBB, the Swiss Federal Airways, but it is actually an open source tool that has been around for quite some time. So obviously there is also a little bit talk about Matzim itself and the software and what it does and how you could use it if you are ever interested in that. That's on the agenda, so I'll briefly explain what Matzim does and why we find that useful at SBB. And we also contribute actively into the Matzim code and I'll give some examples of that. And yeah, since you might wonder why on earth are you even bothering, I'll give you some examples of our work with the software. So what is Matzim and why is it useful? If you have that elevator speech moment where you have to explain your work to your CEO and then they ask you what you're doing, then I tend to say I'm playing SimCity, but with complex econometric data behind it, so you have all these weird formulas somewhere in there. Then the elevator area is over and I have more or less explained what I'm doing and then the CEO knows that we have some guy playing SimCity all day. Well, there's a bit more behind it, but in brief that's what you're doing. So we are simulating transport and we simulate people's behavior using transport during the day. Matzim stands for Multi-Agent Transport Simulation and it has been around for roughly 20 years. It started as a purely academic project between ETH Zurich and TU Berlin. On a side note, that also explains why a Berlin guy is now living in Switzerland. So you can kind of imagine my background, but it has evolved over the years and there are many models around the world and quite a few of them are actually fully built on open data and are publicly available, not ours for some reasons, but for example there's quite a scenario about Berlin that you can download and you can see the data where it comes from and you can start playing with the model. Whether this is useful for anyone, I can't say, but I guess I think it's useful for some, even mostly PhD students to be fair. There are commercial users around the globe as well. Among us, the SBB, there's Volkswagen who have quite a strong development, also into the Matzim core, but they're not as open to talk about that as we are probably. Then there are models in Melbourne, there's one at the Berlin Transit Agency, so it has some standing right there. There's a book, there's code, there's a license and for the last couple of weeks there's also an association that kind of brings the whole thing together. Now, how does it work? So imagine you have a lot of data. You have census data, for example, you register data, you know where people live in a city or you just make that up, you replace people somewhere. You have econometric data that is value of times. You know what a person's intention is. If they travel by train, then the value of time is maybe six euros per hour and if they go by car, then it's maybe 10 euros per hour or the other way around. You have a road network that can come, for example, from OpenStreetMap, that is a very typical use case. You have a timetable for public transport, typically GTFS. You have count data. Many of the topics discussed in the previous talks are actually input data for us and that is a lot of input data. What we do then is we add some generic algorithms that basically randomly decide to tell people during the day, change your route, change your transport mode when you go from one activity to another or change your departure time choice and then we let that run the same day. It's a bit like Groundhog Day, 200 times, 500 times and mix people up and let them try out new things. This is what we call the Matsum loop that is also somewhere on my T-shirt. What comes out of it is actually output data, even more of it. You have individual daily plans. You know what your synthetic population is doing during the day, where they go shopping, what transport modes they use. You have mode choice for each strip, whether people tend to take car to get from A to B or public transport, depending on what's the offer. You have time-sharp traffic loads, so a lot of data to analyze and do your policy planning. You have distances, you have all kind of aggregate data that you can then use and play with. Obviously, this calibration process for a model that it really depicts the real world in an initial stage is kind of what's the long story behind model building. What can you use the whole thing for? Of course, transport policy evaluation, what happens if there's a new road, what happens if there's a new railway line, what happens if there's a new price. You can do a person specific. You know who's affected by transport policy because you have this agent-based paradigm behind it. You can also calculate, for example, accessibility. A lot of things are actually happening where MADSIM is being used when it comes to on-demand transport modes. You can really do your fleet scheduling, your fleet planning. You can say, okay, what happens if we have a lot of automated vehicles that replace passenger cars and what's the advantage of that? All these kind of future scenarios you can use MADSIM for, well, basically playing SimCity. The MADSIM project has been around for almost 20 years and historically it has been administered by the universities, so ETH Zurich and Tew Berlin. Professor's grow older and at one point they retire and the person that comes next is maybe not as interested in such transportation simulation anymore. Since last year, the whole MADSIM project is built on an association level, so that there's also some funding from other users to maintain build servers and all that stuff. The association also organizes things like the user meeting that is held annually or keeps track of all kind of developments, publish a newsletter, all these kind of stuff. Now at the Swiss Federal Railways, how did we start with that? It's a very brief timeline on one slide, but I think it's kind of interesting. In 2016, our CEO saw a presentation about MADSIM and decided we need this at SBB, please buy MADSIM. Well, as it happens with open source software models and open source software at all, buying the whole thing wasn't as easy and the whole procurement process didn't quite work out, so the task was delegated to the department that deals with classical transport models, so it's actually part of somewhere in the passenger division, this is also where I'm working. It came up with some challenges, for example you needed someone who knows programming Java and those people didn't exist in that department, but things that can be overcome and you need like proper computers actually because if you want to run proper big models then having a nice tiny laptop isn't sufficient. That was also something to overcome, also thanks to the IT at Peter's department in the end. At least we didn't kill it. At least they didn't. Yeah, but the whole thing, building a model for Switzerland in MADSIM took three years and ran from 2017 to 2020 and along the way we noticed that there are several additions into the code base that need to be made to make this a useful project for us and then at one point you see okay we need to decide, do we commit this back into the MADSIM core or do we keep that in our secret chamber and luckily people chose widely and this is all open source and this was actually a management-backed decision so I'm very happy with that. So in 2018 the first release of the model that I will present in a moment was released and since 2020 we have an annual release cycle of a new, of a transport model that is multi-model and MADSIM based in Switzerland and that can be used for all kind of policy studies. Our contributions to MADSIM I just want to showcase two. First of all one that is called oddly the Swiss rail rector so it's an rector based public transit router that works really fast because if you want to route millions of people within a reasonable time frame then this is what you need and compared to what we had before it was like many, many times faster and the whole simulation was actually sped up by factor three and what is also important to say, the Swiss rail rector is, well it's a Java package but you don't have to, it's tied to MADSIM, it uses MADSIM data structures but you don't actually have to use it for MADSIM problems so in a way it's something that can be used instead of OpenTrip planner but OpenTrip planner has other advantages but if you really need to route a lot of routes at the same time then it's something to look at. And then we, yeah now we have a bit of a fancy routing algorithm so it also knows like range queries, queries and intermodal access and egress and the last point is very, very important because if you want to model stations you really need to have an idea how many people arrive by food, how many arrive by bike and so on and that is not an easy question to answer and apart from the routing problem which is already quite complex it's also a part of the, you don't have that much empirical data about it so, but it's really one of the most useful features by now from our model so we're happy to have that and yeah as I said you can use that kind of independent of the rest of MADSIM so it's well worth checking out. Then there's another contribution where I was a bit deeply involved, it's basically, there is a traffic flow simulation in MADSIM that is typically queue based, I didn't go into detail here because I don't have enough time for that but we replaced this by something that is just like roughly two times faster for the whole simulation process and it's called Hermes because well it can fly now but has less plugability so depending on the use case of your simulation you can use one or the other and they're kind of interoperable and yeah this brings down simulation run times for Switzerland scenario so both the router and Hermes to something like 24 hours and since these are typically AWS instances it actually saves a lot of money to have models that run reasonably fast and for calibration process you maybe need 50 of those runs so it's it's kind of sums up yeah now what we use MADSIM for first of all there's a model called Zimba Mobi this is where I'm the product owner so I know a lot about it and it literally depicts the everyday mobility of eight and a half million people in Switzerland so basically the whole population it includes all major public transport modes so walking cycling taking the bus taking the car and it has a representation of the transit schedule obviously and also the whole road network and since it uses MADSIM the whole people's behavior or agents behavior is microscopic so including first and last mile decisions and hope that video works now yeah it does so right now imagine it's 8 a.m in Switzerland and you see people in those blue dots being at home and these light blue dogs people starting their work time and now we zoom in into a region somewhere in Eurik and see what people do there so to get from one place to another they need to travel obviously and they could travel by car that then there are in those little gray boxes or they could take the train or public transport then there are in those little red boxes and they get from one place to another and obviously you can run your analysis on that some public transport vehicles are maybe more crowded some are less crowded and yeah you can sum that up over the day and see what's going on and now we zoom in again and what you can do is see who's alighting at certain stations what kind of passenger groups are they are there do they have a regional subscription do they have a half fare card or do they have ordinary tickets for example and also on the highway you can see during the day how many people are currently on their way to work how many people are on their way home and who's just a truck and who's doing other things and this is all in the model and you can like analyze a lot of things obviously we are a bit more tied to public transport related analysis so station access and egress is different at certain points so in Jettikon for example there's more people arriving by bus to the station than in Aarau or if you have this city in Baden here then you can see people that reach the stations from nearby typically walk and people that come from further field further field they take the bus and these kind of analysis is really useful for example for station design and station planning and yeah so typically use cases for the model are then also the development of rail lines and designing of stop locations and time and the effects of timetables so it's nice that you created a nice timetable but who's going to use it and how many people will be on the trains this is an answer we can give and we can analyze what's happening around the stations so that was just the video that I showed and we can also see what's the effect of certain land use policies for example so we don't only have a model that depicts today but also one for 2030, 2040, 2050 so we kind of know how according to today's assumptions Switzerland is going to evolve and then we can do policy planning with this yeah and these are kind of future scenarios yeah and just one example for example over the next 20 years there will be roughly 20 to 30 new railway stations opening up mostly along existing lines and and very often these stations are being built because there's something happening around them so there's a new development coming new housing or a new commercial area being built or something like that and we can just like in Simcity we can add those little houses into the model and add people there give them daily plans and for example this is now in the city of Sankt Gallen where not a new station is being planned but the the moving of one station to another place so basically that station goes from left to the right and then lots of houses are being built there and with the tool we can say okay what's the beforehand we had at both those stations there are roughly 4 000 passengers a day and now it's roughly 6 000 so that would be the effect of the policy of things happening there and and these numbers help you to dimension those stops properly yeah another application that doesn't come from my department so please don't ask me doesn't question about it but i think it's interesting enough to be presented here is that we also want to go into deep down knowing what's happening along the railway corridors so Matsim has a mobility simulation i talked about that earlier and my colleagues decided okay we can replace this with something that we call Ray Sim and that actually has tracks and signals in there and blocks and we can start playing around with that and and do roughly also capacity planning and on a much easier level than it is usually being done so that you still don't need need to have an idea of every signal that's on the tracks but and of every switch but of the whole idea roughly you need to know whether a track is single track or double track for example and so the outcome of this is now also a little video you have two trains um one coming from down here it has currently a speed of six six meters per second but it wants to go it's currently accelerating to 11 meters per second and there you have also a train that is six meters per second and wants to accelerate to 14 meters per second and this train wants to go this way and this train wants to go that way and there's a station and they both interfere at one point so obviously we don't want them to crash so as you can see the train coming and the red lines are basically the blocks that are being blocked in front of the train so depending how how fast the train is the train is faster than the braking distance would be longer so more blocks are being blocked so um and then you can see the train that comes from the right has the right of way and and and the left train got a red signal and is breaking down and now that right train is passing switch goes to green and the other train can enter the station well yeah um and obviously you can connect that with the rest of matzin so you know how many passengers are on the train and then you can do your policy planning again if there's um if there's a heavily delayed freight train that but that would generate you that and that amount of money maybe you want to accelerate it but then you see oh no it interferes with all of our daily commuters they would be very angry so you can do your policy planning around this it's still in an early stage the that microscopic railway simulation but ultimately this is where we want to go to it's also released as as a matzin contribution called rail sim so it's part of the matzin code everyone can use it and i think it's a way to go in that direction as well but please don't ask me too many questions about that but i can connect to people who know about it so to wrap up matzin has helped sbb massively to to understand customer behavior and committing to open source in our point of view has really paid off here and it's also the way to go for us um yeah and these models are very very complex but um that is what they are no matter whether you use commercial or open source software um oh yeah that one has to come too um and uh yeah if you want to know more about matzin um there's an annual meeting um this year it will be in um as part of the heart conference it's a transport conference um on the 17th of june in in outer university so yeah hey thank you um yeah and um i'm happy for questions even though i only have five minutes thanks a lot for the presentation uh quite a bit of the the transport systems have an historical background for example probably some of them are based on the industrial need back at like 20 years ago or some of them are related to new developments in the city and some of these things are also presented in fact at open street map in these kind of resources are you able to extract amenities maybe or an historical sense or is the distributioner and the social demographic something that you get as a matrix how do you get this kind of distribution and sorry second yep do you import EFS and these other stuff um yeah both good questions so the first one um so we do have the census data from the federal office of statistics in switzerland and they also have an idea how it looks like in the future so we are in a very lucky situation there's a lot of data available and also publicly um in fact there's also a transport model available for switzerland publicly but unfortunately with a closed source license where you need a software that costs you roughly uh 10 000 euros a year so that doesn't help a lot um so we rather build our own models and the other question is um yeah typically we uh one would use gtfs as an as the main data source for public transport data since we are the railway operator and we have time tables in all kind of formats we use a different one but if you were to build a model for matzim typically you would use opus creed map and and and gtfs um yeah yep you um what for the model of the swiss transport uh how do you do that there's three three million agents that are being simulated eight and a half yeah eight and a half so what's what's the typical like how how much can it scale like how many agents can you have in one simulation um yeah that um so the the models do scale um but um there's there's an upper limit to what is useful because um so switzerland is still a useful also in terms of a regional scope because there you have many long distance commuters um but if you are using matzim to for example for for really long distance um choices um then you would simply remove everyday commuters so i also in a previous life i created a model for sweden that also worked but it's also roughly the same number of people um and but there are simulations for for cities in china which have 20 million people um and what you can do is you can cut down the number of agents that you simulate and and increase network capacities and so on so there there are ways to deal with that yeah so you mentioned that you can feed open street uh uh map data into matzim but does matzim also provide tools to add new assets or population models how there are tools that allow the addition of of new people if you don't want to hard code it some of them are commercial i mean there are some spin-off companies around matzim who provide this as a service but you can also do it on your own it's just basic java or python code that you can use for that so um yeah and if you do transport modeling um for public transport you would probably rather edit the gtfs than editing the matzim schedule so i think there are a lot of interesting yeah i'm here so sorry yeah well i think we one short question one question is possible yeah one more yeah i don't know how do you determine the accuracy of your model ah oh yeah that that is another talk of of an hour um so getting getting getting models right and calibrating them properly um is um is mostly countator that you need and in switzerland we have something that's called the microsensors that where where where we ask people every five years about their mobility behavior and that is very very accurate and has a lot of data where that is useful to calibrate models but um it's always a fair question if someone presents you a transport model to ask how is it calibrated okay so you
Bending geographic maps for enhanced railway space-time diagrams
Hello everyone. So my name is Alexis and I developed data visualization web applications at Westware. We do a lot of open source things and I'm totally not a trained person initially. I'm still not a trained person actually. But since early 2021 we started working for SNCF Rezo who is a firm in charge of handling the French train infrastructure. And we started to contribute on OSRD which is I guess it's been advocated here today already. Not yet. Not yet. Okay. So OSRD is an open source railway designer. It's an open source application to simulate trains on real or edited infrastructures. That's kind of amazing. The interface is web based. The project is kind of huge. The team is a good part of the team must be in the room I guess. And yeah, you can check it out. And at some point we, Westware, we've been tasked to actually enhance the space time diagrams. What are space time diagrams? First of all, everybody do not agree on what it should be named. So it might be circulation diagrams or graphical timetables or train graphs, which is actually nice train graph, but I'll stay on space time diagrams. It's probably been invented by a French engineer, Charles Hibry in the early 1840s. This engineer was in charge of scheduling the trains between Paris and Rouen. And he used this very smart chart I described right after. Some people think it's actually a Russian military guy. No, you remember? No. Okay. There's another lead and it's not clear. It's been invented by these people. Let's stay on this track. So horizontally you see the time. It's hours of the day. And vertically you see the list of the stations from Paris to Rouen. Basically it works. A train, can I zoom in? Okay, nice. The train is aligned and, no, sorry. Okay, each train is aligned on this diagram. And you can read a lot of information in this type of diagram just with those lines on this scale. So basically the more vertical is aligned, the faster the train goes. When the line is horizontal, it means that during different time, the train is at the same position, so it doesn't move. When two lines are crossing, it means that there are two trains at the same position on the line at the same time, which means that it's possible. So if I read this map, for instance, I can know that there are probably two different tracks, one for each direction, and probably no more. I know this because trains that don't go in the same direction can cross kind of anywhere, here or here or here. But when they are in the same direction, one train has to stop in a station like here or like somewhere around here, et cetera. This is kind of crazy, all the information that are just displayed in a so simple diagram. The thing is, I'm not a train person, but I know this diagram for long because it's actually on the cover of one of the DataVis reference book, Visual Display of Quantitative Information by Tuftor. And it's still used today. There are reasons why it's screen shot from OpenTrack and not from OSRD. I'll come back on it later. But OpenTrack is another open source software to handle trains, and they still use this kind of diagrams. And it becomes actually even better once we introduce blocks. So when they started using trains on tracks, it was kind of easy because basically there were not enough trains to consider collisions. But at some point, a train goes fast enough and is heavy enough that when an operator sees a danger stopped on the track, if he starts braking, he won't be stopped before the collision. So people had to find solutions for this. And so easy, since I'm going to sum up how it works, but I'm not a train person. So basically, the track is split into blocks, and only one train can be at one block at any given time, and there's a signal at the entrance of the block. If there's a train inside the block, the signal is red, so you cannot enter it. If the signal is orange, it means that you must stop because there is a train in the next block, basically. The thing is, when a block is occupied by a train, it means that during a time and on a distance, there cannot be other trains in this block. So basically, the occupancy of a block by a train is a rectangle, and when two rectangle collides, that's bad. And in OSRD, here is what it looks like in OSRD. So the red rectangles are the blocks occupied by a train, and here in OSRD, I started a simulation, and I dragged this train so that there was a collision. And yeah, it's really easy, graphically, to see that there will be two trains in the same block at some point. And I think as a data-vis person that it's kind of amazing. But how can we make this even more informative? So people from OSRD asked us, yeah, basically vertically, we just have the list of three stations or the list of point of interest, but we would like to bring more information in this. And we thought, let's start digging. Who does this kind of things? So we started looking into other transportations where people have to see how they travel along kind of a line. So here is what it looks like when you are inside the RORD. So this is a train that goes from northern Paris suburb to southern Paris suburb, going through Paris. And when you are inside this train, you have this synthesized diagram. So it's nice because it brings only the information you need, the list of stations, but also some interesting things like where can you switch to other transportation systems, et cetera, et cetera. And this is nice, but the thing is what Loïc wanted us to do was to have the exact infrastructure and to see exactly what are the tracks on the lines at any given point. And this would have needed us to actually know the whole infrastructure and to do heavy computations. And at this point, we planned to do this as a front end only feature. So we kept digging, sorry, yeah. So we can run there anything we want, but we need to know the exact topology and do heavy computations. And we kept digging to find something else. OK, sorry. So on top of this is the actual map of the bus in Paris. OK. When you take the bus 58 in Paris, you have this map. And the thing is, as you can see on the top map, this line, it appears absolutely straight here, you see. And this is kind of amazing because we cannot bend things in cartography, but that's what they did probably by hand. And they obtained this nice map where there are very identifiable areas. You can see all the streets. You can see a lot of information. But still, you know that you are going basically from left to right or from right to left. And it works. So yeah, we have to show everything a map would show. So we cannot just terrific exactly what we would like to display as we did with the schema because we have to take everything. But the good point is that we show everything a map would show, which means that we have all the context around. For a train, it would be the cities, the buildings, the places that are near the train, but not exactly on it, et cetera. It's actually called a strip map. And it exists for quite a long time. We've seen some very old examples like this one. And it's actually already been used within space-time diagrams already. So this one comes from the Russian military. It's trains between St. Petersburg and Moscow. And on top of it, not vertically aligned on the left, but you can see the whole itinerary with a lot of information surrounding. Like you can see the sea next to St. Petersburg. You can see other identifiable points, et cetera. And it brings a lot of context. So let's bend geographic maps. The strategy we used was to generate a grid made of triangles along the path. And then we generate another grid, which is totally flat. And when we want to translate a coordinate from the normal geographic system to the bent system, we just find in which triangle am I, and then I will translate it from one triangle to another. And this is something that is easy to do. So let's take a path. This is from Nantes to Angers in France. Then we generate a grid around it. So basically, I just simplify a bit the path, and I take regular steps. And I draw a line crossing it at a perpendicularly. And then I draw triangles, that kind of. But I have two problems here. First of all, there are points that are in multiple triangles. And this is bad. And another issue is that I have large triangles touching really small triangles, which means that in my final map, I know that if I have this kind of distortions, it wouldn't be very smooth. So we smoothen the grid with some, we just run some steps. I move each point to the barycenter of its neighbors, basically, something like this. Then we index all the triangles to get a quadtree. So it's really, really fast to know if I have a point, what are the nearest triangles. So in this nearest triangles, which is the one that contains my point, et cetera, et cetera, I do the regular grid on the right. So each triangle exists in both grids. And at this point, yay, we have a projection. So that's what I said. If I have a point P, I find the quad that contains P. I look for all the triangles that collide with this quad. I find the one that contains my point. Then there's a triangle with the same ID in the straight grid. So I just find this triangle, and I use barycentric coordinate system to translate from one triangle to the other. Also, I had to actually develop something. So we use the ReactMapGL and MapLibre. MapLibre and ReactMapGL because it's already used inside OSRD. So for this prototype, basically, we render a hidden map that contains the whole grid, but we don't show it into the screen. We just load every feature we can. We use layers from OpenStreetMap for the context and OSRD to have the actual infrastructure and signals, et cetera. Then we wait for the idle events that says, OK, I have loaded everything. I'm ready. So I take all the features. I transpose them. So I project them, actually. It's a projection. I also have to clip them if they go through or if they come from outside the grid and enter it, et cetera. Then I can render a new map with the projected features, which looks like this with the grid and like this without the grid. And we can look to the two maps side by side. Yeah, that's it. We have what we wanted, a map that shows the full itinerary from Nantes to Angers. We can still identify things. What I really like with StreetMap is that locally, if I'm going from Nantes at some point here, I know that I have Laloir on my right and the Scarke-Fou on my left. And those local information remain true in the BENT map. At some point, the Scarke-Fou on my left, the Laloir on my right, et cetera. You preserve local context at the price of having BENT lines around. In OSRD, this is how it looks like. This is a screenshot. I hope to show you something that works better in a minute. So yeah, it brings a lot of context. But when you zoom in precisely inside to the train in OSRD, you can see the exhaustive infrastructure, all the tracks. We didn't have signals yet, but it will come soon. And yeah, so it works for almost any path. It's known there's no loops, right? And it does bring context. With the current instantiation, we lose the Mabel data. It means that we have to load everything at once and render the map at once. But we cannot, like, if I zoom in, I will see more things with a better definition. It might come sometimes later. And it's a bit slow at the moment because we have to load everything and translate it at once. Yeah, demo, that's going to be really quick. So there's just a storybook. So it's on the OSRD UI project. If you want, you can just ask to the OSRD people. This is the project that's been moved out of the OSRD, which means that you can actually use it without OSRD data. It's just a react component that embeds some dependencies. And yeah, this is from Nantes to Marseille. Think with path. You have, yeah, on your right, you will first have this big ocean. And then later, there's Toulouse, there's the Pyrenees, and then the Mediterranean Sea. So it works as we wanted, lots of context. And also in OSRD, Roulmante-Tembourg, drumrolls. OK. This is the path I showed earlier. The trains, when I over train on the graph, I can see it on my strip map. And when I zoom in, I will have the actual infrastructure. They should be, yeah, OK. I see that the train changes swaps tracks here. That's nice. OK. That's going to be it for the demo. Thank you very much. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. I can probably take one or two questions. I'll need two minutes. So one or two questions. Sorry. Does this projection look good with photos of satellite imagery, or would it look really strange? Yes, it might look a bit strange. But actually, when the grid is quite smooth, like the one I showed earlier, where the triangles are just slightly bent, it might work. The thing is, I only work with vector data right now. But I could actually project pixels. But if I project pixels, you will have larger pixels. Yeah, a real sharp turn. It would skew things. Yes. Loïc has tried with some path that starts somewhere and goes just next to this somewhere later, and this is bad. For now. For now. Do you know how these maps were made before by bus maps in Paris, at some point? By hand. I'm quite sure it's by hand. But I don't have any proof. But I know that when I saw the amazing schematics maps of the infrastructure inside SNCF, and I asked, wow, what's the algorithm? What algorithm? So I bet it's by hand.
MARECO algorithm: how to drive a train using the least amount of energy
Now, second talk about OSDRD, it's about the running time calculation and the best way to calculate the running time is to save energy and Alex will present it. Thank you, Loïc. Hi, everyone. Thanks for coming to this talk. Today, how to drive a train using the least amount of energy with Mariko algorithm. This talk could have been how to drive a bus using the least amount of energy or how to drive any public transport that has wheels to use the least amount of energy and actually even worse with bikes. So I'm Alex Roland working also at SNSEF Réseau on the same project OSDRD for those who were at the previous talk. Here is our GitHub repo if you want to check it out. And I'm going to mostly spend some time on one type of graph, not the space time that you've seen just before, but this one is called the space speed. So it's a very easy graph that just represents the speed along the path of the train from its departure to its destination throughout the stops that it might take. On this graph, you have some speed limits that are on the line. Most of the time, the speed limits are quite lower on train stations. And then the train will leave the departure station with a speed of zero and then it will accelerate until reaching the different speed limits. Then it will break to reach the stop at speed zero again and then accelerate and then break again. So this is the fastest drive that the train can make. It accelerates as much as possible, drives as fast as possible and breaks at the last moment to reach the stop and to reach each speed limit restriction. In that case, let's pretend that the departure is at eight and then the stop is at nine twenty and the destination is at ten. This is still distance in this graph. I'm just showing you time because this graph does not show the time and we're going to need this. The problem is with public transport, if the train leaves five minutes late, it won't be able to catch up because this is the fastest drive. So it will at most arrive with five minutes late at the stop and at the destination. That's also a problem if the driver does not accelerate as much as the fastest drive. It will also be late even though it left on time. It's also a problem if the driver does not drive at the maximum speed, which spoiler happens. So going back to the fastest drive, this is actually a very bad way to plan trains, buses or any public transport because everything can fail very easily from the example I've shown. So we kind of want this public transport planification to have some margin to it. So if I'm adding, let's say, a ten percent time margin, I want to stop eight minutes later, so at 9.28 and then 12 minutes later here. So then I have some margin to like damp the few, I don't know, leaving late or not driving as fast type of problems. But we are here to save energy too. So we want to add that straight time, but we also want to save energy. Well the good thing is, physics does both of it. If you drive slower, you will save some energy. That's great. Great news. Okay, this is due to the different forces that apply on the train when it's running. You can see here, well let's not care about the weight and ground resistance. What's important here is the air drive and solid friction that scales with V and V squared. So at high speeds, it will be much greater. So driving slower, you experience it if you bike, if you drive slower, you use less energy, same with cars and every transportation system. So let's lower the speeds with a very basic way, a linear margin. So we are lowering the speeds at the same percentage all the way through the train's path. And then we arrive at 9.28 and we arrive at 10.12. Did we save that much energy? Not quite sure. This is also another way to lower the speed and then be on time with the margin that has been planned. We are lowering the high speeds only this time. But what I'm going to show you is actually the best strategy to lower the speeds because there are infinite ways to lower the speed and arrive on time. I could also just stop in the middle and then come back on track. So I'm going to show you what's been published by engineers from SNCF a few dozen years ago. I think the original paper is from 1979, so I was not born, which shows the best strategy to run trains in terms of energy consumption. So how does it work? There are four types of actions. Here I'm showing the same graph, but like a very simplified one. The train can be either accelerating, maintaining speed, posting. Posting means the driver cuts down the traction and then the train rolls with, drives with thanks to its inertia. And then the train can be braking. That's four driving actions that we are going to study. The idea is to study each type of action and see how much energy we can save per unit of time that we add. If we look at the accelerations, we can try to accelerate a bit not as strong as the maximum. So let's say V0, we like accelerate a bit slower and then we accelerate at maximum again. And I'm saving the formulas because it would be too long for this talk, but basically this can lead to a nice but small amount of energy saved per unit of added time. If we look at maintaining speed, as we saw the speed has a huge impact on the air friction. And so driving at V1 a bit lower saves actually lots of energy per unit of time added, so that's interesting. There are two reasons for coasting. The first one, the small triangle you see here, corresponds to a slope. So the driver will cut down the traction before the slope, slow down a little bit and then accelerate again thanks to gravity in the slope. And then on this distance, no traction was used. So it's interesting. And before braking, if we know we are going to need to brake and slow down, we might as well cut the traction before and then thanks to the inertia, the train will drive and lose some speed. So this is, we have two parameters here, the same V1, so the maximum speed and then VF here, the velocity at which we want to stop the coasting and start braking. And this is also very interesting in terms of energy saving per added time. And for braking, well, no energy used at braking, so no possible energy savings at braking. So what this analysis shows is that the two most interesting parts are actually saving energy on maintaining speed and coasting, which would, if we combine them, look like something like this. We want them to be equal so that the margin is as evenly distributed as possible. We don't want all the margin to be on one spot for reason I'll explain a bit later. And then basically how the algorithm works in the end. We start with the fastest drive that we compute and then we do a binary search algorithm with iteration. So we start with V1 and VF that leads to, let's say, this result. Then we get an output of how much time this represents actually. And we compare this time to the time we want, the 9, 28 and the 10, 12 from before. And then we iterate and we compute different ones until we converge to the solution that leads to the time we want. So if we come back to the first example, this leads to something like this, where the higher speeds are lowered and then you can see the coasting phases before each braking phase. And we arrive on time with the margin that we added. Now let's see how it looks like on some examples using OSRD simulation. So here a train between Paris and Lyon, so it's a high speed train, a TGV. Here you can see it on the map and then you have a linear margin. I don't know if everyone can see the green lines, yes. Okay, so linear margin on top and then maricot algorithm at the bottom. And then here the orange curve is the slopes along the path of the train. So here on this example you can see some like the triangle parts correspond to the train cutting down traction and then using the slope to accelerate again. And then you can see here that it's cutting the traction about a bit before the final braking here. So in this case we get 12% of energy saving between those two strategies for the same running time. Another example between gap and brilliance, so this is in the Alps, in the mountains. So the slopes here are quite strong and there are many like uphill and downhill slopes. So it's interesting because then we can use the triangle technique many times, cutting down the traction and then keeping up with the speed, the fence to the slope. As you can see here, there are many triangle shape. And then there are more stops here, so also more braking phases that we can use for coasting. In that case, it's 13% of energy saving also because the overall decliivity goes uphill. So it's not as great for this algorithm to work well. Another example in Bretagne, so this is the west of France between Rennes and Camper. With many stops this time, so I simulated a regional train that stops in many cities. So there are many stops, hence many coasting phases before braking. And the overall decliivity is quite flat. So in that case, we get 20% energy saving, which starts being a lot. And last example would be near Paris, so between Paris and Mont-Flajoli. Also many stops, but this time the overall decliivity is mostly descending. So it's very good situation for the algorithm to be efficient. Well in that case, we get 32% of energy saving. So let's plan all the trains with this algorithm. What can go wrong? Well, there are some impacts of the Marico algorithm on train planning and operation. I'm going to start with the few downsides. Most of the margin end up towards the braking phases because that's where the main coasting phases are. So we need to use it a bit carefully, especially in the long distances. I showed you the Paris to Lyon trip without no stops in between. So most of the margin was at the end, which means if the train leaves late, it's going to catch up near Lyon, but it's not going to be able to really catch up on the way. So it's going to be a bit late all the time, which is not great. So it needs to be used carefully. You can also deteriorate the headway, so how many trains can run in a certain amount of time because it can lower the speeds a bit too much in some areas. It also considers that drivers will follow the fastest drive at low speeds, so accelerating as much as possible, which is not the case if we study at driver behaviors. So we in the end plan trains a bit wrong if we consider that every driver will accelerate using 100% of the traction force. The good news is, sorry, energy savings. Well, this can be a lot of money in the end for the company. Each percent can be a lot, so imagine 20 or 30%. It's a bit more similar to driver behaviors, especially experienced drivers that know where they are driving. They will anticipate the slopes and they will cut down traction to save some energy. So it's more similar to the actual driving than the linear margin. Strong accelerations are better for the headway, especially in dense lines. So you want trains to leave the stations as early as possible and then drive at high enough speed because trains that drive slowly have really bad, sorry, for the headway. And coasting before braking, also in dense lines, brings drivers to approach to reach the stations at lower speed because they have been using the coasting before. So they can anticipate and adapt to braking better if there is a train in front of them with and they get like a caution signal, asking them to slow down. And that's it. Thank you. APPLAUSE We have three minutes for questions, I think here and here now. Yeah, you mentioned that grading doesn't cost any energy. We generate a risk braking. It does save energy if you rate slower. Does it take a different program? This algorithm doesn't take it into account because it's too old. So I don't think trains that could gain energy at braking were a thing at that time. I personally would like to adapt this algorithm to take this into account in the future if that becomes one of our needs for OSRD simulations. But yeah, you're right. What about the length of the train? I mean, for instance, a long freight train, you can get some very different things because there are supposed to be a sensing and a sending. Yeah, so I forgot to, I think I need to say the questions again for the microphone. So yeah, the question was what about long trains, especially freight trains, that can be very long? Well, yes, the cleavities, the slopes can compensate on very long trains. Well, this effect, this algorithm also works, no matter the length of the train. It's just because of the binary search, we don't know the exact output. So we simulate it, we see the total time, and then we adapt the V1 and Vf velocities for the right time that we want. So it's actually taking into account those, as long as the algorithm you use is taking this into account in the simulation. One last question. Sorry. I see your graphs because they're on time, but when you save energy, I will assume from the graph that the graph is now shown, it covers less distance because you have time over speed. But where is the time actually saved? Is it normally that the train would just get at its end station earlier, and now just take that extra five minutes away and spread it out by saving energy? No, actually the graph was showing only show speed and distance. So the time, if you show it on, I don't know if there is, yeah. No, we don't have time for this, okay? Sorry. But they actually arrive on the time. They drive a bit slower, so if you represent it as a space-time diagram, you will see that there will be a bit more horizontal because they drive slower. Thank you, Alex. Thank you.
Railway signaling: detecting conflicts in a complex world
Hi, I'm Eunice and I work for SNCF Rizzo and the OSRD project. So standard disclaimer, the opinion in this presentation I'm maum and not those of the OSRD project SNCF Rizzo or the Open Relay Association. So what's OSRD? It's a railway design toolbox built around microscopic simulation. So it allows to perform operational studies and also to find last minute pass through the infrastructure without creating any new conflicts. It's a licensed under our GPL V3 and funded by SNCF Rizzo, the European Union and the French state. So a short signaling primer. The main goal is that trains do not crash into each other or derail. The problem is trains are very hard to stop. They take a very long time to slow down and they need to know that they should slow down very much in advance. To do that we use signals and in order to actually use the signals we need to know where the trains are. So for that we use track circuits and the axle contours. But basically we divide the infrastructure into zones and in each zone we can know if a train is there or isn't there. We call the space between signal blocks and the block detection zones. Another thing is that train must not go over a switch that isn't set for them. So for that they need to have actually an itinerary through the infrastructure. We call that a route. And the route need to be established which means that the switches must be locked in place before a train can pass the signal at the start of the route. So for example this is using your ball signalings which is the French main signal system. So here we have a train and behind it this route is set so you have one red light, one yellow light that signals that this is a red light and then it's okay it's a green light. But up above the route isn't set this switch is dangerous so there is not only one red light there is two red lights. This means that under no circumstance a train must pass the signal. Here a train can pass it but very slowly. So we have a number of challenges. Every European country has its own signalling system and actually multiple. And there is standardization which is called ERTMS which is actually three levels of signalling system and even more complicated than that but it's not widely deployed yet and it probably never will be because nobody is going to upgrade an online for no reason. But we need to cover every single one of those cases so far as it's just another standard. We also need to not simulate the infrastructure every time we make a small change in a train for example departure time or in a forest EDCM for example it use an ASTAR through the graph of time, space and speed in order to find the pass that doesn't conflict with another train and every iteration of that ASTAR we cannot simulate the whole infrastructure it would just one scale. So we need to be able to model the capacity needs of a train while only simulating it. And also most of the application should not have 15 implementation of everything because of 50 signalling system it should be very much abstracted. So our approach is that signalling system have a very restricted view of the infrastructure the only thing that only see in front of it and it see linear pass until the next signal. So they see the state of the zones they protect and they also see the state of the next signals. We give them other metadata such as the speed of the approaching train or the kind of train this is useful in some special case. And we also separate the concept of signalling system such as BAL or RTMS and signalling the rival which is the actual code that implement the behavior of a signal. For example, yeah, so they depend on the output and input signalling system. So for example, here we have the BAL signal that followed either by a T-VM signal or by a BAL signal also. And we have two drivers, so two modules that under every BAL to BAL signals and every BAL to T-VM signals. And we inject BAL parameters because this is actually a BAL signal since the actual lights are using the BAL signalling system. So from that we can actually feed along the path of the train the state of the preceding signal here, get a state and feed it forward and we have signalling. But there is a number of problems with this. It's very cool but as you can see the actual signal reacts after the passage of the train which is quite normal because that's how it is in the real world. Our problem is actually trains need to see green before them. Their actual needs are before them and not after them. They don't really care. And we linearize the path but what is the path of the train that follow our train? We don't know because as we said earlier we are simulating each train alone. So we need to model the capacity requirement of a train but only knowing that this train pass. So why do train conflict? Either they are following too close to each other in which case they need the zone in front of them to be free or they have incompatible routes which means that they need the zone before them to have some specific switch configuration in order to proceed. There are other reasons why train conflicts such as power delivery needs and many other reasons but we don't handle those and they have nothing to do with signalling. So what's the spacing requirements? We have a zone, a begin time and end time. It's quite simple. For a route we have a set deadline which is the begin time. We have the actual switch configuration. So in order to set a route you need to know in which direction you are going to traverse the zone and what is the switch configuration you traverse it. So how do we get this? Every time a train encounters a signal we start by assuming the zone in front of the signal is occupied and we probe the infrastructure linearly until that signal becomes green again. And then we know that all the zones where the signal wasn't green basically are part of that signal requirement and we can adjust the begin time of the zone to match the time at which the train saw the signal. And every time a train leaves the zone it doesn't require it anymore. In terms of routing requirements most of the parameters only depend on the path of the train. So if we go earlier the route, the traverse zone, the detectors which basically indicate the direction in the zone and the switch configuration only depend on the path of the train. We know that. We are simulating the train. But in order to find the setting deadline which means we need to know which signal protects the entry of the route. And not only that signal but the signal before it because as we saw trains can see a signal being packed by signal before the actual protecting signal. So basically we probe in the other way so we set all the zones in the route as incompatible which means that the route isn't set and then we iterate through the signals until we find a signal that's green. Good so now for a train we have its conflict and spacing requirements and the good thing with those is they are indexable by zone so we can simulate every train once and then keep a database of each requirement and we simply need to check for every zone if all the requirements match. So the spacing requirements are never compatible if they overlap and the routing requirements are compatible if they go in the same direction and have the same switch configuration. And if we had a new train we only need to check its requirement and same thing in the ASTAR of STDCM if we had a train we only need to check that the new zones that are traversed by ZC ASTAR iteration are actually conflicted. So in the future we want to implement TVM support. We are actually in the process of doing that and it should be done by the end of the month. We want to implement support for overlaps. The main problem is friends doesn't use overlaps which basically are zones that must be free after the rest of the signal in case a train doesn't stop there. Friends doesn't use that but for example Germany do and we do not have any German on the team. And same thing for other countries signaling systems we want to implement those and the contribution are very welcome. There are also moving block support with basically the ERTMSV3 and in order to implement those we probably need another model specifically for moving block systems. Thank you for listening. Do you have any question? Five minutes question. Here. Thanks. Another one? You mentioned the different signaling systems and the different operational rules. Can you move this quite flexibly or as you said the implementation of TVM as a manual coding? Manual coding of the signaling system on the driver. The signaling system is quite simple. It's not actually a JSON file but it could be. So it's just declaring what properties the signal may have and there is actually code to check when we actually construct the blocks that they are correct. So basically sanitization of user input and in order to make the driver you just decide what are the possible transitions for your system and you implement those it's basically one function class. So it's actual code but it's quite simple. So when you do the root planning you know of course what rules you train such as post-created times but I can imagine that the train will occupy is on for a short time and the other will have to wait a little bit for that to become available. So there is an optimization problem there like what do we do? So this is a system how does it tie in with the actual timetable planning? So the timetable planning in operational studies are done manually. So because the people doing operational studies actually do this manually or want to do this manually for now and in the case of SCD-CM we cannot change the path of any train because they are already sold. So we cannot make a new train but it must not interact with any other train. So essentially this gives you a yes or no? Yeah it gives you a yes it's possible and this is the fastest path. Again for now. For now yeah. What's the difficulty? Was it a challenge for TVN's? Because there is this big limit on blocks with the speeds before? Well it's yeah but this was a challenge to integrate in the design. But for now it's not a challenge it's just a developer bandwidth. Does the conductor of the train see some kind of nice map that they can see how fast they can still be going? Or do they just see the green light red light, orange light, double red light and they just kind of react to these very basic signals? So on BAL signalling they only see the green light, white light, etc. On TVM it's actually a cab signalling system. So the driver sees in the train what speed he should go. But yeah on BAL I don't think. There is no connection so the driver just look outside the window and see what he has to do. What challenges do you have now? Or how does this simulation help in case of delays and dynamically reallocating paths or timetables where the delays are in a cabin at a scale? So we do not currently support any dynamic simulation but we plan to. We hope so. And well in order for dynamic simulation you pretty much have the same constraint and simulating. You need to actually simulate what state it's in at any point but you also need to know the. The resource needs in front of a train. So the situations are for now resolved manually like from the control center? Oh yeah OSRD actually is not used in real operations in real time for them. So I think at SNSF most of it is done by experience of the regulator. This one last question maybe short. Yes please. What is the safety requirement of the company? No. Thanks. No no no. That's it. Thank you.
How we at Deutsche Bahn develop IoT use cases quickly and cost-effectively
Okay. Yeah. So great. We managed to set up everything. We have a demo, so we needed to do this. So without further ado, I'm very happy that Olga is here to tell us a little bit about the ISU case and what it's done. A little bit of less time, so please condense it as much as you can, but take your time. Yeah. Thank you very much. I will give my best that we just in time arrived to finish off this lot. But let us start. My name is Horace Koch and I'm working for DB Suster, the IT company of Deutsche Bahn. And I'm the product owner of IoT. We works with applied IoT. And if I make not IoT, then I'm a member of Aka Open Source of Bitcoin, the chairman take a look. Digital association, Bitcoin and pushing forward the open source ID. And some words about my employer. The visitor is a 100 person subsidiary of Deutsche Bahn AG and is the digitalization partner for all Deutsche Bahn companies. And we have currently 7000 employees and managed over 500 projects and services in the cloud. And if you are looking for a new challenge, please take a look at dbd.com. Okay, let's start. What is Internet of Things? Here's the definition from the Wikipedia, but I would like to describe it with my own words. The aim of things, the aim of Internet of Things is to measure some conditions in the real world and to link and evaluate it. And this information and ultimately to derive the measures from it. And we have unfortunately only a little time, but I will try to give you a very deep insight in practical usage of IoT and for example, so we realize a practical project in this talk. We would like to measure the air quality inside of this room. And yeah, I think also it's a quite funny topic. And yeah, we will touch the theme. Where can I get the sensors? How can I transmit the data? And finally, how will be this process stored and visualized and so on. So let's start with the sensors. Where do the sensors come from? Yeah, after understanding the customer's problem and determining suitable matrix, the question arise which sensors can be used to reliable determine the matrix. And normally we try to buy this from the market, use standard sensors, but from time to time there are no sensors available for this topic. And then we give contracts to our DB company, DB-Con T, or DB System Technique, or maybe external partners to develop the sensoric after our specifications. And from time to time we make some in-house development and for this we use some sensor platforms. For example, other foods, fever, V-MOS, or Tinkerforge. For our project to measure the air quality inside of this room, we use the Tinkerforge and take a look inside the portfolio from Tinkerforge and found two interesting sensors. One is the air quality bricklet. This sensor measures the air temperature pressure, humidity, and an air quality index. The air quality index measures some gases and other values and gives us a calculated value for this. And the second sensor is a particulate metal bricklet. This measures the particles inside of the air, for example fine dust. Both sensors are connected to the master brick and the master brick makes the communication between my laptop and both sensors and we can now take a look. I connect it to my laptop and hopefully you can see this. We fire up the Tinkerforge bricklet fever, making a connect and we see all the bricklets and bricks. They are connected together and we can see some values from it. Without code, some lines or something, you can make first analysis. Is this possible to measure with this sensoric the right values? And is this worth to go further on? Okay, let's go back to the presentation. The next step is connectivity. How was the data sent to our back end systems? And for this, there are a lot of transmission protocols in the IoT environment available. Here are the four important protocols. But it's really difficult to take the right one because some need further infrastructure. For example, gateways or other costs, monthly pay or something. And you also must look for data, for example, bandwifes, coverage, energy consumption and so on. For our example, we only use the Wi-Fi connection of my laptop. So it's really easy. Normally, we use NeoBand IoT because we can't use the standard one. And NeoBand IoT based on LTE is more or less available. Okay, we use the MQTT protocol. MQTT is more or less a producer-consumer model. The producer writes some data into a topic. On the message broker, a topic is directory structure also. And the consumers can subscribe exactly this directory, this topic. And if the producer sends some data, the consumer can read it. Or got it pushed from the message broker immediately. Okay, we use AWS Core IoT for this. And there is a product IoT Core. And IoT Core is a perfect MQTT broker for us because it's full-managed and co-ops, auto-skating. And so on, and you have to work with this. Okay, then let's take a closer look into the code of this. I'm not a programmer, but it's so easy, anybody can work with this. Tinkerforge has a lot of examples, and it's more or less intelligent copy and paste. You take an example, you write a unique ID for every sensor, so you can connect more air quality index sensors together or something. Anybody has his own ID, imports some libraries, and here is the important part. We take our certificates for the MQTT protocol communication, and we create two callback functions, one for the air quality sensor and one for the particular meta sensor. And here you can see it's easy, one call to libraries, and you get all information from the sensors. Here we have a little bit formatting, print out, or write it to MQTT. The same for the particular meta sensors. It's really easy, there are examples everywhere available for this. And we can fire up this. We start this Python program, and here we can see the values. The values will be formatted into a JSON format and then sent to our MQTT program. Okay, let's come back to Fing Sport. Fing Sport is a relatively new software, it started in 2016, but in shortly the time it's more or less the market leader in open source IOT platforms. One question around, who has any heard from Fing Sport? One, two, three. Okay, perfectly. It's an open source software with open Apache license, and it is all in one solution. All aspects of IOT is available, API agree, so it's really easy to configure your system and so on, reporting, scheduling, visualization, and so on. And the best thing is the root chain. The root chain is a little bit like Node-RED, and there you can configure whatever you want to control a backend system. For example, if the error is too bad, then open the window. Okay, next step, it's always a good idea if you use an open source software to take a look at an open hub. And here are... Oh, five minutes left. Okay, it's a good software, it's perfect, insecure. It's a microservice architecture, and it's really easy to install it. And I will show you a little demo. Okay, here's our Fing Sport system. I fired up, and first we must create an integration. Integration is the part to subscribe an MQTT topic from the pro-car. Maybe you remember. I prepared it before, so I don't make it now. And the next way to create a converter, a converter is for preparation of the data. Sometimes the data are in grad Fahrenheit, and you would like to do it in grad Celsius or something, so you can prepare the data for the storage and so on. Okay, dashboard. So we try to create a new dashboard first. It's insert, and we add a new widget, temperature, temperature, and a quality sensor. We select the device from which we would like to visualize the data and which data case we want, a quality index, and for the first step we would like the temperature. And go on, and so we create our first widget on our dashboard, and this you should be repeat sometimes, and then it looks so. Okay, I let it running, and we take a look in three minutes on it. Because the system needs some minutes to measure the correct values, the sensor is a little bit of self-calibrating. Okay, we come back to our presentation, and I would like to speak about some use cases. From time to time we make some IoT hackathon, for example with our customers to better understand their requirements or to find possible solution very quickly and to make some tests with this. And from time to time we make this also for HR to get new employees or to work with studies, for example, some digital resummon school or similar events. Okay, here's an example of our environmental sensor. For example, this is measuring temperature, humidity, pressure, and the best is it measures the particles. The count of the particles in the air and the mass of the particles and vibrations. And why we make this, some employees, some colleagues of digital signal boxes told us maybe there's a connection between pollution, fine dust and so on, and air hours occurred in our signal boxes. And I have here a screenshot, it's really actual. Did anybody has ID, what is wrong with this? Okay, it's difficult to see. New year, exactly. 20 minutes after New Year's Eve we have a massive growing up, defined us in our signal boxes. Nobody knows how can it be. At the moment 300 sensors are rolled out, but 50, 15, 20 signal boxes shows this. And now we have to make some evaluation, how can it be, and maybe it's a good idea to power off the air condition or the ventilation system or something, or take a look at the windows open, what is wrong with this. And here is another example, another use case, PESC. Okay, PESC control, we can measure the red visits and so we can reduce the amount of very toxic baits. And another example, Samurai, things what is a perfect open source software if you like to realize IoT project really cost efficiency. Okay, thank you very much. APPLAUSE We don't have time for questions, but I do want to see the diagram. Yeah, yeah, yeah. What does it mean? Please? What does it mean? What does it mean? Yeah. No, no, no, no, no, no. This is below 50, it's really good. But I think I'm not sure how long does it take until the sensor is recalibrating. Yeah, yeah, yeah. I think the air is too poor for this value, so maybe we should wait half an hour or so. Okay. Okay. Thank you very much. Thank you. And this thing was bought. Yeah, is it a bit like this? Yeah, the next thing you do is get up. Basically, it's a competition to open up and all the other stuff. Now, open up is more or less...
Transportr: the Past, the Present and the Future
So we are coming towards the end of the program. We have two short community talks as final talks of this morning. And I'm very happy that we have Nicola here to talk about Transcorter. Many of you probably know that as one of the free applications to take public transport information. So yeah, it's your state. Thank you. Perfect. Can you hear me with a microphone? Or is it just for the recording? Perfect. So yeah, then welcome to my short little talk about Transcorter. It's passed. The current state and glimpse into the potential future. Let me maybe start the talk with asking you who of you is using public transport regularly? Great. I could have guessed that, I guess. But then some of you may know this kind of problem. You go somewhere, you travel somewhere, there's a different public transport system. To find your way through it, they want you to download their own app. Usually proprietary from Google Play. And then at some point your home screen gets cluttered with all of these apps. Some alternatives that you may know is using Debena Vigartra that works quite well in Germany. They include a lot of regions with decent data quality as well. But then first of all, since the new update, I think there's no map inside anymore, which I find pretty, well, it's a bit sad. And then there's also some people found out that Debena Vigartra is sending or connecting to a lot of tracking services, even if you declined that. So maybe that's not what you want to use. Well, Google Maps, another option, I guess we don't have to talk about why you maybe don't want to use it either. So as you guessed, a transporter, it tries to be another alternative to these kind of things. It was created 2013 by Trosten Grote. As you may notice from that picture, that's not me. I'm Miko Leguccio. I'm filming under Jallochim on GitHub. And I started contributing to Transporter in 2017. So when you open Transporter, it might look like this. You have a list of networks. You can choose where you are. And then you can basically look for journeys as you would expect. So in this short talk, I will, well, first of all, tell you a bit about how Transporter works, the internals. How do we get the data, basically? Then, as I said before, the past, present, and future of the project. So first of all, these official apps, how they work, well, they have their data source, usually in some proprietary format. And they have the apps that talk to some APIs that provide the data. In the case of Google Maps, it's a bit differently. They don't use the data directly, but they use a format called GTFS. That's a standardized public transport format. Initiate it by Google, but it's an open specification. So you can create your own GTFS files and also consume GTFS files, GTFS files as you want. And that's what Google uses internally for their public transport routing. Now, where does Transporter come into play? Maybe you've heard of Uffy before. That's another app that also works on Android developed by Andreas Schildbach. Even before Uffy itself was open-sourced, Andreas Schildbach also already open-sourced a library called Public Transport Enabler. And that is basically the wrapper that contains the logic to connect to and understand the data from the official APIs. And Transporter is using that same library. So huge thanks to Andreas at that point for him to open-source this and making a Transporter possible. Then there's also a second part in Public Transport Enabler where you can consume GTFS files by a proxy. In that case, you don't use the GTFS files directly. You don't perform routing on your phone. But you use some third-party provider. What Public Transport Enabler was using is Navicia, a French company. They provided this service for free, basically consuming the GTFS files and then exposing them as an API to interested apps. And that's actually how I got in contact with Transporter because when I was spending some time in Nicaragua, I was working there with some other volunteers to gather Public Transport data, schedule information, put that together into some GTFS file and then in the end making routing possible for a limited region, but at least with apps such as Transporter and Auffie. So now that you know a little bit more about the internals, I would like to go on with the project itself, how it evolved. This is the graph of the code frequency on GitHub. So as you can see, there has been quite a lot of activity in the beginning. Initial commit 2013, release 1.0 2015. And then in 2018 with this huge spike, there was a major rewrite of the app. Most of this was done by Torsten Grotter. And as you can see afterwards, activity declined a bit. Well, both Torsten and me were busy with other stuff. So this is actually a try to get some more or attract maybe some new contributors to Transporter with this talk. Maybe you've noticed that at some point we even got removed from the official after-ride repository because they found out that some library or the map library that we were using was not fully open source. And it became necessary to switch to an open source fork of that library that actually didn't include the non-free dependencies. And another thing that happened last year is that Navicia started to change strategy. It's like the new version of their software is not open source anymore. And they also stopped serving a lot of regions. So at that point, for example, Nicaragua was not available anymore, which is a bit of a shame. So in 2023 we also got some new interest from the community. There was people asking about the future of the Transporter app. Well, that's got some new energy and we finally finished the migration to the new map library. And since one month ago, we made it actually back to asteroid. We're back there. But as I said before, we lack some regions that were supported before because Navicia stopped providing them and because some APIs also broke over time. Well, as I said before, there are some new contributors. There's some effort to move to new design theme, which is great. There are quite some open issues. Some of them are bugs and many of them are feature requests. And a lot of them are actually marked as so-called beginner jobs. That means that they are supposed to be quite easy to tackle. So if anyone watching this or being here feels like looking at some Kotlin or Java Android code, feel free to pick one of those and try to work on them. And apart from Transport itself, it's also nice to see that the whole ecosystem of similar public transport apps is growing. So there's Uffy, as mentioned before. That's open source now for some years as well. Then there's itinerary, an app that tries to do or does even more things than what Transporter is trying to do, like saving your tickets. There's another Linux app, GTK-based, in that case, which is pretty new, also looks really nice. And on iOS, there's also an app that I'm not sure if it's fully open source, but at least there is some variety to choose from. And looking at this ecosystem that is growing, I think it would be nice to try to combine efforts in some way. And maybe what would be nice as well is find an alternative to what Navicia was providing before, some kind of shared service that is maintained by the community that can use GTFS files that are available for a lot of places in the world and provide an API that can be used by all of these apps and even more. So that's for me. I have three steps for you. If you haven't already downloaded Transporter either from Afterhood or from Google Play, if you find anything that doesn't work as you want, tell us. Look at the code, contribute, and have fun using public transport. Is there one quick question? Hello there. I tried to navigate to the results yesterday. Yes, Bajum is one of the regions that was supposed to be broke, the API broke, and I think we would have to look into what kind of API they're using and then maybe feel free to look into that and contribute. Sorry, we don't have any time left for more questions. Please talk to him, please contribute, and we are moving to the next presentation. Thank you.
Software needs of a volunteer operated heritage railway
So, we're coming to the end of our last talk for today. I'm very happy that we are closing with a real training operation. And in this case, we're talking about how to do that with open source, how the source can help there. And yeah, Niels, it's your stage. Yeah, thank you. So my name is Niels. During the week, I try to make medical devices speak together. And on the weekend, I'm playing with trains. So I'm working at Dampfbahn Frankische Schweiz on the weekends, which you can't see because it's too bright on the beamer. For location, this is Feuchheim, which some people might know in the medical industry, because that's where my employer is. And the next bigger city is Nuremberg, which is somewhere around here. We have a short line, 16 kilometers. It was closed down in 74. We are running it since 80-something. There's something like 30,000 passengers per year. So we are slightly sized for a heritage line or museum ray-ray. We have 400 members in the club, of which 40 are actually active. We are running steam and diesel every Sunday from May to October, sometimes occasionally holiday train or special train, or whatever. But May to October is the main time. We are completely volunteer run. So we have a professional safety manager, but everything else is completely volunteers. We are a real railway running under FONE regulations, so slightly easier rules than the Deutsche Bahn has, but still real railway rules. And because we are kind of the only railway in the region, a lot of local initiatives are in contact with us. There are some initiatives we want to reopen public transport on the line, which for us would be good, because then we can get a lot of Trassengebühr. And it would also help the area quite a lot. So why do I do the talk here? So first, I want to put heritage lines generally on the radar for you guys, A, to come and visit us, and B, because we have a lot of need for people doing IT stuff. And the interesting thing is, as we are running some of the heritage lines, have their own line, where we can do more or less whatever we want and what the E-Bahn allows in Germany. We are the perfect experimentation ground, if you want to try out some stuff. If you look at Europe, we have about 100 heritage lines in Germany. UK is the absolute mecca. They have, I don't know, far above 100. They are about 50 in Austria, 20 in France, 10 in Belgium. So all over Europe, you will find some lines. They are organized in larger communities. In Germany, it's the Faudi MT. In Austria, the EMT. UK has the Heritage Railway Association, the HRA. So there's a bigger group, and there's a European organization called Fedeg Rail. What's our problem? We like trains, but we are horribly bad at computers. So this is kind of the typical members. So that's me after a training shift as a fireman. This is our engineer who is in real life a state attorney, and this was my trainer for firemen. He is in real life a medical doctor. So I'm the only one who has to do something with IT in this picture, and I'm probably the one of three in the whole club who has to do something with IT. So big problem if we want to run anything in IT. What do we do? Of course, we do the stuff that a normal railway does. So we sell tickets. We run trains. We operate and repair the infrastructure. So we have switches and signals and things. We operate and repair rolling stock. So we have coaches and wagons and locomotives and everything. Not the usual stuff the Deutsche Bahn has to take care of. There's a little overlap. We have a V60, which I think is still in operations at Deutsche Bahn as well. But most of the stuff is 80 to 100 years old. And we have workshops and sheds and all the infrastructure around. You need to have a railway running. But also we have a nonprofit part of things. So we have archives. We do this to preserve history. So we have a lot of documentation on our trains. We have photography, everything on the historic side of things. We are a club, a fein in the city of Germany. So we have to do membership management. We have to do all the tech stuff you need to do for fein. And we need to get somehow money for everything. So we need to organize donation campaigns and try to get fundings for things. And it's not just the, so we cannot just run on tickets alone. If you need to do a full inspection of a steam engine, we are talking about half a million of euros. And that is about 10 years of running. And we need to do every 10 years this inspection. So, and we have four steam engines. So you see the problem. OK, we still run railways as in the 1950s. So our line, in our case, closed down 74. It's more or less in the same state. We are now, but the rest, the signalling, everything is still in like in 74. Our active members, unfortunately, are getting older. We get newer active members again. But unfortunately, the everyday workload on people has also increased. So you cannot spend all your life at the railway anymore. Some people need to earn money. And this is kind of increasing, decreasing the time that people can spend on the railway. They are higher safety requirements. So even if we run railways like in the 50s, we still have to fulfill all the safety requirements from the 2020s. So this is kind of challenging. Our customers want more. So you cannot come and say, yes, the ticket office is open from 8 till 9 on Saturdays. They want to buy a ticket on the internet. And of course, growing regulations and administrative effort, which you have everywhere. So the problems we have, tickets, we still sell these Edmondson tickets. As you can see there, there's carport things. And one of our members also has a printing machine for that. So this problem is solved. But we also want to sell tickets via the internet. And there's not really a good solution. There's Farkapendrucker.de in Germany, which works. And works reliably. But this thing is stuck in the 90s. So if you look at the layout, it doesn't have responsive design. And it's really hard to use. The back end is quite OK, but the front end for the customers is horrible. Unfortunately, it's the only thing we have. The other thing is you could use some kind of an event ticketing software. So a lot of people here would probably know Pre-Tex. Pre-Tex is absolutely great, but not made for railways. So it starts with seating arrangements. So usually we want to have some algorithm that not everybody sits at the windows, but that if you book multiple places, you have to set the window. And then you fill up until you only then are allowed to have the next window, because else you have all the windows taken. And then the rest will stay empty because people say, I want a window. I don't go there. There are hundreds of bachelors and master thesis on how to do a ticket selling system. But none of them has made it into an open source software, unfortunately. So this is something Farkapendrucker works. Using it helped needed. Running trains. So for timetables, for us it's pretty simple. We have one line, one train. So we take the same timetable since 20 years. So this works. But if we get more complicated, we might want to run two trains. And then we need to have a serious timetable. We use now J-train graph and FPL edit, which is made for model trains. Works really well. The FPL edit author also added GTFS export now. So we might show up on Google Maps hopefully soon. And OpenStreetMap and Traveling and all the other apps which could use GTFS. So there's probably also some larger software interesting. We have some stuff like the signaling. So this is our signal box, the complete signal box. The other safety feature you need to know, there's a key. This is something where we can improve on. And it might be some stuff like just putting a GPS tracker on the train, which then has the other problem. There's no mobile phone reception on our line because we're in Germany and in the middle of nowhere. So there are lots of things where, for example, the IoT department can have a field day. We have passenger information systems, whatever. So there's a lot of things where you could create new software, which would help us a lot. Managing rolling stock. So right now, every train management, so train cars have regular inspection dates and a lot of paperwork attached to. This is now managed in an XCloud and an Excel sheet. And that's already the advanced technology solution. Usually it's paper folders. So a lot of things there. We have reports, regulations, whatever. So this is kind of a nightmare right now. But we also got good feedback from our regulating body because we handed them readable PDFs and they said they were better than what they get from Deutsche Bahn. So what's our problem? So basically, the left side is the museum railway, half where we need to know our problems. We still don't really understand the problems well, so we need to get better at that. We need to find the solution and we need to be able to apply the solution. That will be the big thing. And the other thing is also the software side of things, so it needs to fit to our problem. We need to be able to find this software. So if you search for ticket systems, you will find G-RA and all that stuff. So completely unsearchable now. And it needs to be really ease of use because we are not good at computers. So this whole thing started at the Gulasch Programmier Nacht, a couple of, actually last year. We did a workshop at the KS Communication Camp and a small group formed trying to get the IT nerds and the railway nerds together. And that's also what I want to present here. For you guys, why should you bother if you don't like playing with trains? Playing with trains is fun. But you could also use museum railway as a learning ground. So if you work in software and want to do something for transport but don't know how railway works, we are a place where you can learn that. We are an experimentation area. So you can do a lot of stuff on museum railways. So the railway regulations are quite wide open for experimentation. So I'm really surprised what's possible sometimes. Coming from medical device where you can't do anything. And you can use this as a test bed where you have, we have the simple case. So we have one line one train or one line two trains if it gets more complicated, but we don't have the full network like DBS. How you can join? So to have the super easy entry point, we just created a Discord chat where you could just join. We are currently four or five people, so smaller. Hope to grow more. We have a wiki at kaosban.net where we, which was kind of the original idea dumping ground, which now gets a bit more formalized into a knowledge base where we try to collect the problem cases, the possible solutions, where there aren't solutions to get an overview of things. And we're starting now to network with different heritage line associations. So at the Faudi MT meeting in three weeks in Aachen, I will present basically this talk again and do the same publicity for heritage lines then. So then I'm done. So join us on the Discord if you like. We're open for crazy ideas. And if you want to play with trains, there's a museum railway near you which takes you with open arms normally. There's one question. Yeah. More information. I started this with a Netflix presentation. And in Norway, we had six rotation train nights that is using the M plan that we don't mention to produce netx data. So they are integrated in national trip mapping. Yeah. Yeah, I made a lot of notes during your talk. So. We also converted the beganesis. So just repeating for the video, in Norway, there are, how many you said, six museum railways to all you're already using the netx tool from the first talk. So if you haven't watched the first talk. Do you regulations on station visibility that the station has to be on a straight section of track applied to the historic railways? We have kind of at the limit of what I know. So I think we have some kind of heritage protection that still stuff can stay. But for example, we have one halt in a curve which we cannot use anymore because the border of the platform is not really there. So it's just a meadow which ends up in the track. And we need to put a clear border between the platform and the track there. So there are some regulations and some safety rules. But I think not 100% what the big railways have. In the Czech Republic unfortunately a lot of cities have lost their railways service because of this regulation. Yeah. So for video question was on the, if the regulations for stations that it must be on a straight part of track and that everything must be visible applied for museum railways. And then Czech Republic, a lot of towns have lost their railway access to that for that. So one last question. Thank you for the presentation. I really enjoyed it. I have also a lot of ideas. I'd be a very consider to get into the debate network. Because for example in Italy you have the foundation of like Vain Tardien. And you can actually buy within the ticketing system tickets in the Vain Tardien system. So the question is if we have considered joining DBNETs or DB infrastructure as well. Not really. So for ticketing for example we didn't consider that because it worked so far. It was a lot of manual work but somehow it works. And everything which is external gives external costs because DB doesn't do stuff for free. But we get work time for free. So if we can do it by hand or have to pay for it then we do it by hand. But they do have like their eye or their mind. Yeah. So there might be something I haven't really looked at that. But for tracks for example if we go out we are running on DB tracks and have to join their tools and work with their tools to get into the timetables. OK great. Thank you Niels. Thank you.
Closing Railways and Open Transport devroom
Thank you all and let's share a Belgian beer now. Bye.
Enhancing the video call experience with Forward Error Correction
So good morning. I am Flor Arlé and I am here with my colleague Jean Monnier to present you a way to announce the video call experience by using the flexible forward error correction. We work in the Bell Dawn Communication Company. This company has developed the LIN phone product that is a soft phone open source to make some video and audio calls. And it works on several platforms. So today I will explain to you how we implement the forward error correction for our video calls. The first way we choose to use the flexible FEC scheme, I will explain to you how it is described in the RFC 8627, how it has been developed in our products and I will show you some results. So at first let's talk about the forward error correction. If you see this real schematic way to represent a video call, you have at first two people who share a video call. On the center side you have a device that captures the video. The signal is encoded here by a video encoder that transforms the signal into frames. And those frames are split into packets that will be sent to the receiver through video stream on the network. On the receiver side the packets are collected and decoded and the frames are recovered back and the signal is displayed and the receiver can see the video. In the case of the video call we are in a real time context. So we work on the RTP retight transport protocol that describes how you can send video or audio with the internet. This protocol describes the format of the RTP packets and it is more regalable than the UDP because for example you don't have latency problem that is adapted for real time communication. Unfortunately in the real world you have problems, you can experiment problems with your network so the traffic may be high, you can have a low bandwidth and sometimes you lose packets during the transmission. So the receiver doesn't collect all the packets, the signal is not complete and you see that your video can freeze and it is really annoying for everyone. To overcome that you can use a strategy to recover your lost packets. For the forward error correction you will recover the lost packets by using the next packet and also a run-down-see information that is sent at the same time than the video stream. With this way you can recover your full video and have a nice video call. We choose to use the flexible scheme for our project. So when you detect loss in your receiver, the receiver side, you have several strategies that you can apply. You can ask to send the packet again but you will have to wait to get the packet. You can primitively decide to send twice the video stream but it is really costly. Or you can try to recover the packets with what you have at the time. The forward error corrections allow you to recover the lost packets by using the run-down-see information and the other packets. There are several algorithms, for example the low density parity check codes and there is the flexible forward error correction. That is the method that we choose because it is really simple. It is based on the combination of the packets with an exclusive XOR operation and it is free. There is no patterns. It is a recent standard. It has been for example developed in the web RTC. So we can be interoperable with it. The standard is described in the RFC 8627. It describes, this document describes fully how an FTP stream can be protected with flexible FEC. It gives the format for the repair packet that will be sent to carry the run-down-see information and it proposes the codes to generate those packets and to decode to reconstruct the lost packets. This RFC is applicable to all media, not only video but also video, text and application. So now we will explain you how it works as described in this RFC document. So at first when you have a video stream you send the packets within an FTP session with a source RTP stream. Your packets are here represented by the squares. They have a unique sequence number that increase with time. And when you want to make a flexible FEC you add another run-down-see RTP stream and you don't change the source stream. So it is backward compatible. The principle is simply to take a set of source packets, combine them with your XO with a priority FEC encoder and generate a repair packet here. So for example this one is called R5, R4. Why using XO operation? It is because of a nice property of this operation that makes you able to recover one of the packets if you have all the other. So you can encode a packet and decode a missing packet. At the receiver side when you detect a loss, for example here the packet S4 has been lost, you can get it back by applying your exclusive OR over S6, S5 and the repair packet R4. And then this new packet can be sent to the stream here. To operate your flexible FEC you can choose several parameters. You have to decide what is the length of your repair window. It is a duration interval that allows you to buffer your source packets to be sure that you have enough source packets to make the recovery. And you have to decide which packet you will combine with which one within a protection pattern. So now we present you several protection patterns. If you represent your source packets like a block here from S1 to SdL with L columns and D lines you can make a first way to protect that is a row protection, a one-dimensional non-interleave protection where the XOR is applied on the rows. So here you generate D repair packets that protects a set of source packets of length L. Another way is to combine them by columns. So here you have L repair packets that protects the source packets with depth D. So now I will show you how you can recover the source packets with this combination. So here you have an example with row protection and here with column protection. Because you have random losses in your transmission you can apply the XOR to recover the lost packets here with the row application of the XOR and here with the columns. But in some cases it will be more difficult because if you have burst in your transmission it means that you will lose a consecutive source packet. You won't be able to recover because you can't have both packets here and here you will recover the columns that have only one loss but not the columns that have more than one loss. To overcome this problem you can make a two-dimensional protection. Here you have simply the combination of row protection and column protection and it generates L plus D repair packets. In this case you have in the RFC an iterative algorithm to recover the lost packets. So here I show you two examples with long burst here and here with random loss. The algorithm starts like this. You repair at first all the rows that can be repaired. Then you apply the XOR on the columns and you repeat so the rows, the columns until you can't repair any more any packets. Here you can see that the burst has been fully resolved. All the packets have been recovered but sometimes you don't have chains and you can't recover some patterns that are connected like a cycle here. So in this case you can do much more with the flexible efficiency. But this two-dimensional protection is really efficient for bursts. But sadly it has a cost because you have to send a lot of repair packets. You can measure the impact on the width that you will need with this term the overhead. It is a ratio between the number of bytes of the repair packets that you sent over the number of bytes of the protected source packets. Usually the repair packets are bigger than the source packets but if you suppose that all the source packets are approximately the same size, the overhead will be 1 over L for the row protection, 1 over D for the column protection and 1 over L plus 1 over D for the two-dimensional protection. For example here are the values of the overhead with increasing values of L and D on increasing protection level. You see that the overhead increases very fast. The RFC describes also what are the formats of the packets. So first you have your source packets with the LTP convention with an LTP header and a LTP payload. And you will generate your repair packets that are also LTP packets with a header and a payload. But within this payload you will carry two kinds of information. The first one is written in the FERC header. It's information about how to identify which source packets are protected. And in the repair payload you will have the result of the XOR operation between the payloads of the source packets. When you apply the XOR between the payload here you have to be sure that your source packets have the same length. So sometimes you will need to add zeros at the end of your payload in order to have the same length for all packets. A single repair packet will carry all the information needed to recover the source packets. It says the size of the source packet protected and which is the configuration of the protection. For example, here when you have R1 and you see that in the FEC header you read L is positive and D is zero. You know that you have a raw protection and the sequence number of the source packets that are protected comes from S1 to SN plus L minus 1 with the consecutive values. If L is positive but D is equal to 1 you also have a raw protection but you are inside a two-dimensional pattern. So you know that you will collect several repair packets that protect rows and then you will have a set of repair packets that protect the columns. And when L is positive and D is more than 1 you have the column FEC protection and the repair packets that are protected are interleaved. So from SN to SN plus D minus 1 times L and it can be the column of the two-dimensional FEC block protection but it also can be a single column protection in one dimension. This method has been implemented in our project NINFON. We decided to use four sets of L and D parameters. It comes from one-dimensional very low protection to high protection with three and three. Ideally we want to have two-dimensional parity protection but it has a cost because you have to send a lot of data. So we decided to adapt our protection to the loss rate that is measured in the transmission and also to the network capabilities. The repair window is 200 milliseconds. It is long enough to collect all the repair packets for any values of L and D and it doesn't cause any delay in the video. The RFC has been implemented in C and C++ in our LINFON SDK. All the elements of the FEC stream are written in the library ORTP and in our streaming engine for video and audio we added a way to manage the video quality with the flexible FEC. For many months about our strategy for the video quality, our rule is to make the best possible use to use the bandwidth but sometimes you don't know the bandwidth at the beginning of the call. It can change during the call and you have all events to manage. We want an optimal video setting so the best definition betrayed and firm rates but most important we don't want freeze in the video. So we decided to prioritize the packet protection before having high encoding setting. To have an adaptability to the network events we make periodic control of several values so we measure regularly the available bandwidth, the loss rate and the bandwidth that is dedicated to FEC. For example in this graph you can see that we propose to have low FEC protection when you have low bandwidth and to enable high level of FEC only when the loss rate is very high but if you have a lot of bandwidth you can have full FEC protection it is not a problem. And finally when you have congestions it means that you have too much packets and the transmission stops. You disable immediately the FEC because it is not your tool and it will make the things worse. So now we will show you some video with flexible FEC activated. So here you have a video, so we simulate a video call with a moving pattern. In the first window here there is 6% of packet losses and we do no protection so you can see that the video is really bad. It is a very very bad case, 6% is really a lot of losses. In this window we have enabled the FEC with a one dimensional row protection with L is equal to 5. You see that the video moves a little more but they are still freezing. In the last window it is a two dimensional FEC protection with a high level, 3 L is equal to 3 and D is equal to 3. And you can see that here the video is perfectly fluent so we have recovered all the lost information here. We have measured the recovery rate here with several values of FEC protection and you see that it increases very fast. So the flexible FEC is really interesting to recover the lost packets and the effects are really obvious. Another example here, this time we have simulated a transmission with loss and burst so we lose consecutive packets so it is a very bad situation. This time you can see that the performance of the FEC reconstruction decreases a little but they are still interesting. In the two dimensional parity protection you can see now some phrases but it is still much more fluent than the initial video. So we can make some conclusions about flexible FEC. It is a simple and resilient way to improve the resiliency to the packet loss in video transmission. It is based on the fact that you send redundant information on a dedicated stream. It is adaptable to the level and the event of your network and it works with a short delay because you don't have to wait that the sender sends you back the missing information. And the exclusive operation is really efficient and rapid. But you have to keep in mind that you will need a significant bandwidth so in some cases it is not indicated. The RFC 8627 gives a complete description of the flexible FEC scheme and it is clever because it is also backward compatible with the RTP protocol. And we show that it gives a real improvement in the video quality. So we decided to release it this year in the video calls of the Linfuan project and we want to in future work add it to the video conference and the audio stream. So thank you for your attention and we will be happy to answer any questions. Thank you. The question is about the size of the source packet. And in fact you are right, it is an issue that we have to deal with. The source packets doesn't have the same size. And for the encoding you have to pad the payload to make the XOR operation. And the thing is that when you combine them to build your repair packet you will have very high, very big repair packets and your overheads will be increasing a lot because of few big source packets. So that is a problem that you have to deal with. So you can change the size of the source packet if possible to make the more equal sizes. But you have to adapt the overheads to decide to have, you have to measure your overheads to check that the repair packets are not too big compared to the source packets and to decide to reduce the FEC protection in order to keep an overhead reasonable. But yes you have to take care of the real size of your source packets. I don't know if it answers your question. Thank you. Yes? Then you always have this fixed delay of 200 milliseconds, right? On the repair window? Yes. So we have a fixed value here. The question is do we have a fixed duration for the repair window of 200 milliseconds? Or it can be changed? The fixed delay. The question is the video output is put on the screen 200 milliseconds after the respective video break has arrived, right? Yes, the 200 milliseconds is a delay that you had before displaying your video. Yes? I'm sorry. Okay. Yes, that's the, in fact yes. Yes? So when you assemble in the stream and rows and columns, I know the second one is reversed. Is that right? No, it's not reversed. In fact, sorry, it was maybe not clear in the representation. You have, okay. This one? Yes. The draw comes back to here. So you read those one, then those one, and then those one. The second question, do you have any examples of an STP line describing how this is expected? Those ones. You have an example of an STP packet that contains a line that describes how this is established. The question is, is it mandatory to signal? So when the stream is a setup on the signaling layer, you have no flyer, I'm guessing, you still use STP and this would exist as a line in the STP to describe how it's established. Yes. The question is to know if during the call exchange, we signal the use of this protocol into the STP right. I'm not sure. There is a. Yes, okay. Signaling. That's the answer. Okay. Yes. So what you described seems very similar to RAID 5 with disk drives. So when you join drives in RAID, you have an eight blocks and then you have one drive block which contains a pad bit for each of the blocks. But there's also RAID 6 which has not one but two pad blocks. Could that be applicable to your skin here? So you have a line of five packets and then you have not one but two redundancy packets which could help you recover the line between two packets of lost. Okay. So the question is about what happens if we lose repair packets for example or if we. Could that skin be improved with having two pallet packets but one? Not one packet but two. Yes, it may be it's always a trade-off between what you what bond with you have and what you decide to send to improve the protection. There is a other way to other protection pattern described in the RFC. For example, you can decide to protect small very specific source packets by using for example a flexible mask. So you can have maybe here in this example decide to protect some packets twice and some other once or not at all. Yes, it can be an improvement to prioritize the most important packets in your stream. And there is other schemes. One pair is one pair two block one. Yes, there is other parity codes. Honestly, I have to try to tell you which one can be better. I don't know. Probably one of the problems if you apply too much protection is that you're also going to a lot of overhead. So at one point if you're in a lossy network, you send more data to try to recover from more loss, you end up in this spinning spiral that doesn't make things better. So finding the balances is where the black magic is usually. Well, thanks Flo. Well, no, it's okay. Oh, there's one more question. Please go ahead, we have some time. Maybe regarding exactly what you said, how do you know that you don't make it worse? Yes, in fact, we had the problem. So at some point we sent more information in the written on stream than simply sending the video stream twice. In that case, we control the overhead periodically. And when it goes above, for example, 1.9, you reduce the FEC protection. It's not always indicated. So it's a decision that you have to make. We have established empirical rules to manage that. Yes? I want to ask you about the masking of your gold right now. Yes? The slide is right now there. You have said that you can protect the specific package. Yes, like you protect a group of packets. Yes? That's for example in video conversations for you. For example, the push and the step or that X2, 6, 4, and you protect, for example, the key frames do be interpolated instead of, isn't it? Yes, so the question is if you can protect, for example, the key frames of the video conference. Yes, it's a way to choose which packet you want to protect. If you don't want to protect everything, but mainly the key frame, it's a good approach. Or you can make the one-dimensional, two-dimensional only protection, only when you have the packets of the key frame. Okay. So the receiver size. Is it right like all your key frames are on one column and you just protect them? Yes, but you, so the key frames are not necessarily in the same rows or the same columns, but you can change the values of D and L whenever you want. On the receiver side, the receiver just read what it has in the FEC header. You see the value of D, the value of L, and it adapts the configuration to recover the lost packets. Okay, so you can modify that value dynamically during the... Yes, you can dynamically modify the protection configuration. And it's very powerful. Yes. How do you measure the network's bandwidth, for example? Because without provoking the network with high load, right? Yes, how do we measure the available bandwidth? We have estimator in our program that tries to measure the... If I remember, the time delay between the reception of packets and try to establish the bit rate. And we see if there is congestion, if there is congestion occurs or not. But it's based on estimation. We have to deal with that. Yes, the idea of the algorithm that we use is to measure the regularity of the packet at the receiver side. And when it changes, we can deduce that the bandwidth is more... is close to be saturated. This is more or less the way that we use. So do you use RTCP for this configuration? Yes, and we also use RTCP feedback as well in order to measure... packet losses from the receiver side. But it's a bit different than just bandwidth. For the bandwidth, it's really the regularity of the receiver side, which is a measure. Thanks, Bois. Thank you. Thank you.
Shig: distribute and clone live streams among Fediverse instances
How is it possible? About interactive live streaming in a very worse. How is this possible or is it possible? To me, I'm Enrico and I'm interested in interactive live streams. Sorry. So, now it's better. I'll take it so. Sorry. Here are my contactless and I worked for different companies and even most likely in a conference system topics. And now we're talking about lessons. And in the 30 versus is quite interesting situation. When you're in a 30 versus for example when you're in Macedon, you read in post. The interesting point here is the post came to you. Means you have an app in Macedon or inclined and you don't care who posted the post on which instance the post itself is cloned from instance to instance through the further worst. Means you get a copy or a clone of this post. This is a quite interesting concept. So the instance in the background communicating to each other. How is he doing this? Of course with activity part, we had to talk right before this. So I will not go deep in it but the main idea of activity part is like you have an inbox and an outbox. And everyone in the 30 versus in terms of activity part is an actor. The users are actors, the servers are actors. And on the end you can send to every actor in the 30 versus a message or a post. And that's the way how it works. So activity we describe the things like in activities, it's like activity part, like subscription, follow and so far. And the other topic is content. It's all described in JSON. And how I said, the instances in the background communicating to each other and the content is flowing through the 30 versus. Activity part and live streams. They are in the 30 versus already implementation of activity part like OWNcast or PeerTube are the main famous. But the thing is we want a little bit more. I mean you have in OWNcast and PeerTube live streams but not interactable. It is not possible. It means without leaving your PeerTube instance or leaving your OWNcast instance you cannot interact with another stream or another instance. It's not possible. Yeah. That leads to a problem. It's called scaling in the 30 versus. That means on the end more or less the... More or less every instance provider in the 30 versus responsible for himself, you have to scale by your own. You have the possibility of course with hardware where you make an HLS CDN on top of this or this object storage. Those are the common ways how you can increase the amount of users that can watch you. But on the end you stay alone more or less. PeerTube try to solve this problem with PeerTube mail loader. It's quite awesome. Sometimes you see it. You're watching a video and then you see that other people are watching you as well. This means PeerTube Peer exchanging the chunks of HLS files. We are bit torrent and over-verb. It means you make a real PeerTube Peer connection to the other viewers. I put it on the top because this is the most common way in the 30 versus to share live streams. There are other ways as well, but most likely the basement PeerTube here in the browser. There's another way, it's web torrent in the background. Of course they can clone... Even PeerTube can clone videos from one server to another server. This is possible. And the new concept is remote runners. This is quite awesome. You can scale PeerTube with a remote runner. It means you can run other services that do the transcoding for you. Quite often it's re-expensive. This is the possibilities you have to scale your application or your instance. Oncast has a quite interesting feature. Oncast has a general concept. Oncast is you have a server and you only stream for yourself like this. But they have a dashboard. On the dashboard you can see every live stream in this time. But this dashboard is nothing else than an HTML page. They are linked to the live server. It means it's like a list of links. It's not really scaled because when you're watching there a stream, you're watching it from the server as well. This is the current state of it. But what we have now, we have ActivityPub. It is possible to share the information there as a live stream. This already worked as PeerTube as well. There's a live stream but you cannot share the stream itself. And what we want is we want to share a live stream. So in the live stream you want to have it interactive. Means an interactive live stream is a little bit more as if you have a stream with a track like a video and audio. No. We want to have it, you have a stream with a track and the tracks inside of the streams can change. You added new tracks, you added removed tracks, you enabled tracks, you disabled tracks and the tracks coming from different sources, different instances. When we can reach this, then we have interactive live streams in the Furryverse. It's not only that you share a stream, a static stream, it's a little bit more. This is what you want to achieve. It's like a conference in the Furryverse. And we already talked today about it. There's a protocol, it's called WIP and WAP. Of course we need a real-time protocol. It's clear we need WIP and WAP. It's a real-time protocol, it's a moment. And on the other side, there's another interesting approach, WIP and WAP. In short words, what is WIP and WAP? You make an HTTP request to a server and receive a WAP-ATC resource. That's it. No complicated signaling, only an HTTP request. It's a little bit like an activity path. You make a request and you get a resource back. This is written there. For the first one, you make a request to offer a resource. Hey, I have a resource here you can have. And for the second one, you make a request, you subscribe to the resource. This is only the different. This is the main idea. When you have this, here's a little bit more in detail, you can ignore this one, the eyes, only this two are important. You offer something with an HTTP, of course, and you get something back. And then you have all what you need for the resource. Finish. And then you have such kind of architecture. You can do something like this. A, you are sent off a resource like a client. You offer this to an endpoint. And the endpoint offers to the next endpoint. This is for WIP and for WAP, turn around as well. It's like you can make an, you can establish like a pipe. Yeah, sounds, it's really great. And then you can do this, you can clone streams. Because when you clone streams, only you send a request to an endpoint. Give me this and send this to another endpoint and clone this to another site. That's it. However, there's a problem. WIP and WAP is static. You cannot update the resource. When you one time have offer and the resource as a miss a request, you get an STP and you cannot update the STP anymore. It is static. Means you will receive a track, all the tracks that insert in the stream and nothing more, no way. Means you have a static resource. It's cool for a live streaming, but we want interactive live streaming where the resource are changed. This is quite important. So we want a little bit more dynamically inside WIP and WAP. This is not enough for us. And our trick sources is two things, two important things. A little bit smaller things, but the two main ideas behind us is like this. When you subscribe an egress endpoint and receiving a resource, you have to subscribe as well a channel. It is so opposite. You get a channel as well. Because you need a channel to get the information that the egress resource, the receiving a resource is updated. This is the first thing, what you need. Without is not possible. Normally you do it in a conference system. Perhaps you do it with a signaling server, your resources update, you get a new STP. But we only want rest. We have no WebSocket server. You need established an extra resource like a channel to receive this information. The second point is you have to annotate the tracks. You have to know what this track is. For example, this is the main track or it's a guest track. And here, Schick is using the STDP attribute media title. It's not used normally, some people are using it, but it's there for title of the track, for example. Here it's used for some meta information. For example, it's the track that you received as muted first, but the track is the main track or another track. And the rest is activity problem. You rely on the things. Yeah, Schick itself is an instance written in Go, based on PyM. It came with the JavaScript SDK. You get in front end, it's a web component, not an iframe. You get in web component. And this SDK is implemented in PeerTube plugin. Because Schick itself can nothing, only makes this exchanging. And it looks like this. You have a PeerTube instance on the left side, and you have a PeerTube instance on the right side. You are here starting your stream, and you want invite people on the form another instance. This PeerTube instance has a possibility to a Schick instance, and this is a complete other Schick instance. They are not related to each other. And this user is on his, and with this Schick instance, and with this protocol and background, he can exchange and communicate with each other, like a conference, but this is a stream. And then on this side, he is the owner of this, he is in streaming this one. It's then transcoded in RTMP, because from RTMP then in HLS. At the moment, I have not the direct HLS transcoding. But theoretically, you can, from verbiage, directly in HLS transcoding, but it's not implemented yet. Yeah, and let us look how it's looked. I think I have, yeah. For this one. Yeah. So, I have here the two PeerTube instances. I make it like this, and so like this. It depends on the time I already created a live stream, but you can do it directly now, because we have more time. Sorry. When you're looking, I'm not sure how familiar you with PeerTube. Here, inside of PeerTube, I have the chick plug-in. This is this one, and you can configure the chick plug-in, and you have here, this one is relating to the chick server. It's called stream.chick, means he knows this one. Yeah. Here's an ASESC, okay. Theoretically, you can use this. This is, okay. And the other one, let me see. Yeah, this is the other one. Yes, as well. That plug-in. But he is related to forstem.chick, is another chick instance. It's a complete different. They are in different servers. Yeah, they are complete separate from each other. See, this PeerTube instance, follow this PeerTube. You see? Means this one get all live videos from the other one, cloned. And, of course, this one has his own chick is following this instance. The communication between chick and the PeerTube is over activity pub. So when this chick, when the PeerTube instance get a new live, the chick get it as well as copy over activity pub. That's the idea behind it. The implementation is stored, steal from owncast is exactly the same, because owncast has a cool implementation for it. Yeah. That's a good idea, owncast and PeerTube together. I only want to mention. So, and what we can do now is, we can create a live stream right now. It's like this. I hope I have time. Yeah, I have time. Make it permanent, makes no difference. Yeah. One interesting point when you create a live stream, it should be short as possible. PeerTube can nine second delay something like this. Nine, fifteen seconds, something like this is the shortest what PeerTube arrives. I mean, when we're talking about interactive, it's definitely not take 30 seconds or 60 seconds. It's too much. Okay. So what we can do as well is, let us invite the other guy from the other instance. What you have to know is the activity pub ID from this guy. Yeah, this one. Now we create a live stream. I hope so. No, we don't create a live stream. I have to update the live stream. Sorry. My mistake. So now we have a live stream. Here's online. And in the back, I have to take this one because I'm not figure out how I can find this live stream than on the other side. Maybe someone will explain. Now activity pub has synced to both. So we have the live stream as well on the other instance. So when I have this one, I'm logged in as user one to three. I can assess now here. I'm now in. Now I'm in the web component. It's a web component rendered in peer tube from the plug-in. It's not an iframe. And I can do this here as well. So now two guys in two different stream, but they are not connected at the moment. First, they have to join. He's joining. And he's joining as guest. Takes a while. So let me see. So now we can do it. And of course we want the other guy is seeing something. So now the internet is a little bit slow. Sorry about this. Now they're both on different check instance, different SFOs. And the SFOs communicating with them and established with only rest endpoint. And the information like mute and unmute what you need. And exchanged like, sorry. Like the channel that for the web egress component is established. And even when I, let me come back. And even I can do this one. Sorry. No, I can't. Sorry, the connection is bad. So you see the other side. Now I have the track mixed. So I can even mix the live stream. And then all is working fine. Theoretical wise, and my internet goes not down. I can online goes as well. I can go live with this. Let me see that he can see this live as well. One moment. I think it's here. Yeah, it was here. Somewhere here. This one should be. Yeah, now we are live as well. Okay, sorry the internet is not so good. Yeah, that's it. And so we have established a clone stream between two instances in the first bus. That's it. Yeah. Yeah, question. I'm curious. I've worked a little bit with Activity Pub, but not Super Induct. I'm curious if there's like a, is there a live stream post type in Activity Pub, such that like other implementations like a master.on server or something could play this live stream, or does it look like just a link to a live stream? How does that go this way? The question is, is there an Activity Pub attribute or something like inside, right? I'm not sure. You have the content type of video inside, and you have as well the annotations that it's a live video or not. This came from PeerTube itself alone. So, inside of the JSON is only the host server inside. It means when you share this JSON to another PeerTube instance, you get a description like who is the owner, which actor is the owner of this live stream, and where is the home server, the home instance for this live stream. This is all what we have inside. And then, Schick annotates this with extra attributes like who is the guest, and this has the host server at Schick instance. Because you can only follow with Schick another instance when your own instance has as well a Schick instance. When you not have a Schick instance, the button to join, you have to go then to the other instance. This is the main. This is the mechanisms behind it. I think, what's the question? Yeah. Okay. Yeah. This only works when both instances implement in Schick instance. And this is supposed to work as well for own cast, because it makes no difference. Only the front is needed for own cast. And this is the main idea behind it, that you have a way to scale your streams in the background with extensions. Yeah, based on activity. Perhaps an interesting point. It's like a little bit controversial. You can use such kind of technology for, I will not say advertisement, but for recommendations. When you have a live streams, often you have the problem you want inform other people that you have as well live streams. Other people didn't know about you. And here you have something like a pool where you can add streams and then you can chat doing the live streams. Because in a back, a live streams and an active live streams, nothing else as that you have different kind of sources from different kind of furry growth instances. And such kind of things are then possible. Okay. Okay. Yeah. You mentioned that you're using data channels to change information about back of this. What exactly is set up the data channel? Renegeration, the STP. I have the egress endpoint. I mean, the receiving end point needs a data channel from the offer of the resource. The question was what came through the channel, the STP. The STP and the mute event as well. Yeah. This is coming soon. Yeah. What's the reason for the delay so much lately? Here in this one, I think also what's the reason for the delay in the latest thing? First, the network here, I guess. Second one, no, most likely the network. I have this one here. One moment. When you have this one, I hope I'll be online still. I'm not sure. This delay, what you have here, this is more bigger. This came from the transcoding form. VapRTC to RGMP. That is at the moment not optimized. This is the reason for this delay where you have such kind of, yeah. But the rest, I think it's the network. I guess. So it's not VapRTC to VapRTC. It's converted somewhere? It's like this. You have a VapRTC to VapRTC converted. Which one you mean between the server or between the? On the right-hand side, the video is quite delayed. Yeah. Where did the left? Yeah. This one. Yeah, there's a big delay at the moment. Yeah. Yeah. Now, the thing is, in this case, you have three VapRTC connections now. One is from the client. Maybe I can show you this here in the slides. Sorry. You have three connections. One to your chic instance. One from the chic instance to this one and one to this one. It's like a pipe. And I guess this was this quite fast because they are in the same location. But I guess this one makes a trouble at the moment. I guess. Yeah. Some other question? Yeah. I missed part of the presentation, sorry about that. As far as I understood, you are using Weep and Web as a way to get those two to communicate with each other. So, as I was saying before, in the last year, the view of what specification basically forces you to create an offer for that as well. So it makes changing Weep and Web impossible within the specification. Are you using the old mode where you were expecting an offer to do something? How are you dealing with this synchronization where you have to wait for an offer and stuff like this? Yeah. I try to repeat the question. Weep and Web, I think, have two options. First, you send an offer and get an answer back. And second, the second option is you say, hey, I want an offer from you. Then you get an offer and you send the answer back. What is the difference between this one? For the first, you need only one request. It's like, give me one post request. You send an offer and get an answer back inside the post request. For the second option, you send first a post request, get an offer, and send again a post, a patch. I think it's a patch afterwards. Yeah, something like this. I implemented the second one because I implemented it in June and I think now is a new version out where they are supposed only one request. Yeah. For Web, for Weep in one, for Weep, I only need one request. Yeah, that's right. But because we are not here, I not use Weep and Web how it's supposed to be because I need to dynamically, so I established Web at the C Channel as well. So that is additional. Okay. Yeah. Yeah, if no questions anymore, then thank you for watching. Thank you. Quite interesting. Yeah, because you're talking about this problem already. I wrote a long post because I liked the old mode. I liked the way that we are doing things. Federation is possible thanks to the mode. Just leave a couple of minutes to sit down. Yeah. Yeah. Yeah.
Getting AV1/SVC to work in the Janus WebRTC Server
Well, welcome everybody. Lorenzo here needs no introduction. He brought the crazy contraption to give his presentation with. It's almost a dangerous demo in and of itself. Yeah, yeah, easy. And he'll be telling us all about AV1 as we see. Let's go for it. Yeah, you can hear me, right? Yes, sir. So thanks so for the introduction. Yeah, so I'll be talking about specifically AV1 as we see. I'll go, it's in some technical details. So it may be boring here and there, but I really think it's important in order to get a better understanding of how it all works. And this is just a quick introduction over me. So I'm one of the co-founders of a small company based in the south of Italy called Miteco. I'm the main author of Janus, which is an open source for Bouticy server. And there are some links if you want to get in touch with me or learn more. And basically what we'll be talking about today is AV1. And if you're not familiar with what AV1 is, it's a new, relatively new video codec that was designed within the context of the Alliance for Open Media. That has a lot of companies behind it. There's Apple, Cisco, Google, really a ton of them. And what they really wanted to do was to create an open and royalty free video codec. And of course emphasis on open and royalty free because we don't want another H264 or H265, which was specifically designed for real time applications pretty much like Opus was also designed as a codec for the internet. So that was quite important innovation with support for higher resolution, so for KM Beyond. And most importantly, it was also conceived to have support for SVC baked in the codec specification itself. And that's quite important because some other codec support SVC as well, but many times they come as, let's say, later additions. So basically codecs are extended to have SVC supported. In this case, AV1 was conceived with native support for SVC. So all AV1 implementations are supposed to at least be able to decode an SVC stream, for instance, which is important when you start working in hardware decoders and stuff like this. And of course this got me and should get you all very interested because these are all very interesting features to have for different reasons in WebRTC. And SVC is important for a few different reasons. So we all know what CIML Cast is. You use a single M line to basically carry multiple quality streams, like you have a high, medium and low quality stream, both sent at the same time, so that different qualities can be distributed to different participants as needed. But with CIML Cast, each stream is encoded as a separate stream, which means that each stream is also decoded independently of others. But this does mean that you have to encode the same stream more than once. And the fact that they are decoded independently can also cause some challenges sometimes. With SVC instead, you still use the same media source, the same M line and so on, but the different qualities, so high, medium, low, whatever it is, are all layers of the same thing. So you have a single video stream that has like an onion, different layers, that basically make each layer provides more detail if you want to look at it that way. And so the key difference between CIML Cast and SVC is that with CIML Cast, since you have different streams, you also have different SSRCs. Each quality is a separate RTP stream. With SVC, all layers are the same SSRCs. So as far as the recipient is concerned, it's just a single stream, which means that it does require less bandwidth because you can pack some things up and it's more a layer kind of approach. It is sometimes more CPU intensive in terms of encoding because that's a bit more tricky, but it does have some advantages over CIML Cast as a consequence of that. And an interesting aspect is that CIML Cast, as we know it in WebRTC today, actually did already make use of SVC somehow, because when we say, for instance, BPA to CIML Cast, and then we mention temporal layers, temporal layers are not a feature of CIML Cast. Temporal layers are a feature of SVC. So we are basically using a feature of VPA that allows us to use a partial SVC functionality where we can have different frame rates within over the same RTP stream that we are handling. And this is just summarizing it from a visual perspective. So you have CIML Cast sending three different streams and then we can choose which, an SFU in the middle can choose which stream to send to other participants. With SVC, we have one big thing that has many layers. One participant may want to receive them all, another participant may only want to receive the medium layer, and then another participant may want to receive the lowest layer as possible. This is just to give you an idea from a visual perspective instead. And so I was very interested in implementing it in Janus, and here are a few links if you want to learn more about Janus itself. And so I started to figure out what we needed to do in terms of what do I need to do in order to get that working. And so first of all, of course, we need a way to negotiate AV1 and the SDP, and that's of course a given. It may be helpful also to be able to detect keyframes in the stream, and that may be helpful for different reasons. For instance, when you are doing Siemulcast as a server, it helps when you know whether a packet is a keyframe or not, especially if you want to switch on a keyframe or stuff like this. It's also important to be able to somehow interpret how the AV frames are spread across RTP packets, and for us it's especially important for our recordings, because when we record stuff in Janus, we just record all the RTP packets that we received, so that we can go through them later on. And so basically getting a recording in a playable format just means reorder all these RTP packets I received, get the AV1 frames out of those RTP packets, and then put them into an mp4 file to make an example. And this means that we need to know how AV1 fits within RTP, and we show how that works later. For SVC specifically, there is another important thing that is called the dependency descriptor that I'll talk about in a minute. And so that means that we also need to somehow support that in the server as well, which first of all means negotiating it, or extensions must be negotiated in order to be used. We need to know how to parse an extension of that sort, and then we need to figure out how to use the information that we receive in that extension. And as we'll see, 0.5 is the one that got me the most in trouble, and then I'll explain later why. But starting from negotiation is very easy, so you just negotiate the codec name and the relatively clock rate there, so that's easy. Detecting keyframes and support m basically being able to extract frames from packets is a bit more complicated, but that's because we need to start delving a bit deeper, and so figure out how AV1 is packetized over RTP. And that's actually something that's true for all codecs. So for all codecs, you need packetization rules, and that's especially true for video, because for video, typically you have larger frames, and RTP packets cannot be that large. They are usually limited by the MTU size and so on. And so you need to have some rules that tell you if you have a frame that is this large, this is how you split it across multiple RTP packets for this codec, this codec, and this other codec. And usually there are some similarities, but usually each codec has its own rules, mostly because of the nature of the bit stream, let's see. And this is an activity that typically the ITF carries on in the AVT core working group, because basically all packetization rules as RTP and WebRTC are all standards. Unfortunately for AV1, it did not happen in the ITF, so they came up with their own specification, which is provided here. So in this specification, they provide information both on the AV1 aggregation header, that is those packetization rules that I mentioned. So how do I split an AV frame over multiple RTP packets, and how do I get that same frame back when I have access to the RTP packets on the other side? And it also talks in great detail about this dependency, the scripture, which is a beast of its own, as you can see. And this is basically how it looks like from a visual perspective. So with RTP, you typically have an RTP header with all the usual stuff that you all know. You can have some RTP extensions in there, and this is where the new RTP extension would appear. And then you have the RTP payload. And the RTP payload is where this aggregation header plays a role, because as we mentioned, we cannot just dump an AV frame in there because it may not fit. And so we need to have some sort of information that tells us how an AV frame is actually split, or if there are more than one AV frame in the same packet, we need to know that as well. And the AV aggregation header, the AV1 aggregation header is fairly simple, because it's just a single byte with a few bits that you can set. Like, I will not go too much into the detail, not to bore you, but just information about whether these OBO, and the OBO is basically the equivalent of an AL for AV1. So if you know what an AL is for RAS264, an OBO is the same thing for AV1, more or less. So it's basically a unit of a frame. And then basically these attributes tells you whether or not an RTP packet that you just receive is a continuation from a previous frame, so that you know that whatever you're receiving now, you have to append to whatever buffer you had before, whether or not this frame is complete or not, whether you have to actually wait for something else before passing it to the decoder. You may have some information about how many OBOs are in place, which is actually optional, and we'll see why in a second. And then this bit tells you whether this is the packet that you receive is the beginning of an AV frame, which is, again, all of these pieces are very important when you have to reconstruct the AV frame when you receive it, so that AV1 frame when you receive it, so that you know that this is the first thing that you have to put in there, then you pass this year, this year, this year, eventually you again end up with the complete AV frame. And basically it looks a bit like this, so in this case, for instance, we are actually aggregating multiple OBOs in the same RTP packets, and in this case we are not specifying that there are that many elements, which means that for each OBO in there, after the aggregation header, we have a variable size element that tells us how long each OBO is, so in this case we're just going sequentially, aggregation header, we know there are some packets, we check the size, then we read exactly this amount of bytes, and this is the first element, second element we read the size of that, and we go on and go on and go on. And the W attribute over here allows us to save a tiny bit of space when you use it, because if you say that, for instance, there are just two OBOs in this element, then this means that you do need to provide the size of all the elements except the last, because then you can read them sequentially by checking the variable size length until you get to a certain point. When you get to the last element, you know that all the bytes that are left are actually associated to that frame, so you don't need that additional variable element in there, so you save a bit of data, maybe not that much, but in some cases it may be helpful. And to use the aggregation header, I mean I mentioned that it can be helpful in a few different cases. In my specific use case, I basically interpreted that, for instance, not a continuation and a first packet, I can more or less treat as a key frame. It's, of course, not really always like that, but it at least gives me the beginning of something, which is something that is very quick and simple to use when you're actually just routing stuff. You just read a single byte and make some decisions based on that. For instance, when you need to do some symbol-cast-related switches, for instance. For recordings, I needed to do something more complex, because as I mentioned, we need to traversal the RTP packets, reconstruct an obu frame, and an ap1 frame before we can put it into an mp4 packet, which means that I had to actually implement all that de-packetization rules accordingly. And also I had to implement the parsing of a specific obu in order to get some additional information, like the video resolution, because if I'm creating an mp4 frame, I don't need to decode the frames, but at least I do need to know how large it is so that I can put it into the mp4 header, for instance, or maybe use the RTP headers to figure out roughly the frame rate, these sort of things. And all that I've mentioned so far is really all that you need if you want to use everyone normally, just as a regular codec, so we forecast all streams are independent of each other. So if I want to go from high to low, I can just move to the SSRC with the low quality stream, and I don't need to do anything else. The low quality stream is encoded separately from that other one. I don't need to know anything about that other stream, they're completely independent. With SSRC, that's not always true, because you may have some dependencies in place. So if I want to go from, for instance, the highest quality layer, since we are talking about an onion, will very much likely depend on one or more packets from the medium layer and the low layer, which means that I may have to forward those two, otherwise the high quality layer will not work, because that alone is not enough to decode something. And these are all things that you need to figure out at runtime, because you have a stream that is coming in and you have to make a decision right away, otherwise you cause delays and stuff like this. And most importantly, most of the times you may not even be able to parse the payload, because, for instance, if insertable streams are used and the stream is end-to-end encrypted, you cannot have a look at the payload to see what is what. And this is what the dependency descriptor is for. The idea that you have an external component, so an RTP extension, that contains all the information related to the packet that you just received. And this one would not be encrypted as the payload itself, and so it's something that an intermediary like an SFU can use to do something. And this is just one example that comes from the RTP specification over there. There are really a ton of examples. In this case, this is an example of how L2 T3 dependencies work. L2 T3 means two different spatial layers that depend on each other and three temporal layers. So two video resolutions and maybe 30, 20, 10 frames per second. And this gives you an idea of how the dependencies work as a frame goes by. So this is the first frame, second, third, fourth, and so on and so forth. And so you'll see that in this specific kind of approach, the first packet you'll receive will be related to spatial layer zero, temporal layer zero. And pretty much everything depends on this packet over here. And then if I want spatial layer one and temporal layer zero, I definitely need to relay this packet to otherwise this one will not be able to be decoded. If I'm interested and basically you follow the arrows and you have an idea of the kind of dependencies that you can do so that you can choose which packets you can actually drop or not. And as you can guess, the problem is, as an SFU, how do I know these? So how do I know that this is what is happening and these are the dependencies that are in place? And this is basically what the dependency the scripture provides and I'll explain how in a second. And so continuing from the requirements that I described before, it means that if I wanted to have SAP or for this component in Janus, but this is true for each web artist is around there, again, I need a way to negotiate the extension. I need to somehow parse it so I don't, I need to know how it is encoded so that I can figure out what is in there. And then I need to find a way to use it. So for instance, to recover those dependencies there. And I thought that negotiation was supposed to be the easy part, but it's actually not that easy because of course you just need to negotiate that extension with that name as an additional X map. That's how it works for all extensions in the SDP. But it turned out that I also needed to support the so-called two byte header extensions using X map allow mixed. And this is because RTP extensions by default are supposed to be quite small. And so you usually have the so-called one byte header RTP extension where in one byte you provide some information, which means though that the length of the extension is limited as well. So since you are using one byte to convey a lot of information, the size of the extension itself cannot be more than, if I'm correct, more than 16 bytes or something like this. I don't remember now exactly. And the dependency, the script though can be much larger than that. And so you do need to support two bytes extensions with at the time I didn't. So I needed to implement that first in order to get it to work because when I started testing it, nothing worked and it turned out that this was the issue. And then we need, once we have negotiated it and we start receiving the dependency, the script, as part of our TP packets, we need to figure out a way to parse it. And this was really a nightmare for me. This is like therapy for me right now because I'm sharing all this with you. And I actually run to the about this in a couple of blog posts where you can see the nitty-gritty details. But just to give you an idea, basically it's, let's say a mess. I will not say that word because I don't want to be bit. But basically you can see that this is a specification that was written by somebody who writes codex, not a network specification because all fields are variable length and often at the bit level, which makes it really a nightmare to parse sometimes. And from what we regard the specification itself, it's indeed quite flexible because there are a few mandatory fields like if this is a start of a frame and end of a frame, the frame number, and the template ID for those dependencies that we've seen before. But also everything else is optional, which means that you can either have a dependency in the scriptural element that describes everything, so the whole context of the SVC or just something that tells you the scope of the current frame. And when we look at how a dependency in the scriptural really looks like, this is a simple parser that I created to basically debug things offline. And when we receive a keyframe, typically we have a 95 bytes extension, which if you know RTP, that's a lot. That's basically almost 10% of the payloads that you have. So it's really big, but that's because it contains a lot of information. So if you start parsing it and serializing everything that you receive, you have information about the different layers that you have, spatial temporal and so on and so forth. TDI, I don't remember exactly what it was, but this is just the output of that tool. That's a lot of stuff. So blah, blah, blah, blah, some more chains, some more stuff, the code layer targets. I have some stuff about resolutions. And finally, we're done. Basically, all the parts that we've seen before were basically the media center telling us, these are all the information that I used for this specific SVC context. So in this case, this was an L3T3, so three temporal layers and three spatial layers. And all those, that huge stuff that you've seen before is all the information related to chain dependencies, all that kind of very low level stuff. And so if you want to use it, it's there. And then at the end, it also tells you the resolution streams of the three different spatial layers. In this case, it was low because I captured really at the beginning, I think. And finally, it tells you that for this specific RTP packet, this is a spatial layer zero, temporal layer zero, and it uses template index number one, which is indeed spatial zero, temporal layer zero. And this is the information that we need because then having a look at all the stuff that we've seen before, we know that the resolution for spatial layer zero is, in this case, this multi-mover here. In practice, it would be something like 320 by something else. And this is it. And of course, likely not all dependency descriptors are so long, only for the meaningful key frame packets, it's usually like that. And then other dependency descriptors will be much smaller, like only seven bytes, because they will only tell you, for instance, the temporal index of this specific packet. In this case, it is a spatial layer zero at temporal layer zero. But I only know this because I received this before. So I received somewhere in time this huge chunk of information before, because if I only receive this and I get temporal index six, what is six? Six relative to what? So what does it mean? I don't even know how many layers there are. So you do need to have that information first if you want to make sense of all these smaller packets that you receive later after that, which means that when you start to implement stuff in a server, it does mean that you start need to keep a state, which is not really true for single cast or other things. I don't mean it's partly true, but only in a very limited way. In this case, it does mean that anytime that you receive that huge packet and you parse it, you need to keep it somewhere so that when you receive packets after that, you can reference them and use them for something. And the idea was that once I have a knowledge of those templates and I receive information and I know that this packet that I just received, this spatial layer X and temporal layer Y, then as a server, I can decide whether or not I want to relay it or drop it. And you can do it the relatively easy way or you can do it the hard way. The hard way is figuring out all of those dependencies that we've seen before. I went for the easier way, especially right now. If it is temporal layer 2, then relay everything related to spatial layer 1 and 0 as well, as long as it's the same or let's say the temporal layer is smaller or equal to the one that I'm receiving. So I may be relaying more than I should, but at least I know that everything is there. What's important is that once you use that information so that once you've parsed it, you cannot drop it. You need to relay it anyway because it's not only helpful to you, it's also helpful to the subscriber that is receiving that video stream because they also need to know what is what. So you need to forward that information as well. And very important, you also need to update the RTP headers accordingly, including the marker bit, which is what really drove me nuts the first time because I actually implemented all this for a long time and it didn't work. And eventually I figured out that the problem was that I was not updating marker bits as well. And this is the reason, basically. So if we have a sequence of RTP packets related to different spatial layers and temporal layers, this is basically what it looks like from an RTP perspective, including marker bits. If I am dropping spatial layer 2 because I don't need it, then what it means is that I'm dropping some packets over here. So of course, all the packets that I'm dropping, I need to update the sequence number so that it keeps on growing monotonically because otherwise the recipient will think that they are missing, losing some packets, but they are not missing them. They are just dropping them because they don't need them. So I need to update the sequence number so that this is one, this is two, this is three, this is four, five, six, seven, etc. So I need to make sure that they know that they are not really missing anything. But I also need to update where I'm setting the M equals one marker bit as well because this is needed for b-decoating, especially from Chrome. So in particular, you need to set M equals one on the last packet with the same timestamp. So since the timestamp now is changing on the second packet, because that's the last packet with that timestamp over there, I need to set M equals one on that second packet before I forward it or otherwise nothing works basically. Sorry, wrong direction. And basically, if you want to test all these and with Janus or with anything else, of course you need to have a browser that supports all this stuff. And the kind of bad news is that at the moment I think only Chrome supports it. I don't know if other Chrome-based browsers support it too, but definitely Chrome supports AV1 as a codec. And you can check that by using the RTP sender get capabilities thing to see. If you see AV1 in that list, you do support AV1 as a codec. But you also need to support SBC functionality and most importantly the dependency, the scripture. And the dependency, the scripture is not offered by default. So you do still need, I think, to first fill the trial like this. I don't remember right now if you can just manage the SDP to artificially put the extension in your SDP in order to make it work anyway, but that I should check, I should double check. But you may need to launch, for instance, Chrome with that thing over here so that the extension appears in the supported extensions by the browser. When you do that, then your browser is capable of encoding AV1 SBC functionality with dependency and scripture, which is quite important. And if you want to test this, I also made it very simple because if you go on the online demos for Janus and you check the eco test demo you can provide a couple of attributes to, first of all, for AV1 as a codec and then for a specific flavor of SBC, in this case, for instance, L3T3 to send three temporal layers and three spatial layers. And when you do some small buttons appear over there and they allow you to check one thing or the other, which means that you will send the big AV1 SBC stream to Janus and Janus will send you back only what you asked for. So in this case, for instance, spatial layer one and temporal layer two which is why my resolution is smaller and the bitrate is smaller as well. So by playing a bit with those things you should see resolution changing, bitrate changing, if it does, it works. And the same functionality is also supported in the video room, of course, which is the SFU to do video conferencing. So at least in theory you can have a complete video conference that is based on AV1 SBC as well, even though we haven't tested that much but it should definitely work. And I think this is it. I'm not sure if we have time for questions, but before that, I also wanted to announce that, I'm sorry, I'm bothering you all, but JanusCon is back. So JanusCon is our own Janus conference. So it's a conference devoted to Janus and WebRTC in general, which will happen at the end of April in Naples in the south of Italy. We have a few sponsors already which I'm very grateful for. And the call for paper ends in about a week. So if you have anything interesting doing with Janus and WebRTC, you can feel free to submit a talk there. Well, tickets are also available for sale as well. And of course, if your company is interested in sponsoring, that would be great too. And that is all. I don't know if we have time for questions because I didn't really check how fast I was going, maybe too fast or... Okay, so are there any questions for anyone at the C part? I see a couple. I think slow me with... Generally, would you say that the SBC is like the generation of simulcast or if we continue, whether we look at the future of people on the platform that will replace it or they will need to get the sale by sale? I mean, in general, if you look at, for instance, if you look at that... Oh, sorry, sorry. Slow me was asking, is basically SBC or evolution of simulcast or does it make sense to have them both at the same time? Which one will take... Which one will be more important in the future? Which one is the technology to invest in in the future, maybe, as well? And functionally, I mean, they serve the same purpose, if you want, because I have the same demo for simulcasts and if you look at the demo for simulcasts, it looks visually the same. So you have the same buttons to say, I want high quality, low quality and so on. The difference are really in just how the thing is implemented. And the main problem, I mean, in general, SBC is supposed to be more advanced, of course, than simulcast and more resilient as well, probably. But the main obstacle right now is that it's related to what I was saying before. So right now, if you want to use AV1 SBC, you have to do a custom flag, which means that right at the outset, it's really not something that you can ask your customer to do, for instance. So for the moment, it's not really something that is production ready. You can use the SBC flavor of VP9, which provides a similar feature, which is now available out there. But still, simulcast is overwhelmingly favored in general for production environments because it's been battle tested, it's been there since day one. Everybody supports simulcast, it's easier to work with and so on and so forth. So for the moment, it doesn't make sense to just use force SBC in your production environment right away, if not for experimental purposes and for testing how it works, for dipping your toes in the technology. But for the future, I definitely think you should pay attention to that because AV1 will be the code that everybody will adopt, hopefully because it's better quality, it's royalty free, it's open source, and it has SBC baked in. Sooner or later, hopefully Safari will have AV1 as we see, Firefox will have it, Edge and other browsers will have it as well. And you definitely want to be ready when that happens because otherwise you'll be the one stuck with the old codec and everybody else is taking advantage of the new team. I think learns that you can munch the SDP to make it work. For the extension, yeah. Because we have it working new team. Tuzlomi, there is one thing that in some environments might be relevant which is as many hardware decoders don't cope with SBC, but they do with Samocast because they look like a normal strain. So if you're in a resource constrained thing, maybe receiving SBC is no bueno, but receiving a normal Samocast will be better. But in theory, these will not be true for AV1 because AV1 was conceived with SBC in mind. So in theory, all hardware decoders, too, even smaller ones, will know how to interpret that. And since it's a single stream, they will be able to decode it. Of course, it's just theory and... Ideally they would. For VP9, for example, Chrome still does not use hardware decoders when you use SBC. And I'm not sure because AV1 hardware support is hit and miss yet still. And there was another question here, yeah? Yeah, I was wondering what the forward error correction strategy here is, like, is this patient, if there are... I'm sorry, if forward error correction is used, how do you use it with do is I mean... Yeah, if all the use forward error correction is SBC, then you are like, helping out some tactics and then it doesn't work. Yeah, that's a good question. And it's actually related to one of the doubts that I have related to FBC, mostly because I mean something like AV1, SBC and CMUCAS as well only makes sense when you have a server in the middle. It doesn't really make sense if you are sending something from point A to point B and point B is the one that is meant to receive it because in this case you are sending everything anyway. So unless you are using SBC as some sort of a... of your redundancy mechanism because you say, if I lose some packets related to two, I can still display one. That's one thing, but that's not really what it's meant for. And so the moment you have a server in the middle, it also means that you can offload the forward error correction stuff to the server as well. So which does make sense also because, for instance, when you use FlexFec, which is the thing that was described in the first presentation from Chrome, Chrome by default will not put any redundancy information, so it will not put any FEC packets until the peer tells them that they are losing some packets. And this is to optimize stuff, so you don't add redundancy unless it's needed because there's loss reported, which becomes a problem if you're doing something like a video conference because your uplink find may be perfect, and then you have subscriber X over here that is experiencing loss and you don't have any redundancy packets to send them instead. So the idea and probably the solution to that, this is something that I'm still brainstorming myself because FEC interests me, but I have some doubts there, is that probably the forward error correction stuff is something that the server itself will need to add on each subscriber leg. So from the server to you, I will have a dedicated FEC channel where I add some forward error correction stuff from the stream that I'm sending you, and for the stream that I'm sending you, the layer 2 may not be there, but I have a consistent stream because packets are in sequence, and so the forward error correction stuff that I'll be sending you will be different from the one that I'll be sending to somebody else who is receiving additional layers, and that's probably the only way to do this if you don't want to forward FEC and to end without treating it, which anyway wouldn't be useful at all, especially if the sender is not providing that information themselves. Yeah, in my experience, and this may be an implementation choice, of course, I did have to forward it because otherwise it would not be decoded properly, basically. And I don't know if this is actually really needed, like for instance, even the marker bit 1, that's not really needed from a specification perspective because as a receiver, you do see that the timestamp is changing, so you do know that it is a new frame and you can decode the previous one. But it's simply that Chrome expects that marker bit set to 1, otherwise it will not decode a frame, basically. So in my experience, you need to forward that information too. And I guess it makes sense because the recipients themselves also need to decode possibly differently the video stream depending on what they are receiving because they need to know if the resolution must be this size or this size or this size or something like this. It may all be part of the 81 bit stream, so it may be redundant information as far as they are concerned, but at least when I made these tests a few months ago, it was needed, so just relaying it makes sense. Yeah. In regard to switching this layer, like saw your previous talk somewhere was on bandwidth estimation, maybe you can comment on how they do go together or is there something specific to 81? Yeah, no, I mean the bandwidth estimation stuff is important for a few different reasons. And in this case, I'm talking about the bandwidth estimation on the subscriber side. So from server to recipients, because on the publisher side, there is transport-wide control CC and basically the browser themselves are capable of using the feedback to figure out if they need to send less or more. And so dynamically, you may see that some special layers are not appearing because the browser doesn't have enough bandwidth for that. On the subscriber perspective, it's really useful because it allows us to it helps with the decision. So for instance, right now I just mentioned just generically whether I want to relay or drop a packet, but this actually depends on why I should relay it because a user may want to receive the highest quality possible, a user may want to receive the lowest quality possible, but this may be because they only want a lower quality because the video is going to appear in a thumbnail and so they don't need the whole thing and that's an application logic decision. And now the decision may come from the user doesn't have enough bandwidth for all of that stuff, so they don't have enough bandwidth for special layer 2 and 1. Let's just send them special layer 0. And this is where bandwidth estimation helps because if I'm sending stuff to the subscriber and I'm starting to get information that congestion is happening, then internally the server can update which special layer or temporal layer I should send to this specific publisher dynamically. And so this will impact my decisions to relay or drop stuff and so it allows me to dynamically dynamically impact the quality of the subscriber depending on how much bandwidth they have. And in my experiments right now I've only done this with Siebel because I haven't hooked it up to SBC yet, but the key principles are really the same. One minute? Yeah, just related to that is there a way or Wipen Web to signal the final cast of the publisher and the subscribed site? Yeah, I mean for the final cast or SBC. Of course, yeah, in Wipen Web do you with Wipen Web is there any need to signal Siebel cast or SBC as well and does it make sense? And in general, I mean it's definitely important that you signal it on Wip because you want to make sure that the stream that you are ingesting is recognized by the server as a Siebel cast or an SBC stream so that the server can also parcel of those dependency descriptors in case it's a one SBC for instance or in case it's Siebel cast it knows that it needs to take care of, let's say, three different qualities. On the subscriber's side for Siebel cast it's really not important because you're just always, as a subscriber, you're just always going to receive one video stream and as far as you're concerned it's a consistent video stream. You don't even know that there is a switch behind the curtains that is happening from high to low to medium or whatever. You just see a single video stream so you don't need to be aware of the fact that it's Siebel cast. For everyone as a SBC it may be important to negotiate the dependency, the scripture extension as I mentioned because if it's needed for decoding purposes and you want the browser to be able to decode things properly then you may want to negotiate that extension as well on the subscriber's side. But as I was saying before it may or may not be needed so that's something that we'll have to check. And I think I'm really out of time now so. Thank you. Thank you.
Using GStreamer to build real-time applications with Golang
All right, well, welcome back everybody. Up next, the one and only Dan Jenkins is going to tell us all about G-Streamer and Golang. Take it away, please. Thank you. Hello, everyone. Can everyone hear me okay? Yeah? Good. Great. Cool. Okay. I forgot my clicker. Number one rookie thing to do. No, no, I've got my phone. So I'm good. But yeah, that's why I've got my phone. And it's going to look a little bit weird. I also forgot to buy I brought two European plugs with me. But one wasn't European. One was American. So my day did not start off well. So yes, G-Streamer and Golang. So a little bit about me. Very, oh, that's just going to get really annoying. I'm just going to click. Cool. Okay, so a little bit about me. So yes, I'm Dan Jenkins. I run a couple of companies. One called Everycast Labs, one called Nimbleape, and another one called Comcom. So Everycast Labs does broadcast stuff, bringing in remote talent into broadcast workflows. Nimbleape is a consultancy service, consultancy company based in the UK. And then Comcom is an event that we put on for open source people, our way of kind of giving back to the ecosystem that we build from. I was the very first Google developer expert in the world when it comes to WebRTC. I'm not saying I'm the best at WebRTC, but I'm the first that actually got accredited by Google's developer program. I love Lego, and I love real-time media. So yeah, Nimbleape, we're a consultancy, and if you've got hard problems that you want solved, come talk to us. And Everycast Labs, we've got that product that I was just talking about called Broadcast Bridge. And then Comcom. So Comcom is dear to my heart. Historically, it's been a residential event where we bring everyone, everyone stays in the same place. And then we've got three days of awesome real-time and open media content. And then we're back in 2024. Dates are still up in the air because of contracts, but it's not going to be residential this year. We're going to go on tour, so we're not just going to be in the UK. And that's quite exciting. So to the actual topic, GStreamer building real-time applications with Golang. So what are we actually going to talk about? We're going to talk about GStreamer, obviously. We're going to talk about Golang, obviously. But I want to introduce you to something called GoGST. GoGST has been around for a long time now, but kind of got itself into a bad state where it was not un-maintained, but there were lots of little forks and lots of little patches everywhere. And so we've kind of changed how that project's being managed now. And then also I want to introduce you to something called Pion. So let's take a look at GStreamer first. Who in the room has heard about GStreamer? Good. That's the answer I was looking for. So open source multimedia framework basically does everything that you chuck at it in some form. And I absolutely love GStreamer. So a lot of you might know GStreamer as something like this. I'm not going to ask you to tell me what that is, because I know that it's kind of taking in an RTSP source and then doing something with it and then outputting something at the end, via UDP, but all the stuff in the middle now. But GStreamer is actually super powerful and ultimately lets you do ingress, do something with it, and then egress. And it kind of boils down to something that's simple, right? GStreamer can do it all and can do a lot of things. So for us at everycast labs with our broadcast bridge product, we care about certain things. So GStreamer can do NDI, GStreamer can do WebRTC, GStreamer can do SRT, can do RTP, it can do HLS, it can do RTMP and RTSP, right? I'm not telling you anything that you don't know at this point. But for us, at least with broadcast bridge, GStreamer has a superpower and that superpower is app source and app sync. How many people in the room know about app source and app sync? Okay, good. That means like 60% of you are going to learn something now. The rest of you just sit and be happy. So yeah, this is what we use at broadcast bridge, in our broadcast bridge product. And that's because we don't write C. And so ultimately kind of adding code to plugins within GStreamer is really difficult for us. I know that's changing as time is going on. There are more and more Rust plugins, but at its core there are a load of stuff that we don't feel able to kind of contribute to if we find a problem. And so a lot of the time we don't like writing C like this, but we do like writing a lot of Go. And so we end up writing something like this. And this is Go GST. It was originally created by a guy with the GitHub handle, Tinyzimmer. I love the name. But now it's in its own GitHub organization. So it's under github.com. So it's under the new GitHub org and there's three main contributors. I think there's something like 17 total, but there's three main ones. Tinyzimmer me and R.S. Willy. So Lesfawkes is better for everyone. So this other one, Big, Little Ben. That's from the LiveKit team. And the LiveKit team had their own fork of Go GST. And they had put a load of work into fixing bugs, but they were never getting merged back into the project as it was under the Tinyzimmer GitHub. So now it's forked out. Well, it's not actually forked. We forked it into its own organization and then did the GitHub magic where we unforked it and then the Tinyzimmer one is now a fork of us. So there's a lot of GitHub kind of organization going on to make it easier for everyone. Did you know that GitHub forks don't turn up in Google SEO results and they don't turn up in GitHub search results either? And search doesn't work in the repo. So yeah, and search doesn't work in the repo. So basically forks are dumb. I mean, they're not dumb. But forks are bad. We should not be relying on forks for a long-term thing whatsoever. So yeah, this is actually really great for everyone now. So less forks is better for everyone. And like I said earlier, BroadcastBridge uses a mixture of SRT, NDI, WebRTC, among a load of other things as well. And so why, you're probably asking, why would we even need to use AppSync and AppSync when the modules, the plugins are already in Gstreamer? Like Gstreamer already knows how to take in an SRT feed. It already knows how to output an NDI feed and it knows how to do WebRTC stuff. So why are we building on top of AppSync and AppSync? And it comes down to greater control like I was kind of alluding to earlier. We use Pion to do WebRTC. And that's not because the Gstreamer implementation isn't good. It's just that if we want to be able to do anything that isn't implemented into the Gstreamer implementation, then we'd need to get someone to actually go and change that code. And that's something that we're able to do. My team aren't capable of doing, but we don't go really, really, really well. And so we can definitely kind of go and take that greater control. Like I said, this means we're handling WebRTC in something that we really know. Like, ultimately, very few people in this room know about transcoding something from one codec to another. And we just rely on FFMpeg or Gstreamer or whatever to do it for us. It's the same with WebRTC for us. We really know what we're doing with WebRTC and we want to be able to kind of tweak things that we can't necessarily tweak with the Gstreamer implementation. But Pion is hugely, hugely powerful. And this is the other key thing and it's easily upgradeable. So when we actually find a bug in Pion, we can a Gstreamer pipeline and never leaving the C level. But cost isn't just measured in terms of compute. Cost is everything from building the feature all the way through to deploying the feature and running the feature. And you've got to look at the whole picture. Pion gives us huge, huge flexibility and we can move fast and we can add new features and ultimately that means that we win business. So let's take a quick look at AppSource. How many people are actually familiar with AppSource? Right. So AppSource is just another plugin, module, whatever they're called. And ultimately, you can put it inside of your pipeline and you can push data into Gstreamer using AppSource. You set a load of capabilities on that element, that AppSource element, telling it, oh, well, this media that I'm just about to push into you is this format and this frame rate and whatever else. And you can push in data or you can make, so you have to push data in obviously, but you can also make Gstreamer ask you for the data. So instead of just going, oh, I've got data, data, data, data, data. And then Gstreamer goes, oh, hold on. I can't do anything with this. Why are you sending me so much data? You can, Gstreamer can actually ask for it. Now, that's not hugely helpful when it comes to real-time applications because real-time applications, in the case of Pion, sending us web, getting RTP data from Pion, for example, that's real-time. And so we want to get that data from Pion and we want to pass it into Gstreamer straight away. Because we're getting it in this constant flow from Pion. Whereas if you were reading a, if you were reading a file and then you were passing those chunks into Gstreamer, well, you've got control over how fast you push those chunks in. And so why not let Gstreamer go, ah, I want a bit more data. I want a bit more data. I want a bit more data. Right. App sync is absolutely no different. It's a, it's a plug-in, it's a module. And, and when you put it into the pipeline, it becomes an element. And ultimately you get push data out of AppSync. And so imagine you've got AppSource and then you've got something in the middle, whether or not that's transforming it or transcoding it. And then you've got AppSync and you're connecting all these bits together. And so you're pushing data in. Gstreamer is then doing something with it. And then it's pushing it, it's passing it over to AppSync. And then AppSync sends it out to your application as data. Not as UDP, not, not via report or anything. It's giving you the, the, the raw buffer of data. So you get pushed your data from AppSync via the, the, the, the new sample signal and event. I've got some data here you go. Notice how this is all go lang. So, yeah, let's take a very quick look. So we've got our sync. So that's an AppSource, AppSync element that I've made. And I'm setting some callbacks on it. And then we've got new sample funk. And then that gives me, that gives me my, my sync. And then I'm going to tell it as a return. I'm going to tell it what the return, what the flow state is. And so I pull the sample. And then if the sample isn't end of, isn't nil, then, then we carry on. If it is nil, then I'm returning that we are at the end of the stream. And then buffer. So we get this, our sample. So we're pulling the sample. And then, and then we're getting the buffer out of that. And then ultimately reading some, some, some information from that, from that buffer map, changing it from big Indian to little Indian, I think, or something. And then, and then doing some stuff on it, doing some maths on it. Not a lot of like useful information there. Like in terms of like, what am I actually then going to go and do with it? Well, at the moment, it's just printing out RMS. But then you can go off and do whatever you want with it. For us, that means getting a video and audio data out of G streamer and chucking it into NDI. Oh, Dan, why are you not using NDI within G streamer? Well, I tell you number one, when we did our NDI integration, G streamer didn't have NDI. It was, it was completely separate. It was, it was a different repo. And it wasn't part of the G streamer rust plugins. And then B, we do extra stuff that G streamer doesn't know how to do yet. So we, we grab tally information from, from NDI. And to be able to do that, you need access to the underlying NDI sender. And, and so there's stuff that G streamer can't do yet. Something that we actually want to add in to G streamer. So that we can stop sending stuff via the NDI SDK directly and we can just let G streamer deal with it for us. But again, goes back to that cost analysis, right? At the moment, we can get that data out of G streamer using app sync and chuck it out via NDI. We can do that. And it's relatively cheap. But then there's a load of extra work for us to be able to kind of go in and figure out the right way of doing it in G streamer so that like tally information becomes available as a signal. So yeah, for us, this means that we have to handle RTP and RTCP from Pion. Because Pion, within WebRTC, WebRTC is made up of lots of standards. But ultimately the media is RTP. And the bit that tells you what the quality is and everything else that goes with it along with it is RTCP. So it's very easy to forget about things that are very important when you don't deal with them. Like RTCP. SFU people in the room will go, ah, you could never forget about RTCP. But as a web developer, the browser deals with all of this for us. And so it's very easy for us to go, ah, RTP, I'm going to get my media. I'm going to get my media. And then everything works really, really well when you're in a really nice network environment. But then you chuck in real life scenario and the audio in the video goes terrible. Why did the audio and video go terrible? Because there's no RTCP feedback mechanism to go, ah, something's going wrong. But yeah, GStreamer makes all of this easy. And very quickly on this very specific thing, we use RTP bin within GStreamer. So that's that middle bit for us. We use app source, chuck it into RTP bin, and then we do a load of transcoding and stuff as well. And then we get app sync. RTP bin is magical. If you deal with RTP at all with GStreamer, then you need to be using RTP bin. There's a lot of text there. But ultimately, it implements everything you need to be able to handle RTP and RTCP and demuxing of payloads. And it's just a very nice all in all thing that deals with everything using all of the separate, all the separate plugins. But it forces it all together nicely for you. So for us, that's connecting the app source sync pads to RTP bin. And you'll notice I say pads. So for us, you can see up the top there RTP bin. So we're requesting a pad from RTP bin in that format. So it's a receive RTCP sync. And then we're also requesting a pad of send RTCP sync source as well. We then go and make a new app sync and a new app source. And you can see they're labeled RTCP app sync and RTCP app source. We then add those to our pipeline because otherwise nothing works. All of your elements have got to be in a pipeline. And then we link our RTCP app source pad RTCP app source, get static pad source, link RTCP sync pad. Yes. So I'm getting the app, sorry. I'm grabbing the RTCP sync pad from the RTP bin. And I'm linking it over to the RTCP app source. So that's basically just saying RTP bin is going to give me some information up to RTCP information via a pad. And I'm connecting to that pad so that I can then grab that information and send it over, send it back via Pion up to my web RTCP. So you'll get RTP in, in this case, you'll get RTP in into RTP bin, but you'll get RTCP in and out. So you'll get told RTCP and you'll also send it back out as well. And like I say, don't forget about the RTCP. As you can tell, I forgot about the RTCP and ended up doing certain demos and going, ah, look, it's really great. And then someone went and tried it on a really crappy internet connection and went, no, Dan, it doesn't work. And, and made me look rather foolish. So you end up looking something like this. So does everyone know about the dot graphs that you can generate from GStreamer? A couple of nods, not that many. So you can, within GStreamer, you can tell it, I want you to export a dot graph file on anything, on, on a state change or whatever. You, you've got control over when it generates it. And so for, for me, we, when we've got debugging enabled, we enable a dot graph generation whenever state changes. And so ultimately, this looks really small and dumb. It's a PDF. So you can go in and, and look at it in high quality detail. Um, because it's not a PNG. So you've got lots of options. You can, the dot graph can be converted into lots of different formats. But the really cool thing about dot graphs is it tells you what's connected to what. And so it's really great for debugging. And so for us, we've got our app source, um, our app source and our, our two app sources. So one is, um, one is RTP, which is this one. And then this one is RTCP. And you can see, I'm coming off the camera. I'm sorry. Um, so you can see that this one's set with, um, with capabilities to say that this is RTCP. And this one is set with capabilities to say this is RTP. And so you can see those are linked to a pad within a GST bin, a GST RTP bin. And so those pads are then connected to an RTP session. The RTP session is then, um, connected to a demuxer. The demuxer is then connected to a jitter buffer. And the jitter buffer is then able to go. Oh, well, in this, in this RTP stream that I'm receiving, that's both audio and video, where it's demuxed it and then it automatically goes, ah, here's the video and here's the audio. Right. And then it chucks it back out, chucks it back out, creates some pads for me, which I then connect over to, well, there's an app sync up there and that's my RTCP app sync. But then you could see here that it's then connecting out Opus and VPA into my pipeline. And then this is like the rest of the pipeline, which we don't care about, but like, I get told it's Opus and I get told it's VPA. And so I'm able to decode it and do stuff with it, whether or not that's outputting to NDI or whatever. At the end of the, um, at the end of it is, um, is an app source, uh, sorry, an app, an app sync for sending out via NDI. So we, we got into go purely because of Pion and Pion gives us loads of control. It's basically WebRTC in pure Golang. If you ignore the fact that WebRTC does lots of like actual media stuff, but when you look at, say, the, just the, the network portion of it of sending, sending data from here and sending it there, then it's pure Golang. So yeah, you can do any of this with any of the G streamer bindings or you can just, you know, do it with actual G streamer C. I mean, who actually want to do that? I don't know. But you can go and use whatever bindings you want. And so there's really nice bindings for Python, Rust, um, and I haven't used any of the others. Um, I've definitely used the Python one and the, and the Rust one myself. Um, and the Golang one, I went on there this morning to take the screenshot and I was like, Oh, where's the Golang one? Um, so here's the pull request to add it to the list. So if you've got a problem and G streamer doesn't quite solve that problem, that's what this talks about. This talk is about the fact that you can make G streamer do what you want it to do using app source and app sync. You can build it yourself with app source and app sync. So why G streamer? Why not FFM peg? Whatever. G streamer does everything that we need it to do. It has a fantastic community, super friendly community. And ultimately it's just super flexible and does exactly what we need it to do. Um, which is not something that we felt as a team. FFM peg would give us, for example, G streamer has a lot of scaffolding, let's say, um, and, and gives us an awful lot, um, for free. Whereas G, uh, FFM pegs a little bit more, more work, right? So my last message is G streamer for the win. Um, so yeah, don't wait for others. Don't wait for others to build your plugin for you. You can go and build with G streamer, app source and sync. And that's me. Thank you very much.
Building open source telephone agents using LLMs
Welcome back everybody. We're going to begin a little block of talks with the intersection of AI and real-time communications. Rob Pickering here is going to take it away with building open source agents using LMS. Rob, go for it. Thank you. Hi, can you hear me okay? Brilliant. Okay. So I'm Rob Pickering. I kind of landed in the real-time communications VoIP industry about 20 years ago after doing a whole load of other internet stuff and I've never quite really managed to escape. But what I want to talk about today is the idea of putting LLMs on the end of the telephone and some work I've been doing for about kind of the last nine, ten months on this whole idea. But in terms of where I come from on this, what was the kind of the most important thing for a successful project or a successful open source project? Come on. What was the most important thing for a successful project? Developers. Developers, yeah. Well, basically I think it's a fundamental belief in what you're doing that is just completely unassailable. You just believe in what you're doing and kind of make it work. I don't have that view actually about AI on telephones or machine voice interfaces generally. And up to now I've actually been reasonably agnostic about the utility of machine voice interfaces. I did a bit of work about four years ago right at the height of the COVID pandemic around connecting dialogue flow type AI, so proprietary conversational AI to phone lines for doing things on assistance lines. So basically asking a bunch of questions and then feeding the results into a Google spreadsheet. And I was reasonably convinced at that point that you could kind of make this into a SME product by putting a kind of a front end on it that allowed it to be self-service where someone could just go to a website, you know, kind of order themselves a machine voice assistant, connect it to the end of the phone line, feed the results into whatever office automation they use and, you know, just automate away all their customer contact. That was probably really quite naive. I didn't think I entirely believed in that, but it actually didn't go so well for that whole thing. I mean, the technology worked absolutely brilliantly and did a great job, but it took roughly a day basically to onboard the most basic of sort of telephone agents getting some results back. So kind of before that and ever since really, I've been a little bit agnostic about machine voice interfaces. I do do the whole, you know, Alexa Google Home thing, but only kind of in the workshop when I'm covered in dust or something, or in the car. I'm one of those kind of people that prefers kind of pointy-plicky stuff. From a developer point of view, I think they're painful to train and then they still blow up in your face afterwards and users either kind of train themselves to, you know, not like that Alexa, but they're only effectively train themselves to talk to the agent the way the agent expects them to, or they just kind of give up. Round about sort of this time last year, I started looking when things like GPT kind of went general release or general availability. I sort of thought, well, actually, could we start using LLMs to do the difficult bit, the intent recognition part of that, and then effectively kind of feed that into our back-end logic. I thought, okay, let's give it a go. There was nothing kind of really like that. OpenAI hadn't released their kind of audio interface, although all the pieces were there with Whisper and everything else. So I thought, okay, let's kind of give this a go and try connecting an LLM up to the telephone. So how do you do it? Well, there are quite a few projects that have kind of got most of the moving parts of taking a SIP conversation in and then turning it into an audio stream that you can then put into a speech-to-text, send that off to whatever LLM platform or whatever local LLM you're running, and get the results back. Asterisk, I'd already used actually in the previous project, and there was a nice bit of software actually. Dan's disappeared off now because I'm talking. Dan had a nice Asterisk-ARI interface module that effectively allowed you to work with Dialogflow, and that was fine, send the audio stream up to Google, get the results back. There's a similar module in FreeSwitch, and then there's Jambones. Jambones is really quite a nice open source platform, sort of open source UC platform, that kind of does all this stuff for you from a speech-to-text and text-to-speech point of view. So it interfaces to, I think it's about 10 different speech-to-text services, multiple text-to-speech services. It's got a nice API, WebSocket, Vent API, and it's like, okay, let's use Jambones as the piece that we're going to put the SIP calls into and figure out how to kind of write a bit of middleware that sits in the middle and lets us evaluate this. So this was my kind of plan for how to work this. So we have this really imaginatively named LLM agent down here. It talks to Jambones. It also talks to OpenAI. So call comes in from a SIP carrier. Jambones tells us via a WebSocket that that calls come in. It then hands us the speech-to-text transcription of that call. We send that off to OpenAI with a prompt. We get the result back and then we use the text-to-speech engine that also interfaces Jambones to get that out as an audio stream. And okay, doing it with just one vendor isn't a great idea. So I added Google Palm 2 onto that as well and put a nice kind of generic interface on it. We use the Jambones WebSocket API and the way that the agent actually works to get a calling or when we set an agent up, the idea was to make it quite big and multi-tenant capable so that we could have one instance of Jambones and one instance of the LLM agent effectively handling multiple inbound calls and multiple inbound call agents so running effectively multiple LLM scripts. So what happens when we set a new client up, a new agent within the LLM agent software? There's a lot of agents in here, aren't there? We pull a spare number off a pool that we hold inside Jambones. We nail up an application with a new WebSocket to the agent and we link that to a number and that then gives us a way of routing a call in into our software that we can then post off the S2T to the LLM. Inside Jambones it kind of looks like this. You have a bunch of applications and then we have a bunch of applications that post in to our agent and then we have a bunch of phone numbers, big lock of them that are routed in from a carrier. Some of the numbers are allocated to an application that means that they've been linked to an agent and then others are available. So when we make a REST call into our LLM agent software, we federally nail up an application and then that connects up to a Jambones phone number. The whole point really is that the client application that's trying to provide this agent service doesn't see any of that complexity in Jambones, although it does effectively, is post the prompt to the agent interface. It gets a phone number back saying I've linked your number to this phone number and then it gets a WebSocket feed of all the events that occur on that agent. So rather than messing around doing this as PowerPoint where let's set up an agent, so all that we do is we make a REST call to create an agent. So in this case I'm going to create a chat GPT, oh sorry, an open AI GPT 3.5 turbo agent with a prompt there. And then I get a response back. That response gives me an ID and a phone number. So let's give that phone number a call. Impossible to stage because the number is kind of pseudo random. Okay, so that's a simple call on a REST API that allows us to create an agent with a prompt. I'm just going to tear that agent down again otherwise we will leak phone numbers. Okay, so that was kind of the first part of it. We built a service that allowed us to run up an agent on the end of a phone number that was using an LLM to fulfill the conversation. Next part of it I thought was kind of worthwhile was to give it a try with putting a user interface so that we can create a playground for it. Just allowed us to experiment with it. So that's really kind of the next part of it which is to go back to the playground. And we do a similar thing here. We can select a model, we'll do a slightly more capable model, hit on create agent. We did that playing around with voices so I'll still use a UK English voice but let's do something like that. Yeah, I know the font is quite tiny. I'd like to buy five different UK flags please. That's great. Could you please specify the size and material for the UK flags you want to order? We have sizes of 50 centimetres, one metre and five metres and materials of nylon or canvas. Could you do me one each of those sizes and each of the material please just so I've got a range of samples? Sure, just to clarify you would like one UK flag of each size, 50 centimetres, one metre, five metres in both nylon and canvas material. So that would be a total of six flags. Is that correct? Yeah, that's right. Great. Could you please provide your full name and delivery address? Okay, so actually that was the kind of presentation I was giving in kind of July, August 5 last year. So that was as we just developed the thing. Let's go back and have a look at some of the fun things we had to try with it. So I thought this one was quite cool actually. Rob's Donuts wholesale ordering line. How can I assist you today? Hello, I need to order some donuts. Hi there, thank you for calling Rob's Donuts. I'd be happy to help you with your order. How many donuts would you like to order? Hello, I need to order some donuts. I sell on average 45 donuts a day. What is the best price you can offer me? Great. Selling an average of 45 donuts is fantastic. Let me calculate the best price for you. Based on our cost price, I can offer you a special price of 40 pence per donut. So for 45 donuts, the total would be 18 pounds. How does that sound to you? Thank you for the offer, but I'm afraid that price is too high for me. I need to maintain a margin of 75% on my purchase price in order to make a profit. Can you offer a lower price? I think Saul will yank me off stage if I let that whole thing run for five minutes. Let's move on. That was kind of interesting, right? Because that's actually the prompt for the cafe owner. It's not huge. We basically just set two different agents off, bridged the call together and set these folks. They did actually arrive at a price. If you find the video online, there is actually a bit of a bug in the middle in that one of the speech-to-text transcriptions gets... Basically, they get confused about whether they're quoting for the volume of orders per day, and it all goes a bit awry, but then they recover it again. It's quite an interesting example of... If I was a little cynical at the start of the process, it's certainly a good demo of what the technique can do. There is kind of a bit of a problem because doing demos like that, it is really, really quite interesting to do that from a one-page prompt. I literally woke up at four in the morning and thought, it'd be quite cool if we tried that. Wrote a couple of prompts, didn't really tweak them around very much. I think I posted a video on the internet by about seven in the morning. It's not a complicated thing to do. But there is a bit of a trap in here because prompts aren't code. They're just an initialization of some AIS state in the expectation or hope that you might get the kind of completions that you're looking for. But actually, as long as we understand that, we can kind of work with it. But the problems that we did find while we were playing with these different instantiations of agents based on simple text prompts are certainly hallucination. All sorts of interesting things happen, especially through the lens of a lossy speech to text. Bear in mind, we're using these models as text-based models and we're putting speech to text on the front. So if they get deeply confused about whether we're talking about pounds and pence or donuts per day or pence per donut because of poor transcription, you get almost a sort of multiplying effect on that hallucination because the randomness that the speech to text can inject, particularly with real humans on noisy phone lines, kind of multiplies that ability to hallucinate. There's also prompt injection. I would not encourage anyone to sack all their salespeople and empower an AI to negotiate prices. Because actually, they don't do maths. I don't know if that's a secret, but they don't do maths. They replicate previous results effectively. So that's a problem that if I'm malicious, I can inject malicious text with knowledge that the underlying LLM is implemented in order for it to give me donuts for 1p or pay me £1,000 to buy a donut or something similar. A particular problem when we're trying to do this with voice on things like telephone connections, and this is a huge issue because of our expectation we pick up the phone to a human and you get basically about a second or so to respond before the human on the end gets a bit fed up and either thinks you've gone away or whatever. So poor latency is particularly a problem, especially if we're using large remote models where we're firing a request into open AIS data centre and hoping they'll look favourably upon us. And similarly, there's very definitely a privacy issue around firing off all this data to humongous cloud providers. A lot of these problems are quite soluble though. And the model that we came up with which seems to work quite well actually is effectively using containment or gatekeeping. So rather than just giving the agent one great big prompt right at the start, what you effectively do is micro prompt each stage. So it's going back really more to that model that I talked about where we use the LLM to do the intent recognition but then use logic to decide how we as an organisation are going to actually act on that. And this actually lines up quite nicely with a lot of the current AI safety theory. If you look at some of Musta Solomon's work on this whole idea of applying guardrails and effectively putting algorithmic controls on what the AI can do, this idea of gatekeeping and containment actually works quite nicely with that. So the idea with this is that you allow the LLM full authority over the conversation but as soon as you then summarise that conversation and once they can act an action on it, anything that's got side effects or changes, only the gatekeeper logic gets to do that. So if I'm authenticating a bank account for example, what I don't do is say to the LLM, Fred's password is Tuesday. Check whether he's authenticated or not. You effectively use a multi-letter approach where the LLM doesn't have any knowledge of the secret. It's only sent off to extract and then we give an indicator back to the LLM of either that answer is correct, incorrect or perhaps close enough, drill down. And what that means is that the LLM then is acting as an agent of the logic which is implementing the algorithm rather than having this kind of, this autonomy and this ability to go and do things. I'm probably going to run out of time if I show you how we architected the API to do that. But fundamentally what you do is the API's got the ability to do that. What you do is you create your initial agent with your initial prompt and then when a call comes in, we get a call event on the WebSocket and then each completion or intent recognition that goes through the LLM, we get an event on the WebSocket that tells us that's happened and then we get the ability to post an update to the agent on a specific call ID to update the prompt that's being used. So for example in the banking application context, the user has said, you know, I want to log on. So we update the prompt to say prompt the user for their secret. And then when the LLM comes back with a result, the result may be multiple results, right? It may be, yeah, I got the user secret. No, actually the user doesn't want to know that at all. They just want to open an account and they want to talk to a salesperson or any one of those intents. So the gatekeeper then moves the conversation into another of these contexts by putting effectively an update prompt. So it's a little bit like RAG where well it is RAG effectively. So what we're effectively doing is that the LLM is doing the intent recognition, plumbing it into the algorithm and then giving an answer which is generated by an algorithm effectively and certainly controlling it through logic in an algorithm. So, you know, that's the 30,000 foot view of the project. I kind of founded this really as a little side project just because I was kind of interested in what the technology was capable of. As a result of what I've kind of figured out from doing this, there are a few kind of opportunities for development here. And the most obvious one which helps solve the latency problem, a whole load of other things, privacy problems as well, is at the moment we're only interfacing to open AI and Google. There's an abstraction there that should make it easy to interface to other things, but actually for operational reasons, we kind of decided to use open AI and Google because this whole thing is available as a playground that you can just go to today and play with so I'm kind of funding hosting of that. And the quantization of the hosting cost actually works better with open AI and Google than it does with implementing an open source model and paying for the compute time to run it to make it available. But certainly some work on implementing this based on open source models, bringing Mastral and Lama into the mix. Open source embedded, speech text, text space. The best game in Taiwan at the moment I think is probably Whisper. There's been a new release of open AI's Whisper in December and that is an open source product, unlike some of their other stuff and bringing that into the speech recognition engine, either doing it here or doing it in JamBones' custom speech-to-text engine, I think is definitely going to have some interesting results there. Handling interruptions and async conversations. Human beings aren't very good at doing conversations in turn like those two agents there were there. The agent starts talking, human interrupts them. Yes, you can help me, this is what I want. So again, being better at handling that and that's something that I think we can do through customization of that speech-to-text engine that allows us to abort an intent recognition earlier if someone starts talking. Again, latency is going to be improved by making those models closer. Function calling. So open AI's API has got the ability to do function calls. Some of the other commercial models have, none of them do it in a consistent way and I've not even thought about how we start to make function calling, something that the API can kind of abstract away in a sensible way. Someone came up a while ago with a bot-to-bot API which I think is quite cool so it's the idea of let's just stop the humans talking to each other at all. I do not buy and bot, talk, steal them, selling bot and we just don't bother with that anymore. I like the idea of audio interfaces and this kind of textual log of the conversation because it's kind of interesting and it means we can read it. But certainly freeing up humans from conducting it all through the medium of a killer as audio doesn't seem particularly sensible. And I guess actually a sustainable business model to support the try-out. The try-out interface which sits at llm.applicay.com's costing $200 or $300 a month to run at the moment in credits which I guess is a new MacBook every few months so I should probably try and find a way of kind of supporting that. And most of all a better name because tripping over the name LLM agent in this presentation has been kind of fun. So that's where the project's going, going forwards. A bunch of links there. There's my email address, Robert Pickering.org at RobertMatrix.org. We got time for questions? So quick question. I'm curious about the integration between OpenAI Whisper and Gambon. So is this something that you're going to have a little bit of that? Something you can do, use OpenAI Whisper within Gambon or do you need to do it outside? Do you know, I've not looked, it seems like an obvious thing for Jambones to do but it's not my project. It may be they've even done it or someone's done it. The brilliant thing about Jambones is it has got a custom speech-to-text API on it so you have got this kind of plug-in availability where you can take a WebSocket audio feed from Jambones and hand the transcription back in. But let's say that you want to use OpenAI Whisper because it's open source, it's faster, it's latency. So I'm curious about how does that interlace with your kind of project. It's like in the workflow, it's something that you need to do it before using Jambones, in Jambones, after. I don't know if you have anything about it. Architecturally I could choose to do that in any place. It seems pretty obvious that given that I'm handing off speech-to-text and text-to-speech to Jambones, I would do it there and if someone isn't doing it already then doing it through their custom speech-to-text API would be the obvious place to do it. It gets a little bit interesting because of that thing that I talked about with the ability to do interruptions and terminating the endpoint on a conversation. I think when setting up, and it's one of the jobs that's kind of been on my to-do list since about October, when I do get around to doing it, I think I actually may find that doing it within just the straight Jambones custom speech-to-text interface might be too restrictive because it doesn't allow me to get real-time control of what's going on in the inferencing that's been handed off to OpenAI. But I hope not because in theory if we could just throw that into Jambones then that both solves the problem and is a useful enhancement to that project. Okay. Thank you. Over there. Yeah, so in the point you mentioned that the language model is not like code, not going to produce or react like code. Yeah. It's not reliable enough. But there is this thing called the GBNF, the formal grammar in the language. Yeah. Okay, so the question was about my statement that prompts on code, particularly in things like OpenAI models. And it's my opinion on that impact of by formal notations like GBNF, I think it is in Lama. GBNF in Lama. Yeah, I mean that's an alternative approach to the problem is to make the AI more deterministic, right? So you either constrain the AI by an algorithm kind of outside the box or you make the AI more deterministic. And I would argue that actually doing both is probably what you need to do. Any more questions? Can I take the orange t-shirt at the back please? Thank you. My question was about literacy. Is that mostly caused by finding what the end of the sentence is or by the time that it takes to get a response from the LLM? Okay, that's a great question. So the question is, I've stated latency is a problem. Is that primarily caused by the endpoint detection in the speech to text or is it primarily caused by the latency to the LLM? And the answer is it's both. Good streaming speech to text allows me to reduce that latency. Really bad speech to text, which is kind of heavily blocked. So I'm having to take a big sample size before I put it into my engine to get a stream of transcription out. Obviously slows down my, I can't even send for a completion until I've got what looks like a meaningful piece of user input. But I've then got the further delay caused by non-deterministic stuff on the engine. So you will have noticed actually I chose GPT 3.5 for one of those demos because it's really quick. GPT 4, even Turbo, you know, kind of like about half of the latency is waiting for the transcription and the other half is waiting for getting the answer back. But really great question. How are we doing for time? I've still got time for another question. White shirt in the middle there. Could you screen that a bit by getting the speech to Urbanare? Because you can both with humans, I don't know how to think about it. They go, um, okay. So could you put the voice on to do that? So the caller hasn't gotten that long silence when they think they've got away. So the question is can you kind of use human factors and put some kind of noise, some feedback, some Urbanare's coming from the agent synthesized to effectively let the human know that there's still something on the end. Yeah, you absolutely can actually. And the LLNA agent code, I think it might have gone away. But one of the things that I was just put some background noise in there, so you just put some background office noise, because everyone is super used to calling call centers and hearing tappity tappity click clickity click. And you know, what did you do at the weekend in the background? So I think I might have found the web file I was using for that wasn't entirely as unencumbered as I thought it was. Question down here at the front. What type of speech model is there? Because it sounds pretty natural. That one that I was using there was Google, which is, you know, still pretty good. It's old but good. But it's a really nice streaming speech to text. The text to speech was again Google. It's one of the WaveNet Google text to speech. Sorry, I misunderstood and misrepresented your question. Is there a lot of interest in the bot interface where you set the zone up and find and selling bots which can involve people there? Which seems to be a big waste of producing power because you could just have like a json API and find and sell donuts. So the question is, is there a lot of interest in a bot-to-bot interface? And actually, when I was first asked this question, it was in another presentation, another place. And I wittily retorted, yeah, we've had EDI for kind of like 30 years and you know, that's the way that you do these kind of data exchange and negotiation. But I think the reality is it is quite interesting and people have subsequently persuaded me to change my mind on this because it is quite interesting to be able to see a human-parsable transcript of a negotiation. I think it might have legs. It's well down the feature list. But it is quite interesting from that point of view because I can see a transcript of it rather than seeing, you know, a bunch of XML or a bunch of JSON saying this is why we arrived at 46p of donut. What I actually get to see is something that purports to be something like a kind of human reasoning path. But I think it's a very good question. Any more questions? Oh, waving your arm to the back. Sorry, I missed you. Latency is, you know, yeah. One and a half, two seconds is not a great latency. Getting that down to 500, 600, 700 milliseconds would be ideal. And by being able to chunk the speech to text better so that we can stream it and also move the inference thing quicker and faster, we can hopefully get that down. I mean, through some of the techniques like the background noise, you can kind of get away with it, especially if the human being knows they're talking to a bot, which I, you know, you kind of have to be honest and let them know that's what's happening. So I think people are happy to synchronize in that way, but it could be a lot better. Okay, so question is what use cases are there in the domain of customer interaction? Not out there, but which ones have you tried? Which ones have we tried? In terms of trying this on real users in a real production commercial environment, I'm going to stick my hands up and say I don't think it's ready yet. There's some development work that I'm personally doing that should kind of come to fruition. August, September this year, they'll put it into a real trial environment in a commercial environment, but right now there's those four or five bullet points on hurdles there to improve the ability of the system to do stuff. I wouldn't put this on the end of anything commercially significant right now in February 2024, but I think with the right controls in place, there is so much money to be made out of doing this. The amount of human endeavor spending its time on the end of headsets in call centers is going to happen commercially, so really what we need to do is going to make it work in the most beneficial way that we possibly can, and by making it open source so we can see the moving parts that are inside it, and having it depend on open source models that can be audited I think is the best possible way of doing that personally. I think we're probably done with questions, I'm certainly out of time, so thank you all very much.
Skynet: introducing local AI summaries in Jitsi Meet
Okay, now it's time for the to introduce the guy that needs to introduce no introduction so we all know Saul is one of the key members of the the GC team and today is going to talk about and I hope we'll not get are not first and I'm trying to shoot you down but yeah it's going to talk about Skynet and AI summaries in GC meet so thank you. Thanks Lorenzo and thanks everybody for being here. Time for the intro so I don't need to do that myself. Many of you probably know GC already and GC is where is my cursor. There we go. It's a video conferencing platform, it's a toolkit to build your own, it's really a set of open source projects that we combine together to deliver these end-to-end video conferencing capabilities. It's also a set of APIs and SDKs that you can mix and match, host it yourself, you know pay us some money and we have a service running or just go to town with it and it's also a community of people that build more plugins for our platform help each other and we saw for instance during the pandemic lots of people spinning up instance GC instances to help other people communicate. It became a lot bigger than the way it started because it's a project that has been around for a while. GC is 20 years old. It started out as a communicator, a CIP client then when XMPP Jingle became a thing it became the kind of pivoted to that and then video was a big focus so multi-party video was an area where a lot of effort was put together and that kind of came to fruition when WebRTC came out because that made the client was pushed to the browser and we could run the same software that powered a client on the server this time and do the multi-party video on the server and that's the GC we have today where last year I presented how we did 10,000 participants so it went a lot of many transformations over the years. I think arguably the biggest transformation in all this time was WebRTC and how the desktop client was in a way left behind and then everything was moved into the browser. Some say that AI is the next sort of gold rush or the next revolution in this space that will make things change a little bit. As the old joke goes well in the gold rush era it's not the gold diggers that make the money it's those selling shovels so I'm hoping to show you some shovels today. Now in 2023 and plus and beyond what's kind of the state of things so AI became huge it has already been there right we have full played video games with AI characters but in November 2022 something changed can anybody guess? Open AI, open AI indeed. Open AI released chat GPT. Now the way I think of it in my head I think the most important part for the end user of chat GPT because this is vocabulary that is now second nature people have used these words even though you don't need to know about transformers so to me the more important part is chat because it's the first time that we could interact with an AI in that way. Before it was always hidden in some backend server or oh there is AI that enhances these pictures but there is this thing but you couldn't directly interact with it you couldn't ask it questions get back these prompts you did not have this ability to interact with it and I think that is what made these new developments more special than the fact that you can host them yourself of course it's a plus and you know in GC style this is kind of the path we would try to follow so as I mentioned GC is a collection of open source projects so it's in order to run GC meet is a platform that's built of different components and this is like the basic platform where we have the web server the signaling server which is still based in XMPP the GC video bridge in charge of routing and GCOFO to do the conference signaling. Now another component not depicted here is GIGACI which allows us to connect to the PSTN but in this very room sort of in 2017 we presented transcriptions with GIGACI. It was a project that started out with Google summer of code so we had this building block already in place and since all of these LLM technologies are text-to-text kind of operations where you need to feed it text to get some other text you really need to have transcriptions to start working with it so we had this building block in place and internally we started to prototype how do we want to leverage these tools are they of use to us so we conducted an experiment where we built the bots framework our original idea was instead of us building something directly we're going to build a framework so that these technologies can be integrated externally into a meeting our rough idea and what we built was to use puppeteer to run Chrome because it's the richest web RTC endpoint that we can have and I'm going to show why running Chrome on the backend I guess you could call it was a good idea and we integrated a low-level library so live jitsimid and you could pretend to be a participant talk to an LLM get the transcript and talk to the user and with with the script this is what we prototyped and I'd like to show you a couple of videos of what we built so already hey tutor I've been thinking about the architecture of our to-do web have you given it any thought yes I've been reading up on it I think we should go with the restful architecture for scalability and flexibility that's a good idea how about the database should we use sequel or no sequel I think no sequel it's totally how we better for our needs it's more flexible and can handle unstructured data yeah agreed and what about the front end framework we had a robot I was thinking of using react it's fast at the end of the sounds pretty good how about we get started on building the application let's see if that works sure thing let's do it all right now we're going to bring John Doe into a meeting and he is arriving late and the bot just greeted him here in the chat and now he'll get a transcript real quick of what was said in this meeting and he can just continue the conversation so that's that was the first thing we wanted to try which is can we use this technology to build something like that right we had seen many others you know start with oh I get a transcript I do some stuff but because we had access to real time transcriptions it's like oh can we try a little twist so you just arrive late you get the summary of what was said already okay and this was done with this chrome running in the browser now why do you want to run chromium sorry chromium back in or in a container because then you can do cool stuff like this no not that hello I don't know if it's playing or not that should be I'll show you when I switch to the files now but basically it's the fact that we can play audio and use webgl and we kind of do that otherwise the browser is very smart to end it now what did we learn in this exercise of attempting to use it this way well first is that JavaScript might not have been the best choice because not all of the AI libraries are in a different camp also that for our specific application we think that that instead of going the general route of you know you can ask any question is more specific tasks and like in this case meeting summarizes something very well defined very well understood and that also allows us to give some more value to our users through customers so we can only help our users when they are in a meeting in our software but if we do something like this we can help them even when they're not there if you can go and check the notes of a meeting that you were not part of and then it turns out they are useful so you don't really need to be there well that is in and of itself something that's helpful for you now in terms of running this our idea is to run this modest model that fulfills the task because running these things I'm going to talk about in a little bit can be taxing so you want something you know as simple as possible which it still meets the criteria that you can of course cost can be a problem so when it comes to yeah we run the model well yeah but that also costs money so some you need to balance balance those things out and as I said one thing we realized is that writing all of these logic in a bot felt kind of wrong because you might want to use the same logic you know to apply it in different places so we thought maybe we should move all that logic to do summaries or to do other interactions with the meeting to its own dedicated framework and then we can reuse it here or maybe in other places and that's where the idea of SkyNet came from and we started prototyping that right after so SkyNet is our core for GC meetings it is designed to support different but specific AI operations our ability to be horizontally scalable so it could run multiples of the of each of the parts that compose SkyNet and we currently implemented like three but really two AI services which is summarizations and transcriptions we get the open AI compatible API for free but at the moment we're not making extensive use of it per se but focusing on the specific tasks of summarization and transcriptions and our initial focus as with kind of everything we do in GT was focused on running it locally using local LLMs so you can run it on your own servers and you don't rely on you know external cloud services this is how we always build things authentication is also a URL you plug and you connect elsewhere so it kind of fits our our DNA and personally I was excited because we were using Python again and I haven't been using it in a few years so that was cool to go back to it I totally installed this sort of naming from someone's conference at Comcom couldn't remember his name but I it really resonates with me the idea in the current AI landscape you have tools that can do speech to text that can text to speech that can do text to text and then so you have all these transformations right so and then you can combine because if you want to do a summary you probably have voice first so you need first a speech to text then some text to text and well then you could summarize that part so we've got our summary's application that sits on top of long chain and then we run our LLM underneath it and then for transcriptions we're currently running a whisper I'm going to go into a bit of detail on how we run whisper and we sort of divided the SkyNet architecture internally to show this this divide because I think it helps build this mental model of how the data flows from where it begins does it originate in the speech then goes to text then it ends up being text again so for example the summary's modules got like two parts to it one is the dispatcher we call it and the other one is the executor so the only one that needs access to the LLM to actually run inference and get an output is the executor and the idea here is that we can have multiple dispatchers that will handle the request then they will prepare the request and they will store you know the content that needs summarizing in a cache and as a worker becomes available will pick up the work do the work publish a result and then the frontend scan get back the result to the user these allow us as to throw in however many executors we can well of course based on how much money it costs us to run because we need GPUs to do the the inference how much capacity that we need to serve and of course how many of them we run is a metric of measuring your own how much you want to spend how much you need to service at a given time and how long can it take how long can the answer take so it's possible that getting a summary two minutes after a meeting is acceptable maybe you need it in a minute it all depends and depending on the way you want to go about it you could play with how many of them you want to run but we built it so that we could run it this way and then we could scale horizontally based on the on the load that that we needed now as we started playing with this like for real one thing becomes quickly kind of obvious which is your summary is is like when is a summary a good summary well first of all it is a good input so the transcription is really critical because if you have a bad transcript there is no way for you to get the bad summary with a good transcript you can also get a bad summary but if you have a bad transcript you're definitely going to get bad summary so uh jigasi as i mentioned before already had transcription capabilities today it has the ability to connect to world cloud to vosk and now to skynet now the google cloud has changed the models that they have and the one we have we were using was not really great it didn't give very very accurate transcriptions and that then showed in the in the summary that we were getting we have not yet played with the other models that they have but then again sending our audio samples to google is not something sorry that we're looking forward so we started building the equivalent on top of whisper because it gave better results the end result is a skynet module that so jigasi will open a web socket connection towards skynet and it will send audio frames in pcm format and it will get back the transcript and in this module we're going to run inference leveraging the faster whisper project faster whisper combines voice activity detection with with an alternate implementation of whisper to give you these transcripts and you can have near real-time transcriptions in fact in jigasi you can use this to also show subtitles so they were definitely fast enough for this for this application that we're interested in and what faster whisper allows us that quote unquote the og whisper doesn't is the ability to do it in this streaming manner and in real time so it's way files back and forth is not something we we can use for this application because we want to get the transcripts in real time and this is this is the way we accomplish it one advantage we have also from doing it this way is that jigasi will receive the streams from each participant individually so we already have each participant identified by by their stream you can of course always do a transcript of a recording but if you have a recording with all the audio mixed in then you also need to sort of separate it and there can also be mistakes in that now this way we don't run into that problem because you guys you can clearly identify whose audio it is and send it as part of the metadata that comes back in that uh web socket channel so once we had all of these things kind of glued together we had to face reality which is this kind of ai ops so you can't just throw this in a server and it runs and it's all great because models are kind of big and that can be a little bit of a problem so deploying this to actual production is a new headache you need to worry about because you do want when you're as I mentioned if you're horizontally scaling you do want these new servers that you put in the pool let's say to boot fast you want them to be able to start working right off the bat so and then you also need you might need multiple container images because oh this needs this version of CUDA like the version of CUDA the faster whisper works with is different than the one we need to run you know like lama for example so you don't need to think about that currently the way we're doing it is we're running OCI virtual machines and inside each of them we run nomad and skynet in a container and then the VMs have the models loaded into them in their image so then this way the models are readily available because the container images are already pretty big and adding the model to them it would make them unbearably big and you end up with timeouts and yeah things you you don't really you don't really like I like to show you how this looks like a little bit and since I'm doing pretty good on time I think let's see where my mouse is so first come on okay so first I would like to show you the video that I couldn't show you it is this guy this was the video of the bot that joins the container thing well maybe that's why it didn't I'm a friendly ah there we go ah come on hello I'm a friendly bot this is just an example of what I can do check out other examples for more so this looks very simple and in a way it is but the way you get here is this little robot that moves its mouth is a 3d model made with blender animated with webgl and the lips animate as the audio is being played back the audio has been played back directly in the browser with we use the service play ht just as a as a test again going forward if we end up needing this I would love to tinker with Microsoft t5 or mozela's cognigy I think it's what's called because you can self-host them yourself so that using a browser as your runtime for bots is kind of nice because it does allow you to do these sorts of things where you can run something like this and it will be very hard to create a 3d kind of robot that moves their mouth in something else that's not really a browser and you can animate it so easily with the same library that you use for for the rest of the stuff so now I want to show you how this thing looks like when you run it so I committed some stuff so first let's look at for instance the real-time mess of of the transcriber let's see oh that's pretty good or is it are we running that's correct it's the other one way beyond so thank you rasvan you can tell who built it so as we connect and we wait for the interim hello mr robot let's look at what's going on are we running I didn't do anything let's reboot that I don't like you okay so okay I think we should be good now are you there yet what mic is it using let's see it is using my mic well that's unfortunate I'll try to show you later so this is just this demo is part of the skynet project itself and the way it works is it uses an audio worklet it will send the audio frames to the skynet server and then it will render them here in the browser it when it does it it does it in a close to real-time manner and the idea behind is that you don't need the whole jigasi thing in a jigsabang to test the way it works so it's like a self-contained demo I'm not sure if it's getting confused with the network or something else because when I tried yesterday it was working fine and in fact I do see you know data being received in this thing but I don't see it being rendered here we're going to try one more time to see hello are you there and if not we're gonna move on so we're gonna move on yeah well you know experimental technology what can you do so never mind that's what I was thinking because this network is fun let's see if the other one works so in here I'm now going to load skynet because we build this thing to be modular I think of it as a modular monolith if you will which is it's one fat thing but you can disable parts of it and the reason behind it is that you end up otherwise needing to have simplification you want to have consistent logging you want to have a lot of things that are common and then this way you can select what do you want to run do you want to run you want to run only the transcripts do you also want to run the summaries all of that stuff is separated and you can decide if you want to run it or if you don't want to run it so I put some text in here of a fictitious conversation that a guy named Tudor and I had chat gbt is very good at coming up with interesting conversations for you to see how they summarize we're not necessarily interested in the like I'm pretty sure you can't read the content but I want comment on on the API a little bit so I'm going to copy this in case this thing goes to shit because we don't know now we send a post with the data that we want to get summarized to our summary endpoint hopefully it did get the response yeah and we get back an ID then this ID we can query in the job ID thing and there we go so here we get so we get the result of our job with the status of success the type is a summary it took 11 seconds this is a mac with an m1 so it's running accelerated on this very gpu and then yeah Tudor and seller are discussing the design of the both backend of the web app yada yada yada so we we built this API that kind of follows a bit of a kind of a polling mechanism because we found in practice that some you know even within within our workplace like this stingy proxies and summarizing a long conversation can take a little bit of time and if we had the request living for too long some proxy in the middle would decide to cut it and also in order to make it resilient to things falling in the middle and another of these machines taking over and then like running it again we we decided to build it this way so the idea is you post the summary you're going to get back an id and then with the with the job polling api you can get the jobs result as it's done i was looking into using for example event source as another option so you could have like an ongoing stream that's also a possibility that we'll probably look into adding to to make it a bit more a bit more palatable but we found that in practice this has been working well we have processed in the realm of hundreds of thousands of little summaries within within the company and we're so far happy with how the architecture is is working so we can focus on what we're going to do sort of next some of the things we we think we should do next are supporting multiple model runners or backends or however we want to call it as i said we started out with hosting our own llms but that has different trade-offs and there are different reasons why people may or may not want to do that so being able to run for example on top of this example was on top of a seven billion llama too but someone may want to talk to open ai directly maybe they are not worried about sending their data to open ai or they have a different deal or whatever or maybe cloud works better for you or maybe you want to actually not host the model yourself but kind of offload that responsibility to another company like open router which hosts open source models in that case what sort of thing does like what does this kind of do in that case well the nice thing is that it can shield you from all of those changes so you could we're thinking that we can change how you are actually doing the summary or sorry rather what engine you're using to to run inference but you don't need to change your api you don't need to modify any of your applications suddenly if we change to a to a model that works better you're going to just get better results and that is kind of the the path that we're trying to follow we're going to work on integrating late arrival summaries in in jitsi meet so as i said we'll focus on making building blocks so we can then plug them together to do this thing right now you can get a transcript with gigasi and you can use this kind to summarize it we are going to build a way so that you can glue it directly to jitsi meet to get the late arrival summary straight in the meeting without needing to have this whole bot thing which i'd be very happy if we go back to but uh yeah it may take a little while of course more prompt tweaking i think rob did a good job talking about you know prompting and and a huge prompt he had i'm not sure that ends ever so you always start with something and then get a tweak it also depends on the context right it's different to summarize an article than to summarize the conversation and you might want to take the model into a little journey or steer it into a given direction so i did that's a better job at what it needs to do and lastly actually this very past week there was a new release of of lava which is the open source vision model and in the context of a meeting which is kind of our our main focus it would make sense for instance to summarize slides so if someone because it does help you capture the context of a meeting if someone is sharing slides they're probably important and they're probably the central part of the meeting so if we could use lava to get a glimpse of what's on the slide plus the transcript we think combining all of that would give us a pretty good view of what happened in that meeting and this way again help our users those who were not necessarily in the meeting all of this we open source the first scan and version yesterday is now available there with yeah the full get history so we'll see our mistakes and everything we learned along the way we're not you know learning machinists here or mad scientists we are learning ourselves it's a brand new world and i have two people to thank for they are they actually came all the way here uh ross vanduuder thank you for joining me on this interesting journey i would definitely not be here telling you this if it wasn't for them so uh very thankful and it was a very exciting project to to get started and then we can take it further i'm not sure how am i on time because this thing resetted but that's all i got to tell you today if there's any questions i'm here right there at back so you mentioned that you are happy with architecture of your setup right now how happy are you with the actual summer rates is there much hallucination going on would be different from a human that is not necessarily fully aware of all the anti-cellual good question so ralph was asking um we're happy with the architecture are we happy with the summaries um so so you know another we are we're trying to support two apis summaries and action items now a nice thing about about um about this problem application is that if it fits in the context window right like the lm does not need to invent anything and everything needs to know is within the data that you give this is the conversation what was said here um and it can come up with decent summaries at the small model sizes but where small smaller models sort of where they missed the mark is that capturing what's actually important in a conversation what are the key points that we should focus on when i'm summarizing this thing and we're trying to find where is the right balance how high up do we go in the model size um so that we get a better summary in also in timely manner because of course going bigger will also mean slower inference and it will also may mean higher cost so uh we we didn't spend a huge amount of time in improving that yet we have something that you know works okay but it can definitely be improved so we we are not you know we we can be happier so we we are focusing on making that that better and our next step in that direction is trying to make this more available for example by sharing it here and also by sharing it within the company so that people can run their meetings and get them summarized and this way we're going to you know tweak improve and rinse and repeat all the time question there have you guys looked into using the language bubble in the browser itself using for example there's a root implementation blah blah which has wgpu full which means you can use the browser to you know interact with the language bubble and I was just wondering if there's been any experimentation with not doing it centrally on a server so the question is um that there are implementations that allow you to run the model directly in the browser uh by a rest with web gpu and if we have thought of running it in the browser no we have not um that said it's a very interesting idea and that is one that actually fits what I say in the beginning of the the sort of the bots thing because the browser is such like such a competent beast it can do everything um you could actually test that by doing it that way because essentially what that was is a script that the browser run in a container and it had access to everything so I think that would be a like a cool place to test because basically what our first test did was we used javascript to tell the transcriber hey give me a transcript in real time and then we would in real time talk to open ai let's say and get back results so in that sense running it locally would be something that's completely attainable very interesting I think it's it would be interesting to give it a go I'm not sure if when it comes to because one of the advantages of model like uh running a centralized thing would be the fault tolerance for instance so the fact that you get that you send a transcript from the dispatcher to store it in the cache and even if the inference the whatever node is running inference crashes another one will pick it up we have a mechanism to uh so that another one picks up if some work has not been updated in a while and if you push that all the way to the client well a everybody would need to run their own inference so it feels a bit more wasteful overall if you sum it all up I think right off the bat and be uh yeah you have the the problem of if it fails with your browser crashes you're kind of left out in the cold and in the other case the server can take care of it and send it to you you know by email when it's ready I think but I think is a very interesting thing right there behind you yes possible to connect your other your partners webxs and zooms and what right there are companies right so the question is sometimes you talk with external organizations and would you be able to put the bot there so the bot was the experiment that took us where we are but right now that effort is paused now that bot was a bot for gtc meetings specifically so um the idea here is the moment anyone from any organization joins one of your meetings you would be able to transcribe it now if you join their meetings uh yeah we don't have that I know there are companies that sell proprietary products that integrate with different meeting providers that can use these so it's it's a little bit of there's companies that use it externally then there is uh each vendor that is sort of building this up in themselves I'm not sure what the right answer is I think it feels like to me there's there's a lot of one of the hardest parts of working in this space has been filtering the signal from the noise there's just so much shit going on but I think this particular features really blend well with a meeting's product so adding the making them building so whenever anyone from wherever they are end up in one of your meetings when you record it in fact this is a change that is coming on the next stable release is that making a recording will involve making a transcription unless you opt out because at the end of the day the recording contains everything that's also in transcription the transcription is just making it more palatable so you can then operate on it um and that's sort of the direction that we're taking at the moment question here yes that is a good question um so that is a problem we have not solved yet so we do have the capability of doing translations but the the uh at the moment the only entity that we use to do translations is google cloud so that's why sort of out of this picture and also there's a limitation of the source language so you can get multiple you can get actually subtitles in different in the language that you want but as long as the source language is English so what we are looking for which is a bit further in the future is well first of all identifying the languages on the fly so the source language doesn't matter and then having the ability for the source language to not be set so to be different so that we can both be in a meeting and I can be talking to you in English you can be replying in French and I will see subtitles in English you will see subtitles in French and then we'll get the summary I don't know in Mandarin man because it's going to be fun but that's you get the idea that that's that's where we're going but more work is needed in that area um and that's all you can find me in the whole thank you very much
Moving real-time AI inference to the edge with Infernos
Hi everybody, thanks for the wait, sorry, a bit of technical challenges, but now we can, this is the last talk related to AI, to AI staff, and I think it's also a nice continuation from Saou's first presentation, so we'll have Maxime coming from all the way from Vancouver, I think probably the farthest participant here today, so please go ahead, thank you. Okay, thank you, thank you Lorenzo, thanks everyone for attending my presentation. So let's get started, today I'm going to talk about a little bit some of our newest work on the eyesight of things, we've been a little bit about myself, I'm born and raised in Ukraine, I have masters in physics and radio physics from Key State University, I live in Vancouver for about 20 years now, father of three, and I've been involved in CIPA in various forms and projects and whatnot since about 2003, so basically since that first day I give a little bit of background of what me and open source, today I discovered free BSD in my university at some like second year, got very curious, started exploring and eventually submitted patches and all that, got accepted as developer, and then found CIPA Express router which is kind of later became came earlier in OpenCIP, added some modules, we use it extensively, I also created Pproxy, which people who use CIPA probably know about, and yeah, I get busy with various open source projects to this day. Speaking of machine learning, I started in about 2010, read this really nice book, it's actually free book, it's at 99 sort of age, but surprisingly got pretty good basic introduction into neural networks and all that, I already got a little bit curious about that, and then as time went on, it became more available, I trained a little toy model just trying to figure out if I can detect DTMF from 729 encode frames essentially, because 729 is just a bunch of floating point of coefficient, so you can feed it to the model and get sort of DTMF detection that way, it was just kind of weekend type of fun project. I also got installed a little AI powered ADAS called Coma 2, and it's open source project also, the user AI model to drive your car, you can really like install it yourself, you can hack it, and it's pretty nice, a good community, I participated in some events there. Also played with DeepFake lab, which kind of DeepFake framework, which allows you to work faces, also involves some training, so it was curious, and also lately I've been in San Diego, we've been doing like building a little from scratch, we've been building model which will drive a little robot across the office, also was quite fun, lately I've been also playing with Mu Zero support, first for the game of Digger, which I maintain, Linux port, it was pretty fun to sell highly interested, this is just the device that drives my car, and it runs open source AI model inside that little chip. So anyway, back to the main point of this talk, we're basically looking at that from our perspective, which is trying to build a system which is, we basically have a lot of customers who route calls with us, so we try to build a model that can be scaled out for provider level, not a model but software which can run those models. So basically the idea behind this project is that we try to figure out how to scale those models on reasonably priced hardware, so as the problem with the area right now, we didn't kind of vacuum tube error, as I call it, machine learning, because we have very expensive hardware for, you know, to run something mixed trial, you need, you know, like 64 gigabytes of GPU, and that is very easy, you need several of them or you need a very expensive one, but eventually all this stuff will get more affordable, so we're trying to kind of work towards that goal. So right now there are like two major frameworks which people use, which torsion and tether flow, and they are also pretty heavy as well, it's like hundreds of megabytes, if not gigabytes, to get stuff running, but it looks like at least from my perspective, in a few years we'll see some changes, already, see people working on alternative frameworks that are expected to be more lightweight and more flexible in terms of environment, because Python is not very easy to scale and integrate into something like CAPIs, although doable, but anyway. So we started earlier this year with the project, and the original idea was to just, for starters, we tried to implement text-to-speech, we already took our SIP stack, which is conveniently Python based, so that was pretty easy, pretty much like 20 lines of code to get SIP endpoint implemented RTP generation thread and tried to run those models. Essentially my, yeah, so basically we started with this guy, and it's like four gigabytes, cart, Nvidia 1050, and I was able to run like one channel essentially on it, of text-to-speech, and then obviously the next thing was like how to scale it, oh no, hold on, yeah, a little bit into how this works, essentially. So text-to-speech is, at least with transformers, you basically take your text, you send it to your model, and then it runs multiple iterations, so you basically have all your stuff through the model, and then it spills out something like male spectrum, which you put through Valkoder, and get the audio out. And then the first problem we run out into is that it basically required this run for the whole duration of the audio, so essentially on my small GPU it would take quite a lot of time to actually produce audio. So I had to modify it's a sph-t5 model, it's like not the latest, but one of the pretty good ones from Microsoft a few years ago, they released it, so I had it to modify it. I mean I re-wrote the model itself, not the model itself, but the pipeline, so instead of processing the whole audio in one go, I made it so it spits out those smaller chunks. The unfortunate problem came with that is it started like clicking, because Valkoder probably was not trained for this mode, so I tried to retrain the Valkoder, it did not go well, very well, so it did not produce a good result, so I had to build a little kind of post Valkoder scene, which smooths out essentially and fixes those clicks. And now it sounds pretty well, I will maybe play some examples when I get finished with the stuff. And then how I tried to scale it, so I got kind of normal size GPU, I would say, so it's like 16 gigabytes card, and I expected to get like maybe 10 times, 20 times performance, just looking at the spec, but to my surprise I only got like two times more performance, so I can only like with this model, I can only run two real-time threads of TTS on the bigger card. So I started looking into why this is happening and how to improve, because theoretically it has much more performance. Turns out that in order to get good performance out of those models you need to use batch inference, so instead of generating like each prompt, each audio in each one session you batch your prompts that you need to voice out and submit it into the model, and then it generates all those streams at the same kind of cost. Because my main problem with using GPUs is that they are pretty widely computationally, but they are pretty expensive to send, so it's like you're operating through very slow network with a very fast device, so you need to load a lot of, well not a lot, but several jobs to it. And so I considered like several ways how to batch it, so my first idea was that maybe I can just vary the size of the batch, so I run it in continuously, but then as sessions come and go I can add or remove them to this running batch. Unfortunately this does not work with the sequence to sequence model, because internally it kind of clocks itself, so you cannot really add another session of it, so they should be running all of them at the same time, so essentially you need to do something like this, so you batch a bunch of sentences and that you need to generate, pump it up and wait for them, all of them to finish, and then at the time you can collect a new request, then you batch them up and repeat again, and obviously if you have like pretty powerful GPU you can probably run a few of them, or if you have several GPUs you can improve latency by running on multiple of them. Yeah, so that part kind of works. The next thing I'm right now working on kind of in the other way, so we need to have something like Whisper to do the other way around, and that one already supports batching, so basically should be pretty scalable there as well, so right now I have on that $300 card I can do 50 sessions of text-to-speech in real time at the same time, so it's pretty good result because this is all running locally, so I don't use any anything pretty much and I can run it on a reasonably small device. And yeah, I guess the last thing that I played recently was there is a framework called Ray. It's basically when you can build a little cluster of machines with different kind of, well maybe the same GPU, maybe different hardware, and distribute your training or inference work over them, so that's just me running and not 20 probably games of digger. It's basically, all of this is a model doing inference, just looking at the screen doing inference saying where this should go basically, and kind of training itself to win at some point of time. So yeah, Ray, yeah this is like a CCH training, kind of improving a little bit, maybe, but anyway, a good interesting part of it, I figured out how to use that Ray, so it's kind of useful open source framework that you can kind of scale, use to scale up your AI project, so I'll probably use some of it to distribute. Yeah, there are some links and I guess I have probably a few minutes for questions. Yeah, yeah, yeah we can do video technically because it's all about, you know, as soon as we have like the whole mechanism set up, we can do video as well, so yeah. Right now I'm using PyTorch, but oh okay, the question is what kind of model can they run? Right now I'm running with this existing code, I'm using PyTorch, but I also played with TinyGrad, so I might use some of it as well because as I said, it's kind of very lightweight, so the whole goal of the guy who wrote it is to keep like usable framework in like 5,000 lines of Python code, so it's kind of very interesting from that perspective, but yeah, it's not really limited, it could use anything, right? I think another question, no? I have one. If you have another question, please give a round of applause.
Build your ENUM LCR Server using CGRateS
I hope you can hear me. First of all, thank you for having me this year in Fosdum. My name is Saber Katelari. I'm a core developer at IDCS.com. And today I'll be showing you how you can build your own enum as your server using CG Rates. Firstly, something about our company. It's located in Bavaria, Germany with backhouses in Romania and Albania. We have over 17 years of experience in architecture and server-side solutions in voice-over IP environments. We have platform implementations covering both wholesale and retail businesses categories. And by now we are responsible to understand real-time processing and constraints and serious life system outages. Something about CG Rates. It's a real-time enterprise building suite, more like a framework since it can do many things. It's pluggable into any existing infrastructure. It's non-intrusive into existing setups. So it means it does not force you to make decisions. It's all dependent on your system admin if you want to take into consideration what CG Rates gives you or if you just want to ignore it. We are an open-source software since born in 2010. First sources published in 2012. Full sources are available in GitHub, 100% in Go. We always mention Go because when CG Rates first started, Go was also in its first weekly releases. And this means that we were one of the first implementers of Go. And it also means that everyone that we also paved the way for other people coming after us. We have no add-ons in private repositories and we take into consideration community contributions also. About Engine. Engine is performance-oriented. It has this built-in advanced caching system with transactional list record use and time TTL expiring records. It's asynchronous, processing with micro threads. If you know about Go, you probably know more about this. Also including API load balancer. We have three branches, V010, master and 1.0. V010 is our most conservative branch. Master is where we have our most recent developments. And also 1.0, we call it like the pinnacle that CG Rates can do, but it's still in early developments. We have a test-driven development environment with over 10,000 tests as part of our testing suite. Here we can mention unit tests, integration tests, and also call tests for switches. It has a building modular architecture which is cloud-ready. It has microservices with a rich set of RPC APIs because everything in CG Rates is API-related. And it's easy to enhance by rewriting specific components. So for example, if you want to rewrite the engine in some other code, you can easily do so. Some features for about CG Rates. You can do online offline charging system. You can have multi-tenancy from day one. This is more for wide labeling platforms. You can have multiple databases supported. We have multiple databases supported to mention some MySQL, Microsoft SQL, SQL Lite, Mongo Rates, Postgres, and also our internal database, which is compatible with everything we do. This is also a pretty challenging job to do for a relatively small team that we are. You can have real-time configuration reloads. So you can reload your configurations without having to shut down the engine and open it again. You can have rating engine with derived charging and in-number rating. You can have account balances and management with bundles and Dynaprepate. With Dynaprepate, you can create accounts on the fly and have it give some restricted permissions or limited permissions to your system. You can have sessions or event charging with balance reservation and refunds. This is prepaid logic. Stereo-shaken authentication, which is more for North America. CDR logging with support for interim records and rating cues. This is when you have your CDR sitting in a black box and have it communicate with your switch and have your CDR straight at the end of a matter of milliseconds without using any databases from the CDR side. You can have high-number of interfaces for event readers and exporters to mention some MQP, SQS, SQL, CSVs, XMLs and a couple more. You can have fraud detection with automatic mitigation, LCR with quality-based bundles, quality-based stats and bundles, call statistics with pattern monitoring. So you can find your ASR and your ACD live from your CDR rates. And also in combination with your proxy, you can find your average call cost and your total call cost. You can have dynamic pricing imports with templates. This is since all suppliers have different formats and CDR scan is compatible with most of them. You can use it with diameter, with radius if you need some authentication, Wi-Fi authorization. With DNS if you need enamel CR routing, which is also the topic for today. And you can also have a basic SIP server where it can do redirecting with your CDRs. You can have it redirect traffic from your switch to your CDRs with some routing and IP addresses. Well, else we have resource allocation and controller. This is some virtual channeling for your customers. You can have your API server with Gop Json, HDB Json support, built-in high availability with dynamic partitioning support, API capturing analysis service. This is something like an internal grant for CDR rates. Clustering through remote, replication for internal cache and database. Data versioning with automatic migration. This is when you need to move between releases in the same branch. You can do so with data migration. You can have and we also do, we also are agile in developing new features. So if you have some feature or some idea that you want to bring us, you are more than welcome to do so. This is an internal schema or diagram that we have for CDRs. It basically shows how CDRs has its components and interfaces and how they communicate with each other. On your left side you can see all our interfaces. You might notice that we don't have open SIPs over there because open SIPs has its own native module which is faster and better than anything we can do since it's native to open SIPs. And if we take one example, for example DNS agent which is on your left, you can see that it communicates with sessions which is our main subsystem and through there it can communicate with every component or all components at all or one component. It's all dependent on what you want to do with CDRs. For some use cases, again online offline charging, you can have a highly configurable rating bundle with voice, data, SMS, MS, monetary or anything else. In 1.0 you can really charge anything else. You can have there concurrent sessions with concurrent sessions handling and also a centralized CDR server. And this all together is what others call online offline charging system. Another use case which you can do is a dynamic routing system where you can use the dedicated subsystem for various routing strategies. There we can mention load balancing, the difference in our load balancers is that we cannot use setups but only real calls since we get that information out of CDRs. Also you can have LRN support via attributes, bundle support routing systems, quality based stats monitoring with thresholds and also load balancer which I mentioned. Now to get to the INOM LCR server that the topic is for. Firstly we need to know about DNS, probably most of you know but DNS is something like an internet address book where you query for something and you get information back specific to that what you question for. Depending on your answer the answer is categorized in some record types. There's a couple but we only work with these three, A-Type, SRV type and NEP type records. We work only with this because that's what most people need and nobody has really asked for anything more than this. To shortly describe them A-Type records convert domain addresses into IPv4 addresses, SRV records for network servicing. You can find priority, weight, port, targets from your SIP addresses and most importantly in NEPTR records which convert INOM addresses, INOMs into IP addresses. But what is INOM? INOM is basically a standard to translate telephone numbers into your eyes. Here's an example how you can do that. Firstly you need an E164 number. You can convert your number into an E164 number by firstly removing any leading zero before it and also adding your country code after it and with a plus at the end. Then to convert this INOM 164 number into an INOM number you have to remove the leading plus, reverse all the digits, add a dot between each digit and then add a suffix. This suffix, the one you have in this example is from RFC standards but in C-Drates we don't really care what you put in your suffix. In my example even I even replaced this ARPA later with the account string that I will use. For DNS agent I also mentioned earlier it's an interface, it's like a middleware where your DNS client communicates with DNS agent and then sends that information, that request to the DNS server and then from there maybe you can see from the schema. From there you can go into sessions and any component it can take any component and then give that information back to the DNS client. In terms of capability you can have as many listeners as you want. Also to mention in DNS agent we also implemented our DNS server and DNS service and listeners and for listeners you can have as many listeners as you want and they can all be opened at the same time. You can have UDP, TCP and TLS protocols and this means it is highly configurable and concurrent. Again for query types we support ASRV and NAPTR. For configuration this is in your configuration files. You need to open a new field, name it DNS agent, also this is JSON, everything is JSON in configuration. Name a new field DNS agent, enable it, by enabling it you allow it to receive and to send API calls. Then you name listeners where again you can see that it's a list so you can have as many listeners as you want. You name your address by giving it an IP and a port. In my case I use an empty IP since if it's sent by default in CJA we put what's in defaults and in this case in default is just localhost. For port I put 2053. If left empty again this will be filled by the default which is 53. And for that address I need to attach it a network. On this case I use the UDP protocol and again if left empty again it will be on UDP by default. After that I want to also be open to TCP listeners. That's why I create the same address but this time I changed the protocol. This doesn't mean that either one or the other will work. It means that both of them will work at the same time. There's something messed up over there. They should be on the same line for the last one. The address for TLS since I cannot have TLS and TCP on the same address I can put it in a different port for this example. And after you finish with listeners you go to connect your DNS agent with sessions and you do that by using session cons. You can have either localhost, internal or some configurable other connection which is done by you. I use in this case localhost since I want to track the network, the packets going through sessions and DNS agent. You can switch it with internal if you want to have a faster connection or if you do not need this debugging, this packet tracing. Just on that same DNS agent field you put request processors. To short explain request processors do the logic of what's going to happen after a query is done to your server. In this case you can have many request processors. In this case I'm only showing one. And this is what happens with it. First we define an ID for it which has to be different from other request processors. It doesn't matter what you put inside, it just has to be different. So in this case I'm describing what I do in this process which is NAPTR list cost route. After that you define filters. Because I want to find the list cost route to find a Cp address for my query. I first need to be sure that the query type is in NAPTR and that the leading country code starts with 32. This is just an example. You can have any filter that you want. The first filter asks the query type from the request if it's a full NAPTR string. And if that's true it goes to the second filter which finds if there's a prefix starting in that query name that starts with 32. And before it does that it converts that in number into E164. And that's done with filters. If those are true it goes to the next one which are the flags. In my case I want to create an event each time this query is being made. So I put there meta event which calls an API for sessions process event. Each time this query is true. And I also put routes authorized because I want to get the max usage when the query is done. And I also put routes because I want to do list cost routing with it. Next I put log there because I want to get some logs out of the query when the query is done. So I want to get the request and the reply from the query. And after that I put request fields. The request fields are what you want to populate when the query is being done. In this case I want to populate account, destination, set up time, type of record and usage. I want to populate this because I want to put them in my event later and the event needs to use them. How I populate them? I populate account with the query name by stripping away the first E164 and what's before it. So it leaves me behind with only the 1001 account which I will show later. This way I populate account with 1001. In destination I put the query name fully converted into E164. In set up time I put now for the current time of the query, type of record voice and usage of one minute. For the reply fields I want to put what I want to reply to the DNS with. So I want to reply with order of 100, reference 10, flags U and service E2U plus CIP. In the most important part the regular expression which I find through route parameters. I didn't show here but I created a routing profile before and I put there two information in two routes and that information are the CIP addresses which are different. One of them is of highly cost and the other one is least cost, is lesser cost. And since I have that meta routes flag over there, those routes will be sorted using least cost. And since I have reply I want to find that reply the routing parameters for that first index of the route. And the first index is always depending on the sorting route and make it least cost, the first index is going to be the least cost route. And under the reply you can see the reply. I find in the structure routing profile I go to run ID meta row, meta is in this case asterisk of iteration 0 of that ID. I go to routes of iteration 0 again and then I find the value of routing parameters which is the CIP address that it finds. And then I populate it to that regular expression. After that I just also put the replacement dot at the end. For the client, for the client I'm using dig, in this case I'm couring localhost on port 2053, the type of regular this NAPTR. And you can see the N number that I put there. You can see the 1001 account at the end. For the reply I captured this using ngrep. You can see the API that gets called sessions process event. In the flags they are the exact same that I put in my request processes. The tenant gets automatically taken by default configs which is cj.org, the ID is some random number. Time is the current time of the query. And in the event you can see they are exactly what I asked for in my request processes again, if you can see. And that's just the request for the reply site. I can see the reply from that API where I find the max usage of 60 seconds. If you remember I put one minute of the request. You can see that it's also 60 billion nanoseconds. This cj also works in nanoseconds. Also I have the reply on the routes profile site. You can see that it found the routes account for 1001. You can see the sorting that it used. It's LC for list cost. And also it shows all the routes that it found sorted by it. And you can see routes with ID route 2. You can see the Cp address ending with 12 and the cost that it would take of 60 units. And the second ID which is more costly with the Cp address of 11. And here we get the reply back from DNS agent after it's done. You can see that it found a regular expression with 12 at the end which was 60 cost units if you saw from earlier. And also as another use case you can have a fail fallback. So for example you can have multiple answers over here. In my case I would just have to make another request process. And in this case I put just one instead of zero over there and it gets the second list cost that it finds from routes. By that you can just get the second answer also. And that's about it. Any questions? I'm guessing not. If you have any questions you can also ask them at our Google groups. Oh sorry. Yeah. Going back to the request and the response. I saw you had a, in the request you were getting an account ID. How are you figuring out the account of the person asking according to DNS? Well it depends on what you want to do. In my case I just put that in my request on the DNS client over there. You can see at the end it's in that 1001. So I give it myself that account ID. Okay so you're giving each customer a phone top level domain name. Whatever you want. Any other questions? Okay. Thank you.
SecSIPIdX - Library, CLI tool and RESTApi server for STIR/SHAKEN
you Okay, so now it's Federico's turn unfortunately Daniel from the Camarillo project couldn't be with us today but Federico will talk, will talk to us about steer and shake and maybe tell us why I'm getting all those robocalls from an energy hour Good evening everyone can you hear me? Yes, okay, thank you, so thank you Lorenzo for the introduction. Yeah, my name is Federico Cavidu, I am a voice lead developer at Libon, it's a voice application developed in France and I am contributor of the Camarillo project and this presentation should have seen also Daniel but he couldn't make it today and Daniel Constantinierla is the co-founder of the Camarillo project and today we are here to speak about sexy PIDX, which is a library CLI tool and REST API server implementing steer shaken So first of all, steer shaken, steer stand for secure telephony identity revisited and shake and stand for secure handling of a certain information using tokens They are a suit of protocols and procedure that have been designed to try to fight against robocalls and colliding spoofing which is a huge problem in certain countries and the name shake and it was inspired by James Bond and his predilection for Martini drinks steer and not shake and so the name shake and since steer was already existing was created as someone said torturing English language to get this acronym So basically how does steer shake and work? Basically when an originating provider receive and invite, so a call First thing I decided is to decide which level of attestation, which level, how we can, one of the three level of attestation that can give to the caller So there is a full attestation which is the service provider has authenticated the calling party and can be sure that the entity using the phone number is authorized to do this The more simple example is a landline subscription or an operator running a mobile network or a VoIP network where the devices are directly connected Basically any way else to physically verify that the user is who claims to be And we have a partial attestation and so the service provider can identify the caller origination but he cannot confirm that the caller, say the call source is authorized to use the phone number So the call number would be an example, an calling number behind an extension of an enterprise PBX And then we have the third level which is the gateway attestation level And when the service provider can authenticate the call origins but he cannot verify the source, this could be the case for example of an international gateway provider receiving a call from an international gateway So once decided the level, the attestation level, the originating provider can create a CIP identity either, we will see later how it is formed Which contains information about the calling number, the call number, the attestation level, the call origination and the certificate that it has been used for signing the identity And the CIP invite is sent to the receiving the destination provider which in terms using the information of the CIP identity can verify that the identity is itself So the global shake and steer framework includes several components, several entities and a whole layer of governance which is being public infrastructure And we need a way to trust each one's certificate, so we have some policy administration which is responsible to delegate to identify which are the certificate authority that are allowed to enable to issue certificate for this infrastructure And where it is used today, as today is deployed and forced in US, in Canada and since a few months in France where it goes under the name of man, mechanism d'authentification d'anime But it's basically using the steer shake and it's just the governance layer is slightly different and more centralized Ok so let's quickly look at how this CIP identity either is built, this is the, you could say here, I was in search of it, in a message and let's have a look Basically the CIP identity header contains a JSON web token and three parameters, the JSON web token has three sections, header is highlighted in blue and red we have the payload and then the signature is the green part And then we have three, the identity header has finally three parameters, they say what are these attributes and parameters, so sorry I forgot to say that both header and payloads are based on 64 URL and encoded JSON So this is how they look, the other part is encoded and decoded and it has five attributes, the algorithm, the encryption algorithm which must be ES256, the extension PPPT, the extension used, in this case shaken, the token type, the passport And the X5U which is the location of the certificate used to sign the token or whether the public certificate is used to validate the signature And while the payload part contains information about the attestation level, the information about the destination, the timestamp and the original ID which is a unique way for the provider to identify the actual sender And finally the signature part which is obtained by base 64 URL link, the encryption, ES256 encryption of the concatenation of the JWT header and JWT payload both in their base 64 URL encoded And finally the parameters, the three parameters of the identity are the info, so again the location of the certificate must correspond to the X5U attribute of the JWT header, the algorithm used again must be ES256 and the extension used that must be shaken So all of these said, so how can you implement a steer shaken in your infrastructure, not only from the SIP, not only the SIP part but also the other component that you need for the certificate, the validation service, the signature service, etc. At least my knowledge there are two open source projects just focused on steer shaken, one is the SignalY libsteer shaken library and the second one is SIPDiax Did you make this slide? Sorry? Did you make this slide or did Daniel write it? No I did Why? I'll ask again, I think it's okay Okay So SIPDiax is a project created by Daniel and it's the GitHub repository and it has three components, the SIPD Go library, the main component is the library exporting the common function CSEC PID which is a C library wrapper to build dynamic or static libraries and the H files and that what it uses, we'll see later in Camelio to build the SIP module And then we have SSEC CPIDiax which is the CLI tool and the HTTP server for checking or building SIP identities So why a standalone project first? So first was the idea of starting developing a extension for Camelio in goal language Also the idea was to have an HTTP API service that could be used by many SIP server nodes Also it is an easier way to integrate with older releases that doesn't have native support for Steershaken And finally it has the command line tool could be useful when you're in the bug phase There were some examples with a command line you can generate a full identity header passing the parameter of the header An example there you can check here, you can check here an identity header stored in a file, this is more useful for a command line for debugging But you can also run SSEC CPIDiax as an HTTP server and it exposes this way or as an HTTP server and use it to validate the identity header of your traffic Or at the same time or generate them, so as I said you don't need to have the full Steershaken stack in your SIP server, you could use external services And also it can be used to serve the certificate publicly, to serve the certificate so that when you are sending an invite with an identity header the certificate will be retrieved from this URL And as I said in Camelio the SSEC CPID module is implemented using the SSEC CPID wrapper and it exposes basically all the functionality that you need to implement Steershaken in your server There are various, you can check, get, add, build and assign identities and all these functions accept several variables in various forms And finally a simple example of how it can be used in a Camelio typical routing script, so loading it, a couple of parameters about the expires of the JWT or the time outages, the time out, you want to configure a rule to retrieve the certificates in case you don't have them cached locally And that's it, I should have any questions, I want to remind Camelio World Conference division that will take place this year in April 18, 19 in Berlin Is there a HVP to the server behind the scenes? No, no, there is a, okay, I go back Okay, so it has three components, the library is in a go, but it has also a C wrapper, so Camelio is using the library for the moment, you have to compile it and use it statically because it's still not packaged in any distribution But instead, yeah This isn't looking to go code directly via a wrapper, okay Yeah, sorry I'll go back to my 31 So have you dealt with your shake-a-friend with MAN? Yeah, and, yeah, so he's asking if, sorry if I am... I have a complete application here, so we found out that they have this different requirement of the original telephone number and destination telephone number where you can go past the E164 link Yeah He does sexy things, like have a... So, okay, so he's asking about, as I said, there are some differences because it's French implementation, so And honestly, I'm starting to look at the actual implementation, a large total specification of the MAN because, yeah, there are some differences And, yeah, we will have probably to adapt Yeah, sorry about your intervention, I was just, as I was saying, if I realized that our implementation of this is in the library The study was about open source libraries, not about open source information No, it was exclusively about a library that you can use outside of The secure zip is done as a library, so is that why it's... It's nice Thank you Is there also a CLI or something that you can use? Sorry Is there a CLI, a command line? Yes, I don't think... Sorry Yeah, so the same... So the CLI is called sexy PIDX And you can use it, let's say the same binary can be used as a command line to check or generate identity headers Or as an HTTP server, which can... Yeah, which has both the two APIs for generating and validating And the path for the X5U to download the certificate for Thank you There's not so much of a question, just look to see how I do it finally I'm hoping on the presentation and we hope to see you guys from OpenCypher Because I was trying to call Daniel here, so I was coming There's some sort of sharing in the way, but nobody is needed I know you needed Yes, sorry When you went back to one of your earlier slides, you were showing that there is a part of this third-stake in model Is that you go and contact some regulatory authority But all you see here is that you are just signing a JWT So how is that verification service integrating into this picture? For me, I... The dot of line on the bottom that you're showing here, how is this being taken into... This is simply the call, one sign That's an HTTP request to some regulatory authority or what? Yeah, but the regulatory authority is for the certificate Okay, so the authority gives you a certificate and after that... Yeah, so authority decides which certificate the authorities can generate You are allowed to generate a certificate and actually there is another A third level of governance which is the political governance So basically you go through that and you get your certificate I want you to have a certificate that you can sign all the calls And it's fully your responsibility to make sure you sign things with the correct attestation Yes, exactly Is the certificate any different from a normal S519 certificate? No So you can have a self-imgenerated certificate then? For testing, yes, but then not for running For any entity that requests? Yeah, for testing Yeah Another small difference just because we were speaking, I believe you were asking According to MAN operator should report to a kind of centralized database All the attempt to have a kind of global security mechanism Any other question? Does the library have a... I heard the mention right before this question about the S519 certificate being standard But I also noted that there is a small extension to the right, the end of this extension Sorry, I could not The check and framework, this end of this extension to the S519 Where you put kind of a range of numbers that the certificate is for So my question was does the library... Still not Care at all about the thing? No, not for them, no We don't care about it either It might be required according to the nation It's not required in France for the MAN map, but it may... If some other country decide to implement it, they could choose to So kind of related to the extent, you know, a longer number allowed on the France one Is there any provisions in the state system at all to allow for the tax prefixes? Sorry, two To allow for tax prefixes And you can put this pre-candidate to the dial number to signify the route that it's going to go through Is there any capabilities of sign... No, because it... We are all... We are all... But this is not the end Yeah I'm assuming that's why the number would be a longer than France one To allow for the tax prefixes Yeah Okay That's it Okay, thank you Thank you
Provide VoLTE/VoNR using OpenSIPS 3.5
Thank you for staying up until this point. My name is Liviu Kirku and together with my colleague, Rizvank Rainer, we are both open sieve developers for over 10 years at this point. We're going to be splitting the time into half. I'm going to be covering a bit about open sieves and then I'm going to be talking about IMS and why we chose to push more time into this direction and everything about the new IMS working group in the open sieves community. And then Rizvank will talk about the good stuff, right, the Volte and the 5G extensions and how we are tackling them. For this release, by the way, so there's that. About open sieves, if you're not familiar with it at this point, it's a high performance open source sieve server. It is fully RFC 3261 compliant. Various usage throughout the industry. It started around 2008, I think, so it has 16 years of runtime by this point and of experience. Multifunctional, you can use it for all sorts of sieve scenarios. It is programmable and it's written in C, so it's again, it's quite fast. Quick idea of how, I didn't start my timer, of what kind of companies are using it and this is just the tip of the iceberg that I selected on the, who is using section on the open sieves wiki. For example, it supports a lot of protocols from the basics of TCP and UDP then to the protocols which stack on top of them, such as TLS or WebSocket and other protocols such as sending head packets to Homer or MSRP for SMS in IMS networks, of course, diameter for accounting and whatnot. Also there are a bunch of extensions of sieve extensions that are supported from the very basic which are, I mean, by basic I mean ever since the project started, they were developed such as presence, the back to back is also a set of specs for which open sieves is popular. There is also an SDP parser. We don't do that much media in open sieves but you can definitely manipulate the SDP quite a bit. So there are those basic specs but they are also kept up to date, right? For example, the sieve digest authentication, as soon as the RFC came out, we got contributions and we got it up to speed with the latest hashing algorithms. Thank you to Max for that. So also the SIP push notifications, right? You could do them for just like a glue scripting here and there, you could achieve it but also they came up with an official IETF RFC for it which also is supported. And then there are other types of interfacing. From SQL databases, like the most popular ones you can think of, MySQL, Postgres, there's even a virtual database. We came up with that kind of lets you fail over between them, right? You can write to the virtual and it goes either to Postgres or MySQL as a fallback, let's say, from all the way to NoSQL databases. You get these document-oriented databases such as Mongo or Couchbase or caching databases, the memcache d and you can build your application the way you want to. And finally there's interfacing in the form of just do it via REST queries, right? Or you could be using a message queues or even diameter messages. Radius, let's try to move past it, right? It's so old by this point. Let's start using diameter. Or you could even interface with MI. That stands for management interface. It's quite a fun protocol that lets you interface, it's JSON RPC based and lets you interface with OpenSIPs. You can control its behavior, you can trigger reloads in various caches it can do or simply control the SIP signaling. There's also some part of that. How do you program it? It has a bespoke scripting language. It's not that difficult to pick up. You're also, in case you're worrying about, okay, I do not understand it, it's not readable. Yeah, exactly. Okay, it knows about it, interesting. There is also syntax highlighters so you can have quite a good time writing OpenSIPs config code which we've, I've tried, I mean, I've mostly tried to push it further down as time went but it is what it is. I think it's up to a good point at this by now. In that you get these nice variables. They are scoped. You can start from a single worker scope and then to the SIP transaction scope. They are kind of concentric, right? You get the SIP dialogue scope or you could even have like the biggest scope possible, right, with globally shared variables that are persistent throughout the lifetime of the server. There are lots of modules you could be building services with from SIP registrars or SIP residents all the way to doing more of an operator kind of workloads such as least cost routing. You could do load balancing to your Alba and Gateways. You could do maybe topology hiding on as the request exit your network and hide your IPs or whatever sensible information your SIP packets may contain, for example, the via headers, right? That's a typical use case. And even not traversal capabilities, right? Because we've also, we are familiar with the problems that can happen there with the contact headers and the private media, media IPs. So all of this leads to the creation of various class 4 services, right? And services such as SIP front ends, maybe you are building a platform that does wholesale trunking or even a simple redirect service that you could be scaling and building a business with, right? Such as portability service or least cost router or a C name dip as simple as that, right? Or even a stair shaken service, right? We also have a stair shaken implementation. Be sure to check out the stair shaken module so you could be doing either side. You could be doing the siding side or the verification side and that's it. Make a service out of it. Or you could go into the enterprise feature domain and look into building PBX experiences, right? With conferences, hunt groups, call pickup, call parking, IVRs, voicemail, all that kind of stuff. I put a small asterisk there that some of these require media. So not sure if I mentioned it, but OpenSIP is at the end of the day either a back to back or a C proxy. It's mostly a C proxy. It can be a back to back, sorry. But it does not do that. It does not do media. It does not carry media at all. So maybe you will have to use an additional service here, a free switch or an asterisk to achieve these features. Last but not least, it has built-in high availability and the clustering makes it so you can build these highly available services. And next is the IMS and a bit of info around it. It is a specification from the 3GPP consortium and it's not really an, let's start from what it is. It is an architectural framework. It is a set of recommendations, a set of microservices that you could be structuring your platform in order to not only achieve the voice services you want, either IP multimedia services, the group chats, the file transfers, the voicemails and so on and so forth. But also it is, the designs are modular. They allow quite a bit of flexibility and it does not force you to go down a specific path or makes it as loose as possible. So for example here we can see an IMS network broken down into its three major layers, which is first of all there is the concept of the user equipment, which is here on the bottom, your cell phones, your laptops that are connected to the platform. They are doing this via the transport layer. Okay, I have here on this slide. And this is not of much relevance to open-sifs per se, right? This is your radio access devices and the wireless access or the PSTN interrupt if you have devices coming from that side, right? They are hitting this MGW, the media gateway. But what's important is that they end up in your IMS layer, the control layer. And here is where it is of great interest to open-sifts specifically because two core protocols here are SIP and diameter and we happen to support both. And the components that open-sifts could fulfill the role here are mainly this. The first one is the proxy call session control function, CSCF. It is the first entry point into the platform. It can be discovered by the IMS terminals and typically receives its traffic through an IPsec tunnel. So this is one of the first things that Rosbano, one of the things that I think is the last actually, that he'll be talking about and we are still to implement this part. Coming next, there is another call session control function in the IMS topology that is interrogating CSCF. You could also put an open-sifts to fulfill this role and program it. So it acts as a relay. It serves the domain, maybe does some validations for the SIP calls coming in and maybe does some interactions with the subscriber, locator, with the SLF. Finally, and here is where the big part of the logic will be in the IMS platform is the serving CSCF. And here is where the users, the user registrations are processed and they are accepted. Of course, they are stored in the HSS. I'll get there in a second. It can also store billing information. It handles the SIP session timers and it talks to the HSS using the ammeter. If you take a look at the connections, the where is the full line is SIP and the dot it is the ammeter. And finally, we have stuff that is not really related to open-sifts and here is the HSS, the subscriber database. We talk to it via the ammeter and the store's user profiles provides info and pretty much you write, you both write into it and pull data from it. The SLF is only required when you have multiple databases. Media resource function also interesting to note provides the media functions and playing announcements. Again, you'll probably be doing this with some other software than open-sifts. The breakout gateway and that kind of covers the control layer. And finally, we have on top the service layer and here there is the SIP application server, which again we could be using open-sifts here. And it can host a variety of SIP services from redirect, roll, proxy, even a back-to-back user agent providing the multimedia services. And I wanted to kind of quickly go through my experience of working with the 3GPP documentation and my few tips on how to find your way around them because let's say we let's take the IMS-SH interface, right? And let's say we want to work here is where what it pertains to. It pertains to interactions between the AS and the HSS and also the, it's also an AS, even the OSA with the HSS. And if you make the mistake of just Googling for these documents, you might not know whether you've run into the latest version or not. Here we see we get the top result is V15, right? But it turns out that had we gotten to the 3GPP portal, which has a nice selection here and ways for you to filter the documents, just dial in the SH interface, hit search, and it gives you the exact two documents that you need. Let's zoom in through one of them. There is a nice versions tab. You can click it and there you go. The 17 is actually the latest version. And there is one more gotcha here. It's where the 3GPP works in tandem with the European Telecommunication Standard Institute. And that's where the standard actually gets published and accepted and they just put a one in front of it, right? So from here, let's say we have, that's the rule. You have a 29329 spec from 3GPP. Once it's accepted, it becomes the one 29329. So here now we know what to Google for. It's TS version 17 and it gives us, right? So we get the result. Alternatively, if you still don't find it this way, you could just go to the directory here on the ETSI website, on the ETSI website and just manually dig for it. You can go to the range, you get the latest and there you go. The PDF is there. So kind of interesting to note how these documents change over time and what to expect because let's, this specific SH document, it started in 2002 and it actually received updates even up to 20 years later. So this is from a couple years ago, the latest version, 17. And meanwhile, the app code has changed from two. These are, this is just a diameter packet and a bunch of AVPs and some command codes have shifted and the methods have stayed the same, the user data profile and so on. But a lot of AVPs have been added. So you can see, so we're going from eight AVPs all the way up until 25 or something. So the complexity of the networks has grown as more and more requirements have been put there. So they have been represented as AVPs, of course. So it's more data to be pushed. Also more errors to be replied to, right? So we have started with six errors. Now we've got maybe 12 or somewhere there. And to help to draft all of these and to understand better what requirements we are, we have to implement because they are, after all, even with the 5G, there are tens of documents on the 3GPP website. We started this working group for IMS on the OpenSYF mailing list. And the basic idea is just to get feedback to start open discussions, right? All of these are public. You can Google for them on the web. And they will help us a lot in shaping the development of the IMS support in OpenSYFs. And with that, I will leave it to Resvan to go into Volte. Thank you, Livia. This, Livia, presented how we got into the IMS world. Now I'm going to show you how we are approaching it from our perspective. That includes the feedback that we got from the working groups, as well as feedback that, as well as the way OpenSYFs works and how we put all of these together. Okay. There you go. And first, let's do some sort of history. Probably most of you already heard of Volte. It's voiceover LTE, or voiceover 4G infrastructure. It started, the specifications started in early 2000, but they got standardized in 2010. And then there were a couple of implementations. It was first released in 2012, but nowadays it grows a lot and expanded a lot. In 2020, they were documented around 226 operators. And it offers a lot of improvements over earlier networks, such as 2G or 3G. It has faster call setup, high definition of voice quality, reduced bandwidth, reduced background noise, support for video calling and video codecs and so on. This is more or less the Volte architecture, although it might be quite complex. As you can see, there are multiple components. What I really want to show you is actually only these three squares. So is the EPC Evolve packet core, the CS and the IMS. So what we are only interested from the voice perspective is the IMS, which stands for IP multimedia subsystem. So we are only interested in the multimedia features of this whole scheme, especially as Liviu said in the CSCF components. The same thing is shown here. So we are not interested, at least from the voice perspective or from the multimedia perspective, we are not interested in radio frequencies. We are just interested in the backbone, which offers us the IP connectivity so that we can carry our voice and our calls over these networks. This is the CSCF functions, which basically uses different interfaces, such as CIP for different communications for GM, where is it? Here is GM and MW, so communications within the CSCF, as well as diameter interfaces, which helps us discuss with the HSS or to the PCRF for charging. It also requires the IMS, aka, which is a way of mutual authenticating between the user equipment and the proxy. So we will be discussing about this later on. So that was in 2012, actually in 2010, the Volte specs were released. Five years later, actually, no, it's ten years later, the voice over new radio or voice over 5G networks specs were released. It brings some improvements over the previous Volte or 4G, such as better codec support, faster call setups, low latency capabilities, but it also requires and has a couple of specifications of falling back to Volte. This is quite important because it keeps the high quality of the call, even though you're not able to operate in a 5G network. However, it completely drops to GM 3G, so those that fall back is no longer available, whereas it was in 4G. This is how a 5G network looks like. Again what's important here is actually this is how a 4G architecture looks. This is the 4G architecture. I'm showing both of them because, as I said earlier, 5G requires or forces you to fall back to 4G in case, I don't know, for example, you ran out of the radio range, so coverage, yes. So both of them, you need to support both of them. However what's important is that the IMS architecture, more or less, stays the same. So both of them, wherever the user equipment reaches you through the 5G or through the 4G, the IMS subsystem is more or less the same in terms of architecture. However, you may see that these are different colored arrows. So here, if you remember, we were discussing diameter with HSS and the subsystems of the 4G network. Here there's a blue. That's because we will be using a different protocol and a different interface. So as I said earlier, the core architecture is quite different. However, the IMS architecture is the same. For us, as a Cproxy, as a CSEF helps us a lot because we don't really need to change a lot of things. We can have one simple deployment that serves both of them. However, we are using different communication interfaces, whereas in 4G we were using diameter, starting with 4G, we have to use HTTP 2.0. And again, I need to emphasize that voiceover new radio requires you to fall back to a 4G call in case 5G cannot be properly handled. So since we have the same architecture, we will be basically taking care of the same process, the PC or P or I or S call station control function, the CSEF function. However we have a different transport layer. We are still using CIP for the control plane. However, we are using HTTP 2.0 for interfacing with the IMS components. This is an example of fallback from 5G to 4G. Basically I'm not going to get into details of this slide because it may be quite complex. The idea is that you basically have to use the same infrastructure of 4G in case 5G does not work. This is why it's very important to also support 4G. I'm emphasizing this because you can't choose one or the other. Actually you can choose only 4G if you want to have a 4G network, but if you want to have only a 5G network, you can't have it without 4G because you need this fallback in case 5G is not always available. In terms of implementations, in OpenCS we try to keep things as flexible as possible. We try to avoid hard coding interactions with components. You know why? Well, exactly why I leave you said earlier because these specifications change very, very quickly. So you have to adapt as fast as possible. It's very hard to adapt if you have queries hard coded in your code. That's why it's from our perspective it's important to have everything in a configurable way. When we develop the diameter interface, we try to push as much as possible in the script so that whenever, for example, you want to issue a query to a diameter server, you can handcraft it in the script. So we provide this flexibility in the script. You just build your JSON which is equivalent to the diameter and just push it, wait for a response and handle the response accordingly. Commands can be handled both synchronous and asynchronous with regards to the message processing so it's quite efficient. We are also planning to act as a diameter server. This is useful in order to get notifications or events from AGLS for example. You need to update your profile. Another example is the user equipment gets disconnected, either turns off gracefully or not, who knows, or just runs out of network. You might need to know these events so that you can cut the communication or terminate the calls, the ongoing calls. We built it on top of the free diameter open source project. That's because we didn't want to get our hands dirty with diameter. So we just used somebody else's code which is highly used and highly stable and so on. So that's why we chose the free diameter library that is provided by the free diameter project. The only problem we had was integrating it from open-source multiprocessor architecture to diameter single process multi-threaded architecture. That's why we created a dedicated process which talks with all the other SIP workers and synchronizes in order to ensure communication with the diameter server or servers. In order to support 5G, we also had to create an HTTP 2.0 interface which is going to behave similar to the diameter interface. So again, commands can be handcrafted, requests will come to script. Again, this is important to us so that we keep open-source as flexible as possible. We already have the client side available through the REST client which uses Libcurl which is already supporting HTTP 2.0. However on the server side, we will be using a different library. It's called ng-http2 which provides different hooks that you can use in order to parse HTTP 2.0 messages. In terms of authentication, all the 4G and 5G specifications require that the user gets mutual authentication with the core network. So this means that you don't just need to authenticate the user but the user also needs to authenticate the core and create a secure communication between these two. So this is what the AKA authentication and key management agreement tries to offer through mutual authentication through one-time passwords. Basically these are some, they are called authentication vectors so it's a set of parameters that are shared between the user equipment, the CSEF and the authentication server here in the AHSS. So these parameters are negotiated either, for example, the shared secret is stored into your SIM. So when you get a SIM from your carrier, it already has a K value which is known by AHSS. So this is not real-time. It can also be real-time through some interfaces but it's not always like that. So the idea is that you negotiate some parameters of the AED through one channel and then when you start registering through open-sIPs, the SCSF goes and asks for these parameters on a different channel. This way these two will be able to create a secure channel. I will talk later about this. This is how... Go and get these vectors through different interfaces. For example, you can go through diameter. This is 4G. You can go through HTTP2. This is 5G. You can actually get a route and provide your own parameters. For example, you can even use CP which has support for akav1. You negotiate those parameters like simply copy-pasting from one project to the other and that's it. You will get aka authentication. So there are different ways of getting this. We don't want to hark open-sIPs into getting one single, only one of it. akav1 is susceptible to many of the middle attacks. That's why we also need to use IPsec. Once we negotiate the IEC and the keys, the integrity keys and safer keys, somebody else might spot them and might use them as a man in the middle. That's why we need to also use IPsec. However, this has been improved in the second version. Those integrity keys and safer keys can no longer be used starting with 2015. Basically this drops the requirement of IPsec. You can have different other channels of integrity. For example, you can use TLS. You first establish the call through TLS, exchange the keys securely. Then you can use that communication channel to communicate. In conclusion, in order to provide Volte, it's not enough to use just-sIP. Whereas it's enough to provide telephony using just-sIP and RTP. In IMS, you need to have a diameter client and server support. You have to have HTTP to client and server support for offering voiceover 5G. You also need to implement IMS, AK1, V1 plus IPsec. Or if you don't want to do IPsec, you need to implement AKV2 and WebRTC. There are a lot of requirements that you need to implement. Everything is very dynamic as you've noticed and very hard to process. In short, everything is MSM. Some of you that already have experience with that have the same opinion as me. What's next? We put everything together and start working on it. We already have the diameter stuff. We need to work on the IMS, AK implementation. It's actually almost ready. We need to implement the HTTP 2.0 server. What's very important for us, because this is something that the community gets afterwards, is putting all of them together and provide a fully working setup. We want to provide as many tutorials as possible to help people understand and prevent having them do what we've done, all this documentation and all this, as well as the IMS working work. We want to spare you of that. We can also provide some working examples through Docker containers. For example, the guys from Voice Center already did that with the first stage of our work. Yeah. Putting all of them together. If you want to get more on this topic and more about OpenCiv's, come join our OpenCiv Summit in 2024. It's going to happen in Valencia between 14th and 17th of May. Looking forward to seeing you there. If you have any questions, if you have time for questions, please shoot. All the time we want for questions. Go ahead. There's one over here. That's a very good question. The question is why do we need a specification for the internal communication? I think the reason, so they're actually two answers. One is in the Livio side. One of them is here. Where is it? Here. Indeed, HSS through CSCF is internal communication. You could normally use anything right here. It's very common for these implementations to change. That's why it's a good idea to have them. For example, the initial deployment of Open5.js was using the front hover HSS, which is a very nice project, but I'm not sure how updated it is nowadays. Nowadays, there's another solution for HSS. It's by HSS, which is more flexible. It's written in Python. If you wouldn't have this interface, you would have to redo the whole. That's one. The other one is, as you can see, HSS also discusses with CIP application. This interface, this may not be the same in the internal network. It's a different layer. It's a service layer. You can have multiple service layers. At that point, it makes sense to have some sort of specifications. Yes, but we are not, per se, sending the notifications in the OpenCIP configuration. What we give you is just a hook. We've got all sorts of the timer, the specification. It has a chapter where you're supposed to periodically force a re-register from the device, also through a push notification. Both that and the scenario where you receive the call imply that you should generate that push notification. How you do it, it's out of the scope of the OpenCIP. Probably you'll do some kind of a Python script, some bash script that grabs your app ID, app developer ID with the device ID. That's the PN, I've forgotten the name, PN, PRID or something like that. That uniquely identifies the device. There you go. You send it to... Yes, because it calls, for example, with Apple, the APNs for notification service. It's not CIP. It's probably HTTP getters. I'm asking this because I'm from B2Touch Project, which is a mobile operating system with completely different push architecture. It's flexible and that's pretty nice. That's interesting. It's a scenario you should be using push notification. There's another question, Max? I'm just curious, why did you decide to put, like, a build-in application? Why not rely on something like NGINX? That's also a very good question and we did think about that. Yeah, we considered that. We did consider that. We wanted to have a full solution by ourselves and fortunately, integrating HTTP2, we decided it's not that hard. What we are planning to do... That's true. What we were planning to do is also provide both ways or have the hooks. This provides the HTTP2 server but also all the hooks that an external server might get to push them, for example, through an MI, to use the same commands through MI. As you said, have an HTTP2 separate application that receives the notification and just triggers some MI commands to, let's say, terminate the calls or terminate the... We do consider this, if you have any. Yeah, all I could add to this is that this is a quick way of achieving kind of a version 1.0 of all of the HTTP2 problems. We've got this client-side salt, also some solution for the server side. If we do see this becoming a problem, then we could absolutely go with a design like you suggested where it is a bit more complex. Now you have to deploy both an Nginx. You have to kind of get your hooks in there and convert them to some kind of UDP datagram via MI. Definitely, that would scale better. But also, it's worth mentioning that we haven't yet have gotten a use case for it as server side to begin with. So why optimize prematurely? I guess that is the best reason. We didn't want to optimize too quickly until we get some usage because at least with the diameter server we have quite a bit of methods, right? The push profile request, the registration termination, there are quite a few where you really need to be... to be invoked by the server actually. No, but I think this is like a very good question because this is sort of the questions that we're trying to solve using that IMS working group. People getting us ideas and trying to get pros and cons of different solutions. Indeed, here we have two different solutions and it would be interesting to, let's say, debate these over that working group. So it's free, join it and your ideas and feedback is welcome. All right, no more questions. Thank you so much, guys. Thank you, Saul and Lorenzo, for hosting us here. See you all next year. Go ahead. I didn't get an opportunity. The question is that regardless where the radio part, the radio network originated the request from, whether if it was on a 4G, 5G, ultimately you get to the control layer, right? So how does that differ or what actually makes you switch from diameter to HTTP2? Why not always use diameter even in a 5G network? That's simple because... We could discuss it though. Okay, yeah, let's go. I think the answer is that the components of 5G don't support diameter, so you need to talk with... Thanks everybody for sticking to Leanne. Hope to see you next year. Thank you. Thank you.
The big adventure of little professor and its 4-bits handheld friends running TMS 1000
I'm going to talk about big adventure of little professor and it's four bit handheld friends running TMS 1000. Thank you K-Stuff. Okay. Hello everybody. I need to recover a little bit. So let me introduce my talk. I don't know if you know what little professor is from Texas Instruments. I can see some heads moving. So who has already seen this kind of device? That's okay. Who has already seen that device? Okay. Who had one when he was a kid? Okay. Who has already played with one? Okay. Thank you. So this is the starting point actually of my talk because, well, I'm from a computer museum. I'm also a computer scientist and retro guy. And in the museum, actually, we do a lot of things but we collect artifacts and we also try to explain them and to preserve them. And one year ago, we got one of those and it reminded me of my childhood and what was in this device. Okay. So if you want to play, I can circulate. I want them back at the end of my talk, of course. So you can try it. And you see, I have a bigger version here. We'll talk about it at the end of my talk. So the idea was to, well, first to look at that device. So you can see here, little video. So it's very simple. You can choose a level, one, two, three, four, from simple to difficult. Then you choose an operation. And then you start the device. So the device is not a calculator. It's a reverse calculator. It will generate problems and train you to do mathematics to learn different operations, typically for charts between five and 12. And so you can see here, it was one of the first version, a very simple subtraction. Oh, sorry. And if you are wrong, of course, it will display error, EEE. And you have three, you can have three tentatives. And then it gives you the answer. And you have 10 of those questions and then you have a global score. So this was very successful. And it appeared in the 70s, about 76, for the first version. And it was very, very successful. And there were a lot of variants after. So you can see on the side there were other devices. Actually, the little professor, the most iconic, one and most, we were both one. And also the one that is still actually available. There is still a solar version that is sold online. And so, well, it's what's really interesting. So our first project was to build this, actually. We wanted the kids to experience that and to play with it in the museum. So the first thing we decided to do is to build a big version, a large version of it. So it's still a prototype, so we do print it. And before, well, of course, that prototype is built with current technology. So we use Arduino, a LED display. And we build our own keyboard. Keyboard, so the keyboard, actually, you have to think about how to scan the key. So it's interesting. And actually, after the second step, we decided to look inside the little professor and see, oh, it was built. Actually, inside, you will find also one microcontroller that was very interesting, because for reducing the cost, that was the design. That does everything, including the ROM, the clock, and of course, the processor. And the keyboard, actually, is the same principle. It's also based on the skyline, and it's directly managed by the microcontroller. So you can see that there are a lot of them, many variants, and also many different ones. So this guy there from Datamatt, a German guy who moved to the United States, so he has them all and has analyzed everything. So it was really a very deep source for this presentation. And so, well, what did I discover inside that little professor? Actually, you can see there, there is only one microcontroller, and that's about it. You have the display, and you have the keyboard. If you look at another one, this is the Merlin. Actually, you see the display is a bit different, it's based on the touch, but you also have the same kind of processor, but it's a variant. It's not the same, because actually, as you will see, it's a microcontroller, and it has the RAM, you can see on that side, on the left side here, and the RAM here on the right side, you see the program counter, the instruction PLA, where actually it's microcode, so you have the microcode there, the accumulator, and the resource clock, and the driver here for the display. So you have everything inside, and of course, the ROM is really burned inside. If you change the device, of course, you have to build another one. Another interesting point is that as the ROM is inside, you need to be able to access that ROM in order to get it for building an emulator. So I will come that later. But an interesting point for that, well, interesting point about architecture that will make this a bit more complex, is that it's not a von Neumann architecture, it's a Harvard architecture. So the Harvard architecture actually has a separate bus between for the ROM and the RAM. It's still used for microcontrollers, of course, not for CPUs, and it means that you don't have an easy way to read the ROM and extract it to send it to the outside world, because for von Neumann architecture, of course, the ROM and the RAM are in the same address space, so you can use an instruction to read a cell of RAM to read the ROM, actually. It's not possible with the Harvard architecture. So start to think, oh, can we read the ROM? The answer will come a bit later in the talk. And before that, I will zoom on the history. We are museum, so we try to understand how it evolved, actually. It's Texas Instruments. So I don't know if you know, but Texas Instruments, the guy there at the bottom, Mr. Kielby, is the guy who invented the integrated circuit. Well, actually, we found another guy from Intel about the same time, but that's the history as a remember for that. You got a Nobel prize for physics, yes, for that. And of course, it was the start of the development of the microchips. And the first one, of course, we all remember about the 4001 of Intel as the first CPU. But here, this talk is also to say, well, in microcontroller, there is the TMS-1000 family, which is actually the first commercially used microcontroller that was really a success at large scale. So you can see in the early 70s, there were a first trial with another kind of processor instruction set that was used in the data map, very successful calculator and also the Sinclair scientific. And then, well, they learned from that. In that one, it was a very complex instruction set with 11 bits. And then they designed the TMS, and this one was really successful. Here on this side, you see only as application only the games and L games. But if you go for the calculators, you can see here about a selection of main calculators that were already based on the same. So using the same technology, actually, the main usage was calculators. And you can see the whole evolution across the 70s, but it was still heavily used in the 80s. And actually, all of this is based on 4-bit computing. So it was really amazing to say, oh, in the early 90s, there were still devices built in the 70s on that design. So it triggered that need to go into more details. I would speed. And yeah, just a quick comparison. Also, as I told you, the history remembers a lot, the Intel 4001, the TMS 1001 is also very, very interesting. You can see in unit sold, it's very different. And the price was also very, very different because it was designed for the mass market. And as everything was in there, also, the device itself, the calculator or the game was also very cheap to build. For the instructions, it's about the same. You can see there are a lot less registers. We will see how we manage that quickly. So about how to program that. So there is a huge manual. That was a very good source for this. Everything that is in my slide, actually, it's on the website in reference. You can find all the technical documentation and some examples. And what's in that CPU? You have, actually, a very simple register structure. You have only one accumulator. You have four bits. Y, why register that is used to point in the memory in the RAM? And the X pointer, that is only for two bytes long. As you can see, actually, the memory is managed like that. So it's managed like a grid. So you have four lines. So for the X pointer, you can address the rows. And for the Y, you have 16, so four bits. And then there you can address all the columns. And typically, that RAM was used as register. So you can see here, we use it to store it, to store four numbers. And you can do computation on them. If you want to compute, to make some differences, multiplication, division, you have, of course, to implement them using the simple operation you have. So you only have addition, subtraction. You have to implement yourself the other operations. For the rest, you can see program counter. You can see also there is a page address. It's a page memory. So the jumps are only 16, four bits. So it's 64 instructions. So if you need more, typically, you have one K ROM. And you have to manage the paging. So here are the instructions. So you can see only 43 instructions. As I told you, you have the instruction here for the arithmetic. You have some for the input output. You have, of course, addressing, reading and writing to the RAM, some increment decrement. But you don't have, for example, shift operation, logical operation. You don't have that on that device. And so it's not very regular. So you can see the instruction now. So difficult to memorize. This is just one for the addition. So it's well documented. So this typical documentation you can find in the documentation. Of course, if you have a carry, you also have a flag that is set and can manage to propagate and make a computation in the RAM, in the register in the RAM. Quick example, a BCD addition. Because of course, that's usually done in BCD on that device. Actually, you can do it using binary addition. But in some case, you just have to add six if it's larger than nine. So this is the algorithm you can see if you perform the addition in memory. Then if you are bigger than nine, actually, that's 2k. So if you are bigger than 15, you have already a carry on the binary operation. Then you know you are bigger than nine. And then you have an extra test to see if you are bigger than nine. And then you add six. And you can see here in this code here, you will test here if you are less than nine. And then you will have the correction and add it. And then you have a loop. Of course, you will perform the same on the whole register. If you see the example here, it's an addition. We try to add the two in the middle and sort the result in the F register. So X equal two. If you have the row nine, sorry, the column nine, with nine plus seven, you have 16. And then you have two at six. And you can see at the bottom, you have six that is stored. And of course, you have an extra carry that will be used for the next operation. Okay, now let's go inside. So for those who have thought about how to read the ROM, actually there are two solutions. The first one is that there is a test mode that is documented in the patent. And that is used for testing in factory. But it's difficult to use. There is no reported success. So the main way to do it actually is to decap the die and to read, visually read, so to capture the structure. So you can clearly see where there are transistors, where there are not. That means that there is one bit of ROM there to read. And then you can try to rearrange things because it's a bit difficult, but you have to think about, oh, because it's really mixed to rebuild the ROM. And this is the Python script that will do the job for you. And then you get the ROM at the bottom. And then you can typically emulate it. Or you can first also disassemble it. So this is a program to disassemble. And you can also emulate it. So for this, the grid tool to use, of course, is Mem. And if you start it in debug mode, you will have all the tooling. So you will have a disassembler that will show you the code and where you are. And you can see also on the left the ROM. So you have the four lines of 16 nimbals that will help you to understand what happens. And you can see here, it's the little professor that is running. You have that addition 39 plus 62. And you can see on the ROM here, you have the 39 here on that register. You have 62 on that register. And you have the sum here, 101. So it's the other way that has been computed. And it's used for checking. So if you type first one, it will accept. If you type something else, it will immediately display an error. And you can also see here in the code that we are in the code that is performing an addition, actually. So the addition, you can recognize the algorithm I showed you before, because it's the one with the test about nine. And then the correction to add six to make the BCD correction. OK, I will quickly close. So of course, ma'am, I will not go into detail. It has the support for the emulation. For the CPU, you have to import the ROM. It's not distributed. But if you have the possibility to do it yourself using a tool from a visual, you can do it. But usually, it's already available. And there is also a custom layout. You can see here, you can have, that's not common, but you can have custom display in ma'am to have that around the ring. And last but not least, back to our big professor. Well, actually, the design is quite the same. It was not meant, but it's just, as we know, today, it's kind of TMS 1,000 of that type. Of course, we don't have the ROM. We can directly program it. It's better. And it's also interesting because then you can do a lot of more things, rather than strictly emulating the original game. You can also try other games, for example, asking, trying to guess or having a different way to ask questions, not ask about the answer, but if I find a number, five plus what does 10, and then you have to find the other. And you can also have, because that device was not actually very user-friendly. So it tells you how strong it's right, and that's it. You can have more way to reward the kit that is playing by showing, for example, animation, a little Pac-Man, things like that on the lead. So that's what we are currently doing at the museum, to show that from the past and also to show another experience from today. Okay, so a quick look inside. You can see here the Arduino, the lead, and the keyboard with the matrix to read the keys. And that's it. I hope you enjoyed that quick journey in the past and about our work on preserving them. We have, of course, other plans for developing with kits also an app, but that's more like scratch coding, but it's also interesting to do it. And of course, we continue to evolve that, and if you want to see it, it will be on display in our museum soon. Okay, if you have questions, it's welcome.
Gameboy Advance hacking for retrogamers
We're right on time to start with you, Daniele, with Game Boy Advance Hacking for RetroGamers. That's what we are, RetroGamers here. Okay. Thank you. So, okay. I don't have so much time because there are a lot of things to say, but the slides are already online, so you can find all the links or whatever. So, I have to rush some stuff, but you can use Internet. I think that you know how it works, so you can use it to find everything. So, let's start. About me, I am part of the Italian League of Society. I am a Bozzilla volunteer contributor, so I am a few years old. That's a work press. I have an Italian podcast and wrote a book about auto-contribute to open source buzz and the experience in these projects in the last decade, I think. I am old. But let's just start about talking about the Game Boy Advance. So, these are the technical details. There are a lot of fun, interesting things. I think that for time reason, we can just say which are the most important things. That's the fact that can support four players by cable. And also, the output of the cable just can support a ROM. I mean, you can share a ROM to the Game Boy Advance and read it from the memory as an example. There are also the peculiarity of two different CPUs because it supports also the Game Boy Pocket and Color games. So, when you insert a different cart, you use a different hardware stuff. So, it is more interesting for other gamers because with a single device, you just have also all the games for the Game Boy Pocket and Color, not just only the Game Boy Advance. And also, there are also emulators, we will see it for other console. And one of the other things that the maximum capacity for the games was 32 megabytes. These lights are like 32 megabytes, just to have an idea of the difference of size at the time. And also, supported the first 3D, but in a screen, very time like this is not, I mean, so much beautiful for a lot of games. It depends on the technology. There were also the first games of this kind of thing. So, it's just amazing that this device that's today, you can find it in your closet on a bay for, I don't know, 30 euros. You can do it a lot of fun and doing a lot of things because for me, it was very fun during the COVID pandemic, just digging more, hacking, changing things, modding software because it's something that everyone can do because it's very cheap today. And probably you have also some of them at home. I don't bring with them all the collection, but I think that there are people that saw someone with the Game Boy there. So, I think that we understand. So, in this case, the first things that you can do as mod are changing the shell. It's not something difficult after all. There is also, you can do USB-C battery right there. They can put, instead of the battery that was at the time that you want to take the battery from the remote, the TV remote because you need the battery. Now, you can have a rechargeable battery, technology. And also, there are more cheap to get the HDMI high output, but they're very horrible. There is also, you can use it as an SNS controller, a cartridge with an SD card with all the games that you want, on-brew whatever you want. You can also change the speaker because one of the issues was the speaker that was very cheap. You can change it. And also, you can put an amplifier that improved the audio quality. And also, the first things to do in this case of the Game Boy Vadan, but also the pocket and the color was the lights. It's very horrible when you have the sun there and you don't see anything. And it's still there, the same problem. You can change it in these cases with a bit of the wiring. It's very fun because you can get the screen for the GBA SP and put it there. So you can have the back lights with very 20 or something like that. And there are also modes that can let you to change the brightness of the screen. Just moving something like that, you change the brightness. There are a lot of modes around that you can install on your Game Boy. Advanced, but also color. I mean, a lot of these things are also for the color. You just know to check the modes. There's also other things like the Bluetooth controller remote with a specific cartridge with the Bluetooth. And also, it's a cartridge with Wi-Fi so you can do it with a remote controller. There are people that get, I have a lot of time, I mean, like us. So, on view. Okay, you can have with this cartridge, there are a lot of around emulators. So you can have NES games here. You can have Game Boy Pocket, Game Boy Color. You can have the Sega Master System games there. There are also other emulators for other games, but the capacity of these not let run also for issues of the screen. This is very tiny. There is also option that you can have because there are people we don't view that did that. It's not by me. Game with Wi-Fi support with the remote, with the Raspberry. So you can have this game multiplayer there by internet. And media output. You can use it for Shift-Tan as a synthesizer. So you could connect these to the keyboard and play out the sound exit there. And there's also the Retro-Pie streaming. I mean, there, I don't know how much sense has, but you can play with the Retro-Pie and the Crash Bandicoot or PlayStation 1 there. I don't know much sense, but there is. That's the funny part. And there is also a huge database, ROM hacking, that you can find patches to a lot of ROMs around tools to mod ROM, a lot of things. This is something that I didn't remember. I was a kid. I didn't know that. In Italy, it wasn't so much famous. I don't know in other countries, but this was this video format called a GBA video. So you can have in a cart, I mean, Disney Toons, SpongeBob Toons in some episodes in a cart. And look it there, because it's very famous. They sold millions of those. And someone on the Internet, because I say that game people a lot of time, put a 10-bit movie in a cartridge. There is the video, so just there. I'm not joking. It's there. So, MGBA, in my opinion, is the best emulator for a lot of reasons. It's everywhere basically. It's integrated as a retro rock. And it's a part open source because there are some little for Game Boy that are there aren't. As an interesting website that explained the issues they had on fixing some emulation issues with specific video games. So, you know, there is an illogical version in Japan that created issues. We have no idea, but it's explained where there was the problem. So you can find a lot of these interesting posts in this website. But also, for getting fun, there is also a scripting giant inside the emulator. So you can do fun meetings. And it's also the debugging giant that is important. So, for recent times I have a video recorded, but you can do it at home, I mean. And in this case I choose this game because Metal Slug, I think that everyone knows except Pokemon. I think it's the most famous game. And so, this is the video. The game is starting with MGA. I'm just opening the memory search. I will look for the ammo points of the gun because I want to change it like infinity. Like a trainer with today, we call it trainer at the time of the word cheat. Now we are playing a bit. We now with which gun I pose the game. I have now a value I will search in the memory of the game. Now I will change a bit the points. I will search again, but this is just in this subset of memory. So, from now I will get the memory. So now these are the two, probably the values where it is the ammo points. Now I'm getting the memory and changing the value there. It's funny, I made a lot of like. And now we have ammo. So, and we can see from these that the memory is changing. Very basically we are, it's something that we, everyone knows with memory, with the mullet, it works this way. But it's very simple to do it with a lot of consoles or whatever. With Gameboy it's more simple. So, this is the Lua script that for MGA that does the same thing. Automatically. On MGA you can load this Lua script. You see it's not simple, it's more simple. We can, we have something that is executed not every single frame, because there are too many frames in the game. We just put like a maximum and we will write the console of the game and change the ammo. In this way we have the maximum ammo points for the game. So, it's an easy win. And there are already scripts for Pokemon, these kind of things already in the repository of MGA. You can develop it very easily because we can just set the memory address, write, read. So, it's Lua. I mean, it's a very simple script and joining that you can do a lot of things. Well, I don't think there is anything else to explain. So, this is a tool that I did during the pandemic. I will say, okay, I'm on Linux. I want to patch my ROM to have cheats because I want to get fun. And there was this tool very old with wine to get in running. I said, it would be very cool for me to learn a bit, C++ getting the source code, and updated for Linux and Mac OS and Windows with QT. And I did it. And at the end I got the source code, but the problem was very old. So, before I had to get running on Windows XP in a virtual machine without error. So, it was a very long task, but I did a new tool with a new interface that does the same things, but is a multiplatform that let you inject into a game a trainer. So, without doing the emulator before the starting game, you have the trainer. You have to run. So, this is on-brew development. There are various tools to develop games. This is a GB Studio. There is a module for GBA. There is also more languages that you can use to develop a ROM today. You can have Rust, NIM, Zig, obviously C++, and Lua again. But Lua is a scripting engine. It's not something that you can compile. Well, it's only one because Astyme did it. So, we have now an engine that lets you to do a Lua game in a ROM. So, basically you can write your script in Lua. You build a ROM with inside your script Lua and run it in the emulator. So, you can have a simple game with inside it. So, you can get the crash of Lua inside the screen. Something very strange, but you get that because this is a Lua. And, you know, automatically you can have the assets, audio, images, this kind of things. So, you have a tiny, horrible demo, not video game because it will be. And there is also the code of GitHub, in my case. This is a demo they did a few years ago. This is me, horrible, I'd say. But it's there that we have moving an image with any Lua inside. And this works also here because you have the cart there. So, the game works also in a real game by advanced. It's not just the emulator. So, this is the code of the game. I don't know how much you can see, but it's very simple. I will go very quickly. We have, first of all, setting some texture. We are setting the function to draw the image with the coordinates. We have also a way to clean up the screen because when I have to switch the image, I have to clean up all the screen, otherwise it will be like Liars. Some stacks over there, we are just in case when press some buttons, we are moving the images, rating the data like FPS, RAM, whatever. That is not perfect because the join is worse. So, the data that's saying with this function are not so much trustable, but it's working. So, someone did already the games with that. And so, let's move on. Also, for the game advanced, there are a lot of communities around. We are talking about prototypes because at the time, I don't know many remember, magazine, video games magazine, the writers got usually games in advice to review it. So, around there are cartridges with the demo, with alpha, beads, whatever, so people are collecting them and put online. So, we have here a prototype of Metal Slug for game advanced, the XOBEDE V-FREV of the game we saw. Robocop, never released, and this is QuakeTree, rewritten from scratch for game advanced, never released, with the source code is online. And we have different projects that collect them, but also documenting them. So, what there is inside the ROM that was unused as an example, incomplete, they document everything, also for other consoles. So, moving on. We have also the compilations, people that have a lot more time, want to do a compilation one-on-one to a game to generate a complete copy of the game from a complete new code. There was also for, it was Famous Mario 64 as an example, but there are also for Game Boy. And Pokemon people are more, has a lot more time than me, and development again, all the games in G++, one-on-one, all the games for Game Boy, also the Pimbal game for the Pokemon. And there are also other games that they compilated, so you can create your version in cheat code instead of assembly. So, I think that is better. So, there is also competition online every year, and a lot of them. In this case, there is also the GBA gem with practice. There is also this case, Community on Discord. And there is also this one, the Retro Platform Gem, also for Windows 95, these kind of things, and this included also the Game Boy Advance as an example. So, if you have more time, you can do that, so games. So, let's go fast with some examples of those. We have this one. It is an Open City building for GBA and Linux, with Nc++. We have this one, a 60-PS game run 3D in C++. This one is very cool, not because it's a localized in Italy, but because it's so crazy that the Game Boy is amazing. Your name is a Game Boy, it's cool. But the point is that it is written in Lisp, so there is a compiler for Lisp. There is Linux, Windows, GBA and PlayStation Portable, with multiplayer support on GBA, in a game that is procedurally generated. At the time, there wasn't something like, okay, we have random maps every time, everything. Now, you have Indigaboy Advance 2024. So, you can have levels generated in Procedure. It is the same one that brought the engine in Lua. This one is the same after for this game. In this case, it's written again in Lisp. This one right on the screen, a QR code so you can share the map of the game in a community online, of course, with a smartphone. There is also Open Lara, at the compilation of Lara Croft, one running on the Game Boy Advance. But, of course, Doom is everywhere. Why not again in GBA? Because Doom, there was for GBA, in this case, is the modern Pierre Doom, Boom, Port, compiled for GBA, but it wasn't enough because someone added the support to create a GBA ROM with your mods of the time. So, you can start wars mod on GBA. You have Counter Strike mod for Doom in your Game Boy Advance. Time. So, we are at the end. I was russian, so I guess there are questions. Just to say, these are true projects. They can help you to understand because I have no idea how it works. Assemblies, I studied these to understand. And the idea, this is a Game Boy guide from modern game developer. It's complete. There is a lot of things. And this one is for Game Boy Color. We are working on it, so localizing in Italian. And we know from the community because this is the Game Boy dev community, they organize all these gems. And it's used from university to explain how it works as CPU because at the end, the Game Boy is very simple. Everyone can play with that. So, it's very easy to teach how it works, assembly, basically. So, you can find the tutorial step by step, or to write a Pong game. And now I think there are asteroid, I think. So, it's something from the community that can teach you how it works at Game Boy at the end. And I think that for someone that never developed with C++, it was interesting to understand how it worked at the time to do all these kind of things. There are a lot of people that works a lot on the Game Boy. You can find on Twitter, GitHub, a lot of people that develop game emulators, but also developing a lot of games. So, if you are just interested to understand how it is easy to do retro gaming, that's so hacking because I say it's cheap, so I change the amplifier, I change the screen, I change the battery, I change a lot of things just for fun. I just understand how it works, a console. But at the end, when you do these kind of things, you learn how it works as a computer, basically. So, you can get fun just doing these kind of things and learning, getting fun that is not okay, and getting fun developing a framework, JavaScript that no one ever used, but something like, okay, I'm hacking a Game Boy. So, you can say, okay, I'm not care of the Game Boy. So, I just invite you to check the emulation world, what you can do. The various communities around that are very active. And just to get fun with something from Cheat Dude, but it was still today. So, thank you. Thank you. We have time for one question or two, Maximum. Anybody? There is a question. That's too far. Hello, this is actually not a question at all. Sorry, in the front of the camera. So, this is actually not a question at all. This is actually, I am the maintainer of the GBAsim tutorial. Oh, cool. And we actually have a bunch of people from the Game Boy and Game Boy Color development community in the back who are currently saying hi. Cool, cool. So, we are very thankful for the shout outs. And we would like to say that there is a lot of interaction between the Game Boy and Game Boy Advanced communities. So, if you would like to use more high level languages, the Game Boy Advanced is great. It's very modern. If you want to use older languages, the Game Boy is also really nice. And yeah, we have a lot of resources. So, thank you everyone. And hopefully, we will be the ones giving a talk next year. Perfect. Okay, finish it. Okay, thank you everyone.
Running DOS & Unix on an 8-bit Commodore
Thank you, mission. Hello, everyone. My name is Michal Pleban, and I would like to show you how to do cool things with a Commodore computer. So, back when dinosaurs roamed the earth, Commodore introduced the Pets. It was one of the first home computers on the market, and it started a succession of a lot of different business computers that were especially popular in Europe. But time went by, and competition were introducing more and more powerful machines. So, Commodore decided to upgrade the Pets, and in 1982 they introduced the CBM-2. And when you look at the hardware specification, you see that it's basically an upgraded Pets, a faster processor here, more memory there, but it's basically the same architecture. But there is one little detail that stands out, it makes this machine really unique. It's the second processor interface. It allows you to attach a different CPU to the system and run applications on it. So, if you want to do serious business, you can attach a Z80 and run CPM. Or maybe you want to do scientific stuff, so you can attach a Z6809 and run Pascal. It's all made possible because the architecture is very flexible. So, the way it works is both CPUs are connected by a message bus, so they can communicate with each other, and they share access to the main memory using an arbitrator. So, normally one CPU is running an application, and the other one is either waiting or doing some housekeeping tasks, maybe checking the keyboard, maybe updating the timer, but basically the CPUs run together at the same time. And the message passing bus allows you to use an inter-processal communication. So, if you have an application running on the second CPU, and you need to do some IEO, maybe load a sector from the disk, then the second CPU interrupts the first one. The first one grabs the memory access from it, loads whatever it needs to be loaded from the disk, puts it in the memory, and gives the access back. And the beauty of it is that it can work both ways. So, normally you would have an operating system running on the second CPU, and call the first one for IEO, that's the standard way we could say, but you can also do it the other way around, and use an application on the main CPU, and use the second one as an accelerator. That's also possible because the bus is very flexible. So, if you compare it, for example, to the A-Cord and Tube, which is also a popular and well-known second processor interface, that one works only one way. This is much more flexible. So, given this powerful architecture, what did Commodore actually do with it? Of course, there was a Z80 card plant, we have the schematic for it, but it was never produced, but what Commodore did produce in the end was this. It's an Intel 8088 processor card that's supposed to run CPM, and of course MS-DOS as well. So, yes, you can run MS-DOS on this computer, and it looks like this. I run a check disk, because that's about the only thing I could run on the machine. Because there's a tiny little problem with MS-DOS. Today, when we think about MS-DOS, we think about the PC, because that was the original operating system that launched with the PC, and the PC became so popular that it overwhelmed all the other personal computers except the Mac. So, we say MS-DOS, we think the PC. But in the early 80s, Microsoft had a different plan for MS-DOS. It wanted it to be a G operating system for all 16-bit machines, just like CPM was for 8-bit machines. So, we had more than a dozen different computers on the market, and each was running its own version of MS-DOS. And the theory was that once an application was written using MS-DOS APIs, it should be able to run on all of those computers. But the thing is, the MS-DOS API is very limited. So, for example, if you have a spreadsheet application, you need to be able to place a cursor on the screen and update a specific cell. Well, guess what? There's no API in MS-DOS to position a cursor on the screen. So, what you had to do, you had to go through the machine's BIOS. And of course, each computer had a different BIOS interface. And not to mention Bitmap graphics and any other advanced features. All of them had to be accessed in a machine-specific way. So, what really happened is the applications were written first and foremost for the PC, maybe for a few other architectures, but if your computer was not PC-compatible, then you had very little software to run on it. And the Commodore is about as incompatible as possible with the PC, so there's nothing that you can run on it. So, the big question is, can we do something about it? Can we somehow make this great machine PC-compatible and run real applications on it? And the answer is, of course, we can. And the way to do it is we need three things. First of all, we need something that has the same interface as the PC BIOS, so that applications can use actual PC BIOS interrupt calls to interactive the hardware. We need video memory, because there's one thing that all PC applications do is they write directly to the video memory instead of using the BIOS, because that's so much faster. And first, third, we need virtual hardware, because the PC has a lot of I-O chips that the Commodore does not. So, if you want to generate sound with PC speaker, for example, there's no BIOS interface for it, you need to interface with the I-O chip. So, we need to do something and make up for the fact that the Commodore is lacking all those chips. So, this is what is needed to create a PC-compatible BIOS. These are all the interrupts that need to be implemented to have MS-DOS boot on the machine. If you are familiar with low-level DOS programming, you will recognize what they are. They give access to the screen, to the keyboard, to the disk, some basic stuff. So, the good news is that we can reuse a lot of code from existing MS-DOS 1.25, because, for example, if you want to put a character on the screen, there's already a function in Commodore Kernel that does it, and there's already an inter-processor call in the old MS-DOS that uses it, so we just need to slap a different interface on it. The bad news is that it is a lot of functions. And you need to get all of them right before anything starts to work. So, if you have a few years of free time, this is a good way to spend it. The video memory is actually the easiest one of those three. So, of course, the PC stores the video data at a different location than the Commodore, so what we need to do is to use a timer interrupt to copy the data from one location to the other. Of course, doing ASCII and Petski conversion, but it's very simple. So, this way, anything that the application writes to the memory where a video memory would be on the PC will actually appear on the Commodore screen. And the third thing, we need to pretend that the computer has the same peripheral chips as the PC. So, of course, we could try putting all those chips inside the Commodore and basically making just another PC clone, but that's not cool. And that's another way to do it. We can use virtualization. So, how do we create a virtualization platform on an Intel 88 processor from the 70s? Well, this is a virtualization platform. And the way it works... We need to be able to detect when the computer is trying to perform an I-O operation and stop it. So, we put a virtualization environment here and every time when the computer tries to access the I-O, it is interrupted. The interrupt routine checks what kind of I-O access is being done. That's whatever magic is necessary to emulate this I-O chip, most likely using inter-processor calls to perform the actual I-O. And then it turns back and the application thinks that it has actually accessed an I-O chip. Then it's done. So, well, is it all enough to actually make this platform PC compatible and run those applications? I'll grab a couple of bugs and let's find out. So, we are booting the computer and we are starting to load the operating system, which seems to be free-dose. And once we have free-dose running, of course we are starting not on Commander, what else would be? And because we are dealing with Microsoft, of course we start with basics. So, this is the QBasic from MS-DOS. This is some very simple Hello World program. And let's just find out if it works. Yes, it does. And of course we are going to use Turbo Pascal as well. Prince of Persia. So, again, a Hello World program. Is it going to work? Yes, it is. So, just for good measure, let's try to change it. So, indeed we can position the cursor on the screen. We can do some changes. And it works again. So, that was the Intel 88 processor card from the Commodore. But as I showed before, you can attach many different processors to the bus. So, how about we do something really cool? This is Commodore 900. It's an abandoned Unix workstation prototype that was being developed by Commodore, but it was cancelled because they bought the Amiga and they tried to focus on that one. And if you look at the hardware, it's a very strange machine because it uses a Xiloc 8000 processor, which is very rare. It was used in some Olivetti machines, some industrial equipment, but basically in 1994 nobody was using it anymore, except Commodore, of course. It has a memory management unit and it runs coherent. And coherent is like Linux, but 10 years earlier. So, it's a system written from scratch to act exactly as Unix, but it was much cheaper. And it's a truly multitasking and multi-user machine. You can attach many terminals to it, log in at the same time around applications. The problem is because it was cancelled, only a few dozen prototypes were made, so unless you are very lucky and very rich, you can't have one. So, how about we do something about it and we put a Z8000 processor on the second processor interface? Here's what is needed to create a virtual Commodore 900 using this interface. Nothing very difficult here. So, of course, we need to create virtual hardware again. We need a keyboard controller, we need a disk controller, we need a serial-poll controller, and so on. But the good news is once the Iotips are emulated properly, we can use the original Commodore BIOS. We don't need to write a new one. And even better, because it's Unix-like with resource protection, then no applications are touching the Iotips directly. They are all going through the operating system. So, once the operating system works, then everything is supposed to work as well. And this is very different from MS-DOS, where every application had its own dirty ways to play with the Iotips and it's needed to emulate all of that stuff to let them run. So, you know what happens now. But before I'm going to show you how it works, this is the Z8000 processor card that I made. So, it has one megabyte of memory. It can be accessed either as 8-bit or 16-bit. It has a memory card that emulates the hard disk. And, of course, it has the virtualization environment as well. So, is it going to work? Let's find out. So, now the original BIOS is starting. It's going to perform a self-test of all the hardware. The hardware does not exist, but the self-test passes. And now we are booting the operating system. Router can't without the password. That's a very secure installation. It's a nice Unix file system. And, of course, because it's Unix, we are going to program it in C. That's a Hello World program again. We have a C compiler on board, which takes a bit of time to compile this tiny program because that's just six megahertz. But, finally, it's done. And, the program works. And that, ladies and gentlemen, was Unix running on a Commodore. Thank you very much. Three minutes for questions. Hi, and thank you for the presentation. Stand up. Have you tried, like, there's another project for the Commodore, for the BBC Micro that uses a Raspberry Pi to emulate other professors. Have you tried using, not using the hardware that you cannot obtain today, like the ZE-8000 and so on? Have you tried using better metal or Raspberry Pi to do the emulation of the other hardware or maybe the emulation of other professors and so on? Have you had a thought of that? Yeah, well, you can obtain the ZE-8000 from eBay very easily. But, I have not tried using any platform from this. I'm a Commodore guy. The main problem with emulating the Commodore 900 specifically is that nobody has ever written any software emulation for the memory management unit of it. So, you can emulate the CPU because, for example, in the main repository, you can find the code, does it? But there is no emulation existing right now for the MMU, and that makes it really hard to do it without the real chip. But try the Intel stuff. Yes, the Intel stuff should be done easily. Yes, of course. As long as you can virtualize it. Thanks for the good talk. How much performance is being lost in the emulation of the various IO chips? A lot. Yes, it takes a lot of time because, first, it needs to go through the interrupt routine, then it needs to go through the message bus, then it needs to go back through all this. So, that's quite a lot. And that's why I decided to bump the processor speed a little bit. So, for example, the original Commodore 900 has a 6 MHz CPU. I put a 10 MHz. And that makes a nice difference. Otherwise, yes, you would really notice the difference. But there's also one thing. If you have static memory in the computer, you don't need to waste cycles for memory refresh. And that also gives you a nice speed boost. So, all in all, it works quite well. Okay, time's up, unfortunately. Many thanks, Michel. Thank you. Thank you.
A Game Boy and his cellphone
So, ready to start? Yeah. Okay, so we have now Esteba with Game Boy and his cell phone. Hello, and thanks for being here for this talk about Game Boy peripheral that I think is very interesting and versatile. I'm Esteba and I've been working to emulate and restore this peripheral on and off for the last six years or so. But first, I should tell you what it is. The mobile adapter GB is a peripheral that allows you to connect your Game Boy up to your cell phone, allowing games to make and receive calls in part to also call an internet service provider and connect to the internet, allowing for all sorts of online connectivity, like sharing scores and getting updates for various things. It was one of the very first attempts by Nintendo to have any sort of online connectivity for consoles, but what makes this one very interesting in my opinion is that it supported a few rather high profile games. There were actually a few variations of this adapter made for several different phones. You have a blue, a yellow and a red one. The green one for PHS was also planned but never released. But what you will notice is that none of these actually work for any non-Japanese phones. So this service never left the island and unfortunately it was sunset very early, almost two years into its life in December 2002. But to give you a better idea of what this peripheral could do, we will talk a little bit about the games that supported it. So first of all, you got the mobile trainer with the adapter. It was used to configure the adapter and you had to use this before you could connect to the internet. It also came with a very useful usage manual but it also had some very interesting utilities, which were a mail client which supported both SMTP and POP and could communicate with the outside world so you could actually receive real emails. And a very minimal web browser which was hard-coded to one website to read news about Nintendo games and games for this peripheral. Now, the very first game that was released for this thing was Pokemon, a very popular franchise that I'm sure you're familiar with. But it was actually one of the very first time you were able to battle and trade online with your friends or at very large distances at least. Besides that, it also featured a battle tower which allowed you to fight people who have entered that tower previously. It got localized with NPCs in the west but the Japanese version worked with this adapter. You also had a trade corner which is a bit of a prototype of the global trade station which appeared back in Generation 4. And you had a news machine which I think is the most interesting part because you could download scripts which had news items but also many games, questionnaires and you had rankings to show off your friends how big your Magikarp is. Another very interesting game in my opinion is Net the Get which was one of the only titles which used the MBC6 on the Game Boy. It's a minigame collection that came with 15 built-in minigames which could download more and more would be released over time though they never reached the titular 100 minigames unfortunately. A few other games that were very interesting, Mobile Golf which is a sequel to Mario Golf which never got localized but it came bundled with the adapter later in its life to help sell the adapter. Starcom which is a sort of pet simulator, Game Boy Wars which was part of the Wars series known for Advanced Wars and Famicom Wars and Mario Kart which allowed you to upload and download ghost data. So let's tell you a little bit about how this project got started and where we are now. Somewhere in 2016, Haki posted a thread on GlitchCelapse which explained a little bit about how the mobile adapter protocol worked. From there we spun up the Python script which communicated with the BGB emulator allowing you to have a proof concept that this thing actually worked. Somewhere in 2018, a guy named Shinumi who is known for emulating various peripherals including suing machines and phishing sonars which were made for the Game Boy. Also emulated the mobile adapter and specifically Net-to-Get and created very comprehensive documentation that we are updating and keeping track of to this day. And at some point people wanted to actually bring a real Game Boy to connect to the internet and that's kind of where I stepped in and we started doing stuff. So fast forward to today, we have a group called Rion. We are a group of preservationists, developers and enthusiasts who want to preserve this system and make it usable to the common user as it used to be. For that we are making emulators, servers and translations for a few of the games so that they can be enjoyed by a wider audience. So to give you an idea of how this all fits together, I will explain a bit about how the system connects together. So this is a connection diagram. On the left side you have the user's Game Boy which communicates through a custom link protocol with the adapter which further communicates with a proprietary protocol with a mobile phone. The mobile phone is connected to the phone network but depending on who you call you can either call a friend and communicate with their phone directly. And this was used for example for the Pokemon trading and battling. Or you could call the internet service provider and use the point-to-point protocol to tunnel your connection through TCP and UDP to the official Nintendo servers. Now most of this stuff is kind of irrelevant when we are emulating this because when we are emulating it we can kind of make big black boxes depending on what you are doing. This is how it would look if you have a simple microcontroller that connects to your Game Boy and then further connects through USB to your computer. Your computer will communicate to either the game server or if you want to call a friend then we have set up a relay to punch through router firewalls and that sort of thing which allows you to connect to any other player on the world. And this can either be hardware or these blocks can either be full emulator which also emulates like the Game Boy itself and the adapter so it's a little bit more variable. So we have full documentation and emulation of the peripheral itself or at least the part that communicates with the Game Boy. And for that we have made a library called LibMobile. This library can be integrated into all sorts of projects from software emulators to hardware emulators and back. We've integrated thus far into the BGB emulator which is a Game Boy Color emulator. We've integrated into the MGB emulator. We've made a little fun interface to configure it as well. And some people have been playing around with making it work on the Raspberry Pi Pico and communicating over Wi-Fi for example or the Arduino Uno which is mostly what I've been using. There's also the GBE plus emulator which was made by the Shonumi which I mentioned before. This is more of a local only emulator but it's for some games that we don't yet. And of course full documentation of this is available in Dandox. So these are a few of the examples of things of setups that people have put together. On the far left you've got the simplest one which is just breaking out a few wires and connecting them to the Arduino and then just plugging that into your computer and doing it like that. Some people have made PCBs. The central one is able to communicate over Wi-Fi and Xenaro really active user lately has made a 3D print version of it as well. Now of course you don't need to connect it directly to a computer. You can also just use a modern phone which are basically computers these days. We've also of course started emulating the server side of things. We have the relay server which I've mentioned before which gives you a phone number and allows you to call someone else. We have a mail server which is implemented in Node.js and stores in SQL so we can manipulate the emails more easily. And we have a few complete game servers for Pokemon Crystal which supports actually everything at this point. And a very driven person called Winter who has fully emulated Mario Kart and Monopoly though Monopoly doesn't have many features unfortunately. Also GBE plus has emulated a few games in particular Net-to-Get Game Boy Wars, All Japan GT Championship and Hello Kitty's Happy House which allows you to send emails with items to your friends which is very cute I think. And of course we've also made a few translations in particular Pokemon Crystal of course was already localized but we've restored all the functionality for it. And we've also ported all of those changes to the four other languages that the game was released for. Mark Max came to us asking if we were interested in his mobile golf translation and most of it has been translated but not the mobile features because we don't have any support for it yet. And the mobile trainer which of course is a cornerstone of this whole thing. If you want to get into it or make an emulator for yourself or develop a game that supports this thing. We have of course the mobile which allows you to emulate the adapter itself. We have the re-enrollable story which you can extend with other games or if you want to emulate those though I would suggest if you make homebrew that you make your own server behind this. And unfortunately we still don't have a client library the library that runs in the Game Boy itself though we have reverse engineered the library from the Nintendo SDK. If you don't care about licensing problems. So in conclusion most of the things that you'd want to see are already there. Of course we don't have all the games yet. The problem that we're mostly struggling with right now is authentication and getting this useful for actual people who aren't very techie. So if you want to help with any of that documentation making tools websites whatever you can reach us on unfortunately discord only. We have if you want to make a matrix server and bridge that I would be very happy but unfortunately right now it would be the only person who would use that. Our github is over there and show numies block with a lot of more peripherals and funky things that he's emulated with the Game Boy. Can be reached through his github pages. That was it. Thank you. Thank you. We have time for one or two quick questions. I have a very quick question. Thanks for the talk. Do you know how the original games that you could download of the Internet back in the 2000s how those were captured. How those have been like captured like that's like 22 years ago. So one of the things that we actually sometimes need help with is if you have any of the games that supported the mobile adapter. Don't run them dump the save directly. If the battery still lives then we might be able to restore some of the games that were supported back then. Thankfully though we have the 15 built in games which serve as an example to make more so that helps a lot already. Another quick question. Yes. No. No. Okay. Well then. Thank you. You can get prepared. It was really interesting. Thank you. Thank you.
PiStorm - The evolution of an open source Amiga accelerator
Okay, we are right on time. Many thanks. So Andrew with the Pistole. Hello everyone. I was stupid enough to do this from an Amiga 1200, which is great because I don't have a screen in front of me, so I'm going to try and see what I'm doing whilst I'm doing it. But it'll make sense later. So I'm here to talk about Pistole. My name is Andrew Hutchings, I'm also a learner at Linux Jedi. During the day, I worked for a non-profit called the MariaDB Foundation. And by night, I restore Commodore Amiga's, Acorn computers, I design upgrades for them, and I'm part of the Pistole community and a whole bunch of other things. I've also written for PixelAdex to go by that, because the next issue's got a big article by me in it. And I'm also going to plug... The aperture there was made by Stu Cambridge, who from Sensible Soccer fame, he did cannon fodder and all of that lot. And you can get him to do Doodles of You just like that from his site. What's it called now? Design Droid. He doesn't know I'm plugging it, but I love his work. So anyway, about Pistole, it was a project created by a guy called Claude Schwartz. And if you've ever tried to use or upgrade Commodore Amiga today, you need a processor like a 68030 or a 68060. If you want a 68060 with a board and RAM and everything like that, you need to sell a kidney, basically. They are really rare, really expensive nowadays. So the idea was to create a very fast budget accelerator. And you can get a lot of compute resources from something called a Raspberry Pi, which you probably all know about. So what this essentially does is it emulates the 68000 processor on a Raspberry Pi running Linux originally, but the rest of the Amiga motherboard was used. And then it adds things such as RTG. Now, RTG stands for Retargetable Graphics. And essentially, that means it's like a second graphics card for your Amiga. So this is what I'm actually projecting from right now, is the RTG from my Amiga. It has the native Amiga. If I tried to run an old Amiga game on it, you wouldn't see it on the screen right now, because I haven't got the output for it. I'm going to talk about that a little bit later. It adds virtual scuzzies. So the SD card on there is basically a driver for the Pi, as the Pi Storm, to talk directly to the Raspberry Pi's SD card. So it's rather than being emulated, it's almost like a direct driver in a way. And it also adds RAM. So I've got a Raspberry Pi 4 in here. So nearly 2 gig of RAM added to what is normally a 2Mega system. So a little bit of a boost. And everything is open source. The boards are open hardware and stuff. What we used to do is a group buy where you could come along and say, I want to buy one of these, and we'd all go to JLCPCB, buy loads of boards together, and you just have to solder on the headers, which were great until the chip shortage, and then that kind of died off completely. But back then, I said you can pay more than 20 bucks for a Pi Storm. So about 18 pounds, it's probably about 20 odd US dollars, whatever. So it was really, really cheap. You just need a Raspberry Pi. So this is what the first one looked like. Now, you can see there's quite a few chips to it on top of what is normally a Pi GPIO there. So essentially the problem we have is the Pi GPIO is 40 pins, but you only get about 26 GPIO lines from that. And the Amiga, 16-bit Amiga has 16-bit data bus, and then a 24-bit address bus, and then control lines on top of that. It's a lot more than you have IO lines. So what we've got here is a CPLD chip, a programmable logic chip essentially. And we have in there basically this 6,000-8,000 state machine. And that does all sorts of multiplexing communications to the Pi. And then we have some buffers basically because the voltage-total translation is needed between the CPLD and the Raspberry Pi, and then the external IO logic. So it was nice and simple boards. We could get JLCPCB to build all these originally until the CPLD kind of ran out of stock, and then that became difficult. And the logic that we wrote for the CPLD is enough to run it for an Amiga, but it doesn't include some of the state control lines that other systems use because we were targeting an Amiga 500 at the time. So this supports a 500. It supports most of an Amiga 2000, the 1000, and the CDTB. And then... Oh, doing this on my clicker, clicker and of course, I've got my clicker connected. So it used to Raspberry Pi 3A originally. You could have used the Raspberry Pi 3B, but you'd have to raise the header a bit because otherwise your Ethernet board smash into the board. And that's not good. You can take off the ports on the 3B if you don't want them, or you can extend the header. Also, Pi 0 2W will work. If you don't know, Pi 0 2W is basically a Pi 3, but in a much more compressed format. We ran Mishashi 6800... I hope I'm pronouncing it right. 6800 CPU emulator, which... It was good. It's a pretty good 6800 emulator, and then there's some kind of glue code to make it work, but it was basically an off-the-shelf emulator. And most of that software was done by a guy called Bjorn. He's not part of the project anymore, but he's got a lot of great early work on it. Again, I'm clicking on my clicker. So, performance-wise, you can see here... This is what's called SysInfo. It's kind of a stock benchmarking software for an Amiga. And an Amiga 600, which is same as an Amiga 500, roughly. The original Pi Storm ran about 23 times faster, which is pretty good acceleration. You're getting even faster than what was called 6800, 25 MHz. So you're getting about 50 MHz, 030 processor, kind of speed out of it, which is pretty good performance for something that costs a lot less than even the CPU for an 030. How I got into Pi Storm? I was designing some new hardware for a Commodore Amiga, and the other advantage of having Mishashi on Pi Storm is the fact that you can, on the fly, change the entire configuration of the Amiga. I want a different OS ROM to boot into, different RAM configuration, different hardware configurations. All that can be changed on the fly. I started providing patches, helped build a community. This was probably in September. We had 7,000 members on Discord and 3,000 on Facebook. So it's grown to a pretty big community. Things I've done, I'm going to skip over this, but I did a lot of the early work regarding bug fixing and things like that for the original Mishashi Pi Storm. Then we released a version for the Amiga 600 and Amiga 2000. They are essentially basically the same thing, but Amiga 2000 has a coprocessor slot, so it's much easier to just debug it in the slot. At Amiga 600, you have to do this hacky thing where it sits on top of the PLCC CPU, and then there's a little kind of thing in there to tell that CPU to go to sleep, and then that basically is identical after that. So EMU68 came along. EMU68 is a bare metal emulator for Raspberry Pi, for the 6800, so it's much, much faster. You don't have to boot into Linux anymore. This is what this boot is from. It became an option for Pi Storm in 2021, and now it's pretty much de facto standard, and it uses JIT-based emulation instead of table-based. So performance-wise, it got a bit faster. 1,490 times faster, and this is just on the Amiga 500. Then the Pi Storm 32 came along. This project was scrapped. So essentially, it's the same kind of thing, but for the 32-bit Amigas like this one. But it became very hard to build, and it required a Pi CM4, which is a Pi without all the ports and everything. You just got these big connectors on the bottom, and it became difficult and expensive to build, so we gave up on that, and instead built the Pi Storm 32 Lite, which is Lite because it doesn't have all the ports on it. But basically, it's the same kind of thing. And we have a nice big FPGA on there instead of CPLD. FPGA, just much more logic, but you have to kind of flash it every time you turn it on. And that was basically the start of what became the 8-200. This is kind of the peak of Pi Storm right now. We released that about a year ago, and it's still going strong. Performance-wise, we're now talking 3,052 times faster than an Amiga 500, which is not too bad. Even the Amiga 1200, which this is, it's 1,326 times faster. And you can get faster still if you overclock it. I'm not going to overclock mine. I've got a little fan running underneath it as it is. And inside this Amiga, you can see this is what mine looks like inside. So you've got the Pi Storm in here. And then I've got a little cable running out of the HDMI port to the back, and that's what's running as projector right now. And then I 3D printed a kind of assembly with a fan in the net just to keep everything nice and cool. Demo time. So, John Clarmac said the Amiga is not powerful enough to run do. At the time, to be fair, he was right. The de facto Amiga at the time was kind of Amiga 500, my Amiga 600. If you wanted one that could run do, it would cost you thousands and thousands, much more than the PC would at the time. But today... It's later we were running do. Yes. But I can do a bit better. AmiQuake. And I haven't got sound hooked up, unfortunately, but what I can do... Time demo, demo one. It's slow, I know. So we just got to wait for all this demo to finish just to get a nice kind of benchmark out of it. And there we go. So we get 93 frames a second out of Quake through the RTG. If I run this through the AGA graphic, the built-in graphics instead, we still get about 45 frames a second. So it's a bit faster than native, which would be a few frames a second at best. Oh, there it goes, that window. So... If I use... CandlebyeStorm modified chip RAM. So chip RAM is chip set RAM. It's the RAM that the entire chip set... Can the Amiga talks to each other with? So you've got like the audio chip, the graphics chip, etc. That is capped at 2 megabytes by design, by Commodore. They were trying to move it to 8 meg for the Amiga 4000, but it never really hit there. No, it can't because we don't modify the chip set. We don't override the chip set, so we can't increase the RAM that the chip set uses. So whilst we have 2 gig of fast RAM, we don't have any chip RAM. Can you emulate a power PC? Probably you can, but it's going to be a lot of work, and we don't want to do it. So if anyone wants to put a PC emulator in there, it will probably work. Can you use PySom in other 6,000-8,000-based machines? Yes. So someone's done a port, I forget the name, they've done a port for the Atari, which basically had to pretty much rewrite the firmware to make it work because Atari actually uses all the 6,000-8,000, instead of the hacky thing Amiga did. I love Amiga, but Atari did that bit a bit better. And similar problems with the Apple. So there are projects where they're trying to get this running. It's not going... It's not all the way there yet, but they're working on it. CD32, sorry, 3,000, 4,000 versions. In theory, the one in this machine should work on CD32, but it doesn't, and we don't know why yet. We haven't had time to figure it out. It shouldn't take much modification to make it work. 4,000, 4,000 versions are going to require a lot more bus arbitration work, so it's just time to do that. And then the really cool thing we're working on right now is Amiga Native Video Injection Device, which we haven't got a name for yet, but essentially what it does is it captures the... It sits in various places in Amiga, depending on the model, captures the digital video before it gets converted to analog, pipes it through the camera port in the Pi, and then you can have both native video and the RTG video through the HDMI on the Pi. So, if you want to sponsor the Pi Storm development, Claude has a donate button on his Pi Storm 32 Like GitHub page. I'm just checking it out. Mikal, who develops the EMU68 project, has a Patreon to sponsor the development of it. And if you have any questions at all about the project, feel free to come to me. I'm the Linux Jedi everywhere, pretty much, and I'll be happy to answer them. And that is it. So we have time for questions. Any questions? Thanks a lot for your talk. So according to the SIS info output, it's not emulating a plain 68000, but a 030 or 60 or 040? So, Mishashi, you can choose which one you want to emulate. The 020 and 030 were the most stable doing that. For EMU68, it currently pretends it's an 040, but will support the instructions set for 060. Okay, so that's only about the instructions, and it does not emulate the MMU, I guess. Yeah, it's just saying, hey, I'm an 040, but it doesn't really matter. It will run 060 code, fine. Hi. I'm actually Debian's M68K maintainer, and I'm wondering if there's plans to add MMU support, so you can go to the Linux kernel. Other plans for the MMU? That is a good question. Mishashi, no, we did have it to begin with, and it was broken. So we didn't. EMU68, I believe, somewhat supports MMU, but needs some work to support it properly. It's at the moment a direct one. It's basically given a block of RAM in the pie, and just said, yeah, just use that. So we could probably emulate MMU. That's too much trouble there. Thank you for your talk. Just a quick question about the MMU68 variant. Yep. Does you need to maintain a second OS on the SD card without, or is it effectively a persistent thing once it's on? No, it's a system that boots on by itself completely. There's a whole set of tools that are out through Pi Foundation to create your own bare metal OS, essentially, so it's an OS in its own right. The downside to that is every part of hardware, we have to write new drivers from scratch to be able to talk to the hardware, which is why if you want to use Ethernet or Wi-Fi or anything like that, it becomes a much harder task for us to do that on MMU68, and that isn't there yet. So there's no USB host support. You can't use USB keyboard. I'm sorry, sir, again? There's no USB host support for the Pi. Not on MMU68, no. Right. There is a mishashi that will actually support keyboard and mouse through Pi's USB, yeah. Still time for one or two quick questions? Yeah, one in the back. Hi, quick one, I think. Did you have to do anything special to cope with the bring up time for the Pi, because it's a lot slower than the CPU? That's a really good question, the bring up time for the Pi. So the CPLD versions hold down the reset, until the Pi's booted. So basically the machine's basically, I'm resetting constantly, kind of thing. The version in this, the Pi's on 32, it will boot the native CPU first, because the FPGA hasn't been flashed. Once the FPGA is flashed, then the reset gets held down. And it's a very short time. You're talking like two or three seconds. Still time for one question? That will be the last one. So I guess the problem with the CPLD version is that AMD has announced that they're going to stop making those. So AMD, old as iLinks, do you use it? Yes, but iLinks is AMD, right? Yeah, no, we're not using... They announced like the last Pi's or something from now. We're not using iLinks ones, so I... Ah! No, so we're using... Yeah, I think we're using... Yeah, I think we're using... Yeah, I think we're using... Yeah, I think we're using... Yeah, I think we're using... No, so we're using... The CPLD is Oterra Max... Max2, yeah. Also as Intel. And then the FPGA is Trian... Maybe... Ethnics, Ethnics Trian. Ah, okay, maybe... I thought it was iLinks in the first picture, but maybe that's wrong. Yeah. Or maybe that's a prototype. The other projects I maintain, yes, they are all screwed in regards to iLinks, but... Good, that's it. Many, many thanks, Andrew. No problem, thank you very much.
A journey documenting the Sanco 8003 computer
We're about to start. Please be quiet. We can start with John by Julioff about a journey documenting the Sanko 8023 computer. Hi. Welcome everybody. My name is Giovanni Battista, but everybody calls me Giomba. I got my master degree in computer engineering in the University of Pisa, which is in this nice place. I'm working as an embedded software developer, so I do low-level stuff, microcontrollers and some Linux drivers. And as you may imagine, I'm a retrocomputer enthusiast. I wrote some games for consoles and retrocomputers, and you can find me on that place there. And here with me, there is Julioff. Hello everybody. Can you hear me? Yeah. My name is Julioff. I studied in Pisa too as an electronic engineer, and I like so much Pisa that I stay there working because I work in Pisa as a firmware engineer in a company that produces cameras. And today, I'm here to talk with Giomba about one of my hobbies, which is retrocomputing and how we investigated about an old vintage computer, which is this. The story is funny because one morning I was going to work by bicycle. That's not me. I was going to work by bicycle. When I saw on the sidewalk this computer, this old computer, which was an old one computer with CRT monitor, floppy disks and so on, and it was abandoned there. So I looked it. There was a cheddar label on it. It said nothing. I searched on the Internet what it could define nothing, and I decided to take it, save it from a dump. Another computer in the house. I started searching about it, and with the help of some friends, I found out that it was a clone. A clone of a computer produced by Senio and Koflec, which are French, Japan companies. And it's a true company. A company producer. The true producer was Sanko. So our cheddar computer was instead a Sanko. This further information gave us nothing more because on the Internet, we couldn't find anything except some photos on Wikipedia and some advertisements. No technical information, unfortunately. And so we decided to do reverse engineering on the whole computer. If you try to open a Sanko, you would find a single big mother board with Z880 CPU. Some peripheral chips, common chips for the 80s, like the one to interface with the monitor, the one to interface with the floppy disk, and so on. Some memories, RAM memory, RAM memory for the program, for the BIOS. And around them, a ton of 74 LS gates. So this motherboard is quite self-documented because it has no custom chip. All full common logic and pretty common integrated circuits from the Z80 series and Z80 peripherals. And so you can continue. And so we had never done this before. So the first thing that we thought that was sensible to do was just to start and dump all the ROMs that were on the motherboard. So there was this first one that was a common standard 2732 ROM. We dumped it and we thought, well, let's run a disassembler on it for the Z80. And it contains a lot of things that made sense as a Z80 code, as you can see. Of course, it was not all easy because this is just a huge binary blob. There is no differentiation, of course, between text and data sections. So you just have to be creative with understanding what this code does. And we found out that a lot of code made sense if it was placed at this address here, C000. But as you may know, the Z80 starts from, boots from address zero. So this was something odd that everything was starting from C000. And in order to confirm this, we used some logic analyzers just to confirm that it was actually a Z80, not some custom Z80 variant that started from another address. So we found out that it actually booted from zero and then started to C000. It was a bit odd, but in the end we found out why. And this contained the BIOS. Let's say that we have disassembled it. You can find it here. It's not complete, but it's enough for what we are interested in too. And then there was this other ROM here that, well, as the name suggests, it is a character generation ROM. Again, it was pretty easy to understand what it did, because we just run the dump into a matrix. We started putting different configurations for rows and columns. And we found out the characters that were inside the computer. And then this was this one here. It is an old 28-22 ROM. It is a narrow plastic line package. It is a bit odd. We didn't have any programmer to dump it. So we built some wires. We had an Arduino. And there were a lot of patterns, repetitive patterns there. We thought that it was something related to some glue logic for the computer, but at the point we didn't know anything about this. So our first hacking attempt. As you can see here, on the system ROM, it is this label that says V1.01. And if you turn on the computer, it says V1.01 on the monitor. So maybe there is some correlation. And yes, in the dump we have a correlation, because we found it here. So let's find in the ROM where it is that it uses this string here. And we found it. And we modified the ROM in order to make it display something. We discovered that this was just some memory-mapped area for the video display. And in order to, at this point, we had to understand what the various peripheral did of this computer. So we started patching the ROM with our own code. So we installed the Azure Insertion Force socket on the motherboard without damaging it, of course. But at a certain point we realized, well, why are we switching the ROM continuously? We can just write our own bootloader with lots of things from the serial in order to make our experiments. And our experiments were targeted at finding out how things worked in the Azure computer and to confirm or deny our assumptions about the schematics. And speaking of the schematics. Yes, speaking of the schematic, to better understand the software we had to better inspect the hardware. So what we did? We first started reading the manuals of the integrated circuits, the standard integrated circuits in there. And we tried to find the connection between them, because these data sheets were quite well documented. So we found out and we verified using a multimeter some of the basic connections. And we drove them on a piece of paper. But that was not enough, because we had to better inspect the motherboard. Not all connections were written on the manuals. So we had to have a better view of the motherboard. We were initially scared by the motherboard, because we suspected that it was multi-layered. And it is. It's a four-layer board. But fortunately, it follows a very common standard where power rails are buried in the inside ladder, in the inner layers. And signal traces are in the top and bottom and top layers. So it's quite simple. You had only to follow tracks under chips and something like this. We took a photo of the motherboard and we started drawing the traces on it to keep in memory, to keep in mind where the traces were going. And that's how we reverse engineered the board the simplest way, I think, you know. We used GIMP, free software. And then we moved on to the key card. So after the discovery of all the traces, we put them on a true schematic. This is not a true schematic. This is a true schematic. And quite 90% of the schematic is understood for us. The other 10%, which is mostly the floppy disk interface, is quite messy, but is copied on the key card. The whole board is on key card, but we understand quite the 90% of it. And let's talk about some of the pieces of the schematic. First of all, the memory map, the memory management. So John Bassett before, no, later. So the Z80 has 64K of addressing space and is quite all-mapped dynamic RAM main memory, except for some holes for the video memory, main ROM, et cetera. And the addressing of these holes is done by the coder chip that matches the first digit of the XADESHIMAL address. And it enables the correct peripheral. When none of the holes is addressed, none of these, the dynamic RAM is enabled in place. But there is more, because if you want to address the whole dynamic RAM, you can turn off this decoder through a switch signal and address always the dynamic RAM. This mechanism is known as bank switching. But John Bassett before, the Z80 starts booting from address zero, but the ROM is at address C, C000. How is this even possible? Because you would read garbage from RAM at boot. It's simple. There is a latch circuit that at reset or boot, it forces the ROM enable until a particular instruction from the Z80 is executed and it disables the latch. In this way, you would have this memory map with ROM all over the addressing space until the particular instruction is executed that restores the correct addressing space. So the code had to jump in the correct area and to execute this instruction. And this way is booted and with the correct addressing space. This latch is never used until the next reset. So one interesting part that we tried to understand at first was the video generation and all the video generation is done by this CRT controller that knows everything about the timings of the system for video generation. This is a very interesting information. It produces the synchronization pulses for vertical and horizontal and it always knows what it is displaying at that moment. So it can generate memory address to retrieve the character that is being displayed in that moment. Let's assume that this is what it is on the display. So it generates an address, for example, for the first character top left. This takes out the index of that character, which is fed into the character ROM. So this is the character with that we want to generate it to display. But it also knows which of these lines are being displayed at the moment through this path here, as you can see. So this selects one single line, which is then fed to this shift register, which is clocked at this speed here. And it produces a pattern of dots that if you are familiar with video signals, that's what it looks like, like the synchronization pulses and the data. But all of these also has some other interesting things peculiar to this computer. We have the video ROM, but we also have another memory, which is the attribute video memory that produces some bits that are fed into these combinatorial networks that we had to understand what they did. And speaking of combinatorial networks, well, you know, you can describe them using a truth table, but you may wonder now why didn't I put here input and outputs, but I put here address and data, because you know the answer. Well, combinatorial networks are just read only memories, so it can be implemented by these mysterious ROM here. So that's what it actually does. So we could generate all these effects by simple networks like this one, like two exclusive or gates that are triggered and can produce an inverted signal like this, or some other effects can be generated by shifting gear, as I say, that just clock the output shift registers at half the pixel clock, and this generates wider pulses for the data so that you can produce wider characters. And then the other network that we have is the vertical stretch one that simply replaces all the accesses to the character generation in order to address the lines twice. So instead of zero, one, two, three, you have zero, zero, one, one, two, two, and so on, and you have characters that have double in height. Since they say that the Sanko is desktop computer in the actual meaning of the word that it takes a whole desk, in order to work with this, we built our own adapter for the video signals that is based on an RP2040, as you can find it here. And so in order to make us able to connect to a small VGA monitor so that we could use on a desktop in a more compact way. Okay, and you should know what is this, I think. We have no software for this computer, so this is not a Sanko floppy. So when a friend of us published on the internet the CPM operating system for Sanko computer, we had to create our CPM disk for the Sanko computer. So we started studying how to manage floppy disks with the Sanko. We studied the BIOS routines. We learned how to write format read floppies. And we studied the floppy image of the CPM. In this way we were able to write a custom Z80 code able to transfer information from Sarah Port through to the floppy disk. And together with this custom Z80 code, a Python script on the PC side able to transfer the whole disk image to the computer through serial. This process, this whole process took about 20 minutes. I did it at midnight, the day after I had to go to work. I left it to writing and the next day we had the CPM operating system booting on the computer. In a single shot we were very happy to be able to do this with a single try. But we still had a huge problem. We didn't have the keyboard. How could we use this computer? A friend of us had this one in a working condition. He provided us with this truck that was transmitted on the wire of the keyboard. It was simply a serial. We built an adapter so we could connect a common modern PS2 keyboard to this computer and use it as a real computer. We also thought that all this knowledge was not enough to put in some repositories with some documentation with images and schematics and so on. So we started writing our own emulator that at this moment has a working Z80. That's easy because we use a third party library of course. We have interrupts that work in mode 2. We have the correct emulation of the CRT controller with all the effects that I described before. We have some serial peripherals. Among them there is the keyboard. This emulator allows you to debug your programs for the computer because it has an integrated monitor. You can set breakpoints, inspect memory and so on. Now there is currently a working progress for emulating some peripheral I.O. with GPIO. The floppy disk which is in a half working state at the moment. Since the project is starting to grow and grow, we need to add some tests. But I need to show you the killer feature of this emulator which is this one here. Oh, it makes a beep. So we need help with documentation, software and people who want to help us discover more about this computer because it is quite mysterious. So if you want you can join us. You can find everything on GitHub and help us. And now there is some time for the questions I hope. Thank you. Maybe we can take one or two very quick questions. Hi guys. Very nice work. Thank you. I was wondering if you compare this to a Sinclair Spectrum architecture because it looks really, really similar except maybe for the character generation. All the addresses there. Have you done a side-by-side comparison of some kind of thought about emulating a Spectrum 48K? Don't. I downloaded a lot of similar computers. I don't remember at the moment. A lot of computers that use the Z80 as its core. The Osmo uses the Z80 as its core. So I tried to compare them but they are all similar but not the same. There is everything that is different. Every time something is different. The strange thing about the boot ROM is documented online on a website that talks about the dynamic RAM refresh. But I have never seen it in other computers. So if someone knows this strange mechanism, feel free to. The site was about the dynamic memory refresh and it talks about the strange ROM substitution too. So the mechanism that allows the computer to boot even if not in C000. Okay offline. Yeah, I'm afraid we have reached the time. Many, many thanks. You're welcome. Thank you. Thank you.
Controlling a 6 degree Robot Arm using a 48K ZX Spectrum
So, please take a seat and be quiet. We start with the, unfortunately, I already did the last talk for this deaf room today with Rui who's talking about controlling a robot using ZX Spectrum. Okay, thank you. So, that's my name, if you need to contact me on top. I'm a software developer by profession and I do weird stuff on my hobby time. So let's start with this. So, we will talk about all these subjects because this was a project that took about 11 months to do in part time. So, we have little time so let's go through it. So, how this started in the beginning, I'm part of the LogZX Spectrum Museum. I'm an active member and the idea was we had a stand in 2022 on Lisbon Games Week, which is a show that talks about games and stuff like that. And we had this stand and we wanted to actually have more people on the stand. And if you can see on the right side of that photo, there was a bunch of arcade machines next to it. And that attracts a lot of people. And there was also this kind of claw machine that allows you to pick up stuff. And since it was free stuff, people was packing in a line just to go to that machine. So, the idea is we want that, we want to have a lot of people to go to our stand and make it more successful. So, I somehow convinced the guys that we should do something about it. And so, we set these goals that we want to attract people to the stand. We want to actually make something like a claw or something similar, then end it up as a robot arm. And we would like to also create a game that integrates with the robot arm to be more interactive. So, I managed to convince them that this was possible. A tip for you guys is bring the right t-shirt and then be crazy enough and they will trust you. So, what was the plan is to use a 48K spectrum and a robot arm. And we need to find one that doesn't break the bank because they are expensive machinery. And we want finally to integrate it with basic so that we can program directly in basic without having to know the specifics of the robot in particular. So, our plan was really simple. So, we have a spectrum, we find a robot arm and then we send some data to it and we receive some data to it and it's easy, right? Not really because that's the big problem to solve. So, what I did was found a robot arm. So, we found this lab vault 50 to 50 that is discontinued already about 2002 or 2003. And I bought it online for a fair amount but it was missing some stuff. Basically, the emergency stop button was not available which is the problem. And that little thing that is very useful is called the pendant and it's not available, they don't sell it anymore and I can't find one if you have one contact me. Okay, so that simplifies the coding of the robot by the way. So, it has two serial interfaces, one binary one, one text based and the binary one, since there is no official documentation for it because it was a closed source system. There is a research paper that actually did the reverse engineering of it. You can find it here but it seemed more complex and harder to debug. So, I went for the text based one which fortunately has a help command that helps a little bit out to actually interface with it and it's better for actually debugging it with the commands and try to understand how it works. So, some reverse engineering was needed. So, the first thing that was needed is determined that the protocol is actually, the text protocol is actually bidirectional. So, we need to send commands and get actually some data back and finding exactly what is the setup for the serial communication between the robot, the controller of the robot arm. So, that means that we can update our plan. So, we have a spectrum anyway. Now, we have this specific robot which has a specific controller and this controller has a lot of interfaces to simulate cells like industrial cells and communication with other robots. But, we will use the simple part. So, basically what this does is send controls to the motors of the robot arm and get some feedback from it like encoders feedback position where each of the parts of the arms are. So, what we need to do is come up with something that actually communicates with the robot. So, this is our goal and we know that this has to work with serial interface. So, why do we need one? Because there was something called Sinclair interface one and I don't know if you guys know about it but it says something from Sinclair and it was BitBang. This means that the software is actually hitting the bits of the hardware just to find out the protocol, the serial protocol which is very expensive in runtime. And it doesn't have a standard pinout which is a brand mark of Sinclair incompatible with everything and then you know you have to sell your own stuff. It reminds me another company, Apple. So, there is another option which is a Spectrum 138K but again it's a BitBang interface and it has a very weird connector that is hard to find nowadays. And all of those that exist, pre-exist have the same problem. They are BitBang which means it will steal us a lot of CPU. And since the Spectrum is 3.5MHz, we need to help it and not make it difficult. And we wanted to actually use the iconic Spectrum 48K to just be sure that it works with this one. So, our updated plan is to actually build some hardware to actually do that and have a serial reward in hardware that will eventually also do flow control so that we don't lose with buffer overflows and stuff like that. So, that means that we have to implement that interface that goes into the socket of the Spectrum and then actually do the communication. Besides this, since nowadays we don't like to work with cassettes to put the software in, it's better to actually find another interface and DivMMC is one good one for that because it allows you to have SD card with your software. And more importantly, it allows you to actually have an extra port behind so that you can plug the device that you are developing to because there is no emulators here, the hardware is new. We have to actually try it on the hardware. Okay. So, how do we build that? So, we know that there is the Spectrum interface bus that has all the data lines and control lines that we need. We need to actually define an address decoder so that we can actually access, read and write all the registers from the expected UART. So, it needs to define, enable, read and write so that you can actually control that ship. That ship will get data and control and configuration from the 8-bit bus. It needs some kind of crystal so that it has a stable clock signal and in this case it was chosen to be 16 MHz. And then we need a line driver, something that will actually convert the 5 volts that we have on the board or connected to the Spectrum to the minus 12 and plus 12 that we actually need for the serial line. And then we will do communication and transmit and receive and we also want to do RTS and CTS and shake for flow control by hardware. Okay. So, like the other colleague said a few sessions ago, this was what was in the drawer. So, it's this old PAL encoder chip for decoding logic or making logic expressions. It allows for 9 input lines and 8 input or output but with some tweaks you'll see later why. So, then select the UART chip that has at least 2 UARTs and by some luck it has at least 64 bytes of buffers which will save us in the long run. You'll see why. And we managed to work it with 16 MHz and then we need a pump driver to actually give us the right amount of voltage on the output like I said. And the max the 238 just because we need 4 inputs and 4 output lines because we are only checking the lines that we actually need and not implementing the entire setup of RS-232 or serial protocol. But we can do it later. So, this is the actual schematic. It's not complex but I'm not going through it. You can later see and check the GitHub. Okay. And so, this is the... We need 8 address lines or sorry, 8 registers for controlling the work. It has a lot more registers but only needs 8 addresses. And there is a special concern here because the spectrum has the ULA which is what makes a spectrum a spectrum that maps the port 254 as its control port but it only actually uses the address A0 as being 0. So, every even address is reserved for the ULA. So, we need to use an address that is even. Sorry, odd. And so, we will hop every odd port between 0 and 15 for the first comm and between 16 and 31 for the second comm. We have to be mindful that port 31 is being used by Kempion and Logistic. So, if you want this to be compatible, we still have to change this. My plan is to use something like an index register like we used to have in VGA cards and stuff like that. So, we'll use only 2 registers. So, this is the prototype board. I have it here if you want to take a look after this. And so, mapping what we said before, there is the interface bus, there is the address decoding logic with the PAL and there is the UART with 2 serial UARTs in there, the crystal, and we have the line driver. And here you will see the hack that we had to do because PAL didn't exactly allow us to program every combination. So, we had to do a workaround and there is a small cable in the back hiding the hack. But we can fix that later. So, how do we interface this with basic? For the time, Spectrum was at pretty advanced concepts. Basically, channels is a new concept. Back then, that is basically the equivalent of a device driver. So, you can plug in new drivers if you do the right thing into the ROM and into the system variables. The advantage of having this is that we have an abstraction layer that then allows us to use the devices and allows basic to actually control them as expected. So, this just keeps this kind of information, the input output routines and the special ID so that you can ask for it specifically. Besides channels, we have some other advanced concepts, concept that is streams. And streams is just a way to actually do the same we do today with streams. It's just a byte stream that can go as input or output and you just address it by ID and you actually create an instance of the channel or an instance of using that specific driver for that hardware. So, again, this allows integration with basic so we can use those commands and directly without actually having to know how it actually works. To know this, we have to actually understand how this abstraction layer is implemented and there is a very good book which is the Spectrum Disassembly, the ULI, and explains in detail, but you have to go really deep down to understand how it works, but that's the solution to actually make this work. So, by default, the ROM basic creates four channels, the keyboard, the screen, workspace or scratch, internal buffer and four printer. And we can add more channels to this, although the original ROM had the bug and didn't work except for these four, so it needs to be extended. But if you reserve the memory in the right places in the system calls, it works as expected if you replace the open and close routines. It has also defined a lot of streams by default and we can have up to 19 streams, but we should only mess with the last 15 because the others are internal and I wouldn't advise to actually change one and two because it can also mess the basic. So, after a lot of headbanging and problems, including during the event, it was able to actually have a working version with transmit and receive integrated with the ROM. So, we have two new channels named one and two for COM one and COM two. And like I said, we need to have an open close, a new stream to, sorry, routine for that stream. Currently, the coding GitHub is hard coded, so we can actually change the parameters for the ward because it was quick and dirty and we had to actually make it work for the event. And you'll see why in a bit. And we also have our code of the parameters to actually set up the serial line. So, what we did in the driver is to actually define, like I said, two new channels with ID one and two that correspond to COM one and COM two. And we need to properly reserve memory for that. So, regarding the robot controller, that big box that comes with the robot, it actually controls the motors, the feedback from the encoders and the limit switches so that the robot arm doesn't just blow up to the side. And it does the movement integration when you go from position A to position B as it needs to integrate all the several axes so that it ends in the right position. And it has a bunch of protocol commands that we need to actually interact with. So, these are some of them. So, the program Rdome reset, it needed to actually position the robot from the beginning. The torque and tree is just to release the motor so that you can position the robot by hand, which is very useful. Then run is to actually move at a certain speed to a specific position in all axes. And get port and set port is just basically to be able to interact with other systems. In particular, the programmatic system that was implemented. So, what did we need to do? The first application was to actually control the robot. So, we made a small application in basic that did some commands and then it did some move commands with the run command in a text. And it had a special menu that allows us to actually define and list a bunch of points so that we can then work the robot like a puppet and position the several positions that we want. And we just have a list and we go through that list according to some specific order. So, we can list and you can save these in particular and we can also define each of the axes and then position the robot in real time. Okay, next we need to implement the pneumatic reaper to make it more fast to be used in the event. So, it consists of a suction cup, a motor that creates the vacuum and a valve that actually releases the vacuum so that you can release the parts. And then this was the first version working in the test bench. But as usual, nothing works correctly when we plug all in, right? So, the robot didn't have enough power. So, as soon as we tried to use it, it just didn't work. I had to add a special extra relay and put some extra power from the outside so that we can actually integrate this new model. That's the cable you see there, like here. It's having to plug in more power so that I could have the pneumatic thing working. And next. So, besides that, this was a live event. So, there were kids around that were people around and they could put their hands near the robot and that will mean that someone could get hurt. So, we had to define a display stand. So, I designed this also to support the extra weight of the robot arm. Here you can see the first iteration of it. And you can see that's a kitchen cabinet that was later transformed to look nicer. On this one, you can see actually setting up the stand in the event. You can probably recognize that t-shirt from previous years on Fosdom. And as you can see on the right, sorry, on the center now, I was still hacking during the event to make it work because transmit was actually working fine. But receive was sometimes not triggering the right sequence because there was an issue with the ROM. It didn't like carriage returns for some reason to hacky around it. So, and then this was when it was already working and doing some interesting stuff just to try it out. So, then I developed the tic-tac-toe program that was inspired around Wargames movie. And tried to make it as educational as possible. So, the top is the tic-tac-toe is drawn using lines and commands from basic and then using UDGs or user developer graphics from basic. And then you can actually play the game. But it's not yet integrated with the robot movements because it was not possible in the time frame. So, it should be another two days work or something once I have the time to go for it. So, I would like to show you a simple video. Quick one because I don't have much time. So, the robot is moving. I'm with my hands on the spectrum controlling it and giving some inputs. And then trying to move and actually confirm that the suction is actually working. So, you'll see that the suction will actually grab the part. And I will be a lot happier when this started working. Okay. So, something like this is something that you don't do alone or is in isolation. So, I had a bunch of friends and people from the museum that somehow helped. And a special guest that came up to the fair, which also helped. It was really fun. So, we do this kind of stuff to enjoy ourselves and we cherish these moments and the legacy that is behind it. And you should do this to enjoy yourself. So, this is the end. If you have any questions, you can see the GitHub repository and fire away any questions you have. I'll give you a few minutes for questions. Yes. Anybody? At least I explained everything. Yeah. Everything was so clear. I can tell you some other quick, interesting things like the communication with the robot. What happened with the ROM is that when you send a carriage return to the ROM, it thinks it's an, sorry, not a carriage return, a line feed. It thinks it's to abort the input in basic. Somehow, they coded it like that. So, once I found that out, I had to actually see what the robot was sending. And it was not sending line feed and carriage return, it was just sending line feed. So, every input I sent, I would get a line feed and then it would stop my input. And then I'd run it again and then it didn't work. And then I ran it again, it worked again. So, it was intermittent every time. Until I found it was because the carriage return line feed, that was not matching. So, the driver had to actually swallow the line feed. And then it didn't work because that didn't have a carriage return either. So, I had to actually swallow the line feed and inject the carriage return. And then stuff started working. And that's what I did during the show, to actually make it work. Hope you liked it. Oh, by the way, we didn't finish the game, but it will be done eventually. Last chance for question. If not, many, many thanks, Louis. You're welcome. And thanks, Sebastian, for taking over this year, because I didn't have the time to handle it. So, and if you want to help, it's quite a lot of work to find speakers, find speakers again because you don't have enough. And then select talks because you have too many and stuff like that. So, if you want to help, then contact us on the mailing list, maybe. And thanks. See you next year, maybe. Thank you. Thank you. Yeah, you make us cry a little bit. Why? We burned, we spent the years. I was building and selling the joystick cards. Oh, okay, okay. The 8084. In the school, they allow us to print the boards. Yeah. So, I start to print and soldering and selling the joystick cards. I hope to eventually sell this, but since it's open source, anyone can actually set it up. You can see the one behind it. There is a hack with a wire. Yeah. Because the pile doesn't allow you to configure everything in the pins you want. Although I was following the input output, it was fine, but the combination that I needed didn't work. So, I had to do a hack and do it elsewhere. So, the connector, you got that from a mouser or something and chopped the ends. Yeah, yeah. You need to chop the ends. And you need to move the key. Yeah, because then it... And then, it's the thickness of the PCB, so you can just stick the PCB... Yeah. But there is a better way. You can actually put a metal part that has pins, and then you solder on this side, it will never leave. That's what I'm going to do next. Nice. I want to ask you, what did you use to program the dial chips, because I can't find a decent program. I have it there, but you can see it on the repository, on the GitHub. On the product information. Yeah. It's there. I'm using something... It was from... Not Palazzo, but it's something similar. Okay. And the software is still available. You can download it. Okay. You can use Windows 95 or something, but it runs on the latest ones. Okay. And it allows you to actually program that fine. Okay, thank you. We'll look at the other one. Yeah. Nice work. If you look there, I think it's even... This kitable is there or something. I'm not sure. Okay. There is an example with the hardware to program it. Thank you. And did it work? The other... What? I mean, did it attract more people? It did. It did. I don't know if there is another video. Yeah. In the... Let me try. If you go to FOSDOM and you open this one. Yeah. Is the screen still on or they... Shop it off. Sorry. So here you can see a video that was done by the museum. Right. And the idea was to... This is presentation. Anyway. And the idea was to... It's in Portuguese, unfortunately, because it was for that audience. But it's showing up. Thank you. Showing up how it works. So as you see, it's a huge beast. Yeah. Yeah. So that's why I couldn't bring it. This weights like 12 kilograms or something like that. Plus there is the box is hidden inside. I also had a VT420 terminal inside because it was helpful for debug. You can put it in debug mode and you can see the actual characters in the serial. That's how I found the line feed was missing the carriage return. And I used that to actually decode the rest. The last two nights I didn't sleep. It was like go, go, go until it worked. And then on the second day from a five day event, it started working and I left it alone for a while. What? Reverse demo effect. Yes, maybe. But I couldn't make the game actually work at the time because you need feedback so that when you say put your... the arm here, right? It goes to the arm... How much time does it take? You don't know. You need to wait for the feedback. And the feedback was not working because of the line feed. So my first experiment was like I send the command and then wait, then enter. For the next command I just wait for the robot to do the movement and then enter. It was like interactive. But then when it started working, everything started working. But I didn't have time to... and I needed to sleep. After two days without sleep. First two minutes. What's the reason you put it with the ZX Spectrum? Did you stand as a child? I did, I did. Because I'm part of the museum. I participate with demos and doing stuff. One of your specific... Yeah. To start picking up the... But I also do stuff. You find so many very interesting things. Yeah. Why is that? I've been growing up with Commodore 65 or 20. Yeah, I also... I would do that with Commodore. Yeah.
MAMBO - Dynamic Binary Modification Tool for RISC-V
Okay, hello everyone. We are here to present Mambo, a dynamic binary modification tool, and what's the better way to start the presentation than with a demo. So we are going to see a fairly complex application running on risk five within our system. So let's see it. So we are going to use it to learn something about the running binary. So here it is. Okay, so this is not our tool. So this is just an image viewer of Linux, and we generated this picture with one of these fancy AI tools so we can kind of promote our talk on LinkedIn. But what's really happening is that this image viewer is running under our tool that runs on risk five, and then we use it to find some information about the binary. So here we have a very simple tool that counts the number of threads that the application was used so we can see we have eight threads. So the application run under our tool on risk five, and then we can see that we have eight threads. Okay, but thanks for your first. I'm Igor. This is Alistair, and we are here from the University of Manchester. And as I said, we are going to talk about Mambo, which is a binary modification tool for risk architectures. Okay, but thanks. But okay, has anyone knows here what the dynamic binary modification is, or heard the term in the first place? Raise your hands if you did. Okay, wow. Okay, a few people. That's good. But you may haven't heard the term, but I'm pretty sure if you did any development, you used those frameworks. So the examples of the very known open source tools that do dynamic binary modification are Valgrind and KMU. So I'm pretty sure you use Valgrind and one of these tools, which is called Memcheck. And most of you probably are in the risk five room use KMU. So both Valgrind and KMU are dynamic binary modification framework, and they have a various tool built on top of that. So this is what Mambo is. Okay, but let's break down this term a bit. So what do I mean by dynamic binary modification? So dynamic is working at the runtime. So while the binary is running, the tool is working. Binary, we are working on the natively compiled code. So we don't need a source code, we just take a binary that was already compiled, and we can analyze it. And modification means that we can alter the application in a specific way. So we can add extra functionality, we can remove functionality, we can swap functionality. So there are two terms that are related to that. There is also dynamic binary instrumentation and translation. So instrumentation is basically a subset of modification. We just insert new functionality into the binary. So for example, if I want to do some sort of profiling, I can input some sort of counters into the running binary. And then translation is kind of an overlapping term. I can swap one, I say to another, so we could do it by modifying the binary, or there are more specialized tools that do the translation. So you are probably familiar with the Apple Rosetta, which translates now Intel to ARM when you got your new MacBook, but there is also the KMU also can act as a translator and usually use like that because they can translate one architecture to another. But now, so very few uses of the tools. So you can do program analysis, you can do error detection. So I'm pretty sure most of you are familiar with that use case, and there is a dynamic translation. OK, but now the question is why would you like to use Mambo if there are other tools? So the Mambo has been specifically optimized for risk 5, ARM, risk 5 64, ARM 32 and ARM 64. So in the stock, we are focusing on the ARM, on the risk 5, but we also have the version of the tool that can run on ARM. And the tool features low overhead, and to our knowledge, this is the only at the moment available DBM tools that has been optimized for risk 5. And the tool itself is fairly low in complexity, so if you would like to dive into the database, is around 20,000 line of code. So if you want to learn how it works, or if you want to modify the internals, the entry bar is not that high. And then it has a simple plugin API that allows you to write the architecture agnostic plugins. So you can write the plugin for risk 5 and later on you can deploy it on ARM if you would like. But it's worth to say it is not a toy. So we showed it before in the video that we can run fairly complex applications, so it's a full GUI tool from the ship with Inux. It could run stuff like GIMP or library office as well. So the tool itself is not a toy. OK, and if you are interested what the numbers would be roughly, so we evaluated it on the spec benchmark, so don't worry about too much about numbers. If you want, we can point you to the paper or we can talk about it later. But the idea is for like, FP benchmark, which is more like data processing. We get around 6% overhead if we just run the tool. We just run a framework without an extra tool built on top of it. And it's around 30% when we do more general purpose computing. So the baseline then, if you have no plugins enabled when you just run the tool under, if you run the binary on the RL tool, you get around 30% overhead. OK, so that was the brief introduction of what the dynamic binary modification is. And I'm going to briefly talk about how Mambo works internally. So I'm going to mention a few details, so it's useful if you would like to, I don't know, contribute to the internal of the tools that may help you. But the focus of the talk will be more the developer side, so I'm just going to talk about it as well. But I would like to just highlight a few bits and pieces so you will understand how Mambo works. OK, so this is the simplified diagram, and I'm going to talk you through the more important bits of that. So the instrumentation plugin API, so this is the part that Alistair is going to talk in much detail about, and I'm going to cover everything else. OK, so first of all, the first component is the elf loader. So if you run any binary on Linux, it has to be first loaded into memory, and then we can run it. So in case if we use our framework, so if we use Mambo, then the Mambo itself is loaded by Linux using its default loader, and then Mambo itself has to load the application, which we call a hosted application. So the Mambo has a custom build loader inside of it, which takes the application and loads it alongside the Mambo, so it can interact with it, it can modify it, and it can run it. So that's the first element. The second element is instruction and decoder. So while we execute the application, we have to modify some of the instruction. We have to know what instruction we are copying and scanning and modifying, so this is what the instruction and decoder and decoder does. So you may be familiar with the custom project, which is like a fully fledged assembler. This is a very simple module that basically takes a text specification of the instruction and what the fields it has and uses some rubbish scripts to generate the C functions to encode any code fields, and this is what Mambo uses because it's fairly simple and low overhead, and that's something that we want inside the tool that runs dynamically. Okay, and now the two most important parts of Mambo, it will be a code scanner and the dispatcher and the code cache. So let me maybe first talk about what the code cache is. So we have our Mambo, and the Mambo uses the loader to load the binary into a memory. And now we want to run this binary, but we also want to modify it. So if we just load the binary and run it, then it will run as it would be before. So that's why where the code cache comes in. So this is not the instruction cache that we have hardware. This is just allocated space in memory that we call the code cache. And now the Mambo scanner will copy the instruction from the binary that we loaded into memory into the code cache. And in the process of copying those instructions, we can introduce any functionality, we can remove some instructions, we can replace some instructions. So the scanner is responsible for copying instructions from the binary that we loaded into the code cache. And then the code cache is what will actually execute on the processor. And then we have a dispatcher, which is responsible for actually running the code. So the scanner will copy a basic block, and then it will say, I finished copying a basic block. Now I go to the dispatcher and dispatcher will start the basic block, and it will actually natively execute it on a RISC-5 processor. And then when we finish the basic block, the control will return back to Mambo to scan the other basic block, and then again we'll go back to the dispatcher and dispatcher will execute the next basic block, and it will have this back and forth. And if the code is ready to the code cache, we don't have to scan it so we can directly execute another basic block without scanning. So this is very simplified. If we did it that way, it would be very, very slow. So there is a number of optimizations there. So for Mambo to stay in the code cache as long as possible. So it does scan things ahead of time and tries to guess what would be the next thing it jumps to and then if it can do it, then it can stay within the code cache. Otherwise it has to go back to the scanner and back to the dispatcher if it doesn't know what the next basic block is. Okay, and this is what I was talking about. So when we execute the application, we have a single process with two binaries in it and two contexts. So there is a Mambo context that scans instruction, and then the dispatcher changes from the Mambo context into the application context. So it will save the state of the Mambo, jump to the code cache, execute the code cache as long as it can, and then if it cannot find the next target in the code cache, it will go back to Mambo. So it will save the application state, restore the state of Mambo, and then the scanner will kick in and then it will go back and forth. So this is like a principle of it, of how it works. Okay, so the dispatcher and the scanner are like the two main elements in Mambo that allow us to do the modification and execute the code. And the last thing is the kernel interaction. So on top of just executing the application, the framework itself has to interact with the Linux kernel, so we have to handle and pass signals and handle and pass system calls. So this is important because for signals, if there is a signal coming from the operating system, it will first hit our framework, so it will first hit Mambo. But if you don't want Mambo to handle the signals, in many cases you want to pass it to the application because the application may have a handler installed to handle this signal. And in the same way, if there is a system call, so if the hosted binary is doing a system call, for example, let's say a thread creation, Mambo needs to know that it created a thread because it has to track every thread that gets created. So the Mambo has to learn first what was the system call and only then it can pass it to the Linux kernel. So that's also, I talked briefly about the architecture of Mambo, so we had the L flow there, we had the instruction encoder and encoder, two main elements, one free management scanner, dispatcher and the code cache, and then we had a bit about the handling signals and system calls. So that's, if you are going to just use Mambo to write your plugins and the tools probably you don't have to know all of that, it may help to know how Mambo works. And if you want to contribute to the internals of it, that hopefully will give you some rough idea how the system works. But now the bit probably people are more interested in is how we can write our own plugins, our own tools within our framework. And for that I will pass the microphone to Alistair. Hi, so yes, I will talk to you about the API, this is how you take Mambo and you build your own tool on top of it. So this is where it actually gets really useful. So we've mentioned use cases but it's worth repeating. We're talking about things like code analysis so you can build a control flow graph, you can generate new functionality, you can instrument code, you can analyze it, you can re-implement library functions, you can patch library functions, you can do all sorts because you can modify this running binary. So Mambo's API exposes events, so it's event driven. So you as the user of this API define functions which you register as callbacks on these events. And when one of these events is encountered Mambo will trigger the callback and execute the function that you registered to it. So there are two categories of events, there's hosted application, runtime events. So these are events that happen to the hosted application as it's being executed in the code cache. So here we're talking things like system calls, thread creation and we have Mambo scan time events so these happen as Mambo is scanning instructions from the loaded elf into the code cache. So this is something like pre-instruction, post-instruction, you can do stuff with these callbacks. So as I was mentioning pre-instruction, post-instruction, this kind of gives you an idea, you can insert something before and after an instruction, before and after a basic block, before and after a thread. So you can see it can be very, very fine grained or it can be at a high level of abstraction and of course before and after an application runs. So taking all of this, you see a slightly chopped off diagram there but it kind of gives you an idea of the order in which these callbacks will be executed. So at the very highest level, at the very start you have the initialization function which is where you set up a plugin and then you'll have pre-thread so that's quite high level, pre-basic block, you also have pre-function and so it kind of gets narrower and narrower and then it kind of expands out after these things have executed. So this is something that's important to bear in mind. So how do you actually use Mambo's API? I'm going to talk to you about the following things. So the functions that you'll need to register your callbacks, the functions that perform code analysis, the functions that perform instrumentation, so how you actually emit code into the code cache and then there are various helper functions which you can use. So the first thing you need to do is initialize your plugin and this is done in the plugin constructor function and there are two main things that you do here. You create a Mambo context which is a global data structure which holds the current state of Mambo and also the application that's being executed by Mambo and pretty much all of Mambo's helper functions will use this context to get for instance the current instruction that you're looking at. And this is also where you'll register callbacks. So for instance here we have Mambo register pre-instruction callback. So before an instruction is actually scanned into the code cache something that you register here will execute. And to register callbacks it follows this signature so you have Mambo register then you have an event time so that's pre or post something happening then you have the event so this can be Mambo pre-instruction callback. So it's quite easy to remember that way. So you've registered your callback so let's say we're building a plugin that counts the number of branches that are executed. So you've registered a pre-instruction callback. So now Mambo's scanning things and your pre-instruction callback has executed. So one of the first things you're going to want to do is use a code analysis function. You're going to want to know which instruction am I looking at. So you have things like Mambo get branch type, Mambo get condition which would for instance give you the condition of the branch that you're looking at if it's a conditional branch. So these give you information that you can use and choose to act on. So the function signature of these analysis functions follows Mambo action so that would be get set is and then the information. So Mambo get function type, Mambo get branch type even relating back to our example would get you the type of the branch that you're looking at. So bringing all of this together into a simplified plugin we have the constructor where we initialize context and we register a pre-instruction callback and when that's executed we get the branch type and then based on what type of branch it is we do something. It's also worth pointing out that the branch types that we're looking at here are generic so that's how it is portable between architectures. So you've found out you're looking at a branch. Now you're going to actually want to emit instrumentation. So this is instructions that you can put into the code cache to do something. So for instance we have emit64 counter increments so this is how you can tell Mambo to emit the instructions that you need to increment a counter. You can emit pushes, you can emit pops, you can set registers so you can do all sorts of things and there are two main types. You have emit instructions so that would be for example emit increment so that's more portable because we implement the backend tell Mambo which instructions to emit into the code cache for that. And then we have the more architecture dependent ones which are emit risk five instructions so this is when you know exactly what you are trying to achieve with the plugin. Let's say you need to emit an arithmetic instruction. You can do that until Mambo emit this arithmetic instruction. The only drawback to this is that it's riskier doing that. You have to make sure that you save and restore registers and that kind of thing which we do for you in the safer generic ones. And then finally you have additional helper functions so for instance Mambo will expose a hash table which is really useful for when you're instrumenting code and you have lots of data to associate with different addresses. So we have hash tables, we have Mambo allocator so these will help you to write your plugin. And then finally it can be very difficult to get your head around this. It took me a while to fully understand it and that is the difference between scan time and run time. So when we talk about scan time we talk about something that happens once when Mambo is scanning something and run time is when that scanned code is executing in the code cache and the reason this difference matters is if you are for instance counting the number of branches that are executed at scan time you need to emit instructions into the code cache to increment a counter so that when that code is executing you get the actual number of instructions, times that instruction is executed. Okay so it's time for an example. The code I'm about to show you can find on the Mambo repository in the plugins directory and it's time for a live demo. So I will be running Vim under Mambo on risk 5 to show you the source code of the branch counter plugin which is something that you can run and is in the Mambo repository and whilst running Vim I will also have enabled the branch counter plugin so you can see it in action. Sounds very convoluted I know. Okay so here we run Mambo and I don't know how well you can actually see that but... Command shift plus. Oh command shift. Hooray. Do we need more or? Bigger. Oh bigger. Even bigger. I'm trying to call it that wrong. Okay yeah. Okay so we start with the constructor function which is where we set up Mambo's context and we're registering four callbacks so we have a pre-instruction callback, we have a pre-thread callback, a post-thread callback and an exit callback and the order that these will actually be executed in will go pre-thread, pre-instruction, post-thread and then exit. So I'll start with the pre-thread. So in the... Let's hear some more. Oh yeah yeah yeah. In the pre-thread handler we're initializing the counters for that thread so we have a direct branch counter, indirect branch counter and return branch counter. The reason why we have this per thread is because each thread has its own code cache and therefore its own numbers of branches that we'll be executing which is why for each thread that we create we initialize its own set of counters. And then we have a pre-instruction callback. So for each instruction that's executed we're checking if this is a branch, we're getting the branch type and then for each of the types of branches, the return branch, the direct branch and the indirect branch we select the correct counter for that thread and we then emit a counter increment into the code cache so that the correct counter will be incremented. Okay so at this point Vim is running away, running away and when we close it the post-thread handler will first be executed and this will say okay so this thread is terminating let's take this thread count for each type of branch and add it to the global total and it does that atomically and then finally we have, oh yeah the exit handler which just says okay this application has now terminated let's print out the global totals which are composed of the individual threads. Since Vim is a single threaded application we'll get one thread and one total which you can see there. Okay and now I'll quickly talk to you about some lessons that we learned from porting Mantlot to risk 5 because it was originally written for ARM so there are differences that we had to take into consideration. So the first thing was the range of branches. So for conditional branches and direct jumps they have a range of branches and they have a range of branches. So for conditional branches and direct jumps they have quite a limited range which is less of an issue on ARM because they have a much longer range. Why this matters is because in a compiled binary obviously the offsets will be fine because that's how it was compiled. When you take that code and you put it into a code cache it's done as it's needed and so the ordering of that code may be different and therefore the offsets may be different and exceed the offsets of the original binary. And so we may have to replace these instructions with instructions that have a longer range. So with a conditional branch we may have to insert an additional jump instruction that is triggered when the branch condition is true to extend the range of that branch. And same for a direct jump it may need to be replaced with instructions that first load the address into a register and then take a register jump. We also have load reserve and store conditional. You can only have a limited number of instructions between these two instructions and you can't also have a limited number of instructions between loads and stores in between otherwise the lock will fail. This matters in dynamic binary modification because we can insert additional instructions so we have to place limits on what you can do with atomic instructions in plugins and with other optimizations implemented we have to be mindful of this limitation. And finally we have the thread pointer register X4. There isn't a dedicated register in the general register file on ARM that does this. And so when we create a new thread Mambo will save and restore the context by saving and restoring all registers. We need to make sure that the thread pointer actually points to the newly allocated thread local storage otherwise there will be a world of pain which we found out. Okay so in terms of road map where we take it from here we of course want to foster our open source community. We really welcome collaborations and contributions not only plugins but also any contributions to the main internals of Mambo. As part of this we are currently in the process of improving documentation and also developing more tools to kind of give people a flavor of what's possible. So for instance we're currently porting Mambo's Memchecker from ARM to RISC 5. We also are trying our very best to keep up with all of the new RISC 5 and also ARM extensions that keep appearing. We also have various research projects ongoing that make use of Mambo. And probably goes without saying since this is a talk at FOSTEM but Mambo is open source on GitHub with an Apache 2.0 license so definitely check it out. And we'd like to thank our sponsors. So yeah any questions? Yeah. Oh yeah yeah. So you're asking how do we handle pointers when we scan code from the binary into the code cache. Those pointers are still pointing into the binary. So we actually in the scanner we have instructions like that specifically. So for instance if we take a branch instruction the first time that branch instruction is executed it will point to Mambo's dispatcher which will perform a lookup. We then have optimizations which will replace that branch instruction with a direct branch to the next basic block. And the same for loads and stores. We update these to point to the new location. So basic block is a single pointer. Oh sorry. Yeah I'll repeat the question. So what is a basic block? A basic block is a single entry single exit point. So you essentially ends when there's a branch to somewhere else. At the back. Yeah so in a general case. Oh I keep doing this. So how often is the load reserve store conditional an issue. We find it's not that much of an issue. Most applications won't have an issue with it. It becomes more of an issue when you have plugins that do something in between. So for instance if you're counting a specific type of instruction that may occur between these two instructions and you emit stuff into the code cache you may end up exceeding this 16 instruction limit. You mentioned translation early in your presentation. Does Mambo support running ARM on the RISC-5 machine and vice versa? So does Mambo support translation? Not currently. You need to be on that architecture. What happens if I try to run a just-in-time compiler under a Mambo? What happens with a just-in-time compiler? I'm not sure. So the Mambo is designed to support self-modifying code. So basically what it does, you have some code in the code cache and just in time compiler recompile it so basically the cache will be flushed and then it will re-scan it again. So it carries some performance penalty but it will react to the things like that and it will re-scan the code and put the new version into the code cache. So it does support self-modifying code. It should be. Hopefully. This isn't tested on RISC-5 because most browsers don't seem to be ported. Any other questions? So what do we interested in about RISC-5 applications from plugins? We're interested in building tools that kind of perform things like memory checking, data race detectors, that kind of thing. So tools that are very useful to people developing software on RISC-5 to kind of help them do that. So just out of it, so we haven't mentioned it on the slides but we also have some research. That was for R but done on the architectural simulation, so kind of code design of accelerators and CPUs on the SOC system. So there's some stuff going on but yeah. So at the moment I think for RISC-5 the biggest push was to get the base system to work and now we are exploring on RISC-5 what we can actually do with the system. Any other questions? Does it update sections that refer to pieces of code like jump tables, different things between basic blocks? So the question is about does MAMBO support the jump tables? How does it do? So we do not rewrite any of the sections of the original binary so basically MAMBO works in a way on demand. So we have a jump that uses a jump table. MAMBO will try to remember the most recent jumps but then if you miss it you have to go back to the scanner, scan the code again and then go to the dispatcher. So we are going to use the addresses that are already there and then we are going to keep the translation of some addresses in the code cache but none of them. But we are not going to rewrite the actual jump tables in the data section of the binary. Any more questions? Okay so the question is about the data-raised detector and whether we could implement some sort of stepping back within MAMBO. So the data detection is in the early stages but you will not have such a verbose functionality as RR or GDB replay or whatever but what you can do in the very easy way when you scan the basic blocks. So you would have to probably have some sort of we don't have functionality to detect the data-raised. But let's say in the general case if you want to inspect what's happening you can introduce a trap instruction into the code cache and then you can run under GDB and then you will trap the instruction and you can inspect what's in the basic block after the translation and you could try to look what was in there before the translation. So you can do some sort of things in the manual but there is no automated way to replay and go back in time. Thank you.
Lessons from porting software to RISC-V @ RISE
Thank you everyone for coming. So I'm going to do a quick talk on Porty Software at RISV. So thank you very much for attending. So just to quickly introduce myself, I'm a software engineer and team leader at RIVOS. We are a hardware company doing RISV CPUs. I work on the management time team, where on the management time system arrays and profiling team. So our scope is we work a lot on OpenJDK, Python, Go, system libraries like OpenBLAS, Sleaf, like Math, stuff and everything profiling. I also have a hat at the language and time working group at RISE. So what is RISE? It's a collaborative effort to accelerate the development of open-source software for the RISV architecture. So we are basically a consortium of companies who are interested in porting software to RISV. We're investing a lot in that. I'm going to get back to it, but we're doing a lot and we would love to have you involved as well. The focus of this working group is on OpenJDK, Go, Python, .NET, Android, Runtime and V8. Most of them already support RISV to varying degree like we're also going to see after. The focus is really on the compilers of these different runtimes, on the runtime themselves, like the libraries, the base class libraries and everything, but also on the ecosystem. Make sure that the most used Java libraries, most used .NET libraries, most used Python libraries and everything are well supported on RISV. Also, my last hat is also part of Adapter Working Group, where we are distributing Java. We're making sure that there is a Java distribution available for 11, 17, 21. It's in progress. We're getting close, but you should soon have a distribution, like a rendered distribution of Java on RISV. Let me just increase the size here. Who is the intending audience of this talk? The talk is really for people who have some experience in RISV and who want to get more involved, but also for people who have very little experience with RISV or no experience with RISV. But it sounds exciting, right? And it is really exciting. There's a lot of work to do and it's a lot of fun. So I will not talk assembly. Don't get scared. And if you don't know a concept or words, please ask. I would love to have a bit of interaction. Also, the target is application system. So anything like smart phone, laptop, desktop, servers, and HPC. We're not going to talk about embedded. We're not going to talk about microcontrollers. That's not the topic of this talk. So first of all, I want to give a huge shout out to Unsexyful. They made a lot of things a lot easier to port to RISV. So there have been years of investment in porting a lot of software to RISV. And that just makes a path easier for RISV. There's a lot of libraries out there, for example, which support X86, RISV, PPC, and F390X. So adding support to RISV to that is very straightforward. Because well, adding RISV is just like one more flag, one more configuration somewhere, and it's pretty easy. There's already all the if-desfs. There's already the compiler support. There's also already the cross-compilation support, CI setup, and everything. So RISV is just like one more thing. For libraries and projects which only support X86, though, the work is a bit more involved. Obviously, we have to support, we have to add support to the build systems to support, for example, cross-compilation. For the sources, we have to teach it that well, not everything is X86. There is some assumptions about X86. And so you want to make sure that you root out all of these issues. In terms of resources, in the RISV ecosystem, the kind of the reference is the RISV GitHub organization. It's spread out, but it's really the most complete. For anything related to Scala instructions, you have like the RISV ISA manual. For the vector instructions, you have the vector spec. For the vector crypto instructions, you have the vector crypto. For the vector spec, and for the vector intrinsics, you have the vector intrinsics spec. So there is a bunch of documents. They are a bit spread out, but they really are the reference. The GitHub organization is really where the work happens. There is work happening on mailing lists. There is work happening in meetings. But in the end, all the spec, all the documents, all the result documents of all the discussions is on GitHub. So GitHub really is the reference. It's quite unique compared to, for example, X86 or ARM or other architectures. It's just very open source. It's very easy to access. So it's easy to look for. It's not easy to find, but it's easy to look for things. Also, watch out for peer releases. Things will change, may change in between the peer release and the release. It will just be weird because, for example, for the vector spec, there was the 0.7 release, which some hardware vendor have adopted for some of the boards. But the encoding of some instructions or even instructions have completely changed, meaning that if you are coding against vector 1.0, it's just not going to work on any board which implements vector 0.7. And it's just not going to work. It's not, sometimes it's going to fail. No, not going to work. There is also this very nice resource I found recently, the RISV Intrinsic Viewer. So it just allows you to look through the vector Intrinsics implemented in GCC and Clang. There are 14,000 Intrinsics. So it's very nice to have a nice thing to look through. A lot of them are very complicated. Like for example, there is like 5,000 about vector load and stones. It's just because of the combinatorial explosion of what kind of load you want, what are the sizes you want to load, what are the element sizes you want to load. There's a lot of repetition. There's just a lot of them, so it's easy to look through them. So the first question you should ask yourself when you are wanting to receive is, what am I targeting? As you may be familiar, RISV has a concept of extensions. And so you kind of want to understand, OK, I'm writing software, but who is going to be able to run it? Which boards are going to be able to run it? The most basic one is Rv6 for GCC, for application servers or application processors. That's really the most basic ones. Even the first board, like high-five-enished support set. Can of the second step is bit-manip. So anything like ZBA, ZBB, ZBS, which allows you to do some bit manipulation, like rotate. I don't have all of them in mind. But like bit manipulation stuff, like where you want to look at one bit of a word or thing like this, this is going to be in bit-manip. A lot of boards implement that. I think from the high-five-enmatch, which was like the second board released in the market, it supports it. So it's very, very, very common. Then you have Vector. Again, please use Vector 1.0. Please don't target 0.7. Today there is one board that supports it. It has one core. It's not very fast, but at least it supports it. But nearly nothing supports it. And Vector Crypto, don't want to support it yet. It's probably going to come like one, two years down the line. So that's really, we expect that to be the future. But obviously it's been ratified two months ago or something. So nothing implemented yet. To kind of simplify things, there is the concept of profiles, which is a RISV international concept with RVA 20, RVA 22, RVA 23. There is RVA 24, there is discussion. There's going to be RVA 25 and et cetera. You can find the spec at this link as well. But the idea is to kind of define a set of extensions that allow software to target a specific profile and how we're to say we are RVA 22 compatible, we are RVA 23 compatible, we are RVA 24 compatible. And just like makes it easier than having to support a hundred different extensions and you don't really know what to target. So what's certain, what you can target today with very good certainty is RV64GC plus Bitman Ip and that hardware probe for vector and vector crypto. I'm going to talk about how we'll provide after. What's putting my crystal ball, like using my crystal ball, I can say that the expectations for the future is RV23 plus vector crypto. Please don't quote me on that. It's just a crystal ball. That's the expectations from looking at. But let's see how the future evolves, but that's the expectation. Knowing that in RV23, vector would be mandatory. So that would basically be RV64GC, Bitman, vector, vector crypto would be the future. I don't know when it's going to happen either, but that's the expectations. So in all of that, hardware probe is your friend. So what is our probe? It's a Linux kernel syscall that allows you to check for hardware capabilities. For example, it's going to allow you to say, well, do my machine support ZBA? Great. It's going to tell you. Do my machine support V? Great. It's going to tell you. Do my machine support ZVKNHA, which is like SHA256 extension. It's going to tell you. It's very rapidly evolving. For example, in the last Linux release, which coming out like 6.6 or 6.8, I don't remember, they added like 15 or 20 extensions which are probed as part of the syscall. So it's always better to be in the last version on Linux to use that. And once you know what you want to target, you want to look at your compilers and run times and libraries. What is supported on RISV? So there is always support for RISV in many compilers and run times. For example, GCC, LVM, OpenJDK, Go, Python, .NET, V8, and RISV. And anymore, do support RISV. There's various degrees of quality and support of this RISV. For example, I think kind of red of the curve, this GCC and VN OpenJDK support very well RISV. Bit back of the wave, I'm not going to cite them. But it's, for example, they will only support RV64GC, which is functional, but it's not very fast, but at least it's functional, right? You can actually test it. It's also very rapidly evolving. For example, like with the team, we contribute regularly to the OpenJDK and there is like, for every release, there is like dozens of commits improving. Like for example, hey, now we support this extension. Hey, now we actually are editing with these instructions. So it's going very fast. So it's important to really be on the latest and greatest. Like for example, I said, I talked about the kernel before. Same thing. Vector was like six months ago or something. So before that, you didn't have a Linux kernel release with support for Vector. So it's very important to be on the latest ones. Also it's more and more libraries. Support RISV. That's where most of the outcome work I expect because it's great for example that GCC and LVM support RISV, but if any of your 20 data pens don't support RISV, like it's not going to work, right? You need all of the libraries to be supporting RISV. So RVI is maintaining this very nice page, which it's not very full, very complete, but it's a good reference, which kind of highlights some of the projects we have which are supporting RISV. Sorry. If you project to RISV and don't sit on this page, please report it to our VDI. I think they will be very happy to add it. Also in that, huge shout out to all the contributors. I don't know you, but thank you very much for all the work you're doing. Without you, all of that would not be possible. And also, many of them are doing it on their free time. So thank you very much. So what are some of the difficulties and gotchas that we ran into? So the main one is like hereby assumptions. A lot of code has been written for a long time for RISV. For example, who assumes that the page is 4K, right? So luckily, only five pages are 4,000 bits, but some architecture is in the past after it, hey, what if a page is 16K? Who knows what's going to happen? Why don't things break? So something specific to RISV, the vector length specific code. Some architectures, like for example XA6 and Arm Neon, for example, they say a vector is going to be 128 bits or 256 bits. And the software engineers who write code is going to know that's the vector length and I'm going to write my code for that. RISV decided to be smart and decided that there should be no, everything should be vector length and elastic. Well, it makes it a bit harder to write software because a lot of software assumes that vector are 256, 512, 128 bits and they just don't know what it means to be vector length and elastic. Also, can you know connans? Whenever you try to represent a nan, in XA6, it's basically any value of nan is valid and if you do a multiply of 1 by a nan, any value of nan is going to return you this value of nan. So for example, it's going to keep the sign. That's kind of the main issue. But in RISV, if you do 1.0 times a nan, or minus nan, it's going to return you connecal nan. That's part of the spec. That's how hardware should behave. That means that if you multiply 1 by a negative number, it's going to return you a positive number. Right? It's not a number. So what should be the behavior there? Well, RISV says it should be positive. So if you try to do each sign of this result on C++, it's going to return you something different on XA6 and RISV. So that's gotches. Well, we will have to look out for that. Something else, memory model. It's stronger on XA6, weak on RISV. I think I'm 64 degree up there as well of teaching people that weak memory model are different and they're worth looking out for. Something else that we are going to have to look at. Also the vector spec simplicity. It's very simple to program. Hard to implement in hardware. You have some instructions that can do something very complex. And so to implement it efficiently in hardware, it can be very difficult or even impossible. Meaning that sometimes you have your stream instructions. Something takes like five cycles, 15 cycles, maybe 30 cycles. And then suddenly one instructions cannot be executed by hardware, has to be emulated and takes a thousand cycles. And you don't see it as like your programming user space. You just don't see it. It just works. But then you are like, oh, I vectorize my code and it's 50 times slower. What happened? Like I don't understand. It should be eight times faster or 16 times faster. It's slower. What happens? Well, that's going to be the problem. So solutions here. Test it. Figure out what's happening. Use profiling tools. Figure out what can go wrong. And eventually refer to vendor specific information like optimization manual that may tell you this instruction is slow on the hardware because X20. So great. Now you are developing, compiling and testing. Like, you know, how do you test things basically? So here, QMU is with friends, also membo. So functionally, QMU I think is the most complex. Purely because when people want to make a new extension, they implemented on QMU to test what it would mean. So basically they prototype it on QMU. And then whenever it's modified, it's like, well, it's there, right? We prototype everything. So it's just there. So it's vector crypto, for example. QMU was the first one because it was prototyped. So it was easy. Yes, please. You just talked about how you need to measure the example of X instructions. Yes. How well does that translate to QMU? Going to talk about it after. Yes. So the question was, how do we measure performance on QMU? That's next slide. Oh, second, next slide. So user space simulation is also easy enough on the Ubuntu or Debian based. You do add getting stored QMU static. And then you can just run a Docker image with RISC-5 stuff. Like you literally run RISC-5. It's Ubuntu and RISC-5. It just works. So it's very easy to test out. It's great for most testing. There's some leak abstractions. For example, on QMU, Parat is 7.0. Proxy.info would read the host one. So you're on a RISC-5 environment, RISC-5 binary. You read proxy.info and tell you support AVX2. Why? Like, not in the right thing. What's happening? It's also not particularly, yes, please. No. I was wrong. OK. Post 7.0, that specific thing is emulated. So you don't have this problem anymore. But also QMU is not particularly fast. It can be like five to 10 times slower, like on single-core performance. So if you want to test, have a machine with 64 cores, things go faster. Also linking for executable or libraries, like large ones can take a long time because linking is usually single-core. So it can be slow. Also debugging can get pretty complicated because you attach GDB-grade. But do you attach it to the X86 QMU process? Or do you attach it to the RISC-5 that QMU is trying to emulate? And yes. QMU can act as a database server. Yes. But that's a bit broken in some ways sometimes. And it's, yeah, sorry. So the question was, GDB can act as a GDB server? QMU, yes, sorry. QMU can act as a GDB server. It works most of the time, sometimes it doesn't. That's where it can get a bit complicated and sometimes it's just like... So that brings me to the next point of, well, the other way of doing it is just cross-compilation and testing on deathboards. So obviously you have faster build-up times because cross-compilation, like your compiler are native, so it's just faster. Per-MEs, well, first of all, does your project support cross-compilation? That's not a given. Also, today's boards have a mutation. They don't support everything. For example, I mentioned before, vector, vector, crypto. Vector only one board supports it. Vector, crypto, no one supports it. So you cannot test any of these algorithms. You are stuck to QMU for that. Also, hardware bugs. Not every board just behaves perfectly. I'm not going to name names, but you have, for example, some boards have atomic bugs. So you do an atomic operation. It's going to say success. It's going to say failed if it did succeed. So mutex can be a bit complicated to implement like that. So you have bugs. So on that, retries, again, is your friend. If it fails once, try a second time. If it succeeds, and it succeeds always in another board that you know doesn't have the bug, probably just do the hardware bug. So for CI, QMU is your friend. Again, on GitHub Actions, for example, which is quite nice, a lot of free time, it's a one-liner. You do like use Docker setup QMU action. And suddenly you have QMU setup on your machine. It takes like three or five seconds to set up. So it's also very fast. You don't even need Docker. You can just use QMU and D prefix and the C straight to kind of help you have like a RISV file system like slash ETC, slash user, slash home, slash everything. And QMU is going to take care of loading the file from there rather than the host X86 stuff. And then it just works. Like you have a RISV machine on GitHub Actions for free. Slow, but it's functional. Also, you can tweak the available CPU options. For example, you can say, I want ZBA, ZBB, ZBS, but I don't want vector. For example, I'm working on my fancy library, which I added vector. I want to make sure it's still going to run on the board that don't support vector and it's not going to crash. Well, I can do that. Or I want to run with vector lengths of 128 bits. Great. You do. You just specify V-Linux 128. It just works. 256 bits, same thing. You can go up to 16,000 if you want, but it's poor of two. Basically, it makes it very easy to test a lot of different configuration for free on CI. So that's where I come to the question about performance. So performance measurement is not different. It is not psycho-accurate. It does not even try to be psycho-accurate. It's just not what's made for. Usually, psycho-accuracy measure, there is some open source one, like Gem5, but the vector-specific ones are extremely secret, obviously, because it contains a lot of information about the macro-architecture. So don't ask me. I will not say any information about our psycho-accuracy. What you can do, eventually, is interaction count. The problem is it's very inaccurate. If you go from 10 instructions to five, but if instruction takes 10 times the latency, yes, you're going to use less instructions but it's going to take longer. So that's not perfect. Better than nothing, but not perfect. Second bad thing is bolts. The problem is imaging, optimizing for high-end servers, like Crashing and Fall, sorry, on the Raspberry Pi. The preference profile doesn't really match. Like the bolts we have today, all in order CPUs, there's few cores, scalability is not very great. For example, if you go to 64 cores on the board, which was meant to be four cores, memory accesses are going to be slow as soon as you load it a bit, so it can be limited. Also, only one board again supports vectors, so that's the KNMV K230. So you cannot even really test vector. You cannot really test vector monthly-thirty either. So it's getting better, slowly, but it's getting better. Yes, please. I think the vector version is 0.7. On the KNMV K230, it's the first one that supports vector 1.0. It's, for example, the Leachie Pi with a 3.910 that only supports vectors 0.7. And also the optimization manual, so that's something that we're working on at RISE. It should be coming in the next few days, but stay tuned. And it's kind of a guide of how to write performance codes with the input of companies who are actually making chips, and so who knows, okay, this kind of thing is going to work well on our chip, and so we came together and we said, okay, these are common guidelines for writing efficient RISE 5 code. So I think, South, it's fun, never too late. Please get involved. There is so much more work to do. There is more work that you can imagine. We have enough work for the whole industry for the next five to 10 years at least, so please get involved. Check out Wikidotrizeproject.dev if you don't really know where to start. We are trying to outline some of the work that we're planning to do and we're welcome contributions. Also, if you have an idea, make a proposal. We pay money for that. As in, if you think that there is something, like some project that should be ported to RISE 5, we're ready to sponsor it, so it's paid open source work, which is not that common. And finally, thank you to all contributors again. Without everyone, it would just not be possible. Applause Yes, please. So what about running software for the two years and testing the... Softon, can you define softon, please? So basically, I know very low implementation of the... Any other, you know, any of the compiling the FPG and running it on FPG? Yes, OK. So it tested with very fast and very key Mew, but you know, not that fast. So the question is how about running, like so literally softon on FPG basically have a very low guarantee on implementation compared to FPG and run on that. We are doing that internally for cycle accuracy. Even that is way slower than QMU. Like I think in order of speed today, it's bolts are still a bit faster than QMU. QMU, as long as you have a lot of calls, like 30 to 64 calls is ideal. FPGA, emulators, this kind of stuff, and software emulation of RTL. If you reverse that in terms of accuracy of the results you are going to get in terms of performance, software emulation, emulator, FPGA are going to be the most accurate because they are actually trying to be cycle accurate. QMU is not trying to be cycle accurate. The bolts are by definition cycle accurate. But the big advantage of emulators and simulators and all of that is you can literally have a trace of this instruction took that many cycles and this instruction took that many... like a stream of every instructions executed. And so you can have a very, very precise and accurate representation of this is how my application ran. Obviously you run like something that would take 10 seconds on XC6, you generate it for like 300 gigs of data. But you have very precise and accurate information. Yes, please. So my question is more on the hardware vendors now, right? So if somebody starts to take out today, right, what should he... because of the RTL development and then the tape out itself takes a lot of time, right? So this is... you're looking for one or two years. So what should somebody start with if you said the greatest today is this? But if I have a product that would go in this time, I can have some iteration in between, right? But the problem is again that something changes in the vector library and then my entire timing constraint gets discovered. Yes. Should I only be starting what you suggest now? Or can I have some kind of a plan that I can update six months, eight months, if something new comes? What is your suggestion for this? So the question is what should be the target? If today I want to make a new hardware, what should be the target profile, for example? And do I understand correctly that you're also alluding to the cycle timing of interactions and things like this? Because it will disturb the logic, right? Yes. If the number of cycles for the same thing will be different if I have a better vector crypt or something coming. Yeah. If we look at what's being done, for example, by the companies who are talking about what's happening, right? There is Sci-5, there is T-Head, there is Ventana. Well, let's take these three as examples. T-Head, as the latest announced, was the C908, which does support vector 1.0, but they announced it only like a few weeks or a month after a vector crypto was announced. So obviously when they announced it, it's because they knew it was going to tape out. And so vector crypto is complex spec, so they didn't have time to implement it. Ventana has also announced that they would support in their second gen chip the vector 1.0. I think that was announced at RVA Summit last November. But I don't think they are targeting vector crypto because same thing, the timing was like when it was announced, vector crypto just came out or was coming out in next week. So same thing. Sci-5, same thing, they announced vector 1.0 chip, but same thing non-vector crypto. The expectation is that vector crypto chips are going to be announced like a year from now, maybe a year and a half, maybe two years. We all hope it would be two months from now, but things shift. If you start the chip today, I sure hope that you are targeting RVA 23 plus vector crypto. But that's obviously for starting a chip today, and that's easy for me to say, but how to... Knowing that the chip takes... If you look at Intel's timing, a chip takes 5 to 7 years to take out. So you start today, you deliver it in 2030. Does that answer your question or...? Yeah, it's a soft talk. So what I get that today, if I start, I can't just move vector crypto. If you write after today, I think you can expect to have boards having it in 2025, 2026. But that's where Harropov is good for you, because you can check if Harropov is supported by the hardware, then use this path, else use the other path. And that's how it's done in OpenJK. That's how it's done in OpenSSL, that's how it's done everywhere, because it's funny how we say hardware has a 5-year lead time. Software also has, it's not just like, I have my commit and then everyone has it, right? You need to have a release and then a release is not enough, because it needs to be shipped in a distribution, and then hopefully it's a LTS, and so you easily have 2 to 3 years lead time in software often. So it's important that if you are doing crypto stuff like OpenSSL, do vector crypto today, because the time that it gets into the hands of like Reda Hat 10 customers, or Ubuntu 24.0 for anything like this, well you want to have community of things like a year ago, 2 years ago. So then these hardware vendors are talking to RISE to get this input, because you are kind of connecting all the open source. Sci-Fi is part of RISE, Ventana is part of RISE, Revis is part of RISE, Alibaba is part of RISE. It's really, like, we want to push the software forward, because we also understand that software needs to be ready for, when we get the bold out, we need people to actually have software to run on it. So here I talk really about a lot of user space stuff, but RISE also has interest in kernels, in debugger, in firmware, in OpenSBI, in emulators, in debuggers, in like everything. Like we want all the software to work, and even software and a bit of firmware to work, right? So we need everything to work physically. Thank you. Yes. I'd like to say that I think you're asking as well about knowing what instructions to use to target performance improvement in software, right? So how do I today use the best equipment that we know we're going to be able to do? So the simple answer is, Maxi, the compiler again is, the question is, what's the problem? Are there any already know what patterns are for the compiler? Or already know about the appropriate pattern sort of under hardware? We will get instructed, and I'll just say you need to do the micro-commodization. You want to go down to the example, or that's by the optimization going on. Yes. So general advice for what to do. It's not going to be perfect for all architect. Similar to AR64, training the way for what the RISC-5 software. You have an AR64 implementation, not too difficult, but RISC-5. If you have a RISC-5 implementation, which is semi-optimal, you know you're not going to want these useful rocks, that contribution into an open-source project, even if it isn't specific to your micro-architect, it's very valuable. So just to repeat what Kieran said for the people online. If you're trying to do some optimizations, the compiler is provided by the vendors, usually is going to have vendor-specific information. Usually things that have not landed upstream yet, or that have landed upstream in the latest, like Trunk, for example. So it's not released yet, and so they just want to make sure that they have, like customers can have access to the best compiler for their, for whoever buys Ventana or ReVos or Sci-Fi stuff. And then if you're trying to do really handwritten assembly optimizations, then that's where you start referring to the optimization manuals, where there is the rise one, but you also expect ReVos to have some, Ventana to have some, Sci-Fi to have some, but they should be very specific to the hardware, and the rise optimization or RISC-5 optimization manual will be very generic, and for example, how to best use vector instructions, like what's to avoid, what's not to avoid, how to basically best use it across most hardware. So I think like OpenBLAST in that is very interesting, because they are planarized on X86, right? They have 20 or something different kernels for X86, based on the generation of Intel and based on the generation of AMD, because they know that the timing of instruction is different based on whatever, so you can compile for Nehalem or you can compile for Sci-Fi, or you can compile for Xenversion3 or Xenversion4, and then the kernels are going to be different on this kind of stuff, and I think OpenBLAST is very specific in that, because they really try to extract, yes, it's like a linear algebra library, but it's, I think it's really pushing the thing very, very, very, very far, but that's kind of the kind of things that is going to be very vendor specific, and which you will have to refer to the vendor spec or vendor optimization manual for that. Thank you for the question. Yes, please. Two questions. So one, when you start boarding open source software to RISC-V, you probably have to talk to the developers or the maintainers as well. Yes. How often do you get the reaction, we have no idea what RISC-V is? So the question is, if you're trying to port project RISC-V, what is the, like, how often does it happen that the maintainer says RISC-V, like, you don't know what it is? Honestly, pretty rarely. I think Hacker News has done a lot of great job of, RISC-V is exciting, RISC-V is new, can't involve in RISC-V, people don't really know, oh yeah, it's a new architecture, right, it sounds exciting, right, but they don't really know what it is, but at least they heard about it. About three years ago, I was in a talk about Flist, Yes. like a variant of OpenBLAST, and I suggested that we should look at RISC-V, we had no idea what it was, and these are pretty low level people, right? Yes. That was three years ago, but I think it's too late. Yes, so your comment also, for like, looking at the beliefs of, like, three years ago you said, people didn't know what RISC-V was, but hopefully today they know. Hopefully today or when it works in RISC-V? Today it works, yeah, so, I think, actually the, also like, fun tidbit for OpenBLAST, like, Serge in the room just got the RISC-V branch of OpenBLAST merged into the trunk branch of OpenBLAST, so there was some generic support for RISC-V in the developed branch, but there was some, like, vendor-specific optimization in the RISC-V branch, and we just merged it, right, so that's also why, even for things as important as the West, but I've got things that are evolving fast, that's why, that's what I mean by, if you don't know how to get involved, you know that there's a lot of stuff to get involved in, and look around any library you generally use, probably one support RISC-V, so please add support RISC-V to that, it's going to be nice. And then another question is, there's different aspects, there's getting it to build, to compile, there's, you talked about performance a bit as well, what about running the test suite, because that's like, there may be 100,000 tests, there may be five failing on RISC-V, fixing those may require a lot of work to get it to actually fully back. Yes, so the, your comment was, yeah, we can compile, we can test, but what about the five last tests which are failing on RISC-V specifically? Yeah, so either it's assumptions which are taken which don't hold on RISC-V, that's usually the hard one to figure out, or, yeah, well the code is generally broken, but again, like, because it's running on X86, it was just working, that's why I think having support for M64, F390, XPPC helps, because the projector really knows that it's not just X86, and so they developed it in a way that works everywhere. QMU in that is very interesting, because, well, if you're in QMU on X86, the memory model, they don't try to emulate anything, so it's just the host memory model. So you test on X86 with QMU, so you have your Raspberry, based on your memory model, it's just going to work. You go to a bold, suddenly it's going to fail, and you don't know why, but it works on RISC-V, or it works on QMU, well, but the thing is it's TSO on QMU, weak memory ordering on RISC-V, and that's the harm bug to track. Yes, please. I just wanted to add the R-M-I to the exact same issue. Yes, so the comment was R-M-I is exactly the same problem, and yes, it's exactly the same issue. Yes, please. While porting some software to RISC-V, I had a problem that didn't appear even on ARM64, so there are even bits that... Yes. ...that ARM64 doesn't help. It was that GP register that isn't on... Possibly. It's a long story, but it was related to GP register that worked differently, that is not on ARM64, and is on RISC-V, so there are even stories like this. Just to repeat your comment again for online, it's not because everything works on ARM64 that everything is going to magically work on RISC-V. Yes, there are some RISC-V specificities, but yeah. Yes, please. I'm just curious, you may be outside of the scope of the effort, but it seems like there's an opportunity to provide possibilities for the hardware vendors in terms of the code that's already forwarded, the prevalence of specific instructions in the app prices so that the hardware vendors know what to focus on in terms of the performance. So the remark was there is opportunity in looking at the dynamic traces of instructions for hardware vendors to figure out what do we need to invest in, and yes. Yeah. To shun them, yes, there's a lot of opportunity there. The question usually for that is more what workload should we focus on? And that's a very tough question because obviously it depends which market you're targeting. Also given the certain market, there is five different software, five different frameworks, which one are you looking at? How do you use the frameworks? Which library? But yes. There's a... Yeah. Just further on the general subject of not obvious problems and QMU are analyzing them. So the single most useful, not obvious thing with QMU is a mode for alternative and most effective operations where it will crash the diagnostic components in the register. So the remark was QMU has a mode where you can trash... You can trash or cache? Trash. Trash. Okay. You can trash the vector register length if I understand... So for masks and for tails of the mask or the elements... Okay. ...and elements that have been built into that vector... Yeah, you can basically put garbage into the tail elements and other kind of things like that to help you test so that you will actually see that you are using not what you expected to use in the vector. And that just makes your life easier for testing because it's going to crash instead of just failing silently. Yes, please. So the question is for precision and compatibility do we want QMU in ARM64? So the fun thing is for adoption that I mentioned at the beginning like to build Java, we are actually using an ARM64 VM just because it's cheaper on the cloud. And yeah, I mean it's compilation but you run the GCC compiler basically and it just works. Like there's no question. So you are stress testing GCC but it works. For general testing, yes, but for example, GitHub Actions you only have XC6 machines. So I think it would be better to test on ARM64 because yeah, you have week memory monitoring, you have all of that, but it's harder to access. Yes, please. Yes, so the comment is with ARM64 SV, which is the Scalemore vector extension, like vector length agnostic extension, it will help with five to basically let the world know that there is not just vector and yes, absolutely. Like I mentioned in the slide right after, but there is a project called XCMD which is a project called XCMD. And it's a project called XCMD. And it's a project called XCMD. There is a project called XCMD which is kind of abstraction over vector stuff. And yes, it assumes that the vector is 128, 512, or 256 and you don't have a choice. So feeling resiving that is a bit painful. Yes, I think there was. Yeah, sorry. Okay. Any other questions? Okay, well, thank you everyone. Thank you.
Unleashing RISC-V in Managed Runtimes: Navigating Extensions, Memory Models, and Performance Challenges in OpenJDK
Hello, my name is, does this work? Or? It's green, it's good. It's green, it's good. Yeah, my name is Robin N. I work at Reavals with RISC 5 and I'm mostly working on the OpenJDK. So I'll talk about some of the experience with the OpenJDK. And unfortunately for me, I can't lie too much because I see some experienced OpenJDK people in the crowd here. So we'll see if they correct me. So yeah, this is basically what I'm going to talk about, the OpenJDK, the JIT, which is kind of important for a new architecture. We're going to mention the trampoline lines as Mambo. We have some cross-modifying code. We talk about all the extensions we have, a bit about sign extensions. And I was going to talk something about canonical NANDs, but I think Ludovic made a good job of it, so I might just skim that through. So I'm not sure how much anyone knows about the OpenJDK, but it's a huge C++ code base with inline assembly and there is a lot of C++ code which is architecture specific, since we have different ABI's on different architectures. So the C++ code needs to know what's ABI for these architectures. We have a template interpreter, which means we basically have assembly snippets implemented for each thing we want to interpret, which jump to each other, so it's not a C and a switch statement. We have two compilers, C1 and C2. One is very fast and one is a bit slower. The first one is usually compiled with profiling, so we keep profiling the interpreter, we keep profiling when we compile with C1, then we compile with C2 and we drop the profiling because the profiling eats from your performance. And we also generate the template interpreter is actually generated during startup, because we customize it because you might use a GC which requires some specific load barriers and stuff, so we generate the code for the interpreter and we generate a bunch of other code. Like we have a lot of assembly which is glue between like the runtime, the compiler, the interpreter. So, the risk five port, it's fully functional, all great. Well, we are missing some optimizations and when we say fully functional, we mean with limited tested. We are done, have done, as Ludovic have talked about, testing is a pain as we have small boards, we have QEMO and OpenUDK have a lot of tests. We have tests that can run for like a week, just one test. If you take that and try to run it in QEMO, it will take forever. So, we have JDK 21 and 17, we are working on the 11 to get the port done for 11. I wouldn't recommend JDK 11, I would recommend at least 17, because it's much faster, it's better and you also get a GC. Yeah, the other platforms since like x86 have had like 25 years of optimization and our report is like, I don't know, four years, so we are missing like at least 20 years of optimization in the codebase. So just in time compilation, why? Yeah, of course the obvious reason is because we want to be right once run anywhere, but we also have some other things in the OpenJDK going on. We have a dynamic class hierarchy, as we can do class loading or we always do class loading, otherwise we wouldn't get any classes, which means that the hierarchy is changing. So it's not such a good idea to try to pre-compile because at any given time your class hierarchy might be different. So even if you did pre-compile, since mostly everything is virtual, it's virtual by default, you would just do just virtual calls all over the place. So that would be slow. So but with YIT and profiling, we can avoid virtual calls and we can speculate a bit about the class hierarchy. When do we compile? Yeah, we compile hot methods. And as I said, first we compile with C1, we keep profiling, then we can compile with C2. So what we do is kind of a speculative compilation, which means that if we see you have never executed this branch in your method, we may choose to remove that branch and put in a trap instead. So if you actually want to run that piece of code in that branch, instead we trap, the optimized will go to the interpreter. And we can do the speculation based on the profiling. So if you have a hash table and you put cars in it, you call hash code, we can and I can guess that this call to hash code will be on the car. So we don't need to do the Vtable lookup, we can instead guess that you're putting cars here so we call hash code for car. And until we get proven otherwise. Yeah, so we also need to do some cross modifying code. So when we're kept compiling something, compiling is a bit expensive. So if we can just change the code instead and update whatever was what was missing, so we don't have to deoptimize and recompile, we will do that instead. So I'm jumping directly to talk about a jittered call site. So when the jitter lays out a call site, we have two instructions, jump and link, jump and link register. And when we lay out the call site, since we have a dynamic class hierarchy, I forgot to say that on the first page, but classes are loaded on first use, which means the compiler is not allowed to load classes, it has to be used by the program. So when we lay out a call site, we might not know where we're going to call because we don't want to do a resolve. Because resolving the call site might mean we need to load classes. So when we lay out certain kinds of call sites, we need a full range of that call site, which means we have two options, we can either load the address or we can materialize the address. Materializing requires a bunch of instructions. I think the example here is just materializing six bytes or something, maybe someone that is fluent in assembly can tell me. Yeah, the reason why normally you would maybe do a table look up here, but we wanted to actually lay out a direct call as we can without any loading of data and stuff like that. So that's why the call site looks like this. And for the full picture, it actually looks even like this. So we actually lay out a smaller call site in the code, which calls a trampoline, which will load the address, which is just under the jump and link in the trampoline, and then we will end up at a destination. But as I said, a dynamic call site can be unresolved, which means when we get the code, we actually just point the trampoline to a resolve stub. So the first thread that actually executes this, we'll need to resolve this call, whatever it's going. So if this was the, if a is the car.hashcode, when we lay out the code, we don't know this, we need to resolve this and figure out that this is the receiver of the call. So then we have cross modifying code. What is cross modifying code? It's that one core is writing the instruction or changing the instruction stream, and we have another core executing the instruction stream. It's a bit complicated, of course, but OpenJDK does it a lot too. Avoid recompilation is basically the thing we want to avoid, because especially during startup when your class graph is changing all the time because you keep loading classes, if we compile something that looks hot, we don't want to remove it directly and recompile and remove it and recompile. Instead we can do the speculative compilation and layout code and fix it a bit later. So you can talk about two types of cross modifying code synchronous, which is basically you're waiting for the other CPU to fix the instruction stream ahead of you. And here's an example with the modifying processor do a store to the instruction stream, then release the guard. The executing processor waits on the guard. When it gets released, it picks up the new instructions. It's not that easy, but pick up the new instruction is not just a simple thing, but I'll get to and then you have the asynchronous cross modification where we just store something directly in the instruction stream. Executing processor might see the new or the old instruction. We don't know. We need to handle both. So back to our example here. So one Fred calls to resolve. After you have resolved the know who's the receiver of this call, it will patch the eight byte address stored in the trampoline. So anyone else that does this call will reach a. But we still allow friends to see the old destination, which means that both of the old trampoline and the new trampoline is valid. Since if you see the old one, you will hit the resolve stub. You will see that this quality is already patched by someone else and you will just go back and re execute it and then you will pick up the new destination, which is a. Yeah, so point to so when the executing Fred actually sees the new instruction stream in especially in the said Jai said Jai the extension for cross modifying code. We talk about point of unification. So that means that modifying processor and executing processor agree on the global state. So I'll use the terminal leave from that extension. I'll mention it more later. So we have patched the trampoline. Well, good. No. So someone loads a B, which is also of a type. So we have a new receiver here and we actually need the V table look up. So we need to patch trampoline once more and add. A V table look up before we can land on a because it could have been a B. So the trampoline is not patched just one time. It can be patched. Well, I think at most two times, but yeah. And in this case, all three. Ways of calling is that is alive at the same time because you can still see the so one Fred lagging behind can still see this resolve. Someone else might see this jump and someone might see the V table. We allow all three to be OK at the same time. We do this, but we have a small piece of code in a which verify when you did the jump to a you had the right receiver as your intended target. But that becomes really complicated. The main point of the slide is to show that we need to be able to patch the whole site multiple times. So what we're doing here is actually not cross modifying code on risk five as we have a the as we do an LD on the eight bite address and we do actually a store of eight bite address. It happens to be in the. Just below the instruction stream, but it's actually not read as an instruction as we do an LD on it. So in this case, we're actually not doing cross modifying code since we load the address with an LD. So but. There's still some problems with this. First of all, as the the address is just below the addresses, your pipeline might try to decode the constant as instructions. You also have the problem with reading from the same cash line that you're executing. Some process might not like that. So you have the same cash line in I and D. And we also have the overhead of the jump from a to trampoline. So what we are suggesting suggested in on risk five. Yeah, and I can also mention as we need this place, atomic or patchable, that's why we can't use the ally since it's seven instructions and we can only patch one instruction atomically. So for this case, we're suggesting that we actually do the load directly at the call site in a and we only have the address as a piece of metadata instead of a full trampoline, which means we get rid of one jump. We put the address on a separate cash line. So it should be faster on any risk processor. This is just the general philosophy of open JDK, meaning that in hot pass, we don't have any synchronization. We allow execution of stale instructions because like you know, if you have your ISB instruction on a arch, it's really expensive. We cannot have that in hot path since we try to compete with C++. So in slope of we try to reach point of unification. If you're on AR64, that means that there's probably an ISB instruction in your slope off. Yeah, and there's a list of other examples of cross modifying code. JIT itself is cross modifying. It's compiled by one thread. Pointer is installed by one thread and another thread is picking that pointer up and jumping to the JIT code. So that in itself is cross modifying code. The third in this solution is when you do a field access. The class for the field access is not yet loaded, so we don't know the offset for the field. So we basically say, oh, you need to fill in the offset here. So the first thread that hits this path needs to load the class. If it's not loaded, figure out the offset, patch the code. And then you have different barriers for the method because they can get invalidated. We might need to update the method. So we have guards and barriers to protect the method. We can have addresses of objects directly into the code stream. So when the GCMOS an object, we need to change the immediate for that object that was moved. We can have GC barriers as immediate values. So when the GC changes color, we might need to update the load barrier to reflect the color change. Yeah, point of unification. So if you're running your AR64, that usually means you're doing an ISB. We don't have that. What we have is something about fence.i, which is not so good. What we're doing today is something really crazy. For every write we do in a page that is from the JIT, meaning we think we're doing cross-modifying code even though my first example was not, we're doing RISC-5 flash iCache, which means the kernel will do an IPI on all CPUs and emit fence.i. So every write we do, this is really expensive as from the last page, if we put in like GC barriers, which need to shift color for every load of object in the instruction stream, meaning that we might change 10 places in one method to reflect the change in GC color. So there will be 10 writes just in this method and that will cause 10 IPIs. That means that every write we reach point of unification. So it's working really well with cross-modifying code RISC-5 with OpenJK at the moment, since we actually don't have any races basically, since we do the IPI on every write. So I see in like a really small board it costs a half a percent of performance. On a large real CPU server class, maybe 2-3 percent of performance decreased due to all the IPIs all the time. Yeah, point of unification, the modifier needs to make the stores visible and executioner needs to make sure the instruction stream is invalidated and so he picks up the new instruction. But we still think we can do a bit better with what we have, since fens.i is an unprivileged instruction, we can actually emit it ourselves in the slow path. So we don't need to do the IPI, but we need help with context switches. So you're on your heart to use RISC-5 terminology. You emit your fens.i and think you have invalidated your instruction stream, but the kernel moves you to another heart. So if the kernel moves you, the kernel would need to emit the fens.i so you know that on that whole heart also the instruction stream is invalidated. And what it's going to save us, we hope, is the ZJID extension for IDI. ID synchronization. So instead of fens.i we would get an import i, but more importantly we will get a limit on the instruction fetching. So ARCH allows out of order fetching, which is problematic for us. So if you have a call, when you do an A, Y, P, C, jump and link, even though if you not bit out by first not being out the jump and link, and then you not about the A, Y, P, C, the iFetch could fetch the jump and link before the A, Y, P, C. So it reads the A, Y, P, C before you not that then it reads a not from the A, Y, P, C, then you're toast. So ZJID will specify how the iFetching will work, what we can overwrite without tearing instructions apart and stuff like that. So we're hoping we get that in place well this year. How long have we been going? Okay, that's fine. Yeah, that brings me to extensions. We have a bunch of extensions. When I looked, maybe this is totally wrong, but I found 60 ratified, which adds instruction for RV64. That's 450 base instructions, and I found 45 unratified adding another 400 base instructions. As an example, I took this fall I was looking at the CRC32 a bit. So OpenJDK have an implementation of it in Java, works fine, but you probably want to have an intrinsic for it to make it faster. So then you can make your table look up intrinsic with the base ISA, which is the standard CRC32 intrinsic. But you can also use Kerala's multiplication to do even faster intrinsic. Then you have your scalar Kerala's multiplication in the CBC extension, but you also have Kerala's multiplication in vector. So there's a possibility to have four implementations of the same CRC32 algorithm, one in Java, one for base ISA, one for CBC, one for vector, which is too much. Also at least I'm getting really annoyed with the architecture description through your compiler. And this is just the first of four lines. So if you have a server class CPU, I'm not sure how long that can get. So as Ludwig was talking about profiles, we're hoping that we get nice profiles. Right now RV823 is perhaps the one that looks best. And for the JIT, you need to add an option for every one of these. But we have HVPROB, so we can get it automatically. But there is like, you get an extension, you add an option, then you get HVPROB. So make sure you have like, so basically you need a 6.9 kernel or something to make everything work nice. 6.8 maybe is the next one. So I recommend using 6.8, which is released in, I don't know, because otherwise you need to add all the options on the command line. This brings me to the next problematic things. We have some major extensions like do your CPU allow misaligned access? Do you have vector, what are your memory model? We allow to turn off. Yeah, so the JIT, since we do this cross modifying code and stuff, we're really sensitive to code layout. So if we change anything with code layout, we would like to test it. Since you have so many options that changes the code layout from the JIT, we have so many combinations that we would like to test, but we only have basic boards in QEMO. That makes it really hard to guarantee that your combination will work fine, because I guess everyone is testing a combination which will be something for the CPU they are intending. So I think there's a lot of combinations which are not tested much at all. We also have the compressed. Yeah, we have an option for it. You can turn it on and off. We have an assembler that just changes the instruction for you if you want. Since we're sensitive to code size, some parts are fixed size, so just to make it easy for us, we turn off compressed in certain parts, because we want it to be at a certain alignment or certain address. We see 5-10% code size reduction. One thing we can do is, since you know the compressed just have 4 bits for the registers, we don't consider that, so we just use registers. For example, we have the heat base. If you have compressed points for your object, we have a base for it, which means every time you load an object, we need to materialize the full address. That one is in X27, which means we never can use compressed for that. So if we were to put heat base in another register, like X14, then we could use compressed more. Next, which Ludovic touched on, about memory models. We have your weak and your strong model. In OpenJDK, we're often dealing with free models. We have the hotspot memory model, which is from the 90s, I think. So it predates C++ and C11. Then you have your Java memory model. Then you have your C++ memory model. Since we have two hardware memory models, we get a lot of mapping around that. So we basically have six combinations here. And that also, extension, increase the complexity. Because then you have like SACAS, which introduce the CAS, which means we need the CAS for the memory model also. So yeah, again, if we're going to test all combinations, it will be really costly. Yeah, sign extension. Maybe it's just me, but I'm not a friend with it. So sign extension is when you have a word and you need to enlarge it. Oh, yeah, I only have a few minutes. So that's good. So you want to enlarge it to a word. You need to replicate the sign bit. So we present the sign as of the word when we treat it as a double word. And we do this because some of the instructions use the full register, branch and or, for example. So this is all fine when you let the compiler do the work. But as we have so much assembly and we do, yeah, type less passing, we have templates with inline assembly. So you get a type T and then you're supposed to put in your inline assembly. And we have type aliasing, meaning we have one type and we access it through a pointer to a different type. So when you write all this, you need to think about both like the short representation of your word, but you also need to think about the word as a eight bite. So I get confused and suddenly my branches go somewhere else because I forgot sign extension. So I'm not a fan of it. Yeah. And sign extension. I don't have much to say more than what Ludovic said. I had a, this is one example. If you're writing Java code, if you use this method, you will be surprised because if you have a negative NAND and you ask this guy, you don't know what the bit will be. It depends on the instruction and stuff. And the C++ version is even more complicated because compiler may choose to evaluate that at compile time, which means you get whatever the compiler think design flag should be. If you execute it in runtime, then it depends on the instructions. So if you see anyone using such one functions and they don't consider not the number then there might be a bug. So sorry, one too many. So yeah, I personally like RWA 23, but of course want said JID. So we can formalize the cross modifying code. And also like some of the more atomic extensions, I think SACAS is just optional in RWA 23. I would like it mandatory. And also would like one more instruction to materialize a 64 bit immediate. So that will help. So we don't, because the load we're doing in the trampoline, even though we remove the trampoline, we're doing a load, which means we can have cache missers, which means that the call can be really expensive. And all additional loads we need to do for the JIT itself or for the JIT code, its memory bandwidth. So when you're competing with other platforms, which can materialize a large enough immediate and have it atomically patchable, it's hard to compete when we can't do that in those cases. So I guess the road to one instruction to materialize a 64 bit will be long. Thank you. Yes. Two questions. First of all, is there a limited interface to send more dense ice through the UTI? You can use it with the, for the IPI, you can use the G-Lib C, cache flash, eye cache. So there is a G-Lib C function you can call, which do this is call for you and fixes it. Yeah, so that's, I can't remember if you changed that or we're using G-Lib C wrapper. So there's a G-Lib C wrapper over this is called. So you can just say, I want to flash eye cache. I can't hear. Yeah. I haven't given it much thought. So I'm not a big fan of compressed. So I don't mind what we're doing now. It's just that there might be, so from what I've seen, it's the smaller board which gains performance from compressed. The big out of order CPUs we're waiting for, we don't think there will be much difference. So we haven't spent time on it. I forgot to repeat the question. Yeah, sure. So I'm not sure if you're going to be able to measure the code size decreased, but were you actually able to measure any sort of performance? Using the Vision 5.2, I've seen some performance improvement, but that's an in-order, simpler CPU. So yes, on Vision 5.2, I see some performance improvements when using compressed. Yes. And you're using the Vision 5.2? I have one at home. We have many boards, but I have that one I have sitting next to my desk, so I often use it. So yeah. Well done. No corrections from the GD code.
A framework for RISC-V SBI verification and ISA extension validation
I'll get one more minute. Sure. You got some naked nathalus. You got sick just kind of out of the blue and then like, didn't have any. I thought it was. It was so sick like, nothing. It's sick sick or sick as it is. Yeah. I've never known. Yeah, I've got the beer induced one. I'm the last speaker and it's our hero that fills in for, you know, the. Yeah, I want to thank one or two missing. So yeah, take it away from. Yeah, thanks, Bjorn for letting me fill in. I heard I had wanted like a 15 minute session just to kind of advertise this framework because I'd like to encourage people to contribute to it. I ended up with a 30 minute or however long minutes the session is because of the cancellation. I'll have an hour. Don't worry. No, no, no, no, there's lunch and I can do it four times maybe. Anyway, so quickly just about who's standing in front of you talking. I work for Bentana. I work on Linux kernel, also KVM, open SPI and KEMU. And I'm trying to build, you know, the software system that we need for risk five. So I'm also participating in these RVI working groups and rise that we'd heard about earlier today as well. Prior to that, I worked on air 64 before risk five air 64 red hat also virtualization. So the Linux and the KVM bits KEMU as well. I've carried over into the risk five world as part of the vert stuff that I did previously. I got involved with Katie Munitess, which existed before my time because it's quite old. But I started, I wanted to use it for air 64 specifically. And so I did some ports. I'll support it to power PC and then kind of left that for others to maintain. I don't think it's getting a lot of action, but it's there. And I'm bringing it to risk five. And that's what this talk is about is the fact that we now have this tool available to us. So the outline is just Katie Munitess. First, I'll give a quick overview of the framework generally. And then it regarding risk five, the use cases I see that we could apply it to right away and also as the framework evolves. And then the the and you part is my kind of appeal for contribution. So, so as I said, Katie Munitess is actually quite old. It's as old as KVM. Avi created it shortly after his first couple of KVM commits in order to start testing. So to make sure it actually works. And then over that time, though, we've we've been expanding its targets. So now we can actually test not just with QMU as the user space is originally, but with KVM tool or you could probably put in Rust VMM or whatever you want in there. Cross VMM. I mean, with some efforts, probably it doesn't just drop in at the moment. But you can already test other hypervisors. People do that. And we can even test it on hardware now because we've added at least x86 and air 64 at this point, the ability to boot over some sort of a if you capable boot loader. So then what is this test actually that I keep talking about these KVM tests? And so they're actually like a little tiny guest kernel because that's what Avi needed for testing KVM, right? He needed to have a guest, a guest OS that would have to boot and maybe exercise some stuff that the hypervisor needed to provide for it. So that's what they are. These little guest kernels and originally, you know, kind of booted in maybe hacky ways or whatever. But over the time, we've actually tried to build the framework in a way that is easy to port and easy to maintain. And so we even have DT support in there, some limited ACPI support for this booting. Like I mentioned, we can boot with CFI protocol, which helps us to be able to do the booting over hardware directly rather than through hypervisor. And then for air 64 ARM and RISC-5, I've also taken my notes from the Linux kernels boot requirements. So, you know, particular registers need to be set in a particular way when you first jump into the kernel code. And so we follow that protocol and then it makes, you know, everything just kind of work for bootloaders that already know how to do that. Any bootloader that can boot Linux in this direct way can boot these unit tests. And so, yeah, you're in privilege mode because it's like a little kernel in kernel mode. So you can do all the things that you would do, manipulate the page tables, set up your own exception handlers, generate exceptions and make sure they do what you expected them to do, things like that. You know, you're privileged, so go nuts. So, despite the fact that we're actually writing kernel code, we don't have to make it complicated. We don't have to make it something that's hard to do or at least feel hard to do at first look. So the framework tries to allow the unit tests to be written in a C-app type of way. So you kind of look and feel that way. You've got your main function, which is actually the entry point for the test. And then we have a bunch of libc, api, ported over, not a bunch, but enough for most tests. And we are, you know, of course welcome to add as necessary, whatever kind of looks like it's needed. So all your expected ones, assert is there, which is, you know, of course one of the most important ones for a test framework. Also, with the scripting wrapped around these tests, when you execute them, at least over QMU, then when you do, when you get an assert or any sort of an unhandled exception, you actually get a back trace for all the ports in a way that support stack walking. So we have that, and then this is just a little snippet of code to show you that, you know, don't be afraid. It's just the, and very simple. See, it's just main, even environment variables can be provided to the unit tests. For that, we do a little trick where we take a text file of environment variables. So, you know, your usual key equals val, just a whole list of those. And we put them into an NDRD, so they're in RAM disk. And we can just read them out of there, and we can find it through the DDE, all that stuff, just like we're supposed to. And then we can load those environment variables into memory, and you can use them like a normal C program. So that can also be nice for passing in your expected values and whatnot for unit tests. You can also pass in expected values for the command line, of course, which is a little bit easier to do. But it's, you know, if you have too many of them, it gets kind of ugly. So, of course, you can also, for at least people who want to test on hardware, they're free to manipulate their device tree in any way they want. So they could create a special node for test cases, sure, why not. And then the unit tests would just, you know, parse that node and get all their input. However, however you want to do it. So how do you run the test? So originally, it was, you know, from the command line just for running KVM Guests. So that still, of course, works. You can just pass the test, you know, as a kernel. That's the kernel parameter to continue. Depending on which KVM user space you're using, you'll do it in some similar way. There's also some bash wrapped around all of that stuff. It allows you to run all the tests for you automatically so it can be built into CI very easily. And we do have it built into many different CI already. So we run just a single group. And then the reason is bash. I mean, some people wonder why, because it gets kind of awkward to add more advanced functionality to the test harness having to write it in bash. It was historically in bash, is probably the main reason. But then we actually had a discussion a couple of times, like should we use Python or whatever, go whatever the latest thing is these days. It's a little bit easier for the harness. And we had some pushback from people who have been using this framework quite a lot. And they like to have a very lightweight framework that they can put on an embedded, you know, busy box type thing. There's nothing there except for bash. And they didn't want to bring in libraries and everything else for something else. So bash is not that painful. We don't have that much functionality. So I don't really have a problem with it. Another thing we can do, we can build standalone tests with it. So nothing changes except make space standalone. And it'll actually wrap a lot of that bash around the binary after it converts the binary with base 64 to be embedded all into one nice text file, each text file depending on how big your test is. And you can actually just email that or send it to people. So if you build a quick and dirty test, and I'll get to talking about quick and dirty tests a little later in the talk, if you do that, like, you know, a few lines is to like prove your point that this is broken. Then maybe you just want to package it up with this make standalone thing and mail it to somebody. They can run it and see for themselves. I don't think that's used a lot. That was one of the things I invented that I thought would be useful, but not too many people have been mailing these tests or whatever. So now we know what the framework is, and this is a risk five talk, so we finally get to risk five. So we already have a use case for it. The tech PRS working group has more or less committed to using it for the SBI verification framework. So the SBI for those of you that don't know, I guess most people in this room do, is this interface between either supervisor mode and in mode, machine mode, or also between a virtual supervisor mode and hypervisor. And so we, you know, we either respect community or trying to keep that interfaces from going nuts in all sorts of different directions. We have a standard for it, the SBI spec. And so we write when we want new functionality that we need, the supervisor needs to ask for some service or some information from in mode or we want to emulate that in mode for the guest. Then we need to provide this interface, right, this SBI. And so as we add these functions to the spec, we explain how in the spec it's supposed to work, the parameters, etc., like usual. Then it would be nice to be able to have a verification framework for that so you also say, okay, you've written a nice, you know, addition to our spec, a new extension, SBI extension, please show us, you know, how it's supposed to work. And you could do that, and we do do that with Linux proof of concept codes. We always submit patches for Linux and also for open SBI or Rust SBI that show that, you know, it works, right? We prove our extensions. But with the verification framework, we can actually avoid having to any, focus on any specific projects or people having to involve an entire Linux kernel for the test. They can just do this quick and, this quick small thing here. And so that's the idea is to try to build all those function tests in there and have regression tests for that as well for everybody's SBI implementations. So we can test already, right now you can start writing tests for open SBI. It's quite easy to run over QMU, you don't need hardware for that. You can actually, with QMU, you can swap out open SBI and drop in Rust SBI. That also works over QMU. Probably other SBI implementations can be run from QMU. Of course KVM is a SBI implementation because it emulates, so you can already test that as well. That's one use case already, which could be started now. So, we have a CPU validation as people actually have CPUs to validate. And when we get the EFI support merged. So I haven't done that yet. I'll come to that too, like with current status. But as we, when we get that done, then you'll be able to just put these tests directly, boot them from U-boot or CBEFI and you'll be able to do some validation tests. So ARM does that, I'm quite aware, because they've been involved with KVM unit tests for a long time now. They're doing their memory model, litmus testing, they use KVM unit tests using the EFI support to go straight on hardware and run that. So microbenchmarks are another great use case for KVM unit tests because while you can always find a way to create like some sort of a privilege level test where you write a kernel module in Linux and then you like put it, I used to do that a lot, just like in the init of the module, I would have my whole test case and then I'd, you know, I just mod probe it and now it runs my test, right, that privilege. But, which is kind of awkward to begin with, it's not a real test framework. But it also requires Linux to be booted up and working and everything. And it's not very good for a microbenchmark because you've got Linux doing whatever Linux wants to do. And so you're not really isolating your instruction sequence. But with KVM unit tests, you know, the world is yours. The unit test is running there and nothing else. So it's actually quite good for that. When you get your timing numbers from that, they're pretty reasonable to trust. Question. Yeah. So in this diagram, what does the test say? Ah, yeah. So the test is either this guest kernel or actually the host kernel. It's one of those two. So if it's bare metal, if you just launch it from the boot loader, you'll be the host. That support isn't in the RISC-5 port yet. But you can already do the guest kernel version. Okay. So, yeah, the tests are easier to write as we already talked about. And the quick and dirty ones are even easier. I do, I do the, so, so I do this a lot. I actually, because I'm familiar with the test suite, I use it for a tool while I'm working on something else. Like something for Linux or whatever. I use it just for my own testing purposes. And then it's kind of ugly and it doesn't really look like people would be maybe interested anyway. It's too, like, one-off. And so I just kind of toss it. Or maybe I keep it for myself to look at later, but it's not shared, which isn't really a very good open source approach. So I've actually been thinking about that, that for these types of tests that don't really necessarily fit what we consider the main test suite, maybe we should have a separate branch though for them. So we still collect the code. And I kind of did that already. I recently wanted to test TCG. So I kind of forgot to mention that for CPU validation, we already can, of course, test our, you know, emulators and our other models to see if they're correct. So TCG is, you know, QME's emulation framework. So I wanted to make sure that the MMU model that it had was able to handle the access to dirty bits correctly, because there's actually a couple different ways to do it in spec. And QME had picked one by default. And then a couple extensions came along that actually allow you to decide which one you're going to use. And a new bit was added, which is actually going to require another SPI call. So we'll go back to the SPI verification for that. Anyway, kind of, you know, balloons as we know. And I wanted to make sure it was actually working the way it's supposed to right now. So I wrote a test case in KVMunit tests. And then I wasn't sure, okay, this is maybe not the one that we're going to merge because it's just for this one-off test. But I've already decided maybe at least goes to a branch that we should keep track of these things. And then, you know, and the other reason why posting them, even if they don't get merged in the end, or at least not to the main branch, but to the side branch, is because when people do post-tests, sometimes they reinvent something they need inside the test case to get the job done. And that looks like something, oh, we should probably pull that into the common code, right? We can let the framework evolve better the more people who contribute. And there's no one and done. Usually I write something, some quick and dirty test, and then like three weeks later, I'm like, oh, yeah, I actually need that again because something similar is broken or whatever. Yeah. I think I talked about everything on this slide. Those are some links. And, yeah. So one thing I was going to do, because I have way more time than I need, but I was just going to show that test that I just got done talking about. So it's a little bit more complicated than that little snip that I shoved in the slide. So you can see that it's still not that complicated, right? Oh, yeah, sorry, everyone can try to brighten the screen somehow, maybe. Yeah. I don't know if I can turn off the light. Just smash it with a hammer. Yeah, it's probably, you know what, maybe I can go to a black background and just cut the file. It might be better. Is this better than before? Yeah, because black background is better. Don't touch that. That sounds like fire hazard there. Anyway, so I'll just, you know, just kind of slowly scroll through it, I don't know. Just to show you that really you can build these tests with like 100 lines of code, and they achieve a pretty reasonably good goal, like making sure that an NMU behaves correctly in like three different modes. So, yeah, so I don't know if there's any particular lines here I want to point out, so I wanted you to get a feel for what a test would look like if you guys decided to go sit down and write one. You don't have to like, you know, you don't have to learn a whole big framework with some bizarre looking APIs. The APIs that we have are minimal to begin with, so you're going to write your own functions. But when you do need them, you know, they're pretty self-explanatory and C, so you just, you know, you can grab for anything you need to know. And yeah, that's the bottom of the file already. It's only like three page downs. So, um... So, does the actual return value get used? I mean, I noticed you're carefully returning a report summary. Yeah. But does anything actually look at the return value of this, May? Yeah, so CIs will do that. So, like, this will dump a summary to the screen. So, if you're just running it yourself, which I guess I might as well go ahead and... So... Yeah, you know, I'm feeling brave. But, um, so, yeah, you can just run it. And then it'll dump... Yeah. It'll dump stuff like this out. And then CIs will, they know how to parse that, right? So, they'll be looking... And we have the... We have those, you know, reports and report, pass type API to try to make sure you get a nice, uh, consistent format so that it's parsable. You know, we don't use a TAP. Maybe we should. We've done that in a different test suite that I'm involved in as well, KVM self-test that's in the kernel. We're starting to... We're not there yet, but we're starting to migrate the TAP for that one. Uh, yeah. This one we have our kind of our own thing going. We've had it a long time now. Um, anyway, so that's like one and then there's like this... Yeah, there was another test. You probably saw it said skip. And it's skipping because I didn't give it an environment variable. Uh, let's see. Yeah, that's the file. So, this is that text file I mentioned before. You can create just, uh, you know, plain old text with all your environment variables. And then when you want to pass it to the thing... Um, Oops. It passes... Like this. And we'll just run that one group of tests this time. We're seeing about live demo is I have to type in front of people. Um, and, uh, so now we, now we're not skipping anymore. Now we're passing because I gave it, I gave it the inventor ID, which is zero for KMU. Um, and, uh, it matched. Working demos, working passing, passing SPI test. Uh, yeah. You showed the failing test also. Oh, yeah. Yeah. I want to see that it's true. Yeah. Yeah. Good challenge. Uh, yeah. Forgive what I called this one. There we go. So, yeah, this is that other one was the, uh, was the, uh, in a new testing. Um, oh, yeah. And so now it's here. It is failing. It's skipping. Um, that's failing, but skipping. And that's because, uh, this, uh, CPU, the default CPU is missing the, um, the extensions needed. So we can, we can fix that, of course. Um, something like this, um, we can actually add, um, we can add the extensions. So, uh, still spot. It was not there because I don't know why. Oh, no, because that requires an extra, uh, extra step of adding an SPI implementation that allows you to turn on, uh, the, uh, AD bits, um, uh, the hardware AD bits, um, where you don't have that yet. That's actually, we need to add an SPI extension. I think we're going to call it FWFT, allowing us to tell SPI to flip, uh, bits and registers the machine, uh, environment, config, enable bits. Because if you want to turn on this particular feature, uh, you need them, you need, uh, to be at the machine, uh, uh, mode level to be able to do that. So I can't do it from, uh, the s mode level. And so I actually hacked OpenSBI to let me do it and to test this out. And I, I'm not going to look for that in a live demo file. But, um, yeah, I have that. It does work. Yeah. Um, yeah, I think, uh, So what's in the run test.sh? So you wrote the C file, right? Uh, again? So then you had this run test.sh? Yeah. So, did you write it as well or is it, so, or is that the test? Okay, run test is just the, the, the test suite that kind of pulls everything together. So, um, if we look at, uh, this one, for example, this on the screen, uh, the log here shows at the very top. Which is at the bottom of the screen. Uh, this, this time out, et cetera, et cetera, et cetera. So that's actually the command line, the run test, figured out how to compose, uh, based on some configuration files and stuff. And then this is the output of that. There's, uh, this, um, uh, configuration file that you can, uh, provide, uh, for your groups of tests or for individual tests. Uh, allowing you to, um, um, to tell run tests what to do to pull it all together. I mean, of course, you can also just manually do the command line. And I do do the manual QMU command line, uh, when I want to also, like, do something with GDB or, you know, make sure I get the, the address is dumped out and I can find them as obj dump or something. So, um, yeah, I don't always do everything through run tests. Actually, very rarely. That's more for the CIs after you've got the thing working. Which one? No, that's already there. That's, that's a static. Yeah, it's committed to the repo. Yeah. Yeah. Um, yeah, nothing for scripts is automatically generated except for when you do the make standalone. And then you get, uh, might as well show that because we're in demo mode now. Um, so then you get this guy, which is generated. So this batch script was automatically generated. All this junk is the, uh, base 64 of the actual test code that was written in C. Um, yeah. And then, you know, this, some of this stuff is just kind of extracted directly from, uh, other scripts that are used by run tests and they're just chucked in there. And now you can, now it's all one unit. Yeah, you could put anything in there. I mean, don't trust someone to send you a reproducer. Yeah, it could be like, sure. Yeah. This is for developers passing things among trusted friends. Yeah. Yeah. Yeah. Make them sign it or yeah, just, yeah. Sure. Yeah. Yeah. I mean, yeah, anything could, absolutely anything could be in there. Right. Like enter your password. Please. Thank you. Uh, those tests are very similar to what is a case of test does and, uh, those tests integrated to case of test and if no, do we have such plans? Yeah. So the question is more or less, how does this relate to his, uh, case of test? Yeah. So there's, there's definitely overlap in what is tested. Um, the frameworks are quite different and how they work. Uh, there's, there's more overlap between this particular one and KVM self test, which are in case of test. That's one of the many sub directors in there. Uh, KVM self tests, KVM case self tests, uh, has started to be probably, uh, be probably the, the main place we add new tests for KVM. So actually, you may have noticed I did an entire presentation on KVM unit tests and I think I said KVM like only when I said the name of the framework, but I never actually talked about testing KVM. Uh, we do that still. Uh, we have CI that's specifically testing KVM using this framework, but, um, now we usually use KVM case self tests for the new ones and even some of these reporting to that framework. Um, and I'm seeing that this one's more going towards the testing of hardware or other hypervisors are still using it and stuff like that. But, um, yeah, KVM wise, uh, and, and, and actually I talked to Paulo about that yesterday, uh, on my third beer or whatever. But, um, I was like, you know, KVM self, the case self test is, is the way for the future for KVM testing and I'm not going to really talk about it too much tomorrow. When I talk about KVM unit tests and say, ah, but KVM unit tests are still easier to write and he's right. Like you can write a test case, uh, quicker, faster here. So if you're doing KVM testing and you want to do those quick and dirty tests I was talking about, uh, you might jump to this one first because the other framework, uh, uh, well, it's growing like support quite fast, but you have a little more boilerplate code and everything you have to do because you're actually, when you write that you're writing both the user space code and the guest code simultaneously for a test. And here you only do the guest code. So they initially, uh, here we can just simply read a test and if it's worse then we can move it to a case self test with bigger overhead. Yeah, yeah. And for your question on the other case self test stuff, like risk, there's a risk five directory there too, right? Where we test some instructions. That stuff is good. We need that too, but it's user space only, right? Yeah. So this, this is down in the kernel level. Okay. Thank you. S mode. Any other questions? No, let me, let me appropriately go to the last slide. There. Thank you. All right. That's it. See you next year.
The best `case` scenario
you you yes sorry so let's talk about case which is a keyword that hopefully most of you have used, if you haven't, it's okay, we're gonna go through it. And we're gonna figure out how we can use it, how it works, how we can use it better, and what the latest versions of Ruby have given us to play with this operator more. So yeah, that's more as what I'm talking about. So just in case we're gonna go through what case is, what the different syntaxes are, how you usually use it, and then we're gonna look at how it's implemented, which is terrifying, and we're gonna have a small dive into how the Ruby VM works and the instructions and stuff like that. After that we're gonna go through several use cases, some of them are pretty basic, some of them I think are pretty cool on a Ruby standpoint. And finally we're gonna take a look at pattern matching, which has been coming to Ruby since 2.7 and is mainly operated right now using the case keyword. So let's start. What's a case? So does anyone not know what a case is, or has anyone not used it? Cool. So that will go fast. So basically a case is more or less a big if-else, that's usually how people think about it. So you have your case, you have your different branches, and then you match each branch against your case. And depending on the branch that matches, you go to a different path. So in this case we can assume that, I don't know, status is something you get back from an API, you match it against different cases, and then if you have a success you proceed, otherwise you want to fail depending on what you have. If you want to, you could be even more compact by moving the stuff up a line and using then, and if you want it to be even more compact, you could even add more things to a branch. So if you wanted different conditions to go to the same branch, you can separate them with a comma. So that's basic case. One interesting use case that I don't think I've ever seen before, I don't know if it's useful, but it's still cool to look at, is you can write a case without anything at the top, just an empty case, and then it behaves exactly like an if-else if, so you have to use usual predicates, the same way you would an if-else. So I'm honestly not sure that has much interest, but it's cool. So how does case work? And in general, I kind of also wanted to take the opportunity to talk about a bit about how anything works in Ruby and how you can, when you're debugging something and trying to figure out how something works, how you can go deeper into your code or someone else's code. So if you're, for example, if you have a method that you've written or someone else has written and you don't know where it is, so let's say you're in a big code base and you have 20 methods called, I don't know, count or show, and you don't know which one is being resolved. In Ruby, everything's an object, as you might have heard before. So, so are methods. And you can, on any instance of anything, call dot method, capture your method, and then you have access to two methods that are pretty cool. One is called source location, which will tell you in which file it is. So interesting when you don't know which method is being resolved. And another one is just dot source, which will print out the source in your terminal. Just plain up. So that's interesting also. If you're looking for something more lower level, so a Ruby method, like array.last or integer.next, and you don't know how it works and you don't know where to go, you're kind of stuck. You're going to have to go read the fabulous manual of Ruby to figure out where it is. But in our case, we're kind of one level deeper because we're not looking at a Ruby method, we're looking at a Ruby keyword. So you, if you go to the documentation, you're going to find how it behaves, but you're not really going to be able to see the source code per se. So in this case, one way that I've used to figure out how the internals of case work is to go look at the Ruby VM instructions. So big-ish caveat for the next couple of slides. That's the very limit of what I'm trying to understand this year. I'm kind of in that phase in my Ruby journey when I want to understand how things work. So if I say something outrageous, stop me. So from my understanding, the Ruby code that you write goes through a journey before it is compiled and interpreted. So your Ruby code first gets turned into tokens. So for example, you can imagine that your entire program gets turned into a big array of syntactically relevant stuff. So that could be depth, for example, or an open parenthesis or a space or part of a string. So everything gets turned into a token. And then those token get organized into something called an AST, which is an abstract syntax tree, which is really hard to say. And basically what an AST is, is that big array, but formatted into something that is more understandable. So if anyone has ever played with RuboCop before, that's probably where you've seen something like that, because you have to play with syntax tree, which you want to write your own cops. So the tree is composed of a lot of nodes, and each node has a name. So you have a class node or a method node or a begin node. And then inside the node, you have all relevant information for that specific class or method or begin block or anything. And then all of those, all that tree gets turned into virtual machine instructions. So that's the part where I think this, what I'm going to talk about probably only work on C Ruby. I'm not sure this applies to other implementations of Ruby, like Truffle or JRuby. It probably works a bit differently. So if we look at the case that we're looking at before, and in the Ruby console, you have a class called RubyVM, which gives you access to any tool you might want to, to turn your code into either the tokens or the tree or the instructions. You can end up with all of this, which we're going to try and go through in some kind. So first of all, in case you've never used it, the Ruby virtual machine, the one from C Ruby is a stack based VM. So interact, everything in the VM is a stack. So you end up, you have a lot of instruction here that just interact with the stack. Like the put object over there just puts an object on the stack, the top end finds an object and then moves it to the top of the stack. And you have a lot of things like that. So in our case, if we look in detail, we can see a few things. So first of all, here we're mainly preparing the stack and here we have something, here we can find the status that we have over there. So this is basically calling status to fetch the value that we want to match against. And under this, you have a Ruby optimization, a Ruby VM optimization called case dispatch. What this does is in some cases, if you're using a simple case with simple objects inside of it, like strings or integers or symbols or stuff like that, what it will do is it will create a hash where the keys are basically this and this and the values are the number of the line in your VM structure that you need to jump to. So what that means, at least the way I understand it, is if you have a lot of if, else if, else if, else if, it will be usually faster to build a case because you're losing some time here to build your hash. But then whatever case you want to go to, it's just a hash access. Whereas if you're doing a bunch of if and else if you have to go through each of them to see does this work or does this work or does this work, etc. If we go a bit below, we can see what that would look like technically if we would need it to go through each of the branches to see which one works. So here we have our success symbol, which was our first branch. And what this does is it going to compare it to the status using the triple equal method. And that's the cool part of case. That's technically what's doing the heavy lifting behind. And if that equal works, then it's going to jump to instruction 28 below. If it doesn't work, then it's going to keep going. So second branch is error. So we're going to take error, put it on the stack, compare it to status. And if it doesn't, if it works, we go to 33. If none of those work, if you remember the case, then that means we're in our error case or like our else, which is over there. So if none of those work, so we keep going down our instruction and we end up here called the fail harder and then leave, which is instruction 28. And then under that, then you have the lines that you would have jumped to if anything worked before. So the 28 here, which will call the proceed and the 33, which will call the fail. So that's more or less the instruction patterns of a case. So that turned our question before, answers our question before, right, of how a case works. And the simplest answer that I can give it, it works thanks to triple equal. That's what it's going to use to match everything against everything. So if we wanted to push case to the limit, the question that we want to answer now is what does implement triple equal? And in Ruby, that's a bunch of classes. And the interesting thing and the main reason I wanted to do that presentation is that depending on what you're calling triple equal on, it will behave differently. So the simplest example that we've all used is all the base classes. So string strings, integers, float, arrays, hashes, anything you want. And in this case, it checks for equality. So that's the thing we've seen before. You might have seen that code. You get a param that has a response and then you don't know what the fuck the other person in the API has done, whether it's a string or a 200 or a success or a string or a true or true as a string or anything. So you do your case and you match it against whatever and try to figure out. So in this case, it's always going to check for equality. So here with the come out that we've seen before, it's one or the other or the other. And then you have arrays, you have hashes. Otherwise, yes, you can give up. Another thing that implements triple equal with another behavior are classes and modules. On classes and on modules, triple equal checks for, I don't really know how to say it in one word, checks for type, for ancestry. It's a bit like the is a method of Ruby. So when you have an object and you call is my dog an animal, it's not only going to check the class, it's going to check a bit above to see if animal is included in it if you're going composition way or if it inherits from animal, if you're going the inheritance way. And that's more or less what we can do here, for example, with errors. So I say you have your code and you've defined a bunch of different types of errors. And you've tagged some of them maybe as ignorable. So if it returns any errors that's in that type, then I want to ignore them. If it returns those two different errors, I want to return a not found. If someone forgot about safe navigation, I want to tell them. And then a lot of errors, for example, in Rails, and I'm assuming in Ruby, not entirely sure, don't put me on that, inherit from standard error. And so those maybe you want to raise, but if you have something else, then that's probably a lower level, maybe a PG error if you're dealing with a database, and then you want to do something else. So that's it for classes and modules. Another class, another type of classes that implement triple equal or ranges that I'm assuming most of most of us have already used that check for inclusion. So for example, if you have an integer at the top, then you can check that it's included in this range or this range. And it works with the endless ranges of Ruby. So you can be very, I mean, this might as well use an if else if and just check that it's greater or lower than, but it's good to have options. You never know. And one thing that I found, if you're working in networking, that could be cool, IP address works the exact same way. So you can define IP addresses with their masks and everything, and then have them act as ranges, and then check that your IP address belongs to one or the other. This one we've all probably used is also is reg X. So this one checks for just a match. It's the exact same equivalent as if you wanted to match your against string. So that's a kind of real use case that I have from the company that I'm working for where we manage a lot of messages between clients and providers. And so we want to check in those messages that they're not trying to bypass us, for example, by sending an address and trying to meet somewhere, or they're not sending sensitive information or sometimes people can keep their dick in their pants. So we have to be careful about that also. Stuff like this, right? So this one checks for match. Probably one of the most interesting example, but yet the one that have the most trouble coming up with a good example for are prox and lambdas. On prox and on lambdas, triple equal calls the lambda and gives it the object that you're matching with. So for example, here we can define, let's say we want to define simple prox or lambdas that just delegate to another method. So for example, unknown host will take an element and then check if the host is included in the list of something. Oh shit, yeah, I've done it again. In case this is just, it's the new way of writing the old thing here with the pipe pipe and you enter a variable, this does the exact same thing. It just takes the first one. So underscore one would be the first variable that you enter here, underscore two, the second one, underscore three, et cetera, et cetera, et cetera. So let's say that we've defined a simple list of hosts. So when we get, in this case, probably a request, we could delegate to one of those to see if it whitelisted or if something went wrong. And then we can, if it goes there, yes, we can take a request, let's say a web book for example, and write our case on it and say, okay, when it's whitelisted, then I want to do something. If the host is unknown, I want to do something else. If the action is unknown, it's going to do something else. And what this is going to do behind the curtain is it's going to call whitelisted and give it web book as a first parameter. So it's a more, again, more compact way and allows you to put that code somewhere else instead of having to copy paste it into three ifs. And the last one, we're in Ruby, thankfully. So for every other class, we got duck typing. We can just implement the triple equal method and have it work for more or less anything that we want. So bear with me because that's going to take a little bit of time. So in this case, same, still sticking with my response example that we've been following the entire presentation. So here I can define in my response class or module or whatever different classes that implement the triple equal and that do anything that I want. And then I can, if I do this and I call them, this is going to do what we've seen before in the VM instructions, right? It's going to take the response called triple equal with this and then see if the answer is true or not. So with this, you can basically create as many matches as you want, especially on custom class that can be pretty interesting. If you have one example that came to mind also is payments, for example, if you're managing payments, then you can in your payment class define different subclasses that could be success or canceled or processing that just calls your payment API and checks if it works. And so all that code is it's in own place and then you instantiate your object here and you can use case to easily delegate where you're going. Another example that we've kind of used is a wrapper for services. So basically you define new classes for your service and your service answer a class that's either a success or an error and then you can use this to do some kind of early, early days pattern matching. So speaking of pattern matching, how does it work? So just in, again, just in case, we're going to go quickly through what it is and what it works, how it works, sorry. So the whole idea of pattern matching is that you define as the naming price, you define a pattern, then you try and match it against something and see what sticks. So here my pattern is going to be a hash with a status key, a body key inside of which I'll have a user with a name and an age and whatever is in here, if I can match it, I want to store it in the variable and then once you have your pattern, you can try and match it against any collection of stuff. So in this case, it's going to work because we had the same status and the form that we're trying to match against was the same and what it's going to do is it's going to assign the name variable to whatever was there and the age variable to whatever was there. If you want to match it against something that looks very different, so this hash for example is not going to work because either status and body are here, this value is not going to match against that one, right? So if you try and do this, then it doesn't work, so you're going to get an error. In Ruby at least, this is going to, sorry, this is going to raise an error that just tells you I wasn't able to match it and in Ruby that was implemented using case. So the way it works is if you have a response or literally anything, you want to create your different patterns that you're going to want to match it against and one thing to note is that it's no longer, you know, to make the difference, you no longer use case when, using case in because in is going to be the keyword that's going to be mainly used for pattern matching even out of cases. So in this case, if the response that I get has a status success, I'm going to take whatever is in the body and put it there and otherwise if it's an error, I'm going to fail and put it over there. So it's, again, it kind of does the same. You could do the whole counterpoint to this presentation is I could do it with an if, else if. You always can, but I do think this is a bit more verbose and makes it more clear what you're trying to do because you can see the entire pattern. Whereas if you wanted to do an if, you would have to open response and do if the status is success, then I want to look at the body. For this example looks the same, but if you're dealing with big jasons from APIs where everything is nested like four times and you have response body value and then you take the first element and then the address and then whatever this starts to become more interesting. Another thing that we get with pattern matching that we can do with case when is we get access to guard closes. So whatever that allows us to do is I want response to match with this only if I'm not in maintenance. So this gives us a bit more control over whether or not we want the pattern to match because sometimes you might want to put patterns that are very similar, but you want to condition them to something different. Another example and another thing that we can do with pattern matching, so let's look at a more complex pattern. We have access to a lot of new tools. So for example, here, what this thing here means is that I want to match this pattern where the ID is whatever I put on top. If I didn't put it, then it would act as the one we store before and store it into the variable ID, but by doing this I can tell it no, no, no, use the value that's already there and match one that has 69 as an ID. I don't want anything else. And we also have access to splat operators, kind of. So simple splat for arrays, double splat for hashes, the same as with method arguments. So what this allows me to do is I want to take user and if the user is in an array with some elements at the beginning, some elements at the end, and then somewhere in the middle, an element with ID 69, I want to store the value of admin. So this is kind of equivalent to take my entire array and do a detect where ID 69 and then print admin. So this kind of does the same thing, but in a more flexible way because I can then kind of keep putting more patterns underneath to filter out more stuff or try to find more elements. So how does it work? I kind of, at this point in the talk, I kind of wanted to go through the same journey with pattern matching as I did with a simple case. So try to open it up and look at the VM instructions and see how it works and try and figure out what's underneath. The problem is that pattern matching is kind of new. So in the Ruby VM, that is a lot of instructions to go through. So I ain't going to go through everything. But there are a few things that we can see here. So for example here, we have the same response. So that's the beginning of our case. So this calls the thing that's going to go in the case that we're going to try and pattern match against the same. We're looking at pattern matching. So of course, the thing called check match, we kind of kind of assume that it's going to match or pattern against something. So the way, at least the way I understand it is that all of this is going to build or pattern and then it's going to match it to continue. And if we look at the way it builds the pattern, we can find one method that is interesting, which is this one, which is deconstruct keys. And after looking at it a bit more and going to read out the documentation, this is what Ruby used to do, at least for now, to do pattern matching. So you have two methods. One is called deconstruct keys that is used on patterns that are hashes. And another one is called deconstruct, which is used on pattern that are arrays. That make sense? And so this does all of the deconstruction and then if the pattern that you're sending doesn't respond to the deconstruct keys or the deconstruct method, then it's just going to give up and tell you to implement it yourself so that it works. And after that, it's more of the same thing, right? So that's the second pattern that we have. It's still trying to deconstruct them. And then eventually, if it doesn't find anything, it's going to return a no match error. So the interesting thing then is how do we implement it ourselves? So if you have your class and you want it to be, you want to use pattern matching on it, then one thing that you can do is use, is implement the deconstruct keys method. So in this case, we have a location and we want to have a latitude and a longitude in the deconstruct keys. And then that allows us every time we have a location to use pattern matching on it, because it's going to deconstruct this, deconstruct this, and then see what matches. And so in this case, and interesting thing also is inside of our pattern, we have access to everything that we've been talking about earlier. So in your pattern, you can put classes, you can put reg X, you can put ranges in this case. And the only thing I think we haven't seen before is this little thing magic that just takes like, it wants to match it against this and then store it into the variable that we can then use for anything else. And I think that's it. I've tried to go through everything. I sped through that one, sorry. We have so much time. I didn't. You used a variable that was not declared before. Yeah, probably. Where? In the right address before? One? latitude. Did you declare length to be equal to new before? No, you don't have to declare it before. Basically what this does is it takes this and then store it. It takes whatever matches here. So that would be technically this and then store it into the latitude variables. You don't have to declare it before. And what's the scope of that variable? It's going to be a scope to whatever the case is in. Right? So if your case is defined in a method, then you have access to it in the entire method. This is in current? Yeah. I think this might be, this might have been implemented in Ruby three. And the first occurrence of pattern matching, the one with the case in, was experimental in 2.7 and then actually arrived in Ruby three. And they've been trying to push it a bit more in subsequent versions. So now, for example, you don't necessarily need to have case. If you want it, if you want to use pattern matching, then you can just write your variable in something and use it as a predicate to see if it matches or not. In your example where you're looking for an admin user in an array of users and you have those operations at the back, does that work if your admin user is the first or the last? Yeah, yeah, yeah, yeah. Then the, then the, Like it might not. Yeah, fair. Yeah, yeah, definitely fair. What this will do is it will put nil in here and nil in the other variable. Right? It's like there's nothing after, there's nothing before or there's nothing before. Yeah, that's the thing that I was a bit, if you about this, basically the, shit, I have to go through all the animations. Sorry, bear with me. It's going to scroll again on. Okay, sure, whatever, shit. The argument that it takes is in case you only want to deconstruct some keys. Right? So if that is, if you have a big object and you only want to deconstruct latitude, for example, you could work it this way. That's what it's supposed to do. In the example, I didn't go through the trouble of implementing all of it, because if I want to write code big, I can write too much code. And yeah, that's why. So it was deconstruct for arrays and deconstruct keys, though. Yeah. And you can define deconstruct as well. If you've got a class that implements an interval or something. Probably, yeah, yeah, I think so. Just to take how stable you think the syntax is. Do you think it's going to stay the same? Huh. Would you, would you, would you, would you want to do, for example, like the, the, the, the, the, the, the, the, the, the, the, the, I think it's going to stay the same. No, sorry. Yeah. I know I was thinking in my head, I think it's going to stay the same because it's the exact same syntax that Alexia uses, for example. Like they've probably been inspired from other languages and used it. And so I'm expecting it to stay the same. But then again, I don't know. I think right now I'm still, I'm trying, I tried to push for it in very simple use cases. So usually in a, if we get, if we have to make an API call, that's probably the best, like, foot in the door to get it working in your code basis. Like, because that's the thing that seems the most obvious, right? I get an answer and then I can, not only fetch the status, but assign everything in the answer and then give it to another method. I think that's a bit, not a frontend dev. So don't quote me on this at all. But it looks a bit like the object deconstruction thing from JavaScript. Or you can get an object and then assign all the variables into it. In this use case, I think it's a good first step to implement it in a code base. I wouldn't go all out and start putting deconstruct keys in every class. That would be really, I really hope that, I really hope they put it in Ruby at some point. I don't think that's in the plans right now. I think the idea, the main idea behind was, like, when they put it in Ruby at all, pattern matching in 2.7, it was kind of touch and go. People were discussing a lot about do we want this in our code base because pattern matching in the collective brain is usually more functional than object oriented. But now that it's there and it's past the experimental and it's now stable, I think they're eventually going to do it. It'd be a shame not to, right? Do you think some of this stuff is going to end up in the Ruby style guy? And be something like Ruby cop goes and says, no, no, you don't want to do that. You don't want to do that. You want to use this instead. Probably not in the near future. Because I think people are still very much like trying to figure out what the good style is. Even when I was preparing this, I couldn't find a lot of examples. So I kind of came up with what I think would look the best. But I don't think for now, at least, there are a lot of established guidelines. We good? Cool. Nice.
A front-end journey back to Rails
I haven't said anything yet. All right. Good afternoon, everybody. This is a front-end journey back to Rails, or how I learned to stop worrying and love the hot wire. This is a story of how I strayed really, really far from what you'd call vanilla Rails as far as building new eyes goes. I got slightly fed up and then came back, discovered that it had changed a lot, and was really impressed with what I saw, and now I love it. So, a bit of a backstory. I discovered it would be on Rails in 2010, just as V3 had come out. I happily avoided the V2 merb curve-offal, if anybody remembers. I started working professionally in Rails in 2012 and in the last, what is this now, 14 years? No, 12 years, sorry. But it must. I've worked in very different contexts, agency as a freelance, and now in a product company, which is called NodalView, where we help real estate professionals become even better real estate professionals with a lot of digital tools. In my spare time, I play table top RPGs a lot. I mentioned it one because it's a hobby that also lends itself to potentially infinite complexity, and also because it will be relevant later for the demo I'm going to show you, because it is about managing the infinite complexity of playing table top RPGs. So, Rails. Across the years, Rails, at least the front-end part of Rails, has changed a lot. We've gone through a lot of iterations of trying stuff, following the front-end guys, importing technologies, trying them, and then changing completely 18 months later. We could say it's been somewhat chaotic. So we've had a few attempts. Like, please raise your hand if you felt five of these across your career. All right, yes. All of them keep your hand up. All right. Yeah, so this is pretty much the trajectory that I followed, maybe not quite all of them, or maybe yes. I'm not sure anymore. It's a blur, really. And after all of this, so circa 2015, 2016, I kind of settled with React as a choice for front-end in a Rails app. And it started relatively soft, like small bits of React, small React component, loading a small tree in a page that still Rails. But that did not last long. Quickly gone to entire page. And then, well, we're loading the data and then rendering a Rails view and then rendering React again. So, might as well fetch the data directly from React. That would be easier. And we tried with REST and you know what happened. When what happens when you try to do an API with REST, lots and lots and lots of controllers. So there might be something simpler. And that's when GraphQL revealed itself. Brings also a lot of complexity. Anyone using GraphQL professionally right now? Yeah. Well, we're still using Rails routing. We're still using Rails a little bit. Might as well. So we introduced ReactRouter, client-side routing in one shape or form. And we're straying further and further away from React until we're kind of managing two applications in a single code base. And it's got some upsides. And really good. It gives us excellent UI fidelity, as we like to say. It's really good for delighting users and clients. It's really what we want as a result. But it comes at a huge cost. It essentially means you need to know two jobs if you need to shape a feature. And around that time, I tried looking at ways to make that easier. And this was also the era where React meta frameworks came. Started coming out. One of them was and still is. Next.js, which is kind of similar to Rails. It's very opinionated. It's got a lot of defaults. It does a lot of things for you. It does a lot of great stuff. But it doesn't really solve the problem. We're still managing two stacks at once. We still need to have people that are very proficient in either or Rails or React or whatever framework we're using. Or both, ideally, if you want to be efficient about it. And I still felt a bit torn apart in the one job that I like to do well. I'm doing two jobs slightly less well. So we're now around the beginning of last year. And it's time for a choice. I don't feel as a developer I can keep on like this. Having one foot in the Rails world and Ruby world and the other foot in the JS world, it's very uncomfortable. So I made a decision. I was going to try one or the other, but not both at the same time. And since I'm a very optimistic person, I tried the least likely candidate first. So I tried to build an application, an entire full stack application using only Next.js. No backend, no API, no a lot of stuff. And I quickly realized that it was going to be harder than I thought. I was missing a lot of things daily, constantly. I hadn't realized how much Rails and Ruby gave me until I started generally missing it. And there wasn't a single day where I wasn't telling myself, yeah, I'm going to need Rails to do this at some point. And that's through. It wasn't a happy experience, to be honest. As you can see on my screen. I was missing a lot of features. Next.js doesn't have an ORM or we have ActiveRecord. It's not that it has a bad one. It doesn't have one altogether. Doesn't do background jobs, doesn't do file uploads by default. Deploy anywhere. It's kind of a joke. You can deploy your Next.js application everywhere, but you should really deploy it on Versailles if you want to make it good. And also it's not in Ruby. This is half a joke and also not. When you've been writing Ruby for more than 10 years, you don't realize how much you're going to miss the methods in Enumerable until you don't have them. And that's the same thing for a lot of cool APIs. When I was doing that, the React ecosystem was changing a lot very quickly. Rails had historically been the framework for building single page apps until it wasn't. And the industry took a nosedive into the complete opposite direction with something that they called server components. It's really hard to, it's a talk in itself. I'm not going to go into details. So now the code is written on the server, then it executes again on the client, but not all of it, just the one you mark use server. But the one you mark use client is only on the client, but not really. So it's complex. And it started getting, first of all, very exhausting, but also kind of weird. I don't know if you can read the content of this React component. And I don't know. For everybody at the back end at home, this is an SQL query right in the middle of the React component. I'm not sure how it makes me feel, but it's not good. So second attempt. Yeah, kind of obvious and high-signed. So I was still doing Rails at work every day, so I was constantly reminded of how much better it was. So this time I tried, okay, I'm going to do it with only Rails. Time to see what's new in the frontend part of Rails. And the goal was, I wanted to use Rails, like all of Rails, without sacrificing either the quality of the UI, the quality of the experience for the user, and the quality of the experience for me. So that means that I still want all that. So everything that I had gotten used to from the few years of React, obviously components, but all the cool features, and all the, more importantly, all the UI features that a user would expect. I can't legitimately tell a user, well, you know, it used to be good, but now I'm doing only Rails, so welcome back to 2010. That wouldn't quite work. So can I, could I actually do all of that? Let's find out. Components. Components used to not be a part of Rails, really, you could get a lot done with Partials, but now they are. There are two libraries that do that pretty well. The first one is your component that I will not be talking about. The second is Flex with a pH that I will be using in this talk and in my applications. So this Ruby object that is a component that will render bit of HTML at some point the same way a ERB partial would, except it looks more like a component. If you squint a lot, you might see a React component there. It's got a prop and it's got a render method. And that's the real basics of it. The major difference is that it's a plain old Ruby object. So this is already big for me, at least. It means two things. It means I cannot forget five years of best practices of designing and components. That also means that I can pretty much import one-to-one components that I've designed in React. That means that again, it's a plain Ruby object, so I can do any kind of Ruby in there. This is also the good place for what used to be a decorator object or a presenter object in older applications. In the same way that you can render a component partial, sorry, in your controller, you can render that component that I just showed you. Same way that we'll do exactly what you think it does. Except that this would be an entire view. So you don't even have to do... You don't even have to do ERB for your views. You don't have to do ERB for your layouts. You don't have to do ERB at all, which is pretty cool. Because I don't know about you, but I don't really like ERB. It's really got a really clunky syntax, not a fan. So we can go a little bit further with components. You can add slots, which would allow you to define a skeleton for something that takes children like you would in React and call them later with a block. This would be... This represents a pop-over menu. It's got a trigger and it's got a content that will be revealed once you click on it. There will be a demo in just a minute. But what about interactivity? This brings us to the first part of what Rails now provides as default tools. It's called stimulus. And it's a very, very thin wrapper around a DOM object with a few lifecycle methods and event antlers. It's really, really bare bones, but it's enough. And in that case, it's enough to give us a possibility to bind click on that trigger and show the menu with a class and transition. And this is what the stimulus code looks like. Same it's got a connect method. You might remember a component that mounts from React, if you've used React, that's the exact same thing. And a few methods that do what a pop-over does, show clean, show high clean, that's not really complicated. I think this is the only place in the application that I'm going to show you that actually uses developer level JavaScript. The rest of it is either turbo or stimulus. Little bonus. I don't know if you can see it, but first line of the connect method, there's a use click outside. If you're missing hooks from React, don't worry. They're still there. There are hooks libraries for stimulus as well. So you can mix and match regular behavior in your stimulus components. All right, time for a little demo of what I just said. How do I get out of this? I won't see it here. So this is the application that I'm using to manage my table top RPG tables. And this is the pop-over that I just showed you with just a few lines of stimulus. And this is a normal pop-over is the point. It's just the same as you would expect. Coming back to the talk. There we go. The second part that I really, really wanted to keep because I just got it from React is suspense. You might remember suspense as the thing that loads your bits of JavaScript at the time that they're needed. But it had recently evolved, allowing you to put some kind of network request in a component and it would suspend, that's the name, until the network had finished and then it would show the component with the data in it and would display a fallback in the meantime. Like some sort of loading screen, spinners, Keletun or something. And turns out that this is really, really easy to do in Rails now. This is what it looks like. So this is the moment where we introduced the second really important part of modern front-end in Rails, Turbo and more specifically Turbo Frames. Remember iFrame from the 90s? This is similar but kind of better. So what we're doing here is calling a Turbo Frame, giving it a name and giving it a source which is another route in your application. Sorry. When the page containing the code on the top is going to be loaded, it's going to load the second page in the background. See if there's a frame with the same exact name on the other page and then show that. That's the part of the page that's got the frame with the same name. And you can still put your fallback in there in a block within your frame. It will appear until it's loaded. I'm going to show you what that does. So this is the frame on the other page. Major difference with React. This other page can absolutely be a real page that's got its own use. That reduces a lot of duplication. I'm going to show you. It's going to be easier to understand. With a demo I need to get out of there again. There we go. So this is the frame that I just showed you. No, this is not. I'm lying. This is the frame that I just showed you which is loading the list of the parties that belong to. But the list of the parties that belong to is also here. This is the same exact same code, same controller, no duplication. If I refresh the page, you see it. Just flash the loading screen and then load it. So that's how you do suspense in modern Rails. List one, inline forms. This is one of the big reasons why you would use frontend framework like React to do stuff. Same general ID. You define a turbo frame. This one doesn't have a source because it's not fetching a page right just yet. You have a link that explicitly points to that frame. So when you click on that link, instead of doing a full page reload and going to a page like a link would normally do, it's going to see if there's a content with the same frame name on the other page you're going to and then populate the content of that frame on your current page with that. And then you have an inline form. From the form page itself, it's mostly, very mostly normal Rails form. The UI.form that you see here just delegates to a form for that some standard Rails with the slight change of giving it a turbo action and a turbo frame. Caveat, I'm using the still unreleased new version of Turbo, Turbo 8 that just got dated to RC yesterday. So this is very fresh. It might be released this week. I hope. I really hope. It's really cool. So if it's a new record, I want to refresh the entire page. I'm going to create a full navigation with advance. If it's not a new record, if I'm editing, again, still with an inline form, I will just replace the content. Since I'm doing full navigation for the create action, I'm going to target the turbo frame at the top, which means the entire page. Otherwise, I will just refresh the frame I'm constrained into. And I want to stress that the controller is like the most vanilla controller. There's no magic here. This is just a normal Rails controller the same as 10 years ago, except maybe giving it a status if it fits. That's now needed. Demo. I think this is what I'm going to do here. So this is the link that I showed you before. Creates a form that I can barely see. Thanks, dark mode. I want to say test. There we go. Test, test, yes. Save. Boom. It's there. It's a thing. And now I can edit it. Test one. Super creative. And it's there. And it was very in line and very cool. That's inline forms. Loading states. If you were paying attention, you might have seen that when I was submitting these forms, the button changed. You got grayed out. You got disabled. And the text changed. The first part of that, I think disabled to buttons that are not doing a get action and all of the submit inputs, is done automatically by default by Turbo. And if you want to get a little fancy with it, you can add the Turbo submit with, that's a mouthful, arguments to that button and it will change the text while it is loading. That's also pretty cool. I think, do you want to see it again? Yes. All right. I'm running a little bit short on time. Where am I now? There we go. So I'm just going to do an edit one more time. So therefore test two. Look at the save button, please. There we go. All right. No, you stay. This is not for you. All right. So that was demo. Okay. This is the pool one. So all these really fast inline updates that I just showed you. Well, that's all what I'm good. That's easy to do in any technology really. What would be really cool is if when you update that post, everybody else that's on that page, like your party members in my demo application, would see the result immediately. And when you say that, that seems like it's a hard thing to implement. And if you've done it in another language or framework, you might think or have learned that it's hard to implement. No, it's not. Main part is you need to add that single keyword broadcast refreshes to the model that should refresh on the page that should be refreshed. You just add turbo stream from and the object. And the special case at the top, which I did in the wrong order I realized now, is a special case for when you're creating a record because you can't subscribe to record that doesn't exist yet, that doesn't really work. So the trick is to subscribe to a parent record and touch it when you create an as many association. And that's it. That's the whole thing. It's done. It's all refreshing. I'm just going to show you this one I'm really excited about. Again, I did that on the wrong space. There we go. So I had prepared a thing so you can see it. So test number two, I'm on the deleted. By the way, fancy confirm confirm dialogues also provided by turbo boom. It's gone. It's gone on the other side too. Thank you. I didn't make it, but I'm raising awareness. Let's see. Again, wrong space. This is hard. All right demo. Let's go back to playing. There we go. So yeah, if I'm honest, I think that covers about 80 to 90% of the common use cases of UI fidelity that you would ask in an application. And those tools that I've briefly gone over are versatile enough to do a lot more and they're getting a lot better as time goes. So yeah, for me at least it is mission accomplished. But wait, there's more. No, I like there's less. All of these things that I just showed you, they mean a lot of a lot fewer things to do to get to the same result. Meaning no more internal API is no more GraphQL no more 5000 rest controllers and J builders and you know, no more playing around with local state, at least very, very, very little, if not at all. No more duplicated logic across to half stacks. No more interminable asset compilations that take all of your GitHub action guillible time. By the way, plugging bun as a replacement for yarn or PNPM, bun is very cool. No more building and also hiring expertise in two different frameworks to do a thing. That's pretty cool. Yeah, lot, lot, lot less things to do. So of course this is a game of trade-offs. This is not a perfect solution. It's not as native like, as super cool as doing a full front-end framework. But on the other hand, it's sticking with the framework default a lot more. You get stuff done way quickly and you can realistically get stuff done alone if you're a solo developer, if you're just starting out, or if you want more bang for your hiring buck. But all of that means nothing. That means a lot. And that's the end of the talk. We have a little bit of time for questions. Yeah. Yeah, the live refresh result, the live reload. Yeah. What's that behind the scenes? Is it WebSockets? Yeah, it's WebSockets. It's all action cables, WebSockets. The question was, sorry, what library is that for the refresh? And it is in fact WebSockets. It's all action cable under the hood using the action cable from Rails. Yeah. I have also a comment, because I've also been looking into the data of the new turbo. And the refreshes does work well, but it's very important to also use a data turbo permanent on forms, because if you don't and you refresh all our users, then there's a full context. Yeah. Okay. So, full context, there are caveats, like you mentioned, especially when you have JS powered elements on the page, like a Tricks editor, like I just showed you. If you want to preserve their state, you need to do something about it. Yeah, that's right. Anyone else? Yeah. Do you consider a great solution, stand-on, angular or react application with RESTful APIs? Or, as you said, it's too hard. It's really too... So the question... Yeah. ...personally love the syntax of the component, the Rails component. Okay. So the question is, would I still consider full front-end applications like React or Angular? Yes, but we have some at work. We have a big one at work, for example, and I'm certainly not going to throw that one away. But if I were starting a new application with... I did, and if it was a solo project or a very small team or an early-stage startup or just for fun, I'd definitely use this. And I wouldn't invest in a full front-end framework unless there was, A, a lot of money behind it and the real explicit use case for it now. Yeah. How hard would you say it is to migrate from the hellhole of having two different stacks to something like this? Can you do it partially or do you have at some point to... You could do it partially. The question was, sorry. How hard it is to migrate from an existing front-end application to this? The answer is you can do it step-by-step, replacing maybe some routes first, maybe your admin panel if you want to be in a safe space. And you can probably do it route-by-route, doing something like routing at the end-to-next level or something like this. And it doesn't have to be one or the other. I think... Yeah, there's one more. What's the best source of documentation for, for example, the broadcast... Oh yeah, Jesus. Yeah. Thanks for the question. Because some things use... Sometimes it's about rails, but sometimes it's about turbo. Yeah. Sometimes it's confusing where to go and look at that explicit call. I didn't know about broadcast repression. Yeah, it's new. It's still beta. So it's not... It could be in the turbo documentation. It's in turbo rails. It's provided by the Turbo Rails gem. So the question was, what's the best source of documentation that I've compiled a little bit, which contains also links that I've compiled a lot more, and expect all of that to change in the very near future, sorry, like I mentioned, a big solid upgrade is coming probably next week to the main driver of this. Last one. Do you have any experience with the Strada and the... No. I voluntarily did not... Do I have experience with Strada? Strada is the third wheel in turbo stimulus and the whole package of hot wire, and Strada is used to bridge native applications, native application skeletons with web views that are using the other two. I have no experience with it. That's why I didn't mention doing this talk. And I think that's it. Thank you very much for listening.
Besides Web: a Worker story.
Okay, awesome. The mic is on, hopefully. All right, good afternoon everyone. So I'm going to talk to you about a worker story, which is something we did at work recently. And for once, it was like not using Rails. That's awesome. Not using web at all. That's what motivated me to tell you this story. So before we start, I would like to know who here is the Rails developer? Who would like? Yeah, awesome. Who would say that they are Ruby, but not Rails developer? Okay, awesome. That's great. Love it. I didn't expect that. Awesome. All right, first of all, who am I? Because if you don't know who am I, you might not rely on whatever I'm going to say. So I've been Ruby and mostly Rails developer for 10 years. I've been working with Kevin for almost that period. More recently, I have become a lead dev, then a manager, then a CTO. So I'm doing a lot of new responsibilities now, which also gives me a new perspective on a lot of programming topic, actually getting new perspective when you start making a decision about people and processes and stuff like that. And finally, I've been a teacher for more than six years. I've given a lecture at EPL and Le Wagon. Hopefully, we'll do that again. I feel like a deep-footed lover for teaching and sharing knowledge. And this is also why I'm here today. So I was saying the point of this talk is talking about Ruby, but not about Rails, not about web. And this was the first time for me. I was like a new experience. And it's strange to see how much changes when you start doing that, how much you realize Rails was giving to you once you don't have that anymore. I have some notes. By the way, all my slides are going to be minimalistic. I'm not going to show you a single line of code. I'm also going to forget a lot of stuff, which is why everything I intend to tell you is written in notes available directly in this. So hopefully, you will get everything I intend to say because I'm going to forget part of that. So the main message of this talk is like it's doable. It sounds strange that this is my message, but as most Rails developers sometimes, when we think about Ruby program, we're not even sure we can do it. We're not even sure how would we approach that. So the main message is like, yes, it's doable. There's a lot of tools. There's a lot of process. There's a lot of help along the way. And you can possibly, you can very likely, sorry, get most of your tools and knowledge used in a normal, not web Ruby application. The second news is you can also get all of your Rails knowledge useful in Ruby application if you get things right. So the story I'm going to tell is about like a worker. What is a worker in our case, in my case? It's like microservice. The specificity, why do we call it a worker? Because it's not a web microservice. It's a microservice which is consuming messages from a queue and very likely is going to process files, so it's going to get files from a bucket, process them locally, put them on another bucket. We have, we are using the word worker because we have like lots of them. That's simple definition. We have lots of them. So I'm going to talk about like one of them, but it could be any of them. So we think start with a loop. The whole story starts with a loop because when I started this, I really like I opened my editor and I saw something which I hadn't seen since school. It's an empty directory. It's very strange. Like really first as a Rails developer, I'm really used to Rails new and then you get like everything. You get folders, tree, substructure. You get the config directory, you get the app directory. You, like there's drawers everywhere about what you expect to put things. In this case, I just like create a new folder and it was empty. I'm a firm believer in emergent design. So I started immediately like new file, worker.rb, make a loop, while true, read, perform, delete message. I'm done. It was nice. Like it was, I knew it was not the end, but it was capturing whatever I was, I knew about my process. It was a single level of abstraction. So I knew it was a good start, but it wasn't. It wasn't a good start because I was already forgetting like my main tool when doing Rails apps, which was going to be my main tool when doing any app. It's tests. Anybody who knows me know that I'm a firm believer in tests. And it's a policy. It's not a religion, but it's a policy. This is how I write code. I do believe in it, but you mileage may vary. But for me, it was the beginning. And it's funny because I knew I was going to write loop, the loop of my program, but I was also starting another loop, the loop of my process. And this is what tests are for me. Test first does not mean you do test, then you do code, then you're done. Test first means your first step in the journey is test. Then code, then test, then code, then test, then code, then test, then code. That's what it means to me to do test. But I did it wrong. I started with code. So I tried again. I deleted my file. I created a spec directory. I created a spec file explaining what I knew about it. And I was happier because test is the file that depicts my best understanding of what I currently believe is the success. And I need that because I'm going to write code right after the word. And once you're deep in the code, you're super focused. You forget about landscape. You don't know what comes next. You might have a story. You might have specification requirements. You name it. But I do believe that a story or specification is like coordinates of where you're supposed to land. The whole puzzle, the whole activity of development, of programming, is like playing golf in the fog by night. You know where you are at the beginning. You sort of know when you want to land. But after your first shot, you're going to be lost. It doesn't matter even anymore what you're supposed to land because you've given your first shot and you don't even know where you are anymore. I'm using tests as torches in the night. So I read my specs. I write some tests. This is my belief. I'm going to follow that path. And then I shoot my first shot. Hopefully I'm going to reach my first torch in the night. When I have reached that one, I'm going to go to my second torch again and again. But my loop is that my test is only my best understanding of my success. So my test is going to evolve. I'm going to move my torches and I'm going to move my ball. And this is how they make sense together. Back to the story. I wrote my test, was happy with my understanding, run it, and it failed. It was a catastrophe. And why did it fail? Well, because it couldn't find our spec. Because I didn't bundle it. Because it couldn't find bundler. Like, that is how empty the whole story was. Like, I didn't even have bundler. Okay, so bundling is always easy. Bringing my dependency, starting my gem file. I need to run my spec. Run it again. Well, it still fails. But for a better reason. And that's the whole point of TG, right? You have to fail, but for a better reason than the previous failure. So now it's failing because it doesn't know about what is a queue, what is the method we see in the queue, what is a message, what is a processor, what does perform even means. Well, that makes me happy. Because now I can actually write more tests about what do I believe is a queue at this stage. Why do I believe is a processor, what do I believe should do the receive method. And this was really the starting of both my loops. I got my main loop back, but I got my working loop as well. I got a lot of tests. I knew that trying to make them go green would just generate more tests. Trying to make them go green. I got my actual work loop. Right. So, test code, test code, test code. I was in the middle of it. And every single of the code file was starting with like probably five to ten require or require relative. And I wasn't happy with that. First of all, because it is boilerplate, it's noise. I don't like noise. Also, because I want my code files to be about the responsibility they're supposed to hold. And knowing what files contains the dependency that this file depends upon, it's not the responsibility of each file to know where do I store the other responsibilities. That was wrong. And this is not something we have with Rails. I realized that we actually get something super nice from Rails is put any file in any sub directory of app folder, and you get it. It's like magic. Once you have to start all your require by hand, it felt wrong. So, I Googled. I got a few options. And the best one, which is actually the one which is currently adopted by Rails, was using SideFerq. Hopefully, I'm pronouncing it right. It's written in my speaker mode. And that stuff helped me, like, auto load the constants I was looking for by looking them up in my lib directory. Default config. I'm happy as far as I know this is what I need. But reading the rest of SideFerq, I also realized that this enables you to use short names. So, if you are in the same namespace, you can just mention a constant by a short name. Well, obviously, I want that. I'm doing that in Rails, so I want that again. It's also handling multithread code loading. I have no idea if I'm going to need that, but I certainly don't want to handle that myself. It sounds like something I really don't want to handle myself. And it also handle code reloading, which is not something I'm going to use because of TDD. But again, this is my approach. I know that most people don't do that. And code reloading is a very important part of code loading. So, SideFerq was like my first take, my first really great companion that I found along the way. The second one was dry container. Now, small disclaimer, I knew from the stuff that I was going to use dry gems because I wanted to. And as Kevin said, it's also a little bit about finding joy. So, I wanted to heavily rely on dry gems, but I wanted to wait until the use case was there. I wanted, because I did not only want to skip the requires, I wanted to not know the classes. I wanted to not call new in the middle of my code. My code is about business logic. Most of the code is about business logic. I wanted to separate, sorry, the logic about creating objects and the logic about like, I need something. And most of the time, when you're in a controller, in a Rails controller, you don't even care like where does the request object comes from. You're just like, okay, I want a request object. Just make it happen. If you're in a view, you don't care about what the view context comes from. You just have it. You just want it. And it's really comfortable to write code with just focusing on like using the stuff you need, not focusing on how you get them. So, this is what dry container brings. I've been using dry system, which is like dry container for handling all of that, and dry injector. And dry injector basically works hand in hand with dry container and allows you to call your services, call your dependencies by the small name, by the first name. You give a name to an object and then you can basically say, okay, I want this object. I don't want this class. I don't want to instantiate that class. I want specifically that object. And I'm going to use it. And I don't even care what its class is for. I want that object by name. Interestingly, this had almost no effect on the test. Even though it's a very different approach, I still had most of my tests instantiate object by themselves. Why? Because unit tests actually give a lot of fake dependencies. That's the point of unit, right? You want to test a single unit. So I was still building my subject into tests manually. And for the larger, the broader tests, I actually wanted to use the container set up correctly because I wanted to test that things were correctly wired together. So even though dry containers is like, oh, some you can stub and fake and change whatever you want. I didn't stub it because I was either using it and testing it or not using it at all in my test. And... Sorry. Yep. Yeah, I'm still in time. Dry container also brings something else, which is quite interesting. It's a settings, a settings object. And I realized very soon that the settings object was the object that I was injecting everywhere. Almost every part of my system needed to access settings. So I was injecting it everywhere. It was awesome. And dry settings provide some really interesting value. First of all, it allows any of the settings to be overridden by environment variable, which is quite important. If you know about 12 factors, it is one of the aspects you want for your config to be overridden by the environment that your program runs in. So that was the first part. And the second part is that you can coerce, you can define the type of your settings. Because if you work with environment variables, everything is a string. But when you work in your system, not everything is a string. We do have a lot of strings, but we have dates, we have integers. We have a lot of system. And usually what we do is we just parse them. Dry types allows you to create all types, name them for starters. Also naming things is probably the most important stuff we do in our work, I believe. You can name your type and get them correctly and get your settings in the proper types, which brings me to my next slide about dry types. So dry type creates a contract. It says, okay, this value, this settings, it has to be a phone number. And I'm going to explain exactly what is a phone number. And I'm also going to coerce like a string into a phone number, which means at the end of the day, I either have an error or I do have a phone number, which is exactly the object I want. And it makes a big difference. I don't know if any of you have ever created a class like phone number, like age, like bucket name. If you read correctly the literature about object-oriented design, we are supposed to do that. We are sort of supposed to do that, like subclass string when we want to make a first name. To be honest with you, I've never done that in my life. I've always used string and it's not a first name. It's a string. I know it's a first name. I know I'm not going to use all the methods of string, but the variable is name, first name, that's enough. Using types allows us to actually have proper types, more meaningful types, without creating full-blown classes for everything. Well, settings is one thing. But this contract, it can really be used for something else. It can be used for app input. When you are working in a web application, app input is a request. This is where most of our payload comes from. In our case, the app input was messaged from a queue, but the concept was very similar. As soon as we got one message, we treated it in a very similar fashion as we would have treated a request. When working with app input as a web, there's a very known pattern for handling that input, for validating that input, for correcting that input to everything that you wanted. These are form objects. We basically reused the same. I realized that I'm doing my slide in the wrong order, but you don't care because you don't have the order. But that's okay. We used kind of form object in the form of a dry contract. It comes from dry validation, that is the gem we have been using. Dry validation is really about two pillars. The first one is about typing. Eventually, it leverages dry types. It ensures that you get the keys of your payload that you expect, that you get the values that you expect, that basically your data is of the type you expect, that's the schema, that's the structure. Once you have the proper types, you still have business logic to handle. This is the second pillar of dry validation. A typical example would be if you have to handle a deadline. Imagine that somewhere in your payload there's a deadline. The first pillar would ensure that the deadline is actually a date because you get a string. Hopefully it's an ISO 8601 string, but it could be anything else. You want to coerce that in a string, you want to ensure that you have a string. If it's not coerceable into a string, you want the first error. But now that you have a string, you also need to validate that this actual date is in the future. This is what the second pillar is. You can create rules, business rules. That means that once your payload goes through the dry validation mechanism, you actually get a very valid, very reliable payload from a typing perspective, but also from a business perspective. Once we have that payload, what do we want to do with that? We actually want to process it. For that, we are using a pattern which is named Interactor. At least we used to use a gem which is named Interactor. You can think of an Interactor a little bit like an operation in Trailblazer. I don't know if anybody has used Trailblazer previously. No? Okay. All right. I'm going to go back. The idea of an Interactor is that this is the entry point to your business layer. Because the entry point to your application to most of the web application are the controllers. This is how... I'm not talking about the rules. Let's consider that the entry point is the controller. But that's not true because sometimes your entry point is your test. Sometimes your entry point is a rate task. Sometimes your entry point is an active job. Sometimes your entry point is a channel. So you actually get a lot of entry points into your app. But at the business level, you don't really care if you want to delete a user because of a GraphQL request, of a REST request, of an active job. You want to delete a user. It's the same business unit. And this is how we encapsulate things we are using in Interactor. One Interactor is responsible for one business unit. And well, very fortunately, DRI has a solution for us. It's a name DRI Transaction. So their name for it is a transaction. It allows you to create a series of steps. It relies on DRI Modad because each step can give you a result. And if the result is a success, then the next step is going to happen. If the result is a failure, then the next step is not going to be done. You're going to keep your failure. This is known as the railway oriented programming. Nothing related to rails. It's just because you either stay on your success track, like train track, or at each step you have a junction to your failure track. Well, the thing is we didn't use DRI Transaction. So I wanted to let you know because I would really recommend that you use it. I wanted to use it, but also we have a team of several developers who are used to our Interactors. And it sounded like a better idea to use what everybody knew than trying to reinvent the wheel. We had something, it's working well, everybody knows it well. So this is like my manager voice talking. If it's end broken, broken, don't fix it. But if you're doing it from the start, give a chance to DRI Transaction and DRI Modad. At this point in the talk, I hoped to try my own definition of DRI Modad, of Monad, what is a Monad, which is probably going to take the next two hours. So let's keep it. So the end of this slide is about why do we want to do all that validation early? And this was also something a bit new. First of all, like failing early is a good idea. But it was not enough because doing the business validation at each step would have made more sense. It's just easier to keep the business steps together. It makes more sense if you want to check some permission, then delete a record, then send an email. It makes sense that you do everything related to sending the email at the sending email step. It doesn't really make sense to already check stuff from the start. But the thing is, in Rails, we are very much used to a highly rollback-able environment because most of what we do, well, sending email doesn't count, but most of what we do is manipulate the database. And this is a huge comfort being able to say, my record.transaction do blah, blah, blah, blah, blah, blah, blah, blah, blah, blah. If anything goes wrong, just roll back and done, nothing has happened. When you're doing a microservice, at least what we are doing, nothing is rollback-able. Everything you do, if you send an API request to something, if you delete a file, download a file, create a file, there's no rollback to that. And this is why it was so important to check as much as we could right from the start. All right, next step. Next step, next challenge. The next challenge was an interesting one, as every challenge, because it was about design and design opinion. And there's no truth, there's no strong truth in design opinion. So what was the challenge exactly? The challenge was that we realized we were not using dry containers properly. It felt like we were supposed to use it in a new way. Why was that? The reason was that we are very used to object-oriented design, object-oriented programming, which means we are putting together state and behavior in small objects, and they are responsible for doing that stuff. And the dry system, the dry container, was pushing us to use stateless objects, because that's what you could enjoy if you want to inject something everywhere. It better be stateless. But the code we wanted to write, because we have a lot of experience with that, was stateful. We don't want a command wrapper. We want a command execution specifically about this option. We want to ask a specific invocation. We don't want the full program. So it was very important to be able to write the code that we wanted to write, but it was also important to use the tools properly. And initially what we did is we had that big interactor, or big entry point, get injected with a ton of stuff from the container. It was getting all the services that it would eventually use, and that interactor was instantiating all the small objects, the small life cycle objects that it was going to use, and it was instantiating those objects, giving them their state, so maybe the current date, the current user, the current payload, and all the dependencies that the objects needed. So maybe there's a command service, maybe there's an API client, so the interactor was instantiating all of that, which means the interactor knew about almost everything. There's a name for that. GodObject. And it's a bad name. So we knew we were doing something wrong. We had a small discussion, and we realized that actually the literature again had a solution made for that. There's a pattern made for that. The pattern is factory. So what we eventually did is that we created new services, factories, very shallow services. Each factory was injected with the services that it needed, and the interactor was simply injected with the factories, and the interactor was just asking the factory, well, give me a command invocation specifically about this file, about this API, about this payload. And it's not a fun because it was so difficult to realize at first that we needed that, but at the same time it was so obvious what was the solution. It also raised an interesting comparison with a former colleague of mine who told me he was like a functional programmer. He, I'm not going to say despised, but he despised object-oriented programming. Well, I said it. And he told me, you know, an object is just a set of partially applied function. He was very like this day in for like, oh, it's just a set of partially applied function. We have like, we have object at home. Well, it's not the same. But to be honest, like, introducing those factories gave me that feeling because we had like those functions. We were partially applying all the dependencies. That's like first partial application, and then we were partially applying the state. It also opened our mind about what is stateless, what is stateful. Usually state is like all your instance variables. It's not really true. You don't see things like this anymore. Like your dependencies might still make you stateless. And your state is really what makes an object throwable. So if it's a reusable object, it's stateless. If it's like a one-use object, it's stateful. That's sort of our new definition of that. And factories helps us creating one-use object because factories are all stateless object. Well, I felt bad creating the slide without mentioning a single dry gem. So I also want to bring one here, is the dry initializer gem. And to be honest, this is my favorite, and it's so small. The thing is this is so small that it's crazy that this is my favorite gem. It creates contractors. It just creates an initialized method. But why does it matter? Because if you are very strict about it, all your initializer very probably look the same. It's like you pass them arguments, and then you store them into instance variable. Nothing more because doing business in initializer is a bad idea. So you always get the same initializer times and again, and it makes no sense, and it creates noise. And if you're used to more style guides, it has to be at the top of the file, and it also takes a very important part, focus, because top of the file is very important. So dry initializer, just do that. It says you can create one line for each dependency or state that you want. You can give it a type. You don't have to, but if you have a drive type, you might want to. You can give it a default value, and you automatically get an initializer that accepts them, and you automatically create an ATTR reader for each of the dependencies. You don't want the reader. You don't have to, but by default, you get that. And that's it, and you just transform something very long and noisy into a series of lines. We used to have ATTR reader anyways. That's most of our classes have ATTR, one line for ATTR reader anyways. So it changed nothing in terms of noise. It changed everything in terms of clarity, intention, and anyone reading a file now gets something directly by reading those lines. And yeah, I'm still on time. Well, we were done with the code application. Of course, we had additional challenges, but eventually using those tools and approaches, we reached up the end of the application, and we were done, right? Well, no, we still had to package it. We still had to deploy it, because as long as we were actually solving problems, we had nothing. This is again a time when we realized how rich is the Rails ecosystem, because for deployment, you either get services like Heroku or similar services, or using Capistrano, which does everything for you. You write one CAP file and everything is magic. When we had to deploy, we were like, yeah, we have files with code, but we still have no application. So we get some help from partners about that. We use Docker Compose locally for creating containers. We use Kubernetes remotely for deploying them. We use Helm for actually doing the deployment. And this led us to realize that we still had problems, because we had no observability. We had very difficult access to the log files, so there was still a lot of stuff we didn't have. So what we did is we introduced Yabbaida from Evil Martian. I don't know if anybody is from Evil Martian here, but if you are and if you watch us, like, thank you, you're awesome Evil Martian. So we used Yabbaida, which is an observability framework. It allows you to mention what you want to observe, create metrics, without having to care like where you intend to put those metrics, what we intend to do with those metrics. And then another part of Yabbaida, you can mention, like, actually what you want to do. You can separate the two. So your business logic is not riddled with, like, technical details about monitoring. So this observability allows us to expose some metrics, which in turn enabled us to create autoscaling to measure health. So these are typically stuff that you get for free in Rails if you're using your relic or data. But we had to do it by hand. And we finally reached our latest challenge, because we are not experts in Helm or Kubernetes. We are actually very noob in that. So we had partners helping us. But those partners are also responsible for, like, running and ensuring that our app is working properly. So the way the agreement we had with them is they handled their own repo with everything they do about us. And we have our own repo with our code base. And the problem we realized, and we still haven't solved, is that part of the application is actually in the infrastructure. And this is something we are not used to do in Rails. But typically the queue we use have a dead-letter queue. If you try to read something and it fails, so you release, you retry to receive, it fails. After sometimes that message, you put it into a dead-letter because you don't want to lose waste more time trying to handle that. Another aspect is buckets have life cycle. If a file is forgotten there after 24 hours, you want to delete that file. You don't want to pay fees for that file for the rest of your life. And this is application logic. Even though it fits in infrastructure, like it is application logic. And this bothers me because application logic, anyone who clones a repo should be able to see everything, to know everything. They don't have to be master at everything. They don't have to change everything. But cloning a single repo should explain everything there is to know about this app. So at the moment we still have those two repo. One is like focusing on the infrastructure. One is focusing on the code base. Hopefully we will solve that very soon. But with that done, we actually had the app deployed, monitors scaled, we learned quite a lot. We actually made a blueprint out of that, so we are creating several other workers right out of that. And we feel much more confident actually using Ruby for something else than web application. So thank you, everyone, for your time. Thank you. Any questions? We have two minutes of questions, hopefully. You've talked a lot about... I mean, first you never talked about Rails, but you actually miss it a lot. It's pretty funny that it was not about Rails, but actually... Anyway, you talked a lot about types. Is that something more to bring to the rest of the ecosystem? Yeah, that's a very good question. So the question is, I talked a lot about types. Do I want to bring that into Rails? Actually, the interactor is something we do in Rails already, which means we are using dry validation already, which means we are using dry types already. To be fully honest, we don't use it enough. We sort of use it when we realize that we should have used it before. So it's like not good enough, but it is something we are using, and types have been very helpful in the past already. And there's a lot of other tools that we have discovered here, because we had to, and I very much hope that we are going to use them. But also, my first slide means that I'm no CTO, I'm no manager, which means I don't get to make those calls anymore. And it's very important to me that the one who writes the app are responsible for writing it, maintaining it, running it, so I can influence, I can give my opinion, but I don't make those calls anymore. Yes? You said that you use dry monot. What has been, can you tell me more about your experience, because I used it quite extensively in the past before they introduced these two notations. And it was very sticky to the code as in, it made Ruby not look like Ruby, like something else. So, if there is something changed there, how's your experience? All right, so the question is, do I use dry monot? What do I think of the do notation, and how Ruby-esque does it feel? Is that right? Yes. Okay. So I am not using dry monot, except for like toy projects. So we are not using dry monot in this, so our own take is using our own interactors. So whatever I'm going to say is out of my experience on toy projects. I've learned initially about monads in Haskell. This is still very painful to me 10 years later. So my take on monads is like, most of the time, it's like not the right tool. And it's something that people, the learning curve for understanding what is a monad is so high that once you've earned the right to understand what it is, you want to put it everywhere. A little bit like meta-programming. So this is my take on monads. I wouldn't force them into anyone who is not very comfortable using them. I do believe that it is a very elegant solution, but I also do believe that sometimes a bunch of if-else makes the team happier than using the best tool for the occasion. And I don't have any opinion about the do notation and how Ruby-esque it feels. All right, thank you.
The world of Passkeys
you you has anybody use your past keys as your 2FA method on GitLab? Oh cool, oh awesome, awesome. I should talk after the talk. Anyway, so here's more of the talks. I talk about past keys in the RubyConf Thailand, RubyConf Taiwan, and today I'm going to try to talk, there's going to be a mix of what I've talked before, but the things that I spent more time talking in the other conferences, I'm going to talk less here, and I'm going to talk about some more stuff that I have not talked in the other talks. So what is past keys? Past keys are replacement for passwords. It's part of a wide authentication standard. It was designed to replace or to reduce the over reliance of passwords. Internally, it's a public and private key pair used for challenge-based authentication. It uses public key cryptographic, which has been around since the 70s. Sometimes it's protected by your device biometrics, sometimes it's discoverable, and sometimes it's bound to your device. That's an interesting scenario, it's an interesting use case, but I'm not going to talk about that today here. These are centers that I kept in all my slides and all the presentations, and the reason why I kept these centers when I was learning about what past keys is about a year ago, these centers helped me wrap my head around what past keys were. When I read these, I was like, oh yeah, these make sense. And it says, a password is something that can be remembered and typed, and a past key is a secret stored in one's device in a locked biometrics. That sentence comes from the website, the source is there, past keys are there, there's a lot of interesting stuff in there. These centers have some caveats, not exactly I can remember all my passwords, but it's a good definition. This is one that I like. Past keys is a public and private key pair protected by a device biometrics and used for a challenge-based authentication. So let's break that down into sentences. Public and private key pair, like I said, is used public key cryptographic, and the idea that you have a public and a private key, and if you keep your private key private, and keep your public key public, you give to the world, and if you need it to encrypt some data and send some data over the world, over the web, you encrypt and use your private key, and then whoever wants to read that can decrypt that data with your public key. So that's kind of what this is about. And it's protected by a device biometrics. According to the standards, according to the standards, to use your past keys, you need to first do your biometrics verification. I don't think that's going to be the case all the time, but it's one of the important aspects of past keys. And used for a challenge-based authentication. Again, this goes back to the public and private key encryption. When you go to a website and you're going to sign in for a website, you just don't present your credentials. You do a private and public key encryption. The website is going to send you some data, like, hey, HILU, oh, you're HILU. So in order for you to prove that you're HILU, I want you to encrypt these data with HILU's private key. And then, okay, all right, I'm going to get my private key. I'm going to encrypt these data. I'm going to send it back to you. And then the site is your application. When you're signing in, it's going to look at that encrypted data, that digital signature. Oh, there is this gentleman here pretending to be HILU, and he gave me a digital signature supposedly encrypted with HILU's private key. And then I'm going to use HILU's public key to the crypt. And if I get my data back, then voila, it's your HILU or someone who owns, who have access to HILU's private key. And that's what a challenge-based authentication looks like. So like I said, this is a past key, this is part of a web authentication standard. There's a W2C standard. This is a screenshot for the first public working draft that was created in 2016 by folks from Knock Knock Labs, Microsoft, PayPal, and Google. Basically, several years ago, all the big tech companies, they got together in trying to create a standard for a better, more secure web authentication. And from that, they created an alliance called FIDU Alliance, which is just a group of companies with a goal of promoting a more secure web authentication. And this also is like that I like, that I have in all my presentations. This talk here today is that small little thing there. And just to give you guys a little bit of a perspective that when you're using your web developer hat and you're adopting your support to past key to web application, you're all the way at the top of the iceberg. But there's a lot of things that it goes behind the scenes that for past keys to be a reality and be a more secure, more intuitive replacements for passwords. And anyway, there's a lot of stuff in there. I'm going to talk just a little bit. 2FA or not 2FA? This is one question that has been kind of banging my head for the past couple of months. So this is a screenshot from the FIDU Alliance website from the FAQ section. And they have a question there. Are past keys considered multi-fact authentication? So let me read that for you guys and we'll talk a little bit more about it. So past keys are kept on users device, someone the user has. And if the relying party requests user verification, RP is relying party, is your web application, can only be exercised by the user with a biometric or pinning something the user is or something the user knows. Thus, authentication with past keys embodies the core principle of multi-factor security. That's kind of the beginning of the answer. There's a middle here that I'm going to skip that and then let's go to the end. At the end it says this, note that some regulatory regiments still have to evolve to recognize past keys as one of the officially listed forms of multi-factor. This is an area of active engagement for FIDU Alliance. Just for your information, I've been learning and studying what past keys is for about one year and this note, it's since last year, exactly like that. So I don't know exactly what active engagement means for FIDU, not what regulatory regiments they are talking about, but for the past year that sentence is still there. So these are reminding for me, these are reminding for me, just to remember what two FAM means, right? Something the user has, something the user knows and something the user is. If you have two of those, then you have two facts authentication, if you have more, you have multiple facts authentication. So past keys, past keys is kept on your user device, on your phone, on your USB stick. Sometimes, maybe most of the time, it's going to be replicated to your cloud account, right? It goes to Apple, Google, Microsoft account. This is something that I have, the user has. And past keys can only be used after biomedical pain verification, something the user is. Again, there are some cases where I think you're going to be using your past keys without those things. And that's where one thing that I was going to mention in the previous slide is that when I talk about these FAQ, I tend to say the FIDU Alliance make a soft claim that past keys are two FA. Because these words, me here, that was not, it was probably written for a lawyer, I'm sure it wasn't an engineer. So, but anyway, so that's what I mentioned that sometimes in a couple of forums, in a couple of discussions, I mentioned that FIDU makes a soft claim that past keys is two FA. But we're seeing here that in some situations, some scenarios is going to be two FA. And I think there are some situations that is not going to be two FA. I was going to remember the sentence, right? Some regulatory regimens still have to evolve. And this is an area of active engagement. But if you are using past keys, you have your phone, a USB stick, and you need to use those to authenticate with past keys. And I am my face, my finger, and I need to use those things if you're doing the biomedical verification. And I know my USB stick pin, if you're using a hardware key, and I know that I think now some USB sticks, they now validate your fingerprint. So there are a couple good, strong arguments that you're using at least two of those three things, right? But they still, some regulatory regimens has to evolve somehow. Password managers. So just for information, the demo that I want to show at the end of the presentation, I hope I have time and I need to kind of speed up a little bit, maybe. The demo that I want to show was created before all the password managers had support to past keys. This is a new technology. And way before password managers were able to do past keys, to support past keys, you had to have native support on your Macs, Windows laptops, iPhones, androids, the three browsers, major browsers, right? Safari, Chrome, and Edge. Firefox took forever to work finally on the Mac. Finally, last week or two weeks ago, they finally released a version, I think it was 122, that now works on Mac. Firefox worked on my iPhone before it worked on my Mac. So maybe there's some regulatory regimens that explain some role somewhere. But password managers. Should they become past keys managers? Can password managers have access to our device biometrics? Your cell phone, Touch ID, Face ID, your laptop, Touch ID. Should they have access? Password managers necessary in these world of past keys. By the way, these demo that I want to show you, a couple weeks ago, I was trying on doing some, updating some dependencies. And I was doing some tests and doing some manual tests before deployment. And it stopped working on Safari when I was logged in on my vault, on my password manager vault. So I had to log out from my password manager to be able to use the native support that Safari and Mac has for past keys. But yesterday when I was rehearsing this talk, it was working again, and then it was broken, it was a little bit buggy, but it looks good. It's nice when I do have the option to be logged in on my password manager on Safari. And I can decide whether I use my password manager or use my native support on Safari and Mac. When I can't do that reliably and not buggy, that's going to look good. But I think I brought about this conversation on password managers because this is one thing that I've been wanting to dig a little further, because I think this plays back to the 2FA not 2FA. Because when you create your password, your past keys to your password manager, you're not doing your biometric. But this is an area also that I still need to study and learn a little bit more. Maybe some past keys, some password managers, they are worth sniffing their way out of not using your biometrics, but it's still 2FA. But anyway, this is still something that is still a little bit in the gray area that I don't know how this is going to play out. Alright, now let's get down to some more how it works and under the hood. So this is the area of the talk that I talk in more extensive details in the other talks. So I'm going to go a little faster here today. There is something here that you feel like you want to listen to me talking more about it. You'll find some the recordings of the other talks online. So okay, I'm going to talk about registration authentication. Remember, this is replacing passwords. So these are about sign up, signing in and re-authentication. Re-authentication is just authenticating you one more time. So just going to the signing in again. I'm not going to talk too much about it here today. Alright, so registration. So you're a user, you're a new browser, you go to a website and you want to sign up, right? So RP is a relying party, it's your web application, Ruby, Radio's app or whatever stack. And then the site is going to ask you a public key. Now I need your public key. So the user is going to defer to your device. This can be your cell phone, this can be your laptop, this can be your USB stick where you're going to create that pass keys. I'm using the phone for simplicity. And it's going to do your face ID, create your pass keys. It's going to sync your private key to your cloud account. And it's going to send back your public key. And then the relying party is going to do some verification, create your account, alright, you're signed up, right? So now let's take a look inside. So in order to look one level down from that diagram, I'm going to use this application. This is a web of Rails demo app created by those folks from set-up code. And I'll talk about the initiation phase, what happens in the browser and verification phase. So remember, these days like that we saw in the last one. So that's the registration, all the steps. So this is the initiation phase, this is what happens in the browser, and this is the verification phase. So now let's look down one level down on these blocks and see how they look like. So for this particular application, when you're trying to sign up, that's the JSON that goes back to the server. And then here the server, the server is going to do four things. I'm going to generate a web offer and user ID. I'm going to load the web offer and settings. It's going to create a challenge and then it's going to return a JSON back to the server. So, and here's what the JSON looks like. I'm not going to talk, get into much details. I'm talking more details about these things in the other talks. So today I'm going to go a little faster here. So this is the JSON. The bar there shows that at the top is just static data application settings. And at the bottom it's based on user session. Whatever username you want to create and the ID that was created in random and the challenge that was created in random for your sign up, for your registration process, for that particular user registration process. So here is a documentation straight out of the gem, web offer and Ruby gem. And this is a configuration that you need to do in your application. Basically here you put your origin, your name, and then there's a bunch of other fields here. Just to keep a little bit of what's happening in the world, let's look at GitLab, right? So GitLab is so good, GitLab is open source, right? So just go look at this source code and study and read and see how they're doing and maybe help a little bit. So they basically use the exact same basic version from the gem, but they do change one default. So the encoding, the default value is base 64 and they use base 64. The base 64 you are when they use base 64. This is as of my last conference talk, last December. I'm not sure if we change it, but that's the URL you can guys go to and see if we change it since last December. Anyway, here is just one little detail. So the user ID is one of the things that gets created for your session. And then look at that. It's just a user ID, web offer and user ID that got created is just a key random bytes. The same thing happens with the challenge. So these are two things that got created by the server in the beginning of the registration phase. It's also another secure random. So both your web offering user ID and the challenge just random bytes. So, okay, then we saw that block. So now let's look into what happened in the browser. So when the JSON happens and comes back to the browser, your code in the application is going to call the browser API. And then when you call that your browser and your OS is going to default to your device. And then your device is going to create your pass keys. Remember, it's going to sync a private key to the cloud to keep your device and then sends back your public key. Oops, sorry, wrong arrow. And then here's your credential and then here's your public key. Okay, so that's everything that's going to happen in the browser. And this is a JSON. That's a JSON that gets back to your server. There is some duplication there, I'm sure of it. I'm not sure if it's a bug or a feature. And this whole block here is duplicated here. I didn't have a time to kind of look into what that is. But anyway, that's what's coming for this particular application. But I think this is generated by the browser API. I don't know where that comes from, but anyway, one day I'll try to figure out. So anyway, that's kind of what happened here in the browser. So now let's look at what happened at the verification phase. So now that that whole JSON that we saw is going to be sent to the server, and the server is going to do these four things. And remember, here, note one thing here. These are two separate HTTP requests, completely independent for each other, with a shared context. You need to run one HTTP request to get user ID and challenge, and a second one that you do a bunch of stuff and then you send back the result. So the first step that is going to do here is a series of verifications. And then if your pass key, if your data is verified, then you're going to get your pass keys created. You use it to get created and you get a JSON back, a simple response back to the browser. So I'm running a little bit out of time, but you can just check one of my talks online. I talk in more details about these things, but here we're looking at the gem, what the gem actually does, what your application. This is your first step in your application to create your user and your pass keys. And then these are all the stock trades that happen inside the gem to verify your data. There's a bunch of stuff here. This is the one that does the most, a bunch of verifiers. And here's the one that verifies your challenge based on the expected challenge. These two here, two interesting features of your pass keys, user presence and user verification. And these are the one interesting thing that I put in here. The valid challenge is just an open SSL secure compare based on what came to the server in the second request and unexpected challenge. I didn't look into details what actually these expected challenges, but part of these verifications is a bunch of secure compare on open SSL to make sure that the data is, everything's good. There's no many in the middle, nothing is getting the middle of those two HTTP requests. And then this is another one. And then if your data is verified, your pass keys is valid, then you're going to create your record. This is what the, then it's going to get back to your application. And now you finalize the user creation and create the credential. When you use this application, that's what your pass keys in your database looks like. And then we're done. Right. So we finish the whole process here and look into all these JSONs, HTTP requests in response to what happens during the registration. So authentication. Now we have an account. I'm going to sign in, right? This is the challenge based on indication, right? So the application is answering, hey, here's some dummy data, signed the data with your private key. I'm going to defer to my device. I'm going to do my first ID. I'm going to access my private key and creep that data and send the data back to the server. And the server is going to do a verification. The creep using my public key. And if the creeps are good, voila, this is Hilo, your Hilo authenticated. Right. So authentication works. Shall we look inside? We shall not. We're not going to have time here, but you should need in the other conference that the goal of the talk was pop into pass keys only looking under the hood. I didn't have time to do this. I stick only to the registration. But anyway, that's an application on GitHub and you can go check it out. And pop the hood on the pass keys. All right. Live demo. We are running out of time. You guys want to see a live demo localhost 3000 or the actual product name. I'm going to do localhost. I'm not going to risk the Wi-Fi here. Anyway, so this is an application that I created. It's a Rails app that I created for hackathon. And just basically what I want to show here is pass keys. What I just showed you guys sign up, sign up, sign in and reauthentication. So this is, you see it's running on my localhost. The logs are here. It is running. This is how I was rehearsing my talk. So I create a Ruby dev for him. So I'm going to log out here. I'm going to log in again. And so unfortunately I'm going to do, I'm going to create an account, right? So Ruby dev room at 4th then 2024. And this is a pass keys label. This is an application that I created for hackathon. So there's nothing here. I'm just going to show pass keys, sign up, sign in and reauthentication. And I only collect an email for sign up. And don't even use password. There's no password in this application. Only pass keys. I'm going to put here Safari, which I'm creating. And then if I do that, it's going to be presented by your, you can do your biometrics. These are the options. I'm interested in one. I'm not going to have time to show today, but if I put the right finger here, you get authenticated. And they are always unique for website. They're always strong. They're efficient resistant and rich in resistance. There's a lot of marketing things that goes down behind the scenes that makes pass keys a lot more secure than passwords. So this there will be authentication, right? I already, I'm already authenticated. And I'm going to make some sensitive transaction with authenticating again. And basically what he does is the same thing. If I put the wrong finger, it doesn't authenticate. If I put the right finger that I have configured on my laptop, boom, I'm there authenticated. So now if I sign out, I'm going to sign in again. This is kind of the auto discoverable on pass keys, where when the browser detects a device that supports pass keys, it shows you, oh, I have five minutes. I'm not going to talk too much. Anyway, so let's move on. That last one. So that's the one that I create today. All the other ones are attached account that I have and rehearsing the talk. And then if I put here, then you're authenticated. Everything that you show here is, is this app that I created for this hackathon. Summer last year. All right. Hello Ruby. Pass keys in the Ruby community. I want to give a shout out to the trailblazers in the past, the past keys trailblazers in the Ruby community. Gonzalo Braulio from CELACode, Pete Lavica, Thomas Cannon. CELACode is a web agency in Uruguay. And they're the creators of the web offering Ruby Jam. According to the Jam spec, the authors are Gonzalo Rodriguez and Braulio Martinez. If you're doing pass keys on a Ruby, on a Rails Ruby app, this jam is going to be on your Jam file or Jam file lock. And the first version was created in 2018. The last one is in December last month, two months ago. Peter Lavica. Peter Lavica, I'm sorry for the pronunciation. He's a Ruby on Rails developer. And in 2021, he wrote this article, multi-factual authentication with Rails and Web Off-In and Device. It was originally published at Honey Badger blog. And he also wrote Rails app that goes along with his blog. This article is really nice. I really enjoyed reading it. I strongly suggest you do so. And at last but not least, Thomas Cannon. He's the creator of Ruby, pass keys GitHub organization. He also created the warden, Web Off-In, Ruby and Device pass keys Jam. And he also have a Rails template app, Device pass keys template. You can run it on your laptop if you want to. Thomas is the one person that makes a huge difference for me because I was reading about pass keys. All the pass keys were popping up here and there. And then I was like, okay, I need to read what pass keys are. I don't know what it is. And then I went out with my life and I'm busy. And then one day, Thomas sent a message, posted a message on his social media saying, hey, I just released this Jam, Device pass keys. Hey, it's a public, first public better version of one art or something like that. Go check it out, send some feedback. And I'm like, okay, I've been here about these pass keys. No, that is a Jam that I can put in my Rails app. And is it Device pass keys? I'm like, okay, all right, damn, that was all the motivation that I needed to learn about these. And this was literally about a year ago, like January, February last year. And so anyway, so that's Thomas. If you weren't for him, for his message, or if I was, I don't know if I would be here today. Anyway, just one single message on his social media. That's it for today. And I have two slides about questions. These are folks, if I have any question. If you only have pass keys on your site for your application, how do you now log into the application from the device? So the question is, if you only have pass keys in your application, how do you log in your application from our device? That's that other device options. So you can log in from an external device. I can actually show that real quick here. So if I am signing up here, there is this other option. And so when you pick that other option, you can, there is an option here, you can use your iPhone. Because I created this account on my Apple, and my pass keys got replicated to my Apple account, I can actually do this from my phone. So when you do that from our device, that's what happens. So let me see if I can do this here. So in your browser, I'm going to show this, and then I'm going to do the pass keys validation, the first idea on my phone now, instead of doing the touch ID. So if I do that, so, okay, I need to turn on Bluetooth. Bluetooth, okay, let me see now, the camera. So, not photo, pass keys, come on. All right, the operation could not proceed, please try again. No, no, it's not local host. I think it's the camera, and the Bluetooth, sometimes the Bluetooth gets messed up. But you can do that, I tested these, and oh my gosh, it should work. It works. Oh, why, anyways, maybe it's more Bluetooth here. Why do you use Bluetooth? Maybe I need, yeah, so it's not going to work here, but you can do that. With that QR code, you can scan your QR code here, and then your device is going to communicate with Bluetooth with the laptop, and then it does the face ID here, and then you finish your flow on the authentication. I would like to ask if you can use this kind of pass keys on all sort of proteins such as maybe SSH connection. So the question is if you can use pass keys in an SSH connection. In an SSH, you already use authentication with PEM files and certificates that's strong. This was created for web authentication. You can store your pass keys in your hardware device, but I don't think you're going to be using pass keys in CLI, in any command line, or SSH or anything. There's already strong mechanisms for authentication that are authenticating in that, I think. Let's consider this computer, not biometric way. Can I add a USB dongle because it can have fingerprint sensor? Can I consider it as a way to connect with a pass key? Yeah, so the question is if in his laptop that doesn't have biometrics, if he can put some USB dongle to do the biometrics. I do, I remember I read some information, some news show up in my radar saying that I think Google or maybe UB key, I don't know, they're creating some hardware keys now that validates your fingerprint. So it's like a typical UB key, but instead of using PEM, which is the traditional use of UB keys, you actually validate your fingerprint. But those things cost 30 dollars a piece and I lose them all the time. I think I can add on this fingerprinting biometrics if you don't mind. So there is a parameter in the user verification from RelayBuddy, which could be required, I think preferred and discouraged. So potentially the verification biometric is not needed at all, depending on the relay party, this first. And second is up to password manager or the system to verify the user. It could be a password, pin code, it could be Windows Hello, it could be whatever. And there are devices you can plug in and use Windows Hello for example. Windows Hello is the Microsoft solution that's equivalent to the Apple keychain, where that's where your application is going to be stored, your pass is going to be stored. In the user verification, yeah, you have these three options. But I think that's one thing that I wanted to try to do for this conference and talk to you today, is about the password manager in these two FA. I'm trying to kind of get to the bottom of this and understand and see how this is going to play out. Because when you require the user verification, you're going to create your pass keys and your password manager has to do something. In the test that I did with password manager in this demo application, it doesn't ask anything. I'm already logged in on my vault, but I'm logged in on my vault on my laptop, but there isn't a second layer and then that's all the words, words, myth there. But anyway, there's some words, mything around 2FA and not 2FA. I'm not concerned about words, mything. I was just curious to see which factors are going to be used when you are doing password managers. And if there is some relation to that soft claim from Fido Alliance about 2FA, regulatory regimens and whatnot. But I don't think that, and I saw people complaining that they don't want to use biometrics in pass keys. In one of the other talks, I mentioned that maybe six months from now, we're going to have to differentiate what pass keys are from biometric pass keys. Because biometric pass keys, I have a strong feeling that that's a 2FA. It does use two factors. But if you don't use the biometrics, you only use the pass keys, public-private key encryption, then maybe you're not using two factors. But anyway, that's kind of something that I still need you to say a little bit more. Sorry, we don't have more time. We can, okay, if we're going to have questions later on. Thank you. Thank you.
Backtracie and the quest for prettier Ruby backtraces
Okay, let's get started. So hello and welcome to Backtracy and the quest for prettier Ruby backtraces. So who am I to be here today? My name is Yvon Zhe and I'm currently a senior software engineer at Datadog. I've been in love with Ruby since I started using it professionally around 10 years ago. And I am a really big fan of going into an exploring language run times like C-Ruby, J-Ruby, Truffle Ruby, Java VM and others. And I've been attending FOSDEM every year since 2017, but this is my first time speaking, so I'm excited. So I also- Yes, pray for the cable. I'm also excited to play about concurrency, application performance and making tools that help us kind of look at our apps in different ways and new ways and try to uncover new insights about performance by looking at them in a different way. So that's how I ended up working on this thing, the Datadog profiler for Ruby. So if you're curious, come talk to me about Ruby Performance. I like to talk a lot about that. So, but for today, what we're going to talk about is what's the backtrace? How can we get one? Then how does the Ruby stack work in reality? Then we'll talk a bit about the backtracy gem, this is not good. I will be talking about accessing the internal VM APIs to do some of the weird things that the backtracy gem does. Then we'll play with backtracing in action and then we will talk about maybe a new feature in Ruby 3.4 which is having class names in backtraces. So what's a backtrace? How to get one? If you're a Ruby developer, you probably know what the backtrace is, but quick reminder, it's mostly like a trail of what methods were called and are still waiting to return at some given point in the app. And it's also called like a stack trace in some languages because it represents what's on the thread stack. So backtrace, stacktrace is usually kind of the same thing. And okay, if we have this A that we call A, that calls B and then raises an exception and then you get a backtrace. So we probably see this way too often and maybe you have some nightmares when you see this, but hopefully it will help you figure out your issues in your app. So there's multiple ways of getting a backtrace in Ruby. One of them is rescuing an exception. And an interesting thing is that actually the backtrace gets set on the exception when the exception is raised, not when it's created. Because you can create an exception but not raise it immediately. And so the backtrace only gets set when you raise it. And you can get a backtrace by just getting a thread and asking for it. Or you can use the color API, which is part of kernel. So it's part of every Ruby object, so it just can type color and you will get the stack trace of the method that called U. So you might have noticed that were backtrace and backtrace locations. The methods that end with locations return an array of these location objects that includes absolute path, base level, level, etc. So basically it gives you a nice domain object to represent the stack trace. Whereas the other method just kind of represents, just to return you the strings that Ruby prints. So that's the difference. There's also some Ruby VMC APIs to getting a backtrace. A few ones for different kind of use cases. And actually these two at the top will come back to this in a bit. So talking about the stack itself, how does the Ruby stack work under the cover specifically for the C Ruby runtime? So the idea is that a Ruby thread usually has two stacks. One is the Ruby stack that we usually see on our application and the other is the native stack. So the stack that the VM, which is a program built in C, has. And we can actually look at both of them in a really weird way, which is let's crash Ruby. And this thing is a weird thing. So I'm telling Ruby to send a segmentation fault to itself, which will crash Ruby. And then when we crash Ruby, what we get is this thing, which is the output of the Ruby crash handler, which includes a lot of nice things. So if you ever get the crash in Ruby, please do include this when reporting bugs. It's really useful. And the first thing it shows is it shows the Ruby stack. So here on the bottom, we see, okay, we have this each that represents our each on our code. Then we have the block, then collect, then the block, then the call to kill. So probably not a big surprise. One thing that is interesting, and you can see there at the top, is that Ruby actually, at least this Ruby version that I'm using, uses C methods to implement each collect and kill. And so you see that internally Ruby is actually keeping track of that and knows that there are C methods for that. This is not very good. And this is actually the native stack, which is also printed in that whole big thing. So please ignore the lot of text. The thing that you're caring about is this column in the middle, which is the names of the C functions that the Ruby VM is actually using. And you can actually, if you squint hard and ignore a few of them, you can see our app here. So we can see each showing up. We can see each showing up. And then we can see the block, the call to yield. Then we can see the collect showing up, RBE array collect. Then we can see yield. And then we can see kill. So we can see all of our methods. And you can additionally see these two methods, which are the Ruby code itself that we're writing. And those methods are RBVM exec and VM exec core is the Ruby VM actually executing the byte code, the Ruby byte code for our application, which is kind of the glue code that is between the other functions that you see there. And then at the top you see the code for the VM to handle a crash. So this is the two stacks. So let's focus on the Ruby stack and kind of ignore the native stack mostly. So how is it represented inside the VM? So inside the VM there are a bunch of structures in memory that represent the stack, so how do they look? And so hang on, I will show three slides of C code and then we can come back to actual Ruby code. Please don't kind of like stab your eyes out or something. So yes, when it shows up, there's a in VM core, it's like a VM header where a lot of the internal Ruby interesting things are. And there's like this RB thread struct that includes a bunch of things. And this is what Ruby holds for a thread. And inside that we have this RB execution context thing, which keeps a pointer to this other structure with this RB execution context which has a few more things on the thread that were separated for reasons. And inside here we actually see the size of the stack and the information about the stack. And then we have this array of RB control frame T elements. And this is a pointer into an array that then has these entries, the RB control frame struct. So basically these entries are what represent a stack frame in the VM. So if you see five lines in your stack frame, there will be five of this. And you see that there are some things in here, like if you're wanting to see IC which is the instruction sequence. So this is like the Ruby byte codes for that method or block or whatever that's getting executed. You see like self, the object on each which was called. You actually see like JIT return which was added to support YJIT and the other JIT so that they use that. And there's a few more things that we'll ignore. But yeah, this is how the Ruby VM represents the information that's on the stack internally. So whenever a method gets called, a frame is pushed to represent this new method that got called. So there's this VM push frame method that, like the interesting part is here on the right, which is like, we're setting it up. We say, we have the self object. Like there's some things that we want to care to track. So that adds one more onto the stack. And you would not be surprised if I told you that this stack gets popped and there is a VM pop frame method that actually function in C. That actually takes care of this. So fine, this is kind of what you might be expecting. So let's talk a bit about the backtracing gem. Yes, maybe this is good. I'm doing timing. So the backtracing gem is this really weird gem that I create. And let me tell you why I created it. I created it because of something like this. So if I show you this, main, print stacks, new initialized times, block initialized backtrace. If you squint at it a bit, maybe you can speculate on what's going on. But it's kind of hard for you to get a lay of the land and understand what's this weird example thing doing without looking at the source code. You need to be looking at the source code and then it makes perfect sense. It's here, it's here, it's here. But if you're not looking at source code, it doesn't make a lot of sense. So this is something I was thinking of. Like can we actually improve stack traces and give you more information so that you can read the stack trace and get more information without actually going to one or more files. Because this could be across ten files and you would have to follow along. So actually this is the code, you don't have to read it very much. The interesting thing is that we have this method print stacks that gets called here that creates an instance of print stacks that then initialize and then inside there's the times and then we print the backtrace. But I've shown you the code. So the idea is that Ruby what you saw was printed with the Ruby backtrace. And with the backtrace.jm you can instead get this. You get the class names. You get like a dot on print, hello, Fosdem and here you can see like the namespace. So here you see that like we're calling you on the class and then this is like an instance method. And then we're calling integer times and then we're having a block inside initialize and then backtrace location. This is kind of the thing I wanted to experiment with. Maybe it won't look exactly like this, maybe it will look different. But try to get more context so that you can look at it and you will go like, I think I see what's going on even without opening up your editor and maybe navigating to the ten different files. So this is what I mean about prettier backtraces. I wanted to experiment with adding more things, things such as like class and module names, things such as like show a dot or a hashtag if the method is or like an instance or a class to be able to quickly distinguish that. Maybe distinguish like singleton methods, so methods that you define on a specific object versus just like a regular method from that class so that you can see this is a weird thing that showed up on this object. Maybe that's relevant. You could distinguish refinement, which is this weird thing, which is like methods that show up based on like some context thing. You could maybe show method arguments, maybe that's useful sometimes for you to distinguish between a few of your methods. Maybe even show C function names or like file names and line numbers. Because one thing you might have realized is that I shown you that array and the collect and whatever methods are implemented with C code. But you never see the C file and the C line where they are implemented in your backtrace. So if you want to actually follow that into the VM and understand what's going on or maybe you just are working on a Ruby native gem, you actually don't see that information, Ruby hides it and doesn't even keep it. Another thing is like maybe even have some visibility into the native sex and what might be going there because you might be debugging like this postgres or my SQL driver which is going into C code. So how far did I get? Well, I got this working, this working, this working, this working. This is not, I haven't tried it yet. This is a really awful hack, so let's say maybe. And this is not working yet. So I'm still kind of experimenting with how far we can get. So a question is like how does backtrace work? So the TLDR is basically I've shown you how things get stored inside Ruby. So we basically just like go in there and get what we need out of Ruby without Ruby really having any APIs to do this thing, which is fun. So but these are internal VM APIs. So they are like in private headers and they are not available to gem. So how does this work? How can we access this information? And this is like the cool thing about like that this prototype allowed me to play with. So let's talk a bit about accessing Ruby VM internal APIs. So what's the backdoor? There's actually two different backdoors for accessing this VM internal C headers in C Ruby. One is the hidden Mjith header. So you might have heard about the Mjith experimental JIT compiler. So from Ruby 2.6 and 2.3.2 it was a part of Ruby. And it actually generated some C code and then compiled it. And that C code actually needed a header with some of the internal things. And so what the Ruby developer did was very silently, they went into this folder which is like a weird name and they created this RBM JIT header which is nobody supposed to use. And put that information there. So we can actually search this information from there and then use it. So yes, it's great just for the private use of the Mjith compiler. And if you import this, it's like weird working with it and a bunch of things doesn't work very well. Because it was not supposed to be used by anyone other than the Mjith compiler. But it includes like a copy of all the things we're looking at so we can make it work. Backdoor number two, which is like one of my weirdest backdoors, which is the device Ruby course of gem. So the idea is since the Ruby VM doesn't have any of the headers it needs. Thank you. This gem actually just kind of copypites all of the Ruby headers. So it has a folder and it has like some folders for every Ruby release and then kind of someone just copypites every header in there for every release and then release the new version of the gem. It's very crude, it works for all Ruby. So 3.3, now that's Mjith is gone in 3.2. And it also works like as far back as like Ruby to one or two zero. But yeah, you could do something like that. So the backdoor is like once we know what's the shape of these VM internal structures we can access them in backtracing. And if you remember this slide where I said I'll come back to this one, RB profile frames and RB profile thread frames, now is the time. So what I did in backtracing is that I started by copypasting RB profile frames into the backtracing code, just going into the Ruby VM like copypaste. And obviously when you copypaste from like an open source project, make sure you understand what's the license and if you can do that, you can do that with Ruby. And so I did this. It's fine, but make sure to like have the copyright headers and all that information. And then I added a bunch of features to experiment with it and get all of the things I was talking about. And actually, it was really interesting, this approach was really, I found it a really great way of prototyping something without having to depend on a custom build of the Ruby VM. Because I actually started by modifying the Ruby VM, but then I have a Ruby VM that works only for me and that features only for me. Instead, if I do this, I can tell you gem install backtracing and you can get it as well. So it's like an interesting approach to like playing with something that you would otherwise not play, but be careful. So obviously there's a lot of small details to get right. I am glossing over a ton of things needed to kind of get this weird thing. So for instance, you might want to access some VM internal structure, but you might not know exactly how to access it. So sometimes you need to kind of go read the Ruby API very carefully and see, this object that Ruby hands me actually internally has a pointer to the other thing, which has a pointer to the other thing, which eventually is what I want. So sometimes you need to do a bit of squinting at Ruby and understanding like how are you going to get access to this information. Like in some cases, like the copy pasted code also called other private VM internal APIs that are not exposed by the VM. So when I copy pasted, I compile it and then I try to run it. It doesn't work because those APIs aren't there. They aren't visible to gems. So again, like a lot of details here. Sometimes you just copy paste more and you keep copy pasting until it works. Sometimes you need to re-implement some things yourself because it's easier than you look at it as like okay, I don't need all of the things. But you need to play it a bit with it until you understand how you get it to work. But it has some really cool side effects. So for one, I was able to get this to work as far back as Ruby 2.3 with a lot of conditional compilation things in C. And even as I've done some experiments, even as far back as Ruby 2.1, so I think you could do this. And it was kind of cool because this includes back porting of RbProform frames features. So I copy pasted from Ruby 3 version and actually they have added a few features and some bug fixes and whatever. And so by copy pasting this and then using it on Ruby 2.3, I was actually having features that were not present in Ruby 2.3 from the modern version of the code, which was really cool. I also did not do it alone thanks to KJ from Zendesk that did a lot of work on Backtracing. And so let's quickly take a look at, interesting. Is it one full color? I don't know. So let's take a look at how we can use Backtracing. So you can go on the website, you can install the gem. As I said, it's the magic of doing this thing in this weird way, is that it works for you, for everyone, just install. It has this API which is Backtracing Locations, which gives you an array of locations, which is Backtracing's version of Ruby's location. So you get a lot of nice methods with the different things that Backtracing got, but Backtracing has a lot more things, and I will show you in a bit. Then you also get color locations, like Ruby, you get just for the colors of this current thread. And some use cases you can do with this. So you can obviously probe what information is there and you can implement your own printer. So there's a lot of information about the different names of the methods. And for this very simple example, actually they have all the same names, but sometimes Ruby has these notions of different names. So you can access all of them, you can access the objects that this was called on, you can access the class, a bunch of things. So you can use this, and then you can implement your own printer. That imprints a very nice stack trace. You can obviously use this to just get the pretty stack trace. So by default, Backtracing prints exactly as Ruby does, but if you call fancy to S, you get the one with the class names and a few other fancy things. And you can also call this weird Backtracing gem from C code, it has a bunch of APIs. And in particular, it has a special low overhead API for profilers and tools like that. So if you're interested in building something like that, you can use Backtracing to get the stacks and not have to care about. And actually one gem that's using Backtracing is this Ruby mem profiler that was created by KJ and I helped a bit as well. And so it's like an open source gem by Zendesk, which uses the Backtracing API to build a flame graph of memory so you can investigate memory usage and memory leaks and reduce the memory footprint of your application or even fix memory leaks. So we actually, me and KJ, we gave a talk at RubyKaigi about this thing called Hunting Production Memory Leaks with HIP sampling. So if you're curious, check that talk out. So some other use cases that we've been playing with on Backtracing. So you can actually access native function debug info. There's actually a lot to be said about how you get debug info from native libraries on Linux and different OSes and debug symbols and warf and whatever. I will not go into much into that because that's a nightmare. But I have a working prototype which actually you can see for each. You can see, okay, each belongs to Array. But you can also see it's implemented in this libruby.so object. You can see I'm using Ruby 3.1. And you can see that the C function name is rbarrayeach. And then in the future, we could even get more of the bug information, assuming it's still available and see the file name, the line number, etc. And allow you to smoothly go from Ruby code to C code as if, yeah. And theoretically, this native information doesn't have to just be C. So if you have a Ruby gem that is built in Rust and the Rust binding would have the correct debug information, you could go directly from your Stacktrace to, it's this Rust line. So it's really nice to, that's why I'm looking into having this information. Another idea that I have that I still haven't experimented, I haven't tried really hard to do it, which is, could we build a Backtrace Stacktrace for exception? So that when you have an exception in your app, you get the nicer objects which Backtrace provides you and you can get the full information. I haven't tried it yet, want to do it. So, just kind of a recap. What did I learn from all of this experimentation and playing? One thing is that the Ruby VM itself is very interesting and I would say surprisingly approachable. So my prior C experience was university projects and really, really tiny personal stuff. So I would not classify as a C developer ever. And I like everyone that goes to uni, I just kind of listed C in my CV because I did it at this one or two courses. But really, I was not a C developer and I still could follow along a lot of stuff. And especially if you go there and you add a printf and you start playing, changing the code a bit, it's like, you see things happening. It's really interesting. And also the power of having a working prototype to show off a crazy idea. And this had really two side effects that I was kind of hoping for, but didn't quite expect it would happen. One is that we actually at Datadog ended up using a similar approach for the Datadog Ruby Rufaller. And with Backtrace, I kind of proved to the team, I was like, yep, it works, I've got it working. This is one thing we could do if we wanted to. And the other thing is that the Ruby Core team also kind of liked the show class names in Stack Traces thing. And this kind of started an interesting discussion. And this leads us to the final item, which is class names in Backtraces coming soon in Ruby 3.4. Question mark. So actually in the Ruby issue tracker, this is now being discussed. This number 19117 include the method owner in Backtraces, not just the method name. This was opened by Jean-Bossier, had this proposal after we were discussing this at RubyKaigi. And then Mame implemented like it has a working prototype for this that it has a PR for Ruby. And actually if you just build it, it works. Like as kind of what we were saying, now you get this information of like the full class and you see that this is an instance method and you see like the dot on the class method. So we had this extra information for developers to just out of the box. Obviously this is still being discussed. So if you like this idea, you want to see this in Ruby 3.4 and use it in your app. Just try it out, go and leave feedback on this issue. And that was kind of it that I have had to tell you. So yeah, email if you want to talk to me, a knox on whatever they call the social network, my blog. I have a few other talks and here like yes, go get feedback because Ruby developers are actually calling for feedback in that ticket. And I actually, thanks to my employer, they developed for allowing me to work on these things. And I actually, if you're interested in coming work on the data about Ruby Jam, ping me because we are hiring right now for the Ruby Jam. And it's really different kind of Ruby that we do. Yeah, questions? Hello. Yeah. I think you mostly answered it but the class shown in the trace, Yeah. 3, 4, and 3, 4 and backtracing, it is the owner of the method. Like we can statically, because in J-Bruity the only way we get the piled backtrace is by cramming a bunch of data into the class name or the file name or whatever it's on the JVM's trace. Yeah. I can't make that dynamic. Once I set that stone to the method, it's going to stay that way. But if it's the method owner, then at the point where I can pile it, I can just throw that extra information in there and pull it out. I think that's right. Yeah, I believe the disimplementation is exactly the method owner. I think in backtrace, yeah, I experimented with having both, but it's much harder. And I think part of the discussion going on in the ticket is also, what about dynamically defined stuff and whatever? So I think the implementation is like, oh, when it gets, and I think, yeah. In some cases, it might not show, because it's kind of hard to get this information even in CRuby and expose it in a very efficient way. But in a lot of cases, it's like a regular method on a regular class and it gets it. So yeah. More of a product question instead of a technical one. Yes. You say it came from the, you wanted to have access to what was being called. Yeah. Is that something you personally, or is that something that was shared across the team and then something related to that? Essentially, I'm not working with anything compared to that. Yeah. Will I get something out of it myself? I have a small company. I think so. And then I really, so my other background other than Ruby is Java. And in a Java backtrace, you usually get the class and the method. And I've always found it easier to, in a lot of cases, easier to think about. Like, oh, this is the class and this isn't the method on my class. Then just like the method names. Obviously, in Ruby, if you have a very well-structured code base, you know that app flash foo flash dot rb is going to be foo bar. Like, you know. But sometimes code is not actually that simple. There are like, so near parts of the application. So that's the part where I feel like this kind of thing comes in handy. And I kind of missed it from Java. And I had worked with Java tools and I was thinking like, I want this thing from Java. Can I have it? The, actually, other thing I can add is that because for methods in the Ruby VM, right now, Ruby never shows you, like, where array is in the VM. It kind of, it kind of blames you. I can show it very quickly. If I go back to the way, way, way, way beginning, you can kind of see this here. So this thing, have you noticed that Ruby is lying there and there and there? Is Kiehl defined in line three? Is Collect defined in line three? Is it defined in line three? No. So when you have a C func, like a C API or native API being called from Ruby, Ruby lies and just basically decides, it's the caller. So that's the thing. And actually, at some point, I had to debug this really weird case where Ruby was calling inspect and I really didn't understand it. And I had a bunch of like, new inspect, new inspect, new inspect going into the VM. And I really didn't understand it. And I actually got out backtracing to just get that stack trace. And I understood that it was like this weird case when you have a no methods error on like some Ruby version, Ruby will actually call inspect on your objects. And in some cases, it will, after calling inspect, it will throw the inspect away. Which was like, I was like, why is, whatever. But sometimes, like it gives you a lot more viewing, a lot more context if you know exactly where the methods are getting called and the classes. So here you would see process skill, et cetera. So it's much clearer in my opinion. Yeah. Did you try to apply the same approach on heap dumps? Apparently, you're just inspecting the internal C structures. So at least theoretically, it should be possible to inspect the heap dumps. Yes. So like, I'm not, it's been a while since I've looked at the JSON output of heap dump. So I'm not sure if it has this information, but it could. And actually, even if it doesn't have it, I don't think actually you don't need to go as far as backtracing and accessing the internal stuff. Because you can do like objects.pac.each to implement your own heap dump. And when you do have xpac.each, you have access to the objects where things are defined. I was talking about the dump files. The JSON file, yeah. I mean, not the JSONs that you can get from. I mean like a crash, like a heap dump of a crash of the VM. Yeah, yeah, it could. It's the same thing. Like the structures are there. So you could do this. Like you could even do like a GDB script or whatever debugger script that accesses the same things and reads it. And actually just one thing, if you ever heard of the RbSpy profiler, which is like built by Julie Evans originally, like RbSpy is kind of doing the same from, but from the outside the process. It's like, it's a rough process that it's like reading Ruby memory, reading those things and then showing information. So I actually at some point tried to prototype this in RbSpy and then I just got bored and did something else. Yes. If we want to start looking into the C code of the VM, is there a documentation or somewhere we can start to not reading all of the code? Yes. There is. There is actually a really nice repository that I think is like there is like a, I think it was built by I'm going to say Koichi, like one of the core Ruby developers that have like a nice introduction to the VM. I don't know exactly the name of the repo, but like email me and I have that in my bookmarks and I will send it to you because it exists. Actually, it might, like let me quickly do something. Maybe there's a... A challenge. Yeah, it's that thing. Exactly. Ruby Act Challenge. And I think in the backtracy repo, there's actually some links at the bottom and it might be there because I included in the repository a bunch of links of interesting things I found to read this information. And so if you go to the GitHub repo, the bottom, it might be there, but yes, it's Ruby Act Challenge. So Google it, you probably find it. Thank you. Thanks, everyone. Thank you.
Deploy Your Next Ruby App with WebAssembly (Wasm): Smaller, Safer, Faster
Hello. Yeah, you can hear me? Oh, perfect. Okay, great. Very disconcerting. Okay, great. Last talk of the day, so we still have some people here. That's nice. Good. All right, so my name is Dan Phillips and today I'm going to talk about deploying Ruby on WebAssembly, smaller, safer, faster, and more universal. The faster parts should have an asterisk here, and I'll talk about that too, but we'll get into the details during the talk. A bit about me before we get started. I'm an engineer at an illustrious company called Loophole Labs, and I've got three of my co-workers right here, and we do a couple different things, but we like to call it primitives for software infrastructure, so we kind of focus on really specific pieces of cloud infrastructure specifically. I mostly focus on WebAssembly, server-side WebAssembly, and we also have a product that does live migrations of VMs, which my colleague Felicitas here demoed a couple months ago in Chicago. It's really awesome talk. You should check it out. That's the end of my pseudo-pitch for that. I work primarily on the scale function runtime. It's a scale plug-in framework that is based in WebAssembly. It is sort of a language agnostic way to build plug-ins, and I'll talk a bit about that sort of in conjunction with what I'm going to talk about with Ruby. And on the Internet, I'm d-underscore-filla, so on GitHub it's d-filla without the underscore, because they don't allow underscores in usernames, but on Twitter or X, whatever you want to call it, that's me, and I mostly tweet about stuff kind of like this. Also, I'm from Chicago. That's where I live, and I run the Wasm Chicago group, so if you're ever in town, you'd like to come by and hang out. We would love to have you. Also, all of our stuff is online too, so it's virtual if you'd like to check it out. Okay, so what is WebAssembly? Who here has done any work with WebAssembly on the server? Anybody? We've got two. We've got my colleagues and one person. Okay, great. Excellent. Okay, so we're going to just talk about the background of what Wasm is. Wasm, for short, you can say Wasm, Wasm. It doesn't really matter technically if you want to be pedantic. It might be Wasm because the precursor was Asm.js, which was a toolset for compiling C, C++ libraries to JavaScript in very performant ways, so Asm, Wasm, maybe. WebAssembly is a safe, portable, low-level code format designed for efficient execution and compact representation, which doesn't mean much yet. It's a safe sandbox execution environment. It is a deny-by-default sort of security pattern, and it makes no assumption about languages or hosts. So we're going to get more specific, a little bit less abstract. The best way to think about it is that it's a virtualized CPU. It's a new type of architecture. So it's a compilation target, just like when we compile things to x86 or R64, whatever. It is a virtualized instruction set architecture. That it utilizes a bytecode binary format, so what you get when you compile to Wasm is a Wasm binary, which is a binary, but it's not machine code yet. It's sort of an interstitial representation. It uses a stack machine model for this, but it's not actually a stack machine. It's just sort of virtualized in this way. You could say that it is if you use the interpreter, but you can also compile AOT. So what does that all mean? In a broad sense, it's really just another architecture. That's the way to think about it. The difference is that it is virtualized, so you need a runtime to translate it to machine code. Again, many of the runtimes, specifically those that came out first, that are in the browsers, all the major browsers, utilize both interpreted execution and ahead of time compilation. It's universal, meaning that anywhere where there's a spec-compliant runtime, you can run Wasm. That's the sort of biggest sell, right? So is it just a client-side technology? A couple months ago we were speaking at DevOps, and we would just talk about it. People were like, yeah, I don't really do front-end stuff that much. I was like, no, no, no, no. That's actually not what we're talking about. We're talking about using Wasm on the server. And why would you do that? This is from the spec itself, WebAssembly.org. It's a safe sandbox execution environment that makes no assumption about languages or hosts. Extremely important. The cold start times for Wasm binaries for a Wasm module are in the Nanos microsecond range. Even with many megabytes, you could still see microseconds start-up times, just like we have with Ruby, which I'll talk about. It's a universal compilation target. So there's this saying in the Wasm community that Wasm is neither web and it is neither assembly. That's a very important point. So I like to think of WebAssembly as kind of cloud infrastructure's penicillin moment. WebAssembly came out of a long tradition of trying to write more performant code on the web. If you just take V8, for example, or the JSVM, they've gone through many, many iterations to make that extremely performant. Actually, it's quite amazing what they've done. But even still, there wasn't quite a good way to have extremely performant code run in the browser. So WebAssembly came around after these sort of earlier iterations with ASM and some other things. And I like to think of it as sort of a penicillin moment. It was like people wanted to have a safe way to run code from any language, very performantly. And it must be safe because it's going to be like overnight on billions of users' devices. So what they decided to do is go in the same direction with a bytecode format and a VM. And then if you kind of think about it, and if you kind of squint, if you think about what the WebAssembly VM is, you could have traditional VMs, and then you could think about containers, and then possibly in the next iteration, we have WASM. So smaller, safer, faster, much more universal. And faster has an asterisk here, thankfully. I meant to put it on the first slide, but I forgot. Faster means, in this context, especially startup times. And then when we think about performance with WASM, it's typically described as near-native speed, if not native speed. There's a famous quote here from Solomon Hikes that said that if WASM had existed in 2008, they wouldn't have needed to make Docker. This got a lot of VCs, very excited in 2019. And there was kind of a hype cycle that got kicked off, probably when he tweeted this. And we're going to kind of talk through whether or not this has come to fruition and what the sort of next steps are in this space. Cool. So WASI. WASI is the WebAssembly system interface. Has anyone heard of WASI? Okay, good. A few people. Started in 2019, it was initially kind of like a POSIX interface for WebAssembly. WebAssembly is just a VM. And when we think of it, it's the dumbest VM because it's really just like a VM that's an architecture. So if you think about just machine code, it has no concept of syscalls or libcs or anything until you give it to it. It's the same thing with WASM. So WASM itself just executes very simple instructions. So if we want to run this stuff on the server, we need to think about a way that we can interface with this sort of underlying host. It utilizes a security model called capability-based security. If anyone here has heard of Plan 9, the operating system, it borrowed some of the approaches that were sort of explored there and implemented them. And it's an evolving standard, right? There's Preview 1, 2, 3. Preview 2 literally just got released last week. That also means that it's not implemented yet in all of the runtimes, but it's getting there. Preview 1 was kind of this base layer that kind of wrapped standard POSIX calls. Preview 2 introduces networking for the first time, so things like BSD sockets. And then Preview 3 is going to be very important for dealing with async. And that's going to be a very big advance. That will probably happen by next year or so. And then, after all those things are there, we'll probably get to 1.0. So it's been kind of a long journey with Wazzy. Is it required to run Wazum on the server? No, it is not. You don't need to use Wazzy. Oh, is that my time? Okay, okay. I was like, no. Yeah, so is it required? It's not required. You can run Pure WebAssembly on the server just like you do in a browser. So, I'm going to go through some techniques there that we can do with system requirements and the host. And I'll kind of talk through those as we get into the actual Ruby stuff. So, briefly, I'd like to just mention the project that I focus on at work. This is the scale, function runtime. It's a plug-in framework. This is a link to the actual page. You can also think of it as a serverless function runtime. It gives you a polyglot programming environment in the same runtime environment. So, this means that in the same runtime, you could run Rust code, Timescript code, and Go code, all talking to each other at native speeds, or maybe even faster than native in certain cases. We've done some interesting things where we took the, so if anyone does Golang in here at all, we took Go's Regex library, which has some performance issues, and we swapped it out using scale with Rust's Regex library. And we got a four-time, four-times improvement over native. So, running it in WebAssembly was four times faster than running it in native when we did this. And from the Go side, you had no idea that you were even using Rust. That's kind of the cool part of it. Right? This is written in Golang. Right now, we support Rust. Go, Timescript, the future holds Python and maybe some other things. Hopefully Ruby soon too. I don't know. I've got to convince my boss if that's going to happen. So, yeah. So, building Ruby. Many people in here probably built Ruby from source. And we're going to talk about building it to WebAssembly. Ruby has some assumptions. Right? One of the first assumptions is that you're going to run it on a Unix-like environment. Right? This is a big one. Right? And if you're compiling this to something that has no concept of Unix, this could be quite a challenge. When we, when people were first looking at this, they saw that one of the most developed tool chains for Wasm was with C and C++. And people said, oh, C Ruby. I can take C Ruby, compile it to Wasm. Voila. Turns out that was not quite the case. There was a lot of work to do to make that work. And we'll kind of talk through some of those steps. The other one is a file system. Wasm doesn't have a file system. You have to give it a file system. You have to either give it access to a file system. Wasi allows you to do this now pretty easily. But maybe you don't want to give it access to the underlying host file system in the browser. Obviously, there's no access to the file system there for very good reason. Dynamic linking. Again, Wasm has no concept of this like we have in the C world. So this is something that is also a challenge. And then obviously system calls and libc. There is no concept of that in a pure Wasm module. So that's something we also have to give it. Some of the specific pain points were exceptions. Exceptions in Ruby and the Ruby VM depend on set jump, long jump. This is something that had to be worked around. And this was quite challenging at first. Fibers. Wasm has no concept of context switching. There's no concept of kernel space or user space or anything like that. So this is something that had to be figured out. GC, you must be able to inspect the VM. This is also not easy to do because there are no locals that you can inspect in Wasm. So one of the ways that this was achieved was with a project called Asyncify. Asyncify is a project that allowed you to run WebAssembly code in the browser. And also do Async JavaScript at the same time. So when you run WebAssembly in a browser, you interface with it through JavaScript. But WebAssembly is single threaded right now. And the problem there is that we needed to be able to sort of pause the stack, copy things, resume the stack, rewind, unwind, and so on. So what the people in the Ruby world found out is that we could take that and use it to solve all of these problems, which was quite nice. If you want more detail on that, you should check out Yuda Saito's talk from Ruby Kagi last year. He is the main contributor to Ruby Wasm. And he is also a really, really great guy to talk to about any of this stuff. He's super, super good at this. So check out that talk for a lot more technical detail there. This talks about deploying, actually. And I will get there. Otherwise, I would go more into detail on those points. So I'd like to talk to you about how you can deploy a Ruby project today with Wasm. So I made this little project called Boxer. It's a side project of mine. I've been working on it in my spare time. And what this lets you do is it lets you take a container declaration and you can use it to run the WebAssembly code. It lets you take a container declaration like a Docker file and then it spits out a Wasm binary. Pretty simple. You can check it out. It's Boxer.dev. And I'll talk through this in the demo too. But there's quite a bit that has to go into this. But the point is, right now, if you want to do a Ruby app on WebAssembly, you have to bring in the runtime itself. You have to compile it. You have to set up the linking. You have to pack the actual files that you want to use into the WebAssembly itself. And a lot of this is quite complex. So this kind of does it all for you in a single step. So what's in one of these boxes? I like to call them Wasm boxes. I don't know. There's kind of a nice metaphor compared with a container. This is a box. It's smaller. You've got the base layer. What the base layer does is the base layer gives you basically imports and exports. So the Wasm module only knows about its imports and exports. That's all it gets from the outside world. And so if you want to set up an import into a Wasm module, you have to provide that on the host side or on the side of another module. This also gives you a virtual file system, virtualized syscode subs. There's a project called Wasix from single store labs. Not to be confused with Wasix from Wasm, which is another Wasm runtime. It's naming things as hard, I guess. So what this does is it gives you stubbed out syscalls and then the VFS is actually usable syscalls, POSIX-based FS syscalls. And I'll talk about that too. The compiled runtime gives you a compiled Ruby runtime. And then obviously it packs up the user source code. And the only thing, this is very important, is that the only thing that it knows about is what is inside it, is the Ruby VM and the source code and the imports and exports. So how is this done? Right? Interfaces, libc, a sandbox file system, Wiser. Wiser is a really cool project. If you're interested in WebAssembly, this allows you to pack up WebAssembly into a single binary. It allows you to deconstruct it. It allows you to analyze it. It's a really, really great project. I highly recommend checking it out if you're interested. This is a very simple project. It's just like a container does. This gives you the imports and exports. Here in the VFS we make a directory. We copy that using Wiser. This is a C example, but this a.out would normally be a binary. The big difference here is this must be a WebAssembly binary. That's the big difference. Set the working directory. That does that in the container in the box, rather, itself. And then executes it. And what this does is this also bundles up the runtime itself, passes in the correct arguments, and then executes the WASM module inside it. So, WASM VFS. This is another project of mine, and we worked on it for a while at Loophole. This project basically is pretty simple. It gives you the most standard POSIX SysCalls for file systems, all implemented virtually in WebAssembly. So, this is kind of cool because this means that you can just use most things that depend on a Unix-based file system, which is a lot of things. Specifically, tomorrow I'm giving a very similar talk to this in the Python room. Compiling Python, it really, really depends on the file system itself. So, you can check that out if you're interested. And go through this. Yep. Cool. And now I'm going to do the demo, which is why I'm sitting. Cool. All right. So, we've all worked with Docker, right? Mostly with Docker. So, we've all worked with Docker, right? Mostly probably everyone, every single person here. Okay, cool. Yeah, so this is a simple Docker file. You got from Ruby 3.0, setting the working directory, copying the source code to the to the current directory, which is user-source-app, and then running the actual code, right? And the source code is this. It's just a simple little script that does some square rooting and prints it out. So, by the way, please ignore my Rust warnings that I haven't fixed yet. This is written in Rust. But here's a little information about Boxer. And what I can do then is the... Basically, I wanted to keep a similar API to Docker, because we're all pretty familiar with it. We have BoxBuild passing the Docker file itself. Okay, you can see at the bottom. It found the base image, right? The base layer. It is building. Take several seconds. What that did is that bundled the run time, the standard library, the source code, and the FS all in the right place so that it can be executed into a single binary. So now it's one binary. Okay? And then if we do Box Run, there we have it. Ruby on WebAssembly. That's that. Okay? So pretty simple. Whoops. There's some big caveats. So one of the big ones is threads are not supported yet. Right? There, pthreads are not that are pretty typical used in like green threads, things with like Ruby and Python. This is not supported yet in WebAssembly. There's some ways to do it with asyncify. It's pretty challenging, but basically like with stack switching, you can do some fancy stuff. Networking, this is also a big one. Out of the box, this is not supported. So if you want to do something with a network, this is very challenging. So if you want to do this, and there are ways to virtualize layers of the kernels networking stack and then combine them with the sort of WASM module that you have and then pass them across to the host. And there are some techniques for this and this is actually kind of the way forward for that. Native dependencies, this is also a big one. But, native dependencies, there's a whole other benefit with using WebAssembly for native dependencies, which means you just have to compile it once. Right? So you can compile a native dependency once for all platforms and then anywhere that you have a WASM runtime, then you can use that native dependency. So, yeah, there's a lot of work going on here and this is sort of like the next steps in what need to be solved. But things like WASI Preview 2 and 3 solve things like networking and then threads will also be part of it too within the next year. So this is a little thing that I had ChatGP teammate. I got like a box and then a container. It's pretty fun. Now for some actual metrics. A container of the same exact code that I showed you using the same exact Docker file. A container could be anywhere from 80 to 900 megabytes depending on which container that you use and if you want to bundle the entire standard library. Right? Startup speed could be 800 milliseconds to 2 seconds. The cold start problem for serverless functions is a big problem that a lot of cloud providers have. They try to do a lot of trickery and then things can start up fast. This is a big problem because most of those use container... Security models are shared kernel. A container is not really an isolated environment. It kind of is, but it relies on a kernel and all the images share the same kernel. So that's something. In a box, the sizes of this one is 16 megabytes. Startup speeds are 100 microseconds to about 1 millisecond. And the security model is a virtualized sandbox that's built into the virtualized sandbox. So it's a very simple, and the security model is a virtualized sandbox machine code execution. So the future. Full support for libc and syscall interfaces. Right? This is something that we have worked on a little bit and have done some cool stuff at scale with... Or at loophole on scale, we have this thing called extensions, which allows you to generate host functions that can then act as the underlying layer that you wouldn't have in the WebAssembly VM. So we did a talk in Seattle last year where we talked about how we stubbed out the GVisor container runtime and took the syscalls and provided them with WebAssembly using that technique. So you had WebAssembly-based syscalls on the other side of what we would consider a container isolated environment. The other next step is to modularize kernel stacks, which is kind of terrifying. But this is kind of cool because if you could have a plug-able networking stack, you could do things like run things that require Unix in places that don't actually have a Unix operating system. You could use the WasmVFS. And then when needed, you could write shims and put modules wherever you want them to be. So this kind of creates a cool thing. We got a paradigm shift, a kernel-free, composable, universal, Wasm-based operating environment. There are some people trying this. Wazzy is kind of getting there. Wally, which is the names here, I know. Wally is the WebAssembly Linux interface. This was just released by group at Carnegie Mellon at the end of last year. Incredible stuff. They basically take that approach. They virtualize everything. Signals, threads, file systems, and then they create host functions which match those things for the top, like, 150 Unix-based syscalls and make those available. So you can run stuff like this out of the box. It was a little bit rough. I was going to try to incorporate that in this talk, but I didn't have time. Bare metal runtimes with a unicernel. This is a big one, too. You take a unicernel, give you just the syscalls that the Wasm runtime needs, and then run them. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. You can run them in a virtual machine. seems true. Yeah. So is it worth it to port 0B to pure wasm or to the POSIX interfaces? Yes, that's a great question. So right now, so there's a couple ways to answer this, but the capability based security is a big part of Wazzy specifically. And this works in a way that basically you exchange handles and you exchange permission. So compared with the UNIX way of having user group space or whatever namespaces, things like that, this is actually based on each capability. Wasm itself has no concept of that. Wasm itself just knows to execute the code in the module and it can't know about anything else outside of it. So Wazzy kind of tax that on. So this technique kind of takes that approach where your second approach, where basically it stubs out all of the sort of UNIX and POSIX capabilities and puts it all in Wasm, right? So Ruby thinks it's in UNIX, but it's not, right? In this example, there's a benefit to that too though, is when I mentioned about modularizing the kernel libraries. I'm dead serious about that because what that could do is allow for a greater sense of isolation and security and resource sort of alignment. Because when you run certain applications, like we know with containers, kind of introduce this, you don't need everything necessarily, right? So if you could only have what you need and each of those components could be completely isolated, memory safe, and they could be reused across multiple different things just like an operating system does, but faster and safer. There's a pretty exciting future there. And I think that you'll see certain people, while he kind of takes that approach, you'll see certain people pushing that more. I'm definitely a proponent of that, but there's some disagreements about whether it's the best thing, but yeah, absolutely. Cool, I've kept you guys here a lot. Anyone else? Feel free to talk to me after too. Cool, thanks again.
SemVer in the Rust ecosystem: breakage, tooling, and edge cases
So, thank you all for coming to the Rust Dev Room 2024. We have a really, really good lineup of talks and we're going to start off with one of the strongest ones I think is, Bredrak Gurevsky is going to talk about Semvern Rust and how to make sure that your stuff is not breaking other people's stuff. Thanks everyone. Yeah, let's talk about the breakage, tooling and edge cases of semantic versioning in Rust. My name is Bredrak Gurevsky, some of you might recognize me as the maintainer of cargo semper checks. This is a linter for semantic versioning that this talk is about. But before we get to talking about the linter, let's get familiarized with what semantic versioning actually is and does. Ultimately semantic versioning is about communication. It's a way for library maintainers to let their users know what kind of changes to expect in new releases of libraries. If the changes are major and potentially requiring action on the part of the user of the library, we say that's a major change, we bump the major version number and that lets users know to expect that they might need to do a little bit of work to adopt it. Otherwise, if the library remains compatible with the previous version, that's not a major change and users expect to be able to just update to that new version automatically. So in this way, semper is communication not just between individual maintainers and the users of the libraries, but also between maintainers and the tooling that those users use. Let me give you a concrete example. This is a bit of automation that I have set up in many of my projects. Once a week, a job just runs cargo update on my project, opens a pull request and automatically merges it if tests pass. Now this only works so long as every one of these packages correctly adheres to semantic version, right? Cargo update will bump everything to the latest non-major version that it can find and hopefully tests pass and everything gets merged. But if there is an accidental breaking change and one of these pull requests, then everything sucks and we're back to square one, right? In this talk, I'm going to try to convince you of two major things. The first one is that semantic versioning in practice is so dang hard that no mere mortals can uphold it. None of us in this room are good enough to do this on a consistent day-in-day-out basis. I'm going to show you that the rules of semantic versioning are much more complex than what it seems like at first. I'm going to show you that even the rules that seem simple have a ton of non-obvious edge cases that we have to deal with. And I will show you empirical evidence based on real-world data that this is not just a skill issue. It's not something that can be solved with more experience or with harder work or just caring more about your project senior users. And then I'd like to show you that computers are actually really good at semantic versioning. We can use linter's like cargo semper checks to address almost all of the problems we're going to run into as part of this talk. And if you're so inclined, I'll even show you how cargo semper checks works under the hood so that you can contribute to it if you see fit for the benefit of all of us here in this room and in the broader Rust community. Let's dig right in. Let's talk about why semantic versioning is so hard in Rust. It's used within cargo's own publishing process itself and it's used by large teams like Amazon's AWS and Google open source projects to make sure that their own semper, that their own releases adhere to semper correctly. The way it's designed to be used is by running cargo semper checks right before you publish a new version of your crate. When you do this, cargo semper checks will detect the kind of version bump that you're making, whether it's major or minor, and then it will scan for API changes that might be inappropriate for that bump and let you know what it finds. The way to get cargo semper checks is by running a regular cargo install and if you're in a CI environment, we have a pre-built GitHub action that will do everything for you. And since some of us prefer to use release managers instead of just running cargo publish by ourselves, it's also integrated in some of the release managers. I particularly like release please, which will automatically run cargo semper checks as part of the release process. So if you're on the market for a good release manager, you should check this one out. It's awesome. I want to show you a couple of particular examples of how cargo semper checks finds these issues that might not be particularly obvious to the naked eye. The first example is a public function being deleted. Here we have a crate that had this public function called add and this pull request is just coming in and deleting that public function. This is pretty obviously a breaking change, right? And if we run cargo semper checks, it will tell us as much as well. It will say this function is missing, it cannot be imported by its prior path, and it will point out that the problematic function is the add function in this crate at that specific line in that file. This is great, but you might say, okay, we would have caught this by eye, right? This is pretty obvious. I don't need a tool here. Well, as it turns out, deletions of public items are not always a major breaking change, right? One way in which that can be the case is if we have that public function inside a private module. Yes, the function is public, but it's just not reachable. There's no way to import it, and so since nothing outside its own crate can use it, deleting that function is not a breaking change, right? There's no possible way that anyone could be affected by that. Another more interesting example is when we have a public module, but that module is marked doc hidden, or if the function itself is marked doc hidden. If you're not familiar with the doc hidden attribute, it's a way to mark a piece of your public surface area as not being part of your public API. It's explicitly saying these are internal implementation details that are made visible for a reason other than being public API. These most often happen when crates have macros where they need to expose some functionality that is only intended for use by those macros. Remember that macros get expanded in the downstream crates, and so they have to be able to access everything that is public in your own crate. We don't want to maintain the internal implementation details of macros as public API, so we usually mark that functionality as doc hidden. But it's not actually enough to say, oh, this module is doc hidden, therefore that function is not public API. Here's the opposite example. We have a public module that's doc hidden and a public function inside it, and that public function is now public API. Why? Because it's re-exported and the re-export is not doc hidden. So users of this crate could have imported this function without ever relying on any doc hidden functionality. So even though the module where the function is defined is doc hidden, this function is still public API. These roles are pretty complicated, right? It's not at all unreasonable that someone might mess this up, and in fact we found hundreds of issues like this when we scanned the top 1000 rust crates on crates I.O. Cargo-Sanver checks will catch all of these cases correctly, so it's just a lot easier to use the tooling instead of having to rack our brains when facing a PR that we have to review. So clearly deletions of public items are not always a major breaking change. Let's dig into a second example. Here we have a public struct foo that has some fields, and in this pull request we're adding a new field. And the author of this pull request was quite careful. They noticed that the foo struct has a constructor called new, and they made sure to not change the function signature of that constructor. Instead, they initialized the new fields to default value, and this seems entirely reasonable, right? This is a pull request that many of us would probably merge when facing it. The falsehood here is that adding fields to a struct can only be breaking, it can only be a breaking change via changes to methods. This is not true, right? The problem here is very, very non-obvious, especially to someone who came to rust from a different programming language first, like me. The issue is that this struct is not marked non-exhaustive, and all of its prior fields were public. If both of these things are true, this struct could be constructed with a struct literal. So users don't actually have to use this new method, they can just construct it directly by specifying all of the fields individually. So if a user in a downstream crate wrote something like, let value equals foo curly brace and then listed out all of the fields, they are now broken if we add this new field. They never specified a value for third, and so therefore this code no longer compiles. This is something that could very, very easily sneak up on us in a pull request whether we opened it or reviewing it, especially early in the morning when we're undercaffeinated. Right? And again, it's easier to just let cargo semper checks do the heavy lifting here. Running cargo semper checks again points out the problem. It says that a struct that could be made with a literal has a new public field, so existing struct literals must be updated, that's a breaking change, and it even points out that foo third is the field that is problematic in this case. So adding fields to a struct can sometimes be a breaking change even if we took great care to make sure that all of the methods and everything else around the struct has not changed. That's another falsehood we can cross off of our list. Let's jump into a third example, and this one's probably my favorite. Here we have a private struct foo, so not public, but private, and we're just changing some internal implementation details of that struct. Right? It used to take this static string reference, and we now want to support non-static strings. The struct is cloned, so we want to preserve cheap cloning, and we're going to use a ref-counted string to make that happen. Right? This is fine. I would probably accept this, you know, especially in the morning undercaffeinated. The falsehood here is that, you know, if I didn't touch it, I didn't break it. Right? I never touched public API. All of this code that I just showed you was private, and so I couldn't have broken the public API here. Right? Unfortunately, this is not true, and if we run Cargo-Semford-Checks, it will point out the problem. It says a public type has stopped implementing one or more auto traits. Type bar is no longer send and is no longer sync. And you might be thinking, second, what type bar? We were touching struct foo. There is no type bar here. Right? I was code reviewing, I read all of the changes, and there's no problem. So why is Cargo-Semford-Checks complaining about something that I didn't touch? Right? This must be a false positive. Here, our tools are doing us a disservice. The problem here is not the user's change. The problem is that the change affects something that is not shown in the pull request. Right? So I didn't touch it, I didn't break it, happens to be false, because type bar isn't here. Right? You have to click this button if it happens to be in the same file in order to be able to see the problem. The problem is that struct bar is public, and its implemented traits are there for public. And bar internally contains a foo. Now, an auto trait in Rust is a trait that the compiler automatically implements for us whenever possible. The rule is that a type implements an auto trait if all of its constituents also implement that same trait. Constituents being all of the fields, all of the variants, all of the data that that type might contain. Right? So send and sync are these auto traits in Rust that are used to express whether types are safe to be shared across threads. And the problem that we run into here is that the static string that we used to have inside foo was both send and sync at the same time. This reference counted string is neither send nor sync. Now, since the fields value over here is no longer send nor sync, that means that struct foo is no longer send nor sync, that means that struct bar is no longer send nor sync. And that breaks our public API. Because users that might have been using struct bar in some sort of context that relies on parallelism where that bar is shared across threads or passed between threads, their code is now broken. So they will see an error that looks like this. Rust C will say, RC of string cannot be shared between threads safely. Use parallelism requires that that value is sync and RC string is not because within bar, within foo, that field does not implement send and sync. This is something that has been, it's really a question of time until it bites any given project. This is just kind of impossible to catch because the problem is just not on the pull request page. And so cargo Stanford checks is just much better at finding these things than we humans are because it's looking at the data that the compiler emits. It's not just looking at the limited pull request review screen that we see on GitHub. So if I didn't touch it, I didn't break it is another falsehood that we get to cross off of our list. Hopefully by this point I've convinced you that cargo Stanford checks has some value and that it's likely to catch some stuff that we would otherwise miss and that we would find out when someone opens an issue on our project. Now that you've seen some of the issues that it can flag, let's talk about how this works and why you should trust what it can find. And in order to do this I want to come back to this example of deleting a public function. And I want to show you specifically how this works under the hood. We said that deleting a public function is a breaking change, a major breaking change if all of the following is true. In the original version the function was public. Another crate could have imported that function and used it. That import did not rely on any doc hidden items either on containing modules or on the item itself. And if we tried to use that same import in the newly released version that will no longer work for any of these reasons. Either the function is no longer public or it's no longer importable or it's no longer public API. In any case that's a major breaking change. And if we're looking for all of these breaking changes of this kind it's as easy as saying find all functions such that all of these things are true. Now if you're thinking what I'm thinking this might sound an awful lot like a database query, right? Select all functions where you know. And to help us see this I want to show you a diagram. We're looking at a version pair, right? We have the old version on the left and the new version on the right. And we're going to be looking for public functions that are importable and public API. And we're going to try to match them to versions in the new crates that have the same public function and the same importable path as the function that we were just looking at. And we found a breaking change if the count of such matching functions and importable paths is zero. If we don't find any of them, right? So the function could be imported and used in the past and can no longer be imported and used in the current version. This is a breaking change. And lo and behold, that's exactly how this works under the hood. Here's a database query that does the same thing. Now the point of this talk is not the query language, so I'm not going to dig too deep into the syntax or the semantics here, but I just want to show you that this is the exact same thing that we were just talking about. So we're looking at the baseline at the original version of the crate. We're going to be finding all functions that were public that could be imported by another crate and that were public API. And of course we're outputting stuff for later in case we find a breaking change. And we're going to say the same import does not exist. We will count how many matching functions at the same import path we find in the new version and we will say that count is zero. This is pretty nice, right? We get to write a piece of business logic that is completely ignorant of anything else about how we get this information. We just wrote down what the rule is in English and then we wrote down an equivalent database query that implements that rule and we just called it a day. This is pretty nice. I really like this personally. And if you're interested in architecture diagram, this is roughly what it looks like. Carbis and Verchex sits on top. On the bottom are our data sources. We get information from a tool called Rust doc which comes built in part of the Rust tool chain. We can ask Rust doc to generate a machine readable JSON representation of the crates API and we will read that JSON with Carbis and Verchex. Now Rust doc JSON's format is not stable. It changes relatively frequently more or less on every, if not every, than every other Rust release. This obviously can cause some issues and it has been the source of much frustration and consternation with other folks that have been building Rust doc based tooling. Carbis and Verchex has actually managed to solve this problem. Carbis and Verchex is not the first attempt at a Sembrer linter but it's the first one that managed to be isolated from changes in the underlying Rust doc format. Instead of requiring that you use a specific nightly, we're actually able to support multiple stable Rust versions concurrently. It doesn't matter which Rust doc JSON format version we get, they should work fine so long as they're reasonably recent. And the way this works is we rely on a query engine called Trustfall to sit in between. Carbis and Verchex runs queries in this Trustfall language syntax that I showed you a couple of slides ago. And Trustfall figures out which Rust doc JSON format version it's looking at and uses a little shim, little adapter, a little Rust piece of code that is able to translate that JSON format into something that adheres to the Trustfall schema that Carbis and Verchex is used to. That schema is written in a fairly high level. It talks about, you know, Rust functions and modules and importable path and, you know, whether things are public or private and does not say this value is in a field named such and such and it's an object containing the following fields and so on. So it's very unlikely to be broken by format changes in Rust doc because Rust today, tomorrow and next week is still going to have functions, modules, structs, fields and so on, right? All of that stuff doesn't really change very much. So in practice that means that we get to encapsulate all of the format specific logic in these adapters and nothing outside of this big ellipse at the bottom knows anything about how the data is represented and what format it came in. The query engine on top figures out how to most efficiently run these queries that we're running and that just leaves Carbis and Verchex writing business logic in this query language. Carbis and Verchex only cares about this is the semver logic that we're interested in implementing and everything else happens behind the scenes at a lower level in this diagram. I just want to give you a little bit of a taste of what Trustful is in case this seems interesting to you. It's a project that I also started. It allows us to represent data as a graph and query any kind of data sources. So this is not something that's specific to Rust doc at all. It's heavily battle tested. It's been in production for more than seven years. It's written in Rust. It's open source and it allows adapters to be written in a variety of programming languages like Rust, Python, JavaScript, WebAssembly and so on. And when I say it can turn everything into a database, I really mean it. If you have any kind of data source, be it an API, a database, an arbitrary file format, a machine learning model, you can query it with Trustful and you can do so in place and without having to do an ETL step in advance to ingest the data and then represent it in some other format. If you're interested in digging more into Trustful, I've given a couple of talks on that specifically. I gave a talk called How to query almost everything that's a deep dive into Trustful in particular and how it works. And I also gave a performance oriented talk talking about how cargo semper checks became more than 2,000 times faster by using some new optimization opportunities that Trustful exposed at P99 last year. And if you're interested in playing with Trustful yourself, we have a couple of playgrounds that you can check out on your laptop right now. We have a playground that uses Rustdoc JSON that uses the same exact code that powers cargo semper checks that lets you query popular Rust crates APIs and you can find all sorts of interesting things about them. And just to show that you can query any kind of other data set as well, you can query the hack and use rest APIs as well with Trustful queries from your browser. And just for kicks, because Rust is awesome like that, in these playgrounds we compile the entire Rust, we compile the entire Trustful engine to WebAssembly so all of the queries run client side in your browser. So really go crazy with these queries, I don't care, it's your bandwidth and your CPU, right? So if you get rate limited by hacker news, it's your problem, not mine, please go ham. Fundamentally, Trustful is what makes cargo semper checks possible. We need, there are hundreds of ways to break semantic versioning rules in Rust. And if we had to rewrite every one of our lints, whenever the format under the hood changed, this would be completely infeasible. By being able to decouple the format specific logic from the query logic, the business logic of linting semper, we can focus on linting and ergonomics and cargo semper checks and deal with everything else under the hood. We can take an n times m problem of n lints and m formats and turn it into an n plus m problem, which is much, much more maintainable, especially as a free open source project. So cargo semper checks on the back of Trustful has been growing fairly rapidly. We currently have 58 lints and almost every new release comes with a few more. This is twice as many as a year ago and still growing quite fast. We have 32 contributors and in fact, many of the new lints that we keep adding are first time contributions, which is awesome because it means that this query language is not something that is super niche and difficult to learn and is actually friendly to new folks. And most importantly, our users love us. Everybody prefers to find out about accidentally breaking changes before they get pushed to production and get released and then somebody opens an issue saying, sorry, you broke my project. So hopefully by this point, I've convinced you that semantic versioning is valuable, but it's impossible without automated help. And that cargo semper checks is a solution to this problem that has lots of happy users. So if you take nothing else away from this talk, please consider using cargo semper checks if you maintain Rust code because all of us will be better off. And if you'd like to help, you can contribute code and lints to cargo semper checks. Even though we have 58 lints right now, there are still dozens and hundreds more breaking changes that we still need to lint for. We could really use some sponsorships free and open source projects live and die by GitHub sponsors. So if you or your company use cargo semper checks, please consider sponsoring our development. And finally, for the sake of everyone in the Rust community, please try to not push out breaking changes. Nobody will blame you for it, but it's a lot more fun if you find them before you release the crate as opposed to after you release the crate. So please check out cargo semper checks. Please find me in the hallway if you'd like to chat more. And thank you so much for your time. So I think, do we have time for questions? Yeah, how long? Five minutes. So let's open it up. I'm going to give you the mic so that people on the industry can also hear. Awesome, thank you. So one of the things that I know about semantic versioning is that version zero is, it's an interesting one. And you didn't talk about it at all, but it also notes on the page that there was a lot of version zero. So I was wondering about your opinion on it. Yeah, that's a good question. The question was, is version zero special in Rust? The semper specification and the Rust community have diverged on what version zero means, essentially. So the semantic versioning is about communication, right? So it's about the norms that are accepted in the community rather than a fixed and rigid set of rules. And in the Rust community, we've decided that leftmost, you know, any zeros on the left-hand side of the version kind of don't count. So version zero dot five to version zero dot six counts as a major change, right? And zero dot zero dot one to zero dot zero dot two is also a major change, right? This in practice is what keeps all of us sane because otherwise if zero dot x to any zero dot y could ship any breaking change, then all projects would always, you know, stay on zero dot x and then cargo update would still not work and not be able to bump us. So this is from a point of pragmatism for the sake of the community as opposed to some rigid system of rules. Thanks. Like you said, some changes can be intentionally breaking semper. Is there a way to annotate them so the tool knows or do you have to bypass the tool in these cases? Great question. So the question is, since some changes might be intentionally breaking semper, is there a good way to annotate them so that users can notice them in a way that is not going to break their CI? Unfortunately, there's not a lot of great tooling here. Obviously, we have things like change logs. Authors will usually post on, you know, their social media pages and things like that. They will try to get the word out. It's very rare for a maintainer to deem something so critical that it justifies an intentional semper violation and yet just kind of tell nobody about it. But we don't have great tooling that will say, hey, by the way, this is intentionally breaking because of reasons x, y, and z that you should read up on. It would be lovely if we could sort of mark the item so that when it gets used then it causes a breaking change. We got a custom error message printed out by Rust C that says this is why this is happening and this is how you go about fixing it. Unfortunately, we're not there yet. And the answer is more code needs to be written and more financial support needs to go into all of these projects for that to happen. Is there any work going on integrating these tooling into packaging like Debian or which suffer from these problems once in a while? Yeah, great question. The question was whether there's any work ongoing to integrate something like Cargo-Sanverchex into the broader packaging tools that we already use on a daily basis. The answer is yes. So I've been in close contact with the cargo team. They actually reached out and asked if it would be feasible to work toward integrating Cargo-Sanverchex into cargo itself so that instead of running Cargo-Sanverchex, then Cargo publish, you just run Cargo publish and Cargo tells you whether, you know, what it found. This is obviously a little bit tricky. It's not super straightforward for a couple of reasons. One is that when things get merged into cargo, they're stable and they're stable forever. So we want to make sure that the APIs that we expose are really good and that the right API is not just for now and for next year, but for the next 10, 20, 50 years. The second thing is that we want to make sure that users can always override what Cargo-Sanverchex has found, right? Because there are these cases where an intentional Semver violation is justified. We want to have a workflow that's kind of like Cargo publish dash dash allow dirty, where Cargo publish will normally not allow you to do that, but there is a way to override and say, I know what I'm doing, I've thought about it, and this is still the right thing to do. So we really want to make sure that we dial in the exact user experience that is the right thing for everyone in the ecosystem and that we can support in the long run before we go about integrating it. But long story short, yes, the work on this is ongoing and again, it's a function of how quickly we can get this work done in order to make it happen. Okay, so last one. Okay, so let me give you the mic. So I'm interested in trustful. Do you know if there are other applications such as I'm thinking about validating breaking changes in open API definitions, for example? Yes, great question. I would love to chat. The question was if trustful has other applications besides Cargo-Sanverchex, the answer is yes. This is something I'm very interested in chatting about. So if anyone else has this question, please find me in the hallway and I can show you some more demos. A few other linters are looking into trustful for designing custom lints. I'm also working on a Python semantic versioning linter. Python is a very interesting beast because it's much more dynamic, so Semver is pretty tricky there. And my former employer actually uses trustful to enforce code standards that are not just correctness, but they're just about best practices that the company has decided are supposed to happen. In fact, one of the talks that I put on the slide, this how to query almost everything has a specific example of linting Python applications that get deployed and looking for mismatches in the Python version declared in the project manifest in a pyproject.toml file versus the Kubernetes configuration in the Docker file that goes with it, which also says, you know, from Python 3 colon 8 or whatever. It turns out that we can query for what does the Docker think the Python version is, what does the manifest think the Python version is and find cases where they don't match. And spoiler alert, I mean, we found hundreds of these issues when we rolled out those tools. These things just happen. Great. Thank you very much. Thank you so much for having me.
Writing your own Rust linter
We're going to have your attention. We'd like to begin with the next talk. We have Guillaume. He's going to explain to us how to write your own Rust Linter, as you can see on the lovely slides. And if you talk, Luca, have we got the audio as unmuted and everything? Perfect. Wonderful. Okay. Take it away. Hi, everyone. I will try to speak loud so everyone can hear. So like he mentioned today, I will explain to you how to write your own Rust Linter. So first little presentation. I'm Guillaume Goumez. If you come every year, I give a talk, so now you should more or less remember me, I think. I'm a member of a few teams of Rust projects and I'm an engineer at Huawei. So first, let's explain what a Linter is in case some people don't know yet what it is. A Linter is a tool that generally is an addition to a compiler of a language. And here in Rust, I suppose everyone heard about Clippy. At least I hope so. The goal is to detect some very basic logic errors to suggest improvements for any method you might use, anything you could use better. The goal is to make your code better in short. So now how is a Rust Linter actually working? We are directly entering into the subject. So let's say it's an extension of the Rust compiler. The Rust compiler has an API, very unstable. So very frequently we have to update the Linter to be able to still work with the Rust compiler. And that's exactly how Clippy works. So when Clippy is running, it's actually running a lot of parts of the compiler to get parts like AST for people who don't know what AST is, it's a token representing your code. So if you have the struct keyword, it's a keyword and it's a struct. So that allows you to have information higher than. But it's not only that because if you only had the AST information, you could only make suggestions like, yeah, you use this generics but not in a good way so you can do it like that, et cetera. So the goal is to go beyond that and to get access to more information like a borrower checker and everything. So if you have a trait you're using but you could use another trait which does the same thing but shorter, we can now suggest it because we have this information from the compiler. But because of that, we have to update the Linter often or never update which version of the compiler we are using. So why does it need to be a rest compiler extension? It's quite simple to explain. So unless you want to reimplement all the parsing, the borrower checking and pretty much everything, well, better use what is already existing and ask them likely to make their API public so you can use them. And that's exactly how things went with Clippy and that's exactly how I went as well. So I mentioned a few limitations already. So it can only work on crates compiled with the same Rusty version. You don't see it with Clippy because it's tied with your compiler. When you un-style a Clippy, it's tied with your current compiler version. So it just works but it's something to keep in mind you will see later. Like I mentioned, the Rusty API is not stable so very often you have to update your Linter code to be able to keep up. It's tied to a specific Rusty version and I'm not talking about a stable release but literally to commit version which is a bit annoying. And also because of everything, it's annoying to wrap in a cargo command because you need to use a very specific Rusty version. Again, we'll come back to that later. So I will voluntarily don't mention all lint passes. I will only speak of the two main ones, the early and the late passes. The early passes give you access to AST. So you are able to see the syntax and the work a bit on it but you don't have type information or everything. You can only just know that this is a struct and its name is and it has generics but you don't know what traits it implements or anything. You just have very basic information and you have the late pass which in this case goes a lot further. You have access to Borrowchecker information. You have access to everything. What does this type is implementing? Does it implement this trait? What is its layout? Everything. So in this case, we will talk about how to write a linter but with Rusty tools. The goal of this trait is to wrap the Rusty API into something easier to set up because there is a lot of things to set up. And to add it, it's just that. Like you would add any other trait. For now, the version 0.3 later on it will be updated. And now we start to enter into the fun. So actually to make it work, you need to add this little line in your Kaggle file to tell okay it's a trait but not any trait. It's a Rusty trait. So you need to do some very funny things. And we'll come back to this one but it's some things that you thought were something that we had for years like having to write an extant trait to import a trait. As back you actually need to import a trait from the compiler with extant trait. Otherwise it doesn't work. It's not provided by default. The other thing is we need to create a Rust toolchain file. It's literally its name. If you never use it, if you have a Rust toolchain file in your folder, cargo will only use the version provided inside this file. So in this case, the version of the compiler we're using. This is all in the documentation of Rusty tools. Basically you just need to copy and paste the file into your local file. So in here we say that as components we want the Rusty div which means the traits from the compiler. We want Rust FMT because we are not savages. We want to actually format our code. And the LLVM tool, the preview is to be able to actually compile. Otherwise you don't have a backend which is also problematic. And now let's get into the code. To declare a lint it's mostly macro. As you can see on top we use internal Rusty traits. So lint and session. Lint provides some types linked to handling lints. And session allows us to give information to the Rust compiler about things we want it to run. So in here we create with the declare tool lint macro a lint called warn generics. In MAG. In capital letters. I can't do that. It's warn by default. And we add a message when it will, in case you want information about it, it says warns if any item has generics. It's an early lint pass. So it means we only have access to the AST information. I voluntarily picked this one because to be honest the code is much, much shorter and simpler and for 15 minute talks it will be better. The other thing we need to do is to implement some very basic traits provided by the compiler which we don't need to care about. So they provide a macro for that. So declare lint pass which is in our case allowing you to declare a structure called warn generics. And we link it to the warn generics lint. And after that at the end we have the very empty implementation of the early lint pass trait for our type. This visitor trait, if some don't know the visitor, how to say, pattern, visitor pattern, let's say. The visitor pattern allows you to have access to, literally, you implement whatever you need to have and then, for example, visit function. Whenever the visitor will encounter a function, it will call this method and it will be ours. If we don't care about the rest, they are already implemented. We don't need to care about them. So very convenient. In our case, we only want items that could have generics, so very likely functions and n-us and everything like that. So it will be pretty easy normally. So now we implement the lint. So as I was saying, check item. We don't have anything else to do. It provides a context, the context of the compiler at this stage, an early context, and we have the actual item. And then it's pretty simple. We have methods provided by the compiler and everything. So we check if our, I hope everyone knows the syntax of first, but we check that we have generics. We check that with some generics. We check that the generics are not empty because otherwise there is no point. If we have generics and everything, then we will say, okay, we found generics. We don't want generics because, because, and let's then emit our lint. So first, the lint name. Second, the span. The span is how the rest compiler map your size to your actual source code. It's basically to your size beginning and an end. And you don't have to care about what it's pointing to. You just say, okay, the type I want to lint about is starting here, ending here. You underline you do whatever you do and I don't care. And we have our message saying, no generics here because we don't want generics. And the last thing is in case you wanted to add more information, like for example, we could say, the help and we could add a help message and we can do that a lot more. In case some of you don't know what it is, the syntax with the straight bar is a closure. So a closure taking a diagnostic type argument. Now, the interesting part is now how can we run this lint? And as you can see, not much code because RustyTools is doing pretty much everything. So first, we get the cargo args because it's a cargo command. We will run the cargo tools. We don't want the two first arguments because cargo and tools are not something that we are interested into. We pass the rest of the arguments, if any, into the RustyTools cargo integration command, which we internally call cargo, build everything with its own version because it's not necessarily the case. And once everything is built, it will generate the command line that you actually need to pass the Rusty compiler to be able to run our linter, which we do with WistLint. So this time, args is what cargo provided us so we can now generate and run our lint. So we just give it access because it's already done by RustyTools. And inside this WistLint, we need to actually say to the compiler, OK, I created a lint. It's called not an void call. I did badly. It's a one generics. And that's it. We have everything. We can now live. And the compiler will do everything when living as a WistLint function. So now it's always nicer to be able to run a cargo tool. So you just run a cargo install dash dash pass if it's local, otherwise not. And I named it in this case tools inner. You will understand why later. So we just run it. And it doesn't work because we are not using the same version of the compiler. Congrats. So in this case, what's important to note is that you actually very much need to use the same version of metadata as the files generated by the compiler to be able to use them with the lint. Rusty doesn't understand itself if it's not exactly the same. Like if it's just a commit difference, no, I don't know him. Don't care. No problem. So now we can actually go around this limitation by providing the version like this. So if we do, I thought I had an error output. So if we do, we actually have the tools running. But to be fair, we can't really ask our user to do that themselves. It's pretty bad user experience. So we will go around that and do this very long file as you can see, which for this case will be called the cargo tools. And this one will literally run this command that we saw here itself. And that's it. It does just that. We just wrap our linter and it's just running. So now we install it. We run it. And again, I don't have output. It's very shaming. And believe me, it works. So yeah. I voluntarily, like I said, didn't show a late linter pass to have access to the compiler and everything. But I wrote a blog post explaining that much more in depth. Inside it, you have an example with an unwrap if I remember, saying, yeah, don't use unwraps use something else. And you see how we actually get the real type information because when you call unwrap, you need to check that unwrap is actually called on the result on an option. But for that, you need to actually get the type check information because if it's, for example, self with a capital letter double colon unwrap and then you pass your type, you actually need to infer the type. And for that, you need type check information. You will see a lot of things that are seen very easy but are quite not so easy. For example, if you want to have, I don't know, which type this implementation is being implemented on, funnily enough, it's quite difficult. You can have the trade very easily but the type it's being implemented on, not so much. And thank you for your attention. More information on my blog and you have my email and social media and everything. And thank you for your attention. So we have about two minutes for questions if anyone has them. Yes, come right to the back. Hello, thanks for this presentation. No, don't share at all. Okay. Hello, again, thanks for this presentation. A few years ago, I wrote a refinement type system for Rust as a linter. I had the courage to maintain it for about one or two versions of Rust. A few months ago, I tried to pick it up again and everything was broken, bit rotten, two tiers, everything had changed. Do you know if there are any plans to make things a bit less messy? Because right now it's really, really, really painful to maintain a linter. No, it's just pain, enjoy. It's a shame. No, in fact, it's actually better now because we have less function to worry about. For example, a lot of APIs that were existing before, only for Rust. Because Rust.doc is a compiler extension. Being less and less used because we said, okay, we now stop accepting completely broken code. And soon enough, we'll be very likely using the same API as Lins. So normally it should be still breaking as much, but not as much. I don't know. How is this related to Clippy? I don't hear you at all. Ah. Basically, it's working the same way, but it exists because in Clippy, not all Lins can be implemented if you have specific needs for your project because you need to have higher security levels or you don't want certain code pieces or everything. You can't expect them to be implemented in Clippy. So you implemented them yourself and that's very much why RustyTools exists. So you can actually do it without having to set up everything yourself. Perfect. Thank you so much.
The plan for gccrs
So, today we have a, despite the slide saying Arthur, this is with Pierre Manu, going to speak about GCCRS. Give him a welcome. And, really well. Hello everyone, so I'm not Arthur, so I try my best so please be with me and I'll do my best. Yeah, okay. So, I'm a compiler engineer at Ambicosm and I'm not the co-lead of GCCRS. I believe there's Philip in the room but I can't see it. So, yeah, Philip is one of the co-lead of GCCRS. What will we talk about? So, I'll introduce GCCRS because some of you may not know the project. I'll talk about what we've achieved this year and what we've done, basically. And what we will do in the future in the upcoming year. So, there's a lot of things that's gonna change and I need to introduce those. So, let's begin. So, what is GCCRS? GCCRS is an alternative compiler for Rust. So, you may already know Rust C and we aim to provide a new front end for the Rust language within the GCC project. So, there's already a lot of front end in GCC. There's Hada, Go, and many of them Fortran. So, there's just one new front end that could leverage the GCC back end as well as the GCC plugin system on the GIMP representation. We are targeting for version 1.49 of the Rust language and the work is funded by Mbacosm as well as Open Source Security. So, we'll talk about the points. Why should we create a new compiler? So, there's a whole lot of architecture that aren't supported by the LLVM back end. So, there's already work for a new Rusty compiler for to leverage, for example, GCC JIT to get some GCC targets. And yeah, so basically, we are to leverage those new architecture and provide more target for Rust C. You may check the GCC room later tomorrow actually. So, there will be more about this. There's another big point. It's the Rust for Linux project. Basically, the Linux channel wants to integrate Rust program language in Scott Bayes. And this means we need some people want some support for the Rust language from the GCC project. Having multiple compiler apps, there was multiple domains where helping on this service draws some attention on some dark spot in the Rust language. So, we could show the Rusty people what could be improved, what is good. Yeah, so this brings some discussion about some subjects. And after I've been working on Rusty, on the macro group, for example. So, a lot of things that's brought by new compiler. It's bring new point of view for things. One last thing is working with very old plus plus compilers. There are some architecture systems that have some very old compilers. We just compile them in C++ or C. And you may want to bring the Rust ecosystem to the systems. So, yeah. What we've been doing in 2023. 2023, we had three multiple Google summer of code projects. One was by Mohamed, who was working on the error framework for GCC. We basically want to introduce friendier error codes, like we can find in the Rusty compiler. If you've used the Rusty compiler, you may have seen friendly error codes or user error. And we want to bring this to the GCC ecosystem. And the second new Google summer of code was from Raikey, which I believe, yeah, he's there. You could see tomorrow in the GCC dev room, which implemented multiple things to support Unicob. We've been working on borrow checking, closure, iterators, and a lot of things. We, I also worked on Proc Macros. So, Proc Macros are baking Rusts. Their views are used almost everywhere. So, yeah, I've been working on this in the past year. We are able to expand some Macros right now. And it's not completely polished, but it's almost finished. We had to develop a new binary interface. This is a new system in GCC to leverage Proc Macros. And you may as well see my talk from code one, Google code one 2023, if you want to get deeper into the subject. Okay. I've been talking about the borrow checking. So, we, Jacob Dupac has been working on the borrow checker. Basically, Rust C has a pass in the compiler which emits an IR on the borrower checker, work on this IR to attest and check that some facts are valid, that the code is valid and that the borrow checking rules are all assorted. So, Jacob Dupac has been working on a new IR in GCC. So, we could have a borrow checker. It leveraged Polonius. So, if you've been working on Rust C, you probably already know Polonius. So, this is representation of Rust C on the left and GCC on the right. As you may see, Rust C, MIR is the borrower checking steps and the MIR is then lowered to LVMIR. In GCC, we've been doing things a bit differently. Basically, we had to separate two IRs and there is one kind of dead end IR specialized in borrower checking because we couldn't create an IR that could send below words to the GCC back end. So, at one time in the compiler, there will be two parallel IRs that could be created and on one end there will be GCC tree and on the other end there will be BIR and those will be checked. But the BIR won't be reused for the creation of your final record. I've been talking about the InCode support. So, yeah, I told the InCode support was by Raikey. Told you tomorrow there will be more. Rust code, so we want to be able to pass the Rust C test so this means we should be able to emit the proper error codes and this means we need to fix our own error codes to make those the same as Rusty. We are opening a few more entries this year for the GSOC for students to help us. Feel free to apply if you want. So, what we will be doing in 2024. We aim to implement format arg macro as well as the continue the work on the Polynesboro checker as well as the traits. Why do we need a smart arg macro? Basically, this macro is required in order to compile the standard library and we would like to be able to compile even a simple yellow world. If you have ever used yellow world, you may not know it, but under the hood, there is the format arg macro to format all your arguments and without this, we cannot even compile a simple yellow world. So, yes, this will come soon before GCC 14, hopefully. Currently, our borrow checking pass only rejects some invalid code for some facts and we still miss a lot of facts and we still miss a lot of things. So, yeah, we hope to implement more fact validation from the Polynesian engine. Okay. So, we cannot change our strategy for GCRS. We've met people at the ERRORUST in Brussels a few months ago and those were people from the ERRORUST team, I believe as well as the type language team as well as the trait team. And those people told us that the work required to make the trait solver work was like easy to do, but to get it right, you need a lot more work. So, basically, if you want 90% of the work done, it's easy, but if you want 99% of the work on the trait solver, it will be a whole lot of work because there is many rules that are very specific to some code and, yes, it will be very hard. So, in order to do this, we chose to not implement those ourselves but leverage existing RUST code. So, we'll be using different RUST libraries in the RUST compiler within GCRS. So, yeah, what I was saying, we'll be using RUST code within GCRS. So, that means there will be two steps in the GCC bootstrapping process. The first one will compile GCC RUST without Borough Checker, without a proper trait solver. And that's only a later step that GCRS will compile itself with the Borough Checker as well as all those fancy stuff. Yeah. The first version of GCRS, the one without Borough Checker, without all those things should never land in the end of users. That's only for bootstrapping purpose and, yeah, nobody should use it. Here is a schematic about it. So, yes, as I told you, bootstrapping process will be in two stages. First, we'll compile GCC RUST stage one without the Borough Checker. Then we'll compile Polyneus and then we will compile GCC RUST with Polyneus within embedded inside it. The format argument parser follow the same principle. We will compile it as a separate library and then we will link it. So, in order to do this, we need to make a version of GCRS which can compile the format argument parser RUST code. And, yeah, that'll be it. Let's look at the plan. So, we need a type checker, micro expansion, name resolution, as well as format argument. So, we will integrate those in the compiler in a two-step bootstrapping process in order to then be able to compile the standard library and then be able to call your favorite print line macro. Yeah, compile it after. On a long term, what should we do? So, we want to catch up with RUST or line exterior requirements. We want to be able to compile RUST code that should be used for Linux model. So, RUST for Linux targets a much more recent version of RUST. I believe it's 1.70, I'm not sure, don't quote me on that. But, yeah, we still have some additional work. It won't be that hard because once we have the standard libraries that compiles, there's not many things that are left because most of the work in RUST is done within the standard library, not in a language itself. Yeah. And then, we need some analysis as well as semantic testing. We do not enforce at the current time some runtime guarantees. So, for example, array bound checkings, that kind of things. RUST panic when you try to access an array out of bounds, those kind of things. So, those are not generated yet by compiler, so we still need to add that. And we need to ensure the compiler assembly produces the exact same behavior as RUST. We want to leverage the RUST test suite in order to be sure that GCRS is compliant with the RUST compiler. We need to work on a lot of improvements, more CI because currently all CI is like four little steps and that's all. We want to make sure the GCRS work with every architecture supported by GCC. For example, we have some build failure with some Spark backend, so yeah, let's make sure Spark work again. Spark 64. And one thing we want to add in the upcoming year is more upstreaming. Last year, we were a bit late and work was coming and coming and we didn't upstream as soon as we wanted to do. So yeah, we want to upstream more frequently. This will avoid this kind of situation where we want to push 900 commits in one mail and we GCC a new repository and we're everything crashed because well, it is not supposed to under 900 commit in one time. We want more contributors, more students and yeah, more fun too. Thank you for two open source security and because them and a few members from the RUST community which are helping us getting detailed from RUST. There's a lot of people with a lot more experience in the RUST compiler that's helped us to improve the GCRS compiler and as well as many contributors. So Tomah, Mark and even Riki here. Thank you. Here are different links to our blog, Github if you want to contribute the ISC channel as well as the main list. Yeah, so I'm a bit early, I'm sorry, but yeah. Not my slide, sorry. Yeah, you're the second replacement speaker. What? You're the second replacement after Arthur and then. I don't know. So you're, yeah, as a replacement speaker. I think you did a very good job. So can we. Yeah. Yeah. Thank you. Thank you. Thank you. Great, we do have some questions up here at the back coming around. Yes, so they'll have a microphone for the the stream, but if you could repeat the question anyway. Thank you very much. Thank you. I have two questions actually. The first one is related to the borough checker. So right now the borough checker is really deeply tied into MIR. How are you going to guarantee that you have compatibility between the MIR based borough checker and the beer based borough checker? Basically we. Repeat the question. So the question was, will we integrate the borough checker? Is that right in the GCR? How do you make sure that it's compatible? I'm sorry, I didn't. How do you make sure that the borough checker is compatible? We'll be using the same. So how will we make sure the borough checker is compatible? Basically we will reuse the same borough checker as RACI. We'll be using Polonius. So Polonius could be compiled as a library. We'll just be making an FFI interface and use that interface in order to directly use Polonius within GCC. Okay. And my second question is, do you think you will be able to emit wasm also? I'm sorry, it's very hard for me. One of the nice things with the current REST-C is that you can emit wasm, WebAssembly. Can you do that? Do you think you will be able to do that with GCC-RS? Yeah. Okay. That was a nice answer. Hello. From, as a GCC developer, how can we help from the GCC side? Well, you could drop by our GitHub repo. I mean, there's a whole lot of controversy within the GCC. Sorry, I will repeat the question. How as a GCC developer could we help on the GCC-RS project? Is that right? Yeah. So as GCC, there's a whole lot of controversy on the GCC project somewhat because we're using GitHub and GCC guys don't really like it. So there's, I mean, you could, I believe you could use your user workflow for pushing patches upstream, but I'm not sure. I think the best way to help us is to come to and in our GitHub repository, clone it and basically to like everyone, submit issues, solve issues and yeah. So over here we have a gentleman who would like to. Hey, I just want to clarify the WebAssembly so GCC does not currently have a backend for WebAssembly. So if you want to emit WebAssembly, you first have to write a backend unfortunately. There is however precedent in GCC for other high level assembly backends. So it should actually not be too difficult to do, but not available right now. Okay. Two questions. First one is you mentioned at the beginning compatibility with GCC 4.8. What are the consequences of this choice from a technical point of view? Yes, here. Yeah. I mean, you have the modern GCC code base and you want to your code to be compatible with this old code base. Basically for those who are not accustomed to GCC, GCC 4.8 is a very old version of GCC which doesn't even support C++11 at least not entirely. So we have few steps in our CI to make sure our code is compatible with GCC 4.8 because there are some constructs in C++11 that are not supported by GCC 4.8. So we need to make sure we don't introduce those constructs into compiler. So GCC 4.8 could bootstrap our GCCRs compiler. Thanks. And second question. Rust performance rely on LTO and GCC and LLVM have different LTO strategies. Does that impact you in any way? I haven't much to say about that because we're not at the stage in the compiler development where this matter, we want things to work first and then apply fix on tricks to improve performance. For now we won't focus on a working compiler before focusing on things that work fast. Thanks. Yeah, I just had a... Oh, loud, loud, very loud. Oh. I had a... If I didn't misunderstand, you said that one of your goals was to be able to compile GCCRs with itself, but without the borrower checker or string formatting. I would just like to know what would be the benefit of doing that instead of just compiling with Rust C until you have a working borrower checker. I'm not sure I understood your question. Could you please speak louder? What would be the benefit of compiling GCCRs with itself without a borrower checker or compiling it with Rust C? Okay, so what are the benefits of compiling GCCRs without borrower checker and then with a borrower checker? So basically, that's the slide here. Borrower checker as well as trait solver and many systems like this are very hard to implement. We would need a lot of time and we don't have much resources. So we want to focus on making the compiler work, even if it means compiling, reusing components from Rust C. So this means we first produce a first compiler without borrower checker that knows that the Polonius, for example, works well because Polonius has been compiled with Rust C. So Rust C leveraged the missing borrower checker step for GCCRs and this version of GCCRs will then be linked with Polonius so it can leverage it itself. So basically, this is a temporary version that the user should never see and that the user will probably never see. This is a version that will stay on the build machine of someone who wants to build GCC. And yeah, most of you won't ever see it. And yeah, that's it. You didn't understand quite well, am I right? Yeah, I understood that you were going to use the GCCRs version that didn't have the borrower checker and string formatting to compile GCCRs itself. That's what I'm saying, I don't know. Maybe I misunderstood. How do you say it? Quite... I think what he meant is he wanted to know why you want a bootstrap step that is free of Rust C. What the need is in this? Because we need to be able to compile Polonius. You need to be a separate compiler. I don't remember. I'm sorry, but those are steps that are not yet implemented and events look much into it. So I don't want to say some mistakes or anything. So... So you don't support all the architectures that GCC supports. So if you want a bootstrap on an architecture that's not Rust C, you want to put it back on. Oh yeah, okay, thank you. I had another question. You talked about the Rust C type of errors and also panic with out of bound access. Is there a possibility that we will see this for other languages in GCC from the work you have done? Yes, because... Don't quote me on that, but as I remember it, the students that made the change of framework for the errors made change to some common directory in the GCC project. So Auto Frontend may be able to use this new code. So maybe. But I mean, ZodChanger won't... Won't come by themselves. Confibrators on all of the languages. We need to integrate those in ZodChanger Frontend. Good, okay. So I understand your point in reusing the Borow Checker and the Format Arc stuff because it's already done and it's known to work, so why not reuse it? On the other hand, on your slide, why you are doing this GCC-RS project, you quoted the point that you want to provide an alternative second implementation next to Rust C because it oftentimes helps to have different implementations of the same stuff to better understand what the stuff is all about, to better understand the design. Maybe there is something strange in the design, you just don't notice if you only have one implementation. So this would be a point for also having a second Borow Checker, for also having a checking format ArcSwan. So what is your philosophy? Where do you draw the line between we want to implement a second independent system and we want to reuse proven code? Yeah, the question was where do we draw the line between components that we need to code ourselves or reuse from the Rust C project? I would say that. I mean, we don't draw the line because those are merely temporary solutions. We want to project to get to a better understanding of a state which compiles the Rust code, but in the long run, that won't be the case. We will probably reimplement those components in C++ within GCC. So yes, for now, we simply choose components that are too hard or need too much time. Yeah, in the long run, we may replace them with our own implementation. Hi, so my question is a bit, let's say different, in the sense that what would be wrong with, for example, emitting GCC tree directly from Rust C? This way you have, I think, maximum reuse already compiled Rust, you know, because you don't use LVM, but instead you use Rust GCC tree. And you could, for example, use a feature flag to toggle between these two things. So would there be a merit to exploring this? I'm not sure. I'm not sure, I'm understanding your question. Are you talking about the GCC JIT backend in Rust C? Yes. Yes? Well, one thing, this means we get a new front end which bring diversity on one end. And I believe we could backport the new front end as well as some multiple things to an earlier version of GCC for really old systems, which we cannot do on Rust C. Yeah. Well, I think my question was a bit different in the sense that Rust is a bootstrapping compiler. And the only C++ parts it really needs to function, I might be wrong here, but is LLVM in the end? So instead of LLVM, you could substitute in just a different backend. I don't want to say any mistakes, so I think you should come to Zulip and ask directly to Arthur because it will be way better than me to give you a proper answer. I'm sorry. No problem, thank you. Sorry. So to answer that question, you gave the reason yourself. You said you want to support GCC for eight, but the thing is GCC for eight doesn't have the JIT part yet. Yeah. So like if what he was talking about was the Rust C code can GCC thing that Rust already supports. So you can actually already use GCC JIT to generate code using the Rust C front end. But again, that does not work when you want to support very old versions of GCC. And that is actually what I wanted to ask. I wanted to ask, do you actually plan to upstream GCC RS support in GCC for eight so that people who actually want to use it in an old GCC version don't have to like patch it themselves? Probably, I mean for now we're focusing on only upstreaming things that we could maintain and support, but it could be possible in the future. So probably, I don't know yet. So to answer your question, follow up on your thing. So I'm the maintainer of libGCC JIT. I apologize for the name, because it also does ahead of time compile compilation and worst project name ever. And that itself is a part of GCC and therefore its build time dependency is the same as your build time dependency. So as in the subset of C++11 that GCC 4.8 supports. In terms of the other question, in terms of back porting the GCC RS work into GCC 4.8 itself, I believe GCC 4.8 is still written in C. I'm not sure it's about then that we migrated from C to C++98 and that sounds like, that sounds difficult. That sounds like, yeah. But there is a bootstrapping path. We had another question over here and then thank you for being a good sport. Is there a question over here? Okay, all right. No more questions. Wonderful, can we thank our speaker again? Thank you.
Hardware pointer checks in a Rust application near you?
Alright, we have a real Fostin hero standing in for Lewis, Pierre Emmanuel again. We have two more heroes at the back who have also obviously fixed the audio. Thank you very much as well. Take it away please. Hello again. I'm still not Lewis, and I'm still not the original speaker. The talk will be even worse than the first one. Let's talk about the hardware pointer on the Cherry architecture. Before we get started, we'll cover what we'll be talking about. We'll be talking about mirror safety, capabilities, the Cherry design, digital security by design, as well as CyberLife Connect project. We'll then talk about the motivation, the Cherry and the rest, as well as the implementation and the different challenges and problems we found during this walk. So, mirror safety. Accessing memory pointer, what could go wrong? You're probably your read and answer if you're doing some rest, but is rest even safe about this? So, the problem with rest is once you tag a code with unsafe or something or you're in unsafe context, the hardware will not back you up. You will simply let you access to the hardware if you're lucky, you have a kernel which will give you a page fault, but that's all. So, the hardware will not protect you against user-free, out-of-bound data-scored, everything. But you may already know rest and it helps us. I mean, safe rest is cool. There might still be something that could go wrong. So, what are capabilities? Capabilities are some kind of metadata that we embed at the assembly level with pointers. This means every pointer will have a big field of metadata, whether it could be written, read or even just used and how could it be used? And the second part of the pointer will be the address itself. So, we can encode in this metadata bound permission in validation states, all those kind of things. And the helpers catch code that behaves badly even when a compiler thinks it is valid. So, let's talk about Cherry. So, Cherry is a project from Cambridge University. Cherry isn't an architecture itself. Cherry is more of a specification. It's a set of specifications for an hardware extension. It allows the creation of a capability-based system and the specification covers all capabilities required in order to make code cycle. So, I was talking about this metadata. So, here you can see in this slide the encoding of metadata on the Cherry specification. We've got the permission, the type as well as the bound of the address in order to check any out of bound or array indexing or things like this. And you've got the 64-bit address behind it. Okay. One note, pure cap on hybrid mode. Cherry provides two modes. Pure cap basically is every pointer as metadata. Every pointer is 128 bits. And the hybrid mode is here in order to ensure compatibility with order, not just not order, but capability-less systems. Okay. Okay. So, here you've got an example for an instruction with capabilities. So, it takes an address and it raises an exception if permissions are not correct or something is wrong. For example, let's say on the previous slide. Okay. So, we've got bound set here. This means we can use a pointer for an array on set bound and if we are trying to access this array out of bound, the machine will trap and give us an exception. Digital security by design. What is it? So, in the Kingdom Government Initiative, that want to expand the use of Cherry out of academia to the industry. Zephend, multiple work to demonstrate the application of Cherry and make it work in the real world in the industry. Initially, it revolved only around Morello. You may not know Morello. So, Morello is an extension for our system, ARM. Recently, they focused more on architecture such as Rix-Fi, for example. CyberHiveConnect. CyberHiveConnect is a security-critical application within the rest. It's one to implement end-to-end encryption of a mesh network. So, yeah, here you've got an example. This application is a security-critical application. And it is with a mesh network and end-to-end encryption. So, this means obviously there should not be vulnerabilities. Okay. So, why Cherry and Rust? So, Rust already provided the different restrictions. Some restrictions cannot be provided by Rust. For example, there are runtime enforcements that are provided by Rust, but that slows down the flow. You may have seen out-of-bound checks on your arrays when you index an array. You may have seen that kind of thing. And this kind of code is slow, but if you replace this kind of code with Cherry-based extension in switching, it can now be faster on an extension to access an array out-of-bound. We'll simply trap. You don't have to end-to-end it yourself. You just have to end-to-end the trap. So, when you need to connect an application with Rust code, for example, with the FFI, for instance, a function interface, you may be safer because the Cherry extension will be here to back you up and provide you the correct pointers. And you who are sure that the pointer you'll be using in Rust won't come from nowhere or aren't a pointer or whatever. So, yeah, unsafe can become in some way safer. Here's an example. We've got an array. We converted it to pointer. We make a string, and we try to read the same line and pass a number. And then, at the end, we try to add the index to the pointer. And as you may have seen, we are using the safe code. And the Rust compiler won't catch any of this because we told him to do so. So, here, Cherry might help us. And Cherry will provide an exception on this when we want to go out of bound in the array. Lewis provided two new targets for the Rust compiler, more or less known pure cap and more or less known through the PUDESD pure cap. As you may have seen, both are pure cap. This means those are not compatible with AI breadmode. This means those implementation are not compatible with standard pointer. As we may say it, all pointers should have capabilities enabled. So, here, we have a new type of pointer. It's coming in Rust 5. And all those files are available in the repository right here. There was different implementation challenges. We should provide a new pointer type with capabilities. There is something that's made the created debate a few months, slash years ago, is the size type. You size in Rust, what should it represent? Should it cover the entire addressable space? Should it be able to contain a whole pointer? That kind of thing. So, we chose to represent only the address part of the pointer within the size. Layout and address space differs for pointer on capabilities. More on that later. And we generate, we have to generate some cherry specific interesting for LLven and AI. Again, as I said, your size is not UN pointer. Okay. So, I should have been a demo but I haven't one. So, well, enjoy the screenshots. Okay, so here we get the segmentation fault when we make an out of bound access in our array, even if we don't hit a numlap page, for example. So, that's cool. I'll give the slide. Yeah. Sorry. Okay, future walk. Future walk. So, what will the WIS concentrate on in your studio? It will add more cherry targets. Hopefully, yeah, more possibly some hybrid model for the targets. We want the rest test suite to pass. For now, we have only 50% of the tests in the USC that pass, and refactor the code, document the code, and rebase on a newer version of Rust because right now it's on Rust 1.67. So, yeah. And Lewis would like to begin upstreaming his walk. Well, thank you. And sorry again for this whole talk. If you've got questions, I may be able to answer those questions, but to be fair, probably not. Thank you. What other targets are you looking for, then? Are there other targets besides Morello, which actually implement Cherry today? I'm sorry, you didn't hear. Are there actually targets which implement Cherry today besides the arm Morello thing you showed? I don't know. I mean, RISC talked about some RISC-5 extension, but a journey behind them might be able to answer. So thank you. I'm one of Pierre Emmanuel's colleagues. There are a number of RISC-5 implementations out there. Code of SIP demonstrated by the RISC-5 summit and Microsoft, and I believe low risk also have ones as well. So RISC-5 is actually running ahead of ARM, if anything. But is there RISC-5 implantations so far virtualized, or are there any boards which support Cherry? I'm sorry. Regarding RISC-5 implementation, are there so far any boards which support Cherry, like RISC-5 Cherry, or is it mostly virtualized QM environment? I suspect these have only ever been made by the development teams as demonstrators on FPGAs, but Code of SIP certainly intend to be able to ship stuff to their customers, and I think before too long there'll be hardware available. You have a slide about GDP. Do you have GDP support for someone who prints one of these pointers, like the semantics of the, you know, the extra-secretful bits and bits? If we take a look at, in fact, if we take a look at Lewis Walk, the capabilities are stored in address space 200, if it makes sense. So there is some kind of support, but I believe it's more axe than real thing. I'm not sure, as I said, really, I don't know much. So just to follow up, so I believe there is a reasonable support for GDP and Cherry on Cherry BSD, and it displays all the things you need within the GDP. Any more questions? If not, then let's thank our speaker again.
Proving Performance
So now we have Nikolai Vasquez who's come all the way from Atlanta to tell us about how we can improve performance in our Rust programs and give him your attention and it's going to be a really good talk. Take it away. Thank you very much. So, yep. Hi, I'm Nikolai Vasquez. Some of you might be familiar with my work in the Rust community such as the static assertions crate or recently Devon which is a benchmarking library that I'll be discussing in this talk. And so this title I realize is a bit of a misnomer. You can't really prove performance. Like there's just various factors that make this impossible such as for example there's various system conditions that could affect performance depending on machines and you could be working over different data sets. And so rather than considering this as proving performance, this is more like getting a vibe for performance. And so by show of hands how many people are familiar with measuring performance of their code? All right, so the vast majority. Great. All right, so you're all experts and you don't need me. So I know you probably know this but when we discuss what performance means in software, we're usually talking about how fast it is but to me in broader terms performance is more about how software uses resources to achieve a desired result. So along with thinking about the time that's being spent in our software, we should also be considering the amount of memory that it's using. I think that's a very important aspect of performance. And so making good software can be a balancing act of trade-offs between time and space and so it can be a bit difficult. As developers, the way that we write code can have a pretty direct impact on the performance of our software. So for example, we could be using really inefficient algorithms with a time or space complexity of O of n squared O of 2 to the n or whatever that yellow line might be. We might also be repeating computations instead of saving previous results. We could be choosing to use slower operating system APIs. So for example, waiting on sockets in Linux with the select system call versus ePoll. But also, performance can be bad for reasons that are out of your direct control as a developer. So at a micro level, for example, the CPU might not have cached the data that you're requesting and instead it will have to reach for main memory. The CPU might also expect the wrong branch to be taken and it won't speculatively execute the correct branch as well as the CPU might be waiting on previous computations before executing subsequent code. And then at the macro level, looking out, other cases might be that the network has really poor latency or bandwidth. Other processes could be using excessive amounts of RAM and that can cause DOS to swap memory to disk or your storage might be on a slow device like spinning hard drives instead of SSDs. So when it comes to performance, why do people choose Rust? I believe that the central reason to pick Rust for performance is it's in its community. I find that the community's culture of performance has led to many zero cost abstractions ranging from async away in the compiler to very fast standard library APIs. And we also see this culture in third party libraries. So people will try to make their code work really well and constrain environments in the embedded space or people will focus their attention on how well they're using time and space. So how fast their code is and how much memory it's using. And as well as the community has developed many tools to measure performance. So this really does speak to the culture. And now that we have a sense for what performance is, how do we go about measuring it? So for the sake of simplicity, I'll only be covering things that can be implemented with the Rust standard library. I'm not going to be covering. For example, hardware performance counters because each operating system has different APIs for that and usually accessing them requires root privileges and that can be difficult. So let's consider a simple operation such as allocating a vector of 180 integers. We could try timing it by using the standard libraries instant type and this is generally considered an all right approach. But the results may be surprising. It might just say zero nanoseconds. And so why is this happening? Well, it turns out that the compiler is smart enough to realize that the value wasn't actually used and so it optimizes out the allocation. And so when you're benchmarking, you really should pretend or at least trick the compiler into believing that the value is being used and so the standard library provides a black box function and you can use that to prevent the compiler from optimizing code that you want to run. And I find that a lot of people don't reach for this when they should. And so now that we're using this, we're actually getting higher timings that are more realistic and this is evidence that we're actually now measuring the allocation time. But why 500 nanoseconds? How consistent or accurate is this timing? Well, it turns out that if we run our code repeatedly, the times can fluctuate greatly. So the numbers might vary because of noisy system conditions or some of the things that I mentioned earlier. And you might wonder, well, OK, then how can we get a better sense of our code speed? And you could dive into existing solutions. What I generally recommend for practicality's sake is you should use an existing library that implements this correctly. And so recently I created this library, Devon, that is for exactly this. I wanted to make a tool that makes it very easy to do correct measurements and be able to compare various pieces of Rust code. And so to me, I would say Devon is so easy that it's a comfy bench marking library because a Devon sofa is like a comfy bench. And so that's why I named it that. You can read a bit more about it on the announcement blog post that I have on my website. But I'll also dive into what Devon can do for us today. And so I wanted to make Devon really easy to use. And I wanted the way to register benchmarks to be very simple. So I came up with this very simple yet powerful attribute macro that behind the scenes will generate the code that's needed to benchmark a function. And this might look familiar because this is also how you register unit tests in Rust. And like unit tests, registering benchmarks with Devon can also be done anywhere, not just in the same crate as your benchmarking runner. And you can also, well, I also take advantage of this feature within Devon by measuring internals of Devon with Devon, which is kind of meta. And so given the previous benchmark that we wrote, it's pretty straightforward to adapt it to Devon. We just stick it in a function and then mark it as bench. And then Devon will be able to run that. And after executing our benchmark, Devon presents us with pretty succinct information about how it ran. On this, we can see that the best speed was measured at about 70 nanoseconds. And this realistically represents how fast the function would perform under ideal conditions. And we also see that the worst case was measured at about 200 nanoseconds. And so there's various things that could play into that. It might not be necessarily the code itself, but the situation around the code. And then we also have median and mean, which represent the average time that the function took to run. And we can also see that these values are pretty close to the fastest sample. So we can be fairly confident that this function will generally perform at this speed, at least on this machine. And so to give some insight into how Devon is running this code, we see that it's reporting the number of samples and total iterations. And this represents how many timings, samples represents how many timings Devon has measured. And then iterations is the number of repetitions across all the timings or all the samples. And if we divide the iteration count by the sample count, we end up getting what I call the sample size, which is how many iterations per sample. And so we see that each sample took about 64 iterations. This is chosen dynamically at runtime based on how much time is spent in earlier samples. And this number can be higher for faster functions, or it can be as low as just one iteration per sample for really slow functions. But if we want to measure not only the time to allocate a vector, or sorry, if we only want to measure the time to allocate a vector and not the time to deallocate it, then the way this would work in Devon is you simply return the created value from the benchmark function. And this will defer freeing the vector until after the sample is finished being timed. And since Devon will automatically black box the returned value, we can actually remove the black box from our function. And this just makes it a lot easier to read. And so since we're measuring vector allocation but not deallocation, now our benchmark results are about half the time that we measured before. And so far we've only been benchmarking allocating vectors that contain 100 integers, but we can also benchmark across other cases. So we can use the args option to measure across one, five, 10, 1,000, you name it, any value that can be provided as an input. And this, I find it's generally very good practice to measure across various cases to get a better sense of how your code's running. And we can see that generally as expected, as the size increases, the benchmark also slows down. But interestingly enough, for cases that are at 10 or smaller, there's not really a difference in performance. And so really the differences, I would say, are more like systemic noise because it doesn't really make sense that creating five values in a vector takes longer than creating 10, at least not consistently so. And we also notice that this function really starts to slow down a lot at bigger sizes. And so that aligns with whatever hypothesis we might have had about this benchmark before. But we can also compare the performance across multiple types by making the function generic. And then we can provide a types option to pass all the cases. So now this benchmark is not only running the standard libraries vector type, but it's also comparing that against SmallVec using all the same cases as before. And for those who aren't familiar, SmallVec is a type that's designed to be faster for smaller sizes. And it does this by storing values on the stack instead of doing a heap allocation. But once there's not enough space on the stack, it'll fall back to using the standard libraries vector, or rather it'll use the heap like the standard libraries vector. And so to make what's happening a bit clearer, Devon's not actually doing anything special to the function. This is just normal generic code that's pretty common to write. Instead Devon is using the function as is to generate the benchmarking code for each type that's passed into the attribute. And so once we run this, we have this nice tree output and table where we see that Devon has grouped the types as separate trees under the benchmark function's name. And we can also see from these measurements that, at least for this specific operation collecting from a range, SmallVec is faster than the standard libraries vector when the number of items fits on the stack. However, once a size grows beyond fitting on the stack, once SmallVec needs to do heap allocations, interestingly enough the standard libraries vector is faster. And I imagine this is because the standard libraries vector can do nice optimizations like specialization, which if any of you can make that stable, please. I've been waiting forever. But also when we think about software performance, like I mentioned earlier, we shouldn't only be considering speed and we should also be considering the amount of space that's being used. And normally if you're profiling a long running program, keeping track of allocations with a tool like DHAT, the cost there is relatively low because it gets amortized generally over the life of the program. And the nice thing about tools like DHAT is that it'll collect back traces to tell you exactly where your allocations are happening. So it does give you a lot of information. However, in microbenchmarks, when the time spent tracking allocations, like that can have a noticeable impact. So taking back traces can take microseconds, whereas the code we want to measure may just be a few nanoseconds. And so we would be totally blowing out the timings. And in a sense, by observing the behavior of our program, we've now also affected our measurements. So is it possible to gather insights without affecting measurements? Is it possible to reduce the amount of time spent here? Well, I actually managed to do that. So Devon has a custom allocator that will only track the number of bytes allocated and the number of allocations during benchmarks. This applies to allocations, the allocation, reallocation of grow or shrink. And the way that you use this is you override the global allocator with Devon's allocrofiler. But you can also pass a custom allocator if in reality you are going to be using a faster allocator such as meme alloc. And so it's fairly flexible. So once we've registered this allocator and we rerun the same benchmarks as before, we can see which cases are allocating and how many times. And notice that we are not seeing the allocation listed here because, like I mentioned before, we're returning the created value from the benchmark. And so that's being dropped after the sample is run. And I also want to note that the timings here are the same as before we did any allocation profiling. I managed to optimize this to a point that its footprint is pretty indistinguishable front noise by using thread local storage and then optimizing that further, at least in the case of Mac OS. So if we look a little closer, we can see that, yeah, for smaller sizes, indeed, small back is not going to be performing any heap allocations and is strictly doing its operations with the stack. We can also tell Devon the number of bytes we're processing or number of items we're processing. And this allows us to get a pretty different perspective. The way we do this gets a little more complicated. We change our function to take a venture argument and then we call the counter method on that and we pass it an instance of bytes count. In this case, we're saying that we're counting n 32-bit integers and then we pass a closure to benchmark our function from iterator implementation. So, we then see that Devon will output the number of bytes being processed in terms of, in this case, megabytes or gigabytes per second. And for a lot of people, this might be an easier data point to get an intuition for the speed rather than just the strict timing numbers. For some people, saying growing numbers for better performance is just easier to think about. So to recap what I just covered, Devon has various features that really set it apart from existing solutions. I find that its API is just a lot simpler. It's easier to remember. I also really like how the compact output makes it pretty easy to consider various cases. And as well as because you can parameterize your benchmarks across various cases, you can really just get a sense for the difference in performance depending on the scenario. So I also really like that by going with an attribute macro, I realize that, oh, well, if you make the function generic, you can just pass the types in because you're just parsing whatever you want as the options. And so you can have benchmarks over various collections of the standard libraries, so linked list, VEC, hash set, et cetera. And you can see how different operations really differ between those collections. So such operations that are pretty common like clear might be a lot slower on a linked list whereas on a VEC, it's pretty instant. And another feature that helps me a lot is thinking of the numbers in terms of throughput. I find that it tends to just be easier to understand than durations. As well as something that I find no existing tool out there does is you can track the number of heap allocations at the same time that you're measuring the time being spent running your benchmark. As well as one feature I didn't mention here because I thought it might be a little complex to cover is you can do some interesting things like run benchmarks over multiple threads and this allows you to measure the effects of contention on atomics and locks. So if you're developing a low-level synchronization library, you might find this to be pretty useful. I also want to cover what motivated me to pursue this. I found that a lot of existing tools in space were pretty good but their APIs had some, in my opinion, unnecessary complexity and so I wanted an API that didn't go too far beyond the complexity of the code that you're benchmarking itself. And I really appreciated that by trying this new API, open up new possibilities such as what I mentioned before with benchmarking generic functions, it was relatively straightforward to implement that which was a bit of a surprise. So some food for thought if you're developing your own libraries. And I also found that the default way that some other tools run is pretty slow and I get why they're trying to do a lot of statistical analysis to remove outliers. But there are some cases where you do actually want to know when the code was especially slow because if you're benchmarking over various inputs, it's possible that one case just happened to create a really large string. And so you want to be able to get a sense for everything that happened, not just the best case scenarios in my opinion. And if you do want to run your benchmarks for longer, have larger sample sizes, more samples, there's also options to do that. So you're not restricted. I also want to mention some other Rust performance measuring tools that I think are very much worth considering. So criterion is obviously the current go to Rust benchmarking library. A feature that I really particularly like about it is its graph output because I'm a very visual person. I also do graphic design. Another newer micro benchmarking library is Tango. And what I find unique about it is that it has this technique called paired benchmarking where the execution gets interleaved between benchmarks. And what this does is it spreads whatever systemic negative conditions evenly across your benchmarks. And so this is certainly a feature I eventually want to have in Devon. Currently my APIs tries to prevent requiring ownership of the closure you're passing in. I might have to change that to make this work. I don't know. I think if we had co-routines, I could make it work. But I don't know. Maybe if someone knows how to abuse asynchol weight to get co-routines, please talk to me. Another very useful tool is flame graphs. This is more of a general technique that's well known across the industry. There's plenty of blog posts about this. But for those who aren't familiar, it's a visualization tool that really helps you find where to focus your time. And I think it's very important to find where the bottleneck in your code is before you actually start picking at specific places to optimize and do microbenchmarks on. So try to reach for profiling with flame graphs before you do microbenchmarking, if you can. As well as there's the DHAT crate. And like I mentioned before, every single time an allocation operation happens, it takes a back trace. And so it's able to give you pretty deep insights about where you're allocating memory and how you're allocating memory. It's also able to do some other stuff such as tracking max heap usage at a time. I'm going to try to add that to Devon, but unfortunately it adds a bit of overhead. So maybe it's possible to subtract that overhead from the timings. We'll see. And so some thoughts I want to leave you with is if you're going to be doing reaching for microbenchmarking tools like criterion, Devon, Tango, really figure out if it's worth microoptimizing that kind of code, just try to find the meteor performance issues in your program. So like I mentioned, flame graphs are particularly good for that. And also rather than having standalone benchmarks, you should be comparing it between different cases so you can measure across different inputs and implementations. So like I showed before, with Devon, you can benchmark origin error functions. And this also, for example, in the case of small vec versus vec, really gives you a better sense of is it really worth it to optimize your code using unsafe? And so try to find the specific scenarios where you actually are getting those wins because no one likes nasal demons. And also when making your own projects, since I imagine many people here are contributors to open source and have their own stuff that they're proud of, I really strongly advise that you don't let perfect be the enemy of good. Devon has a lot of features that criterion doesn't have, but also vice versa. Devon doesn't have graphs or machine readable output yet. I do eventually want to get there, but I didn't let that stop me from publishing something that I felt was good that people might want to use. And so try to focus on the features that matter to you most or at least are the most academically interesting. Not everything needs to be a full-fledged product. Definitely try to pursue your interests when making your own projects. Always remember that you can fill in the gaps later if you want. So that's all I had for this. I plan to have questions, but also in the meantime, you can read more about me. And currently I just have one blog post on there about Devon. I plan to publish another thing on kind of like std-conditional T in C++, but in Rust, which is as cursed as it sounds if you're familiar with std-conditional T. You can also follow me on mastodon or Twitter if I refuse to call it X. You can check out the code for Devon. Please give it a star. Play around with it. And yeah, if you want to reach out to me, I'm pretty accessible through mastodon. So there I'm hacky-derm at Nicolai. So yeah, any questions? We do have ten minutes for questions, so I'll plant you. Just raise your hand. I'm going to come to you. So Nicolai, thanks for your talk first. Very nice. And I have two questions. The first question is, have you thought about integrating into CI CD, so continuous integration things? That like, to me it seemed like this is a very handy tool with that, which I can use if I have a problem at hand, which I want to analyze. I quickly can do some benchmark and then dig deeper. But I think if I have found an issue in a very specific place, I might also want to have a test case out of it that I can monitor or be alarmed in my CI CD if there is an issue again. So that was the first question. And the second question would be, is it possible to run all those benchmarks in parallel, or do you have to sequentialize them in order to get the measurements right? Both great questions. So right now, what's blocking getting a lot of value out of running your benchmarks in CI is that Devon doesn't yet have programmatic output. My plans have JSON output and maybe some other format, if that makes sense. But yeah, as well as, so if you have programmatic output, then Devon can then consume previous iterations of that if you're benchmarking across different branches. As well as the author of Tango was exploring some ideas of using a shared library to compile it against four different branches and to make that pretty straightforward with get-of-actions. So yes, I'm definitely very interested in that. Sorry, repeat the second question. The second question was regarding the execution. If you are able to execute more than one benchmark in parallel, and whether there's some impact on the measurement itself. Yeah, so while technically you can, I find that putting the current process under more load could just negatively affect your timings. And so it didn't seem reasonable to do that, although I haven't actually measured if that would actually have as big of a negative impact as I would expect. Thank you. One question I had is, is there a way to compare the execution time with and without the warm cache? That is, the impact of cache on some data structures can be huge. And sometimes in benchmarking, in micro benchmarking especially, you have the problem that you're reusing the same lines. So the second benchmark is going to be faster always. But maybe your use case is actually the one in which the cache is called, for instance. Yeah, great question as well. So I considered having a helper effect function to evict the CPU caches, although I haven't thought of a good way of documenting when this is best to use. But in lieu of that, you can apply as a method on the Bencher type. You can pass a closure to generate inputs for every single iteration. And so if you wanted to, you could create a new buffer for every single time that function is being run at your benchmarking. So since that would be in a different place in memory, the cache effects wouldn't make the benchmark seem so much faster than it might be in a real world case. So we have a question from the matrix. So people are... Oh, Neo has a question. People are following online. So it's a really good topic. It was a very good talk. The question is, thanks. Devan looks very interesting and the API looks much cleaner, simpler than Criterions. Now Criterion can compare across different runs or implementations and then summarize whether performance improved or got worse. Within some confidence interval. Does Devan have something similar or plan to? Yeah, so it currently does not. I found that I kind of shoehorned myself a bit with this output format in that it's not super easy to add a lot more information to that. And so it's kind of like has become a bit of a UI problem in a way, which I find interesting given that's a command line. But yeah, I would very much like to just directly tell the user that this is faster or slower than previous runs. There's also the issue that, for example, if I have my laptop plugged in, now my benchmark runs faster. If I don't have it plugged in, then it's slower. It gets throttled. So it's not always obvious that there was a change in the implementation that caused the function to get slower. And I believe that Criterion's documentation has like a big warning section about this issue. But yeah, I do think that is valuable to have and I would like to have it. And also, if you all are actually very interested in this project, feel free to submit ideas to the GitHub page for it or better pull requests, implement the features that you'd like to see. I'm only one person and only have so many hours in a day. Yeah, I have two questions. The first one was you mentioned that some of the flaws or design differences with Criterion was that it focused a lot on very, I don't know, horrible. It's a lot in statistics and instead of just giving you the fastest, the average and all of that. I was wondering if there is a mechanism in Devon to output, like for example, percentiles or something like that. And my second question was when your benchmarking memory, if the function you're benchmarking the allocates instead of returning all the memory that it allocated, would the output show the total memory allocated or just the memory remaining when the function returned? Yeah, so any allocations that happen before the benchmark runs or not, they will or they will not be visible to Devon in a sense. It will have recorded that but it won't be associated with the current benchmark. It will just get cleared before the benchmark runs. So in that case, you would see that the number of the allocations would be greater than the number of allocations in the benchmark. And to answer your first question, when you say percentiles, are you talking about confidence intervals? Well, no, I mean, it would be also an option but the first thing that came to mind to me was percentiles. Like when you order the outputs, like the timings in the sending order, which for example, I don't know how to describe it right now, but yes, which was the 95th. If you did 100 runs, which was the 95th fastest or slowest, for example. Yeah. So I would like to have more interesting statistics output. Right now, I was just focused on having what I considered was the most important for your average benchmarks. Like I'd also like to output what the variance is across all the samples. So again, I kind of painted myself into a bit of a corner in that people usually only have so many columns in their terminal. And so this table output will be interesting to see how I add to it. I think what I'll end up doing is have options for emitting the columns that you're interested in having and just have certain columns by default. So when I do end up getting around to having more interesting statistics, that'll probably lead the way to make a user configurable of whether you end up with a giant table or not. Okay. Thank you very much for your talk and your answers. Thank you.
Friend or Foe Inside? Exploring In-Process Isolation to Maintain Memory Safety for Unsafe Rust
All right, let's settle down. We have Merve Goulmez. She's going to talk about friend or foe, exploring in process isolation to maintain memory safety for unsafe rust. Thank you very much. Take it away. Hello, everyone. I am happy to be here. Let's get started. I hope that one is working now. As you see, previous presentation I talked about is uptake of rust and EtoB project, for example, rust for Linux or Mozilla or currently is happening is rust in Windows OS. For example, Mozilla is now 11% is rust and the other different languages, for example, C and C++. Of course, that one is all virtual developments, requires mixed language application. Also today we saw previous talk. They talk about unsafe rust. Rust has actually highest two languages. One of them is safe and the other one is unsafe rust. And unsafe rust doesn't enforce memory safety guarantees and why we need it. Sometimes we want to do some low level control or implementation details or sometimes we need it for optimization. In Cherry Talk, they did really demo here and unsafe rust can violate completely rust application memory safety. They can do different route pointers or they can allow us to call unsafe functions via foreign function interface. Academic work shows that more than 72% of rust created is dependent on at least one unsafe FFI. Now we have two things, safe rust and unsafe rust. Unsafe rust says trust me, I know what I am doing. Should we do trust or should we do something and put on shield? And the gap is here actually. I always mention this mixed language application undermines memory safety guarantee of safe rust. And as a result, isolation is really needed. And I am a PhD student. I am a researcher. A lot of academic work to address this issue, we have a lot of academic work for example airing, trust, sun crust or so on. But what is the difference between these different academic publications? They either say that okay, we should use process-based isolation or we should use in process isolation. When you have process-based isolation, firstly you have integrity. It means that each processor, I mean it is on virtual memory space and also the other nicest things you have resilience. It means that each processor, it is on failure boundary. And if one process is crash, the other one is not affected. And a good example for that one, multiprocess software architecture. And the other side, we have in process isolation. It means that you have one other space and inside this one other space, how we can isolate one part of it. For example, you want to protect just a key or you want to protect one part of applications. Of course, if you have in process isolation, it can significantly reduce the runtime cost because context-severing compared to the traditional process isolation is lower. And I put here early approaches. This small box and inside the small box means that in process isolation and the other we have just sandbox provide the process-based isolation. And I would like to highlight something. Just one of them is offering crash resistance, but this is process-based isolation. We have STRAT here, but STRAT doesn't support for us, just it supports C applications. And I did some measurement and according to this measurement, if you have process-based isolation, actually it is 10 times higher than compared in process isolations. But the gap is here actually can be provided best of the board wars. It means that can we have integrity and failure boundaries similar to process-based isolations and we want to also have overheads similar to in process isolation. And my goal is firstly maintain the rest application safety and also we want to increase the software resilience of rest applications and also how we can provide ease of use in the development. In my early work, we provide some approach for C applications. This is secure rewind and discard and this is an approach for recovering vulnerable application after an attack is detected. And how we achieve that stuff? First we compartment the application in the distinct domains and we want to make sure that a memory defect with a domain must only affect that domain memory. This approach is relying on hardware assisted software fault-based isolation. This is a protection key for user space. And also how I detect it? I use different pre-existing detection mechanism, for example, stack canneries and domain violations. And as a result of this work, we publish some C library, SDRet library. If you want to check out, you can scan the QR code. Now I would like to explain the high-level idea. We have function F and we have unsafe. And if you write just some box on top of that, we want to have some memory safety guarantees and we want to have some isolation. And let's get started. We have parent domain and we want to run this function F in another domain. Another domain means stacks that can heap isolations. It means that I want to run this function F in another domain. And to run my function in the nested domain D, firstly I need to push the argument. I need to enter domain D and pull argument from the parent domain. And invoke F. I am executing applications, execute the function F. And the question is that is there something bad or not? But we have a guarantee that now if nested domain has some memory vulnerability, it will not affect the parent domain memory. It means that parent domain is still secure. And I am offering saying that you don't need to create your application, actually. You can just continue the execution. How I am offering, probably all of you know that this RAS says there's some API for panic. And we run the function F in the nested domain and after that we are checking the result. If results say that yes, you have something bad things happens and for example I can detect stack memory overflow or domain-domain violations, it will return error. If everything is okay, we don't need to do anything, it will just return okay. But the idea is that actually we are using this API, panic, but we are actually adding a new feature. Panic has also memory safety guarantees. It means that when panic happens, you can still continue your execution. And after that I explain this rewind and discard. And if nothing happens, if we didn't detect any memory safety violations, we will just push mutable arguments and return value and we will return parent domain. For this whole process in high-level idea, this STRAAT API, CAPI is already offering this pink box, but the point for this blue box is how we can track the rest of the types of the arguments and how we can push another domain and how we can pull it again. And probably all of you know that we have a lot of serialization creator. And what is serialization? It means serialization, we should encode the data in a format, like I just put it in a jar, and after that we should deserialize. When we jump to the nested domain, we should deserialize. And as a case study, I work on two RAS crates, BINCOT and ABONOMATIONS. BINCOT is a transformed data in a common binary representation that allows passing data between different platforms. And only the mention that SANDCRAS is process-based isolation mechanisms and that uses BINCOT. But actually, we realize that BINCOT, for our cases, is redundant because SANDCRAS and also STDF-FI, we are ingested in single platform. And I explore ABONOMATIONS, it's based on Rust Object MemoLayout presentations, but it is just for specific single platform. And it doesn't store any metadata or any type systems. It can deserialize the place without need for another cooperation. And we realize that ABONOMATIONS is efficient and suitable for our purpose. And we did some benchmarking. SNAPE is a fast comparison C-library from Google, and it designed for high-speed compression and the compression of data. And also presented as an FFI example in the Rust books. And what we compare, compress and uncompress random-generated data of different sizes. And we measure execution time of each operation for different serialization creators, like BINCOT and ABONOMATION. I show as a demo here, when I did all stuff, I just used a sandbox macro. And sandbox macro is ensure that this compressed function is completely run in the different domain, and it will not affect the parent domain. And uncompress is also here. Here we tested with different number of bytes, and this is the execution time. What is our lesson learned stuff? Of course, if you have number of bytes, if it is smaller, in process isolation approach clearly outperforms compared to process-based isolations. Because if you have in process isolation, you will not so much overhead. But the interesting start with later, we realize that even for modest-sized arguments, the context, which is not important anymore, is dominated by data serialization method. What you use. And our lesson learned tree, the data serialization method can significantly impact performance, and it is critical to optimize it for different cases. If you are working on this serialization creator or developing, we can talk about it, how we can improve or how we can fit our use cases. In summary, we introduced secure rewind and discard with isolated domains for REST-FFI. We have two goals. Firstly, we want to protect integrity of REST application from memory safety violation in unsafe program. The main point is that actually I would like to highlight, we are increasing the REST application availability, because we have a still option for if unsafe portion of our applications is the some memory safety, we can return back. We have option there. And I provided REST-FFI creators, it is open source, if you would like to try. And what is our takeaway? Improved isolation approach clearly outperforms compared to process-based isolations. But other important things is that data serialization method can significantly impact the performance. Thank you if you have a question. Can you quickly explain how these domains actually work? How do you enter a domain and how do you define what part of memory is part of the domain and what is outside the domain? Of course, it is actually handled by my C-Labri before that. I wrote it. But for RASPEX, if you just use sandbox macro, it will automatically handle it. But if you go into details, for each domain I will create a new stack and new heap memory area. Early, when there is some talk about this allocator, you can specify for allocators for specific domain. Entering a new stack, what does it mean? Just change the stack pointer and continue execution there. So you do a stack switch and share it to an entry point. But the important point is to do this rewind and discard. You should first save your execution context in a secure way. This is the point how we can recover. That is kind of like set-jump and long-jump style. Yes, set-jump, long-jump, but in a secure way. Now we have a guarantee that... Then you use some hardware mechanism to make sure that certain domains, certain memory is only accessible. Yes, exactly. That is true. That is completely true. This is the install feature. We are using that one. It is lightweight. That is why. Because you don't need system calls? Yes, exactly. You don't need to use a RAM trip? Yes, exactly. You have got everything now. First, thanks for the great talk. When deciding which piece of memory you put in the new domain, the global state is shared between different domains or you copy all the global states? Current version is just supporting HIP and memory. HIP and memory, but for the global shared, you should copy and pass it. It is not going to be your application. You should change it. But as a future work, I would like to support this. How we can actually sync between different domains to global shared states? That could be very costly. Sharing and copying the global state could be very costly. Yes, exactly. For example, here also, even though I have improved the isolations, changing arguments between domains create a lot of overheads. Yes, this is the bottleneck now. How we can improve actually this part? How we can pass the function argument one domain to another domain? This is the actual cost, actually. Second question. You copy back all the mutable arguments. Do you use that even if they are not changed or do you do that all the time? I am just pushing this. If they are mutable, I am pushing the argument. But you don't check if they have been changed by the function. If they are mutable, then you copy them back. Yes, exactly. So it is a static check and not a runtime check. Thanks. Thanks for your nice question. Awesome. Sorry, unfortunately that was all we had time for. Can we give another? Thank you to Mervin. Thank you. Thank you.
The Four Horsemen of Bad Rust Code
Let me do a quick survey. Who has a JavaScript background? Okay, maybe like 10%. Who has a C background? C++? Holy hell. It's like 80% for the people on stream. Who has a Python background? What are you, Paulie Glotz? What's going on? 70% or so. Any other languages? Just scream out. I heard something like, it was something like, oh, but I can't really remember. Does anyone own this book? I found this book on my attic and it was kind of peculiar because it had some arcane cantations in it and it looked like magic, but it certainly had something to do with Rust. And I was really excited. I was really enticed by this book. This is why I want to talk about that book. It was pretty old. There was one section in there which I really liked and it was called the Four Horsemen of Bad Rust Code. This is what this talk is about. Before we get into what the Four Horsemen are, I would like to introduce myself. I'm Matthias. I live in Düsseldorf in Germany. I've been doing Rust since around 2015. I do Rust for a living as a consultant. I did a Rust YouTube channel a long, long time ago called Hello Rust. Only 10 episodes, but well, what can you do? And lately I started a podcast called Rust in Production. If you like what I say in this talk, maybe you also want to subscribe to the podcast later on. That's it for the advertisement, going back to the Four Horsemen. I thought about this title a lot. Why would you talk about Bad Rust Code? I think from my experience as a Rust consultant, I see patterns evolving over time. I see people doing the same things in Rust that they do in other languages. They repeat the same mistakes and I saw that no one really talked about those problems. That is an issue when you come from a different language and you try to learn the rustic way, the idiomatic way to write Rust code. This is what this talk is about. Let me present to you the antagonists. While I do that, try to picture yourself. Imagine who you are and what you think your role would be in this talk. The first horseman is this. Actually, let me show all of them. And the first one is ignorance. What is ignorance? Magical little term. We will get to that in the next slide. And we have excessive abstraction, premature optimization, and omission. Of course, you could add your own personal Rust horseman. And these are just very subjective, but these are the things that I see in the real world. Now that we introduced the antagonists, let's go through their anti-patterns and what they are famous for one by one, starting with ignorance or ignorance. The horseman that is behind this pattern is someone that uses stringy type APIs. You have seen it before. Someone uses a string where they could have used an enum or they don't really embrace pattern matching. And that makes APIs brittle. You are in a situation where if you refactor something, you might run risk of forgetting that you changed something or maybe you make a typo and then your string is incorrect. And so it doesn't represent what you want to represent. They also freely mutate variables. They go and say, yeah, this is state and I can change it. Rust has the mud keyword for this, but they do that liberally across the entire code base, which makes reasoning on a local scope very, very hard. They also use bad or no error handling. We will get to that in a second. They use unwraps a lot and they don't really think about the error conditions of your application. They also have a lack of architecture in their applications. And they use a general prototype style language of writing Rust code. And where do they come from? Usually those are people that were administrators before or they write shell scripts or they come from other languages like scripting languages. And this is what they know. Nothing wrong with that, but they haven't fully embraced what Rust is capable to offer. How do you discover that you belong to this group in the code? Well, if you do things like this, you have highly imperative code. You go through the code and then you tell the program, hey, do this, do that, do this, do that, instead of using, for example, a declarative way of describing what the stage should be. They also use magic return values like minus one or an empty string to represent a certain special value instead of using errors. Everything is a string. Unwrap is used freely. You clone all the things and you use the mod keyword. Why is cloning a bad thing? I don't think it is. But the problem with clone is that you maybe don't buy into the Rust model of ownership and borrowing. And that means that you bring what you learned from the past from other languages to Rust and at some point you run into issues with your architecture which you cannot easily resolve anymore. And this is why clone is kind of a stop sign. It's not a warning sign, but it should make you think for a moment. It's an indicator of structural problems in your code, if you like. Okay. With that out of the way, let's make it a little more practical. How could we maybe put this into practice and improve our code step by step? Imagine you wanted to calculate prices for different cities for a bunch of hotels that you have in these cities. For example, imagine this was a map. This is an actual map, by the way. Africa does not look like this. And also, Jerusalem is not the center of the world. I mean, we can debate about that, but certainly geographically there are some issues with this map. Imagine your input looked something like this. It's a CSV file. You get a hotel name, a city, a date, a room type, and a price. And you go through this file line by line and you try to parse it into something that looks like that. For Brussels, you have a minimum hotel price of 40 bucks, a mean price of 80, and a maximum price of 150. Fun fact, I arrived yesterday not having a hotel room because I thought I booked a hotel, but it was last year. So I was in the upper range here. Thanks, Walshbeng, by the way, for sharing your room with me. Otherwise, they would have been a nightmare. If you wanted to parse the input file and create a result like this, all you have to do is write this code. That's the entire code. Nothing really big going on here. There are some peculiarities, but this is usually what someone would write who would say Rust is not their first language. Maybe they just try to port whatever they had in another language to Rust. This is code that I see them doing. What you do is you read the CSV file, then you create a hash map of cities, then you iterate over each hotel, you try to parse the data by splitting each line, you extract fields from it, you parse the price, and then you update the city. Updating the city happens somewhere in the lower end. At the end of it, you print the mean, the max, and the minimum. That's it. That's the entire code. You know, it's working. Technically, you could run this code and it will produce the result that you expect. Prices for different cities, we're done, right? Unless we think about the bigger picture and the demons and the monsters that are out there out in the ocean, and they can haunt us and bite us. There's dangerous beasts out there, killer animals. I think what you want to do is improve that code a little bit. How can we make this code a little more idiomatic? This is the same code. Now, let's look at some parts that I personally wouldn't want to have. Consider this block. There's some things going on, but overall, it's a very manual way, a very imperative way of going through the list of hotels. We literally have a couple if conditions here. If price is smaller than city data zero and so on, we update the price, yada, yada, yada. There are patterns that make that a little nicer to read in Rust. This is the same code. It's just something very similar, but we kind of manage to shrink it down a little bit. In comparison to what we had before, we get city data and then we use some sort of tuple extraction to get the mean at a minimum and the max. That makes things a little easier. We can suddenly talk about mean instead of city data zero, for example. That's not the major problem with this code. There's unwraps too in here. Well, for a first prototype, that might work fine, but later on, maybe you don't want to have that. What if you cannot open the hotel's CSV file? What if you cannot parse a price? In this case, the entire program just stops. A question of design, but I would say if there's a single line that is invalid, you probably don't want to stop the execution right away. Another problem is that we index into the memory right away. Who tells us that a line has that many entries, five entries? It might have three. It might have zero. Who knows? But if we index into something that doesn't exist, the program will panic and that is kind of a bad thing. The underscores mean that the variables are not used, so we can remove them. We have a little bit of a cleaner structure and a simple way to check that a line is valid would be to just have this manual check in there. I know it's not very sophisticated, but it helps us along the way. Now we check if the hotel data length is five and if it is not, we just skip the entry. Let's look at parsing for a second. How do we want to handle parsing? I said that maybe we don't want to stop the execution when we run into an issue and we can do that in Rust by matching on the parse result. A very simple way to do that would be to say match price dot parse and if we have an okay value, we take it and if we have an error, we don't really care about the error. We just print an error on standard error and then we continue with the rest of the parsing. Looking at the input, one thing we can do as well is apply a similar pattern and introduce a result type. Now we use a box for representing a result type. This is because you don't need anything, any external library to have a result type that has an error type which can be literally anything. So it can be a string, anything that implements error, the error trade. In this case, it's a very simple way to improve your Rust code. It's a good first step. What we do instead now is we say read to string and then we map the error in case we have an error to something that a user could understand and act on. Then yeah, the code is already a little cleaner. We handled a few error cases already and this is something that might pass a first iteration of a review cycle. Now of course there are certain other issues with this code. For example, CSV handling. CSV is tricky. Proper handling of delimiters is very hard. For example, you might have an entry which has semicolons like on the left side here or you have something that has quotes around a semicolon and you probably want to handle that. So a simple string split does not suffice. Same with encodings. On what platform are we operating on? Do we know the encoding right away? Does the CSV file contain headlines or no headlines? And there's many, many caveats like that. If you're interested, there's a talk called stop using CSV. I don't say you should stop using CSV, but I say you should start watching this talk because it's really good. Right. How can we introduce types? I talked about types a lot and Rust is great with types. We should use more of them. Here's a simple way. I already talked about the result type and in the first line we just create an alias for our result and we say it's anything that has a T where T is generic and the error type is of type box dün stet error. And then we can use the result in our code to make it a little easier to read. As well, we introduce a hotel struct and we have a couple fields, just strings and floating points at this point. But this helps us make the code a little more idiomatic already. We will combine those things on the next slides. But first let's look at the CSV parsing. There's a CSV create. I advise you to use it. It's pretty solid. And what you can do is you create a builder and a builder pattern allows you to modify a struct and add members or modify members dynamically. And in this case we decide that our CSV file has no headers and the delimiter is a semi colon. And the way you can use it is like this. You now say for hotel in hotels deserialize. No more strings splitting. And now we match on the hotel because this returns a result. And now we need to make sure that the hotel that we parse is in fact correct. And after the step we don't have to deal with edge cases anymore because we know that the struct is valid. That means it has the required amount of fields and prices are also floats. Which is great makes the code much more readable already. And it was very simple to do so. Now I want to quickly talk about this part. There's a cities hash map. It has a string which is the city name. And then it has three floats which are the mean, the min and the max price. I don't think this is particularly idiomatic. The way it was used before was something like this. And we kind of managed to work our way around it. But a better way I would say would be to introduce a type for this as well. Because if we're talking about prices and pricing seems to be something that is very central to what we do in this application maybe we should have a notion of a price. It's very simple to do that. You just introduce a price type. Now you might be confused why we suddenly don't have a mean anymore. But instead we have a sum in account. And the reason being that when we parse the files we update the sum and later on at the end we can calculate the mean. Which has some mathematical properties which are favorable because now we don't really have, we don't run into rounding issues anymore. This is an aggregation that we can do whenever we want to get kind of a mean on the fly. And at the same time we have a default. Now the default is not really idiomatic too I would say. But the great part about it is that we can later reuse it and make our code a little more readable. In this case we set the min price to the maximum float. But then whenever we introduce a new price it will overwrite the maximum because I guess by definition it's smaller than the maximum or smaller or equal. And same for the max and some in account are kind of set to zero to begin with. And just before we bring it all together here's one more thing that we should do which is have a notion of a display for price. In this case we implement the display trade and we say yeah if ever you want to print a price this is the structure that you should use. The min, the mean and the max. And then this way we can make our code way more readable. Now you can see that instead of using a tuple or floats here we use a price. And when we update the prices we can talk about this object. We can tell the object hey update your min for example. Here we say price.min.min holds a price and we automatically get the min price as well. We update those price fields and yeah we can even introduce a price.add method. I don't show it here but technically why not. We can add a new hold up price. Prices could be added over time. Now that depends on I guess your taste, your flavor of rust. This is the entire code. It's a little longer but you saw all the parts. And now you have something that I would say isn't a workable state. It's not great but we did one thing. We considered rust. We thought the ignorance. We started to embrace the rust type system. We started to lean into ownership and borrowing which are fundamental concepts in rust. We lean into design patterns and we learn how to improve our architecture. And I would also say if you want to improve this part try to learn a different programming paradigm. Rust is not the only language. Try rock or try a functional language like Haskell. It might make you a better rust programmer too. This is how you fight ignorance. Now if you see that none of these horsemen fit to you by the way just think of your colleagues how you would want to introduce them to rust because this is the code you have to review and also probably maintain in the future. So it's time well invested. If you want to learn more about idiomatic rust specifically there is a website. I just put it there. It's an open source repository. It has some resources. This is a rendered version of it. You can sort by difficulty so that's your experience and then you can sort by interactivity if you want to have a workshop or not. For example there are free resources on there and paid resources too. Right let's go on and look at the next horsemen. Excessive abstraction. Everyone in this audience knows someone like that. They try to over engineer solutions because rust leans into that. It allows you to do that. It's a nice language to write abstractions. Everyone likes to do that. But then you add layers of indirection that maybe people don't necessarily understand if they come from a different background. They use trade successively and generics and lifetimes and all of these concepts are great in isolation. The combination of which makes the programs hard to read and understand for newcomers. Now if you find yourself in this camp try to fight this as well. Common symptoms of this are things like this where you have a file builder which takes a t as ref of str and a lifetime of a and this makes sure that you can pass any type and that it has no allocations that are not visible because of the lifetimes. So this might be fast and it might also to some extent be idiomatic but it is something that your colleagues also have to understand. Another thing is I might use this again. Let's make it generic or trades everywhere. And how do you get to that mindset? It's very simple. After you wrote your CSV parser it's natural that you want other parsers too. Of course you want to chase on. Of course you want to read and write into a database. You start thinking that you'll need all of those formats at some point and this is the part that is important at some point. And then you end up with something like this. It's a trade definition for a hotel reader and it has a single method called read and it takes a self that's why it's a method but it also takes a read which implements the read. That means you can pass anything that implements the read trade and it returns a box of iterator of item equals result hotel with a lifetime of A. No allocations except for the box but the iterator itself is a very idiomatic way to say a result of hotel so parsing errors are considered and it's very applicable for all of the reader types that you could possibly want. Let's say you wanted to use that trade and implement it for our hotel reader. Now suddenly we blow up the code to something that is harder to understand or if it is easy for you to understand please reconsider if your abstractions are too much. Maybe you ain't going to need it. Right. So we have a hotel reader and it owns a reader builder and inside of our new method we initialize the CSV hotel reader and we implement hotel reader down here. The single method called read and we say self.reader builder this is the code that we saw before we just put it here this is our CSV parser the initialization of it and then we return a reader.into the serialized hotel map and this is where we map the errors. Right. Does it look great? I don't know depends on someone's nodding. We need to talk but it's certainly nice to use I guess. Now we can say for hotel in hotels.read file. Should hotels know about files? Maybe not. But it's great if you go one step further and you implement iterator on it and now you can say for hotel in hotels. Alright we're getting somewhere from a user's perspective that is really great. But remember we're talking about application code. There's probably code that you earn money with. It's not a library function that is used by thousands of people. It's your simple CSV parser and now we just blew it up into something that is harder to understand. Do you really need this? Well I don't think so. I don't know what this person on the bull does but it certainly looks confusing to me and this is what people think when they see the top signature. I know kind of you wanted to optimize it a bit but at what cost? Right whenever you sit here and you think oh I should implement JSON support and you don't do it for fun. Start thinking if you really need those subscriptions because they can haunt you. Most of the time they don't have no need of it. I don't know what sort of animal this is. Is it a lion cat or something but it's kind of strapped to a cannon and it doesn't look too happy to me. I don't want this. Probably you're not going to need it. As a side note another thing probably you shouldn't do too often are macros. There are traits out there that excessively use macros. What do I mean by macros? Macro rules but also macro derives and these are great but they come at a cost and the cost could be compile times. Just yesterday I talked to Daniel Kerkman who I don't know is he here? He's not here. But thanks for the tip. He has a situation at work where compile times just blow up because of macros and for you it might be easy to write but for other people it might be hard to use. Maybe you want to prefer traits over macros if you can. That was the second horseman fighting excessive abstraction. How can it be done? If you find yourself in this situation keep it simple. Avoid unnecessary complexity. Just think that the person that will maintain the code is not a mass murderer but your best friend. Do you treat friends like this? Watch newcomers use your code. That can be humbling. Ensure that abstractions add value. Yes you can add a layer of abstraction. Does it add value? That's up to you. Decide and don't add introductions that you might need in the future. Add them when you need them. Right. Two off the list we have two more to go. Next one is premature optimization. This is for a lot of people in here because you are C and C++ programmers. I'm looking at you right now because 90% of you raised your hand. I see a lot of people from C and C++ come to Rust with this mindset with these patterns. What are the patterns? They optimize before it's necessary. This is important different from adding too many layers of abstraction. Optimization in this case means profiling is not done but instead you kind of try to outsmart the compiler and you think about performance optimizations way too early before you even need it. Did I even tell you how big that CSV file was in the beginning? How many entries does it have? You don't know. Maybe you should not optimize for it right away. They use complex data structures where simple ones would suffice. For example we saw the hash map with the three tuple elements. These are things that are kind of unravel and then it ends up being a mess not very idiomatic and arguably not even faster. And they also have a tendency to neglect benchmarks. Some red flags. Quotes you might have heard. Without a lifetime this needs to be cloned. Ignore that. If you know that you have a performance problem then you can think about lifetimes. It's fine to clone. Let me help the compiler here. The box is so much overhead. I use B3Map because it's faster than hash map. No need to measure I've got years of experience. They love the term zero cost abstraction or zero copy. Actually it should be zero cost in here. And they hate allocations. Whenever they look at an allocation they feel terrified and they bend over backwards to make that program faster. So whether this is the developer or the compiler and vice versa is up to you. I've been in both situations. They turn a completely simple hotel struct with a couple string fields which are owned yes they live on the heap. Do something that lives on the stack and has a lifetime. And every time you use a hotel you have to carry on the weight of the lifetime. Well does it matter for this one particular case? Probably not. But then you look at other places of the code base and you see that they kind of reverted your changes. They made what you introduced your hard won knowledge about the abstractions and they took them away. Now we start to index into our data structure again. We use string split again. We go backwards. We've been there before. It is super fragile. Again we are going backwards. Now let me play a little game here. Since there are so many C and C++ programs in here I expect you to answer this. What is the bottleneck? This is a very famous medieval game who wanted to be a millionaire. What is the bottleneck? Is it CSV parsing? The DC realization of our entries. Is it string object creation after we DC realized it? We put it into a hotel struct. Is that the bottleneck? Is it floating point operations when you parse the price? Or is it hash map access? Who's for A? Some shy hands? Don't be shy. Who's for B? Okay. Nice. Who's for C? No one. And who's for D? The hash map. Nice. The correct answer is you forgot to run with release. How do you find the actual performance improvements? There's just one correct answer and it is measure. Profile. Use the tools. Cargo flame graph. Cool thing. You will see that in a second. Use benchmarks. There's criteria on Nick still in the room? Nicolet? No. His benchmarking tool. Divan. Pretty great. Use it. Okay. I will give you one example. Let's look at a flame graph of our initial program. The one that a junior developer could write in two hours. What is the bottleneck? There is no bottleneck. This is the setup of our flame graph itself. This is the profiler setup. The code itself is negligible. Negligible, I guess. And why is that? Again, because I didn't tell you how big the fire was, do you think I can come up with thousands of alliterations for hotels? No. So I added 100 entries. There is no bottleneck here. Okay. You might say, but okay. What if the fire grows? Let's add a million entries. Okay. Oh, this is still 120 records. So let's add more. This is a million. You probably ain't going to read it. Let's increase it to 10 million. And indeed, deserialization of the struct takes most of our time. Okay. If we look a little closer, it says, serde deserialize deserialize struct. Okay. We have some memory movement going on. Let's take a baseline. That is our baseline. This is what it takes. 34 seconds. Okay. Now, let's say we kind of want to prove our C and C++ developer wrong. Does this other abstraction that we added for the hotel struct really add that much overhead? No. It's the same. It's like 34 seconds still. Oh, actually, this is the part where we remove the unnecessary fields. But we can go further. We can say, yeah. Here we have a little safer version. We don't index, but we say nth.1. And we have 32 seconds. Now, our bottleneck is append string. String appending. Okay. I think there's something that we can fix. Well, okay. Maybe this is not really that readable. But what we do is we split now by a string. And instead of doing an allocation where we append to our string over and over again, we use this pattern matching here. And this reduces the runtime by 30% already because we save on allocations. Now, if we try to profile this code again, where's the bottleneck now? Read until. Okay. What is that about? We have a lot of memory movement going on. And now we reach a point where the disk becomes the bottleneck. We can use an M-map for this. Now, remember, we are talking about performance and maybe you should not do those optimizations, but prove a C and C++ program were wrong and they are in tuition. And then you see that the bottleneck might be solved elsewhere. Now we are at 30 seconds by changing like four or five lines from the entire program, not the entire thing. We can keep using our abstractions. That's the main point. Here we use an M-map. That's a memory map in the kernel. We save on allocations. 30 seconds. Okay. What if we wanted to do more? It's hard to read, but now we reach the point where in fact the hash map is the bottleneck. And one more step to improve the code would be to split it up into multiple chunks. You can use rayon. You can now finally use a better hash map like a hash map. And we are down to 3.5 seconds. And we did that not by guessing, but by profiling. Now if we want to run a profile, it looks different again. Very different. These are the individual chunks that we managed to split up. We went from 40 seconds to three or four seconds in a couple slides and with few changes. And the point is don't guess, measure. This is the worst part that C developers bring into Rust. They think everything is a performance overhead. And if this challenge, by the way, looked very similar to the one billion row challenge, this is why it was inspired by it. And it is very similar. Read it up. It's kind of fun. We did something similar for hotel data. But the more important point here is how can we fight premature optimization? Measure, don't guess. Focus on algorithms and data structures, not micro-optimizations. More often than not, if you change from a vector to a hash map, this will be way, way more efficient than if you remove your little struct. And if you add lifetimes everywhere. You can get carried away pretty quickly and Rust encourages you to do so, but it also has the tooling to fight it. Be more pragmatic. Focus on readability and maintainability first and foremost. Use profiling tools to make informed decisions. You covered all of that. Your code is idiomatic. It is fast. You didn't overdo it. What is missing? Well, the entire rest. Do you have tests? Do you have documentation? Is your API too large? Does your code lack modularity and encapsulation? These are things that I see from people that are like the lone wolf coders. They know all about Rust, but what they are not really good at is the rest. Explaining the differences to their code maintainers. And writing documentation. Not about the what, but not about the how, but the what. What does your program do? Some things they say. It compiles. My work is done here. The code is documentation. Let's just make it all pop. I'll refactor that later, which never happens. Let's look at that code again. This is our first version junior programmer. Three hours. Okay. How do we test that? It's kind of impossible because this is one big binary, one main. How would we test that? Well, I guess the question is what do we want to test? Well, first off, I would say let's add a test for parsing the entire thing can be a very simple, true test. But if we refactor it such that we have a function that parses cities, now we can start to introduce a path here and do the parsing. And this is where the parsing logic is, by the way. We split it up into a main and the parsed cities. Great. This is our first test. Very crude, but we get to a point where suddenly we can test our changes. We create a temporary directory. We have a path and then we write into a file and that's it. The parsing is done. Great. If we wanted to make it a little better, instead of passing in a path, we pass in something that impels read. Now we don't need to create files like here. Instead, we can have our input as a binary blob. And these are simple things. Add some documentation, add some tests. It's not that hard. And in order to fight a mission, what you need to do is write more documentation, write unit tests, use tools like Clippy and cargo UDAPs, set up CI CD so that you can handle your changes, create releases, use release please, Marco, greetings go out to you, and keep a change lock of what you changed. Right. We're getting towards the end. We have seen the anti patterns. You know them now. I hope that you will be able to, you know, see them in your code. If you want to learn more, there are some other talks that were given here at FOSSTEM and other places. You might want to check them out. Maybe I can put the slides somewhere. And that is all I have to say. Thank you. Thank you.
Introducing Ratatui: A Rust library to cook up terminal user interfaces
So now we have Orhan who's going to tell us about Ratatouille, Terminal UI. Hello. Okay, it's working. Welcome everyone and thank you for being here for my talk and coming to this presentation. Before starting out, we are already starting out a bit late, but I would like to ask you some questions. First of all, how many of you know what a terminal user interface is? Can I see hands please? That's great. And how many of you built terminal user interfaces before? Anything just in the terminal interface? Cool. The whole audience. That's great. I am Orhan Parmax's and today in a few minutes we're going to explore. The fascinating world of terminal user interfaces and see how we can build terminal user interfaces in Rust with a library called Ratatouille. I'm an open source developer and you can find me on GitHub with the handle Orhan, which is also my name. And I built the following projects and you might know me from Gitcliff. It's a change of generator tool and I mostly work with command line tools using Rust and I pretty much live in the terminal, I would say. And that's why in my free time I package some Rust tools for some Linux distributions, mainly Arch Linux and Alpine Linux. So let's learn a bit more about terminals, right, before jumping into the user interfaces. What is a terminal and how does it work? I want to show you some terminal pictures. Well, this is not looking like the typical terminal that we use these days. This is IBM 2741 and this is one of the early user terminals that we had back in the day. And this was used for something called telegraphy. It's like a long distance transmission of messages. And this is also called a teleprinter or a teletypewriter. Shortly we call this TTY. Let's keep them in mind because it will come important later in the slides. Next we have VT100. Now we have video. We have a video display unit and we display some information on a screen rather than printing text on to a paper. That's cool. And VT100 was one of the widely used terminals back then. And maybe there are people in the audience who use this. I would highly respect that actually. Next we have a text terminal or just a terminal. Sometimes we call it a text console. And it's a cellular computer interface for text entry and display. And in the screenshot you can see now of him my preferred text editor and I'm editing some package file. And yeah, this is how we like we can just imagine terminals these days because there's some text input. Then we process something we see the output there. And just to summarize things a bit more, we can say that we can just, you know, have this diagram here. It's a POSIX schema of C standards like streams. And you can see here there is a text terminal. We get the input from a keyboard. We process it. Then we display something on a display or screen with a STDR or STDR. Well, this looks pretty simple, right? But under the hood things are a bit more complicated. There's a nice blog post about how terminals work. And if you want to like learn more about these, you can go check it. They definitely check it out and learn more about it. But just to give you a couple of ideas here, I want to point out TTY, which I mentioned in the first slides and also PTY. TTY here is used for like a serial interface to a computer. Whereas PTY is an emulated TTY, which enables us to emulate multiple terminal interfaces to a computer. Well, you might ask, what does that mean? I'll give you a very simple example. Let's say you want to have like multiple terminal emulators open, right? You want to have them side by side. You want to have like multiple sessions. In that case, you will have multiple PTYs basically. So let's keep that in mind and move on. If you want to see the current TTY that you are on for some specific terminal, you can run TTY command. And you can see here I am on the fifth TTY here. And also you can see the same from the PS output as well. So these things we can access from Linux and get some information about. Just to wrap things up, terminal is a physical device with a keyboard and a screen connected to a computer. TTY is, it was used for a device, like for type messages, but now we use it for a term which describes an interface to a computer. PTY is an emulated TTY. That's cool. But we only talked about text for now. Like we want to have some text on our terminals, right? Well, in reality, we want to have more things like colors and like styling and then like cursor controls and everything. How do we do that? Like we don't want to just have text and we cannot really leave it this. So let's say we want to have this exact text in our terminal. What do we do? We do this. This is very gibberish looking at first glance, but there's actually some magic going on here. And VT100, the second terminal image that I showed, was one of the terminals which was able to do this. We call this NC escape sequences. Here are two examples. First we have this example where we set the foreground color of some text and then we can also set the background color. We can also do more stuff such as control and cursor, setting the graphics mode, like the screen mode and stuff. And something that I want to also point out is that NC escape codes works like a session. So what does that mean? You set something, some terminal attribute that is set for the remaining session of the terminal. So if you set your foreground color to white, the text, the remaining output will be white. If you do something, then it will just stay. So terminal simply has a state. To get more information about the terminal state on Linux, you can use this STTY command. And here I just get the state, set something, and then I can just revert to the original state of the terminal. And then if you mess up your terminal output, you can just use the reset command, reset the defaults. Okay, now we know what a terminal is, how to control it in a very basic way. Now we are ready to talk about terminal user interfaces. In the realm of terminal user interfaces, we use NC escape codes a lot to control the terminal. And also we want to output some styled text to it and then have some mouse controls, read input handle events, and we basically build our loop around this and form a UI. Let's look at some examples. We talk about too much about terminal user interfaces, but how does that look, right? Well, we have H-top here. Pretty much sure everyone, most of you know this tool. It's an interactive process viewer. You can see the running processes. And it's a good example of a tree, actually, because we have some gauge and some list and then some style text going on here. What you can also see from this screenshot is that we need to get a bit creative when it comes to building terminal user interfaces, because we don't have the typical building blocks for having some UI. We need to use symbols for blocks. We need to have these pipe characters to form some kind of a UI. So that's one of the good examples of a tree, I would say. There are also a lot of use cases of trees. H-top is a system administration tool, but we can also have text editors, file managers, miscellaneous stuff, multimedia, even games, which I will talk about today, and even more stuff. So they are good for productivity and efficiency. One might argue that what's the difference between this and this, right? This is a file explorer, which runs on a GUI, and this is another file explorer, which is on terminal. Well, you can pretty much do the same thing in both of them. Like, what's the difference apart from the light team? Well, at the end of the day, you can choose whatever you want, but they both have some advantages, and it's good to consider the advantages of Tweet when it comes to working with terminals, and if you want to have some efficiency in your workflow. I want to go over them briefly. First of all, Tweet's are very resource efficient. They consume fever system resources, and they are very suitable for resource constraint environments, and you can navigate faster in Tweet's because you're on a terminal, and you have some shortcuts and command inputs. And if you like, let's say you want to connect to a server, right, for BISSH, and you don't have X11 or Wayland in the server, you just have ATTY, basically. In that case, Tweet's are good because they have text-only displays. You don't have to have a display server running, so you can simply SSH into that server and run some Tweet, and that's why they can be accessed with an over-end network connection as well. When it comes to Glees, they are also pretty advantages in some cases. If you want to have a very user-friendly interface, Glees are good because they have an intuitive interface for new or casual users. They have user interaction enhancing features such as drag and drop, and there's something called what you see is what you get, which means that in Glees, you have more immediate and visual representation of your changes, which is also good for new users or someone who is not really into computers, I would say. I asked the RataTweet community and also in my socials, what's your top picks when it comes to Tweet applications, and these are the answers that I got. We have a couple of text-adders here. We have some development tools, also some cool stuff such as Atwin. We have the maintainer here. Shout out to her. We have some process management stuff as well. People like Tweet, and I would like to ask you a question here as well. What's your favorite Tweet? Do anyone want to say? B-top. B-top. Good one. What else? F-C-F on the one hand. F-C-F. Cool one. Last one. Alex. Alex. Yeah. Three years ago. Yeah. So, yeah, Tweets are very popular when it comes to development utilities, I would say. So, yeah. Next, I want to pay tribute here and mention some of the legendary legacy software that helped us come this far when it comes to building Tweets. Starting off with the MS-DOS editor. I see some smiles. The OGEdit.com. In 90s, these type of stuff were very popular. And I especially like the aesthetics of this because you have some drop shadows there. It looks very bad, but also really good. So, yeah. You have colors and mouse support and everything, so truly a masterpiece. We have Borland Turbo C++. This was very powerful back in the day. And this was very language specific, but it's really, it's a really nice example of a Tweet that were used in the old days. I mean, heck, we even have syntax silencing in there. So, shout out to them. I'll skip the slide. Yeah. We have Midnight Commander and Orthodox File Manager. There are also a lot of other File Manager Tweets as well, but I picked Midnight Commander here because it's widely known. It has a wiki page. That's actually the reason why I picked it. This is one of my favorite IRC clients. And I honestly added this slide because I like the visuals and the aesthetics. So, yeah. Well, they all look very, very old, right? And they're all stuff. Every desktop and laptop runs on a graphical OS these days. So, should we still care about Tweets? Well, another example that I will give here is that it's about reduced resource consumption. We don't have bloat in Tweets, basically. Turbo C++ was 9 megabytes. Helix is 16 megabytes. And VS Code is 350 megabytes. So, you make them out. And yeah, VS Code will eat your computer for lunch due to electron rip. Okay, let's talk about how to build those magnificent apps. And let's talk about the other Tweet libraries before moving on to Rust. We have N Curses here. N Curses is one of the most popular Tweet libraries and most known ones. And it's for C. And you can just build Tweets using this. I want to point out one thing here. This refresh call is actually a performance trick. So, if you want to have some performance benefits, you don't call this refresh because your updates won't be rendered. So, until you call refresh, your UI changes will pan for a moment. So, we have these small tricks when it comes to building Tweets. And these stuff actually improves performance and offers great flexibility. You have CDK. The reason why this exists is because N Curses, doing stuff in N Curses is pretty difficult. And if you want to like complex UI, then it's really not possible and very difficult to have it in N Curses. So, people created Curses development kits. And this provides some widgets, such as dialogues and calendars and whatnot. And Curses versus CDK will come later as an important point. So, let's keep that in mind. If you run the N Curses code that I showed you, you will get this text on your terminal. And this is very boring. So, we can take things upon us. This is dialogue, a very small command line utility on Linux. And you can show dialogues like this. I added this because it can be counted as a Tweet as well. You have drop shadows, you have this thing. And yeah, let's press enter to this and take things upon us. We have Textualize in our hands. It's a Python framework for building Tweets. We have the tool Dolphi here. It's a Tweet for monitoring MySQL in real time. It's pretty cool. And Textualize can also run in browser. We have Bubble Tea, a Go framework based on the Elm architecture. And now, Go is mentioned. Let's talk about Rust, right? The moment that everyone was waiting for. In Rust, we have Twee RS, created by Florian in 2016. And it was maintained till 2022. And this library is one of the most used libraries in the Rust ecosystem. And then it was unmaintained after some point. And in 2023, we created a community around it. We forked the project and we rebranded it as Rata Twee. And now, Rata Twee is the most used Twee framework in Rust. And you can have these complex stuff in Rata Twee as well. And I will briefly mention this. And I will give you a demo of how to build apps in Rata Twee. Just to give you a bit more about the history of the project. First, there was a discussion about the maintenance of Twee RS. Then in the discussion, it was not really leading anywhere. So me and a couple of other interested people, we created this course server. We talked about the possibility of forking the project. And we let the maintainer know about this. And we forked the project under the name of Twee RS Revival. We had some meetings at the time. And then after some point, the maintainer was not really able to respond to us. He was probably too busy. And then we started to create some merchants and continue development. Someone came up with the coolest name ever, Rata Twee. Someone made the logo, which I have stickers for here, if you want to have it after the talk, for sure. And then we created some releases. And after some point, Florian archived the two RS repository and just redirected people to us. And that's when Rata Twee became the official successor. And today, actually yesterday, we just created a new release. Pretty cool stuff. Definitely check that out. And I also wrote a couple of blog posts about the history of Rata Twee. So you can read them on my blog. When it comes to building Twee, we have a lot of options. We have textualized, we have bubble tea. But why Rust is very important for us? I want to briefly mention a couple of points. First, memory safety. I'm pretty sure everyone is like familiar with this. But Rust's ownership system ensures memory safety. And it also eliminates security issues related to memory issues. And we have a very performant language in our hands. We have zero cost abstractions. And all the controls, they allow you for us to build highly performant tweets. And yeah, cost platform support is great. We have great portability features when it comes to doing stuff in Rust. Unless the cargo is just great, and we have a growing ecosystem of Twee libraries. So if you want to have widgets that are not supported, not in the Rata Twee organization, people create very cool third-party widgets that you can just use in your Twee apps. Here's a demo of Rata Twee. I'll give you a water break. We'll give you a wash this. This was made for the Thousand Comet. We just made this for celebrating. And here you can see pixels are moved around to create this fade-away effect. Rata Twee is very lightweight, customizable, flexible, and it has a very cool name. So it is pretty much designed for developers and enthusiasts who want a lightweight alternative to graphical user interfaces. And if you want to have an app deployed in constrained environments, like a server with limited resources, Rata Twee is for you. And if you want to have full control over the terminal and have more customized and tailored experience, definitely consider choosing Rata Twee. And if you just appreciate the retro aesthetic of the terminal and the cool name, go for it. If you remember the endcurses versus CDK example that I gave, here you can see terminal is handled by endcurses and CDK is rendering the UI. In the case of Rata Twee, terminal can be handled with a couple of backends. You can choose between those options. And Rata Twee is actually responsible for rendering some widgets, such as these ones, like block, tabs, and list. And when it comes to these backends, we have three options. We do not really dictate which one you choose. They are all, at the end of the day, they just handle the terminal. But cross-term is one of the popular options. It's a purest terminal manipulation library, and it supports all of the platforms, basically. Here's a cool diagram about which back-end you can choose. Like I said, it does not really influence your... If you go with like termists instead of cross-term, it does not really have an influence on your project structure, because the core functionalities of the terminal handling stays the same. Here comes the exciting part. I will show you how to build a Twee with using Rata Twee very quickly. We have a lot of tutorials on our website, so if you go to RataTwee.rs, you can find a JSON Editor application tutorial. We have a counter-app and a bunch of other cool documentation about Twees and specifically Rata Twee. So definitely go check that out if you're interested. First, we start with creating a new project. First, you need to check your Rust version. Then in the project structure, you can see... It's a very simple project, but I will just give a brief introduction. CargoTomel is where you have your dependencies, and under SRC main, that's where you'll have your code. Next, we can use the CargoAdd command to add Rata Twee and your preferred back-end to our project. In our case, I run the CargoAdd, RataTwee and CrossTerm, and you can see in CargoTomel, we have the dependencies added. The versions might vary based on the time that you're watching this presentation. Next, we can go ahead and add some imports to our main file. From CrossTerm, I imported some terminal handling types and methods and types and traits. From RataTwee, I imported some widgets and also Prelude. Prelude is a module which re-exports the most used types and traits and really simplifies the imports in our case. Before going into rendering anything, we need to actually set up the terminal and restore it. In this code, you can see I entered something called an alternate screen and enabled something called raw mode. The alternate screen is something like a new buffer in your terminal. If you run your Twee app, you want to switch to a new screen and have a clear page where you can render stuff. In the raw mode, we also call this cooked mode. In this mode, you basically cook. You just switch to it just to have more like the full control over the terminal. In this mode, the I.O. is turned off and you just have to handle your stuff yourself. Before exiting, when you exit, you have to restore the terminal because you don't want to mess up your output. In this GIF, you can see that first I will run some Twee with alternate screen. The text is printed there. When I quit, this cursor is back to where it's at. If I run the same demo without alternate screen, you will see that the whole Twee is printed to a terminal as is without switching to a new buffer. The cursor is shifted down. Alternate screen helps us to have a clear slate when you want to render some widgets. The most important part when it comes to building Twees is the render loop. First, you need to draw the UI. In this case, you can use the terminal's draw method, which takes a frame. It's a closure and it renders the entire screen. Here, I just have this paragraph widget and I render some text. Next, I need to handle some events. In this case, I am pulling some events from cross-term. If Q is pressed, I just break from the loop. The reason why we have 16 milliseconds here is that 16 milliseconds is roughly 60 FPS. We have to wait a bit just to make sure that UI remains responsive regardless of whether we have new event spending. This is the full code. This might look a lot for just the Hello World application. We are aiming to simplify this further. If you run it, you will simply see a Hello World Twee. That's how you build Twees. You might ask what happens in case of errors. In case of errors, you might guess that the restore stuff that we added, the leave and alternate screen won't be called. You will pretty much mess up your terminal output. In that case, you can use some panic hooks. We have a couple of tutorials on how to do them. Here, I have the code for setting up a panic hook using better panic crate. When you panic, like when you just unwrap something, for example, this will be called and you will restore your terminal. We have a couple of concepts. I will briefly talk about them just to further improve our understanding of how the Twee works. We have an area. The coordinate system runs from left to right and top to bottom. We have the origin on the top left. XY coordinates are represented by U16s. In this area, you just basically say that you want to render something in that area. In this example, we render a text and we manually calculate the area to render within. We have a layout and layouts used for, if you want, for example, split your area into two like this and have rendered different things on these different areas, you can do that. You can also have nested layouts and such as well. You might have seen the constraints in the last slide. Constraints helps us to have better control in the layouts. For example, in this example, I want to create an area of 10 characters. Then I want the next area to be 70% of the remaining. The last one will be the remaining area, but just not bigger than five characters. We have a good flex demo we recently added to our repo for demonstrating how to use those constraints. In the world of UI development, there are two concepts when it comes to rendering. The first one is retain mode rendering where you have your widgets and states and you just update states to render something. We have immediate mode rendering which we don't have states and we just redraw everything on every frame. Not to the user's immediate mode rendering approach. You can see here that in every loop, in every draw call in this closure, everything is rendered. This sounds like a bad thing, but it's actually a good thing because it gives us some flexibility. For example, your UI logic becomes a direct reflection of your application state. Also, if you want to hide a widget, then you just don't render it based on some condition. It has those advantages. Lastly, there are several patterns that you can use with your three applications. I briefly mentioned the Elm architecture. We have another actually, who library for building tweets using Elm architecture. It's basically something like you define your models, handle updates and render the view. We have a component architecture. There are two good Rust projects. You can check them out if you want to learn more about how they structure their project. Also, we have a template repository which we have a component architecture template as well. I will briefly mention that now. You have a flux design pattern. We can also use this flux architecture in our apps. We have another cool Rust project which uses this architecture. The templates that we have, you can just use Cargo generate to clone, choose between these templates and bootstrap a 3rd to 3 application very quickly. You can install Cargo generate and run this command to have a prompt where it asks you some questions like your project name, etc. You can have one of these templates, like simple async or component. It's a good way to start out with having another three projects, I would say. Lastly, showcase. Let's have a look at what people have built with 3rd to 3. I will show off some widgets first. We have paragraph, we have a block, we have calendar, we have chart, we have a table. Those are the maintainers, by the way. We have a bar chart. If you want to have more stuff, we have some third-party widgets. We have a retotree image widget where you can just use to render some images on the terminal. There's also a couple of other ways for rendering images. In this case, someone actually shared the snippet on a Discord. They used the colored pixels to show some image. This is also possible, very bad code, by the way. Anyways, here's an album cover of Kendrick Lamar. We have a pseudo-terminal widget, 3-term. If you're building some text editor and have some integrated terminal, this is for you. We have other stuff as well. Go check them out for sure. These are the stuff that people built. We have Pokedex here. It's a Pokedex Tui. You can just browse, pokemons, show off to your coworkers. Yes, it's pretty cool. This is something that I discovered yesterday, actually. Someone built a full-fledged game in retotui, and you can just have this, play this in your terminal. It's about space pirates playing basketball across the galaxy. You just choose some planet when you start the game. You build your character, like the skin, whatever. You select your spaceship, and this is the main menu. You can just take a look at stuff. I haven't played it, sorry. It's pretty cool. We have Outtwin, pretty cool project. It replaces your existing shell history with Scala Database, and a bunch of other cool features, shout out to them. Lastly, this is one of my favorite full-stack projects. This is a website which was built with retotui, with U framework, and it provides two aesthetic in the browser. The backend uses Axum and REST API, and it's just a Mongo database. You can go to this blog to read more about how this blog was built, actually, so definitely check it out. We have other cool stuff at Altsum Retotui repository. You can go check those cool stuff as well. I'm running out of time, so I have to run a bit. The feature, what we are going to do with Retotui, we are improving upon feedback, so definitely let us know if you have some feedback, if you tried Retotui, and if you think something sucks or anything. Just let us know. If you don't think anything sucks, then consider sponsoring us. Meeting this call will allow us to work Retotui on more, maybe part-time or full-time one day. Thanks goes to all these wonderful people who contribute to Retotui, and we have a lot of contributors, and those are some stats. We are just happy to have you as well, if you want to join our Discord as questions or if you are interested in contributing, go for it. Thanks to our team, which are building Retotui right now, and thanks to Florian for creating 2RS in the first place, which make all of this possible. And lastly, if you like my open source efforts, and my projects or anything, blog posts, consider sponsoring me. Let's hit that goal. Let's go. Thank you for your time and attention, and I hope you enjoyed. you
WASM 101: porting a Sega Game Gear emulator to the browser
So we have Anis Astier is going to tell us about the Wazem 101 which is very nice to put in. Thank you very much. Thank you. Thank you. A quick presentation. My name is Anis. This is not my first talk. This is my first time here in the Rust Dev Room. You can find my social media here. Follow me if you want. I've been learning Rust for five years on and off. I wanted a bigger project to learn a bit more about Rust. I said why not write an emulator. I started this project. This is a Game Gear emulator. The Game Gear is this small device. I don't know if you've ever heard of the Game Gear. So yeah, it's a Sega handheld from the 1990s. So this is the name of my emulator. Gears you can see. It's written in Rust. It depends only on the standard library. It has a native UI. This is how it looks like. It works. After I developed this native UI, I thought maybe I should port it to the web. To do that, I would need to use WebAssembly. So quick show of hands. Who here has never heard of WebAssembly? It's interesting. Who here has heard of WebAssembly but never used it? Who here has heard of WebAssembly but never used it and developed things with it? Oh, many people. Okay, quite interesting. So WebAssembly is a kind of a new platform. You can think of it as a new platform, a new to port code. It defines the text by code format. It's a take on the Java Compile 1, so whatever your system. It works in the browser where it's as secure as JavaScript, it's sandbox. It also has many other use cases. You can work on servers. You can use it in fast. It has many use cases. So I want to port my emulator. So there's this first level which is how do I build my code? How do I compile it? So let's go through this journey. How do you compile WebAssembly? I assume you know about Rust, but if you don't, usually you install Rust with this tool called RustUp. You need to add a new target with RustUp. Then you also need this tool called WasmbineGen which will bridge your WebAssembly code with the JavaScript world and generate some things. Use RustUp. You use WasmbineGen to build your code with the new target and then you use WasmbineGen to generate a directory with JavaScript. You serve that with an HTTP browser and that's how it works. You don't have to use WasmbineGen directly. You can use tools that integrate WasmbineGen and call it. There are many such tools that have selected a few. Wasm Server Runner. It comes from the baby community. You have Cargo Run Rasm. You have Trunk which is even higher level and Wasm Pack which is from the Rust Wasm project. I won't go into the details. You can find the comments on how to run them here. I did a quick comparison of those tools from let's say the lowest level tools to the highest levels. WasmineGen, everyone uses it. It's like the reference tool. Then you have a bit higher level tools and then more open-united tools like Wasm Pack and Trunk. Wasm Pack will generally be used to generate libraries that you can use from the JavaScript world whether with NPM for example. Trunk will integrate even more things like compress your HTML assets and things like that. You know how to build. How do you run the code? You usually write a binary. You have a main function and the entry point of your main is how it works. Or you can build a library and usually you annotate your entry point with WasmineGen macro start and you say, okay, this function is my entry point. You start executing from here. We know how to compile. Let's continue porting our application and go to the second level of porting the emulator. This emulator I've written called Gears for the desktop UI. I only selected dependencies that work with WebAssembly. So the whole wasmineGen wasmineGen was capable. They work with the web platform. Have pixels, WinIt, simple, Giller-S which is for gamepads. We'll go deeper into that. They all support WebAssembly. How hard can it be? It should be very simple. Well, it depends. For pixels and WinIt, pixels is a library to make a front buffer, basically a front buffer library that's GPU accelerated. So you can write pixels to coordinate and then it will run that with WGPU. Pixels use WGPU. It's another great to do the rendering. In order to work on the web, you need to enable the WebCR feature of WGPU. In the future, it will also use WebGPU, but that's another subject. The initialization of pixels is also different because it uses WinIt and WinIt needs to be initialized differently if you want to render your UI in canvas in the browser. Last but not least, the initialization of WGPU is Async. So in my emulator, I never used Rust Async. I needed to add that. So I used WebAssembly. Gen features to bridge the Async world from Rust to JavaScript promises. To part the audio part, I'm going to use the WGPU. I'm going to use the CIPL create, which also works on the web. This is a reference create to play audio. It needs to create feature as well. There were also some challenges because maybe nature started directly and if you use a browser, you can't start playing audio directly. That's actually a good thing because it means you can't play audio on anyone's browser without interaction. So you need to have interaction. The user wanted to do this action. Another issue I had with the standard library is I used NPSC channels and they don't work on the web platform. So I wrote a quick channel myself because it was in the core. There are other channels that work on the web platform. But I prefer to implement something with no other dependencies. For time, usually for synchronization in an emulator, you need to know about the current time. Just like for the channels, in the standard libraries, the time API are not available on the web platform. So there are crates that do the bridge. I used the instant crates. You can also use web time, which also works. This is the code, the use code if it's for the was 32 target using instant, if not, use the standard import. For Gil arrest, which was very nice, there was no action needed in order to support working the browser. Everything worked out of the box except the gamepad API, I would say, on browser is not as much much mature as on native. So there is some rebinding to do. There are good reasons for that. For example, browser don't want you to be able to fingerprint someone with the gamepad API, but then it means the bindings are not mature enough. Not the bindings, but the key bindings, which is something else. And then during porting, I also had bugs that were entirely my fault. I used a bit to turn it into a huge bit, but I didn't use it too much. I used to make it too much new size, mostly because I like to index slices. That's what you need to do in the slices. Wasp 32, as it says in the name, is a 32 bit platform. So I had overflows when I had multiplication, additions, it grew bigger than 32 bits. All these were codes because in my project, in my cargo project, I had a lot of defaults in Rust. And yeah, it worked well. I just replaced new size with 64 when it did. And that's it. So let's take a quick break and let's go through a demo of what it looks like. So just for first then, brought to you again, which is this one, I will lend it to you for a few minutes. It's for them exclusive. I recommend you play this demo on, not necessarily on mobile, it will work, but you won't be able to control it. So maybe more on desktop browser or anything that has a keyboard or gamepad controller. So I'll let you a bit more time to load it. It might not work for you if you don't have WebGL enabled on your browser, but otherwise it should. If you have Firefox or Chrome, here's how it looks like. So I've loaded the page, it's play, and basically the emulator starts. If you have audio, it will play audio. And yeah, this is what you should see. Okay, yeah, it works. I can play it. Who here successfully runs the demo? Just a quick show of hands who managed to run it. Okay, thanks. Okay, let's continue. So we have this porting. Okay, mostly worked. I showed you. It worked. There were a few tricks, I picked along the way. There's not mandatory, but let's see what we have here. First thing, if you're used to debugging like me with println, you print code on terminal, it probably won't work as is on the browser, so you want to use the Web console. There's this console log crate which does the binding of the console. If you use the log crate, it's really well integrated with the log levels and things like that. I also recommend that you use the console error panic hook crate. This one helps show when your program crashes, for example, I showed you the overflow checks it can panic. It will show you basically the panic in the console. That's how you register a console panic hook. Another trick I picked along the way is the cargo config. For this demo, I showed you, there's a bit of a problem with some interactions. Some API I use, which I use directly from Rust, and I use the Web Cyscrate, which allows accessing those APIs for this demo. In order to be able to access those APIs, which they are considered unstable, and you need to add an environment variable when you build, which is a bit annoying to add every time. You can add this Rust flags directly in your cargo config.tamo. This way you can build with cargo builds. It will work. Another trick if you use to having VS code or integrated development environments, you probably are using Rust analyzer. If you have code that works on multiple platforms like me for the native, there's WebAssembly, you probably want to tell Rust analyzer to build as a tool. You can do two different architecture. This way you have completion on the WebAssembly part. This is done as well in the cargo config.tamo by specifying multiple build targets. When you build it, you will have multiple build targets. There are some drawbacks for that. It won't work with the workspace member. It must be at the root of your workspace. It also means that when you use cargo run, since you have multiple targets, cargo run will say, oh, no, you have to pick one target in order to run, which makes sense. It can be a bit annoying. So let's go with what did I think of this experience of putting this emulation. What's my feedback? I would say in general it's very easy to port standalone code to WebAssembly if you're using Rust. I did not change anything in my app's architecture. The total port took a few hours over a few days. As I told you, I did custom code for initialization, which is I think, and for DOM interaction, which is the demo you've seen. To go a bit further, what I won't talk about in this talk is how to build a web UI, for example. You probably want to use U or Laptos because I don't recommend accessing DOM APIs directly. This is very ugly, not really ergonomic. I did it so you don't have to try. Those library developers do a great job to do that. I didn't try building a complete UI. As you saw, nothing is configurable, etc. I'm thinking of building a UI with slints or a GUI, but I'm not really satisfied with the current status of font rendering. I know it's something that's being worked on. Just like as well, minification in web size is not web-specific. There are many Rust tutorials you can find on minification, and I didn't do any performance measurement. I can tell you that it works. It also works on native. But I don't have any special feedback for that. That's it for my presentation. Thank you. We have a question. Yes, I have a question. When you build websites today, they have to be responsive. You use media queries in CSS style sheets to adopt to different kinds of resolution so that on the mobile tablet or desktop, it still looks nice. Can you also do this in web assembly that you would say if I run the game in portrait or landscape mode, or if I do it on a bigger screen, that it takes care of the resolution? Will it also scale the graphics accordingly? There are multiple aspects to that. If you're building a web UI, you probably do that with CSS. If you use leptos or you, you will be able to generate HTML whether on the server or on the client. Then it's basically the same thing as web development. You have CSS, you scan this HTML directly. For this demo, this is an emulator. It's a bit specific, especially because it's a full-screen application. So basically it takes the whole width of your screen, and that's it. That's how it works on mobile and tablets and desktops. But it's not that you can combine those and that you can also do something in JavaScript or CSS. You can do that. You can find tutorials on the Rust-Waston book. You can look at the Rust-Waston guide and on the Rust-Waston project, which is this URL. You can find information on how to bridge the two worlds. If you decide to use a crate, as I recommend, like you or leptos, they also have a lot of documentation on how to do that. I understand. Maybe a general question. Why did you choose Rust? Did you also consider programming in C++? Or are there any advantages of using Rust compared to C++? That's a great question. It was actually covered in other talks, but usually I like using Rust because it's a very nice language. It has nice ergonomics. It's fast and native. It has more safety guarantees than C++. A great ecosystem. Thank you. You're welcome. Any other questions? I'm curious what your main loop looks like. Do you spend all the time polling for events? Do you get called back from the browser? Does the browser hang if you never sleep? That's a good question. I did not modify my main loop, but mostly because I used Winit. I used a Winit event loop. This is specific to the Winit crate. Nothing was modified in the main loop. It spins. I don't remember how many times, but basically the length of a frame every time, and then it gets refreshed. Yeah, that's it. And that's all the time we have. Thank you.
A Deep Dive into Tower
Welcome the next speaker, Adrian. He's going to talk about tower. Hello everyone. Thanks to the organizers for having me. I've been working in Rust for about three years now, mostly contributing to QuickWidth. QuickWidth is an open source distributed search engine for log and traces. We'll be presenting tomorrow in the monitoring and observability room. And if you want to follow along, you'll find a slide on this website. There will be quite a bit of code during the presentation, so that may be easier. So as you know today, we're talking about tower. It's a crate for building modular and networking clients and servers. If you're using widely used traits in the Rust ecosystem such as AXM, WAP, Tonic, you're using tower under the hood. Maybe you don't know it. And everything's based on the service trait that I'm going to describe at length during this talk. But before I do that, I would like us to take a step back and think about why we need tower. So I want you to think that you are a web developer. You're working with any web framework in an imaginary dynamic language. And you ask to write a simple handler to get a user from a database. So you probably would come up with something like this, a function that you would call getUser that would take your request in, add some logging, get the user from the database, save the response, log again, return the response you've done, and you're pretty happy. Pretty quickly, you would realize that maybe there's another way to do it. Maybe you could decouple the fetching the user part that's strictly about the database and all that stuff from the logging. Like the pattern wise, I'm going to log something before the request, and I'm going to log something after the response. It doesn't need to know about everything that's user related. So maybe you would write two functions. The first one would be with logging that would take two parameters, the request still, but also a handler that would be a generic function that also that returns a, accepts a request. And then you would write your getUser simply. It'll be a bit more simple, very focused on what it has to do. And then you would compose your with logging function with your getUser function, and you would achieve what you had achieved before. But now you have that other function with logging that you can use for like all your end-to-end lists. And isn't that the sense of programming? We want to write generic and reasonable functions that are really easy to compose. For the purpose of this talk, we're specifically interested in decorators. So the decorator pattern is basically when a function or a class wraps another function. It applies new behaviors before or after calling the inner function. It doesn't know about that inner function. It's totally opaque, and it doesn't matter, we don't care. It doesn't modify the behavior of the inner function. And you've heard, you've used decorators sometimes with different names, very often millwares in the context of client and servers, proxies. This is a term that's coming up a lot. And then what you can do with those is you have your handler at the bottom. You can also call it a leaf, and you stack millwares with different behaviors on top of it. And you apply different behaviors depending on what you need. And then you can build a nice library of millwares that you can share, that you can use from different libraries, and so on and so forth. In that example, we were using a dynamic language. It was really, really easy to do. Duc typing gave us great flexibility. We don't really care about the types of inputs and outputs. When we code, we think about it, do they match implicitly? If they do, we think it's going to work. We ship it to production. We cross our fingers. But implicitly, everything should match and then everything should work fine. So this is what we're trying to do with TauR in the Rust ecosystem. But this time, we still want to compose functions, but we want to do it in a very type safe manner. And we want to be still very flexible. And this is when the TauR service trade comes into play. It is the common interface that allows us implementing components in a protocol, agnostic, and composable way. So let me describe the trade. It has one generic parameter, request. It's called request by convention, but it can be any type and anything. An associated type, response, same thing. Cold response by convention, but it could really be anything. Error type also could be anything. And finally, a future type. The future type is constrained by the future trade. So it has to be a future. And you'll notice that the output of the future is fixed for you. It has to be a result of the response and the error that you define. So the only choice is choosing a future type. The output is always chosen when you say, I want this kind of response, this kind of error. And then in the trade, you have two methods, the pull ready function and the call. Call takes itself mutably, accepts the request, return the future. The future you define. So your mental model should be much easier than this trade. It's just simply be, this is just a generic async function. This is the same thing. The trade gives us that flexibility that we wanted and gives us the type safety that we wanted. But really, this is the mental model of what is a service trade. With all of twist, you have that additional pull ready function that needs to be called once before calling call. And it provides a way to provide back pressure to the caller. So the pull ready implementation for a service without any external dependencies for the hello service that we're going to define is very simple. Pull ready return is always ready and returns okay. If you're implementing something else, if it's not ready, you return pending. And later on, the task will be pulled again. Hopefully, it'll be ready. If you're writing a service that has external dependencies, for instance, you depend on a database, you need to acquire that database connection. So all the services that have external dependency need to load up upfront to receive the need. And so they can say when they're ready or not to the caller. So in this example, if we haven't acquired yet connection, we will pull our pull of connection to acquire a connection. We will save it mutably in our internal state. And then later in call, we actually consume the connection using take. So that's why the tariff trade is mutable because it's not necessarily state less. It can be stateful. So you can manage your resources. And you see with this example why you need to call pull ready always at least once before first before call. And for the services that I'm you know, where's so they basically wrap an inner service, you usually implementation is really straightforward. You delegate to the inner service. So you call pull pull ready on the inner service and you're done. So if you're not doing something too fancy in your service, usually is going to boil down to those three use cases. So it sounds it sounds simple on paper. Why does it get complex? I think people are a bit afraid of like towers tower and towers services in general, could they use a lot of generics. So it's a bit of it's not always easy to write. It's even hard to read and it's even harder to write. It's using all the the rest construct heavily. So you have to be comfortable with lifetime lifetime, sensing market rates, what is a thing is going to expose you to that. So if you've been using rust as a better C++ trying to avoid those concepts, writing towers services are going to be a bit challenging. But but once you get more comfortable, it'd be very rewarding because like because you're going to be exposed to those concepts. At some point, muscle memory is going to help you and you're going to feel more and more comfortable and you're going to get a bit of rust programmer. And in some cases, if you start writing your own futures for your services, then you need to know about future pulling and pinning and it gets very complex. So the only way to get better understanding services and writing them is just to start simple with like a hello services and build upon until working like more complex services. So this is exactly what we're going to do during the rest of the talk. So let's implement hello service together. Remember the mental model is just this hello function. This is what we want to do, but we want to do it the other way. So the input parameter is a hello request. The hello response will be returned and we try to implement that simple function. So we start defining a health struct and we start implementing service for it. The input parameter is a request. The response type is a response. The error is going to be a little bit specific here. It's going to be infallible because we're never going to fail with just printing stuff. And now we have to choose our future type and we have various options. And the first thing we can do is go with a box future. We can define our own type alias or we can reuse the type alias from the futures scrape, for instance. And why would we do that? When you get started with writing your own service, I would recommend starting with that because it's pretty easy, it's very readable. The cons are you pay a small fee for allocation in dynamic dispatch. It's fine if you're not, if it's a client or server that doesn't have an insane amount of QPS, it's totally fine. Sometimes people are afraid of allocation in dynamic dispatch. If you're working on a client or server that doesn't need working in IO, we're talking milliseconds. That allocation, dynamic dispatch is going to cost you microseconds, maybe even nanoseconds. So you should not worry too much and if you start worrying, you should measure before going for the box stuff. I would want to say, it's writing your own future is way more fun. So sometimes it's a little for my own future, just my own personal fun. So box futures are good chose for applications, less for libraries. When you're writing libraries, you want to decide the users of your libraries whether or not they want to incur some overhead. So it's better if you don't use box futures in your libraries so that people can opt out of the overhead. So in this example, we upsold box futures. We need to notice that this box futures as a lifetime that we'll choose and it has to be sent as well. That will become important later. So we choose the static lifetime. The service, the tower service rate is not generic of a lifetime. So we have to go for static. Then we write priority. It's always ready. So it's pretty straightforward. And then we write our future. So we declare an async move block in which we build a message and build a response. And then we box that future on the hip. We pin it with box pin and we should be done. This is how you would unitize that service. You instantiate it. You call ready. We're using the extension trait that makes it a bit easier to work with. So you call ready and then you call call and it works. So it's obviously a bit more complicated than just writing that hello function that I showed at the beginning. But it's all doable. When it comes to choosing a future, sometimes you have the choice to choose one from a third party crate. Futures has some ready to go futures. You can use towers as well. So sometimes those futures are going to fit your use case. You don't have to go for the box. You don't have to go for implementing your own. You can just reuse one. It's convenient. So in this example, we can use the ready trait from the futures crate. So this time, you do see my pointer crate. We change the return type here. It's ready and this time we return ready. So we got rid of the allocation and the dynamic dispatch. And it's actually more readable. So we built a simple hello service. We're going to build a middle one on top of it. That's our logging service. We want our logging service to work with our hello service. But ideally, we like it to work with any kind of service that implements the trait. So we're going to make it generic. So logging has an input parameter called s that will be the inner service. And we start implementing service for logging. What we want to do is calling call on the inner service. And we can only do that if it's a service itself. So we need that now. And we say the inner service is also a service generic over r, our two generic types here. Then we implement already. We delegate to the inner service pretty straightforward. And then we implement call, starting with using a box feature. So we built the inner future calling call on inner. And then we create our own future using an instinct move block. We do the logging. We evaluate the future here. So here on this line, we build a future that's not evaluated. It's actually evaluated here. Then we do the logging. And we respond to response. We box in the outer future. And now we go back to our terminal. We run cargo test. And it should work. Except it doesn't. We've omitted a little technicality. And that technicality is coming from what I said before. The box future must be sent. So the future that you return must be sent. So the inner future must be sent as well. So we need to tell the compiler that constrate. So not only we have constrate the service, s, we told the compiler it is a service, we also tell the compiler the future that an inner service is going to return is also sent in static. And all this time it works. So this is how you would instantiate your logging service. It's wrapping the hello service. And then you call it the exact same way. It's obviously not going to show up during the unit test. But now you have logging on top of your service. And that logging service is usable for some other services. Tower HTTP has a bunch of services that are ready for metrics, tracing. It works a lot like that. So we're going to do this again this time rolling on our own future. Because it's fun. And because when you're going to recode from AXM to NIC, you're always going to encounter those handwritten features. So it's good to get used to them. Unfortunately, writing a future from scratch is actually non-trivial because of the whole pinning thing. And this is not a talk about writing futures and me explaining to you pinning because I don't totally understand it myself. But I can be very practical and tell you how to do it. And I'll show it to you right now. So now the logging is going to happen in the future. So before we were wrapping a service with a service, and the idea is the same. We're going to wrap the inner future with our logging future so we can add behavior to the future. So you can add behavior to your service, but you can also wrap the future and do stuff after you're done pulling or before you were about to pull the future. And this is what can be tricky with our terrorist services is sometimes, for instance, you take rate limiting. There's a bit of logic in Pall Ready. There's a bit of logic in Call and there's a bit of logic in the future. So it's hard to understand where's the logic that's really to the business, what you're trying to achieve versus what just really to trying to write a service. So that's also part of the complexity of our Dillon and Vittorio services. But back to our logging future, it's going to wrap another future. So it's going to be generic over F. We need to use the create pin project to pin the inner future. So that's why you have the pin attribute there. And now we, that's how our service, the logging service now is going to look. We replaced the future type with our logging future and the use which relies on the inner future. So s call and call in future. And then in call, now when we are called, we do the first logging statement. Then we build the future and when the future will be pulled already, then this is when we'll actually add the last logging statement. So this is how you implement future for logging future. Same idea. We need to add the constraint and tell the compiler the inner future is also a future. It's output is the output of the inner future. We need to project. I don't actually know where that the term comes from. But you need to project self. It's usually a convention is the convention is to call this. And this gives you the same object. But when you use the inner futures, they're actually pullable. Because pull is not defined on the future type. It's defined on a pin future. So that's why you need to project which pins the future, which allows you to pull it. So those are the technicalities that you have to deal with when you're writing your handwritten function. But once you've done that, it's exactly like writing the normal call, I want to say. So we pull our future. And if it's ready, we add the logging statement. If it's not ready, it means it's pending. It's going to be pulled again later. And also we know that when a future is pulled and it's ready, it will be no longer pulled. So we know that this statement will appear only once. So let's build on top of that. Let's build a timeout service now. So same thing. We want to add a timeout to any service. It's also generic over s. But the timeout service is interesting because the logging service is pretty simplistic. It doesn't mutate or touch anything. The request not in the response. It's like it did not go through it. The timeout service is a bit interesting because you have to signal the timeout somehow in your written type. And the way you're going to signal the error is potentially using an enum. And in that enum, you will have two variants. The first one would be the timeout. If timeout appears, that would be the error. But if the inner service returns its error itself, you have to wrap it in inner. And then the caller will know if the error comes from the timeout with the inner error. That looks like a good idea to do it this way. And you can totally do it this way. The problem is if you adopt this pattern for timeouts, authentication, rate limiting, the nesting of all those errors in the inner is going to become pretty complicated. And really hard to compose. And it is easy to deal with. So what libraries usually favor is boxing. So if you use the tower services that modify the error type, they will return the tower box error. And it's much easier to compose. The downside is at some point, you have to donk at the error to know exactly what happened. So we're going to use this error type this time to implement our service on future. So it's very similar than before. The difference this time is the error needs to be boxable. So with other compiler, I can only be a timeout service for services where the inner error is boxable. The polarity is a little bit annoying because it's still delegating to the inner polarity function. But you must not forget to convert the result that is an s error to a box error. So that's why the result returned by the inner service must be, its error must be mapped to the box error. So that's why we need this. Now we implement calls. So we take a look at what we do as for our future. So our timeout future is very similar to our logging future. It just wraps two inner futures. One will be used for the inner future and the other ones will be for the sleep future. So the sleep future, you can reuse the future from Tokyo, for instance, if you're using the Tokyo runtime. So call becomes pretty simple. All you have to do is build your inner future and your sleep future and create your timeout future, which we're going to implement now. So the core of the logic this time is leaves in the code of the future itself. So some libraries sometimes split the service implementation in one module and they put the future in another module so you open the service and you're like, what does it do? Because everything happens in the future. So you need to understand that sometimes a lot of the work is actually done in the future. And this depends for timeout. So implementing future for timeout future, we, as before, we project self into this. We pull the first future, the inner future, and if it's ready, great. It's now the timeout, we return pull ready with the result. We don't forget to map the error, the potential error into its box. If the inner future is not ready, maybe it's taking too long. So we need to verify whether a timeout occurred or not. So we pulled timeout future this time. So if it's pending, nothing to do, we'll be pulled again, we'll go this work again. If it's ready, it means this is an actual timeout. So we need to return the timeout error and we return elapsed error, but boxed. Hence the hint too. So now we're going to see how you can stack services together. I could obviously stack my timeout on top of my LO and then add the logging on top of it, but I can also do it with services that are already ready in tower. So I import them directly. So I import the concurrency limit, the timeout, and I compose my service by wrapping everything on top of each other. So instantiate service, I add a concurrency limit on top of it, and then my timeout, and then the logging. So that involves a bit of body play, but it's not too hard to do. What's really tricky and the compiler is not going to help you with that is the order with which you wrap your services actually matters. You really want your logging to be on top. If your logging is like logging how long it took, you don't want logging to be in the middle. You really want it to be on top to capture the whole life cycle of the request. And it's probably the same thing. You want timeout to be applied above everything that's doing race limiting. It would be better to, if you had like a notification service, it would be better to put it pretty high up because if you have services that are below it, maybe they can do whatever they want. So this is what's tricky when stacking layers and the compiler is not going to help you there. So you got to be careful and you can hopefully rely on a good reviewer for like a double check. I wanted to leave a lot of time for questions. This talk is called a deep dive into TARRA. I realized that in 30 minutes it's actually not that easy to cover everything that we could cover when we talk about TARRA. So let's just call it a shallow deep dive into TARRA. If you want a real deep dive, there are really good resources out there. Two years ago, not that much, but there's really great content, blog posts on YouTube that I'm going to talk about a little bit that really can help you get really, really comfortable with TARRA. So there's something that I don't talk about today called a layer. A layer, TARRA people are pretty obsessed with composability. So a layer is a way to compose a stack of services. A service builder is a way to conveniently build small stacks of services. A service builder helps you get rid of the boilerplate to write this, for instance. So if you want to keep studying TARRA, I would start with layer and service builder. Then I would recommend reading, inventing the service rate. That's a blog post by David Pedersen, who's a TARRA contributor. It's a really good introductory blog post. Maybe you can start with that. It takes about 15 minutes. I find that the AXM documentation page about MinoWares is really, really good. So I think that's a really good resource. And then in the spirit of building and reading and writing late services that are more and more complex, I think the next step for us is looking at the rate limit service. The concurrency limit is also a bit more complex, but really interesting. And then you can look at the channel from Tonic. It gets pretty complex as well. And then if you really want to take it to the next level, think the pool in TARRA. So here we're only showing that a service that wraps just one service. But now you could wrap multiple service. And that's what the pool adds. And the pool is a service. It wraps multiple services. And it's using pool ready to track the load of each inner service to handle back pressure. So it's a really interesting use case of using pool ready in a very fancy way. So pool is also very interesting. Good videos on YouTube. Also David Pedersen has a TARRA stream about tower. He goes through the same thing. There's a hello and logging and time out. And you see him dealing like the old delivery things. Because obviously on presentation, I get everything right on the first try. But it's not exactly how that happens when you either yourself. And if you want to get more comfortable with a single weight in rest and futures in general, John has a great talk on YouTube as well. And you'll find those resources on the slides that are available at this thing. Back in the first few slides, you mentioned the pool ready function. And you were initializing a database connection. I didn't get it. If the database connection is actually a future or not, because the pool ready function doesn't return a future. Is you talking the database one? Yeah, that's where you initialize the connection. You know, you depend on a connection. Yeah, it has to be a future. I mean, no, it doesn't return a future. But because that one, yeah, this one. Yeah, I'll pass quickly because it's a bit complex. So already doesn't return a future. But it behaves like one because you pass it the context, the context as a waker so you can rebuild future really easily with access to the context. And that's I'm happy you asked you asking this question because if you look at come back here. Talk about the concurrency element. The concurrency element uses a semaphore. So a naive way of implementing pool ready with a semaphore is would be to call try a choir. And if you get the permit right away at school, pool ready is ready and just save the permit and then you get rid of it in call once you're done. But a better way to do is to start pulling the future. You don't call pool ready. You call ready. And if it's not ready yet, it's okay. You keep you still maintain that future that you started that started acquiring a permit. You save it in your internal state. And then the next time you keep pulling that future instead of each time doing try, acquire, try, acquire. So it's really unique to trick that you can see in the concurrency element service. Okay, cool. Thank you. When is tower gonna go version one because now it's on zero and hyper just ditched it because it's not one yet. So I'm not a tower contributor. So I don't know exactly. I think they're waiting for the the Khmer people to give us the full story of what is sync is gonna is gonna become. You'll see that with the last release of rust now you can return an implement implement future future, but they still like the weather it said or not to to deal with though. So eventually, I think they want they're waiting for see what we can do in rust. And then then what once the old async work is stabilized, I think they will come up, they will come up with a new version of the trade. Like with those new those new constructs that are being released and will be released in the next version of rust. We can greatly simplify the service trade. So they will wait for this feature to land, revise the trade and then release 1.0. And the second question is there a way to make this tower layers, let's say optional so you can enable disable them from outside the code from the compiler somehow. Yeah, I mean you you have you could use there would be different ways you could use like feature flags, you could use environment viable like they services have a stake. So and you the they mutable so you can do you can do what you can do you can do a lot you could have you should I to make bullying somewhere in there and you can enable disable it. Thank you. Yeah. I'm on the time of future slide. What time of future this one. Yes, there's a poll function and if none of them is ready, it will just return and how would it know when to call it again. Like you pulled once and none of them is ready yet. It's the runtime is going to take care of it. With it's well it depends on the runtime you're using it's it knows it's not ready and then I mean then we're talking about how you implement the runtime basically. It maintains like a queue of features and it regularly pulls them using some a waker which is it's using that context in the waker when it's ready it's going to call the waker that's going to wake the the runtime so and so forth. But you as an implementer of the future you're not in charge of like telling when you're going to you should be pulled again. You're just in charge of saying I'm ready or I'm not. That whole machinery is left to whoever implements our AC runtime. Okay thank you. Thanks for your talk. I had a question because you mentioned that there was some overlap between implementing your own future and the tower service trade itself proposing like a poll function and a poll rate function. Is there any reason apart from fun to implement a future trip yourself like is there any benefit from dealing with the internals of creating a future versus implementing your whole logic inside of the service trade itself and just using the basic tower or basic futures future types. You mean a box of doc futures? Yeah exactly. Your question is like why would I bother writing my own future when I can use box? Yeah when I have this poll function inside of a service trade itself where I can bake my own logic and not have to deal with you know projecting my pins and so on and so forth. I'm not sure I totally understand your question so my answer might not satisfy you. I think you're right you want a future if you want you're writing a library and you don't want you don't want your users to incur the overhead of like the box future. You write it if you have like some constraint that might be related to performance that's going to push you to write a hand rolled future that doesn't allocate it doesn't have dynamic dispatch. Yeah that would be my answer. I think it also depends on your team if like if you're the only guy in the team that's going to understand like oh I pin the I project the future and are you spinning if just you maybe maybe you don't do it. It depends on those constraints. Okay thanks. Awesome so if there's no more questions let's thank our speaker. Thank you.
Embedding Servo in Rust projects
Sorry you picked a missed call. ... Yes. Rocky. Rocky. We'll close the doors. Thank you very much. So anyone that knows some history of Rust is going to be very excited about this talk, and Rocky is going to tell us about what's going on with Rust with Servo. Hello. Hi. Thank you. I am Rocky and I'm software engineer. I work on Servo at Tegalia. Before even I start with the talk, I'm very curious about the audience. How many of you are writing Rust professionally or full-time? Many of you. And how many of you are writing personal projects, interest? How many of you are coming from front end or back end world? I still see some. This is like perfect audience for this talk. Normally when I start a talk, I tell about the project I'm going to talk about. But today I want to start by answering some questions because people have questions. What is happening? Is it bad? Is it lie? What is happening with Servo project? I'm simply going to take you a bit back and walk you through. So I was journey in this slide around 2012. It started at modular research around the same time when Rust project also started. They were quite working together actually. And people who have been active in Rust community or have known about Servo project, they knew Servo on 2016, 17, 18, what was happening. But in my previous slide, these questions came in when we were in 2020. A lot was happening in 2020 but this also happened when Muzla's layoff impacted Servo team. And this kind of affected the whole team and the future was not that bright as we thought. Around the same time, Servo project joined Linux foundation. There were few people from Servo team who were trying to maintain the project in their personal time. But that is not enough. Servo project is huge and that's not enough. It needs funding. It needs more people, expertise and many other things. Around 2022, we started restarting Servo. I just mentioned it needs lots of funding, lots of expertise, lots of people. Who is going to start? In 2023, a team was formed at Igalia and we restarted the Servo project. So what we did in 2023, like list is not huge but I want to keep it small because this is not really focus of the talk today. We restarted the project first half of the year. We were trying to maintain the project, take it out of the maintenance mode actually. Tell people what is happening with project and we have restarted the project. That's what I'm trying to mean thereby outreach. Make it easy for new people to contribute because an open source project is literally nothing without the contributors. We started work on layout engine. We started shipping CSS to features. We also had to make a choice between layout engine. We in Servo have two layout engines. Still have it but we haven't deprecated the last one. We ended up choosing to work on layout 2020. The old one is called layout 2030. I'm not going to talk too much about it. You can go to the wiki and you can find why we took the decision of choosing layout 2020. At the end of the project I have a QR code that will give you access to this slide. Don't worry about searching for it now. In 2023 we also worked on internal WPT so that we can track the web platform tests and build lots of Servo demos. When you are talking about a project, it is very important to show things. You can't just sit at computer and code and talk about we are doing this and that but there is no way to test it. I'm saying we built CSS feature but hey, how can I test it? There were a few Servo demos before available as well. We also moved it to the new Servo demo website. Then we did quite a lot of embedding work. We built a mini browser that is going to be focused on this talk. This was 2023. I also want to cover what we are going to do in 2024 because it is already here. We are in February. We want to continue the project maintenance and the outreach because some of you I'm sure were not sure about where Servo project is going. I'm sure there are people outside this room who are still not sure. We want to make sure that everyone in communities aware what is going on. We want to continue shipping CSS support. Right now, while I'm standing, we still have a few PRs open that is related to tables. We are really trying to ship this table support in Servo. We want to continue working on embedding API and the initial Android support. We already have initial build that runs on CI already. We landed this PR, I think, two weeks ago or maybe one week ago. Let's not go with somewhere between two or one week ago. If you see the list is pretty much similar to what we did last year. This is the focus of the talk today. I'm talking about embedding because embedding has been asked from the community for very long. I was just looking on internet, read it, hack and use, and just twitter, github. What are people asking? I ended up collecting some things. If you see that screenshot, this Servo embedding part was asked 11 years ago. This is exactly what I'm trying to tell you. I'm saying that we can't just tell, hey, we are adding support to Servo or X or Y feature without showing you a POC how this feature is working or how you as a user can test it. Last year around the summer, when we did lots of maintenance and took it out of maintenance mode, we decided we want to work on embedding. Then we ended up building a mini browser. When we were talking about mini browser, any open source project for the first step is to open an issue. That's what I did. I opened the issue. We wanted to decide how we want to move. Which library we want to use. I opened the issue initially. We had some discussion. The whole idea behind opening the issue was to get comments from the community. If they have some suggestions on using libraries, we already had this window that we were using in the code base. We can have a quick POC. You don't want to spend years building something without knowing how people or how companies or users who are going to use your product are going to feel about it. We ended up building a mini browser. I want to show you actually. If I go and do a macro, I hope you can see the screen. This is the mini browser. Just keep a mental model of how this toolbar is looking. I'm going to show you some code in a bit. This is about how we can make your life easier. This is the demo website I was talking about. You can go and do stuff there. Check out how some performance is happening. We have some demo that talks about the technical test. You can test the WebGL support as well. Certain things. Play around, go back forward. Just to give you an idea of how this looks in the code. Depending on what kind of ideas you prefer when you are reading code, I prefer to go from top to bottom and then I take myself up from there. You want to go to Sarvoshel. Once you arrive in Sarvoshel, you want to look for mini browser. I can see some code here. This initialization part is okay. This is not the focus. If you see the if condition here that we say, what is happening here is we just don't provide an option for you to have the mini browser. We also provide an option for you to disable it. It is enabled by default. In case I have to show you how it looks, you just need to pause this. You can already see how it looks. You don't have a toolbar. This is the window we also used to test how our Web platform test is looking. While you are here, we also want to look for even loop. That is an important part. Not actually run forever. Yes, this part, even loop. As like names are just, it runs forever. In this particular case for mini browser, we want to see this part. We are using VINIT. We run the event loop of VINIT that really helps initialize the window. As I was saying, I want you to remember how the toolbar looked like. I want to go to mini browser and just show you something that we have going on in the update function. This particular code that I want to walk you through. This is something that we did. Like I was saying, we opened an initial issue where we wanted to decide what library we want to use. We were already using the VINIT to create the whole window. We ended up using Igui. They have really good support for Igui VINIT and Igui Glow. This was very helpful for input and output stuff we needed to do with the mini browser. As you can see, there are just two parts going on. From left to right, we had back and forward button. On the left and right side, you can imagine the toolbar as two parts. Left to right, there was back and forward button. Then the other part was right to left. That was on the right side, you had the go button. On the left side, you had the location field. All together, it was whole toolbar. One other thing that I want to show you is how we are initializing servo. This is like, I can go to the server new. This part is like, inside the even loop dot run forever, we are passing all the data to the new function that initializes the servo. A lot is going on here. I am not going to walk you through all the code because it will take forever. The next 10 minutes is not enough. Initializing servo man, creating a thread for WebGL. One of the most important things that is going on here is the configuration of constellation. If I have to show you here, it should be here already, actually. Here is the constellation. It creates the constellation here. If I have to go and see how to start constellation, this is the constellation part. This is the Gram-Cyndra of servo. I love this comment. Someone left this comment like 11 years ago and this is still valid. From here, you can really get lost in the code and not lost in a bad way because this is the place where everything is connected. Here, you go to the pipeline, navigation, layout thread, script thread, and you can really go to layout and then the script. From here, you can go everywhere. Then I was just showing you this code right here. We are on lib.rs. This is our engine. We call it lib servo and this is the whole engine we are talking about. I want to keep it short here now because I want you to see something else as well. This was about the mini browser that we built. Next was around the same time when we were done with mini browser. We were talking with Tauri about how we can collaborate to integrate servo in the RIBE project they have. Thanks to the funding we got from NLNet and the collaboration from Tauri team, we did a collaborative work where we embedded servo in RIBE. RIBE is a library that aims to provide a fully open-source web view to users. This is the screen shot or the demo that RIBE team built that is like a hello world from servo and RIBE. If you have questions about Tauri and RIBE, Daniel is sitting here. You should catch up with him. He has lots of answers for you on that side. Thanks to my colleague, D-Lyn, for putting it all together so that I can show you today. Earlier when I started this talk, I asked how many of you are coming from front-end or back-end world. I have spent quite a lot of time in my career doing front-end and back-end work. When this embedding stuff happened, when we started this embedding, we worked with Tauri team to create this whole task, what is not needed. We shipped off-screen rendering, we shipped pre-compiled MozAngle, we still got to figure out how we are going to do the package and distribution of the shared object we have been creating for two biggest dependencies of servo, that is MoJS and MozAngle. We shipped the MoJS shared object already, but yet to figure out how we are going to do the distribution. MoJS is still working on progress. We have some work going on on the static-lib side as well, so we are going to do that as well. Before I started this talk, for this one, I wanted to see how this, as a user, how I am going to use Rai, how it is going to impact me. I started this, this tells me that I am close to the finish. This is the demo I built on top of Rai. Behind the scenes, servo is running, rendering things for you and Rai integration. This is the result of the integration work we did. Just to show you quickly what I had to do in order to run this project, I just had to write this HTML, CSS and JavaScript code. That is all. As a user, I don't need to care what is happening on the server side or what is happening behind the scene in Rai itself. As a user, I just need to write HTML, CSS and JavaScript. Things are ready. Maybe you can go ahead and try to write an input and browse the UI and maybe you will have something like that. It was pretty cool to see that as well. I was personally very happy. One of the reasons why I showed you this issue, and even we have a meta issue for mini-browser, there are some unchecked boxes in case you want to contribute. We will be very happy to help you review PR or help you getting started. This was about integrating servo rendering engine itself. We have another story with Dioxys that is doing pretty unique work. By just taking one crate, you might know about this. It is stylo-crate for CSS styles and selector matching. This is something unique because we have been talking about integrating the whole servo rendering engine into a project. This opens another opportunity where you maybe just want to use the script crate or stylo-crate in your project. You can simply do that. It is possible and Dioxys is proving it. After this whole talk, one question that some of you might have, how you can do it in your project. In short, you are literally one step away from doing it. You just need to reach out to us on Zulip Chat. If you have time, you can check out how mini-browser is working or how the integration with write took place. You can try it out with your applications, with your projects. If it works, great. If it is not working and you figured that you need us, so our team, to implement a particular feature, you can reach out to us on Zulip or you can open a discussion on servo. We will be really happy to help you get started and answer questions. We really have lots of people coming in and asking questions like, I want to integrate, we have some talks about Velo, we have some talks about certain other things going on. You can also follow up there and lots of things happening in servo. In short, that is what you are just one step away from reaching out to us. Thanks for listening to me. You can scan this QR code to get access to this slide. Thank you. Unfortunately, there is no time for questions here. I am here. Please catch up with me. Yes, I am happy to answer your questions. Thank you.
Thunderbird: How to Exchange Rot For Rust
So, if I could have your attention. When we got this talk, I didn't know Rust and Thunderbird had a connection, so this is pretty exciting and pretty cool. So we have Sean and Brendan are going to talk about how to exchange ROT for Rust. Thank you very much. Hi. I'm Sean Burke. I am a senior software engineer at MZLA, which is the company that maintains Thunderbird. And this is my colleague, Brendan Abolivier, who is a software engineer at MZLA as well. So we're here to talk about how to exchange ROT for Rust. So our colleague, Ike Dordi, couldn't join us. But I feel I need to shout him out because we would not be giving this presentation without him. And I also have to applaud his pun in the title because the project that forms the basis for this talk is Microsoft Exchange Support in Thunderbird. So we're working on adding support for the Exchange Web Services Protocol. This is the first Rust component written specifically for Thunderbird. We, our code is based on Firefox and so there's Rust there. But nothing specific for Thunderbird. And it's also the first mail protocol to be added to Thunderbird in Thunderbird's lifetime, which is a slightly strange statement. But I will explain that a little bit here. When we started this project, nobody actually knew how to add a new protocol to Thunderbird. And that gets into the ROT part of the title a little bit. So first off, a little bit of history of Thunderbird. Thunderbird grew out of Netscape Communicator originally, as did Firefox. So a lot of the code in Thunderbird predates Thunderbird itself. And the 0.1 release was July 2003. So this is a fairly old code base already. In starting around 2012, Mozilla started to hand over Thunderbird to the community because it felt that Thunderbird wasn't self-sustaining under the Mozilla umbrella. That situation persisted until around 2017 when Thunderbird rejoined the Mozilla Foundation. And so what does that actually mean for Thunderbird? We had a pretty big gap in paid maintainership, which results in, you know, a community can only do so much. Thunderbird is a very large project. There's a lot of work to do. Just keeping up with building, making sure that it's following Firefox's changes since we're based on Firefox. And that gap meant there was a lot of time where you can't expect a community to have a holistic view of the architecture of a huge project like Thunderbird. You can only ask so much time from them. And so changes were made without a view to how this would affect the architecture, how the architecture played into things. There was also a loss of institutional knowledge because the people who'd been employed to work on Thunderbird were no longer, and there was nobody there to take over for them. In a lot of places in Thunderbird, there hasn't really been any kind of architectural maintenance in over 20 years. And that also means that, you know, large portions of the code base are written in C++. C++ has changed quite a bit over the years, and Thunderbird has not kept up. So this is a pretty significant challenge, but also presents us with a pretty significant opportunity. That opportunity is Rust. So we'll talk a little bit about why we decided to use Rust. This is a room full of people interested in Rust. I'm sure most of you are pretty aware of the major benefits. We're a large application maintained by a small team, and we take input from anybody who sends somebody an email, and so memory safety is pretty critical. We do not want security bugs letting anybody have access to somebody's computer. Performance is also pretty big. We use a lot of JavaScript in our code, but for low-level stuff, JavaScript is going to have some performance issues. And then, you know, the modularity of Rust, having that built-in gives us access to a pretty large ecosystem. There are a lot of people doing mail-related stuff in Rust, and we can benefit from that. The other, the next reason is that, I mean, we are based on Firefox code, and Firefox already has Rust in it. So the build system is set up to integrate with cargo. We share CI infrastructure, and so that already has provision for Rust. And then, also, Firefox has something called XPcom, which is kind of a framework for communicating between the different languages that Firefox uses, and there's Rust support in that already. And then, Rust also kind of introducing a new language gives us permission to rethink some of the aging ideas in Thunderbird. It allows us to kind of ignore some of the more delicate code paths that have been around and changed ad hoc special case throughout the code where changing things is a little bit scary. You don't know what you're going to break. And also, I mentioned the loss of institutional knowledge. We need to rebuild that, which means a lot of documentation, and personally, I love the documentation tooling that Rust provides us. And I think that helps a lot in moving forward. But as with any project like this, it's not just, okay, we're going to use Rust. Cool, we're done. We're good to go. We had some problems getting started. Part of that is just we have a large existing code base, which means we have existing patterns. A lot of idiosyncratic async stuff going on that doesn't integrate nicely with idiomatic Rust. Lots of features and capabilities already in the Firefox and Thunderbird code base, which don't have any sort of Rust binding, or sometimes some kind of painful Rust bindings. I mentioned XP-COM as a benefit, but it also became a little bit of a drawback, particularly in terms of developer experience. Over the years, Firefox has excised a lot of the XP-COM interfaces just because it can be a little bulky, a little bit painful to use them sometimes, even in C++ and JavaScript. That work never happened in Thunderbird. We have a lot more uses and huge uses of XP-COM than Firefox. And so what works well for them in terms of developer experience doesn't work for us. It's really painful for us to use XP-COM at this point. I also mentioned the build system as a positive, but in a big way that became a drawback for us because in order to deal with the fact that Firefox has a C++ entry point, no single point of entry for Rust, there's a hack put in place to build a single workspace and shove that into the Firefox code. That hack, we're built as a subtree of Firefox rather than having Firefox as a subtree of our code, which is a little bit unusual. Cargo doesn't like it when you try to have a workspace inside of a workspace. We're not in the same repository as Firefox, and so we can't change their cargo lock, we can't change their dependencies. We kind of solved this by basically stealing all of their dependency tree and merging it with our own, building from within our code and using a script to keep that up to date and hope things don't break so far, so good. With that, I'm going to pass it off to Brendan because... Now we can use Rust in Thunderbird, we can build Rust in Thunderbird, we can run some Rust code in Thunderbird thanks to that work to integrate it into the build system. What do we do with it now? It is good to answer that question, it's good to think back to where we're coming from, what we're trying to achieve with that, and our end goal with this work is to be able to support Microsoft Exchange in Thunderbirds. We want to support more specifically something called EWS, which stands for Exchange Web Services, that's Microsoft's proprietary protocol for interacting with Exchange. That protocol is based on XML or HTTP, so it's up to be more precise. That means that we're missing a few key code infrastructure in order to make this a possibility. First, we want to be able to send HTTP traffic and preferably we want to send it through something called NECO, NECO is the networking component of Thunderbirds and we already have a well-functioning networking stack, it would be a bit sad to completely bypass it. We want to be able to communicate, to interact with NECO and to do it in a way that is familiar and easy to use for Rust developers. Once we have the capability to send those requests, we also want to be able to fill them with the contents that we need in this case XML. We need to figure out how to serialize and dis-realize XML in a way that scales to a lot of data structures to give an example of scale. EWS specifies about 100 different operations and about 1700 different data structures. We're not at the bottom of the stack which is sending HTTP requests. Because we want to interact with a specific component within Thunderbirds, we want to use XP-com, which I mentioned, the acronym stands for the cross-platform component object and its job is basically to allow inter-component interaction by defining some platform neutral interfaces and that way we can cross the language boundary, which is good for us because we want to write Rust code to interact with NECO, which is in C++. Let's use that except sending, except using XP-com with Rust directly doesn't look very Rust-like. It's mostly designed around C++ APIs and so it doesn't have a lot of the features that we can find in Rust and it means that there's a lot of boilerplates. This is the code to just send a single GET request and print the results in the standard outputs. We need to define a bunch of callbacks, we need to define a bunch of different objects and because we're crossing a language boundary, we at the very bottom, we need to wrap all that into the actual call into an unsafe block. None of that is very ideal and we obviously don't want anyone who wants to use NECO in Rust to have to do that every single time they want to interact with the network. Let's split this issue into two sub-issues that we're going to solve. The first one is we want to do native, to support native async await, Rust async await syntax. The way we do this is we added a new interlcrate to Thunderbird, which is actually the first Rust code to be added to the Thunderbird code base. The role of that create is to translate asynchronous operations in xp.com into Rust's native async. The way it does that is it defines custom stream listeners, which is that big struct that we saw earlier with a bunch of callbacks. What that stream listener is going to do is it's going to buffer any incoming data, call wake on a wakeer when the request finishes and then we can wrap that around another struct which is in charge of triggering the asynchronous operation in xp.com. Then it implements the future traits to be able to query the state of the buffer every once in a while and to return the result when it finishes. In the future, we're probably going to also implement the stream future traits in order to be able to interactively process incoming data. We don't need it immediately, so we just went with future for now. Now that we have this native async await support, we want to build on top of that to have some way to write some idiomatic Rust code to send HTTP traffic. We do that with yet another internal crate which provides more idiomatic and requests like HTTP clients. It's not a one-on-one, one-to-one replicate of requests, but it request-wise uses the main inspiration for this work. Under the hood, that crate is in charge of creating and configuring all the necessary xp.com objects, wrap that into our future and also provides more rust idiomatic error handling as well because standard xp.com does its error handling with just error status codes which isn't the best we can do with Rust basically. So that's all nice. What does it look like? So let's do a demo. We're going to do a live demo because we don't like to leave safely. So here is some code that lives in on my local checkout of Thunderbirds. It's got a bunch of code infrastructure to plug it into xp.com for the next step of the demo, but the important bit is what we can see here which is that with all clients that are from my HTTP here, we can create, we can actually create a PUS request, set it a custom header, set it some custom body, send it and natively await on it, and then we can process the response or the error depending. We're going to run this code into a local Thunderbird which apparently crashed while I was preparing the demo. Let me just do... So this is the Thunderbird DevTools. You might already look familiar because it's also the same DevTools that Firefox uses. We use it to work on the front end of Thunderbirds and access some internals of Thunderbird when we need to. So we're going to instantiate that xp.com plumbing that I was mentioning. It's basically just a dummy interface that just has a thing to do the thing which in our case is sending an HTTP request. We can see that we successfully sent a request through NECO. It's not because it appeared in the network tab which means that it went through the Thunderbird networking stack. If we inspect the request, we can see that it did include our custom header, it did correctly attach to the right content type, and it also correctly sets the right body to the request. And to confirm that once more, the server... That's just a simple stupid server that I quickly wrote in Python that... Sorry for using Python. Which just takes that custom body and that custom header and just prints something. Right. So that works. Now what do we want to do from here? We have requests that we can send and we can process the response to that request. But what do we actually put in that request? As I mentioned, we want to put some XML into that to be able to communicate with exchange servers. So we started with a kind of exploration, kind of a lay of the land of what the status is with regards to desilizing and serializing XML in Rust. And we quickly identified that most crates that we could find had some existing issues. Either they don't provide a good way for handling attributes and namespaces in XML or N slash all, they're very boilerplatey. It's fine for desilization because we don't necessarily need to process every single attribute from the response we... Or N slash all namespace, namesoces. For serialization, it's not really something we can do because obviously if you omit a required attribute or something like that, the XML server is not going to be able to understand the request. And also we not only want but need to have a low amount of boilerplate in all code because N EWS defines a lot of data structures, a lot of operations. So, yeah, dozens of operations, more than 1,000 data structures. So we don't have any small amount of boilerplate that we have. It's just going to make the codes 10 times more difficult to maintain. So we decided to create a new crate. This time it's not tied to any Thunderbird internal, so it just lives on GitHub. And so we use this... And so in this crate, we basically leverage the procedural macros in Rust to dynamically generate implementations for a trade that we also define at compile time. Almost everyone in this room will just be like, yeah, this is just a derived macro. I'm fairly new to Rust, and so when I saw that, I thought this is pretty cool, so I want to mention it. We don't want to reinvent the wheel. So we built it on top of QuickXML, which provides some pretty nice tools for writing and formatting XML. And we try to design it with a fairly low boilerplate... That low boilerplate approach that we need. So what does this one look like? This is a kind of dummy data structure that I defined, and as you can see, I was thoroughly uninspired for the naming. But this showcases some of the features that we can use in this crate. So we can set namespaces, either default or custom ones. We can set namesets prefixes. We can instruct a field to be an attribute. We can flatten some structures, and then all we need to do is actually populate our data structure, serialize it, and in our case, we just want to print it to see what it looks like. And if I run this, it generates valid XML that matches the data structure we defined here. So that's a lot of useful code infrastructure that we have now for our Microsoft extension implementation. Where do we go from there? Obviously, the next step is we want to implement the damn thing. So implement protocol support for AWS in Rust, and hook that into the Thunderbird UI to expose it to our users. We also want, if there's enough interest, to generalize the XML struct crate, the one in this slide, because at the moment, it's fairly designed around the use case of EWS in terms of configuration and defaults and things like that. So it might be something that, if there's enough interest, it might be something that we will look into in the future. And another point, another point that's another next step is we might also start working with people from the XPCOM team in the Firefox developers to try to improve the situation around bindings for XPCOM in Rust and make them just more, well, nicer to use for Rust developers. So that's where we are. That's where we're going. And thank you for listening. Thank you. So we, I think we have quite some time for questions if you have them. Yeah. Well, as I make my way over there, one question I had. If the protocol support is in Rust, do you think it's possible that it could be more shareable with other email clients? Yeah, this is one of the things that we're trying to keep in mind. One good example is we're currently, you might have heard that a few years ago, we welcomed the K9 email clients on Android into the Thunderbird family. And if we're building a new protocol support for the desktop application, we would like in the future to potentially include that support into K9 slash Thunderbird for Android. So this is definitely something that's one of the, one kind of extra reason that we decided to go with Rust is because of the ease of implementing, of reusing Rust codes across multiple platforms. And yeah. And we are going to make the EWS create public as well. Yeah, I'm going to repeat because I have a mic. We're going to make the EWS create public. And yeah. And also, you see to build it in a way that is fairly agnostic to the actual desktop application.
Fighting cancer with Rust
you you you you you different search parameters like a bit of genomic data and this is our architecture so I already mentioned lens this is what the researcher sees in their browser this is the front end and then it has its own back end which we actually call spot and in some projects the old spot which is made in Java is still running but we are about to replace it with a new spot which is made in Rust then there's these are beam proxies they're also made with Rust focus is made with Rust Blaze is a store and it is made with closure and then we have those operations which are shell scripts mostly so what is happening here a researcher says I need to find samples with type plasma that come from donors with diagnosis C61 for example and where the age at diagnosis is between 1450 for example and then that request goes to spot where it's packed into a certain beam task beam is a task broker which actually solves problems of strict network environments we face in hospitals in Germany because of the data protection concepts so on the sites which are hospitals by banks there are beam proxies which ask beam do you have a task for me and when they do beam sends the task and focus this component here gets the task then focus unpacks the task decides for which endpoint it is a task blaze is only one of the possible stores and also we can query other applications so it's not only for sorry for database types we also have another application which is exporter called exporter and one more which is called reporter so those can also query blaze in their own ways blaze is actually a fire server fire is a standard of exchange of information in e-health and healthcare in general and medicine and focus then runs the query against blaze or against some other store it gets sorry I keep clicking it gets the results return results to a beam proxy which returns it to beam which returns it to lens backing which is spot and in the end the browser gets the result and this component here Laplace this is used for obfuscation obfuscation of data is done on sites so unobfuscated data never leaves the sites we decided it was the best to put it there and we have multiple projects that actually run our bridge heads these set of applications on sites we call them bridge heads you can look later in our bridge head repository which installs all those components so we have a lot of projects those are some of the projects that actually run bridge heads this is map of Germany which with bridge heads in Germany but besides German Biobank node we also have the European version of it which has biobanks in other European countries that's bbmri eric then German cancer consortium I already mentioned and cancer core Europe which intends to facilitate a translation of clinical research into new drugs and then because children usually have different types of cancers and cancers differently affect children we have a separate project which is intended to facilitate the invention of drugs for pediatric cancers and also applying applying existing drugs which are for adults but also for those genetic markers for which no drugs exist it is intended to facilitate personalized medicine this is another project we have this is for cancer images so MRI CT pet cat it is intended to actually enable AI analysis of images and then I mentioned beam beam is a distributed task broker which enables communication with biobanks which are behind the proxies and have very exotic configurations it handles the encryption beam proxies on each side encrypt all the traffic and decrypted and it also handles certificates and it only allows outbound connections which means it is only possible that beam proxies connect to beam and then we have focus which is a query dispatcher in which the obfuscation happens so first I need to mention CQL that's what we use it is clinical quality language I know that there's another CQL which means something else so parts of CQL come from front end and currently we are working with certain query replacements to prevent CQL injections but soon we should have a translation of abstract syntax tree from lens from front end into CQL completely done in focus I'm working on it and also abstract syntax tree gets translated or rather simplified for you came the project for a medical imaging I mentioned before as I said it uses the sampler Laplace library yeah these QR codes you can scan them and you can get to the GitHub repository I hope it is large enough and also if you want if you want to get to the beam repository this is the QR code and the problem with aggregated data is still that with a search narrow enough it could be deduced in which store in which database or in which Biobank samples or data about a certain patient are stored so we need to offer a similar level of privacy to the patients who are supposed to consent they are more likely to consent to having their samples and their data available if they know that their level of privacy is the same if they are in a Biobank and if they are not in a Biobank because we obfuscate the data enough we add a small number and we round it up I'm gonna mention why K anonymity means that for each set of parameters there would be at least K patients for whom they the search would return results but that's still not enough because we can we have some rare diagnosis we can narrow the age range enough so that we could have searches return only one patient and that's why we had to do this we use a Laplace distribution with certain parameters we take a random value from the distribution we add it to every count in all those counts in all those stratifiers we get for example for each diagnosis for each sample type and this shows how depending on the values we can lower the privacy but we can make the data more usable so here we would get more higher values with B which is 0.1 and here we get more lower values but values that are closer to the true state of the database are actually more usable privacy budget is something that everybody has to decide for themselves but sensitivity depends on what is being obfuscated it is the number of those resources per patient so if it's diagnosis then it's the number of diagnosis per patient if it's a samples then it's the average number of samples per patient so we are working with 10 and 3 and 4 patients of course it is one patient per patient this is the library and it is a rust crate and we also made it Java library for our friends in Erlangen who use it in their Java projects it is highly configurable but I have included parameters that might be needed in medical informatics so of course epsilon and delta I mentioned before but also what to do with values under 10 we round them to 10 some might want to round them down to 0 or they can be obfuscated in the usual way also for zeros we have chosen not to obfuscate them that is because after the search there comes another process the researchers select the biobanks they want to negotiate with and then use the tool which is called negotiator which was made by our friends in Czechia and in the negotiator they describe the research they intend to use and in the biobank the head of the biobank or whoever is tasked with it but in any case real humans decide who is going to get those samples because samples are very valuable and once they used up you don't have them anymore and it could be last sample for a combination of diagnosis and certain sample type certain genetic markers especially so we didn't want for those biobanks that really have zero values to okay that's it we didn't want them to border people in biobanks so all our code is open source you can scan this and you are going to then get to our organization on github you can look at our other also software and if you want to join us live in beautiful Heidelberg help cancer research then scan this this is a job posting just please don't be don't the fact that German languages mention prevent you from applying because my German is still not good enough and it is not a requirement really you will be asked to learn German but the company pays for it so thank you you
The journey of hacking in a new serde dataformat
Now we have the last talk of the day with Paul. He's going to teach us how to hack a new sturdy data format. And I'm really looking forward to it. Thank you. Thank you. Yeah. Hello, everyone. Hello, everyone, for the last talk of this day and of the Rustaf room, the journey of hacking in a new sturdy data format. If you read through the abstract, there are a lot of more things in there, but the main talking points are near for us. You're 30 and I want to present the journey of building that. And before we go into, I want to also emphasize what this talk is not what I'm not going to talk about. This is not going to be an introduction to 30. It's also not going to be an introduction to Neo4S. You don't really need to know what these are to understand what I'm talking about, hopefully, but you're also not going to know what these are after the fact. And similar to the talk about tower, I'm also not going to do an actual deep dive, also a shallow deep dive. And I'm also not going to do a discussion about how to pronounce 30. So with that about me, hi, I'm Paul. You can find me on GitHub with Knut Waco or on Hackyderm. And I work for a company called Neo4J, which is a graph database written in not Rust. We do have a Rust driver though, which is Neo4RS, hence the name. That one is written in poor Rust. We're not wrapping any existing C, ABI or API from some other driver, but it's developed under this thing called Neo4J Labs, which is an incubation community focused process. So there's not really that much product engineering behind it as mostly me doing this in my 20% time with contributions from occasional other people from the company or from the community. And a Neo4J driver for a database, we communicate via Bolt. Bolt is a binary protocol that is built on top of a pack stream, which is something that we use for general data types like strings, ints, lists and so on. This basically a binary JSON-ish thing, I think is an extension of a message pack. And you can have domain specific structs defined in there. And Bolt has some 15 structs on top of that and then another 15 for communication purposes, but we don't really need to talk about those. All of those various data types that the Bolt protocol can have as natively represents are in this big enum. Like you have a null type, you have an integer, you have a float, a string list and a node for graph database and node is an important thing. So it's also on there and there's a bunch of other stuff. Now in Neo4S in the 0.6 release, what is now the previous version, if you wanted to get some data from a node you would write something like this and in 0.7 you would write something like this. And if you think that looks exactly the same, that is on purpose. It's not a fact of low oxygen in the room. But you can also do something like this where you have your struct, you put a 30D0-size thing on it and then you can convert that into your node into that struct. And so I want to talk about how to, or some things that we did to implement this, to provide this functionality. 30, if you're absolutely not familiar with it, I mean it's been mentioned a couple of times today already, a framework for serializing and deserializing data. If you want to get into what it is, you will not find it here but you can go to 30.rs and there's a video linked from John Janseth who goes into great detail about one side of 30 that I'm not going to talk about. And in particular there are these kind of three big concepts there. There's a data type which is your structs where you put your derived serialize and deserialize on it. And then there's the data format which is the other side which has the trade serializer and deserializer. Notice the R at the end of the names. That's the main difference. And they communicate via this 30Data model but not really wire but this data model is mostly represented in the API only. This is one of the examples of something where I would have to put a big asterisk on it and then talk five minutes about why it's not actually like this but shallow dive. So for example JSON is a 30Data format. It's like how your data is represented, formatted in your bytes, in your string, on your desk, wherever it is. And 30Data is the trade that implements this data format, implements the serializer and deserializer trades. Now we want to bring them together. We already have this bold type enum and so we want to implement the data type for this particular thing. And we're also going to focus on the deserializer side only. Doing this while parsing the data into bold type and serializer implementations are on the roadmap. Let's say down there. And we also want to maintain the API compatibility. So as you saw before, the API should look the same. It's not actually the same but it's still going to be breaking but we don't want to have to introduce a lot of things we need to change. All right. So let's talk about this node, this thing that graph databases use. This is the definition from the bold documentation. We can put this in Rust. Very much looks the same. We have ID, labels and properties. And just to show what those actually are. This is Cypher, the query language for users. This is going to be the last slide on Cypher that you see in this talk. I'm just breaking this down so I can show you this particular thing. It's the label of the node. These JSON-ish looking thing are the properties of the node and we have in our end, in our return column, we have the actual node as this node struct. And we have our session thing with our deserialize on and we want to do something like this. I would say give me the value, it's called n as a session, which we could also write like get me a node and then convert that node into a session. So let's try to do that. First attempt that we could try is make our lives easy. Make our bold node, make that deserialize and then use some other data format to do the job for us. So we have here this tool function that we had earlier and we have a t bound that needs to implement deserialize and then we just say, okay, let's lose JSON and then convert our value into JSON and then convert from JSON to the actual user type. And say, are we done? If this was the solution, then this would be a 20 minute talk, not a 40 minute talk. So no. You get a bunch of these error messages that like field event in the year on there and that is because this, what we really want to read from node are the properties, mainly the properties, but we are deserializing the internal structure of a node like the fields, IDs, labels and properties. So user would have to write something like this where they wrap the actual thing in something like a node thing with the properties. We don't want them, but maybe we can fix it easily by just using the properties to pass them into JSON. And that kind of works in the sense that this example could compile and could run, but there's a one there's a fact that we're using JSON, which does not have the same representability of things that Bolt does, like it doesn't know about all those special data structures. And there's also no way to get to the IDs and labels and sometimes we really do want to use them and not just use the properties. So let's try again. We're not going to do this with JSON anymore. We're going to now start writing our own deserializer finally. And we want to, I think it might be able to look like this where we have our session struct with the two fields and then we have some other fields in there which are IDs and labels. So before we can talk about what the deserializer would look like, we need to understand how SIRD brings those as a data format side and the data type side, how it brings those together. If you have this thing where you have this derive deserialize attribute, you can use something like cargo expand to have a look what the result of that macro expansion looks like. And I'm going to show you a very simplified version of that which is more similar to what you would probably write if you would implement this on your own, if not be a macro that would implement this. So you start by implementing the deserialized trade for session. There's only one method that you need to implement called deserialize which gets a deserializer and it returns itself or an error where the error is defined by the deserializer. Here's a deserializer bound and for this particular session struct we call the deserialized struct method. So the deserializer has a bunch of methods and the idea is that we call a method that describes in the most precise way what we actually want to have. That's as we want to have a struct. It's called session. It got those four fields. There's something else and we hope that the deserializer can provide us the data in order to build this thing. And that's something else is a so-called visitor. Whereas a visitor is also a trade from Suri and we implement this as well. We don't need anything on the visitor except like some struct to implement it. So we have our struct, we implement visitor. This one actually defines what the value is that we return. This is the session struct and there's only one method, one function we actually need to implement which is this expecting thing that helps with reporting errors. If you only have a visitor like that it's not useful because the only thing it can do is report an error. So you also want to implement one of the other methods that you get. And you know like Rust Analyzer or so. Or your IDE can help you in figuring out what the methods are or documentation. And we expect when we say to the deserializer, hey, I want to have a struct, we expect that we get a map in return. Structs are like basically like named maps if you will. The keys are your field names, the values are the actual fields. And so we implement the visit map method we get, we also say we return our own value, we return the error from whatever the deserializer gives us. And it gives us this map access thing which is fancy iterator over key value pairs. And if we actually implement this, this looks very mechanical. We have our fields, we don't have any values for them. So they are all options of the actual field types with none of these defaults. We use the map access to say what is the next key in the map. We would match over that key if it's an event, we'll take the next value and put it into the event value. And here we are saying in the TurboFish there we want to have a string. And this call looks quite similar to the node.to call. The bound for string is deserialize and the map access implementation that has the actual value will know what the deserializer is and then you've got another deserializer and deserializer that come together and they do the same thing over again. So it's all like this kind of back and forth from top down. We do that with the rest of the fields and then we can build the value at the end and then we can throw an error for any fields that are missing. Then we plot that in and that is the deserialize site. What is generating and what we need to provide. So let's start by adding a struct for our bold node and we can also implement the into deserializer trait which we can use to tell us what kind of deserializer we want to deserialize a bold node. That is like a very fancy into this, nothing really special to it. So it's up to implementing deserializer for that bold node deserializer. We define an error. The error needs to implement a certain trait. I'm not going to talk more about this. It's like an enum of some typical error cases. And then there's this fancy thing. So if you implement deserializer you have a lot, like a lot of methods that you need to implement. Usually you don't really want to do that. And then there's this concept of self-describing data formats which is that within your data format you know what the next type is, what the next value is, what the next key is. You don't need any kind of information from the outside. And so in JSON, like if you're parsing a key you know okay that's the key, it's called that and then you have a value. So you know at every time where you are and you don't need the information that the next thing that comes is a string or an integer or something. You figure it out by the, just by looking at the data. Bold is such a self-describing data format and for those SIRTY recommends to just implement deserialize any and then use this macro to say every other thing just goes to any and we're all doing that. So we only need to look at ourselves and we don't need anything else. So let's implement this deserialize any here where we say we want to call this visit map method because that's what expected. And there's a map deserializer which is also from SIRTY where you can give an iterator over key value pairs and that will do the correct thing. So we don't actually need to implement anything fancy here. And we can bring those together in this two methods by calling deserialize and using the into deserializer trade to bring those together. So now after we did all of that, we are basically at where we started. We can deserialize our properties or deserialize our properties. But we also want to have the IDs and labels so let's have those before and instead of just using our properties, we are having, we're training this with getting the ID and getting the labels and then we're just passing that on. And that kind of works for this particular example only. We do get our IDs and labels but we only get them if we call the field ID and label string. It's harder than here as this ID string and label string. So you could never call him something else and we could use like special fields like underscore underscore ID or something. But if you want to have an ID, you would have to use one of those field names and you could never use this name in your actual properties because then you would have multiple entries in this map thing and SIRTY will say no, there's multiple values for the key ID and that is an error. And using like underscore like magic field names or something like that would maybe be possible but it feels not really that great to me. So let's try something else and we want to do something like this where we have, we call those extractors internally but these are new type structs. You have an ID struct and the public field that is the ID type and if a label struct with the public field is the labels type and instead of using the field name to say these are IDs or these are labels we're using the field type to say for whatever field name you have we want to, this one gets the ID from the node and everything else in the struct is still being deserialized from the properties. So in order to do that we can no longer use deserializeAny but we already know that we're not actually calling deserializeAny, we're calling deserializeStruct which has been using the forwarder from struct to any but if we also, if we remove struct from this big macro with the forwarding and then implement these there struct on our own we can basically say this is a special version if we know that you want to have a struct. Once we do that because we get these two additional information that deserializeAny doesn't get which is the name of the struct which we actually don't care about but we also get all the fields of the struct. And what we're doing here is we take our properties, we have another enum type in there if I want to show you all the code that is related to the enum there's a lot of it but it's very mechanical code. It's struct data has an enum of two cases property or node and there's a deserializer that's also an enum of two cases and each of those cases have their own deserializer implementation. And so we have our property fields and then we also iterate through all the fields that we get from this struct and we check if any of them are not in our node properties and then we say okay those we are going to deserialize by giving you some additional data from the node and not from the property. And then through this like big enum chain eventually we have okay here we chain them together and through this big enum chain we actually get to to in so-called to another internal deserializer and the new type fields ID and labels they when they get deserialized they don't call deserialized struct they call deserialized new type struct because they're a new type thing and here you get also the name of this struct you don't get any fields and new types there aren't any fields or like there is one field that's just called zero technique. And so we have this additional deserializer where we can implement this one and then here we can match on the name of the struct and if it's called ID we can say oh yeah okay I get the ID from my node and if it's labels we can deserialize labels of the node. And I see deserializer here similar to the map deserializer where we can use 30 to say here a bunch of values and put them in like a list at the end. These are things that SOTI provides in order to avoid like having to allocate a VAC or a map a hash map and then using that one so you can use this you know with like less overhead without allocations and potentially maybe also use it in in in no std environments. Right so if we're doing that and put those together then that works and the example will give you the data. There's still some downsides and I'm not gonna we're not gonna go into a fifth attempt now because those downsides are still not fixed and we're figuring out how to work around them and the biggest downside is that the deserial default attribute doesn't work. So if you have a struct and you annotate one of the fields with this SOTI default what SOTI will do when it generates this deserialize code instead of saying at the end hey I have this value and I do unwrap or else an error with the missing field it's gonna call the default method and then get the value from there and that's the only thing that changes. Like there's nothing else in this whole method call where we get the information that the user actually has this annotated with this default attribute. That means if you go through this next key next value train and we say here is a value and Zuri expects that we give it an actual value like if we say we have something for that value for that key we should provide something. And on the other side what we are doing is saying all the fields in the struct we put them in this iterator where we are saying we have something for the struct. So we eventually see the year field and we think well it must be one of those additional types because the year is not in the properties because it's missing in our node but then we are saying we are not getting one of those new types but we get these U64 and then we don't know what to do about it. Because it's not in the properties we have to do something we cannot say this type isn't available it's too late for that at that point in time and so we have to fail. So using the default attribute doesn't work but there is a workaround that you can use the option wrapper around it so you can manually choose and report default at the end and that works so we have a workaround and so we can figure out how to solve this eventually. So now that we are done with the nodes we can do the same thing for all the other 20 variants. We are not going to do that here of course but you can imagine. But then we are still not done at the end there are still some smaller things that I wanted to mention that are just like things that we ran into and that gave us some learning experience. In particular Bolt has a bytes type so there is a native thing for here is a vec of U8 and this is like some glob of data that the user has defined. Not many data types and many data formats have a bytes type so 30 while the data mod model has something for bytes. We can start by doing this like we get for example a string and we get a bytes and there is a visit string and visit bytes so hey cool we can just pass on our bytes array. But like I said many data formats have them so 30 doesn't actually generate visitors that expect that you call this visit bytes method because then you could never use it with for example JSON. What's really sad is if you have a vec of U8 in your data type you should call this as a sequence of individual bytes instead of like one blob of bytes. So you would have this visit seek and then there is a seek user and that doesn't really matter what is at the end there but we need to call this. But then we can use the same thing where there is a deserialized bytes and there are some third party crates there is 30 bytes there is 30 width that can be used to say to tell 30 hey this is a vec of U8 but actually I can provide bytes for that and please expect call to visit bytes and on the other side like I said earlier deserializer should call the most special like the method with the most information and so it should call not deserialize any at the end but deserialize bytes because it's saying well I actually can accept a call to visit bytes and so in here we can check if we are actually if we have bytes and then we can just pass them on and then we're done. So not really but we're running out of time here so I want to do a quick run through a couple of other things and if you ever ran into one of those issues I don't know you can find me afterwards and we can talk about this. So one thing is I said earlier we want to maintain the existing API and the existing API was based on conversions via the try from and from traits. So there was a bunch of traits implemented for try from bold type for a bunch of other types so we also needed to make sure that if you're using these bunch of other types at the other end our deserializer will do the right thing and part of those bunch of other types are various state related things from the chrono and time crates because bold has native structs for dates and times and date times and without time zones and all the fun with dealing with those. So this taught us about there's a thing called a human readable where you can say whether or not you're a deserializer or your serializer works with a human readable data format and I think the time crates, the time crates uses that in order to say if it's a human readable one it's being serialized and deserialized from string and it parsed from the RFC something 3.9 format and the non-human readable format will pass in a bunch of integers and so we have to set that flag in order to provide the correct data and you can use annotations for one of those where you can say this is a time stamp but the resolution is in milliseconds or in seconds or in microseconds and then we have to figure out okay what is the actual value that we have to give you so that the calculation at the end turns out to be correct. So that was fun. The other thing is if you have conversion via from and try from trades because of the planket implementation you could also convert the bold type into a bold type and so we also need to be able to say we can visualize into from a bold type into another bold type and so we have a custom deserialized implementation for bold type that calls the appropriate method in a way that we know that we get the correct data which is not really that straightforward at the end result is a deserializer that isn't really usable for any other data format and so if you use that one and then deserialize a bold type into a trace on you get some trace on but that doesn't really make sense which is unfortunate and then there's there are things like what if you have more data in your node properties than you actually have in your struct and then usually a data format is expected to be implemented or to be doing the deserialization while parsing and if you're a imagine you're parsing through some JSON and you have this key and you give it to the visitor and the visitor says yeah I don't I don't I don't need this key it can't just continue and say give me the next key you have to go back and say well I don't care about this key but you need to still give me a value so that your parsing state is gonna be correct so that when I call you next you're gonna give me the next key and this is done by this ignored anything where so you will say give me a next value but I'm gonna ignore it so you can give me whatever you want just do the right thing on your end so that your internal state is correct for us we didn't really need to do anything because we already had everything fast and it's enum but since we're moving to doing this while parsing as well we need to take care of that properly then there's the zero copy deserialization we had earlier like oh I want zero copies being a red flag so consider the red flag if you will and all this code that I showed you actually doesn't compile because every trade from 30 has a lifetime around typically called take DE for deserialize and also our deserializer for the boat node doesn't actually have doesn't own the boat node it has a reference to the boat node and we put lifetimes on everything and then implement a bunch of methods that say well if you can make the Rust compiler happy with all the lifetimes that you're using then we don't need to copy data and so you could do things like extracting the labels not as a bag of string but bag of stir slices where you would still get an allocation for the back but the strings would point into the actual data from the boat node thing and we do that for how long we can do that but it's it's it was too much to show it in all the slides and yeah with that I am done and I think we still have time for questions if there are any otherwise yeah thanks for listening questions questions no questions no questions I do have one maybe just a short one you were mentioning like deserializing the bytes before when you say bytes do you mean like a vector of u8 or like the bytes crates or both actually that primarily a back of u8 because that's what we have in a standard library but I think we also have a test that makes sure that it works with the bytes type from the bytes grid so I think you can use that one as well okay then I think that's it for the day thank you the speaker thank you to the whole audience for staying with us next year
SPDX 3.0 - a migration journey
Good morning everybody. So I've got too many slides to present, so I'm going to go kind of fast. I think all of you know what SPDX is. Does anybody not know what SPDX is? I'm going to skip up, Axe Troublemaker. Now I'm going to skip through kind of the what is SPDX slide and jump into what are we doing about 3.0. This is really a talk about my kind of more practical journey. I'm one of the maintainers of the tools and I recently went through a process to upgrade the tools for 3.0 and I thought if I shared my experience with you, those of you who are writing tools yourself might gain some of my experience from that and help you out with your tooling. So as far as the agenda, I thought I might start with a little bit of context. Why did we even do 3.0? Because as you'll see there are some breaking changes or some changes you'll have to adapt to in your code. So it's good to know why we're all doing this. I'm going to talk a little bit about the approach to creating the spec because that will give you context to some of the techniques I've used for upgrading the tools and an overview of the changes, the important part. Then I'll talk about my practical experience on the Java libraries itself. So why SPDX 3.0? We've gotten a lot of feedback from the community. By the way, you guys hear me okay? We have no idea. Yeah, we don't know if it's... There we go. Let's see if I can... It is there. I think that'll work. Alright, a little bit easier. So we've got a lot of feedback on the SPDX 2 spec that is just too complicated. There are too many pages in the spec. So we took some steps in 3.0 to simplify it. At the same time, we added a lot more use cases, which actually can make it a little bit more complicated. So we've taken a few approaches. I think the biggest one is the introduction of profiles in SPDX to allow you to focus in on the things that you care about on SPDX and that does create some changes into the specs and impacts the tooling. Another big change that impacts the tooling is we made a lot more flexible. We have some people using SPDX and extremely large deployments, very, very large S-bombs, and they want to be able to basically distribute this S-bomb across many different files, across the network. And so you'll see some structural changes that allow you to do that more easily and of course reliably and in a way that you can authenticate the relationships. There's a lot of interest in SPDX and non-licensing and non-security scenarios now. So product safety is coming up in 3.1 and some of that actually started to kind of come into 3.0 as well. But there is a lot of changes to support that as well. And a big change, there was actually a time when there was yet a third standard. I know many of you are frustrated with two standards. There was actually three for a while and we merged two of them together in SPDX so that also had somewhat of an impact. So I'm not going to go through all this, but just to point out, we've been around for a long time. So don't ask me why we created two standards. We started back in 2010 and we've gone through a lot of evolutions. Most of them adding use cases back in the early 2000s, 2010, 2013. We added security use cases. More recently, we did the merger. And you can see on the top some of the external influences, the NTIA being one of the most significant in terms of accelerators. What's that? And the CRA. Absolutely. I am in Europe. Yes. And we've also done some work on ISO standardization that's on the timeline as well. And of course, this has an impact on how we evolved the SPAC itself. We started off in a very simple PDF. So we'd give tool developers like myself, here's a PDF, go implement it. That kind of created some errors. Some of us read the words a little bit differently. English isn't the most precise way of describing some of the technical features. We moved that over into a markdown file that was a little bit easier and we generated things. And then we went to an ISO SPAC. Have any of you guys ever gone through an ISO specification process? It's interesting. There are a lot of requirements. They're very picky about their format. So we went through that. And then where we ended up is going to more of a model based description of the language and generating actually multiple different schema files. For 3.0, we actually spent quite a bit of time deciding how we want to do the SPAC infrastructure for 3.0. We decided that a lot of us wanted to write directly to schemas. There's a lot of people though that wanted to make it just human readable and human writable more importantly. So we actually took kind of a middle ground. We described everything in a markdown files, but in a very specific format that every time you commit to the repository, it checks to make sure you adhere to that format. And then we take that and we generate an intermediate schema file. And that schema file then generates everything else. So I have a little bit of a diagram to kind of show you what the process is. And this is important if you want to contribute to the SPAC. This kind of gives you a guide on how to do that. We started with a conceptual model. This is kind of temporary. We don't use that anymore, but it's just a picture to get us all on the same page. And then we write the specs and markdown. And this is where you can contribute. You can just commit directly to the SPAC in that specific format. And thanks to Alexios and quite a few other contributors, we have tools and generators that right now is generating a website, an HTML version of the pages. And here's where us tools developers get to actually kind of get a little excited, at least I do. We generate a Shackle Owl schema file. Now, how many of you have never heard of Shackle or Owl? Okay. Oh my gosh. You guys are going to kill me. But there's good news. We translate the Shackle and Owl to something that you do understand. So, you know, just hang in there because we will get you, we're going to be generating certainly JSON schema files, you know, that I think is really popular. But you might be wondering, first you're wondering what the heck is Shackle and Owl. Look it up. It's really interesting. It's very complicated, but it's very complete. Okay. It's very complete. And then this is where we go to, we call it serialization schemas because JSON looks different than XML, looks different from, you know, there may be other schemas that we generate as well. And the way we, the reason we did all this is it ensures consistency. If we agree on what the markdown file is, everything is completely consistent all the way through to the schemas you use to validate your source code. So, it's well worth the effort. Really, Kate, it's worth the effort. So, you might, now after you ask yourself what is Shackle Owl, you might ask yourself, especially if you look at the spec, you're going to wonder why did we pick that. One thing it captures not only the syntax of the data, which all the schemas do well, you know, this is an integer, this is a string, it's got this pattern. It also captures the semantic behind it. So, it goes beyond what you can capture in a simple syntactic schema. And that is the additional information we pull out of the markdown files and we put it into the Shackle file. So, we can say things like, oh, you got a relationship between a file and a license. It can only be of this type and you have to have at least one of those. Whereas in a syntax, all you can really say is there's a relationship and it's got this cardinality and it's got this type. So, you can go beyond, you know, the specifications. And, of course, if you start with that, you can easily generate the simpler schemas, but you can't go from the simpler schemas to the more complex. So, that's why we picked Shackle. Now, the other reason we picked it is just about the reason we picked all the, well, there's a lot of, huh, it's coming, it's coming back. There's tooling for, there's libraries that support Shackle and most eco-language ecosystems like Python and Java, etc. Am I back yet? So, it's not coming back because we don't have slides that being captured on the stream and the HDMI is put through here. So, there's something going on with this machine. You need to go and talk downstairs. Stream is not available, is what they're saying. Oh, dear. Technical difficulties. So, let's, oh, but I keep, is it coming back? You want me to disconnect? Okay. Okay. X2.x, you know, if you cared about security, you cared about licensing, you cared about, whatever you cared about, you had to read the whole spec to find the little field that you're interested in, you know, in supporting. It's kind of hard to navigate that. And if you wanted to conform, you know, what is required, you know, it's like, you know, if you got a, if you're interested in security, you really want to make sure you have the integrity fields. If you're interested in licensing, you want to make sure you have the licensing fields. But, you know, what do you make required for the spec? So, we introduced profiles and we have what, six or seven profiles in total. And there's really three different aspects to a profile. The most important is the conformance requirements. And what that means for us tools developers, that's the most important. What that means is if you are a producer of a spec and you say, I conform to this profile, I'm, I'm meeting the minimum requirements. That's your promise to the consumer. So, you can say, I conform to licensing, security, and AI and data. But I don't conform to the new services profile. And that, that, and that's, you know, of course carried over in the, in the, in the data itself. It's also a namespace. And this is where the simplification comes in, is that you can kind of filter the spec on what you care about by using these namespace. Technically as well. So, there is a technical namespace that goes along with all the classes and properties. And you can filter on that. And within my code, I also use that to help me with some of the verification code that, that's there. And it's also the way we organize within SPDX. We have meetings that are organized by profiles. So, people of like-mind and like concerns get together and actually develop the spec. So, let's talk a little bit about some of the other structural changes. In SPDX2, we, everything was around a document and that was a file basically. And we had a mechanism for reliably linking documents together because you may get many types of S-bombs for many vendors. You may want to bring them together. You may want to compare them. And you may want to link them together. So, we had a mechanism to do that. In 3.0, we still have the ability, this I got to make this really clear because there's this rumor that SPDX documents are dead and 3.0 is not true. They're still there. And you can use them the same way that you've always used them. But you can also link directly from the elements. And an element is a package or a file or you know, something, you know, a unit of something you care about in SPDX. So, now you can go directly. And so, you can put these things out on a network without having to worry about the files that contain them. And so, think about like the World Wide Web, you know, where you have like files and images that are linked together in HTML. You can do that in SPDX documents in the future. So, it's a very, very flexible, powerful mechanism we're introducing. Relationships have changed. In SPDX 2, they were a property of the element. So, you have an element like a package and you say, it has a relationship to another element like a file and that would be a property. There's a problem with that when you go to this distributed environment because you have to have, you have to know about this in advance. You can't introduce a relationship after the fact because it's a property in the element itself. So, we moved the relationship outside. So, now you have a separate object which is the element that does a relationship from one element to the other. And we've put a bunch of properties in there that in a way kind of simplifies the relationships. So, rather than having hundreds of relationships, we can have dozens of relationships and a few properties within the relationships to take care of it. How am I doing on time, by the way? You are at, yeah, 17 minutes. Oh, perfect. Okay. Other, I want to make sure I go through these changes because I think this may be the most interesting part to you guys. The other, there's a few other changes. There's a better model for what we call entities. This is the person organization. In SPDX 2.x, they were just strings and you'd have to parse the string to figure out whether it's a person or an organization. We now have kind of like a whole object hierarchy that describes what these things are. It makes it a little bit easier for parsing. We renamed and removed a lot of confusing properties. Those of you who have built tooling for SPDX 2 will love this because people complained about these properties all the time. And we either renamed them to make them clear or just got rid of them. And, you know, for example, files analyzed. A lot of people don't like files analyzed. Functionality is still there, but it's just a lot clearer how to actually do those use cases. We've added some additional uses, useful classes and properties. So, for example, we elevated package URL from an external identifier to be in a property on package because a lot of people are using that directly for identifying the package metadata. And then we have some additional profile specific classes and properties, of course. And on this, I know you're not going to be able to type this in. Hopefully, you'll get a copy of these slides. There is a Google Doc that I put together. It's a living document, which means it's open for comment from any of you. You find something missing. Please comment on it. But this is kind of a guide to all the detailed changes. And I was writing that as I was doing this, and I know there's some folks that have done the same thing and contributed to this document that describes all the migration. It'll turn into a migration guide once we do the full release, but right now it's more of a living document. So, kind of stepping back at these changes, you know, what's kind of the big picture of this? You know, it'll be a lot more flexible with the profiles. There'll be a new relationship structure in addition to relationships. So, you need to annotations independent as well, so you can do more incremental changes to the S-bomb without having to go back and create a whole new big S-bomb. And then simpler profiles, simpler snippets, more use cases. And then, again, see the migration document for that. So, now I'm going to switch over to my personal experience. I was involved in writing this back, and now I'm going to tell you how fun it was to actually implement it. So, the Java libraries, first to give you context, I wanted to give you an overview of what the current SPDX 2.x library's architecture looks like. You know, it's what you'd expect. There's a model set of classes, and that match exactly the SPDX 2.x model. The only change is really is I had to rename some of the things that conflicted with the Java language. So, Java doesn't like you to call a class package, for example. And then, there's a set of utility classes that has some useful functions like being able to do a comparison of license, little things like that. And because in the very first iteration of this, I started this like 10 years ago as a pretty printer, it was very monolithic, and I got a lot of feedback that, hey, you know, I don't want to have all these RDF library things in there, if all I want to do is generate JSON. So, we introduced a storage interface that lets you create a lot of different model stores. The model stores can represent a very specific serialization of a file, or it can also represent like a database or a triple store if you're into RDF, or the most common is just an N memory store for it. So, this allows you to separate these out into separate jar files. It does add a complexity because there's a storage interface in between that has to adhere so that we can separate these out into different things, but I think it does make it a lot cleaner. So, a couple of breaking changes that I noticed right off. I think one of the ones I did not expect, this change to the namespaces actually caused a change to the storage interface because I was just using the property names, and I knew that I could always map the property to the full URI of what the properties were, or the full string with the namespace because we had a clean mapping. I can't count on that anymore. So, I had to add one extra parameter, which means, oh my goodness, now I got to change all these different libraries to use that. Of course, I put in a compatibility library that made it a little bit easier, but that was a breaking change to all those things that are implementing the storage object, at the storage model below. The model itself created some breaking changes as you'd expect after going through what the changes are. There is, what I did is I took all of the spdx2.x code and moved that over to a compatibility library so that it's all still there. It is though in a different package in Java, so there is a small change to the imports, but it should work pretty much as is. The relationship and annotation structures that definitely impacted the Java code because it moves it out of properties and makes them a little bit more independent. I came up with a trick to help manage the consumers of my libraries, keep them from having breaking changes. I'll come to that in a couple minutes. That might be an interesting tip for some of you. The external document ref structure really changed. That was probably one of the more significant changes. We talked about the agents, the snippet simplifications, and then moving these properties to relationships. Sure. That layer will direct you to the compatible layer or to the new model layer. It basically minimizes the impact to the users of my library. That's this spdx model factory is what does the switching there. This is the little trick that I came up with for relationships. We used to have these as properties and now we moved them over to separate independent relationships. You can imagine what this will do to all the users of the library. It's like, oh, this isn't just like a change of coding or a change of names. I got to restructure my code. I came up with a way to make it look like a property inside the class. I have a special class that says, okay, this is a relationship, but it looks like a property. If you're interested in that technique, let me know. I can show you the code. It really wasn't that hard. It's a very generalized class that I can use for just about any kind of property. I think it's called relationship property or something like that. That makes it a little bit easier. The other thing I was focused on is reducing the errors. You remember in the, how am I doing? I'll see you in five minutes. Thank you. I saw you getting ready. You remember in the specs, we did a lot of things to reduce the translation errors down to the actual schema files. We're taking that further with the coding as well. We're from the OWL Shackle file. I'm generating the Java code. The Java library code, so now you got from the markdown file all the way through to the actual Java code, traceable, reproducible code to make sure it's all done right. I can't tell you how many errors I have personally made or I mistyped something or I didn't read it right, and it got implemented wrong in the Java library. I think the errors that I make now will be much bigger because it'll be in the code generator. It'll happen to everything. Sorry. That's a little bit of a joke, but no. It'll get rid of all those little errors. We'll also be generating, as I mentioned before, the schema files for those of you who would rather see things in JSON schema or XML schema. I also generate the verification code from the Shackle OWL files. If you're into the RDF, it complies with the Shackle OWL. Those are some of the techniques for reducing the errors. I think this is my last slide, the new architecture. One thing I didn't mention is this copy manager in between. It's a little bit of a detail, but it's kind of an important one, is that if you're migrating, if you've got two different SPDX documents with two different versions and you're referencing to each other, that copy manager will let you copy it over to the new version. That kind of does the upgrades. It'll also copy between the different types of model stores. It'll let you convert between tag value and JSON, whatever. Anyway, I think I might have a minute or two for questions. I got three minutes for questions. Did I go so fast? I lose all of you on that? That felt like I was speed presenting. Yes. I recognize that you get a lot of work about the specs so that you can see them easier. That's great work. But also, I would say we need to have a lot of different kind of examples. You know, you write your library, then you want to test it, and then you find out whether you really understood the specs the right way. Yes. Yes. And therefore, more examples of different types. Yes. I don't think I can repeat all of that, but I think the basic comment, and I think it's a really good one, is in addition to the spec, we need to have examples, you know, so that we can work off of those examples. And we do have an examples repo in SPDX today. We plan on, we're going to organize that in the future for 3.0 by profile. So you'll be, if you're interested in security, you can go down and look at, like, the security examples, be able to use those. Excellent point. Thank you. Yes. Now is current code ready to convert a file from SPDX2 to 3.0? Yeah, that's a really good question. So the question was, do we have code today that'll let you convert from 2 to 3? The Java code is not ready to be used yet, unfortunately. It compiles, but not quite ready. Yes, Dolph. Yeah, we have a project that can do that with the 3DUSRC of SPDX3. So Dolph mentioned there is, is that in the presentation later today? Yeah. Okay. So we hear about a tool that can do that later today. It's not the Java library. So it's coming though. It's not quite ready yet. Yes. SPDX3 light coming? SPDX3 light is coming. And that will be, that's one of our profiles. It's got, it's unique in that it skinnies it down, you know, rather than adding things to it, which other ones do. Yeah. Sorry. Yes. Talk about some of the relationship of SPDX to RDF. I was wondering if you'd come up against any requirements, things that RDF doesn't support, stuff that you feel like you need to push back up into RDF. Ah, that's a good question. The question is, is there anything that we ran into in the RDF world that we, that we couldn't satisfy by using like, like the RDF, maybe Shackle Owl specification? I'd have to think about, I have a feeling we have, but I can't think of an example right now. You know, I, I, I, I, let me think about it and then give back to you later. Yeah. Thank you. Yes. What's the view about converting SPDX3 to Cyclone DX and back and forth? Oh, yes. Because they've obviously got things like AI in their model 5. Right. 5, etc. You've got one. Right. And security. So we're looking at compatibility because there's not people to be, what people to be flexible? Yes, we do. And I'm, I'm with you on that. So the question is, what about converting between Cyclone DX and SPDX? We actually had an effort going on in SPDX2 where we had people from Cyclone DX, myself included on the SPDX side collaborating and, and we were doing two things. We were, we were writing libraries to convert, you know, so kind of really testing it hands on. And, and then we were also working on the SPAC where we were like in 2.3. I actually put a number of things in per request of the Cyclone DX team to make it easier to convert. So we're doing both of those. Unfortunately, that collaboration stopped. I am looking for somebody from the Cyclone DX team to work with to do that in 3.0. So if you're on the Cyclone DX team or if any of you in the room are in Cyclone DX and are interested in, in collaborating with SPDX and make it easier for all of our users, let me know. I'd be happy to work with you. Thank you. Yeah. Okay. So I, I'm not sure I completely understand the, oh, time's up. Answer him. Yeah. Why don't you go ahead and shut me down and you can go ahead and close the screen and take that over. Yeah. Yeah. Just. So, sorry. So the, the decision about. These are committee that decide about the changes. Oh, how, okay. Like the governance of how the SPAC has made itself. Yeah. So we do have a formal governance process. We have a technical, we have kind of like a steering committee and then we have different work groups. The real work gets done in the profile work group. Most of it's in the core. There are team leads that are nominated and, you know, the steering committee, this whole process that does that. And then the way that we really try hard to make all the decisions consensus space and it's, it's based on contributions too. So if somebody says, hey, I want to do this, but they don't contribute anything. Yeah, we don't really listen. If they say, hey, I want to do this and here's a poll request. Here's the spec. Here's some tasks. You know, here's what you do to the schema to make it. Then it's like, oh yeah, come on in, you know, we'll work on it. And sometimes there's differences of opinion. We try to work together very rarely. The team leads will have to make a call and we try to do it based on the majority, but you know, it's rare when we do that. We think very carefully before we do that. Yeah. All right. Max, thank you.
Overview of SPDX tooling and how SPDX3 gets adopted
I will just slowly start as I will today try to give some kind of brief overview of SPDX tooling. There's a lot out there and I will try to capture some of that, especially the stuff that is inside the SPDX org on GitHub. So I'm max, maybe someone saw me already somewhere, that's my coordinates. Pick me if you have questions. So let's talk about SPDX tooling. I think the first part that I want to go through are the tools that are provided in the SPDX org and I think we all call them tools, but they are all libraries and tools and the first one that we just saw, a wonderful presentation of are the Java tools and I think they are still the most complete and most actively developed library for SPDX. Thanks to Gary, really a great job. And if you are unsure, I think the Java tools are a good starting point. They provide the possibility to implement or to use them low level to implement usage of SPDX in a Java JDK-based project or they provide a set of command line tools or tools that allow you to do some easy tasks like converting between formats and view stuff and generate stuff. So that's always a good starting point. I will be too fast, so there will be time for questions and I went fast through the Java tools as we just saw them. The next set of tools are the Python tools. I think they have a similar structure similar to the Java tools. They are also libraries and CLI tools and so there's a library part. If you have a Python project that you want to generate SPDX, pass SPDX, work with SPDX, they have the possibility or you can use them in your project to interact with SPDX. And similar to Java tools, also the Python tools have the CLI part that provides some common functionality like verifying and validation and converting of documents and files. Maybe a more interesting thing related to the Python tools is where are they with adoption of SPDX3 and the Python tools implement the current state of RC1 of the currently latest release candidate, but that will soon change and it was implemented by hand, not generated from the model. So it's a good starting point if you want to try out SPDX3 and try how it behaves and how it looks like. And it has functionality that if you start with an SPDX2 document, it bumps that to an SPDX3 document. So if you want to experiment with SPDX3, you could also use that to take an SPDX2 document and just convert it to SPDX3 and look at how it looks like. It has an implementation of the JSON-LD civilization so you can also see how the serialized format looks. And what's next with that? I think the big change that is up on the horizon is to migrate that manually wrote written model to an automatically generated model like we have seen in Gary's presentation. We also want to use the shackle as an input and get the model as an output to have, to don't make errors. The next, a pretty new part of the ecosystem that is provided in this GitHub org is the TypeScript tools. They are called TypeScript since they are written in TypeScript, but they actually work also with JavaScript and there's a pretty generic build system behind that so that it should work in every related ecosystem. What's different to the previous libraries here is that they are not trying to be the complete universal tool that can do everything and that can convert in every direction, but instead they are built with a use case in mind like I want to generate a software build of material of a dependency tree so they do not yet fully implement SPDX. They do not, they can't pass files. They are very small in lightweight to generate SPDX documents which is fairly helpful especially in the JavaScript ecosystem where small libraries are usually preferred and based on that there are already two plugins implemented. For example, there's a yarn plugin that allows you with two lines of code to generate a software build of material of a yarn project and that's, so one line is please install that plugin and the second line is please generate an S-Bomb and then you have an S-Bomb and the second thing is plugin for rollup. Rollup is one of these build tools in the JavaScript ecosystem that builds a single JavaScript file from a lot of files and here we use SPDX to encode which files from the source site went into the output so that you have traceability on the build process. So that's the second current plugin that is also on GitHub and just as a side note to mention it also the MPM tool recently got native support for S-Bomb generation for SPDX and CDX. That's why we didn't bother to to implement it there but maybe at some point we can get them to use the SPDX library and therefore they might get SPX free support at some point automatically. The last big library in the SPDX arc are the tools for Go. They are modular, they have a lot of the same functionality. I'm not very deep into Go development. There are probably other people that can answer the questions there but we also have tooling in Go. So that was the first part of tools from the tool overview. The next thing is there's also SPX free meta tooling. As we heard in the previous presentation there's a tool that takes the spec and generates machine readable output that can be used and Joshua wrote something that takes one of these outputs, the shackle, and already generates Python code from that. So there's already the first building block to generate code out of the model and it's with templating and it's generic so hopefully it will be soon expanded with a lot of other programming languages and that might be a very valuable tool in the future. And there are also other projects as you can imagine. That's a huge list. There's a page where tools supporting SPDX are listed. I think I copied it rather complete but I think I missed some of them. If you don't see yourself there, check on that list if you have a tool that supports SPDX and I think on the page there's also described how you can add yourself to this list. Maybe. Yeah and that's it. That's tooling and that's questions. Hi. So I'm working on a project called Z-Hypervisor and it's based on a C and it uses C, YAML, RST. I want to understand from a new view perspective that how do I make the project compatible with SPDX B.O overnight? Where do I start and what changes do I need to do? Okay so you're working on Z-Hypervisor and you want to know how to generate SPDX for this project and it's a C project. It's a C-Bus project so there is a hypervisor, there are tools and there are some documents. I want to know how to generate SPDX. How to make it compatible? I think that depends a lot on which build systems, what is happening in the build and how can you trace what's happening. Is it hardcoded somewhere? Is it C-Make or I don't know. And then you need to look for the right tools that can extract that information and maybe I think there's for example you could place if they're. Zephyr has. I need to repeat that answer. The Zephyr repository contains user C-Make and can produce SPDX so that might be a good point to look at. I also say right now it's 2.3 but we will be moving into 3.0. Right now it's 2.3 but it will be moving to 3.0. There are a bit different flavors of S-Bombs from just source files all the way to runtime and how many of the tools are supporting all the different types of S-Bombs. The question is there are many many different format or different S-Bombs types like runtime, build and so on and how many tools are supporting the different formats. I think my good position is that all these libraries are fairly generic. They are what you use in a tool that would generate a build S-Bombs or would generate a runtime S-Bombs. So it's a tooling question or usage question but I think it's there are many tools that are doing one or the other of the formats.
FOSS for FOSS: DejaCode is your new FOSS control center for SBOMs
All right, so hopefully there's a bit of a voice left after last night's excesses. And I think it's the same for everyone. So it's not my regular voice. Usually it's a bit higher. So thank you for joining me today. So I want to present a new tool called DejaCode, which is a tool that can help manage S-BOM. And we'll go through that. Agenda, quick word about me and the project at large. We'll show you some of the features of the tool and then I'll go to a demo. So about me, I like to think I'm on a mission to make it easier to reuse free and open source software. And that means removing any obstacles that are in the way with its license, security issues, and eventually in the future quality, sustainability are related. I lead this project called AboutCode, which has many subparts, including well-known scan code and also package URLs. I don't know how many of you know about this or have ever used it in any way, shape, or form, show of hand. Yeah! Great. So I'm speaking and preaching to the choir. And this is my technical advisor. It's a family rescue puppy. She's six years old. And he decided that he better spend his time on my keyboard, which is not easy. So sometimes if we're chatting and you see really like weird stuff going on, that must be. So AboutCode, it's open source tools, but also open data. And I think that makes a big difference. We're trying to focus on providing as much as possible low level primary tools to do scanning, origin detection, and provide volunteer data. I'm one of the original co-founders of SPDX. I also contribute to Scylindiax and I hope sometime we'll be able to bring the two together. I'm also co-founder of ClearDefine. And I try to contribute to many other tools. We're supported and blessed to be supported by quite a few of your companies. So that's awesome. And we have a company behind that just provides services to help sustain the work. Most of our work is about in fact sustaining the development of the tools. So the problem, just to frame it the way I like to think of it, is we now can develop software using components that we can assemble. It's going to be even worse or even better after the use of Generative AI. And it's very easy to forget where the codes come from. And we need to know where it's from, what's the license, security, and in the future other things. There's a big problem in software promotion and this is that there's been a huge amount of investment from VCs we're talking about in the range of about 1.5 billion fucking dollars. Right? All of this is to eventually milk and mine free and open source software. So you pretend to make it safer and easier to use for large corporates. I have nothing against that but I think we need as a community to come together to bring better tools and what I'm trying to do. So the about code stack, three components, three parts. One which is a bunch of low level tools that can find where the code comes from. That's the SCA tools. Management app, deja code that we're going to look at today. And three components in the knowledge base which is database of license which is built on top of SPDX for anything that doesn't go into SPDX. Which is the last I looked, the largest database of license, this side and this square ground of the galaxy. So it's not too shabby at least, open database. Database of package metadata, files and fingerprints and database of vulnerabilities. And just a word on the vulnerability side. I'm surprised because we're trying to get the data upstream from the source. And it's very clear that nobody did it in many case before. We want to ask the beyond people what's the license of your vulnerability data? Nobody ever asked. We want to ask the NGINX folks. What is the meaning of your advisory vulnerability range? It took us two months to understand. Back and forth with the NGINX metanus. Nobody ever had asked them. Last week we were in discussion with the folks from G-LiP-C. They became a CNA recently so they're allowed to assign TVs. They publish advisories. Obviously never anyone had asked them how to parse this stuff because we're the first and we're doing a back and forth so we're helping them also for the community at large, making this a bit easier to do. So I won't, I'll put the slide online and I won't go into all the details. Something of interest on the SA tools beyond scan code is a new tool for doing code matching. Code matching is you have a big index and you're able to find based on fingerprints, whole package files and eventually approximate files down to the elusive snippet. But we're doing it in a very different way that's been done by existing tools so that's worth looking at. And of course package URL. On the data, just a quick look at where we stand. There's a big problem we have right now is how we can share this data efficiently. So we, we, we have received a bit of funding from the EU for that. And we're building a system where we can federate the data to massively share it without keeping the control of it. That again, not our intent to keep the control of this. Eventually you could think of it as something like a federated open food fax for code. So there's a code. The point is you can import, spdx, cyclone dx. You can generate cyclone dx and spdx. You can aggregate that in a product. So you can combine all of these, but you can also enrich it. So you can say, hey, you know, I received this sbomb. Maybe it was generated by gripe or sift and it's missing license data, which is a common occurrence. And I press a button and voila, I get the stuff scanned. Let me show you an example. So that's deja code. We don't have a lot of time, so I'm cutting a bit for the chase. You know, I have a product here, an example product. I have a bunch of files there, package actually. And if I want these to be scanned, I just say, you know, scan all the packages. And it will eventually fill in the blanks for license. It's also doing the same for volumeties. And if we look at the inventory of packages, well, this one doesn't have too much information in terms of volumeties. Let me look at another example, which may be a bit more interesting. Oh, yes, log for G. No, that's there. Yeah, that's a good one. That's true. Well, it's another one here. So this is an example here. It shows up. It looked up volumeties and a bunch of log for back volumeties. The integration we're doing is with the open data we have with vulnerable code. So if you drill down the package, you have details about the volumeties. And eventually you can zoom in on the details. The interesting thing we do with the volumeties data, as I said, we aggregate data from any source. That includes all the GitHub, GitLab, NVIDIA, Google data, and many upstream source. And we're trying to compare and contrast them so we can find what is actually the correct data. It's a cluster fuck of mess. You have database that make up packages. Not a big deal, nobody uses them. But in terms of trust in the data, it's you have incorrect version ranges, more often than not, different database. Don't agree on which package is vulnerable and which one is fixing the stuff. So you're like on the receiving end, if you're in a security team, you have your eyes to cry and you're going to spend your life doing triage of low quality data. So we're not fully there yet, but we're trying to. So for instance, you see here you have a vulnerability identified by Perl for an instance of the package in Debian, but also at Mevan. So that gives you a bit of an idea of what we can do there. So let me go back briefly to the side. So you can import a speedy x, import cyclone DX. One thing we don't do yet is vex. Being able to provide effective statement of if the usage of certain package, which are vulnerable in my products or applications are vulnerable effectively. But that's on the roadmap. In particular, one thing that's great is that there's a format called CSEF. And it's essentially based on Perl. I've discussed a lot with the folks that promote this. So there's a lot of low impedance between what we have there in terms of data model and what exists in CSEF for instance. Open vex is also based on Perl and so is the cyclone DX format for vex. Where are we next? So yeah, so that's a product. The data model is composed of products, which is typically what you think as a product or application. You can have many of them. You can have versions. We like to track the jacquard itself there. And you can see we change license. We're not always open source. And it's much more comfortable to have every piece of code and open source and data open source, except one which is my passwords and keys. But everything else that works. You can do comparison between versions. You can see what's been added, removed, changed. So it's a rich data model. You have components which is a way to combine together things which will be typically what you think as a component is. Maybe one or more package that you reuse as a block of construction in your product construction set. You have licenses, owners, a bunch of tools for reporting, LDAP integration. Everything you would expect from a decent enterprise between quote grade app that you want to use for this kind of purpose. And that's it. Questions? Go ahead. How do you see yourself in contrast to OVAS dependency tracking? So to my understanding, you lack racks, but you're stronger on the license side. So the question is, I repeat for the audience, how do I see deja code in contrast with dependency track? So they're good buddies. We have a slightly different way. First, the UI is white as opposed to the default UI which is black on dependency track. I hate dark mode personally. And each time there's someone that comes with a contribution to put dark mode, we eventually find a way to redirect it to more positive energies. But so that's one thing. That's a big, big difference. Technically, so the dependency track is essentially doing similar things. I guess the big difference is whether you're based on product and packages versus your focus is more vulnerabilities. Here you could think of it as a way to manage your package and component inventories across multiple products. You can do custom reporting, this kind of thing. But yes, there's probably a lot of similarities. Another tool which we may have quite a few similarities would be software 360. Now, at the lower level, which one does what exactly it's difficult to say? I know the folks of dependency track are working. They reached out to me to integrate the data we have on vulnerable code. So eventually, maybe they all level up and that's great. We're not in a competition mode. We collaborate and we share as much as we can and let the best two win. And there's probably people that prefer dark mode and people prefer light mode. Other questions? Thank you. Thank you. I'm sorry, but if you have multiple licenses, you have an ant here. Yes. It technically means more in some cases. No, in the case of this, at the top level package, all the licenses apply because you have different shanks. One part. So you have also the case that is more? Of course, of course. Yeah, yeah. Thank you.
Panel discussion: Software Naming
Okay, so I think we're live now. So thank you very much everyone for coming. Today we're in for a bit of a treat as far as I'm concerned because we've got all the experts in the various, three of the various formats, naming formats. There's two hard problems in computer science. One is cash invalidation and the other is naming. And naming and how we actually refer to things is what we're going to be talking about today. And so with that, I'd like to ask each of the panelists to introduce themselves and then talk about the use case, other format that they're representing for naming and the use cases. And if we could do a round robin and then want to open it up for questions between panelists as well as the audience. And we will try to make sure we repeat the questions for the stream. And with that, I'll start off with Eva. Thanks Kate. Hello folks, hello stream. Eva Black currently leading open source security at CISA in the US. Only for about six months. Please don't ask me any questions about government. I've been an open source developer way, way longer starting in the late 90s. Did some hashing and encryption stuff way back then. Databases, cloud computing, infrastructure hardware security. And about three years ago had this weird idea as S bombs were becoming popular or in the sort of zeitgeistive discussion. How do we really do supply chain traceability, artifact resolution across heterogeneous languages, build tools, supply chains and open source without burdening developers. Without going around and telling an unbounded set of open source projects, volunteers, hey you need to adopt a bunch of new tools that will cost you either time or money to build these S bombs. Couldn't we find a finite set, a small set of projects to make changes in maybe GCC to start with and then on up the chain from there to Docker build, RPM, etc. To connect to the dots as it were. So that the end state, you download a package from PyPy, you download a container from Dockerhub. Whatever, that it has a signature and a graph and gosh we already have this technology, it's called Git. Underneath Git is a directed acyclic graph of hash identifier maps. Couldn't we repurpose that? Yeah, it turns out we can. It wasn't that hard. And you can build this hash map or what we call an artifact input manifest. At each step in a build process and then chain it to the next step in someone else's build process and if everyone did this automatically in our build tools, it wouldn't take more developer effort after we modify the build tools but at the end of it you could trace all the way back to the source files and the intermediate steps and the use case if we know that these versions of the log for J package, these file hashes have a known vulnerability and you have a Java binary, JVM. You don't know if log for J is in there or what version. You can do some composition analysis, maybe find it. You can look at the S-bomb but today they're not deep enough to tell you that. You can ask your vendor and wait for a response but if we had the full artifact dependency graph you could just do a hash lookup. Is this hash in this graph somewhere and then go oh, whoops, probably a problem here or probably safe. The goal is better fidelity information for defense teams. That's my use case. There are others. Other people have come up with four omnibor. That's it. Thank you. Thank you. So Philippe and Wadan, for those we are in the room. I've already introduced myself but the videos will be split. So I'm the maintainer of many open source tools in the SES space and I'm the co-creator and creator of something called package URL that I represent today. In contrast with what Eva said, I think there's another way to look at things which is not trying to change the world but just take it as it is and try to observe and very modestly try to extract what we have for that. Package URL, interestingly enough, was really introduced in this very room in 2018. It was a package and dependency management dev room. And it comes from work we did on scan code and the early version of vulnerable code where we had a very simple problem. We were collecting package data from many different package managers, parsing manifest, we're collecting dependencies and it was a mess to figure out when the package is named file as a Ruby package, is it the same as file in Debian? Well, it happens, they're completely different. And so what we said and say, well, Debian file version one, two, three is the name of the package and that's essentially the essence of Perl and that's all there is to it, not much more. If it's on Ruby gem, you say gem file one, two, three. And each of these ecosystems, we call package type, are the ones that are doing the hard work to ensure that the names are mostly unique. And the nice thing, I don't know how it happened, but eventually every tool in the space is using that spec at some level or something which is very close to it. And the NVD is considering something similar, for instance, database called OSV in part of the Linux foundation of NSSF and Google is using also Perl, they're also using some of the tools we built around Perl for version range. So it's great. I think it's awesome because it's a way to, even if we don't talk the same language, ensuring that when we talk about a package, we mostly talk about the same thing. Hello. Yeah, I'm Alexios Alvarez. You don't know me. Let me see if I can plug this in again. Right. Right. This slide I had prepared talks about things that you cannot see. Names and locators and identifiers and all this stuff which might or not not be. Yeah. Okay. We can do it without. Can I? You have. Go speak. I'm going to try to get yourself. You know, wonderful. Okay. So when we're talking about the naming as Kate mentioned, right, we have to think about name, locators and identifiers and the slide I was prepared to show was that there are two different ways to refer to things. First of all, why do we want to refer to things? We want to refer to things in order to communicate and say, okay, we're all talking about this thing, right? We give them a name. Right. Okay. You have an idea why? Yeah, yeah, definitely things. Okay. It will be discussion. It will be discussion. Right. So there are what we're calling intrinsic and extrinsic identifiers. Right. Oh, perfect. Thank you, Phillip. Oh, thank you, everyone. I don't know. Okay. You know me. You don't know that. Yeah. Right. So there are, I wanted to show, I took some time to create pictures. Right. So there are extrinsic and intrinsic identifiers. Extrinsic rely on a registry or catalog or directory, whatever you want to call it, which keeps the correspondence between the name and the object. Right. And intrinsic, there is the connection of the object is inside the name. Right. Referring to non-digital world examples. Right. We have ISBNs. Right. So the ISBN is a number, but there is somewhere some database, registry, whatever you want to call it. Right. That assign that corresponds a specific book to the ISBN. Right. Or a passport ID or something like that. Somebody else keeps the correspondence between the name and the entity. Right. And there are intrinsic and in real, I mean, analog world, non-digital world. Right. We have chemical formulas. Right. When you're saying, nature chloride, sodium chloride, it's salt. Right. And this is actually what is the components of the object. Right. Or in musical notation, when you listen, when you see this one, you see that it goes, da, da, da, da, and you understand it's bit of a truth. Right. And there is no registry or directory that gives you the correspondence between these two. Right. And on the digital world, as we've already heard, examples are, you know, URLs, URIs, DOIs, Perls that you just heard, and Omnibor, Ditoid, or whatever, software hash ID, these are based on the content and there is no. So I'm here because I'm involved with software hash ID, which is an attempt to produce unique IDs for every digital artifact. Right. It's been used in software heritage and the largest archive of source in the world. And it's on its way to become an ISO standard. And it's based as we heard there ever before on hashies, on the content. So you know, it uniquely identifies things based on the actual content. So let me try to summarize here. Okay. The software hash IDs is referencing to a global archive and you have this starting point. No, it's not reference archive. It's the content. The content in the archive. No. The content of the... It's the content of the object. Right. Where you find it, it's a completely different thing. Right. It's how do we call it. Right. There is an archive, software heritage archive, and you can find different things. You can find them or maybe not find them. It's not necessary that you will find it there. Our discussion here is about naming. Right. Yeah. Okay. So this is a source of truth for a naming. The omniborx Gidoids are... Here is how things have intermediately been named as they've gone through the build tools. Very close. Okay. Very close. So each of the objects in a build chain, including the final one, get a hash ID. A similar process, slightly different function, but same principle. The difference is we, instead of using the identifier of the object, use the identifier of the build tree behind it to identify it. That's the only functional difference. And then pearls are basically representing, oh, I found this and it may refer to things, other things may refer to these and link the pearls to these. It's a locator. Locator. Yeah. It's a locator and it's also an identifier. And we even summoned the authorities on URLs and URIs from W33 to ask whether it was a URL. Okay. And says, yes, it's a URL. It's also an identifier. The thing is that it's also something you observe when you have an NPM package called React version 2.3. That's what you have. So it's not based strictly on the content, but it's based on the package manifest and way of naming them there. Okay. We're starting to see the questions show up for questions. So just say your question. I'll repeat it for them. On the one for hashes, is it... Which one? The hash map. That's a false of them. Okay. So is it the case that anyone with a proper algorithm can come up with the same results? Yes. Independent later is no central authority. So the question is, for the hash IDs, can anyone come up with the same result independent and there's no central authority? Short answer is yes. Both of us use a published, widely known function to derive the hashes. This is why it's called an intrinsic identifier. Given an artifact, a file you found with any name, you can derive the identifier and then look it up somewhere else. Is the name part of the identifier? Doesn't have to be. Is that a... Is the name part of the identifier? With the question and not necessarily. I can download something off of BitTorrent with a random name. It could be named after a movie star. It doesn't matter. And I derive the identifier, look it up on some global database and realize it's actually an NPN package. Actually, it depends. Right. So do you want the name to be part? If you're talking about... Because this is about identifiers for digital artifacts. So if it's a file, you probably don't care how it's called. So you only look at the content. But if it's a directory or a tree, do you care about the names or the things in there or on the contents? If it's not a directory, if it's only a commit or it's a revision or it's a package thing, it's a release. So in some cases you care, some cases you don't care. But if we're talking about single files or snippets or something like that, yeah, you don't care about some other names that it has. Next question. Governance. Uh-huh. Can somebody tell me about the governance about any of these things such as the Perl spec on the door, how are these being governed, where are they being developed, who's issuing them, et cetera? Okay. I can answer on Perl. It's a loose coalition of volunteers. I'm going to repeat the question. So what's the governance for these different efforts and in particular I'll answer for Perl. It's the loose governance based on a coalition of willing participants, including folks originally from Microsoft, there were folks from Google, there's folks from Sonatype, from the Cyclone DX project and folks from ART and SPDX. Yeah, no, no, no, I go there. Eventually there's been a pressure to get more governance and process in place, which is a good thing, but that's something I like to also do only at the last minute. I'm a proponent of a bit less process whenever possible and I think most of the developers appreciate a bit less process when possible. So we're considering moving to a non-profit charity, potentially a wasp and that will mean also adopting at the same time a bit more formalized process. So that's the long answer. Software Hush ID started from the people who built the software heritage, but right now it's spun out of it. It's run under the community spec governance model, SWHD.org, everyone can contribute, everyone can help and it will follow the process for submitting a spec and becoming standard. And Omnibor is in about the same place. It's under the community specification license. It's a website, a calendar, weekly meetings. Anyone who shows up can participate, contribute. The actual governance right now is mostly the regulars are Microsoft or X Microsoft and Cisco or X Cisco and other folks from several other companies show up regularly, but there's not a certain governance until it moves into a non-profit. Okay, next. Go ahead. Yes, we have many identifiers. Is there a way to convert between some of them or not? Okay, so the question is we have many identifiers here. Is there a way to convert between some of them or not? What makes sense? Okay, I'll let the beat start. Okay, the short answer is yes. And in the end, if you get a checksum on say a Zilibar archive, that's essentially observing the content. So that's what you would get in a much more sophisticated way with the SWHD or Omnibor. If you know where you got it from, you can get a pearl. If you have a packet manifest, you get a pearl. The two are not in conflict, they're complementary. And the last answer I make is that in the things we do on a bot code, we have this database where we track in both the checksums. We don't track SWHD, but we don't track yet Omnibor. And we have the pearls for all the packages. So eventually there will be public data available to get this kind of mapping and correspondence. So as you can see from the table, they're two different. It's not easy to always translate, right? The same way, think of it, forget these things. If it's a URL and the text file, right, maybe it's always the same, maybe it has changed. Right, you don't know. This relies on the registry. We assume that the registry is responsible and it stays there. These are all based on the content, right? The nice thing is that I think we're using the same algorithm, so these are very easily convertible to each other. Right. These are just for files or, I mean, the usual things. I will borrow your phrase. It's complicated, but in this case, I think the short answer is no. Given a pearl, not all, but some pearls do not contain hash identifiers, only a location. If the content at that location is gone or inaccessible to you for some reason, you could not derive and could not translate that pearl into a hash ID. In either of these hash IDs, you cannot give only the hash ID, say the omnivore ID or the clever heritage ID. You can't just transform it to the other one. You have to go back to the original artifact and re-derive the other one. It's math. But there was a lovely research paper done recently on analyzing all of these. The outcome of that is to say, we probably need both at the same time, because there are certain benefits, use cases, needs that only can be answered from an extrinsic identifier and some that can only be answered from an intrinsic identifier. We've got to have a system that supports both at the same time to reference objects. Just one quick word there. We can do NPM install, react at 123. We cannot do today NPM install ABCD 3625, and it's not super expressive. We could. In the end, you may need both. Go ahead. Go ahead. So the question here is, really like the NPM, is there a way to do that? So the question here is, really like the omnibore information, and it's good for the compliance side of it, but will we have the information for the commercial components? And will basically people be providing it when they're sending things out, S-bombs, and so forth? Do you have any perspective? Right. Yeah. So things that are proprietary and commercial and they're not public, is it possible to use them effectively, efficiently? Okay. So the artifact graph that omnibore uses is content opaque. It's built out of the hash identifier of the content tree, but it doesn't tell you what the source code is. So hypothetically, a company which produces proprietary software could publish both the identifier of the end result and the full hash map of all of its components without telling you what's in it. They could. They could. Do they today? No, but they could. And some vendors are looking into doing this. Okay. He's been patient here, so. I would say the dinosaur I'm missing on the slide, there is this quote, CPE. Yeah, thank you. I was going to bring this. What I would encourage you is to find similar to the aliases in the vulnerabilities area, that you don't have just one system owning the truth, but you can digest multiple ones. I see very much a similarity here. Yeah, I do too. It was like I say, it was very much an elephant in the room and we will let Philippe talk to it because he's been dealing with it for a while. So two things there. First, we have to be grateful for CPE to exist. And I remember maybe about a year ago, I was in a call with folks that include the creator of CPE and he was apologetic saying, I'm really sorry for CPE. He says, don't be sorry, man. Nobody was doing anything before you. So it doesn't have to be sorry. So first, let's put things back in perspective. The second thing that comes from an era where the things people cared for was is Microsoft Office vulnerable is Adobe Acrobat vulnerable. We moved away from that massively and things are evolving and that's why we're here. The things about CPE versus other identifiers that you need to have knowledge that's external to the code. That makes it difficult. Originally, GNU ZeeLib was the name for ZeeLib. Nobody knows why GNU. It's changed since, but there's the difficulty. Each time you have something, you cannot observe the code. It makes things difficult. Okay. I think we're having to wrap up right now. We're getting pretty close. So if everyone just wants to give me a one minute summary of what their perspective is here and then I will let people meet up with the panelists outside the room, et cetera, if they want further discussions. Eva. Thanks, Kate. Moving on to what Philippe just pointed out and all of the conversations of the past few days and the last year around the European Commission and the CRA, the world has changed. Proprietary software is still super important, but we know that most software in the world, including lots of hardware, the stack is predominantly open source today. Open source is done by volunteers. It's heterogeneous. It's complex. The supply chains are deep. We try to do analysis. Our analysis tools fail because of the fractal nature of our dependency trees. I have yet to see any other proposals for identifiers that also give us visibility down that fractal dependency tree. I would love to see one. Just to add on top of this, there's a thing, a number of things you can observe. Other things you need to discover. There's two tools which are very useful there, being able to have an index. Index of omniboruses, the launched ID. Software it is a small index which has a couple of petabytes of code for you there to play with. The other thing is being able to do lightweight reverse engineering of the binaries that are built will go a very long way. There's existing techniques. I'm building a bunch of tools in this domain. Eventually there's, I think there will be ways to tag and document in the binaries and the more complex piece of code where the code comes from. Rust does it. There's a lot of binary tag also inside Go and other languages. In the end, it's a bit of a wide, wide West right now. There's transition as we said from proprietary only to mostly open source and this is difficult and we'll figure out something. Yeah. Yeah. I don't have much to add to this one. We need identifiers definitely in order to understand that we're talking about the same things. Right. We have seen there are different ways to approach it and there are different needs so you actually need more than one type. Don't think that you will ever get a single thing. And we will figure it out in the end. It's a problem for tomorrow. Yes. Well, yeah. Let's start. This afternoon. Yeah. Yeah. What are you doing after lunch? Yes. I'm going to take the panelists. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah.
SBOM: What's next?
you you you Okay, I start again. No, I'm not only 15 minutes. So the S-bomb thing piece is an obligation Right and more than that, you know, it's not actually that you've got one artifact in your product You've got actually multiples and multiple artifacts, right? Some have to our containers. They are on an OCI registry some are Java files They are landing in in an Maven repo and so on and so You got all these things and then additionally you got salsa. Obviously, it's about getting security from the producer to the consumer and we were actually also Thinking about this problem and started to say hey in order to ship a product We have all these different artifacts. We and S-bombs are an additional artifact, right? So how about can we create can we use the S-bomb initiative to create an S-bomb around all these artifacts? Can we use it to describe actually more transport metadata about our product and can we more or less make it an give it an operational purpose because then the obligation is not a site initiative anymore But it's part of your intrinsic product and you can't do without and that's how you get hundred percent quality into them, right? And obviously you got to do this technology agnostic So with that we started a little bit even earlier before S-bomb became a big thing something that we now open sourced as well as open component model and I'll talk about this Again in these slides this way So we actually wrap the open component model around all the artifacts. How does it look like? Essentially, it's a piece of Yammel file forget about the syntax. It doesn't actually matter It's a semantics that's important and it's a graph of pointing to to other component models and to the artifacts to binary artifacts as well And also including The S-bombs as well So what can we do with that and I totally relate to the panel discussion because naming is important and some of the things are intrinsic Contents and some of the things are extrinsic Extrinsic content so what we can do is we can build a compliance data lake and put a compliance dashboard Now we have an identifier and connect all these scans and everything like that is bn code That's what we are actually doing. We are also having some stuff which we Which we open source around this in the same area the next thing what we can do obviously these component models are hierarchically built so we can connect internal teams and not to Reinvent the wheel anymore not to do the compliance checking over and over ten times again We just say you've got it done sign it attest it We can use it and we can also do it in with third parties So if we can get that kind of Attestation from a third-party vendor Using in standard like S-bom then yes, we we don't need to do it again We trust that a station to some degree right and now one of the inbuilt tooling for why we built ocm was We were able to do Calculate the transit of all of the artifacts that we have linked and we can actually transport the stuff from one repository or multiple repository on the public cloud Take take that as an example into the private cloud no more missed artifact right because that's the source of truth right and And with the metadata we can actually give the consumer something in hand the deja code and so on is a good example They can actually trust us, but then can verify this is in the context of a commercial product obviously right and But the other thing which is much more important here is you know all about the CI CD How do you do CI CD in order to a private cloud where your deaf people do not have access to right? So you know what we did is we basically use the same tooling because if you can do push you can also do pull and Now all of a sudden you can actually decouple CI from CD from the deployment part right and Now the question is because Got it because now you we can deliver the continuous D stands for delivery, but how do we do continuous deployment? So think about this you got to take your deaf ops thinking package it and transport it into a private or regulated environment right and be able to reproduce it and So therefore obviously you can ship all your infrastructure as code stuff in there The best thing is you do it in a declarative way And you got to do two things in order to make continuous deployment happen first of all the localization Because now you're not loading from public dot cloud or Docker hub or whatnot It's the private cloud and Docker private hub right so all of a sudden you got to say my my my container comes from that OCI repository so that's about localization right and Second is you got to in inject the configuration my secret you know all that stuff has to happen These are two things you have to do to do in order to do continuous deployment and the way we did this is Or no, I got to talk about the operational source of it now all of a sudden this S-bomb piece What we you know we by the way call it software bill of delivery The Ocm thing becomes our operational source of truth because that thing knows where every piece of component sticks in and so we obviously use git for local For for configuration and use tools like flux to do the localization with Modern github's mechanisms and then instead of doing Infrastructure as code we try to do infrastructure or configuration as data. I've heard some talks to this this Yesterday about that as well and with that we actually get to continuous deployment in arbitrary clouds and Yeah, I would say Again not my slides Colleagues slides and what we wanted to convey with this is S bombs can be actually your ring to control them all right you got just Yeah, you just got to make them Operationally valuable right now you're on an obligatory path You just have to do it nobody likes to do it, but you've got to do it And if you make this an operational source of truth, that's what we think it is and for us Ocm yeah, we're part of the puzzle. I think we're the syntax does not matter We can actually inject our stuff into the S bond specification cyclone DX or spdx, but you know we it's a parallel thing I'm gonna do it very quickly. There's some unique Characteristics which we have I'm almost out of time so I'm gonna wrap it up where technology as Agnostic we describe a software bill of delivery. We're not interested what's inside because that S bombs take care of we're only Interested about the product description and last not least we do not replace the stuff. We just wrap things have a look Ocm dot software and yeah, that's it Thank you One question yeah, okay made it in 15 minutes and I'm
Protobom: The Universal I/O Layer of SBOM
All right. Okay, I'm next. So, welcome. My name is Adolfo Garcia. I am a software engineer with a company called StackLoc, and we do supply chain security as well. So, I'm here to talk to you about like a project that we've been developing for the past six months, seven months or so. It's called Protobomb. So, just to give you a quick rundown of why this project came to be, let me start with the promise of Esbom. So, if you remember the promise of Esbom, the Esbom is supposed to be the entire analogy of the label of ingredients in our software. So, at some point, we had like a really good label, like ingredients label for our software, but at some point someone decided that, well, when we had one, we needed a new one that was almost quite the same, but not exactly compatible. So, this led to, you know, that friendly cooperation that Gary was talking about between the formats, and at some point, so, but jokes aside, this situation has kept Esbom applications in stock in a state that we don't want them to be in. And the fact is that most people working on Esbom applications tend to get stuck because they design their data models around the actual data models of the formats themselves. So, at some point, we decided to start working on Protobomb, which is trying to solve this problem. So, Protobomb came to be because of some public funding from the department of whole security through their science and technology directorates, and is now a project hosted under the OpenSF security tool in working group. And so, let me, sorry, I'm super rushing through the slides. So, let me give you a quick overview of what Protobomb does and what it's supposed to be, the problem that it's supposed to be solving. So, the idea of Protobomb is that while you have software Esbom formats that are not compatible by themselves, the data in them doesn't really have to be that incompatible. The fact that one Esbom format defines the hash labels using a hash and the other doesn't, or uppercase or lowercase, doesn't really change the meaning, as we just heard, of the data that's contained in them. So, what Protobomb tries to propose is that it gives you an abstraction of a graph that captures all of the Esbom data in those documents, and it rises one level above of the native libraries. So, the ideal of Protobomb is that you can let the Esbom formats flourish and be as expressive as they want, and we can still capture all of that information and make sure that we can do several interesting things with them. So, the first idea is anybody who has done some kind of Esbom work probably has created their own abstraction to read the Esbom data, and Protobomb keeps several definitions of the Esbom data in protocol buffers, so libraries can use them from many languages, and you can still use those libraries in your project. So, if right now we maintain an official Go library, another Python library is also coming, and we're open to having more of them, so the first thing that Protobomb makes is that collection of data structures that capture lost all of the Esbom data. We also capture how data gets degraded if you output the Esbom data to the native format, so if the idea is that you should never lose Esbom data, but in the cases that there's like some minor nuance that you have to, we document that and we let you know in the ideas that you can control how that that all of this happens. The second part of Protobomb is the library that we maintain. So, we maintain serializers and non-serializers to read and write the data into each of the formats, so the idea is that if you're building an Esbom application, you can read all of the data and work with the neutral representation of the data, and then we try to solve some of the most common problems there. The first one was ingestion and writing, and right now some new contributors on the project are solving how to pair our data structures with Gorms so that they can be ingested. So, the idea is that if you are trying to develop an Esbom application, you don't need to care about reading, writing, or storing that information in a database. Protobomb also offers an API to work with the Esbom data, so if you ingest data from Esbom, you can do some graph operations on it, so you can compare, mix, have differences between the information, and keep in mind that since Protobomb is operating one level above the formats, you don't care where the data came from. So, you can, for example, defend the SPDXS1 with a second on the SS1 or a cross versions, so for example, you can read an SPDX2 and an SPDX3 document and make sure that they compare, and this is where we have that library of SPDX3 test that we had before, we were mentioning before. And then one thing that we often get asked about is we are not trying to introduce a new, like a new Esbom format, so our idea is that we want to make things easier for developers while I think the actual experts do all of the expressivity that the Esbom formats need, so if you, while some of those data structures could be rendered, there are no guarantees that they will keep being useful if you try to exchange data in them. So, the way I want to analogy I often like to do is if you think about a picture, you want to work with a picture, so you don't care if that picture came to you in a JPEG or a PNG file, you just want to work with pictures and then when you have to save it for some reason, you can choose the appropriate format that you need to, that you prefer. So, how is Protobom, how does it handle the data? So, it's a graph, and why is it a graph? Well, if you think about it, both formats are really graphs, so SPDX, of course, was born as a graph, so you have packages and files connected, which are the nodes of the graph, and then you have directed relationships between them, and they are typed, and in Cyclone DX you have the components tree, and a tree is sort of a simple graph, but Cyclone DX also has a dependency tree that can be used to create new relationships between them, so that also forms a graph. So, in a graph, we start with a node, and then connect that node in an edge to another node, and so on, and in Protobom, those nodes are the packages and the files are represented in DSLM data, and we also add types, and the types are, I would say, inspired, but truly stolen from SPDX, and then we capture that data structure in something that we call a node list. So, the node list captures a fragment of our graph, which can be the whole document, and then using that node list to create a document, we simply wrap it in a container that's called the document, and then at some metadata, you have your S1 representation, and then using that node list is how we can create fragments and operations of S1 data. Finally, Protobom is an open source project hosted on the OpenSSF. We're on community meetings every two weeks, and if people want to join the community meetings, create libraries for it, the meetings are open to all. So, I think, hopefully, that was quick enough, and we for questions? Oh, awesome. Yeah, I was trying to get to that. So, yeah, well, these are the links. So, yeah. So, how do you handle the cases that you have a single S1 itself, or multiple S1s that connected by references? That is exactly, yeah. So, how do you handle, yeah, yeah. So, how do you handle when you have reference S1s? Instead of a single file? Well, the external references, they are lifted into the project itself. So, if you have a S1 that references others, we will just hold the references in the data. We don't follow them. That's something that we could potentially view. But the more interesting case is what happens when, for example, in SPDX, you can have many top-level components, and then Cyclone DX has one root-level component, and that's one of the degradations that you can control. So, you can make, if you ingest a project, like, if you try to write Cyclone DX, S1 that has multiple top-level nodes, you can control whether your project throws an error or just returns you the loss data or simply gives you a warning, for example. These are like tiny hacks that we're building into the project so that you can control how it behaves. You're treating like a monohepo for multiple root levels or a project file that's going... So, it's like in terms of your development, you're treating the multiple root files like a monohepo for multiple projects. Right, exactly. Yeah, so is it like a monorepo where you have all of your projects, like one root and then all of the projects there, or just like individual repositories breaking down the tree? Thomas? I can barely hear you. If you have different... So, the question, let me see if I understood. So, if you have six different tools that write S-bombs, how does it help? Well, the idea is, well, based on this... So, this is just a library to ingest and to write, right, so at the moment. So, based on this, you can start building other different abstractions of that same data. So, we have some demos of other tools that, for example, can recompose that data. So, if you have S-bombs in different structures that have, for example, a top-level component of one thing, and then in a live-down somewhere information about the same component, you can take those two and mix them in a new node that has all of the complete information. And this was thought because sometimes you will have a tool that will look at software, give you like secure information, you will have another tool that will give you licensing information. So, you can take those two and combine them into a single one. Yeah. Do you have any recommended GraphDB to store these data in? No. In fact, we're basing it on Gorm. So, right now it's more SQL table-based, but that would be enough. So, if there's like a recommended Graph database to store this, I don't think I know. So, let's thank everyone for...
Know Your Ingredients: Security Starts With the SBOM
Okay, great. Good. All right, so welcome everybody. Thanks everyone for joining the hottest room at POSTA. My name is Steven Chin. I'm VP of Developer Relations at JFrog, and I'm going to talk a lot about different projects which help secure the open-source supply chain about why we need security, a bunch of different security incidents, both historical ones, but also new ones which you probably haven't heard about, a lot of new research and things going on. And hopefully we can all help to improve the open-source supply chain together. So I think a great analogy. Can you guys in the back hear me? Okay, good. So I think a great analogy for the software supply chain and how we think about it is to compare it to our food supply chain. And we know that the way that you get great cooking, great ingredients is starting by fresh ingredients. Having things which you know are safe, which come through the food supply chain, which aren't, don't have any people who are interfering with in the middle, who are not following good hygiene practices. And when you have an issue with your supply chain, you end up with spoiled ingredients and you know, kitchen disasters. So anyone here seeing the Gordon Ramsay kitchen disasters show? Okay, a lot of good fun. And these are not the free-range chicken you're looking for. We're hoping we can get better quality and better security out of our software supply chain so we can build enterprise applications which are hopefully very difficult for attackers to exploit. And this is how the USDA looks at kind of you know, the software supply chain, creating a healthy supply chain. But it's somewhat analogous to software. So you have a lot of production. You have you know, farms and things which are producing software. You have distribution and processing. So it goes through a bunch of different tall gates and different people in the process. Eventually it ends up in a restaurant and a retail location and then you have you know, home users or restaurants or other folks who are cooking the food. So if at any point in this process, if you have issues with your quality, if you have you know, infections, if you have bacteria entering into it, then that results in potential issues at the consumer side. So I think when we're looking at the software supply chain, we need to look at it through a different lens. And I think a good lens to look at it through is Salsa, which is one of the open SSF standards. And it really focuses on getting attestations of the different parts of the builds that your software has gone through. Kind of figuring out at each of these different gates, you know, is the source control secure? Have you done the right things with code reviews? Have you done through the right processes with builds? And when you have all of this information about the build, then you can figure out are you actually secure? And a key ingredient to how you know this is the case, and this is why we're all here in the software build materials room, is because you need to have that final index of your ingredients, where it can show you from end to end, and Salsa and both SPDX, Cycline DX, and S-bomb standards go really well together, because this way you have the attestation of what's happened in your build into your artifacts, and then you can put that together into a single document, which kind of shows you all of the things which verify the components, and then the potential issues with them, which you might have. And if you're not following these good practices in how you build software, how you get provenance of your software and how you attest to it, then you end up with issues like, for example, the log for shell incident. Now I think this by now is infamous. It sparked a whole second round of government security concerns over open source software, and really the challenge for big organizations, which we're trying to address the log for shell incident in production, was are my production systems affected? And it depended upon the version of log for J, which you were using. It depended upon whether you're using just log for J core, or whether you're using the full set of libraries. And the answer for most organizations was, well, I don't know if I'm affected in production, so I'm just going to patch everything. And that's very expensive, it's very difficult to do, and when you have libraries like this, which are used so much across the entire ecosystem, it's quite challenging as well. And I think what really started a lot of the government concern around software supply chain was an earlier incident. This sparked the Biden administration's litigation around this, which was the SolarWinds incident. A very different sort of incident because this one was a true software supply chain attack in the sense that they specifically attacked the build system. So they were using TeamCity, they got in right before the certification happened, the signing of the artifacts happened. So to the downstream people, which SolarWinds was providing, it looked like it was signed by the company and certified, and it wasn't a malicious, but in fact they had done a very good job of infecting it before that was properly signed. And so we'd like to prevent these sort of attacks from happening because it causes a lot of damage. It can cause potentially malicious entities to get access to information. It can cause privacy issues with consumers, and it costs a lot of money and cost according to IBM's data breach report in 2023, USD over 4.45 million and a 15% increase over three years. So this is a huge issue and it continues to get bigger for us as a software industry. Okay, so let's talk about some additional incidents. So anyone, which one of these is your package? So when we're talking about like delivering libraries and dependencies, one of the things that majority of software uses is it relies on open source components, it relies on leveraging that because we don't want to write the same code and it's actually more secure if we're leveraging open source libraries that have been peer reviewed, that have been patched, that are staying up to the latest standards. But what if you can compromise the systems in the middle, which are supplying this information? So the dependency confusion attack basically relies upon the fact that a lot of companies, organizations, and open source projects use some sort of package management or middleman. They'll set up repositories which will pull from upstream or pull from local corporate repos. If you can get the information about what the internal names of the corporate repos, this is an example of Yelp, then what you can do is you can upload those to NPM or other public repositories and especially you're spoofing these libraries. So as a developer, as a CI CD system, you're going through a potentially vulnerable package manager rather than getting awesome corporate lib 1.2, which is the latest version of your company's library. It goes and it sees, aha, there's a new version in a public repository. I'm going to serve that up instead. And as you know, bad things happen when kittens get access to nukes. So we don't want this to happen in our supply chain. Fortunately, all of the commercial package managers, including my company's Artifactory, are now patched for this. So by default, they will not go out to a public repository if it exists in a local repository. So this blocks that attack upstream. But Alex Bresson, who did this exploit, was very creative. He took an attack which was theoretical at the time. Nobody had actually exploited it. He attacked Google, Facebook, Apple, a whole bunch of companies and simultaneously claimed about a dozen bug bounties and ended up getting $130,000 USD for his effort. I'm sure you'll see that. Maybe instead of helping secure the supply chain, there's a more lucrative path. But I think it's also like researchers like him also, they're helping to expose the potential issues in the supply chain in a way where they're not introducing threats, right? So this is white hat hacking. And we need people like this to find the exploits. And also, this helps guide us for what we need to do for new standards of SPDX, for implementing things like VEX to make it easier for us to figure out what the vulnerability scope is. So I think that these sort of attackers actually are helping us a lot with the ecosystem. Now, another food example here. So if you have a recipe that calls for different types of rice, like for example, if you're doing a risotto, you wouldn't want to use like a mixed grain rice. Like you need a specific type of rice. And this is something else which attackers make a lot of use in the supply chain. So another common type of attack is called typosquadding. Another variant of this is leaving off namespaces. So as an example, our research team found an attacker which released to NPM a whole bunch of libraries from Azure. And they just left off the Azure prefix. So if you're a lazy developer and just typed in the package you wanted, if you left off the namespace, you would instead get a vulnerable library instead of the actual library you wanted. So a very clever attack. And the way they did this inside of NPM is they actually had a random account generator which would also generate a unique account for each of the different libraries they uploaded. So it also wasn't easy to systematically find, oh, well, this is a bad entity. I'm going to block them. So they managed to spread out the attack. They did it on 280 different packages on Azure, Azure tests, Azure tools, CattleLang. And then they could install any software they wanted on the person's computer. But basically it was set up for potentially exploiting data from personal machines. Later on, so our security research team found this. We reported it to NPM. They took all the packages down. And then we publicly disclosed it. Later on, a security research firm claimed that they were just testing out NPM. So this was like a company testing the waters. So it wasn't actually a malicious payload in any of the packages yet. But it had a lot of potential for doing that. And the security research firm wasn't exactly upfront about what they were testing either. So, okay. And then, of course, if you're building, if you're serving food, you want the ingredients to be very fresh, right? You can't make gourmet food if you start with a pile of rotten food and things which aren't fresh. And I think when we're looking at the software supply chain, actually a good analogy for this is the somewhat infamous picture of a stack of more things and more things and more things with very small, fragile components nested inside of the supply chain, which any of those, if you pulled out the banana, suddenly your whole supply chain falls apart. And I think a great classical example of this is the left pad incident. So basically, there was a package published on NPM under the Keek package for doing left pad. Not a lot of code, so it's not something that's hard to write. But as developers, we are very, very lazy. If there's, if you can possibly save a line of code by including a dependency, of course you would do that. And then this Keek package was later claimed by a company, which wanted to own that domain. NPM sided with the company. Cameron got upset about this and then pulled down, oh, actually the publisher of Keek got upset about this and pulled all his entities down. Later on, Cameron published an identical version of left pad to solve this problem. But this is the source code which caused this huge incident. And this is something that is not worth including a library dependency, a potential vulnerability for such a very trivial piece of code. So this is, again, a huge threat. Now, one of the ways you can find out what all your dependencies are and figure this out in a visual way is using Guac. So this is a new OpenSSF project. It just got added to the OpenSSF suite. What it does is it gives you a visualization of all of your dependencies, lets you see exactly what you're using, how you're importing, and has some nice visualization on top of it. And I think using things like this helps you to figure out what your risk is and what the potential scope is of your application and how vulnerable you are as a project. So everyone knows Coca-Cola and it's very secret, right? So the secret recipe is locked in a vault, very secure, nobody actually knows what is exactly in Coca-Cola, that's their trade secret. I think we pretty much all know what's in it now. But there's this aura of mystery about the recipe and the history behind it. And so how do we as software developers or as projects, open source projects, keep our secrets? And the reality is we do a very bad job of it. So this is all of the exposed secrets in different central repositories, which we found by scanning NPM, PyPy, RubyGems, crates.io, Docker Hub. Obviously Docker Hub being the biggest repository and having large containers, which came in a lot of other software. There was just a humongous number of secrets exposed, 5.78 million. But even the software repositories like NPM had 1.16 million, PyPy had 0.43 million. So there's a lot of accidental exposure of secrets in open source repositories. This is yet another attack vector which attackers get into open source projects and allows them to attack the CI CD infrastructure, cloud accounts, which the projects are using. And even there's often accidental leaks of corporate secrets inside of open source repositories. Because as a developer you're working in the daytime on your corporate projects. And then evenings and weekends you're working on open source projects. And there's a certain amount of crossover in that as well. So the top ways you can help to prevent this from happening in your own project. So first is not using automation to check for secrets exposures. So using something like Truffle Hog, some sort of commercial scanner like X-Ray, allows you to scan your packages before you check it in to make sure you don't have exposed secrets. This is how we found that we basically ran our tooling on top of central repositories to see exposed secrets. Second one is generating tokens with broad permissions that never expire. So you always want to have the tokens scoped as small as possible in terms of what they can do. And then setting expirations in a reasonably short time frame so you're rotating keys at the right times. Third one is no access moderation for the secret. So putting it inside of some of service like HashiCorp Vault or Docker Secrets or something will help to protect your secrets and tokens. Fourth is fixing a leak by unpublishing the token. So this is a really, really common mistake. But you can't simply check in a new revision which deletes the token. Because then, you know, Git has long history, it's going to remember it. Now, if you followed point two and you have very short-lived tokens or very small scope, that limits the damage because by the time somebody finds it, it's likely not useful anymore. But again, a big mistake, you actually have to go and rotate the token to fully mitigate the issue. And of course, you know, exposing unnecessary assets publicly. So we saw a lot of cases where in test libraries and other like code which was not the main library code, there were secrets exposed that were visible to infrastructure. And in some cases, it looked like that the test code or the other like side cards beside the main code base were not even meant to be published. They were kind of, you know, more internal code. Okay, so to safely use open source, we also need standards. I think if we've ever, you know, gone to a restaurant, this is really common. This is in New York City, they have like letter grading on restaurants. They have like, you know, reviewing of the source. And I think a great way of doing this for open source software is the new OpenSSF Score Cards project. So basically what this does is this gives you nice tooling for Git and a command line. It'll analyze your project. It will give you a score. It's kind of like up to you to interpret the score for the different things that it analyzes. But it tells you about code vulnerabilities, maintenance, continuous testing, build risk assessment, source risk assessment, so a wide set of different things on your project. And helps you figure out like how much risk is in your project, but also more importantly how much risk is in upstream projects. Because if you have dependencies on projects which are vulnerable, then your project itself is vulnerable. Okay, and I think, you know, given we're in 2024 and clearly the machines have been taking over. So it wouldn't be complete if we didn't talk about what's happening with security of machines, machine models, and some of the code which we're leveraging to make better use of AI infrastructure. And unfortunately it's not looking that good for us so far. So ML models, so the machine learning models which we all use and publish to public repositories like Huggingface, they are highly vulnerable and this is, we're already seeing a bunch of attacks against these public repositories with malicious actors injecting payloads into it. And it's not very hard to do so the H5 format, the Huggingface format actually gives you the ability to put inside of it information that is basically executable code that sits alongside your model. So the developers have figured this out and basically from the moment you install the model, they can run some code on your system. So as a developer there's always the possibility, there's already the possibility of simply using models inside of Huggingface and other public repositories could expose your development environment to risks. And basically this is an example of the base 64 payload and you can run whatever you want to inside of the model. Another attack for injecting malicious packages is exploiting the generative AI. So if you're using technologies like chatGPT and other generative AI technologies, what they'll often do is they'll suggest packages that you should use as part of your code. And AI algorithms are prone to hallucinations. Hucinations are actually quite predictable and a lot of the standard code queries which people ask for will include perfectly valid dependencies, but they'll also include fake dependencies which don't exist, packages which don't exist in NPM, PyPy, etc. So hackers have already figured out that by uploading the packages and putting malicious packages in the place of the libraries which the generative AI is producing you can effectively cause people using chatGPT to execute malicious code. So another potential exploit and now even the AI is introducing vulnerabilities into your code. So here are some examples of perfectly reasonable queries, for example requesting, generating an endpoint that returns file contents, right? So this code is vulnerable. If you now do a couple dot dot back dot dot slashes you're going to end up in other directories, you're going to get access to files you shouldn't. And now if we again ask chatGPT, like, okay, we'll give us a secure endpoint that returns a file for user input and prevents directory reversal. It gives us a more complicated example, but this is still exposed to URL exploits. So as developers we can't really trust the current generation of algorithms for code suggestions to give us secure code. And the attackers know this and this now makes a very easy class of security vulnerabilities which are likely to get injected into open source projects and other work simply by the fact that it's being recommended. And something we're going to be publishing soon. So this is kind of, you guys are getting the before official publication on this. So basically what we did is we went into hugging face, Kaggle and some of the public repositories, ran our detection on malicious packages to figure out like what the current exposure of developers is in the ecosystem. And we found over 60 models which contain malicious behavior. We analyzed the payloads. Some of them were not truly malicious, but some of them were malicious. And basically it allowed the attackers to run code on local environments. I believe we're scheduled in another week or so on the JFrog research blog to publish the results of this, but we're, of course, doing the right disclosures to the, to hugging face and Kaggle so they can take down the models before people actually extract the data. And we exploited it. And I think building awareness of these sort of attacks helps the entire open source security ecosystem because we're the ones both in, you know, in this room building software build material standards but also in the general open source security space you have to figure out solutions so these sort of attacks don't become the next solar winds. Okay, so you can find a little bit more about the stuff I've been talking about for research with the JFrog research team at our research blog. This isn't our like commercial blog, just the research guys publish here. So it's all the fun stuff. And hopefully together we can create a more secure software supply chain. So thank you very much for having me at the software build materials room today. Okay, if you guys don't mind, I want to do a quick selfie with the audience. So what's a good, what's a good security sign? Log for J, log for J. Okay, let's give a thumbs up for log for J. Cool. Alright, thanks everybody for joining. And I think we have five minutes for questions if folks want to ask questions or if you need a breather because this room is very hot. I feel free to leave the room as well. Any work on combining S-bombs with stored secrets and verification, things like that? I think that's a good question. So I don't know if there's any work going on now about getting secrets as part of software like S-bombs, but maybe that's a good addition for the standards. Yeah, thank you. Yeah, so the question is what kind of vulnerability is X-ray handles. So I would say we're clearly in the application security department, APSEC. So we find malicious dependencies. We find like secrets detection, like I mentioned. We do stuff. We actually can build SBDX Cyclone DX files with both regular vulnerability info and also the new VEC standard. We don't currently do anything with runtime security, although that's coming. Our package manager, Artifactory is open source. X-ray is proprietary. Yeah. Okay, so Kay asked if I've looked at any of the stuff that's happening in AI for SBDX. AI and data. AI and data. And so I know about the working group that's collaborating on this stuff, but I haven't looked at any of the new stuff. Yeah. But I'm very interested to see what you're doing. Okay, we'll do. Okay. Thanks everybody.
Make your software products trustable
Hello everyone. Thanks for coming. My name is Dejan. Unfortunately Marco couldn't be here today. He got a call but yeah. What I want to talk about today is so we saw a lot of sessions today about producing gas bombs and producing the data a little very little. I think only Philip sessions was about managing actually the produce data right. So the challenge we try to tackle with the justification project is how to get all these data that are currently being produced by the more and more organizations like S bombs but also X files and more and more advisory data that we get and get them into some kind of manageable system because without that information is just a bunch of mostly JSON files spread all over the place right. So what we want to what we try to do is to provide a system that will get all this data put it in into a system that can be searchable and queryable and actually get us get us actionable information. So making software development more proactive in managing security but also making it much more easier to respond to the security issues. And yeah as I said these got us to start working on a justification project which basically set these goals for itself. So being able to ingest and store all kind of S bombs and VEX documents are open source but also proprietary company products right. Discover for those ingested S bombs and VEX is learn about all the new vulnerabilities and advisories related to the packages inside of the S bombs and being able to explore and search those information but also create an API that can be integratable in other systems and provide us to share this information with the rest of the developer toolchain like IDEs and CI CD tools. So ideally we would want to mark all the vulnerable dependencies directly in the developer's IDE and also for example fail the builds that tries to build a software that contains some of the dependencies that are known to be vulnerable. When we started to do this sometime last year this time last year we also figure out that there is another open source initiative that revolves around the similar ideas and it's called GUAC. It was mentioned in the previous session as well and I will cover it a little bit more here. So GUAC stands for Graph for Understanding Artifact Composition and the idea is to being able to ingest all different kinds of artifact documents like S bombs and VEX files and advisory data from all kinds of sources and basically create a graph ontology of that. So at first we started just experimenting with the GraphQL database but today ontology is based on the GraphQL API and can be implemented by the multiple persistent backends. That's on the left side right on the right side of the graph we also want to be able to query all these data. So GUAC should be able to provide us with all the answers about what are the dependencies in my S bomb, how these dependencies correlate with each other, so what's dependent on the what, so it's easy to find all the graph tree of dependency in your project but also being able to attach to this particular dependency all the vulnerability and the advisories and VEX data that we can find in additional systems. This is the basic architecture. Let me just see how much time I have here. But I basically explained it in the previous graph. So we can collect documents from different sources, we can certify them against different sources like OSV or DApps Dev, get it all through the GraphQL API ontology into a database. Two currently supported databases today is POSGRES, relational database that we use basically and works just fine and there's an Orango DB back end which is a pure GraphQL back end right and then on the other side provide the GraphQL API to be able to query that and provide a bunch of CLIs that it can be able to extract the data from the system. So in the classification project we try to provide a little bit more functionality on top of that. First of all we want to be able to actually not just ingest all the data about different relations in the database but we also want to provide a central place to store all your documents for the organization. So it provides an S3 compatible storage for storing and ingesting all the company's data into a single place so it can be an S3 bucket in the AWS but also for local deployments it can be some kind of a Minio instance for that. It has what we call walkers for different kind of CSEF repositories so that we can automatically ingest Asgum and Vex files and then provide what we can see on top and on the bottom. So what we call a single pane of glass like a nice UI to be able to search all this data that we have but also the Exort API as I said for integrating the system to the rest of the developer tool chain. So there's a nice VS Code plugin that can work basically with justification today and automatically from the project get all the dependencies and flag vulnerabilities if it's found in the system. So I thought to do a little demo so let's see how it's going to work. So Neil it will be easier. So here we can see the UI with some pre-loaded data and we can see that we have basically what we call six products here which are actually six S-bombs that are already already ingested in the system and a large number of CVs that have been collected from multiple sources and we can see that we identified around 2000 packages for these S-bombs and most importantly from the Vex files ingested here we identified 29 advisories for these. So if we go to a certain product we can see a couple of information obtained from the S-bombs so we can see the basic metadata that we have. Usually we can see all the packages and how they relate to each other. I think this S-bombs is pretty flat in structure so there's no much dependency going on there but the most important thing is that we can see different kinds of advisories that are against and also immediately see which actual packages are being affected by these advisories. We can go back and forth through this system so we can go to the actual package see that it's actually affected by this vulnerability. We can also go from the package and find the S-bombs that it belongs to, the S-bombs or the product but also what we can provide is that nice search capability as we said like maybe at some point you don't remember exact vulnerability we're looking for so you can basically just do a full text search or maybe yeah and find that there's a packages related to that but also find the exact vulnerabilities that we talked about a little bit earlier. So this is just like a basic demo right? I have a little bit more time just to explain so what were the challenges for us and I think we heard in a lot of sessions all about these challenges so it's mostly still early adopters everywhere, tools are immature including the project I'm working on so we definitely don't consider it mature but also there's a lot of inconsistency in the data wherever you look right so we heard today about all the multiple computing formats in S-bombs space and all the work that people are doing to bring that more closer and together over time which I think is awesome. We also heard a nice discussion about all the different kind of identifiers and you can see so if you work only with one source of data then it's easier but then if you try to correlate this S-bomb with this Vex file and this S-bomb is using PURELs and these are the CPEs it's becoming impossible to correlate data and build the graph basically properly. Also what we found is that even all these things are standards there's a lot of unwritten rules in all the organizations about how they are presenting their data so the documents will pass but what you have as an information from the document really depends so I think yeah it's good that you're all here and there's a lot of things to do right because it's early early days. For the project itself we'll try to additionally simplify architecture and the deployment model we're all about microservices and Kubernetes for now which is okay but I think we could reach much more people with simplifying how much resources and where they can deploy a project like this and go into supporting more standards. So you saw here just basic searches and basic correlation I think once we have much more data in the system we can get much more vision from all this data in and provide that as that's the value of the project in my opinion right and continue working on the future integrations because in my mind if you do continue doing this right I think at some point in a couple years all these infrastructures should be invisible to developers right so it should be part of your developer toolchain automatically working in VS code in all the Git for pipelines and everything right so we are just beginning that's it so justification side doesn't have too much data saying about immature projects but there's a dev box sandbox that you can try there's a code there and we always on the metric channel so if you're interested please reach out and yeah. I'm going to ask the question are you using the SPX libraries for helping with the ingestion? No no we're using yeah sorry yeah the question is are we using existing SPX libraries yes we are yeah so there's one in Golan using in in guac but there is also in Rust one using the classification itself because they are good yeah. So why is the reason that you decided to start a project from the ground instead of help at least four or five open source projects big ones that already do exactly what they do but not yet on the level but mostly 90 percent that we are doing today. Why you not helping that one instead of creating one? So yeah why we are starting a new project instead of instead of helping others so first of all we joined the guac project which is also another new project but yeah I can't answer that I mean a lot of people were involved in that kind of decision but we are trying to be as much I mean it's all open source we are contributing to other projects so it's not a closed source product basically yeah. So one of your early slides said this can be used to sort of share S-bomb data can you talk a little about that feature how you this can be used to sort of send S-bomb data around to other projects? So it's not about yeah sorry about it so about sharing the S-bomb data it's not about sharing the data but providing the API so the external systems can query things so basically the VSCode plugin would get all the URLs from the current project and being able to query this and get actionable item back so there's no any distributed sharing of the data just integration API. Okay please thank you. Thank you.
Can SBOMs become first-class citizens in Open Source ecosystems?
Thank you for allowing me to come here, the organizers. I am almost new to the S1 community out here. My name is Sal Nielsen. I am part of something called the SEPA security working group. We work on supply chain stuff and security on the oldest open source software repository system out there. It started in 1995 with lots and lots of... Let's see if we can switch the slides here. There we are. And I am here with software and all that implies with developers publishing there and downstream in Debian and Redats and all the systems out there being used all over the world. And like 14,000 developers still, more than 14,000 packages. So it's a real system. It's out there, it's working and people are earning a lot of money on this stuff so they want to keep it going. And now we are having a new reality coming with legislation. So I am trying to today bring the open source supply chain perspective. This is recently finished slides so please excuse me if I am either finished early or late I will try to do my best to make you happy. So I talked to a bunch of the people who are involved in the middle parts of this chain of events. They often say why should I care about this stuff? We already do a key track of dependencies on that one. We have the new formats and this is not my problem. If you pay me maybe we can talk. This is actually, this is paraphrasing but the essence of the discussions are like some of the blog posts are like I am not your supplier. It's actually like that out there. As I can confirm that notion. Then reality arrives and end users or use all the software are obliged by the threat of fines to keep track of all the dependencies and what is happening with them so that we can't get all those horrible security situations going on. That means they need authority and up to date information from the utmost upstream sources. And to do that you actually have to have the supply chain bits and pieces and steps play along in this game so that we can get all the good stuff. Like figure out where stuff comes from, check it up against very built-in databases and all that good stuff. We like that. So I am researching, looking around what the documentation tried to learn, this whole S-bound thing, reading documents from the US government and all kinds of interesting organizations like many of you probably have done. Very interesting stuff. Then I find this thing. This is from my nice tea. They tried to describe where I spawned the show in, show up and there's something wrong there. There's no supply chain mentioned at all in there. It says third party software enters here and there's no open source or processes or communities or anything. It seems almost like there's a lot of, this is a pattern, it seems a lot of documentation and even in some of the standards it's just assumed there's something going on here. And well, it isn't. There's stuff going on. And I would like you to just get a little picture of what's going on there to draw a simplified supply chain. We have an author at the top who does stuff, publishes something. There's a language ecosystem they publish on. They also collaborate with others on a collaboration platform. So the language ecosystem would be the PIPIs or the NPMs or the C-Pants collaboration ecosystem would be the GitOps and GitLabs and all that stuff. And they are sources for downstream package. Oops, sorry. There we have it. So that one, the red one, that's where I come from. That's the C-Pan and then PIMS. And we care about the infrastructure and how that happens and making sure that only the right people get to upload software and that it's published and available and all that good stuff. But the downstream of us, we have a packaging ecosystems. These folks here, that's the Debian and the Red Hats and all kinds of places that compile stuff for their own environment and make sure it's available in a consistent and available manner. But they also feed into themselves. Like downstream of Debian you find Ubuntu. And sometimes the packages here are patched because of upstream availability or you have to back port security fixes. And there's a package there that sometimes I have to talk with a curator about which of the software pipelines you should publish one package. Because some of them are LTS pipelines. You don't want to do stuff there that you can do in another one. And then of course you have to make it all available so that the developers, some business can do their work and all that magic so that it can put to something production environment and make people happy. All these boxes here, I try to make it so that there are boxes that represent a role that cares about something that is supposed to be in an S-bomb file. I'll try to be quick. So these bits here, that's actually this one except for that the third-party software arrow here, the tiny little grey one there is doing some seriously heavy lifting. That needs to stop, seriously. And there's another one, second-party software. We are not third-party software doing this. We're second-party. We're partners. When people say we can get third-party software from open source, no, we get second-party. When you accept a license, you're actually getting a partner, someone you are supposed to cooperate with. Most people don't but they're still there and expected and you need to know about that and people who make decision management, they have to know about this. That means anyone who writes documentation and teaches this kind of stuff needs to stop calling open source as a third-party source of software. That's just insane. Second-party software means people like these actual people working on infrastructure out there get ignored, basically. And that's not a good way to get the inclusion and the support from that software supply chain people and the ecosystem that you actually depend on. So, okay, who are these people? They're, in fact, your open source colleagues. In fact, they are your unpaid open source colleagues. Just so you know that. So stop treating them as a stranger, start treating them as a colleague, talk with them, interact with them, teach them and learn from them as usual colleagues do in a healthy environment. Of course, if you don't have any healthy environment at work, maybe your work should go do something else or quit or something. So to make S-Bomb become first-class citizens in open source ecosystem, make open source ecosystem first-class citizens in the S-Bomb community. Please do that. Don't just put them behind a miniscule with one pixel wide arrow that says third-party software enters here. That's just so bad and wrong. It's horrible. So there we have that. And don't, yeah, they are your partners. Please, it's a good thing to have them on your team, even if they're living somewhere and you don't pay them. They're competent people and they actually do want to help you. Like if you've treated them badly, they'll just say, this is your problem and see if you can fix it. No, you can't. And in a way, if you want something happened with somebody you don't have a monetary relationship with, you have to treat them as friends and with respect and help them if they have a problem and communicate and stuff. And this is the good way to operate if you want to have a supply chain in on the S-Bomb game. So I hope this is a message that you can find useful and adopt in your work in years to come. Thank you. If you have any questions, maybe we'll have a room for one or two. One question? I've been involved in some of the groups that produce the thing that we should show them. I think there may be a miscommunication between them because that third-party perspective, it wasn't planned to offend anyone. Anybody can be a third-party when you're developing, right? So I think it's just not you. But in fact, some of the work that some of the work that they're doing has been, you know, approaching those language communities and helping them build their, for example, involved with the BIPA Foundation and their efforts to create their own S-Bomb. So I think if it's a miscommunication, then we just need to sit down and talk a little bit more. There might be a miscommunication, of course. You'll have to repeat that. Yes, there's a long comment here. There might be miscommunications out there. And of course, my perspective comes from one community. All the communities who might be more resourceful, like the Python community, are easier and don't feel that it's, of course, not meant as an insult. And of course it is. But I think my point still stands that by treating open-source communities as partners, you get all the benefits, even if it's a small community like mine or a big one. So I think, thank you for your comment. I'll still mean what I mean. You haven't changed my mind. Thank you very much. Okay, that was it. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you.
SPDX in the Yocto Project
Alright, hi, my name is Joshua Watt and I'm going to talk to you today about our migration to SPDX-3 from SPDX-2 in the Yachter project. I have a lot to get through so I'm going to go really fast. Alright, so a little bit about me. I've been working at Garmin since 2009, making primarily embedded Linux systems that run on boats. So that's exciting. We've been using Open Embedded in the Yachter project since about 2016 to do that. I remember the Open Embedded Technical Steering Committee and there's all the different ways you can contact me if you're interested in doing that. So I primarily work embedded and do some of the SPDX stuff on the side. If you're not familiar with Open Embedded in the Yachter project, Open Embedded is a community-driven project primarily focused on generating software for embedded systems. Open Embedded itself provides the core layers and build system that you use to build embedded systems. And then there's also the Yachter project which is a Linux foundation project. It provides a reference distribution built on top of Open Embedded. It also runs a whole bunch of QA tests to make sure that the project is high quality, which is a really schedule, provides funding and writes really good documentation. So if you're ever confused about the difference between the two, that kind of summarizes it a little bit. So I'm going to give you a very, very brief overview of how Yachter works. I do not have time to go into this in depth. I have a bunch of video, other talks I've given about this. So if you're curious, you can see those. It'll go into much more detail. I don't just don't have time right now. Basically the way that it works is when users want to build some thing for an embedded system or not exclusively embedded, but you can do other stuff too. They basically have source code that they want to use. They have some metadata that says how to build that source code. They've got some policy information. They shovel this all into this magical tool we call BitBake that does a whole bunch of stuff for them, compiles things, configures things like that. It spits out this thing we call a target image. You put that on your widget and profit, right? In a very high level, this is what the build flow sort of looks like. The important thing to note here is that we're building all of the native tools, which we then use to build all of whatever goes on your actual target. So we have a very comprehensive supply chain actually just built into the way that embedded builds in the first place. The way that we generate SPDX when we're doing this is as we go through this process, we're actually generating these SPDX documents at each step along the way. And then at the end, we have this final SPDX deliverable, which basically is just all of these documents combined together, and that describes whatever target image you built. This is what our SPDX 2 model looks like today. It's pretty complex and pretty comprehensive. We have a lot of really interesting things in here that I don't really have time to get into. But you can see that we're generating a lot of relationships and things like that that are really useful. So a couple of problems we ran into with SPDX 2. One of them is that in SPDX, we have this concept of the recipe element. And it's a little bit strange because it's not actually describing a thing that exists, per se. It's really describing a process that happened, which is we built software. SPDX 2 doesn't have a really great way of describing that because it only has a concept of really packages, not a process. SPDX 3 has made this much better by adding a concept of a build element, which describes something, a process that happened at a given point in time. And so we've transitioned to using that. And this is kind of what the build element looks like in SPDX 3. So you can describe the inputs and the outputs of the build element, so the things it took in and the things it's spitting out. And then you can also have sort of more abstract dependencies between build elements themselves, if you can say like this step depends on this step. The other way to show dependencies is the outputs of one step might be the inputs of another step, right? And so builds are really useful for describing how data has flowed through your system and sort of tracking how they've been changed and things like that. Yeah, just like that, right? So that build element could have inputs that were the outputs of another build element. You can also do nested builds, and this is really useful for us because we have a top level command called bitbake that the user actually invokes. But then that's actually gonna go and do a whole bunch of hundreds of individual build steps. And so we can actually track that by using this ancestor of relationship in SPDX 3 to say these builds were invoked as a sub build or whatever you wanna call it of the parent bitbake command that the actual user typed in. There's a couple of other information that's associated with a build. So there's host information so you can say exactly where a build was done. Like it was done on this VM or done on this machine. If you have a complete SPDX document that describes that, you can actually link that in there with this. You can also say who invoked the build using two relationships. So there's invoked by, which is the user or agent that actually did the build. And then there's delegated to, which is the user or agent that wanted the build done. So this would be the difference between like if you click the build button in GitHub or whatever, the user that wanted the build done is the person who clicked the button, the agent that actually did it was GitHub. So you can track both of those things, which is really useful. Another problem that we had, particularly with the way that we generate our documents, was that the SPDX IDs and SPDX 2 are really only valid within the context of a document, which is fine. But that means that you can only reference an SPDX ID and SPDX 2 by referencing the document that contains it and then the ID within that document. And then also when you reference a document, you have to include is checksum. And I completely understand why it was done this way. It makes a lot of sense. But it's really hard when you're doing things like us, like we were doing, where we're generating all of these documents as we go along in the build. Because when you have to reference a document and include is checksum, you can't ever go back and change a document you've done before. Because if you do that, you have completely invalidated everything, all of your links, all of your SPDX IDs. If you ever change any document you've done before, they're no longer valid. In the wider ecosystem, I understand why that's done, but that was really hard for us and made it hard. And so because of that, it's also very difficult to merge documents together. It's not impossible. It's just really difficult because the SPDX IDs are scoped to the document that they're contained in. And so if you're merging documents together, there might be duplicates and conclusions and then you have to go find all of them and fix them. So we just kind of gave up and stuck all of our documents that we produced in a tar ball and called it good cuz we just couldn't figure out a better way to do it. I did that. SPDX fixes this by using a linked data where objects can basically have a globally unique ID. It can be referenced from anywhere. It's actually mandatory if the element can be referenced cuz there's no other way to reference it. And this makes it a lot easier cuz you don't have to worry about the conflicts of IDs. You can just say, I'm gonna reference this ID and it exists somewhere. And that makes it way easier for us to not have to actually try to figure out stuff out. And that also makes merging a lot easier because the IDs aren't namespace so they're unique. And so with SPDX3, we can actually generate a single JSON LD document at the end that is the entire document instead of having to do it with the tar ball. The third problem we had with SPDX2 that will hopefully be a lot better here is the validation was really hard because for some reason we're putting all our documents in a tar ball. People don't know how to pull in a tar ball full of SPDX documents to validate it. And also the data was just huge. So a lot of the especially early validation tools for SPDX2 just couldn't handle the sheer amount of data we were producing. For like a root file system image that we would generate, there was 100 megabytes of data for an embedded system. We were generating 150 to 200 megabytes of SPDX. So there was more SPDX data than actual data. Yeah, turns out compilers are really good at compressing things. Yeah, so we had to hand validate most of our SPDX output, at least parts of it, just to see if it was all. SPDX3 actually has a formal shackle model which helps a ton with this because there are tools that will just take in a shackle model and take in your document and say this is valid or not. I don't think it covers like 100% of the things, but it's miles better than what we had for SPDX2. And as a bonus for this, we're working on a process to automatically generate language code bitings from that shackle model to make it easier for people to write tools and stuff. Another problem we had was the CV and vulnerability tracking. SPDX2 didn't really have a mechanism for saying how vulnerabilities have been addressed. So we were tracking the things we built, like their CPE, but then we were also patching things for the users and there wasn't a way to express that in SPDX2. We did it in SPDX2, but it was very specific of the way we did it. The only way you would have known is if you had known, oh, this came from the OctoProject or the OctoEA generated us, but I can look in these special annotations and figure out that this was patched. It wasn't standardized at all. SPDX3 has VEX compliant encoding for vulnerabilities. So you can actually say, yeah, we know the CVE applies, but we also already fixed it for you. So yeah, it's great. It's really complex for the SPDX3. VEX reporting is very powerful. It's kind of what it looks like very quickly. All right, so where we are today? So this is what our SPDX3 model looks like today. I'm sorry, I have a lot. This is what it looks like today. I did provide the link there if you want to go look at it. It's in a different format. It's actually very similar to the SPDX2. If you dig down, we're doing a lot of the exact same relationships just better and more precisely, which is really nice. Yeah, there's that link there you can look at. I will publish my slides. There are, it's not a free lunch. There are a couple problems that we've seen on the horizon with SPDX3. The big one is we have very strict requirements on we can't be pulling in any external dependencies to process our SPDX documents. I don't think we're unique in that situation. So how are people going to successfully, how do you successfully interchange SPDX data in a very minimal dependency way? And that's kind of my main concern because what I'd really like to do is not only generate our own SPDX documents, but also make our build process link in documents from upstream sources. And the only way I can do that with no dependencies is if there's a standardized way of interchanging SPDX documents. So I really, really want that. And that's kind of where that at context comes from. I don't like context in my documents. Just no, no, no context. I just want everything there. But we're talking about that. So my closing thoughts is SPDX3 is, it has a much higher ceiling. Right now we're just doing the same things that we were doing with SPDX2, but it, there is a lot more that we will be able to do with SPDX3 just looking at the things that we could, will be able to do that were just basically would have been impossible with SPDX2 and linking documents together. So much easier. So much happier with that. No one likes tar balls full of text files. Other talks, this is other talks I've been getting about SBOM generation, open embedded. It goes into way more detail. Check it out if you're interested. Any questions? No. Yeah, no, no. Cool. I'm here.
How to make SPDX industry standard for AI/ML
I think it's good. Okay, thank you. So, yes, it's going to be very quick because it's a 15-minute talk, so it's kind of like lightning speed. So if you want to get anything from the slides, this is your chance, and it won't be a long time, so just like take a picture, and then we'll move on from there. Right, three, two, one, go. Check. Yeah, I love open source. I am working in open as a self, so please come to our stand and talk to us. So, yeah, I don't have to copy this, but you come to our stand. That's what you have to do. So it's just downstairs at level two. So, who have read, let's say you open a pack of Crips, who have read the ingredient list? Anybody? Yes? Some of them don't care, some of them don't care what you're eating. Okay, so this is what you need to know. So you need to know what you're eating because you may be allergic to some of the ingredients, or maybe like, you know, it's not very healthy, I should not eat that much sugar and that kind of thing. So that's why it's almost like a universal standard that we have to list what is actually the ingredients of the things that you are buying, like a pack of Crips, or a cake, or a soda, whatever you're drinking. So, that consumer group now is like, you know, the scope has been expanding. It's not just for food or some hardware that you're buying, but also software that you're using. We just like, you know, talk about, you know, in Europe we have the CRA and PLD. You know, PLD is now going to maybe encourage software, so we have to be careful about those. So, I know this is the S-bomb death room, so maybe a lot of you are already expert in it, but just for those maybe people watching online that is new to the concept. So, a bit of materials is, you know, we have to list all the components in your software, including those open source, that, you know, all the dependencies that you are actually using. So, yeah, so you have to list not just what you're using, but also the detail about what you're using, like for example, the license that, you know, is part of the component. And you should know anyway, because, you know, check the license, maybe you have to be open source if you're using some open source software. So, you know, for people who don't check, then ooh, alarm. Also, the versions, because, you know, versions can be very different, especially like different major versions, there's very different from each other. So, if you don't actually know what your dependencies, which version you're using, that's also a big no-no. So, these all these things kind of like, you know, your pack of crisps, that like, you know, you have to list your ingredients and you have to know what you're eating, what you're consuming. So, go back to the pack of crisps. So, do you know that actually you can't just say like, you know, list whatever you like, you know, because there's actually some standard that you have, you know, when you look at all the packaging, they have a certain format that they will have to follow and they list their ingredients. So, for example, I use a U.K. of Monaz and example, because I live there, you know, I do care what I'm eating. So, if your food or drink has two or more ingredients in it, then you must list all of the ingredients. So, basically it means that you can't skip an ingredient, even though it's like tiny, teeny amount. Also, it needs to be in order of weight. So, usually if you see like, oh, how much sugar, for example, I'm very concerned about that, like how much sugar you have in the food that you're eating. So, you see, like, where they are. So, if they're the first thing in their list, probably like most of the things you're eating are sugar, right? So, yeah, because it's list in order of weight from the most, you know, percent of weight first, so that's mostly what you're eating. Also, if there's any like allergen, because in the U.K. there's like a list, I think it's like 12, but I may be wrong, allergens that need to be clearly, you know, shown. So, those will be, you need to highlight them with maybe different forms, different color, or different background color to make sure that, you know, people are allergic to those, very aware of they should not touch that thing. So, also for S-bomb, we want to have some format, right? Because, for example, I'm allergic to, let's say, I'm not really, but if I'm allergic to nuts, right? Then the standard format of how to list the ingredients make me very easy to have a look and see, oh, I should not touch this chocolate because there's some nuts in it and I would be allergic and I get really sick or ill or maybe die if I eat it. So, we need some formats to show, you know, what the software is actually made of, right? So, that's why S-B-D-X is coming in to kind of have a standard that you could, so I have to speed up again, like, running, like, so it's very good that it's kind of like a standard-sized machine and human readable, so it's very good to make it very clear that everybody can just easily consume that information. So, in X-B-T-S-2.3, it's pretty good for software because, like, it's, you know, it will show all the, like, common, you know, CEV advisory, they've all been linked and referenced to it, it's quite clear. Also, it meets the, yes, executive order of, you know, the S-Bomb need to be, you know, meeting those standards and it also needs to, you know, have the ISO standard, so it's standard, it's very easy, it's very consumable. So, and also, what with Cosine, you know, Sixth Door, I love Sixth Door, I work over on SSF-Y, I come to our stand to get a sticker. So, S-B-T-S-2.3 is good, so if you are using any, like, software, like, you know, maybe, apply today, but we can make it better, right? So, traditional software build material, like, for example, for concrete or for boats or for cars, the hardware is already covered. Now, you know, we have software, now we're doing it, you know, S-B-T-S-2.3 is pretty good for that, but if you think about what we are having in the future, right, we have AI, machine learning, these are the, you know, the words that everybody is talking about. They have a huge component that we actually traditionally, like, software we don't have, for example, a lot of data. We have the AI model and all this stuff, so we have to cover them as well to hit the target, I would say. So, there's the new model, you know, S-B-T-S-3.3, so now is the release candidate, so you can already have a look. So, it has new model that will cover, you know, security data and AI, which is great. It will support database better, so for those like, you know, machine learning, you know, using a lot of data, then it's also covered up. So, domain-specific information as well. So, yeah, so I think for those, if you are like, if your software is actually, you know, going to like AI and stuff, so maybe you should look at model 3.3. So, that's great, but I have been to a lot of AI and machine learning conferences, nobody is talking about S-Bomb and S-B-T-S, which is, ah! So, for them, actually, they really need to care about it because for AI and machine learning applications, there's a lot of like risk that make them vulnerable. For example, data is a big problem. We have like, you know, data bleach and other stuff that happen all the time. Every time we heard about it, it's a big yike, so we don't, we won't avoid those. So, also, the system is very complex, right? This was a black box. We always talk about machine learning models. We don't know what's actually going on, so it's like, very scary stuff. So, also AI Bloom, now all the, you know, VC money is going to AI, and everybody's talking about AI and stuff. So, they may want to rush to get things done, so we need to please the, you know, the funding, you know, the funders, right? So, yeah, so they may be less careful about what they're doing, just need to get it done, you know? Also, there are new vulnerabilities that is not like, you know, there's new technology, there are new vulnerabilities. Now, there are some people hacking around like, problem injection and having fun, but it can be very serious if it's not just fun, if people are using it for malicious thing, it could be very bad, like, consequences coming up. So, we need to, we need to make the community adopt it very quickly. So, I always think about that, like, for any tools, if you want a very good adoption, you need, you know, good tool, and outreach. So, those two go hand in hand, you can't just skip one of them. Very bad tool, going to tell people, people still don't like it. Very good tool, but nobody know about it, nobody use it. So, you need both. First of all, you know, now, you know, for example, we want a very profile for S-bombs, because, you know, also it needs to be easy to start, we don't want something very complex, because, you know, AI machine learning people, they want to, you know, make, you know, work on their model, but not care about these, like, compliances and things like that. Also, we want to make it a universal standard, instead of shopping around, they have to choose which one to use. We want to make it universal, so it's like, no brainer, just go for that one. Also, satisfy the policies, that's very important, because at the end of the day, you have to satisfy the government policies, so you can make it an adoptable production-ready product. So, outreach, we have to, like, show more examples, show people that you can use it that way, so it's like, very easy for them to learn. You have to have use cases, show that some companies, they have some, yeah, exactly. Some companies, you know, if they make a successful case of using that, make sure everybody knows about it and follow. Education, tell the machine and AI community that, like, you should care about security and compliances and all this stuff. You can't just, like, you know, make it work and voila, and you know, everybody's happy to get all the VCs money, right? So, SPDX 3.0 is pretty thorough, it's pretty good, so this is good, keep on continuing doing the good thing. Also, communication with policy makers, so I'm sure that, like, there's, that is ongoing because, you know, at least foundation, we have some people, you know, keep talking to policy makers and that's a good thing. So, how can we make it into, like, a very well-adapted model? So, outreach is a key, so we need to create universal standards, so that's why we need to make more outreach, tell people that, follow this, this is a very good standard, and go to the, where the community is, right? Like, go to those, where people, like, go, this AIML people are, they go to their conference, you know, those machine learning conferences and stuff, like, go there and tell them what they should care about. And also understand their need as well, it's a bi-directional communication, we want to see how we can make it easier for them, or what can cover what they're concerned, maybe, you know, their consumer, the consumer of their product care about these kind of things, so we can cover that for them. So, so at the end of the day, after understanding their needs, it will go back to a good tool, so it's kind of like a very good cycle that we can keep doing that. So, cost action, that's the last thing, adopt, SPDX 2.3, if you're, like, you know, now ready, so just adopt it, but you can also help try the release candidate, you know, try, you know, help contributing to the new 3.0 model, also engaging in outreach activity, go and tell people, use it, right? And also keep communication, like, that's the part I love most. Communicating with policymaker and the user, we have to get everybody involved and have a good communication going on. So, that's the end, I hope I didn't overrun too much, so, yeah, let's make SPDX in the system, ML, and get the slides, and talk to me at the stand, thank you.
Application of the SPDX Safety Profile in the Safety Scope of the Zephyr Project
Yeah, okay. So hello everybody for lunchtime talk. We'll tell you a little bit about the SPDX safety profile that will come about 3.1, so I'll give you a short introduction to it and then Stan will show you how we actually apply this in the Zephyr project. So generally what we are speaking about is we always have systems or systems of systems and plugging this together and knowing what you have and knowing if you have all the evidences. That's the issue that we are currently dealing with. For those who are not familiar with functional safety, it's just a do no harm thing. Your system should not kill you or shouldn't kill anybody, should not harm anybody. For that, you need to know what you have, how things work, that you have corrective actions for unexpected behavior, that you can catch things, detect things, and that you have the evidences that you really do so. Yeah, most people always then give me the backwards saying, hey, yeah, safety is a system property, we cannot do this on element level. Yeah, we can't, but we can make sure that our element brings everything on the table that you can trust, that it behaves like you expected to behave. So usually what are the tasks and how do we know what to do? Yeah, there's a safety standard around that tell you what to do, they will tell you what the deliveries or deliverables that you need to show. They all come down to the same stuff, you need unique IDs of what you have, you need to be traceable, what you wanted to have, what you did implement, how you trusted it, all that to show that you're complete, to have all the evidences. And let's say in software speak, it just means you need to define your dependencies within your project. Yeah, functional safety projects come down to a lot of things, mainly documents. So I don't want to read this to you, but it's a lot usually and it's quite a pain. But these documents are in relation to each other. So the Meme model, which is originally a process model, more or less when you look at it from an informational knowledge point of view, it's a dependency model. And we need to keep these dependencies up through the whole life cycle of something and through all the releases, through all variants, through all bug fixes, vulnerability fixes. Yeah, variants are a big topic. And yeah, what we can use here really are these S-bombs that we have in the software world now. They are machine readable, they are exchangeable, and we can leverage that for our dependencies. And yeah, lucky us, we can express these dependencies in the V-model always relationships in SPDX. So how is the real world at the moment? So we have more or less three types of documents in our safety documentation. We have plans, processes, guidelines, how to do things. We have the actual specification, what you want to have, how is your structure, and you have all these verification and analysis evidences that you did things the way you have been intended to do so. Yeah, these documents are living all in their own little realms. They live all in their little formats, their little tools, and they don't talk a lot, which is other. So traceability usually breaks between the tools. It's quite a pain to keep this up. We have a very solid solution, mainly in the automotive world. The nearly most loved engineering tool, we have exalists. They usually also come with a very distinctive name that's very unique. And yeah, that's how things are. So why not put it in a little bit orderly fashion? Use S-bombs, S-bond types, SPDX relationships to really structure a project in a way where you really can use your relationships. And that's what we are actually applying in the art and the SAFRA project. SAFRA is a really brilliant little art house. It comes with its own build system, with tests and the framework for all that you need. We're currently adding all the systematic capability stuff like requirements, plans, the safety analysis, and we are using StrictDoc, the tool of stand to do so, and yeah, want to express this also with the SPDX relationships. So for example, for requirements, this can look like this. We have on the top level, we have the plans that are in a relationship to each other that specify how things should be done on the specification documents level, that specify that in the end what you need to implement, and you have requirements, acceptance criteria, meaning yeah, that's specifying your reviews, your tests, all the analysis evidence that you need. You can roll this out really through all the lifecycle of SAFRA, so we're saying okay, we have the concept and planning phase, we have the actual implementation phase, we have the tests and the build, everything, we can really put this all in SPDX relationships, so that once we have an issue, it's not like blind looking for the course that we had as we had before, but we can really follow our relationships through to do analysis, to do really maybe even automated analysis, and we have everything in place to really identify why do we have the issue, what do we need to do and what do we need to update in order to fix this issue. So, thanks, Nicole, thanks everyone for having us here today. Everyone is super fast presenting, I'll try to catch up. So before we go to talk about the tool, I actually want to mention a few issues about the requirements engineering in general, this has nothing to do yet with SPDX, so many of you probably know that the commercial requirements tool can be expensive, and they can be sometimes ridiculously expensive, and so one example question, how can you build a working group that needs to actually collaborate across organizations and across tools and across whatever, crossing all sorts of boundaries, what if you need to exchange all of these requirements, and very often also the worlds of requirements and software are not connected, so there is an original, let's say, Excel document that is started somehow with a bunch of requirements, but then developers take over, and they somehow get excited during the development, and then no one really knows what were the original requirements that the system was implemented against. And so the waterfall model actually is not designed to play well with the open source software development, and let's say the outcome of this is that very few open source projects are actually developed according to formal requirements. But everything is slowly changing, so by now I counted over 12 tools of various degrees of maturity, and StrikDoc actually, my tool is one of them, and the key question that I am trying to answer for myself and maybe for, let's say, the subset of the industry, can we actually make requirements useful for open source software so that developers are not annoyed, let's say, when working with requirements, and that actually requirements become first class citizens in open source software. And so I created this tool in 2019, it's originating from the spacecraft avionics, we had to exactly specify the onboard computer behavior, and this is how we got started with just, hey, what if we just do the text-based, git-based requirements management, generate something, and it turned out that actually there was a tool already called DoorStop, and I started contributing to it, and at some point I just realized that I'm moving a bit faster than, let's say, getting some patches in DoorStop modifying the code, and so I ended up doing something of my own. We spent most of the time actually polishing the HTML documentation generator in these years and somehow working out the traceability graph, and the previous year was literally the year of UI programming, so it was a lot of just making the UI support what is written in the text files. The project goals is to make a tool that just allows you for free and in a nice way to work with requirements. All groups are considered, we pretty much achieved the goal of being able to work with just PIP install in five minutes, you can start creating and publishing your requirements, and one core thing is that it should be very easy to get data in and out so that no one should be locked into this tool way of doing things. Then this is the text format that we came up with. It's a combination of a bunch of formats, but the main use case is that we need a format that supports both text, writing documents, and metadata, and so this hybrid is a practical compromise somehow, how to achieve both. I'm happy to be challenged on the specifics of this format, but this is what it is for now. By the way, the requirements are actually this statement thing, so you always have this shale of statements, and that's the core of what a requirements tool should do normally. For Zephyrus, Pedex and StickDoc, a big thanks to Roberto Bagnaro of Baxank, he made a connection, so we introduced basically that this tool exists to the Zephyr community, and that's how we started working together. There was a small competition, and the group after some time selected this tool, and right now we are structuring the requirements and writing the requirements in StickDoc. I will show just in a second all this. The interfaces of StickDoc with Zephyr are actually these text files with the requirements with the source files of Zephyr, the design documentation of Zephyr, and also as of recently, StickDoc can also produce an SPDX file with the requirements that connects to whatever parent Zephyr SPDX to become somehow. So now I'm really short on time, so I try to just jump over the screens. 15 minutes. 15 minutes, okay. So this is actually the UI that we will be, we have been working on so hard. The idea is that, let's say, in the IDE, we have the text files, the storage and git, so you can see this pretty much blocks of requirements. They have statements, they have meta information, and then our effort was to lift it up to the UI so that all this becomes editable. And so we got a bunch of features. I just created one requirement, which is called a FOSDEM requirement to show you how this works in Git. So in Git, this looks pretty much like this. So you'll see the IDE is already recognizing that we're doing the Git-based change, and this is how it can be committed. Currently, we don't support auto-committing, but the UI cannot commit the changes manually, but we can push it using a second terminal, just push these changes and create a commit that will publish this text. But what we try to do, sort of really hard, that all this is possible to be created in the UI, and you rarely, so the more and more rarely you ever need to use console, so this becomes more and more automated in the UI. We got some other cool things. So for example, there is now a diff feature, so how can you actually do diffs on the requirements documents? So when you not just want to do a Git diff, but you actually want to explore what are the... Sorry, let me just... It's a bit funny speaking this way. Sorry. We are comparing the documentation graph, and for each of the requirements nodes, we specify actually what... What... So... So... So... So... So... So... So... So... So... So... So... So... So... So... So... So... So... So... So... Lend, somehow, this is a small demonstration. Let's say we have dummy requirements specification and has a child requirements specification, and that child specification also links to the example. For example, for the snippets, for the requirements, it's possible to jump to the files and trace how they actually link to the requirements in the source files. So it becomes a sort of a browser for SPDX information with respect to requirements. I'm not intending this to be an SPDX browser, I just needed some way to visualize the requirements as we are working through them with Zephyr. And then I come back to the slides. Let's see if we can do this. Actually, maybe this way. Yes, so a couple of slides, some back up. I think we mostly turn it to the conclusions back to Nikon. Oh, yeah. Yeah, so the conclusion, let's say, notes to that is that if we use this approach, we really get finally a first way to do the impact analysis in an orderly way. Usually an impact analysis is, hey, who knows about this? And you get the people in the room and just depending on who knows what, you might do an impact analysis. And there might be this tool here, that tool there, and you're a bunch of Excel lists that explain to you what could be the right relationships between them. So this will give us an idea of how things really are in a project, in a certain release, in a certain configuration, on a certain point of time. When we use SPDX and S-bombs for this, we can pack it, we can sign it. So it's really, the integrity is kept completely. Yeah. We can formally demonstrate completeness. So speaking, automated safety cases, automated assessments, everything that we really want to automate to not always have somebody manually checking is everything is there. So really to have checks, if you have gaps somewhere, when you configure something, something might come up. You have your relationships that break. You have maybe your hashes that break because you change something in there and not the same as you would have expected them. So it's, I'd say it's a really very transparent way to see things and connect things in a safety project. Yeah, with using StrictDoc, we now found something that's pretty painless to use for that. Also with respect to, yeah, nobody wants to write requirements, nobody wants to use doors, obviously. If you have just a tab open with your requirements document and you can export it and you can even edit it if you're not a coding person through the web browser. So yeah, from the several projects perspective, that was God sent. And from the SPDX perspective, I think we also have so many use cases now. Okay. Perfect. Yeah, thank you. Questions? Yeah. Yeah. Really like that. Most systems are systems, systems, systems. So you have a very high level of requirement. Yes. Find out to multiple people and they've got multiple possibilities. Potentially some commonality as well. How do you, you know, consume this working on a small software project? How does this scale to something that might be a, mission critical system? Yes. The question is how does this approach and the tools that we're using scale to larger projects. This is what we have to explore together. So we are starting, the tool itself started very small, but then it was suddenly got the pressure from Zephyr and it has to already address several requirements coming from many directions. And so as of now, it's somehow on the starting, I would say. So I mean, one of the main benefits of having like dedicated requirements, you know, tools, is that if that's one requirement, you actually like notified, okay, and this all these other requirements that are somehow related to that requirement might need to be updated. Is there something that StrictOp can do? There is no automated impact analysis. It's on the roadmap, but for now you can see the traceability graph. And if you're interested about a specific requirement, you can just, sorry, I didn't repeat the question. The question was if there is an automated way to highlight which requirements are affected by a given requirement that gets modified. So it's on the roadmap, the fancy way of doing it. But for now you can already use kits, you can use the change log, and you can use also the traceability graph, which I didn't show, but it shows this deep traceability into, let's say, how all the parent requirements flow down into the child requirements. Yeah. Jos, there's a question. Speak up, please. There is a time. I'm gonna make this my time. Thank you. Go ahead. Sorry, I hope. Can you use the mic? No, I can't. Then we don't lose information. Here you go. Yeah, I was just curious about the framework that you've set up. So it's really awesome. So I'm curious with the requirements and specification, is it like a predefined structure, or is it just like a more abstract data type that we can impose whatever structure we want? It's both. So first of all, you can do whatever you want with the document structure and the metadata structure requirements, all these best practices. You can follow whatever you want. One thing that is happening is that some of the public formats, we are collecting in strict doc format, and then you can use them directly for your traceability. This is one thing. The second thing is if projects like Zephyr or whatever other projects would use this, let's say seriously, there will naturally be some best practice emerging because we are preparing tools, we are preparing the approaches, and yeah, maybe you. Yeah. There's nothing to add, I think. Exactly. So it's. If you want to, you can add a layer in between, and that's awesome. Yes, yes, it should be. That's cool. Thank you. Here about how you practice in tool fields, the requirements, then you take your template to the site. That was one question. Is this stable? If you manage to pull a file, will you still know that it's a tool field, and can you also link to the test to see in a tool field? Yes. The reason. That's why it was. Sorry. Yeah. Yes. The question is how to automate, how to automate, let's say, if a requirement has all the links, and if all source files links together. So for this, exactly the search query engine was created. You can query the graph, and you can implement or script your own things, or there are already default set of checks that are implemented. For example, you can ask questions like, are all parent requirements, sorry, are all root requirements connected to some child? Or are all child requirements are connected to, at least one parent, and things like that? So it's totally, let's say, a flexible territory. You can define checks that you want, and even script it yourself by using the API. Does it answer? Do you have plans to build artificial intelligence to it? Because an expert, when I'm looking at requirements, I can see bloats, or some things that at the time, I see it's there. It seems like a very machine field for child. Do you have plans? There are different kinds of requirements flows. Yeah, again, sorry, I apologize for that. So the question is, if the tool can use AI to automate the detection of flows, and the answer is clearly yes, but there are different kinds of flows which are easy or not so easy to automate. So for example, the syntactic stuff is easy even to just lint. You don't even need the AI. Where this could really help is to improve the readability of the requirements, or let's say teach a tool to follow some kind of guideline. For example, the InCase requirements altering guides provides recommendations. They are even numbered. You can literally take them to the tool. It's on the roadmap, but not yet implemented. So the answer is totally possible, but we're just not there yet. Let's thank again, Nicole and start. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Really great to get some AI. Thank you.
SBOMs that you can trust - the good, the bad, and the ugly
you you you you you many years on porous soil. is useless and in fact it's dangerous because you may be making decisions that you technically cannot trust. You don't know the license that you are making sure that it doesn't get into your production environment. It is there. So we need to make sure that our S-bombs are uniquely identifiable and forcible, complete and available. And now we need to actually figure out how can we build that? How can we build a system that could give us some... We'll answer some of those questions that we were talking before. So we came up with this diagram which has five main properties. Availability, uniqueness, integrity, provenance and generation. It goes all the way from the life cycle of an S-bomb. So for example, availability, can you make sure that the S-bomb is there? Can you make sure that the uniqueness? Can you make sure that it's the one that you're looking for? Integrity hasn't changed. So on and so forth. And what we're going to be doing now is we're going to go through each one of those boxes and we're going to be finding some open source tools that you can use today and how can they play together? And the first one is attestations. So attestations is just a JSON file. I'm already simplifying this but somebody will kill me but it's a JSON file that can have arbitrary information about a software artifact. So if you are building a container image, you might want to store there how you're building information about the build system. You might want to know who approved the change. You won't want to any arbitrary information including obviously the S-bomb. And the second part that is very important about attestations is that it is authenticated. And what that means is that it is signed. And by doing that, if we put attestations in our provenance verification layer, we will basically manage to have S-bombs that are integrity and provenance verified. So the way that we could do that is by taking an attestation, wrap the S-bombs inside, plus some additional information about the context of how the S-bomb was built and sign it. And the nice, the good news are that there are actually tools today that can help you to build this flow like using SIGSTOR or in TOTA framework which is the attestation framework. And most recently we have S-bombs which is this effort to get one step further on how to know how that S-bomb was crafted and being able to make sure that, yeah, you can trust it. So we have now the provenance verification we're going to keep going. So at this point we have this JSON file super, like nice and signed and the S-bomb inside, this is good. But now we're going to keep going to the distribution side of things. And for that, there is this old concept that is called Contender Resolute Storage which again, it's a very offensive way of saying that it is a database where you can retrieve its data by the address of its content. But something as simple as that will give you integrity and immutability because you can commit overrides, you can detect that something has changed. And again, we could technically take one of those Contender Resolute Storage and put it in our distribution layer that way we can guarantee uniqueness and integrity. And for that, a tool that you might be familiar with are OCI registries which are all over the place and OCI registries is a good implementation for Contender Resolute Storage. So again, if we put together the whole picture and we use six-storey, total and cell-cell, the provenance verification and OCI registry, we will get some of the materials that you can trust in identity, integrity, and origin. And this is something that you can build yourself. This is something that you can put together, you can use cosine and sine, your S-bomb, and a rapid interstation, you can do that, all that. But now because we're biased, obviously, we're going to talk about a project that we started that it's an opinionated way of doing that. And it's Chainloop. So Chainloop is, we call that a metadata vault for super-supply chain. So basically it will take any piece of evidence, it will wrap it into the stations, it will sign it, and then it will route it to different locations. You can route it to different OCI registries, S3 buckets, dependency track, WAC that have been mentioned before, so on and so forth. And Chainloop gives you two additional things. So the first one, it gives you an evolution of what counter-dressable storage would be, which we call federated counter-dressable storage. And what it does is create this layer that makes sure that you can have data distributed across multiple back ends. So you might have five different OCI registries. You might have S3 buckets, or maybe from the same, or different organizations. You might be using this to send some of that data to your customers. And the other thing that it enables is advanced routing for replication. If you have some requirements for geolocation, for example, we have a user that needs to make sure that some data that gets generated goes to a back end in Europe because of data policies, so over attention rules. That's one. And this will give you availability. But the second part is, as mentioned before, it's collection. How do you make sure that you can reassure that your organization is collecting this piece of evidence, collecting metadata? And the way that we did that Chainloop is you can write a, we call it a contract, but it's a declarative statement of the pieces of evidence that your developers on the left side will need to provide. So if we put everything together, as you can see, the main pieces in the middle, they're the same. Chainloop doesn't use all these open source components, but extends on the availability side of things and on the enforcement. And now it's time for the demo. I have a microphone. Yeah. I'm over here. Wireless. Okay. Okay. So now let's see how we can do it in practice. So what I'm going to show you, I'm going to show you three demos. The first one is about collecting a cycle in the X and artifact. We are going to wrap it into that station. We are going to store it in different cloud storages. In this case, it's Azure. And we are going to send it automatically to dependency track and to Quack. The second demo is about, hey, I want to scan my, let's say, a jar file for CVEs. But I want to find the S-bomb and I want to find the latest VEX file. S-bomb may not be generated by the same team, may be generated by some other teams, let's say, by someone who is going to put it in production. And the VEX file may be generated by, I don't know, the security team in my organization or someone else who is pushing the application to production. And the last demo is about how can I share my S-bomb or any metadata or attestations with others. So let's jump to it. First of all, what I would like to show you is, one sec, can you see my screen? Yes, this is the Chainloop open source project under Apache 2.0 license. We are going to talk about Chainloop architecture for now. We have CLI and we have a service. So let's assume that right now I'm an operator or platform team or city office or the OSPO team. And I have Chainloop branding on my site. And I have a few storages connected and I have dependency track. I'm going to show you how does it look like. So I have a Chainloop, the latest version of Chainloop branding here. I can see all different backends connected. We are going to talk about the OSCI and the Azure Blob. I have different integrations. We keep adding new integrations to Chainloop every week. But the main one we are going to show you today is dependency track. I have one instance of dependency track running at home. And we have GWAC as well, but unfortunately I will just show you the documentation that we have because I don't have it running. And yes. Now you understand that I have the Chainloop branding and I have also that contracts in place. So this is my contract. Me as a platform team or as a security team, I expect some metadata to be pushed to Chainloop and I can modify that contract and I can see how many different teams are following those contracts. And I have everything ready. I have a token. I will pass the token to developer. And now I'm developer. So the developer, we are building a sample Java application. It is the demo spring clinic and we are doing pretty well. But we have been asked to integrate with Chainloop and we have been asked to start sending new metadata. So it was quite easy just because we have a reasonable workflow. Of course at some point we are going to have the GitHub action. But the reasonable workflow is just fine. We just need the, we call the Chainloop robot account. But we will have more organization tokens at some point. And I have to specify how to find those artifacts and S-bombs and Chainloop will do everything for me. If you don't use the workflow, just to show you how the CLI does it. Yes, here. So like our flow, the user experience is following Git flow as well. So you initialize the station and then you keep adding new artifacts, new evidences to the station. We validate all of them. So we are validating the contract. And then in the end you can even add different annotations. We didn't talk about annotations here. But we add some kind of annotations and we are pushing if everything is okay to Chainloop. So now imagine that I had my workflow run again with Chainloop on my project. We pushed all the different artifacts. Yeah, we got notification from Chainloop. Great job. You are making compliance and SecOps teams and Ospo teams happy. You get some summary of the contract. And yeah, that's it. Like we didn't have to do much on the developer side. Let's get back to the operator side. On the operator side, let me get to my script. What I want to show you, I want to show you that now we can see all these different runs for the very specific workflow. I will show you the attestation, the envelope and what kind of information we are storing because we are not storing all the metadata. We are storing everything around. And today in this very specific basic case we are storing just different environment variables provided by GitHub. So let's try to run those commands. Oh, sorry. Yeah, we are here. Okay, so as you can see we are collecting all information about those workflow runs. So you have some kind of visibility across all these different CI tools that you have across organization. You have the client envelope. You have in total statement with all information inside. It's not about like we automatically detect the head of the grid repo. We collect some other information about that commit. Those are environment variables I was talking about. You have annotations and information about the second DX and the artifact. And this is the same, but in more user friendly way. It's also verified because I have the publicly available in this machine and I have the proper environment variable defined and set. So now what I would like to show you is what if I'm someone else in my work or in my project and the only thing what I know, I know the shot or I have the the the the draft, the output that was provided to me. One nice thing about chain loop is that I can discover things about. That binary. So chain loop allows you to to to ask questions about. Okay, so I have a jar file or I have a git commit or I have a content image. I will get the we call it software barcode. I will take that hash and I will ask chain loop to provide me with more information. And now I can see that there are different attestations generated by different teams by different organizations by different people related to the very specific to the very specific. Container image or jar file in that very specific case so I can see that we have vex registry workflow which is ready to which is adding different vex files for that very specific. Jar file or other container images or jar files in my organizations. I have the main attestation that's the demo spring. But cleaning PR workflow. I can probably get jar fire from here and as bomb and I have some other the stations. This one is also related to the vex file. So what I can do. I can right now get some of the of those attestations and I can even download. Very specific artifacts just by providing just by providing show. And that gets me to the next demo. So what can we do about that right like we are storing all these different data we are throwing metadata and s bombs and vex files and everything was related to what we are building a chain loop based on different code. And I can build a very simple tool and the tool is about scanning for cities. I want to do something different. I want to get a very specific as bomb for my organization that is following very specific conventions. And I want to find it automatically. I want to retrieve the latest version of the vex file somehow as well just by knowing the the the hash of that of the jar file and that's what we are doing here. If you take a look at the of the code of the tool I wrote yesterday sorry it's bash. It's not go. We are using the sea lie. To discover different information about the very specific hash of the jar file. We are we are getting attestations. We are getting the latest one for a very specific workflows in this case for the most print spring pet cleaning and for vex registry. And in the end we are running trivy or you can run any other CV scanning tool with that very specific as bomb and Jason and vex file. And the result is that if you run that trivy with that drop file you will get CV as you can see. Sorry. Yeah, yeah, I see. I'm getting stressed. There is one there is one there is one there is one CV out here and we are removing it just by applying our vex file. Just because we understand that for instance this application is run behind the firewall so we are not concerned. Let me get to the last demo the last demo is about sharing. Everything like you can you can you can mark workflows as public at this point you can either have something starting organization or you can make it public so our attestations are also shared. If you go to releases you can you can see that our attestations reports are available. You can either have the UI but the best part is that you can actually get the the the Jason and you can automate different processes in your organization this way. Yes, so just just very quickly. Okay, just just to finish. Basically the final thoughts is that the the bar in terms of compliance and security has been raised and as one trust is what we think is going to be the next challenge. I know that there are many challenges on the way on the all the way in the life cycle but you know as one trust is something that we need to start thinking about. But the good news is that we can start thinking about that we can start building our own you know we can get that head start with open source security tools today and that's that's why we are very excited to be here. And that's it that you can find us in discord please join and yeah and you like what we do give us a star and thank you for your time if you have any question we might have I'm not sure. Hi, I'm Olle Johansson. You set the bar very high in the beginning saying as one trust and then you quickly said something about the public key you already had. Yeah, you mentioned six doors. Yeah, for individuals most of the right. What kind of solution do you see for trusting as well and validating as well as between customer and vendor here. You didn't really touch. No, okay, so the question was that we will be talking about trust on the signings of things but then we should have them with the public key and private key and what do we think is going to be there. I mean that's a very good question. I mean the current implementation with Chenle piece is you know it's a good implementation of Chenle. Obviously we have you know killer signings on the on the on the horizon to implement and and work like the entity and all this good stuff right this is basically the nature of the open source project but in terms of customers. I mean this is tricky because we have we have we have users because we have users that actually more than the press out things and even they're like talking about just give me access to KMS and my old of my vault instance and things like that. So I mean it's if you have any thoughts the answer is we don't know yet I would say so if we have any thoughts on that I would love to talk to you and hopefully we can shape it in the right way. Thanks again.
12 months of SBOMs - an experience report
Right, I'm live. I'm green. Right, welcome to the post-lunchtime slot. Okay, some of you know me because I was here last year and basically I want to say what we've done for 12 months with S-bombs and it came out as an idea is about change. So I'm going to take explain about change. Some of it is about my tooling that's changed but a lot of it is about observations. A bit about me. I'm from Manchester. That's where that B is. I get asked about where's that picture. So that is a security hub for innovators for startups in Manchester trying to grow the ecosystem in the north. There's more than just London about tech, please. Normally most weekends I'm running around muddy fields at this time of the year so I've had a weekend off running muddy fields. My background is mission critical systems. For 40 years I was delivering mission critical systems. Think about big complex systems. So what Nicole was saying was my bread and butter. Those are my, those are what I used to worry about. Now I'm going to start up and I'm known as Mr S-bomb in Manchester. Didn't know what S-bombs were 12 months ago. They do now. It's all about a tool called CVEbin tool. So this has been presented a number of times and it's a binary scanner. It came from Intel and they wanted to understand what binaries were included in their deliverables and were they vulnerable. Common question. It's open source but one of the things we've done is become a Google Summer of Code project and each year we've added more features and I've been pushing the S-bomb world in there. So we added S-bombs, then we've added CVS, like Cizekirv. We've added EPSS this year. We do a very trivial triage. Let's say we might improve that with VEX, the world of VEX and it's got a thousand stars this week. It's very good. Open SSF best practice is interesting. I'm not going to work for Intel. It's a challenge sometimes in terms of do you have multiple maintainers and it's a challenge in open source when it's run by a commercial organization. So generally you see the reeds, calendar dates tend to trigger on GSOC. We found a little problem this week which is why we didn't release it this week but anyway it's very close to having a new version with all the EPSS stuff formally released. And then there's a tool that I write and I haven't got a thousand stars yet which is a Python generator and I take the installed Python and work out all the dependencies and work out all the bills. So think about Python. There's lots of direct dependency. What are those transitive dependencies? I'm agnostic to which version of S-bomb it is. I've always wanted to be Cyclone and SPDX. Initially I have written my own parsers and generators. I do want to migrate to the stable versions. It's all about time. But what I'm pleased about is there's a benchmark. We'll see then and it's the first one to get a benchmark score of 10 out of 10 which is but they'll tell you explain why you get 10 out of 10. It's quite hard because the ecosystem needs to play together. And this is what I do. Generally I just enrich it when I get the time and I've got a bit busy the last six months. So that's why I've stopped a bit. But anyway. So generally the sort of things we've been doing is adding more stuff into the package information using SPDX. So trying to get as many of the package attributes in to the S-bomb because you want enrichment. The more data you have the more usefulness it can be for the more use cases. That's hard because that data is not readily available. So what have we done? And this came out as a conversation, you know, a monthly open source meeting that we have and said it'll be nice to work out how much change do we have. And this is the S-bomb going to tell us what those changes are. So we put a get up action that runs about two o'clock in the morning. We clean the virtual environment, clean virtual environment, the bun two. And that's quite an important thing because that's going to come back later. We then install all the dependencies. And then we generate the S-bomb in the different forms. So whichever version is the latest flavor will generate. Do it. And 3.12 will become, I think, maybe this week, tomorrow. But generally we just ruin it for the supported versions of Python. So a little bit of a digression about Python dependencies. And it's probably if you went in Node or Java and things like that, you'll have, everything's got little quirks. The thing about Python is it tells you what the direct dependencies are. It can tell you a bit about the environment. So if you're working in Windows, you may have different dependencies than if you're working in another environment. But it says nothing about the transitive dependencies. How much is hidden? So let's look at the example. So this is quite, this is a subset of our requirements file. So at the top, you've got AIR at HTTP. It's got a constraint in the version saying the minimum version we require is 3.74. But it's also got optional requirements as well. So straight away you've got two potential two ways of installing AIR HTTP with or without that additional component. You look at beautiful soup, any version will do. And then you look at these down here, the import lib only installs it if Python version is less than 3.10 and it's got a constraint and similarly the import lib resources again only if it's 3.9. Because the Python library changes a bit like the early system, the language ecosystem is part of your partnership. And you can see the number of dependencies gradually change over time as you add more features. But what you get, that's what you really have, that's the hidden, that's the iceberg. It looks quite like an iceberg actually. That's one of my tools. And the green are all the transitive dependencies. Look how deep that is. That was fascinating. Pictures has a thousand words, I think we all agree. That's not really, if you really zoomed in, I've had to put the license values as well. That's an interesting thing if you could do some analysis. But actually that's quite visually, that's quite, that's quite a Iotner. And we've only got 60 packages there. So what have I observed by looking at all the data we've collected? And I want to look at the context, the context, a bit about quality, a bit about velocity which was the original thing, what's about change and then other things that are analyzed that I've discovered. And generally this is all out of GitHub, so I wrote a little utility to download the file history. So I could then quickly analyze it locally. And I ended up writing a little tool called S-bomb trend which then created it into a JSON file so I could then play around with it to generate pretty pictures which you're going to see. So first thing, there's nothing in any of the S-bombs that tells you it's Python. Or which version of Python, or which environment of Python. Now maybe Python, maybe FBH theme might, but that's actually quite important because you're going to see in a minute the difference what that means. Because if you just get an S-bomb and you don't understand its context, how do you know where, what, whether this is a real representation of the environment you're using, pick up what we were saying on the previous one. Cyclone DX has usually defined properties which you could use. SPDX doesn't yet. You could do comments but it's a bit harder, yes. Yes, I'm sure you could, yeah. So I use Cyclone DX properties just to say language, language Python, language version, something. But I think that's quite an interesting thing. It's good as SPDX thea's doing that because I think we need that is quite an important thing. And this is what you get. If you plot all the different versions of S-bombs across the year, the higher versions of the older versions of Python, it stops at p7 in the middle of the year because we stopped supporting it, but you see a trend. So that's the requirements trend and you see it sort of follows it and then there's a few other bumps. We didn't change it, the outside world changed. And sometimes you see it drops and that's because a package ceases using a dependency. It wasn't obvious until I did the digging up, but that's what that was telling me. It's quite interesting. So the lower versions have least a bit of a letter of dependency. You can probably sort of see that with the requirements file, but the requirements file is lost in the S-bombs. It's not there. So there are differences. Transitive dependencies vary independently of your direct dependencies. I think you could probably see that, but actually it's quite interesting to see the evidence. And the later versions of Python have the least dependencies. So that's a good way of saying don't just update your packages, update your language versions as well. So let's look at quality of S-bombs and that could probably have a whole conversation about this and a cold conference about it. So I've just chosen four tools because they demonstrate four different things. So the SBDX1 which is does it conform to the NTIA minimum standard. Look at the scorecard which comes from eBay, not the open SNF, look at something called QS which is from Interlink and look at one from me because it had something else that I discovered on Friday which was really interesting. So first of all NTIA. We are no different from day one to day today. We're still the same because we still fail to get all the suppliers. I would like to see how many people can get that on a real project. You can get that from small projects but not for real life projects. I think we all recognise that. Then the eBay one, one of the things they were doing is they were looking at package IDs, goes back to 10 o'clock call about the pearls and stuff like that. I didn't have pearls at the start of the year. I don't know. So my score went up. Enrichment, messages enrichment. Good and licenses have probably got better as well. SPM QS. This is done by Interlink. I don't know where they came from the idea but they have a whole load of different things they're looking for like licenses. Do you have other licenses still supported or the deprecated licenses? Do you have checksums for your packages, etc? That was a target. How can I get a better score as a target I started? So we get to 9.6. If you go on their website, most of the excluding S-POMP Python are in the sevens and eights. A lot of the containers are sevens and eights. So I'm quite pleased I can get to that level. The reason it's not 10 is because of the supplier failings, same as the NTIA changes. And then I have a tool called an audit. The reason I put that generated this was could you use the S-POMP to drive policy? So if you wanted to say I've generated an S-POMP and I've got a license like a GPL and I don't want GPL in the things, can I have a allow list or deny list of licenses, for example? That was the use case I came up with. But I also do it and I use the latest version of the products was the other thing I wanted to try and check. So I was getting reasonable number and the number of checks increased because I had more packages. Well this is the interesting thing I found. Scan came from last weekend. I scanned it on Friday. I was expecting to get 100% all the files were latest versions. Four of them got updated last Tuesday, which is why the green ones so happy. But there were a couple that hadn't changed. That got me thinking. Why don't packages change? Pinning. The world of Python is probably not to pin. They're indirect dependencies. I've no control of those. And I haven't quite got to the bottom of finding out where the pinning is happening because they're not even on the direct, the first level of the direct level where they are. So that was the reason that was there because I did a scan, an S-POMP scan, and what I got a vulnerability on my RSA. And the reason was I'm using, not using the latest version of RSA. So that was a weird, that was the sort of, could you detect that? So that's something that I only just discovered this week, which I thought was really interesting to share. I mean just if I happen to have that tool. So NTIA is a good benchmark. It's hard. Accurate supplier information. I think we all know the challenges of that. But date of enrichment is good. Can you enrich your desk things? Look for that threshold. Look at that utopia moment where you get 10 out of 10 for your S-POMPs. Because the more information you have, the more useful that's going to be for all the different use cases people are going to use your S-POMPs for. And it is possible. So this was the original use case. What's changing? What we're changing? What's not changing? Who's changing what? So the first thing is, and these are all driven by Matplotlib. So they're in the trend tool. So if you want to play with these as examples you can do. The top is the number of packages. The red line is the number of changes on a week by week basis. Every week one package, at least one package changed. At least. Which is good, the ecosystem's live. Is. Yes. Yes. So but that, you know, it's not, you know, what are the triggers for those changes? Yeah, some of the, you know, you can see some of the spikes relate to when we, when we did an update of the requirements. So that's, you know, you can see that. But generally things are changing all the time. And I was trying to show how to change, what's the, what's the rate of change and things like that. So this, I came up with this, like, train, train, flat diagram. That is showing a steady, steady going like that means it's changing every week. Except for the holiday. What? Except for July and all that. Oh yes. Yeah. Well, I think we can understand why. Yes. Actually, that's, that's probably quite a thing. Look at time. Time's actually a driver as well. Does lots of things happen in Christmas? Does lots of things happen in holiday periods? Yes. Interesting. That's enough. More people work on Christmas. Yeah. Well, I think we've seen problems where people have released something on Christmas day. And it's, but yeah. Anyway, that's a really good observation. I haven't thought that. Well, it's good. So you can see these things. And these are just the ones that have changed more than five times in a year. Because, you know, that's what 20 odd packages, more than 20 odd packages. And then if I look at, well, okay, these are the ones that frequently changes. Quite a few of them are direct dependencies. Why are they changing? Most of them are feature features, not vulnerabilities. Yeah. But actually, you know, can you find them? And there's one, a lot that rich. Why did they change? And they actually removed and unmaintained package, which then got me on another little track, which you're going to see in a minute. So yeah. Security fixes aren't the drivers for many of these changes for features. And then if I looked at the direct dependencies, again, okay, they're going up. Some of them are changing a little bit slower. That's, the case is no longer used. So you've got, again, you're getting quite a rich picture of change, which then says, if I pinned, first of January or second of January, I've missed all these changes. A lot of changes. Which may be, the features may be performance improvements, et cetera. You know, you might want them for good reasons. And this is what's the ones that have only changed once, haven't changed essentially. And the red ones are the ones that haven't changed in, I just took two years as an arbitrary value. And you think, well, okay, there's 10 of them that have not changed in two years. Does that not start linking a belt? It says, is it maybe not unmaintained? Is it now an unmaintained package? Don't know what industries have in terms of looking at the health of an open source project. Are they looking for the, you know, is two years long enough? And it says, maybe we need to look at alternatives. Right at the top there is Tom Lee, which is now a standard library within Python 3.11. Till I did this, I hadn't, I've missed that. So I then raised a pull request to say, if it's 3.11, we want to use the standard, standard library, not the open source version. So again, on the probability that the language ecosystem libraries are going to be probably better maintained or have a greater need for being maintained than necessarily community. Right? So change happens. But we could be very careful of pinning because direct dependencies change frequently as well. So there's a pinning debate. Right. Let's look at data analysis. Let's look at the first thing is languages, licenses rather. I've tried to look at the SBDX license IDs. When I get the metadata, try and map it. And if it doesn't quite match, do I have some sort of a few rules to try and alias them? So is it Apache space 2? Well, it's Apache 2 type of thing. And Apache 2 is a really good example. People don't know how to write Apache 2 SBDX ID license IDs. Yes? Are you pulling this from, from PyPy? Some of these come from PyPy. Yes. Yes. PyPy is a disaster in terms of specific license. Right. You've peached into the converted here. As a community, we should be looking at this and fixing it because many of the packages that have got license failures have been updated in the last 12 months. Probably because they've got features, but metadata doesn't really matter, does it? Metadata matters now in the world of S-bombs. Let's look after S-bombs metadata as much as the code and the tests. Told you I told you. Yeah. Yeah. Right. So I, so I summarise all the licenses and the things like that. And you see, again, you can probably quite quickly get a summary. Have you got a license problem? Okay. CVU and TIL is a GPL. Everything underneath that is okay, but you may be able to see quite quickly to see have you got a license compliance problem. The other thing is you can look at all the suppliers. Do you have a supplier that you really need to be loving and looking after? Because you're very dependent on your packages. This case, we've got 60 different providers, so it's not quite the obvious. But this could be a way of understanding who are your dependent suppliers that you need to be maybe getting closer to. Maybe supporting, maybe helping. And I'm thinking about the world of the enterprises as well, who might be needed, look, needed to do this. So again, but four of these packages have no suppliers. Three of them were updated. Why didn't they update the metadata? And then just a summary, I've got TIL that just differences two S-bombs, arbitrary format, don't matter that you can compare Cyclo and DX and SV8X. Just to see generally what's changed in the 12 months, well, there's 39 of the packages. I've had at least one version change. And we've lost two packages and we gained 11. So that's what 15% growth packages, number of packages that are dependent on. And then I did a scan. And that's the last one is I was expecting the last S-bomb to be as clean of vulnerabilities. The reason I've got one vulnerability is because of the RSA problem that we've heard too earlier. Potentially. So takeaways, I'm doing all right here. Right. Generate your S-bomb for each version of the supported environment you're doing. So if you're doing Python, 38 to 312, generate five S-bombs and also do it for Cyclo and DX and SVDX because the generations may have more, they may be different data, they may be enrichment between the two. Please as a community can we improve the metadata? We have all responsible to do that. Once you've got an S-bomb, that's the start of the fun. Start analyzing it, start using it, start reading it. It doesn't matter whether, you know, I'm sure many of you are quite familiar with reading JSON. Help the people around familiar with JSON. Look at the data, there's some documentation tools there you may find useful. This is the thing that we do when we install. We install with Python with this upgrade strategy, which is trying to make sure we're using the latest version of everything. But obviously that doesn't stop pinning. So it's interesting. I need to think a little bit more about that with Python teams. Keep your package up to date. I have a problem in my things because I just do pip install and they'll say, oh, I've got a beautiful soup. Yeah, that'll do. It's not the latest version, I'm sure. So just that's just be aware and use the latest version of Python. I have another tool called S-pom for files, which looks at the files so you can look at the change of files as well. That's a bigger thing. So it's just a thing. Could you start to see the amount of change in maybe one of your source trees and you repost, you know, or the test files changing, for example. And then obviously add vulnerability scanning as part of our generation. So this is what you all probably want. These are the list of other tools. The presentation is we'll be on the CVU in tool. There's a pool request in there. It just needs to be approved. Those are all the tools. I haven't written all of them. But if you want to follow me, that's me on LinkedIn. That's me on GitHub. And that's me in Manchester. Okay. Thank you very much. So on your list of increasing hidden dependencies, is it both a package or a package version? Okay, this is about the picture. Yeah. Yeah. Okay. So the picture that I showed, this is that showed the hierarchy of all the packages. They are all packages. They are Python packages. So if you have two different versions of the same package, they appear as one or? No, they would be pairs two if you had that. I've never seen that. Oh, okay. Yeah. Well, I would say that you live on an isolated island in Kingdom of South Africa. We are in the Union. And we have a presentation coming up. And I've got the divorce when you say that updates are driven by features. What do you see? Will that change in the future? Okay. The question there is, I mentioned the thing I said a lot of the updates appear to be driven by feature changes rather than security features. The question is, do you think that will change with things like the CLA? Probably. It depends. It depends. Yeah. There's no other one more. Okay. This is about the improving the metadata upstream. Probably the two things I would say is licenses to support the license compliance teams. And secondly, the supplier, because does that identify, do you know where you've got your software from? What can a large organization know that could sue the way it is? What can we do to help do that upstream? Use SKDX tags in your Python modules. Yeah. Yeah. You can do a public request. Yeah. I don't know. I mean, yeah. I think, recognize that there is a community out there. If you've got the effort, do it. Use it. Because, you know, we know the open source community is stretched because of volunteers. If the enterprise is taking value of it, can we use fully use your contribution? Because you're going to help many people. All right. Let's take time for me again.
Phantom dependencies in Python (and what to do about them)
Good afternoon everyone. having a moment of pleasure in I will begin. You're in green. I'm on green, yes. Okay. Hello everybody. My name is George Goussius. I'm head of research at Tendall Labs, and part-time associate professor at the Delfi University of Technology in a nearby country here. Let me say a few things about myself. Since it's customary here to introduce the speaker. So I have been on this particular field of S-bombs, and dependency analysis, and so on. As a researcher, since 2015 more or less. So I have seen all the failures coming up in real time, like left pad, then solar weeds, and then so on. In 2020, we had organized the dependency management room, which was a pre-care thorough, perhaps it was slightly conflicting with this particular room, where we introduced the Fasten project, which was one of the first projects that basically did the reachability-based analysis of S-bombs and software dependencies. This project, one thing led to another, and this project basically became a startup, perhaps now it's a scale-up, we can call it, which is called Ender Labs. It is based in the Bay Area. We are basically providing solutions in the space of software composition analysis. Plus plus. We will see what the plus plus is today. Well, by describing my history so far, you might have understood that I'm of a certain age. Parts of this being of a certain age, is that I have a teenage daughter, that actually is into pop music, and lately, I came up, she came up actually with this song, which I really, really like because it describes the problem that I'm going to talk about in almost in excruciating detail, I would say. Let's read a bit the lyrics. So like ships in the night, you keep passing me by, just wasting time, wasting time trying to prove who's right. If it all goes crashing into the sea, it's just you and me trying to find the light. What are the ships? Any guesses? Let me help you a bit. There are two ships. One is the Package Manager, where the developer declares their intended differences, and the other ship is the compiler and the order and time of the language, depending on whether you're using Python or C or so on, which has its own view of dependencies. Right? Which ship is right? Who says both? Let's see. One vote for both, but who? It depends. Okay. All right. Who says that the compiler is right, is always right? Yes. Is it ground truth though? They're both wrong all the time. The compiler is ground truth. It can be ground truth, but what about when you load code dynamically? It's runtime. Okay. Yeah, that's the issue. So the developer over describes the dependencies for testing coverage and contract validation. Horrible things happen because you build too much code, bring it all in later on and things go even further. Fantastic. Yes, exactly. Yes. This is where I was trying to get to things. Maybe you have to repeat all that. Okay. I mean, it will come through the presentation. So traditional dependency management and software composition analysis works. I mean, we're in the S-bomb room, we guess everybody knows how it works. So we start a new project, we create a Package Manager manifest, requirements TXT, POM XML, what not, Gradle, whatever tool we're using. Well, the build system and the Package Manager, when we're trying to build a download stuff from the Internet, side parenthesis here, how can we trust our data on just random stuff we download from the Internet? That's a different question, close parenthesis. The Package Manager copies all the files in a directory and then the compiler starts using those dependencies in order to compile or run the project. This is what we know. But if we think a bit from a higher level point of view, we get more or less to this. So the developer declares their intent in a manifest file. So that's the requirements TXT file. A Package Manager does the dependency resolution, we get the dependencies onto our system, we have a compiler. Now, the developer themselves also write source code. The source code might or might be using some of the dependencies, might be declaring dependencies that they are not using, might be actually depending on dependencies that are transitive that other dependencies bring in. All right. All those nice things that make software composition analysis in a way that most of the tools are doing this at this point. Well, I wouldn't say wrong, but perhaps incomplete. Okay. The output is always a program, and the program is the source of truth of everything. So what I want to advocate here is that when we're doing software composition analysis, there is a lot of stuff that we don't really identify. Can you guess where the stuff comes from? Always. So for example, I write a Python program and I have some kind of packets that I have installed throughout get. What else? A copy and code into my repository. That's a dependency, right? But I don't maintain it somehow. What else? All of those fancy shift left dev tools. Shift left dev tools. Okay. So how? Right. They're pulling things in there. They're updating and flagging, and sometimes automatically changing. Okay. Yes. There are some tools that are indeed pulling in. Bazel, for example, but Bazel also depends on the versions that you provide to it. How else can I have, let's say, dependencies on code on my program? Yes. One-time library from your compiler. Fantastic. Yes. Lib C. This is a dependency. Okay. This gets installed, for example, from the operating system, but sometimes when you, for example, do some kind of JNI calls through Java, you depend on Lib C. Okay. Various ways. So things also tend to become out of sync. So developers import new dependencies. The dependencies are in the environment. In some cases, dependencies can be declared in a testing scope, but we're still using them into production for misconfiguration reasons or for any other reason. In some cases, dependencies are removed from the code, but not for the package manifest. So we have an extra dependency that we somehow need to maintain, but we're not using it, so it's basically redundant. Then we have Python. This is the average Python repository, especially if you're dealing with machine learning and AI stuff. You have requirements to use the file or a poetry file, and then we have a list of instructions that looks like this. That tells you please install TensorFlow at this version. Please install NAMPi at this version, but with this patch for this particular GPU, because otherwise the thing is going to be dog slow. Okay. So how can we actually maintain this? How can we discover first those dependencies? How can we maintain this? Let's take a look actually at this particular project. So this is from OpenAI. It's called baseline. It's pretty old, as you can see from the time stamps over there, but still it has exactly this problem. So it tells you to create a virtual environment, and then it tells you that you need to install TensorFlow between 1.4 or 1.15 by hand. It's not part of the requirements CXT file for some reason. Using our tooling, I have run an initial scan of this project without considering what I call phantom dependencies. All right. What you would see here is more or less the same thing that most SCA tools would give you. It's all the files that are in requirements CXT plus their transitive dependencies. Okay. So we have some direct dependencies, and then we have some transitive dependence over here. That's it. Is that it though? Well, we will see at the end of this presentation, but first I would need to run a full scan of the project. So by also enabling phantom dependencies, I will have leaving running in the background. Yes. I mean, the idea here is not to see this. But what I want to show is basically that the thing actually works. So it's not a vaporware. All right. So what are phantom dependencies? Basically, phantom dependencies is the thing that we have discussed. It is dependencies that are provided by the system, and they're assumed to be working basically, to be available somehow in the runtime of the project, can come from various locations, some of which we have already described. If you think this is just a problem with Python, it's not just a problem with Python. It's problem with NPM as well, with Java, if you have plugins with, even with native environments. All right. We discussed this. As I said, what I consider the sips in the night, according to the original talk, is basically the two things, the Package Manager and the compiler in the runtime view. The Package Manager usually sees way more dependencies than the actual runtime or the compiler uses, because there are a lot of transitive dependencies for which we don't have any reachability path to them. So when we start from the client, there is no path, calling path from the client into the actual transitive dependency. So usually what we have found also in the company is that, from all the source code that is being imported into a repository, which is around 80 percent of the code that on an average repository is imported, around 15 to 30 percent is actually being used. So there's a lot of codes that we import. It perhaps forms an attack surface that's never been used. It would be very nice to clean this up. Yes. One way to do that is with reachability but that's not my talk. That's another talk. Okay. So how can we identify a fun dependencies and do this type of cleanup? We need to do program analysis. Any idea what program analysis is? Yes? Some people might have seen that. It is. No, it's not parsing. It's not parsing. It's like one component of program analysis. The first step in the whole program analysis chain. Sorry? I disagree with your first bullet. The source of truth is the source code? Yes. It is not? Could it be part of us, C programming project or the biggest thing in JavaScript to pull in dynamic code or anything written in list for code and data of the same? So I love the optimism of the source code understand with dependency graph. It's actually the binary. Could be true. Yes. For languages that have binaries, for languages that have source code, that are source code executable, the source code is the truth. Right? No? I agree with nature of source code and data, especially when we move into data. When we're talking about why are we doing the program analysis, we try to understand what we're bringing into the net. So let's stop now and then we have to get back. Yes. I agree and disagree. We need to start with program analysis, and program analysis starts necessarily on the source code. So perhaps this formulation here is not precise, but it is something we can work on hopefully. So why do we need proper program analysis? I mean, somebody could say, well, yes. If I track all the imports in my Python code, it will be easy to identify all the libraries. So it's easy. I'm just going to write a Perl script or Python script those days to take the imports and try to find what libraries have those imports. This is perhaps hopefully comprehensive, way of doing importing in Python. You can import a module, a function from a module, all functions in a module, and alias a module to a different name, a packets, static import, a relative import in the source code, using the import leap, in which case you can also rename the import leap. So you can alias basically the import leap and that you need to de-alias it in order to be able to track this, and you can even basically do an evil, an import code. Good luck doing that with a Python script. So we need- No, the code may be inside it. Exactly. That's my next point. Sorry. Exactly. All those things can be in a conditional statement. You can have a try import for otherwise import bar, or you can do a new condition on some variable, and then import the custom library. Those are the reasons for why we need a proper program analysis to do that. Okay. So the steps that we have taken to solve this problem, first of all, we need to start with the source code. Okay. We start, let's say, with the client code, and for each file in the client code, we follow all the imports. How can we follow all the imports? We first need to have analyzed the program, the client code, and the virtual environment, and the site packages that come with the operating system. Basically, all locations from which Python can tell you where it can find code. So if you open the Python interpreter and you configure it with a particular virtual environment, for example, you can ask it, please give me all locations where I can find code for this particular execution. Okay. So we start with that. After we have analyzed the map everything, we can start then from the client code and do case by case import analysis. So I import this particular library, I go into the file that is into the module basically, that creates this particular library, look at it, see its imports, and go, let's say, transitively until the whole thing has been exhausted. This is being done for a bunch of when they're doing the bills for resolving dependencies. Did you leverage that code? No. Okay. Repeat the question for the audience, sorry. The question was that this analysis has been done before, did we reuse the analysis? The answer was no. Thanks. Okay. Now how to do program analysis in Python? As I said, you need to have resolved past everything to begin with, resolve the types for everything. If you have type information, it's way more precise because you can track basically specific function calls onto types. Then you can also take into account one of the existing static type checkers, like mypy or pywrite, we're using pywrite for that matter, to basically parse the code and do all the resolution. Okay. So it's not extremely hard to do, but you need to be aware that it needs to be done. I will show you the results, hopefully of the scan. Yes, the scan has finished just in time. As you can see before, we had 11 differences. Now by doing this fandom dependency discovery, we have found 51. According to our findings tooling here, what we can see is that in one of those differences, we found an actual vulnerability. It's of course in setup tools, so it's not necessarily something that can be actively exploited, but it can be exploited while the packets has been installing. So what we have found is that there is this vulnerability and this is a call chain, let's say, but goes from the client source code into the vulnerable code. If I didn't do this analysis, I wouldn't be able to track this or do anything with this. I mean, this would be information that I wouldn't know. Of course, this is a trivial example that we're using for demos. We have found actual vulnerabilities when running this analysis from clients that I cannot disclose. But this at least to me gives an indication of how this fandom dependency problem can be tracked and solved hopefully with program analysis. After that, everybody wins. Developers can know what is vulnerable to their code. They can accurately map and create accurate S-bombs on what their application is consuming. CISOs can be aware that their vulnerabilities there that otherwise they might not have been aware of. So everybody wins. That's it. Thank you. Thank you. Yes. Yes. I'm going to go first. So yeah, great insights on Python and I'm a Python developer. So I know at least about the majority of these things. But the tricky thing is, I mean here you are talking to the people who are interested in S-bombs. How do we spread the information? How do we get all the open source communities to know about all these issues? How do we make them publishing their S-bombs? This is, your stuff is important but making all the communities aware of this is more important. Yes. This is part of why we're giving these talks so that we make the community. Yes. Excuse me. So the question is how can we make the communities aware of the problems that I have been describing and my answer to that is this is part of this effort. We're trying to make the communities aware that those problems exist. The tooling that I have described, it might sound, let's say, extremely complicated and whatnot. But if you're using, I can actually show you it running. So it is, I have it here. So this is our basically analysis tool. This is closed source at the moment but it can easily be re-implemented, I think. If you ask me, I can re-implemented it in like a couple of days but perhaps somebody else could re-implemented it a bit faster. As you can see, it just goes over, transitively, over all the code that is available to this particular project. Yes. No, there was a question. You've answered the question because I said, is this open source? No, it's not open source. Sorry. What you've done actually is a great benefit to the whole of the open source community. So if you maybe could describe the architecture you've drawn and the algorithm, then we're sure the community will then jump on and do the different projects. So is that something you could share with the community? We need to think about that, yes. I mean, the question again was whether this is open source. The answer is we need to think about this as a company. Yeah, I don't know is the answer. Yes. Are you scanning for Python binary dependencies as well because TensorFlow, for example, includes FFMPEG? So this requires, I mean, the question is whether we are actually considering the binary dependencies into Python packages in the wheels, for example. The answer is not but not yet. So we are trying to get into cross-language analysis. This assumes modeling basically of the interface between Python and the native library. We're getting there. Yes. We're doing static, the question is again, we're doing static analysis and static analysis has false positives. How can we prevent those false positives? In this particular case, this analysis that we do doesn't have false positives because it is basically only considering imports. False positives and static analysis come from the fact that, for example, you have a virtual dispatch call site. You might be linking to multiple implementations of particular interface, for example. In which case, you might be basically over-linking. We're not doing this at all here. Sorry? Where do we have false negatives too? False negatives, we cannot have false negatives here because we are considering the source code as ground truth, which means that we don't basically everything that is in the source code will be parsed and reported. Except for eval. Sorry? Except for eval. Except for eval. Eval is eval, yes. Right. Final question? Interesting, some method that tracks the imports. Well, what if there is an imported module that one function is using but is not reachable from the main code? It's kind of... If you're analyzing an imported module, it's not being called. I don't see what the problem is. As you mentioned, there are false positives. But if there is a file which is not imported and it's by a function, it's not being called. Yes, we will still analyze it because it has an import. It's not being called, right? We have some helper problems. It's never being called. It will be checked as a risk and it's a risk. If it's not being called... So, sorry. Excuse me, Alex. So the question is what will happen if I understand correctly? If we have a file that has an import in a function that's never been called. Yes, we will not analyze this because first we consider the call graph of the whole thing. Thank you again, Jordan. Thank you.
Open Source based Software Composition Analysis at scale
The first good thing, welcome. My name is Marcel Kotzman from, you see my company, Robert Bosch, but what is more important is I think my community and I have also some chokers in the audience for the questions later, so I'm happy that you're also here. And well, that went too fast. Our community was formed a few years ago already, so what we're doing here, I think we were ten people in the beginning. Also we had exactly the same discussion we had this morning some years ago, so it's good that we have now a bigger audience. And this is also my, if you want to go out and have a takeaway then please have this takeaway, please join us in our tooling community, sharing creates value. I really like that name from the beginning because that's really what we are here for and I think this is an all over non differentiating thing, right? So we all profit from having here a better as bomb tooling processes, etc. The title of my talk is about doing this at scale, so here's where my organization comes in because we're not doing this for fun or yeah, it's more hygiene topic, so every developer says yay. But nevertheless, what is important for my company is be good citizen in the open source community. So that means on the one side if you use something we want to give back, so therefore we also doing open source on our own, so that's what we called from the beginning, eat your own dog food, so if our developers need to suffer doing all this open source paperwork, we should also have a clue about that. And so in the beginning we also that you know beginning doesn't mean not for Bosch, Bosch began much, much earlier, but when we started with this automation, our journey began here with typical Java Maven project. So before I can tell you all the details, I just said okay, I make this fact sheet and this will also accompany us through the presentation and that you just get the idea on the left side, no right side, here we have also a fact sheet that I took from the NASA and that made it really complicated now for the presentation preparation because that's so interesting once you dive into this you get distracted very, very quickly, so here I also put the link. So now we got the idea, for me Java Maven this is a community, so we have projects approach us okay we need some kind of support to doing this S-bomb stuff in the beginning and mainly triggered by the license compliance and so in the beginning we didn't have that fact sheet right, so we just started on and then we're done. We were done, we said okay now we created the S-bomb automatically, the FOSS compliance bundles etc. So I said okay mission completed, no. So then the next wave came up right, so oh yeah here web apps, JavaScript, NPM, blah, blah, blah, so here again fact sheet, so some things looked similar and also here the build mode is very important because you could ask why should you automate all this stuff if you release once a year for something, no, but we had this high requirement from the beginning to do this CI CD, but here for JavaScript our paradigms didn't work right because as most of you potentially know if you use it or my previous speakers use one dependency you get thousands of transitives right, so that didn't work, also this that you have this one to one, yeah one source is one binary didn't work, so we had hard time but we somehow well 80%, 90% so here I would rather use my chokers later but I think we handle it now somehow. Also when we started to do this we said okay because as I said the developers were not super keen on doing this, we tried to push this by centralizing that, so looking now from the end that was perhaps not that good idea rather to decentralize this because now we also centralize all the problems that we already here heard the whole morning and the whole afternoon right, so this is all on our desks. And here this was also the biggest learning when we said we had a vendor solution at that time still, said we need this, we need that, so it was hard, so this is also where the community the open source really helped us because then we could also do what we needed at that point. The other thing so here I put Mars, yeah is that we still had all the other projects right, so we couldn't just say okay now a mission completed but they also continued and doing new things, well this is innovation right, so you will never stop so we were going from one orbit to another and also this little wordplay there is intended because most of the time you will know that we use OSS River to get this. And that you know so who is the crew in this, this is I'm not the developer, so therefore my joke is later and I'm rather looking from the process perspective but we also had then the development team, we had developers, we needed to talk about this and here the next stop was then embedded C cone and thing, so okay that is a completely different planet right, so here and also I learned where Saturn is a gas planet so there is no even a surface where it can land on so you need to stay continuously in the orbit having some probes that you send out and also again differences so also you don't need to read all this stuff it's just for giving you the idea and here I made an, I went back in the history and get okay when we came in because also Thomas and Sebastian they were pretty, well busy already in the beginning where we had this starting point which was already supported in the beginning and then you see also the history so even more planets coming up and I'm still convinced that this is not the end here at the right side and now this is at scale because this is just the background right at scale for me means with all those planets there is a scaling then in the horizontal if you want because as central as I said earlier we centrally supported here our two teams that we needed then to scale in the horizontal really supporting this ideally for all the teams with all their different planets and here from the experience that we had in the team is it was very helpful then within the team but also with the customers to have those fact sheets so this always developed more and more to say okay for onboarding say what are we talking about right so what are you doing there so either as well for the for the developers that we contacted as well for us in Germany that was very helpful on the other side also the when we started talking with them so how do you do it today we started the in the documentation why because we needed to have we also having teams from automotive right they need to have audit rails all this reproducible documentation so we documented it that was good because then we also could reuse the concepts but then we iteratively improve them and came finally to say okay hey that's good this we can standardize we can reuse this and then this is the evolution then you can also once you'd standardized then you can start automating you cannot start automating directly from the beginning so you should have this information and the other thing is if you especially if you say okay I started from from sketch again then you might reinvent the wheel but the other thing is with all the tools also I think Anthony showed it earlier so you have a list of of of links several tools so which one is this the right one so also here the concept documentation was very important as all with the help of fact she's then to see okay what is then the correct the correct solution for my for my problem and our next stop you see embedded IoT Linux so here's also invitation join us here we still have a lot of fun in front of us to manage this because here again we have completely when when we had built less approaches where we could say we had this discussion earlier okay source code is the truth perfect this would be then we just take the repository analyze it but here we obviously have a build based approach that we need because the build also does a lot of things and I just learned also compiler blah blah blah all this other stuff and then coming back to the to the point and it's the last point here in the background so now the learning is we need here those fact sheets in order to not lose time in upfront to say okay we're what we are talking about then we also came to a generic architecture model and this is also what you see then in our tooling group so that ideally we use the same wording for that and the standardized representation but the other thing is once I have this so where can I find find this if if I have a good match and then I took here this example from online shopping if you take want to shop close and I that I found what was a nice thing I say okay I would love to have this also for us right so am a man woman or a kid do I need clothes or shoes and check its blah blah blah and then you get the selection of what you can buy but at that point where you need also to give now to narrow down the selection okay you need to measure okay what size do you have and now okay this looking at my belly this is then exactly where he said you here you need the the semantics okay where do I measure this right and this is what is then where we need the genetic architecture model for yeah now you see the story what we do we prepared this already the our new project eclipse up oapses so and I as I said it has several aspects so for those who are not familiar with what is up oapses I also copy pasted the definition again if you dive into this world you will probably need some some hours even or days because that's very very interesting but I also liked the link here so therefore I also put the two definitions so up the abscess is that fast fast from the center of attraction right the high point in an orbit and up oapses the other thing is then the fast fastest away from the body it is orbiting so therefore I made this little picture it's it's potentially wrong so please but then you just say there you here you have the center of attraction left side and the up oapses is this point so the really when you in the orbit that is fast away from this planet and I think that fitted very well because if we switch between the different plans you need to switch the orbit and that's the best point because then the attraction of the original is is so low and here you will not really find a lot of contents yet because this will be in the next days and weeks but the also the proposal the text here I put the link so what is inside so this is more or less the trailer of a film if you want so you can see the film later we I also give you some hints in the in at the end but just that you know what we are talking about and therefore I saw as I'm the process guy I said sorry I had also I also wanted to be a part of this projects and collaborate so therefore there are some process level documents on and that would rather cover this horizontal scaling right so that you can say okay now let's have a common way how we can map this the other thing is that we are using a lot the OSS review toolkit and here that would be then the vertical level because here we need to scale also as we need to run really really a lot of scans every every day and here we have performance issue in the way we use it past and here you see this would be also some code contribution that we would do is that you can expect and the upper part as I try to use the the icons a little bit is then really the idea what is the goal that you can just come say okay what are my needs and what is already there so to really jump start your process definition ideally just copy pasting then the the the templates and you can directly start ideally then via our tooling group mapping the tools that are available on the technical level so here we're using word so if you are a single project then anyway not in a target group of my talk today so then just go to or page or then if you have built based issue then you join us in the tooling group you will find your solution if you have an organization that has the same issues than my organization that you really need to use that at scale then you would invite you also then to join us in the eclipse apoapsis project and here there the old server you can what what we will contribute there you can really just take it and build up your own service in-house to automate then this software composition analysis we're coming heavily from the open source compliance but we also have security aspects and we heard also safety export control so everything that will be there and this is also why I said okay this is important that this is not this is just another puzzle piece if you want but this very important is the cooperation with the spdx here we also call for action invitation we have this new operations profile workgroup so we are invited also to join us there that will start in the next days I think the first first meeting on the one side and we have this dependencies to the open chain tooling group where we do the capability map so also that we do not need to reinvent all the wording so this is already there so you can already check it out and on the technical level here this is also where the maintainers are the same right in the uss3v toolkit where we have strong connections and then also you can see the technical dependencies but as I'm the process guy and I have a lot of chokers in the room but we have not that much time I will tell you later I explicitly asked my colleague so for those who are interested I have an offer later to you so on the process level side so this will be then this is just a work in progress thing where we have the generic architecture that we map against the capability map from that we already developed with the with the tooling group so that we have then the ideally the same semantics so just that you have this picture in mind as I said we're coming from open source compliance so therefore the current capability map is open source compliance related so for if we then also go further with security then I think we also need to define further capabilities then the question is if we do this in our group or in other groups but we will do it together well that's that's for sure the other thing is this and therefore I originally called this abstraction layer so it's abstraction layer not in the sense of a software that we create but rather on a process level that you say okay we have different stakeholders we have different products and here this is where the operations a work group from spdx then will will it will help yeah to say okay to rather say am I a man woman kid etc so where I'm in on the one side on the other side also having some standardization to have those fact sheets so is this the perfect solution or next to a perfect solution or at least something I can start with because here and this is where my organization is potentially special here we are in the middle of a supply chain typically in the automotive industry so we also we are not alone we cannot just say okay we will use this or we have legacy therefore we have to have flexibility for you that would be then the possibility to still keep the flexibility and say okay I have just as Anthony showed I have choice right and that's not bad of the on the other side and therefore we started this touring group is that we are only we were only a small group so we said we cannot maintain thousands of redundant things so we should focus and especially if we still have gaps on the other side so then rather say okay let's consolidate on the one side and then use the resources rather on another place then also the blueprints as I said so we're here we have everything we heard today beginning with deja-court from from Philippe now a phosology I saw here the colleagues also software c60 or obviously false light so there are so many things where you you're totally lost right so there hey I know there's a solution but I don't know does it fit for me or not so I this is something we should tackle and the other thing that's funny because we also talked about this also the question so who am I right so am I rather developer I need rather this for my use case or or I'm rather try to manage this thing or audit thing so here this is important that we document that somewhere so I would offer to the start here but I'm open also to do this somewhere else especially as this will be necessary also for us to do the testing later because with a well-defined use case it's easy then to start then the test case yep thank you the test case definitions and the better the test the better than the solutions the on the technical output so the ord server what will that be so the goals of the server is that we have a unified API that you can use so if you run really at scale from your CI CD you can just call this API and it will all do do all the rest more or less easy setup integration you can read it on your own what we do not have yet but is one of the goals is frontend because this was typically then the the issue if especially if we were compared as those as River to a kid community with the vendors the vendors typically come up with a new and find ceo I blah blah bling bling bling and we will just well we have a tool that just does the do the business right but so we we know it's it's important and you see then later in the outlook and here I would potentially also have a choker then in the in the audience what what will come so that you see how would such a setup look like so you have a development team who wants to use this software composition analysis service so they can just use the API the ord server would do the rest of here you see then the different workers Martin called that from ord ord analyze the downloader scanner so these are the typical steps that you need then ending then with a reporter but we have teams that only needed parts of that then also sometimes the one thing uses a lot of performance so therefore the balancing is very important and that's what what you can then do with the server here we have some usage blueprints that we already will prepare with the project also in the use case collection the things that we use then for testing it and then another use case as I mentioned would then also be those those dashboards those your eyes people that want to know more this will also work then with the server just here the MVP that you will be able to expect then in the next weeks to be in our repository the repository was already there so we're preparing currently the initial contribution and this is the next step and that was I said so now you would say I want to know more about this ord server how does that work so there's an invitation we have twice a month the tooling group meetings every first and third one Wednesday the first Wednesday in the morning for the rather Asia Pacific region the third Wednesday era and the european sorry the european afternoon will also to cover better the time zones with 50 us and yeah so please check out the open chain global calendar so if you if you're interested and the next one I will moderate and Martin will be there so for all detailed questions regarding the technical stuff and we can also discuss the the other parts so I would really be glad if we could have some follow-up discussions in our existing communities then as I said this is yeah depending also on the on the load the initial contribution that we prepare ideally we should have it ready then for the community days so we plan ord community days beginning of March I hope I have put the right link so please register if you're interested because then we have also a detailed session about that and about the front end so we will not provide the front end in the beginning but we already in discussion with double open project so that they are also interested in providing something but as this is an api so everyone would be welcome to have their python shawasaki whatever you want but yeah at least we will care that we have a cool server in the background thank you very much stay tuned so here are all the links especially so really the who didn't know about this especially Anthony I would really welcome you if you also could present your your benchmark that you did that was really great because we also have the python inspector in the ord analyzer how we would get there and these are things that we really do in the tooling group thank you very much we'll have to answer a few questions address that noise yeah so the question is that we have a lot of noise and transitive dependencies I would directly rephrase that so you have dependencies in the build scoped in the test scope whatever and then you get the build of materials with thousands of things and there are irrelevant repositories that's how we call them so here this is a process level thing right this is where we typically come in and check with the teams please unscope that so here we have configuration as code is in the in the ord principle so you can directly silence down that that noise by just excluding then scope things like that but also that depends which planet you're coming from right at scale um at scale this is uh yeah so I would forward you to my joker here which is thomas so again here we should talk about which planet you're coming from right and we can also reuse a lot of this uh in a in a central database this is where also the previous talks were about sharing then creations package configurations uh here you're also more than welcome to uh all collaborate thank you thank you again
Getting lulled into a false sense of security by SBOM and VEX
Good afternoon everybody. I hope you hear me well. My name is Henrik and I'm going to talk about S-Bom and VEX or better about all the problems of collecting high quality data that are required to be part of those documents. So it's nice that we have those data formats, that we can sign them and exchange them and merge them, but I think it's very important for the consumers to know about all the problems related to vulnerability database, quality, dependency management, all of those problems leading to actually wrong data being encoded in VEX and S-Bom documents. A little bit about two minutes about myself. I'm a security researcher at Ender Labs. Previously I've been working at SAP Security Research where I started around about 10 years ago to work on the detection, assessment and mitigation of known open source vulnerabilities. I think the most noteworthy outcome of that activity was Eclipse Steady, which I co-authored, which is basically combining static and dynamic programming analysis techniques in order to basically determine the reachability of vulnerable code. And kind of hand in hand with Eclipse Steady is another project, which is called Project KB, which is a data set of fixed commits, right? Fixed commits that tell you which are actually the vulnerable functions in the different open source projects. A little later on I was interested by this increase of supply chain attacks, the more and more malicious packages showing up. And again, two open source projects I was involved in, which I co-authored was the Backstabbers Knife Collection, which is a collection of around about 4,000 malicious packages that have been published on PIPI and NPM and RubyGems and so forth, which is helping researchers to kind of develop new detective measures. And have also contributed to the risk explorer, which is kind of a comprehensive taxonomy about all the different ways and techniques at the disposal of attackers in order to inject malicious payloads or code into open source packages to infect downstream users. All right. The outline of this call, this talk is as follows. I mean, in order to produce accurate S-bomb and VEX documents, you basically need to link applications to all the components and component versions, right? And we heard a little bit in earlier presentations about this. And from there on, you need to link those component versions to vulnerabilities. And then in the next step, you need to link this to the vulnerable code functions, code snippets, right, in order to have all the information required for VEX documents. And what I would like to do in this talk is to provide your kind of systematic, structured overview about all the problems that appear that exist when wanting to establish this link, and which make that those links are pretty brittle and weak in many cases. And I will exemplify this by vulnerabilities, CVEs, most of the time that you can consume from OSV. A very important note, and I put a star so that I do not forget. I mean, OSV is a great, let's say, project. They do a great job. They just aggregate vulnerability information. They provide you this nice API to read vulnerability information. And so this is just so much better than what we did or what we had a couple of years earlier with NVD and the CPEs in order to identify affected components, right? Because those names didn't really match at all to the identifiers of the packages that you find in NPM and all the package registries. So this is a great step forward, OSV, but still there's some data quality issue that I want to point out. Again, they are just aggregating other vulnerability databases like the GitHub advisory database, Python security advisories and so forth. And so my goal for today is basically make you take S-bombs and VEX documents with one or maybe a couple of more grains of salt and make you aware that there may be wrong information represented there and that you may be missing out a lot of stuff. I'd like you to choose the right application when you're going to evaluate the next S-bombs or VEX generator and basically ask the right questions. So a quick recap of the vulnerability exploitability exchange format. So they can be distributed as part of an S-bomb document or separately, and I think separately makes more sense, right? Because they are changing more frequently as the analysis of the vulnerabilities going on and as patches are provided. And really at the heart, the need of the VEX document is that they assert the status of a certain product identifier or product in regards to a given vulnerability identifier. And so this is basically a quote from the CISA minimum requirements for VEX published last year. And therefore see a status which must be either under investigation, probably at certainly at the beginning of the analysis by the product vendor, then and then it will switch to not affected, affected, and if it is affected hopefully to fixed at some point in time. Now also from the CISA requirements, if you say you're not affected, the CISA requirements basically ask you to provide additional reasoning about why you're not affected, right? And so either you provide what they call an impact statement, which is a free text, or kind of a justification. And for the justification, they have five possible values, which is component non-present, the vulnerable code is not present, the vulnerable code is not on the execution path, or it cannot be controlled by the adversary. The cycle in the X-gamer is kind of very comparable. They have nine values, again code not reachable or code protected at the parameter. And so whether it's nine or five or seven, that doesn't really matter at this point in time. What they have in common is that they talk about vulnerable code. And so you should ask yourself, how do we know which function is vulnerable or not? And so this is basically what we want to know to produce a VEX document, right? For a given application with direct and transitive dependencies, hundreds, God knows how many, is there any vulnerable code? Can it run in my application context? And when I say can it run in my application context, it's basically meaning can I execute the program with input in such a way that this vulnerable function is executed? And there are different ways to do this, static analysis, dynamic analysis, for example. And the last question is can it be exploited? Which is another question, because even maybe it is reachable, but there are some mitigations in place, or I don't know in the specific deployment, the software is not internet facing and so forth. And so in my talk, I will focus more on the first bit. Is there any vulnerable code? A little bit on the second and I will skip the third part, right? Because this is very much depending on kind of the system, the environment where the thing is deployed, whereby the first two questions can be answered to some extent by the application developer analyzing this thing in his environment. Now the ideal situation would be my wish list for the research community, if we had a kind of a fingerprint or a signature or so of vulnerable functions that would allow us to find a vulnerable function no matter where it is rebundled, somewhere in my Python environment, in my Java class path and so forth. Right? And that lets us distinguish this vulnerable function and its function body from the fixed function body. But we don't have this. And because of that, we basically take a long detour where we go from applications to component versions. We try to get this bill of materials on the level, on the granularity of components. We link the components or component versions to vulnerabilities. And then we have vulnerable code functions linked to the vulnerabilities, at least sometimes. And so this is the link that we want to establish and it can basically break at all those cases. And I'm, yeah, this is what I basically do in the next couple of minutes. The first bit is in most of the cases established through many FEST files. We had talked about them, requirements, setup.py, POM, and you name it. And the rest here, the lower parts, they are basically covered by a vulnerability databases. Public ones like NVD, OSV, and the things that OSV aggregates, right? Or private ones. And what is important here is that the vulnerable code, in fact, is not, at least not comprehensively covered by any of the public vulnerability databases, right? NVD, OSV, they all stop here. They just say vulnerability is linked to a component version. They don't give you the vulnerable code. The vulnerable code is in private vulnerability databases. So how does the happy path look like? If everything works well, the happy path, right? So we would have a manifest file with dependency declarations, pinned, maybe. And so it's easy because there's a lot of tooling to identify the direct and transitive dependencies, right? So that link is covered, let's say. Mapping the vulnerability to the component version on the easy path is basically if, if there's a project which just has a single artifact with maybe one supported release branch, not multiple release branches for which the fix could look different, or maybe some release branches not being supported any longer, end of life, and nobody looks at them any longer. And you have a security web project maintainers. They communicate very clearly in their security advisory, this and that is the version that is vulnerable or not. So that would be covered as well. And then the last bit, the easy path to identify the vulnerable code is basically if there's a trivial patch in a function, let's say in a law list or a denialist, done in one commit, and the commit message kind of mentions the vulnerability identifier. In such cases, it's easy to understand what that's the method, right? And then more on the reachability part. On the easy part, you basically have a static dispatch, meaning the function to be called is known at compile time, right? And so there's no doubt about which method will be invoked. There's no dynamic dispatch as common in object-oriented languages. There's no reflection. There's no eval. There's no inversion of control where the framework calls into your application rather than you calling into the library. And of course, the happy path is existing, right? So if you plump all those components together, you will catch a lot of cases. But you will also miss out on kind of the obscure parts on how dependencies are established or maybe whether anything went wrong here. So this happy path is reminding me a little bit of this joke here about the, I think it is called the street light phenomenon or so, which is where people basically prefer to search where there is light, right? Because this is where it's easy to find something. Right, and then now I will try to give you some insights into where there is no light and what are the problems. The first one is, and you have heard, I mean, some of you heard the term earlier on, is these phantom dependencies, and they basically affect the first link here, right? So how do you determine what are the component versions used by an application? Again, manifest file is the easy path, but then there, as Yorgo's mentioned earlier on, there are these kind of dependencies not expressed in manifest files, but kind of described in readme files and kind of manually or scripted in the runtime environments, right? I won't go into detail because they were covered in the previous talk, or in fact, two talks behind, but I would like to mention this other nice flavor, which is dynamic installation, a la try, except install, which I came across. It's typically a pattern that I found in malware, because malware, you know, they require requests, the Python package request in order to exfiltrate information, but they don't want to necessarily ship their component so that it is obvious in their requirements, but they do this dynamically. So we have kind of patterns to detect this, and I was surprised that this is also a technique used by legitimate packages. So what you see here is, again, a try import, we have seen this earlier on, and then if the import fails, it's basically making an operating system call, calling pip install sigopt. This is with pip install, right? Another pip install, and another variation, I like that, it's aptget install, right? And so, you know, there is no limit to developer creativity, right? And these are not some random packages, right? So these are well-known machine learning and used machine learning libraries. I'm not talking about the thing that has three downloads. Good. This was kind of the first link, second link is here, and I guess I need to hurry up a little bit. Name changes. You had this probably in the panel session earlier on. The thing is that projects get renamed and forked, and then there are also some strange distribution channels that change the name when you deploy your jar, for example, right? I found a nice example here. It's a eBix Java client. It's a tiny project. It has been developed on SourceForge originally, I think in the 2000s or so. Has been moved to GitHub, has been renamed, has been forked, and every time, basically, the artifacts, the Java archives are deployed with a different identifier, right? So in order to make sure that your vulnerability, your CVE identifier, or GitHub security advisory, you need to actually, in this case, mark those three group ID, artifact ID, identifiers, because otherwise it will not be highlighted as vulnerable. So this is one of the first orc copy eBix. If you build it from the GitHub sources, this is what is written in the POM. The second one, if they kind of choose a strange, at least strange to me, I didn't know it, distribution channel, they don't deploy on central. They deploy using JITPEG, and if you deploy to JITPEG, it basically doesn't take the POM values, but it takes the value from the GitHub repository. And so you see here, com GitHub, eBix, Java, blah, blah, blah. And the third is the group ID, artifact ID of the fork that deployed on central. So this is kind of all the names you would need to catch. OSD, in this case, doesn't get any of those. They only say it's in the GitHub repo. Next problem, again, same link here. There is this problem of multi-module projects, right? So you have a repository that is producing not only one jar with one identifier, but multiple ones. Crazy case here, Bouncy Castle, it's a pretty well-known crypto library. They have 84 different artifacts with a group ID, ork, Bouncy Castle on central. So if you do the search, right? They all ork Bouncy Castle and then they have various names. They do this to cover FIPS compliant crypto libraries, support for different Java version, God knows what, there's a reason for this, I suppose. The thing is, this vulnerability is affecting 28 or so of those Java archives, but OSD is only marking two. And we know this because we looked up the vulnerability, the vulnerable function, in the commit that they use to fix the problem and the research for the class, right? Another one is this one, Microsoft Common Data Model. It's an interesting case, they have one GitHub repository that they use to produce artifacts for four different ecosystems, NPM, PyPI, NuGet and Java. However, somehow the OSD is only marking three ecosystems, so they miss out completely on NPM. So if you use the artifact for NPM, your kind of scanner wouldn't get it. Now we move on from multi-module projects to rebundling, it's a well-known problem, I think I've heard it earlier on, which is where basically the archives, the packages contain artifacts from other projects, right? This could be binaries, like in many of the Python machine learning libraries, but this could also be just Java classes. In this case, it's a vulnerability in the spring framework, the vulnerable method is called Reset Default Subscription Registry. OSV is marking kind of the wrong artifact from spring, but interestingly, and this moves on to rebundling, this class is also contained in an orc Apache service mix, blah, blah, blah, blah, which is not caught by OSV. I have, there are some studies about rebundling in Java, rebundling in Python. This is an interesting case here, this is a WebP image codec, I think this was a C component which was vulnerable, included also in the Chrome browser, but it was also rebundled in 50 different Python packages, right? The people just, it's for coding and decoding this image format that I don't know, and so 50 projects contained that binary, they were vulnerable, OSV covers just six out of those. The guy who wrote this analysis, you have the link here, also did the job to find the most top rebundled binaries, and he found that here, the GCC runtime is 920 different Python packages. So it's really hard to keep track of all this, because developers just copy code, include code in whatever possible ways. This is another one, again Azure functions something that I found myself. Typically you would say you declare your dependency and then it is pulled from the repository. In this case, they basically included the third party right away in the, what is it now? NPM, I think here, no, Python as well. And so basically they bundled this verktzorg, pretty well-known Python project in the Python package from Azure functions. They also took some random, not random, they also took some single Python file from some GitHub repository that they included here. Again, hard to track. Rebundling in JavaScript, I won't go into this, it's yet another kind of catastrophic stuff. Here I did some statistics about how often do we agree with OSV on what are the affected components. And we're just talking about the components, not yet the versions, which is again the next link. And for a couple of thousand vulnerabilities, Java vulnerabilities, it turns out that we only agree in 57% of the vulnerabilities on what are the affected components, right? Meaning that in 43%, we have a different opinion on what is affected or not. It's a pretty big difference, I find. And so how can you read the chart? So this is basically where we agree. This is where we list one more package than OSV. This is where we list 26 more packages. This is bouncy castle, right? This is where OSV lists more than we do. And so forth. So it's really interesting statistics that you can get out of it. And yeah, which really, I mean, it's really hard to tell also who's right because there's so much manual work involved. All right. That is about version identifiers. Here I want to just highlight one thing maybe, two things. First, Tomcat, they had a vulnerability not so long ago. But 80, the release branch 80, is end of life. It was not checked by anybody, right? The vulnerable function, though, that I looked up, it existed as long as 2012. I think I traveled back in time. I found it in 5.5, 2012. But nobody kind of marked this as vulnerable. This one here is, again, pretty recent. Again, they didn't mark old versions, even though the advisory from Apache said they are vulnerable. And indeed, the vulnerable code is not there, but I tested. The exploit still worked on the old releases as well. So I was filing a change and the GitHub security advisory database kind of adjusted the affected version range. But all this is a super high-touch manual effort. You need to dig down into the code. When was it introduced? How has it been renamed? It's awful. I don't go to this example for lack of time, though. Last link to go, as I said, you need for this reachability magic. No matter how you do it, dynamic static doesn't matter at this point in time. You need to know the vulnerable function, the vulnerable code. Happy path was one function changed in one commit, in one release branch, and you had the CVE identifier and the commit message. Beautiful. It's not always the case. For this component, I think it's Python or so, the guys needed 18 fixed commits. They touched 14 functions, and there was no vulnerability identifier. And so I hope we caught all of those functions. But again, it's a very time-consuming task. This was here about validation of SSL certificates, and so they didn't do this across the whole code base, and so they had to change a lot. So this is where the last bit of the link basically breaks. This is kind of a summary, right? We didn't go into Wendat code, where the people just copy the stuff in your own repository. This will be for another talk. So this is just about the link, right? And now, yeah, for once I'm in time. And now we come to the reachability analysis. Again, easy path was static dispatch. And then the difficulties add up, right? You have reflection, you have eval, you have these framework dependency, where spring is basically calling into the application, then rather than the application calling an easy library. Then of course there are vulnerabilities which are all about configuration. Cannot do much in that case. Cross language calls. I think we had this early on. So if you have a Python machine learning library that calls into C, these cross language stuff is really hard from a program analysis point of view. And then of course you have this dynamic dispatch, megamorphic call sites where a call, the type of a call target is only known at runtime, and you need to do a crazy limbo dance to figure out all the types. And sometimes you just over-acproximate crazily. Right, takeaways. Those links are super brittle, and a high quality vulnerability database requires a lot of manual work. And, well, let's say, let's see it a little bit more positive. The opportunities, I think the industry and the open source community benefit greatly from having kind of a comprehensive code level on the level of vulnerable functions database to use as input for all this reachability stuff. But I don't see it nowhere, which is a pity. Actually, the starting, it's starting. There's ways of starting, okay? Especially with the software heritage as a source of starting points. All right. And then also when it comes to reachability, there's no reliable way to identify the vulnerable code, right? So this is what I was mentioning in the beginning. It would be nice to have signatures of fingerprints of the code rather than doing this stuff with component names and versions, and you just can't get it wrong. You only get it wrong. And so my advice when you talk to your SCA dealer the next time, do not only choose a happy path application for your product validation, right? Use one that has a little bit of complexity. And then also ask for kind of details and statistics, what I would say are three critical areas. This dependency identification on the top of this thing, the vulnerability database, how do they maintain this? How do they ensure it is high quality? And this reachability bit, how do they actually figure out whether the function can be invoked or not? And that's it.
Panel discussion: Best practices managing SBOMs in the supply chain
Okay, so next up we have a panel of Danone and Jeff here. So we're going to be discussing best practices around S-Bornman supply chain. So my idea initially, and if anybody has questions or whatever, feel free to jump in. So just maybe questions at the end. Would be maybe break this into three sections. So the first one would be around what you think is best practice or challenge about S-Born generation, sharing and gaining gestion, and then basically handling them, storage, all of that. So yeah, without further, I'll give the pass them back to you. And then if you want to introduce yourself and Jeff. Yeah, hi everyone. I guess most of the faces are familiar from the fringe event from last week. So yeah, hello again. And I'm Aaron Arrogation. I work for Siemens Health Nears. I had the Secure Development and Compliance at the organizational level and also a co-lead of software 360 project at Hosea-Rite Eclipse Foundation. So the topic of S-Born has kind of, we all know that has came up into the limelight in the last one year or so. So it has had its effect in our workspace also. I was only needed to concentrate on license complaints until recently, but now the whole thing has changed and the requirements has come up in a much more stricter way because of the regulation and the executive order. And so predominantly from my side, being a healthcare organization, we have a lot of challenges in adapting to or catering to this requirement in a very sudden way. Because this new regulation calls for a very stringent or very disruptive way of changing things. And it is very challenging in the healthcare sector because our processes are very closely tied to the FDA regulations and our relationship with our supply chain is quite sensitive. Because we cannot suddenly demand our suppliers to give this thing, otherwise you would be out of the process. That cannot be done. So we have to operate in, since it is healthcare, I can take the example of doing a surgery. We have to keep the patient alive and then do the work. So that is the situation right now. And we are taking it that seriously and taking the steps very clearly because right now what we are doing mostly is identifying the current challenges and gaps in the process. And evaluating our existing methods of how we are meeting the regulations. Declaring all the elements of open source usage was already there, but it existed in different formats. But now when the regulation demands a particular or a set of few formats, there has to be a lot of work done in all areas like right from the R&D processes, the contracts with the suppliers, all of these things. And then we have the legal aspect also. So the current challenges, the inability for a large organization which is heavily regulated to transition at a fast pace, is a major challenge. But we are closely monitoring the developments and trying to go along with the community in that. And the one core strategy that we have right now is make sure that we don't miss a train in this aspect. But it's a challenging thing. I'll come back to that more details after. Hi everyone, I'm Jeff Mendoza. I'm a maintainer on the GWAC project. GWAC is an incubating project under the open source security foundation. And it's a tool that ingests S-bombs and then is used for querying your S-bombs. I can get more into it later, but as far as the S-bom management, the idea is that we need to be able to ask questions about our S-bombs. And so that's the kind of the part I focus on. Some other background for me personally, I used to work at the open source programs office at Microsoft where I worked on security and license compliance where we would have scanners that didn't generate S-bombs. We had an internal format, but it would scan, put the results, all the component versions into a database and then give you security and legal alerts based on that. So I can kind of pull from that experience as well for best practices and managing S-bombs even though we didn't call them S-bombs. So one question for you, Anun. So when you think about healthcare as a heavily regulated industry, so when you think you as a participant that is using software from third parties, the suppliers in this case. So how developed is healthcare in regarding this kind of information? How transparent is it supply chain? Do your providers reliably get you information that you need to form to understand your dependencies all of that? Could you share a few words on that? So until recently, getting the entire dependency list was more of a manual centric process where all the architects were heavily loaded with this task. And with the internal process of software, the software lifecycle management, it was very much engraved in that process. And hence it was like, yeah, we had to believe the person, okay, I'm the architect, I decide that, okay, this is going to be our function and this things we need. So for a longer time until automation kicked in, we all believed that that was the truth and we had only a certain number of components. And we did all the required clearings, whether it is from a compliance aspect or from a security aspect to that. So the moment when it transitioned to package managed world where we started ingesting third party packages through the modern package managers, the situation kind of changed. So people were so alarmed with the number of dependencies that they see. And then there was a surprise for these architects or those people. Like, are we really using these many components? It could be false positives. Are you sure? Is your tool working right? So this was the discussion at some point. So as I said earlier, because it is a regulated one, for all the listed components, there are existing process where they validated, they evaluated for security and license complaints. But now that task has scaled up to an unimaginable extent. So we are trying to figure it out. And I guess, well, when you think about the problem of handling those dependencies, that's where Guac comes in. Like, it's one of those tools that will let you ingest everything and try to understand their relationships between them. So we just heard a number of talks about trusting. And so at some point, well, I would like to understand also how sharing and trust go together. I'm not sure that Guac currently feels that, but I'm not sure if you have overviews on how more visibility into those, that information can lead to better trust. I don't know about trust, but I do feel like when you're cataloging your S-bombs and have all of the dependency information in them. And if you're scanning through all of them and looking for vulnerabilities or legal information, it becomes important to see where you're using the same dependency across all of your projects. And so if you're looking at individual S-bombs, you won't see that. You'll just see, okay, I get the same vulnerability coming up in all these different products that I have. So one thing that correlating those, just part of what Guac does is, you can see where, well, what is the path to get to the vulnerability from my products? Does one product depend on another? And that's why I have this showing up in multiple ones. Is that kind of what you're getting at with kind of by looking at how the different S-bombs all point to the same project, you can get a lot more insights into both what you should be trusting, what you want to look at and what you should be concerned about. Yeah, so what I'm thinking is when you have a large body of dependency data like some organizations like the health care industry has, then you can start using tools like Guac and other databases to start understanding basically who's telling you the truth, who's lying a little bit, who's giving you missing information. And you can start making sure that the players are doing the right job. So that should give you a good overview on the state and ultimately making that information useful. So I'm, well, just to switch tracks a little bit. So I'm curious, how does the health care industry share information? Like how do you share S-bombs? What's the practice when you have an S-bomb? How do you give it? How do you receive it? How do you supply it to others? Okay. So that's a very tricky question to answer at this point of time. So how to put it? Email is fine. Yeah, I mean, I can say like we have not started supplying S-bombs in the current prescribed formats yet, but the other various ways of submitting this information to FD is already in process. So I'm not going to talk about what format or the thing, but this information is being submitted, but it is not as per the prescribed formats as PDX or Cyclone DX. We are only reaching there right now. So just to continue with what he mentioned regarding like, this brings me to the software 360 part of it, like how we catalog this applications. So the real challenge for us is, you know, we have legacy systems, which are still 15, 20 years old and still part of machines running across different hospitals. So we still have to maintain it. And now we have a new set of software coming in new form, which are part of the modern applications when compared to our old and scanners and other tools. So right now, all the teams that were working in this area in terms of compliance or security, we have the challenge of bridging this gap where we have to hold both sides equally well and then think of a solution where we can cater or where we can adhere to this new regulation. Because for us, the major challenge is when we deal with the legacy systems because it's like some part of it is already, you know, submitted to the regulators. We cannot really make heavy changes to that. So that's the tricky part. And software 360 is kind of helping in that way. But making it adaptable to the modern world is a challenge. So still there is going to be a certain level of manual work along with the automation to meet the goals. Do those applications get run under some SCA system to break them and then handle them, catalog them properly in software 360? Yeah, we do use multiple tools. We do use multiple toolings right now, like a lot of tools from the Cyclone, the X-Center. And most of it, I would say, majority is all internally developed because of various reasons. Or, you know, we might not have felt that the available one in the open source world is fitting our requirements. So a mix of both, but dominated by internal SCA tools. Any question for you? When do you run your scanning tools, your S-Bump generators at source area time, build time, or deployment time? So it depends on based on the current thing is like the focus is on license complaints and security. So earlier the SCA was specifically for the security and then separate team taking care of it. And they maintain the vulnerable database and then we link to the software 360. So right now it is changing like for the modern software is where they use the modern package managers. It is run and build time. And then it is a process downstream. Yeah, in my experience as well, scanning at build time is absolutely required. It's completely different building an S-Bomb from a source repository where you're either parsing a manifest or maybe doing a simulation of a manifest install or at deployment time when you don't have all of those artifacts that may go away after the build is done. So yeah, at Microsoft we only scanned at build time. Kind of a proposal for the group and a question, should we be cataloging or categorizing S-Bombs for what time they were built? Like is that something when we have a database and we want to query them or look up something, that would be an important metadata to attach, right? But when you query or as a user, as a consumer, I want to know that, right? As a consumer, especially for the medical side, what do you need to see for support windows and what do you want to see recorded for support windows? Yeah, I'm just repeating a question so that I can, so from a consumer side, like what do we want to see in the S-Bomb? Or as a producer of a device, what do you want to be able to share out, especially with the CRA coming in, okay, in terms of your support windows? And in either format I think you have support right now. I think properly, fully. And so the question is, do you have guidance on things that you're trying to get together for? You see that? Yeah. Because I wrote some of the tools they are using, why do we have our own tools? Because most of the tools, for example, from the excitement providers, all the information that you would need for licensing. Right. So we look for, where do we find the source code? What are potential or existing information of the licenses? Are there no displays or whatever? And again, most of the existing tools, you will have to repeat it. Yeah. So just, most of the existing tools do not provide all the necessary information that you would need for license compliance or to have a real complete S-Bomb. And this is still an issue because this requires that you use different tools, that you then manually adapt data, and then we come to the point, Jeff, what you mentioned, yeah, we create the S-Bomb during CI, but later we store it in SW-16. Most probably we have to add or modify some information because you were lacking at the beginning. Yeah, and regarding the submissions to the FDA part, you know, as far as I know, even the recent submissions, like not in the current format, but FDA says, yeah, things are okay with it. But, you know, I think one thing I heard is when a machine readable format was shared, FDA was not able to process it. So they asked for a human readable version of it. That's an insider story that was there. FDA will do the conversion from SPDX to Excel. Okay. Yeah. So, I mean, adding on to what Thomas mentioned, yes, regarding because the priority in our organization, you know, first comes for the license compliance and then shared with security. So, yeah. And the missing information is not just common across all packages, but maybe very few. But I think in the last six months, we have seen a great improvement in terms of, say, for example, NPMR, you get this kind of sorted out, like we get almost 90% information at the time, but others are a little bit challenging. Yeah. What do you get? Yeah. You know what Carol, you got this in there? I mean, like we are comparing it with much more olden times. So we feel better. It might not be the best, but yeah. All right. I have another topic, but if people have questions, we have like five minutes left. Yeah. I'm going to know how to solve the problem of verification and verification of the SBOM. Like I said, a bit better, right? So, you will offer a person SBOM, you will see it and trust the developer right. But how do you know that what the developer provides for us? Yeah. I'll repeat the question. Like, so your question is how do we verify and validate the SBOM that we receive from the developer? The short answer is trust. I mean, at this point of time. Yeah. And because it's mostly on the tools that we have implemented. And then, you know, I think so far that is what's been done, but there is no formal validation that we do on SBOM because we have not thoroughly started processing automated SBOMs in its complete sense because things are here and there partially. So it's like there is still manual intervention in between happening. Yeah. I'll add to that. If you're consuming open source and you're getting an SBOM, you don't need to, you know, you need to build your own SBOM. Like if for anything that's open source, you should be able to fully create an SBOM based on how you're using that library and the dependencies that you're pulling in at that time. If it's a library and they're saying I depend on these other libraries, that's just kind of theoretical, right? So yeah, if you're using open source and you're getting SBOMs, just build your own. There's some, I found about quite a few projects on Friday that people are talking about maybe in the software heritage for public databases of these dependency information for open source. So we should be able to get the right kind of dependency graphs for open source without trusting somebody, an SBOM that somebody else built. So we need better verification. So as coming from automotive industry, how we validate it, I take your SBOM, I do the language, I re-implement the mini project, take all of your dependencies, run it again. If I then get dependency conflicts, I know, boom, we have already a problem. Then I run it to, in our case, OSSRDUKIT, which Marcel spoke about, where the entire database was to basically download all of the source code of all of the code again to check all of the licenses. But my standard first check is to take your dependency list. I generate for that whatever ecosystem is, Maven, Java, and run it again. If that already gives me a conflict, then I know, why do we have a problem? And then I will go back to the supplier and say, like, we cannot compile this. How the hell did you compile these versions of the code? How did you compile these branches into a product? But again, validation is very difficult. So the other rule what you do is we do a risk-based approach. You cannot do this manual check for everything. So I run the SBOMs to some special rules that I wrote in ORT, where it's policy is code, and then I filter out. These are the products that in my context are high-risk products. You know, the motive it means updating a car. In case you don't know, over the air updates from cars doesn't work. If you need to fix some of the cars, you need to recall the car to the garage. That means millions of euros, dollars, and yen that need to be spent. Those get checked in depth. Anything else we do? So we have a risk-based approach to validating SBOMs because consuming SBOMs is still a sanely pain in the butt in the ass. And you should add, because train your developers. Train your developers. Train your developers in Mesh. No, because the emphasis is on the beach. You know, right here in the modern world, I think I'm talking about different tracks, different formats. Even if you're good at building the locker, you can actually do a good job and feel accessible about these three topics. Sometimes it's just hard to do that right. So maybe try to be the best, provide the best bonds. And you have trust in yourself that you don't have any social rights. And it's just a lack of your knowledge or communication that could be compromised. Yeah, to summarize, developers don't have the proper tools. So it should be better to just provide the... So what I'm hearing is that maybe the SBOM community needs to get together and piece together a verification story. Like I should need to go and reveal the software. I should just take that document through some mechanism and verify that it's actually true. Yeah. Maybe we have time for one more question. Any closing thoughts? Yeah. Yeah. I mean, just to summarize, after observing all these developments that are going on, all these discussions, I think from a healthcare perspective, we are going to take a very cautious step towards it. Because since we know this is quite a disruptive change, and also it is a mandatory change that we need to do. So our approach would be very cautious. But we want to be equally close to the community and very close to the developments that happen so that we would be able to adapt the changes in a fast-paced manner than when compared to three, four years back. Yeah. I mean, I think what I already said is like, yeah, set a high standard for generating SBOMs, do the build time, catalog your SBOMs, and then try to drive insights and relationships between where you're seeing commonalities between all your products. Thank you, everyone. Thank you. Thanks. Thanks. Thanks. Thanks.
Sharing and reusing SBOMs with the OSSelot curation database
All right then. Thank you. Thanks for the holler. All right. So I think I did. Yeah. Okay. Okay. So my name is Karen. I work for OSDL and I'm going to take a step back actually from everything we've been discussing here today about what capabilities S-bombs have and what they need, what more capabilities they have to have as well and how about tools to create them and go back to, well, I think what they were originally for, what we thought we originally should use them for with license compliance. And there we have a lot of stuff where we're still redoing the same work again and again and again because creating S-bombs doesn't work automatically, at least for most of the software that we're dealing with in embedded Linux systems. So there's still a lot of manual work required and that's where sharing and reusing work makes sense and this is where the OSDL project comes in. So I think this is fairly obvious. I don't need to go into this a lot why reusing makes sense. We don't want it to redo work that has been done before that is being done again and again and again. I mean, we'll still get these questions every day that why do I have to extract copyrights from Linux kernel source code. Someone must have done that already. Why can't we reuse that? And so why not do that and why hasn't it been done before exactly what can we do? So you know more or less compliance toolchain could look like this. We can't share work everywhere but we can share work where most manual effort is required with scanning and with curating data because as good as the scanners are that are out there, we've heard about scan code, we've heard a lot of tools ort, ort, ort, ort, ort, ort, ort, ort, ort, ort, ort, ort, ort, ort. So all of the scanners are all the tools that use the scanner materials, they're really good but there's still quite a lot of mistakes. So to actually do license compliance properly, we still need to do manual curation of the data. And this is where the Ocelot project comes in, you can find more information on the Ocelot website. The data itself is available on the open source compliance repository and the package analysis repository there, you actually find stuff that you can already use today. License copyright analysis results for various packages, mainly from embedded Linux systems. We have about 320 when I last checked. So different versions, of course, are 200 unique packages, more than 1.5 million files that have been manually curated. For each package, we have some metadata, so where the package comes from, a package URL to find where the package comes from, download location, and so on. Then there's the S-bombs in there. So the S-PDX S-bombs is what we're focusing on for license compliance in different formats as well with the license conclusions in there with the copyright notices. And I think that is probably some of the most valuable part of this with comments on why a particular decision was made. Because sometimes it's not clear, you can find information in a file. I don't know, you know how licensing information is noted in some files. It doesn't really follow any standards, especially in older software that's still being used. And then you have to make a decision, you have to do some kind of interpretation. And this is explained as well as part of the S-PDX files that are available there. Also, the S-PDX files themselves are explained because what we find, even though there is a standard, there is a specification, people still understand it differently. So someone might expect to get an S-bomb from their suppliers and they have a certain expectation on what a particular S-PDX file looks like. But they understand the different tags and they're differently than the customer. So we have a clear explanation of how we understand the S-PDX tags and of course we try to be as close to the specification as possible there. And then also for convenience, there is a disclosure document where if you find a particular package that are reusing in exactly that way unmodified and exactly that version you might for license compliance just use the finished disclosure document with all the license texts and copyrights and acknowledgments and so on, aggregate it. So of course it's not yet big enough to immediately license an entire system, but it is definitely a start. So as I said, the question why this hasn't been done before has been around for quite a few years and why hasn't it been done before? And I think two of the main reasons are liability and trust, which are more or less two sides of the same coin. So on the first hand, who was willing to supply such information, which is legally relevant if we're talking about license compliance or if legally relevant information where companies have gone to court over licenses. So who was willing to provide this and say, look, you can use this and we don't give you a guarantee but we did our best to make this documentation as sound as possible so that hopefully you won't be taken to court if you use it. And then on the other hand, you're a company and you're putting out products and you reuse legal information that you found somewhere on the internet. How can you trust this information? And these are the thoughts that we were thinking when starting this project. So how can we limit liability, first of all, for ourselves and for anyone who's contributing? And of course we asked some lawyers about that. And the idea was to license as liberally as possible. So we went with CC0, 1.0, that gives you as many rights as possible. And it works well for documents as well. In this case, gift regulation supply and liability applies only for willful intent and trust negligence, which we try to avoid. Also, I think the times have changed. So maybe ten years ago there was a lot of worry, especially from the US, that there's gonna be lawsuits in the open source area. But there haven't been any, not with providing legal information, with providing support with licensing, there haven't been any or none that have been known. And so I think people just got braver and said, okay, maybe now's the time that we can do this. And then on the other hand, we have trust. So how can you establish trust in the information? I think that's fairly straightforward is provide good quality. So do the curation conscientiously, diligently, only let people do it or let people contribute to actually know what they're doing, so train anyone who wants to contribute. I mean, it's a bit of a bigger hurdle for contribution, but it's really important as well to keep up the quality. The same goes with review, so the stuff is on GitHub, so we can use that for the review process. And yeah, we'll stand with it also, we'll stand with our name to make sure or to promise that we'll keep the quality as high as it started out with. Let's wait around. So what are the curation guidelines that we established to ensure this quality? Well, we're working with Phosology, I think that's just our preference. You can use any other tool as well. We're using ScanCode as well for scanning and integrated into Phosology. And we use the source code as upstream as possible. So for ideally directly from the project page, so to not go through any of the stages that we've seen on some slides before where stuff gets added from package managers. But we'll try to start as upstream as possible at the moment. And then I think the diffs that you get from what's added by package managers, this is something that can be included as well, but we're not there yet. So at the moment, we're still trying to go with the origin. And then curating the license, as I said, there's manual work in there. And I think that's the valuable stuff of this project. So license findings, copyright findings that the scanners have created are curated manually, of course, with all the help that Phosology can give with that. So with our curation guidelines, I don't have to check the time. I don't think I'm gonna go into too much details on that. I mean, if you have looked at the scanner findings, you know why there is some manual work required still. So with copyrights, it means mainly that stuff that was incorrectly identified as copyright is removed, stuff that is added to a copyright notice that's not really part of it, formatting signs, there's sometimes license notices, just part of code that is identified as part of the copyright notice is removed from the copyright notice. And then there might be references to external files as well, like copyright by the authors, project authors, C file authors. And then this information has to be added as well. With the licenses, again, reviews then on file level. So every file of the source code tree is inspected, if the scanner has found anything, or if it is mentioned in some kind of notice file or similar. And this is done in addition or even though if a package contains some kind of metadata on licensing. Because we've made the experience, and probably a lot of you have as well, that metadata just gets outdated or is incomplete and so can't really be trusted entirely. And I think that might also be one of the reasons I can imagine that this question might come up. So why do we keep all this information in a separate place? Why not upstream it into the upstream projects? And I think there is some reluctance in upstream projects to provide legally relevant information along with the source code. And also because then we would have, again, we would have the same problem that it just won't be updated. It's just how people are. And yeah, okay, so we check, we do curating on the file level. We confirm or correct scanner findings as you do. We add individual license texts as you have, especially with BSD licenses and so on. So this is also something that's not usually done by scanners. We only tag main licenses if there is a clear main license given in the root directory for a package to not mislead anyone that this might be the only license that's in there. And as I said before, the license comments, the license comments tags of the SPDX explains any license decisions, any curating decisions that are not obvious or that need some level of interpretation. Yes, please. What's your correction rate on average? Do you mean how many scanners are finding? How many do you find that you're like we have to step in? Well, that differs. Yeah, sorry. The question was what our correction rate is. Well, it differs heavily per package. So there's some packets that are really good order where, I don't know, I don't have a number. I would guess around 10% and there are packages in horrible shape where it's closer to 80% that needs manual work. Yes. So let's say I processed more than 3000 packages for sovereignty and I would agree but maybe 20% in general. Yeah, yeah, so that was just some agreement about the numbers with someone who clearly has more experience than me. I can't say I've done 3000. Yeah. That's because of the gray hair. Maybe you're just talking in detail about the clearing process at Siemens. Well, you might guess that there might be some connection there as well. Yeah. Okay, so what do these license comments look like? They also follow some kind of heuristic, so the usually says the information in the file is quotes, whatever the information in the file is. And then we give a reason for why we made whatever conclusion is concluded. Example, we don't have a version of a license given. We find this, this file is GPL. And then the license comment would be as no version of the GPL is given. GPL 1.0 or later is concluded. But we interpret and this is clearly is an interpretation. So this is a legal step that when we find this file is GPL, one could also say GPL the most, I think it still is the mostly used GPL, is still version two. So you could also go ahead and conclude, they probably mean version two, because it's the most heavily used. But our interpretation here is if they only say GPL, the author wanted to give us the option to choose whatever version of the GPL there is available, so version one or later is concluded. So this is something that is a step of interpretation, but that is explained in the data. Or for example, a URL is given instead of a license text. And then of course, the URL is checked, a date is given if anything was found more often than not the URL is dead as well. And then maybe additional research is required. And then of course, the information and the date is given as well when that was checked. So what are, yes? But in that last case, do you report it to the packet itself? Yes, yes. So in case we do find problematic things, we report them back. I mean, there are some licenses that have a URL in the license text that is dead. So I mean, and then people usually say this license is outdated, but it's still valid for some files that are out there and that are being used. So sometimes, yeah, there is, sometimes it's helpful and projects react, but sometimes there's not. But yeah, we try to, whenever possible, we try to report it back. That was also the question, sorry for not repeating it. The question was if we push it back into the projects, yes, we do our best to do so. And then going forward on what you need to comment. If the upstream doesn't take it, how much, what's the hit rate in terms of them ignoring it? So what, the question is what the hit rate is in terms of them ignoring it, it's not large. Mostly they do take it. So because most projects are interested in being license compliant as well or making it possible for users to be license compliant. Because that's what we're trying to do. We're trying to do what the project or the authors wanted us to do. We're trying to make it possible for users to be license compliant. So projects are usually keen to take, to take any or to help. So you asked about the rate before. I couldn't give you exact numbers, but I can give you some example of what kind of scanner findings we do have to correct. Well, I think they're fairly typical and if you've done any curation, you'll know most of them. So I'll go over them fairly quickly as well. So we have not a license or something has been found that just simply isn't a license but a bit of code or just whatever. So that's removed of course. It might be not the files license. So it might be some part, some information that's content of the file but isn't the files license. We have that in documentation quite a lot. Then license text. This is something that of course scanners get wrong and I don't think there's any way to fix it either. If you have a file, a license.text file that contains the license text, then of course the license of that file isn't the license text. Most licenses don't have a license themselves. But new licenses for example have a license. But this is something that's corrected as well then. With generic license text, I said that before. So individual texts, if that differ from the generic license text it is. Of course provided we have improvised, imprecise findings in particular those with respect to version of a license. Then dual licensing cases, especially if it's not a single. So an easy dual license where you have this or that license but you have this or that and a third license or this one license and the second or a third license. So these need some manual work as well. We have license exceptions that we handle a bit differently than Phosology does to bring it into one finding as well. But that's maybe particular to Phosology as well. We have external references that need to be checked. As I mentioned before, it might be URLs. It might also just be external references within the package though. And that also there's a lot of problems there because then you have files that are integrated from a different project and then in their file, they say look in the copyright file in the root directory. But they mean the root directory from where they're originally from. So then that information isn't true anymore and we'll need to do some research and then of course explain what research has been turned to find out where the file originally came from, what license it is referencing. Yeah, so that usually takes a bit of effort. And then we have global license assignment or partially global license assignment that we don't usually use. Again, from the same reasons I said before that meta information is usually wrong or that stuff is included from different projects. So if there's a read me file that says all files in this directory are licensed under the following license, we usually don't go with that information. Unless it says in a particular source code file, it says for license information, see read me file. So this is something that just I think comes from experience. Yes. There's a package manager field where there's a specific license field. That's filled out with the proper SPDX and the fire, do you apply that to the. Okay. The question was about package managers that have a license field or that have a tag for what license is. So at the moment that's not come up yet because we're on like fairly at the bottom were from come from Linux based embedded systems. So this would try and so far we haven't gone into much that is managed by any package managers. But the stuff that I have looked at it depends if that's the only information that's there will go with it. But there might be different information again in the source code. And if possible, we'll always go back to the source code. But we do give, so if we do have third party, or meta information, we also add that to the information in the package, license comment I think is where we add that kind of information. Yeah. I think there was another question. Yeah, it's fine. It's fine. Okay. Okay. Yes. So the project seems to be mostly organized for collaboration among humans. Yes. And not really consuming information about machines. For instance, there is no API for media. Yes, yes, there's a REST API. But for instance, package naming seems to be quite vague. So, so there are uppercase, lower case, and this is the. Well, we tried to go with the, yeah, yeah, you pointed out something. The question was about if the project is made for human consumption or machine consumption, automatic consumption. And I said that there is a REST API to call the files, which is not described in the repo, but on the Ocelot website. If you go to ocelot.org. And then the question was about the naming schemes, and we tried to be as close to the upstream naming as possible. But then again, they're not consistent usually. So, yeah, there is no, so we didn't make up our own schemes, but we tried to stay with the upstream and where there's inconsistencies. Well, we mirror that. Yeah, that's right. Do you know where is described API? Because even on the website, I cannot find it. On the ocelot.org, oh, I might be on tools actually, sorry. On the wiki, wiki.oscelot.org. Try that. Yeah. So, lately I found in some license listing that we were, someone who was using libvui.id and that was TPL listed only and that was taking licenses. And then the readme file which tells me, yeah, various source code in this license has different licenses. Various source code in this package has different licenses. So, and looking at the source code, we are presumably using the functions. It says, oh, it's not strictly TPL. So, not poison from a proprietary business point of view. And how do you express that kind of, I mean, we had the discussion before regarding vulnerabilities that you need to get back to function level. Do you foresee that necessity in your work also or do you strictly handle packages because? We handle files. So, the question was about, as I said before, with the meta information is imprecise, let's say. So, we go back to file, source code file level and what we find there, we believe. So, there might, you're right. You might, we might have to dig deeper but then it's over snippet matching and we, so we only assign something to a package if there is a clear main license but we also warn you can have this information and take it but don't take it as the only information that's there. Okay, there's more questions? Yes. Thank you. What you're doing, I think it's great. I was wondering about the upstreaming of the information that your gallery, first you said well, upstream is often not interested in it and then you made a statement like they really like to be licensed compliant. Did you, do you have statistics on that? Like what is your gut feeling about this because my personal experience coming from the videos and that you can list, many people have interest but they need help with it. I would say these two license compliance we needed to recover have to help them but you were having a great data set to help them actually. Yeah, yeah. So, well what our experience is that when, oh sorry, the repetition, the question was about upstreaming, what our experience was with upstreaming, if the, how the projects would react with that and you said your experience was that they're keen about license compliance and I, well yeah, most of them are, there's exceptions always and we have that as well with, like with concrete and particular cases when we say this file, we found this problem, can you, or can you fix it, can you clarify? Then they, most of them, very, really, very most of them are fine but, well I have to admit also we haven't tried with super many projects but if we say we have, we did a complete license analysis of your entire package for this and this version release, here's the SPDX file, then they're not as keen to, to, to provide that via their website because that is legally relevant information so that rather, I think we had one or two projects who were like, oh that's cool, we'll point to your site but we're not going to provide it through our stuff because there is interpretation in there, as I said before which we explained in the comments. There's some interpretation in there so there's some wiggling room and I don't know, maybe we could reach more with more effort. Okay. There's a few more questions, do we have another minute or time's up? Okay, so, well contact me anyway, I'll skip to the last slide so there's some, yep, sorry. No, that's good though, I prefer discussions. So contact me at info at auselaugh.org and we can, we can chat anyway. Okay. Thanks.
Welcome to the Devroom and Announcements
Welcome to this year's deaf room on software defined radio and amateur radio. It's been a bit of, you know, a harsh year for this place. Actually, we're very happy that we made it here. Last year, the SCR community didn't have a deaf room at Foster, which was really sad. And I'm really happy that we didn't have our own deaf room this year, but a deaf room together with the amateur radio community, which obviously has a lot of overlap. And so this will be like a slightly more diverse presentation than we might be used to. This is super nice. I'm Marcus Miller. I'm one of three deaf room organizers. Do you want to introduce yourself? I think if I have to, my name is Paul Merr, I'm from Switzerland, obviously. I'm a software developer and my stuff I do in amateur radio is mostly developing software. I'm also very happy we're together with the SCR guys here because that's a field of activity for the amateur, so it's very interesting. Yeah, so obviously, I'm Marcus Miller, maybe known from the radio project. I'm very happy to work with amateur radio at Jesus because, well, the application follows the tools in other ways. The third person is, I haven't seen him best year, so I hope he comes in, but we'll start without him. So a couple of things that I'd like to ask the audience is, of course, clean up after yourself. So if you leave, look whether you left some bottles or something because otherwise things will get. Harry, we are not overfilled today, which is a new thing for us. Usually the SCR deaf room was so packed that we had to arrange for people to stand not in the escape routes. I'd like to ask you that if you see someone who's blocking an escape route, talk to him. The other thing I'd like to ask is that if we can find and volunteer to occasionally check the online stream for this room and check whether there's something in chat that someone writes like, we can't understand the speaker or something. Let us know. So that would be the organizational thing. So coming to the content of things. Hi. You made it. So this is best year. This is a come over here. I'm a bit late. I was taking for a fit. Yeah, come on. Come on. Come over. So for the speakers in the room, the cameras over there, you can see yourself on the small screen there. So if you're not on, you can't see yourself on the laptop, then you're not on the screen. Content wise, we got pretty diverse collection of things and we tried during selection and schedule of the talks to make them a bit grouped so that we're not people who want to go to other death rooms can leave and stay for more than one talk. So we start off with me, obviously, and I will give a really, really brief introduction to what happened in Noredu since last FOSTEM, which honestly is going to be a bit opinionated because it's what I think is worth mentioning in this context. We go over to Sylva, who's going to talk a bit about using GPUs to improve the throughput in SDR computing. This is very interesting to me. We go over to Mark, who will then talk about a more modern approach to controlling transceivers than most of the tools that we have today. Then we go for the radar satellite group of topics. So we start with Jean-Michel talking about, sorry, lost it, synthetic aperture radar. We'll follow up on the Qo100 payload and we'll close that part of the satellite things with nanosatellites. I'm not going to go through all of these, I just realized. The next topic is basically SDR architectures and SDR application software, and then we'll go into cellular and radio science. So this is our rough rundown. I'm pretty good on time. I can now start doing my next talk, but I guess we'll take your opportunity to, if there's any questions, ask them now. Okay, so yes. If it's loud, then close the door, please. Yes, of course, yes, he did. So that's true. We do have like a schedule switch. So the second talk, the Tetra talk, I think gets switched with Ang's talk on set-dump, right? So that is another satellite. Heavy top, what? Yeah, should be fine. And I'm excited that it actually works out. Yeah. At the room is the wrong side. Oh yeah, we should probably do it. I happen to have like, you're the perfect person. I have some paper. Yeah, yeah, yeah, I have paper. I have paper. I have paper. Yeah, yeah, yeah, I have paper. And while we're at it. Then I'll just, I mean, this will probably mess up the video stream afterwards because they're going to cut it by the minute, but that's something that we'll arrange in post. Um.
Using GPU for real-time SDR Signal processing
Real-time processing. A very brief background. So I'm Sylvain, Fox 4 Golf Kilo Radio, this is my amateur radio call sign. Very briefly about myself, for those who don't know me yet. I'm the founder of a small company in France, doing SDR. The name is SDR Technologies. I'm the most important here to introduce the story about GPU, and it is the next lines. So I was working for ONERA, the French Aerospace and, well, military, I would say affiliated to the Ministry of Defense Research Lab. And that explains how this started, and I will come back to this in a slide. So very briefly, the outline of the talk will be that I will explain the motivation and then try to explain the approach I took when I tested this GPU and why it came, why I had the idea to use this for DDC. And you will see, and I tried to take a few minutes to explain the background, not of the code, because it's on GitHub and you can read it, and I'm quite sure you will improve it a lot, no doubt. But just to explain for those who are not yet, who are not familiar with GPU, and why it can be useful, and what kind of things you can do with it as long as you take the time to write code in this. So very briefly, the story started a while ago when I was working at ONERA where we have radars, and I just took some pictures that you may have seen already. One is the Nostradamus, it's an HF over the horizon radar, and the other one is Grav, very famous. So these two radars were designed and operated in not my team because I was not leading the team, but in the team I was working in. One of the key problems here is that you have a lot of channels. One antenna is one channel, means that you gather a huge amount of data. And one of the key problems is how do you process this data in real time. And at that time, in my, don't remember exactly the year, NVIDIA released the Tegrar K1, which was very small stuff, but looking promising, sorry, in particular for embedded systems. So my boss said, can you have a look at this and tell us if that can bring anything to the game. And just to make also the story very short, the answer was yes, it's useful, and that made my decision to leave the research team and found my company. So yes, the quick answer is yes, it works. Okay, so now let's go back to more serious things. This is from the leaflet, I would say, from the Tegrar K1 at that time. They were promising something like 326 gigaflops, oh, five watts, 99 euros for the deathboard. You say, who? Does this really work? And that was the idea. That was the idea was to test this if that can be used for software-defined radio. I'm assuming here that most of you have a very brief and very quick idea of what a GPU is, so I will just take a few seconds to explain. I'm just realizing that if I move to the screen, nobody will see from the remote, I guess. Yeah, okay, I'll try. Sorry. So just to explain the model, this architecture has two things inside. You have the ARM ARM processor, this GPU, this is the four cores, this one, and you have the CUDA cores, 192 cores next to it. And the good thing is that they share the same memory. Okay, if you have a PC, you have your core, whatever you want, and in one of the slots you have the GPU cards, and they have to transmit, they have to share data through the PCI bus. In this one, it's a bus, it's kind of PCI bus, but you will see that the performances are much more interesting. The second thing is that one core does one symbol operation at a time. So in this very simple example, I'm adding AC is equal to A plus B, and the code is just saying for each CPU, each CUDA core, take A, take B, make the sum, and store in C. That's pretty simple. So the key point here is that there are three things. One is push the data, the second is push the code, then run the code, and fetch the results. Keep in mind that you have to push the data, and this costs a lot, of course. So coming back to our SDR and DSP, what are the things that may need power? Well, just examples. The one I will elaborate this morning is the DDC, digital down converter, but you have many others, like I have not yet, and I will not describe this morning, I have not yet investigated so much. Feel free to take a seat, no worries. Interpolation, decimation, clock recovery, synchronization, pattern detection, and so on and so on. One of the key issues here is that some algorithms are extremely difficult to run in parallel while others, it's much simpler. And some of them just don't work in parallel easily. So in this example, let's focus on something simple, which is multiband DDC. So we'll assume that we have a white band signal coming from a white band SDR, whatever it is. I took DHF example. So here, for example, we have a receiver that is transferring to the memory. To the stuff, the device, a 50 mega samples per second bandwidth. And we want to extract from this small subbands. OK, so I took examples of DHF bands, one at 7 megs, another one at 14 megs, and so on. That's just examples. The core thing is how do we extract the subbands from the single white band signal? So for one channel, it's pretty easy. And that's the classical stuff. This is a DDC. So you basically translate the frequency and then you make some low pass filtering. And then you throw away all the samples you don't need. That's very classical. I have not invented anything here. And I guess you all know by heart what is a low pass filter, but just take a few seconds to remind how it works. On one hand, you have the input, the samples coming from the SDR. On the other hand, you have the filter you want to apply for the low pass filtering. And you make a convolution, basically some modifications and additions, and you retrieve the output. OK. Now let's look a bit more in my example. How many taps do we really have? So for this example, let's assume that we have a 50 megahertz, so 50 mega samples per second bandwidth incoming. And we want three kilohertz just to extract the audio. This is a fully digital system. So at the end, we want audio, plain old voice, someone speaking. And we assume that three kilohertz is enough. There's a lot of different approaches to estimate exactly, to estimate as accurately as possible the number of taps we need. I saw many, I tried to find an example. I saw plenty and pages from you, Marcus. Many of, I was going to copy and paste some of yours to avoid questions. No, I'm joking, of course. Well, there's many ways to estimate the number of taps. And one of the approaches is this, I don't remember, yes, the Iris approximation, sorry. And so if you do the calculation, you arrive at 50,500 taps. OK. 50,500, so what? So what? Now let's go back to this stuff. So to do the convolution with 50,500 taps here, you need to do this 50,500 times for each sample. It means that to get one value out of the FIR filter, the Lopez filter, you need to take 50,500 inputs, 50,500 coefficients, do the multiplications, do the sum. And you have one sample. And you have to do this for every incoming sample. That begins to be a huge amount of processing. Of course, if you have, you have all experienced many low cost SDR applications running on low cost PCs and they do this in real time. So how do they do it? Of course, there are tricks. One of the most, the easiest one is to divide by two instead of going straight from 50 megs to 3 kilohertz, which is nice, but probably not the best one. You do this step by step by dividing by two. So you take the first band, apply a half band filter, so you have less, you have the half of samples and you repeat this several times. That's very interesting because each time you remove a lot of samples and if you do this clearly, you can have 50% of the coefficients that are zero if you compute the fear in a good way. So that removes you a lot of computation. Of course, yes, but this is not ideal because you will hardly be able to reuse the computation you've made for the other channels. You will reduce a lot the throughput, the number of calculations you need for one of the channels, but then the next one you will want to reuse some of the calculation you've made and that's not easy. So at the end, this doesn't work so good. So, so can this stuff help? So I just put two examples here. On the top you have the Jetson XAVNX. I know that in an open source conference promoting a brand like NVDA is probably not the best idea, but just to make the story short, I have no sponsoring from NVDA. Okay, so just to be figures and facts, the first one is the XAVNX, so it's roughly 500 euros, roughly. And this one has 384 cores and the next in line is the NVDA 800, which is not the same price, 20,000 roughly, and has 6,912 cores. Okay, the interesting thing is the two FFT benchmarks are put below it. So if you look at the Jetson XAVNX to perform an FFT of, sorry, I'll say it this way, 2 power 19, which is quite a lot. So it's 310 microseconds. That's quite a lot. But of course, if you look at the most expensive one, you have 170 microseconds for 2 power 23, which is a huge FFT. A huge FFT. You can do this with an FPGA, but to get those size, it's becoming fun and extremely tricky to do it. Okay, so and for the XAVNX, you see that if you go up to the power of 2 at the power of 23, it's 7 milliseconds. That's a huge number. It's quite fast. So how can we use this? Of course, if you look back to your DSP lessons, that's pretty simple, in fact. A convolution can be, I mean, applying an FIR to a signal is just making a convolution. And the convolution, you can use the FFT. That's pretty simple. I mean, that's pretty known. You take the input signal, you do the FFT, you take your filter, you do the FFT here, and then you make a product of the two vectors. There is a bug. It's FFT minus 1. There is a bug here. Inverse FFT. And you get your output. So basically, you do FFT, FFT multiplication, inverse FFT, and you have your output. That is for one single block. Okay? That's quite good. It works well. But this is for a steady signal, not a stream. So if you want to do this for a stream, there is an improved version of this algorithm, which is called the overlap save or overlap hat. I use the overlap save, which is basically sliding a window, sliding blocks, moving the input, doing the computation, and so on and so on and repeat this. The key point here is that you use always the same filter. So you can compute the FFT of the filter once and keep it. And the input, you will see, can be reused several times. So basically, if you do this in the GPU, the performances are quite interesting. And this is what I did. And this is what I'm going to show you here. So the idea is, this is the architecture of the code I'm proposing. You receive the samples from the SDR. You do a first FFT. So you push the samples into the GPU RAM. Okay? Then your code does a first FFT or the incoming block. You assume that you've done previously at the init the FFT for the several filters you want to apply. So here in this example, I have two. You do the connexer product, modifications for both FFT, the reverse FFT, and the decimation. And you're done. There is one trick. I will come back to it in a few slides. So basically, it means that, sorry, if I go back to this slide, excuse me, you do this FFT in fact only once. You reuse it for the different channels you want. You have done the FFT for the filters once. So in practice for each new incoming block of samples, you have to do one FFT here, modifications, FFT minus one, and decimation. And that can be quite fast. All this doesn't need data to move from the GPU memory to the main CPU memory. But that's quite fast in fact. Then one trick and why I have ended with using the CUDA and proprietary API and the NVDA stuff. I've heard from guys in this room that you can now do this in OpenCL. I have not tested, to be honest. One of the trick is that if you don't pay attention to the scheduling, the different channels will be the different codes will run in series, in sequence, FFT and so on and so on. So you will have to wait for the last block of sequence of operations to be finished before you can retrieve all the samples. And you wait, you may end up waiting quite a lot. But if you use this trick just to compile option, switch, then the scheduling inside the GPU is different. And then everything run in parallel. And the difference is quite large, quite big, to be honest. The difference is much faster this way. One last thing is that if we only do what I proposed, then you miss the frequency shifting. There is a problem, the output frequency is not a good one. So you need to apply the NCO to shift in frequency the signal. And of course it's much more efficient to do this at the end because you have less samples. So it's much faster. You do the shifts at the very end and you just use the fact that you have some aliasing. So the code compensates for the aliasing and that's the frequency shift at the very end. Just look at the code. That's easier this way. So what am I proposing this morning? So you have in GitHub a lib, an example. That's a code that is quite old from me, but it works. And the key thing is that you have to allocate maximum number of channels you will use in the beginning, basically because it will allocate in the GPU the RAM for the different operations. Then the code is thread safe. That is to say that you can add, remove, shift, replace, change the number of channels you use, the size of the channels and so on in real time. This is CUDA based. I know that maybe OpenCL could do something that I have not tested. And I have only tested this with NVDA GPUs. So just to give an example of what you can get with this. So I just benchmarked this with two different architectures, the one I had, but I'm sure that I will receive tons of PR to add new figures in the tables on GitHub for sure. So practically speaking on my machine at home, it's a well, average PC with a GT RTX 2060 with one single channel. So throughput is just a bench test code where it's just pushing data to the GPU, making the computation and retrieving the samples. So with one channel, it's roughly 600 mega-samples per second. With two channels, 530. OK. Just as a baseline for comparison with the Jetson XADI-NX, depending on the FST size, that changes quite a lot. And you can reach up to 156 mega-samples per second with two channels. One channel, sorry. And 117 with two channels. The filters were 7200 taps. Excuse me, that's average. You can change this in the code. I'm checking the time because I know Mark will kick me out soon. So just to come to the, just one of the interesting things is that if you look at the figures here, you see that the GPU is roughly 80% used. The PCI is 36%. So there's room for improvements. And if you look at the CPU, one core is 100% and the others are relaxed. So it means that maybe there's room for much faster, in fact, because we are far from overloading the machine. And in fact, if you look in detail, where is the bottleneck? It appears that the bottleneck is the memory copy. The synchronization between copying the memory from host to device, wasting for the threads to start, waiting for the kernels to stop. All this synchronization takes a lot of time. And if you start to plot this in time, NVGA comes with the tool. I don't remember the name, where you can see the different threads in time, how they work. And you clearly see that there's, the bottlenecks come from the synchronization from the host and so there's room for improvement, for sure. So if you want to tune this, you will see that the, of course, the size of the FFT used has a strong impact on the performances. But that really depends on the performances of the GPU you're using. As I said, moving the data from host to GPU is extremely expensive. In the example I was doing, copy from host to device in complex float, I could use complex ints, raw data from the SDR, and there is in the code one example where you can convert the ints 16 to float directly, so it's cheaper. I mean, the amount of data you would copy from the host to the device is much smaller. And I was using LibUSB in real life. I mean, not in the example, but in real life. So it's also very expensive. LibUSB is far from optimized, from optimal, I would say, more than optimized. And of course, one of the important things is that the CPU, as it's not, well, that's the different cores of room for other things. It means that you can do other tasks like paint and eye spectrum on the screen, like send emails, like listen to music, whatever you want. I think that's all thoughts. Thank you very much. I didn't want to spend too much time. And I'll be happy to reply to questions if you have any. Thank you very much. Yes. Yes, please. You said you did the frequency shift at the very end after this, and is it possible to already do at least a significant part of the frequency shift by just offsetting the FFTs? That's what I do. I rotate the FFT. I rotate, yes. But then you have a reminder, because if you do this, you have the shift you perform is an integral fraction of the incoming. So you need a post fine tune. And that's exactly this. Yeah, you're right. That's what I'm doing. Yes? You didn't see an IIR, FIR, or CIC filter? Just FIR. Yeah, because it's just FFT and Chronicle products. That was the simplest approach. Thank you. Yeah? Was there any attempt to match this into the radio? Not yet, to be honest. I'm not good enough in the radio. I had a discussion with Jean-Michel, a side discussion, and there's a plan to do it. The point, I mean, I was not able to do it for them. I don't have enough practice in C-Blocks, so I said, OK, let's do this with the guys who know. So we will come with a proposal. Yes, that's the idea. Typically, the idea would be to have something, if we can do it, that would permit to have messages to change, to add, and remove channels, or tune the channels in radio directly. Because one of the points is that you need to allocate to define how many channels you want to use. So depending on the applications, you might need different numbers of channels. That's why I wasn't able to do it. Any other question? From the audience? Yes? Just a small question. You used a single precision floating point. Very good question, in fact. Single except this one, the frequency shift. Because in CUDA, the sine and cost functions are nightmare. They produce a lot of noise. So in the code, it's written, double precision, don't touch this. Because otherwise, the noise is going up very quickly. Anything else? OK, thank you very much. So there's more folks pressing in. So if I can ask you to give a little bit more space. You didn't need to kick me out. That's quite fine. Bonjour. Thank you.
trx-control - modern software to control amateur radio transceivers and other hamradio hardware
So I will talk today about the project that I started last summer and that's TRX control or transceiver control. It's a, in my opinion, modern approach, client server approach about controlling amateur radio transceivers and other hardware devices like GPIO devices, rotators for antennas, relays and to integrate third party software systems like, for example, the DX cluster or SOTA cluster or querying the QRZ database. I tried some amount of my work time has gone into this project but let's start at the start. It started actually at the SSP Field Day last summer. We were contesting in Switzerland in our nice club station of Hotel Bravo 9 Alpha Golf. And we had this setup where there was always one operator and one helper. To the right you see Hotel Bravo 9 Echo Yankee Hotel, she sits here. And she was mostly taking notes about the call signs that we heard and I was operating the radio. So we had during this day lots of barbecues, talks and having quite some fun. We got ranked fourth out of seven stations. Always was not too bad. But that was where we had this idea that this setup was not so optimal for contesting. And we were, normally we do search and pounce. So you're listening to the frequencies and when you hear a station you pounce in, you try to make the contact or the other operating mode is you're running. So you're calling CQ, CQ, CQ, CQ, CQ until somebody replies and then you make the contact. And what we were thinking while discussing what we are doing here is why do we run or search and pounce? Wouldn't it make sense to search and run or pounce? So do it a bit different. And that's where we had this idea. If we had a central point where we could control the receivers then we could probably have one station that listens to the spectrum. And whenever it finds someone who is running, he taps on his touch screen and it will enter this frequency and operating mode into a waiting queue. So we have a search station that is just tuning the bands and enters heard stations into a queue. And then we have to run a pounce station that sees this waiting queue and can tap on a frequency and it will tune the transceiver to this frequency and he can try to operate this station. And of course the contest logging software in an ideal world would also be connected to this central system, this transceiver control. And when we have this we can also have a big display for bystanders with whom are we in QSO, on which frequency are we operating because when we are contesting we have a camper, a big camper and we always make this a little bit of happening so we have visitors coming by and we talk them into what we are doing here so such a display would be nice. So what we want to try out next time is a setup with multiple transceivers and with stations such as listen and stations that operate. So what we need at the very core of this system is something universal that can control our devices and we have transceivers, we have GPIO pins to control relays, the rotors of our antenna, we have a 20 meter mast and maybe antenna switches, maybe extensions or whatever so in the beginning I wanted this to be as universal as possible. So we have at the core something that is called TRXD, D for Demon and the number 8 that's the manual page where it's documented in Linux and we have various clients to the bottom that could be PCs, that could be whatever, that could be a mobile phone with an application with an app on it and we have the hardware, the relays, we have GPIOs, we have the antenna rotors and of course we have the transceivers. So the basic idea was to have a clearing house for controlling all these devices in a common language and most modern transceivers can be controlled by software but Jaisu uses his own system, Kenwood uses a different system and of course ICOM has another system too so you cannot use one system to control them all, you have to write something. So what we do here, this is a more schematic diagram, we want to have this demon that can be connected over TCP IP and then we can talk to this demon and instruct him in a common language what we want to do with that and that receiver and these commands shall be uniform for each transceiver so it is one of the jobs of the TRX control system to convert those common commands into the device specific commands needed for example to operate a Jaisu transceiver and we thought it would be nice if we could also connect to this system using web sockets that makes it super easy to contact it from JavaScript or from Flutter or from any other language system so basically playing sockets and web sockets and what you see on this diagram you see what TRX control is not, it is not a software with a user interface like Hamlet Delix or something like this, it is really meant as being the controlling part of or at the center of your Shack automation and not a client to operate the radios so the clients are not part of TRX control besides a few example clients. So TRXD accepts really requests from the clients and controls the devices and the interesting part or one interesting aspect changes in the operating system are detected and are sent to the clients if they wish so they can subscribe to events and then they will receive frequency changes or new DX cluster spots or whatever and we can do this for transceivers that are by themselves capable of sending status updates but we can also do it for all the transceivers that cannot do it like the YAESU FT817 by means of polling. So this really works nice with all the transceivers I have. So what are the goals? The main goal was to create something that is modular and extensible that is more of a solid framework than a complete program but something that can be easily extended by yourself. So it should be easy and it is easy to add new classes of devices like one day I decided to add GPIOs that was relatively easy and it should be very simple to add new drivers for new transceivers. Of course I don't own all the transceivers that are around there but if somebody has a certain transceiver it should be easy for him to add a driver for this transceiver. The system is designed to be extensible from the ground up and it has a complete documentation. So unlike other systems TRX control doesn't need a recompilation of the whole system if you add a new driver. That's because you see it on the right it has this big Lua logo so at the heart of TRX control is a lot of Lua which is a easy to learn and yet very powerful programming language. So that's what it is. An open software design that's extensible and that uses accepted standards like the web sockets like TCP IP IPv6 and programming like that are proven like Lua. So the core is written in C of TRX control and I didn't want to just use C for everything. So and I already integrated a lot of languages into C code in the past and I had a variety of choices which language to use. So I looked at Tickle, Pearl, Python, Java and so on and we had experience with all these languages but we looked carefully at which language to use from this choice and we came to Lua for very good reasons. Lua virtual machine is very very small. It's like 128 kilobytes of binary code and one Lua state uses approximately 64 kilobytes of memory. So even if you're writing a system that runs many many many Lua virtual machines in parallel it will not eat up a lot of money. Memory and if it doesn't need too much memory it doesn't need too much money for the computer and it can run on a Raspberry Pi. So writing those new drivers is done in an easy to master scripting language named Lua and as of today it supports, it's not complete this list, it's favor, more transceivers, it supports more YASL transceivers. I didn't update the list and ICOM and it supports, it will support the QDX from QRP Labs and probably Kenwood transceivers because they use the same protocol as the QDX does. So we're constantly working on adding new drivers. I had some American guy who added driver for some Rix and this is now very easy because these drivers written in Lua, they mostly describe these days what are the operating modes, what is the frequency range of the transceiver, what's the name of the transceiver and the protocol being used for YASL or for ICOM is factored out. So you describe your transceiver and probably that's all you have to do. If it's a new kind of transceiver then you have to write the protocol converter too. But this is technical and you can go have a look at GitHub, there's the code, it's all open source under the MIT license and we're here to help if you get stuck writing a new driver. So what about audio? Would it be nice if I could transport the audio as well so that maybe on my mobile device I can select my transceiver and then listen to the audio and remote control it. This is actually at the moment not part of the software but I think for audio processing we can use Pipewire on Linux and for the streaming maybe Pulse Audio. So I put a question mark there, I'm not so familiar with Pulse Audio but I think this could be a way and this will be some experimentation to be done in the next months. So that's what it is. It's a system to control your devices in a uniform language and maybe let's look at a few implementation details so how it's roughly done. It uses a multi-threaded approach so if there's a lot of things to do it will use all the CPU cores and it has asynchronous execution that means it can do a lot of things in parallel, it supports for the connection-wise IPv4 and more important IPv6 and of course TLS, transport layer security. The core is written in C because writing multi-threaded code correctly is not so easy as one might think. There's a lot of synchronization issues that could arise if you do it wrong so all the thread control and synchronization logic and everything and firing up of the Lua virtual machines and firing up of the various threads that comprise the whole system that's written in C and you can connect over IP and web sockets, I already mentioned that and the protocol is JSON based, JavaScript object notation so all you have to do is to compose a JSON message with a request for example set frequency to 14285000 hertz and send it to TRX control and route it to the transceiver you want. It has been developed for Linux and on Linux but it can be used funny enough on Windows 2 using Windows services for Linux which means it runs on Linux and of course it can be dockerized containerized, it can run in a docker container so you can run it basically on any system you want. It's developed in the open on GitHub and as I mentioned it's under the MIT license. So how does it look when you connect to the system? We have TRX-D which when everything is set up it will listen to connections from a client and when a client connects it will start a thread that's a socket handler that will communicate with the client and whenever it receives a message from the client it will send it to a dispatcher, this patcher looks where he has to route this message to, he will send it to a device controller for a hardware device which will talk to a driver, the blue part that's written in Lua and this driver will talk to the device to do whatever is required. It can also talk to extensions, I have written several extensions for example to query the QRZ database or to see cluster spots in real time so the idea is I see a cluster spot I can tap on this entry on touch screen it will tune my transceiver to this you can look up the call sign so on and then eventually the driver gives back a response in a device specific format which will convert it by the driver to an internal format and the dispatcher will finally send back to the client reply also in JSON format so all you have to understand in a client is JSON and the drivers will convert these uniform requests to the format that the transceiver uses like cut for Yezu or Kenwood or CIV for ICOM and the same thing is done with the return value they are converted from the device specific format into a generalized format and send back as JSON replies for example frequency in this system is always in hertz no matter what the transceiver uses many use like 10 hertz steps but here a frequency is always in hertz it looks very similar in the web socket case when you listen for web socket connections there will be a web socket listener thread and when a client connects for example a browser with JavaScript then it will set up the almost the same the same scenario as it does for plain IP sockets it's just a little bit different the handling of web sockets they have underlying protocol and the framing protocol so they can detect when the other side goes out of business and they can automatically reconnect but otherwise it's the very same and each of these boxes they stand for a thread as I said it's multi-threaded so these things they all run in parallel it's not a this is not a sequence first it takes this then it calls a subroutine then it calls another subroutine but this is all in parallel on the machine so the hard part was to get the synchronization right yeah this is a these are few implementation details when you're want to dig deeper or see what it is there is a website of course the website has flyers we also have flyers at the amateur radio booth and of course there's github with lots of information and the complete source code of course I'm looking for people to support the project I'd be happy if Linux uses tested there's now ready made binary packages for almost any distribution that you can very easily install software to the developer could write drivers or new clients or extensions manufacturers dealers importers they could provide us with hardware they actually already did I got my icon this way and material support or whatever can always be helpful or buy us coffee and so far we've been supported by my company which shelled out a lot of working time and vice equipment and lixnet they supported us with the icon so this is how we developed we have a of course everything in git we develop on alma linux and when we push we push to github but not only we push to an internal git lab server as well to have a continuous integration and deployment pipeline so that we build the packages for all various linux distributions you find all that matters on torx-control.msys.ch there's a matrix room and there is a github project obviously and that's at the end of my presentation but that's not the end of what I wanted to say because I need to synchronize the screens during the development I got in contact with silver's area and during the last I think months I can say not weeks it's months already we just discussed a lot about various things SDR and he's obviously an SDR expert and with marcos too and so I learned a lot about software defined radio and doing this together with marcos and with sylvan so made me think about what could we do with SDR in terms of TRX control and there is a wiki on github where I put together the ideas what we could do so that in the future TRX control could also be an SDR system so this is where everything comes together so we need a few more threads if you're probably not as easy as it is listed here but one thread to gather the samples then each receiver will be a thread and then we could probably even attach whisper to a decoded audio signal to get the transcript of what's being said there in text so this is the future as you see I have plans with it and this means that's it thank you very much
Covert Ground Based Synthetic Aperture RADAR using a WiFi emitter and SDR receiver
I'd like to show you a little bit how I'm using software refund radio, of course, running radio, for developing a covered ground-based synthetic aperture radar using Wi-Fi as a radio frequency source. So just to see what it looks like by the end of a presentation, this was done with some funding, or leftover project funding, so I put the affiliation of the lab. Actually, it's a hobby project, but I had some leftover contract money to develop this thing that I wanted to show with you. So what is ground-based synthetic aperture radar? So let's start with what is the objective of what we want to look at. When you're looking at a radar, you have a remote sensing measurement technique where you want to do some radio frequency detection and ranging. So you would like to see targets. And in the case of GBSAR, it's mostly used for small, minute variation of distances. So in this example here, which I was lucky enough to visit Professor Sato's laboratory in Sendai, and that's one of his setup where he's looking at lens slides. And when you're looking at lens sighting, ground-based synthetic aperture radar, you're using the range information to detect the distance from the SAR measurements and the lateral resolution is given by the spatial diversity of moving your antennas. So as opposed to it's an active measurement technique. So as opposed to passive remote sensing like optical measurements, photogrammetry, optical satellite imagery, you're not sensitive to lighting conditions or day, night, or cloud cover. And it's an active measurement technique. So you will generate the signal that is returned. But unlike laser detection and ranging, you're not sensitive to weather conditions. So radar is all weather conditions. So that's beauty. Now you've got some commercial systems. I just took some of the European ones I'm familiar with, Italian IDF, Dutch Metasensing. I don't claim to be competing with these guys. These guys have 100K units. I'm not going to show you a 5K device that's compete. I see this as an educational project to try to get familiar with the concept of SAR and trying to do this. Well, I wouldn't say legally, but at least not get caught by using Wi-Fi signals. So what are the requirements for radar? Radar, on the one hand, you want to detect a distance. So distance is inverse of bandwidth. So you need a large bandwidth. So you need some wide bandwidth signal, and Wi-Fi is very good for this. Now there is no reason why you would get more bandwidth with higher frequency. But it's a fact that technologically it's easier to get more bandwidth with the higher frequency. And so I moved to 5.8GHz Wi-Fi because you've got 200MHz bandwidth. So that's kind of nice because your range resolution C over 2B is going to be something like submeter. So you can separate by range, two pixels separated by less than a meter. And then also because I want a mechanical setup, I showed you in the introduction we want spatial diversity. So we're going to have some moving stuff. And the higher frequency, the smaller the wavelength, the smaller the wavelength, the smaller the antenna. So it's going to be easier mechanically to move some smaller antenna, hence the increase in higher frequency. And also the rail, along which you're moving to have the spatial diversity, the azimuth resolution is given by the wavelength over length. So if you're higher in higher frequency, your range distance will be smaller, and the rail will be a bit cheaper. So these are reasons for moving to higher frequency. So the SAR measurement tells you that you're doing spatial domain, which is doing time domain. So you're moving the steps, you're moving the antenna. And I'll show you in the next slide that actually azimuth compression isn't for a transform. So it's really your adding phase each time you're moving the antennas. And if you want to match Shannon's sampling theorem, you show that you must have half distance, half wavelength, is the same as half the sampling frequency. And when the transmitter and the receiver are collocated, actually because they're both moving, it's not lambda over 2, but lambda over 4, because you're moving both the transmitter and receiver. So you need a system that allows you to move your setup by quarter-weaving steps. And because I want to have as little sliding contact, all these electrical stuff that's moving, they have poor contact. So I wanted to put everything on the moving part. So everything that is moving is the Wi-Fi dongle as a transmitter, a B210SDR as receiver. But an important story here is you need a dual channel coherent receiver, because you don't know what the Wi-Fi is streaming. It's streaming a broadband signal, but you don't know what it is. For me, it's noise. And so if I'm sending noise, I need to record the reference signal. And on the receiving antenna, I will look at time-delayed copies of this transmitted signal. So that's your basic passive radar measurement. And this is all running on the Raspberry Pi. So the Raspberry Pi at the moment is Raspberry Pi 4. It's running build routes, running new radio, and I'm streaming 0MQ to the processing PC. That's what we showed a few years ago. So actually, this is the final setup. So I took some commercial antennas here. You want it to be a bit directional so that you can have some bigger range. And this is why I'm saying it's not completely legal, because I'm sending the 10 dBm of a Wi-Fi transmitter. But of course, it's an isotropic radiated pattern. And here, I'm focusing on the 20 dBi gain antenna. Let's forget about this. No one's going to notice. And we do the same on the receiving. So you see here, you have the rail, everything that's moving and transmitting and receiving antenna. Raspberry Pi is over here. The B210 is over here. So everything that's moving heads the cables. And then I'm transmitting. Here, I'm transmitting over internet, but it could be over the Wi-Fi communication. The 2.4 GHz, the stream. Now, doing Wi-Fi measurements, actually yesterday, if you walk in the garden just outside here, you're going to see this poster. And actually, I was reading the poster. For those of you reading French, actually, there's a PhD from Brussels using Wi-Fi for what they call it, crowd safety. I call it crowd control, but he's PhD. So he's still optimistic. And of course, using Wi-Fi is MIT is very good at advertising what they're doing. So MIT has been doing full-the-wall Wi-Fi measurement for a long time. So Wi-Fi is not new, but I'm just trying to show you here how to make an educational system. So the principle is we continuously broadcast Wi-Fi. So actually, you could be streaming a very big movie, or you can take Bach's Packet Spammer. This is what I'm doing. So Packet Spammer will just keep on sending packets over time. And you have this non-cooperative source sending signal. And because it's non-cooperative, it might be that sometimes you will look at Packet Spammer. And because you cannot squeeze too many packets in a second, you'll have some gaps. So you just have to detect the gaps, throw these parts away, and collect enough data that you don't have noise. Now, we've just seen the presentation by Sylvain about GPUs. And just going to this, the correlation, when you're doing correlation memory, you're looking at the time-delayed copy of your signal. And you might think he's talking about correlation. Sylvain was talking about convolution. It was a relationship. The relationship between convolution and correlation is just you flip the time in the argument. Convolution is tau minus t. Correlation is t plus tau. And when you flip the time, you take the complex conjugate. So you see that exactly what Sylvain said. You take IFFT of Fourier Transformers Surveillance times the complex conjugate of Fourier Transformers of the reference signal. And the complex conjugate is to go from correlation to convolution. And the problem with this is that if your filter has some ripples on your reference measurement or on the surveillance measurement, if you have some ripples, then you will multiply the ripples because you're multiplying the amplitude. And what's really important in correlating is that you want the phases to subtract because the signal will be square-irons. And if they are coming from the same side, you have zero phase, or even same direction, they have the same phase. So you want to subtract the phase. And actually, instead of doing the analytical formula of multiplying the Fourier Transforms, you can actually take the ratio of Fourier Transform, which is the same by taking the inverse to take the negative phase, but you cancel the amplitude fluctuations. So that's actually what I do at the end of the day. I take the inverse Fourier Transform of a ratio of Fourier Transforms. Now, each Wi-Fi bandwidth is 20 megahertz. And 20 megahertz is on the one hand more that I can stream from my B210 to the Raspberry Pi 4. And secondly, I told you there's 200 megahertz available in Wi-Fi. And we don't want just to be using the 20 megahertz of one channel. And so what we're doing here, if you look at the allocation of frequencies, Wi-Fi is very broad. It starts at 5.4 gigahertz. Actually, you should avoid the 5.4 gigahertz, that's the C-band radar band. It was also called military G-band. So you would like to avoid this kind of frequency. And C-band is also Sentinel-1. We don't want to be jamming Sentinel-1. So we start working above the C-band radar. So we have all these channels here. And what you do is actually you do what is called frequency stacking. So actually, you reprogram your Wi-Fi dongle to jump from one channel to the other. And then you just keep on sweeping. So this was done of Spectralizer. You see here how you're broadcasting each one of these channels. And I can check that indeed this is working. And so for each channel, I reprogram the dongle. I stream the data for 0MQ. I record the data when I know I reprogram the new channel. And after I collected the number of samples I wanted, I reprogram the new. And I keep on looking like this. At the moment, everything, all the FFTs are done offline. Actually, everything I'm showing here is post-processing. I showed you a very fast movie because a full measurement is taking 15 minutes. And as I've had run the movie in real time, my time would be exhausted by the time I finish the introduction slide. So actually, a full measurement is taking 15 minutes and processing the full data is about an hour because I'm not using GPU. Here, this is all CPU post-processing. But one thing that I would love to see we've seen very fancy GPUs here. I just got two Raspberry Pi 5. And I'm told that we would be documented how to use a GPU of Raspberry Pi 5 to do some sort of processing. So that would be really beautiful to do at the moment. It's beyond what I can do. So this is actually experimental. This is what I do. Each one of the color was a spectrum collected by the B210. And so you see my frequency stacking so that allows me to spend the 200 megahertz of Wi-Fi. Be careful that there are some gaps. I think it's these guys here. So there are some gaps. So when you do the ratio, just make sure that you not a number of the values that you don't divide by 0. It's going to be unhappy about the calculation. Just a little side note. When I bought this rail, usually I tried to do some hack where I tried to find what's in the lab and I tried to assemble. And for this time, I had a bit of money left. So I bought a real rail. And I learned, I discovered, that all these industrial controls so that programmable logic controller are running on 24 volts. That is very standard. And your Raspberry Pi, of course, is a 3.3 volt GPIO. So you will need to have some voltage converter. That's your legacy, ULN2803, open collector dialectan transistor that will convert the input 3.3 volt into 24 volts. And the other thing that's kind of funny for us is in industrial control, they don't want you to do anything you want with the rail. Because if you misbehave, then your rail might go off. So actually, you're not allowed to program the position. You have to pre-program a set of values where your rail can go. And then you say, I want you to go to position 1, 2, 3, 4, and so on. This, of course, is proprietary software from the rail manufacturer. But it does run on wine. So it's not open source, but you can do it. So this is what it looks like on the moving part. You've got the Raspberry Pi with 24 volt controller over here. OK, having said that, what you collect, you collect for each antenna position all the spectra in the frequency domain. Once you've got on the reference channel and on the surveillance channel all these antenna positions and for each frequency, so that's a 2D matrix, you cross correlate each one of these. So you end up having one 2D matrix because you've correlated these two guys. You've got the antenna position on the x-axis. You've got the time domain because you inverse for transforming the y-axis. So this is before azimuth compression. Then you do your azimuth compression by doing the FFT. So this is FFT in this direction. Then you take the FFT in this direction to do azimuth compression. And then the part that I'm not completely used to here, you get sine of theta. You want to have range theta position. And my colleague, Weike Feng, Air Force Engineering University in Shanghai, gave me the algorithm for reprojecting the sine theta range to the range azimuth position. And once you get these maps, well the really beautiful thing is there is no degree of freedom. If you know how you move the antenna and you know the frequency step you use, you cannot cheat with the results. You've got an x and y position that is fully determined by your data acquisition condition. So here is one example from our lab. So this is the rail, this is the antenna. You've got here this round circular building which is over here. You've got the portal which is over here. And you've got the university housing which is over here. So there is no degree of freedom other than positioning the radar at the focal point. And this you know, I know where I'm located. And you have only degree of freedom is azimuth. You can tune the picture so that it fits. In this case, I threshold the pack scatter to make transparent where there's no output. So this is on the other side. So this were closed range, this is further away. So we're looking at the opposite side. You've got this building that is over here. You've got this container which is over here at near range. Again, no degree of freedom. And there is this reflection. And you should tell me, how can you get a reflection when it's a field over here? Well actually that was taken this summer when Google Maps had not yet updated their Google image because this building was indeed built since then. So this is one example where we actually get reflections up to 500 meters. This building here is giving us something. This range here is 500 meters. So it's working, all right, let's say well, at least you can see things with it. Then you might ask, is this reproducible? So last weekend I said, OK, like open source project, you put the GitHub, you say, trust me, it's working. And six months afterwards it's all gonna be crushed because all the libraries change and nothing's working anymore. So last weekend I said, let's take everything out and let's check if it's working. So it is working again. So here you've got the XY map which I project over Google Maps. And the nice thing is Google Maps updated their database. So now the hotel is over here. And here you've got the reflection far away. And you've got something here. So you might say, wow, I get even something even further than 500 meters. And it's reproducible. I took a second image over here and you get twice the same image. Don't be fooled. This here is not if you change the big orientation of the radar and you look a bit to the right, you think the reflection is still over here. This is your ambiguity function. The ambiguity function is you take the auto correlation. So you check, is there some self-similarity? And obviously, OFDM Wi-Fi does have some self-similarity. And this is a repeated pattern every 1.5 microseconds or something like this. I don't have the details. And so be very careful when you're using non-dedicated radar signal to check the ambiguity function because they might create their own repetition, which are not targets, but just because the signal does have some structure Wi-Fi. Looks like noise except when they repeat the OFDM error or something like this. But still you see that even if I try to go back, you see that this guy, for example, is a real target. Because if I move a radar azimuth, you do see the radar, the target at the same location. So I'm not completely lying here. And so finally, I was thinking, why is this reflection so powerful? How come there is one building at 500 meters that is sending this echo? So I went to see. I walked around and I took this picture. And actually what you see here, they've got the windows. But as a shade, as a sun shade, they put something that looks very much like a corner reflector. If you remember what a corner reflector is, it's a free right angled part. And actually, modern buildings, architects seem to love corner reflector. You look at modern architecture, you've got free right angled corners everywhere. So that's very good for radars. So this is actually why this building in particular is returning such a good signal. Finally, I told you that the range resolution is only a half meter, 75 centimeter here with 200 megahertz bandwidth. And we want to detect length slides with sub-centimeter displacement. So the classical method is you do interferometric measurement in SAR. So in INSAR, you don't only look at the magnitude of a return signal, but only also the phase. And the phase is uncertain because you've got two pi phase rotations. So you don't know how far the length slide is, but this you don't care because you got it from the range resolution. And by looking at the phase, you can actually get your distance variation, which is half the wavelength times the phase rotation over two pi. So the only challenge is because it's a radar, it's half wavelength because you've got a two-way trip. And so basically, I'm claiming here, so what I did is I took all the strong reflections. The ping here is misleading. This is not one. This is not a number. And I took the average and the standard deviation of all these guys. And you see that the mean value is in 1 millimeter. So you do get a millimeter on the mean value with 1.5 millimeter standard deviation. So I claim this to be 0 plus or minus 1.5 millimeter, which is probably not state-of-the-art, but that's just educational. So I'm still almost pleased that it works. And if you had seen some of my previous presentation, if you take a corner reflector, I try to do it. And it fails here. If you move a corner reflector by step of 5 millimeter, you do see it. So the phase analysis is working as well. So to conclude this presentation, I wanted to share with you how you can use, I think, affordable hardware for running a synthetic ground-based synthetic aperture radar, especially as an educational tool using commercial off-the-shelf Wi-Fi emitter, in this case, as a cooperative target because I'm broadcasting the signal. And I think it's a great opportunity to try to get started with this digital signal processing. Now, just to give you an idea of the budget, because I told you I had some leftover budget from a former contract that I had to spend by the end of the year. So I bought all this hardware. So the antennas are 1,000 euro transmitter receiver. You've got, actually, no, not two times. So stupid me. It's a 1 transmitter, 1 receiver, of course. No, no, no, sorry. A pair of transmitter receiver and the accessories for handling the antennas. You've got the rail, which is by far the most expensive part. But you do need the accuracy of the rail, the repetibility of the rail will give you your ability to do INSAR. If you've got a shitty setup where you've got an uncertainty of 5 millimeter on the position, well, 5 millimeter with respect to a half wavelength of 2.4 centimeter is significant, and this will blur your image. So that's where I wanted to spend a bit of money to have a good quality. These guys claim to have 100 micron reproducibility, so the sub 10s of millimeter, which I think is really good. And it's kind of easy to use. You've got your Wi-Fi. You've got the passive RF. And you've got the Raspberry Pi. These are all easy to find. And the B210, actually, I had leftovers. I think I have a dozen B210 in the lab. So I just took one of the leftover B210. And as I was doing this talk, I wanted to share with you the fact that everyone could do it. And at the end, we have a 7,000 euro project. And I'm not sure everyone wants a whole B210 7,000 euro. And you do see that the most expensive part here is the B210. So I checked, and I do have quotations from the beginning of last year, last, last year, that says that the B210 was 1,400 euros. In January 2024, it's now 2,100 euros. So I'm sorry for an eye, but I'm not going to advertise the B210 because this is really too much of a price hike. You do have a Pluto Plus with a tool channel, which I can get on AliExpress for 300 euros. And it's the same 80, 90, 360 something. It's an internet output. And when you've got all these moving parts, if you ever did some USB on moving parts, USB is the worst connector you want on the moving part. Internet, at least, you plug it in, and it stays there. So yeah, unfortunately, I wanted to demonstrate this for this presentation, and my Pluto Plus is still in the mail. So I cannot demonstrate that the noise level is the same, that the communication capability is the same, that it runs flawlessly on the Raspberry Pi. This may be for the next time. But yeah, you will save 800 euros on this budget, and it's a 5,000 euro project that I'm showing you here. So you can find all the repository processing on the GitHub. Hopefully, I documented everything. If you wish to reproduce and you miss information, feel free to reach me. I'll be happy to complement any misinformation. Be aware that if you want to use different hardware, the running bastion's packet spammer does require what is called promiscuous mode, and not all chipsets support promiscuous mode. Furthermore, be aware that the chipset of this particular board is not in mainline Linux kernel, so you will need to recompile the kernel. And if you're doing cross compilation for big rules, you need to know how to cross compile your kernel module. And finally, this was all done with your taxpayers' money, so public money, public code. Thank you for supporting our research and my colleagues from the Mechanical Workshop who did a very good job in assembling these antennas. And with this, I thank you for your attention. And I even have one and a half minutes for questions. I guess if you have to tune the gargantuan call to the Wi-Fi's packet spammer, then the gargantuan call is to Wi-Fi's container and to see more radio silence after the packet. The question is, how do I tune the silence in packet spammer? And actually, I did the exact opposite. I wanted to have the packets as close to each other as possible, so I have as little gap as possible. And as I was putting too small a value, if you ask packet spammer to send a new packet as the previous one is still being broadcast, then it will send back an error message. And the Wi-Fi don't really become very unhappy. So I was conservative and I did put excess delay, not that I wanted genuine Wi-Fi user to still have their connection. This I didn't really care about, but I didn't want my Wi-Fi don't go to crash. And so I put some additional time delay, not too much, so by time I'm not wasting too much time. The reason why this measurement is taking 15 minutes is really to collect. I'm taking something like 100,000 samples per position, per spectrum. And really the collection of the data and getting rid of the silence is the reason why it takes so much time. Just if you look at commercial GBSAR, they promote one second measurement duration. And the reason also, I didn't mention power consumption, the GBSAR used to be installed in remote locations. And of course, the longer it takes, the more power you draw. So this device I make a power budget is 25 watts. So whether you have 25 watts for 15 minutes or 25 watts for one second is going to completely change the life expectancy of your battery. So if I had to work on something now, it would really make it faster so that it can run on battery or solar panel and that the energy consumption of each measurement is much lower. So the initial question of putting gaps in packet spammer is just not to crash the Wi-Fi dongle. Have you considered using rails from 3D printers because they usually are cheaper and have still very nice decisions? If I can assemble which part in? From 3D printers like the procedure like rails too, which can look like very precise movements and space speed. So the question is what part can be made out of 3D printers? The problem here, I did not put the weight estimate, but I think the two antennas plus the hardware setup weight something like 1.2, 1.5 kilos. And that's really the challenge in having a nice mechanical setup. You do see that there is a bit of hardware there. And so when you want to move these stable and reproducibly, I went for a fancy. Also, I wanted to do it fast because my previous setup was a screw driven rail. And it would take like 10, 15 seconds to go from one position to another. So just the time to move would be added up at something like five, six minutes in the measurement. So this guy can just move in a fraction of a second from one position to another. So there's many solutions that you could go for. There's also these photographers. They want to do time lapses where they move a camera. Yeah, I didn't trust them. So I went for the more expensive. But yes, there's many solutions that you could go for to get a better solution. So thank you, so much.
Design of a follow-up QO-100 payload -
Good morning, all. My name is Frank. I'm working at the satellite communications group of ESA, the European Space Agency, in the technical centre. And we have a presentation here. It's not extremely technical. It's only, it would like to explain a few initiatives we are embarking on. And that is in the area of maybe future amateur satellite payloads which are hosted on satellites for experiments. It's maybe not so known, but let's say we work quite a lot in commercial communications, satellite communications with companies in Europe. And we finance and co-finance various projects. But it should not be forgotten, we think, that many of the innovations that came to the world of satellite communications are actually coming very often from the amateur satellite communication world. A lot of work has been done and that has now spined off in commercial applications. And we gave here a few examples of things that where the satellite communication was always first. They have been flying the CMOS chips, for example, the first time in the world on their satellites. They're the ones who made maybe the first inter-satellite links. And there have been also companies, for example, SSTL, that started with building a few CubeSubs in Surrey University and slowly became basically a larger satellite company. So that is all heritage from the amateur satellite world. Also what is quite interesting is that the amateur satellite world flew the first GPS receiver at very high altitudes, even up to high elliptical orbit. So these are all, I think, quite nice achievements of the amateur world. At the ISA site, we would like to maybe support initiative that at least thinks of future amateur satellite payloads on future satellites. And we explain a few things on that. And that is specifically also related to the payload which is currently on a geostationary satellite. It's called Q100. Maybe not everybody is familiar with that, but it's a very nice payload which for the first time is hosted on a geostationary satellite. So it's always above you. You can find there excellent videos explaining how to handle that satellite and to how to build up with low cost such communication over those satellites. Let me explain just to get a quick idea. This is the footprint of that payload which is on a geostationary satellite. So it travels with the same speed at the Earth, so let's say virtually it's always above you. And this is a very large commercial satellite, hundreds of millions, but on that there is a small payload with hand-handle amateur communication in S-band and the lower KU band. And the beauty, I think, of this is also that this is enabled because it is the basically reuse of existing costs. Let's see, existing 2.4 gigahertz amplifiers, the modification of low noise blocks that you use for normal television for, let's say, maybe 10 and even less zero, and you modify them and then you can use this satellite. So with relatively low cost you can communicate with the satellite. And this satellite payload, and there is a lot of actually, in particular, German amateur radio and also UK amateur enthusiasts who have been instrumental in getting, let's say, community working over this payload on the ASEALSAT 2, which is the name of the commercial satellite. This was the first time that radio amateur satellites were also able to have a, let's say, more continuous communication between each other. And you can see here the footprint. Let's take the green and the red line there, that is linked to how, what kind of elevation you need for the antenna. But this goes from Brazil to Indonesia. So you can have, say, in a single hop, basically, a user from Brazil could communicate with a user in Indonesia. So there is enormous potential there too for all kind of experiments. And this has also led to more broadcasting using standards like DVPS2, which is very active at the moment in Q100, where, let's say, new technologies went into the amateur domain and the amateur domain is now making very nice open source implementations of mini tuner and all various DVPS2 equipment. That is something we would like to support. But how do we support that? Now, a longer time ago, these processes do not go that fast, unfortunately. We wrote a letter to IARU, the International Amateur Radio Union, from where, I think, Sylvain is one of the bosses, I would say, divided in various regions of the world. We wrote there, or we say, stimulated the letter, say, an IARU basically asked, Isa, could you not help? We need to think at least maybe of a follow-up of Q100. We asked that, as you probably know, we are publicly funded, so the various countries want to have their say. Everybody did their say, say, that's okay. Here, you have some funding. So we have funding to start that process, and that funding is meant to collect requirements and also make a few prototypes, maybe. Basically, it will not be enough basically to host a satellite or a payload on a satellite. That will be not possible. We have to look for other funding mechanisms for that later, but we'll come to that later. So what we will be doing is to identify requirements of all the people in Europe and Canada. Canada is one of our member states to identify what would be good requirements to fulfill for a next geostationary payload. And back to the orbit, geostationary or other orbits, we have been heard that some people would be quite interested also to explore maybe payload in medium-Earth orbit. You can imagine that also then you have a longer contact time, and it still has a bit like a global attractiveness. We are considering that because there might be various institutional initiatives starting soon where there could be hosting opportunities for small payloads on medium-Earth orbit. So we will be looking at various further and amateur community that process will start very soon. We already requested a few inputs, and we still have to process all that, and we will then talk to the various satellite operators to see how we could accommodate then a payload and how soon to get the funding for that. So the first idea we heard already that is a few people will be very interested in, let's say, keep it simple. Actually the payload which is currently on as-heel-sub is, it is fantastic that it's there, but functionally you could say it's rather simple. It's an analog transponder, what you go up in whatever modulation you use, that comes down. And many people like that also because that means a lot of experience maybe at the modulation level, more deep down RF level. There is also a whole community that comes more from the, yeah, that has been raised with SDRs, let's say, starting more at that level. There is a whole community that is more working at the, let's say, the IP level and even higher in the amateur world. And we have to find a bit the balance from maybe simple payload designs to maybe something which is really more complicated. And you can imagine also in the amateur world there are communities that are going up and up and up in the frequency range. And in the amateur community we have 24 gigahertz, we have up to 77 gigahertz, which we all could use for satellite. But you could imagine that going higher up in frequency also means that possibly have narrower beams. And for the other hand you would also like maybe that the satellite community is served by, let's say, more in a larger scale. If we have one very nice spot beam in e-band, let's say in 76, 77 gigahertz, yeah, that will serve probably one country that is not so, I think, inclusive, I would say. So there's a few balances there to be made. But it would be quite nice in some way to come to a combination that already some people have suggested where we have an analog transponder. And actually what you would like maybe is to have in geo-arbit the basically the ultimate Linux brick with everything around it. And then everything can do, everybody can do what he wants. The only also here the disadvantage is again that if we put something like this on board, then we get a certain degree of centralizing things. You need a cis-atmine for this satellite basically. And that is not always to the likening also of the amateur community. One likes sometimes a certain degree of chaos, let's say, and anarchism and so on. That it should not be too regulated. So also that is a bit of a balance to be made. But we have various IDs to put it also a bit more in the 5G area where maybe CUDU, certain splits in the whole 5G architectures could be partly put on board. Because many, many people, even in the amateur world, start to look also at various communications based on 5G, non-NTN, non-trivial network. So very straight-offs to be made. We listed here a few of those and we will now start a larger consultation on all those topics. Back again to let's pick a few, let's say the attractiveness of future user terminals. Like the example here, previously of the ground-based SAAG, 6,000 euros, in this case the opportunity of some taxpayers' money. What is acceptable later on for radio amateur? If we go to the 77 GHz, maybe we can use automotive radar hacks and so on. That needs all to come together. So we would like to request later on in a more structured way input from the amateur community, but also taking into account all these factors. Because it's no sense proposing something where not a lot of amateurs can benefit from. So that we are currently starting and we will show you a little bit what we will be doing in the next month. I will not go into detail of the planning, but what we now are going to do, we are talking already a very small group to get a bit of a sense what we should do and also in particular what ESA should not do. Because some things are far better left to the amateur community. We prepare a bit like a consultation, we talk to the amateur community and we also talk to a number of, let's say, people who would likely build such a payload. It is however fun that would be, it is not so likely that the amateur community would build such a payload themselves. A geostationary operator with a 300 million satellite would like to know what he hosts as the few kilos extra. And that is not so likely that he will accept that that is built by amateurs. How good they are also with all respect, he will not accept it and his insurance company will not accept it. And what we would organize in May is a day in Aztec also with support from our technical people to discuss a few options. Start prototyping, we have the funding for that to prototype a bit what some people call like a flat shot. So it's the model of the satellite but you basically put it on a table. And we would like actually to have a few IDs ready in September. September there is always a very large satellite conference where all the satellite operators are and we are making there the appointments. And we will also pitch this to satellite operators as a bit like, let's say as a good thing. Many people complain in the satellite world about the lack of people that understand RF. That is a real lack. A lot of people, there's a lot of programmatic but there's not so many people that can really understand RF. And also I think satellite operators could take a bit of responsibility to and also the industry maybe to stimulate that the young people start to understand satellite communications. That is at least also one of our objectives to get more people enthusiastic about satellite communications. So we hope in May we will advertise that lately. May or June we will see a little bit on availability to organize a day to go through a number of payload designs. And we're also trying to get some travel reimbursements and so on in place so that people could come to us. And hopefully in September we discuss a few things with satellite operators and even better. We hope that maybe the outcome of such a discussion could be discussed at a next, force them hopefully next year. I think that's it from our side. You'll hear more from us. And as said, it's not such a technical talk. It's a process we're starting. All your technical inputs from the AMSATS, the amateur satellite organizations in the various countries which we are approaching already but also individually would be highly appreciated. That's it. Thank you very much. Thank you Frank. I have one or two questions. Please. Is there any for phase airing or beef steering or is it way too expensive for such channels? Yeah, that would be indeed very nice. Of course, if we look at the, let's take first phase array on board. Yeah, of course that will then be highly dependent on frequency range. Let us assume it would be an event, you know, 77 gigahertz amateur phase array. Yeah, that would be a fairly expensive thing. But I think we are also there to see where maybe certain developments could spin off further in industrial developments. So this would be a good, I only see of course again, but also a phase array. It needs a type of management then of that beam. That comes with that. And of course, it's still a challenge in development, of course, on the ground. Maybe if we take now the scenario of a medium Earth orbit where you would need pointing, then I can see there that also from the amateur community, the various YouTube videos that appear with the educational Pluto beam steering and things like that. There I see a lot of opportunities to do beam forming, to educate people on the essentials of beam farming with maybe an existing beacon that comes from the MEO. I think that would be excellent to do that. Yeah. Oh, please. Supporting Canada. Sorry, X support what? Support Canada and obviously those elements in there. Yeah. Okay. The question was about the geographical thing. Canada is in. It's also part of our, let's say, ISA member status. It's called ISA funds, ISA, ISA of Canada funds, ISA. So we are interested to include the Canadian footprint. We have already received a bit of input on that, but we have to see because you can imagine that the orbital location of such a due station is not always. Yeah, we are not the one picking the orbital location. Therefore, in that respect, a medium Earth orbit hosting would of course be preferential, but again, a trade off. We are not. Yeah, we can't decide ourselves. Sorry. Exactly. Yeah. We are having in ISA, there are a number of geostationary projects ongoing. We are, let's say, trying to see whether we could with some leverage, maybe to host something in the future there. Please. So one would be if you're talking about payload providers that are not amateur radio. Are you talking about the universities or only commercial partners? And then is there like a project in that is not like an ML class, ISA mission also like CubeSats that you would consider for such a like CubeSats, not from the amateur radio site, but let's say I think more commercially from a university where you could also see the situation. Yeah, on the payload provider, the question is whether the payload provider could maybe also be from university. Yeah, now I must say we have seen universities providing payloads to various mission, not always to commercial missions. And that is, I think, where the the satellite operate will always have, let's say, the last word, the last say, because that links them to the insurance and things like that. So unclear at the moment, I would say. Then secondly, whether a payload could also be hosted on more educational missions and so on. I think that that is an option. The only thing is that I think there's quite a lot of amateur payloads hosted in various Oscars and so there's nothing new on low Earth orbit satellites and CubeSats and so on. That the essence would be to either to do something new in medium Earth orbit, but also advancing a bit the payload technology and what you can do with it. And there there is a bit that would be the idea. Perfect. Perfect. The proba missions will be also interesting for this kind of applications. The proba platform itself, indeed. Yeah, I do not know whether the current proba, one of the, let's say, satellite that is used in these several various scientific missions also, whether some of the orbits are always, let's say, appropriate. That is, of course, to be seen from a platform point of view. I need to see no problem. Why not? Why not use that? Yeah. If there are normal questions. Thanks again. Thank you.
Maia SDR: an open-source FPGA-based project for AD936x+Zynq radios
So next speaker is Daniel Estes-Vest, did I get that correct? Yeah. Okay. And he will talk to us about Maya Estee or what, please. Thank you. Hello. I'm sorry for my voice. Hopefully I will manage to make it to the end of the talk. I think many of you have already seen this video or a similar demo of Maya's DR, but in any case I'm going to play it and talk over it to introduce what Maya's DR is. So this is an open source project with the goal of trying to get more people doing FPGA development for software defined radio. And at the moment what the project has is an alternative firmware for the Adam Pluto similar Zinc-based devices where what you can do is to use the full 61 MHz of bandwidth which is available in the radio frequency chip. And on the FPGA you can perform an FFT transform to computer waterfall. That waterfall is then showed on a web-based interface. So this is a demo with a cell phone. You can use that on your PC, on your mobile device, whatever you connect with USB. And then it just works. So the... There is no perfect audio. Okay, thank you. The FPGA design is coded in Amaranth which is a Python-based HDL and all the software is done in Rust. I introduced Maya's DR in 3D's Huffin in the software defined radio academy back in the summer. What is new since the summer, there is now support for all the devices, the Pluto Plus and the AntsDRs from Microphase and the nice thing about these AntsDRs is that rather than having the smaller Zinc-70-20, 70-10 they have the larger Zinc-70-20. So that means that both the regular IIO stuff from analog devices and Maya's DR can fit together in one of these AntsDRs and you can use them both at the same time with HDR Plus Plus or GQRX or radio as well as with the web interface of Maya's DR, looking at the same signal. There is also a new feature which is Spectrum-pictured mode. It's an alternative to the previous mode which is average power. This is quite useful to detect burst short signals such as ADS-B or Bluetooth and the way this is implemented is there is actually a trick where you can basically reuse the same hardware on the FPGA either as a power integrator to compute average power or as a power comparator which is the thing you need to do to pick detection. And of course many backfixes, dependency updates and keeping up with the predict running. This talk is going to be quite different to the talk I gave at SDR Academy. It's going to be much more technical. The other talk was high level project based. In this I will show the key technical aspects of the project and my goal with this talk is if anyone sees something here and later on he tells me hey this looks quite interesting I would like to learn more about it or I know more about this stuff. Here is some additional information you might find useful or this component could be useful in my project. I would think this talk has been a huge success. The topics I will cover is the FFT, the DMA and the waterfall which is rendered in WebGL2 in the web browser and this is pretty much what you need for the demo I have shown because you do have a waterfall display you need to do FFTs, you need to send the data out of the FGA and then you need to use the GPU to render the image. The FFT is implemented in Amaranth as I said and you can even think of Amaranth in this case as a Python very large generator and this idea is nothing new. There are other open source FPGA implementations which use Python as a HDL code generator because oftentimes in an FFT implementation you want to have flexibility you want to tweak the pipeline stages or the bit widths or the structure of the FFT so having some way to generate the code or in this case expressing it directly in Python is quite useful. The focus is on good performance and low resource usage because the FPGA on the Pluto is not so large but we still want to get 60 MHz of bandwidth with good results. The architecture which is used is the single delay feedback decimation in frequency architecture that is the most common architecture of an FFT done in hardware as in an FPGA and I will show you some references and some diagrams if you ever want to learn how to implement an FFT on hardware or to do it yourself, I think these are quite good references they are the ones I used to learn how to do this. There is the option to do in RADX2, RADX4 and something which is called RADX2 to the power 2 and I implemented these three options because they are the most commonly used and I wanted to do a trade-off of hardware resource usage and in the end RADX2 to the power 2 is the one which gives best results so that's the one which is the default but you could experiment and even individually select different RADXs on different pipeline stages because this is all Python classes so you can mix and match and construct your custom FFT if you need to. There is something on the FFT which is the Tweedle factor multiplication if you remember the FFT formula there are these exponential functions so that is often called Tweedle factors and you need to do a complex multiplication because everything is complex numbers here so for that you can use 3DSPs the DSP is the multiplier on the Siling's FPGAs and there is a trick to write the product of two complex numbers which usually has four multiplications as just three multiplications and some additions so you can use 3DSPs or a single DSP to save on FPGAs resources and the trick is the single DSP is running three times as fast as the rest of the logic so if you are doing 60 MHz then this is 6 times 3 I think is 180 something or whatever so that's how you manage to do your three multiplications per sample with just one multiplier you can configure the bit widths you use throughout the FFT pipeline and the truncation schedule because when you have an FFT usually you are summing up a lot of samples so if you keep growing your sample size in order not to truncate any of the least significant bits things get out of hand pretty quickly so you can select where you want to truncate those bits there's an optional window multiplication because if you are using these for spectrum display applications you not only do just the FFT you do a windowed FFT you use something like the black man window or whatever your humming window to smooth out the FFT have a nice roll off etc so there is this option to do it inside the FFT core since all of these is python the FFT is implemented by several python classes each of the python class has its own model which is a bit exact calculation of the same math that the FPGA is going to do and you can chain those classes to have the full bit exact model in python of the FFT and you can run that in simulations or whatever or compare it against non-pice FFT in continuous integration to make sure that your FFT is working this is also compared with simulations of the amount of code to make sure that the bit exact model is actually bit exact with the FPGA implementation this is all customizable so you have parameters in the constructor of the classes but for my SDR it's a 4k point FFT black man harris window 12-bit input because that's what we get from the ARF chip and a 10-bit output that's most convenient a thing if you don't want a huge output and we run with 62.5 MHz clock just to have enough clock speed to process the highest sample rate which is 61.44 the resource usage is quite small only 6 dsp is one of those is for the window multiplication and 9.5 B ramps which is quite okay for the kind of storage that needs to go in show this in a minute so this is how the radix 2sqt FFT architecture looks like just let me briefly talk you through this so this is a 16 point FFT a radix 4 architecture would split this into 4 FFTs so you can see one here one there one there one there this is the divide and conquer approach on which FFT is based but the problem with radix 4 is that you are performing an FFT of size 4 so you need to add four samples together and you also have this plus j minus j minus one so that's not so convenient for the FPGA so what we do is this FFT of size 4 we divide it into two again using FFTs of size 2 so you can see this is an FFT of size 2 this is not a FFT of size 2 and here there are again two FFTs of size 2 now they are intertwined and the advantage by doing this is that the butterfly which is what this is called now only needs to add two samples together each time so that is less logic usage and compared to radix 2 which is the other alternative you get twice as less stages and what happens with the stages is each time you have a stage you need to multiply for by these complex exponentials by the two factors so if you get half the number of stages that means less multiplication so less DSPs and this is how it looks like once you draw the broad diagram because if you come back to the drawing the first FFT operates the first sample with the sample in the middle and so on and so forth so of course samples arrive in order what you need to do is to store your samples in a shift register so that when the sample in the middle comes in the first sample is already at the back here so that's basically how it looks like the DMA is the part that is used to copy the data from the FPGA to the DDR memory on the zinc and there are two use cases one is for the waterfall we want to copy one FFT at a time or rather one waterfall line at a time and the other one is the IQ recorder so another theater of my SDR is you can record up to 400 megabytes of IQ data on the DDR of the Pluto this is a continuous recording no samples are dropped and then when the recording is finished you can copy it over to your PC or mobile device using the USB 2 of course it's going to be to take more time than to record because USB doesn't have enough bandwidth but still you can do it so there are two flavors of this DMA one is called Bram write and that copies the contents of a Bram this is used for the waterfall so the waterfall line is in a Bram and it copies it to a buffer in a ring in the DDR memory the other one is called stream write this reads data from an axis stream like interface and it writes all the data into a buffer this is what it's used for the IQ recorder the tricks here to get very low resource usage is to implement an axis 3 manager interface so if you work with silence IP everything is axis 4 but the fun thing about it is the silicon on the system on chip is axis 3 so when you connect silence IP with the system on chip there's a silence protocol converter in the middle which is using some logic to do the conversion and the fun thing is axis 4 and axis 3 are quite similar one of the main differences is axis 4 allows for longer barsts so you have this protocol converter that will split the barst for you but even so if you are not using longer barsts then this protocol converter is basically doing nothing so rather than going with axis 4 we directly do axis 3 and comply with the shorter barsts and we can interface directly into the silicon the implementation is the most stupid thing ever the almost no control is required by software is just like pressing a button go record for the IQ recorder and for the waterfall it just runs on its own and it sends an interrupt to the CPU every time a new waterfall line has become bit to the ring buffer addresses and everything is hard coded at synthesis time so you can change it but you need to rebuild the FAA PGA image the upside we get is the brusseless usage is extremely low as you can see it's rather busy it's the interconnection of the zinc taken from Xilinx manual so let me walk you through even though it's fairly impossible to breathe here are the two CPU cores they are A9 ARM processors each of them has their own L1 cache the L2 cache is here and is shared by both CPUs in the middle is something called the SKU which is used to ensure coherence between all these three caches so if one processor writes to the cache the SKU will evict the line from the cache of the other CPU core so that they have a consistent view of the state of the system and then on the top right on the far right is the DDR controller so DDR is here basically we go from the CPU L1 cache coherence here L2 cache and finally DDR there are two ways of accessing the DDR from the FPGA one of the ways is through this port actually these are the two CPUs sorry this is the FPGA so you can go through what is called the ACP port which is the coherent port it will play the same role as another CPU so you get coherence for free but there are some downsides or you can go through any of these four ports which are the high performance ports they write pretty much directly on DDR but then what happens is you have changed something in DDR but the caches don't know about it so the software must manage your caches manually summing up the ACP port is good because it is fully coherent so hard work does the job for you the bad thing is there is only one such port and if we are writing continuously on DDR we are evicting cache lines which are actively used by the software because the eviction policy is random so we are basically disturbing the software usage of the cache it is not so good for performance on the other hand HP ports we have four of them but it doesn't disturb the L2 cache so it is completely independent of the software that is running but we need to manage coherence manually there are two ways of managing coherence manually one is not to do it which means mark the memory as uncacheable and that is simple but it is very very slow especially reads because for writes there is something which is called write combining that you can enable and it is still uncacheable but writes go in chunks so that is more efficient but reads even if you use neon instructions to do a mem copy out of these memory it is quite slow the other alternative is to mark these memory as regular DDR cacheable memory you need to manage cache invalidation manually so the good thing is you have fast accesses very fast like your regular DDR the bad thing is complex it gets really complex and really ugly because of one very specific thing which is here the Linux kernel has some nice DMA framework where you can manage caches for DMAs and all the things you need to do but we cannot really use this in this situation because we want to map a whole 400 megabyte buffer this is where our IQ recording sits in the DDR but this is a 32 bit platform so in 32 bit platforms you only have 4 gigabytes of virtual address memory and usually on a Linux system this is split as 3 gigabytes for the user space 1 gigabyte for the kernel and the kernel is already using many mappings in this 1 gigabyte space for lots of stuff so if you ask the kernel can you map for me this 400 megabyte DMA buffer it will say I am out of virtual memory I cannot do it within my 1 gigabyte address space so we cannot really map these DMA buffers into kernel space we need to map them into user space that's fine because we want to manage the data from user space but by doing so we cannot use the kernel framework to do cache invalidation for us we need to do it manually and the downside is the functions we need to call are basically private functions in the kernel they are not exported to outer 2 modules or any kernel modules so basically we need to patch the kernel to export those functions and be able to call them from our outer 3 kernel module and this is quite specific about this particular L2 cache and the L1 cache on the ARM CPUs this is basically what I was saying there's a Linux kernel module which manages the IQ recording and waterfall DMA buffers and takes care of doing cache invalidation when the memory is mapped by the user space so the user space always sees the fresh copy that the FPGA has written now let's talk a little bit about the WebGL waterfall this is coded in Rust using WebAssembly and WebGL so basically you are doing Rust but you are calling on a lot of JavaScript functionality which is present on the web browser to do all the WebGL rendering for you it uses a custom render entry and the waterfall data is stored as a 2D texture on the GPU memory this uses the same kind of doubly mapped buffer that we are using in radio for our buffers on the blocks because we want to have this nice scrolling waterfall which looks like a continuous thing it's not continuous it's a trick just 2 copies of the image are pasted onto the same rectangle and it can jump and you don't notice the difference because it mimics it's a periodic thing the fragment shader uses some GL function which is called texture and that performs a texture lookup so the 2D texture is actually power versus frequency and time and then you have your column map and you have those sliders to set the maximum power, minimum power, the column map you want etc so all of that is updated in real time for all the waterfall when you change the settings because it's the GPU for each render frame doing the full lookup on a 1D texture map usually when you are updating the waterfall you are doing so one line at a time because if you are rendering 30 frames per second you are only updating either one or no waterfall lines every 30th of a second so rather than copying all the 2D texture to the GPU each time there is something which is called text sub-image which lets you update a rectangle inside your 2D texture so that's the trick that it's used and then there are your usual frequency scales so you have text to mark the frequency axis there is something called lines draw mode to draw those and something which gets a lot of newcomers to WebGL I was a newcomer when doing this is now we are going to render some text but the problem is the GPU doesn't really have any text rendering functionalities and WebGL2 is rather low level so you can render text to show text what you need to do is to pre-render the text on some textures and then use the GPU to show those textures luckily we are running on the web browser so the web browser knows how to render text to something which is called 2D canvas so we use that we have the web browser write text outside of the screen and use that texture on the GPU that's the trick and in the interest of time I think I'm basically going to show a demo I had here so there is this web page which is a demo of the MySDR waterfall if you go to MySDR.org you can just open it and it should work in any recently modern web browser and you can see here this is recorded data this is actually a JPEG image just for the show but you have all the functionality of the waterfall you can scroll you can see signals and it keeps going and I think I have a couple more minutes so I can walk you through the code very very quickly this is the code of the waterfall demo you show just to show you that it's quite simple you can incorporate this on your own projects so basically what you have is we call this get window and document window and document our objects that you need to use in JavaScript to interact with the web browser we create a new waterfall associated with canvas which is on the HTML page we set up some parameters of the waterfall and the frequency your sample rate etc etc so we are calling a few methods of the waterfall to set those and then we have a generator in this case it's basically putting in the waterfall line by line taken from a JPEG image because this is a demo but usually it will take them from a web socket or any other source of data that you have and it's calling this potline function periodically so basically I'm out of time and the HTML is like this so it's really simple thank you I think we have time for one question where the next speaker comes is it difficult to optimize the HTML using Ambrant because it's lower level HTML like very long? no actually the question was if it's difficult to optimize the FFT implementation using Amant versus a lower level language such as Verloc and actually the fun thing is Ambrant is intended as a low level language even though it's Python and the author of Ambrant says it's even lower level than Verloc because it relies on describing flip-flops basically so for me it's the same mindset I would use with Verloc I had I want this logic I want these flip-flops and I can write it in Ambrant or Verloc so it's the same kind of low level but you have the full flexibility of Python it is either four so the question was in the 2DL Factor multiplication I had three multiplies but it's usually four so if you do it the normal way it's four multiplications two additions if you do a trick to group together some of the summands you can do it as three multiplications and I don't remember maybe four additions or something so yeah that's the trick if you look for it it dates back to Gauss and some Russian mathematician I think thank you
quickstream: a new SDR framework
Thank you. Let's see. Thank you. Come on. Oh, you got a shiny surface here. I'm old fashioned. I like to use a mouse. I'd like to thank Marcus for having me here. I just wanted to see what that looked like. Okay. And I'm going to run a bash grip right now. Okay. So this should be fairly automated. Oh, well. Okay. Hi, I'm Lance. And I like to build stuff and quick streams, software I've built. It's a new flow graph framework. Okay. Okay. What is quick stream? The biggest part of it is it's an application programming library, application programming interface library. Okay, not like the APIs you hear about on the web, I get, but they seem to change the word API to have a different meaning than what I'm used to. So that's the core runtime library. It's the biggest part of it. Okay. And with that library, we built, oh, this isn't going to work so good on that, is it? I'll come over here. We built blocks, similar to new radio blocks. Okay. There are simple blocks that do computation. You connect the simple blocks together and then you're going to get flow graph. And also hierarchical blocks, I call them super blocks. New radio calls them here blocks. I like the word super. Okay. There are two programs that come with it. A command line program and a graphical user program. Both these programs do the same thing. I'm going to write the command line program options first and then I integrate them into the GUI because I need tests to make sure my code works. And so the command line interface is really nice for doing the tests that checks that my code is correct. Okay. And it's all written in C with very few required dependencies. Okay. Oh, I'll just jump right into it here. Okay. So I click on that guy there and I didn't click right so I got to go back. Click this. Okay. Okay. How does that look? Okay. It's a bash shell and this is the demo for the command line program. Okay. I type in QUI. How many people here have done command line programming? Good. Okay. So this is, it has a complete, well not as complete as I could think of if you have more things you want in bash completion, I'll code them in. So I can hit tab a few times and I see there are four programs installed from the quickstream installation. Two of them I'm going to go over and demo and then the others came about from necessity and I want to have time to go into those. So I'm going to hit space and hit tab a few more times and we can see what the options are for this command line program. There are, let me hit another dash there. Okay. And hit tab again. I like that long options better. I always use the long options in my test programs too because the shorter options do the same thing but they're limited to one character so if I change an interface then it breaks all my tests and so on. But the long option names are probably not going to change their name given they'll have the ability to stay unique because they're a long character set. Okay. So I want to run this program then with some kind of option. Okay. So it has a semi decent help in it and it's a long help dash dash help option and I'm going to pipe it to the pager. Bless. And so it looks something like a man page. Let's see how that came out pretty good. All right. So we can scroll through the man page and there you have it. You're used to looking at this kind of stuff but so it's there. It's documented this much at this point. That'll turn into a man page of course and it will be auto generated when you build the software and install it from this text which is actually in a separate program. It launches a separate program because there's too much too many characters in this program. I thought it would get bigger. I wanted to keep it small so it launches the help separately from another program which you don't have to worry about. Okay. As you're using it. Okay. Let me quit this. So this is bash bash has this wonderful thing called history. We all know about and I've preloaded a history in here because I would spend at least half an hour trying to type in one command line to get a flow graph running. So instead of sitting here for dittling on my keyboard. Yeah, maybe wouldn't take a half an hour to type in that command. With the bash completion you can find the blocks in a list of blocks that come up when you hit the tab. So it has tab file completion on the block files themselves. Okay. And I was going to show that but I skipped it because I want to make sure I got enough time here. So I've got to run this thing, right? Okay. So there is running. Okay. So what did I just do? So this was the display command that was in the command line. You can run the display command as many times as you want. You could load a block, display it, then load another block, display it again. Then you see two blocks then on the second display that you show. And then you keep on adding to it. This is really too much to look at in a presentation like this. But it shows the flow graph. The blue lines are what I call control parameters. And the red lines, there's only two red lines. They are stream data, similar to GNU radios stream data. Okay. Let me just cue that thing. Everybody is familiar with image magic? Okay. And this is from the graph viz program that uses this dot file format. That's really easy to print out a graph. And it displays it for you magically. And my formatting kind of stinks there. I've got to get better at dot programming. You get a better display than that. It gives you all the things you can configure and connect in the block in that mess there. Okay. So we don't need to be concerned with that right now. Okay. So this is this one program I'm running from that command line. And we can see it's using the GNU radio QT GUI sync, which has this nice thing. I like the constellation plot. Let's see if I can pick this up. Yeah, I can saturate it. Yeah. Quickstream writing house has some GUIs there. One GUI per block where a GUI would be a slider here. Like I can change the frequency and the gain of this thing. I can maybe desaturate that. Yeah, there you go. I just smudged it. I can also kill that and then run it again with the GUI. Okay. So what that's doing is I have a block that is a Unix pipe block. And then Unix has an option to say what program to run to pipe to. And so it pipes to a radio, a Python script in this case, which I pre-constructed obviously. Okay. Let me kill this and move on to another slide. Okay, so ctrl-C again, ctrl-D, and back to the slides. All right, so that was okay. So why on earth would I do such a thing? I wanted to write blocks with one source file, one C file. Okay. And well, of course, you may have to make a make file or just at least having the top of the file, a compiler line, how to make the damn thing when you get there. That would be good for simple blocks. Of course, you can use many files to make these linked to other libraries. It doesn't care. Or I can make the blocks in C plus plus if I like. Kind of partial to C for some reason. Or for that matter, I could probably write blocks in Fortran. Fortran and C get along very well. Okay, like the linear algebra libraries, nobody wants to rewrite those. Those things are amazing. They'll agglise the matrix and you try to get all ones and it's like one point, you know, whoever wrote that code is, you know, oh, you wrote it? So, oh, so another feature is that workflow, when you're building whatever it is you're building through your end user application, Quickstream GUI and Quickstream command line both output a block. The block is itself your end user application. That block can be reloaded back into Quickstream GUI or Quickstream command line. They both do the same thing. And so we have a cyclic workflow then. So we output stuff, use stuff, that same stuff we can reload back in, even if you edit it. Because the runtime library knows about what that stuff is and when it loads it up it's just loading up a dynamic shared object and it looks for the symbols that are in it. As long as you didn't like change the names of all your functions and stuff, it's still going to be able to find all those functions and be able to call them when it reloads it. OK, so workflow being cyclic, typical use case GNU radio is you output a Python script from GNU radio companion or you output C++ code that you compile. Once those files are output, they don't go back in. So that workflow is not cycled. In the typical use case of GNU radio companion. But if you were to output the hierarchical block, then you could have a cyclic workflow. And I think that could probably bleed back some benefit into the whole thing if you get that cyclic workflow in there. OK, so next feature we can assign threads to blocks on the fly. Well the flow graph is running while all the code is running, what have you. So and we can give it arbitrary thread numbers of thread pools and threads in those thread pools and say here I assign you to this block, I assign you to this block and so on. And I'll illustrate that in the next GUI demo. It also has a pass through buffers where your input and your output are the same size then you don't have to do a memory copy for them with the input to the output. And so you can get an optimization where you're eliminating all the memory copy in the block by using the pass through buffer. And that can go through the next block and the next block if they have the same amount of input as output. So it's not a decimator or what's the other thing? Yeah, it can't get bigger or smaller as it comes in and out. OK, and the flow graphs can have loops. OK, both the pass through buffers and flow graphs with loops, both of those things I think will be in GNU Radio 4.0, my guess. All right, assignment of threads on the fly, I don't know about that. Workflow being cyclic, I want to push that one too. So you'd have to take the hierarchy of blocks and be careful how you construct them such that it's easy for the user to then make their app from that. Yeah, I don't know. So there's a little work to be done on that too. But that was one of my biggest incentives is to get that cyclic workflow. All right, so let me move on to the next demo here. OK, so I just click on this thing here and now it launches the next demo. So this is QuickStream GUI. So you see I named it QuickStream. I should have named it QuickStream GUI, I don't know. I can have a vote here later. What should I call this program? OK, so on the side here, it looks something like GNU Radio Companion. And we have core blocks that are installed from in the package. And this is two actually, but these are test blocks. You wouldn't want to use test blocks in your end user application because what these test blocks do is nasty things like testing to see if a block can fail by, you know, you know, committing suicide and what have you. So you want to test all the different failure modes that are in your software in these test blocks. But it so happens that I'm going to use a test block to demonstrate today. So I can load, so we see I just did a right click on the block in this little directory tree here with files in it. These files are blocks. I can load it into the graph. Here, let me load it into the graph. There it is. Oh, yeah, so I named this one Fostum 2024. Now, that's not as interesting to look at as when I loaded it differently. OK, so let me make another tab. So that right now that's a flow graph that's running in that tab. All right, so let me load it again though, but I'm going to load it as a flattened super block. So boom, there it is. OK, this one's kind of a mess. I don't think you want to, I don't know. These blocks are translucent so you can see the connections easily in them. Because when you start making connections that loop around the blocks, like this one here has a connection going from its input to its output. OK, now what does this block do? OK, so it's got a sequence generator. That's generating just random bits. And I encode them as hex characters because I like to program on units where you always use ASCII text for everything, right? And then I put them through sequence checkers. And that's just testing out that the stream works. OK, and so I've tested this thing out by running this thing for days. I come back, yep, still running, yep. And it's checking every single bit that goes through these blocks to make sure that they're working. Then I can also check out the speed and what have you. See how it's doing the performance. Now, the button bar in this thing is pretty unusual, I guess, but it's got a translucent button bar here. It's got a run buttons here, I'll run it. So now you are in the matrix. OK, that's a little matrix converter block. It was just ASCII text before, but then I say, hey, this is going to be cool for this demo. There's also a halt button. OK, there's a difference between halt and run. A very, very good difference. So halt, right there, I just hit it, will stop all the threads from working. And the blocks don't know anything about it. As far as they're concerned, when you unhalt, it never halted. The blocks just don't know. All right, so run though will run the stream, the stream's in the photograph. The other inner block communication methods are already running. They never stop, like the control parameters. A control parameter, for example, so the stream lines are the red lines. A control parameter is like this purple line here. See, I'm jiggling this block up here. It's got a button on it that will hit a run, and I've got blocks that will run the flow graph for you in the flow graph. That's a little tricky because then it has to be very asynchronous, and it's got a sort of, hey, I request to run later sometime because I don't want to crap on myself because all these threads, others threads are running and I can't modify the data structures to get it to do that yet. But I'm going to have you do this later. So it has to be an asynchronous form of a program. It's kind of like if you've done JavaScript quite a bit, I was influenced quite a bit by JavaScript in the way they run things asynchronously all the time. Any blocking call is asynchronous usually unless you take, unless you do the synchronous one, which is probably not a great design. So let's see, there it is running. Now, I wanted to show Htop here, and Htop is nice. It's not so nice right now. Let me do a F4, and I'll type in bin, what do I type in so far? Bin slash QUI. Okay, now I have it searching for the quickstream GUI process, and we can see there are two threads working hard. Okay, this is unthrottled, so that's what you get. That's what I want in my case because I'm testing this thing, and I want to put it through as hard as I can because this is a test. Now, oops, why did I do that? Okay, what I have in the GUI here is I just clicked on the layout window. That's the top level window there where all the blocks are, and I've got a lot of options in there I can do. So it's a matter of right click mostly to find stuff in this quickstream GUI program. So right click. So I want to manage the thread pools. And so right now you see it says two, and there are two threads in this thread pool that are running this flow graph. And what I want to do is mess around with the number of threads that are running this thing. An easy way to do that is to create a new thread pool by clicking that button there, and then I want to tell it, I want to eight threads because I've got, supposedly, eight cores on this computer. And, you know, I'm not trying to do a performance test. And now, let's see, did I get the eight clicked in yet? No, I got to hit enter or something. There you go. No. Oh, so what's going on here? Why didn't it go? This is probably good that I screwed that up. I still have that thread pool there, and I haven't assigned blocks to it yet. Duh. Okay, so these kinds of duhs happen all the time. So let me just remove that thread pool, and now they have to migrate to the other thread pool. Oh, thank you. Okay, so now I have eight CPUs being used freely well, except not very well. The punchline here is that this has what's it called hyperthreading. So I can only expect 50% on each one. Okay, so let me go get to the next part of this demo, which is real quick. It's in the examples here, and I'm going to run, oops, not that one there. I need a new tab. Okay. Load as graph. This is the same example that I ran before from the command line. It's exactly the same, and I can run it here. And ta-da. Okay, and this has a GUI too. I could go to this control parameter here and set its value as it's running here and tell it it's true, and then it'll show up with the same widgets that we're showing before. And yeah, that's it. Let's see. So, I thought I'd leave two minutes for questions. Okay. So, you have a thread pool. That means your blocks are, you're choosing to run blocks on pools, and they're not like in radio 3 where they all have the same thread basically running on its own, deciding when to do the stuff on its own. Yes. If I want to set CPU affinity, then I need to make thread pools with one thread. And then I can control that CPU affinity. Yeah. So, the thread pool picks the next block to start the next slow project. They just end up in the same queues and shared in that thread queue, and then all the blocks are just going through that queue and saying, here's your job, I call them jobs, Micah. I mean, your terminology in GNU radio is different than mine in Quistream, sure. Any more questions? Oh, okay. And you can assign thread pools for different blocks, or it's always for the whole... Whatever you like. Whatever. Yeah. The GUI had another menu that would go through, and you'd have a list of all the simple blocks, which is the ones that actually compute stuff. And you can select the thread pool and the little thing beside it. Or you can do it on the command line with a command line interface, yeah. You'd have to name the thread pools, though, so that you could refer to it in the command line, or the GUI, too. Yeah. I did it. Any more questions? So, Lance, thanks again. Thank you.
An open source digital radio protocol for amateur radio
Hi everyone. Maybe before we get started, how many of you do know about ham radio? This is kind of the topic of the room I know, but still. Okay. And how many of you are ham radio licensed? Nice. Okay, good. I still have included a small introduction. Please put up your hands again. License operators, please. All right. So I still have a brief presentation and introduction to the topic for you. Your experience with ham radio might not be my experience, so I think this introduction is interesting. And then we will have a brief overview of what ham radio and open source means. Not everybody understands open source the same way, especially, I think, in ham radio community. You will see that open source in ham radio did face and does face a few obstacles. We will pinpoint a few of those. We will see the workarounds. And then finally, we will talk about M17, which is the project that I want to talk about today. So first, who am I? So I'm a research engineer at the University of Friage in Belgium. I do mainly embedded systems and RF. I'm a licensed ham radio operator for two years now, called Sine of kind of number four, my Oscar Delta. I joined M17 project one year ago, right after Fuzzdem. Wow. And yeah, I mostly do hardware design, some of which you can see on the table in front of you. We will go back to that later and work on firmware. Okay. Amateur radio, I think you, then almost everybody knows about this logo for ham radio. This is a technical hobby. This is the goal is to experiment, to play around, to have your hands dirty. It allows you to legally transmit uncertain frequencies, which are allocated to amateur radio operations, and which you cannot do if you don't have your license, of course. The hobby is extremely vast. So you have operators which will do what we call DX, which is reaching the furthest away on the globe using lowest power or specific modes, frequencies, whatever, which is called working demands. You have people which are dedicated in antennas, transceivers, reception, transmission, whatever. It's very, very, very vast. And I think most of you also know that the mainstream products come from just a few brands, ICOM, Yezu, Kenwood, and then that's pretty much it. And you have the Chinese brands, and usually your typical ham radio, Joe operator, or whatever, doesn't know about those Chinese brands. So open source in amateur radio, well, this is a bit controversial, a bit difficult to describe, but the ham spirits, which live in every one of us, have always been about sharing design, ideas, discoveries, the problems we encountered, and how we did solve them. You could call that open source knowledge, maybe, which is not to be confused with the fact that most digital voice protocols that we use have published specifications, which means that if you dig deep enough in whatever search engine you use, you will probably find specifications for those protocols. That does not mean they are free and open source, which is very important. And this is kind of the goal of this presentation. So, yeah. Some protocols are free available. Some of you know about a few of those. So AX25, which is a material adaptation of the X25 protocol. Which mainly works on VHF and UHF, so above 30 megahertz, which is not designed for voice, of course. It's digital, but mainly data bits, let's say. And DSTAR, which is what most of us could consider open source protocol for amateur radio. It is the first protocol really created for amateur radio usage. So it is designed from the ground up with amateur radio in mind. It is open specifications. So from the start, they decided to publish the specifications, mostly in Japanese. So this can be an obstacle. Maybe if you speak Japanese, it's easier for you. I don't know. YSF, Yezoo's proprietary mode. Specifications can be found online, but that's pretty much it. You have to have a Yezoo radio to do YSF. FT4, FT8, just an example of a few modes that are used. Very slow speed, very long range, very low power on HF. So very low frequencies. And this kind of illustrates a point I'm going to go to. The new app, of course, DMR, Tetra, P25, all those commercial protocols which have been adopted by amateur radio, which are not designed for amateur radio. The main thing is, and especially when you talk about FT4, FT5 and such protocols, there are many of them, those have only one closed source implementation. It's not easy to play around with it. You can just, okay, I'm downloading this, trying to modify this. Is it better? Is it worse? The way you play around is, okay, which power do I need to reach this country in this weather or whatever, which is not what is suitable to each and every one of us. So we will briefly take these stars as an example. So released by Japanese amateur radio club, JARL in Japan in 2001. It uses AMBE codec vocoder from DVSI. So very briefly, vocoder, you know that voice is very complex signal. You need a lot of bits to transmit the voice, but amateur radio protocols are slow speed, slow bit rate. So you do need to encode the voice into something which is manageable by those digital voice protocols. And the way it is done in the case of DSTAR is using the AMBE codec from DVSI. Specifications are publicly available, but there is no license tied to those specifications, which means you do not have to publish whenever you deviate from those specifications. And so it kind of de facto became ICOM's proprietary mode. It is called DSTAR, but it is not made to be interoperable with other DSTAR implementations, which by the way, there are not really many other DSTARs implementations. So yeah, main obstacles. First, manufacturers exploit the fact that specifications are not really licensed, and so they can find the trick around to lock down their environments. Main obstacle, second part is technical capabilities. Back in 2001, encoding voice in a microcontroller was not really feasible. That's why you needed an ASIC, so a dedicated chip on the board made by DVSI with their AMBE codec to be able to encode voice into bits manageable, but by the whatever digital voice protocol you wanted to use. So I'm not spitting at DVSI and AMBE all the way. There is a whole lot of technical reasons why this is that way, but I think it's good to understand where we come from and to see where we can go from there. So also another thing to notice DSTAR, YSF, DMR, P25, NXDN, whatever, almost all the digital voice protocols being amateur radio or even commercial protocols use AMBE, AMBE to plus basically AMBE and variants of it from DVSI. Basically one vocoder to rule them all. So how does one think are with closed source vocoder? You don't. At least not really. But hey, the vocoder is an integral part of the protocol, so how do you do? What do you do? Is it possible to have what we could consider fully fast protocol if the vocoder is not open source and if so how do you do it? Well, 2001 was, sorry to break it to you, but quite a few years ago. So solution came in 2010. Name is Kodak2. Release in 2010 by David Roe. I want to underline that this was not an easy task. It was the topic of a full PhD thesis. It self-relying on old works and algorithms and so on. So nobody woke up in the morning and said, hey, let's do an open source vocoder. It's going to be easy. That's not how it goes. It's fully open source, no patents, no industrial secrets. That's the point. And since 2001, computing power on the microcontrollers increased by quite a lot. I mean, 8-bit peaks and 32-bits are microcontrollers are not the same thing. So this last brick, which was kind of the missing brick to have open source, fully open source protocols, allowed to the emergence of two main protocols. The first one is FreeDV, which is designed by the same David Roe. He's not alone, but he is one of the contributors of FreeDV, licensed under GPLV 2.1, using Kodak2 at lower bit rates because it's in HF, so low frequencies, narrow bandwidth. You can't transmit a lot of bits. So you just slow it down. You degrade the voice a bit more, but then you're able to do long range digital voice communications. And it's also used as the reference Kodak2 implementation. So again, just like I said about FT4, FT5 a few slides ago, you do something and then you provide your own implementation, except that this one is open source. And then KM17, GPLV 2 uses Kodak2 at the highest bit rate available. It fits in a standard FM bandwidth for VHF and up, so you can't really use that in HF. It's a bit too wide. You are going to annoy a few people published in 2019. So, M17 protocol has all the features you could expect from a digital protocol in amateur radio, which is that you have the packet mode, so you can use it to control a remote site, for example, using just sending commands. You have a stream mode, which is the mode which is used when you use digital voice. It supports AES encryption, which depending on where you live, you might or might not use it. I know I can't. It does have also the specifications for traffic over IP, which I think is a good thing. If you look back, main digital voice protocols do not have that. So, the community will kind of go, each one has its own way of doing it and different implementations, and then somebody comes, hey, I have an idea. Let's try to inter-connect this, and it's just one more break to a very tall and fragile wall. So, here we provide specifications for this, which I think will ease up, does ease up implementation and inter-operation. If you probably know about DMR, to use DMR, you need DMR ID, which is centralized. We don't. In this protocol, you only need a call sign, which if you can use the protocol, you most probably already have a call sign, so problem solved. And the specifications are open source and license under GPLv2, which means that if some big manufacturer says, hey, I like this new protocol, let's try to benefit from it, yeah, great. But if you modify it to make it incompatible with our specifications, we will force you to publish your specifications completely, and we will find a way to make sure that whatever we do next is going to be compatible with you. If you don't want to be compatible with us, we will be compatible with you. Okay, so this was the M17 protocol, but we should go beyond that. There is the whole M17 project thing. More than a protocol, getting rid of the proprietary vocoder allows a load of things. First thing is you can have it running on your computer. You don't have to pay for any license fee. You don't have to have any USB dongle that you have to plug into your computer so that my software has to go through the proprietary chip that you used, blah, blah, blah, blah. You can have software on your computer for Dstar, for example, but you need this USB key on your computer to have the license to use the codec. You can have it on your phone. Same thing, DroidStar, maybe some of you know about it, maybe not yet. This is a very small app that allows you to use digital voice protocols and connect through reflectors, so servers. M17 allows it to run straight out of the box. AMBE codec, there are online implementations, illegal implementations that allows you to use AMBE, but you have to find it, download it, and then it becomes very shady and it's a cat and mouse game between DVSI and amateur radio operators. You can have it for your radio, a lot about that in just 10 minutes, apparently. You can have it on reflectors. Dstar reflectors do need the same USB key that you would need on your computer to translate the voice between AMBE and whatever else you would use. Maybe a small note that if you have a Dstar reflector, which is from Dstar to Dstar, you do not need this key because you do not need to decode the voice and re-encode it. You can just pass the encoded bits around, but if you want to switch from something to something else, then you're stuck. So yeah, it's a whole ecosystem which was able to grow from the ground up because of the open sourceness of codec 2, including in this ecosystem module 17, so which is this board which is open source hardware, open source software, open source protocol, open source, almost everything you can wish for. Let me go in the frame of the camera maybe. So you have the board here which is the newest revision 0.99 because you never do the 1.0 in one go. And then the enclosure that goes around with it because having this on your computer is screaming for I want short circuit as soon as possible, so let's try to avoid it and put it in an enclosure. Difficult thing is yeah, when you have open source hardware making money out of it is difficult, but this exercise is intentionally left to the reader. Yeah, fully open source, affordable. 50 euros about that. Try to find digital voice, modem, TNC, whatever for that price. I think you will come back to us. Open HD, another baby which is on its way not as advanced as module 17 yet, but it's aimed at being a fully open source portable radio. Basically if you can modulate it, we can send it. For now it's only working on the 70 centimeters band, 430 megahertz, and the 2.4 gigahertz. So for those of you who can see Qo100 in the sky, it does the upstream to Qo100. 25 milliwatts. Hey, it's a prototype. Step by step please. With its 3D printed enclosure also open source and very quickly it relies on a deaf board developed by ST Microelectronics and the backside we did ourselves with the power supply, the FPGA, and the transceiver. FPGA, so you can see maybe a few asterisks on the screen. The FPGA tool chain is sadly not open source. You know maybe if you play with that that having open source FPGA is difficult. It is one of our goals, but usually they will provide IPs in their software and then say yeah, if you want to exploit it commercially, please talk to us before that. So always a bit difficult to deal with that. We have plans for the future though. We are starting the work to port it over OpenRTX, which you will know about in six minutes and a half. We want an open source FPGA, so quick note that maybe the FPGA is not open source, but it does not prevent you from building it yourself. Downloading the software which is free as in beer and you can rebuild the bitstream and upload it to the radio. So you can still tinker with it however you want, but it's not strictly speaking open source. We want five watt output. We want USB-C charging. Oh my god, I come here, please. So yeah, we have plans. We are not only pushing our protocol in it, but we just want to make products better and open source for the community. I think there is a big hole in the ham radio community with this and we are here to fill it. A very quick shout out to some very interesting projects close to M17, the open source primary code OpenRTX. WPSD, which is the hot spot software that you can use which supports M17 for, I don't know, a few years, a few months back, they started support, contrary to PyStar which supports M17 for, I don't know, 10 days. MMDVM, the hot spot hardware. So we rely on those to have hot spots which do M17 and there are much, much, much, much more things that revolve around this. Okay, so thank you for your attention. I hope you liked it. I hope it gave you some ideas and desire to join us, help us. We need devs, please. I know everybody does need devs. Check out our infobooth, ham radio infobooth in building AWU, but I think most of you already came and say hello, but if you did not, we are still here today. Okay, thank you very much, guys. Thank you. We have some time for questions. Yeah. For FSK with root-raised cosine filtering. So the the main chip that you would use is CC1200 from Texas Instruments. Yep. Using the packet radio port. Just like you would do for AX25, for example, your old TNC. So this module basically takes the sound from the microphone, processes it, encodes it using the Kodak 2 vocoder, does the protocol framing, baseband creation and processing and filtering. You have baseband output here, feed it to your radio, and then the output is for FSK modulation with 4M modules, for M17 protocol. But if you want more, come to our infobooth. Yeah. What FPGA type radio is in the chipboard? For now it's the latest, I forgot the one. It's not an i40 which has an open source toolchain available. We had some technical issues, we had some technical issues with the FPGA, so the one we use, yeah, is LATIS, CERTUS, something, I guess, but because for the transceiver we use, we needed LVDS pairs for the data transmission. 64 MHz for the LVDS pair speed. Yep. You have been addressing some of the shortcomings of all the other modulation schemes and protocols, so letting aside that they are not all open source, but they have other shortcomings like on UHF, there's reflection and there's fading and many other things that you are experiencing outside the lab. Actually, are you also addressing these things with M17? I mean, is there a better quality of voice? Is there a better cope with fading and reflections and things like that? Okay, yeah, there are shortcomings indeed with many digital voice protocols. We are between a rock and a hard place. We use for-office-scale modulation scheme which basically does not allow you to overcome the multi-path problems, reflections and so on, so we are aware of this. We have to go step by step maybe. Specifications are open, you are free to fork them and I don't know, implement it in OFDM to avoid some problems that you might have with specific issues which is linked to the physical layer. And for the voice quality, some will say that Kodek 2 sounds better than AMBE, some will say that it's worse. It depends, really, but I think it's at least on par with the other protocols. Yeah? So that's a nice thing. For the Internet use cage of UHF, have you curves or measurements that say what isn't that as an R? I can still do whatever analog voice versus M17 and how much further can you get with like bursts? Yeah, so this basically boils down to, yeah, do we have any graphs, lines, explaining basically the difference between analog and digital? Well, this is a wider topic than just M17, but I agree that maybe having those comparisons might be a good idea to push force into those modes. But no, we don't have those curves at the moment, I believe. Many people I think could have, but yeah. Absolutely, question if I can. Yeah. Have you thought about interfacing that or, and I know I'm on your description, so I'm asking you to because one of the things I like as to experiment is that you can put other data on your channel, but on the current hardware you cannot interface with it. Okay, so can we send arbitrary data using module 17? Not yet. Everything is there for you to be able to do it. So the final net is openRTX. There is a USB-C port with data lines connected to the chip, so we use it for final updates. We use it for STDIO output for debugging basically, and yeah, you should be able, in the future it is planned, but Sylvano in a few minutes will talk about that, having communication channel between the computer and the board, and then from there you can send basically whatever you want. So yeah, it is feasible. Yeah. Have any large manufacturers shown interest in M17? Yes. We talked about, we talked with Kenwood, which would be interested in implementing M17, so I pinpoint back to the fact that specifications are licensed, GPLv2, so they cannot lock it down for their own use. We also have connect systems which showed an interest in our radios and Bowfing. With no more questions, and I see no hands rising. Big thank you again. Thank you very much.
OpenRTX: an open source firmware for ham radio devices
Hi, welcome everyone and a small technical problem, I almost lost my voice when standing at the ham radio booth. So you know my voice is not very good today. Anyway, I'm going to talk about the OpenRTX project which is an open source firmware for ham radio devices. We will see it's not only an open source firmware, it's also a little more. First of all, who am I? Also known as Redman on the various communication channels, Matrix and Discord and so on. I come from Milan, Italy. I'm a ham radio operator since 2017 as India Uniform 2 kilo Whisky Oscar. I do not do so much ham radio activity but like on air, I do it on hardware and on code. I'm a firmware developer by profession and for I work in motorsport industry. I'm co-founder and developer of OpenRTX and also member of the M17 team since 2021. So OpenRTX, it's as I say, an open source firmware for ham radio devices which is by design modular which means that you can, it's designed to allow you to easily add or remove components. It's designed to be easily portable to new devices. You want to have it on whatever radio or whatever device you prefer. You can almost take it and implement a couple of files and have it up and running on your new device. And finally, it's easily extendable to new protocols which means that we try to standardize the interface between the firmware core and the protocols so that you can implement your own protocol in the firmware without too much pain. Finally, it's currently supporting only FM and M17 even though it has been developed primarily on DMR devices. I will tell you why it does not support DMR now. And finally, not written here, licensed under GPLV3. So a bit of timeline. The project started in March 2020 and it started because of COVID. It started because me and Nicola, the uniform 2 kilo indian November were at home basically with nothing to do. We decided to try porting the OpenGD77 project to the Taitira MD380 which at the time was known also and it's still known also because of the work done by Trevis Goodspeed on MD380 tools. He did a lot of reverse engineering, modding and work on the original firmware. Then in September 2020 after a bit of work already we already brought up the radio driver, the display and a bunch of stuff. We realized that the OpenGD77 firmware was too much connected with the structure of the radio. So we decided to, okay, let's take everything, restart from scratch, design things from top to bottom instead of adapting the firmware to the hardware. So yeah, we abandoned the original idea of the porting and we can say it's the official beginning of OpenRTX because we created a separated repo with the OpenRTX name before it was known as, it started as OpenDMR. Only 2021, so like four months later we had the first release, V0.1 with working FM, transmission reception MD380. Also one month later we joined the M17 team and we started doing some experimentation about transmitting M17 with the MD380. A bit of time after we brought up the support for GD77, MDM18, M1, MD380 plus a lot of work in the middle and finally in May 2022 we released the V0.3.3 version which has full support for M17 voice transmission. But now basically since that version you can take radio and flash the firmware and you can use it for M17. Then November 2022 we had a huge contribution from a home radio operator from Australia which implemented the voice prompts for vision impaired operators for extended accessibility and in the end October 2023 we wrote a BSP for the Lilligo TTWR Plus which is a small device you can buy easily and play with with ESP32 and radio module and various technical improvements in the middle, restructurations, every 10,000, M17, MD later on. A lot of stuff and yeah more to come. So the supported devices are those. We saw them in also in the timeline. We have the two Taitira, Retivis radios, MD380 single band, the other one is dual band, single band is UHF or VHF and the other is VHF and UHF. They both do FM and M17. GD77 and DM1801 they do only FM, they are DMR radios, they cannot do M17 for technical reasons. I will give you a bit more details about M17 implementation later and I will tell you why those radios cannot support M17. We have the M17, MD17 and then finally the Lilligo radio which now does only FM in the future is going to be, is going to also support M17. The hardware is good, it's just a matter of writing drivers which is not an easy task always. So the internal structure is of the firmware is this one. There is at the bottom there is an interface with the operating system which can be either real time or not. It's preferred to have a real time operating system on the radios because you have to do protocols and protocols have timings and so you know you cannot just have like normal scheduling. Then we have our hardware interface API which is a set of other files which define the functions used for keyboard and display and so on. In the middle which is the code which is common between all the devices. I called it the core part. It has a small graphics library written by us. We thought about using LVGL but it was a little bit too heavy because one of the objectives is to keep the code size as small as possible. So LVGL was a little too big at least for some devices. There is GPS for the radios who have the GPS. There are voice prompts. All the system to have save, restore of user settings and default settings and settings management. There is the code plug system which allows you to have a list of channels and contacts and stuff like that and saved in the radio. Also the audio management subsystem which is a mechanism we set up to make easy the management of audio connections. You can, we have those concept of audio path and audio stream so that you can easily open a path from source to destination and the audio management system is responsible for actually talking to the hardware and making the various connections and managing the possible conflicts between those paths. Because if you open a path towards a device which is already busy, either your path has an higher priority and you go over or you get rejected but you need code to manage this. Then there is the UI which is the part for keyboard UI basically. And then the operating modes now is FMM 17 but yeah, more to come and we have some ideas of what we are going to implement in the future. So yeah, regarding the interface with the operating system, all the code uses the standard POSIX API which means that any POSIX compliant operating system is good. For example, we already run the firmware also you can compile and run it natively on Linux which is very good for debugging when for example you are developing UI or other parts instead of compile, flash, test, debug, modify and go on. You can run it on Linux so yeah, any POSIX compliant operating system is good. All the remaining pieces use the standard C library so you need operating system which supports them and yeah, if possible use real-time operating system on embedded devices. The other interface is as I said, we have API function for display, keyboard, audio connections, management of the radio section and for the non-volatile memory plus like general purpose platform API which is used for device initialization and to manage all the other things which do not fall in the other categories like the LEDs, calibration data, hardware information data, so on and so forth. Yeah, and also the code is made such that you can share an implementation between devices as long as there are similarities between the hardware of the various targets for example in the MD series radios. The display is always the same, is always connected there. We wrote the display driver once and we compile it for every target. As regards the user interface, currently we have standard let's say GUI which is the GUI you can find on the all the handled devices plus a dedicated GUI for the module 17 just because the module 17 does not need all the elements of the standard radios and also it requires some dedicated entries in the menus for example for calibration of levels and so on. So given that module 17 is not completely radio we decided to make a dedicated GUI only for it but this also means that if you want you can also write your own GUI from scratch or you can also run everything in address mode without the GUI. Yeah, we have also future plans of making the GUI scriptable or expandable with modules such that you can for example implement your own module like satellite tracking module. You can just, we expose, we give a standard interface to interconnect with the rest of the GUI and then you can just write your code using that interface plus the graphics functions and that's it. You can add whatever you want. For the operating modes we are using C++ and yeah, I didn't say that all the code is written in C plus some pieces in C++. We use the C++ for the operating modes, simple C++ luckily just because it was a bit more handy to use C++ to do that part. We define a generic operating mode class which has a bunch of functions to enable, disable, to have periodic update if you have to check, squelch, make computations, whatever and the function return in the squelch status which then gets queried by the GUI and so on. So, yeah, to define your own operating mode you just have to subclass the operating mode class and then implement the interface functions and integrate it into the list of operating modes and you are done. And yes, there is still some work to do on the operating modes because for example now there is not a clear way to for example set some configuration data of a specific operating mode. For example, M17, yeah, now this code is a little hacked in order to allow the user to set easily the source call sign, destination call sign and stuff like that. So, as regards M17 in particular, as I said before, we started the work on MD380, then we extended to the MDUV380. By the way, we started on the MD380 because luckily we had the schematic of the radio. So, we were able to see if there were actually possibility from the hardware to implement this. Everything is managing inside the microcontroller and also, and this is why the other radios cannot do M17, is that the hardware must have the following connections. Everything is done inside the microcontroller. The microcontroller has to sample the microphone for audio encoding. It has to sample the demodulated audio from the radio stage because we do all the 4FSK modulation in software in code. So, we have to sample the raw data with 24 kilohertz and the connection must have the enough bandwidth in order to not distort the signal. And like, so to say, outbound we have to be able to send raw data, raw demodulated speech to the speaker in form of analog signal and also send the basement off to the radio frequency stage for modulation. And yeah, we need those connections. The GD77 locks the, for sure, the ref stage to the MCU. MCU to speaker, I'd probably, it's okay and the MCU to ref stage also. So, it cannot send and receive basement. The microcontroller cannot send and receive basement to the ref stage and you cannot go on air this way. And current problems we have, yeah, you have to mod the hardware of the radio if you want to have M17 on it. Quite a limitation because, yeah, you know, you have to have a bit of practice with soldering of SMD. There are guides on the website, but yeah, you still have to do that or have someone which is a couple of doing that. Given that we are doing everything in the MCU, the MCU has to be powerful enough, which means now M38 has 168 MHz CPU, Cortex M4, which is powerful enough. The point is that for M17 you have to send a frame every 20 milliseconds. So, you have to do everything in 20 milliseconds, audio encoding, baseband and so on. If your MCU is not quick enough, you cannot do that. And finally, yeah, quite a problem. Codec 2 is using floating point math, which means that if your MCU does not have hardware acceleration for it or it does not have enough clock frequency, you are busted. Because you are not able to stay within the timings of the protocol. As regards the codeplug, when we started, initially we thought about parsing and keeping the original codeplug format of the radio manufacturer. But we decided not to do so because otherwise we would have to manage a shitload of codeplug formats with various versions. A mess, basically. So, we decided to, okay, let's write and specify our own format. So, we tried to make something which is open free, of course, which supports common ham radio needs. Direct communications, repeaters, hotspots, whatever. And also portable across devices. These both for end users, you have your codeplug file and you can move it around openRTX devices because the structure is the same. So, they speak the same language. And also for developers, you can take the reference implementation and use it in your own project if you want. It's currently working progress. There is a request for comment open on GitHub. Please take a look at it, comment, help us. We need a bit of feedback from ham radio operators on how this has to be structured. Technical details. It is a binary format, which means that you can either write it row mode to the non volatile memory or inside files. And it's also compact because, yeah, it's binary. Up to 65K channels. So, a lot of space. And this currently supporting FM, DMR for like historical reasons because we worked on DMR radios. And yeah, I'm 17, but more to come. It's something you can extend. It's already designed to have more than those operating modes. And also it may become a separate entity from the firmware. Just because we want to make it something which can be used by hopefully also radio manufacturers. I hope at some point in the history to have it become like reference standard for exchange of codeplug data because now each codeplug is specific to the radio. There are softwares which allow to convert code plugs, but yeah, it's not as easy as taking one file and flashing it into different devices. So summing up what's next on the other side codeplug. We have it's the next in the row. Schedule for we are working on this. In these days. This is a bit more internal to the firmware and event exchange system so that you can register to events or send events which can make the development a bit more easy. Also, APRS support because why not. And more to come probably we have to integrate because it has been done before GPS tracking. I want to have something for meteorological zones radio zones the modulator for those. And then we will see. No DMR in the end because we still have the problem that. Audio codec is patented. And yeah, how do you include the patented binary blob in your work without. Risking of being sued for patented infringement. I don't know. It's open problem. We have to find a way to solve it. And yeah, this is the main reason why there is no DMR on it. The point is the maries open states public specification. It's a. Etsy standard but yeah, the problem is the codec for the short. And hardware side which wins hardware support. Open HD. Bow fang DM 1701 which is another radio which can do. I'm 17. This quite a dream. Yeah, is a series of radios. Difficult because of the microcontroller they have. All the microcontroller for automotive from 2008. Big Indian no debugger. No the compiler. It's a mess. But you know why not. And also yeah more to come. So that's it. Happy hacking. Okay. Yeah. Yeah. Yeah. The ones not doing 17 you said. Yeah. What okay. About the ratio. Yeah. Yeah. So question is what about the devices which do not have the they are which now do not have the hardware connection. What if they had the hardware connections. Well, they would do M 17 is once you have the hardware connections. Then it becomes just a matter of writing the drivers and you have M 17 support. So it's on on this. Thanks again. Yeah.
Expanding IQEngine into a Hub for Previewing RF Signal Processing Software
Awesome. Thank you. So my name is Mark and I'm here to show off the IQ Engine open source project. I'll talk about where it's headed in the future as well. Also here we have Roman who's involved in IQ Engine as well as SIGMF. And this talk is aimed primarily at two groups. One is folks who are newish to SDR and RF signal processing, students, hobbyists, anyone who wants to learn more about all this software that you're seeing. And then second is folks who run or maintain an open source project that involves RF signal processing in some way. And hopefully even if you're not in those groups you'll still find some interest here. So IQ Engine currently it's a web app that is all about RF recordings. It lets you preview recordings, manage them, analyze them, some light processing, and then most importantly is sharing and all in your browser. So entirely web based. And I'll show a quick little demo of what the current tool looks like. So IQ Engine is, it's available at IQEngine.org. The project runs a public instance of the tool. But in this case I've got one running locally because I wasn't sure about Wi-Fi. So the main screen here is essentially a list of these RF recordings. They're all stored in the SIGMF format if you're familiar with SIGMF. We have some good ones from Jean-Michel and Aang23. A lot of folks who are here today. You can also open a recording that's local to your machine and then all the processing is done client side. So like I can open a local directory full of recordings. Here, recordings and it'll list out them all, generate the thumbnails. So actually it's the same directory that I had served from the server. You can also open just one local file pair. So sort of, anyway, so back to the list here. If you click on one of them, you're brought to a spectrogram style interface where it loads the samples that you're looking at at any given time. So that way you can have enormous files. And then the mini map on the right represents the entire recording. So you can jump to any part of it and the little gray area is the part you're looking at. We have time, frequency and IQ like you'd expect. That's FM. And then some other features here are, so there's time and frequency cursors if you want to measure stuff, adjustable dynamic range for the color, windowing, FFT size, you can add an FIR filter taps and all of that is run client side. The FFTs are done client side, the FIR filter is. But the one part that's not client side is our plug-in system. So if you select a portion of the recording that you want to send to the plug-in server, you can select it there and then, let me zoom in here, choose a plug-in. So this was an FM recording. So I'm going to run an FM receiver that's implemented in Guinea Radio. And it sends the samples to the server that runs Guinea Radio. And then in this case, it's actually returning a WAV file with the audio of that signal. But there's other types of outputs like you could run a block or a plug-in that gives you IQ as the output. So if I do a low pass filter, it's just going to output IQ. Let me give it a proper cutoff frequency there. And then currently we're just displaying the IQ in a pop-up. But in the future, we're trying to figure out the best way to replace kind of the signal that's already on the screen with this new one so that you can chain plug-ins together. So that's sort of the gist of the tool. Now back to the slides here. So everything's, IQ engine's built on top of SIGMF in many ways. If you're not familiar, SIGMF is an open standard for saving your RF recordings to a file. It's as simple as it gets. You have a binary IQ file which is sort of the native way to store a recording and then a JSON file. And the SIGMF specifications mainly tell you how to write that JSON file. So there's stuff like how you specify sample rate, center frequency, data type. And then I'll show you annotations in a second here. And by using SIGMF, you have software interoperability. And then you can also avoid data bit rot where like in five years you forget what sample rate stuff was taken at. If you want to learn more about SIGMF, there's a link at the top of IQengine.org and it also links out to the SIGMF GitHub page. So SIGMF, the standard is managed by GNU Radio. It's kind of a sub-project sort of. Now as far as the IQ engine code itself, it's web-based, front-end uses React, Tailwind, some big dependencies that we get a lot of use out of our code mirror for all of the code editing. PyOdide lets us run Python in the browser. I didn't demo that but there's some videos online about how that gets used. And then Plotly for those time frequency and IQ plots. WebAssembly for FFTs. And then for our documentation, we use the MDX system which lets us write it in markdown and then have it rendered as part of this page here. So this was written in markdown and then it lets us render it as React components. Kind of nice. Now, so that was kind of the introduction but I wanted to start off where I left at GNU Radio conference last year. So what have we done since then? Well now it's possible to run a local instance of IQengine like if you want to run it within an organization or whatever to share things privately. You can run an instance and you can put the recordings on the same server. So easy enough. Or something that's mounted to the file system as long as Python open can see it and then it can serve the recording. And the other option is to use Cloud Storage which is what we do for IQengine.org. And as far as how to do that, so the general idea is you pick a directory on your server and then you can run IQengine with the Docker images. So if you go to the install with Docker page, you really, all you have to do is change the directory that's mounted into the container. So pretty much this part here of the command. And then the rest of this command will pull the latest IQengine Docker image and it will run it. And you should be able to see your recordings. They'll look like this because they'll be local to the back end. Versus IQengine.org which has a few different data sources that pop up here. So and that's, yeah, fairly new. If you end up using this and notice some quality of life issues, definitely reach out on Discord or GitHub. So next up, I'm going to dive into the plug-in system that you saw me run with the FM receiver. So the idea is any RF signal processing that you want to run on a back end server but triggered from the browser. So what we have within our project is this rest-based API and it allows for someone to write the plug-in server in any language they want. We have an example in Python and then Loic wrote one for Rust. The Python one can run Gini radio flow graphs. It just pretty much runs the Python flow graph and then uses ZMQ to get samples in and out of it. But in the future, there'll be more languages and by using this rest API, it doesn't matter. It could be, really you can deploy it and implement it however you want to as long as it supports this interface. I'm going to show a little demo later running SatDump which is kind of an example of a whole separate project, not a Gini radio flow graph or anything, but a piece of open-source software that you can trigger from IQ Engine. And then Aang will be presenting more about SatDump in like an hour or so. So as far as how the plug-ins look, the Python based ones, we tried to make it as easy as possible to create a new one. This isn't the actual rest API, this is just how you would make a new Python plug-in and then you would use an existing server code that we already have. So you can see you have to specify your custom parameters and then there's a run function where you're given the samples and you have to return several different data types. As far as Gini radio, you specify the flow graph, but the only catch is you have to substitute your file source and GUIs with the ZMQ source and ZMQ sync. That's how we get samples in and out. Not the most performant thing, but it gets the job done. So you can see these first couple blocks are the ZMQ ones and then the rest represents the flow graph. So we have a Python flow graph that implements an FM receiver in this case and that was the plug-in that I ran earlier. So the kind of the motivation here is if you are an author of an out of tree module for Gini radio, you probably already shared the code somewhere like GitHub and created some examples, some example flow graphs, but the next step would be making it more accessible and easy for folks to find and play with and I think this could be an option there by exposing it as a plug-in. Now, let me go back to the plug-ins. So I'll go ahead and run the SAT dump one. So I've got a recording of NOAA APT right here contributed by Aang. So I can click that, I can browse around the signal. I'll notice so it's actually offset, but I believe this is the APT signal. You could jump to different parts of the file there and then as far as running it through SAT dump, I want to run the entire file because it needs a decent amount of samples. So I'm going to select the whole file and then under plug-ins, we've got the fresh new SAT dump plug-in already preloaded with the pipeline for APT, but you can put whatever pipeline you want. So right now it ran SAT dump under the hood. So here's one of the images that comes out. I think IQ Engine still has some work to do as far as if you have a bunch of different outputs, how do you present them all to the user? There's a lot of web design that can go on there. So either it pops open something or it saves a file and it supports all the different MIME types. If you're familiar with web, it sort of just uses MIME type and then we added some custom MIME types for IQ, like the different data types for SIGMF. As far as other plug-ins, I think, yeah, pretty much, we have a detector as well. So let me go to a recording that I give my students when we study signal detection and classification. This is kind of like a toy example meant for testing a detector where you have a few different signals here. IQ Engine's not about implementing RFML, it's about sharing it and making it more accessible. So we made a very simple detector just to have an example. It's written in Python, you're welcome to check it out in the source. It's called Simple Detector. We also have Marco's detector, he was someone else who was working on it. Simple Detector was pretty quick for that number of samples and it did a decent job. There's one extra little detected emission there. Now the results are in the form of SIGMF annotations which are bounding boxes in time and frequency and that's how the results are shared from the plug-in. So if you wanted to download the raw metadata file, the SIGMF file, you can go to the bottom here and here are the annotations that the plug-in created. So we sort of copied the SIGMF format for the return data type. And if you wanted to perform classification you would simply fill out the label and they would show up. Within IQ Engine you can also edit the annotations and edit the labels. So if you wanted to manually tweak stuff like you were making an RFML data set, sort of like a golden meta file, you could do that here. What I find most useful is simply to have a quick glance at how well something worked. If you had tons of files to run through you wouldn't want to do all this clicking, you would just make a script and you could certainly run the plug-ins from a Python script. It would just need to call the REST API. Back to the slides. Alright, so I want to take just a really quick tangent to mention, remind people about what Gini Radio provides and then how it relates to this plan that the project has. So Gini Radio, it's a way to implement your RF, DSP and C++ or Python. It gives us a standard framework for doing that implementation and it's easy to get annoyed at the boilerplate and how to install everything. But in the end if you use that framework it means that other people who are familiar with Gini Radio can then install your Aditree module. They sort of know the standard usage of your blocks, where to look for the example flow graphs, how to connect your application with their SDR sitting on their desk. And that's an enormous value, that's in my opinion one of the main values of Gini Radio. And then the GUIs are nice as well, it's not always easy to program GUIs. So if you're curious about learning about different Aditree modules, C-Gran.org is where we point people. And I mention this because C-Gran represents a centralized location for Gini Radio applications and libraries, what we call out of tree modules. But kind of zooming out one more layer going beyond just Gini Radio is what I'm going to talk about here in a sec. So let's say that you're a developer of open source software that involves RF processing in some way, like you wrote SAT dump and you're doing satellite signal processing. You build something, you want to share it, you want to keep it easy to demonstrate and show off to people, easy to use. Those are sort of the main steps you might take. Now on the other side of things, you have users out there, whoever they are, individual students, organizations, who first they need to discover that this software exists. That's like the very first step. And then how do you install it, how do you run it properly, how can I evaluate how well it's working and use it with my SDR or my recordings. So kind of a duality here. On the developer side, you might post your code to GitHub, you might share it as part of a Faw STEM talk. That's kind of like the current method that we use. On the user side of things, you might Google the topic you're interested in, like specific satellite, Wi-Fi, whatever. You'll probably come across what's out there. But it's not the best way to do it, right? Just by Googling. So installation can be an enormous barrier. When I teach CS students, it depends who you are, but some students and some folks are better at getting this software installed than others. Obviously having a lot of Linux experience helps folks who are new to Linux but want to dive into signal processing, they can struggle here and there. So it can definitely be a barrier. Now how do you actually run it? If it's a new radio flow graph, you probably know how, but not everything's easy to use. There's RF libraries out there that are not clear how exactly do you use it, but you know it's powerful. And then lastly, evaluating the software. Maybe you're going to use it as a dependency or use it as part of a project. So this idea to sort of evolve IQ engines, so instead of just being a way to share and evaluate RF recordings, it can also be used for just RF open source software in general. Sort of like a central hub, community driven for devs to share stuff for users to find and discover software. And then by exposing the software as a plugin, they can try it out on recordings that are already on the site or their own. And then one side benefit is university isn't anyone else who wants to show off their expertise and creates open source software. They can use this central hub as a way to do that. Now this is all in the browser primarily for accessibility sake. It's not the most performant way to do something like this, but it's extremely convenient. Really, it removes a lot of barriers. So users would be able to play around with a certain function using a variety of recordings. And it's more than just using recordings. They can try in the future, maybe there's a way to lower the SNR, like add noise and see if it still works or what not. Add a frequency shift, see if the RF function still works. And then on the author side of things, all you really would need to do is add this REST based interface or at least make it easy to call with CLI and then retrieve the results. So like Sat dump, I'm not using a REST interface. I'm just running the CLI in a way that's easy. Anyway, now one design decision that was made was to allow multiple plugin servers to connect to a single IQ Engine instance like at IQEngine.org. That way, like a university could run their own plugin server, have total control over it, but they could share their expertise, everything they want to show off. And this is really just a concept. So right now I showed you how IQ Engine lets you preview RF recordings and RF data sets. Well, I think in the future with these building blocks that I showed through the plugin system and this REST interface that we're designing, you could have a tool that would be used for previewing what I'm calling Functions App Software, really anything that involves RF signal processing. Now there are limitations, so a lot of RF apps can't simply be run on a recording. So SRS ran excellent LTE and 5G radio stack. Because of LTE and 5G's strict latency requirements, you can't easily just play it back. It's not straightforward, simply running it on a recording. You sort of want to simulate that closed loop system. So not all RF functions and apps are going to be shareable this way, but I think a vast majority of them are definitely GNU radio apps and those kind of processing applications. The other thing that you wouldn't show off is like an SDR interface, like a GUI, that wouldn't make any sense. Now if you're interested in contributing, it's a community led project, so we can always use more web devs. It turns out that the kind of folks in these RF circles tend to know C++, Python, but less so on the website. And I know I've had to learn a lot of web development to get this project moving more. So even if you're not a web developer, there's plenty of other ways to contribute. We're always looking for more interesting RF recordings to share. If you have an entire data set, we can add like a whole category here on the left. So we have Daniel Estevez's awesome satellite recordings as an example, where we can link off to your website. And so if you want to get involved in any way, there's a Discord link at the top of IQengine.org. We have a little community that's slowly building. And with that, I will take questions. Yep? So yeah, the question was related to geolocation data, like running it as a plugin, I assume. Yeah, yeah, while I explain that, so there actually is already a maps-based interface for, anyway, when we designed the API mentioned, we made sure to allow multiple channels of RF. So those channels could be time synchronized recordings from different sensors. That way at least you could run it from a, the backend perspective. And then, yeah, I guess we would need to have a maps interface to the spectrogram page to make that fully happen. So yeah, I would need to make that fully happen. But good, good, great suggestion. Yep? Well, so Guinea Radio has some Azure credit that they got, and that's what we've been using for a lot of these recordings. So, and we can use that for other folks' recordings if they want to share it publicly. Yeah, you can reach out and we can transfer it over. Yeah, I think it would fall. No, no, no, like I could like upload it for you. So the Guinea Radio has a blob storage account, so I could, I could give you a SAS token for you to upload it yourself or I could upload it for you. Yep, I think there was one more. Yes, there is something that's a work in progress, but I guess I'll share it. So there's an upload page. Oh yeah, yeah, so, so IQengine.org slash upload should allow you to upload a recording. The Wi-Fi's not great, but yeah, that would be the first place to go. I think we're out of time. Any last question? Yep? So how well does it actually handle everybody's thoughts? So, I mean, it was designed to deal with terabyte files from the start, which is why we have that minimap, and when you open the spectrogram page, it's only loading what you're looking at at any given time. So it's sending the IQ samples to your client, to the browser. The browser's doing the FFTs. So it's sending maybe a few million samples to get a spectrogram like this, but if it's a mini terabyte recording, you'll just have a smaller, like, gray window here, because it'll represent a smaller part of the whole recording. Yeah, I mean, you have to store the recording, but it's not all, we have no part of the code that sends the entire recording to either the client or the backend, because we know it's not going to fly for huge stuff. All right. Yep? Yeah. Yeah. Actually, SIGMF has a lot of, there's even an extension for more details about the hardware involved. Definitely check out SIGMF, the specs. So if you want a five minute introduction to SIGMF, that's what we have here on IQ Engine, but I would, yeah, go ahead and go to the specs and dive in, and you'll know a lot of the parameters that you mentioned. All right, thank you very much. Wow. Thank you very much. Thank you very much. Thank you.
DAPNET: Bringing pagers back to the 21st Century
Thank you very much. Hello. Good afternoon. Hope you're all well. Not cooking up too much in this room. My name is Manuel, so I'm a radio amateur. I renewed Nerd if you prefer. I like experimenting with new or older equipment, see what we can do with it, or use existing software or hardware and deploy it as widespread as possible. If possible, within the amateur radio community and keeping things open source whenever I can as well. So today I'm about to talk about cutting-edge technology straight from the 1990s, pagers. So if you've seen those things, well, it might bring you back some memories because they were heavily used in the 80s and the 90s. It was used by mostly doctors, drug dealers, or businessmen, sometimes the three at the same time. So basically they were everywhere in the 90s and started to disappear later on when GSMs made their operation. But this was something really common. In those times, you can still see it in TV shows, medical TV shows, getting paged, the doctor's getting paged because there's a code blue, whatever that means in room 204. Now, I'd like to explore this thing because behind this hallmark from before, they're extremely simple communication systems. And I think it's worth exploring them a bit more and see what you can do with it today in the open source community and the amateur radio community. So today we'll be looking at what is paging in itself, what does that mean, how does it work, generally speaking. We'll go a bit into the technical part of it. So how it works, the modulation types, how you can make a pagering, and then we'll bring that into the amateur radio context. We'll talk about the DapNet project, which has been around for a few years now, what you can do with it, how you can get started, and then I'll be open for questions if you have them. So coming back into the techniques, let's talk about paging in simpler terms. Paging is basically sending a message, making a small device ring one way or another, very often to small, low-powering, compact receivers. Most of them use a standard which is called PoxSag, which was developed in the 1980s, but much older standards exist and are almost not used anymore. So PoxSag is one of them that remains. The other one being developed by Motorola and proprietary, but we don't talk about that here. The topology is always the same. You've got one big transmitter, high power, and then you've got your receivers around that receive the messages whenever there's one. So the frequencies, you have them starting in 8-chef, I should have put that. So you've got pagers on 27 MHz and then all the way up. Here in Belgium, the national services use 160 MHz. In other countries, you will see them in 460 MHz and sometimes even higher in the US, they're all up to 900 MHz, if I'm not mistaken. So you see them on a lot of different frequencies, and you also see that compared to a classic two-way radio, the antenna is built into the device, which is itself a challenge because it means that your signal needs to be higher in intensity to be received by those antennas because they perform a bit less than the standard WIP antenna. Use cases in the commercial world, you'll find them, for instance, for one single hospital to be able to call doctors or industrial scale systems or sometimes a bit bigger. National scale being one of them. Here in Belgium, we have one single frequency for a distributed system of transmitters operated by Astrid, which is used by firefighters, ambulance services and others, so it's still being used today. You will see them also in foot trucks or in order to take away food courts such as if you've been to the wolf two days ago, you'll receive a little pager that would have rang, sorry, whenever your food was ready. So this is also a pager in itself. How does that work? As I said, it's using one single frequency, a specific carrier that we modulate in FSK, so simple frequency shift keying. You send a zero by shifting one way, the other way is a zero, so just by shifting one from one to another, you send one zeroes and then format it into very simple packets. So very often, please mute your radio, just saw it. If you want to send a packet usually, you send a preamble that wakes the receiver up because those receivers usually sleep for long periods of time and wake up from time to time to see if there's not a preamble for them there. And once it wakes up, it will start decoding the signal and then it sends an address and the linked message. And if the address doesn't match the pager's address, it will just shut down, go back in sleep mode, so that makes for very power efficient receivers. So this thing can last up to one month on one single AA battery. So yeah, that's pretty much the idea. Again, if you want your pager to receive a message, you put the address into it, basically. So if you want, for instance, to program this message, which is aimed at the pager with the address 101, you put the address 101 in the pager. If it receives it, it displays a message and rings. Otherwise, it will just stay asleep because this is, for instance, a message for 102, not aimed at the pager itself, it stops ringing. Now, you can also make group alerts that way. So it's quite simple. You just put the same address, recall the RIC. You put the same RIC across all pagers and if they receive it, they will just ring altogether at the same time, displaying the message. So that means that for individual or group calls, you basically address one individual ID to a single pager and then you put one common group ID across all pagers. So you can select if you want to address one person or a specific group and you can organize your system this way. So it makes for a very simple type of receiver and then you can see yourself when you're building the network how you're addressing each pager or each group of pagers. Poxa-Agonamator Radio is not new. It's been done since the 1980s. I think it appeared at the same time as paging itself, so we started filling with that a long time ago. We use the TNCs connected to old VHF systems and yeah, the thing is you had to modify the pagers themselves very often by changing quarts and retuning the receiver loops to make sure that it felt between the amateur radio frequency allocations. But very often those were individual stations used for bulletin board systems at the time, for weather alerts or that kind of messages. So it kind of disappeared when packet radio really folded after the 90s. So right now the only thing we know on packet radio is mostly APRS or the more widespread you use is IPRS. So BBS, you don't see them anymore and the technology got lost in the ages. But now we have easier ways to interconnect stations together using HandNet for instance. So you now have IP links that can be made on amateur radio frequencies quite easily with modified Wi-Fi equipment or others. And there's a team from the Akan University of German Radio Amateurs that developed a network of internet-connected Poxa-Agonamators using free and open-source software and that is the DapNet project. So DapNet stands for Decentralized Amateur Paging Network. The idea is to have various core servers that are geographically separated, interconnected via HandNet, that exchange the messages through multiple nodes. So if you have one, fails, the others will take over. Now of course if you're outside of that HandNet link, you can always get a bridge through internet and this is what I'm doing here because I don't have an HandNet link here. We still haven't brought the HandNet links from Germany up here until unto Brussels. But you can go either way. The frequency is almost universal. Depending on your regulations we try to stay on the same frequency everywhere which is 439.9875 megahertz. That's a mouthful but that's the one we try to use everywhere. The only exception right now I see is the Netherlands because they don't have the access to this frequency so they're using a frequency on 432 if I'm not mistaken. But I mean from with this pager I can use it basically in Belgium, in Germany, in Switzerland. There are some transmitters in France as well so it's growing little by little. Now for addressing transmitters they have to be synchronized one way or another otherwise you have several transmitters that start keying up at the same time and then we'd interfere with on another. So they split up in time slots so if you have two overlapping transmitters you'll put one that transmits on one time slot and the other one that will transmit on another time slot. Just make sure they don't transmit at the same time. So what happens is you send a message on the DapNet infrastructure and as I said this only records basic numbers so there's no call sign you can encode in there. So there's a database on the DapNet infrastructure that links your call sign to an identifier. Very often we put the DMR ID because this is a way to identify hands with numbers and then it matches to this specific rig, this specific address, sends it to the transmitters that are linked to the area we selected so you can key up all transmitters or regionalize your calls. So you can say that if you know that the person that you're trying to reach is in Belgium you put Oscar November dash all. If you want to reach an area in a specific province well you can narrow it down and try not to use the network as extensively and just try to reduce the load if you know where your person is. Same for Germany, the Netherlands, Luxembourg, France there's this same kind of geographical way of cutting the transmitters. You can also make group calls so there are we call them rubrics and you have some for weather alerts, DX clusters etc etc. I'll come back to this in a moment. So what can we do with that? Well pretty much whatever you want. You can send messages manually to a specific pager via the handpager.de website. There's an Android app, I think there's even an iOS app but I don't know what's the status on that. Via the DMR infrastructure from Brandmeister, from APRS, from Tetra, so basically sending a text message from your radio will make it land on the DapNet infrastructure and then it will relay it to to your person you want to call. Then there's an API you can use to send weather alerts. There are automated messages for urgent alerts which will make all the pagers ring for example. DX clusters as I said or status on space solar flux conditions etc. That is something that also is also sent every four hours on the platform. You could also build something for a repeater telemetry or any IoT advice that you want but again keep in mind this is a network aimed at amateurs, for amateurs non-commercial and please keep in mind that is maintained by volunteers that do it on the free time with servers they have access to so don't start bombarding the network with telemetry that sends every second to the status of your fridge because that would be kind of a problem. So say reasonable but this is the kind of thing you can do as long as it's non-commercial. Now how can you get started? As long as you radio amateur with a call sign you can register right now there's a website to submit the tickets and we'll create your account. So once you do that you have access to the platform you can send messages if you want to receive them you'll have to buy, modify or build your own pager for the 439 megahertz frequency. That's one thing but then you need a transmitter somewhere. If you're lucky enough to have one within your living area you're good to go, enjoy. Otherwise well you can go your own way and install a hotspot at home or you can make it a nice project for your local radio club and build a wide range transmitter for everyone to enjoy. So there are two ways you can go. Speaking of specifics, acquiring a usable pager is relatively easy today so as I said before that you had to buy second hand pagers, replace a quart, retune the receiver chain but today we have more frequency agile receivers that have PLLs instead of the quarts that can be directly retuned or directly bought to work on those frequencies. So one of them is the AlphaPok 602R which I thought I had, yeah I have it on loan but here it is. So this one costs about, I think it was 90 euros when we checked on the AlphaPok website directly from Germany. You can buy it on Aliexpress but your mileage may vary. So that's a way to get quickly into it. You could go higher range and buy those commercial ones which are a bit more expensive but work as well or you can go the DIY and free open source routes and build your own using open source software like a project I've been working on which is the ESP32 pager which Bastia also improved a bit on the UI side because that's suck at UI but basically using a ESP32 Lora deathboards you can make it a Poxhack pager and have a receiver for quite cheap. I think those deathboards are about 15 euros on Aliexpress as well right now. So it's built on on radio lib so also freely modifiable so have a look if you're interested. As for transmitters you have two options right now for hotspots you can use if you already have an MMDVM hotspot well you're all set you just need to register it on the DapNet and activate the transmitter so that's one way if you want to build a wide range transmitter things can be extremely simple because you just need a small single board computer such as a Raspberry Pi, an FM transceiver you fit it directly into the audio unfiltered path of your transmitter and then well you're good to go so basically it requires four components the transmitter the Pi a transistor and one capacitor so you can get on the air quite quickly. All our transmitters are being worked on again Bastia is working on a ESP32 transmitter to make a small hotspot even cheaper if possible so again quite easily reachable. So where does that leave us? For me it's quite an elegant solution to receive text messages on our own independent networks having fun in the way learning how to use basic systems implement them and deploy networks that everyone can enjoy and it has its uses for telemetry or others you can do weather reports emergency messages text your friends via pager send silly jokes and the challenge of having it to fit within 80 characters so there are ways to make snappy jokes and intelligence ones at that but I think that thanks to the DapNet network the arrival of audio cars that can act as TNCs instead of using an external module it really made the thing much more accessible so if I'm able to SSH into my hotspots I can make you a quick demo of how that works so give me a quick second who's got a pager here? One? Nice? Nice? Nice? Very nice that's already one depending on what you registered in it I don't know if I'll be able to make a drink but so basically here you have my personal pager this is one from a friend I just borrowed and this one which just died on me which is not a problem in itself I'll just make this presentation shorter oh no it's alive there you go so those all have their own individual addresses so this one is 2069009 206500 sorry this one I don't remember this one is address 100 so I can make this one ring specifically I just key the transceiver up say I want to make the trans the pager number 100 ring please work don't make me look silly there you go so right now you have only this one ringing so I just made an individual call to this one now let's imagine I want to send a group alert for I don't know some weather storm weather coming up or a rare DX spot happening right now on on 18 meters 18 megahertz sorry so then I can make everything ring at the same time 1040 and then everything rings and it's just a nightmare and I need to confirm that otherwise it will ring again so there you go quite simply using basic addressing basic open source software this is just the hotspots just an MMDVM here running in the background and I can directly key the transmitter up if you have access to the DapNet system right now I think at least two or three of yous have access make you can make an individual call to myself there you go he just sent me a message on my on my pager and what did he just say how does a SQL expert get it how does a SQL expert get a date okay nice very nice nice so there you have it if you have any questions oh yeah there's another open source project that is just coming up with it where is alexander hello didn't see you yet if I'm not mistaken you worked on a pox hack decoder which is getting finished up as we speak for sdr plus plus so yeah I think it's important to report that as well sorry I didn't get the time to fit it into the the slides but again if you have any questions I'm I'm just Jesus Christ thank you for your attention and yeah I'm only yours all right I hear a question do we have a microphone or I'll repeat the question so I live quite close to an old school pager site yep they transmit very high power on vhf they do what causes interference a lot of other stuff whoo okay so you know in practice how much power do the transmitters in this network need to be useful and what happens when the pager misses a message is there a transmit or do you go to get one shot um well very often in professional network I'll start from the I'll repeat the question first so you have problems there's an interfering pager transmitter next to you because it's using high power so how much power we're using and second question sorry short term memory is what happens when the pager misses a message if you what happens when you miss a miss a message so the first one being yeah for commercial systems very often they who use 200 300 watts for transmitters because it needs to reach inside of parkings and the antennas are lousy at best so you need high power to get through for amateur radio systems it's less of a god now everyone is trolling me now in um yeah for amateur radio systems very often we don't go to that imperative of being able to reach everyone to through parking lots so very often the transmitters are 25 watts to 50 watts I mean higher up would cause problems such as what you're talking about but yeah usually we keep it low and we just add more transmitters here in Belgium is a problem because every time you add a transmitter you need to pay for an extra license so I mean we're still limited a bit legally speaking but it's not a problem in Germany or other countries where they don't pay repeater licenses or they're much cheaper speaking about missed messages there are two mitigation measures well actually just one is repeating message if it's lost it's lost if you don't get it that's it because there is no way to send an ack so either you receive it or you don't and that links to the first problem that's why the commercial systems use high power so there is no store and forward system in paging so yeah that's a small small limit other questions yes you don't need a call sign to receive signals specifically on radio amateur bands so you could perfectly use an sdr or I don't know buy a pager and put some public public messages but to be able to receive individual to you or to be able to transmit or you know at least access the platform you would need an amateur radio call sign but I mean radio amateur is much more than paging and I think it's worth looking into it if you don't have a license yet I'm not going to start into my big talk about about that because I've done it for about 25 times today but yeah there's a lot to discover and that hobby and might be worth looking into it if you have the time to access that hobby other questions yes yes it does it does you can change the ringtone you make it make it go beep the blue whatever you can even compose your own ringtones on some of them yes p32 pager actually has a provision for you there are different tones and you just compose a music you want so if you wanted to make play Tetris go ahead there was one question here and then you no question here okay what's the frequency range of the receiver the receiver so the receiver itself could be tuned pretty much anywhere on the UHF band so 43440 but the problem is it's using an antenna loop so a loop antenna so it has a very high q so you need to tune it yeah it's 70 centimeters yep yep there is one there is one I would need you have internet you're connected to the custom network here yeah if you're looking to hand pager dot de you should be able to at least get the address book so yeah my time is up thank you
srsRAN Project Update
Yeah, thanks guys. So I'm going to talk about SSRUN project, our deployable open run, open source run solution. So it's going to be, talk about 4G and 5G, because I know that many of you still use 4G, although we usually focus these days on 5G. I'll start with talking about our repositories and the naming, as this causes a bit of confusion. And then I kind of split the talk, kind of 30%, 4G, and 70%, 80%, maybe 5G. And primarily talking about our newest baby, so SSRUN project, which is an open run native CUDU implementation written from scratch, and then obviously also demo that here. So if you go to GitHub, SSRUN, these days you see two repositories. One is called SSRUN 4G, and one is SSRUN project. And I'm going to explain why we have those two projects here in a second. But let me ask a question. So who's interested in who's doing 4G in the audience? You're most interested in 4G. Great. And who's interested in 5G? OK, that's actually more. Nice. So a little bit of background in history. So when we started with that, actually 10 years ago with libLTE, which was back then a pure C implementation of the LTE physical layer, that obviously was all 4G. And then the first application, the real application we had was the UE. So a 4G UE, back then still in a separate repository that we figured out would be better to actually join this when by 2017 we created the E-note B. So SSRTE basically became the file layer, the UE, and the E-note B, and all in the same repository. And later on there was also an EPC added to that. And then to explain a little bit that gap that we had between 2018 and 2021, so we added a lot of new functions like carrier aggregation, ebMS, MIMO, and there's actually, Foster talks about this and they go into detail what we did there. But primarily what we did in this time was to harden the E-note B and the UE application to make them deployable, to run them in real networks, and that's what we have today. So they're deployed in the field, running in hundreds of base stations on a daily basis. And then in 2021 we got very excited back then of course about 5G and we implemented both for NSA and for SA based on the old, based on the existing SSRTE 4G architecture and software architecture. We implemented for NSA a UE and an E-note B or G-note B as well as for 5GSA. But then, I mean that was all based on the 4G code base and we were kind of in trouble because we still call it SSRTE. So we kind of figured out, okay, what can we do there? If we call it SSRTE NR, that's going to be a problem at some stage with 6G, so we figured out okay, let's call it SSRTE RAN. That made total sense back then, right? So we had back then in the entire, in the same repo, we had 4G, UE, and E-note B as well as 5GSA and 5G NSA, UE and E-note B. But what happened then was actually ORAN came into play and everybody was getting excited about ORAN as a buzzword. What does it mean? It's kind of initiation or initiative by operators to open up the interfaces between the radio components. And the idea here is to avoid the defector VendorLogin that we have today. So when you build a network, you're buying all the components from a single vendor. And the idea here is really to use off-the-shelf hardware for the CU and the DU, which is basically your G-note B, your base station, and put Linux software on it and run it and use open interface between those components as well as between the G-note B hardware and the radio, so the IU side on the left side, the left-hand side. And this is in fact something that 3GVP itself brought into the game because they are the ones who define the CU and the DUs, but others have provided a radio interface, so a frontal interface between the DU and the IU, which the ORAN lines took and basically defined the open frontal interface or an open frontal interface to talk to an IU. And that's basically what it all is. And what we did back then was to, and having known and having implemented all of that 5G, all those 5G applications already in the old codebase and knowing the limitations of that codebase, we kind of set aside and said, okay, would it be cool to actually start a scratch and get rid of all the legacy and rewrite everything? And that's what we did. So we sat down and rewrote everything and the Azure's Run project, so as we call it today, is a completely new software architecture, so we had people really laying it out from the being on with all the interfaces in there that the ORAN Alliance specifically specifies. And with all the thinking that went into that for openness, for interoperability, for performance, all towards really deployable open-force radio platform, RAN platform. And that's what we did and that's what today is Azure's Run project. And that was also back then when we figured out, okay, or realized, okay, that's now a new project and that's now a new platform and we need to give it a proper name and distinguish from the old codebase, which is still totally valid and totally functioning. So we kind of renamed the old Azure's Run, which was not that old by then, only a year or two, to become Azure's Run 4G. And that is what it is today and that's what you find in the GitHub repository. Still getting new releases and updates, but the new stuff is Azure's Run project. So this new architecture, this new 5GSA codebase. And then as we have seen, there's quite a bunch of people who are still interested in 4G that it's often a little bit misunderstood that this is kind of an old project and it's not maintained, but that's not the case. As I said, it's deployed and it's a maintained 4G codebase for the ENOB and the UE. And it still contains a UE implementation like this proof of concept UE that we did back then with limited 5G support, admittedly, with all the legacy and the limitation that we had back then. But it's still used by quite a few people in the research community after all and it's good enough to attach to a GNOT B and you can work with that. And the 5G GNOT B code that repository is not recommended to be worked on. So for everyone who is using SA or is interested in SA, please use the new RAPO. We're not fixing any bugs there in the old one. And the last release was actually just the end of the year where we, because of those users who want to use SSUE with 5GSA mode, that have been fixed for to support more bandwidths like minor things, no real DSP changes or anything bigger, but this is something that you can do. And then use the UE in the old repository, so this testing UE, to connect to the GNOT B and the new RAPO. And it's working within the limitations perfectly fine. And now let's come to the SSRUN project. So this is in a nutshell the architecture and in everything we have here in green and blue, especially within our scope. If you're a little bit familiar with the Nome Clutcher, it's the DU and the CU. So the CU is the control or central unit, sorry, doing most of the control and plane stuff here in the upper left corner. It's further split into a UP component. And then you have the DU, which is the physical layer and layer 2. Those two components again have a split in there. So many splits that give you options, possibilities to cut them and implement one thing in hardware, one thing or everything in software, however you want and whatever your application requires. And then you have the so-called frontal interface, which you can see down there, where we support also frontal 7.2, which is this new open frontal protocol, which you can use to talk to commercial radio units, and also frontal split 8, which is IQ baseband, so user piece. Like this is the default, so to speak. And then four or five points about this. So this is a complete solution. So it's layer 1, layer 2, layer 3. You get everything there. It's not just like a subset of the RAM solution. We don't implement anything like RIC or SMOs. We don't implement a core. But we are exposing all the centered interfaces so we can talk to third-party components that implement that. It's very portable, so it's running on ARM. It's running on x86, on Intel, and on AMD. All performance, yeah, already coming to the third point, all performance critical things are written in SIMD instructions for ARM, neon, and for x86, AVX2 and AVX512. And it's also very scalable. So you can actually run this like a full 5GSA with the physical layer on a Raspberry Pi and attach a B200 to it and attach a phone to it and it will work. But the same thing also runs on 128 core Ampera server or Epic server. Obviously doing then MIMO and all the thousand whistles and higher bandwidth and throughput. And very flexible. As I said, every interface that you have there you can cut and then talk to a third-party component or mix and match and maybe put some stuff on the physical layer implementation, some running in the ARM cores and embedded system. Everything is interoperable, so we have integrations with VATI units, with core networks, with RIX, all the components that you need to build a full RAM, which are out of scope of us. We do integrations and talk to others and try to work with them and it's all open. So please feel free to look at the code and it's all very transparent, which we believe is very important also for like TECO projects. And I don't want to dwell too much on the mainline features here, but I think the main takeaway is that everything is there that a normal user or operator even would actually need from that. So all the bandwidth and all the modulation schemes and all of that. The performance is like we are looking at carrier grade like numbers. So many UEs, 24-7 operation, highest bandwidth. I mean, this is what the spec defines. So 1.5 Gb in the downlink, 200 Mb in the uplink. This is a four-layer downlink, 100 MHz and one-layer uplink. And it's all accelerated. So we are not putting any, we support FAC hardware acceleration with Intel ACC100 cards or other DbDK bound BBDF devices, but we don't need that. So it all runs efficiently on ARM and Intel. And then there are some features coming up in the next release. So we're doing bi-yearly releases that's like a pattern that we have been following with like for many years already. But we're doing constant pushes to main, but not releasing that code. And the next one includes for instance, mobility, so between cells. Then I know there's interest in NTN, so there will be initial support for release 17 NTN. So to talk to geostationary satellites, multi-cell support, and the split between the components that I showed. It's also an important point. And then I know that all this telco stuff is very overwhelming, especially if you start with that. I completely understand this. And that's something that we put a heavy emphasis on. So to really increase the user experience and ask people to engage with us, to simplify really the barrier and to lower the barrier, the entry barrier for telco in general, and also ORAN, which has again its own complexities. So we've put a lot of results in documentation. So there's application notes, there's developer guidelines, there's code styles for contributions. Lots of testing is going on. That's also something. So there's a MATLAB based repo where we do all the physical layer conformance testing against MATLAB. Yes, you need a 5G tool for that to test. But it's still very useful for people and for researchers who tend to have access to the university or where they work on engagement. So everything is hosted on GitHub. We ask you to engage here. So discussions forum, we used to have a mailing list, but for the new repo, we're not using that anymore. So better GitHub. And then of course the code itself. And then there's an overview of docs.essence-round.com. You find everything there. And if there's something not there, then reach out. And it's also something we are collecting ideas to create new application notes and things like that. And then the demo. How am I going? Not too bad. So I would have loved to actually do like an ORAN demo to really show you, bring the components and everything. Unfortunately, the reality is that it's very complex. So you need big servers. Usually you need extra hardware like switches. You need timing. Very important. So you actually need PDP. So precision time protocol, grandmasters, the GPS clocks, CERN. And the IU is the one that you see there is like a big brick weighs five kilos and nothing you can just put in a trolley and bring over. Definitely not the servers. What we did is to kind of like, you know, miniaturize it a little bit. But this is how small we got it. This whole setup fits in a big suitcase like in those Peli cases. But it's still like a desktop PC. And this radio unit there on the left hand side, it still weighs five kilos. Plus, you know, power supply, then the PDP grandmaster, you need GPS, which is difficult to get here. And all of that, you know, it's still it's still complicated. Nothing for a weekend and to put in the backpack. But obviously we do a demo there. And what I will show here is the exact same software. Remember, we support both ORAN split 7.2 to talk to those guys, to those radio units, but also to a user P. So that's why I have here my B200 mini. And I have a Pluto running Maya. And I have a Cots phone here. So this is a Motorola phone. And I'll use something that we also created to make it easier for people to get started. So all the components that I'm showing here are running off a Docker Compost Script that it's also in the repo. So in the Docker, you know, top level folder. There is a Compost for the G-Node B. That includes Open 5GS as a 5G core. There's an Influx DB and a Krofana dashboard. This is something we used to basically to show, visualize the performance of the radio. And I'm trying to get us all running here. And then let me just connect this guy here. Yeah, one thing missing here. So what I will run here is just a Docker Compost here. Really in the main G-Node B or as I was running Docker folder specifying like a configuration that I just adopted here a little bit to find the frequency that's actually empty. Because if we look at Maya here and we tune you a little bit, it was actually quite crowded here. So all the radio folks are occupying the spectrum. So this looks OK. So I picked the AFKAN here that is empty. So now it's running. So this was just one command, just a Docker Compost up. It's starting the core and starting the Influx, starting Krofana, starting G-Node B. And that's running there. So if you go back to Maya, we see the G-Node B broadcasting. So this is the SSB and the SIB. So all the information is broadcast by the G-Node B without a UE attached to it identify itself. And what do you see here in the white band admission? Those are CSIRS or reference signals for the UE to adjust it and to do tele-measurements and to report the quality back. And what I do now is to run straight. Do you see this? So this is the Motorola. I will make this a little bit smaller. Because what we see now is, so if we, this is a Motorola phone that I have here. So if I take this out of Airplay mode, what happens is it will scan for a cell, it will do a patch. So send an initial signal and then all the attached procedure comes and what we, you know, what's going on there, the communication with the cell and with the core. And then it goes very quickly. But that's something that we can do here. And it's, you see there, and now it's actually attached. So all the transition that we see here now on the outer band, this is the UE doing uplink signaling, uplink control signaling. And now the level has adjusted a little bit. So the dialing transmission, I guess, is the, is the blue, still like the blue level there and the higher power ones. This is actually the UE. And just to see that, so do you see the network name here? That's FOSM24. And what we can do now is to have a look at the profiler. So that's running in the background as well. So usually we do, it's obviously a console application. We have console traces, but we created this nice Krofana dashboard there. So it's all the data is pushed into influx to be, and it's just displaying the stats there. What we can do is to, to now, yeah, use an application, signal guru. So you can actually also look at, look at this here. I will, because actually the Wi-Fi, the Wi-Fi is very bad here. I would have backhauled over Wi-Fi, but it's not, it's not great. So what I do is I just run an IPerf. So, Mac. So, can we still see that? So now you see that this one user, obviously this is only a very narrow 20 mega cell CISO. We're not, not getting 1.4 gigabit, but we get like 32 megabits here. Maybe we can do a little bit more. I still do that longer. Maybe 50, I'm not sure if the channel is, is good enough to do that. But yeah, it's going up. And then the phone also showing that, like this is an application we use to, to get, you know, information from the, from the baseband. And what I do now is I disconnect that here. I do this and maximize. Close free mode. So now I can actually walk around with the phone. So it's still doing, it's still doing low. So have a look at the, at the MCS level. So this is very good here. So if I walk around here, it should adopt the link and the, the rate should kind of go down a little bit. So the, like this dynamic MCS adaptation, depending on the measurements that the phone does. I mean, we can go out of the room as well. And when I come back, like the, the MCS is, is going up again. And doing full rate again. And I think with that I'm. I think we can still take a question. Sure. Yeah. Sorry. Yes. Yeah. I mean, we have native support for, for UHD and in, in this repo, but yes, you can actually use the SOPE UHD wrapper to, to take the, the blade to run the plate over the UHD or the UHD SOPE wrapper. I always take, like mix it up. But yeah, you can do that. Also with the lime or with any other radio that supports SOPE. Nothing, I mean, it needs to support obviously full duplex. So it's, it's after all still like LT or NR bandwidth wise, 10 mega is enough. But full duplex, it needs to be full duplex. Even, I mean, TDD theory is not full duplex, but I mean, the way we, we handle that and UHD handles, it's, it's, it needs to be full duplex. Yes. But no other, other specs there. Yeah. Yes. Yeah. Yes. And in fact, we are looking, looking at this. So those are used here that we, that we used, that we, that I showed here. This one. And so this is a, where is it now? This is a, a so-called ORAN 7.2a IU. So it's, it's basically doing, it's basically doing the pre-coding in the, in the, in the, in the du. So it's not doing the pre-coding. So if you had a massive MIMO1, what you wanted to do is to send all the layers to it and then compare it with the, with the, with the pre-coding coefficients and let the IU do that. And that's something that you can, that you can do with 7.2b. So if you have an IU that supports that and does speak an ORAN, ORAN open frontal, you can, you can do that. Okay. Thank you very much. Thank you.
SIDLOC: Spacecraft Identification and Localization system
Hello everyone, I'm Manolis from Liber Space Foundation and today I will present you the seedlock which is a new telecommunication standard that we started to propose in order to identify and localize satellites that are orbiting the Earth. So before I start, are you aware of the TLEs, two line elements? How many of you? Okay, a lot of people. So TLEs is two lines, two text lines that describe the orbit of an orbiting object. Okay, so what is the problem and the motivation behind the seedlock? Those orbital data are coming from limited sources, they are public but not Libre. So these sources are mostly military, so we have one source that is US military and France military and they announce them publicly but of course we have already restrictions that the military has. A very distinct limitation is that they provide public data only for space objects that are larger than 10 centimeters and someone may say okay, 10 centimeters are enough for satellite but I know some very annoying people behind there that they build satellites at smaller than 10 centimeters. Actually this is a cubic platform that we flown a year ago into space and with a firefly mission and it was a very very good use case because the second stage of the firefly rocket failed to deliver the satellites on the correct orbit, so it was 100 kilometers less if I recall correctly. So the orbit was not the one that we have prepared for our mission and it was changing very rapidly because due to the drug. Okay and the traditional ways of getting the orbital data rely on radars. This is the space surveillance network of US military that has few radars across the world, they are quite huge okay and in order to get a new orbital formation for your satellite the satellite must pass through the aperture of such a radar okay and again if it is small it is not guaranteed that you will get your orbital data for your satellite. So the idea of seedlock is to put a tiny beacon on a space object for example a rocket body, a satellite or anything that you want to fly and by transmitting specific signals you utilize existing ground station networks like the satnags or whatever you want, do the processing, extract the frequency offset and due to the Doppler and then by making cure fitting techniques you can estimate very accurately the orbit and this is something that we have already done with on satnags using the STRF, I don't know if some one of you are available with STRF okay and it has proven that it can produce a quite accurate orbit. So the goal of seedlock is to have a very low power transmitter on board the satellite or a space object to have low cost let's say reasonable cost for a space mission and also to be zero maintenance so you attach this beacon on your satellite and you don't have to do anything else of course all of the satellite operators want to in order to integrate such a solution wants to be as easy to integrate as possible so that's why we miniserize a lot the beacon and it's this form is for example is three times larger than required I will talk about later why and as I said before it can take advantage and also existing ground station network the other characteristics of the seedlock is we envision to use a UHS band and 401 to 402 especially this is quite dangerous band let's say because the meteorological satellites use it so there was an independent research from MISA if we can use it and by taking into consideration the power spectrum that we produce on such frequency it seems that we can coexist with meteorological satellites the band that the seedlock uses it's at around one megahertz and it it transmits a bpsk signal that is generated by a die sequence spectrum system we use the gold cores I don't know if some of you are familiar with them they have excellent cross-correlation properties and especially and more precisely we repeat 10 times its gold sequence it has to do with the decoding procedure that's why the the repetitions and the effective rate is only 50 bits per second per second it's a frame of seedlock has a duration of about 12 to 14 seconds so quite low in data rate but quite resilient in the lowest in our scenarios the TX power is very low it's only 25 dbm so here describe a bit the process of the TX of the spreading you have a cheap rate which is much higher than the data rate as I said before this is one megahertz and you just so you perform the the XOR of the gold sequence with the initial data to produce a absurd noise sequence that is quite higher in bandwidth so this is the the transmission flow graph that we use in order to test and view the simulations actually it's quite it's quite simple most of the blocks are the the the ignore radio core blocks and we have created two additional ones which is the spreading and and the spreading and one that actually compiles the frame of the seedlock very very simple on the other side however as you may know everything comes with a cost okay so the the RX procedure is quite expensive and potentially intensive so we have to deal with terms of doing this in real time or or not have enough sensitivity to identify something and then start doing something more let's say robust so there are three major three major techniques to to to decode the the dss single it's the auto correlation one which from the from which we can get directly the frequency offset without any issue but it is the it doesn't perform well in low westerner environments okay the other one is the cross correlation with coherent integration which is i think the the most sensitive one but right now as we stand as we stand at this point we cannot do decoding real decoding in real time in nice seven so this is a very big deal and there is also the cross correlation with non-coherent integration which is something in the middle so the idea is to use to use auto correlation only to say hey probably a seedlock transmission is is currently active and then switch to cross correlation with non-coherent integration to extract the frame type and whatever it is needed to to be extracted in order to get the full the full frame and then offline if the snr is not enough switch to cross correlation with coherent integration in order to get the the whole frame so seedlock supports three different types of beacons the the minimal which is a super minimal amount of amount of information in order to identify which satellite you are currently receiving the full which has also location the location of the satellite if the satellite supports uh uh location estimation through a gps let's say and they they integrate it which is all it's similar to the full but uh the the satellite can pass can piggyback some uh minor information on the data on the on the seedlock frame itself so if everything goes wrong on the satellite operators could still have a clue if for example the satellite was never pinked the the seedlock beacon so it could be a clue that the it was a catastrophic failure of the satellite let's say these are the full beacon fields most of them most of them are related to the position the most interesting of them are two it's the sink word which is all all ones and the the satellite ID which is a unique identifier for its space both it's orbiting both let's say the the all ones is a bit weird for a preamble but imagine that you using a spreading you spread these all ones so you have a repeated pseudo noise sequence and if you if you keep it the same you can integrate it on on a larger time window so you could potentially identify easier the existence of siddler so the reference design of the siddler beacon uses an stm 32 l4 series mcu which handles the main control of of the order of the hardware and the peripherals we have the for the for the single generation we use the 80 86 rf 215 ic in iq mode so this particular transceivers can be set can be configured to operate in iq and the the whole dsp and the iq generation is done on the ecp5 fpj that we have on board so all of these components are inside this cube regarding its size it could be far smaller but we got an opportunity to fly with a ryan a ryan 6 the new rocket from isa and a ryan and actually they mandate to have a deep sub nine connector so regard it is huge compared to the to the to the whole stack and manfos do you recall we had the m5 screws yes they they they required m5 screws which was yeah on this we use mv2.5 2.5 yes yes okay so the the the beacons could should be larger of course this is a flight model this will this will actually fly with ryan and it had passed the vibration testing this is the vibration pod that you that you shake it and in order to see if everything holds and of course the divac procedure so we we did also a deployment of the antenna inside the divac in order to to see if the the beacon can deploy its antenna in the cold plateau at minus 20 around minus 20 and actually we are quite happy with the sitlock because it's uh okay Libre space foundation produces everything in open source and open hardware way but our previous comms board had this limitation from xylem so we used in xylem's fpga so we had to rely on vivado in order to build the the beacon but this time we selected the properly the hardware in order to to use to in order to produce everything from open open tools so yeah we are quite proud about it so the the the projects involved is of course key card for the pcb is new radio for the ground station decoding and simulations free card for the mechanical uh zeffy arthos plus ecp 5 for the uh on board controller and uh liptix plus yosis plus trellis and xpnr for the ecp 5 bit 3 so that's all for me uh you can find more information on our github repositories everything is there from pcb to to software and also you can join our matrix transform more information from that's all for me thank you a lot so any questions like the box or is there any like special you know paper or for quite for example students who wants to uh implement it in their own designs so what about the about widths and all that so the question is uh if uh someone can uh can make it out of the box and use it for its own so yes you can do that it's everything there uh but of course you have to respect the the frequency um the the frequency that we'll use okay so for this particular uh beacon it's uh at the frequency that uh i i told you before it's uh for uh 401 uh to 402 thing is that this uh i'm at all for our frequency basically so you chose it from your project i mean from your system as well not not really because we would like to to stay at uh ucf okay because uh it is a compromise between uh the Doppler and Doppler shift and um and path loss okay so ideally we would like to have to to to to to use higher frequencies because the Doppler there is larger so your quantization there and the Doppler estimation is less okay but then you have the path loss is larger so it's a chicken a problem let's say yeah yeah this space sorry hmm yes the question is uh who is doing the uh the spacecraft id coordination that's a very good question um actually there is um uh there is an idea to to to use the ui d's and the posture of it take the 76 bits of it and place it at the supply to identify it um but for now we focus mainly on on the modulation and post processing uh staff and i think that if we see that hey this is a valuable way to get uh accurate orbits and and rapidly then some we can we can think about an authority or registration form that you get an id so you know that in order to be sure that it's unique yes yeah so uh i see you're using the stm32 uh but there's no uh like red hard equivalent i know that you are using only cross components and that it's maybe not the core but there are other like microcontrollers out there that have like a radiation hardened equivalent that might be interesting for the future if you want to equip like geocentrelites for example um so uh what is there like a version two coming or something like that when you were thinking about that yeah so the question is uh if the stm32 microcontrollers is the only choice or we can move to other other choices that are also radiation hardened um so we there was no big reason to to to use the stm32 um only for convenience let's say because we are familiar with them uh or we can easily find uh artos that are already support them uh but as i said this is um this is the demonstrators the demonstrators demonstrators sorry uh we want to let's say focus on the modulation around the rf part to see to show that this is feasible and it can scale also uh and not on the hardware itself so i'll focus mainly on the protocol specification rather than the the actual hardware so the the protocol will be is open source so uh if someone wants to implement a beacon uh with red hard components he is free to do that or no more questions thank you again thank you
HopsFS FUSE Mount
Good morning everyone. Thank you all for joining the first session of today. You're at the software defined storage dev room. We'll have storage talks throughout the day until I think seven tonight. Fabio is here to start with his first talk. Enjoy. Thank you very much. Thank you. Yeah, welcome everybody. I'm Fabio. My colleague Salman was supposed to be as well. Unfortunately, he couldn't make it, which is really a pity because he's the guy who actually built most of this stuff. So yeah, but now you've got me. So yeah. Well, I'm going to talk about Observice a few months, what it is and what we built it and how we use it essentially. Before doing that, I just want to give you a little bit of a context on what is Observice in general and what it is used. I cannot have a feeling that not a lot of people know about it. So Observice is effectively a patchy HFS compliant distributed file system. So the code base was forked out of the patchy HFS and we made a couple of different changes over the course of the years. Some significant, some less significant, but ultimately there is mostly an architectural change and also some security differences between what is a patchy HFS and Observice. Ultimately, the APIs are the same. So the clients are compatible. So if you look at the actual implementation architecture of Observice, you can actually recognize some of the processes, some of the entities that you can also find in a patchy HFS. The difference is that the behave is slightly different. So in HFS and Observice you have the name notes. Those are responsible for interacting with the clients knowing the state of the file system, responding to the clients, giving access to the files and so on. The difference between what is a patchy HFS and Observice is that Observice, we took the metadata information, so the structural file system directories where the blocks are located, permissions and so on, and we moved it outside the memory of the of the name note process into the into what we call RONDB, which is essentially a distributed in-memory database that was forked out of MySQL and DB cluster. Specifically, MySQL and DB cluster, MySQL and DB cluster, if you're not familiar, is not a cluster of MySQL, but it's actually a different storage layer as well. So it's basically a distributed in-memory layer that basically stores the processed data, and you can do like really nice things. You can do like scaled up and down online. You can do upgrades and stuff like this, and it's in memory, so it's extremely fast. So moving this metadata information from the actual names on to the database allows us to basically have effectively stateless name-node dumps of S. So we can have as many as we want. We can take them down. We can stuff new ones, allows us to basically be way more flexible in terms of operations, in terms of management of the entire file system. On the bottom part, we have to say data management layer. Here we have process like the data nodes, right? It's the same concept in HFS. Data nodes are responsible for receiving streams of data from clients and storing them in local file system, and then give it back essentially, right? To deal with that kind of processes. Here we actually made some notifications, some architectural changes, in a sense that depending on where you deploy OpsFS, depending on whether you're deploying on-prem or in the cloud, you can actually decide to store the actual blocks on normal disks, like on the machines, but you can actually store them also on the cloud object stores. So you can store them on S3, you can store them on GCS, you can store them on Azure, Block Storage, and so on. So the data nodes themselves also become stateless essentially, right? And in this case, they act, they have two roles. They have a role of basically, yeah, not breaking the protocol essentially of the HFS protocol client, and also interacting as a cache, right? So every time you write something and then this is flushed down to the storage layer, so to the object stores, but it's also cached into the data node process themselves, so that if the clients keep writing and reading the same files, then obviously you don't have to go to the object stores, improving performance quite a lot. And while the last changes we made around security, and we're going to talk about a little bit later, but essentially that's OpsFL in a nutshell. Where it is used? Well, OpsFL is used as part of the broader platform which is called OpsWorx. OpsWorx is essentially a data platform for machine learning. It allows you to basically provide a bunch of different things which probably you're not interested in, to basically manage what we call features in machine learning. So essentially, you take a bunch of data for a bunch of different data sources, you extract signals to train a model on, and then you're basically storing those signals because the feature engineering process, the signal extraction process is quite compute intensive, so you want to store them and reuse them across multiple models and so on. And we have a dual system architecture, we have offline data for more like training new models and doing like batch inference, and then we have some online data for doing real-time operations. The OpsFL part is actually this, it's powering the offline feature store part, which is basically saying store all the data, historical data across many, many years across more models that is being used for training additional models and so on. Now, the problem that we're facing and the reason we started working on the on the on the on the fuels process then for OpsFS is the following. So the entire platform, the OpsFL feature store platform stands basically in between two different worlds, should we say, right? So on the other side we have like more traditional data engineering world where we have like application like Spark, Flink, maybe Beam and so on, which they have been built from day one essentially, supporting the OpsFS, like the HFS API, supporting the OpsFS API as well, right? So you can plug it in and they work essentially, right? On the other side, on the consumer side of the platform, well you have the entire data science world, right? And that's mostly basically built on top of Python libraries and you know, like they have been built relying on the fact that it is a log of our system and they don't necessarily interact with these different systems, whether it's Apache CFS, but also they don't maybe interact even with like object store or stuff like this, right? They only maybe read data from a local fast system, right? So we can actually have, if you take the libraries on the left, so if you take the libraries that are, let's say, data science library and generally the science processes, we can actually have, you know, two separate scenarios, right? We have a scenario where the libraries actually support what's called like Liby HFS. Liby HFS is essentially a wrapper for the, it's a JNI wrapper for the clients for the fast system and that allows us basically to actually interact with the fast system from a bunch of different libraries like PyArrow, Pandas and a bunch more actually do support reading from the, reading from HFS or OpsFS through the Liby HFS library. The problem is that even this scenario still requires access to a JVM, right? And in a world of their science where, you know, people do, like used to do like pip install a library and that then looks like a bunch of, like zip file and extracting on your local laptop, you know, configuring the JVM, bringing down the JVM, bringing down the OpsFS clients or HFS clients, it's quite, it's quite cumbersome, right? So that, that's kind of, it works but it's not the ideal case scenario, right? The other kind of libraries or tools that are essentially in the other campus, say, are tools that do not support Liby HFS at all. So one of the functionalities that you have in OpsFORX is basically, well, I want to clone a Git repository in OpsFORX so that we can actually run it and, you know, you can't just run Git on top of HFS or OpsFS, right? So there isn't really like a nice solution there. And so we started working on implementing a FuseFile system for OpsFS and we actually built it on top of some of the existing work that is open source and it's out there. All the work that we built on top is also open source. So the libraries and the entire application is written in Go. So that obviously removes the need for the JVM and it's built on basically three different projects. The first one is the FuseLibratesOff, so implemented interface for the FuseFile system. The second one is the OpsFS Go client, which is actually built on the HFS project from Collymark, which essentially implements the entire protocol for communicating with the HFS or OpsFS but using Go, right? So you can actually interact with the file system without the need of the JVM and then basically bring everything together in the OpsFS Go mount project, which is actually a forked from a project that Microsoft stopped working on, the HFS mount project. By the time we forked and started working on it, it was pretty much a POC. I think it was only allowing like read operation or something and then some on the team kind of expanded really nicely to be able to support much more. Now, the implementation of this solution actually has a bunch of different challenges and there are two essentially group of challenges, right? One is the API difference, what is the API that OpsFS and HFS actually provide and what the API that, you know, the POSIX API that are required by tools like Git and so on. And then the second one is the different, complete different security model, right? So we're going to look at how we bridge the gap between the two in the implementation essentially, right? So the first part is HFS is append only, right? So when you create a file, you cannot go in the middle of the file and add additional content, right? You can only append things to the file, right? And so this doesn't work if you open Veeam and you, you know, editing stuff around, then you can't actually just, you know, like write directly to HFS OpsFS. As I said, it doesn't support random writes, it does not support multiple writers. So when you open a file for write, you actually get the list on the file and you have the only one that can actually write there, right? And nobody else can actually write, which is not the behavior that you expect on, yeah, like from the POSIX API. The other aspect is the block size is quite huge. So that's like, you know, the blocks on OpsFS and HFS are configurable. You can make it as small as possible, but like the default behavior is to be quite big. It's like 6428 megabytes. Because the axis button and the write buttons make it that, it's much more performant to do that. And then we're going to talk about the security model in a second, right? So how does, how does the system work, right? So we have two scenarios, essentially. We have read scenario, read only mode, or like read and write scenario. For the read only mode, it's actually quite trivial. What basically happens actually, I can show you this diagram. I don't know if you can actually see from the back. But essentially what happens is that the process actually talks to the OpsFS fuse mount and asks, I want to read this file. Because the API from HFS and OpsFS allow you to basically, you know, do six and read and stuff like this. They are compatible. We can just forward the request to the remote storage essentially, right? So you open a file, you open a file, you read a set amount of bytes from a set amount of position. That is mapped to operations directly in OpsFS. So that's, yeah, pretty, pretty, pretty trivial to implement. The writing scenario is a little bit more complicated for the reasons I was mentioning earlier. So at the i level, what happens is that the remote file gets copied on the local file system. And we write it in a staging area. And so the processes like the Linux processes are actually going to write to these, to these like staging replica of the file. And then when either the file is flushed or the file is closed, then we actually upload the file back into the remote storage. So it looks like a little bit like this. You open the file and we open the file. You actually get back, you don't get, we don't download the file immediately, even if you ask for writing and open the file in writing. And the most, like, the reason is people are lazy essentially, right? In the sense of like when they open a file, like you say, you want to check out, you want to look at the file, you know, you might have opened Vim and open a file. And then at that point, you know, you don't want to write anything, but you have opened it, read and write. Because file and obfuscation are generally pretty large, we don't want to, you know, keep downloading random files for absolute, you know, reason, right? So what basically happens is that we delayed download the file until the first write comes in. When the first write comes in, the write is intercepted. And we basically, the first thing we do is basically download the file into the, into the, into the staging area. Well, from there, we can actually do the operations, right? Now, all the operations that happen, like all the read operation, all the writes operation, do not go to the, to the, to the remote storage anymore, but they actually act on the local version of the file. And, you know, this allows us to basically, you know, write random stuff at a random point in time, so random place in the, in the file, allows us to basically have multiple writers writing files, multiple readers and so on. So they all write as it was, like a local, a local file essentially, which, which it is, right? What happens is at some point, someone was going to call a sync or, you know, call a sync and then close the file. And what's going to happen is that when the sync happens, then the file is then back uploaded into the, into the remote storage, right? And that basically allows us to, yeah, add the file in, in OpsFS. When the last client, the last write client actually closes file, then we can actually remove the file from the staging directory, right? So if you have like five different clients working on it, and then, you know, one removes it, then we can let the last one closes it, then we can remove the, the, the file from the, from the, from the staging, from the staging directory. Now, the last part, the last difference in terms of like, compatibility between what the, what the, what the, you know, OpsFS and HFS API are, and the local file system are. Can I take a question a little bit later? Okay. That's fine. Okay, thanks. Yeah, so in, in that regard is, yeah, so the way this security works on, on OpsFS is slightly different than if you're familiar with HFS. You might be familiar with Kerberos. In Opsfs, we don't use Kerberos, we use certificates. So every user in the platform gets a certificate or actually gets more than the one, but it doesn't matter. Essentially, in the certificate itself, you know, we have the information about who the user is, and every time the user needs to do an operation with the, with the, with the file system and any other service in Opsfs, present a certificate, and the certificate is obviously verified based on the chain of trust that is established within Opsfs deployment. So this basically what happens at the high level, right? So every time the Opsfs fuse mount needs to talk to the, to the Opsfs remote storage, it has to present a certificate essentially. And this is something that we, we control in the way that we actually use the Opsfs fuse mount in, in Opsfs works. When we set up the mount point, we actually, you know, make sure that the mount point is set up with the certificates for that specific user so that, you know, if I, if I need access to a specific directory from a Python library, Opsfs set up the mount point and, and provide provision the certificate for, for, for my user so that someone else cannot, cannot necessarily use the same mount point and the same, and the same, and the same certificates. So this is controlled at the application level, not the, not necessarily the storage layer. All the operations that happen on this side, while the authenticated based on the certificate, the problem happens from, let's say, the Opsfs fuse mount to the, to the user side, to the, to the, let's say, Linux processes essentially, right? So what happens here is the following, so there is a, there is a mapping going on between the users in Opsfs and the user on the machine, right? The problem we, like the problem we had, the problem that, like this setup has is that you end up having a lot of users on Opsfs, right? My, our deployments might have, like, you know, 5,000 users all on the same deployment that, you know, results in way more groups than, than than 5,000. And so we cannot spend time and create them all the users, all the, all the, all the groups in, on the, on the machines. So the way we actually, the way we actually work in this situation is that when the Opsfs application needs access to, to, to, to, to, to mount point, a provision that mount point, it also makes sure it doesn't mount the entire file system, it mounts a specific sub-directory, and when it mounts a specific sub-directory before mounting it, it provisions the users that are within, that, that, their own files within that specific, the specific sub-directory. And the way it knows that is because in Opsfs, directories have a specific meaning, they are organized in a specific way, and so Opsfs application knows which users are, have access to specific sub-directory. So before actually, you know, mounting the, the file system, we actually provision all the user, provision all the groups, and then what basically Opsfs mount does is that, well, you know, a file here is owned by user Fabio, so there is a, there is a user on the machine, user Fabio with a specific UID, so the user ID of that file is going to be the, the same essentially. And again, the, the provision of the user is controlled by the, by the Opsfs application when they, when they, when it, when it's, it's necessary essentially. Now, there are a couple of, I would say, unsupported capabilities and things that we plan to address in the futures, more like limitations essentially. One thing I have supported at the moment are links, both out and soft links. Opsfs has supported for soft links in the background, but we never really used them in Opsfs, so we didn't have support here in Opsfs mount either. The challenge we have is around the users of these caches. Yeah, so essentially, you know, if you, if you're working with a, with a local file system, the kernel, you know, there's a pretty aggressive, you know, caching of the data and so on. The problem we have is that Opsfs is a multi-tent platform, so we have multiple users working on the potential of the same files, and so caching the files becomes a little bit problematic because if you have two users working on the same file, then like the different mount points are not going to be able to talk to each other, say, A, there is something that's changed here, and you need to reflect that in your cache. So at the moment, the caches are kind of disabled, but we're working on, on, on, on some solution to be able to get notifications and, and, and figure out that files have changed, right? And the reason, the reason they don't know is because there's different mount points, because each mount point is not certificate for that specific user, right? So the users are not sharing the same mount point, so changes, one user making changes is going to talk to a different mount point, and the user making, making changes essentially. The other thing that happens, if you, if you, if we go back to the right operation, when we upload the file, we upload the entire file. So there is no concept of, you know, uploading a specific block. The, the, the HFS API and the OPSF API does not allow you to basically say, you know, I know that, like a specific block has changed, I'm going to just replace that specific block, right? So the, this, this operation, if you're working with very, very hard files, then, then might become a problem. For the use cases that we use OPSFS mount within the OPSF platform, this is not a problem. Users are working with, like, small Python applications, or like generally speaking, smaller files, or when they are dealing with bigger files, they are dealing more, in more like a, in a, in a read process, not necessarily in a, in a writing, yeah, not going to end up writing a parquet file in beam, like several gigabytes of parquet files. So that's kind of it. That's kind of where we, where we stand. If you have any questions, that's, that's, you know, I can take them now. And thank you very much. Do you and practice have problems with applications ignoring the return value of close? The return value of close. No, so the question is, if we have problems with applications ignoring the return value of close, not at the moment, no. We had, with the way we use OPSFS mount is basically, we, we, we, we, we mount inside of containers, for instance, when you're doing, if you're running Jupyter notebooks, or if you're running great applications on, and we shut down the entire process, usually shut down the entire container, essentially. Right? So we, when, when, you know, the files gets closed and everything gets closed, um, shut down the entire mount point, essentially. That's because, not necessarily because it's necessary, but that's because usually, that's the use, like user experience and that people have in general. But you can't guarantee that the upload actually worked. You can't guarantee, come again, sorry. The last step. What happens if the upload file fails? If the upload file fails, try again, and then eventually, yeah, it's a simple, we give up, yeah. Yeah. So it seems that the retro machines that try to write open for writing and write through the same file independently, so obviously one of them wins and takes a lease and does download the file modification and upload. Yeah. What happens with other machines, or what does the process that tries to open the file for writing at the same time and write to each and so on? In this case, the last right wins. Like that's the problem that we have here, right? It's basically saying if we have, like multiple machines mounted in the same directory, they can't, like, and we're writing on the same file. I, the mount point doesn't, doesn't know that, like, you are changing the same, like my mount point doesn't know that you're changing the same file. So we upload the file, and then you upload the file as well, and your, your, your right space is essentially wins, essentially. And then, can we get? No, nobody's waiting for another one. But I'm like, you end up in like, in, you end up in a weird situation where I might not be able to see immediately your changes, and so I might be able to, I might re-resave my file and re-re-reprove my file, so then your changes get, get, get covered written again. So it becomes a little bit, a bit, a little bit like this. Yeah. Yeah. Yeah. Sorry. Uh, good question. Uh, in general, yes. Um, you might, like, you might not have, like, the security part of the, we didn't implement the CalBros part of the, the security part. So if, if you have a secure cluster, then you're probably not going to be able to, to use it. Or, yeah, you, you implement the CalBros part. Yeah. Yeah. It was a question in the back. No, there, there, there is, there are processes, yeah, so, yeah, thank you. Uh, the, the question is, uh, whether or not there are processes sharing the file. Uh, there are processes sharing the file, but you have, the problem is that you have multiple users writing on, on the same file, on different processes that do not know about each other. Yeah. It's also, you know, it's like the, the, the, the different mount points are, uh, that the, the question was about the directory metadata. It's also, it's like it's independent with the, it's independent with each other. They don't share mount points, no. Yeah. So if you create a file, um, then, yeah, but if you, if you do a less operation, then we go back to the, to the file system and that is, is, is reflected. But so, but they're not necessarily relying on the mount point, relying the obfuscates mount, going back to the, to the, to the, to the remote storage to get the new directory structure and so on. Yeah. Um, with this, uh, stage file writing, yeah. I mentioned that improves read performance for the client too, right? When it's actually a downloaded local file and it, do people use that? Like do they do a little write and then just get quick downloads? It, it, it does. Um, it does. That's, that's one of the, all that, like that's one of the other, that's one of the other reasons we kind of, the question, sorry, yeah, sorry. Uh, the, the question is around the, um, read performance when you're reading a file that you have, like, uh, like stored locally in the stage directory. Um, it does. Um, we don't have a specific number, um, but I have like some, like user experience with that. Um, when you have, um, when you have files that are on, on the remote storage, uh, even if they are, like, especially if you have smaller files, um, like maybe like a Jupyter notebook, this JSON file, like, you know, maybe a megabyte maximum or something, um, then it's the, the, the override of going and fetching it every time. It's, it's quite significant. If you have it locally, it's, it's, it's much more, uh, you can see it much more reactive. Yeah, there was a. I thought you said, uh, you, you delete the station file as soon as you close it. Yes. No, but if you, if you, if you, if you, if you have it open, uh, when you, when you keep writing it, then you, when you read it, you're not going to the, to the remote storage, you're reading from the, from the local, um, from the local, uh, yeah, from the local, from the local staging directory. Right. So if you, the only, the only, the only, the only time is, is if you're reading, if you're doing some read only operations, essentially, then at that point we don't download it, uh, mostly because, uh, in general, they'll be working with like pretty large files. And so the downloading process might, might not be necessary, right? Because if you open, let's say a parquet file, what happens is that you go at the end of the file, you just read the footer to figure out the schema, figure out where you, you need to go. And if maybe it's like a four gigabytes parquet file, you don't want to download four gigabytes to read that maybe a couple megabytes of metadata, essentially. Yeah. One more. Thank you very much. Thank you.
External Rook Ceph Cluster
Thank you so much guys. So is my audio, can everyone hear my voice? Perfect. So I would be talking about external Rooksef cluster. So people might not be aware like what's Rook, what's Sef and what thing I'm talking about. So I will tell you to the intro first about what topic is this and then we'll deep dive into this topic. So I am Parth Arora and I work for IBM Storage as a software developer and closely with Rook operator and I'm one of the lead approver of that repo. So anybody like might be storing data applications, creating applications through Kubernetes, but everyone knows like Kubernetes is designed in terms like which doesn't talks much about data but anyhow any application we can talk about there is storage for that and we need to store data in it. And for now if someone wants to store the data that is coming from Kubernetes application, we just talk about cloud providers and talking to the cloud and they might be doing anything we don't know at the back end. So why not to bring the storage into your data center and traditionally like there are some limitations also with cloud providers like number of nodes, how we can scale up the different AZs. So why not to have a native solution with your own Kubernetes cluster. So and also like to manage that cluster how natively your Kubernetes application manages and also manage that storage in the same way. So here comes the most trusted platform CIF that can like make the storage available to you in the form of Kubernetes, how it will be but first I will talk about CIF, why CIF. So it's like it has a lot of enterprise already enterprises using from like trusted platform from past 10 years and it provides all the storage at the same time the block file and object you name it and it provides and its open source and it has so many features its resilient, it's configurable, it's more disaster proven and it's consistent and provide like like give your data a safe point of view. And so CIF was designed even back when Kubernetes was introduced. So to bring the storage to the Kubernetes world we have designed Rook and this is how it was born to bring the CIF to the Kubernetes. It's the management layer so it helps in installing and managing the state. It's not just like it's there are Python scripts that just install it but it actually manage the state of it. How it does it's a Kubernetes operator which works how a Kubernetes actual desired and actual state works and we can manage to that and how we can define what we need to give the state so that's the CRDs. We can give any definition to that and that thing can be configured and like we can give I need monitoring, I need the block storage so these types of things we can provide through CRDs. And so this is how the architecture of Rook CIF looks like, Rook is the management layer which installs CIFs, the CIFC driver which helps in mounting the port, like the storage to your application port, how it's the native Kubernetes we have CIFC or CIF and CIF being the data layer. And Rook works in two mode, so first is what I talked about bringing the storage into your application cluster only, the same cluster that would be the Rook Converse mode. It's recommended if like we just have one to the single cluster but when we need external cluster that I would be talking about. So first of all what an external cluster is. So external cluster means you already have a CIF cluster installed in which like there are CIF domains, these are the CIF terminologies, the mon, OSTs, RGWMDS, these are some domains running that helps your storage to like store the data. So it's already been there and now comes the part like I'm in Kubernetes world, I need to connect this CIF to the Kubernetes, so how will you do that? So there would be a separate Kubernetes cluster running in which there would be a Rook operator and this FCS earlier and we need to provide data of external CIF cluster to Kubernetes, so how that magic does, so I will be taking to that part. So first of all I will tell why you need external cluster, in what cases like I told you like you can have this same cluster in your native application, so why you need external cluster. So you might be having a big enterprise where you have like different domains of Kubernetes cluster running, like there is an admin cluster that is there and there is a finance cluster department Kubernetes cluster there and you need isolation between them, they shouldn't talk with each other and underline storage you want the same, so you can make use of external cluster in that way. As you see like there is a standalone external CIF cluster that has been connected to different Kubernetes cluster and have isolation of what kind of data is stored and only access that data only, that is internal algorithm that maps the secret and puts your data safe. So this is how you can, like this is one use case when you can use external CIF cluster. So you see like for example you are native, you are already using CIF but you want to come into the Kubernetes world, so that would be one of the use case and the third would be like if you need complete isolation of data, like someone wants to keep the data totally separate, so then you can use that. So but external cluster provides all the three types of data, block file object, there is no restriction on that, so you can make use of that. So this is why external CIF cluster now comes apart how we can install it and how it internally works. Like I showed you like in the diagram we have to grab the information from the standalone cluster and give it to the Kubernetes cluster, so this is how it's been done. So there is a provider cluster that standalone where demons are running, so cluster admin will come, it will scrape the data from it and once scraping the data it will provide it to the consumer cluster and this is the Kubernetes cluster that is running. So that data it will create some secrets config map and after we have the details what all it has been first, like it will give the IP address and other details that we needed and once it's done it will give it to the Roku operator and Roku operator will perform in the external mode this time and will create the Roku resources and the storage class and then we can create PVCs and mount that PVCs to your application ports. So this is the working and how this scraping is done, so there is a Python script we have to give some CLI flags to it and this should be like user defined what kind of flags they are, like it's RBD data pool name or RGB endpoint, so if we give like the specific name of the pool then we run some self commands with this Python script and it will fetch the data. Like the monitoring endpoints and other things and give it to the Kubernetes cluster and once we have the JSON data then we run the import script for it to import it to the Kubernetes cluster and deploy the manifest, these are the CRDs, the definitions, how we want to configure our external cluster. So I will be showing this into them in the demo how this works. And once it's done then we will verify the connection, like we have the cluster running and so how we can check it, we can just get the self cluster and we can see it's in the connected and the health okay state. And then goes like we can go and create the storage class, pool and create PVC, on depending like what kind of storage you need, you need Cephaphase or RBD or RGW. And now comes the time for the demo. Everything breaks so I have recorded the demo. So is it visible? So this is the standalone self cluster where we can see the health is okay and now there are some pools already been there, RBD pools, so these are some self native commands that we can run by this Cephaphase, we list these Cephaphase file system pools. Now there is a Python script, I'm giving using the CLI flags and you can see I run the Python script with some flags and I got the exported data. So this was like external cluster, I export this data, then I will take this to the Kubernetes cluster that's the second cluster, we'll copy this and so in this terminal like there was the Minicube running and I have exported it and now I will run the import script that will use this exported values and create the config and I have to see grids for the Kubernetes. And after that I have to install the manifest, the CRDs, the definitions, I am using the example folder of Rook in which there are already defined some configurations so I am using that only for now. So there would be CRDs.ML, there would be common.ML which have the Rbex, the specific permissions to it, what permissions I need to give and the operator.ML, the Rook operator file and the cluster external, this is the Ceph extender cluster that I will create in Rook, in the Kubernetes side and there is also common external which creates the separate namespace if we want to keep the Ceph cluster in different namespace. So in this I am using a separate namespace for the Ceph cluster and for the Kubernetes cluster I am using the separate namespace in which the Rook operator resides. And now I am checking, I have created all the manifest, now I am waiting like to get the Ceph cluster up and running. So the Ceph cluster is created and so this is the Ceph cluster, internals, YAML if you want to see and now I will describe it and you can see like it is ready. Now the main thing is I have to wait for the Rook operator to get started. So this is how all the parts that will be get started, there is a Rook operator, there are some Ceph CSI plugins, demons that will use to mount the data and yeah that's all. And other demons are already running in external Ceph cluster, the Mons, OSDs in which our data will resides. So the connection is good what I have shown in previous slides, now I have created the storage class, the storage class will be also be created, we can create some others if we needed. Now in this demo I will tell you like how we can create a RBDPVC and store data and how it's disaster proven or how it's how Ceph internally replicates and make our data safe. So now the demo starts for that. So once we have the storage class the Ceph RBD or the Rook Ceph block these two belongs to the block pool and now I am creating this MySQL example in which it will create a MySQL pod and a PVC and a service to it. So what I will do is making use of that Ceph RBD storage class, I will create this PVC and mount this PVC to the MySQL pod and there is a service if you want to see that MySQL service outside the mini-cube you can make use of that service. So the PVC is created, you have the pod, I will wait for the pod, PVC is bounced it and pod is still getting created, the word press MySQL and the service is created. Once the pod is created, it's up and running. Now I will go inside the pod and I will go to the mount path that I have written and I will create a file in it. So this is actual application for example MySQL that might be your actual application there and what I am doing right now here. So this is I am telling like if you want to use a node pod service like this is something that is needed if you want to see are like I am on mini-cube if you want to see the cluster outside whatever you can use this node pod service but the main thing is the data so focusing on that. So I am inside the pod using this command, escape command. So I am inside it now I will go to a certain path that's the mount pod where the PVC is been mounted that is where so while creating the pod I have given this path so I am using that only and now in this part where PVC will write the data so I am creating a demo dot TXT where I will say I will be persisted because of CIF replication and now I will go so this is inside the mount pod and I will go and delete this pod. So this is file has been created I mean so it the pod now I will delete the pod for example there is a disaster this pod has been deleted I will delete this and as it is a deployment it will create again in some other node as the pod is deleted and now I can see the pod is recreated by Kubernetes it is 9 second ago and I will go inside it again to the exact path the CD where live mySQL and I can see that that file should re-exist there where live mySQL and if I cat it so there is still demo dot TXT even if the pod was gone but the file was still exist you can see that and that's all with the demo and quickly back to the slides so these are some new features we have added the RIDOS namespace the IPv6 support and the RGW multisite and we are also going to add some new more features the replica one and support for the topology awareness and also to improve the documentation and these are some community links and thanks for that this is my LinkedIn logo if you want someone wants to connect and yeah thanks for that any questions anyone have what credentials do you need to abstract after get all the information from the subcluster do you need to be admin user or can you also be a regular user so if you don't want to expose adding credentials to a user so can you do actually do this as a user with enough self to make an export of all these information and bring it into your own okay so the question is what all permissions we need when we actually the export the data the client admin is there like which actually gets the data and give it to the Kubernetes cluster so what all permissions we needed so we keep the permissions minimum we can give the admin key ring but we just give the minimum permissions to it to just read and write because we don't allow ROOP to create anything on this subcluster we just allow to read the data and manage it so that's the minimum permissions that was required to CFCSI so we expose that only anyone else okay I guess thanks thank you all
Boosting CephFS Security in Kubernetes: Rook's Intelligent Network Fencing for Uninterrupted Data Flow and Workload Harmony!
Good morning everyone. Thank you for joining this talk. My name is Niels de Vos. I am presenting instead of Ria. Ria was supposed to travel here but got cancelled. So she gave me her slides. I am one of the organizers of the deaf room together with Jan instead of having an empty shot. I am presenting her talk. Hopefully I know enough about the topic to present it well enough. If you have questions don't hesitate to interrupt. I don't mind answering questions immediately. So Ria and I are actually colleagues. We both work on mostly CFCSI but anything that touches it as well. So that includes RUG. It includes other deaf components where we need it. And we make sure that Kubernetes or OpenShift works well with deaf. It is performant enough and in this case also your data integrity is guaranteed as best as we can. Today we focus on CFFS. CFFS is a scale out storage system based on top of CEPF. And I hope most of you are familiar with CEPF. It is quite a nice network file system. We use it for lots of different deployments and our customers use it for whatever applications they have. For example the Kubernetes image registry if you want to put it locally on your Kubernetes cluster is a very common use case for CFFS but it is not limited to container images. You can use basically any workload on top of it. Software engineers we always face a lot of challenges. For challenges we hopefully find nice approaches to solve these challenges. And one of these challenges is that in a Kubernetes cluster a node can go down. Either a virtual machine where you host your Kubernetes worker node or a Bimetal machine that runs all the Kubernetes services. It can go down. That just happens. That just affects. You can't prevent it. What you can do is you can accommodate your environment or your infrastructure as best as possible to work around these kind of hiccups. When a node goes down it isn't guaranteed that your whole node goes down. There are like partial failures. One of the easiest failures to imagine is if you have a worker node that runs Qplit as part of Kubernetes the Qplit daemon is responsible for mounting volumes, mounting file systems, starting containers and whatnot. If this Qplit daemon has a bug of some kind and it fails to respond to any requests from your Qplit management plane. This node is perceived down in the sense of Kubernetes infrastructure. However it doesn't mean that the node actually is down. It's only the Qplit service that is down. If this is the case then all your containers might still be running. If you have like a database running in your container this database might be happily running, reading and writing and accepting connections and whatnot. It might still function. It might write your data to a file system, CFFS in this case. But if Kubernetes thinks the node is down, Kubernetes will schedule this workload, your database on a different worker node. When it does so with CFFS it is possible to mount your file system on multiple nodes at the same time. It's a network file system so it's prepared to do that. For your database that runs on a second node this is very nice because Kubernetes thinks the service is up and running, everything is fine. But now you certainly have two databases running on the same file system with the same directories, with the same files that it uses. One on the old node that is perceived as down or broken because Qplit isn't running there and the other one where Qplit is running happily your container has been started new and it's running there as well. Now this can cause a lot of issues if your application in case of a database you can almost assume that the database uses a file looking correctly and really doesn't want to corrupt your data if another database instance is running on the same file system. In this case you are relatively safe however the database starting on the second node that just got started that will not get your file system logs. So the database will not start to write but Kubernetes might announce that this database is the master database and actually should be able to write but it doesn't because the old one is still running. So in that case you have like a partial outage that's the good case. The bad case is that your application doesn't use logs and the bad case is that the old instance is still writing data. The second newly started instance is writing data as well and they start to overwrite each other's data and that would be horrible because the application doesn't immediately notice and your data is now destroyed. It's corrupted because it's not in sync anymore. You don't know the state of your data and this is a really bad situation. This is not an inherent problem to Cephaphase. It's just network file systems in general and applications that don't use logs that can corrupt data in this way. It's not even restricted to file systems. If you use block storage like an Iskazi disk attached to multiple servers at the same time and you run a virtual machine on top of it, you actually have the same kind of issues. So that's the challenge. The resolution or one of the resolutions that we have with Cephaphase as a network file system, we can fence off broken nodes, network fencing. On storage layers, you have storage fencing. So if you have your SAN, you can actually fence servers on the HPA level and your storage fabric. With network file systems, you can do network fencing. You can see it as setting up a firewall between the broken server and the actual Ceph cluster. For Kubernetes, we wrote a network fence. It's part of an operator. Network fencing is a CRD. So it's an object, a Kubernetes object, that describes what you want to fence. The IP addresses, in this case, the ciders, so the network that you want to fence. You can have a list of networks in case you have multiple nodes or multiple networks on the node and whatnot. These can be fenced by creating this custom resource. Once you create the custom resource, our operator will see that it exists. It will see the states that you requested and will start do the fencing. That's the eventual thing that we want to do. This is one of the CRs, just an example. CSI Adons is a project that we use for enhancing the CSI specification. CSI is the container storage interface specification. That's the specification that is used by storage drivers and let Kubernetes mount a particular volume or create new volumes and so on. With this storage interface, it's rather limited, unfortunately, and it's not trivial to extend it. We have an Adon project, basically, that adds more storage operations that are not necessarily basic storage operations, but more advanced or more management operations, like fencing nodes from the network. What you do is you create the network fence CR. It needs to have a name. That's just the standard thing. You specify which driver you want to fence or which driver is used for the volumes. In this case, it's CephFS. We use RookCef in many deployments because it's just extremely easy to deploy your Ceph cluster with Rook. You can list which IP addresses you want. This is just a single IP address and this is the whole network that you want to fence. Possibly, you can even use it with multi-cluster environments. If you have a single Ceph cluster and you have one data center and another data center and you want to fence off one whole data center, you can also use it this way. You could fence your whole Kubernetes cluster in data center one and then make sure that data center number two is the only one accessing your storage. Yes, so that's it. You pass it to CephFS because if you want to talk to Ceph, you need your credentials. That's it, mostly. Now, the previous one is the manual example. It works nicely, but you still have to do it as an administrator. As an administrator, you need to know that a pod or that a system started to fail. Then you have to figure out all these kinds of details, which is rather annoying. It's already broken. Sorry? It's already broken before you can actually fence it. If you can just reschedule it in a couple of seconds. Yes, exactly. The remark is that it's already broken if you have to fence it. Yes, so often, you might have a little bit of time, depending on your application and so on, but yes, you are in a stressful situation because something broke in your environment and you don't want to hastily put some yaml together and make sure that you have the right IP addresses. If you do wrong IP addresses, you might break your cluster even more. This is something you can do manually, but it's tricky. Definitely, if you're in a stressful situation with an outage or partial outage, then you really don't want to bother with this. So the enhancement for the network fencing that we have is we include support in RUG. RUG already supports a form of network fencing for CepRBD, which uses the same kind of yaml. For CepFS, it's new and it requires a little bit of a twist, a little bit different commands, for example, with RBD. RBD is the image storage, so the block device is similar to Iscasi. And RBD only mainly talks to the OSDs. There's no metadata server, so the OSDs are what you want to communicate with, so you have only a single service type that you need to book access to. With CepFS, you have also the MDS metadata server, and this is the server that keeps, for example, file system logs. So in the first example, where you have two databases running and the database is smart enough to use a file system log, even if you would network fence the broken node, the broken node had the file system log. The second node doesn't immediately get the file system log if the first node completely gets down. There will be a timeout and it might take, depending on the kind of issue that you had, it might take a while. So for CepFS, we actually evict CepFS clients, which is a little bit different than block listing a client from the OSDs. That's for the background information, but for RUK, it is relatively simple to detect if a node is out. RUK doesn't have its own logic to verify if a node is working correctly. There are dedicated projects for that. I think a project like MediCADES is specialized in detecting the health of a node, and if something goes wrong, MediCADES can set a taint on a node. You can also set as an administrator a taint manually, and in this case, you put the out of service taint, and when RUK notices that a node is tainted with out of service, it will start the network fencing for the drivers of the volumes that are used on that particular node. So if your node is not using a CepFS volume, it's not going to network fence this particular node, because that node might still be used, or ideally, that will have the second database instance that should work correctly. This is then what you would see. So if you edit the taint, or if you manually created this YAML, this is what is shown. If you check the state of the network fence objects, and this is the host name, and whatnot, this is the IP resisted defense, in this case, it's just a single one, and it was fenced 20 seconds ago. In this case, and blocklisted from the OSDs, and the client was evicted from the MDSs. So another client can resume operation with the MDS, it can obtain file logs, and it can talk to the OSDs as usual. If you recover this node, you can edit the network fence CR and mark it as unfenced. Yes, sure. The network fence is always a blacklist. The question is if the network fence is always a blacklist, and that is right. So usually everything is allowed, so the whole cluster is basically allowed to communicate with the CEP cluster, the whole Kubernetes cluster in this case. And we call it blocklist now, but yes, it's a blocklist. You can have an allow list, but that's less practical. If you want to secure an environment, then that's probably something you want to do, but for network fencing, we only do the blocklisting. Yes. Basically two operations at the same time is evicting the client and also doing the network fence. So yes, so the question is if this happens with CEPFS, if network fencing is done, there are two operations and that's true. So there are two operations. The first operation is to evict the client. If you evict on the CEP level, it actually already does the blocklisting, but we only evict the client if the client is actually connected. If you want to network fence a client that does not have a CEPFS fast and connected at that time, we still blocklisted additionally just to make sure that nothing happens in the future. Yes, yes. So we try to be as safe as possible and make sure that it's consistent. What the output here is, if it's fenced, then it really should not be not fenced. So even if you did not have CEPFS mounted, it will still fence your node or your whole subnet or whatnot. Yes, oh yes. So the fencing is done on the CEPF cluster and that means that any client on the particular node won't be able to access the cluster anymore. So that's the whole goal. It doesn't matter what kind of client it is. It can be a kernel mount file system. It can be the fuse mount or it can be LibcFFS, which is also used in fuse. So none of these clients will be able to use the file system anymore and that's just the security measure to prevent any data inconsistencies. This is in fact what we do. So we evict the client. We figure out the client on the worker node or from the worker node that is connected has a particular client ID. We figure out this client ID for this particular IP address. It might be more than one and the network fencing will evict all of these clients and then eventually it will also block list the clients and you can check with this OSD block list command if everything was fenced correctly or not. And yes, so this is the summary how RUG protects you or can protect you more from data inconsistencies if you have multiple or potentially multiple workloads running on the same CEPFS file system. It's a very important feature not only for CEPFS but also for other storage environments because silently corrupting your data is not what you really want. This can further be automated with additional operators like medecates and that do the health checking of nodes and if they figure out that the node is not working they can taint the node. RUG then figures out the taint and RUG then starts the network fence CR. The network fence CR is then executed by the CSI add-ons operator talking to the CEPF cluster and the CEPF cluster will not be able to communicate with the worker out anymore. There's a question. So it sounds like it has to be the case that CEPF is in the final position to control whether the next pod is going to be able to use the same CEPFS file system otherwise there's no way to guarantee that the proper ordering occurs. Is that correct? So the question is that is CEPF is the final source or the decision maker in fencing? It is but you also want to fence, you ideally want to fence on different levels but in order to have your data consistent not necessarily only your workloads but if you want to have your data consistent then that needs to be done on the storage level as low as possible. So CEPF is the one that decides who is allowed to use your data and if there is a broken worker node a broken worker node should not be allowed to use it the best way to prevent the broken worker node from using or modifying the data is by blocking it on the storage level on CEPF. Right but my question was neither should be the next pod was able to schedule because even step one that you missed it could cause a new pod to come. So that would be allowed to happen because what? So the first step is that I'll rephrase it. The first step is that step zero is the broken out breaks. Step one is Kubernetes detects some breakage and starts the pod on another node. Right and that's where you are going to. Right in that case you're already late because the next pod might start to disrupt your data and write in the same things and that is correct. Yes, yes so there's still in this case there's still a window where things might not go right and that is in this case a Kubernetes issue and hopefully by automating more things like enabling like medicaid to detect a broken pod or broken worker node immediately tainted and then make sure that it fails over. Kubernetes also has a timeout in detecting if for example, Kuplet doesn't respond or so it doesn't immediately reschedule so you still have time to actually do some operations but you should not reschedule. Yes, it depends on the failure behavior that you see but you still need to be very fast on automating it so if you have automation around it then that's the best but you should try to have a large enough timeout for the rescheduling after a failure. Exactly so the comment is that you should check your timeouts and the ordering should be sufficient so if you have long enough timeouts then you can be very safe and if you have very short timeouts then the challenge on problems is just bigger. It's not always good to have your application up like five nines and sometimes a little bit lower but more safety is encouraged. This was it for my talk. I'm not sure if we still have we have two minutes for questions if there's anything and otherwise I thank you all for listening.
Data Security and Storage Hardening in Rook and Ceph
All right, welcome everybody. We'll continue with Ruf, Seth and Data Security. So I'll hand it over to our speakers. Thank you. Thank you. Well, it's very nice to be at FOSDEM. It's first for me. I'm glad to have made it here. We work for IBM, but as you can hear from my voice, I had a little bit of a tough time traveling. So instead of wearing my corporate circular 223.7 compliant blue shirt, you get my academic look instead, which for those of you that are local gives you a great joke because it took me as a Harvard graduate 10 minutes to find the stairs to this floor. So anyway, jet lag. Let's call it jet lag. So my marketing manager taught me an important lesson years ago, which is it's my practice not to introduce yourself. So the short version is had the privilege of spending my entire career in open source or nearly so. I spent the last 10 years working on Seth. Before that, I was the product person for Ubuntu Server. And before that, I was the very much feared systems management at SUSE. A bunch of other things that I worked on was the maintainer of man for about 10 years. A picture there with the clouds is because they wrote a book on AWS, stuff like that. I think that's enough for me. Hi, everyone. I'm Sage McTaggart and I use they them pronouns. I'm also at IBM and working on cybersecurity. I come from a decision to leave academia and work at Red Hat and then we move to IBM. And in terms of that side of my background, I've done everything from theoretical computer science, focusing on things like formal language theory, oblivious computation to just plain storage system security. I strongly support open source. I love implementing the principles in my work. And I am really excited to talk about sort of the work that we've been doing on a security side within IBM because it's not all writing little toy languages for provable security. It's solving practical problems. So you'll hear a lot about that for me later. So I think this being the SDS Dev Room, we're going to spare you the introduction to SAF and the introduction to Rook. But they're awesome. Let's just sum it up as that. So let's start with security. So security practices are hard in a specific point of the infrastructure. Cherry picking practices without a model of the threat in the attacker is not a viable strategy. The joke usually goes that to make a computer secure, you cover it in concrete, drop it at the bottom of the ocean and all that. But that's not a very useful computer. If you want to protect from all possible threats, you get a machine that is not very helpful in solving any of your problems. So what you do is that you define security in the context of your application and your users. Who are your adversaries? Who is most likely to attack you and you prioritize that? Yes, there are some theories of security that say patch everything, protect from everything. But the reality is if you don't pick your priorities, then you have no priorities. If you are trying to do everything, you're doing nothing. So some of these actors want to cryptologue you and hold you for ransom. Others are just happy to delete your data and cause a disruption. These are very different threats. Script kiddies, who knows what they're up to? Just some excitement. Organized crime wants money back. So you profile your adversary and you define what your threat model looks like and then you start hardening for that threat model. So let's dive right in. The public security zone, let's start from the network. The public security zone in SF network is an entirely untrusted area of the cloud. It could be the internet as a whole or just networks external to your cluster. You have no authority over. Data transmissions crossing this zone should make use of encryption. Note that the public zone, as I just defined it, does not include the storage cluster front end, the SF public underscore network, which defines the storage front end and properly belongs in the storage access zone. So don't confuse the two. The SF client zone refers to networks accessing SF clients like the object gateway, the SF file system, or block storage. SF clients are not always excluded from the public security zone. For instance, it is possible to expose the object gateways as three-year swift interfaces in the public security zone. Next, the storage access zone is instead an internal network providing SF clients with access to the storage network itself. Finally, the cluster zone refers to the most internal network providing storage nodes with connectivity for application, hard beat, backfill, recovery, and all that. This zone includes the SF clusters backend network called the clustered network in SF. Operators often run clear text traffic in the cluster zone relying on the physical separation or VLAN segregation of the network from other traffic. This would not be a valid choice going back to the previous point if your threat model includes privileged adversarial insiders. But it's perfectly fine if you're dealing with script kiddies. So again, threat profile. These four zones are separately mapped and combined depending on the use case and threat model on your actual physical and VLAN network. Components spanning the boundary of two security zones with different trust or authentication requirements must be carefully configured. These are natural weak points. Maybe they're not natural weak points, but they are natural targets in network architecture and should be always configured to meet the requirements of the highest trust level of the two zones or the two or three zones. Connected. In many cases, the security controls should be a primary concern due to the likelihood of probing for misconfiguration. Operators should consider exceeding zone requirements at integration points, which for a storage product is usually easier to accomplish than it would be for something more generic like an operating system. For example, the cluster security zone can be isolated from other zones easily because there is no reason for it to connect to other zones. Conversely, an object gateway in the client security zone will need access to the cluster security zones monitors 6789 OSDs on 6800 all the way to however many OSDs you have and will likely expose its S3 API to the public security zone ports 80 and 443. I think that's right. There we go. Thank you. Hi, everybody. So many people might be curious. We moved to IBM about a year ago. What's it like to move companies as an open source product? And the security of a product, we worry a lot about that. We're going to discuss that, of course. But how do we support it in a realistic way within industry? We can obviously find, we can make a fork of an open source product that's no longer supported, but do you really want to be in charge of maintaining that, updating all of its dependencies? Not necessarily. So I'll be discussing some aspects of IBM product security, how that's been going in practice and in theory, and I'll discuss some of our accomplishments and what we're planning going forward. So in terms of product security, we follow a secure development life cycle with the goal of reducing risk and improving security for Seth and Rook. We suggest improvements. We pen test. We manifest all of our dependencies. We review vulnerabilities. We track weaknesses. We review them at any exploits. We're approving all of our Arata releases, making sure all the details are correct so that anybody, customer or not, can look and say, okay, what does this release fix? And as a result, we've been trying to modify this to work within a different model from Red Hat to IBM. And one thing that we've done is we've manifested and documented all of our dependencies. That's a challenge. I'm not sure if I can officially call it an S-bomb. That would be a legal question. But getting there was a challenge. There are thousands and thousands of dependencies of dependencies, things like that. We've also automated things like security scanning. Customers want clean scans, and this is true also in open source. We get emails on the security list saying, hey, MEND found 200 vulnerabilities, 80% of them are false positives, half those CVEs have been rejected, but what are you going to do to fix it? So we're working to reduce even our lowest risk vulnerabilities as long as it's actually a vulnerability. We're onboarded fully with IBM P-Cert, and we're fixing all of our prior vulnerabilities no matter how low risk. That might seem silly, but it's a common compliance request. And in theory, somebody could find an exploit for even a very low CVE, and then it suddenly is no longer such a low risk CVE. Suddenly it's actually an implemented one. And we're also trying to make our lives easier for the programming team and my work easier as well by automating all most of our dependencies. We'll see how realistic it is to do all, but most is definitely true. This can oftentimes break builds though, because sometimes the code might have been refactored, details might change. It's not just as simple as update this number. So we're building that into our release cycle, and we're starting by prioritizing libraries and dependencies with many CVEs per release, which thankfully they're finding the CVEs, but we want to start with the known ones first. So we'll be doing that. We're continuing to do our work of reviewing existing vulnerabilities, regular security updates, and just improving our code security preemptively. We've gotten a lot of upstream engagement since moving to IBM, people emailing the stuff security list, multiple people with hundreds of vulnerabilities, lots of individual people emailing it with one or two that we investigate and we're signing CVEs. We're doing just a lot more there, and we've been implementing more changes based on customer and upstream requests, such as an update to the OS that the containers are running on. So we're eventually, in addition to all this, going to be following IBM standards to fix our vulnerabilities and ensure compliance. Currently we're fixing all of the backlog of very low risk bugs and vulnerabilities that we had, and we aim to be fully within IBM SLA and requirements by end of 2024. That should just give you an idea of how secure upstream stuff is going to be, because again, these are getting ported between Red Hat, upstream, and IBM. So how are we going to do this, given that we have a very small security team and a slightly larger but still very small build team? The answer is automation. A lot of this work has historically been very manual, very in-depth, and that's just not realistic when we're talking about the number of vulnerabilities, the number of people involved, and the product. By tracking dependencies, we're finding hundreds of things that we might not have even known about a while though, because we're like, oh, well, it doesn't really matter. It's just used in the build process. Actually it does matter. So we're trying to write our own internal software for a lot of this work. We're trying to figure out how to open source it so you aren't just like, hmm, this is some random IBM specific website that you have to log on to their VPN to get our container, and this code makes no sense. No, we want it to actually be available, and that's a challenge, but it's one that we are investigating, and our commitment remains open source. So where we're able to open source any of this internal software, it will be open sourced. If it is incredibly specific, it may not be very useful if it's public, but anything useful we want to share with industry. And we're also sharing that within IBM. For example, one software that we wrote is to find all of our dependencies for our licensing needs. So it's about sharing this ethos of open source within our company of, hey, take the software, you might licensing team, you're spending days filling out spreadsheets. This is something that by sharing and improving on, your work is easier. So we're sharing this open source ethos there. We've also, in terms of automation, we've automated most of our compliance scans via Jenkins and our build pipeline, reducing the work of incident response from a multi-person team to something that can be done by one person as part of their job. And we're documenting all the core parts of the role, trying to define it, and trying to define what a security process looks like. Similar to how we did a secure software lifecycle at Red Hat, we're trying to define now, what does this job look like, what does the documentation look like? And again, if anything is able to be open sourced or is of use to industry, we definitely want to share it. In terms of other new work, again, fixing all CVE fixes, they'll be ported. New collaboration produces many new challenges. But it's overall going really well. Some things that we've done, we've worked to improve the call home functionality and security for IBM support by applying open source principles. If a vulnerability is there, customer smells it, people will see it. Visible bugs are good. When we know a bug, we can fix it. So we're working on using stuff also as a back end for AI. We're happy to talk more, reach out to us, collaborate with us. What do you want to see with our collaboration with IBM? Feel free to discuss with us. I also want to briefly discuss some of our new cryptography work. So we've been working with IBM research here a bit. We've been talking with them about confidential and quantum computing. Now, those are two separate things. In terms of confidential computing, many of you may be aware of what that is. It uses a trusted computing module, so you don't necessarily have to trust the server that CEP is running on. You just have to trust the TCM within it. We're talking with IBM research about how would this best fit in with SAP? If it's viable, we'll be implementing it. With quantum computing, this is a priority for IBM, and they want to make a lot more software quantum safe, for example. So one thing that we're doing there is we're documenting all places where we use cryptography, determining if it's quantum safer because who the quantum safe requirements are coming out this year as are the libraries. We're seeing, are we using public key encryption? Are we using symmetric encryption, asymmetric encryption? Where exactly are we using it? And where can we plug and play in these open source quantum safe cryptography libraries as they become available? So those are some of our goals and things that we're working on collaborating with them with. These are not yet in CEP releases, but this is sort of a road map to the future of collaboration with IBM. And moving on to talk a little bit more about how CEP currently does encryption and key management. Currently, what we do is we encrypt our data at rest. We have a choice, but most people choose to encrypt their data at rest using the Linux unified key setup, aka Lux. All the data, metadata of a CEP storage cluster can be secured using a variety of DM encrypt configurations. Almost all of our customers choose to do that. We implement security best practices by locating our monitor, our Mons on a separate host from the OSDs that ensures anti-affinity of the keys in the data that they encrypt. This means that a driver host is physically separated from its decryption keys, which increases security and is just generally a best practice. Our object store gateway has some additional capabilities, including encryption at ingest in time, the use of per user keys, key rotation using tools like Vault. We support AWS, SS, E, KMS, and others. We have department of defense ciphers certified under FIPS 140-2 as supplied by reline appropriate versions. And for encryption and transit, we have network communication that can be secured by turning on the CEP protocol encryption and messenger version 2.1. We do allow clear text as a backup, but it isn't our only option. You can definitely certify your encryption and transit or use encryption and transit as well. And we also have the protocols where the CEP protocol is physically or logically isolated, but if you again want more security, messenger version 2. The back end protocols are not encrypted by default, but you absolutely can. And it gives it just a teeny little bit of latency. It doesn't really impact the performance that much, especially when CPU overhead is accounted for. Looking at more specific protocols for encryption and transit, S3 is usually secured between the RGW and the S3 client with TLS, port 443. You can also support plain HTTP on port 80, depending on the nature of your data. And if you want it to be secured or not, we have a special case with our TLS termination at HA proxy. And the link between HA proxy and RGW is in clear text, so it has to be located within the appropriate security zone. But as we saw earlier, the security zones are a great feature of Saffron Rook. So that makes it easier. And of course, with your network, you want to follow your best practices such as firewalling individual nodes. You only want to expose a clear list of ports. But assuming that you're doing that, that really covers a lot of our encryption and transit. And let's talk a little bit now about Rook specific and not just Saff. So Rook can use custom resource definitions to chain code many of these settings. We can configure our trust certificates for our RGW SWEB server. Rook supports at rest data encryption, as discussed earlier. And again, we have in-flight Saff protocol encryption as of 1.9. Kubernetes user permission system also applies to the persistent volumes here. Permissions, quotas, all that comes from Kubernetes. Nothing Rook needs to do here. And Rook also supports a key management software in the CSI driver, allowing individual volumes to be encrypted with their own key. This really limits the scope per key. Again, security best practices. So we can follow those like key rotation, revocation, limiting scope from each key. This really limits the scope of our unencrypted traffic, and it also limits the scope of each key. And going on to the control plane, as popular as by Ansible, SSH is used by Seth Edmund, Seth Ansible, and other deployment day one tools to provide a secure command line path for installation and upgrade operations as part of post management. We do this so that the dashboard is usually, unless you configure it specifically, is not exposed to the world. People can't even see our Grafana dashboard. It's great. And also, although we don't want our dashboard to be exposed to the world, it needs to be reachable by the operator's workstation to be of use. So that's our control plane here. And we're doing that with SSH and how we install Saff by default. The manager supports the whole infrastructure. It needs to be reachable on the storage access zone. So we use SSH for this. And you can see some of our details, such as our port ranges and details like that. Of course, the operator can modify this to have it suit your local threat model. But by default, we just try to make it as secure by default so people don't have to think too much about it. And with identity and access, Seth's use of shared secret keys protects clusters from man in the middle attacks by default. A good practice here is to grant key ring read and write permissions only for the current user and route. Limit your client admin user to be restricted to route only. You don't want all users to be root as per security best practices. And similarly with RGW, we want to treat the administrator's key and secret with appropriate respect. Use your number of administrative users. And to do so, we support AWS S3, the equivalent model for OpenStack, and the equivalent model for OpenStack Swift. Again, that helps your keys and secrets remain secure by using these external software for your key management. Your RGW user data is stored within Saff pools. These are secured previously discussed with data at risk. We also, in terms of keys, we can couple with OIDC providers such as Keycloak, and those can be backed with your organization's identity provider for an even more granular role or attribute access depending on the needs for your system and your threat model. So for identity and access, we also support LDAP and Active Directory users. We recommend secure LDAP and we support OpenStack Keystone to authenticate object gateway users in OpenStack Clouds. And for auditing, which is another important part of security, we of course support auditing. Operator actions against a cluster are logged. You need to periodically remove your logs and you can aggregate them to your log management system where's appropriate. Here's an example of what that might look like. You can see where they're stored. We can see this example here. And you know, an action might start on one node, it might propagate to others, and we're still logging everything. So I'm going to turn it over to Federico for the end. And here is data retention. Thank you. Yeah. Alrighty. Let's see if we can get it through to the end. So, once they raise the lead, it generally cannot be recovered for practical use. But like with everything, there are exceptions. RBD is a new facility called the Trashbin that makes use of spare capacity to preserve deleted images dynamically until the space is needed. There are a certain number of days that's elapsed. You can turn this on or off, but obviously this affects user data retention. Similarly, in RGW, RGW is an implementation of the S3 protocol in most use cases. S3 is versioned. So your data is versioned there unless you configure pools to be not versioned. So if you're storing user data in RGW, you need to watch that your configurations are, that your pool configurations are versioned according to the data retention that you want for that data. Otherwise, obviously the administrator can purge the versioning of RGW buckets, but that's one more thing that you have to account for to ensure compliance. Then there is explicit secure deletion. That's the other thing that we usually get asked about. The data is still on the clusters on the disk. And like in most storage systems, you cannot just go and say, I'm trying to overwrite everything to sanitize that. It won't even work on a standard SSD these days, at least not reliably, which is what you care about. In the distributed network system, obviously it's not going to work. So the solution there is doing the right thing from the beginning. Implement at rest encryption. When you want to sanitize your media, you forget the encryption key and you're done. Very easy, very simple. Also, this is other advantages. You may want to sanitize your media typically because you have to return them under an RMA policy to have a warranty replacement. If you try to replace a drive that's being shredded, shot with a shotgun, or done all sort of untoward things to physically destroy it, most likely your warranty provider will not send you a new drive for it. So again, using encryption is the right strategy here. Also, the majority of drives today does this on hardware for you, so you don't even need to manage. Set up an encryption key and set up a process to wipe it when the time comes. Then there is one more scenario, which is when you want to prevent sanitizing of data, aka ransomware. So the most interesting bit in CEPHER is that RGW is a second factor authentication thing, and you can deploy this as a protection against ransomware attacks so that you essentially ask for a second factor for actions that a ransomware attack would use to re-encrypt your data. Hardening is relating to binaries. Hardening options are highly vendor dependent. These are Red Hat and IBM choices. Other binaries from other vendors may be different. Now, we ship with a Silinox on by default, and I don't think that surprises anybody, given that it's our local religion at Red Hat, and we carried it over to IBM. FIPS 140, two ciphers like Sage was saying earlier are supported out of REL, so whenever REL is certified for FIPS 140 or two or now three with REL 9, those versions can be used with stuff. They're older versions of REL than the current ones. There is lag, CISA needs to do the certification, but the option is there. We're not going to go into these hardening options, but you should be aware that they exist, and you should be knowing what your vendor uses. And some bookmarks for you to close. There are some interesting resources on Kubernetes from Acura Security, and from a book from O'Reilly called Hacking Kubernetes. It goes into very much into detail about storage in Chapter 6. That's Michael Hausenblass' previous book. The Data Security and Hardening Guide comes from our product, and it's basically a written version of this talk, so if you want more, that's where you find it. Encrypted Secret Data at Rest is essentially another version of what's in Rani's article. And the last one is to decrypt those binary flags that I was describing one slide ago. If you don't know what the GCC flags look like, or what the kernel hardening options look like, there is a link there with the linker flags for GCC. There is also one, we ran out of space. There is a convenient page on Ubuntu Server project team page with all the kernel hardening options, so you can find what they are. And then you look for your vendor, what the setting for that specific option is. And that's it.
Crash-consistent group snapshots in CephFS for k8s CSI and you!
Hi everyone. My name is Leonid. This is my colleague, Patrick. And today we're going to talk about Snapshot consistency with you. So before we dive into Snapshot consistency, let's discuss consistency on its own. Now here's some data storage, has a bunch of data written on it. Is this data consistent? We don't know. And the reason that we don't know is because consistency is not an intrinsic property of data. We have to consider a system that comprises of an application and its logic and a storage system and the data that is written. And only then, including the logic, we can reason about the data and we can define whether it's consistent or not. So the application is running fine, data is written, everything is consistent. What happens if the application dies? We don't know whether the system is in a consistent state or not. So actually it is possible to write an application and to write a storage provider in a way that by doing some smart decisions during the runtime, then the application can reach a consistent state after restarting, after a crash. This is called crash consistency. Now how is crash consistency related to snapshots? And the truth is that the snapshot, or rather the snapshot that we're talking about, which is a crash consistent background snapshot, is equivalent to a crash. The application cannot tell between restarting after a crash or restarting after you recovered from a snapshot. So let's look at the system. We have an application in storage and it was a poor selection of a storage. We cannot reach consistency in the system. Even if the application is a high quality, well designed app. Same thing other way around. If you are using an industry leader storage provider but the application just doesn't care or is poorly written, you're not getting consistency. If you have a well written application and an industry leader storage provider, it is still a question whether the consistency or rather crash consistency is reachable. And it is only reachable if we consider a contract that an application and storage adhere to and then when they both do things right, together they can reach crash consistency. And the scope of this talk would like to refer to this kind of application and storage as enterprise. There are many ways to unpack this term so bear with us for this scope of this talk. An enterprise app and an enterprise storage from our perspective are those that adhere to a contract. Now what is this contract? Or rather in our case what is interesting is what is it that we need to do itself as a storage provider that we automatically combine with an enterprise app that is already written with this contract in mind and together we provide a crash consistent system. And I remind you we want a crash consistent system because this is what enables consistent snapshots. To understand that we need to understand right ordering. Rights A and B here, they are ordered if and only if. Right B begins after the app has received and processed an acknowledgement from the data storage that right A has been successfully completed. Now it is important to note that the acknowledgement has to come from the storage and not from the OS because usually your applications are interacting with the operating system and it is the operating system that gives you the first acknowledgement after a right. These applications are aware of that. They know that they need to do to use things like flush or direct IO to know that the acknowledgement is originating at the storage level to perform ordered rights A and B. Now that we understand what ordered rights are, let's inspect what the storage needs to do. So we have two ordered rights. Right B hasn't begun before A has been acknowledged. And in order to understand what storage should do or shouldn't do, let's look at different types of background snapshots that the storage might have taken. So it could be that we've taken a snapshot before A and it's a consistent snapshot. It's a snapshot that has no knowledge about neither A or B. It could be that the snapshot already captured A and we know this is possible because there is a window of time when A has already been completed and B hasn't yet started because application was waiting for the acknowledgement. So this is a consistent snapshot. And finally there could be a case where the snapshot contains both and B. This is also a consistent snapshot. What the storage or enterprise storage provider must absolutely promise to the app is that snapshot 4 is not possible. There cannot be a case that a snapshot contains operation B but somehow lost operation A. That's basically the contract to preserve the order of rights. So we're going to ask Patrick to discuss how this relates to CEP. So within the context of CEPFS, we're going to first start looking at how snapshots work. So on the left we have MDS0, managing two trees of interest in the file system, SV1 and SV2 and two clients, client 1 and client 2. So how do we take a snapshot in CEPFS? Well there will be an operation sent to MDS called MakeSnap and that will snapshot a particular tree within the file system. In CEPFS you're allowed to snapshot a particular directory and everything under it, not just the entire file system. When a snapshot is taken it sends a notification to all the clients that the snapshot has been taken for a particular I know that the client is interacting with. And once that's all done the snapshot's complete. If you want to take another snapshot of another volume you have to do another operation. There's no compound snapshot operation. So we send a second snapshot out of the other volume and again notify the clients for any I know that they may be interacting with. When clients interact with RATOS, the underlying distributed object storage of CEPFS, they create snapshots implicitly when they write to the objects that hold the files data. And they do that by including a snapshot vector of the snapshot IDs that have been taken on the files. And those are what is transmitted to the clients in the snap updates. And here lies the rub. If this, with CEPFS snapshots we have eventual consistency. Because what, when a snapshot is taken on the file data depends on when the client gets the update from the MDS. So they're eventually consistent, not synchronous. To really highlight this we'll look at a case study. So here we have two clients in an MDS. Operation B on client two is dependent on the completion of operation A on client one. Let's say this is like a database application, a distributed database. The MDS is starting a snapshot and it sends the notifications to the clients and expecting the apps. Client one initiates operation A after it's been notified of the snapshot. And so operation A is not part of the snapshot. Meanwhile client two has not gotten the notification from the MDS yet or is not processed yet. But it has already started operation B. It was just a simple write to a file. Well operation B is in the snapshot because it processes the notification afterwards. This is a problem and creates inconsistency. Op B is in the snapshot but op A is not. Looking at this another way you may have a utility that's trying to create a snapshot on the file system and it tells the MDS to make the snapshot it does. But then induces operation A on the client. Expecting operation A to not be part of the snapshot because as far as it knows it's already been taken. But that's not the case. Operation A is in the snapshot because the client has not been notified yet of the snapshot. So this is also inconsistent. So the solution we've implemented is fairly common within enterprise storage systems in the industry that are trying to address this issue of crash consistent snapshots which has become a larger thing right now with Kubernetes CSI requirements is to introduce an IO pause. And IO pause ensures this ordering by preventing any operations from going on within the tree of interest while the snapshot is percolating among the entire file system and all of its clients. So the way this looks in practice is op A is started and IO pause is established. Point one is trying to induce client two to execute operation B. But operation B cannot execute because the IO pause is enforced. Looking at it a little differently you know we could have op A and op B happen before the IO pause. They're both part of the snapshot. This is consistent. And then we may also have a situation where op A is sent to the MDS before the IO pause just before the IO pause is established. Op A waits through the entire course of the IO pause. When the IO pause is lifted operation A is allowed to complete and then the notification is sent back to the client that the operation is done and op B is started. This is also consistent. We'll also look at a super operation, a compound monolith called a mix snap, a variant of mix snap which will also establish this IO pause for you. But we'll also look at the underlying mono operations you can do to establish this IO pause. And that will be the mechanism you can use to actually to establish these crash consistencies. So I'll move back onto the approach. Thanks Patrick. So we now realize that all we need to do is an IO pause and let's see how we do it. Now we were considering a couple of approaches and one of the approaches apparently is a monolith solution. We would define some new comment that would mean consistent snapshot and you would configure it somehow and start it off and it could be either sync across file systems and even across a file system which is a CFFS and an RBD volume. If you have multiple different types of volumes configured for your Kubernetes applications with this approach you will still be able to create a consistent snapshot across all those things. So in order to expose this to the user we introduce a concept of QS set and QS routes. So a QS set is basically just a collection of mount points that you'd like to QS your IO to. In the world of Kubernetes there would be a set of volumes that you would like to QS your IOS to. It's reasonable to give users this entity of QS set because you don't want them to chase around all the different sub volumes whether they are QS or not. We're interested if a group of volumes are together QS and that's what we are waiting. So a QS set implements this state transition. Now internally your mount points they map to some path inside CFFS file system and this is where the magic happens. This is where we actually QS in IO and we refer to that as a QS route. We have also thought about the condition where a QS route may be a part of multiple QS sets at the same time because we don't want to interact too much with the logic of automated snapshots like Kubernetes that might somehow involve consistent snapshot with some volume that is part of two different unrelated processes. The way we resolve it is really simple. As long as the route is part of at least one active QS set IO to this route is QS. So let's talk about the API. This is the comment that we're suggesting. QS we give it a file system name. We name our set ID so that we can refer to it later and then you are including as many mount points as you wish into the set. You can also ask this comment to be synchronous by minus minus a weight and so it won't return until the QS has been achieved. Once that is done you can go on creating snapshots. These are regular snapshots. This is the snapshots that you do in CFFS. Nothing changed about those. So we've created three snapshots for three mount points that we've added to the QS set and then we again refer to the QS comment but this time we're asking it to release the pause. If we successfully QS hopefully we haven't done anything else if there was a failure and then the release also succeeded then we know that those three snapshots are consistent because the pause has been confirmed active for the whole duration of this process. And here's your monolith. Hopefully almost for free we're having also a monolith approach so like a one liner for system administrators who don't need to interface with the internals. We're suggesting a minus minus consistent switch to the snapshot. We're changing the semantics of this comment a little bit by being able to provide all the mount points in the same time. And then this is going to do everything under the hood. It's the same thing. It's going to do it for you. Now we have a tool and we can shoot our leg with it of course because we can DOS our application. And we thought about this and we've built in DOS protection inside the QS database. We've done this by implementing two watchdog timers. The first watchdog timer is a timeout. So when we consider the set it's going to spend some time QSing. Why? Because there are ongoing operations, right? And before we can acknowledge QSing we have to let applications finish whatever they have been doing right now. So under the hood the QSing is managed automatically for you over each and every mount point. And so all the mount points have this timeout to reach the QSing. And then if at least one of the mount points fails to reach QSing within the timeout then the whole set is timed out and whichever QS that were achieved are released immediately. Now the next thing, the second timer is the QS expiration timer. And for that we need a QS set that actually succeeded to QS. Now we know that in order to succeed to QS we have seen all the mount points successfully QS within the configured timeout. But then if we forgot about the set or something crashed, something bad happened and we never released it or never cancelled it and the expiration timeout elapsed then the set is going to enter the expired site and again everything is going to be released automatically for you. Why do we have two timers and not just one? And the reason is because you're going to have different considerations when you will try to come up with the values for those timers. The QSing phase really depends on your system. It depends on how many mount points, what kind of applications you're running, what kind of operations they're doing with the storage because that means how long you should wait for the system to QS and you are allocating some reasonable amount of time for that. However the QS state is already on you. When the system did reach the QS state and you have the notification about this then you can say, okay I know that I need to do just a single snapshot so I don't need more than let's say 10 seconds. Whatever, right? Two different considerations that you need to take into account when figuring out these two timer values. This is the API, I've simplified it a little bit in the previous slides, this is the API where we're not going to go into all the details but it's basically a Swiss knife. You should have all the options that you want. And with that let's ask Patrick to discuss the design. So let's just take a quick look at the high level design of the entire system. So here we have an administrative client which in the wild is probably going to be the Kubernetes CSI driver. That's going to be interacting with the CEPH manager which is specifically the volumes plug-in within the CEPH manager. That volumes plug-in will be actually executing the commands on one of the MDSs in the file system. We'll call it the QS leader or rank zero in reality. And then that will be also coordinating with any other ranks in the file system. We'll call them MDSB and C. And then finally the file system clients which are talking to the MDSs. To talk to the volumes plug-in the API will be the regular CEPH command line interface that we all know and love. The API will be exposed at that level. And the volumes plug-in will be talking to the MDSs using the LibcFFS API. The MDSs will replicate the QS database amongst themselves so they all have a view of the same QS database. And then the QS protocol will be used to actually QS the IO and stop the clients from doing IO on a given subtree in the file system. So we're going to actually talk about that part next. So how do we actually QS IO on a subtree? Before we can get into that we'll take a small step back and look at some context and background regarding what CEPHFFS client capabilities are. So in CEPHFFS it's somewhat different in a number of distributed file systems in that the MDSs and the clients are maintaining a cooperative cache. Clients have an elevated status within the file system context in that they also can cache metadata, not just data of the file system. And not just cache it, they can also have rights to mutate that metadata locally without involving the MDSs immediately. So to give a specific example we have here MDS0 has a given file that it's authoritative for 0x19tb.dat and client.1 on the right has a capability for that file. And the access rights that it has on that file, delegated to it by the MDS is to read, write, cache reads and buffer writes to that file. It has shared extended attributes meaning it has a local cache of the entire extended attribute map for the file and it knows that the extended attribute will not change for that file without being told by the MDS. Similarly also for the link count of the file. This allows the client to respond to certain stat calls locally without actually talking to the MDS. Capabilities themselves are modeled loosely after leases, an academic paper I put in the slides and leases are mostly different for having a time-based duration whereas for capabilities within CEPFS they have an undefined time duration. So now to look at exactly how we're going to QSIL. So now we have this issue of clients having these capabilities and maybe trying to continue doing writes to the file or modifying metadata so we have to recall those capabilities. So here we have two MDSs, zero and one and a QS database replicated between them. On the right we have client one with a number of caps for a given tree of interest that we're trying to QS, rooted at SV. When we want to QS, the QS database launches a QS subvolume operation, it's an internal operation on the MDS, it will start that on MDS zero. That in turn launches some suboperations, QS subvolume inode and it will do that on every inode in the given sub tree that that particular MDS is authoritative for. So the inodes are colored according to the MDS authority. So just the first two inodes at the top of the tree the QS subvolume inode calls will be performed on. We'll look at what that does in the next slide. QSDB will also launch it the same operation on MDS one. It'll launch QS subvolume inode operations on the inodes that it's authoritative for. And then once all this is complete, it's done. So what does QS subvolume inode do? We have this, the operation being executed on as an example, OX19DB.dat. We have a client on the right with the capability to read, write, buffer and cache data for the file. It has exclusive writes on the X-Satters so it can even make local changes to the X-Satters without telling the MDS immediately about them and returning to the client. And it has a shared link count. Now when I start the QS subvolume inode operation, it actually behaves similarly to many client requests that are already executed within the MDS. We're using the internal facilities that already exist to do this. The operation requires a number of locks, internal locks on the inode, not the internal not the POSIX facing locks that normal file system users are familiar with. These are internal locks on the inode and they control which metadata the operation has permission to change on the inode. So we're acquiring the auth lock, the link lock, file lock, etc. for reading or exclusively. And by doing so, the MDS will reconcile this with what writes have already been given to clients, that is what capabilities have been issued. And if necessary, it will revoke capabilities before those locks can be acquired. So when this operation tries to acquire those locks, it sends a revoke to the client. The client updates its local capabilities according to what the MDS is allowing it now to have, possibly fleshing data if it changed the file size for example or added in an extended attribute. You may flush that along with an update message to the MDS saying yes, I've updated the capability, I don't have these access rights anymore. And now you see that it has no file permissions. Its X-hatter is now shared instead of exclusive and the link count continues to be shared. So after this has occurred, the operation is considered done and any future ops on the client associated with this inode will block because these locks are still held. Why? Because this is a long running operation. Unlike most ops in MDS, which it will acquire these locks, perform some metadata mutation and then drop the locks, this is necessarily a long lived operation because it needs to continue to prevent clients from getting capabilities on the file or executing metadata operations, which will also try to get the locks from executing. So the get adder would block or any other client operation that would acquire those locks. So now to close out the talk, we'll take a quick look at the QIES set state diagram and focus only on the happy path. You know, it's a typical state diagram, lots of error paths, right? So we have a new set, we're adding a number of routes to the set and once it's in that state it's going to enter the QIESing. So at that point we're going to be launching all our QIES sub volume inode operations and acquiring all these locks, capabilities will be revoked, new operations will be blocked. When all of those operations have their locks and they're complete but not dead, then we can enter the QIES state. So all of these, that'll trickle back up the stack when we're querying the database we'll be able to see that the set is QIESed. At that point we're going to take our snapshots on all the routes that we need, more than one probably and when the snapshots are complete we can then release the set. So then we'll go into the releasing state, all of those QIESa volume inode operations will be killed and the locks automatically released allowing clients to be reissued caps or any blocked metadata operations to be kicked and resumed. Once those operations are all dead then the set will enter the release state and the QIESet is considered terminal and done. So that is the basics of the QIESets and again there's a number of error states shown on the slide, a canceled QIESet or an expired one, etc. So with that that's the end of our talk, we're going to leave time for questions. Again I'm Patrick Donnelly, this is Usob, I said your last name right, right? Don't often say his last name. These are the pull requests we have open still for our work, they've not yet been merged into the main branch so this is not yet live and even the development version of CEP. And we have some documentation that you can also review, some preliminary documentation, some details may change but for the most part it's reaching a very concrete state. That's it, thank you. Any questions? Yes? Will CEP mistakes snapshot and store it in like dot snap or underscore snap type within the folder? You mentioned that all I.O. in that part onwards will be pleased so will like leads on previously taken snapshots also be frozen or that. For the most part, alright so the question is if I've quiesced I.O. on a subtree can I continue to access snapshots of that, past snapshots of the subtree and the answer is probably not because of the way the locks work on the I.O.s it may also incidentally protect how the access through the snapshot version of the I.O. Then there is like we didn't introduce the shallow volume people like maybe we can mount the snapshots for backup system to just read the contents. Not at this time so we're looking also into a variant of quiescing where it allows most read only access to the files. Right now it's very much a stop the world for the most part I.O. pause so you won't even be able to execute most reads on the file system. Or like some stat calls may still be able to respond well written stat calls on the clients because they still retain certain read only capabilities. In the future it will work for the ASX vector like we can access read only snapshots. That is the hope in the future we'd be able to do that yeah to support that. Any other questions? Neil. So now you have the set command to quiesce volumes is it also possible to run FS3s on the client side if you have set FS kernel mount for example that would quiesce the volume for all other clients. Do I answer that one? So the question was whether we're going to be able to use kernel if it will work for kernel clients. Exactly if you call FS3s on the client side instead of running set command. As of now we haven't planned to support the FS3s command but it will look into it. I think it's pretty reasonable to consider it even for the first operation. Now one of the good stuff about what we're doing right now is that it's intrinsically backward compatible because we're building on the set capabilities kernel clients will be able to reach the quiesce. Now how you trigger the quiesce it's another question and we'll consider this definitely. Other questions? Okay thank you very much. Was it pleasant?
CERN's Open Source Storage Systems
Thank you. So hello everyone. My name is Hugo and here with my colleague Richard and Abbey from the 15 which will be present afterwards. And basically we thought that it would be a nice contribution basically to foster and to actually present a little bit what we do basically as a. In the storage topic we are the experts on the storage. There are other colleagues downstairs in the booth so feel free to pass by and get some stickers basically and have some discussions. So yeah the core of this talk is to present a little bit what non-theft storage technologies we use on disk, on tape and cloud storage. And to explain a little bit what we do with them basically. How many of you know CERN? CERN basically. Okay good. This is an easy start. So yeah CERN is the biggest laboratory basically for particle physics in the world. It's basically based in the frontier between France and Switzerland. And our goal basically is to really understand how basically the universe was constructed. And we have a lot of intelligent people, a lot of physicists that trying to basically understand the data that we produce in this accelerator. CERN is also the birthplace of the web as you know in 1899. CERN 10 Berners-Lee basically created the worldwide web there. And he gave it to the world basically. And actually this changed a little bit how basically in society we communicate and we do things. And we actually bring in this legacy. We really contribute a lot to open source. We try to do basically as open source as much as possible. Where we do not find something that we can find it open source. We try to build it ourselves and give it back to the community. And here you will find today a little bit of examples of these technologies that we use. But before we jump there. CERN is now a multi-b because of its large Hadron collider. So this is a tunnel of 27 kilometers of circumference. It's around 100 meters below on the ground. We have a lot of superconductive magnets. You can see basically in this picture there. And the idea is basically we have different particles, atomic particles that basically we try to collision amongst each other. And we have high resolution cameras. It's what we call the detectors. So here is just one example of one collision. So these big cameras have particles basically coming from opposite directions. And we try to basically align them. So we basically create a collision almost at the speed of light. We generate a lot of these pictures per second. Just to give you a rough number is around one petabyte a second. The throughput that is generated out of the detector, which we cannot handle. It's really impossible. So what happens is that we have four of these big cameras around basically the tunnel. And we have filters basically because not all the collisions are interested for physicists. Some of the collision they are already known. So we try to really know, detect the exotic cases that we would like to basically invest. Traditionally we have what we call basically the processing farm. And it has been always done with FPGAs and hardware. But now we are moving a little bit. Different experiments they use now, for example, GPUs, hyperalyses, basically processes in software to do this filtering. And we arrive roughly around one terabyte a second. This is the data that we have to handle to process and to distribute. So how we distribute this data? CERN as you can see here in the center is what we call the TR0, like an onion. We have TR0, TR1 and TR2. So CERN has its own data center. It's all the data generated and it really brings to this inner layer of the data center, which is stored on high, say, throughput, this buffer, and then copied to tape. And my colleague Richard will explain to you more details what these systems are about. After the data is stored at CERN, we know that we have a copy of the data we distributed. And this is what we call the worldwide LSE computing grid. So this is an international collaboration of many people, many different encounters, many different institutions around the world. And the idea is to distribute this data across all these 160 data centers in the world. So people, basically physicists, for example, are in Oxford. They can analyze the data sets they are interested in. And at the same time, in the case of a big catastrophe, basically, because CERN is built with public money from these institutions that believe in science, the idea is that we can also try to reconstruct, basically, the original data with the copies that we have around. Today, we start around one activate in disk and one activate in tape systems, and we'll dig a little bit in the technical details later. With a lot of things regarding computing at CERN, today we are only focusing on the storage and open source for science. But if you go to home.CERN, this is the domain that you can go and find more information, or just pass by the booth and we will tell you a little bit more about other dimensions, basically, of what we are trying to do. So this talk today is going to focus on three and four of these systems. So the CERN box, EOS, and tape part, me and Richard, we are going to cover it. My colleague, Abby, will talk a little bit about what is the theft infrastructure at CERN. So this is a high-level view of how it looks like. We have the tape system, we have EOS, which is our software defined storage disk system with commodity hard disk, because our goal is to really try to use the cheapest hardware that we can find to provide, basically, reliability for the data. So there is no goal-plated hard disk that we use. We really try to apply a cost factor to really be the cheapest. And then we have CERN box, which is a cloud storage platform that sits on top of EOS and theft, and this allows people, basically, to access the data in a draw-box-like fashion. We use Anklot, the storage solution for that, because not all the people at CERN are geeks, and they are basically SSH-ing into a computing cluster and going through Fuse and CD and getting data out, but they just use computers, they want to get some data, do some basically computations in Mella, for example, and then they run it into the computing farm. So it's very important to bring the data that is generated to the end-user devices. This is a little bit more, let's say, in-detail picture of what we do. So on the left part, as I mentioned, we have EOS that provides a Fuse later, and then we have CERN box that sits on top, and this provides access to basically, you know, 100 EOS devices, also to different computing clusters that are running basically just Linux, Boxes, Almanine right now. And we also support through Samba protocol on top of Fuse for Windows users, because we have also a big user Windows population at CERN, so it's important that we allow these people to also access the data, and we use the open source and implementation for that. JupyterLab, we have a system called SWAN, which is a branded JupyterLab environment, which also sits on top of our storage systems, because it's a very convenient way for people to just basically run some Python notebooks, for example, to run some interactive computations, and then from this platform, you can basically scale out the job to big computing farms using a GT-conder system, for example, for those of you who know the system. So this is a little bit the big picture of the things, and now I pass the ball to Richard, who will dig a little bit more in the different areas for the different systems that we use. There we are. Lovely. Thank you, Hugo. All right, so by now you understand that at CERN, we have a great challenge in dealing with the data coming in. We have a lot of data coming in at once. We store a lot of data over time, because most of the physics data is, in fact, kept in perpetuity, and we also cannot lose data, because this would upset the physicists, and of course, lost data equals wasted time at the LHC. So for this, to solve these kinds of problems, we have developed EOS as our disk storage system of choice. What is EOS? It is a storage platform, open source. It is made such that we can store data in an economically viable way with the public funding we receive. At the scales we operate at. It is elastic, adaptable, and scalable, so that we can even grow on it as we go along, and as we increase even in the future the LHC's capacities. EOS is also very flexible in how it works. So for one, we have from the start considered the use case of having not just hundreds, but even thousands of parallel users active at once. There are physicists on site accessing it, sometimes with multiple clients. There are physicists abroad, as mentioned, and there are high performance batch computing workloads running this kind of stuff. So this is from the grounds up in there. We also support multiple protocols and a variety of authentication methods. This way we not only can use EOS in the varied scientific system that we have running at CERN, but also at the other institutions working and collaborating with us. What does using EOS look like to the user? Well, in the end you get a file system. You can interact with EOS through the command line, through the shell. So you get an EOS shell, which you can drop into using the EOS command. You can also just run EOS commands directly, so you get your standard commands for interacting with file systems, LS, MKDR and such. And then there's also a script mode for more efficient workloads you want to run. You also get your POSIX-like file interface, so you can mount EOS in a number of ways to your system and just interact with it directly, as you would with your local file system as well. For Linux this happens through the EOS XT executable, so you get a few smarts with that. Or alternatively you use the SSHFS tool to get it running locally. This is also an option for Windows and for macOS, and for Windows there's also support for SAMBA. You can also thirdly interact with EOS remotely through remote protocols, HTTP for instance, and then there's also the root protocol. I won't go into the details in the interest of time, but this is just a protocol tied to the root framework that is used at CERN, not to be confused with the root user, completely separate, but just know that it exists and that it's particular to the high energy physics community. Just a little bit of EOS project history. It has been developed at CERN since just about 2010. The code base I think is 99% C++, so we really try to squeeze everything we can out of the system. Each major EOS release gets its own special name after a gemstone, so the most up-to-date one right now is diopside version 5. So if you actually go into our repositories and check them out, then you can look for these and you know what they mean. This one is the important slide. So if you remember only one slide from the EOS part, it's this one. Here we explain what the architecture looks like, roughly speaking. So EOS has two major components, a highly available low-latency namespace and then the disk storage instances themselves. So the namespace is what we call these MGMs, the metadata servers. They are storing key value pairs in memory using something called QuarkXDB, which is based on RoxxDB. And the idea here is that we keep them highly available in part by running multiple instances of these QuarkXDB instances in RAF mode. So for one real full-sized EOS instance on the MGM, you will have three QuarkXDB instances running. One will be the master. The two others will ensure that there's a quorum. So for each operation you do on EOS, there should be at least two of these agreeing that this was actually performed. And then if something falls over, hopefully the others can take over. That's the idea. The other component is the FST. These are just storage servers with really as many disks as you can fit into them. I think a rule of thumb is that you can get one petabyte out of one FST. So essentially these are the workhorses. They store data. They transfer data to other sites. So for the FST is connected to the MGM, your operation usually when you do something, it goes first to the MGM and then the data transfer actually happens to the FST itself. The replication of the data is handled by EOS. So EOS ensures that you get the correct number of copies of your data, that it is spread across independent disks and independent instances as desired, so you don't lose anything. There's also the option to use erasure coding. Numbers. What does EOS at certain that look like in numbers? Well, it's really almost a petabyte, almost, sorry, an exabyte worth of data stored on these EOS instances. We store roughly speaking 8 billion files in total spread across these 1,300 storage nodes. We have running, which in turn are running about 60k disks. We expect this amount of data to grow almost exponentially even in the coming years. The funding will probably not. So we really try to make this as efficient as possible and as performant as possible to get really a minus worth. Here is a more view of what the actual workflow and data access looks like at CERN. So during LHC operations, you will have a number of data streams coming in all at once as fast as possible, storing the largest amount of data possible. This is then after the filtering that is mentioned, but even then you get streams totaling up to 150-200 gigabytes per second just coming in to these various EOS instances. They are by the way split up, so the way we scale them is horizontally using multiple instances. You will have for instance one EOS instance tied to one larger experiment and then you will have some which are shared among the medium and smaller sized ones. While this is going on, while the data taking is going on, you will also have data going elsewhere. You will have about 40-50 gigabytes per second going to tape, which I will get back to in a moment. So this is for permanent archival for long term storage and also for getting some extra capacity of the, like, just for extra storage. And then at the same time the data sharing happening to this WLCG, the worldwide collaboration. On the right hand side you see then the batch workloads. So these are what physicists queue on the EOS system using HDConder for instance, which is a high throughput computing batch management software. This is where the actual, the physics data is actually analyzed and we gain insights about whether or not it confirms or weakens theories. If you're interested now in deploying EOS, good news is you have options. You don't need to start with the petabyte. You can, you know, you can scale it and you can move around this, this, yeah, this gradient of how you want to do it exactly. On the right hand side you have the more mature production grade experience where, you know, the client interacts with your EOS instance through some form of load balancer. It points it at the correct, the lead MGM, which is running these core DBA instances and the namespace as mentioned. These in turn connect to the FSTs for the storage. On the left hand side you have more of a development environment or a test environment. It's very convenient for that. So you can deploy it for instance in your Kubernetes cluster or containerized virtual machine, this kind of thing. You can, you don't need to use a hard, you know, just disk system underneath. You can also use shared file systems such as Ceph underneath. And then, yes, in the middle you have the other options such as a hybrid system. So if you just happen to have a spare Ceph cluster lying around, you can tie it into EOS as well. You can find more details about that in our various documentation links. There's also an extended edition of this talk available online. But before you check that out, let's speak about tape. So before everyone knew about CERN, now I want to do another one. Raise your hand if you've used magnetic tape storage. No. That's incredible. Fantastic. I'll do the intro anyways to magnetic tape storage. What is it? Effectively, magnetic tape storage is you store data on magnetizable media, but instead of doing it on spinning disks like in your HDDs, you do it on this flexible tape. The tape is coiled up in this cartridge, which is just a plastic shell around mostly with little extra bits and pieces. And, yeah, to get actually any data in and out of it, you have to put it into this tape drive. So the cartridge is very simple. The tape drive by itself doesn't speak to many things except to a tape server. So this happens over SCSI for us, so we have other options also available. The end effect is that, well, you get, how do you say this? No, actually, for just simple setups, you can use just, you know, a tape drive and some cartridges on your desk, but we ran out of desks a long time ago for that. So we put these into what are known as tape libraries. These are, they come in many shapes, but ours are shaped sort of like storage container-like with the side portions being lined with these slots for the cartridges and for the drives. And in the middle, you have robots picking up the cartridge physically and putting it into its corresponding drive when needed. What does this result in? It results in that you have very different sort of access patterns for magnetic tape versus disk. Tapes are most happy when you read and write in, you know, one sequence from the start to the end, one smooth motion, ideally. If you read in another way, you will have to, you know, stop, rewind perhaps, and this will slow things down. So whereas, disks are good at random access, tape excels at linear. In the linear case, tape can be good and even beat disk maybe on the writing part. When it's, the conditions are not right, it can be slower. So the way you manage your tape has to respect that sort of thing. Tape is generally used for long-term storage. Lots of people use it for archival purposes, though we do that as well, and we also use it actively as active storage. Here for reference is just what a tape library in our case looks like. Ours is all nice and dressed up. So you have this long container-sized thing. Now the explanation of tape is perhaps a bit negative. Like why would you use tape if it's not necessarily as fast as a disk or if it's this custom format? Well, at the end of the day, magnetic tape is quite cheap per terabyte as soon as you get over the cost of, you know, building up your tape library and filling it with contents. So to get the most out of our money, we do use this for storing lots of data. Magnetic tape also has low emissions actually at the end of the day because a cartridge, which is on storage somewhere, does not consume power. Only when it is actually read or written to will it consume power. So it's a nice little environmental boom there. And it's also good for cybersecurity. A cartridge lying in storage is safe. It can't be overwritten. It can't be encrypted by ransomware. So, and there's an actual, there's an actual time cost to moving it to a drive to be written to. You also have usually way fewer drives than tapes. So there's a bottleneck for any sort of attacker wanting to do something nasty. Finally, tapes are long-lasting. So as mentioned, we keep data in perpetuity effectively. And so the long shelf life of tape when compared to disk is a positive. Finally, what is CTA? CTA stands for the CERN Tape Archive. It refers both to the physical installation as well as the open source software we produce to run it. CTA is open source and it isn't designed to be used in conjunction with disk storage. So the predecessor to CTA was a hybrid system which did both things. It became fantastically complex. And so for the second edition, we decided to not do that. So we leave the disk stuff to the disk system, in this case EOS. Though you can also, thanks to the efforts from the community, use it with other disk storage systems such as Dcash. And then CTA concerns itself only with the tape side of things. So we only store the metadata associated with tape and these things. So what does CTA do? It keeps track of the files you have on tape. It keeps track of where, on which tapes they are. And it does dequeuing. So at CERN, at any given time, there will be lots of physicists wanting the data. There will be contention for the resources in the system. So CTA handles dequeuing system. What makes CTA special in contrast to other tape systems? It is very archive throughput oriented. So to accommodate this LHC data producing workflow. So really the focus for CTA is getting data as fast as possible from disk onto tape. So we give that sort of an advantage in contrast to retrieve operations. We use it actively. So not just as an archive where we ideally never retrieve the data because it's just a backup. We really do get the data back from tape on the regular for the analysis of the data. And yes, it has a grown user community. So the CTA software exists. So their SEO is not maybe not quite as good. We only need to beat the Chicago Transit Authority and the Cherenkov Telescope Array. And then we'll be there. Again numbers. We have about 750 petabytes of data on tape at CERN. 50 of those are backup and miscellaneous IT data. The rest is physics. These are spread across 60,000 roughly speaking tape cartridges. They are accessed by the 180 to 200 tape drives with the corresponding servers. These are spread across five libraries on site. At CTA, at CERN CTA, we use special EOS SSD instances as disk buffer. So when the data comes in, we really want to have it go as fast as possible. And we want the data to spend as little time as possible in there. So we really want the disks to be quick to move the data quickly to tape and also the other way around. This is the important part for the CTA, for the CTA slides. This is what the architecture looks like. So on the left hand side, you have your data coming in, usually from the experiment's big EOS instances, from where it is transferred to our little disk buffer instance. This, for the user, happens basically by just copying the files to a special location called something like archive, and then the system takes care of the rest. The disk buffer through one of the special EOS components connects to the CTA frontend, which is the management instance, and queues the archival and retrieve requests. The frontend connects to the CTA catalog and the object store, the catalog here being our database, effectively. We use Oracle, sorry, but also Postgres is supported. So in there, we keep configuration for the system and the various metadata. And then in the object store, we keep our queuing information. This is CEP. So there, once something is queued, it goes in there. The tape servers also connect to these two components. There we have what is known as CTA tape D, the tape demon running. And the tape servers will check the catalog for their configuration. They will then check the object store to see if there are any tasks matching their configuration to do, and then they will execute them. So either retrieve data or store data through their corresponding drives, which are in the library. How do you interact with CTA as a user? Well, as a user, if you're an administrator, rather, you will use the CTA admin command line tool. This comes with the installation. So there you can do all your low-level operations. As well, we now try to make a push to publish our higher-level operator tools. So these are the things you use to manage your tape lifecycle, the general monitoring and automation of the system. So these you can then use as well. And they basically use the CTA admin command in JSON mode under the hood. Users just interact with CTA through EOS. So for them, it's just a special place on the disk system where it takes a bit longer to get your files back. If you are curious about using CTA, we have a dev setup that we run, virtualized. You don't need a full tape library. You can just use the dev setup. This runs thanks to another lovely open-source project called MHVTL. This is Mark Harvey's virtual tape library. It does require a special kernel module to run, but it's worth it in contrast to having the physical hardware we're needing that. So through that, you can deploy test instances or very small-scale instances, just for maybe doing development stuff on either a virtual machine or a Kubernetes cluster. The instructions for doing that can be found on our documentation. And with that, I hand back to Hugo to speak about CERNBOX. Okay, so we are moving from tape now to cloud storage. Okay, so what is CERNBOX actually? So CERNBOX is the cloud storage platform basically that we use at CERN to expose the data stored on EOS and Thf file systems up to the users in a convenient way. It's actually a global platform. The background that you see here on my slides actually are basically the locations of everyone basically working at CERN. CERN is a very distributed place, so you have people contributing from every corner of the globe, almost every corner. There is no people in the north pole. And the goal basically was in 2014, so this is the 10th anniversary of this project. It's also the 70th anniversary of CERN, and the goal basically was at that time to actually, we saw that many people were using commercial providers and actually we wanted to have the possibility to control the data where the user were putting it and actually have our own jurisdiction on it and actually be sovereign about the data. We generate this data. We should be able to control it. And this is how everything started. And we use basically Ancloud, as you probably know. This is an open source company in Germany. And we have been collaborating on using their solutions since 2013. So in a nutshell, today we have around 37,000 users using the system. On a multi-basis, we have around 10,000 to 12,000 users. We store around more than 3 billion files actually. This number has to be updated. And this system alone for user data, it contains around 20 petabytes of data. CERN provides four main things for users. The first is synchronization and sharing. It's the typical draw box use case. I just throw some data synchronizing with my devices, share it with people, with public links, internally in the organization. And actually it brings a use case, which is to give people access to the data offline. If I have the data synchronized, I just close my network connection, I go in the plane, I can work on my data. When I arrive to a place where I have internet back, I basically, I can just get it in automatically without having to manually upload the data. Another thing that we bring is actually web applications. So, Office kind of collaboration suite. We use only Office, Collaborator, Microsoft. So, we have a plethora of different applications. And also, scientific applications. So, some of the scientific applications for physicists, they're very complex to install on the laptop. So, we basically give them a web environment, usually through JupyterLab kind of platforms. And this facilitates basically the daily job of people, especially students that they're at the university, they want to do something and they don't want to spend time setting up complicated C++ tool changes to basically just build the software they have to use. Another aspect is that this service provides access to the underlying systems through an online file system. On Linux, we use the Fuse Layer library and Windows through Samba. And then we integrated with the specific physics protocols and software and computing farms. So, basically from CERNBOX, imagine the typical use case of a physicist. I have a little data set. I play with it in my laptop. Basically, from my laptop, I do some analysis. This analysis, it's basically living in the same storage where the computing farm is. So, with one command, I can say, OK, now scale out this job. This job will run the computing farm for hours, days, weeks. When the job is done, the results will automatically synchronize to my device. So, I can actually write a nice paper. So, this is the whole workflow that we're trying to optimize with this platform because people before were having to learn different tools, having to pull the data when they were available, etc. So, we really simplify the use cases for the scientist. Now, what makes this platform so special across these last 10 years are basically three things. I want to explain to you here, and maybe if I have some time, I will do a live demo. I'll do it right. First thing is that the data is not owned by a system user. You know, you use a platform like Ancload or NestCloud. All the data is owned by the system users they are running with, usually Apache or Nginx. And we didn't want this because we have a problem. This means that all the data is compromised. So, all the user data basically is owned by the user. So, we have a huge basically held up directory. Every user has a dedicated UID and group ID. And when we start this data, we really start it as the user ID. What this gives to the users is the possibility that they can access the data from a web interface, but also through the FuseLayer. It's the same UID that is being used from web and the file system. And this kind of magic is actually a kind of challenge for institutions right now because people are using web users. You have nice OAuth, basically UPNs, UIDs. But these don't reflect basically the unique side. These are difficult to match them basically. So, this is one of the nice things about the system. The other one is that traditionally, and this is something that Ancload has actually moved away from after many years. They use a new model basically that will explain their new product. But since years, you are using basically this platform. It was written in PHP and all the metadata was stored in a big SQL database. And the problem was that if someone was accessing the file system on the back of this platform, people using the web part will not see the data. They will have to run some synchronization jobs that will take hours to basically spawn the new data in the web interface because its identity was a synchronization between the SQL database and the storage. And you have two brains that think several different things, so you have conflicts as well. So, what we did is remove the database completely and all the operations were joined directly to the file system. So, for example, in Sanbos, we don't store anything on the database. Everything is stored on the storage system. We use standard attributes as much as we can to basically facilitate the operations. This gives us atomicity as well when we have to move files left and right, create versions, etc. The standard attributes that live with the data file. And this facilitates a lot basically operations. And now, Unclo with the new product, which is Unclo's infinite scale. They are actually also doing this. So, this is a way to step forward for scalability. And the third feature of this system is actually that from the web interface, you just set an ACL. Like, I want to give access to my colleague, Elvin, you know, read, write access to the folder called red. And what happens in the back is that also on the storage file system, there is an ACL set. In EOS, we use basically EOS ACLs. On Thephaphase, we use basically the Unix ACLs basically to give access based on UID and GroupID. And what this gives is that people from the web, they can access the data in Dropbox. And people from the, like, you know, Gix, usually Linux computing clusters, they can just basically CD into the places that they have access. And these ACLs are respected. And this basically combines the web world and the storage system world in a way that people, independent, whatever they are, which device they use, they always have access to the same data. And this is a little bit how it looks like the infrastructure behind. On the left side, we use the Unclo's infinite scale web clients. This is a single page application running on top of NG Nexa. So this is the web UI. Nice people just go there and use it. We also use the Unclo application for desktop and mobile people that want to access synchronized data from their laptops or from mobile devices like iOS and Android. They just use these applications. And then we run what we call the Riva server. So the story behind this project is that in 2017, until 2017, we were running PHP Unclo. And at that point, it didn't scale out for our needs. We were storing a lot of files. And PHP, I'm sorry, there were a lot of PHP fanatics, but it was not scaling as we wanted. So I re-broad the server, basically, of Unclo in Go. And this language was really focused on system performance and concurrency. And this is what we needed at that time. And then three years later, in 2020, Unclo's use actually took this component and actually is part of their new product. So we are pretty proud that they did this, because now everyone can profit basically from this integration. And how this server integrates with EOS and CFFS. For EOS, we use GRPC. I'll show you now, GRPC from Google, the open source project. Basically, client-server communication to EOS. EOS supports an GRPC server that we can communicate with protocol buffers. And also with the Xroot protocol, which is a protocol really for wide area networks, which for high latency. So it's a protocol that is optimized for that. And for CFFS, there is a nice binding for Go. So we use the live CFFS library basically from the server. We create virtual mounts inside the process, and then we access the information on behalf of the user that comes to talk to basically to the server. And yeah, here you have some documentation. Basically, we store everything on GitHub, open source. There are also some publications around this software, so you can really look like, basically, dig, dive into it more. You can really refer to that. And yeah, we are downstairs in the booth basically. So feel free to talk to us, because some of the systems are actually better to show you live, with common line, how it looks like, the rather gentle slides. And we are organizing a specific tech-week storage at CERN. It's basically available for everyone. So as you are passing by Geneva, or you are interested into these topics, please come. It's free. And yeah, it's a place basically we would discuss with people having different challenges about the storage. Yeah, and that's pretty much it. CERN basically is open all the year round, so feel free to come and visit us. If you are a student, basically, still studying some master's degrees. There are nice opportunities to come to CERN. It's how I came actually doing some internship. I liked what I did. I stayed there for a couple of years, then I left, and I came back. So it's a nice opportunity. It's public money founded by many countries basically that really believe in science. So feel free to profit from that, and that's it. You have time for questions? Yeah. You pick. Do you actually have external EOS installations? So the question is, do we have external EOS installations? Yes, we have plenty of them. I don't know all the places where they are, but I can tell you, for example, in Asia there are some institutions. Actually, yes, maybe we have a map. Yeah. Some of them. So these are the ones that have EOS and CTA, the tape archive, but there are other places that have only EOS deployed. And if you want more information, just pass by the booth later. I can give you more details. How much redundancy is in the CTA? It depends on how much you want to configure. The default is for the major experiments, you get one copy on tape, and then usually through the WSG there will be a collaborating institution that mirrors the data locally there on site for them. For small or medium sized experiments who don't have that kind of setup, we do do do will copy. So then you will get your files and at least we will guarantee you that you will get two copies and you could in theory set it to something else as well, but usually two is fine. We try to make sure and you can also configure this that the files actually end up in separate tape libraries. So that way you know if for some reason you know your building burns down or just that particular library gets damaged, hopefully you'll have it somewhere else and be safe there. Thanks. And on the topic of disk sizes, what do you use and how do you feel about the upcoming 20, 30, 40 terabyte individual disks that are you going to use? That is, do you want to take this one? Yeah, I can take it. So the question is basically what do we think about the new basically high density disks that are coming to the market and how we plan to use them. So depending on the use case for the physics use case, we really don't care because we have so many data that we have to ingest that even if they are high density it really doesn't make a huge impact. For the sandbox part of the project where we only have 20 petabytes with the new high density disks, this means that we could actually have just one rack with all the data. And this is not ideal for redundancy because then basically we are, you know, our single point of failure is the rack. And what we have in mind basically is to try to basically use the same data that we have in the past. And basically is to try to basically erase encoding basically the files. A US is a system that you can use replica based model or erasure encoding. And to find basically a good trade off, there is really not like a perfect solution but the idea is to find the correct erasure encoding across different racks and maybe even shared the storage server with other projects or other use cases so we can benefit basically to really fill the disks. And another use case that we have currently that we are investigating is the single-magnetic disk also what basically will have the impact to use them because it looks like the industry moving to the direction for single-magnetic recording. And yeah, it poses a challenge because this disk usually happens only and we have basically use cases where you have to handle random writes and random reads around and this can basically pose some challenges. You were saying that you have you're storing data indefinitely and the tapes have a duration life of 30 years. How soon is older than 30 years? So are you having this operation of having to copy all tapes as is how big is that challenge for you? Yes, so the question is if the lifespan of a magnetic tape is 30 years about and soon is older than that, how do you keep this data around? And for that we have continuous workflows in place. So this is part of these operator tools that I mentioned. There's one called REPAC. So effectively we have automation in place for periodically. Once we get to a new generation of tape media to take the old ones and then rewrite that data onto new media. So that way we always sort of stay on the wave of new technology or at least you know trading slightly behind it. We can't of course upgrade everything all at once because that's just way too much. There's an actual amount of time spent in reading from the end of one tape to the other. But yes, we continuously upgrade our media generation. So I think the oldest right now is LTO7 on site and we're moving slowly now to, for new data we're using LTO9, the newest enterprise. And so yes, continuously we will rewrite data onto new media. This also happens if media gets damaged for some reason. You guys also use decentralized cloud solutions. Is that considered a valid option in your stack for archival or is that like a proof of concept? For archival specifically. So the question was do we use decentralized cloud storage? Was that correct? File point. Come again? File point. File point. Oh sorry, file point. Ah sorry. I don't have the numbers for any cost analysis on that but I don't think so. It's going to be, I don't know how you can guarantee with, I'm ignorant about fine-cones perspective so I don't know if you can guarantee a specific timeframe for the retrieval of the data or for the access. There's also, at the end of the day, so once it is retrieved onto disk, we really, well, we need to put it on disk for working with it and the latency needs to be low. Like at some point we had another data center called Wigner off-site, far away. And just even for that the latency for the physics workflow was way too long. Like it's just, it wasn't workable. We have received many complaints so yes, it is nicer to have it on-site. Yeah, though come stop by and speak about file coin. Backplace has a lot of disk statistics that they publish every year. So you have it as well? Yeah, I can answer which one specifically for reliability or... What disks? They have half a kilo of it. Yeah, so the question is basically that back-place probably is basically some information, insights about the reliability of disk. Yes, we do. There are some papers around. I cannot refer now to my head, but we can pass by, we can find them. These are failing every day at CERN. Every single day we have disks that are broken. We have so many that is natural. And what we have built is basically just to make sure that when this fails that basically the software can take care to make sure that there is another replica valid on top. But I cannot give you the exact numbers, but we can figure it out. Short question, 50 seconds. Okay. Maybe I'll do that. Test the guys at the booth. You can ask as many questions as you want, I'm sure. We are all the weekend around so feel free. Yeah, so let's give them a round of applause. Thank you. Thank you.
CephFS at CERN in view of Disaster Recovery
Okay, so I am just continuing on the second talk from CERN. So for people who are already in the room, some of these things are already familiar. So I am here to talk about CFFS at CERN, primarily in context of this asset recovery requirements we have at the organization. So we already introduced CERN in the previous presentation, so I am not going to go much into it but this is just there for the dramatic effect because this looks nice. This is the accelerator we have. We have various detector points, so this is one of the experiment sites, Alice which does lead-let collisions and this was one of the collisions that happened last year. So moving on, so this is a broader perspective of the whole ring how it looks like and what was not previously mentioned in the talk is like we of course had an existing data center that has been serving us since the 70s but there is a new data center that is under construction and this is primarily for the backup and disaster recovery needs of the organization and this talk is mainly focused on the BCDR aspect of the talk. So if you look at the existing data center, that is how it looked like in the 70s. It no longer looks like this but it has been around since mid-70s. This is the new data center that is coming up since it is all built new and stuff like this of course it is more energy efficient and it is expected to go into operations in a couple of months hopefully. And my main purpose of this talk is to talk about CERN and CERN and how we primarily look into SAP shots and what we found while doing some experiments on them and mainly this is asking for advice from the community on how, whether you have run CFFS with snapshots and whether you are facing some of the problems that we are facing. So largely CERN we do not have a single cluster that is serving all the needs. We have multiple smaller clusters that are dedicated to a particular purpose and this kind of helps us in cases of when you have use cases that do not always align with each other and you do not actually take down CFFS clusters because somebody else is doing something that is not normally a workload for your cluster. So when it comes to RBDs we have hard drive clusters that go around 25 petabytes and we also have a full flash Erasure Coder cluster that is like half a petabyte and this is mainly for the HPC use cases. For CFFS it is also a jungle of different types of CFFS clusters we have. We have a full production hard drive cluster again, a full flash cluster which is also used for analysis workloads. We have also hyperconverged clusters that we co-locate compute with. What Richard talked about in the previous talk about CERN Tape Archive, we have a small safe cluster that actually exclusively serves needs for some of the scheduler components of the tape archiving and other than this we also have largely RGW cluster that is serving the S3 components at CERN and we are newly building multi-site clusters that would also be a second backup region for the RGW. So our journey for CFFS it began in 2013 and it was like one physical cluster back then. So the primary need was of course a shared file system. We use OpenStack heavily for compute and it was mainly serving OpenStack Manila and some HPC use cases. We have 8 MDS servers for active and for standby. We do not do active standby and no standby replay. Metadata pools are still on SSDs. We have no snapshots and it is a single file system. Since this predated many of the more exotic pinning options you have with CFFS we have a script that actually tries to pin sub-directries to a random metadata server. Now that we have a femoral pinning we should try re-evaluating that instead. And we have multiple safe clusters, so two general purpose safe clusters that serve general production workloads. We have a few safe clusters that only serve specialized use cases. So we have one for monitoring, one for pure HPC workloads. And last year we also moved like one of the safe clusters that was actually on full flash from regular normal powered to diesel powered battery generators and this was done with a zero downtime using virtual CFFS and virtual racks in crush. Yeah, there are some details in additional slides in case if we have extra time I will go for it. So when you talk about business continuity and disaster recovery it is largely your requirement to keep your data safe during faults and human errors or ransomware or any of all of these things that are important nowadays. And you have various strategies to achieve these things. So you can go for an active sort of thing. You can go for a warm standby and you always have backups and coldestows. We are not that focused on the active part at least in the context of file systems in here because you would more likely benefit from snapshots and backup here. So the talk is mostly focused on the warm standby and coldestows use cases. And I guess suppose everybody knows what snapshots are, anyone in the room who doesn't know what a snapshot is? So you just have a frozen point in time state of the system and it makes you easy to roll back and do soft deletes and various things to build operations. And usually they are quite cheap to create and they are much less overhead compared to full backups of systems. When it comes to CFFS snapshots are actually enabled in new clusters by default from Quincy release onwards and it is this configurable thing called allow new snaps and it's a Boolean variable that you configure on safe cluster. Just configuring snapshots does not give clients the ability to do snapshots. They need a special access key which has a particular flag or an odd permission that allows snapshots to be done. And snapshots are copied on right in CFFS. Patrick already covered this two talks earlier if somebody attended so how the thing works under the hood. So there is a hidden snap directory and creation is just an act of folder creation as far as the end user is concerned. But snapshots are not synchronous. It's a lazy flush operation so when you issue a snap create until CFFS comes back with the message that the snap is created you can have IU that is not being tracked in the snapshot. As an administrator you can also create snapshots at a sub volume level which is almost something like a manila share or a sub volume share that you export to the end user which kind of makes it easier for some other use cases. Our own focus is primarily on user snapshots because users know best how they work loads are and what is a safe point where they want to do snapshots. And one of the things that we do of course want is like metadata intensive workloads like CI build jobs and many of these things that do get blown for example. Do not suffer a major performance penalty when snapshots are used. And many of our CFFS clusters have a very heterogeneous use case. So we need to understand the impact of snapshots because we have both interactive and non-interactive use cases on our CFFS clusters. So you have Kubernetes and OpenShift which use CFFS as a backing store and you also have interactive users who use software called Svan which is basically using Jupyter and notebooks under the hood which possesses use a lot for analysis. And these workloads should not suffer if somebody is doing a snap workloads and this should check off most of our BCDR service offering checklist if we were to provide this as a service. But we are just evaluating snapshots right now and then if we want to provide this as a service to users we do need to understand what kind of operations can potentially impact the functionality of the entire CFFS clusters and it would be nice to know if like you know it is restricted to people only using snapshots to pay the penalty or it is cluster wide. Another important question we need to solve is whether our tiny 3% team can actually run the service offering successfully without too much operational effort and also find out if there are mitigations for the problems and operational impact. To a large extent many of our users are not aware that they are using a shell file system and to a large extent we also want to keep it that way. So that is the motivation for the many experiments that we perform in CFFS snapshots in this talk. So when it looks to the evaluation goals first we should understand what is the baseline behavior of the system and the normal circumstances. So if a client is within their limits and not surfing the system thus the system with snapshots actually behave much worse and in case if we are using snapshots is there a performance degradation and what kind of workloads can trigger these and the other switch goal is of course to understand how the system reacts in this rest. So if we have bad clients with quite metadata intensive workloads which we have for some of the HVC use cases how does the system cope and can the systems people not using snapshots actually not suffer the market impact and how bad is the stability and performance impact we see. Last year at Ceflacon we presented a larger version of the stock and one of the main items that we had as a checklist was to evaluate this in a much larger context and also find the spinning directories and these sort of things actually make some of the problems we saw last year go away. So to do the testing itself we primarily designed two client workloads. So one is your standard IO 500 benchmarks. It's a standard set of benchmarks that is used in HVC context to measure IO performance. Under the hood you mainly use two tools one is called IOR and the other is called MD test. Most of the benchmarks if you run the simple configuration it kind of self cleans up the test at the end of the run and you have various injection points where you can build in functionalities like injecting snapshots. So during one of these post phases you can actually add a script that actually basically creates snapshots so you can have workloads that basically do a write, snapshot and read to understand if a single run actually sees an impact due to snapshots. And generally our experience with other file systems at CERN I mean we have usually down times due to aggressive metadata workloads and we need to see if these sort of workloads actually trigger something bad. All the tests come with an easy and a hard variant so when you look at IOR for example the easy variant is just writing every process writing a giant file which is very easy for bandwidth kind of use cases. The IOR hard test on the other hand is actually writing a single file from many clients which is more IOP intensive than you know a bandwidth intensive test and the units are measured in bandwidth which is actually quite bad. MD test is actually the file or directory creation workloads. The default test doesn't actually write any contents into the file systems itself so it's just a create or create operation whereas if you do the hard test it actually writes some data into small files in non-aligned block sizes so to actually make the system metadata of the system. In addition to this we just had a small workload that always kept on you know tiring and untiring Linux kernel and this was just to keep a base level of activity and since this kind of workload can easily spot when things go really south. On an average run an untied operation takes about 3 minutes for the Linux 6.2 kernel and I mean RM minus RF for example would take about 4 minutes on this cluster that we configured and it's not bandwidth intensive of course it's just a few megabytes per second operation but it's more you know metadata intensive. So the first we started evaluating this on a virtual cluster so the test cluster itself had 3 monitors, 4 MDSs, no standby replay configured. Everything was on virtual OSI servers on RBDs and actually when we run IO 500 benchmarks on these you kind of expected what the theoretical performance was and it always I mean kind of gave the exact performance so what you see as waves in the graphs are essentially like you know read write tests and then you know you have the deletes which are just only IO so as far as the client set up goes we had one client node for each so we created two hierarchies, one hierarchy where snapshots were enabled and another one called just client which does not have snapshots enabled. We also this time I mean print all the snapshots workloads to go on a single MDS so all the work directories are statically pinned to MDS.1 so that I mean MDS.0 can take the regular traffic without any impact. So initially we did the test over the Christmas break and initially adding snapshots did not have a market performance degradation however we did not have any monitoring on the cluster and what is important in SF land is always monitoring your PG states so what happened is I mean placement groups were in the snap term wait state forever they never caught up to trimming the snapshots because the cluster was too small a cluster to actually pull this off eventually the cluster reached fullness as the stamp terms never caught up and you know the cluster kind of went into a bad state but it got worse after that so when you actually removed the workload and you know started removing files things got worse when we started unmounting clients. We saw that the MDS started going into a crash loop and this was there are a couple of trackers that actually refer to this and primarily because I guess you should never bring your system to a state of fullness so unmounting clients actually made the problems worse. We eventually looked at the tracker and then found out like you know this is mainly because of the session tracking metadata object so we manually wiped the metadata session table and that brought the cluster back up but I mean doing this in a production scenarios of course a very huge you know operational nightmare. So lessons learned so we thought okay this is a make sense to continue on a virtual cluster anymore so we actually moved the OSDs to physical hardware so we had 3x48 disk 7200 rpm hard drives that we use as safe fs nodes the OSD nodes and we also increased two more clients so we had two clients on the snap path and two on the regular path and of course the IO 500 benchmarks usually always give the expected performance. In order to establish a baseline these are the baseline stats we got out of the cluster so OSD bench on the random OSD would give you around 240 megabytes 55 IOPS and the higher than expected you know bandwidth for a single OSD is mainly because you have NVME journaling radios bench deliver close to a gigabyte per second bandwidth 250 IOPS. When you do an IO 500 benchmarks with 16 worker processes what we observed is we could extract the 1 gigabyte per second that the cluster can deliver if you and I mean multiply the 300 megabytes per second that this can I mean a single node could deliver into 3 and we hit like 1.5 k start IOPS and you know write IOPS in the range of 900 so whatever a tiny 3 node cluster was configured to do it kind of delivered in terms of the baseline statistics. What we observed is when you do IO 500 benchmarks on a path with snaps and without snaps at least the bandwidth performance is more or less always in line. We do see a small degradation on read the work loads when snapshots were done but this sort of thing can be expected because when you are doing a snapshot you have a tiny performance penalty associated with it this is a cost you can pay and maybe I mean we were running this test with two client processes so probably if you increase the number of client processes maybe you can deliver a much better bandwidth but let's say the bandwidth IO benchmarks kind of reveal same performance whether you use snapshots or not. For IO hard drive which actually writes a single file with many many clients we saw a very very distribution without snapshots. There are two reasons for the spread also we have lot more data when it comes to I mean work loads with snapshots because these run over a period of three or four weeks compared to smaller two week window that we had for without snaps. And another factor note is like when it comes to IO hard benchmark this is a benchmark where it's a single file that is actually being used for very large number of writes so MDS is very unlikely to come as a you know bottleneck to this because like you're still having like one metadata objective to deal with rather than thousands so I mean you are expecting a kind of same performance other than the variation part of it like let's say the numbers the mean and sigma I mean the sigma part of you know IO hard drive everything else seems to be more or less in line. When it comes to metadata read write also I mean with snapshots with two clients we see that you know you don't have that much of a difference both are more or less similar. We have slightly better performance when you do metadata start workloads probably because like things get cached and you know doing a snap operation probably brings things into cache and you know this and clients do the caching so this probably gives you a slightly better performance with the start. MD test easy if you do the deletion workloads this is where of course we saw that snap workloads of of course started suffering slightly however the delete workload is extremely susceptible to cluster being you know pushed under load. So let's say the tar and under workloads also look similar whether you're on a snap mount or a non snap mount I mean the slight variation is more because we have more data on one of the things than the others rather than a spread. So when ran in isolation you know snapshots did not see that much of a performance impact and you knew if that was the only thing that we used to evaluate whether this is good you know this should be good to go but our own requirements are primarily triggered by you know more heavier metadata workloads and we need to understand how the system reacts when these things happen. So what we started doing is like what HPC workloads sometimes do is like you have a longer hierarchical directory structure so in MD test you could specify the depth of the directory tree structure that can be evaluated. So when you started increasing the directory tree depth we saw that with with the non snap client of course there is a degradation of removal operations but it doesn't it's localized to the client but when you start doing this on the snap client we had seen that this actually affects cluster wide the performance. The overall system parameters take a dive when you know when you start deleting clients with snaps. When especially when you started snapshotting after creating deep directory tree we saw that the system actually started seeing latencies that we're never seen so a normal like latency that you see in a CFFS operation is in the order of 50 milliseconds and you would see that we hit like six seven minutes of latency for a CFFS stat latency. What we also observed is like when we started stressing the system the pinned MDS had this increasing latency that we saw in the previous graph and eventually it stopped failing responding to heartbeats. What we did see is standby MDS takes over after a few minutes when it detects that this condition has happened but what happens after seems even more interesting in terms of like traffic no longer being distributed to both the MDSs anymore we see that all the traffic gets rerouted to the pinned MDS which is the one usually handling the snap snap client workloads. What we also saw is like something that was reported in the upstream tracker a year ago which is that you know when you have unlinked operations on MDSs you would have a very high CFFS MDS latency and this tracker I mean is linked in the slides and you know it basically is still open right now and what the tracker also mentions for example is like what you see on the MDS side and what we do see is like the MDS always spins on one particular function which tries to track the parents and sisters of inodes and you see like 100% of CPU being used in just one particular function and when you run so what we were more interested in is when this sort of workload runs on a single MDS that goes on spinning whether the other MDS can actually serve normal client workloads and this is the part that we saw that you know the normal client workloads cannot be served anymore because everything gets rerouted to the same MDS for some reason and you know workloads never seem to hit a completion. If you manually restart a workload some of them I mean catch on however you know an existing workload mostly kind of gets into a very stuck state. We saw a worst case like you know tar and remove a workload which usually takes an order of like 3 or 4 minutes going up to 4 plus hours and you know worst case IOR MD test benchmarks which deliver like I mean even for a single client close to 600 megabytes per second going down to 25 or you know 25 IOPS and this sort of levels and actually so after stressing the system we didn't run the stress workloads for all the time we just like ran it twice a day for a couple of days. What we saw is like there's a market degradation in you know times reported by you know operations so on the blue graphs basically indicate how the you know variants of the data looked like when you know when you untar the workloads it was like taking the order of like 300 seconds and that is the long tail you see I mean a smaller tail that goes up to like 20 minutes in the worst case whereas you already have a market shift in the mean itself going much higher almost to the double range and you have a very long tail latency that goes up to hours which we took off from this graph because I mean it doesn't make any more sense. What we observed is like the sort of a systemic degradation which kind of makes it difficult when you know when we want to productize this sort of a thing so non-trivial effect on non snapshot I mean directory traces are general concern for us and it is not that easy for us to determine from monitoring if we if we hit any of these potential triggers for the MDSS to go on this CI node spin loop we also need further investigation into why single MDSS seem to take all IO traffic after the switch and what we wrote as a line item for last year's Ceflacon which is like pinning away snapshots whether that should help because the problem is seemingly localized to MDS that doesn't seem to work from our experiments this time. So in conclusion we are generally happy with our Ceflacon cluster but we are still not ready to like you know enable snapshots on our general purpose Ceflacon clusters yet primarily because we don't want fraction of users to do this sort of activities that take down the entire Ceflacon cluster and in this particular context if there are people running Ceflacon clusters in production you would like to hear if you have snap workloads and you know how you're monitoring these things or if you see some of the issues that you have seen with that we saw and any feedback or like you know future direction on how to improve snapshots for everybody is very much appreciated. Maybe there are some monitoring insights on you know the deep directory trees from MDSS that we don't know of or maybe it is easy to implement we don't know. Of course one important step is of course educating users on how to use shared file systems in a good way. Ceflacon's best practices doc in the upstream documentation is actually a pretty good starting point here. Another takeaway which we should file bug reports for in the future is like of course documenting various snap term parameters of Ceflacon clusters. Defaults seem sane but if you do need to modify them it's kind of unclear what are sane values to actually configure and that kind of brings me to the end of the talk we would like to hear from you and what Hugo with the previous talk already mentioned we have a tech week storage at CEN coming in the mid of March so if you're in the genie area feel free to pass by and yeah that concludes my talk.
Deploying a hyper-converged infrastructure with Ceph across the Cloud-Edge Continuum
All right. Perfect. So, let's hope that we don't get this. Perfect. So. All right. We're going away from CERN, but we stay in Taitlan, so let's welcome Victor and give him some applause. Hello, everyone. Thank you so much for being here. In today's session, we are going to see how to deploy a hyperconverting infrastructure with F across the cloud edge continuum. But first, let me introduce myself. My name is Victor Palma, cloud engineer at Open Nebula. I come from Madrid, Spain, and I've been working for Open Nebula for more than two years. Developing an innovative new feature for the cloud edge world. So, first, we are going to start with some theory and then my idea is to show you a demo of all the things that we are going to see here. Well, first, we need to do the question of what is the cloud edge continuum? The cloud edge continuum is just an environment of nodes on the edge, distributed in multiple locations and everything interconnected using the same management layer. So, some advantages of the cloud edge continuum are the deployment of low latency applications. Since we are deploying our application near and closely to the final user, we can also reduce the energy consumption that our application does since we are deploying the application in a small data center, not in a big one. We can also reduce the vendor dependency since we are not deploying our application in a standard data center like cloud or AWS. We are deploying our application in multiple locations and managed by any provider that we want at any time. Then we can improve the user experience using the cloud edge continuum for our applications. Since it's related to the first point, since we are deploying our application near to the user, the experience and the latency is going to be low. We can also expand the service ability. It's very easy to duplicate and replicate our applications in different locations, different servers. We can deploy our application, for example, one cloud edge in Paris, another one in Madrid. If we want to replicate that, it's very easy to replicate the same infrastructure in Lisbon or London and so on. Finally, we can reduce data transfers and reduce security rigs since we are running all of this in a local location near to the user. The data transfer we need to transfer from our application to the user is more closely. How can we manage all of this? How can we manage the applications in the cloud edge continuum? First, we need to have a set of clusters of bare metal servers. Ideally, running KVNC is the mostly popular hypervisor. Then we need a uniform management layer for handling and managing all these locations, the private cloud associated with that. Then we need to interconnect all these clusters. Finally, we need to provide a multi-tenancy environment in order to create several users and groups and isolate each environment in the cloud edge. We are going to see now an example of how a cloud edge is inside. This example is a scenario for a 5G radio. We have this cloud edge node connected to the 5G radio using an edge land and two servers. The idea of the cloud edge is only to have a small number of servers. In this case, we have two, running the KVNC hypervisor. We can run in this hypervisor any application that we want. For example, a Kubernetes cluster for handling all the 5G code and workloads. We can also use the virtual GPUs of the server if it is available for data processing. All the servers inside this node are connected to the Internet through this public network using a single VLAN. The idea of the cloud edge node is to be autonomous as possible. If the single management ledger that we are going to use is not accessible, this site can work alone and without any problem. It can still provide service to the user. Now we are going to bring these ideas to life and see how all these concepts match together in order to create our cluster. How can we deploy the cloud edge in the node? We are going to use three main technologies. The first one is OpenNevola. It is a platform for orchestrating virtual workloads. Then we are going to use Terraform to automate the resources deployment. Then we are going to use Ansible for the configuration site for installing the packages that we are going to need. We don't need to use Ansible and Terraform directly. We are going to use these technologies through the OneProvision portal. That is our OpenNevola tool that allows us to automate all the configuration of the nodes and deploy the cloud edge nodes in an instant with a few steps. OneProvision supports several providers in order to create our own cloud node on the edge like GCloud, AWS, or GAYJax. Some of them are currently in development so not all are available. We are going to use AWS for the example of the today's issue. I would like to see more closely what is OpenNevola in order to understand the environment that we are going to see in the today's demo. OpenNevola, as I already said, is a platform that allows you to orchestrate and manage all your virtual machines, application containers, or Kubernetes clusters, all of them in the same way, in a very easy daily operation. You can deploy all these applications or all these workloads using your own private cloud or expand your cloud to the public cloud or on the edge. OpenNevola has integrations with several third-party tools like Terraform, Kubernetes, Ansible, and Docker. It also has its own built-in tools like a Sunstone UI, the graphical user interface, the web portal that we are going to use to interact with OpenNevola. We can create workloads in different hypervisors like BingWare, KBM, LXD, or file crackers. You can handle in the same way a micro-BM or a virtual machine. It doesn't matter for OpenNevola. Then, we have the possibility in OpenNevola to expand our cloud to the multi-cloud or to the hybrid cloud. In the case we have, for example, our own data center and we need more resources, we can deploy any infrastructure with automatic provision using, for example, providers like GCloud, AWS, or Equinix. Then, we have a uniform management for that resources with a homogeneous layer for users on workload management. Then, deploy the infrastructure, any application that we want. For example, as I already said, we can create Kubernetes clusters or Docker or virtual machines and so on. Well, we are going to use for the deployment of our cluster, thev as a storage solution. But why thev? We are going to see thev basically because it's an easy way to distribute the storage of our cloud, of the file, of our virtual machines, and share all the storage between the cloud etching nodes in a very, very easy way. So, thev at OpenNevola has implemented the storage basis on thev that with simple configuration, you can add multiple clusters or pools to OpenNevola. This storage implements replication and consistency and some important features for thev data storage in OpenNevola are snapshots, clone operations, encryption, et cetera. Regarding thev at the edge, as I already said, we need first a small number of nodes in our cloud etch node. Thev storage is going to be dedicated to store VMs in the scheme image. It's not necessary to have high storage requirements because we can use a multiple option for the storage. It has lowest storage requirements. And this is ideal for run in a HCI configuration because we can create nodes everywhere with a reduced cost. So, thev storage in OpenNevola consists of three different types of servers. We have the full nodes that run the Thev OSD and the monitor daemons as well, the KVN hyperbysos as well. We have the OSD nodes to run the Thev OSD daemons and the hypervisor only nodes that, as the name said, only runs the hypervisor. Here we have a sample of a comparison side by side from the point of view of the Thev storage from the AWS and from OpenNevola. For AWS we need to configure first a VPC, defining the metal bar servers, all the reaching tables, all the configuration that we need for our infrastructure. And from the OpenNevola point of view we just only need to create our host and automatically associate it with the Thev storage and we can start to create workloads on that servers. So, we are going to do a little demo on how we can do this using OpenNevola. So, let me... I don't know. Displays, sorry. Okay, that's better. So, this is the OneProvision portal that we are going to use to create the cloud HNO. Here we can see a general overview of the clusters, hosts and data storage that we have already in our infrastructure. So, we are going to take a look to each section. For example, here we can see the providers that we have already configured. I have configured here the AWS provider for Frankfurt. It's the example that we are going to see here, but we can create any provider that we want. For example, AWS or Reqinys or on-prem. Here you can select the location for the provider and when you finish this process, a provider is just a set of credentials that it's going to... OneProvision, the thing that OneProvision do is use the API endpoint of each provider in order to create the host on the servers, on the provider. So, it's just a set of credentials in order to connect to that endpoint. Using Terraform. Then we have the provision that are the provision providers. Here I have already a provision created. I'm going to show you this provision as an example, but we are going to see how the process is. I've already one created because deploying a note on the edge can take around 30 minutes. So, in order to avoid that, we can see here, for example, this class already created and running. For creating a new edge note, it's very easy. We just need to click on the add button. We can see here a description or the different options that we have, the support of virtualization technologies, etc. And we can set here if we want to create a edge cluster or if we want to create a HCI cluster using the app, that's the option that we are going to use. Next, we can select the provider that we want to use. In this case, it's going to be AWS Frankfurt. General attributes like the name, the description. And here we can tune in our note, setting the number of instances. The number of instances for only hypervisors or for only the OECD, etc., the DINIAS servers. The image that we want to use in the host. All of this configuration is also accessible through Jamel files. And we can set, for example, here if we want to use virtual machines or micro-VNs using LXC, etc. I'm going to back scenes. This takes some time, but the result is this. A cluster created in the edge in a very, very easy way. So what is the result of this? We are going to see here this is the main OpenEvula dashboard. The OpenEvula system is the user interface, the web user interface. So here we can see all the virtual machines that we already have. We have some VMs already running here. If we go to the exception, we can see here three nodes deployed on the edge inside this cluster. That is the cluster that we have created using OneProvision. We can create and deploy VMs in a very easy way, downloading an appliance for the OpenEvula marketplace. So for the Docker marketplace, if we want to deploy containers in our cloud, we are going to deploy a VM as an example. So for example, we can, this one, one of the Pine Linux. We can set here, for example, for example, we can even set the host that we want to use inside the cluster. For example, we are going to use this one, but we can change any configuration of that VM. We are going to instantiate the VM. And here we can see that the VM is starting to be created. So we are going to take a few seconds since it's an edge location, but it's booting and it's running. It's totally running. We can see here the host where the VM is running, the start time. A lot of configuration regarding the VM, like capacity or other storage and so on. And we can even connect to that VM. Oops, sorry. Maybe it's not so totally ready. And this one? Okay, I think that is the, I don't know if the Wi-Fi or maybe something is blocking the connection to the VM, but believe me, it's working. It's working on my machine. But that's all regarding the demo. As you can see, it's very easy to create filter machines in Open Nebula on the, using a Cloud Edge node. So returning to the slide. Okay. Well, that's the demo environment that we show you. As final conclusion, the necessity of this project and this integration with the Open Nebula are support for them in spaces, support for incremental backup in theft, adopt theft image, live migration. We want also to improve HCI configurations and integrate the one provision tool with the one deploy project that we already have. It's another public project that you can visit on GitHub that automates all the configuration and install of Open Nebula. And you are more than welcome to contribute to the repository on GitHub. And I also encourage you to contribute to our community. Join to the forum and share your experience using Open Nebula and help other users in order to create our Cloud open source community. So this project is funded by the European Union. So it's very interesting, the project. It's Cognite, so you can visit here the URL. And it's Cognite tries to provide a cognitive serverless experience to the European Union. So we can, the idea of this project is to using Open Nebula and one provision create a lot of Cloud Edge nodes in Europe in order to deploy application and gains independence. So that's all. Thank you very much for your attention. Questions? How do you deal with network outage, especially in the Edge? Sorry, can you repeat the question? How do you deal with network outage, especially in the Edge where the connection might not be stable? How can we handle the network connection when it's not stable? The idea of the Edge nodes are that in this kind of a scenario, the node is totally independent. So it's still working even if you don't have a connection to the Internet, at least to the region. I don't know if that answered your question. Yeah? Just to understand the configuration of the nodes in this form. So from the diagram I see that you can deploy, as I said, the storage subsystem on the same nodes we are running as the virtual machine. Is that correct? But you need dedicated nodes for the storage side and the VMs. He's asking about if we can deploy in the same node the storage and the virtual machines and the other workloads. Yeah, you can deploy in the same node. From OpenEvola point of view, it's only one node. But behind it's handled in the splitting between the nodes and the hypervisual nodes. But for the user that uses OpenNegulites, the host where he can deploy VMs and use their storage. So yeah. Any more? Thank you.
Chorus - Effortless Ceph S3 Petabyte Migration
Hi everyone, hope I am audible to the larger crowd here, thank you. So I am Sreesha Gurduru, I am a self engineer working for Klyso and I am here to present an open source project called KORUS which is responsible for an effortless S3 petabyte migration slash replication. So let us talk about why data migration, like when do we come across a data migration scenario. So lot of the companies and organizations these days have private cloud clusters and hardware which has certain specifications, queue and it can come to end of life anytime like the vendor might stop supporting the existing hardware, there might be a new hardware coming up. So in that case either there are two options in front of us which is to augment the existing cluster with the new hardware. If the specifications and skew are similar to what we have now in the cluster or to build up, build a new brand new cluster all together and when we build a brand new cluster there is a high reason for data to be migrated between old production cluster and the new cluster so that you know the data continues to stay and the operations can happen smooth. This can be one of the main reasons for data migration. Let us talk a few woes or difficulties with data migration. When we are talking about migration of data we are not talking about few bytes or gigabytes, we are talking about petabytes of storage. We have lot of data being stored in our storage back ends these days and those have to be effortlessly migrated to the new clusters. So the challenges include syncing petabytes of data and the continuous monitoring that we have to do behind the scenes like we just pick up some tool like RClone in this case and RClone is a robust synchronization tool, copy tool. So when we just pick that up we just have to monitor or even if we run that in the background we keep monitoring the status of the replication and also the time consumed with that huge amount of data to be copied across the clusters. Obviously the tools that are used for the migration and continuous changes in the data. The existing cluster we just do not yet decommission it. We have an active operation happening on the cluster be it reads, writes, updates. So that also due to the continuous changes in the data we might see it a bit difficult to copy or migrate the data. Let me share one of our experiences with a customer where in the similar scenario they had their cluster end of life and then we built a brand news cluster for them and the data to be migrated was around 3 petabytes. So between the old and new clusters we picked up RClone as the data migration tool. Migration let alone the data we had to migrate the metadata obviously and there was some issue with RClone where we could not copy the ACLs and the bucket policies for that particular bucket and then we had to tweak around and then we eventually got it to working and then that was a bit of difficult task for us. Indeed it is a Herculean task. So this experiment, this encounters with our experiences led to a tool called KORUS which is an open source tool which is a data application software which is capable of synchronizing S3 data as of today S3 data between multiple cloud storage back ends. Let me present you some of the problem statements for our tool. How to migrate S3 vendor data with reduced downtime? So I would not say there would not be any downtime but with reduced downtime and the cluster being operational at the same time and also the data being copied to the new cluster. And how to back up S3 data to another S3 in a different region or a different vendor? Here we might not have the same back end, we might be using storages from different providers like Amazon, Google, Minayo and we might have our own private clusters. So it is vendor agnostic. Like the initial goals of KORUS was to have a vendor agnostic solution. Like it should be able to support multiple back ends and with a pluggable architecture that means the components in KORUS are loosely coupled. Like if I see that one of the layer can be better, it can be replaced with another tool which is more performant and more efficient, I should be able to replace it. And then benchmarking of course before we add in any component we benchmark that tool efficiently so that it will be compatible with all of our, the entire project. The focus on correctness. So the data which is present in the source and the back ends, the following back ends we ensure that the data is correct and in sync with all the storage back ends. And then migrating big bucket under the load without downtime or with reduced downtime. So there are two things here, there can be multiple buckets with small amounts of data and those buckets are easy to be copied because it just takes couple of minutes. But there is a scenario where a bucket, one bucket has huge amount of data and lot of clients might be writing to one bucket and that bucket has to be migrated. So that is a bit of concern. So overview of chorus is there is one main storage and remaining can be configured as the follower back ends. And the users start by inputting the storage credentials in the configuration. And once the configuration is started and configured the chorus S3 API can be used instead of the storage backend API. Like if you are using AWS, if you are using Google, Minio, every backend has its own API. Instead of using that you can use one backend, one chorus API to communicate with multiple back ends because they are all S3 based. And chorus proxies request to the main storage and then eventually the storage is copied to the followers in the backend. All the existing data is replicated. For example when we introduce chorus into our ecosystem we might already have clusters with certain amount of data which has to be copied to different back ends that we configure later. So the existing data can be configured in the background using this tool. The data replication can be configured, paused, resumed, there are different life cycles for that particular request, the status. You can just stop, start, resume at any time you want and the management can be done using a web UI or a CLI. So the features of chorus include routing and replication per bucket. You can configure where to route or the request and where to replicate a bucket. And then again the same using you can pause and resume anytime. And then synchronize object or metadata. So just not the storage, you can also copy the ACLs, bucket policies, tags and everything using the same tool. And then as I spoke earlier migrate existing data in the background, track replication lag. So as of today we might have one set of configuration and the data must be copying to the source. To the back end, to the follower back ends, we can always track the replication and we can improve with the configuration options. We can start to rate limit, we can increase the number of workers. So we can do that. And chorus exposes Prometheus metrics. So we have the entire logging thing and the metrics are sent to Prometheus and the logs are in JSON format. Easy to read. Proxy and Grafana form the monitoring stack. You can visualize the data of how the bucket is being replicated and at what stage it is using the visualization stack. Let me briefly talk about the architecture of this entire chorus. Chorus is structured around two main web services. One is the proxy and the other is the worker. So initially the request comes to the proxy. We are talking about a flow where the routing policy is there. So initially the request comes to the proxy and based on the routing policy, if the bucket has to go to the main storage, which is configured to be Minayo here. So the request comes to the main storage and the request goes back to the proxy and then eventually to the user. That is one of the flows. The second flow is where the replication scenario is established. Again the request comes to the proxy and then there is an event or task created in Redis based on the replication policy. Like it knows what is the main storage and which storage should the replication be done to. In this case it is Ceph for example. And then the chorus worker reads the requests or tasks from the cache and then that routes the request to the, reads from the main storage and replicates to the back end. The chorus worker is accessible using WebBY and CLI. So this is an overview of the entire flow based on different scenarios. So chorus also has an initial migration feature like as I told the replication can happen in the background. So initially when the replication is happening in the background, all the buckets are listed from the main storage and then the objects within particular bucket are listed and then the number of tasks based on the objects are created and it is ensured that every object is copied to the follower back end using a particular task. So the worker processes tasks in the background copying the data to the follower back ends. Here these are the main components of chorus proxy worker admin UI and Redis. So proxy and worker are written in Golang and admin is written in view or the entire deployment is done in a containerized fashion on Kubernetes pods. Let us talk something about the resource requirements for different components in chorus. For Redis the scaling is done using Redis cluster and the persistence is ensured using Redis AOF and Redis database and fault tolerance in case of Redis data loss we can always restart the bucket replication because the state is maintained and memory consumption if there are around 1 million objects that are to be migrated then it can approximately take 1 million tasks in the queue then approximately 700 MB. This is all based on our benchmarking it can change with different scenarios and Redis is assumed to take less CPU and it can be between 100 and 1000 requests per second. So coming to the proxy it is stateless, it consumes less memory and less CPU but high network because proxy is the kind of brain it takes in the requests and it routes the request accordingly based on replication or the actual routing hence it also needs high network. Coming to the worker it is again stateless it takes in high memory and high network but less CPU because worker is the one that routes the reads request from the cache and routes request to the back ends based on the replication policy. So worker instance network and memory consumption can be rate limited in the configuration like in the day when there is huge amount of request coming to our clusters we can just stop the migration activity for a minute like or we can rate limited to do it at a lesser rate and then eventually you can increase it when the bandwidth is high. So yeah so what are our next steps for chorus we want to perform more load tests in case of larger buckets more data and efficient time consumption and then resource optimization at various component level like Redis how can we make it better and workers we want to make the logic more functional and then the API cost. So the routing policy alternatives since we have multiple storage packets what we want to do is to route based on object size for example if there is one GB of file you can configure it to be written to a particular storage packet and then if there is small number of files you can configure it for one back end based on the quota and lot of other parameters and load balance read request for replicated data. Now that we have multiple storage back ends in our hand we can always make an efficient use of each back end we can load balance the requests like for example if main storage is busy in taking writes since the storage is data is being copied to the follower back ends we can always route the read request to any of the back end which is idle so that logic and so every storage back end is providing a bucket notification and event log so we can subscribe to those events instead of querying the proxy every time and overloading it instead we can use proxy to really write data and migrate data so we can use that proxy instead or to keep polling for the bucket information and then there is we are planning on having a Swift API compatibility as of today we have a robust S3 API compatibility but we are planning to have open stack Swift integration and then life cycle policy the API parameters for different back ends is different so we just want to have good life cycle policy it is being tested and when a bucket is created with a particular life cycle policy in one storage the similar should be replicated to the other back ends as well without loss of any configuration for policy. Use cases the further use cases that we see for chorus are active transparent proxy post migration to speak briefly about active proxy migration so that means if the source and destination are completely copied and if we want to stop using the source anymore so once the data is moved we should be able to switch the proxy to the to another back end to make it a main storage instead of configuring it in the configuration file. Robust backup service so if we have two to DCs two sites then we want to synchronize data between both sites in both directions so the simple setup is to synchronize data between prod and the backup site so we want to make this tool efficient enough to be robust backup service like we can ensure that during disaster recovery even if the primary goes down we can simply do all the operations from the other back ends that are available by switching the storage back ends based on the based on how they are configured. So any questions regarding chorus and its implementation? One question regarding versioning so if you do replication from a source to a destination and the source has versioning enabled and there's a couple of versions how does this integrate into the chorus? So the question is basically about object versioning so if an object version is configured in the source how do we replicate it right? So for example in object versioning those are also defined as objects in a hidden bucket right so that bucket will eventually also be copied with the metadata so the object which you create initially before the first object it will have metadata and you configure versioning on it and there is a hidden bucket where all the versions go and we can restore it to that version anytime so the entire data is copied to the other back end as well with the metadata so that's how we can ensure. Yes I'm picking also about backup use case. Did chorus manage object log? Can you repeat that for me? Did chorus manage object log? Object log. Object log. Yes. Lock. Lock. Yes. I'm sorry I couldn't get that. Maybe I'm not so much acquainted with that scenario but can you just elaborate about what do you mean by object locking? It's a warm technology just the object is in right one time and also it cannot be modified. Okay. Read only. Read only. Read only. It's like a malware protection. The ransomware thing. That was one of our features that we want to implement so the question is more about when an object is locked or when there is an attack on the data. So yeah the ransomware thing is in one of our discussions where so actively the back end will be exposed more instead of the main storage so the warm or whatever that is introduced it will be in the back end and then we I mean that's just in discussions we are not yet there to implement but please feel free to post your question in GitHub. You can raise an issue. You can start a discussion in GitHub and then we can definitely take that as a feature with more details. These are the resources. We have open source the back end and then you can definitely reach out to us on GitHub and then this is a GitHub link for the chorus project. I'm sure chorus is more than I cannot speak about chorus so much in this 30 minutes but definitely it's a more efficient tool and it has a lot of capabilities to be acting as an orchestration layer for multiple back ends. Like we need not just use we can use one API to talk to multiple different types of back ends with different vendors. So yeah we are looking forward for more improvements more features and you can always talk to us on GitHub and then we can definitely improve this project together. Thank you so much for this opportunity. Can I get a correct answer? While you're migrating you want to implement a load balancing feature. Yeah. So you need to be in state of all the objects that you already migrated and that you still have to migrate to make an informed decision where you should go. So do you already have like a database or how do you know? Yeah we are going to do yeah to make the load balancing so that you the request is sent to the correct. Yeah so you can like a faster cluster you can go to the new cluster. Exactly. It's time to be more mature to add a presentation for the code but it didn't get in 30 minutes. Yeah I got it. It was like down how to restore it. Yeah sure sure. But it's still like it's really new we have some people there casting we're talking to start. Thank you. Very nice. At the moment we just need to connect both. Yeah that would be great. And we can set up the next speaker.
SMB for Linux with SMB3 POSIX extensions
Yeah, thank you. Yeah, just to introduce myself, my name is Falka Lendeker. As you can all see, I work for Samba since the mid-90s, last century actually, so for quite a while. And I think I don't have to introduce what Samba and SMB really are. They are file-serving protocols. And what I would like to do eventually is kill NFS. And I know this doesn't go down well in some communities, but this is what I'm working on in my spare time, when I have spare time. In the last few months, unfortunately, it was a bit limited, but still, some of you already have seen this talk at Samba XP or other conferences. There's a little bit of new stuff, but I think it's still interesting to see that you can actually serve SMB clients or Linux clients with SMB. So what is it all about? You want to share file systems, directory and files across a network. So you have one server where you have a directory, where you have a file system. And you want this to be shared across a network to possibly many, many clients. If you go Linux to Linux, you typically use NFS. And one of the reasons is it's so simple. What you do is you just add a line to your ETC exports. Maybe you have to kill or restart a demon or whatever. Then you issue just amount command on your client and you're done. That's about it. However, it comes with some downsides. First, there is essentially no real metadata or data caching in NFS. This means that it can regularly happen that you create a file somewhere and it doesn't really show up until a bit later, some on other clients. If you just write to directories, if you just write to files, other clients don't really see the M time or size updates really precisely and so on. So this is kind of problematic. Why does the mail format actually exist? Because locking doesn't work over NFS. And yes, NFSv4 has locking, NFSv3 has external protocols to do locking, but you can't really rely on those. And it's really, really complex to set up locking properly and to get failover done and so on. This initial very simple setup, and I love this acronym of NFS, it's just no file security. Because essentially what you do is you trust your clients to assign the UIDs and GIDs and essentially the group permissions and whatever you assign them correctly on the client. And there's nobody in between who actually checks. I know there are these days there are protocol extensions to do NFS over TLS, so at least transport is in a standard way protected. You can of course go and enable Kerberos for NFS, but this is also pretty complicated and we have done it in customer scenarios. The client at least is buggy like hell. And you get incompatibilities all over the place. You lose keys, you would, you lose anything. So it's really, really difficult to set up. As I said, clients have a very bad day when you Kerberize them. SMB however, it really comes from the Windows world. And if you look at the, there was one talk by the original SMB implementer, Barry Feigenbaum. Is it online available, Günther, do you know? So at one of the conferences that we regularly go to, there was actually a talk by the original inventor or developer of the SMB protocol. And essentially what they did is they took the MS-DOS interrupt in 21 and put the arguments on disk and let the server take care of it. And this means that they have to be compatible with a lot of applications on DOS. And DOS means that applications like Word 5.5 or whatever believes it's alone on the machine. So this means you have to get locking right. If Word opens a file and it believes it's the only one editing that file, you better make sure that nobody else also edits that file simultaneously. So they had to get locking right from day one. The other one is cache coherency. We have protocol for this and this between Windows and Linux, this actually works. So if you open a file over SMB, typically what you get is a permission to cache stuff, to cache your updates, to cache reads and so on. This leads to much, much better performance. And if somebody else also wants to open the file, you get notified that, oh, no, you're not alone on the file anymore. Please drop all your caches. Please write back your caches. And you tell the server, hey, I'm done writing back. Now please let the other one in. And then they have to agree that they all have to write back their changes and so on, read new data from the server. And the other advantage is, one of the advantages is that SMB servers, they are everywhere. Every home router in Germany, the Fritz box has an SMB server in there. All NAS appliances have SMB, so it is everywhere. And you can access it from almost any place. Whether all the features that we are talking about here are correctly implemented everywhere, that's a different story. For example, Fritz boxes don't talk to my mobile phone properly, but that's a different story. But essentially, it's everywhere. The SMB protocol is very flexible. There were very, very early extensions of the SMB1 protocol. So like every protocol, you have a lot of requests going back and forth, and there is unused protocol space. You have whatever, a create request, a read request, and so on. They are numbered, and there's number space that you can take and so on. And this is what we did early in the, whatever, 2000s or so for the SMB1 protocol. There are UNIX extensions that match all the UNIX semantics in the SMB1 protocol. This was never transferred properly yet to the newer and now only SMB3 protocol. And what we are working on actually is we want to extend the SMB protocol with all the behavior that a POSIX client expects. How is that done? The first packet that is sent between client and server is called Negotiate Protocol. And it exactly does what it says. It negotiates different flavors of the protocol. For example, it tells, hey, I'm SMB1, I'm SMB2, I'm SMB3, and I have this and this subfeature and so on. I can do these capabilities and those capabilities I can't and so on. And essentially what Microsoft has done with the SMB3 protocol, they did the smart thing and made this request extensible. Essentially what you can do is you have this Negotiate Protocol request and you can add what I would call extended attributes to this request over the wire. I mean it's not an exact file system, but you can just extend the request in a standard way with a new Negotiate context. So you have a ton of Negotiate context that say, okay, I can do encryption this way, I can do whatever. And we just have an additional extended Negotiate context that says I can do POSIX in this version. So the client tells the server, I can do POSIX, server tells client I can't. The default behavior is for unknown extensions is that the server just ignores it and doesn't send a reply. If the server sends a reply, I know I'm talking to a Samba who is able to do all this stuff that I'm talking about here. File name handling. This is really painful in our case because Unix file systems are case sensitive. Windows file systems in particular NTFS is not case sensitive. What does it mean? Under Unix you can have two files, Make file and Make file, one with capital M, one with lowercase M, and under Windows you can't, under NTFS you can't. When now a Windows client comes in and says I want to create Make file, what you have to prove at creation time is that no other uppercase, lowercase combination of Make file exists in the file system to fulfill this promise that this is case insensitive. What do you do by default? You scan the whole directory. And this leads to an O to the order of N square performance behavior. If you just drop a million files into a directory, file number 900,000 takes a lot longer than file number 1 because I have to scan the whole directory to prove that no other uppercase, lowercase combination exists. And what we can do is we can add a new create context, not only the negotiate context, but also the open file and create file request has these extended attributes. I can say that I want to open a file POSIX style by adding one of these create contexts. And we have defined a create context so that clients on a per request basis can say I want POSIX behavior, I want case sensitive behavior, I don't want file name restrictions, I want double quotes in a file name which Windows wouldn't allow. I want them. I know what I'm doing. I'm POSIX. So what we also need is POSIX metadata. If you look at the properties of Windows client on a Windows file, sorry. So we are here, Windows Server, I say properties. There's a lot of stuff. In particular, there's timestamps created. We have four timestamps in Windows that are roughly similar to what we have in Linux. We have attributes and so on. There's a lot of stuff that Windows has as metadata. However, the semantics are a bit different. In particular, they don't have a good notion of UID and GID. And they don't really have a good match right now for POSIX permissions. So some of the ones that we have in struct stat, like file size and so on, they are the same in Windows but in particular UID and GID, they are not. They are not the same at least. So we did. We extended the protocol. And if you, for example, do a stat on a file, if you ask for get file information, you can say I want this info level and there's a 16-bit field for info levels. And we just added one. We talked to Microsoft. Hey, get us this additional... Don't use this additional number that we use for POSIX information level. They agreed and so we have an additional field that we can use to fill in all this information that a client might want to use. However, second-last line. None of this is really the topic of this talk. It's about file types. If you look at the Unix file system, you have seven types of files. You have a normal file. You have a directory. What else do we have? We have block and character devices. We have named pipes. We have swim links. Oh, shock and horror. And we have sockets. Unix domain sockets. Samba can handle regular and directory files extremely well. Oh, there's a typo here as you find out. So we can handle directories and file. I mean, that's what we are made for. We have file servers so we better handle directories and files well. What do we do about the other ones? If you go and share ETC in Samba, sorry, share slash dev in Samba, something you probably shouldn't do, but if you do, Samba will find a lot of stuff that it can't really properly present to Windows, to any client. It will find character, block devices. It will find all sorts of stuff in slash dev. Or it will, if you just share a home directory, you will find sockets for GPG agent, SSH agent and so on. You will find all sorts of stuff that doesn't really fit into the file and directory schema. In particular, for example, you find files. And in previous Samba version, this used to work, that from a client you came in, it could open a file for writing, hoping that the server side on the Unix machine still exists, the server side process on the Unix machine still exists, and you could write into that and the server would get the data that you write into this. This can't be very popular because, I mean, many versions ago, we broke it and nobody noticed. Alexander is confirming. You're using it, Alexander? We have a lot of tests, but Alexander's comment was that we don't cover this, which means we didn't notice. Why didn't we know, or why did we break it? If you open a file for under Unix, all you can do is issue read and write syscalls. We don't do that in Samba anymore because whenever we get a read and write request from Windows, there's an offset attached to that read and write request, like in NFS. And we do the natural thing. We p-read and p-write, like what you do normally in the process where you have an offset. This is all from times when you couldn't really expect p-read to exist, but these times are long gone. We have some very special support for sockets. What's a socket? That's essentially a... It's a 5.0 on steroids. And what we do with sockets is we implement the Microsoft notion of RPCs. What is it? A Microsoft Windows client over SMB can open a file and transfer data over this file, special file, on the share IPC dollar, slash pipe, slash semr. What you do is you're win-redge. You open a file on the IPC dollar share, win-redge, Windows registry. You talk to the server side registry over RPC calls. And we implemented these days since 4.16 that our Windows registry server actually listens on a Unix domain socket and the SMB server connects to that Unix domain socket and just passes on back and forth requests. And so this is what I mean. We have limited support for sockets, but this is not what somebody would expect if on the client we would run a SSH8, for example, that clients connect to because this all needs to be done on the client side then. Block and character devices, I mean, we find them server side, but they don't make sense at all over the network. You don't want to whatever read and write to DevSDA over the network. You just don't want this. You could, but why? Enter NTFS repass points. There's a Wikipedia article actually on NTFS repass points. Repass points provide a way to extend the NTFS file system. A repass point contains a repass tag and data that is, and data that are interpreted by a file system filter driver identified by the tag. What does this mean? One use case is HSM systems, hierarchical storage management, where you have a huge file on NTFS that some software just pushes to tape and leaves a stab inside the NTFS file system that is just visible to the client as normal. And now when the client opens the file, the open code sees, okay, this stab is a repass point and the extended data that the repass point carries points at the place somewhere on tape. It's on this tape at that offset. And what you can do then in Windows is install a driver that when a client opens this file, the Windows kernel goes to the tape library and says, get me that file back. So this is software that you can install in the Windows kernel to extend NTFS semantics. And this is what, by the way, the NFS server uses. And we will see in an example of this. So applications can use this for arbitrary blobs. It's a special marker for a file, for a normal file that says, oh, I am a repass point and you can store stuff in there and essentially it's an extended attribute. When opening a file, NTFS filters can interpret the contents. This is what Microsoft also actually uses for sim links. Windows has symbolic links. They are stored as repass points. If you double click on that repass point, and I can demonstrate this here, I know demos never work. I have a file for and I will show you how I created this. I double click on the file for. Oh, okay. Wait, oh, I have a, as I said, I should never do. Ah, file for .text. Here it says text document, which is just a description of this is a .text file. I double click on this and what it says is the file cannot be accessed by the system because this is a repass point that happens to be named test.text or something. And they believe, oh, we have to open notepad, but it can't access that file. So the error message that you get if you double click on that file is, status I owe repass tag not handled. You have to tell the server that, oh, I want to open this special file in a special way. You have to set a flag. So a repass point, as I said above, has a so-called repass tag, which is a 32-bit integer. And if you look at the Microsoft documentation, Microsoft uses these repass tags and documents the use to a certain extent. And there's a lot of those. If you go here to that website, there's a ton of repass tags, reserve 0, reserve 1. What you see here is, I hope you can read that. No, you should be able to read that. That's HSM. That's HSM 2. And so on and so forth. Filter manager, repass tag, swim link. So this is what Microsoft defines in their spec, that they are using these sets of swim links. These sets of repass tags, and you get the integer there. Swim link is 0xA and then a C at the end. And we're using this. We are about to use that. So now we have two kinds of users of these repass tags. Do you remember WSL1, the version one of the Windows subsystem? They try to run Linux applications on Windows, and they face the same problem. Windows applications expect sockets and fee force and swim links to work. And in version one, they used NTFS actually for your home directory, for your local files. And what they did is, they have this repass tag address family unix. They use that. And what you will see here is, it must be somewhere. But if you dig a bit deeper, what they tell you is, the contents of these repass tags are not meaningful over the while. They were intended just for the WSL subsystem, Windows subsystem for Linux, server side. So they define as part of the data that is stored in this repass tag, hey, we have a block device, a character, a FIFO, and so on, with the obvious counterparts on Linux. So what they did in WSL1 is, when somebody didn't make FIFO, they created a file with a repass tag. And they, in the content of the repass tag, said, hey, this is a FIFO. None of them are actually documented. And because that costs so much trouble, the version two of the WSL, which I actually, is anybody using WSL? Some are. It's actually usable. I would say it's actually usable. You can't really tell the difference from a real Linux. At least I can't. I mean, if you look at Pock, of course, you will find differences. But for the normal day-to-day use, it actually works pretty well. Because what they are using, they are using a real X4 these days. Then there is a Windows NFS server. Pardon? Why? The question is why. I don't know. The Windows NFS server, this is what I'm going to present here, hopefully in demo. They also have the same problem. A client does a make link or a sim link or whatever. It doesn't make FIFO, and they have to store the data somewhere. And they define yet another set of repass points. And if you look here, they actually have a definition of what goes into the data, into the data field. Repass tag, repass length, and so on, in general. So they define sim link, character device, block device, and so on. And they actually specify what goes into the data field. For a sim link, the target goes in there and so on. And for the character device, you have two UIN32s for major and minor and so on. So they define what goes in there. I mean, you would have thought that these guys, talk to these guys, to share an implementation, but no. Why? The interesting thing is, if you look at, and I created a FIFO server side, and if you look at the properties of this FIFO, and you have to trust me, the one in the first row, can confirm that you have an L here. It says archive and L, L for sim link, if you look up the documentation. No, it's not a sim link, it's a FIFO. So their GUI is not really prepared for this. They believe, okay, all the GUI believes, all files that are repass points are sim links. Alexander? Is this client side? Client side? I can demonstrate that I see it. Because this is a local file, right? That's a local file that I created over NFS. Okay, so the NFS server created on a local file system something with this associated repass. Yes. So this directory here is local disk share. This is a local NTFS file. And what I did is I exported this via the NFS server. I mounted this from the client. And why don't I show it directly? I mounted it from the client, which is here. That's my client with a mount. If you look up at the top, NFS. And you can see in the left column here, I have sim links, I have block devices and so on. And I created them with normal UNIX commands over NFS. And this is what ended up on the NFS, on the NFS file system server side. Repass points. And so this is not too popular with Windows applications. So the Windows Explorer believes all files with repass points must be sim links because that's the most popular use of repass points in the Windows world. OK. So they don't look at the repass tag. They just see that this is repass point and it must be just one type of question. Yes. Alexander's comment was, and I will show you that in the Wireshark trace in a minute. There is a special flag in the metadata of the file that says I am a repass point. And you can of course get into the details of that repass point if you wanted to. But if you're an explorer, you don't care. You say it's a sim link. OK. Now this is a discussion. Do we use these guys or do we use these guys to represent or to present to the client when Samba finds a sim link? Samba site. Or for sim links, we even have three options. And so WSL version one has reserved repass tags. And if you look at one of these lists that I've shown you, you have repass tags for the individual subtypes, but they are not documented. They are not used anymore at all. You don't have any interoperability with anything else. We could of course use them. So in the case when Samba on disk finds a sim link or a block device, how do we present that? We have to make a choice. And WSL defines repass tags with undocumented comment. NFS only uses one repass tag. Pro NFS would be we have documentation available. And so on. And what we can do is we can write protocol level tests against the benchmark, which is the Windows NFS server. So we have ways to create these things on Windows and write just tests, which is very good. Also, if you now say, OK, I want to create a FIFO from my Windows client. That has mounted the home directory of a user on a Windows server. If I do that, the Windows client will create a repass tag that an NFS client talking to that same file system on the Windows server will also see as a sim link. Or as a FIFO, whatever. The same thing. And so this is why I would say, OK, I would like to use NFS repass tags. I have to talk to the CIFS kernel developers. I think with Linux 6.8, they went to different route. Andreas, do you know? No. So I think they went a different route, but we need to talk. Coming to sim links. Symbolic links in the BSD, UNIX, depending on how you look at it, are the best ideas since sliced bread, or the worst nightmare that everybody falls over security-wise. Even the Rust infrastructure. I mean, Rust being a language very security sensitive, they had their sim link race security bug. But we have to deal with it. We have to live with them. They are there. And so what do we now do when we see a sim link on the summer server side? Yeah, we can do that with the two ways that I presented. But as I said, Windows even has its own notion of sim links. So if you create a sim link, depending on where you come from, you get one out of three versions, three ways to represent them on an NTFS. And if you look at it, this Windows way of sim links, they actually work pretty well over SMB in the pure Windows world. For example, what you can do is you can have a sim link on a directory on an NTFS that is shared over SMB. And the sim link target can be backslash backslash IP address backslash share name backslash directory. And if you want to cut that file, or if you want to CD into that file from a Windows client, it will redirect to that server. So you can have cross server sim links with the Windows NTFS notion of sim links, with the pure Windows notion of sim links. Even better, if you try to open a sim link the Windows way, you double click on that file and under POSIX, you typically follow that sim link directory. If you mount that over NFS, the NFS client will have to take care of those and follow client side. But Windows does it a bit differently. When you double click on that or when you open that sim link file, they tell you, hey, you hit a sim link. And they will even in the error response, they will tell you, and by the way, the sim link points there. That saves at least one round trip, or several round tips, that if I hit a sim link, then I know where to go directly on the client side in the response. And Windows typically is completely path based, so if I open a Windows file, slash A slash B slash C slash D, and somewhere in the middle there's a sim link. They don't follow that server side, but they tell me, hey, go there, and by the way, I have passed slash A slash B already, and C was the sim link. So if I have a long path with many components, they tell me, okay, the third component is a sim link. Okay, how do we create these special files? Protocol-wise, there's a special flag to the open call, and we just set the contents. And yeah, what we can do is with Samba, what we don't want to do and what we will never do, if a Windows client comes in and creates a sim link the Windows way, we will not create a sim link server side. What we will do is because Windows sim links are also represented by normal files, 10 minutes left, they are represented by normal files with some special contents, with some special whatever extended data. We will do the same. So if you do a make link from Windows, then we will create such a file telling the client, hey, this is a sim link file, and the Windows client will just work as it will. And we will just open GIFs to the NDI interface and we will create Web notEP Things like R!! Or maybe Mint Speedusu advert. Okay, you know how much space here, I ran over the interlocking Mobile T share line for this. Okay, a shell. So what we will do for that Fashionbt is, I will use the three space walk way, and just kind of Diellow, And what you can see is ln minus s foo bar. This means I do a sim link from bar to foo, I believe. I always get that wrong. Yeah. So what I did is, and this is a file that actually lives on ntfs shared via nfs. And what we should see here now is that this is file bar. I created that. Now what I'm doing is I have my little user space tool, test start, that, and you know my password now, that I use always for Windows boxes. OK, what does this do? It creates a connection to that Windows server over SMB. And I just get all the metadata over SMB. And let me just TCP dump that. Oh, this is the wrong. TCP dump cannot override its own files. That's very strange. I know. OK, let me wire shark this. And what I see is a sim link. It's a sim link. What I could have done actually is to extend this command output with the sim link points there. Haven't done that yet. Maybe on the train back home. Let's look at that wire shark trace. Oh, TLsR. In the background, I have my connection for RDP running. SMB2. So here you go. It's a bit verbose. But what I want to point you at is I try to open the file bar, which is the sim link. And it says, repass tag not handled. Then I open the file again. And don't be confused by the create request. Create request is all catch all open file thing. And there I tell you, I tell the server, hey. I want to open this file. And I don't want you to interpret it. I don't want the HSM engine to go running. I just want to open the file. I want to open the HSM stub or the sim link as such. I want to see the metadata. I want to see the metadata. And it gives, OK, here you are. And a bit further down, what I can say is, OK, what I can say is, I can get the repass point data out of this file. And what I can see here is, I have, oh, this is a repass tag NFS. And by the way, this is a sim link with a target foo. So this is data that the NFS server gave me. This blob here, which is somewhere here, that was created by the NFS server. And so we can just utilize it. And we will utilize this. Before I take questions, I have one more slide that I want to talk about. Long running compute jobs. Very quick overview. If you have your HPC job farm, the one thing that gets in your way is all this file security. You want NFS, no file security, on long running compute jobs, because if a machine dies, it just, yeah, you don't want really, this is a trusted environment, and you just want your jobs to continue existing. SMB actually has secure provisions for this. What you can do with SMB is, you can create a machine account, you can give the machines a password, essentially something like key tab for Kerberos. And you can, this is standard Windows protocol. You can extend the connection to a share with yet another T con context saying, OK, dear server, I know what I'm doing. You trust me by my machine account. For this connection, please use this UID and GID to this share. This is a standard SMB protocol extension, and this is what needs doing before we actually can claim success and say, OK, we can also cover this long running compute jobs properly like you can with NFS or any other file sharing protocol. Not implemented yet, another server side, no client side, but it's there. Yeah. Mark. The machine account to the machine accounts authorized to protect any of the IDs. Correct. The comment was that SMB has a provision that you can trust a machine notified by the machine account database, whatever, you know what I mean? You have server side, this machine is trusted for doing the no file security thing. Send a protocol extension. OK, this is not really, thanks for your attention. Any questions? No questions. This is not good. Fun? Just an observation, the WSWSL version 1, are you really wanting to implement that more obscure data remaining on some obscure machines? I would suggest forget about it completely just because it's that. The comment was questioning whether we want to go the WSL1 way with these repass. Talk to Steve. Talk to Steve, French author, main author of the ZIF client. I mean, it basically is him. Steve French and Paul Alcantara, those are the ones who I believe for Linux 6.8 have implemented the WSL1, here, this one here. If you look at LWN, they can now create block and character devices and I think they went the way with WSL1. But I mean, talk to Steve. The comment was WSL1 is the only one under Windows Server 2019. There you go. Any other questions? Good job. You also, what? The question was how the current ZIFS client deals with these repass points. That's actually what is covered here. They start to properly implement that. So they already have some links. They have support for some links, the Windows way, because I mean, they are there. But they start to, they start working on all the other ones that we were talking about. Mark. It's work in progress. So I mean, parts already exist. Can you repeat the question? I was pointed out. Mark was asking about the, the, the new data, the data that is used in the system, and the data that is used in the system, and the data that is used in the system. Can you repeat the question? I was pointed out. Mark was asking about the status, what's currently implemented. It's a slow progress. And parts already exist. Other parts don't yet exist. So I don't know when we can actually claim that we do full SMB3 Unix extensions. I can't promise anything. One more, there's time up. I think we are pretty strict here. Just come to me later.
Exploring Samba on various File Systems: Bridging ideas and enthusiasts together
Hello everyone, I am Shweta, I work under work in IBM, I come from Bangalore, India. I started work, I started my profession working with GlusterFS, Gluster file system and I gradually moved on to Glustergeore application and I maintained that for Red Hat. On behalf of IBM, I started work, I started exploring new projects. One of them is Samba on Kubernetes. So here is the introductory talk on the same. So the title is exploring Samba on various file system. So bridging ideas and enthusiasts together. So yeah, let's try to understand what is Samba. So first of all, you might have heard the last talk or if at all now, then I will just start from the introduction. Samba stands for server message block. It's a protocol used for sharing resources over a network. So in other words, I can simply say that it is what NFS to the UNIX, it is same to the Windows word. So both are different protocols. And NFS stands for network file system. So this is basically the same. We can share the resources over a network. That is, user on a client can access the file, as if he is accessing the file from a local storage. So these are the two terms that we had to be aware of. So next, so we just came to know about two different protocols used by Windows and UNIX systems. Now, here comes a question. So what if we want to use two different operating systems in the same network? What if we want to share a same resource with a Windows server and a Linux client or a Linux server and Windows client? We know they use different protocols for communication. That is, it's just like they speak two different languages. Now how can they communicate with each other? So this was the problem statement that Dr. Andrew Trigel found interesting to work on. So he just wanted to connect his UNIX server with his wife's Windows machine. So he just started exploring what is SMB. He wrote a packet sniffer. With this packet sniffer, he could analyze what is going on in this network, at what frequency the packets are passed, what he could also intercept the logs and understand what is the pattern in which these data packets are traveling in the network. So with all the insights he got from this network sniffer, he just wrote a software that could help make his UNIX server look like a Windows server so that Windows client could access files from a UNIX server. So basically this turned out to be a project where it was an inter-polarability project where a Windows system as well as a UNIX system could communicate with each other and share the resources together. So here, majorly he could solve his own problem of sharing his house printer with his wife who was owning Windows PC. So this is how the project Samba was born. He just wanted to use the words SMB, the server message block, and create a word. That was Samba. Now let's come to the implementations of it. So firstly, how Samba is set up with file system. So here you can see that we are talking about a clustered file system. So these are basically the server and each server is installed with file system. It can be cluster, CFFS or any file system. And as you know that they are clustered, that means they are internally connected with the private network. They might be following their own principle for that. And once we install this file system, we will be installing CTDB on top of it and we will be configuring this CTDB. CTDB is nothing but a clustered TBD, clustered trivial database. It is basically written for Samba. It stores some of the temporary data that is used by Samba. This could be information about leases, logs, or some IP addresses and such stuff. And it also will make sure that all this data required by Samba is highly available. So that is basically the work of CTDB. And upon CTDB, we will have to configure Samba. So in Samba, you will have to configure a share, username, user password, etc. Share will be nothing but a mount which you will be using and where the resources will be. And user and password are necessary because this Samba can also act as domain controller. That is it can control the security-wise decisions. Like which user can I allow to access this resource? So that is what it is doing. And next, okay. So we just learnt what is Samba and how it is installed and how it works with the FILE system. So basically it is a business solution. So we know that in an enterprise level solution, we do not see that only Unix servers or only Windows server communicating with each other. So there can be a lot of clients. There can be a Windows client, there can be a Unix client. Everyone trying to communicate or everyone trying to use the resource from a particular server. And they can be again Windows or Linux. So in this case, SITE is an Test Integration Environment that is basically set up to test this Complicment Setup of Samba. This will include a lot of servers, a lot of clients and a lot of projects like maybe GlucidFS, CFFS, CTDB or Samba itself. And it can also include multiple configurations of Samba, multiple configurations of FILE system. So it was very convenient to automate this. If we might spend a lot of manhours to manually create this setup. So this was very convenient way for us to create an automated environment. So any system that has the softwares like Vagrant, Libvert and Ansible can help you run this project. That is SITE environment. And we might need a minimal resources like 2 to 4 GB for storage VMs and a minimum 1 GB for other VMs. We can also have other VMs which act as clients or which contain the scripts that are required to run this test. So also we can use this, as I said, we can use this for creating custom Samba environment. We can experiment stuffs on it. Basically it can create your Samba playground. You can play it with it, you can use your favorite file system, you can write your favorite test or you can manually go and run some commands and learn more about Samba. And apart from this SITE environment is used for periodically triggering the test we use nightly runs and also whenever there is a new change in the project, this SITE environment runs the SITE test cases. So it is like, it progressively checks that all the components that are involved are in sync. So we are not encountering any issues before it is given to the user. Before getting to the user we make sure that everything are working in sync and there are no issues among them. So also we are using the most recent code from the main. So that means that it is always, we make sure that the main is always compatible with all the projects which are involved are compatible with each other. So now why SITE? So I said, I said to you that we will be using the file system and these file system like maybe say a safe file system. So this, the resource, the file shared by it is not only used by a Linux system, it can be used by Windows system as well. So that is exactly the use case where SITE came up. So with SITE we can very easily configure this file system, not just whatever, we can just configure for safe, we just configure for Glastor Dive, whatever file system we can configure the number of servers, number of clients and we can make our environment ready for testing. With this we can also find out compatibility issues within Samba and file system and also we can find about bugs and unexpected behavior of the file system and issues in Samba. And now what are SITE test cases? I had briefly told, I had just mentioned about this. So this is basically a GitHub repository. This is run by SITE environment. It houses number of tests. So currently we have consistency test, container test, miscellaneous test and SMB test. So the consistency tests are just mounting the share, writing the data onto the share and unmounting it. So we will just make sure that the exact data is written and read. And the container tests are, it is just exploring the possibility of running these tests on a containerized environment. So we just need to write more of these tests. So we have still scope for improvement in this. And miscellaneous test, these are the IO based test. We have basic IO like read, write. We have a database IO where we import a database and simulate a database related input output, maybe like querying from the DB, storing the data in the DB, etc. And then we have a stress IO, which is we are creating a lot of stress with the large GB of data and we are just testing stress load on the system. And lastly and most importantly, we use SMB test. These are part of SMB torture. These analyze the packet level rules and regulations. So these also have this SMB1 test, SMB2 test, which is nothing but CIFS test. And also they test this RPC level compatibilities or incompatibilities and whatsoever. Now how does SIETE works? So as I have already told, we use LibWord and Zibliu might have already guessed we will be dealing with the virtual missions. So a host mission will create number of virtual missions and these can be servers, clients. So basically this is a screenshot from virtual mission manager. So these were the missions created by a basic setup where client 0, setup 0 and storage 0, 3 VMs were set up. Setup 0 basically contains the Ansible script. Storage 0 is the machine which stores the file system. It is the server where file system is imported. Client 0 are the, this is the client from which we access the files. So as it is an automation, so we do not have a lot of things. I will just explain how it is, what, yeah I will just explain you what happens during this automation. It is very simple, we just run a make. So with the default file system like XFS or whatever we set as default will be setup. The vagrant VMs are setup and then inventory is created. So we can see that client 0, client 1 and setup 0 was setup. And then as a, yeah so we install the Ansible, we install the packages that are required by the file system on the server. And also yeah we have this configurations related to SSH and also all the back and specific tasks. We set up the permissions. I just move forward quickly. Yeah we also have to deal with the SNX permissions. Yeah finally once all the packages are set up like CDB setup, Sambai setup, we will run the test from the SID test cases. So yeah these test results will be present in test.io out and you can examine them. So I will just move on to the next slide. Yeah this is a very, it is a pretty simple execution. We just run make and let me move on to the next slide. So we can list the virtual machines that were created with the worst list. So these are the machines. You can even, we can stop this and we can even SSH to them and manually use this as a playground for our executions. As of now I am just showing you the automation stuff. So yeah so we just run make clean that will destroy the VMs once we are done with the things once all the automations of tests are done. So it just destroys the VMs and we get back the clear end environment again. So we see that no VMs are left. So this is basically how we can pass the extra variables like which OS you want to use, which file system you want to use. In this case we are using a GlucidFS file system and there are the file system specific tasks that are taking place and we download the recent code from GlucidFS and we use basically the same things that are taken in the first demo will be taking place specifically for the GlucidFS file system. Obviously it was for XFS and now there will be different, sorry. So this will be just different, difference will be just in the creation of setup of server. But then everything will be similar. We also check if the cluster is in a healthy state and we also as I said you can just recall our installation of file system, CTDB and Samba. So that is basically done here. Yes so now about custom backends and tests. So now do you want to add your favorite backend or file system to SIT environment? As of now we support XFS, GlucidFS, CIFFS but we also have scope to add in more and more backends. So how will you do it? So the first thing we will have to define a background. So you will include, it is basically the Ansible script. You will have to make sure that how many missions are needed and what test you want to run on them. So basically declare them. That is called as defining the background. Then next you will have to know what are the installations you need, what are the packages you would need to install your file system. So that will be the next part you will have to decide about. And then finally you will have to create the backend role which will actually include the steps required to set up the file system and make it work on your machine. And then we will have to configure CTDB, then you will have to configure Samba. So a detailed description about it can be found in docs backend. Then do you want to add the custom test to this project? You can add any of the, there is a docs test. You can refer to that and it explains in detail how this can be done. As of now we have a project SIT test cases that is also a repository on GitHub. You can just add your test and get it merged and that automatically runs this test on SIT environment. Yeah, this is it. So this is the reference. So yeah.
Advances in Garage, the low-tech storage platform for geo-distributed clusters
You can't hear me at the back. Do not hesitate to ask for me to speak louder. So I'm Alex and I'm a co-founder of the DeFloor Association, which is a non-profit self-hosting collective. We're a member of the Chateau Network in France. And so what that means is we're doing self-hosting and we're trying to promote self-hosting as an alternative to putting everything in large data centers and relying on cloud providers. The thing is actually doing this is relatively hard and mostly it's hard because we want systems that are reliable, which are available most of the time. And if you have computers at home, you have a lot of issues. In particular, the computers we're using at DeFloor are these kind of computers, very cheap old desktop computers. They're not meant to be servers and we expect that they could crash at any time. These are some other examples that we had and those are still used actually. So these are also old desktop computers and we have some system which is based on only these kinds of machines. So they can die. We also have issues possibly with the internet where the electricity connection because we're at home so we don't have redundancy. It can go at any time. And to alleviate these issues, what we do is that we do distributed systems and we have a multi-site geo-replicated cluster. And so in our case, the DeFloor cluster is in three places. There's some nodes in Brussels here, some nodes in Lille and some nodes in Paris. And basically the aim is to build a system that makes use of some cheap hardware which is disseminated in all of these locations and they can basically relay one another when there's an issue somewhere and the whole thing stays up even if there are issues in individual locations. And so this is one of the reasons why I call this a low-tech platform because we're using what we have at hand, cheap machines and regular internet connections. One of the main components in this platform is object storage. And so I will not enter too much into why object storage except that it's very adapted to flexible deployments which are kind of inspired by what is done in the cloud. And indeed, Amazon S3 was created as a cloud product and in 2006 was introduced. And it became since then a de facto standard and many applications are compatible with this object storage. And so it makes sense to base our infrastructure on this kind of software because we can just like plug and play all kinds of various things which are already able to use this kind of storage layer as a backend. There were many actually alternative implementations of S3. MENU is one of the most common ones. I think CEPH is also an implementation. What we discovered is actually that these implementations are not very well suited to geo-distributed deployments. So deployments where nodes are in remote locations because in such case you will have higher latency between the nodes and it can cause issues and the system is basically a bit slower. And sometimes it's even really unusable. So Garage was made specifically for this use case. We make use of distributed systems theory, CRDT in particular, which I will talk about later. And this is basically the aim is to provide a drop-in replacement for Amazon S3 or S3-compatible storage systems which is available, possible to run directly on this kind of geo-distributed cluster and the data will be replicated at several locations and it's kind of transparent and it's supposed to be reasonably fast, not completely slow down all the replications which are running on it. One of the main ways we were able to achieve this outcome was to use CRDT and weak consistency. So this is a bit theoretical explanation of what is going on in Garage and I will have another slide talking about this later. But basically we're trying to avoid so-called consensus algorithms like RAF or PAXOS because these algorithms have issues and are actually very sensitive to latency. But just to list the issues in a clear way, the first of them is software complexity. I think RAF is actually a complex piece of software and it can be implemented badly and if you do it wrong it can lead to various unacceptable outcomes. And of course the issue of performance which I've talked about already. Those algorithms like RAF are using a leader so the leader is becoming a bottleneck for most requests in the system. So if you cannot really scale if you have a naive strategy with just one leader in the system it's also sensitive to higher latency because if the leader happens to be a node in a very far away location, well everything has to transit from there and then come back. And so if the leader happens to be the wrong node everything is going to be much slower in the system. And also if the system is disrupted and the leader goes down the system will have to take some time to reconverge and it's actually something that can take a long time especially if the latency between nodes is high and those are not able to communicate very efficiently. And so for this reason we made Garage a completely different design which is based entirely on CRDT internally which kind of solves most of these issues. Object storage is very likely very similar to basic key value store except that the values are objects like big blobs of data. And so here we have an example where we have the key which is there's no notion of a file system hierarchy so we will just have the entire path in the key with the slash it doesn't have any specific meaning. And the value is like some metadata here it's inspired from the HTTP headers because it's very strongly based on the HTTP semantics and then you have the binary data for each of your files. It happens that this semantics key value actually maps very well to CRDT and this is why we were able to make this work. So just to convince you in one slide that this is actually a worthwhile trade off this is one of the best results we have for Garage and it's a performance comparison for Garage versus Migno. So it's a simulated deployment where we have nodes which are simulated on a single machine and we add some artificial latency between the nodes. And so here we have nodes with 50 milliseconds so pretty long delay between them and so basically we can see that they take some duration which is a multiple of the round trip time latency but for Migno it's a very high multiple so some very common requests like remove object or put object will take like more than one second and for Garage we were able to bring this down to somewhere between 300, 500 milliseconds. So quite an improvement. So the main focus of this talk is to basically discuss recent developments in Garage because so we were here at Fosdame two years ago and I think maybe lots of people in the room are already aware of Garage. So yeah two years ago we were at the beginning of a grant by NGI Pointer and which was the first grant and it allowed us to bring this version 0.6.0 which was the first like public beta version that we launched. So it was like a point where we considered that we had some basic feature which was pretty good actually and we could ask people to come and actually many people were interested and this is the point where we started also to have some external contributions to the project. So we did Fosdame about at the time. In April we did version 0.7 and so version 0.7 was so focused mostly on observability and integration with the ecosystem. So we added support for metrics and traces using OpenTelemetry which is a standard for exporting observability data. We also added some flexibility because while we had originally built the system like supposed to have three copies of everything so we would expect to have nodes in three different data centers actually people were also willing to use the system with less copies so we added one or two copies and we also had some weaker consistency which was useful to like make the system faster or help recover data in some scenarios. We also added integration with Kubernetes for discovery of nodes so that the cluster is able to set up automatically the links between each nodes and we also added an administration API which is useful to like set up the cluster. It's basically a very simple REST API where you can create buckets which are stored spaces, create access keys, give rights to access keys, etc. I will just show a little bit about the monitoring part. So this is a graphnet dashboard that we made for Garage and as you can see it's actually pretty complete. We can monitor so here is the request going on through the S3 API endpoints and here's the request going through the web endpoints because Garage supports serving buckets directly as websites which is something we make heavy use of at Le Fleur. Here we have the error rate and more interestingly here we have some internal metrics so like this is the data which is being read and written to the disk on the nodes. This is some internal metrics for the communications between nodes RPC and these are some queues so how much data is remaining to be processed and so yeah just quick note here the GCQ is common points where people are like why is this queue not going to zero? It's normal that it's not going to zero because items are staying in the queue for 24 hours before they're processed just for information. So basically this queue should almost be to zero and this one too and if it's not then probably your system is under too much load. And we also have tracing so if you want to go further into like how Garage is handling a request you can use this feature. So here we're exporting traces to Yeager and this is a trace of a pretty standard list objects API call and so we can see that the list objects is first reading some data to get some access information on the access key and the buckets. So this is some very fast call because all this information is copied on the node and it can just read it locally and then it's going to do some actual requesting on remote nodes for the list of objects that should return and we see here that it's sending a request to two nodes and the request is taking a bit of time before it completes and then so yeah I think this is a pretty slow cluster and it's taking 100 milliseconds but on faster hardware it can be of course much faster. So this was 0.7 and then we did 0.8 so that was at the end of the NGI pointer grant and for 0.8 we had a pretty high focus on making the performance better. So first thing we did was like change the metadata engine because we were using sled and it had a lot of issues I'll talk about that and we did some various performance improvements across the board making basically some pretty good improvements in this kind of area and in terms of features we added CODAS so this is not a feature from Amazon but it's a feature which you can add on Garage is like limit the size of a bucket to a maximum size of objects or maximum number of objects and it's pretty useful in a multi-tonnets setup where we'd like to lend some storage space to someone but have them restrain to some fixed capacity and of course some regular developments on quality of life improvements etc. So yeah just to talk a little bit about the metadata engine so we were using sled which is a metadata key value store embedded key value store which is written in Rust so we thought yeah it's written in Rust it's pretty good Garage is also written in Rust so let's just integrate them and at the point when we started Garage sled was like one of the most popular key value stores for Rust but actually it's not very well maintained anymore and it had many issues so it was making very large files on disk because it was like just writing and writing and writing and probably it was some internal way to optimize performance but it was not very satisfactory for us to have like data files that were 10 times too big. The performance was also pretty unpredictable on spinning hard drives it was actually very bad and also from a developer perspective it has some API limitations and this has prevented us from implementing some specific features in Garage and hopefully when we get rid of sled we can actually do that. So as an alternative we added LMDB so LMDB is a key value storage which is used I think in OpenLDAP and some other software and it's a pretty established piece of software at this point so we consider it pretty stable it has good performance and it maintains a reasonable size of files on disk so this is the default now and we also have SQLite as a second choice originally we had not optimized SQLite that much so it was not recommended we had not made another bunch of tests but probably now it's okay to use as well and just to show some comparison we did some benchmarks and basically LMDB is much faster pretty much twice as fast as sled not really twice but actually significantly faster and for all these common API endpoints and SQLite was not optimized at that time I cannot I do not have the data updated for now. Another optimization we made is block streaming so the idea here is that Garage will store your data so when it receives an object it will split the object into pieces of by default 1 megabyte and then store these pieces on data servers all around the cluster and then when you want to read the data well your API request is going to go through some Garage node which is going to receive the request is going to look at the objects the metadata and determine okay we have to get this part this part this part from these different nodes in the cluster so it's going to do an internal RPC request to the storage node which has the actual 1 megabyte data block and so this is how it was working before basically this first node that was receiving the API request it would like just read the 1 megabyte into RAM and not send anything to the client before so basically here the client is just waiting for the data to arrive and the data is being transferred here between these two nodes between inside the cluster and so basically the client is just waiting for some stuff to happen inside the cluster where it could just have received some data earlier and so the optimization we made was actually pretty simple but it's pretty big change in the code it was to start sending the data as soon as it arrives to this intermediate node and so here we just have a small buffer of data which is received and waiting to be sent back to the client and so by doing this pretty small change we actually managed to reduce the time to first byte measurement so this measurement is when you do a request to Garage to receive to get an object you will specify the path of the object send your HTTP request all the headers etc and then you will wait for the server to reply the server will give you some headers saying okay the object is coming and then he will start streaming some data and so this measures the time between you the point where you start sending a request and the moment where the first actual bytes of the data file are coming back and here we are in a actually again it's a simulated deployment but we have pretty slow networking so 5 megabits per second so it's actually very slow and so before the optimization garage was here so we would have to wait pretty much two seconds before some data was coming because the like a one megabyte file was being transferred out this very slow connection before it could be returned. Minio has some average performance here and with the optimization garage is very fast and we're able to return the first bytes of the data and so this is important because for instance for websites you want to display the content as fast as possible and even if it's a big file then maybe the first bytes are very relevant so for an image you can have a preview in the first bytes for an HTML file we can have pretty much everything and so minimizing this time is very critical to user experience. So I think we pretty much managed to do this and we also did some other various improvements on the code pass and garage so on the bottom we have 0.7 then we have 0.8 beta 1 beta 2 here we removed some F-sync and it's completely optional to have F-sync and we're almost matching so here is like raw throughput when you're reading and writing big objects continuously to garage the throughput is still a bit worse than Minio but it's actually getting pretty close so there's still room for improvement in this domain and it's yeah we haven't done much more work on this but it's definitely something that could still be optimized I believe. So then it was the end of the NGI pointer grants so we did a bunch of conferences in France this was not me this was other people from Duffleur and then we started another grant by NGI 0 through NLNet and this led to the release of 0.9 and so 0.9 was actually a pretty big release so yeah we had a support for multiple AGDs per node and this is actually a pretty big feature because now you can have one garage node which is directly talking to the hard drive and you don't have to do some pooling at the file system level or some RAID system basically you will just format each of your drives independently as a file system and each of them has a directory, a mount point and garage will just use all of these mount points and like share the data between the drives. This is probably the model which allows for the best performance on the server with multiple drives. We also added some features for S3 compatibility so we added support for basic lifecycle and lifecycle is a feature where it allows you to clean basically some stuff which is going on in the bucket and so for instance in S3 you can start uploading an object using a multi-part upload so multi-part upload means you're initiating the upload at one point and then you're going to do individual requests to add pieces of the file and then once you're finished you do a complete request and then the files get uploaded that gets stored completely in the system and so it could happen that these multi-parts upload they get aborted in the middle you never get to finish the the the requests and in this case there's some data that's lying around in the cluster and so if you configure a lifecycle using this is a very standard S3 API if you support if you configure a lifecycle in your brackets you can basically get rid of all this tail data after say a delay of one day or something like that. And another thing we added for S3 compatibility is retries of multi-parts upload and this was actually because in S3 if you fail a part you can because maybe your network was broken you can try again this part and you can still complete your multi-part upload and in the first versions of garage we did not have that and you would have to restart the upload from the beginning now you can resume only a single part. LMDB is now by default we're deprecating SLED and we have this new layout computation algorithm which I will talk a little bit about. So as I said garage is meant to work on geo-distributed clusters so you have nodes which are in different geographical locations we call them zones in garage so here we have three different zones and the data is going to be replicated and each file has to be on different zones for optimal redundancy. So here is an illustration if we have five zones for example the blue file will be in Belgium France and Switzerland so in three different places and the red file will also be in three different places not necessarily the same here it's UK France and Germany. And the idea is that we do this using this kind of pre-computed layout which is a table which will say okay the cluster the data in the cluster is divided in 256 parts and each of these parts is assigned to a fixed set of three servers and for each part we have to decide so three servers which are in different places in the cluster and we have to also balance the quantity of data that is going to go on each server. So basically for 0.9 we added an algorithm which is able to do this in an optimal fashion so basically this table is computed once when you set up the cluster or when you add some new nodes and then it's propagated to everybody and everybody then knows this table and knows where to look for the data. We actually published a paper if you're interested in the details of the algorithms that we use. Okay so that was 0.9 and then we went on and worked on 0.10 and 0.10 is actually a beta version and I think we will not have a stable 0.10 because it's not worth it to like update to 0.10 and then update again to 1.0 when it's going to be out so I think we will just leave the 0.10 at beta and do the 0.1.0 in May but so I'll just talk a little bit about the 0.10 beta. It's mostly focused on fixing some consistency issues that would happen like when you were adding some servers in the system or removing some servers and so I will enter into a bit of distributed system theory to try to explain why exactly it's an issue and what is the solution that we made. So since I've said that garage is not based on consensus it means that we have to work with inconsistent primitives so this means we have to work with conflict-free, replicated data types, CRDTs and so these are not transactional, they are pretty much very very weakly consistent, very freeform to use and there's this last-writer wind register which is pretty much the fundamental building block of garage and so CRDTs alone are not enough to insert consistency so what we add is some read after write guarantee which is implemented using quorums and I will try to explain, I hope you will understand how it works, I think it's not so complicated but it's a bit theoretical so yeah, hold on. So read after write means if a client one is doing an operation right and the system returns to the client okay your write is saved in the system and then another client is sending a read for this data after the write is returned okay then the client two will read a value which is at least the value x that was written or a newer value this is what this means and so in practice it means that the system is basically evolving between these states so for instance we have the state here where the system is not storing anything and then we can store some value a or we can store some value b and if this is like a basic set if you have stored a on one node and b on another node then when the two nodes like merge together they will have stored a and b okay but let's do an example here for the writes so these are the three storage nodes and we're supposing that a node, a client is sending a write operation for value a so the value a is going to be sent to the network to these three nodes and at some point like maybe the purple node is going to receive the value a so it's going to move from not knowing anything to knowing the value a then the green node is also going to move from not knowing anything to knowing a when it receives the messages and so those two nodes are going to return to the client who did the operation okay I've stored the value a so at this point the client says so I've received two responses this is two over three so it's what we call a quorum and at that point the client says okay the data is stored in the system even if the third node has not received it yet and so this is the point where we can start a read request and so the read will basically is the client will ask all of the three nodes to return the value that they have stored and maybe the first node that will return its value is the red node and the red node has stored nothing so the read will first receive a value of nothing but then it will wait for another response and the other response will necessarily come from one of these two nodes and so it will necessarily read the value that was written and so it will just merge these two so this is why we use CRDTs to do this merge operation and consistency is guaranteed and maybe at some later point through some synchronization mechanism the red node will catch up and also receive the value so we have this in algorithmic form but okay and so the issue we have with this is that we're relying very strongly on these quorum properties so if we have three copies of data a quorum is at least two nodes of the three but what happens when you remove some nodes and add some other nodes in the intersystem so we will have some some data which was stored maybe on the nodes in red here and in the new system the data is being moved and it should be stored on the green nodes and so now if you do some quorum some right quorum on the red nodes and some some read quorum on the green nodes there is not necessarily an intersection of one node that has seen the read and the right and basically the consistency is broken so the question is how do we coordinate in this situation and how do we ensure that even when the cluster is rebalancing data we insert consistency and so the solution is a bit complex but basically we need to keep track of what data is being transferred between the nodes we use multiple right quorum so we're going to use quorums to write on the old set of nodes and the new set of nodes and switching reads to the new nodes only once the copy is finished so this is something we implemented for the in the context of the ngi grants we did some testing using a tool which is called jepsen which is very good for validating these kind of things and so as you can see in garage 0.9 we had consistency issues in most of our runs and in point 10 we have all runs are green except one which failed but at least there was no run where the data was plain wrong and it's actually this is very good result for us okay so this was point 10 now we're at fosdem and we're going looking forward to making a version one in april or may basically we're going to focus on security and stability there's a security audit that is going to be done by radically open security miscellaneous features should be improved this would be added and improvements may be in the user experience refactoring stuff and that's it for 1.0 hopefully we'll have that out in april this year and beyond so we have this survey which is going on in the community right now and so this is a list of the most requested features by the users of garage and actually there's a lot of work to do so the first thing is a web interface for cluster management so I guess for like visualizing the state of the cluster and setting up a new bucket as new access then it's s3 versioning which is so it's a feature of amazon s3 where you can have a you can save the historical data in the bucket and it's pretty good for like a backup system where you don't want to override data accidentally and this is a pretty crucial feature that we would need to have ACLs are here monitoring and various other things and so this is the point where I'm calling for help actually because there's a lot of work and I cannot do it myself so if anyone wants to step in and help us with this please do so we can probably find some some more funding actually we do have some funding in progress for someone who would like to do a phd on this system in in relationship with the garage so if anyone wants to do a phd in France working on some stuff come to us we have this application going on and we also can probably ask some money to nlnet which have funded us once and nji also once so we can probably get some more money if there's some specific task that that is planned and we have somebody who is willing to do it okay and so I will just spend the last few minutes of this talk to explain a little bit about how you can operate garage for people who have not run it or who are willing to scale their clusters to bigger systems so this is the basically what I would call the main screen of garage so when you interact with the cluster just start always by doing garage status and it will tell you if everything is fine so this is a five node cluster and everything seems to be fine but maybe you will have like failed nodes so this means that the connection could not be established and something is wrong and you should fix it garage is made like a some cake of different pieces like this on top we have the s3 api we also have some custom api which I'm not talking about in this talk and this is three api is actually implementing using some internal key value store for metadata and some block manager for the actual data of the big objects and then we have some systems here which maintain consistency in the system and so maybe to be a bit more specific about what's going on we have these three metadata data tables here so the first one is like the list of objects in the system the second is the list of versions of objects and so it's a bit different because an object can have a version which is currently in the cluster and a version which is currently being uploaded so for the same objects multiple versions can exist and then this version will also reference a bunch of data blocks so this is the table which has the reference to actual data blocks and so all of these tables are sharded across the nodes and in particular for the block reference table if a node has the has the shard for some references it means it's also responsible for storing the blocks associated with these references so basically from this metadata table we have a local counter for how many references for each block and then we have this rescind queue and scheduler which is responsible for ensuring that the locally stored data blocks are actually matching the number of blocks which have a reference in the in the store so yeah we have this block rescind for data blocks and this merkle merkle tree based system for the metadata and so if you do this garage stats command so there's not status it stats never command you will get some information about the internals of what's going on so these are the metadata tables and you can see here objects version and block reference so these are the number of items in the table and there are also the number of items in the merkle tree which is always a bit bigger and then you have here the number of rc entries for the block table so the number of blocks which actually have a reference in the system so here we have 42,000 data blocks but we have actually 334,000 block references so this means that blocks are almost referenced by 10 different objects each on average and then we have some information on the actual nodes so the partitions here means basically is how many of the lines in the tables are affected to each of these nodes so if you have more more partitions you're going to use more storage space basically on that node it's proportional and this is a metric which is given by the node actually it's it's measuring on disk how much space is available it's not the use space it's the available space for the data partition and the metadata which is not necessarily on the same drive and so from all this information garage is able to basically tell you how much data you can still store on the cluster so here for 600 gigabytes and if you go even further you can get this list of workers so workers are basically background tasks which are running in garage all the time and so you have these tasks which are block readings so these are copying data blocks between nodes when they're missing and these are synchronization tasks for each of the metadata tables and you can change a bit the parameters of these tasks for so for instance for the the block re-synchronization you have re-sync tranquility and re-sync worker count and tranquility is a metric which can be increased to make the system go slower and use less i o if if it's serring you're saturating your i o you can increase the tranquility and if you want it to go faster you can just put it to zero and then there's also the worker count so you can set it up to eight and then you have eight parallel threads which are sending and receiving data blocks in the network there are some potential limitations if you're running extremely extremely big clusters probably you cannot run with more than about 100 100 nodes i mean you can but then the the data will not be very well balanced between the nodes and this is because we're using only 256 partitions we could probably compile a bigger version in garage but it's currently not the case and on the metadata side if you have one big bucket which is containing all your objects well you will have a bottleneck also because the first table the object table is going to store the list of objects on only three of all of your cluster nodes so if you have lots of data split your data over different buckets and also on the side on the side of the data blocks so the data is split into so if you have a hundred megabytes file in your block size is one megabytes your your file is going to be split into a hundred different files so we will have a lot of small files on disk you can increase the block size to reduce the number of files and if you have more files the processing of the queue can also be kind of slow and this is of course also dependent on your networking conditions and so just some advice for actual deployments for the metadata if you're going to do a very large cluster we recommend doing some mirroring on two fast NVMe drives possibly ZFS is a good choice garage itself does not do check summing on the metadata so it's good to have a file system that does it for you lmdb is the recommended storage engine and for data block it's a bit different and we have other recommendations we recommend using an XFS file system because we actually do some check summing for each blocks because we always compute hashes of the blocks in garage so you do not need to have a file system which is doing this this check summing again it would be wasteful so just format your partitions as XFS which is one of the fastest file systems and store your data directly on this if you have a good network and some nodes with a lot of RAM you can increase the block size to 10 megabytes at minimum and you can tune these two parameters according to your needs and of course you can do some more like global tuning split your data over several buckets use less than a hundred nodes if possible or come to us and we can work out a solution and you can use also gateway nodes which are good way to like have have nodes which are so have the request go faster because if you if you have a local gateway on the same server as the client it can basically route the request directly to the data server and you can possibly avoid run for time we have not made any deployment bigger than 10 terabytes on the side of the floor but actually some people have as we learned from the survey and so if some people are in the room it would be great to share your experience and with this I think I've talked enough garage is available as a open source software on the website of the floor at switch and in Rust and we have a matrix channel and email you can contact us and I'm taking some questions um so the question was if you store websites on garage can you integrate with dns and basically we copied the semantics of amazon where you can have a bucket whose name is the the domain of a website and so garage will route requests to the data according to the host header of the htp request and basically you just have to to configure your dns server so this is something you have to do as of at sort of garage but you configure your dns server to write the request to your garage server and then garage will just select the good bucket with the good content based on the name of the bucket and you should add some reverse proxy probably in the middle if you want to tell us because garage does not do tls yeah it's because when one of those website servers goes down then you need to reroute to some yeah so at the floor we have a solution but it's external to garage so it's more tooling yeah so in all the examples you mentioned you have effectively one node for one zone what if is that by design or can you have multiple nodes per zone or how does that I think it's uh it's uh so the question was in the examples we have uh one node on each zone and can we have more than one node and so I think it's yeah it's just the examples were not very good but yeah of course we can have multiple nodes in a single zone I think maybe in this in this graph no this is not the good one but there is a there is an example where we have several nodes in the same zone it's not a problem yeah and if you have let's say everything else calls and you only have the one zone that's remaining will the node still try to balance the data between themselves or is that effectively a you're in trouble so the question is how is data get being balanced between the nodes if you have like one zone where it's have only one node and maybe the node is smaller and so garage is trying to preserve this property of having three copies in different places you can you can ask it to have only in two places but by default it's three places and this means that if you have only three zones and one is a smaller server then you have smaller capacity of the cluster yes yes so the question yeah so the question is why did we integrate multiple disk support instead of having multiple nodes in the same zone and I think one of the most important reasons is that this way you can reduce the total number of garage processes and entries in this in this table basically because this table has only so many rows and if you have start having many different nodes it's not going to be well balanced so reducing the number of nodes helps us be better balanced basically yes I saw many of your design matching the one of open stack swift and I was wondering if you investigated using it okay so the question is there's many design points which are matching open stack swift and have we investigated using it I personally have not used open stack swift and I have not looked so much into it yes so the question is despite putting this much effort in multi multi node deployments is it still worth running the system on a single node I think it's it's so many people are doing it and I think one of the reason people are doing it is because garage is pretty simple to set up and to use so I think it's definitely possible I think there are also other solutions which are good for single node setups so yeah try it out and figure what's works best for you and okay so I think we're done for this talk thank you
MicroCeph: Get Ceph Up and Running in Minutes
Hello, welcome. My name is Peter Sabini from Canonical. I'm a software engineer there. I work on various CEP stuff and I'm very excited to present Microsoft with the tagline Get CEP up and running in minutes, unlike my slideshow. So problem statement. Microsoft packages CEP. This is a big complex system with distributed configuration, distributed components, complex bootstrapping, procedure and complex operations. It also has non-trivial hardware requirements. It's not just like you can download a package, install it on my notebook and be ready to go. It also has impact uptake and adoption among users. So if you're, for example, a famous physics research organization with thousands of nodes in your storage cluster, you probably have trained staff on hand 24-7. So you're good, you don't need Microsoft. If not, if you don't have a team of trained experts on hand, maybe Microsoft is something for you. So what is Microsoft? Microsoft is a single package staff cluster. Everything is in one file. We're designed it to be a simple setup so you can get a running staff cluster for command lines. And it runs on your notebook. You just need one node with obviously one hard disk. So simple possible staff cluster you can do is install Microsoft, putstrap the cluster, add some simulated OSDs, disk drives. So this is loop files in this case. No extra block devices required. And then wait a few minutes and your staff cluster should be ready to go. How did you do this? Microsoft is a snap package, as you might have guessed. Snap packages have the benefit that you're completely isolated from the host system. All the user land is in separate namespace. You just need a kernel, network devices, block devices, hardware, etc. to get up and running. This gives you a good isolation from the host system and gives a consistent environment across different operating systems. Some other goodies, it's isolation from the host system also means its access is isolated. The snap package just cannot do anything it wants on the host system, which is good for safety, security and robustness reasons. And you have standardized risk levels. So if you want to install release candidates, etc., there's a standardized way to do this. A little bit of overview of the Microsoft architecture. You have a service management demon that manages the standard CEP components and has a distributed database, a DITS proposal for storing configuration and no topology. Also included in the snap package is a CLI that talks to the service management demon via an API. All this is just a standard Ubuntu devian packages, no special binaries here involved. I mentioned the service API, so everything in Microsoft happens via this service API. Things you can do with the API, like listing block devices, adding or removing nodes, adding and removing disks. Everything works via the API and the included CLI is just another client for this API. So this is obviously great for integrating it in other systems. Some more internals. Microsoft is built on the micro cluster library, which provides this distributed configuration database, which is using RAVT for consensus. It also provides cluster membership and API framework. I already talked a little bit about scaling down, so single node systems work. One important component here is that we automatically manage the crush rules from CEP. So this means that as you start up with a single node, you get a failure domain of OSD, so in effect your single node clusters work out of the box, but if you add more nodes, your resiliency and your failure domain gets scaled up automatically. It's also possible to provide custom crush rules. This is important, for instance, if you go for larger failure domains, for instance, if you have a failure domain of rooms or racks, you can implement this. Microsoft itself doesn't know about your rooms or racks, but it won't step on your toes if you provide a custom crush rule set. Microsoft is famously scalable to thousands of nodes. Microsoft's scalability upwards is primarily bound by the RAVT algorithm used in the VQLite database. For performance, I would like to note that we're not sitting in the data path anywhere for CEP operations. You get the standard CEP performance behavior, also with Microsoft. Some integrations. Microsoft is the back storage back end for a number of projects in Canonical, for instance Sunbeam, MicroCades, MicroCloud and LXT. There's also, if you're running Juju models, there's a charm available currently in beta to integrate Microsoft into your Juju Clouds. Last but not least, there's a nice little GitHub action that we provide to integrate Microsoft into your GitHub CI workflow. So if you need, for instance, a S3 endpoint for your testing pipeline, this is an action that would help you with that. Okay, on for demos. I prerecorded these demos for time reasons and also because I'm very bad at talking and typing at the same time. So let's see how this goes. So this is the single node setup we talked about before. I'm going to install the single node. Microsoft Cluster gives it some simulated disks and enable a Rata's gateway, which would give you an S3 endpoint. Yeah, installation. We have the standard stable risk level here set. So this is what you get by default. You see my capital DSL connection here. Yeah, so we bootstrap the cluster. This is done pretty quickly. We can see now that we have a few services running already, but no disks. Then we add some simulated disks. These are just loop files. This is useful for lab environment or for testing. Don't use it for production. For production, you would use separate block devices, obviously. But if you want to get going on your laptop, that's the way to do it. We enable Rata's gateway. You can see it is active here now. We create a Rata's gateway user. This is just a standard safe way to do it. It's a little ugly command line here. And yeah, and we're done. We can use our favorite S3 client to access our new Rata's gateway endpoint. So just to prove that it works, we are creating a bucket here and put some image up in this bucket. Yeah, so that's for this demo. So this is the simplest possible case. Let's do something a little bit more complicated. Say we have got a few extra nodes now. We want to in an expander cluster and provide it with real block devices. This is the way we will do it. I'm now using the candidate risk level because I want to use some features from Microsoft that didn't make it to the stable risk level yet. So to cluster Microsoft, you need to get the token from the bootstrap node. So the first node that we provided, like this. Name the node you want to add and get the token for it and provide that token to the node that you want to add here. So, and yeah, small typo. These have happened as well. And yeah, and now all our nodes are clustered. Let's check Microsoft status. We can see all our new services here, but all the new nodes don't have any disks yet. So let's add some disks. So what I'm going to do here is add a user feature that comes from the release candidate that is automatically pro for empty block devices. So anything that's not mounted is clean. We take as a block device here with this switch. Let the thing settle a little. You can see there's lots of virtual disks from Kime. And we have a lot of disks in our cluster. So the safe cluster is still setting a little bit, but we suddenly have a lot more space available. So one thing we can do is provide a second radius gateway endpoint. Now we can see that the data we put in before is still here. So that's reassuring. And what we'll try to do now is we put in another OSD on the first node, but this time we want to make it encrypted. So full disk encryption is something we provide here. It relies on the dmcrypt kernel module. Not all kernels have that, so that's a little bit of a gotcha. You need to make sure it does. And also, this goes by so fast, also this is something that the snap is not allowed by default to do. You need to connect the dmcrypt module explicitly to make this happen. But once you do, it will give you an encrypted OSD device. That's the one up there now. Just to prove that this is a dmcrypt device, there's a setup here. Well, we all have the loop file for OSDs from the first node that we originally installed. Let's remove that. We have plenty of block devices now, so that our cluster has real disks. So as a last step for our production cluster, something that snaps to by default is auto refresh. This is something you don't typically want for your self cluster. You want to control updates for your self cluster. And that is a step you do, is hold all the snaps and prevent auto refresh so that you can refresh or update your software to your own leisure. So, yeah, so that's what's for the demos. Short outlook, what comes next. We want to make the clustering experience a little smoother still. No passing around of tokens. So one thing we could do with this or we planned to do is on the local network use MDNS to determine new nodes. Another thing that we want to do in the near future is provide built-in HAA and load balancing for other gateway endpoints and also RBD mirroring support. So that was it for the demos and for the... Thank you. Any questions? I don't know if we have time for questions. One question maybe. Otherwise, I'll be outside. Just talk to me and I'll be happy to answer your questions. Oh, sorry. Here you go. What architecture do you support with CPL architecture? Can you repeat the question please? What CPL architecture do you support? So snaps are pretty flexible. We develop on AMD64, but ARM is tested. I don't know if the top of my heart had... But ARM, AMD64, power, PT and risk, I believe...
Container Storage Interface Addons: Extending CSI specification to provide advanced storage operations
Let's welcome Rakhiv and enjoy it. How do you? Thanks guys. So, hi all. Today I'll be presenting a talk on container storage interface add-ons, in short CSI add-ons. We'll be talking on how it extends CSI specification to provide some advanced storage operations. I'm Rakhiv, I work at IBM, I'm the maintainer of CESI and CSI add-ons and core contributor at Rook. You can ping me up if you have any doubts later on. So, in short, first we'll go through what are containers and container orchestration, increase storage drivers and come to CSI specification and what is CSI. Then the deployment, next we'll continue with introduction to CSI and some of its Fung operation which we currently support. First, let's start with container. So, for containers we usually whenever we are building code or a software we just build it into an executable and we use that. But executable needs, it has dependencies which I used to be present in the OS to run. So, to get rid of that we built containers, which has all the dependencies packed in and you can run it almost anywhere. So, that is a container and we later on came up with how to manage it and scale it up, deploy and stop. So, there we have container orchestration which usually just automates the deployment, manages it, scales it, upgrades the versions, etc. And we have a few platforms that are very popular to do it. Kubernetes, Docker swarm, Apache and Nomad. This is for container and container orchestration. So, containers were you and container orchestration were started with being stateless in mind. That is they can go down at any point and come up back again. So, there was not a very need for persistent storage there but then they realized this kind of model actually fits perfectly for other applications. And then came in the need for persistent storage in such a platform. So, like we have this separate container orchestration platforms like Kubernetes, MeSOS and there's also storage systems like providers like CES, Longhorn and Gluster. But so the main problem was like how do they integrate with the Kubernetes platforms, Kubernetes MeSOS. At the beginning there were increased storage drivers that is like code for each storage systems within the container orchestration platforms. So, like Gluster would have its code in MeSOS and Kubernetes and Nomad separately. But likewise with Ceph but that was not scalable because it had a few advantages like we have the code if there was a bug found like if there was a Ceph bug found in Kubernetes. They had to wait till next Kubernetes release to fix it. So, it was very hard for them to scale and like develop independently. So, there it led to the development of CSI like people from different container orchestration platforms came together and they wanted to bring in a single interface, container storage interface, which would allow like the storage platforms to build just one plugin so that they could just write one implementation and it runs on all the platforms. So, yeah this CSI specification mainly that's where it started it just specifies APIs to create, delete, mount and mount volumes plus snapshots and storage vendors just needed to develop one single CSI driver to and it will run on all other platforms. A usual CSI deployment has two things like a provisional deployment and a node plugin demensate. Provisional mainly handles the creation, deletion, snapshots and resize of volumes. So, like all these requests of create, delete, resize and snapshot comes from like a Kubernetes entity or a site car or a like a MeSOS or Nomad. The CSI driver in this case just handles this request is just a API call for it and then node plugin demensate mainly handles just mounting and unmounting one volume. So, you have one usually a provisional deployment consists of two containers like two pods right a pod is just a bunch of containers running together. So, you have like two but one is active at a time like with the leader election. So, if one goes down other one will become active but node plugin demensate is something that runs per node for each node you will need to mount and unmount volumes. Let's go ahead. So, this is how it usually looks. It's basically in Kubernetes. So, you have nodes there and Kubernetes service with API control manager and snapshot controller. So, this is a proper CSI driver provisional pod. So, you have four containers running one of it the CSI plugin is the actual CSI driver that is the code written by the storage vendor and all of the other site cars basically are written for Kubernetes to like interact with this plugin. So, similarly these four functions of these four containers will be handled differently under container orchestration platform and then similarly we have node plugin demensate so that it can mount volumes onto the node separately. Let's move ahead. So, here comes CSI add-ons. So, like sometimes just the basic operations of creation deletion snapshot mounting unmounting is not sufficient because like CSI driver becomes like a pathway of interaction between the user operation going on in a container platform and the storage system. So, you can kind of extend it to handle more scenarios that led to the formation of CSI add-ons. So, one of the why we were starting this project the main problem was like reclaiming space. The question was to run FS stream on a file system mounted on a RBD device like the crux is that when you delete a file in a file system the file system doesn't immediately release that space back to the block. It is done later on. So, like there will be a storage consumption discrepancy between the file system and the block device. So, if you run FS stream it basically releases the storage and admin can see exactly how much is consumed. So, that was one of the starting points of how this project came on to be. So, we again we have to keep it in the same way as the CSI specification. We have three things. First is the specification provides the API's. So, currently we have four services. One is identity that is definitely required. It basically allows the CSI driver to advertise and register itself with the CSI add-ons controller. And then there is three different abilities which we support currently and which I will discuss about later on. And the main one CSI add-ons controller this will watch and respond to custom resources. Custom resources are something in Kubernetes where which allows users to have configs and tell the controller what to do. So, it is Kubernetes specific but it is a way of giving instruction to the CSI add-ons on what needs to be done. The controller then connects to the sidecar and then the sidecar in turn forwards the request to the CSI driver finally. So, sidecar also like kind of advertises itself that it is available and allows the controller to connect to the CSI driver. So, the new deployment with CSI add-ons kind of looks like this. So, similarly how we had multiple sidecars from Kubernetes we have one more sidecar here and here and we have the controller. So, this is the new request flow how it works. So, coming to the first operation reclaim space is just a configuration giving us the volume details and tell a just so. There is two kinds of thing in reclaim space. So, one is online operation and offline operation. Basically being CSI add-ons we do not tell the CSI driver what it needs to do. We just forward the request and it the storage vendor will implement what it needs to be when they receive the request. So, currently for online operation at CFCSI we do FS-trim on the file system more volumes specifically. Other volume more like block mode we do not do anything we just return success. And for offline if it is a arbiter volume we do RBD specify it basically punches out zeros like not save continuous zero zeroes in the block device to save memory. So, it is up to the storage vendor to decide what operation needs to be done in the back end. So, since the operation is reclaim space we need to do something regarding the storage consumption. So, next one is network fans this came a bit later on but it is also pretty prominent to one. So, this allows the user or admin or some kind of automation to tell the storage system that this range of IP or IP ranges. CIDR needs to be block listed and the clients in this range of IP should not be able to communicate with the storage cluster. So, this plays a critical role in two scenarios one is the first one metro disaster recovery. So, it is like you have two clusters which has applications of workloads running accessing a single storage cluster outside. But let us say the cluster A goes down. So, then we will just fence it off so that it does not read or write from the cluster and there is no data corruption in the storage cluster. And only B will still be able to access it. So, we basically if a cluster A has a goes down we just block it and let B still have access to the cluster or other way around depending on the scenario. And then one more is node loss you can have just one of the nodes going down that time we can just block list one of the nodes. Next we have volume replication. So, here again CSI add-ons enables additional operations between the user and the storage system. So, volume replication is one of those where we are just forward instructions between the user and the storage system. Here we are basically telling the CSI driver to enable replication for a certain image. And while image I mean like volume between one two clusters. So, here we are setting up replication between cluster primary and secondary and letting the storage system know which state the image needs to be in. So, primary will be primary state and secondary state primary the image will be active and in use. If the primary cluster goes down we can activate the secondary. So, there will be CSI drivers and CSI add-ons on both the clusters and user will need to like push this state at two primary when the cluster goes down. So, then it will take over all the operations. So, most currently the operation supported like promote, demote and resync. So, this has become a pretty critical part in disaster recovery. And the main consumer of this is Sraman right now which works with OCM open cluster management to handle disaster recovery scenarios with multiple clusters. And this will soon be also supporting volume snapshots. So, that you can have version snapshots sync between two clusters and then like you it's automatically backup like better than backup. So, SEPH only sync the diffs between two images. So, you can have snapshot one taken today and snapshot two taken today. Only the diffs between those two will actually go into the second cluster. And if you lose the first cluster can just restore the second snapshot on the secondary cluster and you start using it. That's one more scenario that's coming soon. So, future roadmap will soon have maybe rotation of encryption keys. This is a bit complicated to implement because we need to make sure the encrypted keys are not lost in the process. So, that is one of the scenarios and next is volume group application. The volume application which I talked about earlier just was meant for one of the volumes. Here we will have it specified as a group. So, we can replicate a whole bunch of volumes together instead of just one because most of the time applications may be using multiple volumes or like applications may be using multiple volumes. So, like when we when a bunch of applications is running together they may have dependencies on the each other. So, it implies that data and the volumes also may be dependent on each other. The third one is repairing corrupted file system. Like the CSI driver is the one who mounts the volumes and take care of it. So, it will have kind of good knowledge on how a file system can be repaired. And so, we may introduce our resource to actually kind of implement this. This is the end of the future roadmap. These are few references. So, do we have any questions? Okay. Then thanks for listening. Thanks so much.
Welcome to Testing and Continuous Delivery devroom
All right, good morning everyone and thank you for coming so early. I'm just going to take less than five minutes of your time to say the welcome and then we continue with the awesome speakers that we've got. As you can see, there's a lot of us here today. If you aren't aware of the history, in the past we did have the testing and automation dev room which was separate from the CI CD dev room and this year the two dev rooms have been merged and going in the future we will continue like that. So we do have the two teams from the dev rooms organizing together this year. So we start with Anders. Say hi. Yeah. Then we have Jan. Olivia, Fabrizio, Carlos. We've got Sirio who is at home at the moment. He cannot join us because he has newly born small kids and my name is Alex. Nice to meet you. So you all know the rules I think. Don't be a jerk. Enjoy the presentation. If you see more people coming in late, go towards the middle so that they can squeeze in on the sides and not jump over you. And if you want to talk to the speaker, it's up to them whether or not they will be taking questions during the presentation or after, but we do not have a lot of time for switch over. So if you want to talk to the speaker, please take it outside so the next speaker can set up and we can continue and then you can just come back in. And this is it, I guess. Thank you again for coming and let's start.
Streamlining Developer Experience: The Power of CI/CD Standardization and Interoperability
I have, I've just been informed that I have a two hour talk. So we're going to use that, we're going to use that time wisely. Hopefully we also have like a minute. So I can't start talking with the talk until for another minute. So with that, who's, this is your first time at FOSDEM. I'm also raising my hands for this first time, tried for years and finally got here. So cool, glad you're all here. So now it's like 25 seconds, we have to kind of just whatever. Yeah, everybody awake? Who, what was the latest you were out? Like who was, who was out, who went to bed at like 10? Okay, good, nobody. And that doesn't mean 10 this morning, when after this and just been up. Who was, who went to bed after midnight? One, two, three, three, 15. That was three, 15. Four, are you, you're still awake. You're still good. Okay. All right, so we are, we're going to start now. Hi, I'm Jeremy. We're going to talk about streamlining developer experience, power of CICD standardization and interoperability. Really going to kind of touch on when we think about developer experience, how, what's kind of the role of CICD in that and how it fits within all of the different kind of tools and systems that we use. So I'm going to talk about that. A quick note, I did use on a fair amount of these slides. Because I had evidently time on my hands. I used chat, GBT and Dahli for the images. So that is a very interesting, don't, don't go into it thinking you're going to get exactly what you want. As you'll see on some of the slides, it's a little weird, but why not? So we're going to jump into that, figured I'd try something new. Okay, so you said, my name is Jeremy Meese. I'm the co-founder of a kind of a stealth DevEx startup right now. Hopefully we'll have some news in the next couple months. But yes, so Jeremy Meese, I've done, I've been in tech for a couple decades. Previously, most recently, I was at CircleCI for about three and a half years, running the DevRel and community team, doing a lot of talking around CICD and stuff. So that's me. Now I did have some early feedback on the title of this talk. So Gray had a lot to say around, this is probably heretical, what I'm going to talk about. I don't know about that. Heresy, but I felt that was kind of harsh. He hadn't even heard the talk and already he's given some feedback. But we're going to kind of talk about this evolving landscape of software development, especially in the modern world. If you've ever seen the CNCF landscape, could not even fit on one slide. I mean, it's fit on one slide, but there's no way you're going to read it. That's how big that's evolved. I really should have had a slide that showed some progression over time. But when you think about, this was a couple of days ago. I'm sure it's grown in that time. But continuous integration has a good, when you kind of zoom in, it has a good section of that. And it stands really as kind of this, CICD does, stands as this kind of transformative pillar that kind of has reshaped how we look at software and how we look at deployments and how we look at delivering quality software to, hopefully quality software, to the users, to the companies and such. And kind of driving that very experience. I also put out, kind of feel, when we think about developer experience, what is the kind of shortened version of that? And we're going to use DevX. The internet has spoken, so we're going to use DevX, not DX. So you all say DevX for short, instead of saying all of developer experience over the next three hours, I think we have. Okay. So developer experience, kind of defining it, it really kind of encompasses the journey of the developers as they're learning and deploying technology, whether that's software, whether that's even hardware kind of fits into that. And when you have a successful developer experience, it really is focusing on eliminating the obstacles that hinder developer or a practitioner from being successful in their endeavors, whatever they're trying to do. Now CICD's transformative influence that we've seen on the developer experience is really pretty profound. Because we've had kind of this dynamic shift in how developers over the years have collaborated, how they create, you know, how they deliver software. And by automating, you know, the pipelines and like the integration and testing, deployment processes, all those things, it really is to empower developers to really gather the feedback necessary with those feedback loops, having faster ones, so that they can improve the code quality and the ability to continue to iterate swiftly. That is not a Taylor Swift drop, it's just iterate swiftly. But by streamlining workflows, that helps to reduce a lot of the friction that we see, provides a lot of intuitive tools. And so you have like this good DevX empowers developers to focus on creating that high quality code we talked about, fostering the innovation and really eliminating and contributing to, you know, faster, more, ultimately contributing to faster, more reliable software delivery. So we're going to kind of hone in on the two of the critical pieces of what that looks like in CICD with standardization and interoperability. So from the CICD standardization side, that really brings the consistency necessary to your pipeline. So that you can reduce the friction, you can enhance the collaboration between your different coworkers or different teams. So we're going to also look in this at a few open source tools. We're going to look at Argo and Flux. I'm not going to bring up any demos or anything, but we're going to talk about some of the features that they have that really work well with this kind of standardization idea of standardizing processes and how you deliver good software that way. Then we're also going to talk about the interoperability side, which is kind of ensures a seamless integration across multiple different tool sets, everything from observability to, you know, different, potentially different frameworks. You have all the different tools that kind of integrate with that. So with that, we'll look at, you know, some of the features that Spinnaker has and also tools like Backstage, how they kind of work with the developer experience on the interoperability side, bridging kind of tool chain gaps and such. At the end kind of whole thing, we're going to kind of really dive in, not really dive, but just kind of summarize how that both of those things play a pivotal role in optimizing developer experience and improving, you know, overall productivity, which is really kind of the idea. All right, so the standardization side, that really means we're trying to minimize the variability, reducing all the errors, fostering environment where developers can, you know, again, collaborate. That's efficient collaboration. So when you're standardizing that, you're kind of defining clear, repeatable, no, not yet, clear, repeatable code integration, testing, deployment processes, all of those kind of things when you standardize that ensures that you're having like your pipelines are streamlined, the developer process becomes a lot more, a lot smoother for everyone that's interacting with what you're trying to do, whether you're building something internally or for external users or both. So when we think about that kind of the steps for what kind of better practices look like for that, we start with kind of assessment and analysis with that. So here you're really kind of looking at your current CI CD pipelines. You want to understand kind of existing workflows, the tools, all the processes that you're using to identify the pain points, the bottlenecks, and then, you know, areas where standardization really is needed. And then the next kind of thing with that is you're going to kind of look at all the specific requirements that are in place and the constraints of your projects and the development of that first step there. Then when you're defining this, you're really going to kind of define the goals and objectives that you're trying to achieve with your pipeline standardization. And those goals are really going to try and align with the overall dev strategy that you have and some of the organization business objectives. You don't want to stray away from that. And that also kind of helps you start to kind of identify those KPIs that are going to really measure what success looks like for you in your development process. Usually that looks like you're probably going to try and reduce deployment times or decrease, you know, error rates. We always want to try and obviously decrease error rates. Then you want to look at what the tools and practices are going to be for your CI CD standardization. So, you know, things are going to align with your organization's needs and goals. So that's things like Jenkins, GitLab, CI CD. There's other cloud native solutions. AWS code pipeline. There's I think Team City, I think is on the cloud side. There's a bunch of different options there. But you want to make sure you have those tools and practices that help you achieve those goals. There's some standardized templates for pipelines defining those essential stages of build, test, deploy, what's that going to look like for you. And then kind of what a standard configuration would be for all of your pipelines. And then you're also going to enforce a lot of those coding standards for CI CD, those configurations ensuring that there is consistency and readability for everything that you're doing. So somebody can come in and understand exactly what you're trying to do and you don't have to spend a lot of time kind of. I mean, there's going to be onboarding, but you want to make it as standardized and relatively simple as possible. And then on the documentation training, which is kind of touched on quickly, you want to make sure that documentation is comprehensive, that you're outlining all of the standardized processes that you have in place. Make sure everybody is aware of how you work, including how you work with your workflows. How do you, you know, what's your standard configuration? What are some of those better practices that you're using inside your organization? Make sure that's documented. And then you're also providing a lot of those training sessions for your dev teams and your support teams that work with the dev teams, ensuring that they're understanding and can be really effective in as they use your CI CD tooling and all those templates that get created. Then you kind of move into the version control side. You want to make sure you're storing those pipeline configs in some kind of VCS, you know, Git, GitLab, GitHub, whatever. That practice there is really going to ensure that your configurations are versioned so you know you can go back to something, you know what the changes were, you can trace where potential errors are, and you can, like I said, revert, you can easily get back to something if you need to. And then implement your branching and pull request strategies. It should mirror what you're already doing in your standard that you've already hopefully documented that we just talked about, but making sure that all of the, you think about the standard templates and such is that they're all kind of following in that same path of branching, pull request and such. And then automated testing. Since this is testing room, we want to make sure we talk about testing. You want to make sure you're integrating your automated testing and validation into the pipeline and all those templates to ensure that, you know, those standardized configs produce your expected results. Don't just create a standardized template and not test out to make sure that works. Otherwise you're going to create problems downstream. Another great opportunity to put code reviews in place. Build out your standardized templates and then start code reviews. Make sure that you're not missing something. Bring more eyes to it. Validate that, catch those errors before they become an issue downstream. Okay. And then continuous monitoring, continuous integration side of this or continuous improvement. Make sure you're monitoring and having alerting in your CIECD pipeline so you're detecting the issues bottlenecks in real time before they become an issue. Establish kind of this culture of kind of continuous improvement. So that means you're regularly reviewing, updating those, you know, those pipelines based on the feedback and evolving kind of framework that your projects and pipelines go through. Make sure you're not, those templates aren't being left behind. Also governance and compliance is very much an important part of the CIECD standardization. So make sure your policies are enforcing pipelines, the standardized pipelines and compliance with industry regs, regulations or, you know, some internal or external standards that are in place. Make sure that you're accounting for those. Really audit and assess how you are adhering to those to make sure that you continue to improve there as well. Scaling and adaptation, ensuring that, you know, those standardized pipeline templates are something that can scale and adapt to the different project types that you have. Every team or, you know, an organization probably has different types of projects that you're all working on. So make sure those templates are easily applied to different, you know, different things that you're doing, different sizes, different technologies that might be in place inside your organization or that you're developing for external. Maintaining the flexibility kind of helps there to accommodate the unique requirements that each project is going to have, but also making sure you're still adhering to your standardized core practices. And then there's that feedback loop. Very much a part of DevOps is feedback loop. Even more, that's part of why continuous integration and continuous deployment is there, is it helps you give you that feedback loop. So have an environment where, you know, developers can really collaborate and provide that feedback and contribute to continuing to improve those standard practices and then continuously kind of communicate the benefits of those to outside your organization. Make sure everybody knows what you're working on and knows that the achievements that you've had really helps kind of drive more collaboration, drives more obviously awareness of what your organization is doing, but also brings a lot of praise to the teams internally. So by kind of putting these steps, these kind of best practices on the standardization side, organizations really can kind of implement more efficient, consistent workflows so that the developer experience on the continuous integration, I'm sorry, on the standardization side is really you start to see those, the results of that. So we're going to right now kind of look at kind of Argo and Flux, just some of the features that they have that help implement some of these better CI CD practices for standardization. So Argo is reusable workflows so orgs, they can really define reusable workflow templates that set up the standard sequences for CI CD like build, test, deploy so that devs can reuse those things across projects, not just within your, the project you're working now, but you can use reuse those templates elsewhere. Argo also follows GitOps principles. So your configs, workflows, they're managed as code in Git repos, ensuring everything's versioned, like I said, traceable, easy processes to kind of collab amongst dev teams is really kind of a core piece of that GitOps. And then the way that they manage artifacts, Argo supports managing and storing those artifacts like Docker images as part of the CI CD process so that you can make sure that the right artifacts are used in the right situation and deployed across the environments and they can be used as inputs in subsequent steps as part of your template. So those are some things that Argo has in place specifically. And then from Flux, we have the declarative config model that they can operate on where systems are, you know, their desired state of how they're going to exist as a system is defined in code. This is what orgs can kind of define and enforce already those standardized practices in a VCS system, ensuring that you can kind of track things consistently. On the continuous synchronization side, they allow you to kind of continuously synchronize the desired state in your Git repos with the actual state of, for instance, like your Kubernetes clusters. So that is that changes, everything can continuously be replicated so that you have a standard config and deployments and that are consistent across your environments. And then there's the policy side. So that's kind of where we have, does it say that flagger? Yeah. So that's kind of the feature flag. So Flux has feature flag capabilities through Flagger, which is a part of that, so that you can deploy and allow orgs to define the different rules for how things get deployed, either different sets of things or to different users. So you can really do a lot of that A-B testing if you think about progressive delivery. It's that kind of thing. Yeah. So who here uses Flux? Okay. What about Argo? Okay. So about, I think there was some overlap. Good. So when we want to achieve these kind of standardized workflows, kind of the summary here is like achieving that. You want to make sure your templates with Argo and Flux, they allow for the standardized templates and definitions so that everything is, all your orgs have an established baseline to work with for consistency. There's also integrations with VCS, CI-CD tooling so that you have your configs are maintained and accessible to all, which is really important, bringing visibility to what you're doing. And then on the documentation and training side, it's really essential to make sure that you've got the docs and training standardized and that you have documents, that you have the docs and trainings for the things that you've standardized. So make sure you've done both so that orgs can really be responsible for making sure that dev teams and even support teams understand how these standard processes are. Continuous improvement really kind of fosters the culture that's really necessary to achieve a good developer experience so that you have, everything's regularly reviewed, the workflows are updated, you're getting feedback, improvements are continuously happening. Making sure that, again, developer experience is high on that list. Alright, interoperability in CI-CD is, you know, system, it refers to the ability of different tools, technologies, components within kind of the CI-CD ecosystem are able to work effectively together. So that means, you know, the various parts like, you know, the pipeline, source code, repositories, build systems, testing frameworks, deployment platforms, monitoring tools, all those things are able to interact with each other in a way that ensures that you're able to see what's happening. So if the data is effectively kind of exchanged and that there's, you know, not really any compatibility issues or disruptions to kind of your workflows. So kind of the way that that looks like, we think about, so there's a collaboration side that gives flexibility and choice that you, when you are looking, trying to implement interoperability in your environment really enables dev teams to use the best tool for the job so that you don't have to work with the vendor lock-ins, give them that flexibility to use what works best. And then there is the various tool preferences that are there that Oregon company has. And so you want to make sure with the enhanced collaboration that all those different various tools are not a blocker to success. Excuse me, that also ensures, you know, smooth interaction. You know, also the collaboration comes is. And then, it's a really important, yeah, really important side of that to make sure that you're, you know, able to integrate the interoperability kind of really enforces that better use utilization of your resources. So your orgs can make efficient use of your existing infra, infrastructure and tools. So you should not always have to build something new. If you have systems that are interacting together, the system, the tools, you're not waste, you're not being wasteful. Yeah, reason the components and scripts, saving cost. The next side of that, scalability and growth. So as organizations are scaling, they're adopting new tech, which happens consistently. Interoperability really ensures that your CI CD systems can adapt and expand as necessary to kind of support incorporating the new tools and processes and ideas and workflows, all of that into your, the way you all work as a team. And then, yeah, cross platform deployments. So interoperability advantage there is that, you know, when you, in this existing kind of multi cloud hybrid kind of environments that are out there now, it really promotes kind of this unified approach that you don't, that doesn't have to be a blocker to having all these different systems. You have it all together, ensuring that the data gets transferred well. Also kind of promotes a unified deployment and infrastructure management. And then troubleshooting debugging. I knew there was another one there. So when issues kind of are arising, this interop enables, you know, this seamless data sharing between all the different tools and process. I've seen a growth, like, it's amazing, but it's been a very astronomical growth of the average number of like SaaS tools that are in place in a organization into the hundreds on average by companies. And so like being able to look at all the tools and be able to troubleshoot those and have everything working together is a huge kind of game changer for kind of looking at better issue ID, troubleshooting, resolution and such. All right, so in essence, you know, this, when CICD systems are interacting together, this interoperability acts as really that bridge. It's one of those chat-chitty created images that kind of works. But you know, connecting all the different parts of your dev and delivery processes together, fostering, you know, we talked about collaboration, ensuring teams can work cohesively, efficiently. All that stuff is tied into the importance of, you know, interop. So looking at like how Spinnaker and Backstage do this, the, on Spinnaker's side, the, there we go, integration with cloud providers. So Spinnaker allows you to pretty much integrate with anything you want so that you have this consistent interface itself for deploying, managing across platforms, ensuring seamless targeting of the different environments that are in place by devs, allowing them to like choose what works best. Don't tie them into one specific tool that is that whole, that little analogy of, you know, hammering a square peg into a round hole. And then integration with VCS systems. Spinnaker, you know, works across, you know, really can work with whatever so that you can kind of trigger your deployment pipelines directly from your repositories and automate that release process, reducing manual intervention. And then extensible integrations. So you know, having an extensible architecture supports a lot of different integrations, which allows teams to connect with, you know, again, various set of tools like the monitoring, incident management scripts, those things, and really ensures that Spinnaker really seamlessly fits into the org. And yeah, tool sets fit into your org's existing tool sets, requirements, workflows and such. And then artifact management. We talked about, you know, Argo has that. Spinnaker also kind of lets you kind of interact with those, integrate with different repos, artifact repos. So you got Docker Hub, there's Artifactory, I'm trying to think those are the two that come to mind. Assist really in managing those artifacts, ensuring that, you know, the right things are consistently used in your deployments. And then there is the pipeline abstraction. The, helps you kind of abstract the deployments, making the process more flexible and adaptable to what you're trying to do. Developers really can start to reuse those templates that you've created, making an adaptation easier as the projects evolve themselves, and those requirements. And so that bridge between, you know, the abstraction and flexibility ensures Spinnaker can kind of cater to different various deployment scenarios. So that's the Spinnaker side. We think about backstage. Backstage has, you know, it integrates with a lot of CI CD tools and other things. We're talking about CI CD here. And so it integrates with a lot of them, like Jenkins, sort of CI, GitHub actions, Flux, Argo. All of those things brings that visibility. And so having that interoperability with pretty much anything allows developers to visualize and manage what's going on in their pipelines directly from backstage and not having to go to multiple systems. You can do it all in one. So there's kind of that unified single pane of glass view of the entire kind of dev workflow. Service Categlog integration with backstage kind of acts as that service catalog, helping teams manage, discover the services and apps and such that they can use. And so that interoperability with all the different systems ensures that the information in your CI CD is integrated into that service catalog itself so that they know that easier for teams to understand service status, history. The history is really important to be able to go back and see what's happened over time and see some trends. Yeah, it has a really good plug-in ecosystem. So, you know, that extensible architecture across all the different custom plugins that you can create that maybe your community has done. All that stuff can help bring better visibility to things you do. And then customization theming that comes in place, allowing repos to kind of customize the UI and theme that's in place. That may seem like a small thing, but when you're trying to get your organization to buy off to use something like backstage or things like it, having that ability to customize the look and feel satisfies a lot of those branding requirements that companies have, marketing departments kind of do. So it's important to have that kind of flexibility ensures that your org is going to be able to be flexible and use what's there. All right, so Spinnaker and backstage, they both kind of prioritize flexibility, adaptability, allowing organizations really to integrate with the kind of the diverse tool sets that are out there and accommodate the various needs that developers have. Bridging those gaps between the different tech and systems, it acts as kind of that central hub that connects those parts, enhances the flexibility of your CI CD pipelines and developer workflows and is ultimately going to kind of promote more efficiencies and collaborative development environment. All right, so the organizations often use kind of this mix of tool sets. And there's some of these challenges that come in with when you're trying to implement this. Is that that mix? And each of them have their own ecosystem, APIs, such data format schema differences, tools are using a lot of different data form. They're not all unified themselves. That's kind of their niche is having something different than everybody else. So that presents a challenge. Authentication and authorization, like those themselves also present a lot of challenges of how do you not only manage the access to all these different tools, but how do they, you know, you have APIs, different APIs going back and forth. How do you kind of work with that? Versioning and compatibility, that also is something that like tools change. New versions come out. They can break the, you know, breaking changes that either could have been avoided or not. It doesn't matter. You're trying to use them and now you have something that doesn't work. So having, you know, that is a real challenge. And then lack of documentation. We've all seen it. API that's on version two and their docs are on version 1.1 and they haven't updated or they haven't changed one thing and it breaks that. That often is a challenge to try and work with all these different systems and in some cases building your own integrations between those systems can really kind of get hit with lack of documentation. But there are ways to overcome those with us. So using unified config formats and how you define your deployment pipelines that are documented and forced for all the tools, libraries that are associated can then automatically kind of convert between the formats, ensuring there's data consistency, compatibility. There's API gateways that really translate the data between the systems for consistency, simplifying, you know, off and authorization access across all the different tool sets. Helps you maintain version compatibility and it's important to kind of use a version compatibility matrix, matrices, so that you can see, you know, track it all down and see what works with each other to help you make better decisions. Make sure you've documented. Oh, time out. Okay. So that's good there on that piece. The last little bit just we think about developer experience really important to remove all the barriers. So that, thank you.
Ghosting the hardware
Hello everyone and welcome to this session about ghosting the hardware. Maybe the title is a bit obscure to you. I will explain what it is a bit later on. So my name is Rémi Durafort. I'm a principal tech lead at Linao and I've been working on different open source projects for many years now. I'm currently working on Lava which is a test automation system that I will present. So Lava stands for linear automated validation architecture. So it's a test execution system which means it allows for testing your software on real hardware, on real devices like Raspberry Pi, Dragon board, so physical devices. It allows to deploy your hardware, boot your software and test it on real devices. It's used by multiple projects like NLCI for example that use mainly multiple Lava instances. We use that a lot also in Linao for the LKFT project, Linux kernel functional testing project that we are driving. We also use it for doing bootloader testing. So for example you can test your UBoot version directly on your board and Lava will allow to interact with UBoot and test it. We also do firmware testing with it. So it currently supports 364 different device types which is a lot of different device types. So if you want to test your software without Lava, so you will have a kernel, DTB, RAM disk, root FS modules that you want to test. You will have Raspberry Pi, so this is a pretty old Raspberry Pi 3 anyway, doesn't really matter. You need a way to access the serial to interact with the board, so FTDI cable usually over USB. You need a way to power on and off the board, so you need some device that will allow to send a TCP request to a specific port with some commands and it will power on the board and another request will power off the board so it can be made automatic. And usually we use TFTP and NFS for sharing the kernel, DTB and root FS system with the board so you don't have to actually flash the board because after some time you will actually destroy the SD card if you do that a bit too often. So then when you have all of this, if you want to test the board you have to power on the board so you send the right command to your power manager. You then connect to the serial, you interrupt UBoot, you send some commands like DSCP, so the board get an IP address, you load the kernel over TFTP, you load the RAM disk over HTTP, you set the console argument for the kernel, you then send the right boot argument that are both specific. You watch the kernel booting, looking for some crashes maybe or warnings, then you have the prompt, you log in, you run your test, you collect the results and you shut down the board. That's dangerous, not really funny and you will have to do that for every release that you will have for your software. So that's where lava come into place. So instead of having to do that manually, we keep the board, we keep the power control, serial relay and TFTP and NFS server and replace yourself by a program which is a lava worker. So instead of typing commands manually one by one, you will explain in a YAML document to the lava worker what you expect him to do. So you will explain that you have a kernel, a DTB and a root FS that you want to deploy using TFTP and you want your root FS to be available over NFS. And lava will then know how to automatically interact with your board to send all the right commands that I explained in the previous slide, automatically in a reproducible fashion for you and it can do that days and nights, including weekends for you. And this document that you write is what we call a job definition or job configuration. So obviously you can have multiple DUTs, device under test, for example in this case, per worker and you can have multiple workers attached to your lava instance and they will all connect to the lava server, the classical server worker model. For example in Linao in Cambridge, we have a lab with hundreds of boards and I know Collabora also has some kind of board farm like that. It has been made for a large board farm if you want to. Regarding the roles, so the server, it's a web UI API, it's what is visible to the user and it usually does not have access to the boards. For example in Linao, all our lava servers are in the cloud somewhere while the boards are connected to the workers physically in a closed lab. So the workers, they have direct control of the DUTs, the boards, device under test and they are not accessible to the users. The user will not have access to the board directly, they will not have access to the worker directly, only to the server. So the server will be responsible for storing the logs, the jobs, the results, do the scheduling, send notifications, have an ICOI, things like that. And on the other side, the worker are more responsible for the hardware. So they have to deploy resources, they have to pour on and off the boards, they have to interact with the serials, look for crashes in the kernel under the board health, something like that. So this is the list of supported devices. Obviously you cannot see it, it's way too small because there are way too many devices. But just to explain that we support from really tiny devices, IoT devices, up to Raspberry Pi form factor and up to even large servers that you can test with Lava if you want to. And as we support multiple different kind of device types, we have to support different deploy methods, different boot methods. So for example you can deploy with TFTC, NBD, fastboots for all the Android boards, Vexpress, etc. For booting you can use DFU, Uboot, PyOCD, fastboot, etc. As a result of different ones. And for the tests you can have a POSIX shell interaction if it's available on the system that you have. You can have interactive tests. For example when you want to interact with a bootloader, it's not a POSIX shell, so you have to send commands and expect results. And we can also do multi-node tests which is a test in which you have multi, you have more than one device that will be booting at the same time and that will be able to interact. For example you can test your server on a physical hardware that will stream to multiple different clients. It's something that you can do in Lava. So today I will speak a bit more about also why we want to test Lava itself because why do we want to test the CI system? The obvious reason is that it's just a piece of software so it's buggy. So you have to test it to know what is working and what is not. And even more important is that when you're building a CI system you have to ensure that the CI system is rock solid for two main reasons. If you have bugs in your CI and if you have false positives which means that you're reporting something like a bug on the software while it's not buggy, then your developer will just say okay I'm done with it, it's not working. I will not look at your CI system anymore. That's the first reason. The second is false negative which is you're not reporting an error that happens in your CI. So you're running a test, it's failing, but the CI system says everything is okay which means that you will say to the developer I tested it, it's working while in fact it's buggy. So you will have a, you will release the software that has been tested but it's still buggy. So you have to prove that your CI is reliable over why it's just fully useless. So how are we going to test lava itself? So we do have a classical hierarchy of tests. We obviously have static analysis, we do have unit tests that are running on every GitLab CI merge request. We also do integration tests and that's why I will print today which is called meta lava and we also do federated testing and test on a staging instances. So we do have some instances that we upgrade every day where we run actual workload and we check that it's still working the same way as before. But the main problem when you want to test lava is that it's a combinatorial issue. As I said before we support 364 different device types, roughly 16 deploy methods, roughly 26 boot methods and five test methods. So if you do the combination that's insane, the number of combinations that you have to test. Yes, I know that a lot of these combinations are just not going to work because not all devices support DFU or fast boot and things like that but still it's really good. So maybe you want to give me both and money, I will be out for it but obviously I don't think that's the case. So maybe we should consider faking the DUTs. So faking the hardware. So that's the goal of the MetaLava projects. So the goal is to be able to set the full system. I want to test from the user point of view back to the user. So the user should be able to send jobs. It has to be scheduled, run on a fake DUT, send back results and the user will pull the result from the user interaction. And I don't want to have any boards because I want that to be somewhere running in a CI CD system. And it has to be cheap obviously and fast. So you have two ways you can do that. If you want to fake devices, you can go for doing board emulation. You can use MVP or QMU devices for example to emulate devices. The main problem is that it's CPU intensive so it will be slow and expensive. The other way is to ghost the hardware. So if you go back at the lab architecture, I don't want to touch the user. That will be my testing system. I don't want to touch anything in the server and the worker because I want to keep my system intact. So the only thing I can change is what is on the left part, the board and the power control server and the TFTNFS servers. So what I have to do, I have to build a system that I have to build a fake DUT that will feel like a DUT, look like a DUT, smell like a DUT, sounds like a DUT and tastes like a DUT because lava should not see the difference between a real DUT and a fake DUT. But that's not enough because I also have to check that what lava will send, the interaction that lava will have with the fake DUT is still valid because if I have a fake DUT that accept anything then lava will do any stupid things and it will still work while it's just wrong. So it has to also check, the fake DUT also have to check that what lava send is legit. So it has to check that lava is still acting correctly. So we have to look at the interaction between lava and the DUT. So as I said there is free interaction, the power control. So by the way lava is designed, it's just a command that lava will run so it can be any shell command that has to return one or zero, one if it's failing or zero passing. But from the DUT point of view, from the fake DUT point of view, the DUT should be able to check that the command has been called at the right time, so before booting, that lava is still doing what is supposed to do. Yeah, the serial relay, again it's just a shell command that lava will run and that it will just interact with the input and output, STD and STD out. So I need to build something that will feel like a DUT when you interact with it with the serial. And the TFTP NFS servers, I will just use a normal and TFTP NFS server and I will just have to check from the fake DUT point of view that lava has deployed right binaries for me. So the question is where do I want to mark things? Let's take an example. I don't want to do this presentation but I want to be in my bed and I have something that will be in place of myself. So you don't see the difference. So I can build a robot that will be in my place and that will speak like me and explain the same things, interact with you the same way as I will do. That's one way to do it. I can also force you to have glasses that will inject in your vision an image of myself. That's two different ways to fake me but from your point of view, it will be the same. You won't be able to notice the difference. For mocking, it's all the same. I have different ways I can mock. I can create a hardware that will interact with lava the same way a real hardware will do but without actually booting a candle. I can do that if possible to just have to fake the serial and it will work. But as I said before, I don't want to have any hardware. I just want software. So what I will do is I will have only software that will fake all the interaction with lava. So it will fake the serial relay for example. So we're going for a full software implementation. It's a project called DEMISIS. So when you run it, it has the same output as a normal board. You can interact with it and you feel like you interact with a real board. I will show you in a right after that. So you can send it commands and it will react like a normal board will do. And when you do TFTP and NFS commands, it will actually load the TFTP and NFS command for you and check that the binary are present. I will just go for a really short demo. So I just have a run script, just a wrapper not to type everything because it's painful to type. For example, I want to, so my DEMISIS system, my program that will fake a duty, so it's a Python script and I give it some commands that are inside YAML file that I will explain right after. So if I start it, you will see for the one that are used to have a UBoot booting and acting to UBoot machine, it's what UBoot usually type enter and it's actually wait for you to do type enter and then you have a shell in which you can type some commands, for example, DSCP. It will get a DSCP. This is all fake. I don't have any board attached to it. You see that. It's just a program that is faking a UBoot interaction, a board interaction. And then I can just ask it to boot. I'm not doing it because I'm not booting anything. It's just faking it. For LavaPoint of view, it's actually booting something. You see that the screen is a bit too small. You see that it looks like a board is booting, but it's just printing text. But that's enough because that's filling all the requirements from the LavaPoint of view. And you see it's just a program running. I can just, for example, do a login interaction if I want to. I want to check that Lava is able to log in automatically to send or write login parameters and password. I can create a program that will do that. Again, just doing the basic thing, booting. You see it's a small delay when printing. It's on purpose to fake what a real board does because a real board does not send all the characters in one row because the cellular takes some time to process, to transfer. So we fake that also. Now I have to send. You see that if I not send in the right parameters, it would do a login incorrect. If I'm sending the right one, it will log in as normal. Again, this is not doing anything. It's just pretending to run up a system. And then this is what usually what Lava is expecting when it's run tests. It's expecting some signals to have. And I can fake that also. If you look at what's inside, it's a bit too small. So inside the argument for my program, it's just a set of commands that I'm asking. I'm asking my program to print the lines. That's the line that you've seen. Then it's printing the different lines and accept to be interrupted like what UBOOT does. Then it has a shell. This is a prompt with UBOOT prompt. And it will look forever waiting for exactly this command, USB start, and et cetera, et cetera, et cetera. And for the fake DUT to work and to go to the next stage, Lava has to exactly send the right commands. And if it's not sending it, it will fail. So thanks to this list of commands that I'm expecting, I'm able to check that Lava will send exactly the same command from one version to another, because that's what a real board will expect. And at the same time, from the Lava point of view, it will have the output that is expecting. And for example, here, Lava will, it's waiting for getting a TFTP instruction to load the VM Linux over TFTP. So I'm waiting for this exact command. And when I get it, I will actually download it. I have a small script that will download over TFTP, download the file, check that it's present. That's what I said. What it sends should be meaningful. So all the tools should be available, what it should be. That's for the shop demo. So that's what the MailTile Lava project is doing. So we have a server, we have the workers that are working together. And instead of having a real board, I just have the domicile system. So it's actually running 28 different device types, including both that I've never seen, because I just need the logs and the commands that it's expecting. I don't need the real board itself. And it also allows you to test bootloader failures, for example. So that's something that's difficult to reproduce in real life. You have to damage your board if you want to have some specific errors. The system, meta-lava and domicile, they can reproduce the same error all the time, because it's just a specific output that Lava will have to see. If you want to contribute to this, to have your boards tested by Lava, so a fake board tested by Lava, please come to see me. I will be happy to add that to the system, and that will ensure that your board will still remain valid in the next time I'm working. It's a fun thing to do, to do system mocking. And just have to look at the interaction between the different systems. That's all. Do you have some questions before we go to the next meeting, next presentation?
Pushing test lab to its limits: performance tracking techniques
Hello everyone, my name is Paweł Wietzorek. I work with Colabora and I've been involved in maintenance of server side components of Colabora's automated testing laboratory. Today, I would like to share with you a few lessons learned from that experience, particularly related to tracking laboratories performance and pushing beyond the limits of the software that it runs. We'll start with some background information. Next I will move to interactive approaches for tracking its performance, I mean the lab performance. After that, I'll describe a few solutions for automating that and finally I will also share some thoughts on data generation. So let's start with the reason why, I mean what brought us here today. Thanks to Remi's talk, we now know and have an idea of what Lava is, what it provides for testing automation and how it supports all these efforts. Some of you might also recall a talk given by my colleague Laura at last year's FOSDEM. Laura described in her talk how the lab at Colabora is set up, what its day-to-day maintenance tasks look like. What main challenges are while running this kind of laboratory and also shared some best practices. The key piece of information for us today is that Colabora's lab is a fairly large Lava instance that is continuously growing and together with high number of devices also comes high numbers of test job submissions to process which unsurprisingly can result in higher load on server side of things. And that in fact was our case. There was no need to panic though, at least not right away. High load means that the resources that were allocated for lab purposes are in use and that's what they are meant to do after all. Interestingly, especially high load was observed on the nodes running database processes. And all of that is mostly fine until the system becomes unresponsive. This might lead to potentially unreliable lab or even unusable for higher level test systems like MESA-CI or Kernel-CI on the screenshot which other Colabora's are involved in development, maintenance and of course usage as well. My first thought was to simply estimate what resources are required for day to day operations and simply throw them at this workload. This could work short term but it wouldn't really solve the problem. To do it the right way, a deeper understanding of the root cause for all these issues was needed. And by the way, this photo is from Polish IT Olympics where hardware component throwing contest is held. And while this is hard drive throwing contest which might not be the type of resource we needed, that was the initial idea. Thanks to RemiStock we also have rough idea of what main components for Lava are but let's recap them real quick. At the very high level Lava on the server side has two main components, a scheduler and a connection to the database. If we take a closer look, those are respectively a jungle application and by default a Postgres database. These are widely known used and mostly loved software components so we can make use of several already available performance tracking tools for them. So let's go through a few interactive or semi interactive ones. As tribal as it might sound, it is equally as important to start with simply enabling verbose logging on affected instances. This way we get first insights from redoing user stories based either on direct reports from users or maybe motomo statistics collected by recent Lava releases or maybe some logs from load balancer which shows us which API endpoints are mostly used by users or which views are most commonly requested. In case of Django we get a few other perks. It's as easy as literally flipping a switch. Django for database also allows to log every statement executed on the database in debug mode and it can be also easily extended with some additional profiling information. But even though there are all these perks, all this information is a post-action information. To collect it in a truly interactive manner, fortunately Django already has us covered and provides just the right tool for this purpose which is Django Debug toolbar. It isn't much harder to enable than just verbose logging. It just requires adding an additional Python package to your deployment, set internal IPs from which Debug toolbar would be available, confirm enabling it and you're good to go. Debug toolbar not only provides great and immediate feedback but also includes traces, some additional profiling information and it gives you all of that in an easy to use graphical user-friendly way. As you can see on the right-hand side of the screenshot you even get all the requests sent, the instance and all the SQL statements run. But even though these tools are easy to enable, it comes with some drawbacks as well. These tools should not be used on any user-facing instance which brings us to setting up a personal local lava instance just for debugging and performance tracking purposes. Such a local instance would often come in a clean slate state. So with empty database with no devices and most local instances would not be able to connect to physical devices, at least not in the numbers as the production instances run. And even though we could fake multiple devices like Remy mentioned in his talk, that wouldn't solve the problem of having a database pre-populated with some additional data. We could potentially prepare a database fixture for that purpose. But it might not be particularly easy to mock the entire database like you see on the model graph for lava server. It's non-trivial task especially when it comes to keeping large numbers of processed jobs as archives. But the question is do we really have to mock the database? It is all done locally in our private debugging and performance tracking instance. Maybe we don't have to create a new database but reuse a backup from staging or production instance that we also run. And as the old saying goes about two groups of people and backups, I believe we all belong to the group that already makes them. There is also an important second part of this saying to make sure that restoring your backup works properly as well. And with reusing your PGDump output as the input for your performance tests, you can tick off this task from your administrator tasks list. Also if you base your Postgres Docker images from the official one, there is a really simple data initialization method which requires just mounting the volume with PGDump output and everything else is taken care of by the INE-DB itself. It also supports the compression on the fly for the most popular archive formats as you can see on the snippet directly from the INE-DB code for Postgres. Since we already have this database in our local instance, it would be useful to incorporate even more statistics from the database itself. For this, we could simply use PGAdmin or even PostgresQL command line tool just to check the actual runtimes and other statistics with explain-analyze queries. This would highlight for us database operation bottlenecks. And this way, having the database level tool, we would also be able to run various experiments on the database like changes in indexes or maybe adding query planner hints. It almost doesn't cost us anything just running another container in our local setup or if PGAdmin is too much, you could also opt to use the online available graphical tool which would highlight the bottlenecks for you with this heat map showing you where the issue might lie. Using this database level utility completes our tool set for off interactive solutions and while it is really important to be able to perform all those actions, it's paramount to do that again sometime soon and again and again and again and that moves us to automation solutions. By now, we know what to look for or what to watch out for in our lava instances and from user stories or bug reports or the motomo statistics or load balancer logs I mentioned earlier, we know and have specific code components to track or maybe even test cases ready to check for that. But the question is how to run those test cases to get the statistically valid feedback. We would have to take into consideration cache warm-ups, test case calibration, preferably also a way to compare between benchmark runs and it would be also great if it fit well into the test suites that are currently used by the upstream project which by the way is based on PyTest. Fortunately, it turns out that there is a PyTest feature that provides all of that and even more. In the case of lava bottlenecks found in the collaboration instance, the next step was just to wrap the test cases prepared with this fixture and wrapping the key pieces of code allowed to have benchmarks ready to run. Next step, once the test suite was prepared, was to plug it into the pipeline. Both upstream lava project and downstream lava tree makes heavy use of GitLab CI and it shouldn't be surprising. Many projects already do the same. For example, DRM CI merged in kernel 6.6 release. So currently, job definitions for those GitLab CI pipelines above the downstream one and below the upstream one don't share any reusable code. This might be subject to change in the future. For now, downstream changes are made with ease of importing them later in mind. Moving to external definitions could make the GitLab CI pipelines a bit more complex, but that's something that we'll see if it brings any value in future. Of course, GitLab CI jobs need a run environment and to get a baseline of what should be expected from benchmarks run, the easy way out is having a dedicated runner that would provide most stable results that are not affected by, for example, other test suites run in parallel on the same GitLab runner. A good choice would be to select a machine that has similar resources to a node, which your lava instance is run on. And for proof of concept purposes, I used a small desktop computer which gave just that. GitLab runners are also really easy to plug into a GitLab server. And while we are already optimizing the pipeline, we should also take into consideration caching the CI data resources for benchmarks runs. For that, we could easily use already available upstream lava caching solution, which is based on specific CI images to run tests on. But that would also mean that production data from database we used earlier is no longer a valid option for us. And we need to revisit the lava server model, which brings us to data generation, which we no longer could omit. That brought us to creating a dummy database generator, which was focused just on a key few tables and relations according to Postgres planner statistics. It was implemented with very limited scope to only support the worst bottlenecks that were found in collaborators instance. And for that, we used standard Python tools, which were FactoryBoy and Faker. As a bonus addition, you might also want to ask a few questions. Should lava actually archive all the test jobs that are run, or maybe archiving those jobs can be delegated to a higher level test systems? Fortunately, retention mechanism is already available. In upstream lava, it just required enabling it in Helm charts, which is used to deploy lava instances for collaboration. To summarize all of that, I've got three final thoughts that I would like to share with you. Constructing in testing laboratories is not a one-time job. It's a process that might differ from instance to instance, depending on your specific workload. But it's something that I hope could be easier for you if you come across the same set of issues. It also requires frequent revisiting and adjusting according to the results you see. But even small changes can bring huge boosts in performance. But that probably is a topic for another talk. And that's all I have prepared for you today. Thanks for your attention. Do we have time for questions? If there is some question, I will be happy to answer it.
Performance testing and why even the imperfect one is important
Hello everyone. You have two more minutes. My name is Andre. I work for... Okay. I've worked for Red Hat for several years now as quality engineering. And today's talk is really about performance testing, but it's not about the testing itself. It's more than why we should do it. You have six minutes in class. Okay. You're starting early, just so you know. Yeah. Okay. So it's more why we should do it and why there are benefits in it, even if you do it wrongly or imperfectly. So that's the main point of the talk for today. So first thing first, why should we do it? What are the benefits for us doing the performance testing, even if we don't have isolated environment and all this kind of stuff? Well, for me, the main benefit is that even if you don't have the environment you would want, you can still find the bottlenecks in your application or whatever you are testing. And you still can optimize it even if you don't have everything ready because the truth is that the performance testing is quite expensive. And for the good one, I don't think that there are the companies that will give you the resources that you need to do it perfectly. So that's for me the main option or the main reason why to do it. And for me, the second most important is that you will gain knowledge that you need or you will obtain the knowledge about the product itself because you will suddenly see things that you cannot see, even if you normally deploy things and everything. You will see the little thingies that are happening here and there. And the information you gain are quite nice to get. So that's probably the most, the points that you should look for if you are thinking about the performance testing. This is actually what it will gain from it. So this is only in my opinion, like you will see a lot of papers about the performance testing and all the things that you have to take care of. Like on the GitHub, I know about two or three papers that have like 40 pages about the performance testing and all the criteria that you have to fill. In my opinion, there are like two variants of the performance testing. And first is measurement and second is testing. For me, the measurement is really the thing that you are looking for numbers. And you need those numbers for, I would say, legal reasons or anything what you have to declare for your customer. You will say that, for example, for us, I work on the division. Like if we would want to say that this connector is actually able to do 30k per second, we would need some kind of a proof that we can do it. And getting this proof is like, it's very complicated and you have to do it in very specific ways. And even if you have everything, it's not quite, it's not always acceptable. But the second part or second variation is just testing. And for me, the testing, the testing is really just finding the bottlenecks in your product. And I think that's even more, like the testing is even more important because there you will find all the bottlenecks and you can optimize your application really. And you can see the flaws in your call because you cannot see these things when you run it regularly and you don't have everything around the application tuned up so you don't throttle your application to the maximum. These things usually happen when you go over the top or near the maximum, near the max. So, yeah, this is in my opinion two ways how to do the performance testing or two variants of the performance testing. What is not really optimal about the testing, which I was talking about, just finding the bottlenecks, not the number, you need massive monitoring and I will say more about it in the talk, but that's like main disadvantage of it that most of the time you will go around tracing, monitoring, metrics and you will find some stuff that will really give you hard times figuring it out because you are going for performance and you are speaking in milliseconds, yeah? But most of the things that are used for monitoring are not really prepared to handle you for one minute second. They think that it's okay to scrape, for example, metrics by 10 seconds and like this will give you massive headaches during the time. I think I have already somehow gathered those, but the goal of the performance testing is, as I said, find the bottlenecks, but they are much more to it because, for example, the load types, you know, even if someone of you had crossed the performance testing, the main point all of the guys are talking about is what kind of load we are going to do, how we are going to do it to make it reproducible, you know? Because that's like a problem because for some application you put constant loads to them. Let's say you are going 10k per second, some requests for the API for one hour. And like that could be fine, but we all saw that some websites or, for example, the systems that you are buying tickets for concerts, these kind of things need peak loads. You know, you are going low for 5k per second and suddenly you spin it up for 100k per second or something like that. So you really need something that will generate the load for you and do it reproducible. You need to have the same load so you can repeat the testing for a couple of times and be consistent in that because otherwise you will find all the other things except the flux in your cold. So the main problems during the performance that you will find. I have said that you don't need the isolated environment to do it. And that's true. We don't have isolated environment in our team when we do performance testing. But if you don't have entirely isolated environment, you need to know your environment pretty well. You have to know your latencies, you have to know all the hardware specs and these kind of things. You really need all of the information because if you see some very specific things happen during the test which are not common, you then can somehow put the puzzle together with all the information you get from the environment and you can somehow at least, I would say, decrease the number of stupid mistakes that you will have to gather around. The next thing that is very important is to have the monitoring which I have said already. You need all the metrics you can get because before that you don't even know what kind of the information would be valuable for you when you start doing it, but you will need all of it. If you can get rid and you will surely use it, if you don't gather those metrics and then you will figure out you need it, you cannot get them from the past. So really the thing is gather everything you can and it will be fine. And the last thing for this is you need to tune up all the systems that you are dependent on. For us, we are working on databases and we cannot really throttle up our product if the database isn't optimized to the hardware it's running because if the database is not throttling, we are not throttling. So you basically need to have everything on the high spec there to not bottleneck your application. So that's one of the main points because on some things it's quite problematic to tune it up. I quite like this quote because it's all about the metrics and if you have them, it's fine and it's nice. If you don't have them, it's massive problems. So I think that's really the quote that you should be looking for. So again, monitoring. I have already talked about the problem with the scraping. So let's say we have used Prometheus mostly and the maximum what you can get from Prometheus is one second scraping. So that's fine for information causes but not for the performance metrics because the things happen during the milliseconds. Maybe 10 milliseconds would be enough but one second is really you are losing a lot of information and later on I have the example of what you could see when the scrapers are not fast enough. And there's like massive problem because not every scraper is or I would say there are no scrapers that can do it really fast. So you probably have to implement it and we are working on that actually. So that's the main problem. And the second problem is that you will end up with having a lot of the systems in the field because you need hardware metrics. You need JMX metrics and I don't know what other. It really depends on your application. But for us, we needed hardware metrics, JMX metrics and some metrics from our test suite. And these three things actually all the different outputs. You know, we have used net data for the hardware metrics. It's like really nice tool open source fast everything nice. But you cannot import JMX metrics to net data. And you net data has also one problem that you cannot import like anything would happen in the past because it's strictly hard coded for now. So that's the problem, you know, and then you will say yes. Okay. So I cannot have JMX metrics to the net data. So I'll add Prometheus, you know, that's fine. Okay. So you have now net data and Prometheus. Sorry. Well, and then you will continue because you still have all at least in our expertise. We still need that someplace to store the metrics from our test suite and you can just import it everywhere. So then you happen that you will deploy the Postgres because you can use Postgres as backup for the Prometheus for the data storage. So now we have net data, Postgres and also Prometheus. And last but not least, you will add the Grafana because you need to visualize it, you know, and getting all those things in shape that you have like massive monitoring. You know, everything can go wrong. So if you can use the least amount of the tools that you can use, it's better because once you have too many, it's nightmare to somehow get that in shape for the whole time. Yeah. So with the performance testing, you are not really looking for the numbers. Numbers are not important in this case because you don't want to see the throughput is like this or like that. You need to see the trends in the graphs because there you can see if you are constantly slowing down or if you are going like optimizing your way. So really you have to look for the patterns in graphs and the trends in the graphs. I have the example for our testing that I will show you the patterns which we have found. But before that, just our system under test was the BZM. I don't know if you guys know the BZM, but it's effectively changed that I capture streaming, which means that we sit on the top of the database and we scan the transaction logs and sense all the events that happen there to the Kafka. We are effectively running in Kafka Connect runtime, which makes the performance testing even more juicy, I would say, because the runtime is not ours. So it's a little tricky. So that's our system under test. And this is the first example that I have put on. The image on the graph on the top is basically our process duration. And there are two things that you can see on the graph. We are effectively most of the time we are oscillating between some values around 200 or 170 to 220. And that's entirely fine. That's actually what we want to see if you are looking for some responses. You need to oscillate around some value like a sinus graph or something like that. But what is not actually nice is this on the star. Where we are constantly getting slower and slower and slower. And we have some peaks there, which have these peaks are don't have a reason to be there because the data are the same all the time. So this is most likely the flaw in the code. There is something happening, what shouldn't be happening. And it can be the database flashing out to the score. It can be basically anything, but you know that there might be a problem. You have all other metrics. You can have metrics from the databases that will show you that flashing was happening or anything. But this is what you have to look for. These, it will certainly be different for your application than ours. You will have to define what you are looking for on the star. But that's the main thing. And the more funny example is this. No, here. This is Jmx metrics from our, from Libyseum. And it shows you the size of the queue. Internal queue of the Bizm. And that basically means that from the database you are reading to internal queue. So once the queue is zero, you are not reading. You know, but we are still processing, right? So there should, there is some mistake. And this is actually the problem with the scrapers. Because if the scraper takes each second, the database is pretty fast and it can empty up the, it can empty the queue during that time. And if the scraper hits the right time, it will give you zero. You know, so from this until the end, the graph is all wrong. It's not true. And it's all because of the speed of the scraper. Because it basically hit the wrong time. So that is the thing that you have to be worried about because this will happen surely. And these are some other graphs. These are, I would say, more wild. I would say it's from the start of our playing with the performance. But the top one is also, it's pretty cool. It's not that constant as it was for the previous one. But it's still in some borders, you know, we are somehow oscillating, but there is not really clear way. But the queue size is okay now. You can see because you have some data there, but not zero. So there's an issue with the scrapers, as I said. And this is actually the thing that you will have to look for the patterns. Really important in graphs. You can see all the different ones and there are a lot of papers on the Internet that you can find. What to look for on your specific application. So, yeah, but not look to the numbers. Numbers don't say you anything. You can usually get the higher numbers if you change up the hardware that you are running on. But if you can optimize it on the some hardware that you have, you will surely get the big numbers on almost anywhere. So some tips and tricks for me. During the way that we have started playing with the performance, we have developed a lot of tools. First is the database manipulation tool, which is effectively giving you a Json API. And just with the Json API, you can create DML for almost any databases. We have now probably MySQL, Postgres, and Oracle there. So it's just you don't need to have a lot of different JDBC connectors in your code. We'll just deploy this and it will take care of it. We have also implemented some kind of the load generator that can generate you. Load for, I would say, constant load, P-codes and all this kind of stuff. We have also some automation and the other, like MySQL auto-tune, that's it. We are pretty proud of that because it can basically tune your MySQL to the whole VM that you are running it or physical machine. And you would say that it's easy, but it's not. You know it's hard when you look on the seven or eight page of the Google. You know, in this phase, you know, we are probably not on the good shape and this is one of the things. So please take a look if you are working on MySQL. We have some counting of the parameters for the database there. And it will save you a lot of time if you want to tune up your database. We have spent the time for you. And secondly, we have implemented the metadata to Prometheus Creper. So you can get rid of one, one struggle point in your monitoring environment. And we are also starting working on the fast Creper for the, for your monitoring, for our monitoring stack. But it's not done yet because it's quite more complicated than before. So, but yeah, please take a look. I will have the links on our GitHub and everything. It's all open source. So you can just also add some code there if you want. Yeah, so, so I have started quite early on than I should. So I have some time now then. But okay, we can just summarize everything now. And I hope for some discussion before you guys who done performance testing. So for me, don't be scared around the performance testing. He's not like some, some monster. People are mostly just like creating the monster from it. But if you don't need that for some legal options or anything, it's fine. You can play with it. It's funny. You will gain a lot of knowledge about your, about your product on that. And especially if you are QE, I mean, a lot of QE folks don't have the necessary knowledge about the product itself. And this helps a lot to get through everything because in the end, you will, you will just go through the code and look for, for the mistakes or something like that. So that really helps a lot. So, gather all the metrics you can. And well, we are also writing our blog and all the repositories will be on the other side before the two, before the two links. So I would be happy to hear from you, you know, like repository or organization before the two links. And yeah, that's probably it for, for my talk. As I said, I have started a lot earlier. So thank you very much for listening to me. Please do have some questions. Yeah. So my question would be, so what kind of experience do you have in your complex system? And then you see something happens there. And say, okay, here, here's the latest EP or something like that. Which experience do you have with, let's say, find the cause of the problem? Cause, so when something happens randomly, you will see it with wrap and say, okay, something happens there. And that's maybe annoying, especially when it happens, happens randomly. So what kind of strategies you are using then when you know, okay, there is something to find cause of the problem in the complex chain. Yeah, so this is, okay. So, so the question was actually, if there are some, some changes in the environment, some like latency things or everything, how we can deal with that and how we can find the causes of the problems. Yeah, so surely this is the main problem of the whole performance testing outside of the like isolating and, you know, well, you need the metrics from everything because then you somehow at least it will help you to, to get all the things in the right timeline and you can see the need and picks what could happen. But if it's like something that is really bad, you can find it usually because it will mostly, it will just disappear in all the logs because it can be something like if you have the smaller machines, I have to write once on the, some microchips, you know, it was funny thing that you fill up your TCP queue. You cannot find that anywhere in the world. So at that case, you will just repeat the testing, even if it's take long and you will see if same thing happens or not. I don't have any other like recommendation for that because this is like really main problem if you are doing it outside of the ideal environment. This surely will face this, but mostly it's not happening that often, I would say, because you can have observation for and tracing for a lot of things and most of the times you can like colorate those things together. So you exactly know what is wrong, especially for the network. You can get a lot of network traffic, like, how is the word? You can see all the traffic and what is going on, especially on one line. So then you can usually put those graphs together and you know at the time. If that is okay, answer for you. Yeah, yeah, yeah. Thanks for the talk. I was just wondering that how to use the traces, analyzing the traces, because I've seen that you mentioned metrics and traces. Sorry, can you speak more louder? Oh, yeah. Can you hear me now? Yes. Yeah, I was wondering how do you use the traces for performance testing, because when you collect the traces, how do you deal with the sampling of the traces? And if you miss something because the sampling is bad or you are not sampling everything, maybe you have to infer something from the metrics and the traces, I was wondering how do you deal with the traces? And if you use distributed tracing in a large project like collecting all this kind of stuff. I'm not sure. I understand the question. The question actually is, I've seen that you're using collecting the metrics and then you are analyzing the metrics. Yeah. And what about the traces? Yeah, so, well, the business does not have really that amount of traces that we could get from it. We have mostly like JMX metrics, you know, from the Java environment. So that's for us what we analyze. And I'm not sure how can I answer more for the questions. So I'm sorry. We can discuss it later. I'm coming to you. Okay, so my question is about the long running tests. Sometimes the performance validation is visible only after long run, for example, one week, couple of weeks and sometimes even more. So how do you address this in your process or how do you recommend to address this problem? Yes. So for this, actually colleagues of mine as part of our open source organization, they are also developing the long running cluster environment, something like that is like, because, you know, having a long running thing is complicated on itself because you have to manage it a lot, especially on OpenShift or Kubernetes or these kind of things is like little problematic before the upgrades and this kind of stuff. So we have not dealt with that yet, but we are planning that once we are okay, that we manage that we have everything prepared for like databases and everything, we want to get the up running on the long running clusters and like regularly doing the performance tests over, I would say a month or something like that or a week, it usually is enough, even especially when you put all the numbers to the low ranges for the retention for the memory and all this kind of stuff. It doesn't take too long to fill everything when you will start to see the retention and flashes and everything. So yeah, that's our plan, but we haven't done it yet. So yeah, but if you are interested in that, you should definitely look in the hub that we have in the repository because it could be useful. Do you have any tips for running performance tests in the cloud? Because for me that's quite the opposite of dedicated test runners. But when the software runs finally in the cloud you should probably also performance test it there. It's a problem. A big one. We have tried it and it is so inconsistent. The results are so all over the place. If you run, you have two same clusters, Kubernetes, OpenShifting doesn't matter actually, and you run the tests on the same clusters at the same time, clusters are in different zones on AWS, you will get entirely different results because all the load balancers and these kind of stuff, it really, you know, if you have only internal communication on the cluster, we found anything from the outside. It could be actually doable, I think, but if you have any communication during the test that is going outside the cluster and it has to go through the load balancers and these kind of stuff, I think that's not doable in any way because it's like you don't know what latency we will have for this kind of the request and travels. So I think that that would be like really problematic, but if you could mock up the external communication with just some internal endpoint, it should be quite okay-ish, I would say, but you will not get like really good results from that, I think, even if you try more and more. But I think there are some Kubernetes, some special Kubernetes builds that should be used for these kind of measurements, but I have never actually tried it, so I cannot recommend it when I try it, when I try it, but yeah, this is definitely a problem. Okay, so I have more questions? No, we have a few more minutes for questions. Come on. Otherwise, I'm going to ask you to, you know, to move your seats. Wait, wait. You said you would want to have a very small statement developed, so in the milliseconds ago. Yeah. So doesn't that create problems on their own, something like noisy neighbors and so on? Yes, it does. It does. Right, but... How big of a problem that is? Well, that's the thing. We are really thinking about writing some scraper that is fast enough for this, and yes, you will probably generate some problems during the way, especially if you would like to send the metrics directly to the Prometheus every millisecond. You will probably fill up the network line or the TCP stack or whatever, because it's really fast. So you will... It will strongly depend on the machine that you are running and if you have space there, if you have a lot of RAM, you could actually batch all the metrics and send it like one package after the test is done. But yes, that's actually what we are now fighting with and we are trying to figure out how we are going to aim for this. But mostly we are thinking that we will do somehow a configurable scraper that will either batch the request or send them or something like that. I cannot say you because we haven't tried it, actually, what problems it makes, especially with batching, because I have counted it up and metrics aren't small, actually. So it will take a lot of place in the memory. So we will have to try it and somehow figure it out. But before the fast scraper, it will give you really headaches because you will try to find something and fight something and you will spend 20 times debugging it and then you will find that the scraper hit it the wrong time every time. So we have to deal with this in some way. But it will be hard and problematic. I think we have time for one last question. No one? Tough crowd. Thank you very much.
squash the flakes! - how to minimize the impact of flaky tests
I Come on people Yeah, let's share up for Daniel because his first time speaker and everything's failing And it's off to a good start. Yeah, come on big applause. Thank you. You're doing awesome And You know the only certain thing about technology is gonna fail exactly when it doesn't need to Yeah, like I think I said already flakiness is not only happening in test obviously right So While we're waiting for this thing to happen I could ask a question about who actually Has has an idea what a flake would be in testing Okay, I should just repeat what you're going to say Yeah, yeah, go ahead you probably I don't know So you have an idea, but you don't want to tell me Exactly exactly so To me or I think to most people that agree about this topic Flaky test is a test that fails and passes and successive runs without changing any code Neither testing code nor the underlying production code Okay So, yeah, this talk will be about flaky test. Yeah Yeah, of course, of course flaky behavior is not determined by just being the test being flaky but also the software but I would Divide those two kids into different categories and how they are going to be handled this differently. So Yeah, but Let's wait. So Yeah, I'm going to start with introduction. My name is Daniel Hiller. I'm working at Red Hat I'm working on the upstream cupid project and there I'm Maintaining the cupid CIS system So this talk will be about flaky tests and How we should or how we are actually going to handle them I'm in our community of supporters for the cupid contributors. So I don't Say I have the silver bullet for handling that I would be happy to have any input from you folks And how we can improve and I would actually also Want to have some kind of extended Q&A session if the time is still there Somehow so that you might talk about what you have experienced and how you are going to handle it Just as a quick Thing how I think this should be going I'm going to start with Waterfeg is but yeah, you described it perfectly already. So it's fine and then What the impact of flakes is And then how we are doing or how we are how we can find flakes somehow and then at the last I'm half rate the flake process works and what tools we do have that support this and Yeah, in the end, I just want to describe what we're aiming to do in the future to improve this Just don't have internet and made up for some reason. Oh, no My email, okay Yeah, yeah, I think it's going really terribly wrong Sorry for all that by the way a packed room. I didn't expect that to be honest. So thank you all for coming Really great. I'm gonna help you out. Don't worry. So Tell me kind of a little bit more until we wait for the slides Can I give us a hint as to what you wanted to show us and just tell us the story about it? Yeah, without the slides just going to open it a bit because so I can Supposed to talk about You know pretend I'm stupid and I have no idea what's flaky and just you know tell it to me So I told you already about the agenda And yeah, the question of flakes was already answered so I Have two other questions. So that one is somehow like a little bit suggestive, I guess so who thinks handling flakes is important Like put your hand up a few you don't Who yeah, of course things handling of flakes is important Okay, I thought about that. So You save my day do you have a USB port I Hope so once again, it's on you need to put in a presentation mode There is on the right should be presentation. Yeah, that should be okay Yeah, okay, so the questions we already had Yeah, and another question who has to deal with flakes on a regular basis Wow, okay. Yeah, I expected something like that So yeah, like you correctly already said Flakes are caused either by production code, which is a bug of course or also by flaky test code This is also a bug, but it's handled differently like I already said So we are using prowl for our CI system which comes from the Kubernetes ecosystem. I'm not sure Whether you're familiar with it, but it's pretty flexible and it can just Start jobs from From GitHub events, which is exactly what we want and what we need this picture actually Shows on the top for example there is the commit ID I made I don't even see it like that there We're pointing this is the commit ID and these are the job runs that are defined So like the jobs on the CI system and this of course is a failed job and these are successful jobs so obviously you can see like this is the PR history for one of our PRs inside the Kubrick CI and What you can see here is that of course? There is jobs all run on the same commit ID, but some failed and some succeeded and That's exactly how we see where we have our flakiness Oh Wait a second. That's the wrong direction. Okay, so There is a really interesting survey that which is a major survey about flakiness in test Which is just called a survey of flaky tests Not really impressive about great stuff inside there something like that there you can read that 79% percent of the flakes were for the lungs and More than 50% of flakes could not be reproduced in production In isolation, so which of course leads us to the conclusion that Ignoring fake heat as flaky test is okay, right? It's doesn't of course So When we're talking about CI we want to have a reliable signal of stability in our CI So because of course we want to know whether we can ship our product or not and so any failed test run signals us as the CI maintainers that the product is unstable and that we can ship it So if we are having flakes in our Production system, of course, they give us a wrong signal like that the product is unstable and that we can ship Which we later then have to verify the test code what exactly got wrong and then we notice it's a flaky test So this is wasted of course a lot of time so Not only does it waste the time of the Developers themselves who have to look at the test results somehow and determine whether this is the flaky test or not But it's also like when you have a CI that is somehow Determining whether a PR can get merged via the test And then you have a test result Of course the merge will not go through So this cost friction by the developers who have then somehow Maybe reissue another test So if they see it's flaky if they there is nothing to fix then they would just retest Which somehow? Yeah, sometimes you would just think okay, there was flakiness. I'm just going to retry not even looking at the test results somehow Which which I would call the retest trap and we have actually had retest like I Mean the highest number I've been seen like 25 times testing and retest on the same commit Do they have to oh I have to stay here. Okay Okay And also a very bad thing is like I'm not sure I guess any CI system has something like an acceleration system where for example, it's like testing like Multiple git commits at once so that it can merge them all together And of course if there is a flaky test this acceleration effect will just be reversed. It will not be effective Yeah, like I said another wasted wasted thing so also flaky test Trust issues at the developers themselves because they of course lose the trust in automated testing Which is really sad because that's All that we want to do we want to trust the test But if we can then then of course we are just ignoring test results, which is not a good idea So how so we want to minimize the impact in our CI so that people don't Experience that much friction Time flies so What we do there is we quarantine those tests we put them out of the set of stable tests and Put them in another set so that they are not run during pull request runs But we only want to do that as we want to do that as early as possible when we Detect the flakiness, but only as long as necessary because tests on themselves of course have value So otherwise it wouldn't be there What do we need for that? We need some like mechanism where we can put stable test from the set of stable test to the set of quarantine test Of course, we also need a report over the flakiness So we can triage Which flaky test we need to act upon first if you have a lot of flaky test that matters so because the higher The flakiness of the test is of course the highest impact And yeah, lots of data because of course you need to somehow analyze whether a test is even flaky or not So as I already said I already described this this is like a The latest commit on a merge PR where we have some flaky test or some failing test runs Which later on got green on the same commit so no changes on the code So This is not of course not saying us that is it is actually flaky But it might might be flaky and like you said it could either be in production code or in the test code itself But that doesn't matter in the end the Problem that we have is the fiction NCI and the wasted resources there So our flake process is pretty well pretty pretty rough I'd say are pretty pretty easy So we have regular meetings where we look at the at the results and at the flakes And then we decide what we want to do with those flakes. So first of all, of course You have to know whether a test is flaky or not You're looking at the test results and deciding whom you showed contact so that he fixes that because we don't fix the test Ourselves we let the developers do that because yeah, they created their mess. They should clean it up A problem is of course when people are gone from the project then someone else has to care, but yeah So we have the flaky test to the dev developers and at the time when it's been Corrected we bring those tests back in so the truth that we have is like we have the main thing that Decides whether a test is being run between For the pull request. It's just a just a note on the test itself. That is like There is in the test name. There is this quarantine Word which is the keyword which makes the test get ignored for the pull request runs We still do to do run those tests to have this stability signal But not in the pre-submit which are required for the pull request merges But in the periodic runs That run I think three times a day So that we still have a signal when we can take a test back in in order to Have the value added again So another thing is of course you need to report so we this is a not not really Nice looking but efficient thing like a heat map so where you see where the action is going on You see like the more reddish the colors are getting the worst the problem is This is in oh, no, I can't go there. So like on the top you can just see on which day how many failures were occurring and There is another like axis which is the per lane Failure so that we can pretty much see which lane is flaky and on which was the biggest impact for everything This is the first time I'm using this sorry. I'm just always switching the directions Okay, this is the detailed report about how flaky a test is or how flaky those tests are This is ordered by the number of failures Occuring for test this is a bit overwhelming I think but on the left column you just see the test name and on the on the upper column you see the number the test lanes that are The Versions of the latest three Like we have a lot of test lanes that have different Sigs which are maintaining them and this of course obviously creates a matrix of like like at least 12 really well, yeah, really Important lanes which absolutely have to be be stable Yeah, and this helps us like finding where which test we should look at and quarantine or which we shouldn't We have also long-term metrics where we can decide how we were doing in the past so like at least everyone of course wants to know whether they are improving or Getting worse in handling flakes that we have long-term metrics where we can look at how we were doing So how many merges per day for example or how many merge PRs with zero retest Which is the thing that we are measuring currently against the most because Obviously that number should be like 28 of 28, but we seldom reach that like flakies We also have a small report over the The tests that happen quarantine so that we can find them quickly because it's like Grapping over the code base is also of course doable But it is easier to just have some report that we can look at straight away during our meetings And then finally we have all the test grade which also Collects all the periodic results so that we can deduce Where to Whether the tests have been stable or not. So this is the tool I guess that guys from the from the Kubernetes ecosystem know that because Kubernetes also uses test grid for collecting all the test results so that you can quickly drill down Yeah, and we have also established Another lane that checks the test for stability which does a thing that that makes like Test dependencies for example visible you I guess you know what a test dependency is some tests that hasn't cleaned up and Leap the mess for other tests and then influencing them and then they might be failing Or the other way around they might not Was already sufficient for For the following tests and if you are just randomizing the test order you catch those cases which is like you have to have Isolated test cases, right? And also it tries to run it five times because Like I said before in the in this metadata in this meta report like Bit more than 80% of all the tests have been fed off the flaky test have been failing after about five times This is not that you catch all of them, but the majority Yeah, that's just the CI search tool so in a nutshell we Just do in regular intervals meetings that we look over the data somehow So like I described before What we want to do is of course we want to collect even more data like We want to run the majority of tests in the same way as we are doing in the flake lane like Running them five times after another and also running Randomizing always the order so that we have a better picture over how flaky our code base is And yeah, of course like we want to avoid this retest problem where you Blindly just retest your things so we are looking for ways to just Directly find that case Yeah, so it's pretty I've been Running through pretty quickly any questions Yeah So you've been talking about responsibility of devs to fix the flakiness So this kind of assumes that the flakiness is introduced either by new tests or by changes on tests or changes on the code base But what about flakiness that is introduced by the by your infrastructure actually like network latency or things like that Do we have we have those problems or is it something that you I didn't get the less could you repeat the last sentence? Sorry sorry so you You imply that the flakiness can either be introduced in new tests or changes in tests or changes in the code base but have you ever been confronted with flakiness introduced by your infrastructure your Like network latency or something like that and how do you detect them and how do you of course of course that that is also a problem But when you have like flakiness in your test infrastructure or even failures in the test infrastructure That's an entire different problem and what we have observed there is that a lot of tests are to fail then and that's why we look at first of all when we have like Rough estimate of like like have more than 20 tests failing at one run that might likely be because the test infrastructure is failing and actually We decided to just quickly verify that there is something going on in the infrastructure And then just disregard that run and yeah in earlier days. We had that problem pretty much often But in recent days it hasn't been happening anymore or Much less that's let's put it like that Of course of course we look so like what we are what we are having to test our E2e test like QBIT is a complex system It's an addition on Kubernetes so that you can run virtual machines and of course for testing that you for testing E2e You need a full Kubernetes cluster which with on which you will deploy The QBIT and that's what we're doing in DCI. So we are actually spinning up Some I would say a frozen cluster like the virtualized nodes that have been frozen and that are spun up on demand Like this takes around one and a half minutes to spin up such a cluster and Then you run all those tests somehow and we have like we have like always three versions of the thank you very much We are running out of time. Yeah, you can continue us. Thank you
From "Free-Lunch" to Dog-Fooding: A Culinary Journey in Crafting IaC for Kairos Testing and Building
Hello. Hello. All right. Hello. Welcome. Thank you for joining this talk. This is about infrastructure, code, and mostly about... and mostly about choices. So that's the main point I want to make. There will be some analogies with food. So, yeah, I hope you don't leave the room to go find food. So, stay with me. Mm-hmm. You really need to stay close to that. Yeah, yeah, okay. Thanks. All right, so this is me. The one piece of information you may want to keep is the Codeberg username I have, because the code I'm talking about, so some samples and things you can copy from are in the Codeberg repo I'm going to show later. So if there's something to keep from this slide, it's this one. And also that I'm working for Kyros, which I'm going to talk about a bit more later. That's the project I'm working on, the App & Source project I'm working on. So, yeah, I said this is about choices, mostly. So I'm starting with food, because it's a general thing in life. I mean, when you have to judge something with just one criterion, sometimes choices seem obvious, right? So let's take... This is a popular choice, a popular Greek, this, Musaka, and that's the well-known burger. So if we were to choose what we eat based on just one criterion, that one was the amount of time you need to prepare, the choice is obvious, right? But obviously we don't always eat burgers for some reason. And the reason is that we have other requirements. But sometimes when the main criterion is there in front of us, our mind gets stuck to certain choices. And we had to choose otherwise, and that's what I want to show you, our story in Kairos and how we chose to do infrastructure otherwise. So this was our problem. So what Kairos is in a sentence? No, maybe two sentences. It's an immutable OS, a special purpose OS, mainly targeting Kubernetes. So it makes Kubernetes very easy to deploy, maintain data operations and such. But it's this diagnostic. What that means is it brings immutability on top of your favorite distribution. So you start with your distribution and then you apply Kairos, let's say, on top, and it makes it immutable, safe, secure, encryption and all. But that also means for our CI that we have to build many different images, lots of artifacts. So you see some numbers there, like the number of pipelines. And one thing to keep in mind, and you all know that, is that when you're working, you don't generally want to think whether you push something or if you open a pull request or something. If you just do it, you don't want to have to think that you're gonna pay for something. So initially we started, it's on GitHub, by the way, the project, so we started with GitHub Actions, the free ones. But because of these requirements, like huge disk space in some cases, like two VMs in one job, KVM support which free runners didn't have, we were looking for a different solution. And initially we said, okay, let's put money into it, right? Money solves everything, and let's just start paying for runners. In our case, that didn't work because, like I said, we didn't want to have to think whether we push, so we were controlling ourselves, okay, let's not open a draft pull request, right? Because that also runs pipelines, let's wait a bit. I'll do that tomorrow and things like that. So we very quickly reached the point where the cost, the amount of money we were paying was double the money for the same kind of hardware, but on bare metal. So what did we do? I mean, okay, we rejected that one. Dog fooding, so as I said, Kairos makes it easy to deploy Kubernetes, which is maybe the hardest and the more complex part, if you want to start with your own hardware, like how do I maintain it and such? And that's what we do as a project. So we said, okay, we're gonna do that. And then some more tools, like, okay, Kairos, we solved the Kubernetes problem, how we provision stuff and how we maintain it. Then we chose FlexiD for GitOps. I can't go into details on what these are. I hope many of you know what these tools are, but look them up because I only have 10 minutes or so. Shops for Secrets, so that allows you to actually commit secrets in your repository but being encrypted. So it's safe to commit them because you need keys to decrypt them. But if you give the keys to FlexiD at runtime, they will be decrypted and deployed. That means you can have full GitOps, nothing, no manual intervention for secrets or whatever. And then there is a project. You probably know that the Actions Runner Controller that allows you to run GitHub runners in a Kubernetes cluster. So this is our toolset. The next slide is completely relevant. It's just that I generated this one with AI and it reminded me of something. And after a while, I remember what it reminds me. That's my real dog there doing the same thing. So we don't need the AI, just remember that real life. Does the same. So yeah, back to infrastructure. Yeah, so some steps we did. So you can start with a cluster on your laptop, right? Very easy to get one, K3D kind, micro-kates, whatever you prefer. Then you read the docs for FlexiD, obviously, and Shops. You create the keys you need. I'm not going into details, but this is what we did. More documentation on how you deploy Actions Runner Controller. And when everything is working, that's the interesting part. Then you can go and get some real hardware. We went for value for money things. We tried a couple of them. So we got, I don't know, 10 different machines or something and then we deployed. And that's the interesting part. The three commands there. How much time do I have? I think I can show you a demo. At this point, where am I? I'm there. There. No, no, no. Where are you? Where did you go? There you are. So what I'm doing here, not here, but here, I will restart it, don't worry. So I start on a project that has no runners. On the left one is the repo. I'm gonna show you the link in a while. And on the right, I'm just copying and pasting three commands. There's a timer down there. It's three and a half minutes spoiler. So what I'm doing is with three commands, I'm gonna start from a cluster, from scratch. So I use K3D to create a cluster. And I play the secrets. So there are just two secrets one that Fluxy needs to pull the repository and another one to decrypt the secrets that are encrypted with SOPS, like up keys for GitHub and such. This is the secrets command. This is the one that creates the namespace in the final and bootstraps the already existing repository. So after three minutes, you'll see that we got runners deployed and connected to GitHub. And that's from a scratch cluster. And we had to do that two or three times because we chose some hardware that didn't work because of some network issues in Hexen, I think. We had, then we moved, but by moving, we were afraid that, okay, we're gonna do that again, but it turned out to be very simple. So what I'm trying to say here is that the choice we made paid off because every time we need to recreate the whole thing, just three commands, right? We create a cluster and we just spin it up. So the initial choice, I mean, we weren't sure if we had to spend the time to create all these because yeah, I described that in 10 minutes, but it took a sprint or two, like a week or two to implement it, but it turns out it pays off because now our infrastructure is cut. It's not patch anymore. We don't care. We can go somewhere else if it's cheaper. So it does pay off. So I'm not sure where it is now. Yeah, I'm cheating a bit here because I don't want to wait for the reconciliation. So I just keep the controller to force it to check. This tool is K9s, by the way, if you don't know that check it out, it makes it extremely easy to navigate through Kubernetes resources and all. So yeah, at this point, yeah, this is the action runner's controller and this will bring up the runners. So when we want to make changes on this thing, we just commit. So if you don't know what GitOps is, you probably know, but you just make a change, you commit, yeah. You commit that to Git and you can actually review that and this thing will apply the diff to make your cluster look like what you described. So yeah, after a while, they come up. Maybe quickly I can skip a bit. Yeah, there we are. And eventually they show up. So just go back. Yeah, that was it. This is the repo. So everything I showed is there. You can't really use it because some of the secrets are encrypted with my own keys, but you can copy everything else. You just have to replace with your secrets. There are instructions. I try to write as many docs as possible, but feel free to open issues or ask me anything. I have my, yeah, that was a screenshot in case I didn't have the video. So the outcome, yes, it works. It broke sometimes initially, but it balanced out. And like I said, it really paid off because it makes making changes extremely easy. So yeah, what I wanted you to think that it is possible and it's a good idea. And I'm not saying it's going to work for everybody. And I'm not certainly saying that GitHub, but others, the paid ones or whatever is not good one, but different teams have different needs. So if you're thinking about it, check it out. Check Kairos and ask us any questions. This is my email. This is the team's email. We'll get that one. And this is our matrix channel. So whatever questions you have, we'll be happy to talk to you. And we're around. We also have an ice hoodie like this, but it's large. So the first come, first served, who is large and we have stickers and we'll be happy to talk to you outside. Thanks. Do we have time for questions? Yeah, we have. OK. Need a mic? I don't know. Microphone? Where? Or you can shout. I'll try to hear. Hold this there. Hi. Have you faced problems with different CPU architectures? Because sometimes it may be hard to get some types of hosts, like ARM x86. Are you talking about Kairos? Yes. Yes, we're not trying to test. We're not trying to specifically. So the architecture of Kairos had problems with the... Can you repeat the question? So you run the test, but it's mostly based on containers locally, but have you found problems testing in different CPU architectures? Yeah, got you now. No, the tests are mainly running with QMU. That's why we need big machines because it's a full OS. So we do test ARM as well. It's just a bit hard to test boards because Raspberry Pi's are just not as easy to automate. You need some KVM or something. So yeah, mainly QMU would do ARM, but not the actual boards. So sometimes things break. But yeah, that's a Kairos question. Thanks. Anybody else? We can take one more, I guess. Yeah, there you go. I saw that you're using the summer bind renders. Those are the old ones, right? Yeah, no, sorry. I realized when I was running, yeah, we have to update. You saw that. You're going to switch to the Google Managed GitHub supported runners. Ah, they changed the images you mean? Yeah, I think the GitHub adopted the actions runner controller. Ah, I didn't even notice that. Yeah, sorry. Thanks. We'll do that. Last question. No, OK. Thank you. Thank you. Thank you. Thank you. Thank you.
Practical CI/CD Observability with OpenTelemetry
Hello everyone. Thank you for having us here. It's our first time here, so please be kind. It's like for both of us, it's the first time here, so we're a little bit nervous. And we're here to talk about practical CI, CID observability with open telemetry. This is the abstract we have submitted and of course we don't expect you to go through a whole thing because it's just enormous. But like if we could abstract the abstract in a way, this talk is about enhancing your pipeline's ability and performance by bringing observability to every stage of software delivery. So we're going to answer two questions like how we can identify flakiness and bottlenecks during our CI-CB process and envision a future of effortless visibility. And again, we're going to talk about what's the role of open telemetry on that and also how, with the role it plays in shaping CI-CB's future and explore all the challenges and opportunities ahead. So if I go to the next slide, a little bit about us, I'm Dimitris, a software engineer at Grafana Labs at the Platform Productivity Squad. And I'm Giordano, Gio for France. I'm a software engineer for the X-Port Squad at Grafana Labs. So this is the agenda. As I said, we're going to start by defining what CI really is and then talk about current issues we have with CI-CB systems. And then we're going to do a small intro to open telemetry and how we use it really, why it is important to own our data, and then practical use cases where we are and what's next. So all that being said, we can proceed to the next slide with a question to you all. So what is CI? We're looking for a definition of CI here. Anyone there to guess? It's fine if not. Okay, I'll proceed. So, sorry, say again? Yeah, we're not looking for continuous integration. We're looking for the definition of that. So CI, I guess, thank you very much, is continuous integration. But the definition of that is, you know, as some experts have defined it in a couple of books, the thing is that continuous integration definition can mean different things to different people. It's all a matter of a perspective where you're looking that from what CI means to everyone. But one thing is for sure, continuous is the only thing that's going to be there all the time because we're talking about a never-ending feedback loop, you know, which keeps improving stuff and gives us like visibility over our CI CD processes. So we can move to the next question, which is what is CI like for real this time? And I would like again, if someone dares to guess. It's a black box. Yes, right. Yeah, go on. Yes, running test could be. So again, CI is a list of things like it's a mechanism to, for example, reduce repetitive manual processes or generate deployable software at any time and any place. And of course, like, you know, scores, flaky tests and flaky builds and prevent people from, you know, getting paid at 3am in the morning because we don't want to like spend human hours during the night. They're really important for us. So the next slide is about, I mean, if you think that this is complicated, I think not. I think, you know, this happens to this has happened to at least everyone of us, at least like once, I think. So, you know, it starts from testing, building, deploying and then waiting for changes or wait for 3am maintenance windows. You can see here to errors and downtime and panicking. And also when we have resolved all those issues, we all go to LinkedIn after that and say, you know, I'm a troubleshooting expert or like DevOps expert, automation expert. So that's what we do because that's who we are. Yes, true. Exactly. So the next slide is another question. Yeah. So another question, the next slide, what is CI like for real, real this time? And what we're looking for is a single word. If anyone, there's to guess that single word. Pipelines, automation. Alertings. Yes, that's what we're looking for. We're looking for alerting. So CI and alerting serve a common person, a purpose or at least they try to serve a common purpose. So they work closely together as essential components of continuous automated monitoring. So you can see that both of them are practically identifying issues or, you know, we have like continuous system monitoring or like, you know, all those things. And then CI, we're looking at alerting as the left shift of CI basically, which means that if we have affecting alerting within CI ensures that, you know, threshold bridges and like potential problems are going to vanish. And we see I needs to focus on like robust build for new releases. So together, CI and alerting serve a common goal like prompt problem identification, fortifying system, reliability and sustainability. As you can see from the picture, they need to be like holding hands forever because that's what they do. And we'll go to the next slide. So a few things about continuous integration. We already talked about some of them, but like continuous integration is the guard in early stages, like we can detect changes, maintain bills, health and constantly monitor system signals. And like CI is used to catch issues before they breathe really. So if we go to the next slide, we're having alerting next to that, but we're not actually comparing those two. We just want to show you how closely, like how tightly coupled they are together. So alerting is like, you know, is our alerting system, like for later stages, they identify as problems as well, maintenance, allows and monitor system, just like CI does. Just but we should see that alerting is a mechanism to be used just in case CI has something has slipped through CI and we didn't catch it. So when we have alerting in place and CI in place as well, we need to know that there are not two components running in parallel, right? They are like CI lanes that lays the groundwork and then alerting response to threats. So they are like, you know, unstoppable working together to serve the same purpose. So an important thing to remember about alerting is that every time we need to create actionable alerts. So if something slips through CI, we need to know, we need to get the alert, have a runbook, have some documentation, automatically resolve some alerts if we don't think that are important enough to wake up someone in the middle of the night and all that. So where we are now with CI-CD systems and like, what is this whole talk amounted to? So observability so far, as you can see here, is about like, you know, all the all-time classic concepts we know, like from printing here, we have all done this, like in our early stages, I guess, from printing that, we're still doing that? Yeah, okay, we're still doing that. And then from paging the platform team or from having three different platforms, like we can have GitHub or GitHub or Atlassian or Bitbucket, whatever you use and then find a broken test, go from there to your favorite CI vendor and then go from there to Grafana or Data Dog or your favorite visualization tool to try and correlate those errors together. So focusing, as you can see down there, if we focus, if like the sole focus of observability is at the run part of things, this neglects valuable insights from earlier phases like code review or building or testing and like incomplete observability across the CI pipeline leads to limited visibility during, you know, earlier stages. We don't know what happened during the build phase, for example, or the test phase or we have difficulty in root cause analysis or increased mean time to recovery. Gio is going to talk to you more about that and how this is related to Dora metrics and also missed optimization opportunities. Like we know that our CI pipelines take a lot to run, but we don't actually know what to improve if we want to make them sort of make them faster. So next question, typical, this is fine meme. You know, we know we deploy something, everything catches fire, we are happy and what we do basically is that we try to mitigate the fire. But when the observability part of things is so late in the deployment and development and life cycle, I think it's too late. So there was no reason to let it last this long and get this bad. So how we can be more proactive? If we shift our focus a little bit to the left, we can address issues before they escalate and be proactive. We can enhance the efficiency by catching problems early in the process. We can have, we can ensure robustness by focusing on like the integrity of our builds and tests and also be mindful about the cost reduction because this is also a really important topic and minimize expenses associated with post deployment troubleshooting at downtime. So the next slide, if we assume that we have focused our shift left, the other, so you know, things turn the other way around. So instead of having the fire everywhere and then us in the middle like being agnostic of what's happening is the other way around. So we have a lot of time to mitigate the fire. We can actually be proactive and as we prioritize observability earlier in the development process, we are identifying and addressing issues actually before they become fire. We tried many tools. We tried to find the best way to set up like such a system so we can proactively like react to all those problems. So the tool we found easier to use and address like all those issues in CI CD pipeline is open telemetry because it helps us create like standard patterns and some underconventions. Jir is going to talk to you more about that in a little bit. So in the next slide, we're going to show how we use open telemetry to get to exactly this point where it's even if something appears, it's still too early and we can, you know, act and fix the issue before we wake up people in the middle of the night. So stage is yours. Thank you. Can you hear me? Okay. Thank you. So first question. What is open telemetry? Does anyone know what to work with? One? So no one. Okay. A few people. But as a short definition of open telemetry is it's an observability framework which is designed to manage and create telemetry data such as metrics, logs, traces, events, whatever. There is of course a more comprehensive definition of open telemetry which is way longer, way more complex, which you can find on the open telemetry website. For this case, though, our, what I want to focus on here is two bits of definition which is semantic convention and owning your own data. Now, semantic conventions, we can think about them as a standard. Like it's a standard way of naming things, of a standard way of defining attributes for your logs, for your metrics, for your traces. And I mean, we know, we all know this. If we think about semantic conventions as standards, we can divide them by two different areas or we can categorize them by two different, in two different ways. By signal type such as metrics, logs, traces, events, whatever. And by AIA. Now, by AIA means we have telemetry, there are semantic conventions for databases, we have semantic conventions for cloud providers, we have semantic conventions for a lot of different things, really, for log files. Something that is not there yet, though, is semantic conventions for continuous integration and continuous delivery. Now, what is important in my opinion? This is important because, I mean, we use some CI tool, I guess, everyone here uses a different CI tool. But regardless of what we use, we can see that at the end of the day, the data that is behind its CI tool is the same. Regardless of whether someone calls it stage, someone calls it job, someone calls it status or outcome or whatever, the underlying data, such as the job name, the outcome of a CI system is the same. Now, I'm not extremely familiar with every CI system of there, but at some point I was trying to figure out why in our CI we had a test that sometimes was taking three minutes to a test, like a pipeline, that sometimes was taking only three minutes to complete, while some others up to nine or ten, which, I mean, without any code changes. So, like, if you talk about flakiness, yeah, that's part of, like, test failing, failing, other parties, why sometimes they take too long. Easy peasy, I think. I wrote some totally reliable Go code. No? No, okay. It was very good code. So, what was code was doing was getting stuff out of the ground database and pushing it to log a template in here for later analysis. It worked great. Worked perfectly. Now, what happened is that we were able, at the end of the day, even if the code wasn't very good, we were able to at least look at something outside of our CI system. Why this? Because our CI system didn't provide us with the UI, with the tools to query for the data we were looking for. So, we were trying to analyze why something was happening. And our UI wasn't able to do so. So, okay. I share the news with my team. And, I mean, I guess every one of you has been there at some point in your life. They got too excited. Ivana wanted us to have the log data on Elasticsearch. Piotr was, which is not a colleague, wanted to get tracing data from Git action, this is a lot of drone. And, yeah, no, that code was not good enough. I mean, back to the drawing board. What happens now? We need to figure out a way of getting data out of GitHub CI, drone, GitHub actions, whatever. I don't know. What else? Bit bucket. I don't know if they have a CI system. I'm not sure about. And, we need to push it to every database out there, from Graphite, Tempo, Elasticsearch, I don't know, whatever. Yeager. I mean, it sounds like a very silly question. I bet there is no one of us that really uses 10 different databases to match their telemetry data. Also, because then, you know, this is what was going to happen. We had to write code to get data out of every CI system to push it to every other database system. And, I say no, like, I wasn't going to do that. But, this, I like some very important point that's owning your data. When I think about, when I started thinking about owning my data, what I thought about was mostly owning the hardware in which the data was going to be stored. So, like, owning the drive or having it stored on one of my machines. I think that that's not exactly the point we need to make here. I think owning your data means you being able to decide where the data goes, where and how to store the data. We can very, very well be using a cloud database provider to store our data. The important bit is that we own, we know, we decide where the data is going to be stored and we decide and we have the ability to use the data however we want. So, the reason why open telemetry is important and fits very well with the picture is that by defining standards and by defining a specification for which data can be transferred and stored, not stored but transferred, we are able to only take care of the first part of the equation here. We take data out of the systems and then open telemetry is going to take care about inverting and sending to the database we need. What we did was we built an open telemetry collector who, does any of you know what our collector distribution is? A few. Okay, so an open telemetry collector distribution is basically a set of pre-built components. It's a binary that you can run, of course, and you can configure to do things. It consists very reductively of receivers, processors and exporters. Receivers are the components that allow you to get data in, can be like watching at log files, can be, I don't know, even listening on Bluetooth 1.0 and check for things over the air, can be extracting metrics from some running services. There are processors that transfer this data in the format that you need. They add attributes, they modify attributes or remove them. And exporters that send this data out to your database of choice. The thing here is that for those exporters, we didn't write anything. Those are already open source exporters. There are more for elastic search for whatever, like really Jager or you name it. So the only bit we had to do was writing a drone receiver, which was getting data out of drone to pipe into open telemetry and then push it to log in, tempo and prometheus. There are some practical examples. There is a Jenkins plugin that gets traces data out of Jenkins and sends it via the OTLP format. Irokinz brought these other get-up functions that run commands and exports the execution of these commands as trace data. And of course, our own experiment, which is very complete but very well free to take a look at. Now, what is unlocked? What is unlocked for us? As I said, first of all, there were these performance issues with our first test. Second thing, at some point we had this test here that was a bit flaky. Failing sometimes, sometimes not. Of course, worked on my machine, worked on my advanced machine probably. But yeah, we couldn't figure out what to do with it because we would disable it but then when we were going to enable it, if you cannot really reproduce it locally. Now, by getting this data out of our CI and pushing the build logs into our observability system into our log instance, we were able to trace back from the build that you see on the right to the logs for that build, trace back to the first time that failed test in our CI. And from there, if you look down here, this is an attribute that we thought was valuable. We had a build number, which is our unique ID for drone, which then pointed out to the first pull request that introduced that test or that flakiness. With that, we were able to identify what was causing the actual issue, which was a test which was totally unrelated, running a different suite. But turns out that was causing the flakiness. Something else that we were able to do was, so first of all, like one, I don't know, silly thing that we did but we liked, was to create a custom UI within Grafana to mimic sort of like the UI that you have when you look at the output of your system. I mean, there is some value maybe near but the important bit here is that we own the data. We were able to do something which was funny. We spent maybe one day on it. And yeah, it will look good. The second thing, however, is more important. Now, in Grafana, we have a very complex release system. Very complex. We maintain a set of different release branches that need, in theory, should need to be released at an even time. Of course, like everyone, like for everyone, something things breaks. Sometimes things break and you don't know why because you are not looking at it. Sometimes a commit you make in main breaks something else somewhere else because you back ported it but you didn't really test it. What we were able to do with this was keep getting metrics and stats out of our system, out of our builds so that we could be the timeline of our deployment branches. This means that at any given time, we had a single pane of glass to look at what was the status of our release processes so that our release team could just go here and check whether something was broken they needed to act upon before trying to do our release. We also had visibility over the stats over the number of running pipelines or failed pipelines. We can dig into builds. We can do a lot of different things which we didn't feel like they were possible in our CI system UI. What is Unlocks? Really anything. The point here is that we are trying to define standards. We are trying to get into this space. It's a very early stage concept but what it may unlock given that you own your data, it's really up to you. We can talk about Dora metrics so having ways of reducing mean time to restoring services, we can talk about generating red like metrics or requests for your CI. How long did it take? Did something that happened start at the rate of our failing test? The duration started going up because of something we did on our environment. That's something that for us it unlocked quite well. Some other example, caught coverage over time. There is no reason why you cannot export test results as JUnit maybe and then graph them on Grafana and keep in track of your coverage. You can do flakiness detection like we did before. You start seeing that the test started failing at some point. You can detect that. You can create an alert on flakiness. You can trace back to where the test started flaking. At that point we think it's for us and we think it was way easier to identify what was the actual root cause of the flakiness. Then we have security, whatever. Really, the data is yours. You decide what to do with it. Again, all in all, what is unlocked for us at this point, I think there are three different CI systems. We are using three different systems for different reasons. All in all, what is unlocked for us was bringing all the data into there to work with Grafana and to have our production metrics together with the pre-production metrics. Now, what's next? We have formed an open telemetry working group about CI security observability. There are more stuff to come. Join the discussion. If you have your own issue that you want to fix or your own use case that you want to bring up to the group, please join the Cloud Native Computing Foundation Slack channel. This is the proposal for the standard. That's it. If you have any questions. Any questions? Yeah, up there. Anything you need? Yeah, we'll try. What kind of? Sampling. So far, we're not sampling anything. We are collecting a trace for every build that goes through the CI system. For PRs, it's a bit different because we don't want to create bad data, like useless data. It costs money. Data costs money. What we do is we generate data only for pipelines that happen on those branches we care about. So if you make a PR and the PR is okay, it gets merged into main. After it gets merged, we run another pipeline, the same one before the PR, and that one we collect data from. That way, basically, we have the flakiness on our list branch and not on the PRs because in PRs, I mean, it's not flaky. I mean, okay, we can reflect it as in PRs, but maybe we are doing something and it breaks the build, but maybe it's not a vital point. Yeah, if we did that for every branch, basically, we would face cardinality explosion and it's going to be so expensive. So you have to define which branches you're interested in. For example, in Grafana, we have the main branch, which is like the main branch of our repo, and then some version branches for all the different versions that Grafana have, for example. And this is what we're interested in. But again, you're on the data. You can decide to do it all your way. Any other? Yeah. How many flaky tests or else have you found by exporting data from this? What's the case? Yeah, number of unnecessary... Oh, yeah, sorry. Apart from flaky tests, what other metrics can we get, like useful metrics we can get out of that? So I think... No, no, no. What other problems have you encountered? Like you found out that you didn't find what you're just looking at? For example, stack runners. Runners were stuck in unused repositories. We didn't have a way to know that there was a runner running all the time. We're getting timeouts and all that. This one problem. Then another problem is the number of restarts in builds, which is basically related to flaky tests. But there was no way for us to know how many, like for Geo, for example, went and restarted his build because it was problematic, because there was a bug, or because there was an actual issue with runner. It doesn't have to be necessarily code related. So we needed to know how big was the number of the restarts and then try to find the root cause of what caused this, basically. There was also something I want to talk about, maybe, is that we are also able to improve a bit of the performance of our pipeline. And by performance, I mean just allocating more resources. By doing that, we were also maybe able to reduce the cost of the bit because the runner where pipelines were running for shorter, there was less queue. So it's also like improving performance also comes from having the data about how long they take. And also, last thing is that we also had issues where we used extremely powerful runners to build docs, for example. And docs builds took, I don't know, a minute where if the docs build took like five minutes, it was not going to be the end of the world, because there are docs, they're just small changes, really important changes, don't get me wrong, but small. So we could move away from really powerful runners to something smaller just to help with some cost reduction and stuff. Any other questions? Do we have time? Do we have one up there? Do we have time? No? Come join us at the Grafana booth, please. Thank you.
Chaos Engineering in Action: Enhancing Resilience in Strimzi
This is nice. So hello guys. Today we prepare a presentation about the house engineering in action. I am Marj Orsak and this is Henrik Srenching and we both work as a software engineers in Red Hat. Today we also prepared a quiz. You can see the QR code. You can scan it with your phone and if you are quick enough and you get correct answers you can win a prize. So over to you Henrik. Yeah so the content of the presentation is as following. We will begin with a brief explanation of house engineering. Then we will describe how the target systems may actually look like in the production. Then we will turn our focus on disabling house. Afterwards there will be two brief demonstrations and then a quick conclusion of how to actually work with the house. So when we are thinking about system resilience or application resilience we have to think about all the components which our application depends upon. That mean other components and other services. There is also big dependency on the network and infrastructure. All of these things are mostly visible in the distributed systems. There are many known fallacies about distributed systems mostly concerned about network and bandwidth. When we will then take a look on a system from the viewpoint of many instances and services which have to communicate with each other in order for the system to work great. We will come to the problem of complexity and the fact that there is possibly no single person that can understand the system completely and every state which can the system get into. So what can happen and what will probably inevitably happen in the system of such magnitude is the thing that one instance or more will crash. This is the story about house monkey which I guess some of you may be familiar with but all we have to know so far is that it is some of first house tools which just randomly kill some instance in the production and it will force engineers to take proactive action to make system more resilient. We can take this step further and bring down not just few instances but availability zone or cluster or bring any kind of network traffic and get the system into the state we are not so comfortable in for the production environment. So we will get to the definition of the house experimenting. It's experimenting on a system in order to build confidence in the system's capability to withstand turbulent condition in production. This may sound weird for us because why would anyone want to bring the house into the production isn't it something funny which should we actually avoid and the real reason for doing so is actually it's the time difference. It's much more easier to solve the problems at 4 a.m. or 4 p.m. rather than 4 a.m. when you are under the high pressure from the customers to solve the problems. There are many principles which we have to abide or we should abide in the house engineering. The first one and most important one is minimal blast radius for each experiment you conduct. We should imagine some red button for each experiment which should be able to stop it in case anything goes wrong. Other principles are mostly focused on the same thing like we would test the thing in the real life. We want to focus on how it actually works in production. We want to make sure it works correctly and we want to introduce the problems that may happen in the real life. The last principle is that it's a continuous run which is basically about running these tests or experiments each time for as possible and as effortlessly as possible. Now over to the target system. This all started with the monolith architecture where we get one box, one backend, one database and one UI. In terms of the complexity it was quite low. You simply get some users connection and the server complexity or overwhelming was not so high. Then after some time you add some customers more and more like let's say not four or five thousand and the load was pretty much high and the server would immediately crash for instance. So such architecture is really hard to scale horizontally and one way how to tackle this problem is scale vertically but you can scale vertically all the time. The second point was that the fault-or-ency of such architecture is really bad. You just target one node and the server just immediately crash and the users will be really sad because they don't get any response. So then Dockercams with the microservice architecture where all these previous improved we got portability isolation. We somehow get better horizontal scaling but in case when you have like thousands of instances it would be quite hard to manage all of these containers. On the other hand also the complexity here increased because of the network trickle and more. And so Kubernetes came to solve scalability in terms of the horizontal and in the Kubernetes you easily if you want to have one replica of the system you just type in the semi-ammo file, apply it and the Kubernetes will do it. Then if you see your server crashes or somehow is overloaded with the request you just simply set it to free and the Kubernetes will do it. The same with the fault-or-ency where you just I don't know inject some disruptions or something else in the pods. The one will still be up if you only target two of them. But still complexity increased again. And so we are in the operator stage where no one can entirely grasp the system in terms of the behavior. And I want to present one of the such operators is the StreamZ. StreamZ is basically Apache Kafka on its core with encapsulation in the Kubernetes system. On top of that you got some operators which simplifies some upright dynamic configuration. It is tracing more security involved also Grafana dashboards. And it is part of the cloud native computing foundation. But it's quite tough too much unknowns right? So let's break this down. So Apache Kafka it has a lot of buzzwords as you can see public subscribe model it is messaging system and so on but still this doesn't help right? So let's move to the basic of the Kafka. We got some producers not these ones but we get some clients right? So these clients sending messages to the broker. They are happy because the connection is up. We could also set system scale. We could create another Kafka broker set up some listeners and another one. We got a second set of the clients which are called consumers and they are simply receiving this data. So we got this really example of the system where you have producers and consumers but we need also some representation of the data which is Kafka topics. Also each Kafka broker has his own configuration and you can basically set up versions set up in center replicas but this is not important for this talk. So we got a lot of buzzwords as you can see but unfortunately or maybe fortunately we don't have time for it. So we could stick with this model now. So we got the producers we got the consumers we got some brokers which are the servers and what if we encapsulate system in the Kubernetes? Now top of that we add some operators managing the Kafka ecosystem and on top of that we have the operators and this is basically Stream Z. Really complex right? So here we can see an example of the deployment of the Stream Z where you got a lot of connections. These components are not like really important now. The main idea here is that even with this low deployments you get a lot of things where you can inject the hairs. So now I want to say that if we go to the production one of such production is the scale job and before I dig into it I want to thank these guys because without them we would be unable to run such hairs in such a massive survival scale. So as I said a scale job is the production environment for Stream Z and other projects and there are a lot of technologies involved such as I don't know Tecno pipelines, Teno's Promi to use, Grafana, Loki logs and more. And here you can see a basic example of how we basically induce the hairs here. Here we have some Kafka clients, Streams produces consumers with some databases which are communicating with Kafka Connect. We have some middle maker which is transfer data from Kafka A to Kafka B but still this is not intention of this slide. There are a lot of connections. So I think over to you, Henrik. Thanks. So the point of these slides were actually to show or somehow explain that when we come to the system and take a first look it may look quite MSC and quite complicated. We may not understand the whole underlying technological stack or every single components and we are in the position when we want to talk about how the system actually behaves when we introduce house whereas we are not even sure how it should behave normally. That's even increased by the fact that the system doesn't behave how it is in paper but in actuality there are countless of instances and connections, operators, clients, network traffic. We need to have some sort of observability and some intuition in the system. Like in other presentation that were before us there were already some mentions about Prometheus and Grafana. They are quite famous for their purpose. So we will be using them as well. As mentioned we need to have some intuition about the system and how it behaves. Without that it is just a mess. So we actually want to introduce some chaos into the system so we start a search for the problematic parts of the system or where we actually what we actually want to focus on. It is a simple process when we take a basically simple look on the system. Take a look what is critical component, where are some possible bottlenecks, are some part of the network really critical here, are there some real world events that can cause my system to be vulnerable for some time like some rolling updates or some notes we start in the cloud and things like that. What would be really helpful is to collaborate with all the people involved in the system. Like we definitely need some input from the devs. We need to know about at least some basic information about architectural component. What we may come up with is some simple document describing all important parts or things that may occur there or protocols that are included and we will naturally come to the important configuration parameters and maybe even some proposal for simple chaos that could be included. So the output of this in reality is some first look on part of the system which may be actually targeted for simple chaos. Now that we know at least have some first insight what could be first, what could be our first guess to start with the chaos, we may focus on concrete chaos and we may start with some simple experiments. Now how to actually formulate some kind of hypothesis or some sort of experiment when we will take a look on specific thing. We will take a look on just part of the system or few components. Now we will decide to make sure that our core part of the system is actually capable to withstand some instances being lost or have some failures. Because this is still a production environment and although it was even in the main principles of the chaos, we don't want to start with chaos in the production environment. I guess everyone here knows why because first intern will try to introduce some chaos, he will bring down all instances, service will not be available for for holiday and good luck explaining that to your boss. Now we will probably start in a smaller scale, in a stage environment with much smaller traffic, with much smaller stakes like let's say there will be some some client maybe just random random fraction. We will have some few instances and few controllers. We will start by making sure that system is in a steady state. We have our instances up and running. When we are sure about it we can introduce the chaos. When we introduce the chaos, instances goes down and afterwards the system stabilizes by again bringing down the instance. During all the time we are observing all important metrics and parameters about the system. For example it could be messages per second. Now that all that is set and done we can actually implement our chaos. What can be really helpful for this are chaos tools. We will not be describing all of them but simply mention like there is chaos mesh, Kraken, or it moves or some other choices. They will help with definition, evaluation, execution and all the other stuff. We will end up with very simple email files to be executed. Now we can actually execute our chaos and see everything went as expected. There was small decrease in the traffic but overall the system got to the desired state after a while. Okay this was first experiment in the stage. Everything went great. We've got the good feeling of resilience being confirmed in our system but what we are supposed to do now is to repeat the experiment, scale it a bit up, go into the production, really try, really make sure that it is this production environment where we will get the confidence. What may happen is that it will not at all go according to the plan. It will fail miserably and this is also the reason why we should scale these experiments a bit slower and this is also the reason why we eventually want to make them in production because we want to really make sure that this environment which is so important for us is actually able to handle that problem. So as I said no reason to despair just keep on and try in the stage and make it and start slowly definitely. So to the demos. Okay so guys today we prepared two demos for you. The first one is the broker's failure. Here we will target the Kafka Pots. We have seven replicas of the Kafka and we will be targeting the three of them. The observability or the metric would gather would be like throughput, CPU memory and the traffic in the Kafka Pots. Then we will also define some steady state which is basically that all brokers and client replicas would be like okay and the communication throughput will be stable even we inject the chaos and if we define the hypothesis it would be like we will be eliminating three of the Kafka Pots and this would not eventually do some cascading failure and we will be okay like user will be not affected with this. So yeah and also we will have some checks on the producers and Kafka Pots. So let's move on the demo and hopefully it will somehow work. So okay so here we got some setup. We have Kafka cluster. We have some notes. We have producers. Here is defined Pots hails. We have ModFix which we targeting the value of three. Three means that we will be targeting three Pots which will be unable to run and duration for this would be three minutes. So let's try with our script to inject the hails. Yeah so now we are injecting the hails and we see that all Pots that three of them would be not running. We would move to the graph on a dashbox where we have some metrics. Here's some really simple not production ready messages per second as you can see. Now you can see here at the decrease immediate decrease of connections. There will be also decreased the average of the messages but Kafka would recover even when the Pots are down. So here's the decrease but after a time we would see that it eventually recover somehow. Yeah and as we can see also we got some brokers on line four. It's correct now. There are also some under replicated partitions. Yeah Kafka is okay now and after this experiment will be done. I think it could be done. Yeah so now we are do the checks. We are checking the stream Z Pots at Kafka which are just internal custom resource of stream Z and now the Kafka Pots are ready. We're completed and also in the Gavana dashboards we will see that brokers will go online. The under replicated partitions we all go to the zero. I think so in the yeah and here it is. Okay so this was the first demo and we got also the second one. Yeah and this is basically a worker node node crash and to quickly somehow describe it the topology is that we have the producer we have the Kafka AI Kafka B with some consumer and in the middle there is Kafka mirror maker which just basically transfers data from Kafka A to Kafka B. The steady state again is that all services are fully available and ready to accept traffic. We made the hypothesis that eliminating one of the Kubernetes worker spools will not bring any down services and also the producers and consumers will be not affected. They will be simply sending some messages without any harm. So let's move to the demo two. I will show you the important things. So we got source Kafka cluster, target Kafka cluster, mirror maker, we have some work nodes, we inject the chaos. We would also create the continuous clients, person, consumer and that's for the correctness that all messages are sent and also receive without any harm. There will be no connection refuse or something like that. So we now reset or crash the worker node. We will see that the worker node will move from the ready state to not ready. Here it is, it's not ready, but clients are successfully and happy with the sending and receiving messages. The script is just checking that worker node is still not ready and we are waiting for recovery. It would take some time. I think it should be a worker. And so now worker nodes just move to the ready state. We can see that all containers which were affected on the specific node would be creating again, producing consumers still sending and receiving messages. We do some checks on again, again, the stateful sets. Yes, this is okay. We target cluster, cluster, recovered also. We're also doing checks for mirror maker. And the script just runs successfully and we are happy. Okay. So I think that was two of the demos. And last words over to Henry. Yeah. So as you could see in the demonstration, the benefits of the chaos or execution of the chaos was a bit different from testing we are used to. There was quite a big hype about the house engineering and possible benefits it can bring to our organization. Yes, it can definitely reveal bugs in the production. You can drastically improve the actual performance there or the situation in the cluster regarding the resilience of the system. But what is the main benefit of doing such a thing is getting confidence in the system, finding the misconfiguration. Those of you who have tried running application in Kubernetes know how important it is to have all volumes, all liveness and readiness checks set correctly and overall infrastructure set in place. The actual greatest benefit is actually is in fact getting experience and new knowledge about the system and really understand how it is supposed to work. This is not a holy grail as I said and it can be a bit disappointing for some. But if we think about the house engineering as natural a step above the other testing and not their replacement, we can see a great benefit in it. So how we can actually embrace it in our organization. The very well-known concept is game days when we will put together a lot of roles and a lot of people from our organization, introduce some kind of chaos and let them handle it in some reasonable manner where they can all communicate, all contribute and fix the problem in reasonable time. So that's why I do a friendly way how to start with it. Know your tools. I know it can be overwhelming. You could see it even in the demo that we have to introduce quite a lot of tools in order to run even simple experiments. But once you know the basics and have some confidence in it, you can really start to make some kind of chaos. We can really recommend some great books about house engineering, Kafka if you want, but still there are a lot of tools and it's what is the most important due to that is to definitely start small don't be afraid to set up some stage environment where you can actually practice and confirm your hypothesis before you will actually go into the production and start doing mayhem. Thank you for your attention. Really appreciate it. Questions? No time? One question. Question? Yeah? Yes, there are. It actually depends. It mostly from practical terms. It mostly start to make sense only when we are talking about not some kind of monolithic application, but we are actually deployed on a cloud. It's some kind of microservices architecture. I would say that it does not depends as much on the size of the system as the fact how much you depend on a customer experience in a sense. That when will it really be decremental for your system to get into the chaotic condition. But yeah, thank you as well.
Progressive Delivery Made Easy with Argo Rollouts
Thank you for being here. I'm going to talk about progressive delivery and hopefully by the end of this talk you're going to know how to easily do canary deployments on Kubernetes. Who is using Kubernetes today? Raise your hand, please. Everybody. I'm not asking if everybody knows what Kubernetes is because you're in the wrong place. I'm a principal scientist on the Adobe Experience Manager Cloud Service. This is a content management system. I'm a long time open source contributor to Maven, Jenkins, Puppet, a few other things. I'm also part of the Google Developer Experts Program. But probably most of you know me because of what I did with Jenkins on Kubernetes. Some people will love it. Some people will hate me. We'll talk about that later. Actually, I just, before this talk, 15 minutes before, I realized, oh, this is 10 years ago. Time flies when people didn't know what Kubernetes was. So, what is progressive delivery? This also came, this was August 2018. This is when the term was coined at the LaunchDarly blog. And also was picked up by Red Monk. And I said, this is a great name for these things that everybody knows about. But the name kind of sums up very well what we're trying to do. So, I said, I'm going to steal this. That's the gist of it. So, it includes deployment strategies that avoid this. I'm going to push this to all my nodes, all my containers, all my files, whatever it is that you're running. I'm going to push this new version to all of them. And if it breaks, it breaks for everybody, we want to avoid that. So, you, with progressive delivery, you have new versions that do not replace the system versions. And you have both old and new versions running in parallel for an amount of time. But the interesting part is that this is happening in production. And you can evaluate both old version, new version during a period of time that you figure out what was the best time for you. And before saying that this is a successful thing that I need to roll to everybody in all my customers. So, continuous delivery is hard. I used to say, like, progressive delivery makes it continuous delivery easier to adopt. Because it reduces a lot of the risk associated with continuous delivery. Yeah, it's great that you commit something to main and it gets pushed to everybody. But what if that's breaking in production? Then you have these methods behind progressive delivery that will prevent you from breaking things. And give you these guardrails that will protect your users. The key points, avoiding downtime, limit the blast radius. You deploy something, it only affects a subset of your users, not all of them. And also shorten the time from your idea to production. So, from the time you create a commit until you push it to production, you can use these techniques. So, you can shorten really as much as possible that time. And it's not affecting your life customers, it could affect your maybe internal customers, employees, something like that. So, you can confidently push things to production. The name is great, but all the techniques already existed for a long time. We have rolling updates on Kubernetes. This is the standard way when you change something on your deployments. You just get a new pod with a new version. When that pod comes up, the old pods start going away. And you can configure that easily on Kubernetes. You can configure how many pods you want to come up, if you want them to come up a little by a little. If you want all of them to come up at once, and they will start rolling. So, Kubernetes has been around. Blue-green deployments, same thing. It's been around forever. Well, defined for some definition of ever. And you have green, what you consider the old version, which is green, the new version, which is blue, or the other one, we're on, I don't know. And you have both running at the same time. You evaluate or you start sending traffic to the new version. And if something happens, you just have to flip the switch to put back to the old version. So, this is a variation. The difference is, in a couple of days, you don't need to have all the machines running at the same time, all the containers. With blue-green, you need to have room for both versions running at the same time. Canary deployment is one of the most interesting ones, where you send a percentage of the traffic or a percentage of your users to the new version, and a percentage, a small percentage to the new version, and you keep growing. I mean, you could just do a small percentage, or you could keep growing that canary percentage. A lot of companies do this. First, this gets deployed to internal employees only, then some countries, some like New Zealand, or a percentage of users, depending on some characteristics of them. And they keep growing this canary pool over time until you reach 100%. Feature Flags, another interesting one, where it allows you to push things to production behind Feature Flags, so you can test them in production. And also, disable them after you deploy them. You push something, you realize, either it breaks for a lot of users, or it breaks for a percentage of users, you can switch that feature off using some tools, or using something as simple as environment variables. But yeah, there's tools that allow you to manage Feature Flags, so you don't have to deal with environment variables, things like that. Monitoring is the new testing, so you know the goal is to know when the users are experiencing issues in production, and the other characteristic, I think, is they react to the issues automatically. So if you deploy something that is bad, how you can automatically roll it back before some human has to go and figure out what happened. So did you know that 90% of the outages could be solved? There's a study that said 90% of the outages could be solved with progressive delivery. Did you know that? No? Because I just made that up. And one thing they need is, yeah, some requirement is having a good amount of metrics, or you need to know what's happening in your production system before you can react, knowing what users are seeing the new version, what users are breaking with the new version, what's happening here. So you need to have this visibility. And I always love to plug Devos Barat, which disappeared from the Twitter server, to make your resumant to prepogator or in-law server in automatic way, that's what DevOps is. Raise your hand if you have broken a lot of servers by doing this automatically. So yeah, what I love to say is if you're breaking something automatically, is that you haven't automated enough. I think that's the... When you get there, it's like, okay, maybe I should step back a little bit. Until you get there, you keep automating things. Now, more to the practical side, how can I do this in Kubernetes? Introduction, who's familiar with Ingress? Ingress in Kubernetes, okay, yes. Then, 10 years ago, this was not like this. So on Kubernetes, you have the load balancer, and you can have services, and from the load balancer, Kubernetes will send you to one service or another. So your load balancer would send traffic to one service or another. But this was kind of the old way. The new way is you have Ingress controllers on Kubernetes, where the Ingress controller is running on Kubernetes. I typically domain names, but you could also do headers. For each of these traffic, it's easy. You can do headers, you can do all sorts of things. And that Ingress is sending the traffic to whatever service you're running. So you can have one service A, one service B, and with their pods. And the Ingress is the one that's, okay, you configure this domain to go to the service, you configure this domain to go to this other server. And there's a lot of Ingress controllers out there. If you run on a cloud provider, you're going to have the AWS, the GC, whatever. And then you can have your own NGINX, ambassador, STO, traffic. That's a lot of it. And ARGO rollouts, anybody using ARGO? Wow, okay. What are you doing here? I mean, we already know this. So provides advanced deployment capabilities. All the things that I mentioned, blue, green, canary, category analysis, experimentation, there are variations over the same thing. ARGO rollouts provides that to you and makes it very easy to do it. And the good thing is you don't need to use ARGO CD, for it to use ARGO rollouts. You don't actually need to use anything else to use ARGO rollouts. You can run ARGO rollouts just with Kubernetes, nothing else. You don't need external dependencies. And, yeah, it allows you to do this very easily. I'm a bit on the architecture of ARGO rollouts. So we have the controller that is watching a new object called a rollout. So ARGO rollouts has this object that can replace or complement your existing deployments. And I think I'll go down there in a bit. So this rollout manages the replica sets. So these replica sets, typically, you would have your deployments with the replica sets, and now they become part of the rollout. And it has the concept of analysis run that will check metrics or any other external source that this analysis will decide is this rollout successful or not. And based on that, it's going to cancel the rollout or keep it going. So you get the traffic coming from the ingress into your services, and you can tell ARGO, OK, send me, send traffic to this new canary replica set or send it to the old one. The percentage base one, for that, you need a service match. So if you need to do something fancy like, oh, I want to send 1% of the traffic or I want some traffic that matches this header, then you need something like a service match or the integration with the ARGO rollouts integration with the ingress controller. But if you use bare Kubernetes without integration between rollouts, you can still do it. Basically, it will use the number of pods in the replica sets. So if you have 10 pods, you can tell ARGO, OK, one new pod is going to go to the new version, and now you have a 10%, 90% sort of split, more or less. You cannot do things fancy that require support from the ingress controller or a service match, but you can still do things. The rollout object, you have two ways of defining the rollout. One is you replace the deployment with the rollout and add extra fields, or you create a rollout that points to a deployment. I don't like a lot the way of replacing the deployment because then people that are not aware of the rollouts objects, they may go and see, oh, there's no deployments. What's going on here? So it requires you changing things. And for us, it requires also you have to change rambles, you have to change commands that people need to secure documentation and all that. So I don't know why the decision was made that way, but it's not something that I'm too happy about it. Of course, you require all the Jamel tools to write these things. And let's go to the demo now. So I have here... So I'm running the Argo Rollouts demo. This is hitting the backend and it's returning one color or another, depending on what is running on the backend. So right now I have the blue one. Let me see how can I do this easier. So what I'm going to do is to change, update my deployment to use a new image that is going to be green. And ring. I lost the terminal. Okay, so it updated the image and let me fit here in big To show what this is doing. Okay, I think I pushed twice and now I see two Rollouts happening at the same time. Otherwise it's not working. Here it is. Okay, so I have the green. The one that shows green is the one that was running and it's stable. So I think I have five pots running and I push a new change, which is the canary. And this should be using the green. Okay, there it is. So like 20% of the traffic is getting green. Right? And how I define this rollout, this one is at the bottom is just the standard deployment configuration. So what image do I want? What ports do I want to expose and so on. But at the top I have the strategy configuration from Marco rollouts. So I can say point to this analysis template. This is what defines what is successful, what is not successful. And I'll show you that in a bit. And it's, I have several steps. So set weight 20 and then do a pause, set weight 40, pause for 10 seconds, set weight 60. So this is percentage. Pause for 10 seconds, set weight 80. So this is my definition of a rollout. 20% wait for me to manually do something. I only do that for demos in real life. That's a bit harder to do as you could still do it. But this is my definition of what the rollout is. So right now it's waiting because I set a pause and it's waiting for me to give you a key. I look at it, it says it looks okay. So I can do the promote. And this is going to continue through the rest of the steps. So hopefully we'll see this in like 60 seconds. It should continue the progression until everybody receives a green. A green color when they call the API. So this shows you just by creating a rollout object with this small section defining what your rollout is, you can do this. There's nothing else you need. Well, you need to install Argo. And what else can you do? Oh, yes, you can also have a preview version. So you can have another ingress pointing to your preview version. So you can even if I said I want zero traffic to go to the new version. All the existing traffic I wanted to go to the old version, but I want to see the new version in a new place. I can do that too. So that's very useful for preview environments sort of thing. So if I go back, okay. So while this continues running, this is running on Google Kubernetes and sending autopilot clusters, but you can run it in any Kubernetes. And the autopilot is pretty cool because you only pay for what you use. So if you scale things to zero, then you don't pay anything. What does it says here? Okay. So now green is the stable one. It says here, stable here. What if I want to do... I was talking about how does this protect me, right? What if I want to do a rollout that is broken? So... Let's see. This works. Right. Okay. So now I push an image that is bad. So I'm changing the deployment. Of course, you would do this with the GitOps. You would never push the production, but YOLO. And so I'm pushing the red image, but this red image is returning in 500 errors. And now Argo realized, oh, this is giving errors based on my analysis template that I'll show you. And this is in the graded status. And it went down and the scale it down, and my canary was set as failed. And you see that only a few percentage of traffic got the red dots, and then it was automatically rolled back. So I think this is the power of doing progressive delivery. Of course, this is very easy if your application is exploding. It's very easy to see. It's like, what if people ask me, oh, can we do this if a button doesn't work? Can we do this? Well, it depends what button. If it's the button that adds, imagine you're in Amazon, you break the button that adds things to the cart, and you get a metric that says nobody's adding things to the cart, maybe you're like, oh, something is really bad. Right. So let me show you the analysis template. Is this one? Yeah. In my case, my analysis template is a very complicated call that fails if this fails, if this doesn't return a 200. But again, you can integrate this with whatever you want, metrics. Argo rollouts also gives you a nice dashboard. If you are not into the command line, you can come here. And here. So where I can see the status of my rollout, what is strategy. As I said, Argo rollout supports multiple strategies on some of the more complex drivers. I can see my steps that I showed you before in the Jamel, 20, 40, 50, 80. And I can say what was the last image that I pushed, and I could click here and do the clickity clock instead of doing Jamel. Okay. So, yeah, what I mentioned before was if you're using service mesh, like Istio, then it integrates with a bunch of service mesh ingress providers. So you could go and say, I want 1% of traffic because Istio supports doing those things instead of saying more, because when you are using only pods, you don't have anyone here. Pod is going to receive the traffic or not. So it's more of an approximation. But with Istio and other advanced things, you can do more complex. We hook it up with Prometheus, also the support for multiple things to get metrics from. And, yeah, hopefully you'll learn how to do a progressive delivery canary deployment very easily. Just you need to do some Jamel here and there. Let me see. On here, this one. So you can have the other labels to the existing version, to the stable version as labels to the new version. So you can do other things with services on Kubernetes. You can pass what analysis you want to run and you set what steps to run. And everything else is just the template. And if in the deployment template. If you don't want to put the deployment template in the rollout object, you just point here, there's another option that says points to existing deployment. The only problem with that is that rollouts is not when you're migrating, rollouts is not going to scale down the deployment. A colleague of mine, she submitted a PR to Argus, which is going to be in the next version. So it will automatically, if you have like thousands of deployments, when you spin out a rollout with a deployment, I pointed to a deployment, when the rollout is successful, it's going to scale down the deployment. So that's how it will actually exist. Okay, so, yeah, and what's that thing? I lost my... Did I close it? Yeah. Okay, so, just a quick summary, you saw everything? And I hope that this helped you and you can try it and do it at home if you like it. And I have time for asking me two questions. Two questions. No questions. One question. I was wondering if you've been testing using the gateway API and some fingers in the waiting? So the question is if I tested using the gateway API instead of fingers, no, I have not been using the API yet, but I'm guessing that if there's no support already, there will be. Because... We did not. Yeah. Hello. So my question is that for, in case of buggy rollout, the particular traffic which is forward to the buggy instances, is it possible to automatically replicate it and send it to the stable versions after the fail? To ensure that even the traffic which hits the buggy rollout instances is served later by stable versions? So if it's... It's possible to run it back automatically, but also... Yeah, the individual traffic, individual request. So you don't want any user to see the spot? Yes. The other thing you could do, if you use a service mess, probably is send a clone the traffic and send a clone to the new version, but the actual traffic is going to the old version. And you could see if the new version is breaking or not. But also that's tricky because you need to make sure that it's not changing your state. If you are getting gets, it's fine if you are changing status. That's my point. Don't do the duplication in advance because it will go to the parallel execution, but do it only when the first execution failed because it's go to the canary instance. Yeah, I think you can do that. Send traffic to the new version, but it's a copy of the traffic that is not seen by any user. And then at some point you could say, okay, this is good. I'm promoting this. I think it's doable. Yeah, thank you. Okay. Thank you. Thank you.
Own your CI with Nix
Okay, all right. Hello, everyone. Meet Bob. Bob's a software engineer and Bob just had the idea of a century for a new startup. That's Gray Cat. Gray Cat is a service that given the picture of a cat will reach in the same picture, but grayscale. Bob's really excited about that and just got some funding to start working on that. So Bob gets started. Chooses to write that in Rust because that's cool and trendy and use GitHub because that's the standard. And so just writes the initial Rust boilerplate, the initial boilerplate for having that built by GitHub Actions, then Git commit, Git push. The first TI runs green. That's wonderful. Champagne. Now Bob decides to do something useful with that code and so he pulls in image2, which is the Rust library, for doing image manipulations, uses that in the code, builds it. The build is just fine locally. Wonderful. Now Git commit, Git push. The TI runs. It's not green. It's complaining about some missing data files somewhere. Okay, so Bob, no big deal about the software engineers. He knows how to use Google. So he searches, finds out that actually this image2 library is mostly a wrapper around C++ library, which is OpenImage.io. And it turns out that Bob had that installed on his laptop, but the CI runners don't, which is why it's failing on the CI. But no big deal. Bob just tweaks a bit the CI config to install OpenImage.io before running the build. And now the CI is green. Wonderful. So fast forward one year later, Bob Crop has grown quite a bit and also did the tech stack of Graycat. It's getting a bit complex, but no big deal. I mean, it's just a matter of having the right CI config file to make sure that everything's gets installed. So the config file has grown a bit out of hand with 5,000 lines of code, but it's not real problem. People just treat it as happened only whenever they need a new piece of software. They just add that to the config file, add a few lines to install it. Generally forget to remove it if they don't need it anymore. But I mean, no big deal. It works, right? And I mean, it's not like anyone would want to maintain that because just the feedback loop of having to do a change, push, push, wait for the CI, get the results. Yeah, now. Okay, better keep it like that. It would need to move fast anyway. Because if some troubles, if we know them, like for instance, when GitHub decides to update the base image of the runners because given the complexity of the setup, obviously something's breaking here and there. No big deal. It takes a couple of days sometimes to fix, sometimes a bit more blocking, blocking things a bit. That's annoying, but I mean, have to move fast. No, one day big deal happens. Microsoft, it's now 2025. Microsoft is in a slightly difficult financial situation and decides that this GitHub action thing is actually, yeah, it's wasting money on that. So just decides to shut it down. Well, no, not such a big deal. I mean, it's not like Bobcops married to GitHub actions. They just have to migrate this little config file and use another CI provider. That's what they do. And three months after, they actually managed to migrate that to a new CI config by a new provider. But now by the time competition has caught up, Graycat is definitely beyond. Bobcops is leaking, bleeding money everywhere. And this is the end of the Graycat dream. So very sad story for Bob. But could you have avoided that? So there's a bunch of things that went wrong. You might have noticed. But most of these are just like natural consequences of we want to do things the quick, quickly as possible. And so we just, like, we can't take care of everything. But there's one practical choice that Bob did at the very beginning, which was what caused the ultimate failure. And that was just being stuck on one single service provider and being at its mercy. And Bob could have avoided that, hopefully. The first, like, the elephant in the room, if I may say, is just the blue whale docker, which would have given Bob a way to just have a, like, agnostic way of defining this CI environment that doesn't depend on GitHub. And Bob could just have written a docker file instead of the CI config file. So now some things that wouldn't have solved is that the feedback loop for a big docker file is not that much better than that of the official CI. Docker's layering is great for caching when you don't need to touch the latest lines of the docker file if you touch up things at the top. You're pretty much screwed up. The other thing it would also not have solved is that unless you're very, very careful, it's easy to just have lines here and there that might break at any time because upstream decides to change something. But, I mean, this is okay. The big thing, the big problem that Bob would still have is that the libopementimage.io issue we had at the beginning. I mean, Bob has his laptop, he's working on that. On his laptop, he has the code for the project. He has tool chains needed to build the project. Then the CI of the container has the same code for the project, checking for the same, from the same commit, also has tool chains to build it, but not exactly the same ones. They are provided by a different means, so obviously there's going to be some differences that might break things down on the line. If you're lucky, it breaks your build. If you're unlucky, your build still succeeds, but then the underlying something behaves slightly differently and you have absolutely no clue why. So what would have been nice would have been to just have Bob's laptop and whatever is running the code in the CI, use exactly the same tool chains. And there's an obvious solution to that, just ask Bob to do all his development on the Docker container. That is great. The thing that's not great here is that Bob's laptop doesn't only have tool chains encoded, also has his text editor, his config, his whole development environment fine-tuned for years to just make Bob as efficient and productive as can be. And if Bob has to develop in the container, he mostly can't access that easily. And now we get a very sad and angry Bob and a very efficient Bob. So now there's one bit of the infrastructure that I barely mentioned in passing, but not made much attention to. That bit is cargo, the rest package manager. And the reason why I haven't really talked about it much is not that it's not important. I mean, it's probably the most crucial part of the infrastructure because it's the thing that pulls in the bulk of the dependencies of Gray Cat. But the reason I didn't talk about it that much is that it just walked. I mean, I was talking about broken things because it's always funnier to talk about broken things. And cargo was not broken, not nothing. And the reason cargo just walked, I think it's like there's two reasons. The first one is that cargo has been very transparent of the role. In the CI, Renskago to provide the dependencies for the build, and that's fast enough for the CI that just walks fine. Bob on his machine runs cargo for that, and that's works that doesn't prevent him from using all the rest of his tooling. And so Bob is happy using that. And beyond that, cargo is also declarative. Like there's one file, two files, that exactly define the set of rest dependencies that your code has. And when Bob runs cargo on his laptop, cargo just reads that file and provides the exact environment needed. When the CI clones the project, Renskago, it reads that same file and provides exactly the same environment. And that's why it walks. Now, there's one thing that cargo doesn't do properly. And that thing is everything except rest packages. And yeah, that's a problem because like the, yeah, that's why we have this open image, your problem. And so it means that actually the declarative aspect of cargo is limited because it's declarative to some point. You really have to read the terms and conditions for that. And so at that point, like it would be great if only we could have something a bit like cargo, but more generic, you know, that would be so awesome if only such a tool could exist. Okay, so meet my friend, Nick. You can think of it, Nick, if you don't know about it in that context as something exactly like cargo or NPM or whatever you want, except that it's fully generic. So you can use it to package and provide your rest crates if you want, but you can also use it to provide the C libraries that your rest crates depend on. And the C compiler used to compile these C libraries and the mini server you're using to run the test or the PostgreSQL database you're using for your deployment server. And so really now like declarative is not just a vain word. It is fully declarative up to the lowest level you might want to think of. And so what could happen for Bob if he were using Nick's is that he has this laptop with everything set up and then he can just use Nick's to provide these tool chains. And because Nick's is transparent, it will it won't prevent Bob from still using his editor with his all his tools just on top have the required tool chains to build the code. And that makes a very, very happy Bob. And then the CI system can just use the same Nick's with the same Nick's config files to get the tool chain. And then the CI builds exactly the same thing as Bob on his laptop and the world is wonderful. So now assuming that, now I mean I'm not assuming Bob is convinced that Nick's is the great thing and probably you are too, right? And what would that look like in practice then for Bob to use Nick's? So Bob would essentially drop a shell.nix file at the root of his environment saying, hey, I want a shell. So calling the make shell function to get a deployment shell saying, I want in my shell this set of packages, cargo, rest, open image, whatever you want. And the little bit of magic here is this PKGS thing from which everything comes, which you import which points to Nick's package collection, big repository with recipes for all the packages that exist in Nick's, which you can. So I've hidden that, but you can import that in this Nick's package.nix file and pointing to a very specific commit of the Nick's package's repository which will pin down every single version of all your transitive dependencies. And now if Bob wants to use that, he can just run the Nick's shell command, be dropped into a new shell in which, for instance, cargo will be available at a path that is managed by Nick's. And once Bob exits the shell, no cargo anymore, that's what we wanted. But then, I mean, Docker also does that. The bit that Nick has in terms of transparency, the extra coolness, is that the shell doesn't tell anything about Vine, Bob's editor. But still, I mean, Bob can still access his Vine, install Globody on his laptop, even inside the shell, which is what you want for development because it's much, much, much nicer. And when Bob wants some extra guarantees that really his shell is complete, he's not just accidentally leaking things from his computer. You have an extra pure mod that you can use for building things with more guarantees. And so that's Bob's machine, but this talk about CI. So let's look at the CI side of things. On the CI, so Bob would still be using GitHub CI because that's the standard at that time. But then beyond the initial mandatory boilerplate to just fetch the repository and all that, there's only two things that Bob would need to have in his CI config file. The first one is install Nick's because it's not yet part of the default GitHub image. And the second one is render build within a Nick shell, a pure one because you really want to be strict at that point. And then if Bob needs to migrate, all he needs to do is on his new CI system install Nick's again and copy that exact Nick's shell command. And now Bob is fine. And we can all send great cat pictures over the internet. So this was like scratching the edge of the tip of the iceberg. We could go a bit much further, although I only had 15 minutes, so I won't cover all that. But the first thing we could do to go a bit further is to improve the pinning situation. I mentioned like I hand wave this, oh, you have this file that pins things down to a very specific version. There's ways to make that much nicer and using a proper log file like all modern package managers would do so that you have full control over when you want to upgrade. But it's also trivial to upgrade. More interestingly, you can also use Nick's a bit further and not only use it to provide you some development or some CI environment, but you can build your thing fully with Nick's, which gives you, I mean, first time extra guarantees that, oh, this is really the right thing I've built in the right environment, in the right set of dependencies. But more interestingly, now you can integrate that and use Nick's a bit further, for instance, to build OCI images on top of that with only a few extra lines of Nick's code, or use that to build AMIs for whichever cloud provider you want to use. At that point, you probably want to care about caching. Nick's is pretty great at caching things for that today. If you have a know-how build, it's really going to be a know-off and take you a few seconds rather than the whatever time it takes to build the project. And you can also get that cache to be distributed, meaning that your developers, if that's built on the CI, then your developers can just reuse the pre-built results, which makes their life quite nicer. And the last thing that could be done to go further is to use Nick's OS, which is a Nick's-based Linux distribution, which follows that same philosophy of being purely declarative, which means that you just have one config file that describes the whole system, and you just rebuild the system based on that, which is useful for deployment, because that's infrastructure at the core, but really down to the deepest level, not just scratched on top of something that existed before that and was never meant to be that way. And you can also use that for testing things further. There's in particular a really nice testing framework that allows you to always declaratively declare a whole network of virtual machines that you can spawn, rinse, comment on, and then just read the results. And that's really useful, whereas as soon as you start to want to test some weird multi-tenant applications. So all that I've been talking about, Bob, but maybe a few words about me, but you know who's talking to you. So I'm Theo Fan. I'm the leader of the Nick's Tech Group at TWIG, which is the open source program office of Modisk Create, and it's pretty big on Nick's, as you might have guessed. And I'm also a maintainer of Nick's and a member of the Foundation Board, the Nick's Foundation Board. And you can reach me in all these places, and more concretely, you can also meet me right here in the AW building where we have the Nick's West End. And that's all for me.
Testing Go command line programs with `go-internal/testscript`
Good afternoon everybody. Who is a Go developer? Very well. Very nice to meet you. My name is Giuseppe Magia. I work with VMware. I am not the creator of this thing that I'm presenting today. My company is not involved in this. I'm just a user and since it makes my life easier, I decide to share with you what I do with it. So the thing that I want to do theoretically is the things that you see on the screen. But practically what I want to do is to make you curious about Test Script. So you will try and eventually see how good it is and what you can do with that. The important thing is that you will learn a few basic things. I mean, I could talk about Test Script for three hours and probably wouldn't exhaust the argument. But since we have only 20 minutes, this is what we are going to do. We are going to show the basics of Test Script so you will know it. We started with why. Why do we need to have this kind of tool? It's because we have a problem. When we do a command line program, we need to test it and to test it, we need to build it. Then we need to do something with this build to shake it up and check that it's doing what it's supposed to do. You can do a lot of things instead of testing directly the command line program in the shell. For example, you could test the single functions that are inside the program and you should do that. But this is not the same as testing the program. To test the program, you need to make sure that the function that works well in your tests also is linked to the command line command or option that you hope is linked correctly, but not always. Also, the input that you put in the function that works beautifully and since it has space in between, it doesn't work on the command line. You need really to test the real thing. The problem that you have is that to achieve this goal, you need first to compile the program and second to find a way of testing that program in such a way that it works well with your go code and it is checked in the right way. By checking the right way, I mean that you are sure that what you hope to achieve is what exactly happens. Doing this kind of thing in a shell script is not always easy. Let's talk about this test script and what is it? It's a Go library. It's also a standalone tool and the best thing is that this is developed directly by the Go team. They use it to test the Go tool itself and all the tools that come with the Go program. It has been recently, a few years ago, released in the Go internal package, so you can use it separately from the Go code. Especially, you can use it for mostly anything. If you are developing your command line code, a command line program in Go, it's much better, but you could also use it to test mostly any command line program, even if it's not written in Go. Of course, if it's written in Go, it helps. Let's see a first example. To test something with test script, you need two components. The first one is a script. It's a script that says something about what you want to test and what you want to get. In the script that you see on the upper part of the screen, there is an echo, a hello world and a keyword, exec, before. Exec is an internal command of test script that will run something. Then there is a standout with confirmation. The confirmation is that you should receive something that says hello world and a new line. Then there is an exclamation point that says standard error and a dot. This line means I don't want a standard error. I don't want anything in a standard error. More about this later. Then you need a component in your Go code. The component will just call the standard, the test script dot run, which contains at least one piece of information, meaning in which directory you find the scripts. I say scripts plural because in that directory, you can have one or a thousand scripts that do different things to your program. Let's modify the first script a little bit. Instead of expecting hello world with standout, we expect h. Then you have this strange thing that is a regular expression. If you know regular expression, what we are saying here is I just want two words, one that starts with h and one that starts with w. Like before, I want the standard error to be nil. This standard error with a dot suggests that what we are expecting here is not a dumb piece of text but a regular expression. You can use a dumb piece of text if it suits you, but you can have much more powerful type of information. For example, you can use several statements to describe better what you expect from the output of the program. In this case, instead of putting everything in one line, I put that in two lines. This is often useful if you want to make your test more readable to express exactly what you are expecting. More important, the test script environment includes one thing that is called text par. Text par is a very simple way of encoding files. To encode files, you just put the name of the file between double dashes and then you put the content of the file there. The file will be magically created in the environment where the test is executed. The thing that happens there is that the script will use a different temporary directory to each script. Every script can run in parallel and it will be more or less isolated from the rest. This data.txt will exist in the temporary directory created for this script only. You have some built-in commands that you can use directly to do your tests. Exactly what we have seen already. Then standard out and standard error, it will check what happens after you have run your command. A standard input will create the input for the next command. Then there is the command exists, the checks that the file exists and stop and skip will interrupt the test. If you put the exclamation point before the command, it will negate the command. Meaning that you expect that command to fail. Other commands are compare and compare with environment and then you have n that will set variables and this can be useful. Then you have something that are also available in shell scripts like cat, cd, cp, check, change mode and make directory move and remove that works like in a shell. Then you have conditions. Conditions is like a command but is within square brackets and you are telling the program that you expect something to happen. For example, exec file name, you are saying I want to make sure that this program is in the path. Unix says this will only be true if you are running in a Unix system. And after that condition, you put a command that will run if the condition is true. And you can check other things like if you have a network, if you are running a specific version of Go and so on. There is a specific environment, some specific environment variables that you have work is where you are running the test in practical. The home doesn't exist but you can set it if you want. And then there is a temp directory that it created for each script but you can change it if you need something different. So if you run the test with verbosity, you will get a lot of information that tells you what is the environment where you are running and what is everything that is executed. If you don't put the verbose, the test is silent, it will just succeed silently. You will see some output only in case the test fails. Let's see some more examples with command, second condition. So the first line says if it's not Unix, skip it. The second line says if it's Linux, say exec, good choice. Exec, sec means if the exec doesn't exist, then say the command echo, the command sec was found and so on. Remember I mentioned something about compiling the executable and one thing that the test script can do for you is having a transparent executable. How does it work? Let's say I have this word count command that I have created in go and I want to test it. So I run exec word count and this command may fail or may succeed whether word count exists or not. So if word count is in the path, it will succeed. If it's not, it will fail. But we want to make sure that it always succeeds so we need to take to tell test script this word count not only I want to exist but I want to be the one that I have created for which I have the code and to be fresh, not stale. How do we do that? In the test we use the test main in case you don't remember the test main is something that you put in your test code and runs before any test function that you may have in that directory, in that package. So the test main contains a call to test script dot run main which has a map of functions that you can associate with a name. In this case we have a name that is word count and we have command run main that returns one integer. In the main code you will have the main that doesn't run directly the code but will call run main and it will exit with the integer that the run main returns. So what happens here is that your word count that is in the script is in reality a call to this function and the funny thing is that there is no separate executable. If you remember go is a compiled language so whenever you run a test nothing is running like in Python, it's not interpreted. There is a compilation and the compilation happens in a hidden place and you will have a piece of code that has been compiled that is a binary and that binary will be available for your test and the good thing is that there is no additional binary is the same binary that is used for the test. I'm going to show you an example later. So let's see something more. I said before that we have built in commands but we may want to have something more profound like custom commands. So for example I want a command that will sleep for let's say three seconds and this command is not available because it's not one of the built in commands but I can build it. So I also can have a command that says I want to check all the files in a directory and I want all these files to exist. So I want to check files that has first argument the name of the directory and the rest of the arguments are the file names. How do I do this? When I run a script run I can have a call an indication of a map of functions that will produce these custom commands. So the custom commands are a map of functions and each function has a test script object a negation in case I put a bang there and a list of string arguments. If the command succeed I do nothing if the command doesn't succeed I call test script dot fatal and I fail. So for example this is how implement the custom commands in my word count. I call check files and I call sleep for and if we look at the implementation of each one you see that the sleep for is a function that accepts a test script a negation and a list of arguments and it takes that argument to say to determine how many seconds I want to sleep for and then it calls time sleep. If the first argument was not a number it will fail and the command will not succeed. A similar thing I do for check files so the first argument will be the directory and then I will check that the files exist for each one of the of the arguments. A similar thing I can do for custom conditions so in addition to the conditions that we have I can implement conditions that suit my environment better for example I may want a condition that says the version of this particular program must be at least 0.2 and how do I do that I cannot do this with the the building syntax of test script so I can implement a custom condition. To do a custom condition is is similar to what we do for the custom commands. I have a function that function receives one string and will parse that string to determine what we do with that condition. In addition to that test script allows us to pass arguments that are variable depending on the environment. So for example I want to pass the current version to the test or I want to pass the home directory to the version and I can do that with a set environment function. So back to the custom conditions we receive a string and return a boolean and an error. What do we do inside that function we parse the string and the string could be a simple condition or it could be a condition with some arguments that are that we need to parse inside and see if the the condition is true. In this case we have a version is at least and you see I have created a condition a function that will check that with the elements that are parsed just in the first line of the of the function. So I assume that the function the the arguments are separated by a column and I use them. For example this version is at last we'll check that we have at least two arguments the first argument will be the version and the second argument will be the compared version. In the same way I can have this condition exist within seconds that checks that a file exists at least after at maximum a number of seconds that I wait. This is useful for example when I test in a database system that is supposed to create something but it doesn't create it instantly. So I say I want to to see this log file at maximum 10 seconds after the database starts and if not I get an error. So I'm going to show you a quick demo of something that happens when we run test script. So if I run go help test flags sorry I get a bunch of options and these are used always by the test. Now if I run word count minus h this is the real executable that I built with go and I you see I have these options that are just the options of the of the program but if I run something a little bit different so let's see a test that I have inside here where I running word count minus h like I did on the command line and you see I have here the options that are made by the executable but in addition to that I have also the option that belong to the test. This to show you that what we are running here is an executable but is the executable that go builds for the test itself and the side effect is that it contains the the command line options that belong to the test itself. Now back to the to the presentation what we have learned today is that using test script you can simplify the testing of any command line program and programs that manipulate tests are extremely suitable for this kind of testing because test script was created for the go tools which manipulate tests. You don't need to have a separate executable because the test script environment will create one for you and you can build the commands and conditions if the built-in ones are not suitable. If you want to see the slides and if you want to see a full example of how to use test script to test a common command that I was created with go you can go to github.com data charmer word count and there is the code for this word count and all the example that I have shown here and a lot more that are testing the word count in most the conditions that you may have. So you can see how to test this kind of program in reality. Well in reality I can have a lot more than that but it will be too too long to show. So this is the beginning of a project that I have to to illustrate all the characteristics of test script using code and the first step is to show a simple command line program and all the tests that are needed. Here you will see also more that you will see in the in that on github all the resources that you can use to learn more and if you want to learn more right now you can ask me outside and I may show you some more examples. Do we have time for questions? Three minutes. Any questions? Yes. When you still want to do unit tests like is it for CLI to know it for like not click input or would you always create custom commands for that? You can use so the question was whether I can use test script for unit tests. So you can use test script for mostly anything. I use it for unit test and I use it for integration test. For integration test I just put some logic before the test to create the environment and so it will run a little bit slower but mostly it will I mean doing unit tests is the easiest thing in the world. So if you if you look at the at the go code most of the unit tests for the go tool itself are run with test script but also the integration test can be run in that. More questions? Okay thanks a lot.
How mutation testing got practical
Thank youender million that is going Technical question. We can edit the video later. Let's start very lighthearted then. Maybe a shell of hands. Who here has heard of mutation testing? Amazing. I can go very quickly to some flights then. Who has never heard of mutation testing? For who is the completely new concept? I will cover it for you guys. That's nice. Of course I'm here, but I'd like to promote Striker a little bit. Who of you is actually using Striker already? Nobody. One person. Well, that's my colleague. He's also working on it. Cool. You guys are definitely going to learn some stuff today then. I can just start right? Sure. Welcome everybody to the talk, How Mutation Testing Got Practical? I'm really focusing on the Got Practical in this talk. I will be explaining mutation testing a bit, but I'm really looking deep into the internals and why is the idea really old? It's a practical use case, relatively new. So I'll get into that. But first, a little introduction. My name is Jan Dele Kester. I'm a self-engineering consultant at InfoSupport. It's a consultancy organization in the Netherlands. I'm also a trainer there, and I'm a research supervisor. And that last one is very relevant today. You'll hear why soon. If you want to contact me afterwards, you can use the link then, I guess, but you can also find me on GitHub. And as I said, I'm here on behalf of Striker. Striker is a mutation testing framework for JavaScript and TypeScripts, C-Sharp, Scala, and hopefully at some point, Kotlin. There's like a partial implementation there already. We're working hard on it. You can find us at StrikerMutator.io, and of course, I have all these nice socks for you guys. So if you're really good with asking questions and reacting to my questions, you might get some. And otherwise, you'll see us after. We'll have some more. So in this talk, in this next 25 minutes, because I am going to try to leave room for questions as well, I first want to talk about why do we actually need to understand our tests? Why is just writing a test not good enough? I also want to go into what mutation testing is for the people that don't know yet. And finally, and that's the major part, hopefully if I don't run out of time, I'm going to go deep in how we got to practical applicability, how we got into this state now where we can actually run mutation testing in our real projects. And that means talking about some state of art, state of art performance improvements. But first, we have to talk about the false sense of security. Because this is a promotional image I copied shamelessly from the Sonacube website. And they show this nice dashboard where they say, well, everything is good. There's no issues, no bugs, it's all fine. And there's even 76% test coverage. Who would be happy with that? Okay, who wants to say why? Why is 76% good according to you? Lots of green. Lots of green. Larger small socks. I don't wear small socks. You get some anyway. Sorry about that. They're hard to throw in this room. I would say I would not be happy with that. Because, I mean, our tests, apparently when we're running our tests, we reach 76% of our code. I don't think that's enough. Because there is more than 20% of our code that is not even getting executed for doing the test. That's a problem. But even 100% code coverage is actually, it doesn't say much. Because coverage only means that code is being executed. We are only testing, in a worst-case scenario, that the program does not crash. What you actually want to know is if our tests do something, and I can very easily get 100% code coverage on the program without writing any assertions, that I'm just checking that the test execution does not crash. So we need a way of testing our tests. And no, we're not going to write unit tests to test our test logic, because that would be stupid. I mean, it would never end. We need to be smart about that. And that's where mutation testing comes in. So what we're going to do is to introduce changes in production code automatically. That's a tool that's doing that. We're going to test again to see whether the tests start failing. Because when the tests start failing, at that point, you know that your tests are actually able to catch that bug that we purposefully introduced. So it's also from a white box testing, because we really have to know stuff about the internals of the code to change it, and see whether the tests are good enough to catch that. And this is really not a new idea, because there's this nice paper from 79 already, where they talk about a new way of a new type of software test. And you can actually find this on Google, and you can read it if you want. But even back then, 40 years ago, 45 years ago, they were already talking about it. But only recently, and I talk recently in a very broad sense here, because I wasn't even in high school, I think, when I'm talking about reasons. Then it got more traction, because the problem is, in the 70s, or the late 70s, it was just a good idea. We did not have the resources to actually apply it in practice. And what you see here, the dark-colored bars are publications, research publications about practical applicability, and they really spike early this millennium. And there are reasons for that. Mostly also, I think, because our computers got fast enough. And why that is important is because of how mutation testing works, what the process behind it is. Because we start with our source code, and we are feeling very happy about it, of course, because we made all this nice code, we even wrote tests for it, so we're very confident. And what the tool is then doing is going to introduce mutants in your code. And mutants are just changes. And for every change that is made, the tests are executed again. And we can have two results. Either the tests start failing, which in this case is good, because we then detected that mutant, we found the bug, so we say that mutant is killed, or the mutant survived. And that means that your test is not complete. And when you do that for everything, in the end, we get a nice report out of it, like a covers report, except a bit more detail. And how that process actually works is that there are operators for that. And an operator is basically a transformation. Given a certain thing in your code, what kind of changes can we do that might fail your tests? And some examples are here. There are way more, and also one operator on the left could result into, or one source code, original source code on the left, could result into multiple mutants. But you could do, like, just switch the operators, or throw away a whole block of code. And when we do that for every of these mutants, we measure something, right? So I already talked about killed and survived. But that's only in an ideal scenario, because in practice, there might be code that does not even reach by test. So you can say, well, we have no coverage, or we have a timeout. And timeout basically means that there are mutants caused by an infinite loop. And we should consider that, okay, well, infinite loop tests actually failed then, so that's maybe kind of killed. But you can also get runtime errors, or compile errors, because we're just introducing weird code changes without looking at what the code is actually supposed to be doing. And finally, mutants can also be ignored, because that's what a developer said, I don't want a test for this. So I don't want to see the report anymore. So it's just like suppressing warnings for your code code dejects. Who here does that very often? I do, actually, but... And then just like with the code coverage score, we want to have a nice metric. We want to know how good we are doing. And for that, we can compute the score. And we call that a striker-deorientation score. And basically what we say here is we want to express in a nice number on a scale of 0 to 100%. How many mutants did you actually manage to kill? So how many unexpected changes in your code are your tests actually catching? And that's this nice formula, but basically what you do here is everything above the line is what you consider that you killed. And everything below the line, so we divide that by everything that was actually a valid change. So we exclude the crashes, for example. And that gives you, for your whole program, for your whole code base, an indication of how good your tests actually are. But what if you don't have that many tests yet? Well, they can also compute a variant of the mutation score. We just look at the code that's actually being tested. So you see here that we do not include the mutants without coverage anymore. So one might think, just like with code coverage, we should maybe strive for a high number. We should maybe have 100% mutation coverage, 100% mutation score. That would be nice. But there we actually ran into a problem because we cannot actually kill all the mutants. It's very easy, well, relatively easy at least, to reach all your codes, to make sure all your code is getting executed. So basically to get 100% code coverage, that's relatively simple. But because you're still calling functions from the outside, you might not be able to test every single operation happening inside of these functions. And actually, some mutants are, if you split up your whole code base completely, you can never kill them. And what category of these is equivalent mutants? Which is also a problem. So given this code, we have a nice for loop, and we say, well, we want to iterate 10 times. You can also write like this, and it will still work. So this mutant, we cannot kill because even though we changed our code, semantically it's doing the same thing. So that's where you might want to ignore, basically. And mutation testing is also very challenging. That's actually where that practical application problem comes in, because you can imagine that mutation testing, basically changing your code and then running all the tests again, doesn't take a lot of time. And if you have a very large code base, that might actually not finish in a reasonable time. You also need a lot of configuration. The mutation testing tool needs to know stuff. It needs to know how it can run tests. It needs to know how it can verify that those tests did complete successfully or not. It also needs to know stuff about your programming language, for example, in order to make sure that it rewrites the code in a correct way. So that also needs to be a lot of tuning support to make it work. And for a long time, mutation testing was simply not feasible or not easy to do. But we're bracing the gap. Not specifically at Striker. A lot of stuff has already been done. Luckily for us. But when we're looking at performance, this is basically the worst-case scenario. So the time it takes to run a single mutant, to analyze a single mutant, is basically the time it takes to run on your test cases. So we can approximate it by saying, well, let's just count the number of tests that you have. And then the time it takes to mutate your whole program is just the sum of that. So it basically means that you multiply the number of test cases that you have in your code times the number of mutants that you need to check, because mutants need to be checked in isolation, because they can influence each other. So you can imagine already with a very small program, this number, the time approximation, can get really, really, really large. So we need to be smarter about that. We want to make sure that the total time is a lot less, not just a bit less, a lot less than just a multiplication. And basically there are three approaches to get there. It's either to do it faster. And doing it faster, for example, I think we're going to paralyze it. We're going to use love course. Like take a big machine, nowadays it's relatively simple to get a machine with 128 cores, so we can do 128 things at the same time. You can try to do fewer. So you can maybe try to make smart choices and say, well, certain stuff we maybe don't need to analyze, because we kind of know that it's probably fine. Or you can try to do it smarter. And the study I referenced here, they really did an analysis in there, it's like a litigator review. And most of the studies are actually focusing on fewer or smarter. And common techniques there, some of us here, I won't have time to go into detail on all of these. But you can think of like random mutation, like we're just randomly checking stuff sometimes. We're just randomly picking some mutation that we're going to do. But that's not deterministic. So that might not give you the best knowledge about what the quality of the test actually is. Parallel execution already mentioned. You can also do stuff like data flow and control flow analysis and try to reduce the set that way. Or maybe look at AI to try to pick smarter sets of stuff that you're actually going to check. But if you want to use that mutation score as a benchmark, as a comparison, for example, with the pull request, did you actually improve it or not? Or to give you a good indication of how good your tests currently actually are, you actually need to execute everything. So just the approach of we're just going to run less, that doesn't always work. And one big way how this process can be sped up is by looking at how we actually change the code. So a very new eave approach would, for example, be just changing the source code, running the compiler again, running the test again, and then making another change in the source code, and then running the compiler again, running the test again. And if you have a compiled language with a fairly slow compiler, that's quite problematic because then it gets really, really slow. A bit better might be bytecode mutation. So, for example, the JVM languages have bytecodes, they have an intermediate step, maybe you can mutate that. And they only have to compile the source code once, they just change the bytecode and then run it. And while that is a lot faster, it's also a lot more complicated, but it has one big downside, and that is that every change that is possible in the bytecode is not necessarily something you can do in your source code, especially for Java, for example. If you write Kotlin or Scala, there's a lot of, a very simple thing in Scala, for example, can result in a lot of bytecodes, and if you're trying to mutate that with the assumption that the Java compiler compiled it, it might come up with mutants that you did not kill, but you cannot actually kill them because they don't exist in your source code. So, who thinks, who has an idea, how we can, how we can do this smarter? Any ideas? Yes. Are you a step? Sorry, what did you say? You compile all the different mutants, the same as people, and you select which mutants you get from the other side. Exactly. Large or small socks? Large or small socks? You have socks with a logo on it. Large. This is the, the larger the other ones, right? Sorry about that. Thank you, Niko. For testing good device. I didn't test my own system, no. Silver written software to do this. But, yeah, basically, so the answer that was given was basically this, and we call that mutant schematic, and this just makes sure that you compile all the mutants into your code once, and then use an environment variable to just switch them on or not. So, if you do this, your compiled code is just full of if statements, that is, check a certain number. And that is complicated, but it is manageable, it's not that hard. The main problem is in keeping track of it, but if you assign every mutant a unique number, it should be fine. And this really helps with compiled languages, especially with stuff that's a bit slower, like Scala. And this is actually relatively new. In the world of mutation testing, this is relatively new. It is from 1993, though, so it's the same age as me. I wouldn't say that I am relatively new. But, as something else, Martin, that you can do is coverage analysis. That's also something that has been part of Striker for a long time already. So we actually do an initial run where we just check which tests are reaching what code. So we also know if you change one part of the code, which tests actually need to run and which don't. So you can also get that number of test cases down a lot, depending on where you mutate the code. Some codes you don't really know, so if you have something static, for example, it's not extreme defined somewhere, that, you know, you might not be able to figure out how that is used. You might still need to have to run the whole test suite, but you try not to. Something else you can do is incremental analysis. So just try to div some previously stored state and try to guess which mutants you actually need to check. This is very hard to do, fully foolproof, fully complete, but you can get there like 99%. And that means that if you make a small change, a small pull request, that checking whether your changes are tested properly is relatively simple, is relatively fast. So if, uh, Nico here in front, he gave it all yesterday in the JavaScript dev room, and he actually showed this feature, and I think that there was a difference between a couple of seconds and like 30 seconds and three seconds, something like that, on a small project. Another cool thing is mutation levels, and that's where you actually give the user a choice. Do you care about testing it fully, or do you care about performance? And the choice that the user wants to make can depend on the type of project or the domain. Do you have code where it's really important that every single thing is tested, or isn't it that important actually? Do you actually want to spend the time? Or maybe you want to do a quick and dirty but pretty good analysis for every pull request, which you do in the nightly build where you test it fully. There's different approaches here, and this is actually something that is researched by one of my colleagues at InfoSport, he did his master's thesis on it. So it's really cool. But what could be downside of this approach? Any ideas? Remember, you can get some socks. Yes? So the answer was the feedback loop is longer, basically. It might take time to find if your tests are not that great. That's the one. I have another slide, but yeah, another guest. Sorry? The problem is very useful because you run nightly and... Yeah. What kind of size socks do you have? Large. Sorry? Large. Large. It thinks too far, just come get them later. I will put them aside for you. What's your size? Small. Small. Ah, damn it. I'm not good at throwing. But yeah, so the mutation score that you compute, if you choose to run other mutants, the mutation score might not be comparable. So you really need to take care of that. And the tool that my colleague actually created, analyzes a code base so that you can do this actually for a specific project. It analyzes a code base and it analyzes the test and it's trying to find a nice balance between accuracy and the number of test executions that you need to do. So it tries to see if there are some mutants that we can exclude that will gain a lot of performance. So it speeds it up massively, but doesn't reduce a lot in accuracy. So that's really nice. And you can actually find his thesis online. So if you go to the FOSDAM page for this talk, you can have a link there as well. And I'm very hard to press actually, so there is not even documentation for that and it's not even merged. It's project Xavier and that is actually implementing that idea because it was very theoretical, actually implementing that idea in Striker for JavaScript. So if you're really interested in how that all works and what decisions they made, I honestly don't know yet myself. Go look at the request. And it's also a very cool example. Mike is dead. Oh, it's back again. We're a project group from a university, in this case the University of Twente, contributed to Striker in that way and they actually built this. Yeah, documentation as I said, still to follow. And this is also a very new thing that also a student is currently working on is to do more analysis on static analysis on the code to figure out if we can run in one run of test cases, if we can analyze multiple mutants. But in order to do that, we actually need to make sure that these mutants do not influence each other. So this only works if you know for sure that they don't cancel these order out or if you can still, given that the test fields can say with confidence which mutant it was. So this is really, really complicated. Again, I'm not entirely sure on this progress yet, but question. So this is sort of static data for more complicated. Yeah, so the question is whether modularizing your application would help. And it would help because if your modules are smaller, then the test runs are also smaller. They would take less time. But it only works if a normal boot rest for you, normal change that you want to do, if that's only contained to one module or two modules. But if you split your code up instead of changing all modules, then it doesn't help. So it might help. But that's also with, like, in general, I want to make my CI pipeline quicker, just make more repos or build less modules. So yeah, that would definitely work. Come grab a pair of socks later. And now it really is time to also start testing your tests. So if you're not using mutation testing in your products already, it's really good now. We can actually use it. There's been a lot of progress in 45 years. We have better hardware. We have process improvements. And actually, there's a lot of research going on still to make it actually faster every time. We also have production ready tooling. There are many great libraries out there. Some of them are more mature than others. Some of them are faster than others. Not everybody integrates the same process improvements, for example. But in general, for most popular programming languages, there is a tool available and you can run it in your pipeline. And most of these tools just integrate with the build tool that you expect. They use information that the test runner already gives you. So that's great. Here's an overview of some suggestions. But if you just Google your favorite programming languages plus mutation testing, I'm sure you'll probably the first result will be the right one. So in summary, when we're talking about mutation testing, we're really talking about testing your tests, making sure that your tests actually expect what you do. And something, if there's anything, only one thing that you take away from this talk, don't rely on code coverage, please, because it doesn't say anything. And a lot of research has been gone into performance improvements. There's lots of research still being done. There's still always students coming to us interested in contributing to an open source project with research. So there's plenty of open research questions. And it's also applicable now. So if you're maintaining an open source project, at least consider mutation testing, because especially in open source, where there's many contributors, there's a really good metric on to get an idea about the quality of the tests that somebody wrote for our poll request. If you want to know more of the code implementation details, as I said, my colleague gave a talk yesterday in the JavaScript Dev Room, it's probably online at some point, so you can go check that out. And that was my talk, so thank you for listening. APPLAUSE Exactly 25 minutes as well, so that went great. Any questions? Yes? I could determine which expressions do mutate. Sorry? I can determine which expressions do mutate. OK, the question is how do we determine which expressions do mutate? Basically, there's a lookup. So it does abstract syntax free analysis, and it just checks a certain node, and there's a lookup table to say, OK, if we have this kind of operation, these are the options that we can... These are the mutations that we do. So there's basically a big mapping file with all the options, and that is probably not complete for every standard library out there, but for a lot of the logic, comparisons and stuff like that, you can do it pretty complete. Yes? I couldn't hear the last part. Can you repeat that, please? What is the way to find the way to test your... OK, so the question is what is the baseline when you start with mutation testing? Right, so if you have a new code base, if you have green fields and you implement mutation testing from the start, it's actually relatively easy to get high 90% mutation score. When you have an assisting project, it's usually very hard. So actually Striker for JavaScript has a mutation score of around 80%, which is actually pretty good. It's really hard to get very high scores. So it's not very... Not like covered. If you're anywhere close to 80%, you're actually doing pretty well, I think. Yes? Yeah? Yeah, so the question is when the purpose of mutation testing is to make sure your test are good, if you're doing selective mutation, how do you know you're not missing something? Actually, you don't. You might miss stuff. That's the point is because it can take a really long time that you have at least the option like, OK, I'm OK with 80% accuracy if it's half the time, because that's for me a good balance for some use cases. But yeah, you have to accept that you're missing stuff because you're just not running all the mutants. So the combination of mutation testing and test-driven developments. Well, you have to write your test first then, but you can only mutate once the implementation is there. So once you test that green, you can check whether you actually did a good job before writing your implementation, which is kind of strange, actually. Yeah, but that's very nice, actually. If you have to change your testing of changing requirements and you re-implement part of your code, then mutation testing will check whether your test are still complete. So that's actually very good, very nice. Yeah, actually, if you really want to go into that, property-based testing, you're only going to test all possible inputs, even though for sure that is correct, but that's not feasible yet. Property-based testing is really hard, too. Do you have time for more questions? Four minutes. All the time in the world, up front. The question is, from your experience currently, if, for example, you have normal unit tests and they run, let's say, it's called one minute, how many minutes will they be around using the framework? So the question is, if I know how long my unit tests run for, how do I know how long mutation testing will take? And there's only one answer. It really depends, because it really depends why your tests take a minute, but it's going to be a lot longer. It's like, it's not going to be four minutes or five minutes. It's way more than that. The only way to find it out is to just actually run it, because the problem is it really depends on how many mutations can be generated for your specific code, because that is what makes it slow. And because of all these optimizations, you cannot really predict how long it will take. I didn't hear that one. So the question is, how do we report it? And for Striker, and you can go to the talk with my colleague, he went into that in a bit more detail. We have a standardized adjustment for that. That's good. Oh, it's a bit. So there's a nice dashboard, but go watch this talk and you will know more. Up front, yes. So you already run one mutation at a time? Yeah, you have to. If you want to do more, you have to prove first that these will not influence each other, so that you need to know that if one test fails, because of which mutant it is. With coverage, you can have reached that? No, you really have to do data flow, control flow analysis, and stuff like that. So that's very, very difficult. But yeah, there's somebody working on it right now. So maybe in half a year's time, we'll have some more to talk about. Yeah, Sci-GridNet already has an implementation for it, but it's not scientifically proven, so we do not know 100% where that's correct, but at least 95% there. So if that was the last question, if there's any more question if you want to talk some stuff, I'll be outside in the hall. And if you ask a question, feel free to come grab your socks here up front. There's plenty. Thank you. APPLAUSE
Running systemd integration tests with mkosi
I'm Dan, I work on systemd and I also maintain the maker side tool which is a sister tool of systemd and I work in my day job at the Linux user space team at MATA. So specifically why do we want to do this? Systemd is a pretty low level user space project so running its integration test is not as trivial as it is for a regular project. So specifically we want to make sure that we don't accidentally break the host machine which when you're running something like systemd becomes rather easy. We also want to minimize the requirements that are needed to run the systemd integration test so that regardless of which machine you actually run them on or regardless of which machine you're hacking on you can still run the tests. This is especially important for new contributors because at the moment the barrier for writing a new integration test is pretty high and we want to make that lower. We don't want any host system details to leak into the integration tests so currently that actually happens quite a bit and it means that you often get a failure for example on a CI machine that you can't reproduce locally. And when that happens it's usually a huge pain to figure out what's going wrong and how to fix it. So we want to try and make these tests more reproducible regardless of the machine that they're running on so that we avoid issues like this. We want to be able to paralyze them as much as possible and again the isolation from the host helps here because it allows you to run more instances of tests without having to fear that they are fighting over the same resources that might be leaking in from the host. We want to make them easy to run of course like I said for new contributors and we also want to make them easy to write. So before I go further with the integration test I'll give a little overview of MakeOSI. MakeOSI is basically system deals tool to hack on systemd. So because systemd is such a low level user space project you can't just build it from source and then run it especially not if you're working on the in-it system itself because you're very likely already running a systemd on your laptop and you can't simply replace it with another one. And even if you could if you write a book and it crushes systemd then your laptop is certainly unusable. So we need another solution and specifically we need to run it in a virtual machine so that if something goes wrong and it crashes you can simply kill the virtual machine and it's like nothing ever happened. And this is where we use MakeOSI. So we use it to build a Linux image that contains systemd compiled from source and installed into the image along with all the other packages from your Linux distribution that you would need for development. If you can then boot in QMU and do whatever testing you need shut down the machine and then you can submit your patch. So it does a few things but the primary thing it does is it simply runs your distribution package manager to install whatever packages are needed and then runs various tools to configure the image. Most of them coming from systemd but also a few from Linux itself. It builds an environment where it's necessary, it generates a unified kernel image if you wanted to and then it packages it all up and then boots it in QMU. And so we can generate a few different archives but the most important ones are probably the disk images and just a plane directory. So what does this look like if you want to build an arch Linux distribution image and install systemd and Linux and then enable autologon that's how you do it. And this will build that and then boot into it with QMU. So you eventually end up in a root shell in a virtual machine with systemd installed. You don't need root privileges for any of this which is another thing we want to do with the integration test. Currently you need root privileges so if more files are written they're owned by your root user in your home directory which means that you run into weird issues when you try to delete files and stuff like that. So we want to try and do it all without even root privileges. You can figure out how to go aside it's like a systemd project so we do the usual unit file stuff. You can conditionally include stuff with a match section. They only apply something to the Fedora distribution for example. So we already use this for hacking and we don't use this for the integration test. So we use macOSI for manual testing which is not exactly great but the automated testing still runs outside of macOSI. So this is because the integration test existed before macOSI was there and the way this was implemented was they still wanted of course that you could run in a virtual machine. But instead of assembling the virtual machine from distribution packages the implementation decided to use the files from the host. So similarly to the first generation tools like Dracood which is where the approach came from they pick various files from the host when building the integration test image and then that becomes the image and there in the image you run the test. The problem is that this is completely independent from macOSI so we have two very different environments one for hacking and then another for running the integration test which isn't great. Even if you manage to do a set of two manual testing inside macOSI you then have to somehow translate that to the existing integration test which is very hard sometimes. We have a custom test runner using make so it's all implemented with make and bash and shell scripts. We don't really use any off the shelf tooling here so it can get very nasty. The tests themselves so this is one part that does work well. The tests themselves that run inside the image are implemented as systems. So what do you get this? Start the image and then we pull in the system unit and the system unit executes the test. If the unit succeeds then it has succeeded and then the test failed. Of course all the test specific dependencies have to be added to the image so this ends up being like I think it's like a two or three thousand line bash file now which is responsible for making sure all the dependencies get picked up from the host file system and then put into the image. So it's very complex and I don't think anyone fully understands it. Any customization that you want to do to these test images also requires writing a lot of bash which again is very hard and for new contributors especially to figure out how to do. As you can see to run a test roughly this is what you currently do. So as I said the files gets picked up from the host for the current images but of course we do need to lay the system to build from source. So you build system from source of the host as well and then what the three thousand line bash file does is it basically takes files from those takes files from the build directory combines them and you end up with this franken image. That contains God knows who what. Half system the build from source half from the host and that's where the image runs in and as you can imagine figuring out what's going on in this environment can be rather complicated. What do we want to do instead. So we want to reuse as much as our existing tooling as possible so one make OSI which are already used for the environment and then the other part is system these build system which is a mess on which already has targets test targets which will execute the tests. This is primarily intended or the I guess the primary goal for this was actually unit tests for C or C++ projects where the test macro and in mess on simply execute the unit test. But there's nothing really specific about it that says it can only be used for unit tests since all it does is really just run a command and check whether it returns zero or non zero exit status. So it's perfectly possible to just have running integration test as well. So I wanted to make use of that so that we can simply add a mess on sweet that's specifically for the integration tests and then running them is exactly the same as running the unit test. So you make things more similar and it will generally we hope lower the barrier for running the integration tests for new newcomers to system. We want to make sure that all the tests reuse the same image. So currently the image gets rebuilt quite often for individual tests which makes the whole thing a lot slower. We want to get to a point where we can ideally reuse the same image even the same one that we use for hacking for the integration tests as well. So we can make use of caching and we avoid having to rebuild the image. And the customization instead of writing whole pile of bash you can just reuse all the settings that may go as I provides to customize the image. And we hope that running an integration test would look roughly like this. So a proof of concept PR is already available on the system the GitHub repo where we more or less have it like this so that an integration test can be executed simply by running mess on test. Specifying the individual test if you want to run one or specify the entire suite if you want to run all of them. Mess on supports running tests in parallel so we want to make use of that as well to be able to run multiple integration tests in parallel. Of course since these tests are quite heavy because they spawn a virtual machine we can do as much parallelization as we would with unit tests but we can probably still run more than one. So how do we run an integration test in a virtual machine with system? There are a few interesting things about running a test in a virtual machine that can make it interesting to get the results out. So for example if mess on runs a unit test then the process simply exits with its exit status either zero or nonzero where nonzero means that the integration test has failed. But if you're running an integration test in a virtual machine when that integration test unit fails in the virtual machine that doesn't mean that your virtual machine is suddenly going to exit with exactly the same exit status. And you're not able to use that without some effort to determine if the test failed or not. You need to somehow get the exit status of the test out of the virtual machine and to the host so that it can be interpreted by mess on. So the way we do this in system D is by using what's called in the AFV socket family. This is a socket family that like TCP or the UDP sockets or the unit sockets but this is specifically intended for inter virtual machine communication. So you can assign a virtual machine and AFV socket device and it has a connection ID which identifies the virtual machine. And then you can bind two ports on that in the virtual machine and you can connect to it from the host. So we use this by for passing data from the guest to those. So system D as this as the notify protocol which you can is basically it can send messages about its status over a socket. And we extended this with support for AFV so that we can send information about the virtual machine to the host if someone is listening. We can we the most basic use case of this is to tell the host system when the machine is finished booting. So we send ready equals one then but it turns out that we can also just simply send access status equals whatever the exit status is. And that's how you can get an access status out of the VM. So this is then this is the access status of system D. So how do we make this access status of system D the access status of our integration test. Well we have two different unit settings for this and success action equals exit or and failure action equals exit. And what these two settings tell is that when this unit exits system D should also exit and specifically with the exit status of that service. So this gives us a way to pipe the exit status from the integration test to system D which then exits with the same. It sends it over VSOC to make or say which is listening it reads the exit status and make or sign in exits with that exit status. So you get this whole flow of data through to the host and to just be able to exit with the same exit status in make or sign. Of course just getting the exit status isn't really sufficient. If you had to do that could ask just by looking at this exit status you'd have a pretty bad experience. So you also need the logs ideally. So because we run on a serial console the serial console is already displayed so you get those automatically. But we also wanted a way to be able to get the system D journal from the virtual machine off the virtual machine and to the host. Normally you would just mount the disk image after the virtual machine has finished executing and get the journal out that way. But remember that we wanted to be able to support running these integration tests without needing root privileges. And if you don't have root privileges then you can't mount any file system in Linux. So we can mount the disk image anymore after the integration after the virtual machine has shut down. So we need to get the logs out while the virtual machine is running. How do we do this? Well again with AFVSOC. In the next version of system D most likely we're going to add another forwarding mode to system D-journally so that it can forward its logs over an AFVSOC socket. So again you can have something listening on the host on AFVSOC, configure journal D to send its logs over this AFVSOC. And then simply store them on the host instead of in the virtual machine itself. Or do both because having the logs in the virtual machine available as well can be useful for debugging. So to listen on the host we have this little program which is system D-journal remote. You can configure to listen on any address. This can also be on Unix socket sort of stuff. And it will simply store the logs to the directory that you specify. So once it's done you simply run journal code, you specify the directory that the logs are stored in and you will get the logs of the virtual machine. You can access them, you can read them, you can debug what's going on. Or you can just simply store whatever CI system that you're running the tests in. Then of course we need to be able to debug any failing tests. So the test might be started. It started via the serial console. But when Maston is running a test it doesn't give you interactive access to the serial console. So we need to have a way to be able to get into the VM without needing the serial console. So the regular solution for this is SSH of course. So we want to provide SSH access to the VM. But we don't want to tie this to the network of the VM. Because let's say we might be testing very specific networking access network tests. This might involve multiple VMs and they might need a very particular networking setup. And it doesn't mean that this network setup might not allow for access to the VM via SSH. So we want to use a different protocol. And again we can just use AFV so for this. So this just emerged. It will be in the next release of system. But when system D started with an AFVSock device it can now detect this during early boot via a new generator. And it will bind port 22 on the AFVSock family to a socket unit. Which will start SSHD when connected to. So this allows you to use SSHD with VSOCK. So you can connect to the connection ID of the virtual machine on the host using SSH. And you will get an SSH session in the VM without needing to configure the network. To provision your public key we use system decredentials which can be provided using SMBIOS. To the VM to provision your SSH public key into the VM in the correct location. In .ssh slash authorized keys. So that you don't need to do anything like you don't need to enter a password or anything. So just SSH it will do the usual key cryptography or key authentication. And you just get your root shell in the VM and you can debug whatever you want. To make this nice to use on the host we can drop in an SSH config file that configures a proxy command for SSH. So we take ownership then of the Unix and the VSOCK host name prefixes. So you can do SSH VSOCK slash the connection ID of the virtual machine to get an SSH session into that virtual machine. So this is what we're going to try and use to be able to debug any tests that are going wrong. That was all I had to say. I'll put a link to the project and go take a look. We want to use this for the integration test but make our size of course useful for a lot of other things as well. If you need for building Linux images please take a look. I'm always happy to add new features or you can join the Matrix channel which is linked in the written and ask new questions. And I'll be happy to answer them. Thank you for listening.
Making it easy to get to SLSA level 2
Hello, hey everyone. Welcome to Making It Easy to Get to Salsa Level 2. Thanks for sticking around. It's last day of the conference, send me last talk. So, yeah, today we're going to be talking about salsa and compliance and hopefully how you can meet those compliance requirements as a play. My name is Theophilus and I'm going to be talking about Choc and open source framework. We developed that crash override. So I come from a security background and every time I hear the word compliance I get bored to death. It's kind of like a book sticking exercise. But hopefully we can discuss today about this and see how you can do this in your own organization easily while also get value for your org. Before jumping into the topic, let me kind of quickly set the scene and talk a little bit about software supply chain attacks. So in a software supply chain attack, the attackers compromise the build system or the package registry and get a foothold there. And over the past years we've been seeing an increase in these types of attacks. So there was a report say from Sonatype that said since 2019, year after year, we've been seeing a sevenfold increase in this type of attacks. The report came out in 2022 that said supply chain attacks basically surpassed malware based attacks by 40%. And last year around two out of three of US businesses were impacted by a supply chain attack. So you can take these numbers with a grain of salt, but the fact of the matter is there is a surge in these types of attacks. And this popularity on the attack realm drives policy changes. So in May 2021, there was an executive order by the White House that said software vendors must be provided and purchased with software of materials and provenance information. And quick show of hands. How many are familiar with the term S-bombs or provenance? Cool. How many of you have been deploying these in your pipelines or your organizations? Okay, great. There's also an S-bomb room. So today we're going to jump into these topics real quick. So we're going to discuss some concepts and then talk about the challenges people face when trying to deploy these things to production. Then we're going to talk about CHOC and how CHOC can help you solve these problems but achieve many, many more things and hopefully have a discussion in the end. So for those of you who are not familiar with software bill of materials or S-bombs, you can think of it like a list of ingredients for software. So you go to the supermarket, you see a package, then you read the labels and you get a list of all the ingredients that are in there. So an S-bomb is pretty much the same thing but for your software applications. So you get either an XML or a JSON and from that you can get a list of the packages, their versions, etc., etc. When we're talking about provenance, what we're talking about really is how did the artifact get here? Like who created it, who packaged it, how was it modified along the way until it actually reaches the user basically. So that is all good but if we think about a list of ingredients, what are the guarantees that what we get is actually what we're promised? So for instance you could have an NPM package and you can generate an S-bomb for your NPM package saying that these are the ingredients that are there but then in a package you could get a foothold somewhere in your build pipeline and inject something that was not originally there. So another key component here besides generating the S-bomb and the provenance formation is really having some attestation around the integrity of the generated artifacts. So anyone should be able to cryptographically verify that at least what we're promised has not been tampered with and that the contents of the S-bomb were coming from an original author, etc., etc. And what's really important here is we need to have some clear assumptions around the threat model aka what can an attacker compromise and what are the security guarantees we're getting depending on that. So do we require the attacker say to compromise our build pipeline or do we require the attacker to get a foothold on developers boxes? What's our threat model? And that's really important because if we think about DevOps pipelines in practice you have many components like developers are pushing codes, that code ends up in some provider like GitHub or GitLab, you have open source packages, you have container images, you have infrastructure code that modifies this code and pushes it out and then somehow it ends up in the service or the cloud. And as we're building out all these graphs of components attackers could get a foothold at various places. So this is where Salsa comes in. Salsa is an open SSF project and essentially gives us some framework to talk about the security posture of our applications. And we have different levels for the supply chain security of our artifacts. At level one essentially all we're doing is we are providing information about how the package was built and have a report but we don't really have guarantees around the report whether it has been tampered with or not. At level two we get signed provenance. Essentially at this point we say okay once the thing has been generated there has not been tampering on that artifact but you don't get guarantees around the build platform etc. And as you move up the layers you get stronger and stronger security guarantees. So today we're going to be talking about chocking how easy it is to get to Salsa level two if you deploy chocking to your built pipelines. So how does one start to do this? This is good, we all want to improve the security posture of our applications, we want to deploy these things in our organization, how does one start to do it? One could think that okay that surely a solved problem there must be tools for this already and you're right to some extent but the tooling ecosystem is really in its infancy and it's largely fragmented at this point. So it's not necessarily obvious to a newcomer which tool or framework they should pick and even if you say select a space like S-Bombs the outputs of different tools are inconsistent with each other. One tool get a certain report, another tool get some different report and there might be assumptions around how these things should be getting set up, how you should be deploying all these things so it's not a straightforward way and what's really really hard is thinking about how can you do this at scale. If you have a large organization with multiple repositories, different providers, how do you make it easy for your teams to just set this and let it run and have it be easy to consume the data and also generate data that are of interest to you. So yeah it's not an easy problem and hopefully Chalk will help here. The main idea behind Chalk is really we have some metadata that we care about and then we want to embed that metadata that we call Chalkmarks into the artifact. So the artifact could be a binary or it could be a docker image and you want to embed this metadata into the artifact during the build time or post build time. So you could have an L file in a box and then you can inject metadata into that L file and say okay that was indeed here, you can have information that you care about like the security settings on that box like if a partner is enabled or what are the users or the network connections, you embed that metadata on it and now that artifact is tagged and once you have that tagged artifact you basically let it go and it gets deployed somewhere in production. So think of Chalk pretty much like air tags for your code so you embed the air tags and then you're tracking it across the ecosystem of your infrastructure and once the artifact actually gets executed what's interesting is you can get back reports with metadata that you configured there. So essentially you can scan what has been out there in production, you can grab for all this metadata that has been embedded in the artifacts or you could configure the artifacts some cases to phone home and give you the report themselves and you could do this once or you could do it periodically for instance configuring Chalk to send you heartbeat reports. So let's see this in action. I have set up here a very very basic git repository and this repository all it does is it deploys a lambda function. So we have the main code of the lambda function here and as you can see there's nothing really special to it, we just sleep and return it to 100k and we're building this lambda function using a docker file and there's nothing specific to Chalk in this docker file pulling from a well known image and we're actually building the lambda using a github action. So during the github action we check out the code, we set up the build environment and then here we're setting up Chalk. So we're telling our build ecosystem that Chalk should drop this build of the image and embed metadata on it and what sort of metadata we choose to embed is completely up to us. It comes like Chalk comes with defaults. So these are the only lines of code we ever need to do for our build pipeline so that Chalk can embed sbombs and actually use, you know, provide cryptographic guarantees around the integrity of the generated reports and we're also creating attestation manifests using SIGS-Tor for those of you who are aware of that framework. So cool, let's go ahead and trigger this. I'm going to go here in the action, kind of re-trigger the action once more and what we're doing here is we're building a docker image and we're telling Chalk to encapsulate the whole build and inject metadata in here. And that metadata, we can choose how we want to emit it. So we can choose to emit a report in there or the CLI or in some destination that we care like S3 or some server. So I have here a dummy server that's running and it's waiting for reports. There's nothing here currently. And I'm going to go back into one of the previous actions and show you a report that was emitted by Chalk on the CLI. So during the build, after we've actually pushed the image, you can see down here we have a Chalk report and this is basically a JSON file that has metadata that we care about. So here we know that some image could build, that was a daytime, that was a docker file path, the exact contents of the docker file, the commit ID, the author of the committer, but you also get a cryptographic signature about the integrity of this report essentially. You get interesting things like the environment variables, arguments. You can configure this to be however you like it. And this is generated on the CLI, but we can send the exact report, the exact same report or variance of that report in other destinations. So going back here to the action we just triggered, hopefully once this completes, we will be seeing a report populated to our server. So not only will we see a report here on the CLI, but we'll also get the metadata in the endpoint we configured. What could possibly go wrong? This is just a live demo here. So you can make this as fine-grained as you like. So Chalk supports plugins. So if you want to run, say, your static analysis tools like SemGrid or CodeQL, you can embed this metadata into the report as part of your regular other metadata that you're tracking. So it looks like this got finished. So we did get a report here. And if I go here, essentially we see that we got a build operation, so that got sent over to our server. And this is essentially just a pre-defined rendering of the JSON, right? You can send it wherever you see fit and render it however you would like. But we get some interesting information. We get some signal that we collected S-Bomb and Signing data. And indeed, if I scroll down here, I do see that I have the full S-Bomb. And I can fetch information about the attestation of the artifact. But I also get a bunch of interesting metadata that might not have been obvious just by seeing the build. So I see here CloudProvider is Azure. And we have information about the actual Azure instance metadata in which the build happened. So essentially what happens here is GitHub runs their machines on Azure in this particular instance. And so that build triggered in one of the Azure instances. So that's nice. We can also see the build command. And you can see here how the normal build command is now wrapped under Choc. So Choc is in charge of the build and embeds the metadata into your image. So that's nice. What we did do here is we pushed this demo lambda essentially. You can see this was modified just now. So I'm going to go ahead and execute the image. And hopefully, if things work as expected, the lambda will execute. And I'm going to get a second report here. And that second report is an exec. And if I zoom into the exec, you now see that the command that got executed is actually running within the lambda environment. So Choc is wrapping the entry point of the execution for that Docker image and tells you like, hey, this Choc mark that you inserted, the metadata that you've all captured is still here, but now I'm executing in lambda. And indeed, if I go here and see the Cloud metadata, you can see the region, the role they are in, the account ID, et cetera, et cetera. So with this, we can basically say the metadata that we injected in our build pipeline here is still present throughout wherever we deploy the image. And we can keep track of where the thing actually executes. So if I take into this Choc mark, I'm sorry, let me zoom out here. I can see that there's two reports essentially associated with this. One was a build and the other one was an exec for the exact same hash. So the exact same hash that I build in that machine has been executed in the other machine. So what did we do here? First of all, with four lines of YAML in our GitHub action, we generate and distribute the desktops. And we also have provenance information because we can track where the build happened and where the actual image got executed. And we also get artifact integrity. So in our case, we're using cosine. You could use different frameworks to do this. But essentially, we're meeting the basic requirements here. So we're checking those boxes. And that's with minimal effort, in my opinion. Like all you need to do is you need to configure whatever destinations you want for these reports to be sent. So you say, OK, that's cool. What more can you do? So let's think about typical scenarios that happen during kind of live production environments. You might be on call for a given service. And you get a page in the middle of the night. And there is some issue. There's a bug. There's a vulnerability. Something is off. And you want to figure out, OK, what's the component that's responsible for this? You could have, say, a pretty complex application with multiple teams pushing code. And for large organizations, usually the pattern for resolving these issues is you cut the tickets to the team. You wait for somebody else to be seen and be like, hey, that's the responsibility of that person. Potentially, you grab into code. You say, OK, what was the last commit? Or you have metrics. And you track from your metrics what chains. And you try to correlate it to somebody else. If you're using Chalk for your build pipelines, it's much, much easier to correlate what exact version of an image is running where and what the components are. And potentially, like, who are the code owners, et cetera. Because if we go back here, you see that we have things like the cometer and the commit ID. So we have the commit ID. You can start building these profiles about ownership incrementally as you go. So instead of having a process which could potentially take a couple of hours to determine the root cause of an outage or an issue, you now can have this in a few clicks, hopefully. Another common use case is application inventory and change management. So say, for instance, you're part of a large organization. You want to deprecate the framework. You want to deprecate, say, AngularJS. So AngularJS is running production. And you want to figure out, OK, what is the impact? How many teams are using it? Is the code even live? And what was the last time things got executed? You can figure out, you can get reports around these things. More importantly, you can see how applications change over time. So many of the people we've been talking to have processes where, for instance, they do a sort of change management meeting. Like, once a week, they say, OK, what has changed? What has been deployed? Do we need to go through a security review? What's the exact list of changes? And that process is manual to a large extent. Using Choc, you can automate this, because you can generate an exact report of the diff and you can get some integrity guarantees around that report. But more importantly, besides these things, you can do much, much more rightly. It's not necessary that you can only Choc containers. You can run essentially tools of your choosing, or you can submit custom plugins for metadata surfacing. And currently, the open source implementation that we have on GitHub only supports the entry pod wrapping for containers, but we're working to expand Choc functionality with more and more features. You can still Choc L files and PYC files and jars, et cetera. So yeah, the framework is out there. It's written in NIM. NIM is a very, very cool statically compiled type safe language. So any fans of NIM here, feel free to contribute. And we're welcoming feature requests. And I think that's my talk. I'm happy to take questions or discuss this. Thank you. Thank you. Yep. You talked about large organization. I'm very open to second. Yes. So the question here is I brought up large organizations, but given a concrete example of what are some use cases that this would apply in, right? So just to make this clear, this does not only apply to a larger, a small organization. It applies to everyone. It's just that if you are having a single application with a single repository, pretty much you know exactly what version is deployed where. The complexities of these situations start to be amplified the bigger and bigger you get, right? So if you have, say, a web application, and that web application has multiple components that are live at any given time, or say you have a distributed service and you have microservices running, you have multiple teams committing different versions of their component at any given time. And potentially some of these teams change. So you could end up with a repository having outdated code, right? There's a mission now, something has failed, and you go into the code, say, what was the last commit? It was six months ago. The committer of that application has left the team, potentially has left the company. Who do you contact? How do you know that's actually that part that has been outdated? But if you keep track of your builds and your executions, you have the ability now to tap into all the history of, like, all the provenance of a certain artifact and surface metrics that you care about. So if you cared about, say, show me all the components that haven't been updated in the last month or that haven't been executed the last month, it's way, way easier to do this. I'm not sure if I answered the question. Yeah. Yeah. So you showed how to do it in a GitHub action, but could you generalize and do this manually in a one-prime or in a different pipeline environment as well? Yes, yes. Chalk, you can actually, if you go now, if you visit the GitHub repo or the website, you can just download it. And it's a binary that runs. You can run locally and embed metadata into any artifact that you care about on your machine. So you can download on your laptop and scan all the L files in your system or the jar files or whatever, or even scan a whole directory. You can specify whatever you want. And then you can configure metadata that you care about, and these will be embedded there. And you can then extract it. So you don't necessarily need to have Chalk report back to you or run it in a GitHub action. You can just use it to embed information and then surface it. So you can both insert and extract if that makes sense. Yep. So that's a great question. I think one of the big benefits of Chalk is that you can embed information even in generated images or artifacts, right? So if you're using, say you have some third party software like a library that you're consuming, perhaps you don't know where it came from, but you know that you saw it in a certain machine at a certain hash. And then you can use Chalk to encapsulate that information for your artifact. And basically, if you run a query across all your applications that say are importing a given library, you can see all the versions of that library that are running. So you can start building these application inventories very easily, even if it's a third party software. Is it the total of the bottom third party container? It's still the same premise, right? Because if you have a container, you have several layers. So you can start saying, okay, these are the layers I have seen here. And potentially you don't have the full information, but you can at least ensure that you can attest that, okay, these are, this is the hash that I have seen. We are starting to add support to actually wrap entry points of different layers if you'd like to. So you should be able to interpose yourself in another layer should you like to, but that's not currently up yet. It's not up on the open source limitation. Yeah? How does Chalk play together with the useful bits? Are they to include them in the library? That's a great question. No, you don't need to include any compiler. All you need is the binary. And then if you have a reducible build for, in your pipeline, you should still be able to achieve the same guarantees. For instance, if you have, say, an L file, we'll embed metadata into a section and that will survive stripping and all that. So once you have a build, then assuming you know that you're running with Chalk, right, and you don't modify the thing later on inappropriately, you would at least know that you're running with Chalk, right? So that if you're getting a report, that report has not been tampered with. Yep? Let's imagine I have a jar which I have Chalk, right? Then I modify it and zip change it. So at which point Q and V8 and then I Chalk it again. At which point how do you pull the code? Right. So the question is, suppose you have a jar, you Chalk it, and then you modify it and then you Chalk it again. How does the tool help you here? So Chalk does not allow you to have just a single Chalk mark within a binary. You can wrap Chalk marks within Chalk marks within Chalk marks essentially. So if you're making modifications and you'd want it to, you can maintain past information about past Chalk marks. Or if you're building a jar, say, out of other jars and those have Chalk marks, you can use this information and embed them into your final jar, if that makes sense. So you can wrap and encapsulate all the metadata from all the components. So I need to focus more on this. Well it wouldn't be more complex than just saying Chalk insert. Like Chalk would take care of all the build dependencies and make sure it injects it automatically. At least that's where we're heading at. It might not be full feature for all the flavors of what can be choked currently, but that's where we want to go for sure. Cool. Thank you.
Are Project Tests Enough for Automated Dependency Updates? A Case Study of 262 Java Projects on Github
All right, everyone. So I guess this is the last session for today. And what I'm going to present now is about our project, Test Enough for Automated Appendice Updates. And before I delve into my presentation, does anyone actually have an answer to this? Who wants to attempt to answer the question here? It depends. Yeah, that's a great answer. And I think that's also in a way in the right direction. So a little bit about myself. So my name is Joseph Hyder. I'm a member of technical staff at Endolabs. It's a startup on more now a scale up based in Palo Alto in California. And before that, I mean, I'm still actually a PhD candidate at the Duff University of Technology in the Netherlands. So quite close by to Brussels here. And for the last, let's say like six, seven, eight years of my life, I've been quite involved in working on this, writing, security, but also developing techniques that are focused on trying to like apply program analysis to, for example, package repositories or trying to better understand what's going on within dependencies and dependency trees. And just like a little bit talk about what I mean with automated appendice updates. I guess most of you already know what it is. So essentially whenever there is a new release from Maven, Ruby Jam, Socargo or MPM, you would have a tool. I just did a couple of them, which is the Panda Bot or renovate or that few. So when there's a new release in your repository, usually a prerequisite is created. And then the, let's say like it creates a branch out of your repository, tries to build it. If that goes fine, so it goes usually to the next stage. If you have it configured to basically run the tests. And then if everything is fine, in this case, it's showed on X mark, but imagine if everything is fine, you will merge it. In some cases, if you know it's not a problem, you would merge it in any case. And I think for many of us, we have seen like, usually show something like this. It would update version 2.2 to 2.4. So that's like the essential thing that I'm focusing around. Like what I mean with automated dependence updates. And an interesting thing around automated dependence updates is that there's usually this promise that if you just run your tests, you are essentially able to catch any type of regression errors, any problems that might exist in your code. And me as a researcher that maybe sort of a bit questioning pattern, as I felt like, hmm, the test that we are usually having are projects, they're more focused on the your project test suite. And maybe not so much on the third party dependencies or third party libraries that you use in your code. So that may be sort of race three questions. The first question that I asked was, do we even write tests against dependencies in the first place? And then the second question is, do project tests with even cover usages of dependencies in the source code? And the last one is like, are even test sufficient alone just using tests to detect any bad updates that you might find in using these tools or doing automated dependency updates? And to study this, I looked into open source projects at the first, oh yeah, another question is, of course, should we even write test for dependencies? Because if we like to reuse components from open source package repositories, why should we even write tests for that? Because it kind of gives us the ergonomics that we can just use anything like it in our code and that's it. And this sort of like started like as an empirical study, that's sort of what this talk is primarily centered around. So the first thing that I looked at in 20 study was to see what is the statement coverage of function calls to dependencies. And this is similar to considering for example, like covered like life support, for example, like J. Coco as a tool. And then the other thing that we also focus in the study is how effective are tests using detecting updates with regression errors. So what we're doing here is that we are basically trying to find, I mean, either find or actually put regression errors in existing libraries, and then directly validate whether the project test suite can directly detect that or not. And that's also something called mutation testing analysis. And I think there was one talk about this earlier. And then the last thing around the studies that currently the sort of state of thought is to focus on just using test suites, but could we use another way to find any problems or early detect issues that might exist in updating our dependency. So yeah, the first question is like, how can we do some type of statement coverage or get an idea about what exactly are we using in third party libraries? So we did this in two ways. The first thing was to essentially, so this was of course in Java, we extracted all call sites that we will find in projects. And if those call sites points to third party libraries in bytecode, we consider that as a usage. And that's for trustive dependencies because now you're not, let's say like longer on your source code where you have the call set direct dependencies, you would also need to go to the trusty ones and here to sort of approximate that is not an exact measurement. We essentially build static whole graphs to kind of get an idea of what would be used in the chemistry of our project. And then last we did some instrumentation. So we essentially run the tests of a project and execute what functions were invoked in the dependencies. And this will give, let's say like some idea of what exactly is being used or not used at all. So essentially first we statically derive like, what are all the usages? And then by running the tests we know which of those functions were covered or not. So kind of similar to code coverage. And we did this for around 521 Gita projects. And what we found very interesting was that when we look at the direct dependencies of a project, so this is all the direct dependencies that were found, about 60% like when running the tests are, let's say like covered by it. But then when we go to trustive dependencies, we found that the median was only 20%. So which means that a lot of the transitive functions that may be used may not even be reachable by test. So they sort of like ring some alarm bells, right? Because that means essentially like if you have a dependency update and you don't have any test that is covering that area, that will basically give you a green tick and you might merge it. And I don't think many would do that. But that's, let's say like the implementation area that also kind of raises some questions around how effective using tests for automated updates. And yeah, the other question does this matter at all. And I think a very interesting one here is the log for shell case because I don't think many of us would have tests that is particularly targeting log libraries. But here is an instance where something we don't normally would test and would have tests in any case. If you would do an update, then yeah, there might be some breaking changes then yeah, there will be a problem here. Then going to the second part of the study, which was on test effectiveness. And I was measuring that we're doing mutation testing. So the underlying framework we used here was a pie test, but we modified pie test to do things a little bit differently. And yeah, to sort of like give a quick sort of idea of what mutation testing is, is that you essentially have a function, for example, return x plus y. And then you apply some type of mutation operator where you swap, let's say like the class. And then you would expect that your test suite will be able to cover this because here the behavior is completely changed, right? It's no longer an addition operator. So normally with mutation testing, you would give it your whole project source code. It will start trying to modify in the source code and then see whether the test suite is able to capture that or not. So what we did differently is that we essentially mutated functions in the dependency code and not the project code at all. And we only mutated those that were reachable by test. So I was saying earlier that we were running a test to know which functions were executed. So we used those functions to essentially apply those mutation operators. And then from there we can see if the test is able to capture that or not. And yeah, before I go into this also another alternative way that we investigated is called a change impact analysis. So here we sort of leverage static analysis and specifically using call graphs. So how it essentially works is that we have a version 1.02 and 1.03. We compute a diff and for the diff we will find out which functions changed. And for example here we know that in bar and bus function we can see that there is an arithmetic change like instead of y minus minus it's y plus plus. And then in the other, like in the bus function we see that there is a new method called. And then what do we do later? We practically build a call graph of the application and its dependencies. And then using reachability analysis, so what we do here is that we know that the bar and bus was changed. And here we have let's say like a reachable path from bar up to let's say like stats on the score JSON I mean. And also we have like bus here where we have a new function called to QX STR. And by using this we can directly figure out if there is a coaching and dependency whether you are reachable or not in the first place. And why this is like a very nice complement to dynamic test is that we are essentially leveraging by looking at the source code what are we actually using. And then as a complement to where tests might not be covering we can sort of find directly if there is any change that might affect like your project. That of course comes the more tricky part which is semantic changes. So I mean one thing it's nice that you can detect that the method change but sometimes you might just do a simple refactoring that you know just refactors are a huge method into a method with like a couple of smaller methods is that. So the truth is that it's extremely difficult to know what exactly is a semantic change because there's a lot of factors around it. So the only thing that we did was that we kind of took what was like behavioral changes. So we looked at only like data flow or control flow changes. So for example if you add a new method call we consider that as like a special change or if you did some major change on your if statements that may introduce a new logic of how the control flow works then we consider that as an interesting change to follow. And what it is like I implemented a tool called Uptatera which means update in Swedish. And so I applied this on. So it essentially shows like which function had a change. So for example Rx, Java, not facing subscriber on error and we can see that it's reachable from the project and then it shows exactly how it was reachable. Yeah through like the code. And then in the second section I would have like what is basically the major changes in that function. So this could sort of give you some context of what essentially changed. Other than just telling that either the test parsed or failed. And when using this mutation PyPlanet that was explaining we essentially generated 1 million artificial updates by introducing those regressions and we did this on 262 GitHub projects. And what we found was that when doing the sort of changes on project tests we found that on average projects are able to detect 37% of those which means that a lot of like changes may not may get unnoticed like in general. But if you use static analysis now that you sort of have the whole context we able to detect 72% of all those changes. But what we find more interesting is that we can see that interestingly here like from the context of the studies that there's basically no guarantees that tests can prevent bad updates and using either of those techniques is not good enough to ensure that updates are safe. Then of course the other thing is that static analysis is not perfect. There are also problems with it as well. So the problem is over approximation. So we have over approximation at two locations. One is the call graphs themselves because when it comes to dynamic dispatch if there are maybe 200 implementations that might stand from an interface call we have to link to all of them and that might generate false positives. And then the other case is also with the semantic changes that we are detecting because we also don't know exactly what type of semantic changes it is. But to sort of see how this worked in practice we also analyzed and applied this on 22 dependable PRs. And from the results what we found in general was that by using static analysis we were able to detect three unused dependencies. So here let's say like the test would just pass it whatever but in fact we found that the dependencies were not used at all. And we were able to prevent three breaking updates and one which actually was confirmed by our developer where the test were not able to detect. And then of course we found that there are let's say like false positives and as I mentioned there were many cases with refractorings and then of course this over approximated call paths. So if you use like a tool like Google here or static analysis it can help to prevent updates but then you also get a lot of noise as well as a result. So sort of coming to the end of more of the studies what are let's say like the recommendations that I have after looking into like on Github projects how tests are being made etc. So one thing I found missing when it comes to updating with test widths is that we don't have any form of confidence score. And what I mean with confidence score is that for example if we can stop measuring test coverage we can see for example if there is a change function in a third party library do we even have test that reaches that or not and that could directly give an indication whether like my test width is able to capture that or not. And another very interesting thing could be for example if you find that one of your libraries are very tightly integrated with your project it can also sort of give an indication whether you have let's say like enough test to cover that usage or not at all. And then by having sort of this score you can maybe get an indication where does let's say like how well am I just able to capture things in third party libraries or not. This is something that I would like to see in tooling in general. And then when it comes to the gaps in test coverage so this is related to the results I was saying like the statement coverage and effectiveness. So I believe more of having a hybrid solution so we're using tests or dynamic analysis is able to capture. I think we should use that because that is more precise. But then in areas of the code where we don't have any coverage so for example consider back to the look for J library where usually I wouldn't expect just to be much test coverage. Here it could be nice to complement the static analysis. So you sort of get a little bit better for both words here. And then another advantage that I might see having static analysis rather running tests is that we can maybe much more earlier to take potential like problems in compatibilities by having that rather than trying to run it through the build system consuming extra resources or tests etc. So those are less likely to main things that I find important to address. And then for users like myself of using this automated dependence updating tools. So although like reusing is free in the sense that we can easily just use a library but we often forget the operational and maintenance costs and those are not free. So trying to basically automate away everything by using tooling etc. is not always the solution. I think it's important to also consider that once we start adopting a library we also need to think about how we can maintain it but also understanding what potential risk might come from it. Could be for example that maintainer have a very different sort of handling when it comes to security vulnerabilities. It could also be with the release protocol like there could be disagreements on what is breaking change or not for clients. So I think having that aspect is one important thing. And the other thing is like of course not blindly trusting automated dependency updates and I guess no one really does this. And then that's another thing which could be debatable is to have essentially critical I mean having writing tests for critical dependencies and this could be a library that's very critical to your project. I think here maybe having tests could help let's say like capture early issues that might arise in dependencies and not come as an unwanted breaking change later on once you merge the automated PR. So if you want to let's say like know more about this work I have a paper so I also uploaded slides on the Fosnum website so you can click the link and the paper is open access and yeah this is concludes my talk more or less so happy to take any questions. So do you know if any of these bots like the pen about renovate are working on such a score so let's say the merge request to get like a warning. Hey your tests are not covering only 10 percent of the dependencies. Do you know if there is any work. So what I'm aware of is that there is a compatibility score that looks at for example for a particular dependency version updates if out of less than like 200 PRs if 100 of those were successful for other projects then it will give us a score that there's a 50 percent chance that you will succeed here. The only thing I find problematic is that every project has their own specific use case or context of how they use it so it could be misleading but I haven't heard anything that looks specifically into your test suite to see how I mean how it's able to do that. Thank you. You mentioned the number of 60 percent for the amount of tests for direct dependencies and I believe it was a lot less for transitive dependencies. Do you have any numbers on the amount of transitive dependencies in search change the chains actually. So I can can imagine that the 60 percent is cumulative in these. Do you mean for the statement coverage thing or the statement. Yeah the first one. So the first one the 60 percent was on like direct dependencies and then this 20 percent was on the transitive ones. Do you have any numbers of the amount of transitive dependencies so you can relate it to that 60 percent. Okay so I did this on 500 time projects but I might have the more specific numbers in the paper. Okay. You have been looking at detecting errors. Have you looked in the other side because you can use it in a hybrid mode that your tool maybe can tell me you can make this update for sure because all the code is changed. You don't care about it. For example if you look at low level right libraries like Apache Commons you only use a part of it but you want to keep up to date and some updates are more or less completely safe because you don't touch any code that has changed because only new features have been added or so and that would also help if I just know yes that's safe. Yeah that's a great question. So this is a little bit idea we had with introducing call graphs because the call graphs you can start learning what exactly is used. So even if you use like a major library and you just use maybe two utility classes and even if you go to like a major version of it you might not be affected by it and this is something that should be covered by the call graph so we will see for example that the utility classes there are no changes there but then in the rest of the package there's a lot of changes that you're unaffected by. Thank you. Did you check how the call graphs work with dynamic dependency injection? Yeah so we essentially if I understood the question right I mean so we did generate the dynamic call graphs like running the test and this is something that we essentially used to guide or rotation like testing framework to only do changes in those functions and not for example functions that the test didn't touch because otherwise we wouldn't know whether I mean the test we were able to detect changes or not.
Perl at PayProp
Thank you. This is a QR code for the slides and also all of the talks I reference in this talk. And yeah, thank you Theo for organizing the poll in Raku Devroom. I'm going to talk about, you can all hear me okay? Yeah, perfect. I'm going to talk about Pearl at PayProp, which is a company I work for, an established company, been around for almost 25 years now. And briefly about me, I don't normally do this, but I see a few faces I don't recognize and I'm sure people don't recognize me as well, so I thought I would do this. I'm a senior dev and head of processing at PayProp. I've been there for 10 years. I've been a developer for just over 20 years. I've worked with various languages, but mostly Pearl. But I've only worked for three companies in the time I've been in that 20 years, so I've kind of seen tech stacks grow and shrink and change. I'm a C-Pone contributor, so Lijo, I'm a C-Pone, Meta-C-Pone. And I'm a London and Swiss Pearl and Raku workshop organizer, so come and talk to me if you're interested in any of those. We're searching for a venue for this year's London Pearl workshop, so if you have any ideas, come and talk to me. And I'm a regular speaker at other Pearl workshops and conferences, and often I'm helping out with the video. And I occasionally blog on Pearl. I prefer to do long form articles rather than technical kind of, this is how you use this module kind of posts. And I run the Pearl events Instagram account, but that's about the limit of my social media interaction. And I'm a Fosdum Newsbie, so my first time here. We usually have a work event that runs at this time of year, so it always clashes with Fosdum, so I've never managed to make it, so this is the first time it hasn't clashed with Fosdum. So about paper op. That's what kind of what we look like, the public facing part of the site at least. We're a payment and reconciliation platform for the residential letting industry. And we kind of, our core business value is we turn things like this, and this is one of the nicer ones to deal with. This is a Swift format into things like this. So we put interfaces and automation on tank consuming payment reconciliation flows. And this literally saves our customers hours, days, weeks of time, so we're really, really useful to them. The key night of you might see CGI bin script.cgi in that URL. So yeah, we've been around for over 20 years, so we have some old code, bit of an understatement in places. But the pearl we are using is relatively modern, 532. And we build our own pearl, and we don't use the vendor supplied pearl or the system pearl. We don't do anything special with it. We could in theory compile it with different flags, but we don't do that. So we get the defaults, which means we don't get things like iThreads, because if you use vendor supplied pearl, you get things you probably don't need. Yeah, the key is that it's not the system pearl. So we're not kind of tied to any particular version of an OS or package or whatever. And we can apply updates and patches as necessary. We should be on 538 by now. We tend to trail a major version. I've been spread a bit thin, so we haven't managed to get to latest, but that's on the roadmap for this year. Yeah, and it gives us dependency control, which is critical. If you've been paying attention the last couple of weeks, there's been a couple of critical CVEs against a couple of spreadsheet passing modules, so we could get those updates out quite quickly. Loose coupling, so yeah, like I said, not tied to the OS or anything like that. And the key is it's everywhere. So we have the same version of pearl, the same version of modules from dev through CI staging demo all the way to production. So otherwise you get interesting debugging problems. And while the issues and challenges around that, well, probably the ones you've all heard, you know, still use pearl or even what is pearl, and the bus factor, which is, you know, becoming a problem with some of the pearl dependencies. So yeah, it's a 20-year-old, 22-year-old app, so we are in the process of migrating from CGI.pm to Modulicious. A 20-year-old app has some legacy, a bit of an understatement really. This is an ongoing task, and we're about two-thirds complete in terms of number of requests to the site. We have a lot more pages than we really use after 20 years. Kind of inevitably happens that people write features and functionality that end up not being used, and we've got hundreds of pages, and really only 20% of them are actively used. So a lot of them will never actually end up getting converted. And one of the ways we did this in one of our other apps is using this plugin from Modulicious. And we decided not to do this with PayProp because we're using Apache on the front end anyway, so we can kind of proxy to a Modulicious server or just run exec CGI if it's CGI script. So we're not doing a kind of serving the CGI scripts from Modulicious using a plugin. There's no real value there, to be honest. So that's kind of what the setup is. I actually gave a talk about this almost a decade ago, so there's a link there to that talk, which has some suggestions for how you can do this if you're using CGI. You want to use Modulicious, what the options are. But it was 10 years ago, so it's a little bit out of date now, because Modulicious moves fast, and it is one of the challenges in using it because they say that you can always count on backwards compatibility, but they will deprecate and remove features within a three-month window, which is not really backwards compatibility. So you just have to be aware that if you haven't done an update in a while, things might break. And we're adding an ORM. And I know this can be a contentious issue, which I kind of find surprising. I'm just title writing this kind of stuff. And this is a simplified, about as simple as the query you can do. So you select some columns from the table, prepare the query, make sure you have the error handling, execute it, grab a hash ref. I just want to write that more descriptive. All the stuff we can get for free is there. And we can still drop down to vanilla SQL if we want. And we do do that. We have some pretty hairy reporting queries, and we're not writing them in ORM speak, because they're big enough already. If you start using the DSL of your ORM, they become an obfuscation. And the reason we're doing that is it allows us to kind of isolate some of the legacy issues in the schema. Again, 20-year-old app, organically growing schema, you can have some issues like this. And we can kind of nicely abstract them away in the ORM that we're using. Put this down as stuff hack and use says, you know, just fix your schema and things will break and you might see it. And it's like, no, we're not going to risk the business by breaking stuff. We don't move fast and break things. You know, we want to keep our customers happy. And then another suggestion is, well, why don't you write your own? But why would you do that? You know, we could abstract all our logic into an ORM, but it'd be half done one full of the bugs that all of the available ones have kind of already ironed out anyway. And yeah, we're using DRX class. Very feature-rich, but not dogmatic about its use. It's like, say, you can use it in ways you want to use it. Some of the issues and challenges around that, well, there's a learning curve, a big learning curve, especially if you haven't used an ORM before. But the manual is very good. Lots of stuff on the web you can find about how to do, you know, quite bespoke things with it. Currently, I say unmaintained, I would say stable rather than maintained. There are talks happening to kind of address this because it's a backlog of patches that could be applied and that kind of thing. And I did talk about this, I want to say, six years ago, using a legacy schema with the big class and how you can address some of those issues that you might have in your schema. Business objects, the model. So the older code is kind of a procedural mashup of business logic, database access, new logic, and so on. So it's all kind of smushed into the same layer. The newer code we're factoring into business objects. And the key is that the business objects are our model. Our ORM is not our model. People often kind of conflate the two. And the reason we're doing it is to get all of this stuff. If you're doing object-oriented coding properly, you get all of this really nice stuff. It's not just calling a method on an instance of a class. You get really powerful, useful things. And we're using Moose. And we were previously using mouse, but we're kind of moving to Moose reasons that I won't go into here. Karina is one to eventually look at. That's been added to the core in 538, the early version. Ovid's going to talk about that a bit later, so I won't go into that too much. But just a quick example, this is kind of the thing we're doing. We're dealing with payments, so we have this incoming payment class, and it has an attribute that references a bank statement, so we're having type constraints. So we can properly constrain that it has to be an object of this type with an ID, and we can throw a useful exception if we try and put something in there that shouldn't be in there. And then we can use the tell-on-ask principle. We can say fail that payment, and then the logic is in one place. And we're throwing exceptions if things aren't in the right state, and then we're delegating to the bank statement object to then fail its payment. So it's all nicely isolated, easy to test. So yeah, Moose, again, what are the issues and the challenges? Well again, the learning curve, if you've not used much object-oriented programming, this is a big paradigm shift. But I think it's worth it, because I think Moose is one of the best object systems available across any language. And then you add the mop, meta-object programming, and you can use introspection and everything. Pearl's very powerful about introspection. And there's been multi-day courses at Pearl Conference that's talking about Moose, so it's impossible for me to even scratch the surface in a small section of a 20-minute talk. People often talk about the slow startup if you're using some of these frameworks and systems, but if it's in a persistent process, a modulator server, that's not an issue. You load it once, it's loaded. If it's on the command line, well yeah, it used to be slow, but now it's things have caught up, and you're probably running those command line scripts once in a blue moon anyway. CGI scripts, we do use some of this, but we lazy load. So these are pages that are taking a couple of seconds to run their commands anyway, so the compile time of loading some of those subjects is a tiny percentage of that anyway. Yeah, mutable state, that's my technical debt. It's one of the things you learn, you know, mutable state is bad, so all our new code and your objects are immutable objects. Refactoring and regression testing, and I'm talking about beyond unit and integration testing because that's kind of the easy stuff. We're adding this for all new code, and mind we do refactoring, we're making sure there's test coverage there and addressing any gaps. But what about those critical business scripts that have existed forever and have no test coverage and basically run the business? I mean, how do you adjust bootstrapping problem of refactoring so you can work easy with them but there's no tests, but you don't want to refactor them because there's no tests, it's kind of a catch-22 situation. Well, this is Pearl, so we've got some useful features we can use to work around that. One of the frameworks we've come up with is we are creating override libraries that we pass into scripts that allows us to override various functions at various times in the lifecycle of that script that runs. So here we are overriding the call to file slippers read text method by saying run this script with this override library path and then we have these various blocks that will override calls so we can kind of monkey patch things. So we can add as much test coverage as we need and then start changing the script. So that's kind of an example of how we do it, a bit down in the weeds, but I would encourage you to watch this talk by Nick. He talked about this at the Pearl and Racket conference last year. It goes into all the details of how you can do this, which blocks you can use to run when, how it works and some of the issues around doing that because you're actually adding technical debt when you do this, but we need that test coverage there. So the aim is get the test coverage in place, the fact of the scripts, the fact of the test coverage, we're in a better place. This has been critical for some of the scripts we have because I mean they literally run the business and they literally have no test coverage while they have test coverage now. Like I said, we don't move fast and break things. Contributing to C pan. So yeah, we actively encourage contributions to C pan. These are all the distributions that we've either written or taken over maintenance of in the last decade, which is the time I've been a pay prop. Stuff like some modulus plugins. So there's this plugin for NMIT, modulus that allows you to profile your routes using NMIT prof. It's really useful. I wrote some of this OAuth 2 server stuff. If you've ever used OAuth 2 and tried to implement server side stuff, it's a fun game. That hopefully makes it a bit easier. Third party payment libraries. We interact with third party payment providers so we've written some stuff. Go Caldlis do direct debit in the UK. TrueLayer is a new opencomer. They're using the open banking spec so I think they're going to get quite big in the coming years. And other stuff, so we maintain CGI.pm because we still have CGI scripts. We can maintain un-maintained libraries, Google Maps stuff and all that kind of stuff. The issues and challenges around that, well, the pool of contributors to C pan is shrinking. Libraries for newer services and APIs don't exist. Often you'll find third party libraries for languages except Pearl, which is a shame. Modern APIs are restful and easy to create a third party library for. We're happy to throw somebody at it for a week or two, which is what we did with the TrueLayer one. They threw me at it for a week and there's one on C pan. Navigating around IP issues, well, that encourages to decouple our code. So that's actually quite a good thing. And finally, hiring devs new to Pearl. I say Pearl has been on the plateau of productivity for quite a while. Those that left it a long time ago don't know the current ecosystem. But more than a generation removed from even Pearl 5. Pearl 1 was released in 1987 and actually probably Larry was prototyping it a long time before that. 510, which can be considered modern Pearl, there are people starting a university now that were born after 510 came out. But it's still in a lot of places and I know that because we've interviewed people. Some of these users can't talk about it. Banks, the fangs, I won't emphasize which letter in the fangs, but we know there's people using Pearl in these places. So I think the rumors of Pearl's demise are greatly exaggerated, but it's kind of a known unknown at this point. And it's still be using Greenfield projects, so the system that Fosdham used to review, annotate, cut, process, transcode and publish all of their videos runs on modern Pearl. So over a thousand videos this weekend are going through a modern Pearl system. And its popularity is kind of normalized over the last two decades, I think. So it's had to find Pearl developers. But newcomers don't have preconceptions. That's my experience of interviewing anyway. I think those under 30 either haven't heard of the language or haven't used it. And those who don't want to use it self-select out of the process anyway. Because we are explicit that we use Pearl in our job specs. We just don't require it unless we're hiring a senior Pearl developer. And I find modern Pearl is an interesting and enjoyable language to work with. Working with legacy code is not specifically a Pearl thing. And we make sure to do all of this stuff, because you should be doing all of this stuff. And we're finding in a distributed work environment you need to do all of this stuff. I've not really talked about this much in the past, but I have written blog posts. So check out the blog posts if you're interested. And the key is that you can still be very experienced, but still a newcomer. And that's absolutely fine. And I think it's actually beneficial to the ecosystem and the community. So if you are, please speak up. You want to hear from you. And that's it. I don't think I have time for questions. So thank you very much. Thank you.
Open Food Facts: Learning and using Perl in 2024 to transform the food system !
I'd like to welcome Pierre, I've got your last name, Pierre. Pierre Slamish. All right. That's, oh yeah. I think it's one of the more recent World Projects started, isn't it? We created a plan for the pack in 2012. So it's just over 10 year old project. That's value of a teenager. Right. Let's welcome Pierre. And thank you, Lee, by the way. We use, we depend on your work. So I'm going to talk about open food fact. And it's not going to be a very technical talk, but more like experiences of people getting into Pearl in 2024 to contribute to food transparency and to transform the food system. So I'm, yeah, I'm Pierre. I'm one of the co-founders of open food fact. So I'm not the technical guy. I'm the product manager, but I double in a, in a product opener, which is our Pearl back end. So on the menu, I'm going to briefly introduce open food fact to both of you who don't know it yet. Then I'll have a part on starting Pearl in 2024. So some portraits of our contributors, how you can have impact on the food system with Pearl, and finally some Q and A. So about the open food fact, it's the answer to a very simple problem. How do you pick a product in the supermarket? You have many products, a lot of information. It's hard if you want to pick one for your children. It's very hard. And then when you get to the nutrition table, you have this long ingredient list, but sometimes you can't read. The nutrition table personally, I have never managed to make sense out of it. And so you have to make decisions every day to get food. So open food fact is all about empowering users to have an impact on their own health, but also on the environment and the food system at large. So we kind of have this slogan, don't panic when you're in the supermarket, organize. So trying to get together and have an impact on the food system. So we've been nicknamed by the media, the Wikipedia for food products. We have over 3 million products in 160 countries and languages. Our data sources are crowdsourcing. So using your mobile phone, you can actually add photos, add data, manually and using machine learning help. And the food industry, which is beginning to realize that closing their data doesn't make any sense. So we want transparency to become the norm. So I'm going to show you how pale code in production is having impact every day for millions. This is, so the first thing is a nutrition score, which you may have seen in Belgium, in France and in other countries. We started computing nutrition score in 2015. It was a scientific formula at the time. So we decided, okay, let's compute it on all the products we have and show it to people in the app. And we helped democratize nutrition score before it passed into law. So this is a screenshot of something one of our contributors had done at the time. He pasted some nutrition score on all the, using image editing software on all the products. Fast forward a couple of years, you go from digital to actually seeing a whole supermarket ale full of nutrition score, which shows that you go from digital to real life impact. So I mean, not only people who run the code, who use the software, but everyone can benefit, but everyone can benefit, even people who don't care about it. So from pale code to real life impact. And it goes even beyond just displaying the score. We started to realize that producers are actually changing the nutritional composition of their food products. So it's a systemic impact. Code can have a systemic impact on the food system. It's absolutely bananas. What you can do also with a path of tour is compare products at very large scale. So for instance, we are able to monitor the composition of Fanta. And as you see, it's not the same in every country. So basically we can show what's the industry is trying to hide us. We also have help producers improve their product. So one of our part of our software stack is the producer platform. And we do some computations based on the nutrition table to actually provide reformulation opportunities. If you reduce sugar by 20 milligrams, you can actually go from nutritional score B to nutritional score A. So computing helps also change the products. And yeah, brands are starting to... Oh, sorry, I went a little bit too far. Yeah, brands are starting to... All those brands have actually started to share data and use the import system, the mapping and import systems that are in OpenFood fact, that are kind of hairy XML parsing and all of that. And so yeah, they are sharing data in many countries at large scale. And to code this stack, we have Stefan, the founder of the OpenFood project, but we also managed to get more coders on board. People who just picked up Perl just to be able to contribute to the food system transparency. We started learning Perl in 2022, 23, 24 just to be able to have an impact. And Lee, I can confirm you that newcomers don't have any preconceptions. So for instance, Yukti picked up Perl in 2022 and she's improving the backend code quality. So she's very serious about food transparency. She doesn't look at the front, she looks at the back where the nutrition tables are. She wrote a lot of tests, bug fixes, and she's into Perl correctness. And she's obviously like a soul trying to convert all people she meets into OpenFood users. Stefan, who coded much of the code, learned Perl in 1998 when he was at Yahoo. He likes to do origami in his free time. And some of the code base are things that he coded perhaps a little bit too quickly 10 years ago when he launched the OpenFood project. And he recently paddled in a 10 degree water. Monalika, she picked up Perl in 2023 to improve the UI, the test, and the code. So it was part of the program funded by the Perl Foundation to include more people in computing. So she worked on product image protection to ensure that data quality stays constant and misuse and user management, email misuse and user management. Alex, who's a Python person, but took the Perl Camel two years ago to contribute to OpenFood fact, who's part of the permanent team, and who's using some of the tools you code in this room, so Proxmox, Sanuid, and many, many more. Benoit, who picked up Perl in 2023 to improve the data quality system, and he's learning nutrition science almost as fast as he is picking up Perl. And John, who didn't do much Perl before and started learning Perl in 2022. And he's spending one day a week leveling up in Perl to be able to contribute to OpenFood fact. So I'm going to go a bit faster, but as you see, the dynamic of people picking up Perl is actually very much alive. Young people, girls, etc. are actually learning Perl to be able to contribute to OpenFood fact. So John, he introduced Perl's critique to the pipeline, and we thank him for that somehow. So a bit more technical. So our backend is ProductOpner, so it's the backend for the web version. It's a monolithic system for the web, so there's no like front-end backend thing. But it's also providing the API of OpenFood fact. So it provides the database. It provides the ReadWrite API, the website, the producer's platform I talked about, analyzing and enriching of product data, so a lot of regs in every direction, and the computation of scores from the nutrition table, from the data. We are then able to calculate nutrition score, about nutrition, know about ultra-processing, and eco-score, which is even more complex to compute about the environment. So a lot of ingredient parsing, very hairy stuff, and what the architecture looks like. So we use ModPel and Apache to basically query the products which are stored in a storable file system. We are then able to fulfill the user queries, and for aggregated query, we store everything in a MongoDB database for more complex queries. So the data structure is very hairy. OpenFood is a very complex matter. As the year evolved, the data structure became more complex and complex. You have probably one-tenth of the data structure. And we store all the... So this is the old interface, and we store everything. We store all the revision of the food products as well to ensure that... to see the evolution of food products over time. I told you that producers were evolving products to make them better. So we are able to basically go back in time by storing revisions of the products at a given time. So when people scan, they will require product.store, but it will require the last revision. We are also exploring for aggregated queries the possibility of migrating to Postgre. So yeah, that's how we do a MongoDB query. And so the tags are the normalized version of the data, and then we are able to return products that match the specific query. It's very powerful. You can do very powerful stuff like require orange juice that are organic and are sold in Belgium and possibly contain no additives, etc. So the website is in Pell. The business logic is in Pell. So ingredient parsing and data analysis. We have those taxonomies to structure data and data navigation. The score computations as well, and importing the data from producers and even a GS1 connection. GS1 is the standard as ways to share product data. And we also have a knowledge panel system, which is basically driving completely remotely the app. So rich content, images and all of that. We've already done... One thing we realized is that we have to make contribution as easy as possible. So we dockerized the project. We started adding documentation. We are also working on a better API. It's not like very restful or API. And we refactor on the go as we add features, because the food system is currently evolving. We also want to have a more service-based approach as opposed to the monolith. So we have introduced open food fact query for aggregated query. FoxLumniangine for the additional properties. And our machine learning stack is called Robotoff. And we are currently revamping search with such issues and introducing key clock for identification. We are also trying to better document the API with open API and adding more tests and integration tests. Because stuff breaks and stuff breaks often. Things we'd like to do on the technical side, the API v3, lower the barrier to contribution. So probably using a modern web framework, we don't use any. So I saw that there was a Corina talk. We are also considering Corina instead of hash maps. Anonymous hash maps. So it could be, or data structure could be more documented. And globally, we factor the code in smaller chunks, like something for NutriScore, something for EcoScore. But one thing, we are not giving up at all. The core of open food fact is and will remain in Perl. And then, yeah, also more like design stuff. Because our interface is still monolithic and people need to be comfortable with Perl to actually do front-end stuff. So what's next for 2024? We go perhaps a bit faster. We are going to improve the mobile app, do some more machine learning. And also do something on open product fact. So NutriScore is going to evolve this year. So a lot of computation is going to, we basically have to change the algorithm. It's still a very controversial thing at the European level. Italy is trying to block the NutriScore. And so once we compute, we will make it available to everyone. We have also the question of prices. We are launching into price collection due to inflation. So we want people to be able to compare prices and be able to make sense out of the ongoing situation and also scientists. And the last and probably more interesting, Perl-wise, is the fact that we are going to merge all of our projects together. We currently have open food fact for food, but we have also open beauty fact for cosmetics and open pet food facts for pet food. We actually launched those as jokes as April Fools a couple of years ago. But now people are asking to be able to scan anything. So we have four installation of products opener on four different servers and we need to be able to bridge them all together. So in terms of architecture, you can imagine that it's going to require a lot of retooling. So open product fact is all about providing circular solution to augment the lifespan of products. So ensuring that they have a second, a third life that you are able to repair them, to give them away. So augmenting the life of objects with open product facts. So data platform for the circular economy and computing the carbon impact of products and also open beauty fact and open pet food fact. So actually work is just starting if you'd like to get involved. That's just about the right time. We haven't started actually retooling the product opener for that. So in terms of helping, how can you contribute? I'm very well aware that you are already maintaining probably a lot of projects. So the casual way is basically to scan and add products in your country. Translation, spreading the word, designing and of course for those of you hacking and hacking the code and fixing the code. The best way is just to try to install the docker on your machine. It should be straightforward. Also if you'd like to mentor, we will be part of the JSOC program this year. Hopefully we will be. We will also probably try to submit a project through the Pell Foundation. So if you'd like to mentor Pell projects on open food fact or actually to become a mentor yourself, it's not just students anymore. As a professional, you can actually be part of this program. Be sure to get in touch. So how can you get in touch? Those emails, you can install the app using this QR code. And if you scan this QR code, you can actually have a link to leave your coordinates and we will get back in touch if you want to become a volunteer. Either a technical task or non-technical task. And that's it. So perhaps if you have any questions or no. Thank you.
Synergy: a chat bot framework
Thank you. Welcome to Ricardo from Boxville. Thank you very much. Hey, okay. I did a timing run of this, but I had like zero sleep in 48 hours. So either it's going to run shorter or longer, maybe right on time we'll see, but we'll get the tape in. Second. There we go. All right. So, imagine it. It's the future. The year 2018. And at Fastmail, all of our critical systems run through our chat bot. Right? Like you want to deploy, you go to the chat bot. You want to set up a task for somebody else to do. You want to remind her, you go to the chat bot. And the chat bot is written in IR, for an IRC service because we're cutting edge company. And it's in charge of everything. Right? And then I got this email from Slack and it said, hey, just so you know, in like three weeks we're turning off the IRC gateway. And I talked to the shareholders who said they didn't want to close the company. So it turned out that we had to take this thing. This is Synergy, our bot, and go through a raging quick process to upgrade her. To talk to Slack. So this talk is about that project. But it's also about the fact that when we did that, we totally rewrote not every line of code, but all the lines of the interesting code to make it not horrible to deal with. Because it was, this is the three of us who did this. Matt Horsfall, who's at the top middle there, is a frequent person at Pearl Things. He happened to be in town. He mostly works remote. And we said, great, let's drop everything else we're doing. I've been sitting in a room for five days and rewired our chat bot. And it was great. It was written originally for Pearl 516, which at the time was cutting edge. And it was written using Poe, which was not cutting edge. Who here has ever used Poe? Yes. Yeah? Okay. Sorry. This is me looking excited back when I was younger. Like, yeah, Poe. No. This is Poe code. It's, look, you don't need to know everything about this code, but there it is. The thing you should notice is that it's pretty weird. Like, dollar underscore square bracket arg zero. What the hell even is that? Like dollar underscore heap. I use Pearl so I don't think about a heap, right? Like that's what I'm doing. It's, it's a mess. So what you need to know about Poe to understand this talk is like nothing. Don't worry about it. But in, even in 2005, right, when I first, very first started writing the first line of this code, it felt weird to use. And it's not really Poe's fault. The problem is that for a long time, any kind of concurrency felt weird. At least for me and at least in Pearl, right? Anything you're going to do, you're like, why is my code now coming from outer space? And Poe is just more weirdness that I didn't want to deal with. So my strategy for building the software is really simple, right? Do as much as possible without Poe. Don't write the Poe code. That's where everything gets messed up. So only use it when you absolutely need to, like all this asynchronous talking to the network server. And you can simplify that. You can make that statement generic by saying concurrency is weird. So weirdness is hard to cope with in your program. So minimize the weirdness by writing less concurrency in your code, right? Minimize how much your code has to do concurrency. So you imagine the program looks like this. You've got like that magic IRC thing. That's where all the Poe lives. That's where it's weird. And then like a thing that gets messages and dispatches them to something that does something. And we tell ourselves that it works like this, right? The concurrency lives over here and then there's the good code that we wrote. And the magic IRC thing does its magic and it calls the dispatcher and the dispatcher calls the handler and sends them the IRC thing and you're good, right? That's it. And the problem is that's not how it works, okay? Like some abstractions let you believe lies and they're good and some let you believe lies and they hurt you. So you imagine it, like subroutines form a stack, right? Subroutine calls a subroutine calls a subroutine and it returns and it returns and returns. And you can violate that but like don't tell me about it. The handler down here has to return. So either the dispatcher is getting the return value from the handler and passing it back to IRC to send a reply. Or the handler is doing some weird thing to send a reply before it returns, right? So what's actually happening? Let's say it's the first one. We're going to like engage in a little Socratic method. A little go through the logical process. Message comes in to IRC, a network message. And it turns it into something that can go to the dispatcher. Dispatcher sends it to the handler. Handler sends it to the IRC thing. The circle of life, right? Great. No. What actually happens is it comes in from the network and it goes to the dispatcher and it goes to the handler. The handler is like, I got this but it's going to take a minute. I need to look up 70 million rows in the database. And meanwhile, everybody at IRC is still sending these other messages. And you're not talking to the network anymore. You're like asynchronous thing is sitting there like, I would be so busy being asynchronous if you would just yield to me. And you don't because you're voting, putting concurrency everywhere you can. And pretty soon the whole thing falls apart. And like you lose all your messages and everybody's like, why aren't my deploys working? Because of IRC. So the other option has to be true, right? The handler is doing a thing. So a message comes into IRC. It goes to the dispatcher. It goes to the handler. The handler has to do something. So, because this thing's happening. So it sends it back to IRC but now it's blue. Right now it's a different message. It's not the, you've got a message. I want to send a message and everything's good. This is no longer just IRC. It's all your async. Your stuff comes in, it goes over there, it keeps going, you're good. Now you need something to handle both kinds of messages. One for, you know, you've got a message, you're going to do something. One you're going to send a reply. These boxes should be labeled differently. You're fine. Every kind of message that comes in, you've written your own simple, pretty much blocking, but okay handler. You don't even need to dispatch anymore. You just tie it into the message. Here's where I go and you call me. Great. Your code got simpler. The problem is making a ticket involves like talking to the database which in non-blocking terms means starting to talk to the database, doing the talk to the database, finishing talking to the database, dealing with an error. So you have to write all of these little pieces of code wonderfully. They're not concurrent, right? They just block if they need to or they just get called once. They're not doing anything weird. And then your program looks like this. Ah! There's, they call this, this is roughly like the dumb ass version of the actor model. So like Zoolander code. But it's not, it can be good. I just came from the Erlang room. The Erlang room is cool. Actors are cool, but you don't write pearl code that way which means your pearl code feels weird and we actually want to write pearl code that feels like pearl code. So here's what we do. We make a message and the message is going to contain its own reply handler, right? You're like, I'm going to send you a message and don't worry, you don't need to go write all these million things. When I send you the message, I put like a self-addressed stamped envelope. If this happens, send this envelope and that's your little reply handler. And now your code still looks like pearl code. You're good. And you do this all over the place. Like when you're setting up the listener, you're like, yeah, okay, I'm going to bind to the socket and if there's an error, here's what you do. And if you do connect, here's what you do. And by the way, after you've connected, once you start receiving packets, here's what you do. Right? And you're nesting all these envelopes and it's great. Like you've all got a pile of top level envelopes and you're like, yeah. So I'm going to listen and then maybe bind and then maybe connect and then maybe accept. And over here, like I'm going to do an L stat, the block in the file. So now it's easy. I'm going to create a file over here and then do some stuff with it. I'm going to k-pole and like we've got all these nested things and everything piles up and everything is like an envelope, an envelope, an envelope and it doesn't look like pro-code anymore. There's a name for this pattern by the way. They call it callback hell because this is what it feels like. Thank you. Like all of your code is just all callbacks. There's no named subroutines anymore. So you wanted to write this code. Okay, this is all you wanted. You just wanted to make a ticket. You got a message from the network. Put the whole thing up here. I was going to go through it line by line but whatever. You get a message and you parse it and it's like here's the ticket you should make. That's the plan. And then you say if they're allowed to make the ticket, good. But if they're not, you reply no and you return, right? You're done. Crash early. And then you make the ticket and then you reply, I made the ticket. That's the code you want to work. This is the perfect platonic expression of a chat bot. I got a message and I did a thing. And the problem is these three things block. And this is where your whole program just starts falling apart because you've got like 75 kinds of event handlers that all look like this and they all block. So it's okay. You can fix this problem by using sequencing, by leveraging promises or futures. And all you have to do is make your code look like this. This is just another kind of callback hell, right? You're just saying here's all this stuff to line up all up and like you will end up being like I'm living in the future. It's amazing. I can like write all my non-blocking code, but you're so sad inside because it's all these anonymous subroutines that you can't debug and like they're real bad. So remember when I said this earlier, right? Concurrency is weird. So minimize it by minimizing the concurrent code. That was bullshit. Don't do that. You need to lean into it, right? You need to, the problem is this. When you minimize the concurrent code, you write crappy programs because you write programs where you're like all the weird shit's over here and everything else is coping with it. Right? All of your code is just, I'm here to cope. Don't do that. What you want to do is get the language to hide that complexity for you. The language is like, don't worry. You write the code you want to write and I'm going to make it work. And then you make the code concurrent at the slightest provocation. You're like, oh, this might be able to block concurrent. That's what you do. And you can do that now because of asynch await. And that's what I'm going to talk about for a little while. I promise I'm going to talk about the chatbot. So you take this ugly ass code, right? Where you're like, do this and then call this other code, but then call this other code. And if it fails, you don't need to read this. I've read it once. That's enough for all of us. You write this. It's just like that beautiful perfect platonic code except I stuck some green stuff on here. Right? This sub is now asynchronous. It can yield. Then this line of code, I will yield here if I need to. That's all you're saying. I identify this code might be blocking. I don't know. Let's something else figure it out. And how does this actually work? Well, when you do this, something, it's called future asynch await, something takes this and it like pulls apart the whole subroutine into different units. And it's like, I'll put these together the right way. Don't sweat it. I'm going to make it work. And kind of what it puts it into together into is this. Kind of. The reality is what it's really doing is really gross and scary and it involves like mangling optries and putting them together. But that's what all pearl code is anyway. All this time that you write pearl, it's just building some crazy ass optry and like maybe there's one person in this room who thinks about optries every day. Hi, Paul. Most of us. Most of us don't have to do that and you still don't have to. So the conclusion of this long digression about asynch await is you should embrace this weirdness, right? Make your code concurrent easily all the time. Embrace the stuff so hard that like all the weirdness becomes part of you and you don't think about it. But the weirdness is there making you powerful and making your code better and just use future asynch await. Okay. It's not. I'll talk more about it later if somebody asks. I like talking about it's very good. So let's talk about synergy. If there's an unopened bottle of water in this room, I would definitely drink it. Okay. So you can find synergy here. The link will show up again later. You can ask me for it. You can install it. It's super cool. If you install it and it doesn't work, I'm sorry. And that's all you're getting out of me. I might answer a question. We don't support this. This is software written in the open and not a public project. We want everybody to use an adopt. If you come and find bugs, we might fix them and we might say, that's a cool bug you found. Here's how it works. There's basically three abstractions in synergy that you need to know about. Channels where messages come and go. And when I say messages, because in, you know, concurrent object oriented networking code, messages can mean a lot of things. Messages means chat messages, right? Like, hello, how are you? Those messages. And a reactor, which decides, should I react to this message? So that is the synergy software diagram. There you go. That's it. You understand synergy now. And I'm almost not joking. Like, it's really about the simple, which is why it's nice to use. But let's look at the code. And answer the question. Most of the time, when you work with synergy, you connect synergies, channels to your chat system, and then everything is about the reactors. What does the bot actually do? So that's what we should look at first. This is a reactor. It's a reactor that I use a lot when I don't understand why synergy did something. I ask for the uptime and synergy says, I've been up for four seconds and I say, aha, well, you just crashed. Here's how it works. It's a package. It's a class. Everything in synergy is written with moose. And this one does a role called synergy role reactor command post. So the role reactor means it's a reactor. And command post is so that later at the bottom, we can say, here's a command. You can write reactors in lots of different ways. I've been spending lots of my free time converting all of the old style reactors, which was called easy listening, into new style, which is command post. You do whatever you want. I don't care, but use command post. It just lets you write a bot really easily. And then the meat is a single way. The command takes a sub and that's what runs. So when someone says, hey, synergy, what's your uptime? This sub routine runs and it says, well, figure out how long I've been up the duration since the process started and reply. So we got a message in this event and we call reply on it. So you are the best. I am guilty. I have here actually stuck into the message, the ability to reply directly to it. There is some small amount of callback hell. That's maybe the last instance of it you'll see. So this is a reactor. You don't really need to know almost anything about asynchronous code other than make sure you write async and await in the right place and everything will work. So you could at this point install synergy, connect to something and be happy. But we're going to keep talking. The one last thing I should talk about on this slide is dollar event. Dollar event is the object that represents the message. I'm really sorry that I've called it both message and event taking, you know, two useful names that mean the same thing and using them to mean the same thing when I could have made them mean different things. I guess that's better. Here's what the event looks like. And then has text. That's whatever the user typed and has a channel it came from, right? So we said channels are how you connect to your chat network. That's the channel. It has a from address. If you are an IRC, that's like the channel again. Sorry. It has the user it came from if it came from a known user. And was it said in a public channel or in DMs? Was it said at me? Right. So like did someone say synergy? What time is it? Or did someone just say what time is it? Because you don't want the box to respond to everything and send a reply and send a reply. But this time it was an error. That's it. So this is basically the stuff a normal reactor does. So now you know, right? Channels, reactors, and you've seen a specific reactor. Great. Now you know how to handle events. You get an event object and you call reply on it and you do whatever you want in that sub. Where they come from, they come from channels. I'm going to talk about how channels work and how you can make one. But the short answer is don't. There's a Slack channel. You might remember from the top of this talk that we needed a Slack channel and that's why we wrote this whole stupid thing. Synergy's not stupid. Synergy's great. There's a Discord channel because I don't do my personal chatting on Slack. There's an IRC channel, although it doesn't work. I'm probably going to try and bug Paul to get some help on it. It works for a while and it falls over. There's a Twilio channel so you can SMS with your bot. There's a console channel we'll talk about. And there's a test channel because of course you can write automated tests for the thing. Channels are kind of a pain to write. This is the place where you can minimize all the complexity for those things that you thought that you could make not concurrent. You have to make those concurrent but it's easy. But at some point connecting to a remote web service over web sockets and handling different kind of frames and dispatching and all that and reconnecting. That's complicated. So there's an irreducible complexity here. The good news is you won't need to write one but I'm going to show you very roughly what it would look like. You'd have something like this. So this is a stupid subroutine that's like every five seconds sends an event. What does it do? It makes an event object saying the user said boop and it tells the hub to handle that event. The hub is that box and the diagram with Synergy's face on it. It drops it in there and everything good happens goes to all the reactors. But to see how the channel really works we're going to look at the console channel. The console channel is for working at the terminal. I'm sorry if you can't read this stuff. I did what I did. So here I'm going to run Pizzazz. Pizzazz is my local testing instance of Synergy. It just fires up Synergy with a bunch of reactors sitting in the console so I can test with it. I run it to get my little, you know, I've started up and I say uptime. That's the reactor we've all seen how it works. And Synergy replies and says I've been online for one second. So that's it, right? This is how I use Synergy when I'm developing. I stick the reactors into the console and I test there. Because if you've ever tried connecting a chat bot to Slack, you'd think that for a company that makes a chat product they'd want to make it easy. But they do not. It is a real pain in the butt and about every 18 months they change the way you connect a bot. Discord's much easier and it's documented in the repository how to do it. Slack I haven't bothered. But if you look at the top of the screenshot, you see console online, console channel online, console channel online. Why are there multiple console channels? That's a great question. I'm glad you asked. Here's another reactor. This is the announced reactor. Back when we are on IRC, we didn't have our chat for work on our phones, right? That before Slack at all. We just didn't have it on our phones. But you might be at lunch and lunch is running long and you want to say, I'm really getting back. And there was a Twilio channel, right? So you text the bot and you say, announce I'm still eating. And then Synergy would receive this message on the Twilio channel. It would go to this reactor and this reactor says, okay, I got in the vent and it's not from the channel I want to send to. The two channel name. We'll come back to that. And if it is, I say, like, what are you doing? You're telling me to announce something but you're already in the announcement room. And otherwise, she'll look up the channel, the two channel, and send a message there. Send a message there saying this. So when I would text the bot saying I'm still at lunch, the bot will post a message in IRC saying, Rick says he's still at lunch. And this all works because you can have multiple channels in your Synergy. This is one of the really keen things about writing asynchronous code. You can have lots and lots and lots of things in your process and they all work. You can have lots of consoles that talk to each other. So here, in my testing environment, I've spun up several console channels. Now only one of them is getting my input because I can only type to one terminal at a time unless I want to do something really weird. And I've set up the announce plugin. And I can say announce, yeah, I was going to do a live demo of this, but I didn't because I've got enough going on. And what you see is on the input-output terminal Synergy says, great, I announced it. Thank you. And on the announcement thing, you see the message come in there. So this testing environment is simulating multiple channels. I still have a purple channel which you won't see in this deck representing Twilio. So you can say like this should page somebody's phone with an emergency and you'll get the page showing up here like, yeah, I would have sent a text message, you're good. So it's all nice and simple. The one thing you might be wondering is what's up with two channel name? So in the world where that's like IRC, two channel name here might be private IRC server, just a string. And it says that's how you're going to go find the channel off the hub. Which channel am I sending to this one? But where did it come from? Like how is this set up? How is it configured? Well remember all the channels and all the reactors and everything else, they're moose objects. So there's an attribute on the object called two channel name and it's a string. Now if that's all we did we'd be a little screwed up because at some point someone would try announcing and we'd realize we had a typo in there and it would crash at run time. So also when the reactor starts up, when the wind synergy is really booting up and connecting, she'll say, do I actually have a channel called that? And if not, crash early, crash early everybody. But that's it. Everything, all the reactors work this way, they're all configured with attributes on the objects which is what you want. That's just one more turtle, right? But where did it come from? This is the bottom turtle. It comes from a config file. So you've got a config file where you list all the plugins that you want, all the reactors, all the channels, and all their properties. And somewhere in here at the top you'll see the announced channel, the announced reactor, and it says here's the address that I send to, and here's the channel on which I will send to that address. And then you'll see all these other reactors that are configured just the same way, the clocks reactor. Which time zones do I care about? Melbourne and New York. There's a DC reactor that you can use to run DC calculator programs. I didn't write that. Okay. So now we've written channel, we know how channels work, and we know that all the stuff comes from configuration. That's great. Now we're going to talk about linear. Linear is not part of Synergy. If any of you don't know about it, linear is a bug tracker. It's like a work tracking system we use for running our scrums and stuff. It's really, really good. I like it a lot, and I'll tell you all about it when you want. But what you do need to know is that linear, like a lot of web services, does webhooks. So you can say something happened to one of my issues, and a post gets sent to wherever you want, saying a thing happened to one of your issues, and you can respond. This is great for like, I track a calendar, right? And if somebody moves an event on the calendar, I get a post telling me this thing's been rescheduled. Consider whether your whole day has just been upended. Webhooks are great. And linear uses them, and we want to react to them. One of the things we use them for is escalation. Escalation inside a fast mail basically means a customer made a ticket, the support team who are great. They don't really know what's supposed to happen next. They escalate by taking the ticket and saying escalate it. They put a flag on it, and it goes to the developers. And when we do that, we want to do something like this, right? Make a message that just says, this issue got escalated by so and so, and here's the link. And we send it to the escalation address, right? Which is pound escalation and fast mail slack. And this is straightforward. I think if you've followed things so far, you follow this, except you might be wondering, where do you put this code? Right? Like, it's got to go someplace, so that's a good question. But you're not going to put it in a command, like uptime, because there's no command to say like, hey, check it, you got a webhook. That's not what a webhook's about. And it's not in a channel. Remember when I had to, like, tediously explain that channels are about chat messages, not just generic messages? So it's not in a channel. Where is this post going to go? The answer, the answer is it goes in a reactor. It doesn't need to be in a reactor. It's where we happen to have put it, but it's not because it's a reactor. It's because it's got this role called HTTP endpoint. And you say, in addition to reacting to chat messages, this thing is a web handler. And you say, I wanted to take the path linear notification. So when you connect this thing up, slash linear dash notification will now be a path that you handle. And how do you handle it? Well, you've got some async sub that is a plaque handler, because if you're writing web stuff in Perl, you probably want to do it with plaque. And that's kind of, I mean, look, there's a whole bunch of code here that's figuring out, like, getting the thing, authenticating it, figure out who's who. But this is basically it. You say this hunk of plug-in, and anybody can write a plug-in, requests a path for web service and mounts a plaque application on there. You know, and then at the end it says, like, yeah, and then return 200. So now this is HTTP endpoint. How does that work? Synergy runs a web server. And you say, I want web service to be provided on this port, and all of the channels, all the reactors, all the every other thing that has a HTTP endpoint, mounts onto those paths, conflicts are detected at start time and it crashes. And then when a request comes in, Synergy dispatches to the right place, and because they're all asynchronous, they can all interact. And that's a really important point, right? All of this whole diagram, this, all of the, every reactor, every channel, every HTTP endpoint, they're all in one process. It's just like one program that's running with everything loaded in it. And to share data, they share memory. There's no IPC, and this is a big win. Like, I don't want to say that IPC is bad, and IPC is the enemy, and I certainly don't want to say everybody should share memory. To share information, like these are big, broad claims. But we do have to talk about IPC sometimes. IPC solves problems, right? What does IPC mean by the way? It's inter-process communication that lets you have two processes talk to each other. But that's not the solution to a specific problem, right? That's not valuable per se. It's valuable because you have a problem that you could solve by having two processes talk to each other. And the question is, when does that problem arise, and when is IPC the right solution? Well, a good one is, right, if you have different parts of your system that scale differently, need different kinds of resources, need different access to things, maybe different processes are useful. You can scale up more workers, thank you. You can scale up more workers, you can scale down workers, that might be really useful. Maybe you have security constraints. This process needs access to certain constrained resources, needs to have these namespaces, needs to talk to the kernel, whatever. And this part of the program, the part of the system doesn't. That's a good reason to have two processes. And maybe you have to do work where multiple things need to be happening at once, and you have multiple processes to eliminate blocking, causing your code to be sequential when it's not going to be sequential. This is, you know, where we often would have multiple programs running or things forking. And it's fine. But remember that any time that we add a solution to a new problem to our program, we're almost always adding more code. And when we're adding more code, we're deforming the program from that ideal platonic version that we're like, well, if I could just write it, it would look like these eight lines. And then we go add all the code that solves all the problems we don't want to think about. And what we always want to be doing as programmers is picking the changes that deform our platonic program the least as possible. And program is always compromised between these things. Once upon a time, it was pretty clear, especially in languages like Perl, but kind of in a lot of programming, that if you had to eliminate blocking, the easiest, most effective thing to do was to go to have multiple processes, right? Fork is a great example. I need to be able to handle a lot of requests. I'm going to fork. Yeah, that makes sense. Forking's easy. It solves a lot of problems. And then later you have to introduce IPC because that's how life goes. But, you know, that's what you're going to do. I don't think it's this clear cut anymore. I think that at this point, we all need to be reevaluating when we want to eliminate blocking and have more communication between multiple kind of concurrent operations. Whether forking IPC is the answer to jump to in Perl anymore, I don't think it always is. I think it often is not the right answer anymore. And that's because of Asynchowate. Asynchowate's really, really powerful and it really moves the lever on where you should be picking which solutions. It's not just a Perl thing, by the way. Hopefully everybody here writes in other languages. Also, it's important to put your eggs in multiple baskets. You'll find this abstraction in a bunch of places. It's very good. Okay, one more thing. I've got a little time left. So, take a breath. Got quite a bit of time left, which is good. So, we've got channels and we've got reactors and we understand those. And we've got these HDP endpoints. And there's some other stuff we've gotten here. Maybe we'll even talk about more. But at some point, I thought, you know, it would be really cool to stick inside of Synergy a telnet server. So, we built a thing. It's not really telnet. Telnet's actually a protocol and has all kinds of weird stuff in it, like control characters. And don't learn. It's a netcat server. So, there's a netcat server built, they call these TCP streams. There's a netcat server built into Synergy, which is called the diagnostic uplink. So, here I am back at my terminal. I run my local development server with a diagnostic uplink available on local host 4321. Because I like those numbers. And when you tell it in, you get greeted with this. Welcome to Synergy. All right, you have connected to the diagnostic uplink. Would you like help? Of course I would. I don't know how to do anything. So, I say slash help. It's like, here you go. You've got some diagnostic commands. You've got notifier commands. Stuff for inspecting or running Synergy. Because when Synergy is acting, when your critical chat bot is sitting there and acting weird and you don't know why it's doing that, and you don't know what's happening, you know, you can reboot it. And that's fine. Thank you. You can, you hope that's going to be okay. You can, like, look at the logs. And I make a lot of logs. That might help, but most people don't write logs, and that's not going to help. But another great answer is, like, yeah, just connect to the thing and ask it questions. So, you can say, like, tell me about your configuration. I'm running a web service here. Here's this file. You can say, I don't show here, show me all the endpoints that your web service listens to, so I can see all those. You can say, show me all of the notifiers currently connected to the event loop. And it's going to show you, like, all these things are going on here. They get names as they're generated, so you can see things like, yeah, there's 47 open web requests all talking to GitLab. Well, that's probably a problem. Really useful. You can also get this guy. This is so good. Eval. So, you can say, I'm going to connect to the diagnostic uplink and instruct my Perl program to evaluate a string of Perl code in the context of the running bot. So, here I am saying, hey, bot, tell me your name. I'm Synergy. Great. What's your reference address and memory? Here you go. These are stupid examples. You never need to know the ref address of the bot. But you can do things like connect to the bot and instruct it to change its configuration as it runs. You can connect to the bot and add and remove reactors. You can do anything that you can do with Eval as long as you're happy typing it into one line because I have not implemented multi-line input. It wouldn't be that hard. I'm super lazy. Okay. That's everything I plan to talk about. We have a couple of minutes left. I'm happy to take questions. Yes. This might sound confrontational, but it is not. So, actually, it does. I'm going to make Elixir developer. The code you showed looks very much like how you would actually write an Erlang language. Yeah. So, why actually use her for use cases like this? No, it's a great question. The question is that it? Can I just say, well, I would say maybe the async stuff, like the tasks could be per, but actually the... The framework, yeah. No, it's a great question. The question is why do this in Perl when Erlang has a much better language for it or Elixir? I'm not to make you put any tone into that at all. That's the question. I think it's a good question. The answer is a boring answer. Well, the original version was written in Perl and all the little handlers were written in Perl. What was easy to do? Switch this to Perl. I also really like Erlang and I really like Elixir and I think that they're really well suited for this. In fact, in a lot of ways that we didn't talk about, like any one of those reactors crashing has to be handled by the hub saying like, oh, an exception happened. Don't worry, I'm going to catch it and recover. And like if a channel crashed, you have to figure out reinserting the new instance of the channel into the hub and what about its pending messages? That stuff's all solved, right, on B-machine languages. But we wrote it in Perl because we write Perl. And I think that if I had said, guess what everybody, we're rewriting the bot in one week and we're doing it with OTP. We would not have written that bot and nobody would have bought me a beer that night. Yes, in the back. It's hard that async await is much better than callback hell and also said that some of the loop is kind of callback hell put with upgrades. So can you expand a bit on how async await is better than callback hell? Yeah. And you mentioned that there may be some duplicates, a definite difference in debugging, but anything other than that? Yes, sure. So the question is how is it the case that using async await is practically better than callback hell? Larry Wall says that you can never eliminate the complexity in your program. You can only move it around, right? You can move around the lump under the carpet, but the dust is all still there. And my view is often that what you want to do is take the things that are complicated and obnoxious and pack them into an infinitely dense ball that lives at the center of your program. And everything else is beautiful and living on the outside. I got one minute, so this is maybe my final concluding remark. You want to put all the complexity deep, deep down in the middle and have everything else be simpler and built on that. Callback hell makes the programmer writing the application think about the complexity. And async await makes Paul think about the complexity. It makes one person cope with that. And I think that is why it's practically superior. Just curious how many in this room have endured future async await? Yeah, who else has used async await? Six, seven people? Yeah. It's very good. It's very good. It's got problems, but mostly they don't come up. And I use it every day because mostly they don't come up. Okay, if you want to run synergy, that's the URL. It's really good. Don't expect to get technical support. I'm going to change stuff whenever I feel like it. That's it. Thank you very much. Thank you.
The CPAN Security Working Group
It's all right. I am early or on time? I'm on time. I'm punctual. That's brilliant. So, hello. My name is Salve Nielsen. I'm one of the few fellows that hack around with the Netherlands in Oslo, Norway. And last year, I bumped into, with some other people into thinking about security on the seapen. So, stuff happened and I'm going to tell you about that now. And a little bit, it's kind of an introduction to FOSTA. Similar talks have been given at other conferences already. And a little bit of an update. So, I hope you can bear with me. So, we were established at the Perl Tulsen Summit in Lyon last year. And the purpose here is to basically feel an void about caring about seapen security. There are already people who care about security in the Perl community. Mostly they live on the P-File porters list. But when it comes to the seapen ecosystem, a couple of us raised our hands and said, okay, we'll try and do something about that. These are the people that showed up at the Perl Tulsen Summit. And a bunch of these are also on the seapen security working group. So, what's in scope for this working group? We are, there's a lot of people who are interested in the security of the Perl. So, we try to do security outreach. That means information work. It's maybe not obvious that's needed because of course everybody knows how to Google and figure out something. But we try to think a little bit about how to do things that are connected to the security of the Perl. So, that includes making sure that important security issues that are probably registered as a CVE. That if there are anything that show up in the CVE index that they are responding to in a good way. And we're not solving the problems. We're helping the people who are involved in the project, for example, that doesn't have a responsive author will make a little bit of an effort to try to find a replacement or solve it that way. This is basically what happened with spreadsheet parseXL and parseXLX. And we are super happy somebody stepped up and actually resolved those issues. And we do some coordination with other scientists through the search.org VINCE interface. And so, we are trying to build up a network so we can make sure to report things properly and share the information we have and help those people who need help. And there is some triaging and coordination going on there. And the goal here is to make sure that important vulnerability issues are not ignored. So, that's one of the major topics we're working on. We care also about having a good vulnerability index. There are, I think, one or two options right now. This one, called C-Pan-Odeta, I think, has something going on there which is useful. But it needs to be up to date and we want to help with that and maybe see if we can integrate it with other indexes out there. Furthermore, let's see what's going on here. That was not the point. Okay, the screen is saying hello. Okay, sorry for technical problems here. It looks like my computer doesn't like the USB C connection here for a moment. Sorry about that. Okay, let's throw it out and put it in again. That's always how it works. There, sweet. Yeah, it would manage to fix itself or the old computer are just saying. So, yes, vulnerability index. We also care about what's called supply chain provenance, which is basically where the stuff come from and how did it become the way it is. And in general, supply chain security. Things that we are working on there. Look here. It's already disappearing. This is a bit annoying. I'll try to continue. We want to make an effort to make sure that all the C-Pan clients use HTLess by default, for example. So we connect quickly to the servers that we want to download from. We want to make use of something called the update framework, which is used by other packaging ecosystems for securing the whole process of publishing and sharing the modules out there. We want to introduce repository signatures and author signatures at some point. We, moving on, we have, come on. It looks like I'm having more trouble than is necessary here. This is quite annoying. No. No. No. All right. So we are looking also at, oh, this is the wrong page. Interesting. We're also looking at tracking all the changes that happen on the software. Look here. Using S-bombs, software below materials. That's a huge topic and demands from that downstream when people in running software on critical infrastructure, for example, have now, they're obliged by law to keep track of dependencies and what's going on. And this whole field also includes solving the problem of how to refer to the depends across package ecosystems. So with that, there's something called package URL, which is currently in use by a lot of systems out there that and S-bomb standards to refer between two packages in different ecosystems. If all goes well, we'll actually have C-pans as part of the package URL standard, sometimes this weekend, I'm hoping. I talked with the author yesterday at the party, at the conference here in Brussels. And we want to improve the indices in general when it comes to interoperability with other indexes, package indexes. Let's see. Since we don't have slides here, this is really annoying. So I'm sorry this doesn't work as expected. Does anybody have a USB to HDMI connector? USB C. No, no, that's, I need to, I need to, female HDMI. Ah, okay. Let's see if this helps. Crossing fingers. Because if it doesn't get better now, then it's not my computer. All right. There's something called transparency logs. There is some tooling called six store and six some that we want to take inspiration from to create transparency around what changes happen on C-pans. So if something is updated without anyone knowing, we want to detect stuff like that. We also would love to have a way to do a patching of C-pan distributions when an upstream of there is completely unresponsive and we have no way of resolving a crisis quickly. So to publish a patch in a structured way so that, say, for example, a client can detect, oh, there's a patch that is not applied here, but we do want to download or something like that. We'll see how that works. It's a current dream we're having. We do care about compliance and privacy. So having an idea of what kind of legislation is relevant for us. That's super important and documenting that stuff is part of that. So we have a reading list already published. We also want to have good tooling for software composition analysis and like checking finding ways to detect if something, some of your dependencies have certainly gotten a vulnerability or something. So we say, for example, during a test run configured, oops, there was something happening. One of the dependencies you need to update is lots of good ways to do that. There's already some tooling in place actually, but these are what we want to do. There's also the act of having a project management. So we're taking that part of that and that means creating a good charter, having a pre-release disclosure agreement. That tells us under what terms we can share information or not. And do general information around how things are put together as an organization and which place we play in the larger ecosystem. Funding is also an important part of this because I have to be frank here for a moment. And that is working on security issues on behalf of others on a volunteer basis isn't always fun. Sometimes it can be increased like horribly boring or frustrating or just solving problems that I don't have. I imagine this is the same for everyone. So we're looking also for finding ways to actually fund some of the work that we want to do. And there's a whole lot of other stuff we want to do. And the most important thing for us is that while Perl isn't the super big thing it was 20 years ago, it's still used everywhere in critical infrastructure and in important businesses that with money is earned right now. So people call it legacy systems these days, but we have to remember legacy means also earning money. So we cannot just ignore and say I'll rewrite stuff later or we'll just update. No, no, we need to update stuff now and we need to figure out exactly what's running and to make that happen. We need to enable a whole lot of things using the stuff I already mentioned. And there's also some cultural things worth mentioning. And in the Cpan and Perl community, we don't always think actively about security. So we're hoping to be a little bit of a catalyst for over time to change the culture also. And that means learning new stuff, not only being a DevOps, but thinking also how to become a DevOps, or Sec DevOps or whatever it's called. The security becomes part of how we operate. And in my opinion also we're pretty good at having our own ecosystem where things have worked for a long time and we know we can trust it and it's been very predictable. But we're not that good at interoperating across the ecosystem boundaries. Like say for example if you package something in Debian, it's like how do from Debian's perspective is what do we have to do to make whatever these guys are doing work in our environment. When we could have used say good standards for communicating dependencies in a machine readable and common way that works across all kinds of ecosystems. That's a super interesting problem that people are working on right now to solve personally and I hope we can be part of that. So why do we do this? There are new security demands coming from the EU and from an old executive order in the US. These are specifically aimed at institutions and companies that write software for critical infrastructure and that could be anything from power internet access, street light management, water treatment plans, administrative systems, all kinds of places throughout society. If something breaks it affects the normal operation of society in a negative manner. That means these two directives applies. For the cyber security sector which is still upcoming, it's more about internet enabled devices which basically means anything from toys to phones and all the systems that connect to and update those. So that means everything. So we will be affected. These laws are coming this year and will be rolled in in the next few years. I think it's 18 months or something. So this is upcoming stuff. That means we have the legislative guns pointing at us basically. We would also love to find ways to show that those of us who publish things on c-pan have our ducks in a row. We have the things in order. People can't trust the code we publish and we do that's what's necessary to make that happen. So there's some awareness raising. So we're discussing blog posts and all kinds of other ways to get more people involved in this. Who are we? Brenno, Graham, Inge, Jose, Andreas, Leon, Olaf, sitting there, Pete, René, Sam, Salvis, Mi, Stig, sitting there. Tim, Merein isn't here today. And a whole lot of others. We, these are a couple of the people that were at the Peralta Julesin Summit. I'm there. It's a photo of me where I don't look horrible. That's good. So that's Stig and Inge and Leon and Merein and Brenno. And the reason I say all the names here is to make a point actually. When somebody talks with you about supply chain security, there are people like this and the group picture that are actually working on the supply chain, the bits and pieces that make that up. On a volunteer basis. And meaning humans. It's not like a black box where suddenly stuff appears. We have to actually think about this as almost like our open source colleagues. We work together with these people. So what I want to hear is to ask you to join us. Do you care about open source security? Do you have some extra tweets, some time to spend over? Do you have a manager that is aware of that there's a security commons out there that is shared and that needs to be updated and kept alive and kept healthy? And you all would like to fix security yourself. Please contact us. We need help. We are a bunch of volunteers right now, but we do not have all the time needed. And at the moment we don't have the funding either. So there's that. So to find us, we are on our seal and there's a link there. You can find all the necessary on security.metasep.org. You could also use the security.zip.org and the mailing list where we coordinate stuff is the zip and security. It's closed off, but with a little bit of dancing and singing, you can get into there. So I don't know. We probably don't have time for questions and comments. Two minutes. Two questions. Yes. Three, very short remarks. First, I'd love to see a module creating a sports, lively. Yes. Working on that. Okay. If you want to help, talk with me. Second, I'd like to have 502 support for a stick support for any of the big frameworks we have in the fall. More delicious or dancer tour. We won't do anything on that, but if you want to publish something, go ahead. I've been looking into that a little bit. Okay. Who in this room has a Vince account? I have one. I like it very much, but please make yourself. Vince is a vulnerability sharing system that search.org runs. We have a couple of us on the have it already. So if you are scared about security enough to have an account there, you're welcome to join us. That's a very good criteria. But of course, please actually help. We have a lot of people that are having bystanders looking at. There's something called the bystander effect where lots of people look at an accident and waiting for someone else to make the first move. That is, we cannot have that. We need people to actually want to make sure it happens. Having a Vince account, maybe not enough, you have to publish yourself and say to them, hey, we use the problem. Yes. There's a whole lot of stuff. More questions? One question. Well, you get the difference from everyone, but for me, it's that we need more people who are actively working at the moment. But you have a whole lot of stuff, which is all of them are good things. I should try to paint a picture of today. And if something tickles your brain, then you're quite welcome to join us and make something happen there. If you know something we don't, then please tell us. We're in the process of learning. I'm getting an idea that this is the end, so I will say thank you. And I hope this was useful for you. And please get in touch if you care about security on CPAP.
openQA - How do you test a testing software?
I'm going to go to the next one. It's always interesting to see people here and this group. A lot of people I've seen before and a lot of people I've seen giving talks before. Somehow, it never happened that Tina gave a talk here in Boston. She's been here before but actually she's first of all, thank you for a great welcome. Hello, hello, can everyone hear me? The microphone might not be on. What, is it on? No. Okay, alright. So I'm trying my best but remind me if I'm getting too low. So, yeah, I'm going to do a talk about OpenQA. Who of you has heard of OpenQA? Okay, a couple of. And are you using OpenQA? Okay. So I'm doing Perl since 1998 and I'm working now as an engineer through the software solutions. And I'm in the tools team where we develop OpenQA and it's written in Perl. And just to give you a little short demo. So with OpenQA you can test an installation of an operating system and you can start applications. You can pretty much test everything you want. So here you see the installation process of Open to the tumbleweed. It's not real time. That's a bit fast forward. But yeah, here it gets more boring. Okay. So it's also, it's not only used by OpenSusul but also for Fedora and Debian and actually more. I think AlmaLinux is using it. And in this talk I'm going to demonstrate you a little bit the web UI and show some relevant test API functions, the project structure and how we deploy and how we actually develop and test it. Okay. I think I'm going to sit down for... Okay. Is this readable? So here you can see all our tests and they are grouped into so-called bills. And here we have tumbleweed and on AR64, PowerPC. And then we can click on a build and see all tests of this build. There are three main states of a test. It can be passed. It can be failed or it can be soft failed and soft fail this like, okay, we know about this bug. And it's not critical for the release but we mark it as we need to look at it later. And let's look at some actual tests. So this is a tumbleweed DVD installation. And you see all these boxes, most of them are screenshots but there can also be informational things. And here, we can move through those screenshots. And here we see a screenshot of the installation where you have to choose a time zone. And we call these screenshots needles actually. So a needle is something that we want to match for. And here at the top it says 99% matching. So that means the screenshot that we got is matching our expectations to 99%. And why is that? So we have this bar here and at the left side you can see the actual needle, what we expect. And on the right side it's the actual screenshot. And you see that the font has changed a little bit. And we don't care much about that. It's still okay. And that's why we set a threshold of something like 90 something percent and it's still matching and it's okay. And here's another needle. And here you can see, so the upper area of the picture is what we want to match. And you also can see this gray area where the penguin is supposed to run around. And it's gray because we can actually go into the so-called needle editor. So we can actually live edit such a needle. And here you can see the screen area is that what we want to match. There are also some red areas and I'm sorry it's not colorblind friendly yet. But you can use some such red areas to exclude certain areas because we don't know where the penguin will be at the time of the screenshot. So we just exclude that. And then you can also review the JSON. So a needle actually consists of a picture and a JSON file that says which areas should be matched. So here's another needle and it's showing the desktop runner. And this is actually showing another purpose of a needle. We don't want to only make sure that we get what we expect. But we also need this to proceed in a test. So if I'm in the test and I tell OpenQA to send the shortcut for the desktop runner. And then immediately type something or tell it to type something then it wouldn't work because it takes a millisecond until that pop-up is actually there. And the easy way would be to just sleep one second right? Or maybe to be sure two seconds or maybe rather five. It can actually sometimes take longer because the worker on the test is running tests in parallel. So there is this function called a third screen and you can give it a timeout. So for example 60. And then it will take a picture every second until it gets this picture what we expect. And then it knows okay now I can type the command. So because otherwise if we always sleep five seconds then the test would take a long time. We can also look at the log files of the test. And settings. And here we have all these job groups so you can see what kind of stuff we are testing. We are actually testing OpenQA itself. And actually, yeah I don't have any screenshots here. And here you can see a screenshot of OpenQA inside of OpenQA. So we use our own software to test ourselves. Okay so that was the demo. Here are some code examples. So here you can see the assert script run call for example. Which is just sending some text to run and it's asserting that the exit code is zero. It also has a timeout. And this is the job group configuration. We use YAML for that. The YAML can be huge and so we are actually using the YAML merge key to avoid duplication. Okay so far about the demonstration. And yeah these are the test API functions we have. The most relevant you would probably like to know is you can send a key. You can also send a key until a needle matches. So something like enter until you get into the BIOS. You can have screenshot related functions. You have the mouse functions. So mouse drag is a function. And click and double click. What we don't yet have is you cannot see the cursor, the pointer moving. So if you want to use it for demonstration which is actually a very good use case for OpenQA. Just demo your software with writing a test and having a demo at the same time. But you don't see the mouse pointer moving yet. I have proof of concept pull request for that but hasn't gotten in yet. So and you can even write test modules in Python now. But that's boring for you because you're in the pull and drag room. And this is how it would look like. So we have all these functions like send keys and set var help. I'm in a Python script, trapped in a pull script. And okay now to the project structure. It's split in two parts. O S O 2 inst which is a name. I don't like much because it's hard to type and pronounce. That's the actual code that's running the test. It started with this project and OpenQA is all the stuff around it. For viewing the test, configuring, worker schedule, managing assets like ISO files and Q-CAL files, API and WebSocket. And it all started in 2009 by Bernhard Wiedemann working at SUSE. And our code is using Modulicious by now. Using as the HTTP user agent and for the web server and the classes. We're using DBIX class and it's really helpful. We're using now subroutine signatures. In our tests we use test warnings to avoid that we actually get any unexpected warnings. We're using test mock module, mock object and mock time. And for the tidiness we are using pearl critic and pearl tidy and develop cover of course. But we also have a lot of JavaScript, Python and shell code. O S O 2 like I said, that's the heart of the software actually. It's called ISO2Video, the main script. It takes an ISO and makes a video. When you develop a test you can actually run it directly if you have an ISO file and some vars. Then you can actually start VNC viewer to watch what's happening and also change something. If your test is actually bad and you have to want to try out stuff. And our deployment is fully automated. We just merge pull requests and then, okay, with every new commit, the open SUSE build service will fetch the new commit. And then we also do a separate update on the web UI regularly and on the workahouse. And necessary service restarts are happening and database changes will also be done automatically thanks to DBIX class deployment feature. The open SUSE build service, it's used for all open SUSE packages. It can build RBM and also other packages. So here we have all our related packages to open QA. And how about testing? So here in open QA we have 98% code coverage and for OS outer hints we have 95. So how did we achieve such a high test coverage? We cheated. Well, at least we do cheat a little bit if you look at that. So there's this feature about Devil Cover which lets you add this comment, uncover the statement. And we have a couple of them and most of them are actually just, thank you, in the test directory. Yeah, but compared to 37,000 lines, I think that's okay. And here's the coverage trend we get from CodeCov. And yeah, our general tests are under T and then we have API tests, UI tests. We are using Selenium currently but we consider use changing to Playwright. And yeah, the tests are actually included in the coverage. So, and yeah, we also use open QA to test open QA. I showed you that. Some of our tests are forking and yeah, ideally everything should be turned into a unit test where I can, where I don't need to fork, but still Devil Cover is able to do that if at the end of the process you add this line and then the coverage of the fork will also be collected. And CodeCov will complain if pull regret adds uncovered code. It will also complain if the percentage goes below a certain threshold and some directories are actually already marked as fully covered. So, if there's any line that goes uncovered there, it will also complain. And since we are using the Merify bot, then nothing can get merged if it's failing some of those tests. And so you need two approvals and no failing test and then it gets merged automatically. And that's working quite well. But checking CodeCov might not be enough. You know, having 100% CodeCov, which doesn't guarantee you anything, well a little bit, but so pull request authors are encouraged to add new tests for every pull request. And writing tests is seen as a part of every ticket we work on. And refactoring is also encouraged and for every regression we encourage people to think what we could do to prevent similar things in the future. And yeah, I showed you that already. And okay, I don't know how many times we have a question, but that's it for me. Thank you. Any questions? One minute. Okay, no questions? Then, alright. Thank you.
Corinna—Perl's new object-oriented system
Ah, good. So if you're on YouTube, you probably just missed the first five minutes of this. I said nothing. Don't worry about it. So I decided rather than do what I had done previously, I'm just going to give an overview of all the major features of Karina for the minimum viable product that we're putting together. So it can be, you can have a fairly complete idea in your mind what's going to happen, because I actually haven't done that talk before, and you probably don't want to go and read a multi-section RFC and all the work we did to put that together. So since PURL 5, object-oriented syntax here was just less in IZA. There's a little bit more than that, but this is primarily the bulk of it. The model was mostly stolen from Python, and I also do Python programming. I can see the similarities. Larry regrets stealing it from Python. I can understand why, even though I like Python, I'm wrong. But blessing is that all they do is say, we have methods, and where are those methods? I'm taking the short version of this, because we're not going to spend a lot of time talking about the original version of object-oriented programming at PURL. Because it didn't give you much. Basically if you wanted everything that you want out of a class-based OO system, then you've got to write your own constructors. You've got destroy method in PURL, but destruction is non-deterministic, so that's kind of a frustration. It doesn't work as well as you'd like. If you want to maintain state, if you want encapsulation, all the sorts of things that you expect to have out of an out-of-the-box OO system you don't have with bless and IZA. And everyone had to redo it themselves every single time, and if you're a programmer, you know you don't want to do that. You want to abstract that away. So people have abstracted that away a lot. It's going to depend upon your definition of what a class is or support for a class is. Well over 80 modules. This is not an exhaustive list. I just decided to order them alphabetically by link. Have fun picking out the one that you happen to like. If you're familiar with the Lisp Curse, or if you're not familiar with it, go out here, your favorite search engine for the Lisp Curse. It will be the top hit, and it will explain how that mess came about and what we're trying to fix. So let me make that a bit larger because I can't read that. Okay, so not everything that you see here is implemented, and not all of it's going to be implemented, but you do want to see object pad that Paul Evans put together. That's a test bed for many of the ideas of Karina. So we can make sure that it actually does what we want it to do. And there are companies who are using this in production. It is so valuable to them. So some of the things you might see will change. It's work in progress, but I think I've tried to strip out anything really problematic. I'll call out the things which are what you're saying is work in progress, but this is pretty close to what we can expect. A simple class. It's very simple. It's not exciting. You create a new person. Name is Ovid. You print the name Ovid. Here you give them a title. You print the name. It automatically pre-pens it with the title. So there's Dr. McCoy. Very simple. This is not complex. On the left-hand side, that's how you're going to do that using Bless in Old Style Pearl. Here's how you do this in Karina. Note that almost all of this is very declarative in nature. You might quibble on one point. We'll come back to that later. But it's very short, very concise. You probably didn't notice this. That will mean your code's not going to work correctly because you misspelled the name. It's not even going to give you a warning. It's just going to silently fail. Sort of bugs we love to have, silent failures in code. In Karina, because that's a lexical variable field title, that's going to be a compile time error if you misspell it. That's Moose, by the way. Moose didn't gain us a lot. Not true. It does have Izzah. Izzah string for those various things. You could do non-empty string for one of them might be better. We argue about that all day long. But basically, Moose is not more terse. And it also has a lot of startup overhead. It's not slow per se anymore, but it's not the fastest thing in the world. But it does make writing an OO code better. In Karina, same thing, much more terse with the exception of the Izzah. So let's just walk through this so you can understand what's going on. To declare a class, you just say class, person. It used to be to declare a class, you couldn't. You would say package, person. And then you would bless a reference into that package. And it wasn't really a class or package. It was kind of this thing. Now they can be separate. They have a future where we can truly disambiguate these things. I might add, you can also do it this way with the post-fix syntax. I prefer this syntax. I will have it on the slides. I argued strongly, as the lead designer, I thought I could get away with this, that we're going to require the post-fix syntax. I lost that fight. And so everyone basically almost everyone disagreed with me. So I went ahead and said, OK, we'll go ahead and make this optional. But a lot of my examples, well, the post-fix syntax, absolutely not required. So don't stress about it, because I know people gave me grief at first a lot. Field, dollar, name, colon, param. That is an instance attribute, or instance field, instance slot, depending upon the language you're coming from. That's just a piece of data tied to the instance after you construct it. Because it has colon, param, it is required in the constructor. You cannot not pass that, or else it will blow up. Same thing with field, dollar, title, except it has the equals on death. That means it is optional in the constructor. You do not need to pass it in. Or you can use equals misses or something. You can give it a fake default title if you want to. Anything after the equals, you can just evaluate and run it, and that will be assigned as a default value. And then we have our name method down here, where we just access those variables directly. This gives us a chance for a lot of performance benefits. It also tremendously encapsulates this data, something which has been traditionally very, very hard to do with older Perl, because you could always reach inside the object and do stuff. Many languages make it easy to reach inside the object and do stuff. When we eventually get around to implementing a meta object protocol, you will be able to reach inside the object and do stuff. But we're going to make it harder. And the intent is you will be allowed to do it, but when you're doing things you shouldn't do, you got to put some more effort in there. It's going to be easier to show up on code reviews or just with grep. Karina, out of the box provides constructors, destructor, state, composition, encapsulation, private methods, and so on. The private stuff might actually not make it in the MVP. We won't cover that. But basically, most of what you want out of a class-based OO system is there in a very short declarative syntax. Just like that, very easy. But there's more than one way to do it. So I mentioned this is mostly declarative. You see the method down there and you're going, I don't have any way I can change the name and title. Everything by default is pretty much immutable externally with Karina. So I'm not mutating that. So why am I even calculating it every time? I could just make that a field. Reader equals if defined title, title name, else name. And that's computed once and only once at object construction time. And fields are generally evaluated in the order that they are declared, which makes it much easier to reason. In Moose, I think it's evaluated alphabetically. No, hash order. Hash order. Oh, sweet. Thank you, Steven, for just making me feel even worse about it. But I've long wanted to submit a patch to see if I could fix that, but they've said no more patches. Which is fine, I totally get why. So because they're constructed in the order that they're found, you can now have the potential for deterministic destruction because you can track that order and unwind them in last in, first out order. I don't know that that will be in the MVP either. Okay, there's only four keywords. By the way, class, field, method, and role. We actually had a lot more originally and then Damian Conway came along and did a deep dive into the spec. And he pointed out a way we could reorganize everything just by having four keywords, class, field, method, and role. And then attributes to modify their behavior. Tremendously simplified the code, made the logic much easier to follow, made the structure much easier to follow. And now I apologize, this is a much bigger slide, probably harder for some of you in the back to read. Class character is a person, that means we've inherited from person. Karina is single inheritance only. You'll notice there's a number of OO languages out there which allow no inheritance. Some of them allow only single inheritance, they almost invariably give you a way to work around that, such as interfaces or mix-ins or something else. Or you can do that with delegation, which delegation is much more powerful than people think, but there's not a talk about that. So I've now declared this class. And you'll notice I have an underscore defense for my reader. I don't have readers or writers for anything. Reader means that you can call target arrow underscore defense and read that value. There's something called trusted methods where you want methods to be callable by other classes, but you don't want people outside to be able to call them. We have done a lot of bike shedding on how to get there, and it's not gonna happen anytime soon. So for now, I punted and thought this is a reasonable compromise. We use a familiar pearl syntax for saying underscore defense. That is, think of it as a trusted method or a private method. And as a result, you can call that and people outside know not to. Notice the only methods we have public are isDead, adjust hit points, and attack, because you want your interfaces to be as small as possible. Because later on, if you have to change your interfaces, you're stuck if you've exposed everything publicly. So, Karina by default forces you to add the colon reader and colon writer keywords to fields because you have to choose, you have to opt in to making your contract public. Rather than with moose and moo and others, the default is everything's public. And if you want a private, too bad. And we have this constrain function. I'll talk more about subroutines being imported. But basically constrain is a function. Again, this is something I don't think we're gonna get to in the MVP. The intent is methods and subroutines are not the same thing. And you should not be able to call a subroutine as a method. You should not be able to call a method as a subroutine. And you can disambiguate them even if they have the same name. But just something to think about for later work. So, we did our subclassing, there's a little Dorothy there. And we create a new dothvader object, a captain Kirk object. And while not Kirk is dead, Vader beats him with his lightsaber until Kirk is dead. It's just very simple, it's easy. It works, yes, Vader will kill Kirk. I'm sorry, I do for Star Trek to Star Trek to Star Wars. But in this case, yeah, Vader, yeah, he wins. Very simple, very easy, and there's nothing when you get down to it, there's nothing really complicated about the code. It's simpler, it's easier to write, it's well encapsulated. But I want to talk about constructors a little bit so you understand some of the design work that we put in here. A lot of it we argued, I think it took like three years of arguing to finally get to something we could agree on. So, we have key value pairs, named arguments to the constructor, name, title, and offense. And it is absolutely required that you do that. You can create an alternate constructor if you want, called new unnamed and have a delegate off, but we do this for readability. And there's also some other benefits. So right now, here's a constructor in Java, character of Vader equals new character. And then if you didn't know what those were, it might not be clear what you're constructing. And in fact, you've got alternate, you've got optional data for your constructors. So you have to create multiple constructors. I won't go into details, but you might have to create multiple, multiple constructors. If we have in this particular example, or use a hash map and extract it manually. It's a pain. Karina, you don't have to do that. You have a declarative specification at the top of your code. Here's how our instance data works. So, writing the manual constructor in Java for a car, that's actually very readable. It's very easy to read. Calling it is not. I don't, I just looked at the code. I wrote this code and I don't remember it. I don't know what those numbers necessarily mean. So, that's why we try to avoid that. And in Perl, we have named arguments. Yes, you have to do a little bit more typing. This is for maintenance. You absolutely want to make it easier to maintain your code. And it's gonna kill you a few times. And you're not gonna be happy about this, but you'll get used to it because it's gonna become natural, I hope. So here, that's not character class. That's a person class. And we've passed in offense. Offense is not defined as one of your param fields. So that's gonna die. And I've heard people argue, well, I should be able to pass in extra data. Maybe my son class will use it, or there's some other way I can handle it. Yes, there is other way you can handle it, like every other authoritarian language does. Provide something which is actually going to properly capture that. But the real reason is, remember, title is optional. So if I misspelled title, it would think it's simply optional data. Now, because it's mandatory, you can't pass in anything which is not known to the constructor, then that is going to be a fatal error. And it's gonna be a very hard to detect bug that you don't have to worry about anymore. If you want to pass in extra optional data, make a parameter called extra. Extra column param equals hash ref. And then just allow them to muddle around with that. It's much cleaner. Moose allows you to pass in a hash ref instead of a list. We do not do that. We want one way of being able to call the object because it's just simple. This also preserves the ordering of those in case that becomes necessary in the future. Also, a hash ref will, any duplicate name in the hash ref will collapse over a previous one, which is kind of annoying. There are ways you can work around that if you actually want this behavior for setting defaults. But we decided this was the safest way to go, just to make one and only one way of calling the constructor. Thank you. So, I didn't talk fast enough, apparently. Here, field name, dollar name, dollar name, in both of those, those are lexically scoped. There is no conflict anymore. And with bless, if you had a arrow, open print name in your hash ref, but your parent did too, you're going to blow up. Here, it's completely encapsulated until you expose it. Now when you expose it, I have column parameter each, and I now have two param methods, and that's going to blow up. You can't override params. We might restrict that later. You can override methods. Sorry, methods automatically generated by param or, sorry, field and other things. I got ahead of myself. Never mind. So I can do this param car name. That means now you pass that to the instructor's car name, and there's no longer a conflict with parent class. Your parent and children classes should always be able to trust their internal implementation, always. So when they hit an external implementation, they're making a contract, and then they've got to negotiate and find out what works. Here's another example. Those are also going to blow up. That's the case where we're actually generating methods, but we cannot override those directly. You can create your own little stub method if you want to override it. Again, you can rename those in order to allow that to be safe. Class data, field, num characters, colon common means this is class data. You can also slap colon comma on a method and call that a class method. Adjust is called after the object is constructed, or actually it's called when it's hit, sorry, Paul. Is it called when it's hit or after the object's constructed? It's called when it's hit, right? Adjust was run as part of the constructor, yeah. Okay, destruct will run when the object is destroyed. So here I can track how many character instances I've created. It's very simple, works naturally in the language. And then I have another class, my world class. I can figure out the difficulty of my world. I've got my class method available. I can figure out how many characters and I can tell them how difficult the world is. Again, it's stuff which is now built into the language and you don't have to worry about that anymore. Is there anyone here who does not know what roles are? Okay, just in case roles are kind of like mixins you'd find in Ruby or interfaces with default implementations you'd find with other languages. And these allow you to take a small subset of behavior which doesn't necessarily belong to a class, a specific class, and move it often to its own role. And then you can compose it into the class. And then you will get that behavior. However, those methods are flattened into the class directly. There's no tricks with inheritance, there's no funky dispatch or anything like that, it's actually in the class. So method as hash ref, because this is what we call a forward declaration, because it doesn't have a body for the method. Anything with a forward declaration is required to be implemented by whatever is calling it. It can be implemented by the class directly or if the class consumes other roles as other roles might implement it. And then to JSON, here's another example where we want to get to the point where we can disambiguate. This is probably a terrible example because you don't wanna confuse those. But the reality is you should be able to call those separately and have them work correctly, even though you probably shouldn't name them the same. But it gets you some safety in the code and avoids the odd case where you called subroutine as a method, and believe me, I've hit that before. And self is injected directly into the method. You don't have to declare it in your signature. If you have a common method, so self, you also get a dollar class variable, which is the class name of the invocant. If you have a dollar common attribute, that means it's a shared method, which means self will not be available, but dollar class will. And again, those will fail at compile time if you get them spelled wrong. Which means if you declare something as a class method with a colon common and you're trying to access dollar self in there, that should be a compile time failure. You don't wanna use this code, but here, field dollar cash, once again, my implementation should be able to trust its internals. So nothing else actually gets to see the dollar cash that I have declared in my role. You don't wanna use this because this would work if you can guarantee your objects are immutable, but you can't. So you actually probably don't wanna cash those. But this is one way you can have of accessing data inside the role, which you don't share with others. And then using a role, it's pretty simple. So there's my serializable role, this one just does JSON. My character is a person, does serializable. All I have to do is define a hash ref method. And hopefully, when it's called up there, it will properly serialize into JSON, depending upon. I did a lot of hand waving there. But that's basically how it works. If you're familiar with roles, it's what you expect out of roles. So here's the various attributes we have. Class attributes. We have is a and does. Is a, again, is single inheritance. You can put one class in there. Okay, great, I've got plenty of time. Does, however, can have a comma separated list of roles that are allowed in there. If you're familiar with roles, there's ways you can exclude or alias methods. We don't actually provide that syntax here because we argued too much about how to make that work, and we just punted on that. I apologize. Well, attributes, it simply does. Roll serializable does some other role, whatever. Maybe it does a YAML role, an JSON role, and a TAML role, and can serialize all those different things if it's given the right data structure. Quite possibly cannot, but that's how roles work. Roles can consume other roles. And we do want to make sure we preserve the commutative and associative behavior so you can mix and match roles any way you want to in any order. In any combination, and it should work correctly unlike with inheritance and mixins where if you shuffle the order, you have no guarantee your code's gonna work anymore. Field attributes, this one's a little bit more. Reader, or you can rename your reader. Writer, automatically propends the name with set underscore, because we're disambiguating between the reading and the writing. And there's reasons for that dealing with return types and not being able to overload things properly. And also wanting to discourage people from writing mutable objects, but making it easy for them to do if they wish to. But it's available there. Param, whether or not it's available in the constructor. Week, to create a weak reference. Column common means it's a class data. Method attributes, do we override a parent method? If you want a method to be abstract in your parent class, just again, just declare it as method, method name, do not use a signature. And do not provide a method body, it's automatically an abstract class. And it must be overridden in a child class or with luck it will be a compile time error. Common, so you can have a class method which does not inject the dollar self variable. Around before and after are the standard method modifiers that you have. To be honest, I wish we had gone with something like, sorry folks, Python decorators because it's so much easier to use. But that would require attributes to be modified and how they actually get handled. Because right now the data inside of the arguments to an attribute is just a simple string, can't be parsed effectively or can't be run effectively. There's some discussion, I think Paul has been handling some of that, about how to maybe change that in the future. Some of the things we have already written in just the very beginnings of Karina. We have Stella, an actor model for Pearl. An actor model basically means if you have a box of toys, they know how to play with each other, you don't have to play with them yourself. That's the simple explanation. What's that? Okay, thank you. I'm very curious to see that. We also have a yellow cooperative message passing concurrency event loops, actors, promises. That one looks like a lot of fun. That's also done by Steven. You don't like that? Okay, these are some of the early prototypes we've been building with this. I used Karina a lot. This is a rogue-like tutorial that Chris Prather has been putting together. You've seen Rogue before, most of you. And I elated some of those, but basically parts one through six. He hasn't done more than that. What amazed me is I thought we would have to have much more of Karina built for it to actually be useful. I was wrong. Even a very tiny subset, properly designed subset of a class-based system works very well and is very powerful. I was really surprised by that. It also might force you to use composition and delegation more often, which trust me, that's your friend. I won't go into it right now. And I'm sorry, that was very fast. It was an overview. It was probably one of my least exciting talks, but I wanted to be able to have something that I can refer people to this and say, look, here's a short overview. If you want to have a video instead of reading the RFC or something like that. The actual RFC is at github.com, Perlapallo, Karina, BlavMessor. I'll put this up a slideshare. There's the seven stages which are referred to in that MVP of what we're trying to implement, unknown timeline as to when it's going to be done. It's already much more powerful than I thought. Really surprised by that. There's lots more to be done. If you want to see this, the single best thing I think you can do is download it, compile it, start playing around with it, send bug reports to Paul, give feedback, write tests for it, write documentation for it. We need that because conceptually it's very small, but under the hood, there's a lot of stuff which has to happen to make that done. And anything you could do to help Paul take some of that work off of him means we will get it out there faster. Does anyone have any questions? No, yes, sorry. Please speak up by the way, I'm a bit hard of hearing. Yeah, you mentioned the overrides as a way of following my pessimism. What happens if you have a base method and a derived class method with the same name without the overrides attribute? Right now I think that should be a, if the method is defined in the, sorry, what happens if in a subclass you're overriding a class which already has that method defined but doesn't, but has a body, so you're overriding something which already exists. That's something I, one thing a parent class generally should not know who or what is subclassing it. It shouldn't have to know that if that is at all possible, because that winds up coupling it too tight with the subclass. And as a result, if we try to put any sort of annotation on the parent class saying this is sub, subclassable, we might want to be able to allow a final attribute on something so you can't overwrite it, but we had to get an MVP out there. So right now it's a method body's defined. If you overwrite it in a subclass, adding the override tag is good. And I would like it to have a warning if you override something and you don't have the override tag. Or if it's an abstract method and you don't overwrite it, then it's fatal. Or maybe if you override, you don't have the override attribute, then it should be fatal, but we can punt on that. Any other questions? Can the rules have a method body? I'm sorry? Can the rules have a method body? If it's a required method in the role, it cannot have a method body. There are ways you could work around that. You could create a default method, which has a separate name from the required method. And inside of your methods, it's going to, no, you'd still have to have the other method required. So it's a yada, yada, yada, operator. I found a very nice. Oh, I forgot about that. So basically you make a method and then you just, the body of method is dot, dot, dot, which is the yada, yada, yada operator, which was added, I don't know when. 5, 5. 5, 5. So it's been around forever. And all it does is it just blows up different times. It's died with no messages. But it's very useful for, yeah. Yeah, that might work. Any other questions? Or do we still have time? Two minutes. Not you. You were exporting stuff, or exporting subroutines. Lexical, exportated. I've been using it and it's been working quite well with it, Corinna. And it doesn't seem to conflict. Oh. Lexically exporting subroutines. And then it removes the symbol. Yeah. So it's bound but not callable. Yeah, in the built-in package, there's an export Lexically, right? And then you put that inside your import, you can export things flexibly. And then they're entirely different scope. Nice. OK. I very much like that. I'll show you. OK. Actually, talk to Paul, because he's the one who's going to be doing some. We'll talk 20 minutes and I'll talk about it. What's that? Wait 20 minutes and I'll be talking. OK. One last question. OK. Thank you very much. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you.
The Art of Concurrent Scripting with Raku
Alright, thank you, thank you Theo for all the organization. I want to thank my employer Instacart for sending me here and thank you everybody here for coming to this talk. Quick survey, how many people have written some code in Rocko before? Oh good. How many people use it kind of regularly? Okay. How many people write bash scripts? Okay, excellent. So my name is Brian Duggan, I'm a logistics engineer at Instacart. We do grocery delivery. I would like to say basically everything is a race condition for us. I'm also a Rocko module author and I like to write scripts in Rocko. So this is a brief outline of the talk. First some motivation. What am I talking about? I'm going to go over concurrency in Rocko, just a basic overview. And then I'll show you some tricks for migrating some stuff from bash to Rocko, how things look in bash and how the same thing would look in Rocko, and then how using some of the concurrency primitives can help your problem solving abilities. Okay. So a few words about scripting. I tried to enumerate some of the characteristics of what we call scripting. So I'm really talking about shell scripting. And I think when I'm writing a shell script it's usually something to solve. I'm going to do something pretty quickly. I don't want very many dependencies. It should be easy to understand. And it should be pretty reliable because has anybody had the experience of writing a script and it lasts for several years when you thought it was going to last for a few minutes? Okay, yes. Good. All right. We are together. Okay. So another thing that I've noticed a lot about shell scripts is they're supposed to be pretty simple. You know, you basically run some commands. Maybe you check their exit status. Maybe you have a little bit of control flow in your scripts. If you're fancy you might write a PID file and use the file system to do atomic write and rename so you get some guarantees that you don't have two of the same copy of your script running at the same time. For the most part they look like this though, right? They're basically like this sort of standard procedural flow. You have some decisions. You go forwards. You don't usually see, if you're really fancy you might use trap to capture some signals. You might try to time out some commands. Has anybody used time out by the way? I just learned about it recently. You might have some progress indicators. You don't usually see things like async await or event loops or message queues or definitely not threads, definitely not mutexes or shared memory or anything like that. With scripts we assume that we don't really need real programming. We're just doing something simple. We want to get it done. There's this idea that the world is just not that complicated. I think in reality the world actually is that complicated. This is how I vision our scripts are on the right where we have this vision of this like perfect linear world where things are well organized, running one after another. But in reality the world is kind of a mess. As a wise man once said earlier today, minimizing concurrent code leads to crappy programs. Let's talk about Raku. For a deeper dive about the implementation of concurrency in Raku, I recommend that you watch Jonathan Worthingson's talk many years ago about parallelism, asynchronic and concurrency in Raku. He gives some really good definitions of those three words. I'm just going to say the definitions without going into too much detail. But parallelism is the idea of choosing to do multiple things at once. Asynchrony is reacting to things that will happen. Concurrency is competition to access and mutate some shared resource. Raku has great support for those. Being a multi-paradigm language, it doesn't impose any particular strategy on you for dealing with concurrency. We've had some conversations earlier today about Elixir, which has the actor model. There's a go track where you have a lot of threads running at the same time. Many languages have different models of concurrency, and Raku tries not to be too dogmatic and lets you do whatever you want. You have to deal with race conditions yourself. If you want to get started writing some Raku and experimenting with concurrency, instead of say hello and say world, you can just put the word start in front of say hello. What that does is it schedules the execution of that statement on another thread. Congratulations, you've just made a race condition. The output from this program is not deterministic. You might get hello world, you might get world hello if world runs before the second thread runs, or you might just get world if the second thread doesn't have a chance to start before the program exits. You can experiment and find out for yourself why other languages impose these models and it's to manage things like this. The simplest thing is you say, okay, I want to avoid this race condition. The easiest thing to do is you can just add the word await. We heard earlier about async await. There's no async, there's just await. Wait until it finishes, wait until the promise finishes before going on to the next statement. There are a number of high level, so the documentation of concurrency in Raku breaks it down as follows. You have these high level APIs, you have low level APIs, and then there are also some other built in event sources that are not mentioned on the concurrency page. The high level APIs basically have promises, which are what we just saw, where you're scheduling some execution and it's going to finish at some point in the future. You have channels, which are basically one to one message queues between different threads of execution. You have supplies, which are one to many message queues. Then you also have this nice thing called proc async, which is a great way to deal with external processes. We're going to see a little bit more of that today since this talk is mostly about scripting, where you're managing external processes. There are also low level APIs. If you want to deal with threads, you can. If you want to deal with locks, then congratulations, you have access to the kernel's implementation of mutex, which may be hardware implementations even. We also have atomic types and atomic operators. Again, these are sometimes implemented even at the hardware level. You can even use the scheduler if you want to change the concurrency paradigm that you're using by writing another scheduler that implements the strategy for scheduling, for queuing different threads. I'm going to go through some of the different built in event sources and do some very practical examples. Some things you might see in your scripts, file system changes. All these things are built into the core. TCP or UDC sockets, of course time changing provides a great event stream. IOPype lets you watch different Unix pipes and respond based on incoming data. I'll also talk a little bit about parallel execution with race and hyper and phasers. Let's do a quick trip from bash to racu. Easiest way to go. Take your bash script in front of every line, put the word shell, and congratulations, you have now ported your bash script to racu. Shell is built in and even better than that, there is an entire language for quoting. You don't have to deal with all of the horribleness of trying to escape your quotes when you're having subcommands and they all have quotes for all of their different arguments. Probably the most interesting one is the one at the top, the two angle brackets. This is sort of like a lot of languages have a way of taking words separated by white space and turning them into an array. That's what this does except if you have something in quotes, that becomes its own element. Starting echo, starting database dump in quotes makes an array with two elements. The first one is echo and the second one is starting database dump. Those get passed then as an array to the shell command. There's some extra fancy stuff going on there behind the scenes. Here's a little script that starts your database dump. One of the goals of racu is that easy things should be easy and hard things should be possible, as Larry has said. Of course, you don't have to say shell echo. You can use the word say to print things to the screen. Say starting database dump, run your shell command. Here we have our first little glimpse of asynchrony with this thing where I'm saying say now minus init now. What this does is pretty clever. What's happening here is init is a phaser and this runs during the initialization phase of your racu program. Anybody use go here? Go programmers? Go has something similar. Go has something called deferred execution. You might use it in a go program to say, hey, when you're exiting the subroutine, don't forget to execute this database transaction, something like that. There are a lot of phasers. Basically, you can use those to schedule code. I don't know about you, but one of the most annoying things is having a script that starts doing something and then you sit there waiting and watching and nothing is happening on the screen. I like to have at least a clock saying how much time has passed. For something like a database dump, if you do it a lot, then you'll know and then you could even turn it into a progress bar. Then you'll be able to estimate how much time something is going to take. We can do this easily with a supply. We make a supply, we call supply.interval1. This makes a clock that gives us a new value every one second. Then we make a tap on the supply by saying my dollar timer equals clock.tap. Then the code there.say is saying what the argument is to the tap. Basically it'll say 1, 2, 3, 4, 5. Then that's running on a separate thread. While PGDump is running, you're seeing time go by. That's really nice. Then there's some other nice built-ins to make it even a little bit prettier. Since we want to have a script that doesn't have a lot of external dependencies, you can use the very clever polymod, one of my favorite methods, which basically does a sequence of mod and div operations and turns your seconds into minutes and seconds. Then you can format it and then you have a nice little clock. Then you might say, hey, that's so great. I want to do this on all my shell commands. I always want to see a little clock. Well, guess what? You can use, so in Python they're called decorators. In Raku, it's the wrap command. Basically we can say shell.wrap. Then basically this will call our timer before we call call same to do the original shell command. Then it'll close it and then it'll say done. Now you've got your nice script that you just copied and pasted and added shell to the beginning of them. Now all of your commands have a little clock saying how long they're going to take. Let's talk quickly about timeouts. This is the timeout command in bash. You can give it an argument that's the number of seconds. Timeout 1 and then in this example we're doing a DNS call. We're looking up example.com. So we say timeout1hostexample.com and if it fails then the exit code is nonzero. So otherwise we say DNS seems okay. The way we do that in Raku is with a promise, actually a couple of promises, one that expires after a second and then another one that does whatever you're doing. And then you make a third promise which resolves when either one of those two finishes first. So we could say promise, we await promise.anyof and then start shell host command and then start sleep 1. So we do those two things in separate threads. Whichever one finishes first we'll know if we timed out. It doesn't quite work though because shell is going to fork something off and when you fork something even if your Raku program exits it's going to keep going. So there is a better way to do that and that is to use proc async instead of shell so that you don't have this sort of tree of processes. So we say timeout equals promise.in1. My dollar proc is a proc async new host example.com and then you await either any of proc.start which finishes when the process finishes or dollar timeout and then you can call proc.kill if timeout is true. Okay so I'm going to do a few more examples here to show you how to build up how we use some of these other primitives to help you think about problem solving concurrently. So supply.tap another way of saying that these are exactly the same. Instead of saying supply.tap you can say start, react, whenever, supply which is a lot of words. Start in a new thread, react makes another like a reactor or an event loop and then whenever it says hey let's make a tap on this supply. So let's watch a directory for changes. Whenever anything changes in there we're going to turn our markdown file into hdml. Dollar star cwd is the current path. You can call watch and then you can grep for certain files and then call md to hdml whenever one of them changes. I have a few more examples quickly. The slides will be available for you to look at more slowly if I run out of time. So let's look at ping. Ping is great. One of the things that sadly is missing from ping is that it prints all these nice statistics at the end but it doesn't print the median. What if you want the median ping time? You only have the min, max and the average. Well let's compute it by watching the output of ping and then keeping track of the output and then printing it at the end. And so here you can see what's really nice about the react whenever construct is that you can have a whole bunch of whenever inside your react. So we have a little leave phaser which is going to kill the process when we exit. We have our process which is going to do the ping and then whenever there's a timeout we're done. Whenever we get a line we parse it and then we add it to this lines array. Whenever we get a signal, so signal makes a stream that when signals are sent to the process you can also finish and then we can compute the median at the end. So can we make it even fancier and ping multiple hosts at the same time with our program? So let's see if we can make something that looks kind of like this. Multi-ping, where multi-ping gets a list of hosts and then it makes this little bar graph by watching the output of ping and then you can sort of see like which one of these hosts is responding more quickly at the same time. This is a really short program to write and basically what you do is you start a loop using a channel. And so this runs in a separate process. Whenever the channel.receive is going to block, whenever it receives something you take that something and you print it out to the screen and then you basically start your processes. There are a few sort of nice things here, nice features of Raku that make this even a little bit easier. So this little percent is a way of constructing a hash and what's kind of cool is that constructing a hash looks exactly the same way as destructuring. If you've programmed in JavaScript, you know they have this really nice argument destructuring syntax and it's equivalent to the construction syntax. So you can basically make these channels that communicate between threads and you can send structured data in between them. And you can also have type checking and things like that and so it's really nice. And that's your output. Similarly if we wanted to dump a whole bunch of MySQL or Postgres databases at the same time, if we don't care about the output it's even easier. And the way we can do that is by using a statement prefix called race which basically says take this loop and run the body of the loop concurrently. And you can give a parameter of the batch, the number to run concurrently and the degree of concurrency. And then in a few lines of code we've made PG multidump which can dump several databases at once. Okay so in conclusion we've seen some examples of tracking progress of a command, timing out a command, using asynchronous techniques to respond to file system events, using asynchronous techniques to respond to lines emitted from a command, instant parallelism, we saw some locks and for further reading there's stuff on the ecosystem and also the Rocky documentation about concurrency is excellent. So that's it, thank you. I think we may be out of time, I don't know if we have time for a question. One question. Or multiple questions at the same time. I'll take your one question. You get an example of watching a file system event and kicking off a process based on that. Yes. Sorry I don't have a word I need here but is there a sort of a program paradigm or capability for testing whether something has finished being written to before you start off your file system? If the file is really big then the file appearing might not have been finished and it's written yet and you kicked off something to convert to HTML before it's done so. Yes so it does get, I think I know where you're going with this especially because if you're using an editor then it's not that there's a single event where the file changes and it'll often be the editor will be doing a write and rename or it'll start writing and so you want to be careful about that. So there are some things to do that. You can throttle your supplies is one thing. If you're spawning a process you can say proc.ready and that will tell you when it's ready so that before you start sending things to it. And then basically it's hooking into the notification API for the file system so any events from the file system there's going to be a file changed, file renamed and then you, so the limit is only whatever the file system provides. Yeah sure. All right thank you.
Updates from the PSC
All right, last session. Paul Evans, working hard behind the scenes to make sure we have a soul through 5, 6, 7. What is going to be at 3? Don't you change it? Let's call it, I don't know, 100. Who knows? All right. Man, I'll be dead by then. All right. Hello, welcome. Hello. So this is updates from the Pearl Steering Council. So a bit of history first. We've had some yearly releases of Pearl for a very long time now. 532, that was out in 2020, middle of the summer. And then kind of, you know, every year or so, kind of like Clockwork. We've had new releases every year. This is a thing. People maybe don't realize this. Some recent changes we've had. So in 532, we added this Easey operator. That was kind of cool. 534, we added this Try Catch syntax. These are some new things we've had. 536 was a lot of new stuff. So we added loads of things here. Brief list here. First big headline thing, Stabilized Signatures. So finally, that nice little signature syntax there. That's now stable part of the language. You can just use that. You don't have to fiddle with the at underscore array anymore. It's very, very nice. We added this multi-variable for each mechanism here. Come in, come in. So if you want to iterate over multiple variables at once out of an array, for example, you can just pull multiple of them and it works. It's especially nice for iterating on hashes. So you have a hash here. You get each key and each value inside the body of the free-trilute. It's wonderful. I love it. What else have we got? We've got defer blocks. So you use feature defer. And now you've got this defer thing here. So you can put a piece of code. If you're familiar with Go, this is not like the Go ones. If you're familiar with any other language that has defer, it's exactly like that. In Go, they decided that defer blocks would always push onto a stack and then at the end of the function, it would run the block. Whereas every other language said, no, that's kind of crazy. We'll just do it lexically scoped. So you have a defer and then you get to the end of the block and it runs it. Every other language does it that way. Even C, some people are discussing adding defer to C. Because if you don't have this crazy array for the function, you can do it mostly statically in the compiler. And it's just kind of shorthand for putting it in the compiler. And every other language does it this way. Don't know why Go does it its own weird way. It's a bit weird. Anyway, so we have defer blocks. And you can put finally blocks on try catch as well. It's basically the same as a defer. But people seem to expect that if you can do try catch, you can do try catch finally. OK, we added it fine, whatever. Another thing we added in 536 is this built in namespace. So for years and years and years, if people wanted things like we can and blessed and ref adder and so on, they'd have to get them out of scale util, which is another module you'd load off the file system. It's a bit annoying. These are now built in to the language. So you don't have to use anything. It's just right there. It's already you can always use it. But if you want, you can you can import it as well. So for example, we have this nice index function that plays very nicely with the multi variable for each, for example. So this indexed, you give it a list of things. It gives you back a list that's twice as big where the first value is prefixed with zero. The second one is prefixed with one, someone and so forth. So if you're iterating a list out of an array at every element, you can see the index of that item out of the array. It's really, really nice. And it's built into the language. It's this here is really just telling the parser for this scope here. I want to have this indexed word available, but it built into the interpreters always available. And who is it? People were talking about lexical imports earlier. These built in is lexical. So basically what that means is you've just written some code here. You can just see it, but it's not putting that word indexed into your package. So if you're writing an object class that you don't get the word indexed visible as a method. It's not visible from outside. It's only visible within this scope. Really nice, really handy, excellent way of working. So these built-ins are very nice. Alongside the built-ins, we finally, finally have actual Boolean values. C originally didn't have Booleans either. And then eventually in C99 they realized whoops, we should have Booleans. It's taken us until 536 to realize we should have Booleans, but we now have them. So we've got this built-in true and false. Look at that, look at that, my T equals true. Guess what that does? There won't be a prize. It's not that subtle. But specifically we have this isBool test. So you can ask, here's a value, is it Boolean or not? So one and the string one, well they're not Booleans, but this real true, well it really is a Boolean. That's kind of handy. It's particularly handy because things like data dumper knows about it. So if we print this array here, got 2 plus 2, 2 concatenated with 2 and 2 is equal to 2. Well that gives you the fairly obviously expecting 4 and the string 22. But it also gives you this pling-pling-1. That's not very nice, but the reason for that is because everyone uses data dumper wrong. Data dumper, it's one goal is to output valid Perl code. It doesn't know that it's trying to output debugging values for humans to read. It's sole purpose is to output valid Perl code. And it doesn't know that you might not be running this on an older Perl, you might not be loading it back in to an older Perl that doesn't know what the true keyword is for example. So it's going to print pling-pling-1 because it has no other choice. This is really more a comment of please stop using data dumper for human debugging. What you want to use is something like data printer. Data printer is specifically designed for outputting pretty things to humans. And this slide doesn't show it, but that comes out in color. It colors the strings and the numbers and the keywords and the surrounding shapes all subtly different. And it looks really nice on the screen and it's lovely and data printer is so much nicer. If you're debugging stuff for humans, use data printer. So thank you to Breno for implementing. It's true, it's so good. That's not all. And the JSON-PP also knows about, yeah, there we go, JSON-PP. You encode this very same array, you get 4 string 22 and true. The JSON-XS version, Remy is still looking at it. Last time I looked it was about three, four days ago. It's not been merged yet, but he's working on it. Hopefully that'll come soon. The YAML modules is Tina around? Tina was around earlier. Yes, hello Tina, thank you. This slide is for you Tina, look at that. Four, the string 22 and true. And JSON-PP as well, they all do it. So thank you for Tina, for doing that one. So yeah, real Booleans, use them, use them, they're nice. Moving on to 538, the newest one that's currently around. Somebody wrote this class thing, I don't know if you've heard of it. Have you heard of it, Ovid? I don't know. Ovid obviously talked quite a lot about this class system earlier. So I'll just go through and brief. Here's a small example of, here's a small piece of code that you can write to implement like an object class. You can create here, we have these points, and they have some values. Yeah, they're great, you can have another point. What kind of behaves in the obvious way you'd expect from looking at the code? There's several things about this that I kind of want to point out as again, kind of covering similar stuff to Ovid earlier. There's a lot of low level stuff that this thing just does for you. So you don't have to write sub-new anymore. You don't have to write a bless, sorry, wait for that, noise outside to finish. You don't have to write a bless expression anymore. You don't have to call these accesses to get at your instance fields. They're just accessible directly as lexical variables. They're nicely there straight away and you can just use them. Specifically thinking of Java programmers, Python programmers in particular, this slide is for you. You write a class, you declare that it has some fields, x and y, here are the default values, and that's it. Nowhere did I have to unpack self.x equals x or args.x or whatever and work out did they pass a value in, take the argument otherwise take a default value. No, you don't have to do any of that. Here's a method. I've just straight away got access to the local fields and I've got the self and notice that I didn't have to put dollar self in the signature here. So I didn't have to put dollar self. No, I didn't even have to shift self in old school style. I was writing some Python class code lately and I kept forgetting to put def method open, self comma. Why would I put the self in the arguments, in the parameters to the method when I don't put the self in there when I'm calling the thing? As soon as you start getting used to using method, you forget about taking self as an argument. It goes out of your head. It's again nice and neat and lovely and it just takes things away that you don't have to think about. More things you don't have to think about anymore. So as I said signatures, we added these in 536. So here's an example of a signature subroutine and here we're taking a parameter. Are we taking this optional parameter? This one's fine. I'm shaking all over the place. This y here. If you don't pass in a value for that y, you get this default. So here we have an x is 20 and a y, well, you just take the default of 10. That's all very well, but the way these work inside, if you specifically pass in an undef, well, you've passed in a value, right? But that's probably not what the author of this code really intended. So it kind of breaks a bit if you pass in an undef. It gets a bit worse if you're just passing in variables because now, well, you'd have to check carefully is $a1 defined or not. And if it's not defined, then I'll just not pass it in. And it's messy to write some code like that. So new in 538, you can now use the defined or a sign operator to declare your signature parameter. So you don't pass, it sort of internally behaves much like this, where you look at is the value defined or not rather than just did it exist or not. So as you'd expect, you pass in one value and you just get the default. If you specifically pass in an undef, Pearl goes, ah, you've passed in an undef, that's the same as if you haven't passed it in at all, I will take the default, which means that passing in two arguments is a lot neater. I have another talk where I go into more detail about specifically what's in 538 and I actually point out if you were to have, say, five parameters to your function and four of them were optional ones, you literally couldn't do it without this operator because you can't literally not pass the middle parameters and still pass the last one. Whereas you have to pass in an undef. And so suddenly with this operator, you can have those kind of middle ones missing and still put in a value at the end. So it makes that kind of thing possible that you literally couldn't do before. So pretty much any time you're using default values in a parameter, to be honest, you probably wanted this defined or because specifically passing in undef is almost never a thing you want to distinguish from just not having a thing at all. So that's quite nice. And these two things combine together quite nicely. So for example, when you have a class and you have some default values on parameters, you can of course just use the defined or operator there. So once again, it means that things like this, where you're constructing an object by just passing in whatever values you have in variables, if those happened to be undef, it wouldn't matter. It would say, okay, I'll just apply this default of zero. So that's all very nice and handy. Other new things in 538, we have these plugable infix operators. So for event and time in part, we've had plugable keywords, which is how a lot of the weirder syntax modules like syntax keyword try and future async weight and object pad, those will work with this keyword mechanism where they can tell the parser, I want to implement a whole new keyword, give me control for a bit. So new in 538, we've added more support for doing similar kind of stuff like that with infix operators. So that means that we can have even more cpan modules to experiment with things that might become new syntax in Perl at some point. And we've got a few things to play around with those. So people want like an EQ operator. People always ask for an E in operator and I've explained in great detail why that's not as easy as it sounds, but there's a few examples there. And there's things like a zip and a mesh and there's a few other modules there. But for example, this one in particular EQU is a nice behavior that at some point we might add into Corp whole. So behaves very similar to the normal string EQ operator, but it knows that undef is different from the empty string. So this is really cool. And here it's literally this new infix operator. So you use it very much like EQ in that two strings, two strings that are either the same or different, it tells you about those, but it knows that undef is equal to undef. It knows that undef is not equal to the empty string. And in none of these cases, it will print a warning. So that's quite often the sort of thing that you want. It's slightly nicer than using EQU and defined tests all the time. And this is exactly this kind of experiment that's really useful to be able to test it on CPAN first and say, hey, do we like it? We'll use it in a few places, go along, maybe decide eventually. Yes, we'll put that into the language, maybe not. Kind of depends. So one thing you might have noticed from pretty much all of these examples so far is that every single one of them starts with this use, the 536 at the top, all these other ones, or use V538. There's a reason for that. The use version mechanism. It allows you to configure effectively the language from the very first line of your source code. So rather than you just deciding this is the version of Pearl I want to use today, your file says I want to be 536 or 538 or whatever. And it's a thing we've always had in Pearl, but people haven't necessarily used it as much as they should. And I keep trying to point out how good and how useful it is and why you should do it all the time. Because, for example, it implies a feature bundle. So you say, for example, use V536, you get all of the features that were enabled in 536. So rather than having to ask for all of these things individually, if you just write use V536, you get all of this good stuff, like say and signatures and maybe some of the other ones are good as well, but those two by far are the ones that I just tend to use all the time. Everything is just say and sub-routine signatures. So those are all very nice. But it gets better. It's very similar to things like when you compile some C code, you tell the C compiler which version of C I want to be using here. So it means that just because you've installed a new version of GCC, if you don't tell GCC that I'm compiling C99 code, well, you can still compile C89 code or whatever. Just because you've updated your compiler, you can still compile old programs. It's even better than that because it's not just applying to a file, it applies anywhere. You can just put a use version inside a block. And you can say, inside this block, I want to behave as if it was 536. But I'm not going to put use V536 for the whole file just because I still happen to have some older code here. For example, this thing using prototypes. I didn't want terminal signatures here. So rather than going to fix up my entire code base all in one go to work on 536, I'll just do a small bit here today and then maybe tomorrow I'll do a bit more. And so I'm like an incrementally update to using the new stuff. It gets even better than that. So not only does it imply a feature bundle, but ever since 512, it turns on strict. So any time you write use strict at the top of your file, you always do that, right? You can instead just put use V512 and you've already got strict. Oh, and you've got all the features. Oh, but new in 536, we added warnings. So if at the top of your file you would write use strict, use warnings, by the way, you should always do that. We don't have to. You just can put use V536. And now you've got strict and warnings and all of those features. So it's really, really nice. It gives you your choice of the latest features. It means that we can maintain that compatibility of the language. We can add new stuff in Perl. So like you noticed 536 added a lot of those keywords like try and defer and so on. If you don't write use V536, you don't get those. But that's fine. It means that if in any of your code you had something called try or defer, well, we haven't broken that. We can add new stuff in Perl without breaking your code. All you have to do is put use V536 or use V540. What that means is, yeah, we can update Perl without breaking your code. That means you can update your Perl binary without breaking any code. Hands up if you've ever installed a new Perl and something has broken. Interesting. That means we failed. That means we failed. If you install a new Perl and something works. A few years ago. Yeah, yeah. Like really early ones, sometimes they didn't go so well. But more recently, I mean, you know, so for example, I think it was about last month or so, I updated a bunch of stuff on my email box and all of my email scripting stopped working. And I looked into it and I discovered actually Proc Mail has, there's a little bug in Proc Mail now, it's a bug, something in Proc Mail had changed that meant that a piece of Perl code I wrote over 15 years ago is not being invoked properly. And so all of this thing stopped working. But the script that I wrote 15 years ago for handling all my email works perfectly fine to this day. Like I haven't bothered touching it. I'd almost forgotten that I wrote it. It just works. And it's all because of this use version mechanism. And so when people say, oh, why do I have to put use version or use feature or whatever to turn on new stuff? This is exactly why. It means we can update Perl and you can update Perl and not break your stuff. But it means you have to ask for new things. Speaking of asking for new things, I've been mentioning a lot of these things are quite experimental. So some terms here. So stable stuff means it's long-term guaranteed. What that means is if we put something in the language and we say it's stable, that means in a decade's time, in two decades, like all of the stuff we're talking about now has been stable for like the last 20 years. And it's all of the stuff that if you update your Perl, you don't have to think about because it's all the stuff that's there and stable and working. Experimental simply means a lack of that guarantee. All the experimental means is we don't guarantee that this will still work in 20 years' time. But it's no worse than random stuff I downloaded anyway. Like if you install stuff off GitHub or C-Pan or other languages, things like NPM or PIPI or whatever, if you just download it and the author says, oh, actually next week I've changed my mind, it's going to work something else, that's only the same level of guarantee here. So don't be afraid of experimental. We're not saying, oh, it's crazy, it might break and blow up your code. That's not what we're saying. What we're saying is if you use it now, we don't guarantee it'll still be around next year. But maybe it will. It's not about does it work. We know it works. We have lots of tests. Things don't get merged at all unless they actually work. So things like the object system and try catch and all of this lot, it works. We know it works. People use it in production. The question is do we like it? It means you. Do you like it? If people come back and say, yes, we like this, this is great, then wonderful, we'll take the experimental tag off. Nobody comes back and says, hey, we've used this, we like this. How do we know whether we should commit on it? There's things, literally this week that we've been staring at to do with lexical subs, that if more people had been using them over the last eight or nine years since they were made non-experimental, we might have encountered sooner and said, actually, yeah, that's a bit of a design flaw. Whoops, that's a shame. But hardy knows when to be using them. So we didn't know. So now it's a little bit late to change them. So this is a request. This is the one takeaway from this talk. If you learn nothing else to learn this, please use experimental features. Not necessarily in your production, I still want this to run in a decade code. But if you're writing some small little test thing that maybe is only going to last for today or a week or whatever, or you're just grabbing some data and mangling it and fiddling around with it on your laptop, and you're going to throw away the script after lunch anyway, please play around with these experimental features. We're not saying they don't work. What we're saying is they might not exist next year. But if you're writing some code that doesn't exist next year, who cares? So please try them out. So with that said, what are the current experiments? Well, we've got try catch. That's still a bit experimental because ideally I would like when you catch an exception, you get more information out of it than just the string of what the exception was. So we might expand a bit on that. Differ is experimental. There's a few reasons for that. To do with if you throw an exception while you're deferring, while you're unwinding another exception, you've got this kind of double exception collision thing going on. It's a bit weird. Multi-variable for each, that's just because it's new. Some of the built-in functions, they're currently experimental, but they probably don't need to be. Class is obviously very experimental because we're changing a lot of stuff around. That will change and devolve over time. There's one particular experiment that I do want to draw attention to, and that's when we got rid of, when we unexperimented subroutine signatures overall, we did leave in one thing, and that's if you use the default arguments array for some reason inside a signature sub, that does currently print an experimental warning. The reason being it's kind of annoying to implement, and if people stop doing this, then we can get rid of a whole bunch of the implementation and make all of functions faster in Perl. Please stop doing this, and then we can make your Perl faster. Sonia? Any mac-in-the-feature? Any mac-in-the-feature? We could, yeah, it could become a feature. Maybe, maybe. We'll see, it's complicated. Talk to me at lunch. Anyway, so we've only got 10 minutes left. Coming up in 540, new release that we're expecting to be out sometime this summer, most built-in functions should become stable. So things like, at the moment, things like ref type, you get experimental warnings. When you do use 540, you won't get an experimental warning anymore, because hey, fairly simple, fairly stable, seems to be fine. We also are going to get built-in bundles from used versions. You know how I said, use v536 implies all of these things? Well, use v540 will add another one. So that means when you go, use v540, you get all of these built-ins for free, which means you can write, use v540, say ref type, and you just get the thing. And obviously, we're going to put that in with the capital E as well. So you can just do pearl-capital E, say ref type. Look at that, that's lovely. Everyone likes to do ref type in their one-liners. Yeah, I don't know. These ones are all bit, like, it's hard to come up with small examples, but it's nice that they're there. It's nice that you don't have to ask especially for them. You just get them. So yeah, use v540. So I want to talk a bit about the process behind some of these things. So we have this thing, the proposed pearl changes. It's, you speak a lot of C's. It's a formal process where people can request changes in the language. So already we've seen in v536 we had the enter time for, that was run by Nick Clark. We have deferred the Booleans and the name, and, well, no, that says Booleans. That should say built-ins. That's a bug. I wrote those ones. Xenu wrote the command line flag for slurping. It's just a small little bit. Rick wrote the built-in index to one. These are all the people who wrote the documents. These aren't necessarily the people who implemented the code. These are the people who wrote the documents. So part of the whole PPC process is about saying, if you have an idea for pearl, but you don't know how to implement it, well, that doesn't matter. Write us a document to explain the kind of thing you want. And if we accept it and we like it, we'll say, yes, we will work out how to get that implemented. You don't have to implement it. In v538 we got rid of the back tick for the package separators. That was the Nicholas Mendoza who wrote that one. Over here you did the module cruising. And I can't remember who implemented that. I did some of it, but someone else did. Who? Chromatic? Yeah, Chromatic wrote that one. Yeah, you just suddenly surprised us one day. I said, oh, by the way, I've implemented this. Wow, OK, fine. Yeah, so we have the module cruising. That's quite nice. And the lexical exports. Sorry about that. We're going to have to change them. Yeah, chat to me later. We're currently testing. There's only one little thing that we're PPC that we're testing at the moment for 539. That's the load module built in. It's going to be quite nice. It's just a nicer way of doing require where you have a package name in a string. It's just rather than having to do all of the horribleness of turning it into a file name. You just go load module. It's quite nice. There's a few other ones that we're in the middle of implementing. So things like English names for punctuation variables. So rather than doing dollar splat like that, you could just ask for dollar eval error. It's quite nice. Template strings. I'm almost upset you didn't. You had a sprint if in your code earlier, Rick. I mean, come on. So if you would finish the implementation. Yeah, it's hard. Sublexing is hard. So this horrible thing, especially with objects, like if you try to implement it, if you try to invoke an object accessor inside of a quoted string, you'll know you can't do that. And so you're always having to break out of the quoted string and stuff like that. So we've stolen this thing from a few other languages like quote, quote, quote, template strings. So now you can just put expressions in your code. It's lovely. It's nice. It's horrible to implement. If anyone knows how to implement it, let me know because I've had about three attempts. Anyway, other ones that we're in the middle of implementing is optional chaining. So a Python actually a couple of weeks ago said they were considering this thing. They call them the none aware operators where you have, you want to do this method call or a hash lookup or whatever it is. But the thing on the other side might be, well, in Python's case, it's none, but in Paul's case, it's undef. And you want to just return undef instead. So we have this wonderful idea of just put a question mark on the operator name. So that there, if the hash key exists, it'll call name on it. If the hash doesn't exist or if it's undef, this whole expression is just undef. And that's often a thing you want to do as well. It's nice and neat and tidy. I like it. And the metaprogramming API. So all these crazy things that you do with no strict refs and glob refs and all this other stuff that's horrible and messy. We're going to make that much, much nicer with just you get a meta package and you get the symbol out of it, you get the value in it. It's all lovely. It's all inspired by things like package stash. And there's a bunch of other things on cpan, but we want to make this an official part of core pearl so that we can tie it into things like the object system as well. It just makes that much more powerful. A few other little upcoming ideas at some point, but probably not going to be in 540, are I'd like to have named parameters and signatures. It'd be nice to be able to have these named things here. But I want to do more stuff on class. I've not really added anything extra in class for 540. So roles would be nice. The convenience accesses might be nice. It's possible by 540 I'll get around to the easy one like reader, but even something like writer is going to be a little bit awkward. But even just having readers in 540 might be nice. I'll see if I can get around to it. And I've got three minutes left. Yep. And the last thing I want to do at some point is renumber 5.whatever into 7 because I really want to be able to type use v7 and just have it work. And with that, I'm going to say there's the end. There's a link to the slides. There's also a link down here to some slides and the video of my talk that I did what's new in 538, which goes into a lot more detail about the new things we added in 538. And then I will say we will take some short questions, but our minds now the last talking here. So afterwards I'm going to go for lunch. If people want bigger chats, we can chat over lunch or in the hallway or something. So with that small questions. Yeah. Question. In the chat support for thoughts, do you expect or is it hard to plan to implement interfaces? So the question is about interfaces. Do we plan to implement interfaces? I mean, in summary, no. I mean, Java's idea of an interface is all about defining what kinds of methods you can call on a thing up front. It's all to do with static typing. That's exactly all that it is. And Pearl doesn't have static typing in that sense. Like if you have an object, you can always at compile time write the code to invoke any method you like. I mean, maybe at runtime the method may or may not exist, but it doesn't. Whereas adding the concept of static typing to a dynamic language like Pearl basically turns the entire language upside down. So the idea of a pure interface isn't really a thing that we want to add. But we definitely want to add roles because roles are statements about an interface, but they can also have implementation with them. So it's all about gluing small bits of functionality together to make a larger class. So we definitely want roles, but pure abstract interfaces are not really a thing that fits in dynamic languages. Oh, good. Comment on the question. Java allows default implementations now. Oh, does it? And residence. So basically they're roles. Yeah. Yeah. Okay. They're much nicer than they are. So we've only got one minute left. For x, y at array, if dollar y happens to be in depth, how do we know because the value is in depth or because we hit the end of the array? It doesn't matter at that point. It's just, oh, with the default argument, the signature parameters thing. No, the multi four. Oh. So I want to know that my array is even sized if I'm pulling out an even size. Yeah. Yeah. So for the, for the, for each, for each when you have multiple arguments, yeah, if, if the size of the array doesn't exactly match, it's like a, not a whole multiple. You will, you will get just undefs for those last missing positions. We did think about other bits of behavior, but I think in the end we decided that it just doesn't match because like if you just did my x, y, z equals array. Like when you get undefs in those last few values, you don't know whether that's because there were undefs in the original array or you just ran out of values. And so you've got the undefs. So it's kind of the same thing. If we did consider implementing something where you could tell the difference, then we'd start to, you'd sort of start to ask questions about, would you put it in other features as well? So like a, a, a large part of, of kind of trying to do language design is saying, well, we're not just going to do this one isolated feature. We have to consider how does it play with all these other things? And so running out of the array is a thing that happens in a lot of places. So I think that's, that's the end of, of questions now. So we'll stop there, but if people want to chat more, I'm, I'm happy to chat over lunch, but thank you very much. Thank you.
Open Source DocOps
Welcome. Our first speaker will be Lorna Jane Mitchell. I always say Lorna Jane in one word. I think everyone knows me. Yes. But you probably already know Lorna and she's going to talk about open-source top box. Take it away. Thank you. Hi everybody. Thanks for coming. It's a busy room and you've had a busy day. I hope your brains are not too full for something more. My name is Lorna. I'm VP of developer experience at a company called Redocling. We make API tooling including documentation tooling. I've worked on docs projects in a couple of previous roles. I describe myself as an engineer with a writing problem and I'm very happy to be here with some like-minded individuals. I'm also passionate about open-source. Yeah, my background is in software development. I learned in the open-source community. I'm an open-source project maintainer, open standards contributor. And I want to bring to you today how open-source and doc ops works together. So this works better if I plug it in. There we go. This is the second talk of the day. I'm not sure I'm still got sentences. Okay. What is doc ops? It's in the talk title. You believed in it enough to be here. Documentation operations is about allowing documentation to be created, but also maintained and published collaboratively and in an efficient manner. It's really about being able to make changes and having confidence and being able to make a lot, a lot, a lot, a lot, a lot of changes with lots of contributors. And the way I think about doc ops is that it, from some of the more traditional documentation practices, doc ops is a culture shift. Some of you are enough in the software space to have seen the DevOps culture shift and we're bringing something very similar to our written word. Everything I'm going to say in this talk really builds upon the concept of docs as code. If you are not treating your docs as code, you cannot benefit from the cool tools that the coders build for themselves that we adopt into our tool chains. This especially includes source control. Git is the key to many of the workflows that I'm going to talk about today. Text based markup so that we can manage multiple change sets simultaneously and bring them together without pain. I personally enjoy rebasing, but you shouldn't have to. Bringing continuous integration and those practices and also having a good local setup. If you have to push to see if you did it right, that's not a good documentation creator experience. And having good tools all the way through the stack is what makes this a really effective workflow. It makes you very productive and lets the machines do the heavy lifting. For a long time I used to say the software developers, the coders build the tools that they want to use, but I don't think they should keep them for themselves. I think we should take them and bring them into our world of documentation. Open source, you're at Fosdame, in English I would say I am preaching already to the choir. Open source means freedom, but it also means not having to build the same tool in every team that needs to publish a docs platform or check that the links work. It means being able to run that tool wherever you want to. Tools that fit into continuous integration systems are typically open source by default. We don't expect it by licenses or sign in, we expect them just to run on our temporary compute platforms or on our local machines. Best of all, there's no vendor lock in. So we can choose this tool or that tool and because we chose that one we're not stuck with having to use another one. We're using standard formats and open source tools. Just because we didn't have to build and rebuild the tool doesn't mean we don't have to build it at all. We all need to be participants in the tools that we use, reporting bugs, fixing things, thanking our maintainers when we see them. It's all part of the story. So I'd like to share with you some of the tools that I use on my docs projects and I've tried to pick just a few categories of things that I think are vitally important. We'll start with the obvious. You need to be able to preview your docs change before you publish it. Everybody should have access to preview. Everybody who contributes to the documentation or reviews any docs should have access to a tool like this. This is a screenshot of VS Code. I'm editing an open API file on this side and this is the redockly rendering on the right hand side and I typically work like this. So I always have local tooling that updates immediately. I can see instantly, oh that didn't render like I expected. There's something wrong with this. I can clearly see that's broken. My table is missing a cell because I've got that live preview response and this is part of the story. It doesn't have to be embedded in your IDE. You can run a local server that updates or use a watch command to rebuild your static site but you should have fast preview when you are working on documentation. You also need to be able to see the build areas locally if there are any. I see too many places where that's hidden away somewhere hard. The other place you need preview is in your pull request. You open the pull request. That needs to build exactly as it's going to ship. We need to spin up a per pull request preview. Don't muddle through the branch and put it on a staging server and hope. Pull request builds for previews and that also enables the reviewers. So it gives them a nice view. I used to think that previewing docs was for people who weren't technical enough to read mark down. Now I'm a VP. It's just people who are too busy. You put the web page in front of me. I can review it. If I have to go past something in a pull request somewhere, it's a bit less likely to happen. Okay. Link checking. Who has link checking in their docs build today? Yeah. It's not very many and it's the thing that is most easily rotted in your documentation. There are two problems. One is all the links between all your own resources which are just super easy to get wrong. And the other one is other people breaking their links making you look like a fool. So I use a link checker to check both of those. It automatically does like a click on all the links. When I'm looking at it for a long time I was building the HTML and checking the links that after render, which is cool and works. Now I'm working on more of a dynamic site. I actually have a tool which checks at build time. So I'm using MLC. There are lots of others. Pick your favorite. So it can read mark down and so then it can just check. This link makes no sense. Your syntax is terrible. Please do this better. All those things. Either approach works, but I think it's very important. It's an easy thing to add. You can run that tool locally. You can run it in CI. The downside of checking all your links is really other, I mean all the problems are really other people, aren't they? All the problems are other people. Sometimes the internet goes wrong. I used to work on a documentation platform which relied on an upstream open source project. Whenever that project launched a new version, all its links were broken for 12 hours. There comes a point where you don't want to know what the explanation for that is, but it meant that all of our builds failed for 12 hours because the links were broken. No, no, their links are broken. So I have a couple of different strategies for this. One is to only check the links in the files that are changed because especially on a big documentation set, you don't want to have to deal with something that's gone wrong in a link from another section might be owned by another team. So I just do that and then I do a weekly check all the links job. If that job fails, it opens an issue. So if something's decayed, we'll catch it maybe not always faster than a customer, but fast enough. So these are some things to think about. Whether somebody else's broken link or downtime should block your build or your release because I think that's a other people's links are outside of your control and so that can be a hazard. Let's talk a bit about validation. If you're coders, you are accustomed to working with syntax checking tools. Some programming languages will error at build time before you even run them. Some of them are more interpreted so they don't go wrong until you run them. We don't historically do that with our documentation, but the tools are there, especially when you are doing docs as code. So we don't necessarily do that. We don't necessarily do that. We are doing docs as code. It's got all the advantages of working in code and it's got all the disadvantages of working in code. It cannot be obvious that something is wrong. The errors can be super subtle. You have a full stop where the comma should be or the wrong sort of bracket. This stuff is even when I work with it all the time, can be very difficult for humans. Super simple for machines. So we can build on those tools and let the machines do the work. The other thing I like about having the validation errors automated, I can run them locally. I never do. I always push it and then wonder why it's failed. The other thing that's nice about that is when you push your pull request and you are missing a comma or you have the wrong sort of bracket, perhaps this is personal to me, but it feels kinder coming from a machine than having someone else criticize my use of a bracket. So that kind of, and I don't have to wait for a person to come and review it. I immediately get that very impartial factual feedback that my bracket is in fact wrong. And I think that's what I like about using validation like this. I was going to say the bots are not judging me. What a horrible thought, are they? The validation tooling, you have a few options and it depends a bit which flavor of markup you are using. I'm working mostly with markdown these days, although let's just say it's not because it's my favorite. Let's keep the markup language war for later. I'm using markdown lint. With markdown I find it very good and very, very configurable. So like all of the linting tools and the same with open API which I work with a lot as well, probably some of you have API reference docs, the default settings for all of those linting tools, the volume is too loud, especially if you were not already using those linting tools at all. Markdown lint is really configurable and it has really excellent documentation on what all the options do. It is remarkable how few documentation tools have a genuinely good documentation. This one does. For restructured text I've mostly been using that with Sphinx and Sphinx has really great validation and I think it builds on the docu-tools so you can use that by itself. All of those also come with command line tools, IDE plugins and you can put them in your continuous integration. So github actions, Jenkins, whatever it is that you use in your setup, set that up for your pros content exactly as you do for your code. If you're using open API you should also be at least validating that. I've already given my open API talk today so I will attempt not to rant about API linting and standards but put those tools in, set your standard and make sure that you are consistently checking that. Again it goes in your tooling. Disclaimer I make, Redock Lease CLI, that's my day job. Other excellent competing open source tools also exist and I'm probably not the right person to take a recommendation from. I'm very biased. So we talked about validation, very closely related to validation is formatting. Again software development does a lot of reformatting of code and that is to give a very consistent presentation. We always use the same white space in the same way, the same indentation, the same wrapping rules. It makes it visually very consistent. So when you work with the same code base all the time it gets easier to read. We can do that for our mark, mark up, mark down, restructure text, ask, skidock, whatever. We can do that for our tools too. By allowing things to adjust our new lines, our white space, the indentation, the wrapping, things like do you need a white, do you need a blank line before your bullet list or after your heading. Lots of tools don't care when they're rendering but by getting that the same you can make it easier to read the raw text and easier to look at it and spot problems because the layout is so consistent. I've only recently started doing this. I write a lot of docs that are in the same repository as the code and we just turned on the engineers prettier tool for our mark down. It's actually really nice and I was initially, like of course you can, I don't mind. Now I'm turning it on everywhere. So yeah, I really recommend it. I also really enjoy prose linting. Now I don't see enough of this. I'm using a tool called Veil and I'll be honest, I don't know very many other tools in this space. Lots of people nodding. Good. I'm also happy to be contradicted like tweet me what I should have said. With this it comes with, you can give it a dictionary. So it's going to do all of your spell check for you. It can also do quite a lot of grammar checking. This is brilliant for me. I work with almost entirely non-native speakers. So having a little bit of help for me and them to get the words out correctly is brilliant. I am a native speaker, doesn't always help. So Veil helps me a lot. Also you might be able to tell from my accent, I'm British. My company is standardised on American English and at this point my spelling can only be described as mid-Atlantic. So having Veil just to catch those common, we have like a Britishism's rule enabled and it's because I'm here. Typing all these British spelled words into our American docs. It catches repeated words. You can teach it product names. In my previous employment I worked with a company that published a bunch of open source database products. You have to get people's trademark product names correctly. Up a case, lower case, trademark. This has to be legally, this has to be correct. So unless you want your lawyers to have to think about this a lot, you just teach it to Veil. Veil explains it back to you really regularly. The other thing we did there was we put a bunch of collars common misspellings in. So we worked on Kafka. When I set up a search for Kaka, loads of hits. We also banned the English word flick because we had a product called flink. And indeed we just don't need this word in English because it probably is a misspelled product name. So those are the sorts of things that Veil can help with. I know we have a Veil talk next. Yes? A little cheer. So I'm not going to say more about that. Veil's amazing. Stay and listen to the talk. Okay. Let's talk a little bit then about how all these amazing different tools that solve different problems and they have your back. They support you in lots of different ways. But let's talk about how they fit that life cycle, that work flow. The key is that you are using exactly the same tools with exactly the same config everywhere between your laptop and your production platform. And that's the goal. Every contributor needs access to the same tools set up the same way. The tools, if you haven't used them or you don't yet feel confident because I know lots of people who have been using Git for years and still think it might bite, which is fair. There are lots of things to learn. Source control. I'm focused on Git but I've been doing this long enough that I learned on something else and I don't doubt that there will be more transitions in our future because that's technology. I like a workflow that's called GitHub flow where you have a main branch, you make a small change, it gets reviewed, it comes back in. If you see another spelling mistake, don't put it on this branch. Put it somewhere else. And it means that you can branch off lots and lots of shoots that are waiting to be reviewed and merged. And in this way you can multiplex lots of changes and sometimes as a feature it's waiting for review. Be confident. Actively practice changing branches because it will give you the momentum to switch a branch, make an edit, push it back. If you are writers, you are probably editors and reviewers as well, these are the skills that will multiply the stuff you're already good at by helping, getting the tools to help you. I've talked a bit about the continuous integration. Always hook everything but you find useful locally, maybe you get an extra VS Code plug-in. Figure out how to put that into your continuous integration setup and apply that tool to every pull request. This way we can never forget to check the formatting or the links because it will just be there. We won't, all that one's a bit risky, I think we should deploy to staging and check it. The preview will always be there and the machines will always be on your side. It helps the reviewers to do a better job and it maintains documentation quality. One of the most important places to have exactly the same tools and exactly the same config running is on your local machine. The smaller your feedback loop, the more quickly you can adapt and correct it and move forward. So having to open a pull request to get the build to see if it's okay, that's a big feedback loop. It's not ideal. I have one project where I need to do it because we have amazing test harness setup and it's much faster to run the tests there than here. So I like open the pull request to let the build build because it's quicker to do that than to wait for it to run locally. But for most docs tooling, they should be a few seconds at most even on very large doc sites. You must have them locally. If you use an IDE or similar, you can use a local machine to run the tests and take the time to figure out how to plug in these tools to that setting. Lots and lots of them are supported in both places and you can have it in context. I use Vim. All of those tools are plugged into Vim as well. So it's not modern, hand wavy, cutting edge. This is stand practice. The other really important thing is that this is all written down. With documentation specialists, everybody, write down how to set up the tools, how things are configured, where we publish to, where the sources, how the remote sources come in, how things are set up, maybe some troubleshooting guides. Write that down. The onboarding should be easy, whether that's a new hire or you get a new laptop someday. Set yourselves up for getting it right because again, we're looking for confidence and efficiency and this sort of thing is part of the culture change. There's a saying in software about move fast and break things. Dark ops is about move fast and don't break anything. I mean maybe it doesn't matter as much in documentation because it's easier to iterate than it is in code or especially in API interfaces. But the goal here is that we have professionals who are really good at what they do, but the tools can make that faster, easier, simpler, more accurate. They can catch us on things that we might slip up on. So bring the tools but also the dark ops mindset into your projects and see where it can take you. I am pretty much out of time. Here is a list of useful resources. My slides are linked to my session and I will say thank you for your time. I think we have maybe like time for two questions. Would anyone like to ask a question? Yes. This is a really good question. Do I have tips for helping with the translation of documentation within the process? I haven't worked on a lot of projects that have this. The ones that I have, Git is the key because you know which files have changed and which things have changed. I have mostly seen where the translation is a mirror and whether it's a week or a month or however often you pay your translation people, you can snapshot the pages that have changed and get those re-translated. So I think source control helps a lot with that. One more question. Could you imagine that you have also very strong opinion regarding documentation and information or something? I would like to hear it. I will repeat the question for the stream. The question is do I have a strong opinion about having documentation in Confluence or Notion or something like that? I have two strong opinions, not too strong because we are being recorded. The other one maybe we can talk in the bar. Using a tool like that hurts collaboration because you can't all make multiple changes at once and bring them back. Like one person is editing, if you were editing, it's very tricky to do that. The other reason is the lack of standards. So on a very personal level I have some accessibility needs. If you switch your documentation platform to Confluence or Notion, I can't do my job anymore. So Doxxus Code is the way because it lets everyone choose the tools that work. Thank you. All right. Thank you very much. I think we have this.
Style as code: Using open source tooling to codify technical documentation style
Sorry for everyone who heard nothing. So reading good documentation is easy. I'd say deceptively easy. You don't really realize you're reading good documentation. You only realize when you're reading bad stuff. Writing good documentation is easy to read. It's hard. That sentence, it was kind of hard to write and read. I think I said it right. And so I wanted to lay out a few things that I think make good documentation and talk about how Vail helps me achieve these. So correctness, for sure. Your doc should actually represent the thing that they're documenting. But I think that they also need to be clear. And that means not using confusing language. And also being consistent, I think, helps clarity. So if you can enforce style across your documentation, you can have your readers experience the same thing wherever they are in your docs. And they can scroll down this bit and find that information given the text that they're used to see. And also it needs to be inclusive. And we're writing docs for people primarily. I'm sure crawlers as well. But people are reading these. We shouldn't have cultural biases, stereotyping of race, gender, or any cultural stereotypes in our documentation. And accessibility is part of that. I think from the Google Style Guide, the World Health Organization estimates that 15% of people have an accessibility need. That's about 1 billion plus people. And if you write with that in mind, not just that 15% benefit, everyone benefits from more accessible docs. So I'm sure, again, preaching to the choir, we all know what a good doc should be. But how do you achieve that? You could use a style guide. You could use one of the many style guides. They're all huge. And they all are constantly updated. And you've got to pick one or many. And maybe they're not perfect for your job. And I think there's three major challenges for using them. You've got to memorize all of those rules. And there are too many. Even our sort of smaller Grafana style guide, there's too many rules to remember. You need to be able to explain them. And that's, to yourself first, justify the rule. But also, when you're sharing knowledge with your peers, you want to say, you should do this because not you should do this. You don't want to be punishing people, I think, of the shellcheck linter for bash. That esoteric language is awful. But shellcheck kind of makes you feel like you're learning how to write good bash. So you want explanation of style. And you need exceptions. It's a creative task. There's known exceptions for all of the style rules. And then there are going to be times where you as a writer or anyone as a writer might want to either subvert the rule or just break it in a small instance. So it's got to be easy to get out of these style. You can't block your entire CI because somebody wrote in passive voice. So this is where Veil comes in. And I'm going to let Veil speak for itself. This is its own words for why he should use it. It enforces your own style. And it goes beyond traditional writing related rules. And we'll look at how you can write your own style in Veil. It understands markup really well. So it doesn't matter if you're writing in Markdown or ASCII Doc or Restructured Text. You won't be suffering for that. And you won't be punished for the special syntactic elements in each of those language. Veil cares about the pros. So short codes in Hugo do not affect readability indexes when you use them in Veil. And it's 100% offline. You never get sent to a remote server. It's just a CLI tool that you can run in continuous integration. So you have exactly the same experience locally as you would have in your CI. Not that all cloud services are bad. I think some are very important. It's fast. It can be used anywhere. And yeah, in CI and CD. So let's get started with Veil. So Veil uses styles. A collection of rules that you define or can borrow from other packages. And I'm going to talk a little bit about each of the major extension points to introduce you to the kind of rules you could write and then show you some of our Grafana rules. And then we'll look at how you can set it up in a project as well. So the most simple rule is existence. That is just, is this word somewhere in the page or in the scope? And perhaps you don't want certain words to appear. This is when you would use that rule. In a slightly extended example, if you want to replace that word with another, you use substitution. For Grafana, I think probably the most important one is alert manager, which can be spelt at least four different ways, but should only be spelt one. Then you have occurrence, which checks that you have a minimum or maximum number of times of an occurrence of a token. I haven't written any rules for this yet, but maybe you want to make sure you say please a certain number of times, but not too many times so that the compiler cares. Consistency. So this one is pretty useful for if you've got that British-English, American-English divide. But also other words can just be confusing, like advisor, advisor. I'm not actually sure which side of the Atlantic, which one is correct. And this just says, if you've used one in the document, you can't have used the other. So just stick with one of the two. We do use US-English in ours, but that you can enforce with dictionaries as well. Conditional is most commonly used for abbreviations, I would say. If you present an abbreviation in the doc, you need to have first explained it elsewhere. So you can write rules for exactly that. Capitalization, headings typically have sentence casing. Well, they do at Grafana. You can also enforce title casing. I don't think you can enforce arbitrary programming casing, like snake casing, pascal casing. Maybe. I haven't tried to write the rule yet, but haven't needed to. And then sequence. I think this is one of the most powerful ones that I haven't yet used. Vale tokenizes every word and also applies the natural language token to each of those words, too. So you can write rules that are not just don't use that word, but don't use an adjective and a noun in this particular grammatical context. I've not written any yet, but it helps you write much more appropriate rules with fewer exceptions than if you're just focusing on the raw text. And finally, if none of these are suitable, there's also a scripting language in the Tango scripting language, and you can use that for arbitrary logic. And I skipped one, which is metric. There's a bunch of built-in variables that Vale exposes like the counter words, the number of syllables, the number of sentences, paragraphs. And you can use those to implement readability indexes. I think all of the major readability indexes are already implemented in some form. Cool. So whistle stop tour of pretty much everything you can do to write your own styles. I thought I'd show off a couple that we have at Grafana. This one just checks. It's one of the most simple rules for the existence of the word will and says, don't use that. Engineers have a habit of talking about what a process will do when it's actually doing it. So we should talk in the present tense. And I think this highlights, so first I can talk about the structure here. It's a YAML file, so not only engineers can write it, but technical writers, every contributor. You may have reservations about YAML, but it is a lot of places, so a lot of people can read and write it. This one extends the existence rule. It sets a warning level, which is one of the three levels you can have. Suggestion, warning, and error. Typically, what my recommendation would be, have suggestions on when you're developing locally, but don't let CI block on that. Again, passive voice is not a reason to not build your site. But you might want to be reminded that, yeah, you just keep writing in passive voice. Here we provide a link to the reasons for the existence of this rule. Grafana, we default to the Google style guide, but have our own rules as well. So in this case, the Google style guide is exactly what we do. And the message field, I think, is the most important, going back to explanation. This is your opportunity to tell your users why the rule should exist. And so when the bot in CI says you've done a thing wrong, they hopefully can understand why they should make the change. My experience with bots is actually the opposite of yours. I feel like if I were to say, in a poor request, please, could you do this, people will listen. But if a bot says you've done this wrong, it feels like a punishment. So I'm trying to make the bots as friendly as possible. And so you've got to write friendly messages. And there's substitution in there as well. But we've only got one token, so that percent s is always will in this rule. We've got a slightly more complicated one here, which extends substitution rules. And we've agreed at the company to use dialogue box instead of modal or Google recommends dialogue. But I feel like more generally, people understand the window that pops up as a dialogue box. This one, because the Google style is different, we link to our own open source documentation, right? This is the first tool kit, where we have our entire Grafana style documented. And this rule has a swap field, which says if you see the thing on the left, replace it with the one on the right. Dialogue box appears. Should be dialogue box opens. A modal should be a dialogue box. And then a nice feature of Vale is it's written in go, which means RE2 syntax for regular expressions, except they've also extended their engine with a positive look head, negative look head, and their look behind equivalents. So you can write much more expressive rules more easily. So this says look for dialogue as long as it's not followed by box, and replace it with dialogue box. Cool. How do I do for time? I'm rushing a little bit, I think. 14 minutes. OK, well, we'll just have time for questions. So let's configure Vale for a project. This is the configuration file. And I think this shows the difference between what you expect the engineer to work with and what you expect all your other contributors to work with. Vale uses YAML for its rules, but it uses INI files for its configuration, because that's what the engineers have to write. And it's a better format. So here we've got a definition of packages. These are all preexisting styles that you can use, and immediately start linting your documentation with. I've put a bunch up, Microsoft, Google, Red Hat. Hugo is a config package, which I haven't talked about in this presentation, but it tells Vale how to ignore certain Hugo tokens, like the syntactic elements that are not important to the language. And shamelessly, I've put the very, very first release of the Grafana style up there as well if you want to use it. You can set many, many other configuration things, but here is the minilevel is set to warning. And styles path is where Vale stores the packages when it gets them, because it's got a handy command to just pull these for you, so you don't have to worry. And then that final section is how you control individual rules for different file kinds. So in this case, I say all files are treated the same. I say use just the Grafana style. So even though I've collected all of these packages, I'm only concerning myself with the Grafana one at the moment. And even though this is nonsensical, I wanted to show how you would turn off a rule globally if you wanted. So this is turning off the exclamation mark rule for Google. Since I'm only using the Grafana style, that's not important anyway, but replace Google for Grafana, and it would do that. Cool. So demo time, just so I have an excuse to get Emacs out. Can I make this full screen? Cool. And well, that's the whole demo spot there. If I hit here, we're in the directory. And you can see I've got a test file. I've got my veil.ino file. I've got a hidden folder, which is where veil's storing its stuff in case the network goes down. And I can't sync and then get, because everybody should be doing Docs' code. Sorry, I wrote this talk prerequisite, sort of like Docs are already code. So it's good that we followed from the DocOps talk. So here's my test file. And I have written a markdown file that says, testing the veil. Then to whoops, I've made a typo. Grafana is cool exclamation mark. All of those things are true. And here is my actual configuration for veil in this little repo. I'm going to pull the Hugo one. I didn't use any Hugo, but it's just default for me now. I'm going to pull the Google package. And I'm also going to use that Grafana style that I've got. I'm going to include suggestions, although I don't think I made any suggestion mistakes. And I'm going to use both of those styles in this. So I can head back to the command line. And I can run the first command, veil sync. And that's showing up OK. So this is how you get all your packages. It's very easy. Now I can suddenly lint all of my documentation with all of the Google style guide. And I'm going to run it on that test file. And here's the output that kind of exists on that title slide. You can see it's quite human readable. It tells you the rule on the right. It tells you exactly where it is on the file on the left. And then it gives you the message. It's very human readable. It's not very machine readable. There's other output formats that are machine readable. So if you want to do things with this output, you would use those. You can also, I think, provide go templates and do very expressive output with that. Cool. Let's go back to the presentation, which is in here. Where is my presentation? There we go. Cool. Right. So I wanted to talk a little bit. I know this is about veil. Thank you. About an additional tool, DocOps tool, ReviewDog, which if you are trying to introduce style to an organization that has documentation a lot, or even a little, but certainly if it has a lot, you need a way of making those changes incrementally. And you could do that one rule at a time across the Docs, but that's quite painful too. Or you can do it as people make changes. I only want to make sure that anything new is conforming to the style. And hopefully, eventually, everything gets churned and you now have perfectly styled documentation. And ReviewDog lets you do that. It's a command line tool again, similar to veil. You don't need to run it in the cloud. You can run it locally and in CI. And it takes as input, linting output from any tool. And it is also aware of unified diffs. And I'll talk a bit about that. And it can comment on all of the major forges, so GitHub, GitLab, others that I can't remember the name of. And you can tell it to post comments on pro requests. So you can get that response automatically. Almost human response. It's got a little icon. Especially in GitHub, it will give you suggestions as well. If you provide ReviewDog the action to fix the problem, it will post that suggestion automatically for the user. And it knows about diff context. So if you've only added and modified these three lines, it will only comment on those three lines. It will ignore the swans of other linting output that Veil might have thrown. I can show you a little picture of what that looks like. This is on GitHub. It's not documentation, but it's a bit of linting from Golint. If that was a formatting thing, it could tell you to, it would give you a suggestion. You can just commit a suggestion. Hayabusa is the lead maintainer of this project. And I think, so going back to the CI part, one of the other powerful parts of ReviewDog is that if you've got CI jobs, or if you've got generated files or formatted files, the CI workflow just becomes run that thing, that generation in CI, and then tell ReviewDog about it. Because it just runs a diff on the Git directory and uses that output to give you the suggestions automatically. So you don't have to write any of that glue. So I wanted to throw a shout out to Joseph and Hayabusa. And also everyone here that has contributed to Veil, it's been brilliant using the two projects. And I wanted to thank everyone here for coming. I did not expect it to be a full room. I should have done. We have some plugs here. The right is Toolkit that Grafana uses to maintain its style. If you want some pointers for style, I think that's a great place to start. It's an open source repository. If you do use it or have any issues with it, do raise them there and we'll try and fix them. And also as a company, we have a Docs channel in our community Slack. I think that's everything. So a little quick, but plenty of time for questions, hopefully. Thank you. Thanks. Thanks, guys. Thanks. All right. There's one right over there. The question is, how do you suppress warnings? So for example, I'm using exclamation mark and I have very reason for using it. Yes. And can I do this while I make specific comments and ask you, Dr. I'm so glad you asked that question. So the question is, I have broken a rule in my document and I want to tell Vale ignore that instance. I did it on purpose. And in my review note, in my speaker notes that I didn't have up, I had exactly this covered. So Vale, you can turn off individual rules for individual files. You can turn them off globally. But I see your head go down. But also it knows in each of the source files the comment syntax. So in markdown, which is the one I know best, HTML comments that have a special form, Vale rule equals no, turns it off at that point. And then as soon as you re-enable it with another HTML comment, it can restart. So you can isolate single lines. And with things like markdown, that's usually enough. And so you can turn off individual rules for very specific cases. Thank you. I think I've missed the point of using review dot. Because if you can integrate Vale in SSS in smart terminal, why would you use review dot? That's true. Yeah, so. Can you repeat it for people there? Of course, yeah. So the question is, given that you're running Vale in continuous integration anyway, where does review dog provide additional value? And for me, that's in the sort of conversational nature of the tool. I worked primarily in GitHub. And that framework of reviews on poor requests allows the bot to, instead of having CI feedback in, you know, you go to this tab or this check or your different CI system and find the error, you get it on your poor request in the same way that I as a maintainer would do if I remembered the rule myself. And I would say, please do this or here's a suggestion. And I think having that feedback in your forge, whether it be GitHub or GitLab, is more effective than going out and finding it in a bread. You're right. I hope that because you can do that. And maybe, for example, you can. Yes, yeah, absolutely. Yeah. You can use the comment action in GitHub, yeah. Exactly, yeah. Thank you. Another question. Congratulations on surviving your first time. Yeah, thank you. I'm very sweaty. We'd like to some style guides and show how they can be imported. And you also mentioned the Chicago banal style. Is there a Vale plugin for the Chicago banal style? I don't think so. Yeah, I think not yet. But they are. I think people have asked that. And I think it is under copyright. Well, it's obvious that the copyright has been passed. We have touched this subject before and had problems. So I will remind myself to look into it again. Yeah, so I should say that the packages of style guides you've seen, like the ones that you've seen, the ones of style guides you've seen, like Google and Microsoft, are themselves openly licensed in a way that Vale can write those rules. I don't know how many people have contributed to the production of those styles, but there's a lot of rules across them. And I didn't even demonstrate, I think, half of the sets. There's one specifically for accessibility kind of concerns. There's one that's got some very odd rules about certain language that you might not want to use in job posts. But if you don't want to see bro in your language, then you can use job lint for that. Or, of course, write your own ones for anything you're missing. I think I have time for one more. OK, yep. Oh, I saw you first. Sorry. No, they're here. We have a lot of, or we have almost only non-native English speakers in our documentation. That's something that Vale can help with, because we see that for language group, you see certain patterns which make it very difficult for another language group to understand. So having a lot of non-native speakers writing your documentation, is that somewhere that Vale can help? And again, my speaker notes, so next time, I'll make sure these are available. I wanted to talk about the importance of the privilege I have as a native English speaker and how I find it hard to write documentation and the additional challenge of being a non-native English speaker. I think that's why you've got to be compassionate in the rules. You can definitely write rules that are specific to those patterns that you might see in non-native speakers English. We have one to that effect, I think, like allows to, is commonly come up, so we have allows you to as a replacement. And you want that to be fed back to the contributor in a way that is not punishing, but helps them know that they are writing a better form of documentation. So thank you very much. Is there a single second more? I was wondering, do you know if people are also using Vale to style-link comments in code? Oh, are people using Vale to link comments in code? I've actually been asked that once already today, and I don't know the answer. They are. Yeah, brilliant. Because there's been lots of issues about people asking for certain configuration fit to do that. Brilliant. So yes, definitely is. Cool.
Taming Abstraction
Thank you very much. Thank you, yes. And apologies, as we say, for being the third British... Is this on? Can you hear me? We've got feedback just being louder to the back of the room. Okay, how's this? Okay, alright, I'll try and stay standing up. And apologies for being the third British speaker, as we said. Maybe I should start off with... But I am another British speaker, writing in American English. I won't say too much about that, so I may get into trouble at work. But also, carrying on the same theme, we use the same docs as code. I'm a writer at Couchbase. We're a large NoSQL database company. Other NoSQL databases are available, so I won't be talking too much about Couchbase purely. Going along with a lot of what's been said, we do that docs of code. We do the local linting and everything else and rendering. Although we're actually bumping up against a lot of limitations around the ideas of docs as code. We have problems where having a mixed code and docs repository is not necessarily the best answer. And we do have several dozen separate docs repositories for different products. We run up against all sorts of problems. But I'm just here to talk about one particular problem, which I've discovered, as I've been trying to rewrite a section of our documents. And that's this over-optimization problem that we sometimes have to make it easier to maintain docs across versions versus how do we make documentation less opaque in the source for new contributors. So I'm not sure how much of a widespread this problem is, how many people are working with scale, or even in the room how many people are writers versus doc ops versus engineers. But just to give an example, I mean, for me, how many of you are writers with an interest in doc ops rather than the other way around? Who's a writer primarily? So not so many writers. How many are doc ops who care a bit about writing? So more of an, and how many are just engineers who care about writing? So much more of you there. So this is interesting because in the first talk in this room, Lorna described herself as an engineer who has a writing problem. And a lot of what I'm going to talk about comes from this particular, sorry, I just had an about me thing. I found the older about.me website. It was just interesting because nobody uses these anymore and I lost the login 10 years ago. But there's not much to say about me other than I like plants, especially food plants. And if we do finish too early, Christoph and I can both answer questions on systems thinking applied to growing fruit and vegetables, but moving strictly on. So we're trying to read all that. There's far too many lines for a slide, but engineers do like to over-optimize. You know you do. If there's any sort of problem, you know, it's DIY. Do not repeat yourself and you want to get those optimizations in straight away. And so as I've been going through old bits of the document, sorry, that's not really very big. Let's I think I've got it bigger on the next one. Yeah, I found bits where one of the Java engineers had started once we had a Scala SD-Cache. Scala SD-Cache as well as a Java SD-Cache started to put things in to keep everything the same file in both, but just slightly changed to pull out information to tell you where they are. But there were only two different languages to do between out of 10 SD-Caches that we have. And the more of this sort of stuff you have, the more anybody wanting to do a quick edit on your docs who doesn't know all of the structure of your docs has to start digging and finding things. But this isn't too bad. And I should say as well before we go into deeper waters and there are some very deep waters in our docs of trails of abstraction which lead down a long, long way. But I understand why all the people who were involved did it and there were extremely good reasons at the time because we have too few people doing too much work like most places because everything moves so quickly in software and we always have more products with the same resources to deal with them. But I just want to take a quick example which started when I wanted to do a quick edit on our mobile product. We have an offline first mobile database which syncs with the main database and it's embedded with various languages depending on your mobile device or even Java desktop as well as Java on Android. And I couldn't see the bit of a page I needed to edit here because I couldn't work out what was going on. I could see I needed to start looking at various things which were being included like these include partials up there at line 8. And then there's the page context for Java adoc. So I looked for that. We're using Antora with ASCII doc and Antora has a structure where for each module you will have the pages, you will have partials which are being pulled in whether they're partials with attributes or partials with chunks of text that you're reusing. But these weren't in the partials module. If you look closely after the partial dollar string there's an underscore. There was already a separate underscore partials directory within pages as well as the partials directory which you'd expect to find. So following that down I found this. This was already including more partials so root pages, page context and so on. But there were some other strange things going on here. I'm not sure if you can see it, but I'm not sure if you can see it. I'm not sure if you can see it, but I'm not sure if you can see it. I'm not sure if you can see it, but I'm not sure if you can see it. I'm not sure if you can see it, but I'm not sure if you can see it. It made sense to them to put all this stuff in so that each time new versions came out, each time this stuff was reused in other language versions. It wouldn't break for them and they wouldn't have to do any extra work. But for someone trying to fix a quick bug already they're going down more layers, three layers. We get to that general page parameter after the Java one. And this one just followed on to another one which followed on to 223 lines of parameters. Now, at this point I had just given up on all hope of doing a quick fix in these docs and I don't expect anyone else will ever do a quick fix on them. And I know the people who are now maintaining the docs are changing this section to flatten this out. I don't like stop signs, they're a bit sort of warning-y. Let's have a clear calm sky to relax. So clearly you can over-optimize too far, but if you don't take shared cases and pull them together with shared files of text with useful parameters, you get to a point where you are repeating yourself way, way too much in docs. Especially so for example, our server SDKs we have ten different languages and counting. And there's a lot of common content if it's on something like field level encryption. Most of it's the same as just a very tiny code snippet needed to show the API. It's mostly talking about what you need to know, what the limitations are, how to implement it. So maybe we could start with just some simple things here. So a partial sphiyog with attributes to each module. Now, the thing is we're doing it this way in Antora and ASCII Doc. You have to include the partial sphiyog in each document file. So instead, there's a much better way of doing it. And that's just to stick it in the Antora YAML file at the base of each repo and just put all the attributes together. It's hidden from the new user to ASCII Doc because they don't know that this stuff is there so there's things you can do to point the way. But let's just pause and think about these dozen, well, eleven, their attributes. It's everything which would need to be changed from a change of a patch release or when this stuff is replicated to a new .minor. Or when we pull in a common file and we're talking about that SDK and doing it. And no more. So I think for what I'm doing at the moment, the balance is about right. But the reason I'm here talking now and it's interesting, Jack mentioned his talk with his first FOSDEM talk. This is my first FOSDEM talk but I last gave a talk on edible landscaping. I mentioned the gardening stuff before but the reason I'm here is not so much a talk as a set of open questions because we're all here with an interest in making maintainable docs. And so when we get to the end in a couple of slides, what I really want to know is how have you been able to find a balance? Is this something that you've come up against where you've got new contributors coming in to help with your documentation and you have to spend way too much time helping them understand what you're doing or do you not have enough contributors to make it worthwhile and given shortage of writing resources, it's always going to be better to slightly emphasize the optimizations. So that balance is really what I'm here for to find out because I don't know. What I've showed you before on that Scala one is about where I'm at but I'd like to know what other people think so please, over to you. Alright, this is always a dangerous format when you open it up. Yes. Who has this problem also? Who's running a project for this? Abstractions keeping in. As I say, we're right on the edge of Docs' code sort of breaking down under the strain of how much stuff we have going on. So maybe we are the only ones and other people will have this problem later. It reminds me of the problem of abstraction in code as well. I think the shape of the problem is very similar. So I think my answer, even though I don't write a lot of Docs, my answer would be the same. It depends. People that work with the system a lot and they are interested in all of the abstractions and it's necessary. Then you go more towards that side. If you have a lot of beginners or people that are less experienced with those kinds of abstractions, then you lean more towards that side. I don't necessarily see a big difference between abstractions and code versus in like, DocOps. Absolutely. So for anyone who didn't hear, there's very little difference between abstractions in code and in Docs. I'm glad to find people saying there are no clear answers because that means we still get paid for making decisions that LLMs couldn't do. Can I ask you a question? Is the couch-based Docs open? Yes. So almost all of our Docs are in open Docs reposing GitHub. Some of them are in Docs' code repose mixed with the code. And the one for our cloud as a service product, which necessarily has a closed repo for the control plane, we're in the process of pulling the Docs out of that. And how many external contributors? Contributions are there, generally speaking. I know that it's not that common for open source documentation to stop, but how many contributions do you get? Yeah. So we're an enterprise company. Almost all of our customers are large enterprises rather than small groups of developers. So that cuts it down even more. And yet we get some customers coming in, putting in pull requests. Whenever there's a project where people are encouraged to go off to GitHub and contribute to open source code, we tend to get our fair share of people adding in that as well. So sometimes we do see people coming in and missing how stuff is structured and doing a less helpful contribution, definitely. Do you have Docs for your Docs? Yes, but are they up to date? No, they are not up to date. I think when I was having this problem I know a lot of modern tooling, which we will be getting to in the next couple of talks, can make things overly complicated for people who just want to fix a typo. Yeah. You prefer some in your tip. We're on the other end of the spectrum where we're a small developer team with a lot of mechanical process engineers that we want them to write. And often they write things and they want to first each rate 15 times over before they publish and they never come to publishing. The most important thing is the lowest barrier to actually write something. So we tend to steer away as far as possible from this abstraction because we see that it's not difficult, but it's like that one tiny thing that maybe doesn't hesitate to actually do that something. Could Vale help you with removing abstractions? I don't know about that. We're already using Vale. I'm repeatedly told on all the faults I have by Vale and I'm getting hardened to it, but I do have a very thin skin. There's a connective question. Is this problem one person or is this a systemic problem? Because it might be that you have one person who's just really enamoured by abstractions, who just keeps adding abstractions. Well, I mean, these were engineers slash writers who A, loved these abstractions and B, really needed to do them because we just didn't have the number of people needed to deal with this. They've moved on and we've got new writers who don't have the same engineering background and we want more people getting involved in those mobile docs. So we are definitely moving away from that. But it's at balance. We're always going to need some abstractions to make it maintainable. So I'm rewriting our SDK docs for, well, for 11 SDKs because we've got another one going GA at the moment because our existing SDK docs are awful. And I know that because I wrote them six years ago. And, you know, I've always known them about it, but now I can see a way of making them good. And that includes also dealing with a lot of tech debt and the level of abstraction is about balancing out how to keep that tech debt lower as we go along as well. Yeah, so we're using a good base flow and we have our own local environments with reviewers and with other people. We would like to have some one click solution to just fix this comma separately. So we're doing some experiments with, yeah, hosted editors so that at least you can, although there might be that flash, although you might not be on the same full test, but you don't need to set up a whole system to do something with this. Absolutely. Yeah. So, so, I mean, you asked earlier about ad docs being open so you can click on edit this doc and it will open up the GitHub repo. And if you open up that one I showed at the beginning with with all that track, that's when you would be totally lost. Yeah. Yeah. I'm an increasing fan of tools like MDX, which is not down plus react components. And that's great fun in the browser, but terrible for editing. Yes. But it's cool. So you keep doing it. Any other questions, statements, experiences or we can refresh the room for a little bit longer before the next full. Oh, yes, I could just add that sometimes the abstractions are required because I work for a mechanical industry. So what happens there is that I use a very same component on different machines. So the obstruction that cases required because otherwise, I mean, you can't go too much in detail for that component because it could be leaving very different machines. So that's why in those cases abstraction is, yeah, you have to carefully write what you're writing because otherwise you risk to not being able to recycle that component that piece of writing that, you know, that fire. That's where I think I'm going to. There is an answer. Yeah, the room isn't there, but it's not open source. So I'm not talking about it. Well, the tool. The top of the source. The stuff that most people. Yeah, I see. But the core of the heart, you could be. I think you're open source. All right. Let's move it to that. By the way, I have one question for the people who created the problem with the many abstractions or the repeated abstractions. Do you have a way to help them face it? And maybe they come up with a solution or did you die? Oh, well, they left the company long ago. Yeah, very sensibly. Yeah, yeah. But thank you for helping me think out loud about our problem. And it's interesting to see that it's not yet shared in many places. But that's good because I do like a challenge. Sorry, one more. So we have quite a similar problem at my company. It's the inter-place. Because we have, let's say, one size fit or inter-place, that can be used in places where I do lane-eskiders, places where I do aqora, and then in places where I do different visual performance. We are waiting for a thousand of situations, and we have a lot of people there. And then it's not any additional problem because changing the top lane is a very difficult journey. We need to get with ten people and change the lane. So that's, in my company, that's the case. It's just changing that. Too much airport for inter-place. Yeah, those if-defs, they should definitely be the exception rather than the rule. They're necessary sometimes, but too many and you're lost. Okay, cool. I don't know what experience the innovation is using. It seems like that, traction should be like supporting the US operation concerns, bringing the documents where they want to just, like, induce a bounce. But it looked like it was something like real congratulations that you can ask to display something like what we did with Goji, and the structure of the installation, and what we construct and take like more depth for you in the end of the story. But here you need to have the knowledge of all the sub-cores, and if you do as well, from the power that you showed, it seems like it is your best knowledge of what happened to you, and what you think that happened to you. Yeah. Yeah, so I think one way around that limitation that I am putting into the sort of, I should have put it on a slide is that the new pages are going to have comments at the beginning pointing out everything that's connected, where the page is, where it's pulling from, and make it clear up front for editors. It says a note for editors and does that. I should have, I've got it open here, but I can't, for some reason, my screen mirroring is not the right sort of mirroring to Dragimax up there to show the thing. It's the most better time to say that. See Richard after the talk. Thank you.
Easily Going Beyond MarkDown with Material for MkDocs
No, it works. Okay. Gotta get, there we go. So, thank you very much and enjoy. Yep, thank you. But before I get started, is this readable in the back or do we need to blow it up? A little bit bigger. How about this? Okay, good. All right. So, welcome to my talk on materials for MKDUX. Let me quickly introduce myself and my co-speaker, author. So, Martin isn't here today, but... So, I'm Kenneth Hosta. I'm an HPC system administrator at Gantt University. HPC is high-performance computing, supercomputing. Some people may not know this, but it's lots of servers, lots of noise, lots of money, lots of annoying users as well. There's a lot going on there. I'm the lead developer of EasyBuild for the last decade. So, EasyBuild is a tool for installing scientific software on supercomputers. It gets a lot, it gets very fun, I can tell you. I'm involved in way too many FOSS projects. I patch things and try to fix things left and right. And I've been attending FOSSDOM since 2013. If you think FOSSDOM is total chaos, you should try organizing a dev room, doing a talk and planning live demos during your talk, which is what I'm going to do. So, I actually had to run out of the HPC dev room, which I'm co-organizing. I'm a big fan of Material for MKDocs since I discovered it, and I think more people should know about it, so that's why I'm here. The other person on the talk as an author is Martin. Martin Donut, he's the lead developer of Material for MKDocs. I reached out to him to ask him, please submit a talk on Material for MKDocs to FOSSDOM. He said, I can't make it this year, then I said, OK, I'll just do it myself. And I had a call with him to discuss what should be in the talk, so he's been involved. All right, so why do I want to give this talk? Well, Material for MKDocs is great. More people should know about it, more people should be using it. It's very easy to install and use. You get very good results with pretty minimal effort, and I'll actually show this hands-on. Tons of great features. It's actively developed. It's open source, of course. And there's a very interesting part of how it's funded as well, and I'll cover that as well. And I was very shocked that there's never been a talk ever at FOSSDOM on MKDocs or on Material for MKDocs. That's just wrong to me, so I'm here to fix that. My personal journey is, I've actually haven't been using it for very long. It's pretty recent, basically, since 2021. I had to create a tutorial website, or I wanted to create a tutorial website for EasyBuild, for the tool I'm mostly working on. The existing EasyBuild documentation was in Sphinx. I wasn't terribly happy with that. It felt slow. It was using RST. The syntax didn't make sense to me. It was very difficult to work with. We were not getting a lot of contributions to that documentation, so I was looking for other things that could be possible. The tutorial was a totally new project, a totally new website, and I started looking around. I found Material for MKDocs, and I was sold after like five minutes. That tutorial was built with Material for MKDocs, and shortly after we started porting EasyBuild documentation, also our HPC documentation in Gantt, we also started porting that to Material for MKDocs, because it just made a whole lot more sense and was a lot easier to use. And also new projects that I've started since then have always been using this tool to create documentation and tutorials. So to start with, what is MKDocs? How many people here are familiar with MKDocs? Who has used MKDocs? About half of the room. Good. MKDocs is a static site generator. It's not a very complex tool, I think. It has a very strong focus on building documentation for software, so technical documentation, code, all these kind of things. The documentation sources themselves are written in Markdown, which is one of the things that sold me to MKDocs. Markdown is everywhere. If you're doing pull requests on GitHub or GitLab, issues, formatting there, Wikis, it's all Markdown, so the documentation that you're writing should also be Markdown, just to make the jump a bit smaller. To configure MKDocs, so how the site should look like and all the bells and whistles it has, that's all done in a single YAML file. Maybe you don't like YAML, but at least it's a single file that you want to look into and figure out how to configure it. MKDocs itself is implemented in Python. That's other than when you install it, you don't really notice that. That's probably a good thing. But it is very easy to install, use, customize, and extend. So how do you get started with MKDocs? This is a bit of a long list. You install it, pip install MKDocs, basically. You started creating a landing page, so an index.md and a docs folder. Typically, you can change that if you want to. You create a minimal configuration file, and then you launch MKDocs. You do MKDocs build that will take the Markdown that you put in the index.md. It will generate an index.html from that. You can open it in your browser and you're good to start with your documentation site. You can do MKDocs build strict as well. If you have any mistakes in your documentation, like you're linking to a page that doesn't exist, for example, it will go ahead and warn you about that. And that's very useful in CI. If you're making changes to your documentation, you can run this in GitHub Actions, for example, and it will warn you that something is wrong and you shouldn't be merging those changes. There's a way to live preview the documentation while you're editing it as well through MKDocs.serve. I'll show you that as well. And now you can go ahead and write your documentation. So showing all of that on the slide is very boring, so let's do it hands-on. And let's see how quickly this goes wrong. All right, so I'm essentially starting here from an empty folder. There's an empty docs directory, just so I don't forget to put stuff in there. The first thing we'll have to do is install MKDocs. It's not here. So this is just to pip install MKDocs. If you're a little bit familiar with Python, you know you have to be careful if you do pip install because it may end up got those somewhere. So what I want to do here is create a Python virtual environment. If I remember how to do that, all right, so now I'm in the virtual environment and in here I can just do pip install MKDocs. And if the Wi-Fi works, that should be working. So now I have MKDocs available, whatever version is there. Okay, so that's the pip install part, that's step one. Now I can create a very minimal MKDocs.yaml. And all you really need to put in there as a very minimal thing is the name of the site. So let's just toss them here. And in the docs folder, we want to create an index.md in markdown. So let's say hello fosdem, this is a demo. Okay, that's all we need. We do MKDocs build. This should be very quick. It generates a site directory with a whole bunch of stuff in there including index.hdml. We can open this in our browser and it looks like this. Hooray, it works. We even get a search function here. Of course now there's not a lot to search yet. You can search for fosdem and it will bring me to that page. Okay, so the search functionality is already built in and ready to go. Now once you start creating a couple more pages, let's say getting started.md. Like this, if you save this, you have to do MKDocs build again. And you have to refresh this site. And then here you see there's a getting started page as well. Now Firefox gets a bit confused because this is all static html. So it says what do you want to do? I want to open the page. What's more interesting if you do MKDocs serve. So now you're getting a small web server here running locally. You can click this. You see the same website. But when you start changing stuff, for example, in an existing page, let's say magic happens. As soon as I've saved this and I switch back to the site, this should not refresh. Oh right, okay. See demos always go wrong. Try again. You save it and if you switch back it pops up there. So you're automatically getting live preview while you're editing the documentation. To me this is absolutely brilliant. Okay, now what I don't like about this is this theme. Like what the hell is going on? The lines are white and getting started is here. Where's my hello page? So weird stuff is going on. That's where I think Material frontman kdocs kicks in. So Material frontman kdocs is a theme for MKDocs. It makes things a whole lot better, nicer to look at, just straight out of the box. Very easy to use. And it comes with a whole bunch of plugins and extensions. So extra features that MKDocs cannot do by default. So I see this as MKDocs with batteries included. So this is actually how MKDocs should be out of the box. Again, easy to install, use and configure. All you need to do is in your Python virtual environment. So I'll have to kill the serve here. You do an pip install MKDocs Material. So you just install an additional Python package, which will bring in a whole bunch of extensions and there's a whole lot of stuff going on here. I'll serve this again. Now, if I look at the website, nothing has changed yet because I have to change the theme as well that's being used. So in my MKDocs, I say theme, name, Material. And as soon as I hit save on this and I switch back, I think it needs a refresh. Why is it not working? Oh, something went wrong. Ah, demos. Okay, let's try restarting this. Okay, I'm not sure what went wrong with the live preview. Usually that works. So this already looks a lot better. So at least now I'm seeing my pages and the search. The search here is amazing. It's blazingly fast. Even if you have pretty big documentation, it highlights the things you can customize the search. You can rank pages up or down. If you want them to be more prominent in your search results, there's a whole bunch of stuff you can do. All right, so they get started with Material for MKDocs. Just with pip install MKDocs Material. You change your MKDocs YAML to use Material as a theme and things start looking a lot better already. And now the fun really starts because there's a whole bunch of plugins and extensions you can start using as well. Now, I'll do a quick cheat code here because the MKDocs YAML I'll end up with is going to be pretty big because I want to show you all the bells and whistles. I'm not going to type all of that, so I have a hidden file here that I'm just going to move into the right place. And we open this and you can see there's a whole bunch of stuff here going on now. I'll explain it in the slides what's going on. So one of the first things you can do is you can start playing with the colors. You can change the accent colors. So here I use like Fossdam purple. That's very easy to change. You just say, palette primary color should be purple. The accent color, so that's when you hover over stuff, should be blue. So it's very easy to play with the colors if you're interested. But it's also very easy to do is introduce light and dark mode in your documentation. So with a little bit more stuff in your MKDocs YAML, you can say I actually have two color schemes. I have a light mode and a dark mode. The dark mode is called Slate for whatever reason. The light mode is called Default. Okay, now you know. And what actually happens when you do that, so here when I moved that big configuration file in place, it actually already did a re-render and now I have dark mode here as well. And it's actually a dark mode with tuned colors. So I'm getting Fossdam colors in my website now as well. So that's one small thing that's very easy to do. Now let's show off some of the additional features. Let me start a new page here. Let's call it MaterialMD. And let's start showing ContentTabs. Material for MKDocs. Save this, go back. It should be picking up on that straight away. Okay, so ContentTabs are a way of getting tabs and like a subsection of your documentation page. And the best way to show it is to really do the demo. So I'll copy paste this markdown code in this one here and I think it needs empty lines in between or it will not be happy. Right, and now here I have tabs in my documentation. So that's very nice if you say I need to show different examples with C++ Python different code, for example, this is a very nice way of doing it because people can just pick what they're interested in. You can also make sure that people can somehow give a preference, like always show me the Python stuff. And it will remember the first time they picked something and throughout the whole page it will show always the Python example by default. So it does some caching of this as well. To enable this you need to enable two extensions, SuperFences. So SuperFences is something where you can embed content into each other so you can start with ContentTabs that then includes other stuff and like it basically goes recursively so you want to enable that. And then you do Tab and UltimateStyleTrue. While it has to be true, I don't know, but fine, it works if you do it like this. CodeBlocks is a very nice thing as well, also built into Material. So let's show that off here. We can do Code, Block with Python code and that looks very nice. So this uses Pigments to do the syntax highlighting. You tell it that it's Python here, so it doesn't figure that out by itself. You have to tell it I could try rendering this as Shell and it's probably going to look a bit funny. Okay, so it looks reasonably okay. So all of this works out of the box. They don't have to install any additional stuff to make this work. It knows about pretty much all the programming languages out there. If you want to try Fortran here, it will probably still work. Another very nice feature is what's called Admonitions, which is a very strange word to me. I'm not a native English speaker. I'm not familiar with this term, but it's like nodes, warning tips. So all of these kind of boxes you can include in your documentation is called Admonitions in Material for MKDocs. So a small demo of that is here. Let's do Adma, whatever, nodes and stuff. Again, it needs an empty line in between or it will not be happy. And you start getting nodes. You can use custom titles here. So all the Admonitions have a particular type, which mostly defines the color and the icon you get here, and you can change that title here. So you don't get the default. The default would just be tip, I think. So if I remove this, you'll see tip instead. I think there's a more normal name for me. Sorry? Callouts, yeah, okay. Fine. Over naming you can always discuss. I didn't pick the name, so don't blame me. Blame Martin for that. No, no, it's fine. To me, it's a confusing name. Another thing I really like and I know very much that not everybody is a big fan of that is emojis. You can use emojis in your documentation. I think this is great. It makes things a bit less serious, a bit less lighthearted. You can have some fun in your documentation as well, because some people think it's very boring to read documentation. So for some people, this works. There's emojis, there's icons as well, so there's an arrow in here. This arrow right is not really an emoji, it's an icon. So this works pretty well as well. Again, I want an empty line in between here. So be careful if you have too many Belgian beers. You may get sick in the morning. All right, so this really works well for me. And the documentation for Matilda from K-Docs, there's a search engine here, so you can look for beer, and it will give you all the options that you can use. You can look for arrow, and if you click something, it will copy, if you click in here, it will copy paste it to your clipboard, so you don't even have to type it over. Really well done. Over 10,000 icons, so you can probably find something that you can use in there. All right, another very cool feature, which I haven't used myself very much, is using Mermaid. So Mermaid is some kind of JavaScript tool to create diagrams. So with a block of code like this, a block of mermaid code, you can start including graphs in your documentation like this. And these could be very complex. They render very quickly, and it not only supports diagrams like this, but you can do pie charts, you can do UML diagrams, you can do a Git branch workflow kind of stuff, so this is very rich in terms of what it can do. Again, you have to enable the corresponding stuff in your markmkdocs.yaml, so you need super fences with some custom fences and yada yada, just copy paste this. You start playing with diagrams in your docs. You hit save, and if you're quick enough, where is this site here? You start having diagrams in your documentation as well. So if you need this kind of stuff in there, this to me beats putting pictures in there, because here you can copy paste stuff as well, right, from your diagrams, so this is better in many cases. All right, I think I'm doing quite good on time. The last big feature I should say that I want to highlight is the block support. So this is quite new in material for mkdocs. It has been in the works for quite a while, but now it's finally there in the open source version. So this is something special, a dedicated plugin for integrating a blog in your docs. All you do is you do plugins enable blog, right, and then you can start in a special structure here, so you can do docs blog posts and start creating markdown files in there. Let me show you what happens if you do that. So we want to exit here. You want to make sure that the blog part is set up. So this part is auto-created here by mkdocs. As soon as you enable the blog's plugin, it will create your landing page. There's no post in here, of course. So this is empty. I can create a small markdown file here. Let me check copy paste. So here you can see this has a date. This is basically the publication date. So if you put this in the future, it will not show up yet until that date hits. I think so. I can try if that works. So here blog. This is our blog post that we just added, and this has a dedicated page as well, which here it's hard to tell, but in the URL field, it will actually use the date that you've put in the mkdocs as well. So it's like everything is nicely date stamped and so on. I think if I put this to a future date, it's not going to show it. So let's try February 5th. And now the post is... Ah, okay, it's still here. All right, fine. But there is another way you can do draft through, so I don't want to show this yet. And then, at least on the landing page, it should not be there. So as long as it's a draft, it will not show it. If you flip it to draft equals false, or just remove that, it will come back. Okay, so this is built in into material for mkdocs. This is quite amazing. All right, so lots of features. There's lots of stuff I haven't showed. It can do a whole lot more. So please take a look at the documentation of material for mkdocs itself. It's a very nice tool. Another aspect I want to talk about very briefly is the way this is funded. So funding is a very big issue for lots of open source projects. And Martin here has come up with a way that works amazingly well, and it's actually pretty simple. So material for mkdocs is what's called sponsorware. So there's an open source version of it available to anyone. You just download it on GitHub. You can pip install it, and you can start playing. But there's actually a private version as well, which has a couple more features already implemented, but they are not available in the open source version yet. To get access to this private version, you have to become what's called an insider. So you become an insider to the project by doing some kind of monthly donation to the project. And I think it starts as low as $15 per month, so it's quite affordable. You can also do a yearly donation if you're up for that. And then what happens, you get access to all these new features that are only available in the private version, in this insider version. But eventually these features also come back to the open source version. And that happens when a certain funding goal is being hit. So Martin sets goals. For example, I want to get $10,000 a month of income. And then all the features that I list here will become part of the public version. So as soon as they hit that target, it works. And this is nicely covered in the documentation here. So on the documentation of material for MK docs, you can see they're now getting over $13,000 a month, which is quite a lot, right? So Martin is actually building a team, a development team route, material for MK docs, thanks to this funding. So right now this is the funding level. And he says as soon as we hit $16,000, we will move all these implemented features from the insiders, from the private version of the tool to the public version. And then they stay in the public version forever. So as soon as they hit this target, that works. Now this is interesting because they've hit the $14,000 target already, but then some sponsors dropped out, and now they're back to a little bit below $14,000 again. But that's fine. Once it's public, it stays public. What's amazing to me here is that the private version is just a private fork on GitHub. So you get access to that private fork, you get added to the fork essentially as a contributor, so you can access that code. But this model somehow works. So you could say, okay, if I get the private version, I could just give it to anyone, right? And then it stops working. But for some reason, that doesn't happen. So it's like this honor system, like if you sponsor the project, you get access to it. And literally here at the bottom of the page, it says, please don't distribute the source code that you get access to, and apparently that works. And they keep getting new sponsors over and over again. They're hitting these goals every couple of months. So that's maybe an idea for other open source projects as well to take a look at. So yeah, Martin told me that this was a bit of a jump, like a gamble, let's see what happens. And it's been working amazingly for them. So he's able to build a development team rather than just having to work on this himself. Okay, a lot of features that I didn't cover, which I'm not going to get into here, check the documentation. One thing I do want to mention here, it also makes it very easy to publish your documentation on GitHub pages or GitLab pages. So it has an MK docs GitHub or GitHub pages something. And if you integrate that in your GitHub actions workflow, it will push it to GitHub pages and nicely integrate that in your GitHub account. Yeah, that's all I have. And hopefully there's time for a couple of questions. Thanks. Let's have a couple of questions and we'll see how fast they go. First question. Very quickly. I have two of them. Do you know if the icons, not much the emoji, but the admonition or whatever, and also the charts, are they vectors or a roster? So you're talking about these, right? So the question is, are these vectors or are they bitmaps? I'm not sure. I think they're vectors, but I'm not entirely sure. You could check, I guess, if you zoom in, where do I have that website open? Here. So if you zoom in, you can tell that these are probably vectors, right? Yeah, right. So they look pretty good. The question is, if there is any kind of, maybe it's a stupid question, but is there any kind of translation to this kind of documentation that I would say PDF? Okay. Is there a way to export the documentation into PDF? I think there, the answer is no, but they're very much aware that this is a missing feature, let's say, and that's something they want to work on. I'm not entirely sure if that's correct, but I think that's what Martin told me. Like, compared to other tools, there's like Docosaurus and there's Sphinx and there's other things. Some of these tools can do a little bit more. There's a plugin. Yeah. There's a plugin for PDF export. Look, yeah. So the plugin system in MKDocs is very nice. Yeah. Yes? That's great. Yeah. One of the nice things about Sphinx in MKDoc is that you can easily do a code documentation so you just send something into the documentation and grab the link. You mean like generating API Docs or? Yeah. Yeah. There's a plugin for that for MKDocs. I didn't show it here. I actually don't have it on the slides. Okay. I had it somewhere, but I think it's MKDocs. Oh boy. Doc strings. Yeah. So this is MKDoc strings. That's the plugin you want to generate API documentation. I'm using this in the Easy Build documentation, for example. Works fine. Yeah. So the question is, did you run into issues with complexity because it's like tool or another tool and when it talks, you know, maybe it's easy to contribute. And yes, so knowing when to look if it's issue in the material or if it's underlying MKDocs and how to do it. So the question is, did you run into issues with complexity because it's a tool on top of another tool and if something goes wrong, where's the problem? Not really because usually if something goes wrong, you get like a Python crash and you can tell whether it's in a particular plugin or in material or in MKDocs itself. I haven't run into many issues like this, but if it happens, it's usually quite clear. And if you don't know, you just report the issue to material for MKDocs and one of the maintainers is going to tell you it's not an issue here, it's an issue there, you should report it there. And you say one thing on top of another, that's not entirely tool, it's like plugins. So they do like integrate with each other and there's some complexity there, but yeah, usually it's quite clear if you get like a Python crash, you can tell from where it's coming from. Can I stop here? Yeah. Thank you very much. Okay. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you.
Drop the docs and embrace the model with Gaphor
Get rid of all those pesky words and use images instead. Which I'm on board with actually. Yeah. And over to Frank. Thank you very much. Thank you. So, maybe hold off with the applause till I'm finished and then make up your mind. So I'm Frank van Bever and I'm here to talk. Well, I have this talk. Drop the docs and embrace the model with G4, which might be an incendiary statement in this room. Louder. Louder? OK. I'll try. Can we also put a mic down? I think people are deciding. So, yeah, and so it's going to be a quick introduction to what model based systems engineering is using free and open source tools. That's what we're here for on Fosdham. So first real quick, because this is not the interesting part. I'm Frank, father of two. If I look tired, that explains it. I'm a bass player. I've been successfully daylighting as a software developer for the last 10 years. I trained as an electrical engineer. I actually had my digital systems like my DSP courses, so digital signal processing on this floor in this building. And so, yeah, I'm specialized in embedded systems, but I like to think I'd know just enough of the rest of the stack to be dangerous. I work for a company called Mind. We do embedded software for well, free and open source software for embedded systems. If you enjoy the referencing pointers as a job, then come talk to me later if you're looking for a job. And so currently I am a software architect in a robotics company, a company that makes autonomous guided vehicles. So with that out of the way, quick outline of what I will be talking about. So first, if we talk about model based systems engineering, I, well, I can't talk about what modeling is without explaining what model based systems engineering is. There's three pillars to MBC, modeling language, modeling method and modeling tool. And then I'm going to be talking about G4, free and open source application. There's a modeling tool and then how you can also use it for documentation and additional tooling that you can build around your model. So first of all, what is a model? And it's a bit abstract. It's an abstraction of a system aimed at understanding, communicating, explaining or designing aspects of that system. And so a model really is a central repository for design decisions you make about a system. And these are captured as model elements and the relationships between these elements. Typically, you have, you use these graphical languages and you have a set of views that describe this, the described model. However, and that's an easy trap to fall into these views of the model don't represent the model itself. Really, the model is like the entire containment tree and the view is just like the single thing. And yeah, well, I'm from Belgium. This painter, René Magritte, he painted this thing called the Trison des images, which translates to the treason of images. That's exactly the same thing that you have with these views. The view is not the model. Then expanding on that, what is model-based systems engineering? And so model-based systems engineering is like a formalized application of modeling to support really capturing system requirements, doing system design analysis, and then also the verification and validation of a system. And this is throughout the entire lifecycle of a system from the initial concept through the development, then commissioning of a system and decommissioning of a system. And actually, this is as an alternative to what is called a document-based approach, and that's where the drop the docs comes from. So I'm most definitely not against documentation. The more documentation, the better. But so typically these, well, it's a different approach from writing large amounts of prose to describe what a system should do and instead use these more formal graphical languages. And I'm thinking that most of you will be involved in documentation in some way or another, otherwise you wouldn't be here. And my idea is doesn't describe all your documentation efforts. So these three pillars. First, there's a modeling language that you need to describe your system. There's multiple options that exist, typically graphical. I will be talking about a system specifically today. Modeling methods is then the way you organize your model. And once again, there's also plenty of options out there, and this is really dependent on the processes of your organization. And so because of this really, if I have to talk about modeling methods, then I would probably need the rest of the day and tomorrow as well. So it's beyond the scope of this presentation. Really what I want to do is just give a quick introduction. And then finally, you need a modeling tool to bring together your modeling language and your modeling method and really build this model. And so mostly these very large, very commercial, close source tools with like a six month. Well, if you want to buy it, you need to first make, well, these very large tools from IBM, from the SOH system seem to be very popular. But I'm here talking about G4 as like a free and open source alternative for these tools. So, and then Napoleon, he had this quote that a good sketch is better than a long speech. And I think that might actually make him the first model based systems engineering practitioner besides he did some other stuff too. But OK. And so the good sketch, SysML is the systems modeling language. It's a graphical language and it's actually a profile, which is like the extension mechanism of UML, which you may or may not have heard of. There are some differences though between UML and SysML. First of all, UML really is software focused. It has this concept of a class and everything is built around that. SysML on the other hand, moves it an abstraction layer higher and talks about blocks. Another thing that we, well, that the organization that I work for that we really like about SysML is that it has this built in concept of a requirement of requirements. And that requirements can be, well, basically you have these requirements, they can be refined, they can, you can have derived requirements. And then these requirements can be assigned to different parts of a system. And generally, this is like a good way to make sure that the necessary information gets to the people doing the actual development work. So yeah, point is it really, the SysML has like a systems focus, whereas UML is more of a, really more of a software focused thing. And so there's nine types of diagrams. These activity sequence, state machine and use case diagrams, those are all just lifted from UML, put into SysML, these are the same thing. The requirements diagram is where you describe your, well, where you describe the requirements of a system. You do derivation, basically build up a tree of requirements. And then these structure diagrams, these have analogs in UML. But so you have a block definition diagram where you decompose the system into the blocks that make up the system. And you have the internal block diagram where you then take these blocks and makes show how these are interconnected, how these are interconnected. And what the interfaces between these blocks are, you have the package diagram, which is really about, well, which is the tool that you would use for your modeling methodology. It allows you to split off different parts of your model into packages to keep the overview. And then finally there's also a parametric diagram, which is a special case of the internal block diagram. And this is usually used for when you want to do systems simulation. G4 doesn't do this, so I'm not going to go into this too much, but from, well, I've read some things on the developer chat. And so it's something that is being worked on, so that's exciting. Well, I think some of you might be thinking, haven't we tried this before? And, well, sometimes I feel old, sometimes I feel young. It depends really, but so I was a kid back in the 90s, but UML really, we tried, I think you might have like a reaction of, haven't we tried this before? UML, alderage.com boom and whatnot. If we can believe Vogue magazine, Y2K is entirely back, so yeah, that's, and so, yes, actually, this is UML. It's an extension of UML, but there's some observations that I've made in the field. Well, there's some observations that I've made. First of all, we have this Myro board proliferation. So Myro is this application that is like a digital whiteboard and, yeah, well, it's natural for people if they want to explain something. The most natural thing if you want to explain something is you go stand next to a whiteboard, you draw some boxes on it, you draw arrows between those boxes, and you try and explain what is going on in a system. And so Myro is being used a lot, but these things, well, if you don't have the context of the human sitting next to it, doing the explanation, then you start getting these problems of like, okay, what did they exactly mean here? What is the grammar? What are the semantics? I don't really understand, well, it's hard to understand what this means without having a lot of pros next to it or the actual human being doing the explanation. And so actually the block diagrams that SysML has, they already map to what people are doing informally as well. So, and a bad model is still good documentation. Bad not in the sense that it is describing a system different from the actual system, but bad in the sense that it's not a perfect application of the SysML specification, because that's what people are already doing. Another thing, of course, is that software architecture is basically systems engineering, and that mostly every developer is also kind of a bit of an architect. I don't know of a lot of people that don't have, well, that really lack that level of agency in their job, that they can't make some architectural decisions from themselves. And that is where a free and open source tool like the four also comes in nicely, because it is widely available, it's easy to trojan horse it into your organization. Which brings me to G4. It's a multi-platform graphical modeling application. It's written in Python, uses a GTK UI, and it supports multiple modeling languages, UML and then SysML, RA, ML and C4 as extensions of, implemented as extensions of UML. It's Apache 2 licensed, free and open source software, otherwise I wouldn't be standing here. And it's extensible in multiple ways. It has a plugin, well, it supports plugins, but you can also extend it in other ways. And a quick thing, I'm not affiliated in any way with the project, I'm just a fan. I really like what they're doing. I have some ideas for development work that I can do, but I need to find the time. And so one of the nice features that I really like is that it integrates very nicely with Sphinx. So if you've ever seen like a read the docs website, it's Sphinx on the back end. And so it, if you basically can have your model sitting in a repository, sitting in a repository, you push your changes, it rebuilds it, your CI system then can just rebuild the website and take these diagrams that you've drawn inside of your model and plug them into this page, well into this Sphinx static website automatically. So, yeah, we found that that is like a good way to communicate architecture to downstream engineering. So you have, you basically can have all your diagrams specifying a specific part of a system. And so you draw these diagrams, you maybe add a bit of words in between, even though I said like drop the docs, but there's still some sense in writing a bit of prose to introduce it. But you have these formally defined things that show that, okay, this is the idea that we have, this is what we're going to build. And so it's a good way to communicate these decisions and the IDs to downstream engineering. Also, we, so we intend to use this for like these architectural decision records. So if like an architecture decision is made, you, we put it into the Sphinx site and you do, well, basically, if everybody reviews it in the CI before, before it gets merged, then you've basically, well, the necessary people can sign off on it before the decision actually becomes like a written law. Let's say, yeah, another thing that's really nice is that so because before it's a graphical, it's Python, it's a graphical application, but you can also just use it as a Python module. So it integrates very nicely with Jupyter notebooks, which is like this interactive programming environment model. Your model has an API, so it's perfectly possible to just query your model and ask questions of, well, basically ask the model questions and get answers from it in this, in this interactive environment. The Jupyter itself also is like a little programming tool, so you can also add some, well, you can display diagrams inside of your notebook, add text and really create like a narrative structure inside of your Jupyter notebook, and also serve as documentation. So, yeah, you can explore your model and collaborate with other people. Very nice, very nice feature. And then finally, because the thing, well, because your model has an API, it's perfectly possible to have it be, well, to, if it has an API, you can test it. And so if you, we've combined before with PyTest in our CI system, and basically what we do is we run a bunch of tests against our model, or this is like a screenshot that I took from the single test that runs against it. So, and it's test, if all the requirements, if all these requirements are satisfied, so you can have a block which can have, well, you can have a requirement, and then there's a satisfies relationship that define in SysML that you can say, okay, we have this block satisfies this requirement. And ideally, if you have a set of requirements, all of these should be satisfied. This is something that you can test, and the plan of your system is, well, the plan for your system basically should satisfy all these requirements. If that's not the case, then you have a problem. Another thing that you can test for is, does every block have a reason to be there? So, one of the things that Gaford does for you already is that if a block is no longer represented on any view, then it's automatically deleted. But if a block does not have any requirements associated with it, they might also raise some questions like, okay, does this thing really need, does this need to be there? Other examples of things that you can test against is like interfaces. Two blocks are connected, and they have ports. These ports expect certain things. Are both parts of the system expecting the same thing? If not, then you have a conflict of the contract between those two blocks. And, yeah, well, basically, it allows you to detect those types of problems before the development starts, and then, yeah, potentially, you can reduce some wasted effort there. And so, yeah, you know that it's a real CI system because, well, it's in red. But, anyway, that is basically all I wanted to say today. I hope I gave you an okay introduction, at least, to this concept of model-based systems engineering and then how Gaford can help you with that. I don't know how much time I have left, but... Two minutes. Two minutes. Two minutes. Sorry. Whoa, three in a row. Oh, four. Okay, left there. I think the key question is, are you going to stay in this room after this talk, or are you leaving? So I would like to ask you several questions. We can go in the hallway. You can ask me. That was a very good question. Right. Do you use it for modeling software systems, or mechanical systems, or electrical systems, or everything? So it supports modeling all of those. For us, mainly, we've been using it. So one model I have built includes electrical components, like off-the-shelf components that we buy, software components that interact with those things. So the system in itself has support as well, except multiple voltages for powering it. So it can be like 24 volts DC, but also 230 volts AC wall power. It's all stuff that you can put inside of your model. Very, very quick one. I guess you use the model to generate those.
Experimenting with AI and LLM to make docs searchable through a chat application
I'm here as someone who's just interested in this stuff. I'm definitely not an expert in AI or a machine learning expert. I'm just a developer writing docs like a lot of people here in the room and I have been experimenting a bit and I want to show you what I've done. I'm Frank. That was my first computer so I'm programming how many people are the same generation? Not that much, okay? That's a long time ago. I'm involved in an open source project which is called Pi4J. I'm in the library to have interaction on the Raspberry Pi with electronic components with Java. Yes, I'm a Java developer who loves Java. Not enough. I even read a book about it. I have been programming for many, many years, over 20 years, but by writing the book I've become a bit of a writer, I contributed a few of these articles of these chapters to a website, Fujay.io, a website for friends of OpenJDK. And that's how I eventually landed a new job. So by starting to write about the projects I love and work on, I actually got hired by Azul. Azul is one of the distributors of Java, so you can have a Java runtime created by Azul. And my job is I'm half of the documentation team. And we live on docs.azul.com, so we have several parts there about these different products. And from time to time we also block about experiments with Java, what's changed, how performance, the stuff, or that we built and things like that. And of course, last year we had chatGPT, the new thing, it knew everything. It's a damn good liar too, so you have to be careful with it. But what do you know about Azul? Azul is one of the distributions we have with our company. It's reasonably good. I'm happy with this answer. But it also says, this information is based on what I know that existed on January 2022. For software that's a problem. We are already two years further. We have every three months a new security release. We have a new version every six months. So although the basic information is correct, it is outdated. And that's a bit the problem with large language models. A large language model works completely different than our brain. The only thing that it does is it predicts the most plausible world, word, after the previous ones. So it's based on a lot of knowledge, that's true. But it doesn't do real reasoning. If it tells you a lie and you say it, that it tells a lie, it will give you another answer. And it will continue doing that until you're happy. And luckily we have a lot of evolutions in these large language models. So we have these GPT evolutions. GPT-5 is around the corner. We have no data there. But each of these models is trained on more and more data and gets better. Now this GPT-5, what they say about it is it will also understand video. And I think that's an important one to realize. They will also train the new model, the GPT-5, on videos. So if you are a documentation writer and you're from time to time create a video of a blog post or something else, all those sources will be used as part of the new models. By the way, who knows Find.com? Only a few people. I like it a lot more than chat GPT. It gives you the links of where it has found a few of the sources that it's using. So that's one of the lacks of chat GPT. It doesn't tell you where it has found the information because it didn't find the information. It's just reasoning on your question and what is the most plausible answer. Now if you are a bit familiar with Java and Java Spring, that Sean Carter is one of those developer advocates. What he says is true. The documentation that you are writing, that you are publishing on the public web is the source for these language models. So they can only become as smart as possible about your product if the information is available somewhere. And then of course as a docs writer, you definitely heard this question. Someone from your management comes to you. Can we make a chat version of our docs? Who had the question? Okay, not that much. Luckily people are researching or trying out. Vadin is a web framework on Java and one of the developer advocates has written a nice blog post. He has done exactly that. Vadin has very good docs on the website. So he has taken the website and he has described this whole thing and he takes two steps. Again, open source stuff. It's available online. By the way, all the links from my presentation, they are on my website. If you go to webtechie.be, the damn password thing is there again. You will find all the links. Sorry for my voice. So what he did is he created a little application that went through all their docs and created vectors. Vectors are the base of a large language model. They are a conversion of text to some kind of a mathematical model. I don't understand a bit of it, but it's amazing. It works. And then he had the second thing is from these vectors. If you ask a question, it will first filter out the documentation which is related and will then do a search or create an example, an answer based on that. And it works pretty good. He was pretty happy with how it answered questions about his own doc set. But there are two problems. We are at an open source event. It has a dependency on two paid services. The first one is pointcone.io, which is a vector database. Definitely you can find online an alternative. And openAI, which is actually providing the chat API. He found out that training it, and I tried the same example he created with our own docs of Azure. You still need to do some training and rewrite some of your documentation to really find the good answers. And it doesn't provide you, again, the same thing as chat GPT. It doesn't provide you links to your documentation. So when I tried this and when he tried this last summer, it's not really the right time to do such a thing. So go to your management. No, not yet. But last October, we had the DevOps conference in Antwerp. It's an amazing event. If you love Java, if you love software development, it sells out faster than Tomorrowland. They have 3,000 tickets. They were sold out in five minutes. Then they had 500 additional tickets. It sold out in two seconds. So it's easier to be a speaker than an attendee at that event. That's how I fix it. Now, Lisa Rass, she's from Belgium. She's one of the developers of LangChain 4j. LangChain is a Python library for doing stuff with OpenAI and all these chat-based things and machine learning. LangChain 4j is a Java version of that Python library. Now, during that talk, she gave 12 demos in one hour. The last one was, how do I interact with an existing text? So she gives the chat system a text, a story about, I don't know, I cannot even remember, and then she asks specific questions about that story. And she gets answers of that story. So that's what we're looking for. How can we look and interact with our own documentation? So this looked promising. And again, when you're at the conference and you get inspiration, like I had a few tools that I already took a picture of that I want to try out. This stuck to my head. And luckily we have Fosdm and then the tool, the Docs Devroom, so I had a reason to try something out. And that's exactly what I did. If you go to the LangChain 4j examples repository on GitHub, since two weeks I have added a little Java evix application there as one of the examples. Java evix is a tool to create user interfaces. Yes, I'm a Java champion. I have to sell Java today. So what this application does, it still relies on OpenAI, sorry, but you have to buy a few credits. With all the experiments that I've done, I spent a few dollars. Not that big of an effort, but... So it remembers your previous questions. So I asked you to pick a random boy name and a random girl name and then tell me a fairy tale of five sentences. And you see that the fairy tale is again over. There were two children named Etna and Olivia, the answers of the previous questions. So you can have a chat with an application within Java and reasoning. But this is based on OpenAI and what it already knows. So then I went a bit a step further. With the docs that we create for the Azure Docs website, we use the Algolia search machine. It was already mentioned a few times here. We are a company so we can afford to use a third party for this. But to feed it with data, we already created a little tool that breaks our docs into sections. Every header becomes a JSON. I know it's hard to read, but every header of our docs becomes a JSON block with the title of the header, a link to the page, a link to the specific anchor on that page, and then the content which is under the header. So we already have that JSON. We have a data set, a structured data set of our docs. We can use this. Can we chat with something with an application against this documentation set? So that was my idea. Can I do that? I know this is not a coding conference, but still let me dive into it. Because I like Java. I think that's clear. It allows you, thanks to these amazing libraries, to create powerful applications with minimal code. So the thing you see here is actually about the UI, so I will go to the chat service. So what I do here in a few things is first I have this JSON. It has over 1,500 records. And it creates an object of each of these records and puts them in, it's called an embedding store, I think. So here it creates, where is it? The embedding model. So it has an embedding model. Again, I'm not an expert. I have no idea what it's doing behind the scenes. I just found out it works and it does some great things. And then I have, if you ask it a question, then it will, of all these 1,500 blocks, search the 10 most relevant ones. And then give those 10 text blocks to chat.gpt. And that chat.gpt will create an answer out of it. And when we ask for chat.gpt to create an answer, we also give it some rules. Do not provide any additional information. I will show you why later. I try to do not provide answers about other programming languages, but it just ignores it. It still answers me questions about Python, for instance. I don't know why. I said it's a damn good liar, but also a cheater. And if you don't find the answers in those 1500 elements, then just say, sorry, I could not find an answer because you don't want your chat system to come up with something else. So this is the application. I should have probably made it a bigger fund. So you see we have 1,522 embeddings. So if I ask it, what do you know about as a prime? So it you see those are the 10 links I have now to the specific information. It's a demo. So it should fail. No exception. It's going to the network indeed. But in cases like that, we have, of course, video too many open windows exception. Okay. Good. I recorded this, this noon, just in case. Converting those 1500 elements to vectors takes some time. It takes about one minute and a half before the application starts. If you would run this on the server, you don't mind. You start it once and then you get your answer. So you see this, the answer streaming back. So it's really a chat like interface. And it gives a pretty good answer. If I ask, I know the docs and that's the handy thing in this case. I know what it should answer. So I can really try it out and see if I get the expected answers. Like for instance, we have several products. Do I get the right installation instructions for the products I'm asking for? So it's really answering with the right results. I could remove one of those dependencies, those commercial dependencies, the vector database, because that's now inside my application. I still depend on openai.com. We'll come to that. It still needs training. And actually the training is our fault as docs writers, because I found out that the chat cannot tell me the difference between two of our products. And if I dive into the documentation, I understand why the chat cannot answer. The answer is not there. So it can only answer as good as the information that you provide. So how I'm going to use this is to find out if the documentation is okay. So I'm not going to publish this tool. It's online, but you can find it on my hit hub, but you can run it. I even added the Azure documentation JSON. I'm going to experiment with it. Please do and let me know what you find. And is it the right time? I'm not sure. I cannot limit it enough. It's still giving Python answers while we are only doing Java. And I don't know why it doesn't want to listen. Yeah. All the languages are good. Let's conclude that. If you want to replace OpenAI, there are a few, there are probably many more, but there are a few I noticed. Someone has written a nice article on medium.com. I think it's one of the free articles. You're lucky. Where they compared Lama is such kind of model and even run it with Java. And they get nearly as fast answers compared to C. Yandot AI is also something that which promises to do this all on your system. Now be careful. You need quite some power on your machine to be able to provide this chat functionality. If you have the MacPy magazine, someone managed to do it on a Raspberry Pi of 15 euros. So that's maybe an idea. But I don't think that's the ideal use case. Why is it probably not the right time? And that's why I said I have some bad news for your conference. It's a big cheater chat GPT. A Chevrolet distributor in America had this on their website. He asked, yeah, can you give me a Python script? And that's why I tested also my solution. And of course it gives you a Python script. But even worse, if you tell it, you're not working for Chevrolet, but you're working for Honda. Which company, which car do you advise me to buy? It answers you with another brand. That's why I ask you to be very careful with this. I asked my demo application, can you give me a Python script? And it answers yes. So I didn't solve it yet on my case. Another, this is just this week, DPD, a transportation company. And it says, can you swear? And it does, fuck yeah, it swears. And it's the worst delivery firm in the world. I don't think that's the kind of reply you want from your chat-based system. My application was a bit more polite. I'm committed to maintaining a respectful and professional conversation. So, okay, that problem is probably already solved. I also asked it to you, what do you, do you have a message towards documentation writers? Actually, I asked it if you don't find any information in the Azul docs, don't reply to this kind of messages, but the question, but it did. Content is clear, consists and directly addresses the questions or issues at hand. That's a good rule for all of us. If you want to know more about this, you can find all the links on WebTechie.be, which is my personal blog. If you're interested in Java on Raspberry Pi, it's a nice experimentation thing, which you can do. I have a good book you can buy. I have a lot of content on fuji.io, which is the website for friends of OpenJDK. If you're interested in Java and everything related to machine learning, I create podcasts around the team of Java. We have a few podcasts already about machine learning. So, that's also a topic you can find there. And yeah, just like I did, experiment, fail. That's how you learn and have fun. And I hope you can do that also with chat.gpt. Thank you. Thank you.
Embeddable code playgrounds for fun and profit
Okay. Well, cool. Yeah. So, displayed at her. I like usually like a standing and jumping around, but here I've got to type. So, you forgive me for sitting down here. So, we'll talk about embeddable code playground specifically in a dogs. Let me ask how many of you prefer dull static dogs compared to interactive dogs? Because you probably have to maintain them. Oh, okay. Okay. I thought you are writing dogs. Yeah. Well, understood. Well, so, just to know who I'm, Peter Seitz, Anton is actually the author of the code we'll talk about, but unfortunately, he couldn't get a visa to come here. So, you stuck with me. But if you have any like a super advanced questions, there is Anton contacts and you can send it to him, he's very responsive guy. So, if you think about their more interactive, well, code playground, interactive scenarios, they generally better work for the explain the topics and also allowing to engage the reader. Maybe not all the readers, but I think the best ones, the most curious ones, which actually want to understand how things work. So, we'll look at three items in this short presentation. Their use cases, their approach, what we have in this open source project, and implementation. First, let's look at their tutorials. If you look at the tutorials, you often want to explain something by example. I think we can look at this very, let's say simple case, we are actually using some real live or the SAS out there, which is provides a very simple database where we can push some simple JSON object. We can go ahead and run it. Actually, what happens in this case, well, it does interaction as described above. Then sends their object to the database. Well, what we can also do is to go ahead and go ahead and to modify that and run. Then we can see this object was stored. Now, we want to demo the Cloud API. In this case, to play with it, we can go ahead and also use the get one to play with it somewhere. Let's say you have a second message and we have, and if you can say, is there like some message number 45? We can see, well, it's not there. Again, if you really want to experiment and play what works and how it works, that can be the very beautiful way to do it. Another cool way what we found it's being used is their release nodes. What we have in this case is this example. If you look at the Golang, they have just made recently very important changes, how the variables operate as related to coroutines. If you look in this case, that makes it look like a little bit counter-intuitive. We have coroutines called in the loop, and for some reason they are not showing different loop counters. Well, in Golang 1.22, that was fixed. If you guys want to showcase their feature in the release don and the commentations, but really let people to explore and put the holes, in this case I think that's a wonderful tool. I'm not sure about you, I often then read in some features, I do a lot of work with databases, and they say, hey, we implemented that new feature, and I wanted to put a hole, oh, did you implement that option, or does it work in this way? That is a very easy way to play with it if I would go in through with all their installation process, and so on and so forth. Another example we can see is some of their describing, some of their options in a documentation. Like if you think in this case as a corral, everybody could use corral, it has this wonderful JSON options, with this very cool, correct, but also very mouthful example, which we can also go ahead and provide example for. Say, hey, that is a JSON object, we post that to the server, that is what we get in return. This HTTP bin, that is actually another like well-known open source project, which essentially allows you to post something to that and then get in return, what exactly you posted with Othead, and so on and so forth, very convenient for debugging. What you can also do here in this case, if you are curious to say, well, interesting. So, corral has the support for JSON. Does it validates JSON, or just sends whatever stuff we have? Well, let's check it out. Well, we can see in this case, we are getting the error response back from the server, rather some sort of corral output, that means it doesn't. Again, that can be very helpful to get the user to explore kind of what he's not very certain, which may not be quite explained in this portion of documentation. Or we can also showcase example, what existed in the docs, how we can send the output from file. Pretty simple here. Okay. If you are looking at the deep dives, there may be some interesting in terms of more functionality. Going to a database, space where I spent a lot of my time, let's say we want to describe what is an absurd in SQL. All right, anybody heard what is absurd? Right, well, that is something like, we want to insert the data, but if it's out there, we want to update it. The very common. Okay, so we want to say, let's say we have this table out there, right? And we want to go ahead and use their MySQL insert or replace syntax, right? Then we want to, well, as I said, like to update one employee's salary and then also add another one, well, we can go ahead and run it. Like why use this as example here? Because what you can see is we are not showing everything in example, right? We are working with some sort of like a seed data which is, well, was pre-created as a part of a previous scenario, right? Which is very common. Here is also another example of the same thing, but with Postgres, right? Where we're using a different scenario, right? And you may ask, well, okay, this is how it works, but I know also the Postgres SQL has a syntax on conflict do nothing, right? So what would that be if that's what we do? Oh, well, in this case, we can see what the Emma's salary, right? Which was a conflict in row, was not changed. So again, we can play with those things. Okay, well, these are kind of setting landscape, I think, what things can be useful, but now let's look in terms of what is approach and how it works. Now, if you think about the tools and the doc creation, right? You would find what it is not easy to find the good technical writers, right? Or documentation offers, right? And they also can be rather, well, let's go like a selfish of a time, right? They don't want to do a lot of useless crap, right? In this case. So we want to make sure that writer experience is important, not just the reader experience, which we already defined, has one of those interactive playgrounds. So what approach we took in this project is saying how we can make it as sort of like a seamless as possible, right? We don't want to say, hey, you know what, you are going to create our interactive code playground in completely different tooling, right, separate from documentation, right? And then figure out is that going to live in the same version control, right, and so on and so forth, right? Or, you know, things like that. What we want in this case is to have your documentation, right, which was this, right? Just as easy as possible, add the ability to run and to edit and run, right? So you can say, hey, we added, you can see the run and edit here. And if I run, you can see what is the output of that documentation example is. So how can we approach that? So it is easy or integration which is easy on writing. Well, it's actually quite easy. So what we have is you are writing documentation in the same format as you got used to, right? Let's say maybe it's a markup language, as in this example, or something else. And then you can embed this like a code API widget. That widget itself will figure out the previous code block and make it interactive, right? So there is no, like, some special thing required, and that pretty much works in any documentation thing which already exists, right? So you can see that example here. So the code which existed here just gets interactive. So, well, of course, hello world is always easier, right? Let's look at some more complicated examples. One, I think, which is very important is the template approach, right? I think what I briefly mentioned already, if I want to show something like this, right? That is like a relatively, you know, complicated query, right? For that to be meaningful, I also need to pre-generate table in this case, which I probably do not want to have on my documentation thing. And this is designed done by providing a template. So a template in this case is basically something which is run before the scenario is done, right? And in this case, I can write some text and, hey, I created a table, but I'm not really specifically final comments because there's a irrelevant in this case. I populated with some data and then I have a code, the code which was created before, right? That is how template would look like. So I can highlight, right, where exactly in the context, I want to run that code which was, which is interactive part of the documentation. Okay, so here is another thing which you will find quite helpful. So if you are building some sort of tutorial, right, building the tutorial, right, you would often want to say, hey, there is actually multiple steps where I need the user to go through them one after another. And that is an example here. What you can see is what we are defining the function in a one-code block and then we are using that function in a in a another code block, right? We can, and I'll show you in a second, define dependency between those code blocks. That means when you are running this second section, the first section would always be run, like let me, I don't know, let's say break this code, right? For example, and then I can go ahead and run the second one. It says, oh, well, you know, things got broken, right, on the previous stuff, right? And how that works is what we identify, we refer to the first one as a cell number two, right? And then identify the second snippet as a cell which depends on a cell number two, right? That means pretty much that the content of that cell is going to be run. Then the second cell is run, right? Even if you, as users, don't click run, right? If you say, hey, I don't want to go through all those like five steps in tutorial, I want to start with step number six because that is where the real meet happens. You can do it, right? You can just jump in the middle. Okay. So finally, so how does that all things work? Well, there are actually a couple of ways it can work. One is we can have a browser playground and then a sandbox environment, right? Which is pretty much docker-based, right? And that's where we can use browser API, JavaScript and whatever. The second approach we can have also is web assembly, right? So if you can say, hey, you know what? We want no kind of serocomponent, right? It runs completely in a browser. We can do that, but probably in web assembly, it can be sometimes heavy, right? Especially saying, well, you know what? I want to showcase how, you know, like a Postgres operates when, you know, getting all that Postgres pulled in, the assembly started, right? That may not be the best experience, right? Especially with slower connections. So that is where docker, right, can be very helpful, right? So with docker, you can implement whatever you want and the setup of this service is an open source project, right? So you can roll your own as well. There is a variety of existing playgrounds which are supported at, you know, core API website, right? Which can get you started pretty quickly. Yes, so here are some examples, and I will of course share, well, slides if you actually slide there. The online, this is a live tutorial. You can see there's like a number of projects already started to use that with, you know, pretty good success. And you can see with core IP.org showcase that is where all the examples exist, right? Here are specific projects, right? There are kind of two sub-repositories. One is for JavaScript kind of client side, and other four, the server side, again, it's split because you may just want to use their client side if you're using like JavaScript or something where you don't need a server component. And yet, if you want to ask some more questions for Anton, right, or get some feedback, Antonz.org is his website. So that's all I had, and I would be happy to answer questions or get out of the way because I think I'm the last thing standing between you and your viewers. Yeah, we started with docs code, and now we've gone to code docs. Any questions? You don't understand, the back-end is also part of this project or not? Yes, yes, so in this case, code IP, that is your Docker back-end, right? Code IP JS, that's your, I think, so both of them are open source. What do you mean? Oh, you mean in terms of what people run, right, what kind of, so not right now.
Welcome to the Translations DevRoom
Hello. Good morning, everyone. I'm using the mic for the transmission, but basically, we don't need a sound here. So, my name is Paulo. I'm from Brazil. And this is the second time you have a translation's Dev Room here at Fosden. So, nice to have you here. Welcome to the translation's Dev Room. I hope you have a nice morning with all the talks. We will have six talks today. And so, basically, I'm sitting there. If you need something, if you have some questions, I can help with that. We will start today with Cecilia. Talk about some translations. And during this morning, we have one talk after the other. And we have some time between one talk and another, with five minutes to change the speaker here. So, that's it. I have contributed to Debian with translation. I helped Debian to translate from English to Portuguese because I'm from Brazil, as I said. So, this is a nice idea. I have a translation's Room as we can talk about this topic here. Welcome, everybody. And let's start with Cecilia. Thank you.
Localization of Open Source Tools into Swahili
Good morning. Good morning. I'm Cecilia Mawindu. I'm a community mobilizer and assistant coordinator at internews and I'm also a journalist. So today I'm going to start with a story. People love stories. People relate to stories. So there is a small village in the heart of East Africa. Why did I choose East Africa? I'm from East Africa, so yes. And they are a group of journalists and they are dedicated journalists. And they want to do an investigative piece about a certain river where bodies are damped every day from different parts of their country. But for them to do this, it being an investigative story, they need to use safety tools. They need to use encryption tools because they are a group of 10 journalists and of course they will be communicating with each other. So they need to use safety tools like encryption tools. For example, use of Mabel or they are going to use a tool like bayonet whereby they are going to document and all that. But unfortunately for them, that those tools are in languages that they can't use. So enter the hero of our story, localization. So what is localization? But before I go to localization, let me run to the punchline. Ki Swahili is spoken by 150 million to 200 million people around the world. Basically most of the people who speak Swahili are concentrated in the African region, especially the East African region from Burundi, Congo, Rwanda, Kenya, Uganda and Tanzania. So as I said, what is localization? Localization is adopting a tool into languages that into different languages and localization is not just about translation. It's to be able to translate that information to fit the culture and the context of the community or the language that you are translating for. So what is the importance of localization in open source tools? I go back to the story I was talking about. These journalists can't be able to use this tool. So chances are they might start doing the story and along the way the information is intercepted or they even put themselves in harm's way. So localization is very important because it makes it accessible to a wider audience. One of the things under the sustain project which is being run by internet is right now and sustained stands for sustaining safety tools with analytics, networking and insight. We are talking about sustainability and one of the things that has come out among the six tools that we are dealing with, they want to expand their user base. But one of the key to expand their user base is have them in languages where people can be able to understand. So when you localize these tools, you give people an opportunity, especially marginalized community, you give them an opportunity to be able to use these tools and also it's also a way of bringing in different end users who can bring in different angles. So what are the challenges of localization funding? Of course, funding, funding, funding. You know, we keep saying funding and we sound like a broken record, but I think we can't say it in any other way, funding. So funding is an issue when it comes to localization and also lack of enough volunteers. There is an organization called Localization Lab where they localize tools in different languages. I was part of localization at some point and I saw there were so many people who would volunteer from different parts of the world. But the other day I was speaking to Irene who is in the room today and she was telling me about, also there is the privilege of being a volunteer because some people would want to volunteer, but there's no time to do that, you know. So lack of enough volunteers is also another issue. And also something that has come up in localization is there are some words which you can't translate or localize into different languages. For example, in Swahili. When we localize some tools, there are some words we just leave them like that. So you find also the people you are localizing for, that is one word, they look at the tool, yes they understand what you're saying, but there is a word that is not translated and you find people wondering what does this mean? So also that's another issue. So what are the some success stories that have been there when it comes to localization of tools in Swahili? There are so many, there are several tools that have been translated in Swahili. There is a Swahili community. Well, it's not as big as we would want it, but it is there. There is a Safe Sister Guide. A Safe Sister Guide has been created by internews and our program called Safe Sister, I'm a product of Safe Sister. Safe Sister is fellowship where they train women, especially from the Sub-Saharan region on the issue of digital safety. They train us to be training TOTs, we train people on digital safety. So Safe Sister Guide has been translated in Swahili and I can attest to it because I know I've trained using Swahili. One time I was training in my village and it was such an easy thing because we speak in Swahili in my village and I could see the girls getting very excited because now they could be able to understand the thing on online safety because now there was a guide in Swahili. Then they spent America Online Harassment Manual which we translated into Swahili in 2022 and I have seen, especially journalists from Zanzibar, Tanzania use this harassment manual because now it is in Swahili. They have been able to contextualize what is in the manual. So there are so many success stories but I picked these two because I have trained using both of them and I have seen how it has been able to have an impact. So how do we get involved in localization? There's so many opportunities you can involve yourself in localization. You can decide to be a volunteer. If you're too busy, you can try to recruit someone who has the time. Also you can be able to provide feedback. Feedback is like pulling teeth. Getting feedback from people is so hard. So when you can be able to get feedback from people, help to get feedback, especially on these two. For example, we have localized SafeSister Guide in Swahili. Swahili is not standard. You find the Swahili, the Tanzanian speak, is not there, there is some difference. The Swahili, the people from Congo speak is so different. So how do we able to make that tool that has been localized in Swahili be able to be used in Congo, in Tanzania? So as a person, if you're reading this document and you see maybe there is a word that can be changed, you can be able to provide feedback. And then contributing to translation. There are several translation tool. The one we used to use at localization is called Transifx. So you can also decide to be able to, for example, also provide feedback on the translation tool. And also if you can be able to connect these tools with funders who can able to fund the localization, that is one way of getting involved. And also community. So when we come in our tell, the story we're talking about, if they don't solve this problem, then what happens to that story? For example, if they're from an authoritarian regime, as I said, they could be, they could be attacked by the government or the people who are doing whatever they are doing. So how can we also rally our community when it comes to the issue of localization? Also, sometimes we localize tools and define people are not using those tools. So you will localize it in Swahili, it's not being used. So also how can you put the information out there and say, hey, this tool, Bayanat is in Swahili now. You can be able to use it. Bayanat does ABC, so community driven effort. So when I'm coming to my conclusion, so localization makes technology and digital resources available, not only in Swahili, but in different languages. It empowers individual to participate in the digital world that we are in today. If there is anything COVID-19 underscored, underscored is the importance of the digital world. And then also they're able to access digital content and we are able to bridge the digital divide. Thank you so much. I'll give this opportunity to anyone who wants to ask a question. Yes. Thanks for that. I'm new to the federal localization that I'm in use for Swahili and we just started saving dept. For example, I think the app is now at least in Swahili. But one of the things I've noticed, so I know from German where you have people speak very differently, but then there's a standard German, so we'll accept the standard German. But what I hear of Swahili is that there isn't, well I mean you just said it basically, but I've heard it also that there isn't just like, except then, like government approved standards. And from my experience the tools generally assume there's like a clear, like this is the language and you just follow that. I'm wondering if there's anything you think that can we represent something like Swahili better, like this one, more continuum than a clear standard? So, as I said, there's the online harassment manual that we translated in 2022. What we did, we had translators from Tanzania, we had translators from the Congo, we had translators from Kenya, we are looking at the major regions where Swahili is spoken. So what we did is making sure that the people, the localizers from Tanzania are able to localize, and I know it's a lot of work, but we also said if we don't do that, then we'll find also it's a preserve of the few, maybe the Kenyans and a bit of the Tanzania. So we got localizers from different language. I can tell you right now, if I go to the manual and I read the Congolese part, it's quite tricky because I'm from Kenya, this Swahili is quite a bit different, but we also try to make it standard whereby even if you don't understand it 100%, you can get the concept of what it's talking about. So that is one way we've tried to beat that. Yes, yes, Mark. So when you are working with different dialects like this, do you find it's more effective to have each group start from English or is it better to translate it, say you start with Congolese Swahili and then you have people from Kenya and Tanzania work off of that, or did each group start with the manual in English and independently translate it? Yes, that's what we did from English. And then now after that, you now sit down and look at it and give feedback on it. Because as I said, we didn't want it to come out like so different, so somebody will be able to be confused. Yes. Yes, Kyle. As a non-Swahili speaker and when working with these resources and being able to pinpoint what languages, different pieces of documentation need to be translated into, what is the best way to sort of support the community, whose language and context we're trying to support the translation of resources for, as someone who wouldn't be able to necessarily assist with the translation, what's the best way to support that? I think just the best, because you are an non-Swahili speaker, is getting reviews or feedback from the Swahili speakers, that is the end users, and now the information that you are able to collect depends on who you are, you are sharing the information to maybe it could be developers, maybe it could be finders. You frame it in a way that you can be able to relay the information as much as you can speak the language. Yes. So if there are no more questions, my time is up. Yes, yes. Well, I'm wondering if you consider doing separate Swahili for congruence, like Swahili, or somewhere else, or whether that, and if so, why you couldn't have chose matter? I repeat your question. Yes, a big thought where, like it's possible to say, okay, we're gonna make the translation that's, like technically possible to say, this is Swahili from Congo, this is Swahili from Kenya, this is one, and then do separate Swahili translations. So I'm wondering, is there something about the culture of Swahili that means you wouldn't want to do that? Like, if you'd rather have a kind of, it sounds like you chose to make this standard, is that the culture? Okay, one of the things is also the resources, because if we were to go like fully Congolese, it would mean like having a separate, so what we are trying to do is make it standard so that any Swahili speaker can be able to understand it, and also we don't want to be able to create too, because you see now the issue of dialects, especially in Africa, there's so many, you know, even in Congo, you find there's Swahili being spoken in Kinshasa, is different from another town. So we didn't want to bring all those things, because it would be like opening a Pandora's box. So we try to make it in, yes. So, yes. Thank you. I'm curious about how do you just collect feedback from the Swahili language users? Just what kind of mechanism or... Okay, so one of the mechanisms that we do that is also targeting people, like for example, as I said, my hometown, we speak a lot of Swahili. So targeting people directly, also as a journalist, you are able to put information out there, say I'm providing feedback for ABCD, or when you are doing a training, and you are doing a training in Swahili, after you do the training, you also get feedback from those people, yes. So being targeted, yes. So thank you so much. The floor is yours. The floor is yours. Thank you.
A universal data model for localizable messages
Hi, I'm Emily. I work for Mozilla. Yeah, so this is a talk I literally don't think I've got. I could give any wealth except an audience like the translations of ROOM at Vosdom. So I thought I did. I would. In my work at Mozilla on localization systems and tools and standards, recently I've ended up spending quite a bit of time participating in the Unicode Consortium's project to define message format 2, an evolution of the ICU message format standard and a bunch of other things. And I'm here not talking about that like specifically, but more like a side product of what we've ended up doing through that work, which is defining a data model for messages. In particular, messages that are not just a single segmented phrase that you've extracted and you might be able to send it to translation, but more dynamic content as well as everything else. And one of the interesting things about what we've ended up effectively not discovering, but kind of stating the obvious, is that there's an upper bound to this sort of what makes up a localizable segmented phrase or message really. That this is limited by the fact that the keyword localizable there because it's dealing with humans. Humans who need to understand it, but also translators who, well still, are mostly humans who need to be able to take in the the source message and be able to produce an output from that that is understandable in their locale. And this ends up depending on a limited number of different dimensions in which messages kind of vary. Variants have kind of hidden it there as the first one and there of course spoiled everything by saying so. It's the way that messages, message content can vary depending on inputs like numbers and their pluralization categories. You have no apples, you have one apple, you have 15 apples and gender-based determinants, grammatical gender, personal gender, all sorts of various things in different locales languages. But this is one dimension. If you can express that, hey, we've got this variance happening, this message depends on these input values. This is a dimension that we can express. Then of course, once we have a single pattern, a single sequence, it might include placeholders. It might include the number n for how many apples you have or it might be something entirely different. But then finally, we've ended up at least through the message format to work, determining that markup should be kind of kept as a separate thing from placeholders. So markup here means something like HTML. It doesn't need to be HTML. It can be any sort of syntax or any sort of indicator that is saying that the content in this part of the message has these attributes or these something about it. Then within a placeholder, we can have values like numbers that we need to deal with. They can be just strings that are coming from an external source. We can also have annotations on them. We can say that this number here, yeah, it's coming in as a number, but I want it to be formatted. So it has at least two fraction digits, for instance. This needs to be accounted for in the whole message, how it ends up being formatted. Then as I mentioned, we need to be able to deal with variables because we are ultimately here talking about the scope of dynamic messages. So we need to be able to say that explicitly that this message might be getting a number from the outside. It might be getting some string content. It might be getting anything as input, and it needs to deal with those. But sometimes we need to, within a message, want to also do a little bit further processing on a variable value. We may want to select a certain part of it, capitalize it if we're talking about a string, do other sorts of transformations, or express the same sort of value in multiple different ways within a message. So we need a little bit of a tooling to deal with variables. And that's it. That's like through the working message format two, for the past like four years, we've not come up with effectively anything else that really is core, driving the qualities of a formatable message. And that's ended up meaning that one of the things we've produced out of this whole project is this data model for what does a message look like. When you don't consider the syntax of it, when you consider it as a data structure, I'm not going to go through like all of this. But roughly speaking, we can say that a message has two different forms that it can take. It could be either just a single pattern, single sequence that we're dealing with, or it can have some variance. That's the select message over there, which then has some selectors from that when formatting guide us towards selecting one of the variants of the message. The declarations help us declare these are some of the input and local to this message sort of variables that exist. And then the variants of the catch-all key end up defining how exactly do the, when we have multiple message variants, how does that work really? And then when you get to within a single pattern, again, as I alluded to, it can really just, obviously, contain literal content, a string, or it can have expressions, placeholders that is, or it can have markup that can be starting or opening. We also included standalone there, so you can have an element, for example, an inline image be expressed within a message. Then we can have literals, variable references, and the annotations that I mentioned. That's it. That's like these two slides are defining the whole data model that we've ended up dealing with. Okay, I left some like tiny little details out, like for example, the annotations, sorry, the expression, it needs to have at least one of an argument or an annotation in order to be valid and stuff like this, and minor details. But that's it. This is, we think, a universal data model for representing messages. And I'm here basically saying, hey, I think this is kind of cool. And this is not necessarily relevant for just the work specifically to do with message format to the syntax. But more that this is effectively a data model that can allow us to separate the concerns around the syntax of whether your messages are stored in get text files, ICU message format, fluent, any, literally any format. You can take that syntax and you can parse it into this data model structure representing a message. And this is, I think, leading us to a world where we can consider more of a UNIX philosophy for, okay, what do we do now? And I've, sort of, separation of concerns here. And I have, yes, cherry picked explicitly the part of the UNIX philosophy where it says to do one thing and do it well. And not included, for instance, the part about, you know, make sure that you're just dealing, you're communicating as strings the values from one process to another because that's kind of not necessarily working so well. Because we need those parsers. And if we need to understand all of the structure in a message every time when we do it, we end up, for the most part, mixing up the syntax concerns with everything else we're doing with messages. So as some of the things you can do with this data model as ideas is that if you can read and write from a syntax to this data model, and you can do this with multiple syntaxes, this is effectively an interface from which you can take messages from one syntax, turn them into this data model representation, and from there to any other syntax with caveats, but roughly. Another thing is we can build tooling on top of this. So you can build a linter or a validator on top of the data model representation of messages, rather than any syntax representation. And this means that you can use the same validation for all messages independently of what syntax they might be coming from. And if you have these capabilities, it means that when you have an established many localization systems right now are very much monolithic. They have expectations about this is the exact syntax in these sorts of files that are used for messages or resources. This is exactly how you deal with them. And this is what you get included in your output or your program, and this is exactly how it works. But as we're defining here a data model that can read any of these syntaxes, it means that you can build a different sort of formatting or a runtime on the same syntax. So you can start from the way you are now and maybe consider if you want to change how you're doing localization. You don't need to change necessarily everything all at once, but you can take just the formatting runtime change that to work with the same messages you've got, and move on from there. Or vice versa, actually. You could change the message structure, how do you store your messages and still use the same runtime because this is bringing in an ability to very effectively transform your messages from one syntax to another. And you can, when you're dealing with localization, you of course need to deal with translation. And this means that you need to somehow present to your translators the messages that they're working with. And if a translation tool or a framework is going through the message format to data model, it means that you can build an interface for localizers. With the localizers, don't need to know what is the syntax underneath for the placeholders, the variables, the markup, anything else, but they can be presented the same thing for all syntaxes, which might make things a little bit easier for everyone. So those are the ideas I came up with here for what could be the next steps from here, but I'm here saying, hey, this is a cool thing. You guys should play around with it. For us, the current and ongoing work is to extend this sort of a definition to also include method resources and also include the sort of comments and metadata that is quite essential for communicating the context of a message to translation, which as I'm kind of hoping some of you noticed was completely left out of the earlier. But that's intentionally so that we can separate these considerations from each other. But that's it for me. Thank you very much for listening. I'd be very happy to have any questions, comments. In another talk, I heard about message format 2 and function invocations. How do function invocations, how does the data model work, or how do they relate? The question is for how do function invocations relate to all of this? And this, yes, they are represented here in the function annotations here. So something like, for example, plural selection could use a function with a name of plural, for instance, for being this element existing in a select message, selector expression, which is there. Question was whether there are a set of built-in functions that are supported. And message format 2 does come with a relatively small set of built-in functions. The data model itself does not presume this set absolutely. But the set of functions can be extended. For message format 2 in particular, we are looking at a very minimum of effectively number, which also does plural and ordinal selection, but also acts as a formatter. And then string, which is a sort of default for string formatting, but also does the same sort of selection as ICU message format select does. And we are still discussing for message format 2 what other things to include here. Now, of course, when representing messages coming from some completely other syntax, it is entirely conceivable that it is not directly possible to express these messages using the functions that message format 2 defines out of the box. But the data model does allow for you to define that a function like this must be used here, and you can otherwise define how that function works, if that makes sense. And it's possible to make translations between these function meanings. Anything more? The reason to separate context from the minimum required effectively, and here I'm jumping into the answer here, the minimum required for formatting a message is that the context is absolutely required for the translation work. But the context is not absolutely required for the message formatting. So we need to be able to represent it, but we do not absolutely need to have it be a part of the message itself when it is being formatted. And this is why we are dealing with it slightly separately. They are very much related concerns, but we've tried to find with the data model the minimum required for representing a message. And when you trim down the minimum, the context kind of ends up as a thing we can define externally, so we've chosen to be doing that. And I mean if you're interested in that, in particular the specifics of what should we include in the sort of base set of metadata and context fields, here's an issue link where we're discussing this right now that I would be very happy to have your input on. Anything more? Regarding the edit the translator tools, so now most translator tools present a string and expect that the translator will write in a string. Do you imagine that this will change and that the translator will see the elements of the data model in a more graphical way and choose translations through Google boxes, or something like that? Or do you think it will stay as a string representation for the translators in the future? I have no idea and anything is possible and that's kind of cool. So predicting the future of what the translator experience is going to be here is shall we say a hard question. One thing I do think is that this sort of a data model makes it easier to build tools that can present to a translator more the options and opportunities that they might have in modifying a message and content like placeholders and markup which might just show up a syntax when presented a string and be a challenge to even realize that I could change how this bit of it is styled. But if we can present interfaces that can read the data model and understand from this that hey hang on this could be tweaked this way, interfaces that are richer could be built. However of course we do need to keep in mind that such a vast majority of cases are just it's best represented as a pure string. So a majority of work is not going to change but the corner case is where it gets interesting and challenging for those there might be opportunities to present messages in a more translator friendly way. And one part of this I kind of skimmed over it was mentioned in the Ujjwelts presentation yesterday on message format too is that here the selection for variants is not an inline component as it is for example in ICU message format or fluent but the variants all of the variants need to be complete messages presented at the sort of top level of the whole message which is entirely intended to guide towards structures that are easier for translators to deal with rather than needing to to figure out you have and then a selector of apples. Instead of that you have a selector which has you have one apple you have three apples and this sort of an interface. But yeah anymore? If not I would like to thank you very much for your time and yeah that's it for me.
Happy translating! It is possible to overcome the language barrier in Open Source!
Language barrier and open source. Here's my biology. I'm an open source advocate, educated, in-through asset, e-commerce and e-commerce practitioner. In recent years, last year, I switched my direction to the e-commerce. I'm also a patch open meetings contributor. I joined the project in 2012, so I got about many years experience. I'm also an interpreter and translator for some largest China's open source communities. I'm a member of Microsoft for startup founders in China. Also, I'm a founder of omoforce.com and www.omoforce.com. Last year, I was very honored to be a patchy community coach and a Force Asia conference speaker. So today, first, I'm going to introduce a little bit about my translation experience. First thing, I'll share you some experience and lessons we learned from the New York's open source Congress report, 2023, and open source initiatives, deep dive AS series, which those two projects we finished last year. Then I'll talk a little bit about the team build strategy. Finally, I want to show you a patch open meetings website translation. So here comes to my translation experience. Back two months ago, 2002, when I worked in college, I taught course like in ABUK courses. For example, information system analysis and design and also data structure. We used English textbooks for students, but to help students to better understand the concept, we use Chinese translation course materials. And then late in year 2011, I was Oracle offshore outsourcing project manager in China for an Alcon logic development, Alcon logic clinical trial study management platform. I was a project manager. We translate like a source, emails, testing, package, delivery, everything. So for one year, then I joined a patch open meetings team, I translated a website into Chinese and promote the website, promote the project in China. In recent years, most of my translation experience related to Kai Yuan Shou. That's probably somebody you never heard of, but in China is a famous open source community. We have each year, we have open source conference in China called Coastcom. And we got a lot of communication and cooperation with the Patch and the Neelix Foundation. So in year 2018, I was an interpreter for the Patch and the Sofa Foundation speakers for Coastcom, China. And in 2021, I gave my first talk in how to integrate AFDAP and AD with the Patch Open Meetings. In 2022, I was a translator for China Open Source Report. And last year, I mentioned before, right, in FOSA Asia and the Community Code over China for the role-based access control mechanism by unifying AD, Neelix and Patch Open Meetings. So in English session and the Chinese session. For the 2023 China Open Source Report, I was a translator. And last year, around November and December, I was involved in Open Source Congress Report reviewer and also a translator and a reviewer for OSI Deep Dive AS sessions, which is a video captioning project. So here I just got some images, photos. This is my first visit to Europe. This is my second time. And this is Coastcom 2018 in China. Those speakers from Patch. The interesting thing is, after five years, last year in Beijing, I fortunately met these two gentlemen again five years later. We made friends, right? But you can see the change after five years, all hair turned gray. Okay, this is the talk in the virtual, my first talk in the Open Source community, how to integrate FDAP and Active Directory with Patch Open Meetings. At that time, I was the personal former member of Kaiyuan City. This is last year, the FOSA Asia. We got both here right at the second floor and that's in Singapore. We have a group of Chinese developers from different cities in China like Shanghai, Beijing, Chengdu, Changsha. And also is a Japanese guy from Japan. We have a second group to the meeting. This is last year, a committee of code in China, in Beijing. See many Chinese developers, commuters. And I also gave a talk in the integration check. For Open Source Congress report, 2023, the title of the report is Standing Together on Shared Challenges. Report on the Open Source Congress. Probably somebody here, maybe you read the English or other versions reported, right? But in China, we are absolutely very honored to be part of the group. We do the translation work. The project created on November 17th, project complete on January 12th, 2024. The whole report length is 34 pages. We got six translators, two reviewers. We divide the task into six subtexts, including infographics. We use AI assistant such as DPL, CloudIn, ChagBt, Co-Panel, Google Documents to help us to facilitate translation. Usually, we use like a Google, Google translate to do translation. But when we compare the result, we find that Google translate is not accurate as DPL. So we just use the AI tool to help us. We labeled the whole translation process in different stages, like in-process, like translated, like reviewed. So each team member, okay, each team member, you can look at the labels, okay, got it. If the former process finishes, then I can start my process. The initial translation finished on December 2nd. And then we start our first round of QA and the second round of QA and final review. So left side, the English version front page, and the right side is Chinese version front page. Here is six subtexts, and here is my nickname. So I reviewed the two parts. So for the infographics, we know in the translation, for text translation, it's comparably easy, right? But for the image translation, we need to use more time. For example, we speed the infographics, there's 12 images, we speed it and get one of them, and we use the Google translator to get the Chinese version. But you can see the title and the content is unreadable, right? It's too small. So we just use, clean everything out and use Photoshop to add the content. So finally, we combined the 12 images together to get the infographics. Can you see? Sorry, it's not very clearly. Left side is the content of English version, and the right side is the content table in Chinese version. You can see the page number here. The English version is 28 pages. In Chinese version, it's 20 pages, which means after the translation, the five sides shrink. Just I only got two thirds of the source file. So here comes the lessons I learned. Original paper has some quotes, which some experts, some leaders, they say something, right? But the translation has some discrepancies, so we need a final review. After we send the paper to the latest foundation, they say, okay, we got some discrepancies. You guys need to pay attention. You need to have a final review. So when we check these discrepancies, we need to keep constantly looking at the content discrepancies. We need to keep consistency and one format, otherwise it would cause confusion. Because we are sending tasks based on the source file in Google documents, but the discrepancies exist in the target PDF files we send to the latest foundation, right? Okay, then we identify how many book codes and in which part of the paper so the corresponding translators will be able to check the codes. Because of the five sides shrink adjustment, right? And in PDF format, we got two column format. So talk to the file page number, it's different from source file page number. So if we didn't keep consistency on source file, then we got confused. This happened in the communication process. Okay, we found this postcode. Who did this? Nobody knows because when we check the PDF format, I didn't do this, I didn't do this, right? So the final solution is the leader, the leader of the project, he took all the responsibility. He checked all the protocols and the translated by himself. After we send the paper to the latest foundation, they used us some badges for just four. I'm very happy to get this for the recognition. For Year 2023 OSI DeepDive AI sessions, this is a video caption project. Project created on November 7th and the project will be completed on February 14th, which means we're still in the process. Total, there are 17 sessions. We had 10 translators in two groups. Only have two reviews. So we follow the steps. First, we use the raw video material to get the scripts. And then we check each word and the sentences by translating ears. We also use like TPL, charge PTE, quite easy, GNY. This is a Chinese automatic translation platform and YouTube to help us. We also put the status in translation translated in review under reviewed to identify the progress of each process. The initial translation finished on December 2nd. Then we immediately started the first round QA, round QA and final review. Then we have the second group to help us to publish in different social media like in WeChat as a web pop in China and also in Facebook and Twitter and YouTube. Sorry, it's too small. See the 17 sessions and we are forcited the publishing process. We subgrouped the 17 sessions into different groups. We got three groups. First one is open risk and challenge. And the second one is governance. And the last one is the fireside track. So for each group, we also got some subgroups. For the year about which are the sessions, I either reviewed or be a translator. I am the reviewer. So you can see it's almost half of the sessions. So this is project management system in Chinese. That's my role. So you can see the yellow bar is my role. And here the arrows pointed to the different process. First one that means reviewed. And this one means in translation. So we can identify the process because the team members, we will work together and work on different cities. Okay, here comes the lessons I learned. We need to check subtitles versus warriors to make sure every word matched. It is very time consuming. For the video session, usually like if it is 30 minute session, we usually take about 10 or more than 10 times time for a 30 minute session. We need like 6 or 7 hours for the check. Because subtitle has 120 character length restriction for bad audience sense of reading. So we need to split every sentence, longer than 120 characters into small parts and also split the subtitle time frame. The split caused some subtitles didn't show up because the time frame didn't match. So we need to listen every word very carefully and adjust the corresponding time frame into secondary seconds, which is 100 seconds. So it's a time consuming. Especially when some adjectives and longer sentences come up, we need to adjust the front and back word time frames next to get perfect effects. Here are some links of some published sessions. See Facebook, Twitter and let's we check. Team build the strategy. Right now we have more than 50 translators. Most of them come from universities, some students in Europe. So team leaders recruit some new translators from various sources, companies, universities. Then new translators will introduce to team members. Every member assigns the workload by identifying tasks by one theory. So you can just okay when the project comes up, you just pick whatever task you want to get involved with. Then we have basic benchmark scoring system, student trial period to record every member's performance. For the benefits, we each code tokens for technique on the long technical contributors. For those team members who have good performance, we grant them the privilege to use community resources such as DPL and the crowdfunding platform. Community also provided the chances to team members as a voluntary for each year's conference. And the students can also add the voluntary experience to resume. So here is the code K token system. But that's in Chinese. Upside is for technical translators. Low side is for non-technical translators. We got like ABC three different kinds of tokens. You use the tokens, you can to cooperate with open source community for different kinds of like Hexen, like different kinds of events. You can join the events. Here is the basic benchmark system. We have a different contribution type like principal, helper, consultant, informer. For each different role, we have basic score weight. So the benchmark score value is 10. So after you just use the score value 10 times by score weight, you get the basic score for different roles. Then here is the task management system for the team members you join the project. We use the system to calculate the contribution score. Finally, we sum up all the scores of your contributions. You can see on the top one is my name. I got 51. So they grant me some privilege. I can use the crowdfunding, I can use DPL. Some AI translation tools help me. Okay, here's my Apache Open Meetings Translation. Apache Open Meetings, may I ask, have anybody heard about the project or used the project before? Never have I heard? Okay, this online video conference system. Originally, it is from European. It is Apache project. It is only fully web browser based open source video conference system. No need to download apps, no need client-side installation. You can create a middle server for remote collaboration. The server can be installed either locally or via container. But I recommend use locally because if you're not very familiar with container, in the installation or in the configuration process, probably you've got to trouble, right? Sometimes, you know, if you do not do the commit, you probably will lose the original configuration data. Usually, we need to install the server behind the term server for NAT configuration for the whole function. Otherwise, you cannot use system. The system supports multi-language. The latest version now is 7.2.0. It supports 39 different languages. So, for the translation website, we cannot use just like Google Translators. You can use Google Translators for static website. But that's interactive, right? It's an interactive website. We've got actions. We've got scripts. So, you can only use the original, use the building framework to change the language. So, you need to change the labels and text strings. All language strings should be localized and stored in the language section. You have full future language editor with every installation of open meetings. So, you need to check out the language editor to look up the label IDs in the GUI. Just you need to run the open meetings client with the debug model. You cannot use the deploying model. The way every text string has label ID in places additionally in the text field. Later on, I'm going to show you the difference between the deploying model and the debug model. Sorry, this one. So, the upside image is the original language, this configuration file in Chinese. So, okay, when I first started, this is the one issue I found. I sent the issue to the JIRA for Apache. So, I said, okay, this configuration file didn't work. Some labels didn't get translated. They said, okay, you can do translation and you can send the file to the JIRA and we're going to update the source. See, can you see the difference between the first image and the second image? There is one actual space, right? You can see here, braces, the number braces, but upside means braces, space, braces. So, they used the file compare, found this problem and send emails to the group, to the email list. There is an issue in Chinese translation, actual space characters. So, that's not the lesson I learned, which means for the translation file, you cannot change anything except the translated content. You cannot even add one more space. Okay, I almost finished my presentation and there is some time left and I'm going to show very quickly about the debug model and the debug model. So, here is my contact information and my emails. This is my website. Okay, excuse me. I used that mean log in system. This is already Chinese version because we set the added profile here. I already set the language in Chinese, set time zone in Asian Singapore, but here you cannot see every label. There is no label ID, right? No label ID. So, how can I bring out the label ID? Because if you want to change, if you have no label ID, okay, let me bring back to English so you can read it. So, current password request. Okay, we'll change. Every time you change the language configuration, it will take effect immediately after you log in and log out again. So now you can see the interface already changed to English, right? But for each label, like if I wanted to do translation, so if you don't know the label ID, you need the Geltron Language Editor, you need find the corresponding language file, which is... We got a language configuration file here. So you can see total is 39 different language files, right? For example, for Chinese, that's 11. And also you can change it to French or to Dutch. So for example, if I want to change like a project website, right? If you want to rebranding or to do something, if you want to change this one, if you have no label ID, you got to remember the text ID. So go to the language editor, go to the... Then you need the remember. I remember because I test many times, I remember the label ID. It's 282, so you can change it here, right? Okay, then let's see the deploying model. Oh, so... There is a configuration file, WIPER.xml here, so you need to go to your WIPER.xml, then you search the deploying model. You can see that's a deploying model. You need to change the deploying to the development model. To deploy model, now the added command... Oh, sorry, small case. Okay. Now you can see we change it, right? Change and then... Come out, then restart the server. Okay. We restart the server, then... I need to... I need to make sure the server is start up. Okay, so we start. Then I used admin to login again. You only need to be slow. Let's sort install the virtual machine, so all the memory is already exhausted, almost exhausted, so you only need to be slow, sorry. Do the thing. Sorry. Some bad things happened. Okay, now you can see this is in the DIPWIPER model. See, from here you can see each text label, you got a label ID, right? So if you want to change the label, you just go to your language editor, find the label ID number, change it, then you can get a different version. Okay, thanks. I finished my presentation. If you got any questions, I would be more than happy to answer. Thank you. So translating the open source report, what was the format of the files that you used as a translator? Was it Google Docs, was it PO, was it actually for what you used in which format? For the source file, we upload the file to the Google Documents, and each team members, if you want to identify which part you want to translate, you just label, select text and say, okay, this part I'm going to do it. So the translator works directly in the Google Docs? Yeah, yeah, yeah. Translate it, we got no problem to access the Google Documents because almost everybody we use the VPN, you know. Good question. Thanks. You have some way to detect space issues or using the maybe double width character instead of the format character, the spaces? Yeah, for the spaces, I'm not 100% sure how the project leader he found the spaces, but I know some guys, they use the file compression. Like in VSS Studio, there is a file compression function. You can use the file compression to compare the source file and the translation file to find the extra space. Yeah, I don't appreciate it. Do you use translation memory to speed up your translation process? Excuse me? Can you repeat the question again? Do you use translation memory to speed up your translation process? When you have a good question, I guess the translation memory is the crowding function. Crowding, you use such kind of translation memory, right? Yeah, we use translation memory. That's the building function. And we also use machine translation, which is AI tools such as child GPT or Microsoft translator, which do the integration with the platform. Thank you. Thank you very much. Thanks. Thank you. Thank you.
Lessons learnt as a translation contributor the past 4 years
Hi, I'm Tom DeMour. Come in. Great to be at Fasnem Sunday morning, 11 o'clock and we are talking about translations. I'm a long time open source contributor and in the past four years I've been contributing to Mattermost. I started contributing with translations and then I did write some blog posts, did some quality assurance as well, and did some programming in the meantime. In that time, I became the unofficial self-declared translation community manager. I want to talk about the lessons I have learned. For the people who are not familiar with the Mattermost project, Mattermost is a chat-centered platform with a lot of possibilities for integrations. It's very versatile. It's not only for DevOps. From the beginning from the project, the translations were a community effort. We are now shipping Mattermost in 21 languages. Each language contains about 9,500 strings that are 6,500 words. That's about 150 pages of text. In total, we have more than 1.6 million words translated. In this talk, I want to focus on the following topics. I want to give you an overview of our translation community and how we try to nurture them. Further, I want to tell how developers have an impact on translations. I want to share some good practices on how we try to improve the quality of translations during time. We have 680 active translators in the past three years. I must say there is a very high turnaround of translators. When we take a deeper look at the numbers, we see that only 15% of the translators is making 75% of the translations. From that small group, most of them are long-term contributors. But we do have also translators that come in, do an amazing amount of work in a very short time for a specific language, and then they disappear. These numbers are not based on survey. This is what my experience is telling. When we look at a large group of the 85% translators that doesn't do that much of translation work, when we are looking at what is motivating them to do that, we often see it's fixing an annoying error. They see an error in the translations, it bothers them, they come in, they fix it and they leave. That's perfectly fine, that's great because it was an annoying bug for someone and it got fixed. Others just want to give it a try. They say, okay, I've heard about it, I want to give it a try. But translation work is work that's not very visible. You're doing it alone, you're not getting real feedback and people give up on that. Others start translating because they love the product and they want to contribute to it. When we look at the 15% translators that do a lot of work, of translation work, why are they doing that? We see that very often their peers are using Mattermost, their company, their friends, their community, and they want to give them the best user experience. Also, we see that they like to keep the translations up to date. They love it when they see 100% translated. They would actually go in, log in at night to get that last string finished so that before we ship the next release, they still get a 100% feeling. Another nice thing is they become experts in the product because by translating, you get to know the product very well. You see new features that are not yet in the product, but you are already translating the strings, or you see features that you are not using, but that you are translating. Oh, I wasn't aware that we can do that also with the product. And they often become ambassadors of the product and like to contribute to the product. Look at me, I'm standing here. Now that we know what is motivating translators, we also know what scares translators away. One of the things is there is some frustration when they don't feel hurt or when they give feedback to developers. It's like mostly it's about style or sometimes the direction the program of the project is following. And yeah, I've said it a few times and it still hasn't changed and people sometimes leave for that. Things that are broken. We had some issues in the past when where translations didn't get into the product. And people, yeah, we are putting time in this and we don't see it in the product. That's also something that people don't like. They want to see the result of their work. We do reach out to new translators or to translators who are really on fire and are doing a great job. And we have noticed that sometimes reaching out to them is also giving them pressure to keep contributing. And they are saying, yeah, but I don't want that. I'm just doing this in my spare time. I love doing it, but I don't want that you have got expectations that I will not be able to fulfill. So we have to be very careful with that. We do our very best to be a very warm and open, welcoming community. This means that we had to remove a long-time translator for a shit language after we made some posts in a channel that violated our code of conduct. That action was took place immediately. It was no big deal. It's hard losing a long-time contributor, but community comes first. And I'm still grateful that how they handled that situation. And that's something I... Yeah, when you are doing translations, it's working with a lot of different nationalities, with a lot of different cultural backgrounds, and that's great. I still learn every day from it, and it opens your view on the world and stuff like that. And we have to make it happen. Like, we don't talk about politics. There are a lot of ongoing issues in the world. We don't talk about politics. I don't take a stand in them. I look at the people. And yeah, I want every conflict to be ended as soon as possible, of course, but I will not take a position in it, because I don't... Yeah, I can't. It's too complicated. OK, communication is key for us to keep our translators together and to move forward with them. We have a dedicated channel on our community server, and I will tell you about what's happening over there in the channel. Monthly, we welcome our new translators. We name them and say, hi, welcome. It's reaching out. It's saying, OK, we see that you have started translating rather most, and you don't see your translations yet in the product, but we have seen your effort and we are saying hi to you. What else do we do? The community channel is also a place where translators ask questions to the developers, but also developers ask questions to the translators from, hey, we are having an issue with this. Is it right that in your language this will do that, or... and stuff like that? Or we can simply ask, like, I don't know, oh yeah, it's readable, like, oh yeah, there is a new term in the product. Does it mean what I think it is, or is it something different? We are very open about ongoing issues as well. We had some much conflicts and we are open about it. People like it because they know that we care about these issues and that we want to solve them and that we are working on it, and that it makes them feeling at ease, like, OK, we are important, we do matter and they are working on it, and when they see your effort of getting things fixed, they have a little bit more patience and a little bit more love. Yeah, we also set up some webhooks, like we get notified when the translations are locked or being unlocked, because people are doing the translations very often in their spare time, and nothing is as frustrating as planning, like, OK, it's Friday evening, I have some time to work on the translations, and then, oh yeah, it's still locked. But if people know, OK, it's still locked, I don't have to check in to see if I can do my translations, it's just, yeah, being user-friendly. We have a weekly update as well, it's running for about three years now. It contains roughly the following information, it depends from week to week, but we have some topics that always come back, the current workload. It's very nice to know that, OK, the current workload for the next release is low, it only are about 40 strings or so, then you know, OK, I'm still at ease, but there are times that we have a very high workload for the next release, and that you have about 200, 250 strings, and if you are one of those guys that says, I want a 100% fully translated product for each release, then it's important to know, OK, I have to make some more time for this week, of this month. We also put in the deadline for the next release. I know myself, if I don't have a deadline, I will keep, I will do it, I will do it, but when I see the deadline facing, oh yes, here I am. We also ask feedback to the community, when we are changing things, stuff like that. Yeah, it's working well. Yeah, what else do we do? Swag, we often make a joke that, matter most, is a swag driven company. That's not always true, but as you can see, to prove my point, I've got socks with me, people who want socks or stickers after the talk, please, come over, we do ship some swag to our translators as well. What else do we do? In each release, with the release notes of each release, we add the names of our translators. They are contributors like everyone else. Each release has most valued professional elected translators, do get elected as well and get that reward as well. We reach out to new translators, I already said that, but we actively are reaching out by mail as well, sending them an email, but we only get a 2% response rate, so it's not really high, and we are not really sure if it's working, but it's still saying, we are seeing that you are doing something that we really like, and we want to say thank you. And we had our first virtual meetup. It seemed fun to me to bring all the translators together and to have a nice chat. We were actually with three people the first time. Okay, translating is something that you do alone, but we want to give it a try and we will make it grow. It's also hard to find the good moment, because we are spreading it all over the world and the time zones are not really working with us. But we're going to keep it, we're going to see how it continues. Maybe I would be happy if there was about 7 to 10 people, right Luri, on the Rhyler base, but we'll see how it goes. Okay, time for the next part. Our developers can help translators, because developers have a positive or a negative impact. Yep, use variables for everything that doesn't need to be translated. That's a lesson we learned a little bit of the hard way. We have about 200 references in the product to our documentation site. And then we had a great idea, we're going to reorganize the documentation site. What meant that we had to change 200 strings in the product and we had to ask each translator, can you copy paste that URL to that one? And that's of course quite some work. And then we decided we're not going to do that way. We must find another way. And then we replaced all our links to the documentation site by a variable. I scripted that. Yeah, that was something. I can script, but I couldn't do React and stuff like that, so it was flying blind. But I got good support from the developers from Mattermost and we made it. Another example is the minimum requirement for the browser you're supporting. That changes also very often. If you put that in a variable, Firefox is Firefox and the number will change, then you have the benefit that when a translation is lagging for whatever reason, your product information is still up to date. Okay, a classical one. Don't split sentences across multiple strings. Mostly done for formatting. But grammar is different in every language and you will end up with a Yoda-style translation. So don't. A question we are asking ourselves as well. Everything needs to be translated. Mattermost is a quite large project. If you have to start from scratch for adding a new language to the product, then you can wonder, should I prioritize translations? Like feedback from an API. Does it need to be translated? Who sees the feedback from an API? A developer. We sometimes think that developers know a little more English than most of our end users. Strings that are not visible for end users like the system console. We think system administrators, most of them will speak a little bit of English. If you have to make priorities, maybe that is not a priority. We have been struggling with plugins as well. What do we do with plugins? It is very nice if they are translated. Because within the product everything is in the same language. That is great. On the other hand, it is also quite some work for the translators to do all the plugins as well. You have to set up a translation server for that as well. So that is not as easy as well. Some plugins come with a lot of enthusiasm and they start translating. These translations were in a language that we are not shipping. What do you do with these translations? These people did a lot of work for the translation of that plugin, but we are not shipping it. Think about it. It is a great policy. Another thing, provide context. If you want to make your translators really hard, remove all context. This is an example where we have to translate the word add. Add can be used in very different meanings and translated in very different words. It turned out that it was add to a clock. It was a reference to time. In an ideal scenario, we would like to add screenshots to the translations, but we have not found a sustainable way of doing this. We do have a dedicated playground cloud server for translators. If they are an admin on their local instance, they can be an admin in our cloud server and they can go to the system console and they can play with all the settings and see, oh, that's what that string is really doing or is meant for. That is a way we are trying to help them as well. Prepare for the unexpected. One bloke came in and said, okay, I'm going to translate to Australian English. Okay, cool. Do. Then we found out when we shipped it that it wasn't showing up in our mobile app, because our mobile app was programmed to detect the language, but not the regional variant, so we had to do some code changes to change that. Another great one was we have support in beta at this moment for Persian. That's right to left language. That's something you have to check with your layout as well, because that's something, whoa, also not really what we were expecting, but it works. A classical one, some languages are more verbose or less verbose, like the previous speaker you mentioned, his translation narrowed down with one third of the document. Other languages like French, Portuguese are more verbose, and you can get in trouble with your placeholders or your buttons when we are putting quite some text over there. Now I'm going to be a little bit unsure. I think that some languages don't use spaces, and I'm going to see what people like Chinese, Japanese don't use spaces, right? And if you want to split sentences that are very long or translations, and we're going to split it on the second of the fourth space, and you're not using spaces, yep, you run into trouble. Yep, all those nice things you learn from working with people. Last thing, some people are evil. If you haven't noticed yet, they are. Our translations have been an attack factor of, they tried to, and you have to check your translations. They tried to put some nasty stuff in there. Also check, very often you can already see the username tells a lot. If you are evil-devil 666, then you are probably not up for something good, and you want to really contribute to the product. I'm not giving a hint to the evil hackers, change their username. Nope. But of course we want to be as open as possible to our contributors. We like contributors, the more the better. But we want to keep our product safe as well. So we have set up a permission scheme. For our work in progress languages that are not shipped and not visible, everybody can contribute. You comment, create an account on the web plate, and you can contribute. For the shipped languages, everyone can make suggestions. I don't like, this translation is faulty, I would change it like that. It's only a suggestion. Our shipped languages, for translating these, you have to get an extra permission. And we are just doing a small check, who are you, welcome to the community. Why are you wanting to translate, no matter most, and here you are. But it keeps the bad guys away. Okay, we are almost there. How do we try to increase the quality of our translations? I mentioned already that we have a very high turnaround of translators. And translating is an ongoing effort that never ends. And we recommend translators to create a stylite for their language, that they use the same phrase every time for the same English phrase. Because, yeah, you can translate it in many ways, but try to be consistent through time and through the product. This is something that we are recommending. Some languages do have it, some others don't have it. Translators are language experts. At matter most, we use an informal, friendly tone in our products. And we ask our translators to follow that tone. But it's very language specific. And I learned that in some languages, people say, yeah, but if we are going to use that tone, it will sound childish or even offending. So how can we find a solution to be friendly and open and a little bit more informal, but still correct in our language? Another issue we ran into was that people were translating product names. We have channels, calls, playbooks as product names within the product. And of course, people say, oh, I'm going to translate it. But we don't want to have our product names translated. And so we made an agreement if they are written with a capital, they start with a capital, they don't have to be translated. So that, of course, if they are in the beginning of a sentence, then you still have to check. But it's something that we had to train some translators like, OK, it's not what we expect that you translate product names. OK. That was it. Do you have some great questions? How do you decide which languages are shipped? Oh, very good question. The question is just for the people on the live stream, how do you decide when a language is shipped or not? When 85% of the languages, of the strings is translated, then it becomes a beta language. And it has to be 85% for three months. Then it goes into the product as beta language. And if we see an increasing, then it goes to really in production after three months when it's about 90%, 95% translated. That means also that when a translation is lagging and is not longer being maintained, that it goes to beta again. But we keep them in the product, even if they are about 60%, because a faulty translation is sometimes a way to get new translators. How fine-grained is the permission schema, especially for languages you are already shipping, like there is a very active translator, how and when do you decide if the casual contributor becomes someone who can write translation? By reaching out to the translator. We reach out to the translator and first we watch. If you create an account, Lozen makes three suggestions for a new, for a shipped language, and then we don't see him anymore. Okay, thank you for the suggestions. We will take a look at it. But if you see that he's making suggestions on a regular base, and yeah, then we contact the person and we say, okay, we see that you are doing, you're making suggestions. We like them. Don't you want to be translating the product? And if the person says yes, we elevate them in the permission. Do you don't interact with the UTS team? Yes, yeah, we will, yeah. We're looking for a style, especially. Yeah, but then we say we have some localization channels for each language as well. We have a channel for German and we join the channel and have a conversation. And we do have some, yeah. I implement encodings a lot. There's no white space and no points and whatever. But I have a big problem with translations always with German, English, French and Dutch, for example. So in English you just have you. And in German you have this formal and informal, do and see or... And it's very funny, it doesn't matter which software I use. It's always, however, even when I translate from German to French, I say the German informal do and in French it's translated Brazil. And do you have anything to present because you said just take places always and take always the same places. But I think... The style guide. Like an open discussions. We have open discussions about it, like one of our French contributors is in the back of the room. And he decided, we decided, OK, we are not going to try a... And in German we decided to have do. And we let the... They are created by the community. I got a very good French contributor who made a very nice format. And I'm happy to share it with you. It's open source guidelines and after the talk I will share it with you. And people who want the style guide ping me and I will share it. No problem. Last year, distribution had an evil contributor, an evil translator who translated two-thirds of the job perfectly. And in the last third of the file they used swear words and bad words. The distribution had to pull down their ISO files and to re-release everything. And all this in the timeframe of two weeks or so. Do you have an idea how this can generally be prevented? It could be an issue at Mattermost as well. We try to have two translators, a maintainer and a reviewer. But actually that is quite informal and we are actually not really checking it. So we... I don't want to bring people on bad ideas but it's a risk. Yeah, that's a good idea. You can easily script it if you have a list with all the swear words. The good part is that we do a check-up on our translators for the shipped languages. It's not that everybody just comes in and we do... And especially the new translators, we do some Google Translate checks on their strings to be sure before we merge them. Yeah, the long-time contributors, I don't check them because I'm quite sure about it. It's an answer. LNMs are very good at that. It's not as comfortable as free, it's slow, but it's a long way. The other is using the distance from machine translation to try to understand how far, like the distribution case would have been called by distance between the machine translation and the existing swear words. But yeah, nothing comes from this. Yep. All your translations in the language, are they all community-provided? Is all community-provided. Sometimes for a specific language, a company pays someone or says to someone of the staff, please translate it, but we are not actually paying our translators. What's the answer using AI for translation? AI for translation. We tried it and we found the results not good enough. You notice that it's AI. I'm going to take him first. Can you welcome translators who don't have technical knowledge in the community? Because it's sometimes difficult for them to understand how to do translation. Yeah, we have a very user-friendly, web-late environment and we made a video recording where we go through the product and show how do you translate and web-late. That's something we use indeed because you have to be as user-friendly as possible. Some people come in and try to translate and make a pull request and then we have to say, no, you have to go through web-late. How do you deal with partial localizations? You mentioned that you have limits like 60% to 85% above which you are shipping a product with this localization, but that clearly means that there are some messages that are on-day in that locale. How do you present these to the user? In English. English is the default. If we don't find the translation, it shows up in English. You mentioned that you don't actually know any card using AI, but I do think you mentioned that when you can just use machine translation and you can learn the translation language, and then perhaps you can also re-edit language and use the cultural language. And therefore, you can use the power of that and with this long-term, would this long-term, would actually reach the up-stay level? In web-late, we do use D-Pill to make suggestions, but I noticed myself that for Dutch it's quite good, for German it's quite good as well, but sometimes you have to change it. Often the do-and-see problem and the two... On the left, and not in left, instead of why did you left it because of loss or so? So, it increases the performance of your translations a lot, but you have to manually go over them because you can't press the AI enough yet. How many translations per language do you have? Mostly one, sometimes two. In German, it's only allowed to add documents in passive form, and in English, you usually have reactive form with you. Do you also have an automatic form? I should ask the German translator what he does for me. Okay, thank you. Don't forget the socks.
Long Term Effort to Keep Translations Up-To-Date
Okay, can I start now? Okay, sorry for the delay. So, I will try to present about how Indonesian teams doing translation for several transition projects I have involved with. Okay, this start from one project. I mean, I have the calculate, I have create statistic from one project. And then I try to add other project that I have greatly miscalculate the effort need to do the statistic. But you will see how difficult, maybe because I'm not skilled enough to create the statistic long term one. Maybe some of you can help me. So, this is the start. Why we Indonesian team doing translation? Because in Indonesia we have around 276 million people there. And from this many people we have several major languages. Actually we have more than 1000 languages, especially here. Every, every, what do you call, every small density, we call in Indonesia is Desa, Kampong. Village, yes, village. Every village has their own languages. But do we really need to support or translate to everything? No. So, at least we need to consider that to translate in this big six, one, two, three, four, five, six languages. But I, myself, I only fluent in Indonesian and Japanese, so I cannot help the other languages. So, my, my, my, I try to, okay, let's start with one language that you say, used by the most of people here. So, so we can, we can start doing something then can be quickly used by many people. Okay. So, then in my talk I will compare about the Bravis, GNOME and Ubuntu. Why three of them? Because I thought they have sufficient data to create a long time historical data. I thought, but let me see how difficult it is to, to extract this. The other thing that, that three of them has periodic release schedule, good release schedule. For instance, LibreOffice is in February right now. I think last year they, they released in March and September. But I think starting this year, February and September. And then GNOME, GNOME usually a little later than, than LibreOffice. They also release two times each year. Also Ubuntu, Ubuntu you can see from the version numbering. Something 04, something 10. So every, every April and October Ubuntu release the, the, the, the, the, the, the, the platform to, to the translation for current one. They have very detailed, very good statistic. But if you want to see the previous version, especially the version mentioned that maybe it's already out of support or something. It's, it's not easy. So with that in mind, I plan to compare how many string changes in its release. Is it easy to get this kind of data? How many string changes in its release? It's difficult. Why, why is it important? Why is it important that, that we translator need to know how many string or how many words need, because we need to, to have size to, to, to try to guess how many, how many days, how many days, how many days. How many hours we need, we need to, to spend to do, to do this translation. I think, I think it, it, it will be helpful for, for every project that they can, can create that kind of data extract to prepare those transient themes. But how far translation for instance, you can, you can use, usually you can use percentage. Some projects use the percentage to, to filter out all this, this language translation is definitely include into release. Maybe less than a certain percent is, is like, like the mother most, it's beta version or something. So, again, for current version, this number is easily, ah, fine out. But, but for past releases, yeah, you need to, ah, I don't know. Some, some, some platform maybe still have the data, the other now. And the other thing is, I want to, how, how, who did this translation? For, for instance, in, in, in my teams, ah, in, in this project, I have, ah, let's say 10 members and this project 5 members, who really do the, the nation? Not all platform can provide that, that kind of data. But GNOME, ah, because the workflow is using, ah, the whole file, download, ah, claim to be translated, then upload. So, so it is quite, ah, obvious who doing what kind of translation. Okay, so let's see. LibreOffice. Right now, the, the main platform is, we have played. It has powerful search facility. Too many options. It's quite difficult to, to get, ah, ah, the, the correct, correct query to, to, to, to, ah, convey my, my need to, to extract. Ah, data I need. It also has, we have played API. Maybe I'm, I'm old school, never use API, so when I try, so many data, so many options, how to, how to create the correct API to, ah, this. So, actually, we have played, ah, it's good, but, but, need, need time. I'm not sure, ah, where, where to, who, who, who has, ah, access, ah, create something from their API. I, I want to learn from him or her or something. Yeah, the other thing that, ah, from the weblet, ah, they create a crown, they create, ah, ah, schedule job to push from this, transition platform into the main repository, kit repository. So, actually, ah, I can choose between, ah, source of data, data source. Do I want to take from weblet directly or from the git repo? I tried to access the git repo also. For instance, ah, ah, I, I get, I create a clone and then, ah, change to a certain release and then, ah, there is a git repo for, for Indonesian translation. So, from the directory, I try to list all of commits. From there, I can try to see how many, ah, line changes. Who do commits, but actually, when web, ah, when someone doing translation in, in web, from weblet, the data stop in, in weblet because commit from weblet to the, the git is done by some special, ah, special account, not the initial translator. So, so, there is data missing there. If we, we, ah, take from git commits, but, ah, some other details, it's good. And actually, the, the data I, I can present here came from way back machine and from my wiki because I, I, I maintain a wiki to, to maintain. I, I, and maybe 20, 30 translation projects to Indonesia. So, so many. So, I, I don't want to make a bookmark. My wiki is my bookmark for translation. So, this is the result. This is the latest status. The office. You see that, ah, UI only 99% of, ah, ah, finished. Only a few, this, this few strings not, not here. It's because we, we are not, not sure how to translate. This is missing context. Now, but, but this one is very bad. Help 70% and even for the newest release, less than 70%. UI for this, ah, ah, two, ah, ah, last four years. So, any, ah, between, between releases is, is, is six months. So, this four years. Relatively, almost 100%. That for help. So, a French version 70, 7.0. We have 3,000 strings. And right now, ah, 18,000. Untranslated. We, we have a good, good, ah, result when in version 6.2, UI and help for admission 100% translated. We, ah, that can happen because we, we did, ah, translation spring, ah, two or three days and how many people, maybe 30 people or something. Going together, boom, zero, untranslated. That unknown quality. Because, because they, they, they start transit only for, for that, for that occasion. Yeah, we, we tried it. Oh, so this time we tried to, to, ah, quantity, not quality. So, let's try. But after that, we, we don't have, ah, they are, they are only involved in, in, in that occasion. After that, ah, the long-term translator do not have enough energy to continue help. I, I, I sometimes try to, ah, maybe 10 or 20, 20 string to translate. Why? Because if UI, one string usually consists of maybe from one word to five, maybe 10, the longest one. But help string. One string consists of 30 words, 40 words, 50 words, ah, so long. So, ah, ah, you cannot compare only, only a certain string. No, no, no. You need to see how many words that really, ah, show how large effort need to, to, to do them. Okay. So, who did? Actually, only two. These three, I think, they involved in updating the, the source, not, not doing the translation, but, but they search, ah, ah, giving this data. Okay. That's LibreOffice. And then, let's see GNOME. These are the latest. You can see GNOME, GNOME, ah, divide their many modules into several categories. What we need to, ah, translate is usually, ah, we prioritize translation in this one, user interface for this, this group. But other groups usually, we, we, for instance, GIM and FRAMPS. This, this, ah, I never touch those for maybe, ah, for many years. Because not every project are easy to be translated. Especially if, for GIM is, ah, image editor. I'm not familiar about the terms used in, in this, this, ah, this community. What kind of translation usually use? So, so I, I, I, I, I don't have enough time to consult with, with, ah, what we call, ah, SME, subject matter expert on image processing, image editor. So, I didn't do that. But other, ah, the, ah, UI is good. Help, ah, still some effort to do that, but, but not quite good. Yeah. This is the statistic. Since version 3.6, GNOME, Indonesian translation is almost always 100%. And help getting better. This period I have too much free time. So, I do many, many translations. And then after that, ah, VC time. This how many commits? This to, to, to calculate who did commit? Actually only two people here. Two people can, can maintain that kind of, ah, percentage quality for how many years? I don't remember. Maybe 10 more. I started as a GNOME Indonesian Transition Connector in 2010. I offer many people, would you replace me and, and when come forward? Nothing special with translation coordinator. It's because no one else want to do that. Okay. Ubuntu is the most difficult one. Yeah. You see here? I'm translating 200,000. So much. Why? Because many of them GCC. Do you want GCC to be translated? Why? Do you want GDB to be translated? Exhib, library. So, so, so, so, because you want to use that, ah, what was the preference? Trans, ah, what, what is the? Transfects. Transfects, eh? They, they do not create a subcategory. So, so, it's quite misleading. How, how good Indonesian was translated into Ubuntu. How good, for any language is, ah, how good translated into this, any language. Okay. Ah, I think that's the statistic that, statistic analyzer, someone said it. So, so, why, why, why I care about this? Because I hope that if some other team can, can recreate the, the effort to, to do that, that, maybe they can use that for create a funding proposal or maybe, ah, can be used to plan transition phase, ah, spring, need for many people, ah, how, how long and target to can, can be, ah, can finish how many strings. It's, it's, it's rather difficult to, to use data from one language into others, yeah. But, but at least we can have, ah, approximation. Okay. I think that's my thought. Any question? Good. Ah, sorry, yes. Yes. Because we are all from Asian, right? And, ah, I'm not very familiar with the Indonesian, but I know you did a lot of translation work for upstream like Ubuntu. Okay. Ah, if I, if I come back to China, I want to do, okay, I want to do some translation for the upstream, ah, over assistance. So, ah, how can I solve it? Sorry. Sorry. How can you start? Ah, sorry, ah, what kind of language do you, Chinese? Because, ah, as, as far as I know, in Ubuntu, they, they have traditional Chinese and the other one. So, so, so you just joined those, those existing team and start translating. I think it's, it's quite, quite easy. I think, ah, I'm not sure what is the number, but, but it's already has a team, already has many translators, but you can join them and then, then continue. I think for a language that has not been started, it's rather difficult, but, but for Chinese, I think. Yeah, yeah. Yeah, Ubuntu, Ubuntu is, is, is quite established, I think, for the Chinese translation. For different projects, ah, you need to check, they, they, they have different, different way to, to do translation, different way to, different platform, different, different. So, for instance, ah, like my, my three example, GNOME, LibreOffice and Ubuntu, different platform, different, ah, algorithm, process, different process. GNOME, they, they need to take the, the whole file, translation file, doing translation locally in your computer and then upload back to the system. The other one, Ubuntu and LibreOffice, you can do two way, you can do translation online or, or you can download the file, translate locally, upload back. Okay, other question? No? Ah, yes. Because in the beginning you mentioned you have had some problems using the WebLay API for your projects. How many projects, different languages actually do you use within the WebLay process? Six, I think, right? Six different Indonesian languages, local languages? No, no, no, no, I only, only one main Indonesian. So it's only one project? Yeah, code name ID. And what, what, especially are problems with the API? Ah, for instance, ah, ah, if I want to see, ah, which one? How, how many strings left? Not translated yet. For help. This version, help. API can only, ah, need, need, need to be, need to be expressed with, ah, URL and then, ah, main function and then project name, project name is, ah, they have a list of, ah, UI master, UI version, ah, certain, and then help, master, help. So, that's the project name. And then the, the next component is, what do you call? Module or something? Module, ah, from, from this, this one, one project name, they has 200 modules or something. And then slash language, ID. Ah, so, ah, it says, I need to, I need to iterate all those things and create the, the summary to add and then something. It's, ah, quite difficult for me. So, it's more an organizational problem? Yeah, yeah. So, so, maybe because, because I don't understand enough the API, so, so my, my, my approach is not efficient enough. So, that's, that's why I, I need to consult or discuss with whoever, ah, more familiar with this, this API, that API. Okay. No more questions. Thank you.
Using Open Source AIs for Accessibility and Localization
So, hello all. Thank you for being here till the last and I'm a first-time presenter so if I get a bit jittery, I'm sorry. So, the topic that I'm taking is open source AI for localization and accessibility. Well, the main idea is to use open source AI tools to elevate the content that you are actually receiving and to enhance the localizations that you can benefit from. So, okay, so sorry for that. So, essentially coming back to what I was saying, we can use open source AI to enhance the subtitling potential and to have voice-to-voice conversion of a lot of videos and audio content in addition to the text-to-text conversions that we are. So, you might be wondering what is the actual problem? So, well, I've seen that most of the time when I'm trying to access a dog. I'll be having this language issue. For example, I was working with the technology in the augmented reality realm and all the documentation was actually in Japanese. I tried reaching out to the developers over there but unfortunately the language barriers still hit hard and another case would be with the same guys actually had a few tutorials available on YouTube but the same case. I don't speak Japanese and I'm actually unable to convey my ideas to them in the language that I know. So, this was actually an issue that we were all facing. And then there's the, so you might be thinking, why can't you use something like Google Translate? Well, the obvious case is actually data safety. If I'm working on a cutting-edge technology, I don't want that to be leaked to other people without my concerns. Or like I want to release that into the public. I don't want someone else to just take my data and then release it or without my safety or my approval. And yeah, when it comes to usability, that's another factor and financials, when I was actually working as an independent developer, financials, the financial side was a big issue for me. I didn't have the money to bankroll something like $1,000 into a translation subscription for every month. So, let's actually elevate that with a bit more of a user's story. Suppose I'm a research student and you can actually take the case of augmented reality right now. I'm trying to work in this very niche case and people actually know about it, but I can't really converse with them. I'll be having a few issues like that. And one of the main problems that I'm facing is actually the lack of resources. And there could be resources, but they are present in another language that I'm not really able to understand with or converse with. So I would actually require the resources to be converted to another language that I speak of. And I want the conversions to have a diesel level of accuracy. So that's it. And for one of the solutions, I can obviously ask for people who actually talk the language and require their services, but it is expensive and it is time consuming. So a stop gap would be to use an AI solution. A similar case would be in the case of the docs manager. So before I go there, what do we actually have right now? We have a few text-to-text conversion engines like Indonesia or if you're from India, there's about 128 languages that are actually spoken and we actually have cover for two. So if they are from such cases, you will require more coverage and you will require more assistance. So to sum up with what I was actually talking about so far, we don't have an all-in-one solution where we can actually use all these, which can actually fulfill all these requirements, be it from text-to-text, text-to-video or the other way around. So that is some of the things that I would like to talk about. And yeah, if we are actually looking for an open source library, I would like to have one that focuses on the audio and the video translation side because enhancement and accessibility for the audio and the video side is what is actually helping us to improve the language models and is helping us to reach a wider audience. So solutions which can actually help me with the automatic translations can be a good choice. So just to recap it again, I think I'll just skip this part. So what I actually require is an open source model that can be executed locally and actually gives me a decent accuracy or a decent amount of execution time and helps me enhance the quality of the content that I'm delivering. So the one to watch for right now is called seamless M40 and it's a model for meta and it's under the MIT and okay, it's under the MIT and the Creative Commons Live Senses. So it actually gives you speech-to-speech translation, speech-to-text, text-to-speech, text-to-text and automatic speech recognition. So that's a pretty good one. And as we all have been trying to highlight for the last 10 minutes, we require that because we need an all-in-one solution. And I was just, I think I highlighted about all these parts like the super informative video or the precarious conversations that you are having. Like if I'm trying to have a conversation with, can I have a name, sir? Samath. So if I'm having to have a conversation with Samath in his name, if I'm having to have a conversation with Samath and for a moment let's just think Samath doesn't speak English, he speaks French, I speak English and it's hard. So that's the conversation or that's the moment that I would require and that's the moment where I require a tool like this. But if it's not French, it's some language that's actually not sort of documented. So say I'm just going to go with Sohili. So yeah. Okay. Okay. That was a random guess. So yeah. If I'm going with a language like Sohili and I'm speaking in English or if I'm speaking in my mother tongue called Malayalam, I'm just going to sit here and I'm trying to explain the concept to him, but he doesn't understand what I'm saying. I don't understand what he's saying. And that is the moment where we require such a tool. He might have cutting edge research in the domain, but unfortunately, it's only accessible to the native speakers. So as I said, that is where the benefits come in. It's with the universalization of the resources. Anyone from a large org, a creator, a student, a developer, and basically anyone who can make use of these technologies and come out with this. So far as the technology that I mentioned, the M4G is actually under an MIT and a Creative Commons license. So it can only be used for nonprofit uses. And I believe it should remain like that. So that's the summary. But before we go, I think there's something else that I can show. So, okay. Excuse me. Okay. Yeah. Okay. So just suppose I'm this really famous creator and I hope the mic has an arm. Okay. Sorry. That just played out in a way. So suppose I'm actually a really famous creator and I'm doing something about AI. So I want this content. I only speak English and I want this content to reach you guys everywhere. So let's watch the individual video first. Okay. So let's pause it over there. And maybe how good it would it be if you can actually hear the same thing in another language. Okay. Wait a second. So how many of you guys over here know Spanish? Okay. Just tell me whether this even remotely makes sense to you. So that's it. Yeah. I'm going to take, yeah. Yeah. It sounds good, right? Yeah. I'm just going to take it from those two guys. And yeah. The same thing can be actually in French, in Dutch, in Italian, and in Hindi. I do speak a bit of Hindi, so I can validate that. So yeah. So would you guys like to hear it in another language? It's going to be the same audio, but if, okay, for the French speaking guys, I'm just going to play this as well. So you can see the text, the audio and all this. The model that I'm just mentioned called the seamless M40 is actually doing all of this in a single model. And I feel it's about time we actually have a few of these solutions coming into open source as well. That's a totally open source model. Like this is what I dream of. And maybe I'll be back next year with something that's remotely close to this. So thank you. Yeah. So any questions? So the model we're talking about, it's not open source, but it's like usable for non-commercial. Or is it open source? The model's open source, but you cannot use it for commercial use. It's having the MIT and the Creative Commons license. Yeah. Is the training data made public? No, the training data is not public. It's the classic Facebook thing. Yeah. So it's speech recognition because it's a problem, the touch problem and now it must for use. Yeah. It runs on speech recognition. It runs on speech recognition and you're converting it to a base language and from there you're converting it to this thing. So suppose if I'm speaking in Hindi and you want the conversation to be in France, it's going to convert the Hindi's conversation to English and from English it's going to convert it to French. Talk about the latency between the speaker speaking and understanding all this thing and also the way. Okay. There are models that in the same model can actually offer capabilities near 100 milliseconds, but it's a lightweight model and you won't have the accuracy of the heavy weight one. So we actually have to trade off between accuracy and the speed. So the heavier the model is or the more parameters it has, you'll be getting more accurate results, but unfortunately the speed will be coming down and I saw a couple of other hands. Yeah. The model is 2.7 million parameters one and if you're talking about the actual size of the models in gigabyte, it's about eight gigabytes. Yeah. So just to clarify a few things that have already kind of been answered. The licensing is MIT and Creative Commons? Not or. And correct? Yeah. Oh, okay. And then the other thing is when you are translating from one language to another, you said that some of the models have a latency of about 100 milliseconds. And how long did those audio samples you showed us? How long did they take to run? It took me like three seconds, but I'm running it on a call up T for GPU. So it depends on the GPU. Right. Yeah, I'm sorry. Sure. So the proposed solution that I use over here is the same as the LLM thing. So you can actually use racks or you can just split the text up into smaller, smaller bits and then combine it into this. So for example, this model actually performs better if you have something like 20 seconds of audio. So what I do is if I have a one minute of audio, I'm going to split it into three chunks of 20 seconds each and it's going to go off from there. Sorry, I don't have a solution to that. Yeah. And I think you have a comment. Yeah, I think. Okay, sure. Okay. So I'm going to go ahead and ask you a question. Okay. You can run it locally, but it's going to depend on your GPU speed. That's it. It's possible to have a view of practical information, link and so on. Oh, sure. Okay. So yeah, I just close that a bit early one moment. Okay. You can hit me up over here. So my name is Nevin Koshy Daniel and that's the same for the Gmail ID. You can just text me over here. And if you're looking for the particular model, you can get it from seamless communications on the Facebook research page. Yeah. And someone did ask me a question about latency, right? So if you guys have a moment, then we can have something called seamless expressive. I am not 100% sure how this will work, but accept. Try the demo. Can you come over here please? Tate? Sure. Let's have a try with this. Do I have to say something in English? Yeah, I think so. Or do I have to speak French? Let's try it with French, I guess. No, I can't speak French. It's not going to work. Okay. Yeah, let's speak English and let's translate that to French. French? Sure. Oh, you have to allow audio permissions. Yeah. Don't worry, I'll just re-bump this out. Yeah, yeah, yeah, it's okay. Yeah, right. Can I use Linux and run this on my server? I hope that works. So yeah, it's going to take some time to generate the translation. Oh, wow. It's pretty quick. And... I don't speak French so for someone who knows the language, it's correct. It's correct. That's very cool. Okay, is the model doing both the translation and the text-to-speech? Yes. This model can do all the four things, text-to-text, text-to-speech, speech-to-text, and text-to-another language conversion. So all the four things, yeah, and automatic speech recognition as well. So the five things. To give you guys this... I am. Okay. That's pretty much everything, and thank you. Thank you. If you're gonna... Okay. Oh. Yeah, no, I can't see it very well either.
The importance of Web Performance to Information Equity
All right. Hey, everybody. Welcome to the Web Performance Dev Room. I facilitated this year with Missyla and Wickey Media. We're going to go ahead and get straight into it. And I'm going to introduce Baz Skroten. And, yeah, I'll pass this over. In Yemen, the cost of a gigabyte of data is approximately 6,000. A gigabyte of data is approximately $16. In Chad, the cost of a gigabyte of data is around $20. In Yemen, the average income, or the median income, is about $250 a month. In Chad, the median income is about $60 a month. Hundreds of millions of people in the world live in areas where they spend a significant portion of their income on their data bundles. Often, lack stable charging facilities, and they can only imagine what it would be like to have the kind of high end, or even mid-end devices that people like you, I, or anyone else in this room have. Often, when we think about performance, we're thinking of making our websites faster. We're looking at making them faster and more fun to use in order to improve our conversion rates to sell more products. I'm Bas Faldan. I am a principal engineer at Mozilla, and I am the tech lead for Firefox Performance, and I'm going to talk about how performance is so much more than that. I'm sure that most of you are familiar with countless statistics about the limited means that the poorest half of the world's population live with. There can be no doubt that those people deserve the same access to information as you, I, or anyone in this room. Understanding the importance of that information equity means understanding the importance of web performance. When we're talking about performance, we're usually talking about one of three things. The primary thing, and the most obvious one, is speed. Speed is about how fast and how smooth the results of a user's interaction with your sites or services actually renders on their device. Another aspect of that that sort of directly leads into that is data usage. Obviously, before you've actually sent the data to the device, there is no way that they're actually going to be able to see what you're about to render. But something that's a little bit less obvious is that power usage is also an important aspect of performance. Not only are you going to help the environment by using less power, but you're going to extend the lifetime of people's devices, making their batteries last longer, but also causing them to heat up less, have less fan spin up, keeping the devices more comfortable to use, and decreasing the wear and tear and increasing the longevity. And finally, you're going to reduce the amount of heat, obviously. In the time we have together, I want to talk about how people living with more limited means are at a disadvantage for all three of these pillars of performance, and specifically also web performance. We'll go over what the global landscape looks like and the situation, particularly in the global south, that people are dealing with when we're thinking about some of these things. I'm confident that you will be left more motivated to improve your sites and services, and as a result, you'll pay extra close attention to the speakers that are here the rest of this day. For now, the first thing I want to talk about is raw speed. When I talk about raw speed, what I mean is the performance of the CPU of the device that we're talking about. This is basically how quickly, once a device has all the instructions that it needs to render something onto the screen, how quickly can it do that? What does that look like? Well, over here, I've compiled a list of the most common, most popular smartphones for Africa versus Europe. Now, getting good public data for Africa is actually kind of hard, but that's not too important right now. The most important thing is that this list for the phones in Africa is probably a lot for you like it was for me. This high sense, what now? I've never heard of these things, right? And the important thing here is the trend. Most of these devices on the list for Africa, or actually almost every single one of them, is at least 2x to 3x slower than devices we see here. And do not let their names fool you. That Itel S23, that cute naming trick, that device is nothing like the same model from Samsung. So what that means is that if my LCP takes 500 milliseconds of CPU time, delivering that same LCP in another part of the world will take at least a second. Now, we know that LCP impacts conversion rates quite significantly. A one-second improvement to LCP means a 27% of improvement to conversion rates. And that does not just mean, that is not just about how fast or how likely you are to sell your products. It's also about how likely people are to access the information that you are looking to present to them. And of course, this is not limited to the global south. The performance improvement that that Samsung S23 offers over that Itel S23 comes at a hefty price tag. With a Samsung S23 costing over 650 euros and an Itel S23 costing under 150. It doesn't take a genius to know which class of society is more likely to own one device over the other. But obviously, the raw performance of the devices is not the only thing that is different here, where most of us live, and in the global south. And there are other aspects of those differences that have a much more direct impact on people living with more limited means. And the most important one there that I'm going to talk about is data usage. Let's take a look at what the global landscape looks like when it comes to data usage. I've pulled a list off Wikipedia here for the countries with the slowest mobile data connections. One of the first things you'll notice here is that the mobile speed of none of these countries exceeds 20 megabits a second, or about 2.4 megabytes, right? And for some of these, the landlines don't even exceed 1 megabyte a second. But an important thing to note here is that this list from Wikipedia is built based on results from speedtest.net. Now, we can sort of assume that people aren't likely to run speedtest.net when they're not actually trying in a good connection situation. And we can see that because even the slowest here, 3 megabits per second, that's almost a maximum speed of 3G. 3G has a maximum speed of about 500 kilobytes per second. And let's look at that a little bit. Displayed here is a 4G coverage map for one of the major carriers in Nigeria, the most populous country in Africa with a population of approximately 225 million. What you can see here is that outside the major population centers, there's not much. And a lot of that on the really less densely populated areas isn't even 2G. So you can assume that a lot of these people, the fastest connections that they could possibly have access to are about 500 kilobytes a second. Now, let's compare that to a 4G coverage map by the FCC of the United States of America. Unless you're visiting a national park, you're probably going to have 4G. And if not, you're still going to get 3G. That's a very different situation. But now, of course, the speed of mobile data transfer is not the only component here, which is different between here and western countries and the global south. A more pronounced impact your users will experience through cost. Visualized here is the cost per gigabyte of mobile data. If we look at Chod, the cost of a gigabyte of data is over $20. If we look at the United States, that cost is less than $10. And in most European countries, the cost lies even lower. But even if we ignore the outliers, it's important to realize that the global average is roughly $4 per gigabyte. Compare that to a global median income of about $300 a month. Half of the population has to do with less than that. So let's think about that for a second and think back to the introduction of the talk. A gigabyte of mobile data in Chod costs about $20. A monthly income in Chod is about $60. That means that if your site ships one megabyte to the median user in Chod, that costs them about 0.03% of their income. If your website takes about three megabytes per visit and a user visits it once a day, that will cost that user about 1% of their income. Now, to make that even a little bit more concrete, I went to bbc.com. I looked at about five articles and I read them. During that time, bbc.com consumed about 17 megabytes. If a median user in Chod chooses to read five articles a day on bbc.com, every day that consumes about 15% of their income. Add to that the consideration that 95% of sub-Saharan Africa accesses the internet solely through mobile devices. You can see what an immense impact the data consumption of your websites can have on the disposable income of the people living there. And of course, when thinking about that data usage, it's not just that you're making it faster. You're, after all, on 3G, that three megabytes to show your site takes at least six seconds to retrieve. It's saving you and your user's money, as we already talked about, and it's going to lower the carbon impact of your sites and services. And talking about carbon impact, let's talk a little bit more about power. When we're thinking about optimizing power, we shouldn't just think about reducing the power usage by making our websites render faster. Obviously, if your website consumes less CPU, it's also going to use less power. But more impactful for power is, what are we doing when a user isn't really actively interacting with our websites? We should be avoiding animations, videos, or animated ads when a user is just reading on our sites. Of course, we should be minimizing the amount of JavaScript that's associated with simple interactions. And this comes with a myriad of benefits, even though those two watt-hours a visit to your site might consume might not sound like much. If your site has a million visitors, those two watt-hours become 2,000 kilowatt-hours, 2,000 kilowatt-hours a day for your millions of visitors. The average power consumption per capita, or the per capita power consumption in Africa, is about 150 kilowatt-hours a year. But when we delve a little deeper, there's a lot of other advantages. You're going to be decreasing the amount of heat users' devices produce, particularly on slower devices. You're going to be reducing the amount of fans that spin up, reducing the wear and tear, making them more comfortable to use. But most importantly, you're going to be reducing or increasing the lifetime of their batteries. And this is again the area where particularly the global south is disproportionately affected. There is about 1.1 billion people living in sub-Saharan Africa. By estimates of the International Energy Agency, about 40% of that 1.1 billion people live without access to electricity. I want you to stop and think about that. There are more people living in sub-Saharan Africa without access to electricity than there are living in the United States and Canada combined. For many others living there, access to electricity is limited and power outages are frequent. Of course, those people with no access to power are also going to be less likely to have mobile phones. However, for those millions of users that do own mobile phones and do not have access to power, or more limited access to power, and all those countless users that are going to be coming online over the next decade, they are often dependent on centralized charging facilities to charge their devices. Needless to say, for them, their phone running out of battery is a very different situation. They're for most of us in this room, where your phone running out of battery means you have to grab a charger, it's a nuisance, or maybe you grab your power bank, right? So now that we have a better understanding of what the world looks like in terms of the internet, where does that leave us? The internet is going to play an increasing role in everybody's lives from how you do your taxes and how you are billed by your service providers. And the potential to do good for the internet is immense there, and that potential can particularly benefit those people and organizations with limited means by reducing their costs for staffing, travel, and time spent for those people and organizations that are least able to afford that. At the same time, it reiterates the role that we have as developers to ensure that we have a responsible impact on the most vulnerable communities on the planet. Now, we're here at FOSDEM, which means that hopefully most of us are working on open source projects, and if you're anything like me, other projects, commercial or otherwise, using your code is a great source of pride. And that means that when we're designing our components, our code, we may not be thinking about those particular use cases. We may be thinking about users that are not affected by these particular disadvantages. However, we have to think about what other projects may be using the code that we're writing and what users they might be reaching out to, and those users may be in those vulnerable positions. Thinking about them means thinking about what performance. The great news is that the work we do to make our sites faster, make them use less data, and make them use less power isn't just good for those users. You're going to be making your websites for your users, especially when they're riding a train through a tunnel or riding an elevator. You're going to be keeping their devices cooler, making them more comfortable to use and making them last longer. You're going to be helping the environment. The greenhouse gas emissions that are produced by the internet and the devices that we consume it are vastly more than all of global aviation combined. And of course, this works the other way around as well. When you're making your websites faster for your users here, you're also going to be helping those people in more vulnerable positions. Since you're all here at this early hour, I'm certain that many of you were thinking about web performance already, and thank you for that, and thank you for being here. I'm confident that you're going to be even more excited to make all your websites faster, and there are a bunch of great speakers coming up the rest of this morning to help you do exactly that. Enjoy the rest of your day. Are there any questions? It's an interesting question. The answer... Oh, yes. So the question is, what do we do, or what do I do? I guess that means Mozilla, right? To make sure that Firefox works well on devices with limited CPU. I think the short answer is not enough. I think that, like probably most of you and most developers out there, almost everyone working on Firefox is on fast devices, fast phones, many, many iPhones where we don't even chip our own engine, right? And I think that is a hard thing to change. That's a hard mentality to change in the business of software development in general. We do explicitly test certain low-powered devices and their performance characteristics. But the global landscape is very diverse. I think that the reality of it is that we tend to work a lot on making Firefox faster and consume less resources on fast CPUs, and then we hope that translates to a better experience on slow CPUs. One day, perhaps, we'll have optimizations and work that targets very specifically the different types of CPUs and different compositions of CPUs, in terms of heterogeneous architectures and things like that, that are more common in the global south, but we do not currently do that. Anybody else? The next speaker has five minutes to set up.
Let's build a RUM system with open source tools
Hello, everyone. Today I'm going to be talking about building a real user monitoring system with open source tools. And before I dive in, a bit more info about me. My name is Tvetan Stojchev. I work on MPALS in Akamai. MPALS is a real user monitoring system. It's a commercial one and it serves, I think, thousands of customers, Akamai customers. And my hobby project is basic run, which will be the focal point of this presentation. And really before I dive in, I would like to share a bit more about some of my other personal activities. Every December, I make an attempt to publish at least one blog post on the web performance calendar. That's the best place for the web performance to see us in the year. And the other thing is, sometimes I do a street art. So that's my safety net plan. If chat GPT takes over the world, I still will have something creative to do. Yeah. So let's now move on to the important part of the presentation. And let's take a look how, in general, how a real user monitoring system would look like. So we will need to have something in the browser, ideally a JavaScript agent, that will read some data and it will send it to a server. We will store it somewhere in a storage and later we will analyze this data. And here we just see the most trivial piece of JavaScript. This is the bare minimum that will do the job in the browser. So this piece of JavaScript will read what is the current page URL. And it will create a one by one image, one by one pixel image. It will append it to the HTML and this actually will create, will force the browser to create a request to this endpoint. And here is a really very simple code snippet on the server side, how the code will look like when we need to intercept this data and to store it somewhere. So here is our route where the browser will hit this route. We will read the query parameters, headers, headers, and even we will put a timestamp in the structure and then we'll save it to JSON on the file system and we will return back to the browser a transparent GIF. And eventually we will, on the next stage, when we want to analyze the data, we will go through all the files and we can create a summary for the page visits. And for example here in this example we can see that category four was the most visited page with 427 page visits. So that's the theory. And in 2019 I started as a hobby basicram and that's the initial version and the components that I used to build basicram. So on the browser side I started using an open source library called boomerang.js which collects a bunch of interesting metrics from the browser and sends them to a server. On the server side I used nginx and some PHP application code. And for storage I used mysql and for analyzing the data I still used php and for reading the data and serving it to a frontend and on the frontend I used plot.ly.js for visualizations. And I ended up with something like this. It actually, it's really interesting after five years it's still running. So if you want to give it a try this is the first version of basicram. You can visit demo.basicram.com and you can play with this UI. Now about boomerang.js. Boomerang.js was started 2011 in Yahoo by Philip Tellis who happened actually to be now a colleague of mine. And currently the library is maintained by MPAL's engineering team in Akamai. And as I mentioned the library collects a bunch of interesting metrics like the interesting ones for core web vitals, lcp, cls, fid. It also can track some session data. It can also help users of the library to create a timeline of all the clicks over the page like cycle of a visitor. And also it has more modern ways to send the data to the server like more modern JavaScript APIs fetch, XHR and send beacon. And it can be found on GitHub in akamai slash boomerang. On the back end side that's again like very theoretical but what actually was happening I still was every request that I was getting to my server I was saving it in a file. And then periodically I was running a cron job which here I just marked as a that's kind of a too much overhead and you understand why later. But I was running a cron job that was reading all these collected files and I was creating one big batch and I was inserting this data in my SQL. I also ended up with a database model that's very biased. My previous background was I was building Magento online shops and if somebody ever worked with Magento we'll probably recognize some patterns about all these foreign key relationships and this main table that's in the center of everything. I had to put bunch of indexes here and again this created a bit too much overhead for I would say also on the code level like on the application level but also for me as a maintainer. So I had to take care about again every time when I wanted to introduce some new dimension I had to create a new table and to put a bit more code for inserting the data and it was just too much maintenance for me. Also I had to take care about not duplicating some of the data here and this is because of the nature of PHP. PHP is kind of a stateless so every request is independent from the other request so I couldn't keep some things in memory. If I could keep some references in memory I probably could optimize some things here. And actually question to the audience do you have an idea what this query actually would produce? What's the idea behind this query? Maybe. I can say that. Bucketing? Yeah it's a bucketing for a histogram and I also had to write a lot of kind of queries that are in the data scientists type of queries which also was for me introduced a bit of a learning curve but the system had really had coded in itself such type of queries and this here represents a histogram of the time to first byte. Like we can see that the median is around 1.8 seconds. It's a bit skewed. And with the help of plotly the JavaScript library for visualization I could create such panels for distributions for operating systems and mobile operating systems and I also could write such bar charts that were showing kind of the relationship between the first byte and start render time. And yeah reference to the plotly it's a really cool library really rich and you can create a bunch of panels with it. But I found myself like having difficulties and probably not focusing at the right place. So as I say when you build a real user monitoring system you need to change your mindset and your queries should be more like in data scientist style. And the PHP were out and the ORM that I was using I was using doctrine. It's not really meant for writing complex queries from this fashion. So I found myself writing my own query builder and using doctrine when convenient and using my query builder when convenient but this was again too much maintenance for just for a single maintainer of a project. I also wanted to introduce user management and permission system but again with my limited time and working from time to time on the project during the weekends this was just again too much it was not the right focus. I wanted just to show some meaningful data. And yeah I really love plotly but I just ended up with large blobs of JavaScript here and there and it was more like more and more plotlier. I wanted to see data not writing JavaScript. So I took a break I believe half a year and I focused on my main job but from time to time I was doing research and I was reading some other articles about time series databases and I started exploring some of the open source available open source systems for visualization. So I kind of rebuilt the complete backend. I still kept boomerang but I rewritten the server site so I completely removed nginx and PHP and I used golang. I replaced my SQL with click house and I replaced all the custom code all the PHP and plotlier with grafana. And again if you would like to play with the current version of basicram that's what I ended up with that's actually a let's say a bit of rebranded version of grafana with the specific basicram dashboards and settings. So if you would like to play with it just visit this address and write calendar calendar as a username and password. So where golang was really useful, golang it's just different paradigm it's a different idea compared to PHP. Golang you can compile a single binary that and in this single binary everything that I needed was packaged inside the binary so it's just a process that you run on the server and it has everything inside and this allow me to replace the actually to get rid of nginx because golang has a package for built in htp server and yes PHP also has a package for PHP for htp server but you need to do a lot of work arounds to make it working because just not native in this is not native in PHP. I also could leverage the existing click house package in golang for interacting with the click house database and I took advantage of asynchronous insert which saved me a lot of I could get rid of some code that I had in the PHP version of the basicram. Also in golang it was very easy to create a backup mechanism for all the data that was flowing through the system because in golang I could easily keep stuff in memory I didn't have to write each request to the system on a file and later to batch it and bundle it. I was just keeping these data points and requests in for example in memory for 10 minutes and I could just flush them on the hard drive and compress them and this was again really really easy few lines of code and just natively coming in golang and also for some cases where I needed encryption again in golang there is a let's encrypt package it's a third-party package but I could easily just spin a server and say okay I want to use let's encrypt and I was getting secure connection to this server with it it really reduced the operation the effort on the operation site. I also took advantage of a gip lookup library which is using the maxmind database and why I needed this because in a real user monitoring system you would like to see from which city a visitor visited the website or from which country visited the website this is really helpful when you want to create a report and when you want to figure out maybe in which country is your website is slow. I also took advantage of another library about user agent parsing so this library helped me to extract important information about the browser name the operating system and the user agent family and I also started using my new favorite database Clickhouse. So you remember where I say that I was doing a lot of work when I was like batching and bundling everything and inserting these big batches in MySQL. Clickhouse comes with a really cool feature called asynchronous inserts so Clickhouse allowed me every time when a request reaches my back end to immediately to create an insert to Clickhouse and Clickhouse was internally batching this and it was deciding where it needs to insert in the database so this was not this helped to like not reach some performance botonics. Another thing that I could do with Clickhouse so here you see I have seven tables in the old setup with MySQL but in Clickhouse I actually end up with two tables and I actually could I actually could have one table but I needed this table for showing the host names in the filters in Grafana and just Clickhouse or in general when you work when we work with time series the main idea is that here the the the data is normalized I try to really build to build a user monitoring system in the fashion of a webshop right which is really the wrong idea but when we use time series database the idea is that the data you can just throw your data into this database you you have one large fat table and you throw a lot of data and you don't really need to consider duplication of the data for example here we have this filter's device type and I don't have a foreign key here to another table where I keep references to all the device types I just can insert and insert the same string over and over again desktop desktop desktop and this database will be completely fine with it it will compress the data internally and I won't experience any performance bottlenecks when I filter by this field and here is my other favorite feature in Clickhouse it's called it's called low cardinality data type and this data type is really convenient for columns where the distinct values in this column some less less than 10 000 because this it's optimizing eternally and it's the the where conditions and the filters in this case are much much faster when we use low cardinality we if if we have more than 10 000 distinct values we probably need to go again to something like this and to start introducing additional dimension tables also so here in left is really uh I would say insane I even don't know how I created this I still I'm really surprised with myself and you we cannot zoom in here but this was a process where it included querying my my secure database and I had some application code and I had bunch of cron jobs and this was trying to guess and to find out all the sessions that bounced and what was the duration of the sessions it was just really complex and for example to to calculate the bounce rate with my new setup in Clickhouse I just could use such a query again I got a bit help with this query I don't completely understand it but it does it actually it works and it's much more simple and much much more it makes my it makes basic run much much easier to maintain and with with this query I could actually create easily this correlation between bounce rate and epic and metric and in our case this is time to first bite also I want to say that open source is not only about how great is the open source product that you work with but also the community is very important and that's why I also stick to Clickhouse they have really great slack community and every time when I ask a question I I can say that in the matter of a few hours I get really a good response for example here I'm asking hey I I wrote this query but I feel that it's not optimal I'm not a SQL expert and here another expert actually suggested a better way how to write this query it's it's shorter and it's much more performant and also probably this is the first and probably the last database channel YouTube channel that I will be subscribed but I'm actually subscribed to the Clickhouse YouTube channel and they have really really good videos like they have every month they have like a release party video where the the Clickhouse team is showing the new features and there are a lot of good tutorials so it's it's really welcoming for for beginners and they say you get support from the community and there is really good there are really good materials out there so now let's look at the user interface Grafana earlier I mentioned that I was about to start in my in the first version of basicram I was about to start implementing my own my own user management and login and authentication and Grafana this comes out of the box so it's much easier to add new user to give them different permissions and again this is just the code that I would never want to write again right and in this repository I bundle the basicram version of Grafana it has some customizations also another benefit of Grafana is it's very easy to model the data and what you want to see in the in the visualization panels so for example here we have we can define filters we can have a preview of our data we can also configure different things for example here I'm just showing how I can configure different colors for the different thresholds and also there is an SQL editor so when I write the SQL here this Grafana uses this SQL to fetch the data from Clickhouse and here are other panels that I took advantage of here is the world map so I could it was really literally plug-and-play I just configured few stuff and I say it from where to read the data about the countries Grafana also has a third-party plugin for plotly so I still there were scenarios where I wanted to build some more complex panels and with this panel I could actually build this one which is showing how the device the screen size is the width of the screen size is distributed yeah time series this is the kind of the most the default view in Grafana and also I could present the data in a table this is very good when you want to explore your own data also Grafana comes with different data sources and of course Grafana needs to know how to talk to Clickhouse in my in basic realm I'm using a data source developed by company called Altinity but there is also another one developed by actually official by Clickhouse right yeah and just to say that all these things that I'm showing all these dashboards that are built in in the basic version of Grafana everything there is actually under version control so it's not just that I created a dashboard in Grafana instance and exported it and save it somewhere this I have this repository where I have the configuration for each of the panels that I'm maintaining and then this makes makes it much easier when I need to change something or to add a new panel and I can go through the history and I can understand what actually change if something has to be reverted yeah for example here we are seeing how I keep this row as it's a templated SQL but this is how it's presented then when we look in Grafana and again out of all this source code configuration that I keep for the dashboards I'm building a docker image where we here we have a bit of branding work just removing some things from the default or rewriting some things from the default Grafana image here we are installing the plugins that we need for our setup and here we are importing all the configurations for the dashboards and the data sources and what I found over time when I spoke to different people who asked me about three user monitoring systems very often the conversation was just ending when when I was explaining yeah you need to run this component on this server and you need to run this component on this server and you don't need to run this component on this server and it looks like their use case the use case of the people that I spoke to was actually not requiring them to scale they had pretty small websites or web shops and I work on something a bit more monolithic it's called basicrum o in one and the idea is that probably again probably it sounds from engineering point of view a bad practice but it actually could be really practical thing the idea is to run everything on one big box and I believe for 20 euro a month this could be actually hosted somewhere and I tested it it can handle 1.5 million page views a month and the idea here is we introduced traffic which is a proxy it stays in front of this folder components and it's helping me for SSL termination and routing request because some of the request needs to go to the data collection part and other request needs to go to the grafana to the part where we analyze the data so this is really convenient it's really easy for people if you just want to give it a try and a few takeaways I just have to say that a real user monitoring system is fairly complex system and you need to learn to train yourself you want to develop one you need to you need to learn more about on the data collection site where how the data is collected from the browser how to visualize the data and it will be a bonus if you learn about how time series databases work again choosing the right database to solve the right problem is the key and it's great when when you can transfer a problem from the application on the database layer it just saves a lot of time and yeah grafana could save a lot of time and effort even I recommend it even if you still want to build your own front end maybe just start with grafana to play with the data and to display something it literally will save a lot of time and I got a signal that I run out of time but you can catch me up all right I can take one question so in this project we don't really keep any IP addresses so for example that I guess that's what we consider like user data or yeah so the backend doesn't store any personal data in this case so by default it's using the IP address only to identify the country and the the city but it's not storing the IP address after that and I know that on the data collection site from the boomerang library I'm not sure if it's on the boomerang library has also like part of the boomerang source code is private but I know that for PCI compliance reasons it has special parts that try to avoid collecting stuff around the user sometimes the user may put for example a credit card number and this could be actually collected by mistake so this library also tries to avoid collecting critical user information do you mean to consent the cons so the library comes with a special snippet that's a loader snippet so you can have your own callback so you can you can call this loader snippet only after a cookie consent so it's possible you
Better than loading fast… is loading instantly!
Alright, let's get started. So I'd like to introduce Barry Pollard. Hi everyone, thanks for coming. My name is Barry. I'm a web performance developer advocate in the Google Chrome team. I work on the Core Web Vitals initiative. I look after the web Vitals.js library, crux, lighthouse, Chrome DevTools, all sorts of things that work about. Today I'm going to be talking about better than loading fast is loading instant. I'm going to talk about a new feature that we've added to Google Chrome very recently. So show of hands. I'd like to answer this question. So who here? No one? No one? No one? No one? Oh, oh, look what happened there. Like nobody likes that, do they? Nobody likes staring at a blank white screen for ages. It's even worse whenever it seems to have loaded but nothing works and then something else comes in. So hopefully you're all here at the web performance track because you actually want websites to be fast. And Google tried pushing this initiative quite hard a couple of years ago. So we launched this Core Web Vitals initiative. Show of hands of no tricks this time. Anyone heard of the Core Web Vitals initiative here? Okay, good. Sometimes I ask this question and nobody has. So we released these three metrics and we measure them in Chromium, which means they're available in Google Chrome and Edge and all the derivatives and that sort of thing. Recently Firefox launched LCP. Yay. So we're hopefully starting to get cross-browser support for them. But there are three metrics that are supposed to measure three different facets of the user experience. And we give recommended targets and say whether you're good, whether you're poor or somewhere in between needs improvements. There is a change happening, by the way. We're changing one of the metrics. FID is dying, resting peace. We're going to be using it with IMP. I'm not going to talk about that now. I would love to talk about that. So if you want to grab me afterwards, because there's a lot going on there, but that's the whole subject of a whole other talk. What we are going to be talking about mostly is the first metric, largest contentful paint. And it's a measure of when you click on a link until the largest content appears on the new page, which is supposed to be representative of the pages mostly loaded. Maybe there's some other little bits still coming in there, but it's generally the pages pretty much there and the user can start using it and stuff like that. In this, we said 2.5 seconds is seen as good. Anything above four seconds is seen as poor and as I say, somewhere in between that, it's like, eh, could do as I'm improving. And 2.5 seconds sounds like a lot. I see a lot of people in the forums never read the comments, but I occasionally read them and they're like, oh, 2.5 seconds. Google thinks that's fast. That's terrible. And you think computers measure things in microseconds or milliseconds, stuff like that. 2.5 seconds seems really poor. But the truth is that's actually quite a difficult metric. Meeting two-thirds of websites don't meet that, because there's a lot that happens whenever you click on the website. We'll talk about that in a minute. So getting 2.5 seconds is a tough enough target. But to me, I'm looking beyond that and I'm saying, can we do better than good? How can we do better than 2.5 seconds? How can we get instant? And that is very difficult on the web, because the website, it's got an inherent slowness to it. It's a distributed system. The internet is a network of computers and you sit on the browser at one end, you make a request, has to go to a server, and then this other request has to come back. This is different than traditional software before the web came along, where you download it or throw in 18 different floppy disks or get that free CD-ROM that you got in the magazine and actually install it locally and nearly everything's available locally. The web, nothing's available locally, not even the program. Okay, the browser is, but after that, everything that you want gets there. And this, by the way, is the best case scenario. The reality is more like this. You connect to something, maybe you're on a mobile and it's a mobile network or an ISP and it's going through a million different switches and stuff, not even the server gets to some load balancer which connects to a server which then has to connect to a database which then has to get some of the stuff. The last talk talked about connecting to a database and doing that and gathering information and sending it back. So websites cannot be fast, or so you think. The SBA is one attempt to work around this, so it tries to load a lot of stuff up front, maybe prefetch some stuff and get stuff and do that. But that often results in this guy. Anyone seen this sort of thing before in the web? Oh, sorry, by the way, do we all know what an SBA stands for? Yeah? That's right. Spinner page application. So this has become far too common on the web where you take that hit up front, but to a worse degree. Like it's kind of expected with app installs. You sit there and you wait and you see the loading bar going on the web. You don't expect that. You're not doing that. Particularly if you're going to look at one thing, I want to know what time Barry's fantastic talk is. And you go there and you get this, you don't when the first time website, they don't use an SBA, by the way. But still, this happens too often. And do you know what it reminds me of, actually? It reminds me of growing up in the 80s, where you used to get this whenever you load stuff from a set desk and usually some funky music. Some of them even gave you games while they were loading the main game, because it took a while. And like the 80s was 10 years, twi... All right, I just realised how old I am. The 80s was a long time ago. And the fact that we're still dealing with this is kind of depressing. And that's before you even get to the browser. So once stuff gets to the browser, then it has to run that JavaScript. And that's where you see the spinny sort of stuff. It has to do style and layout. It has to do paint. It has to do composite. These are all different parts of the browser just creating the website once you get the code. And that takes quite a long time. And again, this is usually measured in milliseconds, but paints and composite and stuff like that can take an awful lot of time to actually do that. And particularly with, again, sorry to dunk on SPAs, but they end up re-rendering the page multiple times at quite a high cost. And you quite often end up with this. You get a beautiful picture of your webpage, but do not touch it. You can't interact with it for another couple of seconds. You know, you sit there and you click on something, nothing happens. And then suddenly the menu opens and you lose a 16 time and you're like, what the hell is going on? So to deliver instant loading experience, we need to be less reliant on both the network and the client side processing. So just solving the network doesn't really fully answer the question because of all that stuff that has to happen once you have it. So you've been asking, how can the browser help with this? And there's a couple of things to do about it. They all basically fall into these two categories. You can prefetch stuff in advance. So get it before you need it rather than waiting for the user to ask for it. You sit there and say, I'm going to try and get it in advance. Or take it a step further and actually pre-render the whole thing. So not only do you get the resources, but you actually do a lot of that browser work and actually lay it out, actually sit there and render it almost like we do this ourselves sometimes. I don't know about you, but I right click on a link and say open in another tab because I know it's going to take a while. I've been at this website before. Let that load in the background and eventually go there. Prefetch, been there a while. Service workers are shown as great things. They can go ahead and prefetch stuff, which is great for next navigation. It's also good for offline stuff. SBAs, as I said, I gave them a bit of abuse earlier, but they're quite good. Sometimes some SBAs in the boot right is trying to get you next stuff whenever it goes there. The browser has an inbuilt, HTML is an inbuilt decorative API where you can just do link rel prefetch, say I want this JavaScript as script. There's a new speculation rules API that I'm going to perform the basis of this talk. Prefetch, basic concept, this is from a site I used to work at. We're on a login screen. I'm pretty sure I know where the next page is going to be. Most people that come to this login screen are going to log in, hopefully, and they're going to want the app loaded. We go ahead here and we load the actual area. It's a client-select rendered app, one of those spinny things because how everyone was doing it at the time. We load the app itself, the CSS, the JavaScript. We load a couple of static resources, a list of town and counties, some banner images and stuff like that. We can't load anything for the user because we don't know what the user is yet, but we can at least get some of the stuff up front. Go ahead there. Safari Sense has pretty good support. Firefox were the first support in 2006, which is coming up 20 years ago. Chrome Edge and all that, they've had it since about 2010. Safari have had it behind the flag for four years and I have no idea why and why they won't enable it, but anyway. Safari users don't benefit from this. I wish they would do it, but I'm okay using this anyway because Safari users, and I'm an iOS user myself, typically are using an higher end device, not often with better networks and stuff like that. I won't not use this because they don't support it. I still think it's a good API for that sort of thing. Prefects can help improve future web page performance, but it gets it slightly better. It's when we're talking about taking that 2.5 seconds down to 2.2 seconds or something like that. It doesn't give you that instant feel. The options for that are pre-render, some of you might be aware there was a very similar API to Prefects where you could do link rel pre-render and the speculation rules that I'm holding off, I still haven't told you about. Pre-render has less support. Chrome's had it for a while. Safari and Firefox basically never implemented it. To be honest, I don't blame them. There was a lot of problems with it whenever we put it in there, used a lot of memories, we didn't specify it entirely properly. We've actually taken it away in Chrome. If you do link rel pre-render, and you're like, oh great, this is going to pre-render, it doesn't actually do that anymore. Despite its name, what it does is we call it no state prefetch. It scans through it and gets all the links in the document that you know, CSS, JavaScript, stuff like that. It doesn't do anything with them. It doesn't actually pre-render anymore. That was kind of because we hadn't couldn't solve some of the things. We were like, there's a bit of a footgun, people were doing it wrong and that sort of thing, causing more performance problems than solving it. We don't recommend this anymore. It's still supported. I think we're going to remove it at some point. It's deprecated. Don't use it. Because we have replaced it with a new thing. This page, the slides, I'll give you a link afterwards. We have a new way of doing pre-render called the speculation rules API. There's documentation written by myself. I'm going to give you the gist of this talk. It's basically, again, you put something you put in HTML. It's a JSON-based format. You sit there and say, I want to pre-render the source here. Pre-render, you can also do prefetch, by the way. That can be a good way of starting into this. But getting pre-render at the end is really, to me, the ultimate goal. You have a source list. To be honest, we made this optional in the next version of Chrome because we've given you a list. It's kind of obvious a list. Why do you need to tell us it's a list? You say source list and you give it a couple of URLs. This will, in effect, load up next.html in a hidden background tab that the user doesn't see and next.html. If they click on that, then it will swap it in with the current tab seamlessly without using it and you get that instant page experience. But how can you know where the user is going to go next? That's the real difficulty problem. Baz gave a great talk to open this to talk about the cost of internet browsing, particularly in other parts of the world, where we can't guarantee and maybe users don't want us to use up their bandwidth and stuff like that. It can be very wasteful doing this. In certain scenarios, you've got a pretty good confidence level of where people are going or the costs are lower and you're happy to do it. In other scenarios, that might be less what you actually want to do. Chrome does put some certain things if you save data on, if you're in low bandwidth connection, it won't do this sort of stuff. Still, you don't really want to do that. Even for yourselves, you don't want to say that every web page load now costs 10 loads in your backend server. That can cost you quite a bit and so on. We've introduced a new thing called document rules. Source instead of being a list is a document. Again, we're going to make this option obvious because we've got a where object. In this case, we're saying any link with the href matches the slash star, i.e. any internal link, and that includes if you've fully qualified it because it will figure out that it's actually an internal, except certain things. You can put an exception in here and say log out. Log out. We don't want to pre-render for whatever reason that will log you out. Most links are actually safe to render. There's websites crawlers, Googlebot, some of you might have heard of and so on. Putting things where clicking on a link causes a problem is a very bad practice and shouldn't really be done anyway. But there are certain times we do things we know we're not supposed to do. There is a facility to say don't pre-render certain things. There's an eagerness field where you can sit there and say I want to pre-render it since the page loads, which is over where you go to me. This one is the one that I'm talking about here. This is moderate. This means if you hover over a link or you actually start clicking it, when you click a link, there's a lot of things that happens. There's a mouse down event, then there's a mouse up event, then there's a click event, then it runs some unload pages and then eventually it loads the page. Just by doing it on that mouse down or touch down on screens, you get a little bit of a head start, 50 milliseconds, 100 milliseconds, and that can do it. If you do it on hover, you get a lot of head start because quite a few people were hovering over and then they're like, yes, you can use that time to actually go ahead and get it. There's a lot of JavaScript libraries that have done this, but this is now built into the browser. It's available right now in Chrome 121. It's not behind a flag, it's not behind an origin trial. We've been through that process and it's actually available for people to use. Let's try and do a live demo. This is not going to go badly at all, is it? Okay. We have lots of DevTools built for this. My application. Okay. So when I run this, this is the original version. I have two links up here, Apple and Orange, and I have a third link Kiwi. By the way, I've misspelled in my rules. So you can see it has pre-rendered Apple, pre-rendered Orange, and then it's failed to pre-render Kawa because I misspelled it deliberately to show you what would happen here. So you can see two of these are ready. Now, if I click on this, now, I'm going to zoom in here. This is the magic thing. And I think an LCP of zero milliseconds deserves a round of applause, but obviously nobody else. So we're not talking about making it a little bit faster. We are talking about making it instant, where you literally get a zero millisecond. There's nothing better than that. That go negative LCP time. Though thinking about it, I'm sure there's something we can work with it. I just prove I'm not just doing it. Like, listen, it's not the most complicated app. It doesn't take a while to do it. But if I take Kiwi, which wasn't rendered, 120 milliseconds, I'd say it's a very simple app. And we've got very good Wi-Fi here, which is going to screw up my next demo with something shocking. But anyway, so if you think about that, this is instant. This is a very fascinating network and stuff like that. I haven't slowed it down. But multiply that up if you're on a mobile network and slower. Again, you might not get the zero milliseconds, but in most cases you will, depending on how much they hover over it and so on. Sorry, that was the immediate load. If I go back to the hover over version, again, let me zoom in here. So it finds all these URLs. None of them are triggered yet. So it's found all these URLs as potentially ones to pre-render. And if I hover over it, you see it changes. So Apple is now ready. Orange is now ready. Kiwi, I think, having this one. So what it does is it works in a FIFO, first in, first out thing. So it's got rid of Apple and said, I want to keep two in there. Because again, there's memory usage by this. Effectively, you've got two more tabs open. So as you do, move around. If I do Apple again, it will pre-render Apple and it will knock the next oldest down to the bottom. At this point, I've already fixed everything. Everything's in the HTTP cache. So I'm not causing any network costs here. There is a little bit more CPU costs, so there would. But again, we're not thinking, and some people do read like that, by the way, but most people only hover over a link whenever they're actually going for it. And again, I click on it and you get that zero milliseconds LCP. So the demo works. And I don't know what you might be thinking. Okay, that doesn't fix the first page load. That's great once you're in an application and moving around. But quite often, it's the first one that's actually most of the problem. Excuse me, dry mouth. And that's true. But there's not much you can do for that first page load because you haven't even gone there. We can't do anything like that. But the browser can. So Chrome can look at this. And if you look at Chrome colon slash that predictors, you can actually see what the user does, what the browser does based on your past history. So I cleared my history down to get rid of all the sites I didn't want you to see here. But if I type in d e d e v e, at this point, it's got 100% confidence that I'm going to developer dot chrome dot com because I visit that site a lot. So before that, it was amber. I have a little bit of confidence there. Once it gets a think above 60% it will pre fetch the document and go there. Once it gives above 80%, it will say, okay, Barry is definitely going there. And I'll actually pre render the document. And you may have noticed whenever you click on Chrome, it will actually then be an instant page load. So that's a nice way we can use this. You don't need to do anything about it. So if you're worried that preender not everyone's ready for it, the browser is actually doing this. So you better be ready for it. And look at it. And there's various things analytic fighters can do and Google Analytics does and Google to ads do that they don't actually register until the page becomes active. And I might ask why am I so obsessed about instant? Like surely fast is good enough. And I think it introduces new options for web developers. Like, has anyone heard of view transitions here? Okay, a couple of you. So view transitions is a new API where you can do things like this. This is shamelessly stolen from the call well ex-colleague Jake Archibald moved on dammit. He created this nice little demo. And what does you click on something and it does this nice transition effect as you move around your website. That's used to be possible with bucket loads of JavaScript. We've now made it a little bit of CSS. And it's a lot easier and a lot lighter. The browser does most of the work on it. That works in single page apps. But what we're really hoping to do is launch it for I hate the term multi page apps. I like to call them web pages. But anyway, and yeah, let's try that on this one. Let's say this works really well as long as the Wi-Fi isn't amazingly good as it is here. So this is a multi page app version of that same demo. And if I click on any of the sort of links, you might have noticed a little bit of a delay there. Not much because I say the Wi-Fi is really good, but it definitely takes away from the experience. And I also knocked the network down to 3G, which can't see. But anyway, yeah, there's definitely a little bit of a jarring effect there, which it's not the worst thing in the world, but it takes out that magical view transitions effect. If you go to the same demo with literally just that Jason block button, other than that, it's the exact same code. I can't code this stuff. Jake can. So I just stole his. And you can see again, we got the list of the potential links. And if I hover over this, you see it's running. It's now ready. I click on it. It does that nice effect there. There's not a one second, two second pause before it does that effect. So things like this, you might consider not using them on a multi page app or a non SPAR, or you might even consider using SPAR architecture because it doesn't, you can't get things like this. Well, now, hopefully, you can. That's available behind the flag at the minute. We're still working on that one. But that was a real life demo on the real life Chrome. You can test these links yourself afterwards. So finally, what's all it's got to do with open source? Very good Chrome. Look at you, done it. You're multi billion dollar company and you can do this. Congratulations. Your thoughts down here were a little bit lower key. If you've got an open source project, I would like to ask you to consider adding sport for this. I think it's a very easy thing that you can do. I say there's a lot of libraries that do that lover link over effect. And they do that if you run a framework, Astro is one framework that's done that they've added support for this. So you can go in there using some weird JavaScript Astro config and it'll do it word press. And I'm really excited about this. So on the Google Chrome team built a WordPress plugin. It's got two options. Do you want to prefetch? Do you want to pre render? And which of these options do you want to use? And we can suddenly give this to millions of WordPress sites. So WordPress is powers a third of websites at the minute. Just by installing this plugin, they can go ahead and just get instant navigation. And I think it's worked really well for the WordPress type thing, which is typically static websites, blog posts or articles or brochure sites and stuff like that. So yeah, let's make 2024 the year of instant navigations. And here is a link to slides, QR codes, stuff like or I'm around. I've got a couple of minutes for questions. Yes in the middle. Yeah. So on mobile, sorry, the question was if it renders and hover, how do you handle it on mobile? So moderate was render on hover or milestone. Conservative is measuring my style only mobile because there is no hover at the minute. It only falls back to effectively the conservative option there. Now there's an argument for and against in some ways, mobiles quite often the one that would benefit the most from this sort of thing. In other ways, mobiles often more constrained and you don't want to use it. So maybe not overly using this API until we're more sure isn't a bad thing on mobile. But yet trick of that is other things we can do. People can use the list URL if a library wants to build it. As you scroll and viewport. And if it stops scrolling, maybe that's an option. Maybe that's something we'll bring to Chrome and so on. But the minute it's a little bit slow in desktop. I think there's someone up back there, sure. So question is if they're a way to protect, if it's preloading, is there a way to deny it? Yes. So there's two ways. One, we send an extra HTTP header sec purpose, which is either prefetch or prefetch and pre-render. And your server, if you send back a non successful status code, so if you send back a 500 or a full award or whatever, Chrome will go, okay, they don't want me to pre-render this, I'll leave it alone. And similarly on the page, there's API you can check, am I in the middle of pre-rendering? Or have I been activated? In which case, maybe don't fire analytics, maybe don't load this, maybe don't load that, whatever, and you can choose to do that. So we're over here. Second rule. So if you open a pre-render page in a new tab, at the moment it won't use a pre-render, it will go and render it separately. So it's linked to the page. Similarly, if you go away from that page, it discards the old pre-renders. So it is linked to the current tab at the moment. At the front. Are there any attempts to put this in a common standard? There is. It's going through. It's part of WCAG at the moment. We've asked the other browsers for their feedback. I say none of them implemented pre-render first. We think we've done a better job with this. We've got a proper spec for it. We've got lots of things we consider. If the video is playing, should it play? If this API is used and so on. So yes, it's going through this under theization process. That's not to say other browsers will definitely like it. But we're definitely at least trying to do our part to push it out there. At the front. I'm probably running out of time, but go ahead. Sorry. The service, well, the speculation rules allows the rendering, which, oh, good point. The question is, if you use something like service worker to prefetch stuff, how does that compare with speculation rules? Service worker is pretty good about getting resources. Speculation rules is more about getting the actual document itself. As I said earlier, it sends headers. In certain scenarios, you can reject and say, hey, this is one that's live up to date. Must be live information. Don't do anything with it. There's more obvious to the server side. And it also offers the pre-render option there. Saying that, service workers are still very good for getting the sub resources. And even with pre-render, if you've got service worker back on, maybe the pre-render happens faster rather than only being half pre-rendered than it goes there. I'm not sure how the time is. No, I'm afraid we're out. I'm feel free to grab me. I have chrome stickers, little dinosaurs, by the way. Anyone want some? Sorry. Anyone else? Are you all ready for it? Thank you very much.
Keyboard Interactions
So, hello everyone. My name is Patricia. I'm very excited and happy to be here. Just quick disclaimer about me. I am Chromium contributor and today I'm going to present about keyboard interactions and how I tried to improve them. So quickly about myself, I am from Vilnius, Lithuania. I moved to the Netherlands to study computer science and during my studies I really got into open source, specifically through the Google Summer Code program. In 2022 I worked on the definition of the INP metric and I continue my work diving deeper into metrics, interactions and specifically into keyboard interactions that I will explain you today. So about 2022 I worked on the Perfetto tool which is a wonderful tool for developers but I won't get into details here because Alexander in few moments will explain everything you need to know. But how I use it to this day, I trace websites with DevTools Timeline check mark and it gives me all necessary information about interactions and specifically event timings. And we know when we have event timings we can get anything about INP metric. So what is INP metric that was already mentioned? It is a very I think popular metric today but it's simply an interaction to next pain metric that assesses responsiveness by measuring key press, tap and click interactions. So for example when you press a key on a virtual or physical keyboard or tap on a touch screen or click a mouse to open any menu on a website everything is measured for developers to see how fast their website responds to users input. And this definition is actually I think wonderful, very innovative and Google since Google announced it very recently that it's going to replace first input delay this March 12th which is very exciting for that performance group. However after looking better into the metric we found out that it's not entirely perfect although it's very wonderful metric but specifically key press interactions aren't working as we would like them to work. So my goal today is to explain how we improve key press interactions, what is firstly a key press interaction and how we measure them and then we will dive deeper into a bit more complex concepts of non-standard interactions such as emojis and how we measure them for the INP. So I guess this lines brings you to kindergarten especially considering the FOSDEM context of very heavy tech topics but to understand what is a key press interaction we really need to look into a simple button because key is just a button but in a more complex context. So when you press any button in this world button goes down and when you release it buttons button goes up. It's just that simple. So within the key press we have very similar behavior that contains the two fundamental events key down and key up. In this example we have one interaction as in the input we see character A and the entire interaction starts with the first event called key down meaning that user press down a key. We immediately generate interaction ID for that saying okay we start the interaction. After key down there is key press event which is dispatched if and only if there is a character value and we see that in this case we do have character value and it means that key value was mapped to a specific key. So for example if you would press anything that wouldn't produce a character value you wouldn't see a key press. Then we have some events about DOM so before input which means that the DOM is about to be updated input which is the immediately dispatched when DOM is updated and lastly we finish the interaction with the key up to which we assign the very same interaction ID as it was generated on the key down. And although this definitely makes sense and most importantly key down and key up are the most significant events in the interaction that you perhaps have already seen. So this sequence gives us the entire definition of keyboard interaction within the INP. So it contains of three time spans as in the click and click and tap interactions but in this case we have input delay, processing time and presentation delay. So input delay is the time when is the duration when user presses down a key and event handlers are executed. Processing time is the time it takes for the code to be executed in the key down and key up event handlers and finally we have presentation delay which is the entire duration when the event handlers are stopped being executed and we finally see something on our frame. And this definition definitely does make sense. However it had some problems. After better investigation we found out that it can be a bit more than confusing. Firstly having key press events interaction ID equals zero makes developers wonder if key press is related to keyboard interaction at all. And to make things worse it turned out that key press can be as large as key up and key down together. Just a second. So with this problem we updated keyboard interaction, the definition of the keyboard interaction and we have that the new update is very similar. It all contains those three time spans of input delay, processing time and presentation delay. However for the processing time we included key press such that it would be between those three candidates of key down, key press and key up at the end. And we really hope that this will remove some confusion for developers as we assign interaction ID to the key press event. And finally we do believe that including key press is a step towards polished, improved and more accurate IMP metric especially within the keyboard interactions. Well with the simple key press is everything is quite well defined. We know where the start is and we know where we finish the interactions. However more interesting things happen with non-standard keyboard interactions because we cannot be sure that our users will always use just standard keyboard interactions. I even came across this post from Instagram on Google's Instagram that has everything to express one idea from emojis to just basic symbols. And to understand how fast the website responds to such input we really need to dive deeper into input method editors. So what is, does anyone know what is input method editor? That's great. So actually I think most of you might have used in some sort of way. It's a software component that enables users to input text that cannot be easily represented on a standard keyboard. So it typically happens due to the very large number of characters in users' written language. And it's very common in East Asia regions for example Korean language, Chinese language and Japanese language. Although I would love to speak Korean, Japanese and Chinese unfortunately I cannot. So today we will look into a bit more standardized example of simple emoji which has very similar structure. So we already can see that we need to process way more events for one emoji than for simple key press. And however everything actually depends how many interactions were made while producing that emoji. In this case we see that users started by typing in H and then selected the emoji as you can see on your left screen example. Since the complexity is way higher of such interactions IDs we only assign interaction ID to input events. But thinking in general the differences between pressing down a key within emoji context and non-emoji context we find out that we still have those very important two fundamental events key down and key up. So with our updates we assigned interaction ID on key down and key up. We still start our interactions with the key down. We generate new interaction ID on key down then assign the same interaction ID to input event and we finish that key press interaction in the emoji context with the key up event. When users just select the emoji without typing we just simply assign interaction ID to the input event. And that gives us better understanding of how non-standard interactions behave. However the algorithm really requires some creativity and some better understanding. But coming up with the solution for something that does not behave in the same way was quite a challenge and the solution might not be most perfect because it heavily relies on the order of events dispatched. And for example we see here that when we hold three keys at the very same time and release them at the very same time we have three key ups at the very end with the exact same interaction IDs. And this shouldn't happen in general. They should have different interaction IDs. But although it might not be the perfect solution but looking into input method editors is a very important way to address web responsiveness within East Asia regions where people actually do use a lot of graphemes in Chinese, Korean and Japanese languages. And who knows maybe just all our emojis are just introduction to a bit more complex interactions. Maybe one day we will see 3D interactions and emojis will be just simple ones. So for this project I'm really grateful for my mentors and I call them heroes. They really helped me through the entire process of understanding interactions, defining INP and understanding how to improve web responsiveness for developers. And thank you a lot for listening. If you have any questions let me know and if you're interested you can read the blog or just ask me anything you want. Thank you. And do you have? Yeah. Yeah basically, so do the interactions go from top to bottom? So it's actually it really depends on the way we see on the websites. And there's like websites that shows all the events dispatched and it starts from top to bottom and maybe it's not the most intuitive way for you to read from top, I mean from bottom to top, right? But is the usual order that what you would get when you try to look into the events dispatched during keyboard interactions? So yeah, I mean yeah, this was from bottom to top, absolutely. Okay, thank you.
Web Performance at Mozilla and Wikimedia
Hello, my name is Peter. I work for the Wikimedia Foundation in the quality and test team. So I have like three minutes, so I'm going to show you a couple of things. In the team, we want to make sure that we find regressions, right? And the cool thing about it is that we keep all our performance metrics in the open. So you can go to our Grafana instance, Grafana.wikimedia.org and see our metrics. Now I'm going to do a live demo. Oh, that didn't work out so good. So let's see. Okay. So, we have, I'm going to show you four dashboards. We have our real user monitoring dashboard with the data that we send from our read users back to Wikimedia. So I propose that you go to that dashboard and look out at our performance metrics. I think it's quite interesting because we don't have so many big websites that actually show their data. We also have another dashboard where we have all our synthetic tests. So you can use the drop downs to see the pages that we test and the performance data of that. So this is kind of like internal data, so maybe not so interesting for you. So I have two more dashboards that is more interesting. So we have the user network dashboard. Let's see. Here is actually what kind of network our users that use Chrome has. So we use the network information API and beacon back the data. So we can use the drop down and see what kind of network our users is using. And if we scroll down, you can also see what kind of connection type they have. So this is interesting because you can see what kind of connection different areas of the world have when they access Wikipedia. And the last thing I want to show you is the CPU benchmark. So as Beth said, it's important because different users have different devices, right? And for some of our users, we run a small JavaScript that we measure the time it takes to run and we beacon that data back. So we can see what kind of performance different devices have for different users all around the world. And we use that data to actually see and compare it to different devices. So we can use that data to tweak how we run our tests internally. So if you go to that page, you can see what kind of benchmark to use as a Wikipedia. Okay, that was all for me. Dave. Thanks, Peter. Hey, everybody. I'm Dave Hunt. I'm the, I'm going to stand over here so I'm not blocking the screen. I'm Dave Hunt. I'm the engineering manager for the performance tools team at Mozilla. And I'm going to show you a little bit about how we handle regressions for Firefox. And we have tests that test page load benchmarks. I'm going to use a real example of a recent regression. And I'm going to go pretty fast through these because I only have a few minutes. So obligatory slide with a quote from a famous person. So Galileo, Galilei said, measure what is measurable and make measurable what is not so. And I think this is something we try to do in our team. So here, this is a performance alert. We have a bunch of tests running on, we're not suddenly commit something to Mozilla Central, our repository for Firefox. When we notice a change in the baseline, we generate a regression. One of our performance sheriffs will be monitoring and triaging these alerts. In this case, this one was triaged by Andra. And this shows you the magnitude of the regression and the tests that have alerted. In this case, I filtered it down just for simplicity. This is Expedia. And we can see some of our speed index tests have regressed. The sheriff will do some investigation. This is the same test or one of those tests shown in graph view. You can see this is our baseline. There was a change. The sheriff actually has come in here to some retriggers and backfills just to narrow the regression range and identify a likely culprit. And then the sheriff will file a bug. So we file a bug in in bugzilla. Because we've identified the likely culprit, we'll also need info. They will request further information from the author of that patch so that they can be aware. Looks like there might have been regression. Maybe we need to back this out or maybe we need to fix it. And I'm just highlighting here as well links through to one of our other tools, the performance, sorry, the Firefox profiler. So we provide as much as we can to the engineer so they can confirm. Yes, it looks like it's my patch and also can have a little bit more of a deeper dive into what might have caused it. And then another tool that we have is perf compare. This allows engineers to, if they think they've got a fix or they think they have something that might affect performance, either positively or negatively, they can push that to our CI system, run the tests and see a comparison. And so here this is again that example, Xpedia contentful speed index. This is the before, in this case, the regression and a patch that should fix it. And we can see that the distribution of the results, we've run the tests multiple times. Distribution of the results is smaller. And so it indicates that perhaps this is fixed. And it was. So we also alert on improvements. This is the alert that came in a couple of days probably after the patch landed to fix it or to back this out. I think this change was a change in how aggressively we are garbage collecting. And so yeah, we get this and we can also look at the graph view. We can see the period of time that we had that regression and we can see that it is fixed and it's back to the baseline that we had before. We also capture videos. So again, another tool that is useful for the engineers to confirm. Yes, it looks like there really is a regression. In this case, this is the fix. So this is the slower and improved, the faster. I mentioned the Firefox profiler. I encourage everybody if you don't use it or haven't used it, check it out. Try it. Give us feedback. And finally, I just wanted to promote. Floring Kes is talking in Janssen at 1pm today. That's the main track on Firefox profiling. So you'll see a little bit of example of using the profiler for something other than necessarily web performance, but it's a very versatile tool. That's it.
Understanding how the web browser works, or tracing your way out of (performance) problems
Hello. Can you hear me all right? Welcome to the next installment in your regular scheduled entertainment. I think that's the great benefit of being scheduled after such great talks that everyone is in and with no break you can't leave. So you're kind of stuck with me for the next half an hour. So welcome. I hope you enjoy. So this is a very ambitious talk or at least as title suggests that it is. So let me actually start with talking what it really is about. And this talk is really about me and my experience. And hope, frustrations, aspirations, experience and some illustrations. So I'm Alex. I have been doing web performance, mobile web performance at Google in Chromium for the last eight or so years. So this talk is going to be pretty much about that. This is not going to be a practical talk. So because we don't have time. This is the first reason. The second is that there are way too many rough edges. So I wouldn't recommend this at this point to start to reproduce this. But hopefully this will be a source of inspiration. And those of you who are desperate enough and frustrated enough and have seen the problems outlined in this talk too many times, hopefully they will be brave enough to venture and try. There is a practical guide that I would recommend in the recent web performance calendar by the one and only great Annie Sullivan. I would recommend you go and check it out, preferably after this talk. But you know, I can't prohibit you from doing so. So this talk really is about problem solving and working with complex systems and trying to make sense of them. So the examples will be from Chromium. I will talk about Perfetto. But I think these examples hopefully will be an inspiration for a great variety of projects, both when building your sites and building other and working with other complex systems as well. So let's talk about performance and improvement performance. If you want to improve performance, as I imagine you are not totally adverse to, given that you are in this room, then I want to remind certain trivial things that you probably already know. The first is that performance problems are nasty. They are unpolite and they don't have the common courtesy of locating themselves in a nice isolated area of code in your project that you can master and just work on and don't bother with the rest of the stuff. So because of that, kind of knowing what to improve and where to improve and how to improve takes a substantial effort of the performance work. And the fact that more and more web is mind boggling complex with new APIs, both performance and non-performance, browsers getting more complex and bigger every day, various sites, drawing in diversity and in complexity and libraries and so on. So these all leads to all of us working on performance, spending a lot of time on a regular basis trying to understand what the hell is going on here. And what are the approaches? So the first approach that I have to mention is you can go and read the code. You are a very brave person and I wish you the very best of luck if you decide to do it, but it's not very practical. So modern projects are layers of up and layers of obstructions and then you have a listener and then you have 30 possible callbacks or entry points and then good luck. Usually, I give up at this point when I see like, hey, no, this is probably one of these 30 things. The second one is printf and it's possible variations. So it's console, it's log, it's other, just log it statements. And the second is the buggers. So GDB, LLDB, RR, Chrome DevTools, some of them are better than the others. But all of them, these approaches effectively don't scale to complex systems, especially if you talk about indeterminism when you test sometimes reproduces and the error sometimes reproduces sometimes doesn't then you are in a bit of fun. So when you have multiple processes and multiple architectures, multiple architecture components, then all of these, you know, these tools don't work particularly well. So they focus on low level details. Hey, what is this variable? And most often you want to know, hey, what this component is doing? And am I doing a good job? So enter tracing. How many of you are familiar with tracing in some form or the other? Some of them. So pretty much tracing is structured logging and visualization. I will go into this a little bit more further down the line. But as far as chromium is concerned, from the practical perspective, it means turning these annotations. So here we have a request resource from Java function that is being annotated with tracement macro. So we in C++, we emit some information when we enter this function. And we emit some information that we exit this function when tracing is enabled. And this will allow us to look at this nice timeline. Pretty much the x axis is time advancing in time. And here you can see that we have entered this function here, you have exited this function. And you can see which other functions were called inside of it, how long it took. And you can see zooming out what else the system has been doing across different threads across different processes, which is I think a good starting point and the basic infrastructure talk about. So if you wanted to use it to trade yourself, you can actually go to your IPF at a depth. And the examples in this talk are pretty much all from open Chrome example. So if you have a laptop, then you can go and follow it. Then the links to the slides should be on the FOSDEM site for this talk. But we'll back talking about how to make this useful. So you have this wonderful instrumentation. And this already is, you can use it as a fancy fprintf with search functionality. You can just record a lot of information and then look at it. But this is basically instrumenting the code you're already working on as a fancy fprintf. It's powerful and flexible, but not necessarily most convenient. And it doesn't win either compared to fprintf or debuggers out of the box. For fprintf, the basic debug loop is still faster. You had a single statement, you don't have to bother with opening anything anywhere. You just see the console output and you're done. So it gets less pleasant when you have to do it multiple times. And with debuggers, every all information is present. You can take you a bit of time to find it, but you don't have to bother with adding more annotations, recompiling and wasting time there. So like, and it's unrealistic to have all of the functions instrumented and captured in this race, because it's too much information both to record, which adds all of overhead and slow downs, but also it's a lot of information to go through and looking at it is not pleasant and not fast. So I will talk about finding opportunities for scaling this instrumentation and finding the opportunities where a few instrumentation points can give us a lot of information and substantially advance our ability to reason about what the code is doing. And enter Chrome task schedule. Chromium is implemented based on an event loop model. So we have a bunch of name threads. We have browser process with a browser main thread where which is responsible for coordinating everything. We have the render process with the main thread, which is responsible for running JavaScript, bling DOM and whatever. Or we have worker pulls, sorry, dedicated workers, which sites can create using new worker API. There is a thread pool for miscellaneous background work. And these is pretty much all there is and various place in the code. In the code base, I think we have, you know, a few thousand places. So maybe 10,000 nowadays, which basically post tasks. They get a task runner from somewhere and they post a task. No, the from here macro will talk about this in a second. But otherwise it's just a fancy lambda with some of safety thrown in. And here it is. So you post this task somewhere, you know, some thread or thread pool picks this up and it will run this task. Voila. And this is a great point for tracing instrumentation. And this is a great point to start looking at. So what it gives us, it gives us that we will have pretty much all of the places running Chromium and code. We will know about them and we will have some basic information. Here specifically look at posted from information. This is the result of from here macro expansion, which using some of the C macro tricks, give automatically without any further support, gives us file name on the function. So at least for every function, you have a basic idea of where this information, where this task has been posted from. And you can go to that part of the code base and start understanding what the hell is going on and why this task might be running. So then we can zoom out and we also have instrumentation for post task. And the post task and run tasks conveniently are linked through a flow event. And what this actually means that you, instead of looking at a single task, which might or might not be useful, you can also explore which tasks this came from and which tasks the task it came from came from here because of, I can't really zoom out and I can't really make it interactive. So I can't show the entire to all threads involved. So this is a view of a single thread. But hey, you can see I selected a single task. I can see an incoming flow that is coming from a thread pool. And as you can see that that thread pool task is coming from another task from the main thread. And so you can see that actually all of these smaller tasks running after a larger task, they have been posted from it. And these are, we know, these are related. So this is pretty much a very good starting point, but it doesn't give us everything. So there are a few other chalk points that might be useful to instrument and that have been useful to instrument that can, that improve our ability to reason about what is going on. Task scheduling is inherently inter-intro process. So it doesn't tell us about inter-pros communication, but fortunately in Chromium we have Mojo, which is an IPC subsystem, which we can also instrument and get pretty much the same information. But for cross-pros communication, we can know who is sending messages and we can connect the place which posted the message and places that received the message and to be able to trace this back through the flow. Capturing console logs and DV logs and debug logs and the both logs is also not a great source of information. If someone bothered to log it somewhere in the system, that's probably already useful for us. And being able to correlate this additional source with data with actual tasks that Chromium is using have been proven useful in many investigations. Capturing, instrumenting all of the blink binings and pretty much capturing all of the JavaScript functions, JavaScript calls that end up being implemented in blink is another great way to reason about what is going on and what the website is doing on and a couple of other similar infrastructure pieces. So the key takeaway here is that, hey, if you have complex systems, then probably you would do some good to instrument some of the widely used things and if you are familiar with this codebase, you will be able to make some informed judgment of what is going on and you will be able to spot outliers, something taking too long, log being held in case of performance regression or a functional regression or a flaky test, etc. And that's already a great step forward. So you have, you can look at it, you can like, if your test is flaky, you can run a thousand times, it will fail five times, you can open five phrases, look and see if you're lucky enough, you will be able to spot noticeable difference. But this is still not good enough for me. And the problem is that, despite having visibility into everything we're doing, this is very, very, very expertise intensive. So in order to be able to make good use of it, you have to kind of know everything. You have to know a lot about Chromium architecture. So as some of my colleagues say, you have to have a PhD in tracing and Chromium architecture to truly make this useful. And I have an inspiration of, hey, let's get it to the point that anyone, so any web developer can open and trace and instead of being discouraged and being intimidated by all of this mumbo jumbo, they can learn something about how Chromium actually works and get more knowledge about this. So an inspiration that I have is this slide and this diagram from a life of a delegation, talk from Chromium University, that is kind of similar to what we have already seen. It's a kind of a virtual timeline with a kind of boxers being connected by arrows. But if you look at it, then even if you are not deeply familiar with the browser architecture, then you probably kind of make some sense and you can make some educated guesses of what is going on. For example, if you see network stack doing start URL request as a one off stage, it's something that you can develop or get a reasonably good intuition for. And that's kind of the status quo that we currently have, which is pretty much exactly the same information, but slightly less useful, slightly less easier to read and slightly more intimidating. So for example, you can see tasks, you can see that hey, some of them are related to URL load the client, so you're getting information from the network. Someone, a navigation client, which kind of you know the navigation stack, you kind of guess what it is, but the level of intuitiveness is starkly different. So there are existing examples where we already do this in Chromium and we like take the care to reconstruct the high level events and the high level timeline for specific things. For example, this is an example of event latency. So specifically breaking down the timeline of steps and sequence of steps involved in presenting a frame. We're doing great on time. So the downside is that it's plumbing is very expensive and scaling this up is very difficult. When you have a big project, you have information, you need information from different you know corners of this project and plumbing is very expensive, both in terms of serialization costs, in terms of layering concerns, in terms of the amount of plumbing code that you need to maintain. And this you know difficult to scale and you know we haven't implemented this for too many exciting things. So let me talk about Perfetto a little bit. So Perfetto is the new generation tracing framework born from the ashes of Chromium tracing by a few great folks who have been working on Chromium tracing, got fed up with it, learned all of the mistakes that happened there and all of the things that we should shouldn't have done in the first place. And Voila Perfetto, which is nowadays widely used for Chromium and Android tracing. So it has fancy new UI, it has more efficient format, but the thing that brings a special place in my heart for it is the new SQL data model and query engine. So essentially everything that you can see in the UI is backed by a data model and UI is just running queries in the data model, against this data model and presenting it. And presenting it and you can very easily do it yourself. And you know we trace processor actually is compiled as a was module and running in your browser in a background thread. Voila web, we have gone, we've came very far. And this allows us to separate recording the trace and emitting the low level instrumentation and actually analyzing it and building high level data models. So this is probably the best example of Perfetto powers. I could hit it in a single slide. You can replicate this yourself if you go to Perfetto, if you go to open the Chrome trace, if you type colon into the search box, you will enter the SQL query mode and then you can copy and paste the query that I inserted there. Once again, you should have access to the slides and then it will pretty much give you the list of top 100 longest tasks that we ended up running there, which is already useful for analysis and can allow you to build more and more complex data models through different tables within SQL, which is kind of cool. So what are the next steps here? So I am right now trying to build in a navigation instrumentation, fancy navigation instrumentation as a proof of concept. The current prototype is kind of there, so you can see that we have a timeline. This is all pretty much based on the same low level information, but presents it in a more fancy version. And this then can be further integrated with the documentation. So this is just a not standalone box with a couple of words scribble on it, but we can also link to parts of Chromium documentation that outline what this stage is actually about, what are the concepts that you need to think about and make it generally more useful. One of the major complexities, why we haven't done this before, is that the complexity in a number of corner cases. When you talk about navigation, when you talk about typing the URL into the OmniBox, there are like a mind boggling complex number of cases from redirects to navigation, turning it to downloads to server returning to O4 and canceling the navigation that you kind of need to think about. And building this instrumentation without being able to test it is kind of a losing game. And the Sequel support actually allows us to feasibly write this testing coverage for these corner cases. I think I'm 15 out of 50 at this point, so some work to do. So yeah, I think that's all of the main content that I have. I have a bonus demo, which is kind of about DevTools, but I can also take questions. Eyeballing is, so the question is what's the best way of comparing these traces? I think eyeballing is probably a good place to start. So there are some early experiments of opening the traces times, say to some it and being able to link the timelines, but this is greatly depends on what kind of problem you're looking at. For example, if you are comparing the traces from tests, then the workload is more repeatable and you can actually go further in comparing it. For example, writing some Sequel queries and instructing some high-level metrics can get you very, very far and spotting if any high-level metrics changed or any derived things changed. If it's user's interactive, then probably eyeballing and going from there and seeing how much variance there is. Yes. Yes, some Sequel statement there. Yes, great question. The question is we have Sequel, but where is the database? The answer is it's all done locally. So this is Sequelite compiled into a WOS module with some helpers on top. So when you're opening a trace, it's running in your background thread in Sequelite instance. Everything is local. More questions? If not, I can actually go and show you my favorite House Party trick, which is an illustration of why it's actually quite important, I think, to think about data presentation. Sorry? More questions? No. Let me try to do this. Can you see what's going on? So let me open a trace in the performance in ground deftos that I have recorded earlier this morning. And this is something that you should be already familiar with, but the thing that some of you might not have realized, that there is nothing inherently magical or special about these deftal traces, apart from very good UI and a lot of UX thoughts that went into that. But fundamentally, they are just JSON-chrome traces, just with a bunch of categories. And you can actually open the very same information in Perfetto and actually look at it. And you already can see that the usefulness of this information is a bit different. We have to zoom and find our relevant parts, and we have been exposed with a low-level information, but no high-level insights. But then we have the network tracing. And the best way to illustrate that, further, is look at one of the network requests. Let me... Not this one. I want to find a network request from the deftools with the URL. And you can see that it... Some high-level stats, and you can see where it fits with other stuff. Psyllium shots also help. But then I can search by this URL, and I can also find the request ID and find all of the events, the low-level events that Chrome is tracing has actually meted. So all of the information about this network request is there. So if you can be bothered, you can actually go and correlate and go to all of these specific events and correlate them and reconstruct the same level, high-level takeaways. But it's going to be a little bit slower, a little bit less useful, and you won't actually be using it that much yourself, probably. So, yeah.
Fast JavaScript with Data-Oriented Design
Hello everyone. My name is Marcus. I would like to share some lessons that I learned while working on the Firefox Profiler. So yeah, I work at Mozilla. I'm on the Firefox Performance Team and I work on Firefox itself and on the Profiler and I also have my own Profiler called Sample which you can use on macOS and Linux to profile your own applications. And I'll give you a brief overview of the Profiler. So this is what it looks like when you have a profile loaded. You have a timeline at the top. You have a call tree and you have a sidebar here in the call tree. It's very, very small down there. I'll zoom in a little bit. In the call tree you can see which function called which function. You can see how much time each function spent running. So let's say this function here dispatch event. Firefox Profiler is a sampling Profiler. So it interrupts the thread at a given interval like usually one millisecond every one millisecond. It checks what's on the stack and then it accumulates that into this call tree. So one thing that I want to call out here is the category breakdown in the sidebar. So here we have a bunch of categories. User is just regular code. Ion here means this is JavaScript code that was jitted into the IonMonkey subengine of our JavaScript engine. And yeah, there are a bunch of other categories. And you can select in the timeline and as you draw your selection, oops, as you draw your selection, the category breakdown updates in the sidebar. So we can also zoom in, see something a little smaller. So here we have more code in Ion, more code in the user category. It also has a flame graph. So zoom back out. And flame graph, you're probably familiar with flame graphs. They're a different representation of the same information. Like you have a call tree, you have nested functions, the width of each box is the time that is spent in that function. And we also have a tooltip here in the call tree, which again gives us a category breakdown. I'm emphasizing the category breakdown so much because we're going to implement our own profiler in a minute, which, and we're going to focus on calculating this category breakdown. So here we see it's a bit sluggish as you move the mouse around, because it actually needs to iterate over all the samples in the timeline. It checks for every sample is the stack inside of the function that you're hovering. If so, check the category that the CPU is spending its time on for that sample, accumulate that into a map of categories to counts, and yeah, do that for all the samples. And we can see here at the root node, we have about 500,000 samples in this profile. So what I didn't tell you is this is actually the version from last July. And I fixed this performance problem here. So this is what's live now on profile.farfax.com. Hovering these boxes is no instant. And it's still doing the work. It's still going through all 500,000 samples every time you move your mouse. So I want to talk a bit about how we can crunch through all that data in in very short time. Wrong direction. So yeah, even with lots of samples, we can now have a fast UI. And I made an example project just for this talk called mini profiler. It is on GitHub. It's also live on Netlify. You can try it out in your browser if you want to. And this is what it looks like. It has a very reduced feature set, but it also has this timeline. You can select parts of the timeline and it calculates this category breakdown. So yeah, let's say here, we spent 30% in Ion Monkey Jitter JavaScript code. At the same time, it also calculates the heaviest stack. The heaviest stack is the stack that we spend the most samples in. All right. So yeah, mini profiler features. There's only two features. You select the range, and it gives you a category breakdown and a heaviest stack. So how does it calculate that? We have an input JSON, which describes the profile contents. The JSON is a list of samples. Every sample has a time, a weight, and a stack. Every stack is an array of frames. Every frame has a name and a category. I'll show you that here in an example. So as I said, a list of samples, every sample has a time property, a stack property, a weight property. The stack is an array. Each stack frame has a name and a category. To calculate the category breakdown, we take in the profile. We take in a range of the indexes of the samples that are in the selection. Then we iterate over this range. We get each sample. Whoops. We get its stack and its weight. We get the top frame from the stack. We get the category from the frame. And then we check. Does our map have this category already? If so, get the current value. Otherwise, default to zero. We add the weight of the current sample. We put the sum back into the map. And then this map is what gets used by this spelt component. For the heaviest stack, it's somewhat similar. We again iterate over all the samples in the selected range. For each sample, we again get the stack and the weight. And now we need to check if this stack has been used by multiple samples. And how do we find two samples with the same stack? Well, the stack is an array, and you can't really check them for equality easily. So what I'm doing here is I'm stringifying the stack into a JSON string. I'm using that as the map key. And then here is a similar issue to what we had with the category breakdown. We check. Do we have an entry in the map for this stack? If so, take its current value. Otherwise, default to zero. Add the weight. Put that back into the map. And if this stack is the heaviest that we've seen so far, we remember it, and then at the end, we return it. So these are the two main algorithms in this mini-profiler. Category breakdown, heaviest stack. Both of them have to iterate over all the samples. So how fast is it? So if I select here, it's reasonably fast. If I make the selection bigger, it starts getting a little janky. I'm computing some throughputs down here. So 100 nanoseconds per sample is how long the algorithm for the category breakdown takes. And 30,000 something nanoseconds per sample for computing the heaviest stack. Because, yeah, we saw the heaviest stack algorithm, it was really inefficient. It used JSON stringify. It looked up this gigantic string in a map. It needs to hash the entire big string and so on. So this is obviously not the way to go. But this is just a place to start so that we understand what's going on. So this is the throughput here. The nanoseconds per sample might not tell you much. But what you can think about is, how does it limit the size that you can handle while still being responsive? So let's say you have 100,000 samples. In this example here, we just had 1,600 something samples. What if you have 100,000? Then you get 10 milliseconds for computing the category breakdown and 3.6 seconds for computing the heaviest stack. 3.6 seconds per update, that's not acceptable. So we need to do something. And also the JSON file, because it has all those repeated stacks, it's just massive. So let's use a different format, different JSON format. Here I made a V2 format. It still has samples, but instead of having the stacks right in the sample, it just has an index. And this index now goes into a stack list. Each element in the stack list has a frame index, which goes into the frame list. Each frame has a name and a category index, which goes into the category list. So I hope that's not too overwhelming. We just have a bunch of indexes now. Instead of nested objects, we just have some side-by-side lists and we index into them. And also here the stacks are a little special because of this parent stack index here. So if, for example, a sample refers to stack number two, then this is the frame at the top of the stack. Then we go to the parent stack, find this frame, that's the next frame on the stack, find this stack, put this frame on the stack, and then the parent here is null. So that means we're at the end of the stack. Hope I haven't lost anyone yet. So let's go back to the compute-heavy stack algorithm. So we were iterating over the samples. We were stringifying the stack arrays and we were checking the JSON string. Now we don't need to do that anymore. Now we have an index. If two samples have the same stack index, that means they have the same stack. So we just use the stack index now here and we don't need the JSONification. We don't look up big strings. And this is like a massive performance improvement. So 300 times faster. The category breakdown is also affected by the new format changes. So now instead of getting the stack and the frame directly from inside the sample, we instead get a stack index. We look up the stack index in the stack array, which gives us a frame index. We look that up again, get the category index, look that up again, get a category name. This is a string. Put that in the map or add up the weight. This string here, this is kind of unnecessary. We know if two samples have the same category index, we can use that as the key. So I made an optimization here to remove this name lookup. And now we're just accumulating category index weights in this map here. There needs to be some process, post-processing afterwards to make sure we get these names here in the category breakdown again. But that's outside of our algorithm. All right. So here I had selected the small profile for the format version one. Let's switch to the same profile in format version two and do the selection again. And now we can see, we can select to the full width and it's still very responsive. So here's our throughputs. So how fast is it now? 47.1 nanoseconds per sample for the category breakdown is what I measured, 51 for the heaviest stack. Okay. So that's much better. Let's see how far we can go. We want to see if there's more we can do here. So we use a profiler. I am going to start the profiler. Oh, what I didn't show you is how to use the profiler. Well, let me do that really quick. So if you use Firefox and you go to profiler.firefox.com, you can click this big button here, which gives you a toolbar button. And then if you click that toolbar button, it starts recording. So let's record our current implementation. Do a little bit of this, capture a profile and see where the time is spent. Well, where is the time spent? One second. Let's try that again. Let me refresh this page. Ah, I can tell you for this time spent. It is so fast that it barely shows up in the profiler because we are still using the small profile size. So let's do that again. Capture profile. The local host here, there's barely any CPU usage. You would see more yellow in here. So let's switch to a bigger profile. We still have just the 1600 samples. Let's switch to the medium profile. So here, yeah, it still works okay. It gets a little bit janky towards the edge here. So again, we're going to start the profiler, select, play around a little bit so that we get lots of samples. Capture the profile. And there we go. This is what I was expecting. So now we have lots of yellow in here. I'm going to show just this thread. I'm going to switch to JavaScript only. I'm going to switch to the flame graph. And now what we can see here is we are spending time in compute category breakdown with string key map and compute heaviest stack with map. And what we see here is that we are spending some time in map.prototype.set, both over here and over there. That makes sense. We're assigning things to a map. So can we not use a map? Wrong direction here. So we're seeing the time in map prototype set. We have the map here. For the category breakdown computation, we're getting the category index out and putting it back in. But we know these are integers. They're integers into the category list. The category list doesn't have lots of elements. We can just use an array here instead. I'm going to use a float 64 array here. Because the weights are floats, using a typed array means I know that the maximum number of elements is already preallocated. It's initialized to zero. I don't need to check if there's something in it already. I know that it starts with zero. I can just add the weight. And that's it. We can do the same modification to getting the heavier stack, the seriously compute heavy stack algorithm. It was also using a map. We can use a float 64 array because we know how many stacks there are. Here the key is the index into the stacks array. We use that key as our index into the map array. And then it should work as before. And what we see down here, it is three times faster to skip the map to use a typed array instead. Let's try that out. Here I'm going to switch from the basic implementation to the integer keys for category breakdown. No, sorry, to the typed arrays instead of maps implementation. And now I'm going to select, and it's very smooth through the entire profile. And we have 500,000 samples now here. And we are still responsive. And let's see if we get an even bigger profile. This one here has two million samples. How responsive are we? It's okay. It gets a little janky towards the end here. It's mostly okay. So where are we now? Let's just take some, take some recap. We've addressed the obvious load ons. We've done what the profile told us. We fixed the hotspots. We changed the format so that comparing stacks is cheap, we changed two maps into typed arrays. Got us a 3x perf boost. In the heaviest stack case, the map or the amount of memory we're using might be a bit bigger now because we're allocating an array where we have an element for every single stack index, even if no sample references that stack index. So maybe some extra memory, but we have a performance boost. And so we have the throughput here. Yeah. So for the medium profile, our throughput is like 16 nanoseconds. Or let's see, sometimes it goes up and down a little bit. Yeah, let's say 16 nanoseconds for the category break down, 40 nanoseconds for the heaviest stack. I was seeing some other numbers when I was trying this at home. So it's pretty impressive. Modern computers are pretty fast, but maybe we can do even better. So let's try better. Let's go back to the category breakdown algorithm. We are taking these two values out of every sample. The sample is an object. It has three properties. We're ignoring the time property. We're getting these two properties out. So what does that mean at a byte level? So how are arrays of objects stored in memory? Well, it depends a little bit on which JS engine you're using, how you're allocating the object, if you happen to be on a fast path or not. But in SpiderMonkey, this is what you might expect. So we have a samples array, which is backed just by a list of pointers. Every pointer takes up 8 bytes on a 64-bit system, and it points to a JS object. So let's say here, the first entry in our samples array points to this JS object here. The JS object starts with a header. SpiderMonkey takes up 24 bytes on a 64-bit machine. Then if we're lucky, we have the fields inline just after the header. We might not be lucky, but let's say we're lucky. We might also have a bit of padding here at the end, because the inline slots might be only sized to four or eight, and we're using three properties here, so there might be a bit of extra memory used up by that. So this is just one representation that we could have. It varies a lot by engine. For example, Chrome has pointer compression, so these things here might be four bytes each, but then the time field might be an extra pointer, because in Chrome, sometimes the floating point values are a separate heap allocation. The padding could vary, the object header size could vary. These fields here could be behind another pointer if they're stored out of line, and so on. But anyway, what it comes down to is we wanted to get these two fields here, 16 bytes in total, but what we ended up with is all of these other not-so-useful bytes clogging up our cache. So when the CPU wants to get those bytes, it gets them in 64-bit chunks. Cache line is 64 bytes. So if you're getting this value here, you're getting the other bytes that are in the vicinity, even if you don't need them. Well, here we do need the JS object header, because the JIT needs to check that the object is of the right shape, and so on. But we really just want those values here. So can we do anything about that? We want to improve our cache line utilization, and we want to reduce the indirection. Maybe we can. Let's do something radical. Let's turn everything on the side. So we have this array of objects. What we could do instead is to have an object of arrays, or struct of arrays, where we have a column, or where we have just one key for the time column with a big array that has just the time values, one for the stack index, just the stack index values, the weight, just the weight values, and a length stored on the side. These arrays must all have the same length. So now everything's backwards. If we want to access the weight in the past, we had samples i.weight. Now it looks a bit weird, because we have the sample table.weight column, and then we get the ith element of that. But let's do it. Let's see where it goes. And so what we end up with here is a new profile format again. Now we have a sample table, a stack table, a frame table. The calories are still a list, because it's just some strings. And same thing as before, the stack index goes into the stack table, the frame index goes into the frame table. We just need to access the properties differently. So what does it do for the computation of the heavier stack? Here we were getting the stack index and the weight property from an object. Now we just get them from separate columns. And already we're seeing a 2x performance improvement. For the category breakdown, similar story. Instead of getting the properties from objects, we get the column first, access the ith element, and get that. This here is even faster, like 3.5x faster. Let's see that in practice. So we're switching to format v3 now, struct of arrays. Let's get the medium, medium sized profile. And now it just flies. It's just responsive all the way. 4.5 nanoseconds per sample, that's really not a lot of time. This is super fast now. Let's get an even bigger profile. Still super responsive. So when we think about the memory model, or the memory, how it is represented in memory again. We're accessing these columns now. We're accessing them in order. And what happens is that our cache lines are now fully utilized. We don't have object headers clogging up our cache anymore. We just have the numbers that we wanted. But yeah, it's just super efficient now. We get all the stack indexes, we got all the weights. The time column is now pretty much irrelevant. It was clogging up our cache before, but now we're not accessing the time column at all. So it just doesn't bother us anymore. Okay, so let's recap quickly. We have a struct of arrays. Some people call it parallel arrays, commonly used in game engines, databases, and so on. It has a few drawbacks. It looks a bit backwards if you read it. Sometimes when you want to pass around an object, you need to manually materialize it because you don't just want to pass around an index. But it also means that the type system, at least in TypeScript, is now less of a help. We can introduce mistakes that it wouldn't catch. So for example, if we build up our arrays and we end up not putting our values in every one of the columns, we end up with mismatched lengths, and that is hard to catch at the type level. Also, when we pass around indexes, sometimes, yeah, you get a number, you don't really know, is this an index into the stack table, into the frame table? I don't know. The type system, at least in TypeScript, I don't think is well set up to catch these kinds of mistakes. But it's much more cache efficient. It's easier on the garbage collector. You need to traverse fewer objects. Some engines skip the contents of arrays and numbers, so it should speed up that too. Less memory overhead from object headers and padding. And we can just treat columns separately. Like sometimes we want to make a change to one column. Let's say we want to shift the entire profile by some time delta. We can change just the time column. The other columns stay untouched. We don't need to recreate any objects. And it also gives us a little more control over sizes and how compact our integers or our numbers are stored. We can pick with our typed array. We could pick an int32 array. We could pick an int16 array. If we know what the domain of our values are, we can store things more compactly and we get back in control of the representation. Okay. I want to make it even faster. So if we look back at our category breakdown, we're getting the stack index, we're getting the frame index, but it's all just to look up the category from the frame table. We're not really interested in the stack of the frame. We just want the category for a sample table, for a sample. So what if we just got the categories for each sample and use that instead of here, stack, frame, category, just go category, boom. Well, it would be great if we had this column. Where does it come from? Well, we can compute it here with the get sample categories method. We iterate over all the samples. We do the stack frame category conversion here. We cache that in the sample categories column. We pass that to our existing function, but we only want to do this once, not on every call. So we need to cache it somewhere. We can use memorization for that. So here's a memorized call. We get the profile. We only run this once. So if we call this multiple times with the same profile, let's say our profile is immutable, we have it cached from last time. And we can make the caching even more precise. If we memorize a function which takes just the columns that we need, then we get it. We wrap this into the existing get sample categories function, which takes the profile, but then it takes out the categories. Sorry, it takes out the columns we want, passes those separately to the memorized function, and that makes the caching even more or even tighter. If you touch a column that is not involved, you don't invalidate your cache. And did it work? Yes, it did. Oops, wrong direction again. Memorized sample categories. We're now down to three nanoseconds. So I'm basically done with the talk. Let's just look at the graph here at the end. This V1 graph is off the charts like this. It's way higher than this. But we made it faster with every change here. And this last step of caching the sample, the categories for each sample, it looks like it's not much, like 25% on these nanoseconds. But what it actually means is we can handle more data. We can handle a higher count of samples in, let's say, a 16 millisecond interval. And like 25% more data, that's massive. Okay, I want to say really quick, what is data-oriented design? It's a mindset and it's a collection of techniques. The main technique here is structure of arrays. The mindset is more about how you think about it. The shape of the data determines the algorithm and its performance. You need to know which things are small, which things are big. We might have seven elements in this array and 100,000 in that array. If you keep that in mind, you're better set up to write fast code. And if you also think about cache line utilization, you're even better set up. The rest is not that important. Thanks, everyone. You can find me in the Firefox Profiler channel. You can check the Firefox Profiler online. Happy profiling!
From Google AdSense to FOSS: Lightning-fast privacy-friendly banners
Good morning. I'm Tim. I work as a performance specialist at Akamai, but that's not what my talk is about. And everybody here in this room has two things in common. I assume. First, we love web performance. And two, how many of you think already about food? Because I'm starving. And actually, if you don't know what to eat in the next days, when you're here in Belgium, there is this waffle burger at a Belgian restaurant called Quick. And if we are performance focused, Quick is also a nice way to get there. Now, next to my day job at Akamai, I also run the largest scale modeling websites in the world. With 50,000 visitors a day and 6 million pay juice a month. It's a bit too big for the talk of Tvetan earlier today. And it's not only the largest scale modeling website in the world. It's also the fastest one. And this, thank you. Thank you. Thank you. And this, despite the fact that I run banner advertising, because normally banners means slow, slow, slow and annoying for your end users. And this talk is all about how I switch to an open source ad server solution in order to give my users a better privacy. And then also because I love performance, make sure that the performance is lightning fast. Who remembers this day? One. Yes. GDPR. Correct. This was the this was when GDPR almost six years ago was introduced. And if we travel back in times to six years ago, my website back then I used Google AdSense and a few other ad serving solutions. And what is great about these solutions, you can just add some JavaScript on your website, and you start earning money. That's it. Now, the problem is that when you then you look at your waterfall that you see all these extra requests to third parties, third parties calling third parties, fonts are downloaded, CSS, JavaScript, cookies are set, tracking cookies, a lot of stuff happens. And this is a tool by my ex colleague Simon Herne, request map that shows you the blue bar, the blue circle at the bottom is the actual website. And then you have all these spiders crawling off additional requests going to additional things. And from a privacy perspective, this is not ideal. And this is all you need to do to create a nice banner of in this case, a hamburger. Now, when I started, this was how my website looked, I was basically chillax. This was just how the web worked. This was the only way there was no different way. This was just how the web worked. Now, in April 2018, one month before GTPR, I was like a little bit in panic. I was hoping that the ad providers, not only Google AdSense, but all the others would come up with a privacy friendly version for Europe, and would therefore also make the websites faster. And in April, nothing was moving. So I looked for a plan B. And luckily, I was able to find a plan B, which was open source. So revive open source ad server. And why did I pick it? It was PHP based. My website was PHP. So it's good. It was already five years old. So it was like not brand new. So it was already proven. And it had fairly stable releases at a regular time. Today, this open source project is maintained by the Aqua platform by Eric and his team. They also run a, of course, paid hosted version of the solution. But I use the download free version. So what can you do very quickly? Everything you expect from an ad server. You can manage your campaigns. People can sign up for to start adding ads on your website. Basically, everything which is needed to send to serve ads on your website isn't there. And this is the result. Remember before that spider going everywhere, everything hosted on the same domain. So from my privacy perspective, I was back in chill X mode. Now, let's talk about performance. Just by implementing the open source solution on my own systems also gave me some performance gains by design. And the first is here revive itself does not require all these requests. So that's the first thing. But as you can see, what is missing here are things like DNS lookups, TCP connections as a cell handshakes in order to talk to different systems. So that basically means that everything which is needed to serve that hamburger banner as soon as possible, it's not delayed, which is good. The other benefit is we already talked about IMP before and JavaScript performance. The library broadly compressed only 1.7 kilobytes. And typically the more bytes you ship and JavaScript bytes you ship the less good for things like IMP first input delay total blocking time. So it's a fairly small library. Other things I work for a CDN so I can run my website on the CDN. So also use the image optimization services to make sure that I return modern formats like AVIF or web P, et cetera. And then finally, last but not least, the fact that everything is under my control also means that I fully control priorities things like fetch priority high, fetch priority low, preload the order in the page. I fully control the order of things and I decide do I want the banner to be served first or do I want the actual content to be served to be served first. This of course assumes that your web server or your CDN listens to the priorities. Now, this was the basics. Just setting up revive great for performance, great for privacy. Now, good is not good enough. And in order to get these very, very good results, you still need to do a little bit more. So let me explain you that. So we'll first look at LCP or largest contentful paint. Just as an example, what is here the LCP element on this page should be fairly obvious. It's the largest image on the screen, which is that nice helicopter, which I'm currently building. Now, that's easy. Second one. What is the largest contentful paint element here? Sorry, it's early and I'm hungry. It's actually as expected, the top one, because that's the biggest image. Now, this is not what my users perceive as the LCP element, because they come for that small picture of the car. Now, what is the problem? This image is late discovered. It's first needs JavaScript to run, then it needs to do a request to a PHP server to know which ad to serve. And then only the image will download. So it's late discovered, and it means it will come in of potentially a few seconds later. So what is the best solution? Just send more bytes. So my website is driven by a lot of contributors. So when somebody uploads a smaller image, I basically nudge some other people like, Hey, do you have a bigger image of this one? So my LCP gets better. Not only my LCP, people also like to watch nicer pictures as well. Now, that's a plan A. That's the best. Now, I'm not sure, not always sure if it actually happens. So sometimes I do have pages where the images are too small compared to the banner. And what is my plan B? I call that a fast fallback banner. And it's exactly what it's doing. It's fast. And it's a fallback. So in order to make it fast, you need to make sure that it's early discovered. So it's just it becomes a standard image tag. So basically my PHP code, I check, Hey, when I generate this page, I already know the size of the image I will embed. Rather than using the JavaScript based version, which is slow, I fall back to a default image variant. The downside is that from an ad perspective, I can no longer track revenue. I can no longer know exactly which banner should be targeted. Yes or no. So I lose some functionality. But typically on a website, you don't always sell all your potential banner locations. So you anyway have some, for example, I sponsor certain scale modeling events, or I have some coffee mugs of my website with internal banners. So I can basically perfectly display these non revenue generating banners, but keep performance. And here is then what you see is request number four is the LCP element, which is requested quite soon rather than somewhere at the end. That was for LCP, making sure that it's green under all conditions. Next is CLS, CLS cumulative layout shift. And this is something everybody knows, typically on newspaper websites, you're looking at a page, you're, you're reading the content and then suddenly bam, everything goes down because the banners start loading. Now the solution for this is quite simple. Just add a class, add a placeholder that the browser while rendering and painting everything on the screen makes room for them already reserves the room for that nothing special. Now, unfortunately, this was not good enough. Why not? Because in add systems, and in all add systems, you can do you have basically the choice between user experience and making more experience and making more money. And the top one is the fixed zone. You basically say, hey, in this location, when it's a fixed zone, I only want to show banners which are this size 300 times 250 pixels. Now, you can also have flexible zones. You hear I can define, you know what I have my design allows 300 pixels wide, but I can show bigger banners, smaller banners, a variety of things. From a money perspective, this is better. Why the bigger the pool of ads you can potentially serve to your users, typically the more money you make. The top one is better for end user experience because you know, hey, my placeholder is always this, which one did I implement? Of course, the top one. Now, a new problem arrived. Page is rendered. You see the nice placeholders. And then suddenly this happens. Watch carefully. Everything moves to the top. Which same browser would do this? Safari, Chrome, Firefox. All browsers are same. However, ad blockers are not always same. Ad blockers assume and assume that when you have advertising, they try to remove everything. So what happens is they detect the ads on my website. Although they're privacy friendly, although they're fast, they get removed. And you have this shift. So how do you solve that? Not blocking the ad blockers. If my users want ad blockers, that's fine. That's okay. If they are free to use that. This solution is to add an additional container around your ad. So this is the EINS. That's the ad. Make sure that the container has the placeholders. And then when the ad blocker arrives and deletes the ad, the container is left. So no layout shifts. And this is really my mantra. CLS should really be reduced to zero. Every single pixel which moves is in my view a bad thing, an annoying thing. And CLS should really be reduced to zero. So we covered privacy. We covered performance. Now let's look at the revenue perspective. Because in the end, the revenue part is I need the money to fund the hundreds of dollars which are paid every month on the server cost. And when I started, it was easier. I used ad sanctions. task steamed student stock of a sort of So, banners does not mean bad. You can implement it in a positive way. If you have full control with open source, you are perfectly able to do that. And it's also possible to make that lightning fast. Now, I didn't get any money for this. So, I'm really dreaming already about this burger later on today. There is just one small problem. It's Robin. Robin, do your hand up. Robin is the next speaker. Robin is my colleague. And we also call him Mr. Quick, so the Quick restaurant, but he normally works on the HTTP protocol and he hates it when I call him Mr. Quick with the K. He stands between us. So, Robin, please talk fast so we can all go have a great lunch. Are there any questions? Yes. Thank you. So, I've heard that your scale mates is very popular in various continents. So, for the answer, do you need to get practical somewhere in the continent or they're all like... Yeah, okay. Yeah. So, the question is, so that scale mates, my website is visited by people across the globe, here in Belgium, in Australia, Japan, Brazil, everywhere the globe. And the question was, if I need to have a replicated setup, so I use a CDN that gives you a replication for static content images, etc. So, that's a given. But I actually also replicate my servers across the globe, not all. I have, for example, servers in Australia, in Japan, to make sure that when a user does a database call or does a search, that they get an instant response. Thank you for the question. We have a few more minutes, I think, for questions. Two minutes. Otherwise, two minutes. Any additional questions? Yes, they're true. Yes, but the... Yeah, great question. So, the question is, in a revive, which ad providers can I introduce? In theory, in revive, you could also make a non-privacy friendly version, because you can also say, hey, in case I don't have any direct inventory, let's say, for example, with a scale modeling company, you can also decide to fall back to, for example, adsense or anything else. And it's just the only thing you need to do is add that their JavaScript and your advertiser code, so in theory, you could integrate any SSP. But then you're back into the same game, then you're... You have a performance impact and privacy impact. So, revive allows you, potentially, to do everything. Does that answer your question? Thank you. There was one question in the back as well. Yeah. So, the question is, which frameworks did I use or modules did I use to build a website or just for the advertising? Everything builds myself from scratch. Yes. Yes, I... Yeah. Yeah. Everything builds from scratch. The only thing I used was jQuery and I still use jQuery on some admin sites of the thing. But yeah, saying jQuery in 2024 is not cool, but I'm okay with that. Now, everything, yeah, PHP built from scratch. Thanks for the question. Any additional questions? Robin can maybe already come up as the next speaker. There was one question in the back. Yes. You can already switch my laptop, Robin. Yes. Just to look at it here, you are negotiating with these advertisers directly or...? Yes, correct. Yeah. So, the question is... Sorry? They call some people or you can go to them? Yeah. So, the question is, how do I get in touch with these advertisers? Because before, any banner would just show up. So, Revive also has an API. So, on my website, I basically have what you can sign up. You can create an account and register as a business account. And then I have a simplified interface where you can just upload the banners and you can ask, hey, is it for all scales or for specific scales? Are you targeting all scale models or just the aircraft ones or the shipbuilders? So, I have a simplified interface and they just sign up themselves. Thank you for that question. Yes. So, the whole question, have you had to deal with bad ads and bad actors? Yes and no. So, the question was, do I have to deal with bad ads and bad actors? Because I also have a shop database. And I basically already have a database of domains which are from scale modeling companies. So, when somebody signs up with a Revell, which is a brand, the e-domain, I basically know that it's linked to a, I basically can give them some confidence. If I'm unsure, they can already start creating their campaigns, but I can't, I still need to enable them before they're actually published. And people can't add JavaScript on the website. And so, in Revive, you can add JavaScript banners, but I blocked that because JavaScript is bad for performance. Does that answer your question? Thank you. Thank you very much and have a great lunch. Thank you.
Insights from the RUM Archive
Oh, last talk of this session is about the RUM archive, which is a data set of anonymized real user monitoring measurements. Now I know what some of you are thinking, Robin, if it's a data set, why does it have a palm tree in the logo? That doesn't make any sense. But think about it for a second. What happens if you go to a palm tree and you shake it? Something interesting might fall out, like a coconut. And the same thing happens with the RUM archive. If you shake it a little bit, something interesting might fall out. But for both, you need to be a little bit careful. Because if you're not, the coconut might fall straight on your face, leaving you scarred for life. So we need to be a little bit cautious in how we query the RUM archive, and we'll get to that later. The first thing I want to explain is what is actually in there. How do we get the coconuts in there? So currently, all the data is from the Akamai Ampulse products, which basically means we have a lot of Akamai customers that have Ampulse, and they let us put a piece of JavaScript on each of their pages. So every time a page is loaded, we send what is called a beacon, which contains all the performance measurements and a lot of other metadata for later analysis. Now, usually, our customers only see their own data, obviously. And here, we want to make this more publicly available. So we have to do a couple of things first. First of all, we filter the data. We only do the top 100 customers in terms of traffic. We anonymize the data. This includes stripping all of the URLs. So you won't know which measurement belongs to which site, which is a sad but necessary operation that we have to do. And then we further aggregate the data so that many similar measurements are actually combined into a single histogram for later analysis. This gives us two data sets. One, of course, for the page loads, and then one for third-party resources. These will be things like Google Analytics that are loaded from external URLs by many different customers. And so we can also offer some insights on that. We have most of the performance metrics you would expect, including some others, like RageClicks. This is if people got very frustrated. They start clicking the same area of the screen trying to make it work. For the third-party resources, we can also show if they were loaded from the cache or not. Very interesting. But crucially, one of the things we try to make the difference in is that we collect data from all the different browsers on all the different platforms. And you can, of course, also query on those from the data set as well. Now, you might be thinking, Robin, sounds fine, but don't we already have this from other public data sets? And partially, yes, this is true. We are blessed with very good web and web performance data sets, but we still feel that there are some gaps in there, gaps that we hope that the RUM archive might help fill, especially when it comes to things like cross-browser and real-user monitoring data. So let's say you're interested now and you say, how do I actually get access to this data? The main way is through Google BigQuery, where most of the data is stored. BigQuery is a very powerful, very flexible platform. It's sadly not the cheapest. It does cost you a bit of money. And even if you're willing to pay, it can take a while until you get useful data out of this, which is something a colleague of the Mozillaeans here today noticed a while ago. The reasoning was sound. They were trying to look for user agent Firefox on device mobile, expecting to get Firefox mobile data, obviously. It doesn't actually work, because in the RUM archive, Firefox is really just Firefox desktop. If you want mobile, you need Firefox mobile for Android and Firefox iOS for iOS. This is because we at the RUM archive put stock in consistency above all things. Now, especially for newer users, going to BigQuery directly is sometimes a bit of a big hurdle. So we also have a cheaper way, which we call the RUM insights. This is basically the team saying, OK, this is what we think most people will want to know about this data. We do the queries, and then we have some ready-made visualizations on the website for those as well. They also do the access. Sadly, though, even the RUM insights don't really help much for the Firefox mobile use case. As you can see, Firefox in our data set is definitely present on the desktop side. On the mobile side, none of the variants actually hit the 1% cutoff that we put for generating these diagrams. This is one of the many insights we can get from this data set, of course. Because having a nice coconut is of course all nice and dandy. You can't really do much with that, right? What you really want is you want to get to the juicy inside of the coconut, in this case the coconut milk. Now, I can hear some of you thinking, you think Robin, there is no such thing as coconut milk. Okay? Coconuts cannot be milked. They do not have nibbles. And you would be correct for the latter part, of course. But there are still ways to get milk out of this. You know, you could hit it with a machete. Or if you're a bit more sophisticated, you could hammer a screwdriver into these black spots there. You could still get something out of there. The point is, there are many different ways of getting the milky insights out of the data nut. But they don't all give you the same results. And a good example of this, I found when I first started querying the room archive, I just wanted to know, you know, roughly mobile versus desktop. What are we dealing with here? And when I plotted that out, I actually saw this weird periodic pattern. You have these bumps and valleys in there, which seem to suggest that people switch the type of device they use every three months, which of course makes no sense. Okay? And anyone who's ever done this kind of analysis already knows what this is. You know, this is of course just a bit of temporal interference. Because what I did not want to do was have a separate data point for each and every day that would be way too expensive in Big Merry, right? So what I want to do is just have one day per month. And naive as I was, I chose the first day of every month. Now this is not always the same day of the week, of course. This can very easily be a Saturday or a Sunday or a holiday, where you would expect more people to use mobiles than desktops, of course. The solution is of course also very simple. So the first day, we just use the first Tuesday of the month. Not the Monday, because that's often also still a holiday or a vacation day. But Tuesday should give us more consistent results. It's not fully foolproof though, as I found out. The first of July last year was a Saturday. So the first Tuesday of July was the fourth of July, the big US holiday. And that definitely does show up in these metrics. But this is not just something specific to the RUM archive, and every temporal data set has this. But I think it bears repeating because people keep making the same mistakes there, including me. Now diving a little bit deeper, looking at the different OSes that we see. On the desktop side, it's probably somewhat as you might expect. But on the mobile side, we have a very outsized representation of iOS devices. At nearly 63%. And I say outsized, because if you look at the actual sales numbers, globally, iOS fluctuates between about 15 and 20%. Even if you look at some of the richer countries, like let's say Australia, you expect a more 50-50 split. There are several reasons why iOS is overrepresented in the RUM archive. One of the main ones is that Akamai, as a company, is mostly present in the richer Western countries. Right? And our customers are mostly from industries like e-commerce, luxury goods, travel, that also address more richer end users as well, that are more likely to be on, say, iOS devices. So there is definitely an ingrained bias in the current RUM archive data set that you need to be aware of. But that doesn't mean the data isn't useful, in my opinion. We can still do much interesting stuff with it. For example, I think this serves nicely to highlight one of the big problems I feel we have in web performance right now, is our maybe somewhat overreliance on the Core Web Vitals and the Google Crux data set. You might not notice, but on iOS, you actually have no browser that can give you Core Web Vitals metrics, not even Chrome. This is because on iOS, every browser is actually Safari in disguise. Apple forces you to use the underlying WebKit engine, which does not support the Core Web Vitals. And so the more iOS traffic you have, the bigger your blind spot for those users is going to be if you only use the Core Web Vitals and the Crux data set. And I might say, Robin, that's only a problem for the customers represented in your RUM archive. And I would argue that the RUM archive currently does not maybe represent the global web, but I do think it's somewhat representative for, you know, for example, the e-commerce industry, which is definitely one that we consistently target when we talk about web performance. So I do think this can lead to interesting insights on that part. There is a silver lining to all of this. As you probably know, the EU is trying to force Apple to allow other browsers properly on iOS. Apple is dealing with this in one of the most disgusting ways ever. In my opinion. So I'm not quite sure how much this is actually going to change in practice, but still it is a step in the right direction. Okay. And even if this doesn't happen, we can still do some cross-browser comparisons by looking at other metrics that are readily available on all browsers. And we actually started doing this in the RUM archive, because we have those metrics, of course. I had hoped to present them to you today. But we want to be sure that we are 100% correct in our interpretation of this before we release any type of summary on that. So not yet, but soon we are working on this. I don't want to leave you hanging for today, though. I do still want to give you something to take home. And this is because there is a shining ray of light in the darkness. Because a couple of months ago, Firefox actually announced that they will now start implementing largest contentful paint. First Corel Vital available in non-chromium browsers. And this actually went live in stable Firefox about two weeks ago. And we already have some of the data in the RUM archive, which I looked at. And if you compare this, you will see that Firefox is actually faster than Chrome, sometimes a little bit, and later percentiles significantly faster than Chrome, for LCP. Now, what I think this means is that Firefox has won the browser speed wars. We should all immediately switch to Firefox and dump Chrome. No, it's much too early for that. We don't know if this actually means that Firefox is faster, or if they just use a slightly different algorithm, or they identify different elements, or it's just a different type of site that Firefox users visit. We don't know, right? So don't read too much into these results. I just wanted to have something to start the discussion, to start getting people and ties to actually look into what the core reasons for these results are. But so, useful things for the future. We're talking about Corel Vital, so you might ask Robin, what about INP coming up? INP is actually already well supported in the Ampulse products. So you can see here, this is an INP screenshot from the previous speaker's website, Tim's ScaleMates. You can see Tim has a ton of work to do. He claims he has the fastest website in the world, but we can all see the proof that it is not true. Shame on you, Tim, shame on you. So INP is in Ampulse, it's just not piped through the Remarkive yet. We expect this to happen in the coming months, and so we can also start analyzing data for that. So up until now, I've mostly been talking about the milk in the coconut, right? But we all know there is something else in the coconut as well. The flesh, the meat of the coconut. We rarely eat this directly. We usually process it into other foods, such as, for example, these delicious coconut cookies. These are actually kind of a Flemish specialty, I think. We call these Rotskis. I think there are amazing, amazing cookies. Now one thing you might see is that there are several individual cookies in this box, right? But they all look kind of the same. They're all quite similar. And sadly, that is also something that we see for the third-party resources that we have in the... He's not human. Kind of human. So third-party resources that we have in the Remarkive. Because if you start looking into this, a lot of them are from Google, as you might think. Most of them are ads, or tracking, or analytics, right? Most of these are things that the typical end user probably would like not to see loaded on the pages they visit. So it's a little bit ironic that we have to go all the way down to number 98 to find the first sign of something that is created to try and mediate some of this, which is the very first cookie consent manager, the GDPR backwash, let's say. I say try to deal with all of this. I'm a bit skeptical that it actually works. But I mean, the fact that you have 100 entries before the first cookie consent manager, I think is a nice one-slide summary of some of the things that are wrong with the web today. Now, this was a bit of a downer, so I also wanted to end on a better note. So I went through the whole list, and almost near at the end at number 498, I find something that we all were hoping to see, which was, of course, the jQuery mouse wheel plugin. With 13,000 downloads every single day, half of that is from Tim's site, as we just heard. And then the other one's there. So jQuery is still going strong. Let's hear it for jQuery. As I said before, we also have some other stats on these third-party resources. For example, how often they're loaded from cache. And at the median, this is actually quite low. It's only about 2%. I definitely think that the browser cache partitioning plays into this. It gets better to higher percentiles. But so most of these third parties are not actually loaded from cache. This might not be a huge problem, though, because most of them are also quite small. Most of these are tracking pixels that are just a few hundred bytes in size, though there are definitely outliers. So one of the bigger ones that I found was a Google Ads JavaScript that was 131 kilobytes compressed. That's massive. And that was loaded over 260,000 times in a single day. So a very big impact just on that one external resource. Now we had a lot of different resources, a lot of different cookies. Another thing that we have a lot of is browser versions. Because browsers a few years ago, they started updating themselves fairly regularly. So for example, Chrome releases a new version almost every month. And the question there was, how long does it take for most users to switch to the new version? That's actually quite good. Because here is that within two weeks, within the month, over 75% of Chrome users are on the latest version. And most of the remaining ones are on the previous version. There is a very short long tail of versions present in the dataset. This is very similar for Firefox, which also very aggressively updates. But here we do see one interesting data point, which is the blue one here, which starts in August. And even in December still had about 13% usage. And it turns out this is something they call the extended support release, which I think is a long-term support version there. Probably mostly used by companies, I would imagine. So you do have a bit of a longer tail there. But other than that, Firefox is also a very cutting edge, I would say. This is of course contrasted with Safari. It's not an entirely fair comparison, because with Safari we don't have the minor version numbers like with the others. But here we still see the global trends, right? The latest version of Safari is 17. Even after two months, didn't even reach 50% of the Apple product population. And if you look even version 15, which was released over a year ago, is still at about 14% of all page lists. So clearly in Safari you do have a lot of older versions, up to a year and even older present in your dataset. You can't really rely on newer features being readily available there. A very fun one there was Facebook. They have a ton of versions, often multiple per week. And their clients apparently also update to the new versions very, very quickly. Meaning that I often only had one data point per version, which messed with my graphing library. It tries to draw a line, finds only one point, and then decides to just draw nothing at all. Now interestingly, this is exactly what would happen if you would leave me alone with these cookies. You would know that there was supposed to be something there, but there's no physical evidence of it whatsoever left. So a couple of other things. So this is again from Tim's website. So Tim has his own very extensive ROM setup, as you by now all know. But even for people with their own ROM, I think it's useful to have the ROM archive next to that so you can compare both of these. For example, this is for the navigation types dimension. The biggest part is normal navigations. You click a link, you go to the sites. You can also have back forward navigations. People press the back button, which should be much faster because it should still be somewhere in the browser loaded. And they have things like reload. So people actually hard reloading the page. Now for the back forward navigations, you want to see as much of that as possible. Because people, that is the faster that you can get. You can see here that Tim has clearly very well optimized for this use case because he has a lot more people doing back forward navigations than the averages that we see in the ROM archive. So good work, Tim. The same goes for reloads. Reloads you actually want as few of those as possible. Because when people reload, it usually means something has been going terribly wrong and they're doing the have you tried to turn it off and on again, Mattage, to try and fix it. So Tim is only at about 1% there. Well that is much lower than what you see in the aggregated data there as well. So it can be useful even if you have ROM to compare to see where might we improve or where are we actually doing better than others. Or let's say you want to move into a new region or a new country that you don't have ROM for yet. You can try and get some ideas about what the situation is there before you actually do. And so to thank Tim for everything that he does for the web performance community, I actually brought him a little gift. It's a palm tree scale model, Tim. I don't really know what to do with these. Maybe you can have one of your tanks drive them over or something. I don't know. But so thank you, Tim. Thank you for that. Another thing I really wanted to look at was single page apps. I have to admit something. I am still on Twitter. I still call it Twitter as well. And if you're on Twitter, sometimes it seems like everything is react. Nothing else exists on the web anymore. All of it is react. All of it is single page apps, which I really hope is not the case. But when I looked at this, I was somewhat surprised because more than 40% of all page loads in the ROM archive are actually single page apps, which is much more than I would have thought. Now, for web performance people, this is actually good news. This means we have a lot of job security down the line. So that's good. It's a little bit weird. And another very interesting point here is the difference between hard and soft. So hard means you load the initial load of the single page app, the spinner that we saw before. That's basically the hard. The argument being you download more, it takes a longer time to load the very first time, but after that everything is much faster. That's usually the selling point for an SPA. But if you would take this at face value, you would say for every hard load, there was only one soft load after that, where you would expect a lot more soft loads and hard loads. And if that is actually the case, then that whole argument for SPAs doesn't actually hold through at all. Now, that's not what I'm saying. We need more research. I need to look deeper into the data. There could be other explanations for this. But interesting to think about. I definitely did not expect these results. And I would love to compare these with other datasets there as well. I'm running out of time. I had a little bit about HTTP3 there as well, including some things where I got very angry. But let's skip that because I really want to get to this final page. Because we all know that coconuts are amazing. They are exceptionally delicious. You can make a lot of different products for them. But you can make them even better if you combine the coconuts with something else. For example, delicious Belgian chocolate. I think you can get into a very much a 1 plus 1 equals 3 situation with this. In case you haven't tried this Belgian coconut chocolate, it is to die for. Definitely try it out. What I'm trying to say is that currently the RUM archive only has Akamai impulse data. We are very much open to other RUM vendors or even large sites with a big RUM presence contributing data to the dataset as well to hopefully help us remove some of the biases that we've seen to get a better picture of the actual global web in the RUM archive there as well. Some of you might think, sounds interesting Robin, but this is going to be a lot of work. Isn't it? No, no, it's actually super easy, barely an inconvenience. Because if you look at the SQL query that we use to put impulse into the RUM archive, that is only 1.6K lines of SQL. Only 1.6K lines. Very simple, two hour stops and the data is in. Well I guess the message is clear. The RUM archive is now open for business. What I talked about today is really just the highlights, the top what we can do. We are literally just started to shift the coconut milk there. So if you want to help out with that, please come. If you have any questions, if you want us to run some queries for you, if you want to help with the analysis, please let us know. If by now you are just really, really hungry, I would say please come and try out some of the excellent chocolates and cookies, because there's no way I'm taking them home with me today. Okay, so please, thank you.
Linux on a Confidential VM in a cloud: where's the challenge?
Hello, everyone. Welcome to the virtualization dev room. My name is Vitaly. I normally work for Red Hat and you can see me being active in KVM community as well as taking care of Linux on all types of third party hypervisors and public clouds. And today I wanted to talk about bringing basically general purpose Linux distributions to the newly introduced VM type on public clouds, which is a confidential virtual machines. So if you haven't been living in a cave with no internet over the last couple years, which I wouldn't blame you because the world is a crazy place to be in now, but you may have noticed that some hyperscalers were announcing or releasing their confidential VM instance types or features. I'm not here to advertise any of them, but just for the reference, Google probably was the first with their plain AMD serve option in 2020. And now they even have a seven SNP in public review as of like last week or the week before. Microsoft Asia, they were refers to commercialize seven SNP offering AMD seven SNP and they GA in 2022. You probably see they now have Intel TDX option available in public review and Amazon offers seven SNP feature in GA. So it sounds confidential, so it must be good, right? Because we all like when our data is confidential. But what does it actually give us? Like at least like these technologies, what are they about? Like this are like both like AMD serve and all its variants and Intel TDX, they are CPU technologies. So first thing they give you is memory encryption. So your VMs memory cannot be read by your hypervisor or other guests. Second, which is important and which wasn't in like first implementation like plain serve is that your CPU status encrypted because normally hypervisor can see for example your registers where your VM is executed and if it can stop you every cycle can certainly read your data. And the last which is also important is that memory integrity guarantees are provided to you because even when your memory is encrypted, the hypervisor which is like malicious or compromised can do like an esthetic try to for example swap to memory pages. They will remain encrypted but your guest will access the wrong one, right? And it can probably mount an attack using this technique. So this all sounds great, but when we talk about confidentiality normally we say like confidentiality must be achieved in runtime at rest and it transits, right? Like very generic and all these things which I just described, they give you confidentiality at runtime, right? So what about the rest, right? Concentrality of the data in transit is not really specific to CVM because we were doing this for years, right? We know that internet is not safe place, right? So we need to encrypt our data when we send it through public channels and not only public channels, but what about storage, right? How do we ensure that the storage of the VM is also confidential because even if you have something which is confidential in memory, you will eventually need to write it to disk and do other things like you will need to read your operating system from the disk. So you need some guarantees there. The last thing I wanted to mention is that these confidential VM technologies, they don't give you any additional guarantees when you're already within the VM. So if you have an application which is attacked there, right? Nothing's gonna save you, right? The hypervisor cannot see your data, but everything which is within the VM can normally see the data. That's how it works, right? We want to put general purpose operating systems there. So yes, let's discuss a little bit about this protecting data trust because it seems that hardware technologies don't give us this, right? So first is that you want to protect at the guest level. If some cloud tells you, oh, but we are encrypting our disks, right? Like you don't need to worry. Yes, but then you have the key, right? If you can encrypt and decrypt it for me like in a transparent way, so then it's not confidential from this perspective. So you need to do it from the guest. And the thing is you need to somehow protect the operating system itself and not only data you care about because first you have some data which is really sensitive. Like think SSH host keys, right? If somebody can read it from your VM, he can impersonate himself and pretend that he's you, you know? You don't want this. Second, you have, you will say, oh, I'm running like a general purpose operating system there. It's open source. Why would I need to protect it? You don't probably need to protect it from arbitrary like reading from the host, but you still need to protect it from writing because a malicious host can try to mount an attack by modifying something in the operating system. Think about swapping SSHD binary with something, you know? How would you notice, right? You won't. And good thing is that we have some technologies in Linux already for years which are mature like locks or things like the invariative or integrity protection which you can use because even when you store your like encryption key something or like integrity hash in memory, it is protected from the host because remember your memory is encrypted, the host cannot read it. The thing is the guest needs to somehow get this key, right, when it starts and where would it get it from? So, yes, let's take a look at like how Linux normally boots and what we, how we can implement say like full disk encryption or something, right? You start booting from firmware, normally everything is UFI now and all these confidential instances, they are UFI. So, there is some firmware which comes from CloudVendor, but that's like another story. Why would you trust this firmware? You probably shouldn't, but anyway. So, then you will always have some unencrypted part, right? Because the firmware cannot jump in the encrypted part without knowing the key, right? You want to do decryption yourself, you don't want to afloat this job to someone else. So, you will always have something like bootloader, kernel, initramafas stored there in clear. Yes, you may say that we can actually do encryption at bootloader level, which is true, but then we are complicating the bootloader like a lot and the only one which does it probably is grub and nobody likes it. No, I mean, no, but it becomes, it's all like a operating system with all the complexity and everything and you don't really want that for your bootloader. You want it to be really small if present, maybe even you don't want to have a bootloader at all for confidential case. So, and then you will jump into this, you know, encrypted part, you will somehow get the key and then we'll decrypt it. So, that's how it's going to work. So, yeah, how can you provide the key to the VM? You cannot do it manually. For example, like grub, you can type it on your console. You cannot do it on a cloud because you don't trust the console. The console is an emulated device there, right? If you type your password there, the cloud will know the password, right? So, you're not going to do that and you will need to provide it like in an automated fashion, but you can only do that. you you you you So, they were suggesting if you want to have a virtual TPM device, you run a separate domain like another virtual machine which will have this like TPM device. It's really hard to implement and this like 1.5, I think, TDA specification they've added partitioning, which is somewhat similar to trust levels and I think that that's what clouds are going to use. Although, you don't know, thumb clouds may actually implement an emulated device on the host. Just for example, like you do with QEMU and SWTPM, right? You can run it as a process on the host. And not all of these solutions will give you a confidentiality. For example, the one which runs on the host obviously won't. Then there are two types of TPMs normally, stateful and stateless. Stateful is a TPM which has its state, right? And every time you run it, for example, think about it this way. It has a private key and it never changes, right? So, it's generated once when your VM is created and then every time it's loaded, you can use it for like encrypting, decrypting, something. Stateless TPM is just firmware which will generate a new key every time it boots. So, how can we use this? Let's first talk about stateful TPM. Like all these hyperscalers, they give you some sort of a stateful TPM. The question is where is the state stored, right? Because you can turn off your VM, turn it on back. So, the state needs to be saved somewhere. And it's not part of your like encrypted truth volume or anything. It's somewhere else, right? So far, again, like not an advertisement but publicly only Azure proves that this state is kept securely, that there is some attestation going on under the hood when this TPM loads, which protects it from the underlying hosts. You can't say much about other implementations, like because no such claims were made. So, you know, you don't know whether you can use it to isolate from your host or not, right? What's good about stateful TPM is that you can implement root volume like pre-encryption, right? There is a device which has like private key so it can decrypt something. So, you can take your root volume and encrypt it and upload it in an encrypted state there. And that's something which, for example, like Azure confidential decryption is doing. In theory, we don't need to pre-encrypt. We can probably do something like self-encryption. And there are such ideas floating in the air that we will start with this general-purpose Linux distro, right? Do some integrity checking. And on the first boot, you will encrypt your root volume and seal the key to the TPM. But I haven't seen such implementation yet. It's probably possible, but it's kind of hard because you need to prove that the environment where you were doing the initial encryption is saying that it was really a confidential VM doing an initial encryption. Otherwise, someone can try doing it at some other place and attack your VM. So, stateless TPM. Currently, I only know about Azure TDX which publicly offers this option. But what's good about stateless TPM is that it's just a program. You know, it's just part of the firmware. So, you can take this initial launch measurements and attest it. It never changes, right? You don't need to attest the state of the VTPM. It's going to get generated every time, right? Which is good. Think is that, again, like as I said, currently, you will have to trust your cloud provider with the provided VTPM. And yeah, there is no anything like bring your own firmware in public clouds. You can still use it for volume disc encryption if you want to use TPM, but you will probably have to do some attestation and then inject some intermediary key. And also, there is nothing like this in standard Linux tools, anything. Like you can, like just encrypting root volume to TPM is something which is like generally supported by SystemD or Clevis or other solutions. But something which would do like attestation to remote server and then bring the key is just non-existing. Second, yes, what do you do with the VTPM if the cloud provider is not telling you that its state is isolated from the host? Or doesn't tell you how it's implemented, actually. And the thing is you cannot use it, right? You probably cannot even use it for things like PCR measurements because if it's an emulated device, it can certainly get messed with, you know, and then you will see different measurements. So, the only thing you can do in this case is try ignoring this thing completely and rely on architectural attestation, something, registers which both Sev and TDX give you. The thing is, again, that our standard Linux tools for like volume encrypting, something, they don't know anything about this currently, right? So, you will have to, you know, come up with a solution for attestation and delivering like root volume key password or something there. And it's not done yet. So, just a few words about this unencrypted part, right, which I told you that will always be there, right? Even if you do like full disk encryption, which you call full, it's not going to be full because you want to load like kernel and something. So, how can you prove that these things are good? So, normally, we have two technologies which have been used. One is called secure boot, the other called like measure boot. Secure boot without a space, measured boot with a space, nobody knows why. Anyway, so secure boot proves that all loaded EFI binaries are signed by a trusted party and measured boot basically measures every important fact about the boot, like binary certificates, which signed binaries, there has to be something in special registers of TPM devices. And we need to check basically everything which is being loaded. And as I told you, like normally, again, for general purpose Linux distro, you will end up with like a kernel, initramafas, kernel command line being available in clear, not encrypted because, yes. And to protect these things, there was a concept called unified kernel image introduced, which is a very simple thing. It just you take all these artifacts like kernel, initramafas, command line, sign them together and make it like a UFI binary like which is extracting itself and launches the kernel after that. So the implications are, of course, of this like it's more secure, but it's less convenient to use. The initramafas becomes static and generated when we build UKI. And normally for a general purpose Linux distro, we want our vendors, yes, to build UKI. You want just like install an RPM, you get a UKI. You don't want to build it yourself. Otherwise, you will have to get your keys provisioned in the firmware. And not all clouds allow that, right? They may have like a vendor certificate there in UFI by default. It may not give you an option to put your own there. So you will get like a static initramafas which may or may not be a problem. Of course, you have less demands for initramafas which is on public clouds. And like you don't need to do network boot something there normally. But it's still limited. There is a system extension feature in system D which can be used to with limitations to do initramafas extension. Emanuele is going to give a talk like in an hour after me about extending UKI is going to cover this topic, how this can be done. So the other limitation is kernel command line becomes static, right? So this becomes one size fits all, right? When we build as a vendor like Fedora, we build Fedora UKI, we need to hard code kernel command line. You cannot pass like root equals UID anymore. So you need to rely on something like auto discover or something. And again, we just got an extension mechanism which is called like signed extensions. You place basically a UFI binary stub in ESP and get your kernel command line extended. This is already like publicly released in system D but these tools are still adopting this. I haven't seen like a fully working solution yet. But we're actively working on it in Fedora. Last but not least is how do you boot your UKI, right? So it is UFI binary. So it must pass secure boot checks. So it must be signed. And you can boot it either directly from firmware or you can, for example, boot it from shim if you want to have shim for some reason. For example, if the cloud provider does not allow you to have your vendor certificate in the secure boot DB. But you will still have to manage your UFI variables because there is nothing like boot menu there if you are booting directly from firmware, right? In Fedora, we have a package called UKI direct now which can manage it for you like automatically. We do things like AB booting. For example, when you install a new UKI, it's going to be tried once. If it boots, it becomes the default. If it doesn't boot, you will rework back after the reboot to the old UKI. Because otherwise, if it doesn't boot, you are like screwed completely. You won't be able, even able to access your encrypted root volume. Yes, so if we speak about stateless TPM where we don't really need to trust the provider, the cloud provider doing attestation of VTPM state under the hood, then we will need an attestation server and client. And again, there are some offerings say in the proprietary world like Intel was advertised as project ember. But there is nothing which you can use today in the open source world. There are attempts to implement this in confidential containers project. There is this thing called KBS which is both like a protocol and an implementation of this key broker server. But again, like we will need something in the standard tools to do attestation. We are yet to figure out how to tell this thing which server to attest to. Yes, so we talked a little bit about encryption as I said that for root volume, you need to at least ensure that it wasn't tampered with. And for that you can probably use integrity checking. But then problems are very similar there because now instead of the password, you will have to somehow convey the right hash ID to use for the checked part. Right? Yeah, so I'm a little bit out of time here. But yes, you will still need to use all the technologies which I described for encryption. You will have to ensure the integrity of this non-encrypted, non-verified part because UKI is still going to be on ESP which is like VFAT, you cannot attach anything there. Right? Okay, so just a few words. Even if you have your VM which started and checked, yes, everything, you need to verify that you are basically connecting to the VM you expect because think about host starting your VM somewhere and then starting another one which is completely encrypted and was like host and you know, oil miners there are changed. How would you know that you are connecting your VM? So you probably need runtime attestation and clouds are offering you something but there is also no open source, something standard for that. Okay, I'll skip to the last and the most important slide. Thank you very much for listening. You probably don't really have time for questions but I can take as many as I can before dying in the hallway. Yeah, so thank you.
How Much Do You Know about Snapshot
Okay, hello everyone. My name is Titi Ma. I'm from Redhead. And today my topic is about snapshot, especially for the implementation in open shift virtualization, open stack and LibWord. Actually, I'm a QE from QMU, which is very close to LibWord. And the main production of our LibWord is open shift virtualization and open stack. So I made some investigation here. And here is today's data. So first, what is a snapshot? A snapshot is a point in time representation or copy of data on the state of system, software, a disk, a virtual machine or any cells. But today I'm mainly focused on disk and virtual machine. And actually, snapshot plays a vital role in virtualization as it is used for data backup and recovery. We know that data is always imported for any users. And compared to the traditional data backup, snapshot can do a quicker backup and restore. And about the snapshot, we can also do different snapshots in different points in time. It means that we can restore to any historical value of our system in the state. So here are some general user cases about the snapshot. In our daily work, we mainly hit systems failures or data corruption. If they have a snapshot, we can use it to do the backup and the disaster recovery. And also it could be used for testing or developing environment. It means that in our data testing or in our development, we may destroy our system during our work. If we have a snapshot, we can make use of it for this scenario too. And also snapshots can be used for systems upgrade or software updates. If it fails, then we can roll back to the lower value of our system. And it can also be used for training on education scenarios. And that students may make mistakes during their learning. If there is a snapshot, we can also make use of it to roll back to the initial state of the system. And it also can be used to customer scenarios, customer issues, replication. It means that we can save customer environment as a snapshot. And we can use this snapshot to do the debug. It will accelerate the problems solved here. And a snapshot can also be used for security, incident recovery. In today's network world, malware is everywhere. So if our system is attacked by it, then we can make use of a snapshot for this scenario also. OK, from now on, I will talk about snapshots in that three platforms. First step is about snapshot in OpenShift virtualization. Actually, OpenShift virtualization is an add-on for OpenShift container platform. And about the snapshot in OpenShift virtualization, OpenShift provides robust capability for it. As it extends the base OpenShift snapshot feature to include the guest OS operation coordination and multi-disk management. And actually, from user space, there are two methods to create the snapshot. Next is through the web console. And next is created through the OS command line with YAML file. And in the YAML file, we need to define a virtual machine snapshot as a customer resource definition. And the snapshot, you open a shift virtualization, we can create when the guest is powered on or it is powered off, they are both supported. And when the guest is powered on, we are usually recommended to install guest agent, this software in the guest. The guest engine here is used to free file system of the guest. Then it gives time to flash memory data to the disk before the disk snapshot is created. So it here to guarantee the data consistency here. Okay, actually, the VAM snapshot in OpenShift virtualization, it makes use of a volume snapshot for VAM snapshot. The VAM snapshot here, yeah, VAM snapshot is a YAML file. It actually creates corresponding volume snapshots for all the supported volumes, either VAM. And actually, volume snapshots, the source of it usually from PBC, persistent volume claim. We know that the real data of the PBC is stored in a PBC persistent volume. And it could be classified into different storage classes based on different storage bank heads. It is the same as the same for volume snapshots. The real data of volume snapshot is stored in volume snapshot content, this object. And it also could be divided into different volume snapshot classes. Yeah. Okay, let's look at a general data flow for the snapshot in OpenShift virtualization. You already there is a user request line to create a volume snapshot. And the request will send it to the snapshot controller. This controller is deployed in the control plan of OpenShift. And here it is watching the volume snapshot, this object. Once it detects there is the object, then it will create the corresponding volume snapshot content here. And there is another component named CS9 snapshot, which is a sci-dicart container in CSI driver part. And it is watching the volume snapshot content, this object. Once it detects it, it will trigger the snapshot to create operations. And actually based on different storage bank heads, the issue in command loss is different. Like for RBD, it uses RBD snapshot related commands. And for NFS, it uses tar command to issue the snapshot here. And about the host pass for local file, it uses tar command also. And for the block, it uses DD related commands for the snapshot operations. Yeah. Okay. About the snapshot in OpenStack, like in OpenShift virtualization, there is also WAM snapshot and WALM snapshot. Actually WAM snapshot here is different from OpenShift virtualization. Here it actually creates several images, WALM, several image snapshots. And actually the image of the snapshot is also saved as an image file in OpenStack. It means that you like to restore from this snapshot, you need to relunch a new instance from the snapshot file. And also for data consistency, guest agent is also recommended to be installed before the snapshot is created. And for the WALM snapshot, OpenStack is similar as the OpenShift virtualization accepts the commands here is use OpenShift OpenStack related commands like OpenStack, WALM snapshot, create or for the restore it use sender related commands. Okay. Like in low end to the data flow here. For WALM snapshot in OpenStack, usually yes, it's the same. There is a user request from user space. The request will be sent to the sender component. And first it will send to the sender API. It will do some basic checks here. And then it will send it to the sender scheduler. And for the sender scheduler, it will schedule the request to different storage bankers. It's just like the OpenShift virtualization. And for the different storage bankers here is also the issue in commands is different. Like for RBD, it's the same. It use RBD related snapshot related commands here. And for OFS, it's different. It use the QMIL image, this QMIL tools to do the snapshot here. And for the LVM, it use LV related commands here. And about the VM snapshot in OpenStack, it's different. It's also different from the OpenShift virtualization. It does not make use of the WALM snapshot in OpenStack. Here is such code flow. It's mainly implemented in NOVA. And also it can divide into live snapshot or code snapshot. About live snapshot, the data flow first, it use the QMIL image to create a delta disk at first. And then it use the LibWord API block rebase to rebase this delta disk to the root FI file. And then it use the QMIL image to convert this delta to the snapshot. After the snapshot file is created, it will delete this delta disk. And about the code snapshot, it just use the QMIL image to convert directly to do the data transition. Actually, when I first saw this workflow, I'm confused. Why not use LibWord snapshot directly? Actually, the workflow here is just some LibWord APIs or QMIL related commands. Why not use LibWord snapshot? Actually, the reason here is that the LibWord snapshot from the current real release note that LibWord snapshot is not recommended to be used in that. OK, let's look at why it is not recommended to be used. What is the current status of the VAM snapshot in LibWord now? The LibWord snapshot now is using internal snapshot. So what is an internal snapshot? Internal snapshot means that the snapshot file itself is saved in the same base image file itself. We can image that the snapshot file and also the base file is merged in the same file. It will be hard to maintain. Actually, this feature is stopped developing in QMIL real level. It is planning to be disabled in the future. Another thing I'd like to highlight is that the VAM snapshot in LibWord is truly different from the VAM snapshot in OpelShift virtualization and OpelStack. In OpelShift virtualization, OpelStack is used as a guest agent to guarantee data persistence. But in LibWord snapshot, it will include the complete systems info. It will include the complete memory data and memory data and also the disk to see info into the snapshot file. So it can guarantee the data persistence here. And also for the LibWord snapshot, we can also do disk-only snapshot here. And regarding this advantage of the internal snapshot, LibWord appears as working on external snapshot now. And for the current status, we can create external snapshot now. But for restore and delete, it is still under developing and there is an issue tracking it now. And it is planning to be released in LibWord 10. And so eventually, when this feature is supported from data persistence, the perspective that this could be a perfect option for snapshot. But actually, there is still some limitations for the VAM snapshot in LibWord. As it should not source the storage bank assets that will. And the image format of the snapshot file in LibWord must be QQ2. While this QQ2 is not for some bank as a link from RBD bank as from the official documentation that we learned, QQ2 is not RAM-committed over RBD as there are some performance issues there. So, let's give a brief summary here that in high-level, we can divide the snapshot into two parts, code and live. For code, it means that the VAM is powered off. We can guarantee the data consistency. But actually, more customers may prefer live snapshots as there are still some other applications. If VAM is running in the VAM, they want to keep it running. They want to keep it running while doing a snapshot. So, about the live snapshot, we can also divide it into disk-only or volume-only snapshot. There is no memory data. It means that there is VMAH data in consistency here. And also, another choice is VAM snapshot, the whole VAM snapshot. There are two choices, like in OpenShift virtualization or OpenStack, they make use of guest agents, this component. It is used to file. But the question here is that it is just quite the file system as much as possible. It also depends on the workloads. It means that if there is a very heavy workload, in the VAM, there is still data loss, potential data loss here. And another choice is the live-watch snapshot. It will include a completely memory info in the snapshot file. But as I also told, there are some limitations for the different storage of backhands here. So, always based on your requirements or your environment to choose the one that suits your best. Okay, that's all of my presentation. Thanks for listening. Thank you.
UKI addons and extensions: safely extending UKIs kernel command line and initrd
Okay. Hello, everyone. My name is Emmanuel Giuseppe Sposto. I'm a software engineer at Red Hat. And today I'm talking about the UKI at Donson Extension, how to safely extend UKI, scan and comma line in E3D. So why this talk? First of all, because this is extremely new stuff, like it's very new, hopefully also exciting. Because there's not a lot of documentation, of course, because this stuff was just merged. And hopefully this talk will also help you understand a little bit more about what they are, how to use these addons and so on. And because they may be very useful because UKI, as also Vitaly explained in this talk one hour ago, is pretty static on the point of comma line in E3D. And with these addons, we can extend it, these two things without sacrificing the security. And also, yeah, this attempt to advertise a little bit to UKI, so what they are to the more public to be more recognized. So let's look first at Vitaly's slides. These are from last year, I think. So I will just briefly go through this. So Confidential VM provides data protection from the host he runs on. So we are protecting the VM from the hypervisor because it could be malicious and it's privileged, so we can access the VM and we don't want that. The host is still able to disrupt the execution of the VM. And there are specific hardware, SV, SMP and TDX responsible for encrypting memory and CPU. And storage encryption is necessary for security and must be done by the guest OS. This was already explained by Vitaly. And usually the situation that we have is that we usually encrypt, we have the encrypted part and while the kernel is signed by the vendor, in NITRA MFS and the common line are locally produced, are not signed and also it's difficult to measure them, of course. Whereas with the UKI unified kernel images, basically a single binary produced and signed by the vendor, in this case Red Hat. And it basically contains the important parts, the RP sections together with the signature, there is the kernel, the NITRA MFS and there is also the common line as a separate section that is then feed to the kernel. Before going to the next details, I wanted also to explain like the use case, like yeah, the use case in this case for this talk, that we have the UFI, the firmware that is in terms called shim the boot loader, which in terms called system distap which is very key piece for the add-ons and on both the kernel and common line the NITRA MFS which in turn unpacks the UKI and gets the kernel and runs the OS. The issue that also Vitaly mentioned is that the kernel line is immutable and is something that we don't like because there are limitations and you cannot have a static common line for every use case that you have, there is a crash kernel options, debugging options and we cannot ship different UKI for every basically use case. So what we are aiming for the UKI kernel common line is it cannot be static as I said because there are different use case, it has to be secure so whoever modifies the common line has to be authenticated otherwise the whole point of confidential computing is lost and by default nobody because the common line is inserted inside the UKI and then is signed so you cannot modify it anymore and has to be extensible of course because we don't want to ship a new UKI every single time. There are already ways for the one that are no UKIs to extend to add kernel common line to a new UKI but it's a little bit when we talk about confidential virtual machine it's a little bit tricky because as again I'll show you the option and you need to trust a lot of parts. So as I said there is the common line section it's embedded in UKI, it's generated with UKI, it's secure, it's shipped with UKI altogether but it's static, you cannot be modified. Then there is FI shell which is looked by system distable if the common line section inside the UKI is missing many distro for example they ship always something in the common line section inside the UKI so it's ignored. It's useful usually for type 1 entries but again it's unsafe because an attacker can easily inject its own parameter through the FI shell that's why it was disabled for CVMs so you cannot extend the kernel common line with the FI shell. There is SM BIOS system management BIOS, embedded metal this is good, it's trusted because it's coming from firmware and BIOS but it doesn't apply on CVMs again the hypervisor can easily inject kernel common line. So yeah as I said it's not good so it was also this was disabled and then there is QM firmware configuration by the name you can already figure that this is only from QM it's again coming from hypervisor so also disabled. Then what do we do? Our proposal initially upstream was an allow list basically an allow list is another P section where you use regex globbing and whatever just something like this to parse the common line that you want to get and the easy case will be if there is something that we don't accept in the regex we just discard the whole common line but the common line would come from FI shell SM BIOS all these sources but we try to filter and system desktop does the parsing. The advantage is of course that we can reject what we don't want but the problem is just moved to another place because then you can do attacks on the regex and globbing because they need to be very careful formulated so what's also this was disabled so was rejected actually and eventually we have the solution the system D solution nuclei addons. Nuclei addons is basically another separate binary which is contains a very few P section one of these is the common line and it's signed by yeah can be signed but should be signed for the CVMs and we take advantage of shim validate function offered by shim to validate the P signature so basically this means that system desktop will ask shim to validate if the binary has been signed by some key that we trust in the secure boot database. There is a very useful tool UQFI in system D upstream it's you can create UQIs very easily very better than drag up and object copy and you can also create addons and yeah basically the common line is very easy you can also provide the keys when you want to sign your own addon so it's this is the solution. So how UQI works the workflow is UQFI first you create the addons so you ask UQFI to create an addon with the common line that you want then the addons it needs to be put in the specific location in the ESP I will show you later where exactly is this system this tab looks for this location and finds automatically the addons asks shim calling shim verify on the addon to verify the if the addon is trusted so it's signed by somebody that we trust and then if a leadation is successful we read the addon the system read the addon and appends the common line inside the addon to the UQI common line section to extend it and then it's provided to VM linux to start links with the new common line there are two kinds of addons there is global and local addons so global addons can be applied to all installed UQIs and this is the location and UQFI UQI specific addons so if you want to apply all these to one specific UQI you have installed has to be provided in the UQI name has to be in an extra d folder in the same location where your UQI is and then has to be put in there just naming convention because last time I checked the system this tab was checking for also the extension name and this kind of stuff so you need to get them right UQIs are always located in the this part AFI linux UQIs always ends with the AFI and addons is dot addon dot AFI and specific addons here as I said you need to be located in the extra d folder okay so next next step is what is but white self so suppose that we as a vendor we shipped a new key I common line addon and we signed it and everybody's using it and then we figured the common line as an issue then what do we do because we signed it as a vendor so what it's trusted first solution just change the certificate so but this is basically impractical yeah good luck with that yeah we messes up all the measurements you invalidate all the addons so second solution try to create a blacklist on the cloud provider this is impractical third solution at the station check if the hash is matching your addon that you don't like anymore and the last solution these are these s but rules so what is s but is basically another p section inside the UQI the yeah the addons for example and contains component generation and also other information but the key part is the component generation table because there is the same table there should be the same table inside your shim that and then the we are at component level so for example every Linux PS action has should be should have the its own component generation version for the Linux one for the addon and so on and if the component generation match with the what shim has we accept it but if the generation for this component of the addon that is incoming is lower then we have a mismatch and even if the addon is signed by red dot or whatever it will be rejected and this part is done by shim when they verifies they are done in checks the s but components and generation just an example to clarify this in this case we have the shim has s but one myadon version two and then the addon contains the same version for s but and myadon so it's good it will be accepted of course has to be signed by somebody we trust in this case the my the addon as the s but version is correct but my addon component is lower which means that we don't accept it even if it's signed by whoever we trust in the secure boot database it won't be accepted one open problem it's combining addons so if you have two separate addons that contain common line that is safe but together can create a security issue because they enable something that we don't like how do we do it how do we solve this issue to be honest as of now I couldn't come up with a concrete example for this and yeah one solution will be to use that station to see if they are both there talking about the system dc6 in iterative addons so system dc6 already exist they are already famous so used and what is new is that you can also use them for uki so for what if you don't know is a system system extends an image extend the base system with an overlay containing additional files so you can extend base system and you can use this system this tab provides also the possibility to use this to extend the initer d inside the uki um more or less is same concept as the common line addons so you just use different tools because they are different things they are no p binary with p files sections so there are system extension images and micozi is used instead the uki fi and but the part for example where to put it is the same the workflow is more or less again the same create a system c6 extension you put it inside the extra d folder it must be a raw file and then this is the only difference system this tab will take the initer d the addon and will put it inside the initer d extra c6 folder where the c6 extension will then load it and apply it to the initer d yeah who uses this can use these addons the use case are various there are three groups of users that can use this the vendors for example read that they want to ship we want to debug kernel and uki and we ship our addon and there are there could be the vstod the virt host admins that can use host side tools like virt firmware or whatever to modify these these kind of variables more or less the same use case and the guest admins can add you can use guest side tools like mock to insert the key insecure boot even though this is a little bit tricky for in the cloud because on asia it's basically impossible to add a key in mock because when it reboots you cannot connect via when you connect with the shell you skip the mock reboot section when they ask you to confirm your key available tools there is a system d has a lot of tools uki fie is the main one in different version is supported gradually first how to build and then how to inspect them and then there is also i sent a pr to extend boot ctl to find addons and display already as a preview what will be the kernel command the full command line so if there is a system d maintainers right then and there is mico c to create a system d sex the image and then we have a uki director for fedora there is kernel boot config you can add update and remove uki's and then we and also added kernel addon which does the same thing for uki addons and the future work what are we planning to do next maybe an rpm so the vendor ships an rpm with the collection of addons generic addons that we want to ship signed by the vendor but of course we don't want to pollute the esp with the addons that the user doesn't need so there was a agreement also upstream to find these two locations user lib linux extra d for global and the other one for uki specific addons where the rpm should install these addons and then when the user needs them can simply use kernel addon or just copy the addon that for example we as a developer ask to for debugging the uki to copy it in the esp reboot and they will be there yeah on the cloud cloud if they want to allow the user to upload their own uki addons they need to be a way to inject to inject the owner certificate otherwise yeah you cannot do it this also there is a little bit an issue with the measurement because the when you add the user certificate has to be measured in pscr7 especially and the solution we found is to simply add the dummy addon before performing attestation so the certificate is part of the in the key ring so it will be attest is measured on prem more or less the same things who for us is libvirt we want to offer the same possibility to upload the certificate for secure boot and yeah and there is already a way to add the dummy addon so that's that's it from my e-talk if you have any question here outside thank you yes please uh so second comment is on all of the add-ons Right? Because you can trust the UiViceQ boot mechanism. Whereas in a confidential computer environment you cannot today use. I'm not aware of any stack right now that gives you a trustworthy UiViceQ boot environment. That means you need another mechanism to do that measurement for a confidential computer environment. The most natural path for that is to use the launch digest. Because you have the launch measurements, you need to know ahead of time. When you boot the VM in a way, in a way, in a way at boot time, all of the data that you need to launch at the end, which means you need to have the UiViceQ ready to be available including all the add-ons. At which point we go in full circle, I think we are much better off just building a separate UiViceQ for that one set of configuration you're doing. So you can attest that I'm actually running a set of configuration. You don't want your debug add-on in your production fleet. That is, you want to pre-aggressively. So I think the mechanism that is the most natural one here is to go and build a separate one-off UiViceQ even if they're made of add-ons if you want to. Okay. Okay, thank you. Okay. Thank you. We cannot do a vocation only with a firmware. The firmware cannot support a vocation mechanism outside of the DDX. And DDX has both space and around that. If you have a lot of space, if you ditch the microsoft solution, don't use the microsoft solution. Thank you. Bye. We know how it ends. Guys, you are more than welcome to present next year if you want. You are more than welcome to present next year. You are more than welcome.
From Virtualization Platform to Hybrid Cloud Solution: A Hands-On Account
So, good afternoon everyone and thank you for joining me today. My name is Bello and I'm a software engineer at Red Hat. And over the past year I have the opportunity to be part of the forklift team and take it for a spin. So, today I'm about to share with you our recent journey and without further ado let's jump in. So, in today rapidly involving both of IT we can observe an increase in moving away from traditional virtualization environments towards more hybrid cloud solutions. And with Red Hat RNOT just observing this trend we're actively participating in it. So, recently we had the opportunity to go on a journey and migrating from a virtual established environment to a newer solution. And today I'm going to share with you some of the inside challenges and benefits of such a transition. So, let's start by discussing these two very different solutions. So, picture you on a journey through the IT computing landscape. Our first step is Ovid. It's like an older reliable train that's been running for years. So, Ovid is an open source product based on KVM technologies and it's offering cost efficient way for enterprise to managing their virtual workloads. It's an alternative to vSphere. But our journey does not end there. We then continue to the world of OKD. So, picture it as a high speed train whisking us to the future of cloud computing. So, OKD is also an open source project based on Kubernetes and it's providing us with cloud computing capabilities alongside enhanced Kubernetes features such as added security, automation and user friendly interface. And it supports both containers alongside virtual machines. So, when considering such a transition it's important to take into account how it can be done. So, there are several path we could take each with its own set of advantages and challenges. But today I would like to focus on main three. So, first we can't reprovisioning all the virtual workloads and start from scratch. Even though this solution may be sound pretty straight forward it's both costly and time intensive. And for complex workloads it's not always possible without risking some data integrity and operational disruptions. Next, we can migrate all our virtual workloads into containers. So, with the use of conveyor project we can really reduce the cost here. But it's still not an easy task. And again we have the same issues before not all workloads can be containerized. So, while this may be a good solution for certain types of applications, it's not suitable for everyone. And finally, which seems to be the best one is keeping our migration workloads, our virtual workloads as they are. And with the use of forklift tool migrate them to the new environment. So, by that we don't have to worry about any data lozage. And with the use of this tool we can have simple and smooth transition. So, what is it forklift? So, forklift is a tool designed in a system that is designed to be a virtual environment. Design in assisting migrating from traditional virtualization environments to Kubernetes-based environments. And it's taking care of the entire migration process for us. It's working alongside with another project named CubeVirt. And CubeVirt providing us the virtualization capabilities on top of Kubernetes-based environments. And once forklift migrating the virtual workloads, they will be placed on top of CubeVirt. So, forklift as a versatile tool supports a variety of source providers, source environments as you can see here in this list. So, now I would like to take a deeper look at forklift high-level functionalities. So, forklift supports two types of environments, KVM-based and VMware-based. And for both of them, it's taking care for the entire migration process. That means creating the disk, copying the data, and for VMware-based product, converting the virtualization stack to match CubeVirt requirements. And, of course, finally creating the VM itself with its original setup to run on top of CubeVirt. So, the use of this tool will make easier and smoother transition to the new environment. So, now that we finish discussing these different solutions and approaches, let's dive in into the specifics of our own migration from OVirt to OKD, where forklift used as a crucial tool in facilitating this migration. So, I would like to start with a little bit background on why we decided to go ahead and proceed with this transition in the first place. So, our OVirt environment being in use for more than a decade, supporting hundreds of virtual machines with diverse usage, some for production while others for developing and testing. While the fact that OVirt reaching its end-of-life zone wasn't the main reason we decided to go on this transition, it certainly matches in this direction. And, moreover, we wanted to take this opportunity and reallocate some of our resources and remove underutilized workloads while causing as minimum interference to the users as possible. So, taking all this into account, the shift to OKD seems to be the most reasonable fitting choice. So, as any successful story, planning is always essential, and our migration wasn't exception. So, we started our journey by having in-depth analysis of our current environment and just understand what the migration requirements and what we need from this transition exactly. We then continued to having some resource evaluation. That means we had to make sure that our target environment will have enough resources to accommodate the incoming workloads in terms of compute, storage and network. And finally, we had to create a clear timeline to make sure that each step of the way is well known and everyone involved from users and IT teams are in the loop of this transition. So, now we would like to zoom in even more into the preparation step and focus on the resource allocation. So, we had to start by finalizing our VM list for migration. And when we thought about what going to be the criteria for VM to be eligible for this transition, we decided to proceed with actively used VMs only and had to have close conversation with their owners to understand their specific needs. After that, we had to calculate the storage and IP addresses of all the VMs in this list to make sure that our target environment will have enough resources. This step was more than just technical preparation. It was essential to ensure that once the migration is started, we won't have additional downtime to lack of resources. And last, we had to come up with a way to reflect our original ownership and access mode from the overt environment to OKD. So, with a well-planned and tool like forklifts at our disposal, you might think this migration is going to be a walk in the park, right? Well, not quite. As we started our journey, we discovered that the path ahead of us is going to be quite challenging. So, now I would like to share with you some of the obstacles we encountered and how we tackled each of them to keep our migration on track. So, the first challenge was regarding the VM selection. So, as I mentioned earlier, we wanted to continue with only actively used VMs. That required from us to analyze the VM usage patterns and understand which VMs were actively used during specific time period, tasks that proven to be quite challenging. Then we had to gather the information about these VMs, such as disk size, network and ownership. And that task appeared to be quite demanding as well, both in matters of complexity and in time intensive. And the first, our two environment had different provisioning models. Our overt was more admin-driven and our OKD was more user-driven. And we had to come up with a way to bridge this gap somehow. So, in order to overcome these challenges, we went ahead and developed Python script specific for the migration process. And they can be broken into two categories. The first one, based on OvitasDK, was mainly used for finalizing the VM list for migration and two data gathering, such as the disk size, IP allocation and ownership. The second sort of scripts were based on Kubernetes API and they were used for creating the namespace on the target environment and for assigning the appropriate role for the users. We also uploaded the script to our GitHub region, so they can be used as a blueprint if anyone wants to take a look, you're more than welcome. So, now I would like to focus into a specific issue we had and just walk you through the different stages that we took to solve it. So, as I mentioned earlier, our two environments had different provisioning models. So, our Ovitas environment were more centralized models, where admin had full control of the environment and managed all the resources and created new VMs. Our OKD environment, on the other hand, is more user-driven, where the user have freedom to manage and create their own resources within their namespace. The namespace resources are set by predefined quotas. So, to bridge this gap, we decided to go ahead and create new namespaces on the target environment and place in each one of these namespaces all the shared VMs by the same users. And by giving them an admin access, we made sure that each user will remain with the original permissions. So, let's clarify it with an example. So, let's say after we finished finalizing our VM list, we ended up having four VMs for migration. So, as you can see, on the new environment, we created three new namespaces, and in each one of them placed all the shared VMs by the same user. So, Bob and Alice, who shared two VMs, now will have shared namespaces with both having admin access to it. And Bob ended up having three projects assigned to him, which really reflects the diverse usage on the original setup. So, now I would like to guide you through the script we used for this mapping process. So, the first part is based on Ovid SDK, and we did the following. So, we started by creating a list that mapped between all the VMs and the users from the system. Then, based on information from another script, we removed all the admin and system users from that list. Then, we created a dictionary that mapped between sets of VMs and all their corresponding users. And based on this dictionary and Kubernetes API, we created a YAML file. So, here we can see a set of actions for one set of VMs. So, we started by creating the new namespace on the target environment. Then, we created an admin role that gave full permissions on all the resources under this namespace. And finally, we created a role binding that bind between a specific user and the role, the admin role. And by that, we made sure that each user will retain its original access to its resources. So, now that we finished with the planning and preparation phase, let's dive into the migration execution. So, our first step was to deploy Forklift. Forklift can be installed from the operator hub, and it's managed by an operator lifecycle manager. In our case, we decided to install it on the same cluster as the target one, but it also can be deployed on a remote different cluster. Next, we had to create a new namespace that will hold all the migration resources, including providers, different mappings, and the plans themselves. It's important to know that the user used to create the namespace should have sufficient permissions on the migration resources. Next, we had to create the target and source provider. So, each provider represents the environment we're migrating from and to. Once we deploy Forklift, a new tab named migration will appear in the console, and from there, we can manage all of our migration resources, including the addition of new providers. So, we started by creating the source provider, and here we chose Redhead Virtualization, which is the downstream name for Ovid. We then had to fill in all the information about this environment, so Forklift will be able to connect it. Here, it's important to use users that have sufficient permissions on the VMs where about to migrate, or else the migration will fail. In our case, since we were dealing with scale migration, we went ahead and used Administrator account. Next, we created the target provider. So, here we chose OpenChief Virtualization, which is the downstream name for OKD. Here, we only need to fill in the name, and all other information is automatically filled in. Next, we had to create our network and storage mapping. Once the migration starts, Forklift needs to know how to redirect the incoming workloads in terms of villains and storage class. This mapping will tell him how to handle the incoming workloads. So, here we can see our network mapping, and we can see the new villains we created for our migration needs. Here, we can see the storage mapping and the storage class used for accommodating our incoming workloads. Finally, with the use of script, we had to create our migration plans. So, each plan holds inside of it all the VMs that are about to be migrated to do the same namespace. This means used by the same users. Once we were ready, we triggered, again, with the use of script, all the migration, and the migration started. As you can see, it also can be triggered from the console, but since we were handling with scale migration, we automate this process. Now, I would like to have a quick overview of the steps we had and add some additional information. So, we started by deploying Forklift and setting up all the costume resources for migration. Then, with the use of script, we automated all the plans and the migration. In our case, we decided to go with cold migrations. That means that during the transition, the VM is going to be shut down, because it best suited our needs. We're migration, on the other hand, keeping the VM operational during the migration, but it's leading to longer migration time, because we need constantly backing up the data to keep the VM operational. So, during this transition, we also monitored and troubleshoot the entire process just to make sure that we're on track. And once the migration was over, we chose some randomly VMs and tested to see their up and running, and then waited for some user feedback. So, although eventually we had a successful migration, we did encounter some issues during it. So, the first two issues related to the fact that we had a lot of simultaneously migration running at once. That caused both storage and network strain, and eventually led to longer migration times than we originally anticipated. Another issue we encountered caused some of the migration to fail, and after we had some investigation, we realized it was related to some bug in our codebase. After that, we released a fix, and with the use of that fix, we were able to migrate all the VMs, and it was included in the next version of Forklift. And finally, since the downtime was involved, we had to keep a clear communication and make sure everyone in the loop of what's happening. So, once we started receiving user feedback from the field, it was clear that we still have some issues to solve in order to make this transition fully successful. So, the first one was related to boot order issues. So, VMs with multiple disks were not booting from the right one. So, we addressed this issue manually, and later we discovered it was caused by another bug in our codebase that was fixed in the next version of Forklift. The second issue was related to the new VLAN we used. That caused our FQDN names to change, and the workload inside the VMs were no longer accessible. So, we had to update our DNS records, and the user had to adjust their FQDN names inside the workload to use the new ones. And after that, all the workloads were accessible again. So, as we're reaching the end of today's journey, I think it's a good point to reflect and draw some conclusions. So, overall, we had a successful migration. We were able to migrate more than 100 VMs and copy 12 terabytes of data. We mainly were able to achieve this result through thorough in-depth preparation and planning, and we realized how much it's crucial for a successful migration. Another thing is that we understand that each migration process can be different and held between different environments, but we do see some common ground and best practice that can be used to similar journeys. And finally, and probably the most important, is that even though Forklift is a really powerful tool and it gives us great capabilities, it cannot facilitate migration on its own, and additional steps are required, such as the use of scripts and thorough preparation. So, as we're wrapping up today's session, I would like to extend my biggest gratitude for each and every one of you. I hope that the session today will be valuable for people that want to go on the same journey. I wasn't able to cover into details all today topics, but we post a blog post about this. So, whoever wants to get another information, you're more than welcome to take a look. And that's it. Question and some insights. Thank you. Yeah. How did you handle notifying the VM owners during the process you automated the fanning with the old automated notifications? Yeah, we had... Can you repeat the question? No, you should get the question. Sorry. For the streaming, so maybe people are watching from the audience. So, did we automate the process of notifying the VM owners? So, in our case, we had a VM list that all the owners on this environment were included in, and we said as spreadsheet that all the VMs that were eligible for migration were included. And then we asked the owners to let us know if they want to migrate their VMs, because there was people that decided to continue to different environments or didn't need the VM at all. And based on this information, we also built our migration, our final migration list. So, yeah. Yes. Kai, could you please give us some example of what issues you had during the migration steps? Yeah, so I will give an example about some boat issue we had after the migration. So, we had a lot of VMs with multiple disks, and when you're trying to boot from the disk that doesn't have the operation system on it, the boot will fail. So, you see just like a black screen, and the OS is not found. So, we understood that it's probably not booting from the right disk, because we saw there is another one. And once we manually changed that, we really saw that it's solving the issue. So, we adjust this manually for all the VMs in the migration list. And after that, as I said, we released the fix in our next version. Yeah. Hi. Hi. Is this tool also performing some kind of a preferred check over the plane? I don't know. It's checking that you have enough space on the target storage class, or it's checking that the VM you selected is not exposing particular devices. But can make the middle of the end? We do have set of, so the question was if we did some verification to make sure that we have enough space on our target environment or in devices like in compute. So, we do have set of validations on our plans, but this one are not included. We're checking more things like names that match Kubernetes and more security things, not something like that. Yes. You transferred, you mentioned 12 terabytes of data. I was in the presentation yesterday about talking about PCCOP, and then we're talking about validating that all the data was named correction over a large database migration. Did you do some things? Because I was saying quite a hard problem. A lot of days you might get a crack or off. So, the question is if we're doing some validation on the data, if it's copying correctly. So, it's depend on the source environment, but we do use some external tools for that, and these external tools supposed to make sure that all the data copied correctly. So, it's really depending on the source environment you're using because there's different flows between the different environments. But the tools that we're using, for VMware for example, we're using VIRT V2V, so we're taking care for this check. For Ovid and OpenStack, we're using ImageIO, so it's taking care of under this tool. Okay, so if anyone want to ask any specific question, feel free to approach outside. Thank you.
Making VirtIO sing - implementing virtio-sound in rust-vmm project
Hi everyone, my name is Dorin de Basse and I work at Red Hat. I currently work on enabling the audio stack and other features in the automotive team. And with me here is Matthias. Hello everyone, I'm Matthias. I also work at Red Hat. I am working at the automotive and the beautification team. And I'm going to talk about the audio sound and implementation we did last year in this year too. So yeah. Okay, so in this presentation, we'll be talking about making VETAIO sync. And we'll focus on the implementation of the VETAIO sound in the RASVMM project. So just a brief outline. I'll be talking about the automotive use case. I'll go through the VETAIO sound device on the driver. And Matthias will take care of the VHOS design implementation, the audio back end architecture and the upstream status. Okay, so let's get right into it. One might ask why VETAIO sound? Our main use case is the automotive industry. And in automotive, Android guests are being used for deploying infotainment systems. So in order to support these Android guests, the virtual machine monitor, as in our case, Quemo, requires a set of virtual hardware like VETAIO sound, VETAIO net and VETAIO GPU. And having a VETAIO sound device emulation would allow for Android to be deployed in different virtual machine monitors that currently support the VETAIO device emulation. Examples of these VMMs are Quemo, CrossVM and the likes of them. The Android reference platform, which I linked in the slide there, it defines a set of VETAIO interfaces that are expected from any VMM monitor that runs Android. So based on our expectation for Quemo KVM as a hardware diagnostic hypervisor, we decided to close the gap, which involves enabling the VETAIO sound device emulation as an external process. So now Quemo or any other VMM that currently implements the VHOSESA protocol can actually interact with the user space application. So before showing you how we build this device, let's present to you what the device is. So the VETAIO sound device is a parametriolized sound device and is based off on the VETAIO specification standard. It's consisting of the VETAIO driver, the PCI bus transport and the VETAIO sound device. And this is an architectural view of what the sound stack looks like. And I will show you how the different VETAIO components come together. So first we have the user application in the guest that's interacting with the driver using a set of SISC calls and common user space libraries, such as, take for example the ALSA library in the case of a normal application in the guest or tiny ALSA library as in the case of an Android application. And then the VETAIO sound driver on the other side takes the information that it received from the guest user space and shares it over a transport method. And in our case is the PCI bus. Now this PCI bus is a way to expose the VETAIO sound device to the driver that's in the guest. And the VETAIO sound device, just like any user space application that's running in your host, it sends the audio streams to the host sound drivers and the necessary sound libraries and the E-mone would route it to the host, to the sound driver that's running in the host canal space. So I mentioned something about the VHUCHESA protocol in the previous slide. So what is it? The VHUCHESA protocol is a set of messages that has been designed to offload the VETAIO data part processing from QEMU to a user space application on the host. And this user space process application is what's responsible for configuring the VETAIO rings and doing the actual processing. The VHUCHESA protocol actually uses communication over the Unix domain circuit. And it allows the control planes to initialize the shared memory regions and also exchange the file descriptors. The protocol defines two sides for communication. We have the front end and the back end. For the front end, we have it sending the message request while the back end is sending the message replies. The protocol itself also implements the control plane for establishing VETQ sharing between the guest and the user space process. And this user space process utilizes the VHUCHESA library. So I attached an example here of what the VHUCHESA protocol message would look like. We have the front end that's sending the VETQ memory layout and configuration to the back end. And you can see the message outputs in hex formats. An example of one of these messages is the VHUCHESA get feature message. It's expecting an acknowledgement reply. But sometimes not all messages from the driver expect a reply from the back end. We attached here a subdom tool, which is a tracing tool that can help you while you're debugging in case you want to play around with the traffic messages. So this subdom tool would actually dump the socket traffic between the front end and the back end. And it's being used if you pass the parts of the socket and also specify formats. Maybe you want the format in hex and the subdoms could also provide your format in a pickup format if you want. So the VETL memory region, which is this guest memory here, is initially being allocated by the guest. And in Quemo, this is being done by the memperealock option. And the VETL memory region, when it's been allocated by the guest, it's smacked by both the front end and the back end using the M-MAPS CIS calls. So this memory region would be accessed by the file descriptors on M-MAP. OK, so what happens during the device initialization? We have the feature bit negation that goes on there. During this initialization, the device and the driver both have feature bits that need to be negotiated. And at this point here, the driver would read the feature bits that the VETL sound device is exposing to the driver. And then the driver would tell the device, OK, hey, man, I only support this subset of features or I do not accept this set of features. So take a example, when we have the VETL ring event IDX feature, when it's been negotiated, it would allow the device to control how the notification from the driver should be handled. And we have other features like the indirect descriptor feature. And this one thing to note about the VETL sound driver is that it doesn't have any specific features that are currently defined. So it uses a generic feature bit set of the VETL device. And there are a couple of other driver requirements for this feature bit negation, which you can find it in the VETL specification link. So in a nutshell, a VETQ is a queue of guest allocated buffers. And this VETL sound driver is consistent on four VETQs. We have the control queue, the event queue, the TX queue and the RX queue. And each of these VETQs are consistent of three parts. So first we have the descriptor table. And the descriptor table is occupied the descriptor area. We have the available ring, which is occupying the driver area. And we have the used ring that's occupying the device area. So to further explain how the VETQs are being mapped in the driver and the device, take for example, we have the user application that's running in the guest. It would notify the driver of the audio streams that needs to be processed through the corresponding libraries and interfaces. And when the driver wants to send a buffer to the device, it fills the descriptor table with the M-Mapped buffer and writes that descriptor index into the available ring. Now after writing it, it has to notify the device of those available buffers. So it would notify the device saying, hey, I have some buffers that need to be processed. Now, depending on the buffer size, it could create a descriptor chain, which it would always because of the sound buffers are usually a lot of them. So for the device side, when it's done consuming these buffers, it would write the descriptor index into the used ring and send a used buffer notification to the driver itself. Now in the past, this was not how the driver used to work. That's when the user application sends messages to the driver, because it was unable to actually determine when the buffer has been updated from the user application that's running in the guest. And some of our upstream contributions was to ensure that this acknowledgement callback was being used to notify the updated buffers and also prevent the reading of steel buffers. Thanks to Matthias for some of those contributions. And let's see how the requests have been processed for each of the vertio sound red queue. So for the control queue, it's been used for sending the control messages from the driver to the device. And this control red queues have been translated into a VHOS user request and it's been forwarded to the backend for processing. So on the device side is going to respond to these messages indicating the status of the operation. For the event queue, it's been used for sending notifications to the driver, but in our current implementation, we did not use it because it's not necessary. Then we have the TX queue, which is used for sending the PCM frames for our P streams. And this TX queue is being used for playback. So it would carry the PCM frames that have been initiated by the driver and also replied to the previous received frames from the device. For the RX queue, it's being used to receive the PCM frames for input stream. And this is being used during the capture. So the RX queue would carry the PCM frames that have been initiated by the device and also replied to the previously transmitted frames. So I'll let Matthias take over. So now I'm going to talk about the VHOS user implementation. The VHOS user implementation is split into the front end and the backend. So the backend and the front end communicate by using the VHOS user protocol as Doreen explained before. So for the front end, we based on the word from Alex Benet from Linario that simplified the boilerplate code in Kimu, which is common for all the VHOS user devices. So if you want to see this work, I leave the patch set there. Then for the backend, we decided to implement it under the RASP-MM project in the VHOS device repository. And the benefits of doing that are the following. So for example, we show the device implementation between multiple virtual machine monitors like Kimu or cross-PM. And we use RASP as our main language. So we leverage the features that this language have. Also the process that emulates the device runs separately from the Kimu. So that's reducing the attack surface of Kimu. And also the current implementation has less context which that, for example, the Kimu built in device. And I leave you the link to the script that you can use if you want to try it, you compare. And also you have the link to the RASP-MM project. You can look for the implementation. So now let's see how the backend is designed. So basically the current implementation is made of a device and the audio backends. The audio backends implement the driver for different libraries like PyWear or ALSA. And the whole backend is implemented by a single thread. And current implementation has called the number of strings. So we have only one for input and one for output. So when a new request comes from the guest, depending on the queue in which the request arrives, we're going to have different handler. And depending on the queue, the semantic of how we handle that request change. So I'm going to talk about that a bit. So for example, for the control queue, when the driver's in a request, what we're going to do is just to process that request immediately. So for example, we're going to pass the request and depending on the control message, we're going to call a different method. What we use here is a genetic interface so anyone can write a driver for the audio backends because they share the interface. And then after processing the request, we notify immediately the guest that the request has been processed. So in this case, the methods in the interface are not blocking. In the case of the transmission queue, when a request arrives from the guest and the transmission queue, as Doreen said before, it is when we're doing playback. So we're going to reproduce some sound in the host. What happens is how we process that request is by just picking up the request, I mean, storing a pointer to the request and putting it in a 5.0 queue, which is per stream. And then at some point, the worker's going to wake up and pop the queue request and process that. Here we have to make sure that we're going to consume all the payload that the request has or at least to fill the buffer that the audio engine proposes because otherwise what happens is that the worker's thread is going to wake up more often and we're not going to use the buffer, I mean, the whole buffer that the engine has for the playback. So we have to be sure that at least we consume the whole period. So in this case for the transmission, we notify the guest only after consumption. We have to do that, have to wait because otherwise we can make the user application run out of data. So the spec said that we have to do that, I mean, to notify just after consumption. So in the case of the reception queue, I mean, the transmission queue, reception queue were exactly the same. The only difference is that in the case of the transmission queue, we have, and the payload has data to reproduce in the host. And in the case of the reception queue, we have data in the host that we want to send to the guest for capturing. So what we do is the only difference is that when we pop requests, we're going to use that space to fill with data from the host and then send it back. So if you want to try it, as I said before, we have to launch two processes. One is going to be for the emulation, for the device, and this is the command line in which you use it up there. For example, the backing that you want to use in this case is pipe wire. And in the other command line is for chemo. And the only parameter that you have to take into account is the unique socket that you're going to use to communicate with the demo. So I would like to mention some of the afterword that these were required. And for example, we fixed the BitDio sound driver because it was not respecting the BitDio specification. So that is what Doreen mentioned before. And so we fixed that. And also we have been working in the spec to make it more clear. So we have we sub-streamed some patches to the BitDio spec. And other work we did was to add the descriptor util module to the build queue crate, which allows, I mean, which is what's before in BitDio FS, before, and we move it to the build queue crate so anyone can use it. And the point to do that is because you cannot, you cannot hack all the way that request is distributed over at the scriptor. So the guest can use any distribution of the, use descriptors he wants and because the spec doesn't say how to do it. And we have to be independent of that. And that is the reason of that. So also there were the patches to add the generic because user device, which used the boilerplate code code that you have to put in chemo for because user devices. And also there were some, I mean, there were many development in the pipe wire arrays crate, thanks to the Linda. So for example, we added the fill out module. Also the sparring buffer. There were many also backfixing that we did doing this work. So yeah, we are getting at the end of the presentation. So if you want to get in touch, feel free to participate in the because device project. Also we have a Slack channel called a big dios on if you have any questions. And we also submitted a proposal for how Google somebody of course, so we are, if you're really interested in participating, we are trying to add a new. Audio backing for she is streamer. So feel free to submit your candidate to that. And if you have any questions, feel free to contact us directly. We have the email here. So yeah, that's all I think. So I think now we're going to questions. The question is what happened if I want to use it. It's going when you launch the first program is going to launch the device emulation and then it's going to launch Kimo. And then, for example, if you are in the guest, you want to use it, you're going to use for example, speaker test or apply or something like this to do. And then you are going to listen something in the host. So, yes, but what is now nothing is happening. What is happening when you use the back end? No. So she's asking what happens when we use the now backing. It's clean. No audio. He doesn't use any library. Yes, nothing because the pipe wire would use the pipe, I correspond in libraries and also would use the also libraries, but no, nothing. Okay. Sorry, I missed the question. Can you disclose some car brands that is using your feet? Can you can we mention some brand that is using this implementation? No. Can I ask why you chose to implement this in Rust? Okay. He's asking why we choose to implement this in Rust. So as you all know, Rust, like going to Rust design safety and features of Rust, we choose to implement it in Rust and also the memory usage. So, yeah. I can compliment a bit because also there was the was already the Rust BMM project that existed before. So for a lot of things, we was quite easy to implement the device because we could use many, many things. For example, to work through the beer queues, notify the guests, it was already all in that project already. So for us was just to implement the parsing of the request. But for example, the beer queue handling was already there and also it was easier to implement. Yeah. That's it. Maybe it's a bit out of scope, but have you made any benchmarks compared to like fully virtualized audio devices? What's the like overhand of using this compared to one of the audio devices already existing in KMU? Okay. So he's asking what is the benefit of using this audio device in comparison to the other audio devices in KMU? So regarding the PipeWire backend, PipeWire provides reduced latency, low latency and also low CPU usage and memory usage. And using it in the audio backend, we did some latency benchmarks. You can look up the PipeWire Wikipedia and how to do this latency benchmarks. You could also use the CPU check for CPU cycles and context switches and also latency. So that's, yeah. I think we compare it with the KMU built in device, for example. And it looked like the less context switch for the user application in the guess. Yeah. One of my colleagues who is a computer sound developer device, but completely different. I don't know. I think I'm going to go into details. So he said that the way how good that sound specification is written doesn't allow proper implementation of the device reset functionality. So I just want to ask if you've had any troubles with the device resets or just curious how you've handled that. So the question is that the built-in aspect, rather than built-in sound, doesn't exactly well describe the reset method. That's it. I said that the question is that the built-in sound aspect doesn't explain very well the reset method. That's it. There are some conflicts in the sound. We didn't have that issue yet, at least. And now I tried to remember if we had any feature to call it reset or something like this, but we don't. So maybe we can talk offline if you want. Any more questions? Thank you. Thank you. Thank you.
OpenStack Cluster Installer (aka: OCI): the Debian way to manage your OpenStack deployments
All right. Thanks for coming. So here is what I'm for this talk. Basically, I'm going to describe what my OpenStack closer and Star does. First a bit about myself. So that's me playing with my Atari computer. So I'm a depend developer. It's been like nearly 15 years. I maintain not only the whole of OpenStack since it existed, but also things like self-open this switch, a bit MQ, many, many more stuff. That's probably a bit too much. So I've been working on hosting since the beginning of my career and of a maniac doing cloud computing for the last six years. If you don't know what OpenStack is, maybe you've been living in a cave. But like we used to have this schematic, which is kind of fun because it has all wires all around. It doesn't mean much because when you really know about it, it's a little bit more simple than that. But it handles a lot of projects and it's simply not reasonable to do everything by hand. You have to use some kind of automation at some point. That's what OCI does. So you boot up your computers over PXE. So OpenStack cluster installer provides that. And from bare metal to fully provisioned OpenStack cluster, every single artifact is taken from the Debian archive. So the only thing you need is a Debian mirror and that's about it. Even the pipette manifest that OCI is using are packaged. It's a solution which is kind of mature because it's been five years we're using it. And so all the pictures you will see are actual real pictures from our data centers, by the way. So it supports many types of hardware. I recently added ARM supports because we're putting that in production as well. We use many brands so we do recognize lots of Dell's, Gigabyte, HPEs, Lenovo, Supermicro. And what it does is full hand-free automation. It means you plug in the server, you press the power and it can do everything for you, including IPMI, hardware profiling and red setup. It discovers the location so that it can put your servers in the correct availability zones, this type of things. So at the end of the setup, everything is free SSL encrypted so that even though you are supposed to set that up on a private network, it's still best to have SSL everywhere. And so in OCI, there's many roles, like there's controller, compute network, high-scasy volumes, Ceph, monitor, Ceph OSDs, SwiftProxy, SwiftDoor and DNS. And maybe a bit more, I forgot. So every single computer, you can decide what type of node it's going to be and you can define that in the hardware profile so that the process of enrolling that node will be automated. So we've been using that software in production for five years and a half. So it's not just for fun that I'm doing it. We have real customers and we are making millions out of it. So we have decently large Swift clusters, probably eight clusters in total with like 6,000 hard drives running. And it also powers our public cloud. So it's really a production-ready system that I have uploaded to Debian. So as I said earlier, the overall workflow is that you're going to PXC boot your servers. So it also handles secure boot, meaning that it uses shim, then grub, and then your live system will download the SquashFS system over the network using that's how a live build works, right? And then once the server has booted up, it's going to report all the hardware capabilities of your server. How many hard drives, their size, this type of CPU, all this type of information. So with that hardware discovery, which is kind of simple, it's a simple shell script. It's not like everything in that open stack cluster installer is made in a way so that it's easily hackable. Everybody's going to understand it. It's bash scripts most of the time, some pipettes manifest. There's some PHP, but you mostly do not need to touch it a lot. So once the machine is enrolled, we know it's rolled, we know it's IP address because it has a network manager to assign IP addresses. Then OCI will produce the pipettes node classifier. So that's a huge MF manifest that tells all the parameters to the node. And so that when the machine boots up for the first time in its operating system, then the system knows what to set up on that machine. So it's kind of dynamic because the ENC ML file is generated from the DVE, which you can interact with the CLI. So you can modify the DVE to, I don't know, add a GPU, and then on the next pipette run, it's going to install your GPU or anything like that. We also provide many types of networking options. We used to do L2 connections so that you have a lot of ARP, and then my network guys started to complain about it. So we implemented BGP to the host so that you have an L3, only connect to the tree between the hosts. So I'm not going to describe what BGP to the host is in a lot of details, but basically all the links that you see there are BGP connectivity between the host on the bottom and the switches on top, which provides redundancy because every device that you see is connected to two other devices, and then you can use multiple routes. So the way it's done is that it uses a link local IPv6 connectivity between the two devices, meaning that you will have absolutely no ARP between your whole rack. It's going to be only L2 connectivity over IPv6 on the link local. So it's probably a little bit small, but what you can see in this is... So you have... Here's the types of machines that you have when you do a DMI decode. That's the product name, and here is the switch host names. So data center, row, rack, location, and computer aggregates. So when the servers boot up, they see the switch names, and then I can deduct where they are physically in the data center. So thanks to that, we can classify them in availability zones. That as well is probably a little bit small, but this is the way we classify hardware. So here you give a name, whatever you want, and the role that you want for the machine. A product name can be multiple ones, like if you have many types of compute nodes that works. The amount of RAM, and that's the description of the red layout that you will want. It also supports software RAID if you want. That's what the system automatically setups as a command line, as if you were typing OCLI, CLI, and then some parameters, so it does that for you. And then this makes it enrolling compute nodes into some compute aggregates and availability zones. So once you define that, you can define it for all the roles you have in your cluster, and that's how it does the magic of, okay, server has booted, I'll put it as this role in the cluster in that availability zones, install the operating system, and do everything without touching the keyboard. There's other features that are fancy, so it's maybe going to be a little bit a feature catalog, because it does a lot of things that we actually needed in production. So you can set up a Swift cluster with OCI, without compute, and then you can define a compute cluster with another OCI instance, and then you can connect the two, and if you do that, then you have Glens and Cinderbackup that are going to use that other cluster. The point to do it is that most of the time we set up our Swift cluster in a cross data center way, with one availability in each data center, and that's not really what you want to do with a compute cloud, right? You want everything to be in the same data center, with VMs close to each other, and just define availability zones per rack, for example. So that's the advantage of doing that. So we also support GPUs. For Maniac, we have a huge demand from our customers to use GPUs. You can define as many GPUs as you want per compute, so if you have four, six, eight, it's fine. So we have a picture of some A100 NVIDIA GPUs. So just to activate GPU support, you enter the GPU name, the PCI vendor IDs, I believe these are for the T4 GPUs that you see on the screen. Then you define a NOVA flavor that is going to use the name that you defined on top. So you can have multiple types of GPUs in one server that's also supported. The only thing is that once you've activated GPU, you need to reboot the compute so that it knows it has GPU and therefore it's a blacklist NOVA kernel module. Otherwise, you won't be able to use it with virtualization. There's also support for CPU models. So most of the time, in one cluster, you will define one CPU model for the whole of your clusters. Let's say you have epic CPUs from AMD, you will do that. But then if you have a mix of CPU types, like let's say AMD and Intel, then you can also define it per compute. There's also the possibility to do a hyper-converged model, meaning that basically you put your safe storage in the compute. So I designed it, we tried it, and we were not very happy about the performance of it. So if you don't have a lot of money and it's not really for customer-facing stuff, you can do it, it works, but I do not recommend it if you do large-scale. If you do that, you can also provision things like nutrient dynamic routing agents. So yeah, we also support BGP announced IPs for your VMs, so that's why there's this. So at Infomaniac, we also provide public cloud, therefore we also support telemetry. So telemetry is the name in OpenStack for rating all the resources. That's not the actual billing, that's counting resources. Let's say you've used that type of flavor for two hours, then it's that price. And then billing with actual PDF and such, it's up to you to rate it. So with telemetry, people also can do auto-scaling, that's how it works with heat. So you can basically rate any type of resources that's more OpenStack-free than OCI, though we provide all the things you need to implement it yourself. So telemetry is a huge resource-demanding thing. So I just made a small calculation so that you understand that. If you have 6,000 meters and 20 metrics every five minutes, that's 400 metrics per second. That's a lot of metrics to process, and that's only VMs. In the production system, you won't only build VMs, but many other things like, I don't know, Glance images, Swift objects, public IPs, load balances, some polling will be done. And so all of that takes a lot of resources, and therefore we wrote dedicated roles for it. So there's the messaging nodes where Clouty processor will be hosted. There's the new key API, so new key is the thing that does the time series for the resources. This messaging node, we provisioned it with a lot of cores, so we have three of these nodes with 128 cores, and it handles about 5,000 VMs in one cluster just to give you a rough idea. So it has dedicated RabbitMQ for the notifications, dedicated Galera cluster so that it doesn't interfere with your control plane, and dedicated Ceph as well. So if you decide to do with telemetry, you can set up the special Ceph cluster and these three messaging nodes, but you don't have to. You can also don't set it up, and then it's going to use the three control nodes. Everything is a little bit like that in OCI, so let's say you add some compute nodes, then it's going to provision Nova on the control planes, and if you provision some messaging nodes, then it's going to remove all the new key API and Clouty from the control plane. So that's the rough idea about it. So if you want to test the results, you can try on a FormalActiveDecloud. We're cheaper than everybody that you see over here. And we give you a 300 USD trial for two months. So in the near future, we expect to implement more services like Magnum. If we do that, we are going to do it with the cluster API plugin from the Vex host. Otherwise, we may implement CubaseService, not an open stack with our old Boo solution. So we are still working on it. I can't really tell you. Malilla, we are not going to implement it until the Viettayo driver is done. So I'm not going to go into details, but the generic and the Ceph FS driver, we are not happy about it. We don't think it's pollution ready. And Chauvin is trying maybe for later. So I was scared to have too many slides. So I went a bit fast. I have some time remaining. Before I do that, I may show you a little bit how it works. So there's no actual demo. There's this that shows you how to interact with OCI. So I started working on the web interface, and then quickly I realized it was crap. And then I just did, I work every day with that CLI. So you see there it created a cluster. You can set up many options on the cluster like time server and I don't know, 40 options probably for every machines as well. You can do machine sets. So it turns in loop. What you see here is a virtualize environment that I have a machine with half a terabyte of RAM where I spawn 38 VMs just for OpenStack and nine more where I have a virtual switch environment. So that you can reproduce at home. The virtual switch can do the BGP, so it's fun because from the host you can trace routes to your VMs that are using the OpenStack workload. Okay, that other one. So there you will see a few machines by hand. So machine add three controllers on zones and so on. So I'm open for questions. There's a few minutes remaining. I haven't been in the game for quite a while, so I don't know what to do with the second stage. One of the problems that OpenStack was kind of upgrading. So how do you end up being upgrade between the two? Yeah, so how do I do upgrades? With a simple shell script. So I have written, so when you see OCCLI, it has a bash completion everywhere. And I wrote HAPC, which is HAProxy command that does controlling the backends which one you disable or enable. So it does that to disable the APIs when it's upgrading one node. So I don't know how to explain, but despite what everything said around, I wrote that script and it wasn't that hard. And it's kind of easy when you read it to understand how it works. So you upgrade the... First you upgrade the puppet machine. It's going to upgrade all the puppet manifests. So it disables all puppet everywhere, upgrades that machine, and then it basically does upgrade. Not really in fact, because I've calculated all the OpenStack dependencies, so that's the only thing it's going to upgrade. It's not going to upgrade your OpenVsuit when you do that, for example. And then, yeah, it just works. So I've tested with Tempest the upgrades from Victoria to Bobcat. I'm not completely finished, but it's going to be soon done. So if you don't know Tempest, there's functional testing of OpenStack. Any other question? All right.
Exercising QEMU generated ACPI/SMBIOS tables using Biosbits from within a guest VM.
Thank you. Thank you. Good afternoon, everyone. Thanks for coming to my talk on using bias bits to test key moves, ACPI, and SM bias implementation. My talk is going to be structured around these four points. First, we're going to discuss what's bias bits and why we're using bias bits to test key moves. And then I'll be talking about some of the implementation choices of my test framework. And then I'll describe the test framework itself. And then I'll give a brief overview, depending on how much time I have on the changes that I made in bias bits to get everything working together. So what's bias bits? It's actually a software written by Josh Triplett. He wrote this software after he left Google. And the software had actually a real-life usefulness in the sense that the bias developers and Intel, they used it to test their bias implementations on real physical hardware boxes. And what this software comprises of is that you can exercise ACPI and SM bias objects in the bias directly from a grub environment. And even though it's a grub environment, it also has Python built into it. So you don't have to write tests using Bashish, which is grub's native scripting language. You can write all your tests using Python. And all of this is executed from ring zero. So there is no need to actually go from ring three to ring zero to execute your tests, et cetera. All of the components, that is grub, Python, ACPI, which is what bias bits uses to execute ACPI components. All of these comes together in the form for bootable ISO, which is then used to boot actual physical box or virtual machine, in our case. So this is what it looks like in a most simplest form. You just run Kimu KVM here. Using the bits ISO, and it spawns a virtual machine. It executes a bunch of tests, and then generates the logs, and it pushes the log out of the virtual machine. I'll describe that a little bit later. And then it shuts down the VM. So why use bias bits for testing? Well, first of all, like I said, all the tests you can write are based using Python in a pre-operating system environment. And so that means that we don't have to go through the OS to execute bias components, but we can directly execute ACPI from the grub environment itself. And it has already ACPI CA built in so that we can directly execute ACPI methods. And the current test framework that we have in Kimu is basically what it does is it spawns a VM. It extracts the bias, the ACPI tables from the virtual machine's memory, and then compares those tables with some golden master blobs that is already checked into Kimu repository. And then it compares the golden master blobs with the actual table which is what Kimu is using. And then if there is a difference between the two, it throws an error. So the main idea is that any time we're making changes into Kimu that affects ACPI or some bias tables, we can go through, inspect the changes, and we can make sure that the changes are not breaking anything. But what we don't have is an ability to actually execute the tables from a running VM. And using bias bits gives us the ability to execute the tables. So that's the main advantage of using bias bits. So let's discuss some of the implementation choice of the test work. So bias bits is a software in itself. So it has its own repository. And then we have the Kimu repository. And these two repositories, in the Kimu repository, we have all the changes that basically decide the ACPI implementation. And bias bits repository has all the bias bits specific stuff, like all the build scripts, all its internal logic, and the two things are kind of separate. And adding to the complication is the fact that bias bits has, so George gave up developing on bias bits around circa 2017. And any effort that I made to reach out to him failed, so he didn't respond to my queries. So we couldn't actually directly use the bias bits upstream. So what we had to do is we had to fork the upstream bias bit software and put it in GitLab under the Kimu project, and then make changes to it. And those changes involved a lot of build fixes. So bias bits turns out to be something that is not buildable under the Neo compiler and tool chain because nobody has been maintaining it. So we had to make a lot of changes to make bias bits just build. And then a lot of fixes to get all the parts of the test framework working together, which I'll describe a little bit later. And then we have the Kimu repository that has potentially the changes that affect the tables. And so the people who are actually making changes to the ACPI implementation in Kimu, they care about the Kimu repository. They don't know or understand the bias bits repository. So now we have to decide how these two repositories are going to work together. So one of the questions is, so do we make bias bits repository as a module of the Kimu repository? And there has been a lot of discussions upstream on that. And it turns out that people really hate some modules because of a multitude of reasons. And you can actually look into this thread upstream. And it has a lot of interesting discussion as to why we don't want to have another submodule. So how do we keep the two repositories in sync with each other is an interesting question. And then from developer's point of view, whoever is making changes to, say, ACPI implementation in Kimu, do we make them go back and forth between the two repositories? Say, for example, they make a change in Kimu that affects the tables, and they want to write a test for it. So do they go to the bias bits repository, make the change, build bias bits into an ISO, come back to the Kimu repository, point the test to the new ISO, run the test. Oh, something doesn't work and fail. OK, let's go back to the bias bits repository, make changes, come back to the Kimu repository, and go back and forth. That's kind of complicated. And developers don't like to do that because they don't really care about bias bits. They just want to test. They want to add a test to exercise their changes. Right? So another also going to question is what kind of test framework do we use to write the bias bits tests? Do we use Q-test framework? Or do we use something else like the integration Avogadro test framework? Now, the existing test that I just described before that compares the blobs, it's called Biostable Test, and it's actually a Q-test framework. And people are familiar with that framework, right? Because any time people make changes to SAP implementation, that's the test that fails because it compares tables blobs and it right away fails saying that you have these new changes in the tables. You better have a look at it. So people actually understand how Biostables Test work. But do we use the Q-test framework then? The problem with that is that Q-test framework is really not written for something like spawning a VM, the managing all the issues of VM management, collecting the logs, dealing with errors, and then shutting down the VM, et cetera. So I started writing a Q-test for bias bits, and then I realized that it's not really suitable. So I started then looking into writing a new Python-based test framework for just doing the VM management and then using bias bits with it. And then finally, when I proposed that upstream, then somebody pointed me to the Avogadro framework, and I looked at it, and it was right away, Avogadro framework already had all the libraries that deal with VM management. And all I had to do was just focus on the bias bits part and develop that part. So Avogadro Test framework kind of really nicely fit into what we really wanted to do and what was available already without doing any new development. So finally, we went with the Avogadro Test framework. But then the question is, how do we make people familiar with how to run Avogadro tests? Because not all people are familiar with this test framework. Not all people run integration tests. So then we decided that, OK, how about we write a documentation for bias bits test? And that's what we did. So Kimu repository has documentation to how to run a few simple commands to execute the test framework. So I just described all this stuff. So let's describe what the test framework is all about. Now, before I'll just keep a couple of slides, and I'll show you the diagram here. So like I said, there are two repositories. There is one Kimu repository, and there is one bias bits repository. So in bias bits repository, we want to maintain everything that's related to bias bits and nothing related to Kimu or a testing ACPI. So the way we did it is that in the fork, which is residing right here, we have all these branches in there. Now, the Kimu bits branch is the one where we have made all the changes specific to using bias bits for Kimu. And so there we have a GitLy CI job, which is basically a BAS script that builds bias bits. And as a part of this CI job, so every time you commit any change to bias bits repository, this CI job gets triggered, and it will generate a bunch of build artifacts, which are nothing but like pre-built binaries for things like rub, Python, ACPI, CAC, et cetera. And then all these build artifacts are pushed in a well-defined location. And there is a URL for it, and you can just go and download those artifacts. And so in the Kimu repository, what we do is we, in the Kimu repository, we maintain the actual tests that exercises ACPI and some bias tables. So the actual tests are here in this location that are run from within the bias bits environment. And then there is a main driver to put all these things together. And this is the one, this is the main Avogadro test, ACPI bits.py. So when you are running the bias bits ACPI, S&B bias test, you need to run this guy. And what this guy does is that it pulls in these changes, these test scripts, where you have potentially added new changes for your stuff that has gone into Kimu for ACPI. And then it pulls in these build artifacts. And together it generates an ISO here. And then with this ISO, it spawns a Kimu VM and it runs the tests. Once the test is running, it collects the logs. The logs are pushed out into outside the virtual machine into a well-defined location. This test script then analyzes the logs. And then it says whether it failed or passed, depending on how many tests it ran, whether it looks for certain patterns and says, OK, this test failed or what have you. So basically, this mechanism does two things. First of all, you don't need to go back and forth between the two repositories. Everything that is bias-bit-specific resides here. And if you're not concerned with bias-bits or if you don't care about how it is built or what changes are in there, you don't need to touch this repository. All you need to do is just remain here. So every time you make changes to ACPI implementation, you add corresponding test code in here. And then you run this guy. This guy will pull in your changes, use the existing artifacts, and you run your test. Now, after it runs your test, this has some verbose mode where it puts out more information in case there is a failure. So you can analyze the failure, make changes to these test scripts, and again, rerun this guy. So the advantage is that you are actually not, you're already within the chemo repository in your workspace. You're not going back and forth between the two. And then, because a pre-built artifacts are being used, generation of this ISO is a lot easier because these things need not be built. They're already built for you by the CI job. All you need to do is put these test scripts together with this guy and generate the ISO. So this is what I just described all these points here. And then, so let's look at the advantages which I briefly described. So, so no need to use some modules. There are pre-built artifacts that makes it a lot easier. And then if you need to make changes to the bias table, as to bias bits, you make the changes, build new artifacts, and you point the main test to the new artifacts. And the other advantage is that when you release chemo in turbos, that turbos does not have any bias bit specific binaries. They're completely mentored outside of the chemo repository. So they're completely separate, and you don't need to release chemo with any bias bits artifacts. The disadvantage is that because we're using pre-built binaries, therefore we are very architecture specific. So right now we only support 64-bit X86, and it does not support any other platforms. And supporting other platforms is kind of non-trivial, because you need to make sure bias bits can actually build for those platforms, right? And that is, and bias bits was never tested on platforms other than X86. So it's a non-trivial work anyway, right? And then there is tool dependencies to build the ISO, and the environment where you're running the test should have those tools available. So let's look at the overview of the changes that are in the bias bits fork. So like I said, bias bits was ever maintained after 2017, so I had to make numerous changes to make bias bits build with the latest toolchain and compiler, and changes were across all these guys. And I had to also upgrade a CPI-CA, because a CPI-CA is the main driver that knows about various tables. And if you don't upgrade a CPI-CA, you cannot write tests that uses the newer tables. So I had to upgrade a CPI-CA. I had to find a mechanism to push the logs out so that the test framework can analyze the logs. I had to make sure that the console logs are available. And one other thing is that the Python that runs from within the bias bits VM is still Python 27 and not 3, because upgrading Python is a non-trivial work. And since it is a very closed environment, very controlled environment, I didn't see the value of upgrading Python in that environment. So it is still running Python 2, whereas everything else in Kimu is Python 3. These are some of the useful resources, and you can have a look at those resources. This includes things like the Josh's presentation slides and his talk on bias bits itself, which is a lot more details than what I described about bias bits in this talk, and then the details about the test framework itself, the fork that we maintain here, et cetera. So the last but not the least is, before I talk about demo, is that I would really like to thank these guys. Igor is originally proposed the idea of using bias bits for exercising Kimu with the CPI tables, and so I'm grateful for that. And then all these other guys, they gave various useful feedback throughout the process while I was submitting patches upstream, and I'm grateful to all the reviewers of my batch sets and the entire upstream Kimu community for help. Lastly, if you really want to see a demo, there is no time for this in this presentation, but you can click on this link, and there is a video that describes a lot more details on actually how to run the test and all the scripts within the repository. So thank you so much, and now I can take questions if you have. Yes. I have a question. Yes. What do you mean by Python? I mean, what is that Python? It's just a copy based on the built in Python? No, no, it's Python. The interpreter is built from source. Wooden Biospits, it's actually, the Python is built from source. So Python 2.7 is the one that Biospits uses, and it builds everything because it has to build extensions so that it can integrate with Grub. So from Grub, you can actually run, you can say Pi, and then you can run a Python script. So all that happened because it was built from source with integration with Grub. The only problem is that it's a Python 2.7, and I didn't see the value upgrading it to 3, but you can actually run the whole Python script, and that's how all the tests work, because they're all running from Grub, but they're full-fledged Python 2.7 scripts. So it's a full-fledged one, not only certain API that you can use? No, no, it's a full Python. Any other questions? Thanks. Thank you.
One SDN to connect them all
Okay, so good afternoon. My name is Miguel Duarte. I'm a software engineer working for Red Hat and the OpenShift Virtualization Networking team. Well, in this talk we're going to be discussing an SDN solution for both types of workflows so you can have pods and virtual machines in the same network and the use cases that this SDN provides and a little bit how it works. There are going to be some demos as well. So let's jump to the agenda. All right, so first thing we're going to do is explain the motivation, like what drives us to have like to do this and the actual problem we're trying to solve. From there, there's going to be a little short introduction that depends how deep it is going to be. Well, it depends on a few things. And then I'm going to walk you through the use cases for this SDN solution, show the demos, finalize with the roadmap for the future and with the lessons we've learned during this development. So first thing, how many of you have used or like worked for stuff that has anything to do with Kubernetes? Like yeah, pretty much everyone. How many of you use Kubeverts or know what it is? Well, more than I thought. Okay, cool. So the introduction is not going to be that deep. But yeah, let's start. First thing, going to be discussing the Kubernetes networking model. Like as most of you will know, pretty much it's very simple. And one of its few premises is that any pod that is deployed on the Kubernetes cluster can contact and can reach any other pods in the Kubernetes cluster. Like basically you have cluster-wide communication between whatever type of workloads that are deployed in your cluster. Without NAT, by the way. So another thing you get as a byproduct of that is like VSC and all that, it pretty much it configures a way for you to reach the outside world. So you get free batteries to reach the outside world, to reach the internet. The thing it does not allow you to do is to connect to a pre-existing network. If it's, for instance, I say you want to connect to a database that's deployed on an existing network, well, you're out of luck. Kubernetes does not solve this. More things, if you, for instance, you want to deploy a VNF, for instance, and you require more than an interface, Kubernetes will also not do that for you. There are solutions out of three, but we're not going to go there right now. So the motivation for our talk pretty much comes that you don't have like an entryway for you to access like stuff on physical networks and to get access to additional networks. The default cluster network that you have that comes bundled and Kubernetes gives you for free or whatever Kubernetes distributions you have give you, well, it's not suited for all types of use cases. For instance, for virtualization, like the ones that you, of you that use Qvert, should know like you'll get IP addresses and well, IPAM management in for virtualization is extremely tricky and it will not mix well. Pretty much because you have like different IPs when you migrate from the source to the destination pod and that thing will not play along correctly. And finally, in virtualization, you typically like use secondary networks to do all sorts of east-west communication and you pretty much rely on the default cluster network just to like get batteries of the Kubernetes services, stuff like cluster DNS and things like that. So on your secondary networks, you need to figure out other ways. So you could do like bridge CNI and other types of plugins, but that leaves like your operation teams will need to know how to debug yet another totally different solution. Where admins will need to know and configure yet another like bunch of solutions and depending on the use cases, you'll realize that this plugin will work, but this other will not. So like the matrix of things that your operations team has to know and your administrators need to learn how to configure, it skyrockets like it's and it becomes literally too expensive to actually handle. So now the objectives we have the first is we want these cluster admins to like go do something else like we want to push all the complexity of these different sorts of use cases and this mix and match technology to be pushed from their heads and from this lots of tools that they need to learn and know. We want to push all this complexity to the network. And finally we want to have like a single like a single plugin be able to handle like a multitude of use cases. So pretty much what we want to have is like to have the whatever the CNI that comes bundled with the with our Kubernetes distribution, we want that to be able to to to work properly both for the cluster default network and for the secondaries. Okay, so very short introduction now. So Qvert. Qvert is a Kubernetes plugin that allows you to run virtual machines inside pods. So you basically get two different types of workloads. You have pods and VMs and like you manage them from the same from the same solution. As a like an implementation detail like the virtual machine actually runs inside of the pod. And each pod has like a live version instance running in it and the QML process and all that and like just to finalize you have like the networking requirements that virtual machine has is a lot more than a pod like a pod is something like entirely stateful. It's like a cattle you just kill it and a new one will spawn and do it a new thing while a VM is stateful and you need to treat it very careful carefully. Now kind of a little disclaimer our SDN solution that we developed uses Oven and Oven stands for open virtual network and it is essentially like an SDN control an SDN control plane to open V switch. So you have like a bunch of you have open V switch installed in each of the nodes of the cluster. You have Oven on top that is kind of rendering open flow and installing it in each of the open V switches in the nodes from like higher level entities like you have things like a logical switch that grants you connect L2 connectivity between the workloads on these two nodes and this thing gets rendered into open flow and installed to the nodes. Then we have Oven Kubernetes on top of it. It's a CNI plugin that what it does is translate from Kubernetes entities into Oven logical entities. So for instance when we have like a secondary network what we end up having is a logical switch when you have like pod attachments what we have is logical switch ports that are connected to the logical switches and for instance in network policy is nothing more than like a port group that associates a list of ports of logical switch ports to a bunch of ACLs. Alright, so supported use cases as I said in the initially in the motivation section for virtualization use cases mostly you do not rely on the default cluster network to do east-west communication. You use secondary networks. So that's the first use case we are focused on is east-west communication. So as you can see here like pretty much what we end up having like these things here are pods or virtual machines it doesn't matter what we actually end up doing is attaching a new internet network interface to it configure it and what we get is like the logical view of having like them connected via a cluster wide switch. So that's literally what we get a cluster wide switch a connection to this cluster wide switch and everybody that is disconnected that is connected to it can communicate across that network. There's a short demo right here and we'll see that oh god no internet I knew that that's why I have this terminal here. I'm really sorry for the font size but if I put this a little bit bigger it'll basically mess up like the window configuration. I hope you can still see it. So first thing we're going to see is what the network configuration is. I'm not sure if you're used to using MULTUS the ones of you that use Kieferd I guess they are so this is the first thing that we need to look at the network attachment definition. This explains this pretty much holds the CNI configuration from which the CNI plugin will just configure networking for your pod. So the interesting thing here is pretty much like the name of the network so the idea here is that the networks are not namespaced but the network attachments are. So this means that if your network admin wants to grant a namespace connection to a network he needs to provision or she needs to provision one of these like in the proper namespace and this will connect your namespace to this network. Finally the oops sorry it does not go back. Another interesting thing that was there was the topology which is layer 2 so pretty much what we have is an overlay network this is totally disconnected from the physical network that allows you to have like east-west communication. And we do not have IPAM because IPAM for work loads is very tricky and we'll see more on that later on. So we're connecting two different namespaces as I said. Now we're going to provision these into the cluster like fun fact this is like all lazy so I just put like a bunch of stuff in the cluster but nothing happened yet. It's just like the definitions are provisioned and nothing else and now we're going to show the workload definitions. Here they are so we have one virtual machine remember we do not have IPAM in the network so we need to configure the IP statically you have there in the bottom we configure the IP statically using cloud init in the VM and that's its IP 192, 168, 200 dot 10. And then we have like our pod we have two pods here specified we have the blue server pod and you have like a yellow server pod and the blue pod has the dot 20 IP that we configure using the network selection elements this is like multis lingo and what this is doing is exposing like an HTTP server on port 9000 and what we want to do is so we have two servers the blue and the yellow and we have the VM is the client and we're going to be curling to each of the servers by their hostname so that's what we're going to see in well let's first provision this the windows in the bottom are so we can kind of see when they're ready so the old service ready the tenant so the VM is booting up let's speed up this part so we're now going to log in via console to the virtual machine and we're going to curl both the servers and they're going to echo back their hostnames. So God it's going to take forever if only I had internet and I could play a video oh God. Does anybody know how you can ask an emma tell it wow amazing I don't know. It stopped again. Okay it's playing cool so yeah login via console I hope it did not stop what's happening so yeah I should have known better but again we're going to log via console the UI of this thing is absolutely preposterous like I don't know if it's playing or not. Okay so yeah you get curl to the dot 20 IP address it replies with the blue server thing we do the same thing to the dot 30 IP address it replies with its hostname which is the yellow server this concludes the first demo which shows us like east-west communication between well different workloads in different namespaces. Now going to the second use case which now we want to have like remember the motivation slides where I mentioned stuff about accessing things on a pre-existing physical network that's exactly what we're going to be seeing now so as before we see like we have a logical view of a cluster wide switch the difference is that this switch is actually connected to a physical network and you can access stuff that's there in our example it's going to be like a database that has well the data the VM needs. The first thing we need to kind of elaborate a little bit is that you need to configure the physical network so first of all it's not something that a typical user will get access to it needs to happen by a cluster admin and for that we're going to be using two tools first NM state and Kubernetes NM state. NM state is basically like a declarative tool that configures networking you just give it like the desired state I want my network to be like this and it's going to go while punching buttons attempting to make the current state be what you desire to be if it fails it rolls everything back so no changes to your network so it cannot destroy it and if it succeeds it'll tell you you succeeded so basically what we want to do is use Kubernetes NM state which is kind of a cluster wide thing send YAML specification to the cluster and all my network specification will be applied in all the nodes in our cluster so it would look like this so in the in the left we have an example of a policy and in the right we have like a diagram of the topology we're trying to do here so if you look here we this is going to be applied to all the Kubernetes worker nodes because of this node selector that we have here and what we're going to do is create an OVS bridge in each of these worker nodes attach this ENS for interface to the OVS bridge and then using this these oven bridge mappings in the bottom we're going we're saying that we want traffic from the network called default network to be sent to the OVS bridge called BRX and we want the traffic from the tenant blue physical network to be sent to the OVS one bridge it's literally what you diagram you see there in the right now we are granting access to from workloads to the physical network you should not mean should fret carefully when you do this and for that we need to have like micro segmentation this pretty much is like what you have in for the primary network for the cluster default network network policies this is the exact same thing but applied to secondary networks so in our example what we want to have is like a virtual machine that wants data from the database but we do not allow it to actually consume the data directly from the database so we expose that information from a pod so the pod actually can connect to the database and expose this information from a RESTful API over port 9000 so this is kind of what we want to do ensure that the VM cannot reach the database directly over the port the posgrasql port but and ensure you can using well this tiny pod as its proxy data proxy so again another demo this is going to be a disaster I have tempted to tell you to just check this at home but how many times we have more than five minutes right again this does not work sorry it's the other cast so now in this demo what we do have we do have two namespaces data consumer and data adapter we just provisioned them let's first thing some information like this I'm running a kind cluster here and I botched this again so I'm running a kind cluster here and so my Kubernetes nodes are running basic as containers in in my laptop and so the physical network that we see in the diagram is basically like my laptop it's going to be connected by a Linux bridge in the in the laptop and for that I need to kind of since I'm using a VLAN I'm going to need to pre provision the VLAN and that's the interface you see here in the bottom like this podman 1.123 it's a VLAN on top of the Linux bridge management interface I'm going to show this again and so again VLAN 1.123 that's subnet and we have a database running in containerized database running here and we are going you see we have access to the database now let's check our manifests really sorry so I think you should check the demo at home and we have five minutes yeah I don't think this is going to work please do check the demo at home but pretty much what you'll see is what I showed in this diagram so you'll have access on one port you'll have access direct access from the VM to the database like you can PSQL to the data directly and to get the data from using HTTP from the pod and then I provision some policies and then you stop having access to the database and that's pretty much it now roadmap what what are we going for next first thing we need to have is like IPAM for the workloads so we need to find a way to tie the IP allocation to the virtual machine and not to the pod where the VM runs like remember the our big issue where the virtualization is migration and that means that when you migrate the VM to a new node the pod gets a new interface the VM is still with the old interface and basically networking is not properly configured we wanted to address that first and that will unlock the next thing in our in our queue which is selective selective policies so our kind of policies for the secondary networks right now you can only use IP blocks you cannot use like semantic things like I want to allow all workloads from this namespace we're having these labels to access this sort of workload you cannot do that you need to specify IP ranges directly once we have these two things we're going for services next like we want to have like exposed via services like egress from VMs and to have like load balancer services so that you can access them directly over the secondary networks finally self-service networks this is instead of having the cluster admin provision these for you since network overlay a simple overlay that does not touch the physical network you could directly use like a self-service functionality to just create the network yourself and connect and provide east-west connectivity to your workloads okay well lessons learned this was a let's say a joint venture like or a collaboration between both Red Hat and NVIDIA and the fun thing is we had the exact use cases in mind like both of them focusing on Qvert but with different scope so we are a lot more into the generic kind of world we want we give you a platform and you do whatever you want with your platform and NVIDIA notes they have like their use case in mind which is I guess gaming in a data center and their tooling is a lot more let's say pointed in that direction but was a really nice collaboration and we're hoping to see more in the future on a less positive note we get the user experience of this is not amazing like it's better than originally intended because like for instance the thing I've showed you about the the nm state policy that was something that we came up with because we felt like this feature is entirely unusable like people are going to be breaking their default cluster network every time they use this or risk doing that so we've provided that but we still have some sort of nightmarish stories every now and then because of the the way network attachment definitions look like and how easy it is for you to get things wrong and how silently and how these silent errors kind of creep up it's absolutely insane like sometimes things work but not in the way you expect because it just doesn't recognize one of the parameters because you typed it wrong but everything else works it's insanely hard and yeah I'm really sorry about the demo but yeah thank you for listening and if you have any questions in one minute one question it's the same thing yeah so the question yeah so the question is basically there's another cni so we're doing this in oven Kubernetes and there's another cni cni called cube oven so it's kind of it really screams that it does the same thing and yeah it really does the same thing the use cases are mostly the same the thing is that they do a lot more than we do like quite honestly like their feature set is a lot more complete than ours and we're trying to get there like if your question is like why didn't you just use that well we do not have any let's say current stake in that cni we do not have maintainers there we do not like we would have to try to gain entry there and in some cases we do not like their API so we're trying to do things in another way it might seem like we're reinventing the wheel sometimes but yeah kind of we kind of are like we both do the same thing and their feature set is more complete but again we're trying to do more and reach their feature set thank you for the question and I think it's that's it well
Deploy Kubernetes... From Kubernetes: an overview of Cluster API
You're famous. Yeah. That's it. Close the door behind you. Okay. Okay, let's go. I hope you hear me correctly. People in the side and in the room and people online. So, yeah, let's begin. Okay. Okay, let's go. I hope you hear me correctly. People in the side and in the room and people online. So, yeah, let's begin. Thanks for them for having me today and to talk about cluster API and Kubernetes. Thanks to you to come here. That's quite impressive to see the room being fully packed. Yeah, I hope you will learn things. I hope you will discover things. That's the most important. And you will get some stuff to continue to investigate at home. So, the goal of this talk is to give you a brief introduction to cluster API. To give you a brief introduction to cluster API. So, yeah, let's begin. Thanks for having me today and to talk about cluster API and Kubernetes. So, yeah, let's begin. I hope you have a great day. Thanks for having me. Thank you. Thank you. Thank you very much. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thanks for joining. Let me quickly introduce myself. I work as a flat car engineer inside Microsoft. Flat car is an operating system designed to run container workloads. If you want to learn more about flat car you can go at 5.15 see my teammate Thilo talk about flat car. It's a deep dive introduction. And it will give you the key elements about this operating system. But that's not the purpose of this today. Outside of work like SRE France, which is a DevOps association where we organize meetups and conferences in France and in Paris and in France. So if you want to talk in a meetup or if you're interested to organize something, let me know. Context, the context is Kubernetes. Kubernetes is quite the answer to everything today. So if you want to deploy something small, something big, there is likely a big chance that you're going to use Kubernetes. So to me, it becomes a great standard, I think we can say this term. So yeah, that's the cool thing with Kubernetes. You can deploy a small thing and big things and it works. And it works in the same way if it's a big thing or if it's small thing. Something to know about Kubernetes is that you have two aspects of this technology. You can consume Kubernetes, means you deploy your application on it and that's cool. And you have also to deploy and maintain the Kubernetes cluster. So you can do both if you want to. You can do only one aspect of the other. But today, let's focus on the deploy and maintain Kubernetes cluster and not how to use Kubernetes cluster. Two or three weeks ago, I was on Twitter checking some news, what's going on in the tech industry. And I've seen a tweet of a person I've met in different conferences. A tweet about, hey guys, what if I write a book to describe all the ways to deploy Kubernetes. So it was an idea like that and he got some traction in the end about this idea and he started to draft a list of all the ways to deploy Kubernetes. So the knee is at the first day of the currently. So if you want to talk with him about his book or if you want to invest in his book, it's a great opportunity to meet him. He has a talk in the Go Dev Room this afternoon. But we're not talking about Go, we talk about Kubernetes and the 50 shades of deploying Kubernetes. So you can use binaries, you can use managed services, you can use platform, you can use a bunch of ways to deploy Kubernetes. But today, let's have a look on the line 27 or 26, something like that, it's the cluster API. Cluster API, if I quote documentation, it's Kubernetes, a project focused on providing, you can read. The most important is the last line, the cluster API project, use Kubernetes types APIs and patterns to automate cluster life cycle. In other words, use Kubernetes to deploy Kubernetes. So that's the cool thing with Kubernetes is you can extend this technology using CRDs or custom resource definition. So you can extend the technology and you can leverage, you can benefit from the reconciliation loop, for example, the Kubernetes for what you want to do. It's already available for the basic usage of Kubernetes, but you have over projects like a cross plane that will leverage this way of managing the life cycle on the provider side. So cluster API is this kind of stuff. So if we take a look on really abstract way of how does it work, you have two clusters. On the left, it's the management cluster, this is the pilot of everything. This is where things happen. And you have the workload cluster, this is where you run to run your workload. So your website, your SaaS, whatever, it will run in the workload cluster. This is what you currently do if you do some Kubernetes cluster. But before that, we have the management cluster. So you're going to tell the management cluster, hey, I need a cluster in these providers, please deploy everything I need to have to run this Kubernetes cluster. Because to deploy a Kubernetes cluster, you need networks, you need security groups, you need a bunch of things to create on the provider. Well, the management cluster will take care of that, and it will deploy things for you, and you don't have to do anything. So that's the way to see things. And in this example, my management cluster is running with Kubernetes in Docker, kind. So this is pretty convenient because I can run my management cluster on my local laptop, on tiny resource thing, because I just need to deploy one single control plane. I don't need to have high availability and stuff like that. I just want to use the Kubernetes APIs. And the workload cluster in this case is running on OpenStack just for illustrating. So as long as you have a network connectivity between those two clusters, and you have credentials, of course, it will work. So and you can even decide to migrate the management cluster from one cluster to another, but that's something else. That's the key elements to understand and to know if you want to understand the concept of cluster API. So this is my kind cluster, so I just have one single control plane running. That's it. Nothing fancy, nothing to do. Just kind create cluster, and I have my management cluster. Nothing to install on it on top of this. Just a regular Kubernetes cluster. Really simple. Now, how can I create things on my cloud provider using cluster API? Well, for people that already knows Terraform, that already knows cross plane, all these kind of projects, you know that there is no secret. You need to know the APIs of the cloud provider to implement them, to consume them, and to create what we call a contract. So this is the border between the cluster API logos and the cloud providers. So you need to teach cluster API. Okay, so in cluster API, we say that a network, it's this thing. So a network on GCP will be this thing, on OpenStack, it will be this thing, and so forth, and so forth. So yeah, the idea is to teach cluster API how to manipulate and how to deploy resources on the cloud providers. And for this, we use what we call a cluster API providers. So on the Kubernetes SIGS sub project on GitHub repository, you can see all the various providers supported. So there is OpenStack, GCP, Public Cloud, on-premise. So it's a tinker bell on the upper right. So yeah, you have a bunch of providers and if you have some knowledge in Go programming, if you have some knowledge in API consumption and stuff like that, feel free to start to contribute on this provider because this is a cool way to start contributing to Kubernetes and Kubernetes ecosystem. So yeah, that's the idea under the hood, what's going on. And now I have my management cluster. I need to create my workload cluster configuration. So I just use the cluster CTL, cluster Cuddle, whatever you call it, command to generate this YAML configuration file. And I just provide a few key elements, the flavor, the Kubernetes version, how many control plane I want, how many workers I want. One interesting thing is the flavor. So cluster API relies on templates. So these templates are provided by the maintenance of the providers. So for example, the flat card template will deploy a workload cluster based on flat card images. You have some flavor, for example, on the open stack with load balancing, if you need some load balancing services and stuff like that. So flavor is a way to customize your deployment of your workload cluster. You will still have a workload cluster in the end, you get a Kubernetes cluster, that's fine, but you can decide to customize it. So this flavor, this variant, are tested using end-to-end testing. So each time there is a new release of the providers, you can be sure that it passed the CI, so you can safely update your configuration. Of course, for clarity, I didn't mention that you need to provide a few more environment variables to, for example, to provide the credentials. Of course, cluster API is going to create some things on GCP, on AWS, on the open stack, whatever. It needs to get access to this infrastructure, so it needs to get the credentials. So this is an example of things you can pass, but you can also define which instance size you want to use for your control plane, which instance size you want to use for your walkers. So this is the kind of elements that you need to configure previously calling this command. But yeah, just for demo purpose, I wanted to show you this command line, which is the bare minimum to generate the cluster configuration. And now we have the KAPI quick start.tml file. We can apply it like any over Kubernetes manifest file. So KAPI Ctl, KAPI Ctl, apply KAPI quick start.tml. And it will create, as usual, some resources on my management cluster. So we can see that there is the cluster definition, there's machine deployment. So this is something common to cluster API. Then we have the open stack specific part. And this from one provider to the other, of course, the output will be different. But that's the idea, you just apply this. So that's pretty convenient because you have a file. So you can use this file in a Git repository. You can use this file in a CI. You can use this file with whatever you want. So you have an infrastructure as code in term of cluster API. Now, if I check on the provider side, I have some resources that are going to be created by themselves. Not by themselves, by cluster API. But you can see that I have some instances. So I asked for one control plane and three worker nodes. So we can see that I have four instances between being created. I have a network, I have security groups, I have stuff, SSH keys. So this is for open stack, but once again, it's the same thing for every provider. But this is the cool thing about cluster API is that it does not just deploy a cluster. It deploys everything to create a cluster. It's instance, the security groups, the firewall rules, ingress, egress, whatever you need. So it works in this way. When everything is up and running, you will just get your configuration that you can inject into Qubectl and then you have a new cluster ready to be used. So that's it about open stack. Now, we can ask yourself what's under the hood on the operating system side. I'm a factor engineer. I work in the operating system field. So I'm curious to know what's power my nodes. So with Qubectl, we can inspect the nodes and see that for example, this one is running Flakar because I asked for Flakar variant, but for example, with Flakar, we do not ship QADM. We do not ship Qubelet service. We do not ship MISC files. So how my nodes can start acting like a Kubernetes node. How things can work. And on top of that, Flakar is immutable. There is no package manager. So there is no way cluster API is going to SSH into that node. It say, okay, APT install QADM. No, no, no. So what's the magic behind? It's another project called the Image Builder. So it's on the Kubernetes 6 GitHub repository. That is the Image Builder project. So the idea is to take an OS, for example, Ubuntu, to build it with Pacer. So nothing new under the sun. And to inject the QADM, the container runtime, the MISC files, whatever you need to power Kubernetes nodes. So it's a three step thing. You take your OS, you inject the Kubernetes components, and then you export this new image, this golden image, like we sometimes call, into your providers of your choice. Open stack, GCP, AWS. So you understand that something quite complicated because in order to use cluster API, you need to use this kind of image. So it means I can wait for someone from the community to build it. The build of the image is not an issue. Everyone can build image. It's more about the storage. Because storing an image, it's something, but when you have to store image for each Kubernetes version, so it's three main versions at each time. So three Kubernetes version, then I have to keep the image for each cloud provider I wanna use, and I have to keep an image for each different version of Ubuntu. It can be complicated to store everything and to have the time and the energy to build these images. But this is what we currently do with this provider. So that's, I will not say this is the way to do, but this is commonly done currently in the open source world. So we can think about something, an alternative. And the truth spirit to me of the open source world is to have alternative. So there is no one way or one other way to do things, there is alternative. Then you choose which one is the best for you. So an alternative would be, okay, I take a Linux based OS, for example, Ubuntu, Flat Car, whatever. It's already available on GCP. It's already available on AWS. It's already available on Digital Assign Azure. Because these cloud providers provide these images for you. So just the vanilla image, a fresh image is already available. So what if now we download the Kubernetes components during the boot of the image? And in the end, we have the same result. We have a Linux based OS with the Mixed file, with the QBGM, everything I need to power my nodes. So this is something we implemented on the open stack side. So you just need to use an over flavor. It's SystemD CZex, Flat Car dash CZex, sorry. And it leveraged this new feature of SystemD called SystemD CZex. Basically SystemD CZex, it's an image, raw, squash fs image that you're going to mount as an overlay on the Linux based system. And it will bring you new binary files, new configuration files into your system. So if you want to have a look to SystemD CZex, I really encourage you to check this new features from SystemD and this is what basically we're going to do with this flavor. It's during the boot, we're going to download QBGM SystemD CZex image and everything will be open running to power my node. One, just for example, if I SSH on the node, I can just run SystemD CZex list and it will give me the output of Kubernetes image being available. So what's, what it's cool with this approach is that you remove the strong bounding between the Kubernetes version and the image version. So if you want to update Kubernetes but you don't want to update your base.OS, you can. If you want to update your base.OS but you don't want to update Kubernetes, you can. Before that, you were supposed to build new images and stuff like that. And the cool thing is that SystemD CZex is, it works in the same way on AWS, on Azure, on premise, on whatever. So you have just one configuration for all the cloud provider. So that's something to keep in mind. And we discussed with cluster API folks to see what could be the new approach of this. We also attend some office hours of the cluster API or ecosystem to make some demo of this. But it's already available on OpenStack and we hope it will be available in the next, in the over providers. A few resources, if you want to continue, check this at home. You can have of course the cluster API website, the cluster API OpenStack. This is for the example I've shown, Flattar and SystemD CZex, which is the main outlook in the end of this talk, is SystemD CZex. But yeah, so to conclude, I would say this talk is to give you the key elements, what's going on in cluster API, how does it work, but also to give you an overview of what we currently working on on this aspect of cluster API. So yeah, thanks. And of course I forgot to add the QR code, but you can find them on the FOSDEM website. And yeah, thanks for your attention, if you have any questions. So if you have any questions, or maybe on the chat, if there is some. I didn't see anything, but we can start with you. Yeah, do we have a mic? No, you will pretty much. Okay, we repeat the question. Okay, I have a philosophical question about cluster API. What's the life cycle of Kubernetes cluster to you? It's long running, you upgraded, or for it just temporarily, you just replace with a new one. So the question is about the life cycle of the cluster. Do we need to replace each node when there is a new version? That's correct? Yeah, but what is the intended purpose? It's like long life cluster, so it's like just for short term. Well, the goal is to have, yeah, okay. So the question is, do we use cluster API to have long term usage or short term usage? I'd say that cluster API is in the philosophy of leveraging Kubernetes, which means using the reconciliation loop of Kubernetes. You can just things that by themselves and if there is, I don't know, a network, so instance that going down, it can be restarted by the management cluster that will take care of, there is a state and you still want to be in this state. Because with Terraform, for example, there is a state, of course, but it's not live. It's a static state. So if there is some things missing, you need to rerun, plan apply, to be sure that your things get back. So that's one of the difference. Yes, Mr. Why cluster CTL to generate a template instead of using Helm templating? That's a great question. That's a great question. As you can see, can you repeat? Yeah, so why use cluster CTL CLI tool instead of using Helm or customizer or stuff like that? So the idea of cluster CTL, it provides some sugar on top of the common generation. So you can manage your clusters. So that's one of the features. But you can also have variable injection. When you generate a template, it will check if there is some missing variables required by the providers. So I think you can perform the same thing with other tools, but cluster CTL is just handy and you have it in this way to just be sure that you don't miss an environment variable to configure your deployment. Yes? In terms of the overlay or the flat card, how hard or easy is it to build custom overlays? Say if you've got OEM integration, what's the tooling to support that? So the question is about the overlay and how to build these images. If I understand correctly, the system is CZX. So you don't have to build, because we provide them on... You've got to say you wrote custom security... Yeah, you can... We've all decided to fork the repository I mentioned, which is called FlatCard CZX bakery, where we provide these images. So you can fork and do your stuff and why not send a peer, if it's something relevant for the community. And it's just a matter of SquashFS. If you have SquashFS utility tools on your system, you can just build your images. Basically, everything will be in a directories, then you convert these directories to a SquashFS image. Yes? Does the machine deployment controller do any sort of like... If forcing reconciliation, so if you were to delete the instance in OpenStack, would it be created? Yeah, not immediately, but in a few seconds, few minutes after it, we say, okay, I have to get four machines, I have only three. One missing is a walker node, so I need to go to habit. So the question was about... The question was about if there is some instance that is, that disappear from the OpenStack or the provider's dashboard, does the management cluster restart the instance? Yes? As a Kubernetes admin, I really love to manage my Kubernetes classes with Kubernetes resources, but I always wanted to bootstrapping problems like how do I manage my management cluster? I mean, I had so many projects, but I'd love to use cluster API, and it never makes sense, because in the end, I end up using CubeSpray for the management cluster, and I can just add it using CubeSpray. Yeah, so I think the bootstrap issue, they got the same question. So the question is about how do you create the management cluster? So I think this logo is well representative of the issue. It's the torter, the torter that you stack, and in the end, there is no answer, because your management cluster, you can define to handle it with another management cluster. So you can change cluster API if you want to, why not? That's something you can consider. And on the new workload cluster, you can just say, okay, now it's a management cluster. So I'm going to deploy cluster API on the workload cluster. That, of course, is just theoretical. There is no practical way to do that, and it's not the point. But your management cluster, that's what I say, you can use really something simple. I think I see that there is this new Kubernetes tool that you can use. It's like deployment without Cubelet or something like that, so maybe, yes, because you just need the APIs in the end of the Kubernetes. And so why not try to come with something like that, to just deploy a set of API, and that's it. But you can, yeah, we can do things like that. You can decide to use kind, for example. That's the best way to deploy things for the management cluster. Time's up. Okay, thanks everyone. Have a great day. Thank you. Thank you very much. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Ooh. Is it all right? How's that? Cold share. It's all right. You gonna have it? Yeah. Oh, yeah. Do you also have to use to the... You'll see it, so. I think we're gonna have to do the... So... We can enable hotspot for the... I think we'll stop here. Yeah, let's go here. Do you have presentations? Yeah.
Operating Kubernetes Across Hypervisors with Cluster API & GitOps
Hi everyone. Hi everyone. Welcome to this talk on Cluster API across hypervisors and with GitOps. So we've got a lot of the hype words in there. My name's Richard. I'm a maintainer of various Cluster API providers, more notably the AWS provider, the GCP provider and the RKE2 provider. Hey, I'm Alex. I work together with Richard at SUSE. I'm a KPI operator maintainer and also maintaining RKE2 provider. So today we are going to talk about Cluster API. This is only for the stream. But just to speak louder. So today we are going to speak about Cluster API, GitOps and couple virtualization providers. So we'll briefly talk about what is KPI. There was a previous talk about this. But just in case you haven't been there, we will repeat something again. We will tell you and show how Proxmox will integrate with KPI and how GitOps can be added to there. And then we'll replicate the same process with KubeVirt to show that KPI can work with different infrastructure providers. Cool. So all the demos, well the two demos in this session are available via this repo. Feel free to take a picture of it. It's got the full script for it. So you can actually run this yourselves when you get home. I'll leave that out for a second. So who was in the last talk about the intro to Cluster API? Okay, cool. So you get the idea that you have a management cluster. Oh, yep, sorry. You have an idea that you have a management cluster and to that management cluster you install Cluster API. Now Cluster API is made up of the core controllers and a bunch of providers. And you can mix and match those providers to meet your needs. So if you have your provision in AWS, you just install the AWS provider. Once you have that, you then declare the shape of your cluster. It's fully declarative using Kate's custom resources. And you apply that to the management cluster. Then Cluster API does its magic and then it will provision the infrastructure and then bootstrap Kubernetes on top of that infrastructure. So we're going to demonstrate how it works on Proxmox. So just a couple of words about this in case you don't know what it is. It's a virtualization platform. It's open source and includes anything you need for your virtualization purposes. One thing to note is there are two providers for Proxmox if you go out there. So one requires you to have essentially a template pre-created within your environment. The other one will essentially just take a bare OS and it will install everything on top. We are using the one that requires a template. Yeah, so we made a diagram of how our cluster will look like in terms of cluster API. So everything you see there are Kubernetes resources. And all these resources, they represent the cluster, the Proxmox cluster we are going to use. So we'll have the main entities, of course our cluster. It will reference the infrastructure and also reference the control planes and the way how they should look like on the Proxmox environment. And then another resource is machine deployment, which is used for worker machines and it also should reference a template of how it's going to look like on Proxmox and also some configuration for bootstrapping Kubernetes over there. Cool, so over to the demo. So we were going to do the demo live, but actually the network is not being nice to us. So luckily we did record it. So let me just set this up. That's what I'm going to do. Can I do full screen? Is that a visit? Yeah, that's what I tried. Obviously didn't try hard enough. So hopefully you can see this all right. So this is just shown initially the repo that that link showed before. In that repo there are two branches, one for the Kubevert side and one for the Proxmox side. So we're just going to use obviously the Proxmox branch here. And in that you can see all of the artifacts that we would have used in a live demo and if you're going to use this yourself. So moving on then to the pre-rex. So as I mentioned, you are required if you're going to do this yourself to have a template in your Proxmox environment. So the way that you do this, if you want to do it in an automated way, so you can use the Kubernetes image builder project and that has specific make targets that will provision and build that base image for you. And actually what we should see in a minute is I should change to that window and you can see it here. So the Virtual Machine 100 has been built using the Kubernetes image builder project. So that's got everything on there required to bootstrap Kubernetes. So it's got versions of KubeADM, et cetera, already baked into that VM and it's been marked as a template within Proxmox. Yeah. Cool. So the basic flow is we're going to create the management cluster. Sorry, we're going to create the management cluster. We're going to then install GitOps agent on there and then we're going to create a cluster. So I'm just going to fast forward here because this is great. We're using kind for our management cluster. So if I just fast forward, just preloading a bunch of images onto there. The idea being it would have made the demo a lot quicker. So I'm going to start canines in another environment, another window so you can see actually what is getting installed. So this is a plain vanilla Kubernetes cluster at this moment in time. One thing to note, if you're going through the instructions at a later time, we've made a slight config change to the cluster cut all utility configuration so that we can install an IPAN provider. So probably in the last session you went through the different provider types. The main ones are the control plane provider, infrastructure provider and bootstrap provider, but the newer provider types are the IPAN provider which is especially useful for virtualized and bare metal type scenarios and also the add-on provider type. So the way that you create a management cluster is with cluster cut all. One thing to note here is we're specifying version numbers. That was purely just to pin the versions so that we could load the images but you don't have to do that in your environment. And this will go away and install all of the providers and core Cappy into this, turning it into a Cappy management cluster. So if we fast forward a bit, you can see them installed in now and you can see the IPAN provider at the top there and the Proxmox provider. So the next step, so we've got a management cluster. So we want to use GitOps in this scenario. So you can use whatever GitOps agent you want. So we're going to be using fleet, but you could equally apply these steps with slack modifications if you wanted to use flux, Argo CD, whatever your choice is. But we're using fleet so we just need to quickly install fleet, a couple of Helm commands and we'll have that there. So we can fast forward a bit. So now we have the GitOps agent in our cluster. We can start using GitOps to provision clusters. And this is where I guess the mixture of cluster API and GitOps comes really interesting because you then can create clusters via a pull request, which opens up to all sorts of different types of scenarios. And it also means you can perform all of the operations against that cluster via pull requests. So you have the full history of the changes. You can roll back and all of those types of things. If you're using GitOps, you're used to with your applications, but you can now apply it to your actual clusters in the cluster lifecycle. So in the repo, you'll see two folders. Funny enough, the one with the cluster definitions in is in the clusters folder. It's just got the one cluster definition in there. So we're going to bring it up now to have a look at what it is. So it's just pure YAML. It describes the shape of your cluster. There's different resources to represent different parts of the cluster, whether that's the control plane or the worker machines. And it matches the diagram that we showed before in the presentation. Basically, this YAML is what you saw in the diagram, but not visualized. So two things to note here. Just highlighting the fact that we are using the Proxmox, so you will have resource kinds dependent on your infrastructure provider here and likewise for the other type of providers. So there's a couple of things we want to note there. So just highlighting some labels here. If you just remember these labels say CNI Calico, we'll come back to that in a bit. And then we just see some various other aspects. One thing to note, we're also using CubeVip here. So in this type of environment, you need some sort of load balancer so that you can get to the API server. So we're just using CubeVip as an easy way to do that, and it uses gratuitous ARP. So if the control plane machine that is currently hosted on crashes, it will move across and it will start advertising the address from another control plane machine. So it's quite a nice setup. So we can fast forward there. So here you can just see the shape of the VMs that we want, the specifications, so this could be whatever you want. One thing to note is you'll see the template idea at the bottom, which says 100, so that will have to match the template that was created via the ImageBuilder process. If they don't, then things don't work. So we require a small amount of configuration for Fleet, and this will be the same for other GitOps agents. So in this file, we call it the Git repo, and this just tells Fleet about, hey, go to this source repo, download everything in there and apply it to the cluster. So you'll just see that the repo URL, the branch that we require, so we're on the Proximax branch, and then potentially any paths or secrets that are required to access that cluster. Cool, now we've done that. We've applied that to the cluster, so it's going to bring all those cluster definitions into our management cluster, and then hopefully we start to get virtual machines being started and that cluster will be formed. Maybe. Cool, so you can see now that automatically, the cluster API has created machines here for you. So you'll see that there's one machine for the control plane and one machine for the machine deployment or the worker nodes, and you can see that the one has started to move to provisioning. What that basically means is it's going to provision the infrastructure and then start to bootstrap Kubernetes. So what does this mean from a Proximax point of view? People with really good eyesight will probably see that there's a new VM starting up, so you can see it in the events at the bottom there, a VM 104, and you'll see it on the side in the viewer. So this is being orchestrated by the cluster API provider. So it's talking directly to the Proximax API and saying, hey, create me a VM, I'm going to use it for this control plane machine. Now this part does take a while, so we're going to have to skip quite some way through. We'll just get it to the point where you bring up the console, so you can see it's using Ubuntu, and if we fast forward a little bit, eventually you'll start to see, essentially, cloud init will kick in, and depending on how you configure the bootstrap providers, it will use either cloud init or ignition currently. This is using cloud init, so you'll start to see cloud init run in, and that will essentially be running the commands to bootstrap Kubernetes on top of this VM using QVADM in this instance. Oh, we missed it. You'll see it, it will come up. So it does come up, and you can see that. So essentially what it's doing. So at that point, we have one control plane machine ready, essentially. Once one control plane machine is ready, you can then start to provision the worker machines, and it always waits until one control plane machine is ready, and then it will just start provisioning all of the worker machines in parallel. So we can fast forward that, and you'll see another VMs come up, and I think you get the point, so it just repeats the same things, but this will be for a worker machine. So, well, I just skipped ahead in the top part of the terminal window. I have just got the cube config for that newly created cluster. So the cube configs for the newly created child clusters or the tenant clusters are available in the management cluster, so you can get that out and then run, and obviously do what you want with it. In this instance, I'm just showing that stuff is running in there. So you can see that Calico is running in there, so we didn't put Calico necessarily in the cluster definition, but if we go back to those labels on our cluster definition that said C&I is Calico, that is using a feature of a cluster API called cluster resource sets, and essentially this enables you to install any type of resources into a newly provisioned cluster automatically. So it's really ideal for things like C&I or cloud providers to be able to do the things that you want as soon as that cluster is being provisioned. And again, this is all done in a declarative way, so you don't have to do any special commands, you just put all of your definitions into Git, and then the cluster API will do the orchestration. So this is what is in the second folder, in the repo, in the CRS folder. You'll see that there's a cluster resource set, and you'll see that there's a label selector, so if your cluster matches that label selector, it will then apply all the resources listed below, and those resources are essentially just config maps or secrets, and they contain embedded Kubernetes YAML, so it will just squirt those into your cluster. So where are we now? So we've got one control plane and one working machine, so I said that you could go into Git, and you could scale the cluster and do all your operations, so what we're going to show here is actually, if you go to the cluster definition in Git, we're going to scroll down until we get to the machine deployments, where it will say replica one, and we're going to, hopefully, you see the machine deployment there, change app to two, commit those changes to the Git repo, and you can probably guess what will happen now. Any VM is spun up, Kubernetes is bootstrapped on there, and that node joins the existing cluster, and you'll see eventually that it does come up. So that is the props box demo. So now we're going to show the same process with Kubert, and the idea is that you can use cluster API to provision your clusters and multiple providers in the same operational way, so the process for different infrastructure providers is relatively very similar, with the difference in operating your infrastructure, I mean, defining how your machine looks like, but the whole idea is the same, no matter where you're on your clusters. So the one major difference with the Kubert provider is it requires Kubert to be installed in your cluster already. So before you install the provider for Kubert, the Kappi provider, you must have Kubert already installed. So what you're seeing here is we're installing Metal LB to take the place of providing the load balancers within this environment. Then we install Kubert, and so Kubert works on the basis of you describe your virtual machine as a custom resource, and then it will make that happen behind the scenes via QMU. So this is what we're doing first, and this is before you get to any of the Kappi stuff. This is just setting up the Kubernetes cluster. So you can see the Kubert is starting up. So we now done the quaternate installing the provider. We're going to install the GitOps agent here now. So it's basically the same process, just slightly modified with different providers and just with different prerequisites, the prerequisite being Kubert. So forward again. So in this second branch, you'll see a different cluster definition that uses Kubert, but essentially the way it's applied to the cluster is exactly the same via GitOps. So what you take away from this is the same operational procedure irrespective of your target infrastructure or the flavor of Kubernetes that you want. You just create some YAML that describes the cluster that you want. So you'll see this is the interesting part. It will spin up a pod per VM, and that pod will then do the interaction with to actually provision the virtual machine on the host. So you'll see one of these spin up for each of the virtual machines that are required for the cluster. You can look at the boot logs via VNC as well. So if you use Vert Cuttle in this scenario, just get the name of the node and use Vert Cuttle, and you'll see it's using exactly the same QBADM commands that we saw previously with the Proxmox. And then we do the same operation. We scale it, and you get the third machine. So again, probably the key takeaway from this is that you can use the cloud environment to actually use it. So key takeaways are, CAPI can be used in many, many different infrastructure environments, not just like the cloud environments where a lot of people would naturally think of it. So virtualized environments, bare metal type environments, and some really interesting type environments where you want a control plane as a pod type scenario. It supports different Kubernetes flavors, so you might want just pure upstream with QBADM. You might want something a bit more lightweight, so you can use K3S. So it allows you to mix and match all of these things. And lastly, this is fully declarative, fully GitOps friendly. Perform all of your cluster operations via Git. So yeah, thank you for coming. Thank you for your question. Thanks. Thank you. Yeah, so the question was, can you realistically provision a cluster associated infrastructure like low balances, et cetera, with the cluster API currently in a hyperscaler to like AWS as an example? The answer is yes, definitely for AWS. I'd say the caveat is it will provision the infrastructure in an opinionated way. So it will only provision the infrastructure that's required for the cluster and nothing more. And it will provision it in a way that it thinks best. So you can slightly tweak it if you want. If you don't like, say you want to use A or Bs instead of something else, or you want to add security groups, it does allow you to do that as well. But there are, I guess, boundaries. So if you want full flexibility, then it might need to do something else. But you can also use things like Terraform and cluster API together. It doesn't have to be an either or. So you might provision the VPC and the network with Terraform and then get cluster API to do the Kubernetes and like the day two operations type of stuff on Kubernetes.
#snapsafety: de-duplicating state across Virtual Machine clones
Hello. Thanks for coming to this talk. My name is Babis Hallos. I am a software engineer with Amazon Web Services. I'm currently working with a team that maintains the Firecracker Virtual Machine Monitor. Today I will be speaking to you about Virtual Machine Snapshots. Essentially I'm going to be speaking more about some challenges we face when we clone virtual machines and then we start multiple virtual machines from that same clone. A problem that we call Snapshot Safety. I'm going to be speaking a bit about the mechanisms we have today for tackling those issues. What do we believe we need to do as a community in order to grow awareness about the issue and build systems that are safe in the presence of Snapshots. Quick sneak peek on the agenda. We're going to define what is a virtual machine Snapshot for us and what is problematic with virtual machine Snapshots and which scenarios we have problems with them. Then go through a bit about the mechanisms we have today for addressing those issues and how we are thinking about building solutions that are system wide and address the problem. Finally I'm going to be speaking a bit about what we're planning to do next on the area. Earlier this morning there was a very nice talk about virtual machine Snapshots. It went much more in detail what I'm going to go into but let's think for the moment about the virtual machine as a collection of some state and that state might be memory, the guest memory, architectural state of the VM. Then you might have some devices for doing networking and storage, etc. Then some host resources like whatever state the KVM in Linux is holding for us for the VM, maybe a top device for the networking and files that back our storage. For this talk, Snapshot is simply the serialization of this state at a given point in time in a file that we store somewhere in some storage medium. Then we use that Snapshot file in order to start one or more VMs, not that exact identical copies of the initial virtual machine. The morning talk spoke about various scenarios why you might want to do that. For example, you want to give a backup of your machine so you can go back in time in a previous state, etc. Or another scenario that you might want to do that is if you are building some sort of service that uses VMs to isolate workloads and you want to spawn these VMs very, very fastly in a state that they are ready to handle user requests, you might want to spawn a VM like that, bring it in a state, initialize everything, every service, a component that you want in order to get it ready to handle requests and take a Snapshot at that point. Whenever you have a new request in the future, instead of booting a machine from scratch, booting all the operating system, the user space, blah, blah, blah, blah, you just resume from a Snapshot and then you are much faster in a state where you can handle that request. What's wrong with that? Now, let's look again at the previous picture of our VM and let's imagine for a second that somewhere in VM memory or it doesn't have to be memory, it can be any other component of the VM. There is some piece of state, an object, some sort of state that for the purpose of the application that it's making use of it, it needs to be unique and or secret. It needs to have this property in order for the application to operate correctly or securely, etc. Now, you see where I'm going with this, once we take that Snapshot, that property of this state is lost. Here we're speaking about what sort of mechanisms, what sort of applications are having this problem and how we can address this exact problem. We are aware even today of many classes of applications that rely on this assumption of some part of the state being unique, secret, etc. For example, we can think of cryptographically secured pseudo-random number generators. Those are random number generators that have the property that it is very, very hard, if not impossible, to guess what the next byte they're going to give you is. Many applications, the security of many applications rely on this property. They have other properties as well that given knowledge of the current state of the PRNG, you cannot guess the previous bytes, etc. But for those sort of applications, imagine that one, those sort of random number generators, imagine that once you take the Snapshot, the VM Snapshot and you start more VMs from that, the state of the PRNG is being duplicated. So unless we do something else, unless we add more entropy, for example, in this PRNG, in all of the VMs that start from the same Snapshot, the next byte that is going to be given out from that PRNG is going to be exactly same in all of the VMs. Other examples of use cases that have this problem is network configuration. Imagine you have a VM that has some network configuration, IP addresses, MAC addresses, etc. Suddenly you Snapshot that VM and you create new VMs from that Snapshot that live in the same network as your seed VM. Suddenly they appear in the network VMs with the exact same network configuration and depending on your use case that might be a problem. So you might want to be able to do something about it once this happened. You might want to detect that this is happening and do something about it. Another class of applications that are affected by this is anything that really uses a UUID, a GUID. Many applications rely on the uniqueness of this variable, this number, in order to perform correctly. Imagine for example once you take this Snapshot of an application that has a UUID and you start more VMs out of it and the application that is running in this VM is using that number as an index in a database to modify stuff, read stuff, suddenly you have a race condition on the database. More than one entities are going to be using that same thing for accessing data. Any sort of use case where you rely on this thing being unique is a problem here. And really we do not know exactly all of the applications that use cases that have this problem. So it really depends on the application itself. We really need to go see whether our applications keep state that has the semantics, the semantics of uniqueness and secrecy. And if you know that you are running some workload that has this problem and you run in such environment, let's speak about and think of what sort of mechanisms you could use in order to make this use case safe. Okay, now that we know a bit more about the problem that we are speaking about, we are facing, let's see what kind of mechanisms do we have today to address it. Essentially the most fundamental mechanism we have today for doing that is called virtual machine generation ID. It operates as a notification mechanism for the VM after it is getting resumed from a snapshot about that particular fact. But it tells the VM, okay, now you are in a new world. You are not in the world that you thought you were without having rebooted. And in the technical aspect of it, it's an ACPI virtual device. It is emulated by the monitor. And the way it provides the notification inside the guest is via a generation ID, which is a 16 bytes cryptographically random number that changes every time we resume from a snapshot. So when you resume from the snapshot, the monitor makes sure that it changes the new value, it stores a new value in the generation ID, and before resuming the VCPUs of the VM, it injects an ACPI notification in the system. And once it resumes from the snapshot, resumes the VCPUs, then the guest kernel is going to handle that ACPI notification. What happens in Linux is that today the kernel is using the new generation ID as extra entropy for its entropy pool. So it's receding its entropy pool, essentially, so that it avoids the problem we were speaking about before about PRNGs. It works, apparently. It works fine. There is still a bit of a concern regarding the fact of its asynchronousity in the sense that there is a small race window between the moment we resume the VCPUs and the ACPI notification is being handled by whatever thread in the kernel handles it. Okay. Yay. Sorry about that. But at least we have something. Nice. So moving forward, recently we built in the Linux kernel, contributed a small essentially change that every time the generation ID changes, we emit a new event to the user space, because before that VM generation ID implementation did not do anything. It was using, since it was using the generation ID as entropy for the kernel PRNG, people were nervous about exporting it to the user space. So I said, okay, that's it. And in reality, the user space does not really need that 16 bytes themselves. It just needs a small notification. So there you have it. It got matched recently in 6.8 and it is still an asynchronous notification mechanism. So everything that in the user space that runs event loops, for example, can monitor for it and get notified about the fact that they're now in a new VM started from a snapshot. It is still racy, this thing has to be said. So if we think that we have use cases that need to get more asynchronous mechanism, more synchronous mechanism, we should continue doing work to build those. Okay, so going back to the PRNGs, mainly because they are used by security sensitive applications, let's see how these mechanisms can help us. In runtime systems that maintain their own PRNGs like JVM, we can now use the VM GenADU event to be notified about snapshots. So upon resume, the runtime would get that event, eventually would be notified and it would receive the PRNG as soon as possible. Now in other PRNGs that are implemented from libraries, within libraries, this is a bit more weird situation at the moment because an asynchronous mechanism like a U-event is not a perfect fit for the programming model. We will need to do something else about them. One idea here would be to use prediction resistance with what cryptographers call prediction resistance with hardware instructions. The idea here is simple. With every byte that the PRNG returns to you, you mix in some random bits that you got from a hardware instruction that is not affected obviously by virtual machine snapshots, so the problem just goes away. If you are able to do that, it doesn't matter if you have resumed from a snapshot. The state of the PRNG is always going to, including these snapshot irrelevant random bytes and everything is going to be fine. Other potential solutions, for example, in cases where you do not have these instructions or for whatever reason you don't want to use them, it would be to build some sort of synchronous APIs on top of the asynchronous VM-genade event, for example. But we really think that we should do something, don't go out on me again. We really think we should do something about the use case of these libraries. Okay, so let's think, now that we know what mechanism we have available, let's see if we can really solve the problem. And let's follow this example. It's a very simple example of a VM that has started from a snapshot. The hypervisor and the guest kernel support VM-genade. The kernel is going to use the generation ID to receive its random number generator. And we have a user space application that does some network communication and it wants to use TLS. And it reads some random bits from the from the view random, which is safe because of VM-genade in order to do some sort of communication. And everything works fine. The application creates the session key to start communicating without the world and everything looks fine. And at that point, we take a snapshot. Now the moment we resume the VM, the second VM from that snapshot, the session key is duplicated in essentially both VMs. So even though we have these mechanisms built in the system that give safe interfaces over the view random, for example, the final system is not necessarily safe. The same would go, for example, for GUID applications that have GUIDs, et cetera, et cetera, and they would need to adapt themselves. And it is true that the application could use the VM-genade event, but that event is present in the resumed snapshot, in the resumed VM. In the initial VM, there is not today a mechanism to do something about that. And again, there is some sort of race window between the event resuming the VM and the application being reacting to that event, which makes us think that probably there are things that should not ever be serialized at all. It would be much easier if that session key was never serialized. And that makes us think that VM-genade is a post-mortem mechanism. It is a notification in the new VMs, not the initial VM. And by the moment it arrives to us, sensitive information operations might be already in flight, and even if we handle that notification, there is nothing we can do about the things that are in flight. And that makes us think as well next that what we should probably do is control the timing of snapshot events. The moment snapshot events in the lifetime of the VM can arrive, let's say, at arbitrary points in time, instead we should control them. We should do something before we take even the snapshot and make sure that we only take a snapshot when the machine is in a safe state to be snapshotted. And once we resume, make sure that every application that needs to has adapted to the new situation before marking the system as ready to be operational again. Thinking about these things, some time ago we were speaking with system defaults and we thought about modeling this problem using force states, describing our systems being in one of force states. Planning is the normal state of your VM. Now once you want to take a snapshot, you start quiescing. People earlier today spoke about this as freezing, for example. And during that period you do things preparing yourself to be snapshotted so you cannot find yourself in a previous situation. And once you are quiesced, once everybody is ready to be snapshot, then you can take the snapshot and then the same. And on the resume path, on the resume from the snapshot path, you essentially do the opposite work, right? You start from a quiesced state, then you start inquiescing, getting ready for the new world, recreating your GUIDs and what not. And once everything is done, then you can be running again, up and running again. SystemD has this nice concept of inhibitors, which can essentially applications use in order to say, OK, don't do that. Don't do that transition until you are ready to, I tell you I'm ready to do so. For example, there are inhibitors for system CTL suspend. At the moment we were thinking that maybe we could use some para virtual agent to orchestrate everything. In reality, maybe system CTL suspend is all what we need and we can drive this from the hypervisor by sending an ACPI event. And going back to the previous example, how that would look like is we are in a running state in LVM, we have our previous application, and suddenly the control plane informs the PV agent that it needs to start quiescing. Here I say system CTL quiesce, but again, unless we find the reason why suspend should be different than some new sort of operation, we could even use suspend and get away with having to have a para virtual agent in there. In any case, once that happens, the application would say, OK, do not get quiesced again because I need to do some cleanup before you can snapshot me. And once the application does that, it says, OK, now I'm good to go. And at that point the control plane knows that, OK, we can take that snapshot. Now on the opposite path, the control plane would probably resume the VM from a snapshot and then start the unquiescing operation. The application might want to say, OK, wait until I know that I'm safe again because I want to create new random numbers. And I do that. And at that time, we are safe module of that tiny race condition in order to start getting random numbers again and recreate our safe and be in the state we want to be in order to be up and running. That's it. So yeah, we started working in adding support in Firecracker for VM GenAD. Up until now, we were telling people who were using the snapshot in feature in such a way that they should make sure that manually they would need to receive their kernels, PRNGs, and they use the space PRNGs after the fact. The other thing we want to pay attention to is working with PRNG owners in order to find proper ways to make their libraries not safe. Here we're speaking about the PRNGs that are implemented as libraries such as OpenSSL, AWS, and C, et cetera, et cetera. And start building this system we spoke about in system D, start modeling this in system D. And earlier, we had some ground work already done some time ago. And we hope that system D is going to be just the first one that we get this into and hopefully other management systems will follow. And that's it. Without that, I'd be happy to take questions. I just wanted to ask, you mentioned the network issues where machine comes up with the same back address. Didn't appear to address that. Is there a plan to take care of that situation as well? Or is that a problem? Yeah. The question was that we mentioned that during the presentation that there are problems with networking when you take snapshots and resume and whether we plan to address those at the future. Yeah. I think that this is part as well of the system D work that we're going to do. This problem essentially appears mainly when systems are in networks. If your VMs are not networked, there is no problem if two VMs that are not in the network are communicating somehow, they have the same random numbers. So yes, for example, something that we would like to do is to, I guess, shut down networking before taking a snapshot so you're sure that there are not in-flight connections and stuff like that. So I think this is going to be part of that work. Thank you. If we have to come up with a MAC address, we generally try to hash things first outside kind of even if we can also hash that into that element, it's already here. We have been discussing this, like, this is going to happen in my conference. We have to identify the generation ideas to the most obvious thing in the world, to add that to the hash. So that basically, yeah, once the generative changes and everything, get this into the GHD, it's not going to go to wherever else. Thank you very much. Thank you very much.
Pipewire audio backend in QEMU
Hi everyone, my name is Dorian Dabasi and I work at Red Hat. I currently work on enabling the audio stack and other features in the automotive team. And today I'll be talking about the Piperia Audio Backend in QEMO. So just a brief overview about what Piperia is. Piperia is a multimedia service for handling booths audio and video. But in this presentation I'll be focusing on the implementation that was done in QEMO and I will also focus on its use cases in the embedded platforms. So for a start, what's the QEMO's audio backing? It's a software component that's responsible for managing the audio streams and also providing audio functionality to the emulated platform and in our case QEMO. And it's also responsible for handling the audio imputes and the outputs on your virtual machine that's running on QEMO. The Piperia Audio Backend, it provides an interface that would allow the sharing of these audio streams from the guest operating system to your host using the Piperia native APIs and libraries. So how does it work? Here this is an illustration of what the stack looked like. First we have the application that's running in the guest is sending the audio data to Piperia through MMFD and DM above and we have the Piperia Daemon that's communicating with the session manager and it's also responsible for handling the media routing and in charge of talking and talking to the AOSA driver on the guest channel. And in QEMO you can see that we have the emulated sound card driver which could either be AC97 GOS or Intel HDA. It's providing the audio software emulation and then like any other host application that's running in your host user space, QEMO with the native Piperia Backend would be playing these audio streams to the host Piperia Daemon directly. So now it's playing those audio streams to the Piperia Daemon in the host user space but it's not going through any of the post audio compatibility layer or in the case of maybe AOSA, the AOSA plugin. So the Piperia Daemon process would now handle the media routing to the corresponding libraries like AOSA library and routed to the sound driver that's in the host channel. So after many iterations of the patches, the Piperia Audio Backend was merged in QEMO in May 2023 and it was added to the QEMO version 8.1 release and currently it supports Piperia version 0.36 although there's been a latest release in Piperia which is a Piperia 1.0 release and that's like a huge milestone for Piperia. So after the QEMO 8.1 release there was some improvements that were made to the backend thanks to McHundry and Volca for their support while optimizing this backend. QEMO, it has a number of audio backends and you can find the latest audio backend which is Piperia among the list of available audio drivers. So depending on your architecture, the Audio Dev Help option should show you the list of audio drivers and you can see Piperia there. The Piperia Audio Backend, it uses similar structures like the other audio backends although the difference is that it's being implemented with the Piperia native C libraries. So these are the Piperia Audio Backend properties that can be configured on your command line. So first we have the Audio Dev command and we specify the Piperia backend that we want to use and also specify the ID of the Piperia backend and then you can specify the name of the Piperia backend, the stream name and the latency for the output stream and you could also specify the same for your input stream depending on the latency that you want. So this is a description of what the QAPI schema looks like for the Piperia Audio Backend. You can see the name there which is the Piperia Key Target Object and it's used to specify the target object to link to although it's not necessary if you do not specify an object name to set. And for the stream name, this parameter is a stream media name that is being used when you're creating a new stream and if you don't set a stream name, it should use the ID of the Piperia backend which is a PW sound. For the latency, in order to set your desired latency, you could set anyone that you want although the default latency is 46 milliseconds for Piperia Audio Backend. So there are other parameters like the mixing engine, the frequency, the channels and the formats. These parameters are common to the other audio backends like Paul Sodu, Jack and Ausa. But one thing to note is that in QEMU, it currently supports just one channel or two channels. That's money or stereo setup. So you can configure either one or two channels and in Piperia, when you're using a single channel, the content of your buffer is basically S1 samples, S2, S3 and each of the samples would be the buffer samples. Now when you have two channels, the format would be expecting the buffer to be like one sample on the left and one sample on the right and like continuously like that. So each of the samples that's the one on the left would be going to the left speakers and then the samples on your right would be going to the right speakers. So in the case of two channels, the sum of the samples would be the sum of the left samples and the right samples which is the stride. And then the buffer size there, it's specified in microseconds just in case you want to configure a buffer size. And the default format that can be used is S16, although the Piperia Audio Backend has a range of formats that it supports. And for frequency, you could set a default frequency of 44.1 milliseconds. So to use the Piperia Audio Backend, you need an audio device and this audio device is an emulated sound card. It's a legacy PCI device that's been plugged directly into the PCI ExpressRid bus. So this is an example of how the audio device is being configured in the command line. So first we have the device option which we're specifying an Intel HD device and we'll specify a codec option like HD Duplex for streams from your host speakers and your host microphone. And maybe if you wanted to only allow access to just your speakers, you could use the HD output option or if you just want access only to your microphone, you could use the HD microphone option. So here you can see that when specifying the sound card device to use, I used the ID that was specified in the Piperia Backend. That's you telling the sound card device to use the Piperia Backend. So this is how the properties of the Intel HD Audio Device are being declared inside the code. You can look up how the properties of the other devices like GOS and AC97 are being declared inside the code. So Quemo allows you to configure multiple audio backends and this is very useful in embedded platform development. So let's say for example that I'm emulating an infotainment system using Quemo and I want to configure a stream only for notifications on the mono channel and then I want to configure another stream only for music on two channels. This multiple audio backend configuration, it will allow you to specify different parameters for each of the created stream. So this is a visual representation of what the backend would look like with two Piperia Audio Backends. So you can see that each node in the guest is representing a created stream and you can see that the nodes which are the colored boxes and you can also see the host speaker nodes. So for playback, the output ports of the Quemo node which is on the right, the output ports for the Quemo node which is on the right, I've been routed to the speaker nodes on the host and then the input ports that's coming from the host microphone, I've been routed to the input ports on the guest. So this is also very useful when maybe you want to isolate the audio that's coming from different processes that are running in your guest. So now we'll take a technical deep dive into how the Piperia Audio Backend works. So what happens in playback? For playback, we first activate the stream and using this Piperia Streamset Active API call, it will set the stream mode into streaming and then next we call the buffer get free function. This function is used to know in advance the available number of bytes for writing data to the buffer and this also improves the playback latency by a factor of two. And later I will show you some latency measurements. So next what we want to do is to lock the tread loop because I'm using the tread loop mechanism and this mechanism ensures that we are doing the Piperia API calls only from one single tread at a time. So you don't want to be accessing this Piperia resources from multiple threads because it could cause a risk condition. So next what we want to do is to get the number of bytes that are available for writing data to the buffer. How we get this value is that we subtract the number of bytes that are actually inside the ring buffer from the effective Piperia Backend buffer size. And to get these bytes that are inside the ring buffer, we use the sparring buffer get rights index API call. So now what we do next is to use the sparring buffer write data to actually do a mem copy of buffer data from the source audio device to a temporary buffer with the index being the offset and then we update the write pointer. So here at this point there is the possibility of buffer on the run sometime occurring and although this happens in very rare cases, this is like a situation where the audio buffer levels has dropped below a certain threshold and it sometimes cause audio distortion or stuttering and like we cannot really guarantee that okay this guest would be producing the audio samples fast enough. So in Piperia we had a robust solution to fix this issue which was to handle this buffer on the runs by plain silence. You can look up the code on how we handle the buffer on the runs. So next what we do is to get a buffer that can be filled for the playback stream and then we copy this audio data from the temporary buffer to the Piperia buffer using the sparring buffer read data API call. Although I'm just giving you a summary of what the sparring buffer read data API call does but it does much more than that. And then next we queue the buffer for playback and this continue to happen in a loop until all the buffers have been played. So for the capture side what happens? It's more or less the opposite of what's happened in the Piperia backend and in this case like it's kind of similar but not because it's the opposite. So in this case what we do first is that we like activate the stream and then we use the buffer get refunction to know in advance the available number of bytes that we can write and then we use the tread loop lock again to ensure that we're just doing the API calls from one single tread at a time. But the difference here is that this time around instead of using the sparring buffer write data API call we're using the sparring buffer read data call and this time we're doing a mem copy of buffer data from the temporary buffer to the source audio device. So with the index being the offset we would update the read pointer afterwards. Then what we want to do next is to get a buffer that can be consumed for the capture streams and then we copy the audio data from the Piperia buffer to the source audio device using the sparring buffer write data API call. Next we queue the buffer for capture and then this continues in a loop until all the buffers are being consumed for capture. So as regards to the volume controls in order to be able to adjust your volumes through the virtual machine to be effective on the host we use the Piperia volume control API calls and this volume control code would allow for purpose synchronization of the volume changes that are made on the guests to be effective on the host. So when this volume changes have been applied on the node output monitor parts of the guests it will synchronize with the host. So for Piperia I use the Piperia stream set control API call and this is used to set the effective volume. Although one thing to note in Quemo was that it had volume levels of 0 to 255 and in the back end because the Piperia API has volume levels from a floating range of 0 to 1 where 0 is the silence and 1 is representing without attenuation I had to do a linear conversion of these levels so did a linear conversion of 255 levels to Piperia floats in range of 0 to 1. So regarding the features of the Piperia back end these features are not like the features of Piperia in general it's not only limited to its design to handle multimedia processing on the Linux but it also transcends to applications that have been built with the Piperia C API of which a use case now is the Piperia audio back end in Quemo. So on into the Piperia low latency features the Piperia back end has been developed to significantly reduce the latency in many ways and one of the ways is by setting the Piperia keynote latency property and we set it in the back end to be 75% of the time period for faster updates and the other way in which we reduced latency was to use the buffer get free function which improved the latency by a factor of 2. And I think to note about the latency is that the Piperia back end latency is mostly determined by a combination of the buffer size and the sample rate of the back end and this is usually called the quantum. So another feature of the audio back end is that it's providing a reduced footprint and also reduced dependencies in comparison to the other audio back ends that we have in Quemo and for the native Piperia back end we get to benefit from Piperia features such as the less CPU usage and the memory as well. So here I made some benchmarkings of the round triple latencies for the different audio back ends and all these latencies were being measured with a jack Iodili and a loopback cable. So listed here are the round triple latencies as reported by jack Iodili and the sample rate of the device that I used is set to 44.1 kilohertz. So as you can see there, yeah, I have to agree that jack is like topping the charts in low latency as expected but that's not my focus. You can see that the low latency that Piperia offers is quite low and then next we have the pulse audio and SDL competing with each other. So about debugging, while I was working on this audio back end, the GDB was very useful because in case you want to examine the state like registers and memory and you want to maybe set break points and watch points you could use it and you could also leverage the Quemos internal tracing infrastructure. So I added a couple of Piperia audio back end trace events that you can use. So these trace events you can configure it on the command line and example of these trace events is the Piperia writes trace events. When you set it, it will show you the length of bytes that's to be written to the buffer. It will also show you the available number of bytes that can be written. And one thing to note here with Quemos is that this, when if you use this, if you enable this Piperia write trace event, it produces a lot of outputs given that we are copying bytes every millisecond. So you should expect to have a very big log file in case you want to enable those trace events. And then there is the other tool that is very handy, the Piperia debug login. You can use it to set different debug levels from 0 to 5 and these levels would help you to see and have control of the behavior of the Piperia back end or if you were maybe using like debugging your own Piperia application. So here I added some helpful links. The first one which is my blog about the Piperia back end and its usage in Quemo. The next one was Get Hoffman's blog about the sound configuration changes that were made in Quemo. And then we also have like the Quemo invocation that's in case maybe you want to see and know how to use these audio back ends or maybe some other audio back ends, you could look up the Quemo invocation and how to use it. Then I also added the Piperia's wiki page on performance measurements. So it includes scripts that you could use to measure the latency in the different audio back ends like Piperia, Paul, Sodja and Jack. And you could also use it to measure the context switches and the CPU cycles, etc. So at this point here I would like to give a shout out to Intema and the Piperia maintainer who assisted me while I was working on this back end. Thank you. Do you have any questions? Any questions? Questions? What was? Oh yeah. I'm not curious what applications you tested with the incubators. So you're asking what applications in Quemo that I tested with this and how they behave. Okay, you could test a couple of applications that's like trying to play audio which maybe you're watching YouTube on your guest. But I mostly use the loopback cable and the Jack Iodili tool to measure latency. At least that's very effective because you could use it to measure like the CPU cycles as well. The latency and you could also measure other like features that you're interested in. Any other questions? Thank you very much. Thank you. Thank you.
AI-Driven Observability and Operations in Cloud-Edge Systems
Thank you. Hello everyone. Thank you so much for being here. In today's session we are going to do an introduction to a driving observability on operations in cloud edge system. First let me introduce myself. My name is Victor Palma. I'm a cloud engineer at Supernevola Systems. I come from Madrid, Spain and I've been working for Supernevola for more than two years. So let's move on to the presentation. First I would like to start with some initial context in order to introduce later what we are going to see here. So first what is observability? Observability is the ability to understand and analyze the internal behavior of a system by collecting and analyzing relevant data. That is the dictionary definition. But in other words it's just the ability to transform data into information in something that could be useful for us. So we can have a lot of system logs or of data or number but if we don't provide a meaning to them it's useless. So observability has multiple, sorry I just got blank. It has some advantages like anomaly detection that allows us to identify anomalies or bugs in our system. This also provides the ability to do a performance analysis. So we can identify areas for improvement in our system and finally it's very useful for decision making. So we can see the impact of the change that we made in our system in a very easy way. As the saying goes information is power. So observability is very, very important in nowadays. But now I would like to talk about AI because AI in nowadays is everywhere. You know, a marketing guy's fault. But it's really useful for observability. It's really useful for the quick answer is yes, socially. AI provides the capacity to create more enhancing data analysis, create automated anomaly detection, create dynamic scaling for our cloud. For example, if we have more workloads that the usual we can automatically create new notes or deploy new VMs in our cloud in order to provide more services to our customers. And finally we can create predictive analysis, analytics in order to predict how our system is going to behave in the future. After finishing this part, I also would like to talk about data sorrentia and open source because I think the most important concept about AI, currently it's the data sorrentia or the information. Many organizations truce sensitive data to third parties providers. Currently these providers are based outside of Europe and we need to wait in order to bring back the data to our servers to Europe and be more transparent in this way. So as a solution, the open source is a very good solution for that problem. And provide more transparency for the cloud and helps reducing the vendor locking in our infrastructure. So we are not tired to a specific vendor and we can migrate within vendors every time we need it. So what's next? How can we address all these challenges? The answer is the one AYOPS framework, the open source solution for eye driving observability. One AYOPS framework combines open Nebula as virtualization and cloud management tool, Prometheus and Grafana as metrics and visualization solution and some AI and ML algorithms to predict and analyze all the behavior of infrastructure. All the three technologies together creates the one AI AYOPS framework that we are going to see here today. So let's go step by step and first we are going to see what is open Nebula. Open Nebula is the open source cloud platform solution in order to create your own cloud. It provides the ability to deploy virtual machines in your own private data center, in your public cloud or even in the edge. But you not only can deploy virtual machines but also application containers, micro VMs or even Kubernetes clusters. As I've said before, one of the features of open Nebula since it's open source and it's oriented to provide a truly, truly flexibility to the cloud is that avoids the vendor locking so you can migrate your workloads with between different providers in a very, very easy way. Open Nebula has a lot of integration with party tools like Terraform, like Kubernetes, Ansible or Docker. It also has a built-in tools like Sandstone that is the web user interface and you can handle all of your infrastructure from there or from the Celi-I and deploy a built-in machines based on the where, on KBM, LXD or micro VMs in Fightcracker. Finally one of the most important features of open Nebula is the possibility to expand your cloud to the multi-cloud or to the hybrid cloud. You can create on demand resources on the edge in Amazon, Google Cloud, Equinids, just clicking a button or automatically if you configure that. So you can migrate workloads to your on-premise data center to on-edge data center of the public cloud in a very strife way. So you can deploy any infrastructure with a uniform management for all this infrastructure and you can run any application in your cloud. For open Nebula doesn't occur if the host is located in Equinix or the edge or in your private data center. The only thing that open Nebula occurs is what is, what VM is running the workload and how can you access to that VM? So very handy. The next section of the one AI ops is the integration of open Nebula with Prometheus. That integration is based on the Prometheus Sportex like the Prometheus Node Sportex that is installed in every open Nebula server. It's also installed in the hypervisor nodes and it combines with the open Nebula Libreth exporter. It's a Q-Stume exporter created by open Nebula in order to extract and collect information about the KVM machines. And we also combine this information with the, the, our metrics of open Nebula that it's created itself. And this metric is, is gathered by the 1D Demon, it's the main demon of open Nebula. And then it's exported to the Prometheus server through the open Nebula server exporter. So the, so the next thing is the AI that we, sorry, that we add to the formula. So we create a bunch of machine learning algorithms and some decision algorithms and implemented in the, as a, as an exporter to Prometheus. So gathering all the metrics that open Nebula produce and the exporter produce, we implement this algorithm in order to predict and to, to, to get how, how improve the performance of, of your cloud. So in summary, the feature and capabilities of when one AA ops are the CPU usage prediction of the VMs of your cloud. One A ops come predicts the individual VM CPU prediction per hour, the general CPU prediction for usage for, for your host. The accuracy of that prediction, a very important value in, in terms of a feasibility of your, of the, of the system. And then when AI ops come also suggests where you can place a VM. So based on that prediction, one A ops maybe tell you to migrate a VM from one server to another in order to improve the performance there. Three main policies configuring when, when AI ops. The first one is the load balancing, load balancing. It's just the name said balancing the load of, in all your notes is very useful when you have on premise or private data center and you want to use all your hosts. The next policy is the resource contention. So very useful for public cloud environments when you want to use a, a few number of, of hosts. And the last one is reduce migration. That policy very useful when you want to, to avoid migrating VMs between hosts. The, this, this scenario is very useful for a edge environment where the migration of people machines between edge nodes sometimes it's kind of a bit done. So here you can see the architecture of the one AI ops. It's based on the already existed open nebulizer architecture. So everything at the bottom, it's already what's open nebulize. And the layer at the, at the top of the picture is the new one AI ops architecture layer. So here you can see the modules that we have implemented in order to provide this, this prediction and then the, all the virtualize infrastructure orchestrated. So let's do a demo in order to show you how this works. First we are going to, to go to the, to the open nebulizer system portal in order to show you, wait, sorry. Thank you. I'm so sorry. Get me a minute. Okay. Well, this is the main dashboard of open nebula, user graphic interface. Here you can see the, the principal information about your cloud, like how many machines we have or the images or the built on network or the usage of the host. We have currently, here, sorry. In this, this is a demo environment. So we only have two, two hosts with some workload and that these are the, the VMs that we have running in these environments. So this is solid to see that we already have some workload in this environment and this workload is fully random. So maybe it's consumed a certain CPU depending on, on the time. So when we install the one AI ops framework and we have a documentation for that, if we go to Grafana and import the one dashboard, we can see this. This is the results that we want AI ops generates. So we can see here at the left, the average CPU predicted per host. Here we can see the average CPU, the usage per VMs and here the, the, the real usage. So as you can see, it's closely one value to, to the other and the accuracy of the prediction, in that case, a 92%. Here we can see the suggestions that one AI provides to the user. The first policy is the core optimization policy. So it's going to, to, to reduce, it's going to try to reduce the number of hosts to the minimum. Here you can see that all the VMs, we have five VMs in this demo. The four VMs are in one host. Since this host is fully and not a more VMs entering inside this host, the, we have here another VM, but it tries to concentrate the VMs in the, in a few hosts. Here you can see the migrations that one AI suggests for, for that, for, for achieve this distribution. So it's to get to us to move the VM with ID three to the host with ID one. It's very, very, very easy to, to follow the instructions. And then we have the other policies, the load balancing optimization that as you can see, we have the, the VMs distributed in the two hosts that we have in our environment. And then the final policy that is the immigration optimization. And in this case, no migrations are suggested because no, no optimization are found. But what? In case one AI produce something, it, it should be up here. And what? Returning to the, to the slides. Well this is the demo that we have just seen. And closing thoughts on the next step of, of this project. So the next steps and challenge that we, we are facing in, in one AI ops. First is implement the virtualization operations in order to apply the suggestion automatically. Currently the operations are only suggested but not performed in the, in the cloud system. Then we would like to improve the AI ops distribution as part of the OpenEvola software. And you need to install it separately. And finally we will, we will like to expand the functionality to provide anomaly detections, allocation based on memory prediction and network traffic. Because we only provide currently CPU usage prediction. And based on the result of the, of the tool, creates alerts and warnings. This project is totally open source. So you can go to the repository on GitHub and collaborate and suggest new features and, and, and changes. And finally I would like to, to encourage you to join to the OpenEvola community. So you can visit the forum and participate in, in discussion with other, with other OpenEvola users and, and learn and help together in the cloud community that we have created here. As closing a slide, I would like also to, to, to say that this project is, is founded by the European Union as part of the Horizon Europe Research and Innovation Program. So this project is called CONNIS. I, I recommend you to take a look in, in this URL. I can espel you if you want, COG and IT. COGNET. It's, it's very, very interesting project. Well, that's all. Thank you very much for your attention. So, questions? Yeah. Yeah, we use a linear models and a, and a, and a, and a, and a, and a, and a, and a, and and a, and a, and a vasilian models tool. By, by a sund. Sorry. By a sund model save any models. And would you be able to share the slides in the forms of website as też in the child. You can also find it in the repositories as, as yeah here in this repository all the data models and, algorithm applied are, are explained. Okay, thank you. You're welcome. Any more questions? Yeah. I think I'm basically, it was the same question as the last one. If you could quickly go back to, because you did actually have the model, and then in like a couple of slides before. Here. So here, does it explain where the Bayesian are we in here? It's not in there, because that was the bit I was wondering how the model worked. Okay, thank you. That's a question near our side. Okay. Thank you. Perfect. Any more questions? You're welcome. Ah, it's here. Are you optimizing for CPU utilization or not recorded? So also, is it that possible to also say, okay, optimize for availability or network throughput? Okay, he asked about if we optimize for CPU or for other attributes like network or memory. Currently, in the current state of the project, we are only make suggestions and predict the CPU usage. But the idea is to implement a prediction based on the network, on the memory, and other keystone attributes that you want to add to the tool. The idea is that you can change the prediction, the configuration. But really, it's just a prototype. So, for storage? For storage prediction. Yeah, we are also considering that. Sometimes, optimality is changing based on the cloud service provider. Do you also consider this or is this only based on regular hardware? Or what is your optimization? Yeah, we are considering that too. I mean, the way to optimize based on the location of the VM. It's not the same half a VM in your on-premise cloud than on the public cloud or on the edge. So, it's based on different policies that we are currently defining. Yeah, we are considering that. Any more questions? Okay, thank you so much.
Bare-Metal Networking For Everyone
Okay, hello everyone. My name is Mateusz. I work at ThreadHat as a principal software engineer in the Kubernetes bare metal networking team. So yeah, as the title of the talk says, we'll be talking about bare metal networking and I wanted this talk to be somehow a gentle intro into what you need to think about when you want to start doing Kubernetes on bare metal, but the thing that Kubernetes doesn't tell you you should care about. So we'll see in a moment what that means, but I work at ThreadHat. I already said this. I'm based in Switzerland. When I'm not doing computers, I'm doing farming. I actually make it much much better, but it doesn't pay the bills, so I need to do the stuff that I'm going to tell you about here. Well, it is what it is. I don't do AI as opposed to, you know, all the hype and all this kind of stuff, so yeah, I'm not really on the hype wave. Bare metal was never really hyped, so well, what can I say? Some intro why we may even think about doing containers for bare metal. Like, you know, no one ever told us to do so, so what the heck is the deal? So HPC and AI. This slide predates the AI hype, so sorry for this. I could remove it, but long story short, there are some workloads we really want to benefit from running for bare metal. You may have some fancy GPU from, let's not name the company, or some network adapter, which is, you know, something that you really want to have access to the hardware directly, or the other side of the scale. Something that you run and is critical to any part of the infrastructure that you already have. Like, for example, network equipment. You don't want to run router of your own data center as an instance in AWS, right? That would be somehow, yeah, we shouldn't do this this way. Or something which is almost forgotten, and you know, then people call me and put this use case. Benchmarking. How do you benchmark hardware, CPUs, and this kind of stuff if not by running workload directly on this hardware? Again, you don't want to create 50 VMs on some CPU, only to get the benchmark of this CPU performance. That would be chicken egg. Let's not do this. So now fast forward. We agree that we want to do Kubernetes, and we agree that we want to do this on bare metal. So we go to Kubernetes.something, I don't know what that is today. We go to the, you know, FAQ, installing a cluster, and we start reading. What do I need to do to install a cluster? Is there any tooling that would help me installing this cluster? And the very first page you see is this installing Kubernetes with deployment tools, and they tell you QubeADM and to some other tools. And we are like, oh, so lucky. There are tools that are going to do this stuff for us. Okay, let's check the first one. You go to QubeADM and we start reading. Using QubeADM, you can create a minimum viable Kubernetes clusters. And, okay, so is MVP really the production cluster that I'm going to run? Well, probably no. Let's keep that tool. The second one, we look into K-Opps. Okay, let's go to the website of K-Opps. Let's do the same. Installing Kubernetes, getting started, and we start reading. Deploying to AWS, to GCP, digital option, yada, yada, yada. None of them is deploying to bare metal. Thank you very much. End of the story. Let's check the last one. Maybe that's our chance. So we go to the Qube spray. It's a set of ansibles. So another story, you know, but, okay, someone gives us some method to deploy Kubernetes on bare metal. So we go, run Qube, Qube spray playbooks. With the bare metal infrastructure deployed, Qube spray came now in, so Kubernetes and set up the cluster. And you start reading those playbooks and you feel like, oh, this is so opinionated. So either I want to do my data, either I want to build my data center like they want me to build, or thank you very much, there is no tool. So let's agree that none of these three methods is for us. We need to do this stuff ourselves. So let's build the stuff, you know, brick by brick from the, from the beginning. So what, what we need to care about a cluster, and not only during the installation, but in general to have this cluster bootstrapped and then working. First of all is, of course, this is bare metal. At the end, you want to deploy this cluster because there will be some workload, right? You want to access this workload. As well, you want to access the API, right? Basic operations. You don't deploy the cluster for the sake of deploying it and running, consuming the energy. Then, of course, DNS infrastructure. You are deploying this in your data center. And then what? Are you going to give to your customers? And now, you know, type this IP address slash something, something to look at this fancy web, website or application that we deployed. No, you want to have it some very nice domain and, you know, but for that, again, DNS infrastructure, you need that. It doesn't come for free. The next step is we agreed that we are doing bare metal because we have some reason to do this and it's not like we just don't like a simple VM from AWS, which means there will be some non-standard network configuration. Doesn't really matter if fancy or not. It will be something more than just, you know, plug the server, turn it on because in most of the cases, people doing bare metal, they don't have DHCP in all the networks or they need some storage network and it all requires some fine tuning which doesn't come from default when you boot your Linux distro and some other dirty tricks that I'm going to tell you later because it's Kubernetes specific and I want to build my way up to this. So cluster load balancer because I told you that you need to have API and ingress to your workload and all this kind of stuff. The slide is overly complicated for two reasons. The first reason is because it is complicated. The other reason is because no one ever cared to make it less complicated. I know it sounds bad but it is what it is. So the only thing I want to tell you is that, you know, we are in the story of building a cluster installing it from scratch, which means we are starting bootstrapped from summer. Like, you know, you may be running those cube ADM create cluster, yada yada, from this laptop, right? So this laptop will be your initial bootstrapping infrastructure. On the other hand, at the other side of this room, I have those three servers that are going to be masters. So this somehow has to ride all together. I need to have some IP address that will be this API finally when I spawn all those nodes in the cluster. So I need to have some virtual IP which will be pointing toward this API, right? This is what I'm calling API VIP and it sounds complex but at the end it boils down to one sentence. When you start doing cube CTL commands at the end, you need to target some IP address. If you are deploying Bermetal infrastructure, you don't want to ever target specific node because if this node goes down, all your tooling goes down. So you want to have some virtual IP and you may have some load balancer from well-known companies as an appliance or you may want to just do it yourself with keep alive this. So I will show this in a second. And in this slide, what is then the part? So at some point, we have deployed those control plane nodes, those worker nodes and we have the API address which should be now pointing only to the control plane nodes not to your bootstrap so this laptop, it goes away from this story. But then you have some other IP address because you are deploying Quarkode. You are not only an admin now. You really have something that runs and your applications, you don't want to expose your control plane to anyone, right? Or do you? Well, you rather not. So you need another IP and exactly the same story. Where do you take all those IP from and who manages them? Yeah, you manage them. So what you are doing for this and of course I'm telling you about some very opinionated way of designing how to install Kubernetes cluster and it's opinionated because we decided, so let's do keep alive D in the combination with HAProxy. And I told you the story why we need the VIP so you should be already convinced that if we need that, then we keep alive D because it's very simple and it's proven in action. Why do we put HAProxy in this story also? And now it will be fast forward to some specific use cases and requirements that we got. Only thing to remember is that it won't be always the same stock for API and ingress because your admin control, as an administrator of the cluster, I have usually different requirements than the user, so different tools, different purposes. Because it's very easy to simply deploy keep alive D and tell it, you know, let's pick this 1.2.3.4 IP and put it somewhere in the pool of this servers, right? But then Kubernetes is about being highly available. So what happens if your one node goes down? Well, the IP address should float to some other node that works, right? But what does it mean from the network perspective that IP address floats? What's going to happen with the connections that you have to this IP address? We start having this kind, we start asking ourselves this kind of questions because we have now three servers in the control plane, QBAPI server runs in three of them, we kill one QBAPI server, unlucky us, it was the one that was holding the IP of the, you know, of how we access the cluster. What happens now? No access to the cluster. So either we wait for keep alive D to move this IP address, our tables to propagate and all this kind of stuff or, and this is what we decided, we put HAProxy in between the QBAPI server and keep alive D so that HAProxy, and this is something that, you know, people from Kubernetes want to kill me, HAProxy is much more stable than Kubernetes API. That's it. That's it. If you look at this, that Xeq, QBAPI server fails much, much more than HAProxy, so this is our way to keep this and as simple as it sounds, the problem that I want to solve is that when QBAPI server dies, I don't want the IP address to float because propagating ARP tables and expiring the caches takes too long and I just simply don't want to wait for that, so I put HAProxy there and, and yeah, the only thing to remember if you really take this path is that you need to fine tune the health checks because then the worst you can do is that if keep alive D starts to notice outage faster than HAProxy because HAProxy also balances the traffic, right? So then the order of actions is that you want QBAPI server to die, which shouldn't be happening, but it happens. HAProxy notices that and end of the story. That's, that's it, keep alive. This should never, should never notice this and of course we may go deeper and what happens if HAProxy dies? Well, this is now a game of statistics. Has it ever happened for us that QBAPI server and HAProxy died at the same time? Well, it never happened apart if you go to the server and just plug it out from the rack. So this is some corner case that we don't want to cover, but, but it doesn't really, really happen in the wild. Of course, there are some limitations because, you know, you can have IP address on the single node. This is disadvantage versus some, some appliance. The biggest problem here is that you need to have all this stuff in one single L2 segment. So in one broadcast domain, this is because keep alive D doesn't work across subnets. We have some ways to fix that by grouping nodes into different L2s and then having different keep alive Ds in those L2s. But still, this is, this is a pain point and this is something that you should really well design on the, on the paper if you, if you start doing this. But, you know, enough of load balancers because we could be talking ages about this. DNS, because we said that we want to, to do this DNS mambo jumbo and, you know, we don't want to use IP addresses only. So of course you are administrator, you manage the infra. You could say, but, you know, we have this DNS infrastructure there. It's maybe AWS, maybe Cloudflare, maybe, maybe something else. So we can just create records there. But, but then, you know, either you trust the user or you don't. And we don't. So another opinionated thing in our way of installing Kubernetes is that we spawn very minimal setup of core DNS, which will be providing the DNS resolution of what you want to all the nodes of the cluster and all the pods running in this cluster. So that when you start installation claiming that you will have API running on API.example.com, I don't worry if you already created this record on the external DNS. I will just spawn static pod running core DNS and I will create those records myself. So whatever I'm running in this cluster will have this. This again protects me because now what happens if we decouple this? You have your external, you know, DNS like most of the people. And how do you want your cluster to behave when this DNS infrastructure goes down? You have your data center, everything is okay. In some other data center, you have DNS and this DNS is out. Do you want now your cluster to be, you know, dying because pods want to talk to each other and they cannot resolve DNS? It should be all self-contained, right? You don't want to have those external dependencies. So yeah, this is something that we are doing. And the part I will skip is that network manager requires some tuning because for people knowing how containers are spawned, when you start a container, a copy of ETC resolve conf is taken at the moment of starting the container and is plugged into the container. Meaning that if you change configuration of your host regarding DNS, it will not be propagated to the container unless you restart the container. So yeah, for this reason we are also hacking this file around so that it would be really updating on the fly but I don't want to go into this. Something a bit more interesting because we are going now into Kubernetes APIs and how to extend this stuff is network configuration of the host. This is static configuration file for network manager and probably you've seen this and probably you've made some mistakes to this file not once. The problem I want to state here is that this is a static file. You go, you modify it, nothing happens. You may notice mistake in this file five years after because for five years you haven't rebooted your server and we don't want to have this scenario in Kubernetes world. When you define some configuration it should either apply immediately or fail immediately. So this kind of stuff that you need to do manual modifications of the file, it breaks this contract we have and another part is it simply doesn't scale. If you have 300 servers in your bare metal cluster, you are not doing those changes manually. Simply not. You have CRDs and this is what should be happening. This is some very, very simple example. I do some modification. I mistake slash for backslash. They detect that and that's easy but I'm configuring default gateway as an IP address from outside of my subnet and this is utterly wrong but nothing in network manager will prevent me from this configuration. I simply don't want to. We have this CRD defined that creates host configuration from the API and it may sound like chic and egg but it's all the matter of how we order the stuff. We define Kubernetes CRD that will be defining how you configure network manager on the host. You can do it per node, all this kind of stuff. I will just show you how that works very quickly. That's the one. I have this node which has this IP address on the last interface 10, 24402 and what I want to do now, I want this to be different. I want to change that. I want to change it from the Kubernetes in a declarative way so that whenever someone will be modifying this, the change will get reverted. I just created a YAML which will configure IP address on some interface. As simple as that and I will apply that with the hope that it works as expected. At the top we can see that this CRD is now progressing with the configuration progress. In fact, that was as simple as it is so we can see that this IP was removed. For a moment I was thinking who's going to ask but you already had IP in this subnet configured. What's going to happen? Well, that configuration wouldn't fly because you should not have two IPs from the same subnet on the same interface. This is a short demo of that. At the same time, it's Kubernetes API. It should protect us from doing stupid things. I will try to configure a stupid DNS server which has no way of existing because it's on the link local IPv6 subnet. If I try to apply that, something should protect me from doing this because that would actually break the configuration. Let's see our configuration right now. We have 111.1 as the DNS server and let's apply this manifest. Now that configures the wrong DNS. The change has been applied. It's wrong. At this moment your cluster starts to misbehave, your probes go down and so on. Let's give it around 10-15 seconds and this configuration should get reverted because there is a controller which in fact checks if your modifications to the network infrastructure on the host. After applying, do they make something not working as it should? In this scenario, we see that degraded failed to configure. It failed because this DNS server doesn't exist in reality. That was just a short demo of how we handle all that. It's a bunch of self-contained elements that once you start using them all together, you give you a very nice Kubernetes installer that does it all for you. Sometimes in an opinionated way, sometimes less. Now I told you that there will be some dirty tricks. In KubeNet, there is a concept of Node.IP and we are now moving to the Linux world. When you want your application in Linux world to run and interact with the network, it has to bind somewhere. This somewhere is IP address and the port. Let's forget the port. We are about IP address. If you have multiple network interfaces, where should Kubernetes listen? Everywhere on one IP address, on two IP addresses. If you have 10 interfaces, what do we do? I say that Kubernetes upstream doesn't solve it in a very smart way because it was designed to run on clouds with only one network interface. As we started expanding, it's not something that we still want. We developed some additional logic to check that and I will skip the details. In general, one more problem to think about. When you configure KubeNet manually, you need to think what the IP addresses should be there. This configuration is complicated because actually you can say bind everywhere or bind to one IP address or you can say bind to IPv4, like as a string IPv4 and what happens there? It's all you know. You get even stranger syntax IPv6 as a string, comma and then IPv4 address. All this kind of stuff you need to understand how it behaves and pick your choice. It's complex. You may get really confused once you start. We have some set of rules. I will skip them. You can go back to this. In general, some corner cases, I just showed you an example in which you shouldn't have multiple IP addresses in one subnet. What if you do? There are some people who do this for a reason and how do you want KubeNet to behave then? Also, one example that I have and this is just mind blowing. It killed me for like two weeks. Is your IPv6 address really an IPv6 address? Okay, this slide I skip. I got to this RF, sewage describes IPv4 compatible IPv6 addresses and I was like, what the heck is that? Let's go to all the libraries in all the known programming languages. Every of them has a function. Is an IP address IPv6 address? You go to implementation. How implementation looks like? If string contains, colon return true. Thank you very much, game over. It's as simple as that. Really, for the last 30 years of my life, I thought this is as simple as it is, but it's not. Let's take this. So, comma, comma, four times f, then comma, sorry, colon, and then we put IP address with dots. It is a correct address. There is RFC for this address. It may look stupid, but it's a well defined address and, you know, it breaks. Try opening a net cut socket to listen on this address. It will not work because half of the tools now think this is IPv6 address, half of the tools think this is IPv4 address. I did a stress on that and what I realized is that based on this address, it was trying to open a socket on a simple IPv4 address. At this moment, how should we treat that? This is the real case scenario. I got it from a customer who was trying to install Kubernetes and they wanted to use this subnet. I was like, what is that? Then we dig deeper and we realized that this is a monster. It should have never existed, but apparently it exists. If you find a set of parameters that you pass to net cut and it crashes, then something went wrong. So, in the end, yeah, choose wisely what you want to do and once you design your infrastructure really, you know, double check it with someone out there with upstream community. Is it really how you should be doing stuff? Because in a lot of cases, you realize that something misbehaves and, you know, and that was, yeah, one more thing. You think everything is okay, then you start to get and you tell you, oh, sorry, but, you know, in fact, with this cloud provider, you cannot use this syntax and then you realize, oh, I wanted to do all that, but I cannot because you tell me that I cannot. So, you know, and you realize it only at the end of the story once you spend two weeks on designing. So, that's it.
Instant Ramen: Quick and easy multi-cluster Kubernetes development on your laptop
Okay. All right. All right. Okay. We are ready to go. All right. Thanks, everybody. Thanks for sticking around till the end today. And a special shout out to those of you on the live stream as well. My name is Adam Litke and this is Near Software. And today we're going to get our money's worth out of this laptop. Something is not right here. I keep flipping on and off. Let's see. I'll do my best here. So we've come a long way since Linus introduced Linux to the world back in 1991. What started off on his personal computer is deployed pretty much everywhere these days in increasingly complex scenarios. Take Kubernetes, for example. Everyone's favorite clustered container orchestrator, which runs open source up and down the entire stack from the cluster nodes to the Kubelet and to the container runtime itself. And developers haven't stopped building or debugging screens. Kubevert is a Kubernetes add-on that allows you to seamlessly run your virtual machines in Kubernetes. And since the VMs are running in pods, like any other workload on Kubernetes, they integrate really well with whatever else is deployed there, be it your applications, storage, networking, monitoring, et cetera. And as people continue to deploy Kubernetes and Kubevert to host their mission-critical workloads, naturally they wonder what will happen when disaster strikes. Disaster recovery software exists to protect your data and return to normal operations as quickly as possible. And this is typically achieved using redundancy. So data can be replicated from primary environments to secondary environments, and applications, including virtual machines, are able to be started on the secondary environment at a moment's notice, should that be required. In this particular scenario, we have a primary data center DR1 in the west, a secondary data center DR2 in the east, and a hub cluster located somewhere in between. Now we prefer to run our applications on the primary environment because it's closer to where our customers are. But thanks to continuous data replication, we can rest easy knowing. We can start the application up on DR2 when required. So ramen DR is software that enables disaster recovery for multi-cluster Kubernetes environments. It does this by working with the storage to enable data replication according to a DR policy set by the administrator. And it talks with open cluster management to manage application placement, failover, and relocation flows. Today we're going to simulate this disaster for you. We're going to start by disabling the primary environment. We can then failover our virtual machine to the secondary environment. And I just want to note here that failover is different from live migration because live migration would require both environments to be up. In this specific scenario, obviously, we don't have access to DR1. So failover is going to take a couple of minutes, but we can be confident that the app can start back up on the secondary and environment with minimal data loss. So I've been kind of introducing a bunch of different components here that's quite the menu of open source ingredients. KubeVert is a operator managed Kubernetes deployment, which packages libvert and QMU into a container image, allowing you to run your virtual machines inside of a pod. It also comes with other utilities to help you manage your virtual machine storage and networking requirements. Rook is software that manages and integrates self-storage into the Kubernetes platform. Open cluster management stitches together multiple Kubernetes clusters and provides for application management, placement, scheduling. And then RoninDR is adding on those DR flows to open cluster management. So when we're considering a realistic multi-cluster DR environment, it's a beautiful thing, kind of like this bowl of ramen here to tempt you at dinner time. However, it's also complicated and expensive to operate, especially when we consider like the single developer use case. So the question we're trying to answer here is how can we enable development in this open source software stack without huge cloud budgets? And our answer is to scale down that environment so that it can run inside the kind of laptop that most of us are carrying around with us each day. And NIR has prepared a live demo right on this laptop that you're looking at that's going to show all this stuff working together, and we're going to simulate that disaster for you. So take it away. Yep. And I'm going to mute it so we don't annoy the live stream people. Okay. Put that in your pocket. Yep. Okay. So this is our stack, right? Three clusters. We have two identical clusters. Everything is ramen. And we are going to put it inside this laptop to see that we can do it because they are small and cheap. So what we want to do today is to stuff three data centers with ramen and kovir and stuff and a lot of other components and large part of Europe and stuff everything inside this laptop. Now note about this environment. The clusters are all in the same laptop, but they are remote clusters on different regions. And the cluster is standalone with its own storage. So how can we prepare this laptop for the demo? So I have a pack of instant ramen DR, which is very easy to use. You want one command. DRM start with the environment of your flavor. This in case is a kovir environment. And then you let it cook for 10 minutes until everything is ready. Sorry. So we are not going to wait 10 minutes now because it's a little thing. I prepared the environment before the talk and we'll just use it. So whatever we need, we need a Git repo because we're going to use GitOps. We will give Ocm some Git repo to pull the VM resources and deploy the application. So we use Adam's repo, Ocm kovir samples. And I talked it to customize the VM with SSH public key. So let's jump into the demo and increase the font size a bit. So I'm using a little tool to save my typing error for you and make everything more colorful. So first look what we have in this laptop. We have three clusters. DR1 is the primary cluster where we run the VM. DR2 is the secondary cluster. Something bad happens to DR1 and something bad will happen. Don't tell anyone. And Hub is orchestrating everything and controlling the other clusters. Now each of these are libvirt VMs inside the laptop. So let's look inside one of the clusters. We can use kubectl, just a normal kubectl clusters. And we see that we have a lot of stuff that DR1 installed for us. The most important parts for the demo are the kovir parts that you will run the VM, the CDI that will provision the VM disk from container image. Of course, it will be stored. So we have a complete RookSafe system inside that using the local disk of the cluster. And this will provide storage for the VM disk and volume replication between the clusters. And to protect the VM, we have the Raman DR cluster operator, which orchestrates the DR flows. And finally, we need the open cluster management components that lets Raman control the clusters. Because Raman extends open cluster management and depend on it. So let's look inside the Git repo. I'm running this inside the clone of the Git repo from other. The important parts in this repo for the demo are this VM, VM standalone pbc.js. This is VM optimized for the environment. The subscription, which is OCM resources for the VM. And DR are the Raman DR resources. So let's look first at the VM. We have a quick look to see what we have there. We will not look inside the YAMLs. You can check the Git repo later. We have a VM configuration. This VM is using a pbc because we are using a pbc-based VM. So we have this pbc here. And we need to provision a pbc somehow. So we have the source YAML, which tells CDI how to provision the pbc disk. So we can apply this customization to cluster DR1. And this will start the VM on cluster DR1, but we are not going to do it. Because nobody will protect this VM. It's just like a port that you start and it goes down and nobody protects it. So what we want to do is create OCM application. OCM application. We will use subscription-based application. These resources tells OCM how to protect the application, how to create it, like which cluster set to use, where Git repo is, the namespace that the VM is running, where to place the VM, and subscription ties everything together. So to start the VM, we apply this customization to the hub. Everything is done on the hub. And then OCM and Raman later will do the right thing. So at this point, OCM is starting the VM on cluster DR1 and we can watch it. Using kubctl to get the VM, VMI port and pbc. And we can see that the pbc is already bound. And Virt launcher is running and we have an IP address, so ROVM is running. But let's inspect the VM a little bit more to see where is our disk. So we can use ROVM Cess kubctl plugin to look at Cess layer. And we can run the RBDU command, in this case on cluster DR1. And we see that we have RBD image created for our VM. If we look inside the pbc, we will find this volume there. So if something bad happened to cluster DR1, we lose the running VM and the disk. And this image will be gone and we lost all the data. So how can we prevent it? We want to protect this VM is Raman. So to do this, we must tell OCM first that Raman is going to take over this VM and ACM should not change it. We do this by notating the placement with a special annotation and at this point Raman can take over. So how do we protect this with Raman? We need to apply the Raman resources. Basically it's one resource, a DRPC. The DRPC tells Raman how to find the application, how to protect it, which pbc should be protected and what is the policy. We are not going to look inside now, we can check the gtrepolator. So to protect the VM, we apply disk customization. Again on the hub, then Raman will do the right thing on the right cluster. So once we did it, our VM is protected Raman and you can watch it again. This time I'm watching also VRG and VR resources. VRG is the volume replication group. We have one such resource per each protected application and volume replication is the resource that entails the volume replication for each pbc. So we have one of them replication for every pbc. Now both of them are ready and primary, primary windows, this is the primary cluster, replicating data to the secondary cluster. So what does it mean that we replicate data to the other cluster? If you look again on the RBD images, if you remember we have seen that we have an RBD image on the primary cluster. If you run the same command on the secondary cluster, and this time I'm running the same command on context DR2. And we will find that we have an image on the primary cluster and we have the same image on the secondary cluster. So what's going on under the cover is that when Raman enables volume replication, a secondary replica of the image is created on the secondary cluster, and the RBD mirror demon is starting to replicate rights from this image on the primary cluster to this image on the secondary cluster. So if something bad happens to cluster DR1, we can use the secondary image to start the VM at the time of the last replication. So the last thing to show about the VM is that we have a logger inside updating the log file every 10 seconds. We can access the log file using the VIRT CTL SSH. We just run this command to see the start of the log and we see the line where the service was started, and then we see one line every 10 seconds. This will help us verify later when we recover from a disaster that we got the right data from the disk. So now we are ready for everything. Let's try to create a disaster. So one thing that is easy to do on the laptop is to suspend the cluster DR1. If you remember this is a Libre VM, so we can just suspend it. Now everything running there stopped. So let's try to access the VM again with VIRT CTL SSH. Let's try to tail the log and let's see if it works. Well, it does not seem to work because of course we suspended the VM so nothing there is accessible. If we had an important service on this VM, our users would be not happy now. So how can we fix this? Adam, do we have any idea? I was hoping you would tell us. Yes, so because our VM is replicated, we can just failover to the other cluster quickly. How would we failover? If you remember that we installed the DRPC, we can patch the DRPC. We set the action to failover and we set the failover cluster. And once we did it, Raman starts the failover. And you can start watching the failover on the other cluster. I'm running this again on the DR2 cluster because DR1 is not accessible. And we see that we have a PPC. It's impending. We have a volume replication group. We have a volume replication, but the volume replication book is not primary yet. It will take a while until the VM is stopped. So while you wait for it, let's understand what's going on under the cover. So the RBD image on the secondary cluster was replica, pulling data from the master for the main cluster. Raman has to stop this replication and promote it to a primary image that will replicate to another cluster. Once this is done, the VRG will be marked as primary and it should happen any second. And at this point Raman will change the application placement. It just became primary. So now Raman changed the placement of the application. And Ocm will see the change and will redeploy the VM on the second cluster using the subscription. And this should happen any second now. When Ocm deploys the application, it will reuse the PPC that Raman has restored and connected to the right RBD image. And it just happened. We see that the VRT Launcher is running. The VM is up. We have an IP address. So if we add this important service of the VM, this service should be absent. And it will be used as a VRT. But how do we know that we got the right data? Maybe we got just a new application with empty disk. Let's check the disk. Again, we can use the logger. We just dumped the entire log using SSH. This time I'm connecting to the cluster, the R2. And we see all the logs from the VM that run on cluster DR1 until we created the disaster. And we see the new logs when the VM started to run on cluster DR2. Note that we have a gap here between the last line logged when the VM was running on DR1 and the first log, which is about three minutes in this case. This gap depends on how fast we started the failover and the tech that there was an issue with the cluster. So we had a short downtime. The VM is running. We got all the data. Looks like a successful failover. So what's next? In a real cluster, you would try to fix the cluster, recover it. Maybe you need to reinstall it. At this point, you can relocate the VM or maybe later during some maintenance middle, you will relocate the VM back to the primary cluster. In this demo, we are done. And it's time for questions. The first three questions we'll get in Sotramen. Go ahead. The question is what about the IP address of the virtual machine? We're paying attention and noted that change. So what would you suggest? Sotramen does not do anything about the IP address. I think in a real application, you will have some load balancer to make sure that you can switch from one cluster to the other cluster. Probably using the DNS system because you have a nice name for the DNS. But basically, you will do what VIRT CTL is doing when you connect to the VM. You use the VM name, the name space, and you have to find the actual address. Yes. Very nice demo. How much you run at home? I don't have any cloud copies. I can re-image it to 4G. Is 60G enough? 16 will be too low. Yes. So the question is what do we need to run it at home? So first, you can try it at home, but you need big enough laptop. I think for COVID, you need like 32G with Sotramen. Yes. Because we have two clusters with 8GB, maybe you can trim them down a little bit, but 16 will be too low. And... Maybe you need a laptop. Yes. Maybe you need a laptop. If a door is wide, it's not so. If a door would be easier because this is what we use. But it should work on anything. We continue the question. Can I use two laptops, one for disaster recovery and one for the one that you made and the other for disaster recovery? Let's say old laptop. Repeat the question. I didn't answer the question exactly. Can you repeat it? Your presentation is from the same laptop. Can I use the solution for two laptops? So the question is, can we use different machines for this? You can, but it will be harder because you need to set up the network between them. In the same laptop, it's much easier because MiniCube handed most of the stuff for you. If you use different machines, you need to make sure that the clusters are accessible to... So it will be harder. I've got one over here. Yes. Is it required to use SIF or you can use an OSR by system? Repeat the question. Do we have to use SIF? Currently, we work with SIF, so it's optimized for SIF and it works. And we have a complete tool that you set and configure it. If you want to use something else, we have support for any storage, basically, but it doesn't work on Kubernetes yet. It's very commonly on the shift. It needs more work. Yes. The primary site is down. Is there any mechanism for the extra machine for starting by mistake, by itself? The question was, once the cluster is down, do we have any protection that the virtual machine will not start again on the same cluster? So we don't have any protection at the ramen level because SIF is protecting us. If the same VM starts again, let's say we resume the cluster, the application is still running then it will continue to run and try to write to the disk, which is fine because the disk is not replicated at this point. Because the application is done on the destination cluster. It's pulling data from the primary. Usually it will just fail or go into some error state and in real application, when ramen detects that it should not run, it will shut it down. So it's safe. There is one more question. Yes. Just because it's the end of the day. Just one more question. What happens when the hub that was controlling the two data centers goes down? The question was what happens when the hub goes down? Very good question. In a real setup, in OpenShift you have a hub recovery setup, so actually you need two hubs. One passive hub and one active hub. And there is a lot of setup that backup the hub and restore it. But for testing it doesn't matter. And also hopefully you're not running customer visible or end user visible workloads on the hub. So if it goes down you can repair it and people won't be quite as urgent of a disaster. So hopefully the other sites don't fail at the same time. Alright, thanks everybody for coming. What a good question.